McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

Tongyi Lab, Alibaba
Qualitative comparison
Visualization comparison of generated videos by the baselines and the proposed method. Our McSc generate videos with larger motion dynamic and stronger semantic alignment.

Abstract

Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.

Motivation

Motivation overview

This work aims to address the challenge of aligning text-to-video generation models with human preferences. Traditional preference alignment methods rely heavily on large-scale human-annotated preference data, which is both time-consuming and labor-intensive. To mitigate this, recent studies have proposed a two-stage framework: preference prediction followed by preference alignment. In this framework, a preference model is first used to automatically estimate human preferences for generated videos, and the resulting pseudo-preferences are then leveraged to optimize the generative model. Building upon this paradigm, our work introduces a novel preference prediction approach, ScHR, comprising two sequential training stages, ScDR and HCR, followed by a new strategy for preference alignment.

Specifically, to improve both preference prediction and alignment, we draw inspiration from the way humans naturally evaluate video quality. Humans typically decompose their assessment into multiple dimensions (e.g., motion dynamics, visual quality, temporal consistency), evaluate each dimension independently, and then integrate these multi-dimensional judgments into an overall preference score. We design our training process to mimic this cognitive mechanism: the model first learns the reasoning behind human judgments along individual dimensions (ScDR), and then acquires the ability to synthesize cross-dimensional evaluations into a holistic preference (HCR).

In the course of our analysis, we uncover an unexpected yet critical phenomenon: some of dimensions involved in human preference annotations exhibit significant contradictions and negative correlations. Notably, we observe a strong negative correlation between motion dynamics and static visual quality dimensions—videos with lower motion dynamics tend to score higher on other static quality metrics, leading human annotators to rate them more favorably overall. However, using such preferences for model optimization biases the generator toward producing videos with reduced motion dynamics, ultimately compromising motion richness and diversity. To address this issue, we propose McDPO, a novel alignment strategy that explicitly mitigates this trade-off and encourages the generation of videos with enhanced motion dynamics.

Method

Method overview

We propose McSc for video generation, integrating human preference reasoning and preference alignment to synthesis videos with estimated preference. McSc contains three key steps: (1) ScDR trains a generative reward model with a self-critic strategy towards single-dimension preference reasoning, (2) HCR exploits holistic video assessment with structured reward mechanisms, and (3) McDPO optimizes the video generation model to synthesize diverse videos align with true human preference by reducing evaluation dimension bias.

Generated Videos

Baseline

"In a still frame, a stop sign"

SFT

"In a still frame, a stop sign"

McSc

"In a still frame, a stop sign"

Baseline

"Macro slo-mo. Slow motion cropped closeup of roasted coffee beans falling into an empty bowl"

SFT

"Macro slo-mo. Slow motion cropped closeup of roasted coffee beans falling into an empty bowl"

McSc

"Macro slo-mo. Slow motion cropped closeup of roasted coffee beans falling into an empty bowl"

Baseline

"A drone view of celebration with Christmas tree and fireworks, starry sky - background."

SFT

"A drone view of celebration with Christmas tree and fireworks, starry sky - background."

McSc

"A drone view of celebration with Christmas tree and fireworks, starry sky - background."

Baseline

"a shark is swimming in the ocean, racking focus"

SFT

"a shark is swimming in the ocean, racking focus"

McSc

"a shark is swimming in the ocean, racking focus"

Baseline

"A panda drinking coffee in a cafe in Paris, animated style"

SFT

"A panda drinking coffee in a cafe in Paris, animated style"

McSc

"A panda drinking coffee in a cafe in Paris, animated style"

Baseline

"A steam train moving on a mountainside"

SFT

"A steam train moving on a mountainside"

McSc

"A steam train moving on a mountainside"

Baseline

"a truck slowing down to stop"

SFT

"a truck slowing down to stop"

McSc

"a truck slowing down to stop"

BibTeX

@article{arxiv_coming_soon
    author    = {Qiushi Yang, Yingjie Chen, Yuan Yao, Yifang Men, Huaizhuo Liu, Miaomiao Cui},
    title     = {McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning},
    journal   = {arXiv},
    year      = {2025},
  }