Benchmarking Single-Factor Physical Video-to-Audio Generation
单因素物理视频到音频生成的基准测试
Tingle Li, Siddharth Gururani, Kevin J. Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu
AI总结 提出FlatSounds基准,通过控制反事实对和单视频模式测试评估视频到音频模型的物理推理能力,发现模型依赖文本描述而非视觉流,且物理准确性与时序对齐存在权衡。
详情
- Comments
- CVPR 2026
生成式视频到音频(V2A)模型能产生高度逼真的音轨,但尚不清楚它们是否捕捉了底层物理过程。现有评估强调感知真实性,忽视了在受控干预下的物理正确性。本文中,我们引入FlatSounds,一个通过以下方式审计V2A模型物理推理的基准:1)改变单个物理因素的受控反事实对,以及2)探测内部一致性和方向趋势的单视频模式测试。这些设置测试生成的音频是否正确反映特定的物理属性和时序。我们对最先进模型的评估揭示了一致的权衡:模型更依赖文本描述而非视觉流来推断物理和语义。描述通常提高物理和语义准确性,但矛盾地降低了时序对齐。我们的结果强调了需要超越音频质量,直接从像素学习物理过程。最后,我们发现我们的基于物理的指标与我们自己数据上的人类偏好测试强相关。项目网页:https://research.nvidia.com/labs/cosmos-lab/flatsounds/
Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/