2606.00640 2026-06-02 cs.CV

An Attribute-Based Measure of Video Complexity

基于属性的视频复杂度度量

Aditya Sarkar, Yi Li, Zihao Wang, Jiacheng Cheng, Sai Vidyaranya Nuthalapati, Aashu Singh, Shlok Kumar Mishra, David Jacobs, Nuno Vasconcelos

发表机构 * UMIACS-University of Maryland College Park（马里兰大学College Park分校UMIACS）； University of California San Diego（加州大学圣地亚哥分校）； Yale University（耶鲁大学）； Meta AI

AI总结提出VideoABC框架，通过属性空间量化估计视频-问题对在视频大语言模型上的失败概率，实现非参数复杂度度量。

详情

AI中文摘要

提出了一种新的框架，用于估计视频-问题对给视频大语言模型带来的复杂度，即基于属性的视频复杂度（VideoABC）。视频复杂度定义为视频大语言模型在给定视频-问题对上的失败概率。VideoABC是一种非参数复杂度度量，使用参考视频数据集和预定义的视频属性词汇表（这些属性对复杂度有信息量，例如场景复杂度或与问题相关的视频事件速度）。在训练阶段，参考视频被投影到这些属性空间中，然后进行量化。计算每个量化单元的期望ABC。给定一个新视频及其在属性空间中的投影，通过关联量化单元的期望ABC来估计复杂度。为了能够使用小规模参考视频数据集，结合了两种量化器：k-means量化器（能对参考数据集分布内的样本进行准确复杂度估计）和通用格点量化器（保证对分布外样本的泛化）。受心理物理学研究中目标-干扰物操纵的启发，提出了一种合成视频生成程序，用于在训练期间填充格点量化器的单元，从而计算其期望ABC。实验结果表明，即使使用非常低维的属性表示，VideoABC也有效，其性能大大优于“视频大语言模型作为评判者”等方法，且复杂度更低。最后，VideoABC分数在定义良好的属性方面的可解释性，揭示了基准测试的属性组成如何影响其复杂度。

英文摘要

A new framework for the estimation of the complexity posed by video-question pairs to video-LLMs, Video Attribute-Based Complexity (VideoABC), is proposed. Video complexity is defined as the probability of failure of a video-LLM for a given video-question pair. VideoABC is a non-parametric complexity measure, using a reference video dataset and a pre-defined vocabulary of video attributes informative of complexity, \eg the scene complexity or the speed of the video event informative of the question. In a training phase, reference videos are projected into the space of these attributes, which is then quantized. The expected ABC of each quantization cell is then computed. Given a new video and its projection into the attribute space, complexity is estimated by the expected ABC of the associated quantization cell. To enable the use of VideoABC with small reference video datasets, two quantizers are combined: a k-means quantizer that enables accurate complexity estimates for samples in the distribution of the reference dataset and a universal lattice quantizer that guarantees generalization to out-of-distribution samples. A synthetic video generation procedure, inspired by target-distractor manipulations of psychophysics studies, is proposed to populate the cells of the lattice quantizer during training, enabling the computation of their expected ABCs. Experimental results show that VideoABCis effective even with very low-dimensional attribute representations, substantially outperforming approaches like `video-LLM as judge' with much less complexity. Finally, the explainable nature of the VideoABC score, in terms of well-defined attributes, is shown to provide insights on how the attribute composition of benchmarks affects their complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.00637 2026-06-02 cs.RO

Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion

全局-局部注意力分解用于人形感知运动中的地形编码

Shengcheng Fu, Yang Zhang, Zhanxiang Cao, Liyun Yan, Yizhi Chen, Yunpeng Yin, Yue Gao

发表机构 * Tongji University（同济大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Jiao Tong University（上海交通大学）； Humanoid Robot (Shanghai) Co., Ltd.（人形机器人（上海）有限公司）

AI总结提出全局-局部注意力分解（GLAD）方法，通过粗到细编码器分离全局地形感知和局部立足点选择，实现人形机器人在稀疏立足点和受限环境中的鲁棒运动。

详情

AI中文摘要

尽管强化学习显著推进了人形运动，感知策略在稀疏立足点地形和受限环境中仍然存在困难。在这些场景中成功需要广泛的地形感知和精确的立足点选择，而传统编码器常常纠缠这两种感知角色。为了解决这一挑战，我们提出了用于人形运动地形编码的全局-局部注意力分解（GLAD）。通过基于机器人中心高程图的粗到细编码器实现，GLAD明确分离了这些目标：全局注意力分支利用注意力池化总结周围地形上下文，而状态条件局部注意力分支稀疏化并编码精确的立足点相关几何。这种显式注意力分解防止了细粒度空间线索的稀释，同时减少了训练开销。实验表明，GLAD能够在具有挑战性的间隙、踏脚石和楼梯上实现可靠运动。此外，学习到的策略表现出涌现的地形响应行为，在简单速度指令下自主跟随狭窄路径并避开障碍物，无需显式导航规划器。在搭载机载LiDAR的Unitree G1人形机器人上的实际部署中，所提方法在多种稀疏立足点和障碍物丰富领域实现了鲁棒的零样本仿真到现实迁移。

英文摘要

Although reinforcement learning has significantly advanced humanoid locomotion, perceptive policies still struggle on sparse-foothold terrain and constrained environments. Success in these scenarios requires both broad terrain awareness and precise foothold selection, two perceptual roles that conventional encoders often entangle. To address this challenge, we propose Global-Local Attention Decomposition (GLAD) for terrain encoding in humanoid locomotion. Realized by a coarse-to-fine encoder over a robot-centric elevation map, GLAD explicitly separates these objectives: a global attention branch utilizes attention pooling to summarize the surrounding terrain context, while a state-conditioned local attention branch sparsifies and encodes precise foothold-relevant geometry. This explicit attention decomposition prevents the dilution of fine-grained spatial cues while reducing training overhead. Experiments demonstrate that GLAD enables reliable locomotion over challenging gaps, stepping stones, and stairs. Furthermore, the learned policy exhibits emergent terrain-responsive behaviors, autonomously following narrow paths and avoiding obstacles under simple velocity commands without explicit navigation planners. In real-world deployment on a Unitree G1 humanoid robot using onboard LiDAR, the proposed method achieves robust zero-shot sim-to-real transfer across diverse sparse-foothold and obstacle-rich domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00635 2026-06-02 cs.LG

How Neural Losses Shape VAE Latents

神经损失如何塑造VAE潜在变量

Giorgio Strano, Luca Cerovaz, Michele Mancusi, Tommaso Mencattini, Emanuele Rodolà

发表机构 * Sapienza University of Rome（罗马大学萨皮恩扎分校）； Paradigma, Inc.（Paradigma公司）； Moises Systems, Inc.（Moises系统公司）； EPFL（苏黎世联邦理工学院）

AI总结本文研究感知损失和对抗损失等神经重建损失如何改变VAE的率失真问题，证明其减少潜在表示信息量并改变潜在空间几何结构，使表示更各向同性且不确定性分布更均匀。

详情

AI中文摘要

现代VAE很少使用标准$β$-VAE目标隐含的点态似然进行训练。在实践中，尽管缺乏对如何改变模型潜在动态的理解，点态重建常与感知损失和对抗损失结合。我们表明，重建损失的选择重塑了率失真问题本身，改变了潜在表示的信息内容和几何结构，这些变化可能仅从重建中无法察觉。首先，我们证明并实证验证，用神经项（如感知和对抗目标）增强点态重建会减少存储在潜在表示中的信息量。其次，我们展示神经重建损失系统地改变了潜在空间的几何结构：它们使表示更各向同性，并更均匀地将不确定性分布在潜在维度上，产生不同的后验方差分布。这些发现强调了率失真权衡并非理解VAE行为的全面视角，我们提出一种更机械的方法来研究失真度量的选择如何重塑优化问题。

英文摘要

Modern VAEs are rarely trained with the pointwise likelihood implied by the standard $β$-VAE objective. In practice, pointwise reconstruction is often combined with perceptual and adversarial losses, despite a lack of understanding of how this changes the latent dynamics of the model. We show that the choice of reconstruction loss reshapes the rate-distortion problem itself, altering both the information content and the geometry of the learned latent space in ways that may be invisible from reconstructions alone. First, we prove and verify empirically that augmenting pointwise reconstruction with neural terms, such as perceptual and adversarial objectives, reduces the amount of information stored in the latent representations. Second, we show that neural reconstruction losses systematically change the geometry of the latent space: they make representations more isotropic and distribute uncertainty more evenly across latent dimensions, producing different posterior variance profiles. These findings highlight how the rate-distortion tradeoff is not a comprehensive lens to understand the behavior of VAEs, and we propose a more mechanistic approach to investigate how the choice of a distortion metric reshapes the optimization problem.

URL PDF HTML ☆

赞 0 踩 0

2606.00634 2026-06-02 cs.CL cs.LG

French parsing enhanced with a word clustering method based on a syntactic lexicon

基于句法词典的词聚类方法增强的法语解析

Anthony Sigogne, Matthieu Constant, Eric Laporte

发表机构 * Université Paris-Est（巴黎-est大学）； LIGM（语言与信息学实验室）

AI总结本文通过将法语句法词典（Lexicon-Grammar）的数据整合到概率解析器中，并应用聚类方法于法语树库的动词，提高了基于概率上下文无关文法的解析性能。

Journal ref Second Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), 2011, Dublin, Ireland, pp.22-27

2606.00630 2026-06-02 cs.CV stat.ML

带有层归一化的循环Transformer可证明地学习幂方法

Lyumin Wu, Chenyang Zhang, Yuan Cao

发表机构 * School of Computing & Data Science, The University of Hong Kong（计算与数据科学学院，香港大学）

AI总结本文通过主成分预测任务，证明带有层归一化的循环线性Transformer在梯度下降训练下会收敛到实现幂方法的解，揭示了层归一化带来的算法隐式偏差。

Comments 70 pages, 8 figures

详情

AI中文摘要

Transformer在广泛的应用中取得了显著成功，越来越多的研究表明其部分优势来自于学习和执行算法程序的能力。然而，我们对Transformer如何学习此类算法的理解仍然有限，尤其是在存在层归一化（LN）的情况下。在这项工作中，我们研究主成分预测作为理解带有LN的Transformer训练动态的具体测试平台。我们证明，通过梯度下降训练的带有LN的循环线性Transformer收敛到实现幂方法的解，其中每个自注意力层执行一次幂迭代。值得注意的是，模型仅针对主成分预测进行训练，而非明确监督其实现幂方法。因此，我们的发现揭示了带有LN的循环Transformer的“算法隐式偏差”：主成分预测原则上可以通过多种机制实现，但梯度下降选择了实现幂方法的一种。我们进一步提供了带有和不带有LN的Transformer之间的具体比较：即使有幂迭代的逐层指导，没有LN的Transformer也无法精确学习幂方法，而带有LN的对应Transformer可以，导致主成分预测中可证明的性能差距。据我们所知，我们的结果首次对带有LN的循环和单层Transformer的训练动态进行了理论分析，并阐明了LN在Transformer模型中的作用。

英文摘要

Transformers have achieved remarkable success across a wide range of applications, and a growing body of work suggests that part of their strength comes from their ability to learn and execute algorithmic procedures. However, our understanding of how transformers learn such algorithms remains limited, especially in the presence of layer normalization (LN). In this work, we study principal component prediction as a concrete testbed for understanding the training dynamics of transformers with LN. We prove that a looped linear transformer with LN, trained by gradient descent, converges to a solution that implements the power method, with each self-attention layer performing one power iteration. Notably, the model is trained only for principal component prediction, rather than being explicitly supervised to implement the power method. Our finding thus reveals an "algorithmic implicit bias" of looped transformers with LN: principal-component prediction can in principle be achieved by many mechanisms, yet gradient descent selects one that realizes the power method. We further provide a concrete comparison between transformers with and without LN: even with layerwise guidance from power iterations, a transformer without LN cannot exactly learn the power method, whereas the corresponding transformer with LN can, leading to a provable performance gap in principal component prediction. Our results provide, to our knowledge, the first theoretical analysis of the training dynamics of looped and single-layer transformers with LN, and shed light on the role of LN in transformer models.

URL PDF HTML ☆

赞 0 踩 0

2606.00602 2026-06-02 cs.CV

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

ASAP: 基于解剖感知语义自适应预训练的医学体素表示学习

Rongsheng Wang, Fenghe Tang, Zihang Jiang, Yingtai Li, Xu Zhang, Haoran Lai, Wenxin Ma, Wei Wei, Zhiyang He, Xiaodong Tao, Rui Yan, Qingsong Yao, Shaohua Kevin Zhou

发表机构 * School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China（生物医学工程学院，生命科学与医学系，中国科学技术大学）； Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE) Lab, YRD-RIGHT, USTC Suzhou Institute for Advanced Research（医学影像、机器人、分析计算与学习（MIRACLE）实验室，YRD-RIGHT，中国科学技术大学苏州研究院）； Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology（江苏省多模态数字孪生技术重点实验室）； Biomedical Basic Research Center (BBRC) of Jiangsu Province（江苏省生物医学基础研究中心）； Department of Radiology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC（放射科，中国科学技术大学第一附属医院，生命科学与医学系，中国科学技术大学）； Anhui IFLYTEK CO., Ltd（安徽科大讯飞股份有限公司）； School of Medicine, Stanford University（医学院，斯坦福大学）； State Key Laboratory of Precision and Intelligent Chemistry, Hefei, Anhui, China（安徽省精密与智能化学重点实验室，合肥，安徽，中国）

AI总结提出ASAP框架，通过解剖感知知识注入、语义自适应对齐与融合，从胸部CT扫描和放射学报告中学习可迁移且可解释的体素表示，在15个数据集和22个下游任务上取得最先进性能。

Comments MICCAI2025 extention

详情

AI中文摘要

从医学体素扫描中学习可迁移和可解释的表示仍然具有挑战性，因为存在复杂的解剖结构和放射学报告提供的弱、异质监督。在本文中，我们提出了解剖感知语义自适应预训练（ASAP），一个用于从大规模胸部CT扫描及其对应放射学报告中进行细粒度医学体素表示学习的原理性视觉-语言预训练框架。ASAP集成了三个关键组件：（1）解剖感知知识注入模块，通过现成的分割工具融入器官级结构先验，以促进解剖上一致的表示；（2）语义自适应选择性对齐机制，动态地将句子级别的发现与局部体素区域关联；（3）语义自适应融合模块，在双模态掩码建模范式下，实现解剖信息视觉特征与基于文本线索之间的有效交互。除了方法论贡献外，我们还为胸部CT上的医学体素视觉-语言预训练建立了一个全面的基准，涵盖15个数据集和22个下游任务，包括异常分类、分割、疾病预后预测、报告生成、词汇分类、跨模态检索和视觉问答。该基准提供了标准化的评估协议，以系统评估在不同临床设置和数据制度下的表示质量。大量实验表明，ASAP在跨任务和数据集上一致地实现了最先进的性能，在有限监督和分布偏移下尤其显著，验证了其在学习可迁移和临床有意义的体素表示方面的有效性。

英文摘要

Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.

URL PDF HTML ☆

赞 0 踩 0

2606.00596 2026-06-02 cs.CL

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

面向负责任且基于认识论的多语言大语言模型在计算社会科学与人文学科中的应用

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结本文重新概念化多语言推理大语言模型为解释学工具，提出一个基于理论框架来评估其在社会科学与人文学科研究中的文化对齐、跨语言稳定性和推理忠实性。

Journal ref Proceedings of LLMs4SSH Workshop at LREC 2026, Palma de Mallorca, Spain, 2026

详情

AI中文摘要

大语言模型在多语言能力和推理能力方面迅速发展，使其能够整合到社会科学与人文学科研究流程中。然而，现有的评估范式仍然基于任务型NLP基准，未能解决解释有效性、文化情境性和认识论中介问题。本文重新概念化多语言推理大语言模型为解释学工具，这些工具在不同语言和文化背景下积极构建意义生产。借鉴解释学、技术哲学、科学技术研究、多语言NLP研究和计算社会科学方法论，我们为评估社会科学与人文学科研究中的多语言推理开发了一个理论基础的框架。我们阐述了一个严格的实验协议，包含文化对齐、跨语言稳定性和推理忠实性的可操作化指标，以及针对解释性研究任务定制的透明度要求。我们通过一个涉及多语言政治话语分析的具体应用场景来说明该框架。本文为将多语言推理大语言模型负责任地整合到计算社会科学基础设施中提供了概念和方法论基础。

英文摘要

Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in task-based NLP benchmarks and fail to address interpretive validity, cultural situatedness, and epistemic mediation. This paper reconceptualizes multilingual reasoning LLMs as hermeneutic instruments that actively structure meaning production across linguistic and cultural contexts. Drawing on hermeneutics, philosophy of technology, science and technology studies, multilingual NLP research, and computational social science methodology, we develop a theoretically grounded framework for evaluating multilingual reasoning in Social Sciences and Humanities (SSH) research. We articulate a rigorous experimental protocol with operationalized metrics for cultural alignment, cross-lingual stability, and reasoning faithfulness, along with transparency requirements tailored to interpretive research tasks. We illustrate the framework through a concrete application scenario involving multilingual political discourse analysis. The paper contributes a conceptual and methodological foundation for responsible integration of multilingual reasoning LLMs into computational social science infrastructures.

URL PDF HTML ☆

赞 0 踩 0

2606.00593 2026-06-02 cs.CL cs.AI

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER: 面向多答案问答的逐步同伴优势与多样性感知探索奖励

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu

发表机构 * State Key Lab of CAD&CG, Zhejiang University（浙江大学CAD与CG国家重点实验室）； School of Software and Microelectronics, Peking University（北京大学软件与微电子学院）； School of Software Technology, Zhejiang University（浙江大学软件技术学院）

AI总结提出SPADER强化学习框架，通过逐步同伴优势（SPA）机制和多样性感知探索奖励，解决多答案问答中长程工具使用的细粒度信用分配与持续探索问题，实验表明在多个数据集上提升了召回率和F1分数。

详情

AI中文摘要

大型语言模型越来越多地被部署为工具增强型智能体，以获取参数知识之外的信息。虽然最近的工作改进了长程工具使用推理，但大多数方法专注于具有单一正确答案的任务。相比之下，许多现实世界中的查询需要发现一组全面的有效答案，这种设置被称为多答案问答。这种设置带来了两个挑战：长搜索轨迹上的细粒度信用分配，以及超越简单高频实体的持续探索的奖励对齐。我们提出了SPADER，一个用于多答案问答中长程工具使用的强化学习框架。SPADER包括逐步同伴优势（SPA），一种无评论家的逐步信用分配机制，它通过决策步骤对齐并行轨迹，并根据同伴回报估计优势。它还包括一个多样性感知探索奖励，通过加权稀有发现和降低冗余发现的权重来促进长尾实体发现。在QAMPARI、Mintaka、WebQSP和QUEST上的实验表明，SPADER通常比基于提示的智能体、结果监督的强化学习方法和最近的逐步监督方法提高了召回率和整体F1分数。我们的代码和模型权重可在https://github.com/KhanCold/spader获取。

英文摘要

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

URL PDF HTML ☆

赞 0 踩 0

2606.00592 2026-06-02 cs.CV

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

通过PRISM：原则感知、可解释和多尺度的视觉设计评估

Mona Gandhi, KJ Joseph, Srinivasan Parthasarathy, Sayan Nag

发表机构 * Ohio State University（俄亥俄州立大学）； Adobe Research（Adobe研究院）

AI总结提出PRISM基准和一种多尺度评估框架，通过原则扰动和分层分析实现可解释的设计质量评估。

Journal ref CVPR 2026 Findings

详情

AI中文摘要

有效的视觉传达源于多个设计原则的和谐，如可读性、对比度、对齐、重叠和连贯性，这些原则共同支配着传达者的清晰度和意图。虽然人类设计师会整体性地考虑这些原则，但机器智能体通常将它们压缩成一个单一的启发式分数，提供有限的可解释性和诊断精度。为了解决这一差距，我们引入了PRISM（原则感知、可解释和结构引导的设计修改），这是一个基准，它沿着可测量的设计原则系统地扰动Crello数据集中的专业布局。该基准包含10万个扰动训练样本和1万个扰动验证设计，每个样本隔离特定的原则违规，以进行关于设计质量的多模态推理的受控分析。我们表明，像Qwen-2.5-VL和GPT-4o-mini这样的模型对有针对性的原则退化在很大程度上不敏感，而GPT-4o表现出全局意识但缺乏细粒度的解耦。基于这些见解，我们提出了一个多尺度评估框架，该框架集成了用于定量评估的轻量级评分器、用于局部反馈的指令调优视觉语言模型以及用于全局推理的基于提示的方法。我们的框架提供了设计失败的可解释解释。利用这些局部见解，我们展示了改善布局质量的有针对性的改进。PRISM和我们的框架共同为可解释的、具有设计素养的多模态推理系统奠定了基础。

英文摘要

Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision. To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement. Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Hanyang University（翰阳大学）

AI总结提出VRPO方法，通过强化学习将静态对齐损失替换为生成式表示策略优化目标，动态平衡表示一致性与生成质量，在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情

AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力，但由于生成表示与判别表示之间的弱对齐，训练效率仍然较低。虽然表示对齐框架（如REPA）通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛，但其外部监督的对齐损失是静态的，在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标，无法动态平衡表示一致性和生成质量，导致判别收益有限，且无法以任务自适应方式优化对齐。为了解决这个问题，我们提出了VRPO，一种基于强化学习的优化策略，用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束，而是将表示对齐视为一个奖励引导的过程：模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示，朝向有语义意义的方向，同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中，引入可忽略的计算成本，并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明，我们的VRPO-Alignment显著提高了收敛速度和保真度，在相同计算预算下，与REPA相比，FID提升高达1.8，训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.00582 2026-06-02 cs.AI

LASER: 面向高效低精度视觉-语言模型的损失感知奇异值分解与秩分配

Haiyu Wang, Yutong Wang, Leshu Li, Yihui Ren, Sai Qian Zhang

发表机构 * Tandon School of Engineering, New York University（纽约大学工程学院）； Courant Institute of Mathematical Sciences, New York University（纽约大学数学科学学院）； Brookhaven National Laboratory（布鲁克海文国家实验室）

AI总结提出LASER框架，通过损失感知的奇异值分解和跨层秩分配，结合混合量化方案，实现视觉-语言模型在低精度推理下的高效压缩与加速。

详情

AI中文摘要

视觉-语言模型（VLM）具有强大的多模态推理能力，但其巨大的计算开销和高参数数量使得在资源受限设备上部署面临挑战。低秩分解已成为一种有前景的压缩技术，然而现有方法通常优化局部矩阵重建误差，依赖均匀或启发式的秩分配，并且主要关注注意力投影，而前馈网络尚未得到充分探索。在本文中，我们提出 extit{LASER}（ extbf{L}oss- extbf{A}ware extbf{S}ingular-value d extbf{E}composition and extbf{R}ank allocation），一种面向高效低精度VLM推理的低秩压缩框架。LASER从模型损失的二阶近似推导出曲率加权的SVD目标，并使用Kronecker分解的Fisher信息来引导分解朝向下游性能而非单纯的重建。我们进一步引入基于校准梯度的损失感知跨层秩分配策略，使得跨层的参数预算分配更加有效。最后，我们通过一种结合SVD与量化的混合方案，将低秩压缩扩展到FFN层。评估结果表明，LASER在低精度推理下相比先前工作实现了超过2.3倍的解码加速，同时保持了强大的准确性。

英文摘要

Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has emerged as a promising compression technique, yet existing methods often optimize local matrix reconstruction error, rely on uniform or heuristic rank allocation, and focus mainly on attention projections while leaving feed-forward networks underexplored. In this paper, we propose~\textit{LASER} (\textbf{L}oss-\textbf{A}ware \textbf{S}ingular-value d\textbf{E}composition and \textbf{R}ank allocation), a low-rank compression framework for efficient low-precision VLM inference. LASER derives a curvature-weighted SVD objective from a second-order approximation of the model loss and uses Kronecker-factored Fisher information to guide decomposition toward downstream performance rather than reconstruction alone. We further introduce a loss-aware cross-layer rank allocation strategy based on calibration gradients, enabling more effective parameter budgeting across layers. Finally, we extend low-rank compression to FFN layers through a hybrid scheme that combines SVD with quantization. The evaluation results show that LASER achieves more than $2.3\times$ decoding speedup over previous work while preserving strong accuracy under low-precision inference.

URL PDF HTML ☆

赞 0 踩 0

2606.00572 2026-06-02 cs.LG

Spatiotemporal Multi-Task Graph Transformer for Trip-Level Transit Prediction

时空多任务图Transformer用于行程级公交预测

Oluwaleke Yusuf, Adil Rasheed, Frank Lindseth

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU)（工程 cybernetics 部，挪威科学与技术大学（NTNU））； Department of Computer Science, Norwegian University of Science and Technology (NTNU)（计算机科学部，挪威科学与技术大学（NTNU））

AI总结提出SMT-GraphFormer，一种将行程级公交预测建模为序列到序列问题的时空多任务图Transformer，通过图嵌入、上下文编码器和多门专家混合模块，在挪威特隆赫姆公交数据上优于停靠级基线方法。

Comments 25 pages, 7 figures, 11 tables, including appendix. Code available at https://github.com/Outsiders17711/SMTGraphFormer

详情

AI中文摘要

来自公共交通系统的乘客计数数据揭示了城市出行模式，对于规划、运营和优化至关重要。然而，站点和线路之间的非线性时空相互依赖性使得建模和预测具有挑战性。现有方法通常依赖于固定的时间、空间或站点级公式，限制了它们捕捉行程内演变和网络上下文的能力。本研究提出了SMT-GraphFormer，一种时空多任务图Transformer，将行程级公交预测构建为序列到序列建模。给定一条线路的站点序列和行程级上下文，模型预测连续的上下车人数，并将延误和停靠时间作为编码器侧的辅助任务。关键组件包括用于多关系站点相似性的图嵌入、用于天气和时间信息的上下文编码器，以及一个多门专家混合模块，该模块为上下车预测生成任务特定的解码器表示。对挪威特隆赫姆的公共公交数据进行评估表明，SMT-GraphFormer优于站点级表格基线，消融研究考察了每个组件的贡献。序列化公式在下车预测上取得了显著提升（R²提高+0.24），并在上车、延误和停靠时间上持续改进，证实了显式行程级序列偏差和目标间依赖性的价值。这些发现展示了基于Transformer的序列建模在捕捉公共交通复杂时空动态方面的潜力，并强调了针对公交数据定制的架构相对于现成表格模型的价值。所提出的框架为数字孪生环境中的场景分析提供了与预测范围无关的基础，支持规划者和公交运营商的知情决策。

英文摘要

Passenger count data from public transit systems reveals urban mobility patterns and is essential for planning, operation, and optimisation. However, non-linear spatiotemporal interdependencies across stops and lines make modelling and prediction challenging. Existing approaches often rely on fixed temporal, spatial, or stop-level formulations, limiting their ability to capture within-trip evolution and network context. This study proposes SMT-GraphFormer, a spatiotemporal multi-task graph transformer that frames trip-level transit prediction as sequence-to-sequence modelling. Given a line's stop sequence and trip-level context, the model predicts successive boarding and alighting counts, with delay and dwell time treated as encoder-side surrogate tasks. Key components include graph embeddings for multi-relational stop similarity, a context encoder for weather and temporal information, and a multi-gate mixture-of-experts module that produces task-specific decoder representations for boarding and alighting predictions. Evaluation on public bus transit data from Trondheim, Norway, shows that SMT-GraphFormer outperforms stop-level tabular benchmarks, with ablation studies examining each component's contribution. The sequential formulation yields substantial gains on alighting prediction ($+$0.24 in $R^2$) and consistent improvements on boarding, delay, and dwell, confirming the value of explicit trip-level sequential bias and inter-target dependencies. These findings demonstrate the potential of transformer-based sequence modelling for capturing complex spatiotemporal dynamics in public transit and underscore the value of architectures tailored to transit data rather than off-the-shelf tabular models. The proposed framework provides a horizon-agnostic basis for scenario analysis in digital twin environments, supporting informed decision-making by planners and transit operators.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Demystifying the Optimal Fair Classifier in Multi-Class Classification

MESA: Improving MoE Safety Alignment via Decentralized Expertise

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

An Attribute-Based Measure of Video Complexity

Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion

How Neural Losses Shape VAE Latents

French parsing enhanced with a word clustering method based on a syntactic lexicon

A Systematic Benchmark of Intraoperative Ultrasound-to-MR Synthesis for Brain Tumour Surgery

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

FlowNar: Scalable Streaming Narration for Long-Form Videos

MemPro: Agentic Memory Systems as Evolvable Programs

Efficient Test-time Inference for Generative Planning Models

Linguistics-Aware Non-Distortionary LLM Watermarking

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

FiSeR: Fine-Grained Source Representations for Cross-Domain AI Image Detection

Looped Transformers with Layer Normalization Provably Learn the Power Method

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

Improving Visual Representation Alignment Generation with GRPO

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

Dynamic Resilient Spatio-Semantic Memory with Hybrid Localization for Mobile Manipulation

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

Spatiotemporal Multi-Task Graph Transformer for Trip-Level Transit Prediction