arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28269 2026-05-28 cs.LG stat.ME

Dynamic Topic Modeling with a Higher-Order Hypergraphical Representation

基于高阶超图表示的动态主题建模

Hanjia Gao, Hanwen Ye, Qing Nie, Annie Qu

AI总结 针对传统主题模型忽略词间高阶交互和动态语料中语义重叠的问题,提出超图表示文本并构建动态主题建模框架,通过结构化低秩分解和时间正则化实现,理论保证收敛性和误差界,实验优于现有模型。

详情
Comments
34 pages, 4 figures
AI中文摘要

动态主题建模被广泛用于分析科学文献、医疗记录和社交媒体中的演变趋势。传统主题模型通过多项单纯形上的单个概率向量表示每个主题,并将词的出现和重复隐式耦合在一个概率机制中。然而,这种表述限制了词之间的依赖结构,并忽略了信息丰富的高阶交互,特别是在具有重叠语义的动态语料中。为了解决这些局限性,我们引入文本的超图表示,其中每个文档被建模为一个连接所有共现词的超边,重复强度编码为节点权重。这种表示自然地将词的出现与重复分开,并引入了一种新颖的基于超图的多项分布,其非线性归一化取决于每个文档的观测词集。基于此似然,我们通过结构化低秩分解和主题-词轮廓上的显式时间正则化,开发了一个动态主题建模框架。此外,尽管双线性分解和文档特定的非线性归一化导致了内在的非凸性,我们仍建立了局部收敛保证并推导了非渐近误差界。在合成数据上的数值实验以及在国际学习表征会议(ICLR)语料库上的应用表明,该方法比现有的基于多项式的主题模型具有一致的改进。

英文摘要

Dynamic topic modeling is widely used to analyze evolving trends in scientific literature, medical records, and social media. Traditional topic models represent each topic through a single probability vector on the multinomial simplex and implicitly couple word occurrence and repetition within one probabilistic mechanism. However, this formulation restricts the dependence structure among words and overlooks informative higher-order interactions, particularly in dynamic corpora with overlapping semantics. To address these limitations, we introduce a hypergraph representation of text where each document is modeled as a hyperedge connecting all co-occurring words, with repetition intensities encoded as node weights. This representation naturally separates word occurrence from repetition and induces a novel hypergraph-based multinomial distribution with a nonlinear normalization depending on the observed word set of each document. Building on this likelihood, we develop a dynamic topic modeling framework via structured low-rank factorizations with explicit temporal regularization on topic-word profiles. Moreover, we establish local convergence guarantees and derive non-asymptotic error bounds despite the intrinsic nonconvexity induced by bilinear factorization and document-specific nonlinear normalization. Numerical experiments on synthetic data and an application to the International Conference on Learning Representations (ICLR) corpus demonstrate consistent improvements over existing multinomial-based topic models.

2605.28267 2026-05-28 cs.LG stat.ML

Parameter-Efficient Generative Modeling with Controlled Vector Fields

基于受控向量场的参数高效生成建模

Peyman Morteza

AI总结 提出一种基于Chow-Rashevskii定理的连续时间生成建模框架,通过少量固定向量场和学习的标量控制构建表达流,实现参数高效的分布变换。

详情
AI中文摘要

受Chow-Rashevskii定理启发,我们引入了一个连续时间生成建模框架,该框架从一小组固定向量场和学习的标量控制中构建表达流。我们的框架不是学习无约束的高维向量场,而是通过学习标量控制函数来调制固定向量场,从而构造速度。当固定场是括号生成时,它们的李代数张成整个空间,提供了一种仅用少量学习控制通道即可实现表达性传输的机制,并为标准向量场参数化提供了一种参数高效的几何替代方案。这种解耦公式产生了一个结构化和可解释的生成模型,其中学习的标量输出通道的数量可以独立于环境维度选择。我们制定了一个表达性原则,表明在适当的可控性和适定性假设下,这种受控流可以将源分布传输到目标分布。我们使用连续归一化流似然目标训练所得模型,并在合成分布上进行了概念验证实验。

英文摘要

We introduce a continuous-time generative modeling framework, motivated by the Chow-Rashevskii theorem, that builds expressive flows from a small set of fixed vector fields and learned scalar controls. Instead of learning an unconstrained high-dimensional vector field, our framework constructs the velocity by modulating fixed vector fields with learned scalar control functions. When the fixed fields are bracket-generating, their Lie algebra spans the ambient space, providing a mechanism for expressive transport with only a small number of learned control channels and offering a parameter-efficient geometric alternative to standard vector-field parameterizations. This decoupled formulation yields a structured and interpretable generative model in which the number of learned scalar output channels can be chosen independently of the ambient dimension. We formulate an expressivity principle showing that, under suitable controllability and well-posedness assumptions, such controlled flows can transport a source distribution to a target distribution. We train the resulting model using a continuous-normalizing-flow likelihood objective and present proof-of-concept experiments on synthetic distributions.

2605.28264 2026-05-28 cs.AI

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

熵分布作为生成模型中幻觉的指纹

Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, Niraj Kumar

AI总结 本文提出基于token级熵分布(而非仅均值)的校准熵分数(CES),通过单次前向传递和黑盒logits访问实现幻觉检测,并提供理论保证和实证验证。

详情
AI中文摘要

大型语言模型(LLMs)经常生成事实上不正确的输出,通常称为幻觉,这削弱了信任并限制了在高风险环境中的部署。现有的幻觉检测方法通常需要多次前向传递或访问模型内部。在这项工作中,我们提供了理论背景和实证证据,表明token级熵的分布(超越困惑度或长度归一化熵所捕获的均值)作为幻觉的指纹,其分布形状和尾部行为携带独立信号。我们将幻觉检测形式化为统计假设检验,并提出校准熵分数(CES),一种轻量级算法,仅需单次前向传递和黑盒访问token logits。CES通过校准的参考CDF将均值信号与生成熵的最大信号相结合,产生可直接跨模型和任务比较的分数。我们通过新颖的随机长度Dvoretzky-Kiefer-Wolfowitz不等式建立了有限样本校准保证,并证明了CES检测幻觉的概率随生成长度指数级收敛到1。在八个QA基准和十个生成模型(涵盖开源和API访问模型)上,CES在所有单次黑盒方法中实现了最高的检测性能,同时提供了现有启发式方法所缺乏的正式误差保证。值得注意的是,CES在统计上与需要更高计算成本的多样本方法无法区分,缩小了轻量级与昂贵检测之间的差距,使其适用于实时、大规模部署。

英文摘要

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

2605.28261 2026-05-28 cs.CV

MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations

MORI-Seg: 无需实例标注的形态学几何学习用于实例分割

Leiyue Zhao, Tianyu Shi, Daniel Reisenbuchler, Xinzi He, Junchao Zhu, Tianyuan Yao, Yuechen Yang, Yanfan Zhu, Junlin Guo, Gelei Xu, Haichun Yang, Yuankai Huo, Mert R. Sabuncu, Yihe Yang, Ruining Deng

AI总结 提出MORI-Seg框架,通过从语义掩码学习形态感知几何表示(对象中心距离场和边界带表示)以及类条件特征解耦模块,在仅语义监督下实现端到端的实例分割,提升拥挤粘连区域的实例分离精度。

详情
AI中文摘要

肾脏功能单元的实例级量化对于形态测量分析至关重要,然而大多数公开可用的病理数据集仅提供语义分割标注,其中同一类别的相邻结构被合并为单个区域。这阻碍了可靠的实例级分析,并限制了后续的定量研究。现有的启发式后处理方法在拥挤和粘连区域往往产生次优的实例分离,而基于深度学习的实例分割方法通常需要密集的实例级标注,这些标注成本高昂且劳动密集。我们提出MORI-Seg,一个无需实例级标注即可实现实例分割的深度学习框架。MORI-Seg不依赖启发式分割或实例监督,而是通过联合建模对象中心距离场和边界带表示,直接从语义掩码学习形态感知的几何表示,以编码内部结构和接触界面。类条件特征解耦模块进一步促进实例内一致性和实例间分离。在仅语义监督下,MORI-Seg以端到端的方式将连接的语义区域分解为不同的实例掩码。实验表明,与经典的后处理流程和代表性的语义到实例学习方法相比,MORI-Seg在实例分离准确性和更可靠的形态测量量化方面表现更优。官方实现已在 https://github.com/ddrrnn123/MORI-Seg 公开。

英文摘要

Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at https://github.com/ddrrnn123/MORI-Seg.

2605.28258 2026-05-28 cs.SE cs.AI cs.CV cs.HC

GUI Agents for Continual Game Generation

面向持续游戏生成的GUI智能体

Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo

AI总结 提出利用GUI智能体作为客观评估者和主观测试者,通过PlaytestArena和Play2Code框架实现持续游戏生成,显著提升可玩性。

详情
AI中文摘要

生成一个游戏与制作一个可玩的游戏不同。尽管代码生成取得了进展,现有方法将游戏生成视为从提示到产物的单次翻译,导致交互层面的失败未被检测。我们认为评估和改进游戏生成需要一个玩家,并研究了图形用户界面(GUI)智能体在此过程中的两个角色:(1)作为客观评估者,为此我们引入了PlaytestArena,这是一个新的评估环境,将8个游戏类型的200个基于浏览器的游戏生成任务与预期的游戏行为准则配对,由GUI智能体在浏览器中加载每个构建并玩它来裁决;(2)作为主观测试者,为此我们提出了Play2Code,其中游戏智能体和GUI智能体在共享内存的持续循环中运行,将游戏生成转化为编码和游戏之间的对话。我们的实验表明,即使是前沿模型也难以直接生成可玩的游戏,而Play2Code达到了66.8%的准则通过率,分别比单次传递和智能体编码基线提高了37.1和14.6个百分点。进一步分析表明,GUI测试者的反馈比人类报告更可追溯,但在某些方面具有类似人类测试者的特质,将游戏测试确立为交互式代码生成的关键测试平台。我们的项目网站位于https://continual-game-generation.vercel.app/。

英文摘要

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.

2605.28257 2026-05-28 cs.CV

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

基于可变形对象先验的相机空间类别级3D对应

Leonhard Sommer, Artur Jesslen, Basavaraj Sunagad, Adam Kortylewski

AI总结 通过学习共享可变形对象先验,从单张图像预测类别内实例间一致的3D位置,无需显式对应监督,并在新基准HouseCorr3D上达到最优。

详情
Comments
14 pages, 4 figures. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D
AI中文摘要

从图像理解3D对象是机器人和AR/VR应用的基础。尽管近期工作在类别级姿态估计方面取得了进展,但当前的表示未能捕捉到推理对象部件、功能和交互所需的细粒度语义。在这项工作中,我们研究了相机空间中的类别级3D对应——从单张图像预测在类别内实例间保持一致的3D位置——并展示了通过学习共享的可变形对象先验,无需显式对应监督即可涌现出这种对应。为了推动这一方向的研究,我们引入了HouseCorr3D,这是首个大规模的单目类别级3D对应基准,包含50个家庭对象类别的178k张图像、280个独特实例以及直接标注在CAD模型上的3D关键点。关键的是,HouseCorr3D提供了遮挡区域的模态补全对应标签和显式对称标注,解决了现有数据集的主要局限性。我们进一步提出了Morpheus,一种通过学习解耦规范形状、形变和对象姿态来学习可变形类别级形状先验的方法。通过这种共享的规范基础,相机空间中有语义意义的3D对应隐式地涌现出来。这些涌现的3D对应在HouseCorr3D上达到了新的最优水平,证明了无需直接对应监督即可实现语义3D对象理解。数据和代码公开于https://github.com/GenIntel/HouseCorr3D。

英文摘要

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.

2605.28255 2026-05-28 cs.AI cs.CL cs.HC

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

AI,掌舵吧:是什么驱动人机协作问答中的委托与信任?

Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber

AI总结 通过问答游戏实验,研究人类在何时以及为何选择委托AI或采纳其建议,发现人类存在对AI正确建议的低依赖(3.9%)和错误建议的过度依赖(1.7%),并受确认偏见影响,建议通过校准置信度、基于证据的解释和信任细化机制来改进人机协作。

详情
Comments
Findings of the Association for Computational Linguistics, 2026
AI中文摘要

AI系统并非完美无缺,人类在决定是否信任AI而非自身判断时也可能犯错。因此,改善人机协作需要理解人类何时、为何以及如何决定依赖AI。我们研究了两种不同的依赖决策:委托选择——在不知道AI输出结果的情况下决定何时让AI自主行动,以及采纳选择——评估AI建议并决定如何使用它们。这两种解耦的依赖模式塑造了协作,但先前的工作很少在现实环境中对同一用户同时研究它们。我们通过研究在问答游戏中竞争的人机协作团队来填补这一空白,游戏中人类可以选择何时以及如何与AI代理合作以获胜。我们的24场比赛匹配了23位专家人类和16个AI代理,捕获了387次委托决策和1440次采纳决策。虽然人机协作表现优于单独的AI或人类,但人类做出了次优的协作决策,既对正确的AI建议低依赖(错失3.9%的机会),又在AI误导时过度依赖(1.7%)。双方都贡献了错误答案:当人类和AI意见不一致时,报告的模型置信度接近随机水平,而确认偏见导致当AI建议与人类初始错误答案一致时,低依赖率更高(64.5%)。为缩小这一差距,我们建议采用校准的置信度、基于证据的解释以及帮助用户细化信任的机制。

英文摘要

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

2605.28254 2026-05-28 cs.RO cs.SY eess.SY math.DS

Natural Locomotion: Principle and Method

自然运动:原理与方法

Mirado Mortel, Luc Jaulin, Lionel Lapierre, Simon Rohou

AI总结 本文提出自然运动作为系统与环境约束或相互作用介导的运动交换原理,通过构建自然运动流形(NLM)并采用闭/开构造方法,在理想非完整无滑移系统上验证了该原理。

详情
Comments
Preprint. 20 pages, 7 figures
AI中文摘要

当机构利用被动动力学、柔顺性和共振而非跟踪预定轨迹时,机器人运动可以变得高效。本文将自然运动表述为一种交换原理,适用于运动由环境约束或相互作用介导的系统。当内部振荡器周期性返回、身体姿态漂移且平均推进-振荡器交换功率(POE功率)在一个周期内为零时,运动是自然的。所选族是自然运动流形(NLM)。我们针对连续理想环境约束发展了该原理的保守实现:约束不做外部功,总机械能守恒,零平均POE功率是与环境介导的推进通道的内部交换,而非外部能量输入。该方法是一种闭/开构造。首先关闭推进通道以揭示有效的内部振荡器,该振荡器由一个有效自由度中的标量作用-角结构或多个自由度中的非线性模态扇区组织。然后重新打开通道,重建姿态,接受的周期必须保持内部递归和零平均POE功率。我们在两个理想非完整无滑移系统上演示了该原理:一个Chaplygin雪橇/摆驱动小车和一个三体扩展。在标量情况下,POE闭合等价于缺失的内部返回条件,从而给出NLM族的定理支持计算。在多自由度情况下,POE闭合仍然是必要的,但必须由模态恒等性、内部返回、动力学一致性、相同的固定被动架构和非零位移来补充。自然运动成为一个设计问题:哪些被动架构支持零个、一个或多个经过认证的NLM族?

英文摘要

Robotic locomotion can become efficient when mechanisms exploit passive dynamics, compliance, and resonance rather than track prescribed trajectories. This paper formulates natural locomotion as an exchange principle for systems whose motion is mediated by environmental constraints or interactions. A motion is natural when an internal oscillator returns periodically, the body pose drifts, and the mean Propulsion--Oscillator Exchange power (POE power) vanishes over one cycle. The selected family is a Natural Locomotion Manifold (NLM). We develop the conservative realization of this principle for continuous ideal environmental constraints: the constraints do no external work, total mechanical energy is conserved, and zero mean POE power is an internal exchange with the environment-mediated propulsive channel, not external energy input. The method is a closed/open construction. The propulsive channel is first closed to reveal an effective internal oscillator, organized by scalar action-angle structure in one effective degree of freedom or by nonlinear modal sectors in several degrees of freedom. The channel is then reopened, pose is reconstructed, and accepted cycles must preserve internal recurrence and zero mean POE power. We demonstrate the principle on two ideal nonholonomic no-slip systems: a Chaplygin-sleigh / pendulum-driven car and a three-body extension. In the scalar case, POE closure is equivalent to the missing internal return condition, giving a theorem-backed computation of the NLM family. In the multi-degree case, POE closure remains necessary but must be completed by modal identity, internal return, dynamics consistency, same fixed passive architecture, and nonzero displacement. Natural locomotion becomes a design question: which passive architectures support no, one, or several certified NLM families?

2605.28253 2026-05-28 cs.CL cs.DB cs.HC

Building Community-Centred NLP Resources for Puno Quechua

构建以社区为中心的普诺克丘亚语自然语言处理资源

Elwin Huaman, Adrian Gamarra Lafuente, Johanna Cordova, Anna Korhonen

AI总结 通过参与式设计收集66小时语音数据,微调Whisper-base等模型,首次为普诺克丘亚语建立ASR基准并开源所有资源。

详情
Comments
Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2026), co-located with ACL 2026
AI中文摘要

保护资源不足的语言需要由使用者塑造并为其服务的数字工具和资源。我们首次为普诺克丘亚语(ISO 639-3: qxp)提供了专门的ASR资源:(1)任何单一克丘亚语变体中最大的语音语料库,包括66小时的脚本和自发性语音录音(其中36小时为手动转录和验证数据),通过参与式设计活动收集;(2)首个系统的普诺克丘亚语ASR基准,评估了最先进模型并微调了Whisper-base、wav2vec2-base和XLS-R-300M,包括有无继续预训练(CPT)的情况;(3)所有数据集和微调模型的开源发布。

英文摘要

The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.

2605.28251 2026-05-28 stat.ML cs.CY cs.LG

Counterfactually Fair Regression via Optimal Transport

通过最优传输实现反事实公平回归

M. Generali Lince, S. Gaucher, J-J. Vie, P. Loiseau

AI总结 本文采用因果不确定性视角,通过重采样噪声定义反事实公平性,提出基于最优传输的后处理估计器,并证明其有限样本公平性保证和风险界。

详情
AI中文摘要

我们考虑学习一个反事实公平回归器的问题。我们采用因果不确定性视角,其中反事实公平性通过重采样噪声定义。我们专注于为一种新的后处理估计器获得理论公平性保证。我们首先证明反事实公平性等价于满足以潜在变量为条件的群体均等。这使我们能够通过重心分位数映射提供最优公平回归器的闭式表达式。为了处理连续潜在变量,我们提出了一种离散化的后处理方法。然后,在温和的正则性假设下,我们证明了我们的估计器具有高概率的有限样本公平性保证,不公平性衰减率为 $ ilde O(n^{-1/3})$,并建立了匹配的风险界 $ ilde O(n^{-1/3})$。我们给出了几乎公平预测的过剩风险的下界。最后,我们将结果扩展到宽松反事实公平性的设置。我们在真实世界和合成数据上验证了我们的方法。

英文摘要

We consider the problem of learning a counterfactually fair regressor. We adopt a causal uncertainty view in which counterfactual fairness is defined with resampled noise. We focus on obtaining theoretical fairness guarantees for a new post-processing estimator. We begin by showing that counterfactual fairness is equivalent to satisfying demographic parity conditional on the latent variable. This allows us to provide a closed-form expression of the optimal fair regressor via a barycentric quantile map. In order to handle continuous latent variables, we propose a discretized post-processing method. Then, under mild regularity assumptions, we prove high-probability finite-sample fairness guarantees for our estimator, providing an unfairness decay at rate $\tilde O(n^{-1/3})$, and establishing a matching risk bound of order $\tilde O(n^{-1/3})$. We provide a matching lower bound on the excess risk of almost fair predictions. Finally, we extend our results to the setting of relaxed counterfactual fairness. We validate our approach on real-world and synthetic data.

2605.28247 2026-05-28 cs.LG cs.AI

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS: 通过验证器耦合的稀疏自编码器覆盖实现可解释的RLVR数据选择

Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun

AI总结 提出IRDS方法,基于稀疏自编码器簇和验证器耦合的覆盖目标,选择模型失败但可学习的RLVR训练实例,提升数学推理准确率并降低计算成本。

详情
Comments
24 pages,3 figures,18 tables
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强LLM推理能力的关键技术,但其数据效率低下仍是一个主要瓶颈。现有方法仅部分解决此问题,各自至少缺少子集级覆盖、验证器信号使用或可解释性中的一项。为弥补这一空白,我们提出了IRDS(可解释的RLVR数据选择),该方法在稀疏自编码器(SAE)簇的基础上选择RLVR训练实例,使得选择本身在可识别的问题模式上是可审计的。为了选择模型既失败又能从中学习的实例,我们在SAE基础上引入了一个验证器耦合的覆盖目标,并通过贪心对数行列式最大化来求解。在三个指令微调模型和六个数学推理基准上的实验表明,IRDS实现了最高的整体准确率,在Qwen两个模型上超过最强基线+3.9/+4.0个百分点,在Llama-3.1-8B上超过+0.5个百分点,同时运行成本比基于轨迹的基线低一个数量级。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.

2605.28241 2026-05-28 cs.CV

PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

PointQ-Bench:诊断性和可解释的点云质量评估基准

Duanchu Wang, Cheng Li, Junjie Yang, Jing Huang, Zihang Cheng, Zhi Gao, ZhuBohong, Di Wang

AI总结 提出PointQ-Bench基准,通过异常感知、缺陷诊断、可用性分级和开放式质量报告任务,将点云质量评估从标量评分扩展到全面质量理解,并揭示当前模型在感知与诊断之间的差距。

详情
AI中文摘要

点云质量在3D采集、重建、渲染和感知中起着关键作用,然而现有的点云质量评估(PCQA)研究主要集中于标量分数预测。在实际检测场景中,质量评估通常涉及识别缺陷、表征主要问题类型、评估下游可用性以及提供基于证据的描述,而这些并未被当前基准明确评估。我们引入了PointQ-Bench,一个旨在将PCQA从标量评分扩展到全面质量理解的基准。PointQ-Bench包含3,083个点云,涵盖真实扫描、模拟失真和AI生成内容,覆盖八种主要问题类型。每个样本都标注有平均意见分数(MOS)、质量等级、问题标签、专家依据的描述以及12,332个问答对。该基准支持三个感知导向任务:异常感知、缺陷诊断和可用性分级,以及一个认知导向任务:开放式质量报告。为了评估自由形式的质量描述,我们进一步提出了SSFRQ-5D,一个通过人机一致性分析验证的五维评估协议。在14个视觉语言模型和传统PCQA基线上的大量实验揭示了一致的感知-诊断差距:虽然当前模型在粗粒度缺陷感知方面表现出新兴能力,但在基于证据的诊断和质量校准方面存在困难。强大的2D多模态大语言模型通常优于现有的3D视觉语言模型,而额外视图或点级输入的收益并不均匀,在不同任务、数据源和模型之间变化,特别是在边界模糊条件下。总体而言,PointQ-Bench为推进可靠且可解释的点云质量理解提供了一个诊断性测试平台。

英文摘要

Point cloud quality plays a critical role in 3D acquisition, reconstruction, rendering, and perception, yet existing point cloud quality assessment (PCQA) research remains largely centered on scalar score prediction. In practical inspection scenarios, quality assessment often involves identifying defects, characterizing dominant issue types, assessing downstream usability, and providing evidence-supported descriptions, which are not explicitly evaluated by current benchmarks. We introduce PointQ-Bench, a benchmark designed to extend PCQA from scalar scoring toward comprehensive quality understanding. PointQ-Bench consists of 3,083 point clouds spanning authentic scans, simulated distortions, and AI-generated content, covering eight major issue types. Each sample is annotated with mean opinion scores (MOS), quality levels, issue tags, expert-grounded descriptions, and 12,332 question-answer pairs. The benchmark supports three perception-oriented tasks: anomaly sensing, defect diagnosis, and usability grading, as well as a cognition-oriented task of open-ended quality reporting. To evaluate free-form quality descriptions, we further propose SSFRQ-5D, a five-dimensional evaluation protocol validated through human-AI agreement analysis. Extensive experiments on 14 vision-language models and traditional PCQA baselines reveal a consistent perception-diagnosis gap: while current models exhibit emerging abilities in coarse defect perception, they struggle with grounded diagnosis and quality calibration. Strong 2D MLLMs generally outperform existing 3D VLMs, and the benefit of additional views or point-level inputs is non-uniform, varying across tasks, data sources, and models, particularly under boundary-ambiguous conditions. Overall, PointQ-Bench provides a diagnostic testbed for advancing reliable and interpretable point cloud quality understanding.

2605.28239 2026-05-28 cs.CV

Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

学习标注:一种用于半监督指代表达分割的强化自进化框架

Runlong Cao, Ying Zang, Chuanwei Zhou, Tianrun Chen, Tong Zhang, Zhen Cui, Chunyan Xu

AI总结 提出L2L框架,通过强化学习将伪标签构建转化为可学习的决策过程,结合多模态大模型提取先验,实现半监督指代表达分割的联合优化。

详情
Comments
24 pages, 13 figures
AI中文摘要

半监督指代表达分割(SS-RES)旨在有限标注下实现精确的像素级语言定位,但在利用未标注图像-文本对时面临监督不足和伪标签不可靠的问题。本文提出学习标注(L2L),一种强化自进化框架,将伪标签构建视为可学习的决策过程。为建立基础理解,我们利用多模态大语言模型提取语义-空间先验,将其实例化为初始软分割提议,并与文本线索一起提升为可学习引导信号,以条件化层次分割网络。为确保稳定学习,强化伪标签选择被表述为探索性决策过程,基于多模态先验和模型预测自适应地奖励高实用性的像素级监督。这种强化自进化循环实现了分割模型和伪标签的联合优化,在稀疏监督下逐步增强标签可靠性。在RefCOCO、RefCOCO+和RefCOCOg上的大量实验表明,该方法优于现有方法,验证了其有效性和泛化能力。

英文摘要

Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.

2605.28237 2026-05-28 cs.RO cs.CV

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

POINav: 在真实世界视觉语言导航中基准测试与增强最终米级到达

Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen, Zedong Chu, Zhining Gu, Wei Guo, Xiaolong Cheng, Qiming Li, Kangning Niu, Yanqing Zhu, Xiaolong Wu, Tianlun Li, Mu Xu

AI总结 针对真实世界POI导航的“最后几米”挑战,提出首个闭环评估基准POINav-Bench,并设计脑-动作框架结合70K真实标志-入口数据对,实现高保真度导航。

详情
Comments
25 pages, 9 figures
AI中文摘要

真实世界导航本质上由兴趣点(POI)驱动,然而到达精确的POI仍然是一个关键的“最后几米”挑战。现有的POI目标导航的视觉语言导航(VLN)基准通常由于生成的场景而存在粗粒度或显著的模拟到现实差距。为弥合这一差距,我们提出了POINav-Bench,这是第一个专为真实世界POI目标导航闭环评估设计的基准。它包含使用3D高斯泼溅(3DGS)从真实世界捕获重建的11个商业区域,总面积达126,398平方米,涵盖163个不同的POI。通过可通行性感知标注和参考轨迹,POINav-Bench能够在真实、POI丰富的现实环境中对导航智能体进行高保真评估。在此基础上,我们提出了POINav脑-动作框架,其中脑模块执行基于POI的推理以指导动作模块预测用于真实世界执行的连续航点。我们进一步整理了POINav-Dataset,包含70K个真实世界标志-入口对。实验表明,我们的框架为改进真实世界POI目标导航提供了一条可行路径。

英文摘要

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

2605.28234 2026-05-28 cs.CV

Bridging the Sampling Distribution Shift in Radio Map Estimation: A Trajectory-Aware Paradigm

桥接无线电地图估计中的采样分布偏移:一种轨迹感知范式

Feng Qiu, Zheng Fang, Shuhang Zhang, Kangjun Liu, Longkun Zou, Jing Liu, Ke Chen

AI总结 针对无人机轨迹采样与随机采样分布不匹配导致的性能下降,提出基于随机触发轨迹采样的轨迹感知训练范式,有效降低估计误差。

详情
AI中文摘要

基于学习的无线电地图估计(RME)在无人机辅助无线感知中扮演关键角色,支持覆盖预测和网络优化等任务。当前大多数方法假设基于随机采样的独立同分布(i.i.d.)训练和测试设置。然而,实际无人机测量是沿着可行轨迹顺序收集的,导致高度结构化和空间相关的模式。这种不匹配引入了采样分布偏移,增加了空间场恢复的内在难度,并损害了在i.i.d.假设下训练的模型的泛化能力。为缓解这一问题,我们提出了一种基于随机触发轨迹采样(ST-TBS)的轨迹感知训练范式,该范式在保持轨迹连续性的同时引入采样变异性。此外,从统计角度来看,我们表明与随机采样相比,基于轨迹的采样降低了空间多样性并增加了信息冗余。在RadioMapSeer和SpectrumNet数据集上的大量实验表明,在基于轨迹的观测下,使用随机采样训练的模型性能显著下降,在SpectrumNet上RMSE从0.0391增加到0.2632。相反,我们提出的ST-TBS方法有效将RMSE降低到0.0571。这些结果强调了对齐训练和部署采样分布对于可靠RME的必要性。

英文摘要

Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

2605.28233 2026-05-28 stat.ML cs.CY cs.LG

Geometry of Relaxed Fair Regression: A Unified Framework for Aware and Unaware Settings

松弛公平回归的几何:统一感知与无感知设置框架

M. Generali Lince, V. Divol, R. Flamary, S. Gaucher, P. Loiseau

AI总结 本文通过最优传输理论统一了感知与无感知设置下的公平回归问题,提出了基于Wasserstein-2和全变差惩罚的算法,在松弛公平约束下实现准确预测。

详情
AI中文摘要

公平-准确权衡是部署公平感知机器学习方法的核心问题。当敏感属性在推理时不可用——即所谓的无感知设置时,在松弛公平约束下获得准确预测的原则性方法基本缺失。在这项工作中,我们通过将人口统计平价惩罚下的回归问题表述为最优传输问题来填补这一空白。我们的框架统一了感知和无感知设置,并通过最优传输映射刻画了在平方Wasserstein-2和全变差惩罚下的最优预测函数。这些结果表明,惩罚的选择反映了根本不同的公平哲学:Wasserstein惩罚诱导出平滑的、群体范围内的妥协,而全变差惩罚则对个体子集强制执行精确的平价。基于这些理论刻画,我们提出了一种易于实现、计算高效且在实际基准测试中始终匹配或超越最先进基线的算法。

英文摘要

Fairness-accuracy trade-offs are a central concern in the deployment of fairness-aware machine learning methods. When sensitive attributes are unavailable at inference time-the so called unawareness setting, principled methods for obtaining accurate predictions under relaxed fairness constraints are largely missing. In this work, we address this gap by formulating regression under a demographic parity penalty as an optimal transport problem. Our framework unifies both the \emph{aware} and \emph{unaware} settings and characterizes optimal prediction functions via optimal transport maps, under both squared Wasserstein-2 and Total Variation penalties. These results reveal that the choice of penalty reflects fundamentally different fairness philosophies: the Wasserstein penalty induces a smooth, population-wide compromise, while Total Variation enforces exact parity for a subset of individuals. Building on these theoretical characterizations, we propose an algorithm that is simple to implement, computationally efficient, and consistently matches or outperforms state-of-the-art baselines on real-world benchmarks.

2605.28232 2026-05-28 cs.AI

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

PIRS:基于物理信息奖励塑形的SAC建筑能源管理

Shadmehr Zaregarizi, Khashayar Yavari

AI总结 针对深度强化学习中奖励函数设计缺乏物理基础的问题,提出PIRS方法,将ISO 7730 PMV公式嵌入SAC的多目标奖励中,提升可解释性和性能。

详情
Comments
N pages, 4 figures, 3 tables. Accepted at the 2nd Workshop on AI-Driven Energy Efficiency in Dynamic Systems (AI-DEEDS '26), co-located with ACM e-Energy / ACM Sustainability Week, Banff, AB, Canada, June 22-25, 2026
AI中文摘要

居住者舒适度和电网感知的能效是相互竞争的目标,其联合优化关键取决于深度强化学习(DRL)控制器中奖励函数的指定方式。然而,奖励设计在很大程度上仍然是临时的:舒适度项要么是手动调整的启发式规则,要么是简单的温度偏差代理,缺乏热舒适物理的明确基础。我们提出PIRS(物理信息奖励塑形),它在用于Soft Actor-Critic(SAC)的加权多目标奖励中,用ISO 7730预测平均投票(PMV)公式替代这些临时的舒适度代理。通过将舒适度信号锚定在ISO 7730 PMV公式中,PIRS提高了奖励的可解释性,并在不改变学习流程任何其他组件的情况下,提供了一个基于标准的舒适度代理。我们在CityLearn v2.1.2(2022年挑战赛第一阶段)中评估PIRS,使用一个中央SAC智能体在五个随机种子上训练50k步,并与基于规则的控制器(RBC)、手动设计的奖励(E2)、仅能量奖励(E3)和朴素温度偏差舒适度奖励(E4)进行比较。区域级关键绩效指标(KPI)以与RBC的比率报告显示,PIRS在成本、碳和电力指标上与手动基线相当,同时显著优于非物理基础的设计——特别是在负载爬坡(1.78倍 vs. ~2.4倍RBC)和日峰值需求方面。所有DRL策略在此训练预算下仍高于RBC;我们诚实地解释这一差距,并将PIRS定位为可解释、符合标准的奖励设计基础,而非在有限计算下优于经典控制的声明。

英文摘要

Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

2605.28231 2026-05-28 cs.RO cs.LG

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

ProgVLA:进度感知的机器人操作技能学习

Seungsu Kim, Jinyoung Choi, Seungmin Baek, Jean-Michel Renders

AI总结 提出ProgVLA,一种紧凑的视觉-语言-动作模型,通过显式表示任务进度和两阶段Perceiver重采样机制,在有限计算和内存下实现长序列多模态处理,并在多任务操作基准上达到或超越大模型性能。

详情
AI中文摘要

我们提出了ProgVLA,一种紧凑的视觉-语言-动作(VLA)模型,专为在严格的计算和内存预算下进行可靠的机器人操作而设计。该模型特别关注通过维护任务进度的显式表示来高效处理长多模态序列。为此,ProgVLA集成了两个关键组件。首先,一个带有两阶段Perceiver重采样方案的多模态编码器将可变长度的视觉、语言和本体感受流压缩为一组固定的控制就绪上下文令牌,在保持跨模态基础的同时大幅减少序列长度。其次,一组辅助的进度头通过离线强化学习(RL)目标进行训练,以联合学习针对归一化剩余水平目标的批评者。这为策略提供了任务进度的内部估计,并实现了优势加权和成功加权的流匹配模仿学习。在两个成熟的多任务机器人操作基准上,一个0.1B参数的ProgVLA模型达到了与显著更大的预训练基线相当的成功率,并且在长时域和更困难的任务层级上超过了它们。消融实验表明,学习到的上下文重采样器和任务自适应视觉微调是最大的单一贡献者,而进度感知训练提供了集中在长时域和多对象任务上的一致额外增益。我们还在真实世界的玩具厨房环境中进一步验证了该方法。

英文摘要

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.

2605.28230 2026-05-28 cs.CV

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

Proprio: 用于物理合理视频生成的潜在自评分与推理时精炼

Mariam Hassan, Kaouther Messaoud, Wuyang Li, Alexandre Alahi

AI总结 提出Proprio,一种无需训练框架,通过分析模型在潜在扰动下的流残差作为自评分信号,结合最佳N搜索和梯度自精炼,提升冻结视频生成器输出的物理合理性。

详情
AI中文摘要

现代视频生成模型在视觉上效果显著,但经常违反基本物理原理。我们提出Proprio,一种无需训练的框架,使冻结的视频生成器能够评估和改进自身输出的物理合理性。受本体感觉(生物对自身运动的感知)启发,Proprio将模型在受控潜在扰动下的流残差视为自评分信号。能被生成器学习到的动力学更好解释的样本会产生更小且更稳定的残差。我们跨时间步和扰动聚合该信号,通过动态时空掩码聚焦于运动相关区域,并将其用于最佳N搜索、基于梯度的自精炼或两者结合。在文本到视频和图像到视频基准测试中,Proprio持续提升物理合理性,在多种设置下优于基于VLM的评分和外部世界模型基线。使用TurboWan2.2,Proprio将Physics-IQ从32.2提升至37.5(+16.5%),VideoPhy2-hard物理常识从45.6提升至55.0(+20.6%)。人类评估进一步显示,在大约三分之二的比较中,评估者更偏好Proprio选择或精炼的视频的物理合理性。这些结果表明,冻结的视频生成器包含可操作的内部信号,用于评估和改进自身输出的物理合理性。

英文摘要

Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

2605.28229 2026-05-28 cs.CV cs.AI

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism: 用于图像到视频迁移的异构混合专家模型

Rui Lin, Chuanming Wang, Huadong Ma

AI总结 提出VidPrism,一种异构时间混合专家框架,通过功能专业化专家、内容感知多速率采样和动态双向融合机制,解决传统MoE中专家同质化问题,在视频识别基准上达到最先进性能。

详情
Comments
CVPR2026 camera ready
AI中文摘要

随着预训练技术的快速发展,适应大规模视觉-语言模型(VLM)进行视频理解(即图像到视频迁移学习)已成为主导范式。为了获得卓越性能,近期进展中采用混合专家(MoE)来增强VLM的时间建模能力是一种有效策略。然而,传统的MoE设计存在专家同质化问题,即所有专家充当相同的通才,从无差异的视频流中低效地学习时空特征。为解决此问题,我们提出VidPrism,一种新颖的异构时间混合专家框架。VidPrism通过部署功能专业化的专家开创了分工机制,每个专家承担从空间理解到时间建模的不同角色。为了适当地为这些专家提供输入,我们引入了一个内容感知的多速率采样模块,动态生成从语义丰富到运动聚焦的表示流,为专家提供专业化输入。此外,一种动态双向融合机制实现了这些路径之间的协同信息交换,从而产生全面的视频表示。在各种视频识别基准上的大量实验表明,VidPrism达到了最先进的性能,并有效促进了专家专业化。我们的源代码可在https://github.com/Lrrrr549/VidPrism.git获取。

英文摘要

With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.

2605.28228 2026-05-28 cs.CL

When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions

当求助者难以帮助:评估情感支持对话系统在最坏情况交互中的表现

Jiajie Yang, Yangchun Li, Guanyi Chen, Rui Fan, Xin Bai, Tingting He

AI总结 本研究通过专家模拟和提出最坏情况评估框架,发现现有情感支持对话系统在面对低参与度、抗拒等困难求助者时性能显著下降,并验证了最坏情况模拟数据可提升模型鲁棒性。

详情
AI中文摘要

情感支持对话系统(ESDS)越来越多地使用大语言模型模拟的求助者进行评估和训练。然而,这类模拟求助者通常表现为合作、平均水平的用户,他们清晰披露、建设性回应并在几轮内接受支持。这可能导致过于乐观的评估,并掩盖ESDS是否能够处理困难的求助互动。在这项工作中,我们研究了在最坏情况交互下的ESDS评估,其中求助者由于低参与度、抗拒、有限的自我披露、情绪波动或僵化的负面解释而难以帮助。我们首先进行了一项专家模拟研究,邀请八位经验丰富的咨询专业人员模拟困难求助者,与现有的中文ESDS互动,提供量表评分,并参与半结构化访谈。基于这项研究,我们推导出最坏情况下的求助者行为,并识别出当前系统的关键局限性。然后,我们提出了一个最坏情况评估框架,包括一个基于LLM的最坏情况求助者模拟器和四个面向最坏情况的指标:深度情感理解、引导性探索、平衡的情感支持以及真实和接地气的支持。评估17个系统后,我们发现几乎所有模型在最坏情况交互下性能都大幅下降。大型通用LLM通常比专门的ESDS更稳健,但即使是最强的模型也难以维持参与度并改善求助者的情绪状态。最后,我们表明最坏情况模拟也可以生成有用的训练数据,提高较小模型的鲁棒性。

英文摘要

Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers' emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.

2605.28227 2026-05-28 cs.CL

Why We Need Speech to Evaluate Speech Translation

为什么我们需要语音来评估语音翻译

Maike Züfle, Danni Liu, Vilém Zouhar, Jan Niehues

AI总结 本文通过元评估发现现有文本和语音质量估计指标在评估语音翻译中的语音特有信息(如性别一致性和韵律)时均存在不足,并提出SpeechCOMET模型,分析其失败原因,强调需要专用训练数据和真正基于语音的模型。

详情
AI中文摘要

语音翻译模型越来越能够保留语音特定信息(例如,说话者性别、韵律和强调),但评估指标仍然对这些现象视而不见。我们在两个针对性别一致性和韵律的对比数据集上对基于文本和基于语音的质量估计指标进行了元评估,发现两者均存在不足,即使直接访问语音信号也是如此。然后,我们训练了SpeechCOMET,一个带有语音编码器的质量估计模型家族,并评估了一个最先进的SpeechLLM作为评判者。两者在标准质量估计上匹配或超过基于文本的COMET,但都没有一致地评估语音特定现象。我们确定了三个原因:(1)当前编码器未能可靠地保留语音特定特征,(2)模型倾向于忽略语音源信号,以及(3)质量估计训练数据包含的相关示例太少。我们发布了所有模型和代码,并认为进展需要专用的语音特定训练数据和真正基于语音的模型。

英文摘要

Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.

2605.28226 2026-05-28 cs.LG

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

PhAME: 基于表型感知的潜在扩散分子编辑

Łukasz Janisiów, Sebastian Musiał, Bartosz Zieliński, Dawid Rymarczyk, Tomasz Danel

AI总结 提出PhAME框架,利用潜在扩散模型在预训练图VAE的潜在空间中进行分子编辑,通过组合无分类器引导机制同时优化表型条件和结构相似性,实现高化学有效性和新颖性的多目标分子优化。

详情
AI中文摘要

小分子药物发现需要同时优化候选分子的众多属性。这些属性可以通过分析高维生物特征(如细胞形态和转录组扰动)来研究,这些特征提供了对潜在生物机制的丰富视角。然而,现有的使用这些特征进行优化的生成方法未能满足两个关键要求:在保持与已知先导物结构接近的同时,提供朝向期望表型特征的精确引导。我们引入了PhAME(表型感知分子编辑),这是一种潜在扩散框架,通过将分子优化重新定义为预训练图基VAE潜在空间中的编辑来克服这一挑战。我们的核心贡献是一种具有两个独立尺度的组合无分类器引导方案,一个用于表型条件,另一个用于与种子结构的相似性,允许从业者控制这两个目标之间的权衡。在包括对接分数优化和多模态表型生成在内的多个基准测试中的实证评估表明,PhAME在保持高化学有效性和新颖性的同时实现了最先进的结果。

英文摘要

Small-molecule drug discovery requires simultaneous optimization of numerous properties of candidate molecules. These properties can be investigated through the analysis of high-dimensional biological signatures, such as cell morphology and transcriptomic perturbations, which provide a rich perspective on the underlying biological mechanisms. However, existing generative methods, which use those signatures for optimization, fail to meet two key requirements: providing precise guidance toward desired phenotypic signatures while maintaining structural proximity to a known hit. We introduce PhAME (Phenotype-Aware Molecular Editing), a latent diffusion framework that overcomes this challenge by recasting molecular optimization as editing in the latent space of a pretrained graph-based VAE. Our central contribution is a compositional classifier-free guidance scheme with two independent scales, one for the phenotype-conditioning and one for similarity to the seed structure, allowing practitioners to control the tradeoff between these two objectives. Empirical evaluations across diverse benchmarks, including docking score optimization and multimodal phenotypic generation, demonstrate that PhAME achieves state-of-the-art results while maintaining high chemical validity and novelty.

2605.28225 2026-05-28 cs.CL

Supervised Semantic Differential for Cross-Cultural Concept Analysis: A Case Study of Human Affect

监督语义差异法用于跨文化概念分析:以人类情感为例

Jan Sikora, Paweł Lenartowicz, Hubert Plisiecki

AI总结 本文提出跨语言监督语义差异法(SSD),通过对齐的多语言词嵌入比较语义维度,并以波兰语、英语和法语情感规范词汇为例,验证了情感维度的跨语言可恢复性及文化差异。

详情
Comments
9 pages, 2 figures, excluding the appendices. Code to reproduce our results is available at https://github.com/przebor/Cross-Cultural-SSD
AI中文摘要

跨文化比较心理意义需要超越词汇层面的翻译,并考察语义维度在不同语言中的组织方式。我们提出了监督语义差异法(SSD)的跨语言扩展,该方法在嵌入空间中估计监督语义梯度,并在对齐的多语言词嵌入之间进行比较。该方法通过置换检验和自助法区间检验梯度对齐性和差异,并通过围绕差异梯度的聚类解释残差差异。我们在波兰语、英语和法语情感规范词汇上展示了该方法,对效价、唤醒度和优势度(如可用)进行建模。情感维度在语言和模型设置中显著可恢复。跨语言比较显示出广泛的对齐性以及结构化的残差差异:效价似乎是共享的,而唤醒度和优势度产生了更多可解释的对比,涉及身体威胁、审美刺激、内部情感性、宏观权威和日常控制。几个聚类也反映了语料库特定的伪影,强调了谨慎解释的必要性。跨语言SSD提供了一个可解释的框架,用于测试语义对齐性、识别差异,并生成关于心理意义跨文化差异的假设。

英文摘要

Cross-cultural comparison of psychological meaning requires methods that go beyond word-level translation and examine how semantic dimensions are organized across languages. We introduce a cross-lingual extension of the Supervised Semantic Differential (SSD), which estimates supervised semantic gradients in embedding space and compares them across aligned multilingual word embeddings. The method tests gradient alignment and difference using permutation procedures and bootstrap intervals, and interprets residual differences through clustering around the difference gradient. We demonstrate the approach on Polish, English, and French affective norm lexicons, modeling Valence, Arousal, and Dominance where available. Affective dimensions were significantly recoverable across languages and model settings. Cross-lingual comparisons showed broad alignment together with structured residual differences: Valence appeared mostly shared, whereas Arousal and Dominance produced more interpretable contrasts involving bodily threat, aesthetic stimulation, internal emotionality, macro-level authority, and everyday control. Several clusters also reflected corpus-specific artifacts, underscoring the need for cautious interpretation. Cross-lingual SSD offers an explainable framework for testing semantic alignment, identifying divergence, and generating hypotheses about cross-cultural differences in psychological meaning.

2605.28224 2026-05-28 cs.AI

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

何时记忆有助于工具使用LLM代理的多轨迹推理?

Xinzhe Li, Yaguang Tao

AI总结 本文提出一个统一框架,将记忆沿传输范围和内容抽象两个维度分解,在无验证器设置下评估四种记忆方法与三种推理策略在四个工具使用基准上的表现,发现推理策略是混淆变量,不同策略下相同记忆方法产生显著不同结果。

详情
Comments
More evaluation and analysis are on the way
AI中文摘要

工具使用LLM代理的多轨迹推理——生成多个推理尝试并从中选择——受益于跨尝试的知识转移,以便后续尝试避免早期尝试的陷阱。现有的跨轨迹记忆方法(轨迹级反思、原子事实提取、原始观察注入)均在单个任务的单一推理策略下进行评估,使得报告的性能提升是否反映记忆抽象或推理方法的属性变得不明确。我们提出一个统一框架,将记忆沿两个维度分解——传输范围(扩展内 vs. 跨轨迹)和传输内容的抽象程度——并在四种工具使用基准(涵盖SQL、知识图谱和CLI环境)上,在匹配实际代理部署设置的无验证器设置下,评估四种方法在三种推理策略(best-of-N、束搜索、MCTS)下的表现。实验矩阵将推理方法识别为混淆变量:相同的记忆方法在相同示例的不同推理策略下产生统计上不同的结果。反思仅在MCTS下达到显著性(不在best-of-N下);扩展内注入(使每个候选条件依赖于先前兄弟候选的结果)仅有助于缺乏多样性的束搜索;而原子事实提取对准确性无影响,但在具有可重用环境结构的任务上使轨迹缩短19-26%。

英文摘要

Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes -- the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content -- and evaluate four methods under three inference strategies (best-of-N, beam search, MCTS) on four tool-use benchmarks spanning SQL, knowledge-graph, and CLI environments, in a verifier-free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best-of-N); within-expansion injection (conditioning each candidate on prior siblings' outcomes) helps only diversity-starved beam search; and atomic fact extraction is accuracy-neutral but shortens trajectories by 19-26% on tasks with reusable environmental structure.

2605.28222 2026-05-28 cs.CL cs.IR cs.LG

Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

使用LoRA适配分析技术文档RAG助手中的质量-延迟-资源权衡

Evgenii Palnikov, Elizaveta Gavrilova

AI总结 本研究通过LoRA适配器在RAG系统中分析质量、延迟和资源之间的权衡,发现仅对q和v注意力投影进行适配的配置在帕累托前沿占优。

详情
Comments
13-page main body plus extended appendix; 6 figures; benchmark, LoRA adapters, and code at https://github.com/EugPal/rag-lora-tradeoffs
AI中文摘要

我们研究了在基于文档的检索增强生成(RAG)系统中使用生成器的低秩适配(LoRA)时的质量-延迟-资源权衡。我们构建了一个包含5,144个问答对的手动验证基准测试,这些问答对基于官方Kubernetes文档,并将其与固定的混合检索流水线(BGE-M3密集、BGE-M3原生稀疏、互惠排名融合、交叉编码器重排序)结合。在此基准测试上,我们在Llama-3.2-3B-Instruct和Llama-3.1-8B-Instruct上对20种LoRA配置进行了消融实验,涉及秩和目标模块的选择,并评估了每个配置的token级F1、LLM判断的接地性和正确性(pass@4)、推理延迟、推理内存和训练成本,所有结果均附有bootstrap 95%置信区间。帕累托分析表明,仅作用于q和v注意力投影的LoRA适配器始终主导前沿,而3B/8B的选择主要定义了操作区间。参数匹配的控制比较进一步表明,q/v优势是结构性的而非纯粹参数性的。基准测试、选定的适配器和代码可在https://github.com/EugPal/rag-lora-tradeoffs获取。

英文摘要

We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at https://github.com/EugPal/rag-lora-tradeoffs.

2605.28219 2026-05-28 cs.HC cs.AI cs.LG

SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data Grouping

SmartIterator: 监督无监督数据分组的可视化分析工作流

Gennady Andrienko, Natalia Andrienko

AI总结 提出SmartIterator可视化分析方法,通过六阶段工作流和IteraScope协调视图,系统探索参数扫描下的分组结果,支持用户理解数据结构和做出知情决策。

详情
AI中文摘要

无监督学习方法——主题建模、基于划分和基于密度的聚类——在没有人类指导的情况下产生数据分组,但选择和评估这些分组本身不应是无监督的。我们提出了\emph{SmartIterator}(SI),一种可视化分析方法,将参数扫描中分组结果的完整序列视为一等分析对象。对于每个方法族,SI提供了一个结构化的六阶段工作流,引导分析师系统地探索分组结果——从质量指标概览,经过过渡稳定性评估、成员置信度评估、内容和上下文检查、循环原型验证,到知情决策——在此过程中逐步建立对数据结构的累积理解。这些工作流通过\emph{IteraScope}(IS)实现,这是一个协调的可视化显示,结合了质量指标图表与语义颜色编码、带有桑基式过渡流和成员置信度小提琴图的一维组嵌入、带有HDBSCAN检测的循环原型的二维组嵌入(突出显示捕获所有持久模式的迭代),以及用于上下文解释的特定领域链接视图。我们在以下三个场景中演示了这些工作流:(1)来自VAST Challenge 2011的模拟社交媒体消息(基于密度的聚类,根据真实情况进行验证),(2)约1500个NUTS-3区域的欧盟人口统计数据(基于划分的聚类),以及(3)30年的IEEE VIS论文(NMF主题建模)。这些工作流构成了主要贡献:它们提供了可操作的、针对特定方法的指导,用于导航参数空间、研究数据结构如何随配置变化,以及将分析理解扎根于领域背景——从而产生关于数据的知识,这是任何单个“最佳”结果都无法提供的。

英文摘要

Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guidance, yet choosing and evaluating those groupings should not itself be unsupervised. We present \emph{SmartIterator}~(SI), a visual analytics approach that treats the full sequence of grouping results across a parameter sweep as a first-class analytical object. For each method family, SI provides a structured six-phase workflow that guides the analyst through systematic exploration of grouping results -- from quality-metric overview through transition-stability assessment, membership-confidence evaluation, content and context inspection, and recurrent-archetype verification to an informed decision -- building cumulative understanding of data structure along the way. The workflows are operationalized through \emph{IteraScope}~(IS), a coordinated visual display combining quality-metric charts with semantic color encoding, a 1D group embedding with Sankey-style transition flows and violin plots of membership confidence, a 2D group embedding with HDBSCAN-detected recurrent archetypes that highlights iterations capturing all persistent patterns, and domain-specific linked views for contextualized interpretation. We demonstrate the three workflows on: (1)~simulated social-media messages from the VAST Challenge 2011 (density-based clustering, validated against ground truth), (2)~EU population statistics across ${\sim}1\,500$ NUTS-3 regions (partition-based clustering), and (3)~30 years of IEEE VIS papers (NMF topic modeling). The workflows constitute the main contribution: they provide actionable, method-specific guidance for navigating parameter spaces, studying how data structure evolves across configurations, and grounding analytical understanding in domain context -- yielding knowledge about the data that no single ``best'' result can provide.

2605.28218 2026-05-28 cs.CL

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

IFMTBench:多语言翻译指令遵循的综合基准

Mingrui Sun, Mao Zheng, Zheng Li, Mingyang Song

AI总结 提出IFMTBench基准,涵盖7种语言、4506个单约束和2838个多约束项,通过确定性检查器和基于LLM的评分器评估翻译指令遵循能力,揭示指令遵循随模型规模增长快于翻译质量,且术语表和结构化格式约束难度最高。

详情
Comments
11 pages, 6 figures, conference
AI中文摘要

现代翻译工作流程要求的不只是语义等价。用户通常要求模型保留JSON或HTML模式、遵循精心策划的术语表、利用提供的上下文消除歧义,并匹配指定的语域,往往同时满足多个条件。传统的BLEU和xCOMET等指标能捕捉语义保真度,但对约束遵循的指示甚少,而一般的指令遵循基准则忽略了翻译的跨语言性质。我们引入了\bench,一个涵盖七种语言的多语言翻译指令遵循基准,包含4506个单约束和2838个多约束项,跨越六个约束维度和五种组合模式,指令以所有七种语言发出。约束分为由确定性检查器验证的门控子集和由基于评分规则的LLM法官评分的连续子集,通过乘法规则组合以抵抗奖励黑客攻击。评估15个模型揭示了先前协议遗漏的系统性差距:指令遵循随模型规模增长比翻译质量更显著,术语表和结构化格式约束主导了难度梯度,而一般指令遵循排名与翻译行为的相关性很弱。我们的基准可在https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench获取。

英文摘要

Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.

2605.28217 2026-05-28 cs.CV

A Patient-Specific Pulmonary Arterial Tree Digital Twin to Extract Pulmonary Embolism Biomarkers

患者特异性肺动脉树数字孪生以提取肺栓塞生物标志物

Morgane des Ligneris, Nathan Painchaud, Allan Serva, Laurent Bertoletti, Pierre Croisille, Carole Frindel, Odyssée Merveille

AI总结 提出一种自动化流程,通过构建肺动脉树的有向图表示并提取基于图像的生物标志物(包括局部动脉特征和全局严重程度评分),生成患者数字孪生,用于肺栓塞的风险评估。

详情
Comments
11 pages + 2 pages of supplementary materials. Submitted to special issue of JBHI
AI中文摘要

肺栓塞是由血凝块阻塞肺动脉引起的,是急性心血管综合征的主要原因之一。在临床实践中,通过计算机断层扫描肺血管造影诊断后的治疗决策依赖于风险分层,该分层将30天死亡风险分为三类。这种分层取决于右心室与左心室直径比以及两种心脏酶的血液水平。然而,在急诊情况下,血液生物标志物并不总是可用,而手动计算既定的严重程度评分(如Qanadli和Mastora评分)耗时且很少在临床常规中进行。本研究引入了一种自动化流程,该流程对肺动脉树的有向图表示进行建模,标记其层次结构并表征肺栓塞。该流程推导出基于图像的生物标志物,包括局部动脉级特征(形态学信息、层次位置、血凝块体积和由此产生的阻塞)以及全局患者级生物标志物,如自动计算的严重程度评分(Qanadli和Mastora评分)以及按肺叶和层次划分的总栓塞体积分布。利用人工智能生成的动脉、栓塞、肺和肺叶的二元掩码,它创建了动脉结构的患者数字孪生。通过与现有流程、解剖学期望和手动严重程度评分计算的比较验证,证明了该流程能够自动生成解剖学上准确的数字孪生和具有高度一致性的严重程度评分。这支持了这些基于图像的生物标志物自动提供关于血栓负荷和空间血凝块分布的快速、精确信息的潜力。

英文摘要

Pulmonary embolism, the obstruction of a pulmonary artery by a blood clot, is one of the leading causes of acute cardiovascular syndrome. In clinical practice, therapeutic decisions after diagnosis via computed tomography pulmonary angiography rely on risk stratification, which categorizes 30-day mortality risk into three categories. This stratification depends on the right-to-left ventricular diameter ratio and blood levels of two cardiac enzymes. However, blood biomarkers are not always available in emergency settings, and manual calculation of established severity scores - such as Qanadli and Mastora - is time-consuming and rarely performed in clinical routine practice. This study introduces an automated pipeline that models a directed graph representation of the pulmonary arterial tree, labeling its hierarchical structure and characterizing pulmonary embolism. The pipeline derives image-based biomarkers, including local artery-level features (morphological information, hierarchical position, clot volume, and resulting obstruction) and global patient-level biomarkers such as automatically calculated severity scores (Qanadli and Mastora) and the total embolic volume distribution by lobes and hierarchical levels. Using artificial-intelligence-generated binary masks of arteries, emboli, lungs, and lobes, it creates a patient digital twin of the arterial structure. Validation of the pipeline through comparison to an existing pipeline, anatomical expectations, and manual severity score calculations demonstrates the pipeline's ability to automatically generate anatomically accurate digital twins and severity scores with strong agreement. This supports the potential of these image-derived biomarkers to automatically provide rapid, precise information on thrombotic burden and spatial clot distribution.

2605.28215 2026-05-28 cs.AI cs.CL cs.LG cs.LO cs.MA

Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

解释比单独预测更难:评估基于概念的MLLM解释作为ICL视觉分类器

Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

AI总结 本文通过五种形式化程度递增的条件,系统评估多模态大语言模型在少样本上下文学习中的基于概念的可解释性,发现解释比预测更难,且强制生成形式化解释会降低预测准确性。

详情
Comments
Accepted to the CompLearn Workshop at ICML 2026
AI中文摘要

上下文学习(ICL)使多模态大语言模型(MLLM)能够从少量标记示例中对图像进行分类。然而,这些模型如何使用提供的上下文仍然不透明。虽然思维链提示被广泛使用,但最近的研究认为它可能不反映真实的内部计算。在本文中,我们通过五种形式化程度递增的条件(从基线分类到描述逻辑(DL)公理生成)系统评估了冻结MLLM在少样本ICL下的基于概念的可解释性。通过独立的LLM-as-a-judge流水线评估四个最先进的MLLM,我们证明解释确实比单独预测更难。令人惊讶的是,强制模型生成形式化结构的基于概念的解释会单调降低预测准确性(从93.8%降至90.1%),这与显式推理普遍有助于性能的假设相矛盾。然而,当模型成功表达类别判别性视觉特征时,解释质量与正确预测强相关。我们的发现表明,虽然MLLM在视觉分类方面表现出色,但它们缺乏形式化、机器可验证的可解释性所需的特定指令微调。

英文摘要

In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.