arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.01822 2026-06-05 cs.CV

Hierarchically Decoupled Mixture-of-Experts for Robust Traffic Sign Recognition in Complex Driving Scenarios

用于复杂驾驶场景中鲁棒交通标志识别的分层解耦混合专家模型

Mingxiao Wang, Xiaozhen Qu, Bolin Gao, Tong Wang, Lei He

发表机构 * School of Automotive and Traffic Engineering, Liaoning University of Technology(辽宁科技学院汽车与交通工程学院) State Key Laboratory of Intelligent Green Vehicles and Mobility, School of Vehicle and Mobility, Tsinghua University(智能绿色车辆与移动State Key Laboratory,清华大学车辆与移动学院)

AI总结 提出分层解耦异构混合专家框架CBDES MoE TSR,通过图像级动态路由机制选择最优专家模型,在复合交通标志数据集上mAP50-95达76.8%,比基线提升2.3%且计算开销降低39.4%。

详情
Comments
9 figures, 3 tables
AI中文摘要

交通标志检测是自动驾驶和智能交通系统中环境感知的基本组成部分。然而,现有大多数检测器依赖具有全局共享参数的静态推理,限制了其适应多样化和非结构化交通场景的能力。因此,单个静态模型通常难以同时处理清晰的近距样本和诸如远距离小目标或恶劣天气环境等挑战性条件。为解决这一局限,我们提出了CBDES MoE TSR,一种用于交通标志识别的分层解耦异构混合专家(MoE)框架。该框架通过引入异构YOLO专家池和轻量级门控网络,摆脱了传统的全局共享参数范式,实现了图像级动态路由机制。基于输入图像的语义特征,门控模块从专家池中选择性激活最合适的专家模型,实现从固定参数拟合到按需动态表示的转变。这种设计增强了特定场景下的特征提取能力,同时保持了可控的推理开销。实验结果表明,所提方法在复合交通标志数据集上实现了检测精度与效率的显著平衡。具体而言,我们的方法达到了76.8%的mAP50-95,相比基线方法(74.5%)提升了2.3%,同时计算开销降低了约39.4%。这些结果有力地验证了所提方法的有效性。

英文摘要

Traffic sign detection is a fundamental component of environmental perception in autonomous driving and intelligent transportation systems. However, most existing detectors rely on static inference with globally shared parameters, limiting their ability to adapt to diverse and unstructured traffic scenarios. As a result, a single static model often struggles to simultaneously handle both clear near-range samples and challenging conditions such as distant small targets or adverse weather environments. To address this limitation, we propose CBDES MoE TSR, a hierarchically decoupled heterogeneous mixture-of-experts(MoE) framework for traffic sign recognition. The proposed framework departs from the conventional globally shared parameter paradigm by introducing a heterogeneous You Only Look Once (YOLO) expert pool together with a lightweight gating network, enabling an image-level dynamic routing mechanism. Based on the semantic characteristics of the input image, the gating module selectively activates the most suitable expert model from the expert pool, enabling a shift from fixed parameter fitting to on-demand dynamic representation. This design enhances feature extraction capability for specific scenarios while maintaining controlled inference overhead. Experimental results demonstrate that the proposed method achieves a remarkable balance between detection accuracy and efficiency on the composite traffic sign dataset. Specifically, our method attains an mAP50-95 of 76.8%, yielding a 2.3% improvement over the baseline method (74.5%) while simultaneously reducing computational overhead by approximately 39.4%. These findings robustly validate the effectiveness of the proposed approach.

2606.01113 2026-06-05 cs.CV

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

R^3: 基于推理引导的召回与重排序的组合视频检索

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出R^3零样本组合视频检索流程,通过生成推理轨迹增强查询表示,并融合重排序验证候选视频,有效解决源视频与编辑指令组合检索的挑战。

详情
AI中文摘要

CoVR-R挑战评估组合视频检索,系统需根据参考视频和文本编辑指令从大型图库中检索目标视频。该设置不是标准的视频-文本检索问题:查询由源视频中的视觉证据和编辑隐含的变换共同定义。强嵌入模型可提供可扩展的候选召回,但可能无法充分表达目标侧后果,如状态变化、动作替换、对象保留或时间一致性。成对多模态重排序器可直接验证此类细节,但全面重排序整个图库在计算上不可行。我们提出R^3,一个基于推理引导的召回与重排序的零样本组合视频检索流程。核心思想是将源-编辑查询转化为推理基础的检索程序,而非将编辑文本视为短标题。首先,模型生成推理轨迹,描述应用编辑后预期的目标视频。然后,将轨迹与源视频一起编码为推理增强查询,并通过一致性门控残差规则与基础组合查询的检索分数融合。最后,重排序器通过直接源-候选比较验证召回候选。实验证明了我们方法在应对该挑战中的有效性。代码可在https://github.com/Lee-zixu/R-3获取。

英文摘要

The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.

2606.00644 2026-06-05 cs.AI

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

ForeSci: 评估LLM智能体在前瞻性AI研究判断中的能力

Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Duke Kunshan University(杜克昆山大学)

AI总结 提出ForeSci基准,通过时间控制的500个任务评估LLM智能体基于历史证据做出前瞻性研究判断的能力,发现证据与决策脱节问题。

详情
AI中文摘要

AI研究通常需要在未来证据出现之前做出决策:攻击哪个瓶颈、追求哪个方向、项目应如何定位。我们引入了ForeSci,一个时间控制的基准,用于评估LLM智能体是否能够从历史证据中做出此类前瞻性研究判断。ForeSci包含500个任务,涵盖四个快速发展的AI领域和四个决策家族。每个任务配有一个截止对齐的离线知识库;截止日期后的论文在生成过程中被隐藏,仅用于验证。为避免随机未来事件预测,任务源自截止前的分类分支和证据信号,并选择早于任务截止日期的答案生成骨干。我们评估了原生LLM、混合RAG以及四种骨干上的三种研究智能体适配。结果表明,显式证据组织提高了可追溯性和事实支持,但收益强烈依赖于决策家族。诊断揭示了一个反复出现的证据-决策脱节:智能体可能引用相关证据,但预测错误的研究对象。ForeSci将前瞻性AI研究判断转化为一个受控基准,用于评估作为决策系统的研究智能体。

英文摘要

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

2606.00522 2026-06-05 cs.CV

A Trajectory-Driven Spatio-Temporal Refinement Solution for CVPR 2026 8th UG2+ Challenge Track 3: DOST

CVPR 2026 第八届 UG2+ 挑战赛赛道三:湍流中动态目标分割的有效解决方案

Hongzhen Li, Miao Yu, Leilei Cao, Youwei Pan, Yingfang Zhu, Fengjie Zhu

发表机构 * TEX AI, Transsion Holdings(TEX AI,Transsion控股)

AI总结 基于 SegAnyMo 框架,通过数据域自适应和时空后处理模块,提升严重大气畸变下的动态目标分割性能,在挑战赛中获第二名。

详情
AI中文摘要

在这项工作中,我们提出了针对第八届 UG2+ 挑战赛(CVPR 2026)赛道三:湍流中动态目标分割(DOST)的解决方案。我们的方法建立在强大的基线框架 Segment Any Motion (SegAnyMo) 之上,该框架提供了强大的掩码生成和运动跟踪能力。为了进一步提升在严重大气畸变下的分割性能,我们提出了两个关键改进。首先,我们采用以数据为中心的域自适应策略。通过从 DAVIS 数据集和 DOST 数据集的子集中选取序列,并结合模拟大气波动退化,显著扩展了训练数据,增强了模型对复杂几何畸变的鲁棒性。其次,我们引入了时空后处理模块。该细化步骤有效去除了持续存在的边界连接假前景和短时碎片噪声,同时严格保留了真实小目标并保持帧间的原始个体标签。通过上述组合策略,我们的方法在挑战赛中获得了第二名。

英文摘要

In this work, we present our solution for the 8th UG2+ Challenge (CVPR 2026) Track 3: Dynamic Object Segmentation in Turbulence (DOST). Our method is built upon the strong baseline framework Segment Any Motion (SegAnyMo), which provides powerful mask generation and motion tracking capabilities. To further boost the segmentation performance under severe atmospheric distortions, we propose two key improvements. First, we employ a data-centric domain adaptation strategy. We significantly expand our training data by incorporating selected sequences from the DAVIS dataset alongside a subset of the DOST dataset, and apply simulated atmospheric fluctuation degradations to enhance the model's robustness against complex geometric distortions. Second, we introduce a spatio-temporal post-processing module. This refinement step effectively removes persistent boundary-connected false foregrounds and short-lived fragmented noise, while strictly preserving genuine small targets and maintaining original individual labels across frames. With these combined strategies, our proposed method ranks the 2st place in the challenge.

2605.31278 2026-06-05 cs.AI cs.LG stat.ME

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

工业化预测驱动推断:用于可靠生成式AI与智能体系统评估的GLIDE库

Grégoire Martinon, Ibrahim Merad, Mohammed Raki

发表机构 * University of California, Berkeley(加州大学伯克利分校) Google Research(谷歌研究院)

AI总结 提出GLIDE开源库,统一多种预测驱动推断方法,提供无偏估计与有效置信区间,显著降低人工标注成本。

详情
Comments
8 pages, Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems, Seoul, South Korea, 2026
AI中文摘要

智能体系统的可靠评估需要具有有效不确定性的无偏估计,但标准实践在昂贵的人工标注和有偏的LLM-as-judge代理之间权衡。预测驱动推断(PPI)将两者结合为具有有效置信区间的去偏估计,然而其各种方法仍分散在不同论文的部分实现中。我们介绍GLIDE,一个开源Python库,它在专用于均值估计的scipy风格API下统一了最先进的PPI估计器(PPI++、分层PPI、先预测后去偏及其分层变体、主动统计推断)和采样器(均匀、分层、主动、成本最优)。GLIDE附带一个可复现的蒙特卡洛验证套件、一个基于经验的决策树用于方法选择,以及一个智能体评估案例研究,显示在同等精度下显著节省标注成本。GLIDE包可通过此URL获取:https://github.com/EmertonData/glide

英文摘要

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

2605.30819 2026-06-05 cs.CV cs.GR

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Function2Scene: 基于功能规范的3D室内场景布局

Ruiqi Wang, Qimin Chen, Daniel Ritchie, Angel X. Chang, Manolis Savva, Kai Wang, Hao Zhang

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Brown University(布朗大学)

AI总结 提出Function2Scene框架,通过解析自然语言设计简报中的用户角色和活动,从17个功能约束准则生成布局,并利用LLM和VLM的迭代检查-修复循环优化,在30个专业案例中94.3%的成对比较优于基线方法。

详情
Comments
project page: https://function2scene.github.io/
AI中文摘要

大多数文本驱动的3D室内场景合成方法从以物体为中心的提示生成房间,询问应放置什么家具而不是如何使用空间。然而,在实际室内设计中,布局的好坏取决于其对居住者的支持程度,例如他们的活动和身体需求。我们引入了Function2Scene,一个从功能规范(即描述谁将使用房间以及他们需要在那里做什么的自然语言设计简报)生成3D室内布局的框架。给定这样的规范,我们的系统解析居住者角色和活动,从涵盖空间、人体工程学、活动和环境考虑的17个标准分类中导出一组定制的功能设计约束,并使用这些约束来指导布局生成。Function2Scene不依赖LLM直接生成最终场景,而是通过工具增强的检查-修复循环进行迭代评估和细化,结合几何测量、基于LLM的上下文推理和基于VLM的视觉评估。在30个专业编写的室内设计案例上的实验表明,Function2Scene生成的布局比最近的基于LLM的场景合成基线更好地满足功能需求,我们的结果在94.3%的成对比较中被偏好。我们的工作将文本驱动的室内场景合成从放置合理的物体重新定义为设计支持人类使用的空间。

英文摘要

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

2605.30747 2026-06-05 cs.AI

Generating Graph-Like Logical Rules for Knowledge Graph Reasoning via Diffusion Models

通过扩散模型生成图状规则用于知识图谱推理

Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu

发表机构 * Laboratory for Big Data and Decision(大数据与决策实验室) National University of Defense Technology(国防科技大学) National Key Laboratory of Information Systems Engineering(信息系统工程国家重点实验室) Microsoft Corporation(微软公司) College of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出GRiD框架,利用扩散模型将图状规则发现转化为以目标关系为条件的离散生成过程,结合监督预训练和强化学习优化,实现知识图谱补全中图状规则的高效挖掘。

详情
Comments
accepted by KDD 26
AI中文摘要

逻辑规则构成知识图谱推理的基石,因其可解释性和建模关系模式的能力而受到重视。然而,现有规则挖掘方法主要关注简单的链状规则,因此忽略了图状结构中编码的更丰富的关系信息,例如循环和分支。这一局限性因搜索空间组合爆炸导致的计算瓶颈而进一步加剧,这对图状规则尤其具有挑战性。同时,生成方法如扩散模型,尽管在其他领域取得了成功,但不能直接应用于规则挖掘,因为它们的训练目标与学习高质量规则的目标不一致,且不可微的知识图谱规则质量指标无法直接指导模型优化。为解决这些局限性,我们提出GRiD,一个将图状规则发现重新表述为以目标关系为条件的离散生成过程的框架。GRiD采用两阶段训练策略。首先,监督预训练使GRiD能够从知识图谱元图采样的子图中捕获结构先验。随后,应用强化学习通过直接由不可微规则质量指标指导的策略梯度优化来微调GRiD。在六个基准数据集上的实验表明,GRiD在知识图谱补全任务上取得了有竞争力的性能。消融研究证实了GRiD的效率和鲁棒性,并进一步表明图状规则在知识图谱补全中补充了链状规则。我们的代码和数据集可在https://github.com/Haoxiang-Cheng/GRiD获取。

英文摘要

Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, cannot be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our code and datasets are available in https://github.com/Haoxiang-Cheng/GRiD.

2605.28579 2026-06-05 cs.AI

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

MUSE: 面向可制造、功能性和可装配的文本到CAD生成的基准测试

Xiaoyu Dong, Zhi Li, Xiao-Ming Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Curvature Flow Co., Limited(曲率流有限公司)

AI总结 提出MUSE基准,通过三阶段评估协议(代码检查、几何检查、设计意图对齐)和基于规则的语言模型评判,衡量文本到CAD生成模型在功能、制造和装配方面的实际设计质量。

详情
Comments
26 pages
AI中文摘要

大型语言模型(LLMs)近期推动了文本驱动的3D生成,但文本到CAD仍远未支持工业产品设计。现有基准主要关注生成单零件CAD模型,并使用几何相似性指标进行评估,这些指标无法捕捉功能、可制造性和可装配性。为弥补这一空白,我们引入MUSE,一个专注于复杂、可编辑边界表示(B-Rep)装配体的文本到CAD基准。MUSE将实际设计实例与结构化设计规范配对,并通过三阶段评估协议评估生成的模型:代码检查、几何检查和设计意图对齐。最后阶段使用特定于设计的评分标准评估功能、可制造性和可装配性,超越形状匹配,走向实际设计质量。为实现可扩展评估,我们使用基于评分标准的视觉语言模型(VLM)评判器,并通过人工标注验证其可靠性。在闭源和开源LLM上的实验揭示了从可执行代码到有效几何再到工程就绪设计的明显失败级联,即使最强的模型在细粒度工程标准上也仅取得有限成功。MUSE为将文本到CAD从几何生成推向真正的工程设计提供了现实的基准和评估框架。我们的项目网站(包括排行榜、数据集和代码)可在 https://dong7313.github.io/muse-benchmark/ 获取。

英文摘要

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

2605.27887 2026-06-05 cs.AI q-fin.PM

PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

PortBench: 一种相关性感知的、全流水线的LLM驱动投资组合管理基准

Yuxuan Zhao, Sijia Chen, Ningxin Su

发表机构 * Yantai Research Institute of Harbin Engineering University(哈尔滨工程大学烟台研究院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出PortBench基准,通过静态QA和动态五阶段分配流水线评估LLM在投资组合管理中的表现,发现多数模型无法超越等权重分配,且存在推理错误累积和压力下大幅回撤的问题。

详情
Comments
Project page: https://portbench.github.io/
AI中文摘要

LLMs在多种金融任务中表现出色,但投资组合管理(PM)这一关键金融决策任务仍缺乏良好基准。现有基准存在两个主要缺陷:忽略跨资产相关性结构,从而无法区分真正多样化的投资组合与集中投资组合;未能评估真实场景中完整的PM决策流水线。我们提出PortBench,一个涵盖十年间六类异质资产类别的基准。PortBench由两个互补层组成:包含6269个基于相关性的问题(覆盖七个任务模板)的静态QA数据集,以及模拟完整PM决策周期的动态五阶段分配流水线。为评估这些层,我们引入两个专用指标:双层次相关性分数,衡量所提投资组合是否利用跨类别对冲并避免类别内集中;以及CEPS,量化推理错误如何在流水线阶段间累积。我们进一步在三种历史压力情景和风险配置下评估策略稳健性和投资者对齐。评估十个前沿LLM,我们发现尽管在静态金融QA上表现强劲,90%的模型-配置组合未能超越基本的等权重分配,且满足所有程序约束的模型在压力下仍遭受灾难性回撤。我们的源代码可在\href{https://github.com/AgenticFinLab/portbench}{此https URL}获取。

英文摘要

Large language models (LLMs) have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross-asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real-world scenarios. We introduce PortBench, a benchmark spanning six heterogeneous asset classes over ten years. PortBench consists of two complementary layers: a static QA dataset of 6,269 correlation-based questions across seven task templates, and a dynamic five-stage allocation pipeline that mirrors the full PM decision cycle. To evaluate these layers, we introduce two dedicated metrics: a dual-layer correlation score that measures whether proposed portfolios exploit inter-class hedging and avoid intra-class concentration, and CEPS, a metric that quantifies how reasoning errors compound across pipeline stages. We further assess strategy robustness and investor alignment under three historical stress regimes and risk profiles. Evaluating ten frontier LLMs, we find that despite strong performance on static financial QA, 90\% of model-profile combinations fail to outperform a basic equal-weight allocation, and models that satisfy every procedural constraint still suffer catastrophic drawdowns under stress. Our source code is available at \href{https://github.com/AgenticFinLab/portbench}{this https URL}.

2605.27866 2026-06-05 cs.CL

GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

GRADE: 面向AI导师的通用推理感知对话评估

Parth Bhalerao, Jeromy Chang, David Chou, Oana Ignat

发表机构 * Santa Clara University(圣克拉拉大学)

AI总结 提出GRADE框架,通过系统比较五种开源模型的零样本推理、LoRA微调、合成增强及思维链推理等配置,证明精心选择的LoRA流水线在关键教学维度上可媲美专有系统。

详情
Comments
16 pages, 7 figures
AI中文摘要

评估AI导师的回应需要超越事实正确性:导师必须识别错误、定位错误、提供指导并提出可行的后续步骤。我们提出GRADE,一项针对学生-导师对话中教学能力评估的开源模型系统研究。基于BEA 2025 TutorMind设置,我们评估了五种语言模型、零样本推理、LoRA微调、合成增强、思维链推理以及单任务与多任务公式化配置下的120种配置。Gemma3-12B在单任务评估中表现最佳,而8位精度的Gemma3-27B在多任务预测中更可靠。我们发现,增强有助于那些在原始数据上表现不佳的模型,验证虽然成本更高但增益有限,思维链推理对合成数据生成比直接分类更有用。我们进一步表明,在结构化分类目标上进行LoRA微调会干扰思考模式下的指令遵循行为,使生成偏离所需的评估格式。碳分析显示,模型选择和推理模式显著影响排放。总体而言,GRADE表明,精心选择的开源LoRA流水线可以在关键教学维度上匹配或超越专有和基于集成系统的性能,代码和数据可在https://github.com/pvbgeek/GRADE获取。

英文摘要

Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations. Gemma3-12B performs best for single-task evaluation, while Gemma3-27B in 8-bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification. We further show that LoRA fine-tuning on structured classification objectives interferes with instruction-following behavior under thinking mode, redirecting generation away from the required evaluation format. Carbon analysis shows that model choice and reasoning mode substantially affect emissions. Overall, GRADE shows that carefully selected open-source LoRA pipelines can match or surpass proprietary and ensemble-based systems on key pedagogical dimensions, with code and data available at https://github.com/pvbgeek/GRADE.

2605.24481 2026-06-05 cs.CV

OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

OmniEgo-R$^2$:面向CVPR 2026首届跨领域EgoCross挑战赛的路由推理框架

Zixu Li, Zhiwei Chen, Zhiheng Fu, Wenbo Wang, Yupeng Hu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 针对跨领域自我中心视频推理中的时间边界模糊、语义粒度不匹配和决策不稳定问题,提出OmniEgo-R$^2$路由推理框架,在Source-Limited和Open-Source赛道均获第二名。

详情
Comments
Technical Report for the 1st Cross-Domain EgoCross Challenge at CVPR 2026
AI中文摘要

CVPR 2026 EgoVis首届跨领域EgoCross挑战赛评估多模态大语言模型在手术、工业、极限运动和动物视角等自我中心视频上的推理能力。我们在Source-Limited和Open-Source赛道均获得第二名。在本报告中,我们将EgoCross定义为一个鲁棒的跨领域具身视频推理问题,而非简单的多项选择视觉问答任务。我们识别出三个关键挑战:(C1)时间边界模糊,关键状态转换稀疏采样且常发生在帧间;(C2)跨领域语义粒度不匹配,相同能力需要不同的领域特定视觉语法;(C3)接近选项下的决策不稳定,长多模态推理可能选择无支撑的干扰项或产生畸形输出。为解决这些问题,我们提出OmniEgo-R$^2$(全领域自我中心路由推理),一个统一的路由推理流水线,包括时间证据归一化、领域无关能力路由、结构化感知-动态-决策推理、边界感知选项验证和防御性答案校准。OmniEgo-R$^2$使用每个EgoCross领域上的Qwen3-VL-4B-SFT检查点作为视觉语言骨干,并用轻量级测试时推理和解析程序包装。最终提交在Source-Limited赛道获得66.35%总体准确率,在Open-Source赛道获得66.77%,均位列第二。代码见https://github.com/Lee-zixu/OmniEgo-R2。

英文摘要

The 1st Cross-Domain EgoCross Challenge at EgoVis, CVPR 2026 evaluates whether multimodal large language models can reason over egocentric videos across surgery, industry, extreme sports, and animal perspective. We achieved second place in both the Source-Limited and Open-Source tracks. In this report, we formulate EgoCross as a robust cross-domain embodied video reasoning problem rather than a simple multiple-choice visual question answering task. We identify three key challenges: (C1) temporal boundary ambiguity, where critical state transitions are sparsely sampled and often occur between frames; (C2) cross-domain semantic granularity mismatch, where the same capability requires different domain-specific visual grammar; and (C3) decision instability under close options, where long multimodal reasoning can select unsupported distractors or produce malformed outputs. To address them, we propose OmniEgo-R$^2$ (Omnidomain Egocentric Routed Reasoning), a unified routed reasoning pipeline consisting of temporal-evidence normalization, domain-agnostic capability routing, structured perception--dynamics--decision reasoning, boundary-aware option verification, and defensive answer calibration. OmniEgo-R$^2$ uses the Qwen3-VL-4B-SFT checkpoints on each EgoCross domain as the visual-language backbone, and wraps them with lightweight test-time reasoning and parsing programs. Our final submissions obtain 66.35% overall accuracy in the Source-Limited track and 66.77% in the Open-Source track, ranking second in both leaderboards. The codes are available on https://github.com/Lee-zixu/OmniEgo-R2

2605.27292 2026-06-05 cs.LG stat.ML

Detectability in Diversity: Improved Canary Crafting for Privacy Auditing in One Run

多样性中的可检测性:单次运行中用于隐私审计的改进金丝雀构造

Mathieu Dagréou, Aurélien Bellet

发表机构 * PreMeDICaL team, Inria Idesp, Inserm, Univ. de Montpellier(PreMeDICaL团队,Inria Idesp,Inserm,蒙彼利埃大学)

AI总结 针对单次运行隐私审计中金丝雀点相互干扰导致隐私泄露估计偏弱的问题,提出结合影响函数贪婪初始化与双层优化的方法,最大化金丝雀可检测性并促进嵌入空间多样性,以较低计算成本获得更强的隐私泄露估计。

详情
AI中文摘要

隐私审计旨在利用成员推断攻击(MIAs)实证评估机器学习模型中的隐私泄露,并推导差分隐私(DP)参数的下界。最近的单次运行审计方法通过依赖单个训练运行和多个“金丝雀”点来降低标准方法的高成本,审计者需要检测这些点是否被包含或排除。在这项工作中,我们研究了为单次运行隐私审计高效构造金丝雀的问题。受最近理论见解的启发,即金丝雀之间的干扰导致与多次运行方法相比更弱的泄露估计,我们提出优化金丝雀使其既高度可检测又最小化干扰。我们的方法结合了基于影响函数的贪婪初始化与双层优化过程,该过程最大化可区分性同时促进嵌入空间中的多样性,从而能够使用计算高效的双层算法。实验表明,与现有的金丝雀构造方法相比,我们的方法以更低的计算成本实现了更强的隐私泄露估计。

英文摘要

Privacy auditing aims to empirically assess privacy leakage in machine learning models using membership inference attacks (MIAs), and to derive lower bounds on differential privacy (DP) parameters. Recent one-run auditing methods address the high cost of standard approaches by relying on a single training run with multiple "canary" points whose inclusion or exclusion must be detected by the auditor. In this work, we study the problem of efficiently crafting canaries for one-run privacy auditing. Motivated by recent theoretical insights suggesting that interference between canaries contributes to weaker leakage estimates compared to multi-run methods, we propose to optimize canaries to be both highly detectable and minimally interfering. Our approach combines a greedy initialization based on influence functions with a bilevel optimization procedure that maximizes distinguishability while promoting diversity in embedding space, enabling the use of computationally efficient bilevel algorithms. Experiments show that our method achieves stronger privacy leakage estimates at a lower computational cost than existing canary crafting approaches.

2605.26761 2026-06-05 cs.CV

Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning

Once-For-All: 一种用于多模态指令微调的“一次训练,随时选择”框架

Mingkang Dong, Hongyi Cai, Xiwen Lei, Jie Li, Tao Zhang, Muxin Pu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出OFA框架,通过一次训练可迁移的选择器,无需重新计算即可从任意数据集或模型中筛选出最具信息量的多模态指令数据,实现高效微调。

详情
Comments
15 pages, 6 figures. Mingkang Dong and Hongyi Cai contributed equally to this work. Muxin Pu is the corresponding author
AI中文摘要

多模态指令微调是适应视觉语言模型(VLM)的事实标准方法,然而指令数据高度冗余,使得数据选择对训练效率至关重要。现有方法从特定模型或数据集中导出选择信号,因此每当目标模型或候选池发生变化时,必须从头重新计算标准,代价高昂。为了解决这一问题,我们提出了OFA,一个数据选择框架,该框架训练一次可重用的选择器,并将其应用于任何数据集或模型而无需重新计算。OFA在冻结的CLIP空间中对多模态指令进行聚类,从聚类结构中导出伪标签,并仅训练几个epoch的轻量级选择器;该选择器最不确信的样本被选为最具信息量的样本。一旦训练完成,冻结的选择器可直接跨数据集和模型规模迁移。选择器在LLaVA-665K上训练一次,然后应用于LLaVA-665K本身,以及无需任何重新训练的未见过的Vision-Flan-186K。仅选择15%的数据,OFA在10个下游基准测试中达到了全数据性能的98.3%;在较小的Vision-Flan-186K上,迁移的选择器比全数据训练高出10.6%,证实了学习到的信号泛化到了选择器训练期间从未见过的数据集。相同的选定子集在Qwen2.5-VL-3B和LLaVA-v1.5-7B上均有益于VLM,无需针对每个模型重新计算,从而将选择与目标模型解耦。这些结果表明,单个可迁移的选择器为高效的多模态指令微调提供了一种有效且可重用的解决方案。

英文摘要

Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection framework that trains a reusable selector once and applies it to any dataset or model without recomputation. OFA clusters multimodal instructions in a frozen CLIP space, derives pseudo labels from the cluster structure, and trains a lightweight selector for only a few epochs; samples on which this selector is least confident are selected as the most informative. Once trained, the frozen selector transfers directly across datasets and model scales. The selector is trained once on LLaVA-665K and applied both to LLaVA-665K itself and, without any retraining, to the unseen Vision-Flan-186K. Selecting only 15% of the data, OFA achieves 98.3% of full data performance across 10 downstream benchmarks; on the smaller Vision-Flan-186K, the transferred selector surpasses full data training by 10.6%, confirming that the learned signal generalizes to datasets never seen during selector training. The same selected subsets benefit VLMs at both Qwen2.5-VL-3B and LLaVA-v1.5-7B without per model recomputation, decoupling selection from the target model. These results demonstrate that a single, transferable selector provides an effective and reusable solution for efficient multimodal instruction tuning.

2605.26236 2026-06-05 cs.CV cs.SD

DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation

DuoGesture: 神经启发与生物力学约束的双流共语手势生成

Ferdinand Paar, Lanmiao Liu, Aslı Özyürek, Serge Thill, Esam Ghaleb

发表机构 * Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Radboud University(拉德堡德大学) Utrecht University(乌得勒支大学)

AI总结 提出DuoGesture,一种神经启发和生物力学约束的双流方法,通过语义变分信息瓶颈协调语义流和节拍流,实现语义表达与生物力学合理的节律运动。

详情
AI中文摘要

共语手势生成需要语义表达性和生物力学合理的节律运动。现有的整体手势模型混合了基于词汇的语义手势和频繁的韵律对齐节拍手势,这限制了语义基础、语音-运动对齐和运动平滑性。我们提出DuoGesture,一种神经启发和生物力学约束的双流方法,将共语手势合成分解为耦合的语义流和节拍流。两个流通过语义变分信息瓶颈协调,这是一个随机帧级门控,学习何时语义手势应覆盖节律节拍运动。语义流由运动基础语义条件控制,该条件用运动-语言表示替代纯语言词嵌入,为手势的长尾词汇触发提供运动对齐的语义先验。节拍流进一步由惯性节拍先验正则化,这是一个基于人体测量学的臂链模块,减少抖动并提高节律一致性而不约束语义帧。客观评估和主观实验表明,DuoGesture优于强整体基线,而组件消融证实了语义基础、随机流选择和生物力学正则化的互补作用。

英文摘要

Co-speech gesture generation requires both semantic expressivity and biomechanically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This limits semantic grounding, speech-motion alignment, and kinematic smoothness. We propose \emph{DuoGesture}, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a \emph{Semantic Variational Information Bottleneck}, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion. The semantic stream is controlled by \emph{Motion-Grounded Semantic Conditioning}, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is further regularised by an \emph{Inertial Beat Prior}, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective experiments show that DuoGesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.

2605.26046 2026-06-05 cs.CL cs.AI cs.LG cs.MA cs.SE

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

当梯度冲突时:多目标提示优化用于LLM评判器的失败模式

Parth Darshan, Abhishek Divekar

发表机构 * IIT Jodhpur(印度理工学院乔普里尔) Amazon(亚马逊)

AI总结 研究多目标文本梯度优化中梯度稀释和指令干扰两种失败模式,通过分解优化器信息共享方式揭示性能下降原因。

详情
Comments
Accepted at ACL 2026 - CustomNLP4U Workshop. Code, prompts and data available at https://github.com/adivekar-utexas/when-gradients-collide
AI中文摘要

将LLM评判器定制到特定任务或领域通常需要同时跨多个评估标准优化其提示。文本梯度方法针对单一评判标准实现了自动化,但它们产生自然语言批评,而非数值向量。因此,多任务学习的冲突解决工具包(PCGrad、MGDA)不适用于多目标文本梯度设置。我们通过改变损失、梯度和优化器LLM共享跨任务信息的程度,测试了文本梯度优化器的五种分解模式。在10种配置中的6种中,我们观察到优化从未优于初始提示。当梯度LLM联合处理多个标准时,梯度特异性下降了59%(从9.0降至3.7)。另外,我们观察到将每个任务的指令简单组合成单个提示会使斯皮尔曼相关系数降低5.3%。这些结果识别出两种可分离的失败模式:优化时的梯度稀释和推理时的指令干扰,它们共同限制了使用文本反馈进行多目标评判器定制的设计空间。

英文摘要

Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) does not apply to this multi-objective textual gradient setting. We extend TextGrad to the multi-objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross-objective information the loss, gradient and optimizer LLMs share. We find the gradient's task-focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single-objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (-0.085). These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge optimization using textual feedback.

2605.28367 2026-06-05 cs.RO cs.SY eess.SY

Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints

基于非光滑控制障碍函数的状态与输入约束下安全关键自适应阻抗控制

Faisal Lawan, Xiaoran Han, Joaquin Carrasco, Barry Lennox, Xiaoxiao Cheng

发表机构 * Department of Electrical and Electronic Engineering, The University of Manchester(电气与电子工程系,曼彻斯特大学)

AI总结 提出一种在线自适应阻抗控制框架,结合二次规划安全滤波器与新型组合位置-速度非光滑控制障碍函数,在不确定动力学下实现关节状态安全约束与柔顺交互,并通过区间二型模糊逻辑补偿未知动力学、软约束处理执行器力矩限制,利用复合Lyapunov分析证明安全集前向不变性与阻抗跟踪误差一致最终有界。

详情
Comments
12 pages, 3 figures
AI中文摘要

安全物理交互对于在人类-机器人交互和接触密集型任务中部署机器人操作臂至关重要,其中不确定性、外力和执行器限制可能危及性能和安全性。我们提出一种在线自适应阻抗控制框架,在不确定动力学下强制执行关节状态安全,同时实现柔顺交互。该方法结合了基于二次规划的安全滤波器与一种新颖的组合位置-速度非光滑控制障碍函数(NCBF),使得关节位置和速度约束能够通过统一的相对度一障碍来实施。未知动力学通过区间二型模糊逻辑系统在线补偿,而执行器力矩限制则通过软约束处理,并利用精确罚函数恢复可行解。一种增强的扰动观测器安全机制提高了对建模误差和外部交互力的鲁棒性。使用复合Lyapunov分析,我们证明了安全集的前向不变性和阻抗跟踪误差的一致最终有界性。在具有严重参数不确定性和外部交互力的7自由度操作臂上的仿真展示了安全约束满足和鲁棒的阻抗跟踪。

英文摘要

Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.

2602.03890 2026-06-05 cs.CV

4DPC$^2$hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping

4DPC$^2$hat: 面向动态点云理解的失败感知自举学习

Xindan Zhang, Weilong Yan, Yufei Shi, Xuerui Qiu, Tao He, Ying Li, Ming Li, Hehe Fan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个针对动态点云理解的多模态大语言模型4DPC$^2$hat,通过构建大规模跨模态数据集4DPC$^2$hat-200K和引入Mamba增强的时间推理模块及失败感知自举学习策略,显著提升了动作理解与时间推理能力。

详情
Comments
Accept by ICML 2026
AI中文摘要

点云提供了3D对象的紧凑且富有表现力的表示,最近已被集成到多模态大语言模型(MLLMs)中。然而,现有方法主要关注静态对象,而理解动态点云序列仍基本未被探索。这一限制主要是由于缺乏大规模跨模态数据集以及在时空上下文中建模运动的难度。为弥补这一差距,我们提出了4DPC$^2$hat,这是首个专为动态点云理解设计的MLLM。为此,我们通过一个精心设计的两阶段流程构建了大规模跨模态数据集4DPC$^2$hat-200K,该流程包括拓扑一致的4D点构建和两级标注。该数据集包含超过44K个动态对象序列、700K个点云帧和200K个精心策划的问答对,支持关于计数、时间关系、动作、空间关系和外观的查询。在该框架的核心,我们引入了一个Mamba增强的时间推理MLLM,以捕捉点云序列中的长程依赖和动态模式。此外,我们提出了一种失败感知的自举学习策略,该策略迭代地识别模型缺陷并生成有针对性的问答监督,以持续增强相应的推理能力。大量实验表明,与现有模型相比,我们的4DPC$^2$hat显著提高了动作理解和时间推理能力,为4D动态点云理解奠定了坚实基础。

英文摘要

Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC$^2$hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC$^2$hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question-answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC$^2$hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.

2605.29219 2026-06-05 cs.CV

SalsaAgent: A multimodal embodied language model for interactive dance generation

SalsaAgent: 一种用于交互式舞蹈生成的多模态具身语言模型

Payam Jome Yazdian, Zoe Stanley, Angelica Lim

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出SalsaAgent语言模型,通过非语言运动令牌传递和两阶段令牌到扩散管道,生成与人类领舞者及音乐背景交互的全身萨尔萨舞蹈动作。

详情
Comments
Project page: https://pjyazdian.github.io/Salsa-Agent
AI中文摘要

人形机器人之间的交互涉及双向和非语言反应性、协调与同步。为了构建具有社会意识的机器人和交互式虚拟代理,我们提出了SalsaAgent,一种语言模型,能够生成表达性的全身萨尔萨舞蹈动作,以响应人类领舞者并配合背景音乐。我们将交互形式化为非语言运动令牌传递,扩展了大语言模型(LLM)的词汇表,以处理离散运动令牌、成对关系令牌和音频。我们的贡献包括:用于全身和运动关系的新令牌、使用自动推导的骨架动力学文本描述进行令牌对齐的LLM微调,以及两阶段令牌到扩散管道。主观和客观评估表明,我们的方法在运动质量、音乐与伙伴协调以及一致的双人空间行为方面具有有效性,显著优于基线方法。

英文摘要

Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop. We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline. Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.

2603.19294 2026-06-05 cs.LG cs.AI cs.CL

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

最大化提示与响应之间的互信息无需额外数据即可提升LLM性能

Hyunji Nam, Haoran Li, Natasha Jaques

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出互信息偏好优化(MIPO)方法,通过对比数据增强构建偏好对,利用直接偏好优化最大化提示与响应间的点互信息,无需额外数据或外部监督即可提升LLM在个性化和可验证任务上的性能。

详情
Comments
International Conference on Machine Learning 2026
AI中文摘要

虽然后训练已在多个领域成功改进了大型语言模型(LLM),但这些提升严重依赖人工标注数据或外部验证器。现有数据已被充分利用,而新数据收集成本高昂。此外,真正的智能远不止可验证任务。因此,我们需要较少依赖外部信号且更广泛适用于可验证和不可验证领域的自我改进框架。我们提出**互信息偏好优化(MIPO)**,一种对比数据增强方法,通过基于正确提示生成正响应,以及基于随机无关提示生成负响应来构建偏好对。我们证明,使用直接偏好优化从这些配对数据中学习,可以最大化*基础LLM*下提示与响应之间的逐点互信息。使用1-7B参数的Llama和Qwen指令模型的实验表明,与提示基线相比,MIPO在个性化任务上实现了3-16%的提升(Qwen2.5-1B-Instruct提升51%)。令人惊讶的是,MIPO在可验证领域(如数学和多项选择题问答)也有用,*无需任何额外数据或外部监督*即可获得1-20%的提升。这些结果表明,利用对比数据对中的内在信号进行自我改进是一个有前景的方向。

英文摘要

While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new data is expensive to collect. Moreover, true intelligence goes far beyond verifiable tasks. Therefore, we need self-improvement frameworks that are less dependent on external signals and more broadly applicable to both verifiable and non-verifiable domains. We propose **Mutual Information Preference Optimization (MIPO)**, a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization to learn from this paired data maximizes pointwise mutual information *under the base LLM* between prompts and model responses. Experiments with with 1-7B parameter Llama and Qwen instruct models show that MIPO achieves 3-16% gains (and 51% increase for Qwen2.5-1.5B-Instruct) on personalization compared to prompting baselines. Surprisingly, MIPO can also be useful in verifiable domains, such as math and multiple-choice question answering, yielding 1-20% gains *without any additional data or external supervision*. These results suggest a promising direction for self-improvement using intrinsic signals derived from contrastive data pairs.

2605.25970 2026-06-05 cs.CV

PathWISE: Multi-Agent Cancer Pathway Triaging Ontology Learning from Clinical Flowcharts

PathWISE: 基于临床流程图的多智能体癌症路径分诊本体学习

Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Mohammed Adil Butt, Andrew D. Beggs, Adam Byfield, Anusha Jose, Junaid Qadir, Muhammad Bilal

发表机构 * Birmingham City University(伯明翰城市大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University Hospitals Birmingham NHS Foundation Trust(伯明翰大学医院国家健康服务信托基金) NHS England (National Health Service)(英格兰国家健康服务局) Qatar University(卡塔尔大学)

AI总结 提出PathWISE五阶段流水线,结合四个基于LLM的智能体、确定性深度优先搜索审计器和Java编译器批评者,将临床流程图转化为可执行的HL7 CQL库,覆盖100%患者路径,并在五种NHS癌症路径上验证。

详情
Comments
13 pages, 4 figures
AI中文摘要

临床路径以视觉流程图形式传播,其中空间拓扑、箭头方向、颜色编码和字体粗细编码了关键的转诊逻辑,但这些逻辑对计算系统仍然不可访问。我们提出PathWISE,一个五阶段流水线,结合四个基于LLM的智能体与确定性深度优先搜索审计器和Java编译器批评者,将这些不可计算的人工制品转化为经过验证、可执行的HL7临床质量语言(CQL)库,可部署为FHIR CDS Hooks服务。专门构建的智能体将流程图结构提取为类型化有向图,执行确定性路径枚举,对每个节点的可计算性进行结构化语义审计,生成经官方Java CQL-to-ELM编译器验证的术语约束CQL定义,并产生覆盖100%枚举患者路径的路由逻辑。在五种英国NHS癌症路径(结直肠、肺、皮肤、上消化道和乳腺)上展示,PathWISE审计多达183个节点(混合配置下182个),识别四个问题类别中的544个结构化治理发现,实现100%语法编译成功,其中UNCOMPUTABLE节点接收虚假占位符以保持可编译性,同时暴露治理差距供临床审查,并为字典覆盖的概念产生零幻觉术语代码。关键的是,PathWISE将非确定性LLM推理限制在知识提取上,而确定性图数学和标准编译器支撑每个验证步骤。

英文摘要

Clinical pathways are disseminated as visual flowcharts where spatial topology, arrow direction, colour coding, and font weight encode critical triage logic that remains inaccessible to computational systems. We present PathWISE, a five-phase pipeline combining four LLM-based agents with a deterministic depth-first search auditor and a Java compiler critic, transforming these non-computable artefacts into validated, executable HL7 Clinical Quality Language (CQL) libraries deployable as FHIR CDS Hooks services. Purpose-built agents extract flowchart structure into a typed directed graph, perform deterministic path enumeration, conduct a structured semantic audit of every node's computability, generate terminology-constrained CQL definitions verified by the official Java CQL-to-ELM compiler, and produce routing logic covering 100% of enumerated patient journeys. Demonstrated across five UK NHS cancer pathways (colorectal, lung, skin, upper GI, and breast), PathWISE audits up to 183 nodes (182 under the Hybrid configuration), identifies 544 structured governance findings across four issue categories, achieves 100% syntactic compilation success, with UNCOMPUTABLE nodes receiving false placeholders that preserve compilability while surfacing governance gaps for clinical review, and produces zero hallucinated terminology codes for dictionary-covered concepts. Critically, PathWISE confines non-deterministic LLM inference to knowledge extraction while deterministic graph mathematics and a standard compiler underpin every verification step.

2605.25956 2026-06-05 cs.CV

RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

RAPTOR+: 一种基于视觉的视觉-语言框架,用于提高自动化癌症转诊处理中的临床信任度和可审计性

Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Anusha Jose, Adam Byfield, Lukman Akanbi, Muhammad Bilal

发表机构 * Birmingham City University(伯明翰城市大学) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) University Hospitals Birmingham NHS Foundation Trust(伯明翰大学医院 NHS 基础信托) NHS England(英格兰国家卫生服务体系)

AI总结 提出RAPTOR+多模态框架,通过微调视觉-语言模型实现端到端转诊理解,在结直肠癌转诊表单上显著提升提取准确性和证据定位能力。

详情
Comments
12 pages 4 figures
AI中文摘要

紧急疑似结直肠癌(CRC)转诊会因半结构化临床文档通常需要人工审查和转录而造成操作瓶颈。原始的RAPTOR系统使用大型语言模型进行结构化提取,但依赖单独的OCR阶段,使其易受手写、布局变化和视觉证据链接丢失的影响。我们提出RAPTOR+,一种多模态扩展,使用视觉-语言模型(VLM)进行端到端转诊理解。我们在223份临床整理的CRC紧急转诊表单上评估了微调VLM、商业和开源零样本VLM以及基于OCR的原始流水线。我们还引入了一种基于定位的评估框架,同时衡量提取准确性和证据定位。结果显示零样本模型存在明显的定位差距。Gemini 2.5 Flash实现了92.6%的读取准确率,但严格安全性仅为1.2%。相比之下,微调的Qwen3-VL-8B实现了96.1%的读取准确率和60.6%的严格安全性,显著改善了可验证的证据定位。这些发现表明,任务特定的微调对于可靠、可审计的临床文档理解至关重要。RAPTOR+使得提取的转诊决策能够与视觉证据关联,支持更安全、更高效的癌症转诊分诊。

英文摘要

Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.

2605.25582 2026-06-05 cs.LG cs.AI

Extreme Region Policy Distillation

极端区域策略蒸馏

Changyu Chen, Xiting Wang, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学耿丽人工智能学院) Wuhan University(武汉大学)

AI总结 提出极端区域策略蒸馏(ERPD)两阶段框架,通过解耦样本效率与KL效率,在固定数据上先进行弱约束离策略优化以最大化提取训练信号,再在信任区域约束下蒸馏到基础策略,从而在数学推理任务中实现更好的性能与更小的KL散度。

详情
AI中文摘要

大语言模型的强化学习面临样本效率与渐近性能之间的基本权衡:严格在策略方法在单次更新后丢弃轨迹,而离策略重用引入分布不匹配,现有信任区域技术主要通过强制保守优化来缓解,往往未充分利用丰富的训练信号。为研究此问题,我们在固定数据上执行大量离策略更新。实验揭示,激进的多步优化带来快速初始增益,但过度更新导致轨迹概率偏离和熵崩溃,性能早期停滞。收紧KL约束仅降低上限而不解决退化。这促使我们提出极端区域策略蒸馏(ERPD),一个两阶段框架,将样本效率与KL效率解耦。第一阶段在固定数据上执行弱约束离策略优化,以最大化提取训练信号。所得策略提供令牌级监督。第二阶段,我们在信任区域约束下将这些信号蒸馏到基础策略中,过滤有害漂移同时保留有用信号。蒸馏后的策略以显著更小的KL散度达到相当或更好的性能,表明第一阶段的大部分散度用于不必要的漂移而非真正改进。关键的是,ERPD同时适应强和弱教师:当激进优化未产生更强策略时,即使是退化教师也能通过替代信号构建策略提供有效监督。我们在数学推理上验证ERPD,显示出对强基础模型(在策略训练停滞时)的增益,以及使用弱教师的可靠改进。

英文摘要

Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.

2605.25256 2026-06-05 cs.AI

Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

谁的对齐?比较不同组织决策情境下的大语言模型过程对齐

Niklas Weller, Emilio Barkett

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出一种决策策略捕获方法测量过程对齐,发现LLM在ECHR第6条决策中过程对齐与输出准确性高度相关,但在德国消费信贷决策中关系消失,揭示了多元对齐挑战。

详情
Comments
Accepted to Pluralistic Alignment Workshop @ ICML 2026, Seoul, South Korea
AI中文摘要

将AI系统与组织决策对齐通常被框架化为单一目标问题:使模型表现得像组织一样。我们认为这种框架掩盖了更深的多元主义挑战。我们依赖一种决策策略捕获方法来测量过程对齐:LLM是否像组织一样加权信息,而不仅仅是是否得出相同结论。将此方法应用于ECHR第6条决策,过程对齐强烈预测输出准确性(r = 0.85, p < .001),且外部化显著改善了低对齐模型的对齐。将其应用于德国消费信贷决策,这种关系消失(r = 0.15, p = .60):干预产生不一致的效果,且基准编码了潜在歧视性的历史模式。这种对比本身就是一个多元对齐发现:在有争议的领域,高过程对齐既不能通过外部化实现,也不是无条件可取的。仅凭输出一致性无法区分一个模型是内化了组织政策还是仅仅近似其结果;过程级测量是任何多元对齐评估的必要组成部分。

英文摘要

Steerable pluralism requires a model to faithfully represent one specified perspective. Organizations are a natural setting for this demand, since they deploy LLMs to make decisions that must reflect their own policy. Yet, most existing work fixes that perspective at the level of individuals or demographic groups. We rely on a decision-policy capturing method to measure process alignment in organizational settings, assessing whether an LLM faithfully reproduces the organization's decision policy rather than merely reaching the same conclusions. We find heterogeneity along two axes. Across models, baseline alignment varies strongly and tracks neither pricing nor general benchmark performance. Across organizations, the structure of alignment changes. In ECHR Article 6 decisions, process alignment predicts output accuracy ($r = 0.85$, $p < .001$), and making the organization's past decision policy explicit improves poorly aligned models. In consumer credit decisions, process alignment is low overall but varies more than output accuracy, and the models resist adopting the organization's weighting of protected attributes. Because historical credit decisions encode potentially discriminatory patterns, higher alignment there is not always desirable. Process-level measurement is therefore necessary, and depending on whether the target policy is normatively desirable, the same procedure can calibrate or audit a model. Deciding which policy to align to, and whether higher alignment is feasible or desirable, makes organizational alignment a pluralistic problem in its own right.

2605.25240 2026-06-05 cs.CL cs.AI cs.CY

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

JudgmentBench: 比较评分量规与偏好评估在质量评价中的应用

Russell Yang, Ruishi Chen, Pierce Kelaita, Riya Ranjan, Sibo Ma, Charles Dickens, Matthew Guillod, Megan Ma, Julian Nyarko

发表机构 * Stanford University(斯坦福大学) Snorkel AI

AI总结 本研究通过构建包含30个真实法律任务、1539个评分量规和1530对偏好判断的数据集JudgmentBench,比较了评分量规与成对比较两种评估方法,发现成对比较在恢复预期质量排序上显著优于评分量规(平均斯皮尔曼等级相关系数0.908 vs 0.150),且注释时间减少一半以上。

详情
Comments
37 pages, 9 figures
AI中文摘要

当前基准测试实践中主导着两种方法论:基于评分量规的评分根据预定义标准评估项目,而比较判断则引发输出之间的成对偏好。尽管两种方法论被广泛使用,但两者之间的选择很少被论证。我们发布了JudgmentBench,一个包含30个真实法律任务的基准测试,配对了来自执业律师(包括美国主要律师事务所)的1539个评分量规和1530个成对偏好判断,这些律师具有丰富的经验。这些注释构成了高专业领域内首个公开可用的数据集,其中两种监督信号由同一专家对同一项目进行收集。使用LLM生成的三个质量级别的输出,我们提供了初步的经验比较:比较判断在恢复预期质量排序方面显著优于评分量规(平均斯皮尔曼等级相关系数为0.908 vs 0.150,估计差异=0.758 [0.494, 1.021]),同时所需的注释时间不到一半。这一模式对人类注释者和LLM自动评分器均成立。除了这一初步比较,数据集的配对结构支持更广泛的研究议程,探讨在没有可验证真实情况的领域中,如何引导、聚合专家判断并将其用作监督信号。

英文摘要

Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics under both a per-task rank-correlation metric (mean Spearman's rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) and a per-judgment pairwise win-rate metric (0.669 vs. 0.542, estimated difference = 0.127 [0.067, 0.186]), while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.

2605.24500 2026-06-05 cs.CV

EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge

EgoAdapt: CVPR 2026 HD-EPIC VQA挑战赛的多场景自我中心适应方法

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Guozhi Qiu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出EgoAdapt方法,通过类别条件路由、校准选项评分和测试时一致性适应,解决自我中心视频问答中通用推理与异构时空语义结构不匹配的问题。

详情
Comments
Technical Report for CVPR 2026 HD-EPIC VQA Challenge
AI中文摘要

本技术报告介绍了我们针对CVPR 2026 HD-EPIC VQA挑战赛的解决方案EgoAdapt(通过类别、校准和一致性进行自我中心适应)。HD-EPIC评估视觉语言模型是否能够对真实的第一人称厨房视频进行推理,其中答案的证据可能是短暂的手-物体交互、长食谱轨迹、与固定装置的空间关系或微妙的注视线索。该基准包含26K个多项选择题,涵盖七个宏观类别:食谱、食材、营养、细粒度动作、3D感知、物体运动和注视。我们观察到主要困难不仅在于模型容量,还在于单一通用推理配方与基准的异构时间、空间和语义结构之间的不匹配。我们的方法EgoAdapt引入了三个推理时组件:(1)类别条件路由,包含每类提示、帧预算和采样率;(2)校准选项评分,使用字母标记似然和生成一致性评估所有候选答案,而非仅依赖直接生成;(3)测试时一致性适应,针对模糊情况聚合选项排列和验证式提示的预测。该设计显著优于现有的HD-EPIC基线。

英文摘要

This technical report presents our solution, EgoAdapt (Egocentric Adaptation via Category, Calibration, and Consistency), to the CVPR 2026 HD-EPIC VQA challenge. HD-EPIC evaluates whether a vision-language model can reason over realistic first-person kitchen videos, where the evidence for an answer may be a short hand-object interaction, a long recipe trajectory, a spatial relation to a fixture, or a subtle gaze cue. The benchmark contains 26K multiple-choice questions across seven macro-categories: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. We observe that the main difficulty is not only model capacity, but also the mismatch between a single generic inference recipe and the heterogeneous temporal, spatial, and semantic structure of the benchmark. Our method, EgoAdapt, introduces three inference-time components: (1) category-conditioned routing with per-category prompts, frame budgets, and sampling rates; (2) calibrated option scoring that evaluates all candidate answers with letter-token likelihoods and generation agreement instead of relying only on direct generation; and (3) test-time consistency adaptation that aggregates predictions across option permutations and verification-style prompts for ambiguous cases. This design substantially improves over the available HD-EPIC baselines.

2605.24496 2026-06-05 cs.CV

EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026

EgoAction: 面向 EPIC-KITCHENS 动作检测挑战的可靠性感知时间融合自我中心动作组合 (CVPR 2026)

Zhiheng Fu, Zixu Li, Zhiwei Chen, Fangxu Liu, Yupeng Hu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出 EgoAction 统一解耦检测与融合流水线,通过动态加权融合(DWF)自适应组合动词和名词检测流,解决自我中心视频中动作边界定位的可靠性问题。

详情
Comments
Technical Report for CVPR 2026 EPIC-KITCHENS-100 Action Detection Challenge
AI中文摘要

EPIC-KITCHENS-100 动作检测挑战评估模型能否在长段未裁剪的自我中心视频中定位每个动作的起始和结束,并分配相应的动词-名词动作标签。在本报告中,我们将提交的方法表述为 EgoAction(基于可靠性感知时间融合的自我中心动作组合),这是一个统一的解耦检测和融合流水线。该流水线使用 EPIC 微调的 VideoMAE-L 特征,训练带有因果时间建模的独立名词和动词时间检测器,从 top 名词-动词对中组合动作假设,并在后处理时引入置信度自适应边界融合规则。关键观察是动词和名词流通常以不同方式失败:动词分数对运动过渡敏感,而名词分数对手-物体可见性和物体杂乱敏感。因此,当其中一个流退化时,其预测边界的固定算术平均值会放大定位误差。我们用动态加权融合(DWF)替换这种硬编码的平均值,DWF 将最大名词和动词分类置信度归一化为提议级别的边界权重,并线性组合两个区间。这种轻量级张量运算将边界权威转移到更可靠的流,同时保留解耦的动作评分机制。结合滑动窗口推理、top-K 名词-动词动作组合和类别级 Soft-NMS,EgoAction 为自我中心时间动作检测提供了一个紧凑且可复现的系统。

英文摘要

The EPIC-KITCHENS-100 Action Detection challenge evaluates whether a model can localize the start and end of each action in long untrimmed egocentric videos and assign the corresponding verb--noun action label. In this report, we formulate our submission as EgoAction (Egocentric Action Composition with Reliability-Aware Temporal Fusion), a unified decoupled detection and fusion pipeline. The pipeline uses EPIC-finetuned VideoMAE-L features, trains separate noun and verb temporal detectors with causal temporal modeling, composes action hypotheses from top noun--verb pairs, and introduces a confidence-adaptive boundary fusion rule at post-processing time. The key observation is that verb and noun streams often fail differently: verb scores are sensitive to motion transitions, whereas noun scores are sensitive to hand-object visibility and object clutter. A fixed arithmetic mean of their predicted boundaries can therefore amplify localization errors when one stream degenerates. We replace this hard-coded mean with Dynamic Weighted Fusion (DWF), which normalizes the maximum noun and verb classification confidences into proposal-wise boundary weights and linearly combines the two intervals. This lightweight tensor-only operator shifts boundary authority toward the more reliable stream while preserving the decoupled action scoring mechanism. Together with sliding-window inference, top-K noun--verb action composition, and class-wise Soft-NMS, EgoAction provides a compact and reproducible system for egocentric temporal action detection.

2605.24470 2026-06-05 cs.CV

TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

TempRet: 面向CVPR 2026 EPIC-KITCHENS-100多实例检索挑战的时间增强与两阶段重排序

Zixu Li, Yupeng Hu, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 针对第一人称视频检索中时间动态被忽视的问题,提出基于CLIP双编码器、视频端时间Transformer和两阶段重排序的TempRet方法,在EK-100 MIR基准上达到67.97%平均mAP和82.92%平均nDCG。

详情
Comments
Technical Report for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge
AI中文摘要

视频-文本检索在大规模视觉-语言预训练的推动下取得了显著进展,但大多数现有方法继承了图像-文本检索的一个隐含假设:视觉语义可以逐帧捕获。这一假设忽视了第一人称视频的时间动态性。EPIC-KITCHENS-100多实例检索(MIR)挑战进一步提高了要求,提供软标签相关性矩阵而非二元标签,要求模型能够解决跨模态的分级语义对应。在本报告中,我们提出了面向CVPR 2026 EPIC-KITCHENS-100 MIR挑战的解决方案,称为TempRet。我们的方法基于CLIP双编码器骨干,并引入两个关键组件来应对时间和跨模态挑战。首先,一个时间Transformer仅在视频端操作,通过可学习的位置编码和帧级CLIP特征上的多头自注意力来建模帧间依赖关系。其次,一个两阶段重排序流程首先通过双编码器检索Top-K候选,然后使用配备图像-文本匹配(ITM)头的交叉编码器细化其分数。整个系统使用对称多相似性损失进行训练,以利用挑战提供的软标签相关性矩阵。我们的方法在EK-100 MIR基准上实现了67.97%的平均mAP和82.92%的平均nDCG,证明了时间建模和跨模态细化对第一人称视频检索的有效性。

英文摘要

Video-text retrieval has witnessed remarkable progress driven by large-scale vision-language pretraining, yet most existing approaches inherit an implicit assumption from image-text retrieval: that visual semantics can be captured frame-by-frame. This assumption overlooks the temporal dynamics of egocentric videos. The EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge further raises the bar by providing soft-label relevance matrices rather than binary labels, demanding models that can resolve graded semantic correspondences across modalities. In this report, we present our solution, termed TempRet, to the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Our approach builds upon a CLIP-based dual-encoder backbone and introduces two key components to address the temporal and cross-modal challenges. First, a temporal transformer operates exclusively on the video side, modeling inter-frame dependencies through learnable positional encodings and multi-head self-attention over frame-level CLIP features. Second, a two-stage reranking pipeline first retrieves Top-K candidates via the dual-encoder, then refines their scores using a cross-encoder equipped with an Image-Text Matching (ITM) head. The entire system is trained with Symmetric Multi-Similarity Loss to exploit the soft-label relevance matrices provided by the challenge. Our method achieves 67.97% average mAP and 82.92% average nDCG on the EK-100 MIR benchmark, demonstrating the effectiveness of temporal modeling and cross-modal refinement for egocentric video retrieval.

2605.24059 2026-06-05 cs.LG cs.AI

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

频谱探测电路:识别预训练Transformer中注意力头电路的三步法

Yongzhong Xu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种三步法,通过频谱信号排序、任务模式筛选和组消融因果验证,无需标签即可识别预训练Transformer中执行持续内容依赖计算的注意力头电路,并在多个模型上验证了其通用性和因果必要性。

详情
Comments
35 pages, 4 figures
AI中文摘要

我们提出了一种三步法,用于识别预训练Transformer中的注意力头电路。每个头的频谱信号——即每个头注意力输出的时间积分参与比——可以在没有标签或归因梯度的情况下,对执行持续内容依赖计算的头进行排序。任务模式屏幕将此通用指标过滤为特定任务的候选电路,而针对匹配随机对照的组消融则完成了因果声明。我们在8倍参数范围(5100万至10亿活跃/70亿总参数)、两种架构族(密集型和混合专家)以及四种预训练流程上进行了验证。该方法是可移植的:一个2-6头的归纳电路在每个测试模型中都是因果必需的,消融后合成归纳top-1下降94-100%。频谱信号在无监督下具有预测性:在5100万参数探测模型的六个独立种子上,相同的计算识别出每个种子上的种子特定电路。在Pythia族(1.24亿至4.1亿)中,执行可识别专门计算的头比例保持在17-19%,而特定归纳电路保持3-11个头——与总头数呈次线性关系。本文是一个三篇论文计划的方法论锚点;配套论文将该方法扩展到预训练期间的发展轨迹以及组合任务电路,其中模式选择性与任务因果结构解耦。

英文摘要

We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-integrated participation ratio of each head's attention output -- ranks heads doing sustained content-dependent computation without labels or attribution gradients. A task-pattern screen filters this general indicator into a task-specific candidate circuit, and group ablation against a matched-random control completes the causal claim. We validate across an 8x parameter range (51M to 1B-active / 7B-total), two architecture families (dense, mixture-of-experts), and four pretraining pipelines. The recipe ports: a 2-6 head induction circuit is causally necessary in every model tested, with a 94-100% drop in synthetic-induction top-1 after ablation. The spectral signal is predictive without supervision: on six independent seeds of a 51M-parameter probe model, the same computation identifies the seed-specific circuit on each seed. The fraction of heads doing identifiable specialized computation is conserved at 17-19% across the Pythia family (124M to 410M), while specific induction circuits stay 3-11 heads -- sublinear in total head count. This paper is the methodology anchor of a three-paper program; companion papers extend the recipe to developmental trajectories during pretraining and to composed-task circuits where pattern selectivity decouples from task-causal structure.

2605.23453 2026-06-05 cs.LG

Class-Dependent Hybrid Data Augmentation for Multiclass Migraine Classification under Severe Class Imbalance

严重类别不平衡下多类别偏头痛分类的类别依赖混合数据增强

Elvin Somón, Miguel A. Gutiérrez-Naranjo

发表机构 * University of Santiago de Compostela(圣地亚哥-德孔波斯特拉大学)

AI总结 针对偏头痛分类中严重类别不平衡问题,提出一种基于类别样本量的混合数据增强策略,并引入保真度不对称概念,在纠正数据泄露和指标偏差后,显著提升了多分类器的平均鲁棒性。

详情
AI中文摘要

我们以可重复性为导向,对先前的偏头痛分类研究进行了重新评估,纠正了数据泄露和指标偏差。然后我们引入了(i)一种临床驱动的两个偏瘫亚型聚合(遵循ICHD-3 §1.2.3),(ii)一种类别依赖的混合增强策略,根据每类样本量分配生成方法,以及(iii)保真度不对称的概念,激励按比例约束的增长作为完全类别平衡的替代方案。实验在包含400名患者、七种偏头痛亚型的数据集上进行,采用两阶段协议,包括上述六类配置。模型使用分层5折交叉验证进行评估,以宏平均F1作为主要指标。纠正方法缺陷降低了先前膨胀的性能估计,修正后的宏F1基线为0.71。所提出的框架在八个评估分类器的宏平均F1上持续优于单个增强器(0.862对比高斯Copula的0.836、CTGAN的0.815和无增强基线的0.801),并在比例增强下使用FT-Transformer达到峰值结果0.914。无增强的FT-Transformer基线(0.896)表明,在每分类器上限处,临床驱动的类别聚合贡献了大部分绝对改进;该框架的主要可测量贡献是跨分类器的平均鲁棒性提升,凸显了问题表述的主导作用。

英文摘要

We conducted a reproducibility-oriented re-evaluation of prior migraine classification studies, correcting for data leakage and metric bias. We then introduced (i) a clinically motivated aggregation of two hemiplegic subtypes following ICHD-3 §1.2.3, (ii) a class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, and (iii) the concept of fidelity asymmetry, motivating proportionally constrained growth as an alternative to full class balance. Experiments were performed on a dataset of 400 patients across seven migraine subtypes under a two-stage protocol, including the six-class configuration described above. Models were evaluated using stratified 5-fold cross-validation with macro-averaged F1 as the primary metric. Correcting methodological flaws reduces previously inflated performance estimates, with the corrected macro-F1 baseline standing at 0.71. The proposed framework consistently outperformed individual augmenters in macro-F1 averaged across the eight evaluated classifiers (0.862 vs. 0.836 for Gaussian Copula, 0.815 for CTGAN, and 0.801 for the no-augmentation baseline), and achieved its peak result of 0.914 with FT-Transformer under proportional augmentation. The no-augmentation FT-Transformer baseline (0.896) shows that, at the per-classifier ceiling, clinically motivated class aggregation accounts for most of the absolute improvement; the framework's principal measurable contribution is the gain in average robustness across classifiers, highlighting the dominant role of problem formulation.

2605.15913 2026-06-05 cs.CL cs.AI

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

通过自动分割和块蒸馏实现块注意力的泛化

Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam

发表机构 * The Chinese University of Hong Kong(香港中文大学) City University of Hong Kong(香港城市大学) Tencent(腾讯) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) Singapore Management University(新加坡管理大学)

AI总结 提出基于语义分割数据集训练的轻量级分割器和块蒸馏框架,解决块注意力在长上下文中的文本分割和微调效率问题,实现接近全注意力的性能。

详情
Comments
16 pages, 2 figures
AI中文摘要

块注意力将输入作为独立的块处理,块之间不能相互关注,在检索增强生成(RAG)等长上下文场景中具有显著提升KV缓存重用的潜力。然而,其广泛应用受到两个关键挑战的阻碍:将输入文本分割成有意义且自包含的块的困难,以及现有块微调方法效率低下且可能降低性能的风险。为解决这些问题,我们首先构建了SemanticSeg,一个大规模且多样化的语义分割数据集,包含超过30k个实例,涵盖16个类别——包括书籍、代码、网页文本和对话,文本长度从2k到32k。利用该数据集,我们训练了一个轻量级分割器,能够自动将文本分割成符合人类直觉的块,且粒度可控。其次,我们提出了块蒸馏,一种比块微调更高效的训练框架,它使用冻结的全注意力教师模型来指导块注意力学生模型。该框架集成了三个新颖的组件:块汇合令牌以减轻块边界处的信息丢失,块丢弃以利用来自所有块的训练信号,以及令牌级损失加权以聚焦于对块注意力敏感的令牌的学习。跨多个模型和基准的实验表明,我们的分割器优于启发式和统计基线,且块蒸馏在块注意力下实现了接近全注意力的性能,为部署块注意力建立了一条实用且可扩展的路径。

英文摘要

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.