arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.25652 2026-05-26 cs.CL cs.CY

A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

LLM评审员与律师协会考官对泰国律师资格考试自由回答论文的两阶段稳定性研究

Pawitsapak Akarajaradwong, Wuttikrai Lertprasertphakorn, Chompakorn Chaksangchaichot, Sarana Nutanong

AI总结 通过泰国律师资格考试的自由回答论文评估,研究LLM评审员与人类考官在评分一致性上的不对称性,发现LLM评审员倾向于多数人类阅读而无法复制少数人类阅读。

详情
AI中文摘要

NLP中的自由形式法律论文评估将专家间评分者稳定性视为单一上限数字,并将LLM评审员与该上限的一致性视为评审员稳定性的证据。我们通过相同输入协议在泰国律师资格考试上检验这两个假设:三名律师协会培训的考官(A、B、C)和一个26个LLM评审员小组对来自相同四个输入(问题、官方律师协会评分规定、标准答案、考生答案)的15个交叉评分的答案进行评分。主要发现是不对称的。在评分标准规定两个轴的15个单元格中的10个上,所有29名评分者收敛在一个狭窄的区间内:小组一致性是普遍的。在其余5个单元格中,评分标准未规定如何评分一个正确但省略了决定性法定引用的最终答案,人类小组在两个连贯的解读之间分裂(B/C多数在评分标准上限区间,分数6-8;A少数在较低区间,分数1-2)。LLM评审员群体并不对称分裂:26个LLM中有22个在或接近B/C的有争议区间评分,3个位于规定沉默的中间间隙,只有1个(GPT-5.4 Nano)接近A的区间但未一致地在其内评分。我们26个评审员小组中的零个LLM在有争议的单元格上复制了少数人类阅读。B/C方向的集群跨越了我们测试的每个模型大小、供应商和价格层级。一个仪器化的三个LLM锚定子小组(Claude 4.6 Opus、Gemini 3.1 Pro、GPT-5.4 Pro)携带确定性探针、输入消融和自助法置信区间,并在15个单元格上达到锚定小组α=0.77,而人类小组α=0.36。高LLM小组α反映了系统性地收敛于多数阅读,而不是平衡地复制两种阅读;一个通过最大化与人类参考小组的一致性来选择其LLM评审员的基准将必然继承这种不对称性。

英文摘要

Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score $6$--$8$; A minority at the lower band, score $1$--$2$). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C's contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A's band without consistently scoring within it. \emph{Zero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells.} The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel $α= 0.77$ on the 15 cells against human-panel $α= 0.36$. The high LLM-panel $α$ reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.

2605.25651 2026-05-26 cs.CV

Hierarchical Consistency Learning for Test-time Adaptation in Camouflage Perception

用于伪装感知测试时适应的层次一致性学习

Mingfeng Zha, Tianyu Li, Guoqing Wang, Yunqiang Pei, Chaofan Qiao, Jiening Zhang, Yang Yang, Heng Tao Shen

AI总结 提出层次一致性学习(HCL)框架,通过测试时适应动态调整表示,结合层次表示重构、任务亲和引导和原型一致性校准,解决伪装目标检测中的域刚性和注释依赖问题。

详情
AI中文摘要

伪装目标检测(COD)旨在通过物理属性定位与背景感知差异最小的目标。现有方法受限于静态的“训练-冻结”范式,存在域刚性和注释依赖,限制了其对场景变化和未见伪装模式的适应性。为克服这些问题,我们提出层次一致性学习(HCL)框架,该框架集成了测试时适应以实现动态表示重校准。具体而言,我们设计了层次表示重构(HRR),通过协同空间重构与双流频域分解来缓解特征纠缠,增强对表观均匀化的鲁棒性。像素和频谱推理提供了结构和上下文先验。我们进一步引入任务亲和引导(TAG),通过通道级亲和力在分支间传播知识,对齐局部判别线索并缓解语义漂移。为确保语义不变性,我们制定了原型一致性校准(PCC),将区域特征聚合为紧凑原型并建立原型-特征相似度。这施加了隐式和层次化的约束,弥合了任务和表示之间的差距。在四个伪装和四个水下目标基准上,在三种退化设置下的广泛实验表明,我们的方法始终优于最先进的方法,突显了其在分布偏移下的鲁棒性和泛化能力。

英文摘要

Camouflaged object detection (COD) aims to localize targets that exhibit minimal perceptual differences from backgrounds through physical attributes. Existing methods, constrained by the static train-then-freeze paradigm, suffer from domain rigidity and annotation dependency, limiting their adaptability to scene variations and unseen camouflage patterns. To overcome these, we propose the hierarchical consistency learning (HCL) framework, which integrates test-time adaptation for dynamic representation recalibration. Specifically, we design the hierarchical representation reconstruction (HRR) to alleviate feature entanglement by synergizing spatial reconstruction with dual-stream frequency-domain decomposition, enhancing robustness against appearance homogenization. The pixel and spectrum inference provide structural and contextual priors. We further introduce task affinity guidance (TAG) to propagate knowledge across branches via channel-wise affinity, aligning local discriminative cues and mitigating semantic drift. To ensure semantic invariance, we formulate the prototype consistency calibration (PCC), which aggregates region features into compact prototypes and establishes prototype-feature similarity. This imposes implicit and hierarchical constraints that bridge task and representation gaps. Extensive experiments across four camouflaged and four underwater object benchmarks, under three degradation settings, demonstrate that our method consistently outperforms state-of-the-art approaches, highlighting its robustness and generalization under distribution shifts.

2605.25648 2026-05-26 stat.ML cs.LG

StrTransformer: Source-Wise Structured Transformers for Unsupervised Blind Source Recovery

StrTransformer: 面向无监督盲源恢复的源向结构化Transformer

Yuan-Hao Wei

AI总结 提出StrTransformer框架,通过源向结构化Transformer分支和观测空间混合器直接优化潜在源矩阵,实现盲源恢复和分支潜在建模。

详情
AI中文摘要

本文提出StrTransformer,一种用于盲源恢复和分支潜在建模的源向结构化Transformer框架。StrTransformer不使用编码器推断潜在变量,而是直接优化潜在源矩阵,同时结合观测空间混合器和源向结构化Transformer分支。混合器强制重建一致性,而每个Transformer分支对一条潜在源轨迹施加可微的结构约束。具体来说,每个源被转换为多尺度补丁令牌,随机掩码,由局部偏置Transformer处理,并通过掩码补丁重建能量进行评估。该能量作为隐式的源向结构先验。为了鼓励不同潜在分支专门处理不同的时间模式,StrTransformer进一步引入有序多尺度控制器,学习分支特定的补丁尺度权重、有序尺度中心和局部注意力斜率。最终目标函数结合了观测重建、源向结构正则化以及用于分离和尺度专门化的模块化辅助惩罚。我们分析了目标函数的解耦和耦合结构、正则化精确重建纤维,以及由有序分支描述符引起的置换对称性减少。一个受控案例研究表明,学习到的分支收敛到不同的时间尺度结构,并在事后评估中恢复源对齐的潜在轨迹。

英文摘要

This paper proposes StrTransformer, a source-wise structured Transformer framework for blind source recovery and branch-wise latent modeling. Instead of using an encoder to infer latent variables, StrTransformer directly optimizes the latent source matrix together with an observation-space mixer and source-wise structural Transformer branches. The mixer enforces reconstruction consistency, while each Transformer branch imposes a differentiable structural constraint on one latent source trajectory. Specifically, each source is converted into multi-scale patch tokens, randomly masked, processed by a locality-biased Transformer, and evaluated through a masked patch reconstruction energy. This energy acts as an implicit source-wise structural prior. To encourage different latent branches to specialize into different temporal regimes, StrTransformer further introduces an ordered multi-scale controller that learns branch-specific patch-scale weights, ordered scale centers, and locality attention slopes. The resulting objective combines observation reconstruction, source-wise structural regularization, and modular auxiliary penalties for separation and scale specialization. We analyze the decoupling and coupling structure of the objective, the regularized exact-reconstruction fiber, and the reduction of permutation symmetry induced by ordered branch descriptors. A controlled case study shows that the learned branches converge to distinct temporal-scale structures and recover source-aligned latent trajectories under post-hoc evaluation.

2605.25646 2026-05-26 cs.RO

G-DRAGON: Geospatial Reasoning and Dynamic Planning for Retrieval-Augmented Outdoor Navigation

G-DRAGON:面向检索增强的户外导航的地理空间推理与动态规划

Dongzhihan Wang, Yi Du, Jianan Sun, Yuan Xue, Yingchen Zhang, Bing Xiao, Chen Wang, Liang Xu

AI总结 提出G-DRAGON框架,通过轻量级LLM的生成式检索将自然语言命令映射到本地OSM实体,结合全局路径规划与SLAM系统,并利用前沿探索和开放集语义体素映射实现最后一英里目标定位,在仿真和真实场景中优于现有方法。

详情
Comments
Accepted by IEEE Robotics and Automation Letters (RA-L)
AI中文摘要

在大型户外环境中运行的自主地面机器人需要强大的远程导航和细粒度的“最后一英里”探索。当前视觉语言导航(VLN)的进展在短距离任务中表现良好,但缺乏长距离任务的地理空间基础。一些基于OpenStreetMap(OSM)的方法依赖云端大型语言模型(LLM),容易产生事实幻觉,且无法根据人类指令进行“最后一英里”探索。为解决这些挑战,我们提出了G-DRAGON,一个用于户外开放世界导航的检索增强框架。该框架通过基于轻量级LLM的生成式检索将自然语言命令映射到版本化的本地OSM实体,为全局路径规划生成精确坐标。高级规划模块将全局拓扑路线与SLAM系统桥接,将地理空间路点投影到机器人的可导航框架中。对于“最后一英里”,框架转换为基于前沿的探索和开放集语义体素映射,以定位开放词汇目标。仿真实验表明,我们的框架优于最先进的基线。此外,我们在未见过的真实城市环境中使用无人地面车辆(UGV)验证了该系统,成功完成了轨迹长达500米的人员搜索任务。

英文摘要

Autonomous ground robots operating in large-scale outdoor environments require both robust long-range navigation and fine-grained ''last-mile'' exploration. Current advances in visual-language navigation (VLN) work well at short-range tasks, lacking geospatial grounding for long-distance missions. Some OpenStreetMap (OSM)-based methods relying on cloud-based Large Language Models (LLMs) are prone to factual hallucination and cannot conduct ''last-mile'' exploration based on human instruction. To address these challenges, we present G-DRAGON, a retrieval-augmented framework for outdoor, open-world navigation. This framework maps natural-language commands to versioned, local OSM entities via generative retrieval based on lightweight LLM, yielding accurate coordinates for global route planning. A high-level planning module bridges global topological routes with the SLAM system, projecting geospatial waypoints into the robot's navigable frame. For the ''last mile," the framework transitions to frontier-based exploration and open-set semantic voxel mapping to localize open-vocabulary targets. Experimental results in simulation demonstrate our framework outperforms state-of-the-art baselines. Furthermore, we validate the system in unseen real-world urban environments on an Unmanned Ground Vehicle (UGV), successfully completing person-search missions with trajectories of up to 500m.

2605.25641 2026-05-26 cs.CL

Iterate Until Retrieved: Factual Nugget Optimization for Discoverable Continual Corrections in Agentic RAG

迭代直到检索到:面向可发现持续修正的事实性片段优化在智能体RAG中的应用

Moshe Hazoom, Gal Patel, Alon Talmor, Tom Hope

AI总结 提出迭代片段优化(INO)方法,通过将反馈转化为事实性片段并利用生产环境智能体RAG系统迭代优化,提升事实性修正的可发现性和使用率。

详情
AI中文摘要

在复杂的B2B(企业对企业)环境中,智能体检索增强生成(RAG)系统经常接收自由形式的反馈。我们关注可操作的事实性修正,而非风格、偏好或整体响应质量等通用反馈信号。我们识别这些实例并将其转化为紧凑的知识库条目,称为事实性片段。我们引入迭代片段优化(INO),一种索引时优化方法,将生产环境中的智能体RAG作为测试平台:它创建初始片段,使用触发查询及其释义进行探测,反思失败的检索和回答轨迹,并修订片段直到其可被发现。我们使用两个生产B2B知识辅助代理(一个回答公司特定知识库问题的产品支持代理,以及一个协助支持工程师的支持工单代理)在多家使用我们系统的公司中评估INO。在自动化和人工评估中,INO在事实性修正的可发现性和使用率方面持续优于基线。

英文摘要

Agentic retrieval-augmented generation (RAG) systems in complex B2B (business-to-business) settings may often receive free-form response feedback. Rather than generic feedback signals such as style, preference, or overall response quality, we focus on actionable factual corrections. We identify these instances and convert them into compact knowledge-base entries, which we call factual nuggets. We introduce Iterative Nugget Optimization (INO), an index-time optimization method that uses the production agentic RAG as a test harness: it creates an initial nugget, probes it with the triggering query and paraphrases, reflects over failed retrieval and answer traces, and revises the nugget until it is discoverable. We evaluate INO with two production B2B knowledge-assistance agents across multiple companies that use our system: a product support agent that answers questions over company-specific knowledge bases, and a support ticket agent that assists support engineers. INO consistently improves results over baselines in terms of discoverability and usage of factual corrections, in automated and human evaluations.

2605.25640 2026-05-26 physics.ins-det cs.LG hep-ex nucl-ex

3D Magnetic Field Reconstruction and Mapping with Physics-Informed Neural Networks

基于物理信息神经网络的3D磁场重建与映射

Haohan Yu, Zhanxu Hao, Bingzhi Li, Zejia Lu, Xiang Chen, Liang Li

AI总结 提出一种物理信息神经网络(PINN)框架,通过将麦克斯韦方程直接融入损失函数并引入测量点物理残差损失,实现高精度3D磁场重建,仿真精度达10^{-4},实验精度达10^{-3}水平。

详情
AI中文摘要

准确重建不可达区域的磁场对于物理学中的许多高精度实验至关重要。传统方法(如球谐展开)常因截断误差而限制精度。本研究提出一种先进的物理信息神经网络(PINN)框架,用于高精度3D磁场映射。与传统的纯数据驱动模型不同,所提出的PINN将麦克斯韦方程直接融入损失函数,在整个域内强制执行无散度和无旋度条件。一个关键创新是在测量位置包含显式的物理残差损失,确保超越随机配点采样的严格物理一致性。使用模拟数据进行验证,重建精度达到$10^{-4}$,比现有PINN基准提高十倍。此外,使用定制线圈组件的实验验证表明,在环境条件下,相对精度达到亚百分比水平($10^{-3}$量级)的稳健重建。这种AI驱动方法为传感器放置受限的复杂实验环境中的场监测和测量提供了稳健的高精度解决方案。

英文摘要

Accurate reconstruction of magnetic fields in inaccessible regions is vital for many high-precision experiments in physics. Traditional methods, such as spherical harmonic expansion, often suffer from truncation errors that limit their precision. This study proposes an advanced Physics-Informed Neural Network (PINN) framework for high-precision 3D magnetic field mapping. Unlike conventional data-driven models, the proposed PINN integrates Maxwell's equations directly into the loss function, enforcing divergence-free and curl-free conditions across the entire domain. A key innovation is the inclusion of explicit physics-residual losses at measurement locations, ensuring rigorous physical consistency beyond random collocation sampling. Validation using simulated data achieves a reconstruction accuracy of $10^{-4}$, a tenfold improvement over existing PINN benchmarks. Furthermore, experimental validation using a custom coil assembly demonstrates robust reconstruction with sub-percent relative accuracy, reaching the $10^{-3}$ level under ambient conditions. This AI-driven methodology provides a robust, high-precision solution for field monitoring and measurement in complex experimental environments where direct sensor placement is restricted.

2605.25632 2026-05-26 cs.AI cs.LG q-fin.RM

Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

为每个行动投保:自主AI代理运行时精算控制的权威边界框架

Hao-Hsuan Chen

AI总结 提出精算行动接口(AAI)和权威边界框架,通过确定性运行时合约对自主AI代理的副作用行动进行定价、门控和评估,实现跨领域的精算控制与基准测试。

详情
Comments
35 pages, 4 figures, 11 tables. Companion paper on the mathematical foundations: SSRN 6761960
AI中文摘要

自主AI代理越来越多地产生带有副作用的行动:数据库变更、退款、支付、外部承诺。我们提出精算行动接口(AAI),这是一个确定性的运行时合约,它在时间一致的风险映射下,对每个此类行动按照合约固定的安全默认值进行定价,并根据每个边界的储备资本预算门控执行。然后我们开发了权威边界,这是一种评估原语,用于衡量运行时在每个储备资本水平下释放的自主权威量。该框架提供:(i) 一个确定性的报价-绑定-提交协议,带有通行费限制的能力令牌;(ii) 一个通用的七类行动分类法,将异构工具调用映射到可比较的权威单位;(iii) 在alpha支出下的重放确定性和逐路径储备覆盖;(iv) 通过全储备需求C_full和资本指标Capital@k进行跨域归一化。我们在四个代理环境(数据库变更、客服退款以及公共tau-bench零售和航空工具使用轨迹)中实例化AAI,并报告一个实时Postgres面板,其中三个Azure托管的模型通过同一合约提出行动。边界在跨域中表现出常见的低储备拒绝和中间释放模式,仅在预算网格达到全储备需求时饱和;所需储备资本变化达22倍(Capital@50从289到6457)。该框架不强制域采用相同形状;它揭示每个域的精算几何。在实时面板中,合约在低预算下防止了所有三个模型的实现损失,但在拒绝下的承保持续性方面有所不同:模型身份是一个精算承保变量。贡献是一个用于自主代理副作用运行时精算控制的基准就绪评估框架。

英文摘要

Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time-consistent risk mapping, and gates execution against a per-boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote-bind-commit protocol with toll-bounded capability tokens; (ii) a universal seven-class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha-spending; (iv) cross-domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer-service refund, and the public tau-bench retail and airline tool-use traces) and report a live Postgres panel in which three Azure-hosted models propose actions through the same contract. The frontier exhibits a common low-reserve refusal and intermediate-release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain's actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark-ready evaluation framework for runtime actuarial control of autonomous-agent side effects.

2605.25626 2026-05-26 cs.CL

Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC

超越字面翻译:评估社交媒体用户生成内容中的文化有效性

Linjuan Wu, Ruiqi Zhang, Xinze Lyu, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Yixin Cao, Yongliang Shen, Weiming Lu

AI总结 针对社交媒体用户生成内容翻译中文化传递与情感共鸣不足的问题,提出CULTURE-MT基准,通过构建涵盖14个领域、4种文化负载类型的1002条UGC笔记,并引入文化有效性评估标准,实验表明传统指标无法捕捉文化有效性,且基础LLM的文化有效性与模型规模相关。

详情
Comments
Accepted by ICML2026
AI中文摘要

社交媒体平台实现了大规模跨语言交流,但由于用户生成内容(UGC)的非正式风格、文化引用和基于互动的表达方式,其翻译仍然具有挑战性。尽管近期的大语言模型(LLM)提高了翻译质量,但现有基准和指标往往未能捕捉翻译是否在真实场景中传达了预期含义和文化共鸣。在这项工作中,我们引入了CULTURE-MT,一个专注于文化传递和UGC特定情感共鸣的社交媒体翻译基准。CULTURE-MT包含跨14个领域的1,002条UGC笔记,根据文化负载符号和语言风格特征分为四类。我们还构建了面向UGC的训练数据,以微调Qwen3-8B和Qwen3-32B作为基线。我们提出文化有效性作为新的评估标准,侧重于表达准确性和文化适应性。测试包括基线在内的15个模型,我们发现传统指标无法捕捉文化有效性。我们还观察到,基础LLM上的文化有效性与模型规模相关。我们的工作为UGC翻译模型提供了全面的评估系统,并将提供一个开放的评估平台以推动该领域的研究。我们发布了CULTURE-MT基准,并提供了一个在线排行榜,提交的翻译结果可由我们训练的JUDGER进行评估。

英文摘要

Social media platforms enable large-scale cross-lingual communication, but translating user-generated content (UGC) remains challenging due to its informal style, cultural references, and interaction-based expressions. While recent LLMs have improved translation quality, existing benchmarks and metrics often fail to capture whether translations convey intended meaning and cultural resonance in real-world settings. In this work, we introduce CULTURE-MT, a benchmark for social media translation that focuses on both CULtural Transmission and UGC-specific emotion REsonance. CULTURE-MT consists of 1,002 UGC notes across 14 domains, categorized into four types based on culture-loaded symbols and linguistic style features. We also construct UGC-oriented training data to fine-tune Qwen3-8B and Qwen3-32B as baselines. We propose cultural effectiveness as a new evaluation criterion, focusing on expression accuracy and cultural adaptability. Testing 15 models, including the baselines, we find that traditional metrics fail to capture cultural effectiveness. We also observe that cultural effectiveness on base LLMs correlates with model size. Our work provides a comprehensive evaluation system for UGC translation models and will offer an open evaluation platform to advance research in this area. We release the CULTURE-MT benchmark and provide an online leaderboard where submitted translation results can be evaluated by our trained JUDGER.

2605.25621 2026-05-26 cs.CV

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

StreamOV: 通过证据引导记忆与响应触发的流式全视频理解

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu

AI总结 提出StreamOV框架,利用多模态证据引导的长短期记忆和隐状态驱动的触发机制,实现流式全视频理解中的在线推理与主动响应,并在新基准SOVBench上取得最优性能。

详情
AI中文摘要

虽然流式全视频理解需要持续感知和主动的实时交互,但这一关键领域仍未被充分探索。当前的全模态方法本质上是为离线场景设计的,由于两个根本缺陷限制了其在流式场景中的适用性。首先,它们缺乏稳健的机制来管理长时间跨度下持续增长的音视频上下文,并且无法在适当时机自主发起响应。其次,现有基准主要局限于离线、单轮问答,无法捕捉连续的多轮流式交互。为弥补这些差距,我们提出了StreamOV,一种新颖的流式全视频理解框架,用于具有有限记忆和主动响应触发的高效在线音视频推理。具体来说,StreamOV引入了多模态证据引导的长短期记忆,在固定预算下将历史音视频上下文压缩为紧凑的信息性证据。它还采用隐状态驱动的触发器来决定何时响应,避免了显式的静音令牌生成和外部路由器。我们还整理了SOVBench,这是首个用于在线、多轮全模态评估的综合基准。大量实验表明,StreamOV在各种流式和全视频基准上取得了最先进的性能,证明了其在在线和离线视频理解中的有效性。

英文摘要

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

2605.25620 2026-05-26 cs.AI

Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

回归简约潜在变量:从视觉基础学习以任务为中心的世界模型

Minghao Fu, Fan Feng, Nicklas Hansen, Biwei Huang

AI总结 提出TC-WM框架,通过将预训练视觉嵌入线性投影为紧凑潜在状态、对比学习对齐子空间并重建嵌入,将基础模型特征转化为任务充分的世界表示,实现更好的世界建模质量和控制精度。

详情
AI中文摘要

世界模型使智能体能够根据动作预测未来动态,因此潜在表示的选择对于规划和控制至关重要。这种表示通常要么直接从像素中学习,但语义结构有限;要么继承自冻结的视觉基础模型,但包含过多与任务无关的细节,导致状态空间与下游规划和控制不匹配。这在无奖励的离线设置中尤其具有挑战性,因为模型必须从固定轨迹中学习,没有奖励监督或在线交互。为了解决这个问题,我们提出了TC-WM,一个将基础模型嵌入转化为紧凑、任务充分的世界表示的框架。关键设计是将预训练嵌入空间视为语义支架而非最终状态空间:TC-WM将高维视觉嵌入线性投影到紧凑潜在变量作为动态空间,通过对比学习将子空间与智能体的物理状态对齐,并重建嵌入以保留有用的视觉结构。这结合了基础特征的通用性和以任务为中心的动态的可控性。理论上,我们证明TC-WM足以识别潜在的任务中心潜在因子,只需简单变换。实验上,TC-WM能够在多种环境(如Robomimic和D4RL)中实现测试时规划,其世界建模质量和控制精度均优于现有方法。

英文摘要

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

2605.25619 2026-05-26 cs.LG

Analogies between Transformer Layers and Power Method

Transformer层与幂法之间的类比

Chenglong Li, Claudio Altafini

AI总结 本文揭示了Transformer层中的操作(投影和层归一化,忽略前馈神经网络)与幂法步骤之间的类比,并证明通过层后token倾向于与该层输出权重矩阵和值权重矩阵乘积的主特征向量对齐,同时提出了一种将Transformer输出导向任意期望方向的方法。

详情
AI中文摘要

在本文中,我们展示了Transformer层中发生的操作(投影和层归一化,忽略前馈神经网络)与幂法步骤之间存在类比。与此类比一致,我们证明通过一层后,token倾向于朝向一个矩阵的主特征向量倾斜,该矩阵是该层的输出权重矩阵和值权重矩阵的乘积。在具有共享权重的Transformer(即所有层具有相同权重)的特殊情况下,与这个主特征向量的对齐在经验上特别明显,并且也可以在分析上证明。该类比还提出了一种方法,可以将Transformer的输出引导到token空间中的任意期望方向。

英文摘要

In the paper we show that there is an analogy between the operations occurring in a layer of a transformer (projections and layer normalizations, disregarding the feedforward neural network) and a step in the power method. Coherently with this analogy, we show that passing through a layer the tokens tend to be tilted towards the principal eigenvector of a matrix which is the product of the output and value weight matrices of that layer. In the special case of a transformer with shared weights (i.e., in which all layers have identical weights) then the alignment with this principal eigenvector is particularly evident empirically, and can also be shown analytically. The analogy also suggests a method to steer the output of the transformer towards an arbitrary desired direction in token space.

2605.25616 2026-05-26 cs.LG stat.ML

Courtroom Analogy: New Perspective on Uncertainty-Aware Classification

法庭类比:不确定性感知分类的新视角

Taeseong Yoon, Heeyoung Kim

AI总结 提出法庭类比框架,通过结构化混合狄利克雷分布建模分类中的不确定性聚合,并设计单次前馈神经网络MoDEX实现高效、可解释的不确定性量化。

详情
Comments
ICML 2026
AI中文摘要

分类中的单次不确定性量化方法通过预测类概率向量上的可处理分布来表示不确定性。现有方法主要关注增强该分布的表示能力,但往往对预测不确定性如何结构化和聚合提供的见解有限,导致可解释性较弱。我们引入法庭类比,将不确定性感知分类概念化为类特定倡导者之间的结构化辩论。每位倡导者形成概率意见,并通过输入依赖的可信度权重聚合这些意见得出最终裁决。在此框架中,每位倡导者的意见被建模为狄利克雷分布,其浓度参数分解为共享证据和类特定倡导。这产生了具有语义可解释参数的结构化混合狄利克雷分布。为实例化该公式,我们提出了混合狄利克雷专家(MoDEX),一种预测法庭参数的单次前馈神经架构,能够在显式建模不确定性聚合的同时实现高效且表达力强的不确定性量化。我们证明MoDEX具有强大的理论性质,并在多种基准测试中实现了最先进的不确定性量化性能,产生具有有意义语义的可解释不确定性估计。

英文摘要

Single-pass uncertainty quantification (UQ) methods for classification represent uncertainty by predicting a tractable distribution over the class probability vector. While existing approaches primarily focus on enhancing the expressiveness of this distribution, they often provide limited insight into how predictive uncertainty is structured and aggregated, resulting in weak interpretability. We introduce the courtroom analogy, which conceptualizes uncertainty-aware classification as a structured debate among class-specific advocates. Each advocate forms a probabilistic opinion, and a final verdict is reached by aggregating these opinions using input-dependent plausibility weights. In this framework, each advocate's opinion is modeled as a Dirichlet distribution whose concentration parameter is decomposed into shared evidence and class-specific advocacy. This yields a structured mixture of Dirichlet distributions with semantically interpretable parameters. To instantiate this formulation, we propose Mixture of Dirichlet EXperts (MoDEX), a single-pass neural architecture that predicts the courtroom parameters, enabling efficient and expressive UQ while explicitly modeling uncertainty aggregation. We demonstrate that MoDEX enjoys strong theoretical properties and achieves state-of-the-art UQ performance across diverse benchmarks, yielding interpretable uncertainty estimates with meaningful semantics.

2605.25615 2026-05-26 cs.CV

UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition

UAV-OVO:无人机动作识别中的视点外泛化

Yu Xia, Zhengbo Zhang, Shuaihu Zhang, Zhigang Tu

AI总结 针对无人机动作识别中训练与测试视点不一致导致的性能下降问题,提出UAV-OVO基准和LATER方法,通过视点隔离和LoRA锚定特征重中心化实现视点鲁棒泛化。

详情
AI中文摘要

无人机动作识别面临标准基准测试常掩盖的部署偏移:从低俯视角拍摄的无人机视频训练的模型可能需要识别来自高俯视角的相同动作类别。虽然动作标签保持不变,但这种偏移改变了身体可见性、运动投影和场景上下文,促使模型依赖视点特定的捷径。我们引入UAV-OVO,一个用于无人机动作识别的视点外泛化基准。UAV-OVO从未校准视频中导出视点分数,使用视点隔离带将低俯视角视频分配给训练和分布内测试集,同时保留高俯视角视频用于分布外测试,并构建按类别分布匹配的ID/OOD测试集,使得性能差异反映视点偏移而非标签不平衡。在代表性视频识别器上,UAV-OVO揭示了显著的ID/OOD差距:拟合低俯视角训练分布良好的模型往往无法迁移到保留的高俯视角,暴露了被整体准确性隐藏的视点捷径。我们进一步提出LATER,即LoRA锚定的测试时重中心化,首先通过低秩适配(LoRA)适配识别器,然后利用学习到的LoRA子空间作为在线特征重中心化的语义锚点。具体来说,LATER在重中心化特征之前将目标域位移投影到LoRA子空间的正交补上,减少视点引起的漂移同时保留任务相关语义。UAV-OVO和LATER共同为视点鲁棒的无人机视频理解提供了一个受控测试床和一种实用的适配方法。

英文摘要

UAV action recognition faces a deployment shift that standard benchmarks often obscure: a model trained on UAV footage captured from low-depression viewpoints may be required to recognize the same action classes from high-depression viewpoints. While the action labels remain unchanged, this shift alters body visibility, motion projection, and scene context, encouraging models to rely on viewpoint-specific shortcuts. We introduce UAV-OVO, an Out-of-Viewpoint generalization benchmark for UAV action recognition. UAV-OVO derives view scores from uncalibrated videos, uses a view-isolation band to assign low-depression videos to the training and in-distribution test splits while reserving high-depression videos for out-of-distribution testing, and constructs ID/OOD test sets matched by class distribution so that performance differences reflect viewpoint shift rather than label imbalance. Across representative video recognizers, UAV-OVO reveals a substantial ID/OOD gap: models that fit the low-depression training distribution well often fail to transfer to held-out high-depression views, exposing viewpoint shortcuts hidden by aggregate accuracy. We further propose LATER, LoRA-Anchored Test-time Re-centering, which first adapts the recognizer with Low-Rank Adaptation (LoRA) and then uses the learned LoRA subspace as a semantic anchor for online feature re-centering. Specifically, LATER projects target-domain displacement onto the orthogonal complement of the LoRA subspace before re-centering features, reducing viewpoint-induced drift while preserving task-relevant semantics. Together, UAV-OVO and LATER provide a controlled testbed and a practical adaptation method for viewpoint-robust UAV video understanding.

2605.25612 2026-05-26 cs.LG cs.AI

Towards the Connection between Activation Sparsity and Flat Minima

激活稀疏性与平坦极小值之间的联系

Ze Peng, Jian Zhang, Lei Qi, Yang Gao, Yinghuan Shi

AI总结 本文发现损失景观的平坦性与Transformer中MLP激活稀疏性密切相关,通过理论推导和三种实用方法增强稀疏性,显著降低推理和训练成本。

详情
AI中文摘要

标准训练的Transformer的MLP块中出现的激活稀疏性为在不牺牲性能的情况下大幅降低计算成本提供了机会。为了从理论上解释这一现象,现有工作表明激活稀疏性并非源于数据属性或数据拟合,而是来自训练过程的隐式偏差。然而,这些联系是在强假设下得到的,无法应用于标准训练的大步数深度模型。与这些工作不同,我们发现损失景观的平坦性也与MLP激活稀疏性密切相关,并且可以作为标准深度网络的一个更弱且自然出现的假设。具体来说,我们发现:1) MLP激活稀疏性等于“增强平坦性”(平坦性度量的加权和)与输入范数和MLP激活梯度乘积的比值。我们经验性地发现该比值在训练过程中下降,导致稀疏激活。2) 我们还提出了导数稀疏性的概念,在ReLU下它退化为激活稀疏性,但进一步支持反向传播中的剪枝,并且比激活稀疏性更稳定。基于理论发现,我们通过三种方法减小分子和增大分母来进一步鼓励激活稀疏性。这些即插即用的修改可以有效降低比值并产生更稀疏的激活。在ImageNet-1K和C4上的实验表明,与原始Transformer相比,推理稀疏性至少提高36%,训练稀疏性至少提高50%,表明在推理和训练中进一步降低成本的潜力。

英文摘要

The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically reduce computation costs without sacrificing performance. To theoretically explain this phenomenon, existing works have shown that activation sparsity does not result from the data properties or data fitting but from the implicit bias of the training process. However, these connections are obtained with strong assumptions, which cannot be applied to deep models standardly trained with a large number of steps. Different from these works, we find that the flatness of loss landscapes is also closely related to the MLP activation sparsity and can serve as a weaker and naturally emerging assumption standard deep networks. Specifically, we find that 1) the MLP activation sparsity equals a ratio between "augmented flatness" (a weighted sum of flatness measures) and the product of the input norm and activation gradient of the MLP. We empirically find that this ratio decreases during training, leading to sparse activations. 2) We also propose the notion of derivative sparsity, which reduces to activation sparsity under ReLU, but further enables pruning in the backward propagation and is more stable than activation sparsity. With the theoretical findings, we can further encourage activation sparsity by decreasing the numerator and increasing the denominator of the ratio using three methods. These plug-and-play modifications can effectively reduce the ratio and produce sparser activations. Experiments on ImageNet-1K and C4 demonstrate relative improvements of at least 36% on inference sparsity and at least 50% on training sparsity over vanilla Transformers, indicating further potential cost reduction in both inference and training

2605.25608 2026-05-26 stat.ML cs.LG

Learning Sparse Compositional Functions with Norm-Constrained Neural Networks

学习具有范数约束神经网络的稀疏组合函数

Shuo Huang, Lorenzo Fiorito, Lorenzo Rosasco, Tomaso Poggio

AI总结 本文通过范数约束的深度神经网络,建立了学习稀疏组合函数的逼近率和过风险界,证明了深度网络能够利用层次表示避免维数灾难。

详情
AI中文摘要

深度神经网络学习层次特征的能力被广泛认为是其在高维学习中成功的关键机制。现有理论通过基于参数计数的逼近率和组合模型的无维数灾难样本复杂度保证,部分支持了这一观点。为了研究参数数量超过样本量的过参数化场景,我们开发了一个通过参数范数衡量复杂度的框架。在该方法中,我们使用Frobenius范数约束的深度神经网络,为学习稀疏组合函数建立了逼近率和过风险界,其中组合函数的组合结构由有向无环图表示。我们的结果具有广泛的适用性,因为每个可有效图灵计算的函数都具有稀疏组合表示。特别地,我们涵盖了一系列代表性模型,包括多指标模型、二叉树结构和一般组合架构。我们推导的速率表明,深度网络可以利用目标函数的组合结构,通过层次表示有效避免维数灾难。

英文摘要

The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates based on parameter counts and sample complexity guarantees for compositional models without incurring the curse of dimensionality (CoD). To study overparameterized regimes, where the number of parameters exceeds the sample size, we develop a framework that measures complexity via the parameter norm. Within this approach, we establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representations.

2605.25605 2026-05-26 eess.AS cs.LG

Decoding Stimulus Reconstruction-Based Auditory Attention Robustly in Unbalanced EEG Datasets

在不平衡EEG数据集中基于刺激重建的听觉注意力鲁棒解码

Yuanming Zhang, Yayun Liang, Zhibin Lin, Jing Lu

AI总结 研究不平衡数据集对基于刺激重建的听觉注意力解码性能的影响,提出留一对包交叉验证协议以防止解码准确率膨胀。

详情
AI中文摘要

在过去十年中,许多研究通过刺激重建从脑电图信号中应用深度神经网络解码听觉注意力。然而,数据集平衡对基于刺激重建的AAD解码性能的影响尚未被探索。在本研究中,使用三个公开的EEG-AAD数据集——KUL、DTU和NJU cEEGrid——构建平衡和不平衡的实验条件。我们假设并证明基于刺激重建的DNN解码器倾向于在不平衡数据集上产生高估的解码性能。为了解决这个问题,我们提出了一种留一对包交叉验证协议。实验结果证实,LOPEO有效防止了在不平衡数据集上的解码准确率膨胀。虽然平衡数据集在实验设计中通常更受青睐,但LOPEO为已经发表的不平衡数据集提供了一个原则性的评估框架,填补了该领域的一个重要空白。

英文摘要

In the past decade, numerous studies have applied deep neural networks (DNNs) to decode auditory attention (AAD) from Electroencephalogram (EEG) signals via stimulus reconstruction. However, the influence of dataset balance on the decoding performance of stimulus reconstruction-based AAD remains unexplored. In this study, three publicly available EEG-AAD datasets - KUL, DTU, and NJU cEEGrid - are used to construct both balanced and unbalanced experimental conditions. We hypothesize and demonstrate that stimulus reconstruction-based DNN decoders tend to produce overestimated decoding performance on unbalanced datasets. To address this issue, we propose a leave-one-paired-envelope-out (LOPEO) cross-validation protocol. Experimental results confirm that LOPEO effectively prevents inflated decoding accuracy on unbalanced datasets. While balanced datasets are generally preferred in experimental design, LOPEO provides a principled evaluation framework for unbalanced datasets that have already been published, filling an important gap in the field.

2605.25604 2026-05-26 cs.CL cs.LG

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

DVAO: 面向多奖励强化学习的动态方差自适应优势优化

Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu, Yuewei Zhang

AI总结 针对多奖励强化学习中奖励组合导致训练不稳定、优势组合依赖静态超参数的问题,提出动态方差自适应优势优化方法,通过基于经验奖励方差动态调整组合权重,实现稳定训练与多目标帕累托前沿优化。

详情
AI中文摘要

强化学习已成为将大型语言模型与人类意图和任务要求对齐的标准范式。尽管组相对策略优化为近端策略优化提供了一种高效、无价值模型的替代方案,但将其适应于现实世界的多奖励设置仍然具有挑战性。标准的标量化实践,如奖励组合和优势组合,存在显著缺陷:奖励组合经常产生平方幅度过大的优势,导致训练不稳定;而优势组合依赖静态超参数,忽略了跨目标相关性。为了解决这些限制,我们提出了动态方差自适应优势优化(DVAO),它根据 rollout 组内每个目标的经验奖励方差动态调整组合权重,有效提高具有更强学习信号的目标的权重,同时抑制噪声目标。我们从数学上证明 DVAO 保持有界的优势幅度以实现稳定训练,并引入了一种自适应的跨目标正则化机制。使用 Qwen3 和 Qwen2.5 模型在数学推理和工具使用基准上的大量实验表明,DVAO 显著优于基线方法,实现了卓越的多目标帕累托前沿和稳健的训练稳定性。

英文摘要

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

2605.25603 2026-05-26 cs.AI

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

通过电路引导的内外不一致性检测不忠实的思维链

Xu Shen, Zhen Tan, Song Wang, Pingjun Hong, Rui Miao, Xin Wang, Tianlong Chen

AI总结 提出CIE-Scorer框架,通过追踪句子级电路并利用Fused Gromov-Wasserstein距离度量内部与外部推理图的不一致性,实现实例级思维链不忠实检测,在FaithCoT-Bench上取得最优性能并降低电路构建成本。

详情
AI中文摘要

思维链(CoT)推理提高了大型语言模型(LLMs)的问题解决能力,但生成的推理轨迹可能并不忠实地反映模型的实际决策过程。现有的CoT不忠实检测器主要依赖于生成理由的外部信号,如文本合理性或答案一致性,而忽略了来自模型内部计算的证据。尽管最近的电路追踪方法通过追踪推理过程中信息如何在模型组件间流动提供了获取模型内部证据的途径,但为长CoT构建完整推理电路成本高昂且难以扩展。为应对这些挑战,我们提出了电路引导的内外不一致性评分器(CIE-Scorer),一个用于实例级CoT不忠实检测的框架。关键思想是,忠实的推理轨迹应与模型的计算过程一致,而不忠实的轨迹可能偏离它。CIE-Scorer从信息丰富的推理令牌中高效追踪紧凑的句子级电路,构建内部和外部推理图,并使用Fused Gromov-Wasserstein距离度量它们的不一致性。在FaithCoT-Bench的四个数据集上的实验表明,CIE-Scorer在降低电路构建成本的同时实现了最先进的性能,证明了将机械可解释性信号与外部推理轨迹相结合用于CoT不忠实检测的有效性。

英文摘要

Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

2605.25601 2026-05-26 cs.CL cs.AI

Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

面向大语言模型可控模拟不完美学生的基准

Alexander Apartsin, Omri Sason, Yehudit Aperstein

AI总结 本研究提出一个基准框架,通过提示控制语言模型模拟具有指定技能轮廓的学生,并评估其可控性,为教师教育中的刻意练习提供支持。

详情
Comments
22 pages, 7 figures
AI中文摘要

教师教育需要与表现出可识别优势、弱点和部分掌握的学习者进行刻意练习。大型语言模型可以通过模拟具有已知技能组成部分的学生来支持这种练习,使教师能够演练解释、诊断和教学回应。然而,为此目的,核心要求既不是最大化基准准确率,也不是抑制孤立的事实,而是控制模型行为,使其反映指定的技能轮廓。本文研究了是否可以通过提示引导语言模型保留某些技能同时抑制其他技能。我们引入了一个面向基准的框架,其中显式技能向量表示模拟学生,基于提示的控制指定保留和缺失的能力,并使用轮廓对齐指标、保留与遗忘比较以及跨技能校准分析来评估行为。结果表明,在结构化数学环境中可以诱导和测量选择性的部分掌握,尽管可控程度仍依赖于模型。这些发现将可控学习者模拟定位为教师教育、教育模拟和语言模型控制交叉领域的一个独特研究问题。

英文摘要

Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large language models could support such practice by simulating students with known skill components, enabling teachers to rehearse explanations, diagnoses, and instructional responses. For this purpose, however, the central requirement is neither to maximize benchmark accuracy nor to suppress isolated facts, but to control model behavior so that it reflects a specified skill profile. This paper investigates whether prompted language models can be steered to retain some skills while suppressing others. We introduce a benchmark-oriented framework in which an explicit skill vector represents a simulated student, prompt-based control specifies retained and missing competencies, and behavior is evaluated using profile-alignment metrics, retained-versus-forgotten comparisons, and cross-skill calibration analyses. The results show that selective partial mastery can be induced and measured in a structured mathematics setting, although the degree of controllability remains model-dependent. These findings position controllable learner simulation as a distinct research problem at the intersection of teacher education, educational simulation, and language-model control.

2605.25599 2026-05-26 cs.LG cs.CV

Generalized Evidential Deep Learning: From a Bayesian Perspective

广义证据深度学习:从贝叶斯视角

Yuanye Liu, Yibo Gao, Yuanyang Chen, Xiahai Zhuang

AI总结 本文从广义贝叶斯框架出发,为证据深度学习建立理论基础,并提出统一可扩展的广义证据深度学习框架,在分类、不确定性估计和OOD检测上取得可比结果。

详情
Comments
Submitted to ICML2026
AI中文摘要

证据深度学习(EDL)已成为一种高效、无需采样的不确定性估计策略。一系列EDL变体被提出以解决原始框架的特定局限性,并取得了显著成功。然而,EDL的基本理论结构以及这些变体之间的关系尚未得到系统研究。在这项工作中,我们通过在广义贝叶斯框架内解释EDL,包括先验规范、后验更新和训练目标,为其建立了原则性的理论基础。我们进一步从贝叶斯分布不确定性角度刻画了证据不确定性,并通过渐近分析建立。基于这一视角,我们进一步提出了广义证据深度学习(GEDL),这是一个统一且可扩展的框架,明确解耦了各个组件的作用,并将GEDL与现有变体系统地联系起来。大量实验表明,GEDL在分类、不确定性估计和OOD检测上取得了可比的结果,并具有理论依据。

英文摘要

Evidential Deep Learning (EDL) has emerged as an efficient, sampling-free strategy for uncertainty estimation. A series of EDL variants have been proposed to address specific limitations of the original framework, achieving notable success. However, the underlying theoretical structure of EDL and the relationships among these variants have received limited systematic investigation. In this work, we establish a principled theoretical foundation for EDL by interpreting it within a generalized Bayesian framework that includes prior specification, posterior update, and training objective. We further characterize evidential uncertainty from a Bayesian distributional uncertainty viewpoint, established via asymptotic analysis. Building on this perspective, we further propose Generalized Evidential Deep Learning (GEDL), a unified and extensible framework that explicitly disentangles the roles of individual components and systematically relates GEDL to existing variants. Extensive experiments demonstrate that GEDL yields comparable results on classification, uncertainty estimation and OOD detections, with theoretical grounding.

2605.25598 2026-05-26 cs.CV

SurfSurg6D: Geometry Consistent Dense Correspondence for Textureless Surgical Instrument Pose Estimation

SurfSurg6D:面向无纹理手术器械位姿估计的几何一致密集对应

Daiyun Shen, Shuojue Yang, Chang Han Low, Qian Li, Mengya Xu, Qi Dou, Yueming Jin

AI总结 针对无纹理手术器械位姿估计中的数据稀缺和几何一致性挑战,本文构建了SynSurg6D数据集并提出SurfSurg6D密集对应框架,在多个数据集上实现了优于现有方法的RGB-only位姿估计。

详情
AI中文摘要

手术器械位姿估计为自主机器人手术、技能评估和手术工作流程标准化等有前景的应用提供了关键信息。然而,由于高精度要求、频繁遮挡、无纹理器械、深度信息稀缺以及标注数据非常有限,该任务仍然极具挑战性。这些限制导致在将通用物体位姿估计方法应用于手术场景时性能往往不理想。为解决这些问题,我们首先构建了一个新数据集SynSurg6D,以缓解该任务中的数据短缺问题。我们进一步提出了SurfSurg6D,一个专为手术器械位姿估计设计的密集对应框架。在SurgRIPE、EndoVis2018和SurgPose数据集上的实验结果表明,我们生成的SynSurg6D数据集能够多样化位姿分布,从而提升现有方法的性能。此外,SurfSurg6D优于现有方法,为精确高效的RGB-only位姿估计提供了鲁棒解决方案。

英文摘要

Surgical instrument pose estimation provides crucial information for promising applications, including autonomous robotic surgery, skill assessment, and standardization of surgical workflow. However, this task remains highly challenging due to high precision requirements, frequent occlusions, textureless instruments, scarcity of depth information and very limited annotated data. These constraints often lead to unsatisfactory performance when employing general object pose estimation approaches to surgical scenarios. To address these issues, we first construct a new dataset SynSurg6D, to alleviate the data shortage in this task. We further propose SurfSurg6D, a dense-correspondence framework tailored for surgical instrument pose estimation. Experimental results on the SurgRIPE, EndoVis2018 and SurgPose datasets demonstrate that the introduction of our generated dataset SynSurg6D is able to diversify the pose distributions, thus enhancing the performance of existing approaches. Furthermore, SurfSurg6D outperforms existing methods, providing a robust solution for precise and efficient RGB-only pose estimation.

2605.25596 2026-05-26 cs.CL

Multilingual Phonological Feature Recognition with Self-Supervised Speech Models

基于自监督语音模型的多语言音韵特征识别

Abner Hernandez, Tomás Arias-Vergara, Daiqi Liu, Andreas Maier, Paula Andrea Pérez-Toro

AI总结 提出PhonoQ-2.0,一种基于自监督语音模型的多语言帧级音韵特征识别器,通过方式条件门控机制直接预测结构化特征向量,在域内和域外均优于CTC基线。

详情
Comments
Submitted to Interspeech 2026
AI中文摘要

音韵特征提供了语言通用且基于语言学的语音表示。我们提出PhonoQ-2.0,一种基于自监督语音模型构建的多语言帧级音韵特征识别器。该系统直接预测每帧的结构化22维特征向量,编码方式、元音质量、发音部位和清浊,而不是从音素输出中推导特征。为确保音韵上一致的预测,我们引入了一种方式条件门控机制,激活有效的特征组。在多种语言和语料库上评估,PhonoQ-2.0在域内平均宏F1为91.3%,域外为88.9%。与强CTC音素基线相比,它在域内平均获得+8.8 F1的持续提升,域外平均+8.6。在未见语言评估中,PhonoQ-2.0将宏F1从66.9%提高到73.6%(平均+6.7),最高提升达+10.8个百分点。

英文摘要

Phonological features provide a language-general and linguistically grounded representation of speech. We present PhonoQ-2.0, a multilingual frame-level phonological feature recognizer built on self-supervised speech models. The system directly predicts a structured 22-dimensional feature vector per frame encoding manner, vowel quality, place, and voicing, instead of deriving features from phoneme outputs. To ensure phonologically coherent predictions, we introduce a manner-conditioned gating mechanism that activates valid feature groups. Evaluated across multiple languages and corpora, PhonoQ-2.0 achieves an average macro-F1 of 91.3% in-domain and 88.9% out-of-domain. Compared to a strong CTC phoneme baseline, it delivers consistent gains of +8.8 F1 in-domain and +8.6 out-of-domain on average. In unseen-language evaluation, PhonoQ-2.0 improves macro-F1 from 66.9% to 73.6% (+6.7 on average), with gains of up to +10.8 points.

2605.25595 2026-05-26 cs.CV

How Far Has AI Come in Liver Fibrosis Staging? A Large-Scale Real-World Dataset and Benchmark

AI在肝纤维化分期中取得了多大进展?大规模真实世界数据集与基准

Yuanye Liu, Nannan Shi, Zhejia Zhang, Hanxiao Zhang, Boya Wang, Derong Yu, Nao Wang, Yuxin Jin, Yang Zhou, Kunhao Yuan, Siqi Wang, Lida Yang, Xu Qiao, Wentao Liu, Xuelei He, Xin Hong, Guoyan Zheng, Xin Chen, Guang-Zhong Yang, Le Zhang, Lei Li, Yuxin Shi, Xiahai Zhuang

AI总结 基于多中心、多序列MRI的大规模真实世界数据集LiFS,系统评估了9种AI方法在肝纤维化分期中的表现,发现最佳AI与资深放射科医生相当,但跨中心异质性和标签不平衡仍是主要挑战。

详情
Comments
Submitted to Medical Image Analysis
AI中文摘要

尽管方法学上取得了多年进展,但AI在肝纤维化分期中的进展从未在定义临床实践的异质性、多中心条件下进行系统评估。为填补这一空白,我们引入了LiFS,这是一个来自MICCAI 2025 CARE-Liver挑战的大规模数据集和基准,包含来自多个中心和扫描仪的610名患者的多序列MRI。据我们所知,LiFS是第一个提供完整钆塞酸增强序列并具有来自不同真实世界扫描仪的病理学确认注释的基准。通过对从96个注册团队中选出的9种独立开发方法进行系统评估,并与队列内放射科医生参考结果进行比较,我们的发现从三个互补角度回答了当前AI在临床级肝纤维化分期方面的进展。首先,与放射科医生相比,最佳AI方法总体上与资深放射科医生相当,并在特定设置下显著超过初级放射科医生,而中位AI性能通常接近初级放射科医生水平。其次,从数据角度来看,跨中心异质性、标签不平衡和对比增强序列变异性成为AI方法的主要挑战。第三,从技术角度来看,方法设计选择,包括空间配准、输入维度、多模态融合策略和骨干架构,似乎调节了跨中心鲁棒性,尽管没有单一选择能完全缩小差距。总体而言,LiFS为定位AI在肝纤维化分期中的当前状态以及促进对限制临床可靠部署的关键挑战的未来研究提供了严格的真实世界基准。

英文摘要

Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.

2605.25592 2026-05-26 stat.ML cs.LG

Optimal Design for Multinomial Logit Model with Applications to Best Assortment Identification

多项Logit模型的最优设计及其在最佳组合识别中的应用

Joongkyu Lee, Min-hwan Oh

AI总结 针对多项Logit(MNL)模型,提出计算高效的最优实验设计框架,通过混合整数线性规划和多项式时间松弛方法实现统计效率与可扩展性,并应用于线性效用和非均匀收益下的最佳组合识别。

详情
Comments
Accepted at ICML 2026
AI中文摘要

我们研究了多项Logit(MNL)赌博机的最优实验设计,其中智能体从大小为$N$的基集中重复选择$K$个物品的子集,并观察单选择反馈。与线性或广义线性赌博机不同,MNL赌博机具有组合动作空间,这使得经典的最优设计方法和对所有子集的朴素优化在计算上难以处理。我们为MNL模型提出了一种计算高效的最优设计框架,通过两种互补方法实现了统计效率和可扩展性:(i) 将设计预言精确或认证近似地重构为带有求解器认证早停的$0$-$1$混合整数线性规划(MILP),以及(ii) 一种完全多项式时间的提升设计,用可处理的替代目标替换非线性目标。利用Kiefer-Wolfowitz等价定理,我们建立了接近G-最优性的保证,并刻画了由此产生的统计-计算权衡。作为应用,我们为具有线性效用和非均匀收益的MNL赌博机开发了一种最佳组合识别算法,并证明了实例相关的样本复杂度为$\tilde{O}\big(\frac{d \log N}{\Delta^2}\big)$,其中$d$是特征维度,$N$是臂的数量,$\Delta$是最小收益差距。

英文摘要

We study optimal experimental design for multinomial logit (MNL) bandits, where an agent repeatedly selects a subset of $K$ items from a ground set of size $N$ and observes single-choice feedback. Unlike linear or generalized linear bandits, MNL bandits have a combinatorial action space, which makes classical optimal design approaches and naive optimization over all subsets computationally intractable. We propose a computationally efficient optimal design framework for MNL models that achieves both statistical efficiency and scalability through two complementary approaches: (i) an exact or certified-approximate reformulation of the design oracle as a $0$-$1$ mixed-integer linear program (MILP) with solver-certified early stopping, and (ii) a fully polynomial-time lifted design that replaces the nonlinear objective with a tractable surrogate. Using the Kiefer-Wolfowitz equivalence theorem, we establish near G-optimality guarantees and characterize the induced statistical-computational trade-offs. As an application, we develop a best assortment identification algorithm for MNL bandits with linear utilities and non-uniform revenues, and prove an instance-dependent sample complexity of $\tilde{O}\big(\frac{d \log N}{Δ^2}\big)$, where $d$ is the feature dimension, $N$ is the number of arms, and $Δ$ is the minimum revenue gap.

2605.25590 2026-05-26 stat.ML cs.LG

Nonstationary Generalized Linear Bandits with Discounted Online Mirror Descent

基于折扣在线镜像梯度的非平稳广义线性老虎机

Joongkyu Lee, Min-hwan Oh

AI总结 提出DOMD-GLB算法,利用折扣在线镜像梯度处理非平稳广义线性老虎机,在保持O(1)每轮计算和内存成本的同时,实现动态遗憾界。

详情
AI中文摘要

我们研究非平稳广义线性老虎机(GLBs),其中期望奖励通过非线性链接函数与未知时变参数建模。该框架涵盖广泛的奖励模型,包括线性、伯努利和二项式奖励。现有方法主要基于最大似然估计(MLE),使用滑动窗口、重启或折扣机制处理非平稳性。尽管这些方法在统计上实现了高效的遗憾保证,但它们通常需要在每轮重新访问过去观测,导致计算和内存成本随时间增长;此外,其中一些方法依赖于非凸投影步骤。本文提出DOMD-GLB,一种用于非平稳GLBs的新算法,利用折扣在线镜像梯度(DOMD)进行参数估计,从而每轮仅产生O(1)的计算和内存成本。我们证明了在漂移环境下的动态遗憾界为$\tilde{O} \big(c_\mu^{-1/2} d^{3/4} P_T^{1/4} T^{3/4}\big)$,在分段平稳环境下为$\tilde{O}\big(c_\mu^{-1/3} d^{2/3} \Gamma_T^{1/3} T^{2/3}\big)$,其中$d$表示特征维度,$T$表示时间范围,$P_T$表示路径长度,$\Gamma_T$表示变化点数量,$c_\mu$是与链接函数相关的曲率参数,同时显著提高了计算效率。据我们所知,这是首个每轮计算和内存成本与时间无关的非平稳GLBs算法。

英文摘要

We study nonstationary generalized linear bandits (GLBs), where the expected reward is modeled through a nonlinear link function with an unknown time-varying parameter. This framework encompasses a broad class of reward models, including linear, Bernoulli, and binomial rewards. Existing approaches are predominantly based on maximum-likelihood estimation (MLE), using sliding-window, restart, or discounting mechanisms to handle nonstationarity. Although these methods achieve statistically efficient regret guarantees, they generally require revisiting past observations at every round, which leads to computation and memory costs that grow with time; moreover, several of them rely on a non-convex projection step. In this paper, we propose DOMD-GLB, a new algorithm for nonstationary GLBs that utilizes discounted online mirror descent (DOMD) for parameter estimation, thereby incurring only $O(1)$ computation and memory costs per round. We prove dynamic regret bounds of order $\tilde{O} \big(c_μ^{-1/2} d^{3/4} P_T^{1/4} T^{3/4}\big)$ in drifting environments and $\tilde{O}\big(c_μ^{-1/3} d^{2/3} Γ_T^{1/3} T^{2/3}\big) $in piecewise-stationary environments, where $d$ denotes the feature dimension, $T$ the time horizon, $P_T$ the path length, $Γ_T$ the number of change points, and $c_μ$ a curvature parameter associated with the link function, while substantially improving computational efficiency over prior work. To the best of our knowledge, this is the first algorithm for nonstationary GLBs with per-round computation and memory costs independent of time.

2605.25589 2026-05-26 cs.CV

Artifact Correction for Echo-Planar Imaging at Low-Field and Ultra-Low-Field MRI

低场和超低场MRI中回波平面成像的伪影校正

Sisi Qiao, Yilin Yu, Tiecheng Lin, Yuhao Liu, Jiajia Sun, Xiaoling Li

AI总结 针对低场和超低场MRI中回波平面成像的奈奎斯特鬼影问题,提出一种无需参考扫描的校正流程,结合峰值对齐与插值重采样方法,有效抑制鬼影并提升图像质量。

详情
Comments
19 pages, 10 figures, 2 tables
AI中文摘要

目的:低场和超低场MRI中的回波平面成像因奇偶k空间错位而遭受严重的奈奎斯特鬼影伪影。本研究开发了一种无参考扫描的伪影校正流程,减少对传统参考扫描的依赖,同时实现更好的鬼影抑制。方法:从传统的基于参考扫描的鬼影校正方法出发,我们首先引入一种基于峰值对齐的鬼影校正方法,无需参考数据即可校正奇偶行位移。为进一步减少残余伪影,采用了插值与重采样策略。该组合方法在低场和超低场下的EPI和扩散加权EPI数据上进行了评估。结果:所提出的流程有效减轻了奈奎斯特鬼影,改善了结构连续性,并增强了信号均匀性。仅基于峰值对齐的鬼影校正方法提供了与基于参考扫描的鬼影校正方法相当的伪影抑制效果,而插值与重采样进一步抑制了残余伪影,使得在超低场条件下能够可靠地可视化脑结构。结论:为低场和超低场EPI提出了一种实用的无参考校正流程,结合了基于峰值对齐的鬼影校正方法和插值重采样,实现了高效的鬼影抑制,扩展了低场MRI系统的临床适用性,为基于超低场EPI的DWI成像提供了理论指导和实践经验。

英文摘要

Purpose: Echo-planar imaging (EPI) in low-field (LF) and ultra-low-field MRI (ULF) suffers from severe Nyquist ghost artifacts due to odd-even k-space misalignment. This study develops a reference-free artifact correction pipeline that reduces reliance on conventional reference scans while achieving improved ghost suppression. Methods: Starting from the traditional reference-scan-based ghost artifact correction method, we first introduce a peak-alignment-based ghost artifact correction method to correct odd-even line displacement without reference data. To further reduce residual artifacts, an interpolation-and-resampling strategy is applied. The combined method was evaluated using EPI and diffusion-weighted EPI data in LF and ULF. Results: The proposed pipeline effectively mitigated Nyquist ghosts, improved structural continuity, and enhanced signal uniformity. Peak-alignment-based ghost artifact correction method alone provided comparable artifact suppression to reference-scan-based ghost artifact correction method, while interpolation and resampling further suppressed residual artifacts, enabling reliable visualization of brain structures under ULF conditions. Conclusion: A practical, reference-free correction pipeline is presented for LF and ULF EPI, combining peak-alignment-based ghost artifact correction method and interpolation-resampling to achieve efficient ghost suppression and expand the clinical applicability of low-field MRI systems, providing both theoretical guidance and practical experience for ULF EPI-based DWI imaging.

2605.25584 2026-05-26 cs.RO cs.AI

Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation

作用于未知:面向分散式多机器人任务分配的无通信协同过滤

Alexander Apartsin, Yigal Meshulam, Yehudit Aperstein

AI总结 针对零知识多机器人任务分配问题,提出基于在线低秩协同过滤的SwarmCF方法,无需通信、先验知识或协调者,实现每个机器人在未见任务上的有效行动,并证明其样本复杂度优势。

详情
Comments
27 pages, 12 figures
AI中文摘要

多机器人任务分配通常假设某种通信、已知任务模型或协调者的组合。我们研究相反的极端情况,这在实践中常见但在理论上被忽视,我们称之为零知识MRTA(ZK-MRTA):一个没有先验知识(没有任务模型,甚至没有潜在秩)、没有通信(没有消息、没有参数共享、没有协调者)、并且只能部分且私下带噪地观察队友结果的公共流的机器人团队。一个隐藏的低秩结构决定了哪个机器人适合哪个任务,并且任务数量远多于轮次,因此大多数(机器人,任务)对从未被尝试过。然而,每个机器人可以通过在广播流上运行在线低秩协同过滤(SwarmCF)来很好地处理从未尝试过的任务以及新任务。与任何无结构学习器相比,优势是类别性的,而不是常数因子:无结构学习器在未见对上的误差被证明处于先验均值水平。我们证明了每个机器人的匹配样本复杂度(在秩d和任务数n下,Θ(d) vs Θ(n)),任务稀缺下的任意时间(累积奖励)分离,以及一个确定性条件,在该条件下从掩码广播中分散恢复是精确的(经验验证)。实验量化了广播的价值、一个正比例缩放律(每个机器人的未见对技能随团队规模增加)、以及低秩方法中最强的掩码鲁棒性和任意时间曲线,恢复了集中式全通信上限的大部分(约80%的技能收益),并在容量1竞争和基于机器人的感知实例中保持有效。

英文摘要

Multi-robot task allocation usually assumes some combination of communication, known task models, or a coordinator. We study the opposite extreme, a regime common in practice but overlooked in theory, which we name Zero-Knowledge MRTA (ZK-MRTA): a robot team with no prior knowledge (no task models, not even the latent rank), no communication (no messages, no parameter sharing, no coordinator), and only a partial and privately-noisy view of a public stream of teammates' outcomes. A hidden low-rank structure governs which robot suits which task, and there are far more tasks than rounds, so most (robot, task) pairs are never attempted. Yet each robot can act well on tasks it never attempted, and onboard new tasks, by running online low-rank collaborative filtering over the broadcast (SwarmCF). The advantage over any structure-free learner is categorical, not a constant factor: a structure-free learner is provably at the prior-mean error floor on unseen pairs. We prove a matching per-robot sample complexity (Θ(d) versus Θ(n), in the rank d and the task count n), an anytime (cumulative-reward) separation under task scarcity, and a deterministic condition under which decentralized recovery from the masked broadcast is exact (validated empirically). Experiments quantify the value of the broadcast, a positive scaling law (per-robot unseen-pair skill rises with team size), and the strongest masking-robustness and anytime profile among low-rank methods, recovering most (about 80% on earned skill) of a centralized full-communication ceiling, and holding under capacity-1 contention and in a robotics-grounded sensing instance.

2605.25581 2026-05-26 cs.LG

Learning Latent Dynamical Causal Processes for Single-Cell Perturbation Prediction

学习单细胞扰动预测的潜在动态因果过程

Wenkang Jiang, Yuhang Liu, Erdun Gao, Ehsan Abbasnejad, Lina Yao, Javen Qinfeng Shi

AI总结 提出一种潜在动态因果生成模型(CITE-VAE),联合捕获潜在细胞程序、扰动条件机制和时间演化,实现单细胞扰动预测的分布外泛化。

详情
Comments
Accepted to SIGKDD 2026 AI4Science Track
AI中文摘要

单细胞扰动预测旨在推断细胞如何响应未见过的干预,并实现分布外(OOD)泛化,为理解扰动如何随时间重塑细胞程序提供计算途径。现有的机器学习方法取得了重要进展,但通常仅捕捉响应的一方面。潜在因果方法寻求支持泛化和解释的机制,但往往将扰动效应视为静态结果。时间模型描述基因表达随时间的变化,但通常不显式恢复驱动这些变化的潜在因果生成机制。在实践中,扰动效应既是潜在的也是动态的:干预通过未观察到的细胞程序起作用,这些程序的状态随时间演变并产生观察到的表达谱。受此观点启发,我们提出一个用于单细胞扰动数据的潜在动态因果生成模型,联合捕获潜在细胞程序、扰动条件机制和时间演化。我们进一步提供可识别性分析,表明在适当条件下,潜在因果变量可恢复至标准等价类。在此分析指导下,我们开发了CITE-VAE,一个从单细胞测序数据中恢复潜在细胞程序及其扰动驱动动态的学习框架。在Causal-3DIdent上的实验验证了理论结果和所提方法在受控环境中的有效性。在真实世界的基于CRISPR的单细胞扰动数据上的额外实验表明,与最先进的基线相比,对未见扰动的泛化能力有所提升,突显了我们方法的实际鲁棒性。

英文摘要

Single-cell perturbation prediction aims to infer how cells respond to unseen interventions and to achieve out-of-distribution (OOD) generalization, providing a computational route to understanding how perturbations reshape cellular programs over time. Existing machine learning methods have made important progress, but typically capture only one side of the response. Latent causal approaches seek mechanisms that support generalization and interpretation, yet often treat perturbation effects as static outcomes. Temporal models describe how gene expression changes across time, but usually do not explicitly recover the latent causal generative mechanisms driving these changes. In practice, perturbation effects are both latent and dynamical: interventions act through unobserved cellular programs, whose states evolve over time and give rise to observed expression profiles. Motivated by this view, we propose a latent dynamical causal generative model for single-cell perturbation data that jointly captures latent cellular programs, perturbation-conditioned mechanisms, and temporal evolution. We further provide an identifiability analysis showing that, under suitable conditions, the latent causal variables are recoverable up to standard equivalence classes. Guided by this analysis, we develop CITE-VAE, a learning framework for recovering latent cellular programs and their perturbation-driven dynamics from single-cell sequencing data. Experiments on Causal-3DIdent validate the theoretical results and the effectiveness of the proposed method in controlled settings. Additional experiments on real-world CRISPR-based single-cell perturbation data show improved generalization to unseen perturbations compared with state-of-the-art baselines, highlighting the practical robustness of our approach.

2605.25577 2026-05-26 cs.LG cs.AI

Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition

基于流形分解的几何流匹配分子构象生成

Yunqing Liu, Yi Zhou, Wenqi Fan

AI总结 提出GO-Flow方法,通过将生成过程分解为平移、旋转和构象三个物理子空间,利用流形上的最优传输和测地流,解决现有方法忽略分子几何层次结构的问题,实现高质量、高效率的分子构象生成。

详情
AI中文摘要

生成准确的3D分子构象是计算化学和药物发现中的关键挑战。最近,扩散和流匹配模型取得了显著成功。然而,它们的数学公式与分子的物理现实之间存在严重的不匹配。现有方法主要将分子视为笛卡尔空间中的无结构点云,忽略了键长和键角相对刚性而扭转角构成主要柔性自由度的内在层次力学。这种对流形的不感知迫使模型从头重新学习基本几何约束,常常导致物理上不可信的中间结构。为了解决这个问题,我们提出了GO-Flow,通过流形分解将生成建模与分子几何对齐。GO-Flow不是强制在欧几里得空间中运动,而是将生成过程分解为三个物理驱动的子空间:具有线性最优输运的平移空间、$SO(3)$上具有测地流的旋转空间以及具有熵最优输运的构象空间。这种分解注入了几何归纳偏置,使生成路径更好地与分子自由度对齐。当与等变神经架构结合时,它鼓励旋转一致的生成并提高几何有效性。在GEOM-Drugs和GEOM-QM9上的大量实验表明,GO-Flow实现了最先进的生成质量。值得注意的是,通过在正确的流形上自然地学习更直的概率路径,我们的方法能够在仅50步的情况下实现高保真采样,有效弥合了结构精度与计算效率之间的差距。

英文摘要

The generation of accurate 3D molecular conformations is a pivotal challenge in computational chemistry and drug discovery. Recently, diffusion and flow matching models have achieved remarkable success. However, there is a critical misalignment between their mathematical formulation and the physical reality of molecules. Existing approaches predominantly treat molecules as unstructured point clouds in Cartesian space, overlooking the intrinsic hierarchical mechanics where bond lengths and bond angles are relatively stiff, whereas torsion angles constitute the dominant flexible degrees of freedom. This lack of manifold awareness forces models to relearn fundamental geometric constraints from scratch, often leading to physically implausible intermediate structures. To address this, we propose GO-Flow that aligns generative modeling with molecular geometry via manifold decomposition. Instead of forcing motion through Euclidean space, GO-Flow decomposes the generation process into three physically motivated subspaces: translation space with linear optimal transport, rotation space with geodesic flows on $SO(3)$, and conformation space with entropic optimal transport. This decomposition injects geometric inductive biases and makes the generative paths better aligned with molecular degrees of freedom. When combined with equivariant neural architectures, it encourages rotation-consistent generation and improves geometric validity. Extensive experiments on GEOM-Drugs and GEOM-QM9 demonstrate that GO-Flow achieves state-of-the-art generation quality. Notably, by learning straighter probability paths on the correct manifolds naturally, our method enables high-fidelity sampling with as few as 50 steps, effectively bridging the gap between structural precision and computational efficiency.

2605.25574 2026-05-26 cs.CV cs.AI

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

Mosaic: 通过向量场混合的组合式多概念擦除

Junseok Ko, Jungwoo Kim, Jong-Seok Lee

AI总结 针对流式文本到图像模型中同时擦除多个目标概念的任务,提出Mosaic框架,通过动态构建概念特定掩码并选择性混合向量场,无需额外优化即可有效移除复杂场景中的多概念。

详情
AI中文摘要

概念擦除已成为确保文本到图像(T2I)模型安全与伦理图像合成的关键研究方向。现有研究虽探索了多概念擦除,但通常假设每张图像仅有一个目标概念,这一限制被现代基于流的T2I模型日益暴露,此类模型可同时生成包含多个概念的复杂场景。为弥补这一空白,我们引入组合式多概念擦除这一新任务,旨在同时移除单个场景中的多个目标概念。我们提出CoME-Bench,一个用于评估组合式多概念擦除的基准,涵盖类别内和跨类别场景。我们进一步提出Mosaic,一个用于基于流的T2I模型中多概念擦除的新框架,该框架通过动态构建概念特定掩码并选择性混合它们,利用向量场中目标概念的空间局部性,无需额外优化。大量实验表明,Mosaic能有效移除复杂组合场景中的多个目标概念,同时保留非目标上下文。

英文摘要

Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.