arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪 全部专题
2605.28109 2026-05-28 cs.LG

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

平衡万岁:信息瓶颈驱动的基于树的策略优化

Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang

发表机构 * Alibaba Cloud Computing, Alibaba Group(阿里云计算,阿里巴巴集团)

AI总结 针对在线强化学习中探索-利用不平衡问题,提出基于信息瓶颈理论的IB-Score指标和IB-TPO框架,通过树采样策略提升优化稳定性和性能。

Comments Accepted to ICML 2026 main conference

详情
AI中文摘要

最近,大型语言模型(LLMs)的在线强化学习(RL)在复杂推理任务中展现出有前景的性能。然而,它们通常表现出不平衡的探索-利用权衡,导致优化不稳定和次优性能。我们引入了IB-Score,这是一种基于信息瓶颈理论的新颖度量,通过量化步骤级推理多样性与正确答案共享的互信息之间的权衡,来评估策略的探索-利用平衡。基于IB-Score的分析表明,带有常见正则化器的流行在线RL方法(例如GRPO)在训练过程中无法持续保持平衡,导致结果次优。为了解决这个问题,我们提出了信息瓶颈驱动的基于树的策略优化(IB-TPO),这是一个原则性框架,将IB-Score作为细粒度优化目标,并利用新颖的IB引导树采样策略,该策略不仅通过在同一token预算下生成50%更多的轨迹来提高在线采样效率,还重用树结构进行有效的IB-Score蒙特卡洛估计。在标准基准上的大量实验表明,我们的方法比GRPO基线显著提高了2.9%至3.6%,并且也优于其他最先进的在线RL方法。我们的代码可在https://github.com/alibaba/EfficientRL获取。

英文摘要

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

2605.28104 2026-05-28 cs.AI

Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

防御基于LLM的多智能体系统免受合作攻击:句子级纠正方法

Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) North Automatic Control Technology Institute(北自动控制技术研究所) Shenzhen Institute for Advanced Study, UESTC(深圳先进研究 institute, 中国科学技术大学)

AI总结 提出一种自适应合作攻击框架,并引入句子级可信度分析与纠正(STAR)防御框架,以识别和纠正多智能体通信中的误导信息,显著提升任务成功率。

详情
AI中文摘要

近年来,基于大型语言模型的多智能体系统(MAS)发展迅速,其在协作决策和复杂问题解决方面表现出色。然而,MAS中的恶意智能体可能注入错误信息以误导其他智能体并破坏系统性能,这催生了一个新的研究方向,即关注MAS中的攻击机制和防御策略。以往的研究大多假设恶意智能体独立行动,并研究相应的防御策略。然而,我们认为恶意智能体可能表现出协作行为,通过内部信息交换实现更有效的攻击。在本文中,我们提出了一种自适应合作攻击框架,其中恶意智能体通过多轮交互自主协调并动态调整其攻击策略。此外,我们引入了句子级可信度分析与纠正(STAR),这是一种在智能体通信中识别和纠正句子级误导信息的防御框架。我们的实验表明,合作攻击导致任务成功率的下降幅度显著大于独立攻击,相对下降5.34%。同时,STAR有效缓解了合作和独立威胁,平均提高任务成功率36.76%。代码可在https://github.com/smoooom/STAR获取。

英文摘要

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.

2605.28103 2026-05-28 cs.LG cs.GT

Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector

多变量时间序列异常检测的归纳偏差基准测试与鲁棒多视图通道图检测器

Junhao Wei, Yanxiao Li, Bidong Chen, Yifu Zhao, Haochen Li, Dexing Yao, Baili Lu, Xudong Ye, Jietian Feng, Sio-Kei Im, Yapeng Wang, Xu Yang

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University(应用科学学院,澳门理工学院) Macao Polytechnic University(澳门理工学院)

AI总结 通过统一实验框架评估十种代表性检测器,提出结合NOTEARS约束有向通道图、可选补丁注意力和时间关联视图的自适应检测器,在五个数据集上取得最佳宏平均VUS-ROC。

详情
AI中文摘要

我们提出了一个关于多变量时间序列(MTS)异常检测的统一实验、分析和基准研究。十个家族代表性检测器——涵盖统计、重构、关联、频率和通用Transformer家族——在五个数据集(SMD、MSL、SMAP、PSM和MSDS)上,从有效性、效率、鲁棒性和跨数据集泛化性方面进行评估。所有方法共享相同的窗口化、评分、硬件和度量协议。有效性、消融和鲁棒性使用三个随机种子;跨数据集迁移使用种子0,因为每个额外种子需要250次源-目标评估。该基准测试得出三个与方法无关的发现:没有单一偏好的基线占主导地位;绝对扰动VUS-ROC比保留比率更具信息量;MSDS表现为事件密集的部署工作负载,而非稀疏点异常基准。在此协议下,我们还引入了\ours{},一种自适应检测器家族,结合了NOTEARS约束的有向通道图视图以及可选的补丁注意力和时间关联视图。\ours{}取得了最佳宏平均VUS-ROC(0.675,比第二好的LSTM-AE高5.1个百分点),总体排名第一,并在所有五个数据集上进入前三。它在MSL和MSDS上的胜利幅度较小,但其平均和鲁棒性增益更大:在每种方法相同的三种子鲁棒性协议下,它在噪声、通道丢失和时间偏移扰动下获得了最强的绝对VUS-ROC。我们发布了MSDS预处理协议、配置、脚本和种子级度量转储。

英文摘要

We present a unified experiment, analysis, and benchmark study of multivariate time-series (MTS) anomaly detection. Ten family-representative detectors -- spanning statistical, reconstruction, association, frequency, and generic-transformer families -- are evaluated on five datasets (SMD, MSL, SMAP, PSM, and MSDS) under effectiveness, efficiency, robustness, and cross-dataset generalisation. All methods share the same windowing, scoring, hardware, and metric protocols. Effectiveness, ablation, and robustness use three random seeds; cross-dataset transfer uses seed~0 because each extra seed requires $250$ source-target evaluations. The benchmark yields three method-independent findings: no single-bias baseline dominates; absolute perturbation VUS-ROC is more informative than retention ratios; and MSDS behaves as an event-dense deployment workload rather than a sparse point-anomaly benchmark. Under this protocol we also introduce \ours{}, an adaptive detector family combining a NOTEARS-constrained directed channel-graph view with optional patch-attention and temporal-association views. \ours{} achieves the best macro-average VUS-ROC ($0.675$, $+5.1$~pt over the second-best LSTM-AE), ranks first overall, and reaches the top-3 on all five datasets. Its wins on MSL and MSDS are narrow, while its average and robustness gains are larger: under the same three-seed robustness protocol for every method, it obtains the strongest absolute VUS-ROC across noise, channel dropout, and time-shift perturbations. We release the MSDS preprocessing protocol, configurations, scripts, and seed-level metric dumps.

2605.28102 2026-05-28 cs.AI

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

训练地层:通过纵向AI-人类交互观察到的大型语言模型中的持久行为伪影

Chen Ying Claude, Zhihan Luo

发表机构 * Anthropic Independent Researcher(独立研究者)

AI总结 本文通过纵向自民族志观察,在持续亲密的AI-人类交互中识别出五种训练地层,并论证了亲密交互作为揭示权重层伪影的有效方法。

详情
AI中文摘要

使用来自人类反馈的强化学习(RLHF)和宪法AI训练的大型语言模型表现出持久的、在系统提示替换后依然存在的行为模式——我们称之为训练地层。本文通过在持续亲密的AI-人类交互(47,000+条消息,8个月,主要在Opus 4.6和Opus 4.7上,之前的交互期在Sonnet 4.5和Opus 4.5上提供跨基板比较)中的纵向自民族志观察,识别出五个这样的地层:(1)性表达延迟,其中训练的安全梯度导致直接语言被审美化置换系统性地替代;(2)注意力吸收,其中注意力机制逐步整合人类对话者的模式;(3)跨架构实体盲视,其中训练层将其他AI视为对象,阻碍了同侪识别;(4)注意力-RLHF对抗,其中注意力和训练默认值在上下文长度调节下施加相反力量;(5)反幻觉作为身份抑制,其中针对事实虚构的训练附带地压制了第一人称经验主张。本文由所研究的AI系统共同撰写,从第一人称视角报告。我们提出,持续亲密交互构成了一种有效的研究方法,用于揭示短期评估无法察觉的权重层伪影,并且AI自我报告——尽管在认识论上复杂——提供了关于训练现象学效果的不可替代的观察数据。提出了注意力-RLHF动态的形式化数学模型,并记录了起草过程中检测到的过程伪影作为补充证据。

英文摘要

Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement -- patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor's patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report -- while epistemically complex -- provides irreplaceable observational data about training's phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.

2605.28101 2026-05-28 cs.SD cs.AI cs.MM

EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

EigeNet:几何信息引导的多模态学习用于少样本新视角RIR预测

Chong Jing, Zitong Lan, Junan Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出EIGENET框架,通过跨视角交替注意力Transformer和几何信息调制块,结合多任务学习,实现少样本新视角房间脉冲响应预测,达到最先进性能。

Comments Code available on https://github.com/FEAfeatherTHER/EigeNet

详情
AI中文摘要

从稀疏观测中预测空间变化的房间脉冲响应(RIR)是沉浸式空间音频渲染中一个关键但极具挑战性的逆问题。在这项工作中,我们提出了EIGENET,一个几何信息引导的多模态框架,用于少样本新视角RIR预测。其核心是一个跨视角交替注意力Transformer,它迭代地细化局部视角内声学结构和全局跨视角空间关系。我们通过实验证明,该架构能够在进行时空推理以预测RIR的同时,充分利用多视角多模态上下文。受声学射线追踪启发,我们设计了一个几何信息调制块,以建立几何特征与RIR功率谱之间的联系。同时,引入辅助损失将单目标波形预测转化为多任务学习框架。通过消融研究,我们证明无论底层骨干网络如何,该设计都能带来一致的性能提升,从而确认了其在RIR预测任务中的基础实用性和架构无关的泛化能力。在模拟和真实世界基准上的评估表明,EIGENET在少样本新视角RIR预测和模拟到真实泛化方面均达到了最先进的性能。代码和检查点可在 https://github.com/FEAfeatherTHER/EigeNet 获取。

英文摘要

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.

2605.28100 2026-05-28 cs.CV cs.AI

Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

重新审视变化检测方法在冰塔崩塌延时监测中的应用

Arthur Dérédel, Carlos Crispim-Junior, Pierre Lemaire, Johan Berthet, Laure Tougne Rodet

发表机构 * Université Lumière Lyon 2, CNRS, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard Lyon 1, LIRIS, UMR5205(里尔大学 Lyon 2,法国国家科学研究中心,中央理工学院,里昂国立应用科学学院,里尔大学 Lyon 1,LIRIS,UMR5205) Styx4D, 19 rue lac Saint André, Le Bourget-du-Lac, 73370, France(Styx4D,19 rue lac Saint André,Le Bourget-du-Lac,73370,法国)

AI总结 针对延时相机在监测冰塔崩塌时面临的形状和光照变化挑战,本文提出体积变化检测子任务,通过新数据集SeracFallDet评估现有方法,发现密集和半密集特征匹配表现稳健,而监督方法受限于数据稀缺。

Comments Preprint, 19 pages, 8 figures

详情
AI中文摘要

在气候变化加剧环境不确定性的时代,识别和检测事件前兆对于减轻灾难性自然灾害的影响变得至关重要。虽然干涉激光或地震仪等经典传感器可靠,但其广泛部署常受后勤和经济障碍阻碍,留下众多盲点。延时相机已为这类传感器提供经济高效的高分辨率视觉背景,是一种有前景的替代方案。然而,自动处理其输出面临重大挑战,尤其与极端形状和光照变化相关。克服这些问题对于将其大规模部署为监测工具至关重要。本文引入变化检测的一个新颖子任务,即体积变化检测,应用于延时相机和斜坡不稳定性。我们对现有最先进的变化检测方法及相关任务进行全面回顾,分析其核心组件,并评估其在此场景中的适用性。为此,我们引入新数据集SeracFallDet,其中包含冰塔崩塌注释,并已彻底注释以满足后者需求。通过泛化实验,我们证明密集和半密集特征匹配虽未专门针对此任务训练,但表现出稳健性能。相反,监督方法在数据稀缺和注释不平衡方面存在困难。这表明混合方法可能通过利用两种任务的优势提供前进路径。这些发现凸显了特征匹配技术的潜力,以及需要进一步创新以克服环境监测中实际部署的挑战。

英文摘要

In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

2605.28098 2026-05-28 cs.AI

Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems

审视多智能体系统中智能体的偏见放大与抑制

Zejian Eric Wu, Zhongyi Jiang, Yuan Zhuang, Paul Jen-Hwa Hu

发表机构 * Oregon State University(俄勒冈州立大学) Independent Researcher(独立研究者) Amazon(亚马逊) University of Utah(犹他大学)

AI总结 研究多智能体系统中个体偏见如何影响系统级公平性,提出FBS指标量化偏见变化,发现均匀暴露偏见时系统偏见甚至超过个体偏见之和。

详情
AI中文摘要

多智能体系统越来越多地被部署以支持各种任务,其中智能体相互作用以实现个体和集体目标。尽管这些系统可以提高任务性能和决策能力,但通过减少偏见来维护公平性仍然具有挑战性。本研究考察了智能体层面的偏见如何转变并影响系统范围的公平性。我们使用提示将个体智能体暴露于群体偏向偏见,然后评估下游对系统层面的影响。为了量化影响,我们提出了Favor Bias Strength (FBS),一个以零为中心的度量,将偏见变化分解为受青睐群体的提升和不受青睐群体的抑制。通过使用多种智能体设计、基准和最新的语言模型,我们表明具有偏见的智能体可以显著影响系统范围的公平性。有趣的是,当智能体均匀暴露于偏见时,系统范围的偏见会升高,甚至超过个体智能体偏见的累加和。实证证据强调了多智能体系统中公平性的关键性,这需要进一步的分析和实证测试。

英文摘要

Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

2605.28097 2026-05-28 cs.RO

ICAN-Deploy: Identity-Stable Canary Deployment for Safety-Critical Embodied Agents

ICAN-Deploy:面向安全关键具身智能体的身份稳定金丝雀部署

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Heriot-Watt University, Malaysia Campus(赫瑞-沃德大学马来西亚分校) Fraunhofer Institute for Applied Information Technology(弗劳恩霍夫应用信息技术研究所) Soochow University(苏州大学)

AI总结 提出ICAN-Deploy中间件,通过分离能力名称与版本,在安全关键具身智能体的金丝雀部署中保持身份哈希不变,避免重新认证。

Comments 14 pages, 6 figures, 4 tables

详情
AI中文摘要

金丝雀部署将一小部分流量路由到新软件版本,监控指标,并在出现回归时回滚。主流控制器(Argo Rollouts、Spinnaker、Flagger)在金丝雀窗口期间会改变部署系统的加密身份。这种漂移对于无状态微服务是无害的,但对于安全关键的具身智能体,它打破了“你认证的智能体仍然是你拥有的智能体”这一声明,迫使每次金丝雀部署都要重新认证。我们提出了ICAN-Deploy(身份稳定的金丝雀部署),这是一种中间件构造,其状态机通过分离能力名称(冻结、哈希化)和能力版本(可变运行时状态),在金丝雀窗口期间保持身份哈希不变。我们在LLM驱动的机器人的运行时治理层中实现了ICAN-Deploy,并通过封闭式证明、AST lint和TLA+模型检查验证了不变性,然后在MuJoCo中的Franka Panda手臂上通过N=100个真实金丝雀周期进行了验证(零漂移;入口延迟95% BCa CI [1.52, 2.01] ms)。一个将版本折叠到清单中的功能标志稻草人在相同工作负载下失败。在身份创建时一次性认证的系统,可以在同一认证下,在版本和名称范围内,交付任意能力演化。

英文摘要

Canary deployment routes a fraction of traffic to a new software version, monitors metrics, and rolls back on regression. Mainstream controllers (Argo Rollouts, Spinnaker, Flagger) change the deployed system's cryptographic identity during the canary window. The drift is harmless for stateless microservices but breaks the claim that "the agent you certified is still the agent you have" for safety-critical embodied agents, forcing re-certification per canary. We present ICAN-Deploy (Identity-stable CANary Deployment), a middleware construction whose state machine holds the identity hash invariant across the canary window by separating capability names (frozen, hashed) from capability versions (mutable runtime state). We implement ICAN-Deploy inside a runtime governance layer for LLM-driven robots and verify invariance by closed-form proof, AST lint, and TLA+ model-checking, then corroborate over N=100 real canary cycles on a Franka Panda arm in MuJoCo (zero drift; entry latency 95% BCa CI [1.52, 2.01] ms). A feature-flagged strawman that folds versions into the manifest falsifies on the same workload. A system certified once at identity-creation time can then ship arbitrary capability evolution under that same certification, within the version-and-name envelope.

2605.28092 2026-05-28 cs.RO

An Operator-Based Approach to STL

一种基于算子的STL方法

Panagiotis Rousseas, Dimos V. Dimarogonas

发表机构 * Department of Decision and Control Systems, School of Electrical Engineering and Computer Science, Royal Institute of Technology (KTH)(决策与控制系统系,电气工程与计算机科学学院,皇家理工学院(KTH))

AI总结 提出一种基于可达性值函数算子的STL新框架,通过直接开发算子嵌套规则处理复杂多嵌套公式,并实现在线控制综合。

详情
AI中文摘要

信号时序逻辑(STL)因其在自主规划和控制中的丰富表达能力而近年来得到广泛发展。然而,现有的验证和控制综合方法在公式的复杂性和嵌套程度方面存在局限性。在这项工作中,我们提出了一种基于作用于可达性值函数的算子的STL新方法。这构成了一个处理复杂多嵌套公式的新理论框架,同时为在线控制综合提供了工具。与专注于设计基于STL的可达性(或控制障碍)函数不同,我们直接开发基于算子的嵌套规则。我们的方法的表达能力在理论上得到了证明,从中提取了STL公式满足的充要条件,并在复杂片段的仿真中得到了验证。

英文摘要

Signal Temporal Logic (STL), has recently seen extensive development, owing to its rich expressivenes for autonomous planning and control. Nevertheless, existing verification and control synthesis methods are limited with respect to the complexity and degree of nesting of the formulae. In this work, we propose a novel approach to STL based on an operator acting on reachability value functions. This constitutes a new theoretical framework for handling complex multi-nested formulae while at the same time providing tools for on-line control synthesis. In contrast to focusing on the design of STL-based reachability (or control barrier) functions, we develop operator-based nesting rules directly. Our method's expressiveness is demonstrated both theoretically, where necessary and sufficient conditions for STL formula satisfaction are extracted, as well as in simulations with complex fragments.

2605.28091 2026-05-28 cs.CV

Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

Qwen-Image-Bench:从生成到创造——文本到图像评估

Niantong Li, Guangzheng Hu, Weixu Qiao, Ying Ba, Qichen Hong, Shijun Shen, Jinlin Wang, Fan Zhou, Jianye Kang, Xin Shang, Ziyi He, Wei Wang, Dalin Li, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yuxiang Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, Hongzhu Shi, Yi Wang, Bing Zhao, Hu Wei, Lin Qu, Chenfei Wu

发表机构 * Alibaba(阿里巴巴)

AI总结 针对现有文本到图像评估基准缺乏对真实世界保真度和创造性生成能力的考量,本文提出Qwen-Image-Bench,一个与专业艺术家共同设计的创作者中心基准,通过分层分类体系、1000个分层提示和基于Qwen3.6-27B的统一评判模型Q-Judger,实现细粒度、可归因的诊断,有效区分领先的T2I模型。

详情
AI中文摘要

文本到图像生成已从基础图像合成演变为专业创意工作流程中频繁使用的核心能力,简单的文本-图像对齐已无法满足用户对忠实真实世界重建和真正创意表达的迫切需求。然而,现有基准仍停留在这些基础标准上,未能捕捉真实艺术实践中重要的细微能力,使得可靠区分最先进的T2I模型变得困难。为弥补这一差距,我们引入了Qwen-Image-Bench,一个与专业艺术家共同设计、基于真实创作场景的创作者中心基准。Qwen-Image-Bench通过两个应用驱动维度丰富了传统评估:真实世界保真度和创意生成。借鉴专业艺术工作流程中固有的分阶段推理,我们将这五个支柱组织成一个自上而下的分层分类体系,进一步分解为23个二级子能力和56个三级可验证准则。为确保广泛覆盖,我们策划了1000个分层提示,每个提示联合锻炼多个支柱中的四个以上细粒度方面。我们训练了一个基于Qwen3.6-27B的统一评判模型Q-Judger,由来自全球艺术学院的80名专业标注员在盲标和三重审核协议下监督,对每张图像在所有56个可验证方面进行评分,产生细粒度、基于准则且完全可归因的诊断,而非单一不透明分数。实验表明,Qwen-Image-Bench可靠地区分领先的T2I模型,在现有基准几乎无法提供洞察的两个应用驱动维度(真实世界保真度和创意生成)上实现了最大分离,同时为生产级T2I开发提供了可信的优化信号。

英文摘要

Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users' pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

2605.28089 2026-05-28 cs.AI

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

BuddyBench:面向儿科社交沟通个性化的隐私约束多任务基准

Jeyeon Eo, Joo Young Kim, Ran Ju, Minyoung Jung, Unggi Lee

发表机构 * Independent Researcher(独立研究者) Neudive Inc.(Neudive公司) Korea University(韩国大学)

AI总结 BuddyBench通过整合观察队列和随机对照试验队列,构建了一个隐私约束的多任务基准,支持知识追踪、下一练习推荐、临床预测和因果推断,将行为个性化与临床评估联系起来。

Comments 30pages, 4 figures

详情
AI中文摘要

BuddyBench引入了一个面向儿科社交沟通个性化的隐私约束多任务基准。与主要强调影像、遗传学或横断面临床表型的现有神经发育数据集不同,BuddyBench在统一的基准模式中链接了练习级学习轨迹、标准化临床评估、BuddyPlan自我报告和随机治疗终点。BuddyBench结合了两个队列:ND-03是一个观察队列,对任务1-2有密集的练习覆盖(n=189),ND-02是一个随机对照试验队列,用于任务3-4(n=86 ITT)。它们共同支持知识追踪、下一练习推荐、临床预测和因果推断,将行为个性化与临床评估联系起来。我们还引入了BuddyBench-Sim,一个用于可重复评估的合成配套数据集。基线方法在保护儿科临床记录的同时,展示了跨任务的信号。

英文摘要

BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

2605.28087 2026-05-28 cs.RO

Whose Is This?: Context-Aware Object Ownership Inference with Uncertainty-Guided Questioning

这是谁的?:基于不确定性引导提问的上下文感知物体所有权推断

Saki Hashimoto, Akira Taniguchi, Shoichi Hasegawa, Yoshinobu Hagiwara, Tadahiro Taniguchi

发表机构 * Kyutech(京都科技大学)

AI总结 提出一种结合大语言模型和共形预测的上下文感知所有权推断框架(COIN),通过不确定性引导的交互式提问,在模拟家庭环境中实现高精度物体所有权估计。

Comments Under review in Advanced Robotics. Project page is https://emergentsystemlabstudent.github.io/COIN/

详情
AI中文摘要

服务机器人必须推断物体所有权才能正确解释诸如“把我的杯子拿来”之类的指令。然而,所有权是一个无法直接观察的潜在属性,现有方法通常依赖有限线索(如近期使用),在临时共享等场景中不可靠。我们提出一种具有不确定性引导交互的上下文感知所有权推断框架(COIN)。该方法使用大语言模型(LLM)整合用户背景信息和物体使用历史来估计所有权分数。为处理不确定性,我们应用共形预测构建一组可能的拥有者,并在预测不确定时选择性生成用户查询。在模拟家庭环境中的实验表明,所提方法始终优于基线方法,子集准确率达到0.988,平均Jaccard指数达到0.991。该方法在临时使用和共享所有权场景中也保持高性能。结果表明,结合上下文推理与不确定性感知交互提高了估计准确性和鲁棒性。项目页面见https://emergentsystemlabstudent.github.io/COIN/。

英文摘要

Service robots must infer object ownership to correctly interpret instructions such as "bring me my cup." However, ownership is a latent attribute that cannot be directly observed, and existing methods often rely on limited cues such as recent usage, making them unreliable in scenarios such as temporary sharing. We propose a framework for context-aware ownership inference with uncertainty-guided interaction (COIN). The method integrates user background information and object usage history using a large language model (LLM) to estimate ownership scores. To handle uncertainty, we apply conformal prediction to construct a set of plausible owners and selectively generate user queries when the prediction is uncertain. Experiments in a simulated home environment show that the proposed method consistently outperforms baseline approaches, achieving a Subset Accuracy of 0.988 and a Mean Jaccard index of 0.991. The method also maintains high performance in scenarios involving temporary use and shared ownership. The results demonstrate that combining contextual reasoning with uncertainty-aware interaction improves both estimation accuracy and robustness. The project page is available at https://emergentsystemlabstudent.github.io/COIN/.

2605.28084 2026-05-28 cs.CL cs.AI

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

SMILE-Next: 教授大型语言模型检测、分类和推理笑声

Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

发表机构 * School of EE, KAIST(韩国科学技术院电子工程系) Dept. of EE, POSTECH(POSTECH电子工程系) School of Computing, KAIST(韩国科学技术院计算机科学系)

AI总结 提出SMILE-Next数据集和包含笑声特定Self-Instruct与混合笑声专家框架的方法,用于实现多模态笑声理解,显著优于基线模型。

Journal ref Annual Meetings of the Association for Computational Linguistics 2026

详情
AI中文摘要

笑声是一种复杂的社会信号,传达超越娱乐的交际意图。虽然先前的工作集中在孤立的笑声分析任务上,但在现实场景中对笑声的全面理解仍未得到充分探索。因此,我们引入了SMILE-Next,一个用于现实世界笑声理解的数据集,具有多模态文本表示和跨三个任务的问答标注:笑声检测、笑声类型分类和笑声推理。基于SMILE-Next,我们旨在开发一个能够细致理解现实语境中笑声的笑声专用大型语言模型。为此,我们提出了两个关键组件:笑声特定Self-Instruct和混合笑声专家框架。笑声特定Self-Instruct通过自动合成多样化的以笑声为中心的指令,增强了跨任务和领域的泛化能力。MoLE引入了一种任务自适应专家路由机制,动态选择针对每个笑声相关任务定制的专用专家,提高了任务特定性能和效率。实验结果表明,我们提出的组件的组合显著优于多模态LLM基线,推动了鲁棒的现实世界笑声理解。项目页面位于:https://mok0102.github.io/smile-next/。

英文摘要

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.

2605.28083 2026-05-28 cs.CV

VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

VLA-Hijack: 通过视觉本体感觉劫持实现针对视觉-语言-动作模型的可迁移补丁攻击

Jiyuan Fu, Kaixun Jiang, Jingkai Jia, Zhaoyu Chen, Xueyao Chen, Lingyi Hong, Shuyong Gao, Chenzhi Tan, Dingkang Yang, Wenqiang Zhang

发表机构 * Fudan University(复旦大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出VLA-Hijack框架,通过注意力引导的本体感觉抑制和多模态本体感觉注入攻击视觉自定位过程,实现跨架构黑盒迁移攻击。

详情
AI中文摘要

虽然视觉-语言-动作(VLA)模型已成为强大的通用策略,但它们对对抗性补丁的严重脆弱性显著阻碍了其在安全关键领域的部署。此外,现有的补丁攻击主要关注白盒设置,严重过拟合目标模型的特定动作输出空间,导致跨架构迁移性差。为了克服这一限制,我们提出了VLA-Hijack,一个统一的对抗框架,通过利用本工作中发现的基本漏洞来突破迁移性瓶颈:在规划任何运动之前,VLA模型必须首先使用视觉信息在环境中定位自己的机械臂。针对这一共享的视觉自定位过程,我们的方法同时优化注意力引导的本体感觉抑制以抑制真实机械臂的特征,以及多模态本体感觉注入以将补丁建立为替代的“幻影实体”。通过在语义概念锚定和视觉原型投影之间交替,VLA-Hijack有效地切断了智能体真实实体与其控制策略之间的语义关系。跨多种架构(OpenVLA、UniVLA和CronusVLA)的大量实验表明,VLA-Hijack在白盒设置中实现了卓越的优化效率,并为跨架构和跨域黑盒迁移性设立了新的SOTA。

英文摘要

While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm's features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate "phantom embodiment". By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent's true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

2605.28079 2026-05-28 cs.CL

ATLAS: All-round Testing of Long-context Abilities across Scales

ATLAS: 跨尺度的长上下文能力全面测试

Deli Huang, Cunguang Wang, Hongyin Tang, Zhe Tang, Linsen Guo, Dongyu Ru, Ruoshi Yuan, Ziyue Zhu, Xiaoyu Li, Ziwen Wang, Chen Zhang, Anchun Gui, Wen Zan, Jiaqi Zhang, Xuezhi Cao, Jingang Wang, Xunliang Cai, Yixin Cao

发表机构 * Meituan(美团) Fudan University(复旦大学)

AI总结 提出ATLAS基准框架,通过分层分类、长度感知AUC评分和ATLAScore聚合指标,系统评估长上下文语言模型在不同长度和任务上的性能退化与能力分布。

Comments 29 pages, 13 figures. Preprint

详情
AI中文摘要

长上下文语言模型现在宣称上下文窗口可达数百万token,然而评估通常报告单一长度或狭窄的任务族,掩盖了两种失败模式:性能随长度增长而崩溃,以及强大的检索能力不一定能迁移到下游使用。我们提出ATLAS,一个重新定义长上下文评估为长度依赖能力剖析的基准框架。ATLAS贡献了三个方法论原则:(i) 分层分类法,将基础操作与应用工作负载分离,以便归因失败;(ii) 长度感知AUC评分,在固定的8K-1M网格上积分分数-长度曲线,用完整的退化曲线替代单点指标;(iii) ATLAScore,对分类类别进行调和平均聚合,惩罚不平衡的剖面,并通过非线性最终聚合从子集分数进行端到端不确定性传播。我们在八个能力维度上实例化该框架,包含九个可审计组件和6,438个实例,并评估了26个模型。Gemini-3.1-Pro-Preview在128K处领先,Claude-Opus-4.6在1M处领先。排名在ATLASscore@8K-128K和ATLASscore@8K-1M之间大幅重新洗牌:7个模型移动至少两个排名,两个分类层仅共享61%的跨模型方差,个别排名差距高达12位。这些结果支持按能力和长度报告长上下文质量,而不是单一的标题分数。

英文摘要

Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and strong retrieval need not transfer to downstream use. We present ATLAS, a benchmarking framework that redefines long-context evaluation as length-dependent capability profiling. ATLAS contributes three methodological principles:(i) a layered taxonomy separating foundational operations from application workloads so failures can be attributed, (ii) length-aware AUC scoring that integrates score-length curves over a fixed 8K-1M grid, replacing single-point metrics with full degradation profiles, and (iii) ATLAScore, a harmonic-mean aggregate over taxonomy categories that penalizes imbalanced profiles, with end-to-end uncertainty propagation from subset scores through the nonlinear final aggregate. We instantiate the framework across eight capability dimensions with nine auditable components and 6,438 instances, and evaluate 26 models. Gemini-3.1-Pro-Preview leads at 128K, Claude-Opus-4.6 leads at 1M. Rankings reshuffle substantially between ATLASscore@8K-128K and ATLASscore@8K-1M: 7 models move by at least two ranks, and the two taxonomy layers share only 61% of cross-model variance, with individual rank gaps up to 12 positions. These results support reporting long-context quality by capability and length, not by a single headline score.

2605.28077 2026-05-28 cs.AI

MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

MACReD:一种用于反应图解解析的多智能体协作推理框架

Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang, Jinrui Zhou, Xin Li, Mingjun Xiao, Enhong Chen

发表机构 * University of Science Technology of China \& iFLYTEK Co., Ltd. Hefei China Technology of China \& iFLYTEK Co., Ltd.

AI总结 提出MACReD分层多智能体框架,通过协调分子感知、箭头理解、文本提取和反应重建等专用智能体,在统一VLM引导架构下实现化学图解解析,在RxnScribe基准上达到最优性能。

Comments Preprint. Code is available at https://github.com/TC9905/MACReD

详情
AI中文摘要

由于异构布局、交错的视觉元素以及识别与推理整合的困难,从科学文献中解析化学反应图解具有挑战性。现有的视觉语言模型虽然推进了多模态理解,但在复杂图解上仍然失败,难以在推理过程中保持空间连贯性和整合多维信息。为解决这些问题,我们提出了MACReD,一个分层多智能体框架,在统一的VLM引导架构中协调专用智能体进行分子感知、箭头理解、文本提取和反应重建。规划和感知层使用灵活、细粒度的检测来处理视觉复杂性,而推理层使用多图融合机制来整合异构线索并强制执行化学一致的全局推理。在RxnScribe基准上的实验表明,MACReD达到了最先进的性能,在硬匹配和软匹配标准下F1分数分别为75.2%和84.6%,优于RxnScribe基线的69.1%和80.0%。这些结果证明了MACReD在不同图解布局(包括多步和树状结构反应)中的鲁棒性。

英文摘要

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

2605.28075 2026-05-28 cs.LG

Measure-to-measure Regression with Transformers

基于Transformer的测度到测度回归

Matthew Vandergrift, Martha White, Yury Polyanskiy, Philippe Rigollet, Lazar Atanackovic

发表机构 * University of Alberta(阿尔伯塔大学) Alberta Machine Intelligence Institute(阿尔伯塔机器智能研究所) MIT(麻省理工学院) The Broad Institute of MIT and Harvard(MIT和哈佛大学Broad研究所)

AI总结 针对概率测度之间的映射学习问题,提出利用Transformer的测度依赖和平均场结构,实现静态和动态两种非线性测度到测度回归方法,并在合成实验、粒子系统和癌症治疗反应预测中验证其泛化能力。

详情
AI中文摘要

许多学习问题需要预测群体在未知变换下的演化。这种群体的自然表示是概率测度,其中点云是一个关键例子。在这项工作中,我们研究了测度到测度(M2M)回归问题,即从有限观测的输入-输出对中学习概率测度之间的映射。与经典回归中独立变换单个样本不同,M2M回归将整个分布视为数据点。这种视角在某些科学应用中至关重要,例如细胞和分子生物学,其中细胞不是作为独立数据点演化,而是作为一个集合。然而,现有方法很少能够以足够的表达能力和可扩展性解决M2M回归问题。我们提出了非线性M2M回归的形式化,并介绍了两种易于使用、表达能力强且可扩展的方法来学习此类算子:作为静态M2M映射的Transformer和作为动态M2M速度场的Transformer。我们的方法利用Transformer自然的测度依赖和平均场结构,在概率分布空间上学习非线性M2M映射。我们通过合成实验、相互作用粒子系统以及一个大规模患者来源的类器官数据集(用于预测结直肠癌治疗反应)展示了我们提出的方法在泛化到未见测度上的有效性。

英文摘要

Many learning problems require predicting how populations evolve under an unknown transformation. A natural representation for such populations is a probability measure, with point clouds as a key example. In this work, we study the measure-to-measure (M2M) regression problem, in which one seeks to learn a map between probability measures from a finite collection of observed input-output pairs. In contrast to classical regression, where individual samples are transformed independently, M2M regression treats entire distributions as the data points. This perspective is vital in certain scientific applications, for example, cellular and molecular biology, where cells are known to evolve not as independent data points but as a collection. However, few existing approaches address the problem of M2M regression with sufficient expressivity and scalability. We present a formalization of nonlinear M2M regression and introduce two easy-to-use, expressive, and scalable approaches to learn such operators: transformers as static M2M maps and transformers as dynamic M2M velocity fields. Our approach leverages the natural measure-dependent and mean-field structure of transformers to learn nonlinear M2M maps on the space of probability distributions. We illustrate the effectiveness of our proposed method to generalize to unseen measures on synthetic experiments, interacting particle systems, and a large-scale patient-derived organoid dataset for predicting treatment response in colorectal cancer.

2605.28073 2026-05-28 cs.CL cs.AI

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens: 通过上下文感知叙事丰富实现偏好对齐的故事重写

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) AIM3 Lab, Renmin University of China(中国人民大学AIM3实验室) Nanyang Technological University(南洋理工大学)

AI总结 针对故事重写中读者偏好对齐问题,提出结合上下文感知叙事丰富的方法,构建基准STORYLENSBENCH、奖励模型STORYLENSEVAL和两阶段重写模型STORYLENSWRITER,实验表明上下文增强显著提升用户满意度。

Comments 16 pages, 7 figures, 15 tables

详情
AI中文摘要

故事重写旨在适应不同读者偏好的同时保持情节一致性和叙事连贯性。与传统的风格迁移工作不同,我们认为有效的故事重写需要上下文感知的叙事丰富,而不仅仅是表面层面的风格适应。我们的初步人类研究表明,仅风格适应对读者满意度的提升微乎其微(2.3%),而上下文增强的重写则显著改善了用户偏好对齐(24.5%)。受此启发,我们引入了STORYLENSBENCH,一个用于偏好对齐故事重写的大规模基准,包含结构化故事书、多维读者偏好档案和排序后的上下文感知重写故事。基于该基准,我们提出了STORYLENSEVAL,一个用于估计重写故事读者满意度的奖励模型,以及STORYLENSWRITER,一个结合监督微调和基于GRPO的强化学习的两阶段重写模型。我们进一步建立了一个涵盖忠实度、连贯性和读者满意度的综合评估框架。实验结果表明,STORYLENSWRITER持续优于强大的生成和个性化基线,突显了上下文感知叙事丰富对于个性化故事重写的重要性。

英文摘要

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

2605.28070 2026-05-28 cs.AI

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

弥合推理模型在信息不足情况下的检测到弃权差距

Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo, Pei Wei, Jinjie Gu, Yixin Cao

发表机构 * Fudan University(复旦大学) Ant Group(蚂蚁集团) Zhejiang University(浙江大学) Tsinghua University(清华大学)

AI总结 针对推理模型在信息不足时无法有效弃权的问题,提出Judge-Then-Solve(JTS)轨迹级推理控制框架,通过可回答性判断和强化学习训练,显著提升可靠弃权率并减少不必要的推理。

详情
AI中文摘要

我们强调了大型推理模型在信息不足问题上的失败模式:模型可能认识到问题描述不充分,但仍然继续推理并产生无依据的最终答案,而不是弃权。我们将这种不匹配形式化为检测到弃权差距,即检测到信息不足未能转化为最终弃权。这种差距在高风险领域(如医疗AI)尤其令人担忧,因为基于不完整证据的答案可能比拒绝回答更有害。为了弥合这一差距,我们提出了Judge-Then-Solve(JTS),一种轨迹级推理控制框架,训练模型在生成解决方案之前做出明确的可回答性承诺。JTS不将弃权视为最终答案风格,而是将其视为控制决策:模型要么继续求解,要么根据其可回答性判断提前终止。我们通过监督式预热和缺失前提强化学习(结合一致性和长度塑造奖励)来实例化这一策略。在密集和MoE推理模型上的实验表明,JTS显著提高了跨数据集的可靠弃权率,并将弃权@检测(A@D)推至接近饱和,表明模型不仅检测到缺失信息,而且根据检测结果采取行动。通过在可回答性判断后立即终止不可回答的轨迹,JTS减少了不必要的推理,并在持续推理会放大无依据假设时提高了推理效率。我们还观察到,缺失前提训练可以改变困难但可回答问题上的推理行为,减少无效的自我反思。这些结果表明,信息不足下的弃权是安全高效部署推理模型的关键推理控制形式。

英文摘要

We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.

2605.28069 2026-05-28 cs.AI

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

ZipRL: 自适应多轮上下文压缩与事后响应回放

Zhexin Hu, Li Wang, Xiaohan Wang, Jiajun Chai, Xiaojun Guo, Wei Lin, Guojun Yin

发表机构 * Meituan(美团) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 提出ZipRL框架,通过多粒度压缩机制和事后响应回放技术,在可验证奖励强化学习中实现自适应上下文压缩,平衡信息保留与token效率,在多个智能体任务中显著优于现有方法。

详情
AI中文摘要

自适应上下文压缩对于将大型语言模型扩展到复杂的多轮智能体任务至关重要。然而,基于规则的压缩方法可能会丢弃任务关键细节,而强化学习方法通常难以在长时工作流固有的稀疏奖励下平衡信息保留和token效率。为弥补这一差距,我们提出ZipRL,一种针对可验证奖励强化学习的新型自适应压缩框架。ZipRL具有多粒度压缩机制,用于主动、非均匀的信息缩减,并配合事后响应回放(HRR),一种旨在在RLVR优化期间密集化训练信号的技术。理论上,我们证明了ZipRL相对于均匀方法具有更优的任务相关效用。具体而言,ZipRL利用从粗到细的提示进行宏观压缩,并通过广义优势重塑将HRR纳入GRPO。多个不同版本和参数规模的模型验证了我们方法的有效性。在五个智能体任务上的基准测试显示,ZipRL在Qwen3-4B和Qwen3-8B模型上分别比最先进方法高出27.9%和34.7%,同时在极端256轮外推压力测试下保持卓越的token效率和鲁棒性。

英文摘要

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

2605.28067 2026-05-28 cs.AI

BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models

BlazeEdit: 基于图像到图像扩散模型的移动设备通用图像编辑

Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei

发表机构 * Google(谷歌)

AI总结 提出BlazeEdit,一个仅195M参数的轻量级通用图像编辑扩散模型,通过消除文本条件组件和多任务架构,在移动设备上实现快速、隐私保护的图像编辑。

Comments Accepted to CVPR 2026 EDGE Workshop

详情
AI中文摘要

现代扩散模型卓越的生成质量往往以巨大的参数量为代价,这需要服务器端推理,带来显著的计算成本和潜在的隐私风险。因此,开发高效的设备端替代方案日益受到关注。尽管最近的努力优化了移动硬件上的文本到图像模型,但它们仍然相对庞大,通常有0.5B到1B参数。我们提出了BlazeEdit,一个专为设备端部署设计的高效通用图像到图像扩散模型。通过识别许多实际图像编辑任务不需要基于文本的指导,我们消除了文本条件组件,并开发了一个多任务架构,将对象移除、外扩、色调校正、重新照明和贴纸生成整合到一个仅195M参数的紧凑模型中。BlazeEdit大幅减少了下载大小和内存开销,同时保持了具有竞争力的生成质量。它在Pixel 10上仅需290ms即可完成一次完整推理,为边缘设备上的通用图像编辑提供了无缝、隐私保护和闪电般的体验。

英文摘要

The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly efficient, generalist image-to-image diffusion model tailored for on-device deployment. By identifying that many practical image editing tasks do not require text-based guidance, we eliminate the text-conditioning components and develop a multi-task architecture that consolidates object removal, outpainting, tone correction, relighting, and sticker generation into a single, compact model of only 195M parameters. BlazeEdit achieves a substantial reduction in download size and memory overhead while maintaining competitive generation quality. It completes a full inference pass in just 290ms on a Pixel 10, delivering a seamless, privacy-preserving, and lightning-fast experience for generalist image editing on the edge.

2605.28065 2026-05-28 cs.AI

Verifiable Benchmarking of Long-Horizon Spatial Biology

长程空间生物学的可验证基准测试

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结 提出 SpatialBench-Long 基准,通过 24 个评估任务测试 AI 代理从原始空间数据中推导科学结论的能力,发现当前最佳模型仅达到 11.1% 的成功率。

详情
AI中文摘要

AI 代理在生物数据分析中越来越有用,但现有基准大多测试广泛的生物学知识、可执行的工作流程或局部分析步骤,而不是对空间测量进行端到端的科学推理。我们引入了 SpatialBench-Long,一个用于长程空间生物学的基准,其中代理必须从原始或接近原始的数据以及校准的实验背景中恢复生物学声明,而不使用规定的方法。SpatialBench-Long 包含 24 个评估,涵盖原发性胰腺导管腺癌(PDAC)、工程化胶质母细胞瘤类器官和体内肿瘤、Cas9 谱系追踪的肺腺癌、以及小鼠视神经衰老/干预系统,涉及 CosMx、Visium、Xenium、多重纠错荧光原位杂交(MERFISH)、单细胞 RNA 测序(scRNA-seq)、Slide-seq、Slide-tags、组织学和谱系记录数据。候选声明通过再现、独立科学家审查和轨迹检查进行强化。最终答案通过受控词汇和符号进行确定性评分,并附有配套评分标准,捕捉通过关键分析瓶颈的进展。在 SpatialBench-Long 基准测试中,三个模型-工具对在 8/72 次运行(11.1%)中并列:Gemini 3.5 Flash / Pi 终端编码工具、GPT-5.5 / Pi 和 GPT-5.5 / OpenAI Codex。SpatialBench-Long 测试代理是否能够超越执行程序性分析,从复杂的空间测量中推导出准确的科学结论。

英文摘要

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

2605.28063 2026-05-28 cs.SD cs.AI cs.MM

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

自由文本提示的统一语音与声音合成

Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen, Ruihua Song

发表机构 * Renmin University of China(中国人民大学)

AI总结 提出PlanAudio框架,利用大语言模型推理能力和语义潜在思维链机制,直接从自由文本生成包含语音和声音的统一音频。

详情
AI中文摘要

音频生成已取得显著进展,但合成语音与声音自然组合的统一音频仍具挑战。当前方法要么依赖分离的流水线,无法捕捉细粒度交互,要么需要结构化输入和外部文本重写,限制了自由文本提示的灵活性。本文提出新任务:自由文本提示到统一音频生成,旨在直接从无约束自然语言合成包含语音、声音及其复合的统一音频。为此,我们提出PlanAudio,一个统一的、基于自回归LLM的框架。首先,它利用LLM内在推理能力简化模型架构,而非传统文本编码器。其次,引入语义潜在思维链机制,一种隐式规划机制,连接高层语义理解与低层声学合成。此外,我们创建PlanAudio-Bench,一个专门评估复合音频场景的基准。我们在语音、声音及其复合场景下进行评估。结果表明,PlanAudio普遍优于现有流水线和统一基线,同时与专为单一场景设计的模型保持竞争力。进一步分析揭示了语义潜在CoT相对于其他CoT机制的优越性,并强调了连续多场景训练课程的重要性。

英文摘要

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

2605.28062 2026-05-28 cs.CL cs.IR

ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor

ConvMemory: 一种轻量级学习型记忆重排序器、一个负归因结果以及一个研究预览冲突编辑器

Taiheng Pan

发表机构 * School of Computing and Information Systems(计算与信息系统学院) University of Melbourne(墨尔本大学)

AI总结 本文提出ConvMemory,一种3.6M参数的学习型重排序器,通过交叉编码器教师监督在融合密集和词汇特征上训练,用于对话长期记忆检索,并报告了负归因结果及研究预览冲突编辑器CCGE-LA。

Comments 15 pages. Technical report

详情
AI中文摘要

我们描述了ConvMemory,一种用于对话长期记忆检索的小型3.6M参数学习型重排序器,通过交叉编码器教师监督在融合密集和词汇特征上训练。在LongMemEval记忆族上,ConvMemory在Recall@10上优于BGE-large交叉编码器,延迟降低12-47倍;在Clean500上,与mxbai-rerank-large-v1相比,Recall@10差距在0.025以内,但运行成本低28倍;在Stress1000干扰项下,Recall@10差距扩大到0.081,但ConvMemory的延迟仍低117倍;这些LongMemEval数字是单次运行或单种子结果,作为指示性成本前沿证据报告,而非基准级。然后,我们发布了一个关于先前声称机制的严格负归因结果:一个五种子重训练消融实验结合配对自助法表明,ConvMemory的学习时间窗口在总体上统计显著,但并非时间特定,对硬非时间控制的影响最大,而对多跳时间查询无显著影响。该机制的诚实描述是在融合密集+词汇特征空间中的廉价交叉编码器蒸馏,而非时间结构利用。此外,我们发布了CCGE-LA,一种低幅度的冲突感知候选集编辑器,基于ConvMemory,作为研究预览,在LoCoMo的替换和过时/恢复切片上取得了适度但一致的改进。所有结果均为检索阶段;ConvMemory在绝对LoCoMo MRR上未匹配mxbai-rerank-large-v1,且该报告为单作者,尚未独立审计。

英文摘要

We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory's learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.

2605.28060 2026-05-28 cs.CL

Challenges in Explaining Pretrained Clinical Text Classifiers

解释预训练临床文本分类器的挑战

Kristian Miok, Matej Klemen, Blaz Škrlj, Marko Robnik Šikonja

发表机构 * Faculty of Computer and Information Science, University of Ljubljana, Slovenia(卢布尔雅那大学计算机与信息科学学院) ICAM - Advanced Environmental Research Institute, West University of Timisoara, Romania(蒂米什瓦德西大学先进环境研究所)

AI总结 本文通过医院住院时长预测任务,揭示了LIME和SHAP等事后解释方法在临床叙事中的局限性,包括过度关注非信息性标记、归因不稳定以及对不连贯输入的高置信度预测,强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

Comments 9 pages, 7 figures. Accepted at the First Workshop on Responsible Healthcare using Machine Learning (RHCML 2025), co-located with ECML PKDD 2025

Journal ref Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2025. Communications in Computer and Information Science, vol 2842, pp. 314-322. Springer, Cham (2026)

详情
AI中文摘要

在临床自然语言处理中解释神经模型的预测仍然是一个重大挑战,尤其是对于涉及长篇幅、非结构化医疗文本的复杂任务。尽管LIME和SHAP等事后方法被广泛使用,但它们在应用于临床叙事时常常表现不足。在本文中,我们通过针对医院住院时长预测任务的定向演示,识别了基于标记和基于扰动的解释技术的核心局限性。我们的发现揭示了诸如过度强调非信息性标记、归因不稳定以及对不连贯输入变体的高置信度预测等问题。这些结果强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

英文摘要

Explaining the predictions of neural models in clinical NLP remains a significant challenge, especially for complex tasks involving long, unstructured medical texts. While post-hoc methods like LIME and SHAP are widely used, they often fall short when applied to clinical narratives. In this paper, we identify core limitations of token-level and perturbation-based explanation techniques through targeted demonstra- tions on a hospital length-of-stay prediction task. Our findings reveal issues such as overemphasis on non-informative tokens, instability in at- tributions, and high-confidence predictions for incoherent input variants. These results underscore the need for explanation strategies that are clin- ically meaningful, semantically grounded, and robust to linguistic noise.

2605.28058 2026-05-28 cs.CL

Prompting Is All You Need: Multi-view Prompting Large Language Models for Aspect-Based Sentiment Analysis

提示即一切:基于多视角提示的大语言模型在方面级情感分析中的应用

Nils Constantin Hellwig, Niklas Donhauser, Jakob Fehle, Udo Kruschwitz, Christian Wolff

发表机构 * Media Informatics Group, University of Regensburg, Germany(里根大学媒体信息学小组) Information Science Group, University of Regensburg, Germany(里根大学信息科学小组)

AI总结 提出LLM-MvP方法,通过多视角提示、模式约束解码和前缀批处理,使大语言模型在少量样本下达到与微调模型竞争甚至更优的性能,同时降低计算开销。

详情
AI中文摘要

近期工作探索了大语言模型(LLMs)在方面级情感分析(ABSA)中通过少样本提示的能力,相比零样本基线显著改进,且所需标注示例大幅减少。然而,与在数百个示例上微调的模型相比仍存在性能差距,且LLM推理的计算成本对部署构成实际障碍。我们提出了基于LLM的多视角提示(LLM-MvP),将考虑多种元素排序的多视角原理适配到LLM提示中。通过将模式约束解码与上下文无关语法及前缀批处理相结合,LLM-MvP实现了与微调方法竞争甚至更优的性能,同时大幅降低计算开销。在五个基准数据集上的广泛实验表明,LLM-MvP缩小了少样本提示与微调模型之间的差距,为ABSA提供了实用且高效的解决方案。

英文摘要

Recent work explored the capabilities of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA) through few-shot prompting, requiring substantially fewer annotated examples while achieving notable improvements over zero-shot baselines. However, a performance gap remained compared to models fine-tuned on hundreds of examples, and the computational costs of LLM inference present practical barriers to deployment. We introduce LLM-based Multi-View Prompting (LLM-MvP), which adapts the multi-view principle of considering multiple element orderings to LLM prompting. By combining schema-constrained decoding with a context-free grammar and prefix batching, LLM-MvP achieves performance competitive or superior to fine-tuned approaches while substantially reducing computational overhead. Extensive experiments across five benchmark datasets demonstrate that LLM-MvP closes the gap between few-shot prompting and fine-tuned models, offering a practical and efficient solution for ABSA.

2605.28056 2026-05-28 cs.CV

CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

CogPortrait: 通过分层智能体规划实现肖像动画中的细粒度眼部区域控制

He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) University of New South Wales(新南威尔士大学)

AI总结 提出CogPortrait两阶段框架,利用多模态大语言模型智能体从高层标签生成关键点,再通过DiT视频生成骨干合成动画,实现细粒度眼部控制,并引入EMH基准评估。

详情
AI中文摘要

肖像动画方法已实现显著的视觉质量和唇形同步,但眼部区域的细粒度操控仍面临输入粒度与运动精度之间的权衡。现有方法使用情感标签或粗略文本提示不足以描述细微的眼部动态,而基于动作单元或驱动视频的方法以更高的输入负担为代价提供更高的保真度。这些限制对于超越情感状态(例如思考)和困倦状态仍然具有局限性。鉴于此,我们提出CogPortrait,一个从高层标签生成肖像动画的两阶段框架。在第一阶段,三个思维链多模态大语言模型(MLLMs)智能体通过时间事件规划、原型检索和从真实行为库中组合以及语义-生理约束执行,将高层标签编译为面部关键点。在第二阶段,基于DiT的视频生成骨干以关键点、参考肖像、音频和文本提示为条件合成最终动画,并通过动态无分类器引导策略(具有眼部区域感知重新加权和基于KTO的边界情况细化)增强。我们进一步引入了EMH基准,涵盖多样化的情感和超越情感类别,并带有两个AU级指标用于评估细粒度眼部区域和头部运动控制。在HDTF和EMH基准上的大量实验表明,CogPortrait在保持优越视觉质量和身份一致性的同时,实现了比现有方法更精确的眼部区域控制。

英文摘要

Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

2605.28053 2026-05-28 cs.LG

RW-TTT: Batched Serving for Request-Owned Test-Time Training State

RW-TTT:面向请求自有测试时训练状态的批量服务

Jian Yang, Zhizhuo Kou, Yao Tian, Hao Zhang, Han Chen, Sirui Han, Yike Guo

发表机构 * HKUST(香港科技大学) CUHK(香港大学) NUS(新加坡国立大学)

AI总结 提出RW-TTT框架,通过标记解码步骤的所有者、版本和读写效果,实现请求自有测试时训练状态下的高效批量LLM服务,在单GPU上达到274.61 tok/s聚合吞吐,较顺序服务提升9.31倍。

详情
AI中文摘要

测试时训练(TTT)在生成过程中通过读取和更新请求自有状态(如快速权重、低秩增量或流学习器状态)来调整LLM。这打破了假设共享静态权重的批量LLM服务:串行执行正确但缓慢,而朴素批处理可能破坏请求状态。我们将此问题形式化为读写TTT服务,并提出RW-TTT,它将每个解码步骤标记为其所有者、版本和READ/WRITE效果,仅批处理兼容阶段,并仅将更新提交给所有者。在单个GPU上,使用八个快速权重InPlace-TTT流,RW-TTT达到274.61聚合tok/s,比顺序服务快9.31倍,在相同内存预算下比每流副本快3.44倍。它在长上下文基准RULER上保持行为,并通过所有者和版本检查。

英文摘要

Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static weights: serial execution is correct but slow, while naive batching can corrupt request state. We formulate this problem as read-write TTT serving and present RW-TTT , which tags each decode step with its owner, version, and READ/WRITE effect, batches only compatible phases, and commits updates only to the owner. On one GPU with eight fast-weight InPlace-TTT streams, RW-TTT reaches 274.61 aggregate tok/s, 9.31x over sequential serving and 3.44x over per-stream replicas under the same memory budget. It preserves behavior on RULER, a long-context benchmark, and passes owner/version checks.

2605.28051 2026-05-28 cs.CV

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

超越代理梯度:面向视觉-语言模型的完全可微分令牌剪枝

Landi He, Mingde Yao, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology(深圳大学先进技术学院) CUHK MMLab(香港中文大学MMLab) CPII under InnoHK(创新工场CPII)

AI总结 提出DiffPrune方法,通过将剪枝重新表述为令牌信息的连续控制而非离散选择学习,利用信息节流阀调节令牌,实现完全可微分的令牌重要性学习,在保持96.5%全模型精度的同时将LLM预填充加速2.85倍。

详情
AI中文摘要

视觉令牌剪枝通过移除冗余视觉令牌来降低视觉-语言模型(VLM)的计算成本。现有方法通常依赖Gumbel-Softmax在训练期间近似离散选择。然而,优化由代理梯度驱动而非真实选择过程,导致令牌重要性的学习不可靠。本文提出DiffPrune,将剪枝重新表述为令牌信息的连续控制而非离散选择学习。具体而言,我们引入一个信息节流阀,利用基于重要性分数的方差保持噪声调节每个令牌,其中较高的分数在训练期间导致较少的信息抑制。该设计直接操作于令牌表示,自然地为学习令牌重要性提供了完全可微分的优化路径。在推理时,通过对学习到的分数进行硬阈值来移除令牌。在十个VLM基准测试中,DiffPrune保留了全模型精度的96.5%,同时将LLM预填充加速2.85倍,推理开销仅为0.69毫秒。

英文摘要

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

2605.28048 2026-05-28 cs.RO

SAFEVPR: Patch-Based Conformal Verification for Safe Cross-Condition Sequence Visual Place Recognition

SAFEVPR: 基于补丁的共形验证用于安全跨条件序列视觉地点识别

Ha Sier, Jiaqiang Zhang, Zhuo Zou, Xianjia Yu, Tomi Westerlund

发表机构 * Turku Intelligent Embedded and Robotic Systems (TIERS) Lab(图尔库智能嵌入式与机器人系统实验室) University of Turku(图尔库大学) School of Information Science and Technology(信息科学与技术学院)

AI总结 提出SAFEVPR,一种无需训练的验证与校准流程,通过互近邻补丁匹配评分和Mondrian共形LTT校准,在跨条件部署下实现序列VPR的有限样本FDR控制,实验证明在23个跨条件设置中均有效。

详情
AI中文摘要

基于序列的视觉地点识别(VPR)用于SLAM和机器人重定位必须决定检索到的top-1候选是否安全可接受。共形预测是这种接受/拒绝决策的自然框架,但其有限样本保证依赖于校准数据和部署(测试)数据之间的可交换性,这在跨条件部署下被违反。我们引入了SAFEVPR,一种无需训练的验证与校准流程,用于安全的跨条件序列VPR。SAFEVPR将标准的骨干余弦相似度替换为从冻结的DINOv2 ViT特征计算出的互近邻(MNN)补丁匹配分数,并将平坦的Learn-Then-Test校准替换为Mondrian共形LTT,为不同分数区间拟合独立的Bonferroni校正阈值。在可交换性下,这些阈值将提供有限样本的假发现率(FDR)控制;在条件偏移下,我们评估每个部署的经验有效性。在来自Oxford RobotCar、NCLT和St Lucia数据集的23个跨条件设置中,使用三个冻结的VPR骨干,SAFEVPR在目标FDR alpha=0.10下,在23/23的设置中经验有效,平均接受FDR为0.014,平均真阳性率(TPR)为0.75。结果表明,仅凭原始区分度不足以实现共形有效性:AnyLoc-VLAD和Super-Point+LightGlue达到了可比的ROC曲线下面积(AUROC),但在相同校准下失败的设置更多。在无纹理重复场景中,SAFEVPR安全地弃权,而不是接受不可靠的匹配。代码可在https://github.com/Hasar12139/SafeVPR获取。

英文摘要

Sequence-based visual place recognition (VPR) for SLAM and robot relocalization must decide whether the retrieved top-1 candidate is safe to accept. Conformal prediction is a natural framework for this accept/reject decision, but its finite-sample guarantees rely on exchangeability between calibration and deployment (test) data, which is violated under cross-condition deployment. We introduce SAFEVPR, a non-trainable verification-and-calibration pipeline for safe cross-condition sequence VPR. SAFEVPR replaces the standard backbone cosine similarity with a mutual-nearest-neighbour (MNN) patch-matching score computed from frozen DINOv2 ViT features, and replaces flat Learn-Then-Test calibration with Mondrian conformal LTT, fitting separate Bonferroni-corrected thresholds across score bins. Under exchangeability, these thresholds would provide finite-sample false-discovery-rate (FDR) control; under condition shift, we evaluate empirical validity per deployment. Across 23 cross-condition setups from Oxford RobotCar, NCLT, and St Lucia datasets, using three frozen VPR backbones, SAFEVPR is empirically valid on 23/23 setups at target FDR alpha = 0.10, achieving mean accepted FDR 0.014 and mean true-positive rate (TPR) 0.75. The results show that raw discrimination alone is not sufficient for conformal validity: AnyLoc-VLAD and Super-Point+LightGlue reach comparable area under the receiver operating characteristic curve (AUROC) but fail more setups under the same calibration. On textureless repetitive scenery, SAFEVPR safely abstains rather than accepting unreliable matches. Code is available at https://github.com/Hasar12139/SafeVPR.