arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2605.29498 2026-05-29 cs.CL cs.CV

Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting

Mask the Target: 一种即插即用的正则化器,用于对抗LoRA遗忘

Runze Xu, Arpit Garg, Hemanth Saratchandran, Simon Lucey

AI总结 针对LoRA微调中目标分布与原始训练分布差异大时导致的灾难性遗忘问题,提出一种无需重放数据的输出空间正则化方法,通过遮蔽目标token并仅对非目标词汇进行KL正则化,在不增加推理开销的前提下改善新学习与遗忘之间的平衡。

Comments In Submission

详情
AI中文摘要

低秩适应(LoRA)已成为将大型语言模型适应新领域、任务和用户的最广泛使用的微调机制之一。然而,仅凭适应性能可能掩盖一个重要失败模式:LoRA更新可能在提升目标分布性能的同时,削弱预训练和对齐阶段学习到的先前能力。我们表明,当适应分布与模型的原始训练或对齐分布存在显著差异时,这种遗忘变得尤为严重。在实际场景中,原始训练和对齐数据通常不可用,这加剧了挑战。受此约束,我们研究了基于LoRA的适应如何在无重放设置中平衡新学习与遗忘,并引入了一个简单的输出空间正则化器,可直接添加到现有训练流程中。我们的方法从基模型和适应模型分布中移除真实标记,重新归一化剩余概率,并仅对非目标词汇应用KL正则化。这保留了基模型在替代标记之间的相对偏好,同时不直接对抗适应所需的交叉熵信号。由于正则化器仅在损失层面起作用,它不需要重放数据、架构更改、适配器重新设计或推理时开销,并且可以直接应用于现有LoRA变体。在所有测试的LoRA变体和各种骨干网络上,当适应分布与基模型的原始训练或对齐分布存在显著差异时,我们的方法改善了新学习与遗忘之间的边界,表明这是一条通往更可靠LLM更新的广泛适用途径。

英文摘要

Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.

2605.29497 2026-05-29 cs.LG

Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption

单索引模型损失景观中的凸盆地:强对抗性腐败下的鲁棒恢复应用

Santanu Das, Sagnik Chatterjee, Jatin Batra

AI总结 针对具有重尾噪声和对抗性腐败的单索引模型,提出首个基于凸盆地结构的鲁棒恢复算法,实现近线性样本和时间复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究了在存在重尾噪声和恒定比例的对抗性腐败协变量及响应的情况下,鲁棒学习高斯单索引模型(SIMs)的问题。先前关于鲁棒恢复的工作考虑了线性回归(Pensia等人,JASA 2024)、严格单调链接函数(Awasthi等人,NeurIPS 2022)和相位恢复(Buna和Rebeschini,AISTATS 2025)等设置。然而,这些技术不能推广到通用的非对称非单调链接函数,例如现代门控神经架构中自然出现的标量原语 extsc{GeLU}和 extsc{Swish}。我们通过给出第一个针对通用非单调链接函数的具有近线性样本和时间复杂度的鲁棒恢复算法来填补这一空白,从而为一大类非线性SIMs建立了首个鲁棒恢复保证,而此前对这些SIMs没有任何已知保证。我们的核心贡献是对对抗性污染下高斯平方损失景观的新结构理解。关键的是,我们证明对于一大类非线性非单调SIMs,在真实参数周围存在一个维度无关、恒定半径的凸盆地,并且即使在对抗性污染下,也可以通过鲁棒谱初始化高效地到达该盆地。先前的工作无法同时建立这两个保证,因此要么在对抗性污染下崩溃,要么无法处理通用的非单调链接函数。这些结构洞察共同为鲁棒梯度下降提供了一个原则性的热启动,该算法在$ ilde{O}(nd)$时间和$ ilde{O}(d)$样本下可证明收敛到最终估计误差$O(\sigma\sqrt{\varepsilon})$,其中$\varepsilon$是污染比例。

英文摘要

We study the problem of robustly learning Gaussian Single Index Models (SIMs) in the presence of heavy-tailed noise and a constant fraction of adversarially corrupted covariates and responses. Prior work on robust recovery has considered settings such as linear regression (Pensia et al., JASA 2024), strictly monotonic link functions (Awasthi et al., NeurIPS 2022), and phase retrieval (Buna and Rebeschini, AISTATS 2025). However, these techniques do not extend to generic asymmetric non-monotonic link functions such as \textsc{GeLU} and \textsc{Swish}, which arise naturally as scalar primitives in modern gated neural architectures. We close this gap by giving the first robust recovery algorithm with near-linear sample and time complexity for generic non-monotonic link functions, thereby establishing the first robust recovery guarantees for a broad family of nonlinear SIMs for which \textit{no guarantees were previously known}. Our central contribution is a new structural understanding of the Gaussian squared-loss landscape under adversarial contamination. Crucially, we prove that for a broad class of nonlinear non-monotonic SIMs, a dimension-independent, constant-radius convex basin exists around the ground truth and is efficiently reachable via robust spectral initialization even under adversarial contamination. Prior works fail to establish both guarantees simultaneously, thereby either breaking down under adversarial contamination or failing to handle generic non-monotonic link functions. Together, these structural insights yield a principled warm start for robust gradient descent that provably converges to a final estimation error of $O(σ\sqrtε)$ in $\tilde{O}(nd)$ time with $\tilde{O}(d)$ samples, where $ε$ is the contamination fraction.

2605.29496 2026-05-29 cs.CL cs.CV

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

视觉语言模型后训练中推理与感知的非对称优化研究

Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng

AI总结 通过合成任务诊断发现,后训练中推理提升显著优于感知,SFT源于感知token少导致训练信号弱,RL源于奖励耦合,提出动态重加权损失和感知奖励可缓解不平衡并提升端到端性能。

Comments Project: https://asymmetric-vlm-post-training.github.io/

详情
AI中文摘要

后训练极大地提升了前沿视觉语言模型中的推理能力,但其对感知的提升相对有限,这成为端到端视觉推理的瓶颈。为探究这一差距,我们引入了一个受控的诊断框架,包含两个将感知与推理分离的合成任务。我们的分析揭示了一致的感知-推理非对称性:后训练对推理的提升显著大于感知,尽管其内在机制因训练范式而异。对于监督微调(SFT),这种非对称性源于思维链监督中的token不平衡,其中感知占据较少token,因此接收到的训练信号较弱。动态重加权损失可缓解这种不平衡,并将端到端性能提升高达18.2。对于强化学习(RL),非对称性则源于奖励耦合:结果奖励与推理的相关性比与感知更强,从而削弱了感知学习的信号。添加感知感知奖励可缓解不平衡,并将端到端准确率提升高达6.0;即使没有真实感知奖励,可靠的替代奖励也能提供有用信号,带来3.2个百分点的提升。综合来看,我们的结果全面诊断了非对称优化,并提出了平衡感知与推理的具体干预措施。

英文摘要

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.

2605.29495 2026-05-29 cs.LG

On-Policy Replay for Continual Supervised Fine-Tuning

面向持续监督微调的在策略重放

Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang

AI总结 提出在策略重放(OPR)方法,通过重放模型自身生成的高质量响应来缓解持续监督微调中的灾难性遗忘,在多个大语言模型上显著降低遗忘。

详情
AI中文摘要

持续监督微调(SFT)是将大型语言模型(LLMs)适配到连续下游任务的事实标准,但它会遭受早期能力的灾难性遗忘。最近的研究表明,在策略信号——在模型自身输出上训练——比离策略监督更可靠地减少遗忘。现有的在策略方法通过新的训练目标(例如,带有教师副本的自蒸馏损失)路由该信号,从而继承了额外的前向传播、调度敏感性和来自教师的风格漂移。我们改为通过训练数据源路由在策略信号。我们的方法,在策略重放(OPR),在少量历史提示上展开最新检查点,通过任务奖励过滤生成结果,并将幸存(提示,模型响应)对作为普通SFT示例重放。没有教师,没有辅助损失,也没有即时蒸馏。在三个7-8B指令微调骨干(Qwen2.5-7B-Instruct、Qwen3-8B、Llama3.1-8B-Instruct)上,在TRACE持续学习基准测试中,OPR一致地减少了遗忘;在最尖锐的压力测试(Qwen2.5-7B-Instruct,顺序SFT BWT -13.93)中,OPR在10%重放预算下将BWT提升至-0.65,在1%预算下提升至-2.29——与调优的普通重放基线相比,|BWT|减少了46%,在所有三个骨干上观察到42-46%的减少。我们给出了一个KL收缩解释,将OPR和先前的在策略蒸馏方法置于单一轴上,并提出了一个反直觉的发现,解释了为什么普通重放已经是一个强基线:低分重放一致地比普通重放更差,表明OPR中的有效成分是在策略分布,而不是单独的响应质量。我们的代码可在https://github.com/Yancey2024/OnPolicyReplay获取。

英文摘要

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

2605.29494 2026-05-29 cs.LG

Gradient Perturbation: Learning to Perturb Gradients for Adaptive Training

梯度扰动:学习扰动梯度以实现自适应训练

Hua Li

AI总结 本文提出学习扰动梯度(LPG)方法,通过自适应地扰动类别级别的梯度实现类别感知训练,并建立统一框架揭示SAM、梯度裁剪等方法的梯度扰动本质,实验表明LPG在平衡/长尾分类和噪声标签学习中优于现有方法。

详情
AI中文摘要

深度神经网络训练涉及前向传播(从特征经logits到损失)和反向传播(从损失经梯度到参数更新)。尽管沿前向链的扰动(包括特征扰动、logit扰动和标签扰动)已被广泛研究,但反向链的梯度扰动却鲜有系统性的研究。在本文中,我们建立了一个统一的梯度扰动框架,揭示现有方法如锐度感知最小化(SAM)、梯度裁剪和梯度噪声注入都可以解释为施加特定形式的梯度扰动。类似于最近提出的Logit扰动学习(LPL),我们推测放大某一类别的梯度范数起到正增强作用(增强学习),而抑制它则起到负增强作用(抑制过拟合)。基于这些观察,我们提出学习扰动梯度(LPG),该方法自适应地在类别级别扰动logit梯度以实现类别感知训练。我们还通过PAC-Bayesian分析建立了梯度扰动边界与泛化保证之间的理论联系。在平衡分类、长尾分类和噪声标签学习上的实验表明,LPG一致优于现有方法,并且可以作为插件模块与它们结合使用。

英文摘要

Deep neural network training involves both forward propagation (from features through logits to loss) and backward propagation (from loss through gradients to parameter updates). While perturbations along the forward chain, including feature perturbation, logit perturbation, and label perturbation, have been extensively studied, the backward chain's gradient perturbation has received little systematic investigation. In this paper, we establish a unified framework for gradient perturbation, revealing that existing methods such as Sharpness-Aware Minimization (SAM), gradient clipping, and gradient noise injection can all be interpreted as imposing specific forms of gradient perturbation. Analogous to the recently proposed Logit Perturbation Learning (LPL), we conjecture that amplifying the gradient norm for a class acts as positive augmentation (enhancing learning), while dampening it acts as negative augmentation (suppressing overfitting). Based on these observations, we propose Learning to Perturb Gradients (LPG), which adaptively perturbs logit-level gradients at the class level to achieve category-aware training. We also establish theoretical connections between gradient perturbation bounds and generalization guarantees via PAC-Bayesian analysis. Experiments on balanced classification, long-tail classification, and noisy label learning demonstrate that LPG consistently outperforms existing methods and can be combined with them as a plug-in module.

2605.29491 2026-05-29 cs.AI

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

帮助的诅咒:通过 DistractionIF 在干扰指令鲁棒性中的逆缩放定律

Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

AI总结 提出 DistractionIF 基准,发现大语言模型在参考文本中干扰指令的鲁棒性存在逆缩放现象,并通过 GRPO 强化学习提升鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在智能体和检索增强生成(RAG)系统中,在这些系统中,它们必须对外部提供的参考文本执行用户指定的任务。实际上,这种上下文通常是非结构化的,并且包含良性的但类似指令的语义噪声,例如编辑评论和系统痕迹,这些应严格视为数据。我们引入了 DistractionIF,这是一个旨在评估对参考文本中此类干扰指令鲁棒性的基准。在广泛模型范围内,我们观察到一致的逆缩放现象:较大的模型通常鲁棒性较差,随着规模增加,性能下降多达 30 个百分点。从机制上讲,我们的困惑度分析表明,缩放侵蚀了鲁棒和受干扰行为之间的概率边界,使模型越来越倾向于将噪声过度解释为指令。为了解决这个问题,我们证明了强化学习,特别是群体相对策略优化(GRPO),可以恢复这一边界,在不损害通用指令遵循能力的情况下,将鲁棒性提高多达 15.5%。我们的发现突显了参考接地任务中关键的指令遵循鲁棒性差距,并确立了强化学习作为在大规模下强制严格数据-指令分离的有前途的途径。

英文摘要

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

2605.29489 2026-05-29 cs.LG cs.SY eess.SY

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

访问集至关重要:为可扩展的权重空间模型合并预算专家读取

Yuanyi Wang, Yanggan Gu, Su Lu, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang

AI总结 针对大语言模型合并中专家权重读取的I/O瓶颈,提出MergePipe,一种预算感知的执行层,通过将合并问题转化为专家访问集问题,在显式I/O预算下选择要访问的专家增量块,实现高达11倍加速且参数偏差极小。

Comments ICML 2026 Workshop on Weight-Space Symmetries: from Foundations to Practical Applications

详情
AI中文摘要

权重空间模型合并通常被表述为检查点上的代数运算,然而在LLM规模下,限制性资源往往是必须读取的专家权重集。我们引入MergePipe,一种预算感知的执行层,将LLM合并转化为一个\emph{专家访问集}问题:给定一个合并算子和一个共享权重坐标系中的检查点族,在显式I/O预算下选择要访问的专家增量块。MergePipe索引参数块,构建确定性访问计划,并通过可重放清单执行诱导的预算合并。该计划在构造上是预算合理的,并在全预算下恢复全读取合并;对于固定系数加法算子,省略更新的误差由省略增量的范数界定。在Qwen和Llama合并工作负载上,MergePipe将专家读取I/O减少多达一个数量级,并实现高达$11 imes$的加速。代表性预算扫描显示,与全读取合并的参数偏差为$O(10^{-3})$,并且在下游基准测试上没有单调退化。

英文摘要

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.

2605.29486 2026-05-29 cs.CL cs.AI cs.LG

PhoneWorld: Scaling Phone-Use Agent Environments

PhoneWorld: 扩展手机使用代理环境

Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu

AI总结 提出PhoneWorld,一个可复用的管道,将真实GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚,从而规模化构建手机代理环境。

Comments work in progress

详情
AI中文摘要

手机使用代理的一个核心瓶颈是,覆盖真实移动行为的可控、可复现环境难以大规模构建。现有的移动代理基准在评估方面取得了重要进展,但它们本身并未提供一种可扩展的方式来构建许多新的手机使用环境。我们提出了PhoneWorld,一个可复用的管道,将真实的GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚。PhoneWorld不是一次手动构建一个移动基准,而是利用真实轨迹来恢复哪些屏幕重要、屏幕如何连接、哪些交互必须改变环境状态、以及哪些用户目标可以自动验证。从这些信号中,它构建了由只读应用内容和可变状态支持的可运行模拟Android应用,然后从相同环境中派生出可执行任务、基于规则的验证器和训练回滚。在当前实例中,PhoneWorld覆盖了16个领域的34个应用,涵盖了常见的消费者移动行为,如搜索、浏览、购物、预订、媒体和社交互动。在固定的训练预算下,将来自辅助AndroidWorld语料库的10K步替换为广泛的PhoneWorld监督,同时提升了所有四个评估基准,使HYMobileBench提高了17.7分,AndroidControl提高了6.0分,AndroidWorld提高了14.7分,PhoneWorld提高了52.5分。然后我们研究了两个额外的扩展问题:增加PhoneWorld监督量显著提高了PhoneWorld性能,并且在固定的PhoneWorld预算下,扩大应用覆盖范围带来了更大的收益。总体而言,PhoneWorld将焦点从一次构建一个移动基准转向了规模化供应手机使用环境本身。

英文摘要

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

2605.29483 2026-05-29 cs.AI

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

VitalAgent: 一种工具增强型代理,用于对可穿戴健康数据进行反应性和主动式生理监测

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed, Vassilis Kostakos, Ting Dang

AI总结 提出VitalAgent框架,通过工具增强推理和纵向生理记忆,实现对ECG/PPG信号的反应性问答与主动监测,在VitalBench基准上相比基线提升超30%。

详情
AI中文摘要

可穿戴设备能够连续监测ECG和PPG等生理信号,但现有的移动健康系统大多局限于特定任务的预测管道或对静态摘要的反应性问答。它们缺乏支持时间推理、持久生理上下文以及对长期信号流进行主动监测的能力。我们提出VitalAgent,一个基于ECG/PPG的移动健康工具增强型代理框架,支持反应性问答和主动监测。VitalAgent建立在纵向生理记忆和工具增强推理接口之上,能够对原始信号进行动态计算。我们进一步引入VitalBench,一个纵向生理监测基准数据集,包含用于反应性问答的1,862个问答对和用于主动监测的90.2小时连续ECG/PPG记录,涵盖心脏、身体活动和压力相关任务。实验表明,VitalAgent在反应性评估中相比基于提示和ReAct的基线实现了超过30%的提升,并支持对长期生理信号的主动警报监测,突显了动态工具使用和长期生理监测的重要性。

英文摘要

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 30% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

2605.29476 2026-05-29 cs.CL

Comparative Evaluation of Machine Translation Systems on Images with Text

含文本图像的机器翻译系统比较评估

Blai Puchol, Sergio Gómez González, Miguel Domingo, Francisco Casacuberta

AI总结 本研究比较评估了三种机器翻译范式(模块化流水线、多模态大语言模型和端到端模型Translatotron-V)在含文本图像翻译任务上的性能,发现多模态大语言模型表现最佳。

详情
AI中文摘要

本文对应用于包含文本信息的图像的机器翻译系统进行了比较评估,该任务位于计算机视觉和自然语言处理的交叉领域。研究比较了三种主要范式:分离文本检测、识别和翻译的模块化流水线;能够联合处理图像和文本的多模态大语言模型(MLLM);以及直接生成翻译图像的端到端模型Translatotron-V。模块化系统采用最先进的OCR(docTR)结合多语言LLM(如Llama和EuroLLM),而评估的MLLM包括Gemini 2.5的不同配置。实验在覆盖多种语言对的并行多语言数据集上进行,基于BLEU、chrF和TER指标进行评估。结果表明,模块化流水线优于端到端方法,而MLLM实现了最佳整体性能,展现出卓越的灵活性和上下文理解能力。这些发现强调了多模态推理在图像到文本翻译中的有效性,并为未来在多语言环境中整合视觉理解和语言生成的研究提供了坚实基础。

英文摘要

This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.

2605.29471 2026-05-29 cs.CV

V2XCrafter: Learning to Generate Driving Scene Across Agents

V2XCrafter:学习生成跨智能体的驾驶场景

Yihang Tao, Yu Guo, Senkang Hu, Yanan Ma, Zihan Fang, Sam Kwong, Yuguang Fang

AI总结 提出V2XCrafter框架,通过渐进式多智能体扩散模型和跨智能体注意力模块,生成跨智能体相机视角的一致可控协作驾驶场景,以增强数据并提升下游协作3D目标检测性能。

详情
AI中文摘要

协作驾驶系统利用车联网(V2X)通信进行多智能体协作感知,以提升驾驶安全性,但仍受限于标注的真实世界V2X驾驶数据集稀缺以及在多样化驾驶条件下的泛化能力有限。虽然图像生成技术为数据增强提供了可行的解决方案,但现有针对单车辆多视角场景的方法在多智能体驾驶设置中面临两个基本挑战:(1)学习目标的扩展降低了生成质量;(2)跨智能体的高度动态变化阻碍了对联合观测对象物理属性(如颜色、类别)一致性的建模。为弥补这一差距,我们提出V2XCrafter,这是首个用于跨智能体相机视角生成可控且逼真的协作驾驶场景的框架。为了实现有效学习,我们基于单智能体骨干网络开发了一种渐进式多智能体扩散模型,利用相邻智能体的潜在状态作为参考信号,逐步引导从单智能体到多智能体的扩散过程。为解决跨车辆不一致性问题,我们提出了一个跨智能体注意力模块,该模块利用协作视图图和可学习的联合观测对象表示来建模动态的跨智能体相机视角关系。实验表明,V2XCrafter能够生成高保真且可控的街道视图,并保持跨智能体的一致性,从而有效提升下游协作3D目标检测任务的效果。

英文摘要

Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

2605.29467 2026-05-29 cs.LG cs.AI

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

非共轭因子图的闭式变分推断组合

Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev, İsmail Şenöz, Jeff Beck, Bert de Vries

AI总结 提出五种因子图原语,证明任意组合均支持闭式变分消息传递,并通过堆叠路由层实现通用函数逼近,应用于时间序列预测。

详情
AI中文摘要

将概率构建块堆叠成更深层次的架构通常会破坏闭式推断。我们证明闭式推断是可以保持的。我们识别了五种因子图原语:双线性因子、指数链接、Gamma先验、高斯似然和等式节点,并证明任何由它们组成的模型都允许闭式变分消息传递。这种构造之所以有效,是因为每个原语都保留了一小部分消息族:在平均场分解下,高斯变量上的消息保持高斯分布,精度变量上的消息保持Gamma分布,而唯一的非共轭接口——指数链接——通过高斯矩生成函数和Gamma族的充分统计量保持可处理性。我们展示了从静态集成到输入依赖门控再到分裂分支路由的递增深度组合,并表明堆叠路由层编码任意决策树,建立了具有闭式推断的通用函数逼近。应用于集成时间序列预测时,该框架产生了一个贝叶斯专家混合模型,其中门控函数是推断而非学习得到的,在五个基准数据集上提供了对专家选择的校准不确定性。

英文摘要

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

2605.29462 2026-05-29 cs.CV cs.AI

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

大型视觉语言模型在CFMME上的基准测试:一个全面的中文金融多模态评估数据集

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo, Feng Chen, Chi Zhang

AI总结 提出CFMME,一个包含6052个实例的中文金融多模态评估基准,涵盖八种主要金融图像模态和四项核心多模态任务,用于评估LVLMs在金融业务全流程中的感知、理解、推理和认知能力。

详情
AI中文摘要

大型视觉语言模型(LVLMs)的出现显著扩展了模型的能力,超越了仅文本理解,实现了跨视觉和文本模态的统一推理,并支持更广泛的实际应用。为了全面评估LVLMs在中文环境下整个金融业务流程中的感知、理解、推理和认知能力,我们引入了CFMME,一个新颖的中文金融多模态评估基准。CFMME包含6052个实例,涵盖从基础学术知识到复杂实际应用,涉及八种主要金融图像模态和四项核心多模态任务。在CFMME上,我们对代表性LVLMs进行了全面评估。结果表明,最先进的模型在问答任务上达到了66.11%的总体准确率,在检测、识别和信息提取任务上平均得分为77.18,表明当前LVLMs仍有很大的改进空间。此外,我们对错误原因、跨模态能力和多方向设置进行了详细分析,为未来研究提供了有价值的见解。我们希望CFMME能推动LVLMs的进一步进展,特别是在金融领域多个多模态任务上的性能提升。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.

2605.29461 2026-05-29 cs.CV

FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation

FlowSeg: 面向大语言模型条件分割的动态语义引导

Zekang Zhang, Guangyu Gao, Youyun Tang, ChengJing Wu, Xiaochao Qu, Chi Harold Liu, Jianbo Jiao, Yunchao Wei, Luoqi Liu, Ting Liu

AI总结 针对大语言模型条件分割中语义错位问题,提出FlowSeg方法,通过双向语义流动态引导掩码生成,实现语义对齐并达到最优性能。

Comments 18 pages, accepted by ICML 2026

详情
AI中文摘要

大语言模型条件分割最近通过将大语言模型与迭代掩码生成框架相结合而迅速发展。然而,我们在当前的“提议-选择”流程中发现了一个持续的失败模式。尽管通常能生成高质量的掩码候选,但最终预测可能无法匹配给定的语言条件。这种失败源于语言语义通常被用作静态提示或事后匹配信号,而不是参与迭代掩码生成过程。通过系统分析,我们表明许多错误源于语义错位而非掩码质量差。为解决此问题,我们提出FlowSeg,它通过在整个生成过程中引入中间解码状态与大语言模型导出的条件嵌入之间的双向语义流,实现动态语义引导。语言条件在每个阶段主动引导掩码细化,而条件嵌入则通过出现的视觉证据逐步更新。这种设计产生了语义基础的掩码表示和视觉对齐的语言条件,从而实现更可靠的匹配。我们进一步引入轻量级边界感知细化,以选择性增强不确定区域而不扰动置信内部。在指代表达分割和推理分割任务上的大量实验表明,FlowSeg持续改善语言-掩码对齐,并达到最先进的性能。项目页面:https://zkzhang98.github.io/FlowSeg_page

英文摘要

LLM-conditioned segmentation has recently advanced rapidly by coupling large language models with iterative mask generation frameworks. However, we identify a persistent failure mode in current propose-then-select pipelines. Although high-quality mask candidates are often generated, the final prediction may fail to match the given linguistic condition. This failure arises because language semantics are typically used as static prompts or post-hoc matching signals, rather than participating in the iterative mask generation process. Through systematic analysis, we show that many errors stem from semantic misalignment rather than poor mask quality. To address this issue, we propose FlowSeg, which introduces dynamic semantic guidance via a bidirectional semantic flow between intermediate decoding states and LLM-derived condition embeddings throughout the generation process. Language conditions actively guide mask refinement at each stage, while condition embeddings are progressively updated by emerging visual evidence. This design yields semantically grounded mask representations and visually aligned language conditions, enabling more reliable matching. We further incorporate a lightweight boundary-aware refinement to selectively enhance uncertain regions without perturbing confident interiors. Extensive experiments on referring expression segmentation and reasoning segmentation tasks demonstrate that FlowSeg consistently improves language-mask alignment and achieves state-of-the-art performance. Project page: https://zkzhang98.github.io/FlowSeg_page

2605.29460 2026-05-29 cs.CV

FedSmoothLoRA: Toward Smoother and Faster Convergence in Federated Low-Rank Adaptation

FedSmoothLoRA:面向更平滑和更快速收敛的联邦低秩适配

Zehao Wang, Guanglei Yang, Yihan Zeng, Hang Xu, Hongzhi Zhang, Wangmeng Zuo, Chun-Mei Feng

AI总结 针对联邦低秩适配中更新空间有限、轮间状态不匹配和客户端无关起始状态的问题,提出FedSmoothLoRA框架,通过轮匹配矩阵和梯度对齐矩阵实现更平滑和更快速的收敛。

Comments 26 pages, 4 figures

详情
AI中文摘要

使用低秩适配(LoRA)对基础模型进行联邦微调提供了一种高效的解决方案,可在降低通信和计算成本的同时保持数据本地性。然而,FedAvg与LoRA的直接组合存在三个关键问题:有限的更新空间限制了模型的有效学习能力;轮间状态不匹配破坏了跨轮局部优化的连续性;以及客户端无关的起始状态减慢了客户端上的局部收敛。尽管最近的方法通过跨通信轮将LoRA更新合并到主干中缓解了有限更新空间问题,但轮间状态不匹配和客户端无关的起始状态仍未得到充分解决。为了解决这些问题,我们提出了FedSmoothLoRA,一个联邦LoRA微调框架,它保留了扩大的更新空间,改善了跨轮局部优化的连续性,并为局部训练提供了客户端感知的起始状态。在每个通信轮,FedSmoothLoRA使用两个矩阵构建局部LoRA初始化:一个轮匹配矩阵,用于保持跨轮局部状态连续性;以及一个梯度对齐矩阵,用于从局部数据估计的梯度信号提供客户端特定的优化指导。这些设计共同实现了更平滑和更快速的收敛。在图像分类和自然语言生成任务上的大量实验表明,FedSmoothLoRA始终优于现有的联邦LoRA微调方法。代码:https://github.com/wangzehao0704/FedSmoothLoRA

英文摘要

Federated fine-tuning of foundation models with Low-Rank Adaptation (LoRA) provides an efficient solution for reducing communication and computation costs while preserving data locality. However, the direct combination of FedAvg and LoRA suffers from three key issues: limited update space, which restricts the model's effective learning capacity; inter-round state mismatch, which disrupts cross-round local optimization continuity; and a client-agnostic starting state, which slows local convergence on clients. Although recent methods mitigate the limited update space issue by merging LoRA updates into the backbone across communication rounds, inter-round state mismatch and the client-agnostic starting state remain insufficiently addressed. To address these issues, we propose FedSmoothLoRA, a federated LoRA tuning framework that preserves the enlarged update space, improves cross-round local optimization continuity, and provides a client-aware starting state for local training. At each communication round, FedSmoothLoRA constructs the local LoRA initialization using two matrices: a Round-Matching matrix that preserves cross-round local state continuity, and a Gradient-Aligned matrix that provides client-specific optimization guidance from gradient signals estimated on local data. Together, these designs enable smoother and faster convergence. Extensive experiments on image classification and natural language generation tasks demonstrate that FedSmoothLoRA consistently outperforms existing federated LoRA tuning methods. Code: https://github.com/wangzehao0704/FedSmoothLoRA

2605.29459 2026-05-29 cs.CL cs.LG

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

Kronecker嵌入:用于参数高效语言模型的字节级结构化词元表示

Rohan Shravan

AI总结 提出Kronecker嵌入,通过字节级字符-位置确定性分解替代标准嵌入表,消除91-94%输入侧可训练参数,在多个实验中实现更低验证损失、更强拼写鲁棒性和运行时效率。

Comments 28 pages, 16 tables. Reference implementation: https://github.com/theschoolofai/kronecker-embeddings

详情
AI中文摘要

大型语言模型通过一个形状为|V| x d_model的可学习嵌入表路由每个输入,在前沿规模下消耗数亿到数十亿的可训练参数。我们引入Kronecker嵌入,一种确定性的字节级字符-位置分解,用固定编码器和单个可学习投影替换该表,与标准BPE分词器兼容,在前沿规模下消除91-94%的输入侧可训练参数。我们提供五项贡献。第一,跨六个LM(135M-671B参数)的模型探针显示,训练后的输入嵌入将探针词的印刷变体聚类程度远高于形态学相关词;Kronecker在嵌入层避免了这种聚类。第二,在FineWeb-Edu上对nanoGPT GPT-2 124M进行2.5B词元的三种子受控比较显示,Kronecker达到比BPE绑定基线低2.5±0.2%的验证损失(差距0.083±0.007 nats,约9%更低的困惑度),达到BPE收敛损失所需的步数减少约1.43倍。第三,在110个干净/拼写错误对上的拼写鲁棒性探针显示,Kronecker在55.5%的对上保持top-1预测,而BPE为47.3%(+8.2个百分点),并将KL降低7.6%,在11个类别中赢得或平局10个;生成探针显示Kronecker在生成中回显字节新颖字符串和拼写错误,而BPE则遗忘它们。第四,BPE嵌入范数在训练期间漂移,而Kronecker投影范数保持在1.0附近,与稳定的表示目标一致。第五,一种即时运行时变体从4.5 MB的字节缓冲区重建嵌入,而不是从词汇量为131,072的2.15 GB表中重建,步长时间开销为0.01-0.24%。字节级局部性存在权衡:字节相似但语义距离远的对(compute/commute, nation/notion)聚类在一起,将消歧转移到早期注意力层。

英文摘要

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte-level character-position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91--94% of input-side trainable parameters at frontier scale. We provide five contributions. First, a cross-model probe across six LMs (135M-671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three-seed comparison on nanoGPT GPT-2 124M over 2.5B tokens of FineWeb-Edu shows Kronecker reaching 2.5 +- 0.2% lower validation loss than the BPE-tied baseline (gap 0.083 +- 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling-robustness probe over 110 clean/typo pairs shows Kronecker preserves the top-1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte-novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on-the-fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01--0.24% step-time overhead. Byte-level locality has a tradeoff: byte-similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.

2605.29458 2026-05-29 cs.CL cs.AI

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

面向LLM人格模拟的自适应访谈:基于证据的推理提升决策对齐

Ruoxi Su, Yuhan Liu, Jingyu Hu

AI总结 提出自适应访谈框架,通过结构化三阶段对话收集人格相关信息,并基于访谈记录评估LLM在道德困境场景中模拟个体决策的能力,发现基于后续追问的证据推理能显著提升预测准确性。

Comments 20 pages, 2 figures, 12 tables

详情
AI中文摘要

准确模拟特定个体的决策对大型语言模型(LLM)仍然具有挑战性,部分原因在于人格信息通常以静态描述形式提供,缺乏个体层面决策模拟所需的价值观、经历和情境线索。我们提出一种自适应访谈框架,通过结构化的三阶段对话收集人格相关信息:核心问题、动态追问和综合人格总结。利用生成的访谈记录,我们评估LLM能否模拟参与者在道德困境场景中的决策。我们比较了三种对话情境——核心10个问题回答、完整访谈对话以及总结性人格表征。结果发现,自适应访谈并非作为统一的准确性增强器,而更像是一种选择性接地机制:约40%的完整访谈轨迹中融入了基于追问的证据,且这些基于追问的预测比仅基于核心问题的预测更准确(45.5% vs. 39.3%)。这些发现强调,仅靠更丰富的人格背景是不够的:只有当模型真正将其决策基于用户特定证据时,改进才会出现。

英文摘要

Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.

2605.29455 2026-05-29 cs.CV eess.SP

Uni-RCM: Unified Reference-guided Cross-modal Mapping for Multi-Class Anomaly Detection

Uni-RCM:面向多类异常检测的统一参考引导跨模态映射

Yangchen Wu, Huiqiang Xie

AI总结 提出Uni-RCM框架,通过参考引导块和离线残差量化器,实现多类工业异常检测的统一建模,在MVTec-3D AD数据集上达到最优性能。

Comments This work has been submitted IEEE for potential publication

详情
AI中文摘要

多模态工业异常检测通常依赖于每个产品类别的单独模型,从根本上限制了实际可扩展性。当转向同时处理多种类别的统一范式时,由于类间干扰和特征流形混淆,检测精度往往会下降。为了克服这些挑战,我们提出了一个统一的参考引导跨模态映射框架,命名为Uni-RCM。其核心是,我们提出了一个参考引导块,通过引入可学习的参考特征来动态过滤特定类别的噪声,该参考特征捕捉了不同模态之间的共性。此外,我们提出了一个离线残差量化器,通过多个级联码本来表征正态分布。在MVTec-3D AD数据集上的大量评估表明,在具有挑战性的多类设置以及图像级检测和像素级定位方面,该方法达到了最先进的性能。

英文摘要

Multi-modal industrial anomaly detection typically relies on separate models for each product category, fundamentally limiting practical scalability. When shifting to a unified paradigm that handles diverse classes simultaneously, detection accuracy often degrades due to inter-class interference and feature manifold confusion. To overcome these challenges, we propose a Unified Reference guided Cross-modal Mapping framework, named Uni-RCM. At its core, we propose a reference guide block to dynamically filter out category-specific noise by introducing a learnable reference feature, which captures the commonalities across different modalities. Besides, an offline residual quantizer is proposed to characterize the normal distribution by multiple cascaded codebooks. Extensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art performance in the challenging multi-class setting and in terms of image-level detection and pixel-level localization.

2605.29454 2026-05-29 cs.LG

A Full-Pipeline Framework for Evaluating Membership Inference Attacks in Machine Learning

用于评估机器学习中成员推断攻击的全流程框架

Ding Chen, Xinwen Cheng, Xuyang Zhong, Xinping Chen, Xiaolin Huang, Chen Liu

AI总结 提出一个涵盖数据、架构、算法和后训练模块的全流程评估框架,系统分析不同上下文对成员推断攻击效果的影响,并通过标准化威胁模型和互补指标提供实用指南。

详情
AI中文摘要

虽然成员推断攻击(MIAs)是识别训练数据的主流方法,但其应用已扩展到隐私审计和机器遗忘。然而,该领域缺乏一个系统性的框架来评估不同上下文如何影响MIA的效果。没有这样的特征描述,实践者可能会部署在基准测试中表现良好但在面对特定真实世界数据集的细微差别时变得统计上无关的算法。为了弥合这一差距并提供可操作的见解,我们引入了一个全面的评估框架,该框架系统地描述了整个机器学习流程(包括数据、架构、算法和后训练模块)中的隐私风险。我们的框架旨在固有地捕捉多样化的操作上下文,严格评估了在广泛训练配置下的最先进MIA。为了考虑真实世界部署中不同的误分类成本,我们采用了三个互补指标:对称成本下的平衡准确率,以及低FPR下的TPR(或低FNR下的TNR)用于严格惩罚误报或漏检的非对称场景。此外,认识到现有MIA假设不同的对手能力,我们形式化了两种标准化的威胁模型,并将这些攻击调整为相应的变体,以确保公平的基准测试。大量的实证评估表明,特定MIA方法的效果高度依赖于假设的威胁模型和选择的评估指标。最终,我们将这些发现提炼为可操作的指南,并提供一个即用的审计工具包,使实践者能够进行更好的隐私评估。

英文摘要

While Membership Inference Attacks (MIAs) are the prevailing method for identifying training data, their application has expanded into privacy auditing and machine unlearning. Nevertheless, the field lacks a systematic framework for evaluating how different contexts affect MIA efficacy. Without such a characterization, practitioners risk deploying algorithms that perform well on benchmarks but become statistically irrelevant when faced with the nuances of specific, real-world datasets. To bridge this gap and provide actionable insights, we introduce a comprehensive evaluation framework that systematically characterizes privacy risks across the entire machine learning pipeline, spanning data, architectures, algorithms, and post-training modules. Designed to inherently capture diverse operational contexts, our framework rigorously evaluates state-of-the-art MIAs across a broad spectrum of training configurations. To account for varying misclassification costs in real-world deployments, we employ three complementary metrics: Balanced Accuracy for symmetric costs, alongside TPR at low FPR (or TNR at low FNR) for asymmetric scenarios where false alarms or missed detections are strictly penalized. Furthermore, recognizing that existing MIAs assume divergent adversary capabilities, we formalize two standardized threat models and adapt these attacks into corresponding variants to ensure an equitable benchmark. Extensive empirical evaluations demonstrate that the efficacy of specific MIA methodologies is highly sensitive to the assumed threat models and chosen evaluation metrics. Ultimately, we distill these findings into actionable guidelines and provide a ready-to-use auditing toolkit, empowering practitioners to conduct better privacy assessments.

2605.29453 2026-05-29 cs.LG cs.AI

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

遗忘更少,泛化更强:统一动态图的时间与结构适应

Qian Chang, Ciprian Doru Giurcaneanu, Runsong Jia, Xia Li, Guoping Hu, Xiufeng Cheng, Jinqing Yang, Mengjia Wu, Yi Zhang

AI总结 提出双尺度保持动态(DSRD)框架,通过统一的时间-结构自适应机制和可学习衰减核,在动态图表示学习中实现更强的泛化能力。

详情
AI中文摘要

动态图上的表示学习需要捕获随时间与结构共同演化的复杂依赖关系。现有方法通常采用固定的时间衰减方案或预定义的结构传播深度,限制了其在具有不同交互频率和拓扑特征的图上的泛化能力。我们提出双尺度保持动态(DSRD),一个统一框架,维护一个同时编码时间记忆和结构上下文的保持性表示状态。DSRD引入两个关键组件:(i) 具有双尺度自适应的保持状态,在单一循环公式中联合建模时间动态和结构传播;(ii) 具有可学习时间敏感性参数的自适应衰减核,基于底层交互模式自动平衡短期响应和长期保持。我们提供理论分析,建立了事件级并行聚合与高效循环状态更新之间的等价性,以及所学动态的稳定性和有界性保证。在14个真实世界基准上的广泛实验表明,DSRD在链接预测和节点分类任务上均持续达到最先进性能,并在直推和归纳设置中展现出强泛化能力。

英文摘要

Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing approaches typically adopt fixed temporal decay schemes or predetermined structural propagation depths, limiting their ability to generalize across graphs with diverse interaction frequencies and topological characteristics. We propose Dual-Scale Retentive Dynamics (DSRD), a unified framework that maintains a retentive representation state encoding both temporal memory and structural context. DSRD introduces two key components: (i) a retentive state with dual-scale adaptation that jointly models temporal dynamics and structural propagation within a single recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters that automatically balance short-term responsiveness and long-term retention based on the underlying interaction patterns. We provide theoretical analysis establishing the equivalence between event-wise parallel aggregation and efficient recurrent state updates, as well as stability and boundedness guarantees for the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD consistently achieves state-of-the-art performance on both link prediction and node classification tasks, with strong generalization across transductive and inductive settings.

2605.29452 2026-05-29 cs.CV

Comparative evaluation of photogrammetric reconstruction methods and 3D Gaussian Splatting for road surface roughness analysis

摄影测量重建方法与3D高斯泼溅用于路面粗糙度分析的比较评估

Marouane Elmegdar, Teng Xiao

AI总结 本研究比较了COLMAP、Meshroom、Metashape和3D高斯泼溅四种重建方法,评估它们从智能手机图像估计路面粗糙度的能力,结果表明COLMAP对微纹理最敏感,而开源方法适用于低成本路面监测。

Comments accepted by RSMIP 2026

详情
AI中文摘要

基于图像的三维重建为传统的基于传感器的路面评估技术提供了一种低成本替代方案。本研究比较了四种重建流程——COLMAP、Meshroom、Metashape和3D高斯泼溅(3DGS),以评估它们从智能手机图像估计路面粗糙度的能力。所有点云均在CloudCompare中使用一致的工作流程进行处理,包括方向对齐、分割、法线估计以及在0.2、0.4和0.6模型单位的邻域半径下进行粗糙度计算。结果表明,COLMAP对微纹理的灵敏度最高,而Meshroom产生具有中等粗糙度变化的平衡重建。Metashape由于其内部滤波而生成最平滑的几何形状,3DGS捕捉到可见的不规则性但表现出更高的噪声和较低的密度。比较表明,开源管道可用于相对粗糙度评估,为低成本路面监测提供了一种实用方法。

英文摘要

Image-based 3D reconstruction offers a low-cost alternative to traditional sensor-based techniques for road surface assessment. This study compares four reconstruction pipelines--COLMAP, Meshroom, Metashape, and 3D Gaussian Splatting (3DGS)--to evaluate their ability to estimate road surface roughness from smartphone imagery. All point clouds were processed in CloudCompare using a consistent workflow involving orientation alignment, segmentation, normal estimation, and roughness computation at neighborhood radiuses of 0.2, 0.4, and 0.6 model units. The results show that COLMAP provides the highest sensitivity to micro-texture, while Meshroom yields balanced reconstructions with moderate roughness variation. Metashape produces the smoothest geometry due to its internal filtering, and 3DGS captures visible irregularities but exhibits higher noise and lower density. The comparison demonstrates that open-source pipelines are viable for relative roughness evaluation, offering a practical approach for low-cost pavement monitoring.

2605.29448 2026-05-29 cs.LG cs.AI cs.CV cs.IT math.IT

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

数据集值多少钱?缩放定律、Vendi分数与矩阵谱函数

Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

AI总结 本文通过子模性理论统一了神经缩放定律与Vendi分数,提出矩阵谱函数作为广义数据评估框架,并开发了基于割线方程的快速优化算法,在ImageNet-1K规模上实现了约35,000倍加速,实验表明设施选址函数在预测子集价值方面表现最佳。

Comments 75 pages

详情
AI中文摘要

神经缩放定律通过数据集大小评估数据,而Vendi分数使用量子熵衡量数据集价值。我们证明常见的神经缩放定律目标和Vendi分数都是子模的。进一步,我们表明Vendi分数是一类更广泛的子模目标(称为矩阵谱函数)的特例,这还包括行列式点过程(DPP)目标以及许多其他目标。我们还引入了弱矩阵单调函数,并展示了它们如何导致弱子模矩阵谱函数,从而产生一系列实用的数据评估目标。我们开发了基于割线方程的更新方法,避免了贪心优化过程中的重复特征分解,将$m$维嵌入的边际增益评估相对于预言机查询减少了$O(m)$因子。这实现了平均约35,000倍的实证加速,使得在ImageNet-1K规模的数据集上直接优化Vendi分数成为可能。由此,我们比较了多个目标在固定大小、类别平衡和固定训练预算条件下预测训练子集对保留测试性能价值的能力,包括Vendi分数、DPP、设施选址以及三种新的矩阵谱变体。在多个数据集上,设施选址表现最佳。直接优化还揭示,虽然Vendi分数在中等分数范围内具有预测性,但将目标推向更高值可能使其成为下游性能的糟糕代理。我们还发现,均匀随机选择的固定大小子集(无论是否类别平衡)在评估分数和保留性能上都表现出显著的集中性。最后,我们表明大小、类别平衡和训练预算单独并不决定数据价值:即使控制这些因素,性能范围也从好到差平滑变化。

英文摘要

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

2605.29447 2026-05-29 cs.CV cs.CL

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

恢复策略诱导错误:鲁棒GUI智能体的基准测试与轨迹合成

Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang

AI总结 提出GUI-RobustEval基准和鲁棒驱动轨迹合成框架RoTS,通过树状管道主动发现错误模式并合成恢复步骤,训练模型在GUI任务上取得最先进性能。

Comments ICML 2026 Spotlight. 36 pages, 19 figures, includes appendix

详情
AI中文摘要

尽管GUI智能体发展迅速,但它们通常缺乏从自身错误中恢复的鲁棒性,阻碍了实际部署。为了在评估和数据层面弥补这一差距,我们引入了GUI-RobustEval并提出了鲁棒驱动轨迹合成。GUI-RobustEval包含1,216个可执行测试用例,系统性地衡量在广泛且真实的错误模式下的错误恢复能力。在数据层面,RoTS是一个可扩展的合成框架,通过树状管道主动发现多样化的错误模式并合成相应的恢复步骤,创建了80万高质量数据。我们的两个模型RoTS-7B和RoTS-32B,在数据集上微调后,在GUI-RobustEval和传统GUI基准测试上均表现出显著提升。值得注意的是,RoTS-32B在OSWorld上达到了最先进性能,成功率为47.4%,All-Pass@4得分为33.8%,表明改进的长时域错误恢复能力有助于鲁棒性和整体性能。我们的代码可在https://github.com/AlibabaResearch/RoTS获取。

英文摘要

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

2605.29446 2026-05-29 cs.AI

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

CrystalXRD-Bench:面向多种晶体材料的XRD峰索引的视觉-语言模型基准测试

Chengliang Xu, Xiaogang Li, Peiyao Xiao, Beng Wang, Hu Wei, Bing Zhao

AI总结 提出CrystalXRD-Bench基准,通过250个样本评估视觉-语言模型从粉末XRD图谱中识别米勒指数(HKL)的能力,发现最佳模型Jaccard得分仅0.5888,任务远未解决。

Comments 18 pages, 10 figures

详情
AI中文摘要

从粉末XRD图谱中识别米勒指数需要现有多模态基准未测试的能力:模型必须从渲染的科学曲线中读取窄峰位置,然后将该观察与多步晶体学推理联系起来。我们引入CrystalXRD-Bench,一个基于10个公共晶体学数据库构建的250样本基准,用于单一任务:恢复对XRD图谱中最高强度峰有贡献的完整HKL集合。每个样本将渲染的XRD图像与源CIF文本和化学式配对,因此视觉提取错误和推理错误可以并排检查。我们评估了七个视觉-语言模型。最佳Jaccard得分为0.5888(GPT-5.4),精确匹配率为37.6%,但七个模型中有六个仍低于Jaccard 0.50;该任务远未解决。错误模式系统性地变化:双峰情况尤其脆弱,注重召回率的模型通过过度预测HKL来增加覆盖率,而访问CIF文本并不能缩小晶体学计算方面的差距。除了模型排名外,该基准还确定了当前VLM在定量科学图形上失败的条件。所有数据和评估代码将公开提供。

英文摘要

Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.

2605.29440 2026-05-29 cs.CL cs.AI cs.IR

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

SkillBrew: LLM智能体技能库的多目标策展

Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen

AI总结 提出SkillBrew框架,将技能库策展建模为带效用约束的帕累托优化问题,通过双层提议-验证循环实现技能库的精简与多样性。

Comments 16 pages. Preprint. Under review

详情
AI中文摘要

检索增强的LLM智能体越来越依赖于精心策划的技能库:指导复杂任务决策的可重用文本原则集合。现有方法通常以仅追加的方式扩展这些库,不断添加新技能而不移除冗余、过时或有害的技能,导致存储库效率低下且策展不良。在本文中,我们将技能库策展形式化为一个受约束的多目标问题:一个理想的库必须对智能体有用、内容多样,并且对查询分布有良好的覆盖。为此,我们引入了SkillBrew,一个多目标策展框架,将技能库策展形式化为在效用约束下的帕累托感知优化,并通过双层提议-验证循环求解。我们在两个公共基准上评估了我们的方法。我们的发现表明,将技能库视为原则性策展的对象,而不是不断增长的仅追加日志,是构建自我改进的LLM智能体的重要一步。

英文摘要

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.

2605.29438 2026-05-29 cs.RO

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

ElegantVLA:学习何时思考以实现高效的视觉-语言-动作模型

Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang

AI总结 提出ElegantVLA,一种即插即用的相位自适应推理框架,通过动态计算调度在视觉编码器、大语言模型和动作头之间分配计算资源,实现VLA模型加速,在GR00T和CogACT上分别获得最高2.55倍和3.77倍加速。

详情
AI中文摘要

视觉-语言-动作(VLA)模型是通用机器人控制的一种强大范式。然而,其高计算成本和有限的控制频率阻碍了实时机器人操作,尤其是在每个控制步骤都运行大型视觉-语言骨干网络和迭代动作头时。现有的VLA加速方法通常优化单个组件或依赖固定的加速规则,对不同控制步骤采用大致固定的计算量,忽略了序列化具身控制的非均匀推理需求。受人类运动控制的启发,其中认知和反馈资源集中在目标敏感阶段,我们认为VLA模型应该学习何时投入完整计算以及何时重用先前的计算。我们提出ElegantVLA,一种即插即用的相位自适应推理框架,通过模型内动态计算调度加速VLA模型。ElegantVLA引入一个轻量级调度器,观察时间表示相似性、机器人运动线索和任务进度,联合分配视觉编码器、大语言模型和动作头的计算。对于感知-语言推理,调度器根据视觉-语言表示稳定性选择五级视觉-大语言模型计算模式,从完全重计算到多步时间重用。对于动作生成,它选择三级去噪模式,在稳定运动期间重用中间去噪状态,同时在目标敏感阶段保留完整细化。通过协调这些决策,ElegantVLA为具有显式动作生成模块的现代VLA流水线提供了一个通用加速框架,无需修改或重新训练基础模型。在GR00T和CogACT上的实验分别实现了最高2.55倍和3.77倍的加速,在六个真实世界的GR00T任务中,ElegantVLA将计算量减少了2.18倍,同时将控制频率从13.8 Hz提高到26.3 Hz。

英文摘要

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.

2605.29430 2026-05-29 cs.AI cs.CL

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

AI总结 提出Agentic ASR闭环框架,通过多轮交互和语义纠正减少语义错误,并引入句子级语义错误率(S^2ER)作为评估指标。

详情
AI中文摘要

自动语音识别(ASR)是人机交互的核心组成部分,也是基于LLM的助手和智能体日益重要的前端。然而,当前大多数ASR系统仍遵循单遍范式,这与人类通信方式不一致——在人类通信中,误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误,很难纠正。同时,词错误率(WER)或字符错误率(CER)等词级指标无法充分反映此类问题。为解决这些局限,我们将交互式ASR形式化为多轮修正任务,并提出Agentic ASR,一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率(S^2ER),一种基于LLM的语义评估指标,以及交互式仿真系统,用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明,迭代交互持续减少语义错误,在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见:https://interactiveasr.github.io/,在线演示见:https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

2605.29429 2026-05-29 cs.CV

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

每细胞类型一次点击足矣:无需训练的组交互用于细胞实例分割

Sanghyun Jo, Seo Jin Lee, Seohyung Hong, Yoorim Gang, Hyeongsub Kim, Hyungseok Seo, Kyungsu Kim

AI总结 提出组提示范式,通过每细胞类型一次点击即可分割所有该类型实例,基于SAM冻结编码器的特征聚类性质,设计无需训练的Chain-of-Prompts框架递归扩展点击,在多个基准上保持高性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情
AI中文摘要

在特定细胞数据集上训练的细胞实例分割模型在分布外的细胞类型上性能严重下降,而交互式基础模型通过每个实例提示克服了这一点,但对于包含数百到数千个密集实例的组织病理学图像,其成本过高。我们引入了组提示,这是一种新范式,将交互式分割从每个实例 $O(N)$ 转变为每个类型 $O(T)$,其中每细胞类型一次点击即可分割该类型的所有实例。我们的关键观察是,Segment Anything Model (SAM) 的冻结图像编码器在给出任何提示之前,已经在其特征空间中对相同类型的细胞进行了聚类。利用这一特性,我们提出了Chain-of-Prompts (CoP),这是一个无需训练的框架,通过以下方式递归扩展单个用户点击:(1) 通过非参数门控多尺度编码器特征识别可靠的相同类型位置,以及 (2) 选择空间上最远的可靠点作为下一个提示以最大化覆盖范围。在三个细胞类型标注的基准上,每类型一次点击的CoP保留了超过90%的每个实例性能,并且无需任何额外训练就超越了全监督方法。在四个形态均匀的基准上,一次点击保留了超过99%。项目页面:https://shjo-april.github.io/Chain-of-Prompts/

英文摘要

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance $O(N)$ to per-type $O(T)$, where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo-april.github.io/Chain-of-Prompts/

2605.29427 2026-05-29 cs.CL

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard:检测LLM交互中的金融监管违规

Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

AI总结 针对金融领域LLM交互中的监管违规检测问题,提出基于监管文档的自动化管道,构建首个金融合规检测基准FinGuard-Bench,并训练FinGuard模型,在基准上显著优于现有方法。

详情
AI中文摘要

随着大型语言模型(LLM)在金融服务中的部署日益增多,一次不合规的交互就可能使机构面临监管处罚并直接损害消费者利益。现有的防护模型围绕通用危害分类构建,忽略了基于特定金融法规的违规行为。我们通过一个直接操作监管文档的监管驱动管道来弥补这一空白,该管道归纳出金融合规风险分类,并在没有任何预定义违规类别的情况下合成基于监管的训练数据。将该管道应用于中国金融法规,我们发布了 extbf{FinGuard-Bench},据我们所知,这是首个金融监管合规检测基准,在查询和回复层面均带有专家标注的标签。我们进一步训练了 extbf{FinGuard},这是一个基于Qwen3-8B构建的金融合规检测模型,通过监督微调和自我对弈强化学习在基于监管的数据上进行训练。在FinGuard-Bench上,FinGuard显著优于所有基线,包括专用防护模型和更大的通用LLM,如Qwen3.5-397B-A17B和GPT-5.1。此外,FinGuard还保留了通用安全能力,并能仅使用政策文档适应未见过的机构特定政策。我们将在GitHub上公开发布本工作中使用的代码、提示和资源。

英文摘要

As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.

2605.29425 2026-05-29 cs.AI

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

ReasonLight: 一种多模态基础模型增强的强化学习框架用于零样本交通信号控制

Aoyu Pang, Maonan Wang, Yuejiao Xie, Chung Shue Chen, Zhiwei Yang, Man-On Pun

AI总结 提出ReasonLight框架,通过多模态基础模型增强强化学习,利用路侧传感器和摄像头数据实现零样本适应罕见交通事件,显著降低紧急车辆等待时间。

详情
AI中文摘要

强化学习在交通信号控制中展现出潜力,但其对预定义状态的依赖限制了其对训练数据中未出现的可观测开放世界事件的响应能力。物联网赋能的路口通过路侧传感器和摄像头提供异构观测,为提升强化学习对此类事件的适应性创造了机会。为此,我们提出ReasonLight,一种多模态基础模型增强的强化学习框架,用于零样本交通信号控制。ReasonLight整合三类信息:结构化交通测量、多视角摄像头观测以及预训练强化学习控制器生成的候选相位决策。给定强化学习提议的相位,ReasonLight从多视角图像中提取视觉语义,并将其与紧凑的传感器导出的场景描述对齐。这种对齐使得语义引导的细化模块能够根据交通规则和事件语义保留或调整提议的动作。为确保操作可靠性,细化后的动作受可用相位集合约束。任何无效决策被拒绝,系统回退至原始强化学习动作。我们在强化学习训练期间未见的两类罕见事件上评估ReasonLight:紧急车辆优先和临时交通管制。实验结果表明,ReasonLight无需重新训练即可实现零样本适应。与仅使用强化学习的主干相比,它将紧急车辆等待时间最多降低88.7%,同时保持相当的常规交通性能。

英文摘要

Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.