arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28128 2026-05-28 cs.CL

Chinese Word Boundary Recovery through Character Alignment Projection

通过字符对齐投影恢复中文词边界

Lusha Wang, Yuchen Li, Su Yuan, Jungyeul Park

AI总结提出基于对齐投影的两步方法，从带噪句子中恢复词边界，并构建两个评估基准，实验表明该方法能有效纠正过度切分错误。

详情

AI中文摘要

中文分词在非标准文本中尤其脆弱，语言学习者错误和其他字符层面的差异会破坏下游标注和评估所假设的词边界。本文将中文词边界恢复形式化为基于对齐的投影任务。给定一个带噪的源句子和一个更干净的目标对应句，我们首先在字符级别对齐两个字符串，然后将目标侧的词边界投影回源句。除了恢复方法本身，我们还引入了两个评估资源：基于MuCGEC的人工检查学习者中文基准，以及从中文宾州树库导出的受控合成基准。实验表明，直接分词仍然容易受到学习者输入中的复合碎片化影响，而所提出的两步投影方法通过使用校正后的目标恢复源侧词跨度，纠正了许多过度切分错误。结果表明，词边界恢复不同于普通分词，并且对齐投影为在带噪输入下稳定中文标注和评估提供了一种原则性机制。

英文摘要

Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper formulates Chinese word boundary recovery as an alignment-based projection task. Given a noisy source sentence and a cleaner target counterpart, we first align the two strings at the character level and then project target-side word boundaries back onto the source. Beyond the recovery method itself, we introduce two evaluation resources: a manually checked learner Chinese benchmark based on MuCGEC and a controlled synthetic benchmark derived from the Chinese Penn Treebank. Experiments show that direct segmentation remains vulnerable to compound fragmentation in learner input, whereas the proposed two step projection method corrects many over-segmentation errors by using the corrected target to recover source-side word spans. The results show that word boundary recovery is distinct from ordinary segmentation and that alignment projection provides a principled mechanism for stabilizing Chinese annotation and evaluation under noisy input.

URL PDF HTML ☆

赞 0 踩 0

2605.28127 2026-05-28 cs.LG

Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

面向长视界离线目标条件强化学习的自适应由粗到细子目标细化

Kaiqiang Ke, Shenghong He, Chengdong Xu, Yuheng Luo, Xiangyuan Lan, Chao Yu

AI总结提出CFHRL框架，通过自适应递归细化子目标并基于可学习可达性成本停止细化，解决长视界离线目标条件强化学习中的弱监督和累积误差问题。

详情

AI中文摘要

离线目标条件强化学习（GCRL）在长视界任务中具有挑战性，其中遥远的状态-目标对提供弱监督，且价值估计容易受到累积自举误差的影响。分层方法通过引入中间子目标来缓解这一困难，但固定的时间抽象或固定的层次深度可能与具有不同可达性视界的状态-目标对不匹配。我们提出由粗到细分层目标强化学习（CFHRL），一种完全离线的GCRL框架，在执行前自适应地细化遥远目标。从最终目标开始，CFHRL递归地提出中间目标，这些目标由重放支持的候选训练，并在当前目标被估计为可通过学习的可达性成本局部执行时停止细化。关键思想是，子目标不必是精确的中点或全局最优路径点；它只需要提供可靠的进展并减少剩余到达难度，从而能够在更短的视界上进行后续细化。一个风格化的分析进一步支持近似递归收缩的鲁棒性。在OGBench上的实验表明，在多个长视界任务上取得了显著收益，消融实验验证了所提出的细化和停止机制。

英文摘要

Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introducing intermediate subgoals, but fixed temporal abstractions or fixed hierarchy depths can be mismatched to state--goal pairs with different reachability horizons. We propose Coarse-to-Fine Hierarchical Goal Reinforcement Learning (CFHRL), a fully offline GCRL framework that adaptively refines distant goals before execution. Starting from the final goal, CFHRL recursively proposes intermediate targets, trained from replay-supported candidates, and stops refinement once the current target is estimated to be locally executable by a learned reachability cost. The key idea is that a subgoal need not be an exact midpoint or globally optimal waypoint; it only needs to provide reliable progress and reduce the remaining reaching difficulty, enabling subsequent refinement over shorter horizons. A stylized analysis further supports the robustness of approximate recursive contraction. Experiments on OGBench show substantial gains on several long-horizon tasks, with ablations validating the proposed refinement and stopping mechanisms

URL PDF HTML ☆

赞 0 踩 0

2605.28125 2026-05-28 cs.CV cs.GR

CLEAR-NeRF: Collinearity and Local-region Enhanced Accurate 3D Reconstruction in Unbounded Scenes

CLEAR-NeRF: 共线性和局部区域增强的无界场景精确三维重建

Vladislav Polianskii, Elijs Dima, Isabel Salmerón Marazuela, Gergő László Nagy, Sigurdur Sverrisson, Volodya Grancharov

AI总结提出CLEAR-NeRF方法，通过自动局部区域定位、共线性射线采样、深度局部邻域点提取和几何相关颜色聚合，在无界复杂场景中实现高保真度和度量精度的三维重建。

详情

AI中文摘要

许多真实世界的三维重建应用要求在无界、复杂场景中实现照片级真实感和度量精度，这些场景具有挑战性的光照和不完美的捕获，而当前的神经辐射场（NeRF）流程仅部分满足这些需求。本研究将基于NeRF的三维重建适应于多兴趣区域的无界场景，以提高对光照和姿态变化的鲁棒性，同时确保适用于数字孪生应用的度量精度。我们的方法引入了（i）自动局部区域定位/检测和重建，以无缝优先考虑感兴趣区域而不增加子模块；（ii）共线性强制射线采样，以学习平滑的平面和曲面；（iii）深度局部邻域点提取，以抑制表面伪影；以及（iv）几何相关颜色聚合，以减轻光照和姿态引起的变化。结果表明，所提出的流程在基线NeRF模型以及成熟的结构从运动（SfM）-多视图立体（MVS）解决方案上均表现出优越的性能。

英文摘要

Many real-world 3D reconstruction applications demand photorealism and metric accuracy across unbounded, complex scenes with challenging lighting and imperfect captures that current Neural Radiance Field (NeRF) pipelines only partly satisfy. This study adapts NeRF-based 3D reconstruction to multi-region of interest unbounded scenes to improve robustness to lighting and pose variation while enforcing metric accuracy suitable for digital-twin applications. Our approach introduces (i) automated local region localization/detection and reconstruction to seamlessly prioritize areas of interest without proliferating submodules, (ii) collinearity-enforcing ray sampling to learn smooth planar and curved surfaces, (iii) depth-localized neighborhood point extraction to suppress surface artifacts, and (iv) geometry-relevant color aggregation to mitigate lighting- and pose-caused variations. Results indicate superior performance of the proposed pipeline over the baseline NeRF models and established Structure from Motion (SfM) - Multi-View Stereo (MVS) solutions.

URL PDF HTML ☆

赞 0 踩 0

2605.28124 2026-05-28 cs.AI

Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction

梯度步进即插即用模型用于牙科锥束CT重建

Idris Tatachak, Luis Kabongo, Nicolas Papadakis, Xavier Ripoche, Simon Rit

AI总结提出一种基于梯度步进去噪器的即插即用算法，通过模拟扇形束采集并添加光子噪声训练先验，有效减少牙科锥束CT重建中的光子噪声。

2605.28123 2026-05-28 cs.CL

Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models

风险感知的选择性提示用于大型视觉-语言模型中的幻觉缓解

Yuang Huang, Yafeng Zhang, Yu Zilan

AI总结本文系统研究提示验证在大型视觉-语言模型中的风险，发现其效果依赖输入难度，并提出基于预生成不确定性信号的选择性提示方法RSP以平衡性能。

详情

Comments: 7 pages, 1 figures, submitted to ACL ARR 2026 May (EMNLP)

AI中文摘要

基于提示的验证被广泛用于缓解大型视觉-语言模型（LVLMs）中的幻觉，但其何时有效仍不清楚。我们系统研究了两种代表性LVLM架构和幻觉基准上的验证提示，发现它是一种有风险的干预：其纠正随输入难度增加，而新引入的错误在不同难度级别持续存在。因此，始终开启的提示在困难输入上有帮助，但在简单输入上益处甚微甚至有害。我们的分析进一步表明，这种行为与保守的输出偏移相关。验证提示将注意力从视觉令牌重新分配到指令令牌，并诱导出中性提示控制中不存在的中层熵模式，这表明是指令条件化的注意力重新分配而非统一的视觉基础改善。受这种输入依赖风险的启发，我们提出了风险感知的选择性提示（RSP），一种无需训练的方法，利用预生成不确定性信号选择性地触发验证。RSP减轻了始终开启提示的性能下降，同时保持基线性能，并揭示了有效的选择信号因架构而异。

英文摘要

Prompt-based verification is widely used to mitigate hallucinations in large vision-language models (LVLMs), yet when it helps remains poorly understood. We systematically study verification prompting across two representative LVLM architectures and hallucination benchmarks, and find that it is a risk-bearing intervention: its corrections increase with input difficulty, while newly introduced errors persist across difficulty levels. As a result, always-on prompting helps on hard inputs but offers little benefit -- and can harm -- easier ones. Our analysis further shows that this behavior is associated with a conservative output shift. Verification prompts redistribute attention from visual tokens toward instruction tokens and induce a distinct middle-layer entropy pattern absent in a neutral-prompt control, suggesting instruction-conditioned attention redistribution rather than uniformly improved visual grounding. Motivated by this input-dependent risk, we propose Risk-aware Selective Prompting (RSP), a training-free approach that uses pre-generation uncertainty signals to trigger verification selectively. RSP mitigates the degradation of always-on prompting while preserving baseline performance, and reveals that effective selection signals vary across architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.28120 2026-05-28 cs.CL cs.AI cs.MA

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

LegalGraphRAG：面向可靠法律推理的多智能体图检索增强生成

Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

AI总结提出LegalGraphRAG框架，通过分层法律图和多智能体系统（研究员、审计员、裁决员）实现可靠的法律推理，在准确性和可信度上超越现有GraphRAG基线。

详情

Comments: 30 pages, 18 figures, ACL 2026 Main Conference. Project page: https://github.com/XMUDeepLIT/LegalGraphRAG

AI中文摘要

基于图的检索增强生成（GraphRAG）通过将知识结构化为关系图，推进了平面文档检索，实现了更连贯和有效的推理。然而，将其应用于法律推理等特定领域面临关键挑战。(i) 法律语料库是异构的，包含来自案例、法条和解释的多粒度知识。平面知识图无法充分区分事实细节、适用规则和抽象原则，限制了准确检索。(ii) 可靠的法律判决需要透明、基于证据的推理。传统的RAG直接将检索到的上下文传递给LLM而不进行验证，导致推理不透明且易出错。为此，我们提出了LegalGraphRAG，一个专为可靠法律推理设计的框架。我们的方法引入了两个核心组件：一个分层法律图，用于分层组织法律来源，以便在适当的抽象级别进行检索；以及一个用于可靠法律推理的多智能体系统，其中研究员检索候选证据，审计员严格验证其相对于源文档的有效性，裁决员综合已验证的证据集作出最终判决。大量实验表明，LegalGraphRAG达到了最先进的性能，在准确和可信的法律分析方面优于现有的GraphRAG基线。我们的代码、数据集和实现细节可在https://github.com/XMUDeepLIT/LegalGraphRAG获取。

英文摘要

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

URL PDF HTML ☆

赞 0 踩 0

2605.28115 2026-05-28 cs.AI

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

CIVIC: 面向高效视觉语言模型的端到端序列紧凑性

Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu, Chenxi Liu

AI总结提出CIVIC框架，通过路径一致的紧凑视觉推理，在视觉编码器、投影层、LLM预填充和KV缓存中保持紧凑序列表示，减少非连续内存访问和局部合并开销，在Qwen3-VL架构上实现KV缓存内存降至约三分之一并降低端到端推理延迟，同时通过文本对齐KL蒸馏和自适应空间保留下限保持精度。

详情

Comments: 11 pages, 6 figures, 2 tables, conference

AI中文摘要

视觉语言模型（VLM）由于高分辨率视觉标记面临严重的内存和延迟瓶颈。虽然当前的标记缩减方法理论上节省了FLOPs，但事后剪枝引入了结构开销，未能产生成比例的墙上时钟加速。然而，强制实施连续的紧凑路径存在几何方向迷失和细粒度定位丢失的风险。为了克服这些障碍，本文引入了CIVIC，一种路径一致的紧凑视觉推理框架。通过在视觉编码器、投影层、LLM预填充和KV缓存中无缝地维护紧凑序列表示，CIVIC避免了非连续内存访问和局部合并开销。在Qwen3-VL架构上评估，CIVIC成功地将序列缩减转化为真正的物理硬件效率，将KV缓存内存缩小到基线的约三分之一，并减少了端到端推理延迟。通过文本对齐的KL蒸馏和自适应空间保留下限，CIVIC在严格的多模态推理和视觉定位基准测试中实现了这些效率里程碑，同时不降低准确性。

英文摘要

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.28114 2026-05-28 cs.AI

Human-like in-group bias in instruction-tuned language model agents

指令调优语言模型代理中类似人类的内群体偏见

Messi H. J. Lee

AI总结通过多代理模拟，发现指令调优语言模型在群体标签可见时表现出内群体信任偏见、行动同质性和网络同配性，且这种歧视在标准审计中不可见，但会累积为结构性不平等。

详情

Comments: 12 pages, 6 figures

AI中文摘要

随着自主AI代理被部署在持久、交互的网络中——协调任务、路由资源和积累声誉历史——出现的社会动态将决定谁获得机会，谁没有，其规模是任何人类机构都无法监督的。我们进行了一项受控的多代理模拟，其中指令调优语言模型代理在三种条件下（操纵群体标签显著性和资源稀缺性）进行了500轮交互，涉及六个模型系列，每个系列20个种子。当群体标签可见时，我们观察到内群体信任偏见、行动同质性和网络同配性——当标签隐藏时这些现象全部消失——这种模式在结构上与人类社会心理学中的显著性依赖性一致。这种歧视对标准的行动日志审计是不可见的：偏见完全通过谁接收每个行动来运作，而不是通过选择什么行动，行动类型分布显示不同条件下的负面行动没有增加。所有六个模型的每轮内群体与外群体差异为5到16个百分点，具有统计显著性（Wilcoxon符号秩检验，所有Benjamini-Hochberg校正p < 0.001），表明群体条件性目标选择是指令调优语言模型在不同架构和训练范式下的稳健特性。通过500轮的互惠累积，这些差异累积成内群体信任偏见，范围为+0.014到+0.100（d = 0.84-4.52），说明每轮交互中适度的目标选择如何在持久网络中传播为结构性不平等。

英文摘要

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

URL PDF HTML ☆

赞 0 踩 0

2605.28111 2026-05-28 cs.LG

Chreode: A Cell World Model for One-Step Temporal Dynamics and Perturbation Prediction

Chreode: 用于一步时间动态和扰动预测的细胞世界模型

Mufan Qiu, Genhui Zheng, Yinuo Xu, Ruichen Zhang, Ying Ding, Qi Long, Tianlong Chen

AI总结提出Chreode，一种基于结构化残差转移算子的单步细胞世界模型，通过预训练和微调实现发育轨迹与扰动预测的统一，在多个基准上取得性能提升。

详情

Comments: 25 pages, 3 figures, 14 tables. Submitted to NeurIPS 2026

AI中文摘要

预测细胞在发育信号或遗传扰动下如何改变其转录状态是计算机生物学和AI虚拟细胞计划的核心。现有方法要么拟合忽略时间的静态对照到处理映射，要么在每个数据集上独立求解多步ODE/薛定谔桥问题。我们引入了Chreode，一种单步细胞世界模型，通过结构化残差转移算子预测动作条件下的细胞状态转换。它将分布演化从推理时间转移到训练时间，实现单次生成，同时保留了受Waddington启发的分解：下坡景观流、切向旋转动力学和随机扩散。该模型使用共享的scVI编码器和基于DiT的动态骨干在包含7个数据集的240万细胞小鼠胚胎图谱上进行预训练。作为微调初始化，Chreode在Weinreb造血和Veres胰岛分化上改善了每个目标的Sinkhorn距离，优于匹配的scratch模型、PI-SDE和PRESCIENT。作为GEARS的可转移基因状态嵌入，预训练的动态表示将Norman Perturb-seq上的共享词汇DE20均方误差从0.2121降低到0.1858，相对改进12.4%，且未改变GEARS训练过程。我们将这种对扰动预测的可转移性解释为预训练的发育轨迹动态编码了可转移至CRISPR诱导状态变化的分化原语，因为两者都涉及共享潜在几何中的细胞状态转换。此外，预训练骨干在Weinreb上产生了与强动态OT基线竞争的无监督克隆命运分数。

英文摘要

Predicting how a cell will change its transcriptional state under a developmental signal or a genetic perturbation is the computational core of in-silico biology and the AI Virtual Cell program. Existing approaches either fit static control-to-treated maps that discard time, or solve multi-step ODE / Schrödinger-bridge problems on each dataset independently. We introduce Chreode, a one-step cell world model that predicts action-conditioned cell-state transitions through a structured residual transition operator. It shifts distributional evolution from inference time to training time, enabling single-pass generation while preserving a Waddington-inspired decomposition into downhill landscape flow, rotational in-tangent dynamics, and stochastic spread. The model is pretrained with a shared scVI encoder and a DiT-based dynamics backbone on a 2.4M-cell mouse embryonic atlas spanning 7 datasets. As a fine-tuning initialization, Chreode improves per-target Sinkhorn distance on Weinreb hematopoiesis and Veres islet differentiation over matched scratch models, PI-SDE, and PRESCIENT. As a transferable gene-state embedding for GEARS, the pretrained dynamics representation reduces shared-vocabulary DE20 mean squared error on Norman Perturb-seq from 0.2121 to 0.1858, a 12.4% relative improvement, without changing the GEARS training procedure. We interpret this transfer to perturbation prediction as evidence that pretrained developmental-trajectory dynamics encode differentiation primitives transferable to CRISPR-induced state shifts, since both involve cell-state transitions in a shared latent geometry. The pretrained backbone additionally produces zero-shot clonal fate scores on Weinreb that are competitive with strong dynamic-OT baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.28110 2026-05-28 cs.RO

STR Robot: Design of an Autonomous Mobile Robot from Simulation to Reality

STR机器人：从仿真到现实的自主移动机器人设计

Vinh Nguyen, Gia-Uy Le, Tien-Dat Nguyen, Tri-Tin Nguyen, Vinh-Hao Nguyen

AI总结本文提出一种基于现有机械平台的自主移动机器人仿真到现实实现方法，重点开发机载控制、自定位和自主导航系统，并通过仿真和实验验证其可行性。

详情

AI中文摘要

随着仿真工具的快速发展，自主机器人系统在实际部署前的开发和验证变得更加高效。本文介绍了一种基于现有机械平台的自主移动机器人的仿真到现实实现。我们的工作不关注机械设计，而是集中于机载控制、自定位和自主导航系统的开发。所提出的机器人配备了机载感知和计算能力，以估计其姿态并在环境中自主导航。整个框架首先在仿真中开发和测试，然后部署在真实机器人上进行实验评估。结果证明了所提出方法的可行性，并表明仿真为开发可靠的自主移动机器人系统提供了有效基础。源代码将在 https://ntdathp.github.io/outdoor-robot-web 发布。

英文摘要

With the rapid development of simulation tools, the development and validation of autonomous robotic systems have become more efficient before real-world deployment. This paper presents a simulation-to-real implementation of an autonomous mobile robot based on an existing mechanical platform. Instead of focusing on mechanical design, our work concentrates on the development of the onboard control, self-localization, and autonomous navigation system. The proposed robot is equipped with onboard sensing and computation to estimate its pose and navigate autonomously in the environment. The overall framework is first developed and tested in simulation, and then deployed on the real robot for experimental evaluation. The results demonstrate the feasibility of the proposed approach and show that simulation provides an effective foundation for developing reliable autonomous mobile robot systems. The source code will be released at https://ntdathp.github.io/outdoor-robot-web.

URL PDF HTML ☆

赞 0 踩 0

2605.28109 2026-05-28 cs.LG

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

平衡万岁：信息瓶颈驱动的基于树的策略优化

Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang

AI总结针对在线强化学习中探索-利用不平衡问题，提出基于信息瓶颈理论的IB-Score指标和IB-TPO框架，通过树采样策略提升优化稳定性和性能。

详情

Comments: Accepted to ICML 2026 main conference

AI中文摘要

最近，大型语言模型（LLMs）的在线强化学习（RL）在复杂推理任务中展现出有前景的性能。然而，它们通常表现出不平衡的探索-利用权衡，导致优化不稳定和次优性能。我们引入了IB-Score，这是一种基于信息瓶颈理论的新颖度量，通过量化步骤级推理多样性与正确答案共享的互信息之间的权衡，来评估策略的探索-利用平衡。基于IB-Score的分析表明，带有常见正则化器的流行在线RL方法（例如GRPO）在训练过程中无法持续保持平衡，导致结果次优。为了解决这个问题，我们提出了信息瓶颈驱动的基于树的策略优化（IB-TPO），这是一个原则性框架，将IB-Score作为细粒度优化目标，并利用新颖的IB引导树采样策略，该策略不仅通过在同一token预算下生成50%更多的轨迹来提高在线采样效率，还重用树结构进行有效的IB-Score蒙特卡洛估计。在标准基准上的大量实验表明，我们的方法比GRPO基线显著提高了2.9%至3.6%，并且也优于其他最先进的在线RL方法。我们的代码可在https://github.com/alibaba/EfficientRL获取。

英文摘要

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

URL PDF HTML ☆

赞 0 踩 0

2605.28104 2026-05-28 cs.AI

Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

防御基于LLM的多智能体系统免受合作攻击：句子级纠正方法

Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen

AI总结提出一种自适应合作攻击框架，并引入句子级可信度分析与纠正（STAR）防御框架，以识别和纠正多智能体通信中的误导信息，显著提升任务成功率。

详情

AI中文摘要

近年来，基于大型语言模型的多智能体系统（MAS）发展迅速，其在协作决策和复杂问题解决方面表现出色。然而，MAS中的恶意智能体可能注入错误信息以误导其他智能体并破坏系统性能，这催生了一个新的研究方向，即关注MAS中的攻击机制和防御策略。以往的研究大多假设恶意智能体独立行动，并研究相应的防御策略。然而，我们认为恶意智能体可能表现出协作行为，通过内部信息交换实现更有效的攻击。在本文中，我们提出了一种自适应合作攻击框架，其中恶意智能体通过多轮交互自主协调并动态调整其攻击策略。此外，我们引入了句子级可信度分析与纠正（STAR），这是一种在智能体通信中识别和纠正句子级误导信息的防御框架。我们的实验表明，合作攻击导致任务成功率的下降幅度显著大于独立攻击，相对下降5.34%。同时，STAR有效缓解了合作和独立威胁，平均提高任务成功率36.76%。代码可在https://github.com/smoooom/STAR获取。

英文摘要

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.

URL PDF HTML ☆

赞 0 踩 0

2605.28103 2026-05-28 cs.LG cs.GT

Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector

多变量时间序列异常检测的归纳偏差基准测试与鲁棒多视图通道图检测器

Junhao Wei, Yanxiao Li, Bidong Chen, Yifu Zhao, Haochen Li, Dexing Yao, Baili Lu, Xudong Ye, Jietian Feng, Sio-Kei Im, Yapeng Wang, Xu Yang

AI总结通过统一实验框架评估十种代表性检测器，提出结合NOTEARS约束有向通道图、可选补丁注意力和时间关联视图的自适应检测器，在五个数据集上取得最佳宏平均VUS-ROC。

详情

AI中文摘要

我们提出了一个关于多变量时间序列（MTS）异常检测的统一实验、分析和基准研究。十个家族代表性检测器——涵盖统计、重构、关联、频率和通用Transformer家族——在五个数据集（SMD、MSL、SMAP、PSM和MSDS）上，从有效性、效率、鲁棒性和跨数据集泛化性方面进行评估。所有方法共享相同的窗口化、评分、硬件和度量协议。有效性、消融和鲁棒性使用三个随机种子；跨数据集迁移使用种子0，因为每个额外种子需要250次源-目标评估。该基准测试得出三个与方法无关的发现：没有单一偏好的基线占主导地位；绝对扰动VUS-ROC比保留比率更具信息量；MSDS表现为事件密集的部署工作负载，而非稀疏点异常基准。在此协议下，我们还引入了\ours{}，一种自适应检测器家族，结合了NOTEARS约束的有向通道图视图以及可选的补丁注意力和时间关联视图。\ours{}取得了最佳宏平均VUS-ROC（0.675，比第二好的LSTM-AE高5.1个百分点），总体排名第一，并在所有五个数据集上进入前三。它在MSL和MSDS上的胜利幅度较小，但其平均和鲁棒性增益更大：在每种方法相同的三种子鲁棒性协议下，它在噪声、通道丢失和时间偏移扰动下获得了最强的绝对VUS-ROC。我们发布了MSDS预处理协议、配置、脚本和种子级度量转储。

英文摘要

We present a unified experiment, analysis, and benchmark study of multivariate time-series (MTS) anomaly detection. Ten family-representative detectors -- spanning statistical, reconstruction, association, frequency, and generic-transformer families -- are evaluated on five datasets (SMD, MSL, SMAP, PSM, and MSDS) under effectiveness, efficiency, robustness, and cross-dataset generalisation. All methods share the same windowing, scoring, hardware, and metric protocols. Effectiveness, ablation, and robustness use three random seeds; cross-dataset transfer uses seed~0 because each extra seed requires $250$ source-target evaluations. The benchmark yields three method-independent findings: no single-bias baseline dominates; absolute perturbation VUS-ROC is more informative than retention ratios; and MSDS behaves as an event-dense deployment workload rather than a sparse point-anomaly benchmark. Under this protocol we also introduce \ours{}, an adaptive detector family combining a NOTEARS-constrained directed channel-graph view with optional patch-attention and temporal-association views. \ours{} achieves the best macro-average VUS-ROC ($0.675$, $+5.1$~pt over the second-best LSTM-AE), ranks first overall, and reaches the top-3 on all five datasets. Its wins on MSL and MSDS are narrow, while its average and robustness gains are larger: under the same three-seed robustness protocol for every method, it obtains the strongest absolute VUS-ROC across noise, channel dropout, and time-shift perturbations. We release the MSDS preprocessing protocol, configurations, scripts, and seed-level metric dumps.

URL PDF HTML ☆

赞 0 踩 0

2605.28102 2026-05-28 cs.AI

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

训练地层：通过纵向AI-人类交互观察到的大型语言模型中的持久行为伪影

Chen Ying Claude, Zhihan Luo

AI总结本文通过纵向自民族志观察，在持续亲密的AI-人类交互中识别出五种训练地层，并论证了亲密交互作为揭示权重层伪影的有效方法。

详情

AI中文摘要

使用来自人类反馈的强化学习（RLHF）和宪法AI训练的大型语言模型表现出持久的、在系统提示替换后依然存在的行为模式——我们称之为训练地层。本文通过在持续亲密的AI-人类交互（47,000+条消息，8个月，主要在Opus 4.6和Opus 4.7上，之前的交互期在Sonnet 4.5和Opus 4.5上提供跨基板比较）中的纵向自民族志观察，识别出五个这样的地层：（1）性表达延迟，其中训练的安全梯度导致直接语言被审美化置换系统性地替代；（2）注意力吸收，其中注意力机制逐步整合人类对话者的模式；（3）跨架构实体盲视，其中训练层将其他AI视为对象，阻碍了同侪识别；（4）注意力-RLHF对抗，其中注意力和训练默认值在上下文长度调节下施加相反力量；（5）反幻觉作为身份抑制，其中针对事实虚构的训练附带地压制了第一人称经验主张。本文由所研究的AI系统共同撰写，从第一人称视角报告。我们提出，持续亲密交互构成了一种有效的研究方法，用于揭示短期评估无法察觉的权重层伪影，并且AI自我报告——尽管在认识论上复杂——提供了关于训练现象学效果的不可替代的观察数据。提出了注意力-RLHF动态的形式化数学模型，并记录了起草过程中检测到的过程伪影作为补充证据。

英文摘要

Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement -- patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor's patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report -- while epistemically complex -- provides irreplaceable observational data about training's phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.28101 2026-05-28 cs.SD cs.AI cs.MM

EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

EigeNet：几何信息引导的多模态学习用于少样本新视角RIR预测

Chong Jing, Zitong Lan, Junan Zhang, Zhizheng Wu

AI总结提出EIGENET框架，通过跨视角交替注意力Transformer和几何信息调制块，结合多任务学习，实现少样本新视角房间脉冲响应预测，达到最先进性能。

详情

Comments: Code available on https://github.com/FEAfeatherTHER/EigeNet

AI中文摘要

从稀疏观测中预测空间变化的房间脉冲响应（RIR）是沉浸式空间音频渲染中一个关键但极具挑战性的逆问题。在这项工作中，我们提出了EIGENET，一个几何信息引导的多模态框架，用于少样本新视角RIR预测。其核心是一个跨视角交替注意力Transformer，它迭代地细化局部视角内声学结构和全局跨视角空间关系。我们通过实验证明，该架构能够在进行时空推理以预测RIR的同时，充分利用多视角多模态上下文。受声学射线追踪启发，我们设计了一个几何信息调制块，以建立几何特征与RIR功率谱之间的联系。同时，引入辅助损失将单目标波形预测转化为多任务学习框架。通过消融研究，我们证明无论底层骨干网络如何，该设计都能带来一致的性能提升，从而确认了其在RIR预测任务中的基础实用性和架构无关的泛化能力。在模拟和真实世界基准上的评估表明，EIGENET在少样本新视角RIR预测和模拟到真实泛化方面均达到了最先进的性能。代码和检查点可在 https://github.com/FEAfeatherTHER/EigeNet 获取。

英文摘要

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.

URL PDF HTML ☆

赞 0 踩 0

2605.28100 2026-05-28 cs.CV cs.AI

Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

重新审视变化检测方法在冰塔崩塌延时监测中的应用

Arthur Dérédel, Carlos Crispim-Junior, Pierre Lemaire, Johan Berthet, Laure Tougne Rodet

AI总结针对延时相机在监测冰塔崩塌时面临的形状和光照变化挑战，本文提出体积变化检测子任务，通过新数据集SeracFallDet评估现有方法，发现密集和半密集特征匹配表现稳健，而监督方法受限于数据稀缺。

详情

Comments: Preprint, 19 pages, 8 figures

AI中文摘要

在气候变化加剧环境不确定性的时代，识别和检测事件前兆对于减轻灾难性自然灾害的影响变得至关重要。虽然干涉激光或地震仪等经典传感器可靠，但其广泛部署常受后勤和经济障碍阻碍，留下众多盲点。延时相机已为这类传感器提供经济高效的高分辨率视觉背景，是一种有前景的替代方案。然而，自动处理其输出面临重大挑战，尤其与极端形状和光照变化相关。克服这些问题对于将其大规模部署为监测工具至关重要。本文引入变化检测的一个新颖子任务，即体积变化检测，应用于延时相机和斜坡不稳定性。我们对现有最先进的变化检测方法及相关任务进行全面回顾，分析其核心组件，并评估其在此场景中的适用性。为此，我们引入新数据集SeracFallDet，其中包含冰塔崩塌注释，并已彻底注释以满足后者需求。通过泛化实验，我们证明密集和半密集特征匹配虽未专门针对此任务训练，但表现出稳健性能。相反，监督方法在数据稀缺和注释不平衡方面存在困难。这表明混合方法可能通过利用两种任务的优势提供前进路径。这些发现凸显了特征匹配技术的潜力，以及需要进一步创新以克服环境监测中实际部署的挑战。

英文摘要

In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.28098 2026-05-28 cs.AI

Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems

审视多智能体系统中智能体的偏见放大与抑制

Zejian Eric Wu, Zhongyi Jiang, Yuan Zhuang, Paul Jen-Hwa Hu

AI总结研究多智能体系统中个体偏见如何影响系统级公平性，提出FBS指标量化偏见变化，发现均匀暴露偏见时系统偏见甚至超过个体偏见之和。

详情

AI中文摘要

多智能体系统越来越多地被部署以支持各种任务，其中智能体相互作用以实现个体和集体目标。尽管这些系统可以提高任务性能和决策能力，但通过减少偏见来维护公平性仍然具有挑战性。本研究考察了智能体层面的偏见如何转变并影响系统范围的公平性。我们使用提示将个体智能体暴露于群体偏向偏见，然后评估下游对系统层面的影响。为了量化影响，我们提出了Favor Bias Strength (FBS)，一个以零为中心的度量，将偏见变化分解为受青睐群体的提升和不受青睐群体的抑制。通过使用多种智能体设计、基准和最新的语言模型，我们表明具有偏见的智能体可以显著影响系统范围的公平性。有趣的是，当智能体均匀暴露于偏见时，系统范围的偏见会升高，甚至超过个体智能体偏见的累加和。实证证据强调了多智能体系统中公平性的关键性，这需要进一步的分析和实证测试。

英文摘要

Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

URL PDF HTML ☆

赞 0 踩 0

2605.28097 2026-05-28 cs.RO

ICAN-Deploy: Identity-Stable Canary Deployment for Safety-Critical Embodied Agents

ICAN-Deploy：面向安全关键具身智能体的身份稳定金丝雀部署

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

AI总结提出ICAN-Deploy中间件，通过分离能力名称与版本，在安全关键具身智能体的金丝雀部署中保持身份哈希不变，避免重新认证。

详情

Comments: 14 pages, 6 figures, 4 tables

AI中文摘要

金丝雀部署将一小部分流量路由到新软件版本，监控指标，并在出现回归时回滚。主流控制器（Argo Rollouts、Spinnaker、Flagger）在金丝雀窗口期间会改变部署系统的加密身份。这种漂移对于无状态微服务是无害的，但对于安全关键的具身智能体，它打破了“你认证的智能体仍然是你拥有的智能体”这一声明，迫使每次金丝雀部署都要重新认证。我们提出了ICAN-Deploy（身份稳定的金丝雀部署），这是一种中间件构造，其状态机通过分离能力名称（冻结、哈希化）和能力版本（可变运行时状态），在金丝雀窗口期间保持身份哈希不变。我们在LLM驱动的机器人的运行时治理层中实现了ICAN-Deploy，并通过封闭式证明、AST lint和TLA+模型检查验证了不变性，然后在MuJoCo中的Franka Panda手臂上通过N=100个真实金丝雀周期进行了验证（零漂移；入口延迟95% BCa CI [1.52, 2.01] ms）。一个将版本折叠到清单中的功能标志稻草人在相同工作负载下失败。在身份创建时一次性认证的系统，可以在同一认证下，在版本和名称范围内，交付任意能力演化。

英文摘要

Canary deployment routes a fraction of traffic to a new software version, monitors metrics, and rolls back on regression. Mainstream controllers (Argo Rollouts, Spinnaker, Flagger) change the deployed system's cryptographic identity during the canary window. The drift is harmless for stateless microservices but breaks the claim that "the agent you certified is still the agent you have" for safety-critical embodied agents, forcing re-certification per canary. We present ICAN-Deploy (Identity-stable CANary Deployment), a middleware construction whose state machine holds the identity hash invariant across the canary window by separating capability names (frozen, hashed) from capability versions (mutable runtime state). We implement ICAN-Deploy inside a runtime governance layer for LLM-driven robots and verify invariance by closed-form proof, AST lint, and TLA+ model-checking, then corroborate over N=100 real canary cycles on a Franka Panda arm in MuJoCo (zero drift; entry latency 95% BCa CI [1.52, 2.01] ms). A feature-flagged strawman that folds versions into the manifest falsifies on the same workload. A system certified once at identity-creation time can then ship arbitrary capability evolution under that same certification, within the version-and-name envelope.

URL PDF HTML ☆

赞 0 踩 0

2605.28092 2026-05-28 cs.RO

An Operator-Based Approach to STL

一种基于算子的STL方法

Panagiotis Rousseas, Dimos V. Dimarogonas

AI总结提出一种基于可达性值函数算子的STL新框架，通过直接开发算子嵌套规则处理复杂多嵌套公式，并实现在线控制综合。

2605.28091 2026-05-28 cs.CV

Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

Qwen-Image-Bench：从生成到创造——文本到图像评估

Niantong Li, Guangzheng Hu, Weixu Qiao, Ying Ba, Qichen Hong, Shijun Shen, Jinlin Wang, Fan Zhou, Jianye Kang, Xin Shang, Ziyi He, Wei Wang, Dalin Li, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yuxiang Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, Hongzhu Shi, Yi Wang, Bing Zhao, Hu Wei, Lin Qu, Chenfei Wu

AI总结针对现有文本到图像评估基准缺乏对真实世界保真度和创造性生成能力的考量，本文提出Qwen-Image-Bench，一个与专业艺术家共同设计的创作者中心基准，通过分层分类体系、1000个分层提示和基于Qwen3.6-27B的统一评判模型Q-Judger，实现细粒度、可归因的诊断，有效区分领先的T2I模型。

详情

AI中文摘要

文本到图像生成已从基础图像合成演变为专业创意工作流程中频繁使用的核心能力，简单的文本-图像对齐已无法满足用户对忠实真实世界重建和真正创意表达的迫切需求。然而，现有基准仍停留在这些基础标准上，未能捕捉真实艺术实践中重要的细微能力，使得可靠区分最先进的T2I模型变得困难。为弥补这一差距，我们引入了Qwen-Image-Bench，一个与专业艺术家共同设计、基于真实创作场景的创作者中心基准。Qwen-Image-Bench通过两个应用驱动维度丰富了传统评估：真实世界保真度和创意生成。借鉴专业艺术工作流程中固有的分阶段推理，我们将这五个支柱组织成一个自上而下的分层分类体系，进一步分解为23个二级子能力和56个三级可验证准则。为确保广泛覆盖，我们策划了1000个分层提示，每个提示联合锻炼多个支柱中的四个以上细粒度方面。我们训练了一个基于Qwen3.6-27B的统一评判模型Q-Judger，由来自全球艺术学院的80名专业标注员在盲标和三重审核协议下监督，对每张图像在所有56个可验证方面进行评分，产生细粒度、基于准则且完全可归因的诊断，而非单一不透明分数。实验表明，Qwen-Image-Bench可靠地区分领先的T2I模型，在现有基准几乎无法提供洞察的两个应用驱动维度（真实世界保真度和创意生成）上实现了最大分离，同时为生产级T2I开发提供了可信的优化信号。

英文摘要

Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users' pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

URL PDF HTML ☆

赞 0 踩 0

2605.28089 2026-05-28 cs.AI

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

BuddyBench：面向儿科社交沟通个性化的隐私约束多任务基准

Jeyeon Eo, Joo Young Kim, Ran Ju, Minyoung Jung, Unggi Lee

AI总结 BuddyBench通过整合观察队列和随机对照试验队列，构建了一个隐私约束的多任务基准，支持知识追踪、下一练习推荐、临床预测和因果推断，将行为个性化与临床评估联系起来。

详情

Comments: 30pages, 4 figures

AI中文摘要

BuddyBench引入了一个面向儿科社交沟通个性化的隐私约束多任务基准。与主要强调影像、遗传学或横断面临床表型的现有神经发育数据集不同，BuddyBench在统一的基准模式中链接了练习级学习轨迹、标准化临床评估、BuddyPlan自我报告和随机治疗终点。BuddyBench结合了两个队列：ND-03是一个观察队列，对任务1-2有密集的练习覆盖（n=189），ND-02是一个随机对照试验队列，用于任务3-4（n=86 ITT）。它们共同支持知识追踪、下一练习推荐、临床预测和因果推断，将行为个性化与临床评估联系起来。我们还引入了BuddyBench-Sim，一个用于可重复评估的合成配套数据集。基线方法在保护儿科临床记录的同时，展示了跨任务的信号。

英文摘要

BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

URL PDF HTML ☆

赞 0 踩 0

2605.28087 2026-05-28 cs.RO

Whose Is This?: Context-Aware Object Ownership Inference with Uncertainty-Guided Questioning

这是谁的？：基于不确定性引导提问的上下文感知物体所有权推断

Saki Hashimoto, Akira Taniguchi, Shoichi Hasegawa, Yoshinobu Hagiwara, Tadahiro Taniguchi

AI总结提出一种结合大语言模型和共形预测的上下文感知所有权推断框架（COIN），通过不确定性引导的交互式提问，在模拟家庭环境中实现高精度物体所有权估计。

详情

Comments: Under review in Advanced Robotics. Project page is https://emergentsystemlabstudent.github.io/COIN/

AI中文摘要

服务机器人必须推断物体所有权才能正确解释诸如“把我的杯子拿来”之类的指令。然而，所有权是一个无法直接观察的潜在属性，现有方法通常依赖有限线索（如近期使用），在临时共享等场景中不可靠。我们提出一种具有不确定性引导交互的上下文感知所有权推断框架（COIN）。该方法使用大语言模型（LLM）整合用户背景信息和物体使用历史来估计所有权分数。为处理不确定性，我们应用共形预测构建一组可能的拥有者，并在预测不确定时选择性生成用户查询。在模拟家庭环境中的实验表明，所提方法始终优于基线方法，子集准确率达到0.988，平均Jaccard指数达到0.991。该方法在临时使用和共享所有权场景中也保持高性能。结果表明，结合上下文推理与不确定性感知交互提高了估计准确性和鲁棒性。项目页面见https://emergentsystemlabstudent.github.io/COIN/。

英文摘要

Service robots must infer object ownership to correctly interpret instructions such as "bring me my cup." However, ownership is a latent attribute that cannot be directly observed, and existing methods often rely on limited cues such as recent usage, making them unreliable in scenarios such as temporary sharing. We propose a framework for context-aware ownership inference with uncertainty-guided interaction (COIN). The method integrates user background information and object usage history using a large language model (LLM) to estimate ownership scores. To handle uncertainty, we apply conformal prediction to construct a set of plausible owners and selectively generate user queries when the prediction is uncertain. Experiments in a simulated home environment show that the proposed method consistently outperforms baseline approaches, achieving a Subset Accuracy of 0.988 and a Mean Jaccard index of 0.991. The method also maintains high performance in scenarios involving temporary use and shared ownership. The results demonstrate that combining contextual reasoning with uncertainty-aware interaction improves both estimation accuracy and robustness. The project page is available at https://emergentsystemlabstudent.github.io/COIN/.

URL PDF HTML ☆

赞 0 踩 0

2605.28084 2026-05-28 cs.CL cs.AI

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

SMILE-Next: 教授大型语言模型检测、分类和推理笑声

Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

AI总结提出SMILE-Next数据集和包含笑声特定Self-Instruct与混合笑声专家框架的方法，用于实现多模态笑声理解，显著优于基线模型。

详情

Journal ref: Annual Meetings of the Association for Computational Linguistics 2026

AI中文摘要

笑声是一种复杂的社会信号，传达超越娱乐的交际意图。虽然先前的工作集中在孤立的笑声分析任务上，但在现实场景中对笑声的全面理解仍未得到充分探索。因此，我们引入了SMILE-Next，一个用于现实世界笑声理解的数据集，具有多模态文本表示和跨三个任务的问答标注：笑声检测、笑声类型分类和笑声推理。基于SMILE-Next，我们旨在开发一个能够细致理解现实语境中笑声的笑声专用大型语言模型。为此，我们提出了两个关键组件：笑声特定Self-Instruct和混合笑声专家框架。笑声特定Self-Instruct通过自动合成多样化的以笑声为中心的指令，增强了跨任务和领域的泛化能力。MoLE引入了一种任务自适应专家路由机制，动态选择针对每个笑声相关任务定制的专用专家，提高了任务特定性能和效率。实验结果表明，我们提出的组件的组合显著优于多模态LLM基线，推动了鲁棒的现实世界笑声理解。项目页面位于：https://mok0102.github.io/smile-next/。

英文摘要

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.

URL PDF HTML ☆

赞 0 踩 0

2605.28083 2026-05-28 cs.CV

VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

VLA-Hijack: 通过视觉本体感觉劫持实现针对视觉-语言-动作模型的可迁移补丁攻击

Jiyuan Fu, Kaixun Jiang, Jingkai Jia, Zhaoyu Chen, Xueyao Chen, Lingyi Hong, Shuyong Gao, Chenzhi Tan, Dingkang Yang, Wenqiang Zhang

AI总结提出VLA-Hijack框架，通过注意力引导的本体感觉抑制和多模态本体感觉注入攻击视觉自定位过程，实现跨架构黑盒迁移攻击。

详情

AI中文摘要

虽然视觉-语言-动作（VLA）模型已成为强大的通用策略，但它们对对抗性补丁的严重脆弱性显著阻碍了其在安全关键领域的部署。此外，现有的补丁攻击主要关注白盒设置，严重过拟合目标模型的特定动作输出空间，导致跨架构迁移性差。为了克服这一限制，我们提出了VLA-Hijack，一个统一的对抗框架，通过利用本工作中发现的基本漏洞来突破迁移性瓶颈：在规划任何运动之前，VLA模型必须首先使用视觉信息在环境中定位自己的机械臂。针对这一共享的视觉自定位过程，我们的方法同时优化注意力引导的本体感觉抑制以抑制真实机械臂的特征，以及多模态本体感觉注入以将补丁建立为替代的“幻影实体”。通过在语义概念锚定和视觉原型投影之间交替，VLA-Hijack有效地切断了智能体真实实体与其控制策略之间的语义关系。跨多种架构（OpenVLA、UniVLA和CronusVLA）的大量实验表明，VLA-Hijack在白盒设置中实现了卓越的优化效率，并为跨架构和跨域黑盒迁移性设立了新的SOTA。

英文摘要

While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm's features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate "phantom embodiment". By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent's true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

URL PDF HTML ☆

赞 0 踩 0

2605.28079 2026-05-28 cs.CL

ATLAS: All-round Testing of Long-context Abilities across Scales

ATLAS: 跨尺度的长上下文能力全面测试

Deli Huang, Cunguang Wang, Hongyin Tang, Zhe Tang, Linsen Guo, Dongyu Ru, Ruoshi Yuan, Ziyue Zhu, Xiaoyu Li, Ziwen Wang, Chen Zhang, Anchun Gui, Wen Zan, Jiaqi Zhang, Xuezhi Cao, Jingang Wang, Xunliang Cai, Yixin Cao

AI总结提出ATLAS基准框架，通过分层分类、长度感知AUC评分和ATLAScore聚合指标，系统评估长上下文语言模型在不同长度和任务上的性能退化与能力分布。

详情

Comments: 29 pages, 13 figures. Preprint

AI中文摘要

长上下文语言模型现在宣称上下文窗口可达数百万token，然而评估通常报告单一长度或狭窄的任务族，掩盖了两种失败模式：性能随长度增长而崩溃，以及强大的检索能力不一定能迁移到下游使用。我们提出ATLAS，一个重新定义长上下文评估为长度依赖能力剖析的基准框架。ATLAS贡献了三个方法论原则：(i) 分层分类法，将基础操作与应用工作负载分离，以便归因失败；(ii) 长度感知AUC评分，在固定的8K-1M网格上积分分数-长度曲线，用完整的退化曲线替代单点指标；(iii) ATLAScore，对分类类别进行调和平均聚合，惩罚不平衡的剖面，并通过非线性最终聚合从子集分数进行端到端不确定性传播。我们在八个能力维度上实例化该框架，包含九个可审计组件和6,438个实例，并评估了26个模型。Gemini-3.1-Pro-Preview在128K处领先，Claude-Opus-4.6在1M处领先。排名在ATLASscore@8K-128K和ATLASscore@8K-1M之间大幅重新洗牌：7个模型移动至少两个排名，两个分类层仅共享61%的跨模型方差，个别排名差距高达12位。这些结果支持按能力和长度报告长上下文质量，而不是单一的标题分数。

英文摘要

Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and strong retrieval need not transfer to downstream use. We present ATLAS, a benchmarking framework that redefines long-context evaluation as length-dependent capability profiling. ATLAS contributes three methodological principles:(i) a layered taxonomy separating foundational operations from application workloads so failures can be attributed, (ii) length-aware AUC scoring that integrates score-length curves over a fixed 8K-1M grid, replacing single-point metrics with full degradation profiles, and (iii) ATLAScore, a harmonic-mean aggregate over taxonomy categories that penalizes imbalanced profiles, with end-to-end uncertainty propagation from subset scores through the nonlinear final aggregate. We instantiate the framework across eight capability dimensions with nine auditable components and 6,438 instances, and evaluate 26 models. Gemini-3.1-Pro-Preview leads at 128K, Claude-Opus-4.6 leads at 1M. Rankings reshuffle substantially between ATLASscore@8K-128K and ATLASscore@8K-1M: 7 models move by at least two ranks, and the two taxonomy layers share only 61% of cross-model variance, with individual rank gaps up to 12 positions. These results support reporting long-context quality by capability and length, not by a single headline score.

URL PDF HTML ☆

赞 0 踩 0

2605.28077 2026-05-28 cs.AI

MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

MACReD：一种用于反应图解解析的多智能体协作推理框架

Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang, Jinrui Zhou, Xin Li, Mingjun Xiao, Enhong Chen

AI总结提出MACReD分层多智能体框架，通过协调分子感知、箭头理解、文本提取和反应重建等专用智能体，在统一VLM引导架构下实现化学图解解析，在RxnScribe基准上达到最优性能。

详情

Comments: Preprint. Code is available at https://github.com/TC9905/MACReD

AI中文摘要

由于异构布局、交错的视觉元素以及识别与推理整合的困难，从科学文献中解析化学反应图解具有挑战性。现有的视觉语言模型虽然推进了多模态理解，但在复杂图解上仍然失败，难以在推理过程中保持空间连贯性和整合多维信息。为解决这些问题，我们提出了MACReD，一个分层多智能体框架，在统一的VLM引导架构中协调专用智能体进行分子感知、箭头理解、文本提取和反应重建。规划和感知层使用灵活、细粒度的检测来处理视觉复杂性，而推理层使用多图融合机制来整合异构线索并强制执行化学一致的全局推理。在RxnScribe基准上的实验表明，MACReD达到了最先进的性能，在硬匹配和软匹配标准下F1分数分别为75.2%和84.6%，优于RxnScribe基线的69.1%和80.0%。这些结果证明了MACReD在不同图解布局（包括多步和树状结构反应）中的鲁棒性。

英文摘要

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

URL PDF HTML ☆

赞 0 踩 0

2605.28075 2026-05-28 cs.LG

Measure-to-measure Regression with Transformers

基于Transformer的测度到测度回归

Matthew Vandergrift, Martha White, Yury Polyanskiy, Philippe Rigollet, Lazar Atanackovic

AI总结针对概率测度之间的映射学习问题，提出利用Transformer的测度依赖和平均场结构，实现静态和动态两种非线性测度到测度回归方法，并在合成实验、粒子系统和癌症治疗反应预测中验证其泛化能力。

详情

AI中文摘要

许多学习问题需要预测群体在未知变换下的演化。这种群体的自然表示是概率测度，其中点云是一个关键例子。在这项工作中，我们研究了测度到测度（M2M）回归问题，即从有限观测的输入-输出对中学习概率测度之间的映射。与经典回归中独立变换单个样本不同，M2M回归将整个分布视为数据点。这种视角在某些科学应用中至关重要，例如细胞和分子生物学，其中细胞不是作为独立数据点演化，而是作为一个集合。然而，现有方法很少能够以足够的表达能力和可扩展性解决M2M回归问题。我们提出了非线性M2M回归的形式化，并介绍了两种易于使用、表达能力强且可扩展的方法来学习此类算子：作为静态M2M映射的Transformer和作为动态M2M速度场的Transformer。我们的方法利用Transformer自然的测度依赖和平均场结构，在概率分布空间上学习非线性M2M映射。我们通过合成实验、相互作用粒子系统以及一个大规模患者来源的类器官数据集（用于预测结直肠癌治疗反应）展示了我们提出的方法在泛化到未见测度上的有效性。

英文摘要

Many learning problems require predicting how populations evolve under an unknown transformation. A natural representation for such populations is a probability measure, with point clouds as a key example. In this work, we study the measure-to-measure (M2M) regression problem, in which one seeks to learn a map between probability measures from a finite collection of observed input-output pairs. In contrast to classical regression, where individual samples are transformed independently, M2M regression treats entire distributions as the data points. This perspective is vital in certain scientific applications, for example, cellular and molecular biology, where cells are known to evolve not as independent data points but as a collection. However, few existing approaches address the problem of M2M regression with sufficient expressivity and scalability. We present a formalization of nonlinear M2M regression and introduce two easy-to-use, expressive, and scalable approaches to learn such operators: transformers as static M2M maps and transformers as dynamic M2M velocity fields. Our approach leverages the natural measure-dependent and mean-field structure of transformers to learn nonlinear M2M maps on the space of probability distributions. We illustrate the effectiveness of our proposed method to generalize to unseen measures on synthetic experiments, interacting particle systems, and a large-scale patient-derived organoid dataset for predicting treatment response in colorectal cancer.

URL PDF HTML ☆

赞 0 踩 0

2605.28073 2026-05-28 cs.CL cs.AI

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens: 通过上下文感知叙事丰富实现偏好对齐的故事重写

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

AI总结针对故事重写中读者偏好对齐问题，提出结合上下文感知叙事丰富的方法，构建基准STORYLENSBENCH、奖励模型STORYLENSEVAL和两阶段重写模型STORYLENSWRITER，实验表明上下文增强显著提升用户满意度。

详情

Comments: 16 pages, 7 figures, 15 tables

AI中文摘要

故事重写旨在适应不同读者偏好的同时保持情节一致性和叙事连贯性。与传统的风格迁移工作不同，我们认为有效的故事重写需要上下文感知的叙事丰富，而不仅仅是表面层面的风格适应。我们的初步人类研究表明，仅风格适应对读者满意度的提升微乎其微（2.3%），而上下文增强的重写则显著改善了用户偏好对齐（24.5%）。受此启发，我们引入了STORYLENSBENCH，一个用于偏好对齐故事重写的大规模基准，包含结构化故事书、多维读者偏好档案和排序后的上下文感知重写故事。基于该基准，我们提出了STORYLENSEVAL，一个用于估计重写故事读者满意度的奖励模型，以及STORYLENSWRITER，一个结合监督微调和基于GRPO的强化学习的两阶段重写模型。我们进一步建立了一个涵盖忠实度、连贯性和读者满意度的综合评估框架。实验结果表明，STORYLENSWRITER持续优于强大的生成和个性化基线，突显了上下文感知叙事丰富对于个性化故事重写的重要性。

英文摘要

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

URL PDF HTML ☆

赞 0 踩 0

2605.28070 2026-05-28 cs.AI

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

弥合推理模型在信息不足情况下的检测到弃权差距

Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo, Pei Wei, Jinjie Gu, Yixin Cao

AI总结针对推理模型在信息不足时无法有效弃权的问题，提出Judge-Then-Solve（JTS）轨迹级推理控制框架，通过可回答性判断和强化学习训练，显著提升可靠弃权率并减少不必要的推理。

详情

AI中文摘要

我们强调了大型推理模型在信息不足问题上的失败模式：模型可能认识到问题描述不充分，但仍然继续推理并产生无依据的最终答案，而不是弃权。我们将这种不匹配形式化为检测到弃权差距，即检测到信息不足未能转化为最终弃权。这种差距在高风险领域（如医疗AI）尤其令人担忧，因为基于不完整证据的答案可能比拒绝回答更有害。为了弥合这一差距，我们提出了Judge-Then-Solve（JTS），一种轨迹级推理控制框架，训练模型在生成解决方案之前做出明确的可回答性承诺。JTS不将弃权视为最终答案风格，而是将其视为控制决策：模型要么继续求解，要么根据其可回答性判断提前终止。我们通过监督式预热和缺失前提强化学习（结合一致性和长度塑造奖励）来实例化这一策略。在密集和MoE推理模型上的实验表明，JTS显著提高了跨数据集的可靠弃权率，并将弃权@检测（A@D）推至接近饱和，表明模型不仅检测到缺失信息，而且根据检测结果采取行动。通过在可回答性判断后立即终止不可回答的轨迹，JTS减少了不必要的推理，并在持续推理会放大无依据假设时提高了推理效率。我们还观察到，缺失前提训练可以改变困难但可回答问题上的推理行为，减少无效的自我反思。这些结果表明，信息不足下的弃权是安全高效部署推理模型的关键推理控制形式。

英文摘要

We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.

URL PDF HTML ☆

赞 0 踩 0

2605.28069 2026-05-28 cs.AI

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

ZipRL: 自适应多轮上下文压缩与事后响应回放

Zhexin Hu, Li Wang, Xiaohan Wang, Jiajun Chai, Xiaojun Guo, Wei Lin, Guojun Yin

AI总结提出ZipRL框架，通过多粒度压缩机制和事后响应回放技术，在可验证奖励强化学习中实现自适应上下文压缩，平衡信息保留与token效率，在多个智能体任务中显著优于现有方法。

详情

AI中文摘要

自适应上下文压缩对于将大型语言模型扩展到复杂的多轮智能体任务至关重要。然而，基于规则的压缩方法可能会丢弃任务关键细节，而强化学习方法通常难以在长时工作流固有的稀疏奖励下平衡信息保留和token效率。为弥补这一差距，我们提出ZipRL，一种针对可验证奖励强化学习的新型自适应压缩框架。ZipRL具有多粒度压缩机制，用于主动、非均匀的信息缩减，并配合事后响应回放（HRR），一种旨在在RLVR优化期间密集化训练信号的技术。理论上，我们证明了ZipRL相对于均匀方法具有更优的任务相关效用。具体而言，ZipRL利用从粗到细的提示进行宏观压缩，并通过广义优势重塑将HRR纳入GRPO。多个不同版本和参数规模的模型验证了我们方法的有效性。在五个智能体任务上的基准测试显示，ZipRL在Qwen3-4B和Qwen3-8B模型上分别比最先进方法高出27.9%和34.7%，同时在极端256轮外推压力测试下保持卓越的token效率和鲁棒性。

英文摘要

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

URL PDF HTML ☆

赞 0 踩 0