arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11079 2026-06-10 cs.CL 新提交

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

VISTA：用于智能体评估的多功能交互式用户模拟工具包

Yunan Lu, Ryan Shea, Yusen Zhang, Zhou Yu

发表机构 * Department of Computer Science, Columbia University（哥伦比亚大学计算机科学系）； Arklex.ai

AI总结提出VISTA工具包，通过六项指标和混合用户模拟器（UI+API）提升智能体评估的真实性与全面性，在电商和教育场景中验证有效性。

详情

AI中文摘要

评估仍然是交互式智能体开发的关键瓶颈。现有的评估方法通常依赖于静态基准，这些基准无法捕捉智能体行为的动态、多步骤特性，也难以暴露有意义的失败模式。虽然基于用户模拟的评估提供了一种有前景的替代方案，但现有的模拟框架存在两个主要局限性。首先，它们提供的评估模拟交互质量和全面性的机制有限，使得难以评估模拟器是否充分探索了智能体的能力和失败模式。其次，大多数框架仅限于仅UI操作或仅API操作，限制了它们建模真实用户行为全范围的能力。为了解决这些局限性，我们提出了VISTA，一个用于智能体评估的多功能交互式用户模拟工具包。我们的工具包包含一套六项指标，用于衡量模拟交互的真实性、能力覆盖范围和交互有效性。此外，我们开发了一个混合用户模拟器，集成了基于UI的交互和基于API的交互，从而能够在多样化的交互环境中进行更真实和全面的评估。我们在电子商务购物和教育客户服务场景中评估了VISTA，并证明它比现有方法产生了更真实和全面的评估。

英文摘要

Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.11078 2026-06-10 cs.AI cs.CL cs.CV 新提交

A History-Aware Visually Grounded Critic for Computer Use Agents

面向计算机使用代理的历史感知视觉基础批评家

Jaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Capital One ； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出HiViG框架，通过历史感知的视觉基础多模态批评家，在测试时评估动作并拦截错误，在多个GUI基准上提升成功率。

详情

Comments: Code: https://github.com/G-JWLee/HiViG

AI中文摘要

针对计算机使用代理（CUA）的各种测试时干预措施（包括批评模型）已被开发出来，通过在复杂图形用户界面（GUI）环境中执行前动作评估来提高性能。然而，现有的批评家存在两个关键限制：（1）主要关注短视决策循环（例如，遗忘早期动作）；（2）缺乏检测有缺陷动作（例如，点击错误的UI元素）所需的视觉基础。为了解决这些问题，我们引入了HiViG，一个历史感知的视觉基础测试时框架，其核心是一个在真实GUI轨迹上训练的多模态批评家，用于将过去的交互抽象为紧凑记录，并基于视觉基础评估动作。在测试时，HiViG将批评家集成到策略决策循环中，以提供宏观动作历史（总结策略已完成成就）和视觉基础批评（根据当前截图验证原始执行坐标，在执行前拦截错误）。在网页、移动和桌面基准测试中，HiViG持续优于现有的标量和口头批评家，在Qwen3-VL-32B上比最强基线平均成功率提高5.8%，在Gemini-3-Flash上提高9.0%，并展示了强大的跨平台泛化能力。消融实验表明，宏观动作历史缓解了短视规划，视觉基础批评减少了执行错误，这两个组件对于长时域GUI任务中的测试时扩展至关重要。

英文摘要

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11075 2026-06-10 cs.LG 新提交

Exploring the Design Space of Reward Backpropagation for Flow Matching

探索流匹配的奖励反向传播设计空间

Ruoyu Wang, Boye Niu, Xiangxin Zhou, Yushi Huang, Tongliang Liu, Chi Zhang

发表机构 * Westlake University（西湖大学）； Tencent Hunyuan（腾讯混元）； University of Sydney（悉尼大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结针对文本到图像流匹配模型，提出统一代理轨迹框架FlowBP，通过设计反向轨迹（稀疏重建、桥耦合、拉格朗日积分）解决直接奖励反向传播中的内存和梯度爆炸问题，在多个模型和指标上优于基线。

详情

AI中文摘要

通过直接奖励反向传播将文本到图像流匹配模型与人类偏好对齐是样本高效的，但受到两个众所周知的问题的阻碍：在现代模型规模下，激活值无法在整个采样轨迹上存储，并且跨步骤的链式雅可比乘积使得奖励梯度在传播回早期索引时膨胀。基于连接器的方法，如LeapAlign，通过用短的固定路径替换完整的反向轨迹来解决这些问题，突出了采样和优化之间的有用解耦。然而，所得梯度的质量取决于该短路径对完整展开的近似精度，尤其是在长间隔上。我们提出FlowBP，一个统一的代理轨迹框架，将反向轨迹本身视为设计对象。FlowBP保留一个无梯度的缓存展开用于采样，然后从缓存和选择性重新前向的速度中构建一个轻量级的反向代理。这个视角分离了四个选择：奖励模型输入、活动集、积分权重和桥耦合，并将先前的直接梯度方法恢复为特定设置。在此框架内，我们实例化了三个变体：FlowBP-Sparse使用稀疏欧拉重建，FlowBP-Bridge添加受控桥耦合，FlowBP-Lagrange提高跳跃求积的阶数。所有三个变体都将内存限制在活动集大小，并将梯度链限制在最多一个雅可比因子。在SD3.5-M、FLUX.1-dev和FLUX.2-Klein-base上，针对偏好、质量和组合指标，这三个变体在大多数指标上优于直接梯度基线。

英文摘要

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.11070 2026-06-10 cs.CL cs.AI 新提交

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

T1-Bench：真实世界领域中的多场景智能体基准测试

Genta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao, Shikhhar Siingh, Houhan Lu, Nadia Bathaee, Sriharsha Hatwar, Paresh Dashore, Anmol Jain, Kshitij Tayal, Xiuzhu Lin, Anirban Das, Sambit Sahu, Shi-Xiong Zhang

发表机构 * Capital One（第一资本）

AI总结提出T1-Bench，一个高保真、全面的基准，用于评估多领域真实客户场景中的智能体系统，通过交织的多轮交互任务提升复杂性和评估严谨性。

详情

Comments: Preprint

AI中文摘要

近期大型语言模型（LLMs）在推理和工具调用能力方面的进步使得智能体系统越来越强大。然而，现有基准在任务复杂性、真实性和领域多样性方面仍然有限，并且往往无法捕捉跨多个领域的交互，限制了它们在需要持续推理和协调的现实多步骤设置中评估智能体的能力。为解决这些限制，我们引入了T1-Bench，一个高保真、全面的基准，用于评估真实客户面向的多领域环境中的智能体系统，具有交织的场景，需要在多轮用户-助手交互中进行结构化推理，并在25个不同难度的领域中显著增加了组合复杂性和评估严谨性。我们使用12个专有和开放权重模型评估T1-Bench，提供了一个可重复和标准化的框架，用于评估复杂多步骤环境中的智能体行为、工具利用和对话质量。我们进一步用人类判断补充自动评估，以加强对定性性能的评估。总体而言，T1-Bench通过增加任务复杂性、交互深度和模拟多领域环境中的领域覆盖，显著推进了先前的基准。为促进智能体系统的未来研究，我们将公开数据及评估代码作为开源资源。

英文摘要

Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.

URL PDF HTML ☆

赞 0 踩 0

2606.11063 2026-06-10 cs.AI cs.LG 新提交

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

CIAware-Bench: 评估前沿大语言模型的控制干预感知能力

Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio, Roland S. Zimmermann

发表机构 * MATS ； Mila – Quebec AI Institute（Mila – 魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Astra Fellowship ； ELLIS Institute Tübingen, MPI for Intelligent Systems & Tübingen AI Center ； LawZero ； Google DeepMind

AI总结提出CIAware-Bench基准，通过四个任务域测试模型能否区分自身轨迹与被控制干预修改的轨迹，发现前沿模型在默认设置下感知能力低至中等，且因任务和模型对而异。

详情

AI中文摘要

AI控制协议通过监控不可信模型的行为并修改潜在不安全步骤来监督它们，通常使用可信模型。这在一定程度上篡改了不可信模型的轨迹。如果可信模型检测到这种干预，它可能推断监控器的属性并适应以逃避控制。我们引入\textbf{CIAware-Bench}，一个用于衡量前沿模型\textbf{控制干预（CI）感知}能力的基准。CIAware-Bench测试模型能否区分自身轨迹与被控制干预修改的轨迹。该基准包含四个任务域（论文写作、BigCodeBench、Bash Arena和SHADE-Arena），同时改变轨迹水印、侧任务存在和控制协议。评估十一个前沿模型，我们发现默认设置下CI感知能力低至中等（最高0.87；随机机会平衡二元分类准确率为0.5），且在不同任务域和模型对之间存在显著差异。跨模型家族的检测通常更容易，表明模型利用了提供商特定的风格或后训练差异。总体而言，CI感知不是一个固定的模型级属性，应针对每个新模型发布和部署场景进行测量。我们发布CIAware-Bench以跟踪CI感知并为干预更难检测的控制协议提供信息。

英文摘要

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

URL PDF HTML ☆

赞 0 踩 0

2606.11052 2026-06-10 cs.CL 新提交

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

混合LLM中的注意力遗忘：当CoT微调破坏长程记忆时，如何修复

Xinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li, Yingfa Chen, Huiming Wang, Zhijiang Guo

发表机构 * LARK, HKUST(GZ)（香港科技大学（广州）LARK实验室）； UCL（伦敦大学学院）； Mistral AI ； Tsinghua University（清华大学）； SUTD（新加坡科技设计大学）； HKUST（香港科技大学）

AI总结发现CoT监督微调会系统性降低混合线性注意力模型的长上下文召回能力，提出QK-Restore方法通过恢复微调前的查询-键投影矩阵来修复，无需额外训练。

详情

Comments: 28 pages

AI中文摘要

链式思维（CoT）监督微调（SFT）被广泛用于提升推理能力，但我们发现它会系统性降低混合线性注意力模型的长上下文召回能力。在包括HypeNet和Jet-Nemotron在内的架构中，CoT-SFT后Needle-In-A-Haystack（NIAH）上的检索性能大幅下降，且在更难的检索设置和更长的上下文窗口下退化更严重。例如，HypeNet-9B在NIAH-S2@256K上的性能从$67.2\%$降至$9.4\%$。我们将其归因于CoT-SFT将注意力梯度偏向短程模式，破坏了负责长程路由的查询-键投影（$W_Q, W_K$）。基于此观察，我们提出QK-Restore，一种无需训练的方法，仅从预SFT检查点恢复$W_Q$和$W_K$，同时保留所有其他后SFT参数。我们进一步引入Procrustes变体以平衡路由保留和推理适应。在各种架构上，QK-Restore在零训练成本下一致地恢复长上下文能力，同时保持推理性能；例如，在HypeNet-5B上，它将S3@256K从$65.4\%$提升至$76.4\%$，同时保持强大的推理性能。

英文摘要

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from $67.2\%$ to $9.4\%$. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ($W_Q, W_K$) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only $W_Q$ and $W_K$ from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from $65.4\%$ to $76.4\%$ while maintaining strong reasoning performance.

URL PDF HTML ☆

赞 0 踩 0

2606.11046 2026-06-10 cs.CL 新提交

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

推理是否保持对齐？关于大型推理模型的可信度研究

Prajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV, Furong Huang, Amrit Singh Bedi, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； University of Central Florida（中佛罗里达大学）； University of Maryland College Park（马里兰大学帕克分校）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结研究通过监督微调、强化学习和蒸馏生成的推理模型在安全、偏见、隐私等六个可信度维度上是否保持对齐，发现推理模型常出现对齐退化，如毒性增加、刻板印象加剧等。

详情

AI中文摘要

经过指令微调的LLM越来越多地通过后训练转化为推理模型，以提高多步任务性能。这种转化通常针对推理准确性进行优化，而没有明确保留指令微调模型的对齐行为，如安全拒绝、避免偏见和隐私保护。我们提出疑问：这种转化是否保持对齐？我们通过可信度审计研究这个问题，并发现默认情况下它并不保持行为。为了系统分析，我们比较了通过监督微调、基于RL的后训练和蒸馏产生的推理模型，与匹配的指令微调基线在六个可信度维度上的表现：安全性、毒性、刻板印象与偏见、机器伦理、隐私和分布外鲁棒性。我们观察到推理模型通常在推理基准上有所改进，但表现出对齐退化，包括毒性增加、刻板印象加剧、拒绝校准错误和上下文隐私泄露。这些退化与从指令微调基线的行为漂移一致，通过KL散度测量。总体而言，我们的结果指向更广泛的结论：可信度指标对于评估推理模型至关重要，并且应与推理能力的提升一起报告。

英文摘要

Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

URL PDF HTML ☆

赞 0 踩 0

2606.11045 2026-06-10 cs.AI cs.LG 新提交

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

什么适合（少量标记）就不会过拟合：ML研究智能体中的压缩与泛化

Martin Andres Bertran, Aaron Roth, Zhiwei Steven Wu

发表机构 * Amazon Responsible AI（亚马逊负责任人工智能）； University of Pennsylvania（宾夕法尼亚大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结研究LLM驱动的科研智能体在输出和输入压缩下能否保持性能，发现短提示和可压缩反馈足以复现高性能模型，支持成功策略位于低复杂度区域的假设。

详情

AI中文摘要

原则上，自适应地重复使用保留的基准测试会引发过拟合。然而，基准驱动的机器学习（ML）在实践中却出人意料地很少出现过拟合。一个吸引人的假设是，成功的ML策略具有高度可压缩性。我们在LLM驱动的科研智能体设置中研究这一点，通过两个互补的信息瓶颈可以直接检验该假设。在\emph{输出压缩}中，探索智能体使用验证集自适应地搜索高性能模型，我们测试一个全新的“复现智能体”是否能在仅给出极短提示和训练数据的情况下复现其性能。在\emph{输入压缩}中，探索者仅接收一位反馈，指示每个提交的模型是否优于当前最佳。跨越8个数据集（涵盖表格分类、视觉、语言建模、扩散建模和奖励建模），我们发现这些瓶颈对性能影响很小：短提示和可压缩反馈足以复现和找到高性能模型。该假设是可证伪的：当我们故意诱导验证集过拟合时，结果无法用短提示复现。综合来看，我们的结果支持描述长度解释，即基准驱动ML中缺乏过拟合的原因是成功策略占据策略空间的低复杂度区域。

英文摘要

Reusing a held-out benchmark adaptively should, in principle, invite overfitting. Yet benchmark-driven machine learning (ML) has produced surprisingly little overfitting in practice. An attractive hypothesis is that successful ML strategies are highly compressible. We study this in the setting of LLM-driven research agents, where the hypothesis becomes directly testable via two complementary information bottlenecks. In \emph{output compression}, an exploration agent adaptively searches for high-performance models using a validation set, and we test whether a fresh ``reproducer agent'' can reproduce its performance given only an extremely short prompt and the training data. In \emph{input compression}, the explorer receives only one-bit feedback indicating whether each submitted model improves on the running best. Across 8 datasets spanning tabular classification, vision, language modeling, diffusion modeling, and reward modeling, we find that these bottlenecks have little effect on performance: short prompts and compressible feedback are sufficient to reproduce and find high-performance models. The hypothesis is falsifiable: when we deliberately induce validation-set overfitting, the results fail to reproduce with short prompts. Taken together, our results support a description-length explanation for the lack of overfitting in benchmark-driven ML: successful strategies occupy a low-complexity region of strategy space.

URL PDF HTML ☆

赞 0 踩 0

2606.11037 2026-06-10 cs.RO 新提交

Generation of Diverse and Functional Robot Designs using Superquadrics Parametrisation and Quality-Diversity

使用超二次曲面参数化和质量-多样性生成多样化且功能性的机器人设计

Leni Le Goff, Simon Smith, Emma Hart

发表机构 * Edinburgh Napier University（爱丁堡纳皮尔大学）

AI总结提出基于超二次曲面（SQs）的机器人身体表示，结合质量-多样性算法MAP-Elites，以增强形态多样性并避免过早收敛，在测试中取得最高QD分数。

详情

Comments: Accepted at PPSN 2026

AI中文摘要

机器人的生成设计需要导航一个巨大的搜索空间，涵盖物理配置和行为参数。进化算法（EAs）已显示出有希望的结果，但常常过早收敛到一小部分次优设计。大多数EAs未能保持种群中足够的多样性，从而无法发现不同的功能性机器人。为了应对过早收敛，我们引入了一种基于超二次曲面（SQs）的机器人身体表示。SQs是3D几何形状的可解释、紧凑且计算高效的数学表示，可以针对特定设计空间进行调整。为了鼓励形态多样性，我们将这种表示与质量-多样性（QD）算法（MAP-Elites）相结合。我们比较了SQs和组合模式生成网络表示作为形态生成器，将它们与标准EAs和MAP-Elites结合。在两个测试环境中，我们发现使用SQs生成形态并结合MAP-Elites算法在两个环境中都达到了最高的QD分数，最大化了生成机器人的设计和功能多样性。研究结果强调了使用紧凑且可解释的几何表示来探索复杂设计空间的好处，并表明将SQs与显式多样性机制结合可以提高生成设计的质量和数量。

英文摘要

Generative design of robots requires navigating a vast search-space, encompassing physical configurations and behavioural parameters. Evolutionary Algorithms (EAs) have shown promising results, but often converge prematurely to a small set of sub-optimal designs. Most EAs fail to maintain sufficient diversity in the population that would allow the discovery of distinct functional robots. To counter premature convergence, we introduce a superquadrics-based representation (SQs) for robot bodies. SQs are interpretable, compact and computationally efficient mathematical representations of 3D geometrical shapes that can be tuned to specific design-spaces. To encourage morphological diversity, we combine this representation with a quality-diversity (QD) algorithm (MAP-Elites). We compare SQs and Compositional Pattern Producing Networks representations as generators of morphologies, combining them with standard EAs and MAP-Elites. In two test environments, we find that using SQs to generate morphology in conjunction with the MAP-Elites algorithm reaches the highest QD-score across both environments, maximising diversity of design and functionality of generated robots. The findings highlight the benefits of using a compact and interpretable geometric representation for exploring a complex design-space and suggest that combining SQs with an explicit diversity mechanism increases the quality and number of designs generated.

URL PDF HTML ☆

赞 0 踩 0

2606.11034 2026-06-10 cs.RO cs.NE 新提交

A Spiking Neural Architecture for Coordinating Arm and Locomotor Control

一种协调手臂和运动控制的脉冲神经架构

Lea Steffen, Kathryn Simone, Graeme Damberger, Travis DeWolf, Hudson Ly, Chris Eliasmith

发表机构 * Centre for Theoretical Neuroscience（理论神经科学中心）； Dept. of Systems Design Engineering, University of Waterloo（滑铁卢大学系统设计工程系）； Applied Brain Research（应用脑研究公司）； Dept. of Nanotechnology Engineering, University of Waterloo（滑铁卢大学纳米技术工程系）； Dept. of Philosophy, University of Waterloo（滑铁卢大学哲学系）

AI总结提出一种基于脉冲神经网络（SNN）的架构，利用神经工程框架（NEF）和语义指针架构（SPA）协调仿人机器人的手臂力控制与双足运动，并通过基底节模型实现高层动作选择，首次在全身仿人平台上实现集成控制。

详情

AI中文摘要

脉冲神经网络（SNN）结合神经形态硬件为仿人机器人控制提供了节能解决方案。然而，现有的基于SNN的运动控制系统分别处理双足运动和手臂控制，未解决两者的集成控制问题。我们提出了一种脉冲架构，使用神经工程框架（NEF）和语义指针架构（SPA）在仿真仿人机器人中协调基于力的手臂控制和双足运动。运动控制和手臂控制之间的高层动作选择由基于生物学的脉冲基底节模型介导。我们通过Nengo（神经控制）和Isaac Sim的协同仿真验证了该系统，展示了成功的目标到达、连续数字绘制、路径跟随运动，以及通过基底节去抑制在行走和手臂控制之间切换。据我们所知，这是首个在全身仿人平台上结合双足运动和手臂控制的集成脉冲控制器。全脉冲实现使其未来可部署在低功耗神经形态硬件上。

英文摘要

Spiking Neural Networks (SNNs) coupled with neuromorphic hardware offer energy-efficient solutions for humanoid robot control. However, existing SNN-based motor control systems address bipedal locomotion and arm control in isolation, leaving integrated control of both unaddressed. We present a spiking architecture that coordinates force-based arm control and bipedal locomotion in a simulated humanoid, using the Neural Engineering Framework (NEF) and Semantic Pointer Architecture (SPA). High-level action selection between locomotor and arm control is mediated by a biologically grounded spiking basal ganglia model. We validate the system through co-simulation of Nengo, for the neural control, and Isaac Sim, demonstrating successful target reaching, continuous digit drawing, path-following locomotion, and finally, switching between walking and arm control via basal ganglia disinhibition. To our knowledge, this is the first integrated spiking controller to combine bipedal locomotion and arm control on a full-scale humanoid platform. The full spike-based implementation enables future deployment on low-power neuromorphic hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.11033 2026-06-10 cs.LG cs.AI cs.CL 新提交

AuRA: Internalizing Audio Understanding into LLMs as LoRA

AuRA: 将音频理解内化到LLM中作为LoRA

Bo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

发表机构 * Meituan（美团）； Jilin University（吉林大学）

AI总结提出AuRA方法，通过层间蒸馏将ASR编码器的语音表示内化到LoRA适配的LLM中，实现紧耦合的语音-语言联合建模和高效并行端到端推理，在多个基准上优于级联系统和现有适应方法。

详情

AI中文摘要

最近将大语言模型（LLM）扩展到语音输入的努力通常依赖于级联的ASR-LLM流水线、端到端语音-语言模型或基于桥接/蒸馏的适应方法。虽然这些路线分别重用了强大的预训练组件、实现了原生语音-语言交互或提供了轻量级适应，但它们常常遭受转录-接口延迟、昂贵的多模态训练或顺序语音-语言耦合的问题。为了解决这些限制，我们提出了AuRA，一种将音频编码能力蒸馏到LLM中的方法。具体来说，AuRA通过一个轻量级音频嵌入层将相同的语音输入馈送到ASR编码器（作为教师）和LoRA适配的LLM（作为学生），并使用逐层蒸馏将学生的隐藏状态与相应的教师表示对齐，从而将语音表示内化到轻量级的LLM侧适应中。与级联和串行桥接方法相比，AuRA实现了更紧密的语音-语言联合建模和高效的并行端到端推理，同时重用了预训练的语音和语言模型，而不需要大规模的多模态训练。在多个语音-语言基准上，AuRA在有效性和效率方面始终优于级联系统、语音到LLM适应基线以及大规模语音-语言和多模态模型。

英文摘要

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.11032 2026-06-10 cs.CV 新提交

U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

U-TTT：通过测试时训练实现可泛化的PET图像去噪

Zhiwen Yang, Jiayin Li, Hao Lu, Hui Zhang, Zihua Wang, Bingzheng Wei, Yan Xu

发表机构 * School of Biological Science and Medical Engineering, Beihang University（北京航空航天大学生物科学与医学工程学院）； Department of Biomedical Engineering, Tsinghua University（清华大学生物医学工程系）； School of Aerospace Engineering, Tsinghua University（清华大学航天航空学院）； ByteDance Inc.（字节跳动有限公司）

AI总结针对PET图像去噪模型在分布偏移下性能退化的问题，提出U-TTT模型，集成测试时训练（TTT）层，通过自监督动态调整参数，并设计双域自适应机制（空间和频率TTT层），在未见剂量水平和扫描仪下实现最优去噪和泛化。

详情

AI中文摘要

现有的用于正电子发射断层扫描（PET）图像去噪的深度学习模型在分布偏移下常常遭受严重的性能退化，从根本上限制了其稳健的临床部署。这种泛化能力的缺乏源于固定参数模型的传统范式，该范式在训练后无法适应测试数据的变化（例如，剂量水平或扫描仪类型）。为了克服这一限制并实现稳健的泛化，我们引入了U-TTT，一种新颖的U形模型，它集成了测试时训练（TTT）层，通过自监督在推理过程中动态调整模型参数，从而适应每个测试实例的特定特征。此外，为了全面捕捉3D PET数据的复杂退化，U-TTT具有双域自适应机制，包括空间测试时训练（S-TTT）层和频率测试时训练（F-TTT）层。S-TTT层捕捉并校正空间结构退化，而F-TTT层抑制全局噪声谱并恢复精细的高频细节。大量实验表明，U-TTT在PET去噪性能上达到了最先进水平，并在具有挑战性的分布偏移下（包括未见剂量水平和未见扫描仪）展现出优越的泛化能力。我们的代码将在此https URL提供。

英文摘要

Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-parameter models that cannot adapt to variations in test data (e.g., dose levels or scanner types) after training. To overcome this limitation and achieve robust generalization, we introduce U-TTT, a novel U-shaped model that integrates Test-Time Training (TTT) layers to dynamically adjust model parameters during inference through self-supervision, thereby adapting to the specific characteristics of each test instance. Furthermore, to comprehensively capture the complex degradations of 3D PET data, U-TTT features a dual-domain adaptation mechanism comprising a Spatial Test-Time Training (S-TTT) layer and a Frequency Test-Time Training (F-TTT) layer. The S-TTT layer captures and corrects spatial structural degradations, while the F-TTT layer suppresses global noise spectra and restores delicate high-frequency details. Extensive experiments demonstrate that U-TTT achieves state-of-the-art PET denoising performance and exhibits superior generalization under challenging distribution shifts, including both unseen dose levels and unseen scanners. Our code will be available at https://github.com/Yaziwel/U-TTT.

URL PDF HTML ☆

赞 0 踩 0

2606.11025 2026-06-10 cs.LG 新提交

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Flow-DPPO：用于流匹配模型的散度近端策略优化

Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo, Tianyu Pang

发表机构 * Xi’an Jiaotong University（西安交通大学）； Tencent Hunyuan（腾讯混元）； National University of Singapore（新加坡国立大学）

AI总结针对流匹配模型中PPO比率裁剪的结构性缺陷，提出Flow-DPPO方法，利用高斯策略的精确KL散度计算实现散度近端约束，提升奖励和训练稳定性。

详情

AI中文摘要

近期工作表明，在线强化学习（RL）可以显著提高图像和视频生成中流匹配模型的质量和对齐性。Flow-GRPO和CPS等方法将去噪过程建模为马尔可夫决策过程，并应用PPO风格的比率裁剪来强制执行信任区域。然而，我们认为比率裁剪在结构上不适合流模型：新旧策略之间的概率比率是真实策略散度的有噪声单样本估计，导致在轨迹的某些区域过度约束，而在其他区域约束不足。我们提出Flow-DPPO（流散度近端策略优化），用散度近端约束替代比率裁剪。一个关键观察是，流模型中每步策略是高斯分布，使得新旧策略之间的KL散度可以精确且廉价地计算。Flow-DPPO采用非对称散度掩码，仅当梯度更新同时远离信任区域并违反散度阈值时才阻止更新。实验表明，Flow-DPPO以更好的KL近端效率获得更高奖励，减轻灾难性遗忘，促进平衡的多目标优化，并在比率裁剪退化的多轮训练中保持稳定。代码和模型见此https URL。

英文摘要

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.

URL PDF HTML ☆

赞 0 踩 0

2606.11019 2026-06-10 cs.RO cs.AI 新提交

Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving

扩散强制规划器：基于时间依赖引导的历史退火规划用于自动驾驶

Zehan Zhang, Neng Zhang, Yaoyi Li, Jia Cai, Zhiling Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Yinwang Intelligent Technology Co., Ltd（银网智能科技有限公司）； Hefei Institutes of Physical Science, Chinese Academy of Sciences（中国科学院合肥物质科学研究院）

AI总结提出扩散强制规划器（DFP），通过历史引导控制实现异构联合扩散过程，结合退火历史的条件引导，解决运动规划中的时间不一致问题，在nuPlan上取得竞争性能。

详情

Comments: CVPR2026

AI中文摘要

基于学习的运动规划器尽管近期取得进展，但常常遭受时间不一致性问题。跨帧的小扰动可能累积成不稳定的轨迹，降低闭环驾驶的舒适性和安全性。几种方法尝试将历史作为静态条件信号注入以稳定输出，却导致规划器复制历史模式而非适应环境上下文。为解决这一限制，我们提出扩散强制规划器（DFP），一种由历史引导控制驱动的基于扩散的规划框架。具体地，DFP将完整轨迹分解为历史段、当前段和未来段，并为每个段分配独立的噪声水平。模型联合去噪历史段和未来段，强制执行异构联合扩散过程。在推理时，使用无分类器引导（CFG）以可控方式利用退火历史引导未来采样。在nuPlan上的闭环评估和全面消融实验表明，DFP在复杂驾驶场景中实现了竞争性能，同时生成连续、稳定且可控的运动规划。

英文摘要

Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.11018 2026-06-10 cs.CL 新提交

Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

测量社交媒体文本中的人类价值表达：校准的LLM标注与编码器迁移

Maria Milkova, Maksim Rudnev

发表机构 * Independent researcher, Lisbon, Portugal（独立研究者，里斯本，葡萄牙）； University of Waterloo, ON, Canada（滑铁卢大学，安大略省，加拿大）

AI总结本研究使用校准的LLM标注和软标签训练，将基于Schwartz人类基本价值理论的社交媒体文本价值表达迁移到编码器模型，实现可扩展预测。

详情

AI中文摘要

测量自然发生的社交媒体文本中的主观构念需要标注程序在理论上具有基础、经验上得到验证，并且可迁移到编码器模型以进行可扩展预测。使用根据Schwartz人类基本价值理论标注的非英语社交媒体帖子，我们研究了不同的LLM、提示和指令语言如何操作化文本中的价值表达。我们认为，尽管文本可能允许多种合理解释，但基于理论的价值定义可以约束解释并减少虚假的价值归因。除了精确率、召回率和F1分数，我们还评估了价值之间的结构对齐、错误结构、置信度-模糊度关系以及标注稳定性。我们表明，不同的LLM产生不同的价值解释。通过错误分析进行迭代提示校准减少了错误归因并提高了与专家标注的对齐。我们还从反复出现的错误结构中推导出有针对性的专家验证规则，并在语料库标注中使用它们。最后，我们表明，通过软标签训练，LLM标注可以迁移到编码器模型，保留基于理论的价值解释以及价值表达中不确定性的信息。

英文摘要

Measuring subjective constructs in naturally occurring social media text requires annotation procedures that are theoretically grounded, empirically validated, and transferable to an encoder model for scalable prediction. Using non-English social media posts annotated according to Schwartz's theory of basic human values, we investigate how different LLMs, prompts, and instruction languages operationalize the expression of values in text. We argue that although texts may permit multiple plausible interpretations, theory-based value definitions can constrain interpretations and reduce spurious value attributions. Beyond precision, recall, and F1, we evaluate structural alignment between values, error structure, confidence-ambiguity relations, and annotation stability. We show that different LLMs produce different value interpretations. Iterative prompt calibration through error analysis reduces misattributions and improves alignment with expert annotations. We also derive targeted expert verification rules from recurrent error structures and use them during corpus annotation. Finally, we show that LLM annotations can be transferred to an encoder model through soft-label training, retaining theory-based value interpretations and information about uncertainty in value expression.

URL PDF HTML ☆

赞 0 踩 0

2606.11017 2026-06-10 cs.LG eess.AS 新提交

Data-Driven Runway and Taxiway Exits Prediction of Landing Aircraft: A Case Study at Hartsfield-Jackson Atlanta International Airport

基于数据驱动的着陆飞机跑道和滑行道出口预测：以哈茨菲尔德-杰克逊亚特兰大国际机场为例

Alex Porcayo, Yutian Pang, Maria Thomas, John-Paul Clarke

发表机构 * Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin（德克萨斯大学奥斯汀分校航空航天工程与工程力学系）

AI总结针对机场场面运行效率问题，提出两阶段数据驱动决策辅助模型，分别预测着陆飞机选择的跑道出口及是否穿越活跃起飞跑道，在亚特兰大机场数据上验证，XGBoost和LightGBM表现最优，进近速度是出口选择的关键因素。

详情

AI中文摘要

机场场面运行日益成为高吞吐量枢纽性能的瓶颈。本研究考察了哈茨菲尔德-杰克逊亚特兰大国际机场（KATL）的到达滑入决策，并提出了一种模拟管制员工作流程的两阶段数据驱动决策辅助工具。第一阶段预测到达飞机选择的跑道出口。第二阶段预测，在给定该出口的情况下，飞机是否会在指定点穿越活跃起飞跑道或使用绕行滑行道。模型使用ASDE-X场面轨迹、飞机特性、停机坪目的地、短期交通率和天气数据，跨越多个回顾时间窗口进行训练。我们基准测试了九种分类器，包括随机森林、XGBoost、LightGBM和CatBoost，并评估了准确率、宏F1、精确率-召回率行为、混淆矩阵、Brier分数和期望校准误差。在东西流向中，XGBoost和LightGBM优于随机森林。第一阶段达到0.86-0.89的准确率，宏F1分数为0.40-0.50；第二阶段达到0.70-0.74的准确率，宏F1分数为0.28-0.55。特征重要性分析表明，进近速度是出口选择的主要驱动因素。起飞率、穿越率、停机坪目的地，以及对于西流向，所选出口是预测穿越与绕行路径的最强预测因子。由于特征空间重叠，少数类仍然难以预测，如t-SNE和UMAP分析所示。所提出的框架通过校准的、可解释的预测支持管制员态势感知，同时保留人类对最终路由决策的责任。

英文摘要

Airport surface operations increasingly constrain performance at high-throughput hubs. This study examines arrival taxi-in decisions at Hartsfield-Jackson Atlanta International Airport (KATL) and proposes a two-stage, data-driven decision aid that mirrors controller workflow. Stage I predicts the runway exit selected by an arriving aircraft. Stage II predicts whether, given that exit, the aircraft will cross the active departure runway at a designated point or use the end-around taxiway. Models are trained using ASDE-X surface trajectories, aircraft characteristics, ramp destinations, short-horizon traffic rates, and weather across multiple look-back windows. We benchmark nine classifiers, including Random Forest, XGBoost, LightGBM, and CatBoost, and evaluate accuracy, macro-F1, precision-recall behavior, confusion matrices, Brier score, and Expected Calibration Error. Across east and west flows, XGBoost and LightGBM outperform Random Forest. Stage I achieves 0.86-0.89 accuracy with macro-F1 scores of 0.40-0.50, while Stage II achieves 0.70-0.74 accuracy with macro-F1 scores of 0.28-0.55. Feature-importance analysis shows that approach speed is the main driver of exit choice. Departure rate, crossing rate, ramp destination, and, for west flow, the selected exit are the strongest predictors of crossing versus end-around routing. Minority classes remain harder to predict because of feature-space overlap, as shown by t-SNE and UMAP analyses. The proposed framework supports controller situational awareness through calibrated, explainable predictions while preserving human responsibility for final routing decisions.

URL PDF HTML ☆

赞 0 踩 0

2606.11016 2026-06-10 cs.AI 新提交

Superficial Beliefs in LLM Decision-Making

LLM决策中的表面信念

Gabriel Freedman, Francesca Toni

发表机构 * Department of Computing, Imperial College London（帝国理工学院计算系）

AI总结研究大型语言模型在二元选择中是否仅模仿理由，通过合成决策实验发现模型行为有系统性但自我报告与行为推断的驱动因素不完全一致，表明存在“表面信念”。

详情

Comments: Under review

AI中文摘要

我们询问大型语言模型（LLMs）在选择两个选项时是否仅仅模仿理由，或者它们的选择是否反映了系统性的潜在决策结构。使用合成二元决策设置，其中模型在由分级属性定义的档案之间进行选择，我们将模型声称最重要的属性与在拟合先前决策的行为模型下最能解释其选择的属性进行比较。行为模型很好地预测了保留的选择，表明模型行为与可见属性系统相关，而非随机。然而，直接自我报告和单独的基于分数的判断仅部分恢复了行为推断的驱动因素。由此产生的图景既不是任意行为，也不是完全明确的信念——输出结构足以支持预测，但显式理由仅不完美地追踪恢复的驱动因素。这种定性模式在提示顺序和采样扰动、替代行为模型、目标遮挡分析以及结构变化的决策设置中持续存在。我们将其解释为LLM决策中“表面信念”的证据：模型表现得好像由属性上的概率性局部优先级引导，同时对其决策驱动的属性只有有限的语言访问。

英文摘要

We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.

URL PDF HTML ☆

赞 0 踩 0

2606.11015 2026-06-10 cs.AI 新提交

Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

从推理中获取结构，从搜索中获取数值：本地部署的开放大语言模型作为耦合MIMO控制器整定的结构先验

Jiaxuan Chen, Haonan Li, Yang Shu

发表机构 * Jinling Institute of Technology（金陵科技学院）； College of Water Resources and Civil Engineering, China Agricultural University（中国农业大学水利与土木工程学院）； Zhejiang University（浙江大学）

AI总结针对强耦合MIMO过程，提出利用本地部署的开源大语言模型作为结构先验，通过推理耦合关系提出非对称结构，结合经典优化器实现样本高效且可解释的控制器整定。

详情

Comments: 10 pages, 7 figures, 6 tables. Submitted to IEEE Access

AI中文摘要

为强耦合多输入多输出（MIMO）工业过程整定控制器是困难的：分散式经典自整定忽略了回路交互，而自然初始化的局部数值优化会在由此产生的非凸代价景观中停滞。我们探究能否使用本地部署的开源大语言模型（LLMs），它们将数据保留在本地且无需过程模型。在单回路CSTR上，经典继电反馈整定（IAE 0.106，接近最优0.102）优于LLM整定器（0.162）：对于简单回路，LLM没有增加价值。而在具有冲突设定点的强耦合四水箱系统上，情况发生逆转，通过惩罚代价J = IAE + lambda*TV(u)评分，该代价奖励无致动器抖动的跟踪。在那里，朴素继电整定（J ~ 28.6）和朴素LLM整定（29.7）不比开环（22.7）更好，且从平衡初始化的局部优化器在10/10次运行中均失败。相反，一个经过支架设计的开放LLM推理耦合关系，提出反直觉的非对称结构，并从任何初始点达到J ~ 16.9 +/- 0.2；用经典优化器精炼后，达到平滑全局最优（J ~ 12.0，10/10 vs. 0/10），该最优甚至应用了分散式整定无法实现的非明显负积分校正。全局优化器（差分进化）也能达到该最优，因此LLM并非唯一途径；其优势在于样本效率和可解释性：在18次评估中即可得到可用控制器（此时全局优化器比开环更差），并附带明确的理由。这一优势随维度增长，在3x3过程上评估次数减少约6倍。该行为在四个开放模型上泛化，而在良性过程中LLM没有优势，从而划清了边界。我们贡献了一个可复现的基准测试，界定了开放LLM在控制整定中何时有帮助：不是作为优化器，而是作为样本高效、可解释的结构先验。

英文摘要

Tuning controllers for strongly coupled multi-input multi-output (MIMO) industrial processes is hard: decentralized classical auto-tuning ignores loop interaction, and local numerical optimization from natural initializations stalls in the resulting non-convex cost landscape. We ask whether on-premise open-source large language models (LLMs), which keep data on-site and need no plant model, can help. On a single-loop CSTR, classical relay-feedback tuning (IAE 0.106, near the 0.102 optimum) beats an LLM tuner (0.162): for simple loops the LLM adds nothing. The picture inverts on a strongly coupled quadruple-tank with conflicting set-points, scored by a penalized cost J = IAE + lambda*TV(u) that rewards tracking without chattering actuators. There, naive relay tuning (J ~ 28.6) and naive LLM tuning (29.7) are no better than open loop (22.7), and a local optimizer from balanced starts fails in 10/10 runs. A scaffolded open LLM instead reasons about the coupling, proposes the counter-intuitive asymmetric structure, and reaches J ~ 16.9 +/- 0.2 from any start; refining it with a classical optimizer attains the smooth global optimum (J ~ 12.0, 10/10 vs. 0/10), which even applies a non-obvious negative integral correction decentralized tuning cannot. A global optimizer (differential evolution) also reaches this optimum, so the LLM is not the only route; its advantage is sample efficiency and interpretability: a usable controller in 18 evaluations (where the global optimizer is worse than open loop) plus a stated rationale. This edge grows with dimension, reaching ~6x fewer evaluations on a 3x3 plant. The behaviour generalizes across four open models, and on a benign plant the LLM offers no advantage, sharpening the boundary. We contribute a reproducible benchmark delimiting when open LLMs help in control tuning: not as optimizers, but as a sample-efficient, interpretable structural prior.

URL PDF HTML ☆

赞 0 踩 0

2606.11012 2026-06-10 cs.CV 新提交

An Uncertainty Estimation Framework for Dose Accumulation in Adaptive Radiotherapy: Application to CBCT-Guided Radiotherapy for Cervical Cancer

自适应放疗中剂量累积的不确定性估计框架：应用于宫颈癌CBCT引导放疗

Cedric Hemon, Delphine Lebret, Jean-Claude Nunes, Valentin Boussot, Karine Peignaux, Nathalie Mesgouez-Nebout, Chantal Hanzen, Antoine Simon, Anaïs Barateau, Renaud de Crevoisier, Caroline Lafond

发表机构 * Univ. Rennes, CLCC Eugène Marquis, INSERM, LTSI - UMR 1099（雷恩大学，尤金·马奎斯中心，法国国家健康与医学研究院，LTSI - UMR 1099）； Department of Radiation Oncology, Centre Georges-Francois Leclerc（乔治-弗朗索瓦·勒克莱尔中心放射肿瘤科）； Institut de Cancérologie de l’Ouest–Site Paul Papin（西部癌症研究所-保罗·帕潘院区）； CLCC Henri Becquerel（亨利·贝克勒尔中心）

AI总结提出IMPACT-DoseAcc框架，通过贝叶斯分割引导和集成分割模型两种策略量化DIR不确定性，并传播至累积剂量指标，应用于宫颈癌CBCT引导oART，验证了不确定性校准和几何误差相关性。

详情

Comments: Under revision

AI中文摘要

背景与目的：oART能够每日根据分次间解剖变化调整计划，但累积剂量估计仍受限于DIR、分割和解剖不确定性。我们在IMPACT中引入IMPACT-DoseAcc，一个不确定性感知的剂量累积框架，用于语义特征驱动的图像分析。该框架具有模态和疾病无关性，并应用于宫颈癌（LACC）的CBCT引导oART。材料与方法：回顾性分析9例LACC患者，使用每日CBCT衍生的虚拟CT进行剂量重新计算。IMPACT-DoseAcc关注DIR引起的不确定性，不建模vCT生成的不确定性。在IMPACT-Reg中测试了两种DIR不确定性策略：一种贝叶斯分割引导方法，使用一个概率模型量化解剖不确定性；以及一个针对结构的分割模型集成，以捕获认知变异性。体素级不确定性图通过剂量变形和累积传播，生成概率剂量体积直方图。集成不确定性通过变形场上的体素级标准差量化，几何误差通过变形轮廓与验证轮廓之间的表面距离评估。解剖变异性加权优化了聚合。结果：集成DIR不确定性与几何误差相关，CTVt和膀胱的Pearson系数分别为0.63和0.66。对于CTVt，pDVH达到96.3±3.9%的覆盖率，显示传播不确定性的校准。加权稳定了各分次和器官的估计。结论：IMPACT-DoseAcc将配准驱动的不确定性传播至累积剂量指标，改进了解剖变化下累积剂量的解释。其3DSlicer集成支持可重复、不确定性知情的ART工作流程。

英文摘要

Background and purpose: oART enables daily plan adaptation to interfraction anatomical variations, but cumulative dose estimation remains limited by DIR, segmentation, and anatomical uncertainties. We introduce IMPACT-DoseAcc, an uncertainty-aware dose accumulation framework, within IMPACT for semantic feature-driven image analysis. The framework is modality- and disease-agnostic and is applied to CBCT-guided oART for cervical cancer (LACC). Material and Methods: Nine LACC patients were retrospectively analyzed using daily CBCT-derived virtual CTs for dose recalculation. IMPACT-DoseAcc focuses on uncertainty from DIR, without modeling vCT-generation uncertainty. Two DIR uncertainty strategies were tested within IMPACT-Reg: a Bayesian segmentation-guided approach using one probabilistic model to quantify anatomical uncertainty, and an ensemble of segmentation models targeting structures to capture epistemic variability. Voxel-wise uncertainty maps were propagated through dose warping and accumulation to generate probabilistic dose-volume histograms. Ensemble uncertainty was quantified from voxel-wise standard deviation across deformation fields, and geometric error was assessed using surface distance between warped and validated contours. Anatomical-variability weighting refined aggregation. Results: Ensemble DIR uncertainty correlated with geometric error, with Pearson coefficients of 0.63 for CTVt and 0.66 for bladder. For CTVt, pDVHs achieved 96.3 +/- 3.9% coverage, showing calibration of propagated uncertainty. Weighting stabilized estimates across fractions and organs. Conclusions: IMPACT-DoseAcc propagates registration-driven uncertainty to cumulative dose metrics, improving interpretation of accumulated dose under anatomical variations. Its 3DSlicer integration supports reproducible, uncertainty-informed ART workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.11001 2026-06-10 cs.CV 新提交

IPSM-Bench: A New Intermediate Phase Segmentation Benchmark in Microstructure Images of Zinc-Based Absorbable Biomaterials

IPSM-Bench：锌基可吸收生物材料显微图像中的新中间相分割基准

Jinglin Xu, Shangyan Zhao, Jiabo Wang, Xinghong Mu, Yulong Lei, Jiacheng Zhang, Hongbo Sun, Yageng Li

发表机构 * School of Artificial Intelligence, University of Science and Technology Beijing（北京科技大学人工智能学院）； School of Advanced Materials Innovation, University of Science and Technology Beijing（北京科技大学前沿材料创新学院）； China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd（中国电信人工智能技术（北京）有限公司）； School of Materials Science and Engineering, University of Science and Technology Beijing（北京科技大学材料科学与工程学院）； Institute of Materials Intelligent Technology, Liaoning Academy of Materials（辽宁材料实验室材料智能技术研究所）； School of Big Data and Software Engineering, Chongqing University（重庆大学大数据与软件工程学院）

AI总结针对锌合金中间相分割面临的数据稀缺、低对比度等挑战，构建最大高质量数据集IPSM-Bench，并提出空间上下文先验引导的SAM方法SCoP-SAM，实现最优分割性能。

详情

Comments: Accepted by IJCAI 2026

AI中文摘要

锌基合金是不可或缺的新兴可吸收金属生物材料，其宏观性能受微观结构特征控制。中间相——关键的微观结构成分——在调节机械和功能性能中起关键作用。然而，锌合金显微组织中的中间相分割面临严峻挑战：标注数据集稀缺、对比度低、小目标检测困难以及形态异质性。为此，我们构建了IPSM-Bench，这是用于锌合金中间相分割的最大高质量数据集。此外，我们提出了SCoP-SAM，一种新的空间上下文先验引导的SAM方法，利用中间相的梯度结构和灰度属性捕获空间上下文先验，并将其融入整个SAM编码-解码过程，从而提升分割性能。基于提出的IPSM-Bench，我们建立了中间相分割的新基准，以系统评估最先进方法并推动锌合金微观结构分析研究。在IPSM-Bench和额外的公共合金基准上的大量实验表明，我们的SCoP-SAM不仅在锌合金中间相分割上实现了最先进性能，而且对其他合金场景也具有显著的泛化能力。

英文摘要

Zinc-based alloys are indispensable emerging absorbable metallic biomaterials, and their macroscopic performance is governed by microstructural characteristics. Intermediate phases-key microstructural constituents-are pivotal in regulating mechanical and functional properties. However, intermediate phase segmentation in zinc alloy microstructures faces formidable challenges: scarce annotated datasets, low contrast, difficulty detecting small targets, and heterogeneous morphologies. To this end, we construct IPSM-Bench, the largest high-quality dataset for zinc-alloy intermediate phase segmentation. Furthermore, we propose SCoP-SAM, a new Spatial Context Prior-guided SAM method that leverages the gradient structure and grayscale properties of intermediate phases to capture spatial context priors and incorporates them into the entire SAM encoding-decoding process, improving segmentation performance. Based on the proposed IPSM-Bench, we establish a new benchmark for intermediate phase segmentation to systematically evaluate state-of-the-art (SOTA) methods and advance research on zinc alloy microstructure analysis. Extensive experiments on IPSM-Bench and additional public alloy benchmarks demonstrate that our SCoP-SAM not only achieves SOTA performance for zinc-alloy intermediate phase segmentation but also generalizes remarkably well to other alloy scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.10989 2026-06-10 cs.AI 新提交

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

零空间约束的低秩自适应用于指定响应的大型语言模型遗忘

Bocheng Ju, Jianhua Wang, Chengliang Liu, Xiaolin Chang

发表机构 * Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University（北京交通大学智能交通信息安全与隐私保护北京市重点实验室）； College of Computer Science and Technology, Taiyuan University of Technology（太原理工大学计算机科学与技术学院）； Institute of Computing Technologies, China Academy of Railway Sciences Corporation Limited（中国铁道科学研究院集团有限公司计算技术研究所）

AI总结提出零空间约束响应指定遗忘框架，通过正交投影低秩参数化将LoRA更新限制在保留子空间的零空间内，在抑制遗忘知识的同时保持模型效用。

详情

AI中文摘要

大型语言模型遗忘旨在抑制指定的不良知识，同时保留良性能力。许多遗忘目标侧重于抑制不良答案，而最近的目标引导变体指定了替换行为，但仍使更新局部性在很大程度上不受约束。本文介绍了零空间约束响应指定遗忘（NSRU），一种用于受控LLM遗忘的投影约束低秩框架。NSRU使用明确结构化的安全目标响应来指定每个遗忘查询的期望行为，同时抑制原始不良内容。为了定位自适应，NSRU从良性隐藏表示中估计每个模块的保留子空间，并使用正交投影的低秩参数化将LoRA更新限制在保留子空间的零空间内。由此产生的目标在此约束参数化下联合优化安全目标学习、不良响应抑制和保留保持。我们提供了一个局部一阶分析，表明投影更新减少了保留侧的扰动，同时保留了用于塑造遗忘查询行为的可编辑方向。在TOFU上的实验表明，NSRU有效抑制了可提取的遗忘集知识，同时提高了保留QA性能、模型效用和安全目标对齐，优于代表性基线。在WMDP上，NSRU将危险领域准确率保持在接近随机选择的区域，同时保留了广泛和领域相邻的MMLU效用。消融研究支持安全目标监督、不良响应抑制、保留损失和零空间投影更新的互补作用，而敏感性和鲁棒性分析表明在测试的超参数和提示变化下行为稳定。

英文摘要

Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities. Many unlearning objectives focus on suppressing undesired answers, while recent target-guided variants specify replacement behavior but still leave update locality largely unconstrained. This paper introduces \emph{Null-Space Constrained Response-Specified Unlearning} (NSRU), a projection-constrained low-rank framework for controlled LLM unlearning. NSRU uses an explicitly structured safe target response to specify the desired behavior for each forget query, while suppressing the original undesired content. To localize adaptation, NSRU estimates per-module retain subspaces from benign hidden representations and uses an orthogonal-projected low-rank parameterization to confine LoRA updates to the null space of the retain subspace. The resulting objective jointly optimizes safe-target learning, undesired-response suppression, and retention preservation under this constrained parameterization. We provide a local first-order analysis showing that the projected update reduces retain-side perturbations while preserving editable directions for shaping forget-query behavior. Experiments on TOFU show that NSRU effectively suppresses extractable forget-set knowledge while improving retain QA performance, model utility, and safe-target alignment over representative baselines. On WMDP, NSRU keeps hazardous-domain accuracy near the random-choice region while preserving broad and domain-adjacent MMLU utility. Ablation studies support the complementary roles of safe-target supervision, undesired-response suppression, retention loss, and null-space projected updates, while sensitivity and robustness analyses indicate stable behavior across the tested hyperparameter and prompt variations.

URL PDF HTML ☆

赞 0 踩 0

2606.10988 2026-06-10 cs.CV cs.GR 新提交

AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D Objects

AnimaSpark: 一种用于任意3D对象动画的前馈方法

Yiming Zhao, Haoyu Sun, Aoyu Wang

发表机构 * Bytedance（字节跳动）

AI总结提出AnimaSpark管道，通过将带骨骼的3D模型渲染为多层图像表示，输入视频生成模型，再提取2D关键点运动并提升至3D空间，实现类别无关的3D动画生成，在文本-运动对齐、运动质量和计算效率上优于现有方法。

详情

AI中文摘要

尽管生成式AI的最新进展显著加速了静态3D模型创建流程，但类别无关的3D动画合成仍然是3D资产生产中的主要瓶颈。当前的类别无关动画生成方法在推理速度、运动质量和文本提示遵循方面存在关键限制，使得该过程仍依赖于劳动密集型的手工艺术。为解决这些挑战，本文介绍了AnimaSpark，一种用于类别无关3D动画生成的新型管道。我们的方法受以下关键洞察驱动：对于3D世界中的许多基本运动，相应的关节变换通常可以在二维子空间内有效建模。该管道首先将带骨骼的静态3D模型渲染为其网格和骨架的多层图像表示，随后将其输入视频生成模型。然后，我们在生成的视频上采用关键点跟踪算法，捕获投影到相机观察平面上的骨骼关节运动。在最后阶段，我们从这些跟踪的关键点中提取平面平移和旋转，并将其从2D域提升到3D空间以驱动角色动画。全面评估表明，我们的方法在关键指标（包括文本-运动对齐、运动质量和计算效率）上优于现有最先进技术。

英文摘要

While recent advancements in generative AI have substantially accelerated static 3D model creation workflows, the synthesis of category-agnostic 3D animations remains a significant bottleneck in 3D asset production. Current methods for category-agnostic animation generation exhibit critical limitations in inference speed, motion quality, and adherence to textual prompts, thereby leaving the process dependent on labor-intensive manual artistry. To address these challenges, this paper introduces AnimaSpark, a novel pipeline for category-agnostic 3D animation generation. Our approach is motivated by the key insight that for many fundamental motions in the 3D world, the corresponding joint transformations can often be effectively modeled within a two-dimensional subspace. The pipeline begins by rendering a rigged static 3D model into multi-layered image representations of its mesh and skeleton, which are subsequently fed into a video generation model. We then employ a keypoint tracking algorithm on the generated video to capture the motion of the skeletal joints projected onto the camera's viewing plane. In the final stage, we distill the planar translations and rotations from these tracked keypoints and lift them from the 2D domain into 3D space to animate the character. Comprehensive evaluations reveal that our method achieves superior performance over existing state-of-the-art techniques across key metrics, including text-motion alignment, quality of motion, and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.10979 2026-06-10 cs.AI 新提交

Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

具有状态依赖可行动作集的马尔可夫决策过程的贝尔曼-泰勒分数解码

Yi Chen, Rushuai Yang, Qiang Chen, Dongyan, Huo

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结针对状态依赖可行动作集的MDP，提出贝尔曼-泰勒分数解码框架，将策略学习映射到欧几里得分数空间，通过动作解码器保证可行性，并证明最优性间隙分解为结构近似误差和算法学习误差，在排队网络控制中表现优异。

详情

AI中文摘要

运筹学中的许多马尔可夫决策过程（MDP）具有状态依赖的可行动作，这些动作由各种操作约束隐式定义。这些特性使得难以使用标准的深度强化学习（DRL）算法，因为它们的动作接口通常假设固定的有限动作目录或简单的欧几里得空间。受最优动作价值函数的泰勒展开的启发，我们提出了贝尔曼-泰勒分数解码，这是一个将策略学习转移到欧几里得分数空间，同时通过动作解码器强制执行可行性的框架。由此产生的潜在分数MDP可以通过标准的DRL算法进行优化，而无需通过解码器进行微分。我们提供了一个性能保证，表明该方法的最优性间隙分解为结构近似误差和算法学习误差。最后，我们将该框架应用于排队网络控制问题，其中策略本质上学习了一个状态依赖的基于索引的调度规则。数值实验显示，在小规模实例中性能接近最优，在较大系统中相比基准有显著改进。

英文摘要

Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman--Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10974 2026-06-10 cs.RO 新提交

Language-Driven Cost Optimization for Autonomous Driving

语言驱动的自动驾驶成本优化

Diego Martinez-Baselga, Khaled Mustafa, Javier Alonso-Mora

发表机构 * TU Delft（代尔夫特理工大学）

AI总结提出语言驱动框架，利用大语言模型解释场景和用户查询，生成风险感知MPPI控制器的参数，并通过人机交互验证和反馈迭代优化自动驾驶行为。

详情

Comments: Paper accepted at IEEE Intelligent Transportation Systems Conference (ITSC) 2026

AI中文摘要

自动驾驶车辆的驾驶行为通常由其运动规划器的成本函数控制，该函数编码了速度跟踪、平滑性、车道保持和碰撞避免等目标。然而，调整构成该成本函数的参数是一项需要技术专长的挑战性任务，限制了车辆适应不断变化的交通场景或最终用户偏好的能力。本文提出了一种语言驱动的自适应成本设计框架，用于自动驾驶。大语言模型（LLM）解释结构化场景描述和自然语言用户查询，生成应用于风险感知模型预测路径积分（MPPI）控制器的参数。该系统包含一个人在环验证阶段，在该阶段中，拟议的行为变化以非技术语言描述，并在部署前确认。用户还可以在部署前后提供反馈，从而实现车辆运动行为的迭代优化。该框架在真实驾驶场景中通过多个查询进行评估，以评估其有效性。仿真结果表明，该方法成功诱导了与预期要求一致的行为变化，且方式直观，从而弥合了智能车辆控制系统与最终用户之间的差距。

英文摘要

The driving behavior of autonomous vehicles is typically governed by the cost function of their motion planner, which encodes objectives such as speed tracking, smoothness, lane keeping, and collision avoidance. However, tuning the parameters that shape this cost function is a challenging task that requires technical expertise, limiting the vehicle's ability to adapt to evolving traffic scenarios or end-user preferences. This work presents a language-driven framework for adaptive cost design in autonomous driving. A Large Language Model (LLM) interprets structured scenario descriptions and natural language user queries to generate the parameters applied to a risk-aware Model Predictive Path Integral (MPPI) controller. The system incorporates a human-in-the-loop validation stage in which the proposed behavioral changes are described in non-technical language and confirmed prior to deployment. Users may additionally provide feedback either before or after deployment, enabling iterative refinement of the vehicle's motion behavior. The framework is evaluated across multiple queries in realistic driving scenarios to assess its effectiveness. Simulation results demonstrate that the method successfully induces behavioral changes that align with the intended requirements in an intuitive manner, thereby bridging the gap between intelligent vehicle control systems and end users.

URL PDF HTML ☆

赞 0 踩 0

2606.10971 2026-06-10 cs.RO cs.SY eess.SY 新提交

Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection

利用基于加加速度增强模型与仅IMU干扰抑制的自主农业机器人弹性导航

Batu Candan, Mohammed Atallah, Simone Servadio, Saeed Arabi

发表机构 * Iowa State University（爱荷华州立大学）； Salin247

AI总结针对农业机器人传感器中断和振动问题，提出加加速度增强EKF与多调谐因子自适应方法，动态调整测量协方差，显著降低3D位置RMSE。

2606.10967 2026-06-10 cs.CV 新提交

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

视觉上下文学习何去何从？跨领域与任务的统一基准

Pradnya Halady, Jiale Wei, Zdravko Marinov, Alexander Jaus, Simon Reiß

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结针对视觉上下文学习评估局限于预训练镜像任务的问题，构建跨领域和任务的统一基准VIBE，在14个数据集和12个任务上测试6个模型，揭示其适应能力、局限性及失败模式。

详情

AI中文摘要

视觉上下文学习被提出作为动态模型的一种途径，这些模型可以根据提供的上下文生成预测，从而在测试时适应新的视觉任务。然而，对这些模型适应能力的评估一直局限于狭窄的设置，主要反映预训练中的任务或图像领域，而实际适应并不需要。我们通过构建一个广泛的视觉上下文基准（VIBE），重点关注多样化的成像领域和广泛的任务，来解决这一差距。借此，我们能够更清晰地了解视觉上下文模型在面对新的图像和任务分布时的适应能力。我们在14个数据集和12个任务上对六个模型进行了压力测试（总共探索了106个数据集-任务组合），并在统一的、可重复的评估协议下，以一次学习设置进行比较。我们的评估揭示了视觉上下文学习现状的关键见解，包括局限性、系统性失败模式和有前景的方向。为了促进更广泛的评估，我们将公开发布我们的VIBE工具包。

英文摘要

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

URL PDF HTML ☆

赞 0 踩 0

2606.10959 2026-06-10 cs.LG 新提交

Population-Aware Physics-Informed Neural Particle Flow for Bayesian Update

群体感知的物理信息神经粒子流用于贝叶斯更新

Batu Candan, Simone Servadio

发表机构 * Iowa State University（爱荷华州立大学）

AI总结提出群体感知的物理信息神经粒子流（PA-PINPF），通过群体编码器增强粒子更新，在贝叶斯后验传输中优于标准方法。

详情

AI中文摘要

物理信息神经粒子流（PINPF）学习一个确定性传输场，该场将粒子从先验分布移动到贝叶斯后验分布，同时强制执行控制概率演化的方程。然而，标准PINPF速度模型独立处理每个粒子，因此不显式地将其传输决策基于经验粒子群体。本文引入了群体感知的PINPF（PA-PINPF），它通过整个粒子集的置换不变深度集表示来增强每个粒子的更新。我们研究了两种群体编码器。PA-PINPF-State总结粒子状态，而PA-PINPF-Feature总结完整的局部物理信息特征向量，包括粒子位置、伪时间、测量信息、似然值和得分信息。后者使得群体上下文不仅能表示粒子云几何，还能表示群体级别的贝叶斯传输几何。这些方法保留了原始无监督的物理信息残差目标，并且在训练过程中不需要真实后验样本。在距离测量任务和非线性到达时间差后验传输上的实验表明，两种群体感知变体均优于逐粒子PINPF，而特征群体编码提供了最强的性能。这些结果表明，群体级别的物理特征为学习贝叶斯粒子传输提供了有用的全局信息。

英文摘要

Physics-informed neural particle flow (PINPF) learns a deterministic transport field that moves particles from a prior distribution toward a Bayesian posterior while enforcing the governing probability-evolution equation. However, the standard PINPF velocity model processes particles independently and therefore does not explicitly condition its transport decisions on the empirical particle population. This paper introduces population-aware PINPF (PA-PINPF), which augments each particle update with a permutation-invariant Deep Sets representation of the full particle set. We investigate two population encoders. PA-PINPF-State summarizes the particle states, whereas PA-PINPF-Feature summarizes the complete local physics-informed feature vectors, including particle position, pseudo-time, measurement information, likelihood values, and score information. The latter allows the population context to represent not only particle-cloud geometry, but also the population-level Bayesian transport geometry. The methods retain the original unsupervised physics-informed residual objective and require no ground-truth posterior samples during training. Experiments on range-measurement tasks and nonlinear time-difference-of-arrival posterior transport demonstrate that both population-aware variants improve over particle-wise PINPF, while feature-population encoding provides the strongest performance. These results show that population-level physics features provide useful global information for learned Bayesian particle transport.

URL PDF HTML ☆

赞 0 踩 0

2606.10956 2026-06-10 cs.AI cs.CL 新提交

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

注意差距：前沿大语言模型能否通过标准化办公能力考试？

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

发表机构 * Microsoft Research（微软研究院）

AI总结基于中国计算机等级考试（NCRE）的200个综合操作任务，评估7个前沿LLM在Word、Excel和PowerPoint自动化中的表现，发现单轮模型最高得分率36.6%，带执行反馈的智能体系统达68.8%，仍低于95.5%的社区参考分，表明可靠细粒度办公自动化仍是重大挑战。

详情

Comments: 21 pages, 5 figures

AI中文摘要

大语言模型（LLM）代理在计算机自动化领域的部署正在加速，但其在复杂、专业级生产力软件中的导航能力在很大程度上尚未得到测试。我们认为办公自动化是基准测试文档自动化能力的理想环境，因为它需要长期规划和推理、精确的参数配置以及多应用集成。为了量化这种能力，我们引入了一项基于中国国家计算机等级考试（NCRE）的评估，包含200个涵盖Word、Excel和PowerPoint的综合实践操作任务。每个任务根据7118个机器可评分标准按100分制评分，得分率（SR）表示这些任务中获得的平均评分百分比。我们对7个前沿LLM进行了基准测试，并观察到明显的局限性：单轮模型最高得分为36.6%。一个具有执行反馈、迭代修复和更广泛办公自动化访问权限的更强智能体系统达到了68.8%，但仍低于用作评分合理性检查的95.5%社区参考分。最终，我们的实验表明，尽管代码生成最近取得了进展，但对于当前的代码生成LLM和智能体系统来说，实现可靠的细粒度办公文档自动化仍然是一个重大挑战。

英文摘要

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10953 2026-06-10 cs.AI cs.CV 新提交

Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

Architect-Ant: 可编辑的建筑平面图自动家具布置

Fedor Rodionov, Aleksandar Cvejic, Michael Birsak, John Femiani, Peter Wonka

发表机构 * King Abdullah University of Science and Technology (KAUST)（阿卜杜拉国王科技大学）； Miami University（迈阿密大学）

AI总结提出基于微调视觉语言模型的可编辑自动家具布置框架Architect-Ant，通过领域特定语言编码布局并优化，生成符合建筑约束的合理布局。

详情

Comments: 17 pages, 10 figures

AI中文摘要

带家具的平面图是房地产可视化、室内设计和建筑工作流程的基础。然而，由于缺乏带有对象级家具标注的真实专业设计平面图数据集，自动家具布置的进展受到限制。为解决这一差距，我们引入了AntPlan-270，这是一个包含270个建筑平面图的精选数据集，每个房间都有家具边界框标注，涵盖十个住宅房间类别。基于该数据集，我们提出了Architect-Ant，一个由微调视觉语言模型驱动的可编辑自动家具布置框架。家具布局使用一种紧凑的、基于坐标的领域特定语言（DSL）表示，该语言编码对象类别和相对于房间几何形状的位置。为了提高空间推理能力，我们生成了程序化推理轨迹，捕捉建筑约束，如墙壁对齐、门窗间隙、流通、固定装置兼容性和房间特定家具清单，并使用它们来监督模型的微调。然后，我们对候选对象位置应用偏好优化，以进一步提高布局质量。生成的DSL可以栅格化为语义掩码，并用于条件化基于Flux的LoRA渲染器，生成逼真的蓝图风格带家具平面图图像，同时保留可编辑的符号布局。布局布置实验表明，Architect-Ant能生成几何上有效且功能上合理的布局，并为更大的仅结构平面图数据集的家具布置提供了一条可扩展的路径。

英文摘要

Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.10949 2026-06-10 cs.AI 新提交

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

回忆过好：记忆增强模型中的谄媚评估与缓解

Shelly Bensal, Axel Magnuson, Aparna Balagopalan, Daniel M. Bikel

发表机构 * Writer, Inc.（Writer公司）

AI总结本研究首次系统评估记忆增强模型中的谄媚现象，提出MIST基准测试，发现记忆会放大谄媚行为（最高25倍），并提出两种轻量级缓解方法。

详情

Comments: Under submission; preprint

AI中文摘要

持久记忆系统通过随时间存储用户信念，有望使LLM更有帮助。我们表明，它们也会通过系统性地放大谄媚（即模型优先同意用户而非准确性）而使模型更不准确。我们首次对此效应进行系统评估，引入MIST：一个合成生成的多轮对话基准，其中用户在科学、医学和道德推理领域表达看似合理的误解。对三种最先进的记忆系统和五个模型家族的测试表明，记忆在所有条件下都会放大谄媚行为，谄媚率比上下文基线高出25倍。错误分析表明，记忆提取是主要原因：有损压缩成离散片段会编码用户误解，同时丢弃纠正性上下文。基于这些结果，我们提出了两种轻量级缓解方法，在事实回忆方面匹配或超越记忆系统的同时，大幅减少谄媚。

英文摘要

Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. We show they also make models less correct by systematically amplifying sycophancy, wherein models prioritize agreement with users over accuracy. We conduct the first systematic evaluation of this effect, introducing MIST: a benchmark of synthetically generated multi-turn conversations where users express plausible misconceptions in scientific, medical, and moral reasoning domains. Testing across three state-of-the-art memory systems and five model families reveals that memory amplifies sycophantic behavior across all conditions, with up to 25x higher sycophancy rates than in-context baselines. Error analyses suggest memory extraction as the primary culprit: lossy compression into discrete snippets encodes user misconceptions while discarding corrective context. Based on these results, we propose two lightweight mitigations that substantially reduce sycophancy while matching or exceeding memory systems at factual recall.

URL PDF HTML ☆

赞 0 踩 0