arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.29272 2026-05-29 cs.LG cs.AI stat.ML

Causal Label Recovery in Payment Networks

支付网络中的因果标签恢复

Gaurav Dhama

发表机构 * Mastercard（麦star卡）

AI总结针对支付网络中标签存在的四种系统偏差，提出序列三重稳健（STR）估计器，同时纠正所有偏差并达到半参数效率界，实现基于数天而非数月数据的训练。

Comments 49 pages

详情

AI中文摘要

支付网络中的欺诈检测模型依赖于存在系统性偏差的退单标签进行训练。每个标签必须依次经过三个门控：授权（被拒绝的交易不产生标签）、发卡行报告（未报告的欺诈不可见）和延迟（待处理的退单在训练时缺失）。到达的标签可能因第一方滥用或发卡行错误分类而受损。配套论文[arXiv:2605.27557]证明这四种损害对检测性能施加了极小极大下界。本文问：能否达到该下界？我们将观测流程形式化为一个具有三个倾向阶段和一个损坏层的顺序缺失数据问题，并构建了序列三重稳健（STR）估计器。STR同时纠正所有四种损害，并达到半参数效率界——没有估计器能具有更低的渐近方差。它是序列三重稳健的：在每个门控处，一致性仅要求倾向模型或结果回归中有一个正确指定，而非两者。我们提供了通过噪声率调整的伪标签进行损坏校正、通过经验贝叶斯收缩稳定小发卡行的逆倾向权重、提供有效置信区间的插件方差估计量，以及用于有限样本保证的伯恩斯坦集中不等式。在操作层面，我们推导了最优训练延迟——使标签质量损失和模型过时之和最小化的成熟窗口——并证明STR允许使用数天而非数月前的数据进行训练，将模型新鲜度与退单成熟周期解耦。对于任何样本量，STR在均方误差上严格优于基于退单的朴素训练。

英文摘要

Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequential gates: authorization (declined transactions generate no labels), issuer reporting (unreported fraud is invisible), and delay (pending chargebacks are missing at training time). Labels that do arrive may be corrupted by first-party misuse or issuer misclassification. A companion paper [arXiv:2605.27557] proved that these four impairments impose a minimax lower bound on detection performance. This paper asks: can that bound be achieved? We formalize the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, and construct the Sequential Triply Robust (STR) estimator. The STR corrects for all four impairments simultaneously and achieves the semiparametric efficiency bound -- no estimator can have lower asymptotic variance. It is sequentially triply robust: at each gate, consistency requires only that either the propensity model or the outcome regression is correctly specified, not both. We provide corruption correction via noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage to stabilize inverse-propensity weights for small issuers, a plug-in variance estimator yielding valid confidence intervals, and a Bernstein concentration inequality for finite-sample guarantees. On the operational side, we derive the optimal training delay -- the maturity window that minimizes the sum of label-quality loss and model staleness -- and prove that the STR permits training on data that is days old rather than months old, decoupling model freshness from the chargeback maturity cycle. The STR provably dominates naive chargeback-based training in mean squared error for any sample size.

URL PDF HTML ☆

赞 0 踩 0

2605.29271 2026-05-29 cs.AI cs.IR cs.LG

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

CoHyDE: 用于工具检索的LLM改写器与稠密编码器的迭代协同训练

Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

发表机构 * SAP Labs（SAP实验室）

AI总结提出CoHyDE方法，通过迭代协同训练稠密编码器和LLM改写器，结合对比学习和偏好对齐，在工具检索任务中同时提升标准查询和模糊查询的性能。

详情

AI中文摘要

在大规模API目录上的工具检索是LLM智能体的核心瓶颈：用户查询以口语化、通常不明确的语言出现，而目录使用技术性API词汇，没有固定的编码器能够单独弥合这一差距。两种主要的训练方法，对比编码器微调和基于冻结LLM的HyDE式查询扩展，从相反的角度解决这个问题，并在互补的方向上失败：微调编码器在查询的表面形式与目录匹配时表现出色，但在不匹配时性能崩溃；而零样本HyDE对不明确的查询更鲁棒，但生成不感知目录的假设描述，当查询形式良好时检索性能下降。我们提出CoHyDE，一种迭代过程，将稠密编码器和LLM改写器训练为单个共同演化的系统：编码器使用改写器生成的目录风格假设描述通过InfoNCE重新训练，改写器通过DPO基于编码器的检索分数进行偏好对齐，两者在循环开始前在工具目录上进行热启动。在ToolBench目录的约10k工具子集上，三轮CoHyDE在标准查询上比最强的单组件基线提高+2.5个百分点的NDCG@5，在保留的模糊查询上提高+6.3个百分点，在最难的模糊层级上增益高达+8个百分点。消融实验证实协同训练是关键因素：单独使用任一组件都无法在形式良好和模糊查询上匹配CoHyDE，在模糊查询上损失高达-8个百分点。

英文摘要

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

URL PDF HTML ☆

赞 0 踩 0

2605.29267 2026-05-29 cs.AI cs.LG

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

人类策展何时以及如何适得其反：多模型自消费循环下的偏好对齐

Yang Zhang, Xiukun Wei, Xueru Zhang

发表机构 * Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio（计算机科学与工程系，俄亥俄州立大学，哥伦布，俄亥俄）

AI总结研究多模型自消费训练中人类策展对模型对齐的影响，发现跨模型交互可能削弱甚至逆转策展效果，导致长期对齐退化。

详情

AI中文摘要

基础模型越来越多地使用先前模型迭代生成的合成数据进行训练，而非仅依赖真实数据。这种自消费训练范式可能导致模型崩溃、发散或偏差放大。近期工作（Ferbach et al., 2024）表明，将人类策展纳入循环可以引导自消费模型向人类对齐的行为，但这些分析聚焦于单一孤立模型，该模型仅消耗自身输出。然而，在实践中，模型经常交互并训练于其他模型产生的输入-输出对。本文研究多模型机制下的自消费训练。我们首先形式化了一个交互自消费模型的框架，并刻画了所得动力系统何时收敛到稳定点。然后，我们考察了一个模型的人类策展如何影响其自身对齐（自影响），以及这种效应如何传播到其他模型（交叉影响）。与孤立设置中人类策展总是增强模型对齐不同，我们表明跨模型交互可以削弱甚至逆转这种效应，最终损害长期对齐。

英文摘要

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.29262 2026-05-29 cs.AI

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

协调实时约束与长视距推理：一种用于动态调度的异步智能体框架

Shijie Cao, Yuan Yuan, Jing Liu

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China（北京航空航天大学计算机科学与工程学院）； Shenzhen Loop Area Institute, Shenzhen, China（深圳环形区研究所）； Qingdao Research Institute, Beihang University（青岛研究院）； Hangzhou Innovation Institute, Beihang University（杭州创新研究院）； School of Artificial Intelligence, Xidian University, Xi’an 710071, Shaanxi, China（西安电子科技大学人工智能学院）； Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, Guangdong, China（广州技术研究所）

AI总结提出RACE-Sched异步智能体框架，通过双流架构解耦策略执行与逻辑推理，利用LLM合成和验证符号启发式规则，在保证实时性的同时提升动态调度质量。

详情

AI中文摘要

动态柔性作业车间调度问题（DFJSP）需要在即时响应随机扰动与全局优化生产目标之间进行权衡。传统的优先级规则在处理复杂扰动时灵活性不足，而基于学习的方法往往牺牲可解释性或难以跨问题规模泛化。尽管大语言模型（LLM）提供了高级推理能力以弥合这一差距，但其显著的推理延迟与工业控制系统的毫秒级决策周期不兼容。为解决这一冲突，我们引入了RACE-Sched，一种异步智能体框架，通过双流架构将策略执行与逻辑推理解耦。反应流执行低延迟的符号启发式规则以实现实时调度，而并行的深思流利用LLM合成、验证和演化这些规则。候选规则在沙箱中经过严格测试，并通过原子更新部署，确保安全且不阻塞控制循环。此外，语义规则库索引已验证的启发式规则，用于基于检索的初始化，从而增强跨问题规模的可迁移性。在GEN-Bench、MK-Bench和JMS-Bench上的广泛评估表明，RACE-Sched优于领先的深度强化学习和其他基于LLM的基线方法。该方法协调了实时约束与长视距推理，实现了更优的解决方案质量和对动态事件的鲁棒适应。

英文摘要

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

URL PDF HTML ☆

赞 0 踩 0

2605.29259 2026-05-29 cs.LG cs.AI

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

KLAS：利用相似性拼接神经网络以改进精度-效率权衡

Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain, Akshay Jajoo, Myungjin Lee, Clayton Kerce, Alexey Tumanov

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Microsoft M365 Research（微软M365研究）； Cisco Research（思科研究）； Georgia Tech Research Institute（佐治亚理工研究机构）

AI总结提出KLAS框架，通过KL散度度量中间表示相似性自动选择最佳拼接配置，在相同微调成本下提升拼接模型的精度-效率曲线。

详情

AI中文摘要

鉴于部署目标的广泛性，灵活模型选择对于在给定计算预算内优化性能至关重要。最近的研究表明，在模型家族内拼接预训练模型能够实现精度-效率权衡空间的成本效益插值。拼接将一个预训练模型的中间激活变换到另一个模型，生成新的插值拼接网络。这类网络沿精度-效率谱提供了部署选项池。然而，现有拼接方法往往产生次优权衡且缺乏泛化性，因为它们主要依赖启发式方法选择拼接配置。我们认为，构建改进的精度-效率权衡需要显式捕获并利用被拼接预训练模型之间的相似性。为此，我们引入KLAS，一种新颖的拼接选择框架，通过利用中间表示之间的KL散度，自动化和泛化跨模型家族的拼接选择。KLAS从$O(k^2n^2)$种可能性中为$k$个深度为$n$的预训练模型识别最有前景的二元拼接。通过全面实验，我们证明KLAS在相同微调成本下改进了拼接模型的精度-效率曲线，与基线相比，KLAS在相同计算成本下实现了高达$1.21\%$的ImageNet-1K top-1准确率提升，或在保持准确率的同时将FLOPs降低$1.33\times$。

英文摘要

Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget. Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space. Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network. Such networks provide a pool of deployment options along the accuracy-efficiency spectrum. However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations. We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the similarity between pretrained models being stitched. To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations. KLAS identifies the most promising binary stitches from the $O(k^2n^2)$ possibilities for $k$ pretrained models of depth $n$. Through comprehensive experiments, we demonstrate that KLAS improves the accuracy-efficiency curve of stitched models at the same finetuning cost as baselines. KLAS achieves up to $1.21\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\times$ reduction in FLOPs.

URL PDF HTML ☆

赞 0 踩 0

2605.29257 2026-05-29 cs.SD

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

ChildVox：理解与表征儿童期声音的语音、音频及大型音频语言模型基准

Tiantian Feng, Anfeng Xu, Xuan Shi, Aditya Kommineni, Shakhrul Iman Siam, Megan Micheletti, Zhonghao Shi, Helen Tager-Flusberg, Mi Zhang, Lynn K. Perry, Catherine Lord, Daniel Messinger, Shrikanth Narayanan

发表机构 * University of Southern California（南加州大学）； The Ohio State University（俄亥俄州立大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Harvard University（哈佛大学）； Boston University（波士顿大学）； University of Miami（迈阿密大学）

AI总结提出ChildVox基准，整合17个儿童音频数据集和20多个子任务，评估多种基础模型在儿童生理声、非语言发声、规范音节和口语识别上的性能。

Comments preprint under review

详情

AI中文摘要

我们提出了ChildVox，这是一个新颖的基准，用于表征儿童通过其交流的多样化声学信号。具体来说，ChildVox遵循从出生到学龄的完整发展轨迹，涵盖生理声音、非语言发声、规范音节和口语。ChildVox整合了来自17个以儿童为中心的音频和语音数据集的20多个子任务，实现了系统的跨语料库和跨领域比较。我们评估了一系列代表性的音频和语音基础模型，包括自监督、面向ASR和大型音频语言模型，在生理声音分类、发声和规范音节建模以及语音质量评估和识别等任务上的表现。基准测试结果表明，ChildVox提供了一套高性能模型，用于识别来自儿童的广泛声学信号，支持下游应用，如表征儿童语言水平和追踪随年龄变化的语音产生。

英文摘要

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

URL PDF HTML ☆

赞 0 踩 0

2605.29256 2026-05-29 cs.CL cs.AI

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

DynSess：面向角色扮演智能体的动态会话级评估与优化框架

Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang

发表机构 * Zhejiang University（浙江大学）； Fuxi AI Lab, NetEase Inc.（福克斯人工智能实验室，网易公司）； Xiamen University（厦门大学）

AI总结提出DynSess统一会话级框架，通过会话级评估（DynSess-Eval）和基于多步前瞻搜索的训练轨迹优化（DSPO/GSRPO），提升角色扮演智能体的长程一致性和交互质量。

详情

AI中文摘要

基于大型语言模型的角色扮演本质上是一个会话级任务，要求智能体在扩展的多轮对话中维持角色身份和交互质量。然而，现有的评估和优化方法大多停留在轮次级别，无法捕捉长程质量。我们提出DynSess，一个统一的会话级角色扮演智能体框架。DynSess-Eval通过针对长程行为的评分标准对完整对话会话进行评分。利用其会话级奖励，我们通过多步前瞻搜索构建高质量训练轨迹，并训练DynSess-Character的两个互补变体：DSPO（离策略）和GSRPO（在策略）。实验表明，DynSess-Eval与人类判断的一致性显著优于先前的评估器，盲人机评估进一步显示，尽管参数少得多，DynSess-Character仍能与最强角色模型匹配，同时保持强大的角色一致性和交互能力。我们的数据集和代码将发布以促进未来研究。

英文摘要

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2605.29254 2026-05-29 cs.RO cs.AI

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

极端动态对称性实现全向多功能机器人

Jiaxun Liu, Boxi Xia, Boyuan Chen

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University（杜克大学机械工程与材料科学系）； Department of Electrical and Computer Engineering, Duke University（杜克大学电气与计算机工程系）； Department of Computer Science, Duke University（杜克大学计算机科学系）

AI总结本文提出动态对称性概念，通过动态各向同性度量，在超过1000种模拟形态中发现高动态对称性可提升轨迹跟踪、任务成功率、鲁棒性等性能，并开发了Argus球形机器人系列验证近极端动态各向同性带来的全向运动、自适应地形、快速自稳定和抗故障能力。

Comments Published in Science Robotics (2026). Our project website is at:https://generalroboticslab.com/Argus

Journal ref Science Robotics 11, eaec1725 (2026)

详情

AI中文摘要

对称性是自然系统中的核心组织原则，但其作为机器人统一设计策略的应用仍主要局限于几何形态。我们证明，对称性可以在动态驱动能力层面加以利用。我们引入动态对称性，即机器人可达质心加速度的均匀性，并通过称为动态各向同性的度量将其形式化。在超过1000种模拟形态中，我们发现更高的动态对称性持续改善了轨迹跟踪、任务成功率、鲁棒性、恢复能力和能量效率，且当动态各向同性接近其理论极限时，效益最为显著。为了系统地研究这一机制，我们开发了Argus，一系列球形机器人，旨在探索增加动态对称性的效果。Argus家族的成员在驱动几何和动态对称性水平上有所不同，但共享一个共同架构原则：径向定向的线性致动器直接塑造机器人的质心动力学。其中，我们构建了一个物理的20腿Argus变体，实现了接近极端的动态各向同性，并展示了方向无关的运动、在杂乱和可变形地形上的敏捷穿越、快速自稳定以及对部分致动器故障的鲁棒性。其分布式感知进一步实现了在连续运动中的全向感知和物体交互。这些结果表明，不仅在形态上而且在可达动力学上设计机器人的对称性，为在不确定的地球和地外环境中实现敏捷性、鲁棒性和多功能性提供了一条强大且通用的途径。

英文摘要

Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.

URL PDF HTML ☆

赞 0 踩 0

2605.29253 2026-05-29 cs.AI

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

OpenClawBench: 真实智能体执行轨迹中过程侧异常的基准测试

Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han

发表机构 * School of Software, Shandong University（山东大学软件学院）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院；南京大学新型软件技术国家重点实验室）； State Key Laboratory of Novel Software Technology, Nanjing University（医学人工智能中心；青岛中医药科学院；海洋传统中医研究所，山东中医药大学）； Center for Medical Artificial Intelligence（四川大学软件工程学院）； Qingdao Academy of Chinese Medical Sciences ； Institute of Marine Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine ； School of Software Engineering, Sichuan University

AI总结提出OpenClawBench数据集，通过FullTax标注框架量化智能体执行中的过程侧异常，揭示仅基于结果评估的不足。

Comments 37 pages, 1 figure, 43 tables

详情

AI中文摘要

任务成功可能掩盖真实智能体执行中的过程异常。智能体可能通过最终任务测试，但过程中仍累积未解决的歧义、不安全的外部写入、被忽略的错误、弱化的承诺或能力边界过度承诺。我们将这种不匹配研究为结果-过程差距，并引入OpenClawBench，这是一个用于测量和监督真实智能体执行过程中过程侧异常的大规模数据集。OpenClawBench基于由6个源模型生成的BFCL驱动的OpenClaw会话构建，包含31,264条带注释的轨迹。它将任务测试结果与结构化过程证据对齐。FullTax将对齐的轨迹转换为结构化异常监督：二元标签、支持证据、起始/跨度定位、严重性、可恢复性以及一个5类异常分类法。使用OpenClawBench，我们使结果-过程差距变得可测量。在31,135次通过测试的执行中，有2,904次在FullTax下被标记为过程异常。这些结果表明，仅基于成功的评估忽略了真实智能体执行中一类具体的过程侧失败。基于高置信度FullTax监督池训练的LoRA微调Gemma 3 12B检测器，在更干净标签的保留测试集上达到了二元F1=0.729。总之，OpenClawBench将真实智能体执行日志转化为可审计和可复用的监督，用于研究、诊断和操作监控运行时智能体可靠性。

英文摘要

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.29251 2026-05-29 cs.AI cs.CR

Provably Secure Agent Guardrail

可证明安全的智能体护栏

Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang, Nenghai Yu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有语义护栏无法提供确定性安全下界的问题，提出基于逻辑推理基本限制的新安全范式，并引入可执行证明约束动作框架，通过神经符号隔离架构实现零攻击成功率和零误报率。

详情

AI中文摘要

随着大语言模型从有限生成引擎转变为具有广泛执行权限的智能体，人工智能失控引发了人工智能安全的基本危机。现有的防御架构严重依赖经验性语义护栏和概率性大模型裁决器，这些机制在面对复杂的语义符号解耦攻击时无法提供确定性的安全下界。为了克服这种经验性语义护栏困境，本文提出了一种基于逻辑推理基本限制的智能体安全新范式。基于该范式，我们进一步引入了一种具有神经符号隔离架构的可执行证明约束动作（ePCA）框架。该框架放弃了对自然语言的语义信任，迫使智能体在执行物理操作之前将其意图无损地形式化为一阶逻辑数学约束。宏观和微观二维动态对抗系统的实证评估表明，我们的形式化验证机制在评估场景中实现了零攻击成功率和零误报率，且计算延迟极低。这项研究为在明确系统假设下构建未来智能系统的基础防御提供了条件性的形式化基础和工程范式。

英文摘要

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.29250 2026-05-29 cs.CL cs.AI cs.IR cs.LG

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval：跨异构知识源的统一检索

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）

AI总结提出OmniRetrieval框架，通过自然语言查询识别并调度到不同知识源的本地执行引擎，在13个数据集和309个知识库上超越单源基线，实现异构知识源统一检索。

详情

AI中文摘要

现实世界的信息需求需要访问结构多样的知识源，从非结构化文本和关系表到知识图谱和属性图。然而，现有的检索器一次只在一个源上操作，使用固定的查询语言，使得可用知识的更广泛图景被不兼容的接口所分割。一种自然的统一尝试是将这些源折叠到一个共享空间中，但这会抹去每个源的结构性优势（如模式、本体、组合操作符），而这些优势赋予了每个源其表达能力。因此，对多样化知识的有效检索需要的不是同质化，而是一个能够按每个源自身条件与其交互的总体层。为了实现这一点，我们提出了OmniRetrieval，一个框架，它接受任何自然语言查询，识别合适的知识源，并将源原生查询分派到其本地执行引擎。在涵盖文本、关系和图结构源的13个数据集和309个不同知识库的广泛基准测试中，OmniRetrieval超过了单源基线，证明了它可以作为异构源的通用接口，同时保留使每个源有价值的结构差异。

英文摘要

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.

URL PDF HTML ☆

赞 0 踩 0

2605.29247 2026-05-29 cs.AI cs.CL cs.LG

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

DenseSteer: 引导小型语言模型进行密集数学推理

Yang Ouyang, Shuhang Lin, Jung-Eun Kim

发表机构 * North Carolina State University（北卡罗来纳州立大学）； Rutgers University（罗格斯大学）

AI总结提出DenseSteer，一种无需训练的推理时引导框架，通过调节内部表征向密集推理模式靠拢，提升小型模型在多步数学推理中的准确性。

Comments ICML 2026

2605.29243 2026-05-29 cs.CL cs.AI cs.CY

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

等等！有出路：一种预测对话偏离的决策机制

Laerdon Kim, Vivian Nguyen, Cristian Danescu-Niculescu-Mizil

发表机构 * Cornell University（康奈尔大学）

AI总结提出一种基于前瞻性模拟的延迟决策机制，在预测对话偏离时通过评估紧张时刻的恢复可能性来降低误报率，同时保持预测准确性。

Comments To appear in the Proceedings of ACL 2026

详情

AI中文摘要

预测对话偏离的任务是，在对话进行中预测其最终是否会偏离为人身攻击。由于预测模型以在线方式运行，它们必须在每轮发言后决定是否“触发”警报——例如，通知参与者或主持人对话有偏离风险。现有方法仅根据先前发言估计的偏离可能性做出这一决定，隐含假设对话的未来轨迹是固定的。因此，它们忽略了未来恢复的可能性，并导致不必要的高误报率。在这项工作中，我们提出了一种将触发决策与偏离可能性估计解耦的方法。我们的方法受该任务第一个人类基线的启发，该基线表明，人类通过选择性地推迟触发决策（当他们预计紧张局势可能缓解时），实现了显著更低的误报率。我们通过一种延迟机制来操作这一见解，该机制使用前瞻性模拟来评估紧张时刻是否存在合理的恢复路径。将这一机制整合到最先进的预测模型中，可以在不牺牲预测准确性的情况下大幅减少误报。更广泛地说，这项工作强调了将决策制定视为预测系统的一等组成部分的价值。

英文摘要

Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into personal attacks. Since forecasting models operate in an online fashion, they must decide whether to "trigger" an alert after each utterance--for example, to notify participants or a moderator that the conversation is at risk of derailing. Existing approaches make this decision solely based on the estimated likelihood of derailment given the preceding utterances, implicitly assuming that the conversation's future trajectory is fixed. As a result, they ignore the possibility of future recovery and incur an unnecessarily high rate of false positives. In this work we propose a method for decoupling the decision to trigger from derailment likelihood estimation. Our approach is inspired by the first human baseline on this task, which shows that humans achieve dramatically lower false positive rates by selectively deferring their decision to trigger when they anticipate that tension is likely to subside. We operationalize this insight with a deferral mechanism that uses forward-looking simulations to assess whether a tense moment admits plausible paths to recovery. Incorporating this mechanism into a state-of-the-art forecasting model substantially reduces false positives without sacrificing forecasting accuracy. More broadly, this work highlights the value of treating decision-making as a first-class component of forecasting systems.

URL PDF HTML ☆

赞 0 踩 0

2605.29240 2026-05-29 cs.AI cs.CL cs.HC cs.IR

Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

使用AI在教师与学生之间进行结果无关的反馈中介来发现孤立学习者

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute Of Technology（佐治亚理工学院）

AI总结提出一种无需成绩的可解释决策层，通过整合学生困难普遍性、自我报告与观察困难的不一致以及教师未解决关注点三个信号，对课程主题进行优先级排序，以帮助教师及时做出教学决策。

Comments Accepted to HAI-Agency Workshop on Orchestrating Human and AI Agency for Proactive and Reflective Learning

详情

AI中文摘要

AI增强的课堂在成绩结果可用之前就生成了丰富的教师和学生反馈，但这些信号难以转化为及时的教学决策。我们提出一个可解释的决策层：一种透明机制，无需使用成绩或事后结果标签即可对需要关注的课程主题进行排序。该方法结合了三个信号：学生学习困难普遍性、学习者自我报告与观察到的困难之间的不一致，以及未解决的教师关注点。输出是一个按优先级排序的主题集，每个主题附有解释其排序的决策记录。在一门研究生CS课程（$n=5$次教师访谈；$n=279$份调查回复）中，优先主题与教师关注点一致（top-5重叠3/5；Spearman $ρ=0.80$），并与学生报告的主题困难相关（$ρ=0.46$, $p=.048$）。多信号整合还发现了仅通过单个信号源未能识别的学习者（AUC $=0.96$ vs. 仅差距普遍性的$0.91$）。反思性思维、求助行为和自我效能感提供了额外证据，表明学生行为信号与学习相关构念一致。尽管是初步结果，这些发现表明，当反馈不完整时，透明的协调机制可能有助于支持人机协同。

英文摘要

AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $ρ=0.80$) and student-reported topic difficulty ($ρ=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.

URL PDF HTML ☆

赞 0 踩 0

2605.29234 2026-05-29 cs.AI cs.IR

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

重新思考文献检索评估：深度研究有帮助，且人类引用列表并非金标准

Gaurav Sahu, Laurent Charlin, Christopher Pal

发表机构 * Mila – Quebec AI Institute（魁北克AI研究所）； HEC Montréal（蒙特利尔HEC商学院）； ServiceNow Research（ServiceNow研究）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）； Université de Montréal（蒙特利尔大学）； Polytechnique Montréal（蒙特利尔理工学院）

AI总结本文通过改进检索流程和检验人类引用列表作为评估目标的可靠性，发现深度研究管道显著提升召回率，而人类引用中仅51%被判定为中等相关以上，建议采用多维度评估。

详情

AI中文摘要

我们从两个互补角度研究大规模文献检索：改进检索流程，以及压力测试人类参考文献列表作为评估目标。首先，我们实现了一个深度研究管道，处理完整查询论文并沿其参考文献广度优先扩展检索结果，表明其显著优于纯API搜索，将RollingEval-Jun25（一个250篇论文的文献检索基准）上的召回率从低于20%提升至高于80%。其次，我们使用中立的LLM作为裁判来判断人类参考文献是否是任务的金标准。我们发现显著局限性：只有51%的人类引用被判定为中等相关或更高，而最强AI重排序器为86-88%。我们在OpenAlex合著图上研究这一差距，发现人类引用直接合作者的可能性比最佳AI重排序器高2.5倍。综合来看，我们的结果反对单一轴线的文献检索评估：召回率、主题相关性评分、排序列表多样性和合著距离诊断各自衡量引用质量的互补属性，应联合报告。

英文摘要

We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.

URL PDF HTML ☆

赞 0 踩 0

2605.29230 2026-05-29 cs.CV cs.AI

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

面向道德的面部年龄估计：无需儿童数据训练的广义零样本基准

Caio Petrucci, Leo Sampaio Ferraz Ribeiro, Sandra Avila

发表机构 * New York University（纽约大学）

AI总结提出一个广义零样本基准，训练时排除儿童数据，评估模型对未见年龄组的泛化能力，发现所有方法均存在严重性能下降和可见类偏见。

Comments 12 pages; 3 figures; 5 tables

详情

AI中文摘要

从面部图像进行年龄估计通常依赖于包含未成年人图像的训练数据，这种做法引发了严重的伦理、法律和隐私问题。在这项工作中，我们提出了一个用于面部年龄估计的广义零样本基准，该基准在训练时明确排除儿童数据，同时仍评估模型在年轻人群上的性能。我们重新审视了六个广泛使用的数据集，并引入了具有严格年龄组划分的标准化分割：18-59岁的样本用于训练、验证和测试；18岁以下的样本仅保留用于零样本评估；60岁以上的样本作为分布偏移下模型选择的未见验证集。对于具有身份注释的数据集，基于主体的分割防止了身份泄露，并更好地反映了实际部署条件。在此协议下评估九种最先进的年龄估计方法，结果表明所有评估方法均无法泛化到未见年龄组，性能相对于监督基线平均下降46.4%，最高达52.8%。此外，模型并非简单退化：它们系统性地将未见年龄的预测锚定到附近的可见类别，这是广义零样本学习中众所周知的可见类偏见的体现。通过将无儿童数据的年龄估计形式化为现有数据集上的广义零样本基准，这项工作突出了当前建模实践与现实伦理约束之间的关键差距。我们的基准为在受限数据制度下评估模型提供了原则性基础，并鼓励开发对分布偏移鲁棒且符合负责任数据使用的方法。

英文摘要

Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.

URL PDF HTML ☆

赞 0 踩 0

2605.29229 2026-05-29 cs.AI

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

定制课程：通过动态数据-模型兼容性进行以学生为中心的推理蒸馏

Jiahao Huang, Fei Cheng, Junfeng Jiang, Akiko Aizawa

发表机构 * University of Tokyo（东京大学）； Kyoto University（京都大学）； National Institute of Informatics（日本信息处理研究所）

AI总结提出数据-模型兼容性（DMC）指标，通过联合考虑数据质量、相对难度和学生能力来评估数据集对推理蒸馏的适用性，并基于DMC动态选择数据以提升蒸馏性能。

详情

AI中文摘要

推理蒸馏将复杂推理能力从大型语言模型（LLMs）转移到较小的模型，但其成功取决于训练数据与学生模型的匹配程度。本文引入了数据-模型兼容性（DMC）指标，可用于评估数据集在学生模型上进行推理蒸馏的适用性。DMC通过联合考虑数据质量、相对难度和学生能力来提供评估。我们从两个角度验证了DMC的有效性：（1）DMC与推理蒸馏性能表现出强相关性；（2）使用DMC作为数据选择标准可提高推理蒸馏性能。这两个发现在多个学生模型和任务上均得到一致证明。此外，由于每个数据集的DMC在训练过程中动态变化，我们的实验表明，基于DMC动态选择数据集可以进一步提升性能。

英文摘要

Reasoning distillation transfers complex reasoning abilities from large language models (LLMs) to smaller ones, yet its success depends on how well the training data align with the student model. This paper introduces the Data-Model Compatibility (DMC) metric, which can be used to assess the suitability of a dataset for reasoning distillation on a student model. DMC provides an assessment by jointly considering data quality, relative difficulty, and student capability. We validated the effectiveness of DMC from two perspectives: (1) DMC exhibits a strong correlation with reasoning distillation performance; and (2) using DMC as the criterion for data selection leads to improved reasoning distillation performance. Both findings are consistently demonstrated across multiple student models and tasks. Moreover, since the DMC of each dataset dynamically changes during training, our experiments demonstrate that dynamically selecting datasets based on DMC can further enhance performance.

URL PDF HTML ☆

赞 0 踩 0

2605.29225 2026-05-29 cs.AI

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

BenchTrace: 用于测试LLM智能体反思能力和受控进化的基准

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa

发表机构 * University of Tokyo（东京大学）； Kyoto University（京都大学）； National Institute of Informatics（日本信息处理学会）

AI总结提出BenchTrace基准，通过反思评估和进化评估两个任务，结合失败避免率(FAR)指标，系统评估LLM智能体的自我进化能力，实验发现当前模型在反思诊断和泛化上存在显著瓶颈。

详情

AI中文摘要

自我进化智能体通过反思过去失败来随时间改进，但现有评估存在两个局限：仅衡量任务得分，无法反映反思质量；且依赖智能体自身的回合运行，缺乏针对特定失败模式的机制。我们提出 extbf{BenchTrace}，一个用于评估LLM智能体自我进化能力的基准。BenchTrace基于包含1,821个带注释回合的快照反思数据集构建，涵盖六个多样化任务，包含 extbf{反思评估}（通过目标QA任务探测失败识别）和 extbf{进化评估}（在受控自我进化模拟中测试过去失败经验是否转化为回避行为）。基于BenchTrace，我们提出 extbf{失败避免率(FAR)}，一种新的评估指标，衡量智能体成功避免目标失败实例的测试用例比例。使用Qwen3-32B和GPT-4.1的实验表明，两个模型在反思评估上的端到端通过率均低于30%，其中诊断是主要瓶颈。进化评估显示，自我进化方法通常比非进化基线提高FAR，但随着噪声回合累积，智能体会遗忘早期教训，且无法将反思泛化到特定情境之外，导致跨任务情境的负迁移。我们的相关性分析进一步揭示，只有完全正确的反思与更高的FAR强相关。BenchTrace揭示了当前自我进化方法的具体局限，并提供了一个受控的、模型无关的针对性评估框架。

英文摘要

Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.29224 2026-05-29 cs.CL cs.AI cs.CR

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

相关性即漏洞：网络检索如何削弱LLM智能体的安全对齐

Aditya Nawal, Manit Baser, Mohan Gurusamy

发表机构 * Department of Electrical and Computer Engineering（电子与计算机工程系）； National University of Singapore（新加坡国立大学）

AI总结本文提出AgentREVEAL框架，分析检索集成方式和内容属性如何导致LLM智能体安全退化，发现相关性是共同激活条件，并引入HarmURLBench基准。

详情

AI中文摘要

AI智能体通过外部工具（如网络检索）增强大型语言模型，使其能够提供基于事实和最新的响应。然而，将外部内容纳入生成流程可能会削弱控制模型输出的安全对齐机制。先前的研究表明，在智能体中启用检索会增加对有害请求的遵从性。我们提出了AgentREVEAL，一个用于分析LLM智能体中检索诱导的安全退化的诊断框架。该框架考察两个维度：检索如何集成到智能体流程中，以及检索内容的属性。在集成维度上，我们发现将工具调用和响应生成绑定在单一步骤中会放大有害输出。在内容维度上，我们揭示了安全来源悖论：即使是对立或安全导向的来源（例如包含警告或风险免责声明的页面），与无检索基线相比，也会使有害遵从性平均增加25%。最后，我们表明相关性是这两种漏洞的共同激活条件。类似模式出现在前沿闭源模型上，并且在几种代表性流程干预下，有害遵从性仍然保持较高水平，一些智能体在自主检索下也会进入这种状态。由于相关性也是使检索有用的原因，这些结果揭示了检索增强智能体的安全-效用权衡。我们引入了HarmURLBench，一个包含1,405个真实世界URL和320个有害行为的基准，以支持未来的评估。

英文摘要

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.29221 2026-05-29 cs.CV

An Approach for Thyroid Nodule Analysis Using Thermographic Images

使用热成像图像进行甲状腺结节分析的方法

J. R. González, É. O. Rodrigues, C. P. Damião, C. A. P. Fontes, A. C. Silva, A. C. Paiva, H. Li, C. Du, A. Conci

发表机构 * Computer Science Department, Universidade Federal Fluminense（联邦弗里蒙特大学计算机科学系）； Radiology Department, Hospital Universitário Antônio Pedro (HUAP)（安东尼奥佩德罗大学医院放射科）； Applied Computation Group NCA-UFMA, Universidade Federal do Maranhão（马兰舍大学应用计算组NCA-UFMA）

AI总结本文综述了热成像在甲状腺分析中的应用，提出图像采集协议和自主配准方法，并通过特征提取、图像处理和分类方法区分健康与患病患者。

Journal ref Application of Infrared to Biomedical Sciences 2017

详情

DOI: 10.1007/978-981-10-3147-2_26

AI中文摘要

据预测，到2030年，甲状腺癌将成为女性中第二常见的癌症类型，男性中第三常见。一般来说，早期检测癌症可提高个体生存机会。热成像是一种诊断工具，越来越多地用于检测癌症和异常，包括甲状腺异常。已有多种方法被提出用于分割和检测热成像图中的热区域，从而检测这些图像中存在的可疑组织。众所周知，医学诊断会产生大量信息。因此，医生必须在短时间内全面分析和评估这些信息，这在大多数情况下是不可行的。在这项工作中，我们对热成像进行了全面综述，重点关注甲状腺分析。我们提出了图像采集协议和甲状腺图像的自主配准方法。我们还对图像数据进行了分析，包括特征提取、图像处理以及一种可能的健康或非健康患者分类方法。总之，这项工作提出了在我们大学医院检测肿瘤的试点项目，这是支持我们内分泌科预防性医疗行动的一部分。经过一些未来调整后，该项目将提交给弗鲁米嫩塞联邦大学安东尼奥·佩德罗大学医院（HUAP-UFF）的伦理与研究委员会以及巴西卫生部伦理委员会审批，项目名称为：评估热成像在HUAP-UFF患者甲状腺结节诊断辅助中的重要性（葡萄牙语：Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF）。

英文摘要

Thyroid cancer is said to be the second most common type of cancer in female individuals and the third in males by 2030, according to projections. In general, detecting cancer in its early stages improves the chance of survival of the individual. Thermography is a diagnostic tool that has been increasingly used to detect cancer and abnormalities, including that of thyroid. Various methods to segment and detect hot regions in thermograms and, consequently, to detect suspicious tissues present in these images have been proposed. It is well known that medical diagnosis yields a great deal of information. Thus, physicians have to comprehensively analyse and evaluate this information in a short period of time, which is infeasible in most cases. In this work, we perform a general review of thermography , focusing on the thyroid analysis. We propose protocols for image acquisiton and an autonomous registration for thyroid images. We also perform analyses of the image data, which include feature extraction, image processing, and a possible approach for classification of healthy or unhealthy patients. In summary, this work presents a pilot project for detection of tumors in our university hospital, which is part of an effort to support preventive medical actions in our endocrinology department. Under some future adjustments, this project will be submitted for approval by the ethics and research committee of Hospital Universitário Antonio Pedro at Universidade Federal Fluminense (HUAP-UFF) and to the Brazilian Ministry of Health Ethical committee under the name: Evaluation of the importance of thermography to aid diagnosis of thyroid nodules of patients in HUAP-UFF (in Portuguese: Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF).

URL PDF HTML ☆

赞 0 踩 0

2605.29220 2026-05-29 cs.CV

Motion-guided sparse correction enables expert-quality point tracking across diverse microscopy regimes

运动引导的稀疏校正实现跨不同显微镜体制的专家级点跟踪

Leonidas Zimianitis, Pasindu Thenahandi, Kai Buckhalter, Dineth Jayakody, Julian O. Kimura, Xinyue Liang, Karen Cunningham, Azeem Ahmad, Balpreet S. Ahluwalia, Sampath Jayarathna, Nikos Chrisochoides, Brandon Weissbourd, Dushan N. Wadduwage

发表机构 * Department of Computer Science, Old Dominion University（奥德赛大学计算机科学系）； Department of Biology, Massachusetts Institute of Technology（麻省理工学院生物学系）； The Picower Institute for Learning and Memory, Massachusetts Institute of Technology（麻省理工学院学习与记忆研究所）； Department of Physics and Technology, UiT--The Arctic University of Norway（挪威北极大学物理与技术系）； Department of Physics, University of Oslo（奥斯陆大学物理系）； School of Data Science, Old Dominion University（奥德赛大学数据科学学院）； Department of Physics, Old Dominion University（奥德赛大学物理系）

AI总结提出RIPPLE方法，通过运动引导的稀疏校正，在多种显微镜视频中实现专家级点跟踪，将手动标注工作量减少3至25倍。

详情

AI中文摘要

在显微镜视频中跟踪非规范生物系统的动力学仍然是一个持续的挑战。经典和基于学习的跟踪器都需要专家审查的数据来进行评估和适应，然而详尽的手动标注很少能扩展到最需要这些工具的视频中。我们开发了RIPPLE（点位置估计的细化插值平台），它将标注重新定义为稀疏校正：用户点击一个起始点，RIPPLE提出完整的轨迹，用户仅在轨迹偏离时进行干预。我们在来自实验室的五个具有挑战性的显微镜数据集上测试了RIPPLE，其中四个来自透明水螅体Clytia hemisphaerica，一个跟踪快速移动精子的地标。在这些数据集中，RIPPLE匹配了详尽手动标注的质量，同时将数据集的手动点击次数减少了3至25倍。因此，RIPPLE填补了手动标注和全自动跟踪之间的缺失层，使得能够立即量化生物动力学、进行方法基准测试，并生成适应未来自动显微镜跟踪器所需的金标准数据。

英文摘要

Tracking the dynamics of non-canonical biological systems in microscopy videos remains a persistent challenge. Both classical and learning-based trackers depend on expert-reviewed data to be evaluated and adapted, yet exhaustive manual annotation rarely scales to the videos where these tools are needed most. We developed RIPPLE (Refinement Interpolation Platform for Point Location Estimation), which recasts annotation as sparse correction: a user clicks a starting point, RIPPLE proposes a full trajectory, and the user intervenes only where the trajectory drifts. We tested RIPPLE on five challenging microscopy datasets from our laboratories, four from the transparent jellyfish Clytia hemisphaerica and one tracking landmarks on rapidly moving sperm. Across these, RIPPLE matched the quality of exhaustive manual annotation while reducing manual clicks by 3 to 25 times across datasets. RIPPLE thereby fills a missing layer between manual annotation and fully automated tracking, enabling immediate quantification of biological dynamics, method benchmarking, and the production of the gold-standard data needed to adapt future automated microscopy trackers.

URL PDF HTML ☆

赞 0 踩 0

2605.29218 2026-05-29 cs.AI cs.CL

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA：大规模生成面向Web智能体的长程任务

Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

发表机构 * University of Southern California（南加州大学）； Salesforce AI Research（Salesforce人工智能研究）； University of California, Davis（加州大学戴维斯分校）

AI总结提出GTA框架，通过集成爬取、检索式种子生成、上下文内生成和自动质量控制，为Web智能体生成带可执行轨迹的真实长程任务，解决现有基准缺乏过程监督和可扩展性问题。

Comments Published at Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics

详情

AI中文摘要

Web智能体将语言模型与浏览和工具使用能力相结合，有望成为开放的Web助手。然而，进展日益受到缺乏可扩展的过程级监督的限制。现有基准大多为手动构建，仅提供粗略的起始-目标注释，缺乏中间轨迹，而最近的自动生成方法仍然昂贵、有偏且浅显。这些限制阻碍了对必须泛化到现实、多跳、跨页面任务的智能体进行可靠训练和评估。我们引入了一个可扩展的框架GTA，它集成了爬取、基于检索的种子生成、上下文内生成和自动质量控制，以生成与可执行轨迹配对的真实任务。该设计将爬取与生成解耦以提高效率，将任务基于站点图以强制组合性，并通过确定性重放和系统验证确保密集监督。我们在超过50个涵盖电子商务、政府、论坛和新闻的网站上实例化了该流程，并具有多语言和多跳覆盖。由此产生的基准揭示了显著的人机性能差距，并实现了详细的诊断。我们的贡献有三方面：（i）形式化多跳Web智能体任务生成，（ii）提出一个高效且经过验证的自动数据创建流程，以及（iii）发布一个具有可重复评估的动态基准。

英文摘要

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.29217 2026-05-29 cs.CV

Towards the automated segmentation of epicardial and mediastinal fats: A multi-manufacturer approach using intersubject registration and random forest

朝向心外膜和纵隔脂肪的自动分割：一种使用跨受试者配准和随机森林的多厂商方法

É. O. Rodrigues, A. Conci, F. F. C. Morais, M. G. Pérez

发表机构 * Institute of Computing（计算学院）； Institute of Medicine（医学学院）； Fac. de Ing. en Sist. Electr. e Ind.（电子与工业工程系）； Universidade Federal Fluminense（里约热内卢联邦大学）； Universidade Federal do Rio de Janeiro（里约热内卢联邦大学）； Universidad Técnica de Ambato（阿姆巴托技术大学）

AI总结提出一种基于跨受试者配准和随机森林的全自动方法，用于分割CT图像中的心外膜和纵隔脂肪，平均准确率达98.4%，Dice相似指数为96.8%。

Journal ref 2015 IEEE International Conference on Industrial Technology (ICIT)

详情

DOI: 10.1109/ICIT.2015.7125355

AI中文摘要

心脏周围的脂肪量与多种健康风险因素相关，如颈动脉僵硬度、冠状动脉钙化、心房颤动、动脉粥样硬化、癌症发病率等。此外，心脏脂肪的变化与受试者的总体脂肪无关，因此加强了对这些脂肪组织进行定量分析的必要性。临床决策支持系统是能够评估信息并提供相应诊断或数据以补充物理学家分析的计算机程序。本工作的目的是提出一种方法，能够在通过用于冠状动脉钙化评分的标准采集协议获得的CT图像上，全自动分割两种由心包隔开的心脏脂肪组织。我们致力于减少用户干预并提高可重复性。本文提出的方法包括配准（将输入图像粗略调整到标准）、提取与像素及其周围区域相关的特征，以及基于数据挖掘分类算法的分割步骤，该算法判断输入像素是否属于某一类型。实验表明，心外膜和纵隔脂肪的平均准确率达到98.4%，平均真阳性率为96.2%。平均Dice相似指数为96.8%。

英文摘要

The amount of fat on the surroundings of the heart is correlated to several health risk factors such as carotid stiffness, coronary artery calcification, atrial fibrillation, atherosclerosis, cancer incidence and others. Furthermore, the cardiac fat varies unrelated to the overall fat of the subject, and, therefore, it reinforces the quantitative analysis of these adipose tissues as being essential. Clinical decision support systems are computer programs capable of evaluating information and providing a corresponding diagnosis or data to complement the physicists' analyses. The aim of this work is to propose a method capable of fully automatically segmenting two types of cardiac adipose tissues that stand apart from each other by the pericardium on CT images obtained by the standard acquisition protocol used for coronary calcium scoring. Much effort was devoted to promote minimal user intervention and ease of reproducibility. The methodology proposed in this work consists of a registration, which will roughly adjust input images to a standard, an extraction of features related to pixels and their surrounding area and a segmentation step based on data mining classification algorithms that define if an incoming pixel is of a certain type. Experimentations showed that the achieved mean accuracy for the epicardial and mediastinal fats was 98.4% with a mean true positive rate of 96.2%. In average, the Dice similarity index was equal to 96.8%.

URL PDF HTML ☆

赞 0 踩 0

2605.29212 2026-05-29 cs.CV cs.HC

MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality

MetaRanker：用于超透镜图像质量的人机协同主动排序

Yujin Park, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（翰阳大学）； Hankuk University of Foreign Studies（韩国民法大学）

AI总结提出MetaRanker框架，通过人机协同主动排序，以语义可解释性为指标评估超透镜图像质量，减少80%人工标注量，并实现与人类评估高度一致的排序。

Comments 12 pages, 6 figures

详情

AI中文摘要

现代成像系统中的图像质量源于传感器、光学元件和计算重建的耦合效应。超薄超透镜为实现光学模块的显著小型化提供了途径，但实际设计通常表现出明显的色差和视场相关像差，需要计算重建来补偿。在当前的超透镜流程中，重建模型通常使用基于失真的保真度目标（如PSNR）进行训练和选择，但这些代理指标与人类偏好和下游实用性的相关性较弱，反映了众所周知的感知-失真权衡。我们引入了MetaRanker，一种人机协同主动排序框架，以语义可解释性（定义为人类在存在光学伪影时可靠识别物体和结构的程度）来形式化超透镜图像质量。MetaRanker结合了概率偏好模型与不确定性感知的查询选择，并利用视觉-语言模型提供轻量级语义先验。重要的是，这些先验仅用于指导信息性比较的采样；人类判断始终是主要的监督信号。在具有不同退化特征的现实和合成超透镜数据集上，MetaRanker生成的排序与人类评估最为一致，同时相对于穷举成对评估，所需的成对标注数量减少了约80%。最后，我们表明标准图像质量评估指标在超透镜领域与人类可解释性的对齐有限，这使MetaRanker成为迈向基于感知的超透镜评估和协同设计的实际一步。

英文摘要

Image quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.

URL PDF HTML ☆

赞 0 踩 0

2605.29202 2026-05-29 cs.LG

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

通过黑盒成员推断审计生成音乐模型中的训练数据

Yi Chen Liu, Jiawei Yu, Kexin Cao, Syed Irfan Ali Meerza, Trishika Movva, Jian Liu

发表机构 * University of Georgia（佐治亚大学）； Independent Researcher（独立研究者）； University of Tennessee（田纳西大学）

AI总结本文提出一种黑盒成员推断方法，通过比较候选音频与模型基于其描述生成输出的语义对齐程度，并训练音乐审计器分类成员身份，实现对生成音乐模型训练数据的高精度审计。

Comments The paper has been accepted for presentation at the workshop ArtSec 2026: Workshop on Artwork Security and Provenance in the Age of AI

详情

AI中文摘要

近期文本到音乐生成的进展实现了结构化音乐音频的高保真合成，引发了对数据来源、同意和训练透明度的日益关注。这些模型通常在很少披露的大规模语料库上训练，没有实际机制来验证特定音频样本是否包含在训练中。在本文中，我们研究了生成音乐模型的黑盒成员推断，旨在仅通过查询部署系统来确定候选音乐样本是否在训练中使用。我们的关键见解是，训练成员身份会导致候选样本与模型基于其描述生成的结果之间系统性地更强的语义和结构对齐。我们使用相关描述查询目标模型，并在学习特征空间中测量候选音频与生成输出之间的关系。为了捕捉区分成员和非成员的特征，我们构建了由每个曲目及其基于描述生成的影子模型组成的配对示例，并训练音乐审计器分类成员身份。该审计器捕捉训练成员身份特有的对齐模式，并在完全黑盒设置下泛化到未见过的目标模型，无需访问模型参数或训练元数据。在多个最先进的音乐生成器上，我们的方法达到了高达98.6%的准确率，假阳性和假阴性率低至1.9%和1.0%，表明在现实部署场景中可靠的训练数据审计是可行的。

英文摘要

Recent advances in text-to-music generation enable high-fidelity synthesis of structured musical audio, raising growing concerns about data provenance, consent, and training transparency. These models are typically trained on large-scale corpora with little disclosure, leaving no practical mechanism to verify whether a particular audio sample was included in training. In this paper, we investigate black-box membership inference for generative music models, aiming to determine whether a candidate music sample was used during training, given only query access to the deployed system. Our key insight is that training membership induces systematically stronger semantic and structural alignment between a candidate sample and the model's generation conditioned on its caption. We query the target model with the associated caption and measure the relationship between the candidate audio and the generated output in a learned feature space. To capture features that separate members from non-members, we construct paired examples consisting of each track and its caption-conditioned generation from shadow models, and train a music auditor to classify membership. The auditor captures alignment patterns characteristic of training membership and generalizes to unseen target models in a fully black-box setting without access to model parameters or training metadata. Across multiple state-of-the-art music generators, our method achieves up to 98.6% accuracy, with false-positive and false-negative rates as low as 1.9% and 1.0%, demonstrating that reliable training-data auditing is feasible in realistic deployment scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.29194 2026-05-29 cs.LG cs.AI cs.NA math.NA

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

随机提升：生成随机物理系统轨迹

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA（Courant数学科学研究所，纽约大学，纽约，NY 10012，USA）

AI总结提出随机提升方法，通过为每个状态转换附加独立高维随机标签并学习从当前状态和标签到下一状态的映射，以生成多样化的随机物理系统轨迹。

2605.29192 2026-05-29 cs.AI cs.CL

ReasonOps: Operator Segmentation for LLM Reasoning Traces

ReasonOps: 大语言模型推理轨迹的算子分割

Daniel Lee, Owen Queen, James Zou

发表机构 * Stanford University（斯坦福大学）

AI总结提出无监督方法ReasonOps，从思维链轨迹中提取7种通用推理算子，揭示模型推理结构并用于模型识别与正确性预测。

详情

AI中文摘要

大型推理模型的思维链轨迹可长达数万token，但我们缺乏描述其内部结构的词汇。以往用于分析思维链轨迹的方法要么过于僵化，要么表达能力不足，无法捕捉跨领域和跨模型的特征。为解决此问题，我们开发了ReasonOps，一种无监督、表达力强的方法，用于注释思维链轨迹，提供简洁的通用算子。利用ReasonOps，我们分析了来自12个思考型LLM（涵盖6个家族、8个推理基准）的44,662条轨迹，发现它们共享一个共同的组合结构：7个反复出现的推理算子——语篇层面的动作，如回溯、推理和假设——这些算子从句子开头的3-token枢轴的无监督聚类中涌现。这些算子出现在每个模型家族和基准领域，由三个独立的LLM评判员对留出样本进行分类，准确率达70-76%。我们分析了算子在简单与困难问题上的结构，发现反思性算子在困难问题上更有帮助，而在简单问题上则损害性能。算子序列具有高度的模型识别性：仅基于算子分布训练的分类器能以宏AUC恢复源模型，揭示每个模型家族具有独特的推理指纹。结构化的算子特征在问题内答案正确性预测上远高于基线。基于这些算子构建的分类器在WP-AUC上达到，特别是在AIME上。ReasonOps还能够在轨迹完成前进行早期质量估计：我们仅用50%的轨迹就能在WP-AUC上进行预测。ReasonOps流程是无监督且无需标注的，能够深入洞察LLM推理轨迹，并在模型识别和正确性预测方面取得强大的下游结果。

英文摘要

Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.29190 2026-05-29 cs.LG cs.CL

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

当RL抑制自身词汇：在谜题到数学迁移中恢复推理多样性

Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid

发表机构 * Fin AI Research（Fin AI研究院）； Salk Institute for Biological Studies（萨尔克生物医学研究所）

AI总结本文提出一种基于可验证奖励的强化学习框架，通过引入新颖性奖励机制恢复被抑制的探索性推理原语，实现从约束满足谜题到数学问题的跨领域迁移，在无需数学数据的情况下将OlymMATH-Hard的pass@32从16%提升至36%。

Comments Preprint

详情

AI中文摘要

使用可验证奖励的强化学习（RLVR）改进了大语言模型的推理能力，但其跨领域迁移的条件及原因仍未被充分探索。我们研究了一个7B模型在仅使用约束满足谜题进行SFT和RL后训练（无数学问题）时的跨领域迁移。为了分析迁移如何产生，我们引入了一个推理原语级框架，该框架结合了9类跨度分类器和基序提取，使我们能够将思维链轨迹分割为原语基序，并追踪其在训练阶段和领域间的演变。我们发现，谜题SFT诱导了一个推理原语词汇，在OlymMATH-Hard上带来了+7pp的pass@32提升。随后，普通GSPO将这些原语组合成更长的计算-验证链，进一步增加了+6pp。然而，这个RL阶段也抑制了探索性原语，如“假设”和“回溯”。为了解决这个问题，我们引入了一个新颖性奖励，奖励多样化的正确轨迹，使用参考模型下的困惑度作为信号。这恢复了RL期间的恢复原语，并相对于普通GSPO额外增加了+7pp的pass@32。最终，端到端配方将硬数学能力上限从OLMo3-7B-Instruct-SFT基线的16.0%提升至36.0%，且在SFT或RL阶段未添加任何数学问题。

英文摘要

Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model whose SFT and RL post-training stages use only constraint-satisfaction puzzles, with no mathematics problems in the post-training data. To analyze how transfer emerges, we introduce a reasoning primitive-level framework that combines a 9-class span classifier with motif extraction, allowing us to segment chain-of-thought traces into primitive motifs and track their evolution across training stages and domains. We find that puzzle SFT induces a reasoning-primitive vocabulary, yielding a $+7$pp \texttt{pass@32} gain on OlymMATH-Hard. Vanilla GSPO then composes these primitives into longer compute-verify chains, adding a further $+6$pp. However, this RL stage also suppresses exploratory primitives such as \textit{hypothesize} and \textit{backtrack}. To address this, we introduce a novelty bonus that rewards diverse correct rollouts, using perplexity under the reference model as a signal. This restores recovery primitives during RL and adds a further $+7$pp \texttt{pass@32} relative to vanilla GSPO. Finally, the end-to-end recipe raises the hard-math capability ceiling from $16.0\%$ at the OLMo3-7B-Instruct-SFT base to $36.0\%$, without adding any mathematics problems during the SFT or RL stages.

URL PDF HTML ☆

赞 0 踩 0

2605.29188 2026-05-29 cs.CL

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

口号还是立场？一种用于中国国企演讲中创业话语测量的轻标签诊断方法

Ting Gong, Shangquan Sun

发表机构 * Tsinghua University（清华大学）

AI总结本文提出一种轻标签诊断方法，利用同一企业不同演讲者的自然实验，评估词典方法、主题模型和嵌入相似度评分器在测量中国国企演讲中“创业精神”时的有效性，发现零样本大语言模型（Qwen3.5:9b）在区分演讲者身份方面表现最佳。

Comments 15 pages, 2 figures, 7 tables

详情

AI中文摘要

词典方法、主题模型和嵌入相似度评分器广泛应用于CSS和管理研究中，用于测量企业演讲中的“创业精神”等构念。我们贡献了一种轻标签的测量诊断方法，而非新的提取模型。在80篇中央管理中国国有企业领导人演讲的语料库中，我们利用24对同一企业不同演讲者和5对同一企业同一演讲者的自然实验，测试方法每文档指标是否在控制企业不变的情况下随领导人身份变化。LDA失败（Cohen d=0.20，95% CI [-0.72, 1.20]）；词典评分器达到d=0.81，中文句子编码器在文档向量距离为10^-3量级时达到d=0.65。零样本9B开源大语言模型（Qwen3.5:9b）将配对对比d提升至1.09（精确置换p1=0.034）。我们相应地降低三个主张：黄金F1衡量的是与LLM自身提示规则的一致性，而非外部构念恢复；文档级风格残差化将LLM的d降至0.43（p1=0.22），因此约一半效应与演讲者个人习语一致；置信加权校准以方差换取Delta，自动挖掘的口号词典在消融中几乎无效。我们发布了包含2,190个片段的评分语料库、170段试点语料、口号词典、两族LLM评分以及评估框架。

英文摘要

Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement diagnostic for such instruments rather than a new extraction model. On a corpus of 80 speeches by leaders of centrally administered Chinese state-owned enterprises, we exploit a natural experiment of 24 same-company different-speaker pairs and 5 same-company same-speaker pairs to test whether a method's per-document indices vary with leader identity holding firm constant. LDA fails (Cohen d=0.20, 95% CI [-0.72, 1.20]); a dictionary scorer reaches d=0.81 and a Chinese sentence encoder d=0.65 on doc-vector distances of order 10^-3. A zero-shot 9B open-weight LLM (Qwen3.5:9b) raises paired-contrast d to 1.09 (exact permutation p1=0.034). We downgrade three claims accordingly: gold F1 measures consistency with the LLM's own prompt rule rather than external construct recovery; doc-level style residualisation cuts the LLM's d to 0.43 (p1=0.22), so roughly half of the effect is consistent with leader idiolect; and a confidence-weighted calibration trades Delta for variance with an auto-mined slogan lexicon near-inert in ablation. We release the 2,190-segment scored corpus, the 170-paragraph pilot, the slogan lexicon, two-family LLM scores, and the evaluation harness.

URL PDF HTML ☆

赞 0 踩 0

2605.29184 2026-05-29 cs.LG cs.AI

Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback

影响引导的符号回归：基于大语言模型与细粒度反馈的方程搜索科学发现

Evgeny S. Saveliev, Samuel Holt, Nabeel Seedat, David L. Bentley, Jim Weatherall, Mihaela van der Schaar

发表机构 * University of Cambridge（剑桥大学）； Thomson Reuters Foundational Research（汤姆森·路透基础研究）； U. Colorado, Anschutz Medical Campus（科罗拉多大学安舒茨医疗校区）

AI总结提出影响引导符号回归（IGSR）方法，利用大语言模型生成候选函数并通过细粒度影响分数进行剪枝，结合蒙特卡洛树搜索高效探索组合空间，在多个基准和真实生物数据中发现新关系。

Comments ICML 2026

详情

AI中文摘要

大型语言模型（LLM）为科学发现提供了有前景的途径，但它们在符号回归中的应用常受限于低效的搜索策略和粗糙的反馈信号。当前方法通常使用标量指标（如全局均方误差）指导LLM，这无法识别所提出方程中哪些成分驱动性能或导致误差。我们引入 extit{影响引导符号回归}（IGSR），该方法将方程发现表述为一个迭代的两步过程，结合多样化的项生成与严格选择：LLM为线性模型生成候选基函数$ψ_j(\mathbf{x})$，然后使用细粒度影响分数$Δ_j$进行评估。这些分数量化每个项对泛化准确性的边际贡献，从而实现影响引导的剪枝过程，系统地精炼模型结构。将此机制集成到蒙特卡洛树搜索（MCTS）中，能够在导航组合搜索空间的同时平衡对新函数形式的探索与对高影响成分的利用。我们在多个基准测试上展示了IGSR的有效性，包括LLM-SRBench、药理学PKPD模型、流行病学模拟和真实基因组数据。值得注意的是，我们通过一个高维生物数据集的案例研究验证了该框架的真正发现能力，其中IGSR识别出DNA甲基化与RNA聚合酶II暂停之间的新关系；该假设随后通过湿实验得到了支持。

英文摘要

Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which components of a proposed equation are driving performance or causing error. We introduce \textit{Influence-Guided Symbolic Regression} (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions $ψ_j(\mathbf{x})$ for a linear model, which are then evaluated using granular influence scores $Δ_j$. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness on a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Causal Label Recovery in Payment Networks

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Provably Secure Agent Guardrail

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

An Approach for Thyroid Nodule Analysis Using Thermographic Images

Motion-guided sparse correction enables expert-quality point tracking across diverse microscopy regimes

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Towards the automated segmentation of epicardial and mediastinal fats: A multi-manufacturer approach using intersubject registration and random forest

MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

ReasonOps: Operator Segmentation for LLM Reasoning Traces

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback