arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4101
2510.13774 2026-06-02 cs.LG cs.CV

UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

UrbanFusion: 用于鲁棒空间表示对比学习的随机多模态融合

Dominik J. Mühlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出UrbanFusion模型,通过随机多模态融合(SMF)和Transformer模块整合街景、遥感、地图和POI数据,在56个城市41项任务中优于现有GeoAI模型。

详情
Journal ref
International Conference on Machine Learning (ICML), 2026
AI中文摘要

预测房价和公共卫生指标等城市现象需要有效整合各种地理空间数据。当前方法主要使用特定任务模型,而近期用于空间表示的通用模型通常仅支持有限模态且缺乏多模态融合能力。为克服这些挑战,我们提出UrbanFusion,一种具有随机多模态融合(SMF)的空间表示模型。该框架采用模态特定编码器处理不同类型输入,包括街景图像、遥感数据、制图地图和兴趣点(POI)数据。这些多模态输入通过基于Transformer的融合模块进行集成,学习统一表示。在全世界56个城市的41项任务上的广泛评估表明,与最先进的GeoAI模型相比,UrbanFusion具有强大的泛化和预测性能。具体而言,它1)在位置编码上优于先前模型,2)允许推理时多模态输入,3)能很好地泛化到训练中未见过的区域。UrbanFusion在预训练和推理过程中均可灵活利用给定位置的任何可用模态子集,从而在多样化的数据可用性场景中实现广泛适用性。

英文摘要

Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent generic models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a spatial representation model that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios.

2510.12624 2026-06-02 cs.LG cs.AI

Learning-To-Measure: In-Context Active Feature Acquisition

学习测量:上下文主动特征获取

Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi

发表机构 * University of Tokyo(东京大学)

AI总结 提出 Learning-to-Measure (L2M) 方法,通过不确定性量化与条件互信息引导的贪婪特征获取,在上下文学习中解决元主动特征获取问题,无需针对每个任务重新训练。

详情
AI中文摘要

主动特征获取 (AFA) 是一个序列决策问题,目标是通过自适应选择要获取的特征来改进测试实例的模型性能。在实践中,AFA 方法通常从具有系统性特征缺失和有限任务特定标签的回顾性数据中学习。大多数先前的工作针对单个预定任务进行获取,限制了可扩展性。为解决这一限制,我们形式化了元 AFA 问题,其目标是学习跨各种任务的获取策略。我们引入了学习测量 (L2M),它包括 i) 对未见任务的可靠不确定性量化,以及 ii) 一个最大化条件互信息的不确定性引导的贪婪特征获取代理。我们展示了一种序列建模或自回归预训练方法,该方法为具有任意缺失模式的任务提供了可靠的不确定性量化基础。L2M 直接对具有回顾性缺失的数据集进行操作,并在上下文中执行元 AFA 任务,消除了每个任务的重新训练。在合成和真实世界的表格基准测试中,L2M 匹配或超越了特定任务的基线,特别是在标签稀缺和高缺失率的情况下。

英文摘要

Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.

2510.12249 2026-06-02 cs.LG

Optimal Regularization for Performative Learning

表现性学习的最优正则化

Edwige Cyffers, Alireza Mirrokni, Marco Mondelli

发表机构 * EPFL, Switzerland(瑞士联邦理工学院)

AI总结 研究高维岭回归中正则化如何应对数据分布随模型变化的表现性效应,发现过参数化下表现性效应有益,并给出最优正则化参数与表现性效应强度的关系。

Comments Accepted at ICML 2026

详情
AI中文摘要

在表现性学习中,数据分布会响应部署的模型——例如,因为策略性用户调整其特征以博弈模型——这创造了比经典监督学习更复杂的动态。因此,我们不仅应该针对当前数据优化模型,还应该考虑模型可能将分布引向新方向,而不知道潜在变化的确切性质。我们通过研究正则化在高维岭回归中的影响,探索正则化如何帮助应对表现性效应。我们表明,虽然表现性效应在总体设置中恶化测试风险,但在特征数量超过样本数量的过参数化机制中,它们可能是有益的。我们证明最优正则化与表现性效应的整体强度成比例,从而可以预先设置正则化以应对这种效应。我们通过在合成和真实数据集上对最优正则化参数的经验评估来展示这一发现。

英文摘要

In performative learning, the data distribution reacts to the deployed model - for example, because strategic users adapt their features to game it - which creates a more complex dynamic than in classical supervised learning. One should thus not only optimize the model for the current data but also take into account that the model might steer the distribution in a new direction, without knowing the exact nature of the potential shift. We explore how regularization can help cope with performative effects by studying its impact in high-dimensional ridge regression. We show that, while performative effects worsen the test risk in the population setting, they can be beneficial in the over-parameterized regime where the number of features exceeds the number of samples. We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect. We illustrate this finding through empirical evaluations of the optimal regularization parameter on both synthetic and real-world datasets.

2510.10541 2026-06-02 cs.LG cs.AI

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

重新思考强化学习评估:基准测试能否真正揭示强化学习方法的失败?

Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院HFIPS) University of Science and Technology of China(中国科学技术大学) University of California, Los Angeles(美国加州大学洛杉矶分校) Arena Project: RL-GAP.github.io(Arena项目: RL-GAP.github.io)

AI总结 本文通过引入诊断套件和Oracle性能差距(OPG)指标,发现当前基准测试无法可靠区分强化学习方法在训练集和测试集上的性能差异,并揭示现有方法在分布偏移、难度变化和反事实场景中泛化能力不足,提出更可靠基准设计的三项核心原则。

详情
AI中文摘要

当前的基准测试不足以评估大型语言模型(LLM)在强化学习(RL)方面的进展。尽管最近报告了RL在基准测试上的提升,但我们发现,在这些基准测试的训练集上训练与直接在测试集上训练几乎达到相同的性能,这表明基准测试无法可靠地区分进一步的进展。为了研究这一现象,我们引入了一个诊断套件和Oracle性能差距(OPG)指标,该指标量化了在基准测试的训练集与测试集上训练之间的性能差异。我们进一步通过压力测试分析这一现象,发现尽管基准测试得分很高,现有的RL方法难以在分布偏移、不同难度级别和反事实场景中泛化:这些是当前基准测试未能揭示的缺陷。我们得出结论,当前的基准测试不足以评估泛化能力,并提出了设计更可靠基准测试的三项核心原则:足够的难度、平衡的评估和分布鲁棒性。

英文摘要

Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress.To study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to reveal.We conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.

2510.09608 2026-06-02 cs.CV cs.AI cs.CL

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM:无限视频流的实时理解

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Yao Lu, Song Han

发表机构 * MIT(麻省理工学院) NVIDIA(英伟达)

AI总结 提出StreamingVLM,通过统一训练与流推理的框架,利用注意力汇点状态复用和滑动窗口机制实现无限视频流的实时稳定理解,在Inf-Streams-Eval基准上以8 FPS速度达到66.18%胜率,并提升通用VQA能力。

Comments Published as a conference paper at ICLR 2026. The first two authors contributed equally to this work

详情
AI中文摘要

视觉语言模型(VLM)可以为实时助手和自主代理提供动力,但它们面临一个关键挑战:理解近乎无限的视频流而不增加延迟和内存使用。对整个视频进行全注意力处理会导致二次计算成本和在长视频上性能不佳。同时,简单的滑动窗口方法也存在缺陷,它们要么破坏连贯性,要么由于冗余重计算而遭受高延迟。在本文中,我们介绍了StreamingVLM,一种专为实时、稳定理解无限视觉输入而设计的模型。我们的方法是一个统一框架,将训练与流推理对齐。在推理过程中,我们通过重用注意力汇点状态、最近视觉令牌的短窗口和最近文本令牌的长窗口来维护一个紧凑的KV缓存。这种流式能力通过一个简单的监督微调(SFT)策略灌输,该策略在短的重叠视频块上应用全注意力,有效地模拟了推理时的注意力模式,而无需在过长的上下文中进行训练。为了评估,我们构建了Inf-Streams-Eval,一个新的基准,包含平均超过两小时的视频,需要帧与文本之间的密集、每秒对齐。在Inf-Streams-Eval上,StreamingVLM对GPT-4O mini实现了66.18%的胜率,并在单个NVIDIA H100上以高达8 FPS的速度保持稳定、实时的性能。值得注意的是,我们的SFT策略还增强了通用的VQA能力,无需任何VQA特定的微调,在LongVideoBench上提高了+4.30,在OVOBench Realtime上提高了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm获取。

英文摘要

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

2510.08825 2026-06-02 cs.CL

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

图上搜索:面向知识图谱的大语言模型推理的迭代式知情导航

Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu, Yuchen Hui, Zhan Su, Lingfeng Xiao, Jian-Yun Nie

发表机构 * Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) Digital Media, CBC(数字媒体,CBC) Halmstad University College(哈姆斯塔德大学学院) University of Waterloo(滑铁卢大学)

AI总结 提出Search-on-Graph方法,通过让大语言模型自身基于知识图谱结构和完整推理历史选择关系路径,遵循观察-思考-导航范式,在六个KGQA基准上超越现有方法,无需任务特定微调。

Comments Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

详情
AI中文摘要

大语言模型(LLMs)结合知识图谱(KGs)为知识密集型推理提供了一种有前景的方法。该方法的核心是在KG中选择合适的推理路径。然而,现有方法面临一个共同限制:推理路径选择通常由独立模块使用与推理需求仅弱相关的标准进行,这往往导致选择错误的关系或过早剪枝相关路径。我们提出Search-on-Graph(SoG),一种通过让LLM自身选择要遵循的关系来加强路径选择与推理之间联系的方法,该选择基于可用的KG结构和完整的推理历史。SoG遵循“观察-思考-导航”范式:每一步,LLM观察当前实体可用的关系连接,思考哪条路径最有助于回答问题,并相应地进行导航。这种上下文感知的导航充分利用了LLM的推理能力,而不是依赖具有替代标准的独立选择模块。在六个知识图谱问答(KGQA)基准上的实验表明,SoG优于最先进的方法,同时无需任务特定的微调,并能泛化到不同的KG模式。

英文摘要

Large language models (LLMs) augmented with knowledge graphs (KGs) offer a promising approach for knowledge-intensive reasoning. Central to this approach is the selection of appropriate reasoning paths in the KG. Yet, existing methods face a common limitation: reasoning path selection is often performed by separate modules using criteria that are only weakly connected to the reasoning requirements. This often results in selecting incorrect relations or premature pruning of relevant paths. We propose Search-on-Graph (SoG), a method that strengthens the connection between path selection and reasoning by having the LLM itself select which relations to follow, informed by both the available KG structure and the complete reasoning history. SoG follows an \textit{observe-think-navigate} paradigm: at each step, the LLM observes the relational connections available at the current entity, reasons about which path best advances toward answering the question, and navigates accordingly. This context-aware navigation fully exploits the LLM's reasoning capabilities rather than relying on independent selection modules with surrogate criteria. Experiments on six knowledge graph question answering (KGQA) benchmarks demonstrate that SoG outperforms state-of-the-art methods while requiring no task-specific fine-tuning and generalizing across different KG schemas.

2508.17320 2026-06-02 cs.LG

AdaptiveK: Complexity-Driven Sparse Autoencoders for Interpretable Language Model Representations

AdaptiveK:基于复杂度的稀疏自编码器用于可解释的语言模型表示

Yifei Yao, Hanrong Zhang, Mengnan Du

发表机构 * Zhejiang University(浙江大学) University of Illinois Chicago(伊利诺伊大学香槟分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出 AdaptiveK SAE,根据输入语义复杂度动态调整稀疏度,利用线性探针引导特征分配,在重构保真度、解释方差、余弦相似度和可解释性指标上优于固定稀疏度方法。

Comments Accepted by ACL 2026

详情
AI中文摘要

理解大型语言模型(LLMs)的内部表示仍然是可解释性研究的一个核心挑战。稀疏自编码器(SAEs)通过将激活分解为可解释的特征提供了一种有前景的解决方案,但现有方法依赖于固定的稀疏度约束,未能考虑输入的复杂度。我们提出了AdaptiveK SAE(自适应Top K稀疏自编码器),一种新颖的框架,根据每个输入的语义复杂度动态调整稀疏度。利用线性探针,我们证明了上下文复杂度在线性层面上编码在LLM表示中,并使用这一信号在训练过程中指导特征分配。在十个语言模型上的实验表明,这种基于复杂度的自适应方法在重构保真度、解释方差、余弦相似度和可解释性指标上优于固定稀疏度方法,同时消除了大量超参数调优的负担。我们的代码可在 https://github.com/hiyukie/adaptiveK 获取。

英文摘要

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models demonstrate that this complexity-driven adaptation outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the burden of extensive hyperparameter tuning. Our code is available at: https://github.com/hiyukie/adaptiveK.

2507.02983 2026-06-02 cs.CL cs.AI

Truth, Trust, and Trouble: Medical AI on the Edge

真相、信任与麻烦:边缘上的医疗AI

Mohammad Anas Azeez, Rafiq Ali, Ebad Shabbir, Zohaib Hasan Siddiqui, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem

发表机构 * Jamia Hamdard(贾迈亚哈姆达德大学) DSEU-Okhla Macquarie University(麦考瑞大学) Center for SDGC, Stanford University(SDGC中心,斯坦福大学)

AI总结 通过一个包含1000多个健康问题的基准测试框架,评估Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B三个模型在诚实、有用性和无害性方面的表现,发现AlpaCare-13B准确率最高(91.7%)且无害性最佳(0.92),而领域微调可提升安全性,少样本提示能提高准确率,但复杂查询下有用性下降。

Comments Accepted at EMNLP 2025 (Industry Track)

详情
AI中文摘要

大型语言模型(LLMs)通过实现自动医疗问答,在转变数字健康方面具有巨大潜力。然而,确保这些模型满足事实准确性、有用性和安全性等关键行业标准仍然是一个挑战,尤其是对于开源解决方案。我们提出了一个严格的基准测试框架,使用超过1000个健康问题的数据集。我们评估了模型在诚实、有用性和无害性方面的性能。我们的结果突出了评估模型——Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B——在事实可靠性和安全性之间的权衡。AlpaCare-13B达到了最高的准确率(91.7%)和无害性(0.92),而BioMistral-7B-DARE中的领域特定微调尽管规模较小,却提高了安全性(0.90)。少样本提示将准确率从78%提高到85%,并且所有模型在复杂查询上的有用性均有所下降,凸显了临床问答中持续存在的挑战。

英文摘要

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

2510.05342 2026-06-02 cs.LG cs.AI

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Margin Adaptive DPO: 利用奖励模型实现偏好优化中的细粒度控制

Hyung Gyu Rho

发表机构 * Independent Researcher(独立研究者)

AI总结 提出Margin-Adaptive Direct Preference Optimization (MADPO)方法,通过奖励模型估计偏好边界并自适应调整DPO损失权重,实现实例级别的细粒度控制,在摘要任务上优于现有方法。

详情
AI中文摘要

直接偏好优化(DPO)已成为一种简单有效的大语言模型对齐方法。然而,其依赖固定温度参数导致在多样化偏好数据上训练次优,造成对简单样本过拟合而对信息丰富样本学习不足。近期出现了应对此问题的方法。虽然IPO解决了通用过拟合,但其均匀正则化可能过于保守。更针对性的β-DPO方法有其自身局限:其批次级自适应对混合边界对应用单一折中温度,线性更新规则可能产生不稳定的负β值,其过滤机制丢弃了潜在有用的训练信号。本文提出边界自适应直接偏好优化(MADPO),一种稳定、保留数据且实例级别的解决方案。MADPO采用实用的两步方法:首先训练奖励模型估计偏好边界,然后利用这些边界对每个训练样本的DPO损失施加连续自适应权重。这种重加权方案创建了一个有效目标边界,对困难对放大而对简单对抑制,从而实现对学习信号的细粒度控制。我们提供了全面的理论分析,证明MADPO具有良性的优化景观,且对奖励模型估计误差具有鲁棒性。我们通过使用人类偏好数据的摘要任务实验验证了理论。MADPO在全面的解码温度扫描中一致优于强基线。

英文摘要

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of $β$-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative $β$ values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a summarization task using human preference data. MADPO consistently outperforms strong baselines across a comprehensive sweep of decoding temperatures.

2510.04074 2026-06-02 cs.RO

Feedback Matters: Augmenting Autonomous Dissection with Visual and Topological Feedback

反馈至关重要:利用视觉和拓扑反馈增强自主解剖

Chung-Pang Wang, Changwei Chen, Xiao Liang, Soofiyan Atar, Florian Richter, Michael Yip

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣迭戈分校电气与计算机工程系)

AI总结 提出一种反馈驱动的自主组织解剖框架,通过内窥镜图像推理拓扑变化、量化可见性并主动操控组织,结合规划与学习方法,显著提升手术自主性和鲁棒性。

详情
AI中文摘要

自主手术系统必须适应高度动态的环境,其中组织特性和视觉线索快速演变。这种适应性的核心是反馈:在执行过程中感知、解释和响应变化的能力。尽管反馈机制已在手术机器人中得到探索,包括工具和组织跟踪以及错误检测,但现有方法在处理组织解剖的拓扑和感知挑战方面仍然有限。在这项工作中,我们提出了一种用于自主组织解剖的反馈驱动框架,该框架在每次解剖动作后明确地从内窥镜图像中推理拓扑变化。这种结构化反馈指导后续动作,使系统能够定位解剖进展并在线调整策略。为了提高这种反馈的可靠性,我们引入了量化组织暴露的可见性指标,并制定了主动操控组织以最大化可见性的最优控制器设计。最后,我们将这些反馈机制与基于规划和基于学习的解剖方法相结合,并通过实验证明,它们在复杂手术场景中显著增强了自主性,减少了错误,并提高了鲁棒性。

英文摘要

Autonomous surgical systems must adapt to highly dynamic environments where tissue properties and visual cues evolve rapidly. Central to such adaptability is feedback: the ability to sense, interpret, and respond to changes during execution. While feedback mechanisms have been explored in surgical robotics, ranging from tool and tissue tracking to error detection, existing methods remain limited in handling the topological and perceptual challenges of tissue dissection. In this work, we propose a feedback-enabled framework for autonomous tissue dissection that explicitly reasons about topological changes from endoscopic images after each dissection action. This structured feedback guides subsequent actions, enabling the system to localize dissection progress and adapt policies online. To improve the reliability of such feedback, we introduce visibility metrics that quantify tissue exposure and formulate optimal controller designs that actively manipulate tissue to maximize visibility. Finally, we integrate these feedback mechanisms with both planning-based and learning-based dissection methods, and demonstrate experimentally that they significantly enhance autonomy, reduce errors, and improve robustness in complex surgical scenarios.

2510.03745 2026-06-02 cs.LG cs.NA math.NA

Neural Low-Discrepancy Sequences

神经低差异序列

Michael Etienne Van Huffel, Nathan Kirk, Makram Chahine, Daniela Rus, T. Konstantin Rusch

发表机构 * MIT(麻省理工学院) CMU(卡内基梅隆大学) Harvard University(哈佛大学)

AI总结 提出NeuroLDS,首个基于机器学习的有限低差异序列生成框架,通过监督预训练和无监督微调两步学习,在多个应用中显著优于传统方法。

Comments ICML 2026

详情
AI中文摘要

低差异点旨在以均匀方式高效填充空间。这种均匀性在科学和工程的许多问题中非常有利,包括数值积分、计算机视觉、机器感知、计算机图形学、机器学习和模拟。尽管大多数先前的低差异构造依赖于抽象代数和数论,但最近引入的消息传递蒙特卡洛(MPMC)利用机器学习方法生成点集,其差异低于以往可能。然而,MPMC仅限于生成点集,无法扩展到低差异序列(LDS),即每个前缀都具有低差异的点序列,这一性质对许多应用至关重要。为解决这一限制,我们引入了神经低差异序列(NeuroLDS),这是首个基于机器学习的有限LDS生成框架。受经典LDS启发,我们训练一个神经网络将索引映射到点,使得生成的序列在所有前缀上表现出最小差异。为此,我们采用两阶段学习过程:经典构造的监督近似,随后是无监督微调以最小化前缀差异。我们证明,NeuroLDS在差异度量方面显著优于所有先前的LDS构造。此外,我们展示了NeuroLDS在多种应用中的有效性,包括数值积分、机器人运动规划和科学机器学习。这些结果凸显了神经低差异序列的前景和广泛意义。我们的代码可在https://github.com/camail-official/neuro-lds找到。

英文摘要

Low-discrepancy points are designed to efficiently fill the space in a uniform manner. This uniformity is highly advantageous in many problems in science and engineering, including in numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. Whereas most previous low-discrepancy constructions rely on abstract algebra and number theory, Message-Passing Monte Carlo (MPMC) was recently introduced to exploit machine learning methods for generating point sets with lower discrepancy than previously possible. However, MPMC is limited to generating point sets and cannot be extended to low-discrepancy sequences (LDS), i.e., sequences of points in which every prefix has low discrepancy, a property essential for many applications. To address this limitation, we introduce Neural Low-Discrepancy Sequences (NeuroLDS), the first machine learning-based framework for generating finite LDS. Drawing inspiration from classical LDS, we train a neural network to map indices to points such that the resulting sequences exhibit minimal discrepancy across all prefixes. To this end, we deploy a two-stage learning process: supervised approximation of classical constructions followed by unsupervised fine-tuning to minimize prefix discrepancies. We demonstrate that NeuroLDS outperforms all previous LDS constructions by a significant margin with respect to discrepancy measures. Moreover, we demonstrate the effectiveness of NeuroLDS across diverse applications, including numerical integration, robot motion planning, and scientific machine learning. These results highlight the promise and broad significance of Neural Low-Discrepancy Sequences. Our code can be found at https://github.com/camail-official/neuro-lds.

2505.18102 2026-06-02 cs.LG cs.AI cs.CL stat.ME

CapBencher: Give Your LLM Benchmark a Built-in Alarm for Test-Set Overfitting

CapBencher: 为您的LLM基准测试内置测试集过拟合警报

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

发表机构 * National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 提出CapBencher方法,通过向答案注入随机性(准备多个逻辑正确但仅一个作为解)来降低贝叶斯准确率,从而在公开基准测试时防止测试集过拟合并检测泄露或作弊。

Comments ICML 2026 camera ready version

详情
AI中文摘要

在互联网上发布大型语言模型(LLM)基准测试(尤其是其真实答案)存在污染未来LLM和导致评估作弊的风险:它可能被无意(或有意)用于训练或选择模型,或者在标签可访问时被利用来过拟合和操纵排行榜。常见的缓解措施是保持基准测试私有,并让参与者向组织者提交他们的模型或预测,但这仍然允许通过反馈循环进行测试集过拟合。为了克服这个问题,我们提出了CapBencher,一种在不完全公开真实答案的情况下发布基准测试的方法,同时保持LLM的开放评估。主要思想是通过准备多个逻辑正确的答案,并仅将其中一个作为基准测试中的解,向答案中注入随机性,从而降低最佳可能准确率,即贝叶斯准确率。这不仅掩盖了真实答案,还为泄露或作弊提供了测试:由于即使完全有能力的模型也不应超过贝叶斯准确率,任何超过该准确率的模型都是一个强烈的信号。我们从理论和实验上证明,CapBencher能够在不同的基准测试、模型、训练方法和场景中准确检测测试集过拟合。

英文摘要

Publishing a large language model (LLM) benchmark (especially its ground-truth answers) on the Internet risks contaminating future LLMs and enabling evaluation gaming: it may be unintentionally (or intentionally) used to train or select a model, or exploited to overfit and hack leaderboards when labels are accessible. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers, but this still permits test-set overfitting through feedback loops. To overcome this issue, we propose CapBencher, a way to publish benchmarks without fully disclosing the ground-truth answers, while preserving open evaluation of LLMs. The main idea is to reduce the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only does this obscure the ground-truth answers, but it also offers a test for leakage or gaming: since even fully capable models should not surpass the Bayes accuracy, any model that does is a strong signal. We show theoretically and empirically that CapBencher accurately detects test-set overfitting across diverse benchmarks, models, training methodologies, and scenarios.

2510.03494 2026-06-02 cs.LG stat.ML

Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

轨迹数据足以在具有线性 $q^π$-可实现性和集中性的固定视界离线强化学习中进行统计有效的策略评估

Volodymyr Tkachuk, Csaba Szepesvári, Xiaoqi Tan

发表机构 * University of Alberta(阿尔伯塔大学)

AI总结 本文研究在轨迹数据假设下,利用线性 $q^π$-可实现性和集中性,实现固定视界离线强化学习中策略评估的统计有效学习,并改进了策略优化的样本复杂度分析。

详情
AI中文摘要

我们研究了具有函数近似的固定视界离线强化学习(RL),用于策略评估和策略优化。先前的工作表明,当唯一的假设是数据具有良好的覆盖性(集中性)且每个策略的状态-动作值函数是线性可实现的($q^π$-可实现性)时,对于这些问题中的任何一个,统计有效的学习都是不可能的(Foster et al., 2021)。最近,Tkachuk et al. (2024) 给出了一个用于策略优化的统计有效学习器,前提是数据被假定为以轨迹形式给出。在这项工作中,我们在相同的假设下提出了一个用于策略评估的统计有效学习器。此外,我们表明,通过更紧的分析,可以改进 Tkachuk et al. (2024) 用于策略优化的学习器的样本复杂度。

英文摘要

We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^π$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.

2510.03259 2026-06-02 cs.LG cs.AI

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

通过推理模型中的预测奖励验证元意识

Yoonjeon Kim, Doohyuk Jang, Eunho Yang

发表机构 * Yoonjeon Kim, Doohyuk Jang, Eunho Yang

AI总结 提出 MAPR 方法,利用自生成任务预测推理统计量(长度、通过率、概念)来增强模型的元意识,从而在多个数学推理基准上显著提升准确率和训练效率。

Comments accepted to ICML 2026

详情
AI中文摘要

近期关于推理模型的研究探索了语言模型的元意识,包括其确定最佳思考时长、识别知识边界以及结构化概念级思维的能力。虽然当前的大型推理模型仅依赖于基于答案的验证,但我们表明,添加元意识目标可以显著提升性能,超过缺乏此类元知识的模型。MAPR(通过预测奖励实现元意识)利用自生成任务来预测展开统计量——具体包括长度、通过率和所用概念——从而能够对照实际统计量进行验证。此外,通过利用这种自我预测能力,模型可以通过以下方式调节其推理行为:i) 过滤掉琐碎或无法解决的提示,ii) 减少倾向于错误的长篇生成,以及 iii) 生成与问题相关的提示。结果令人鼓舞:MAPR 在各种推理基准上显著提高了准确率和训练效率。更具体地说,我们的方法可以将 GRPO 训练加速超过 1.28 倍以达到相同的性能,在 AIME25 上实现 83.18% 的准确率提升,并在六个数学基准上平均提升 13.04%。代码公开于 https://github.com/akatigre/MAPR-RL。

英文摘要

Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR (Meta-Awareness via Predictive Reward) utilizes a self-generated task of predicting rollout statistics - specifically length, pass-rate, and concepts used - allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by i) filtering out trivial or unsolvable prompts, ii) reducing lengthy generations that tend to be incorrect, and iii) generating hints relevant to the problem. The results are inspiring: MAPR yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve 83.18% gain in accuracy on AIME25, and a 13.04% average gain over six mathematics benchmarks. The code is publicly available at https://github.com/akatigre/MAPR-RL.

2510.03086 2026-06-02 cs.LG

Chaining 2-FWL GNNs for Combinatorial Graph Alignment

链式2-FWL图神经网络用于组合图对齐

Marc Lelarge

发表机构 * INRIA - Ecole Normale Supérieure PSL Research University(法国国家科学研究中心-巴黎高等师范学院-巴黎理工研究大学)

AI总结 针对组合图对齐问题,提出链式2-FWL GNN方法,通过非可微排序注入离散反馈,在稀疏随机图和正则图上显著优于FAQ和现有GNN方法。

Comments code available at https://github.com/mlelarge/chaining-gnn-graph-alignment

详情
AI中文摘要

对于组合图对齐问题(GAP)——寻找最大化两个无标签图之间公共边数(nce)的节点对应关系——适当初始化的FAQ仍然是强大的经典基线,而现有的GNN方法在纯结构设置中表现不佳。我们引入了一种链式过程:一系列Folklore类型(2-FWL)的GNN,其中每个网络在解码前一个网络的相似性矩阵并根据当前对齐质量对节点进行排序后,使用交叉熵进行训练。这个不可微的排序步骤在每个链接处注入离散的组合反馈;在推理时,我们迭代最终网络并保留具有最高观测nce的候选。在噪声水平0.25的稀疏Erdos-Renyi图上,带有FAQ后处理的链式FGNN达到85%的准确率,而FAQ从凸松弛初始化仅为13%,先前的GNN方法基本为0%。在相关正则图上,其中具有恒定特征的MPNN产生相同的节点嵌入(1-WL无法细化)且FAQ的凸初始化退化,链式是我们知道的唯一能够恢复非平凡对齐的方法。在三个真实世界基准(酵母PPI、合著和道路网络)上,我们表明最近的比较通过从均匀双随机矩阵初始化FAQ低估了FAQ;一旦FAQ从凸松弛初始化,它已经超过了先前报告的数字,而数据集特定的链式FGNN进一步改进了这个加强的基线。

英文摘要

For the combinatorial graph alignment problem (GAP) -- finding the node correspondence that maximizes the number of common edges (nce) between two unlabeled graphs -- properly initialized FAQ remains a strong classical baseline, while existing GNN approaches struggle in the purely structural setting. We introduce a chaining procedure: a sequence of Folklore-type (2-FWL) GNNs in which each network is trained with cross-entropy after decoding the previous network's similarity matrix and ranking nodes by their current alignment quality. This non-differentiable ranking step injects discrete combinatorial feedback at every link; at inference, we iterate the final network and keep the candidate with highest observed nce. On sparse Erdos-Renyi graphs at noise level 0.25, chained FGNNs with FAQ post-processing reach 85% accuracy versus 13% for FAQ initialized from the convex relaxation, and essentially 0% for prior GNN methods. On correlated regular graphs, where MPNNs with constant features produce identical node embeddings (1-WL fails to refine) and FAQ's convex initialization is degenerate, chaining is the only method we know that recovers a non-trivial alignment. On three real-world benchmarks (yeast PPI, coauthorship, and road networks), we show that recent comparisons underestimate FAQ by initializing it from a uniform doubly stochastic matrix; once FAQ is initialized from the convex relaxation it already surpasses prior reported numbers, and dataset-specific chained FGNNs further improve on this strengthened baseline.

2510.02528 2026-06-02 cs.AI cs.LG

Multimodal Function Vectors for Visual Relations

视觉关系的多模态函数向量

Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过因果中介分析提取多模态函数向量,操纵注意力头以改善视觉关系推理,并实现零样本和微调性能提升。

详情
AI中文摘要

大型多模态模型(LMMs)从少量多模态演示中展现出令人印象深刻的上下文学习能力,然而支持这种任务学习的内部机制仍不透明。基于大型语言模型的先前工作,我们表明大型多模态模型中一小部分注意力头负责传递视觉关系的表示。这些注意力头的激活,称为函数向量,可以被提取和操纵以改变LMM在关系任务上的性能。首先,使用合成和真实图像数据集,我们应用因果中介分析来识别强烈影响关系预测的注意力头,并提取多模态函数向量,以提高推理时的零样本准确率。我们进一步证明,这些多模态函数向量可以在保持LMM参数冻结的情况下,用适量的训练数据进行微调,从而显著优于上下文学习基线。最后,我们展示了特定关系的函数向量可以线性组合,以解决涉及新颖和未经训练的视觉关系的类比问题,突显了该方法的强大泛化能力。通过在两个LMM(包括OpenFlamingo和Qwen3-VL)上的实验,我们的结果表明这些模型在局部内部结构中编码了视觉关系知识,这些知识可以被系统地提取和优化,从而增进了我们对模型模块化的理解,并增强了对LMM中关系推理的控制。

英文摘要

Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of visual relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks. First, using synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained visual relations, highlighting the strong generalization ability of this approach. Through experiments on two LMMs, including OpenFlamingo and Qwen3-VL, our results show that these models encode visual relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.

2510.01891 2026-06-02 cs.SD cs.AI eess.AS

HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering

HRTFformer: 用于沉浸式音频渲染中个体HRTF上采样的空间感知Transformer

Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg

发表机构 * SONICOM

AI总结 针对个体HRTF测量困难的问题,提出基于Transformer的HRTF上采样架构,利用注意力机制和球谐域处理,结合邻域差异损失,实现高保真HRTF重建。

Comments Accepted to IEEE Transactions on Multimedia 2026

详情
AI中文摘要

个体头相关传输函数(HRTF)正开始被引入许多商业沉浸式音频应用中,对于实现逼真的空间音频渲染至关重要。然而,引入它们的主要顾虑之一是,由于HRTF测量过程的复杂性,大规模创建个体HRTF并不实用。为缓解这一缺点,提出了HRTF空间上采样,旨在减少所需的测量量。尽管先前的工作已通过不同的机器学习方法取得成功,但这些模型通常难以在相邻源方向之间保持局部空间变化模式的长期一致性,以及在高上采样因子下的泛化能力。本文提出了一种新颖的基于Transformer的HRTF上采样架构,利用注意力机制更好地捕捉HRTF球面上的空间相关性。在球谐域中工作,我们的模型从稀疏输入测量中学习重建高分辨率HRTF,精度显著提高。为增强空间一致性,我们引入了邻域差异损失,促进幅度平滑性,从而产生更逼真的上采样。我们使用感知定位模型和客观频谱失真指标评估了我们的方法。实验表明,我们的模型在生成逼真、高保真HRTF方面,在多个评估指标上优于现有方法。

英文摘要

Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.

2606.02507 2026-06-02 cond-mat.mtrl-sci cs.ET cs.LG physics.app-ph physics.comp-ph

Towards Automated Discovery: A Review of Generative Models, Multimodal Learning and Closed-Loop Workflows in Inverse Materials Design

迈向自动发现:逆向材料设计中生成模型、多模态学习与闭环工作流综述

Anand Babu, Rogério Almeida Gouvêa, Gian-Marco Rignanese

发表机构 * Institute of Condensed Matter and Nanosciences, Université Catholique de Louvain(凝聚态与纳米科学研究所,比利时列日-努瓦尔桑大学) WEL Research Institute(WEL研究机构)

AI总结 本文综述了逆向材料设计中生成晶体结构建模、多模态学习和闭环设计管道的最新进展,重点讨论了可行性约束与物理先验的施加方式、多模态融合策略以及多种逆向设计策略(如条件生成与潜在优化、贝叶斯优化、强化学习和主动学习),并指出了常见失败模式及基于分阶段报告的评估实践。

详情
AI中文摘要

逆向材料设计将材料发现从正向预测转变为在物理约束下满足目标的有针对性的候选材料提出。在此,我们回顾了晶体固体中生成晶体结构建模、多模态学习和闭环设计管道的最新进展。我们调查了现代生成器如何从大型数据库中学习化学-结构先验,以实现周期性结构的可控采样,并比较了包括变分自编码器、归一化流、自回归公式和扩散模型在内的主要模型类别。特别关注如何通过表示选择、训练目标、采样时指导以及生成后筛选和弛豫,在整个工作流中施加可行性约束和物理先验。我们还讨论了多模态学习如何融合多种材料模态,包括晶体结构、热力学、电子信息、显微镜、光谱学、加工背景和科学文本,以构建更通用、可迁移的化学空间表示。此外,考察了多种逆向设计策略,特别是那些将条件生成与潜在优化、贝叶斯优化、强化学习和主动学习相结合的策略。最后,我们强调了反复出现的失败模式,如代理利用、多样性崩溃、分布偏移和稳定性-可合成性差距,并基于有效性、新颖性、独特性、稳定性和成本的分阶段报告,概述了发现级评估实践。

英文摘要

Inverse materials design is shifting materials discovery from forward prediction to targeted proposal of candidates that satisfy objectives under physical constraints. Here, we review recent advances in generative crystal structure modeling, multimodal learning, and closed-loop design pipelines for crystalline solids. We survey how modern generators learn chemical-structural priors from large databases to enable controllable sampling of periodic structures, and compare leading model classes including variational autoencoders, normalizing flows, autoregressive formulations, and diffusion models. Particular attention is given to how feasibility constraints and physical priors are enforced across the workflow, through representation choices, training objectives, sampling-time guidance, and post-generation screening and relaxation. We also discuss how multimodal learning fuses diverse materials modalities, including crystal structures, thermodynamic, electronic information, microscopy, spectroscopy, processing context, and scientific text, to construct a more universal, transferable representation of chemical space. In addition, diverse inverse-design strategies are examined, particularly those that integrate conditional generation with latent optimization, Bayesian optimization, reinforcement learning, and active learning. Finally, we highlight recurring failure modes, such as surrogate exploitation, diversity collapse, distribution shift, and the stability-synthesizability gap, and outline discovery-grade evaluation practices based on staged reporting of validity, novelty, uniqueness, stability, and cost.

2606.02494 2026-06-02 cs.SE cs.AI

Monitoring Agentic Systems Before They're Reliable

在代理系统可靠之前对其进行监控

Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens, Heather Frase

发表机构 * Reins AI USA(Reins AI美国公司) Veraitech USA(Veraitech美国公司)

AI总结 针对生产环境中代理系统因结构缺陷主导故障的问题,提出一种基于方差信号的三维度三范围监控与分类方法,并通过合成测试验证其有效性。

Comments 9 pages, 2 figures, 3 tables. Accepted to the Workshop on Agentic Software Engineering (AgenticSE), co-located with ACM CAIS 2026 (non-archival)

详情
AI中文摘要

进入生产环境的代理系统通常以部分集成的组件形式运行,其中结构缺陷(而非任务级错误)主导故障场景。在此成熟度下,任务级错误检测可能不可行:结构故障模式掩盖了任务级监控器旨在检测的信号。我们提出一种监控与分类方法,将代理系统评估分解为三个维度(质量、适用性、效率)和三个监控范围(运行内、跨运行、结构),使用方差作为表征信号。发现结果通过基于FMEA的严重性分类进行路由,将人类注意力集中在需要调查的子集上。我们在一个包含220次运行、120个文档包且受控错误注入的合成测试平台上进行评估。三个结果显现:监控范围决定故障类型——运行内监控器发现确定性阶段缺陷(CV=0.02),跨运行监控器发现随机集成后果(CV=1.25,24%为L2级),结构监控器以完全一致性识别集成缺口(CV=0.00)。注入的任务级错误与干净基线无法区分,证实结构缺陷掩盖了任务级信号。确定性分类将97%的发现路由至自动跟踪,仅留下2%反映可变行为的发现供人工调查。基于第一阶段证据,我们提出一个成熟度阶段模型,其中监控随着集成缺陷的解决从结构表征过渡到错误检测再到可靠性跟踪。该分类法、基于CV的范围表征和严重性模型在架构上可迁移至受监管行业中基于文档的多阶段代理工作流;具体校准是领域特定的。尽早部署监控:它发现的第一个问题就是最需要修复的问题。

英文摘要

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: structural failure modes mask the signal that task-level monitors are designed to detect.We present a monitoring and triage methodology that decomposes agentic system evaluation into three dimensions (quality, suitability, efficiency) at three monitoring scopes (within-run, cross-run, structural), using variance as a characterization signal. Findings are routed through severity classification adapted from FMEA, concentrating human attention on the subset that warrants investigation. We evaluate on a synthetic testbed of 220 runs across 120 document bundles with controlled error injection.Three results emerge. Monitor scope determines failure type: within-run monitors surface deterministic stage defects (CV = 0.02), cross-run monitors surface stochastic integration consequences (CV = 1.25, 24% at L2), and a structural monitor identifies an integration gap with perfect consistency (CV = 0.00). Injected task-level errors are indistinguishable from clean baselines, confirming structural defects mask task-level signal. Deterministic triage routes 97% of findings to automated tracking, leaving the 2% reflecting variable behavior for human investigation.We propose, on Stage 1 evidence, a maturity-staging model in which monitoring transitions from structural characterization to error detection to reliability tracking as integration defects resolve. The taxonomy, CV-based scope characterization, and severity model transfer architecturally to document-driven, multi-stage agentic workflows in regulated industries; specific calibrations are domain-specific. Deploy monitoring early: the first thing it finds is the most important thing to fix.

2606.02483 2026-06-02 cs.CR cs.AI cs.CL

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

幽灵工具调用:投机性智能体工具的发布时隐私保护

Bardia Mohammadi, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) EPFL(苏黎世联邦理工学院) Aarhus University(阿arhus大学)

AI总结 针对工具增强型语言智能体投机性预发调用泄露用户意图的问题,提出投机性工具隐私契约,在发布时而非提交后保护隐私。

详情
AI中文摘要

工具增强型语言智能体投机性地发出可能的未来工具调用以隐藏延迟,但这些调用在智能体提交分支之前将推断出的用户意图泄露给外部服务。每个收到调用的外部观察者在智能体放弃分支后仍保留该披露。问题在于时机,而非授权:提交后的清理、只读限制或访问控制白名单都无法撤回观察者已持有的信息。我们将这些调用称为幽灵工具调用,并提出投机性工具隐私契约,这是一种运行时抽象,将提交前的观察视为与状态突变不同的第一类效应。我们在原型运行时中实现了该契约,并在三个语料库上评估了十二种策略。投机性调度增加了观察者能够推断用户意图的程度;事后过滤器、只读限制和访问控制白名单无法消除这种推断;只有那些在调度前改变或抑制投机性调用的参数或目标投影的发布时策略才能减少这种推断。

英文摘要

Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to external services before the agent commits to the branch. Every external observer that received the call retains the disclosure after the agent abandons the branch. Timing is the issue, not authorization: no commit-time cleanup, read-only restriction, or access-control allow-list unsends what an observer already holds. We call these invocations ghost tool calls and propose Speculative Tool Privacy Contracts, a runtime abstraction that treats observation before commitment as a first-class effect, distinct from state mutation. We implement the contracts in a prototype runtime and evaluate twelve policies across three corpora. Speculative dispatch increases what an observer can infer about user intent; post-hoc filters, read-only restrictions, and access-control allow-lists leave that inference intact; only issue-time policies that change or suppress the speculative call's argument or destination projection before dispatch reduce it.

2606.02448 2026-06-02 eess.SP cs.SD

Diffusion-Based Heart Sound Generation: Evaluation with Physiological Signal Metrics, Classifiers, and Expert Listening

基于扩散的心音生成:使用生理信号指标、分类器和专家听诊评估

Xinqi Bao, Jia Bi, Xin Chen, Ernest Nlandu Kamavuako, Saikat Chatterjee

发表机构 * Department of Information Science & Engineering, KTH Royal Institute of Technology(信息科学与工程系,皇家理工学院) Rutherford Appleton Laboratory(拉瑟福德·苹果顿实验室) Peng Cheng Laboratory(鹏城实验室) Department of Engineering, King’s College London(工程系,伦敦国王学院)

AI总结 提出一种在log-mel域上的类别条件扩散模型用于生成心音图,通过生理指标、下游分类准确率和专家听诊评估合成保真度,并分析了异常声学线索保留和重建伪影等挑战。

详情
AI中文摘要

公开可用的心音图(PCG)数据集在规模和病理多样性方面仍然有限,限制了听诊训练和自动心音分类器的泛化能力。本文在log-mel域上开发了一种用于PCG生成的类别条件扩散模型,并使用互补的(i)生理启发的合理性指标、(ii)下游标签一致性评估和(iii)专家听诊来评估合成保真度。实验使用Phy-sioNet/Computing in Cardiology Challenge 2016数据集(3240条记录)进行记录级划分。经过预处理和质量控制后,将16,749个不重叠的4秒片段映射到归一化的1×128×128 log-mel表示,以训练带有无分类器引导的条件2D U-Net去噪器。使用三个轻量级指标在重建波形上量化信号级合理性:包络自相关节律评分、基于幅度的爆炸评分和主周期滞后。合成片段保留了相似的主周期持续时间,但与真实片段相比,包络周期性降低,瞬态突发性增加。在下游评估中,ResNet-50分类器在保留的真实测试集上达到92.24%的准确率,在类别平衡的合成批次上达到82.8%的准确率,表明生成信号保留了与正常/异常分类相关的判别结构。在一项初步的专家听诊研究(60个片段,两名临床医生)中,大多数合成片段被判断为类似心音,而真实和合成的4秒片段对异常敏感性均较低。总体而言,结果为基于扩散的PCG生成提供了实用基线,同时突出了在保留异常声学线索和减少重建伪影方面的剩余挑战。

英文摘要

Publicly available phonocardiogram (PCG) datasets remain limited in size and pathological diversity, constraining both auscultation training and the generalisation of automated heart-sound classifiers. A class-conditional diffusion model for PCG generation is developed in the log-mel domain and synthetic fidelity is assessed using complementary (i) physiology-inspired plausibility metrics, (ii) downstream label-consistency evaluation, and (iii) expert listening. Experiments use the Phy-sioNet/Computing in Cardiology Challenge 2016 dataset (3240 recordings) with recording-level splits. After preprocessing and quality control, 16,749 non-overlapping 4 s clips are mapped to a normalised 1 x 128 x 128 log-mel representation to train a conditional 2D U-Net denoiser with classifier-free guidance. Signal-level plausibility is quantified on reconstructed waveforms using three lightweight metrics: an envelope-autocorrelation rhythm score, an amplitude-based explosion score, and the dominant cycle lag. Synthetic clips preserve similar dominant cycle durations but exhibit reduced envelope periodicity and increased transient burstiness relative to real clips. For downstream evaluation, a ResNet-50 classifier achieves 92.24% accuracy on the held-out real test set and 82.8% accuracy on class-balanced synthetic batches, indicating that generated signals retain discriminative structure relevant to normal/abnormal classification. In a pilot expert listening study (60 clips, two clinicians), most synthetic clips are judged as heart-sound-like, while abnormality sensitivity is low for both real and synthetic 4 s excerpts. Overall, the results provide a practical baseline for diffusion-based PCG generation while highlighting remaining challenges in retaining abnormal acoustic cues and reducing reconstruction-induced artefacts.

2606.02433 2026-06-02 cs.IR cs.AI cs.CL cs.LG cs.MA

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe:面向未来数据预测与推理的开放域表格问答数据集

Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出开放域表格问答的未来预测与推理任务,并构建首个覆盖时间序列预测和基于预测推理的数据集,通过基于LLM代理的TimeFore框架(检索器、预测器、分析器)解决历史数据检索、预测限制和响应标准化挑战。

Comments This paper has been accepted by Findings of ACL 2026

详情
AI中文摘要

大语言模型的快速发展显著推进了表格问答,但大多数系统无法进行面向未来的数值预测。为弥补这一空白,我们引入了一个新任务——面向未来数据预测与推理的开放域表格问答,并提出了首个覆盖时间序列预测和基于预测推理场景的数据集,使用房地产数据。该任务在检索精确历史数据、克服LLM的预测限制以及标准化多样化查询的响应方面提出了挑战。为解决上述挑战,我们提出了TimeFore,一个基于LLM代理的框架,将问题分解为三个协作角色:检索器自主生成SQL以获取数据,预测器调用外部时间序列模型以获得更高精度,分析器综合结果以构建精确且一致的最终答案。大量实验证明了我们TimeFore的有效性。

英文摘要

The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.

2606.02430 2026-06-02 cs.DC cs.AI

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

并非所有错误都平等:大型语言模型推理中错误传播的系统研究

Yafan Huang, Sheng Di, Guanpeng Li

发表机构 * University of Iowa(爱荷华大学) Argonne National Laboratory(阿贡国家实验室) University of Florida(佛罗里达大学)

AI总结 本研究通过提出的LLMFI故障注入框架,系统研究了软错误在大型语言模型推理中的传播机制,揭示了关键脆弱性模式,并提出了四种低开销的软件级可靠性改进方向。

Comments Accepted at ICS'26

详情
AI中文摘要

大型语言模型(LLM)日益集成到高性能计算(HPC)工作流中,通过代码生成和领域特定决策等多种视角加速科学发现。然而,软错误如何传播并影响LLM推理仍 largely unexplored。为弥补这一空白,我们提出了LLMFI——一个可配置且确定性的故障注入框架,并基于该框架对LLM推理中的错误传播进行了全面研究。我们系统地跨三个开放权重的LLM和十三个代表性任务(涵盖推理、多语言、数学和编码领域)注入故障。此外,我们进行了细粒度的案例研究,揭示了关键脆弱性模式。总体而言,我们的研究得出了17个要点,推进了对LLM推理中错误传播的理解,并提出了四种低开销的纯软件修改方向以提高可靠性,为未来的错误检测和缓解提供了实用指导。

英文摘要

Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.

2606.02427 2026-06-02 math.NA cs.LG cs.NA

Spectral Audit of In-Context Operator Networks

上下文算子网络的频谱审计

Zhiwei Gao, Liu Yang, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 提出基于雅可比矩阵的频谱审计方法,通过分析上下文算子学习中的局部频谱特性(频率增益、相位结构、交叉模式耦合)来评估模型是否真正学习了PDE算子的局部动力学机制,而不仅仅是输出预测。

详情
AI中文摘要

现有的神经算子和上下文算子学习评估主要依赖于预测误差,但准确的输出预测并不能保证正确的局部动力学结构。一个模型可能匹配解,同时表现出不正确的敏感性、失真的频率响应、虚假的模式耦合或不稳定的切向行为。我们引入了一种基于雅可比矩阵的频谱审计方法,用于上下文算子学习。对于固定的提示,我们将网络输出对查询函数求导,并将得到的雅可比矩阵视为学习的切向算子。将其投影到傅里叶模式上,我们获得了推断算子的局部频谱特征,包括频率相关的增益、相位结构和交叉模式耦合。该审计通过测试模型是否再现底层PDE算子的局部机制(而不仅仅是输出)来补充标准预测指标。在多个基准测试中,审计揭示了不同的算子级现象,包括相位传输、粘度依赖的阻尼、非线性模式耦合和反应-扩散稳定性结构。它还检测了部分被预测误差指标隐藏的失败,包括高频退化、不正确的相位恢复和提示-算子不一致。即使逐点预测部分准确,损坏或内部不一致的提示也会导致切向算子结构退化。我们的结果表明,预测精度和局部算子保真度是学习到的神经算子的不同属性。我们的框架还为稳定性、灵敏度和算子一致性提供了诊断。

英文摘要

Existing evaluations of neural operators and in-context operator learning rely primarily on prediction error, but accurate output prediction does not guarantee the correct local dynamical structure. A model may match solutions while exhibiting incorrect sensitivities, distorted frequency response, spurious mode coupling, or unstable tangent behavior. We introduce a Jacobian-based spectral audit for in-context operator learning. For a fixed prompt, we differentiate the network output with respect to the query function and view the resulting Jacobian as a learned tangent operator. Projecting it onto Fourier modes, we obtain a local spectral characterization of the inferred operator, including frequency-dependent gains, phase structure, and cross-mode coupling. The audit complements standard prediction metrics by testing whether the model reproduces local mechanisms of the underlying PDE operator rather than only outputs. Across benchmarks, the audit reveals distinct operator-level phenomena, including phase transport, viscosity-dependent damping, nonlinear mode coupling, and reaction--diffusion stability structure. It also detects failures partially hidden by prediction-error metrics, including high-frequency degradation, incorrect phase recovery, and prompt--operator inconsistencies. Corrupted or internally inconsistent prompts lead to degraded tangent-operator structure even when pointwise predictions remain partially accurate. Our results suggest that prediction accuracy and local operator fidelity are distinct properties of learned neural operators. Our framework also provides a diagnostic for stability, sensitivity, and operator consistency.

2606.02418 2026-06-02 quant-ph cs.AI

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

基于LLM引导搜索的双变量自行车码的进化发现

Juan Cruz-Benito, Andrew W. Cross, David Kremer, Ismael Faro

发表机构 * IBM Research(IBM研究院) IBM T. J. Watson Research Center(IBM T.J. 巴特勒研究中心)

AI总结 提出一种LLM引导的进化工作流,通过变异生成双变量自行车码和扰动变体的Python程序,在约1650次迭代中筛选约2×10^5个候选码,发现了465个不同候选码,包括非CSS扰动码和CSS码,展示了LLM引导的程序进化在结构化量子码发现中的实用性。

详情
AI中文摘要

量子LDPC码的发现需要在大型代数设计空间中进行搜索,同时可靠地认证任何候选码的参数和等价类。我们引入了一种LLM引导的进化工作流,其中语言模型变异生成双变量自行车码和扰动双变量自行车码ansätze的Python程序。在五次活动中,系统执行了约1,650次进化迭代,筛选了约$2 \times 10^5$个候选码,需要约140小时的计算时间和约400美元的LLM推理成本。候选码通过一个分阶段验证流水线进行评估,该流水线结合了$\mathrm{GF}(2)$秩计算、距离估计和认证、混合整数线性规划、BLISS Tanner图去重、可分解性分析和局部Clifford等价检查。在块长度$n \leq 360$时,工作流识别出465个不同的候选码:97个CSS双变量自行车码和368个非CSS扰动变体。CSS搜索恢复了已知的高性能码,并找到了新的有限长度代表,包括一个不可分解的[[288,16,12]]码和更高权重的码,在距离$d = 8$时最多有$k = 50$。非CSS搜索产生了在[[144,12,12]]处匹配总码品质因子的扰动码,以及根据MILP状态报告为认证值或上界的额外高距离候选码。总体而言,这些结果表明,当与独立评估配对时,LLM引导的程序进化可以作为一种实用的结构化量子码发现工具。

英文摘要

Quantum LDPC code discovery requires searching large algebraic design spaces while reliably certifying the parameters and equivalence classes of any candidates found. We introduce an LLM-guided evolutionary workflow in which language models mutate Python programs that generate bivariate-bicycle and perturbed bivariate-bicycle code ansätze. Across five campaigns, the system performed approximately 1{,}650 evolutionary iterations, screened about $2 \times 10^5$ candidate codes, and required ${\sim}140$ hours of computation and ${\sim}$US\$400 in LLM inference cost. Candidate codes are evaluated through a staged validation pipeline combining $\mathrm{GF}(2)$ rank computation, distance estimation and certification, mixed-integer linear programming, BLISS Tanner-graph deduplication, decomposability analysis, and local-Clifford equivalence checks. At block length $n \leq 360$, the workflow identifies 465 distinct candidate codes: 97 CSS bivariate-bicycle codes and 368 non-CSS perturbed variants. The CSS search recovers known high-performing codes and finds new finite-length representatives, including an indecomposable [[288,16,12]] code and higher-weight codes with up to $k = 50$ at distance $d = 8$. The non-CSS search produces perturbed codes matching the gross-code figure of merit at [[144,12,12]], along with additional high-distance candidates reported as certified values or upper bounds according to MILP status. Overall, these results show that LLM-guided program evolution can serve as a practical tool for structured quantum-code discovery when paired with independent evaluation.

2606.02345 2026-06-02 stat.ML cs.LG

Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

少即是多!关于经验成对损失估计/最小化的采样技术

Louise Davy, Stephan Clémençon, Charlotte Laclau

发表机构 * IDS, LTCI Télécom Paris Palaiseau, France(IDS、LTCI 雷电巴黎实验室,巴黎帕莱索,法国)

AI总结 本文利用调查采样技术,通过直接对成对样本进行采样而非单个观测,在保留少量信息的情况下实现与全量成对评估相当的估计或优化性能,为精度与计算成本之间提供了理论上有依据的权衡。

详情
AI中文摘要

许多机器学习问题,包括相似性学习、排序和聚类,都依赖于经验成对损失函数,其二次计算成本在大规模下迅速变得难以承受。我们展示了一种节俭的方法,通过利用调查采样技术,仅保留成对信息的一小部分,即可实现与使用所有成对数据相当的估计或优化性能。一个核心发现(理论和实验均支持)是,这种采样方案必须直接针对成对样本而非单个观测。特别地,对于高维向量(如视觉或图学习中的嵌入)之间的成对损失,使用合适的辅助信息为信息量大的成对样本分配更高的包含概率,可以获得接近全量成对评估的性能,从而在精度和计算成本之间提供了一种有原则且理论上有依据的权衡。

英文摘要

Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.

2606.02302 2026-06-02 cs.CR cs.AI

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

SeClaw: 面向自主代理评估的规范驱动安全任务合成

Hao Cheng, Changtao Miao, Tianle Song, Yin Wu, He Liu, Erjia Xiao, Junchi Chen, Xiaoyu Shi, Yichi Wang, Jing Yang, Taowen Wang, Jinhao Duan, Mengshu Sun, Peiyan Dong, Xuan Shen, Yang Cao, Renjing Xu, Kaidi Xu, Jindong Gu, Bo Zhang, Jize Zhang, Chenhao Lin, Philip Torr, Chao Shen

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Ant Digital Technologies, Ant Group(蚂蚁集团数字技术部) Xi’an Jiaotong University(西安交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University of Oxford(牛津大学) City University of Hong Kong(香港城市大学) Institute of Science Tokyo(东京科学研究所) Zhejiang University(浙江大学) Massachusetts Institute of Technology(麻省理工学院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Beijing University of Technology(北京理工大学)

AI总结 提出SeClaw框架,通过规范驱动的安全任务合成与基于执行的安全评估,实现对自主LLM代理在状态化环境中的安全风险的可扩展、可复现评估。

详情
AI中文摘要

自主LLM代理越来越多地在有状态环境中运行,访问工具、文件、内存和外部服务。虽然这些能力支持复杂的现实工作流,但它们也引入了难以通过现有评估捕获的安全风险。当前的代理安全基准通常依赖手动策划的任务,对新兴威胁的覆盖有限,并且主要关注最终结果而非导致不安全行为的执行过程。我们引入了SeClaw,一个结合规范驱动的安全任务合成与基于执行的安全评估的框架,用于自主代理。规范驱动的安全任务合成能够从结构化风险规范中可扩展且可控地构建安全任务,而SeClaw docker提供了一个标准化测试平台,用于评估代理在各种安全风险场景下的行为。该基准涵盖了由资源、用户任务、环境和内在代理行为引起的风险,并支持对不安全行为的轨迹感知评估,超越最终响应。通过桥接系统化的任务合成和可复现的安全评估,SeClaw为测量、诊断和比较自主LLM代理中的安全故障提供了实用基础。代码可在 https://github.com/seclaw-eval/seclaw-eval 获取。

英文摘要

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.

2606.02301 2026-06-02 cs.HC cs.AI cs.CV

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

定量运动测试:从单部智能手机视频测量患者运动

Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins, Eoin Kelleher, Shuangyi Tong, Annina Schmid, Katja Wiech, Anushka Irani, Ben Seymour

发表机构 * Nuffield Department of Clinical Neurosciences, University of Oxford(临床神经科学系,Nuffield大学,牛津大学) Max Planck Institute of Biological Cybernetics(生物信息学研究所) Oxford Gait Laboratory, University of Oxford(牛津大学步态实验室) Harvard Medical School(哈佛医学院) Massachusetts General Hospital(麻省总医院) Institute of Biomedical Engineering, University of Oxford(生物医学工程研究所,牛津大学) Mayo Clinic(梅奥诊所)

AI总结 提出基于计算机视觉的定量运动测试(QMT)方法,利用深度学习3D姿态估计从单目智能手机视频提取运动生物标志物,在实验室验证中与光学运动捕捉高度一致(r>0.85),并在纤维肌痛和慢性坐骨神经痛患者中展示了可靠性和纵向监测能力。

详情
AI中文摘要

慢性疼痛通过降低功能能力而损害生活质量,但在现实环境中客观测量这种功能影响仍然具有挑战性。虽然光学运动捕捉为评估运动质量改变提供了高精度,但成本高昂且局限于实验室环境。我们旨在开发并验证定量运动测试(QMT),这是一个从标准单目智能手机视频中提取3D运动生物标志物的计算机视觉流程,平衡临床可及性与生物力学精度。我们利用基于深度学习的3D姿态估计,在健康对照组(N=13)中针对金标准光学运动捕捉验证了QMT流程。经过留一法受试者校准以纠正系统偏差后,我们在两个前瞻性临床队列中部署QMT以评估现实世界效用:一项纤维肌痛患者的干预前后试验,以及一项慢性坐骨神经痛患者和健康对照的30天纵向家庭监测研究。在实验室验证中,QMT提取的临床运动指标与光学运动捕捉高度一致,显示出强相关性(r>0.85)和低平均绝对误差。QMT在纤维肌痛患者中显示出高重测信度(r>0.86),并成功追踪了慢性坐骨神经痛患者的日常运动波动。虽然现实家庭环境引入了比实验室环境更高的测量方差,但QMT完全基于远程记录发现了健康对照组和坐骨神经痛患者之间的组级差异。单目3D姿态估计为传统评估提供了一种可扩展的替代方案。QMT为临床试验中跟踪疾病进展和治疗反应提供了客观、可及的生物标志物,但需要进一步研究以优化家庭环境中的可靠性。

英文摘要

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.

2606.02278 2026-06-02 eess.SY cs.LG cs.SY

Physics-Guided Recurrent State-Space Neural Networks for Multi-Step Prediction

物理引导的循环状态空间神经网络用于多步预测

Ruiyuan Li, Ajay Seth, Manon Kok

发表机构 * Delft Center for Systems and Control, TU Delft, the Netherlands(代尔夫特系统与控制中心,代尔夫特理工大学,荷兰) Department of Biomechanical Engineering, TU Delft, the Netherlands(生物力学工程系,代尔夫特理工大学,荷兰)

AI总结 提出PG-RSSNN,一种结合物理知识和循环结构的状态空间神经网络,通过缓解梯度消失和数值发散风险,在有限数据和部分物理模型下提升多步预测性能。

Comments 6 pages, 3 figures. Accepted at IFAC World Congress 2026

详情
AI中文摘要

状态空间模型传统上基于物理知识,但由于模型不准确,这些物理模型的多步预测可能较差。黑盒深度学习作为替代方案显示出潜力,但这些方法依赖于大量数据集的可用性,且潜在可用的物理知识被忽略。我们提出PG-RSSNN,一种物理引导的循环状态空间神经网络,它结合循环结构以在多步预测中使用非饱和激活函数。它缓解了梯度消失,并消除了现有结构中因反馈状态估计而导致的训练数值发散风险。在多个具有不同物理模型不完善性的系统上(从带高斯噪声的线性状态空间模型到机械臂和级联水箱系统)的实验结果表明,与黑盒神经网络和纯物理模型相比,所提出的PG-RSSNN即使在训练数据有限且物理模型仅部分已知的情况下,也能保持稳定的训练行为,并改善多步预测。

英文摘要

State-space models are traditionally based on physical knowledge, but multi-step predictions from these physical models can be poor due to model inaccuracy. Black-box deep learning has shown promise as an alternative. However, these methods rely on the availability of large datasets and potentially available physical knowledge is neglected. We propose the PG-RSSNN, a physics-guided recurrent state-space neural network that incorporates recurrent structures to enable the use of non-saturating activation functions in multi-step prediction. It mitigates the vanishing gradients and eliminates the risk of numerical divergence in training seen in existing structures that feed back state estimates. Results across multiple systems with various physical model imperfections, from linear state-space models with Gaussian noise to a robotic arm and a cascaded water tank system, show that the proposed PG-RSSNN maintains stable training behavior, and improves multi-step predictions, as compared with black-box neural networks and physics-only models, even with limited training data and when physical models are only partially known.

2606.02228 2026-06-02 stat.ML cs.CV cs.LG

Bayesian meta-learning for modeling Alzheimer's disease progression

贝叶斯元学习用于阿尔茨海默病进展建模

Clara Hoffmann, Nadja Klein

发表机构 * Scientific Computing Center, Karlsruhe Institute of Technology, Germany(卡尔斯鲁厄理工学院科学计算中心,德国) Alzheimer’s Disease Neuroimaging Initiative(阿尔茨海默病神经影像计划)

AI总结 提出贝叶斯元学习方法,利用个体历史MRI体积和疾病轨迹预测疾病评分分布,无需重新训练即可动态预测,并减少长期预测的过度自信。

详情
AI中文摘要

预测阿尔茨海默病患者将经历轻度还是重度疾病进展对于个性化治疗至关重要。通常,临床医生试图预测离散疾病评分的分布,条件是个体当前的MRI体积及其历史疾病轨迹。经典的统计回归模型和单任务神经网络不适合此目的,因为拟合单独模型不可行(每个个体通常只有少量观测),而忽略个体间相关性会导致泛化能力差。相比之下,元学习提供了一种自然的方法来动态预测分布,无需重新训练,并能建模结果与协变量之间的非线性关系。受此启发,我们提出了一种贝叶斯元学习器,它在多个个体上训练,但根据每个个体的历史数据定制预测的疾病评分分布。我们的模型无需重新训练即可预测未见过的个体,与历史观测数量呈线性扩展,并且在预测长期疾病评分时,与确定性对应模型相比,保证更少的过度自信。在阿尔茨海默病神经影像学倡议(ADNI)数据库的真实世界数据上,我们的模型在性能上与单任务模型和确定性元学习器相当,同时在预测长期疾病进展时显著提高了性能。

英文摘要

Predicting whether an individual with Alzheimer's disease will experience mild or severe disease progression is essential for personalized treatment. Typically, practitioners seek to predict the distribution of a discrete disease score, conditional on an individual's current MRI volume and their historical disease trajectory. Classical statistical regression models and single-task neural networks are not well-suited for this purpose because fitting separate models is infeasible (since each individual typically has few observations), while ignoring individual-level correlation leads to poor generalization. Meta-learning, in contrast, provides a natural avenue to dynamically predict distributions without retraining and model nonlinear relationships between the outcome and covariates. Motivated by this, we propose a Bayesian meta-learner that is trained on multiple individuals but tailors the predictive disease score distribution to each individual's historical data. Our model predicts on unseen individuals without retraining, scales linearly with the number of historical observations, and is guaranteed to be less overconfident when predicting long-term disease scores compared to its deterministic counterpart. On real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, our model achieves performance competitive with both single-task models and deterministic meta-learners, while substantially improving performance when predicting long-term disease progression.