arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2506.10630 2026-06-04 cs.LG cs.AI

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

时间序列预测作为推理：一种基于强化LLM的慢思考方法

Yitong Zhou, Yucong Luo, Mingyue Cheng, Qi Liu, Jiahao Wang, Daoyu Wang, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（认知智能国家重点实验室，中国科学技术大学）

AI总结提出Time-R1框架，通过两阶段强化微调（监督微调+强化学习）训练LLM进行多步推理，以提升时间序列预测的准确性。

详情

AI中文摘要

为了推进时间序列预测（TSF），人们提出了各种方法来提高预测精度，从统计技术发展到数据驱动的深度学习架构。尽管这些方法有效，但大多数现有方法仍然遵循快速思考范式——依赖提取历史模式并将其映射到未来值作为核心建模理念，缺乏包含中间时间序列推理的显式思考过程。与此同时，新兴的慢思考LLM（如OpenAI-o1）展示了显著的多步推理能力，为克服这些问题提供了替代途径。然而，仅靠提示工程存在若干局限性——包括高计算成本、隐私风险以及领域特定时间序列深度推理能力有限。为了解决这些局限性，更有前景的方法是训练LLM发展慢思考能力并获得强大的时间序列推理技能。为此，我们提出了Time-R1，一个两阶段强化微调框架，旨在增强LLM用于时间序列预测的多步推理能力。具体来说，第一阶段进行监督微调以进行预热适应，而第二阶段采用强化学习来提高模型的泛化能力。特别地，我们专门为时间序列预测设计了一个细粒度的多目标奖励，然后引入了GRIP（基于组的相对重要性策略优化），它利用非均匀采样进一步鼓励和优化模型对有效推理路径的探索。实验表明，Time-R1在多种数据集上显著提高了预测性能。

英文摘要

To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical techniques to data-driven deep learning architectures. Despite their effectiveness, most existing methods still adhere to a fast thinking paradigm-relying on extracting historical patterns and mapping them to future values as their core modeling philosophy, lacking an explicit thinking process that incorporates intermediate time series reasoning. Meanwhile, emerging slow-thinking LLMs (e.g., OpenAI-o1) have shown remarkable multi-step reasoning capabilities, offering an alternative way to overcome these issues. However, prompt engineering alone presents several limitations - including high computational cost, privacy risks, and limited capacity for in-depth domain-specific time series reasoning. To address these limitations, a more promising approach is to train LLMs to develop slow thinking capabilities and acquire strong time series reasoning skills. For this purpose, we propose Time-R1, a two-stage reinforcement fine-tuning framework designed to enhance multi-step reasoning ability of LLMs for time series forecasting. Specifically, the first stage conducts supervised fine-tuning for warmup adaptation, while the second stage employs reinforcement learning to improve the model's generalization ability. Particularly, we design a fine-grained multi-objective reward specifically for time series forecasting, and then introduce GRIP (group-based relative importance for policy optimization), which leverages non-uniform sampling to further encourage and optimize the model's exploration of effective reasoning paths. Experiments demonstrate that Time-R1 significantly improves forecast performance across diverse datasets.

URL PDF HTML ☆

赞 0 踩 0

2502.00944 2026-06-04 cs.LG

Training speedups via batching for geometric learning: an analysis of static and dynamic algorithms

通过批处理实现几何学习训练加速：静态与动态算法分析

Daniel T. Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, Claudia Draxl

发表机构 * Humboldt-Universität zu Berlin（洪堡-柏林大学）； Max Planck Institute for Solid State Research（马克斯·普朗克固态研究所）； Max Planck Computing and Data Facility（马克斯·普朗克计算与数据设施）； Orbital Materials（Orbital Materials公司）

AI总结本文分析图神经网络中静态与动态批处理算法对训练速度和模型性能的影响，实验表明算法选择可带来最高2.7倍加速，但最优算法取决于数据、模型、批大小、硬件和训练步数。

Journal ref Transactions on Machine Learning Research (3/2026)

详情

AI中文摘要

图神经网络（GNN）在材料科学、化学和社会科学等多个领域展现出有前景的结果。GNN模型通常包含数百万个参数，与其他神经网络（NN）模型一样，通常仅以批次方式输入训练数据集中的一部分图来更新模型参数。批处理算法对训练时间和模型性能的影响已在NN中得到了深入探索，但在GNN中尚未进行。我们分析了两种不同的基于图的模型批处理算法，即针对两个数据集（小分子QM9数据集和AFLOW材料数据库）的静态和动态批处理。我们的实验表明，更改批处理算法可提供高达2.7倍的加速，但最快的算法取决于数据、模型、批大小、硬件和运行的训练步数。实验表明，对于批大小、数据集和模型的某些组合，静态和动态批处理算法之间的模型学习指标存在显著差异。

英文摘要

Graph neural networks (GNN) have shown promising results for several domains such as materials science, chemistry, and the social sciences. GNN models often contain millions of parameters, and like other neural network (NN) models, are often fed only a fraction of the graphs that make up the training dataset in batches to update model parameters. The effect of batching algorithms on training time and model performance has been thoroughly explored for NNs but not yet for GNNs. We analyze two different batching algorithms for graph-based models, namely static and dynamic batching for two datasets, the QM9 dataset of small molecules and the AFLOW materials database. Our experiments show that changing the batching algorithm can provide up to a 2.7x speedup, but the fastest algorithm depends on the data, model, batch size, hardware, and number of training steps run. Experiments show that for a select number of combinations of batch size, dataset, and model, significant differences in model learning metrics are observed between static and dynamic batching algorithms.

URL PDF HTML ☆

赞 0 踩 0

2604.14575 2026-06-04 cs.LG cs.AI stat.ME stat.ML

Generative Augmented Inference

生成式增强推断

Cheng Lu, Mengxin Wang, Dennis J. Zhang, Heng Zhang

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）

AI总结提出生成式增强推断（GAI）框架，将AI输出视为学习真实标签的高维信息特征而非代理，通过非参数方法建模，实现人机数据联合的一致估计和有效推断，在随机标注下渐近效率严格优于仅用人类数据。

详情

AI中文摘要

大型语言模型使得廉价的AI生成标注成为可能，但如何可靠地将其用于因果推断仍然具有挑战性。简单地将AI和人类数据混合会引入偏差，而现有方法如预测驱动推断（PPI；Angelopoulos et al., 2023a）将AI输出视为真实标签的代理——这一假设在实践中常被生成模型输出所违背。我们提出生成式增强推断（GAI），一个将AI输出视为学习人类标签的一般性、潜在高维信息特征而非替代品的框架。GAI使用非参数方法灵活建模这种关系，从而能够从人类和AI的联合数据中进行一致估计和有效推断。我们建立了渐近正态性，并证明在随机标注下，只要AI输出对真实标签具有信息量，GAI在渐近效率上严格优于仅使用人类数据的估计。在真实数据集上的实证研究表明，与仅使用人类数据和基于PPI的估计相比，GAI在多种生成数据源上显著降低了估计误差并提高了置信区间质量。

英文摘要

Large language models enable inexpensive AI-generated annotations, but using them reliably for causal inference remains challenging. Naively pooling AI and human data induces bias, while existing methods such as Prediction-Powered Inference (PPI; Angelopoulos et al., 2023a) treat AI outputs as proxies of true labels -- an assumption often violated for generative model outputs in practice. We propose Generative Augmented Inference (GAI), a framework that treats AI outputs as general, potentially high-dimensional informative features for learning human labels rather than as surrogates. GAI flexibly models this relationship using nonparametric methods, enabling consistent estimation and valid inference from combined human and AI data. We establish asymptotic normality and show that, under random labeling, GAI strictly improves asymptotic efficiency over human-data-only estimation whenever AI outputs are informative for true labels. Empirical studies on real-world datasets demonstrate that GAI significantly reduces estimation error and improves confidence interval quality across diverse generative data sources relative to human-only and PPI-based estimation.

URL PDF HTML ☆

赞 0 踩 0

2407.00809 2026-06-04 cs.LG cs.NA math.NA

Kernel Neural Operators (KNOs) for Scalable, Memory-efficient, Geometrically-flexible Operator Learning

核神经算子（KNOs）：可扩展、内存高效、几何灵活的算子学习

Matthew Lowery, John Turnage, Zachary Morrow, John D. Jakeman, Akil Narayan, Shandian Zhe, Varun Shankar

发表机构 * Kahlert School of Computing（卡勒特计算学院）； University of Utah（犹他大学）； Department of Mathematics（数学系）； Sandia National Laboratories（桑迪亚国家实验室）； Scientific Machine Learning（科学机器学习）； Scientific Computing and Imaging (SCI) Institute（科学计算与成像（SCI）研究所）

AI总结提出核神经算子（KNO），通过组合深度核积分算子实现算子学习，具有收敛性、低内存和几何灵活性，在基准测试中以更少参数达到可比或更高精度。

Comments 14 pages + 15 page appendix, 7 figures

Journal ref Transactions on Machine Learning Research, ISSN 2835-8856, 2026

详情

AI中文摘要

本文介绍了核神经算子（KNO），一种可证明收敛的算子学习架构，它利用深度核积分算子的组合进行算子（函数到函数的映射）的函数空间逼近。KNO将核的选择与数值积分方案（求积）解耦，从而自然允许在不规则几何上使用显式选择的可训练核进行算子学习。在不规则域上，这使KNO能够利用特定于域的求积规则。为了帮助缓解维数灾难，我们还在规则域上利用了一种高效的维度分解算法。更重要的是，显式指定核的能力还允许使用高度表达性的、非平稳的、神经各向异性核，其参数通过训练神经网络计算。我们提出了通用逼近定理，表明连续和完全离散化的KNO都是算子学习问题的通用逼近器。数值结果表明，在现有基准测试中，KNO的训练和测试精度与流行的神经算子相当或更高，同时通常使用的可训练参数少一个数量级，其中更具表达性的核对于实现高精度至关重要。因此，KNO促进了低内存、几何灵活的深度算子学习，同时保留了科学计算和机器学习中传统核方法的实现简单性和透明性。

英文摘要

This paper introduces the Kernel Neural Operator (KNO), a provably convergent operator-learning architecture that utilizes compositions of deep kernel-based integral operators for function-space approximation of operators (maps from functions to functions). The KNO decouples the choice of kernel from the numerical integration scheme (quadrature), thereby naturally allowing for operator learning with explicitly-chosen trainable kernels on irregular geometries. On irregular domains, this allows the KNO to utilize domain-specific quadrature rules. To help ameliorate the curse of dimensionality, we also leverage an efficient dimension-wise factorization algorithm on regular domains. More importantly, the ability to explicitly specify kernels also allows the use of highly expressive, non-stationary, neural anisotropic kernels whose parameters are computed by training neural networks. We present universal approximation theorems showing that both the continuous and fully discretized KNO are universal approximators on operator learning problems. Numerical results demonstrate that on existing benchmarks the training and test accuracy of KNOs is closely comparable to or higher than that of popular neural operators while typically using an order of magnitude fewer trainable parameters, with the more expressive kernels proving important to attaining high accuracy. KNOs thus facilitate low-memory, geometrically-flexible, deep operator learning, while retaining the implementation simplicity and transparency of traditional kernel methods from both scientific computing and machine learning.

URL PDF HTML ☆

赞 0 踩 0

2604.12645 2026-06-04 cs.RO cs.AI

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

上下文多任务强化学习用于自主珊瑚礁监测

Melvin Laux, Yi-Ling Liu, Rina Alo, Sören Töpper, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam

发表机构 * University of Bremen（不莱梅大学）

AI总结针对水下动力学不确定性和任务变化，提出上下文多任务强化学习框架，学习可复用的控制策略，在模拟环境中实现高效训练、零样本泛化和鲁棒性。

Comments To be published in IEEE OCEANS 2026 (Sanya) conference proceedings

详情

AI中文摘要

尽管自主水下航行器有望实现海洋生态系统监测，但其部署从根本上受限于在高度不确定和非平稳的水下动力学下控制航行器的难度。为了解决这些挑战，我们采用数据驱动的强化学习方法来补偿未知动力学和任务变化。传统的单任务强化学习容易过拟合训练环境，从而限制了所学策略的长期实用性。因此，我们提出使用上下文多任务强化学习范式，允许我们学习可复用于各种任务的控制器，例如在一个珊瑚礁中检测牡蛎，在另一个珊瑚礁中检测珊瑚。我们评估上下文多任务强化学习是否能有效学习自主水下珊瑚礁监测的鲁棒且可泛化的控制策略。我们在HoloOcean中的模拟珊瑚礁环境中训练了一个单一上下文相关策略，该策略能够解决多个相关的监测任务。在我们的实验中，我们经验性地评估了上下文策略在样本效率、对未见任务的零样本泛化以及对变化水流的鲁棒性方面的表现。通过利用多任务强化学习，我们旨在提高训练效率以及所学策略的可重用性，从而向更可持续的自主珊瑚礁监测程序迈进一步。

英文摘要

Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary underwater dynamics. To address these challenges, we employ a data-driven reinforcement learning approach to compensate for unknown dynamics and task variations. Traditional single-task reinforcement learning has a tendency to overfit the training environment, thus, limit the long-term usefulness of the learnt policy. Hence, we propose to use a contextual multi-task reinforcement learning paradigm instead, allowing us to learn controllers that can be reused for various tasks, e.g., detecting oysters in one reef and detecting corals in another. We evaluate whether contextual multi-task reinforcement learning can efficiently learn robust and generalisable control policies for autonomous underwater reef monitoring. We train a single context-dependent policy that is able to solve multiple related monitoring tasks in a simulated reef environment in HoloOcean. In our experiments, we empirically evaluate the contextual policies regarding sample-efficiency, zero-shot generalisation to unseen tasks, and robustness to varying water currents. By utilising multi-task reinforcement learning, we aim to improve the training effectiveness, as well as the reusability of learnt policies to take a step towards more sustainable procedures in autonomous reef monitoring.

URL PDF HTML ☆

赞 0 踩 0

2604.11510 2026-06-04 cs.CL cs.AI cs.LG

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

策略分裂：通过双模式熵正则化激励大语言模型强化学习中的双模式探索

Jiashu Yao, Heyan Huang, Daiqing Wu, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology（北京理工大学）； Tsinghua University（清华大学）； Beihang University（北航）

AI总结提出Policy Split方法，将策略分裂为正常和高熵两种模式，通过协作双模式熵正则化在保持准确性的同时促进多样化探索，实验表明在通用和创造性任务上优于现有基线。

Comments preprint

2604.09686 2026-06-04 cs.AI cs.CV

Belief-Aware VLM Model for Human-like Reasoning

信念感知的VLM模型用于类人推理

Anshul Nayak, Shahil Shaik, Yue Wang

发表机构 * Mechanical Engineering Department, Clemson University（克莱姆森大学机械工程系）

AI总结提出一种信念感知的视觉语言模型框架，通过检索式记忆和强化学习近似信念，提升长时程意图推理能力，在HD-EPIC等数据集上优于零样本基线。

Comments Accepted for publication at the IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). 6 pages, 3 figures, 1 table

详情

AI中文摘要

传统的意图推理神经网络模型严重依赖可观测状态，难以泛化到多样化的任务和动态环境。视觉语言模型（VLM）和视觉语言动作（VLA）模型的最新进展通过大规模多模态预训练引入了常识推理，实现了跨任务的零样本性能。然而，这些模型仍然缺乏显式的信念表示和更新机制，限制了其像人类一样推理或捕捉长时程中不断演变的人类意图的能力。为了解决这个问题，我们提出了一个信念感知的VLM框架，集成了基于检索的记忆和强化学习。我们不学习显式的信念模型，而是使用基于向量的记忆来近似信念，该记忆检索相关的多模态上下文，并将其纳入VLM进行推理。我们进一步通过在VLM潜在空间上使用强化学习策略来优化决策。我们在公开可用的VQA数据集（如HD-EPIC）上评估了我们的方法，并展示了相对于零样本基线的持续改进，突出了信念感知推理的重要性。

英文摘要

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.08564 2026-06-04 cs.CL cs.LG

Attention-Based Sampler for Diffusion Language Models

基于注意力的扩散语言模型采样器

Yuyan Zhou, Kai Syun Hou, Weiyu Chen, James Kwok

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology（计算机科学与工程系，香港科学与技术大学）

AI总结针对扩散语言模型采样中忽略全局序列结构的问题，提出基于注意力矩阵列和的采样顺序优化方法，实现无训练的高质量并行采样。

详情

AI中文摘要

自回归模型（ARMs）已在语言建模中建立了主导范式。然而，其严格的顺序采样范式对推理效率和建模灵活性施加了根本性限制。为解决这些限制，提出了基于扩散的大语言模型（dLLMs），提供了并行采样和灵活语言建模的潜力。尽管有这些优势，当前dLLMs的采样策略主要依赖于token级别的信息，未能考虑全局序列结构，往往产生次优结果。在本文中，我们从对数似然最大化的角度研究采样顺序选择问题。我们证明该问题是NP难的，并提出一种基于最优采样秩的近似方法，使目标在计算上可行。我们进一步证明，通过按注意力矩阵列和降序采样token可以优化该可行目标。这一发现为注意力引导采样提供了原则性依据，并提供了贪婪搜索的理论基础替代方案。我们将这一理论见解实例化为一种新的无训练采样算法，称为Attn-Sampler，并进一步提出动态注意力阈值以实现实际加速。在多个基准上的大量实验验证了我们方法的有效性，表明它在增强采样并行性的同时实现了更优的生成质量。

英文摘要

Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential sampling paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel sampling and flexible language modeling. Despite these advantages, current dLLMs sampling strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the sampling order selection problem from the perspective of log-likelihood maximization. We show that this problem is NP-hard and propose an optimal sampling-rank-based approximation that makes the objective computationally tractable. We further prove that the tractable objective is optimized by sampling tokens in descending order of their attention-matrix column sums. This finding provides a principled justification for attention-guided sampling and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free sampling algorithm, termed Attn-Sampler, and further propose dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the sampling parallelism.

URL PDF HTML ☆

赞 0 踩 0

2603.21180 2026-06-04 cs.LG stat.CO stat.ME stat.ML

ALMAB-DC: Active Learning, Multi-Armed Bandits, and Distributed Computing for Sequential Experimental Design and Black-Box Optimization

ALMAB-DC：用于序贯实验设计和黑箱优化的主动学习、多臂老虎机和分布式计算

Foo Hui-Mean, Yuan-chin I Chang

发表机构 * Institute of Statistical Science, Academia Sinica（中央研究院统计科学研究所）

AI总结提出ALMAB-DC框架，结合高斯过程代理模型、多臂老虎机控制和异步分布式调度，解决昂贵黑箱优化问题，在多个基准上显著优于现有方法。

Comments 33 pages, and 13 figures

详情

AI中文摘要

在昂贵且无梯度目标下的序贯实验设计是计算统计学中的一个核心挑战：评估预算严格受限，必须从每次观测中高效提取信息。我们提出 extbf{ALMAB-DC}，一种基于高斯过程的序贯设计框架，结合主动学习、多臂老虎机（MAB）和分布式异步计算，用于昂贵的黑箱实验。具有不确定性感知获取函数的高斯过程代理模型识别信息量大的查询点；UCB或汤普森采样老虎机控制器在并行工作节点间分配评估；异步调度器处理异构运行时间。我们给出了老虎机组件的累积遗憾界，并通过阿姆达尔定律刻画了并行可扩展性。我们在五个基准上验证了ALMAB-DC。在两个统计实验设计任务中，ALMAB-DC在剂量-响应优化中实现了比等间距、随机和D最优设计更低的简单遗憾，在自适应空间场估计中匹配了贪婪最大方差基准并优于拉丁超立方采样；在$K=4$时，分布式设置达到目标性能所需的序贯挂钟轮次仅为四分之一。在三个机器学习/工程任务（CIFAR-10 HPO、CFD阻力最小化、MuJoCo RL）中，ALMAB-DC实现了93.4%的CIFAR-10准确率（超过BOHB 1.7个百分点和Optuna 1.1个百分点），将翼型阻力降低至$C_D = 0.059$（比网格搜索低36.9%），并将RL回报比网格搜索提高50%。所有相对于非ALMAB基线的优势在Bonferroni校正的Mann-Whitney $U$检验下均具有统计显著性。分布式执行在$K = 16$个智能体时实现了$7.5 imes$加速，与阿姆达尔定律一致。

英文摘要

Sequential experimental design under expensive, gradient-free objectives is a central challenge in computational statistics: evaluation budgets are tightly constrained and information must be extracted efficiently from each observation. We propose \textbf{ALMAB-DC}, a GP-based sequential design framework combining active learning, multi-armed bandits (MAB), and distributed asynchronous computing for expensive black-box experimentation. A Gaussian process surrogate with uncertainty-aware acquisition identifies informative query points; a UCB or Thompson-sampling bandit controller allocates evaluations across parallel workers; and an asynchronous scheduler handles heterogeneous runtimes. We present cumulative regret bounds for the bandit components and characterize parallel scalability via Amdahl's Law. We validate ALMAB-DC on five benchmarks. On the two statistical experimental-design tasks, ALMAB-DC achieves lower simple regret than Equal Spacing, Random, and D-optimal designs in dose--response optimization, and in adaptive spatial field estimation matches the Greedy Max-Variance benchmark while outperforming Latin Hypercube Sampling; at $K=4$ the distributed setting reaches target performance in one-quarter of sequential wall-clock rounds. On three ML/engineering tasks (CIFAR-10 HPO, CFD drag minimization, MuJoCo RL), ALMAB-DC achieves 93.4\% CIFAR-10 accuracy (outperforming BOHB by 1.7\,pp and Optuna by 1.1\,pp), reduces airfoil drag to $C_D = 0.059$ (36.9\% below Grid Search), and improves RL return by 50\% over Grid Search. All advantages over non-ALMAB baselines are statistically significant under Bonferroni-corrected Mann--Whitney $U$ tests. Distributed execution achieves $7.5\times$ speedup at $K = 16$ agents, consistent with Amdahl's Law.

URL PDF HTML ☆

赞 0 踩 0

2604.08438 2026-06-04 cs.LG

Adalina: Adaptive Linear Approximation for the Shapley Value and Beyond

Adalina: Shapley值及更广的半值的自适应线性逼近

Weida Li, Yaoliang Yu, Bryan Kian Hsiang Low

发表机构 * School of Computer Science, University of Waterloo, Canada（滑铁卢大学计算机科学学院）； Department of Computer Science, National University of Singapore, Republic of Singapore（新加坡国立大学计算机科学系）； Vector Institute, Canada（加拿大向量研究所）

AI总结针对Shapley值及其半值族的高效近似问题，提出一种基于向量集中不等式的理论框架，并开发了线性空间算法Adalina，在Θ(n)空间约束下实现O(n/ε² log(1/δ))次查询，显著降低均方误差。

详情

AI中文摘要

Shapley值及其更广的半值族在各种归因问题中受到了广泛关注。一个基本且长期存在的挑战是它们的有效近似，因为精确计算通常需要指数级的效用查询次数（关于玩家数量n）。为了应对大规模应用的挑战，我们探索了在Θ(n)空间约束下有效近似半值的极限。基于向量集中不等式，我们建立了一个理论框架，使得现有的无偏随机算法能够获得更锐利的查询复杂度。在该框架内，我们系统地开发了一种线性空间算法，该算法需要O(frac{n}{ε^{2}}logfrac{1}{δ})次效用查询，以确保对于所有常用的半值，有P(‖hat{boldsymbolϕ}-boldsymbolϕ‖_{2}≥ε)≤δ。特别地，我们的框架自然地桥接了OFA、无偏kernelSHAP、SHAP-IQ和回归调整方法，并明确刻画了配对采样何时有益。此外，我们的算法允许针对每个特定的效用函数显式最小化均方误差mathbb{E}[‖hat{boldsymbolϕ}-boldsymbolϕ‖_{2}^{2}]。据此，我们引入了第一个自适应的、线性时间、线性空间的随机算法Adalina，该算法在理论上实现了改进的均方误差。我们所有的理论发现都得到了实验验证。我们的代码可在https://github.com/watml/adalina获取。

英文摘要

The Shapley value, and its broader family of semi-values, has received much attention in various attribution problems. A fundamental and long-standing challenge is their efficient approximation, since exact computation generally requires an exponential number of utility queries in the number of players $n$. To meet the challenges of large-scale applications, we explore the limits of efficiently approximating semi-values under a $Θ(n)$ space constraint. Building upon a vector concentration inequality, we establish a theoretical framework that enables sharper query complexities for existing unbiased randomized algorithms. Within this framework, we systematically develop a linear-space algorithm that requires $O(\frac{n}{ε^{2}}\log\frac{1}δ)$ utility queries to ensure $P(\|\hat{\boldsymbolϕ}-\boldsymbolϕ\|_{2}\geqε)\leq δ$ for all commonly used semi-values. In particular, our framework naturally bridges OFA, unbiased kernelSHAP, SHAP-IQ and the regression-adjusted approach, and definitively characterizes when paired sampling is beneficial. Moreover, our algorithm allows explicit minimization of the mean squared error $\mathbb{E}[\|\hat{\boldsymbolϕ}-\boldsymbolϕ\|_{2}^{2}]$ for each specific utility function. Accordingly, we introduce the first adaptive, linear-time, linear-space randomized algorithm, Adalina, that theoretically achieves improved mean squared error. All of our theoretical findings are experimentally validated. Our code is available at https://github.com/watml/adalina.

URL PDF HTML ☆

赞 0 踩 0

2604.07778 2026-06-04 cs.AI

The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

问责地平线：人类-智能体集体治理的不可能性定理

Haileleol Tibebu, Hewan Shemtaga

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Responsible Intelligence Institute（负责任智能研究所）

AI总结本文通过引入人类-智能体集体形式化模型和问责不完全性定理，证明当自主性超过可计算阈值时，现有AI问责框架必然失效，并基于合成实验验证了该相变边界。

详情

AI中文摘要

现有的AI系统问责框架（法律、伦理和监管）都基于一个共同假设：对于任何重要结果，至少有一个可识别的人具有足够的参与度和预见性来承担有意义的责任。本文证明，一旦自主性超过可计算阈值，智能体AI系统违反这一假设不是工程限制，而是数学必然性。我们引入了人类-智能体集体，这是一种联合人-AI系统的形式化，其中智能体被建模为共享结构因果模型中的状态-策略元组。自主性通过四维信息论特征（认知、执行、评估、社会）来刻画；集体行为通过交互图和联合行动空间来刻画。我们通过四个最小属性公理化了合法问责：可归因性（责任需要因果贡献）、可预见性边界（责任不能超过预测能力）、非空性（至少一个智能体承担非平凡责任）和完备性（所有责任必须完全分配）。我们的核心结果——问责不完全性定理——证明，对于任何复合自主性超过问责地平线且交互图包含人-AI反馈循环的集体，没有框架能同时满足所有四个属性。这种不可能性是结构性的：透明度、审计和监督在不降低自主性的情况下无法解决。在阈值以下，存在合法框架，从而建立了一个尖锐的相变。在3000个合成集体上的实验证实了所有预测，零违规。这是AI治理中的第一个不可能性结果，建立了一个形式边界，低于该边界当前范式仍然有效，高于该边界分布式问责机制变得必要。

英文摘要

Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outcome, at least one identifiable person had enough involvement and foresight to bear meaningful responsibility. This paper proves that agentic AI systems violate this assumption not as an engineering limitation but as a mathematical necessity once autonomy exceeds a computable threshold. We introduce Human-Agent Collectives, a formalisation of joint human-AI systems where agents are modelled as state-policy tuples within a shared structural causal model. Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social); collective behaviour through interaction graphs and joint action spaces. We axiomatise legitimate accountability through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated). Our central result, the Accountability Incompleteness Theorem, proves that for any collective whose compound autonomy exceeds the Accountability Horizon and whose interaction graph contains a human-AI feedback cycle, no framework can satisfy all four properties simultaneously. The impossibility is structural: transparency, audits, and oversight cannot resolve it without reducing autonomy. Below the threshold, legitimate frameworks exist, establishing a sharp phase transition. Experiments on 3,000 synthetic collectives confirm all predictions with zero violations. This is the first impossibility result in AI governance, establishing a formal boundary below which current paradigms remain valid and above which distributed accountability mechanisms become necessary.

URL PDF HTML ☆

赞 0 踩 0

2604.04974 2026-06-04 cs.RO

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

从视频到控制：基于时间视觉数据学习操作接口的综述

Linfang Zheng, Zikai Ouyang, Chen Wang, Jia Pan, Wei Zhang

发表机构 * School of Automation and Intelligent Manufacturing (AiM)（自动化与智能制造学院）； Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology（广东省全驱动系统控制理论与技术省级重点实验室）； Southern University of Science and Technology（南方科技大学）； The University of Hong Kong（香港大学）； Peng Cheng Laboratory（鹏城实验室）； LimX Dynamics

AI总结本文综述了利用无动作标注的时间视频数据学习机器人操作控制接口的方法，提出以接口为中心的三种分类：直接视频-动作策略、潜在动作方法和显式视觉接口，并分析了控制集成特性与未来研究方向。

详情

AI中文摘要

视频是物理动力学的可扩展观测：它捕捉物体如何移动、接触如何展开以及场景在交互中如何演变——所有这些都不需要机器人动作标签。然而，将这种时间结构转化为可靠的机器人控制仍然是一个开放的挑战，因为视频缺乏动作监督，并且在具身、视角和物理约束方面与机器人经验不同。本综述回顾了利用无动作标注的时间视频来学习机器人操作控制接口的方法。我们引入了一种以接口为中心的分类法，按照视频到控制接口的构建位置及其启用的控制属性进行组织，识别出三个家族：直接视频-动作策略（保持接口隐式）、潜在动作方法（通过紧凑的学习中间体路由时间结构）以及显式视觉接口（预测下游控制的可解释目标）。对于每个家族，我们分析了控制集成特性——如何闭合回路、执行前可以验证什么以及失败在何处进入。跨家族的综合分析表明，最紧迫的开放挑战集中在机器人集成层——将视频衍生的预测连接到可靠机器人行为的机制——我们概述了弥合这一差距的研究方向。

英文摘要

Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an interface-centric taxonomy organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video-action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit visual interfaces, which predict interpretable targets for downstream control. For each family, we analyze control-integration properties -- how the loop is closed, what can be verified before execution, and where failures enter. A cross-family synthesis reveals that the most pressing open challenges center on the robotics integration layer -- the mechanisms that connect video-derived predictions to dependable robot behavior -- and we outline research directions toward closing this gap.

URL PDF HTML ☆

赞 0 踩 0

2604.04944 2026-06-04 cs.CL cs.AI

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

包含思维：通过净化决策空间缓解偏好不稳定性

Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang, Jey Han Lau

发表机构 * School of Computing and Information Systems, The University of Melbourne（计算与信息系统学院，墨尔本大学）

AI总结提出包含思维（IoT）策略，通过逐步自过滤干扰选项来重构多选题，从而稳定模型偏好并提升推理性能。

详情

AI中文摘要

多项选择题（MCQ）被广泛用于评估大型语言模型（LLM）。然而，LLM 仍然容易受到似是而非的干扰项的影响。这常常将注意力转移到无关选项上，导致在正确和错误答案之间不稳定地摇摆。在本文中，我们提出包含思维（IoT），一种渐进式自过滤策略，旨在减轻这种认知负荷（即干扰项存在下模型偏好的不稳定性），并使模型更有效地关注合理答案。我们的方法仅使用合理的选项选择来重构 MCQ，为检查比较判断以及模型在扰动下内部推理的稳定性提供了一个受控环境。通过明确记录这一过滤过程，IoT 还增强了模型决策的透明度和可解释性。广泛的实证评估表明，IoT 在算术、常识推理和教育基准测试中显著提升了思维链性能，且计算开销极小。

英文摘要

Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2602.00104 2026-06-04 cs.CV cs.AI

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

R3G: 一种面向以视觉为中心的答案生成的推理-检索-重排序框架

Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang

发表机构 * The Shenzhen International Graduate School, Tsinghua University, China（清华大学深圳国际研究生院）； State Key Laboratory of Nuclear Power Safety Technology and Equipment, China（核能安全技术与装备国家重点实验室）； School of Computer Science and Information Engineering, Hefei University of Technology, China（合肥工业大学计算机科学与信息工程学院）

AI总结提出R3G框架，通过先制定推理计划指定所需视觉线索，再采用粗检索加细粒度重排序的两阶段策略选择证据图像，在MRAG-Bench上提升六种多模态大语言模型在九个子场景中的准确率，实现整体最优性能。

2604.01161 2026-06-04 cs.LG

Reasoning Shift: How Context Silently Shortens LLM Reasoning

推理偏移：上下文如何无声地缩短LLM推理

Gleb Rodionov, Roman Garipov, George Yakushev

发表机构 * Yandex ； HSE University（俄罗斯高等经济学院）

AI总结通过系统评估，发现推理模型在不同上下文条件下（如无关长上下文、多轮对话、子任务）对同一问题的推理链长度显著缩短（最高65%），且伴随自我验证和不确定性管理行为减少，但简单任务性能不受影响，复杂任务可能受影响。

Comments Preprint

详情

AI中文摘要

展现出测试时缩放行为的大语言模型（LLMs），如扩展推理轨迹和自我验证，在复杂、长期推理任务上表现出色。然而，这些推理行为的鲁棒性仍未得到充分探索。为此，我们对多个推理模型在三种场景下进行了系统评估：（1）添加冗长无关上下文的问题；（2）具有独立任务的多轮对话设置；（3）作为复杂任务中的子任务呈现的问题。我们观察到一个有趣的现象：与问题单独呈现时产生的推理轨迹相比，推理模型在不同上下文条件下对同一问题产生的推理轨迹要短得多（最高达65%）。更细粒度的分析表明，这种压缩与自我验证和不确定性管理行为（如双重检查）的减少相关。虽然这种行为转变不会影响简单问题的性能，但可能会影响更具挑战性任务的表现。此外，我们表明有针对性的监督微调可以部分缓解无关上下文的不利影响。我们希望我们的发现能引起对推理模型鲁棒性以及LLM和基于LLM的智能体上下文管理问题的更多关注。

英文摘要

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 65%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. Additionally, we show that targeted supervised fine-tuning partially mitigates the adverse effects of irrelevant context. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.

URL PDF HTML ☆

赞 0 踩 0

2604.00915 2026-06-04 cs.LG stat.ML

Orthogonal Learner for Estimating Heterogeneous Long-Term Treatment Effects

正交学习器用于估计异质性长期处理效应

Haorui Ma, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel

发表机构 * AI in Management, LMU Munich（慕尼黑莱茵河大学人工智能管理系）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Munich, Germany（德国慕尼黑）

AI总结提出LT-O-learners，通过定制重叠权重重新定位损失函数，解决低重叠区域下异质性长期处理效应估计不稳定问题，并证明其Neyman正交性和对干扰误差的鲁棒性。

详情

AI中文摘要

异质性长期处理效应（HLTEs）的估计对于营销、经济学和医学中的个性化决策具有重要意义，在这些领域中，短期观测数据集通常与长期观测数据集相结合。然而，由于某些子群体在处理分配或长期结果上的重叠有限，HLTE估计面临挑战，可能导致具有大有限样本方差不稳定的HLTE估计。为了解决这一挑战，我们引入了LT-O-learners（长期正交学习器），这是一组新颖的正交学习器，用于在具有替代性的典型HLTE设置中进行HLTE估计。我们的LT-O-learners的关键思想是通过定制的重叠权重重新定位损失函数，这些权重降低了低重叠样本的权重。我们证明了重新定位的损失函数逐点恢复真实的HLTE，并满足Neyman正交性。我们进一步证明了两个关键的理论结果：（i）干扰误差仅通过高阶项进入误差界，这意味着我们的学习器对干扰估计误差具有鲁棒性。（ii）在线性函数类下，重新定位通过低重叠区域中的重叠权重有效控制了HLTE估计器的渐近方差。我们在合成和真实世界数据集上进行实验，以确认我们的LT-O-learners的理论性质，特别是在低重叠区域中的鲁棒性。据我们所知，我们是第一个在长期设置中对低重叠鲁棒的HLTE估计正交学习器。

英文摘要

Estimation of heterogeneous long-term treatment effects (HLTEs) is relevant for personalized decision-making in marketing, economics, and medicine, where short-term observational datasets are often combined with long-term observational datasets. However, HLTE estimation is challenging due to limited overlap in treatment assignments or in long-term outcomes for certain subpopulations, which can lead to unstable HLTE estimates with large finite-sample variance. To address this challenge, we introduce the LT-O-learners (Long-Term Orthogonal Learners), a set of novel orthogonal learners for HLTE estimation in the canonical HLTE setting with surrogacy. The key idea of our LT-O-learners is to retarget the loss via custom overlap weights that downweight low-overlap samples. We show that the retargeted loss recovers the true HLTE pointwise and satisfies Neyman-orthogonality. We further prove two key theoretical results: (i) The nuisance error enters the error bound only through higher-order terms, which means our learners are robust to nuisance estimation error. (ii) Under a linear function class, the retargeting effectively controls the asymptotic variance of the HLTE estimator via the overlap weights in low-overlap regimes. We conduct experiments on synthetic and real-world datasets to confirm the theoretical properties of our LT-O-learners, particularly robustness in low-overlap regimes. To our knowledge, ours are the first orthogonal learners for HLTE estimation robust to low overlap in long-term settings.

URL PDF HTML ☆

赞 0 踩 0

2604.00819 2026-06-04 cs.CL cs.AI

Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

情感纠缠与贝叶斯推理用于多维情感理解

Hemanth Kotaprolu, Kishan Maharaj, Raey Zhao, Abhijit Mishra, Pushpak Bhattacharyya

发表机构 * Indian Institute of Technology Bombay（印度理工学院班加罗尔）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； IBM Research（IBM研究院）

AI总结提出基于Plutchik基本情绪理论的情感场景基准EmoScene，并利用情感共现统计的贝叶斯推理框架进行联合后验推理，提升多维情感理解的结构一致性。

Comments 19 pages in total, 10 Figures, 7 Tables

详情

AI中文摘要

理解自然语言中的情感本质上是一个多维推理问题，其中多个情感信号通过上下文、人际关系和情境线索相互作用。然而，大多数现有的情感理解基准依赖于短文本和预定义的情感标签，将这一过程简化为独立的标签预测，忽略了情感之间的结构化依赖关系。为了解决这一局限性，我们引入了情感场景（EmoScene），一个基于理论的基准，包含4,731个上下文丰富的场景，并用源自Plutchik基本情绪的8维情感向量进行标注。基于情感很少独立出现的观察，我们进一步提出了一个纠缠感知的贝叶斯推理框架，该框架结合情感共现统计，对情感向量进行联合后验推理。这种轻量级的后处理不需要任何参数更新，提高了预测的结构一致性，并在不增加额外成本的情况下，整体词汇准确率提升了2.24%。因此，EmoScene为研究多维情感理解和当前语言模型的局限性提供了一个具有挑战性的基准。

英文摘要

Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 contextrich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik's basic emotions. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing does not require any parameter updates and improves the structural consistency of predictions, and yields overall gains of 2.24% Lexical Accuracy without any additional cost. EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.

URL PDF HTML ☆

赞 0 踩 0

2603.28762 2026-06-04 cs.CV cs.AI cs.GR cs.LG

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

上下文空间中的即时排斥以实现扩散变换器的丰富多样性

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or

发表机构 * Tel Aviv University（特拉维夫大学）； Snap Research Israel（Snap以色列研究）

AI总结针对文本到图像扩散模型多样性不足的问题，提出在扩散变换器的上下文空间中通过多模态注意力通道施加即时排斥，在不牺牲视觉保真度和语义一致性的前提下显著提升生成多样性，且计算开销小，适用于现代Turbo和蒸馏模型。

Comments SIGGRAPH 2026. Project page: https://contextual-repulsion.github.io/

详情

AI中文摘要

现代文本到图像（T2I）扩散模型在语义对齐方面取得了显著进展，但通常缺乏多样性，倾向于为任何给定提示收敛到狭窄的视觉解决方案集。这种典型性偏差对需要广泛生成结果的创意应用构成了挑战。我们识别出当前多样性方法中的一个基本权衡：修改模型输入需要昂贵的优化来整合生成路径的反馈。相反，对空间上已承诺的中间潜变量进行操作往往会破坏正在形成的视觉结构，导致伪影。在这项工作中，我们提出在上下文空间中应用排斥作为一种新颖的框架，以实现扩散变换器的丰富多样性。通过干预多模态注意力通道，我们在变换器的前向传播过程中施加即时排斥，在文本条件被新兴图像结构丰富后的块之间注入干预。这允许在结构信息形成后但构图固定之前重定向引导轨迹。我们的结果表明，上下文空间中的排斥在不牺牲视觉保真度或语义一致性的情况下产生了显著更丰富的多样性。此外，我们的方法非常高效，计算开销小，即使在现代“Turbo”和蒸馏模型中也有效，而传统的基于轨迹的干预在这些模型中通常会失败。

英文摘要

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

URL PDF HTML ☆

赞 0 踩 0

2512.14177 2026-06-04 cs.CV

Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

利用语义高斯过程改进LVLM中的语义不确定性量化

Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi

发表机构 * AMIAD, Pôle Recherche, Palaiseau（AMIAD研究部，Palaiseau）； valeo.ai ； Safran Tech ； University of Liège（利耶大学）； New York University（纽约大学）； National University of Singapore（新加坡国立大学）； ENSTA Paris（巴黎ENSTA）

AI总结提出语义高斯过程不确定性（SGPU）框架，通过分析答案嵌入的几何结构来量化语义不确定性，避免脆弱的聚类方法，在多个模型和数据集上实现了最先进的校准和判别性能。

详情

AI中文摘要

大型视觉语言模型（LVLM）经常产生看似合理但不可靠的输出，因此鲁棒的不确定性估计至关重要。最近的语义不确定性估计工作依赖于外部模型对多个采样响应进行聚类并测量其语义一致性。然而，这些聚类方法通常脆弱，对微小的措辞变化高度敏感，并且可能错误地分组或分离语义相似的答案，导致不可靠的不确定性估计。我们提出了语义高斯过程不确定性（SGPU），这是一个贝叶斯框架，通过分析答案嵌入的几何结构来量化语义不确定性，避免了脆弱的聚类。SGPU将生成的答案映射到密集的语义空间，计算其嵌入的Gram矩阵，并通过特征谱总结其语义配置。然后将这种谱表示输入到高斯过程分类器中，该分类器学习将语义一致性模式映射到预测不确定性，并且可以在黑盒和白盒设置中应用。在跨越VQA、图像分类和文本QA的八个数据集上的六个LLM和LVLM中，SGPU始终实现了最先进的校准（ECE）和判别（AUROC、AUARC）性能。我们进一步表明，SGPU可以跨模型和模态迁移，表明其谱表示捕捉了语义不确定性的通用模式。

英文摘要

Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2601.04051 2026-06-04 cs.LG

Symbolic Regression for Shared Expressions: Introducing Partial Parameter Sharing

共享表达式的符号回归：引入部分参数共享

Viktor Martinek, Roland Herzog

发表机构 * Interdisciplinary Center for Scientific Computing, Heidelberg University（科学计算跨学科中心，海德堡大学）

AI总结提出一种符号回归方法，通过引入部分参数共享机制处理多个分类变量，以分离通用效应、类别特定趋势和类别交互，并在合成数据和天体物理学数据集上验证其减少数据需求和迁移学习的能力。

详情

AI中文摘要

符号回归旨在寻找描述数据集的符号表达式。由于其固有的可解释性，符号回归（SR）是科学发现的有力范式。最近的进展已将SR扩展到使用具有可变参数集的单一表达式来描述相关现象，从而引入单一分类变量。例如，这允许搜索描述多种流体温度依赖粘度的单一表达式，同时识别一组不同的流体特定参数。我们在先前工作的基础上，考虑多个分类变量并引入中间级别的参数共享。参数并非完全通用或完全唯一，一些参数可以在特定类别之间共享，而对其他类别保持不同。这允许分离通用效应（共享参数）、类别特定趋势（部分共享参数）和类别交互（非共享参数）。我们通过一个合成的仅拟合示例测试了这种设置在减少数据需求和迁移学习方面的极限。此外，我们将该方法应用于先前单类别研究中也使用过的天体物理学数据集。相比之下，我们以显著更少的参数实现了类似的拟合质量，同时提取了关于问题的额外信息。

英文摘要

Symbolic regression aims to find symbolic expressions that describe datasets. Due to its inherent interpretability, symbolic regression (SR) is a powerful paradigm for scientific discovery. Recent advances have expanded SR to describe related phenomena using a single expression with varying sets of parameters, thereby introducing a single categorical variable. To illustrate, this enables the search for a single expression describing temperaturedependent viscosity across multiple fluids, while simultaneously identifying a distinct set of fluid-specific parameters. We expand upon prior efforts by considering multiple categorical variables and introducing intermediate levels of parameter sharing. Rather than parameters being either entirely universal or entirely unique, some parameters can also be shared across specific categories while remaining distinct for others. This allows for separating universal effects (shared parameters), category-specific trends (partially-shared parameters), and category interactions (non-shared parameters). We test the limits of this setup in terms of reducing data requirements and transfer learning using a synthetic, fitting-only example. Furthermore, we apply the method to an astrophysics dataset also used in a previous single-category study. In comparison, we achieve similar fit quality with significantly fewer parameters while extracting additional information about the problem.

URL PDF HTML ☆

赞 0 踩 0

2601.11214 2026-06-04 cs.CL

T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning

T$^\star$：通过轨迹感知强化学习实现掩码扩散语言模型的渐进块缩放

Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu

发表机构 * Shanghai Academy of AI for Science（上海人工智能科学研究院）； Shanghai Innovation Institute（上海创新研究院）； Fudan University（复旦大学）； School of Mathematical Sciences（数学科学学院）； Shanghai Jiao Tong University（上海交通大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出T$^\star$方法，利用基于TraceRL的训练课程，在掩码扩散语言模型中渐进增大块大小，实现高并行解码且性能损失极小。

2603.23841 2026-06-04 cs.CL cs.AI

PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

PoliticsBench: 通过多轮角色扮演基准测试大型语言模型中的政治价值观

Rohan Khetan, Ashna Khetan

发表机构 * Northville High School, Northville, USA（北维尅高中）； Department of Computer Science, Stanford University, Stanford, USA（斯坦福大学计算机科学系）

AI总结提出PoliticsBench，一个多阶段角色扮演基准，通过20个演化场景评估LLM的细粒度价值表达，发现场景提示比直接提问能引发更广泛和强烈的价值表达。

Comments 7 pages, 5 tables, 5 figures, 4 appendix pages. Accepted to the ICML 2026 Trustworthy AI for Good Workshop

详情

AI中文摘要

虽然大型语言模型（LLMs）越来越多地被用作主要信息来源，但其潜在的政治偏见可能影响其客观性。现有的LLM社会偏见基准主要评估人口统计刻板印象，而当衡量政治偏见时，是在粗略的层面上进行的，忽视了塑造社会政治推理的价值观。我们引入了PoliticsBench，一个用于评估LLM中细粒度价值表达的多阶段角色扮演基准。在20个演化场景中，模型在竞争压力下阐述权衡、表明立场并做出决策。在八个主流LLM上，我们表明，与直接的政治问题相比，基于场景的提示引发了更广泛和更强烈的价值表达，峰值交互阶段使强烈激活的价值维度数量增加了约0.75（共10个维度），相对于基线提示具有统计显著性（p < 0.05）。此外，在交互过程中，立场的承诺度增加，从初始阶段到决策阶段，在[0,5]量表上上升了约1.4分。虽然在后期交互阶段，响应对于场景释义的鲁棒性降低，但评判者间的一致性保持相对稳定。我们的结果表明，评估LLM的政治行为需要超越静态提示，转向更长的交互设置，以捕捉价值观如何在上下文中应用。

英文摘要

While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate demographic stereotypes, and when political bias is measured, it is done so at a coarse level, overlooking the values that shape sociopolitical reasoning. We introduce PoliticsBench, a multi-stage roleplay benchmark for evaluating fine-grained value expression in LLMs. Across twenty evolving scenarios, models articulate tradeoffs, take positions, and make decisions under competing pressures. Across eight prominent LLMs, we show that scenario-based prompting elicits broader and more strongly expressed value profiles than direct political questions, with peak interaction stages increasing the number of strongly activated value dimensions by approximately $0.75$ (out of 10 total dimensions), a statistically significant increase relative to baseline prompting ($p < 0.05$). In addition, commitment to a stance increases over the course of interaction, rising by approximately $1.4$ points on a $[0,5]$ scale from initial to decision stages. While responses become less robust to scenario paraphrasing in later interaction stages, inter-judge agreement remains relatively stable. Our results suggest that evaluating LLM political behavior requires moving beyond static prompts toward longer interactive settings that capture how values are applied in context.

URL PDF HTML ☆

赞 0 踩 0

2603.23420 2026-06-04 cs.AI

Bilevel Autoresearch: Meta-Autoresearching Itself

双层自动研究：元自动研究自身

Yaonan Qu, Meng Lu

发表机构 * Independent Researcher（独立研究者）

AI总结提出双层自动研究框架，外层循环通过读取内层循环代码和轨迹、识别瓶颈并注入可执行Python搜索机制来改进内层循环，在GPT预训练基准上实现5倍改进。

Comments 16 pages, 5 figures, 3 tables. v2 expands the framing as mechanism-level agentic self-improvement and updates related work and limitations; core method and experiments unchanged. This paper was primarily drafted by AI agents with human oversight and direction

详情

AI中文摘要

如果自动研究本身是一种研究形式，那么自动研究可以应用于研究本身。我们提出了双层自动研究（Bilevel Autoresearch），一种双层框架，其中外层自动研究循环通过读取内层自动研究循环的代码和轨迹，识别瓶颈，并在运行时生成可注入的Python搜索机制来改进内层循环。内层循环优化任务性能；外层循环优化内层循环的搜索方式。两个循环使用相同的LLM，因此改进来自双层架构而非更强的元级模型，尽管外层循环消耗额外的推理和挂钟时间预算。在Karpathy的GPT预训练基准上，元自动研究外层循环相比单独的内层循环实现了5倍的改进（验证集每字节困惑度从-0.045降至-0.009），而无需机制变化的参数级调整则没有可靠的增益。外层循环从相邻搜索领域实例化机制，包括组合优化、多臂老虎机和实验设计，无需人工指定最终机制设计。轨迹分析表明，这些机制打破了确定性搜索模式，并迫使探索LLM先验所避免的方向。实验表明，在该基准上迈出了双层的第一步：外层循环改进了内层循环的搜索行为。在此实现中，代码是机制载体，但技能、提示、工作流、评估器、领域原则、世界模型假设和记忆模式也可以编码塑造未来智能体行为的机制。这指向了一条递归自举的路径，其中为内层循环发现的机制可以反馈回来改进元级循环本身。

英文摘要

If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We present Bilevel Autoresearch, a bilevel framework in which an outer autoresearch loop improves an inner autoresearch loop by reading its code and traces, identifying bottlenecks, and generating injectable Python search mechanisms at runtime. The inner loop optimizes task performance; the outer loop optimizes how the inner loop searches. Both loops use the same LLM, so improvements come from the bilevel architecture rather than a stronger meta-level model, although the outer loop consumes additional inference and wall-clock budget. On Karpathy's GPT pretraining benchmark, the meta-autoresearch outer loop achieves a 5x improvement over the standard inner loop alone (-0.045 vs. -0.009 val_bpb), while parameter-level adjustment without mechanism change yields no reliable gain. The outer loop instantiates mechanisms from adjacent search domains, including combinatorial optimization, multi-armed bandits, and design of experiments, without human specification of the final mechanism design. Trace analysis suggests that these mechanisms break deterministic search patterns and force exploration of directions the LLM's priors avoid. The experiments demonstrate, on this benchmark, a first bilevel step: an outer loop improves the search behavior of an inner loop. Code is the mechanism carrier in this implementation, but skills, prompts, workflows, evaluators, domain principles, world-model assumptions, and memory schemas can also encode mechanisms that shape future agent behavior. This suggests a path toward recursive bootstrapping, where mechanisms discovered for the inner loop can be fed back to improve the meta-level loop itself.

URL PDF HTML ☆

赞 0 踩 0

2603.22121 2026-06-04 cs.CV cs.AI

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

GenSpan: 用于多动词视频语料库时刻检索的生成校准运动跨度先验

Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Linlin Zong, Xianchao Zhang, Wenxin Liang

发表机构 * Dalian University of Technology（大连理工大学）

AI总结提出GenSpan框架，利用LLM生成辅助视频作为时间先验，结合令牌选择器和双向状态空间模型，提升多动词查询下的视频语料库时刻检索与定位性能。

Comments Major revision with title change, updated method, and additional experiments

详情

AI中文摘要

视频语料库时刻检索（VCMR）旨在检索与自然语言查询对应的正确视频及其时间片段，对于时间动作顺序至关重要的多动词查询尤其具有挑战性。现有方法通常仅依赖文本或静态图像，难以捕捉隐式运动动态，导致检索错误和时间错位。我们提出GenSpan，一个生成校准的VCMR框架，从LLM选择的字幕线索和分解的子事件中构建短辅助视频，将这些作为时间先验而非直接检索目标。令牌选择器过滤与生成运动对齐的候选视频特征，双向状态空间模型高效预测视频-时刻元组。在TVR和ActivityNet-Captions上的实验表明，GenSpan提高了语料库级检索和时刻定位，特别是对于复杂的多动作查询，同时与最先进的多模态基线相比降低了计算成本。

英文摘要

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language query, a task that is especially challenging for multi-verb queries where temporal action ordering is critical. Existing approaches often rely solely on text or static images and struggle to capture implicit motion dynamics, leading to retrieval errors and temporal misalignment. We propose GenSpan, a generation-calibrated VCMR framework that constructs short auxiliary videos from LLM-selected subtitle cues and decomposed sub-events, using these as temporal priors rather than direct retrieval targets. A token selector filters candidate-video features aligned with generated motion, and a bidirectional state-space model efficiently predicts video-moment tuples. Experiments on TVR and ActivityNet-Captions demonstrate that GenSpan improves corpus-level retrieval and moment localization, particularly for complex multi-action queries, while reducing computational cost compared to state-of-the-art multimodal baselines.

URL PDF HTML ☆

赞 0 踩 0

2603.13432 2026-06-04 cs.CV cs.AI

Spatial Transcriptomics as Images for Large-Scale Pretraining

空间转录组学作为图像进行大规模预训练

Yishun Zhu, Jiaxin Qi, Jian Wang, Yuhua Zheng, Jianqiang Huang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences（中国科学院计算机网络信息中心）； Hangzhou Institute for Advanced Study, University of the Chinese Academy of Sciences（中国科学院大学杭州高等研究院）

AI总结提出将空间转录组学数据视为可裁剪的多通道图像，通过空间分块和基因子集选择来增加训练样本并保留空间上下文，实现大规模预训练，显著提升下游任务性能。

详情

AI中文摘要

空间转录组学（ST）在组织切片上具有精确坐标的离散点处分析数千个基因表达值，保留了临床和病理研究所需的空间背景。随着测序通量的提高和平台的进步，不断增长的数据量促使大规模ST预训练成为可能。然而，预训练的基本单元（即单个训练样本的构成）仍然不明确。现有选择分为两类：（1）将每个点视为独立样本，这丢弃了空间依赖性，将ST简化为单细胞转录组学；（2）将整个切片视为单个样本，这导致输入过大且训练样本急剧减少，削弱了有效预训练。为解决这一问题，我们提出将空间转录组学视为可裁剪的图像。具体而言，我们通过从原始切片中裁剪补丁，定义了一个具有固定空间大小的多通道图像表示，从而在保留空间上下文的同时大幅增加训练样本数量。在通道维度上，我们定义了基因子集选择规则以控制输入维度并提高预训练稳定性。大量实验表明，所提出的基于图像的数据集构建方法用于ST预训练能够持续提升下游性能，优于传统预训练方案。消融研究验证了空间分块和通道设计都是必要的，从而建立了一种统一、实用的ST数据组织范式，支持大规模预训练。

英文摘要

Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, preserving spatial context essential for clinical and pathological studies. With rising sequencing throughput and advancing platforms, the expanding data volumes motivate large-scale ST pretraining. However, the fundamental unit for pretraining, i.e., what constitutes a single training sample, remains ill-posed. Existing choices fall into two camps: (1) treating each spot as an independent sample, which discards spatial dependencies and collapses ST into single-cell transcriptomics; and (2) treating an entire slide as a single sample, which produces prohibitively large inputs and drastically fewer training examples, undermining effective pretraining. To address this gap, we propose treating spatial transcriptomics as croppable images. Specifically, we define a multi-channel image representation with fixed spatial size by cropping patches from raw slides, thereby preserving spatial context while substantially increasing the number of training samples. Along the channel dimension, we define gene subset selection rules to control input dimensionality and improve pretraining stability. Extensive experiments show that the proposed image-like dataset construction for ST pretraining consistently improves downstream performance, outperforming conventional pretraining schemes. Ablation studies verify that both spatial patching and channel design are necessary, establishing a unified, practical paradigm for organizing ST data and enabling large-scale pretraining.

URL PDF HTML ☆

赞 0 踩 0

2603.20884 2026-06-04 cs.CL

MemoNoveltyAgent: A Historical Research Memory-Aware Agent Workflow for Paper Novelty Assessment

MemoNoveltyAgent：一种用于论文新颖性评估的历史研究记忆感知智能体工作流

Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Derek F. Wong, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China（计算与智能研究院，哈尔滨工业大学深圳校区，中国）； Xiaohongshu Inc.（小红书公司）； Zhongguancun Academy, Beijing, China（中关村学院，北京，中国）； NLP 2 CT Lab, Department of Computer and Information Science, University of Macau, China（自然语言处理2实验室，计算机与信息科学系，澳门大学，中国）

AI总结提出MemoNoveltyAgent多智能体系统，通过分层抽象记忆、细粒度新颖点分解和自验证机制，生成忠实的新颖性报告，在评估中比GPT-5 DeepResearch提升13.69%。

详情

AI中文摘要

为减轻论文筛选的沉重负担，研究人员越来越依赖现有的AI智能体（如AI审稿人或DeepResearch）进行论文评估和新颖性评估。然而，由于缺乏处理学术文献的专门机制，它们的分析往往产生表面结果，质量明显不足。为弥补这一差距，我们引入了MemoNoveltyAgent，一个旨在生成全面且忠实的新颖性报告的多智能体系统。除了通过RAG检索具体的先前论文证据外，我们的系统还包含一个从大规模学术语料库构建的高层抽象记忆。该记忆将研究组织成层次树，以提炼领域特定的演化轨迹，从而提供更广泛的历史背景。此外，我们将论文分解为离散的新颖点，以便进行细粒度分析和检索，同时采用自验证机制提高报告的忠实度。最后，为解决此类开放生成任务的评估挑战，我们提出了一种RAG增强的检查表评估方法，能够实现可靠且基于证据的评估。大量实验表明，MemoNoveltyAgent比GPT-5 DeepResearch提升了13.69%。代码和演示可在https://github.com/SStan1/MemoNoveltyAgent获取。

英文摘要

To alleviate the heavy burden of paper screening, researchers increasingly rely on existing AI agents, such as AI reviewers or DeepResearch, for paper evaluation and novelty assessment. However, lacking specialized mechanisms for processing scholarly literature, their analyses often produce superficial results with noticeable deficiencies in quality. To bridge this gap, we introduce MemoNoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports. Beyond retrieving concrete prior-paper evidence via RAG, our system incorporates a high-level abstract memory constructed from large-scale scholarly corpora. This memory organizes research into hierarchical trees to distill field-specific evolutionary trajectories, thereby providing a broader historical context. Furthermore, we decompose papers into discrete novelty points for fine-grained analysis and retrieval, while employing a self-validation mechanism to improve report faithfulness. Finally, to address the evaluation challenges of such open-ended generation tasks, we propose a RAG-augmented checklist evaluation method that enables reliable and evidence-grounded assessments. Extensive experiments demonstrate that MemoNoveltyAgent outperforms GPT-5 DeepResearch by 13.69%. Code and demo are available at https://github.com/SStan1/MemoNoveltyAgent

URL PDF HTML ☆

赞 0 踩 0

2603.20304 2026-06-04 cs.CV

Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges

跨冻结扩散模型的可迁移多位水印：基于潜在一致性桥

Hong-Hanh Nguyen-Le, Van-Tuan Tran, Thuc D. Nguyen, Nhien-An Le-Khac

发表机构 * National Institute of Advanced Intelligence Science and Technology, Japan（日本国家先进人工智能科学与技术研究院）

AI总结提出DiffMark，一种即插即用的多位水印框架，通过潜在一致性模型实现高效训练，在单个前向传播中提取64位水印，速度提升45倍，并支持跨架构迁移。

Comments Accepted in Second Workshop on Technical AI Governance Research (TAIGR) @ ICML 2026

详情

AI中文摘要

随着生成式人工智能的发展，全球治理框架越来越要求可验证的内容溯源。然而，现有水印技术面临关键的政策-技术脱节：基于采样的方法需要计算上不可行的逆过程，而微调方法则受限于特定模型检查点，阻碍了标准化、跨模型的监管。为弥合这一差距，我们引入了DiffMark，一种即插即用的多位水印框架。DiffMark将一种持久的、学习到的扰动嵌入到冻结扩散模型的每个去噪步骤中，在最终潜在空间中累积可恢复的信号。为了通过冻结网络实现高效训练，我们利用潜在一致性模型（LCM）作为可微的训练桥梁。DiffMark在单次16.4毫秒的前向传播中实现64位提取，比逆过程基线快45倍。通过实现每图像密钥灵活性和无需重新训练的跨架构可迁移性，DiffMark提供了实用、可扩展的技术工具，以实现用户问责并执行新兴的AI治理要求。

英文摘要

As generative AI advances, global governance frameworks increasingly mandate verifiable content provenance. However, existing watermarking techniques face a critical policy-to-technology disconnect: sampling-based methods require computationally prohibitive inversion, while fine-tuning approaches are tethered to specific model checkpoints, hindering standardized, cross-model oversight. To bridge this gap, we introduce DiffMark, a plug-and-play multi-bit watermarking framework. DiffMark embeds a persistent, learned perturbation into every denoising step of a frozen diffusion model, accumulating a recoverable signal in the final latent space. To enable efficient training through the frozen network, we utilize Latent Consistency Models (LCMs) as a differentiable training bridge. DiffMark achieves 64-bit extraction in a single 16.4 ms forward pass, which is a $45\times$ speed-up over inversion baselines. By enabling per-image key flexibility and cross-architecture transferability without retraining, DiffMark provides the practical, scalable technical tooling necessary to operationalize user accountability and enforce emerging AI governance mandates.

URL PDF HTML ☆

赞 0 踩 0

2603.19005 2026-06-04 cs.LG cs.AI stat.ME

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

AgentDS技术报告：领域特定数据科学中人机协作的未来基准测试

An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie Ding

发表机构 * School of Statistics, University of Minnesota（明尼苏达大学统计学系）； AIScientists, Inc.（AIScientists公司）； Data Science Institute, University of Chicago（芝加哥大学数据科学研究所）； Carlson School of Management, University of Minnesota（明尼苏达大学卡尔森管理学院）； Cisco Research（思科研究）； Department of Electrical and Computer Engineering, University of Minnesota（明尼苏达大学电气与计算机工程系）； Division of Computational Health Sciences, University of Minnesota（明尼苏达大学计算健康科学 division）

AI总结提出AgentDS基准测试和竞赛，通过17个跨行业挑战评估AI代理及人机协作在领域特定数据科学中的表现，发现AI代理在领域推理上存在不足，人机协作优于纯AI方法。

详情

AI中文摘要

数据科学在将复杂数据转化为跨领域的可操作洞察方面发挥着关键作用。大型语言模型（LLM）和人工智能（AI）代理的最新发展显著自动化了数据科学工作流程。然而，目前尚不清楚AI代理在多大程度上能够匹配人类专家在领域特定数据科学任务上的表现，以及人类专业知识在哪些方面仍具有优势。我们引入了AgentDS，一个旨在评估AI代理和人机协作在领域特定数据科学中表现的基准测试和竞赛。AgentDS包含来自六个行业（商业、食品生产、医疗保健、保险、制造业和零售银行）的17个挑战。我们组织了一场公开竞赛，涉及29支队伍和80名参与者，从而能够系统比较人机协作方法与纯AI基线。我们的结果表明，当前的AI代理在领域特定推理方面存在困难。纯AI基线的表现低于竞赛参与者的前四分位数，而最强的解决方案来自人机协作。这些发现挑战了AI完全自动化的说法，并强调了人类专业知识在数据科学中的持久重要性，同时为下一代AI指明了方向。访问AgentDS网站：https://agentds.org/，开源数据集：https://huggingface.co/datasets/lainmn/AgentDS。

英文摘要

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform below the top quartile of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

URL PDF HTML ☆

赞 0 踩 0

2603.18577 2026-06-04 cs.AI

MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning

MedForge：基于伪造感知推理的可解释医学深度伪造检测

Zhihui Chen, Kai He, Qingyuan Lei, Bin Pu, Jian Zhang, Yuling Xu, Mengling Feng

发表机构 * National University of Singapore（新加坡国立大学）； The Chinese University of Hong Kong（香港中文大学）； Hunan University（湖南大学）； Xi’an Jiaotong University（西安交通大学）； Guangdong Provincial People’s Hospital（广东省人民医院）

AI总结提出MedForge框架，通过构建大规模基准数据集MedForge-90K和局部-分析推理方法，实现可解释的医学图像伪造检测，在准确性和专家对齐解释上达到最优。

详情

AI中文摘要

文本引导的图像编辑器现在能够以高保真度操纵真实的医学扫描，实现病灶植入/移除，威胁临床信任和安全。现有防御措施不足以应对医疗领域。医学检测器大多是黑箱，而基于MLLM的解释器通常是事后解释，缺乏医学专业知识，并可能在模糊案例上产生幻觉证据。我们提出MedForge，一种用于事前、基于证据的医学伪造检测的数据和方法解决方案。我们引入了MedForge-90K，这是一个大规模基准数据集，包含19种病理的真实病灶编辑，并通过医生检查指南和黄金编辑位置提供专家指导的推理监督。在此基础上，MedForge-Reasoner执行局部-分析推理，在产生判决前预测可疑区域，并通过伪造感知GSPO进一步对齐，以加强基础并减少幻觉。实验表明，该方法在检测准确性和可信、专家对齐的解释方面达到了最先进水平。

英文摘要

Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threatens clinical trust and safety. Existing defenses are inadequate for healthcare. Medical detectors are largely black-box, while MLLM-based explainers are typically post-hoc, lack medical expertise, and may hallucinate evidence on ambiguous cases. We present MedForge, a data-and-method solution for pre-hoc, evidence-grounded medical forgery detection. We introduce MedForge-90K, a large-scale benchmark of realistic lesion edits across 19 pathologies with expert-guided reasoning supervision via doctor inspection guidelines and gold edit locations. Building on it, MedForge-Reasoner performs localize-then-analyze reasoning, predicting suspicious regions before producing a verdict, and is further aligned with Forgery-aware GSPO to strengthen grounding and reduce hallucinations. Experiments demonstrate state-of-the-art detection accuracy and trustworthy, expert-aligned explanations.

URL PDF HTML ☆

赞 0 踩 0

2603.16867 2026-06-04 cs.LG cs.CL

Efficient Reasoning on the Edge

边缘设备上的高效推理

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出结合LoRA适配器、监督微调、强化学习预算控制、并行测试时缩放、动态适配器切换和KV缓存共享的方法，在资源受限的边缘设备上实现高效准确的推理。

Comments Project page: https://qualcomm-ai-research.github.io/llm-reasoning-on-edge/

详情

AI中文摘要

具有思维链推理的大型语言模型在复杂问题解决任务中达到了最先进的性能，但其冗长的推理轨迹和大的上下文需求使其不适用于边缘部署。这些挑战包括高令牌生成成本、大的KV缓存占用，以及在将推理能力蒸馏到用于移动设备的较小模型时的低效性。现有方法通常依赖于将较大模型的推理轨迹蒸馏到较小模型中，这些轨迹冗长且风格冗余，不适合设备端推理。在这项工作中，我们提出了一种轻量级方法，通过使用LoRA适配器结合监督微调，在小型LLM中实现推理。我们进一步通过在这些适配器上进行强化学习引入预算控制，显著减少响应长度，同时保持最小的精度损失。为了解决内存受限的解码问题，我们利用并行测试时缩放，在轻微延迟增加的情况下提高精度。最后，我们提出了一种动态适配器切换机制，仅在需要时激活推理，以及在提示编码期间的KV缓存共享策略，减少设备端推理的首令牌时间。在Qwen2.5-7B上的实验表明，我们的方法在严格的资源约束下实现了高效、准确的推理，使LLM推理在移动场景中变得实用。展示我们的解决方案在移动设备上运行的视频可在我们的项目页面上找到。

英文摘要

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

Training speedups via batching for geometric learning: an analysis of static and dynamic algorithms

Generative Augmented Inference

Kernel Neural Operators (KNOs) for Scalable, Memory-efficient, Geometrically-flexible Operator Learning

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Belief-Aware VLM Model for Human-like Reasoning

Attention-Based Sampler for Diffusion Language Models

ALMAB-DC: Active Learning, Multi-Armed Bandits, and Distributed Computing for Sequential Experimental Design and Black-Box Optimization

Adalina: Adaptive Linear Approximation for the Shapley Value and Beyond

The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

Reasoning Shift: How Context Silently Shortens LLM Reasoning

Orthogonal Learner for Estimating Heterogeneous Long-Term Treatment Effects

Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

Symbolic Regression for Shared Expressions: Introducing Partial Parameter Sharing

T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning

PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

Bilevel Autoresearch: Meta-Autoresearching Itself

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Spatial Transcriptomics as Images for Large-Scale Pretraining

MemoNoveltyAgent: A Historical Research Memory-Aware Agent Workflow for Paper Novelty Assessment

Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning

Efficient Reasoning on the Edge