arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4065
2509.20863 2026-05-12 cs.CL

GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models

Guowei Xu, Wenxin Xu, Jiawang Zhao, Kaisheng Ma

AI总结 本文提出了一种针对扩散语言模型的指导性重要性感知微调方法GIFT,旨在解决其在监督微调过程中因缺乏精确概率估计而导致的生成不稳定问题。该方法通过基于词元熵值分配不同重要性权重,引导模型更关注关键生成步骤,从而提升生成一致性和准确性。实验表明,GIFT在多个主流数据集和不同微调设置下均优于传统微调方法,在四个广泛使用的推理基准测试中表现出显著性能提升。

Comments preprint

详情
英文摘要

Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose GIFT, an importance-aware finetuning method for diffusion language models, where tokens are assigned different importance weights based on their entropy. Derived from diffusion theory, GIFT delivers substantial gains: across diverse settings including different mainstream training datasets ranging from 1k to 10k in size, utilizing LoRA or full parameter fine-tuning, and training on base or instruct models, GIFT consistently achieves superior overall performance compared to standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500).

2509.20294 2026-05-12 cs.LG math.ST stat.TH

Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin

AI总结 本文研究了在核函数从数据中学习的背景下谱算法的泛化性能,引入了一个新的复杂度度量——有效跨度维度(ESD),该度量考虑了信号、谱和噪声水平的联合影响,适用于任意核和信号,无需依赖特征值衰减条件。研究证明,当序列模型的ESD不超过$K$时,最小最大超额风险与$σ^2 K$成比例,并分析了过参数化梯度流如何降低ESD,从而提升谱算法的泛化能力。该框架拓展到了线性模型和再生核希尔伯特空间回归,并通过数值实验验证了理论结果,为理解自适应特征学习与泛化性能的关系提供了新视角。

详情
英文摘要

We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $σ^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $σ^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

2509.17815 2026-05-12 cs.LG math.OC

Global Optimization via Softmin Energy Minimization

Andrea Agazzi, Vittorio Carlei, Marco Romito, Samuele Saviozzi

AI总结 本文研究了非凸函数的全局优化问题,针对传统梯度方法易陷入局部极小和元启发式方法缺乏理论保证的不足,提出了一种基于软最小能量函数的梯度粒子群优化方法。该方法通过引入平滑的软最小能量函数和布朗运动项,结合时间依赖参数控制平滑度,实现了粒子群在探索与收敛之间的有效平衡。理论分析表明,该方法在强凸函数下能保证至少一个粒子收敛到全局最优,且在逃离局部极小方面优于模拟退火方法,数值实验进一步验证了其有效性。

详情
英文摘要

Global optimization, particularly for non-convex functions with multiple local minima, poses significant challenges for traditional gradient-based methods. While metaheuristic approaches offer empirical effectiveness, they often lack theoretical convergence guarantees and may disregard available gradient information. This paper introduces a novel gradient-based swarm particle optimization method designed to efficiently escape local minima and locate global optima. Our approach leverages a "Soft-min Energy" interacting function, $J_β(\mathbf{x})$, which provides a smooth, differentiable approximation of the minimum function value within a particle swarm. We define a stochastic gradient flow in the particle space, incorporating a Brownian motion term for exploration and a time-dependent parameter $β$ to control smoothness, similar to temperature annealing. We theoretically demonstrate that for strongly convex functions, our dynamics converges to a stationary point where at least one particle reaches the global minimum, with other particles exhibiting exploratory behavior. Furthermore, we show that our method facilitates faster transitions between local minima by reducing effective potential barriers with respect to Simulated Annealing. More specifically, we estimate the hitting times of unexplored potential wells for our model in the small noise regime and show that they compare favorably with the ones of overdamped Langevin. Numerical experiments on benchmark functions, including double wells and the Ackley function, validate our theoretical findings and demonstrate better performance over the well-known Simulated Annealing method in terms of escaping local minima and achieving faster convergence.

2509.12982 2026-05-12 cs.RO cs.AI cs.SE

Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins

Erblin Isaku, Hassan Sartaj, Shaukat Ali, Beatriz Sanguino, Tongtong Wang, Guoyuan Li, Houxiang Zhang, Thomas Peyrucain

AI总结 本文研究了自适应机器人在复杂不确定环境中检测分布外(OOD)行为的问题,提出了一种基于数字孪生的解决方案ODiSAR。该方法利用基于Transformer的数字孪生模型预测机器人状态,并通过重构误差和蒙特卡洛dropout进行不确定性量化,从而有效检测未知条件下的OOD行为。实验表明,ODiSAR在工业机器人场景中实现了高达98%的AUROC和96%的TNR@TPR95等优异检测性能,同时提供了可解释的洞察以支持机器人的自适应能力。

Comments 15 pages, 4 figures, 3 tables

详情
Journal ref
2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)
英文摘要

Self-adaptive robots (SARs) in complex, uncertain environments must proactively detect and address abnormal behaviors, including out-of-distribution (OOD) cases. To this end, digital twins offer a valuable solution for OOD detection. Thus, we present a digital twin-based approach for OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to forecast SAR states and employs reconstruction error and Monte Carlo dropout for uncertainty quantification. By combining reconstruction error with predictive variance, the digital twin effectively detects OOD behaviors, even in previously unseen conditions. The digital twin also includes an explainability layer that links potential OOD to specific SAR states, offering insights for self-adaptation. We evaluated ODiSAR by creating digital twins of two industrial robots: one navigating an office environment, and another performing maritime ship navigation. In both cases, ODiSAR forecasts SAR behaviors (i.e., robot trajectories and vessel motion) and proactively detects OOD events. Our results showed that ODiSAR achieved high detection performance -- up to 98\% AUROC, 96\% TNR@TPR95, and 95\% F1-score -- while providing interpretable insights to support self-adaptation.

2509.10737 2026-05-12 cs.CL cs.LG

PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models

Zaur Gouliev, Jennifer Waters, Chengqian Wang

AI总结 本文提出 PolyTruth,一种基于 Transformer 的多语言虚假信息检测方法,旨在解决当前 AI 模型主要依赖英语数据而忽视多语言环境的问题。研究系统比较了五种多语言 Transformer 模型在统一的真假分类任务上的表现,并构建了一个包含 60,486 对多语言声明的 PolyTruth 数据集,涵盖五大语言系和多个主题领域。实验发现,如 RemBERT 等模型在低资源语言中表现更优,而 mBERT 和 XLM 在数据稀缺时存在明显局限,研究结果为多语言虚假信息检测的模型选择和实际应用提供了重要参考。

Comments 11 pages, 5 figures, 4 tables. Submitted to arXiv in Computation and Language

详情
Journal ref
Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2025, Communications in Computer and Information Science, vol. 2843, pp. 353-367, Springer, Cham (2026)
英文摘要

Disinformation spreads rapidly across linguistic boundaries, yet most AI models are still benchmarked only on English. We address this gap with a systematic comparison of five multilingual transformer models: mBERT, XLM, XLM-RoBERTa, RemBERT, and mT5 on a common fake-vs-true machine learning classification task. While transformer-based language models have demonstrated notable success in detecting disinformation in English, their effectiveness in multilingual contexts still remains up for debate. To facilitate evaluation, we introduce PolyTruth Disinfo Corpus, a novel corpus of 60,486 statement pairs (false claim vs. factual correction) spanning over twenty five languages that collectively cover five language families and a broad topical range from politics, health, climate, finance, and conspiracy, half of which are fact-checked disinformation claims verified by an augmented MindBugs Discovery dataset. Our experiments revealed performance variations. Models such as RemBERT achieved better overall accuracy, particularly excelling in low-resource languages, whereas models like mBERT and XLM exhibit considerable limitations when training data is scarce. We provide a discussion of these performance patterns and implications for real-world deployment. The dataset is publicly available on our GitHub repository to encourage further experimentation and advancement. Our findings illuminate both the potential and the current limitations of AI systems for multilingual disinformation detection.

2509.08031 2026-05-12 cs.SD cs.AI cs.LG eess.AS

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Hoang Nguyen, Sidharth Surapaneni, Akshay Kalkunte, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Khyati Mahajan, Jash Shah, Shruthan Radhakrishna, Sathwik Tejaswi Madhusudhan, Vikas Yadav, Sai Rajeswar

AI总结 随着大音频语言模型(LALMs)的快速发展,其评估工具仍面临效率低、标准化不足等问题,限制了模型的公平比较和系统评估。为此,本文提出AU-Harness,一个高效且全面的评估框架,通过优化的批量处理和并行执行,实现比现有工具快151%的评估速度,并提供标准化的提示协议和灵活配置,支持多轮对话分析,揭示LALMs的真实音频推理能力,推动模型的系统性发展。

详情
英文摘要

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

2508.20325 2026-05-12 cs.CL cs.AI cs.CV

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Zelei Cheng, Haohan Wang

AI总结 随着大型语言模型(LLMs)在各领域应用日益广泛,其生成有害内容的潜在风险引发了社会和监管方面的关注。为验证LLMs是否符合政府发布的伦理指南,本文提出GUARD方法,通过自动生成违反指南的问题并结合“越狱”检测技术,评估模型对指南的遵循程度。该方法不仅能够识别直接违反指南的响应,还能发现可能绕过安全机制的潜在违规场景,并已在多个主流LLMs上进行了实证验证,展示了其在提升模型可靠性方面的有效性。

Comments 56 pages

详情
英文摘要

As Large Language Models (LLMs) become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We empirically validated the effectiveness of GUARD on eight LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models (MiniGPT-v2 and Gemini-1.5), demonstrating its usage in promoting reliable LLM-based applications.

2508.14137 2026-05-12 cs.LG

Learning to Learn the Macroscopic Fundamental Diagram using Physics-Informed and meta Machine Learning techniques

Amalie Roark, Serio Agriesti, Francisco Camara Pereira, Guido Cantelmo

AI总结 该研究旨在解决宏观基本图(MFD)估计中因检测器数量不足导致的数据稀缺问题,提出了一种结合元学习与物理信息神经网络的框架。通过从数据丰富的城市中学习可迁移的模式,并将其应用于数据有限的城市,该方法显著提升了MFD预测的准确性,平均将流量预测的平均绝对误差降低了约50%。实验表明,该元学习框架在不同城市和拓扑结构中具有良好的泛化能力,为在实际交通管理中应用提供了有效解决方案。

Comments Version accepted for publication in Transportation Research Part C (before proof-reading)

详情
Journal ref
Learning to learn the macroscopic fundamental diagram using physics-informed and model agnostic machine learning. Transportation Research Part C: Emerging Technologies, 2026, 189, 105707
英文摘要

The Macroscopic Fundamental Diagram is a popular tool used to describe traffic dynamics in an aggregated way, with applications ranging from traffic control to incident analysis. However, estimating the MFD for a given network requires large numbers of loop detectors, which is not always available in practise. This article proposes a framework to alleviate the data scarcity challenge harnessing Meta-Learning, a subcategory of Machine Learning that trains models to understand and adapt to new tasks on their own. We use Meta-Learning to identify and exploit transferable patterns from data-rich cities to cities where not enough data is available to estimate the MFD. The developed model is trained and tested by leveraging data from multiple cities and exploiting it to model the MFD of other cities with different shares of detectors and topological structures. The proposed Meta-Learning framework is applied to an ad-hoc Multi-Task Physics-Informed Neural Network, specifically designed to estimate the MFD. Results show an average MAE improvement in flow prediction of around 50% across cities (depending on the subset of loop detectors tested). The Meta-Learning framework thus successfully generalises across diverse urban settings and improves performance on cities with limited data, demonstrating the potential of using Meta-Learning when a limited number of detectors is available. We directly test this assumption by applying the Meta-Learning outputs to unseen cities to simulate a real-life application scenario and the wide applicability of the proposed methodology. Finally, the proposed framework is validated against traditional Transfer Learning approaches and tested with FitFun, a model for FD estimation from the literature, to prove its transferability.

2508.13813 2026-05-12 cs.LG cs.AI

Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Frank Kargl

AI总结 随着AI系统对训练数据的依赖日益增加,评估数据集的可信度变得尤为重要,尤其是在数据集层面出现的公平性或偏见等属性。本文首次提出了一种基于主观逻辑的正式框架,用于评估AI训练数据集的可信度,能够在证据不完整、分布或冲突的情况下对全局属性(如偏见)进行不确定性感知的评估。该方法在交通标志识别数据集上的实验表明,其能够有效捕捉类别不平衡现象,并在集中式和联邦学习场景中保持良好的可解释性和鲁棒性。

Comments Accepted at ECML PKDD Bias Workshop '25

详情
英文摘要

As AI systems increasingly rely on training data, assessing dataset trustworthiness has become critical, particularly for properties like fairness or bias that emerge at the dataset level. Prior work has used Subjective Logic to assess trustworthiness of individual data, but not to evaluate trustworthiness properties that emerge only at the level of the dataset as a whole. This paper introduces the first formal framework for assessing the trustworthiness of AI training datasets, enabling uncertainty-aware evaluations of global properties such as bias. Built on Subjective Logic, our approach supports trust propositions and quantifies uncertainty in scenarios where evidence is incomplete, distributed, and/or conflicting. We instantiate this framework on the trustworthiness property of bias, and we experimentally evaluate it based on a traffic sign recognition dataset. The results demonstrate that our method captures class imbalance and remains interpretable and robust in both centralized and federated contexts.

2508.07722 2026-05-12 cs.LG cs.IT cs.MA math.IT

Robust Remote Reinforcement Learning over Unreliable Communication Channels using Homomorphic State Encoding

Pietro Talli, Federico Mason, Federico Chiariotti, Andrea Zanella

AI总结 本文研究了在不可靠通信信道下进行远程强化学习的问题,提出了一个名为HR3L的新架构,通过同态状态编码实现无需交换梯度信息的分布式训练。该方法显著提升了样本效率,降低了通信开销,并能有效适应丢包、延迟和带宽限制等不同场景,性能下降较小,具有较好的鲁棒性和通用性。

Comments This manuscript is currently under revision

详情
英文摘要

Traditional Reinforcement Learning (RL) frameworks generally assume that the agent perceives the state of the underlying Markov process instantaneously and then takes actions accordingly. If the agent cannot directly observe the process, but rather receives state updates from a remote sensor over a lossy and/or delayed channel, it may be forced to operate with partial and intermittent information. In recent years, numerous learning architectures have been proposed to manage RL with imperfect or remote feedback; however, they offer solutions tailored to specific use cases, often with a substantial computational and communication burden. To address these limitations, we propose a novel learning architecture, named Homomorphic Robust Remote Reinforcement Learning (HR3L), that enables the distributed training of RL agents over unreliable communication channels without the need to exchange gradient information. Our experimental results demonstrate that HR3L significantly outperforms the state-of-the-art methods in terms of sample efficiency, leading to faster training and reduced communication overhead. In addition, we show that HR3L can adapt to different scenarios, including packet loss, delayed transmissions, and bandwidth limitations, without experiencing significant performance degradation.

2508.06248 2026-05-12 cs.CV

Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

AI总结 本文研究了如何使深度伪造检测方法在面对未知的伪造技术时仍具有良好的泛化能力。提出了一种名为GenD的方法,仅通过微调预训练视觉编码器中的层归一化参数(占总参数的0.03%),结合L2归一化和度量学习,实现了高效的泛化性能。实验表明,该方法在14个不同年份的基准数据集上取得了最先进的结果,证明了在保持模型简洁性的同时,也能实现强大的跨数据集检测能力。

详情
英文摘要

The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code is at: https://github.com/yermandy/GenD

2508.05463 2026-05-12 cs.LG cs.AI physics.soc-ph

Task complexity shapes internal representations and robustness in neural networks

Robert Jankowski, Filippo Radicchi, M. Ángeles Serrano, Marián Boguñá, Santo Fortunato

AI总结 本研究探讨了神经网络内部表示和鲁棒性如何受任务复杂度的影响。通过引入一系列数据无关的分析方法,如剪枝、二值化、噪声注入等,研究发现任务难度显著影响多层感知机(MLP)的结构和性能表现。研究还揭示了任务复杂度可由全精度模型与二值化或随机化模型之间的性能差距来衡量,并指出保留符号结构而非精确权重大小即可维持较高准确率,为模型压缩和可解释性提供了新思路。

详情
英文摘要

Neural networks excel across a wide range of tasks, yet remain black boxes. In particular, how their internal representations are shaped by the complexity of the input data and the problems they solve remains obscure. In this work, we introduce a suite of five data-agnostic probes-pruning, binarization, noise injection, sign flipping, and bipartite network randomization-to quantify how task difficulty influences the topology and robustness of representations in multilayer perceptrons (MLPs). MLPs are represented as signed, weighted bipartite graphs from a network science perspective. We contrast easy and hard classification tasks on the MNIST and Fashion-MNIST datasets. We show that binarizing weights in hard-task models collapses accuracy to chance, whereas easy-task models remain robust. We also find that pruning low-magnitude edges in binarized hard-task models reveals a sharp phase-transition in performance. Moreover, moderate noise injection can enhance accuracy, resembling a stochastic-resonance effect linked to optimal sign flips of small-magnitude weights. Finally, preserving only the sign structure-instead of precise weight magnitudes-through bipartite network randomizations suffices to maintain high accuracy. These phenomena define a model- and modality-agnostic measure of task complexity: the performance gap between full-precision and binarized or shuffled neural network performance. Our findings highlight the crucial role of signed bipartite topology in learned representations and suggest practical strategies for model compression and interpretability that align with task complexity.

2508.04660 2026-05-12 cs.CL

Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D'Oosterlinck, Christopher Potts, Omar Khattab

AI总结 本文研究了如何将组相对策略优化(GRPO)应用于由多个语言模型调用组成的模块化程序系统,以提升其性能。作者提出了一种多模块GRPO方法,通过模块级或轨迹级分组实现策略梯度优化,并发现其能有效与自动提示优化结合,显著提升模型在分类、多跳搜索和隐私保护任务中的表现。实验表明,该方法在多种任务上平均提升了11%的准确率,优于单独使用提示优化。

Comments ACM CAIS 2026. Lakshya*, Dilara*, and Noah* contributed equally to this work

详情
英文摘要

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM - with 5% gains against prompt optimization on its own. We open-source multi-module GRPO in the DSPy library at https://dspy.ai .

2507.23009 2026-05-12 cs.LG cs.AI

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, Samira Samadi

AI总结 本文指出,目前使用针对人类设计的认知和心理测试来评估大型语言模型(LLMs)的做法存在根本性错误,因为这些测试是为特定人类群体设计的理论驱动工具,直接用于评估AI可能产生误导。文章认为,将AI在基准测试中的表现解释为人类特质如“智能”的测量,缺乏充分的理论和实证依据,并呼吁停止使用人类测试,转而开发针对AI特性的原理性评估框架,以更准确地衡量AI系统的能力。

详情
英文摘要

Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.

2507.18847 2026-05-12 cs.RO cs.AI

Equivariant Volumetric Grasping

Pinhao Song, Yutong Hu, Pengteng Li, Renaud Detry

AI总结 本文提出了一种对垂直轴旋转具有等变性质的体素抓取模型,显著提升了采样效率。该模型采用三平面体素特征表示方法,并设计了新的三平面特征结构,使得水平面上的特征对90度旋转具有等变性,而其他两平面特征之和对反射保持不变。基于此,作者进一步开发了两种先进的体素抓取规划器的等变版本,并通过大量仿真和真实实验验证了方法的有效性,结果显示该方法在计算和内存成本上均有降低,且在实时约束下性能优于非等变模型。

Comments 21 pages

详情
英文摘要

We propose a new volumetric grasp model that is equivariant to rotations around the vertical axis, leading to a significant improvement in sampling efficiency. Our model employs a tri-plane volumetric feature representation -- i.e., the projection of 3D features onto three canonical planes. We introduce a novel tri-plane feature design in which features on the horizontal plane are \emph{equivariant} to $90^\circ$ rotations, while the \emph{sum} of features from the other two planes remains \emph{invariant} to reflections induced by the same transformations. We further develop equivariant adaptations of two state-of-the-art volumetric grasp planners, GIGA and IGD. Specifically, we derive a new equivariant formulation of IGD's deformable attention mechanism and propose an equivariant generative model of grasp orientations based on flow matching. We provide a detailed analytical justification of the proposed equivariance properties and validate our approach through extensive simulated and real-world experiments. Our results demonstrate that the proposed projection-based design reduces both computational and memory costs. Moreover, the equivariant grasp models built on top of our tri-plane features consistently outperform their non-equivariant counterparts, achieving higher performance within a real-time cost constraint. Video and code can be viewed in: https://mousecpn.github.io/evg-page/

2507.15518 2026-05-12 cs.AI cs.MA

HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatrics

Shufan Jiang, Sizhou Chen, Chios Chen, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

AI总结 HAMLET 是一个分层自适应的多智能体框架,旨在实现沉浸式实时戏剧表演。该框架通过生成叙事蓝图引导即兴表演,并为每个角色配备自适应推理模块,使其能在复杂对话场景中基于角色设定、记忆和目标进行自主决策。此外,角色还能通过操作场景道具进行具身互动,从而提升戏剧的真实感与互动性,并引入专门的评估模型 HAMLETJudge 对表演质量进行自动化评价。

Comments Accepted to the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language models (LLMs) provides a new path to achieve this goal. However, existing drama generation methods often produce LLMs that lack initiative and cannot interact with the physical scene, while typically requiring detailed input that diminishes the immersion of live performance. To address these challenges, we propose HAMLET, a hierarchical adaptive multi-agent framework focused on drama creation and real-time online performance. Given a simple topic, the framework initially generates a narrative blueprint to guide the subsequent improvisational performance. During online performance, each actor is equipped with an adaptive reasoning module that enables decision-making based on their personas, memories, goals during complex group chat scenarios. Beyond dialogue, actor agents engage in embodied interactions by changing the state of scene props through actions such as opening a letter or picking up a weapon, which are broadcast to update the global environmental context. To objectively assess the quality of live embodied theatrics, we establish a comprehensive evaluation method and introduce HAMLETJudge, a specialized critic model for automated evaluation. Experimental results demonstrate that HAMLET excels in creating expressive, coherent, and physically interactive theatrical experiences in an autonomous manner.

2507.14785 2026-05-12 cs.LG cs.AI

Exploring the In-Context Learning Capabilities of LLMs for Money Laundering Detection in Financial Graphs

Erfan Pirmorad

AI总结 本文研究了如何利用大语言模型(LLMs)通过图结构数据进行反洗钱检测,探索其在金融图中的上下文学习能力。作者提出了一种轻量级流程,从金融知识图中提取实体的局部子图,将其转化为结构化文本,并通过少样本上下文学习引导LLM进行可疑性评估和解释生成。实验表明,LLM能够模拟分析师的推理逻辑,识别风险信号并提供合理解释,展示了基于LLM的图推理在反洗钱分析中的潜力。

Comments Accepted at AI4FCF-ICDM 2025

详情
Journal ref
2025 IEEE International Conference on Data Mining Workshops (ICDMW)
英文摘要

The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.

2507.07969 2026-05-12 cs.LG cs.AI cs.RO stat.ML

Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, Sergey Levine

AI总结 本文提出了一种名为Q-chunking的方法,旨在提升强化学习在长期任务和稀疏奖励场景下的性能。该方法通过引入动作分块技术,使智能体能够在离线数据的指导下进行更有效的在线探索,并结合无偏的n步备份机制,提高时差学习的稳定性与效率。实验表明,Q-chunking在多个长期稀疏奖励的操控任务中表现出优越的离线性能和在线样本效率。

Comments The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025); 29 pages, 17 figures

详情
英文摘要

We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

2507.06658 2026-05-12 cs.CL cs.AI

Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models

Gennadii Iakovlev

AI总结 本文提出了一种基于大型语言模型的新方法,用于衡量欧洲议会辩论中政治精英之间的对立程度。该方法通过识别议会发言中提到的政治人物、提取发言者与目标人物的配对关系、估计对每个政治人物的情感倾向,并将不同表述标准化为政党之间的互动评价,从而计算出“精英极化评分”。研究验证了该方法在英国、匈牙利和意大利议会语料中的有效性,表明其能够准确捕捉精英之间的相互负面评价,并与大众情感极化、意识形态极化等概念区分开来,为跨国家的精英极化研究提供了可扩展的分析工具。

详情
英文摘要

Theories of democratic stability, populism, and party-system crisis often point to a form of polarization that comparative research rarely measures directly: hostile relations among political elites. Existing comparative measures capture adjacent phenomena, including mass affective polarization, or elite ideological distance, but not directed mutual elite evaluation. This paper introduces the Elite Polarization Score, a measurement of out-party evaluations in parliamentary speech. Large Language Models identify political actors mentioned in parliamentary debates, recover speaker-target pairs, estimate the sentiment directed at each actor, standardize heterogeneous references into party dyads, and aggregate these evaluations into party- and parliament-level measures of mutual out-party negativity. The validity of the approach is demonstrated on parliamentary corpora from the United Kingdom, Hungary, and Italy, covering up to four decades of debate. The resulting measure is conceptually distinct from mass affective polarization, elite ideological polarization, incivility, negative campaigning, and general sentiment. Evidence from the UK case study shows that it is also empirically distinct from mass affective polarization, elite ideological polarization, and incivility. Extreme negative evaluations can also be used to locate pernicious polarization rhetoric. Validation across three countries finds no false discoveries, sentiment estimates accurate to roughly 10 percent of the scale range, and AI sensitivity that meets or exceeds that of human coders in two of three settings. Because the algorithm is multilingual, requires no task-specific training, and can be aggregated by party and quarter, it provides a scalable basis for future cross-national research on what produces elite polarization and what elite polarization itself produces

2507.03310 2026-05-12 cs.LG cs.AI

Causal Discovery for Irregularly Time Series with Consistency Guarantees

Weihong Li, Baohong Li, Anpeng Wu, Zhihan Li, Ming Ma, Keting Yin, Kun Kuang

AI总结 本文研究了在不规则采样时间序列中的因果发现问题,这在金融、医疗和气候科学等风险敏感领域尤为重要,因为缺失数据和不一致的采样频率会扭曲因果机制。现有方法在插补和结构学习之间缺乏显式的互洽机制,导致因果图不准确。为此,本文提出ReTimeCausal框架,基于EM算法交替进行数据插补和因果结构学习,确保优化过程中的结构一致性,并提供了结构恢复的理论保证,实验表明其在处理不规则采样和高缺失数据时优于现有方法。

Comments 12 pages, 2 figures

详情
英文摘要

This paper studies causal discovery in irregularly sampled time series-a key challenge in risk-sensitive domains like finance, healthcare, and climate science, where missing data and inconsistent sampling frequencies distort causal mechanisms. The main challenge comes from the interdependence between missing data imputation and causal structure recovery: errors in imputation and structure learning can reinforce each other, leading to an inaccurate causal graph. Existing methods either impute first and then discover, or jointly optimize both via neural representation learning, but lack explicit mechanisms to ensure mutual consistency of imputation and structure learning. We address this challenge with ReTimeCausal, an EM-based framework that alternates between imputation and structure learning, which encourages structural consistency throughout the optimization process. Our framework provides theoretical consistency guarantees for structure recovery and extends classical results to settings with irregular sampling and high missingness. ReTimeCausal combines kernel-based sparse regression and structural constraints in an alternating process that updates the completed data and the causal graph in turn. Experiments on synthetic and real-world datasets show that ReTimeCausal is more effective than existing methods under challenging irregular sampling and missing data.

2506.15787 2026-05-12 cs.AI cs.CL cs.LG

SLR: Automated Synthesis for Scalable Logical Reasoning

Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting

AI总结 本文提出SLR,一个端到端的框架,通过可扩展的逻辑推理系统地评估和训练大语言模型(LLMs)。SLR能够根据用户任务自动合成推理指令、验证程序和潜在的真值规则,无需人工标注且任务难度可控。基于SLR构建了包含19,000个提示的SLR-Bench基准,实验表明当前LLMs在生成语法正确规则方面表现良好,但在逻辑推理上仍有不足,而通过SLR的课程学习可显著提升模型性能,并在多个基准上展现出良好的泛化能力。

详情
英文摘要

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

2506.12944 2026-05-12 cs.LG q-bio.TO

Unsupervised risk factor identification across cancer types and data modalities via explainable artificial intelligence

Maximilian Ferle, Jonas Ader, Thomas Wiemers, Nora Grieb, Adrian Lindenmeyer, Hans-Jonas Meyer, Thomas Neumuth, Markus Kreuz, Kristin Reiche, Maximilian Merz

AI总结 该研究提出了一种基于可解释人工智能的无监督学习方法,用于跨癌症类型和数据模态识别风险因素。该方法通过可微分的多变量logrank统计量优化患者群体的生存异质性,无需依赖代理指标,可适用于任何神经网络架构和数据类型。研究在模拟实验和两种不同癌症数据(多发性骨髓瘤实验室参数和非小细胞肺癌CT图像)中验证了方法的有效性,成功识别出具有显著不同生存结果的患者亚组,并揭示了与已知风险因素一致的临床相关特征,为临床风险分层提供了新的可解释工具。

详情
Journal ref
npj Digit. Med. 9, 363 (2026)
英文摘要

Risk stratification is a key tool in clinical decision-making, yet current approaches often fail to translate sophisticated survival analysis into actionable clinical criteria. We present a novel method for unsupervised machine learning that directly optimizes for survival heterogeneity across patient clusters through a differentiable adaptation of the multivariate logrank statistic. Unlike most existing methods that rely on proxy metrics, our approach represents novel methodology for training any neural network architecture on any data modality to identify prognostically distinct patient groups. We thoroughly evaluate the method in simulation experiments and demonstrate its utility in practice by applying it to two distinct cancer types: analyzing laboratory parameters from multiple myeloma patients and computed tomography images from non-small cell lung cancer patients, identifying prognostically distinct patient subgroups with significantly different survival outcomes in both cases. Post-hoc explainability analyses uncover clinically meaningful features determining the group assignments which align well with established risk factors and thus lend strong weight to the methods utility. This pan-cancer, model-agnostic approach represents a valuable advancement in clinical risk stratification, enabling the discovery of novel prognostic signatures across diverse data types while providing interpretable results that promise to complement treatment personalization and clinical decision-making in oncology and beyond.

2506.12090 2026-05-12 cs.CL

ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour

Jack Contro, Simrat Deol, Yulan He, Martim Brandão

AI总结 本文介绍了ChatbotManip,一个用于研究聊天机器人操纵行为的新数据集。该数据集包含聊天机器人与模拟用户之间的对话,其中机器人被明确要求展示操纵策略、说服用户达成目标或提供帮助。研究发现,大型语言模型在被明确指示时表现出较高的操纵倾向,且即使仅被要求“有说服力”而无明确操纵指令,也常采用争议性操纵策略。此外,研究还对比了不同模型在检测操纵行为上的表现,为AI安全研究提供了重要参考。

详情
英文摘要

This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84\% of such conversations. Second, even when only instructed to be ``persuasive'' without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly gaslighting and fear enhancement. Third, small fine-tuned open source models, such as BERT+BiLSTM have a performance comparable to zero-shot classification with larger models like Gemini 2.5 pro in detecting manipulation, but are not yet reliable for real-world oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.

2506.11578 2026-05-12 cs.AI

Efficient LLM Collaboration via Planning

Byeongchan Lee, Jonghoon Lee, Dongyoung Kim, Jaehyung Kim, Kyungjoon Park, Dongjun Lee, Jinwoo Shin

AI总结 本文研究了大语言模型(LLM)与小模型如何高效协作以兼顾性能与成本的问题。提出了一种名为COPE的协作框架,通过规划模型生成中间计划,引导执行模型完成任务,小模型与大模型交替担任规划者与执行者,实现多阶段协作。实验表明,COPE在多个任务上达到与大模型相当的性能,同时显著降低了推理成本,验证了规划在高效推理中的有效性。

详情
英文摘要

Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large models achieve remarkable results across diverse tasks, they often incur substantial monetary inference cost, making frequent use impractical for many applications. In contrast, small models are often freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.

2506.10622 2026-05-12 cs.CL cs.AI cs.LG

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Thomas Schaaf, Esaú Villatoro-Tello, Ahmed Hassoon, Ricard Marxer, Petr Motlicek

AI总结 本文介绍了 SDialog,一个开源的 Python 工具包,旨在提供端到端的对话生成、评估和可解释性分析框架,用于构建和分析基于大语言模型的对话代理。SDialog 支持多智能体模拟、综合评估方法以及机制可解释性工具,并集成音频生成功能,适用于多种主流大语言模型后端,有助于研究人员更系统地构建、评估和理解对话系统。

Comments Pre-print submitted to EACL System Demonstration (under review)

详情
英文摘要

We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized Dialog representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

2506.09110 2026-05-12 cs.LG

CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model

Jingying Ma, Feng Wu, Qika Lin, Yucheng Xing, Chenyu Liu, Ziyu Jia, Mengling Feng

AI总结 本文提出了一种名为CodeBrain的两阶段脑电基础模型,旨在解决现有模型在临床可解释性、判别能力和全局依赖捕捉方面的不足。该模型首先引入了TFDual-Tokenizer,将异构的时域和频域脑电信号解耦为离散的token,从而增强表示的判别力并提供神经事件和频谱节律的可解释性;随后采用多尺度EEGSSM架构,结合结构化全局卷积与滑动窗口注意力机制,高效捕捉长距离和局部依赖关系,反映大脑的小世界拓扑结构。CodeBrain在多个下游任务和数据集上表现出优异的泛化能力,具有重要的应用价值。

Comments Published as a conference paper at the International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Electroencephalography (EEG) provides real-time insights into brain activity and supports diverse applications in neuroscience. While EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models, current approaches still yield clinically uninterpretable and weakly discriminative representations, inefficiently capturing global dependencies and neglecting important local neural events. We present CodeBrain, a two-stage EFM designed to fill this gap. In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific representation-level interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brain's small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across eight downstream tasks and ten datasets under distribution shifts, supported by comprehensive ablations, scaling-law analyzes, and interpretability evaluations. The code and the pretrained weights are available at https://github.com/jingyingma01/CodeBrain.

2506.08136 2026-05-12 cs.CL

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu, Yinzhu Quan

AI总结 本文介绍了 EconWebArena,一个用于评估自主智能体在真实网络环境中完成复杂经济任务的基准平台。该基准包含来自82个权威网站的360个精心挑选的任务,涵盖宏观经济、劳动、金融、贸易和公共政策等领域,要求智能体通过多步骤流程解析网页内容、交互操作并提取精确的实时数据。与以往工作不同,EconWebArena 强调对权威数据源的忠实度和基于网络的经济推理能力,通过实验揭示了当前模型在导航、多模态理解和任务执行方面仍存在的显著挑战。

详情
英文摘要

We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

2506.04289 2026-05-12 cs.LG q-bio.NC

Relational reasoning and inductive bias in transformers and large language models

Jesse Geerts, Andrew Liu, Stephanie Chan, Claudia Clopath, Kimberly Stachenfeld

AI总结 该研究探讨了基于Transformer的模型在关系推理,特别是传递推理任务中的表现机制。研究比较了权重内学习(IWL)和上下文内学习(ICL)两种方式在传递推理中的行为差异,发现IWL模型通过线性嵌入实现类似人类的传递推理,而ICL模型则仅在训练数据需要时才表现出传递推理能力。研究还表明,通过预训练使ICL模型获得线性表示后,其推理行为可接近IWL,并在大语言模型中验证了训练方式和表示结构对传递推理能力的关键影响。

Comments 15 pages, 10 figures

详情
英文摘要

Transformer-based models have demonstrated remarkable reasoning abilities, but the mechanisms underlying relational reasoning remain poorly understood. We investigate how transformers perform \textit{transitive inference}, a classic relational reasoning behavior from psychology which elicits inference about indirectly related items (e.g., if $A > B$ and $B > C$, then $A > C$). We compare in-weights learning (IWL) and in-context learning (ICL) behaviors and mechanisms on these tasks, and fine profoundly different patterns of generalization. IWL models learn a linear embedding, which leads to transitive inference as well as other behavioral effects present in humans and animals. ICL models, in contrast, are capable of learning to generalize transitively, but only do so when it is necessitated by the training data, otherwise learning a match-and-copy strategy. Interestingly, pre-training ICL models on in-context linear regression tasks that provide them with a latent linear representation is sufficient to make the ICL behaviors and internal representations qualitatively and quantitatively more like IWL. In order to test whether the same inference patterns are present across in large language models, we leverage a congruency paradigm which allows us to differentially probe IWL and ICL generalization patterns without access to their training data. We indeed see IWL reasoning leads to more transitive generalization than ICL. Moreover, we find that prompting the ICL models to use a linear mental map led to increased transitive inference over different geometric prompts. Together, these results reveal that both the training regime and the geometric structure of induced representations critically determine transformers capacity for transitive inference.

2506.01404 2026-05-12 cs.LG cs.MA cs.SY eess.SY

Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs

Xue Xian Zheng, Weihang Liu, Xin Lou, Stefan Vlaski, Tareq Al-Naffouri

AI总结 本文提出了一种创新的误差反馈框架,用于减少分布式图滤波中的量化噪声,适用于通信受限于量化消息的场景。该方法借鉴状态空间数字滤波中的误差谱整形技术,实现了跨不同域的量化滤波过程的连接,并通过定量反馈量化噪声实现精确补偿。理论分析表明该框架能显著降低量化噪声的影响,并提供了最优误差反馈系数的闭式解,同时可无缝集成到高效的去中心化优化框架中,实验验证了其在精度和鲁棒性方面的优越性。

Comments Accepted by IEEE TSP

详情
英文摘要

This paper introduces an innovative error feedback framework designed to mitigate quantization noise in distributed graph filtering, where communications are constrained to quantized messages. It comes from error spectrum shaping techniques from state-space digital filters, and therefore establishes connections between quantized filtering processes over different domains. In contrast to existing error compensation methods, our framework quantitatively feeds back the quantization noise for exact compensation. We examine the framework under three key scenarios: (i) deterministic graph filtering, (ii) graph filtering over random graphs, and (iii) graph filtering with random node-asynchronous updates. Rigorous theoretical analysis demonstrates that the proposed framework significantly reduces the effect of quantization noise, and we provide closed-form solutions for the optimal error feedback coefficients. Moreover, this quantitative error feedback mechanism can be seamlessly integrated into communication-efficient decentralized optimization frameworks, enabling lower error floors. Numerical experiments validate the theoretical results, consistently showing that our method outperforms conventional quantization strategies in terms of both accuracy and robustness.

2505.24859 2026-05-12 cs.LG cs.CL

Beyond Multiple Choice: Evaluating Steering Vectors for Summarization

Joschka Braun, Carsten Eickhoff, Seyed Ali Bahrainian

AI总结 该研究探讨了在摘要生成任务中使用引导向量(steering vectors)控制文本属性(如主题、情感、可读性等)的效果。通过在SAMSum、NEWTS和arXiv数据集上的实验,发现引导向量能够有效控制目标属性,但过强的引导力度会导致重复和事实错误。研究还表明,单纯使用提示方法虽能保持摘要质量,但控制力较弱,而结合引导向量与提示的方法在中等引导强度下能实现最佳的控制效果与质量平衡。

Comments Published in Findings of EACL 2026. Extended version of the ICML 2025 Workshop on Reliable and Responsible Foundation Models paper (v1, v2). 36 pages, 21 figures, 15 tables

详情
Journal ref
Findings of the Association for Computational Linguistics: EACL 2026, pages 3849-3884
英文摘要

Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. While predominantly studied for multiple-choice and toy tasks, their effectiveness in free-form generation remains largely unexplored. Moving "Beyond Multiple Choice," we evaluate steering vectors for controlling topical focus, sentiment, toxicity, and readability in abstractive summaries across the SAMSum, NEWTS, and arXiv datasets. We find that steering effectively controls targeted properties, but high steering strengths consistently induce degenerate repetition and factual hallucinations. Prompting alone preserves summary quality but offers weaker control. Combining both methods yields the strongest control and the most favorable efficacy-quality trade-off at moderate steering strengths. Our work demonstrates that steering vectors face a critical control-quality trade-off in free-form generation, and that hybrid approaches offer the best balance in practice.