arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.05085 2026-06-04 cs.CL cs.AI

Automatic Generation of Titles for Research Papers Using Language Models

使用语言模型自动生成研究论文标题

Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay

AI总结 提出利用预训练语言模型和大语言模型从摘要生成论文标题的方法,通过微调PEGASUS-large在多个数据集上取得最优性能。

详情
Comments
24 pages, 24 tables, 01 figure
AI中文摘要

研究论文的标题以清晰简洁的方式传达其主要思想,有时也包括结论。选择合适的标题通常具有挑战性,自动标题生成可以帮助作者完成此任务。在这项工作中,我们提出了一种使用开放权重预训练模型和大语言模型从摘要生成论文标题的技术。我们使用了CSPubSum和LREC-COLING-2024数据集,并引入了一个新数据集SpringerSSAT,该数据集来自社会科学领域的四个Springer期刊。此外,我们使用GPT-3.5-turbo在零样本设置下生成标题。模型性能通过ROUGE、METEOR、MoverScore、BERTScore和SciBERTScore指标进行评估。我们的实验表明,微调的PEGASUS-large在大多数指标上优于其他模型,包括微调的LLaMA-3-8B和零样本GPT-3.5-turbo。我们进一步证明ChatGPT可以生成有创意的论文标题。总体而言,AI生成的标题通常是恰当且可靠的。

英文摘要

The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.

2606.05080 2026-06-04 cs.AI cs.LG

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

AutoLab:前沿模型能否解决长周期自动研究与工程任务?

Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen, Hang Hua, Zijian Wu, Zheyuan Liu, Zexue He, Lichi Li, Shizhe Diao, Jiaxin Pei, Jinsung Yoon, Hao Zhang, Mengdi Wang, Radha Poovendran, Misha Sra, Alex Pentland, Zichen Chen

AI总结 本文提出AutoLab基准,通过36个专家策划的长周期闭环优化任务评估前沿模型,发现持续迭代和利用经验反馈比初始尝试质量更重要。

详情
Comments
Code: https://github.com/autolabhq/autolab ; Website: https://autolab.moe/
AI中文摘要

科学和工程进步本质上是一个长周期迭代过程:提出更改、运行实验、测量结果并不断改进工件。然而,现有的前沿模型基准主要评估单轮响应或短周期智能体轨迹,未能捕捉在长时间跨度内持续迭代改进的挑战。为了解决这一差距,我们引入了AutoLab,一个用于超长周期闭环优化的新基准。AutoLab包含36个现实且由专家策划的任务,涵盖四个不同领域:系统优化、谜题与挑战、模型开发和CUDA内核优化。每个任务从一个正确但故意次优的基线开始,并挑战智能体在严格的挂钟预算内改进它。评估17个最先进模型的结果表明,成功的主要预测因素不是智能体初始尝试的质量,而是其持续进行基准测试、编辑和整合经验反馈的毅力。虽然claude-opus-4.6表现出强大的长周期优化能力,但大多数前沿模型,包括几个专有模型,要么过早终止,要么在预算内进展甚微。这些结果强调了时间意识和持续迭代在自主智能体中的重要性。我们开源了完整的基准、评估框架和任务工件,以加速研究真正有能力的长周期智能体。

英文摘要

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

2606.05079 2026-06-04 cs.CL cs.LG

Fast & Faithful Function Vectors

快速且保真的函数向量

Minh An Pham, Anton Segeler, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin, Patrick Kahardipraja, Reduan Achtibat

AI总结 本研究通过优化注意力头选择和分布式引导方法,利用基于梯度的逐层相关性传播(LRP)提高了函数向量(FV)的效率和准确性,从而实现了对大型语言模型(LLM)的快速且保真的引导。

详情
AI中文摘要

函数向量(FV)是在上下文学习过程中产生的任务表示,可用于引导大型语言模型(LLM)。然而,其公式中的设计选择仍未得到充分探索。在这项工作中,我们研究了沿两个自由度(注意力头选择和引导)改变FV定义对指令的影响。对于头选择,使用基于梯度的逐层相关性传播(LRP)显著提高了效率和准确性。对于FV引导,分布式应用比简单聚合获得了更高的准确性。我们的代码已公开。

英文摘要

Function vectors (FVs) are task representations elicited during in-context learning that can be used to steer Large Language Models (LLMs). However, design choices in their formulation remain underexplored. In this work, we study the impact of varying FV definitions for instructions along two degrees of freedom: attention head selection and steering. For head selection, using gradient-based attributions with Layer-wise Relevance Propagation (LRP) substantially improves efficiency as well as accuracy. For FV steering, applying it in a distributed manner yields a higher accuracy compared to simple aggregation. Our code is publicly available.

2606.05073 2026-06-04 cs.LG

Learning What Not to Impute: An Uncertainty-Aware Diffusion Framework for Meaningful Missingness

学习什么不该插补:一种面向有意义缺失的不确定性感知扩散框架

Lixing Zhang, Yidong Ouyang, Weifu Li, Shixiang Zhu, Guang Cheng, Liyan Xie

AI总结 提出Diff-Joint扩散框架,通过联合建模表格数据和潜在缺失掩码,交替进行条件采样和不确定性感知聚合,以区分有意义缺失和需插补的缺失,实现选择性插补。

详情
AI中文摘要

缺失值插补是机器学习中的一项基本任务,现有大多数方法假设所有缺失条目对应于未观测到的常规值。然而,在许多现实世界数据集中,缺失可能源于两个不同的来源:一些条目是有意义缺失(本质上不存在且语义有效),而另一些则因观测过程而缺失,应被插补。我们将这一区别形式化为选择性插补问题,目标是共同推断哪些缺失条目应被保留,哪些应被恢复。为应对这一挑战,我们提出了Diff-Joint,一种基于扩散的框架,联合建模表格数据与潜在缺失掩码。该方法在条件采样和不确定性感知聚合之间交替,以迭代优化插补值和缺失标签。在合成和真实数据集上的实验结果表明,Diff-Joint能有效识别有意义缺失条目,同时实现具有竞争力的插补精度和改善的下游任务性能。

英文摘要

Missing value imputation is a fundamental task in machine learning, with most existing methods assuming that all missing entries correspond to unobserved regular values. In many real-world datasets, however, missingness may arise from two distinct sources: some entries are meaningfully missing (intrinsically absent and semantically valid), while others are missing due to the observation process and should be imputed. We formalize this distinction as a selective imputation problem, where the goal is to jointly infer which missing entries should be preserved and which should be recovered. To address this challenge, we propose Diff-Joint, a diffusion-based framework that jointly models tabular data together with a latent missingness mask. The method alternates between conditional sampling and uncertainty-aware aggregation to iteratively refine both imputed values and missingness labels. Empirical results on synthetic and real-world datasets demonstrate that Diff-Joint effectively identifies meaningfully missing entries while achieving competitive imputation accuracy and improved downstream task performance.

2606.05071 2026-06-04 cs.CV

InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

InstantRetouch:基于双边空间的高效高保真指令引导图像润色

Jiarui Wu, Yujin Wang, Ruikang Li, Fan Zhang, Mingde Yao, Tianfan Xue

AI总结 提出一种基于双边空间操作的图像润色方法,通过预测低分辨率双边网格并利用学习引导图切片,结合扩散模型蒸馏和提示对齐损失,实现高效、高保真且遵循指令的图像润色。

详情
Comments
Computer Vision and Pattern Recognition (CVPR), 2026
AI中文摘要

语言引导的照片润色旨在调整颜色和色调,同时保留几何和纹理。最近,基于扩散的润色显示出优越的视觉质量,但由于其生成性质,常常面临保真度问题,并且由于其迭代采样过程,效率低下。在这项工作中,我们提出了一种高效且保真的润色方法,使用双边空间操作,该方法既紧凑又内容解耦。具体来说,我们的模型不是直接编辑像素或图像潜在表示,而是预测一个低分辨率的仿射变换双边网格,该网格通过学习的引导图进行切片,然后应用于全分辨率图像。这种方法实现了高保真度和更高的效率。为了保留预训练生成模型的强先验,我们使用变分分数蒸馏将多步扩散模型蒸馏到我们的双边网格框架中,并辅以提示对齐损失来指导指令跟随行为。此外,我们引入了一个新的基准,并在多个维度上评估我们的方法:保真度、指令遵循和效率。与最新的润色方法(如Gemini-2.5-Flash(Nano-Banana))相比,我们的方法可以避免内容漂移,显著改善延迟,并生成视觉上令人愉悦的编辑,同时保持高水平的保真度。项目页面:https://openimaginglab.github.io/InstantRetouch/。

英文摘要

Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: https://openimaginglab.github.io/InstantRetouch/.

2606.05070 2026-06-04 cs.LG

RIDE: An Open Dataset and Benchmark for Train Delay Prediction

RIDE:用于列车延误预测的开放数据集与基准

Clément Elliker, Mathis Le Bail, Clément Mantoux, Jesse Read, Sonia Vanier

AI总结 针对列车延误预测缺乏标准化数据集和评估协议的问题,构建了覆盖比利时全国铁路网的开放数据集RIDE,并基于非学习、统计学习和深度学习模型进行了首次全面比较评估。

详情
Comments
58 pages, 41 figures
AI中文摘要

列车延误预测对乘客和铁路运营商都是一个重要问题,但由于缺乏标准化的数据集、预测目标和评估协议,该领域的进展仍然难以评估。为了解决这一问题,我们引入了RIDE,一个在比利时铁路网全国范围内构建的开放数据集和基准。RIDE涵盖了2023年至2025年的9450万次列车事件、360万次行程和3570万条天气记录。它被组织成一个分层数据管道,从原始铁路和天气数据源到两个公开发布版本:一个可重用的中间关系数据集和模型就绪的基准数据集。该基准标准化了预测任务以及训练和测试数据。它还提供了一个统一的评估协议,支持模型间的直接比较。利用这一框架,我们首次对非学习模型、统计学习模型和深度学习模型进行了全面的比较评估。我们表明,基于学习的方法明显优于非学习模型,其中图神经网络实现了最佳的平均性能,而最强的基于学习模型之间则相对接近。除了聚合的平均绝对误差(MAE)和均方根误差(RMSE)外,该框架还提供了按预测时间范围和延误变化分类的细分结果,从而能够更详细地分析模型在不同预测场景下的行为。

英文摘要

Train delay prediction is an important problem for both passengers and railway operators, yet progress in the field remains difficult to assess due to the lack of standardized datasets, prediction targets, and evaluation protocols. To address this gap, we introduce RIDE, an open dataset and benchmark for train delay prediction built at nationwide scale over the Belgian railway network. RIDE covers 94.5M train events, 3.6M journeys, and 35.7M weather records from 2023 to 2025. It is organized as a layered data pipeline from raw railway and weather sources to two public releases: a reusable intermediate relational dataset and model-ready benchmark datasets. The benchmark standardizes the prediction task and the training and testing data. It also provides a unified evaluation protocol that supports direct comparison across models. Using this framework, we provide the first comprehensive comparative evaluation of non-learning, statistical learning, and deep learning models. We show that learning-based methods clearly outperform non-learning models, with graph neural networks achieving the best mean performance, while the strongest learning-based models remain relatively close to one another. Beyond aggregate mean absolute error (MAE) and root mean squared error (RMSE), the framework also provides breakdowns by prediction horizon and delay change, enabling more detailed analysis of model behavior across forecasting regimes.

2606.05068 2026-06-04 cs.CV

MaCo-GAN: Manifold-Contrastive Adversarial Learning for Single Image Super-Resolution

MaCo-GAN: 用于单图像超分辨率的流形对比对抗学习

Daeyoung Han, Seongmin Hwang, Moongu Jeon

AI总结 提出MaCo-GAN,通过流形对比对抗学习替代传统对抗损失,利用动态假样本合成器生成保持低分辨率对应的假图像,实现感知-失真权衡的持续改进。

详情
AI中文摘要

传统的用于单图像超分辨率(SISR)的生成对抗网络(GAN)常常出现幻觉伪影,这主要是因为标准判别器评估整体图像自然度而非严格的条件真实性。为了解决这个问题,我们提出了MaCo-GAN,一种新颖的流形对比GAN框架,用监督对比目标替代了传统的对抗损失。我们方法的核心是一个动态假样本合成器,它将真实数据(GT)转换为一系列具有挑战性、感知上合理且严格保持低分辨率(LR)对应的假图像。利用这些合成样本,我们建立了一个鲁棒的对比极小极大博弈:生成器被训练为将其预测吸引到流形上的假图像(低失真)并远离流形外的假图像(高失真),而判别器则优化完全相反的目标。通过简单地将基线SR模型的对抗损失替换为我们提出的目标,我们在各种基准测试中展示了感知-失真权衡的持续改进。广泛的消融研究验证了我们框架的有效性,并深入洞察了这种条件对比博弈的动态。

英文摘要

Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standard discriminators evaluate overall image naturalness rather than strict conditional realism. To address this, we propose MaCo-GAN, a novel manifold-contrastive GAN framework that replaces the conventional adversarial loss with a supervised contrastive objective. A core component of our method is a dynamic fake sample synthesizer that transforms ground truth (GT) data into a spectrum of challenging, perceptually plausible fake images that strictly maintain low-resolution (LR) correspondence. Utilizing these synthesized samples, we establish a robust contrastive minimax game: the generator is trained to attract its predictions toward on-manifold fakes (low distortion) and repel them from off-manifold fakes (high distortion), while the discriminator optimizes the exact opposite. By simply replacing the adversarial loss of a baseline SR model with our proposed objective, we demonstrate consistent improvements in the perception-distortion trade-off across various benchmarks. Extensive ablation studies validate the effectiveness of our framework and provide deep insights into the dynamics of this conditional contrastive game.

2606.05067 2026-06-04 cs.LG

FLAGG: Flexible Autoregressive Graph Generation

FLAGG:灵活自回归图生成

Samuel Cognolato, Alessandro Sperduti, Luciano Serafini

AI总结 提出FLAGG框架,通过将一次性模型与自回归顺序生成相结合,灵活处理不同规模和拓扑的图生成任务,在多个数据集上优于纯一次性或纯自回归基线。

详情
Comments
Accepted for publication at JMLR, currently in press
AI中文摘要

深度图生成的全景涵盖了两个极端:一次性模型和顺序模型。前者联合生成节点和边,而后者以自回归方式采样它们。每种方法在不同图域中根据大小和拓扑表现更好,但都不适用于所有图类别。例如,一次性方法难以生成大图,而顺序方法在小图上表现不佳。克服这些限制的一种可能方法是在一个统一系统中灵活结合这两种方法。在这项工作中,我们提出了FLAGG(灵活自回归图生成)框架,该框架使用一次性模型顺序生成图的部分。FLAGG可以应用任何一次性模型使其自回归,从而灵活选择顺序策略。该策略通过一个随机节点移除过程来指定,插入模型学习逆转该过程。我们使用DiGress一次性模型在多个不同图大小和领域的数据集上评估FLAGG。结果表明,该方法在采样质量上优于一次性基线和自回归基线。

英文摘要

The Deep Graph Generation's panorama spans two extremes: one-shot and sequential models. The former generates nodes and edges jointly, while the latter samples them autoregressively. Each method performs better in different graph domains depending on size and topology, but neither is applicable to all graph categories. For instance, one-shot methods struggle with generating large graphs, while sequential methods underperform on smaller graphs. A possible way to overcome these limitations is to flexibly combine the two methods in a unique system. In this work, we propose the FLAGG (Flexible Autoregressive Graph Generation) framework, which sequentially generates portions of graphs with one-shot models. FLAGG can apply any one-shot model to make it autoregressive, allowing flexibility in choosing the sequential policy. This policy is specified through a stochastic node removal process, which an Insertion Model learns to reverse. We evaluate FLAGG with the DiGress one-shot model on several data sets of different graph sizes and domains. We show that the approach outperforms both one-shot and autoregressive baselines in terms of sampling quality.

2606.05058 2026-06-04 cs.CV cs.AI

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

UniCAD:面向多模态多任务CAD的统一基准与通用模型

Jingyuan Chen, Sheng Jin, Haopeng Sun, Wentao Liu, Chen Qian

AI总结 针对CAD领域缺乏统一多模态基准的问题,提出UniCAD基准和UniCAD-MLLM通用多模态大语言模型,在点云到CAD重建、文本/图像到CAD生成和CAD问答等任务上实现端到端统一处理,并在多个基准上取得最优性能。

详情
AI中文摘要

计算机辅助设计(CAD)通过创建精确、可编辑的3D模型,支撑着现代工程和制造。然而,CAD研究通常孤立地研究各项任务,而多模态、多任务学习因缺乏统一基准而受阻。为解决这一问题,我们引入了UniCAD,一个全面的多模态CAD学习基准,涵盖点云到CAD重建、文本/图像到CAD生成以及CAD问答等多种输入模态。伴随该基准,我们提出了UniCAD-MLLM,一个通用的多模态大语言模型,能够接收文本、图像、草图和点云,并在单一框架内以端到端方式执行这些异构任务。在UniCAD和Fusion360基准上的大量实验表明,UniCAD-MLLM在所有任务上均达到最先进性能,优于现有的任务特定和多任务基线。我们将发布数据集、代码和预训练模型,以加速未来研究。

英文摘要

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

2606.05054 2026-06-04 cs.CL

Boosting Self-Consistency with Ranking

通过排序提升自洽性

Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Salnikov, Alexander Panchenko, Viktor Moskvoretskii

AI总结 提出RISC方法,将自洽性中的答案选择转化为排序问题,使用轻量级LambdaRank模型结合五个特征,在多个数据集上实现了比标准自洽性更好的准确率-效率权衡。

详情
Comments
16 pages, 13 figures, accepted at ACL Student Research Workshop 2026
AI中文摘要

自洽性通过采样多条推理路径并选择最频繁的答案来改进大型语言模型,但多数投票通常无法恢复样本中已经存在的正确答案。我们通过排序改进自洽性(RISC)解决了这一限制,该方法将自洽性中的答案选择重新表述为排序问题。RISC不是依赖单一的不确定性或置信度信号,而是使用轻量级LambdaRank模型,通过五个精心设计的特征对候选答案进行评分,这些特征捕捉了答案频率、语义中心性和推理轨迹一致性。我们在三个数据集上评估了RISC,涵盖了多种测试时预算。在数据集上,RISC始终比标准自洽性和强基线实现了更好的准确率-效率权衡,在问答基准上尤其取得了显著提升。进一步分析表明,所提出的特征各自有用,更重要的是具有互补性,凸显了学习组合多个信息信号以进行测试时答案选择的价值。

英文摘要

Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority voting often fails to recover correct answers that are already present among the samples. We address this limitation with Ranking-Improved Self-Consistency (RISC), which reformulates answer selection in self-consistency as a ranking problem. Instead of relying on a single uncertainty or confidence signal, RISC uses a lightweight LambdaRank model to score candidate answers with five carefully designed features that capture answer frequency, semantic centrality, and reasoning-trace consistency. We evaluate RISC on three datasets under a range of test-time budgets. Across datasets, RISC consistently achieves a better accuracy-efficiency trade-off than standard self-consistency and strong baselines, with particularly large gains on question answering benchmarks. Further analysis shows that the proposed features are individually useful and, more importantly, complementary, highlighting the value of learning to combine multiple informative signals for test-time answer selection.

2606.05046 2026-06-04 cs.LG stat.ML

Graph Cascades: Contagion-Based Mesoscopic Rewiring for Structure-Aware Graph Machine Learning

图级联:基于传染的介观重连用于结构感知图机器学习

Meher Chaitanya, My Le, Luana Ruiz

AI总结 提出一种基于传染扩散的介观重连策略Graph Cascades,通过构建辅助图增强图神经网络和变换器对中间尺度结构的捕捉能力,在节点分类任务上提升多个骨干网络性能,并理论刻画了重连有效的条件。

详情
AI中文摘要

我们引入图级联(Graph Cascades),一种用于图神经网络(GNN)和图变换器(GT)的介观重连策略,它能够捕获超出纯局部边或完全全局注意力的中间尺度图结构。基于传染扩散过程,Graph Cascades 在 O(|V|+|E|) 时间内构建一个辅助图,其中由重复多跳强化支持的节点对被提升为直接邻居。我们从理论上刻画了基于强化的重连何时有帮助:强化边选择比直接邻接更标签对齐的充分条件,一个两跳强化完全同质的 SBM 示例,以及通过图有效电阻对介观连通性的形式化。实验上,在节点分类基准测试中,Graph Cascades 改进了多个 GNN 和稀疏 GT 骨干网络,在异质图和中等至高同质度图上观察到最可靠的增益。理论条件还识别了介观重连不太可能有益的场景——低度正则图和存在结构瓶颈的图——这些预测与观察到的失败相符。我们还观察到重连图中性能与结构属性之间的紧密相关性。

英文摘要

We introduce Graph Cascades, a mesoscopic rewiring strategy for Graph Neural Networks (GNNs) and Graph Transformers (GTs) that captures intermediate-scale graph structure beyond purely local edges or fully global attention. Using contagion-based diffusion processes, Graph Cascades constructs, in O(|V|+|E|) time, an auxiliary graph where node pairs supported by repeated multi-hop reinforcement are promoted to direct neighbors. We theoretically characterize when reinforcement-based rewiring helps: sufficient conditions under which reinforcement-based edge selection is more label-aligned than direct adjacency, an SBM witness in which two-hop reinforcement is perfectly homophilic, and a formalization of mesoscopic connectivity via graph effective resistance. Empirically, across node-classification benchmarks, Graph Cascades improves multiple GNN and sparse-GT backbones, with the most reliable gains observed on heterophilic and moderate- to high-degree homophilic graphs. The theoretical conditions also identify regimes where mesoscopic rewiring is unlikely to be beneficial -- low-degree regular graphs and graphs with structural bottlenecks -- and these predictions match the observed failures. We additionally observe tight correlations between performance and structural properties in the rewired graphs.

2606.05045 2026-06-04 math.DS cs.LG

Learning Control-Affine Reduced-Order Models via Autoencoders

通过自编码器学习控制仿射降阶模型

Ali Mjalled, Martin Mönnigmann

AI总结 提出一种利用自编码器同时学习降阶潜在空间和控制仿射状态空间动力学的框架,并扩展为序列模型以提高预测精度,通过反馈线性化验证其有效性。

详情
AI中文摘要

本文提出了一种用于识别控制仿射降阶模型(ROM)的框架。该方法利用自编码器(AE)将高维状态以及潜在的高维输入变换为适合控制仿射状态空间动力学的降维潜在变量。这是通过同时训练AE和状态空间模型实现的。此外,我们将离散ROM公式扩展为基于序列的模型,该模型处理状态和输入历史以提高预测精度,同时保持控制仿射结构。我们通过对导出的模型应用反馈线性化来激励我们的框架,并提出了有效使用它的指南。所提出的框架在两个数值示例上进行了评估,并将其性能与基线模型(其中AE识别具有线性状态空间动力学的潜在空间)进行了比较。评估涉及测试数据上ROM的预测精度及其将系统控制到期望状态或轨迹的有效性。

英文摘要

We present in this paper a framework for the identification of control-affine reduced-order models (ROMs). The proposed method utilizes autoencoders (AEs) to transform the high-dimensional states, and potentially the high-dimensional inputs, into reduced latent ones suitable for control-affine state-space dynamics. This is achieved by simultaneous training of the AE and the state-space model. In addition, we extend the discrete ROM formulation to a sequence-based model, which processes state and input histories to improve prediction accuracy while preserving the control-affine structure. We motivate our framework by applying feedback linearization to the derived models, and we present guidelines for its efficient use. The proposed framework is assessed on two numerical examples and its performance is compared to a baseline model, where the AE identifies a latent space with linear state-space dynamics. The assessment involves evaluating the prediction accuracy of the ROM on test data and its effectiveness in controlling the system to a desired state or trajectory.

2606.05043 2026-06-04 cs.AI

Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

Strabo: 声明式规范与实现代理交互协议

Samuel H. Christie, Amit K. Chopra, Munindar P. Singh

AI总结 提出 Strabo,通过声明式交互协议建模 UCP 的结账部分,并利用 Peach 编程模型实现代理,展示声明式规范的优势,同时实现与 Google UCP 代理的互操作,为 EMAS 思想在实践中的渐进引入提供路径。

详情
Comments
Presented in the Engineering Multiagent Systems Workshop co-located with the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
AI中文摘要

过去几年中,基于声明式交互协议的多代理系统建模与实现取得了重大进展。我们的贡献 Strabo 确立了这些进展与当前 Agentic AI 行业努力的相关性。具体来说,我们考虑了 UCP(通用商务协议),这是谷歌近期主导的为 AI 代理标准化电子商务交互的努力。我们的工作分为两部分。第一部分,我们将 UCP 中处理结账的部分建模为声明式 Langshaw 协议,并使用 Peach(一种 Langshaw 编程模型)实现代理。这部分工作展示了形式化、声明式规范的优势。第二部分,我们展示了 Peach 代理可以与谷歌实现的 UCP 代理互操作,从而确立了我们的方法相对于 UCP 的保真度。这种互操作使得声明式协议和代理能够逐步引入传统环境,为 EMAS 思想在不要求全面更新的情况下影响实践指明了路径。

英文摘要

The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.

2606.05042 2026-06-04 cs.LG cs.CL cs.SC

In-Context Graphical Inference

上下文图形推理

Zehua Cheng, Wei Dai, Jiahao Sun

AI总结 提出一种自回归图Transformer(ICG-I),通过模拟变量消除并利用张量列压缩和加权共形预测,实现离散图形模型中可扩展且校准的边缘推理,在标准实例和受挫自旋玻璃上达到最先进性能。

详情
Comments
19 Pages
AI中文摘要

离散图形模型中的边缘推理迫使在精确性和可扩展性之间做出选择:精确算法对于高树宽图是难以处理的,而迭代近似(信念传播、变分方法)在受挫拓扑上牺牲了收敛保证。我们认为这种二分法源于归纳偏置不匹配:迭代方法放弃了使精确推理正确的顺序消除结构。我们引入了上下文图形推理(ICG-I),一种自回归图Transformer,通过模拟变量消除并使用学习的张量列压缩中间因子来恢复这种结构,同时结合Dirichlet输出层和加权共形预测,在拓扑偏移下提供校准的、无分布的覆盖保证。我们证明了TT压缩误差在自回归链中最多线性传播,Dirichlet-Multinomial损失是适当的评分规则,并且WCP在估计密度比下保持覆盖且退化可量化。我们进行了大量实验来评估ICG-I,并在所有基准测试中取得了最先进的性能。ICG-I将标准实例上的MAE从0.041(最佳基线)降低到0.020,并在N=500的受挫自旋玻璃上达到0.048,而BP完全发散。

英文摘要

Marginal inference in discrete graphical models forces a choice between exactness and scalability: exact algorithms are intractable for high-treewidth graphs, while iterative approximations (Belief Propagation, variational methods) sacrifice convergence guarantees on frustrated topologies. We argue that this dichotomy stems from a mismatched inductive bias: iterative methods abandon the sequential elimination structure that makes exact inference correct. We introduce In-Context Graphical Inference (ICG-I), an autoregressive Graph Transformer that restores this structure by mimicking Variable Elimination with learned, Tensor- Train-compressed intermediate factors, paired with a Dirichlet output layer and Weighted Conformal Prediction for calibrated, distribution-free coverage guarantees under topological shift. We prove that TT compression errors propagate at most lincarly through the autoregressive chain, that the Dirichlet-Multinomial loss is a proper scoring rule, and that WCP maintains coverage with a quantifiable degradation under estimated density ratios. We conducted intensive experiments to evaluate ICG-I and achieved state-of-the-art performance across all benchmarks. ICG-I reduces MAE from 0.041 (best baseline) to 0.020 on standard instances and achieves 0.048 on N=500 frustrated spin glasses where BP diverges entirely.

2606.05037 2026-06-04 cs.SE cs.AI

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

自反式API:结构优于冗长,助力AI代理恢复

Arquimedes Canedo, Grama Chethan

AI总结 提出自反式API,在验证失败时返回机器可读的结构化建议,使AI代理无需外部推理即可修复请求并重试,在Anthropic模型上将任务完成率提升36.7-40.0个百分点,且每成功令牌效率提升1.8-2.2倍。

详情
AI中文摘要

当AI代理调用API并遇到验证错误时,它需要的不仅仅是哪里出错了——它需要下一步该做什么。自反式API在验证失败时返回一个机器可读的 recovery_feedback.suggestions[] 负载,足以让代理修复请求并在无需外部推理的情况下重试。在一个经过泄露审计的试点实验(每单元N=30,3个LLM,10个对抗性任务)中,结构化建议在Anthropic模型上将任务完成率提升了+36.7至40.0个百分点(Fisher精确检验 p ≤ 0.0022),每成功令牌效率提高了1.8至2.2倍。在gpt-4o-mini上提升不显著(p=0.435);在计费API上的第二个领域复制确认了这一模式。该比较仅在审计了LLM基准测试中两个未记录的答案泄露类别后才成立。我们提供了 audit_prompt_leakage.py 作为可重用的CI基础设施。代码和数据:https://github.com/arquicanedo/self-reflective-apis。

英文摘要

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.

2606.05035 2026-06-04 cs.CV

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

Anchor3R: 基于瞬态锚点的流式3D重建用于长时程视觉映射

Peilin Tao, Chong Cheng, Yuansen Du, Caiwei Song, Zhengqing Chen, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Hainan Cui, Shuhan Shen

AI总结 提出Anchor3R框架,通过将前馈重建视为当前帧坐标系下的局部测量预测而非全局回归,结合窗口相对位姿预测、闭环插入和运动平均,实现长序列上的在线3D重建与位姿估计。

详情
AI中文摘要

长时程在线视觉映射是机器人感知的核心能力,需要在有限内存和计算下从视觉流中持续估计相机运动和场景几何。最近的前馈3D重建模型提供了强大的几何先验,但其流式变体通常在与第一帧或持久场景记忆绑定的固定坐标系中预测位姿。这种固定基准设计会导致训练-测试不匹配、对早期锚点的注意力偏差以及在远长于训练序列的序列上累积漂移。我们提出Anchor3R,一种流式3D重建框架,将前馈重建视为以当前为中心的局部测量预测,而非持久的全局基准回归。在每个时间步,Anchor3R预测窗口相对位姿和当前帧坐标系下的局部点图,将流式重建转化为相对位姿测量生成。这些测量支持在线位姿更新,而闭环插入和运动平均对齐轨迹并将局部点图转换为一致的全局重建。在室内、室外、驾驶和RGB-D基准上的实验表明,Anchor3R在长时程位姿精度和密集重建质量上优于现有流式基线,同时支持有限内存的在线推理。

英文摘要

Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train--test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emph{Anchor3R}, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.

2606.05031 2026-06-04 cs.CV

MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

MetaPoint:在智能体视觉生成中实现精确空间控制

Dewei Zhou, Xinyu Huang, Xun Wang, Ji Xie, Yabo Zhang, Liang Li, Kunchang Li, Zongxin Yang, Yi Yang

AI总结 提出MetaPoint方法,通过将连续2D坐标表示为单个特殊token,利用模型固有的位置编码实现像素级空间控制,无需修改架构。

详情
AI中文摘要

生成式视觉模型从根本上难以实现精确的空间控制。这源于一个核心脱节:模型可以处理空间的文本描述,但无法直接将数值坐标映射到2D图像画布上。我们引入了MetaPoint,一种通过将连续2D坐标表示为单个特殊token来弥合这一差距的方法。关键在于,MetaPoint不需要新的架构组件;它直接利用模型固有的位置编码方案来解释这些坐标,将我们的token视为画布上的一个虚拟点。这种轻量级方法能够用一个token实现对象位置的像素级控制,或用两个token实现边界框控制,而无需架构更改或定制注意力掩码。MetaPoint token被设计为可组合的,作为空间基元。这使得规划智能体能够将高级用户请求分解为结构化的基元序列,供生成器使用。通过提供一种简单、精确且可扩展的空间控制构建块,MetaPoint解锁了更强大的组合式生成智能体,并支持直观的交互式编辑系统。

英文摘要

Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model's inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object's position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.

2606.05030 2026-06-04 cs.CL cs.SC

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

赋予大语言模型双向逻辑以进行稳健的链修复

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

AI总结 针对自回归链式推理中错误雪崩问题,提出Teleological Reasoning Infilling (TRI)框架,通过将错误推理段重构为填充中间任务并引入前缀-后缀-中间序列重排,结合符号验证器监督微调和直接偏好优化,实现仅修复受损段的高效链修复。

详情
Journal ref
In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2026
Comments
25 Pages
AI中文摘要

大型语言模型(LLMs)中的自回归链式推理(CoT)本质上是前向的:每一步仅依赖于先前的令牌。这种单向归纳偏差使得即使是能力强的模型也容易受到错误雪崩的影响,即早期步骤中的单个逻辑或算术错误会不可逆地破坏整个推理链。我们提出了Teleological Reasoning Infilling (TRI),一个训练框架,赋予仅解码器变换器原生的目标条件桥接能力。关键见解是将错误的推理段重构为填充中间(FIM)任务:给定一个验证过的前缀前提P、一个验证过的下游里程碑S和原始查询Q,模型必须综合出连接P到S的逻辑桥M,要求严格且完整。为了实现这一目标,我们引入了一种前缀-后缀-中间(PSM)序列重排,使用三个非重叠的哨兵令牌,使得M能够同时关注P和S,而无需对自注意力机制进行任何结构修改。训练分两个阶段进行:(i)在从形式数学语料库中提取的符号验证的(P, S, M)三元组上进行监督微调(SFT),以及(ii)以确定性符号验证器(Lean 4 / Python)作为唯一奖励神谕的直接偏好优化(DPO),消除了LLM评判的谄媚。在推理时,TRI作为双系统循环中的外科修复模块运行:因果草稿模型生成初始轨迹,验证器定位失败点,TRI仅填充受损段,保留已验证部分不变。在三个基准上的综合实验表明,TRI在所有任务上达到了最先进的性能,同时每个问题的令牌消耗减少了31.2%。

英文摘要

Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.

2606.05029 2026-06-04 cs.LG cs.CL

Validity Threats for Foundation Model Research

基础模型研究的有效性威胁

Gunnar König, Martin Pawelczyk, Ulrike von Luxburg, Sebastian Bordt

AI总结 本文提出一个因果推断评估框架,将基础模型研究中的不同近似实验策略(代理实验、观察性研究、单次运行设计)映射为四种有效性(统计、内部、外部、构念)的权衡,揭示并分析计算节省带来的隐蔽有效性威胁。

详情
AI中文摘要

受控实验是机器学习研究的基石,但在现代基础模型的规模下,它们变得过于昂贵。相反,研究界越来越依赖于以较低成本近似理想实验的研究策略:代理实验和缩放定律、使用公开模型的观察性研究,以及利用单个训练运行内部变化的单次运行设计。在这项工作中,我们认为在计算预算内近似大规模实验没有免费午餐。具体来说,计算节省是以有效性威胁为代价的——隐藏且有时无法检验的假设,当这些假设被违反时,会使研究主张无效。为了帮助应对这些威胁,我们提出了一个评估框架,将基础模型研究视为因果推断问题。在这个框架内,我们通过从经验社会科学中改编的四种有效性——统计、内部、外部和构念有效性——来评估不同的研究策略。我们发现每种策略都有其特有的有效性特征:代理实验以外部和构念有效性换取统计和内部有效性;观察性研究面临混杂和效应异质性;单次运行设计则因处理单元之间的干扰而紧张。这一分析揭示了文献中未得到充分关注的若干有效性威胁。总体而言,我们的评估框架为研究人员提供了一个实用的工具包,用于审视基础模型研究设计中的有效性威胁。

英文摘要

Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.

2606.05025 2026-06-04 cs.LG cs.AI

Invariant Gradient Alignment for Robust Reasoning Distillation

不变梯度对齐用于鲁棒推理蒸馏

Zehua Cheng, Wei Dai, Jiahao Sun

AI总结 提出不变梯度对齐(IGA)框架,通过逻辑同构集、连续梯度冲突掩码和截断SVD投影,对齐不同语义域但逻辑结构相同的梯度更新,提升大语言模型在分布外输入上的鲁棒性。

详情
Journal ref
In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2026
Comments
30 Pages
AI中文摘要

大型语言模型(LLMs)存在捷径学习问题:它们在分布外(OOD)输入上系统性失败,这些输入的语义表面与训练数据不同,即使逻辑结构相同。这破坏了将思维链推理迁移到较小学生模型的知识蒸馏流程。我们引入不变梯度对齐(IGA),一种训练框架,通过三项创新对齐跨语义多样但逻辑同构示例的梯度更新:(i)逻辑同构集,即跨不同语义领域(数学、医学、法律、科学)共享相同逻辑结构的问题组;(ii)可微的连续梯度冲突掩码,抑制具有高跨域梯度方差的参数维度,同时保留不变方向;(iii)将掩码梯度通过截断SVD投影回LoRA低秩流形,保持参数效率。理论上,IGA比ERM产生更紧的OOD泛化界,随同构域数量缩放,并在温和正则条件下以标准SGD速率收敛。实验上,IGA在四个基准测试中优于八种基线,准确率提升高达14.3个百分点(相对于ERM-SFT),逻辑一致性得分为0.031对比0.142——表示不变性提升四倍。

英文摘要

Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermines knowledge distillation pipelines that transfer chain-of-thought reasoning to smaller students. We introduce Invariant Gradient Alignment (IGA), a training framework that aligns gradient updates across semantically diverse but logically isomorphic examples via three innovations: (i) Logical Isomer Sets, groups of problems sharing identical logical structure across distinct semantic domains (mathematics, medicine, law, science); (ii) a differentiable \emph{Continuous Gradient Conflict Mask}, that suppresses parameter dimensions with high cross-domain gradient variance while preserving invariant directions; and (iii) a truncated SVD projection of the masked gradient back onto the LoRA low-rank manifold, maintaining parameter efficiency throughout. Theoretically, IGA yields tighter OOD generalization bounds than ERM, scaling with the number of isomer domains, and converges at the standard SGD rate under mild regularity. Empirically, IGA outperforms eight baselines across four benchmarks with accuracy gains up to 14.3 pp over ERM-SFT and a Logical Consistency Score of 0.031 versus 0.142 -- a fourfold improvement in representational invariance.

2606.05021 2026-06-04 cs.LG

Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

通过动作推理和重要性采样增强多智能体学习的MADDPG算法

Marc Walden, Jason Liu, Shaashwath Sivakumar, Ryan Liu, Hamza Khan

AI总结 针对多智能体深度强化学习,提出动作推理机制和基于几何分布的重要性采样策略来改进MADDPG算法,在离散动作捕食者-猎物任务中提升了学习稳定性、智能体间协作和探索效率。

详情
AI中文摘要

我们研究了多智能体深度强化学习,并提出了对多智能体深度确定性策略梯度(MADDPG)算法的两项增强。首先,我们引入了一种新颖的动作推理机制,使每个智能体能够预测其他智能体的预期动作,从而提高其自身策略的准确性和稳定性。其次,我们在回放缓冲区中应用了基于几何分布的重要性采样策略,以优先考虑更近期和更具信息性的经验,这有助于缓解多智能体环境中固有的非平稳性。我们在PettingZoo库提供的离散动作捕食者-猎物任务上评估了这两项修改,PettingZoo是一个用于通用多智能体强化学习基准测试的灵活Python接口。我们的结果表明,动作推理在提高学习稳定性和智能体间协作方面是有效的,并且使用几何分布的重要性采样可以在探索效率上比标准MADDPG带来显著改进。代码可在https://github.com/shaashwathsivakumar/MARL_Proj获取。

英文摘要

We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, thereby improving the accuracy and stability of its own policy. Second, we apply an importance sampling strategy, using geometric distribution, in the replay buffer to prioritize more recent and informative experiences, which helps mitigate the non-stationarity inherent in multi-agent environments. We evaluate both modifications on the discrete-action Predator-Prey task provided by the PettingZoo library, a flexible Python interface for general multi-agent reinforcement learning benchmarks. Our results indicate that Action Inference is effective in improving learning stability and inter-agent cooperation and that importance sampling using geometric distribution can lead to significant improvements in exploration efficiency over standard MADDPG. Code available at https://github.com/shaashwathsivakumar/MARL_Proj

2606.05018 2026-06-04 cs.CV

Handwriting Extraction and Analysis of Signature Lists in Swiss Popular Initiatives

瑞士民众倡议中签名列表的手写提取与分析

Marco Peer, Thomas Gorges, Mathias Seuret, Vincent Christlein, Andreas Fischer

AI总结 针对瑞士民众倡议中签名列表验证的繁重人工流程,提出结合模板行分割、OCR和基于AI的手写分析(特别是作者检索)的自动化管道,实验表明OCR对短文本识别率低(CER 29.6%),而作者检索mAP达50.6%,可有效支持重复提交检测。

详情
Comments
Accepted for presentation at ICCST 2026
AI中文摘要

民众倡议和公投是瑞士民主的核心,然而手写签名列表的验证仍然是一个劳动密集型的手工过程。本文研究了自动化文档分析方法的潜力,包括OCR和基于AI的手写分析,以支持这一任务。我们提出了一种结合基于模板的行分割与文本识别和作者检索技术的流水线,并在包含418位作者的443条手写条目的数据集上进行了评估。结果表明,OCR在处理词汇表外的手写文本时表现不佳,名字的词错误率(CER)为29.6%。相比之下,作者检索表现更为稳健,平均精度(mAP)达到50.6%。此外,我们的实验表明,现成的OCR系统对于手写签名数据的转录不够可靠,尤其是对于姓名或地址等短且词汇表外的条目。然而,作者检索方法可以有效地识别签名列表中视觉相似的条目,使其成为基于手写相似性支持检测潜在重复提交的合适工具。

英文摘要

Popular initiatives and referendums are central to Swiss democracy, yet the validation of handwritten signature lists remains a labor-intensive manual process. This paper investigates the potential of automated document analysis methods, including OCR and AI-based handwriting analysis, to support this task. We propose a pipeline combining template-based line segmentation with text recognition and writer retrieval techniques, evaluated on a dataset of 443 handwritten entries from 418 writers. Results show that OCR struggles with out-of-vocabulary handwriting, with a CER of 29.6% for first names. In contrast, writer retrieval performs more robustly, reaching an mAP of 50.6%. Furthermore, our experiments indicate that off-the-shelf OCR systems are not sufficiently reliable for transcription of handwritten signature data, particularly for short, out-of-vocabulary entries such as names or addresses. However, writer retrieval methods can effectively identify visually similar entries across signature lists, making them a suitable tool for supporting the detection of potential duplicate submissions based on handwriting similarity.

2606.05016 2026-06-04 cs.CL

TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging

TaDA: 任务-领域LoRA合并的校准探针门控

Huy Quoc To, Fuyi Li, Guangyan Huang, Ming Liu

AI总结 针对任务与领域LoRA适配器合并中的深度不对称性,提出无训练算法TaDA,通过校准探针引导的逐层门控和逐分量子空间感知合并,在六个科学QA和六个图像分类基准上取得最优性能。

详情
AI中文摘要

将任务LoRA适配器与领域LoRA适配器组合成一个统一模型是一个实际但很大程度上未被探索的挑战。现有方法将两个适配器视为对称对等体,对所有层应用统一权重。我们认为,任务和领域适配器在Transformer架构中表现出一致的深度依赖不对称性。领域主导性随层深度增加而增强,而较浅层保留更强的任务相关信号。受此观察启发,我们提出$ extbf{TaDA}$($ extbf{Ta}$sk-$ extbf{D}$omain LoR$ extbf{A}$ Merging),一种无训练算法,通过校准探针引导的逐层门控和逐分量子空间感知合并来利用这种结构。门控使用被证明对适配器权重幅度不变的探针信号,为每层和投影类型分配独立权重。合并则在组合剩余分量之前丢弃冲突的奇异方向。$ extbf{TaDA}$产生一个标准秩$r$的LoRA适配器,推理开销为零。在Llama-2-7B的六个科学QA基准上,TaDA平均准确率达到0.452,比DARE-TIES高出3.6个百分点,并在所有六个基准上取得最佳结果。在ViT-L/16的六个图像分类基准上,TaDA平均准确率达到85.9%,在六个基准中的三个上领先,同时优于最强的合并基线。

英文摘要

Combining a task LoRA adapter with a domain LoRA adapter into a single unified model is a practical yet largely unexplored challenge. Existing methods treat both adapters as symmetric peers, applying uniform weights across all layers. We argue that task and domain adapters exhibit a consistent depth-dependent asymmetry across transformer architectures. Domain dominance increases with layer depth, while shallower layers retain stronger task-relevant signals. Motivated by this observation, we propose $\textbf{TaDA}$ ($\textbf{Ta}$sk-$\textbf{D}$omain LoR$\textbf{A}$ Merging), a training-free algorithm that exploits this structure through calibrated probe-guided per-layer gating and per-component subspace-aware merging. The gating assigns individual weights per layer and projection type using a probe signal proved invariant to adapter weight magnitude. The merging discards conflicting singular directions before combining the remaining components. $\textbf{TaDA}$ produces a standard rank-$r$ LoRA adapter with zero inference overhead. On six scientific QA benchmarks with Llama-2-7B, TaDA achieves an average accuracy of 0.452, outperforming DARE-TIES by +3.6 percentage points and obtaining the best result on all six benchmarks. On six image classification benchmarks with ViT-L/16, TaDA reaches 85.9\% average accuracy, improving over the strongest merging baseline while leading in three of the six individual benchmarks.

2606.05015 2026-06-04 cs.RO

Generalization of World Models under Environmental Variability for Vision-based Quadrotor Navigation

环境变异性下基于视觉的四旋翼导航的世界模型泛化

Luca Zanatta, Grzegorz Malczyk, Kostas Alexis

AI总结 通过基于视觉的四旋翼导航测试,研究世界模型在不同环境随机性下的鲁棒性,发现自监督预训练阶段的泛化能力是模拟到现实迁移的强预测因子,并识别出离散潜在大小和训练序列长度是关键因素。

详情
AI中文摘要

世界模型,即学习预测环境演化的生成模型,已成为样本高效机器人学习的有前景工具。然而,它们对环境变异性的鲁棒性仍知之甚少。为解决这一问题,我们以基于视觉的四旋翼导航为测试平台进行系统研究,在不同环境随机性水平下训练基于DreamerV3的世界模型,并通过跨环境验证(涵盖自监督学习预训练和强化学习微调)在所有水平上评估它们。然后,我们将所有世界模型及相关导航策略部署到真实四旋翼上,在未见环境中进行测试,包括一次开环运行,其中模型仅接收2.5秒的真实感官输入,之后所有传感器被切断,系统完全依靠想象导航穿越12米距离。结果表明,自监督预训练阶段的世界模型鲁棒性是模拟到现实迁移的强预测因子:在跨环境自监督验证中泛化良好的每个模型都成功部署到真实世界,通过窄至0.67米的间隙,而在模拟策略评估中占主导地位的模型却在真实平台上失败。我们进一步识别出(a)离散潜在大小和(b)训练序列长度是控制世界模型质量的主要因素。

英文摘要

World models, learned generative models that predict how an environment evolves, have become a promising tool for sample-efficient robot learning. Yet how robust they are to environmental variability remains poorly understood. To address this, we conduct a systematic study using vision-based quadrotor navigation as a testbed problem, training DreamerV3-based world models under varying levels of environmental randomness and evaluating them across all levels through cross-environment validation, spanning both Self-Supervised Learning (SSL) pretraining and Reinforcement Learning (RL) fine-tuning. We then deploy all world models and associated navigation policies on a real quadrotor in unseen environments, including an open-loop run where the model receives just 2.5s of real sensory input before all sensors are cut off, leaving the system to navigate entirely in imagination over a 12m traverse. Our results show that world model robustness during SSL pretraining is a strong predictor of sim-to-real transfer: every model that generalized well in cross-environment SSL validation deployed successfully in the real world, passing through gaps as narrow as 0.67m, whereas the model that dominated simulation policy evaluation failed on the real platform. We further identify (a) the discrete latent size and (b) the training-sequence length as the dominant factors governing world model quality.

2606.05014 2026-06-04 cs.CL

Depth-Attention: Cross-Layer Value Mixing for Language Models

深度注意力:语言模型的跨层值混合

Boyi Zeng, Yiqin Hao, Zitong Wang, Shixiang Song, He Li, Feichen Song, Yifan Liu, Ziwei He, Xinbing Wang, Zhouhan Lin

AI总结 提出深度注意力机制,在注意力模块内部实现跨层值混合,无需额外参数和推理状态,提升语言模型性能。

详情
Comments
21 pages, 4 figures, 9 tables
AI中文摘要

自注意力机制可以在序列中自由选择信息,但在深度方向上,Transformer仅将每一层的输出加到残差流中,因此后续层无法选择性重用早期层的表示。最近的跨层方法改善了这种流动,但在注意力之外的隐藏状态上操作,在推理时增加了键值缓存之外的状态——随着现代LLM使用分组查询和多头潜在注意力压缩缓存,这一成本日益显著。我们引入深度注意力,它在注意力模块内部执行这种选择:在一层对序列进行注意力之前,其查询在同一token位置上对早期层的键进行注意力,并将它们的值混合到自注意力随后读取的值中。由于深度注意力重用标准的注意力查询、键和值缓存槽,将深度混合后的值替换原始值,因此它不增加参数,也不引入超出标准键值缓存的持久推理状态——缓存大小与普通解码器相同,且小于基于隐藏状态的跨层方法。在1.5B和3B参数的Qwen3风格解码器上,深度注意力取得了最低的困惑度和最高的平均下游准确率,相比普通Transformer提升高达2.3个准确率点,在困惑度和平均准确率上超越了强跨层基线,同时仅增加不到0.01%的额外算术FLOPs,且无额外持久推理状态。这些增益在360M到3B参数范围内保持一致,并扩展到循环Transformer。

英文摘要

Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.

2606.05011 2026-06-04 cs.CV cs.RO

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

CIPER: 跨视图图像检索与姿态估计的统一框架

Yurim Jeon, Dongseong Seo, Seung-Woo Seo

AI总结 提出CIPER框架,通过共享Transformer编码器和任务特定令牌联合进行城市级跨视图检索与精确3自由度姿态估计,实现互惠特征学习。

详情
Comments
16 pages, 5 figures
AI中文摘要

跨视图地理定位通过将地面图像与航拍图像数据库匹配来估计其地理位置。现有方法要么通过大规模检索,要么通过精确姿态估计来处理,但无法兼顾:基于检索的方法能够进行广域搜索,但牺牲了定位精度;而姿态估计方法仅在狭窄的搜索空间内实现高精度。简单级联这些流程会导致误差传播和特征表示不一致。我们将跨视图地理定位形式化为一个统一问题,要求同时进行城市级检索和精确的3自由度姿态估计。我们提出CIPER(跨视图图像检索与姿态估计变换器),这是一种单一架构,通过互惠特征学习联合执行两项任务。CIPER使用共享的Transformer编码器和任务特定令牌,将全局检索特征与空间定位线索分离。为了弥合地面和航拍视图之间的大领域差距,我们引入了一个双向Transformer姿态解码器,该解码器使用地面特征作为空间查询进行双向交叉注意力。一种集合预测策略进一步在统一的多任务目标下实现稳定的3自由度回归。在VIGOR、KITTI和Ford Multi-AV上的实验表明,特别是在有限的视野和任意方向条件下,性能具有竞争力。代码可在https://github.com/yurimjeon1892/CIPER获取。

英文摘要

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.

2606.05009 2026-06-04 cs.CL cs.AI

DAR: Deontic Reasoning with Agentic Harnesses

DAR: 基于智能体框架的道义推理

Guangyao Dou, William Jurayj, Nils Holzenberger, Benjamin Van Durme

AI总结 提出DAR框架,通过让模型按需与法规交互来提升基于LLM的道义推理能力,实验表明智能体框架可提升性能但存在非均匀改进和弱模型数值任务退化问题。

详情
AI中文摘要

道义推理是通过将明确的规则和政策应用于具体案例事实来回答问题,例如根据法规计算纳税义务或确定移民上诉结果。基于LLM的道义推理的一个关键技术挑战是相关规则集可能很长且相互引用,因此模型可能仍无法找到特定推理步骤所需的规则。我们引入了道义智能体推理(DAR),这是一种智能体推理设置,其中模型按需与法规交互。我们在DeonticBench的困难子集上使用多种框架评估DAR。在这些设置中,我们发现智能体框架可以推动道义推理任务的前沿,但改进并不均匀:较弱的模型在数值任务上往往性能下降,同时消耗更多的令牌。

英文摘要

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

2606.05008 2026-06-04 cs.CV cs.AI cs.CL

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

M$^3$Eval: 通过认知基础视频任务的多模态记忆评估

Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong

AI总结 提出首个多模态模型记忆评估框架M$^3$Eval,通过认知心理学设计的视频任务系统评估模型在记忆保持、忠实性和鲁棒性上的表现,发现模型在并行视频流处理、干扰模式、时空记忆和符号记忆方面的显著缺陷。

详情
Comments
We present an evaluation designed for multi-modal memory in multi-modal models
AI中文摘要

随着多模态模型向长视频理解发展,记忆成为关键能力。尽管在视频数据集和基准测试方面做出了大量努力,现有工作主要关注感知和推理,而没有系统评估记忆:模型保留了什么、信息如何忠实保存、以及记忆在干扰下的鲁棒性。为填补这一空白,我们引入了M$^3$Eval,这是第一个用于探测多模态模型中不同记忆维度的综合评估框架和基准。基于认知心理学,我们的设计通过精心构建的任务来隔离记忆的关键方面。利用M$^3$Eval,我们在代表性多模态模型上进行了大量实验,揭示了一致的弱点和独特行为。我们发现,模型在处理并行视频流时难以保持解耦表示,表现出与人类记忆显著不同的干扰模式,在空间域比时间域更可靠地定位记忆源,并且符号记忆有限。总的来说,我们的基准为未来研究提供了宝贵资源,而我们的发现强调了记忆作为基本但未充分探索的能力,并为设计更有效的多模态模型记忆机制提供了见解。我们的代码和数据集可在https://pku-value-lab.github.io/m3eval-homepage获取。

英文摘要

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

2606.05004 2026-06-04 cs.CR cs.AI

SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models

SharedRequest: 面向大型语言模型的隐私保护模型无关推理

Peihua Mai, Xuanrong Gao, Youlong Ding, Xianglong Du, Wei Liu, Yan Pang

AI总结 提出一种模型无关的隐私保护推理框架SharedRequest,通过批量级别混淆和语义分组实现高效隐私保护,相比差分隐私基线效用提升20%以上,查询成本降低5倍。

详情
Comments
accepted by ACL 2026 (main)
AI中文摘要

随着ChatGPT等公共大型语言模型(LLMs)的广泛部署,保护用户提示隐私已成为一个日益关键的问题。现有的隐私保护推理方法要么牺牲效用,要么牺牲效率,并且通常需要特定于模型的修改,限制了其兼容性。在本文中,我们提出了SharedRequest,一个模型无关的隐私保护LLM推理框架,它将隐私保护重新定义为批量级别而非单个提示级别。关键思想是通过将原始提示与噪声变体混合来混淆敏感信息,同时将语义等效的指令分组,以在大量查询批次中分摊推理成本,对LLM响应质量影响最小。该设计独立于LLM架构,无需访问模型参数或进行架构修改。实验结果表明,与先前的差分隐私基线相比,SharedRequest实现了超过20%的效用提升,并且其共享提示机制相比非批量推理将查询成本降低了5倍。

英文摘要

With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increasingly critical issue. Existing privacy-preserving inference methods sacrifice either utility or efficiency, and often require model-specific modifications that limit their compatibility. In this paper, we propose SharedRequest, a model-agnostic framework for privacy-preserving LLM inference that reformulates privacy protection at the batch level rather than the individual-prompt level. The key idea is to obscure sensitive information by mixing original prompts with noisy variants, while grouping semantically equivalent instructions to amortize the inference cost over a large batch of queries with minimal impact on LLM response quality. This design is independent of the LLM architecture, requiring no access to model parameters or architectural modification. Empirical results demonstrate that SharedRequest achieves over $20\%$ higher utility compared to prior differential privacy baselines, and its shared-prompt mechanism reduces query cost by up to $5\times$ compared to non-batched inference.

2606.05002 2026-06-04 cs.CL

GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

GARL:面向多智能体战略优先级排序的博弈论强化学习

Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu

AI总结 提出GARL框架,将多智能体战略优先级排序形式化为两阶段博弈,通过博弈论效用转化为角色特定强化信号,优化交互策略,在争议问题排序任务中提升性能并使小型开源LLM与强闭源LLM竞争。

详情
AI中文摘要

基于LLM的多智能体系统越来越多地用于战略决策任务。在此类设置中,性能不仅取决于单个模型的能力,还取决于智能体交互和适应的策略。多智能体强化学习可以优化这些交互策略,但其奖励设计通常特定于任务且与交互结构的关联较弱。为弥补这一差距,我们提出GARL,一种面向多智能体战略优先级排序的博弈论强化学习框架。GARL将战略优先级排序形式化为两阶段博弈:竞争智能体首先在共享候选集上分配战略资源,然后更高级别的仲裁者产生最终排名。由此产生的博弈论效用被转化为角色特定的强化信号,使策略优化能够由结构化交互引导。我们在争议问题排序任务上实例化GARL,其目标是在法律程序中优先处理核心问题。实验表明,GARL提高了排序性能,使小型开源LLM在相同候选排名设置下与强大的闭源LLM竞争,并在法律领域能力和更广泛的战略决策方面取得收益。总体而言,GARL展示了如何将博弈论交互结构转化为强化学习目标,为多智能体战略优先级排序中的策略优化提供了原则性方法。

英文摘要

LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.