arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02027 2026-06-02 cs.RO cs.LG cs.MA

World-Task Factorization for Robot Learning

世界-任务分解用于机器人学习

Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock, Amanda Prorok

AI总结提出将策略分解为世界因子和任务因子，通过可微图模型AICON与紧凑学习策略结合，实现零样本泛化到新配置并迁移到真实硬件。

详情

AI中文摘要

机器人学习必须产生能够泛化到新的约束、队友和环境组合的策略。为此，我们必须对策略进行结构性分解，这种选择决定了哪些部分泛化、哪些需要重新训练、哪些保持纠缠。现有方法涵盖从期望结构从数据扩展中涌现，到通过层次结构、技能库或学习专门化手工设计。在本文中，我们研究我们认为机器人学中最基本的分解：将世界与任务分离。我们研究了这种分解有原则的条件。世界因子是具身系统和环境的属性；它们独立于意图存在。任务因子由任务在世界所允许的事物上的逻辑定义。我们通过贝叶斯模型证据形式化这种不对称性：它与数据生成过程一致，通过分析世界模型保持高似然，并减少奥卡姆剃刀对任务参数的惩罚。我们通过将AICON（一个可微分的递归估计器和互连图，具有组合性，无需任务特定数据即可运行，并将成本梯度传播到执行器）与一个紧凑的学习策略配对来实例化这种分解，该策略调节梯度路径。梯度作为两个因子之间的接口：它们通过图携带世界结构，通过成本携带任务结构，从而在保持结构泛化的同时实现低维学习。我们在三个问题上测试了世界/任务分解，这些问题包含异构机器人、环境、任务逻辑和感觉运动模态。我们的框架在所有设置中优于端到端基线和分析启发式方法，零样本泛化到分布外配置，并无需重新训练即可迁移到真实硬件。

英文摘要

Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

URL PDF HTML ☆

赞 0 踩 0

2606.02022 2026-06-02 cs.CV cs.AI cs.LG

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配：多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

AI总结本文揭示了多视角目标关联中常用的排名度量（如AP、FPR-95）与分配目标之间的根本性不匹配，并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情

AI中文摘要

多视角目标关联是一个重要的计算机视觉问题，是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题，但最近的工作严重依赖成对排名度量（如AP和FPR-95）进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上，我们表明即使分配已经正确，AP和FPR-95也可能不完美，而基于Sinkhorn的归一化可以使它们完美。相反，最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试，在实践中验证了这种不匹配。我们表明，仅优化几个后处理参数就能显著提升AP和FPR-95，而分配级别的度量（如ACC和IPAA）却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

URL PDF HTML ☆

赞 0 踩 0

2606.02021 2026-06-02 cs.CV

PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation

PerBite: 一种用于咬合感知食物体积估计的精选诊断工作流

Ahmad AlMughrabi, Farid Al-Areqi, David Fernández Gómez, Umair Haroon, Marc Bolaños, Ricardo Marques, Petia Radeva

AI总结提出PerBite工作流，通过分割、三维重建、尺度校准和网格后处理等步骤，从餐前餐后状态估计食物体积，在MetaFood挑战中排名第一。

详情

AI中文摘要

一个视觉上合理的食物网格能否被信任来估计消耗食物的体积？\method 使用来自MetaFood CVPR 2026连续三维重建与进食挑战的选定配对餐前和餐后状态来研究这个问题。提交的工作流遵循一个精选的重建协议：SAM~3分割食物和盘子区域；Hunyuan3D/SAM~3D生成无量纲食物网格；盘子直径提供度量尺度；在Blender中移除盘子几何形状；剩余的网格进行孔洞填充、水密化并积分以估计体积。MoGe-2仅作为辅助线索用于初始菜肴直径估计，当直接盘子测量不确定时；它不是报告挑战结果的主要尺度来源。\method 排名第一，在34个网格上使用刚性ICP（无尺度校正）的平均Chamfer距离为8.31。在17个餐前餐后对上，它实现了33.87%的状态级体积MAPE和零单调性违规，而消耗体积MAPE为53.74%。结果表明，表面重建、度量尺度、受控网格清理、水密体积积分和物理消耗一致性应分别评估以用于饮食评估。源代码和评估脚本将在\href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}提供。

英文摘要

Can a visually plausible food mesh be trusted to estimate the volume of consumed food? \method investigates this question using selected paired before- and after-consumption states from the MetaFood CVPR 2026 Continuous 3D Reconstruction While Eating Challenge. The submitted workflow follows a curated reconstruction protocol: SAM~3 segments the food and plate regions; Hunyuan3D/SAM~3D generates a dimensionless food mesh; the plate diameter provides the metric scale; the plate geometry is removed in Blender; and the remaining mesh is hole-filled, made watertight, and integrated to estimate volume. MoGe-2 is used only as an auxiliary cue for initial dish-diameter estimation when direct plate measurement is uncertain; it is not the primary scale source for the reported challenge result. \method ranks first, with an average Chamfer distance of 8.31 across 34 meshes using rigid ICP without scale correction. On 17 before- and after-pairs, it achieves 33.87\% state-level volume MAPE and zero monotonicity violations, while consumed-volume MAPE remains 53.74\%. The results show that surface reconstruction, metric scale, controlled mesh cleanup, watertight volume integration, and physical depletion consistency should be evaluated separately for dietary assessment. Source code and evaluation scripts will be available at \href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}.

URL PDF HTML ☆

赞 0 踩 0

2606.02020 2026-06-02 cs.CL cs.LG

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning

揭示思维链推理的熵动力学

Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li, Wai Lam, Jianye Hao

AI总结本文通过熵动力学揭示思维链推理的两阶段结构（不确定性区域和置信区域），并提出基于CUSUM变化点检测的无训练框架实现早期退出和测试时缩放，以提升推理效率与可靠性。

详情

Comments: 21 pages, 10 figures, accepted in ICML2026

AI中文摘要

本文研究了思维链（CoT）的熵动力学，揭示了一致的两阶段结构：一个探索性的不确定性区域，然后急剧过渡到收敛的置信区域。我们证明置信区域具有两个关键性质：1）高可靠性——置信区域中的答案变得高度准确和稳定，以及2）高冗余性——模型在达到正确答案后生成长时间的不必要token。这些性质解锁了更高效和可靠的推理策略：1）早期退出利用可靠性和冗余性，在收益递减时安全终止计算，以及2）测试时缩放使用置信区域信号优先考虑收敛轨迹。为了实施这些见解，我们将置信区域检测建模为序列变化点检测问题，首次将经典变化点方法应用于监控CoT推理。使用累积和（CUSUM）算法（一种统计最优的变化点检测器），我们开发了一个无训练框架用于实时推理控制。实验表明，我们的方法为早期退出建立了优越的帕累托前沿。CUSUM在减少11.1% token的情况下达到63.06%的准确率，在准确率上分别超过DEER和Dynasor 3.28%和4.36%。对于测试时缩放，CUSUM加权投票始终优于自一致性。

英文摘要

This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability -- answers in the confidence region become highly accurate and stable, and 2) High Redundancy -- models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.02016 2026-06-02 cs.LG

Evaluating Real-World Generalizability of Algorithm Selection Models

评估算法选择模型的现实世界泛化能力

Gjorgjina Cenikj, Jakub Kudela, Eva Tuba, Tome Eftimov

AI总结通过跨基准测试系统评估算法选择模型在合成与现实优化问题上的泛化能力，分析其迁移性能并指出在特定领域应用中的挑战。

详情

DOI: 10.1145/3795101.3805348
Comments: 10 pages, 12 figures

AI中文摘要

算法选择（AS）旨在通过利用可测量的问题特征和历史性能数据，自动为给定问题实例识别最合适的优化算法。在本研究中，我们研究了AS模型在合成和现实优化景观上的泛化能力。我们考虑了两个广泛使用的学术基准测试套件（BBOB和CEC）以及两个现实世界问题集（机器人轨迹优化任务和无人机路径规划问题）。通过系统的跨基准测试评估，我们分析了AS模型如何在领域之间迁移，识别了泛化成功或失败的情况，并强调了在现实、特定领域环境中应用AS时出现的挑战。我们的研究结果提供了对当前AS方法鲁棒性的见解，并为开发更可靠、广泛适用的现实世界优化AS系统提供了信息。

英文摘要

Algorithm Selection (AS) aims to automatically identify the most suitable optimization algorithm for a given problem instance by leveraging measurable problem characteristics and historical performance data. In this study, we investigate the generalization ability of AS models across both synthetic and real-world optimization landscapes. We consider two widely used academic benchmark suites (BBOB and CEC) and two real-world problem sets (robotics trajectory optimization tasks and unmanned aerial vehicle path-planning problems). Through a systematic cross-benchmark evaluation, we analyze how AS models transfer between domains, identify where generalization succeeds or breaks down, and highlight the challenges that arise when applying AS in realistic, domain-specific contexts. Our findings provide insights into the robustness of current AS approaches and inform the development of more reliable, broadly applicable AS systems for real-world optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.02011 2026-06-02 cs.AI cs.LG

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

推理模型中的极端低位推理：失败模式与针对性恢复

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov

AI总结针对大型推理模型在2位量化推理中因生成不稳定导致总token数膨胀而无法实现端到端加速的问题，提出轻量级FP16规划和循环救援两种控制方法，显著恢复模型精度并保持实际速度。

详情

AI中文摘要

大型推理模型（LRM）依赖长推理轨迹，导致推理成本高昂。虽然低位量化降低了每token解码成本，但我们表明，激进的2位推理可能无法实现端到端加速，因为生成过程中的不稳定性会膨胀总token数。2位量化不仅降低答案准确性，还常常产生更长的轨迹，包含重复循环、预算耗尽、延迟承诺和未闭合的推理段。我们分析了Qwen3推理模型在数学和常识基准上的完整推理轨迹，并表明准确率下降与这些过程级失败密切相关。为解决这些问题，我们引入了两种轻量级控制：FP16规划，为2位模型提供简短的高精度轮廓；以及循环救援，检测重复轨迹并要么承诺早期答案，要么回退到FP16。在MATH-500上，循环救援将Qwen3-8B准确率从17.2%提升至74.2%，而规划加循环救援将Qwen3-32B准确率从65.0%提升至87.2%。总体而言，我们的结果表明，当极端低位推理的失败被视为可控生成病理时，它变得可行：通过轻量级检测和选择性FP16支持，2位推理可以在恢复准确率的同时保持真实的端到端速度。我们的代码可在 https://github.com/brain-lab-research/quantized-reasoning 获取。

英文摘要

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: https://github.com/brain-lab-research/quantized-reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02010 2026-06-02 cs.CL cs.AI

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

PlanarBench: 通过平面图绘制评估LLM空间推理能力

Oleksandr Nikitin

AI总结提出PlanarBench基准，通过让LLM根据边列表以ASCII艺术绘制平面图来评估其空间推理能力，发现边数是主要难度预测因子。

2606.02009 2026-06-02 cs.CL

Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French

自动作文评分与语言认证：评估法语中的泛化性、一致性和有效性

Rodrigo Wilkens, Rémi Cardon, Vincent Folny, Thomas François

AI总结本文提出一个增强的论证有效性框架，通过公平性分析、语言特征相关性、预测误差评估和与人工评分的一致性比较，对8种模型架构在法语作文评分上进行多维评估，推进了法语自动作文评分的前沿。

详情

AI中文摘要

在自动作文评分（AES）中，基准测试实践促进了最小化评估方法，这与评估框架（如论证有效性框架ABV）的广泛视角建议形成对比，ABV主张对系统进行多维评估，特别是在高风险语言测试的背景下。在本文中，我们引入了一个增强且更实用的ABV框架版本，结合了公平性分析、与语言特征的相关性、预测误差评估以及与人工评分者的一致性比较。将该框架应用于法语AES，我们在一个包含27k篇考试作文（每篇2名评分者）的语料库和一个包含961篇作文（每篇至少9名评分者）的泛化语料库上比较了8种模型架构。我们的分析展示了应用ABV框架以更好地理解AES模型的能力和缺陷的益处，同时推进了法语AES的最新水平。

英文摘要

In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.

URL PDF HTML ☆

赞 0 踩 0

2606.02008 2026-06-02 stat.ML cs.LG

Provable Data Scaling Law for Meta Learning via Complexity Minimization

通过复杂度最小化实现元学习的可证明数据缩放定律

Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui

AI总结提出复杂度最小化框架，通过最小化跨源域的最坏情况下游模型复杂度，从理论上证明元学习中的预训练数据规模增大可提升少样本适应性能。

详情

AI中文摘要

预训练已成为现代机器学习的基本范式，其关键经验优势之一是随着预训练数据规模的增加，下游样本复杂度降低。然而，现有的预训练理论框架并未完全解释这一现象。在本文中，我们引入了复杂度最小化，一种新颖的元表示学习框架，旨在实现对此缩放行为的理论分析，该框架通过评估每个领域最适合的下游模型复杂度并最小化跨源域的最坏情况复杂度来学习表示。我们的端到端理论分析，涵盖从预训练到下游回归，表明该框架可证明地捕捉了这种缩放行为；特别地，我们展示了少样本适应的错误率随着元训练数据量的增加而改善。实验上，我们证明将复杂度正则化纳入现有的元学习方法中持续提高下游样本效率。

英文摘要

Pre-training has become a fundamental paradigm in modern machine learning, with one of its key empirical benefits being reduced downstream sample complexity as the scale of pre-training data increases. However, existing theoretical frameworks for pre-training do not fully explain this phenomenon. In this paper, we introduce complexity minimization, a novel meta-representation learning framework designed to enable theoretical analysis of this scaling behavior, which learns representations by evaluating the downstream model complexity best suited to each domain and minimizing the worst-case such complexity across source domains. Our end-to-end theoretical analysis, spanning pre-training through downstream regression, shows that this framework provably captures this scaling behavior; in particular, we show that the error rate of few-shot adaptation improves as the amount of meta-training data grows. Empirically, we demonstrate that incorporating complexity regularization into existing meta-learning methods consistently improves downstream sample efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.02002 2026-06-02 cs.CV

Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

面向盲图像质量评估的统计与视觉-语言特征的失真感知融合

Bishr Omer Abdelrahman Adam, Xu Li

AI总结提出一种失真感知融合框架，通过乘法门控机制动态加权NSS统计特征与VLM嵌入，在三个基准上取得最优或竞争性能，并揭示NSS对不同失真的贡献差异。

详情

AI中文摘要

盲图像质量评估（BIQA）旨在无参考图像的情况下预测感知图像质量。经典的自然场景统计（NSS）描述符和现代视觉语言模型（VLM）嵌入从根本不同的角度解决这一问题，但两者结合是否能产生互补优势以及如何根据输入图像加权其贡献尚待探索。我们提出一种失真感知融合框架，通过乘法门控机制将138维NSS描述符与两种互补的VLM嵌入（SigLIP和CLIP-H）集成，该门控机制学习基于图像内容的每输入流权重。与静态拼接融合不同，所提出的门控网络根据输入抑制或放大每个流的贡献，产生的权重与在KADID-10k上通过独立消融测量的每失真NSS贡献呈正相关（Spearman秩相关系数ρ=0.33）。该框架无需对VLM骨干网络进行端到端微调，并使用结合均方误差、Pearson线性相关和成对排序目标的混合损失进行训练。我们在三个标准基准上评估：KonIQ-10k（SROCC=0.9142，PLCC=0.9279）、KADID-10k（SROCC=0.9715，PLCC=0.9733，超越近期最先进方法）和LIVE Challenge in-the-Wild（通过跨数据集预训练和微调，SROCC=0.8527，PLCC=0.8802）。在KADID-10k上的每失真分析表明，NSS特征对噪声和色彩偏移失真（像素统计直接影响）贡献最大，对感知失真（如色彩饱和度变化）贡献最小。学习到的门控值验证了这些发现，确认模型自主发现了与手动每失真研究一致的失真-流亲和模式。

英文摘要

Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.

URL PDF HTML ☆

赞 0 踩 0

2606.02001 2026-06-02 cs.CL

Scaling Agentic Capabilities via Grounded Interaction Synthesis

通过基于交互合成扩展智能体能力

Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao, Shuqing Bian, Wei Lu, Xiaoyong Du

AI总结提出GAIS框架，通过两阶段接地机制（协议锚定环境和结构引导规划）自动生成多样化的环境和复杂任务，显著提升智能体在BFCL、τ²-Bench和ACEBench上的性能。

详情

AI中文摘要

通用智能体智能的关键在于与多样化的真实世界工具交互以完成复杂任务的能力，这种能力与交互数据的质量密切相关。为了规避人工标注的昂贵成本，现有范式完全依赖大型语言模型（LLMs）来扩展智能体环境和任务的合成。然而，这种无约束的生成常常退化为LLMs内部先验的有偏随机采样，无法捕捉真实世界领域的多样性和难度，也无法构建高保真、长周期的任务。在这项工作中，我们引入了基于交互合成（GAIS），这是一个通过两阶段接地机制自动构建多样化环境和复杂任务的框架。具体来说，我们构建了源自真实世界模型上下文协议（MCP）服务器的协议锚定环境，以确保功能多样性和难度。随后，我们采用结构引导规划来导航这些环境，主动施加逻辑依赖和对抗策略以生成复杂任务。在BFCL、τ²-Bench和ACEBench上的实验表明，GAIS合成的数据显著优于最先进的基线，使基础模型能够匹配甚至超越其官方指令微调版本。此外，GAIS展现出优越的数据效率和可扩展性，在显著减少数据量的情况下实现卓越能力，同时在基线停滞时保持持续增长。我们的代码和数据集可在https://github.com/Eric8932/GAIS公开获取。

英文摘要

General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Models (LLMs) to scale the synthesis of agentic environments and tasks. However, such unconstrained generation often degenerates into biased random sampling of LLMs' internal priors, failing to capture the diversity and difficulty of real-world domains or construct high-fidelity, long-horizon tasks. In this work, we introduce Grounded Agentic Interaction Synthesis (GAIS), a framework that automates the scalable construction of diverse environments and complex tasks via a two-phase grounding mechanism. Specifically, we construct protocol-anchored environments derived from real-world Model Context Protocol (MCP) servers to ensure functional diversity and difficulty. Subsequently, we employ structure-guided planning to navigate these environments, actively enforcing logical dependencies and adversarial policies to generate complex tasks. Experiments on BFCL, $τ^2$-Bench, and ACEBench demonstrate that GAIS-synthesized data significantly outperforms state-of-the-art baselines, enabling base models to match or even surpass their official instruction-tuned counterparts. Furthermore, GAIS exhibits superior data efficiency and scalability, achieving exceptional capabilities with significantly less data while maintaining continuous growth where baselines stagnate. Our code and dataset are publicly available at https://github.com/Eric8932/GAIS.

URL PDF HTML ☆

赞 0 踩 0

2606.02000 2026-06-02 cs.CV cs.AI eess.IV

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

迈向3D感知视频扩散模型：基于网格标记化的无渲染人体运动控制

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang

AI总结提出一种无渲染框架，通过压缩的3D人体网格标记直接条件化视频生成，实现精确的人体运动控制，减少2D引导伪影并提升3D结构建模能力。

详情

Comments: Project page: https://jingyunliang.github.io/MeshToken/

AI中文摘要

扩散模型在视频生成方面取得了显著成功。然而，这类模型是否真正感知视觉观察背后的3D结构，而不仅仅是生成合理的2D投影，仍是一个开放问题。本文通过人体运动控制这一任务来探究该问题，该任务需要对人体3D几何、运动、相机视角和场景上下文进行精确建模。与依赖渲染的2D运动引导视频的先前方法不同，我们提出了一种无渲染框架，直接基于压缩的3D人体网格标记条件化视频生成。该表示保留了完整的3D几何信息，同时实现了统一的基于标记的生成流程，在DiT架构中联合处理视频标记和运动标记。这种设计要求模型在视频生成过程中联合推理外观、3D结构和相机视角。实验结果表明，该方法在人体运动控制基准上表现强劲，同时减少了由视角依赖的2D引导和编辑过程中轨迹-姿态不匹配引起的伪影。这些发现表明，配备网格标记化的视频扩散模型能够更好地捕捉复杂的3D人体结构及其与周围环境的交互。

英文摘要

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

URL PDF HTML ☆

赞 0 踩 0

2606.01999 2026-06-02 cs.LG cs.AI

Why Do Time Series Models Need Long Context Windows?

为什么时间序列模型需要长上下文窗口？

Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi

AI总结本文从生成过程识别和条件预测两个目标出发，证明长上下文窗口通过降低生成过程的不确定性来提升预测性能，并表明即使对于记忆长度为P的过程，输入窗口必须严格大于P才能达到最小误差。

详情

AI中文摘要

现代用于预测时间序列组的深度学习模型依赖于越来越长的观测窗口。然而，增加窗口大小的好处通常被简单地归因于捕捉长程依赖，而关于全局预测模型如何利用输入观测的更广泛讨论一直有限。在本文中，我们表明预测时间序列组涉及两个目标：(i) 生成过程识别（GPI），即推断生成输入序列的具体过程，以及 (ii) 条件预测（CF），即根据输入观测预测未来值。从这个角度来看，最优预测可以解释为对所有可能数据生成过程的平均，并按输入窗口给定的似然加权。这为长上下文窗口的好处提供了另一种解释：它们降低了运行过程中输入时间序列由哪个具体过程生成的不确定性。我们证明，即使对于记忆长度为 $P$ 的过程，严格大于 $P$ 的输入窗口大小对于达到最小可实现误差是必要的。最后，我们展示了如何将 GPI 和 CF 解耦，以在不牺牲准确性的情况下提高计算可扩展性。在合成和真实数据上的实验验证了我们的见解及其对设计预测架构的相关性。

英文摘要

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.01995 2026-06-02 cs.CL

CARTE: A Benchmark for Mapping Language Model Knowledge Across France

CARTE：法国语言模型知识映射基准

Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang, Michalis Vazirgiannis

AI总结提出CARTE基准，通过2431道多选题评估大语言模型在法国13个大区14个主题领域的细粒度区域知识，并引入CARTE-LV子集聚焦语言变异，实验发现模型在区域和规模上存在性能差异。

详情

AI中文摘要

我们推出了CARTE（文化锚定的区域-领土评估），这是一个多项选择基准，用于评估大语言模型（LLMs）在法国境内基于地理和区域差异的知识上进行细粒度推理的能力。虽然先前的基准侧重于国家层面的文化理解，但它们很大程度上忽略了国内差异以及区分密切相关区域背景的需求。CARTE通过引入涵盖法国13个大区和14个主题领域（包括文化、语言、人口、经济、环境和流动性）的2431个问题来填补这一空白。我们进一步推出了CARTE-LV，这是一个针对法国区域语言变异的子集，能够对语言相关差异进行集中评估。我们在少样本设置下评估了27个参数从1B到12B的LLMs。我们的实验揭示了跨区域和模型规模的性能差异，表明预训练覆盖存在系统性差距，且对国内变异的鲁棒性有限。

英文摘要

We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.

URL PDF HTML ☆

赞 0 踩 0

2606.01993 2026-06-02 cs.CL cs.AI cs.LG

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能？

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

AI总结提出MMG2Skill框架，将多模态异构的野外指南编译为可编辑技能，通过轨迹级根因反馈持续改进，在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

详情

Comments: 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而，这些知识通常是多模态、异构、有噪声的，并且隐含地假设人类执行者，使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距，我们将此问题形式化为指南到技能学习：将野外指南转换为可执行技能，并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力，我们引入了MMG2Skill-Bench，这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill，一个闭环框架，它将指南编译为可编辑技能，在执行过程中将固定的视觉语言模型（VLM）智能体条件化于这些技能，并从轨迹级根因反馈中修正技能，而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中，使用六个VLM骨干网络，MMG2Skill在每个模型-领域设置中始终优于普通基线智能体，在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明，直接用原始指南提示智能体会降低性能，而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中，当成功信号适当校准时，基于分析器的提前停止进一步防止了后期性能退化，并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

URL PDF HTML ☆

赞 0 踩 0

2606.01992 2026-06-02 cs.CV cs.AI cs.LG

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准：当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

AI总结提出结构化基准TGAD，通过三个场景逐步增加语言功能角色，评估多模态异常检测系统的文本引导能力，发现当前系统仅表面受语言条件化，标准基准高估了其能力。

详情

AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统，并被呈现为支持文本引导的零样本和少样本检测。然而，这些方法使用继承自单模态基准的协议进行评估，这些协议保持文本条件不变，因此无法衡量语言是否条件化决策；报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测（TGAD），这是一个结构化基准，通过三个场景逐步增加语言的功能角色：MVTec AD上的受控提示敏感性设置；MVTec AD的组件标记扩展，要求模型将其评估限制在指定部件；以及新的组装面板数据集（APD），这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型：生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中，文本接口仅表面条件化决策：除非移除对象名词，否则提示内容被吸收（生成模型的I-AUROC从97.4降至82.6）；一旦指令部件外的缺陷被视为正常，组件级指令不约束决策（从90.3降至66.3）；当两者在APD上结合时，图像级判别崩溃至MVTec水平以下，一种情况低于随机水平（71.2、50.5、31.5）。这些结果表明，标准基准夸大了当前多模态异常检测系统的文本引导能力，并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.01991 2026-06-02 cs.AI cs.CL cs.CY

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP：基于环境接地前瞻推理的LLM智能体防御主动功率调节

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai

AI总结针对LLM智能体因动作空间扩大而面临功率寻求风险，提出SafeMCP服务器端防御插件，通过内部世界模型进行前瞻推理，实现主动工具过滤和即时干预两级防御，在保持智能体效用的同时有效降低风险。

详情

Comments: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

AI中文摘要

随着大语言模型（LLM）智能体越来越多地利用模型上下文协议（MCP）在复杂环境中运行，其动作空间的扩展赋予了智能体不安全的能力，并凸显了功率寻求的风险。虽然广阔的动作空间和更大的环境影响对于任务完成至关重要，但它们也创造了一个脆弱的风险表面，其中微小的错误或幻觉会被放大为灾难性故障。为此，我们提出了SafeMCP，一种{服务器端}防御插件，通过关于未来安全风险的预测推理来约束工具获取。SafeMCP利用内部世界模型进行前瞻推理，实现两级防御：主动工具过滤以限制危险功率扩展，以及即时干预作为故障安全机制。为了训练SafeMCP，我们引入了一个三阶段流程，包括环境动态接地、安全策略初始化和具有双重可验证奖励的强化学习（RL）。在PowerSeeking Bench、ToolEmu和AgentHarm上的实验表明，SafeMCP实现了安全平衡，在有效缓解风险的同时保持了智能体的效用。

英文摘要

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

URL PDF HTML ☆

赞 0 踩 0

2606.01987 2026-06-02 cs.DM cs.LG

Graph Edit Distance Formulation for the Vehicle Routing Problem: Theory and Analysis

车辆路径问题的图编辑距离公式：理论与分析

Adel Dabah

AI总结本文提出将车辆路径问题重新表述为图编辑距离最大化问题，通过边删除成本模型实现总路线成本最小化，并利用该公式进行结构分析和基准测试。

详情

AI中文摘要

我们证明车辆路径问题（VRP）可以重新表述为图编辑距离（GED）最大化问题。在简单的边删除成本模型下，最小化总路线成本等价于从完整实例图中删除的边的总权重最大化。该公式在边级别对VRP进行建模，其中解由选定的边而非路线序列定义，从而能够进行经典公式中难以实现的结构分析：解质量的每条边归因、最优性差距的分解、解稀疏性的刻画以及贪婪构造难以到达的边的识别。理论上，我们建立了一个合并-分解定理，表明Clarke-Wright节省等于每次合并的GED增量，以及一个近似转移定理，将GED近似比转化为VRP成本界限。利用这一重新表述，我们分析了90个已知最优解的CVRP基准实例。我们发现最优路由图仅使用5.5%的可用边，约3.0%的最优边在重复重启下始终未被Clarke-Wright启发式找到，并且成本差距分解为遗漏的最优边和替代的非最优边，两者总权重相当。边加性目标为未来的图神经网络边预测方法提供了自然的每条边监督信号，暗示了与图神经网络方法的潜在联系，这留待后续工作。

英文摘要

We show that the Vehicle Routing Problem (VRP) can be reformulated as a Graph Edit Distance (GED) maximization problem. Under a simple edge-deletion cost model, minimizing total route cost is equivalent to maximizing the total weight of edges deleted from the complete instance graph. This formulation models VRP at the edge level, where solutions are defined by selected edges rather than route sequences, enabling structural analyses that are difficult in classical formulations: per-edge attribution of solution quality, decomposition of the optimality gap, characterization of solution sparsity, and identification of edges that are hard to reach by greedy construction. Theoretically, we establish a merge-decomposition theorem showing that Clarke-Wright savings equal per-merge GED increments, and an approximation-transfer theorem that turns GED approximation ratios into VRP cost bounds. Using this reformulation, we analyze 90 CVRP benchmark instances with known optimal solutions. We find that optimal routing graphs use only 5.5% of available edges, that approximately 3.0% of optimal edges are consistently not found by Clarke-Wright heuristics under repeated restarts, and that the cost gap decomposes into missed optimal edges and substituted non-optimal edges of comparable total weight. The edge-additive objective provides a natural per-edge supervision signal for future graph neural network approaches to edge prediction, suggesting a potential connection to graph neural network approaches that we leave for follow-up work.

URL PDF HTML ☆

赞 0 踩 0

2606.01985 2026-06-02 cs.CV

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow：基于流匹配的多轮图像编辑强化学习

Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu

AI总结提出MT-EditFlow框架，通过流匹配强化学习优化多轮图像编辑的奖励信号，解决单轮编辑模型在多轮交互中的失败和误差传播问题，显著提升多轮编辑性能。

详情

AI中文摘要

近年来，基于指令的图像编辑取得了重大突破，模型现在能够处理现实世界中的编辑需求，满足日常用户的实用性要求。然而，主要为单轮编辑训练的编辑模型在多轮编辑中常常失败——在这种自然的交互设置中，用户基于模型自身之前的输出迭代地细化图像。这种失败源于“全有或全无”的要求，即单次失败会破坏整个序列，以及误差传播，即暴露偏差导致编辑误差累积。为了解决这些挑战，我们引入了MT-EditFlow，一个流匹配强化学习框架，旨在优化序列图像编辑的奖励信号。MT-EditFlow整合了多轮视角和多奖励公式，为基于GRPO和NFT的强化学习方法提供了统一的结构。我们通过研究有效的轮次级聚合评分策略、权衡奖励偏差与方差的VLM推理模式以及防止奖励破解的优势融合级别，系统地分析和优化了奖励信号。我们的发现表明，将聚合优势广播到整个编辑轨迹中，有效地弥合了局部规划与全局多轮任务成功之间的差距。大量实验表明，MT-EditFlow在多种基础模型上显著提升了性能。值得注意的是，它在FLUX.1-Kontext-dev上将第3轮整体性能提升了6.85分，超越了Qwen-Image-Edit等最先进的开源模型。通过保持高边际成功率和减少暴露偏差，MT-EditFlow为视觉内容创作中更可靠、更自然的人机协作奠定了基础。

英文摘要

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

URL PDF HTML ☆

赞 0 踩 0

2606.01982 2026-06-02 cs.AI

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

一种基于NLP的课程-劳动力市场对齐框架：模式约束的LLM抽取、ESCO锚定的语义匹配和多维差距量化

Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki, Khaled Shuaib

AI总结提出一个四阶段NLP框架，通过模式约束的LLM抽取、ESCO语义匹配、仲裁协议和验证机制，实现课程与劳动力市场的对齐，并量化多维供需差距。

详情

Comments: 53 pages, 9 figures, 4 tables

AI中文摘要

从多样化的教育和劳动力市场语料库中进行模式约束的信息抽取仍然是自然语言处理中的一个开放挑战，因为现有流程主要依赖于无法恢复隐含能力的词汇表面方法，缺乏共享分类法的基础，并且没有提供抽取可靠性或文档级完整性的正式度量。为了解决这些限制，本文提出了一个四阶段NLP框架，结合了(i) 对两个前沿LLM集成模型进行模式约束提示，针对JSON Schema强制实施的七槽能力形式；(ii) 使用Sentence-BERT (SBERT)将抽取的记录与十一个领域的ESCO v1.2.1受控词汇表对齐；(iii) 一个解决模型间分歧的两级裁决协议；(iv) 一个结合每槽Cohen's kappa、模式符合性和文档级完整性审计的验证机制。该框架在高等教育质量保证的关键应用中实例化，即阿联酋大学ABET认证的计算机科学学士学位课程的课程-劳动力市场对齐。该流程从2025-2026学年的85门课程学习计划中抽取400条能力记录，并在从计算核心到概率加权学生轨迹的五范围分析下，与30个职位发布（483个要求条款）以0.50的SBERT余弦阈值对齐。抽取器在技能槽上达到0.79的Cohen's kappa，模式符合性100%，文档级完整性100%。对齐揭示了可解释的供需差距：通用和横向技能差距25.0%，算法与计算理论差距13.8%，软件工程与项目管理差距12.2%，而人工智能与数据科学差距接近零的1.8%，尽管供应覆盖率为38.6%。

英文摘要

Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language processing because existing pipelines rely primarily on lexical-surface methods that cannot recover implicit competencies, lack grounding in shared taxonomies, and provide no formal measures of extraction reliability or document-level completeness. To address these limitations, this paper proposes a four-stage NLP framework that combines (i) schema-constrained prompting of a two-model frontier-LLM ensemble against a JSON Schema-enforced seven-slot competency formalism, (ii) Sentence-BERT (SBERT) alignment of the extracted records against an eleven-domain ESCO v1.2.1 controlled vocabulary, (iii) a two-tier adjudication protocol that resolves inter-model disagreements, and (iv) a verification mechanism that combines per-slot Cohen's kappa, schema conformance, and document-level completeness audits. The framework is instantiated for a critical application in higher-education quality assurance, namely curriculum-labor market alignment for the ABET-accredited BSc Computer Science program at the United Arab Emirates University. The pipeline extracts 400 competency records from the 85-course 2025-2026 study plan and aligns them, under a five-scope analysis ranging from the computing core to a probability-weighted student trajectory, with 30 job postings (483 requirement clauses) at an SBERT cosine threshold of 0.50. The extractor achieves Cohen's kappa of 0.79 on the skill slot, with 100% schema conformance and 100% document-level completeness. The alignment surfaces interpretable supply-demand gaps of 25.0% in general and transversal skills, 13.8% in algorithms and computational theory, and 12.2% in software engineering and project management, with a near-zero 1.8% gap in artificial intelligence and data science despite 38.6% supply coverage.

URL PDF HTML ☆

赞 0 踩 0

2606.01981 2026-06-02 cs.CV

Generalization Limits in Vehicle Re-Identification

车辆再识别中的泛化极限

Anis Yassine Ben Mabrouk, Antoine Tadros, Rafael Grompone von Gioi, Gabriele Facciolo, Axel Davy, Rodrigo Verschae

AI总结针对车辆再识别任务中模型对未见车辆类型泛化能力差的问题，提出了一种新的评估方法，并通过视角分割分析揭示了现有方法在视角鲁棒性和细节关注上的局限性。

详情

AI中文摘要

车辆再识别关注于根据查询图像从图库中检索同一车辆的图像。通过仔细检查常用数据集，我们观察到视觉差异很小的车辆——例如相同的品牌、型号和颜色——同时出现在训练集和测试集中。因此，有效记忆训练数据的方法在这些测试集上表现良好，但难以泛化到其他数据集。在本文中，我们通过提出一种新的评估方法来解决这个问题，该方法能更有效地衡量对未见车辆类型的泛化能力。为了进一步研究泛化性能，我们还提出基于视角进行分割评估，从而区分视角鲁棒性与同视角再识别的影响。我们的发现表明，大多数最先进的方法在处理未见车辆类型时存在困难，并且它们对视角变化的鲁棒性和对细节的关注仅限于训练中见过的车辆类型。

英文摘要

Vehicle re-identification focuses on retrieving images of the same vehicle from a gallery given a query image. Upon closer inspection of commonly used datasets, we observe that vehicles with few visual differences-e.g., the same make, model, and color-appear in both the training and test sets. As a result, methods that effectively memorize the training data tend to perform well on these test sets but struggle to generalize to other datasets. In this paper, we address this issue by proposing a novel evaluation approach that more effectively measures generalization capability to unseen vehicle types. To further study generalization performance, we also propose splitting the evaluation based on view, allowing us to differentiate the effect of viewpoint robustness from that of same-view re-identification. Our findings reveal that most state-of-the-art methods struggle with unseen vehicle types, and that their robustness to viewpoint changes and attention to detail are limited to vehicle types seen during training.

URL PDF HTML ☆

赞 0 踩 0

2606.01975 2026-06-02 cs.AI cs.SE

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

基于LLM的算法开发：以张量网络收缩顺序优化中LLM使用为例

Fabian Hoppe, Melven Röhrig-Zöllner, Philipp Knechtges

AI总结通过OpenEvolve对张量网络收缩顺序优化的案例研究，探讨了基于LLM的算法开发，重点分析了LLM选择、评估指标和测试实例等设计因素，强调了验证引导的进化编码代理的潜力以及人类科学家在评估、验证和解释方面的重要性与挑战。

2606.01973 2026-06-02 cs.LG cs.CV

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

开放集测试时自适应中分布内与分布外准确率的深入分析

Zefeng Li, Evan Shelhamer

AI总结本文通过基准测试和提出新基线，揭示了当前开放集测试时自适应方法在平衡分布内准确率和分布外检测能力上的不足。

详情

Comments: TMLR 2026

AI中文摘要

开放集测试时自适应（TTA）在存在输入偏移和未知输出类别的情况下更新模型。尽管近期方法在提高已知类别的分布内（InD）准确率方面取得了进展，但它们准确检测分布外（OOD）未知类别的能力仍未得到充分探索。我们在小规模CIFAR-10-C和大规模ImageNet-C的标准损坏基准上，对鲁棒和开放集TTA方法（SAR、OSTTA、UniEnt和SoTTA）进行了基准测试。对于CIFAR-10-C，我们使用来自SVHN和CIFAR-100的OOD数据，分别对应其损坏形式SVHN-C和CIFAR-100-C。对于ImageNet-C，我们使用来自ImageNet-O和Textures的OOD数据，分别对应其损坏形式ImageNet-O-C和Textures-C。ImageNet-O更接近ImageNet，包含未知但相关的物体类别（如食物类的“蒜香面包”与“热狗”，基础设施类的“高速公路”与“水坝”），而Textures则远离ImageNet，包含非物体图案（如“裂纹”泥土、“多孔”海绵、“纹理”树叶）。我们评估了TTA方法在CIFAR-10-C和ImageNet-C上对InD与OOD识别的准确率和置信度。我们在CIFAR-10-C上验证了每种方法自身OOD检测技术的准确率。我们还在ImageNet-C上进行了评估，并报告了准确率和标准OOD检测指标。我们进一步考察了更现实的设置，其中OOD数据的比例和速率可以变化。为了探索InD识别与OOD拒绝之间的权衡，我们提出了一种新的基线，将softmax/多类输出替换为sigmoid/多标签输出。我们的分析首次表明，当前的开放集TTA方法难以平衡InD和OOD准确率，并且它们仅能不完全地过滤OOD数据以进行自身的自适应更新。

英文摘要

Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.

URL PDF HTML ☆

赞 0 踩 0

2606.01970 2026-06-02 cs.RO cs.MA cs.SY eess.SY

Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions

基于市场重规划的搜救任务中安全关键无人机群

Luiz Giacomossi, Andrea Haglund, Claire Namatovu, Emily Zainali, Esaias Målqvist, Yonatan M. Beyene, Ivan Tomasic, Baran Çürüklü, Håkan Forsberg

AI总结提出一种分布式协调架构IRDS，通过反向拍卖市场机制和几何共识协议，在无人机故障下自主重分配任务，在25%退化下保持93%任务成功率。

详情

Comments: 6 pages, 4 figures, accepted at MIPRO 2026

AI中文摘要

搜救任务中可靠自主无人机群需要能够容忍代理退化并维持操作的容错协调。本文介绍了智能重规划无人机群（IRDS），一种为资源受限环境设计的分布式协调架构。所提出的框架采用反向拍卖市场机制，其中代理基于距离加权成本函数竞标服务搜索区域，并结合几何共识协议进行目标验证。我们通过物理仿真（N=8个代理，8x8网格）评估该方法，并施加随机故障注入。结果表明，无人机群能够以相对于总任务持续时间较低的延迟自主重新分配来自故障代理的任务，在25%劳动力退化下保持93%的任务成功率。所提出的框架展示了一种稳健的、经过实证测试的空中机器人自愈协调方法。

英文摘要

Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.

URL PDF HTML ☆

赞 0 踩 0

2606.01967 2026-06-02 cs.CL

Training Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning

训练提示至关重要：面向鲁棒微调的状态自适应优化

Wenhang Shi, Yiren Chen, Shuqing Bian, Zhe Zhao, Jinhao Dong, Pengfei Hu, Wei Lu, Xiaoyong Du

AI总结本文提出状态自适应提示优化（SAPO）策略，通过将任务公式从静态输入转变为动态状态自适应变量，有效缓解灾难性遗忘并提升泛化能力，在多个基准上取得显著性能提升。

详情

AI中文摘要

虽然提示工程在推理过程中对最大化大型语言模型（LLM）的能力至关重要，但提示在训练过程中的作用仍未得到充分探索。现有的微调范式通常将训练提示视为表面形式，假设语义等价的指令会产生相同的学习结果。然而，我们揭示这种等价性具有欺骗性：虽然释义后的提示通常会导致类似的任务内性能，但它们在灾难性遗忘和泛化方面会引发截然不同的跨任务影响。关键的是，这些影响在任务间呈正相关，表明存在始终产生更好性能的优越提示。此外，我们发现这些优越提示可以在学习之前通过任务损失稳健地识别。利用这些见解，我们引入了状态自适应提示优化（SAPO），这是一种轻量级但有效的训练策略，它将任务公式从静态输入转变为动态的、状态自适应的变量。在多种基准上的全面实验证实了其有效性，它显著减轻了遗忘，同时提高了泛化能力，相比于最先进的方法取得了显著的性能提升。这些结果提供了关于训练提示如何塑造学习动态的见解，并为鲁棒微调提供了实用的方法。我们的代码可在 https://github.com/Eric8932/SAPO 获取。

英文摘要

While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing fine-tuning paradigms typically treat training prompts as mere surface forms, assuming that semantically equivalent instructions yield identical learning outcomes. However, we reveal that this equivalence is deceptive: while paraphrased prompts often lead to comparable in-task performance, they induce drastically different cross-task impacts regarding catastrophic forgetting and generalization. Crucially, these impacts are positively correlated across tasks, indicating the existence of superior prompts that consistently yield better performance. Furthermore, we discover that these superior prompts can be robustly identified by task loss prior to learning. Leveraging these insights, we introduce State-Adaptive Prompt Optimization (SAPO), a lightweight yet effective training strategy that shifts task formulation from a static input to a dynamic, state-adaptive variable. Comprehensive experiments on diverse benchmarks confirm its effectiveness, which significantly mitigates forgetting while improving generalization, achieving substantial performance gains over state-of-the-art methods. These results provide insights into how training prompts shape learning dynamics and offer a practical recipe for robust fine-tuning. Our code is available at https://github.com/Eric8932/SAPO.

URL PDF HTML ☆

赞 0 踩 0

2605.04948 2026-06-02 cs.CL

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

将大型语言模型适配到低资源黏着语：LoRA与QLoRA在巴什基尔语上的比较研究

Mullosharaf K. Arabov, Svetlana S. Khaybullina

AI总结本文比较了LoRA和QLoRA两种参数高效微调方法在低资源黏着语巴什基尔语上的适配效果，发现QLoRA在7B规模模型上能在质量和计算成本间取得有效平衡。

详情

Comments: Accepted to CLIB 2026

AI中文摘要

本文对参数高效微调方法（包括LoRA和QLoRA）在将大型语言模型适配到巴什基尔语（突厥语族的一种低资源黏着语）任务上进行了比较研究。实验在包含71k文档（4690万个token）的巴什基尔语文本语料库上进行，使用了多种架构的模型：DistilGPT2、GPT-2（base、medium）、Phi-2、Qwen2.5-7B、DeepSeek-7B和Mistral-7B。为提高结果可靠性，每种配置使用三种不同的随机种子进行训练。测试集上最低困惑度由全微调的GPT-2 medium获得（3.34）。同时，应用于Mistral-7B（3.79）和Phi-2（3.81）的QLoRA在可训练参数减少40倍以上的情况下达到了相当的质量。然而，我们也观察到某些架构使用PEFT时质量显著下降的情况（例如，DeepSeek-7B，秩为8，困惑度=129.55），这表明结果关键取决于基础模型及其分词器的选择。此外，基于巴什基尔语提示的生成文本定性分析显示，具有最佳困惑度的模型不一定产生最连贯的输出：QLoRA微调的模型生成了单语巴什基尔语续写，而具有最低困惑度的全微调模型则频繁切换到英语。结果表明，对于巴什基尔语，7B规模模型上的QLoRA在质量和计算成本之间提供了有效的折中。为确保可重复性，开放数据、代码和训练好的适配器将在论文被接收后发布。

英文摘要

This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.03384 2026-06-02 cs.CR cs.SD

DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition

DECKER: 跨键盘提取与识别的域不变嵌入

Bikrant Bikram Pratap Maurya, Nitin Choudhury, Daksh Agarwal, Arun Balaji Buduru

AI总结针对键盘声学侧信道攻击的跨键盘、跨用户和噪声环境泛化问题，提出包含四阶段域不变击键推理框架DECKER，并构建了多维度数据集HEAR，实验表明该方法在跨键盘和跨用户场景下显著提升击键识别性能。

详情

Comments: Accepted to AsiaCCS'26

AI中文摘要

键盘上的声学侧信道攻击（ASCA）构成了重大的安全风险，因为击键可以从打字声音中推断出来，从而泄露敏感信息。先前的ASCA研究受限于小规模数据集，在用户、键盘和环境方面的多样性不足，限制了跨设备、麦克风和噪声条件的分析。我们引入了HEAR数据集，旨在沿着三个轴研究ASCA：键盘泛化、噪声适应和用户偏差。HEAR包含来自53名参与者使用37种笔记本电脑键盘的录音，在三种现实场景中收集：（1）外部麦克风捕获，（2）无网络噪声的设备麦克风捕获，以及（3）基于VoIP的流式捕获。这使得能够在用户、键盘和环境之间进行受控评估。在HEAR上，我们建立了一个ASCA基准，涵盖了单模态和多模态设置中来自原始音频和频谱图的传统特征和预训练表示。我们提出了DECKER，一个域不变的击键推理框架，包含四个阶段：（1）键盘签名归一化以减少设备着色，（2）域对抗解耦以抑制键盘身份，（3）有监督的跨键盘对比对齐以强制键一致性，以及（4）声学风格随机化以合成未见过的键盘响应。我们进一步探索了使用基于LLM的后处理层进行句子级推理，通过语言上下文优化击键序列。在HEAR上的结果表明，DECKER在跨键盘和跨用户设置中显著提高了击键识别性能，并通过语言模型校正进一步获得提升。这些发现强调，ASCA在多样化的用户、设备和噪声环境中仍然有效，凸显了其实际安全风险。

英文摘要

Acoustic side-channel attacks (ASCA) on keyboards pose a significant security risk, as keystrokes can be inferred from typing acoustics, revealing sensitive information. Prior ASCA studies are limited by small-scale datasets with restricted diversity in users, keyboards, and environments, constraining analysis across devices, microphones, and noise conditions. We introduce HEAR, a dataset designed to study ASCA along three axes: keyboard generalization, noise adaptation, and user bias. HEAR contains recordings from 53 participants using 37 laptop keyboards, collected in three realistic settings: (1) external microphone capture, (2) device microphone capture without network noise, and (3) VoIP-based streaming capture. This enables controlled evaluation across users, keyboards, and environments. On HEAR, we establish an ASCA benchmark spanning conventional features and pre-trained representations from raw audio and spectrograms in unimodal and multimodal settings. We propose DECKER, a domain-invariant keystroke inference framework with four stages: (1) Keyboard Signature Normalization to reduce device coloration, (2) domain-adversarial disentanglement to suppress keyboard identity, (3) supervised cross-keyboard contrastive alignment to enforce key consistency, and (4) Acoustic Style Randomization to synthesize unseen keyboard responses. We further explore sentence-level inference using an LLM-based post-processing layer to refine keystroke sequences via linguistic context. Results on HEAR show DECKER improves keystroke identification over strong baselines, particularly in cross-keyboard and cross-user settings, with further gains from language-model rectification. These findings highlight that ASCA remains effective across diverse users, devices, and noisy environments, underscoring its practical security risk.

URL PDF HTML ☆

赞 0 踩 0

2605.02640 2026-06-02 cs.AI

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

可信人工智能面临不变性冲突，因果性是解决方案

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

AI总结本文通过将可信AI目标重新解释为数据生成过程变化下的不相容不变性要求，论证因果性是理解和平衡性能与多个可信目标之间权衡的必要框架。

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

AI中文摘要

随着人工智能（包括机器学习模型和基础模型）在高风险领域的部署日益增多，确保其可信度已成为一个核心挑战。然而，可信人工智能的核心目标，如公平性、鲁棒性、隐私性和可解释性，很难同时实现，尤其是在保持效用的同时。这篇立场论文认为，因果性对于理解和平衡性能与可信人工智能多个目标之间的权衡是必要的。我们将可信人工智能的权衡重新解释为数据生成过程不同变化下的不相容不变性要求，从而为我们的论点奠定基础。然后，我们通过文献中的案例研究和风格化的合成数据模拟来说明这一论点，表明因果性提供了一个统一的框架，用于理解可信人工智能中的权衡如何产生，以及如何通过选择性不变性来缓解或解决这些权衡。这一视角既适用于经典机器学习模型，也适用于大规模基础模型。最后，我们概述了利用因果性构建既可信又高性能的人工智能所面临的开放挑战和机遇。

英文摘要

As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), are increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate this argument through case-study analyses from the literature and a stylized synthetic-data simulation, showing that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Finally, we outline open challenges and opportunities for using causality to build both trustworthy and high-performing AI.

URL PDF HTML ☆

赞 0 踩 0

2605.02270 2026-06-02 cs.CL

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

塔吉克-波斯语对机器音译模型的系统基准测试：从基于规则到Transformer架构的比较研究

Mullosharaf K. Arabov

AI总结本文通过构建多源平行语料库，系统比较了从基于规则到Transformer的六类模型，发现字节级ByT5在塔吉克-波斯语音译任务中表现最优（chrF++ 87.4/80.1），而基于子词的分词模型完全失败。

详情

Comments: Accepted to CLIB 2026

AI中文摘要

本文首次对塔吉克语（西里尔字母）和波斯语（阿拉伯字母）之间音译的现代机器学习架构进行了全面比较分析。一个关键贡献是创建并验证了一个独特的平行语料库，该语料库汇集了多个异构来源，包括众包项目、词典对、《列王纪》平行文本、外交文章、《玛斯纳维》文本、官方术语列表和音译对应关系。初始数据集包含328,253个句子对；通过分层随机抽样形成了40,000个句对的代表性子集。实验比较了六类模型：基于规则的基线、带注意力的LSTM、字符级Transformer、G2P Transformer（从头训练）、预训练多语言模型（mBART、带LoRA的mT5）以及字节级ByT5。结果表明ByT5具有压倒性优势（塔吉克语到波斯语chrF++ 87.4，反向80.1）。尽管数据有限，G2P Transformer显著优于mBART（72.3 vs. 62.2 chrF++）。使用子词分词（mT5）的模型完全失败（chrF++低于18.5）。研究结果表明，对于塔吉克-波斯语对的准确音译，在字节或字符级别操作的架构明确优于依赖子词分词的傳統多语言Seq2Seq模型。

英文摘要

This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results demonstrate the overwhelming superiority of ByT5 (chrF++ 87.4 for Tajik to Farsi, 80.1 for reverse). The G2P Transformer significantly outperformed mBART (72.3 vs. 62.2 chrF++) despite limited data. Models using subword tokenization (mT5) failed completely (chrF++ less than 18.5). The findings demonstrate that for accurate transliteration of the Tajik-Farsi pair, architectures operating at the byte or character level are unequivocally more effective than traditional multilingual Seq2Seq models relying on subword tokenization.

URL PDF HTML ☆

赞 0 踩 0

2605.02122 2026-06-02 cs.LG cs.AI

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

STABLEVAL: 面向AI系统的分歧感知与稳定评估

Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun

AI总结针对多数投票法在标注者分歧下导致排名不稳定的问题，提出STABLEVAL框架，通过建模潜在正确性和标注者混淆模式，实现稳定且不确定性感知的系统评估。

详情

AI中文摘要

人类评估仍然是评估现代AI系统的主要标准，然而标注者的分歧、偏见和变异性使得在标准多数投票聚合下系统排名变得脆弱。多数投票忽略了标注者可靠性和项目级别的模糊性，往往在标注者子集之间产生不稳定的比较。我们引入了STABLEVAL，一个分歧感知的评估框架，该框架对潜在项目正确性和标注者特定的混淆模式进行建模，以产生后验期望项目得分和校准的智能体级别分数。与Dawid-Skene等标签去噪方法不同，STABLEVAL明确设计用于稳定和不确定性感知的系统评估，而不是硬标签恢复。我们将排名稳定性形式化为首要评估目标，并分析聚合方法如何保留或扭曲底层标注者行为。在受控的合成实验和多个真实世界人工标注基准上，多数投票在标注者异质性和对抗性噪声下表现出增加的得分误差和排名不稳定性，而STABLEVAL产生了更稳定和统计上更合理的系统排名。这些结果表明，对分歧进行建模对于稳健和可复现的AI评估至关重要。

英文摘要

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

URL PDF HTML ☆

赞 0 踩 0