arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28321 2026-05-28 cs.SE cs.AI

Multi-Agent LLM-based Metamorphic Testing for REST APIs

基于多智能体LLM的REST API蜕变测试

Shehroz Khan, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Dragos Truscan

AI总结 提出ARMeta方法,利用基于LLM的多智能体工作流自动识别蜕变测试场景并生成可执行测试,以解决REST API测试中的预言问题。

详情
Comments
Author submitted version accepted for publication the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), July 7-11, 2026, Madrid Spain
AI中文摘要

随着REST API在软件系统中日益重要,其验证也变得更为关键。因此,测试和发现潜在问题对于提高软件质量至关重要。然而,测试REST API的主要挑战在于难以评估API调用的输出是否正确,即测试预言问题。蜕变测试是一种基于规约的测试方法,适用于正确输出未知或未明确指定的情况。为了检查系统的正确性,需要指定不同输出之间的关系。我们提出了ARMeta,一种支持工具的方法,利用基于LLM的多智能体工作流来支持使用OpenAPI文档化的REST API的蜕变测试。该智能体工作流用于识别蜕变测试场景,并以Given-When-Then格式进行规约。这些场景自动实现为可执行测试,并针对被测系统执行。我们在两个公开的暴露REST接口的Web应用程序上评估了ARMeta,并将其性能与基于场景的测试基线进行了比较。结果表明,ARMeta探索的行为可作为现有基于场景的测试方法的补充。

英文摘要

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

2605.28320 2026-05-28 cs.RO cs.AI

Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robots

识别工业时间序列中的显式简约分段多项式关系:应用于机械臂

Mazen Alamir, Sacha Clavel

AI总结 本文提出一种算法,利用隐式关系中的多项式集构建显式分段表示,以识别工业时间序列中的简约显式分段多项式关系,并应用于机械臂逆模型识别,实验表明该模型在泛化能力上优于深度神经网络。

详情
AI中文摘要

本文解决了识别可能涉及大量原始特征的简约显式分段多项式关系的问题。该算法利用最近提出的一种识别算法,该算法产生简约隐式关系,从而能够在异常检测和定位的背景下推导出正常性表征。本文提出的算法更进一步,通过使用隐式表示中涉及的多项式集构建显式分段表示。该框架在识别六轴机械臂逆模型的简约显式表示问题上得到了说明。此外,还展示了在四轴机械臂上的进一步实验,这些实验旨在研究当模型面对未见过的使用场景时,简约模型与最先进的深度神经网络结构相比的泛化能力。

英文摘要

This paper addresses the problem of identifying parsimonious explicit piece-wise polynomial relationships that might involve a relatively large number of raw features. The algorithm leverages a recently proposed identification algorithm that yields parsimonious implicit relationships enabling to derive normality characterization in the context of anomaly detection and localization. The algorithm proposed in this paper goes a step further by deriving explicit piece-wise representations that are built using the set of polynomials involved in the implicit representations. The framework is illustrated on the problem of identifying parsimonious explicit representations of the inverse model of a 6-axis manipulator robot. Moreover, further experiments on a 4-axis robot are also shown which are designed to investigate the generalization capability of parsimonious models compared to state-of-the-art DNNs structures, when models face unseen contexts of use.

2605.28317 2026-05-28 cs.LG cs.AI cs.NA math.NA physics.comp-ph

Hybrid Neural World Models

混合神经世界模型

Pranav Lakshmanan, Paras Chopra

AI总结 提出混合神经世界模型,通过单网络连续视界条件训练直接预测未来状态,并利用误差图隐式捕捉不连续性,实现高效且可靠的物理动力学模拟。

详情
Comments
Preprint. Under review
AI中文摘要

神经代理模型有望在物理动力学中实现比经典求解器大幅加速,但在冲击、锋面和接触等剧烈动力学事件中会静默失败。我们提出了用于物理动力学的混合神经世界模型:一种在物理状态空间中训练和部署多视界代理模型的方案,其中单个具有连续视界条件的网络通过直接监督(对照教科书参考求解器)进行训练,以在前向传播中一步预测任意未来状态(视界T)。尽管训练数据、损失函数或架构的任何部分都没有监督不连续位置,但训练后的代理模型隐式地编码了它,仅通过其前向传播即可恢复为每个轨迹的误差图,该误差图集中在冲击、锋面和接触上,而在其他地方保持较小。该误差图与标准无标签基线(包括深度集成、学习误差头、梯度幅度指标和局部自适应共形预测)相比具有竞争力或更好,同时仅使用单个训练网络,且不需要校准集或控制方程知识。该方案支持两个操作点。模式1单独运行代理模型以最大化吞吐量,在PDE环境中,与教科书求解器相比,相同硬件上的CPU加速比为26倍至72倍。模式2使用误差图来门控参考求解器回退,推迟不确定的轨迹,并在默认操作点将代理模型的残差误差大致减半。该方案无需修改即可应用于反应扩散、可压缩欧拉和刚体碰撞动力学。

英文摘要

Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as shocks, fronts, and contact. We present hybrid neural world models for physical dynamics: a recipe for training and deploying multi-horizon surrogates in physical state space, where a single network with continuous horizon conditioning is trained with direct supervision against textbook reference solvers to predict any future state at horizon T in one forward pass. Although no part of the training data, loss function, or architecture supervises discontinuity location, the trained surrogate encodes it implicitly, recoverable from its forward passes alone as a per-trajectory error map that concentrates on shocks, fronts, and contacts, and stays small elsewhere. The map is competitive with or better than standard label-free baselines including deep ensembles, learned error heads, gradient-magnitude indicators, and locally-adaptive conformal prediction, while using only a single trained network and requiring no calibration set or governing-equation knowledge. The recipe supports two operating points. Mode 1 runs the surrogate alone for maximum throughput, with same-hardware CPU speedups of 26x to 72x against textbook solvers on the PDE environments. Mode 2 uses the error map to gate a reference-solver fallback, deferring uncertain trajectories and roughly halving the surrogate's residual error at the default operating point. The recipe applies without modification across reaction-diffusion, compressible Euler, and rigid-body collision dynamics.

2605.28315 2026-05-28 cs.CL

HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains

HardMTBench:知识密集型领域的中英翻译压力测试

Zheng Li, Mao Zheng, Mingyang Song, Tianxiang Fei

AI总结 针对现有中英翻译基准饱和问题,提出HardMTBench,一种难度感知的诊断基准,通过多阶段构建和难度融合规则,在12个知识密集型领域上显著扩大系统性能差异并暴露术语和知识弱点。

详情
AI中文摘要

通用机器翻译基准(如FLORES-200)在中英对上已达到饱和状态,现代大语言模型的高分区间狭窄。在22个系统中,FLORES-200中英GEMBA分数落在7.87分范围内,标准差为2.29,这压缩了系统在金融、医疗、法律、科技等知识密集型领域上的区分度。我们提出HardMTBench,一个面向双向中英领域翻译的难度感知诊断基准。HardMTBench涵盖12个领域,包含10,000条人工筛选的源句及其参考译文,打包为20,000个方向性测试项。一个三阶段构建流程构建了包含84,566对的领域平衡候选池,应用基于LLM的多信号评判器评估知识密度、翻译难度、术语负载和参考正确性,并在难度融合规则下按领域配额组装最终测试集。在涵盖通用LLM、商业引擎和专业翻译模型的22个系统上,HardMTBench将跨系统的GEMBA范围相比FLORES-200扩大了约两倍,引发明显的排名重排,并暴露了仅靠质量指标往往掩盖的领域特定术语和知识弱点。所有数据和代码已在https://github.com/jasonNLP/HardMTBench开源。

英文摘要

General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty-aware diagnostic benchmark for bidirectional Chinese-English domain translation. HardMTBench covers 12 domains and contains 10,000 hand-curated source sentences with reference translations, packaged as 20,000 directional test items. A three-stage construction pipeline builds a domain-balanced candidate pool of 84{,}566 pairs, applies an LLM-based multi-signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per-domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross-system GEMBA range by roughly a factor of two over FLORES-200, induces visible rank reorderings, and exposes domain-specific terminology and knowledge weaknesses that quality-only metrics tend to flatten. All data and code are open-sourced at https://github.com/jasonNLP/HardMTBench.

2605.28313 2026-05-28 cs.CL

Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

基于大语言模型的论证质量评估:一种成对Bradley-Terry方法

Nicolás Benjamín Ocampo, Agnes Paullate Nyiranziza, Davide Ceolin

AI总结 本研究利用12种开源大语言模型,通过成对比较和Bradley-Terry模型评估论证质量,发现Llama-70B与人类专家判断具有中等一致性(Cohen's κ=0.493),其他模型表现各异但互补。

详情
AI中文摘要

大语言模型(LLMs)在推理和判断相关任务中展现出显著能力。然而,评估论证质量需要严格的评价。我们研究了LLMs有效执行此任务的程度。我们在零样本、少样本和思维链设置下测试了12种不同规模和系列的开源LLMs,以近似专家在逻辑、修辞和辩证三个维度上对论证质量的成对比较,并将这些比较用于Bradley-Terry模型,以推断潜在强度分数并得出论证排名。我们的见解表明,LLMs与人类专家判断具有有希望但中等程度的相关性,其中Llama-70B获得最强一致性,达到中等Cohen's κ=0.493,并且与从这些标注导出的Bradley-Terry分数具有中等相关性(Kendall、Pearson和Spearman:0.327-0.477)。其他LLMs与Llama-70B表现出弱、中等或高度一致性,同时在与人类专家比较中取得可比结果,表明尽管模型规模和系列存在差异,但对潜在质量维度具有部分但互补的理解。此外,LLM预测在试验运行中稳定,少于7.75%的情况产生不同标签。剩余变异性通过多数投票和少样本提示对大模型进行处理。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen's $κ$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75\% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.

2605.28312 2026-05-28 cs.RO cs.CV

EventShiftFlow: Towards Hardware-efficient FPGA-based Flow Estimation

EventShiftFlow:面向硬件高效的基于FPGA的流估计

Arianna Alonso Bizzi, Fernando Cladera, C. J. Taylor

AI总结 提出一种基于事件相机的流估计方法,通过离散化事件、构建1位空间占用网格并并行评估速度假设,仅使用固定宽度整数逻辑实现,无需帧重建、浮点运算或迭代优化,适用于低延迟机器人感知。

详情
Comments
10 pages, 5 figures. Accepted to the IEEE ICRA 2026 Workshop on Challenges and Opportunities of Neuromorphic Field Robotics and Automation
AI中文摘要

基于事件的视觉传感器提供异步、高时间分辨率的测量,适用于低延迟机器人感知,但许多基于事件的运动估计方法计算密集且难以映射到FPGA硬件。我们提出一种流式速度估计器,将异步事件离散化为固定持续时间的时间片,构建1位空间占用网格,并并行评估多个速度假设,仅使用固定宽度整数逻辑——移位寄存器、计数器、比较器和小型LUT映射乘法——无除法器且无DSP模块。它不需要帧重建、浮点运算或迭代优化。该方法有意将密集亚像素光流替换为每个活动像素的稀疏量化速度估计,适用于尺寸、重量和功率受限平台上的反应式避障等低延迟任务。在具有已知真实速度的噪声合成数据上,该方法恢复了幅度和方向,其中当不同速度的物体相交时幅度估计最具挑战性。在真实事件相机序列上,所有四个评估运动段的方向准确率达到99.5%,在10-40%的占用密度范围内性能保持稳健。我们表征了算法的密度依赖行为,进行了参数敏感性分析,表明所提出的数据路径需要小于2 kB的存储,并在低成本Xilinx Artix-7上实现了单轴原型。

英文摘要

Event-based vision sensors offer asynchronous, high-temporal-resolution measurements that are attractive for low-latency robotic perception, but many event-based motion estimation methods are computationally intensive and difficult to map to FPGA hardware. We present a streaming velocity estimator that discretizes asynchronous events into fixed-duration time bins, constructs a 1-bit spatial occupancy grid, and evaluates multiple velocity hypotheses in parallel using only fixed-width integer logic - shift registers, counters, comparators, and small LUT-mapped multiplies - with no dividers and no DSP blocks. It requires no frame reconstruction, no floating-point arithmetic, and no iterative optimization. The method deliberately trades dense sub-pixel optical flow for a sparse, quantized velocity estimate at each active pixel, suited to low-latency tasks such as reactive obstacle avoidance on size-, weight-, and power-constrained platforms. On noisy synthetic data with known ground-truth velocities, the method recovers both magnitude and direction, with magnitude estimates being most challenged when objects of different velocities intersect. On a real event-camera sequence, directional accuracy reaches 99.5% across all four evaluated motion segments, with performance remaining robust across occupancy densities in the 10-40% range. We characterize the algorithm's density-dependent behavior, present a parameter sensitivity analysis, show that the proposed datapath requires less than 2 kB of storage, and implement a single-axis prototype on a low-cost Xilinx Artix-7.

2605.28309 2026-05-28 cs.LG cs.NE

Learning to Assess the Reliability of Number-of-Runs Estimation in Stochastic Optimization

学习评估随机优化中运行次数估计的可靠性

Sara Gjorgjieva, Eva Tuba, Tome Eftimov

AI总结 针对随机优化中运行次数估计的可靠性问题,提出一种基于学习的扩展方法,利用统计和形状特征训练分类器预测估计可靠性,在内部配置场景下实现高少数类召回率。

详情
Comments
Preprint version of a poster accepted at the Genetic and Evolutionary Computation Conference 2026 (GECCO 2026)
AI中文摘要

在随机优化算法的大规模基准测试中,关键挑战不再是重复运行是否必要,而是如何在不产生不必要计算成本的情况下确定何时收集到足够的证据。我们研究了一种基于学习的扩展方法,该方法基于最近的一种经验在线启发式算法,该算法通过异常值处理和基于偏度的对称性检查自适应地估计所需的运行次数。利用来自COCO(24个问题,20维,每个问题10个实例,11个优化器)的132,000次Nevergrad运行的标注结果,我们在23个统计、无能量以及形状和稳定性特征上训练分类器,以预测运行次数估计是否可靠,并通过少数类召回率优先检测错误估计。我们使用内部配置学习设置评估可靠性预测,其中模型在共享相同优化器的数据上进行训练和测试。结果表明,在内部配置场景下可以学习运行次数可靠性,从而能够以高少数类召回率检测不可靠的估计,尽管由于固定配置内数据多样性有限,性能仍受到限制。

英文摘要

In large-scale benchmarking of stochastic optimization algorithms, the key challenge is no longer whether repeated runs are needed for reliability, but how to determine when sufficient evidence has been collected without incurring unnecessary computational cost. We study a learning-based extension of a recent empirical online heuristic that adaptively estimates the required number of runs using outlier handling and skewness-based symmetry checks. Using annotated outcomes from 132{,}000 Nevergrad runs on COCO (24 problems in 20 dimensions, 10 instances each, 11 optimizers), we train classifiers on 23 statistical, energy-free, and shape and stability features to predict whether a run-number estimate is reliable, prioritizing detection of incorrect estimates via minority-class recall. We evaluate reliability prediction using a within-configuration learning setup, where models are trained and tested on data sharing the same optimizer. The results show that run-number reliability can be learned in a within-configuration scenario, enabling detection of unreliable estimates with high minority-class recall, although performance remains limited by the restricted data diversity within fixed configurations.

2605.28308 2026-05-28 cs.CL

HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment

HELEA: 用于鲁棒实体对齐的硬负样本基准和基于LLM的重排序

Yoonjin Jang, Junwoo Kim, Youngjoong Ko

AI总结 针对现有实体对齐基准中模型依赖名称重叠而非关系结构的问题,提出同名的硬负样本增强策略生成质量可控的评估基准和训练语料,并设计HELEA两阶段框架(实体编码器检索+LLM重排序),在硬负样本基准上实现鲁棒对齐。

详情
Comments
10 pages, 3 figures, 9 tables. Code and benchmarks available at https://github.com/Wnsdnl/HELEA
AI中文摘要

实体对齐(EA)对于知识图谱(KG)融合至关重要,但现有基准通常允许模型利用名称重叠而非关系结构。这使得难以评估模型是否能拒绝指向不同现实世界对象的同名实体。我们的主要贡献是一种同名的硬负样本增强策略,通过从KG名称冲突组中挖掘同名但不同的实体对,同时生成质量可控的评估基准(DW-HN29K、DY-HN27K)和增强训练语料(DW-Train、DY-Train)。我们进一步引入HELEA,一个两阶段框架,整合了(i)在硬负样本增强训练语料上训练的实体编码器检索(使用1跳KG上下文),以及(ii)无需额外训练的基于LLM的重排序。实验表明,依赖名称的基线在我们的硬负样本基准上性能下降至接近随机,而HELEA在DW-HN29K上达到F1 0.967,同时在标准DW-15K上保持Hit@1 0.993。

英文摘要

Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same-name entities that refer to different real-world objects. Our primary contribution is a same-name hard-negative augmentation strategy that simultaneously yields quality-controlled evaluation benchmarks (DW-HN29K, DY-HN27K) and augmented training corpora (DW-Train, DY-Train), by mining same-name but distinct entity pairs from KG name-collision groups. We further introduce HELEA, a two-stage framework integrating (i) entity encoder retrieval trained on hard-negative-augmented training corpora with 1-hop KG context, and (ii) LLM-based reranking without additional training. Experiments show that name-dependent baselines collapse to near-random performance on our hard-negative benchmarks, while HELEA achieves F1 0.967 on DW-HN29K while maintaining Hit@1 0.993 on standard DW-15K.

2605.28306 2026-05-28 cs.CL cs.AI

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

面向混合专家模型中多语言下游任务的路由对齐微调

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong, Sichun Luo, Linqi Song

AI总结 针对混合专家模型在多语言下游任务中的路由结构异构问题,提出RA-MoE三阶段框架,通过中间层语言通用对齐区识别任务相关专家,并引入路由对齐损失增强目标语言路由,实验表明该方法优于标准微调和强基线。

详情
AI中文摘要

混合专家(MoE)模型已成为高效扩展LLM的主流范式,但将其适配到非英语下游任务仍然具有挑战性。现有的微调方法将MoE模型视为整体学习器,忽略了预训练期间形成的异构路由结构。我们在多个MoE模型和下游任务上验证,中间层形成了语言通用对齐区,其中路由发散性强烈预测了每种语言的任务性能差距。基于这一观察,我们提出了RA-MoE(路由对齐MoE微调),一个三阶段框架,该框架根据英语和目标语言的正确性将并行任务示例分类为四路分类法(cc/ci/ic/ii),识别中间层中与任务相关的专家,并用路由对齐损失增强标准SFT,该损失鼓励ci类型示例上的目标语言路由遵循英语任务专家激活模式。在三个MoE模型、三个任务和六种目标语言上的实验表明,RA-MoE始终优于标准SFT和强基线(包括Routing Steering和RISE),其中任务-语言对的ci比例可作为对齐收益的可靠预测指标。

英文摘要

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

2605.28305 2026-05-28 cs.CL cs.AI

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

重新审视大语言模型推理中的拟人化反思标记

Yahan Yu, Noa Nakanishi, Fei Cheng

AI总结 本文通过提示级和令牌级干预抑制拟人化反思标记,发现这些标记并非推理性能的必要条件,且抑制后模型仍能进行无标记验证,表明它们更多是表面线索而非可靠反思代理。

详情
Comments
15 pages, 12 figures
AI中文摘要

大语言模型(LLMs)在复杂推理过程中经常产生显式的反思痕迹,并伴随有拟人化标记,如“wait”、“hmm”和“alternatively”。尽管这些标记通常被用作反思的可见指标,但其机制仍不清楚,这带来了与冗余和重复反思标记相关的过度思考风险。在这项工作中,我们重新审视了拟人化反思标记,考察了它们对推理的必要性以及在反思中的作用。我们通过提示级和令牌级干预抑制这些标记,并分析了它们对四个基准测试和两种模型规模的任务性能的影响。我们的结果表明,拟人化标记对于推理性能并非普遍必要:抑制它们可以在多种设置下保持或提高性能,尤其是在较大的采样预算下。同时,标记抑制并不一定消除反思行为,因为模型仍然可以进行无标记验证。这些结果表明,拟人化标记更倾向于表面线索,而不是反思本身的可靠代理,并激励未来在显式标记模式之外对推理机制进行研究。

英文摘要

Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

2605.28304 2026-05-28 cs.LG

Compositional Generalization in Autoregressive Models via Logit Composition

自回归模型中通过Logit组合实现组合泛化

Aakash Kumar, Maria Sofia Bucarelli, Emanuele Natale

AI总结 本文受扩散模型组合方法的启发,提出一种新的自回归系统组合策略,在因子化条件假设下实现投影组合,并证明该组合在输出空间平滑重参数化下保持长度泛化行为。

详情
AI中文摘要

组合自回归模型仍然是理解大型语言模型如何结合跨任务学习的行为或技能的核心挑战。受扩散模型组合方法的启发,我们为自回归系统引入了一种新的、有原则的组合策略。在因子化条件假设下,我们证明所得组合是投影的:每个组件模型保持对其输出分布指定子空间的控制,避免模型间干扰。该性质在输出空间的平滑重参数化下进一步保持,产生特征空间定理。最后,我们证明当因子化假设和组件保证在目标长度上一致成立时,组合保持长度泛化行为。这些结果为理解自回归系统中模型组合和合并何时成功提供了原则性理解,并确定了其交互保持稳定的条件。

英文摘要

Composing autoregressive models remains a core challenge in understanding how large language models can combine behaviors or skills learned across tasks. We introduce a new and principled composition strategy for autoregressive systems, inspired by composition methods developed for diffusion models. Under a factorized-conditionals assumption, we show that the resulting composition is projective: each component model preserves control over its own designated subspace of the output distribution avoiding interference between models. This property is further preserved under smooth reparameterizations of the output space, yielding a feature-space theorem. Finally, we show that composition preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length. These results provide a principled understanding of when model composition and merging succeed in autoregressive systems and identify conditions under which their interactions remain stable.

2605.28303 2026-05-28 cs.AI

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

从事实覆写到知识演化:基于同策略自蒸馏的因果编辑

Shuaike Li, Kai Zhang, Xianquan Wang, Jiachen Liu, Shengpeng Mo

AI总结 针对知识编辑中静态事实覆写范式导致认知失调的问题,提出基于因果引导的同策略蒸馏方法CODE,将事实注入转化为连贯的知识演化,显著降低自反驳率并提升多跳准确性。

详情
AI中文摘要

虽然知识编辑(KE)能够实现高效更新,但其主导的静态事实覆写范式将大型语言模型视为离散数据库,强行注入孤立事实。这会破坏预训练的逻辑拓扑结构,引发认知失调——一种未进化的先验知识迫使模型明确否定注入更新的病理现象。理想化干预表明,这本质上是结构缺陷而非算法噪声,零失真代理导致高达95.6%的自反驳率。鉴于现实世界知识的因果驱动特性,将更新基于明确的因果叙事可将冲突率降至仅6.6%,凸显了向因果编辑范式转变的必要性。为内化这种演化,我们提出CODE(用于编辑的因果同策略蒸馏)。通过将因果自举与非对称同策略蒸馏相结合,CODE将因果转换逻辑直接刻入参数记忆。在LLaMA-3.1和Qwen-2.5上的实验表明,CODE将自反驳率大幅抑制至1.8%,同时保持稳健的多跳准确性(高达83.5%),将离散事实注入无缝转化为连贯的知识演化。代码见https://github.com/CrashBugger/CODE。

英文摘要

While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, forcibly injecting isolated facts. Fracturing pre-trained logical topologies, this triggers Epistemic Dissonance -- a pathology where un-evolved legacy priors force the model to explicitly negate the injected update. Idealized interventions reveal that this is an inherent structural flaw rather than mere algorithmic noise, with a zero-distortion proxy yielding a catastrophic 95.6% self-refutation rate. Given the causally driven nature of real-world knowledge, grounding updates in explicit causal narratives effectively collapses this conflict rate to just 6.6%, underscoring the imperative for a paradigm shift toward Causal Editing. To internalize this evolution, we propose CODE (Causal On-policy Distillation for Editing). By coupling causal bootstrapping with asymmetric on-policy distillation, CODE engraves causal transition logic directly into parametric memory. Experiments on LLaMA-3.1 and Qwen-2.5 show CODE drastically suppresses self-refutation to 1.8% while securing robust multi-hop accuracy (up to 83.5%), seamlessly transforming discrete fact injection into coherent knowledge evolution. Code is available at https://github.com/CrashBugger/CODE.

2605.28302 2026-05-28 cs.LG cs.AI cs.DC

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

解聚能走多远?面向高效 MoE LLM 服务的 Attention-FFN 解聚设计空间探索

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare, Sudarshan Srinivasan, Suvinay Subramanian, Souvik Kundu, Madhu Kumar, Midhilesh Elavazhagan, William Won, Amir Yazdanbakhsh, Tushar Krishna

AI总结 本文系统探索了从分块预填充、预填充-解码解聚到算子级 Attention-FFN 解聚 (AFD) 的不同解聚层次在 MoE 模型推理中的收益与局限,通过融合设备内核测量与高保真网络仿真的框架,在严格 TTFT/TPOT SLO 下 AFD 可在 DeepSeek-V3.2 上维持约 4k tokens/s 的系统吞吐量,并给出了联合优化吞吐与交互性的具体设计原则。

详情
AI中文摘要

现代大语言模型 (LLM) 推理已逐步解聚以跟上不断增长的模型规模和严格的 TTFT 与 TPOT 服务级别目标:从分块预填充聚合,到预填充-解码 (P/D) 解聚,再到最近出现的算子级 Attention-FFN 解聚 (AFD)。这一趋势对于混合专家 (MoE) 模型尤为重要,其中内存受限的注意力、计算密集的专家 FFN 以及 MoE 分发/组合通信产生了不同的资源需求。AFD 通过将注意力与 MoE-FFN 执行放在不同的 GPU 组上进一步暴露了这种异构性。每个解聚层次都加深了跨工作负载特征、资源分配和互连拓扑的调度设计空间,提出了核心问题:每个层次何时真正产生收益?我们系统地刻画了 MoE 推理中这一权衡,涵盖了输入/输出序列长度、前缀-KV 重用和每用户延迟约束等实际工作负载。以分块预填充和 P/D 解聚为基线,我们通过一个融合设备内核测量与高保真网络仿真的框架,研究了 AFD 在大规模下的收益与局限。在严格的 TTFT/TPOT SLO 下,AFD 在 DeepSeek-V3.2 上针对聊天、编码和代理编码工作负载维持了约 4k tokens/s 的系统吞吐量,而未经 AFD 的部署则不可行。我们提炼出联合优化吞吐与交互性的具体结论,包括如何根据工作负载和模型架构在 GPU 间划分注意力与 FFN,为当前机架级和集群级部署以及未来的解聚 AI 基础设施提供了设计原则。

英文摘要

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure.

2605.28301 2026-05-28 cs.AI

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

更高的准确率,更差的推理:医学思维链蒸馏的步骤级审计

Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Honghan Wu

AI总结 通过蒸馏大模型思维链训练小模型,发现医学问答中答案准确率提升但推理步骤的事实错误率上升,表明答案质量与推理真实性可能背道而驰。

详情
AI中文摘要

思维链(CoT)蒸馏训练一个小模型模仿教师的推理轨迹,但通常通过最终答案指标(包括准确率)进行评估。我们探究答案质量的提升是否伴随着轨迹的改进。在医学问答中,短答案选项可能使更丰富的临床理由未充分指定,从DeepSeek-V3系列教师蒸馏得到的Qwen3-8B学生在MedQA-USMLE答案指标上有所提升(SC@64从74.7%到84.4%;期望校准误差(ECE)从0.096到0.034)。然而,在Kimi-K2.6风格盲法LLM裁判审计下,其非弃权步骤的错误率从30.6%上升到50.3%。在这个主要医学设置中,答案质量和轨迹事实性向相反方向移动。这种前后模式在评估者、教师强度、学生规模和系列、医学基准以及风格、分割和答案正确性控制中持续存在。由临床专家进行的150步盲法审计重现了相同的排序。边界检查缩小了主张的范围:当紧凑答案对理由约束不足,且有能力的学生能够模仿专家风格而不可靠地支撑每个局部主张时,风险出现。标准答案指标和聚合对冲率未揭示这一转变。当此类轨迹被发布或重用时,仅靠答案级指标是不够的。

英文摘要

Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.

2605.28300 2026-05-28 cs.LG

T-GINEE: A Tensor-Based Multilayer Graph Representation Learning

T-GINEE:基于张量的多层图表示学习

Maolin Wang, Ziting Mai, Xuhui Chen, Zhiqi Li, Tianshuo Wei, Yutian Xiao, Wenlin Zhang, Wanyu Wang, Ruocheng Guo, Haoxuan Li, Zenglin Xu, Xiangyu Zhao

AI总结 针对现有方法无法捕捉多层网络层间复杂依赖的问题,提出T-GINEE框架,结合张量分解与广义估计方程显式建模跨网络相关性,理论证明一致性与渐近正态性,实验验证有效性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

传统网络分析关注单层网络,而现实系统通常形成具有多种关系类型的多层网络。然而,现有方法通常通过独立处理各层或聚合它们来捕捉层间复杂依赖,效果不佳。为解决此问题,我们提出T-GINEE(基于张量的广义多层图估计方程),一个统计正则化框架,结合基于张量的广义估计方程与任务特定损失,显式建模跨网络相关性。关键创新包括:(1)CP张量分解通过共享潜在因子捕捉结构依赖;(2)广义估计方程框架通过工作协方差矩阵建模层间相关性;(3)灵活的连接函数适应稀疏性等特征。我们的理论分析在温和条件下建立了一致性和渐近正态性。在合成和真实数据集上的大量实验验证了T-GINEE在多层网络分析中的有效性。

英文摘要

Traditional network analysis focuses on single-layer networks, real-world systems often form multilayer networks with multiple relationship types. However, existing methods typically fail to capture complex inter-layer dependencies by treating layers independently or aggregating them. To address this, we propose T-GINEE (Tensor-Based Generalized Multilayer-graph Estimating Equation), a statistical regularization framework combining tensor-based generalized estimating equations with task-specific loss to model cross-network correlations explicitly. Key innovations include: (1) CP tensor decomposition capturing structural dependencies via shared latent factors; (2) a generalized estimating equation framework modeling inter-layer correlations through working covariance matrices; and (3) a flexible link function accommodating characteristics like sparsity. Our theoretical analysis establishes consistency and asymptotic normality under mild conditions. Extensive experiments on synthetic and real-world datasets validate T-GINEE's effectiveness for multilayer network analysis.

2605.28298 2026-05-28 cs.AI

REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis

REED: 面向跨域语言隐写分析的后训练表示编辑

Ruohan Lei, Jianxin Gao, Wanli Peng, Huimin Pei

AI总结 提出一种后训练表示编辑方法,通过构造域偏移向量和源域封面到隐写方向指导编辑,实现无需架构修改或参数更新的高效跨域语言隐写分析。

详情
AI中文摘要

在语言隐写分析的实际场景中,测试文本通常来自未见过的域,具有不同的词汇、主题、写作风格和隐写生成模式,这会显著降低检测性能。尽管现有的跨域隐写分析方法可以通过分布对齐、域不变特征学习等有效缓解这一问题,但检测性能仍不理想。本文提出了一种用于跨域语言隐写分析的后训练表示编辑方法。具体来说,首先在源域数据上训练检测器,然后保持特征提取器和分类器冻结,在分类前对中间表示进行确定性编辑。对于域适应,我们从边缘源域和目标域表示构造域偏移向量。对于域泛化,我们推导出源域封面到隐写方向以指导样本特定编辑。实验结果表明,与先进方法相比,所提方法能够实现高跨域检测性能,尤其是在F1分数方面,同时无需在源域训练后进行架构修改或参数更新。

英文摘要

In real-world scenarios of linguistic steganalysis, tested texts usually come from unseen domains with different vocabularies, topics, writing styles, and steganographic generation patterns, which can significantly degrade the detection performance. Although existing cross-domain steganalysis methods can effectively alleviate this problem through distribution alignment, domain-invariant feature learning, etc., the detection performance is not satisfactory. In this paper, we propose a post-training representation editing method for cross-domain linguistic steganalysis. Specifically, the detector is first trained on source-domain data, and then the feature extractor and classifier are kept frozen, and the intermediate representations are deterministically edited before classification. For domain adaptation, we construct a domain-offset vector from marginal source and target representations. For domain generalization, we derive a source-domain cover-to-stego direction to guide sample-specific editing. Experimental results show that compared with the advanced methods, the proposed method can achieve high cross-domain detection performance, especially in terms of F1-score, while requiring no architecture modification or parameter updates after source-domain training.

2605.28296 2026-05-28 cs.LG nucl-ex physics.ins-det

Machine Learning methods for event classification and vertex reconstruction of the 12C + 12C reaction with the MATE-TPC

基于MATE-TPC的12C + 12C反应事件分类和顶点重建的机器学习方法

Minghui Zhang, Xiaobin Li, Jie Chen, Ningtao Zhang, Fenhua Lu, Junrui Ma, Jiazhen Yan, Wanqin Tu, Xiaodong Tang, Bingshui Gao, Chengui Lu, Zhichao Zhang, Jinlong Zhang, Weiping Liu

AI总结 采用ResNet和VGG等深度学习模型对12C + 12C反应事件进行分类,准确率达97%(模拟)和90%(实验),并利用CNN重建反应顶点。

详情
AI中文摘要

在现代核物理实验中,使用活性靶时间投影室(TPC)进行核反应研究时,识别感兴趣的事件具有挑战性。本工作采用机器学习技术分析来自名为MATE(用于核实验的多用途活性靶时间投影室)的TPC的12C + 12C聚变反应的复杂数据。具体来说,我们成功应用了残差神经网络(ResNet-50、ResNet-34和ResNet-18)和视觉几何组(VGG-19)对12C + 12C反应中的弹性散射和聚变反应事件进行分类。四个模型的分类结果几乎相同,模拟数据的准确率约为97%,实验数据的准确率约为90%。此外,这些方法成功识别了一些被传统方法误分类的事件。这些模型还应用于对不同聚变反应通道的事件进行分类,模拟数据的分类准确率约为95%。此外,开发了一个卷积神经网络(CNN)模型来重建反应顶点,为顶点重建提供了另一种策略。这些结果表明,机器学习技术可以有效分类不同通道的反应事件并重建反应顶点,从而为未来复杂核反应数据的分析铺平道路。

英文摘要

In modern nuclear physics experiments, identifying events of interest is challenging for nuclear reaction studies with the active target Time Projection Chamber (TPC). In this work, machine learning techniques are employed to analyze the complex data of the 12C + 12C fusion reaction from a TPC named MATE (multi-purpose active-target time projection chamber for nuclear experiments). Specifically, we successfully applied Residual Neural Network (ResNet-50, ResNet-34 and ResNet-18) and Visual Geometry Group (VGG-19) to classify elastic scattering and fusion reaction events from the 12C + 12C reaction. The classification results of the four models are nearly identical, with accuracies of approximately 97% for the simulated data and 90% for the experimental data. Moreover, these approaches successfully identify some events that are misclassified by traditional methods. These models are also applied to classify events from different fusion reaction channels, with classification accuracies of approximately 95% on simulated data. In addition, a Convolutional Neural Network (CNN) model is developed to reconstruct the reaction vertex, providing an alternative strategy for vertex reconstruction. These results indicate that machine learning techniques can effectively classify reaction events from different channels and reconstruct the reaction vertex, thereby paving the way for future analyses of complex nuclear reaction data.

2605.28295 2026-05-28 cs.AI cs.CL cs.LG

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Rollouts 的起点:面向 RLVR 的低负载、高杠杆的首 token 多样化

Soeun Kim, Albert No

AI总结 本文提出 REFT 方法,通过在推理标记后的第一个 token 处进行均匀采样多样化,以低开销显著提升 RLVR 中 rollout 的多样性,从而改善推理模型的 Pass@k 性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)无需标注轨迹即可训练推理模型,它依赖分组 rollout 将策略暴露于替代推理路径,并由验证器进行评分。Rollout 多样性因此成为 RLVR 的核心瓶颈,现有方法大多通过温度、前缀或 rollout 选择调整来拓宽探索。我们发现了一个结构上独特但被忽视的拓宽多样性的位置:推理标记后的第一个 token。策略的首 token 分布表现出尖锐峰值但正确性解耦的现象,且该首 token 位置可以拓宽 rollout 组覆盖的区域而不改变正确性信号。我们引入 REFT(基于首 token 多样化的 Rollout 探索),这是对 RLVR 流程的一个轻量级补充,它从策略自身的 top-$N$ 候选集中均匀采样首 token,并均匀分配 rollout,其他组件保持不变。在由此产生的多样化 rollout 上训练后,REFT 在四个基础模型(0.5B-7B)和三个难度级别上,相较于 DAPO 和 GRPO 基线,提升了聚合的 Pass@1、Pass@8 和 Pass@64。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

2605.28292 2026-05-28 cs.CL

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

CIRF:将思维链分词化为可重用的功能单元,用于大型语言模型的高效潜在推理

Yukyung Lee, Yumeng Shen, Jinhyeong Park, Hyein Yang, Jun-Hyung Park

AI总结 提出CIRF框架,通过将显式思维链中的语义连贯推理单元映射为离散功能令牌,实现动态序列推理,在数学、符号和常识推理基准上取得优于现有隐式CoT方法的准确率-延迟权衡。

详情
Comments
17 pages, 7 figures
AI中文摘要

隐式思维链通过内化显式理由来降低大型语言模型的推理成本。然而,现有方法通常缺乏与显式理由的对齐以及对示例复杂性的适应性。在这项工作中,我们提出了CIRF(思维链转化为可重用功能单元),一个隐式CoT框架,将推理作为离散功能令牌的动态序列进行。CIRF为显式CoT轨迹中的每个语义连贯推理单元分配一个功能令牌。然后对模型进行微调,以自回归方式生成功能令牌及其可选结果,随后生成最终答案。这种设计将潜在推理与功能单元序列对齐,促进了并行训练、显式理由对齐和自适应推理。在数学、符号和常识推理基准上的大量实验表明,与最先进的隐式CoT方法相比,CIRF提供了有利的准确率-延迟权衡。进一步的分析表明,CIRF构建了独特、可解释的功能令牌,从而带来一致的性能提升。

英文摘要

Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\textit{\underline{C}hain-of-thoughts \underline{I}nto \underline{R}eusable \underline{F}unctional units}), an implicit CoT framework that performs reasoning as a dynamic sequence of discrete functional tokens. CIRF assigns a functional token to each semantically coherent reasoning unit in explicit CoT traces. The model is then fine-tuned to autoregressively generate functional tokens and their optional results, followed by the final answer. This design aligns latent reasoning with a sequence of functional units, facilitating parallel training, explicit rationale alignment, and adaptive reasoning. Extensive experiments on mathematical, symbolic, and commonsense reasoning benchmarks show that CIRF provides a favorable accuracy-latency trade-off compared with state-of-the-art implicit CoT methods. Further analyses demonstrate that CIRF constructs distinct, interpretable functional tokens, leading to consistent performance improvements.

2605.28290 2026-05-28 cs.LG cs.GT stat.ML

Adaptive Bandit Algorithms for Contextual Matching Markets

上下文匹配市场的自适应Bandit算法

Shiyun Lin, Simon Mauras, Vianney Perchet, Nadav Merlis

AI总结 针对上下文匹配市场中的bandit学习问题,提出自适应算法,在随机和对抗性上下文下分别实现实例相关的多对数遗憾上界和次线性遗憾界。

详情
Comments
Accepted to ICML 2026
AI中文摘要

我们研究匹配市场中的bandit学习,其中玩家和臂构成市场的两侧,玩家的效用与臂上下文呈线性关系。每一轮,新臂带着可观测的上下文到达。然后,算法将它们与玩家匹配,旨在最小化每个玩家相对于稳定匹配基准的遗憾。这种上下文结构带来了显著的复杂性:微妙的上下文偏移可能轻微改变一个玩家的效用,同时完全重构底层基准,导致其他玩家出现大的遗憾峰值。我们在两种设置下解决这个问题:随机上下文(从潜在分布中抽取)和对抗性上下文(可能是任意的)。对于随机情况,我们引入了一个新颖的最小偏好差距来捕捉学习难度,并提供了一种完全自适应的算法,具有实例相关的多对数遗憾上界。我们还在温和的分布假设下建立了匹配的实例无关遗憾上界和下界。对于对抗性设置,我们提出了一种在任意上下文下仍然有效的可处理遗憾概念,并通过自适应算法实现了实例无关的次线性遗憾界。

英文摘要

We study bandit learning in matching markets, where players and arms constitute the two market sides, and the players' utilities are linear in the arm contexts. In each round, new arms arrive with observable contexts. Then, the algorithm matches them to players, aiming to minimize each player's regret against a stable matching benchmark. This contextual structure creates significant complexity: subtle context shifts can slightly alter one player's utility while completely reconfiguring the underlying benchmark, causing large regret spikes for others. We address this in two settings: stochastic contexts, drawn from a latent distribution, and adversarial contexts, which may be arbitrary. For the stochastic case, we introduce a novel minimum preference gap to capture learning difficulty and provide a fully adaptive algorithm with an instance-dependent poly-logarithmic regret upper bound. We also establish matching instance-independent regret upper and lower bounds under a mild distributional assumption. For the adversarial setting, we propose a tractable regret notion that remains valid under arbitrary contexts and achieves an instance-independent sublinear regret bound via an adaptive algorithm.

2605.28287 2026-05-28 cs.LG cond-mat.mtrl-sci

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

AtomComposer: 基于强化学习从第一性原理发现化学空间

Bjarke Hastrup, Francois Cornet, Tejs Vegge, Arghya Bhowmik

AI总结 提出AtomComposer,一种无需预训练数据、通过在线强化学习自主构建有效3D异构体的智能体,在未见化学式上发现的异构体数量比现有方法多一个数量级。

详情
AI中文摘要

在没有训练数据的情况下发现新型稳定分子仍然是一个重大的科学挑战。当前的分子生成模型是在大型预筛选数据集上训练的,这引入了偏差并限制了对新型化学的探索。相比之下,我们提出了一种新范式:能够无需任何预训练而映射广阔未知化学空间的自主、通用智能体。我们首次提出了AtomComposer,一个在化学计量约束下自主构建有效3D异构体,并仅通过在线强化学习进行训练的自我引导智能体。与通常过拟合特定化学式的现有方法不同,我们建立了一种多组分训练方案,使得在能量和有效性奖励的引导下,能够跨不同化学领域进行广泛泛化。我们的智能体在未见过的测试化学式上,能够发现比现有单组分强化学习基线(使用每步能量奖励训练)多一个数量级的有效异构体。这些结果实现了在线强化学习作为一种可扩展、从头探索化学构型空间的强大范式的承诺。

英文摘要

Discovering novel stable molecules without training data remains a grand scientific challenge. Current molecular generative models are trained on large, pre-curated datasets, which introduce biases and limit exploration of novel chemistry. In contrast, we propose a new paradigm: autonomous, generalized agents capable of mapping vast, unknown chemical spaces without any pretraining. For the first time, we present AtomComposer, a self-guided agent that autonomously constructs valid 3D isomers under stoichiometric constraints and is trained exclusively online using reinforcement learning. Unlike existing approaches that generally overfit to a specific chemical formula, we establish a multi-composition training scheme that enables a broad generalization across diverse chemistry, guided by energy- and validity-based rewards. Our agent can discover up to an order of magnitude more valid isomers on unseen test formulas than existing single-composition reinforcement-learning baselines trained with per-step energy rewards. These results fulfill the promise of online reinforcement learning as a powerful paradigm for scalable, from-scratch exploration of chemical configuration space.

2605.28283 2026-05-28 cs.CL cs.AI

PrunePath: Towards Highly Structured Sparse Language Models

PrunePath:迈向高度结构化稀疏语言模型

Zhexuan Gu, Zixun Fu, Yancheng Yuan

AI总结 提出PrunePath框架,通过软最大归一化路由和累积质量阈值实现自适应预算的结构化稀疏化,在自然语言理解、生成和指令调优中取得优越的稀疏-性能权衡,并利用Triton内核将结构化稀疏转化为实际内存节省和解码速度提升。

详情
AI中文摘要

前馈网络(FFN)主导了现代语言模型的参数数量和计算量,然而现有的剪枝方法往往难以将稀疏性转化为硬件友好的推理效率提升。我们引入了 extbf{PrunePath},一个针对FFN层的预算自适应结构化稀疏化框架。基于MoEfication,PrunePath用软最大归一化路由分布替代独立的专家级阈值,并在累积质量阈值下激活重要专家。这种公式化施加了令牌级概率预算,实现了自适应专家数量以及从单个检查点直接推理时的稀疏性调节旋钮。在自然语言理解、自然语言生成和指令调优评估中,与现有的静态剪枝和基于MoEfication的方法相比,PrunePath实现了有利的稀疏-性能权衡。我们进一步实现了用于KV缓存解码的Triton内核,以将所得的结构化稀疏性转化为实际的内存节省和可测量的解码速度提升。这些结果证明了PrunePath在构建高度稀疏、易于部署的大型语言模型方面的优越性能。

英文摘要

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

2605.28282 2026-05-28 cs.AI

ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research

ResearchLoop: 一种用于AI辅助研究的证据门控控制平面

Yihan Xia, Taotao Wang

AI总结 提出ResearchLoop,一种通过证据门控控制平面来确保AI辅助研究中声明可审计的协议,包括状态模型、转换规则和实验验证。

详情
Comments
32 pages, 4 figures, 6 tables; technical report
AI中文摘要

AI辅助研究将构思、实现、评估和手稿撰写压缩成一个单一的交互循环。这种压缩是有用的,但也带来了出版风险:论文声明可能比审计更容易陈述。我们提出了ResearchLoop,一种用于AI辅助计算研究的证据门控控制平面。ResearchLoop将研究问题、任务合同、证据对象、声明账本、结项和论文绑定视为持久的项目状态,在此实现为基于仓库的运行时。本技术报告提供了完整的协议规范、状态模型、转换规则、声明准入算法和洞察复合机制。它还报告了跨越九个版本(V0--V9)的完整实验记录,包括自托管案例研究、带有组件消融的受控任务套件研究、数学奥林匹克评估以及使用官方生成代码工具评估的补充SciCode边界实验。所有工件、清单和验证报告都保存在项目仓库中。

英文摘要

AI-assisted research compresses ideation, implementation, evaluation, and manuscript writing into a single interactive loop. This compression is useful, but it also creates a publication risk: paper claims can become easier to state than to audit. We present ResearchLoop, an evidence-gated control plane for AI-assisted computational research. ResearchLoop treats research questions, task contracts, evidence objects, claim ledgers, closeouts, and paper bindings as durable project state, realized here as a repository-backed runtime. This technical report provides the complete protocol specification, state model, transition rules, claim-admission algorithm, and insight-compounding mechanism. It also reports the full experimental record spanning nine versions (V0--V9), including a self-hosting case study, a controlled task-suite study with component ablations, a mathematical olympiad evaluation, and a supplementary SciCode boundary experiment evaluated with the official generated-code harness. All artifacts, manifests, and verification reports are preserved in the project repository.

2605.28279 2026-05-28 cs.RO

IMU Propagation as Preintegration

IMU传播作为预积分

Jianzhu Huai

AI总结 本文证明IMU预积分与传播在计算上等价,提出一种与约定无关的视角,通过包装现有传播例程获得预积分测量、偏差雅可比和协方差,反之亦然,从而简化代码复用并支持一致性检查。

详情
Comments
6 pages, 2 figures, to present in ISPRS2026 Thematic Session 10 on Radar Perception
AI中文摘要

IMU预积分广泛用于基于因子图的视觉-惯性、激光-惯性和雷达-惯性状态估计,但通常被视为与常规IMU传播分离的专门实现。本文表明IMU预积分和传播是同一底层计算的不同实现。我们提出一种与约定无关的视角,其中预积分测量、偏差雅可比和协方差可以通过包装现有的IMU传播例程获得,而预积分模块反过来可以恢复状态转移矩阵和传播协方差。这种视角简化了现有传播代码的复用,支持跨不同误差状态定义的转换,并为预积分实现提供实用的一致性检查。随机IMU序列的实验表明,基于RK4的传播实现与GTSAM的切空间和流形预积分模块在恢复的雅可比、协方差和转移矩阵上高度一致。

英文摘要

IMU preintegration is widely used in factor-graph-based visual--inertial, lidar--inertial, and radar--inertial state estimation, yet it is often treated as a specialized implementation separate from conventional IMU propagation. This note shows that IMU preintegration and propagation are equivalent realizations of the same underlying computation. We present a convention-agnostic view in which the preintegrated measurement, bias Jacobians, and covariance can be obtained by wrapping an existing IMU propagation routine, while a preintegration module can conversely recover state-transition matrices and propagated covariances. This perspective simplifies the reuse of existing propagation code, supports translation across different error-state definitions, and provides practical consistency checks for preintegration implementations. Experiments with random IMU sequences demonstrate close agreement between an RK4-based propagation implementation and GTSAM's tangent and manifold preintegration modules in the recovered Jacobians, covariances, and transition matrices.

2605.28277 2026-05-28 cs.AI

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

LLMs 是否从文本构建世界模型?多语言空间推理诊断

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao, Yitong Qiao, Chunlei Meng, Zhangquan Chen, Xin Cao

AI总结 通过多语言诊断基准 MentalMap 评估大语言模型的空间推理能力,发现所有模型在视角推理上存在普遍的性能瓶颈(L3 推理悬崖),表明该限制源于纯文本工作记忆约束而非特定架构。

详情
AI中文摘要

大语言模型(LLMs)是否从纯文本描述中构建内部空间世界模型仍存在争议,且这种能力是否跨语言迁移尚未得到系统研究。我们引入 MentalMap,一个多语言诊断基准,具有六级能力层次(L0-L5),涵盖从原子空间事实到生成性世界图构建,以及四个诊断轴:参考系、阅读方向偏差、推理努力分配和幻觉。MentalMap 基于 100 个 ProcTHOR 家庭场景构建,涵盖八种类型多样的语言加上一个结构化文本控制,包含 39 个任务族,共 1950 个评估单元。评估了跨规模和模型家族的十三个 LLMs,我们识别出一个普遍的 L3 推理悬崖:一旦基线原子准确率超过 40%,没有模型能在视角推理上保留其 L0 性能的一半。该悬崖在语言、规模和提示策略中持续存在,而结构化输出失败和推理模式在不同模型间差异显著。在相同纯文本协议下的人类评估重现了相同的失败模式,表明瓶颈源于纯文本工作记忆约束,而非特定于当前 LLM 架构。我们的发现将纯文本空间推理重新定义为多轴世界建模问题,并推动多模态和草稿板增强推理作为未来方向。

英文摘要

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

2605.28276 2026-05-28 cs.LG

Commit to the Bit: Reactive Reinforcement Learning Done Right

Commit to the Bit: Reactive Reinforcement Learning Done Right

Onno Eberhard, Claire Vernade, Michael Muehlebach

AI总结 针对确定性观测的有限环境,提出Committed Q-learning算法,在弱于$q_\star$-可实现性的rewire-鲁棒性假设下,证明其几乎必然收敛到最优反应策略。

详情
AI中文摘要

强化学习算法通常在马可夫假设下进行分析(和设计)。这是不现实的,因为实践中遇到的大多数环境要么是部分可观测的,要么需要函数近似,从而限制了智能体访问非马可夫状态特征。我们考虑在具有确定性观测(或等价地,硬状态聚合)的有限环境中学习最优反应策略的问题。我们引入了一种新算法,Committed Q-learning,并在一个称为rewire-鲁棒性的直观假设下证明了其几乎必然收敛到最优反应策略。该假设严格弱于先前工作中使用的$q_\star$-可实现性条件。我们的算法是经典Q-learning的一个变体,其中行为策略在进入一个特征时承诺执行单一动作,并且仅在观测到的特征变化时重新采样动作。我们分析的一个关键部分是引入了准马可夫环境。

英文摘要

Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation that restricts the agent to access non-Markovian state features. We consider the problem of learning an optimal reactive policy in a finite environment with deterministic observations (or equivalently, hard state aggregation). We introduce a new algorithm, Committed Q-learning, and prove almost-sure convergence to the optimal reactive policy under an intuitive assumption we call rewire-robustness. This assumption is strictly weaker than the $q_\star$-realizability condition used in prior work. Our algorithm is a variant of classical Q-learning in which the behavior policy commits to a single action upon entering a feature, and only resamples actions when the observed feature changes. A crucial part of our analysis is the introduction of quasi-Markov environments.

2605.28273 2026-05-28 cs.AI

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

全局策略空间响应预言机用于两人零和博弈

Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang, Xudong Zhang

AI总结 提出Global PSRO框架,通过直接最小化种群可利用性(PE)来引导策略种群扩展,以更少的策略迭代逼近纳什均衡。

详情
Comments
Accepted by ICML 2026
AI中文摘要

策略空间响应预言机(PSRO)框架通过使用深度强化学习(DRL)迭代扩展受限策略集,将均衡计算扩展到大型零和博弈。一个核心挑战是在有限计算预算下构建一个小的策略种群,其诱导博弈能很好地近似完整博弈。现有的PSRO变体通常使用从受限博弈收益计算出的元策略的最佳响应来扩展种群,这可能导致效率低下的扩展,仅提供有限的全局改进。我们提出通过直接评估扩展后的种群质量来引导种群扩展。具体来说,我们采用种群可利用性(PE)来衡量受限策略集代表完整博弈的程度,并引入一个两阶段探索-选择框架,在扩展过程中显式最小化PE。我们将该框架实例化为Global PSRO,一种实用的基于DRL的算法,该算法通过参数共享的条件神经网络高效生成候选响应并估计PE。在多个两人零和博弈上的实验表明,与先前的PSRO方法相比,Global PSRO实现了更低的可利用性,并以显著更少的策略迭代逼近纳什均衡。

英文摘要

The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.

2605.28272 2026-05-28 cs.CV

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

EchoAvatar: 从音频流实时生成动画虚拟化身

Bohong Chen, Yumeng Li, Yinglin Xu, Youyi Zheng, Yanlin Weng, Kun Zhou

AI总结 提出统一流式架构,从流式语音和音乐中低延迟生成连续全身运动,通过强化学习优化在线生成质量,并利用工具调用接口实现意图驱动控制。

详情
Comments
SIGGRAPH 2026; Project Page: https://robinwitch.github.io/EchoAvatar-Page
AI中文摘要

从音频实时合成高保真3D角色运动是下一代交互式虚拟化身和虚拟助手的关键组成部分。然而,大多数现有方法仅限于离线处理完整音频序列,或局限于特定领域,很少能有效处理语音和音乐。本文提出了一种新颖框架,旨在从流式语音和音乐中低延迟生成连续、连贯的全身运动。我们方法的核心是一种统一的流式架构,能够从增量音频输入中合成连续运动。我们采用鲁棒的训练策略,强制音频依赖性,使模型能够无缝泛化到对话式语音和节奏性音乐,无需显式领域标签或模式切换。此外,我们探索了强化学习以优化在线生成质量。进一步地,我们通过工具调用接口将反应式动画与意图驱动行为连接起来,允许上游大型语言模型注入显式语义控制。通过将这种可控性与流式音频驱动合成相结合,我们的框架可作为即插即用解决方案,将语音代理转化为交互式人形虚拟化身。大量实验表明,我们的方法在运动质量和同步性上优于最先进的实时基线,同时保持了实时部署所需的灵活性。我们的代码、预训练模型和视频可在 https://robinwitch.github.io/EchoAvatar-Page 获取。

英文摘要

Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/EchoAvatar-Page.

2605.28271 2026-05-28 cs.CV

LV-OSD: Language-Vision-Complementary Open-Set Object Detection

LV-OSD: 语言-视觉互补的开放集目标检测

Yupeng Zhang, Ruize Han, Wei Feng, Song Wang, Liang Wan

AI总结 提出语言-视觉互补开放集目标检测问题,设计双分支检测框架LVDor,通过目标引导提示动态加权模块和提示随机掩码机制实现文本与图像提示的灵活组合与语义对齐。

详情
AI中文摘要

目标检测是计算机视觉中的重要任务,旨在通过给定的类别列表或查询图像检测感兴趣的目标。在这项工作中,我们提出了一个语言-视觉互补开放集目标检测(LV-OSD)的新问题,即使用灵活的基于文本和/或基于图像的提示来指定所需的目标类别。这种设置在现实应用中更为常见和实用。为此,我们设计了一个双分支检测框架LVDor,它可以同时接受文本和图像提示。具体来说,我们首先为每个类别构建包含多种文本描述和图像样本的多模态提示(MPr)。随后,为了弥合输入图像、文本提示和图像提示之间的语义差距,我们设计了一个目标引导提示动态加权(TPDW)模块。在该模块中,通过目标图像的先验信息,动态生成与目标语义最对齐的文本和图像提示,实现精确对齐并有效减少两种模态之间的差异,从而适应LV-OSD设置。我们还提出了一种简单的训练时提示随机掩码(PRM)机制,以模拟测试时文本和/或图像提示的任意组合。大量的实验结果验证了我们问题表述的合理性和方法的有效性。提示和代码将公开发布。

英文摘要

Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.

2605.28270 2026-05-28 cs.CV

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

Every9D-21M:日常物体的大规模真实世界9D规范化

Leonhard Sommer, Emil Akopyan, Adam Kortylewski

AI总结 针对真实世界9D姿态数据缺乏的问题,提出包含2180万张图像、700类物体的Every9D-21M数据集,通过多视图几何重建点云并跨实例对齐实现大规模标注,验证了其在多个基准上的性能提升。

详情
AI中文摘要

从单张真实世界图像估计日常物体的9D姿态仍然具有挑战性。这很大程度上是由于缺乏大规模监督。大多数现有数据集要么严重依赖合成渲染,要么对真实世界物体的覆盖有限:迄今为止最大的真实世界9D姿态数据集仅包含9个类别的17K个标注物体。我们通过Every9D-21M数据集填补了这一空白,该数据集包含来自109K个以物体为中心的视频的2180万张真实世界图像的9D姿态标注,涵盖700个日常物体类别——在图像和类别数量上比之前的真实世界9D姿态基准大两个数量级。为了实现这一规模,我们利用以物体为中心的视频,通过多视图几何重建物体级点云,并将相似实例对齐到共享的规范坐标系中。仅对一小部分参考物体(少于所有图像的0.01%)手动标注规范姿态,并通过跨实例对齐传播到其余实例。然后从多个视角验证所有传播的规范姿态。我们进一步引入了跨类别方向规则,以诱导类别级对称性,从而实现对称感知评估。除了建立专用的训练和评估划分作为9D姿态基础模型的基准外,我们还表明,在Every9D-21M上训练可提高在ImageNet3D和PASCAL3D+上的性能,并且比在ImageNet3D上训练更好地泛化到HANDAL。数据和代码可在https://github.com/GenIntel/Every9D获取。

英文摘要

Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at https://github.com/GenIntel/Every9D.