arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.17309 2026-05-19 cs.CV cs.AI

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

StyleText: 一个大规模数据集和基准,用于具有风格保留的场景文本修复

Aleksandr Simonyan, Nipun Jindal

AI总结 本文提出StyleText,一个用于具有风格保留的场景文本修复的大规模数据集和基准,通过控制评估文本可读性和视觉一致性,利用共享场景上下文。

Comments Accepted at the SynData4CV Workshop, CVPR 2026. 8 pages + 1 page of references, 5 figures, 4 tables

详情
AI中文摘要

我们提出了StyleText,一个用于局部场景文本修复的大型数据集和基准,具有风格保留。StyleText包含28,518个图像-掩码-提示三元组,分为9,932个场景家族,使能够受控评估文本可读性和视觉一致性。我们通过自动化流程构建数据集,该流程结合LLM提示模板、基于Flux的源生成与键值(KV)缓存注入、基于OCR的语义过滤、多边形掩码提取以及掩码条件的FluxFill增强。我们定义了一个可重复的评估协议,使用归一化的OCR度量(词准确率和字符错误率)和CLIP图像-图像相似性,结合显式预处理。在StyleText上训练的FluxFill+LoRA基线在初始化基础上显著提高了OCR准确性,同时保持场景风格一致性,为未来的比较建立了有力的参考点。

英文摘要

We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

2605.17308 2026-05-19 cs.AI

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

在诊断前进行推理:受医生启发的结构化思维用于心电图分类

Yang Wu, Xiaoyan Yuan, Hau-San Wong, Xiping Hu

AI总结 本文提出CardioThink框架,通过结构化推理过程提升心电图分类的临床相关性,并引入SSPO方法以优化诊断结果的准确性和可解释性。

详情
AI中文摘要

心电图(ECG)临床诊断依赖于对多个层次方面的结构化推理,包括心律、传导特性、波形形态和总体诊断印象。然而,现有大多数方法直接从ECG信号预测标签,缺乏显式的临床推理过程,导致决策不透明且不具临床相关性。为弥合这一差距,我们提出CardioThink,一个受医生启发的多模态大语言模型(MLLM)框架,通过可解释的中间阶段(心律、传导、形态和印象)显式建模诊断推理过程,以推导最终分类结果。此外,我们引入结构化集合策略优化(SSPO)以联合优化对这种结构化推理格式的遵循程度和变量大小诊断集的准确性,而无需手动标注的推理轨迹。在多样化的ECG基准测试中,广泛实验表明,我们的方法在诊断准确性上显著优于现有方法,同时提供可解释的临床推理。值得注意的是,推理质量评估确认SSPO显著增强了生成的推理依据的临床有效性。这些发现表明,超越直接标签预测,转向结构化推理为未来ECG建模提供了更符合临床需求的方向。

英文摘要

Electrocardiogram (ECG) diagnosis in clinical practice relies on structured reasoning over multiple hierarchical aspects, including cardiac rhythm, conduction properties, waveform morphology, and overall diagnostic impression. However, most existing approaches predict labels directly from ECG signals without explicit clinical reasoning, resulting in opaque decisions that lack clinical alignment. To bridge this gap, we propose CardioThink, a physician-inspired multimodal large language model (MLLM) framework that explicitly models the diagnostic reasoning process through human-interpretable intermediate stages (rhythm, conduction, morphology, and impression) to derive final classification results. Furthermore, we introduce Structured Set Policy Optimization (SSPO) to jointly optimize adherence to this structured reasoning format and the accuracy of variable-size diagnostic sets, without requiring manually annotated reasoning traces. Extensive experiments on diverse ECG benchmarks demonstrate the significant superiority of our approach in diagnostic accuracy, while simultaneously providing interpretable clinical reasoning. Notably, reasoning quality evaluations confirm that SSPO substantially enhances the clinical validity of the generated rationales. These findings reveal that moving beyond direct label prediction toward structured reasoning offers a more clinically aligned direction for future ECG modeling.

2605.17305 2026-05-19 cs.AI cs.CL

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect: 一种基于闭环自修正的大型语言模型框架

Yuning Wu, Yingmin Liu, Yang Shu

AI总结 本文提出CyberCorrect框架,将大型语言模型的自我修正建模为闭环控制系统,通过三模态错误检测器、类型导向的修正控制器和收敛判断器,提升模型的自我修正能力和准确性。

Comments 6 pages, 1 figure, submitted to IEEE SMC 2026

详情
AI中文摘要

大型语言模型(LLM)的自我修正能力——即检测并修复生成输出中的错误——仍然主要依赖于通用提示,如'请重新考虑你的答案',缺乏系统性的错误分析和收敛保证。我们提出了CyberCorrect,一种将LLM自我修正建模为闭环控制系统的方法,基于控制论理论。该框架将LLM生成器视为被控对象,并引入三模态错误检测器(结合自一致性、口头化信心和逻辑链验证)作为传感器。类型导向的修正控制器根据诊断的错误类别生成针对性的修复指令,而收敛判断器利用控制理论适应的稳定性标准确定迭代终止。我们进一步引入了三个控制理论评估指标——收敛率、超调率和振荡率——以捕捉修正动态,而不仅仅是最终准确性。在我们构建的CyberCorrect-Bench(440个带有标注错误类型和修正路径的推理任务)上的实验表明,CyberCorrect实现了79.8%的最终准确性,比现有最佳自我修正方法提高了6.2个百分点,同时通过其收敛控制机制将超调(错误的过度修正)减少了41%。

英文摘要

Large language model (LLM) self-correction -- the ability to detect and fix errors in generated outputs -- remains largely ad hoc, relying on generic prompts such as "please reconsider your answer" without systematic error analysis or convergence guarantees. We propose CyberCorrect, a framework that formalizes LLM self-correction as a closed-loop control system grounded in cybernetic theory. The framework models the LLM generator as the plant and introduces a tri-modal Error Detector (combining self-consistency, verbalized confidence, and logic-chain verification) as the sensor. A type-directed Correction Controller generates targeted repair instructions based on diagnosed error categories, while a Convergence Judge determines iteration termination using stability criteria adapted from control theory. We further introduce three control-theoretic evaluation metrics -- convergence rate, overshoot rate, and oscillation rate -- that capture correction dynamics beyond final accuracy. Experiments on our constructed CyberCorrect-Bench (440 reasoning tasks with annotated error types and correction paths) show that CyberCorrect achieves 79.8% final accuracy, improving upon the best existing self-correction method by 6.2 percentage points, while reducing overshoot (erroneous over-correction) by 41% through its convergence control mechanism.

2605.17304 2026-05-19 cs.LG cs.CL

Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression

压缩上下文,保持承诺:可验证大语言模型上下文压缩的正式框架

Natalia Trukhina, Vadim Vashkelis

AI总结 本文提出Context Codec框架,通过语义层面的压缩方法,确保在压缩对话历史时保留关键承诺,解决现有方法在压缩过程中缺乏对语义承诺保留的明确规范的问题。

详情
AI中文摘要

LLM上下文不仅仅是token;它是一组承诺。长期对话累积了目标、约束、决定、偏好、工具结果、检索到的证据、制品和安全边界,这些必须被未来响应保留。现有上下文管理方法通过截断、检索、摘要、记忆系统或token级提示压缩来减少长度,但很少明确指定哪些语义承诺必须在压缩中保留或如何衡量其保留。我们提出Context Codec,一种基于承诺的框架,用于压缩提示和聊天历史。Context Codec将对话状态表示为具有标准身份、等价性、冲突、置信度、风险和证据跨度的语义原子。它分离了五个关注点——提取、规范化、表示、渲染和验证,并引入了关键原子召回率、加权原子召回率、承诺密度和往返恢复性等指标。它还定义了语义压缩错误的分类学,一个具体的规范化程序,保守的回退规则用于低置信度和安全关键原子,以及Context Compression Language (CCL),一种以ASCII优先的紧凑表示法,用于标准JSON原子。在一项小规模诊断研究中,CCL-Core在结构化的散文和JSON之间占据了一个有用的中间位置:比散文更明确和可审计,通常比JSON更紧凑,且比高度压缩的符号更安全。结果不是声称缩写解决压缩问题,而是一个使上下文压缩可验证的框架:压缩对话,保持承诺。

英文摘要

LLM context is not just tokens; it is a set of commitments. Long-running conversations accumulate goals, constraints, decisions, preferences, tool results, retrieved evidence, artifacts, and safety boundaries that future responses must preserve. Existing context-management methods reduce length through truncation, retrieval, summarization, memory systems, or token-level prompt compression, but they rarely specify which semantic commitments must survive compression or how their preservation should be measured. We propose Context Codec, a commitment-level framework for compressing prompts and chat histories. Context Codec represents dialogue state as typed, source-grounded semantic atoms with canonical identity, equivalence, conflict, confidence, risk, and evidence spans. It separates five concerns - extraction, normalization, representation, rendering, and verification - and introduces metrics for Critical Atom Recall, Weighted Atom Recall, Commitment Density, and round-trip recoverability. It also defines a taxonomy of semantic compression errors, a concrete normalization procedure, conservative fallback rules for low-confidence and safety-critical atoms, and Context Compression Language (CCL), an ASCII-first compact rendering of canonical JSON atoms. In a small diagnostic study, CCL-Core occupies a useful middle ground between structured prose and JSON: more explicit and auditable than prose, usually more compact than JSON, and less risky than heavily minified notation. The result is not a claim that shorthand solves compression, but a framework for making context compression verifiable: compress the conversation, keep the commitments.

2605.17303 2026-05-19 cs.CV

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

LongDPM: 长视频中基于重叠意识的4D重建

Chenyi Xu, Yihao Wu, Liqi Yan, Chao Yang, Jianhui Zhang, Fangli Guan, Pan Li

AI总结 本文提出LongDPM,一种基于重叠意识的长视频单目动态重建框架,通过分块处理、重登记和动态身份关联,实现长距离的3D重建和跟踪,提升了PointOdyssey、Kubric-F和Kubric-G等数据集上的密集跟踪精度和相机姿态估计性能。

详情
AI中文摘要

从长单目视频中恢复动态3D场景对于保持共享坐标系中密集几何、相机运动和时间对应的一致性至关重要。现有方法面临两个关键挑战:(1)前馈重建模型提供准确的局部预测,但仅限于短片段;(2)长距离跟踪器保持对应关系但不产生密集序列级重建。本文提出LongDPM,一种新的重叠意识框架,用于可扩展的长距离单目动态重建。首先,LongDPM通过重叠分块处理长视频,使推理内存受限于分块长度。其次,它通过带有静态意识的重叠抽象进行置信度加权注册,连接分块局部坐标系统。第三,它在分块边界处关联动态身份,并融合匹配轨迹以恢复连贯的长距离3D运动。实验结果表明,LongDPM在长距离重建和跟踪性能上优于现有方法,在PointOdyssey、Kubric-F和Kubric-G数据集上减少了密集跟踪EPE,同时在相机姿态估计方面获得了最佳TUM-dynamics ATE。

英文摘要

Recovering a dynamic 3D scene from a long monocular video is crucial for dense geometry, camera motion, and temporal correspondence to remain consistent in a shared coordinate system. Existing methods face two key challenges: (1) feed-forward reconstruction models provide accurate local predictions but are limited to short clips, and (2) long-range trackers preserve correspondences without producing dense sequence-level reconstruction. This paper presents LongDPM, a novel overlap-aware framework for scalable long-range monocular dynamic reconstruction. First, LongDPM processes long videos in overlapping chunks, keeping inference memory bounded by the chunk length. Second, it connects chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction. Third, it associates dynamic identities across chunk boundaries and fuses matched trajectories to recover coherent long-range 3D motion. Experimental results demonstrate that LongDPM achieves superior long-range reconstruction and tracking performance, reducing dense tracking EPE over V-DPM on PointOdyssey, Kubric-F, and Kubric-G, while obtaining the best TUM-dynamics ATE for camera pose estimation.

2605.17302 2026-05-19 cs.RO

Beyond Geometry: Efficient Topologically-Grounded Navigation in Complex 3D Environments

超越几何:在复杂3D环境中高效拓扑导向的导航

Yifan Du, Chengwei Zhang, Siyu Liao, Zhongfeng Wang

AI总结 本文提出了一种表面提取框架,通过强制地面支撑、头顶 clearance 和基于种子的连通性约束,构建了物理可达的站立位置的简化状态空间,从而在复杂3D环境中实现高效的拓扑导向导航。

详情
AI中文摘要

在复杂的3D环境中,地面机器人导航常受到几何歧义的阻碍,其中不可通行的结构如家具与可通行地面共享局部几何特性。此外,搜索大规模体素空间的计算成本仍然是重大挑战。为了解决这些问题,我们提出了一种表面提取框架,通过强制地面支撑、头顶 clearance 和基于种子的连通性约束,构建了物理可达的站立位置的简化状态空间。在五个Matterport3D室内场景和三个PCT基准场景上的评估显示,状态空间减少了超过80%,并在Matterport3D场景上实现了亚毫秒级的A*搜索,所有300个测试查询均实现了100%的规划成功。

英文摘要

Ground robot navigation in complex 3D environments is often hindered by geometric ambiguity, where non-traversable structures such as furniture share local geometric properties with navigable ground. Furthermore, the computational cost of searching massive voxel spaces remains a significant challenge. To address these issues, we present a surface extraction framework that constructs a reduced state space of physically reachable standing positions by enforcing ground support, overhead clearance, and seed-based connectivity constraints. Evaluation across five Matterport3D indoor scenes and three PCT benchmark scenes demonstrates over 80\% state space reduction and sub-millisecond A* search on the Matterport3D scenes, with 100\% planning success across all 300 tested queries.

2605.17300 2026-05-19 cs.RO

HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

HCLM:一种用于双四足机器人协同运动操作的分层框架

Qixuan Li, Chen Le, Jincheng Yu, Xinlei Chen

AI总结 本文提出HCLM框架,通过分层结构实现双四足机器人在复杂环境中的协同运动操作,核心方法是采用集中式联合扩散策略和混合全身控制器,主要贡献是实现了高鲁棒性的多机器人协作控制。

详情
AI中文摘要

我们介绍了HCLM,一种用于通用目的双四足系统协同运动操作的分层框架。协调具有浮动基的多机器人协作操作极具挑战性,因为空间协调、稳健移动和闭链物理交互的需求相互冲突。为了解决这一问题,我们的架构系统性地将高层协作推理与底层稳健运动执行分离。在高层,一个集中式联合扩散策略利用SE(3)-不变的任务空间表示来学习不依赖坐标的空间协调模式。为了将这些帧无关的参考转换为物理运动,一个以任务为中心的混合全身控制器协同利用主动的运动学模型预测控制来生成无碰撞的速度分布,以及一个反应性执行层。关键的是,这一反应层保证了对精确末端执行器跟踪的快速响应,同时通过合作顺应方案整合主动力调节,以安全解决运动学冲突并在闭链交互中严格调节内部应力。我们验证了该框架在逐步更具挑战性的模拟场景中的有效性,包括协作搬运、打包和交接,并成功在现实世界中部署后者。结果表明,任务执行可靠,配置无关性严格,对严重物理扰动具有出色的抗扰性,为多机器人具身协调提供了一条高度稳健的路径。

英文摘要

We introduce HCLM, a hierarchical framework for general-purpose cooperative loco-manipulation with dual quadrupedal systems. Coordinating multi-robot collaborative manipulation across floating bases is highly challenging due to the conflicting demands of spatial coordination, robust locomotion, and closed-chain physical interactions. To resolve this, our architecture systematically decouples high-level collaborative reasoning from low-level robust motion execution. At the high level, a centralized Joint Diffusion Policy leverages an SE(3)-invariant task-space representation to learn coordinate-agnostic spatial coordination patterns. To translate these frame-agnostic references into physical motion, a task-centric hybrid Whole-Body Controller synergizes a proactive kinematic Model Predictive Control for collision-free velocity distribution with a reactive execution layer. Crucially, this reactive layer guarantees rapid responsiveness for precise end-effector tracking, while concurrently integrating active force regulation via a cooperative admittance scheme to safely resolve kinematic conflicts and strictly regulate internal stresses during closed-chain interactions. We validate the framework across progressively challenging simulated scenarios, including cooperative carrying, packing and handovers, and successfully deploy the latter in the real world. The results demonstrate reliable task execution, strict configuration agnosticism, and exceptional resilience against severe physical perturbations, offering a highly robust pathway for multi-robot embodied coordination.

2605.17295 2026-05-19 cs.LG cs.CL

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

DISA: 分布匹配强化学习中的离线重要性采样

Shaobo Wang, Yujie Chen, Yafeng Sun, Wenjie Qiu, Zhihui Xie, Sihang Li, Yucheng Li, Huiqiang Jiang, Xingzhang Ren, Xuming Hu, Dayiheng Liu, Linfeng Zhang

AI总结 本研究提出DISA方法,通过离线重要性采样解决分布匹配强化学习中的校准问题,分离了分区函数估计与策略学习,提高了策略多样性并在多个基准测试中表现出色。

Comments 21 pages, 7 figures, 7 tables. Abstract shortened to respect the arXiv limit of 1920 characters. Please see the PDF for the full abstract

详情
AI中文摘要

现代推理代理越来越多地被评估其在给定输入下生成多个有效解决方案路径、计划或工具使用轨迹的能力。标准奖励最大化强化学习倾向于崩溃到最容易强化的高奖励模式,而分布匹配强化学习旨在在整个奖励形状的解决方案集中分配概率质量。实现这一目标需要计算轨迹空间中依赖提示的分区函数。由于现有分布匹配方法在线学习这个分区函数,导致分区函数的校准误差直接扭曲策略更新且无法独立诊断。我们引入DISA(Decoupled Importance-Sampled Anchoring),通过离线绘制提案轨迹、通过重要性采样估计分区函数,并在策略优化开始前冻结所得的分区函数估计。这种解耦保持了分布匹配目标,同时严格分离分区函数估计与策略学习在数据、梯度、损失和诊断方面。实验表明,在六个数学和三个代码基准测试上,DISA与在线耦合的分布匹配基线FlowRL持平或超过,优于奖励最大化基线GRPO和GSPO在数学平均表现,并在相同离线轨迹上超过LoRASFT蒸馏方法多达13.8 Mean@8点。LLM-as-judge评估进一步显示DISA比奖励最大化基线保留了显著更多的策略多样性,提案强度和逆温度的敏感性研究遵循分析预测的偏差-方差模式。

英文摘要

Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.

2605.17294 2026-05-19 cs.CV

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

HierEdit: 基于区域的分层扩散用于高效的高分辨率编辑

Yuyao Zhang, Alexander Huang-Menders, Yu-Wing Tai

AI总结 本文提出HierEdit,一种区域感知的分层扩散框架,用于高效可扩展的高分辨率图像编辑。通过低分辨率代理生成参考并定位修改区域,结合分层局部窗口扩散模型和推理加速机制,实现无需高分辨率训练数据的快速高保真编辑。

详情
AI中文摘要

高分辨率图像编辑对于专业和创意应用至关重要,但现有的多模态扩散编辑器在计算效率上仍然不足,并且受限于相对较低的分辨率。当前方法要么冗余处理整个图像画布,要么依赖大规模高分辨率数据集,导致显著的训练和推理成本。我们引入HierEdit,一种区域感知的分层扩散框架,专门用于高效且可扩展的高分辨率图像编辑。我们的方法首先使用现成的编辑模型在低分辨率代理上进行编辑,生成参考并定位修改区域。一个分层局部窗口扩散模型(Local-Window MMDiT)仅在原始高分辨率图像中细化编辑区域,同时重用未修改的区域作为条件输入。低分辨率代理进一步提供结构指导和中间去噪监督(Inference Acceleration),确保一致的全局语义和稳定的生成,而无需完整的高分辨率注意力计算。这种针对性和分层的设计使图像编辑能够快速、高质量地达到4K分辨率,而无需任何专门的高分辨率训练数据。广泛的实验表明,HierEdit在商用分辨率数据集上实现了竞争性的视觉质量,同时显著加速了推理过程,并无缝扩展到超高清的4K编辑。请查看我们的项目页面:https://peteryyzhang.github.io/HierEdit-page/

英文摘要

High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbf{Local-Window MMDiT}) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbf{Inference Acceleration}) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our {\href{https://peteryyzhang.github.io/HierEdit-page/}{\textbf{Project Page}}}.

2605.17293 2026-05-19 cs.RO cs.MA

Task Capability Improvement Algorithm for Collaborative Manipulators

协作机械臂任务能力提升算法

Keshab Patra, Arpita Sinha, Anirban Guha

AI总结 本文提出利用附加力矩提高协作机械臂的任务能力,通过在非质心位置施加力产生额外力矩,从而增强单个机械臂及整个协作组的能力,实验结果显示任务能力提升了5.86%。

详情
AI中文摘要

本文介绍了一种利用附加力矩进行协作任务能力提升的方法。机械臂在物体的抓取点施加力。在非物体质心的位置施加力会产生不期望的力矩。这种不期望的力矩作为附加力矩,提高了单个机械臂的能力,从而提高了整个协作组的能力。任何任务能力的提升都会直接增加物体和运输能力。协作组增强的能力也有助于实现最优能力、最优资源分配和最大故障容忍性。我们的仿真结果表明,与不使用力矩增强机械臂能力相比,任务能力提升了5.86%。

英文摘要

This work introduces a cooperative task capability improvement utilizing additional moments. The manipulators apply forces at the object's grasp point. Applying forces at a point other than the object's center of gravity produces undesired moments. The undesired moment acts as an additional moment. It improves the capability of an individual manipulator and, hence, the entire collaborative group. Any improvements in task capability directly add up to the object and transportation capability. The group's enhanced capability also helps achieve optimal capability, optimal resource allocation, and maximum fault tolerance in object manipulation. Our simulation results show an improvement in the capability of 5.86 \% compared to when no moment is used to enhance the capability of the manipulators.

2605.17292 2026-05-19 cs.AI cs.MA

MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

MetaCogAgent: 一种具有自我意识的任务委托多智能体大语言模型框架

Chenyu Wang, Yang Shu

AI总结 本文提出MetaCogAgent框架,通过引入元认知自我评估单元,使每个智能体在执行任务前评估自身能力边界,从而提升任务准确性并减少API调用次数。

Comments 6 pages, submitted to IEEE SMC 2026

详情
AI中文摘要

多智能体大语言模型(LLM)系统通过智能体协作展示了解决复杂任务的潜力。然而,现有框架基于预定义角色分配任务,未考虑智能体能否准确评估自身能力边界,导致超出其专长的任务执行过于自信。受认知科学中的元认知理论启发,我们提出了MetaCogAgent,一种多智能体LLM框架,其中每个智能体配备元认知自我评估单元,在执行前评估任务能力匹配度。该框架提出了三个贡献:(1)一种自我评估机制,通过结合口头不确定性与历史能力档案估计每项任务的置信度;(2)一种自适应委托协议,通过跨智能体评估将低置信度任务路由至更适合的智能体;(3)一种能力边界学习模块,通过闭环反馈迭代优化每个智能体的能力模型。在我们构建的MetaCog-Eval基准(700项任务,5个认知维度)上的实验表明,MetaCogAgent实现了82.4%的任务准确率——比最佳路由基线高8.7%——同时比AutoGen少使用5%的API调用,比投票集少34%。消融研究确认了每个元认知组件对整体系统性能的贡献。

英文摘要

Multi-agent large language model (LLM) systems have shown promise for solving complex tasks through agent collaboration. However, existing frameworks assign tasks based on predefined roles without considering whether an agent can accurately assess its own competence boundaries, leading to overconfident execution on tasks beyond its expertise. Inspired by metacognition theory from cognitive science, we propose MetaCogAgent, a multi-agent LLM framework where each agent is equipped with a Metacognitive Self-Assessment Unit that evaluates task-capability alignment before execution. The framework introduces three contributions: (1) a self-assessment mechanism that estimates per-task confidence by combining verbalized uncertainty with historical capability profiles; (2) an adaptive delegation protocol that routes low-confidence tasks to better-suited agents through cross-agent evaluation; and (3) a capability boundary learning module that iteratively refines each agent's competence model via cybernetic feedback. Experiments on our constructed MetaCog-Eval benchmark (700 tasks across 5 cognitive dimensions) demonstrate that MetaCogAgent achieves 82.4% task accuracy -- 8.7% above the best routing baseline -- while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. Ablation studies confirm that each metacognitive component contributes to overall system performance.

2605.17291 2026-05-19 cs.LG

Step-wise Rubric Rewards for LLM Reasoning

分步评分奖励用于大语言模型推理

Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu, Shuai Dong, Ziyue Wang, Xinbo Xu, Kean Shi, Ruoyu Wu, Xiaoying Zhang, Wenqi Shao, Baobao Chang, Nan Duan, Jiaqi Wang

AI总结 本文提出一种分步评分奖励方法,通过引入LLM判官对每个评分项进行归因,规范化每步评分,并结合优势估计器提高推理准确性和减少自我纠正循环。

Comments Code available at https://github.com/akarinmoe/SRaR

详情
AI中文摘要

可验证奖励的强化学习(RLVR)被广泛用于改进大语言模型的推理能力,但奖励仅关注最终答案的正确性,而没有对中间步骤进行监督。基于评分的 方法如评分作为奖励(RaR)通过评分滚动生成的结构化标准引入更细粒度的监督,但评分仍然被聚合为一个单一的标量应用于整个响应,导致三个弱点:多标准结构的丢失、对正确和错误步骤的均匀监督,以及通过无界自我纠正的奖励黑客。在1000个问题上,我们发现18.2%的正确答案响应中的步骤是错误的但仍然被正向奖励,而49.9%的错误答案响应中的步骤是正确的但被惩罚。我们引入了分步评分作为奖励(SRaR),一种RLVR框架,它(i)使用LLM判官将每个评分项归因于特定的推理步骤;(ii)规范化每步评分跨滚动生成,使得只有质量变化的步骤产生学习信号;(iii)通过解耦的优势估计器结合每步奖励与结果奖励,保持结果基线的稳定性。我们进一步构建了一个16000个问题的评分数据集,通过对比性地从强模型中采样正确和有缺陷的推理路径来蒸馏评分项。在六个数学推理基准测试中,SRaR在Qwen3-8B上将平均准确率提高3.57个点,在Qwen3-32B上提高2.75个点,将AIME 2025的忠实推理率从34.5%提升到46.7%,并将自我纠正循环从48.1%降低到26.5%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.

2605.17285 2026-05-19 cs.LG cs.AI

UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models

UNR-Explainer: 为无监督节点表示学习模型生成反事实解释

Hyunju Kang, Geonhee Han, Hogun Park

AI总结 本文提出UNR-Explainer,一种基于蒙特卡洛树搜索的反事实解释生成方法,用于无监督节点表示学习模型,通过识别关键子图来提升对下游任务如链接预测和聚类的理解。

Comments Accepted at ICLR 2024

详情
AI中文摘要

节点表示学习,如图神经网络(GNNs),已成为机器学习中的关键方法。对可靠解释生成的需求日益增加,但无监督模型仍处于探索阶段。为此,我们提出了一种在无监督节点表示学习中生成反事实(CF)解释的方法。我们识别出在扰动后导致感兴趣节点k近邻显著变化的最重要子图。基于k近邻的反事实解释方法为理解无监督下游任务,如top-k链接预测和聚类,提供了简单但关键的信息。因此,我们引入UNR-Explainer,基于蒙特卡洛树搜索(MCTS)为无监督节点表示学习方法生成具有表现力的反事实解释。所提出的方法在多样化的数据集上对无监督的GraphSAGE和DGI表现出优越的性能。

英文摘要

Node representation learning, such as Graph Neural Networks (GNNs), has emerged as a pivotal method in machine learning. The demand for reliable explanation generation surges, yet unsupervised models remain underexplored. To bridge this gap, we introduce a method for generating counterfactual (CF) explanations in unsupervised node representation learning. We identify the most important subgraphs that cause a significant change in the k-nearest neighbors of a node of interest in the learned embedding space upon perturbation. The k-nearest neighbor-based CF explanation method provides simple, yet pivotal, information for understanding unsupervised downstream tasks, such as top-k link prediction and clustering. Consequently, we introduce UNR-Explainer for generating expressive CF explanations for Unsupervised Node Representation learning methods based on a Monte Carlo Tree Search (MCTS). The proposed method demonstrates superior performance on diverse datasets for unsupervised GraphSAGE and DGI.

2605.17284 2026-05-19 cs.CV cs.AI cs.LG cs.RO

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP:用于端到端自动驾驶的对比潜在空间提示优化

Ruiyang Zhu, Yuehan He, Boyuan Zheng, Zesen Zhao, Ahmad Chalhoub, Qingzhao Zhang, Z. Morley Mao

AI总结 本文提出CLAP方法,通过对比潜在空间提示优化解决自动驾驶中罕见但安全关键的长尾场景问题,利用V2X通信获取数据并优化提示,从而提升规划性能。

Comments 9 pages + appendix

详情
AI中文摘要

端到端自动驾驶系统通过视觉-语言-动作(VLA)模型在常见驾驶场景中表现出色,但在罕见但安全关键的长尾场景如活跃施工区和复杂让行几何中表现脆弱。本文提出了一种方法,超越数据扩展和模型训练,解决长尾挑战场景。我们引入CLAP(对比潜在空间提示优化),一种位置感知的适应框架,通过车辆到一切(V2X)通信按需检索,将冻结的VLA驾驶模型与每条道路块的软提示相结合。我们的方法基于VLA潜在空间的两个观察:(i)在VLA的隐藏状态层,来自相同道路块的场景紧密聚集并占据潜在空间的紧凑区域;(ii)在单个道路块内,长尾和正常帧在潜在表示中高度混合,难以改进其中一个而不影响另一个。CLAP通过两阶段流程解决此问题:监督对比学习发现道路块特定的困难场景方向,随后方向性正则化提示优化选择性改进挑战帧同时保持正常帧性能。在NAVSIM基准上,使用各种最先进的VLA后端,CLAP将挑战场景规划错误减少了24%,在不回归正常帧的情况下显著提高了规划性能。

英文摘要

End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.

2605.17283 2026-05-19 cs.CL cs.AI

OProver: A Unified Framework for Agentic Formal Theorem Proving

OProver:一个用于代理形式定理证明的统一框架

David Ma, Kaijing Ma, Shawn Guo, Yunfeng Shi, Enduo Zhao, Jiajun Shi, Zhaoxiang Zhang, Gavin Cheung, Jiaheng Liu, Zili Wang

AI总结 本文提出OProver,一个用于Lean 4的统一框架,通过迭代修订检索到的编译器验证证明和Lean编译器反馈来改进代理证明,通过持续预训练和迭代后训练,使OProver-32B在多个基准测试中取得最佳成绩。

详情
AI中文摘要

近年来,形式定理证明的进步得益于大规模证明生成和验证器感知训练,但代理证明很少被整合到证明器训练中,仅在推理时间出现。我们提出了OProver,一个用于Lean 4的统一框架,其中失败的证明尝试通过检索到的编译器验证证明和Lean编译器反馈进行迭代修订。OProver通过持续预训练和迭代后训练进行训练:每次迭代运行代理证明,将新验证的证明索引到OProofs和检索内存中,使用修复轨迹作为SFT数据,并使用未解决的困难案例用于RL。OProofs由公开的Lean资源、大规模证明合成和代理证明轨迹构建,包含177万条Lean语句、686万条编译器验证证明以及带有检索上下文、失败尝试、反馈和修复的序列轨迹。在五个基准测试中,OProver-32B在MiniF2F(93.3%)、ProverBench(58.2%)和PutnamBench(11.3%)上取得最佳Pass@32,且在MathOlympiad(22.8%)和ProofNet(33.2%)上排名第二,比任何先前的开放式整体证明证明器的顶级位置更多。

英文摘要

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

2605.17278 2026-05-19 cs.AI cs.LG

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

A2RBench: 一个用于形式可验证抽象推理基准生成的自动范式

Qingchuan Ma, Yuexiao Ma, Yongkang Xie, Tianyu Xie, Xiawu Zheng, Rongrong Ji

AI总结 本文提出A2RBench自动范式,通过生成、扩展、评估和分析流程提升抽象推理基准生成效率,发现当前LLM在抽象推理能力上存在根本缺陷,且高信息复杂度输入可简化推理过程。

详情
AI中文摘要

抽象推理能力反映了LLM提取和应用抽象规则的智能和泛化能力。然而,准确测量这一能力仍然具有挑战性:现有基准要么依赖昂贵的手动标注,限制了其规模,要么有风险测量记忆而非真正的推理。为此,我们引入了一个名为A2RBench的自动化流程,包括生成、扩展、评估和分析。具体而言,在生成阶段,LLM创建多样化的任务,要求真正的推理;在扩展阶段,LLM重用已验证的规则并扩展新的输入空间以生成任务变体,实现扩展。然而,这一过程可能导致幻觉。为消除它,我们进一步建立了理论框架并证明,程序验证——测试逆操作是否完美地逆转正向操作(循环一致性)——保证了唯一解。通过在主流LLM上的广泛评估,我们发现:(1)当前LLM在抽象推理上存在根本缺陷,顶级模型在代表性子集上显著低于人类(39.8% vs. 68.5%)。(2)当前LLM在生成3D任务的复杂度上远低于2D和1D,揭示了其对高维任务的理解不足。(3)反直觉的是,信息复杂度更高的输入可以简化推理过程。

英文摘要

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.

2605.17276 2026-05-19 cs.LG cs.AI

How Do Electrocardiogram Models Scale?

ECG模型如何扩展?

Jiawei Li, Fabio Bonassi, Ming Jin, Stefan Gustafsson, Johan Sundström, Thomas B. Schön, Antônio H. Ribeiro

AI总结 本文研究了ECG模型在不同规模下的扩展规律,发现监督学习模型在数据受限时表现不佳,而自监督学习模型在模型和数据规模上都具有鲁棒性,同时自监督Transformer在非常大的模型规模上超越了ResNet。

详情
AI中文摘要

尽管扩展定律已为自然语言处理中的基础模型建立了基本框架,但其在心电图(ECG)模型中的适用性仍缺乏充分的描述。事实上,最近的研究并未始终显示出随着ECG模型的大小或预训练数据集大小的增加,下游性能的一致性提升,这使得模型架构归纳偏置、预训练范式以及与规模相关的预期改进的确切作用仍然不明。在本工作中,我们系统地研究了ECG领域内的神经网络和损失到损失扩展定律。通过在大规模CODE数据集(230万条记录)上预训练超过120个模型(参数量从2万到2000万不等),我们解耦了模型架构(ResNet vs. Transformer)和预训练范式(监督学习SL vs. 自监督学习SSL)的影响。我们发现(i)SL模型在分布内是数据瓶颈的,而SSL模型在模型和数据规模上都具有鲁棒性;(ii)对于分布外(OOD)泛化,ResNet比Transformer在参数效率上高1.3到2.5倍,而SSL在数据效率上最高可达16倍,并在未见的临床任务上实现了高达7.6倍的转移效率;(iii)在观察到的规模范围内,基于ResNet的模型通常在OOD损失上表现最低,SSL在未见的临床任务上占据主导地位,而自监督的Transformer在非常大的模型规模上超越了ResNet。我们的结果表明,有效ECG基础模型的路径在于架构和范式的战略对齐,而非单纯的暴力扩展。

英文摘要

While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.

2605.17270 2026-05-19 cs.CV

Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

超越检测:一种结构感知的场景文本跟踪框架

Chenmin Yu, Liu Yu, Daiqing Wu, Gengluo Li, Zeyu Chen, Yu Zhou

AI总结 本文提出了一种结构感知的场景文本跟踪框架SymTrack,针对场景文本跟踪中的几何失真、视觉模糊和结构细节敏感性等挑战,通过双分支设计和自适应推理引擎实现了高效的文本跟踪,提升了视频文本处理的能力。

Comments Accepted at ICML 2026. Code is available at: [https://github.com/EdisonYCM/SymTrack]

详情
AI中文摘要

现代视觉目标跟踪器在一般目标上表现优异,但在处理场景文本时性能显著下降。尽管目前研究较少,但视频中文本跟踪对于动态文本操作如分割、移除和编辑至关重要。为填补这一空白,本文将此特定任务正式定义为场景文本跟踪,并提出了首个系统性的工作。我们识别了该任务的三个主要挑战:1) 严重的几何失真来自透视变化,2) 不同实例之间的高视觉模糊性,3) 对细粒度结构细节的高敏感性。为解决这些问题,我们提出了SymTrack,一种无检测的统一框架,具有协同的双分支设计。它集成了Cross-Expert Calibration机制以减少语义偏差,以及Predictive Token Rectification机制以纠正结构不平衡,并辅以Adaptive Inference Engine以在运动约束下稳定预测。考虑到该任务缺乏专用基准,我们利用三个视频文本定位数据集构建了一个具有高质量注释的基准。大量实验表明,SymTrack在所有三个基准上均达到了新的状态-of-the-art,比先前最佳跟踪器在BOVText_SOT上提高了高达11.97%的AUC。总体而言,我们的工作促进了高效的文本跟踪,为更通用的视频文本处理铺平了道路。

英文摘要

Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97\% AUC on $ \text{BOVText}_{\text{SOT}} $. Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.

2605.17269 2026-05-19 cs.LG stat.ML

Calibeating for general proper losses: A Bregman divergence approach

基于Bregman散度的方法:一般恰当损失的校准

Maximilian Fichtl, Cristóbal Guzmán, Nishant A. Mehta

AI总结 本文提出了一种基于懊悔最小化的通用校准框架,考虑了包括α-Tsallis损失(α∈[1,2])和Lipschitz损失在内的广泛恰当损失家族,同时展示了新的关于Be The Regularized Leader的懊悔等式。

Comments 31 pages

详情
AI中文摘要

本文介绍了一种基于懊悔最小化的通用校准框架。与Foster和Hart的开创性校准工作相比,后者专门处理Brier分数(平方损失)和log损失,我们考虑了一类包含α-Tsallis损失(α∈[1,2])和Lipschitz损失的广泛恰当损失家族。我们的结果对于Tsallis损失也适用于未缩放的Tsallis损失,该损失恢复log损失。我们的分析围绕恰当损失的Bregman散度观点展开。技术上,我们考虑的Tsallis损失家族的结果是U-calibration结果,同时在所有损失家族中获得对数懊悔,同时与先前结果相比具有更弱的维度依赖性。潜在的独立兴趣点是,我们还展示了新的关于Be The Regularized Leader的懊悔等式。该懊悔等式适用于一般恰当损失,并且本身基于两个与广义方差的在线更新公式相关的结果,后者是基于Bregman散度的方差泛化。

英文摘要

This work introduces a general framework for calibeating based on regret minimization. As compared to Foster and Hart's seminal calibeating work which had specialized treatments of Brier score (squared loss) and log loss, we consider a large family of proper losses that includes $α$-Tsallis losses (for $α\in [1, 2]$) and Lipschitz losses. Our results for Tsallis losses also hold for an unscaled version of Tsallis loss that recovers log loss. Our analysis is oriented around the Bregman divergence view of a proper loss. Technically, our results for the family of Tsallis losses that we consider are U-calibration results, simultaneously obtaining logarithmic regret for all losses in this family while having a weaker dependence on the dimension compared to previous results. Of potential independent interest, we also show a new regret equality for the regret of Be The Regularized Leader. This regret equality holds for general proper losses and itself is based on two results related to online updating formulas for the generalized variance, the latter being a previously introduced generalization of variance based on Bregman divergences.

2605.17265 2026-05-19 cs.LG

When Molecular Similarity Works: Property Cliffs Reveal Hidden Errors

当分子相似性起作用:属性悬崖揭示隐藏的错误

Di Hu, Kun Li, Haojie Rao, Longtao Hu, Jiameng Chen, Wenbin Hu, Yizhen Zheng, Jiajun Yu, Duanhua Cao

AI总结 研究通过属性悬崖揭示分子相似性失效的问题,提出CliffSplit和CliffLoss方法来评估和缓解模型在局部区域的错误。

Comments Preprint, 22 pages, 10 figures, 11 tables. Di Hu and Kun Li contributed equally

详情
AI中文摘要

准确预测分子性质是药物发现和材料设计的基础,然而即使最先进的模型仍容易在局部失效模式中出现错误,这些错误无法通过聚合指标检测。属性悬崖暴露了这一差距:结构相似的分子在目标性质上可能有显著差异,因此性能表现优秀的模型可能在高风险的局部区域失效。为揭示并缓解这种失效模式,引入了CliffSplit,一种能够构建局部支持、暴露悬崖的评估协议,以及CliffLoss,一种对悬崖敏感错误具有通用性的训练-only缓解机制。在三个QM9目标和三个MoleculeNet任务上,五个backbones的实验表明,CliffSplit在QM9区域揭示至少15%更高的错误,而CliffLoss在疏水性上将悬崖到平滑错误差距减少了30%,并整体将MAE提高了9.7%。这些结果将分子相似性失效从描述性异常转变为分子机器学习的基准评估问题。代码可在https://anonymous.4open.science/r/Cliff_Loss获取。

英文摘要

Accurate prediction of molecular properties underpins drug discovery and material design, yet even state-of-the-art models remain vulnerable to localized failure modes that aggregate metrics cannot detect. The places where molecular similarity should be most helpful are also places where standard evaluation can be most misleading. Property cliffs expose this gap: structurally similar molecules can still differ sharply in target property, so models with competitive overall performance may fail in high-risk local neighborhoods. To expose and mitigate this failure mode, CliffSplit, a cliff-aware evaluation protocol that constructs locally supported, cliff-exposed test cases, and CliffLoss, a model-agnostic train-only mitigation mechanism for cliff-sensitive errors, are introduced. Experiments on three QM9 targets and three MoleculeNet tasks across five backbones show that CliffSplit reveals at least 15% higher error in cliff-heavy QM9 regions, while CliffLoss reduces the cliff-to-smooth error gap by up to 30% on Lipophilicity and improves overall MAE by 9.7%. Together, these results turn molecular similarity failure from a descriptive anomaly into a benchmarked evaluation problem for molecular machine learning. The code is available at https://anonymous.4open.science/r/Cliff_Loss.

2605.17264 2026-05-19 cs.RO

Stretch-ICP: A Continuous-Trajectory Registration and Deskewing Algorithm in Scenarios of Aggressive Motions

Stretch-ICP: 一种在剧烈运动场景下的连续轨迹配准与校正算法

Simon-Pierre Deschênes, Veronica Vannini, Philippe Giguère, François Pomerleau

AI总结 本文提出Stretch-ICP算法,通过改进SLAM的鲁棒性,以提高在剧烈运动下的激光雷达-惯性导航状态估计的鲁棒性和一致性,同时减少了线速度和角速度的估计误差。

Comments 29 pages, 16 figures, published in Sensors 2026, 26(8), 2567, special issue "New Challenges and Sensor Techniques in Robot Positioning"

详情
Journal ref
Sensors 2026, 26(8), 2567
AI中文摘要

在复杂的环境中,机器人自主性仍然具有挑战性,其中在不平或滑腻地形上失去稳定性可能导致极端加速度和角速度。这些运动会破坏传感器测量并降低状态估计的精度,推动了对更鲁棒算法的需求。为研究此问题,我们引入了Tumbling-Induced Gyroscope Saturation (TIGS)数据集,该数据集包含机械激光雷达和惯性测量单元(IMU)从山下滑倒的记录。该数据集包含的角速度是类似数据集的四倍,且已公开可用。我们随后提出了两种互补的方法来提高同步定位与建图(SLAM)的鲁棒性,并在TIGS上评估了它们。首先,Saturation-Aware Angular Velocity Estimation (SAAVE)在剧烈运动中估计角速度,当陀螺仪测量饱和时,减少角速度估计误差83.4%。其次,Stretch-ICP是一种新的配准和校正算法,能够在剧烈运动下比经典迭代最近点(ICP)算法产生更平滑的六自由度(DOF)轨迹。Stretch-ICP在扫描边界处将线速度和角速度误差分别减少95.2%和94.8%。共同的贡献提高了在剧烈运动下的激光雷达-惯性状态估计的鲁棒性和一致性。

英文摘要

Robust robotic autonomy remains challenging in complex environments, where loss of stability on uneven or slippery terrain can induce extreme accelerations and angular velocities. Such motions corrupt sensor measurements and degrade state estimation, motivating the need for improved algorithmic robustness. To investigate this issue, we introduce the Tumbling-Induced Gyroscope Saturation (TIGS) dataset, which consists of recordings from a mechanical lidar and an Inertial Measurement Unit (IMU) tumbling down a hill. The dataset contains angular speeds up to four times higher than those in similar datasets and is publicly available. We then propose two complementary methods to improve Simultaneous Localization And Mapping (SLAM) robustness and evaluate them on TIGS. First, Saturation-Aware Angular Velocity Estimation (SAAVE) estimates angular velocities when gyroscope measurements become saturated during aggressive motions, reducing angular speed estimation error by 83.4%. Second, Stretch-ICP, a novel registration and deskewing algorithm, enables reconstruction of smoother 6-Degrees Of Freedom (DOF) trajectories under aggressive motions compared to classical Iterative Closest Point (ICP). Stretch-ICP reduces linear and angular velocity errors by 95.2% and 94.8%, respectively, at scan boundaries. Together, these contributions improve the robustness and consistency of lidar-inertial state estimation under aggressive motions.

2605.17262 2026-05-19 cs.CV

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

EgoIntrospect: 一个用于用户中心内部状态推理的注视数据集和基准

Zeyu Wang, Chang Liu, Eduardus Tjitrahardja, Yuntao Wang, Borislav Pavlov, Fangfei Gou, Jose Manuel Davila, Dai Shi, Ran Xu, Yue Pan, Jiayi Tan, Shuting Chang, Qi Wang, Jinzhao Li, Jiacheng Hua, Yifei Huang, Jingwei Sun, Yu Zhang, Liuxin Zhang, Guocai Yao, Jia Jia, Yin Li, Qianying Wang, Yuanchun Shi, Miao Liu

AI总结 本文提出EgoIntrospect数据集,用于研究用户中心内部状态推理,通过自注释揭示用户与AI助手的交互意图,评估多模态大语言模型在从注视观察中推理用户内部状态的能力。

详情
AI中文摘要

尽管在注视视频数据集和基准方面已有大量努力,但理解用户内部状态,这对于实现无缝的AI助手体验至关重要,仍被忽视。在本文中,我们介绍了EgoIntrospect,这是第一个在用户驱动场景中捕捉的注视数据集,具有自我注释,明确揭示用户与AI助手的交互意图。EgoIntrospect使用跨设备设置收集,提供了同步的视频、音频、注视、运动和生理信号。它包含60名受试者180小时的记录,平均每人记录3小时。利用EgoIntrospect,我们正式化了一套围绕用户内部状态的任务,包括情感体验、交互意图和认知记忆。我们进一步处理注释以构建评估现代多模态大语言模型从注视观察中推理用户内部状态能力的基准。在我们基准上的实验表明,现有的多模态大语言模型难以有效利用多模态信号来推断用户的主观内部状态。该数据集和注释将向公众开放,以促进注视视觉和可穿戴AI助手的研究。项目页面:https://ego-introspect.github.io/

英文摘要

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

2605.17255 2026-05-19 cs.AI math.OC

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

CAM-Bench: 一个用于Lean中的计算与应用数学的基准测试

Wentao Long, Yunfei Zhang, Chenyi Li, Li Zhou, Chumin Sun, Zaiwen Wen

AI总结 本文提出CAM-Bench,一个包含1000个Lean证明目标的基准测试,涵盖优化、数值线性代数和数值分析等领域,旨在补充现有形式化数学基准测试,通过针对依赖教科书概念和基本定理的应用数学问题进行评估。

Comments Preprint. 44 pages, 7 figures

详情
AI中文摘要

形式化定理证明基准测试能够机械地验证大语言模型中的数学推理能力。然而,现有基准测试主要集中在竞赛式问题和代数领域,导致计算与应用数学代表性不足。我们引入CAM-Bench,一个包含1000个Lean 4形式化证明目标的基准测试,涵盖优化、数值线性代数和数值分析等领域。这些问题改编自教科书练习,通常依赖于局部引入的定义、符号、算法和基本结果。为了构建CAM-Bench,我们开发了一个依赖恢复流水线,用于重建每个问题所需的本地教科书上下文。然后,它将每个问题标准化为一个独立的非正式定理,并将其翻译成Lean目标。我们通过Lean编译和语义审查验证最终的形式化问题,检查形式正确性和与原始练习的语义一致性。对于每个问题,我们发布了原始练习、恢复的上下文、标准化的非正式定理和最终的Lean目标。CAM-Bench通过针对依赖教科书概念和基本定理的应用数学问题补充现有形式化数学基准测试,其中许多问题无法直接作为标准Mathlib4引理使用。我们评估了广泛使用的大型语言模型和形式化代理在CAM-Bench上的表现,并分析了在跟踪局部假设、应用基本结果、分解证明和维护长距离控制时的常见失败模式。

英文摘要

Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied mathematics, with coverage spanning optimization, numerical linear algebra, and numerical analysis. These problems are adapted from textbook exercises and often depend on locally introduced definitions, notation, algorithms, and elementary results. To construct CAM-Bench, we develop a dependency-recovery pipeline that reconstructs the local textbook context needed to state each problem faithfully. It then normalizes each problem into a standalone informal theorem and translates it into a Lean target. We validate the resulting formal problems through Lean compilation and semantic review, checking both formal correctness and semantic alignment with the original exercises. For each problem, we release the raw exercise, recovered context, normalized informal theorem, and final Lean target. CAM-Bench complements existing formal mathematics benchmarks by targeting applied mathematics problems that rely on textbook concepts and elementary theorems, many of which are not directly available as standard Mathlib4 lemmas. We evaluate widely used large language models and formalization agents on CAM-Bench, and analyze common failure modes in tracking local assumptions, applying elementary results, decomposing proofs, and maintaining long-horizon control in Lean.

2605.17252 2026-05-19 cs.CV cs.GR

Monocular Depth Perception Enhancement Based on Joint Shading/Contrast Model and Motion Parallax (JSM)

基于联合阴影/对比模型和运动视差(JSM)的单目深度感知增强

Seungchul Ryu, Hyunjin Yoo, Tara Akhavan

AI总结 本文提出了一种新的JSM框架,用于增强单目深度感知,显著提高了深度体积感知和深度范围感知,该框架不仅适用于传统2D显示设备,也可用于3D显示设备,因为它能补充双目深度线索。

详情
AI中文摘要

立体3D显示器采用双目深度线索来提供深度感知。然而,用户需要配备昂贵的专用设备才能欣赏基于双目深度线索的深度感知。此外,立体显示器引起的视觉疲劳仍然是一个具有挑战性的问题。为了克服这一限制,本文提出了一种新的框架JSM,用于增强单目深度感知,显著提高了深度体积感知和深度范围感知。所提出的框架不仅可以在任何传统2D显示设备上提供增强的深度感知,还可以应用于3D显示设备,因为它能够补充双目深度线索。定性评估、消融研究和主观用户评估证明了所提出框架的优势和实用性。

英文摘要

Stereoscopic 3D displays adopt a binocular depth cue to provide depth perception. However, users should be equipped with expensive special devices to appreciate depth perception based on the binocular depth cues. Also, visual fatigue induced by the stereoscopic display is still a challenging open problem. In order to overcome this limitation, this paper proposes a novel framework, JSM, to enhance monocular depth perception, significantly improving both depth volume perception and depth range perception. The proposed framework can not only provide an enhanced depth perception on any conventional 2D display devices, but also it can be applicable to the 3D display devices since it is complementary to binocular depth cues. The qualitative evaluation, ablation study, and subjective user evaluation proved the advantages and practicability of the proposed framework.

2605.17250 2026-05-19 cs.LG

Towards Principled Test-Time Adaptation for Time Series Forecasting

面向时间序列预测的原理性测试时间适应方法

Haochun Wang, Ruichen Xu, Georgios Kementzidis, Karen Cho, Sebastian Ramirez Villarreal, Yuefan Deng

AI总结 本文提出了一种基于频率域的轻量级校准方法FAC,用于改进时间序列预测在分布偏移下的适应性,通过频率域分析现有适配器的预测修正,并在多种数据集和预测时间上实现了更高效和一致的性能。

详情
AI中文摘要

测试时间适应(TTA)最近作为一种有前景的方法,用于在分布偏移下改进时间序列预测(TSF)。现有的TSF-TTA方法在利用揭示的目标方面存在差异,导致适应协议异质且缺乏明确统一的公式。为了解决这个问题,我们从协议清洁度的角度重新审视TSF-TTA,并提出了一种仅基于成熟地面真实值的适应协议,从而获得更原理化的适应设置。在该协议下,我们进一步在频域中诊断现有适配器,并发现其预测修正通常表现出有限且弱结构化的频谱修改。受此诊断启发,我们提出了频率感知校准(FAC),一种轻量级校准方法,直接在频域中参数化预测修正。在多种数据集、预测时间跨度和源预测器上,FAC实现了竞争性和一致性的性能,同时所需可训练参数显著少于比较的TSF-TTA适配器。

英文摘要

Test-time adaptation (TTA) has recently emerged as a promising approach for improving time series forecasting (TSF) under distribution shift. Existing TSF-TTA methods differ in how they utilize revealed targets, yet the resulting adaptation protocols remain heterogeneous and lack a clearly unified formulation. To address this issue, we revisit TSF-TTA from the perspective of protocol cleanliness and propose an adaptation protocol based solely on matured ground truth, yielding a more principled setting for adaptation. Under this protocol, we further diagnose existing adapters in the frequency domain and find that their prediction corrections often exhibit limited and weakly structured spectral modifications. Motivated by this diagnosis, we propose Frequency-Aware Calibration (FAC), a lightweight calibration method that directly parameterizes prediction corrections in the frequency domain. Across diverse datasets, forecasting horizons, and source forecasters, FAC achieves competitive and consistent performance while requiring substantially fewer trainable parameters than the compared TSF-TTA adapters.

2605.17248 2026-05-19 cs.CV

Image-to-Video Diffusion: From Foundations to Open Frontiers

图像到视频扩散:从基础到开放前沿

Xianlong Wang, Wenbo Pan, Shijia Zhou, Ke Li, Yuqi Wang, Zeyu Ye, Hangtao Zhang, Leo Yu Zhang, Xiaohua Jia

AI总结 本文研究了图像到视频扩散生成的核心问题,通过梳理任务定义、模型架构、数据集和评估指标,提出了一种基于架构和训练范式的分类方法,并总结了四个核心设计:条件编码、时间建模、噪声先验设计和空间时间上采样,探讨了代表性应用场景和主要开放挑战。

详情
AI中文摘要

基于扩散的图像到视频(I2V)生成已成为生成模型中的核心方向,通过将参考图像(可选条件)转换为时间一致的视频。与更广泛的视频生成设置相比,该任务对内容一致性、身份保持和运动一致性提出了更严格的要求。尽管文献增长迅速,现有工作大多在更广泛的主题中讨论I2V生成,仍缺乏专门的分类法和以该领域为中心的系统分析。本文通过将扩散I2V生成视为独立主题来填补这一空白。首先回顾了任务定义、模型架构、数据集和评估指标,然后通过基于架构和训练范式的分类法组织现有方法。进一步总结了四个核心设计,即条件编码、时间建模、噪声先验设计和空间时间上采样,并讨论了代表性应用场景和主要开放挑战。

英文摘要

Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.

2605.17247 2026-05-19 cs.AI

Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

通过TIDE实现稳健的论辩论文理解:一个具有试错和辩论机制的交互框架

Zheqin Yin, Yupei Ren, Yadong Zhang, Yujiang Lu, Man Lan

AI总结 本文提出TIDE框架,通过整合试错和辩论机制,改进基于标准的提示优化,以提高论辩任务的理解和评估能力,实验表明其在自动作文评分、论点组件检测和论点关系识别任务中均提升了性能。

详情
AI中文摘要

论辩论文是评估批判性思维和推理能力的重要媒介,但目前关于通过提示准确理解和评估此类文本的研究有限。在本工作中,我们提出了TIDE,一种新的框架,旨在通过整合TrIal和DEbate机制,改进基于标准的提示优化,以提高与论辩相关的任务。我们的方法通过减轻噪声训练数据的影响并增强优化稳定性,解决了基于标准的提示优化的关键限制。我们评估了TIDE在三个核心任务上的表现:自动作文评分、论点组件检测和论点关系识别。结果表明,我们的框架在各项任务中均提升了性能。这些发现凸显了结合基于提示的方法在高级论辩理解中的潜力。

英文摘要

Argumentative essays serve as a vital medium for assessing critical thinking and reasoning skills, yet there is limited works on accurately understanding and evaluating such texts via prompt. In this work, we propose TIDE, a novel framework designed to improve criteria-based prompt optimization for argument-related tasks by integrating TrIal and DEbate mechanism. Our method addresses key limitations of criteria-based prompt optimizing by mitigating the influence of noisy training data and enhancing optimization stability. We evaluate TIDE on three core tasks: Automated Essay Scoring, Argument Component Detection, and Argument Relation Identification. Results demonstrate that our framework improves performance across tasks. These findings underscore the potential of combining prompt-based methods for advanced argument understanding.

2605.17246 2026-05-19 cs.LG cs.AI

Fidelity Probes for Specification--Code Alignment

规范-代码对齐的保真度探针

Ferhat Erata, Hao Zhou, Luke Huan

AI总结 本文提出保真度探针,通过从参考artifact生成的自然语言问题和代码派生的地面真实答案,从候选规范中回答问题。保真度是同意探针的比例,分解为矛盾率和覆盖缺口率,驱动针对性的规范编辑以达到收敛。在15个程序、约12000行COBOL基准(AWS CardDemo)上,通过八次迭代将冻结测试规范的保真度从0.63提升到0.94,其中平台位置由仅需四次速率数据的两状态马尔可夫固定点$F^\dagger$预测。探针来自LLM读取代码或静态分析管道对其控制流、数据流和系统依赖图的处理,具有可调混合比例。一个带有冻结留出集的探针重采样协议提供了Hoeffding有界的过拟合判别;我们测量的训练/测试差距保持在该包络线下一个数量级。三种基于图的混合提升了保真度16到30分;跨分布评估显示LLM和符号通道在经验上互补。在五个独立LLM家族(Anthropic、DeepSeek、Google、Alibaba、OpenAI)上进行的跨家族生成器扫描确认了收敛行为不依赖于任何单一模型家族:五个非Claude生成器中有三个产生了与马尔可夫固定点预测一致的轨迹,而冻结测试协议主动否定了两个探针分布随迭代变化的生成器。该方法适用于任何应描述相同行为的artifact对。

Comments 29 pages, 14 figures, 11 tables

详情
AI中文摘要

我们引入了保真度探针:从参考artifact生成的自然语言问题,其代码派生的地面真实答案由候选规范回答。保真度是同意探针的比例,分解为矛盾率和覆盖缺口率,驱动针对性的规范编辑以达到收敛。在15个程序、约12000行COBOL基准(AWS CardDemo)上,我们通过八次迭代将冻结测试规范的保真度从0.63提升到0.94,其中平台位置由仅需四次速率数据的两状态马尔可夫固定点$F^\dagger$预测。探针来自LLM读取代码或静态分析管道对其控制流、数据流和系统依赖图的处理,具有可调混合比例。一个带有冻结留出集的探针重采样协议提供了Hoeffding有界的过拟合判别;我们测量的训练/测试差距保持在该包络线下一个数量级。三种基于图的混合提升了保真度16到30分;跨分布评估显示LLM和符号通道在经验上互补。在五个独立LLM家族(Anthropic、DeepSeek、Google、Alibaba、OpenAI)上进行的跨家族生成器扫描确认了收敛行为不依赖于任何单一模型家族:五个非Claude生成器中有三个产生了与马尔可夫固定点预测一致的轨迹,而冻结测试协议主动否定了两个探针分布随迭代变化的生成器。该方法适用于任何应描述相同行为的artifact对。

英文摘要

We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into contradiction and coverage-gap rates that drive targeted spec edits to convergence. On a 15-program, roughly 12k-line COBOL benchmark (AWS CardDemo), we raise frozen-test specification fidelity from 0.63 to 0.94 over eight iterations, with the plateau location predicted by a two-state Markov fixed point $F^\dagger$ from just four iterations of rate data. Probes come from an LLM reading the code or from a static-analysis pipeline over its control-flow, data-flow, and system-dependence graphs, with a tunable mixture. A probe-resampling protocol with a frozen held-out set gives a Hoeffding-bounded overfitting discriminant; our measured train/test gap stays more than an order of magnitude below this envelope. Three graph-grounded mixtures lift fidelity by +16 to +30 points; cross-distribution evaluation shows the LLM and symbolic channels are empirically complementary. A cross-family generator sweep on five independent LLM lineages (Anthropic, DeepSeek, Google, Alibaba, OpenAI) confirms the convergence behaviour is not tied to any single model family: three of five non-Claude generators produce trajectories consistent with the Markov fixed-point prediction, and the frozen-test protocol actively falsifies the two generators whose probe distributions drift across iterations. The method applies to any pair of artifacts that are supposed to describe the same behaviour.

2605.17244 2026-05-19 cs.LG cs.AI

Drift Flow Matching

漂移流匹配

Chenrui Ma, Xi Xiao, Lin Zhao, Tianyang Wang, Ferdinando Fioretto, Yanning Shen

AI总结 本文提出Drift Flow Matching框架,结合漂移生成模型与基于流的迭代生成方法,实现高效生成与多步细化,提升生成质量与效率适应性。

详情
AI中文摘要

迭代生成模型如流匹配和扩散模型在测试时表现出强大的扩展性,额外的推理计算可以提高生成质量。相比之下,漂移模型提供高效的单步生成,但其直接生成范式限制了灵活性。在本文中,我们提出Drift Flow Matching (DFM),一个连接漂移生成建模与基于流的迭代生成的框架。DFM保留了直接传输映射的效率,同时在需要时通过多个推理步骤细化生成。这填补了单步漂移模型与多步流匹配方法之间的空白,并提供了一种新的生成范式,可以适应不同的质量-效率需求。在不同任务和数据集上的广泛实验验证了所提框架的有效性和通用性。

英文摘要

Iterative generative models such as Flow Matching and Diffusion models have demonstrated strong test-time scaling behavior, where additional inference computation can improve generation quality. In contrast, Drift Models offer efficient one-step generation, but their direct generation paradigm limits such flexibility. In this work, we propose Drift Flow Matching (DFM), a framework that connects drifting generative modeling with flow-based iterative generation. DFM preserves the efficiency of direct transport maps while enabling generation to be refined through multiple inference steps when desired. This bridges the gap between one-step Drift Models and multi-step Flow Matching methods, and provides a novel generative paradigm that can adapt sampling computation to different quality--efficiency requirements. Extensive experiments across different tasks and datasets demonstrate the effectiveness and generality of the proposed framework.

2605.17238 2026-05-19 cs.LG stat.ML

Learning in Position-Aware Multinomial Logit Bandits: From Multiplicative to General Position Effects

基于位置感知的多项逻辑带宽学习:从乘法位置效应到一般位置效应

Xi Chen, Shibo Dai, Jiameng Lyu, Yuan Zhou

AI总结 本文研究了动态联合品类选择与排列问题,其中每个产品的吸引力取决于其内在吸引力和显示位置,在多项逻辑(MNL)选择框架下。研究从乘法位置效应模型扩展到一般位置效应模型,为两种模型设计了基于轮次的学习算法,并建立了首个最优后悔分析。此外,这些基于轮次的算法为现代平台提供了必要的实时操作。对于乘法模型,开发了具有截断机制的交叉位置成对最大似然估计器,并证明算法P2MLE-UCB达到$ ilde{O}(\sqrt{NT})$的后悔,匹配下限并弥补了先前基于周期的分析留下的$\sqrt{K}$差距。对于一般模型,建立了最小最大下界并提出了GP2-UCB算法,具有匹配的上界。此外,设计了基于Dinkelbach方法和最大权二分图匹配的高效子程序,用于每轮联合品类和排列优化。在合成数据和Expedia数据集上的数值实验表明,我们的算法在性能上始终优于最先进的基准。

详情
AI中文摘要

我们研究了动态联合品类选择与排列问题,其中每个产品的吸引力取决于其内在吸引力和显示位置,在多项逻辑(MNL)选择框架下。我们的研究从乘法位置效应模型开始,其中每个产品的吸引力由位置特定因子缩放,扩展到一般位置效应模型,该模型为每个产品-位置对分配独立吸引力参数以捕捉异质协同效应。对于两种模型,我们设计了基于轮次的学习算法,在每次反馈后更新决策,并建立了首个最优后悔分析。此外,我们的基于轮次算法为现代平台提供了必要的实时操作。对于乘法模型,我们开发了具有截断机制的交叉位置成对最大似然估计器,并证明我们的算法P2MLE-UCB达到$ ilde{O}(\sqrt{NT})$的后悔,匹配下限并弥补了先前基于周期的分析留下的$\sqrt{K}$差距。对于一般模型,我们建立了最小最大下界并提出了GP2-UCB算法,具有匹配的上界。此外,我们设计了基于Dinkelbach方法和最大权二分图匹配的高效子程序,用于每轮联合品类和排列优化。在合成数据和Expedia数据集上的数值实验表明,我们的算法在性能上始终优于最先进的基准。

英文摘要

We study the dynamic joint assortment selection and positioning problem, where the attraction of each product depends on both its intrinsic appeal and its display position under a Multinomial Logit (MNL) choice framework. Our study ranges from the multiplicative position effects model, in which each product's attraction is scaled by a position-specific factor, to a general position effects model assigning independent attraction parameters to every product--position pair to capture heterogeneous synergies. For both models, we design round-based learning algorithms that update decisions after every single feedback, and establish the first regret-optimal characterization. Besides, our round-based algorithms provide the prompt operations needed by modern platforms. For the multiplicative model, we develop a cross-position pairwise maximum likelihood estimator with a clipping mechanism, and prove that our algorithm P2MLE-UCB attains a regret of $\tilde{O}(\sqrt{NT})$, matching the lower bound and closing the $\sqrt{K}$ gap left by prior epoch-based analyses. For the general model, we establish a minimax lower bound and propose GP2-UCB with a matching upper bound. Moreover, we design an efficient subroutine for the per-round joint assortment and positioning optimization based on Dinkelbach's method and maximum-weight bipartite matching. Numerical experiments on synthetic data and the Expedia dataset show that our algorithms consistently outperform state-of-the-art benchmarks.