arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17133 2026-05-19 cs.CV cs.AI

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

CAM-VFD: 跨注意力多模态视频伪造检测

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

发表机构 * Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport（计算机工程系，工程与技术学院，阿拉伯科学、技术与海运交通学院）

AI总结针对深度伪造技术和视频编辑工具快速发展带来的挑战，本文提出CAM-VFD框架，通过跨模态矛盾建模实现多模态视频伪造检测，实验表明其在两个生成视频基准测试中表现出色，具有良好的鲁棒性。

详情

AI中文摘要

深度伪造技术和视频编辑工具的快速发展对多媒体取证、司法证据完整性以及信息真实性构成了重大挑战。当前的检测器依赖单一模态信号，将外观、几何和运动独立处理。然而，先进的生成器在保持单模态一致性的同时会产生跨模态矛盾，这些矛盾在取证上具有鉴别性但无法被单一模态检测器发现。本文提出CAM-VFD，即跨注意力多模态视频伪造检测框架，将跨模态矛盾建模为方向性取证信号。该框架采用跨注意力融合机制，其中基于CLIP的外观表示作为查询，与VideoMAE运动特征和MiDaS深度特征进行对比，从而识别视觉、时间及几何证据之间的矛盾。通过跨模态注意力差异分析验证了该设计，观察到真实与伪造分布在统计上可分离（p<0.001，Cohen's d=0.68）。在两个生成视频基准测试中的实验结果表明，CAM-VFD在GenVidBench上达到95.31%的Top-1准确率，在GenVideo上达到93.43%的准确率、90.63%的F1分数和96.56%的AUROC。此外，CAM-VFD在压缩、噪声、模糊和对抗扰动下表现出稳定的性能，表明跨模态推理可能在媒体取证中提高鲁棒性。代码已公开在https://github.com/Hoda-Osama/CAM-VFD/tree/main。

英文摘要

The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.

URL PDF HTML ☆

赞 0 踩 0

2605.17125 2026-05-19 cs.CV cs.LG

Principal Component Analysis for Lunar Crater Detection

基于主成分分析的月球陨石坑检测

Travis Driver, John A. Christian

发表机构 * School of Aerospace Engineering, Georgia Institute of Technology（航空航天工程学院，佐治亚理工学院）

AI总结本文提出了一种基于主成分分析的自动陨石坑模板生成方法，用于改进基于图像的陨石坑识别技术，通过在模拟月球图像上展示优于手工挑选模板的检测和定位性能。

2605.17120 2026-05-19 cs.CV

Markerless Motion Capture for Biomechanical Whole-Body Kinematic Estimation in Infants

无标记运动捕捉用于婴儿生物力学全身运动学估计

Divya Joshi, J. D. Peiffer, Colleen Peyton, R. James Cotton

发表机构 * Center for Bionic Medicine, Shirley Ryan AbilityLab（生物医学中心，Shirley Ryan AbilityLab）； Dept. of Physical Therapy and Human Movement Science, Northwestern University（物理治疗与人类运动科学系，西北大学）； Dept. of Biomedical Engineering, Northwestern University（生物医学工程系，西北大学）； Dept. of Pediatrics, Northwestern University（儿科系，西北大学）

AI总结本研究评估了三种先进的姿态估计框架在婴儿运动学重建中的性能，展示了无标记运动捕捉在婴儿生物力学分析中的潜力和局限性。

Comments Accepted to EMBC 2026

详情

AI中文摘要

早期识别婴儿运动障碍依赖于专家对自发运动的视觉评估，这推动了自动化、客观方法的发展。本文系统评估了三种最先进的姿态估计框架（MeTRAbs-ACAE、SAM 3D Body和Sapiens）在8名婴儿13次录制的100个视频上的性能。通过重投影误差、几何一致性以及Procrustes对齐的3D位置误差量化关键点检测精度，并展示了将逆向运动学框架拟合到婴儿数据的可行性证明。虽然Sapiens在重投影误差和几何一致性方面表现最佳（分别为22.8像素和0.82），但SAM 3D Body提供了最全面的3D信息用于运动学重建，其Procrustes对齐的位置误差为19至28毫米。通过案例比较示例，证明了基于SAM 3D Body估计的生物力学模型能够区分与运动发育相关的婴儿典型运动模式，如临床专家所识别的。这些发现突显了3D姿态估计在婴儿生物力学中的潜力和当前限制，并为可扩展的视频基早期运动发育评估奠定了初步基础。

英文摘要

arly identification of motor impairment in infancy relies on expert visual assessment of spontaneous movement, motivating the development of automated, objective alternatives. One promising approach is using computer vision, which benefits from high quality pose estimation from video. In this study, we systematically evaluated three state-of-the-art pose estimation frameworks (MeTRAbs-ACAE, SAM 3D Body, and Sapiens) on 100 videos over 13 sessions of 8 infants recorded with a multi-view markerless motion capture system. We quantified keypoint detection accuracy using reprojection error, geometric consistency, and Procrustes-aligned 3D position error, and demonstrated proof-of-concept for fitting an inverse kinematic framework to infant data. While Sapiens achieved the lowest reprojection error and highest geometric consistency of the methods evaluated (22.8 pixels and 0.82, respectively), SAM 3D Body provided the most comprehensive 3D information for kinematic reconstruction with Procrustes-aligned position errors of 19 to 28 mm. We demonstrate in a case comparison example that biomechanical models fit to SAM 3D estimates distinguish representative movement patterns in infants related to motor development, as identified by a clinical expert. Together, these findings highlight both the promise and current limitations of 3D pose estimation for infant biomechanics and establish preliminary groundwork for scalable, video-based assessment of early motor development.

URL PDF HTML ☆

赞 0 踩 0

2605.17118 2026-05-19 cs.LG stat.CO stat.ML

Differentiable Optimization Layers for Guaranteed Fairness in Deep Learning

可微优化层用于深度学习中的保证公平性

David Troxell, Noah Roemer, Guido Montúfar

发表机构 * Department of Statistics \& Data Science, University of California, Los Angeles, USA ； Department of Mathematics, University of California, Los Angeles, USA ； Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany

AI总结本文提出了一种称为'公平性层'的可微优化层，该层可确保在神经网络中集成时满足所选的输出平等性概念，并介绍了一个在线对偶推理算法，为流式预测提供可证明的公平性保证，即使使用任意小的批量大小。

Comments To be published in International Conference on Machine Learning (ICML), 2026

2605.17115 2026-05-19 cs.AI

F2IND-IT! -- Multimodal Fuzzy Fake Indian News Detection using Images and Text

F2IND-IT! -- 多模态模糊假新闻检测：结合图像和文本

Kushal Trivedi, Murtuza Shaikh, Khushi Singh, Jeevaraj S.

发表机构 * ABV - Indian Institute of Information Technology, Gwalior（ABV-印度信息技术学院，加尔瓦里）

AI总结本文提出了一种多模态模糊框架，结合图像和文本进行印度媒体虚假新闻检测，通过ResNet-50提取图像特征，DistilBERT获取文本语义嵌入，ANFIS生成模糊可靠性评分，并通过轻量级注意力融合模块进行分类，实验结果显示在准确率、精确率、召回率和F1分数上均优于现有方法。

Comments 10 pages, 1 figure

2605.17113 2026-05-19 cs.CL cs.AI

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

无法回头的点：语言模型推理中欺骗承诺的反事实定位

Scott Merrill, Shashank Srivastava

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结本文研究语言模型在推理过程中何时开始承诺欺骗，通过反事实定位方法，分析不同环境中的欺骗产生机制，并发现注意力转移特征在跨环境泛化中的有效性，同时提出通过压缩注意力头集来抑制欺骗承诺。

Comments 41 pages, 25 figures

详情

AI中文摘要

现有欺骗数据集将完成的输出标记为诚实或欺骗，将欺骗视为最终响应的属性，而非模型推理轨迹的功能。这掩盖了一个更根本的问题：语言模型何时开始承诺欺骗？我们引入反事实定位：对于推理轨迹中的每个句子前缀，固定前缀，重新采样后续内容，并估计欺骗结果的概率。为了扩展此方法，我们构建了五个环境（涵盖战略欺骗、迷宫指引、财务建议、二手车销售和报价谈判），其中欺骗从未被提示，而是源自战略激励，标签机械地从环境状态得出，而非主观人类判断。所得到的语料库在四个推理模型中定位了约146万句话，来自超过9410万次采样的后续内容、915亿生成的token和超过1万种场景。句子层面的人类评估证实，检测到的承诺点对应于决策状态的可解释转变。使用此资源，我们显示，用于承诺预测的词汇线索在不同环境之间转移效果差，而基于注意力的转移特征在分布外泛化中表现良好，表明欺骗承诺反映在可重用的推理动态变化中，而非表层形式。我们进一步识别出压缩的注意力头集（少于10%的头）在一种环境中选择后，能因果地抑制其他环境中的欺骗承诺。我们发布此语料库作为研究语言模型推理中欺骗和更广泛承诺的子基质。

英文摘要

Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes $\sim$1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.17108 2026-05-19 cs.LG

Parallel Recursive LSTM

并行递归LSTM

Tristan Gaudreault, Yongyi Mao

发表机构 * School of Electrical Engineering and Computer Science（电气工程与计算机科学学院）； University of Ottawa（渥太华大学）

AI总结本文提出并行递归LSTM（PR-LSTM），一种层次递归架构，通过递归非线性状态组合替代左到右递归，以减少长上下文设置中的计算深度，同时保持非线性门控状态表示，并在形式语言基准测试中实现了更强的序列长度泛化能力。

Comments 13 pages, 5 figures. Code available at https://github.com/tristangaudreault/pr-lstm

详情

AI中文摘要

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.

英文摘要

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.

URL PDF HTML ☆

赞 0 踩 0

2605.17104 2026-05-19 cs.AI

Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

具有科学逻辑性的方法论：LLM推理的实践：物理学

Zhaoxin Yu, Nan Xu, Kun Chen, Jiahao Zhao, Lei Wang, Wenji Mao

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China ； Beijing Wenge Technology Co., Ltd, Beijing, China ； School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

AI总结本文提出了一种增强科学逻辑性的方法论，旨在提升LLM在科学推理中的逻辑正确性与任务表现，通过物理学中的多样逻辑结构和形式化进行实践验证。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

随着大语言模型（LLM）推理能力的持续进步，其在科学推理任务中的应用获得了广泛关注。当前研究主要强调通过在更大、更全面的数据集上进行训练，以提升LLM在科学问答基准测试中的性能，但这些方法忽视了科学推理过程的本质——逻辑性，这是确保推理步骤有效性的理性基础。在本工作中，我们首次系统地研究了LLM科学推理内部的逻辑性，并开发了一种科学逻辑性增强的方法论，包括一套评估标准和数据采样方法，用于逻辑性引导的训练，以提高LLM推理的逻辑正确性以及任务性能。进一步地，我们以物理学为典范学科，实践上述方法论。在数据构建方面，我们从学术文献中提取科学问题，并采样出一个具有强逻辑性的高质量数据集。基于三种不同的基础LLM进行的实验表明：1）我们构建的训练数据能够有效提高LLM推理中的科学逻辑性；2）增强的科学逻辑性在解决科学问题中起着关键作用。代码可在https://github.com/ScienceOne-AI/PhysLogic获取。

英文摘要

With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs' performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of the scientific reasoning process -- logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality-enriched methodology, including a set of assessment criteria and data sampling methods for logicality-guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high-quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems. Code is available at \href{https://github.com/ScienceOne-AI/PhysLogic}{https://github.com/ScienceOne-AI/PhysLogic}.

URL PDF HTML ☆

赞 0 踩 0

2605.17095 2026-05-19 cs.CV cs.AI cs.LG

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

警察执法视频中的视觉时间线：用于训练和分析的开放BWC操作上下文和活动编目

Angela Srbinovska, Christopher Homan, Adrian Martin, Ernest Fokoué

发表机构 * Rochester Institute of Technology（罗切斯特理工大学）； Rochester Police Department（罗切斯特警察局）； Office of Business Intelligence（业务智能办公室）； School of Mathematics and Statistics（数学与统计学学院）

AI总结本文提出了一种处理体感摄像头视频的方法，生成时间对齐的固定长度10秒窗口序列，用于训练和分析，通过隐私保护协议进行处理和标记，以提高事件审查和培训流程的效率。

Comments 13 pages, 10 figures, 9 tables

详情

AI中文摘要

执法机构正在积累大量体感摄像头（BWC）视频。然而，这些视频仍然在操作上是模糊的。也就是说，分析人员和培训人员仍然需要花费大量时间观看完整视频以确定关键事件的开始点，并识别活动转向更剧烈的物理活动的点。我们提出了一种方法，将BWC视频处理为时间对齐的固定长度10秒窗口序列，通过隐私保护协议进行处理和标记。每个窗口被标记为两个维度的信息：（i）窗口的操作上下文和（ii）窗口内的运动强度水平，对于因黑暗、模糊或遮挡导致证据不足的窗口，使用低证据标签。我们训练模型根据这两个轴分类窗口，使用从每个窗口中采样的帧，通过CLIP模型编码并汇总成窗口级别的表示。我们提取每个窗口的密集光流统计信息以捕捉运动强度。在测试窗口中，最佳上下文模型达到78.75%的准确率，最佳准确率活动模型达到88.33%。我们还包含了完整性审计，以展示结果以及视觉时间线表示如何支持更快的事件审查，并使警官培训流程更加实用。

英文摘要

Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.

URL PDF HTML ☆

赞 0 踩 0

2605.17093 2026-05-19 cs.CV cs.CL

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

HEED：基于密度加权残差对齐的混合视觉-语言模型蒸馏

Yihao Liang, Niraj K. Jha

发表机构 * Princeton University（普林斯顿大学）

AI总结本文提出HEED方法，通过密度加权残差对齐改进混合视觉-语言模型蒸馏，提升在OCR和文档任务中的性能，同时在不同教师模型和混合架构上实现高效推理。

详情

AI中文摘要

将视觉-语言模型蒸馏为更高效的混合架构，如3:1 Mamba-2/注意力混合，已成为提高推理效率的标准做法。聚合基准表明这可行，但隐藏了选择性失败。当将Qwen3-VL-8B-Instruct蒸馏为3:1 Mamba-2/注意力混合时，在视觉推理基准如MMStar、MMBench和MMMU-Pro上，学生模型在教师模型附近保持2分差距，但在光学字符识别和文档任务上下降13分。学生模型仍能理解场景，但失去回答所需的细粒度文本。我们发现大部分失败归因于特定位置。在高分辨率图像中，大多数拼图是天空、墙壁或平滑纹理，而一小部分携带文本、边缘、物体边界或其他局部细节。在令牌级诊断中，前10%最高密度拼图的残差漂移比后10%最低密度拼图大3.6倍，且教师遮蔽答案贡献大3.5倍。均匀加权将许多损失项分配给低信息量的背景拼图，而稀疏答案承载拼图未得到特殊保护。所需干预极小：我们用拼图自不相似性作为无监督代理来替代均匀残差对齐，以确定位置重要性。我们称之为HEED。与常规端到端蒸馏相比，HEED在OCRBench v2上提升8.7分，在10个基准平均上提升5.13分。增益在不同教师模型和混合架构上实现。在标准后训练后，学生在10个基准平均上达到教师级性能，具有4.12倍的吞吐量和128k上下文时68%的内存节省，无需额外参数和推理时间成本。

英文摘要

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.

URL PDF HTML ☆

赞 0 踩 0

2605.17091 2026-05-19 cs.LG

Mechanism Learning: Prototype-Anchored Mechanism Inference for Scientific Forecasting

机制学习：面向科学预测的原型锚定机制推断

Qian Jiang, Liping Sun

发表机构 * School of Computing（计算学院）； The Australian National University（澳大利亚国立大学）； iHuman Institute（iHuman研究院）

AI总结本文提出机制学习框架，通过估计当前活跃的局部机制来预测未来状态，其核心方法是将局部时空片段压缩为机制描述，并利用原型锚定来构建数据驱动的机制空间，从而在科学预测中实现鲁棒性和稳定性。

详情

AI中文摘要

科学预测通常依赖于直接状态预测，这种方法在数据稀缺、扩展时间范围、非平稳动态或高维复杂性下会变得脆弱。尽管原始状态轨迹在这些情况下非常敏感，但底层的局部演化规则往往表现出鲁棒的可重用性。我们引入了机制学习，这是一种通过估计当前活跃的局部机制来预测未来状态的框架。我们的方法将局部时空片段压缩为机制描述，形成一个数据驱动的结构化机制空间，其中相似性反映相似的局部演化规则。为了使这些估计基于观测数据，我们利用原型锚定，一组代表性的机制，稀疏覆盖局部规则的空间。我们在Burgers动力学、WeatherBench2和Lorenz96上评估了这种方法。实证表明，学习的机制空间能够抵抗崩溃并保持强局部一致性。与直接预测和其他模型，包括FNO、NODE、LSTM和回声室方法相比，我们的框架在脆弱的环境中显示出预测优势：在Burgers动力学中显著提高了切换稳定性，在WeatherBench2的稀缺数据固定时间范围协议和中间复杂度Lorenz96中实现了最先进的性能。消融研究和漂移诊断确认，这些改进是由有限的原型锚定而不是纯粹的潜在容量驱动的。这些结果共同确立了机制学习作为在预测复杂系统中直接状态预测的原理性、鲁棒替代方案。

英文摘要

Scientific forecasting typically relies on direct state prediction, an approach that grows brittle under data scarcity, extended horizons, non-stationary dynamics, or high-dimensional complexity. While raw state trajectories are highly sensitive in these regimes, underlying local evolution rules often exhibit robust reusability. We introduce mechanism learning, a framework that forecasts future states by estimating the currently active local mechanism. Our method compresses local spatiotemporal fragments into mechanism descriptors, forming a data-driven, structured mechanism space where proximity reflects similar local evolution rules. To ground these estimates in observed data, we utilize prototype anchors, a set of representative mechanisms that sparsely cover the space of local rules. We evaluate this approach on Burgers dynamics, WeatherBench2, and Lorenz96. Empirically, the learned mechanism spaces resist collapse and maintain strong local consistency. Compared to direct prediction and other models including FNO, NODE, LSTM, and reservoir-family methods, our framework demonstrates predictive gains in fragile regimes: it significantly improves switching stability in Burgers dynamics and achieves state-of-the-art performance both under the scarce-data fixed-horizon WeatherBench2 protocol and in intermediate-complexity Lorenz96. Ablation studies and drift diagnostics confirm that these improvements are driven by finite prototype anchoring rather than sheer latent capacity. Together, these results establish mechanism learning as a principled, robust alternative to direct state prediction in forecasting complex systems.

URL PDF HTML ☆

赞 0 踩 0

2605.17088 2026-05-19 cs.CL

ACIL: Auto Chain of Thoughts for In-Context Learning

ACIL: 自动链式思维用于上下文学习

Rui Chu

发表机构 * Rui Chu（楚瑞）

AI总结本文提出ACIL框架，通过自动构建包含推理步骤的演示来提升上下文学习在多步推理任务中的性能。

详情

AI中文摘要

近年来，大型语言模型（LLMs）的进步表明，链式思维（CoT）推理可以显著提高复杂推理任务的性能。同时，上下文学习（ICL）已成为一种重要的机制，用于在不更新模型参数的情况下将LLMs适应于新任务，仅使用提示中提供的示例。然而，标准ICL在需要多步推理的任务上往往表现不佳，因为演示通常只包含输入-输出对，缺乏显式的中间推理步骤。本文介绍了一种自动链式思维（Auto-CoT）框架，通过自动构建推理增强的演示来改进ICL。Auto-CoT为输入-输出示例生成推理链，将结构化的中间解释添加到提示上下文中，并通过系统化的选择过程去除无关或低质量的演示。通过将高质量的推理示例纳入ICL提示中，Auto-CoT引导模型朝向更可靠的推理，并提高预测准确性。在多个推理任务上的实验表明，所提出的框架通过提供显式的中间推理指导，提高了ICL的性能。

英文摘要

Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.17087 2026-05-19 cs.CV

The Learnability Gap in Medical Latent Diffusion

医学潜在扩散中的可学习差距

Mischa Dombrowski, Felix Nützel, Bernhard Kainz

发表机构 * Department of Computing, Imperial College London（伦敦帝国理工学院计算机系）

AI总结本文研究了医学图像中潜在扩散模型在处理类别不平衡问题时的可学习差距，指出尽管预训练的自动编码器能有效编码判别特征，但其潜在表示的结构性使分类器难以学习，通过开发噪声条件潜在分类器和图像空间蒸馏技术，提高了效率并改善了潜在空间质量。

详情

AI中文摘要

生成数据增强使用潜在扩散模型是解决医学影像类别不平衡问题的有前景策略，但当前方法侧重于感知保真度和领域特定自动编码器微调，而忽视了更根本的瓶颈。我们识别并正式化了可学习差距：大规模预训练自动编码器能够忠实编码医学分类的判别特征，如重建空间中的近无损性能所示，但其潜在表示以难以被分类器学习的方式结构化。在五个自动编码器家族和四个覆盖胸片、皮肤镜、计算机断层扫描和超声的医学基准上，我们证明这种差距无论架构、初始化策略或超参数调整如何，都持续存在，且医学领域微调无法关闭它。为了探测并部分缩小这一差距，我们开发了具有FiLM层的噪声条件潜在分类器和图像空间蒸馏，这些方法在效率和内存方面比图像空间模型分别提高了64倍和120倍，同时作为潜在空间质量的诊断工具。我们的分析提供了一个新的框架来评估自动编码器的潜在空间，并识别其结构而非保真度或领域特定性是关闭真实和合成医学训练数据性能差距的主要障碍。

英文摘要

Generative data augmentation with latent diffusion models is a promising strategy for addressing class imbalance in medical imaging, yet current approaches focus on perceptual fidelity and domain-specific autoencoder fine-tuning while neglecting a more fundamental bottleneck. We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. Across five autoencoder families and four medical benchmarks spanning chest radiography, dermatoscopy, computed tomography, and echocardiography, we show that this gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and that medical-domain fine-tuning of the autoencoder does not close it. To probe and partially narrow the gap, we develop noise-conditioned latent classifiers with FiLM layers and image-space distillation that offer 64x throughput and 120x memory gains over image-space models while serving as diagnostic tools for latent space quality. Our analysis provides a new framework for evaluating autoencoder latent spaces and identifies their structure, rather than their fidelity or domain specificity, as the primary obstacle to closing the performance gap between real and synthetic medical training data.

URL PDF HTML ☆

赞 0 踩 0

2605.17085 2026-05-19 cs.SD cs.LG eess.AS

Taming Audio VAEs via Target-KL Regularization

通过目标KL正则化驯服音频VAE

Prem Seetharaman, Rithesh Kumar

发表机构 * Adobe Research（Adobe研究院）

AI总结本文提出通过压缩率调节和目标KL正则化训练音频VAE，以解决在音频生成任务中VAE正则化带来的过正则化与欠正则化之间的平衡问题，并构建了音频VAE的率失真曲线。

Comments Accepted at ICASSP 2026 (Barcelona, Spain, 3-8 May 2026). 5 pages, 1 figure, 3 tables

Journal ref Proc. ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11460662

AI中文摘要

潜在扩散模型已成为许多生成任务，如音频生成（如文本到音频、文本到音乐和文本到语音）中的主导范式。潜在扩散模型的关键组成部分是一个自动编码器（VAE），它将高维信号压缩成低帧率的连续表示，以利于后续预测。正则化这些VAE具有挑战性，因为存在过度正则化（输出质量差）和欠正则化（难以预测）的潜在表示之间的权衡。我们提出一个框架来研究这种权衡，通过压缩率调节和通过目标KL正则化训练音频VAE。这使得可以直接与已研究的离散神经音频编解码器模型进行比较，并构建音频VAE的率失真曲线。我们评估了目标KL正则化对文本到声音生成的影响，并发现扫掠压缩率有助于确定最佳生成设置。

英文摘要

Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.

URL PDF HTML ☆

赞 0 踩 0

2605.17084 2026-05-19 cs.LG cs.CL

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

尺度决定语言模型是否为预测组织表示几何

Weilun Xu

发表机构 * School of Computer and Communication Sciences（计算机与通信科学学院）； École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）

AI总结研究探讨了语言模型中表示几何是否为预测组织，通过Subspace PGA指标发现，模型规模影响表示几何的组织程度，小模型在训练后期逐渐失去这种组织，而大模型则保持稳定。

详情

AI中文摘要

在语言模型中，表示所编码的内容由其表示空间的几何结构决定：距离而非激活值承载意义。现有工具描述了这种几何结构的形状，但并未探讨其组织目的。我们引入Subspace PGA指标，测试某层的距离结构是否比随机等大小子空间更符合解嵌入矩阵$W_U$的读出子空间。在七个Pythia模型（70M-6.9B）和三个跨家族模型中，中间几何显著为预测组织（峰值$z = 9$--$24$），但程度依赖于规模：小模型（$d \leq 1024$）在训练后期逐渐失去这种组织——即使损失持续改善，而大模型（$d \geq 2048$）则保持稳定。我们追溯到容量权衡：少数主导方向迁离$W_U$的读出，掩盖而非破坏预测结构，移除它们可恢复对齐。频谱度量和损失曲线无法捕捉这一区别。因此，规模不仅决定了模型预测性能，还决定了其表示几何如何组织以实现预测。

英文摘要

In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_U$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$--$24$), but the degree is scale-dependent: small models ($d \leq 1024$) progressively lose it at late layers during training -- even as loss keeps improving -- while large models ($d \geq 2048$) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from $W_U$'s readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.

URL PDF HTML ☆

赞 0 踩 0

2605.17079 2026-05-19 cs.CL cs.AI cs.CY

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

LLMs能否像消费者一样思考？通过ConsumerSimBench进行大众级反应重建的基准测试

Tianyu Wang, Jiajun Li, Jianghao Lin

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出ConsumerSimBench基准，通过1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准，评估LLM在模拟消费者反应方面的能力，揭示了前沿模型在预测高语境中文消费者讨论中实际关心内容方面的不足。

详情

AI中文摘要

LLMs越来越多地被用作“数字消费者”来模拟公众意见、预测试营销决策并预测观众反应。然而，现有评估很少询问模型是否能重建现实中消费者在公开讨论中表现的具体反应模式。我们引入了ConsumerSimBench，该基准基于1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准，涵盖四个反应类别。与评分开放生成的综合偏好判断不同，ConsumerSimBench将每个任务分解为可审计的yes-no决定，使三判官协议从65.8%提升至92.1%，且点wise判断与人类多数标签在98.4%时一致。在13个前沿生成器中，最强的模型Gemini-3.1-Pro仅覆盖了47.8%的真实反应标准，而GPT-5.2和Claude-4.6尽管在技术基准上表现优异，但仍然落后。这些失败揭示了技术基准表现与基于社会的消费者直觉之间的巨大差距。直接的结构化推理提示会降低覆盖率，而生成-反思多代理流水线可将MiMo-V2.5-Pro在子集上的表现从32.9%提升至37.6%。ConsumerSimBench将消费者模拟重新定义为对真实公开讨论反应的预测问题，表明前沿LLM在预测高语境中文消费者讨论中实际关心内容方面仍远未可靠。

英文摘要

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.17077 2026-05-19 cs.RO cs.AI

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

如何指导你的机器人：密集语言标注助力机器人策略学习

Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； NVIDIA

AI总结本研究通过密集语言标注提升机器人策略学习效率，提出DeMiAn方法，利用视觉语言模型生成多方面标注，提升策略和世界模型性能，无需新增演示数据。

详情

AI中文摘要

机器人策略学习受限于演示数据收集成本，而现有演示的语言标注相对廉价。我们研究语言密度作为提取固定机器人或第一人称视频数据集信号的杠杆。我们引入DeMiAn（密集多方面标注），一种两阶段方法，首先通过视觉语言模型生成四个互补方面的演示段落重标记：物理运动、场景组成、手臂姿态和推理。一个学习到的指导者将任务描述和初始场景快照映射到部署时的任务合适标注，异步运行以隐藏生成延迟。在超过100万机器人操作片段和5万EgoVerse人类第一人称视频上，DeMiAn在视觉语言-动作策略和基于视频的世界-动作模型上均未收集新演示的情况下提升了性能。在RoboCasa上，指导者在任务-only基线基础上提升了5个百分点，接近每任务oracle的3个百分点。没有固定标注方面在所有任务中占主导，表明选择正确的密集语言至关重要。DeMiAn还提高了复合任务和分布外性能，并在考虑标注生成FLOPs后，同时提升了中训练和后训练的计算-性能前沿。这些结果将密集重新标注定位为机器人策略学习的实用扩展杠杆。

英文摘要

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.

URL PDF HTML ☆

赞 0 踩 0

2605.17072 2026-05-19 cs.AI cs.CL

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

RAGA：用于自主知识图谱构建和检索增强生成的阅读与图构建代理

Chengrui Han, Zesheng Cheng

发表机构 * Qingdao University（青岛大学）

AI总结本文提出RAGA框架，通过结合阅读、搜索、验证和构建的认知约束，提升知识图谱构建与检索增强生成的效率和准确性，实现了知识图谱的全生命周期管理。

详情

AI中文摘要

现有基于LLM的知识图谱（KG）构建方法主要采用无状态的批处理流程，存在跨片段语义关系捕捉、实体消歧和构建过程可解释性方面的结构性缺陷。这些限制影响了KG的质量、检索精度和在高风险领域的部署信任度。我们提出RAGA（Reading And Graph-building Agent），一种基于LLM的自主KG构建和检索融合框架。RAGA提供支持完整KG生命周期CRUD操作的原子工具集，并将读取-搜索-验证-构建的认知约束嵌入到ReAct工具循环中。KG向量同步机制实现了混合符号-向量检索，而证据锚定验证将每个知识条目与其源文本链接，以实现可审计的溯源性。在QASPER科学问答数据集的子集上的初步实验表明，RAGA的融合检索优于零样本基线，KG整合在答案和证据质量方面提供了可衡量的提升。该框架设计和实验基线为代理驱动的自主KG构建提供了参考。

英文摘要

Existing LLM-driven knowledge graph (KG) construction methods predominantly employ stateless batch processing pipelines, exhibiting structural deficiencies in cross-chunk semantic relation capture, entity disambiguation, and construction process interpretability. These limitations undermine KG quality, retrieval precision, and deployment trust in high-stakes domains. We propose RAGA (Reading And Graph-building Agent), an LLM-based autonomous KG construction and retrieval fusion framework. RAGA provides an atomic toolset supporting full KG lifecycle CRUD operations and embeds a Read-Search-Verify-Construct cognitive constraint into a ReAct tool loop. A KG-vector synchronization mechanism enables hybrid symbolic-vector retrieval, while evidence-anchored verification links every knowledge entry to its source text for auditable provenance. Preliminary experiments on a subset of the QASPER scientific QA dataset indicate that RAGA's fusion retrieval outperforms zero-shot baselines, with KG integration providing measurable gains in both answer and evidence quality. The framework design and experimental baseline serve as a reference for agent-driven autonomous KG construction.

URL PDF HTML ☆

赞 0 踩 0

2605.17071 2026-05-19 cs.AI

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

AnchorDiff: 基于拓扑结构的掩码扩散模型与基于置信度的重写方法用于放射学报告生成

Shiying Yu, Jielei Wang, Guoming Lu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结本文提出AnchorDiff，一种首个结合临床锚点的掩码扩散框架，用于生成放射学报告。该方法通过拓扑感知训练策略和推理时的重写策略，有效缓解了固定顺序自回归解码的局限性，实现了最先进的性能。

详情

AI中文摘要

放射学报告生成（RRG）旨在从医学图像自动生成临床准确的文本报告。现有方法大多依赖于自回归（AR）语言模型，其因果依赖结构限制生成过程为单向的左到右过程。这种范式可能导致序列偏差，即模型倾向于遵循刻板的token顺序和高频报告模板，而非完全基于图像特定的证据进行生成。在本文中，我们提出AnchorDiff，这是首个用于RRG的掩码扩散框架，整合了来自知识图谱的临床锚点到扩散语言模型中。通过利用双向上下文和迭代细化，AnchorDiff缓解了固定顺序自回归解码的局限性。具体而言，我们引入了一种拓扑感知的训练策略，利用RadGraph推导出的实体层次结构来分配临床重要token的差异化掩码保护和损失权重。我们进一步设计了推理时的重写策略，通过基于扰动的测试检测不稳定已提交的token，并在去噪过程中选择性地修改它们。在MIMIC-CXR和MIMIC-RG4基准上的大量实验表明，AnchorDiff实现了最先进的性能，展示了临床锚点掩码扩散在放射学报告生成中的有效性。

英文摘要

Radiology report generation (RRG) aims to automatically produce clinically accurate textual reports from medical images. Existing methods predominantly rely on autoregressive (AR) language models, whose causal dependency structure restricts generation to a unidirectional left-to-right process. This paradigm can induce sequence bias, where models tend to follow stereotypical token orders and high-frequency report templates rather than fully grounding generation in image-specific evidence. In this paper, we propose AnchorDiff, the first masked-diffusion framework for RRG that integrates knowledge-graph-derived clinical anchors into diffusion language modeling. By leveraging bidirectional context and iterative refinement, AnchorDiff mitigates the limitations of fixed-order autoregressive decoding. Specifically, we introduce a topology-aware training strategy that uses RadGraph-derived entity hierarchies to assign clinically important tokens differentiated masking protection and loss weights. We further design an inference-time rewriting strategy that detects unstable committed tokens through perturbation-based testing and selectively revises them during denoising. Extensive experiments on the MIMIC-CXR and MIMIC-RG4 benchmarks demonstrate that AnchorDiff achieves state-of-the-art (SOTA) performance, showing the effectiveness of clinically anchored masked diffusion for radiology report generation.

URL PDF HTML ☆

赞 0 踩 0

2605.17070 2026-05-19 cs.CV

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

EPIC-Bench: 一种以感知为中心的细粒度具身视觉 grounding 的基准

Haozhe Shan, Xiancong Ren, Han Dong, Haoyuan Shi, Yingji Zhang, Jiayu Hu, Yi Zhang, Yong Dai, Bin Shen, Lizhen Qu, Zenglin Xu, Xiaozhu Ju

发表机构 * X-Humanoid ； Fudan University（复旦大学）； University of Science and Technology of China（中国科学技术大学）； University of Manchester（曼彻斯特大学）； Monash University（墨尔本大学）； Celonis AI ； University of New South Wales（新南威尔士大学）

AI总结本文提出 EPIC-Bench，一种以感知为中心的细粒度具身视觉 grounding 基准，旨在系统评估 VLMs 在现实世界具身环境中的视觉感知能力。该基准包含 6.6k 个精心标注的元组（图像，文本，掩码），涵盖 23 个细粒度任务，涉及具身交互管道的三个核心阶段：目标定位、导航和操作。评估结果显示，尽管先进推理模型表现出潜力，但当前 VLMs 在复杂视觉-文本对齐方面普遍存在困难，特别是在多目标计数、部分-整体关系理解和 affordance 区域检测方面存在瓶颈。

详情

AI中文摘要

尽管大型视觉-语言模型（VLMs）越来越多地被用作具身代理的感知骨干，但现有基准往往依赖于问答或多选格式。这些协议允许模型利用语言先验，而不是展示真正的视觉 grounding。为了解决这个问题，我们提出了 EPIC-Bench，即具身感知基准，这是一种细粒度 grounding 基准，旨在系统地评估 VLMs 在现实世界具身环境中的视觉感知能力。EPIC-Bench 包含 6.6k 个精心标注的元组（图像，文本，掩码），涵盖 23 个细粒度任务，横跨具身交互管道的三个核心阶段：目标定位、导航和操作。对超过 89 个领先 VLMs 的广泛评估显示，尽管先进推理模型显示出潜力，但当前 VLMs 在复杂视觉-文本对齐方面普遍存在困难。具体而言，模型在多目标计数、部分-整体关系理解以及 affordance 区域检测方面存在关键瓶颈。EPIC-Bench 为推进下一代视觉驱动的具身模型提供了稳健的基础和可操作的见解。

英文摘要

While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex visual-text alignment for physical interactions. Specifically, models exhibit critical bottlenecks in multi-target counting, part-whole relationship understanding, and affordance region detection. EPIC-Bench provides a robust foundation and actionable insights for advancing the next generation of vision-driven embodied models.

URL PDF HTML ☆

赞 0 踩 0

2605.17058 2026-05-19 cs.LG

Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning

学习多时间尺度抽象以进行分层组合规划

Vivienne Huiling Wang, Tinghuai Wang, Joni Pajarinen

发表机构 * Department of Electrical Engineering（电气工程系）； Automation, Aalto University, Finland（自动化，艾尔沃斯大学，芬兰）

AI总结本文提出了一种基于模型的分层框架，用于解决序列随机组合决策问题，通过多时间尺度目标结构化潜在动态，实现高效的前瞻规划，并联合学习子目标条件预算策略以支持上下文感知的资源分配。

Comments 34 pages, 8 figures, 23 tables

详情

AI中文摘要

指数级大的动作空间、随机动态和在有限资源下进行长周期决策使得序列随机组合优化（SSCO）对强化学习尤其具有挑战性。分层强化学习（HRL）提供了一种自然的分解方法，但将其高层策略置于半马尔可夫决策过程（SMDP）中，其中动作具有可变持续时间，使得学习适用于规划的世界模型变得困难。我们引入了一种基于模型的分层框架，直接解决这一问题。我们的方法结合了潜在空间树搜索规划器和SMDP-aware的世界模型，用于可变持续时间决策。多时间尺度目标结构化潜在动态，使得转移幅度反映抽象动作的有效时间尺度，从而在自适应时间抽象下实现高效的前瞻规划。我们进一步联合学习子目标条件预算策略与世界模型，以支持上下文感知的资源分配。在具有挑战性的SSCO基准测试中，我们的方法优于强大的基线方法。

英文摘要

The combination of exponentially large action spaces, stochastic dynamics, and long-horizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17044 2026-05-19 cs.AI

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

PersonaArena: 用于评估和提升大语言模型层面角色扮演的动态模拟

Wenlong Shi, Jianxun Lian, Mingqi Wu, Haiming Qin, Mingyang Zhou, Xing Xie, Naipeng Chao, Hao Liao

发表机构 * College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）； Microsoft Research Asia（微软亚洲研究院）； Microsoft Gaming（微软游戏）； Provincial Key Laboratory of Intelligent Communication and Digital Society Governance, Shenzhen University（深圳大学省级智能通信与数字社会治理重点实验室）

AI总结本文提出PersonaArena框架，通过动态模拟评估和提升大语言模型在角色扮演层面的能力，利用用户生成的社会内容构建细致的个性库，并在模拟社交环境中进行多轮上下文丰富的交互，通过多代理辩论裁判实现全面公正的评估。

Comments ACL 2026 Findings

详情

AI中文摘要

大语言模型（LLMs）日益成为交互式社会代理，但其维持连贯且真实的层面角色扮演能力仍有限，尤其是在现实社交场景中。现有研究主要集中在角色层面设置，并依赖静态评估格式，无法捕捉日常社交互动的复杂性。在本文中，我们提出了PersonaArena，一个用于评估和改进LLMs层面角色扮演的动态模拟框架。PersonaArena利用大量过滤后的用户生成社交内容构建细致的个性库，并在模拟社交环境中引发多轮、上下文丰富的交互。我们的框架包含一个多代理辩论裁判，用于全面且无偏的评估。通过广泛实验，我们证明PersonaArena能够严格评估和提升LLMs的角色扮演能力，推动更真实且社交能力强的AI代理的发展。

英文摘要

Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs' role-playing capabilities, advancing the development of more authentic and socially adept AI agents.

URL PDF HTML ☆

赞 0 踩 0

2605.17042 2026-05-19 cs.CV

Thermal-Only Crowd Counting with Deployment-Time Privacy Protection

仅热成像的人群计数与部署时隐私保护

Yifei Qian, Zhongliang Guo, Chun Tong Lei, Bowen Deng, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound

发表机构 * School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）； School of Computer Science, University of St Andrews（斯特灵大学计算机科学学院）； Department of Data Science, City University of Hong Kong（香港城市大学数据科学系）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结本文提出了一种仅使用热成像数据的人群计数框架，通过消除RGB数据依赖，减少公共监控中隐私暴露风险，并利用深度到RGB扩散模型来缓解热成像的模糊性，提升计数准确性。

详情

AI中文摘要

尽管RGB-热人群计数已显示出潜力，但该范式面临关键限制：RGB数据在公共监控中引发隐私问题，而多模态对齐问题会降低融合性能。我们提出首个专门设计用于隐私意识人群计数的热成像-only框架，在推理时消除RGB依赖，并显著减少公共监控部署中连续RGB捕获带来的隐私暴露。为缓解热成像模糊性，我们利用深度到RGB扩散模型作为跨模态桥梁，提取具有辨别力的特征以增强热表示。关键地，我们证明单步LCM去噪产生最忠实于深度条件信号结构内容的特征，而多步方法则逐步将特征与条件输入解耦并累积误差，从而降低计数准确性。在RGBT-CC和DroneRGBT数据集上的实验表明，我们的方法在性能上与最先进的RGB-热融合方法具有竞争力，且仅需在推理时使用热输入，消除了连续RGB捕获的需求，这在现实世界监控部署中是主要的隐私问题。代码将公开提供。

英文摘要

While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.17041 2026-05-19 cs.CL cs.AI cs.HC

Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design

代理AI翻译：一种用于翻译作为沟通设计的代理翻译原型

Masaru Yamada

发表机构 * Rikkyo University（立命馆大学）； Translation Lab Inc（翻译实验室公司）

AI总结本文提出了一种代理翻译原型，通过将翻译研究的金属语言转化为生成AI的指令代码，重新定义翻译作为沟通设计的过程，而非文本转换。

Comments 14 pages. Conceptual and architectural paper; empirical validation in future work. Code: https://github.com/chuckmy/agentic-translator (v0.8.0). Live demo: https://agentic-translator-chuckmy.streamlit.app

详情

AI中文摘要

我们提出了Agentic AI Translate，一种代理翻译原型，实现了Yamada（即将出版）的论点——翻译研究的金属语言已成为生成AI的指令代码。该系统取代了机器翻译中占主导地位的文本输入/文本输出范式，采用四阶段代理循环（识别->提示->生成->验证），并在用户通过模型辅助对话构建一个基于skopos理论、语域、受众和体裁惯例的结构化翻译简报的交互规范阶段之前。验证阶段采用GEMBA-MQM错误跨度协议（Kocmi & Federmann, 2023）进行证据导向评分，并通过Wang等人（2025）的Delta-lite记忆保存文档层面的连贯性。我们描述了哲学动机、架构承诺、系统消耗的四种参考材料类别以及架构显式说明的主要设计张力。实证验证留待未来工作；本文的贡献是概念性和架构性的——一种可执行的体现，表明在GenAI时代翻译是沟通设计，而非文本转换。

英文摘要

We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) -- that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text-in / text-out paradigm of machine translation with a four-stage agentic cycle (Identify -> Prompt -> Generate -> Verify), preceded by an interactive specification phase in which the user composes -- through model-assisted dialogue -- a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA-MQM error-span protocol (Kocmi & Federmann, 2023) for evidence-grounded scoring, and document-level coherence is preserved through a DelTA-lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference-material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural -- an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.

URL PDF HTML ☆

赞 0 踩 0

2605.17039 2026-05-19 cs.LG cs.CE

Privacy-Preserving Generation Fraud Detection for Distributed Photovoltaic Systems: A Solar Irradiance-Fused Federated Learning Framework

隐私保护的分布式光伏系统发电欺诈检测：一种融合太阳能辐照度的联邦学习框架

Xiaolu Chen, Chenghao Huang, Yanru Zhang, Hao Wang

发表机构 * School of Computer Science and Technology, University of Electronic Science and Technology of China（电子科技大学计算机科学与技术学院）； Department of Data Science and AI, Faculty of IT, Monash University（墨尔本大学信息技术学院数据科学与人工智能系）； Monash Energy Institute, Monash University（墨尔本大学莫纳什能源研究所）； Shenzhen Institute for Advanced Study of UESTC（电子科技大学深圳研究院）

AI总结本文提出了一种基于联邦学习的隐私保护分布式光伏系统发电欺诈检测框架，通过融合太阳能辐照度数据和天气数据，利用共注意机制检测关键异常，有效解决了光伏发电欺诈检测中的间歇性和不确定性问题，并在真实世界数据集上验证了方法的有效性。

Comments 15 pages

Journal ref IEEE Transactions on Smart Grid, 2026

详情

DOI: 10.1109/TSG.2026.3692585

AI中文摘要

住宅光伏（PV）系统的广泛应用引入了新的发电欺诈检测（FD）挑战。与传统电力盗窃检测不同，光伏发电欺诈检测（PVG-FD）因光伏发电的固有间歇性和不确定性而更加复杂。由于可扩展性和隐私问题，分布式光伏系统的集中式PVG-FD方法面临进一步挑战。本文开发了一种基于联邦学习（FL）的隐私保护分布式PVG-FD框架。在此框架中，电力公司管理多个家庭社区，每个社区都配备有本地检测器。该框架集成了新颖的检测模型架构与隐私保护的全局协作。每个社区的本地模型通过共注意机制融合光伏发电和天气数据以检测对PVG-FD至关重要的异常。FL框架通过聚合模型参数和原型实现跨社区协作，利用全局知识共享与本地细化，同时保护隐私。它还使用原型对齐来解决类别不平衡问题，通过增强欺诈样本的表示。在真实世界住宅PV数据集上的广泛实验验证了所开发方法的有效性，并证明其在各种场景中优于最先进的FL方法。结果还显示了其在不同社区规模下的可扩展性和对类别不平衡的强鲁棒性。

英文摘要

The wide adoption of residential photovoltaic (PV) systems introduces new challenges for generation fraud detection (FD). Unlike traditional electricity theft detection, which focuses on electricity consumption-side behavior, PV generation fraud detection (PVG-FD) is complicated by the inherent intermittency and uncertainty of PV generation. The distributed nature of PV systems poses further challenges for centralized PVG-FD approaches due to scalability and privacy concerns. This paper develops a privacy-preserving distributed PVG-FD framework based on federated learning (FL). In this framework, a utility company manages multiple household communities, where each of which is equipped with a local detector. The framework integrates a novel detection model architecture with privacy-preserving global collaboration. Each community's local model fuses PV generation and weather data via a co-attention mechanism to detect discrepancies critical for PVG-FD. The FL framework enables cross-community collaboration by aggregating model parameters and prototypes, leveraging global knowledge sharing with local refinement while preserving privacy. It also uses prototype alignment to address class imbalance by enhancing fraud sample representation. Extensive experiments on a real-world residential PV dataset validate the effectiveness of the developed method and demonstrate that it outperforms state-of-the-art FL methods across various scenarios. The results also show its scalability across varying community sizes and strong robustness to class imbalance.

URL PDF HTML ☆

赞 0 踩 0

2605.17038 2026-05-19 cs.AI

Evidential Information Fusion on Possibilistic Structure

可能性结构上的证据信息融合

Qianli Zhou, Ye Cui, Zhen Li, Witold Pedrycz, Yong Deng

发表机构 * School of Electronics and Information, Northwestern Polytechnical University（电子信息学院，西北工业大学）； Department of Electrical and Computer Engineering, University of Alberta（阿尔伯塔大学电气与计算机工程系）； China Mobile Information Technology Center（中国移动信息科技中心）； Systems Research Institute, Polish Academy of Sciences（波兰科学院系统研究所）； Institute of Fundamental and Frontier Science, University of Electronic Science and Technology of China（中国电子科技大学基础与前沿科学研究院）

AI总结本文提出了一种基于可能性结构的证据信息融合方法，通过引入信任演化网络和三角范数家族，实现了更灵活的信息融合框架，适用于非distinct源融合、冲突管理等复杂场景。

详情

AI中文摘要

Dempster's规则是结合来自不同且可靠来源的信念函数的基本工具。然而，其基于交集的语义 imposes 强烈的结构限制，限制了其在处理复杂源状态和多样信息融合场景时的灵活性。为克服这一限制，我们提出了一种可逆转换，源自等概率原则，将信念函数与定义在幂集上的可能性结构联系起来。在此转换中，子集之间的关系通过信念演化网络显式表征，提供了比传统质量函数结构更灵活的证据信息表示。在此基础上，我们进一步引入三角范数家族，开发了一个通用且适应性的证据信息融合框架。与根植于Dempster语义的融合方法不同，所提出的框架支持更灵活的组合行为，并在非distinct源融合、冲突管理、参数组合设计和异构信息融合中表现出优势。

英文摘要

Dempster's rule is a fundamental tool for combining belief functions from distinct and reliable sources. However, its intersection-based semantics imposes strong structural restrictions, which limits its flexibility in handling complex source states and diverse information fusion scenarios. To overcome this limitation, we propose a reversible transformation, derived from the isopignistic principle, between belief functions and a possibilistic structure defined on the power set. In this transformation, the relationships among subsets are explicitly characterized by a belief evolution network, which provides a more flexible representation of evidential information beyond the conventional mass function structure. On this basis, we further introduce the triangular norm family to develop a general and adaptive evidential information fusion framework. Unlike fusion methods rooted in Dempster semantics, the proposed framework supports more flexible combination behaviors and exhibits advantages in non-distinct source fusion, conflict management, parametric combination design, and heterogeneous information fusion.

URL PDF HTML ☆

赞 0 踩 0

2605.17037 2026-05-19 cs.LG cs.AI cs.CL

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D$^2$Evo: 双重难度感知的自进化方法用于数据高效的强化学习

Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao, Yong Wang, Xiangxiang Chu

发表机构 * Zhejiang University（浙江大学）； AMAP, Alibaba Group（AMAP，阿里巴巴集团）

AI总结本文提出D$^2$Evo方法，通过双重难度感知的自进化机制，解决强化学习中有效数据稀缺和动态难度变化的问题，从而在数学推理基准上以少于2K真实数学样本实现优于现有方法的性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情

AI中文摘要

强化学习（RL）在增强大型语言模型（LLMs）推理能力方面展现出潜力。然而，需要中等难度训练样本的有效RL训练面临两个根本性挑战：有效数据稀缺和动态难度变化，其中中等难度样本稀缺且随着模型提升变得简单。现有方法在一定程度上缓解了这种稀缺性，通过生成训练样本。然而，这些方法存在无锚点生成、忽略共进化和难度不匹配的问题。为了解决这些问题，我们提出了D$^2$Evo，一种双重难度感知的自进化RL框架。在每次迭代中，我们的方法基于当前求解器的能力挖掘中等难度锚点，训练提问者生成不同难度层级的多样化问题，并共同优化两个组件以实现渐进式的推理提升。广泛实验表明，D$^2$Evo在数学推理基准上以少于2K真实数学样本优于现有方法，并在通用推理基准上表现出强大的泛化能力。

英文摘要

Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.17033 2026-05-19 cs.RO

Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy

具有对称标注自由学习策略的通用且可操作的部件姿态估计

Wenxiao Chen, Xueyu Yuan, Liu Liu, Di Wu, Dan Guo

发表机构 * Hefei University of Technology, Hefei, Anhui, China（合肥工业大学）； University of Science and Technology of China, Hefei, Anhui, China（中国科学技术大学）

AI总结本文提出了一种无需对称标注的通用且可操作的部件姿态估计框架SAFAG，通过分步细化两阶段框架和自监督学习策略解决对称预测问题，提升了在数据匮乏场景下的姿态估计性能和鲁棒性。

Comments Accepted as a poster at the Forty-third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

迫切需要的通用机器人物体交互和操作要求高质量的跨类别物体感知。作为该领域的先驱，通用且可操作的部件（GAParts）理解吸引了越来越多相关研究人员的关注。然而，大多数最近的工作要么在对称问题的设计上不足，要么需要丰富的对称标注，这严重阻碍了在数据匮乏场景中精确的GAPart姿态估计。在本文中，我们提出SAFAG，一种新的无需对称标注的通用且可操作的部件姿态估计框架。具体而言，我们建议了一个分步细化的两阶段框架用于候选到最终的四元数回归，并将对称预测作为概率分布问题，通过自监督学习策略进行解决。实验结果证明了我们SAFAG的优越性能和鲁棒性。我们相信我们的工作在许多具身AI系统领域具有巨大的应用潜力。

英文摘要

Urgently needed generalizable robot object interaction and manipulation requires high-quality Cross-Category object perception. As a pioneer of this area, Generalizable and Actionable Parts (GAParts) understanding has attracted increasing attention from relevant researchers. However, most recent works either have insufficient design regarding the symmetry issue or require rich symmetry annotation, which severely impedes precise GAPart pose estimation in data-lacking scenarios. In this paper, we propose SAFAG, a novel Symmetry Annotation-Free framework for Generalizable and Actionable Parts Pose Estimation. Specifically, we suggest a stepwise refinement two-stage framework for candidate-to-final quaternion regression, and tackle the symmetry prediction as a probability distribution problem with self-supervised learning strategy. The experimental results demonstrate the superior performance and robustness of our SAFAG. We believe that our work has the enormous potential to be applied in many areas of embodied AI system.

URL PDF HTML ☆

赞 0 踩 0

2605.17028 2026-05-19 cs.CL cs.AI

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX：区分真实幻觉检测与基准构建人工制品

Khizar Hussain, Murat Kantarcioglu

发表机构 * Virginia Tech（弗吉尼亚理工大学）

AI总结本文研究了大型语言模型幻觉检测中的基准构建人工制品问题，提出DRIFT作为对比方法，发现大部分基线方法在控制条件下表现接近随机，而SAPLMA和DRIFT作为上层隐藏状态的监督探针表现出例外。

Comments Preprint to Neurips 2026 submission

详情

AI中文摘要

大型语言模型（LLMs）在生成输出时常常表现出自信的幻觉：其输出可以流畅、权威且错误。在医疗、法律和科学应用中，这种失败会造成直接伤害，而通过内部模型状态检测幻觉则为更安全的部署提供了路径。越来越多的研究表明，这一问题变得越来越可处理，最近的方法在广泛使用的基准上实现了高检测性能。然而，我们发现，这种明显的进步在仔细审视后并不成立。六个语料库中的四个直接将真实答案嵌入输入提示中。我们提出的名为 extsc{TxTemb}的简单文本相似度基线利用这一点，无需访问模型内部状态即可实现接近完美的检测分数。为了衡量在消除这些人工制品后剩余的真实检测能力，我们进行了涵盖22种检测方法、12种开源模型（涵盖6种架构家族）和6个语料库的大规模评估。我们进一步引入 extbf{DRIFT}，作为实时生成检测的比较点。我们的发现表明，该领域报告的幻觉检测进展在很大程度上是由广泛使用的语料库中的基准构建人工制品所解释的；在受控条件下，大多数已建立的基线方法表现接近随机；一致的例外是SAPLMA和DRIFT，两者都是基于上层隐藏状态的监督探针。

英文摘要

Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A naïve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbf{DRIFT}, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field's reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.

URL PDF HTML ☆

赞 0 踩 0

2605.17026 2026-05-19 cs.LG

Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road

为什么推理模型会失去覆盖能力？数据和道路中的分支在其中的作用

Ngoc-Hieu Nguyen, Parshin Shojaee, Phuc Minh Nguyen, Nan Zhang, Chandan K Reddy, Khoa D Doan, Rui Zhang

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）； Virginia Tech（弗吉尼亚理工大学）； VinUniversity（文大学）

AI总结本文研究了推理模型覆盖能力下降的原因，发现训练数据中决策点的普遍存在是导致覆盖缩小的关键因素，并提出通过数据合成和解码机制改进来缓解这一问题。

Comments 22 pages, 13 figures

详情

AI中文摘要

近年来，大语言模型的进展催生了推理模型，这些模型通过专门的微调过程在复杂任务上表现出色。尽管这些方法能可靠地提高pass@1准确率，但先前的研究发现它们表现出覆盖缩小行为，即pass@k相对于基模型会退化。在本文中，我们调查了基于SFT的后训练过程中推理缩小现象的出现原因。我们假设这种行为是由微调数据的特性驱动的，特别是与决策点或

英文摘要

Recent progress in large language models has led to the emergence of reasoning models, which have shown strong performance on complex tasks through specialized fine-tuning procedures. While these methods reliably improve pass@1 accuracy, prior works have observed that they show a coverage shrinkage behavior, where pass@k degrades relative to the base model. In this paper, we investigate the reasoning shrinkage arise under SFT-based post-training. We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or "forks in the road" scenarios where model faces indecipherable patterns with multiple valid reasoning paths. To test this hypothesis, we design controlled case studies that simulate such decision-point settings, spanning indecipherable nodes in graph branching, and reasoning modes. By tracking post-training dynamics in these settings, we find that the shrinkage phenomenon is tightly correlated with the prevalence of decision-point scenarios in the training data. We also demonstrate that this shrinkage behavior can be partially mitigated through targeted data synthesis design of decision-points, and a more systematic diversity-encouraging decoding mechanism. Our findings identify data-centric factors as a key driver of shrinkage in reasoning models and highlight diversity-aware designs as an effective lever for controlling it.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

Principal Component Analysis for Lunar Crater Detection

Markerless Motion Capture for Biomechanical Whole-Body Kinematic Estimation in Infants

Differentiable Optimization Layers for Guaranteed Fairness in Deep Learning

F2IND-IT! -- Multimodal Fuzzy Fake Indian News Detection using Images and Text

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

Parallel Recursive LSTM

Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

Mechanism Learning: Prototype-Anchored Mechanism Inference for Scientific Forecasting

ACIL: Auto Chain of Thoughts for In-Context Learning

The Learnability Gap in Medical Latent Diffusion

Taming Audio VAEs via Target-KL Regularization

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

Thermal-Only Crowd Counting with Deployment-Time Privacy Protection

Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design

Privacy-Preserving Generation Fraud Detection for Distributed Photovoltaic Systems: A Solar Irradiance-Fused Federated Learning Framework

Evidential Information Fusion on Possibilistic Structure

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road