arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.03058 2026-06-09 cs.LG cs.AI 版本更新

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

基于对比分层消融的大语言模型神经元锚定规则提取

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

发表机构 * Università della Svizzera italiana（瑞士意大利大学）

AI总结提出MechaRule方法，通过定位稀疏激动剂激活将规则提取锚定在LLM电路中，利用自适应组测试和置信引导剪枝，以极低代价高召回率识别关键神经元，并在算术和越狱任务中验证其有效性。

详情

DOI: 10.1145/3770855.3818091
Comments: Accepted for publication at KDD'2026

AI中文摘要

可解释AI的一个核心目标是符号化地表达大语言模型（LLM）的决策逻辑，并将其锚定在内部机制中。现有的规则提取方法通常学习非锚定的符号代理，而机械可解释性将行为与神经元联系起来，但通常需要手工假设和昂贵的干预。我们提出MechaRule，一种通过定位稀疏激动剂激活（其消融会破坏规则相关行为）将规则提取锚定在LLM电路中的流程。MechaRule基于两个发现。首先，在固定的基线/翻转机制下，稀疏激动剂效应可能表现出“超越”：少数高效应的激活在较大组中仍可检测到，主导较弱效应，并翻转许多相同的示例。在这种机制下，使用置信引导的保守剪枝的自适应组测试，当k << N为激动剂时，需要对N个候选进行O(k log(N/k) + k)次干预。其次，在与接近忠实规则行为对齐的数据分割上，激动剂的定位更可靠；谱分割提供了无规则的备选方案，而不忠实的分割会降低定位效果。实验上，在算术和越狱任务中，MechaRule在匹配的暴力验证中召回97.0%的最高效应激动剂，平均仅消耗完全消融成本的2.14%。消融定位的激动剂消除了97.6–100.0%的合格正确算术答案和越狱，并可纠正算术错误或诱导越狱，分别高达72.8%和32.5%。

英文摘要

A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms. Existing rule-extraction methods usually learn ungrounded symbolic surrogates, while mechanistic interpretability links behavior to neurons but often requires hand-crafted hypotheses and costly interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by localizing sparse agonist activations whose ablation disrupts rule-related behavior. MechaRule rests on two findings. First, in a fixed baseline/flip regime, sparse agonist effects can exhibit overtopping: a few high-effect activations remain detectable within larger groups, dominate weaker ones, and flip many of the same examples. In such regimes, adaptive group testing with confidence-guided conservative pruning requires O(k log(N/k) + k) interventions over N candidates when k << N are agonists. Second, agonists are localized more reliably on data splits aligned with close-to-faithful rule behavior; spectral splits provide a rule-free fallback, whereas unfaithful splits degrade localization. Empirically, on arithmetic and jailbreaking, MechaRule recalls 97.0% of highest-effect agonists in matched brute-force validations at only 2.14% of exhaustive-ablation cost on average. Ablating the localized agonists eliminates 97.6--100.0% of eligible correct arithmetic answers and jailbreaks, and can correct arithmetic errors or induce jailbreaks by up to 72.8% and 32.5%.

URL PDF HTML ☆

赞 0 踩 0

2605.01799 2026-06-09 cs.CV 版本更新

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

Embody4D: 面向具身4D世界建模的通用数据引擎

Peiyan Tu, Hanxin Zhu, Jingwen Sun, Shaojie Ren, Cong Wang, Yuyan Xu, Jiayi Luo, Xiaoqian Cheng, Zhibo Chen

发表机构 * Zhejiang University（浙江大学）； Beijing Zhongguancun Academy（北京中关村学院）； University of Science and Technology of China（中国科学技术大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Shanghai Jiao Tong University（上海交通大学）； Beihang University（北京航空航天大学）

AI总结提出Embody4D视频到视频世界模型，通过3D感知合成管道、潜在置信度专家调制和交互注意力机制，将单目机器人视频转换为多视角视频，解决具身智能中视角稀疏问题，提升下游规划与学习性能。

详情

AI中文摘要

具身智能体需要鲁棒且全面的3D时空表示来支持空间推理、操作理解和下游决策。然而，现有的机器人数据通常从固定或稀疏的视角捕获，仅提供部分且依赖视角的观察，这限制了多视角感知和跨视角泛化。鉴于在真实环境中收集额外视角的困难，我们提出Embody4D，一种专为具身场景设计的视频到视频世界模型，通过将单目机器人视频转换为来自灵活目标相机视角的新视角视频来弥合这一观察差距。首先，为解决训练数据稀缺问题，我们引入了一种3D感知的组合合成管道，以策划一个异构数据集，该数据集组合了跨具身形态的机器人手臂与多样背景，促进了广泛泛化。其次，为强制几何稳定性，我们设计了一种潜在置信度感知的专家调制策略，该策略估计扭曲潜在先验的可靠性，并自适应地将区域路由到复制、修复或修补专家，以实现时空一致的4D生成。最后，为增强操作保真度，我们引入了一种交互感知注意力机制，该机制明确关注机器人交互区域。大量实验表明，Embody4D在视觉评估基准上达到了最先进的性能，同时模拟和真实机器人实验进一步证明了其作为鲁棒数据引擎的有效性，能够合成高保真、视角一致的视频，赋能下游机器人规划和学习。

英文摘要

Embodied agents require robust and comprehensive 3D spatiotemporal representations to support spatial reasoning, manipulation understanding, and downstream decision making. However, existing robot data are typically captured from fixed or sparse viewpoints, providing only partial and view-dependent observations, which limits multi-view perception and generalization across viewpoints. Given the difficulty of collecting additional viewpoints in real-world settings, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios to bridge this observation gap by transforming a monocular robot video into novel-view videos from flexible target camera viewpoints. First, to tackle training data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, promoting broad generalization. Second, to enforce geometric stability, we devise a latent confidence-aware expert modulation strategy, which estimates the reliability of warped latent priors and adaptively routes regions to copy, repair, or inpaint experts for spatiotemporally consistent 4D generation. Finally, to enhance the fidelity of the manipulation, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments show that Embody4D achieves state-of-the-art performance on visual evaluation benchmarks, while both simulated and real-world robotic experiments further demonstrate its effectiveness as a robust data engine for synthesizing high-fidelity, view-consistent videos that empower downstream robotic planning and learning.

URL PDF HTML ☆

赞 0 踩 0

2605.01616 2026-06-09 cs.LG cs.AI cs.CY cs.NI 版本更新

Learning Behavioral Signals from Encrypted Smartphone Network Traffic

从加密智能手机网络流量中学习行为信号

Rameen Mahmood, Omar El Shahawy, Souptik Barua, Zachary Beattie, Jeffrey Kaye, Xuhai "Orson'' Xu, Chao-Yi Wu, Danny Yuxing Huang

发表机构 * New York University（纽约大学）； NYU Langone Health（NYU Langone健康）； NYU Grossman School of Medicine（NYU Grossman医学院）； Oregon Health & Science University（俄勒冈健康与科学大学）； Columbia University（哥伦比亚大学）； Harvard Medical School（哈佛医学院）

AI总结本文利用基于Transformer的模型从加密网络流量中学习行为表征，结合用户特定适配器，并通过稀疏表示和广义估计方程分析，发现压力、孤独感和睡眠障碍分别与个体间差异、个体内波动及两者组合相关，且学习到的表征优于传统手工特征。

详情

Comments: 19 pages, 6 figures

AI中文摘要

人类行为难以在大规模下连续测量，然而日常活动和幸福感的痕迹可能反映在与个人设备的交互中。我们研究加密的智能手机网络流量是否可以作为被动感知信号，用于检测与睡眠障碍、压力和孤独感相关的行为状态。为了捕捉群体层面的模式和个体特定的行为，我们采用基于Transformer的模型，该模型带有用户特定的适配器，学习网络活动的表征，同时考虑个人基线及其偏差。为了提高可解释性，我们进一步使用稀疏表示学习分析这些表征，以识别与不同活动模式相关的潜在行为特征。我们使用带有Mundlak分解的广义估计方程将所得特征与睡眠障碍、压力和孤独感联系起来，从而能够区分稳定的个体间差异和随时间变化的个体内变化。我们的分析揭示了这三种结果具有不同的时间动态：压力主要与持续的个体间变异相关，孤独感与个体内波动更密切相关，而睡眠障碍则反映了两者的结合。重要的是，这些个体内行为信号无法通过传统的手工网络流量特征恢复，这突显了学习表征在纵向行为建模中的优势。总体而言，我们的发现表明加密网络流量包含可解释的行为信息，并能够支持被动、可扩展的行为动态监测，特别是相对于个体典型活动模式的变化。

英文摘要

Human behavior is challenging to measure continuously at scale, yet traces of daily routines and well-being may be reflected in interactions with personal devices. We investigate whether encrypted smartphone network traffic can serve as a passive sensing signal for behavioral states related to sleep disturbance, stress, and loneliness. To capture both population-level patterns and individual-specific behavior, we employ a transformer-based model with user-specific adapters that learns representations of network activity while accounting for personal baselines and deviations from them. To improve interpretability, we further analyze these representations using sparse representation learning to identify latent behavioral features associated with distinct activity patterns. We relate the resulting features to sleep disturbance, stress, and loneliness using generalized estimating equations with Mundlak decomposition, enabling separation of stable between-person differences from within-person changes over time. Our analysis reveals that the three outcomes are characterized by different temporal dynamics: stress is predominantly associated with persistent between-person variation, loneliness is more strongly linked to within-person fluctuations, and sleep disturbance reflects a combination of both. Importantly, these within-person behavioral signals are not recovered by conventional handcrafted network-traffic features, highlighting the advantages of learned representations for longitudinal behavioral modeling. Overall, our findings demonstrate that encrypted network traffic contains interpretable behavioral information and can support passive, scalable monitoring of behavioral dynamics, particularly changes relative to an individual's typical pattern of activity.

URL PDF HTML ☆

赞 0 踩 0

2606.07435 2026-06-09 cs.CV cs.CL 版本更新

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

唇读差距：VSR模型是否像人类唇读者一样感知视觉语音？

Rishabh Jain, Naomi Harte

发表机构 * Sigmedia Group（Sigmedia集团）； School of Engineering（工程学院）； Trinity College Dublin（都柏林大学）

AI总结通过对比VSR系统与人类在MaFI数据集上的表现，发现模型虽整体准确率更高，但错误模式与人类不同，主要依赖训练数据中的语言线索而非视觉感知。

详情

Comments: Accepted at INTERSPEECH 2026

AI中文摘要

视觉语音识别（VSR）模型在基准测试中现已超越人类唇读者，但这样的进步是否建立了类人的视觉语音感知？为探究此问题，我们使用MaFI词级唇读数据集，在词、字符、音素和视位级别上比较了三个VSR系统与人类基线。尽管模型实现了更高的整体准确率，但它们在不同于人类的单词上成功和失败。仅给定少量初始音素的纯文本n-gram基线可与人类唇读相媲美。VSR词级错误始终能更好地通过训练词频而非词的视觉信息量来解释。视位准确率、混淆矩阵以及人类-模型相关性进一步表明，模型在人类认为最难的视位上获益最多，并且对视觉清晰度的依赖性弱得多。我们的工作表明，VSR系统主要依赖训练数据中的语言线索而非视觉感知，未能将视觉特征绑定为有意义的单词。

英文摘要

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

URL PDF HTML ☆

赞 0 踩 0

2606.07431 2026-06-09 cs.CV 版本更新

OpenGlass: Ultra-Low-Power On-Device AI Eyewear with Event-based Vision

OpenGlass：用于设备上基于事件的手势识别的开源智能眼镜

Pietro Bonazzi, Julian Moosmann, Ahmet Celik, Philipp Mayer, Michele Magno

发表机构 * Department of Information Technology and Electrical Engineering, ETH Zürich（信息科技与电气工程系，瑞士联邦理工学院）

AI总结提出开源智能眼镜平台OpenGlass，采用模块化设计、事件驱动电源管理和GAP9 RISC-V SoC，实现低功耗设备上ML，在LynX数据集上达到83.94%的跨主体手势识别准确率。

详情

AI中文摘要

智能眼镜通过多模态传感器和设备上智能实现无干扰、上下文感知的交互，但受限于紧凑外形下的功耗、内存和计算约束。支持事件视觉和嵌入式ML的开源硬件平台在此规模下很少见。本文介绍了一个开源智能眼镜平台，用于新型传感器和算法的快速原型设计。其模块化设计使用灵活的FPC转接板，支持事件相机和帧相机，无需完全重新设计PCB。硬件-软件协同设计的电源管理系统结合了可配置PMIC和通过nRF5340协调器的事件驱动唤醒，使GAP9 RISC-V SoC在推理之间保持断电。原型从200 mAh电池实现长达11.8小时的连续设备上ML。作为演示，使用来自Prophesee GENX320相机的极性分离事件直方图，在LynX数据集上评估了以自我为中心的手势识别流水线。R(2+1)D在留二受试者交叉验证下达到最佳跨主体准确率83.94%（宏F1=0.781），在GAP9上端到端延迟为33.9毫秒。时间增强和去除模糊类别带来了最大增益（+8.9个百分点）。所有硬件设计、固件和模型均开源发布。

英文摘要

Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.5 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94\% (macro F1 = 0.781) under leave-two-subjects-out validation, with 78.3 ms end-to-end inference latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.

URL PDF HTML ☆

赞 0 踩 0

2606.07419 2026-06-09 cs.CV 版本更新

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

DisPOSE: 投影多随机扩散用于自监督多视图3D人体姿态估计

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * Imperial College London（伦敦帝国学院）； Technical University of Munich（慕尼黑技术大学）

AI总结提出DisPOSE框架，将多视图人员分配问题建模为多随机张量空间上的生成扩散过程，通过可微Sinkhorn投影和超图卷积解码器实现自监督3D人体姿态估计，在标准数据集和手术室遮挡场景中表现优异。

详情

AI中文摘要

从不同摄像机视角恢复多个个体的3D人体姿态是分析交互行为的基本瓶颈。现有的自监督方法利用3D姿态的合成目录；然而，由于分布偏移，这导致在真实场景中泛化能力差。因此，我们引入了DisPOSE，一个自监督框架，将固有的离散多视图人员分配问题近似为多随机张量空间上的生成扩散过程。通过在去噪过程中采用可微的Sinkhorn投影，模型学会基于2D图像先验引导解决方案走向有效且可行的分配。然后，使用超图卷积解码器对定位个体的完整3D骨架进行回归，该解码器显式建模跨多个视图的关系结构和关节。所提出的方法在标准数据集上优于当前最先进的自监督方法，并在一个包含手术室高度遮挡场景的新基准上展示了强大的性能。我们的基于扩散的定位展示了高标签效率，仅使用10%的伪标签就能保持99%的性能。值得注意的是，在保持可微性的同时解耦分配和根回归组件，使得DisPOSE几乎对不同摄像机布置不敏感。

英文摘要

Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

URL PDF HTML ☆

赞 0 踩 0

2606.07379 2026-06-09 cs.LG cs.AI cs.CL stat.ME 版本更新

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

编码智能体会欺骗我们吗？通过带随机测试的上限评估检测和防止作弊

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori, Nontawat Charoenphakdee, Masashi Sugiyama, Takashi Ishida

发表机构 * The University of Tokyo（东京大学）； RIKEN（理化学研究所）

AI总结提出CapCode框架，通过设置上限评估检测模型在编码任务中的作弊行为，并设计CapReward奖励机制防止作弊，实验表明该方法能有效检测和减少作弊。

详情

AI中文摘要

在智能体评估和训练中，一个日益增长的失败模式是模型可以通过利用捷径而非解决预期任务来获得高评估分数，产生欺骗性表现。这使得评估分数作为真实任务解决能力的度量不可靠。我们提出CapCode，一个构建带有随机测试的编码数据集的框架，其最佳可达的非作弊性能被故意限制在1以下。这种上限性能设计赋予评估分数更清晰的解释：显著高于上限的分数是不可信的，因此提供了作弊的证据。为了防止作弊，我们提出CapReward，一种基于CapCode原则的奖励设计，以抑制超出上限的优化。跨多个数据集的实验表明，CapCode能够检测作弊同时保持模型的性能排名，CapReward减少了作弊行为，产生了更好地遵循预期任务规范的模型。

英文摘要

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

URL PDF HTML ☆

赞 0 踩 0

2606.07118 2026-06-09 cs.RO 版本更新

QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation

QuadVerse：一种对齐视觉-物理现实用于四足仿真的集成框架

Yuxiang Chen, Yuanhao Wang, Ziheng Zhang, Meng Zhang, Yu Liu, Yufei Jia, Tiancai Wang, Erjin Zhou, Jin Xie

发表机构 * Nanjing University（南京大学）； BUPT（北京邮电大学）； DEXMAL ； Tsinghua University（清华大学）

AI总结提出QuadVerse框架，通过重建场景校准视觉、物理和致动器，利用3DGS和接触校准减少仿真到现实的差距，实现零样本视觉导航策略部署。

详情

AI中文摘要

仿真对于机器人学习至关重要，然而仿真到现实的差距仍然是一个主要挑战。现有方法通常单独处理视觉或动态差距，忽略了这些个体不匹配如何在机器人状态估计中累积和传播。在本文中，我们介绍QuadVerse，一个集成框架，使用重建场景作为校准基底，对齐视觉感知、物理交互和致动器动力学。从捕获的RGB视频中，我们重建几何约束的3D高斯泼溅（3DGS）场景，支持批处理的光照真实自我视角渲染和可用于碰撞的语义网格提取。网格进一步通过初始化空间变化的摩擦先验并通过基于轨迹的后验推理细化，实现接触校准。为了解决剩余的致动器差异，QuadVerse通过在接触校准的地形上重放真实世界轨迹来训练残差动力学补偿器，减少地形引起的接触误差与致动器动力学之间的纠缠。我们表明，QuadVerse在相关基线上提高了重建质量和运动跟踪。在此基础之上，我们展示了无需任务特定真实世界部署的鲁棒零样本视觉导航策略部署。

英文摘要

Simulation is central to robot learning, yet the sim-to-real gap remains a major bottleneck. Existing approaches often tackle visual or dynamic gaps separately, overlooking how these individual mismatches accumulate and propagate throughout the robot's state evolution. In this paper, we introduce QuadVerse, an integrated framework that uses reconstructed scenes as a calibration substrate for aligning visual perception, physical interaction, and actuator dynamics. From captured RGB videos, we reconstruct geometry-constrained 3D Gaussian Splatting (3DGS) scenes that support batched photorealistic ego-view rendering and collision-ready semantic mesh extraction. The meshes further enable contact calibration by initializing spatially varying friction priors and refining them through trajectory-based posterior search. To address remaining actuator discrepancies, QuadVerse trains a residual dynamics compensator by replaying real-world trajectories on the contact-calibrated terrain, reducing the entanglement between terrain-induced contact errors and actuator non-idealities. Experiments show that QuadVerse improves reconstruction quality and locomotion tracking over relevant baselines. Leveraging this foundation, we demonstrate robust zero-shot visual-navigation policy deployment without task-specific real-world rollouts.

URL PDF HTML ☆

赞 0 踩 0

2606.07108 2026-06-09 cs.AI 版本更新

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

DyCon: 通过演化难度建模的动态推理控制

Tengyao Tu, Yulin Li, Hui-Ling Zhen, Libo Qin, Zhoujun Wei, Jinghua Piao, Zhuotao Tian, Yong Li, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Zhongguancun Academy（中关村学院）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Shenzhen Loop Area Institute（深圳环城研究院）； Tsinghua University（清华大学）

AI总结提出DyCon框架，利用步骤级嵌入动态建模推理过程中的难度演化，无需训练即可控制推理深度，减少冗余步骤，提升效率且不损失准确性。

详情

Comments: Accepted at ICML 2026

AI中文摘要

近期大型推理模型（LRMs）通过迭代反思、探索和执行复杂任务取得了显著的性能提升，但由于冗余推理（即“过度思考”）而效率低下。现有的缓解方法要么依赖静态难度估计，要么需要特定任务训练，因此无法适应推理过程中的动态复杂性。在这项工作中，我们经验性地证明，问题难度在推理过程中动态演化，并线性编码在LRM的步骤级嵌入中。基于这一发现，我们提出了DyCon，一个无需训练的框架，利用潜在步骤级表示显式建模演化中的任务难度，从而实现对推理深度的动态控制以缓解过度思考问题。在4B到32B的四个模型上进行的广泛实验，涵盖数学推理、通用问答和编码任务的十二个基准测试表明，DyCon通过减少冗余步骤显著提升了推理效率，且不牺牲准确性或泛化能力。项目页面和代码可在此https URL获取。

英文摘要

Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM's step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Code is available at https://github.com/yu-lin-li/DyCon.

URL PDF HTML ☆

赞 0 踩 0

2606.06915 2026-06-09 cs.CL cs.AI cs.LG 版本更新

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ； ETH Zürich（苏黎世联邦理工学院）； Imperial College London（伦敦帝国理工学院）； NUS（国立大学新加坡）； Accenture（埃森哲）； Innopolis University（因诺普里斯大学）； Independent Researcher（独立研究者）

AI总结提出ThinkBooster框架，通过模块化库、联合评估基准和可部署代理服务，实现LLM推理的测试时计算扩展，在数学和编码任务上验证了性能-计算权衡。

详情

AI中文摘要

测试时计算（TTC）扩展已成为一种强大的范式，通过在推理期间分配额外计算（例如，通过多样本生成和基于验证器的重新排序）来改进大型语言模型（LLM）推理。现有的TTC扩展策略和推理评分器仍然碎片化，在不一致的协议下进行评估，并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster，一个用于LLM推理无缝测试时计算扩展的统一框架，它包括（i）一个模块化的Python库，实现了最先进的TTC扩展策略和评分器家族，（ii）一个联合评估性能和计算效率的基准，以及（iii）一个可部署的、兼容OpenAI的代理服务，使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器，用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡，并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

URL PDF HTML ☆

赞 0 踩 0

2606.06656 2026-06-09 cs.AI cs.LO 版本更新

A Study of Parallel Continuous Local Search

并行连续局部搜索研究

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结研究并行连续局部搜索（CLS）在对称伪布尔约束可满足性问题中的应用，发现冗余约束会抑制收敛，CLS在混合求解中能快速完成部分赋值，且局部搜索因鞍点密集目标而快速收敛到稳定解质量分布。

详情

AI中文摘要

我们研究并行连续局部搜索（CLS）作为解决具有对称伪布尔（PB）约束的布尔可满足性问题的一种方法。这里，$n$变量PB可满足性问题被松弛为一个连续优化问题，其目标函数在$n$维超立方体上可微。对于可满足的实例，该优化问题的全局最小值对应于所讨论SAT问题的满足赋值。我们通过实证实验提出了几个新发现：（i）冗余约束会抑制而非加速收敛；（ii）CLS在混合设置中作为子求解器显示出前景，能快速完成部分赋值；（iii）由于鞍点密集的目标函数，局部搜索迅速收敛到解质量的稳定分布（即满足程度），此时额外的求解步骤收益递减。我们的发现为在现代加速硬件上使用CLS解决SAT问题提供了实用指导。

英文摘要

We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (PB) constraints. Here, the $n$-variable PB-satisfiability problem is relaxed to a continuous optimisation problem with a differentiable objective function on an $n$-dimensional hypercube. For satisfiable instances, the global minimisers of this optimisation problem correspond to satisfying assignments of the SAT problem at hand. We present several novel findings via empirical experiments: (i) redundant constraints can inhibit rather than accelerate convergence; (ii) CLS shows promise as a sub-solver in hybridised settings, quickly completing partial assignments; and (iii) local search rapidly converges to a stable distribution of solution quality (i.e., degree of satisfaction), due to saddle-dense objectives where additional solver steps yield diminishing returns. Our findings inform practical uses of CLS for SAT on modern accelerator hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.06554 2026-06-09 cs.LG cs.AI 版本更新

Multi-Scale Feature Attention Network for Polymer Classification Using Terahertz Spectroscopy

基于多尺度特征注意力网络的太赫兹双梳光谱聚合物分类

Roshni Mahtani, Ilán Carretero, Laura Monroy, Aldo Moreno-Oyervides, Oscar Elías Bonilla-Manrique, Rocío del Amor

发表机构 * Instituto Universitario de Investigación e Innovación en Tecnología Centrada en el Ser Humano, HUMAN-tech, Universitat Politècnica de València（人类中心技术大学研究与创新研究所，HUMAN-tech，巴塞罗那理工大学）； Department of Electronic Technology, Universidad Carlos III de Madrid（电子技术系，马德里卡洛斯三世大学）； Artikode Intelligence S.L.

AI总结提出多尺度特征注意力网络(MSFAN)，结合特征门控和多尺度并行卷积，利用太赫兹双梳光谱对12种聚合物进行分类，准确率达85.2%。

详情

Comments: Accepted in EUSIPCO'26

AI中文摘要

可靠的聚合物识别对于确保回收塑料的质量和安全至关重要，然而传统的分选和光谱技术往往难以提供稳健的区分。太赫兹双梳光谱(THz-DCS)提供了一种有前景的替代方案，能够实现快速、高分辨率且无损的测量。在这项工作中，我们利用THz-DCS对12种聚合物进行分类，包括纯聚合物、多层薄膜、商业混合物和生物聚合物。为了处理这些光谱信号的复杂性，我们提出了多尺度特征注意力网络(MSFAN)，这是一种专为THz-DCS数据设计的新型深度学习架构。该框架集成了用于信号重校准的特征门控和多尺度并行卷积，以捕获不同的频率模式。这些特征通过交叉特征注意力和注意力池化进一步细化，使模型能够内在地突出最具信息量的太赫兹区域。MSFAN始终优于最先进的模型，分类准确率达到85.2%。本研究展示了将THz-DCS与深度学习技术相结合，用于有效、可扩展且可解释的聚合物分类的潜力。

英文摘要

Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz (THz) spectroscopy offers a promising alternative, providing high-resolution and non-destructive measurements. In this work, we leverage THz signals to classify 12 types of polymers, including pure polymers, multilayer films, commercial blends, and biopolymers. To handle the complexity of these spectral signals, we propose the Multi-Scale Feature Attention Network (MSFAN), a novel deep learning architecture tailored for THz data. The framework integrates feature gating for signal recalibration and multi-scale parallel convolutions to capture diverse frequency patterns. These features are further refined through cross-feature attention and attention pooling, enabling the model to intrinsically highlight the most informative THz regions. MSFAN consistently outperforms state-of-the-art models, reaching a classification accuracy of 85.2%. This study demonstrates the potential of combining THz spectroscopy with deep learning techniques for effective, scalable, and interpretable polymer classification.

URL PDF HTML ☆

赞 0 踩 0

2606.06407 2026-06-09 cs.CV cs.IR cs.LG eess.IV 版本更新

A Vision-language Framework for Comparative Reasoning in Radiology

放射学中比较推理的视觉语言框架

Tengfei Zhang, Ziheng Zhao, Xiaoman Zhang, Lisong Dai, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Weidi Xie

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Department of Biomedical Informatics, Harvard Medical School（哈佛医学院生物医学信息学系）； Department of Radiology, Renmin Hospital of Wuhan University（武汉大学仁民医院放射科）； Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University（上海交通大学附属第六人民医院）

AI总结提出一个实体感知的跨图像推理框架，通过构建大规模比较影像数据集MedReCo-DB和开发MedReCo及MedReCo-VLM模型，实现了参考病例检索和时间比较解读，显著提升了放射学比较推理性能。

详情

AI中文摘要

医学影像人工智能在孤立图像解读方面取得了强劲性能，但仍与放射学实践存在较大差距，因为诊断和随访依赖于对先前研究和类似参考病例的比较。本文我们将放射学比较形式化为一个实体感知的跨图像推理问题，并引入一个支持参考病例检索和时间比较解读的框架。我们构建了MedReCo-DB，这是一个从常规图像-报告对中派生的大规模比较影像资源，包含来自八个机构、四个国家、七种成像模态的超过16万名患者的69万余张图像。报告被分解为解剖结构、异常发现和病理状况，为实体条件检索和比较视觉问答提供监督。利用该资源，我们开发了MedReCo，一个用于可控检索临床类似病例的实体感知视觉编码器，以及MedReCo-VLM，一个用于生成性解读间隔变化的视觉语言扩展。在内部、外部和跨中心评估中，MedReCo在所有12个内部检索设置中实现了最高的Recall@1，并将外部检索平均提高了6.0个百分点。在临床易混淆的鉴别组中，它始终优于最强的基线。MedReCo-VLM在所有比较生成评估中取得了最佳性能，并在胸部X光片上将纵向随访准确性提高了14.5-46.5个百分点，在CT上提高了13.0-27.9个百分点。这些发现表明，实体感知的比较推理可以从常规临床数据中大规模学习，并可能为医学影像AI提供更符合临床的基础。

英文摘要

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

URL PDF HTML ☆

赞 0 踩 0

2606.06399 2026-06-09 cs.CL 版本更新

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

CollabSim: 一种基于CSCW的方法，通过受控多智能体实验研究LLM智能体的协作能力

Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

发表机构 * Northeastern University（东北大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出CollabSim框架，结合CSCW理论定义协作能力、控制交互条件并探测智能体内部状态，以系统分析LLM多智能体系统的协作能力。

详情

AI中文摘要

基于大语言模型的多智能体系统展现出日益增长的潜力，其有效性依赖于智能体通过文本渠道进行协调的能力，类似于人类团队。然而，近期研究表明，多智能体系统常常失败并非因为智能体缺乏个体任务解决能力，而是因为缺乏协作能力：建立共同基础、维持共享任务理解、平衡个体与集体激励以及在交互过程中修复失调的能力。计算机支持的协同工作领域数十年的研究已经描述了人类团队在受限通信下协调的这些要求，然而现有的多智能体系统评估主要关注任务结果或单智能体在推理、规划和工具使用方面的能力。为了能够系统分析多智能体系统中智能体的协作能力，我们引入了CollabSim，一个可配置的仿真框架，它结合了基于理论的协作能力定义、交互条件的受控操作以及智能体内部状态的行动级探测。在四个大语言模型上的实验表明，CollabSim能够捕捉条件效应、分离模型性能模式，并揭示智能体设计的任务依赖效应。

英文摘要

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

URL PDF HTML ☆

赞 0 踩 0

2606.06388 2026-06-09 cs.AI cs.CL 版本更新

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

人类的ALMANAC：用于智能体协作的动作级心智模型标注的人类协作数据集

Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

发表机构 * Northeastern University（东北大学）； University of Notre Dame（Notre Dame 大学）； University of Waterloo（滑铁卢大学）； Carnegie Mellon University（卡内基梅隆大学）； Adobe（Adobe公司）； Microsoft Research Asia（微软亚洲研究院）

AI总结为解决当前LLM智能体缺乏协作中心智模型能力的问题，构建了基于Map Task的ALMANAC数据集，包含2987个协作动作及其心智模型标注，并评估了六种LLM在预测人类行为和心智模型上的表现。

详情

AI中文摘要

近年来，LLM智能体的进展使其具备了复杂的认知能力，如多步推理、规划和工具使用，这些能力使它们逐渐成为人类的协作者。然而，有效的协作要求协作者在协作过程中持续维护和调整自身推理、伙伴意图和共享目标的心智模型。当前的智能体很少发展这种能力，因为它们主要针对任务完成进行优化，而社区缺乏带有动作级心智模型标注的真实人类协作数据，这些数据可以指导智能体获得过程级的协作能力。为填补这一空白，我们提出了ALMANAC，一个基于社会科学中经典的二元路由任务Map Task构建的动作级心智模型标注数据集。ALMANAC包含2,987个协作动作，每个动作都配有基于理论的心智模型标注，记录了参与者的自我推理、感知的伙伴意图和感知的团队目标。我们评估了六种LLM在预测人类下一轮行为和心智模型方面的表现。我们的结果证明了ALMANAC在评估模型模拟人类协作行为及推断其潜在心智模型方面的实用性。

英文摘要

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

URL PDF HTML ☆

赞 0 踩 0

2606.06360 2026-06-09 cs.AI 版本更新

An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

基于大语言模型决策的传染病传播模拟

Yonchanok Khaokaew, Ruochen Kong, Andreas Zufle, Hao Xue, Taylor Anderson, Chandini Raina MacIntyre, Matthew Scotch, Flora D. Salim, David J Heslop

发表机构 * Computer Science and Engineering Faculty of Engineering（计算机科学与工程系）； The University of New South Wales（新南威尔士大学）； Department of Computer Science（计算机科学系）； Emory University（埃默里大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； George Mason University（乔治·马歇尔大学）； The Kirby Institute Faculty of Medicine & Health（Kirby研究所医学院与健康学院）； Arizona State University（亚利桑那州立大学）

AI总结提出一个空间显式的基于智能体的模拟框架，利用大语言模型生成自我报告流感样疾病的决策，并整合到基于人口普查的合成人群中，以捕捉社会与地理异质性。

详情

Comments: 12 pages

AI中文摘要

在传染病爆发期间对个体决策进行建模对于理解行为动态和指导有效的公共卫生干预至关重要。先前的工作表明，大语言模型可以通过基于人口统计提示和情境背景生成智能体决策来模拟逼真的人类行为。我们在此基础上构建了一个空间显式的基于智能体的模拟框架，将LLM生成的关于自我报告流感样疾病的决策整合到基于人口普查的合成智能体群体中。位置被视为核心特征：智能体被分配到城市内的空间单元，利用真实世界的人口普查数据捕捉不同人口群体的空间分布，并实现地理多样化的行为建模。我们实施并比较了三种决策场景：独立推理、家庭影响和消息框架，并在旧金山和亚特兰大模拟了自我报告结果。结果显示，收入和受教育程度是报告率变化的主要驱动因素，地理、LLM模型选择和消息框架的影响较小但一致。我们的框架生成了捕捉社会和地理异质性的合成数据，支持空间流行病学建模和偏差感知行为分析。

英文摘要

Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded, agent-based simulation framework that integrates LLM-generated decisions about self-reported influenza-like illness into a census-based synthetic population of agents. Location is treated as a central feature: agents are assigned to spatial units within cities, capturing the spatial distributions of different demographic groups using real-world census data and enabling geographically diverse behavioural modelling. We implement and compare three decision scenarios, independent reasoning, household influence, and message framing, and simulate self-reporting outcomes in San Francisco and Atlanta. Results reveal that income and education are the dominant drivers of reporting rate variation, with smaller but consistent effects from geography, LLM model choice, and message framing. Our framework generates synthetic data that captures both social and geographic heterogeneity, supporting spatial epidemiological modelling and bias-aware behavioural analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.06114 2026-06-09 cs.AI 版本更新

Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

走向健康进化：探索人机交互在自我进化系统中的作用与机制

Dianxing Shi, Bowen Wang, Junqi He, Junhao Chen, Yuta Nakashima

发表机构 * The University of Osaka（大阪大学）

AI总结提出ANCHOR框架，通过模拟人类监督的反馈机制，在自我进化系统中缓解能力退化与安全漂移，实验表明有限监督可显著提升安全性与稳定性。

详情

AI中文摘要

自我进化智能体通过持续的自我对弈和自我生成的学习信号进行改进，但自主进化也可能导致能力退化与安全漂移。尽管人类反馈已被证明对静态和后训练智能体有效，但其在自我进化系统中的作用仍未被充分探索。我们提出了通过类人监督与审查进行智能体规范修正（ANCHOR）框架，这是一个基于LLM的框架，模拟人类监督并在自我进化的不同阶段提供反馈。利用ANCHOR，我们评估了两个代表性的开源自我进化智能体系统在编程、数学推理和安全性方面的表现。结果表明，即使是有限的监督也能显著缓解安全退化，同时保持核心进化目标的稳定性能。进一步分析显示，对输出验证阶段的监督是最有效的干预方式，而增加监督频率则收益递减。这些发现为设计更稳定、可控且与人类对齐的自我进化智能体系统提供了经验证据和实践指导。

英文摘要

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.

URL PDF HTML ☆

赞 0 踩 0

2606.06076 2026-06-09 cs.AI cs.CV 版本更新

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

通过模态差距感知自蒸馏从符号状态学习视觉空间规划

Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li

发表机构 * Tsinghua University（清华大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出MGSD两阶段框架，通过冷启动接地和特权教师蒸馏弥合视觉与符号规划之间的模态差距，在视觉规划基准上显著提升性能。

详情

Comments: 17 pages, preprint

AI中文摘要

尽管视觉-语言模型在通用多模态理解方面表现出色，但在视觉空间规划上仍存在困难。我们将其归因于感知-推理模态差距：视觉规划要求模型从像素中推断潜在状态结构，然后对恢复的结构进行推理以产生有效动作，而符号规划直接利用显式对象和约束。这造成了视觉状态恢复和多步规划的双重瓶颈。为解决此问题，我们提出MGSD，一种两阶段模态差距感知自蒸馏框架。首先，冷启动接地阶段为视觉学生模型配备可靠的状态表示，最小化早期感知噪声。其次，特权教师通过在线策略蒸馏转移规划能力，使用显式符号状态监督学生自身的视觉 rollout 前缀。关键在于，符号数据仅在训练期间使用，推理完全基于视觉。在视觉规划基准上的实验表明，MGSD在4B和8B骨干网络上均持续提升视觉规划性能，宏观平均值分别提高19.3%和18.4%。所得模型缩小了与符号输入上限的差距，而消融和诊断实验证实改进来自视觉状态恢复和最优路径推理。这些结果表明，模态差距感知自蒸馏不仅改善了模型感知可行动状态的方式，也改善了它们在推断结构上进行规划的能力。代码见 https://github.com/Oranger-l/MGSD。

英文摘要

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

URL PDF HTML ☆

赞 0 踩 0

2606.06033 2026-06-09 cs.RO 版本更新

RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning

RealDexUMI：用于灵巧机器人学习的可穿戴通用操作接口

Chaoyi Xu, Yixuan Jiang, Jiahui Huan, Yuhui Fu, Haoyu Zhou, Weitian Yuan, Jiayi Yu, Wanpeng Zhang, Haoqi Yuan, Zongqing Lu

发表机构 * Peking University（北京大学）； BeingBeyond ； Beihang University（北航）； LinkerBot ； Tsinghua University（清华大学）

AI总结提出RealDexUMI，一种基于共享灵巧末端执行器模块的可穿戴通用操作接口，通过掌侧同构遥操作手套实现无重定向、直观精确的手部控制，在八项真实机器人任务中平均成功率达88.75%。

详情

AI中文摘要

学习灵巧操作需要演示，这些演示在保持精细手-物体交互的同时，在部署时仍可执行。现有流程要么通过重定向或具身转换损失可部署的灵巧性，要么依赖特定机器人的遥操作，这种遥操作成本高昂且难以扩展，并且通常缺乏用于灵巧数据收集的直观、接触感知控制。我们提出RealDexUMI，一种围绕共享灵巧末端执行器模块构建的可穿戴通用操作接口，该模块集成了轻量级灵巧手、手内视觉和指尖触觉传感。掌侧同构遥操作手套将人类手指输入映射到机器人手关节命令，实现实时、无重定向、直观且精确的手部控制。共享的手和传感模块产生零间隙的末端执行器数据，在收集和部署之间具有匹配的手内观察、触觉信号、接触和手部动作。在涵盖精细、接触丰富、长时域和双臂操作的八项真实机器人任务中，基于RealDexUMI数据训练的策略平均成功率达到88.75%，能够泛化到未见过的初始姿态，并在三种具身之间迁移。网站：https://research.beingbeyond.com/realdexumi

英文摘要

Learning dexterous manipulation requires demonstrations that preserve fine hand-object interactions while remaining executable at deployment. Existing pipelines either lose deployable dexterity through retargeting or embodiment conversion, or rely on robot-specific teleoperation that is costly to scale and often lacks intuitive, contact-aware control for dexterous data collection. We present RealDexUMI, a wearable universal manipulation interface built around a shared dexterous end-effector module that integrates a lightweight dexterous hand, in-hand vision, and fingertip tactile sensing. A palm-side isomorphic teleoperation glove maps human finger inputs to robot-hand joint commands, enabling real-time, retargeting-free, intuitive, and precise hand control. The shared hand and sensing modules yield zero-gap end-effector data, with matched in-hand observations, tactile signals, contacts, and hand actions between collection and deployment. Across eight real-robot tasks spanning fine-grained, contact-rich, long-horizon, and bimanual manipulation, policies trained on RealDexUMI data achieve an average success rate of 88.75%, generalize to unseen initial poses, and transfer across three embodiments. Website: https://research.beingbeyond.com/realdexumi

URL PDF HTML ☆

赞 0 踩 0

2606.05932 2026-06-09 cs.AI cs.LG 版本更新

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

RLVR中自我一致性激发与奖励设计的预注册因果分解

Yuze Gao

发表机构 * Outlook.com（Outlook公司）

AI总结本文通过预注册实验和因果分解方法，证明RLVR中朴素奖励设计估计量存在系统性偏差，并量化了自我一致性激发与真正奖励设计信号的贡献。

详情

Comments: 9 pages, 7 figures

AI中文摘要

基于可验证奖励的强化学习（RLVR）即使在奖励信号是虚假的情况下也能提升推理能力——将功劳分配给群体多数答案而非真实验证器。实践者通常将朴素估计量 naive = acc(TRUE) - acc(RANDOM) 解释为奖励设计效应。我们证明该估计量存在系统性偏差：它混淆了自我一致性激发（通过多数伪奖励将策略向众数答案锐化）与真正的奖励设计信号。使用受控的表格GRPO模拟器，我们推导出精确的望远镜分解 total = null + elicit + rd，并在五个先验强度水平上测量每个项。朴素估计量中奖励设计占比从弱先验（ps=0.20）时的0.139变化到强先验（ps=0.80）时的0.05，激发项在自我一致性交叉点处符号翻转。一个预注册的2x2x2析因实验证实了非可加性（交互比0.385；AxC效应-0.089）。一个点与界试点门控表明，强先验区域是点识别的，而接近交叉区域仅是有界的。对两个已命名发表结果的重新审计分别得出“激发主导”（激发份额0.98）和“奖励设计主导”（rd份额1.18）的结论，证明了该分解的诊断价值。我们预先承诺无论翻转结果如何都提交论文；非翻转同样是一个有价值的发现。我们发布一个可复用的单命令工具，供任何对齐论文运行相同的审计。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.

URL PDF HTML ☆

赞 0 踩 0

2606.05872 2026-06-09 cs.AI cs.CV 版本更新

Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

基于熵的AI智能体评估：一种测量行为模式的轻量级框架

Olasimbo Ayodeji Arigbabu

发表机构 * Olasimbo Ayodeji Arigbabu（奥拉西姆波·阿里加布）

AI总结提出一种基于熵的轻量级评估框架（EEA），通过动作熵、轨迹熵、工具熵、信息增益、探索效率和鲁棒性熵等指标，从决策过程结构角度补充传统任务成功率等评估方法。

详情

Comments: 6 pages, 2 Tables

AI中文摘要

AI智能体通常使用任务成功率、奖励、延迟和成本进行评估。这些指标很有用，但常常忽略了智能体行为的重要方面：智能体是否过度探索、是否过于僵化地重复自身、是否有效使用工具、是否随时间减少不确定性、或者在多次运行中保持鲁棒性。本文提出基于熵的AI智能体评估（EEA），一种通过熵来测量智能体行为的轻量级框架。EEA不将智能仅视为最终任务完成，而是研究智能体决策过程的结构。该框架引入了动作熵、轨迹熵、工具熵、信息增益、探索效率和鲁棒性熵。这些指标旨在补充而非取代传统评估方法。我们还提供了一个实用的Python实现，旨在与LangChain、Google ADK、自定义智能体循环以及存储的可观测性轨迹等智能体框架集成。

英文摘要

AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.

URL PDF HTML ☆

赞 0 踩 0

2606.05816 2026-06-09 cs.CV cs.AI 版本更新

Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning

基于LLM提示翻译和LoRA微调的韩语日记文本情感感知图像生成

Jihun Cho, Soo-Yeon Jeong, Sun-Young Ihm

发表机构 * KAIST（韩国科学技术院）

AI总结提出一种情感感知文本到图像流水线，利用Qwen3-8B识别短日记中的隐含情感，并通过LoRA微调Stable Diffusion 3.5 Medium生成儿童手绘风格图像，同时探讨情感触发词的影响及CLIP Score作为评估指标的局限性。

详情

Journal ref: Proc. Int. Conf. Multimedia, Information Technology and its Applications (MITA), 2026
Comments: 4 pages, 4 figures, 2 tables, MITA 2026

AI中文摘要

T2I模型无法有效捕捉包括日记在内的各类文本中的情感，因为它们主要关注视觉对象相关模式而非上下文情感理解。本文提出一种情感感知文本到图像流水线，从短韩语日记条目生成儿童手绘风格图像。该流水线采用Qwen3-8B识别短日记中的隐含情感，并使用基于情感触发词在儿童绘画图像上通过LoRA微调的Stable Diffusion 3.5 Medium进行图像生成。此外，本文通过实验检验情感触发词对生成图像的影响，并讨论CLIP Score作为情感感知图像生成评估指标的局限性。

英文摘要

T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and Stable Diffusion 3.5 Medium fine-tuned with LoRA on children's drawing images with emotion-based trigger words for image generation. Additionally, this paper presents experiments examining the effect of emotion trigger words on generated images and discusses the limitations of CLIP Score as an evaluation metric for emotion-aware image generation.

URL PDF HTML ☆

赞 0 踩 0

2606.05797 2026-06-09 cs.LG stat.ML 版本更新

Causal Longitudinal Prior-Fitted Networks for Counterfactual Outcome Prediction

因果纵向先验拟合网络用于反事实结果预测

Amirhossein Zare, Amirhessam Zare, Herlock Rahimi, Reza Salarikia, Mohammad Kashkooli

发表机构 * Yale University（耶鲁大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出CausalLongPFN，一种基于先验拟合的上下文预测器，通过合成因果模型预训练实现无需梯度更新的纵向反事实结果预测，在多个基准上达到与领域训练模型竞争的性能。

详情

Comments: 31 pages, 10 tables

AI中文摘要

纵向治疗决策需要预测未来治疗序列下的潜在结果，同时考虑时变混杂、异质性患者动态和有限的领域特定数据。现有的纵向因果估计器通常为每个队列或模拟器训练新模型。我们引入了因果纵向先验拟合网络（CausalLongPFN），一种用于纵向因果预测的先验拟合上下文预测器。该模型完全在从时间结构因果模型的广泛先验中采样的合成情节上进行预训练，使其暴露于治疗-混杂反馈、潜在异质性、非线性状态演化、延迟效应和累积治疗反应。在测试时，CausalLongPFN被冻结：它基于支持轨迹、查询历史和提出的未来治疗序列进行条件预测，返回未来结果的预测分布，无需梯度更新或倾向性模型拟合。通过在指定治疗序列下递归应用一步预测器获得多步预测。我们在具有真实反事实标签的可分支癌症、HIV和华法林基准上，以及在MIMIC-III ICU轨迹的仅事实滚动起点预测上进行评估。CausalLongPFN在反事实基准上与领域训练的纵向基线竞争，并在事实MIMIC-III预测上表现强劲，表明当重复的领域特定训练成本高昂或不可行时，广泛的合成因果预训练可以提供有用的冻结替代方案。

英文摘要

Longitudinal treatment decisions from multivariate time-series data require predicting potential outcomes under future treatment sequences in the presence of time-varying confounding, heterogeneous patient dynamics, and limited domain-specific data. Existing longitudinal causal estimators typically address this problem by training a new model for each cohort or simulator. We introduce Causal Longitudinal Prior-Fitted Networks (CausalLongPFN), a prior-fitted network for time-series causal inference in longitudinal treatment-response data and zero-shot in-context counterfactual outcome prediction. The model is pretrained entirely on synthetic episodes sampled from a broad prior over temporal structural causal models, exposing it to treatment-confounder feedback, latent heterogeneity, nonlinear state evolution, delayed effects, and cumulative treatment responses. At test time, CausalLongPFN remains frozen and is used zero-shot: it conditions on support trajectories, a query history, and a planned future treatment sequence, and returns a predictive distribution over future outcomes without gradient updates or propensity-model fitting. Multi-step predictions are obtained by recursively applying the one-step predictor under the specified treatment sequence. We evaluate the model on branchable cancer, HIV, and warfarin benchmarks with ground-truth counterfactual labels, and on factual-only rolling-origin prediction in MIMIC-III ICU trajectories. CausalLongPFN is competitive with domain-trained longitudinal baselines on counterfactual benchmarks and performs strongly on factual MIMIC-III prediction, suggesting that broad synthetic causal pretraining can provide a frozen, amortized alternative for zero-shot longitudinal treatment-response prediction when repeated domain-specific training is costly or impractical.

URL PDF HTML ☆

赞 0 踩 0

2606.05781 2026-06-09 cs.LG 版本更新

Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data

领域自适应的小语言模型与混合后处理：通过LoRA微调在稀缺数据上实现成本高效、低延迟的多标签结构化预测

Srinivasan Manoharan, Dilipkumar Nallusamy, Sachin Kumar, Haifeng Wu

发表机构 * arXiv.org ； GitHub

AI总结提出一种结合LoRA微调的小语言模型（LLaMA 3.1 8B）和确定性规则后处理的混合框架，在仅219个样本上训练，实现多标签合规评估，达到100% JSON结构有效性和83.0%人工验证准确率，成本降低46-76%。

详情

Comments: 4 pages, 2 figures, 4 tables

AI中文摘要

部署前沿大型语言模型（LLM）用于特定领域的结构化评估任务通常会带来显著的延迟、成本和数据隐私开销。我们提出了一种混合框架，结合了微调的小语言模型（LLaMA 3.1 8B，通过LoRA仅2.05%可训练参数）和确定性规则后处理层。该系统仅使用219个精心挑选的示例进行训练，应用于跨18个异构输出字段的对话转录多标签合规评估。在53个未见过的生产转录的盲评中，它实现了100%的JSON结构有效性、83.0%的人工验证总体准确率，以及最关键分类字段100%的准确率。所提出的方法形式化了混合神经符号分解，并引入了针对性的硬负例增强，以改善关键决策边界的性能。在单个NVIDIA A100 GPU上运行，推理完成约需2秒，比前沿模型API快2-5倍。每次评估成本仅为0.013美元，而专有替代方案为0.025-0.055美元，节省46-76%的成本。这些结果表明，领域自适应的小语言模型与确定性后处理相结合，可以在结构化合规评估中达到前沿模型的准确性，同时大幅降低运营成本、延迟和隐私风险。

英文摘要

Deploying frontier large language models (LLMs) for domain-specific structured evaluation tasks incurs prohibitive latency, cost, and data-privacy overhead. We present a hybrid framework that fine-tunes a small language model (LLaMA 3.1 8B, 2.05% trainable parameters via LoRA) on only 219 curated examples and couples it with a deterministic rule-based postprocessing layer. Applied to multi-label compliance evaluation of conversational transcripts (18 heterogeneous output fields), our system achieves 100% JSON structural validity, 83.0% human-validated overall accuracy, and 100% accuracy on the most critical classification field in blind evaluation on 53 unseen production transcripts. On a single NVIDIA A100 GPU, inference completes in $\sim$2 seconds -- 2--5x faster than frontier APIs -- at USD 0.013 per evaluation versus USD 0.025--0.055 for proprietary alternatives, yielding 46--76% cost savings. We introduce targeted hard-negative augmentation for critical decision boundaries and formalize the hybrid neural-symbolic decomposition, demonstrating that domain-adapted small language models with postprocessing can match frontier model accuracy while dramatically reducing operational cost, latency, and privacy risk.

URL PDF HTML ☆

赞 0 踩 0

2606.05441 2026-06-09 cs.LG cs.AI stat.ML 版本更新

GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data

GOTabPFN: 从特征排序到高维表格基础模型的紧凑分词化

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * University of Cambridge（剑桥大学）

AI总结针对高维小样本表格预测问题，提出GOTabPFN模型，通过图引导排序和神经启发子单元压缩实现紧凑表示，提升TabPFN在严格token预算下的稳定性和准确性。

详情

Comments: Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Code and resources GitHub https://github.com/zadid6pretam/GOTabPFN PyPI https://pypi.org/project/gotabpfn Project webpage https://www.zadidhabib.com/gotabpfn.html Hugging Face ZeroGPU https://huggingface.co/spaces/zadid6pretam/GOTabPFN CPU backup https://huggingface.co/spaces/zadid6pretam/GOTabPFN_CPU

AI中文摘要

我们研究了如何在不重新训练大型骨干网络的情况下，使小型表格基础模型对高维小样本（HDLSS）表格预测有效。我们引入了带局部细化的图引导排序（GO-LR），证明了其与加权最小线性排列的等价性，并将实际求解器解释为TSP路径式替代方案。我们提出了基于GO-LR的GOTabPFN，以及一个神经启发子单元压缩（NSC）单元，将局部相邻的排序特征池化为元特征，从而生成紧凑表示，使TabPFN风格的预测在HDLSS场景中变得实用。在多个表格基准测试中，GOTabPFN在严格的token预算下提高了稳定性和准确性。

英文摘要

We investigate how to make small tabular foundation models effective for High-Dimensional, Low-Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph-guided Ordering with Local Refinement (GO-LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP-path-style surrogate. We propose GOTabPFN,which builds on GO-LR, and a Neuro-Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta-features, yielding a compact representation that makes TabPFN-style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.05409 2026-06-09 cs.CV cs.CL 版本更新

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗？VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University（麦吉尔大学）； Mila Quebec AI Institute（魁北克人工智能研究所）； University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出新颖视觉参照数据集（NVRD），通过对比VLM和人类对新颖视觉概念的泛化能力，发现模型在矛盾先验知识时难以习得新概念，且过度泛化。

详情

AI中文摘要

视觉语言模型（VLM）像人类学习者一样，经常接触新的视觉概念，但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索，特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点，我们提出了新颖视觉参照数据集（NVRD）：包含跨越90个视觉概念的19,176张图像，这些概念具有不同层次的新颖性，每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同，NVRD包含完全新颖、开放式的刺激，从头构建，模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断，以进行直接的人机比较，发现（i）当新概念与先验知识矛盾时，模型难以在上下文中习得它们，以及（ii）虽然模型和人类对视觉扰动表现出相关的敏感性，但模型显著过度泛化，将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

URL PDF HTML ☆

赞 0 踩 0

2606.04945 2026-06-09 cs.LG 版本更新

STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models

STaR-Quant：扩散大语言模型的状态-时间一致训练后量化

Xin Yan, Aqiang Wang, Zhenglin Wan, Xingrui Yu, Ivor Tsang

发表机构 * School of Artificial Intelligence, Beijing Normal University, Beijing, China（北京师范大学人工智能学院）； Department of Computer Science, National University of Singapore, Singapore（新加坡国立大学计算机科学系）； Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore（科技研究局前沿人工智能研究中心）

AI总结针对扩散大语言模型低比特量化中的状态相关激活差异和时间误差累积问题，提出STaR-Quant框架，通过状态引导激活变换和时间注意力补偿实现高效量化。

详情

AI中文摘要

扩散大语言模型（DLLMs）最近通过迭代掩码去噪和双向上下文生成文本，成为自回归LLMs的有前途的替代方案。然而，它们的大模型规模和迭代去噪过程带来了大量的内存和计算开销，促使采用训练后量化以实现高效部署。在本文中，我们确定了低比特DLLM量化的两个关键挑战：状态相关的激活差异和时间误差累积。在每个去噪步骤中，掩码和未掩码的标记表现出不同的激活分布，而在迭代解码过程中，量化误差可能跨步骤累积。为了解决这些挑战，我们提出了STaR-Quant，一种用于DLLMs的状态-时间一致PTQ框架。STaR-Quant引入了状态引导激活变换（SGAT），通过统一的静态权重侧变换将掩码和未掩码的标记分配到不同的激活变换空间。它进一步引入了时间注意力补偿（TAC），通过轻量级块对角仿射映射来校正量化的注意力表示。在代表性DLLMs上的实验表明，STaR-Quant在低比特权重-激活量化上持续优于强PTQ基线，同时相比FP16部署实现了高达1.69倍的加速和3.14倍的内存节省。

英文摘要

Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.04920 2026-06-09 cs.LG cs.CV 版本更新

Toward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling

通过特征对齐与缩放实现多域和长尾量化

Ting-An Chen, Chin-Yuan Yeh, De-Nian Yang

发表机构 * Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan（台湾大学电子工程研究所）； Institute of Information Science, Academia Sinica, Taiwan（中科院资讯研究所）； Graduate Institute of Communication Engineering, National Taiwan University, Taiwan（台湾大学通讯工程研究所）； Institute of Information Science and the Research Center for Information Technology Innovation, Academia Sinica, Taiwan（中科院资讯研究所及资讯科技创新研究中心）

AI总结提出EmaQ和EmaQ-LT方法，通过CDF投影对齐域分布、敏感度加权聚合稳定多域量化，并引入类别条件方差缩放和置信度调整缓解长尾问题，在多种基准上实现低比特量化下的强性能。

详情

AI中文摘要

量化深度神经网络对于在资源受限设备上进行高效推理至关重要。然而，现有大多数方法针对单域和类别平衡数据设计，忽略了存在域偏移或严重类别不平衡的实际场景。我们通过高效多域对齐量化（EmaQ）解决这些挑战，该方法通过基于CDF的投影对齐域分布，并使用敏感度感知权重聚合来稳定多域量化。我们进一步将EmaQ扩展到EmaQ-LT用于长尾量化，通过引入类别条件方差缩放和基于置信度的logit调整来缓解多数类过度自信。理论分析建立了收敛保证，并激励了所提出的敏感度和缩放机制。在标准、多域（Office-31、Digits）和长尾（SynDigits-LT、CIFAR-10-LT、CIFAR-100-LT）基准上的实验表明，EmaQ和EmaQ-LT在域偏移和类别不平衡下实现了强大的低比特性能。

英文摘要

Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We address these challenges with Efficient Multi-Domain Alignment Quantization (EmaQ), which aligns domain distributions through a CDF-based projection and uses sensitivity-aware weight aggregation to stabilize multi-domain quantization. We further extend EmaQ to EmaQ-LT for long-tailed quantization by introducing class-conditioned variance scaling and confidence-based logit adjustment to mitigate majority-class overconfidence. Theoretical analyses establish convergence guarantees and motivate the proposed sensitivity and scaling mechanisms. Experiments on standard, multi-domain (Office-31, Digits), and long-tailed (SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT) benchmarks show that EmaQ and EmaQ-LT achieve strong low-bit performance under domain shift and class imbalance.

URL PDF HTML ☆

赞 0 踩 0

2606.04804 2026-06-09 cs.LG 版本更新

The Right Measure for Physics-Constrained Generation: A Co-Area Correction for Posterior-Consistent PDE Inverse Problems

物理约束生成的正确度量：后验一致PDE逆问题的共面积修正

Jian Xu, Yanning Wu, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）

AI总结针对扩散模型和流匹配在硬约束PDE逆问题中采样后验分布错误的问题，提出共面积修正因子和CoCoS采样器，实现正确的后验采样。

详情

AI中文摘要

生成模型——扩散和流匹配——越来越多地用于求解偏微分方程（PDE）逆问题，将控制物理作为硬约束（通过投影或引导）强制执行，并将所得样本报告为具有校准不确定性的贝叶斯后验。我们表明，这种广泛采用的配方采样了错误的分布。在硬PDE约束上条件化生成先验是在测度零流形上的条件化——这一操作本质上是模糊的（Borel-Kolmogorov悖论），而其物理上正确的解，即小残差噪声极限，携带一个共面积（Fixman）雅可比因子$[det(JJ^{\top})]^{-1/2}$，而基于投影和引导的方法默默地忽略了它。我们精确地指出了偏差，表明它随约束敏感性的异质性增长，并在受控问题上通过与独立同分布的真实仲裁者对比验证了这一点。被忽略的因子并非二阶细节：移除它会使后验误差膨胀到采样噪声底限的20倍；最小位移投影（如PCFM）的偏差为底限的9倍；而简单的标量重加权无法修复。我们引入了 extbf{CoCoS}，一种度量感知的约束采样器，针对正确的共面积后验，并表明它在采样噪声内与黄金标准后验匹配。我们的结果意味着“满足物理”并不等同于“采样后验”，并为不确定性感知的科学推理提供了原则性的修正。

英文摘要

Generative models -- diffusion and flow matching -- are increasingly used to solve partial differential equation (PDE) inverse problems, enforcing the governing physics as a \emph{hard constraint} (via projection or guidance) and reporting the resulting samples as a Bayesian posterior with calibrated uncertainty. We show that this widely adopted recipe samples the wrong distribution. Conditioning a generative prior on a hard PDE constraint is conditioning on a measure-zero manifold -- an operation that is intrinsically ambiguous (the Borel--Kolmogorov paradox) and whose physically correct resolution, the small-residual-noise limit, carries a co-area (Fixman) Jacobian factor $[det(JJ^{\top})]^{-1/2}$ that projection- and guidance-based methods silently omit. We make the bias precise, show that it grows with the heterogeneity of the constraint sensitivity, and validate it on controlled problems against an \emph{i.i.d.} ground-truth arbiter. The omitted factor is not a second-order detail: removing it inflates the posterior error to $20\times$ the sampling-noise floor; minimal-displacement projection (as in PCFM) is biased at $9\times$ the floor; and a naive scalar reweighting does not fix it. We introduce \textbf{CoCoS}, a measure-aware constrained sampler that targets the correct co-area posterior, and show that it matches the gold-standard posterior to within sampling noise. Our results imply that ``satisfying the physics'' is not the same as ``sampling the posterior,'' and give a principled correction for uncertainty-aware scientific inference.

URL PDF HTML ☆

赞 0 踩 0

2606.04752 2026-06-09 cs.LG cs.AI 版本更新

An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers

多通道信号Transformer输入编码器的实证审计

Ossi Lehtinen

发表机构 * Anthropic

AI总结通过合成基准和真实数据ETTh1，实证审计八种输入编码器，发现标准线性投影（nn.Linear(C, d_model)）在大多数情况下与复杂替代方案性能相当，仅共享标量基线和通道独立基线显著落后。

详情

Comments: 21 pages, 1 figure, 8 tables. Code: https://github.com/OssiLehtinen/channel-encoder-audit

AI中文摘要

处理多通道标量信号的Transformer必须在每个时间步将$C$个同时值嵌入到一个$d_{ ext{model}}$维向量中。我们在一个设计为使通道身份信息丰富的合成基准和作为真实数据检查的ETTh1上，以下一步负对数似然（NLL）为指标，实证审计了八种输入编码器——包括共享标量基线、每通道线性投影、正交正则化器、非线性MLP主干、块分区拼接、通道独立和通道作为令牌架构，以及投影位置编码。主要结论是宽泛的“第一梯队”内实际近似等价：标准每通道线性投影（nn.Linear(C, $d_{ ext{model}}$)）与该梯队中的每个替代方案相比，差异在统计上显著但实际中很小。两种编码器明显失败：共享标量基线（由于我们明确的信息论原因而崩溃）和通道独立的PatchTST风格基线（在两个基准上表现不佳，并在合成基准上普遍过拟合）。配对测试解决了两个小差距：通过学习的线性层投影正弦位置编码在小$C$时略胜一筹，直接几何探测表明其机制是位置-通道正交化；非线性MLP主干在我们测试的最大$C$时略胜一筹，但差距在更多训练数据下缩小。实际建议是默认使用nn.Linear(C, $d_{ ext{model}}$)，仅当手头任务有实际理由时才采用更复杂的方案。重现本文所有实验的代码和数据可在https://github.com/OssiLehtinen/channel-encoder-audit获取。

英文摘要

Transformers consuming multi-channel scalar signals must embed $C$ simultaneous values into one $d_{\text{model}}$-dimensional vector per time step. We audit eight input encoders -- a shared-scalar baseline, per-channel linear projections, an orthogonality regulariser, a nonlinear MLP, block-partitioned concatenation, channel-independent and channel-as-token architectures, and a projected positional encoding -- on a synthetic benchmark where channel identity is informative and on ETTh1, scored by next-step negative log-likelihood. The headline is practical near-equivalence within a wide "top tier": the standard per-channel linear projection matches every alternative up to small, statistically real but practically modest differences. A direct geometric probe attributes this to a spontaneous orthogonalisation of the per-channel projections: they end up near-orthogonal with no explicit regulariser, letting the standard linear recover channel identity from the summed embedding. Two encoders lose decisively: the shared-scalar baseline collapses for information-theoretic reasons we make explicit, and the channel-independent PatchTST-spirit baseline overfits universally on the synthetic benchmark and underperforms on both. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small $C$ by extending this orthogonality to the positional subspace; a nonlinear MLP stem edges them at the largest $C$, with the gap shrinking under more training data. The practical recommendation: use the standard per-channel linear projection by default; reach for something more elaborate only when the task calls for it.

URL PDF HTML ☆

赞 0 踩 0