arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2092
2605.08084 2026-05-11 cs.RO cs.CV

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

Daniel Dauner, Valentin Charraut, Bastian Berle, Tianyu Li, Long Nguyen, Jiabao Wang, Changhui Jing, Maximilian Igl, Holger Caesar, Boris Ivanovic, Yiyi Liao, Andreas Geiger, Kashyap Chitta

AI总结 自动驾驶领域积累了大量丰富的传感器数据,但由于数据规模大、模态多样且格式不统一,其潜力尚未被充分挖掘。本文提出123D,一个开源框架,通过统一的API整合多模态驾驶数据,支持不同采集率和同步方式的数据处理,并提供数据可视化与分析工具。研究整合了多个真实世界和合成数据集,系统评估了各数据集的标注一致性与标定精度,并展示了123D在跨数据集3D目标检测和强化学习规划中的应用价值。

详情
英文摘要

The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.

2605.08077 2026-05-11 cs.CL

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

Shuhang Lin, Chuhao Zhou, Xiao Lin, Zihan Dong, Kuan Lu, Zhencan Peng, Jie Yin, Dimitris N. Metaxas

AI总结 知识图谱问答(KGQA)在实现可解释推理方面展现出潜力,但现有方法往往难以保证答案的可靠性。为此,本文提出了一种名为Conformal Path Reasoning(CPR)的可信KGQA框架,通过路径级别的校准方法提升预测的可靠性与效率。CPR引入了路径级的置信度校准和残差共形值网络(RCVNet),有效提升了覆盖率并显著减小了预测集的规模,实验表明其在多个基准数据集上表现优异。

Comments 13 pages, 3 figures, 2 tables;

详情
英文摘要

Knowledge Graph Question Answering (KGQA) has shown promise for grounded and interpretable reasoning, yet existing approaches often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior methods suffer from critical limitations in both calibration validity and score discriminability, resulting in violated coverage guarantees and excessively large prediction sets. To address these pitfalls, we propose Conformal Path Reasoning (CPR), a trustworthy KGQA framework with two key innovations. First, we perform query-level conformal calibration over path-level scores, preserving the exchangeability while generating path prediction sets. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Experiments on benchmarks show that CPR significantly improves the Empirical Coverage Rate by 34% while reducing average prediction set size by 40% compared to conformal baselines. These results validate the efficacy of CPR in satisfying coverage guarantees with substantially more compact answer sets.

2605.08075 2026-05-11 cs.LG eess.AS

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

Maryam Maghsoudi, Shihab Shamma

AI总结 本文研究如何从非侵入式脑电记录中解码想象的言语,针对想象数据稀缺且难以跨被试对齐的问题,提出了一种利用听觉记录进行解码的新方法。研究者通过收集受过训练的音乐家在听觉和想象条件下的MEG数据,构建了一个三阶段解码流程,将想象的神经响应映射到听觉响应,并利用听觉数据训练词解码器,最终实现了对想象言语的显著高于随机水平的解码。该方法验证了想象言语解码的可行性,并展示了其在脑机接口应用中的潜力。

详情
英文摘要

Decoding imagined speech from non-invasive brain recordings is challenging because imagined datasets are scarce and difficult to align temporally across subjects and sessions In this work, we propose a new approach to the decoding of imagined speech that leverages the richer and more reliably labeled recordings during listening to speech. We collected paired listened and imagined MEG recordings to rhythmic melodic and spoken stimuli from trained musicians. Using trained musicians helped improve temporal alignment across conditions. We then developed a three-stage decoding pipeline that revealed consistent and meaningful relationships between neural activity evoked by imagining and listening to the same stimuli. First, we trained six linear and neural models to map imagined MEG responses to listened responses. We evaluated these models against a null baseline from unseen subjects to validate that the predicted-listening responses preserve stimulus-specific information. In the second stage, we trained a contrastive word decoder exclusively on the listened MEG responses, and evaluated it using four embedding strategies including semantic, acoustic, and phonetic representations. In the third stage, we process the imagined MEG responses from held-out subjects through the mapping pipeline to compute the corresponding listening responses that are then decoded by the listened decoder. Using rank-based analysis, we show that the imagined words are decodable significantly above chance. We shall report here the results of a proof-of-concept implementation to decode imagined speech, where all evaluations are performed on held-out subjects. We also demonstrate that performance improves with training data size, suggesting that this approach is scalable and can directly be made applicable to realistic brain-computer interface scenarios.

2605.08074 2026-05-11 cs.LG

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal, Sourav Medya

AI总结 本文提出了一种名为GRAPHLCP的结构感知局部化符合预测框架,旨在解决图神经网络中不确定性量化的问题。该方法通过结合图结构和节点间依赖关系,改进了传统的基于嵌入空间相似性的定位方式,提升了预测的可靠性和效率。实验表明,GRAPHLCP能够在有限样本下保证边际覆盖率,并在多种条件场景下实现良好的测试条件覆盖率。

Comments 20 pages, 9 Figures, 8 Tables

详情
英文摘要

Conformal prediction (CP) provides a distribution-free approach to uncertainty quantification with finite-sample guarantees. However, applying CP to graph neural networks (GNNs) remains challenging as the combinatorial nature of graphs often leads to insufficiently certain predictions and indiscriminative embeddings. Existing methods primarily rely on embedding-space proximity for localization, which can be unreliable for graphs and yield inefficient prediction sets. We propose GRAPHLCP, a proximity-based localized CP framework that explicitly incorporates graph topology and inter-node dependencies into localization and weighting. Our approach introduces a feature-aware densification step to mitigate locality bias in sparse graphs, followed by a Personalized PageRank-based kernel computation to model structural proximity. This enables topology-dependent anchor sampling and calibration weighting that captures both local and long-range dependencies. Extensive experiments on several regression and classification datasets demonstrate that GRAPHLCP guarantees marginal coverage with finite samples while efficiently attaining favorable test conditional coverage across various conditioning scenarios.

2605.08073 2026-05-11 cs.CV cs.AI

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

Wei Yu, Yunhang Qian

AI总结 本文提出了一种高效的视觉状态空间模型 EmambaIR,用于基于事件流的图像重建。针对传统卷积神经网络和视觉Transformer在全局特征建模和计算效率上的不足,EmambaIR 引入了跨模态稀疏注意力模块和门控状态空间模块,实现了高效的全局特征融合与时间表征学习。实验表明,该方法在多个图像重建任务中显著优于现有方法,同时大幅降低了计算和内存消耗。

详情
英文摘要

Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR

2605.08070 2026-05-11 cs.AI

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

James Petullo, Sonny George, Dylan Cashman, Nianwen Xue

AI总结 VecCISC 是一种改进的自信引导自一致性方法,旨在降低推理过程中使用加权多数投票所带来的计算开销。该方法通过语义相似度度量过滤掉语义重复、退化或幻觉的推理轨迹,从而减少需要由批评模型评估的候选答案数量。实验表明,VecCISC 在多个领域数据集上保持或超越了 CISC 的准确性,同时将总 token 使用量减少了 47%。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.

2605.08064 2026-05-11 cs.CV

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng

AI总结 本文提出了一种名为 Proxy3D 的方法,旨在为视觉-语言模型提供更高效且全面的三维表示。该方法通过语义聚类和对齐,在仅输入视频帧的情况下生成紧凑的三维代理表示,克服了传统二维方法在空间一致性和序列效率上的不足。实验表明,使用较短序列进行视觉信息编码时,Proxy3D 在三维视觉问答、视觉定位及通用空间智能任务中取得了具有竞争力的性能。

Comments Accepted by CVPR 2026. Project page: https://wzzheng.net/Proxy3D

详情
英文摘要

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

2605.08061 2026-05-11 cs.AI

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley

AI总结 本文提出了一种基于评分标准的强化学习方法(Rubric-Grounded RL),通过将奖励分解为多个可验证的评分标准,并利用冻结的大型语言模型(LLM)作为评分者进行评估,从而为策略优化提供更细粒度的反馈信号。研究从科学与技术信息办公室(OSTI)的文档语料库中构建评分标准,并结合组相对策略优化(GRPO)方法训练Llama-3.1-8B-Instruct模型,在多个推理基准测试中取得了优于基线模型的性能,表明该方法能够提升模型在训练数据之外的泛化能力和推理表现。

详情
英文摘要

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

2605.08060 2026-05-11 cs.CL cs.AI cs.GT cs.MA

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

Jiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo, Haoxuan Zeng, Emanuel Tewolde, Tai Sing Lee, Tonghan Wang, Carl Kingsford, Vincent Conitzer

AI总结 本文研究了大语言模型(LLM)在多智能体社会困境中,随着上下文窗口扩展导致合作意愿下降的现象,称之为“记忆诅咒”。通过大量实验和分析,研究发现扩展记忆内容而非长度是导致合作退化的主要原因,并提出通过记忆净化和推理模式调整等方法可缓解这一问题。研究揭示了记忆内容对多智能体行为的直接影响,为设计更稳定的协作型AI系统提供了新思路。

详情
英文摘要

Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

2605.08059 2026-05-11 cs.CV cs.RO

6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

Ismail Aljosevic, Amir Masoud Almasi, Ana Parovic, Ashkan Shafiei

AI总结 本文提出了一种基于关键点热图回归的模块化6D姿态估计框架,结合YOLOv10m进行目标检测,并利用ResNet18网络从RGB图像中预测2D热图,再通过PnP RANSAC算法计算物体6D姿态。研究还引入了跨模态融合架构,结合深度信息提升性能,并通过优化激活函数和学习率策略进一步提高模型效果。实验表明,该方法在LINEMOD数据集上分别达到84.50%和92.41%的平均ADD精度。

Comments Source code available at: https://github.com/ameermasood/HeatNet

详情
英文摘要

In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.

2605.08057 2026-05-11 cs.CL cs.AI

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

James Petullo, Nianwen Xue

AI总结 本文提出了一种名为CA-SQL的文本到SQL生成方法,旨在提升模型在复杂任务中的推理能力。该方法通过动态调整探索范围,并结合进化搜索原理的提示生成和投票机制,有效提升了候选查询的生成质量与准确性。实验表明,CA-SQL在BIRD基准的“挑战性”任务中取得了优于现有方法的性能,展现了其在资源受限情况下的优越性。

详情
英文摘要

While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.

2605.08054 2026-05-11 cs.CV

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

Hanchao Liu, Fang-Lue Zhang, Shining Zhang, Tai-Jiang Mu, Shi-Min Hu

AI总结 该研究旨在生成满足高度约束条件的人类运动,如复杂的空间障碍或特定步数要求,以支持可控角色动画和虚拟代理行为合成等应用。为解决现有方法在极端时空限制下的不足,作者提出了一种基于检索引导的扩散噪声优化方法,通过从大规模运动数据集中检索参考信息,并结合关系任务解析与奖励引导的噪声掩码,优化扩散模型的初始噪声,从而更有效地生成符合严格约束的运动序列。该方法在无需训练的情况下提升了运动生成的智能性与适应性。

Comments Accepted to CVPR2026

详情
英文摘要

Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.

2605.08053 2026-05-11 cs.LG

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat

AI总结 本文研究了折扣马尔可夫决策过程(MDPs)中基于指数效用函数的强化学习问题,提出了两种基于Q值的算法并证明了其收敛性。通过建立指数效用的Bellman方程扩展,作者推导出两种Q值形式,并证明相关算子在特定度量下为收缩映射,进而分析其不动点性质并证明对应的贪心策略在平稳策略中是最优的。研究还提出了两种无模型算法,分别基于双时间尺度和单时间尺度,并给出了收敛性分析及有限时间收敛速率的理论保证,为基于指数效用的目标的值函数型强化学习提供了理论基础。

详情
英文摘要

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.

2605.08050 2026-05-11 cs.CV

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

Xinyan Ye, Jiankang Deng, Abbas Edalat

AI总结 MoCoTalk 是一种多条件扩散框架,用于可控的 talking-head 生成,能够统一控制身份、面部表情、头部姿态和口部动态等四个互补因素。该方法引入了一个自适应多条件路由模块,实现不同条件信号在通道和时间步上的动态融合,同时设计了 Mouth-Augmented Shading Mesh 来增强口部动态的建模,并引入唇部一致性损失以提升音画对齐效果。实验表明,MoCoTalk 在多个结构、运动和感知指标上均达到先进水平,并提供了单条件方法所不具备的属性级可控性。

详情
英文摘要

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.

2605.08048 2026-05-11 cs.CL

Accurate and Efficient Statistical Testing for Word Semantic Breadth

Yo Ehara

AI总结 该研究针对词语语义广度的统计检验问题,提出了一种准确且高效的测试方法。传统基于离散度的统计方法在比较两个词语语义广度时,可能因语义方向差异导致误判,增加第一类错误率。为此,作者引入了基于Householder反射对齐的置换检验方法,有效分离语义方向差异与离散度差异,从而更准确地评估词语语义广度的差异。实验表明,该方法在降低误判率的同时提升了计算效率。

Comments Accepted to ACL 2026 Main Conference

详情
英文摘要

Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.

2605.08045 2026-05-11 cs.CL

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

Yi Yu, Parker Martin, Zhenyu Bu, Yixuan Liu, Yi-Yu Zheng, Orlando Simonetti, Yuchi Han, Yuan Xue

AI总结 将自由文本的心脏磁共振(CMR)报告转化为可审计的结构化数据,是构建队列、长期数据管理和临床决策支持中的关键瓶颈。本文提出CMR-EXTR,一个轻量级框架,能够将自由文本的CMR报告转化为结构化数据,并为每个字段分配置信度以进行质量控制。该方法通过教师-学生蒸馏流程实现完全离线推理,减少对人工标注的依赖,并结合分布合理性、采样稳定性和字段一致性三个互补原则进行不确定性估计,从而有效筛选需要人工复核的内容。实验表明,CMR-EXTR在变量级别上达到了99.65%的准确率,展示了其在数据提取和置信度评估方面的可靠性与有效性。

Comments Accepted to ISBI 2026

详情
英文摘要

Converting free-text cardiac magnetic resonance (CMR) reports into auditable structured data remains a bottleneck for cohort assembly, longitudinal curation, and clinical decision support. We present CMR-EXTR, a lightweight framework that converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles -- distribution plausibility, sampling stability, and cross-field consistency -- to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores. To our knowledge, this is the first CMR-specific extraction system with integrated confidence estimation. The code is available at https://github.com/yuyi1005/CMR-EXTR.

2605.08044 2026-05-11 cs.CL cs.AI cs.LG

Fast Byte Latent Transformer

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer

AI总结 本文提出了一种名为Byte Latent Transformer(BLT)的新型字节级语言模型,旨在解决现有字节级模型生成速度慢的问题。研究引入了BLT Diffusion(BLT-D)方法,通过结合扩散目标和标准字节预测损失进行训练,实现了每步生成多个字节的并行解码,大幅减少了生成序列所需的前向计算次数。此外,文章还提出了BLT Self-speculation和BLT Diffusion+Verification两种改进方法,在提升生成质量的同时保持较高的效率,显著降低了生成任务的内存带宽成本,为字节级语言模型的实际应用扫清了关键障碍。

详情
英文摘要

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

2605.08043 2026-05-11 cs.CV cs.AI

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang, Hang Xu, Xiaoxiao Ma, Shiting Huang, Ke Xu, Wenxuan Huang, Lionel Z. Wang, Lin Chen, Zehui Chen, Jie Huang, Feng Zhao

AI总结 本文提出SCOPE框架,用于解决复杂图像生成中语义承诺的持续追踪问题。该方法通过结构化分解和条件技能调度,动态维护生成过程中的语义要求,并在必要时调用检索、推理和修复技能以确保生成结果符合用户意图。研究还引入了Gen-Arena基准和EGIP指标,实验表明SCOPE在多个任务中均优于现有方法,验证了其在复杂图像生成中的有效性。

详情
英文摘要

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

2605.08037 2026-05-11 cs.LG cs.AI

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Ning Liu, Chuanneng Sun, Kristina Klinkner, Shervin Malmasi

AI总结 直接偏好优化(DPO)通过成对偏好比较对语言模型进行对齐,是人类反馈强化学习的一种简单有效替代方法。然而,在许多实际场景中,每个提示可能包含多个生成结果,形成复杂的偏好结构,而传统的成对DPO无法充分利用这种结构。为此,研究提出了一种基于有向无环偏好图的图DPO方法(GraphDPO),通过图结构聚合偏好信息,保留传递性并兼容标准DPO,同时引入等价类构造和真实解锚定策略,提升了优化稳定性与效率。实验表明,GraphDPO在推理和程序合成任务中表现出更优性能,展示了图结构偏好建模在对齐任务中的可扩展性和鲁棒性。

详情
英文摘要

Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.

2605.08036 2026-05-11 cs.LG

Don't Get Your Kroneckers in a Twist: Gaussian Processes on High-Dimensional Incomplete Grids

Mads Greisen Højlund, August Smart Lykke-Møller, Henry Moss, Ove Christiansen

AI总结 本文提出了一种名为CUTS-GPR的新方法,用于在高维环境下进行数值精确的高斯过程回归。该方法通过结合加性核函数与不完整网格,实现了极快的核矩阵-向量乘法,其计算复杂度随训练数据量呈近线性或线性增长,随维度呈低阶多项式增长。该方法在数百万甚至数十亿数据点和数千维度的基准测试中表现出良好的可扩展性,能够在数小时内完成包括超参数优化在内的完整高斯过程计算,为高维势能面的贝叶斯建模提供了有效工具。

Comments 51 pages, 8 figures

详情
英文摘要

We introduce CUTS-GPR, a new method for performing numerically exact Gaussian process regression (GPR) in high-dimensional settings. The key component of CUTS-GPR is an extremely fast kernel matrix-vector product, which exhibits near-linear or even linear scaling with the amount of training data, $N$, and low-order polynomial scaling with dimensionality, $D$. This is obtained by combining an additive kernel with an incomplete grid and exploiting the resulting structure of the kernel matrix. We demonstrate the scalability of the matrix-vector product by running benchmarks with billions of data points and thousands of dimensions. Full GPR calculations, including hyperparameter optimization, are completed in a matter of hours for $N = 447 265$ and $D = 24$. We demonstrate that our CUTS-GPR enables Bayesian modeling of high-dimensional potential energy surfaces - a longstanding challenge in computational chemistry.

2605.08031 2026-05-11 cs.CV

Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

Kaidi Jia, Yujie Lin, Chengyi Yang, Jiayao Ma, Jinsong Su

AI总结 视觉语言模型(VLMs)在隐私、版权和偏见等方面引发关注,促使研究者探索如何去除其中的敏感知识。现有方法主要通过微调语言解码器实现,但效果有限,难以彻底消除视觉表征并可能引入物体幻觉。本文提出了一种基于强化学习的深度语义遗忘框架HFRU,直接作用于视觉编码器,通过两阶段策略结合对齐破坏与基于GRPO的优化,有效提升了遗忘效果并显著减少物体幻觉,实验表明其在物体识别和人脸身份任务中均取得超过98%的遗忘与保留性能,优于现有方法。

详情
英文摘要

Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.

2605.08030 2026-05-11 cs.CV cs.LG

PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction

Rüveyda Yilmaz, Yuli Wu, Johannes Stegmaier, Volkmar Schulz

AI总结 该研究提出了一种名为PET-Adapter的测试时域适应框架,用于改进正电子发射断层扫描(PET)图像重建中深度学习模型在未知临床数据上的泛化能力。该方法仅需在phantom数据上预训练生成模型,即可在不依赖配对真实标签的情况下,适应不同解剖结构、示踪剂和扫描设备的临床数据。通过引入逐层低秩解剖条件和基于有序子集期望最大化算法的初始化策略,PET-Adapter显著提升了重建效率与质量,实验表明其在全角和有限角采集下均具有优越的3D重建性能。

详情
英文摘要

Positron Emission Tomography (PET) image reconstruction is inherently challenged by Poisson noise and physical degradation factors, which are further exacerbated in limited-angle acquisitions. While deep learning methods demonstrate promising performance, their generalization to unseen clinical data distributions remains limited without extensive retraining. We propose PET-Adapter, a test-time domain adaptation framework for generative PET reconstruction models pretrained solely on phantom data. Our method enables adaptation to clinical datasets with varying anatomies, tracers, and scanner configurations without requiring paired ground truth. PET-Adapter introduces layer-wise low-rank anatomical conditioning during adaptation and Ordered Subset Expectation Maximization-based warm-starting that initializes the generation from physics-informed reconstructions, reducing diffusion steps from 50 to 2 without compromising quality. Experiments across multiple clinical datasets demonstrate superior 3D reconstruction performance in both full-angle and limited-angle settings, highlighting the clinical feasibility and computational efficiency of the proposed approach.

2605.08029 2026-05-11 cs.CV cs.LG

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel Ángel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu

AI总结 本文提出 STARFlow2,一种将语言模型与归一化流结合的统一多模态生成框架,旨在解决现有方法中文本生成与图像生成之间结构不匹配的问题。该方法基于 Pretzel 架构,通过残差连接垂直融合预训练视觉语言模型流与图像生成流,并在相同的因果掩码下协同工作,实现了文本与图像的无缝交织生成。STARFlow2 采用深度-浅层流设计和统一的 FAE 潜在空间,支持高效的缓存友好生成,实验表明其在图像生成和多模态理解任务中表现出色,验证了自回归流在统一多模态建模中的有效性。

Comments 19 pages, 9 figures

详情
英文摘要

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

2605.08028 2026-05-11 cs.LG cs.SY eess.SY

Adaptive Domain Decomposition Physics-Informed Neural Networks for Traffic State Estimation with Sparse Sensor Data

Eunhan Ka, Ludovic Leclercq, Satish V. Ukkusuri

AI总结 本文提出了一种自适应区域分解物理信息神经网络(ADD-PINN),用于解决基于稀疏固定传感器的交通状态估计问题。该方法通过两阶段残差引导框架,在离线速度场重建中有效缓解了传统物理信息神经网络对LWR模型中激波的过度平滑问题。实验表明,ADD-PINN在多种传感器配置下均取得优于现有方法的估计精度,并且训练速度更快,验证了其在稀疏传感场景下的有效性与高效性。

Comments 56 pages, 5 figures, 12 tables. Submitted to Transportation Research Part C

详情
英文摘要

Traffic state estimation from sparse fixed sensors is challenging because physics-informed neural networks (PINNs) tend to over-smooth the shockwaves admitted by the Lighthill-Whitham-Richards (LWR) model. This study proposes Adaptive Domain Decomposition Physics-Informed Neural Networks (ADD-PINN), a two-stage residual-guided framework for LWR-based offline speed-field reconstruction. A coarse global PINN is first trained; its spatial residual profile is then used to place subdomain boundaries and initialize child subnetworks in a decomposition-enabled mode, while a data-driven shock indicator can retain a single-domain fallback when localized evidence of transition is weak. The primary offline I-24 MOTION evaluation spans five days, five sensor configurations, and ten seeds per configuration, yielding 1,500 runs in total. Against neural and physics-informed baselines, ADD-PINN attains the lowest relative L2 error in 18 of 25 configurations and in 14 of 15 sparse-sensing cases, while training 2.4 times faster than the extended PINN (XPINN) baseline. An ablation study supports spatial-only decomposition as an effective default for fixed-sensor traffic reconstruction in the evaluated settings. Supplementary Next Generation Simulation (NGSIM) experiments serve as a negative control: the shock indicator suppresses decomposition in all 50 runs, and the default single-domain fallback ranks first across all sensor configurations. These results support residual-guided spatial decomposition as an effective PINN-family design for offline reconstruction when sparse fixed sensing coincides with localized transition regions.

2605.08024 2026-05-11 cs.AI

MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis

Wenxin Zhan

AI总结 该研究提出了一种名为MPD$^2$-Router的新型路由框架,用于青光眼筛查与诊断中的人机协同决策。该方法通过引入掩码感知的多专家机制和双头路由策略,考虑了专家可用性、诊断风险不对称性以及病例难度等因素,有效提升了系统在复杂场景下的安全性和诊断准确性。实验表明,该方法在多个跨国家数据集上实现了更低的临床成本和更高的诊断性能,同时保持了专家资源的均衡利用。

详情
英文摘要

Learning-to-defer (L2D) can make glaucoma screening safer by routing difficult/uncertain cases to humans, yet standard formulations overlook expert availability, heterogeneous readers behavior, workload imbalance, asymmetric diagnostic harm, case difficulty from morphology and deployment shift. We introduce MPD$^2$-Router, a mask-aware multi-expert deferral framework that recasts ophthalmic triage as constrained human--AI routing: whether to defer and to which available expert. It couples a dual-head deferral/allocation policy with mask-aware Gumbel--sigmoid gating that strictly enforces per-sample availability, and fuses uncertainty, morphology, image-quality, and OOD signals. Training uses an asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer that jointly prevent expert collapse without forcing uniform allocation. Across three cross-national glaucoma cohorts (REFUGE, CHAKSU, ORIGA) with a frozen REFUGE-trained backbone, MPD$^2$-Router substantially lowers clinical cost and improves MCC over AI-only at a moderate deferral rate. It is Pareto-optimal in F1--MCC--cost, robust under cross-domain shift, and yields balanced expert utilization.

2605.08020 2026-05-11 cs.RO

Active Embodiment Identification with Reinforcement Learning for Legged Robots

Nico Bohlinger, Jan Peters

AI总结 本文提出了一种用于腿式机器人的主动本体识别方法,通过强化学习同时学习信息获取行为和显式的本体参数预测。该方法采用增强历史信息的URMA架构,在不同形态的仿真环境中通过与环境的交互,推断出关节级和全局的本体参数,提升了机器人对自身形态的适应与识别能力。

详情
英文摘要

We present an active embodiment identification method for legged robots that jointly learns information-seeking behavior and explicit embodiment prediction. Using a history-augmented URMA architecture, the method infers joint-level and global embodiment parameters through interaction with the environment in simulation across different morphologies.

2605.08019 2026-05-11 cs.AI q-bio.NC

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

Botos Csaba, Sreejan Kumar, Austin Tudor David Andrews, Laurence Hunt, Chris Summerfield, Joshua B. Tenenbaum, Rui Ponte Costa, Marcelo G. Mattar, Momchil Tomov

AI总结 该研究探讨了现代人工智能系统是否能像人类一样在新环境中快速学习抽象知识并高效行动,通过分析人类在复杂游戏任务中的行为与脑活动数据,对比了前沿大推理模型(LRMs)与深度强化学习代理的表现。研究发现,LRMs在游戏学习行为和脑活动预测方面显著优于传统强化学习模型,尤其在皮层和皮下区域表现出数量级的预测优势。结果表明,LRMs能更有效地模拟人类在复杂自然环境中的学习与决策过程。

详情
英文摘要

Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model's in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: https://botcs.github.io/reason-to-play/

2605.08013 2026-05-11 cs.AI

Learning CLI Agents with Structured Action Credit under Selective Observation

Haoyang Su, Ying Wen

AI总结 该论文研究了如何通过强化学习训练命令行界面(CLI)代理,使其能够有效地与计算机系统进行交互。针对CLI任务中部分观测和稀疏奖励的挑战,作者提出了一种基于结构化动作信用分配的方法,引入了$σ$-Reveal机制用于选择性观测,以及Action Advantage Assignment(A³)方法用于动作信用分配。研究还构建了ShellOps数据集,用于验证所提方法在代码仓库环境中的有效性。

详情
英文摘要

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $σ$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

2605.08012 2026-05-11 cs.LG cs.AI cs.CL

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Zezheng Lin, Fengming Liu

AI总结 本文探讨了机制可解释性研究中因果声明所依赖的识别假设问题,指出当前研究虽广泛使用因果术语,但往往未明确说明其因果推断所基于的识别假设。通过对10篇论文的系统审查,发现多数研究缺乏专门的识别假设部分,且常用验证指标被误用为因果支持依据。论文提出应建立披露规范,明确因果声明的识别策略、假设条件及其失效影响,以提升因果推断的透明度和可靠性。

Comments 10 pages, 2 figures. Submitted to NeurIPS 2026 (Position Track)

详情
英文摘要

Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.

2605.08011 2026-05-11 cs.AI stat.CO

Abductive Reasoning with Probabilistic Commonsense

Joseph Cotnareanu, Chiara Roverato, Han Zhou, Didier Chetelat, Yingxue Zhang, Mark Coates

AI总结 该研究旨在提升大语言模型的推理能力,特别是解决其在处理需要常识推理的问题时的不足。提出了一种概率框架,用于建模不同个体对常识信念的差异,并引入了名为PACS的新算法,通过结合大语言模型与形式化求解器,从多个样本中聚合结论,以判断多数人是否会认为某个陈述为真或假。实验表明,PACS在多个基准测试中优于现有的推理方法。

详情
Journal ref
Proceedings of the International Conference on Machine Learning, 2026
英文摘要

Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps that humans find obvious. Prior methods address this by using LLMs to supply missing commonsense assumptions, but these approaches implicitly assume universal agreement on such commonsense facts. In reality, commonsense beliefs vary across individuals. We propose a probabilistic framework for abductive commonsense reasoning that explicitly models this variation, aiming to determine whether most people would judge a statement as true or false. We introduce Probabilistic Abductive CommonSense (PACS), a novel algorithm that uses an LLM and a formal solver to sample proofs as observations of individuals' distinct commonsense beliefs, and aggregates conclusions across these samples. Empirically, PACS outperforms chain-of-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.