arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.08084 2026-05-11 cs.RO cs.CV

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

Daniel Dauner, Valentin Charraut, Bastian Berle, Tianyu Li, Long Nguyen, Jiabao Wang, Changhui Jing, Maximilian Igl, Holger Caesar, Boris Ivanovic, Yiyi Liao, Andreas Geiger, Kashyap Chitta

AI总结 自动驾驶领域积累了大量丰富的传感器数据,但由于数据规模大、模态多样且格式不统一,其潜力尚未被充分挖掘。本文提出123D,一个开源框架,通过统一的API整合多模态驾驶数据,支持不同采集率和同步方式的数据处理,并提供数据可视化与分析工具。研究整合了多个真实世界和合成数据集,系统评估了各数据集的标注一致性与标定精度,并展示了123D在跨数据集3D目标检测和强化学习规划中的应用价值。

详情
英文摘要

The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.

2605.08077 2026-05-11 cs.CL

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

Shuhang Lin, Chuhao Zhou, Xiao Lin, Zihan Dong, Kuan Lu, Zhencan Peng, Jie Yin, Dimitris N. Metaxas

AI总结 知识图谱问答(KGQA)在实现可解释推理方面展现出潜力,但现有方法往往难以保证答案的可靠性。为此,本文提出了一种名为Conformal Path Reasoning(CPR)的可信KGQA框架,通过路径级别的校准方法提升预测的可靠性与效率。CPR引入了路径级的置信度校准和残差共形值网络(RCVNet),有效提升了覆盖率并显著减小了预测集的规模,实验表明其在多个基准数据集上表现优异。

Comments 13 pages, 3 figures, 2 tables;

详情
英文摘要

Knowledge Graph Question Answering (KGQA) has shown promise for grounded and interpretable reasoning, yet existing approaches often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior methods suffer from critical limitations in both calibration validity and score discriminability, resulting in violated coverage guarantees and excessively large prediction sets. To address these pitfalls, we propose Conformal Path Reasoning (CPR), a trustworthy KGQA framework with two key innovations. First, we perform query-level conformal calibration over path-level scores, preserving the exchangeability while generating path prediction sets. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Experiments on benchmarks show that CPR significantly improves the Empirical Coverage Rate by 34% while reducing average prediction set size by 40% compared to conformal baselines. These results validate the efficacy of CPR in satisfying coverage guarantees with substantially more compact answer sets.

2605.08075 2026-05-11 cs.LG eess.AS

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

Maryam Maghsoudi, Shihab Shamma

AI总结 本文研究如何从非侵入式脑电记录中解码想象的言语,针对想象数据稀缺且难以跨被试对齐的问题,提出了一种利用听觉记录进行解码的新方法。研究者通过收集受过训练的音乐家在听觉和想象条件下的MEG数据,构建了一个三阶段解码流程,将想象的神经响应映射到听觉响应,并利用听觉数据训练词解码器,最终实现了对想象言语的显著高于随机水平的解码。该方法验证了想象言语解码的可行性,并展示了其在脑机接口应用中的潜力。

详情
英文摘要

Decoding imagined speech from non-invasive brain recordings is challenging because imagined datasets are scarce and difficult to align temporally across subjects and sessions In this work, we propose a new approach to the decoding of imagined speech that leverages the richer and more reliably labeled recordings during listening to speech. We collected paired listened and imagined MEG recordings to rhythmic melodic and spoken stimuli from trained musicians. Using trained musicians helped improve temporal alignment across conditions. We then developed a three-stage decoding pipeline that revealed consistent and meaningful relationships between neural activity evoked by imagining and listening to the same stimuli. First, we trained six linear and neural models to map imagined MEG responses to listened responses. We evaluated these models against a null baseline from unseen subjects to validate that the predicted-listening responses preserve stimulus-specific information. In the second stage, we trained a contrastive word decoder exclusively on the listened MEG responses, and evaluated it using four embedding strategies including semantic, acoustic, and phonetic representations. In the third stage, we process the imagined MEG responses from held-out subjects through the mapping pipeline to compute the corresponding listening responses that are then decoded by the listened decoder. Using rank-based analysis, we show that the imagined words are decodable significantly above chance. We shall report here the results of a proof-of-concept implementation to decode imagined speech, where all evaluations are performed on held-out subjects. We also demonstrate that performance improves with training data size, suggesting that this approach is scalable and can directly be made applicable to realistic brain-computer interface scenarios.

2605.08074 2026-05-11 cs.LG

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal, Sourav Medya

AI总结 本文提出了一种名为GRAPHLCP的结构感知局部化符合预测框架,旨在解决图神经网络中不确定性量化的问题。该方法通过结合图结构和节点间依赖关系,改进了传统的基于嵌入空间相似性的定位方式,提升了预测的可靠性和效率。实验表明,GRAPHLCP能够在有限样本下保证边际覆盖率,并在多种条件场景下实现良好的测试条件覆盖率。

Comments 20 pages, 9 Figures, 8 Tables

详情
英文摘要

Conformal prediction (CP) provides a distribution-free approach to uncertainty quantification with finite-sample guarantees. However, applying CP to graph neural networks (GNNs) remains challenging as the combinatorial nature of graphs often leads to insufficiently certain predictions and indiscriminative embeddings. Existing methods primarily rely on embedding-space proximity for localization, which can be unreliable for graphs and yield inefficient prediction sets. We propose GRAPHLCP, a proximity-based localized CP framework that explicitly incorporates graph topology and inter-node dependencies into localization and weighting. Our approach introduces a feature-aware densification step to mitigate locality bias in sparse graphs, followed by a Personalized PageRank-based kernel computation to model structural proximity. This enables topology-dependent anchor sampling and calibration weighting that captures both local and long-range dependencies. Extensive experiments on several regression and classification datasets demonstrate that GRAPHLCP guarantees marginal coverage with finite samples while efficiently attaining favorable test conditional coverage across various conditioning scenarios.

2605.08073 2026-05-11 cs.CV cs.AI

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

Wei Yu, Yunhang Qian

AI总结 本文提出了一种高效的视觉状态空间模型 EmambaIR,用于基于事件流的图像重建。针对传统卷积神经网络和视觉Transformer在全局特征建模和计算效率上的不足,EmambaIR 引入了跨模态稀疏注意力模块和门控状态空间模块,实现了高效的全局特征融合与时间表征学习。实验表明,该方法在多个图像重建任务中显著优于现有方法,同时大幅降低了计算和内存消耗。

详情
英文摘要

Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR

2605.08072 2026-05-11 stat.ML cs.DS cs.LG math.ST stat.TH

A Note on Non-Negative $L_1$-Approximating Polynomials

Jane H. Lee, Anay Mehrotra, Manolis Zampetakis

AI总结 本文研究了在高斯分布下具有非负性的 $L_1$-逼近多项式的存在性,这类多项式在逼近指示函数时不仅满足 $L_1$-范数误差要求,还保证输出非负。作者证明了对于具有有限高斯表面面积(GSA)的集合类,存在次数为 $\tilde{O}(Γ^2/\varepsilon^2)$ 的非负多项式,能够以 $\varepsilon$ 的误差逼近其指示函数。该结果在保持 $L_1$-逼近能力的同时,提供了更强的点态保证,并且与当前最优的无非负性约束的高斯 $L_1$-逼近多项式次数相差仅常数因子。

详情
英文摘要

$L_1$-Approximating polynomials, i.e., polynomials that approximate indicator functions in $L_1$-norm under certain distributions, are widely used in computational learning theory. We study the existence of \textit{non-negative} $L_1$-approximating polynomials with respect to Gaussian distributions. This is a stronger requirement than $L_1$-approximation but weaker than sandwiching polynomials (which themselves have many applications). These non-negative approximating polynomials have recently found uses in smoothed learning from positive-only examples. In this short note, we prove that every class of sets with Gaussian surface area (GSA) at most $Γ$ under the standard Gaussian admits degree-$k$ non-negative polynomials that $\eps$-approximate its indicator functions in $L_1$-norm, for $k=\tilde{O}(Γ^2/\varepsilon^2)$. Equivalently, finite GSA implies $L_1$-approximation with the stronger pointwise guarantee that the approximating polynomial has range contained in $[0,\infty)$. Up to a constant-factor, this matches the degree of the best currently known Gaussian $L_1$-approximation degree bound without the non-negativity constraint.

2605.08071 2026-05-11 econ.EM cs.HC stat.ME

Vibe Econometrics and the Analysis Contract

Lydia Ashton

AI总结 本文探讨了“vibe方法论”在经济学中的应用,指出人工智能辅助的因果分析(即“vibe计量经济学”)在提升效率的同时,也带来了新的方法与数据不匹配、置信度漂白和隐形分叉等失效模式。文章提出“分析契约”框架,通过预分析计划和因果路线图的改进,为AI辅助下的因果推断提供一种治理机制,以增强结果的可信度和可审查性。

Comments 20 pages, 2 figures. Appendices A-C (fillable templates) provided as ancillary file. Companion materials: https://github.com/lydiaashton/vibe-econometrics-supp . Also posted on SSRN: https://doi.org/10.2139/ssrn.6699999

详情
英文摘要

"Vibe coding" and "vibe analytics" have been framed as a democratization of technical capability. This paper argues that AI-assisted methodology more broadly, or what I call "vibe methodology," also democratizes the failure modes specific to each domain. When AI assists with methods whose validity depends on assumptions that cannot be verified from the output alone (a class I call "vibe inference"), the failure surface is structurally different: the output does not reliably signal invalidity, and when it does, recognizing the signal requires the expertise the workflow bypasses. I focus on "vibe econometrics," the subset of AI-assisted causal analysis where identification can be named faster than it can be audited. The claim of this paper is not that AI invents inferential failures that did not previously exist, but that it changes their incidence, observability, and persuasive force enough to create a practically distinct governance problem. This results in three failure modes: method-data mismatch, where AI bypasses expertise at execution; confidence laundering, where AI amplifies the credibility of formatted output; and invisible forking, which spans both. What is new is not the failure modes but AI's industrialization of their packaging. The barrier between naming a method and executing it has collapsed, and weak foundations, dressed as rigorous analysis, now reach audiences at a scale, speed, and polish that previously required expertise. I propose the Analysis Contract, a pre-commitment framework that adapts the logic of pre-analysis plans and the Causal Roadmap to the AI-assisted setting. The contract imposes three conditions before a causal claim is made: a method-data contract, a data audit, and a pre-commitment statement defining what would count as a disconfirming result. The framework generalizes across domains of vibe inference through domain-specific instantiation.

2605.08070 2026-05-11 cs.AI

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

James Petullo, Sonny George, Dylan Cashman, Nianwen Xue

AI总结 VecCISC 是一种改进的自信引导自一致性方法,旨在降低推理过程中使用加权多数投票所带来的计算开销。该方法通过语义相似度度量过滤掉语义重复、退化或幻觉的推理轨迹,从而减少需要由批评模型评估的候选答案数量。实验表明,VecCISC 在多个领域数据集上保持或超越了 CISC 的准确性,同时将总 token 使用量减少了 47%。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.

2605.08066 2026-05-11 quant-ph cs.IT math.IT

Covert Signaling for Communication and Sensing over the Bosonic Channels

Tianrui Tan, Evan J. D. Anderson, Michael S. Bullock, Boulat A. Bash

AI总结 本文研究了在具有热噪声的玻色子信道中实现隐蔽通信与感知的稀疏信号策略。通过分析信号可检测性,作者发现了一种非直观的最优量子态结构:仅由两个连续光子数态组成的混合态。在低亮度条件下,最优信号态为真空态与单光子态的混合。该研究揭示了隐蔽性与通信、感知性能之间的权衡关系,并确定了不同优化目标之间的功率阈值。

Comments 15 pages, 4 figures, draft, comments welcome

详情
英文摘要

Preventing signal detection in communication and active sensing requires careful control of transmission power. In fact, the square-root laws (SRL) for covert classical and quantum communication and sensing prescribe that the average output power per channel use scales as $1/\sqrt{n}$ for $n$ channel uses. Two strategies for achieving this are diffuse and sparse signaling. The former transmits signals with power decaying as $1/\sqrt{n}$ on all $n$ channel uses, which is convenient for mathematical analysis. The latter transmits constant-power signals rarely, on approximately $\sqrt{n}$ out of $n$ channel uses, while remaining silent on the others. This offers significant practical advantages in compatibility with modern digital transmitters. Here, we study sparse signaling over lossy thermal-noise bosonic channels, which describe quantumly many practical channels (including optical, microwave, and radio-frequency). We characterize the input signal state that minimizes detectability. We find an unintuitive optimal quantum state structure: a mixture of just two consecutive photon-number states. In particular, in the low-brightness regime, the optimal signal state is a mixture of vacuum and a single photon. Since these states are generally suboptimal for both communication and active sensing, we explore the resulting trade-off and identify input-power thresholds for transitions between optimizing for covertness vs. performance in communication and sensing tasks.

2605.08064 2026-05-11 cs.CV

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng

AI总结 本文提出了一种名为 Proxy3D 的方法,旨在为视觉-语言模型提供更高效且全面的三维表示。该方法通过语义聚类和对齐,在仅输入视频帧的情况下生成紧凑的三维代理表示,克服了传统二维方法在空间一致性和序列效率上的不足。实验表明,使用较短序列进行视觉信息编码时,Proxy3D 在三维视觉问答、视觉定位及通用空间智能任务中取得了具有竞争力的性能。

Comments Accepted by CVPR 2026. Project page: https://wzzheng.net/Proxy3D

详情
英文摘要

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

2605.08061 2026-05-11 cs.AI

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley

AI总结 本文提出了一种基于评分标准的强化学习方法(Rubric-Grounded RL),通过将奖励分解为多个可验证的评分标准,并利用冻结的大型语言模型(LLM)作为评分者进行评估,从而为策略优化提供更细粒度的反馈信号。研究从科学与技术信息办公室(OSTI)的文档语料库中构建评分标准,并结合组相对策略优化(GRPO)方法训练Llama-3.1-8B-Instruct模型,在多个推理基准测试中取得了优于基线模型的性能,表明该方法能够提升模型在训练数据之外的泛化能力和推理表现。

详情
英文摘要

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

2605.08060 2026-05-11 cs.CL cs.AI cs.GT cs.MA

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

Jiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo, Haoxuan Zeng, Emanuel Tewolde, Tai Sing Lee, Tonghan Wang, Carl Kingsford, Vincent Conitzer

AI总结 本文研究了大语言模型(LLM)在多智能体社会困境中,随着上下文窗口扩展导致合作意愿下降的现象,称之为“记忆诅咒”。通过大量实验和分析,研究发现扩展记忆内容而非长度是导致合作退化的主要原因,并提出通过记忆净化和推理模式调整等方法可缓解这一问题。研究揭示了记忆内容对多智能体行为的直接影响,为设计更稳定的协作型AI系统提供了新思路。

详情
英文摘要

Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

2605.08059 2026-05-11 cs.CV cs.RO

6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

Ismail Aljosevic, Amir Masoud Almasi, Ana Parovic, Ashkan Shafiei

AI总结 本文提出了一种基于关键点热图回归的模块化6D姿态估计框架,结合YOLOv10m进行目标检测,并利用ResNet18网络从RGB图像中预测2D热图,再通过PnP RANSAC算法计算物体6D姿态。研究还引入了跨模态融合架构,结合深度信息提升性能,并通过优化激活函数和学习率策略进一步提高模型效果。实验表明,该方法在LINEMOD数据集上分别达到84.50%和92.41%的平均ADD精度。

Comments Source code available at: https://github.com/ameermasood/HeatNet

详情
英文摘要

In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.

2605.08057 2026-05-11 cs.CL cs.AI

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

James Petullo, Nianwen Xue

AI总结 本文提出了一种名为CA-SQL的文本到SQL生成方法,旨在提升模型在复杂任务中的推理能力。该方法通过动态调整探索范围,并结合进化搜索原理的提示生成和投票机制,有效提升了候选查询的生成质量与准确性。实验表明,CA-SQL在BIRD基准的“挑战性”任务中取得了优于现有方法的性能,展现了其在资源受限情况下的优越性。

详情
英文摘要

While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.

2605.08054 2026-05-11 cs.CV

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

Hanchao Liu, Fang-Lue Zhang, Shining Zhang, Tai-Jiang Mu, Shi-Min Hu

AI总结 该研究旨在生成满足高度约束条件的人类运动,如复杂的空间障碍或特定步数要求,以支持可控角色动画和虚拟代理行为合成等应用。为解决现有方法在极端时空限制下的不足,作者提出了一种基于检索引导的扩散噪声优化方法,通过从大规模运动数据集中检索参考信息,并结合关系任务解析与奖励引导的噪声掩码,优化扩散模型的初始噪声,从而更有效地生成符合严格约束的运动序列。该方法在无需训练的情况下提升了运动生成的智能性与适应性。

Comments Accepted to CVPR2026

详情
英文摘要

Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.

2605.08053 2026-05-11 cs.LG

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat

AI总结 本文研究了折扣马尔可夫决策过程(MDPs)中基于指数效用函数的强化学习问题,提出了两种基于Q值的算法并证明了其收敛性。通过建立指数效用的Bellman方程扩展,作者推导出两种Q值形式,并证明相关算子在特定度量下为收缩映射,进而分析其不动点性质并证明对应的贪心策略在平稳策略中是最优的。研究还提出了两种无模型算法,分别基于双时间尺度和单时间尺度,并给出了收敛性分析及有限时间收敛速率的理论保证,为基于指数效用的目标的值函数型强化学习提供了理论基础。

详情
英文摘要

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.

2605.08050 2026-05-11 cs.CV

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

Xinyan Ye, Jiankang Deng, Abbas Edalat

AI总结 MoCoTalk 是一种多条件扩散框架,用于可控的 talking-head 生成,能够统一控制身份、面部表情、头部姿态和口部动态等四个互补因素。该方法引入了一个自适应多条件路由模块,实现不同条件信号在通道和时间步上的动态融合,同时设计了 Mouth-Augmented Shading Mesh 来增强口部动态的建模,并引入唇部一致性损失以提升音画对齐效果。实验表明,MoCoTalk 在多个结构、运动和感知指标上均达到先进水平,并提供了单条件方法所不具备的属性级可控性。

详情
英文摘要

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.

2605.08048 2026-05-11 cs.CL

Accurate and Efficient Statistical Testing for Word Semantic Breadth

Yo Ehara

AI总结 该研究针对词语语义广度的统计检验问题,提出了一种准确且高效的测试方法。传统基于离散度的统计方法在比较两个词语语义广度时,可能因语义方向差异导致误判,增加第一类错误率。为此,作者引入了基于Householder反射对齐的置换检验方法,有效分离语义方向差异与离散度差异,从而更准确地评估词语语义广度的差异。实验表明,该方法在降低误判率的同时提升了计算效率。

Comments Accepted to ACL 2026 Main Conference

详情
英文摘要

Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.

2605.08045 2026-05-11 cs.CL

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

Yi Yu, Parker Martin, Zhenyu Bu, Yixuan Liu, Yi-Yu Zheng, Orlando Simonetti, Yuchi Han, Yuan Xue

AI总结 将自由文本的心脏磁共振(CMR)报告转化为可审计的结构化数据,是构建队列、长期数据管理和临床决策支持中的关键瓶颈。本文提出CMR-EXTR,一个轻量级框架,能够将自由文本的CMR报告转化为结构化数据,并为每个字段分配置信度以进行质量控制。该方法通过教师-学生蒸馏流程实现完全离线推理,减少对人工标注的依赖,并结合分布合理性、采样稳定性和字段一致性三个互补原则进行不确定性估计,从而有效筛选需要人工复核的内容。实验表明,CMR-EXTR在变量级别上达到了99.65%的准确率,展示了其在数据提取和置信度评估方面的可靠性与有效性。

Comments Accepted to ISBI 2026

详情
英文摘要

Converting free-text cardiac magnetic resonance (CMR) reports into auditable structured data remains a bottleneck for cohort assembly, longitudinal curation, and clinical decision support. We present CMR-EXTR, a lightweight framework that converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles -- distribution plausibility, sampling stability, and cross-field consistency -- to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores. To our knowledge, this is the first CMR-specific extraction system with integrated confidence estimation. The code is available at https://github.com/yuyi1005/CMR-EXTR.

2605.08044 2026-05-11 cs.CL cs.AI cs.LG

Fast Byte Latent Transformer

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer

AI总结 本文提出了一种名为Byte Latent Transformer(BLT)的新型字节级语言模型,旨在解决现有字节级模型生成速度慢的问题。研究引入了BLT Diffusion(BLT-D)方法,通过结合扩散目标和标准字节预测损失进行训练,实现了每步生成多个字节的并行解码,大幅减少了生成序列所需的前向计算次数。此外,文章还提出了BLT Self-speculation和BLT Diffusion+Verification两种改进方法,在提升生成质量的同时保持较高的效率,显著降低了生成任务的内存带宽成本,为字节级语言模型的实际应用扫清了关键障碍。

详情
英文摘要

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

2605.08043 2026-05-11 cs.CV cs.AI

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang, Hang Xu, Xiaoxiao Ma, Shiting Huang, Ke Xu, Wenxuan Huang, Lionel Z. Wang, Lin Chen, Zehui Chen, Jie Huang, Feng Zhao

AI总结 本文提出SCOPE框架,用于解决复杂图像生成中语义承诺的持续追踪问题。该方法通过结构化分解和条件技能调度,动态维护生成过程中的语义要求,并在必要时调用检索、推理和修复技能以确保生成结果符合用户意图。研究还引入了Gen-Arena基准和EGIP指标,实验表明SCOPE在多个任务中均优于现有方法,验证了其在复杂图像生成中的有效性。

详情
英文摘要

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

2605.08040 2026-05-11 cs.HC

ECNUClaw: A Learner-Profiled Intelligent Study Companion Framework for K-12 Personalized Education

Yizhou Zhou, Jiayin Li, Zhi Zhang

AI总结 本文介绍了 ECNUClaw,一个用于构建面向 K-12 教育的个性化智能学习伙伴的开源框架。该框架通过分析学生与学习伙伴的对话,构建包含认知、行为、情感、元认知和情境五个维度的学习者画像,并实时调整指导强度、鼓励频率和知识建构策略。研究结合了中国教育技术领域的三种理论,实现了基于 Python 的系统架构,并支持多种中文大语言模型,为个性化教育提供了有效工具。

Comments 14 pages, 6 figures

详情
英文摘要

We introduce ECNUClaw, an open-source framework for building learner-profiled intelligent study companions in K-12 education. The system constructs and maintains a five-dimension learner profile -- covering cognitive, behavioral, emotional, metacognitive, and contextual dimensions -- by extracting signals from student-companion dialogues at each turn. Profile updates feed directly into an adaptive strategy engine that adjusts the companion's guidance intensity, encouragement frequency, and Bloom's taxonomy scaffolding in real time. The framework design draws on three theoretical strands from the Chinese educational technology literature: Zhang's Digital Portrait Three-Layer Framework for learner assessment, the Education Brain model for educational system architecture, and the Human-AI Collaborative IQ concept for companion design philosophy. ECNUClaw is implemented in Python and supports seven Chinese LLM providers through a unified OpenAI-compatible adapter layer. We describe the system architecture, the profiling and adaptation mechanisms, and discuss limitations and next steps. The source code is available at https://github.com/bushushu2333/ECNUClaw.

2605.08038 2026-05-11 math.NA cs.NA

Invariant domain preserving limiting of time explicit and time implicit discretizations for systems of conservation laws

Bartolomeo Fanizza, Florent Renac

AI总结 本文研究了一种用于非线性双曲守恒律系统高阶数值解的限制技术,旨在保持不变域的性质。该方法适用于各种空间守恒格式以及显式和隐式时间积分方案,通过将高阶解限制在已知保持不变域的低阶解附近,从而确保数值解的物理合理性。该方法推广了通量校正传输限制器,并基于凸限制框架定义限制系数,同时提供了有限体积和不连续伽辽金方法在显式和隐式时间离散中的应用实例与数值实验结果。

详情
英文摘要

This work concerns the design and analysis of a limiting technique that allows the preservation of invariant domains for high-order numerical approximations of nonlinear hyperbolic systems of conservation laws. The method can be applied to any conservative discretization method in space as well as to a wide range of explicit and implicit time integration schemes. The method limits the high-order solution around a low-order accurate solution that is known to preserve all the invariant domains. It generalizes the flux-corrected transport limiter [J. P. Boris and D. L. Book, J. Comput. Phys., 11, 1973; S. T. Zalesak, J. Comput. Phys., 31, 1979] to systems of conservation laws and relies on the limitation of antidiffusive fluxes, but defines the limiting coefficients so as to express the limited solution as a convex combination of invariant domain preserving quantities similarly to the convex limiting framework [Guermond et al., Comput. Methods Appl. Mech. Engrg., 347, 2019]. We give details on the derivation of this limiting technique and provide some illustration with finite volume or discontinuous Galerkin (DG) space discretizations associated to explicit or implicit Runge-Kutta methods as well as to time DG integrations. The limiter is applied iteratively to refine the limited solution around the high-order one, while preserving the invariant domains, and a heuristic is proposed to accelerate its convergence. Numerical experiments solving one- and two-dimensional problems involving scalar hyperbolic equations and the compressible Euler equations are presented to illustrate the properties of these schemes.

2605.08037 2026-05-11 cs.LG cs.AI

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Ning Liu, Chuanneng Sun, Kristina Klinkner, Shervin Malmasi

AI总结 直接偏好优化(DPO)通过成对偏好比较对语言模型进行对齐,是人类反馈强化学习的一种简单有效替代方法。然而,在许多实际场景中,每个提示可能包含多个生成结果,形成复杂的偏好结构,而传统的成对DPO无法充分利用这种结构。为此,研究提出了一种基于有向无环偏好图的图DPO方法(GraphDPO),通过图结构聚合偏好信息,保留传递性并兼容标准DPO,同时引入等价类构造和真实解锚定策略,提升了优化稳定性与效率。实验表明,GraphDPO在推理和程序合成任务中表现出更优性能,展示了图结构偏好建模在对齐任务中的可扩展性和鲁棒性。

详情
英文摘要

Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.

2605.08036 2026-05-11 cs.LG

Don't Get Your Kroneckers in a Twist: Gaussian Processes on High-Dimensional Incomplete Grids

Mads Greisen Højlund, August Smart Lykke-Møller, Henry Moss, Ove Christiansen

AI总结 本文提出了一种名为CUTS-GPR的新方法,用于在高维环境下进行数值精确的高斯过程回归。该方法通过结合加性核函数与不完整网格,实现了极快的核矩阵-向量乘法,其计算复杂度随训练数据量呈近线性或线性增长,随维度呈低阶多项式增长。该方法在数百万甚至数十亿数据点和数千维度的基准测试中表现出良好的可扩展性,能够在数小时内完成包括超参数优化在内的完整高斯过程计算,为高维势能面的贝叶斯建模提供了有效工具。

Comments 51 pages, 8 figures

详情
英文摘要

We introduce CUTS-GPR, a new method for performing numerically exact Gaussian process regression (GPR) in high-dimensional settings. The key component of CUTS-GPR is an extremely fast kernel matrix-vector product, which exhibits near-linear or even linear scaling with the amount of training data, $N$, and low-order polynomial scaling with dimensionality, $D$. This is obtained by combining an additive kernel with an incomplete grid and exploiting the resulting structure of the kernel matrix. We demonstrate the scalability of the matrix-vector product by running benchmarks with billions of data points and thousands of dimensions. Full GPR calculations, including hyperparameter optimization, are completed in a matter of hours for $N = 447 265$ and $D = 24$. We demonstrate that our CUTS-GPR enables Bayesian modeling of high-dimensional potential energy surfaces - a longstanding challenge in computational chemistry.

2605.08035 2026-05-11 eess.SP cs.LG

PropSplat: Map-Free RF Field Reconstruction via 3D Gaussian Propagation Splatting

William Bjorndahl, Maninder Pal Singh, Farhad Nouri, Joseph Camp

AI总结 PropSplat 是一种无需地图的无线传播建模方法,通过3D各向异性高斯原语重建射频场,能够从稀疏的射频测量数据中学习传播环境。该方法利用可学习的路径损耗指数对高斯进行初始化和优化,无需依赖平面图、地形数据库等外部信息。实验表明,PropSplat 在室内外场景中均优于现有方法,实现了更精确的信号强度预测和定位性能,展示了从稀疏测量数据中实现高精度传播建模的可行性。

Comments Accepted for presentation at IEEE DySPAN 2026

详情
英文摘要

Building a site-specific propagation model typically requires either ray-tracing over detailed 3D maps or dense measurement campaigns. Both approaches are expensive and often infeasible for rapid deployments where geographic data is unavailable or outdated. We present PropSplat, a map-free propagation modeling method that reconstructs radio frequency (RF) fields using 3D anisotropic Gaussian primitives. Each Gaussian encodes a scalar path loss offset relative to an explicit baseline path loss model with a learnable path loss exponent. Gaussians are initialized along observed transmitter--receiver paths and optimized end-to-end to learn the propagation environment without external information like floor plans, terrain databases, or clutter data. We evaluate PropSplat against wireless radiance field methods NeRF$^2$, GSRF, and WRF-GS+ on two real-world datasets. On large-scale outdoor drive-tests spanning multiple topographical regions at six sub-6 GHz frequencies, PropSplat achieves 5.38 dB RMSE when training measurements are spaced 300m apart and outperforms WRF-GS+ (5.87 dB), GSRF (7.46 dB), and NeRF$^2$ (14.76 dB). On indoor Bluetooth Low Energy measurements, PropSplat achieves 0.19m mean localization error, an order of magnitude better than NeRF$^2$ (1.84m), while achieving near-identical received signal strength prediction accuracy. These results show that accurate site-specific propagation reconstruction is achievable from sparse RF-native measurements. The need for geographic data as a prerequisite for scalable RF environment modeling is reduced.

2605.08034 2026-05-11 stat.ML cs.LG

Semiparametric Efficient Test for Interpretable Distributional Treatment Effects

Houssam Zenati, Arthur Gretton

AI总结 该研究提出了一种名为DR-ME的半参数高效测试方法,用于检测可解释的分布性处理效应。该方法能够在观测数据中识别出处理对结果分布不同位置的影响,而不仅仅是整体差异,通过学习关键结果位置并结合正交的双重稳健核特征,实现了对分布尾部、模式等变化的精确检测。实验表明,DR-ME在控制第一类错误率和检测能力方面表现优异,并能有效定位医学影像研究中的分布性处理效应。

详情
英文摘要

Distributional treatment effects can be invisible to means: a treatment may preserve average outcomes while changing tails, modes, dispersion, or rare-event probabilities. Kernel tests can detect discrepancies between interventional outcome laws, but global tests do not reveal where the laws differ. We propose DR-ME, to our knowledge the first semiparametrically efficient finite-location test for interpretable distributional treatment effects. DR-ME evaluates an interventional kernel witness at learned outcome locations, returning causal-discrepancy coordinates rather than only a global rejection. From observational data, we derive orthogonal doubly robust kernel features whose centered oracle form is the canonical gradient of this finite witness. For fixed locations, we characterize the local testing limit: DR-ME is chi-square calibrated under the null, has noncentral chi-square local power, and uses the covariance whitening that optimizes local signal-to-noise for discrepancies visible through the selected coordinates. This efficient local-power geometry yields a principled location-learning criterion, with sample splitting preserving post-selection validity. Experiments show near-nominal type-I error, competitive power against global doubly robust kernel tests, and interpretable learned locations that localize distributional effects in a semi-synthetic medical-imaging study.

2605.08031 2026-05-11 cs.CV

Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

Kaidi Jia, Yujie Lin, Chengyi Yang, Jiayao Ma, Jinsong Su

AI总结 视觉语言模型(VLMs)在隐私、版权和偏见等方面引发关注,促使研究者探索如何去除其中的敏感知识。现有方法主要通过微调语言解码器实现,但效果有限,难以彻底消除视觉表征并可能引入物体幻觉。本文提出了一种基于强化学习的深度语义遗忘框架HFRU,直接作用于视觉编码器,通过两阶段策略结合对齐破坏与基于GRPO的优化,有效提升了遗忘效果并显著减少物体幻觉,实验表明其在物体识别和人脸身份任务中均取得超过98%的遗忘与保留性能,优于现有方法。

详情
英文摘要

Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.

2605.08030 2026-05-11 cs.CV cs.LG

PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction

Rüveyda Yilmaz, Yuli Wu, Johannes Stegmaier, Volkmar Schulz

AI总结 该研究提出了一种名为PET-Adapter的测试时域适应框架,用于改进正电子发射断层扫描(PET)图像重建中深度学习模型在未知临床数据上的泛化能力。该方法仅需在phantom数据上预训练生成模型,即可在不依赖配对真实标签的情况下,适应不同解剖结构、示踪剂和扫描设备的临床数据。通过引入逐层低秩解剖条件和基于有序子集期望最大化算法的初始化策略,PET-Adapter显著提升了重建效率与质量,实验表明其在全角和有限角采集下均具有优越的3D重建性能。

详情
英文摘要

Positron Emission Tomography (PET) image reconstruction is inherently challenged by Poisson noise and physical degradation factors, which are further exacerbated in limited-angle acquisitions. While deep learning methods demonstrate promising performance, their generalization to unseen clinical data distributions remains limited without extensive retraining. We propose PET-Adapter, a test-time domain adaptation framework for generative PET reconstruction models pretrained solely on phantom data. Our method enables adaptation to clinical datasets with varying anatomies, tracers, and scanner configurations without requiring paired ground truth. PET-Adapter introduces layer-wise low-rank anatomical conditioning during adaptation and Ordered Subset Expectation Maximization-based warm-starting that initializes the generation from physics-informed reconstructions, reducing diffusion steps from 50 to 2 without compromising quality. Experiments across multiple clinical datasets demonstrate superior 3D reconstruction performance in both full-angle and limited-angle settings, highlighting the clinical feasibility and computational efficiency of the proposed approach.

2605.08029 2026-05-11 cs.CV cs.LG

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel Ángel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu

AI总结 本文提出 STARFlow2,一种将语言模型与归一化流结合的统一多模态生成框架,旨在解决现有方法中文本生成与图像生成之间结构不匹配的问题。该方法基于 Pretzel 架构,通过残差连接垂直融合预训练视觉语言模型流与图像生成流,并在相同的因果掩码下协同工作,实现了文本与图像的无缝交织生成。STARFlow2 采用深度-浅层流设计和统一的 FAE 潜在空间,支持高效的缓存友好生成,实验表明其在图像生成和多模态理解任务中表现出色,验证了自回归流在统一多模态建模中的有效性。

Comments 19 pages, 9 figures

详情
英文摘要

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.