arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.25572 2026-05-26 cs.CL cs.AI

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

PennySynth:基于RAG的数据合成用于自动量子代码生成

Minghao Shao, Nouhaila Innan, Hariharan Janardhanan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

AI总结 提出PennySynth框架,通过检索增强生成和代码感知嵌入,利用13,389个PennyLane指令-代码对数据集,在QHack竞赛中实现52%-68%的pass@5,显著提升量子代码生成的结构有效性和功能正确性。

详情
Comments
11 pages, 3 figures
AI中文摘要

量子编程框架日益增长的复杂性暴露了现有基于大语言模型(LLM)的代码助手的一个关键局限性:通用模型在面对专门的量子编码挑战时,会幻觉出PennyLane特定的门名称、错误放置设备配置并生成结构无效的电路。我们提出PennySynth,一个检索增强生成框架,通过将LLM推理条件化为一个包含13,389个PennyLane指令-代码对的精选知识库来解决这一差距,该知识库通过一个三阶段(提取、验证和去重)流程从官方PennyLane仓库、社区GitHub源和QHack竞赛档案中构建。PennySynth引入了一种使用st-codesearch-distilroberta-base的代码感知嵌入策略,该策略针对自然语言到代码的检索进行训练,将平均检索余弦相似度从通用基线的0.45提高到0.726。在涵盖QHack竞赛三年(2022、2023、2024)的74个挑战上进行评估,PennySynth在QHack 2022、2023和2024上分别达到64%、68%和52%的pass@5,相比无检索的Claude Sonnet 4.6提高了+28、+25和+28个百分点。我们进一步引入了一个量子适应的CodeBLEU指标,该指标对qml.*令牌模式进行加权,并表明结构代码相似性和功能正确性捕捉了量子代码质量的不同方面。受控消融实验揭示,代码感知嵌入是检索性能的主要驱动因素,而当检索质量足够精确时,数据集扩展和源组合提供了额外的增益。

英文摘要

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

2605.25571 2026-05-26 cs.CV

AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution

AnE: 通过锚点进化推动多模态大语言模型的推理前沿

Zehao Wang, Yihan Zeng, Zidong Gong, Yuanfan Guo, Feng Zhu, Hongzhi Zhang, Wei Zhang, Wangmeng Zuo

AI总结 提出锚点进化(AnE)范式,通过真值锚点数据策展和脚手架剥离机制,解决多模态大模型推理中的认知漂移和幻觉路径问题,显著提升推理性能。

详情
Comments
34 pages,10 figures
AI中文摘要

通过监督微调(SFT)和强化学习(RL)进行的后训练对于增强多模态大语言模型(MLLMs)的推理能力至关重要,然而现有范式由于静态数据的限制常常达到性能瓶颈。虽然当前方法利用自我反思或自我进化来突破这些界限,但它们仍然受到低质量合成数据导致的认知漂移和幻觉推理路径的影响。为了解决这些挑战,我们提出了锚点进化(AnE),一种整合了真值锚点数据策展和模型进化的新范式,在推理前沿实现了忠实且稳定的性能提升。具体来说,我们提出了真值锚点扩展,通过轨迹展开定位模型失败前沿,并利用真实数据库检索高保真锚点以进行忠实的数据策展。随后,我们引入了脚手架剥离机制来内化推理能力。该机制首先通过脚手架增强监督来锚定推理路径,以减轻直接在原始数据上进行SFT的学习复杂性和分布漂移,然后利用强化学习剥离脚手架模板,从而有效地将推理路径转化为内在模型能力。在多模态推理基准上的实验结果表明,我们的方法显著推进了模型性能前沿,在八个多模态基准上将基础模型提升了10.3%,并达到了最先进的结果。代码将公开提供。

英文摘要

Post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is crucial for enhancing reasoning in Multimodal Large Language Models (MLLMs), yet existing paradigms often reach a performance bottleneck due to the limitations of static data. While current methods leverage self-reflection or self-evolution to push these boundaries, they still suffer from cognitive drift and hallucinated reasoning paths caused by low-quality synthetic data. To address these challenges, we propose Anchor Evolution (AnE), a new paradigm that integrates truth-anchored data curation and model evolution, achieving faithful and steady performance gains at the reasoning frontier. Specifically, we propose Truth Anchor Expansion, which pinpoints the model failing frontier via trajectory rollouts and leverages ground-truth databases to retrieve high-fidelity anchors for faithful data curation. Subsequently, we introduce the Scaffold-Stripping Mechanism to internalize reasoning capabilities. This mechanism first anchors reasoning paths via scaffold-augmented supervision to mitigate the learning complexity and distribution drift of direct SFT on raw data, then leverages RL to strip the scaffold template, thereby effectively transitioning the reasoning paths into intrinsic model capabilities. Experimental results on multimodal reasoning benchmarks show that our method substantially advances the model performance frontier, improving the base model by 10.3\% across eight multimodal benchmarks and achieving state-of-the-art results. The code will be made publicly available.

2605.25568 2026-05-26 cs.CV

Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking

重新思考涂鸦引导的图像编辑:泛化、指令遵循与多任务

Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng

AI总结 针对涂鸦引导图像编辑在多任务场景下性能不稳定的问题,通过实证研究揭示指令级泛化瓶颈,提出覆盖-真实课程、多任务拼接和编辑聚焦损失三种策略,在VIBE基准上实现单任务和多任务的最优结果。

详情
AI中文摘要

涂鸦引导的图像编辑允许用户将简单的涂鸦注释与文本提示相结合,以指定图像编辑的位置和方式,从而实现灵活交互和精确的空间控制。然而,现有模型在这种范式下仍表现出不稳定的性能,尤其是在多任务场景中。为了提升性能,我们使用开源编辑模型进行实证研究,并揭示了泛化中的不对称性:指令级泛化(包括跨编辑任务以及从单任务到多任务设置)比图像域泛化(例如从合成图像到真实图像,或从马赛克图像到常规图像)更具挑战性。这表明主要瓶颈在于对多样化编辑指令的学习不足,而非图像域差异。受此启发,我们提出了三种策略:(a) 覆盖-真实课程,一个两阶段流程,首先构建大规模合成、指令丰富的数据以提供广泛的任务监督,然后精选少量真实数据以细化生成的真实性;(b) 多任务拼接,通过几乎零成本地拼接单任务样本来构建多任务训练样本,同时使学习到的能力泛化到非马赛克图像;(c) 编辑聚焦损失,利用合成数据中输入和输出图像之间的变化区域,将训练聚焦于编辑区域,提高学习效率和编辑准确性。通过这些策略,我们在VIBE基准上显著提升了单任务和多任务涂鸦引导编辑的性能,取得了最先进的结果。我们将公开发布我们的数据集和模型。

英文摘要

Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existing models still exhibit unstable performance under this paradigm, especially in multi-task scenarios. To improve performance, we conduct empirical studies using an open-source editing model and reveal an asymmetry in generalization: instruction-level generalization, including across editing tasks and from single-task to multi-task settings, is more challenging than image-domain generalization, such as from synthetic to real-world images or from mosaicked to regular images. This suggests that the primary bottleneck lies in insufficient learning for diverse editing instructions rather than in the image domain gap. Motivated by this insight, we propose three strategies: (a) a Coverage-then-Realism Curriculum, a two-stage pipeline that first builds large-scale synthetic, instruction-rich data for broad task supervision, then curates a small set of real-world data to refine generation realism; (b) Multi-Task Mosaicking, which constructs multi-task training samples by concatenating single-task examples at nearly zero cost while enabling the learned capability to generalize to non-mosaicked images; and (c) an Edit-Focused Loss, which leverages the changed regions between input and output images in synthetic data to focus training on edited regions, improving both learning efficiency and editing accuracy. With these strategies, we substantially improve both single-task and multi-task scribble-guided editing on the VIBE benchmark, achieving state-of-the-art results. We will publicly release our dataset and model.

2605.25566 2026-05-26 cs.AI

Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

基于大语言模型的不确定性推理用于可解释疾病诊断

Xiaoyang Fan, Yufan Cai, Zhe Hou, Jin Song Dong

AI总结 提出一种神经符号推理框架,将大语言模型与模糊逻辑和声明式规则结合,实现可解释且形式可验证的医学诊断。

详情
AI中文摘要

临床决策需要对不完整、不精确且以语言表达的患者叙述进行推理。虽然大语言模型(LLMs)擅长从自然语言中提取潜在信息,但它们缺乏可信赖医疗AI所必需的可验证性和可解释性。我们提出一种神经符号推理框架,将LLMs与形式逻辑对齐,以实现可解释且形式可验证的医学诊断。患者描述和临床指南被嵌入神经知识库,其中LLMs提取结构化医疗实体、时间关系和模糊症状模式,这些被解码为用模糊逻辑和声明式规则表达的符号知识库。我们执行两阶段推理:(1)归纳符号泛化,从编码叙述中捕获诊断模式;(2)通过逻辑编程引擎进行推理验证,推导并验证符合临床标准的诊断。每个症状被视为具有概率权重的模糊谓词,推理路径可审计、可调整,并与医生反馈兼容。与纯统计方法不同,我们的系统支持迭代优化:LLM生成的诊断与真实情况之间的偏差可以通过形式规则追踪、解释和纠正。通过结合基于逻辑的透明性、LLM的适应性和概率鲁棒性,该框架实现了与人类一致的医疗推理,具有强泛化能力和可验证的逐步推理链。我们在公开基准上验证了该框架,展示了符号推理与LLM在真实临床叙述中的有效协调。结果显示,性能与最先进的LLM相当,同时额外提供了可解释的推理路径和形式可验证的诊断结论。

英文摘要

Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large language models (LLMs) excel at extracting latent information from natural language, they lack the verifiability and interpretability essential for trustworthy medical AI. We propose a neuro-symbolic reasoning framework that aligns LLMs with formal logic to enable explainable and formally verifiable medical diagnosis. Patient descriptions and clinical guidelines are embedded into a neural knowledge base, where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns, which are decoded into a symbolic knowledge base expressed in fuzzy logic and declarative rules. We perform two-stage reasoning: (1) inductive symbolic generalization to capture diagnostic patterns from encoded narratives, and (2) inference verification via a logic programming engine to derive and validate diagnoses consistent with clinical standards. Each symptom is treated as a fuzzy predicate with probabilistic weights, and inference paths are auditable, adjustable, and compatible with physician feedback. Unlike purely statistical methods, our system supports iterative refinement: misalignment between LLM-generated diagnoses and ground truth can be traced, explained, and corrected through formal rules. By combining logic-based transparency, LLM adaptability, and probabilistic robustness, the framework enables human-aligned healthcare inference with strong generalization and verifiable, step-by-step reasoning chains. We validate our framework on public benchmarks, demonstrating effective reconciliation of symbolic reasoning and LLMs with real-world clinical narratives. Results show performance comparable to state-of-the-art LLMs, while additionally providing interpretable reasoning paths and formally verifiable diagnostic conclusions.

2605.25565 2026-05-26 cs.LG cs.CL

RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

RotMoLE:通过旋转门控机制增强混合低秩专家

Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang, Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang

AI总结 针对MoE-LoRA中传统门控仅标量加权限制表示能力的问题,提出RotMoLE框架,通过引入旋转门控机制对每个专家进行旋转操作,提升专家利用率和专业化程度,在多任务和多语言训练中验证有效性。

详情
AI中文摘要

虽然大型语言模型(LLM)通常在进行垂直应用之前会针对特定领域任务进行微调,但将它们适应于具有多样化专业知识的复杂场景仍然具有挑战性。与此同时,混合专家(MoE)架构已成为训练LLM的关键范式,最近的一些工作也将MoE引入参数高效微调(PEFT),提出了混合低秩专家(MoE-LoRA),以增强低秩适配器学习复杂知识的能力。然而,MoE中的传统门控机制通常仅对选中的专家应用标量重新加权,从而限制了其表示和泛化的潜在能力。受MoE-LoRA中低秩结构的启发和推动,我们提出了RotMoLE,一个专门针对低秩专家的MoE框架,其特点是一个额外的旋转门控。除了简单的缩放,RotMoLE为每个选中的专家实现了一个旋转机制,从而在专家候选有限的情况下,实现了更好的专家利用和专业化,以学习多样化的数据。在复杂多任务和多语言训练场景下的实证结果验证了我们的有效性。

英文摘要

While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

2605.25563 2026-05-26 cs.CV

CodecSplat: Ultra-Compact Latent Coding for Feed-Forward 3D Gaussian Splatting

CodecSplat: 用于前馈式3D高斯泼溅的超紧凑潜在编码

Pengpeng Yu, Runqing Jiang, Qi Zhang, Dingquan Li, Jing Wang, Yulan Guo

AI总结 提出CodecSplat框架,通过将压缩集成到前馈式高斯生成流水线中,利用结构化中间特征表示实现超紧凑场景编码,显著降低存储和传输开销。

详情
AI中文摘要

尽管前馈式3D高斯泼溅无需逐场景优化即可从稀疏上下文视图重建可渲染的高斯基元,但现有流水线并未提供紧凑的场景表示用于存储或传输。一种自然的解决方案是将现有的3DGS压缩方法应用于生成的高斯基元。然而,这种方法作用于最终的不规则3D表示,且与内部特征到高斯的生成过程解耦,限制了压缩效率。为解决此问题,我们引入了CodecSplat,一种用于前馈式3D高斯泼溅的超紧凑潜在编码框架。CodecSplat首先将中间2D高斯生成特征编码为熵编码的场景比特流。在解码器端,潜在特征被重建并用于预测深度和高斯参数,然后映射到3D高斯基元。注意,通过将压缩集成到前馈式高斯生成流水线中,CodecSplat避免了对不规则3D高斯基元的低效压缩,并允许编解码器利用结构化的中间特征表示。我们在前馈式高斯泼溅骨干网络上实例化了CodecSplat,该网络具有深度引导的多视图特征细化和分层学习特征编解码器。在DL3DV和RealEstate10K数据集上,CodecSplat分别实现了23.56-26.36 dB和24.76-27.05 dB的PSNR,每场景仅需20.00-107.77 KiB和3.37-12.51 KiB。这比压缩前馈式生成的高斯基元大约小一个数量级,同时保持了可控的率失真行为。

英文摘要

While feed-forward 3D Gaussian splatting reconstructs renderable Gaussian primitives from sparse context views without per-scene optimization, existing pipelines do not provide a compact scene representation for storage or transmission. A natural solution is to apply existing 3DGS compression methods to the generated Gaussian primitives. However, this approach operates on the final irregular 3D representation and is decoupled from the internal feature-to-Gaussian generation process, which limits compression efficiency. To address this, we introduce CodecSplat, an ultra-compact latent coding framework for feed-forward 3D Gaussian splatting. CodecSplat first encodes an intermediate 2D Gaussian-generation feature into an entropy-coded scene bitstream. At the decoder, the latent feature is reconstructed and used to predict depth and Gaussian parameters, which are then mapped to 3D Gaussian primitives. Note that, by integrating compression into the feed-forward Gaussian generation pipeline, CodecSplat avoids inefficient compression over irregular 3D Gaussian primitives and allows the codec to exploit the structured intermediate feature representation. We instantiate CodecSplat on a feed-forward Gaussian splatting backbone with depth-guided multi-view feature refinement and a hierarchical learned feature codec. On DL3DV and RealEstate10K datasets, CodecSplat achieves 23.56-26.36 dB and 24.76-27.05 dB PSNR with only 20.00-107.77 KiB and 3.37-12.51 KiB per scene, respectively. This is roughly one order of magnitude smaller than compressing feed-forward generated Gaussian primitives, while preserving controllable rate-distortion behavior.

2605.25561 2026-05-26 cs.CV

Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?

我们在半监督3D医学图像分割的模型和结果上是否过于自信?

Jun Li, Ziwei Qin

AI总结 针对半监督医学图像分割中伪标签框架的确认偏差和基准测试集使用不当导致的性能高估问题,提出一种基于双轴可靠性评估的三空间校准分割框架(TCSeg),以解耦置信度与不确定性并协同校正偏差。

详情
Comments
Accepted by ICML 2026
AI中文摘要

半监督学习已成为减少标注成本的主流范式。然而,我们认为当前的进展被双重过度自信问题所掩盖。在算法层面,主流的伪标签框架常常将预测置信度与不确定性混为一谈,导致严重的确认偏差。在策略层面,由于多个基准数据集缺乏专用的验证集,一些研究也使用测试集进行验证,导致性能估计膨胀。后续方法为了超越已报告的最先进水平而被迫采用相同策略,引发了过拟合的军备竞赛。这引发了担忧,即社区中令人印象深刻的数值提升可能反映的是过拟合而非真正的进步。因此,我们提出了一种基于原则性双轴可靠性评估引擎的三空间校准分割框架。它明确地将置信度与不确定性解耦,并利用这一信号在特征空间、概率空间和图像空间中以协作方式检测和纠正确认偏差。在三个基准数据集上,TCSeg在现有评估协议下始终提供强大的性能。更重要的是,我们主张社区在多次运行协议下报告最终检查点结果,从而以更现实的视角建立更严格的基准。代码将公开:github.com/DirkLiii/TCSeg。

英文摘要

Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a twofold overconfidence problem. Algorithmically, mainstream pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias. Strategically, since multiple benchmark datasets lack dedicated validation sets, some studies use the test set for validation as well, leading to inflated performance estimates. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. This raises concerns that the impressive numerical gains in the community may reflect overfitting rather than genuine progress. Thus, we propose a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmark datasets, TCSeg consistently delivers strong performance under existing evaluation protocols. More importantly, we advocate that the community report final-checkpoint results under multiple-run protocols, thereby establishing more rigorous benchmarks with a more realistic perspective. Code will be available: github.com/DirkLiii/TCSeg.

2605.25558 2026-05-26 cs.AI

Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

超越查询记忆化:基于查询分解和历史匹配的大语言模型路由

Bo Lv, Jingbo Sun

AI总结 提出DecoR路由框架,通过查询能力分解和历史日志匹配来避免记忆化陷阱,在保持高准确率的同时降低推理成本。

详情
AI中文摘要

优化预测性能与计算成本之间的权衡是大语言模型(LLM)部署中的核心关注点。当前的路由方法主要依赖于基于表面特征的查询到模型的直接映射,使其容易陷入记忆化陷阱,并导致在分布外(OOD)数据上的泛化能力差。在本文中,我们提出DecoR,一种新颖的路由框架,将路由任务重新定义为从历史日志中筛选相似查询的匹配过程,有效缓解了记忆化陷阱。为了提高匹配准确性,我们引入了一种查询能力分解方法,将语言表面形式与任务内在需求解耦,将匹配导向能力维度,从而将决策基于基本任务属性。此外,我们开发了CodaSet,一个用于评估路由泛化能力的综合基准,实验结果表明,DecoR在分布内和OOD设置下均保持优越的准确性,同时大幅降低推理成本。所有代码和数据可在https://github.com/lvbotenbest/DecoR获取。

英文摘要

Optimizing the trade-off among predictive performance and computational cost is a central focus in the deployment of Large Language Models (LLMs). Current routing methods primarily rely on direct mapping from queries to models based on surface-level features, making them susceptible to the memorization trap and leading to poor generalizability on out-of-distribution (OOD) data. In this paper, we propose DecoR, a novel routing framework that recasts the routing task as a matching process of sifting similar queries from historical logs, effectively mitigating the memorization trap. To enhance matching accuracy, we introduce a query capability deconstruction method that decouples linguistic surface forms from task-intrinsic requirements, directing matching toward capability dimensions to ground decisions in essential task attributes. Furthermore, we develop CodaSet, a comprehensive benchmark for assessing routing generalization, where experimental results demonstrate that DecoR maintains superior accuracy while substantially lowering inference costs across both in-distribution and OOD settings. All the codes and data are available at https://github.com/lvbotenbest/DecoR.

2605.25554 2026-05-26 cs.AI

PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

PHGNet: 原型引导的超图构建用于异质时空预测

Ruiwen Gu, Yahao Liu, Zhenyu Liu, Qitai Tan, Xiao-Ping Zhang

AI总结 提出基于原型引导超图构建的时空预测框架PHGNet,通过原型学习机制自适应地将模式相似节点分配到超边以捕获高阶交互,并引入全局-局部节点表示模块和迭代残差细化与时间查询注意力机制提升预测精度。

详情
AI中文摘要

作为智能交通系统的核心任务,交通预测在城市交通管理中起着关键作用。准确的交通预测依赖于对复杂时空依赖关系的建模,而由于交通系统中的空间异质性,这本身就具有挑战性。尽管取得了显著进展,大多数现有方法仍局限于成对空间依赖建模,难以捕获具有相似交通模式的节点之间的动态高阶交互。为了解决这个问题,我们提出了PHGNet,一种基于原型引导超图构建的新型时空预测框架。在PHGNet的核心,设计了一种原型学习机制,自适应地将模式相似的节点分配到超边,从而捕获具有时变结构的高阶交互。为了提高动态超图构建的可靠性,我们进一步开发了一个全局-局部节点表示模块来提取时间一致的特征。对于预测,引入了迭代残差细化和时间查询注意力机制,以提高预测精度并支持高效的并行解码。在多个真实世界数据集上的大量实验表明,与最先进的方法相比,PHGNet实现了优越的预测性能。

英文摘要

As a core task in intelligent transportation systems, traffic forecasting plays a critical role in urban traffic management. Accurate traffic forecasting relies on modeling complex spatiotemporal dependencies, which is inherently challenging due to spatial heterogeneity in traffic systems.Despite significant progress, most existing methods are still limited to pairwise spatial dependency modeling, making it difficult to capture dynamic high-order interactions among nodes with similar traffic patterns. To address this issue, we propose PHGNet, a novel spatiotemporal forecasting framework based on prototype-guided hypergraph construction. At the core of PHGNet, a prototype learning mechanism is designed to adaptively assign pattern-similar nodes to hyperedges, thereby capturing high-order interactions with time-varying structures. To improve the reliability of dynamic hypergraph construction, we further develop a global-local node representation module to extract time-consistent features. For forecasting, iterative residual refinement and Temporal Query Attention are introduced to improve forecasting accuracy while supporting efficient parallel decoding. Extensive experiments on multiple real-world datasets demonstrate that PHGNet achieves superior predictive performance compared with state-of-the-art methods.

2605.25553 2026-05-26 cs.CV cs.RO

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

ComPose:用于鲁棒类别级物体姿态估计的统一补全-姿态框架

Huan Ren, Yihan Chen, Chuxin Wang, Nailong Liu, Wenfei Yang, Tianzhu Zhang

AI总结 提出ComPose框架,通过关键点渐进补全模块和几何关系一致性损失,将形状补全与姿态估计紧密集成,在不依赖类别级形状先验的情况下提升点云不完整场景下的姿态估计精度和效率。

详情
Comments
Accepted by CVPR 2026 (Oral, Best Paper Award Candidate). Project page is available at renhuan1999.github.io/ComPose
AI中文摘要

类别级物体姿态估计旨在预测特定类别中任意物体的姿态和尺寸。现有方法难以处理观测点云固有的不完整性,这限制了它们捕捉完整物体形状以实现鲁棒姿态推理的能力。虽然点云补全提供了一种有前景的解决方案,但将其作为部分观测的独立预处理步骤会引入复合误差和额外计算开销,最终阻碍准确性和效率。为解决这些挑战,我们提出了ComPose,一种新颖的统一框架,紧密集成形状补全以提供完整的几何线索,从而增强姿态估计。ComPose的核心是一个基于关键点的渐进补全模块,通过逐步预测稀疏关键点及其周围的密集点集来恢复完整形状表示,使关键点能够捕捉整体物体几何结构。几何关系编码模块进一步用局部和全局几何上下文丰富关键点特征。此外,我们引入了一种新颖的几何关系一致性损失,以强制观测关键点与其预测的NOCS坐标之间的结构对齐,确保全局一致的坐标变换。在标准基准上的大量实验表明,我们的方法在不依赖类别级形状先验的情况下优于现有最先进方法。

英文摘要

Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency. To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors.

2605.25551 2026-05-26 cs.LG

Learning Permutation from Structure Without Supervision

从结构中无监督学习排列

Ran Eisenberg, Ofir Lindenbaum

AI总结 提出熵自适应Gumbel-Sinkhorn方法,通过局部调节温度改善无监督排列学习的稳定性和质量。

详情
AI中文摘要

许多学习问题需要揭示隐藏的排序,以揭示无序数据中的结构,例如排序中的单调性或拼图重建中的空间连续性。在这些设置中,排列可以作为潜在算子通过优化直接定义在重排序输出上的目标来学习,通常没有真实排序的访问。可微松弛如Gumbel-Sinkhorn通过用双随机矩阵近似排列矩阵使这种方法实用。然而,无监督地从结构学习会导致非均匀的不确定性:一些分配早期变得自信,而其他分配仍然模糊。现有方法使用单个全局温度控制这一过程,迫使所有分配同时锐化或扩散,导致大规模不稳定。我们引入了一种熵自适应的Gumbel-Sinkhorn公式,根据分配不确定性局部调节温度。这使得自信的分配可以早期离散化,同时在不明确的地方保留探索。在排序和拼图重建任务以及路由式设置中,相对于固定温度基线,自适应熵控制提高了训练稳定性和最终排列质量,特别是在问题规模和分配模糊性增加时。

英文摘要

Many learning problems require uncovering a hidden ordering that reveals structure in unordered data, such as monotonicity in sorting or spatial continuity in jigsaw reconstruction. In these settings, permutations can be learned as latent operators by optimizing objectives defined directly on the reordered output, often without access to ground-truth orderings. Differentiable relaxations such as Gumbel-Sinkhorn make this approach practical by approximating permutation matrices with doubly stochastic matrices. However, learning from structure without supervision induces a non-uniform uncertainty: some assignments become confident early, while others remain ambiguous. Existing methods control this process using a single global temperature, forcing all assignments to sharpen or diffuse simultaneously and leading to instability at scale. We introduce an entropy-adaptive formulation of Gumbel-Sinkhorn that locally modulates temperature based on assignment uncertainty. This allows confident assignments to discretize early while preserving exploration where uncertainty remains. Across sorting and jigsaw reconstruction tasks and in routing-style settings, adaptive entropy control improves training stability and final permutation quality relative to fixed-temperature baselines, particularly as problem size and assignment ambiguity increase.

2605.25549 2026-05-26 cs.CL cs.AI cs.LG

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

BC协议:结构化双专家对话用于生成高质量思维链后训练数据

Bo Zou, Chao Xu

AI总结 针对大语言模型后训练中高质量专家思维链数据生产瓶颈,提出BC协议——一种结构化双专家引出方法,通过配对领域专家与知识工程师,系统外化专家隐性判断为自然语言推理链,实验证明其在推理过程自然性上具有压倒性优势。

详情
AI中文摘要

高质量的专家思维链(CoT)数据是大语言模型(LLM)后训练的核心瓶颈之一。现有数据生产方法各有结构性局限:众包标注缺乏深度推理路径;专家单独写作受限于“专家盲点”——专家会结构性跳过他们认为显而易见的推理步骤;RLHF仅产生偏好信号而非推理链。 本文提出BC协议——一种用于LLM后训练数据生产的结构化双专家引出方法。该方法精心配对领域专家(晶体智力)与知识工程师(流体智力),系统地将专家的隐性判断外化为自然语言推理链。我们引入了参与者资质模型,定义了影响引出质量的六个参与者特征维度。“校准的无知”是本文提出的原创概念。我们进一步提出“选择优于规定”作为方法论原则:对于隐性知识引出任务,将质量控制资源投入人员选择比投入同等资源于流程设计能获得更高回报。 在叙事小说领域的受控实验中,我们直接比较了BC协议双对话产生的CoT(A组,n=20)与同一领域专家独立撰写的CoT(B组,n=20)。三个跨供应商评判模型——GPT-4o、Claude Opus 4.5和Gemini 2.5 Pro——在五个维度上进行了盲评(共600个评分)。结果表明,BC协议在“推理过程自然性”上具有压倒性优势(A组均值4.80 vs. B组均值1.30,p=2.4×10^{-8},Cliff's δ=1.0)。

英文摘要

High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data production methods each have structural limitations: crowdsourced annotation lacks deep reasoning paths; expert solo writing is constrained by the "expert blind spot" -- experts structurally skip reasoning steps they consider obvious; RLHF only produces preference signals rather than reasoning chains. This paper proposes the BC Protocol -- a structured dual-expert elicitation method for LLM post-training data production. The method carefully pairs a domain expert (crystallized intelligence) with a knowledge engineer (fluid intelligence), systematically externalizing the expert's implicit judgments as natural language reasoning chains. We introduce the Participant Aptitude Model, which defines six participant characteristic dimensions that affect elicitation quality. "Calibrated Ignorance" is an original concept proposed in this paper. We further propose "Selection-over-Prescription" as a methodological principle: for implicit knowledge elicitation tasks, investing quality-control resources in personnel selection yields a higher return than investing the same resources in process design. In a controlled experiment in the narrative fiction domain, we directly compared CoT produced by BC Protocol dual dialogue (Group A, (n=20)) against CoT written independently by the same domain expert (Group B, (n=20)). Three cross-vendor judge models -- GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro -- conducted blind evaluation across five dimensions (600 ratings total). Results show that the BC Protocol achieves an overwhelming advantage in "naturalness of reasoning process" (Group A mean 4.80 vs. Group B mean 1.30, (p=2.4\times10^{-8}), Cliff's (δ=1.0)).

2605.25548 2026-05-26 cs.LG cs.AI

'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

Shubhajit Roy, Anirban Dasgupta

AI总结 提出SiST-GNN,通过在一个消息传递操作中融合空间和时间信号,实现动态图表示学习的联合推理,在链接预测任务上超越先前方法109%-277%。

详情
AI中文摘要

操作于快照序列的动态图神经网络(DGNN)通常分为两类:\emph{时间优先}方法先构建每个节点的时间嵌入,然后进行空间聚合;而\emph{空间优先}方法则颠倒这一顺序,将图卷积的输出馈送到下游时间模块。无论哪种情况,严格的顺序迫使第二阶段消耗第一阶段已压缩的摘要,排除了对拓扑和演化的联合推理;具体而言,消息传递算子永远无法根据邻居的\emph{过去}轨迹来加权其贡献。本文介绍了 extbf{SiST-GNN}( extbf{Si}multaneous extbf{S}patial- extbf{T}emporal extbf{GNN}),它在单个消息传递操作中融合两种信号,而不是将它们串联。具体地,在每个快照中,我们为每个节点维护一个循环隐藏状态来总结其历史,将其与节点当前特征向量配对,并将该配对视为由跨时间边连接的两个节点;在此时间增强图上运行标准图卷积,得到更新后的表示。我们的实证研究涵盖九个公开基线和十四个模型-数据集组合,覆盖固定分割和实时更新评估场景。在每个公开基准上,SiST-GNN在链接预测任务中相对于最强先前方法,在固定分割设置中提升109%-277%,在实时更新设置中提升68%-194%。我们还通过离散化底层连续时间事件流,构建了三个动态节点分类任务;在此,SiST-GNN以7%-22%的优势击败领先的离散时间(DTDG)基线,并与直接消费原始事件的连续时间(CTDG)方法相匹配。

英文摘要

Dynamic graph neural networks (DGNNs) that operate on snapshot sequences typically fall into one of two categories. \emph{Temporal-first} approaches build per-node temporal embeddings and only afterwards perform spatial aggregation, whereas \emph{Spatial-first} approaches invert this order, feeding the output of a graph convolution into a downstream temporal module. In either case, the rigid sequencing forces the second stage to consume an already-compressed summary produced by the first, ruling out joint reasoning over topology and evolution; concretely, the message-passing operator never gets to weight a neighbor's contribution by that neighbor's \emph{past} trajectory. This paper introduces \textbf{SiST-GNN} (\textbf{Si}multaneous \textbf{S}patial-\textbf{T}emporal \textbf{GNN}), which fuses the two signals inside a single message-passing operation rather than chaining them. Concretely, at each snapshot we maintain a recurrent hidden state per node that summarises its history, pair it with the node's current feature vector, and treat the pair as two nodes joined by a cross-time edge; running a standard graph convolution on this temporally augmented graph yields the updated representation. Our empirical study spans nine public baselines and fourteen model-dataset combinations, covering both fixed-split and live-update evaluation regimes. Across every public benchmark, SiST-GNN sets a new state of the art in link prediction task over the strongest prior method by $109$--$277\%$ in the fixed-split setting and by $68$--$194\%$ in the live-update setting. We additionally construct three dynamic node-classification tasks by discretising the underlying continuous-time event streams; here SiST-GNN beats the leading discrete-time (DTDG) baseline by $7$--$22\%$ and matches continuous-time (CTDG) methods that consume the raw events directly.

2605.25547 2026-05-26 cs.RO cs.CV

TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation

TapSampling:基于任务进度理解验证器的推理时采样方法用于机器人操作

Sizhe Zhao, Shengping Zhang, Shuo Yang, Weiyu Zhao, Shuigen Wang, Xiangyang Ji

AI总结 提出TapSampling框架,通过Action-VAE在低维潜空间采样候选动作,并利用任务进度预测验证器选择最优动作,无需微调即可提升多种通用策略的性能。

详情
Comments
ICML 2026. Project Page: https://aipixel.github.io/TapSampling/
AI中文摘要

现有的具身控制研究通过扩展训练数据和模型规模展现了显著的性能提升。我们则探索推理时策略作为另一个维度。非确定性生成模型,如扩散模型和自回归模型,已被广泛应用于具身控制领域。然而,单次推理范式限制了它们的性能。在本文中,我们提出 extbf{TapSampling},一个即插即用的推理时采样框架。首先,我们引入一个Action-VAE,通过将策略生成的初始动作映射到压缩的后验分布中,在低维潜空间中表示动作,从中可以抽取任意数量的潜样本并解码为候选动作,这些动作近似于真实动作分布。其次,我们将动作验证表述为任务进度结果预测,利用机器人数据集固有的序列结构训练一个语义基础验证器,用于可解释的动作选择。此外,TapSampling是一个策略无关的框架。在模拟和真实环境中的大量实验表明,我们的方法无需进一步微调策略即可显著提升多种通用策略的性能。代码和模型可在项目页面获取。

英文摘要

Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbf{TapSampling}, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the project page.

2605.25546 2026-05-26 cs.RO

Safety-Critical Whole-Body Control for Humanoid Robots via Input-to-State Safe Control Barrier Functions

基于输入到状态安全控制屏障函数的人形机器人安全关键全身控制

Kwanwoo Lee, Sanghyuk Park, Gyeongjae Park, Myeong-Ju Kim, Jaeheung Park

AI总结 提出一种基于输入到状态安全控制屏障函数(ISSf-CBF)的分层安全关键全身控制框架,通过运动级全身控制器、ISSf-CBF安全滤波器和动力学级全身控制器,在存在未知扰动时保证人形机器人的运动学安全约束。

详情
Comments
14 pages, 6 figures
AI中文摘要

安全关键控制对于在复杂人类中心环境中运行的人形机器人至关重要,这些环境中的物理安全约束(如关节限位、自碰撞避免、障碍物避免和工作空间边界)必须在实际机器人操作中得到满足。然而,现有方法仍然有限,因为在存在未知扰动(如模型不确定性、轨迹跟踪误差和外部扰动)时,运动学安全保证可能会降低。本文提出了一种基于输入到状态安全控制屏障函数(ISSf-CBF)的人形机器人分层安全关键全身控制框架。所提出的架构集成了运动级全身控制器(KinWBC)、ISSf-CBF安全滤波器和动力学级全身控制器(DynWBC)。KinWBC根据优先级任务生成标称关节运动参考;ISSf-CBF滤波器最小程度地修改这些参考,以在有界扰动下满足运动学安全约束;DynWBC跟踪滤波后的参考,同时确保全身动力学可行性和接触稳定性。安全约束施加于全身运动学模型,并保守地调整ISSf-CBF参数,使得所得的运动学安全保证能够在未知扰动下传递到全阶人形机器人动力学。仿真和实际机器人实验表明,所提出的框架在模型失配下提高了安全裕度,并在行走、遥操作和带手控制的单腿平衡过程中实时可靠地强制执行多个安全约束。项目网站:https://kwlee365.github.io/SafeWBC-Website/

英文摘要

Safety-critical control is essential for humanoid robots operating in complex human-centered environments, where physical safety constraints such as joint limits, self-collision avoidance, obstacle avoidance, and workspace boundaries must be satisfied during real-robot operation. However, existing approaches remain limited because kinematic safety guarantees can be degraded in the presence of unknown disturbances, such as model uncertainties, trajectory-tracking errors, and external perturbations. This paper presents a hierarchical safety-critical whole-body control framework for humanoid robots based on input-to-state safe control barrier functions (ISSf-CBFs). The proposed architecture integrates a kinematic-level whole-body controller (KinWBC), an ISSf-CBF safety filter, and a dynamic-level whole-body controller (DynWBC). KinWBC generates nominal joint-motion references from prioritized tasks; the ISSf-CBF filter minimally modifies these references to satisfy kinematic safety constraints under bounded disturbances; and DynWBC tracks the filtered references while enforcing full-body dynamic feasibility and contact stability. Safety constraints are imposed on a whole-body kinematic model, and the ISSf-CBF parameters are conservatively tuned so that the resulting kinematic safety guarantees can be transferred to full-order humanoid dynamics under unknown disturbances. Simulation and real-robot experiments demonstrate that the proposed framework improves safety margins under model mismatch and reliably enforces multiple safety constraints in real time during locomotion, teleoperation, and single-leg balancing with hand control. Project website: https://kwlee365.github.io/SafeWBC-Website/

2605.25543 2026-05-26 cs.AI

ADMFormer: An Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention for Traffic Forecasting

ADMFormer:一种用于交通预测的具有时变掩码空间注意力的自适应分解Transformer

Ruiwen Gu, Qitai Tan, Yahao Liu, Xiao-Ping Zhang

AI总结 提出ADMFormer,通过自适应分解机制解耦交通序列中的稳定周期规律与事件驱动波动,并使用时变掩码空间注意力稀疏化动态空间依赖,实现交通预测的SOTA性能。

详情
AI中文摘要

准确的交通预测对于智能交通系统至关重要,支持广泛的现实应用。然而,由于两个关键因素,它仍然具有挑战性:(1)交通序列包含异质的时间模式,其中稳定的周期性规律与事件驱动的波动共存。现有方法通常将它们统一表示,限制了捕捉细粒度时间动态的能力。(2)节点间的空间依赖本质上是动态且稀疏的,而密集的全对注意力常常引入冗余交互并放大噪声。为了解决这些问题,我们提出了ADMFormer,一种具有时变掩码空间注意力的自适应分解Transformer。具体来说,ADMFormer首先采用时间-节点自适应门控机制将交通信号解耦为随时间与节点变化的主导规律和残余波动。然后设计了一个双分支时间模块,分别从这两个分解成分中捕捉全局周期依赖和高频不规则变化。此外,ADMFormer引入了时变掩码空间注意力,基于实时交通状态稀疏化空间交互,从而有效保留动态且信息丰富的依赖。在四个真实世界数据集上的大量实验表明,ADMFormer实现了最先进的性能。

英文摘要

Accurate traffic forecasting is essential for intelligent transportation systems, supporting a wide range of real-world applications. However, it remains challenging due to two key factors:~(1) Traffic series contain heterogeneous temporal patterns, where stable periodic regularities coexist with event-driven fluctuations. Existing methods often treat them within a unified representation, limiting their ability to capture fine-grained temporal dynamics.~(2)Spatial dependencies among nodes are inherently dynamic and sparse, while dense all-pairs attention often introduces redundant interactions and amplifies noise. To address these issues, we propose ADMFormer, an Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention. Specifically, ADMFormer first employs a time-node adaptive gating mechanism to decouple traffic signals into dominant regularities and residual fluctuations that vary across time and nodes. A dual-branch temporal module is then designed to separately capture global periodic dependencies and high-frequency irregular variations from these two decomposed components. Furthermore, ADMFormer introduces a time-varying masked spatial attention that sparsifies spatial interactions based on real-time traffic states, thereby effectively preserving dynamic and informative dependencies. Extensive experiments on four real-world datasets demonstrate that ADMFormer achieves state-of-the-art performance.

2605.25541 2026-05-26 cs.CG cs.AI cs.HC cs.LG

TopoAlign: Topology-Aware Visual Representation Alignment

TopoAlign:拓扑感知的视觉表示对齐

Xinyuan Yan, Rita Sevastjanova, Mennatallah El-Assady, Bei Wang

AI总结 提出TopoAlign框架,利用拓扑数据分析中的mapper图,通过联合力导向优化、自动结构匹配区域检测和基序查询,从拓扑角度比较不同模型或层的表示结构对齐。

详情
AI中文摘要

神经网络将输入编码为高维向量(称为表示),通过编码任务相关的结构和语义来捕捉模型如何处理数据。表示对齐指不同模型、层或训练条件对相同输入产生相似表示的程度,对模型解释、选择和鲁棒性分析有重要意义。现有的对齐度量方法主要依赖于几何属性(如邻域和聚类相似性),对表示的全局组织提供的洞察有限。在这项工作中,我们提出了TopoAlign,一个从结构角度视觉比较模型表示的拓扑感知框架。利用拓扑数据分析中的mapper图,TopoAlign联合分析来自不同模型或层的共享输入构建的图。该框架支持自上而下的比较工作流:首先通过联合力导向优化进行全局结构对齐,生成协调的图布局;然后通过自动检测结构匹配区域(用Bubble Sets可视化)识别局部对应关系;最后通过基于基序的查询和膜启发式可视化实现细粒度模式检查。我们通过语言和多模态模型的案例研究以及专家反馈展示了TopoAlign。结果表明,TopoAlign从拓扑角度为表示结构和对齐提供了有意义的洞察。

英文摘要

Neural networks encode inputs as high-dimensional vectors, known as representations, that capture how models process data by encoding task-relevant structure and semantics. Representation alignment refers to the degree to which different models, layers, or training conditions produce similar representations for the same inputs, with important implications for model interpretation, selection, and robustness analysis. Existing approaches to measure alignment primarily rely on geometric properties, such as neighborhood and cluster similarity, offering limited insight into the global organization of representations. In this work, we present TopoAlign, a topology-aware framework for visually comparing model representations from a structural perspective. Leveraging mapper graphs from topological data analysis, TopoAlign jointly analyzes graphs constructed from representations of shared inputs across different models or layers. The framework supports a top-down comparative workflow: it first performs global structure alignment via joint force-directed optimization to produce coordinated graph layouts; it then identifies local correspondences through automated detection of structurally matching regions, visualized with Bubble Sets; and finally it enables fine-grained pattern inspection through motif-based queries and membrane-inspired visualizations. We demonstrate TopoAlign through case studies on language and multimodal models, complemented by expert feedback. Our results show that TopoAlign provides meaningful insights into representation structure and alignment from a topological perspective.

2605.25540 2026-05-26 cs.SD cs.LG

A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

基于语言和声学表征学习的多模态痴呆检测框架

Loukas Ilias, Dimitris Askounis

AI总结 提出一个端到端可训练的多模态深度学习框架,通过预训练模型提取声学和文本特征,结合注意力融合与互信息最大化,实现自动痴呆检测。

详情
AI中文摘要

阿尔茨海默病(AD)是一种进行性神经退行性疾病,是痴呆的主要原因,影响记忆、推理、沟通和日常功能。早期诊断尤为重要,因为及时干预可能有助于减缓认知衰退并改善患者护理。最近的研究表明,自发性言语包含与痴呆相关的有价值的语言和声学生物标志物。然而,现有方法通常依赖于独立训练的模态特定模型、特征拼接策略、集成方法或基于注意力的融合机制,这些方法并未明确最大化语音和转录表示之间的依赖性。在这项工作中,我们提出了一种用于自动痴呆检测的多模态深度学习框架,该框架以端到端可训练的方式联合利用语音和转录信息。具体来说,语音录音被分割成10秒的片段,并通过预训练的HuBERT模型提取上下文化的声学表示。为了更好地捕捉信息丰富的时域语音特征,采用注意力统计池化来聚合帧级声学嵌入。对于文本模态,使用预训练的BERT模型对转录进行编码,其中[CLS]标记表示用作语言嵌入。随后,使用基于注意力的音频-文本融合(AT-Fusion)机制组合声学和文本表示。此外,我们引入了一个MINE目标,以最大化模态之间的互信息并改善多模态表示对齐。最终融合的多模态表示用于痴呆分类。在公开的ADReSS挑战赛和PROCESS-2数据集上进行的实验证明了所提方法在基于语音的痴呆评估中的有效性和鲁棒性。

英文摘要

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.

2605.25537 2026-05-26 cs.RO

Action-Prior Denoising for Smooth Real-Time Chunking

基于动作先验去噪的平滑实时分块

Dongyang Liu, Zhaowen Zheng, Yu Sun, Longxu Zhang, Yixuan Liu, Hao Wan

AI总结 提出Soft RTC方法,通过动作先验去噪训练时模拟执行延迟,在保持近朴素运行时间的同时,降低高延迟动作变化并提升平滑性。

详情
Comments
7 pages, 5 figures, 1 table
AI中文摘要

实时分块(RTC)通过将新生成的动作块条件于前一块已提交的动作,使得分块动作策略能够在推理延迟下运行。训练时RTC在学习过程中模拟这种延迟,避免了部署时昂贵的指导,但其二元前缀掩码将所有非前缀令牌视为完全无约束。这低估了异步执行:早期重叠动作是固定的,而后期重叠动作虽然可编辑,但仍应保持接近先前的计划。我们提出Soft RTC,一种基于动作先验去噪的训练时RTC泛化方法。Soft RTC从部分去噪状态而非纯噪声中构建损坏的重叠令牌,并通过轻量级的令牌级混合规则在推理时将对齐的前一块作为相同先验注入。在12个已发布的大型Kinetix关卡上,短软窗口在整体解决率上几乎与硬训练时RTC相当(0.809 vs. 0.815),而中等窗口相对于硬RTC将高延迟动作变化和急动度分别降低了9.1%和9.6。与推理时RTC基线不同,两种变体都保持近朴素运行时间。一项小型初步真实机器人分拣研究提供了额外证据,表明训练时RTC可以提高完成率,并且Soft RTC在测试策略中给出了最低的命令动作有限差分指标。

英文摘要

Real-time chunking (RTC) lets chunked action policies operate under inference delay by conditioning a newly generated action chunk on actions already committed by the previous chunk. Training-time RTC simulates this delay during learning and avoids expensive guidance at deployment, but its binary prefix mask treats all non-prefix tokens as fully unconstrained. This under-models asynchronous execution: early overlap actions are fixed, while later overlap actions remain editable but should still stay close to the previous plan. We propose Soft RTC, a training-time RTC generalization based on action-prior denoising. Soft RTC constructs corrupted overlap tokens from partially denoised states instead of pure noise and injects the aligned previous chunk as the same prior during inference through a lightweight token-wise blending rule. On the 12 released large Kinetix levels, a short soft window nearly matches hard training-time RTC in overall solve rate (0.809 vs. 0.815), while a medium window reduces high-delay action delta and jerk by 9.1% and 9.6% relative to hard RTC. Both variants keep near-naive runtime, unlike inference-time RTC baselines. A small preliminary real-robot sorting study provides additional evidence that training-time RTC can improve completion and that Soft RTC gives the lowest commanded-action finite-difference metrics among the tested policies.

2605.25536 2026-05-26 cs.SE cs.AI

A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends, Challenges, and Future Directions

基于大语言模型的代码生成任务的三级综述:趋势、挑战与未来方向

Muslim Chochlov, Michael English, Jim Buckley

AI总结 本三级综述综合了30篇二级研究(2017-2025年),分析了基于大语言模型的代码生成任务在出版趋势、效果、场景、集成挑战和未来方向上的证据,发现基准测试准确率高但泛化性弱,鲁棒性脆弱,效率问题普遍,毒性和偏见报告不足,主要挑战涉及经济可行性、评估有效性和社会技术集成。

详情
AI中文摘要

上下文。大语言模型(LLMs)越来越多地被应用于软件工程中的代码生成任务(CGTs)。尽管报告的结果令人鼓舞,但这种应用的更广泛影响及其与真实世界开发的集成仍未被充分理解,现有的三级研究在这方面提供的很少。目标。本三级研究整合了关于基于LLM的CGTs的二级证据,综合了出版格局、效果、场景、集成挑战和未来研究方向。方法。遵循系统综述指南,我们在相关数字图书馆中进行了检索,并辅以前向和后向滚雪球及筛选步骤。评估了研究质量,并通过评估者间一致性统计对提取可靠性进行了审计。使用SWEBOK知识领域和HELM框架综合了证据。结果。我们识别出30篇发表于2017-2025年间的二级研究,自2023年以来快速增长。在基准测试上准确性似乎很强,但在真实世界泛化方面支持较弱;鲁棒性在不同任务和配置下脆弱;效率约束普遍存在;毒性和偏见报告不足。主要挑战涉及经济可行性、评估有效性和社会技术集成。未来方向建议领域感知的模型改进以及全面、标准化评估的需求。结论。基于LLM的CGTs代表了一个快速成熟但评估不均的研究领域,突出了对领域感知模型改进和全面、标准化评估的需求,以及解决效率和相关成本问题。

英文摘要

Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported results are promising, the broader effects of such application and their integration into real-world development remain insufficiently understood with existing tertiary studies provide little in this area. Objective. This tertiary study consolidates secondary evidence on LLM-based CGTs, synthesizing the publication landscape, effects, scenarios, integration challenges, and future research directions. Method. Following systematic review guidelines, we searched in related digital libraries, complemented by backward-and-forward snowballing and screening step. Study quality was assessed and extraction reliability was audited with inter-rater agreement statistics. Evidence was synthesized using SWEBOK knowledge areas and the HELM framework. Results. We identify 30 secondary studies published between 2017-2025, with rapid growth since 2023. Accuracy seems strong on benchmarks but weakly supported for real-world generalization; robustness is fragile across tasks and configurations; efficiency constraints are pervasive; toxicity and bias are under-reported. Dominant challenges concern economic feasibility, evaluation validity, and socio-technical integration. Future directions suggest domain-aware model improvement and the need for holistic, standardized evaluation. Conclusion. LLM-based CGTs represent a fast-maturing yet unevenly evaluated research area, highlighting the need for domain-aware model improvements and holistic, standardized evaluation, addressing efficiency and associated costs.

2605.25535 2026-05-26 cs.AI

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

个性化再存储:面向长时程智能体的个性化记忆基准测试与学习

Yeonjun In, Wonjoong Kim, Sangwu Park, Kanghoon Yoon, Chanyoung Park

AI总结 针对现有基于大语言模型的记忆系统采用通用静态策略忽略用户间存储上下文差异的问题,提出首个个性化记忆基准PerMemBench和会话级存储门控框架,验证个性化能显著提升记忆保留但精确门控仍是关键挑战。

详情
Comments
preprint
AI中文摘要

现有的基于大语言模型(LLM)的记忆系统采用通用、静态的策略,忽略了一个基本现实:不同用户值得存储在记忆中的上下文是不同的。这种错位将有限的记忆预算浪费在短暂交互上,同时未能为长时程任务保留关键上下文。为解决这一差距,我们研究了一个未被充分探索的问题:基于LLM的记忆系统能否学习个性化的记忆策略?我们引入了PerMemBench,这是首个用于评估个性化记忆系统的基准,具有跨多年、多领域、多样化用户角色的交互历史。我们进一步提出了记忆个性化的首个实证研究,提出了会话级存储门控,这是一个轻量级框架,可选择性地绕过短暂会话的记忆操作。我们的研究证实,在完美门控下,个性化能带来显著的保留增益,但同时也揭示出精确门控仍然是一个开放且关键的挑战。

英文摘要

Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limited memory budget on transient interactions while failing to preserve critical context for long horizon tasks. To address this gap, we investigate an underexplored question: can LLM based memory systems learn personalized memory policies? We introduce PerMemBench, the first benchmark for evaluating personalized memory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposing session level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.

2605.25534 2026-05-26 cs.AI

StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs

StructBreak: 多模态大语言模型中结构性认知过载引发的安全故障

Yang Luo, Xinran Liu, Tiantian Ji, Zhiyi Yin, Lingyun Peng, Shuyu Li

AI总结 提出StructBreak框架,通过量化结构性认知过载(SCO)揭示一种高阶认知过载攻击范式,在六种主流MLLM上实现92%平均攻击成功率,并证明该攻击通过结构性通道绕过安全过滤器。

详情
Comments
23 pages; accepted to Findings of ACL 2026. This paper contains examples of harmful content
AI中文摘要

多模态大语言模型(MLLM)在结构推理方面表现出色,但在结构一致性方面存在明显的逻辑脆弱性。我们将这种现象称为结构性认知过载(SCO),它是深度推理与安全对齐之间竞争产生的副产品。然而,先前的工作主要针对排版和像素级扰动,对SCO的研究尚不充分。为此,我们提出了StructBreak,一个自动化的端到端框架,旨在量化SCO。通过利用StructBreak,我们发现了一种新颖的高阶认知过载攻击范式;值得注意的是,这种攻击在实用的黑盒设置下运行,无需内部模型访问。因此,我们利用该框架建立了一个涵盖十种不同威胁场景的综合基准。对六种领先MLLM的实证评估表明,SCO容易触发有毒内容生成,平均攻击成功率(ASR)达到92%(在Gemini 2.5上高达97%)。为了阐明SCO的机制,我们进一步进行了模型级解释,涵盖注意力动态、潜在空间拓扑和几何分析。我们的发现表明,StructBreak作为一种新颖的结构性通道来绕过安全过滤器。此外,固有安全机制的有效性有限,凸显了当前的对齐范式不足以应对复杂多模态推理的时代。

英文摘要

Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistency. We term this phenomenon Structural Cognitive Overload (SCO), a byproduct of the contention between deep reasoning and safety alignment. However, prior work has predominantly targeted typographic and pixel-level perturbations, leaving the study of SCO largely unexplored. To this end, we propose StructBreak, an automated end-to-end framework designed to quantify SCO. By leveraging StructBreak, we uncover a novel higher-order cognitive overload attack paradigm; notably, this attack operates under a practical black-box setting, requiring no internal model access. Consequently, we utilize this framework to establish a comprehensive benchmark spanning ten diverse threat scenarios. Empirical evaluations on six leading MLLMs reveal that SCO readily triggers toxic generation, yielding a 92% average ASR (up to 97% on Gemini 2.5). To elucidate the mechanism of SCO, we further conduct model-level interpretations spanning attention dynamics, latent space topology, and geometric analysis. Our findings reveal that StructBreak acts as a novel structural channel to circumvent safety filters. Furthermore, the limited efficacy of inherent safety mechanisms underscores that current alignment paradigms are insufficient for the era of complex multimodal reasoning.

2605.25530 2026-05-26 cs.CV

Location Prior Generation via Multi-Source Urban Data Fusion for Low-Altitude Air Mobility

基于多源城市数据融合的低空空中交通位置先验生成

Xiang Xie, Xiaonan Liu

AI总结 提出LPGF框架,融合多源数据(哨兵2号影像、无人机遥测、车辆GPS轨迹、OSM足迹)生成结构化城市位置先验,通过三级优先级分配建筑高度,并引入质量门控的阴影估计模块,在米兰数据集上验证了约5.5米的最坏误差。

详情
Comments
11 pages, 7 figures, submitted to IEEE Journal of Internet of Things
AI中文摘要

建筑高度作为城市空间数据的第三维度,在全球地理空间数据库中超过95%的结构中缺失。对于新兴的低空经济而言,这一数据缺口迫使每个空中平台依赖实时机载感知而非预计算的3D场景几何。我们提出了位置先验生成框架(LPGF),这是一个多源数据融合管道,将哨兵2号影像、无人机遥测、车辆GPS轨迹和OpenStreetMap足迹整合为结构化、可重用的城市位置先验。LPGF通过三级优先级层次分配建筑高度:(1)可用的显式OSM高度标签,(2)楼层数乘以每层3.2米(若记录),以及(3)否则使用建筑类型默认高度,产生约5.5米的最坏情况误差。一个可选的基于阴影的高度估计模块(SHEM)仅在满足四项质量标准时才被激活;当任何标准失败时,管道转向结构化后备方案。在MiTra A50米兰数据集上,质量门正确识别了两种成像故障模式:10米GSD下的亚像素阴影和0.93米GSD下的地面阴影合并,在两种情况下均产生一致的27栋建筑先验。第三级类型默认高度与手动楼层计数(n=15)进行验证,在5.0米不确定性范围内达到MAE=3.07米。该框架表明,结构化、质量门控的通用数据流融合可以为低空城市运营启动3D场景覆盖。

英文摘要

Building height, the third dimension (3D) of urban spatial data, is absent in over 95% of structures in global geospatial databases. For the emerging low-altitude economy, this data gap forces each aerial platform to rely on real-time onboard sensing rather than pre-computed 3D scene geometry. We present the Location Prior Generation Framework (LPGF), a multi-source data fusion pipeline that integrates Sentinel-2 imagery, UAV telemetry, vehicle GPS trajectories, and OpenStreetMap footprints into structured, reusable urban location priors. LPGF assigns building heights through a three-tier priority hierarchy: (1) explicit OSM height tags where available, (2) floor count multiplied by 3.2 m per story where recorded, and (3) building-type default heights otherwise, yielding a worst-case error of approximately 5.5 m. An optional shadow-based height estimation module (SHEM) is activated only when a four-criterion quality gate is satisfied; when any criterion fails, the pipeline routes to structured fallback. On the MiTra A50 Milan dataset, the quality gate correctly identified two imaging failure modes: sub-pixel shadows at 10 m GSD and ground shadow merging at 0.93 m GSD, producing a consistent 27-building prior in both cases. Tier 3 type-default heights were validated against manual floor counts (n=15), achieving MAE=3.07 m within the 5.0 m uncertainty bound. The framework demonstrates that structured, quality-gated fusion of universally available data streams can bootstrap 3D scene coverage for low-altitude urban operations.

2605.25527 2026-05-26 cs.LG cs.CE

DeepSeekMath Meets Order Book: Group-Aware Policy Optimization for High-Frequency Directional Trading

DeepSeekMath 遇见订单簿:面向高频方向性交易的组感知策略优化

Sayak Charabarty, Souradip Pal

AI总结 本文通过将基于订单流的状态模型与策略梯度方法结合,研究限价订单簿上的高频交易强化学习,提出组感知策略优化方法,在回测中优于基于价值的 Q-learning 基线。

详情
Comments
9 pages, 3 figures
AI中文摘要

本文通过将基于订单流的状态模型与策略梯度方法配对,研究限价订单簿上的高频交易强化学习。与基于价值的 RL 技术(如表格 Q-learning)不同,我们的方法部署基于策略的方法,如普通 PPO 以及受 DeepSeekMath 启发的变体 GRPO 和 GSPO,这些方法使用组归一化更新和下行感知整形。在使用基于点差缩放奖励的简化回测设置下,对金融资产 AMZN、AAPL 和 GOOG 进行回测,这些新策略在净平均 PnL、盈利能力和回撤方面优于 Q-learning 基线。我们的结果表明:(1) 订单流信号是策略 RL 的合适状态;(2) 组感知 PPO 替代方法优于基于价值的基线。

英文摘要

This paper studies reinforcement learning for high-frequency trading on limit order books by pairing an Order-Flow-based state model with policy-gradient methods. Instead of value-based RL techniques like tabular Q-learning, our approach deploys policy-based methods like vanilla PPO and DeepSeekMath-inspired variants like GRPO and GSPO, that use group-normalized updates and downside-aware shaping. On backtests with financial assets AMZN, AAPL, and GOOG under a simplified backtesting setup based on spread-scaled rewards, these new policies improve net average PnL, profitability, and drawdown over the Q-Learning baseline. Our results show that (1) Order-Flow signals are an adequate state for policy RL and (2) group-aware PPO surrogates are preferable over value-based baselines.

2605.25526 2026-05-26 stat.ML cs.LG

From DPPs to $k$-DPPs: identifiability analysis via spectral decomposition

从DPP到$k$-DPP:通过谱分解的可识别性分析

Hideitsu Hino, Keisuke Yano

AI总结 通过谱分解研究行列式点过程(DPP)及其条件版本$k$-DPP的几何结构,揭示了$k$-DPP中谱参数和特征空间旋转参数的可识别性变化,并刻画了可识别性差距。

详情
Comments
10 pages
AI中文摘要

我们通过谱分解$L=UΛU^{\top}$研究行列式点过程(DPP)的几何结构。谱$Λ$通过初等对称多项式控制基数分布,而特征空间方向$U$控制每个固定基数层内的条件分布。在基数$k$上取条件得到$k$-DPP,其可识别性结构发生根本变化:谱参数仅在一个公共尺度下可识别,特征空间旋转参数仅通过特征向量矩阵的平方子式可识别。我们通过三个显式不变性(尺度、符号相似性和特征空间旋转)以及一个维数计数定理精确刻画了可识别性差距,该定理表明当$\binom{N}{k}<N(N+1)/2$时存在额外的连续不可识别性。相比之下,对于完整DPP,不可识别性仅来自离散的符号相似性。

英文摘要

We study the geometry of determinantal point processes (DPPs) through the spectral decomposition $L=UΛU^{\top}$. The spectrum $Λ$ governs the cardinality distribution via elementary symmetric polynomials, while the eigenspace orientation $U$ governs the conditional law within each fixed-cardinality stratum. Conditioning on cardinality $k$ yields the $k$-DPP, for which the identifiability structure changes fundamentally: the spectral parameter becomes identifiable only up to a common scale, and the eigenspace rotation parameter is identifiable only through squared minors of the eigenvector matrix. We characterize the identifiability gap precisely, via three explicit invariances (scale, sign similarity, and eigenspace rotation) and a dimension-counting theorem showing the existence of additional continuous non-identifiability whenever $\binom{N}{k}<N(N+1)/2$. In contrast, for the full DPP the non-identifiability comes only from the discrete sign similarity.

2605.25525 2026-05-26 cs.LG

SAE-FD: Sparse Autoencoder Feature Distillation for Continual Learning of Large Language Models

SAE-FD: 面向大语言模型持续学习的稀疏自编码器特征蒸馏

Mingxu Zhang, Yuhan Li, Lujundong Li, Dazhong Shen, Hui Xiong, Ying Sun

AI总结 针对持续学习中的灾难性遗忘问题,提出基于稀疏自编码器特征蒸馏的方法,通过将模型表示锚定在稀疏特征空间以减少表征纠缠,实现更精准的正则化,在多个基准上优于现有方法。

详情
AI中文摘要

持续学习使大语言模型能够适应不断变化的任务而无需从头重新训练,但灾难性遗忘仍然是一个核心障碍。在持续学习方法中,基于正则化的方法被广泛用于约束模型更新并减少遗忘,这些方法在权重空间、梯度空间或输出空间中操作。然而,这些密集表示空间存在特征叠加问题,即多个概念被编码在重叠的维度中,使得难以在不阻碍新任务学习的情况下有选择地保护先前学到的知识。为了解决这个问题,我们提出了\method(稀疏自编码器特征蒸馏),该方法将模型表示锚定在预训练稀疏自编码器的稀疏特征空间中,其中密集激活被分解为稀疏过完备基,从而减少表征纠缠,实现更有针对性的正则化,同时减少对新任务学习的干扰。在三个模型架构上的两个持续学习基准实验表明,\method始终优于现有的基于正则化的方法,平均准确率高达52.70%,仅产生-0.46的后向迁移。

英文摘要

Continual learning enables large language models to adapt to evolving tasks without retraining from scratch, yet catastrophic forgetting remains a central obstacle. Among continual learning methods, regularization-based approaches are widely used to constrain model updates and reduce forgetting, operating in weight space, gradient space, or output space. However, these dense representation spaces suffer from feature superposition, where multiple concepts are encoded in overlapping dimensions, making it difficult to selectively protect previously learned knowledge without impeding new-task learning. To address this issue, we propose \method (Sparse Autoencoder Feature Distillation), which anchors model representations in the sparse feature space of a pre-trained Sparse Autoencoder, where dense activations are decomposed into a sparse overcomplete basis that reduces representational entanglement, enabling more targeted regularization with less interference to new-task learning. Experiments on two continual learning benchmarks across three model architectures show that \method consistently outperforms existing regularization-based methods, achieving up to 52.70% average accuracy with only -0.46 backward transfer.

2605.25524 2026-05-26 cs.CV

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

ProSR: 面向可靠思维链的过程塑造空间推理方法

Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong

AI总结 针对视觉语言模型在空间推理中存在的虚假基础与尾部不稳定性问题,提出ProSR框架,通过反事实不变性惩罚和尾部漂移惩罚优化推理过程,提升答案准确率及轨迹稳定性与视觉依赖性。

详情
Comments
19 pages, 6 figures
AI中文摘要

可靠的空间推理仍然是视觉语言模型(VLM)的核心瓶颈。现有的空间推理主流训练范式主要依赖于结果对齐或过程模仿,缺乏对推理过程的显式约束,因此难以确保真正的视觉依赖和稳定的推理轨迹。在本文中,我们构建了一个覆盖多种空间现象的高质量思维链数据集,并诊断了模型的推理过程,揭示了强化学习优化过程中两种典型的过程退化类型:虚假基础(绕过视觉证据)和尾部不稳定性(推理后期不确定性异常上升)。为了解决这些问题,我们提出了ProSR,一种用于空间推理的过程塑造优化框架。通过反事实不变性惩罚和尾部漂移惩罚,ProSR将优化目标从单一的答案正确性扩展到两个过程级维度:视觉依赖性和轨迹稳定性。在多个复杂和分布外的空间推理基准上的实验表明,ProSR在提高答案准确率的同时,生成的推理轨迹更加稳定且更依赖于视觉证据。

英文摘要

Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model's reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.

2605.25520 2026-05-26 cs.CL

Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

LLMs中的推理是否由不同的语义结构介导?一种机制性解释

Nura Aljaafari, Marco Valentino, André Freitas

AI总结 通过SVD分解和激活引导实验,研究自然语言推理中Transformer模型是否编码语义操作,发现操作级子空间部分重叠且因果影响预测,表明模型不仅编码假设与前提的关系,还部分编码如何关联。

详情
Comments
26 pages, 16 figures, 13 tables
AI中文摘要

正确预测标签并不一定需要表示产生该标签的操作。已知Transformer表示携带标签级信息,但它们是否编码产生这些标签的语义操作尚不清楚。我们使用受控的前提-假设对(仅通过单一语义变换区分)在自然语言推理中对此进行研究。利用逐层激活,通过SVD估计操作级子空间,并通过在四个开源解码器模型中的激活引导测试其因果相关性。变换效果以84.8%-99%的准确率可解码,并占据部分不同但重叠的子空间,超过随机子空间基线。引导实验表明这些方向因果性地影响预测,尽管可引导性因模型而异;跨操作引导进一步揭示了结构化干扰以及子空间选择性与跨操作独立性之间的分离。这些发现表明,模型不仅编码假设与前提相关,还部分编码如何相关,这意味着机制分析和控制应在语义操作层面而非仅预测标签层面进行。

英文摘要

Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with $84.8$-$99\%$ accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.

2605.25518 2026-05-26 cs.CV cs.AI

Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis

受放射科医生启发的乳腺超声诊断的跨阶段注意力多专家网络

Xinyang Zhai, Chong Yang, Ruizhi Zhang

AI总结 提出跨阶段注意力混合专家网络(CSA-MoE-Net),通过跨阶段注意力模块增强多级特征、三分支MoE块从全肿瘤图像、肿瘤核心和边界学习互补特征,并在平衡数据集上实现96.33%准确率,显著优于基线ResNet-18。

详情
AI中文摘要

乳腺超声成像是一种重要的早期乳腺癌诊断无创方法,但由于肿瘤异质性、边界模糊和数据不平衡,自动良恶性分类仍具挑战。为了提高特征表示和分类准确性,本文提出了跨阶段注意力混合专家网络(CSA-MoE-Net)。它采用跨阶段注意力增强的ResNet-18作为骨干网络,其中跨阶段注意力模块自适应地重新校准多级特征,从而增强关键肿瘤特征并抑制冗余。一个三分支混合专家(MoE)块从全肿瘤图像、肿瘤核心和边界学习互补特征,自适应门控网络融合这些特征以捕获形态、纹理和上下文信息。融合后的特征在架构中称为融合专家特征(FEF)。在包含2,129张乳腺超声图像的平衡数据集上的实验表明,在20次独立运行的平均值下,该模型实现了96.33%的准确率、94.09%的精确率、98.53%的召回率、96.25%的F1分数和99.50%的AUC。与基线ResNet-18相比,这些指标分别提高了3.01、0.70、5.37、2.98和5.42个百分点。所提出的机制无需侵入性修改,可无缝嵌入VGG-16、DenseNet-121等网络,带来稳定的性能提升,从而为计算机辅助诊断提供可靠支持。

英文摘要

Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classification remains challenging due to tumor heterogeneity, blurred boundaries, and data imbalance. To improve feature representation and classification accuracy, this paper proposes the Cross-Stage Attention Mixture-of-Experts Network (CSA-MoE-Net). It adopts a Cross-Stage Attention-enhanced ResNet-18 as the backbone, in which the Cross-Stage Attention module adaptively recalibrates multi-level features, thereby enhancing key tumor features and suppressing redundancy. A three-branch Mixture of Experts (MoE) Block learns complementary features from the Whole Tumor Image, Tumor Core, and Boundary, and an Adaptive Gating Network fuses them to capture morphological, textural, and contextual information. The fused features are denoted as Fused Expert Feature (FEF) in the architecture. Experiments on a balanced dataset of 2,129 breast ultrasound images show that, averaged over 20 independent runs, the model achieves an accuracy of 96.33\%, precision of 94.09\%, recall of 98.53\%, F1-score of 96.25\%, and AUC of 99.50\%. Compared to the baseline ResNet-18, these metrics improve by 3.01, 0.70, 5.37, 2.98, and 5.42 percentage points, respectively. The proposed mechanism requires no invasive modification and can be seamlessly embedded into VGG-16, DenseNet-121, etc., yielding stable performance gains, thus providing reliable support for computer-aided diagnosis.

2605.25517 2026-05-26 cs.AI

What Gets Cited: Competitive GEO in AI Answer Engines

什么被引用:AI 问答引擎中的竞争性生成式引擎优化

Rahul Vishwakarma, Shushant Kumar, Ratnesh Jamidar

AI总结 研究 AI 问答引擎中两个检索候选源竞争时,哪些因素决定哪个源被优先引用,通过控制实验发现主题相关性和列表位置是主要驱动因素。

详情
AI中文摘要

AI 问答引擎从检索到的页面生成答案,但只引用少数来源。这使得可见性不仅取决于排名,还取决于被引用。我们研究竞争性生成式引擎优化(GEO):当两个检索到的候选源竞争时,什么因素使得其中一个更可能被首先引用?我们构建了一个受控的两文档检索增强生成(RAG)测试平台,将恰好两个候选源注入模型上下文,并测量输出中第一个引用标记引用了哪个源。在六个 LLM 上,我们执行了 252,000 次试验,在 18 个内容因素的一个析因程序下进行重复配对比较。在每次试验中,两个源恰好在一个因素上不同;我们使用品牌匿名化和平衡源顺序来将内容效应与位置偏差分离。混合效应模型显示,主题相关性和列表位置是被首先引用的最大驱动因素。包含明确的价格信息和最近的时间戳也持续有帮助。完整性和信任线索带来较小的增益,而仅格式编辑几乎没有影响。我们发布了一个可重复的评估协议和一个优先化的 GEO 检查清单供从业者使用,并在 Sprinklr 的早期内部试点中进行了实践,团队报告了对工作流可用性的积极定性反馈。

英文摘要

AI answer engines generate answers from retrieved pages but cite only a few sources. This makes visibility depend not just on ranking, but on being cited. We study competitive Generative Engine Optimization (GEO): when two retrieved candidates compete, what makes one more likely to be cited first? We build a controlled two-document retrieval-augmented generation (RAG) testbed that injects exactly two candidate sources into the model context and measures which source is referenced by the first citation marker in the output. Across six LLMs we execute 252,000 trials, repeated paired comparisons under one factorial program over 18 content factors. In each trial the two sources differ in exactly one factor; we use brand anonymization and counterbalanced source order to separate content effects from position bias. Mixed-effects models show that topical relevance and list position are the biggest drivers of being cited first. Including explicit price information and a recent timestamp also helps consistently. Completeness and trust cues add smaller gains, while formatting-only edits have little impact. We release a reproducible evaluation protocol and a prioritized GEO checklist for practitioners, and we exercised it in an early internal pilot at Sprinklr, where teams reported positive qualitative feedback on workflow usability.