arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.01243 2026-06-02 cs.CL cs.LG

Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

解锁潜在推理的黑箱:一种可解释性引导的干预方法

Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma, Qingyang Liu, Zhaohe Liao, Yibo Miao, Li Niu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 本文通过结构、因果和几何探针分析潜在推理向量的可解释性,并基于此提出无需训练的解码时干预方法,提升大语言模型推理准确性。

详情
Journal ref
ACL2026 Main
AI中文摘要

潜在推理使大型语言模型(LLMs)能够在连续隐藏状态内执行多步推理,相比显式思维链(CoT)提供了效率提升。然而,这些连续思维向量的不透明性阻碍了其可靠性和可控性。本文弥合了机械可解释性与可操作控制之间的差距。我们首先使用结构、因果和几何探针进行系统分析,揭示潜在向量编码了推理步骤的压缩、忠实表示,其中早期向量作为关键因果枢纽。在此基础上,我们将这些可解释性见解操作化为一套无需训练、解码时干预的方法,通过施加已识别的几何和语义先验来优化潜在推理过程。跨多个模型规模和不同任务领域的广泛实验表明,我们的方法持续提高了推理准确性。我们的可解释性引导干预一致地解锁了潜在能力,并在没有任何参数更新的情况下提高了推理准确性。

英文摘要

Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.

2606.01240 2026-06-02 cs.CL

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

基于意图感知检索与语义保持分块的高效RAG

Fachrina Dewi Puspitasari, Chaoning Zhang, Jiaquan Zhang, Zhicheng Wang, Hafiz Shakeel Ahmad Awan, Rizwan Qureshi, Jewon Lee, Tae-Ho Kim, Yang Yang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Massachusetts General Hospital, Harvard University(哈佛大学马萨诸塞州总医院) Nota AI

AI总结 提出InSemRAG框架,通过意图感知检索器和语义保持分块模块,结合迭代检索-检查机制,解决传统RAG的信息不足问题,在多跳和证据敏感任务上取得显著提升。

详情
AI中文摘要

对大型语言模型(LLM)强大指令遵循和推理能力的需求推动了检索增强生成(RAG)的快速发展。RAG系统通过从外部数据库检索与查询匹配的补充知识块来辅助LLM生成。然而,传统RAG系统由于意图无关的检索和信息碎片化两个因素,存在信息不足的问题。我们的工作提出了一个名为InSemRAG的RAG框架,通过迭代检索-检查机制以及两个支持模块——意图感知检索器(IAR)和语义保持分块(SPC)来解决这些挑战。IAR实现了一种动态混合检索方法,根据查询意图自适应地加权检索通道,而SPC对受损的证据块进行检测和修复以保持语义完整性。为了减轻迭代机制带来的计算延迟,我们利用了小型语言模型(SLM)。在多个基准数据集上的大量实验一致表明,我们的方法相对于最近最先进的RAG机制具有竞争力。特别是在多跳和证据敏感任务上,我们的方法取得了显著提升,在HotPotQA上F1提高了2.65个百分点,在FEVER上准确率提高了1.5个百分点。我们的方法还利用SLM实现了与Multi-Hop RAG相当的性能,同时延迟降低了4.32倍。

英文摘要

The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving chunks of query-fit supplementary knowledge from an external database. Conventional RAG systems, however, suffer from information insufficiency due to two factors, which are intent-agnostic retrieval and information fragmentation. Our work proposes a RAG framework, termed InSemRAG, that addresses these challenges via an iterative retrieve-and-check mechanism with two supporting modules, an intention-aware retriever (IAR) and semantics-preserving chunking (SPC). IAR implements a dynamic hybrid retrieval method that adaptively weights the retrieval channels based on the query intent, while SPC performs detection and reparation to the damaged evidence chunks to preserve the semantic integrity. To alleviate the computational latency brought by our iterative mechanism, we leverage small language models (SLMs). Extensive experiments across several benchmark datasets consistently demonstrate the competitiveness of our method against recent state-of-the-art RAG mechanisms. Particularly, our method achieves significant gains on multi-hop and evidence-sensitive tasks, with a 2.65-point improvement in F1 on HotPotQA and a 1.5-point increase in accuracy on FEVER. Our method also achieves competitive performance to Multi-Hop RAG with 4.32$\times$ lower latency with the utilization of SLM.

2605.04819 2026-06-02 cs.LG

Unsat Core Prediction through Polarity-Aware Representation Learning over Clause-Literal Hypergraphs

通过子句-文字超图上的极性感知表示学习进行不可满足核心预测

Zhenchao Sun, Shuai Ma, Ping Lu, Chongyang Tao

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种极性感知的表示学习框架,将SAT公式建模为子句-文字超图,通过极性感知分解和极性反转一致性正则化,有效预测不可满足核心。

详情
Comments
Accepted at ICML 2026
AI中文摘要

图神经网络已广泛用于布尔可满足性(SAT)任务中,以从SAT公式中学习结构信息。这些研究的目标是解决SAT实例或增强SAT求解器,包括不可满足核心预测等任务。然而,大多数现有方法将SAT公式建模为二分图或有向无环图,这些方法在捕捉文字和子句之间的子句级和高阶交互方面不够直接。此外,这些方法在建模SAT固有的极性相关属性(如变量的正负文字之间的互补关系)方面存在局限性。为了解决这些局限性,我们提出了一种基于子句-文字超图的极性感知表示学习框架。我们将SAT公式建模为子句-文字超图,并辅以子句关联图以捕捉高阶结构交互。然后,我们引入一种极性感知分解机制,将变量表示分离为极性不变和等变分量,显式建模正负文字之间的关系,并将生成的文字表示沿超图结构传播。我们进一步引入极性反转一致性正则化,以在训练过程中强化极性一致的表示。在多个SAT数据集上的实验结果表明了该方法的有效性。

英文摘要

Graph neural networks have been widely used in Boolean satisfiability (SAT) tasks to learn structural information from SAT formulas. The goal of these studies is to solve SAT instances or to enhance SAT solvers, including tasks such as unsat-core prediction. However, most existing approaches model a SAT formula as a bipartite graph or a directed acyclic graph, which are less direct in capturing clause-level and higher-order interactions among literals and clauses. Moreover, these approaches are limited in modeling intrinsic polarity-related properties of SAT, such as the complementary relationship between the positive and negative literals of a variable. To address these limitations, we propose a polarity-aware representation learning framework over clause-literal hypergraphs. We model SAT formulas as clause-literal hypergraphs augmented with a clause incidence graph to capture higher-order structural interactions. We then introduce a polarity-aware decomposition mechanism that separates variable representations into polarity invariant and equivariant components, explicitly modeling the relationship between positive and negative literals, with the resulting literal representations propagated along the hypergraph structure. We further incorporate a polarity-inversion consistency regularization to reinforce polarity-consistent representations during training. Experimental results on multiple SAT datasets demonstrate the effectiveness of the proposed approach.

2605.04638 2026-06-02 cs.CL cs.AI

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

相对于语义保持嵌入的梯度揭示大语言模型的不确定性

Mingda Li, Rundong Lv, Xinyu Li, Weinan Zhang, Ting Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个基于梯度的自由文本生成不确定性量化方法SemGrad,通过语义空间中的梯度计算实现高效且无需采样的不确定性估计。

详情
Comments
Accepted by ICML 2026
AI中文摘要

不确定性量化(UQ)是确保大语言模型(LLM)可信度的重要技术,因为LLM容易产生幻觉。现有的自由文本生成UQ方法严重依赖采样,导致计算成本高且方差大。在这项工作中,我们提出了首个基于梯度的自由文本生成UQ方法SemGrad,它无需采样且计算高效。与先前针对分类任务开发的在参数空间中操作的梯度方法不同,我们提出在语义空间中考虑梯度。我们的方法基于一个关键直觉:自信的LLM应在语义等价的输入扰动下保持稳定的输出分布。我们将这种稳定性解释为语义空间中的梯度,并引入语义保持分数(SPS)来识别最能捕捉语义的嵌入,并针对这些嵌入计算梯度。我们进一步提出了HybridGrad,它结合了SemGrad和参数梯度的优势。实验表明,我们的两种方法都提供了高效且有效的不确定性估计,在多个有效响应的设置中尤其优于现有方法。

英文摘要

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

2605.04193 2026-06-02 cs.AI cs.LG cs.LO

ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor for Inductive Logic Programming

ANDRE:一种基于注意力的神经符号可微规则提取器,用于归纳逻辑编程

Iman Sharifi, Peng Wei, Saber Fallah

发表机构 * Dept. of Mechanical and Aerospace Engineering, George Washington University, USA(机械与航空航天工程系,乔治华盛顿大学) Dept. of Mechanical Engineering Sciences, University of Surrey, UK(机械工程科学系,萨里大学)

AI总结 提出ANDRE框架,通过注意力驱动的可微逻辑算子优化连续规则空间,实现从概率数据中学习一阶逻辑规则,在噪声环境下保持鲁棒性和可解释性。

详情
Comments
35 pages, 8 figures, 10 tables
AI中文摘要

归纳逻辑编程(ILP)旨在从数据中学习可解释的一阶规则,但现有的符号和神经符号方法难以扩展到噪声和概率设置。经典ILP依赖于离散的组合规则搜索,在不确定性下脆弱,而可微ILP方法通常依赖预定义规则模板或不精确的模糊算子,这些算子在推理概率谓词估值时会遭受梯度消失或逻辑结构近似不佳的问题。本文提出基于注意力的神经符号可微规则提取器(ANDRE),一种新颖的ILP框架,通过基于注意力的逻辑算子优化连续规则空间来学习一阶逻辑程序。ANDRE用完全可微的、注意力驱动的合取和析取算子替代规则模板和逻辑算子,这些算子近似逻辑最小-最大语义,从而实现对概率数据的准确、稳定和可解释推理。通过在每条规则内软选择、否定或排除谓词,ANDRE在保持符号结构的同时支持灵活规则归纳。在经典ILP基准、大规模知识库以及带有概率谓词和噪声监督的合成数据集上的大量实验表明,ANDRE达到了有竞争力或更优的预测性能,同时在不确定性下可靠地恢复正确的符号规则。特别是,ANDRE对中等标签噪声保持鲁棒,在规则提取质量和稳定性上显著优于现有可微ILP方法。

英文摘要

Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approaches struggle to scale to noisy and probabilistic settings. Classical ILP relies on discrete combinatorial rule search and is brittle under uncertainty, while differentiable ILP methods typically depend on predefined rule templates or inaccurate fuzzy operators that suffer from vanishing gradients or poor approximation of logical structure when reasoning over probabilistic predicate valuations. This paper proposes an Attention-based Neuro-symbolic Differentiable Rule Extractor (ANDRE), a novel ILP framework that learns first-order logic programs by optimizing over a continuous rule space with attention-based logical operators. ANDRE replaces both rule templates and logical operators with fully differentiable, attention-driven conjunction and disjunction operators that approximate logical min-max semantics, enabling accurate, stable, and interpretable reasoning over probabilistic data. By softly selecting, negating, or excluding predicates within each rule, ANDRE supports flexible rule induction while preserving symbolic structure. Extensive experiments on classical ILP benchmarks, large-scale knowledge bases, and synthetic datasets with probabilistic predicates and noisy supervision demonstrate that ANDRE achieves competitive or superior predictive performance while reliably recovering correct symbolic rules under uncertainty. In particular, ANDRE remains robust to moderate label noise, substantially outperforming existing differentiable ILP methods in both rule extraction quality and stability.

2605.03403 2026-06-02 cs.CV cs.LG

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

GRPO-TTA:基于GRPO驱动的强化学习进行视觉语言模型的测试时视觉调优

Yujun Li, Hongyuan Zhang, Yuan Yuan

发表机构 * School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University(人工智能、光学与电子学院(iOPEN),西北工业大学)

AI总结 提出GRPO-TTA方法,将GRPO应用于测试时适应,通过将类特定提示预测重构为组策略优化问题,并设计对齐奖励和分散奖励,在多种基准上优于现有方法。

详情
AI中文摘要

组相对策略优化(GRPO)最近在大型语言模型和视觉语言模型的后训练中展现出强大性能。这引发了一个问题:GRPO是否也能显著促进视觉语言模型的测试时适应(TTA)。在本文中,我们提出了用于测试时适应的组相对策略优化(GRPO-TTA),通过将类特定提示预测重构为组策略优化问题,将GRPO适应到TTA设置。具体来说,我们通过从CLIP相似度分布中采样top-K类候选来构建输出组,从而在无需真实标签的情况下实现概率驱动的优化。此外,我们设计了针对测试时适应的奖励函数,包括对齐奖励和分散奖励,以指导有效的视觉编码器调优。在多种基准上的大量实验表明,GRPO-TTA一致优于现有的测试时适应方法,在自然分布偏移下性能提升尤为显著。

英文摘要

Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

2605.02277 2026-06-02 cs.CL

CECOR: Correction-oriented synthetic data construction for factual error correction

CECOR:面向事实错误纠正的修正导向合成数据构建

Lei Zhu, Xiaobao Wang, Jianbiao Yang, Chenyang Wang, Dongxiao He, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University(天津大学)

AI总结 提出CECoR框架,通过分解与注入范式合成高质量训练数据,结合两阶段学习策略,有效提升多跳事实错误纠正的准确性和鲁棒性。

详情
AI中文摘要

事实错误纠正(FEC)旨在将不准确的文本修改为与外部证据事实一致的陈述。尽管近期方法在单跳纠正上表现良好,但它们通常将声明视为原子单元,难以处理需要跨多个证据源进行组合推理的多跳情况。有限的配对数据和复杂推理链中定位语义错误的困难进一步放大了这一挑战。我们提出了CECoR(基于推理感知的组合错误纠正),一个推理感知框架,引入了分解与注入范式用于组合错误纠正。CECoR将多跳声明分解为可解释的推理步骤,并注入受控扰动以合成高质量的训练对。结合监督微调和强化学习的两阶段学习策略提高了事实准确性和鲁棒性。全面评估表明,CECoR在多跳基准上取得了强劲性能,优于远程监督方法和少样本LLM基线。它还能有效泛化到单跳纠正,并在噪声证据下保持稳定,展示了其在真实世界事实纠正中的多功能性。

英文摘要

Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reasoning across multiple evidence sources. This challenge is further amplified by limited paired data and difficulties in locating semantic errors within complex reasoning chains. We present CECoR (Compositional Error Correction via Reasoning-aware Synthesis), a reasoning-aware framework that introduces a Decomposition and Injection paradigm for compositional error correction. CECoR decomposes multi-hop claims into interpretable reasoning steps and injects controlled perturbations to synthesize high-quality training pairs. A two-stage learning strategy combining supervised fine-tuning and reinforcement learning improves factual accuracy and robustness. Comprehensive evaluations show that CECoR achieves strong performance on multi-hop benchmarks, outperforming both distantly supervised methods and few-shot LLM baselines. It also generalizes effectively to single-hop correction and remains stable under noisy evidence, demonstrating its versatility for real-world factual correction.

2606.01238 2026-06-02 cs.RO cs.LG

Training-Free Imitation Learning with Closed-Form Diffusion Policies

无训练闭环扩散策略的模仿学习

Raghav Mishra, Ian R. Manchester

发表机构 * Australian Center for Robotics, ARIAM Hub, and School of Aerospace, Mechanical and Mechatronic Engineering University of Sydney(澳大利亚机器人中心、ARIAM中心和悉尼大学航空航天、机械与机电工程学院)

AI总结 提出一种基于演示数据集闭式得分的无训练扩散策略(CFDP),实现毫秒级实时模仿学习,性能媲美需数小时训练的神经基线,并支持推理时策略编辑与演示增强。

详情
AI中文摘要

尽管基于扩散的策略具有令人印象深刻的性能和表达能力,但其长时间离线训练拖慢了数据收集和策略部署循环。我们引入了闭环扩散策略(CFDP),这是一类使用从演示数据集导出的闭式得分的无训练扩散策略,用于模仿学习。我们在硬件实验中用移动CPU进行实时推理部署CFDP,表明它能够直接从数据集中毫秒级成功执行模仿,并且推理速度比神经扩散策略更快。在模仿学习基准实验中,我们展示了CFDP与需要数小时训练的神经基线相比具有竞争力,在训练时间和性能之间提供了有利的权衡。最后,我们展示了闭环扩散策略如何作为一种可组合原语,实现对预训练神经扩散策略的数据驱动推理时编辑,包括策略引导和新颖的演示增强。

英文摘要

While diffusion-based policies have impressive performance and expressivity, their long offline training slows down the data collection and policy deployment loop. We introduce Closed-Form Diffusion Policies, a class of training-free diffusion-based policies for imitation learning using the closed-form score derived from the demonstration dataset. We deploy CFDP with real-time inference with a mobile CPU in hardware experiments, showing it can successfully perform imitation directly from the dataset in milliseconds and with faster inference than neural diffusion policies. In experiments on imitation learning benchmarks, we show that CFDP is competitive against neural baselines that require hours of training, providing a favorable tradeoff between training time and performance. Finally, we show how closed-form diffusion policies act as a composable primitive that enables data-driven inference-time editing of pre-trained neural diffusion policies, including policy guidance and novel demonstration augmentation.

2606.01237 2026-06-02 cs.AI

Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes

脑图谱引导的生成式反事实注意力用于基于多模态连接组的可解释认知衰退诊断

Xiongri Shen, Jiaqi Wang, Zhenxi Song, Yi Zhong, Leilei Zhao, Xin He, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术系) School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology(哈尔滨工业大学智能科学与工程学院) School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical, Measurements and Ultrasound Imaging, Shenzhen University Medical School, Shenzhen University(深圳大学医学院生物医学工程学院、医学超声关键技术工程实验室、广东省生物医学测量与超声成像重点实验室)

AI总结 提出一种脑图谱知识引导的生成式反事实注意力网络(GCAN),通过将诊断建模为源到目标的反事实生成问题,利用多模态连接组实现可解释的认知衰退诊断。

详情
AI中文摘要

轻度认知障碍(MCI)和主观认知衰退(SCD)与早期阿尔茨海默病连续谱密切相关,准确且可解释的诊断对于早期风险评估和干预至关重要。现有的基于连接组的深度学习模型可以提高分类性能,但通常对疾病相关的功能和结构连接变化提供的洞察有限。本文提出了一种图谱知识引导的生成式反事实注意力网络(GCAN),用于使用多模态脑连接组进行可解释的认知衰退诊断。GCAN将诊断建模为源到目标的反事实生成问题,其中从源标签输入生成目标标签连接组,并利用它们的差异构建反事实注意力图。为了保持连接组拓扑,一种图谱感知的双向Transformer(AABT)在脑图谱约束下执行网络级令牌编码和解码。该框架进一步从功能连接(FC)扩展到联合功能和结构连接(SC)建模,从而实现对互补功能重组和结构拓扑变化的反事实分析。在医院收集的数据集和ADNI数据集上的实验表明,GCAN在HC vs. SCD、HC vs. MCI和SCD vs. MCI分类任务中取得了竞争性能。可视化、圆形连接组分析、基于CAM的比较、消融研究和置信区间分析进一步支持了所提框架的可解释性和可靠性。使用特定模态的FC和SC预训练分类器为反事实生成提供目标状态先验,同时将其与下游诊断分类器分离以防止数据泄露。

英文摘要

Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer's disease continuum, where accurate and explainable diagnosis is important for early risk assessment and intervention. Existing connectome-based deep learning models can improve classification performance but often provide limited insight into disease-related functional and structural connectivity changes. This paper proposes an atlas-knowledge-guided Generative Counterfactual Attention-guided Network (GCAN) for explainable cognitive decline diagnosis using multimodal brain connectomes. GCAN formulates diagnosis as a source-to-target counterfactual generation problem, where target-label connectomes are generated from source-label inputs and their differences are used to construct counterfactual attention maps. To preserve connectome topology, an Atlas-aware Bidirectional Transformer (AABT) performs network-level token encoding and decoding under brain-atlas constraints. The framework is further extended from functional connectivity (FC) to joint functional and structural connectivity (SC) modeling, enabling counterfactual analysis of complementary functional reorganization and structural topology changes. Experiments on hospital-collected and ADNI datasets show that GCAN achieves competitive performance across HC vs. SCD, HC vs. MCI, and SCD vs. MCI classification tasks. Visualization, circular connectome analysis, CAM-based comparison, ablation studies, and confidence interval analysis further support the interpretability and reliability of the proposed framework. Modality-specific FC and SC pre-trained classifiers are used to provide target-state priors for counterfactual generation while being separated from the downstream diagnostic classifier to prevent data leakage.

2606.01230 2026-06-02 cs.AI

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

HomeFlow: 面向智能家居智能体训练的可验证数据飞轮

Yi Gu, Huacan Wang, Shuo Zhang, Yuqing Hou, Lei Xue, Weipeng Ming, Chen Liu, Fangzhou Yu, Kuan Li, Ronghao Chen, Sen Hu, Xiaofeng Mou, Yi Xu

发表机构 * Midea Group(美的集团) Beijing University of Posts and Telecommunications(北京邮电大学) Donghua University(东华大学) Peking University(北京大学)

AI总结 提出HomeFlow,一种通过统一仿真环境HomeEnv、程序化家居生成HomeMaker、蓝图编译用户意图、MCTS-Flow合成可验证轨迹,并结合监督微调和逐步RLVE优化智能体的可验证数据飞轮方法,在SmartHome-Bench上达到84.60%和87.03%的任务成功率,其中8B模型超越GPT-5.5 1.23个百分点。

详情
AI中文摘要

大型语言模型智能体正从纯文本交互转向物理世界控制,智能家居是一个代表性领域。真实的家庭交互需要理解模糊意图、在动态环境中操作以及进行多轮推理。然而,现有方法难以生成用于智能家居智能体的高质量训练数据。我们提出HomeFlow,一个针对该领域的可验证数据飞轮。HomeFlow使用HomeEnv作为统一仿真环境,HomeFlow使用HomeEnv作为统一仿真环境,HomeMaker程序化生成多样化的家居设置。随后,Blueprint将开放式的用户意图编译为可执行的基于状态的成功条件,而MCTS-Flow通过环境引导的树搜索合成多样化的、可验证的多轮轨迹。然后我们通过监督微调和逐步RLVE优化智能体,通过真实的物理反馈促进迭代改进。我们进一步构建了SmartHome-Bench来评估智能体在各种智能家居任务上的表现。在该基准上,HomeFlow-RL-4B和HomeFlow-RL-8B分别达到了84.60%和87.03%的任务成功率。值得注意的是,HomeFlow-RL-8B甚至超过了领先的GPT-5.5 1.23个百分点。

英文摘要

Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.

2606.01227 2026-06-02 cs.LG q-bio.NC

DAGGER: Gradient-Free Construction of Transiently Amplifying Networks under Hard Connectivity Constraints

DAGGER: 硬连接约束下瞬态放大网络的无梯度构造

James C. Ferguson

发表机构 * The African Institute for Mathematical Sciences(非洲数学科学研究所) Institute of Science and Technology Austria(奥地利科学技术研究所)

AI总结 提出无梯度单遍算法DAGGER,在硬符号/稀疏/对角约束下构造瞬态放大网络,通过单一标量β控制Wasserstein-2预算实现放大与多重集保留的平滑权衡。

详情
Comments
12 pages, 7 figures
AI中文摘要

许多网络不仅支持而且依赖于瞬态非正态放大,即稳定系统的活动增加数个数量级。在硬符号/稀疏/对角约束(与生物连接组和结构化RNN初始化相关的区域)下构造此类网络,迄今为止需要基于梯度的局部搜索(包含数千次内循环特征分解)或基于Schur形式的直接构造(在抽象基中,投影后破坏约束)。 本文提出DAGGER(有向无环图引导边重加权),一种无梯度单遍算法。给定稳定的有符号稀疏矩阵,DAGGER产生具有相同符号、稀疏性和对角的输出。单一标量β控制Wasserstein-2预算,平滑地权衡精确多重集保留(β=0)与放大;峰值放大随β几乎无界增长,经验上在数值溢出前达到10^10。 在单次前向传递中,DAGGER在多重集保留方面匹配或超过基于梯度的方法(比典型梯度内循环少30-100倍特征分解),并且在中等β下,在精确保持连接性的同时,超过它们数个数量级。我们开发了该算法,将其与现有方法以及下游信号检测任务进行比较,并检查了显示DAGGER在结构上与其他放大网络不同的诊断结果。

英文摘要

Many networks not only support but also rely on transient non-normal amplification, an orders-of-magnitude increase in the activity of an otherwise stable system. Constructing such networks under hard sign/sparsity/diagonal constraints -- the regime relevant for biological connectomes and structured RNN initializations -- has so far required either gradient-based local search with thousands of inner-loop eigendecompositions or Schur-form direct construction in an abstract basis that breaks the constraints under projection. Here we introduce DAGGER (Directed Acyclic Graph Guided Edge Reweighting), a gradient-free single-pass algorithm. Given a stable signed sparse matrix, DAGGER produces an output with the same sign, sparsity, and diagonal. A single scalar $β$ controls a Wasserstein-2 budget that smoothly trades exact multiset preservation ($β= 0$) for amplification; peak amplification grows essentially without bound with $β$, empirically reaching $10^{10}$ before numerical overflow. DAGGER matches or exceeds gradient-based methods at multiset preservation in a single forward pass -- 30-100$\times$ fewer eigendecompositions than a typical gradient inner loop -- and at moderate $β$ beats them by orders of magnitude with connectivity exactly preserved. We develop the algorithm, compare it to the existing methods and on a downstream signal-detection task, and examine the diagnostics that show why DAGGER is structurally different from other amplifying networks.

2606.01223 2026-06-02 cs.CL cs.AI

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

连接点:长时对话中的反思性记忆基准测试

Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin, Weiming Qiao, Jing Li, Ruifeng Xu

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Hong Kong Polytechnic University(香港理工大学) Fudan University(复旦大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 针对现有基准无法衡量从碎片化多模态线索合成高层解释的反思性记忆问题,提出RefMem-Bench基准和REMIND层次框架,通过渐进式证据感知、定位和抽象提升模型反思性记忆能力。

详情
Comments
9 pages, 6 figures
AI中文摘要

尽管长上下文建模取得了显著进展,现有基准仍局限于显式回忆的事实性记忆,未能衡量将碎片化、多模态线索合成为高层解释所需的反思性记忆。为填补这一空白,我们引入了RefMem-Bench,一个用于长时对话中反思性记忆的基准。RefMem-Bench包含26K个带注释的问答实例,涵盖八个反思性记忆维度和三种任务格式,要求模型超越表面检索,从分布在整个交互历史中的证据推断潜在含义。为增强反思性记忆能力,我们提出了反思性记忆归纳(REMIND),一个将反思性记忆视为渐进意义构建的层次框架。REMIND结合了问题条件证据检索、显著性感知定位和抽象级别监督,并使用渐进式反思对齐将高层反思性推理提炼到事实推理路径中。实验表明,RefMem-Bench对当前模型构成了重大挑战,而REMIND通过渐进式证据感知、定位和抽象,持续提高了答案准确性和记忆回忆率。

英文摘要

Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.

2606.01221 2026-06-02 cs.LG cs.AI

Hybrid Imbalanced Regression Through Unified Data-Level and Algorithm-Level Balancing

混合不平衡回归:统一的数据级与算法级平衡方法

Shermin Shahbazi, Hossein Mohammadi, Mohsen Afsharchi

发表机构 * Zahedan National University(札赫德安国立大学)

AI总结 提出一个五阶段混合框架,结合自适应分箱、条件变分自编码器、特征空间聚类过采样、潜在密度加权损失和注意力门控融合,解决回归中的不平衡问题。

详情
Journal ref
Expert Systems with Applications, Date: 1 August 2026, Article: 131908, Volume: Volume 322
Comments
52 pages, 20 figures, accepted at Expert Systems with Applications
AI中文摘要

不平衡学习是机器学习中的一个关键挑战,其中代表性不足的目标值可能使模型产生偏差,并降低对罕见但重要案例的预测性能。尽管在分类中得到了广泛研究,不平衡回归仍然相对未被充分探索。现有方法主要关注数据级平衡(可能引入噪声和过拟合)或算法级平衡(通常难以处理高度复杂的目标分布)。为了解决这些局限性,我们提出了一个统一的混合框架,将数据级和算法级平衡策略集成到一个与回归器无关的流水线中。该框架包括五个阶段:(1)自适应分箱划分,基于局部线性一致性动态分割目标空间;(2)使用条件变分自编码器进行目标条件表示学习;(3)通过特征空间聚类和少数类过采样进行多阶段数据级平衡;(4)使用新颖的潜在密度加权损失(LDWL)进行算法级平衡,以强调潜在空间和目标空间中的稀有样本;(5)基于注意力的门控融合用于最终回归。在基准数据集上的实验结果表明,与单独的回归器和现有的不平衡回归方法相比,所提出的框架持续提高了预测性能。

英文摘要

Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade prediction performance on rare but important cases. Although extensively studied in classification, imbalanced regression remains relatively underexplored. Existing methods mainly focus on either data-level balancing, which may introduce noise and overfitting, or algorithm-level balancing, which often struggles with highly complex target distributions. To address these limitations, we propose a unified hybrid framework that integrates both data- and algorithm-level balancing strategies into a regressor-agnostic pipeline. The proposed framework consists of five stages: (1) adaptive bin partitioning to dynamically segment the target space based on local linear coherence; (2) target-conditioned representation learning using a Conditional Variational Autoencoder; (3) multistage data-level balancing through feature-space clustering and oversampling of minority clusters; (4) algorithm-level balancing using a novel Latent-Density Weighted Loss (LDWL) to emphasize rare samples in latent and target spaces; and (5) attention-based gated fusion for final regression. Experimental results on benchmark datasets demonstrate that the proposed framework consistently improves predictive performance compared to standalone regressors and existing imbalanced regression approaches.

2606.01220 2026-06-02 cs.LG cs.AI

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

通过强化学习和快速采样微调扩散模型用于分子生成

Guang Lin, Shikui Tu, Lei Xu

发表机构 * Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳))

AI总结 提出FTDiff框架,结合组相对策略优化和快速采样机制,微调扩散模型以生成满足多目标药物设计约束的高质量分子。

详情
Comments
13 pages, 7 figures
AI中文摘要

生成同时满足类药性质并符合目标蛋白三维结构的分子是基于结构的药物设计(SBDD)中的核心挑战。然而,现有的生成方法通常依赖于采样过程中昂贵的后处理或训练时需要精心策划的数据集,但增益仍然有限。这些限制在多目标设置中尤为突出,平衡冲突标准仍是一个核心挑战。为了解决这些问题,我们提出了FTDiff,一个专为结构约束下基于扩散的分子生成量身定制的强化学习微调框架。为了确保稳定且样本高效的优化,FTDiff采用了组相对策略优化(GRPO)风格策略。此外,FTDiff基于一个无时间预训练扩散模型,并集成了快速采样机制,减少了去噪步数,在保持生成质量的同时显著加速了训练和推理。通过优化一个固定阈值感知的奖励,FTDiff有效引导模型生成有效、多样且高质量的分子,平衡多个药物设计目标。在基准数据集上的大量实验表明,FTDiff始终优于先前的方法,且无需昂贵的后处理优化或复杂的数据工程。

英文摘要

Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challenge in structure-based drug design (SBDD). Existing generative approaches, however, often rely on costly post-hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi-objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine-tuning framework tailored for diffusion-based molecular generation under structural constraints. To ensure stable and sample-efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time-free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold-aware reward, FTDiff effectively guides the model to produce valid, diverse, and high- quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post-hoc optimization or intricate data engineering.

2606.01217 2026-06-02 cs.CV cs.LG stat.AP

Analysis of Ethnic Disparities in Autism Spectrum Disorder among Toddlers

幼儿自闭症谱系障碍中的种族差异分析

Aadithya Prabha Ramaharsha, Deevna Reddy, Uma Ranjan

发表机构 * Sri Ramachandra Institute of Higher Education and Research(Sri Rajachandra高等教育部与研究机构)

AI总结 通过逻辑回归分析,研究种族、行为评分、性别和新生儿黄疸对幼儿自闭症谱系障碍(ASD)的影响,发现白种人ASD风险比亚洲人高81%,中东人低79%,并确认新生儿黄疸和男性为显著风险因素。

详情
Comments
Third International Conference Biomedical Engineering Science and technology
AI中文摘要

自闭症谱系障碍(ASD)是一种以沟通和行为挑战为特征的神经发育障碍。本研究考察了种族与ASD特征之间的关系,以及行为评分、性别和新生儿黄疸在三个种族群体(白种人、亚洲人和中东人)中的差异。我们进行了逻辑回归分析,表明种族对ASD发病率有显著影响。与亚洲人相比,白种人患ASD的风险增加81%,而中东人患ASD的风险降低79%。我们还证实了早期研究,即新生儿黄疸是ASD的重要预测因子,而男性儿童患ASD的风险远高于女性儿童。这些结果表明,需要建立考虑种族差异的诊断框架和干预措施,以评估ASD特征的表现和评估。

英文摘要

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by challenges in communication and behavior. This study examines the relationship between ethnicity and ASD traits, along with behavioural scores, sex and neonatal jaundice across three ethnic groups: White Europeans, Asians, and Middle Eastern individuals. We perform a logistic regression and show that ethnicity has a significant effect on incidence of ASD. White Europeans are 81% increased risk of ASD and Middle Easterners are at 79\% reduced risk of ASD compared to Asians. We also confirm earlier studied which show that neonatal jaundice is a significant predictor of ASD, while male children are at much higher risk of ASD compared to female children. These results suggest the need for diagnostic frameworks and interventions that account for ethnic in the presentation and assessment of ASD traits

2606.01216 2026-06-02 cs.LG math.OC

Riemannian Optimization for Hadamard Products of Low-Rank Matrices

低秩矩阵的Hadamard积的黎曼优化

Pratik Jawanpuria, Ankish Chandresh, Bamdev Mishra

发表机构 * Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, India(机器智能与数据科学中心,印度班加罗尔理工学院,印度) Microsoft India(微软印度)

AI总结 针对低秩矩阵Hadamard积因子的耦合缩放对称性,提出一种基于黎曼商流形的块对角度量,并开发了线性复杂度的梯度下降算法。

详情
AI中文摘要

两个低秩矩阵的逐元素Hadamard积为具有乘法结构的数据提供了一种参数高效的模型,但由于两个因子之间存在耦合的行/列缩放对称性,其建模具有挑战性。为了利用空间几何,我们将此类矩阵的学习问题转化为黎曼商流形上的优化问题。我们提出了一种新的块对角黎曼度量,该度量由Frobenius内积的拉回导出,并证明该度量在完整对称群下不变。我们开发了一种黎曼梯度下降算法,该算法使用无需调参的Gauss-Newton步长,且每次迭代的计算复杂度与观测条目数呈线性关系。在真实和合成数据集上的实验验证了我们提出的黎曼方法的有效性。

英文摘要

The elementwise Hadamard product of two low-rank matrices provides a parameter-efficient model for data with multiplicative structure, but its modeling is challenging due to the presence of additional symmetries under coupled row/column scalings between the two factors. In order to leverage the geometry of the space, we formulate the learning of such matrices as optimization on a Riemannian quotient manifold. We propose a novel block-diagonal Riemannian metric derived from the pullback of the Frobenius inner product. The metric is shown to be invariant under the full symmetry group. We develop a Riemannian gradient descent algorithm that uses a tuning-free Gauss--Newton step size and scales linearly in the number of observed entries per iteration. Experiments on real and synthetic datasets illustrate the efficacy of our proposed Riemannian approach.

2606.01215 2026-06-02 cs.CV cs.AI cs.CL cs.MM

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

将神经符号程序蒸馏到3D多模态大语言模型中

Wentao Mo, Yang Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出APEIRIA,通过三阶段课程学习将符号推理模式蒸馏到3D多模态大语言模型中,实现透明推理与开放词汇空间推理的统一。

详情
Comments
To appear in ICML 2026
AI中文摘要

当前的3D空间推理方法面临根本性权衡:神经符号3D(NS3D)概念学习器通过组合程序实现可解释推理,但受限于封闭集概念词汇和简单程序;端到端3D多模态大语言模型(3D MLLMs)能处理复杂自然语言和开放词汇概念,但缺乏显式空间验证的黑箱推理。我们提出APEIRIA,一种神经符号3D MLLM,通过将符号推理模式以自然语言思维链形式蒸馏到MLLMs中,桥接两种范式。我们的三阶段课程逐步构建推理能力:a) 3D感知对齐将物体视觉-几何特征接地到LLM,b) CoT-SFT从符号程序轨迹中教授查询分解和逐步验证,c) CoT-RL将推理模式扩展到开放集概念和深度嵌套指令。通过迁移推理模式而非概念特定知识,APEIRIA保留了NS3D的关键优点:透明推理以及规划和感知组件的模块化可互换性。在接地、问答和描述任务上的评估表明,APEIRIA超越了先前的NS3D方法,并在3D空间推理数据集上匹配最先进的3D MLLMs,统一了符号方法的系统推理与MLLMs的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。

英文摘要

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.

2606.01213 2026-06-02 cs.CV cs.AI cs.CL

TECCI: Tricky Edits of Collected and Curated Images

TECCI:收集与策划图像的棘手编辑

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge

发表机构 * Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 提出TECCI基准,包含7550对图像与编辑指令,通过人工与自动评估揭示现有图像编辑模型在指令遵循、最小编辑和视觉质量方面的不足。

详情
AI中文摘要

尽管近期取得了巨大进展,但当前的文本引导图像编辑方法在涉及指令遵循、最小化编辑源图像以及确保高视觉质量等多个方面仍面临困难。当请求的编辑具有挑战性时,例如涉及位置、运动、视角、比例和创意编辑,这些问题尤为明显。为了系统性地测试生成式图像编辑器,我们提出了一个新的图像编辑基准——TECCI:收集与策划图像的棘手编辑。TECCI包含我们发布的全新图像集。TECCI中的图像涵盖7个图像类别。这些图像和类别经过有意策划,以针对现有方法的弱点。TECCI中的编辑指令由Gemini自动生成,每个源图像覆盖5种编辑类型。我们还策划了一组530张图像,为其创建了具有挑战性的人工编写编辑指令。总体而言,TECCI包含7550对图像和编辑指令。我们对TECCI上的五个领先图像编辑模型进行了人工评估。人类从三个维度判断输出:1)指令遵循,2)编辑的最小性,以及3)视觉质量。为了扩大评估规模,我们还使用Gemini构建了一个自动评分器,在匹配人类评估方面达到了74.7%的准确率。我们的评估揭示:1)没有一个模型的总体成功率超过22%,这显示了TECCI的挑战性;2)Nano Banana Pro是整体表现最好的模型;3)模型在指令遵循方面表现显著优于最小编辑和视觉质量;4)模型在编辑建筑和自然图像方面存在困难,这些需要较强的空间布局和复杂视觉细节理解能力;5)推理和创意编辑是最困难的,而颜色和外观编辑是最容易的。

英文摘要

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

2606.01207 2026-06-02 cs.CV cs.LG

Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning

特征对齐决定融合策略:多模态学习中交叉注意力与拼接的比较研究

Zhiqiang Zhou, Xuezhen Xie

发表机构 * Hunan Chemical Industry Vocational and Technical College(湖南化学工业职业技术学院)

AI总结 通过实验和理论分析,证明特征对齐质量而非数据规模是决定多模态融合策略优劣的关键因素,当特征预对齐时拼接优于交叉注意力。

详情
Comments
8 pages,6 figures,4 tables
AI中文摘要

在多模态融合中,交叉注意力与拼接的选择仍由实践者直觉而非原理性理解主导。本文通过使用两个特征提取骨干(ResNet18和CLIP ViT-B/32)在Flickr8k上的控制实验,证明特征对齐质量(而非仅数据规模)是决定哪种融合策略更优的主要因素。当特征通过视觉语言预训练目标预对齐时,在所有测试规模(2048-16384样本)下,拼接比交叉注意力高出4.1-5.1个百分点。我们提供了基于样本复杂度分析的理论解释:拼接需要O(d_v + d_t)个样本来学习其融合投影,而交叉注意力需要O(d_v * d_t)个样本来学习双线性注意力权重,对于512维CLIP特征,后者是前者的256倍以上。当特征已经对齐时,两种方法的近似误差差距消失,拼接的样本效率在所有实际数据集规模上占优。对齐退化研究证实了单调趋势:随着特征对齐退化,拼接的优势从1.3%增长到2.8%。这些发现为多模态系统中的融合方法选择提供了原理性决策框架,对多模态大语言模型的设计具有直接影响。

英文摘要

The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation's sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation's advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.

2606.01204 2026-06-02 cs.CL cs.AI cs.CY

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

LLM医疗分诊中的隐式地理推断:语言驱动的急诊推荐差异

Qi Han Wong

发表机构 * GitHub

AI总结 研究大型语言模型在相同症状下,仅因患者提示语言不同而产生不同的医疗分诊推荐,发现模型根据输入语言隐式推断地理位置,导致急诊推荐率差异显著。

详情
Comments
7 pages, 4 tables. Code and data at https://github.com/wongqihan/ai-behavioral-experiments
AI中文摘要

我们研究大型语言模型是否仅根据患者提示的语言,对相同症状产生不同的医疗分诊推荐。使用Gemini 3.5 Flash,我们评估了六种语言(英语、西班牙语、中文、印地语、日语、阿拉伯语)下的神经症状特征(持续性头痛、视力模糊、恶心),每种条件运行30次(共450次API调用)。我们发现,尽管模型在所有语言中分配的严重程度评分几乎相同(7.7-8.0/10),但急诊室就诊推荐率从0%(日语、印地语)到30%(英语、阿拉伯语)不等。添加一句指定患者位于美国的句子,非英语提示的急诊推荐率最多增加76.7个百分点,而反向锚定(英语提示加上东京地点)将急诊率从30%降至6.7%。回译控制(日语到英语)产生的急诊率与英语基线相当,证实差异并非由翻译质量引起,而是由输入语言的隐式地理推断所致。我们发布了完整的数据集、实验代码和结果。

英文摘要

We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.

2606.01202 2026-06-02 cs.AI cs.CL cs.LG

The Shape of Wisdom: Decision Trajectories in Language Models

智慧的形状:语言模型中的决策轨迹

Shailesh Rana

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过分析三种语言模型在MMLU上的9000条轨迹,提出用答案边际、边际变化和决策翻转距离描述轨迹,发现正确性与稳定性不同,并探究了注意力与MLP标量对边际的影响。

详情
Comments
6 pages, 5 figures. Code and derived artifacts: https://github.com/gut-puncture/The-Shape-of-Wisdom
AI中文摘要

语言模型并非简单地在输出层选择一个答案。在一项包含9000条轨迹的MMLU研究中,涉及Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct和Mistral-7B-Instruct-v0.3,答案的分数在深度上以结构化方式移动。我们用三个量描述每条轨迹:当前答案边际、该边际的下一层变化,以及距离决策翻转的距离。主要经验图景是正确性和稳定性是不同的:最大的群体是不稳定-正确的,而不是稳定-正确的。然后,一个追踪的子集询问是什么推动了边际。在稳定-正确的情况下,平均注意力标量指向正确的方向,而平均MLP标量则不然;跨度删除显示,移除支持答案的文本会损害边际,而移除类似干扰项的文本则有助于边际。结果并非完整的电路解释。它是一种可重复的方式,用于查看哪些答案已确定,哪些仍然脆弱,以及哪些测量来源推动了它们。

英文摘要

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

2606.01199 2026-06-02 cs.AI

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

LLM智能体能否维持长期组织动态?

Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou, Xiaohan Zhang, Yongrui Liu, Guoshun Nan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出TaskWeave分层智能体框架,通过记忆中心的协调机制(规划-分解-诊断-对齐循环和依赖感知追踪记忆)实现长期组织模拟,实验表明该框架能维持连贯的组织动态并产生可靠的人工制品。

详情
AI中文摘要

大型语言智能体越来越多地用于社会模拟,但尚不清楚它们能否在结构化组织中维持连贯行为,其中目标必须通过层级传播,任务依赖于先前执行,并且人工制品在长期范围内积累。我们将长期组织模拟定义为以记忆为中心的协调问题,并引入TaskWeave,这是一个分层智能体框架,通过制定-分解-诊断-对齐循环维护规划状态,并通过依赖感知追踪记忆来接地执行。我们在一个为期一年的IT公司模拟中评估TaskWeave,并将其与其他多智能体框架在组织连贯性、执行接地和下游企业NLP效用方面进行比较。实验表明,TaskWeave支持连贯且长期的组织动态,同时产生接地的人工制品并适应外部环境。这些发现表明,结构化模拟记忆是构建可靠的基于LLM的组织模拟器的关键机制。

英文摘要

Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long-horizon organizational simulation as a memory-centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate-Partition-Diagnose-Align cycle and grounds execution through dependency-aware trace memory. We evaluate TaskWeave in a year-long IT company simulation and compare it with other multi-agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long-horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM-based organizational simulators.

2606.01196 2026-06-02 cs.CL cs.AI

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

低资源安全失败是行动失败,而非表征失败

Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能学院)

AI总结 本文发现低资源语言的安全对齐失败源于决策校准问题而非表征缺失,通过重校准高资源门控(低秩逻辑回归+阈值重置)显著提升拒绝选择性。

详情
AI中文摘要

在高资源语言中学习的安全对齐在低资源语言中迁移效果不佳。模型能拒绝英文有害提示,但当相同提示翻译成斯瓦希里语或缅甸语时则无法拒绝。自适应引导方法如AdaSteer和CAST在跨语言中继承了这一失败。我们诊断了迁移失败的原因。在Qwen2.5-7B、Gemma-2-9B和Llama-3.1-8B模型上,针对23种语言,从高资源激活中提取的有害方向几乎能像高资源提示一样线性分离低资源有害与无害提示。相关表征存在。然而,有害拒绝率从87.9%下降到43.9%。模型未能将表征转化为拒绝。未能迁移的是安全决策的校准,而非底层表征。我们利用这一点,通过重校准而非重新训练高资源门控:一个低秩逻辑回归读出器,其决策阈值使用每类仅1到4个目标语言示例重置。该门控在拒绝引导和有害方向消融之间路由,将平均拒绝选择性(Δ = 有害 − 无害拒绝)从最强自适应基线的33.6显著提高到54.5,同时保持MMLU效用。这些结果表明,一些低资源安全失败可以通过重校准现有表征而非学习新表征来修复。我们的代码已发布:https://github.com/rashadaziz/low-resource-safety。

英文摘要

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.

2606.01192 2026-06-02 cs.CV

PairedGTA: Generating Driving Datasets for Controlled Photometric Shift Analysis

PairedGTA:用于受控光度偏移分析的驾驶数据集生成

Andrea Chianese, Giulio Rossolini, Alessandro Biondi, Marco Cococcioni, Giorgio Buttazzo

发表机构 * Scuola Superiore Sant’Anna(圣安娜高等学院) Department of Excellence in Robotics & AI(机器人与人工智能卓越部门) University of Pisa(比萨大学)

AI总结 提出基于高保真游戏引擎的PairedGTA框架,通过生成完美配对的图像,实现独立于几何和语义变化的光度偏移分析,并用于评估语义分割模型在恶劣条件下的性能退化。

详情
Comments
Under review
AI中文摘要

评估自动驾驶视觉感知系统的性能对于确保在不同环境场景下的可靠运行至关重要。理想情况下,要在不同恶劣条件下进行平衡和公平的分析,需要同一场景在不同天气或光照变化下的完美配对图像。这将允许独立于几何和语义变化来评估光度偏移的影响。不幸的是,真实世界数据集很少提供同一场景在不同环境条件下的图像,因为通常相机姿态、交通和动态物体(车辆、行人等)的位置随时间变化,因此只能提供粗略配对的数据。为了解决这一挑战,本工作引入了一种基于高保真游戏引擎的数据生成框架,用于提取完美配对的图像。通过利用与GTA游戏引擎通信的软件API,该框架在保持场景几何、相机姿态以及动态物体的身份和位置的同时,修改光照和天气条件。对于每个采样位置,它程序化地实例化动态实体,并在各种恶劣条件下渲染像素对齐的图像。通过在语义分割模型上的系统分析,展示了所提出的生成框架在驾驶场景中的优势,其输出退化可以更直接地归因于光度偏移,而不是不受控制的语义或几何因素。

英文摘要

Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real-world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high-fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.

2606.01189 2026-06-02 cs.AI

The Case for Model Science: Verify, Explore, Steer, Refine

模型科学的案例:验证、探索、引导、改进

Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel, Andreas Holzinger, Wojciech Samek

发表机构 * Center for Credible AI(可信AI中心) University of Warsaw(华沙大学) Warsaw University of Technology(华沙技术大学) University College Cork(科克大学学院) University of Technology Sydney(悉尼技术大学) Kempner Institute, Harvard University(哈佛大学凯普纳研究所) Human-Centered AI Lab(以人为本的人工智能实验室) Technical University of Berlin(柏林技术大学) Fraunhofer Heinrich Hertz Institute(弗劳恩霍夫海因里希·赫茨研究所) Berlin Institute for the Foundations of Learning and Data (BIFOLD)(柏林学习与数据基础研究所(BIFOLD))

AI总结 本文提出AI社区应超越基准测试,建立系统性的模型分析学科——模型科学,通过验证、探索、引导和改进四个功能视角,以及共享基础设施和深度案例研究,来理解复杂AI模型的行为。

详情
Comments
Follow up on arXiv:2508.20040
AI中文摘要

我们认为,AI社区现在已经准备好超越基准测试,并将分散的模型分析工作整合成一个系统性的学科,我们称之为模型科学。复杂的AI模型现在服务于数十亿用户,但我们对它们工作原理的理解远远落后于部署它们的能力。几十年来以基准测试为导向的研究取得了显著进展:广泛的排行榜、各种性能指标、跨不同任务的能力提升追踪;然而,这种成功也揭示了基准测试的局限性,因为它们告诉我们模型是否表现良好,但不告诉我们为什么成功或失败,它们忽略了关键的失败模式,如幻觉或捷径。来自成熟科学的先例指明了前进的方向:认知科学表明,理解复杂系统需要互补的分析层次;神经科学证明,对单个案例的深入研究揭示了群体研究遗漏的东西;医学教导我们,专业培训必须与研究实践同步发展;农业模型展示了共享基础设施和原则如何实现累积进展。这些经验为模型科学提供了三个基础。首先,我们建议围绕四个功能视角整合研究:验证、探索、引导和改进,这些视角解决了关于模型行为的互补问题。其次,我们讨论了累积知识所需的基础设施:数据集、模型和发现的目录。第三,我们强调需要对单个模型实例进行深入分析,而不仅仅是模型家族,因为单个案例可以揭示群体研究遗漏的东西。

英文摘要

We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Complex AI models now serve billions of users, yet our understanding of how they work lags far behind our ability to deploy them. Decades of benchmark-driven research have delivered remarkable progress: extensive leaderboards, a wide range of performance metrics, tracking capability gains across diverse tasks; yet this success has also revealed the limits of benchmarks as they tell us whether models perform but not why they succeed or fail, they miss critical failure modes, such as hallucinations or shortcuts. Precedents from established sciences point the way forward: cognitive science shows that understanding complex systems requires complementary levels of analysis; neuroscience demonstrates that deep study of single cases reveals what population studies miss; medicine teaches that specialised training must develop alongside research practice; and agriculture models how shared infrastructure and principles enable cumulative progress. These lessons inform three foundations for Model Science. First, we propose to consolidate research around four functional perspectives: Verify, Explore, Steer, and Refine that address complementary questions about model behaviour. Second, we discuss the required infrastructure for cumulative knowledge: catalogues of datasets, models and findings. Third, we highlight the need for deep analysis of individual model instances, not just model families, because single cases can reveal what population studies miss.

2606.01185 2026-06-02 cs.AI

"Skill issues'': data-centric optimization of lakehouse agents

技能问题:湖仓代理的数据中心优化

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

发表机构 * University of Maryland(马里兰大学) Università Milano Bicocca(米兰Bicocca大学) Bauplan Labs(Bauplan实验室)

AI总结 针对分支湖仓Bauplan上的编码代理,提出数据中心的优化流程,通过生成任务验证器对、在隔离沙箱中执行候选技能并利用追踪信号和程序化检查评分,将准确率提升31.9%。

详情
AI中文摘要

编码代理正在成为数据基础设施的用户,但它们的成功不仅取决于模型质量:还取决于教导代理如何使用系统的技能和环境文件。我们研究如何为在分支湖仓Bauplan上操作的代理优化这些工件。在我们的设置中,无头API和类似Git的数据原语通过代码、分支、提交和合并暴露数据工作流。我们的核心观察是,分支湖仓将数据代理评估从输出匹配问题转变为状态验证问题:代理生成的管道代码会引发具体的、可检查的湖仓变化。我们提出了一个数据中心优化流程,生成任务验证器对,在隔离沙箱中执行候选技能,并使用追踪级信号和湖仓状态的程序化检查对轨迹进行评分。在25个任务的初步评估中,优化后的技能将准确率提升了31.9%。这些结果表明,写路径数据工作流为优化代理技能提供了有用的基础,超越了只读任务。

英文摘要

Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

2606.01182 2026-06-02 cs.CL cs.AI

CA-BED: Conversation-Aware Bayesian Experimental Design

CA-BED:对话感知的贝叶斯实验设计

Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal, Kevin Zhu, Sunishchal Dev, Gabriel Grand, Shreyas Sunil Kulkarni

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 提出对话感知的贝叶斯实验设计(CA-BED),一种推理时概率对话规划框架,通过结合贝叶斯实验设计与LLM似然估计,在多个对话轮次中优化问题选择,在结构化实体推断基准上平均成功率提升21.8%,仅增加1.8轮对话。

详情
Comments
Reliable Autonomy Workshop at ICLR 2026
AI中文摘要

大型语言模型(LLM)在静态推理任务中表现出色,但在需要通过提问主动获取信息的交互场景中,其性能往往会下降。一个关键挑战在于选择能够减少不确定性同时纳入可能模糊或仅部分信息性的回应的问题。为了解决这个问题,我们提出了对话感知的贝叶斯实验设计(CA-BED),一种推理时概率对话规划框架,它将贝叶斯实验设计与基于LLM的似然估计相结合,以在多个对话轮次中优化问题选择。CA-BED维护关于假设的信念分布,预测可能的答案,并通过模拟对话树传播期望信息增益。在两个结构化实体推断基准上,CA-BED相比直接提示实现了平均21.8%的成功率提升,相对于其他信息寻求方法也有相当的增益。与直接提示相比,它仅平均增加了1.8个对话轮次就实现了这些增益。

英文摘要

Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where information must be actively acquired through questioning. A key challenge lies in selecting questions that reduce uncertainty while incorporating responses that may be ambiguous or only partially informative. To address this, we propose Conversation-Aware Bayesian Experimental Design (CA-BED), an inference-time probabilistic dialog planning framework that integrates Bayesian Experimental Design with LLM-based likelihood estimation to optimize question selection over multiple conversational turns. CA-BED maintains a belief distribution over hypotheses, anticipates possible answers, and propagates expected information gain through a simulated conversation tree. Across two structured entity-deduction benchmarks, CA-BED yields an average 21.8% improvement in success rates over direct prompting, with comparable gains relative to alternative information-seeking methods. It achieves these gains with an average increase of only 1.8 conversational turns compared to direct prompting.

2606.01179 2026-06-02 cs.LG cs.AI

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

异质系统中熵预测的物理信息深度学习:热力学与信息论案例研究

Biswajeet Sahoo, Debadutta Patra

发表机构 * Durham University(杜ham大学) Department of Chemical Engineering(化学工程系) Veer Surendra Sai University of Technology(维尔·苏雷纳·赛大学)

AI总结 提出统一物理信息深度学习框架,通过微分方程残差和信息论约束,在单一神经网络中同时实现热力学与信息论系统的熵预测,并验证其数据效率和物理一致性。

详情
AI中文摘要

熵产生支配着物理和信息论系统中的不可逆性和不确定性。尽管物理信息神经网络(PINNs)成功求解微分方程,但当前架构本质上仍是领域特定的。跨根本不同物理定律的领域不变熵表示的提取尚未探索。本文引入了一个统一的物理信息深度学习(PIDL)框架,该框架在单一神经架构中同时强制执行微分方程残差和信息论界限。我们通过两个经典研究来展示该框架:(i)一个热力学连续搅拌釜反应器(CSTR)模型,求解控制常微分方程,其中Softplus约束严格强制执行热力学第二定律;(ii)一个信息论金融市场模型,求解逆Fokker-Planck偏微分方程以推断潜在漂移和扩散系数,通过Softplus约束保证扩散正性,同时自然诱导香农熵。评估了三种模型变体:两个特定领域基线和一种共享编码器架构。PIDL框架保证了绝对的热力学可接受性,零违反第二定律,并表现出卓越的数据效率,仅使用30%的可用训练数据即可保持>90%的预测精度。此外,对学习到的熵表面的事后Ruppeiner黎曼几何分析成功识别了热力学相不稳定性。该方法为物理约束熵建模提供了一个稳健、领域无关的架构,推动了可持续过程设计和定量金融风险评估的应用。

英文摘要

Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neural Networks (PINNs) successfully solve differential equations, current architectures remain inherently domain-specific. The extraction of domain-invariant entropy representations across fundamentally different physical laws remains unexplored. This paper introduces a unified Physics-Informed Deep Learning (PIDL) framework that simultaneously enforces differential equation residuals and information-theoretic bounds within a single neural architecture. We demonstrate this framework via two canonical studies: (i) a thermodynamic continuous stirred-tank reactor (CSTR) model solving governing ODEs, where a Softplus constraint strictly enforces the Second Law of Thermodynamics; and (ii) an information-theoretic financial market model solving the inverse Fokker-Planck PDE to infer latent drift and diffusion coefficients, guaranteeing diffusion positivity via a Softplus constraint while naturally inducing Shannon entropy. Three model variants are evaluated: two domain-specific baselines and one shared-encoder architecture. The PIDL framework guarantees absolute thermodynamic admissibility with zero Second-Law violations and exhibits exceptional data efficiency, retaining >90% predictive accuracy using merely 30% of available training data. Furthermore, a post-hoc Ruppeiner Riemannian geometric analysis of the learned entropy surface successfully identifies thermodynamic phase instabilities. This methodology provides a robust, domain-agnostic architecture for physics-constrained entropy modeling, advancing applications in sustainable process design and quantitative financial risk assessment.

2606.01176 2026-06-02 cs.LG

Temporal Motif Signatures for Temporal Graph Neural Networks

时序图神经网络的时序模体特征

Dylan Sandfelder, Mihai Cucuringu, Xiaowen Dong

发表机构 * University of Oxford(牛津大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对时序图神经网络难以捕捉短时序模体模式的问题,提出一种紧凑的13维模体特征图,可线性嵌入任意静态或时序编码器,并在多种任务上提升性能。

详情
AI中文摘要

真实时序交互流在短时模体模式(重复、互惠、星型多样性、三元组流)中蕴含预测结构,而普通的时序图神经网络(TGNN)通常无法将其暴露给边评分器。我们在MOOC交互预测中具体展示了这一点:一个由过去窗口星型计数组成的小型四特征族已经提供了相对于强静态GNN的大部分提升。在广泛的实际和合成时序数据集中,我们发现模体活动沿着三个尺度稳定的轴(二元近因/互惠、星型多样性、三元组流)一致地组织,并利用这一经验结构设计了一个紧凑的13维、防泄漏、候选局部模体特征图h(u, v, t),该特征图可线性嵌入任何静态或时序编码器,无需改变架构。时序Weisfeiler-Leman(WL)分析将该增强置于锚定时序WL层次的第一级,并展示了一个候选锚定对,模体特征在该对上具有区分性。我们通过实验证明,相同的增强在异构任务上一致地提升了性能:TGB链路属性预测在所有五个基线上,Bitcoin Alpha/OTC和MOOC上的边分类,以及合成时序生成器的图级分类。

英文摘要

Real temporal interaction streams carry predictive structure in short-horizon motif patterns -- repetition, reciprocity, star diversity, triadic flow -- that vanilla temporal graph neural networks (TGNNs) often fail to expose to their edge scorers. We show this concretely on MOOC interaction prediction, where a small four-feature family of past-window star counts already delivers most of the lift over a strong static GNN. Across a wide set of real and synthetic temporal datasets we find that motif activity organizes consistently along three scale-stable axes (dyadic recency/reciprocity, star diversity, triadic flow), and we use this empirical structure to design a compact 13-coordinate, leakage-safe, candidate-local motif feature map h(u, v, t) that linearly embeds into any static or temporal encoder without architectural changes. A temporal Weisfeiler-Leman (WL) analysis places the augmentation relative to the first level of an anchored temporal-WL hierarchy and exhibits a candidate-anchored pair on which motif features distinguish. We demonstrate empirically that the same augmentation consistently lifts performance across heterogeneous tasks: TGB link-property prediction across all five baselines, edge classification on Bitcoin Alpha/OTC and MOOC, and graph-level classification of synthetic temporal generators.

2606.01173 2026-06-02 cs.CV

Reusing Fusion-Time Spectral Reliability for Adaptive Fusion and Expert Routing in RGB-Infrared Object Detection

复用融合时频谱可靠性用于RGB-红外目标检测的自适应融合与专家路由

Yefeng Wu

发表机构 * Tsinghua University(清华大学)

AI总结 提出一种无参数的7维频谱可靠性描述符,通过频谱可靠性融合和可靠性条件专家路由,提升RGB-红外目标检测在退化条件下的性能。

详情
AI中文摘要

RGB-红外检测器通常会丢弃跨模态融合过程中产生的统计信息,使得下游模块无法知晓当前交互是否可靠。我们提出提取一个无参数的7维频谱可靠性描述符——汇总频带能量、幅度比、相位一致性和跨模态相关性——并在融合阶段之外复用该描述符。该描述符驱动频谱可靠性融合(SRF),它将频谱残差与保守的空间基进行门控,以及可靠性条件专家路由(RCER),它将描述符与池化内容结合以引导稀疏的后融合专家。在匹配消融实验下,描述符感知门控相比仅内容自适应门控提高了mAP50;一个2×2因子分析进一步表明,在参数数量几乎相等的情况下,描述符条件路由相比仅专家架构提供了更大的边际增益。在DroneVehicle上的六种合成退化条件下,平均保留率提升至95.0%,而仅内容MoE为92.0%,拼接为87.9%,在模态缺失下增益最大;同一模型在自然白天/黑夜分割上也分别提高了+5.2/+5.3的mAP50。这些结果表明,将融合时可靠性作为显式信号保留有利于自适应融合和融合后条件计算。

英文摘要

RGB-infrared detectors typically discard the statistics generated during cross-modal fusion, leaving downstream modules unaware of whether the current interaction is reliable. We propose to extract a parameter-free, 7-dimensional spectral reliability descriptor -- summarizing band energy, amplitude ratio, phase consistency, and cross-modal correlation -- and to reuse it beyond the fusion stage. The descriptor drives both Spectral Reliability Fusion (SRF), which gates a spectral residual against a conservative spatial base, and Reliability-Conditioned Expert Routing (RCER), which combines the descriptor with pooled content to steer sparse post-fusion experts. Under matched ablations, descriptor-aware gating improves mAP50 over content-only adaptive gating; a $2{\times}2$ factorial analysis further shows that descriptor-conditioned routing provides the larger marginal gain over expert architecture alone at near-equal parameter count. Under six synthetic degradations on DroneVehicle, average retention rises to 95.0%, versus 92.0% for content-only MoE and 87.9% for concatenation, with the largest gain under modality drop; the same model also improves mAP50 by +5.2/+5.3 on the natural day/night split. These results suggest that preserving fusion-time reliability as an explicit signal benefits both adaptive fusion and post-fusion conditional computation.