arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.01258 2026-06-02 cs.LG cs.CL eess.SP

Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding

超越正弦波:基于Morlet小波的Transformer位置编码框架

Athanasios Zeris

发表机构 * Independent Researcher(独立研究者) Athens, Greece(希腊雅典)

AI总结 提出Morlet位置编码(MoPE),通过可学习的频率和局部带宽统一了正弦位置编码和旋转位置编码,并在TinyShakespeare上结合能量门控注意力提升了0.119的性能。

Comments 16 pages, 4 figures, 4 tables

详情
AI中文摘要

标准Transformer位置编码——正弦编码和旋转位置编码(RoPE)——将每个位置视为同等局部:它们编码了标记的位置,但未编码其位置影响应延伸多远。我们提出Morlet小波(同时最小化位置和频率的不确定性)是位置编码的自然基础,并引入Morlet位置编码(MoPE):每个嵌入维度从数据中学习其自身的频率和局部带宽。主要理论结果是统一:当局部性关闭时(sigma_i -> 无穷大),正弦PE和RoPE相关核都作为MoPE的极限情况出现。MoPE的相位精确恢复了RoPE旋转角度;幅度增加了一个标准编码所缺乏的可学习高斯局部核。实验上,MoPE结合能量门控注意力在TinyShakespeare上比标准注意力提升了0.119,优于任一单独组件。对学习参数的分析显示,所有128个频率-带宽对收敛到小波可容许边界——这一经验观察与关于能量门控的伴随结果一致,表明字符级语言信号的一个可重现性质,值得进一步研究。

英文摘要

Standard positional encodings for transformers - sinusoidal and rotary (RoPE) - treat every position as equally local: they encode where a token is, but not how far its positional influence should extend. We propose that the Morlet wavelet, which simultaneously minimises uncertainty in position and frequency, is the natural basis for positional encoding, and introduce Morlet Positional Encoding (MoPE): each embedding dimension learns its own frequency and locality bandwidth from data. The main theoretical result is a unification: sinusoidal PE and the RoPE correlation kernel both emerge as limiting cases of MoPE when locality is switched off (sigma_i -> infinity). The phase of MoPE recovers the RoPE rotation angle exactly; the amplitude adds a learned Gaussian locality kernel that standard encodings lack. Empirically, MoPE combined with Energy-Gated Attention achieves +0.119 improvement over standard attention on TinyShakespeare, outperforming either component alone. Analysis of the learned parameters reveals that all 128 frequency-bandwidth pairs converge to the wavelet admissibility boundary - an empirical observation consistent with a companion result on energy gating, suggesting a reproducible property of character-level language signals that warrants further investigation.

2606.01255 2026-06-02 cs.CL

Agentic Clustering: Controllable Text Taxonomies via Multi-Agent Refinement

Agentic Clustering: 通过多智能体精炼实现可控文本分类

Simon Löwe, Emily Silcock

发表机构 * Burning Glass Institute(Burning Glass研究院) Harvard University(哈佛大学)

AI总结 提出一种基于多智能体(提议者、合成者、审计者、调查者、批评者)的自适应文本聚类方法,通过编排器LLM动态调整聚类流程,在七个公开基准上实现最先进性能,ARI最高提升32%。

详情
AI中文摘要

最近的文本聚类方法使用大型语言模型从语料库中提出聚类分类法,然后将每个文本分配到其中。这些流程本质上是程序化的:LLM调用的顺序以及停止、合并和拆分聚类的规则是预先在代码中固定的,因此它们在不同结构的语料库上泛化能力差,并且难以整合用户提供的约束,如目标聚类数量或聚类意图。我们提出了一种基于智能体的替代方案,其中编排器LLM在每个步骤检查发现过程的状态,并调度一组专门的智能体(提议者、合成者、审计者、调查者和批评者)中的一个,使流程适应语料库,而不是执行固定的流程。在七个公开文本聚类基准上,该方法实现了最先进的性能,在ARI上比最强的先前LLM基线高出32%。

英文摘要

Recent text-clustering methods use large language models to propose a cluster taxonomy from a corpus and then assign each text to it. These pipelines are fundamentally programmatic: the sequence of LLM calls and the rules for stopping, merging, and splitting clusters are fixed in code in advance, so they generalise poorly across corpora of different structure and cannot easily incorporate user-supplied constraints such as a target cluster count or a clustering intent. We propose an agentic alternative in which an orchestrator LLM inspects the state of the discovery process at each step and dispatches one of a small set of specialised agents - proposer, synthesizer, auditor, investigator, and critic - adapting the pipeline to the corpus rather than executing a fixed one. On seven public text-clustering benchmarks the method achieves state-of-the-art performance, beating the strongest prior LLM baseline by up to 32% in ARI.

2606.01252 2026-06-02 cs.CL cs.AI

Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization

理解多目标跨语言摘要中的LLM行为

Sangwon Ryu, Yihong Liu, Mingyang Wang, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee, Hinrich Schuetze

发表机构 * GSAI, POSTECH(POSTECH认知科学研究院) CSE, POSTECH(POSTECH计算机科学与工程学院) LMU Munich(慕尼黑大学) MCML LILT(语言信息实验室)

AI总结 针对多目标跨语言摘要任务,提出MEA基准并分析LLM内部机制,发现翻译和摘要行为在后期层联合出现,并引入推理时激活引导方法提升生成质量。

详情
AI中文摘要

多目标跨语言文本摘要(MTXLS)将源文档总结为多种目标语言,随着用户以多种语言消费内容,其重要性日益增加,但仍未得到充分探索。为填补这一空白,我们引入了多目标跨语言元素感知(MEA),这是一个涵盖24种目标语言的新MTXLS基准。我们评估了各种LLM的端到端和流水线方法,并表明MTXLS性能仍远落后于英语单语摘要。为了更好地理解LLM中的MTXLS,我们提出了一种逐层分析框架,用于研究LLM如何在内部执行MTXLS。我们的分析表明,翻译和摘要行为在后期层中联合出现,而不是作为截然不同的分解阶段。大多数任务相关处理发生在这些层内,错误也往往在类似深度出现。受这些发现启发,我们引入了一种推理时激活引导方法,利用英语摘要的隐藏表示来指导MTXLS生成。实验表明,我们的方法在目标语言上持续提高了MTXLS质量。

英文摘要

Multi-target cross-lingual text summarization (MTXLS), which summarizes a source document into multiple target languages, is increasingly important as users consume content in diverse languages, but remains underexplored. To address this gap, we introduce multi-target cross-lingual element-aware (MEA), a new MTXLS benchmark covering 24 target languages. We benchmark end-to-end and pipeline approaches across various LLMs and show that MTXLS performance still substantially lags behind English monolingual summarization. To better understand MTXLS in LLMs, we propose a layer-wise analysis framework for investigating how LLMs internally perform MTXLS. Our analyses suggest that translation and summarization behaviors emerge jointly within later layers rather than as distinctly decomposed stages. Most task-relevant processing occurs within these layers, and errors also tend to arise at similar depths. Motivated by these findings, we introduce an inference-time activation steering method that leverages hidden representations from English summarization to guide MTXLS generation. Experiments show that our method consistently improves MTXLS quality across target languages.

2606.01247 2026-06-02 cs.CV

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

看向哪里:基础模型能否通过主动探索达到目标视角?

Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen

发表机构 * Zhejiang University(浙江大学)

AI总结 提出目标视角复现(TVR)任务及TVRBench基准,通过分析现有模型瓶颈并构建统一后训练框架,将9B开源模型成功率提升至50%以上。

Comments Project page: https://github.com/aim-uofa/TVRBench

详情
AI中文摘要

人类可以通过主动的头部和身体运动复现目标图像指定的视角,然而基础模型中的空间智能主要被研究为对预先收集的观测的被动理解。我们引入了目标视角复现(TVR)——一个主动任务,其中智能体在3D环境中调整其视角,直到其观测与给定的目标图像匹配——以及TVRBench,一个涵盖场景尺度和目标视角视觉丰富度的室内模拟基准。TVR远未解决:在评估集上,最强的开源和闭源模型仅达到7.8%和12.0%的成功率。细粒度分析识别出两个一致的瓶颈:现成模型难以处理多轮视觉历史,并且当视角复现需要身体平移而非原地旋转时,性能急剧下降,暴露了将空间差异映射到具身运动方面的差距。为了研究缩小这一差距,我们构建了一个统一的TVR后训练框架,涵盖专家轨迹SFT、理由监督的CoT-SFT、离线单轮GRPO以及来自实时模拟器rollout的在线多轮GRPO。视觉-动作SFT提供了主要增益,将9B开源模型提升至50.8%的成功率;多轮GRPO提供了针对性的多房间细化,总体达到51.4%,而CoT监督和单轮GRPO降低了闭环性能。这些结果将TVRBench确立为衡量和训练主动感知并在3D环境中行动的基础模型的测试平台。我们的代码、数据和模型可在https://github.com/aim-uofa/TVRBench获取。

英文摘要

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.

2606.01246 2026-06-02 cs.AI

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

SIRIUS-SQL: 在执行反馈中锚定多候选文本到SQL

Leo Luo, Haining Xie, Siqi Shen, Zhipeng Ma, Rui Ling, Hang Xu, Hefeng Jiang, Dingwei Chen, Yang Li, Peng Chen, Jie Jiang

发表机构 * TEG, Tencent Inc., China(腾讯科技(TEG),腾讯公司,中国) Peking University, China(北京大学,中国)

AI总结 提出SIRIUS-SQL系统,通过难度平滑强化学习、执行生命周期分类和置信门控混合选择器,解决多候选SQL生成中的冗余、修复和选择问题,在BIRD dev和SPIDER test上达到75.88%和91.20%的准确率。

详情
AI中文摘要

在复杂模式上的Text-to-SQL单次通过不可靠,因此近期系统生成多个SQL候选并通过投票过滤错误。然而仅投票是不够的,因为多候选方法有三个耦合的弱点:1) 从单个生成器采样更多会产生越来越冗余的候选,2) 现有流程对每个非干净执行结果应用通用修正,而运行时错误、超时和空结果各自指示与正确性的不同距离,3) 现有选择器依赖单一角度如结果多数投票或成对SQL比较,错过了其他角度可能捕获的信息。我们提出SIRIUS-SQL,解决了所有三个弱点。一个难度平滑的强化学习配方训练SIRIUS-32B生成多样化的可执行SQL候选,并与一个通用LLM配对,填补专家留下的空白。一个基于执行的生命周期对每个结果进行分类,并在候选重新进入池之前应用针对性修复。一个置信门控混合选择器结合执行结果一致性与成对SQL形式判断,仅在接近平局的情况下升级到确定性结构检查。SIRIUS-SQL在BIRD dev上达到75.88%,在SPIDER test上达到91.20%。三个通用配对中的两个超过了BIRD dev上最强已发布的多候选系统Agentar-Scale-SQL。

英文摘要

Text-to-SQL on complex schemas is unreliable on a single pass, so recent systems generate multiple SQL candidates and let voting filter out errors. Yet voting alone is not enough, because the multi-candidate recipe has three coupled weaknesses: 1) sampling more from a single generator produces increasingly redundant candidates, 2) existing pipelines apply one generic correction to every non-clean execution result, while runtime errors, timeouts, and empty results each indicate a different distance from correctness, and 3) existing selectors rely on a single angle such as result-majority voting or pairwise SQL comparison, missing what other angles would have caught. We present SIRIUS-SQL, which addresses all three weaknesses. A difficulty-smoothing RL recipe trains SIRIUS-32B to generate diverse executable SQL candidates, paired with a generalist LLM that fills in gaps left by the specialist. An execution-grounded lifecycle classifies each outcome and applies targeted repair before candidates re-enter the pool. A confidence-gated hybrid selector combines execution-result agreement with pairwise SQL-form judgment, escalating only near-tied cases to a deterministic structural check. SIRIUS-SQL reaches 75.88% on BIRD dev and 91.20% on SPIDER test. Two of three generalist pairings surpass Agentar-Scale-SQL, the strongest published multi-candidate system on BIRD dev.

2606.01243 2026-06-02 cs.CL cs.LG

Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

解锁潜在推理的黑箱:一种可解释性引导的干预方法

Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma, Qingyang Liu, Zhaohe Liao, Yibo Miao, Li Niu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 本文通过结构、因果和几何探针分析潜在推理向量的可解释性,并基于此提出无需训练的解码时干预方法,提升大语言模型推理准确性。

Journal ref ACL2026 Main

详情
AI中文摘要

潜在推理使大型语言模型(LLMs)能够在连续隐藏状态内执行多步推理,相比显式思维链(CoT)提供了效率提升。然而,这些连续思维向量的不透明性阻碍了其可靠性和可控性。本文弥合了机械可解释性与可操作控制之间的差距。我们首先使用结构、因果和几何探针进行系统分析,揭示潜在向量编码了推理步骤的压缩、忠实表示,其中早期向量作为关键因果枢纽。在此基础上,我们将这些可解释性见解操作化为一套无需训练、解码时干预的方法,通过施加已识别的几何和语义先验来优化潜在推理过程。跨多个模型规模和不同任务领域的广泛实验表明,我们的方法持续提高了推理准确性。我们的可解释性引导干预一致地解锁了潜在能力,并在没有任何参数更新的情况下提高了推理准确性。

英文摘要

Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.

2606.01240 2026-06-02 cs.CL

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

基于意图感知检索与语义保持分块的高效RAG

Fachrina Dewi Puspitasari, Chaoning Zhang, Jiaquan Zhang, Zhicheng Wang, Hafiz Shakeel Ahmad Awan, Rizwan Qureshi, Jewon Lee, Tae-Ho Kim, Yang Yang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Massachusetts General Hospital, Harvard University(哈佛大学马萨诸塞州总医院) Nota AI

AI总结 提出InSemRAG框架,通过意图感知检索器和语义保持分块模块,结合迭代检索-检查机制,解决传统RAG的信息不足问题,在多跳和证据敏感任务上取得显著提升。

详情
AI中文摘要

对大型语言模型(LLM)强大指令遵循和推理能力的需求推动了检索增强生成(RAG)的快速发展。RAG系统通过从外部数据库检索与查询匹配的补充知识块来辅助LLM生成。然而,传统RAG系统由于意图无关的检索和信息碎片化两个因素,存在信息不足的问题。我们的工作提出了一个名为InSemRAG的RAG框架,通过迭代检索-检查机制以及两个支持模块——意图感知检索器(IAR)和语义保持分块(SPC)来解决这些挑战。IAR实现了一种动态混合检索方法,根据查询意图自适应地加权检索通道,而SPC对受损的证据块进行检测和修复以保持语义完整性。为了减轻迭代机制带来的计算延迟,我们利用了小型语言模型(SLM)。在多个基准数据集上的大量实验一致表明,我们的方法相对于最近最先进的RAG机制具有竞争力。特别是在多跳和证据敏感任务上,我们的方法取得了显著提升,在HotPotQA上F1提高了2.65个百分点,在FEVER上准确率提高了1.5个百分点。我们的方法还利用SLM实现了与Multi-Hop RAG相当的性能,同时延迟降低了4.32倍。

英文摘要

The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving chunks of query-fit supplementary knowledge from an external database. Conventional RAG systems, however, suffer from information insufficiency due to two factors, which are intent-agnostic retrieval and information fragmentation. Our work proposes a RAG framework, termed InSemRAG, that addresses these challenges via an iterative retrieve-and-check mechanism with two supporting modules, an intention-aware retriever (IAR) and semantics-preserving chunking (SPC). IAR implements a dynamic hybrid retrieval method that adaptively weights the retrieval channels based on the query intent, while SPC performs detection and reparation to the damaged evidence chunks to preserve the semantic integrity. To alleviate the computational latency brought by our iterative mechanism, we leverage small language models (SLMs). Extensive experiments across several benchmark datasets consistently demonstrate the competitiveness of our method against recent state-of-the-art RAG mechanisms. Particularly, our method achieves significant gains on multi-hop and evidence-sensitive tasks, with a 2.65-point improvement in F1 on HotPotQA and a 1.5-point increase in accuracy on FEVER. Our method also achieves competitive performance to Multi-Hop RAG with 4.32$\times$ lower latency with the utilization of SLM.

2605.04819 2026-06-02 cs.LG

Unsat Core Prediction through Polarity-Aware Representation Learning over Clause-Literal Hypergraphs

通过子句-文字超图上的极性感知表示学习进行不可满足核心预测

Zhenchao Sun, Shuai Ma, Ping Lu, Chongyang Tao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种极性感知的表示学习框架,将SAT公式建模为子句-文字超图,通过极性感知分解和极性反转一致性正则化,有效预测不可满足核心。

Comments Accepted at ICML 2026

详情
AI中文摘要

图神经网络已广泛用于布尔可满足性(SAT)任务中,以从SAT公式中学习结构信息。这些研究的目标是解决SAT实例或增强SAT求解器,包括不可满足核心预测等任务。然而,大多数现有方法将SAT公式建模为二分图或有向无环图,这些方法在捕捉文字和子句之间的子句级和高阶交互方面不够直接。此外,这些方法在建模SAT固有的极性相关属性(如变量的正负文字之间的互补关系)方面存在局限性。为了解决这些局限性,我们提出了一种基于子句-文字超图的极性感知表示学习框架。我们将SAT公式建模为子句-文字超图,并辅以子句关联图以捕捉高阶结构交互。然后,我们引入一种极性感知分解机制,将变量表示分离为极性不变和等变分量,显式建模正负文字之间的关系,并将生成的文字表示沿超图结构传播。我们进一步引入极性反转一致性正则化,以在训练过程中强化极性一致的表示。在多个SAT数据集上的实验结果表明了该方法的有效性。

英文摘要

Graph neural networks have been widely used in Boolean satisfiability (SAT) tasks to learn structural information from SAT formulas. The goal of these studies is to solve SAT instances or to enhance SAT solvers, including tasks such as unsat-core prediction. However, most existing approaches model a SAT formula as a bipartite graph or a directed acyclic graph, which are less direct in capturing clause-level and higher-order interactions among literals and clauses. Moreover, these approaches are limited in modeling intrinsic polarity-related properties of SAT, such as the complementary relationship between the positive and negative literals of a variable. To address these limitations, we propose a polarity-aware representation learning framework over clause-literal hypergraphs. We model SAT formulas as clause-literal hypergraphs augmented with a clause incidence graph to capture higher-order structural interactions. We then introduce a polarity-aware decomposition mechanism that separates variable representations into polarity invariant and equivariant components, explicitly modeling the relationship between positive and negative literals, with the resulting literal representations propagated along the hypergraph structure. We further incorporate a polarity-inversion consistency regularization to reinforce polarity-consistent representations during training. Experimental results on multiple SAT datasets demonstrate the effectiveness of the proposed approach.

2605.04638 2026-06-02 cs.CL cs.AI

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

相对于语义保持嵌入的梯度揭示大语言模型的不确定性

Mingda Li, Rundong Lv, Xinyu Li, Weinan Zhang, Ting Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个基于梯度的自由文本生成不确定性量化方法SemGrad,通过语义空间中的梯度计算实现高效且无需采样的不确定性估计。

Comments Accepted by ICML 2026

详情
AI中文摘要

不确定性量化(UQ)是确保大语言模型(LLM)可信度的重要技术,因为LLM容易产生幻觉。现有的自由文本生成UQ方法严重依赖采样,导致计算成本高且方差大。在这项工作中,我们提出了首个基于梯度的自由文本生成UQ方法SemGrad,它无需采样且计算高效。与先前针对分类任务开发的在参数空间中操作的梯度方法不同,我们提出在语义空间中考虑梯度。我们的方法基于一个关键直觉:自信的LLM应在语义等价的输入扰动下保持稳定的输出分布。我们将这种稳定性解释为语义空间中的梯度,并引入语义保持分数(SPS)来识别最能捕捉语义的嵌入,并针对这些嵌入计算梯度。我们进一步提出了HybridGrad,它结合了SemGrad和参数梯度的优势。实验表明,我们的两种方法都提供了高效且有效的不确定性估计,在多个有效响应的设置中尤其优于现有方法。

英文摘要

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

2605.04193 2026-06-02 cs.AI cs.LG cs.LO

ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor for Inductive Logic Programming

ANDRE:一种基于注意力的神经符号可微规则提取器,用于归纳逻辑编程

Iman Sharifi, Peng Wei, Saber Fallah

发表机构 * Dept. of Mechanical and Aerospace Engineering, George Washington University, USA(机械与航空航天工程系,乔治华盛顿大学) Dept. of Mechanical Engineering Sciences, University of Surrey, UK(机械工程科学系,萨里大学)

AI总结 提出ANDRE框架,通过注意力驱动的可微逻辑算子优化连续规则空间,实现从概率数据中学习一阶逻辑规则,在噪声环境下保持鲁棒性和可解释性。

Comments 35 pages, 8 figures, 10 tables

详情
AI中文摘要

归纳逻辑编程(ILP)旨在从数据中学习可解释的一阶规则,但现有的符号和神经符号方法难以扩展到噪声和概率设置。经典ILP依赖于离散的组合规则搜索,在不确定性下脆弱,而可微ILP方法通常依赖预定义规则模板或不精确的模糊算子,这些算子在推理概率谓词估值时会遭受梯度消失或逻辑结构近似不佳的问题。本文提出基于注意力的神经符号可微规则提取器(ANDRE),一种新颖的ILP框架,通过基于注意力的逻辑算子优化连续规则空间来学习一阶逻辑程序。ANDRE用完全可微的、注意力驱动的合取和析取算子替代规则模板和逻辑算子,这些算子近似逻辑最小-最大语义,从而实现对概率数据的准确、稳定和可解释推理。通过在每条规则内软选择、否定或排除谓词,ANDRE在保持符号结构的同时支持灵活规则归纳。在经典ILP基准、大规模知识库以及带有概率谓词和噪声监督的合成数据集上的大量实验表明,ANDRE达到了有竞争力或更优的预测性能,同时在不确定性下可靠地恢复正确的符号规则。特别是,ANDRE对中等标签噪声保持鲁棒,在规则提取质量和稳定性上显著优于现有可微ILP方法。

英文摘要

Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approaches struggle to scale to noisy and probabilistic settings. Classical ILP relies on discrete combinatorial rule search and is brittle under uncertainty, while differentiable ILP methods typically depend on predefined rule templates or inaccurate fuzzy operators that suffer from vanishing gradients or poor approximation of logical structure when reasoning over probabilistic predicate valuations. This paper proposes an Attention-based Neuro-symbolic Differentiable Rule Extractor (ANDRE), a novel ILP framework that learns first-order logic programs by optimizing over a continuous rule space with attention-based logical operators. ANDRE replaces both rule templates and logical operators with fully differentiable, attention-driven conjunction and disjunction operators that approximate logical min-max semantics, enabling accurate, stable, and interpretable reasoning over probabilistic data. By softly selecting, negating, or excluding predicates within each rule, ANDRE supports flexible rule induction while preserving symbolic structure. Extensive experiments on classical ILP benchmarks, large-scale knowledge bases, and synthetic datasets with probabilistic predicates and noisy supervision demonstrate that ANDRE achieves competitive or superior predictive performance while reliably recovering correct symbolic rules under uncertainty. In particular, ANDRE remains robust to moderate label noise, substantially outperforming existing differentiable ILP methods in both rule extraction quality and stability.

2605.03403 2026-06-02 cs.CV cs.LG

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

GRPO-TTA:基于GRPO驱动的强化学习进行视觉语言模型的测试时视觉调优

Yujun Li, Hongyuan Zhang, Yuan Yuan

发表机构 * School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University(人工智能、光学与电子学院(iOPEN),西北工业大学)

AI总结 提出GRPO-TTA方法,将GRPO应用于测试时适应,通过将类特定提示预测重构为组策略优化问题,并设计对齐奖励和分散奖励,在多种基准上优于现有方法。

详情
AI中文摘要

组相对策略优化(GRPO)最近在大型语言模型和视觉语言模型的后训练中展现出强大性能。这引发了一个问题:GRPO是否也能显著促进视觉语言模型的测试时适应(TTA)。在本文中,我们提出了用于测试时适应的组相对策略优化(GRPO-TTA),通过将类特定提示预测重构为组策略优化问题,将GRPO适应到TTA设置。具体来说,我们通过从CLIP相似度分布中采样top-K类候选来构建输出组,从而在无需真实标签的情况下实现概率驱动的优化。此外,我们设计了针对测试时适应的奖励函数,包括对齐奖励和分散奖励,以指导有效的视觉编码器调优。在多种基准上的大量实验表明,GRPO-TTA一致优于现有的测试时适应方法,在自然分布偏移下性能提升尤为显著。

英文摘要

Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

2605.02277 2026-06-02 cs.CL

CECOR: Correction-oriented synthetic data construction for factual error correction

CECOR:面向事实错误纠正的修正导向合成数据构建

Lei Zhu, Xiaobao Wang, Jianbiao Yang, Chenyang Wang, Dongxiao He, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University(天津大学)

AI总结 提出CECoR框架,通过分解与注入范式合成高质量训练数据,结合两阶段学习策略,有效提升多跳事实错误纠正的准确性和鲁棒性。

详情
AI中文摘要

事实错误纠正(FEC)旨在将不准确的文本修改为与外部证据事实一致的陈述。尽管近期方法在单跳纠正上表现良好,但它们通常将声明视为原子单元,难以处理需要跨多个证据源进行组合推理的多跳情况。有限的配对数据和复杂推理链中定位语义错误的困难进一步放大了这一挑战。我们提出了CECoR(基于推理感知的组合错误纠正),一个推理感知框架,引入了分解与注入范式用于组合错误纠正。CECoR将多跳声明分解为可解释的推理步骤,并注入受控扰动以合成高质量的训练对。结合监督微调和强化学习的两阶段学习策略提高了事实准确性和鲁棒性。全面评估表明,CECoR在多跳基准上取得了强劲性能,优于远程监督方法和少样本LLM基线。它还能有效泛化到单跳纠正,并在噪声证据下保持稳定,展示了其在真实世界事实纠正中的多功能性。

英文摘要

Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reasoning across multiple evidence sources. This challenge is further amplified by limited paired data and difficulties in locating semantic errors within complex reasoning chains. We present CECoR (Compositional Error Correction via Reasoning-aware Synthesis), a reasoning-aware framework that introduces a Decomposition and Injection paradigm for compositional error correction. CECoR decomposes multi-hop claims into interpretable reasoning steps and injects controlled perturbations to synthesize high-quality training pairs. A two-stage learning strategy combining supervised fine-tuning and reinforcement learning improves factual accuracy and robustness. Comprehensive evaluations show that CECoR achieves strong performance on multi-hop benchmarks, outperforming both distantly supervised methods and few-shot LLM baselines. It also generalizes effectively to single-hop correction and remains stable under noisy evidence, demonstrating its versatility for real-world factual correction.

2606.01238 2026-06-02 cs.RO cs.LG

Training-Free Imitation Learning with Closed-Form Diffusion Policies

无训练闭环扩散策略的模仿学习

Raghav Mishra, Ian R. Manchester

发表机构 * Australian Center for Robotics, ARIAM Hub, and School of Aerospace, Mechanical and Mechatronic Engineering University of Sydney(澳大利亚机器人中心、ARIAM中心和悉尼大学航空航天、机械与机电工程学院)

AI总结 提出一种基于演示数据集闭式得分的无训练扩散策略(CFDP),实现毫秒级实时模仿学习,性能媲美需数小时训练的神经基线,并支持推理时策略编辑与演示增强。

详情
AI中文摘要

尽管基于扩散的策略具有令人印象深刻的性能和表达能力,但其长时间离线训练拖慢了数据收集和策略部署循环。我们引入了闭环扩散策略(CFDP),这是一类使用从演示数据集导出的闭式得分的无训练扩散策略,用于模仿学习。我们在硬件实验中用移动CPU进行实时推理部署CFDP,表明它能够直接从数据集中毫秒级成功执行模仿,并且推理速度比神经扩散策略更快。在模仿学习基准实验中,我们展示了CFDP与需要数小时训练的神经基线相比具有竞争力,在训练时间和性能之间提供了有利的权衡。最后,我们展示了闭环扩散策略如何作为一种可组合原语,实现对预训练神经扩散策略的数据驱动推理时编辑,包括策略引导和新颖的演示增强。

英文摘要

While diffusion-based policies have impressive performance and expressivity, their long offline training slows down the data collection and policy deployment loop. We introduce Closed-Form Diffusion Policies, a class of training-free diffusion-based policies for imitation learning using the closed-form score derived from the demonstration dataset. We deploy CFDP with real-time inference with a mobile CPU in hardware experiments, showing it can successfully perform imitation directly from the dataset in milliseconds and with faster inference than neural diffusion policies. In experiments on imitation learning benchmarks, we show that CFDP is competitive against neural baselines that require hours of training, providing a favorable tradeoff between training time and performance. Finally, we show how closed-form diffusion policies act as a composable primitive that enables data-driven inference-time editing of pre-trained neural diffusion policies, including policy guidance and novel demonstration augmentation.

2606.01237 2026-06-02 cs.AI

Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes

脑图谱引导的生成式反事实注意力用于基于多模态连接组的可解释认知衰退诊断

Xiongri Shen, Jiaqi Wang, Zhenxi Song, Yi Zhong, Leilei Zhao, Xin He, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术系) School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology(哈尔滨工业大学智能科学与工程学院) School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical, Measurements and Ultrasound Imaging, Shenzhen University Medical School, Shenzhen University(深圳大学医学院生物医学工程学院、医学超声关键技术工程实验室、广东省生物医学测量与超声成像重点实验室)

AI总结 提出一种脑图谱知识引导的生成式反事实注意力网络(GCAN),通过将诊断建模为源到目标的反事实生成问题,利用多模态连接组实现可解释的认知衰退诊断。

详情
AI中文摘要

轻度认知障碍(MCI)和主观认知衰退(SCD)与早期阿尔茨海默病连续谱密切相关,准确且可解释的诊断对于早期风险评估和干预至关重要。现有的基于连接组的深度学习模型可以提高分类性能,但通常对疾病相关的功能和结构连接变化提供的洞察有限。本文提出了一种图谱知识引导的生成式反事实注意力网络(GCAN),用于使用多模态脑连接组进行可解释的认知衰退诊断。GCAN将诊断建模为源到目标的反事实生成问题,其中从源标签输入生成目标标签连接组,并利用它们的差异构建反事实注意力图。为了保持连接组拓扑,一种图谱感知的双向Transformer(AABT)在脑图谱约束下执行网络级令牌编码和解码。该框架进一步从功能连接(FC)扩展到联合功能和结构连接(SC)建模,从而实现对互补功能重组和结构拓扑变化的反事实分析。在医院收集的数据集和ADNI数据集上的实验表明,GCAN在HC vs. SCD、HC vs. MCI和SCD vs. MCI分类任务中取得了竞争性能。可视化、圆形连接组分析、基于CAM的比较、消融研究和置信区间分析进一步支持了所提框架的可解释性和可靠性。使用特定模态的FC和SC预训练分类器为反事实生成提供目标状态先验,同时将其与下游诊断分类器分离以防止数据泄露。

英文摘要

Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer's disease continuum, where accurate and explainable diagnosis is important for early risk assessment and intervention. Existing connectome-based deep learning models can improve classification performance but often provide limited insight into disease-related functional and structural connectivity changes. This paper proposes an atlas-knowledge-guided Generative Counterfactual Attention-guided Network (GCAN) for explainable cognitive decline diagnosis using multimodal brain connectomes. GCAN formulates diagnosis as a source-to-target counterfactual generation problem, where target-label connectomes are generated from source-label inputs and their differences are used to construct counterfactual attention maps. To preserve connectome topology, an Atlas-aware Bidirectional Transformer (AABT) performs network-level token encoding and decoding under brain-atlas constraints. The framework is further extended from functional connectivity (FC) to joint functional and structural connectivity (SC) modeling, enabling counterfactual analysis of complementary functional reorganization and structural topology changes. Experiments on hospital-collected and ADNI datasets show that GCAN achieves competitive performance across HC vs. SCD, HC vs. MCI, and SCD vs. MCI classification tasks. Visualization, circular connectome analysis, CAM-based comparison, ablation studies, and confidence interval analysis further support the interpretability and reliability of the proposed framework. Modality-specific FC and SC pre-trained classifiers are used to provide target-state priors for counterfactual generation while being separated from the downstream diagnostic classifier to prevent data leakage.

2606.01230 2026-06-02 cs.AI

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

HomeFlow: 面向智能家居智能体训练的可验证数据飞轮

Yi Gu, Huacan Wang, Shuo Zhang, Yuqing Hou, Lei Xue, Weipeng Ming, Chen Liu, Fangzhou Yu, Kuan Li, Ronghao Chen, Sen Hu, Xiaofeng Mou, Yi Xu

发表机构 * Midea Group(美的集团) Beijing University of Posts and Telecommunications(北京邮电大学) Donghua University(东华大学) Peking University(北京大学)

AI总结 提出HomeFlow,一种通过统一仿真环境HomeEnv、程序化家居生成HomeMaker、蓝图编译用户意图、MCTS-Flow合成可验证轨迹,并结合监督微调和逐步RLVE优化智能体的可验证数据飞轮方法,在SmartHome-Bench上达到84.60%和87.03%的任务成功率,其中8B模型超越GPT-5.5 1.23个百分点。

详情
AI中文摘要

大型语言模型智能体正从纯文本交互转向物理世界控制,智能家居是一个代表性领域。真实的家庭交互需要理解模糊意图、在动态环境中操作以及进行多轮推理。然而,现有方法难以生成用于智能家居智能体的高质量训练数据。我们提出HomeFlow,一个针对该领域的可验证数据飞轮。HomeFlow使用HomeEnv作为统一仿真环境,HomeFlow使用HomeEnv作为统一仿真环境,HomeMaker程序化生成多样化的家居设置。随后,Blueprint将开放式的用户意图编译为可执行的基于状态的成功条件,而MCTS-Flow通过环境引导的树搜索合成多样化的、可验证的多轮轨迹。然后我们通过监督微调和逐步RLVE优化智能体,通过真实的物理反馈促进迭代改进。我们进一步构建了SmartHome-Bench来评估智能体在各种智能家居任务上的表现。在该基准上,HomeFlow-RL-4B和HomeFlow-RL-8B分别达到了84.60%和87.03%的任务成功率。值得注意的是,HomeFlow-RL-8B甚至超过了领先的GPT-5.5 1.23个百分点。

英文摘要

Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.

2606.01229 2026-06-02 cs.AI

Application of Algorithms in Energy-Efficient Design Platforms for Green Building

算法在绿色建筑节能设计平台中的应用

Na Yu, Fu Wenli, Guo Fei

发表机构 * First Highway Engineering Co., Ltd.(第一公路工程有限公司)

AI总结 提出一种结合BIM、传感器数据和进化多目标优化的算法平台,通过动态能耗仿真降低建筑能耗29.3%,并验证了其可扩展性和技术可行性。

Comments 9 pages, 4 figures.2026 International Conference on Big Data Applications in Education and Engineering (ICBDAEE 2026)

详情
AI中文摘要

在绿色建筑设计过程中,计算机辅助能耗评估被广泛用于提高效率并实现整体优化。本文提出一个平台,该平台结合建筑信息模型(BIM)、传感器运行数据以及使用稳健算法的高级仿真工作流。该平台采用多层服务架构,包含动态能耗仿真和进化多目标优化,通过高性能C++核心和自适应代理模型连接。选取一栋中层办公楼作为案例研究。选择五个代表性区域收集建筑围护结构特征和占用模式数据。预处理后,缺失传感器数据占年度记录的3.2%,所有变量通过15分钟插值标准化。经过40轮优化,每平方米年能耗从315 kWh/m²降至223 kWh/m²,下降29.3%。居住者的生命周期成本增加限制在3.7%以内,不舒适小时数降至每年70小时以下。帕累托最优解分析显示,围护结构U值范围为1.05至1.57 W/m²K,夜间通风率范围为2.1至3.6 h⁻¹,两者均与能耗性能密切相关。结果证实,集成算法框架为绿色建筑设计提供了良好的可扩展性、强性能和技术可行性。该平台为设计工程师和可持续发展从业者提供了可靠的决策支持工具,实现了数据驱动的节能建筑精准交付。

英文摘要

During green building design, computer-aided energy assessment is widely used to improve efficiency and achieve overall optimization. This paper presents a platform that combines Building Information Modeling (BIM), sensor operational data, and advanced simulation workflows using robust algorithms. The platform uses a multi-layer service architecture with dynamic energy simulation and evolutionary multi-objective optimization, connected via a high-performance C++ core and adaptive agent models. A mid-rise office building was selected as the case study. Five representative areas were chosen to collect data on building envelope characteristics and occupancy patterns. After preprocessing, missing sensor data accounted for 3.2% of annual records, and all variables were standardized using 15-minute interpolation. After 40 optimization rounds, annual energy consumption per square meter dropped by 29.3% from 315 kWh/m2 to 223 kWh/m2. The lifecycle cost increase for occupants was limited to 3.7%, and discomfort hours were reduced to under 70 hours per year. Analysis of Pareto optimal solutions shows that the envelope U-value ranges from 1.05 to 1.57 W/m2K, and nighttime ventilation rate ranges from 2.1 to 3.6 h-1, both closely linked to energy performance. The results confirm that the integrated algorithm framework offers good scalability, strong performance, and technical feasibility for green building design. This platform provides a reliable decision-support tool for design engineers and sustainability practitioners, enabling accurate, data-driven delivery of energy-efficient buildings.

2606.01227 2026-06-02 cs.LG q-bio.NC

DAGGER: Gradient-Free Construction of Transiently Amplifying Networks under Hard Connectivity Constraints

DAGGER: 硬连接约束下瞬态放大网络的无梯度构造

James C. Ferguson

发表机构 * The African Institute for Mathematical Sciences(非洲数学科学研究所) Institute of Science and Technology Austria(奥地利科学技术研究所)

AI总结 提出无梯度单遍算法DAGGER,在硬符号/稀疏/对角约束下构造瞬态放大网络,通过单一标量β控制Wasserstein-2预算实现放大与多重集保留的平滑权衡。

Comments 12 pages, 7 figures

详情
AI中文摘要

许多网络不仅支持而且依赖于瞬态非正态放大,即稳定系统的活动增加数个数量级。在硬符号/稀疏/对角约束(与生物连接组和结构化RNN初始化相关的区域)下构造此类网络,迄今为止需要基于梯度的局部搜索(包含数千次内循环特征分解)或基于Schur形式的直接构造(在抽象基中,投影后破坏约束)。 本文提出DAGGER(有向无环图引导边重加权),一种无梯度单遍算法。给定稳定的有符号稀疏矩阵,DAGGER产生具有相同符号、稀疏性和对角的输出。单一标量β控制Wasserstein-2预算,平滑地权衡精确多重集保留(β=0)与放大;峰值放大随β几乎无界增长,经验上在数值溢出前达到10^10。 在单次前向传递中,DAGGER在多重集保留方面匹配或超过基于梯度的方法(比典型梯度内循环少30-100倍特征分解),并且在中等β下,在精确保持连接性的同时,超过它们数个数量级。我们开发了该算法,将其与现有方法以及下游信号检测任务进行比较,并检查了显示DAGGER在结构上与其他放大网络不同的诊断结果。

英文摘要

Many networks not only support but also rely on transient non-normal amplification, an orders-of-magnitude increase in the activity of an otherwise stable system. Constructing such networks under hard sign/sparsity/diagonal constraints -- the regime relevant for biological connectomes and structured RNN initializations -- has so far required either gradient-based local search with thousands of inner-loop eigendecompositions or Schur-form direct construction in an abstract basis that breaks the constraints under projection. Here we introduce DAGGER (Directed Acyclic Graph Guided Edge Reweighting), a gradient-free single-pass algorithm. Given a stable signed sparse matrix, DAGGER produces an output with the same sign, sparsity, and diagonal. A single scalar $β$ controls a Wasserstein-2 budget that smoothly trades exact multiset preservation ($β= 0$) for amplification; peak amplification grows essentially without bound with $β$, empirically reaching $10^{10}$ before numerical overflow. DAGGER matches or exceeds gradient-based methods at multiset preservation in a single forward pass -- 30-100$\times$ fewer eigendecompositions than a typical gradient inner loop -- and at moderate $β$ beats them by orders of magnitude with connectivity exactly preserved. We develop the algorithm, compare it to the existing methods and on a downstream signal-detection task, and examine the diagnostics that show why DAGGER is structurally different from other amplifying networks.

2606.01224 2026-06-02 cs.AI

Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis

基于多模态数据分析的高等数学学习行为预测与学业预警模型

Liu Qiong, Li Zhengbo

发表机构 * Moutai Institute(莫 tai 院)

AI总结 针对高等数学教育中高风险学生早期识别与干预的挑战,提出一种融合知识图谱与多模态时序建模的动态预测框架,通过异构注意力机制和自适应边权重实现精准预警与个性化干预。

Comments 12 pages,5 figures

详情
AI中文摘要

高风险学生的早期发现和及时学业干预是高等数学教育中的主要挑战,其中复杂的概念层次和非线性学习轨迹常常阻碍学生的学业表现。本研究采用多模态数据分析,构建了一个用于学习行为预测和学业预警的动态框架。它构建了层次化的知识图谱本体,根据问题难度和学生表现实现自适应边权重,并结合异构图注意力与时间序列建模来捕捉学生不断演变的知识状态。在学期多模态数据集上的实证测试证明,该方法能够准确识别高风险学生,并有效跟踪错误传播。有针对性的干预显著提高了学生的知识掌握程度并降低了学业风险。结果验证了将知识图谱分析与多模态时序建模相结合,可以为高等数学教育提供更高效、更个性化的学习支持。

英文摘要

Early detection of at-risk students and timely academic intervention pose major challenges in advanced mathematics education, where complex conceptual hierarchies and nonlinear learning trajectories often hold back students' academic performance. This study adopts multimodal data analytics to build a dynamic framework for learning behavior prediction and academic early warning. It constructs a hierarchical knowledge graph ontology, realizes adaptive edge weighting according to problem difficulty and student performance, and combines heterogeneous graph attention with temporal sequence modeling to capture students' evolving knowledge states. Empirical tests on semester-long multimodal datasets prove that this method can accurately identify high-risk students and effectively track error propagation. Targeted interventions greatly improve students' knowledge mastery and reduce academic risks. The results verify that integrating knowledge graph analytics with multimodal temporal modeling can deliver more efficient and personalized learning support for advanced mathematics education.

2606.01223 2026-06-02 cs.CL cs.AI

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

连接点:长时对话中的反思性记忆基准测试

Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin, Weiming Qiao, Jing Li, Ruifeng Xu

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Hong Kong Polytechnic University(香港理工大学) Fudan University(复旦大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 针对现有基准无法衡量从碎片化多模态线索合成高层解释的反思性记忆问题,提出RefMem-Bench基准和REMIND层次框架,通过渐进式证据感知、定位和抽象提升模型反思性记忆能力。

Comments 9 pages, 6 figures

详情
AI中文摘要

尽管长上下文建模取得了显著进展,现有基准仍局限于显式回忆的事实性记忆,未能衡量将碎片化、多模态线索合成为高层解释所需的反思性记忆。为填补这一空白,我们引入了RefMem-Bench,一个用于长时对话中反思性记忆的基准。RefMem-Bench包含26K个带注释的问答实例,涵盖八个反思性记忆维度和三种任务格式,要求模型超越表面检索,从分布在整个交互历史中的证据推断潜在含义。为增强反思性记忆能力,我们提出了反思性记忆归纳(REMIND),一个将反思性记忆视为渐进意义构建的层次框架。REMIND结合了问题条件证据检索、显著性感知定位和抽象级别监督,并使用渐进式反思对齐将高层反思性推理提炼到事实推理路径中。实验表明,RefMem-Bench对当前模型构成了重大挑战,而REMIND通过渐进式证据感知、定位和抽象,持续提高了答案准确性和记忆回忆率。

英文摘要

Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.

2606.01221 2026-06-02 cs.LG cs.AI

Hybrid Imbalanced Regression Through Unified Data-Level and Algorithm-Level Balancing

混合不平衡回归:统一的数据级与算法级平衡方法

Shermin Shahbazi, Hossein Mohammadi, Mohsen Afsharchi

发表机构 * Zahedan National University(札赫德安国立大学)

AI总结 提出一个五阶段混合框架,结合自适应分箱、条件变分自编码器、特征空间聚类过采样、潜在密度加权损失和注意力门控融合,解决回归中的不平衡问题。

Comments 52 pages, 20 figures, accepted at Expert Systems with Applications

Journal ref Expert Systems with Applications, Date: 1 August 2026, Article: 131908, Volume: Volume 322

详情
AI中文摘要

不平衡学习是机器学习中的一个关键挑战,其中代表性不足的目标值可能使模型产生偏差,并降低对罕见但重要案例的预测性能。尽管在分类中得到了广泛研究,不平衡回归仍然相对未被充分探索。现有方法主要关注数据级平衡(可能引入噪声和过拟合)或算法级平衡(通常难以处理高度复杂的目标分布)。为了解决这些局限性,我们提出了一个统一的混合框架,将数据级和算法级平衡策略集成到一个与回归器无关的流水线中。该框架包括五个阶段:(1)自适应分箱划分,基于局部线性一致性动态分割目标空间;(2)使用条件变分自编码器进行目标条件表示学习;(3)通过特征空间聚类和少数类过采样进行多阶段数据级平衡;(4)使用新颖的潜在密度加权损失(LDWL)进行算法级平衡,以强调潜在空间和目标空间中的稀有样本;(5)基于注意力的门控融合用于最终回归。在基准数据集上的实验结果表明,与单独的回归器和现有的不平衡回归方法相比,所提出的框架持续提高了预测性能。

英文摘要

Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade prediction performance on rare but important cases. Although extensively studied in classification, imbalanced regression remains relatively underexplored. Existing methods mainly focus on either data-level balancing, which may introduce noise and overfitting, or algorithm-level balancing, which often struggles with highly complex target distributions. To address these limitations, we propose a unified hybrid framework that integrates both data- and algorithm-level balancing strategies into a regressor-agnostic pipeline. The proposed framework consists of five stages: (1) adaptive bin partitioning to dynamically segment the target space based on local linear coherence; (2) target-conditioned representation learning using a Conditional Variational Autoencoder; (3) multistage data-level balancing through feature-space clustering and oversampling of minority clusters; (4) algorithm-level balancing using a novel Latent-Density Weighted Loss (LDWL) to emphasize rare samples in latent and target spaces; and (5) attention-based gated fusion for final regression. Experimental results on benchmark datasets demonstrate that the proposed framework consistently improves predictive performance compared to standalone regressors and existing imbalanced regression approaches.

2606.01220 2026-06-02 cs.LG cs.AI

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

通过强化学习和快速采样微调扩散模型用于分子生成

Guang Lin, Shikui Tu, Lei Xu

发表机构 * Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳))

AI总结 提出FTDiff框架,结合组相对策略优化和快速采样机制,微调扩散模型以生成满足多目标药物设计约束的高质量分子。

Comments 13 pages, 7 figures

详情
AI中文摘要

生成同时满足类药性质并符合目标蛋白三维结构的分子是基于结构的药物设计(SBDD)中的核心挑战。然而,现有的生成方法通常依赖于采样过程中昂贵的后处理或训练时需要精心策划的数据集,但增益仍然有限。这些限制在多目标设置中尤为突出,平衡冲突标准仍是一个核心挑战。为了解决这些问题,我们提出了FTDiff,一个专为结构约束下基于扩散的分子生成量身定制的强化学习微调框架。为了确保稳定且样本高效的优化,FTDiff采用了组相对策略优化(GRPO)风格策略。此外,FTDiff基于一个无时间预训练扩散模型,并集成了快速采样机制,减少了去噪步数,在保持生成质量的同时显著加速了训练和推理。通过优化一个固定阈值感知的奖励,FTDiff有效引导模型生成有效、多样且高质量的分子,平衡多个药物设计目标。在基准数据集上的大量实验表明,FTDiff始终优于先前的方法,且无需昂贵的后处理优化或复杂的数据工程。

英文摘要

Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challenge in structure-based drug design (SBDD). Existing generative approaches, however, often rely on costly post-hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi-objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine-tuning framework tailored for diffusion-based molecular generation under structural constraints. To ensure stable and sample-efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time-free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold-aware reward, FTDiff effectively guides the model to produce valid, diverse, and high- quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post-hoc optimization or intricate data engineering.

2606.01217 2026-06-02 cs.CV cs.LG stat.AP

Analysis of Ethnic Disparities in Autism Spectrum Disorder among Toddlers

幼儿自闭症谱系障碍中的种族差异分析

Aadithya Prabha Ramaharsha, Deevna Reddy, Uma Ranjan

发表机构 * Sri Ramachandra Institute of Higher Education and Research(Sri Rajachandra高等教育部与研究机构)

AI总结 通过逻辑回归分析,研究种族、行为评分、性别和新生儿黄疸对幼儿自闭症谱系障碍(ASD)的影响,发现白种人ASD风险比亚洲人高81%,中东人低79%,并确认新生儿黄疸和男性为显著风险因素。

Comments Third International Conference Biomedical Engineering Science and technology

详情
AI中文摘要

自闭症谱系障碍(ASD)是一种以沟通和行为挑战为特征的神经发育障碍。本研究考察了种族与ASD特征之间的关系,以及行为评分、性别和新生儿黄疸在三个种族群体(白种人、亚洲人和中东人)中的差异。我们进行了逻辑回归分析,表明种族对ASD发病率有显著影响。与亚洲人相比,白种人患ASD的风险增加81%,而中东人患ASD的风险降低79%。我们还证实了早期研究,即新生儿黄疸是ASD的重要预测因子,而男性儿童患ASD的风险远高于女性儿童。这些结果表明,需要建立考虑种族差异的诊断框架和干预措施,以评估ASD特征的表现和评估。

英文摘要

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by challenges in communication and behavior. This study examines the relationship between ethnicity and ASD traits, along with behavioural scores, sex and neonatal jaundice across three ethnic groups: White Europeans, Asians, and Middle Eastern individuals. We perform a logistic regression and show that ethnicity has a significant effect on incidence of ASD. White Europeans are 81% increased risk of ASD and Middle Easterners are at 79\% reduced risk of ASD compared to Asians. We also confirm earlier studied which show that neonatal jaundice is a significant predictor of ASD, while male children are at much higher risk of ASD compared to female children. These results suggest the need for diagnostic frameworks and interventions that account for ethnic in the presentation and assessment of ASD traits

2606.01216 2026-06-02 cs.LG math.OC

Riemannian Optimization for Hadamard Products of Low-Rank Matrices

低秩矩阵的Hadamard积的黎曼优化

Pratik Jawanpuria, Ankish Chandresh, Bamdev Mishra

发表机构 * Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, India(机器智能与数据科学中心,印度班加罗尔理工学院,印度) Microsoft India(微软印度)

AI总结 针对低秩矩阵Hadamard积因子的耦合缩放对称性,提出一种基于黎曼商流形的块对角度量,并开发了线性复杂度的梯度下降算法。

详情
AI中文摘要

两个低秩矩阵的逐元素Hadamard积为具有乘法结构的数据提供了一种参数高效的模型,但由于两个因子之间存在耦合的行/列缩放对称性,其建模具有挑战性。为了利用空间几何,我们将此类矩阵的学习问题转化为黎曼商流形上的优化问题。我们提出了一种新的块对角黎曼度量,该度量由Frobenius内积的拉回导出,并证明该度量在完整对称群下不变。我们开发了一种黎曼梯度下降算法,该算法使用无需调参的Gauss-Newton步长,且每次迭代的计算复杂度与观测条目数呈线性关系。在真实和合成数据集上的实验验证了我们提出的黎曼方法的有效性。

英文摘要

The elementwise Hadamard product of two low-rank matrices provides a parameter-efficient model for data with multiplicative structure, but its modeling is challenging due to the presence of additional symmetries under coupled row/column scalings between the two factors. In order to leverage the geometry of the space, we formulate the learning of such matrices as optimization on a Riemannian quotient manifold. We propose a novel block-diagonal Riemannian metric derived from the pullback of the Frobenius inner product. The metric is shown to be invariant under the full symmetry group. We develop a Riemannian gradient descent algorithm that uses a tuning-free Gauss--Newton step size and scales linearly in the number of observed entries per iteration. Experiments on real and synthetic datasets illustrate the efficacy of our proposed Riemannian approach.

2606.01215 2026-06-02 cs.CV cs.AI cs.CL cs.MM

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

将神经符号程序蒸馏到3D多模态大语言模型中

Wentao Mo, Yang Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出APEIRIA,通过三阶段课程学习将符号推理模式蒸馏到3D多模态大语言模型中,实现透明推理与开放词汇空间推理的统一。

Comments To appear in ICML 2026

详情
AI中文摘要

当前的3D空间推理方法面临根本性权衡:神经符号3D(NS3D)概念学习器通过组合程序实现可解释推理,但受限于封闭集概念词汇和简单程序;端到端3D多模态大语言模型(3D MLLMs)能处理复杂自然语言和开放词汇概念,但缺乏显式空间验证的黑箱推理。我们提出APEIRIA,一种神经符号3D MLLM,通过将符号推理模式以自然语言思维链形式蒸馏到MLLMs中,桥接两种范式。我们的三阶段课程逐步构建推理能力:a) 3D感知对齐将物体视觉-几何特征接地到LLM,b) CoT-SFT从符号程序轨迹中教授查询分解和逐步验证,c) CoT-RL将推理模式扩展到开放集概念和深度嵌套指令。通过迁移推理模式而非概念特定知识,APEIRIA保留了NS3D的关键优点:透明推理以及规划和感知组件的模块化可互换性。在接地、问答和描述任务上的评估表明,APEIRIA超越了先前的NS3D方法,并在3D空间推理数据集上匹配最先进的3D MLLMs,统一了符号方法的系统推理与MLLMs的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。

英文摘要

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.

2606.01213 2026-06-02 cs.CV cs.AI cs.CL

TECCI: Tricky Edits of Collected and Curated Images

TECCI:收集与策划图像的棘手编辑

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge

发表机构 * Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 提出TECCI基准,包含7550对图像与编辑指令,通过人工与自动评估揭示现有图像编辑模型在指令遵循、最小编辑和视觉质量方面的不足。

详情
AI中文摘要

尽管近期取得了巨大进展,但当前的文本引导图像编辑方法在涉及指令遵循、最小化编辑源图像以及确保高视觉质量等多个方面仍面临困难。当请求的编辑具有挑战性时,例如涉及位置、运动、视角、比例和创意编辑,这些问题尤为明显。为了系统性地测试生成式图像编辑器,我们提出了一个新的图像编辑基准——TECCI:收集与策划图像的棘手编辑。TECCI包含我们发布的全新图像集。TECCI中的图像涵盖7个图像类别。这些图像和类别经过有意策划,以针对现有方法的弱点。TECCI中的编辑指令由Gemini自动生成,每个源图像覆盖5种编辑类型。我们还策划了一组530张图像,为其创建了具有挑战性的人工编写编辑指令。总体而言,TECCI包含7550对图像和编辑指令。我们对TECCI上的五个领先图像编辑模型进行了人工评估。人类从三个维度判断输出:1)指令遵循,2)编辑的最小性,以及3)视觉质量。为了扩大评估规模,我们还使用Gemini构建了一个自动评分器,在匹配人类评估方面达到了74.7%的准确率。我们的评估揭示:1)没有一个模型的总体成功率超过22%,这显示了TECCI的挑战性;2)Nano Banana Pro是整体表现最好的模型;3)模型在指令遵循方面表现显著优于最小编辑和视觉质量;4)模型在编辑建筑和自然图像方面存在困难,这些需要较强的空间布局和复杂视觉细节理解能力;5)推理和创意编辑是最困难的,而颜色和外观编辑是最容易的。

英文摘要

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

2606.01207 2026-06-02 cs.CV cs.LG

Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning

特征对齐决定融合策略:多模态学习中交叉注意力与拼接的比较研究

Zhiqiang Zhou, Xuezhen Xie

发表机构 * Hunan Chemical Industry Vocational and Technical College(湖南化学工业职业技术学院)

AI总结 通过实验和理论分析,证明特征对齐质量而非数据规模是决定多模态融合策略优劣的关键因素,当特征预对齐时拼接优于交叉注意力。

Comments 8 pages,6 figures,4 tables

详情
AI中文摘要

在多模态融合中,交叉注意力与拼接的选择仍由实践者直觉而非原理性理解主导。本文通过使用两个特征提取骨干(ResNet18和CLIP ViT-B/32)在Flickr8k上的控制实验,证明特征对齐质量(而非仅数据规模)是决定哪种融合策略更优的主要因素。当特征通过视觉语言预训练目标预对齐时,在所有测试规模(2048-16384样本)下,拼接比交叉注意力高出4.1-5.1个百分点。我们提供了基于样本复杂度分析的理论解释:拼接需要O(d_v + d_t)个样本来学习其融合投影,而交叉注意力需要O(d_v * d_t)个样本来学习双线性注意力权重,对于512维CLIP特征,后者是前者的256倍以上。当特征已经对齐时,两种方法的近似误差差距消失,拼接的样本效率在所有实际数据集规模上占优。对齐退化研究证实了单调趋势:随着特征对齐退化,拼接的优势从1.3%增长到2.8%。这些发现为多模态系统中的融合方法选择提供了原理性决策框架,对多模态大语言模型的设计具有直接影响。

英文摘要

The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation's sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation's advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.

2606.01204 2026-06-02 cs.CL cs.AI cs.CY

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

LLM医疗分诊中的隐式地理推断:语言驱动的急诊推荐差异

Qi Han Wong

发表机构 * GitHub

AI总结 研究大型语言模型在相同症状下,仅因患者提示语言不同而产生不同的医疗分诊推荐,发现模型根据输入语言隐式推断地理位置,导致急诊推荐率差异显著。

Comments 7 pages, 4 tables. Code and data at https://github.com/wongqihan/ai-behavioral-experiments

详情
AI中文摘要

我们研究大型语言模型是否仅根据患者提示的语言,对相同症状产生不同的医疗分诊推荐。使用Gemini 3.5 Flash,我们评估了六种语言(英语、西班牙语、中文、印地语、日语、阿拉伯语)下的神经症状特征(持续性头痛、视力模糊、恶心),每种条件运行30次(共450次API调用)。我们发现,尽管模型在所有语言中分配的严重程度评分几乎相同(7.7-8.0/10),但急诊室就诊推荐率从0%(日语、印地语)到30%(英语、阿拉伯语)不等。添加一句指定患者位于美国的句子,非英语提示的急诊推荐率最多增加76.7个百分点,而反向锚定(英语提示加上东京地点)将急诊率从30%降至6.7%。回译控制(日语到英语)产生的急诊率与英语基线相当,证实差异并非由翻译质量引起,而是由输入语言的隐式地理推断所致。我们发布了完整的数据集、实验代码和结果。

英文摘要

We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.

2606.01202 2026-06-02 cs.AI cs.CL cs.LG

The Shape of Wisdom: Decision Trajectories in Language Models

智慧的形状:语言模型中的决策轨迹

Shailesh Rana

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过分析三种语言模型在MMLU上的9000条轨迹,提出用答案边际、边际变化和决策翻转距离描述轨迹,发现正确性与稳定性不同,并探究了注意力与MLP标量对边际的影响。

Comments 6 pages, 5 figures. Code and derived artifacts: https://github.com/gut-puncture/The-Shape-of-Wisdom

详情
AI中文摘要

语言模型并非简单地在输出层选择一个答案。在一项包含9000条轨迹的MMLU研究中,涉及Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct和Mistral-7B-Instruct-v0.3,答案的分数在深度上以结构化方式移动。我们用三个量描述每条轨迹:当前答案边际、该边际的下一层变化,以及距离决策翻转的距离。主要经验图景是正确性和稳定性是不同的:最大的群体是不稳定-正确的,而不是稳定-正确的。然后,一个追踪的子集询问是什么推动了边际。在稳定-正确的情况下,平均注意力标量指向正确的方向,而平均MLP标量则不然;跨度删除显示,移除支持答案的文本会损害边际,而移除类似干扰项的文本则有助于边际。结果并非完整的电路解释。它是一种可重复的方式,用于查看哪些答案已确定,哪些仍然脆弱,以及哪些测量来源推动了它们。

英文摘要

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

2606.01199 2026-06-02 cs.AI

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

LLM智能体能否维持长期组织动态?

Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou, Xiaohan Zhang, Yongrui Liu, Guoshun Nan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出TaskWeave分层智能体框架,通过记忆中心的协调机制(规划-分解-诊断-对齐循环和依赖感知追踪记忆)实现长期组织模拟,实验表明该框架能维持连贯的组织动态并产生可靠的人工制品。

详情
AI中文摘要

大型语言智能体越来越多地用于社会模拟,但尚不清楚它们能否在结构化组织中维持连贯行为,其中目标必须通过层级传播,任务依赖于先前执行,并且人工制品在长期范围内积累。我们将长期组织模拟定义为以记忆为中心的协调问题,并引入TaskWeave,这是一个分层智能体框架,通过制定-分解-诊断-对齐循环维护规划状态,并通过依赖感知追踪记忆来接地执行。我们在一个为期一年的IT公司模拟中评估TaskWeave,并将其与其他多智能体框架在组织连贯性、执行接地和下游企业NLP效用方面进行比较。实验表明,TaskWeave支持连贯且长期的组织动态,同时产生接地的人工制品并适应外部环境。这些发现表明,结构化模拟记忆是构建可靠的基于LLM的组织模拟器的关键机制。

英文摘要

Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long-horizon organizational simulation as a memory-centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate-Partition-Diagnose-Align cycle and grounds execution through dependency-aware trace memory. We evaluate TaskWeave in a year-long IT company simulation and compare it with other multi-agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long-horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM-based organizational simulators.

2606.01198 2026-06-02 cs.LG

Linear Strategic Classification with Endogenous Improvements

具有内生改进的线性战略分类

Siddharth Shrivastava, Mahvith Akshintala, B Vamsha Vardhan Reddy, Naresh Manwani, Sujit Gujar, Ganesh Ghalme

发表机构 * Department of Artificial Intelligence(人工智能系) IIT Hyderabad(海得拉巴理工学院) IIIT Hyderabad(海得拉巴理工学院)

AI总结 研究智能体通过修改特征响应分类器时,能产生真实结果改进的战略分类问题,提出线性分类器下的最优决策边界平移方法,并给出PAC保证和实用算法。

详情
AI中文摘要

战略分类研究智能体通过以成本修改可观察特征来响应已部署分类器的设置。经典模型通常将此类响应视为装饰性的:特征可能改变,但真实标签保持不变。我们研究了一种考虑改进的变体,其中战略响应可以引起结果相关特征的真正变化。智能体策略性地选择部署后的特征向量,然后根据一个稳定的条件结果分布生成标签,该分布保留了特征与结果之间的关系。我们在单指标资格模型和线性可分解成本下形式化了线性分类器的这一问题。我们证明,战略最优分类器是通过贝叶斯最优决策边界的平行移动获得的,并且它比贝叶斯分类器为改进感知目标提供了更好的代理。由于改进感知学习需要部署后的标签,而这些标签通常在部署前不可用,我们在预言机模型下提供了PAC风格的保证,提出了一种实用的插件算法,建立了其泛化界,并在合成和真实数据集上进行了评估。

英文摘要

Strategic classification studies settings in which agents respond to a deployed classifier by modifying observable features at a cost. Classical models typically treat such responses as cosmetic: features may change, but true labels remain fixed. We study an improvement-aware variant in which strategic responses can induce genuine changes in outcome-relevant features. Agents choose post-deployment feature vectors strategically, and labels are then generated according to a stable conditional outcome law that preserves the relationship between features and outcomes. We formalize this problem for linear classifiers under a single-index qualification model and linear-decomposable costs. We show that the strategic-optimal classifier is obtained by a parallel shift of the Bayes-optimal decision boundary, and that it provides a better surrogate for the improvement-aware objective than the Bayes classifier. Since improvement-aware learning requires post-deployment labels, which are typically unavailable before deployment, we provide PAC-style guar- antees under an oracle model, propose a practical plug-in algorithm, establish its generalization bound, and evaluate it on synthetic and real-world datasets.