arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15834 2026-06-16 cs.AI cs.CR cs.SY eess.SY 新提交

AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems

AIChilles:自动发现AI进化系统中的隐藏弱点

Yajie Zhou, Ao Li, Ashwin Silla, Zaoxing Liu, Vyas Sekar

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Boston University(波士顿大学)

AI总结 提出AIChilles框架,通过结合确定性工作负载参数提取、基于代理的约束推断、差异预言机和代码频率覆盖,自动发现AI进化程序在正确性、运行时、内存使用或输出质量上相对于基线程序的回归问题,在5个系统应用和30个AI进化程序中发现49个隐藏弱点。

详情
AI中文摘要

计算机系统社区最近对AI驱动的系统进化兴趣日益增长,其中AI代理迭代地重写系统。诸如AdaEvolve和Engram等框架报告称,相比人类设计的算法,得分提高了12-60%。虽然这些结果令人鼓舞,但存在实际担忧:这些AI进化的程序可能在未见过的负载上表现更差,并表现出可扩展性回归。鉴于AI生成代码的速度和规模,我们需要自动化机制来揭示AI进化系统程序中的此类隐藏弱点。为此,我们开发了AIChilles,它接受基线程序$P$和AI进化程序$P'$作为输入,搜索在正确性、运行时、内存使用或输出质量方面$P'$相对于$P$出现回归的有效工作负载。为了应对系统应用、弱点类型和潜在错误的多样性,AIChilles结合了确定性工作负载参数提取、基于代理的约束推断、差异预言机和代码频率覆盖来发现多样化的失败。在五个系统应用和30个AI进化程序中,AIChilles发现了49个不同的隐藏弱点。我们还表明,将AIChilles明确纳入AI驱动的开发生命周期可以缓解其中几个弱点。

英文摘要

The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program $P$ and an AI-evolved program $P'$, AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.

2606.15833 2026-06-16 cs.CL 新提交

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

当正确边无法被验证:不完全KGQA中的溯源缺口及一种偏好溯源的补全策略

Yongqi Kang, Yu Fu, Yong Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对不完全知识图谱问答中补全边的可验证性问题,发现76-96%的正确边缺乏文本支持,提出偏好溯源的TGComplete策略,在保持答案质量的同时显著提高边精度和严格忠实性。

详情
AI中文摘要

不完全知识图谱问答(IKGQA)需要补全缺失的边以继续推理。越来越多的研究通过检索文本来验证补全的边,将文本支持视为边质量的代理。我们提出了一个据我们所知尚未被系统检验的问题:文本可验证性是否真的反映了正确性?利用标准随机删除协议提供的黄金删除三元组,我们测量了这两者。发现是反直觉的:在黄金正确的补全边中,76-96%即使在穷尽检索下也没有支持段落,这一结果在删除率(20%/40%)、数据集(CWQ/WebQSP)和关系类型(结构型、常识型、长尾型)上均稳健。大多数Freebase风格的事实根本不会以头尾共现的形式出现在文本中。因此,文本忠实性衡量的是溯源,而非正确性——两者之间存在一个语料内检索无法弥合的范式级差距。这重新定义了边补全问题。由于大多数补全边——无论正确与否——对答案而言是因果冗余的(95-97%的正确答案不依赖于任何无支持的边),核心问题从“边是否正确?”转变为“在溯源不确定下是接受还是放弃?”在此框架下,我们提出了TGComplete,一种偏好溯源的接受策略,它在推理断点处检索证据,通过轻量级循环验证候选边,并在缺乏支持时放弃。与生成-补全基线GoG相比,它在黄金标准上获得了更高的边精度(15-21% vs 3-14%),且没有统计上显著的EM损失,同时被接受边的严格忠实性提高了3.1-7.4倍——代价是召回率较低。我们将TGComplete定位为并非全面更优,而是在精度/溯源召回权衡下的一个原则性选择,适用于可审计性重要的场景。

英文摘要

Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge quality. We ask a question that, to our knowledge, has not been systematically tested: does textual verifiability actually track correctness? Exploiting the gold deleted triples provided by the standard random-deletion protocol, we measure both. The finding is counterintuitive: among gold-correct completed edges, 76-96% have no supporting passage even under exhaustive retrieval, robustly across deletion rates (20%/40%), datasets (CWQ/WebQSP), and relation types (structural, commonsense, long-tail). Most Freebase-style facts simply do not occur as head-tail co-mentions in text. Textual faithfulness therefore measures provenance, not correctness -- separated by a paradigm-level gap no in-corpus retrieval closes. This reframes edge completion. Since most completed edges -- correct or not -- are causally redundant for the answer (95-97% of correct answers do not depend on any unsupported edge), the central question shifts from "is the edge correct?" to "admit or abstain under provenance uncertainty?" Within this framing we present TGComplete, a provenance-favoring admission policy that retrieves evidence at a reasoning breakpoint, verifies a candidate through a lightweight loop, and abstains when support is absent. Against the generate-to-complete baseline GoG, it attains higher edge precision against gold (15-21% vs 3-14%), with no statistically detectable EM loss and 3.1-7.4 times higher strict faithfulness of admitted edges -- at the cost of lower recall. We position TGComplete not as uniformly better, but as a principled point on a precision/provenance-recall trade-off, appropriate when auditability matters.

2606.15831 2026-06-16 cs.AI cs.LG cs.NE cs.SY eess.SY 新提交

An Integrated System for Real-Time Student Assessment and Career Guidance Using Neural Networks in Computing Disciplines

基于神经网络的计算学科实时学生评估与职业指导集成系统

Sakir Hossain Faruque, Md. Jubair Hossain, Sharun Akter Khushbu

发表机构 * Daffodil International University(达福尔国际大学) Barishal Engineering College(巴里什尔工程学院)

AI总结 针对计算机专业学生职业路径选择困难,提出集成职业指导专家系统与网络评估平台的AI驱动系统,采用多层感知器模型实现94.71%的职业路径预测准确率。

Comments 25 pages, 24 figures

详情
AI中文摘要

许多计算机科学(CS)和软件工程(SWE)专业的本科生在确定合适的职业道路时面临困难,尤其是当他们的学业表现、能力和兴趣不完全匹配时。为了解决这一问题,本研究提出了一种AI驱动的学生评估与职业预测系统,该系统集成了职业指导专家(CGE)系统和基于网络的学生评估(WBSA)平台。在集成框架内,CGE利用AI增强个性化职业推荐,同时帮助毕业生根据其技能和兴趣确定合适的工作、研究领域和深造机会。WBSA平台通过评估、个性化任务、导师活动和安全的实时聊天应用程序进一步加强了学生与教师之间的互动。CGE系统采用多层感知器(MLP)模型,该模型使用滚雪球抽样法从大学学生中收集的真实学术和课外数据进行训练,在预测个性化职业路径方面达到了94.71%的验证准确率。在部署前,跨大学进行了预调查以评估所提出的模型。WBSA系统作为现代Web应用程序开发,使用了Node.js、Next.js和PostgreSQL等技术,以确保可扩展性、响应性和安全的数据管理。整个系统由安全的云基础设施支持,该平台提供可靠的性能,同时帮助毕业生在IT领域选择合适的职业道路。此外,还进行了一项涉及学生和教师的后期调查,以收集反馈并进一步提高系统的整体有效性和可用性。

英文摘要

Many undergraduate students in Computer Science (CS) and Software Engineering (SWE) struggle to identify suitable career paths, particularly when their academic performance, abilities, and interests do not fully align. To address this issue, this study proposes an AI-driven Student Assessment and Career Prediction System that integrates a Career Guidance Expert (CGE) system with a Web-Based Student Assessment (WBSA) platform. Within the integrated framework, CGE enhances personalized career recommendations using AI while also assisting students after graduation in identifying suitable jobs, research domains, and higher study opportunities aligned with their skills and interests. The WBSA platform further strengthens interaction between students and faculty through assessments, personalized tasks, mentorship activities, and a secure real-time chat application. The CGE system employs a Multilayer Perceptron (MLP) model trained on real-world academic and extracurricular data collected using the snowball sampling method from the students of universities, achieving a validation accuracy of 94.71% in predicting personalized career paths. A pre-survey was conducted across universities to evaluate the proposed model before deployment. The WBSA system was developed as a modern web application using technologies such as Node.js, Next.js, and PostgreSQL to ensure scalability, responsiveness, and secure data management. The overall system is supported by a secure cloud-based infrastructure, the platform provides reliable performance while assisting graduates to select suitable career path in IT sector. In addition, a post-survey involving both students and faculty was conducted to gather feedback and further improve the overall effectiveness and usability of the system.

2606.15822 2026-06-16 cs.AI cs.CR 新提交

TrustedARI: Towards Trust-Native Agentic Routing Infrastructure for Agentic AI

TrustedARI: 面向智能体AI的信任原生代理路由基础设施

Qi Li, Zhenhua Zou, Shuo Li, Mingwei Xu, Zhuotao Liu

发表机构 * Tsinghua University(清华大学)

AI总结 针对代理路由基础设施(ARI)中查询和响应被明文访问、无法验证路由完整性的信任风险,提出TrustedARI,通过三方TLS握手、隐私保护查询构建和可验证计费协议实现信任原生路由,实验表明高效且无需修改服务提供商。

详情
AI中文摘要

AI代理越来越多地通过代理路由基础设施(ARI)访问外部模型、工具和服务,以管理异构接口和碎片化订阅的开销。然而,ARI的架构引入了基本的信任风险:它获得对代理查询和服务响应的明文访问,同时使代理无法验证其查询是否被路由到预期的服务提供商,或者请求和响应是否未被篡改。为了解决这个问题,我们提出了TrustedARI,这是首个面向智能体AI的信任原生代理路由基础设施。在架构上,TrustedARI基于三项核心创新:(i)一种适应ARI的三方TLS握手,通过角色特定的TLS密钥材料分发,使代理和ARI能够联合认证服务提供商;(ii)一种隐私保护的查询构建协议,允许代理和ARI在不暴露各自私有输入的情况下协作构建格式正确的查询;(iii)一种可验证的计费协议,支持基于使用量的公平结算,同时保持服务响应的完整性和机密性。我们实现并广泛评估了TrustedARI的原型以验证其性能。实验证实TrustedARI非常高效:与现有的三方TLS握手相比,我们的ARI适应握手协议将通信开销降低了39.34%。此外,隐私保护的查询构建协议引入了可忽略的开销——平均计算时间0.19秒,通信成本0.58 MB——而可验证的计费协议将证明生成速度提高了28.20倍。关键的是,TrustedARI无需对服务提供商进行任何修改即可直接部署。

英文摘要

AI agents increasingly access external models, tools, and services through Agentic Routing Infrastructure (ARI) to manage the overhead of heterogeneous interfaces and fragmented subscriptions. Yet, the architecture of ARI introduces fundamental trust risks: it obtains plaintext access to agent queries and service responses, while leaving agents unable to verify that their queries are routed to intended service providers or that requests and responses remain untampered. To address this problem, we present TrustedARI, the first trust-native agentic routing infrastructure for agentic AI. Architecturally, TrustedARI is built upon three core innovations: (i) an ARI-adapted three-party TLS handshake that enables the agent and ARI to jointly authenticate the service provider through role-specific distribution of TLS key materials; (ii) a privacy-preserving query-construction protocol that allows the agent and ARI to collaboratively construct well-formed queries without exposing their respective private inputs; and (iii) a verifiable billing protocol that supports fair usage-based settlement while preserving the integrity and confidentiality of service responses. We implemented and extensively evaluated a prototype of TrustedARI to validate its performance. Experiments confirm that TrustedARI is highly efficient: our ARI-adapted handshake protocol reduces communication overhead by 39.34% compared to the existing three-party TLS handshake. Furthermore, the privacy-preserving query-construction protocol imposes negligible overhead-averaging 0.19 seconds in computation time and 0.58 MB in communication costs-while the verifiable billing protocol speeds up proof generation by 28.20x. Crucially, TrustedARI is readily deployable without any modification to the service providers.

2606.15821 2026-06-16 cs.CL cs.AI cs.LG 新提交

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

真相留在家族中:通过模型谱系中继承的真相头增强上下文基础

Miso Choi, Seonga Choi, Mincheol Kwon, Woosung Joung, Jinkyu Kim, Jungbeom Lee

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究发现基础LLM与下游变体间存在上下文真相分数的强继承性,提出TruthProbe软门控策略放大真相头以提升上下文真实性并减少多模态幻觉。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)的最新进展产生了许多共享基础LLM的专业多模态LLM(MLLM),形成了不同的模型谱系。基础LLM与下游变体之间是否存在基本的行为联系尚不清楚。我们通过量化头部级别的上下文真相分数来研究这个问题。在包括基于Vicuna、Qwen2.5、LLaMA2和Mistral的模型在内的多种LLM和MLLM谱系中,我们发现真相分数在模型家族内被强烈保留,即使在指令调优或多模态适应后也是如此。我们进一步表明,这种继承与注意力头权重保留一致,并且上下文真相头关注查询相关的证据。基于这一发现,我们提出了TruthProbe,一种软门控策略,在保留其他头部贡献的同时放大上下文真相头。TruthProbe在HaluEval上提高了上下文真实性,并在POPE和CHAIR上减少了多模态幻觉,基础LLM的真相分数有效转移到其微调的LLM和MLLM后代。代码可在https://github.com/miso-choi/TruthProbe获取。

英文摘要

Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adaptation. We further show that this inheritance is consistent with attention-head weight preservation, and that context-truthful heads attend to query-relevant evidence. Building on this finding, we propose TruthProbe, a soft-gating strategy that amplifies context-truthful heads while preserving other head contributions. TruthProbe improves contextual truthfulness on HaluEval and reduces multimodal hallucination on POPE and CHAIR, with base-LLM Truth Scores transferring effectively to their fine-tuned LLM and MLLM descendants. Code is available at https://github.com/miso-choi/TruthProbe.

2606.15819 2026-06-16 cs.CV cs.AI 新提交

SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

SACE: 视觉自回归模型中的语义奇点概念擦除

Siya Yang, Nanxiang Jiang, Zhaoxin Fan, Yunfeng Diao

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院)

AI总结 针对视觉自回归模型应用现有擦除技术导致语义崩溃和视觉伪影的问题,提出语义奇点公理并通过增量语义显著性分析验证,进而引入首个尺度感知的概念擦除框架SACE,在首尺度耦合熵正则化擦除目标与恢复性保存损失,实现精确概念擦除。

详情
AI中文摘要

视觉自回归(VAR)模型的快速进步为高保真文本到图像合成开辟了变革性前沿,同时也加剧了对生成内容安全对齐的担忧。将现有擦除技术简单应用于VAR模型会导致灾难性的语义崩溃和视觉伪影,因为这些技术主要针对扩散模型的同质去噪步骤设计。为应对这一基础性挑战,我们首先提出语义奇点公理,该公理认为提示中嵌入的任何目标语义概念在Scale-0处被明确锁定。然后通过我们提出的增量语义显著性分析(ISSA)严格验证该公理,该分析还使社区能够透明地检查从粗到细的语义注入过程。在此洞察指导下,我们引入了首个针对VAR模型的尺度感知概念擦除框架(SACE)。通过将干预严格限制在首尺度,我们的方法耦合了熵正则化擦除目标以防止高熵采样退化,以及恢复性保存损失以安全锚定纠缠良性先验的完整性。大量实验表明,我们的方法在最小训练开销下实现了跨多个领域的手术式概念擦除性能,及时而优雅地解决了新兴VAR架构中固有的关键安全漏洞。代码可在 https://github.com/limerenceysy/SACE 获取。

英文摘要

The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: https://github.com/limerenceysy/SACE}{https://github.com/limerenceysy/SACE.

2606.15815 2026-06-16 cs.CL 新提交

On Defining Erasure Harms for NLP

论NLP中擦除伤害的定义

Yu Lu Liu, Arnav Goel, Jackie Chi Kit Cheung, Alexandra Olteanu, Ziang Xiao, Su Lin Blodgett

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Carnegie Mellon University(卡内基梅隆大学) Mila – Québec Artificial Intelligence Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) Canada CIFAR AI Chair, Mila(加拿大CIFAR人工智能主席,Mila)

AI总结 针对NLP系统部署中擦除伤害概念模糊的问题,提出结构化定义,明确建立和测量擦除所需的必要组件,促进跨场景应用。

详情
AI中文摘要

NLP系统的部署引发了对其可能产生的伤害的担忧,包括表征伤害。近期文献开始概念化和测量一种这样的伤害——擦除伤害。然而,该领域缺乏清晰且连贯的概念基础来识别和测量擦除。现有的擦除概念化往往过于宽泛——使得难以确定建立和测量擦除所需的内容——或者特定于特定设置——便于在这些设置中进行测量,但可能难以适应其他设置。为了解决这一差距,我们开发并提出了一个结构化的擦除定义,阐明了确定是否发生擦除所需的必要组件,从业者需要明确阐述和操作这些组件以测量擦除。

英文摘要

The deployment of NLP systems has raised concerns about harms they might produce, including representational harms. Recent literature has begun to conceptualize and measure one such harm, the harm of erasure. Nevertheless, the field lacks a clear and cohesive conceptual foundation for identifying and measuring erasure. Existing conceptualizations of erasure are often broad -- making it difficult to identify what is needed to establish and measure erasure -- or else specific to particular settings -- facilitating measurement for those settings but potentially challenging to adapt to other settings. To address this gap, we develop and propose a structured definition of erasure that clarifies what components are necessary for establishing whether erasure has occurred, which practitioners need to explicitly articulate and operationalize in order to measure erasure.

2606.15812 2026-06-16 cs.LG 新提交

Brownian Kernel Ladders

布朗核梯子

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M Pardalos

发表机构 * Faculty of Engineering, Free University of Bozen-Bolzano(博洛尼亚-博兹纳自由大学工程学院) Department of Biomedical and Biotechnological Sciences, University of Catania(卡塔尼亚大学生物医学与生物技术科学系) Center for Applied Optimization, Department of Industrial and Systems Engineering, University of Florida(佛罗里达大学应用优化中心、工业与系统工程系)

AI总结 提出布朗核梯子(BKL)递归层次函数空间,通过布朗核积分构造,证明其为准Banach空间且具有深度相关Hölder正则性,为深度学习的组合表示提供可解析框架。

Comments Submitted to JMLR

详情
AI中文摘要

在统计学习理论中,构建能够捕捉层次组合表示的可解析函数空间仍然是一个核心挑战。我们引入了布朗核梯子(BKL),这是一个通过布朗核积分构造递归定义的积分再生核希尔伯特空间层次结构。从线性泛函开始,每一层通过对前一层子集上的概率测度积分布朗核得到,产生一个递归函数空间模型,其中深度直接通过层次结构编码。基于此框架,我们定义了规范BKL空间及其相关的复杂度泛函。我们建立了这些空间的若干分析和统计性质。特别地,我们证明BKL空间构成准Banach空间,满足依赖于深度的Hölder正则性估计,并表现出关于深度的严格单调性。我们进一步证明了正则化经验风险最小化的存在性结果,并推导了关于环境维度和层次深度一致控制的高斯复杂度界。分析的一个关键成分是基于递归子集分解和布朗核阈值表示的组合证明技术。这些估计为BKL空间上的正则化经验风险最小化提供了接近参数阶的过剩风险保证。我们的结果为研究深度学习中的组合表示提供了一个数学上可解析的层次函数空间框架。

英文摘要

Constructing mathematically tractable function spaces that capture hierarchical compositional representations remains a central challenge in statistical learning theory. We introduce Brownian kernel ladders (BKLs), a recursively defined hierarchy of integral reproducing kernel Hilbert spaces generated through Brownian-kernel integral constructions. Starting from linear functionals, each layer is obtained by integrating Brownian kernels over probability measures supported on subsets of the previous layer, yielding a recursive function-space model in which depth is encoded directly through the hierarchy. Based on this framework, we define canonical BKL spaces together with an associated complexity functional. We establish several analytical and statistical properties of these spaces. In particular, we show that BKL spaces form quasi-Banach spaces, satisfy depth-dependent Hölder regularity estimates, and exhibit strict monotonicity with respect to depth. We further prove existence results for regularized empirical risk minimization and derive Gaussian complexity bounds that remain uniformly controlled with respect to both the ambient dimension and the hierarchy depth. A key ingredient of the analysis is a combinatorial proof technique based on recursive subset decompositions and Brownian-kernel threshold representations. These estimates yield excess-risk guarantees of near-parametric order for regularized empirical risk minimization over BKL spaces. Our results provide a mathematically tractable hierarchical function-space framework for studying compositional representations in deep learning.

2606.15807 2026-06-16 cs.LG cs.AI 新提交

Continuous Cross-Domain Traffic State Prediction via Memory-Augmented Graph Liquid Time-Constant Networks

基于记忆增强图液态时间常数网络的连续跨域交通状态预测

Jinrong Xiang, Ming Xu

发表机构 * Software College, Liaoning Technical University(辽宁工程技术大学软件学院)

AI总结 提出记忆增强图液态时间常数网络(MA-GLTC),通过时空单元分解、图液态时间常数动态和记忆迁移存储机制,实现连续时间下的跨域交通状态预测,在五个数据集上优于现有方法。

详情
AI中文摘要

交通状态预测是智能交通系统中的一项基本任务。在实际应用中,一些区域由于感知基础设施不足而面临有限的交通观测,使得跨域知识迁移成为数据稀缺交通预测的重要解决方案。然而,现有的跨域交通预测方法仍面临若干局限,包括粗粒度的源-目标域适应、处理未见目标域模式的能力有限,以及在非规则或异质时间条件下对连续交通动态建模不足。为解决这些问题,本文提出了一种连续跨域交通预测框架,称为记忆增强图液态时间常数网络(MA-GLTC)。具体地,我们首先构建时空单元(STU)将交通网络分解为可迁移的局部单元,实现跨域的细粒度知识对齐。然后,开发了图液态时间常数网络(GLTC)来建模连续时间下图耦合的交通演化。与通用的基于图神经ODE的模型不同,GLTC将图耦合的循环电导引入液态时间常数动态,允许节点状态随泄漏、自适应时间常数和邻域感知反馈而演化。此外,设计了基于记忆的迁移存储(MTS)机制,以保留源域知识、检索匹配的交通模式,并在出现未见状态时更新可靠的目标域模式。在五个公开交通数据集上的实验表明,MA-GLTC在短期和长期预测任务中均持续优于代表性的域内和跨域基线。与次优方法相比,MA-GLTC分别将平均预测误差降低了3.02%、0.33%、8.92%、10.09%和2.11%。

英文摘要

Traffic state prediction is a fundamental task in intelligent transportation systems. In practical applications, some regions suffer from limited traffic observations due to insufficient sensing infrastructure, making cross-domain knowledge transfer an important solution for data-scarce traffic prediction. However, existing cross-domain traffic prediction methods still face several limitations, including coarse-grained source-target adaptation, limited capability in handling unseen target-domain patterns, and insufficient modeling of continuous traffic dynamics under irregular or heterogeneous temporal conditions. To address these issues, this paper proposes a continuous cross-domain traffic prediction framework, termed Memory-Augmented Graph Liquid Time-Constant Network (MA-GLTC). Specifically, we first construct spatio-temporal units (STUs) to decompose traffic networks into transferable local units, enabling fine-grained knowledge alignment across domains. Then, a graph liquid time-constant network (GLTC) is developed to model graph-coupled traffic evolution in continuous time. Different from generic graph neural ODE-based models, GLTC introduces graph-coupled recurrent conductance into liquid time-constant dynamics, allowing node states to evolve with leakage, adaptive time constants, and neighborhood-aware feedback. Furthermore, a Memory-based Transfer Storage (MTS) mechanism is designed to preserve source-domain knowledge, retrieve matched traffic patterns, and update reliable target-domain patterns when unseen states emerge. Experiments on five public traffic datasets demonstrate that MA-GLTC consistently outperforms representative innerdomain and cross-domain baselines in both short-term and longterm prediction tasks. Compared with the second-best method, MA-GLTC reduces the average prediction errors by 3.02%, 0.33%, 8.92%, 10.09%, and 2.11%, respectively.

2606.15805 2026-06-16 cs.LG 新提交

Mean-Field Parallel Decoding for Discrete Diffusion Language Models

离散扩散语言模型的平均场并行解码

Tamim Zoabi, Ameen Ali, Liran Ringel, Lior Wolf

发表机构 * School of Electrical & Computer Engineering, Tel Aviv University(特拉维夫大学电气与计算机工程学院) School of Computer Science and AI, Tel Aviv University(特拉维夫大学计算机科学与人工智能学院) Department of Computer Science, Technion, Israel Institute of Technology(以色列理工学院计算机科学系)

AI总结 提出一种无需训练的解码框架,通过平均场变分松弛协调并行令牌更新,在单次前向传播中抑制冲突,提升质量-延迟权衡。

详情
AI中文摘要

离散扩散语言模型支持并行令牌生成,为低延迟解码提供了途径。然而,根据边际置信度独立选择令牌限制了并行性:单独看起来可靠的令牌在同时更新多个位置时可能形成不兼容的配置。我们引入了一种无需训练的解码框架来协调这些并行更新。在每次前向传播中,该方法为每个掩码位置分配一个提交分数,并使用从模型预测分布中导出的成对交互来细化这些分数。变分松弛产生了一个简单的定点更新,在单次前向传播中抑制了冲突的同时提交。这种机制允许解码器并行提交更多令牌,同时保持有竞争力的生成质量。该方法轻量级,不需要辅助模型或重新训练,并且可以无需修改地插入现有的扩散解码流程。在推理和代码生成基准上的实验表明,质量-延迟权衡得到了一致的改善。

英文摘要

Discrete diffusion language models enable parallel token generation, offering a pathway to low-latency decoding. However, selecting tokens independently by marginal confidence limits effective parallelism: tokens that appear reliable in isolation can form incompatible configurations when several positions are updated at once. We introduce a training-free decoding framework that coordinates these parallel updates. At each forward pass, the method assigns a commit score to each masked position and refines these scores using pairwise interactions derived from the model's predictive distributions. A variational relaxation yields a simple fixed-point update that suppresses conflicting simultaneous commitments within a single forward pass. This mechanism allows the decoder to commit more tokens in parallel while maintaining competitive generation quality. The method is lightweight, requires no auxiliary model or retraining, and drops into existing diffusion decoding pipelines without modification. Experiments on reasoning and code-generation benchmarks show consistent improvements in the quality-latency trade-off.

2606.15802 2026-06-16 cs.CV 新提交

CPS4: Class Prompt driven Semi-Supervised Spine Segmentation with Class-specific Consistency Constraint

CPS4: 基于类别提示的半监督脊柱分割与类别特定一致性约束

Qingtao Pan, Hongzan Sun, Bing Ji, Shuo Li

发表机构 * School of Control Science and Engineering, Shandong University(山东大学控制科学与工程学院) Department of Nuclear Medicine, Shengjing Hospital of China Medical University(中国医科大学附属盛京医院核医学科) Department of Computer and Data Science, Case Western Reserve University(凯斯西储大学计算机与数据科学系) Department of Biomedical Engineering, Case Western Reserve University(凯斯西储大学生物医学工程系)

AI总结 提出CPS4,首个利用文本类别提示增强伪标签质量的半监督脊柱分割网络,通过两阶段训练(VLM预训练和半监督分割)实现,仅用5%标注数据即达80.44% Dice。

详情
AI中文摘要

视觉语言模型(VLM)有潜力通过利用文本类别提示生成分割图来增强半监督脊柱分割中伪标签的质量,但尚未有人研究。尽管有前景,但缺乏明确的约束来确保脊柱类别提示与脊柱单元区域之间的一致性,导致多类别分割图生成性能不佳。本文提出CPS4,首个使用类别提示增强脊柱伪标签质量的文本引导半监督脊柱分割网络。具体地,CPS4通过两个训练阶段实现。(i) 类别特定一致性约束的VLM预训练阶段:我们提出token级和像素级注意力损失,以优化类别提示与脊柱单元之间的一致性,迫使文本类别提示在语义空间中与目标脊柱单元紧密耦合。(ii) 类别提示驱动的半监督脊柱分割阶段:使用预训练的视觉-文本编码器,我们为未标记的脊柱图像推导每个类别特定的二值分割图,并将它们整合为统一的多类别分割图,提高半监督脊柱分割网络生成的脊柱伪标签的质量。实验结果表明,我们的CPS4在公共脊柱分割数据集上仅使用5%的标注数据即实现了80.44%的Dice,超越了流行的半监督学习和VLM方法。我们的代码将公开。

英文摘要

Vision Language Model (VLM) has great potential to enhance the quality of pseudo labels in semi-supervised spine segmentation by leveraging textual class prompts to generate segmentation map, but no one has studied it yet. Although promising, it lacks explicit constraints to ensure consistency between spine class prompts and spine unit region, resulting in unsatisfactory performance in multi-class segmentation map generation. In this paper, we propose CPS4, the first text-guided semi-supervised spine segmentation network using class prompts to enhance the quality of spine pseudo labels. Specifically, CPS4 is implemented through two training stages. (i) Class-specific consistency constrained VLM pretraining stage: we propose token- and pixel-level attention loss to optimize the consistency between class prompts and spine units, forcing the textual class prompt to be closely coupled with the target spine unit in the semantic space. (ii) Class Prompt driven semi-supervised spine segmentation stage: using the pretrained vision-text encoder, we derive each class-specific binary segmentation map for the unlabeled spine image and integrate them into an unified multi-class segmentation map, improving the quality of the spine pseudo label generated by the semi-supervised spine segmentation network. Experimental results show that our CPS4 achieves superior spine segmentation performance with Dice of 80.44%, only using 5% labeled data on the public spine segmentation dataset, surpassing popular semi-supervised learning and VLM methods. Our code will be available.

2606.15797 2026-06-16 cs.AI 新提交

Unassigned Agents in Compilation-based Multi-agent Path Finding

基于编译的多智能体路径规划中的未分配智能体

Pavel Surynek

发表机构 * Faculty of Information Technology, Czech Technical University in Prague(布拉格捷克理工大学信息技术学院)

AI总结 针对未分配智能体的多智能体路径规划问题,提出基于SAT的编译方法,通过SMT-CBS和NRF-SAT求解器实现。

详情
AI中文摘要

基于编译的技术代表了多智能体路径规划(MAPF)求解器的一个重要流派,因其模块化和对问题非标准变体的适应性。在标准MAPF中,任务是引导所有智能体从初始位置无碰撞地到达给定的个体目标位置,而使用不同智能体要求的变体也具有相关性。这种变体是带有未分配智能体的MAPF(UA-MAPF),其中一些智能体与标准MAPF具有相同的设置(有初始位置和目标),而其余智能体只有初始位置但没有目标——未分配智能体。尽管未分配智能体不需要到达任何目标位置,但如有必要,它们必须为标准智能体让路,这构成了一个特定的挑战。我们在本文中表明,UA-MAPF可以表达为基于编译的MAPF技术,这些技术基于将问题表述为布尔可满足性,具体地,我们改编了SMT-CBS和NRF-SAT,这两种基于反例引导抽象精化和非精化抽象的最新求解器。

英文摘要

Compilation-based techniques represent an important stream of solvers for multi-agent path finding (MAPF) due to their modularity and adaptability for non-standard variants of the problem. While in the standard MAPF the task is to navigate all agents from their initial positions to given individual goal positions without any collision, variants where a different requirement for agents is used are also relevant. Such a variant is MAPF with unassigned agents (UA-MAPF) where some agents have the same setting as in the standard MAPF with initial positions and goals while the remaining agents have the initial position but have no goal - unassigned agents. Despite unassigned agent do not need to reach any goal position they have to be moved out of the way of the standard agents if needed which represent a specific challenge. We show in this paper that UA-MAPF can be expressed in recent compilation-based techniques for MAPF based on formulating the problem as Boolean satisfiability, namely we adapt SMT-CBS and NRF-SAT, the recent solvers based on counterexample guided abstraction refinement and non-refined abstractions.

2606.15796 2026-06-16 cs.CV cs.AI 新提交

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

DifFRACT:用于电路追踪的扩散特征重构与归因

Artyom Mazur, Nina Konovalova, Aibek Alanov

发表机构 * HSE University(高等经济学院) FusionBrain Lab(FusionBrain实验室)

AI总结 本文扩展了基于转码器的电路追踪方法到多模态扩散Transformer,通过训练时间步条件转码器近似MLP子层,实现精确的特征级归因并恢复可解释电路,揭示了属性绑定和跨流语义传播机制。

详情
AI中文摘要

机械可解释性旨在通过将模型计算分解为可解释特征和电路来解释神经网络行为。虽然基于转码器的电路追踪最近已实现对大型语言模型的详细因果分析,但用于图像生成的多模态扩散Transformer仍然相对不透明。我们仍然缺乏理解语义信息如何在去噪步骤间传播以及文本和图像表示如何在双流MM-DiT架构中交互的工具。现有方法仅提供部分洞察:注意力图揭示了token交互的有限视图,而稀疏自编码器可以发现可解释特征,但并未直接揭示这些特征如何通过非线性MLP层进行变换和组合。在这项工作中,我们将基于转码器的电路追踪扩展到多模态扩散Transformer。我们训练了时间步条件转码器,它们忠实地近似FLUX.1[schnell]中MLP子层的输入输出行为。通过用转码器替换MLP并线性化剩余计算,我们获得了精确的特征到特征归因,并恢复了紧凑、可解释的电路。实验上,我们的转码器在稀疏性-忠实度权衡上与稀疏自编码器相当或略优。得到的电路揭示了属性绑定和跨流语义传播背后的机制,并为系统性生成错误提供了因果解释。此外,基于电路的干预比标准的基于SAE的引导更加精确和有效。我们的结果表明,基于转码器的电路分析对于最先进的扩散Transformer是可行的,并为理解和控制多模态生成模型提供了强大的框架。代码可在https://github.com/Artalmaz31/DifFRACT获取。

英文摘要

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

2606.15793 2026-06-16 cs.LG cs.AI stat.ML 新提交

Proximal Policy Optimization for Amortized Discrete Sampling

用于摊销离散采样的近端策略优化

Anna Zykova-Myzina, Timofei Gritsaev, Daniil Tiapkin, Nikita Morozov

发表机构 * HSE University(高等经济学院) Constructor University(康斯特大学) CMAP, CNRS, École polytechnique, IPP(CMAP,CNRS,巴黎综合理工学院,IPP)

AI总结 本文在生成流网络框架下,推导了策略梯度算法并首次应用近端策略优化,提升了离散概率分布采样的收敛速度和数据效率。

详情
AI中文摘要

本文探讨了在生成流网络(GFlowNet)框架下,使用策略梯度算法训练随机策略以从结构化离散概率分布中采样。基于GFlowNet与熵正则化强化学习之间的广泛理论联系,我们推导了用于训练GFlowNet的标准策略梯度算法的等价形式,并实验性地探索了其各种方法论方面,包括基线训练和优势估计。最重要的是,我们的工作是首次推导并成功将近端策略优化应用于GFlowNet,在从合成能量到分子图生成的基准测试中,与标准GFlowNet训练目标相比,显示出更快的收敛速度和更高的数据效率。

英文摘要

This paper explores policy gradient algorithms for training stochastic policies to sample from structured discrete probability distributions under the Generative Flow Network (GFlowNet) framework. Building on extensive theoretical connections between GFlowNets and entropy-regularized reinforcement learning, we derive equivalents of standard policy gradient algorithms for training GFlowNets, as well as experimentally explore their various methodological aspects, including baseline training and advantage estimation. Most importantly, our work is the first to derive and successfully apply proximal policy optimization to GFlowNets, showing its improved convergence speed and data efficiency compared to standard GFlowNet training objectives on benchmarks ranging from synthetic energies to molecular graph generation.

2606.15786 2026-06-16 cs.CV cs.AI physics.geo-ph 新提交

Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes, Visualization, and Hybrid Prompts

领域引导的Segment Anything模型提示用于地震解释:属性、可视化和混合提示的作用

Aniq Ahmad, Heather Bedle, Ahmad Mustafa

发表机构 * School of Geosciences, University of Oklahoma(俄克拉荷马大学地球科学学院) King Fahd University of Petroleum and Minerals(法赫德国王石油矿产大学)

AI总结 提出零样本适应框架,通过地质目标感知的地震属性与颜色映射选择,结合混合提示策略,提升SAM在地震解释中的分割精度,避免微调。

详情
AI中文摘要

计算机视觉大型预训练基础模型的出现显著提高了视觉数据解释的效率。特别是Segment Anything Model (SAM)通过基于提示的交互提供了强大的零样本分割能力,因此成为地震解释的有前景工具。然而,大多数现有的SAM应用依赖于针对特定地质目标的微调,这需要大量标注数据、计算成本高,且常常损害模型的泛化能力。在本研究中,我们引入了一个原则性框架,用于将基础模型零样本适应到地震数据。该框架基于两个关键组件:(1) 将地震属性和可视化选择(如颜色映射)与感兴趣的地质目标对齐;(2) 采用混合提示策略,结合稀疏的用户定义点提示和从SAM内部特征激活中导出的密集掩码提示。我们系统地在多个地质目标、数据集、提示配置和地震属性表示上评估了该框架。我们的结果表明,地质目标感知的地震属性和颜色映射选择,结合混合提示,相对于仅基于点提示,增强了地质特征的可分离性,并改善了边界描绘和分割精度。我们的发现表明,当这些组件联合应用时,SAM可以在完全零样本设置下实现有竞争力的分割性能,从而消除了为每个地质特征重新训练SAM的需要。这项工作建立了一条实用且可扩展的途径,以在地震解释中利用基础模型,减少对标注数据的依赖,同时保持模型的通用性。

英文摘要

The advent of large pretrained foundation models for computer vision has significantly improved the efficiency of visual data interpretation. The Segment Anything Model (SAM), in particular, offers powerful zero shot segmentation capabilities through prompt based interaction, thus making it a promising tool for seismic interpretation. However, most existing applications of SAM rely on fine tuning for specific geological targets, which requires extensive labeled data, incurs high computational cost, and often compromises the model's generalization capability. In this study, we introduce a principled framework for zero shot adaptation of foundation models to seismic data. The framework is built on two key components: (1) aligning seismic attributes and visualization choices (e.g., colormaps) with the geological target of interest, and (2) employing a hybrid prompting strategy that combines sparse user defined point prompts with dense mask prompts derived from SAM's internal feature activations. We systematically evaluate this framework across multiple geological targets, datasets, prompt configurations, and seismic attribute representations. Our results demonstrate that geologic target aware selection of seismic attributes and colormaps, combined with hybrid prompting, enhances the separability of geological features and improves boundary delineation and segmentation accuracy relative to point based prompting alone. Our findings show that, when these components are jointly applied, SAM can achieve competitive segmentation performance in a fully zero shot setting, thereby eliminating the need to retrain SAM for each geologic feature. This work establishes a practical and scalable pathway to leverage foundation models in seismic interpretation, reducing reliance on labeled data while preserving model generality.

2606.15784 2026-06-16 cs.LG cs.CE 新提交

Bayesian Networks with Latent Time Embedding for Stage-Aware Causal Modeling of Alzheimer's Disease Progression

具有潜在时间嵌入的贝叶斯网络用于阿尔茨海默病进展的阶段感知因果建模

Nguyen Linh Dan Le

发表机构 * Alzheimer's Disease Neuroimaging Initiative(阿尔茨海默病神经影像学倡议) Open Access Series of Imaging Studies(开放获取影像学研究系列)

AI总结 提出BN-LTE框架,结合贝叶斯网络与潜在时间嵌入,利用AT(N)级联约束建模AD进展,在ADNI数据上优于基线,并识别出淀粉样蛋白敏感性的中期伪时间窗口。

Comments 7 pages, 5 figures

详情
AI中文摘要

阿尔茨海默病(AD)的进展通常通过淀粉样蛋白-tau-神经退行性变(AT(N))级联来描述。然而,大多数纵向模型要么将这种级联表示为固定的生物标志物序列,要么表示为黑箱预测任务。这使得难以确定生物学引导的生物标志物关系何时影响未来的区域病理。在本研究中,我们引入了具有潜在时间嵌入的贝叶斯网络(BN-LTE),这是一个用于AD进展阶段感知建模的贝叶斯结构框架。BN-LTE从基线生物标志物谱估计疾病伪时间,并根据生物学上合理的AT(N)排序约束有向依赖关系。然后使用后验样条变结构方程将初始多模态测量与未来的年度区域tau-PET变化联系起来。在使用ADNI数据的重复受试者分离评估中,与包含的预测基线相比,BN-LTE显示出tau进展的强空间重建。除了空间重建,BN-LTE恢复了后验阶段变化的AT(N)约束效应,并识别出淀粉样蛋白敏感性的中期伪时间窗口。该窗口得到模型隐含的g公式对比、根调整AIPW、机制敏感消融以及跨样条和先验规范的鲁棒性分析的支持。总体而言,这些发现将BN-LTE定位为一种贝叶斯结构框架,用于预测tau进展,同时检查观察性纵向神经影像数据中阶段依赖的AT(N)级联机制。我们的代码可在https://github.com/danleneurocom/BN-LTE获取。

英文摘要

Alzheimer's disease (AD) progression is often described through the amyloid-tau-neurodegeneration, or AT(N), cascade. However, most longitudinal models represent this cascade either as a fixed sequence of biomarkers or as a black-box forecasting task. This makes it difficult to determine when biologically guided biomarker relationships influence future regional pathology. In this study, we introduce Bayesian Networks with Latent Time Embedding (BN-LTE), a Bayesian structural framework for stage-aware modeling of AD progression. BN-LTE estimates disease pseudotime from baseline biomarker profiles and constrains directed dependencies according to biologically plausible AT(N) ordering. Posterior spline-varying structural equations are then used to link initial multimodal measurements with future annualized regional tau-PET change. Across repeated subject-disjoint evaluations using ADNI data, BN-LTE shows strong spatial reconstruction of tau progression compared with the included forecasting baselines. Beyond spatial reconstruction, BN-LTE recovers posterior stage-varying AT(N)-constrained effects and identifies a mid-pseudotime window of amyloid sensitivity. This window is supported by model-implied g-formula contrasts, root-adjusted AIPW, mechanism-sensitive ablations, and robustness analyses across spline and prior specifications. Overall, these findings position BN-LTE as a Bayesian structural framework for forecasting tau progression while examining stage-dependent AT(N)-cascade mechanisms in observational longitudinal neuroimaging data. Our code is available at https://github.com/danleneurocom/BN-LTE.

2606.15783 2026-06-16 cs.CL 新提交

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

ttda704 在 SemEval-2026 任务 4:通过假名化和多视角句子对齐建模叙事结构

Tai Tran Tan, An Dinh Thien

发表机构 * University of Information Technology, Ho Chi Minh City, Vietnam(胡志明市信息技术大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国立大学胡志明市分校)

AI总结 提出基于对比学习和微调句子变换器的叙事相似度方法,包括单视角(智能层冻结)和多视角(主题/情节/结局投影头+自监督对齐)两条流水线,在合成数据上训练。

详情
AI中文摘要

我们介绍了对 SemEval 2026 任务 4:叙事故事相似性与叙事表示学习的方法。我们的解决方案使用对比学习与微调句子变换器来捕捉跨抽象主题、行动过程和结果的叙事相似性。我们开发了两条流水线:(Track A)单视角方法,通过智能层冻结编码完整叙事以减少过拟合;(Track B)多视角方法,使用视角特定的投影头和自监督对齐对主题、情节和结局进行建模。两条流水线均基于句子变换器模型,并在合成数据上使用对比损失进行训练。代码可在以下 GitHub 仓库获取:https://github.com/dinhthienan33/SemEval2026-Task4-ttda704。

英文摘要

We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across abstract themes, course of action, and outcomes. We develop two pipelines: (Track A) a single-view method that encodes full narratives with smart layer freezing to reduce overfitting, and (Track B) a multi-view method that models theme, plot, and outcome with view-specific projection heads and self-supervised alignment. Both pipelines build on sentence-transformers models and are trained with contrastive loss on synthetic data. The code is available at the following GitHub repository: https://github.com/dinhthienan33/SemEval2026-Task4-ttda704.

2606.15782 2026-06-16 cs.AI cs.CV 新提交

Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference

通过检索增强的可靠性感知推理缓解多模态系统中的视觉幻觉

Pratheswaran Hariharan, Haiping Xu, Donghui Yan

发表机构 * University of Massachusetts, Dartmouth(马萨诸塞大学达特茅斯分校)

AI总结 提出一种检索增强的可靠性感知推理框架,利用外部视觉证据库和多个可靠性指标进行决策门控,在不重训练模型的情况下减少视觉幻觉,将接受预测准确率从85.84%提升至88.88%。

Comments 28 pages, 9 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉语言理解和自然语言响应生成方面展现了强大的能力。然而,当视觉证据较弱、模糊或语义不一致时,这些系统仍可能产生过度自信的预测和类似幻觉的输出。现有方法大多侧重于改进多模态表示对齐或检索增强生成,而缺乏量化实例级预测可靠性或识别错误视觉输出的机制。本文提出了一种检索增强的可靠性感知推理框架,用于可信的多模态视觉理解。该框架利用预训练的视觉嵌入和基于归一化特征表示的最近邻检索构建外部视觉证据数据库。检索到的证据用于通过多个可靠性指标估计预测的可信度,包括相似性强度、类别支持一致性、证据边际、基于熵的不确定性以及聚合可靠性分数。基于这些信号,决策门控决定系统是否应接受预测、谨慎回答或在证据不足时放弃/回退。然后,多模态响应生成层根据可靠性决策生成最终面向用户的响应。在ImageNet-100上的实验表明,所提出的可靠性感知框架在89.04%的覆盖率下将接受预测准确率从85.84%提升至88.88%。类似幻觉的接受错误答案率从14.16%降至11.12%。这些结果表明,整合检索证据、可靠性估计和选择性决策门控可以在不重新训练大型多模态模型的情况下改善校准并减少过度自信的视觉错误。

英文摘要

Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and hallucination-like outputs, particularly when the visual evidence is weak, ambiguous, or semantically inconsistent. Most existing approaches focus on improving multimodal representation alignment or retrieval-augmented generation, while providing limited mechanisms to quantify instance-level prediction reliability or identify incorrect visual outputs. This work proposes a retrieval-augmented reliability-aware inference framework for trustworthy multimodal visual understanding. The proposed framework constructs an external visual evidence database using pretrained visual embeddings and nearest-neighbor retrieval over normalized feature representations. Retrieved evidence is used to estimate prediction trustworthiness through multiple reliability indicators, including similarity strength, class-support agreement, evidence margin, entropy-based uncertainty, and an aggregate reliability score. Based on these signals, a decision gate determines whether the system should accept the prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then produces a final user-facing response conditioned on the reliability decision. Experiments on ImageNet-100 demonstrate that the proposed reliability-aware framework improves accepted prediction accuracy from 85.84\% to 88.88\% at 89.04\% coverage. The hallucination-like accepted wrong-answer rate is reduced from 14.16\% to 11.12\%. These results show that integrating retrieval evidence, reliability estimation, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models.

2606.15779 2026-06-16 cs.CV cs.LG 新提交

Faithful Action-unit Causal Reasoning for Counterfactually Faithful Emotion Explanations

面向反事实忠实情感解释的忠实动作单元因果推理

Van Thong Huynh, Hong Hai Nguyen, Thuy Pham, Trong Nghia Nguyen, Soo-Hyung Kim

发表机构 * Faculty of CSE, Ho Chi Minh City University of Technology (HCMUT), VNUHCM(胡志明市理工大学计算机科学与工程学院,越南国家大学胡志明市分校) Dept. of AI, FPT University(FPT大学人工智能系) Faculty of DSAI, College of Technology, National Economic University(国民经济大学技术学院数据科学与人工智能系) Dept. of AI Convergence, Chonnam National University(全南大学人工智能融合系)

AI总结 提出FACR方法,通过反事实一致性目标和极性感知因果图,训练模型在动作单元与情感之间实现可测量的因果忠实性,在UNBC-PAIN数据集上将忠实度从0.08提升至0.57。

详情
AI中文摘要

多模态模型可以命名面部情感背后的动作单元(AU),但其AU->情感的解释通常是合理的而非忠实的:没有任何机制强制模型调用的AU是实际驱动其预测的AU。我们将AU->情感推理视为解释、标签和结构化AU->情感因果图G之间的反事实一致性问题,并提出FACR,该方法将推理器建立在独立诱导的、极性感知的G上,并训练一个反事实忠实性目标:对G标记为某类因果的AU进行do干预必须改变预测,而对标记为无关的AU进行do干预必须保持预测不变。因此,忠实性既可通过匹配的干预指标进行训练和测量,我们针对已知因果结构PSPI疼痛-AU组成评估该指标,因为现有情感推理基准不支持。我们明确指出,该指标测试的是对给定结构的忠实性而非重新发现:它询问训练后的推理器是否调用结构标记为因果的AU,在留出受试者和第二个数据集上进行评估。在UNBC-PAIN上的受试者独立评估中,该目标将调用AU与PSPI组成的一致性从无目标的基线0.08提高到0.57,检测成本略有增加;一个不忠实控制实验将增益归因于该目标。在跨数据集情感迁移中,该目标同样提高了七类任务上对G的忠实性(0.50到0.84)。最后,我们附加语言verbalizer并将审计扩展到生成的文本:通过潜在激活偏置每个动作单元的发射,使解释在结构上忠实,因此消融一个AU会将其从解释中移除,该属性可迁移到第二个语言模型骨干,而自由生成的解释则不忠实。

英文摘要

Multimodal models can name the action units (AUs) behind a facial emotion, but their AU->emotion rationales are typically plausible rather than faithful: nothing forces the AUs a model invokes to be the AUs that actually drive its prediction. We cast AU->emotion reasoning as a counterfactual-consistency problem between the rationale, the label, and a structural AU->emotion causal graph G, and propose FACR, which grounds the reasoner in an independently induced, polarity-aware G and trains a counterfactual-faithfulness objective: a do-intervention on an AU that G marks causal for a class must move the prediction, while one it marks irrelevant must leave it unchanged. Faithfulness is thereby both trainable and measurable through a matching interventional metric, which we evaluate against a known causal structure, the PSPI pain-AU composition, as no existing affective-reasoning benchmark allows. We are explicit that this metric tests fidelity to the supplied structure rather than its rediscovery: it asks whether the trained reasoner invokes the AUs the structure marks causal, on held-out subjects and a second dataset. Under subject-independent evaluation on UNBC-PAIN, the objective raises the agreement between the invoked AUs and the PSPI composition from a no-objective baseline of 0.08 to 0.57, at a small detection cost; an unfaithfulness control attributes the gain to the objective. On a cross-dataset emotion transfer, the objective likewise raises fidelity to G on a seven-class task (0.50 to 0.84). Finally, we attach a language verbalizer and extend the audit to the generated text: biasing each action unit's emission by its latent activation makes the rationale faithful by construction, so that ablating an AU removes it from the explanation, a property that transfers to a second language-model backbone, whereas a freely generated rationale is unfaithful.

2606.15778 2026-06-16 cs.CL cs.AI cs.LG cs.SI 新提交

DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

DYNA:用于在持续学习中通过时间知识图谱增强大语言模型的动态情景记忆网络

Ali Sarabadani, Mahtab Tajvidiyan

发表机构 * Department of Computer Engineering and Information Technology, University of Qom(卡姆大学计算机工程与信息科技系)

AI总结 提出DYNA框架,通过时间知识图谱作为外部可更新记忆,增强冻结的大语言模型,在三个时间召回任务上减少约7%的灾难性遗忘并提升约5%的时间排序能力。

详情
AI中文摘要

大语言模型(LLMs)难以在不遗忘或昂贵重训练的情况下融入新知识。我们提出DYNA,一个轻量级框架,通过时间知识图谱增强冻结的LLM,其中事件作为节点,时间关系作为有向、带时间戳的边。该图谱作为外部可更新记忆。在查询时,DYNA通过随机游走和中心性度量检索相关节点,然后增强LLM的响应。在三个时间召回任务上评估,DYNA相比微调减少了约7%的灾难性遗忘,相比标准RAG提升了约5%的时间排序能力。更高的图谱聚类系数与更好的检索相关,表明图谱结构的重要性。贡献:(1)将情景记忆作为时间知识图谱,(2)无需重训练的LLM增强,(3)图谱属性作为检索性能的预测因子。

英文摘要

Large Language Models (LLMs) struggle to incorporate new knowledge without forgetting or costly retraining. We propose DYNA, a lightweight framework that augments a frozen LLM with a temporal knowledge graph where events are nodes and temporal relations are directed, timestamped edges. The graph serves as an external, updatable memory. At query time, DYNA retrieves relevant nodes via random walks and centrality measures, then augments the LLM's response. Evaluated on three temporal recall tasks, DYNA reduces catastrophic forgetting by ~7% compared to fine-tuning and improves temporal ordering by ~5% over standard RAG. Higher graph clustering coefficients correlate with better retrieval, showing that graph structure matters. Contributions: (1) episodic memory as temporal KG, (2) retraining-free LLM augmentation, (3) graph properties as predictors of retrieval performance.

2606.15772 2026-06-16 cs.CV 新提交

Ellipse Meets Bit-Planes: A Novel Approach to RNFL based Glaucoma Detection Using Advanced Image Processing and Deep Learning

椭圆遇上位平面:基于先进图像处理和深度学习的RNFL青光眼检测新方法

Snigdha Paul, Sambit Mallick, Anindya Sen

发表机构 * Heritage Institute of Technology(传统理工学院)

AI总结 提出自适应椭圆极坐标变换增强RNFL分析,分别用深度学习特征融合(99.3%检测率)和位平面切片图像处理(92.31%准确率)实现高效青光眼检测。

详情
AI中文摘要

本工作提出了一种从易获取的彩色眼底图像中自动检测青光眼的集成流程,基于自适应椭圆极坐标变换算法,增强对视网膜神经纤维层(RNFL)作为观察青光眼变化的主要生物标志物的分析,不受视盘和黄斑位置影响。利用该变换,我们引入了两种针对不同操作需求定制的框架。第一种框架采用深度学习启发的特征融合方法,检测率达99.3%,适用于需要高精度的场景,尽管计算需求较高。第二种框架采用基于位平面切片的新型图像处理算法,准确率为92.31%,针对需要快速推理且资源消耗最小的环境进行了优化。两种框架都为青光眼早期检测提供了可扩展且经济高效的解决方案。本研究强调了基于RNFL的诊断工具在应对全球青光眼挑战中的潜力,特别是在医疗资源匮乏的地区。

英文摘要

This work proposes an integrated pipeline for automatic glaucoma detection method from easily available colour fundas images based on an adaptive algorithm for ellipse-based polar transformation, to enhance the analysis of the Retinal Nerve Fiber Layer (RNFL) as the primary biomarker for observing glaucomatous changes, regardless of optic disc and macula position. Utilizing this transformation, we introduce two distinct frameworks tailored to different operational needs. The first framework, a deep learning-inspired feature fusion approach, achieves a 99.3% detection rate, ideal for settings where high precision is essential, despite higher computational demands. The second framework employs a novel image-processing algorithm based on bit-plane slicing, offering 92.31% accuracy and optimized for environments requiring rapid inference with minimal resource consumption. Both frameworks provide scalable and cost-effective solutions for early glaucoma detection. This study highlights the potential of RNFL-based diagnostic tools in addressing the global challenge of glaucoma, particularly in underserved regions.

2606.15770 2026-06-16 cs.CL 新提交

ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

ttda704 at SemEval-2026 Task 6: 用于政治回避检测的结构化思维链提示

Tai Tran Tan, An Dinh Thien

发表机构 * University of Information Technology, Ho Chi Minh City, Vietnam(胡志明市信息技术大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 针对总统访谈中政治回避策略分类任务,比较QLoRA微调Qwen3与结构化思维链提示DeepSeek-V3.2/Grok-4-Fast,发现后者在Macro F1上显著更优,最佳系统在9类回避任务上Macro F1达0.5147。

详情
AI中文摘要

本文描述了我们在SemEval-2026任务6中的系统,该任务涉及对从美国总统访谈中提取的英文问答对进行政治回避策略分类。我们系统比较了两种不同的范式:(1) 使用QLoRA对Qwen3模型(4B-32B)进行参数高效微调,通过分层上采样和加权交叉熵损失来应对严重的类别不平衡;(2) 对具备推理能力的API模型(即DeepSeek-V3.2和Grok-4-Fast)使用结构化思维链(CoT)提示。我们的评估表明,启用推理能力的模型的结构化CoT提示在绝对Macro F1上显著优于我们的基线参数高效微调实现。我们最好的系统,即具有扩展推理和少样本分层CoT提示的Grok-4-Fast,在子任务2(9类回避)上达到0.5147的Macro F1,在子任务1(3类清晰度)上达到0.7979的Macro F1,在官方排行榜上分别位列子任务2的第8名(共33支队伍)和子任务1的第13名(共41支队伍)。此外,我们的消融研究揭示了回避检测中有效提示设计的关键见解:在分层分类法中呈现标签有助于结构化模型推理,而少样本示例提供了任务校准。然而,最强的提示变体在Macro F1上并无统计显著差异,而显式启用扩展推理模式通过促进检测回避意图所需的多步语用分析,带来了显著的性能提升。

英文摘要

This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two distinct paradigms: (1) Parameter-Efficient Fine-Tuning of Qwen3 models (4B-32B) using QLoRA, enhanced with tiered upsampling and weighted cross-entropy loss to address severe class imbalance, and (2) structured Chain-of-Thought (CoT) prompting of reasoning-capable API models, namely DeepSeek-V3.2 and Grok-4-Fast. Our evaluation demonstrates that structured CoT prompting of reasoning-enabled models substantially outperforms our baseline parameter-efficient fine-tuning implementation in absolute Macro F1. Our best system, Grok-4-Fast with extended reasoning and few-shot hierarchical CoT prompting, achieves a Macro F1 of 0.5147 on Subtask 2 (9-class evasion) and 0.7979 on Subtask 1 (3-class clarity), ranking 8th out of 33 teams on Subtask 2 and 13th out of 41 teams on Subtask 1 on the official leaderboard. Furthermore, our ablation studies reveal key insights into effective prompt design for evasion detection: presenting labels within a hierarchical taxonomy helps structure model reasoning, while few-shot exemplars provide task calibration. However, the strongest prompt variants are not statistically distinguishable in Macro F1, and explicitly enabling extended reasoning modes yields substantial performance gains by facilitating the multi-step pragmatic analysis required to detect evasive intent.

2606.15768 2026-06-16 cs.RO cs.AI 新提交

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

LaWAM: 用于高效动力学感知机器人策略的潜在世界行动模型

Jialei Chen, Kai Wang, Kang Chen, Shuaihang Chen, Feng Gao, Wenhao Tang, Zhiyuan Li, Weilin Liu, Zhuyu Yao, Boxun Li, Yuanbo Xu, Chao Yu

发表机构 * Tsinghua University(清华大学) Jilin University(吉林大学) Nankai University(南开大学) Peking University(北京大学) Harbin Institute of Technology(哈尔滨工业大学) Zhongguancun Academy(中关村学院) Striding.AI Infinigence AI

AI总结 提出LaWAM模型,通过潜在视觉子目标预测场景变化,实现动力学感知的机器人控制,在多个基准上达到最优或竞争性成功率,且推理延迟低。

详情
AI中文摘要

视觉-语言-行动模型(VLA)利用大规模视觉-语言预训练进行语义机器人控制,但通常缺乏对机器人行动如何改变场景的明确预见。世界行动模型(WAM)通过基于预测的未来条件化策略来解决这一限制,但现有方法通常依赖计算昂贵的视频生成,且存在大量像素级冗余。我们提出LaWAM,一种潜在世界行动模型,通过紧凑的潜在视觉子目标(而非重建的未来视频)向机器人策略暴露预测动力学。LaWAM的核心是一个潜在行动条件化的潜在世界模型(LaWM)。我们通过在预训练视觉基础模型的潜在空间中训练潜在行动模型,并重新利用其前向解码器来预测未来观察特征以描述场景演变,从而获得LaWM。然后,LaWAM基于这些预测的潜在视觉子目标条件化行动生成,以实现动力学感知的机器人控制。LaWAM在LIBERO(98.6%成功率)、RoboTwin(91.22%成功率)和真实世界操作任务中取得了最优或具有竞争力的成功率,同时保持低延迟推理。LaWAM每次行动块预测运行时间为187毫秒,相比像素空间WAM,实现了高达24倍的墙钟延迟降低。

英文摘要

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.

2606.15767 2026-06-16 cs.LG cs.AI 新提交

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

可视化不确定性:深度学习中缺失与冲突证据的空间图

Dong Hyun Jeong, Feng Chen, Jin-Hee Cho, Lance M. Kaplan, Audun Jøsang, Soo-Yeon Ji

发表机构 * University of the District of Columbia(哥伦比亚特区大学) University of Texas at Dallas(德克萨斯大学达拉斯分校) Virginia Tech(弗吉尼亚理工大学) U.S. Army DEVCOM Army Research Laboratory(美国陆军DEVCOM陆军研究实验室) University of Oslo(奥斯陆大学) Bowie State University(鲍伊州立大学)

AI总结 提出不确定性激活图(UAM)框架,结合证据深度学习与全梯度类激活映射,生成空间不确定性激活图,区分缺乏证据的空虚和假设冲突的不和谐,填补不确定性量化与可解释性之间的空白。

详情
AI中文摘要

理解深度神经网络何时以及为何不确定对于在安全关键领域部署可靠的机器学习系统至关重要。虽然现有的不确定性量化方法提供了模型置信度的标量度量,但它们对输入的哪些空间区域导致不同类型的不确定性提供的洞察有限。我们提出了一种新颖的可视化框架——不确定性激活图(UAM),它将证据深度学习(EDL)与全梯度类激活映射(FullGrad)相结合,生成可解释的空间不确定性激活图。我们的方法区分了两种基本的不确定性类型:空虚(代表缺乏证据)和不和谐(捕捉竞争假设之间的冲突证据)。通过利用FullGrad的完整梯度分解特性和主观逻辑的原则性不确定性量化,我们的方法产生了理论上合理的可视化,突出显示了导致模型不确定性的特定图像区域。利用该框架,通过计算信念加权属性生成空虚和不和谐激活图,从而能够识别模型缺乏知识的区域与遇到模糊证据的区域。在多个基准数据集上的广泛评估表明,所提出的框架有效地解决了不确定性量化与可解释性之间的关键差距,为评估复杂视觉识别任务中的模型可靠性提供了直观的视觉反馈。

英文摘要

Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model confidence, they offer limited insight into which spatial regions of an input contribute to different types of uncertainty. We propose a novel visualization framework, Uncertainty Activation Map (UAM), that combines Evidential Deep Learning (EDL) with Full-Gradient Class Activation Mapping (FullGrad) to generate interpretable spatial uncertainty activation maps. Our approach distinguishes between two fundamental types of uncertainty: vacuity, representing lack of evidence, and dissonance, capturing conflicting evidence between competing hypotheses. By leveraging the complete gradient decomposition property of FullGrad and the principled uncertainty quantification of Subjective Logic, our method produces theoretically grounded visualizations that highlight specific image regions responsible for model uncertainty. With this framework, vacuity and dissonance activation maps are generated by computing belief-weighted attributions, enabling identification of where models lack knowledge versus where they encounter ambiguous evidence. Extensive evaluations across multiple benchmark datasets demonstrate that the proposed framework effectively addresses the critical gap between uncertainty quantification and explainability, providing intuitive visual feedback to assess model reliability in complex visual recognition tasks.

2606.15766 2026-06-16 cs.AI cs.HC 新提交

Rethinking Scaffolding in LLM Tutors: The Interactional Mismatch Between Benchmarks and Real-World Deployments

重新思考LLM导师中的脚手架:基准测试与真实部署之间的交互不匹配

Alexandra Neagu, Jeffrey T. H. Wong, Marcus Messer, Rhodri Nelson, Peter B. Johnson

发表机构 * University of Cambridge(剑桥大学)

AI总结 通过分析9490个聊天记录,发现AI导师基准测试假设学生积极接受脚手架,但真实场景中学生常绕过脚手架,揭示基准测试与真实部署的交互不匹配。

Comments Pluralistic Alignment Workshop @ ICML 2026, Seoul, South Korea

详情
AI中文摘要

AI导师基准测试中评估的一个核心教学价值是脚手架:通过渐进步骤引导学生走向解决方案。然而,将脚手架行为嵌入聊天机器人的对齐和评估方法基于一个隐含假设:学生会接受脚手架并参与对话。为了检验这一假设是否成立,我们引入了一个围绕两个指标——聊天机器人脚手架和学生接受度——的评估流程,并将其应用于跨越AI导师基准测试和教育聊天机器人真实部署的九个数据集,共9490个聊天记录。我们的分析揭示,虽然基准测试假设一个高脚手架、高学生接受度的环境,但真实场景中的学生整体表现出较低水平的接受度——经常绕过聊天机器人的教学框架,以较低的人际成本将交互推向自己的学习目标。我们认为,绕过脚手架不一定是坏事;相反,它经常突显聊天机器人的教学框架与学生目标之间的不匹配。为了有意义地评估聊天机器人辅助的有效性,未来的基准测试必须超越学生简单接受脚手架的假设,而是评估这些聊天机器人如何应对多样化的学习环境和学生驱动的交互模式。

英文摘要

A central pedagogical value evaluated in AI tutor benchmarks is scaffolding: guiding students through graduated steps toward a solution. Alignment and evaluation methods for embedding scaffolding behaviour into chatbots, however, rest on an implicit assumption: that students will take up the scaffolding and engage in the conversation. To examine whether this assumption holds, we introduce an evaluation pipeline around two metrics - Chatbot Scaffolding and Student Uptake - and apply them across nine datasets of 9,490 chats, spanning AI tutor benchmarks and real-world deployments of educational chatbots. Our analysis reveals that while benchmarks assume a high-scaffolding, high-student-uptake environment, students in real-world settings exhibit lower levels of uptake overall - frequently bypassing the chatbot's pedagogical framing to drive the interaction toward their own learning goals at little interpersonal cost. We argue that bypassing scaffolding is not necessarily detrimental; rather, it frequently highlights a mismatch between a chatbot's pedagogical framing and the student's learning goals. To meaningfully evaluate the effectiveness of a chatbot's assistance, future benchmarks must move beyond the assumption that students will simply take up the scaffolding, and instead evaluate how these chatbots navigate diverse learning contexts and student-driven interaction patterns.

2606.15765 2026-06-16 cs.CV 新提交

Task-Instructed Causal Routing of Vision Foundation Models for Multi-Task Learning

任务指令引导的视觉基础模型因果路由用于多任务学习

Donghyun Han, Yuseok Bae, Jung Uk Kim, Hyung-Il Kim

发表机构 * Electronics and Telecommunications Research Institute (ETRI)(韩国电子通信研究院(ETRI)) Kyung Hee University(庆熙大学) Chonnam National University(全南大学)

AI总结 提出TIGER框架,通过自然语言任务指令引导路由网络,结合反事实因果对齐,协调多个异构视觉基础模型实现多任务密集预测,在NYUD-v2和Pascal Context上超越现有方法。

Comments 17 pages, 6 figures

详情
AI中文摘要

视觉基础模型(VFMs)在广泛的视觉任务中展现出强大的鲁棒性和迁移性。然而,每个模型通常编码了由其预训练目标和数据领域形成的强归纳偏置,导致视觉知识碎片化但互补。因此,单个模型往往难以捕捉多个密集预测任务所需的不同视觉表示。为解决这一限制,我们提出TIGER(任务指令引导的专家路由),一个协调多个异构VFMs进行多任务密集预测的框架。TIGER并非简单聚合专家特征,而是利用自然语言任务指令引导路由网络,根据任务语义分配令牌级专家权重,实现互补专家特征的自适应集成。TIGER进一步引入反事实损失,通过测量排除专家时的预测变化,将路由决策与每个专家的因果贡献对齐,鼓励更可靠和可解释的路由。我们在两个多任务密集预测基准NYUD-v2和Pascal Context上评估TIGER,在保持所有VFMs冻结的情况下,它持续优于最近的多任务学习基线。这些结果表明,将指令引导的专家路由与反事实因果对齐相结合,能够有效协调异构视觉基础模型。

英文摘要

Vision foundation models (VFMs) have demonstrated strong robustness and transferability across a wide range of visual tasks. However, each model typically encodes strong inductive biases shaped by its pre-training objective and data domain, resulting in fragmented yet complementary visual knowledge. As a result, a single model often struggles to capture the diverse visual representations required across multiple dense prediction tasks. To address this limitation, we propose TIGER (Task-Instruction-Guided Expert Routing), a framework that coordinates multiple heterogeneous VFMs for multi-task dense prediction. Instead of naively aggregating expert features, TIGER leverages natural-language task instructions to guide a routing network that assigns token-level expert weights conditioned on task semantics, enabling adaptive integration of complementary expert features. TIGER further introduces a counterfactual loss that aligns routing decisions with each expert's causal contribution by measuring prediction changes when experts are excluded, encouraging more reliable and interpretable routing. We evaluate TIGER on two multi-task dense prediction benchmarks, NYUD-v2 and Pascal Context, where it consistently outperforms recent multi-task learning baselines while keeping all VFMs frozen. These results demonstrate that combining instruction-guided expert routing with counterfactual causal alignment enables effective coordination of heterogeneous vision foundation models.

2606.15763 2026-06-16 cs.CV 新提交

The Circumplex Degeneracy Behind the Rare-Class Limit in Affect Recognition

情感识别中稀有类别极限背后的圆周退化

Van Thong Huynh, Hong Hai Nguyen, Soo-Hyung Kim

发表机构 * Faculty of CSE, Ho Chi Minh City University of Technology (HCMUT), VNUHCM(胡志明市理工大学计算机科学与工程学院, 越南国家大学胡志明市分校) Dept. of AI, FPT University(FPT大学人工智能系) Dept. of AI Convergence, Chonnam National University(全南大学人工智能融合系)

AI总结 通过多任务研究揭示稀有表情识别失败源于Russell圆周上的退化性,而非类别不平衡,并提出圆周代价最优传输项,但增益非几何性,稀有类别错误结构受视觉混淆影响。

详情
AI中文摘要

野外表情识别在少数稀有情感上持续失败,标准解释是类别不平衡。通过在两个基准上的受控多任务研究,我们表明失败反而是情感几何的一个属性:稀有类别在Russell圆周上是退化的,这种退化限制了任何损失或代价所能达到的效果。我们的工具是一个圆周代价最优传输项,通过效价-唤醒距离对表情混淆进行定价。该项提高了官方得分和表情宏F1,但大多数研究省略的对照显示,增益并非几何性的:一个均匀代价(相当于通用置信度惩罚)在Aff-Wild2上与它匹配(p=0.625),并在AffectNet上显著超过它(比基线高+0.057,大于圆周项)。几何重塑的是错误的结构,使它们在Aff-Wild2上情感上更接近真相(与均匀对照相比p=0.031),但这种效果在AffectNet上不成立,因为圆周远角的一个视觉混淆压倒了它。相比之下,稀有类别失败在我们检查的两个数据集上都是稳定的:退化对(Aff-Wild2上的愤怒-恐惧,AffectNet上的愤怒-蔑视)抵抗基于频率的干预、传输项以及专门为分离它们而构建的动作单元增强代价。我们得出结论,稀有表情的进展需要区分这些类别的表示,而不是重新定价其混淆的监督,我们提供了区分两者的对照和指标。

英文摘要

In-the-wild expression recognition persistently fails on a few rare emotions, and the standard explanation is class imbalance. Through a controlled multi-task study on two benchmarks, we show the failure is instead a property of affect geometry: the rare classes are degenerate on Russell's circumplex, and that degeneracy bounds what any loss or cost can achieve. Our instrument is a circumplex-cost optimal-transport term that prices expression confusions by their valence-arousal distance. The term improves the official score and expression macro-F1, but a control most studies omit shows the gain is not geometric: a uniform cost, equivalent to a generic confidence penalty, matches it on Aff-Wild2 (p=0.625) and significantly exceeds it on AffectNet (+0.057 over base, larger than the circumplex). What the geometry reshapes is the structure of the errors, making them affectively nearer the truth on Aff-Wild2 (p=0.031 against the uniform control), an effect that does not survive on AffectNet, where a visual confound at the far corner of the circumplex overwhelms it. The rare-class failure, by contrast, is stable across both datasets we examine: the degenerate pairs (anger-fear on Aff-Wild2, anger-contempt on AffectNet) resist frequency-based interventions, the transport term, and an action-unit-augmented cost built specifically to separate them. We conclude that progress on rare expressions requires representations that distinguish the classes, not supervision that reprices their confusions, and we provide the controls and metrics needed to tell the two apart.

2606.15760 2026-06-16 cs.LG stat.ML 新提交

The Data Manifold under the Microscope

显微镜下的数据流形

Marios Koulakis, Constantin Seibold

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对深度学习理论与实践的差距,提出一个基准框架,通过扩展dSprites和COIL-20数据集并配合有限差分估计器,实现曲率、可达性和体积的近真实值估计,用于校准几何估计器和验证理论假设。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

深度学习理论与实践之间存在显著差距。泛化和近似误差界通常针对简化模型推导,或者过于宽松而缺乏信息。许多工作依赖于流形假设以及内在维度、曲率和可达性等几何正则性。进展需要深入了解数据流形几何和合适的基准,但现有选项两极分化:具有已知几何但适用性有限的分析流形,或几何只能粗略估计的真实世界数据集。我们引入了一个用于研究数据几何的基准框架。我们重新利用并扩展了dSprites和COIL-20,增加了额外的变换维度和密集的轴对齐采样,并将它们与有限差分估计器配对,在通用估计器不可靠或难以部署的情况下,以接近真实值的精度恢复曲率、可达性和体积。该框架旨在作为一个受控测试平台,可用作几何估计器的校准环境和探索理论假设的沙盒。为了说明其用途,我们展示了两个应用研究,即评估Genovese等人和Fefferman等人的界的缩放行为,以及跟踪$β$-VAE的逐层几何,突出了当前界的行为以及受控基准对指导和验证未来理论的价值。参考实现可在https://github.com/koulakis/manifold-microscope获取。

英文摘要

A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry. We repurpose and extend dSprites and COIL-20 with additional transformation dimensions and dense, axis-aligned sampling, and pair them with finite-difference estimators that recover curvature, reach, and volume at near-ground-truth accuracy in a regime where general-purpose estimators are unreliable or difficult to deploy. The framework is intended as a controlled testbed, useful as a calibration environment for geometric estimators and a sandbox for probing theoretical assumptions. To illustrate its use, we present two application studies, namely assessing the scaling behavior of the bounds of Genovese et al. and Fefferman et al., and tracking the layer-wise geometry of a $β$-VAE, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. A reference implementation is available at https://github.com/koulakis/manifold-microscope.

2606.15756 2026-06-16 cs.LG cs.AI 新提交

From Correlation to Causation in Lane Change Prediction for Automated Driving: A Causal Explanation Framework

从相关性到因果性:自动驾驶换道预测的因果解释框架

Mohamed Manzour, Aditya Kumar, Augusto Luis Ballardini, Miguel Ángel Sotelo

发表机构 * University of Alcalá(阿尔卡拉大学)

AI总结 提出基于因果推断的换道预测框架,结合深度结构因果建模与干预效应分析,在预测准确率超过95%的同时,识别直接贡献变量及其因果链,实现可解释的因果推理。

详情
AI中文摘要

换道预测是智能车辆的核心任务,提前预测操作有助于更安全的决策。然而,现有方法主要学习观测驾驶变量与未来操作之间的统计关联,而忽略了输入变量之间的因果依赖关系。这限制了可解释性,尤其是当纵向间隙、相对纵向速度和碰撞时间(TTC)等物理相关变量被视为独立平坦输入时。本文提出一个基于因果推断的换道预测与解释框架。该方法结合语言特征构建、专家约束的因果发现、基于深度端到端因果推断(DECI)的深度结构因果建模、基于干预的效果分析、反驳测试和递归因果链解释。目标不仅是预测未来操作,还要识别直接贡献于预测的候选变量、影响这些变量的上游因素以及这些效应传播的因果链。该框架在车道标记交叉事件前的前三秒内平均F1分数超过95%。除了预测精度,该框架使用基于干预的效果分析,在学到的因果结构下区分有影响力的变量和弱影响力变量。它进一步区分候选直接贡献者和中介效应,并生成对比性因果链解释,阐明为什么预测的操作更受青睐,而替代操作支持较少。因此,主要贡献是一个机制感知的换道预测流程,从基于相关性的分类转向更可解释的因果推理用于操作预测。

英文摘要

Lane-change prediction is a central task in intelligent vehicles, where early maneuver anticipation can support safer decision-making. However, many existing approaches mainly learn statistical associations between observed driving variables and future maneuvers, while overlooking the causal dependencies among the input variables themselves. This limits interpretability, especially when physically related variables such as longitudinal gap, relative longitudinal velocity, and Time-To-Collision (TTC) are treated as independent flat inputs. This article presents a causal-inference-based framework for lane-change prediction and explanation. The proposed approach combines linguistic feature construction, expert-constrained causal discovery, deep structural causal modeling with Deep End-to-end Causal Inference (DECI), intervention-based effect analysis, refutation testing, and recursive causal-chain explanation. The objective is not only to predict the future maneuver, but also to identify candidate variables that directly contribute to the prediction, the upstream factors influencing them, and the causal chains through which these effects propagate. The framework achieves average F1-scores above 95% during the first three seconds before the lane-marking crossing event. Beyond prediction accuracy, the framework uses intervention-based effect analysis to distinguish influential from weakly influential variables under the learned causal structure. It further distinguishes candidate direct contributors from mediated effects and generates contrastive causal-chain explanations that clarify why the predicted maneuver is favored and why the alternative maneuvers are less supported. The main contribution is therefore a mechanism-aware lane-change prediction pipeline that moves beyond correlation-based classification toward more interpretable causal reasoning for maneuver prediction.

2606.15753 2026-06-16 cs.AI 新提交

RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

RoboPIN: 基于锚定思维链的具身推理

Yaoting Huang, Yifu Yuan, Linqi Han, Chengwen Li, Shuoheng Zhang, Xianze Yao, Hongyao Tang, Yan Zheng, Jianye Hao

发表机构 * Tianjin University(天津大学)

AI总结 提出Pinned Chain-of-Thought (PinCoT)推理范式,通过结构化视觉锚点绑定实体,解决多步推理中实体引用漂移和视觉解耦问题;训练4B参数模型在14个基准上平均超越最强7B基线12%。

详情
AI中文摘要

具身推理要求模型感知物理环境中与任务相关的物体和空间,并在多步推理中保持一致的视觉基础。然而,当前的视觉语言模型依赖于纯文本或坐标增强的思维链,其中实体引用仍然隐式和模糊。这可能导致推理过程与视觉证据解耦、实体引用在步骤间漂移、推理轨迹与最终答案之间的因果断裂,并且由于跨视角外观变化,这些问题在多视角场景中进一步放大。为了解决这些问题,我们提出了Pinned Chain-of-Thought (PinCoT),一种结构化推理范式,将每个推理步骤锚定到视觉证据。PinCoT引入了推理锚点的概念,它将每个任务相关实体绑定到一个结构化的视觉锚点,包含实体名称、唯一标识、视角索引和空间基础,从而能够在推理步骤和视角之间实现一致的实体跟踪。我们构建了一个全自动数据生成管道来构建数据集,这是一个高质量的PinCoT格式推理数据集。然后,我们通过三阶段后训练训练方法,逐步注入具身知识、结构化推理能力和过程监督对齐,奖励直接约束推理过程中的锚点定位和身份一致性。在涵盖具身空间推理、多视角推理和指向的14个基准测试中,仅有4B参数的方法始终优于7B级别的开源具身模型,比最强的7B基线Mimo-Embodied平均提高12%。进一步分析表明,PinCoT提高了基础准确性和跨步骤身份一致性,验证了过程监督的有效性。

英文摘要

Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes. To address these issues, we propose Pinned Chain-of-Thought (\pincot{}), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot{} introduces the concept of \reasoninganchor{}, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset{}, a high-quality \pincot{}-formatted reasoning dataset. We then train \method{} through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, \method{} with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12\% average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that \pincot{} improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.