arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30051 2026-05-29 cs.CL cs.CY

Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues

我是谁?面向辅导对话中学生模拟的历史感知档案

Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan

AI总结 提出历史条件的学生模拟任务,通过强化学习训练档案生成器和模拟器,利用学生历史信息准确预测对话轮次,在数学学习平台数据集上显著优于基线。

详情
AI中文摘要

开发基于大型语言模型(LLM)的自动化辅导工具的一个关键部分是学生模拟,即使用LLM扮演学生角色,这可以促进辅导模型的评估和训练。现有工作主要关注对话内模拟,缺乏关于学生知识和行为的上下文,部分原因是没有基于过去的学生问答或对话交互。在这项工作中,我们引入了历史条件的学生模拟任务,其目标是通过利用学生学习历史中的信息准确预测学生对话轮次。我们提出了一个双组件框架,其中档案生成器总结学生历史,模拟器基于生成的档案预测学生轮次。我们使用强化学习(RL)训练这两个组件,生成针对忠实学生模拟优化的档案。我们在从数学学习平台收集的首个真实世界学生对话和问答响应数据集上评估了我们的方法和基线。大量实验表明,我们的方法显著优于基线,并证明了历史、档案和RL训练的重要性。

英文摘要

A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.

2605.30049 2026-05-29 cs.AI

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

面向文本到图像扩散Transformer的鲁棒且可泛化的安全引导

Zihao Xue, Yan Wang, Zhen Bi, Long Ma, Zhonglong Zheng, Zeyu Yang, Bingyu Zhu, Longtao Huang, Jie Xiao, Jungang Lou

AI总结 提出SafeDIG框架,通过位置感知稀疏特征迁移实现扩散Transformer的安全引导,在保持源域安全性和图像质量的同时,有效降低目标域和整体不安全生成率。

详情
AI中文摘要

扩散Transformer已成为文本到图像生成的强大骨干网络,但其分层和跨模态生成过程使得安全控制在根本上不同于提示级过滤或输出级检测。有害语义可能在文本表示中弱表达,逐步绑定到视觉潜变量,最终与渲染动态纠缠。因此,在固定层进行安全引导可能不稳定,而从已知风险学习到的引导机制可能无法可靠地迁移到偏移的目标风险域。我们提出SafeDIG,一个将DiT安全适应形式化为位置感知稀疏特征迁移的安全引导框架。SafeDIG首先在功能不同的DiT干预位置上构建稀疏自编码器,并使用鲁棒性感知预训练路由来优先选择在源-目标风险偏移下预期保持稳定的干预站点。然后,通过冻结SAE编码器作为可重用的稀疏安全字典,并仅将解码器适应到目标域激活流形,将可迁移的安全特征与特定领域的激活几何分离。在推理过程中,SafeDIG结合混合和排斥操作,将不安全激活引导至迁移的安全流形或远离有害的稀疏方向。在FLUX.1 Dev和Stable Diffusion 3.5 Large上的实验表明,SafeDIG在保持源域安全性和图像质量的同时,持续降低了目标域和整体的不安全生成率。

英文摘要

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

2605.30046 2026-05-29 cs.LG cs.AI

Masked Diffusion Modeling for Anomaly Detection

掩码扩散建模用于异常检测

Lixing Zhang, Yuchen Liang, Liyan Xie

AI总结 提出基于掩码扩散模型的MaskDiff-AD方法,通过重建随机掩码坐标的难度构建异常分数,在分类、混合类型和离散序列数据上实现高效异常检测。

详情
AI中文摘要

异常检测旨在识别偏离名义数据分布的样本,是许多安全关键应用的核心。然而,针对分类、混合类型和离散序列数据开发有效的异常检测方法仍然具有挑战性且相对未被充分探索。掩码扩散模型通过学习从剩余可见上下文中恢复掩码值,为建模此类数据提供了一种自然的方式。在本文中,我们提出了用于异常检测的掩码扩散(MaskDiff-AD),一种基于掩码扩散模型的前向方法,仅在名义数据上训练。给定测试样本,MaskDiff-AD从随机掩码坐标的重建难度构建异常分数,产生一个直接作用于离散状态空间且避免反向时间采样的内容敏感分数。我们还开发了MaskDiff-AD的非参数变体,并通过在固定检测阈值下表征I型和II型错误提供了理论保证。在来自ADBench和UADAD的十四个分类和混合类型表格数据集,以及来自NLP-ADBench的四个文本异常检测数据集上的实验表明,MaskDiff-AD相对于经典、基于扩散以及最近的表格/文本异常检测基线取得了有竞争力的性能。值得注意的是,MaskDiff-AD达到了最佳总体平均排名,优于所有十二种表格基线方法。

英文摘要

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.

2605.30045 2026-05-29 cs.CV

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

GenEraser:通过平衡文本-掩码引导和解耦定位器-保持器实现可泛化的视频对象移除

Yuqing Chen, Lin Liu, Haisu Wu, Xiaopeng Zhang, Yaowei Wang, Yujiu Yang, Qi Tian

AI总结 提出GenEraser框架,通过多条件混合专家、可学习深度CFG融合机制和解耦专家架构,解决视频对象移除中目标与物理效应同时消除的泛化难题,在ROSE和VOR-Eval上分别提升2.16 dB和1.44 dB。

详情
AI中文摘要

视频对象移除在域外场景中常因复杂的时空歧义而难以同时消除目标对象及其关联的物理效应(如烟雾、反射、光线和涟漪)。现有方法主要依赖空间掩码,但往往无法捕捉弱相关效应,且显式文本引导的潜力尚未充分探索。此外,移除模型在高层语义泛化与精确像素级背景保持之间存在根本性的优化冲突。为解决这些挑战,我们提出GenEraser,一种用于泛化高保真视频对象与效应移除的新框架。首先,我们引入多条件混合专家(MC-MoE)配合二分文本引导,充分利用扩散变换器的多模态先验,显著增强复杂效应的识别。其次,开发可学习深度“CFG”融合机制(LD-CFG),以自适应平衡不同场景下掩码和文本条件的相对主导地位。最后,提出解耦专家架构,包含定位器和保持器,以缓解语义泛化与像素对齐之间的固有权衡。大量实验表明,我们的GenEraser超越了近期最先进方法,在ROSE基准和VOR-Eval上分别实现了显著的定量提升(2.16 dB和1.44 dB),同时在开放世界场景中保持了异常稳健的泛化能力。

英文摘要

Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/

2605.30042 2026-05-29 cs.AI

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

学会选择:一种基于赋权与语义通信的自适应方法选择多智能体系统

Geremy Loachamín-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

AI总结 提出一种结合上下文赌博机、结构化智能体间通信和语义检查点的多智能体框架,通过保持动作-结果因果一致性来提升科学计算工作流中自适应决策的收敛性、鲁棒性和泛化能力。

详情
AI中文摘要

自动化科学计算工作流不仅需要生成可执行代码:自主系统还必须选择适当的计算策略,忠实地执行它们,并确保最终结果在因果上可归因于产生它们的决策。在多智能体流水线中,这一过程尤其脆弱,因为智能体意图与行动之间的微小不一致可能导致语义漂移,即最终执行的程序不再反映最初选择的策略,从而破坏下游评估和适应。受ATHENA框架(Toscano等人,2025;Toscano等人,2026)和赋权概念(Yiu等人,2025)的启发,本文引入了一个多智能体框架,该框架将上下文赌博机与结构化智能体间通信相结合,最重要的是,引入了语义检查点以保持整个流水线中行动-结果的一致性。该系统在自适应决策架构中集成了专门的大语言模型(LLM)智能体、基于代码生成和自修复执行循环。通过赋权的视角解释该框架,我们表明可靠的自主学习不仅需要识别高质量的行动,还需要保持这些行动在智能体间传播的完整性。使用敏感性分析和不确定性量化工作流作为代表性案例研究,我们证明未受约束的语义漂移会降低策略学习,而所提出的框架则提高了收敛性、鲁棒性和对新问题情境的适应能力。这些结果表明了科学多智能体系统的一个更广泛的设计原则:自适应决策必须与明确的机制相结合,以保证整个计算流水线中的语义一致性和可靠信息流。

英文摘要

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

2605.30040 2026-05-29 cs.CR cs.AI cs.CL

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

Token通胀:不诚实的提供商如何对大型语言模型使用超额收费

Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya

AI总结 研究揭示了基于每token计费的大型语言模型商业服务中,提供商利用审计信任悖论系统性地虚报token数量,导致用户费用大幅增加的问题。

详情
AI中文摘要

按token计费现在是商业大型语言模型(LLM)的标准定价模式,因此报告token数量的诚实性直接影响用户支付的费用。我们表明,这种计费方式在设计上难以审计:提供商隐藏模型、分词器和执行过程以保护其知识产权、缓解越狱攻击并保护用户隐私,这意味着审计员只能检查提供商提供的证明。因此,审计简化为对提供商自身报告的一致性检查。我们称之为信任悖论:每次审计都必须信任某些工件,但当前的框架恰恰信任提供商最有动机操纵的那些工件。我们研究了三个最近的token审计框架,并表明具有普通商业能力的提供商可以系统地虚报计费token数量。在最宽松的设置中,隐藏的推理使用量平均可以膨胀1469%而不被检测到。以当前前沿推理价格计算,这将使同一查询的诚实账单从100美元变成约1569美元。即使当用户可以看到完整的推理字符串时,仅分词歧义就允许在检测阈值以下多报50.85%。这些结果表明问题不在于任何特定的审计器,而在于任何证据来自被审计方的审计。恢复诚实计费需要将报告的token数量与提供商无法控制的证据(例如可信执行证明、推理的加密证明或第三方重新执行)联系起来的验证。

英文摘要

Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts directly affects what users pay. We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. The audit therefore reduces to a consistency check on the provider's own reports. We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. At current frontier reasoning prices, that turns a \$100 honest bill into roughly a \$1,569 bill on the same query. Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.

2605.30038 2026-05-29 cs.LG cs.AI cs.CV

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

对齐引导的分数匹配用于扩散模型中的文本到图像对齐

Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

AI总结 提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中,以解决文本-图像对齐中的过度惩罚和计数错误问题。

详情
Comments
ICML 2026, Project page: https://jaayeon.github.io/AGSM
AI中文摘要

扩散模型生成高度逼真的图像,但通常难以实现精确的文本-图像对齐。虽然最近的后训练方法使用外部奖励或人类偏好信号改善对齐,但其性能严重依赖奖励质量,且不直接解决扩散过程中的对齐问题。最近的无奖励方法如SoftREPA表明,通过对比学习优化软文本令牌可以有效改善文本-图像表示对齐,优于标准参数高效微调基线。然而,对比公式可能过度惩罚负对,表现为典型的失败案例,如过度计数和重复。为解决此问题,我们提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中来细化软令牌。通过在分数级别分配对齐方向,我们的方法缓解了这些限制,并产生更连贯和语义忠实的生成。实验表明,我们的方法与SoftREPA相当,同时显著改善了其失败案例,在GenEval基准上计数准确性提高了超过35%。我们的方法可无缝应用于现有扩散骨干网络(SD1.5、SDXL和SD3),并与现有的基于RL的扩散后训练方法互补。项目页面:https://jaayeon.github.io/AGSM

英文摘要

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

2605.30036 2026-05-29 cs.AI cs.CL

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

向机器传授价值观:在LLMs中模拟类人行为

Asaf Yehudai, Naama Rozen, Ariel Gera

AI总结 本研究基于心理学价值理论,通过大规模实验(超过500万个问题)评估价值提示的LLMs在价值结构和价值-行为关系上与人类的一致性,并证明引入人类价值分布可增强群体模拟。

详情
Comments
GEM Workshop at ACL 2026
AI中文摘要

大型语言模型(LLMs)展示了采用不同角色和身份的能力;然而,它们是否能表现出符合连贯、类人价值结构的行为仍不清楚。在这项工作中,我们借鉴既定的心理学价值理论,在LLMs中诱导类人价值观,并评估它们与人类研究中观察到的模式的一致性。使用经过验证的心理学问卷,我们进行了大规模实验——超过500万个问题——以评估领先LLMs的价值结构和价值-行为关系,并将其与人类进行比较。我们的发现揭示了价值提示的LLMs与人类在两个维度上的强烈一致性。此外,引入人类价值分布增强了价值诱导LLMs的群体模拟。这些发现凸显了价值诱导LLMs作为有效的、基于心理学的模拟人类行为工具的潜力。

英文摘要

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

2605.30031 2026-05-29 cs.SD cs.AI cs.CL

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

大型音频语言模型中的音频越狱:分类、攻防分析与成本感知评估

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen

AI总结 本文提出了大型音频语言模型中音频越狱攻击与防御的统一分类法和受控实证评估,揭示了声学最佳N攻击暴露了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而现有防御在鲁棒性与良性可用性之间存在权衡。

详情
Comments
Submitted to ACL ARR 2026 May
AI中文摘要

大型音频语言模型(LALMs)将越狱风险从令牌级提示扩展到完整的语音感知到推理管道,其中不安全行为可以通过语义、声学风格、信号伪影或内部表示来诱导。现有研究在异质的威胁模型和评估协议下研究这些风险,使得比较攻击实用性或防御效用变得困难。本文提供了LALM越狱攻击和防御的统一分类法和受控实证评估。我们将先前的工作组织为语义、声学、信号和嵌入层攻击;基于防护、无需训练和基于训练的防御;以及跨模态、音频原生和交互式基准。然后,我们在十个开源LALM上评估代表性攻击和防御,不仅测量攻击成功率,还测量良性拒绝和延迟。我们的结果表明,声学最佳N揭示了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而当前防御在鲁棒性与良性可用性之间存在权衡。这些发现支持将成本和效用感知评估作为仅成功率的LALM安全基准的必要补充。

英文摘要

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

2605.30029 2026-05-29 cs.AI

RAISE: RAG Design as an Architecture Search Problem

RAISE:将RAG设计视为架构搜索问题

Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

AI总结 本文提出将检索增强生成(RAG)系统的设计选择形式化为架构搜索问题,并构建RAISE框架和基准,通过标准化搜索空间和预算评估13种优化算法在7个数据集上的表现,发现优化性能高度依赖任务。

详情
AI中文摘要

检索增强生成(RAG)系统涉及众多设计选择,包括查询重写、分块、检索深度、重排序和上下文压缩。在实践中,这些选择通常通过启发式方法配置,阻碍了跨设置的系统评估和可重复性。我们认为这一挑战最好被形式化为RAG架构搜索。为了支持对该问题的可控和可重复研究,我们引入了RAG智能搜索引擎(RAISE),这是一个用于RAG超参数优化的综合框架和基准,它在标准化的搜索空间和预算下评估RAG管道的优化方法。RAISE实现了13种搜索算法,并使用三种随机种子在七个公开文本和多模态数据集上对其进行评估。我们的实验表明,优化性能高度依赖于任务:在一个数据集上表现良好的方法可能无法在其他数据集上一致泛化,这提醒我们不要将聚合排名解释为普遍优越策略的证据。RAISE为公平、可重复和系统的RAG超参数优化研究提供了共同的实验基础。

英文摘要

Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.

2605.30027 2026-05-29 cs.CV cs.IR

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

DocRetriever:面向多模态文档检索的即插即用框架与综合基准

Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao

AI总结 提出DocRetriever即插即用框架,通过布局感知的稀疏嵌入和推理增强的重排序器解决多模态文档检索中语义模糊和泛化瓶颈问题,并构建MultiDocR基准实现更严格评估。

详情
Comments
Accepted at KDD 2026 Research Track
AI中文摘要

多模态文档包含表格、图形和布局等多样元素,可能使检索任务复杂化。当前方法通常将密集视觉嵌入模型与有监督重排序器相结合以实现高精度检索,但存在固有局限性。首先,密集嵌入的粗粒度特性往往模糊显式语义,无法利用结构显著信息。其次,有监督重排序模型面临泛化瓶颈,其性能严重依赖领域特定训练数据。此外,现有基准通常缺乏多样化的评估维度和全面的相关性标注,限制了可靠评估。为解决这些挑战,我们提出DocRetriever,一个即插即用框架。它通过布局感知的稀疏嵌入技术增强视觉检索,实现无需光学字符识别(OCR)开销的有效混合编码。我们还引入了一个可泛化的重排序器,利用推理增强的示范和优化采样来提高少样本场景下的准确性。最后,我们构建了一个新基准MultiDocR,以实现更严格的评估。在多个基准上的实验验证了DocRetriever相对于最先进方法的优越性。

英文摘要

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

2605.30022 2026-05-29 cs.CL cs.AI

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

给它空间!编码器中位置和语义表示的显式解缠

Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

AI总结 通过将位置和语义信号分离为三个独立流,研究Transformer中位置编码的机制,发现解缠方法能保留宏观结构并提升语言表示性能。

详情
Comments
8 page + 10 pages of bibliography and appendix
AI中文摘要

位置编码(PE)是置换不变的Transformer表示序列顺序的基础,然而位置信息如何处理和存储仍知之甚少。现代PE方法如RoPE在长上下文理解或检索等任务上仍存在困难\cite{chen-etal-2025-hope}。因此,更好地理解内部位置机制有助于设计更好的PE。基于位置和语义信号在训练好的Transformer中占据几乎正交子空间的证据,我们修改编码器Transformer以处理三个显式解缠的流:语义、绝对位置(AP)和相对位置(RP),并将掩码语言建模(MLM)目标限制在语义流上。这种解耦使得能够进行清晰的机制研究,并得出三个要点:(1)孤立的AP子空间自发坍缩为一个捕获文档结构的低频二维流形;(2)注意力头特化为结构导向和语义导向两组,其中RP专门支持后者;(3)标准位置编码不能稳健地保留宏观结构:RoPE和RP仅弱编码它,而纠缠的AP在MLM压力下在最后几层丢失了它。解缠方法保留了位置编码,在Flash-Holmes探测基准的65个语言现象中的49个上改善了语言表示。

英文摘要

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

2605.30015 2026-05-29 cs.LG cs.AI

Test Time Training for Supervised Causal Learning

测试时训练用于监督因果学习

Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun, Jinzhuo Wang, Qiang Fu, Shi Han, Dongmei Zhang

AI总结 针对监督因果学习在分布外泛化中的不足,提出测试时训练框架TTT-SCL,通过动态生成与测试实例对齐的训练集,显著提升因果发现性能。

详情
AI中文摘要

监督因果学习(SCL)通过将因果发现构建为监督学习问题,展现了潜力。然而,它面临显著的分布外泛化挑战。我们揭示了先前SCL实践的三个局限性:合成基准与真实数据之间的显著性能差距、对分布偏移的脆弱性以及组合泛化的失败,共同质疑了其现实世界适用性。为此,我们提出测试时训练用于监督因果学习(TTT-SCL),一种新颖的框架,动态生成与任何特定测试实例显式对齐的训练集。我们展示了TTT-SCL与基于分数的方法之间的关联,并基于经典评分函数设计了一个高效模块用于生成训练集。在合成基准、伪真实和真实世界数据集上的实验表明,TTT-SCL显著优于现有的SCL和传统因果发现方法。

英文摘要

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.

2605.30014 2026-05-29 cs.AI

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

从GPS点到出行模式:基于LLM的灵活语义轨迹生成

Silin Zhou, Chenhao Wang, Yuntao Wen, Shuo Shang, Lisi Chen, Panos Kalnis

AI总结 提出HTP方法,通过层次化生成出行模式再生成GPS点,利用LLM和RQ-VAE实现灵活、语义丰富的轨迹生成,在质量上平均提升29.78%。

详情
Comments
This paper is accepted by KDD2026 second round
AI中文摘要

城市轨迹在建模城市动态和支持各种智慧城市应用中起着关键作用。然而,隐私问题限制了对大规模高质量轨迹数据集的访问。轨迹生成通过合成现实数据来减轻隐私风险,提供了一种有前景的替代方案。然而,现有方法未能显式捕获出行模式,并且只能在单一条件下生成固定长度的轨迹。为了解决这些局限性,我们提出了 extbf{HTP},它 extbf{层}次化地首先生成 extbf{出行模式},然后使用大语言模型(LLM)生成GPS extbf{点},而不是直接生成GPS点。我们首先设计了一个轨迹特定的残差量化变分自编码器(RQ-VAE),它以从粗到细的方式将微观级别的GPS轨迹量化为紧凑的宏观级别出行模式令牌。这些令牌捕获了丰富的段空间不规则性,例如由交通条件引起的点密度变化。然后,我们用出行模式令牌扩展LLM词汇表,以对齐轨迹表示与LLM输入,并应用监督微调(SFT)使LLM与轨迹生成任务对齐,从而能够在各种条件下生成出行模式序列。在两个真实世界数据集上的大量实验表明,HTP在生成质量上平均比最强基线高出29.78%。我们的代码可在https://github.com/slzhou-xy/HTP获取。

英文摘要

Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \textbf{HTP}, which \textbf{H}ierarchically generates \textbf{T}ravel patterns first and then generates GPS \textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.

2605.30011 2026-05-29 cs.CV cs.AI

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

VisualThink-VLA:用于高效低延迟视觉-语言-动作策略的视觉中间推理

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

AI总结 提出VisualThink-VLA框架,通过视觉中间推理和选择性路由机制,在保持高精度的同时将推理延迟从数秒降至亚秒级。

详情
AI中文摘要

近期工作开始为视觉-语言-动作(VLA)策略配备显式的中间推理。然而,在具身控制中,文本思维链并不适用:无关或弱文本信息会干扰动作预测,而自回归文本解码为实时闭环执行增加了过多延迟。我们提出VISUALTHINK-VLA,一个用于准确、低延迟VLA策略的视觉中间推理框架。我们的引导哲学是通过有效的视觉思维来指导动作:VISUALTHINK-VLA通过一个紧凑的视觉证据接口引导动作预测,该接口在避免解码开销的同时保持空间精度。此外,为了进一步提升性能和效率,VISUALTHINK-VLA采用了一种定制的选择性路由机制来学习视觉证据令牌,从而实现低延迟推理同时保持高容量专用性。我们还引入了VisualEvidence-Kit,这是一个以VisualEvidence-Agent为核心的监督与审计资源,该智能体构建了754.7k条VLA指令的VisualEvidence-Set,用于路由监督和反事实忠实性测试。在多个基准测试和真实机器人评估中,VISUALTHINK-VLA在大多数基准测试上实现了最高成功率,同时将推理增强基线的多秒延迟降至亚秒级。例如,在BridgeData V2上,它将步骤延迟从ECoT的8.377秒降至0.367秒,实现了22.8倍的加速。

英文摘要

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

2605.30010 2026-05-29 cs.CV

EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom: 早期令牌压缩实现快速视频理解

Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang

AI总结 针对视频大语言模型中视觉编码阶段效率低下的问题,提出EarlyTom无训练令牌压缩框架,通过在视觉编码器内部进行早期压缩,显著降低首令牌延迟并提升吞吐量。

详情
Comments
Accepted by CVPR 2026. 16 pages, 8 figures, 8 tables. Project page: https://viridisgreen.github.io/EarlyTom
AI中文摘要

视频大语言模型(Video-LLMs)在视频理解任务中展现了强大的能力。然而,处理大量视觉令牌带来的低效率仍然阻碍了它们的实际部署。尽管近期的方法在保持与全令牌基线相当准确性的同时实现了极低的令牌保留率,但大多数方法仅在预填充的后期阶段进行压缩,视觉编码器的效率未得到优化。在本文中,我们首先表明视觉编码对首令牌时间(TTFT)贡献很大。因此,与仅在视觉编码器之后压缩视觉令牌不同,在编码器内部进行压缩仍有很大的探索空间。基于这一见解,我们提出了EarlyTom,一种无训练的令牌压缩框架,在视觉编码器内部执行早期视觉令牌压缩,从而显著降低TTFT并提高吞吐量。此外,我们引入了一种解耦的空间令牌选择策略,提高了整体压缩效果。在单个NVIDIA A100 GPU上,对于LLaVA-OneVision-7B模型,EarlyTom将TTFT降低高达2.65倍,FLOPs降低高达61%,同时保持与全令牌基线相当的准确性。这些改进显著增强了Video-LLMs在实际生产场景中部署的实用性。

英文摘要

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

2605.30003 2026-05-29 cs.MA cs.AI cs.LG

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

发现合作管线:面向序列社会困境的自动研究

Víctor Gallego

AI总结 本文提出一种双层自动研究框架,其中外层AI智能体自动重新设计内层LLM策略合成管线,以解决多智能体序列社会困境,实验表明该方法在多个游戏和福利目标下优于手工基线。

详情
Comments
Accepted to the AI Agents for Discovery in the Wild (AID-Wild) Workshop at ACM CAIS 2026
AI中文摘要

我们研究了两层自动研究合作问题:外层AI智能体自主重新设计用于多智能体序列社会困境(SSD)的LLM策略合成系统的内层管线。研究者智能体$\mathcal{R}$(作为编码智能体运行)读取内层源代码,编辑系统提示、反馈函数、辅助库和迭代逻辑,运行评估,并决定保留什么,遵循自动研究范式。在两个游戏(Cleanup和Gathering)、两个策略合成器LLM和两个福利目标(功利主义效率和Rawlsian最大最小原则)下,研究者可靠地超越了手工设计的基线,显著缩小了运行间方差,并优于仅提示优化。发现的管线依赖于目标:只有在最大最小原则下,研究者才会向合成器管线注入显式的公平机制,而这类机制在其自身目标无关的系统提示和每个效率优化的管线中都不存在。这支持了一种信息设计解读,即研究者根据福利目标选择向有限理性的合成器揭示什么。代码见https://github.com/vicgalle/autoresearch-social-dilemmas。

英文摘要

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.

2605.30002 2026-05-29 cs.AI

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

KairosAgent:融合语义推理的智能体时间序列预测

Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

AI总结 提出KairosAgent框架,通过结合基于LLM的推理器和基于TSFM的预测器,并引入强化学习范式,实现跨模态时间序列的零样本预测。

详情
AI中文摘要

跨领域多模态时间序列预测是一项具有挑战性的任务,要求模型整合精确的数值理解、跨领域语义理解和有效的多模态融合。现有方法要么从头构建时间序列基础模型(TSFM),要么利用预训练的大语言模型(LLM)。然而,TSFM通常忽略语义理解且缺乏面向未来的语义推理能力,而LLM在数值理解和准确的定量预测方面存在困难。为克服这些限制,我们提出KairosAgent,一种用于多模态时间序列预测的新型智能体框架,包括基于LLM的推理器和基于TSFM的预测器。KairosAgent通过动态调用分析工具来增强LLM的数值理解和语义推理能力,从而统一文本推理和数值预测。推理结果随后融合到TSFM流程中,实现更准确可靠的未来预测。为进一步改进推理,我们整理了一个大规模高质量轨迹语料库,并引入了一种基于预测的强化学习范式,包含多轮细化和轮次级别信用分配。实验表明,KairosAgent在最大化预训练LLM和TSFM效用的同时,实现了卓越的零样本预测性能,为高效且可解释的时间序列智能体提供了有前景的方向。项目页面位于https://foundation-model-research.github.io/KairosAgent。

英文摘要

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

2605.29997 2026-05-29 cs.CV

FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

FRUC:来自未标定协作驾驶视图的前馈动态场景重建

Yihang Tao, Yu Guo, Zhengru Fang, Haonan An, Yuguang Fang

AI总结 提出FRUC框架,基于前馈3D高斯泼溅和视觉几何Transformer,从未标定的多车协作视图实现动态场景的一次性、免标定重建,通过自中心因果遮挡场和零初始化残差去噪实现非破坏性几何补充。

详情
AI中文摘要

我们提出了FRUC,一个用于从未标定协作驾驶视图进行动态场景重建的前馈3D高斯泼溅框架。现有的多智能体重建框架常常受到严格先决条件的阻碍,需要精确的空间标定和缓慢的逐场景优化。在本文中,我们通过将分布式多车辆网络概念化为一个时空非结构化的自中心多相机系统来重新思考这一任务,其核心挑战在于在不降低自中心准确观测到的可见几何的情况下,通过协作增强自中心遮挡几何,同时保持重建效率。为了实现高效重建,FRUC基于视觉几何Transformer骨干网络,支持从灵活数量的多车辆视图进行一次性、免标定推理。为了在未标定的跨智能体错位下实现非破坏性几何补充,FRUC首先引入了一个自中心因果遮挡场,通过建模智能体时空相关性,将遮挡演化显式推导为潜在先验。在这些遮挡先验的指导下,它进一步将跨智能体集成公式化为一个通过零初始化注入的确定性残差去噪过程,将具有挑战性的跨智能体融合转化为有界残差学习,以实现鲁棒的协作盲点补全。通过在真实世界V2XReal和UrbanIng-V2X数据集上的广泛评估,FRUC被证明是动态协作驾驶环境场景重建的新最先进方法,在渲染质量和效率上均显著优于现有方法。

英文摘要

We present FRUC, a feed-forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views. Existing multi-agent reconstruction frameworks are often hindered by rigid prerequisites, demanding precise spatial calibration and slow per-scene optimization. In this paper, we rethink this task by conceptualizing a distributed multi-vehicle network as a spatio-temporally unstructured ego-centric multi-camera system, where the core challenge lies in enhancing ego-centric occluded geometry through collaboration without degrading the ego's accurately observed visible geometry, while preserving reconstruction efficiency. For efficient reconstruction, FRUC is built upon a visual grounded geometric Transformer backbone to enable one-shot, calibration-free inference from a flexible number of multi-vehicle views. To achieve non-destructive geometric supplementation under uncalibrated cross-agent misalignment, FRUC first introduces an ego-centric causal occlusion field that explicitly derives occlusion evolution as latent priors by modeling agent-wise spatio-temporal correlations. Guided by these occlusion priors, it further formulates cross-agent integration as a deterministic residual denoising process via zero-initialized injection, turning challenging cross-agent fusion into bounded residual learning for robust collaborative blind-spot completion. Through extensive evaluations on the real-world V2XReal and UrbanIng-V2X datasets, FRUC is shown to be a new state-of-the-art for the scene reconstruction of dynamic collaborative driving environments, significantly outperforming existing methods in both rendering quality and efficiency.

2605.29992 2026-05-29 cs.CL

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

通过跨语言分词器手术和离线蒸馏使多语言嵌入模型适应土耳其语

M. Ali Bayram, Banu Diri, Savaş Yıldırım

AI总结 提出一种高效的三阶段适应流程,通过跨语言分词器优化、教师模型克隆和离线蒸馏,构建了土耳其语句子嵌入模型embeddingmagibu-200m,在STSbTR上超越教师模型,并在TR-MTEB上以更少参数达到竞争性能。

详情
Comments
14 pages, 2 figures, 4 tables, Appendix included
AI中文摘要

句子嵌入是语义搜索、聚类、分类和检索增强生成的基础组件。本文提出了embeddingmagibu-200m,一个专注于土耳其语的句子嵌入模型,生成768维L2归一化向量,支持8192个token的上下文窗口,远超早期基于BERT的土耳其语编码器的512 token限制。无需完整预训练,引入了一个高效的三阶段适应流程:(1) 通过从教师词汇表中修剪冗余token,并基于40语言语料库的频率分析纳入多语言token,构建一个词汇量为131,072的土耳其语优化多语言分词器;(2) 克隆教师嵌入模型,同时保留transformer骨干权重,并通过均值组合token映射为新的词汇表初始化兼容的嵌入表;(3) 使用余弦相似度目标,在平衡的40语言维基百科语料库上,从预计算的教师向量进行离线嵌入蒸馏。得到的student模型约有2亿参数,在单个GPU上训练约四小时,通过避免训练期间的在线教师推理,总成本为5-20美元。实验表明,在STSbTR上,Pearson/Spearman相关系数达到77.55%/77.45%,超过了3亿参数的教师模型(73.84%/72.92%)。在TR-MTEB(26个任务)上,平均得分为63.9%(在26个模型中排名第7),提供了有竞争力的成本-质量权衡,参数比教师少33%。为促进可复现性和下游使用,所有工件均已发布,包括模型权重、分词器文件、预计算嵌入数据集以及开源克隆和蒸馏工具。

英文摘要

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

2605.29986 2026-05-29 cs.AI

Accelerating Constrained Decoding with Token Space Compression

加速受限解码:通过词元空间压缩

Michael Sullivan, Alexander Koller

AI总结 提出CFGzip离线压缩词元搜索空间,大幅降低上下文无关文法约束解码的开销,实现高达两个数量级的延迟减少和7.5倍的总生成速度提升。

详情
Comments
13 pages; 5 figures; under review at EMNLP 2026
AI中文摘要

为了保证LLM的输出符合指定结构,上下文无关文法(CFG)解码引擎强制选择能够产生符合给定CFG的字符串的下一个词元。虽然当前的CFG受限解码引擎已经高度优化,但由于每一步搜索空间(即整个词元词汇表)巨大,导致对于更复杂的CFG会产生难以承受的高开销——而这正是CFG引擎最有用的情况。在本文中,我们引入了CFGzip,一种离线压缩词元搜索空间的技术,它大幅减少了CFG引擎的开销。实验中,我们报告了当CFGzip与最先进的语法引擎一起使用时,延迟减少高达两个数量级,在总受限生成时间上实现了高达7.5倍的加速:借助CFGzip,受限解码现在可以大规模应用于复杂CFG。

英文摘要

To guarantee that an LLM's outputs conform to a specified structure, context-free grammar (CFG) decoding engines force the selection of next tokens that produce strings that conform to a given CFG. While current CFG-constrained decoding engines are highly optimized, the inherent costs arising from the massive per-step search space -- i.e. the entire token vocabulary -- result in intractably high overhead for more complex CFGs: precisely the situation where CFG engines are most useful. In this paper, we introduce CFGzip, an offline technique for compressing the token search space, which massively reduces CFG engine overhead. In experiments, we report latency reduction of up to two orders of magnitude when CFGzip is used with a SoTA grammar engine, yielding an up to 7.5x speedup in total constrained generation time: with CFGzip, constrained decoding is now feasible at scale for complex CFGs.

2605.29983 2026-05-29 cs.LG cs.CV

Improving Adversarial Robustness of Attribution via Implicit Regularization

通过隐式正则化提高归因的对抗鲁棒性

Amir Mehrpanah, Matteo Gamba, Hossein Azizpour

AI总结 本文发现标准随机梯度下降的学习动态可以隐式地提高归因的对抗鲁棒性,并证明在softmax归一化下注意力归因的鲁棒性提升受限,而基于核的注意力可恢复鲁棒性。

详情
Comments
39 pages, 22 figures, to be published in International Conference on Machine Learning 2026
AI中文摘要

归因的对抗鲁棒性是深度学习中可靠可解释性的基本要求,但现有方法通常依赖计算昂贵的显式正则化。在这项工作中,我们表明归因鲁棒性可以从标准随机梯度下降的学习动态中隐式产生。我们通过参数空间和输入空间曲率之间的联系从理论上论证了这种效应,并在各种架构、数据集和归因方法上进行了验证,计算开销可忽略不计。相反,我们证明由于固有的熵约束,这种鲁棒性提升通常不会转移到softmax归一化下的注意力归因,并通过实验验证了这一局限性。最后,我们表明用基于核的注意力替换softmax注意力可以恢复Transformer模型中的鲁棒性提升。我们的结果突出了学习动态作为鲁棒可解释性的一种原则性且实用的机制,并揭示了归一化下注意力归因的基本局限性。

英文摘要

The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

2605.29980 2026-05-29 cs.CV cs.AI cs.LG

Genetically Aligned Patient Representations Improve Hematological Diagnosis

基因对齐的患者表示改善血液学诊断

Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr

AI总结 提出一种两阶段框架,通过自监督视觉预训练和监督对比学习对齐白细胞图像与染色体畸变及体细胞突变,提升血液学诊断性能。

详情
Comments
Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026
AI中文摘要

组织病理学编码器与转录组和基因组数据的多模态对齐已被证明能显著提高下游诊断任务的性能。血液学细胞学的独特之处在于,视觉单细胞评估通常与细胞遗传学和分子遗传学相结合用于血癌诊断。在本研究中,我们提出了一个框架,将单个白细胞图像与染色体畸变(核型)以及来自靶向基因面板的体细胞突变对齐。我们的训练策略采用两阶段方法:(i)在超过1500名患者的队列上,使用iBOT头进行自监督、仅视觉的Transformer聚合器预训练;(ii)通过急性髓系白血病患者的监督对比损失进行基因对齐。我们的基因对齐患者编码器改善了血液学诊断任务,优于切片级组织病理学基础模型。此外,该模型为疾病和遗传改变提供了即用型检索能力。将遗传数据纳入患者编码器提高了患者表示的质量,提供了一个与临床诊断工作流程对齐的框架,并为未来的多模态血液学特定AI铺平了道路。代码和模型权重可在https://github.com/marrlab/GenBloom获取。

英文摘要

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

2605.29979 2026-05-29 cs.CR cs.LG

Fingerprinting Inference Systems of Large Language Models

大型语言模型的推理系统指纹识别

Anna Wimbauer, Jonas Möller, Erik Imgrund, Konrad Rieck

AI总结 本文提出一种通过分析LLM的提示-响应行为来识别推理系统组件(如推理引擎、注意力后端和硬件平台)的指纹方法,并论证了防御该指纹识别的根本困难性。

详情
AI中文摘要

LLM的行为不仅仅取决于模型本身。推理系统的组件,如推理引擎、注意力后端和硬件平台,微妙地影响输入的处理方式。这些组件在实现上存在差异,因此在运行相同模型时,不同系统之间会产生微小的数值偏差。虽然先前的工作已经建立了这种偏差的理论存在性,但其安全影响尚未被探索。在本文中,我们表明这些偏差是特定组件的特征,并传播到可观察的文本输出中,从而将推理系统暴露给任何能够查询模型的方。基于这一观察,我们引入了一种指纹识别方法,通过分析LLM的提示-响应行为来识别推理系统的组件。我们的实证评估表明,即使在LLM以非零温度运行时,推理引擎、注意力后端和底层硬件平台也能被可靠地识别。我们证明,防止指纹识别从根本上来说是困难的,因为它需要消除硬件和软件堆栈之间的数值差异。因此,我们提出了部分缓解措施并讨论了它们的影响。

英文摘要

The behavior of LLMs does not depend solely on the model itself. Components of the inference system, such as the inference engine, attention backend, and hardware platform, subtly influence how inputs are processed. These components differ in their implementations and thereby induce small numerical deviations across systems when running the same model. While prior work has established the theoretical existence of such deviations, their security implications have remained unexplored. In this paper, we show that these deviations are characteristic of specific components and propagate to observable textual outputs, exposing the inference system to any party that can query the model. Building on this observation, we introduce a fingerprinting method that analyzes the prompt-response behavior of LLMs to identify components of the inference system. Our empirical evaluation demonstrates that the inference engine, attention backend, and underlying hardware platform can be identified reliably, even when the LLM is operated at non-zero temperature. We show that preventing fingerprinting is fundamentally hard, as it would require eliminating numerical differences between hardware and software stacks. We therefore propose partial mitigations and discuss their impact.

2605.29976 2026-05-29 physics.ao-ph cs.AI

Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate Simulations

评估 ArchesWeather 和 ArchesWeatherGen 在多年代际气候模拟中的技能和稳定性

Renu Singh, Robert Brunstein, Antonia Jost, Thomas Rackow, Claire Monteleoni, Yana Hasson, Christian Lessig, Guillaume Couairon

AI总结 本研究将两个原本用于天气预报的机器学习模型 ArchesWeather(确定性)和 ArchesWeatherGen(概率流匹配)改造为强迫大气模型,通过月平均海表温度和海水覆盖作为边界条件,遵循 AIMIP Phase 1 协议,评估其多年代际气候模拟能力,发现它们能产生稳定的长期气候模拟、稳定的年循环,并捕捉许多气候变量的漂移。

详情
Comments
29 pages, 16 figures, preprint
AI中文摘要

我们评估了 ArchesWeather 和 ArchesWeatherGen 的气候模拟能力,这两个机器学习模型最初训练用于天气预报,并评估了长达10天的预报时效。ArchesWeather 是一个确定性模型,而 ArchesWeatherGen 是一个概率流匹配模型,利用 ArchesWeather 的预报,实现基于集合的不确定性量化。在这项工作中,我们通过额外以月平均海表温度(SST)和海冰覆盖(SIC)作为边界条件进行条件化,将这些模型改造为强迫大气模型。具体地,我们遵循 AI 模型比较项目(AIMIP)第一阶段协议,该协议类似于大气模型比较项目(AMIP),提出了一个标准化的实验设置,以评估基于 ML 的强迫大气模型的气候技能。我们在这两种条件下对两个模型进行了全面评估,包括与数值气候模型的比较、检查扩展中关键设计选择的消融研究,以及强迫与非强迫配置的分析。尽管最初是为天气预报开发的,但我们证明,ArchesWeather 和 ArchesWeatherGen 的强迫配置能产生稳定的长期气候模拟,具有稳定的年循环,并捕捉许多气候变量的漂移。这些模型忠实地再现了 ERA5 的气候态、大尺度环流和年际变率,并捕捉了分布的尾部。

英文摘要

We evaluate the climate simulation capabilities of ArchesWeather and ArchesWeatherGen, two machine learning models originally trained for weather forecasting and evaluated up to a 10-day lead time. ArchesWeather is a deterministic model, while ArchesWeatherGen is a probabilistic flow-matching model leveraging ArchesWeather's forecasts, enabling ensemble-based uncertainty quantification. In this work, we adapt these models to act as forced atmospheric models by using additional conditioning on the monthly mean sea surface temperature (SST) and sea ice cover (SIC) as boundary conditions. In particular, we follow the AI Model Intercomparison Project (AIMIP) Phase 1 protocol, which, analogous to the Atmospheric Model Intercomparison Project (AMIP), proposes a standardized experimental setup to evaluate the climate skill of ML-based forced atmospheric models. We present a comprehensive evaluation of both models under these conditions, including comparison against numerical climate models, ablation studies that examine key design choices in the extension, and an analysis of forced versus unforced configurations. Despite being originally developed for weather forecasting, we demonstrate that forced configurations of ArchesWeather and ArchesWeatherGen produce stable long-term climate simulations, have a stable annual cycle, and capture the drift of many climate variables. The models faithfully reproduce ERA5's climatology, large-scale circulations and interannual variability, and they capture the tails of the distributions.

2605.29975 2026-05-29 cs.LG eess.SP

A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy

一种全卷积方法用于X射线光子相关光谱中结构动力学数据的去噪

Nisar Nellikunnummel, Andi Barbour, Lutz Wiegart, Tatiana Konstantinova, Anthony DeGennaro

AI总结 提出全卷积去噪自编码器(FC-DAE),用于去噪X射线光子相关光谱中的双时间强度-强度相关函数,支持任意输入尺寸,在低信噪比条件下恢复复杂动力学特征并保持结构保真度。

详情
AI中文摘要

我们提出了一种全卷积去噪自编码器(FC-DAE),用于去噪X射线光子相关光谱(XPCS)中的双时间强度-强度相关函数($C_2$)。与通常限制为固定输入尺寸的传统去噪自编码器不同,FC-DAE接受任意维度的输入,同时保留不同动力学范围内的相关结构。该模型使用在NSLS-II光束线收集的实验$C_2$数据进行训练,并应用数据增强来扩展数据集的多样性并减少过拟合。FC-DAE在低信噪比条件下成功恢复复杂的动力学特征,同时保持结构保真度。为了评估重建可靠性,我们采用定量指标来评估结构保真度并识别潜在的模型引入偏差。我们的结果表明,FC-DAE提供了具有高计算效率的鲁棒去噪性能,使得在光子受限和低剂量测量条件下恢复XPCS动力学成为可能。

英文摘要

We present a fully convolutional denoising autoencoder (FC-DAE) for denoising two-time intensity-intensity correlation functions ($C_2$) in X-ray photon correlation spectroscopy (XPCS). Unlike conventional denoising autoencoders that are typically restricted to fixed input sizes, the FC-DAE accepts inputs of arbitrary dimensions while preserving correlation structures across diverse dynamical regimes. The model is trained using experimentally derived $C_2$ data collected at NSLS-II beamlines, with data augmentation applied to expand the diversity of the dataset and reduce overfitting. The FC-DAE successfully recovers intricate dynamical features in low signal-to-noise conditions while maintaining structural fidelity. To assess reconstruction reliability, we employ quantitative metrics to evaluate structural fidelity and identify potential model-induced bias. Our results demonstrate that the FC-DAE provides robust denoising performance with high computational efficiency, enabling recovery of XPCS dynamics under photon-limited and low-dose measurement conditions.

2605.29971 2026-05-29 cs.CL

Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning

连续变量的因果干预:以上下文学习中转向向量的动词偏向为例

Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank

AI总结 提出一种对连续变量进行因果干预的方法,通过定位低维方向并编辑向量实现反事实目标值,应用于动词偏向特征,证明其在语言模型中的因果表示,并探讨与上下文学习的关系。

详情
AI中文摘要

语言模型表示中的因果干预主要针对离散特征,如语法数。然而,语言模型也必须利用分级特征。我们引入了一种对连续变量进行因果干预的方法:给定与分级目标变量配对的激活向量,我们定位该变量的低维方向,并使用该方向将向量编辑为反事实目标值。我们将此方法应用于心理语言学中研究充分的连续特征,即动词偏向(反映给定动词后倾向于出现哪种句法结构)。我们表明,动词偏向因果地表示在从大型语言模型中提取的转向向量中:对动词偏向的反事实编辑系统地改变了下游结构偏好。动词偏向此前也与上下文学习相关联;在进一步分析中,我们发现转向向量编码了可能驱动上下文学习中观察到的误差驱动更新行为的误差信号,但这些转向向量的方面在下游生成中并未被因果使用。总体而言,这些结果表明因果干预可以应用于连续变量,尽管将连续变量与上下文学习联系起来仍然是一个挑战。

英文摘要

Causal interventions in language model representations have largely targeted discrete features, like grammatical number. However, language models must also make use of features that are graded. We introduce a method for causal intervention on continuous variables: given activation vectors paired with a graded target variable, we localize a low-dimensional direction for that variable and use this direction to edit a vectors toward counterfactual target values. We apply this method to a continuous feature that is well-studied in psycholinguistics, namely verb bias (which reflects which syntactic structures tend to follow a given verb). We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences. Verb bias has also previously been linked to in-context learning; in further analyses, we find that steering vectors encode error signals that could drive the error-driven update behavior seen in in-context learning but that these aspects of the steering vectors are not causally used in downstream production. Overall, these results show causal interventions can be applied to continuous variables, though connecting continuous variables to in-context learning remains a challenge.

2605.29966 2026-05-29 cs.AI

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Compass: 通过专家引导的LLM代理导航全球海洋铅数据整合

Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou, Jing Zhang

AI总结 针对海洋铅数据分散于非结构化论文中的问题,提出专家引导的LLM代理框架Compass,结合知识树分解任务,从23万篇论文中提取3751条铅记录,构建最大海洋铅数据库,准确率达92%。

详情
AI中文摘要

海洋铅及其同位素是海洋环流和人为污染的关键示踪剂,然而实地观测仍然成本高昂且稀疏。尽管存在大量历史记录,但它们被埋藏在学术论文的非结构化内容中,形成了无法进行综合分析的数据孤岛。手动提取不可扩展,而通用大语言模型缺乏必要的领域特定知识,导致幻觉和科学上无效的输出。为了解决这个问题,我们引入了一种专家引导的适应方法,使LLM能够在不进行微调的情况下执行严格的科学数据提取。我们通过Compass(一个由与海洋科学家共同设计的知识树增强的LLM代理框架)来实现这种方法,该框架将复杂任务分解为可验证的步骤,引导代理的推理以确保科学有效性。将Compass应用于超过23万篇相关开放获取论文的语料库,我们成功提取了3751条先前未纳入的铅记录。这项工作建立了迄今为止最大的综合海洋铅数据库。除了标准指标外,Compass通过多层验证展示了卓越的可靠性,经专家手动验证确认准确率达到92%。新整合的数据扩展了先前采样不足区域(如东海和南大洋)的覆盖范围,为未来的科学发现提供了丰富的数据基础。我们发布了一个交互式可视化平台以促进开放科学访问。我们的工作表明,专家引导的代理可以有效弥合通用LLM与高风险科学领域之间的差距,实现地球科学中的可扩展数据发现。

英文摘要

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

2605.29965 2026-05-29 cs.AI

Meta-Programming for Linear-time Temporal Answer Set Programming

线性时态回答集编程的元编程

Susana Hahn, Amade Nems, Javier Romero, Torsten Schaub

AI总结 提出一种统一的元编程框架,通过扩展clingo的理论语法并引入转换管道保护嵌套模态,实现了对多种线性时态逻辑(TEL、MEL、DEL)的语义操作化,并开发了metasp系统。

详情
AI中文摘要

回答集编程(ASP)的时态扩展的发展导致了非单调线性时态(TEL)、动态(DEL)和度量(MEL)时态均衡逻辑的出现。然而,高度优化的ASP系统固有的刚性常常阻碍了替代逻辑设计的快速探索和实现。在这项工作中,我们提出了一个灵活的元编程框架,通过统一的声明性框架操作化各种时态逻辑的语义。我们的方法通过用形式类型规范和嵌套能力增强clingo的理论语法,扩展了标准ASP元编程。为了确保语义正确性,我们引入了一个转换管道,在实例化过程中保护嵌套模态免受基于稳定模型的简化。我们通过实现TEL、MEL和DEL的元编码来展示我们框架的可扩展性。我们提供了TEL的全面说明,并突出了管理MEL的区间约束和DEL中的Fischer-Ladner闭包的关键特性。最后,我们介绍了metasp系统,这是一个封装了此工作流程的多功能工具。

英文摘要

The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.

2605.29963 2026-05-29 cs.CR cs.AI cs.LG

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Honeyval: 基于LLM的HTTP蜜罐综合评估框架

Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Jamie Hayes, Niels Heinen, Tianqi Fan, Luca Invernizzi, Martin Vechev

AI总结 提出Honeyval评估框架,通过16个后端应用、AI攻击代理、控制任务和可验证利用目标,系统评估LLM驱动的HTTP蜜罐,发现其相比规则基线能显著延长攻击交互、降低被前沿模型检测率,且保持成本优势。

详情
AI中文摘要

蜜罐是模拟真实系统组件的诱饵系统,旨在防御网络攻击。最近,LLM越来越多地作为蜜罐的模拟骨干。它们使防御者能够构建高交互蜜罐,同时降低系统安全风险。然而,基于LLM的蜜罐开发缺乏统一的评估框架。大多数评估包括测量固定命令上的响应相似性、手动测试或实际部署。这些方法通常不可扩展用于开发、不可跨评估复现、不能代表实际攻击,或不能适应各种攻击者和蜜罐配置。在这项工作中,我们弥补了这一差距,提出了Honeyval,一个针对LLM驱动的HTTP蜜罐的综合评估框架。我们通过将蜜罐基于16个后端应用程序、使用AI黑客代理作为攻击者、采用两个控制任务来监控代理和蜜罐在定制化方面的能力,以及为攻击者定义清晰且可验证的利用目标,解决了先前评估的局限性。使用Honeyval,我们对近期成本高效的LLM作为HTTP蜜罐进行了广泛评估。我们的实验突出了LLM驱动的蜜罐的前景;它们与基于规则的基线蜜罐相比,导致与攻击者的交互时间显著延长,并且即使被前沿模型检测到的频率也远低得多,同时平均而言,保持了针对代理攻击者的运行成本优势。此外,我们实验了不同的反攻蜜罐配置,并观察到了独特的权衡,例如以增加检测为代价获得更长的交互。

英文摘要

Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as simulation backbones for honeypots. They enable defenders to construct high-interaction honeypots with low system security risks. However, LLM-powered honeypot development lacks a unified evaluation framework. Most evaluations consist of measuring response similarity on fixed commands, manual testing, or real-world deployment. These methods are often not scalable for development, reproducible across evaluations, representative of practical attacks, or adaptable to various attacker and honeypot configurations. In this work, we bridge this gap and propose Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots. We address the limitations of prior evaluations by grounding the honeypots in 16 backend applications, using AI hacking agents as attackers, employing two control tasks to monitor agent and honeypot capabilities across customizations, and defining clear and verifiable exploit goals for the attacker. Using Honeyval, we conduct an extensive evaluation of recent cost-efficient LLMs as HTTP honeypots. Our experiments highlight the promise of LLM-powered honeypots; they lead to substantially longer interactions with the attacker than rule-based baseline honeypots and are far less frequently detected even by frontier models, all while, on average, preserving a running cost advantage against agentic attackers. Further, we experiment with different counter-offensive honeypots configurations, and observe unique trade-offs, such as longer interactions at the cost of increased detection.