arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
2605.30059 2026-05-29 cs.LG cond-mat.stat-mech stat.ML

Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

泊松重置的岭回归:谱正则化的更新视角

Petar Jolakoski

发表机构 * manu.edu.mk

AI总结 通过非平衡统计物理中的随机重置与统计学习中的岭正则化建立联系,证明线性梯度流下以速率r重置到原点产生的稳态均值即为岭估计,并推广到一般更新重置律以生成替代谱滤波器。

详情
AI中文摘要

我们将非平衡统计物理中的随机重置与统计学习中的岭正则化联系起来。对于线性梯度流,以速率$r$重置到原点产生稳态均值$(X^\top X+rI)^{-1}X^\top y$,这正是惩罚项$\lambda=r$的岭估计。这利用了岭回归与梯度流指数时间平均之间已知的拉普拉斯变换关系,其中指数时间现在被解释为与泊松重置相关的稳态年龄。然后我们将这一恒等式推广到一般更新重置律:指数重置时间分布是唯一的更新律,其稳态均值在每个特征方向上作为精确的滤波器恒等式对每个正曲率重现标量岭,而非指数更新律则生成替代的谱滤波器。在波动层面,我们研究了一个具有恒定扩散的独立加性奥恩斯坦-乌伦贝克扩展,解释为一种风格化的SGD近似。在这种设定下,等式仅在均值层面成立,因为重置过程由于累积的OU噪声和重置时序方差具有非零稳态协方差,而确定性岭是一个具有相同中心的固定估计量。风格化实验直接比较了确定性更新诱导的滤波器,并说明了非指数重置时间律诱导的滤波器何时可能在预测上与岭不同。关于稳态均值和诱导谱滤波器的结果是在二次目标上具有各向同性重置的连续时间梯度流下建立的;协方差和风险公式额外假设具有状态独立协方差的加性噪声。

英文摘要

We connect stochastic resetting from non-equilibrium statistical physics with ridge regularization in statistical learning. For linear gradient flow, resetting to the origin at rate $r$ produces stationary mean $(X^\top X+rI)^{-1}X^\top y$, exactly the ridge estimator with penalty $λ=r$. This uses the known Laplace-transform relationship between ridge regression and exponential-time averaging of gradient flow, with the exponential time now interpreted as the stationary age associated with Poisson resetting. We then extend this identity to general renewal reset laws: the exponential reset time distribution is the unique renewal law whose stationary mean reproduces scalar ridge in every eigendirection as an exact filter identity for every positive curvature, while non-exponential renewal laws generate alternative spectral filters. At the fluctuation level, we study a separate additive Ornstein-Uhlenbeck extension with constant diffusion, interpreted as a stylized SGD approximation. In this setting, the equality holds only at the level of the mean, since the reset process has a nonzero stationary covariance from accumulated OU noise and reset-timing variance, whereas deterministic ridge is a fixed estimator with the same center. Stylized experiments compare the deterministic renewal-induced filters directly and illustrate when filters induced by non-exponential reset-time laws can differ predictively from ridge. The results for the stationary mean and the induced spectral filters are established for continuous-time gradient flow with isotropic resetting on quadratic objectives; the covariance and risk formulas additionally assume additive noise with state-independent covariance.

2605.30058 2026-05-29 cs.CL

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

HEART-Bench: 大语言模型智能体是否表现出类似人类的心理学?

Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Imperial College London(伦敦帝国理工学院) Quwan Group(启元集团) University of Washington(华盛顿大学) South China Normal University(华南师范大学)

AI总结 提出HEART-Bench基准,通过构建基于大五人格和自传体记忆的虚拟角色,在DIAMONDS情境框架下评估LLM智能体能否展现一致的人类心理特征。

Comments GitHub: https://github.com/peng-weihan/HEART-BENCH

详情
AI中文摘要

尽管LLM智能体在规划、推理和行动等任务导向能力上表现出色,但很少有研究将它们视为完整的人类个性,其中情感维度同样重要。在本文中,我们引入了一个新颖的基准,系统评估LLM智能体是否能模拟连贯、类似人类的心理。具体来说,我们的基准构建了11个基于正交大五人格特质的多样化人类角色,每个角色都深入整合了1000个结构化的自传体式情景记忆,这些记忆分布在基于理论的发展生命阶段。为了严格评估LLM的心理表现,我们设计了一套由64个决策场景组成的精选套件,这些场景基于DIAMONDS分类法,这是一个心理框架,从八个维度描述情境:责任、智力、逆境、求偶、积极性、消极性、欺骗和社交性。通过将智能体置于不同场景中,基准评估它们是否能整合其固有的人格特质和自传体记忆,做出与其特定心理特征一致的行为决策。经过系统的人工验证和过滤,我们得到了一个包含673道多项选择题(MCQ)的基准。我们相信,这个基准为研究基于LLM的智能体中的人类情感、人格一致性和价值一致的行为决策提供了一个原则性且可扩展的测试平台。

英文摘要

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

2605.30056 2026-05-29 cs.RO cs.LG

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于评论家引导的样本高效扩散强化学习

Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 针对扩散策略在强化学习中探索与利用不平衡的问题,提出评论家引导的扩散策略优化(CGPO),通过无训练引导技术平衡探索与利用,在MuJoCo和Franka机器人任务上取得最优性能。

Comments accepted by ICML2026

详情
AI中文摘要

近年来,强化学习(RL)通过利用扩散策略的多模态性和探索能力取得了巨大成功。在这些方法中,一个代表性分支专注于基于采样的策略优化。这种设计使得扩散模型在训练初期具有更好的探索能力,但在Q值信息的利用上不足,导致策略收敛缓慢。另一个分支关注基于梯度的策略优化,该方法充分利用Q函数的梯度,但容易退化为低多样性的单峰策略。为了解决这个问题,我们提出了CGPO(评论家引导的扩散策略优化),通过将无训练引导技术集成到扩散策略的去噪过程中,有效平衡探索与利用。具体而言,CGPO将动作生成引导至评论家网络定义的高价值区域,并将引导后的动作作为回归目标。通过这种方式,CGPO减少了获取高质量动作所需的时间,并通过更好的探索-利用权衡提高了最终性能。我们在5个MuJoCo运动任务上验证了CGPO的有效性,与现有的基于扩散的RL方法相比,CGPO达到了最先进的性能。值得注意的是,CGPO是首次成功将扩散策略应用于真实世界RL的方法,在Franka机器人臂抓取任务上表现出优越性能。我们的官方页面发布在https://dingsht.tech/cgpo-webpage。

英文摘要

Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.

2605.30051 2026-05-29 cs.CL cs.CY

Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues

我是谁?面向辅导对话中学生模拟的历史感知档案

Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Eedi

AI总结 提出历史条件的学生模拟任务,通过强化学习训练档案生成器和模拟器,利用学生历史信息准确预测对话轮次,在数学学习平台数据集上显著优于基线。

详情
AI中文摘要

开发基于大型语言模型(LLM)的自动化辅导工具的一个关键部分是学生模拟,即使用LLM扮演学生角色,这可以促进辅导模型的评估和训练。现有工作主要关注对话内模拟,缺乏关于学生知识和行为的上下文,部分原因是没有基于过去的学生问答或对话交互。在这项工作中,我们引入了历史条件的学生模拟任务,其目标是通过利用学生学习历史中的信息准确预测学生对话轮次。我们提出了一个双组件框架,其中档案生成器总结学生历史,模拟器基于生成的档案预测学生轮次。我们使用强化学习(RL)训练这两个组件,生成针对忠实学生模拟优化的档案。我们在从数学学习平台收集的首个真实世界学生对话和问答响应数据集上评估了我们的方法和基线。大量实验表明,我们的方法显著优于基线,并证明了历史、档案和RL训练的重要性。

英文摘要

A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.

2605.30049 2026-05-29 cs.AI

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

面向文本到图像扩散Transformer的鲁棒且可泛化的安全引导

Zihao Xue, Yan Wang, Zhen Bi, Long Ma, Zhonglong Zheng, Zeyu Yang, Bingyu Zhu, Longtao Huang, Jie Xiao, Jungang Lou

发表机构 * Huzhou Normal University(湖州师范学院) Alibaba Group(阿里巴巴集团) University of Science and Technology of China(中国科学技术大学) Zhejiang Normal University(浙江师范大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出SafeDIG框架,通过位置感知稀疏特征迁移实现扩散Transformer的安全引导,在保持源域安全性和图像质量的同时,有效降低目标域和整体不安全生成率。

详情
AI中文摘要

扩散Transformer已成为文本到图像生成的强大骨干网络,但其分层和跨模态生成过程使得安全控制在根本上不同于提示级过滤或输出级检测。有害语义可能在文本表示中弱表达,逐步绑定到视觉潜变量,最终与渲染动态纠缠。因此,在固定层进行安全引导可能不稳定,而从已知风险学习到的引导机制可能无法可靠地迁移到偏移的目标风险域。我们提出SafeDIG,一个将DiT安全适应形式化为位置感知稀疏特征迁移的安全引导框架。SafeDIG首先在功能不同的DiT干预位置上构建稀疏自编码器,并使用鲁棒性感知预训练路由来优先选择在源-目标风险偏移下预期保持稳定的干预站点。然后,通过冻结SAE编码器作为可重用的稀疏安全字典,并仅将解码器适应到目标域激活流形,将可迁移的安全特征与特定领域的激活几何分离。在推理过程中,SafeDIG结合混合和排斥操作,将不安全激活引导至迁移的安全流形或远离有害的稀疏方向。在FLUX.1 Dev和Stable Diffusion 3.5 Large上的实验表明,SafeDIG在保持源域安全性和图像质量的同时,持续降低了目标域和整体的不安全生成率。

英文摘要

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

2605.30046 2026-05-29 cs.LG cs.AI

Masked Diffusion Modeling for Anomaly Detection

掩码扩散建模用于异常检测

Lixing Zhang, Yuchen Liang, Liyan Xie

发表机构 * University of Minnesota(明尼苏达大学) Ohio State University(俄亥俄州立大学)

AI总结 提出基于掩码扩散模型的MaskDiff-AD方法,通过重建随机掩码坐标的难度构建异常分数,在分类、混合类型和离散序列数据上实现高效异常检测。

详情
AI中文摘要

异常检测旨在识别偏离名义数据分布的样本,是许多安全关键应用的核心。然而,针对分类、混合类型和离散序列数据开发有效的异常检测方法仍然具有挑战性且相对未被充分探索。掩码扩散模型通过学习从剩余可见上下文中恢复掩码值,为建模此类数据提供了一种自然的方式。在本文中,我们提出了用于异常检测的掩码扩散(MaskDiff-AD),一种基于掩码扩散模型的前向方法,仅在名义数据上训练。给定测试样本,MaskDiff-AD从随机掩码坐标的重建难度构建异常分数,产生一个直接作用于离散状态空间且避免反向时间采样的内容敏感分数。我们还开发了MaskDiff-AD的非参数变体,并通过在固定检测阈值下表征I型和II型错误提供了理论保证。在来自ADBench和UADAD的十四个分类和混合类型表格数据集,以及来自NLP-ADBench的四个文本异常检测数据集上的实验表明,MaskDiff-AD相对于经典、基于扩散以及最近的表格/文本异常检测基线取得了有竞争力的性能。值得注意的是,MaskDiff-AD达到了最佳总体平均排名,优于所有十二种表格基线方法。

英文摘要

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.

2605.30045 2026-05-29 cs.CV

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

GenEraser:通过平衡文本-掩码引导和解耦定位器-保持器实现可泛化的视频对象移除

Yuqing Chen, Lin Liu, Haisu Wu, Xiaopeng Zhang, Yaowei Wang, Yujiu Yang, Qi Tian

发表机构 * Tsinghua University(清华大学) Pengcheng National Laboratory(鹏城实验室) Huawei(华为) Southeast University(东南大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出GenEraser框架,通过多条件混合专家、可学习深度CFG融合机制和解耦专家架构,解决视频对象移除中目标与物理效应同时消除的泛化难题,在ROSE和VOR-Eval上分别提升2.16 dB和1.44 dB。

详情
AI中文摘要

视频对象移除在域外场景中常因复杂的时空歧义而难以同时消除目标对象及其关联的物理效应(如烟雾、反射、光线和涟漪)。现有方法主要依赖空间掩码,但往往无法捕捉弱相关效应,且显式文本引导的潜力尚未充分探索。此外,移除模型在高层语义泛化与精确像素级背景保持之间存在根本性的优化冲突。为解决这些挑战,我们提出GenEraser,一种用于泛化高保真视频对象与效应移除的新框架。首先,我们引入多条件混合专家(MC-MoE)配合二分文本引导,充分利用扩散变换器的多模态先验,显著增强复杂效应的识别。其次,开发可学习深度“CFG”融合机制(LD-CFG),以自适应平衡不同场景下掩码和文本条件的相对主导地位。最后,提出解耦专家架构,包含定位器和保持器,以缓解语义泛化与像素对齐之间的固有权衡。大量实验表明,我们的GenEraser超越了近期最先进方法,在ROSE基准和VOR-Eval上分别实现了显著的定量提升(2.16 dB和1.44 dB),同时在开放世界场景中保持了异常稳健的泛化能力。

英文摘要

Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/

2605.30042 2026-05-29 cs.AI

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

学会选择:一种基于赋权与语义通信的自适应方法选择多智能体系统

Geremy Loachamín-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

发表机构 * Faculty of Science, Technology and Medicine(科学、技术与医学学院) University of Luxembourg(卢森堡大学) Johns Hopkins University(约翰霍普金斯大学) Luxembourg Institute of Science and Technology(卢森堡科学与技术研究院)

AI总结 提出一种结合上下文赌博机、结构化智能体间通信和语义检查点的多智能体框架,通过保持动作-结果因果一致性来提升科学计算工作流中自适应决策的收敛性、鲁棒性和泛化能力。

详情
AI中文摘要

自动化科学计算工作流不仅需要生成可执行代码:自主系统还必须选择适当的计算策略,忠实地执行它们,并确保最终结果在因果上可归因于产生它们的决策。在多智能体流水线中,这一过程尤其脆弱,因为智能体意图与行动之间的微小不一致可能导致语义漂移,即最终执行的程序不再反映最初选择的策略,从而破坏下游评估和适应。受ATHENA框架(Toscano等人,2025;Toscano等人,2026)和赋权概念(Yiu等人,2025)的启发,本文引入了一个多智能体框架,该框架将上下文赌博机与结构化智能体间通信相结合,最重要的是,引入了语义检查点以保持整个流水线中行动-结果的一致性。该系统在自适应决策架构中集成了专门的大语言模型(LLM)智能体、基于代码生成和自修复执行循环。通过赋权的视角解释该框架,我们表明可靠的自主学习不仅需要识别高质量的行动,还需要保持这些行动在智能体间传播的完整性。使用敏感性分析和不确定性量化工作流作为代表性案例研究,我们证明未受约束的语义漂移会降低策略学习,而所提出的框架则提高了收敛性、鲁棒性和对新问题情境的适应能力。这些结果表明了科学多智能体系统的一个更广泛的设计原则:自适应决策必须与明确的机制相结合,以保证整个计算流水线中的语义一致性和可靠信息流。

英文摘要

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

2605.30038 2026-05-29 cs.LG cs.AI cs.CV

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

对齐引导的分数匹配用于扩散模型中的文本到图像对齐

Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea(韩国高级人工智能研究生院)

AI总结 提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中,以解决文本-图像对齐中的过度惩罚和计数错误问题。

Comments ICML 2026, Project page: https://jaayeon.github.io/AGSM

详情
AI中文摘要

扩散模型生成高度逼真的图像,但通常难以实现精确的文本-图像对齐。虽然最近的后训练方法使用外部奖励或人类偏好信号改善对齐,但其性能严重依赖奖励质量,且不直接解决扩散过程中的对齐问题。最近的无奖励方法如SoftREPA表明,通过对比学习优化软文本令牌可以有效改善文本-图像表示对齐,优于标准参数高效微调基线。然而,对比公式可能过度惩罚负对,表现为典型的失败案例,如过度计数和重复。为解决此问题,我们提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中来细化软令牌。通过在分数级别分配对齐方向,我们的方法缓解了这些限制,并产生更连贯和语义忠实的生成。实验表明,我们的方法与SoftREPA相当,同时显著改善了其失败案例,在GenEval基准上计数准确性提高了超过35%。我们的方法可无缝应用于现有扩散骨干网络(SD1.5、SDXL和SD3),并与现有的基于RL的扩散后训练方法互补。项目页面:https://jaayeon.github.io/AGSM

英文摘要

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

2605.30031 2026-05-29 cs.SD cs.AI cs.CL

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

大型音频语言模型中的音频越狱:分类、攻防分析与成本感知评估

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen

发表机构 * National Taiwan University(台湾大学)

AI总结 本文提出了大型音频语言模型中音频越狱攻击与防御的统一分类法和受控实证评估,揭示了声学最佳N攻击暴露了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而现有防御在鲁棒性与良性可用性之间存在权衡。

Comments Submitted to ACL ARR 2026 May

详情
AI中文摘要

大型音频语言模型(LALMs)将越狱风险从令牌级提示扩展到完整的语音感知到推理管道,其中不安全行为可以通过语义、声学风格、信号伪影或内部表示来诱导。现有研究在异质的威胁模型和评估协议下研究这些风险,使得比较攻击实用性或防御效用变得困难。本文提供了LALM越狱攻击和防御的统一分类法和受控实证评估。我们将先前的工作组织为语义、声学、信号和嵌入层攻击;基于防护、无需训练和基于训练的防御;以及跨模态、音频原生和交互式基准。然后,我们在十个开源LALM上评估代表性攻击和防御,不仅测量攻击成功率,还测量良性拒绝和延迟。我们的结果表明,声学最佳N揭示了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而当前防御在鲁棒性与良性可用性之间存在权衡。这些发现支持将成本和效用感知评估作为仅成功率的LALM安全基准的必要补充。

英文摘要

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

2605.30029 2026-05-29 cs.AI

RAISE: RAG Design as an Architecture Search Problem

RAISE:将RAG设计视为架构搜索问题

Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

发表机构 * City University of Hong Kong(香港城市大学) Baidu Inc.(百度公司)

AI总结 本文提出将检索增强生成(RAG)系统的设计选择形式化为架构搜索问题,并构建RAISE框架和基准,通过标准化搜索空间和预算评估13种优化算法在7个数据集上的表现,发现优化性能高度依赖任务。

详情
AI中文摘要

检索增强生成(RAG)系统涉及众多设计选择,包括查询重写、分块、检索深度、重排序和上下文压缩。在实践中,这些选择通常通过启发式方法配置,阻碍了跨设置的系统评估和可重复性。我们认为这一挑战最好被形式化为RAG架构搜索。为了支持对该问题的可控和可重复研究,我们引入了RAG智能搜索引擎(RAISE),这是一个用于RAG超参数优化的综合框架和基准,它在标准化的搜索空间和预算下评估RAG管道的优化方法。RAISE实现了13种搜索算法,并使用三种随机种子在七个公开文本和多模态数据集上对其进行评估。我们的实验表明,优化性能高度依赖于任务:在一个数据集上表现良好的方法可能无法在其他数据集上一致泛化,这提醒我们不要将聚合排名解释为普遍优越策略的证据。RAISE为公平、可重复和系统的RAG超参数优化研究提供了共同的实验基础。

英文摘要

Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.

2605.30027 2026-05-29 cs.CV cs.IR

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

DocRetriever:面向多模态文档检索的即插即用框架与综合基准

Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao

发表机构 * Zhejiang University(浙江大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出DocRetriever即插即用框架,通过布局感知的稀疏嵌入和推理增强的重排序器解决多模态文档检索中语义模糊和泛化瓶颈问题,并构建MultiDocR基准实现更严格评估。

Comments Accepted at KDD 2026 Research Track

详情
AI中文摘要

多模态文档包含表格、图形和布局等多样元素,可能使检索任务复杂化。当前方法通常将密集视觉嵌入模型与有监督重排序器相结合以实现高精度检索,但存在固有局限性。首先,密集嵌入的粗粒度特性往往模糊显式语义,无法利用结构显著信息。其次,有监督重排序模型面临泛化瓶颈,其性能严重依赖领域特定训练数据。此外,现有基准通常缺乏多样化的评估维度和全面的相关性标注,限制了可靠评估。为解决这些挑战,我们提出DocRetriever,一个即插即用框架。它通过布局感知的稀疏嵌入技术增强视觉检索,实现无需光学字符识别(OCR)开销的有效混合编码。我们还引入了一个可泛化的重排序器,利用推理增强的示范和优化采样来提高少样本场景下的准确性。最后,我们构建了一个新基准MultiDocR,以实现更严格的评估。在多个基准上的实验验证了DocRetriever相对于最先进方法的优越性。

英文摘要

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

2605.30022 2026-05-29 cs.CL cs.AI

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

给它空间!编码器中位置和语义表示的显式解缠

Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

发表机构 * Sorbonne Université, CNRS, ISIR(索邦大学、国家科学研究中心、信息研究所) Orange Innovation(Orange创新)

AI总结 通过将位置和语义信号分离为三个独立流,研究Transformer中位置编码的机制,发现解缠方法能保留宏观结构并提升语言表示性能。

Comments 8 page + 10 pages of bibliography and appendix

详情
AI中文摘要

位置编码(PE)是置换不变的Transformer表示序列顺序的基础,然而位置信息如何处理和存储仍知之甚少。现代PE方法如RoPE在长上下文理解或检索等任务上仍存在困难\cite{chen-etal-2025-hope}。因此,更好地理解内部位置机制有助于设计更好的PE。基于位置和语义信号在训练好的Transformer中占据几乎正交子空间的证据,我们修改编码器Transformer以处理三个显式解缠的流:语义、绝对位置(AP)和相对位置(RP),并将掩码语言建模(MLM)目标限制在语义流上。这种解耦使得能够进行清晰的机制研究,并得出三个要点:(1)孤立的AP子空间自发坍缩为一个捕获文档结构的低频二维流形;(2)注意力头特化为结构导向和语义导向两组,其中RP专门支持后者;(3)标准位置编码不能稳健地保留宏观结构:RoPE和RP仅弱编码它,而纠缠的AP在MLM压力下在最后几层丢失了它。解缠方法保留了位置编码,在Flash-Holmes探测基准的65个语言现象中的49个上改善了语言表示。

英文摘要

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

2605.30015 2026-05-29 cs.LG cs.AI

Test Time Training for Supervised Causal Learning

测试时训练用于监督因果学习

Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun, Jinzhuo Wang, Qiang Fu, Shi Han, Dongmei Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Microsoft(微软) Sony Research(索尼研究)

AI总结 针对监督因果学习在分布外泛化中的不足,提出测试时训练框架TTT-SCL,通过动态生成与测试实例对齐的训练集,显著提升因果发现性能。

详情
AI中文摘要

监督因果学习(SCL)通过将因果发现构建为监督学习问题,展现了潜力。然而,它面临显著的分布外泛化挑战。我们揭示了先前SCL实践的三个局限性:合成基准与真实数据之间的显著性能差距、对分布偏移的脆弱性以及组合泛化的失败,共同质疑了其现实世界适用性。为此,我们提出测试时训练用于监督因果学习(TTT-SCL),一种新颖的框架,动态生成与任何特定测试实例显式对齐的训练集。我们展示了TTT-SCL与基于分数的方法之间的关联,并基于经典评分函数设计了一个高效模块用于生成训练集。在合成基准、伪真实和真实世界数据集上的实验表明,TTT-SCL显著优于现有的SCL和传统因果发现方法。

英文摘要

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.

2605.30014 2026-05-29 cs.AI

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

从GPS点到出行模式:基于LLM的灵活语义轨迹生成

Silin Zhou, Chenhao Wang, Yuntao Wen, Shuo Shang, Lisi Chen, Panos Kalnis

发表机构 * University of Electronic Science and Technology of China(电子科技大学) King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹大学)

AI总结 提出HTP方法,通过层次化生成出行模式再生成GPS点,利用LLM和RQ-VAE实现灵活、语义丰富的轨迹生成,在质量上平均提升29.78%。

Comments This paper is accepted by KDD2026 second round

详情
AI中文摘要

城市轨迹在建模城市动态和支持各种智慧城市应用中起着关键作用。然而,隐私问题限制了对大规模高质量轨迹数据集的访问。轨迹生成通过合成现实数据来减轻隐私风险,提供了一种有前景的替代方案。然而,现有方法未能显式捕获出行模式,并且只能在单一条件下生成固定长度的轨迹。为了解决这些局限性,我们提出了 extbf{HTP},它 extbf{层}次化地首先生成 extbf{出行模式},然后使用大语言模型(LLM)生成GPS extbf{点},而不是直接生成GPS点。我们首先设计了一个轨迹特定的残差量化变分自编码器(RQ-VAE),它以从粗到细的方式将微观级别的GPS轨迹量化为紧凑的宏观级别出行模式令牌。这些令牌捕获了丰富的段空间不规则性,例如由交通条件引起的点密度变化。然后,我们用出行模式令牌扩展LLM词汇表,以对齐轨迹表示与LLM输入,并应用监督微调(SFT)使LLM与轨迹生成任务对齐,从而能够在各种条件下生成出行模式序列。在两个真实世界数据集上的大量实验表明,HTP在生成质量上平均比最强基线高出29.78%。我们的代码可在https://github.com/slzhou-xy/HTP获取。

英文摘要

Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \textbf{HTP}, which \textbf{H}ierarchically generates \textbf{T}ravel patterns first and then generates GPS \textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.

2605.30011 2026-05-29 cs.CV cs.AI

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

VisualThink-VLA:用于高效低延迟视觉-语言-动作策略的视觉中间推理

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学) Cornell University(康奈尔大学) National University of Singapore(新加坡国立大学) Xi'an University of Electronic Science and Technology(西安电子科技大学)

AI总结 提出VisualThink-VLA框架,通过视觉中间推理和选择性路由机制,在保持高精度的同时将推理延迟从数秒降至亚秒级。

详情
AI中文摘要

近期工作开始为视觉-语言-动作(VLA)策略配备显式的中间推理。然而,在具身控制中,文本思维链并不适用:无关或弱文本信息会干扰动作预测,而自回归文本解码为实时闭环执行增加了过多延迟。我们提出VISUALTHINK-VLA,一个用于准确、低延迟VLA策略的视觉中间推理框架。我们的引导哲学是通过有效的视觉思维来指导动作:VISUALTHINK-VLA通过一个紧凑的视觉证据接口引导动作预测,该接口在避免解码开销的同时保持空间精度。此外,为了进一步提升性能和效率,VISUALTHINK-VLA采用了一种定制的选择性路由机制来学习视觉证据令牌,从而实现低延迟推理同时保持高容量专用性。我们还引入了VisualEvidence-Kit,这是一个以VisualEvidence-Agent为核心的监督与审计资源,该智能体构建了754.7k条VLA指令的VisualEvidence-Set,用于路由监督和反事实忠实性测试。在多个基准测试和真实机器人评估中,VISUALTHINK-VLA在大多数基准测试上实现了最高成功率,同时将推理增强基线的多秒延迟降至亚秒级。例如,在BridgeData V2上,它将步骤延迟从ECoT的8.377秒降至0.367秒,实现了22.8倍的加速。

英文摘要

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

2605.30010 2026-05-29 cs.CV

EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom: 早期令牌压缩实现快速视频理解

Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学) Alibaba Cloud Computing(阿里云计算)

AI总结 针对视频大语言模型中视觉编码阶段效率低下的问题,提出EarlyTom无训练令牌压缩框架,通过在视觉编码器内部进行早期压缩,显著降低首令牌延迟并提升吞吐量。

Comments Accepted by CVPR 2026. 16 pages, 8 figures, 8 tables. Project page: https://viridisgreen.github.io/EarlyTom

详情
AI中文摘要

视频大语言模型(Video-LLMs)在视频理解任务中展现了强大的能力。然而,处理大量视觉令牌带来的低效率仍然阻碍了它们的实际部署。尽管近期的方法在保持与全令牌基线相当准确性的同时实现了极低的令牌保留率,但大多数方法仅在预填充的后期阶段进行压缩,视觉编码器的效率未得到优化。在本文中,我们首先表明视觉编码对首令牌时间(TTFT)贡献很大。因此,与仅在视觉编码器之后压缩视觉令牌不同,在编码器内部进行压缩仍有很大的探索空间。基于这一见解,我们提出了EarlyTom,一种无训练的令牌压缩框架,在视觉编码器内部执行早期视觉令牌压缩,从而显著降低TTFT并提高吞吐量。此外,我们引入了一种解耦的空间令牌选择策略,提高了整体压缩效果。在单个NVIDIA A100 GPU上,对于LLaVA-OneVision-7B模型,EarlyTom将TTFT降低高达2.65倍,FLOPs降低高达61%,同时保持与全令牌基线相当的准确性。这些改进显著增强了Video-LLMs在实际生产场景中部署的实用性。

英文摘要

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

2605.30002 2026-05-29 cs.AI

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

KairosAgent:融合语义推理的智能体时间序列预测

Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

发表机构 * School of Information Science and Technology, ShanghaiTech University(信息科学与技术学院,上海科技大学) Ant Group(蚂蚁集团)

AI总结 提出KairosAgent框架,通过结合基于LLM的推理器和基于TSFM的预测器,并引入强化学习范式,实现跨模态时间序列的零样本预测。

详情
AI中文摘要

跨领域多模态时间序列预测是一项具有挑战性的任务,要求模型整合精确的数值理解、跨领域语义理解和有效的多模态融合。现有方法要么从头构建时间序列基础模型(TSFM),要么利用预训练的大语言模型(LLM)。然而,TSFM通常忽略语义理解且缺乏面向未来的语义推理能力,而LLM在数值理解和准确的定量预测方面存在困难。为克服这些限制,我们提出KairosAgent,一种用于多模态时间序列预测的新型智能体框架,包括基于LLM的推理器和基于TSFM的预测器。KairosAgent通过动态调用分析工具来增强LLM的数值理解和语义推理能力,从而统一文本推理和数值预测。推理结果随后融合到TSFM流程中,实现更准确可靠的未来预测。为进一步改进推理,我们整理了一个大规模高质量轨迹语料库,并引入了一种基于预测的强化学习范式,包含多轮细化和轮次级别信用分配。实验表明,KairosAgent在最大化预训练LLM和TSFM效用的同时,实现了卓越的零样本预测性能,为高效且可解释的时间序列智能体提供了有前景的方向。项目页面位于https://foundation-model-research.github.io/KairosAgent。

英文摘要

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

2605.29997 2026-05-29 cs.CV

FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

FRUC:来自未标定协作驾驶视图的前馈动态场景重建

Yihang Tao, Yu Guo, Zhengru Fang, Haonan An, Yuguang Fang

发表机构 * Hong Kong JC STEM Lab of Smart City City University of Hong Kong(香港JC STEM实验室,城市大学)

AI总结 提出FRUC框架,基于前馈3D高斯泼溅和视觉几何Transformer,从未标定的多车协作视图实现动态场景的一次性、免标定重建,通过自中心因果遮挡场和零初始化残差去噪实现非破坏性几何补充。

详情
AI中文摘要

我们提出了FRUC,一个用于从未标定协作驾驶视图进行动态场景重建的前馈3D高斯泼溅框架。现有的多智能体重建框架常常受到严格先决条件的阻碍,需要精确的空间标定和缓慢的逐场景优化。在本文中,我们通过将分布式多车辆网络概念化为一个时空非结构化的自中心多相机系统来重新思考这一任务,其核心挑战在于在不降低自中心准确观测到的可见几何的情况下,通过协作增强自中心遮挡几何,同时保持重建效率。为了实现高效重建,FRUC基于视觉几何Transformer骨干网络,支持从灵活数量的多车辆视图进行一次性、免标定推理。为了在未标定的跨智能体错位下实现非破坏性几何补充,FRUC首先引入了一个自中心因果遮挡场,通过建模智能体时空相关性,将遮挡演化显式推导为潜在先验。在这些遮挡先验的指导下,它进一步将跨智能体集成公式化为一个通过零初始化注入的确定性残差去噪过程,将具有挑战性的跨智能体融合转化为有界残差学习,以实现鲁棒的协作盲点补全。通过在真实世界V2XReal和UrbanIng-V2X数据集上的广泛评估,FRUC被证明是动态协作驾驶环境场景重建的新最先进方法,在渲染质量和效率上均显著优于现有方法。

英文摘要

We present FRUC, a feed-forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views. Existing multi-agent reconstruction frameworks are often hindered by rigid prerequisites, demanding precise spatial calibration and slow per-scene optimization. In this paper, we rethink this task by conceptualizing a distributed multi-vehicle network as a spatio-temporally unstructured ego-centric multi-camera system, where the core challenge lies in enhancing ego-centric occluded geometry through collaboration without degrading the ego's accurately observed visible geometry, while preserving reconstruction efficiency. For efficient reconstruction, FRUC is built upon a visual grounded geometric Transformer backbone to enable one-shot, calibration-free inference from a flexible number of multi-vehicle views. To achieve non-destructive geometric supplementation under uncalibrated cross-agent misalignment, FRUC first introduces an ego-centric causal occlusion field that explicitly derives occlusion evolution as latent priors by modeling agent-wise spatio-temporal correlations. Guided by these occlusion priors, it further formulates cross-agent integration as a deterministic residual denoising process via zero-initialized injection, turning challenging cross-agent fusion into bounded residual learning for robust collaborative blind-spot completion. Through extensive evaluations on the real-world V2XReal and UrbanIng-V2X datasets, FRUC is shown to be a new state-of-the-art for the scene reconstruction of dynamic collaborative driving environments, significantly outperforming existing methods in both rendering quality and efficiency.

2605.29992 2026-05-29 cs.CL

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

通过跨语言分词器手术和离线蒸馏使多语言嵌入模型适应土耳其语

M. Ali Bayram, Banu Diri, Savaş Yıldırım

发表机构 * Yıldız Technical University(Yıldız技术大学) Istanbul Bilgi University(伊斯坦布尔比尔格大学)

AI总结 提出一种高效的三阶段适应流程,通过跨语言分词器优化、教师模型克隆和离线蒸馏,构建了土耳其语句子嵌入模型embeddingmagibu-200m,在STSbTR上超越教师模型,并在TR-MTEB上以更少参数达到竞争性能。

Comments 14 pages, 2 figures, 4 tables, Appendix included

详情
AI中文摘要

句子嵌入是语义搜索、聚类、分类和检索增强生成的基础组件。本文提出了embeddingmagibu-200m,一个专注于土耳其语的句子嵌入模型,生成768维L2归一化向量,支持8192个token的上下文窗口,远超早期基于BERT的土耳其语编码器的512 token限制。无需完整预训练,引入了一个高效的三阶段适应流程:(1) 通过从教师词汇表中修剪冗余token,并基于40语言语料库的频率分析纳入多语言token,构建一个词汇量为131,072的土耳其语优化多语言分词器;(2) 克隆教师嵌入模型,同时保留transformer骨干权重,并通过均值组合token映射为新的词汇表初始化兼容的嵌入表;(3) 使用余弦相似度目标,在平衡的40语言维基百科语料库上,从预计算的教师向量进行离线嵌入蒸馏。得到的student模型约有2亿参数,在单个GPU上训练约四小时,通过避免训练期间的在线教师推理,总成本为5-20美元。实验表明,在STSbTR上,Pearson/Spearman相关系数达到77.55%/77.45%,超过了3亿参数的教师模型(73.84%/72.92%)。在TR-MTEB(26个任务)上,平均得分为63.9%(在26个模型中排名第7),提供了有竞争力的成本-质量权衡,参数比教师少33%。为促进可复现性和下游使用,所有工件均已发布,包括模型权重、分词器文件、预计算嵌入数据集以及开源克隆和蒸馏工具。

英文摘要

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

2605.29986 2026-05-29 cs.AI

Accelerating Constrained Decoding with Token Space Compression

加速受限解码:通过词元空间压缩

Michael Sullivan, Alexander Koller

发表机构 * Department of Language Science and Technology(语言科学与技术系) Saarland Informatics Campus(萨尔兰州信息学校区) Saarland University(萨尔兰大学)

AI总结 提出CFGzip离线压缩词元搜索空间,大幅降低上下文无关文法约束解码的开销,实现高达两个数量级的延迟减少和7.5倍的总生成速度提升。

Comments 13 pages; 5 figures; under review at EMNLP 2026

详情
AI中文摘要

为了保证LLM的输出符合指定结构,上下文无关文法(CFG)解码引擎强制选择能够产生符合给定CFG的字符串的下一个词元。虽然当前的CFG受限解码引擎已经高度优化,但由于每一步搜索空间(即整个词元词汇表)巨大,导致对于更复杂的CFG会产生难以承受的高开销——而这正是CFG引擎最有用的情况。在本文中,我们引入了CFGzip,一种离线压缩词元搜索空间的技术,它大幅减少了CFG引擎的开销。实验中,我们报告了当CFGzip与最先进的语法引擎一起使用时,延迟减少高达两个数量级,在总受限生成时间上实现了高达7.5倍的加速:借助CFGzip,受限解码现在可以大规模应用于复杂CFG。

英文摘要

To guarantee that an LLM's outputs conform to a specified structure, context-free grammar (CFG) decoding engines force the selection of next tokens that produce strings that conform to a given CFG. While current CFG-constrained decoding engines are highly optimized, the inherent costs arising from the massive per-step search space -- i.e. the entire token vocabulary -- result in intractably high overhead for more complex CFGs: precisely the situation where CFG engines are most useful. In this paper, we introduce CFGzip, an offline technique for compressing the token search space, which massively reduces CFG engine overhead. In experiments, we report latency reduction of up to two orders of magnitude when CFGzip is used with a SoTA grammar engine, yielding an up to 7.5x speedup in total constrained generation time: with CFGzip, constrained decoding is now feasible at scale for complex CFGs.

2605.29983 2026-05-29 cs.LG cs.CV

Improving Adversarial Robustness of Attribution via Implicit Regularization

通过隐式正则化提高归因的对抗鲁棒性

Amir Mehrpanah, Matteo Gamba, Hossein Azizpour

发表机构 * Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden(瑞典皇家理工学院计算机科学系) Science for Life Laboratory, Stockholm, Sweden(瑞典斯德哥尔摩科学生命实验室) Department of Computer Science, Brown University, USA(美国布朗大学计算机科学系)

AI总结 本文发现标准随机梯度下降的学习动态可以隐式地提高归因的对抗鲁棒性,并证明在softmax归一化下注意力归因的鲁棒性提升受限,而基于核的注意力可恢复鲁棒性。

Comments 39 pages, 22 figures, to be published in International Conference on Machine Learning 2026

详情
AI中文摘要

归因的对抗鲁棒性是深度学习中可靠可解释性的基本要求,但现有方法通常依赖计算昂贵的显式正则化。在这项工作中,我们表明归因鲁棒性可以从标准随机梯度下降的学习动态中隐式产生。我们通过参数空间和输入空间曲率之间的联系从理论上论证了这种效应,并在各种架构、数据集和归因方法上进行了验证,计算开销可忽略不计。相反,我们证明由于固有的熵约束,这种鲁棒性提升通常不会转移到softmax归一化下的注意力归因,并通过实验验证了这一局限性。最后,我们表明用基于核的注意力替换softmax注意力可以恢复Transformer模型中的鲁棒性提升。我们的结果突出了学习动态作为鲁棒可解释性的一种原则性且实用的机制,并揭示了归一化下注意力归因的基本局限性。

英文摘要

The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

2605.29980 2026-05-29 cs.CV cs.AI cs.LG

Genetically Aligned Patient Representations Improve Hematological Diagnosis

基因对齐的患者表示改善血液学诊断

Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr

发表机构 * Institute of AI for Health, Helmholtz Munich, Germany International School of Medicine, Istanbul Medipol University, T\"urkiye Munich Leukemia Laboratory, Germany Department of Medicine III, Ludwig-Maximilian-University Hospital, Germany Department of Physics, University of Munich, Germany Munich Center for Machine Learning (MCML), Germany DKTK, German Cancer Consortium, Germany

AI总结 提出一种两阶段框架,通过自监督视觉预训练和监督对比学习对齐白细胞图像与染色体畸变及体细胞突变,提升血液学诊断性能。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情
AI中文摘要

组织病理学编码器与转录组和基因组数据的多模态对齐已被证明能显著提高下游诊断任务的性能。血液学细胞学的独特之处在于,视觉单细胞评估通常与细胞遗传学和分子遗传学相结合用于血癌诊断。在本研究中,我们提出了一个框架,将单个白细胞图像与染色体畸变(核型)以及来自靶向基因面板的体细胞突变对齐。我们的训练策略采用两阶段方法:(i)在超过1500名患者的队列上,使用iBOT头进行自监督、仅视觉的Transformer聚合器预训练;(ii)通过急性髓系白血病患者的监督对比损失进行基因对齐。我们的基因对齐患者编码器改善了血液学诊断任务,优于切片级组织病理学基础模型。此外,该模型为疾病和遗传改变提供了即用型检索能力。将遗传数据纳入患者编码器提高了患者表示的质量,提供了一个与临床诊断工作流程对齐的框架,并为未来的多模态血液学特定AI铺平了道路。代码和模型权重可在https://github.com/marrlab/GenBloom获取。

英文摘要

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

2605.29975 2026-05-29 cs.LG eess.SP

A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy

一种全卷积方法用于X射线光子相关光谱中结构动力学数据的去噪

Nisar Nellikunnummel, Andi Barbour, Lutz Wiegart, Tatiana Konstantinova, Anthony DeGennaro

发表机构 * Amazon(亚马逊) GE Aerospace Research(通用电气航空航天研究)

AI总结 提出全卷积去噪自编码器(FC-DAE),用于去噪X射线光子相关光谱中的双时间强度-强度相关函数,支持任意输入尺寸,在低信噪比条件下恢复复杂动力学特征并保持结构保真度。

详情
AI中文摘要

我们提出了一种全卷积去噪自编码器(FC-DAE),用于去噪X射线光子相关光谱(XPCS)中的双时间强度-强度相关函数($C_2$)。与通常限制为固定输入尺寸的传统去噪自编码器不同,FC-DAE接受任意维度的输入,同时保留不同动力学范围内的相关结构。该模型使用在NSLS-II光束线收集的实验$C_2$数据进行训练,并应用数据增强来扩展数据集的多样性并减少过拟合。FC-DAE在低信噪比条件下成功恢复复杂的动力学特征,同时保持结构保真度。为了评估重建可靠性,我们采用定量指标来评估结构保真度并识别潜在的模型引入偏差。我们的结果表明,FC-DAE提供了具有高计算效率的鲁棒去噪性能,使得在光子受限和低剂量测量条件下恢复XPCS动力学成为可能。

英文摘要

We present a fully convolutional denoising autoencoder (FC-DAE) for denoising two-time intensity-intensity correlation functions ($C_2$) in X-ray photon correlation spectroscopy (XPCS). Unlike conventional denoising autoencoders that are typically restricted to fixed input sizes, the FC-DAE accepts inputs of arbitrary dimensions while preserving correlation structures across diverse dynamical regimes. The model is trained using experimentally derived $C_2$ data collected at NSLS-II beamlines, with data augmentation applied to expand the diversity of the dataset and reduce overfitting. The FC-DAE successfully recovers intricate dynamical features in low signal-to-noise conditions while maintaining structural fidelity. To assess reconstruction reliability, we employ quantitative metrics to evaluate structural fidelity and identify potential model-induced bias. Our results demonstrate that the FC-DAE provides robust denoising performance with high computational efficiency, enabling recovery of XPCS dynamics under photon-limited and low-dose measurement conditions.

2605.29971 2026-05-29 cs.CL

Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning

连续变量的因果干预:以上下文学习中转向向量的动词偏向为例

Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank

发表机构 * Yale University(耶鲁大学)

AI总结 提出一种对连续变量进行因果干预的方法,通过定位低维方向并编辑向量实现反事实目标值,应用于动词偏向特征,证明其在语言模型中的因果表示,并探讨与上下文学习的关系。

详情
AI中文摘要

语言模型表示中的因果干预主要针对离散特征,如语法数。然而,语言模型也必须利用分级特征。我们引入了一种对连续变量进行因果干预的方法:给定与分级目标变量配对的激活向量,我们定位该变量的低维方向,并使用该方向将向量编辑为反事实目标值。我们将此方法应用于心理语言学中研究充分的连续特征,即动词偏向(反映给定动词后倾向于出现哪种句法结构)。我们表明,动词偏向因果地表示在从大型语言模型中提取的转向向量中:对动词偏向的反事实编辑系统地改变了下游结构偏好。动词偏向此前也与上下文学习相关联;在进一步分析中,我们发现转向向量编码了可能驱动上下文学习中观察到的误差驱动更新行为的误差信号,但这些转向向量的方面在下游生成中并未被因果使用。总体而言,这些结果表明因果干预可以应用于连续变量,尽管将连续变量与上下文学习联系起来仍然是一个挑战。

英文摘要

Causal interventions in language model representations have largely targeted discrete features, like grammatical number. However, language models must also make use of features that are graded. We introduce a method for causal intervention on continuous variables: given activation vectors paired with a graded target variable, we localize a low-dimensional direction for that variable and use this direction to edit a vectors toward counterfactual target values. We apply this method to a continuous feature that is well-studied in psycholinguistics, namely verb bias (which reflects which syntactic structures tend to follow a given verb). We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences. Verb bias has also previously been linked to in-context learning; in further analyses, we find that steering vectors encode error signals that could drive the error-driven update behavior seen in in-context learning but that these aspects of the steering vectors are not causally used in downstream production. Overall, these results show causal interventions can be applied to continuous variables, though connecting continuous variables to in-context learning remains a challenge.

2605.29966 2026-05-29 cs.AI

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Compass: 通过专家引导的LLM代理导航全球海洋铅数据整合

Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou, Jing Zhang

发表机构 * School of Information Science Electronic Engineering,\ Jiao Tong University Shanghai China School of Artificial Intelligence,\ Jiao Tong University Shanghai China State Key Laboratory of Estuarine Coastal Research,\ China Normal University Shanghai China School of Oceanography,\ Jiao Tong University Shanghai China Institute of Geographical Science Natural Resources Research,\ Academy of Sciences Beijing China Electronic Engineering,\ Jiao Tong University School of Artificial Intelligence,\ Jiao Tong University Coastal Research,\ China Normal University School of Oceanography,\ Jiao Tong University Natural Resources Research,\ Academy of Sciences

AI总结 针对海洋铅数据分散于非结构化论文中的问题,提出专家引导的LLM代理框架Compass,结合知识树分解任务,从23万篇论文中提取3751条铅记录,构建最大海洋铅数据库,准确率达92%。

详情
AI中文摘要

海洋铅及其同位素是海洋环流和人为污染的关键示踪剂,然而实地观测仍然成本高昂且稀疏。尽管存在大量历史记录,但它们被埋藏在学术论文的非结构化内容中,形成了无法进行综合分析的数据孤岛。手动提取不可扩展,而通用大语言模型缺乏必要的领域特定知识,导致幻觉和科学上无效的输出。为了解决这个问题,我们引入了一种专家引导的适应方法,使LLM能够在不进行微调的情况下执行严格的科学数据提取。我们通过Compass(一个由与海洋科学家共同设计的知识树增强的LLM代理框架)来实现这种方法,该框架将复杂任务分解为可验证的步骤,引导代理的推理以确保科学有效性。将Compass应用于超过23万篇相关开放获取论文的语料库,我们成功提取了3751条先前未纳入的铅记录。这项工作建立了迄今为止最大的综合海洋铅数据库。除了标准指标外,Compass通过多层验证展示了卓越的可靠性,经专家手动验证确认准确率达到92%。新整合的数据扩展了先前采样不足区域(如东海和南大洋)的覆盖范围,为未来的科学发现提供了丰富的数据基础。我们发布了一个交互式可视化平台以促进开放科学访问。我们的工作表明,专家引导的代理可以有效弥合通用LLM与高风险科学领域之间的差距,实现地球科学中的可扩展数据发现。

英文摘要

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

2605.29965 2026-05-29 cs.AI

Meta-Programming for Linear-time Temporal Answer Set Programming

线性时态回答集编程的元编程

Susana Hahn, Amade Nems, Javier Romero, Torsten Schaub

发表机构 * University of Potsdam, Germany(波恩大学)

AI总结 提出一种统一的元编程框架,通过扩展clingo的理论语法并引入转换管道保护嵌套模态,实现了对多种线性时态逻辑(TEL、MEL、DEL)的语义操作化,并开发了metasp系统。

详情
AI中文摘要

回答集编程(ASP)的时态扩展的发展导致了非单调线性时态(TEL)、动态(DEL)和度量(MEL)时态均衡逻辑的出现。然而,高度优化的ASP系统固有的刚性常常阻碍了替代逻辑设计的快速探索和实现。在这项工作中,我们提出了一个灵活的元编程框架,通过统一的声明性框架操作化各种时态逻辑的语义。我们的方法通过用形式类型规范和嵌套能力增强clingo的理论语法,扩展了标准ASP元编程。为了确保语义正确性,我们引入了一个转换管道,在实例化过程中保护嵌套模态免受基于稳定模型的简化。我们通过实现TEL、MEL和DEL的元编码来展示我们框架的可扩展性。我们提供了TEL的全面说明,并突出了管理MEL的区间约束和DEL中的Fischer-Ladner闭包的关键特性。最后,我们介绍了metasp系统,这是一个封装了此工作流程的多功能工具。

英文摘要

The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.

2605.29955 2026-05-29 cs.AI

Formalizing Mathematics at Scale

大规模形式化数学

Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

发表机构 * FAIR at Meta(Meta的FAIR) New York University(纽约大学) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 提出多智能体系统AutoformBot,利用LLM和形式化验证工具,自动将非正式教材翻译为Lean 4可验证代码,构建了包含超过45,000个声明和50万行代码的Atlas形式化库。

详情
AI中文摘要

我们提出了AutoformBot,一个用于在Lean 4中大规模构建自动形式化教材库(Atlas)的多智能体系统。AutoformBot协调数千个LLM智能体,配备形式化验证工具、依赖感知的任务调度和协作版本控制,将非正式的教材文本转化为机器可检查的定义和证明。我们将方法应用于26本开放获取教材,涵盖分析、代数、拓扑、组合学和概率论,生成了Atlas:一个包含超过45,000个Lean 4声明和50万行代码的已验证库。我们发布两个成果:(i)AutoformBot,开源的多智能体框架;(ii)Atlas,生成的形式化库。我们的结果表明,大规模自动形式化研究生级别数学的核心内容在经济和技术上现在是可行的。这为在研究层面上自动验证人类和机器生成的数学打开了大门。

英文摘要

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

2605.29954 2026-05-29 cs.CV

SwInception -- Local Attention Meets Convolutions

SwInception -- 局部注意力与卷积的结合

David Hagerman, Roman Naeem, Jakob Lindqvist, Carl Lindström, Fredrik Kahl, Lennart Svensson

发表机构 * Chalmers University of technology(查尔姆斯理工大学) Zenseact(Zenseact公司)

AI总结 提出SwInception架构,通过在Swin Transformer的前馈层引入Inception块增强归纳偏置,并改进解码器以更少参数捕捉细节,在多个医学数据集上提升分割性能。

Comments International Conference on Pattern Recognition and Artificial Intelligence, 2024

详情
AI中文摘要

稀疏视觉变换器作为医学体积分割的高效编码器已广受欢迎,其中Swin成为突出选择。Swin使用局部注意力降低复杂度,在许多任务上表现优异,但仍倾向于在小数据集上过拟合。为缓解这一弱点,我们提出了一种新颖架构,通过在前馈层引入Inception块进一步增强Swin的归纳偏置。这些多分支卷积的引入使得在变换器块内能够更直接地对局部多尺度特征进行推理。我们还修改了解码器层,以使用更少的参数捕捉更精细的细节。通过大量实验,我们在十一个不同的医学数据集上展示了性能提升。我们特别展示了在医学分割十项全能(Medical Segmentation Decathlon)和颅穹窿外(Beyond the Cranial Vault)等基准挑战中,相较于先前最先进骨干网络的进步。通过证明Swin中现有的归纳偏置可以进一步改进,我们的工作为增强稀疏视觉变换器在医学和自然图像分割任务中的能力提供了一条有前景的途径。代码和预训练权重可在 https://github.com/Eiphodos/SwInception 获取。

英文摘要

Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on small datasets. To mitigate this weakness, we propose a novel architecture that further enhances Swin's inductive bias by introducing Inception blocks in the feed-forward layers. The introduction of these multi-branch convolutions enables more direct reasoning over local, multi-scale features within the transformer block. We have also modified the decoder layers in order to capture finer details using fewer parameters. We demonstrate a performance improvement on eleven different medical datasets through extensive experimentation. We specifically showcase advancements over the previous state-of-the-art backbones on benchmark challenges like the Medical Segmentation Decathlon and Beyond the Cranial Vault. By showing that the existing inductive bias in Swin can be further improved, our work presents a promising avenue for enhancing the capabilities of sparse vision transformers for both medical and natural image segmentation tasks. Code and pre-trained weights can be accessed at https://github.com/Eiphodos/SwInception.

2605.29953 2026-05-29 cs.CV

Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in Basketball

网格感知的对极匹配用于篮球多视角多人3D姿态估计

Li Yin, Qin Haobin, Tomohiro Suzuki, Calvin Yeung, Mariko Isogawa, Keisuke Fujii

发表机构 * RIKEN Center for Advanced Intelligence Project(RIKEN先进情报项目中心)

AI总结 提出一种无训练框架MAEM,通过单目3D人体网格恢复模型和两阶段对极匹配策略,解决团队运动场景中多视角多人3D姿态估计的遮挡和外观相似问题。

详情
AI中文摘要

团队运动场景中的多视角多人3D姿态估计因球员遮挡、队服造成的外观相似性以及标注多视角数据的稀缺而仍然具有挑战性,这些因素限制了基于学习方法的有效性和泛化能力。相比之下,无训练方法的性能固有地受限于2D关键点检测的准确性和跨视角关联的鲁棒性。为应对这些挑战,我们提出了网格感知的对极匹配(MAEM),一种用于多视角多人3D姿态估计的无训练框架。我们的方法采用单目3D人体网格恢复模型作为前端,并基于恢复的网格输出引入了一种两阶段对极匹配策略。具体而言,所提出的框架结合了基于并查集的聚类与每关节三角测量,以实现鲁棒的跨视角关联和准确的3D姿态重建。在两个公开的多视角篮球数据集上的实验表明,MAEM持续优于现有的无训练关联基线,同时在室内和室外篮球场景中实现了有竞争力的仅RGB性能。MAEM在SportCenter EPFL上达到MPJPE/PA-MPJPE分数59.8/40.7毫米,在Human-M3 Basketball上达到74.0/51.8毫米,突显了密集网格几何在无需目标域训练或微调的情况下进行跨视角关联的有效性。

英文摘要

Multi-view multi-person 3D pose estimation in team sports scenarios remains challenging due to player occlusions, appearance similarity caused by team uniforms, and the scarcity of annotated multi-view data, all of which limit the effectiveness and generalization capability of learning-based methods. In contrast, the performance of training-free approaches is inherently constrained by the accuracy of 2D keypoint detection and the robustness of cross-view association. To address these challenges, we propose Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation. Our method employs a monocular 3D human mesh recovery model as the frontend and introduces a two-stage epipolar matching strategy based on the recovered mesh outputs. Specifically, the proposed framework combines disjoint-set-union-based clustering with per-joint triangulation to achieve robust cross-view association and accurate 3D pose reconstruction. Experiments on two public multi-view basketball datasets demonstrate that MAEM consistently outperforms existing training-free association baselines while achieving competitive RGB-only performance in both indoor and outdoor basketball scenarios. MAEM achieves MPJPE/PA-MPJPE scores of 59.8/40.7 mm on SportCenter EPFL and 74.0/51.8 mm on Human-M3 Basketball, highlighting the effectiveness of dense mesh geometry for cross-view association without requiring target-domain training or fine-tuning.