arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2117
2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Software Performance Optimisation Group(软件性能优化组) Department of Computing(计算部门)

AI总结 提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统,通过在线可微渲染实现跟踪与建图,并支持实时网格转换与编辑。

详情
Comments
26 pages, 11 figures
AI中文摘要

我们提出了一种密集RGB-D SLAM系统,使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法,但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务(如模拟、碰撞和编辑)的标准图元。最近的离线方法表明,通过在一组带姿态的图像上进行Delaunay三角剖分,可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解,我们提出了第一个密集SLAM系统,通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格,从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上,我们的系统在3D几何方面优于基线,匹配相机跟踪精度,并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

2605.28507 2026-06-12 cs.LG 版本更新

Universal Time Series Generation with Neural Controlled Differential Equations

基于神经受控微分方程的通用时间序列生成

Torben Berndt, Elyes Farjallah, Leif Seute, Raeid Saqur, Benjamin Walker, Jan Stühmer

发表机构 * Heidelberg Institute for Theoretical Studies(海德堡理论研究所) IAR, Karlsruhe Institute of Technology(卡尔斯鲁厄技术大学IAR部门) Max Planck Institute for Polymer Research(马克斯·普朗克聚合物研究所) IWR, Heidelberg University(海德堡大学IWR部门) Dept. of Computer Science, University of Toronto(多伦多大学计算机科学系) Mathematical Institute, University of Oxford(牛津大学数学研究所) Vector Institute, Toronto, Canada(多伦多向量研究所)

AI总结 本文证明结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,并提出生成式SLiCEs(G-SLiCEs)用于路径空间上的流匹配,在概率预测和下流任务中表现优异,尤其适用于不规则网格。

详情
AI中文摘要

最近关于状态空间模型(SSMs)序列通用性的工作引入了高效、最大表达性的连续时间方法用于时间序列建模。虽然这些工作侧重于判别设置,我们将这一视角扩展到生成式时间序列建模,通过证明最大表达性的结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,即它们可以在$W_\infty$中逼近紧致潜在集上连续因果推前映射的诱导路径律。基于这些理论结果,我们提出了生成式SLiCEs(G-SLiCEs),一种用于路径空间上流匹配的最大表达性连续时间模型。实验上,我们表明表达性提高了概率预测和下流任务的性能,同时保留了连续时间模型的优势,例如泛化到任意观测网格。这对于不规则网格尤其有利,而固定网格模型通常难以处理此类网格。

英文摘要

Recent work on the sequence universality of State Space Models (SSMs) has introduced efficient, maximally expressive continuous-time approaches for time-series modelling. While these works focus on discriminative settings, we extend this perspective to generative time-series modelling by proving that maximally expressive Structured Linear Controlled Differential Equations (SLiCEs) are universal time-series generators, in the sense that they can approximate the induced path laws of continuous causal pushforwards on compact latent sets in $W_\infty$. Building on these theoretical results, we propose Generative SLiCEs (G-SLiCEs), a maximally expressive continuous-time model for flow matching on path-space. Empirically, we show that expressivity improves performance in probabilistic forecasting and downstream tasks, while retaining the advantages of continuous-time models such as generalising to arbitrary observation grids. This is particularly beneficial for irregular grids, where fixed-grid models often struggle.

2605.29906 2026-06-12 cs.LG 版本更新

Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

计划,而非摆姿势:基于文本对齐的BFM的长复合运动生成

Nikolay Shvetsov, Maksim Bobrin, Nazar Buzun, Anton Bozhedarov, Dmitry V. Dylov

发表机构 * AvaCapo Potsdam University(波茨坦大学) Applied AI Institute(应用人工智能研究所) Computational Imaging Lab(计算成像实验室) AXXX Innopolis University(因诺波利斯大学)

AI总结 提出Text2BFM框架,通过将自然语言与预训练行为基础模型对齐,在潜在策略空间中实现长复合运动生成,无需端到端运动生成器。

详情
AI中文摘要

文本到运动(T2M)生成在角色动画、虚拟化身和人机交互中具有广泛应用。现有方法通常直接从语言生成姿态轨迹或运动令牌,迫使单个模型处理语义解释、长程结构和低级物理实现。这种耦合使得它们在处理长、复合或语义密集的提示时成本高昂且往往不可靠。我们提出Text2BFM,这是第一个将自然语言与预训练行为基础模型(BFM)对齐用于T2M生成的框架,无需依赖重型端到端运动生成器。Text2BFM在冻结的BFM的潜在策略空间中操作,将其用作可执行的运动先验。一个文本对齐的变分行为瓶颈将BFM策略潜在序列压缩成与语言兼容且保留长程行为结构的紧凑运动表示。生成在这个紧凑的行为流形上通过轻量级条件生成器进行,得到的潜在编码行为被解码为驱动预训练冻结BFM的策略潜在。通过将语义规划与运动执行解耦,Text2BFM实现了高效、鲁棒的T2M生成,并在长复合文本描述上表现出色。

英文摘要

Text-to-motion (T2M) generation has broad applications in character animation, virtual avatars, and human-robot interaction. Existing methods typically generate pose trajectories or motion tokens directly from language, forcing a single model to handle semantic interpretation, long-horizon structure, and low-level physical realization. This coupling makes them costly and often unreliable for long, compositional, or semantically dense prompts. We propose Text2BFM, the first framework that aligns natural language with pretrained Behavioral Foundation Models (BFMs) for T2M generation without relying on heavy end-to-end motion generators. Text2BFM operates in the latent policy space of a frozen BFM, using it as an executable motion prior. A text-aligned variational behavioral bottleneck compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure. Generation is performed in this compact behavioral manifold with a lightweight conditional generator, and the resulting latent encoded behaviors are decoded into policy latents that drive the pretrained frozen BFM. By decoupling semantic planning from motion execution, Text2BFM achieves efficient, robust T2M generation and strong performance on long, compositional textual descriptions.

2601.01901 2026-06-12 cs.LG 版本更新

FedBiCross: Personalized One-Shot Federated Learning on Medical Images

FedBiCross: 医学图像上的个性化一次性联邦学习

Yuexuan Xia, Yinghao Zhang, Yalin Liu, Hong-Ning Dai, Yong Xia

发表机构 * School of Computer Science and Engineering, Northwestern Polytechnical University, China(西北工业大学计算机科学与工程学院) School of Science and Technology, Hong Kong Metropolitan University, Hong Kong(香港 Metropolitan 大学科学与技术学院) Department of Computer Science, Hong Kong Baptist University, Hong Kong(香港 Baptist 大学计算机科学系)

AI总结 提出FedBiCross框架,通过聚类、双层跨簇优化和个性化蒸馏解决非独立同分布数据下一次性联邦学习中知识蒸馏效果差的问题,在四个医学图像数据集上优于现有方法。

详情
Comments
Accepted by BlockSys 2026. This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections
AI中文摘要

基于无数据知识蒸馏的一次性联邦学习(OSFL)在单轮通信中训练模型,无需共享原始数据,这使得OSFL对隐私敏感的医疗应用具有吸引力。然而,现有方法聚合所有客户端的预测以形成全局教师。在非独立同分布数据下,冲突的预测在平均过程中相互稀释,产生信息量较少的软标签,从而削弱蒸馏效果。我们提出FedBiCross,一个个性化OSFL框架,包含三个阶段:(1)根据模型输出相似性对客户端进行聚类,形成连贯的子集成;(2)双层跨簇优化,学习自适应权重以选择性利用有益的跨簇知识,同时抑制负迁移;(3)针对客户端特定适应的个性化蒸馏。在四个医学图像数据集上的实验表明,FedBiCross在不同非独立同分布程度下始终优于最先进的基线方法。

英文摘要

Data-free knowledge distillation-based one-shot federated learning (OSFL) trains a model in a single communication round without sharing raw data, making OSFL attractive for privacy-sensitive medical applications. However, existing methods aggregate predictions from all clients to form a global teacher. Under non-IID data, conflicting predictions dilute each other during averaging, yielding less informative soft labels that weaken distillation. We propose FedBiCross, a personalized OSFL framework with three stages: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation. Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

2605.25225 2026-06-12 cs.LG cs.AI 版本更新

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

用于Transformer修补和机制可解释性的连续深度场论

David N. Olivieri, Antonio F. Pérez Rodríguez

发表机构 * Universidade de Vigo(维戈大学) Independent Researcher(独立研究员)

AI总结 本文提出场论框架,将残差流视为深度-标记场,通过局部源插入、灵敏度场预测、经验格林函数响应和伴随变分问题来组织和预测Transformer激活修补干预,并在GPT-2风格自回归Transformer中验证了前向响应理论。

详情
AI中文摘要

机制可解释性通常使用激活修补、因果追踪、路径修补和引导方向来揭示Transformer激活空间中行为有意义的子空间。本文发展了一个场论框架来组织和预测此类干预。将残差流视为深度-标记场,我们将修补公式化为局部源插入,修补效应作为灵敏度场预测,下游传播作为经验格林函数响应,修补选择作为伴随变分问题。实验上,我们通过在GPT-2风格自回归Transformer中应用局部残差场干预并观察诱导的残差场差异和logit差异响应来测试前向响应理论。我们识别出有界的局部线性区域;从跨残差站点的一阶灵敏度预测修补效应;测量跨深度和标记位置的结构化各向异性传播;从高灵敏度站点和切片格林算子构建响应描述;并表明提示诱导的残差位移可以传递答案行为。这些结果将响应对象(即灵敏度、传播场和格林算子切片)确立为组织修补实验的实用语言,以及制定修补站点推断和跨尺度迁移的前向数学基础。

英文摘要

Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal tracing, path patching, and steering directions. This paper develops Transformer Field Theory: a response-theoretic framework in which the residual stream of a fixed forward pass is treated as a Transformer field over layer depth and token position. In this formulation, patching becomes a localized source insertion into the Transformer field, first-order sensitivity fields predict patch effects, Green functions describe downstream propagation, and patch selection is posed as an adjoint inverse problem. Empirically, we test the theory's forward response objects in GPT-2-style autoregressive Transformers. Localized Transformer-field interventions exhibit a bounded local linear regime; first-order sensitivities predict patch effects across layer-token sites; localized sources generate structured anisotropic Transformer-field propagation; high-sensitivity sites and sliced Green operators provide reduced response descriptions; and prompt-induced Transformer-field displacements partially transfer answer behavior. These results establish sensitivities, Transformer-field responses, and sliced Green operators as practical objects for organizing patching experiments, while providing the forward mathematical basis for patch-site inference and cross-scale response transfer.

2605.03460 2026-06-12 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR:面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

AI总结 针对时间序列推理模型在金融领域的失效问题,提出基于2x2能力分类法的FinSTaR模型,通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

详情
Comments
KDD Workshop on SciSoc Agents & LLMs 2026
AI中文摘要

时间序列推理模型在通用领域表现出色,但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法,通过交叉1)单实体与多实体分析,以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务,并基于标普股票构建FinTSR-Bench基准。为此,我们提出FinSTaR(金融时间序列思考与推理),在FinTSR-Bench上训练,并针对每个类别采用不同的思维链策略。对于评估(确定性,即可从可观测数据计算得出),我们采用Compute-in-CoT,一种程序化思维链,使模型能够直接从原始价格推导答案。对于预测(本质上是随机的,即受不可观测因素影响),我们采用场景感知思维链,在做出判断前生成多种场景,模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率,显著优于LLM和TSRM基线。此外,我们展示了四个能力类别通过联合训练具有互补性和相互增强性,并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开:https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

2605.24488 2026-06-12 cs.CV cs.GR 版本更新

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

基于SMPL骨架的拉班运动描述子的暗示性运动外观不变检测

Jaehoon Ahn, Jeonghan Kong, Moon-Ryul Jung

发表机构 * Sogang University(ソガン大学)

AI总结 提出一种仅基于SMPL骨架轨迹和拉班运动分析描述子的运动分类流程,用于检测暗示性和露骨动作,在四个层级上实现57.3%的四分类准确率。

详情
Comments
5 pages, 2 figures, 3 tables. Extended version of a poster accepted to SIGGRAPH 2026
AI中文摘要

在线多人3D虚拟环境中的内容审核最近已交由自动化、基于AI的流程处理。然而,该领域主要涉及图像、视频和音频中非法内容的检测,在暗示性运动的检测技术上存在盲点。我们提出一种仅基于运动的分类流程,使用拉班运动分析(LMA)描述子从SMPL骨架轨迹中检测暗示性和露骨动作。在涵盖四个有序层级(日常、艺术、暗示、露骨)的20,514个运动片段(17小时以上)上,基于110个LMA特征的逻辑回归实现了57.3%的四分类准确率(随机概率的2.3倍)、72.1%的三分类准确率和78.7%的二元SFW/NSFW准确率。混淆主要集中在相邻层级,证实分类错误集中在相邻层级而非非相邻层级。此外,不同运动质量在分类体系的每个层级占主导地位——没有单一特征驱动分类,表明四层级结构反映了真正不同的运动模式。

英文摘要

Content moderation in online multiplayer 3D virtual environments is increasingly automated, yet detection has focused on images, video, and audio, leaving suggestive motion a blind spot. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On a dataset spanning everyday, artistic, suggestive, and explicit movement (17+ hours of video), a logistic regression trained on 61-feature LMA descriptors reaches 68% binary SFW/NSFW accuracy (70% random forest) under a leak-free evaluation protocol. At this level, our descriptor performs comparably to a learned video model trained on the same motion re-rendered as appearance-free video, a gray figure with no clothing, skin, or scene. The indirectness (tortuosity) of each joint's trajectory, measured as the ratio of the joint's path length to its net displacement, peaks at the suggestive tier, showing that the Direct-to-Indirect polarity of Laban's Space factor provides an interpretable marker of the shift from functional to suggestive motion. Ultimately, Laban-based kinematic descriptors offer a lightweight, interpretable approach to suggestive-motion detection: every decision decomposes into named, theory-grounded features. Because the classifier operates on pose trajectories alone, moderation can run directly on avatar poses in virtual environments, with no appearance data.

2605.17770 2026-06-12 cs.AI cs.CL 版本更新

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵梯度反转:迈向大型推理模型的内部机制

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu

发表机构 * National University of Singapore(新加坡国立大学) Renmin University of China(中国人民大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学)

AI总结 本文发现大型推理模型中令牌熵与logit梯度之间的稳健负相关(熵梯度反转),并提出相关性正则化组策略优化(CorR-PO)将其嵌入强化学习奖励正则化,从而提升推理性能。

详情
Comments
The authors are withdrawing this manuscript due to fundamental inaccuracies in the institutional affiliations and administrative attributions provided at the time of submission. As this version cannot be validated under the correct institutional framework, the authors request its formal withdrawal from the repository. No immediate replacement is intended
AI中文摘要

大型推理模型(LRMs)的进步推动了从反应式“快思考”文本生成向系统性、逐步“慢思考”推理的范式转变,在复杂数学和逻辑任务中实现了最先进的性能。然而,该领域面临着 extit{令牌级行为分析与内部推理机制之间的根本差距,以及依赖昂贵外部验证器的推理优化强化学习(RL)的不稳定性}。我们识别并正式定义了 extbf{熵梯度反转},即令牌熵与logit梯度之间的稳健负相关,它作为LRM推理能力的明确几何指纹。在此基础上,我们提出 extbf{相关性正则化组策略优化(CorR-PO)},将这种反转特征嵌入RL奖励正则化。在多个模型规模的各种推理基准上的大量实验表明,CorR-PO始终优于最先进的基线,证实了更强的反转直接与更优的推理性能相关。

英文摘要

The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.

2605.22641 2026-06-12 cs.CL cs.AI cs.LG 版本更新

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

更多上下文、更大模型还是道德知识?政治文本中施瓦茨价值观检测的系统研究

Víctor Yeste, Paolo Rosso

发表机构 * PRHLT Research Center, Universitat Politècnica de València, Spain(巴塞罗那理工大学研究中心,西班牙 Valencia理工大学) School of Science, Engineering and Design, Universidad Europea de Valencia, Spain(Valencia欧洲大学科学、工程与设计学院,西班牙) Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)(瓦伦西亚人工智能研究生学院与研究网络(ValgrAI))

AI总结 本研究系统比较了上下文范围、检索增强道德知识和模型规模对政治文本中施瓦茨价值观检测的影响,发现全文档上下文和检索知识对监督编码器有效,但对零样本大语言模型帮助有限,且模型扩展不保证性能提升。

详情
Comments
Code: https://github.com/VictorMYeste/human-value-detection-context-rag, best model: https://huggingface.co/VictorYeste/value-context-rag-deberta-v3-base-doc-rag, 18 pages, 3 figures
AI中文摘要

检测政治文本中的施瓦茨价值观具有挑战性,因为隐含线索通常依赖于周围的论证和相邻价值观之间的细微差别。我们研究了上下文和显式道德知识何时有助于句子级别的价值观检测。使用ValuesML/Touché ValueEval格式,我们比较了句子、窗口和全文档输入;无检索增强和基于检索增强的设置(使用精心策划的道德知识库);监督的DeBERTa-v3-base/large编码器;以及参数规模从12B到123B的零样本大语言模型。结果表明,更多上下文并非总是更好:全文档上下文使监督的DeBERTa编码器相比仅句子输入提高了3.8-4.8个宏F1点,但对零样本大语言模型没有一致帮助。在匹配比较中,检索到的道德知识更一致地有用,在早期融合下改善了每个测试的模型系列和上下文条件。然而,从DeBERTa-v3-base扩展到large以及从12B扩展到更大的大语言模型并不保证收益,并且简单的早期融合优于测试的后期融合和交叉注意力检索增强生成变体。按价值观分析表明,上下文和检索对社交情境化或概念上易混淆的价值观帮助最大。这些发现表明,价值观敏感的NLP应联合评估上下文、知识和模型系列,而不是将更长的输入或更大的模型视为通用改进。

英文摘要

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

2602.00122 2026-06-12 cs.CV cs.AI cs.MM 版本更新

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

VDE Bench: 评估图像编辑模型对视觉文档进行修改的能力

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

发表机构 * UCAS(中国科学院大学) CASIA(中国科学院自动化研究所) Tencent(腾讯) CMU(卡内基梅隆大学) WashU(华盛顿大学) SJTU(上海交通大学) XDU(北京理工大学)

AI总结 本文提出VDE Bench,一个专门评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务性能的基准,通过高质量数据集和新的评估框架,系统量化了文本修改的准确性。

详情
AI中文摘要

近年来,图像编辑模型取得了显著进展,使用户能够通过自然语言指令灵活地交互式地操作视觉内容。然而,一个重要的但尚未充分研究的研究方向是密集的视觉文档图像编辑,这涉及在图像中修改文本内容,同时忠实保留原始文本风格和背景上下文。现有方法主要集中在英语场景和文本相对稀疏的图像上,因此无法充分解决密集、结构复杂的文档或非拉丁文字如中文。为弥合这一差距,我们提出了VDE Bench(视觉文档编辑基准),这是一个严格人工标注和评估的基准,专门设计用于评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本数据集,其种子图像涵盖密集的中文和英文文本文档,包括学术论文、海报、演示文稿、考试材料和报纸。此外,我们引入了一个新的评估框架,系统地量化了在OCR解析层面的编辑性能,从而实现了对文本修改准确性的细粒度评估。基于此基准,我们对代表性图像编辑模型进行了全面评估。人类验证显示,人类判断与自动化评估指标之间有一致性。VDE Bench构成了评估图像编辑模型在双语密集文本视觉文档性能的首个系统性基准。

英文摘要

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

2605.20763 2026-06-12 cs.LG 版本更新

ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization

ShapeBench: 一种可扩展的基准和诊断套件,用于气动形状优化的标准化评估

Shaghayegh Fazliani, Krissh Chawla, Jack Guo, Yiren Shen, Matthias Ihme, Madeleine Udell

发表机构 * Stanford University(斯坦福大学) Spinoza Labs(斯皮诺扎实验室)

AI总结 本文提出ShapeBench,一个开源的气动形状优化基准,提供统一的API,涵盖103个任务和八个形状类别,通过验证的代理模型和高保真CFD流程进行系统分析,展示了不同形状类别和问题形式中优化器排名的显著差异,强调了需要更通用方法的必要性。

详情
AI中文摘要

气动形状优化(ASO)的快速进展已超过了目前可用的标准化评估框架。公平比较需要一个覆盖多样形状类别、目标公式和匹配预算的统一基准。我们引入ShapeBench,一个开源的ASO基准,涵盖103个任务,跨越八个形状类别和多种优化模式。每个ShapeBench任务包括经过验证的代理模型以实现快速搜索;当可行时,提供高保真计算流体动力学(CFD)流程用于最终验证,从而实现系统化的保真度差距分析。ShapeBench提供可重复的协议和配置良好的基线,以使用一致的预算度量进行公平比较,允许在经典方法和LLM驱动方法之间进行比较,包括通用优化器和一个新的领域专用进化LLM基线,ShapeEvolve。在ShapeBench上的结果展示了不同形状类别和问题形式中优化器排名的显著差异,平均成对斯皮尔曼ρ=0.013,因此单任务结论无法可靠地推广到问题类别中。该基准还远未饱和;经典方法很少能适用于所有形状类别和任务,进一步强调了需要更通用方法的必要性。

英文摘要

Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman $ρ= 0.013$, so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.

2605.01733 2026-06-12 cs.CV cs.AI 版本更新

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

GEASS: 基于证据适应的门控选择性描述信任机制用于视觉-语言模型

Zeshang Li, Shuoyang Zhang

发表机构 * arXiv.org

AI总结 本文提出GEASS,一种无需训练的模块,通过门控、加权和证据标准来决定模型在每个查询中消耗多少描述信息,从而提升视觉-语言模型的准确性。

详情
Comments
18 pages, 12 figures
AI中文摘要

视觉-语言模型(VLMs)在 grounded reasoning 方面表现出色,但仍然容易产生 object hallucination。最近的研究将自动生成的描述视为一个均匀的积极资源,但我们发现盲目地嵌入一个描述可能会降低而不是提高性能——在 HallusionBench 上,Qwen2.5-VL-3B 的准确性下降了近 10 个点。两个结构性质解释了这一点。首先,描述不仅锚定了模型的最终答案,还锚定了其推理轨迹和词汇选择。其次,描述错误是不对称的:遗漏远多于伪造,但每个伪造对实例的影响更大。因此,描述的有用性是查询特定的,而不是语料库特定的。我们提出 GEASS(ated Evidence-Adaptive Selective Caption Trust),一个无需训练的模块,决定每个查询中模型消耗多少描述信息:它通过干净路径的置信度来门控描述,通过它产生的熵减少来加权描述,并在两种路径意见不同时提高证据标准。在 POPE 和 HallusionBench 上对四个 VLMs 的实验表明,GEASS 在 vanilla 推理和对比解码上都表现出色,仅需每个查询两个额外的前向传递。

英文摘要

Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence -- assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B$^\dagger$ on HallusionBench by nearly ten points. To understand why, we build \textbf{GD-Probe}, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a \emph{per-query} property: the same caption helps global questions and harms detail ones, through a single mechanism -- an embedded caption competes with the image for attention and pulls the model's evidence onto its own text -- whose sign is set by whether the caption \emph{covers} the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into \textbf{GEASS} (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

2605.18817 2026-06-12 cs.LG 版本更新

Multi-Token Residual Prediction

多令牌残差预测

Yufeng Xu, Zishuo Bao, Qian Wang, Zeshen Zhang, Haoqi Zhang, Bowen Peng, Ang Li, Rahul Chalamala, Yucheng Lu

发表机构 * New York University(纽约大学) New York University Shanghai(纽约大学上海) Nos Research(Nos研究) Modal

AI总结 本文提出了一种轻量级模块Multi-token Residual Prediction,通过利用去噪过程中相邻步骤的logit分布相似性,在单次骨干网络前向传播中实现依赖感知的多令牌去噪,从而在成本较低的情况下提高去噪效率。

详情
AI中文摘要

扩散语言模型(DLMs)通过迭代去噪掩码令牌序列生成文本,相较于自回归模型在并行性和质量之间提供了一种权衡。在当前实践中,每步解码的令牌数量由置信度阈值控制,随着每步去噪的令牌数量增加,质量单调下降。我们引入了多令牌残差预测(MRP),这是一种轻量级模块,能够在单个骨干网络前向传播中实现依赖感知的多令牌去噪。MRP利用了去噪过程的一个关键性质:相邻去噪步骤的logit分布具有显著相似性。而不是再次运行骨干网络以获得下一步的logits,MRP通过骨干网络的隐藏状态预测步骤间的残差,从而在较低的成本下在单次骨干网络前向传播中去噪更多的令牌。我们部署了MRP在两种推理模式中:直接解码,它使用纠正的logits而不进行验证,以实现可调节的质量-速度权衡;以及推测解码,它通过骨干网络验证MRP的提案以实现无损加速。在SDAR模型上进行的实验表明,在推理和代码生成基准测试中,SDAR模型在1.7B、4B和8B规模上实现了高达1.42倍的SGLang无损加速。

英文摘要

Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We apply MRP across the two operating regimes of DLM decoding. In the high-quality-low-throughput static denoising regime, MRP serves as a drafter for speculative decoding: its proposals are verified against the backbone, yielding lossless acceleration of up to 1.4x in SGLang. In the low-quality-high-throughput dynamic denoising regime, MRP instead drives a remasking scheme that revokes over-eager reveals, recovering most of the accuracy lost to aggressive low-threshold decoding and improving accuracy by up to 22.6 points on code generation task HumanEval and 17.7 points on reasoning task GSM8K.

2605.18231 2026-06-12 cs.LG 版本更新

Attacking the First-Principle: A Black-Box, Query-Free Targeted Mimicry Attack on Binary Function Classifiers

攻击第一原理:一种针对二元函数分类器的黑盒、无查询目标模仿攻击

Gabriel Sauger, Jean-Yves Marion, Sazzadur Rahaman, Victor Matrat, Vincent Tourneur, Muaz Ali

发表机构 * LORIA(洛林信息与自动化研究院) University of Arizona(亚利桑那大学)

AI总结 本文提出Kelpie框架,首次在黑盒无查询环境下成功执行针对二元函数分类器的模仿攻击,展示了其在不同模型架构下的有效性,并通过实际案例验证了攻击的可行性,引发对现有机器学习二元函数分类器可靠性和安全性的质疑。

详情
AI中文摘要

二元函数分类器在维护软件系统安全性和完整性方面起着关键作用,通过检测恶意代码和未经授权的修改。然而,基于机器学习的分类器容易受到对抗攻击的威胁,这些攻击可以绕过检测。在本研究中,我们提出Kelpie,一种新型框架,用于在黑盒、零查询环境下执行模仿攻击,这是一种更强大的目标逃避攻击类型。与以往依赖查询目标分类器来优化无目标逃避攻击的方法不同,Kelpie利用代码转换,保持恶意负载的功能性,同时使其被误分类为所需类别。通过广泛实验,我们证明Kelpie能够成功对六种最先进的二元函数分类器执行模仿攻击,这些分类器代表了不同的模型架构,而无需直接与它们交互。我们进一步通过实际演示验证了我们的方法,包括隐藏在看似无害函数中的键盘记录器和擦除器。到目前为止,我们的工作是首次在黑盒、零查询环境下展示此类模仿攻击,引发了对现有基于机器学习的二元函数分类器可靠性和安全性的重大质疑。

英文摘要

Binary function classifiers play a crucial role in maintaining the security and integrity of software systems by detecting malicious code and unauthorized modifications. However, machine learning-based classifiers are vulnerable to adversarial attacks that can evade detection. In this study, we present Kelpie, a novel framework for executing mimicry attacks, a stronger type of targeted evasion attacks, on binary function classifiers in a black-box, zero-query setting. Unlike previous approaches that rely on querying the target classifier to refine untargeted evasion attacks, Kelpie leverages code transformations that preserve the functionality of malicious payloads while causing them to be misclassified as we want. Through extensive experimentation, we demonstrate that Kelpie can successfully execute mimicry attacks against six state-of-the-art binary function classifiers representing different model architectures without requiring direct interaction with them. We further validate our approach with a practical demonstration, involving a keylogger and a wiper concealed within benign-looking functions embedded in an application. This work, to our best knowledge, is the first to demonstrate such a mimicry attack in a black-box, zero-query context, raising important questions about the reliability and security of existing machine learning-based binary function classifiers.

2603.11395 2026-06-12 cs.LG cs.AI 版本更新

ARROW: Augmented Replay for RObust World models

ARROW:增强重放用于鲁棒世界模型

Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo

发表机构 * Imam Mohammad Ibn Saud Islamic University (IMSIU)(伊玛姆·穆罕默德·本·沙特伊斯兰大学) Monash University(莫纳什大学) University of New South Wales, Sydney(新南威尔士大学,悉尼) Cerenaut

AI总结 本文提出ARROW算法,一种基于模型的持续强化学习方法,通过高效的重放缓冲区减少灾难性遗忘,提升在无共享结构任务和有共享结构任务中的表现。

详情
Journal ref
Transactions on Machine Learning Research, 2026
Comments
36 pages and 11 figures (includes Appendix)
AI中文摘要

持续强化学习挑战智能体在获取新技能的同时保留已学习技能,以提高过去和未来任务的性能。大多数现有方法依赖于无模型方法和重放缓冲区来缓解灾难性遗忘;然而,这些解决方案往往面临显著的可扩展性挑战,因为内存需求大。受神经科学启发,其中大脑将经验重放给预测世界模型而不是直接重放到策略中,我们提出了ARROW(增强重放用于鲁棒世界模型),一种扩展DreamerV3的基于模型的持续RL算法,具有内存高效、分布匹配的重放缓冲区。与标准固定大小的FIFO缓冲区不同,ARROW维护两个互补的缓冲区:一个短期缓冲区用于近期经验,一个长期缓冲区通过智能采样保留任务多样性。我们在两个具有挑战性的持续RL设置中评估了ARROW:无共享结构任务(Atari)和有共享结构任务(Procgen CoinRun变体)。与相同大小的无模型和基于模型的基线方法相比,ARROW在无共享结构任务中表现出显著减少的遗忘,同时保持可比的前向转移。我们的发现突显了基于模型的RL和生物启发方法在持续强化学习中的潜力,值得进一步研究。

英文摘要

Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

2605.16713 2026-06-12 cs.CV cs.AI 版本更新

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

GeoWorld-VLM:从世界模型中获取几何结构用于视觉-语言模型

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Kempner Institute for the Study of Natural and Artificial Intelligence(凯普纳自然与人工智能研究 institute) Harvard University(哈佛大学)

AI总结 GeoWorld-VLM通过将冻结的摄像机条件视频世界模型的几何结构转移到视觉-语言模型中,提升空间关系推理能力,实验显示在两个不同架构上均提升了约4%的性能。

详情
AI中文摘要

现代视觉-语言模型(VLMs)在语义识别方面表现优异,但在基本空间关系如左、在、后、之间等上仍显脆弱。这一失败的原因出现在语言推理之前:视觉路径在特征提取过程中可能压缩或丢弃关键的3D结构线索,导致语言模型接收到的图像表示不足以支持可靠的空判断。我们引入GeoWorld-VLM,一种VLM侧蒸馏框架,将冻结的摄像机条件视频世界模型的几何结构转移到VLMs中。GeoWorld-VLM仅微调图像编码器和多模态投影器,使后投影器图像特征与中间世界模型表示对齐,同时保持主骨干冻结。给定图像、提示和采样的摄像机轨迹,世界模型教师将静态视觉输入转换为合成多视角空间信号。训练结合空间答案监督、教师-学生特征对齐和对原VLM的保留锚点。由于语言模型保持冻结,GeoWorld-VLM保留原始模型的语言能力,同时将空间改进归因于增强的视觉路径。为了评估所提方法的有效性和通用性,我们将GeoWorld-VLM应用于两种不同的VLM架构,并在两个骨干上观察到一致的改进。GeoWorld-VLM在What'sUp和VSR基准上分别提升了约4%的性能,表明世界模型引导的视觉对齐在模型结构和空间推理数据集上具有泛化能力。

英文摘要

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

2605.16430 2026-06-12 cs.LG cs.AI 版本更新

A Theory of Training Profit-Optimal LLMs

训练利润最优大语言模型的理论

Sophie Hao, William Merrill

发表机构 * Boston University(波士顿大学) Allen Institute for AI(人工智能研究院)

AI总结 本文提出一个经济模型,结合扩展定律与微观经济学理论,分析大语言模型训练的利润最大化问题,探讨模型规模与训练成本的关系及对利润的影响。

详情
Comments
Minor edits for preprint
AI中文摘要

扩展大语言模型(LLM)需要巨大的计算资源,近年来人工智能的进步与大量资本支出相伴而生。尽管扩大LLM规模确实能提高模型质量(以损失或下游评估量化),但其质量提升如何转化为潜在收入,以及收入是否能抵消更大规模训练和推理的成本仍不清楚。本文发展了一个经济模型,结合扩展定律与微观经济学理论,以描述LLM训练公司的理性行为。在我们的模型中,增加参数和训练令牌可提高LLM质量,从而吸引更多消费者,每个消费者都有一个质量阈值。另一方面,额外的参数和训练令牌都会带来额外成本。我们分析了该模型在计算受限和数据受限环境下的利润最大化问题。在计算受限环境下,最优模型规模和令牌预算与硬件效率$E$(FLOPs/$)近似线性增长;总训练成本则以$E$的亚四次方程增长。数据效率的提升激励更大规模的模型和训练支出。当数据受限于$D$时,利润最优的训练支出为$D^2/E$,即随数据增加而增加,随硬件效率(以及数据效率)降低而减少。最后,我们分析了训练支出的实际趋势:当前趋势与计算受限环境下的最宽松模型变体一致,但在数据受限环境或假设硬件进步停滞时并非利润最优。总体而言,我们的结果提供了利润最优LLM训练的理论,为批判性地看待行业声明和支持长期经济决策提供了基础。

英文摘要

Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.

2605.13426 2026-06-12 cs.LG math.AG 版本更新

Strategic PAC Learnability via Geometric Definability

通过几何可定义性实现策略PAC可学习性

Yuval Filmus, Shay Moran, Elizaveta Nesterova, Nir Rosenfeld, Alexander Shlimovich

发表机构 * Weizmann Institute of Science(魏茨曼研究院) University of Waterloo(滑铁卢大学) ETH Zurich(苏黎世联邦理工学院) University of Washington(华盛顿大学)

AI总结 研究个体通过成本修改特征影响分类器决策的策略学习问题,证明在简单情况下策略行为可使易学问题变为不可学,并引入几何可定义性假设以控制样本复杂度。

详情
AI中文摘要

策略分类研究个体通过成本修改特征以影响分类器决策的学习场景。核心问题是诱导的(策略性)假设类样本复杂度如何依赖于基础假设类复杂度和可行操纵的成本结构。先前工作显示在某些自然设置如线性分类器与范数成本下,诱导复杂度可被控制。我们证明此类保证一般失效:存在VC维为1的实数假设类,即使在最简单的区间邻域下,诱导类的VC维为无限。因此策略行为可将易学问题转为不可学。为克服此问题,我们引入几何可定义性假设:假设类和成本诱导的邻域关系可通过实数上的第一阶公式定义。这表示假设和成本可通过算术运算、指数、对数和比较描述。此假设涵盖广泛自然类和成本函数,包括ℓp距离、Wasserstein距离和信息论分歧。在此假设下,我们证明可学习性得以保持,样本复杂度由定义公式的复杂度控制。

英文摘要

Strategic classification studies learning settings in which individuals can modify their features, at a cost, in order to influence the classifier's decision. A central question is how the sample complexity of the induced (strategic) hypothesis class depends on the complexities of the underlying hypothesis class and the cost structure governing feasible manipulations. Prior work has shown that in several natural settings, such as linear classifiers with norm costs, the induced complexity can be controlled. We begin by showing that such guarantees fail in general - even in simple cases: there exist hypothesis classes of VC dimension $1$ on the real line such that, even under the simplest interval neighborhoods, the induced class has infinite VC dimension. Thus, strategic behavior can turn an easy learning problem into a non-learnable one. To overcome this, we introduce structure via a geometric definability assumption: both the hypothesis class and the cost-induced neighborhood relation can be defined by first-order formulas over $\mathbb{R}_{\mathtt{exp}}$. Intuitively, this means that hypotheses and costs can be described using arithmetic operations, exponentiation, logarithms, and comparisons. This captures a broad range of natural classes and cost functions, including $\ell_p$ distances, Wasserstein distance, and information-theoretic divergences. Under this assumption, we prove that learnability is preserved, with sample complexity controlled by the complexity of the defining formulas.

2605.11165 2026-06-12 cs.LG 版本更新

COSMOS: Model-Agnostic Personalized Federated Learning with Clustered Server Models and Pseudo-Label-Only Communication

COSMOS:基于聚类服务器模型和伪标签通信的模型无关个性化联邦学习

Ben Rachmut, Luise Ge, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 COSMOS通过伪标签通信实现服务器端个性化,利用客户端本地模型预测公共数据并聚类,训练集群特定模型并回传知识蒸馏,理论分析显示其能有效降低个性化风险,实验验证其在异构环境中优于现有基线方法。

详情
AI中文摘要

联邦学习在异构环境中面临挑战,因为客户端模型在架构和数据分布上差异显著。尽管近期方法通过客户端聚类和知识蒸馏应对,但同时处理架构和统计异质性仍困难。我们引入COSMOS,一种模型无关框架,通过仅使用伪标签通信实现服务器端个性化。客户端训练本地模型并在公共数据上进行预测;服务器根据预测相似性聚类客户端,利用自身计算为每个群组训练特定模型,并将所得模型蒸馏回客户端。我们提供了首个理论分析,证明从学习的集群模型蒸馏可产生指数级个性化风险收缩,超越模型无关联邦学习通常提供的收敛到平稳状态保证。在基准测试中,COSMOS在异构环境中一致优于所有模型无关联邦学习基线方法,同时与最先进的个性化联邦学习方法竞争。更广泛地说,我们的结果强调了使用伪标签实现个性化服务器端学习作为可扩展且模型无关联邦学习的有前景范式。

英文摘要

Federated learning (FL) in heterogeneous environments remains challenging because client models often differ in both architecture and data distribution. While recent approaches attempt to address this challenge through client clustering and knowledge distillation, simultaneously handling architectural and statistical heterogeneity remains difficult. We introduce COSMOS, a model-agnostic framework that enables server-side personalization using only pseudo-label communication. Clients train local models and predict on the public data; the server clusters clients by prediction similarity, trains a cluster-specific model for each group using its own compute, and distills the resulting models back to clients. We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction, going beyond the convergence-to-stationarity guarantees typically provided in model-agnostic FL. Experiments across benchmarks demonstrate that COSMOS consistently outperforms all model-agnostic FL baselines while remaining competitive with state-of-the-art personalized FL methods. More broadly, our results highlight personalized server-side learning with pseudo-labels as a promising paradigm for scalable and model-agnostic federated learning in highly heterogeneous environments.

2503.17182 2026-06-12 cs.CV 版本更新

Radar-Guided Polynomial Fitting for Metric Depth Estimation

雷达引导的多项式拟合用于度量深度估计

Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong

发表机构 * Yale University(耶鲁大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出POLAR方法,利用雷达数据预测多项式系数,对单目深度估计的无尺度深度进行非均匀校正,实现度量深度估计,性能在三个数据集上平均提升24.9% MAE和33.2% RMSE。

详情
Comments
CVPR 2026
AI中文摘要

我们提出POLAR,一种新颖的雷达引导深度估计方法,引入多项式拟合以高效地将预训练单目深度估计(MDE)模型的无尺度深度预测转换为度量深度图。与依赖复杂架构或昂贵传感器的现有方法不同,我们的方法基于一个基本洞察:尽管MDE模型通常能在每个物体或局部区域内推断合理的局部深度结构,但它们可能使这些区域相互错位,使得在三个或更多区域的情况下线性尺度和偏移(仿射)变换不足。为解决这一限制,我们使用从廉价、普遍存在的雷达数据预测的多项式系数,在深度范围内非均匀地自适应调整预测。通过这种方式,POLAR超越了仿射变换,并能够通过引入拐点来纠正此类错位。重要的是,我们的多项式拟合框架通过一种新颖的训练目标保持结构一致性,该目标通过一阶导数正则化强制局部单调性。POLAR在三个数据集上实现了最先进的性能,在MAE和RMSE上平均优于现有方法24.9%和33.2%,同时在延迟和计算成本方面也实现了最先进的效率。

英文摘要

We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

2605.08116 2026-06-12 cs.LG cs.AI 版本更新

The Safety-Aware Denoiser for Text Diffusion Models

文本扩散模型的安全感知去噪器

Amman Yusuf, Zhejun Jiang, Mijung Park

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出安全感知去噪器(SAD),在文本扩散模型的迭代去噪过程中引导生成文本进入安全区域,无需重训练即可实现灵活的安全约束,有效降低不安全生成同时保持生成质量。

详情
Comments
28 pages, 12 figures. Code available at: https://github.com/ParkLabML/SAD
AI中文摘要

最近关于文本扩散模型的工作为自回归生成提供了一种有前景的替代方案,但控制其安全性仍未被充分探索。现有的安全方法面向自回归模型,通常依赖于事后过滤或推理时干预。这些方法不足以有效解决文本扩散模型中的安全风险。我们提出了安全感知去噪器(SAD),一种文本扩散模型中的安全引导框架。SAD修改了迭代去噪过程,使得最终去噪步骤中的文本样本被引导至文本空间中可证明的安全区域。这种推理时方法可以将安全约束集成到去噪器中,避免了底层扩散模型的计算昂贵重训练,并实现了灵活、轻量级的安全引导。我们使用SAD评估生成文本的安全性,涉及危害分类、记忆和越狱。实验结果表明,SAD在保持生成质量、多样性和流畅性的同时,显著减少了不安全生成,优于现有方法。这些结果表明,我们在去噪过程中的安全引导为在文本扩散模型中实施安全提供了一种有效且可扩展的机制。

英文摘要

Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

2605.01391 2026-06-12 cs.CV 版本更新

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

VISTA:视频交互时空分析基准

Alejandro Aparcedo, Akash Kumar, Aaryan Garg, Dalton Pham, Wen-Kai Chen, Anirudh Bharadwaj, Aman Chadha, Yogesh Rawat

发表机构 * University of Central Florida(中央佛罗里达大学) BITS Pilani(比特斯理工学院) Ho Chi Minh City University of Science(胡志明市科学大学) Amazon GenAI Project(亚马逊生成人工智能项目)

AI总结 提出VISTA基准,通过分解视频为实体、动作和关系,实现开放集多实体多动作的时空理解评估,揭示传统指标掩盖的偏差。

详情
Comments
Accepted to CVPR 2026 Workshop on Pixel-level Video Understanding in the Wild (PVUW)
AI中文摘要

现有的视觉-语言模型(VLM)基准主要评估简单单动作视频、封闭属性集和受限实体类型的时空理解,未能捕捉真实世界视频理解中多样实体之间的自由形式多动作交互。此外,缺乏一个系统性的框架来分析模型在互补时空轴上的失败,阻碍了全面评估。为解决这些问题,我们引入了VISTA,一个视频交互时空分析基准,专为VLM中的开放集、多实体和多动作时空理解设计。VISTA将视频分解为可解释的实体、其关联动作和关系动态,实现多轴诊断以及关系、空间和时间理解的统一评估。我们的基准将多个数据集整合到一个单一的交互感知分类法中,包含约12K个精心策划的视频-查询对,涵盖多样场景和复杂性。我们在VISTA上系统评估了11个最先进的VLM,并分解了跨分类法的聚合性能,揭示了传统指标掩盖的缺陷和显著的时空偏差。通过在具有挑战性的数据集上提供详细的、分类法驱动的诊断,VISTA提供了一个精细的框架来指导模型设计、预训练策略和评估协议的进步。总体而言,VISTA是第一个大规模、交互感知的VLM时空理解诊断基准。

英文摘要

Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.

2601.19827 2026-06-12 cs.CL cs.AI cs.IR 版本更新

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

当迭代RAG优于理想证据:科学多跳问答中的诊断研究

Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

发表机构 * Faculty of Engineering, McMaster University, Canada(麦斯特大学工程学院,加拿大) BASF Canada Inc., Canada(巴斯夫加拿大公司,加拿大)

AI总结 通过化学多跳问答数据集,诊断发现迭代检索-推理循环在科学领域显著优于静态RAG上限,揭示了阶段式检索的优势与失败模式。

详情
Comments
51 pages, 29 figures
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但目前尚不清楚迭代检索-推理循环何时能有效超越静态RAG,尤其是在涉及多跳推理、稀疏领域知识和异构证据的科学领域。我们首次进行了受控的、机制层面的诊断研究,以探究同步迭代检索和推理能否超越理想化的静态上限(Gold Context)RAG。我们在三种设置下对十一个最先进的LLM进行了基准测试:(i)无上下文,衡量对参数化记忆的依赖;(ii)Gold Context,一次性提供所有真实证据;(iii)迭代RAG,一种无需训练的控制器,交替进行检索、假设细化和证据感知停止。使用以化学为中心的ChemKGMultiHopQA数据集,我们分离出需要真正检索的问题,并通过诊断分析行为,涵盖检索覆盖缺口、锚点携带下降、查询质量、组合保真度和控制校准。在所有模型中,迭代RAG始终优于Gold Context,增益高达25.6个百分点,尤其对于非推理微调模型。阶段式检索减少了后期跳失败,缓解了上下文过载,并实现了对早期假设漂移的动态修正,但剩余的失败模式包括跳覆盖不完整、干扰物锁定轨迹、过早停止校准错误以及即使检索完美时的高组合失败率。总体而言,阶段式检索通常比理想证据的单纯存在更具影响力;我们为在专门科学环境中部署和诊断RAG系统提供了实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

2605.00600 2026-06-12 cs.LG cs.AI cs.CV 版本更新

Possibilistic Predictive Uncertainty for Deep Learning

深度学习的可能性预测不确定性

Yao Ni, Jeremie Houssineau, Yew-Soon Ong, Piotr Koniusz

发表机构 * arXiv.org University of Cambridge(剑桥大学) National University of Singapore(新加坡国立大学) University of Warsaw(华沙大学)

AI总结 提出基于可能性理论的Dirichlet近似可能性后验预测(DAPPr)框架,通过投影-近似策略实现高效且原则性的认知不确定性量化,在多个基准上达到竞争性能。

详情
Comments
Accepted by ICML 2026, 20 pages
AI中文摘要

深度神经网络在多种应用中取得了令人印象深刻的结果,然而它们对未见输入的过度自信需要可靠的认知不确定性建模。现有的不确定性建模方法面临一个基本困境:贝叶斯方法提供原则性的估计,但计算成本高昂,而高效的二阶预测器在其特定目标与认知不确定性量化之间缺乏严格联系。为解决这一困境,我们引入了Dirichlet近似可能性后验预测(DAPPr),一个基于可能性理论的原则性框架。我们定义了参数上的可能性后验,通过上确界算子将其投影到预测空间,并使用可学习的Dirichlet可能性函数近似投影后的后验。这种投影-近似策略产生了一个具有闭式解的简单训练目标。尽管简单,跨多个不同基准的大量实验表明,DAPPr在保持原则性推导和计算效率的同时,实现了与最先进的二阶预测器相当或更优的不确定性量化性能。代码可在 https://github.com/MaxwellYaoNi/DAPPr 获取。

英文摘要

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modeling. Existing methods for uncertainty modeling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous connections between their specific objectives and epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework grounded in possibility theory. We define a possibilistic posterior over parameters, project it to the prediction space via supremum operators, and approximate the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Despite its simplicity, extensive experiments across diverse benchmarks show that DAPPr achieves competitive or superior uncertainty quantification performance over state-of-the-art second-order predictors while maintaining both principled derivation and computational efficiency. Code is available at https://github.com/MaxwellYaoNi/DAPPr.

2604.27960 2026-06-12 cs.AI 版本更新

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

LLMs 作为 ASP 程序员:自我纠正实现任务无关的非单调推理

Adam Ishay, Joohyung Lee

发表机构 * Arizona State University(亚利桑那州立大学) Samsung Research(三星研究院)

AI总结 提出 LLM+ASP 框架,通过自我纠正循环将自然语言转化为回答集程序,实现无需任务特定工程的非单调推理,在多个基准上优于 SMT 方法。

详情
Comments
30 pages
AI中文摘要

近期的大语言模型(LLMs)在推理方面取得了令人瞩目的进展,但仍面临高计算成本、逻辑不一致性以及在高度复杂问题上性能急剧下降等问题。神经符号方法通过将 LLMs 与符号推理器结合来缓解这些问题,但现有方法通常依赖于单调逻辑(如 SMT),无法表示可废止推理——人类认知的重要组成部分。我们提出了“LLM+ASP”框架,该框架将自然语言转化为回答集编程(ASP),一种基于稳定模型语义的非单调形式化方法。与先前需要手动编写知识模块、领域特定提示或仅限于单一问题类别评估的“LLM+ASP”方法不同,我们的框架无需任何每任务工程,并统一适用于多种推理任务。我们的系统利用自动化的自我纠正循环,其中来自 ASP 求解器的结构化反馈能够实现迭代优化。在六个不同基准上的评估表明:(1)稳定模型语义使 LLMs 能够自然地表达默认规则和例外,在非单调任务上显著优于基于 SMT 的替代方法;(2)迭代自我纠正是性能的主要驱动力,有效替代了手工领域知识的需求;(3)紧凑的上下文参考指南显著优于冗长的文档,揭示了“上下文腐烂”现象,即过多上下文会阻碍约束遵循。

英文摘要

Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning -- essential components of human cognition. We present "LLM+ASP," a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior "LLM+ASP" approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a "context rot" phenomenon where excessive context hinders constraint adherence.

2604.27277 2026-06-12 cs.LG cs.AI cs.CV 版本更新

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

BrainDINO:一种用于通用临床表征学习的脑MRI基础模型

Yizhou Wu, Shansong Wang, Yuheng Li, Mojtaba Safari, Mingzhe Hu, Chih-Wei Chang, Harini Veeraraghavan, Xiaofeng Yang

发表机构 * Department of Radiation Oncology and Winship Cancer Institute, Emory University(放射肿瘤科和Winship癌症研究所,埃默里大学) Department of Radiation and Cellular Oncology, The University of Chicago(放射肿瘤学与细胞肿瘤学部,芝加哥大学) Department of Electrical and Computer Engineering, Georgia Institute of Technology(电气与计算机工程系,佐治亚理工学院) Department of Biomedical Engineering, Georgia Institute of Technology(生物医学工程系,佐治亚理工学院) Department of Biomedical Informatics, Emory University(生物医学信息学系,埃默里大学) Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理系,纪念斯隆凯特琳癌症中心)

AI总结 提出BrainDINO,一种基于自蒸馏的基础模型,在约660万张未标记轴向切片上训练,通过冻结编码器加轻量任务头,在多种脑MRI任务上达到或超越基线,尤其在小样本场景下优势显著。

详情
Comments
25 pages, 5 figures
AI中文摘要

脑MRI支撑着广泛的神经科学和临床应用,然而大多数基于学习的方法仍针对特定任务且需要大量标注数据。本文表明,单一的自监督表征可以泛化到异质的脑MRI终点。我们训练了BrainDINO,一个自蒸馏的基础模型,使用了来自20个数据集的约660万张未标记轴向切片,这些数据集涵盖了人群、疾病和采集设置的广泛变异。通过使用冻结编码器加轻量任务头,BrainDINO支持肿瘤分割、神经退行性和神经发育性疾病分类、脑年龄估计、卒中后时间预测、分子状态预测、MRI序列分类和生存建模等任务的迁移。在各种任务和监督机制下,BrainDINO始终等于或超过自然图像和MRI特定自监督基线,在标签稀缺时尤其具有优势。表征分析进一步显示,在缺乏任务特定监督的情况下,特征结构具有解剖学组织和病理敏感性。我们的发现表明,大规模切片级自监督学习可以产生统一的脑MRI表征,支持多样化的神经影像任务,无需体积预训练或全网络微调,为稳健且数据高效的脑影像分析建立了可扩展的基础。代码可在 https://github.com/mclwu22/BrainDINO 获取。

英文摘要

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at https://github.com/mclwu22/BrainDINO

2604.26940 2026-06-12 cs.CL 版本更新

Select to Think: Unlocking SLM Potential with Local Sufficiency

Select to Think: 利用局部充分性解锁小语言模型潜力

Wenxuan Ye, Yangyang Zhang, Xueli An, Georg Carle, Yunpu Ma

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Select to Think (S2T)方法,通过将大语言模型角色从生成转为选择,并蒸馏选择逻辑到小语言模型,使其在推理时无需依赖大模型,显著提升性能。

详情
Comments
Accepted to ICML 2026. Code is available at https://github.com/YeRona/Select-to-Think
AI中文摘要

小语言模型(SLM)部署高效,但在推理能力上常落后于大语言模型(LLM)。现有解决方案要么在推理分歧点调用LLM,导致大量延迟和成本,要么依赖标准蒸馏,受限于SLM准确模仿LLM复杂生成分布的能力。我们通过识别局部充分性来解决这一困境:在分歧点,LLM偏好的token通常位于SLM的top-K预测中,即使未能成为SLM的top-1选择。因此,我们提出Select to Think(S2T),将LLM的角色从开放式生成重新定义为在SLM的候选提案中进行选择,将监督信号简化为离散的候选排名。利用这一点,我们引入S2T-Local,将选择逻辑蒸馏到SLM中,使其能够在推理时自主重新排序,无需依赖LLM。实验表明,1.5B SLM的top-8候选包含32B LLM选择的命中率达95%,S2T-Local使1.5B SLM的数学平均相对贪心解码提升24.1%,以单轨迹效率达到8路径自一致性的效果。

英文摘要

Small language models (SLMs) offer efficient deployment, yet they often lag behind their larger counterparts (LLMs) in reasoning. Existing remedies either invoke an LLM at points of reasoning divergence, incurring substantial latency and cost, or rely on standard distillation, which is limited by the SLM's capacity to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token often resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose Select to Think (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-Local, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, a 1.5B SLM's top-8 candidates contain the 32B LLM's choice with a 95% hit rate, and S2T-Local improves the 1.5B SLM's Math Avg. over greedy decoding by 24.1% relative gain, matching the efficacy of 8-path self-consistency with single-trajectory efficiency.

2604.24079 2026-06-12 cs.CL cs.AI 版本更新

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

实用人格:通过桥接推理发现LLM人格

Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea(Chung-Ang大学人工智能系) Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada(不列颠哥伦比亚大学计算机科学系) Van Lang University, Ho Chi Minh City, Vietnam(文-lang大学)

AI总结 提出基于桥接推理的框架,通过构建话语级知识图谱捕捉LLM对话中的隐含语义关联,实现从话语连贯性层面发现稳定人格特征,优于基于频率或风格的基线方法。

详情
Comments
15 pages, 4 figures, accepted to ICPR 2026
AI中文摘要

大型语言模型(LLM)通过对话展现出固有且独特的人格。然而,现有的大多数人格发现方法依赖于表面层面的词汇或风格线索,将对话视为平坦的token序列,未能捕捉维持人格一致性的更深层次话语结构。为解决这一局限,我们提出一种新颖的分析框架,通过桥接推理——即通过共享世界知识和话语连贯性连接话语的隐含概念关系——来解读LLM对话。通过将这些关系建模为结构化知识图谱,我们的方法捕捉了控制LLM在对话轮次间组织意义的潜在语义链接,从而在话语连贯性层面而非表面实现上实现人格发现。在多种推理骨干和从小型模型到80B参数系统的目标LLM上的实验结果表明,与基于频率或风格的基线相比,桥接推理图产生了显著更强的语义连贯性和更稳定的人格识别。这些结果表明,人格特质始终编码在话语的结构组织中,而非孤立的词汇模式中。本工作提出了一个系统框架,通过认知话语理论的视角来探测、提取和可视化潜在的LLM人格,桥接了计算语言学、认知语义学和大型语言模型中的人格推理。代码见:https://this URL

英文摘要

Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

2508.04427 2026-06-12 cs.LG cs.AI 版本更新

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

解码多模态迷宫:多模态注意力模型中可解释性采纳的系统综述

Md Raisul Kibria, Sébastien Lafond, Janan Arslan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文系统综述了2020年至2024年初多模态模型可解释性研究,发现多数工作集中于视觉-语言和纯语言模型,注意力机制是主要解释方法,但评估缺乏系统性和鲁棒性,并提出了改进建议。

详情
AI中文摘要

近年来,多模态学习取得了显著进展,特别是随着注意力模型的整合,在各种任务中带来了显著的性能提升。与此同时,对可解释人工智能(XAI)的需求推动了越来越多的研究,旨在解释这些模型的复杂决策过程。本系统文献综述分析了2020年1月至2024年初期间发表的、关注多模态模型可解释性的研究。在XAI更广泛目标的框架内,我们从多个维度审视文献,包括模型架构、涉及模态、解释算法和评估方法。我们的分析显示,大多数研究集中在视觉-语言和纯语言模型上,注意力机制是最常用的解释方法。然而,这些方法往往无法捕捉模态间交互的全谱系,这一问题因领域间的架构异质性而进一步加剧。重要的是,我们发现多模态环境中XAI的评估方法大多是非系统性的,缺乏一致性、鲁棒性,并且未考虑模态特定的认知和上下文因素。为解决这些不足,我们不仅综合了所调查研究的发现,还纳入了补充分析,整合了推动多模态可解释性的近期和新兴进展。基于这些见解,我们提出了一套全面的建议,旨在促进多模态XAI研究中严谨、透明和标准化的评估与报告实践。我们的目标是支持未来构建更可解释、可问责和负责任的多模态AI系统,并以可解释性为核心。

英文摘要

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

2604.23165 2026-06-12 cs.CV 版本更新

BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

BSViT:用于高效表达视觉表征学习的脉冲视觉Transformer

Hongxiang Peng, Dewei Bai, Hong Qu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出BSViT,通过双通道爆发脉冲自注意力机制和局部邻域掩码策略,解决脉冲视觉Transformer中二进制脉冲信息容量有限和全局自注意力密集交互的问题,在静态和事件视觉基准上取得更高精度和能效。

详情
Comments
Accepted by ECML PKDD 2026
AI中文摘要

脉冲视觉Transformer(S-ViT)为节能视觉学习提供了有前景的框架。然而,现有设计仍受限于两个基本问题:二进制脉冲编码的信息容量有限以及全局自注意力引入的密集令牌交互。为应对这些挑战,本文提出BSViT,一种爆发脉冲驱动的视觉Transformer,具有双通道爆发脉冲自注意力(DBSSA)机制。DBSSA用二进制脉冲编码查询,用爆发脉冲编码键以增强表示能力。值通路采用双兴奋性和抑制性二进制通道,实现有符号调制和更丰富的脉冲交互。重要的是,整个注意力操作保持仅加法计算,确保与节能神经形态硬件的兼容性。为进一步降低脉冲活动并融入空间先验,引入补丁邻域掩码策略将注意力限制在局部邻域,实现结构感知稀疏性并减少计算开销。此外,爆发脉冲编码被系统地集成到网络中,以提升脉冲级表示能力,超越传统二进制脉冲。在静态和事件视觉基准上的大量实验表明,BSViT在精度上持续优于现有脉冲Transformer,同时保持有竞争力的能效。

英文摘要

Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.