arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
2606.01813 2026-06-02 cs.CL

Cost-Aware Diffusion Draft Trees for Speculative Decoding

用于推测解码的成本感知扩散草稿树

Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学)

AI总结 提出CaDDTree方法,通过联合优化树结构和节点预算直接最大化令牌吞吐量,并证明在凸验证成本下吞吐量函数是单峰的,从而无需离线预算搜索。

详情
AI中文摘要

推测解码通过让轻量级草稿模型生成令牌,并由目标语言模型并行验证来加速推理。诸如DFlash之类的块扩散草稿模型一次性生成整个草稿块,产生每个位置边际分布;DDTree利用这些边际分布构建候选树,在固定节点预算下最大化期望接受长度。然而,我们观察到接受长度随预算非递减:它总是偏好更大的树而不考虑验证成本,没有为预算选择提供原则性基础。我们提出 extbf{CaDDTree}(成本感知扩散草稿树),一种通过联合选择树结构和节点预算直接优化令牌吞吐量(单位时间内期望生成的令牌数)的方法。我们显式建模草稿和验证延迟,表明吞吐量目标可分解为每轮对预算的一维搜索,并证明在凸验证成本下吞吐量函数是 extit{单峰的},从而实现了高效的贪心停止规则。CaDDTree无需离线预算搜索,每轮根据当前每个位置分布和验证成本自适应调整预算。在Qwen3-4B和Qwen3-8B上跨越推理、编码和指令遵循任务的八个基准测试上的实验表明,CaDDTree在几乎所有任务上匹配或超越了具有最优预算选择的DDTree。

英文摘要

Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.

2606.01811 2026-06-02 cs.CL cs.AI cs.LG

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

“我知道这会如何发展”:通过渐进条件惊奇度刻画多样性

Matthew Khoriaty, David Williams-King, Shi Feng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) Stanford University(斯坦福大学)

AI总结 提出一种基于上下文学习的多样性度量方法 Decan(D_{Ca_n}),通过单次前向传递计算每个字节的得分,无需嵌入模型、参考语料或人工标注,在多个基准上验证了其有效性。

Comments 28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: https://github.com/AMindToThink/icl-diversity

详情
AI中文摘要

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化AI和人类写作中的创造性行为至关重要。我们提出了一种使用上下文学习来度量多样性的新方法,其中“Decan”度量 $D_{Ca_n} = C \times a_n$ 是我们评估的工作实例:一个基于每个字节的得分,该得分从基础模型 $θ$ 的每个标记对数概率中读取,每次排列只需一次前向传递,无需嵌入模型、参考语料库和人工标签。该方法基于信息论,利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性,并避免了训练专用模型的需要。同一流程对AI样本和人类编写的回答集进行评分,将多样性视为(回答、提示、评分模型)的一个属性。在Tevet和Berant基于人类判断的McDiv基准上,$D_{Ca_n}$ 在McDiv prompt_gen 集上达到了0.846的OCA,这是其表现最好的情况,仅次于Tevet和Berant报告的最强神经基线(SentBERT,0.897)。在OLMo-2-7B训练后流程中,$D_{Ca_n}$ 在基础→SFT→DPO→RLVR阶段单调下降,检测到创意写作应用所关注的多样性损失类型。

英文摘要

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

2606.01810 2026-06-02 cs.AI

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Token 预测器不是规划器:构建物理基础的因果推理器

Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li

发表机构 * Tsinghua University(清华大学) Microsoft Research Asia(微软亚洲研究院) MBZUAI

AI总结 针对具身视觉-语言规划中模型依赖语言统计先验而非因果推理的问题,提出 Causal-Plan-Bench 基准和 Causal-Plan-1M 数据集,并训练 Causal Planner 模型,实现从 token 预测到物理因果推理的转变。

Comments 77 pages, appendices included. Code: https://github.com/THUSI-Lab/Causal-Reasoner

详情
AI中文摘要

当前的具身视觉-语言规划基准往往倾向于语言上的下一 token 预测,而非物理基础的下一状态推理。这奖励了模仿统计语言先验而非追踪因果依赖的模型,将物理规划简化为浅层序列建模。我们认为,可靠的物理自主性需要从语言基础的 token 预测转向物理基础的因果推理。为此,我们引入了 Causal-Plan-Bench,这是一个通过多阶段验证构建的高保真诊断套件,用于评估四个因果维度的具身规划。我们还构建了 Causal-Plan-1M,这是一个百万规模的显式推理轨迹语料库,通过四阶段标注流程从自我中心视频中生成。广泛评估表明,领先模型仍然难以展示真正的物理自主性,Gemini 3 Pro 在我们的基准上仅达到 38.18。相比之下,我们的训练方法使基于 Qwen3-VL-8B 构建的 Causal Planner 能够内化物理逻辑,从而实现更准确的下一状态估计。该模型在域内性能和跨基准泛化方面表现强劲,并揭示了一个因果缩放定律:将因果训练数据扩展到一百万实例可获得 36.3% 的相对提升,从 33.22 提高到 45.28。总体而言,我们的工作为将智能体从表面的 token 预测器转变为物理基础的因果推理器迈出了具体的一步。

英文摘要

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.

2606.01808 2026-06-02 cs.CV

Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI for Cardiac Digital Twins

基于电影MRI的个性化三维心肌梗死几何重建用于心脏数字孪生

Yilin Lyu, Mark YY Chan, Ching-Hui Sia, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore(新加坡国立大学生物医学工程系) Department of Medicine, National University of Singapore(新加坡国立大学医学系) Department of Cardiology, National University Heart Centre Singapore(新加坡国立心脏中心心内科部)

AI总结 提出一种显式几何-运动嵌入模型,从多视角电影MRI中全自动重建个性化、可仿真的三维心肌梗死几何结构,采用双分支自适应融合和AHA-17引导的多尺度监督,实现无对比剂梗死表征。

Comments 14 pages

详情
AI中文摘要

准确的三维心肌梗死(MI)几何表征对于构建心脏数字孪生(CDT)以精确模拟梗死相关电生理至关重要。晚期钆增强磁共振成像(LGE MRI)是定位MI的临床参考,但其对造影剂的依赖限制了在肾功能受损患者中的使用,并限制了纵向随访。作为替代,无对比剂电影MRI可可视化异常心室壁运动,这高度指示梗死区域。在本研究中,我们提出了一种新颖的显式几何-运动嵌入模型,直接从多视角电影MRI中全自动重建个性化、可仿真的三维MI几何结构。具体地,我们构建了一个4D(3D+t)双心室网格,以显式提取和解耦几何感知和运动感知特征。我们进一步设计了一个双分支模块用于自适应几何-运动融合,以捕获时空依赖性来映射梗死区域。此外,我们引入了一种利用AHA-17节段引导的交叉注意力机制的多尺度监督来指导预测,确保生物物理一致的重建。在225例电影MRI上的实验结果表明,所提出的三维MI重建实现了高性能,平均Dice得分为0.678±0.011。在下游的计算机电生理模拟评估中,结果与LGE衍生的真实情况高度一致,突显了所提出模型在无对比剂瘢痕表征和无缝集成到CDT建模中的巨大潜力。代码将在稿件被接受发表后公开。

英文摘要

Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 $\pm$ 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.

2606.01806 2026-06-02 cs.CL cs.AI cs.LG

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

ProbeScale: 通过探测分析优化神经缩放定律以实现高效小语言模型推理

Sourav Das

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institution of Information Technology Kalyani(印度信息技术学院Kalyani)

AI总结 提出ProbeScale框架,利用缩放定律和探测分析从预训练小语言模型中识别参数高效子网络,在参数预算下最大化任务加权探测性能,实现5-10倍参数压缩并保持95%-98%原始性能。

Comments 7 pages, 2 figures, ACL

详情
AI中文摘要

小语言模型在能力与计算可行性之间取得了平衡。神经缩放定律指导其最优训练,表明它们拥有随规模增长而丰富的内部表示。然而,在严格的资源约束下部署即使是这些小语言模型也可能具有挑战性。语言模型探测提供了分析模型内部编码的语言知识的方法。我们提出ProbeScale,一个统一缩放定律和探测洞察的框架,用于在预训练小语言模型中识别参数高效的子网络。ProbeScale利用良好缩放的小语言模型的高质量表示,并使用任务特定探测来数学量化每层对目标下游能力的相关性。这使得能够选择在性能与参数规模之间最优权衡的子网络。我们将子网络选择形式化为在参数预算下寻找最大化聚合任务加权探测性能的层子集。在代表性小语言模型如RoBERTa-Large和T5-Base上的实验表明,ProbeScale识别出的子网络实现了5到10倍的显著参数减少,同时在目标任务上保持了高性能(原始小语言模型的95%至98%),优于启发式基线。

英文摘要

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model's internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.

2606.01803 2026-06-02 cs.AI

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

OctoT2I:一种自我进化的智能文本到图像路由系统

Xu Jiang, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(电子与计算机工程学院,北京大学) Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University(广东省超高清沉浸媒体技术重点实验室,北京大学深圳研究生院)

AI总结 提出OctoT2I框架,通过自进化机制构建知识库并采用状态化多轮路由策略,联合优化生成质量与推理效率,在GenEval上达到0.96性能,同时实现90.3%推理加速和56.6%能效提升。

详情
AI中文摘要

文本到图像(T2I)模型的爆炸式增长——从大规模版本到轻量级、实时模型——如今面临单模型扩展的边际收益递减。智能T2I方法通过使用多个模型来缓解这一瓶颈。然而,现有的智能T2I方法面临三个关键挑战:依赖昂贵的手工先验或人工标注、僵化的单路径决策机制以及忽视推理效率。为解决这些挑战,我们引入OctoT2I,一种新颖的智能框架,将T2I任务重新表述为生成质量和推理效率的联合优化。OctoT2I实现了一种有状态的多轮路由策略,该策略基于其知识和记忆自适应地选择最合适的工具。这一策略由我们新颖的自进化机制从头构建的知识库支持。该机制无需人工监督,首先自主定义基础概念维度(例如风格、颜色、数量),然后通过迭代的“提出-求解-评估-学习”(PSEL)循环智能地探索它们的组合。PSEL循环高效地发现每个工具的能力边界,在无需外部指导的情况下推动持续改进。大量实验表明,OctoT2I在GenEval上实现了具有竞争力的性能(0.96),同时相比领先基线(Flow-GRPO)提供了90.3%的推理加速和56.6%的能效提升,在性能和效率之间取得了卓越的平衡。代码和模型将公开提供。

英文摘要

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

2606.01800 2026-06-02 cs.CL cs.AI cs.LG

Multilinguality of Large Language Models From a Structural Perspective

从结构视角看大语言模型的多语言性

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技術研究所)

AI总结 本研究通过表示结构分析探索大语言模型的多语言性,发现低资源语言与英语的结构差异大于高、中资源语言,且语言特定后训练改变结构但保留语言间关系。

详情
AI中文摘要

大型语言模型(LLMs)通过在多语言数据上进行预训练和后训练,在处理多种语言方面表现出色,尽管英语在训练数据中占主导地位。先前关注标记表示的研究揭示了这些LLMs如何处理非英语文本。尽管这些分析提供了有见地的发现,但它们未能捕捉到结构视角,而结构是语言的内在属性。在本研究中,我们通过表示结构分析探索LLMs的多语言性。我们的发现表明,低资源语言在结构上与英语的差异大于高资源和中资源语言,并且语言特定的后训练改变了它们的结构,同时保留了语言间的关系。

英文摘要

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

2606.01799 2026-06-02 cs.LG stat.ML

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

树引导的识别-然后-利用:决斗式赌博机中最佳臂识别与遗憾最小化的统一框架

Pu Wang, Yao-Xiang Ding

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室)

AI总结 针对Condorcet赢家假设下的N臂随机决斗式赌博机,提出树引导的识别-然后-利用(TG-ITE)统一框架,通过共享树引导识别方法在O(N)次比较内找到高置信度候选,并针对不同目标设计利用策略,首次同时实现最佳臂识别O(N)样本复杂度、弱遗憾O(N)和强遗憾O(N log T)保证,并消除现有方法中O(log N)的次优差距。

详情
AI中文摘要

我们研究在Condorcet赢家假设下的$N$臂随机决斗式赌博机,考虑三个广泛采用的目标:最佳臂识别(BAI)、弱遗憾和强遗憾。我们提出树引导的识别-然后-利用(TG-ITE),据我们所知,这是第一个统一处理所有这些目标的框架。无需更强的假设,我们提出一种共享的树引导识别方法,在$O(N)$次比较内找到高置信度的候选。我们进一步提出不同的利用策略,利用这个热启动阶段来优化具体目标。这种方法使得我们的方法能够:(1)在没有通常采用的更强假设的情况下,实现BAI的$O(N)$样本复杂度;(2)构建第一个赢家保持风格的算法,实现$O(N)$弱遗憾;(3)享有与专门强遗憾方法相同的$O(N \log T)$保证;(4)实现BAI和弱遗憾的联合优化,两者均具有$O(N)$保证,消除了现有方法中$O(\log N)$的次优差距。我们的结果提供了证据,表明在决斗式赌博机中,BAI和遗憾最小化之间的权衡相对温和。

英文摘要

We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.

2606.01790 2026-06-02 cs.CV cs.AI

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

STaR-KV: 面向GUI视觉语言模型的时空自适应KV缓存压缩重加权方法

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang

发表机构 * EPIC Lab, SJTU(上海交通大学EPIC实验室) HKUST (GZ)(香港科技大学(广州)) The University of Sydney(悉尼大学) UESTC(电子科技大学) ZJU(浙江大学)

AI总结 提出STaR-KV,一种无需训练的KV缓存压缩框架,通过子空间感知评分、时间稳定性折扣和熵驱动温度三个维度自适应校准令牌重要性,在GUI任务中实现高精度和近40%的峰值GPU内存节省。

详情
AI中文摘要

基于视觉语言模型的图形用户界面(GUI)代理展现出广泛的自动化能力,但其部署受限于随交互步骤线性增长的键值(KV)缓存。例如,UI-TARS-1.5-7B在仅五个屏幕截图上消耗76 GB的GPU内存,接近主流80 GB加速器的容量。现有的KV压缩方法共享两个结构假设:将视觉令牌重要性聚合为单个共享显著性图,并对融合的分数分布应用固定的top-B截断。初步测量反驳了这两点:空间专门化存在于注意力子空间层面并在层间迁移,而分数分布沿轨迹漂移。我们提出STaR-KV(时空自适应重加权),一种无需训练的KV缓存压缩框架,沿三个维度校准令牌重要性:(i)由在线空间互信息驱动的子空间感知评分;(ii)时间稳定性折扣,抑制来自持续关注子空间的冗余缓存条目;(iii)熵导出的温度,自适应重塑分数分布。在四个GUI基准测试中,STaR-KV在匹配预算下实现了最先进的KV压缩方法(如GUIKV、SnapKV)中最强的平均准确率,无压缩阶段FLOPs开销(-0.07%),并在20% KV缓存预算下削减近40%的峰值GPU内存。代码可在https://github.com/kawhiiiileo/STaR-KV获取。

英文摘要

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.

2606.01789 2026-06-02 cs.AI

Consistency evaluation of benchmarks used for causal discovery

用于因果发现的基准一致性评估

Yuzhe Zhang, Chihui Chen, Lina Yao, Chen Wang

发表机构 * Independent researcher(独立研究者) UNSW Australia(新南威尔士大学澳大利亚分校) CSIRO Australia(澳大利亚联邦科学与工业研究组织)

AI总结 提出自动检索论文并利用大语言模型检查基准因果图与领域研究一致性的流程,评估11个流行基准,发现其一致性差异显著。

详情
AI中文摘要

在图形因果模型中,因果发现旨在基于数值数据和领域知识(以纯文本形式)构建因果图。然而,因果发现方法的评估在该领域仍然是一个挑战,因为领域研究的进展常常使得基准因果图包含不一致的知识。这个问题尤其影响基于大语言模型(LLM)的因果发现方法,因为它们对文献中的新发现敏感。本文首次系统研究基准因果图的质量。具体来说,我们设计了一个流程,自动从科学数据库中检索相关研究论文,并提示LLM检查基准因果图与领域研究论文之间的一致性。我们评估了11个流行的真实世界基准,我们的流程总共处理了38,081篇领域论文。结果表明,流行基准与领域研究的一致性差异显著,这对因果发现研究具有明确的意义。

英文摘要

In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.

2606.01788 2026-06-02 cs.CV

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

PlatonicNav: 在导航中揭示柏拉图式拓扑地图的语义对应

Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang, Luke Borgnolo, Maxwell Twelftree, Yang Zhao

发表机构 * USYD(新南威尔士大学) Maincode UNSW(新南威尔士大学) La Trobe(拉特罗布大学)

AI总结 提出PlatonicNav框架,通过自监督视觉编码器构建柏拉图式拓扑地图,无需跨模态训练即可统一视觉目标导航、跨模态目标导航和视觉语言导航任务。

详情
AI中文摘要

具身视觉导航中,智能体感知复杂环境并从原始感官输入出发行动以到达目标,支撑了家庭服务机器人、辅助机器人和大规模自主探索等广泛应用。然而,最近统一视觉语言导航(VLN)和目标目标导航(ObjNav)的尝试仍停留在架构融合、混合任务训练和大规模视觉语言预训练层面,未检验独立训练的视觉和语言编码器是否已共享共同的语义结构。此外,即使是面向目标的拓扑地图,仍通过显式跨模态监督(如CLIP或大型视觉语言模型)来锚定语言目标,尚不清楚这种锚定是否可能仅从纯视觉构建的地图实现。为解决这些挑战,我们将柏拉图式表示假说扩展到具身导航,并将纯视觉ObjNav、跨模态ObjNav和VLN重新解释为同一面向目标的语义流形的三种不同接口。我们进一步引入PlatonicNav,一个无需训练的框架,其柏拉图式拓扑地图融合来自自监督视觉编码器的几何和语义节点距离,并通过盲匹配(无需任何配对视觉语言数据)锚定语言目标。在HM3D-IIN、OVON和R2R-CE(基于MP3D)等仿真基准以及宇树Go2机器人上的实验表明,PlatonicNav无需显式跨模态训练即可跨任务、模态和具身形式泛化。代码:https://github.com/AIGeeksGroup/PlatonicNav。网站:https://aigeeksgroup.github.io/PlatonicNav。

英文摘要

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.

2606.01787 2026-06-02 cs.AI math.OC

Stochastic convergence of parallel asynchronous adaptive first-order methods

并行异步自适应一阶方法的随机收敛性

Serge Gratton, Philippe L. Toint

发表机构 * Université de Toulouse, INP, IRIT, Toulouse, France(图卢兹大学,INP,IRIT,法国图卢兹) IA Artificial and Natural Intelligence Toulouse Institute (ANITI)(图卢兹3IA人工智能与自然智能研究所(ANITI)) NAXYS, University of Namur, Namur, Belgium(NAXYS,纳慕尔大学,比利时纳慕尔)

AI总结 本文提出一类新的异步自适应一阶优化方法,包括多种流行算法的异步变体,并分析其在非凸函数上的随机收敛性,达到O(1/√t)的收敛速率。

详情
AI中文摘要

本文介绍了一类新的异步自适应一阶优化方法,包括几种流行算法的异步变体。还考虑了使用动量和/或非精确归一化的这些方法的版本。在完全随机环境下分析了该类方法在非凸函数上的收敛性,并证明在合理假设下,收敛阶为O(1/√t)(忽略对数因子)。数值实验表明,这种异步自适应算法在异构大规模机器学习系统中非常有用。

英文摘要

A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular algorithms. Versions of these methods using momentum and/or inexact normalization are also considered. The convergence of methods in the class on non-convex functions is analyzed in a fully stochastic setting, and is shown to be (up to logarithmic factors) of order O(1/sqrt{t}) under reasonable assumptions. Numerical experiments suggest that such asynchronous adaptive algorithms are very relevant in heterogeneous large-scale machine learning systems.

2606.01781 2026-06-02 cs.AI

Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction

结构引导的自适应传播用于蛋白质-蛋白质相互作用位点预测

Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen, Yu Zhang, Baoshan Ma

发表机构 * Institute of Computing Science and Technology, Guangzhou University(广州大学计算机科学与技术学院) School of Computer Science, Peking University(北京大学计算机科学学院) Information Science & Technology Department, Beijing Capital International Airport Co., Ltd.(北京首都国际机场有限公司信息科学与技术部) School of Information Science and Technology, Dalian Maritime University(大连海事大学信息科学与技术学院)

AI总结 提出SGAP-PPIS模型,利用等变图神经网络的多尺度几何状态生成残基级传播系数,实现自适应信息扩散,在Test_60上取得竞争性能。

Comments 9 pages, 3 figures

详情
AI中文摘要

准确预测蛋白质-蛋白质相互作用位点(PPIS)对于理解细胞过程、疾病机制和治疗靶点发现至关重要。基于图的深度学习通过整合残基级结构上下文推进了PPIS预测。然而,尽管蛋白质界面存在结构和功能异质性,大多数基于图的模型仍依赖固定传播方案,对所有残基一视同仁。这种传播可能限制信息扩散适应局部几何环境的能力,使得难以区分真正的相互作用位点和结构相似的非相互作用邻居。我们提出SGAP-PPIS,一种用于PPIS预测的结构引导自适应传播模型。SGAP-PPIS不使用固定传播机制,而是利用等变图神经网络的多尺度几何状态生成残基级传播系数。这种设计允许每个残基根据其几何微环境自适应地平衡局部特征保留和邻域扩散。实验结果表明,SGAP-PPIS在Test_60上达到了与最先进方法竞争的性能。消融研究表明,几何条件自适应传播、尺度对齐几何引导和多步传播状态表示共同推动了这些改进。

英文摘要

Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and therapeutic target discovery. Graph-based deep learning has advanced PPIS prediction by incorporating residue-level structural context. However, most graph-based models still rely on fixed propagation schemes that treat all residues similarly, despite the structural and functional heterogeneity of protein interfaces. Such propagation may limit the ability to adapt information diffusion to local geometric environments, making it difficult to distinguish true interaction sites from structurally similar non-interacting neighbors. We present SGAP-PPIS, a structure-guided adaptive propagation model for PPIS prediction. Rather than using a fixed propagation mechanism, SGAP-PPIS leverages multi-scale geometric states from an equivariant graph neural network to generate residue-wise propagation coefficients. This design allows each residue to adaptively balance local feature preservation and neighborhood diffusion according to its geometric microenvironment. Experimental results show that SGAP-PPIS achieves competitive performance among the state-of-the-art methods on Test\_60. Ablation studies show that geometry-conditioned adaptive propagation, scale-aligned geometric guidance, and multi-step propagation-state representation jointly drive these improvements.

2606.01779 2026-06-02 cs.CL

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

HarnessForge:面向自适应智能体系统的协同框架与策略进化

Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University(北京未来区块链与隐私计算先进创新中心,人工智能学院,北京航空航天大学) Tsinghua University(清华大学)

AI总结 提出HarnessForge元自适应框架,通过框架-策略协同进化实现LLM智能体系统的全系统自适应,在多个基准上显著提升性能。

Comments 25 pages, 13 figures

详情
AI中文摘要

LLM智能体越来越需要在需要不同执行范式的异构任务环境中运行。这对固定智能体系统提出了挑战,并推动了超越孤立组件更新的系统级元自适应。虽然现有工作已自适应外部框架或训练底层推理策略,但全系统自适应仍未被充分表征。结构与执行之间的自适应空间很少被明确化,外部框架与内部推理器之间的兼容性也未得到联合优化。我们提出HarnessForge,一个用于进化LLM智能体系统的元自适应框架。HarnessForge将智能体系统形式化为一个框架-策略对,定义了一个稳定的自适应空间,将框架级执行结构与策略级推理行为分离。然后,它通过故障引导的框架裁剪和框架条件化的策略对齐执行框架-策略协同进化。在来自不同领域的五个基准上的实验表明,HarnessForge一致地改进了Qwen3-4B和Qwen3-8B骨干网络,优于仅框架和仅策略的基线,比最强基线提升高达12.0%,并实现了有利的展开效率权衡,证明了框架-策略协同进化是有效的,并且框架与推理策略之间的可执行兼容性对于智能体系统自适应至关重要。代码可在https://github.com/mingju-c/HarnessForge获取。

英文摘要

LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.

2606.01777 2026-06-02 cs.RO

Trans2Occ: Voxel Occupancy Estimation and Grasp for Transparent Objects from Simulation to Reality

Trans2Occ: 从仿真到现实的透明物体体素占用估计与抓取

Yixuan Yang, Sha Zhang, Rui Li, Zhenfei Yin, Xinzhu Ma, Yiran Qin, Lei Bai, Xudong Xu, Shilin Shan, Wangmeng Zuo, Yanyong Zhang, Wanli Ouyang, Feng Zheng, Shixiang Tang, Dongzhan Zhou

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) SUSTech(南方科技大学) CUHK(香港中文大学) Harbin Institute of Technology(哈尔滨工业大学) University of Oxford(牛津大学) Beihang University(北京航空航天大学) Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于单视图RGB输入的体素占用预测框架,结合仿真数据生成与规则抓取策略,实现透明物体的鲁棒3D感知与操作。

详情
AI中文摘要

透明物体由于折射和反射导致的深度感知不可靠,对机器人感知构成挑战。先前的方法依赖多视图重建或深度补全,但往往难以在真实机器人系统中扩展或部署。本文提出一个基于单视图RGB输入的透明物体感知与操作实用框架。我们的方法直接从单张图像预测体素空间占用,提供支持下游机器人抓取的几何感知表示。为实现大规模训练,我们构建了一个仿真流水线,在不同材质和光照条件下生成配对的RGB图像和体素占用标注。我们证明预测的占用表示对领域偏移具有鲁棒性,并能从仿真有效迁移到真实机器人设置,无需微调。基于占用构建的简单规则抓取策略进一步实现了透明物体的可靠抓取性能。在仿真和真实环境中的大量实验表明,我们的框架提供了准确的3D理解,并实现了透明物体的实用操作。这些结果表明,单视图占用预测为机器人中的透明物体感知提供了一种可扩展且有效的解决方案。

英文摘要

Transparent objects remain challenging for robotic perception due to unreliable depth sensing caused by refraction and reflection. While prior approaches rely on multi-view reconstruction or depth completion, they are often difficult to scale or deploy in real-world robotic systems. In this paper, we present a practical framework for transparent object perception and manipulation based on single-view RGB input. Our approach predicts voxel-space occupancy directly from a single image, providing a geometry-aware representation that supports downstream robotic grasping. To enable large-scale training, we construct a simulation pipeline that generates paired RGB images and voxel occupancy annotations under diverse materials and lighting conditions. We demonstrate that the predicted occupancy representation is robust to domain shifts and transfers effectively from simulation to real-world robotic setups without fine-tuning. A simple rule-based grasping strategy built on top of the occupancy further achieves reliable grasp performance on transparent objects. Extensive experiments in both simulation and real-world environments show that our framework provides accurate 3D understanding and enables practical manipulation of transparent objects. These results suggest that single-view occupancy prediction offers a scalable and effective solution for transparent object perception in robotics.

2606.01774 2026-06-02 cs.LG cs.AI

FLARE: Diffusion for Hybrid Language Model

FLARE: 混合语言模型的扩散方法

Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan, Yiran Xu, Wanrong Zhu, Jason Kuen, Koustava Goswami, Rajiv Jain, Yongxin Chen, Molei Tao, Jiuxiang Gu

发表机构 * Adobe Research(Adobe研究院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FLARE框架,通过结合自回归和扩散目标、硬件感知内核和统一推理,将混合注意力LLM转换为支持并行解码的扩散模型,在保持能力的同时提升吞吐量。

详情
AI中文摘要

自回归(AR)大型语言模型(LLM)已取得广泛的实际成功,但顺序解码仍然是低延迟部署的关键瓶颈。近期的高效推理工作沿着两个方向推进:通过高效架构降低每次模型调用的成本,以及通过并行生成减少串行解码步骤。混合注意力骨干解决了前者,而扩散语言模型(dLLM)通过迭代并行去噪追求后者。结合这些优势仍然具有挑战性:AR到dLLM的转换通常无法保留种子检查点的能力,并且混合注意力循环状态和掩码约束使得扩散训练和服务变得复杂。我们提出了FLARE,一个针对混合注意力LLM的系统转换框架。我们的分析确定迁移数据质量是能力保留的主要决定因素,其重要性超过损失公式和注意力掩码设计。最终框架结合了token等价的AR和扩散目标、硬件感知内核以及统一推理,使得一个检查点能够同时支持AR风格的验证解码和扩散风格的并行去噪。从强大的AR检查点出发,使用有限的训练后数据,FLARE在模型规模上与领先的开源dLLM竞争,并在单GPU并发服务中相比开源dLLM基线实现了持续的吞吐量提升。我们的结果进一步表明,实际dLLM不仅受限于解码算法,还受限于迁移数据质量和当前块扩散目标的训练低效性,这促使我们联合设计数据、目标、架构和推理系统。

英文摘要

Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models (dLLMs) pursue the latter via iterative parallel denoising. Combining these advantages remains challenging: AR-to-dLLM conversion often fails to preserve seed-checkpoint capability, and hybrid-attention recurrent states and masking constraints make diffusion training and serving nontrivial. We present FLARE, a systematic conversion framework for hybrid-attention LLMs. Our analysis identifies transfer data quality as the primary determinant of capability preservation, outweighing loss formulation and attention-mask design. The resulting framework combines a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified inference, enabling one checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Starting from strong AR checkpoints with limited post-training data, FLARE is competitive with leading open-source dLLMs across model scales and delivers consistent throughput gains over open-source dLLM baselines in single-GPU concurrent serving. Our results further suggest that practical dLLMs are limited not only by decoding algorithms, but also by transfer data quality and the training inefficiency of current block-diffusion objectives, motivating joint design of data, objectives, architectures, and inference systems.

2606.01757 2026-06-02 cs.CV

PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

PillarDETR:基于YOLO骨干和RT-DETR头的实时3D目标检测

Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出PillarDETR架构,结合YOLOv8的CSP骨干和RT-DETR解码器,实现无需NMS的端到端实时3D目标检测,在KITTI和nuScenes上取得精度与速度的良好平衡。

Comments 6 pages, 1 figures, 8 tables

详情
AI中文摘要

实时3D目标检测是自动驾驶系统和机器人安全运行的关键组成部分。虽然LiDAR点云提供准确的空间信息,但高效处理它们仍然是一个重大挑战。传统方法依赖于复杂的3D卷积或基于锚点的范式,难以平衡检测精度与推理速度。在本文中,我们提出PillarDETR,一种新颖的端到端3D目标检测架构,它将基于柱体的LiDAR编码的效率与现代2D视觉模型的表示能力相结合。具体来说,PillarDETR用源自YOLOv8的跨阶段局部(CSP)网络替代标准卷积骨干,从而能够从伪图像中提取更丰富的特征。此外,我们摒弃了传统的基于锚点或基于中心的检测头,转而采用实时检测Transformer(RT-DETR)解码器。这种混合设计使网络能够捕获全局上下文并直接预测3D边界框,而无需依赖非极大值抑制(NMS)。在KITTI和nuScenes基准上的大量实验表明,PillarDETR在平均精度(mAP)和推理延迟之间实现了令人信服的权衡。我们的消融研究证实,集成YOLOv8骨干和RT-DETR头相比PointPillars基线带来了显著改进,使PillarDETR成为实时3D感知的高效解决方案。

英文摘要

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

2606.01756 2026-06-02 cs.CV

EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

EvoCut:面向高效大型视觉语言模型的多层演化感知视觉标记压缩

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Pengfei Zhang, Yao Hu, Jiawei Li, Shikai Jiang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Xiaohongshu(小红书) Fudan University(复旦大学)

AI总结 提出一种无需训练和注意力的视觉标记压缩方法EvoCut,通过分析多层演化偏差估计标记重要性,在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能。

Comments Preprint. 12 pages, 6 figures, 7 tables

详情
AI中文摘要

大型视觉语言模型(LVLMs)在图像和视频理解任务上取得了强大性能,但其推理效率受到视觉编码器产生的大量视觉标记的限制。现有大多数视觉标记压缩方法从特定层的注意力分数或表示属性估计标记重要性,忽略了视觉标记在视觉编码器中的演化过程。这种逐层标准可能提供不完整的重要性估计,并限制压缩后的性能保持。为解决此问题,我们分析了逐层视觉标记演化方向,并观察到标记在视觉编码器各层形成多个组演化方向。进一步分析表明,信息性标记往往表现出与共同组演化方向的持续偏离。基于这一观察,我们提出了EvoCut,一种无需训练和注意力的视觉标记压缩方法,通过多层演化偏差估计标记重要性。实验结果表明,EvoCut在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能,展示了其在平衡效率和准确性方面的有效性。

英文摘要

Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing visual token compression methods estimate token importance from attention scores or representation properties at specific layers, overlooking how visual tokens evolve across the vision encoder. Such layer-specific criteria may provide incomplete importance estimates and limit performance preservation after compression. To address this issue, we analyze layer-wise visual token evolution directions and observe that tokens form multiple group evolution directions across vision-encoder layers. Our analysis further shows that informative tokens tend to exhibit persistent deviations from common group evolution directions. Based on this observation, we propose EvoCut, a training-free and attention-free visual token compression method that estimates token importance from multi-layer evolution deviation. Experimental results show that EvoCut can retain only 11.1\% of the visual tokens on LLaVA-1.5-7B while preserving 94.4\% of the average performance, demonstrating its effectiveness in balancing efficiency and accuracy.

2606.01755 2026-06-02 cs.AI cs.CL

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign: 迈向个性化大语言模型对齐中的通用真值一致性

Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University(数据科学与人工智能系,墨尔本大学) Defence Science and Technology Group, Australia(澳大利亚国防科学与技术集团)

AI总结 针对个性化大语言模型在不同社会群体间存在的通用真值不一致问题,提出TriAlign框架,通过离线多智能体强化学习联合优化真值准确性、跨群体一致性和个性化,实现公平对齐。

详情
AI中文摘要

个性化大语言模型根据用户的偏好和社会属性调整响应,但可能在不同社会群体间引入显著的通用真值不一致性,即某些群体在客观任务上系统性地获得较不准确的响应。现有的对齐方法要么忽略个性化,要么主要关注主观偏好对齐,很大程度上忽视了通用真值的公平性和一致性。为填补这一空白,我们研究了真值不变对齐(TIA),这是一个针对个性化LLM的对齐问题,旨在确保通用真值在不同社会群体间保持一致,同时保留个性化。我们提出TriAlign,这是首个用于TIA的离线多智能体强化学习(MARL)框架,其中每个社会群体被建模为一个交互的智能体。TriAlign通过一个公平感知目标和一个显式的不一致性惩罚,联合优化通用真值准确性、跨群体真值一致性和个性化。跨多个基准的实验表明,TriAlign在这三个目标之间实现了比强基线更强的平衡,减少了跨社会群体的通用真值差异,同时提高了客观任务性能和个性化质量。

英文摘要

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.

2606.01753 2026-06-02 cs.CV

Quality-Guided Semi-Supervised Learning for Medical Image Segmentation

质量引导的半监督学习用于医学图像分割

Kumar Abhishek, Ghassan Hamarneh

发表机构 * School of Computing Science, Simon Fraser University, Canada(Simon Fraser大学计算机科学学院)

AI总结 提出一种质量引导的半监督学习框架,通过专用网络估计分割质量,并利用质量感知正则化和伪标签重加权提升医学图像分割性能。

Comments Early Accept at MICCAI 2026, 13 pages, 2 figures

详情
AI中文摘要

训练准确的医学图像分割模型需要大量密集标注的数据,这既昂贵又耗时。半监督学习通过从大量未标注数据和少量标注数据中学习来缓解这一问题。然而,大多数现代半监督学习方法依赖未标注数据的伪标签,并通常通过模型置信度或不确定性来评估其可靠性,这些度量是自我指涉的,缺乏对分割质量的明确基础。相反,我们提出了一种质量引导的半监督学习框架,训练一个专用网络从图像-掩膜对中估计分割质量。该预测器在通过合成损坏生成的变质量掩膜上进行训练,这些损坏结合了部分训练分割模型产生的不完美输出,捕捉训练中遇到的真实错误模式。我们通过两种互补机制将质量预测器集成到半监督学习中:质量感知正则化损失和基于质量的伪标签样本重新加权方案。我们表明,我们的方法可以作为现有半监督学习框架的即插即用增强。在五个数据集和多种架构上的大量实验表明,与竞争性的半监督学习方法相比,我们的方法取得了一致的改进,推进了半监督医学图像分割的最新水平。

英文摘要

Training accurate medical image segmentation models requires large amounts of densely annotated data, which is costly and time-consuming to obtain. Semi-supervised learning (SSL) alleviates this by learning from both abundant unlabeled data and limited labeled data. However, most modern SSL methods rely on pseudolabels for unlabeled data, and typically assess their reliability through model confidence or uncertainty, measures that are self-referential and lack explicit grounding in segmentation quality. Instead, we propose a quality-guided SSL framework that trains a dedicated network to estimate segmentation quality from image-mask pairs. The predictor is trained on variable-quality masks generated through synthetic corruptions augmented with imperfect outputs from partially trained segmentation models, capturing realistic error patterns encountered during training. We integrate the quality predictor into SSL through two complementary mechanisms: a quality-aware regularization loss and a quality-based pseudolabel sample reweighting scheme. We show that our method serves as a drop-in enhancement to existing SSL frameworks. Extensive experiments across five datasets and multiple architectures demonstrate consistent improvements over competing SSL methods, advancing the state-of-the-art in semi-supervised medical image segmentation.

2606.01747 2026-06-02 cs.CL cs.AI

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

基于BERT和图神经网络的历史知识图谱构建

Ping Li, Bartlomiej Brzozka

发表机构 * Shandong Management University(山东管理大学) Maria Curie-Sklodowska University(玛丽·居里-斯洛多夫斯卡大学)

AI总结 本文提出结合BERT和图神经网络的高层架构,从历史文本中提取实体和关系,构建知识图谱,在精度、召回率和F1分数上优于传统方法和深度学习基线。

Comments 9 pages, 4 figures

详情
AI中文摘要

通过数字人文研究和规模化历史数据分析,大量传统历史文本被转换为结构化知识图谱。本文提出一种结合双向编码器表示(BERT)和图神经网络(GNN)的高层架构,用于从各类历史文本中提取实体和关系。传统历史文本系统地解决了语言歧义、上下文限制的引用以及缺乏既定语法规范的问题。本研究根据上述建议,开发了一种基于FastRQNet和预训练视觉-语言模型Vilt-qaformer+RoBInet的新型图像检索系统。实验充分利用了市政记录、议会文件和历史信函的全面数据集。与传统基于规则的技术和其他流行的深度学习基线相比,联合BERT-GNN系统获得了更高的精度、召回率和F1分数(表2)。该结构在创建知识图谱时能够以足够的准确性和全面性处理复杂的嵌套结构和隐式引用问题。上述实验表明,将关系图学习算法与上下文敏感的语义表示技术相结合,可以自动提取历史数据,为知识库积累累积的智慧。

英文摘要

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.

2606.01746 2026-06-02 cs.CV cs.LG

Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness

敏感性是一把双刃剑:判别性与对抗鲁棒性之间的权衡

Kai Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文发现全连接分类器的高敏感性带来判别性但也导致脆弱性,而ℓ2距离分类器的不敏感性带来鲁棒性但限制性能,为此提出基于混合原型混合框架的ℓ2重分类器,通过融合稳定原型和动态原型实现判别性与鲁棒性的平衡,并设计混合替代攻击评估协议。

Comments 13 pages including reference, 4 figures

详情
AI中文摘要

现代神经网络极易受到对抗性扰动的影响。在这项工作中,我们指出这种脆弱性部分源于广泛使用的全连接分类器对此类扰动的敏感性。相比之下,简单的基于ℓ2距离的分类器表现出显著更强的鲁棒性。我们提供了充分的理论和实证分析,表明全连接分类器的高敏感性使其具有判别性,但也使其脆弱;相反,ℓ2分类器的不敏感性赋予了鲁棒性但限制了性能。受这种权衡的启发,我们提出了一种基于混合原型混合框架的新型ℓ2重分类器。该方法保留了全连接分类器的判别能力,同时利用了ℓ2距离的鲁棒性。它通过融合两种原型类型来产生基于ℓ2距离的预测:(1)通过指数移动平均更新的稳定数据集级原型,以及(2)使用直通估计器从全连接分类器预测生成的动态批量级原型。然而,这种基于直通估计器的动态架构给评估带来了重大挑战,例如梯度混淆和前向不连续性。为了解决这个问题,我们提出了一种新的严格评估协议——混合替代攻击,该协议使用多个替代模型以及强大的AutoAttack,以确保公平和稳健的评估。大量实验表明,我们的轻量级即插即用模块只需极少的微调,就能有效增强各种现有最先进对抗训练模型的对抗鲁棒性。

英文摘要

Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple $\ell_2$ distance-based classifiers exhibit significantly greater robustness. We provide thorough theoretical and empirical analysis showing that while FC classifiers' high sensitivity makes them discriminative, it also makes them vulnerable. Conversely, $\ell_2$-classifiers' insensitivity grants robustness but limits performance. Motivated by this trade-off, we propose a novel $\ell_2$-reclassifier based on a Hybrid Prototype Mixing (HPM) framework. This method retains the discriminative power of FC classifiers while leveraging the robustness of $\ell_2$ distance. It yields $\ell_2$-distance-based predictions by fusing two prototype types: (1) stable, dataset-level prototypes updated via EMA, and (2) dynamic, batch-level prototypes generated from the FC classifier's predictions using a Straight-Through Estimator (STE). However, this dynamic, STE-based architecture introduces significant challenges for evaluation, such as gradient obfuscation and forward discontinuity. To address this, we propose a new, rigorous evaluation protocol, the Mixed Surrogate Attack (MSA), which uses multiple surrogates along with powerful AutoAttack to ensure a fair and robust assessment. Extensive experiments demonstrate that our lightweight, plug-and-play module, with minimal fine-tuning, effectively enhances the adversarial robustness of various existing SOTA adversarially trained models.

2606.01738 2026-06-02 cs.CL cs.AI

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

THRD:一种针对大语言模型越狱攻击的无训练多轮防御框架

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu

发表机构 * Beijing Language and Culture University(北京语言大学)

AI总结 提出无训练框架THRD,通过显式建模时间风险累积(包括逐轮风险评估、跨轮意图检测、响应评估和决策模块)防御多轮越狱攻击,将攻击成功率降至0.2-4.0%且模型效用损失小于1.5%。

详情
AI中文摘要

多轮越狱攻击通过利用对话动态(如逐步升级和跨轮协调)对LLM构成日益严重的威胁。现有防御要么依赖昂贵的重新训练(通常会降低模型效用),要么在每一轮独立应用单轮分析,无法捕捉风险沿交互轨迹的累积。我们观察到多轮交互中的安全行为是轨迹依赖的:对话历史不断重塑模型的调节上下文,使得孤立评估每一轮变得不足。基于这一洞察,我们提出THRD,这是第一个显式建模多轮越狱防御中时间风险累积的无训练框架。THRD集成了四个模块:用于即时风险评估的逐轮风险评估器(TRA)、用于跨轮意图升级检测的历史上下文分析器(HCA)、用于识别促进性输出的响应评估器(RE),以及通过带衰减调制和趋势感知调整的时间演化评分机制组合这些信号的决策模块。在两个目标模型上针对最先进的多轮攻击(包括基于树搜索和多智能体协作方法)的实验表明,THRD将攻击成功率降至0.2-4.0%,同时在MMLU和GSM8K上将模型效用退化控制在1.5%以内。消融研究证实了模块的非冗余贡献和稳定的跨架构泛化。对首次拒绝触发器的分析显示,超过70%的多轮攻击需要在第2轮或之后才能检测到,验证了显式时间聚合的必要性。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

2606.01737 2026-06-02 cs.AI

TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination

TrafficRAG:用于交通事故责任认定的多模态RAG框架

Xu Li, Zedong Fu, Xinyi Li, Xun Han

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出TrafficRAG框架,通过视觉语言模型生成结构化描述、混合检索获取法规和案例、大语言模型融合多模态证据进行推理,实现自动化交通事故责任分析报告生成。

Comments 12 pages, 3 figures, accepted at ICANN 2026

详情
AI中文摘要

交通事故责任分析是智能交通和法律辅助中一项关键但具有挑战性的任务。现有方法通常存在效率低、主观判断和不一致的分析结果等问题。同时,大语言模型受到噪声视频输入和法律领域知识不足的限制。为了解决这些问题,本文提出了TrafficRAG,一个用于自动化交通事故分析和报告生成的多模态检索增强框架。具体来说,该框架首先采用视觉语言模型生成事故场景的结构化文本描述,作为准确的检索查询。基于这些文本查询,采用结合BM25稀疏检索和稠密嵌入检索的混合检索策略来获取相关交通法规和类似历史案例。最后,大语言模型整合检索到的法律知识和多模态事故证据进行综合推理,生成标准化、有法律依据的责任分析报告。大量实验表明,TrafficRAG始终优于基线方法,实现了77.32%的法律规范适配准确率、81.71%的事实忠实度以及5.48%的责任比例平均绝对误差。结果验证了通过检索增强将多模态事实证据与法律条款相结合,可以有效提高交通事故责任认定的可靠性和准确性。

英文摘要

Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods often suffer from low efficiency, subjective judgment, and inconsistent analysis results. Meanwhile, large language models are constrained by noisy video inputs and insufficient legal domain knowledge. To address these issues, this work presents TrafficRAG, a multimodal retrieval-augmented framework for automated traffic accident analysis and report generation. Specifically, the proposed framework first adopts a vision-language model to produce structured textual descriptions of accident scenarios, which serve as accurate retrieval queries. Based on these textual queries, a hybrid retrieval strategy integrating BM25 sparse retrieval and dense embedding retrieval is employed to fetch relevant traffic regulations and similar historical cases. Finally, the large language model incorporates retrieved legal knowledge and multimodal accident evidence for comprehensive reasoning, and generates standardized, legally grounded liability analysis reports. Extensive experiments show that TrafficRAG consistently outperforms baseline methods, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and a Liability Ratio MAE of 5.48%. The results validate that integrating multimodal factual evidence with legal clauses via retrieval augmentation can effectively improve the reliability and accuracy of traffic accident liability determination.

2606.01734 2026-06-02 cs.CV cs.LG cs.RO

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

FlatVPR: 用于基础模型特征流形几何校正的即插即用地线性残差适配器

Rai Hisada, Kanji Tanaka

发表机构 * Fundamental Engineering for Knowledge-Based Society, Graduate School of Engineering, University of Fukui(知识社会基础工程,工程研究生院,福井大学)

AI总结 提出FlatVPR范式,通过可学习残差适配器和Pullback Flatness Loss抑制特征流形曲率,实现稀疏锚点下的线性插值重建,在NCLT数据集上显著提升视觉位置识别精度。

Comments 5 pages, 1 figure, technical report

详情
AI中文摘要

本文提出“FlatVPR”,一种新颖的几何校正范式,通过强制特征流形结构,使得两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的任何描述符都可以通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$(其中 $t \in [0,1]$ 表示相对位置)精确重建,从而有效平衡视觉位置识别(VPR)中地图轻量化和定位精度之间的权衡。尽管最先进的基础模型(如DINOv2-ViT-S/14)提供了鲁棒的语义特征,但其潜在流形表现出显著的曲率,将物理空间中的均匀线性运动投影到特征空间中高度非线性的轨迹上,这阻碍了稀疏锚点条件下的可靠重建。为了实现上述基于插值的重建,我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$,其中 $\text{Res}(\cdot)$ 表示可学习的适配器。我们的方法通过数学上严谨的Pullback Flatness Loss显式抑制流形曲率,该损失最小化中间特征与连接相邻锚点的线性段之间的偏差,从而最小化流形的内在曲率。通过这种空间展平,地图构建被公式化为期望最大化(EM)框架,解耦为用于流形适应的连续M步和用于最优锚点选择准则的概念性E步。在NCLT数据集上的实验表明,即使在100米间隔的极端稀疏锚点和极端季节变化条件下,应用我们的适配器也能带来显著的性能提升。

英文摘要

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

2606.01725 2026-06-02 cs.AI cs.LG

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

基于迹驱动仿真的通用任务多模型智能体AI系统特征分析

Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) SK Hynix(SK海力士) KAIST(韩国科学技术院)

AI总结 本文提出GAIATrace数据集和Vidur-Agent仿真器,通过迹驱动仿真分析多模型智能体AI系统在通用任务上的行为特征。

Comments 13 pages, 18 figures, 2 tables

详情
AI中文摘要

智能体AI通过迭代规划、工具使用和基于观察结果的推理来完成任务。尽管其流行,但其系统级行为仍然知之甚少,特别是对于复杂数据集和智能体架构——由于高度非确定性执行、高昂的评估成本以及对专有模型的有限可见性。本文提出了GAIATrace,这是两个最先进的智能体系统(MiroThinker和OWL)运行GAIA(一个由异构通用任务组成的基准测试)的首个token级迹数据集。与先前的迹数据集不同,GAIATrace捕获了完整的推理token、任务级结构以及每个主要参与LLM的活动,从而支持深入的系统研究。作为数据集的补充,我们提出了Vidur-Agent,一个迹驱动的仿真器,可以重放GAIATrace以在多种模拟环境中进行可重复、低成本的系统评估。利用这两个工件,我们描述了现代智能体系统如何处理通用任务以及各种系统设计选择如何塑造其行为,得出了若干独特的发现。

英文摘要

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.

2606.01723 2026-06-02 cs.LG cs.AI

Shortcut to Nowhere: Demystifying Deep Spurious Regression

捷径通往虚无:揭秘深度虚假回归

Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Rutgers University(罗格斯大学) Yang AI Lab(杨人工智能实验室)

AI总结 针对连续预测中的虚假相关性,提出利用标签和特征空间中虚假属性的相似性来校准分布,从而提升模型在分布偏移下的泛化能力。

详情
AI中文摘要

现实世界中的回归常常存在捷径:在训练中与连续目标虚假相关的属性,在部署偏移下不可靠;使用此类捷径回归目标可能在测试时灾难性失败。现有关于虚假相关性的研究主要关注分类,其中标签是分类的且组是自然定义的。然而,许多现实任务需要连续预测,其中不存在硬标签边界或离散的组-标签对。我们将深度虚假回归(DSR)定义为从具有属性-标签混淆的回归数据中学习,处理连续虚假相关性,并在测试时泛化到所有属性-标签组合。受分类和回归捷径内在差异的启发,我们提出利用标签和特征空间中虚假属性之间的相似性,从而在跨属性校准标签和学习特征分布时考虑邻近目标和相关组。在涵盖计算机视觉、环境感知和大语言模型(LLM)回归的常见真实世界DSR数据集上的大量实验验证了我们策略的优越性能。我们的工作填补了研究连续预测中虚假相关性的基准和技术空白。

英文摘要

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

2606.01722 2026-06-02 cs.LG cs.AI cs.DC

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

后确定性分布式系统:可信自主基础设施的新基础

Jun He, Deying Yu

发表机构 * OpenKedge Inc.(OpenKedge公司)

AI总结 本文提出后确定性分布式系统(PDDS)模型,以协调确定性代码、随机模型和自主代理共存的异构环境,并定义了五大架构支柱及新的故障分类。

Comments 8 pages, 1 table

详情
AI中文摘要

几十年来,分布式系统通常假设正确的参与者执行协议指定的行为,具有稳定、外部定义和确定性的语义。经典理论广泛参数化了网络时序、通信拓扑和故障域,但参与者模型相对固定。将自主推理引擎、随机模型驱动代理和策略驱动参与者集成到云控制平面、事件响应系统和金融基础设施中,挑战了这一假设的普遍性。这些代理通常产生不同的推理路径、不同的操作轨迹和异构的内部表示,同时实现语义等价且正确的结果。在本文中,我们引入后确定性分布式系统(PDDS)作为研究和工程模型,用于协调确定性代码、随机模型和自主代理共存的异构环境。我们表明,经典分布式计算模型构成了这种参与者通用模型的零歧义特例。我们并非主张确定性系统消失;而是确定性执行不能再作为自主基础设施的通用参与者假设。最后,我们概述了后确定性基础设施的五大架构支柱:协议驱动开发、可验证代理基础设施、自主状态控制平面、语义法定保证和认知状态复制。认知状态复制将持久性和一致性模型从数据可见性扩展到知识可见性,实现代理记忆、可验证语义回滚以及跨推理参与者的连贯性。我们还定义了在此环境中出现的故障类别的分类法。

英文摘要

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

2606.01720 2026-06-02 cs.LG

A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling

关于带客户端采样的正交化矩阵动量的稳定性注记

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Sun Yat-sen University(中山大学) Southern University of Science and Technology(南方科技大学) George Washington University(乔治华盛顿大学)

AI总结 研究带客户端采样的分布式矩阵优化中正交化动量更新的有限样本泛化界,通过耦合邻域稳定性递归和加权集中步骤导出上尾保证。

详情
AI中文摘要

我们研究了带矩阵值参数和正交化动量更新的客户端采样分布式优化方案的有限样本泛化。核心量是当每轮只有一部分客户端参与时,返回模型上总体目标与经验目标之间的差距。在独立异构客户端数据、不等本地样本计数和固定聚合权重下,我们通过耦合邻域稳定性递归和加权集中步骤导出了有限轮上尾保证。该界限通过放大因子 \(Y_i(\mathcal C)\) 保留客户端选择计数;在均匀全参与全批次情况下,当控制依赖于时间范围的放大项时,它产生 \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) 的缩放。矩阵正交化规则要求沿配对轨迹是Lipschitz的,该条件由正则化极型映射和归一化有限步Newton-Schulz正交化器满足。对于未正则化的矩阵符号,相同的论证需要耦合谱分离,而高斯平滑给出了有限轮平滑变体。一个一维反例说明了为什么间隙、平滑或正则性条件是必要的。

英文摘要

We study finite-sample generalization for a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates. The central quantity is the gap between the population and empirical objectives at the returned model when only a subset of clients participates in each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor \(Y_i(\mathcal C)\); in the uniform full-participation full-batch regime, it yields \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton--Schulz orthogonalizers. For the unregularized matrix sign, the same argument requires coupled spectral separation, whereas Gaussian smoothing gives a finite-round smoothed variant. A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.

2606.01719 2026-06-02 cs.LG cs.AI cs.CR

Fair Finetuning Mitigates Distribution Inference Attacks

公平微调缓解分布推断攻击

Rakshit Naidu

发表机构 * Rakshit Naidu

AI总结 提出公平微调(FFt)方法,通过在等几率约束下对互补分布样本进行微调,将模型公平性指标与分布推断攻击中的对抗优势联系起来,并给出理论界限,实验证明能有效降低攻击成功率。

Comments 16 pages (11 main, 5 appendix)

详情
AI中文摘要

在敏感数据上训练的机器学习模型可能会无意中泄露其训练分布的群体级信息——这种威胁被称为分布推断攻击(DIA)。具有黑盒访问权限的对手可以在不直接观察任何训练数据的情况下推断敏感的人口统计属性,如子群比例。尽管已经提出了差分隐私和属性遗忘等防御措施,但公平性约束与分布泄漏之间的联系尚未被探索。我们提出了公平微调(FFt):在等几率(EO)约束下,对来自互补分布的样本进行微调。我们提供了完整的理论刻画,证明了紧界 $ ext{Adv}(\mathcal{A},M_f) \le Δ_{ ext{EO}} \cdot W$,其中 $W$ 量化了两个训练分布通过其敏感属性组成的可区分程度。我们还建立了FFt降低对抗优势的必要条件,并证明了该界的紧性。我们在六个数据集上进行了评估,涵盖表格数据(ACS Income、COMPAS、German Credit)、图像数据(UTKFaces)和自然语言处理数据(Bias in Bios)。基于重演的FFt在所有设置中一致地将对抗准确率差距降低到检测阈值 $τ=0.1$ 以下;在ACS Income上,差距从约15%下降到4%以下。我们的工作提供了第一个将模型测量的EO差异直接与其在DIA博弈中的对抗优势联系起来的正式界限,为统一的公平性和隐私防御开辟了新途径。

英文摘要

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions -- a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$, where $W$ quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold $τ!=!0.1$ across all settings; on ACS Income, the gap falls from $\sim!15%$ to under $4%$. Our work provides the first formal bound connecting a model's measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.