arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.01828 2026-06-02 cs.MA cs.AI

Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

基于动态信任感知的稀疏通信拓扑用于基于LLM的多智能体共识

Wanshuang Gou, Zihan Liu

AI总结 提出DySCo动态稀疏共识机制,通过信任感知的边选择降低通信开销并保持共识质量。

详情
Comments
11 pages, 3 figures, 5 tables
AI中文摘要

大型语言模型驱动的多智能体系统通过多轮讨论、角色专业化和交叉验证增强了复杂推理任务的可靠性。然而,现有的多智能体辩论和协作框架通常采用全连接通信,导致消息数量、令牌成本和端到端延迟随智能体数量近似二次增长;尽管固定稀疏拓扑减少了开销,但它们无法适应不同任务实例或中间推理状态,容易保留低价值交互或丢失关键的纠错信息。针对这一问题,本文提出了DySCo(动态稀疏共识),一种动态信任感知的稀疏共识机制。在每一轮推理中,DySCo基于智能体可靠性、答案分歧和任务相关性估计通信边的价值,并在预算约束下选择少量高价值边进行消息交换;然后通过动态信任权重聚合不同智能体的答案,并在共识稳定后提前终止讨论。该机制用按需通信替代通用广播,从而在保留关键交叉验证信息的同时降低通信开销。我们进一步给出了通信复杂度和共识稳定性的分析,并在数学推理、逻辑推理和事实问答任务上评估了DySCo的性能。

英文摘要

Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role specialization, and cross-validation. However, existing multi-agent debate and collaboration frameworks typically adopt fully connected communication, causing the number of messages, token costs, and end-to-end latency to grow approximately quadratically with the number of agents; although fixed sparse topologies reduce overhead, they cannot adapt communication relationships to different task instances or intermediate reasoning states, making them prone either to preserving low-value interactions or to losing critical error-correction information. To address this problem, this paper proposes DySCo (Dynamic Sparse Consensus), a dynamic trust-aware sparse consensus mechanism. In each round of reasoning, DySCo estimates the value of communication edges based on agent reliability, answer divergence, and task relevance, and selects a small number of high-value edges for message exchange under budget constraints; it then aggregates the answers of different agents through dynamic trust weights and terminates the discussion early once consensus stabilizes. This mechanism replaces universal broadcasting with on-demand communication, thereby reducing communication overhead while preserving essential cross-validation information. We further present analyses of communication complexity and consensus stability, and evaluate the performance of DySCo on mathematical reasoning, logical reasoning, and factual question-answering tasks.

2606.01827 2026-06-02 math.OC cs.LG stat.ML

Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler

自适应锐度感知最小化与Polyak型步长:一种理论驱动的调度器

Dimitris Oikonomou, Nicolas Loizou

AI总结 针对锐度感知最小化(SAM)对学习率敏感的问题,受随机Polyak步长启发,提出适用于SAM的Polyak调度器,在确定性和随机设置下实现自适应算法,并证明收敛性,实验表明性能优于或媲美调优的SAM基线。

详情
Comments
43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

锐度感知最小化(SAM)已成为训练机器学习模型的一种强大且广泛采用的优化器。通过显式最小化损失景观的锐度,SAM通常能提高泛化能力,同时提供强大的经验性能。然而,SAM及其变体,像大多数训练算法一样,对学习率的选择敏感,而学习率通常通过广泛的超参数调优或预定义调度器来选择。在这项工作中,受随机Polyak步长对随机梯度下降(SGD)有效性的最新进展的启发,我们推导了针对SAM风格更新的Polyak调度器,在确定性和随机设置下产生了新颖的自适应算法。在光滑设置中,我们证明了强凸目标的线性收敛性和确定性情况下凸目标的$\mathcal{O}(1/T)$收敛率。在随机设置中,我们建立了直到最优解邻域的类似收敛保证。数值实验表明,所提出的Polyak调度器实现了与精心调优的SAM基线相当或更好的性能,同时大大减少了对学习率调优的需求。

英文摘要

Sharpness-Aware Minimization (SAM) has established itself as a powerful and widely adopted optimizer for training machine learning models. By explicitly minimizing the sharpness of the loss landscape, SAM often improves generalization while delivering strong empirical performance. However, SAM and its variants, like most training algorithms, are sensitive to the choice of learning rate, which is typically selected through extensive hyperparameter tuning or predefined schedulers. In this work, motivated by recent advances on the effectiveness of stochastic Polyak step sizes for Stochastic Gradient Descent (SGD), we derive Polyak schedulers tailored to SAM-style updates, yielding novel adaptive algorithms in both deterministic and stochastic settings. In the smooth setting, we prove linear convergence for strongly convex objectives and an $\mathcal{O}(1/T)$ convergence rate for convex objectives in the deterministic case. In the stochastic setting, we establish analogous convergence guarantees up to a neighborhood of the optimum. Numerical experiments demonstrate that the proposed Polyak schedulers achieve performance comparable to or better than carefully tuned SAM baselines, while substantially reducing the need for learning-rate tuning.

2606.01825 2026-06-02 cs.CV cs.MM

ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search

ROGLE: 基于自动区域监督的鲁棒全局-局部对齐用于文本行人搜索

Zequn Xie, Xibei Jia, Sihang Cai, Shulei Wang, Tao Jin

AI总结 提出ROGLE框架,通过自动区域-句子匹配策略和多重粒度学习,解决文本行人搜索中细粒度对齐不足的问题,并在新基准P-VLG上取得最优性能。

详情
Comments
12 pages, 5 figures
AI中文摘要

文本行人搜索(TBPS)旨在使用自然语言查询检索行人图像。然而,现有的TBPS模型,尤其是基于CLIP的模型,由于从短标题训练中继承的全局表示偏差和语义稀疏性,在细粒度理解方面存在困难。这导致弱细粒度对齐,而区域级标注的稀缺加剧了这一问题。为此,我们提出了ROGLE(鲁棒全局-局部嵌入),一个统一的框架,通过自动区域-句子匹配(RSM)策略克服了对昂贵人工标注的依赖。RSM自动挖掘伪区域-句子对,用于可扩展的细粒度监督。此外,ROGLE采用多粒度学习策略,融合全局对比学习和区域级局部对齐。我们还引入了P-VLG基准,这是一个通过从现有公共基准中整理和丰富图像构建的大规模数据集。它包含超过10万个标注区域和丰富的长标题,是第一个同时支持全局和局部评估协议的TBPS基准。大量实验表明,ROGLE显著优于现有方法,特别是在具有挑战性的长查询上。代码和P-VLG基准将公开提供。

英文摘要

Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.

2606.01824 2026-06-02 cs.RO

DisFlow: Scene Flow from Distance Field for Object Pose, Velocity Tracking, and Dynamic Object Reconstruction

DisFlow: 基于距离场的场景流用于物体姿态、速度跟踪和动态物体重建

Lan Wu, Sheila Sutjipto, Jennifer Wakulicz, Teresa Vidal-Calleja

AI总结 提出DisFlow框架,利用高斯过程隐式曲面表示从距离场估计场景流,实现6自由度动态物体姿态估计、运动跟踪和表面重建。

详情
AI中文摘要

我们提出了DisFlow,一种新颖的从距离场进行在线场景流估计的框架,能够实现6自由度动态物体姿态估计、运动跟踪和表面重建。场景由高斯过程隐式曲面(GPIS)表示,表面法线作为导数约束,使得在表面附近能够进行准确的符号距离计算和带不确定性的梯度查询。以此表示为基石,我们从距离场计算场景流,描述表面点如何在连续帧中随时间传输。通过我们的流,我们可以通过优雅的闭式优化逐步注册新观测的点云来估计物体的姿态和运动。与先前在相机或世界坐标系中操作的方法不同,我们的方法直接在物体坐标系中进行概率融合,其中物体随时间保持几何一致性。DisFlow方法在空间和时间上的紧密耦合产生了密集几何、表面法线、物体姿态轨迹、速度和不确定性,且均达到实时速率。我们在动态物体序列上评估了DisFlow,并证明它在同时重建高质量物体表面的同时,实现了准确的姿态和运动跟踪。代码公开于https://github.com/LanWu076/disflow_ros2。

英文摘要

We present \emph{DisFlow}, a novel framework for online scene flow estimation from distance field that enables \emph{6DoF dynamic object pose estimation}, \emph{motion tracking}, and \emph{surface reconstruction}. The scene is represented by Gaussian Process Implicit Surfaces (GPIS), with surface normals serving as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty. With this representation as a foundation, we compute a scene flow from the distance field that describes how surface points are transported over time in consecutive frames. Through our flow, we can estimate an object's pose and motion by incrementally registering a new observed point cloud via an elegant closed-form optimisation. Unlike prior methods that operate in the camera or world frame, our approach performs probabilistic fusion directly in the \emph{object frame}, where the object remains geometrically consistent over time. The tight coupling of the DisFlow method in space and time yields dense geometry, surface normals, object pose trajectories, velocities, and uncertainty, all at real-time rates. We evaluate DisFlow on dynamic object sequences and demonstrate that it achieves accurate pose and motion tracking while simultaneously reconstructing high-quality object surfaces. Code publicly available at \href{https://github.com/LanWu076/disflow_ros2}{https://github.com/LanWu076/disflow\_ros2}

2606.01820 2026-06-02 cs.CL

TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech

TalkTag: 转录语音的细粒度形态句法错误标注

Shamira Venturini, Oliver Hennhöfer, Steffen Kinkel, Jannik Strötgen

AI总结 提出基于LLM的轻量级工具TalkTag,在数据稀缺条件下自动进行口语转录文本的CHAT风格错误标注,实现低资源场景下的精确标注并识别歧义情况。

详情
AI中文摘要

细粒度形态句法错误标注在临床和发展语言研究中很重要,但劳动密集、依赖专家且难以扩展。我们提出了TalkTag,一个基于LLM的轻量级工具,经过微调可自动对口语转录文本进行CHAT风格的错误标注。该系统在极端数据稀缺条件下使用儿童叙事数据开发,展示了低资源设置下语言分析的可行性。我们的评估表明,TalkTag产生了令人鼓舞的精确标注,同时有效识别了语言歧义使自动标注真正复杂的情况。总之,通过TalkTag,我们提供了一种可扩展的手动错误标注替代方案,并为形态句法错误标注提供了实际可行的支持。

英文摘要

Fine-grained morphosyntactic error annotation is important in clinical and developmental language research, yet it is labour-intensive, expert-dependent, and difficult to scale. We present TalkTag, an LLM-based lightweight tool fine-tuned to automate CHAT-style error annotation in spoken-language transcripts. Developed under conditions of extreme data scarcity using children's narrative data, the system shows the feasibility of linguistic analysis in low-resource settings. Our evaluation demonstrates that TalkTag produces encouragingly precise annotation while effectively identifying instances where linguistic ambiguity makes automated tagging genuinely complex. In summary, with TalkTag, we provide a scalable alternative to manual error annotation and practically viable support for morphosyntactic error annotation.

2606.01819 2026-06-02 cs.CV eess.IV

Hist2Style: Histogram-Guided Stylization with Bilateral Grids

Hist2Style: 基于双边网格的直方图引导风格化

Dekel Galor, Adam Pikielny, Zhoutong Zhang, Ke Wang, Laura Waller, Jiawen Chen, Ilya Chugunov

AI总结 提出Hist2Style,利用双边网格实现快速、边缘感知的逼真风格迁移,通过蒸馏大模型为轻量网络,并基于直方图嵌入提供可解释的用户控制。

详情
Comments
10 pages, 8 figures. Extended results are at https://www.dekelgalor.com/hist2style
AI中文摘要

逼真风格迁移旨在匹配输入图像与风格目标的颜色和色调,同时保留原始场景的内容和细节。尽管现有的大图像模型可以促进这类外观编辑,但它们的高计算需求、潜在的幻觉以及有限的用户控制使其不适合高分辨率、实时工作流。我们引入Hist2Style,一种双边网格公式,用于快速、边缘感知的风格化,通过将操作限制在双边空间中的局部仿射变换来保持视觉保真度。我们的模型通过在大规模监督语料库上训练(该语料库由语言和视觉语言模型生成),将大图像编辑模型蒸馏为轻量网络,针对空间变化的颜色编辑。网络以风格目标的直方图嵌入为条件,提供可解释的接口,通过修改目标颜色分布来调整输出风格。总体而言,Hist2Style通过构造保持内容结构,避免幻觉,并支持实时、高分辨率的逼真风格化,具有交互式用户可控的颜色和色调调整。

英文摘要

Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.

2606.01818 2026-06-02 cs.CV

Unsupervised Collaborative Domain Adaptation for Driving Scene Parsing

无监督协作域自适应用于驾驶场景解析

Jiahe Fan, Shaolong Shu, Mingjian Sun, Tiehua Zhang, Bohong Xiao, Hanli Wang, Rui Fan

AI总结 提出无监督协作域自适应框架UCDA,通过多源模型协作优化和知识蒸馏,在无源数据条件下提升目标域驾驶场景解析的鲁棒性和泛化能力。

详情
AI中文摘要

可靠的驾驶场景解析是自动驾驶车辆在开放动态环境中运行的基本能力。然而,将感知模型适应新的部署域仍然具有挑战性,因为像素级标注成本高昂,且由于隐私、安全或所有权限制,源域数据通常无法访问。现有的无源无监督域自适应方法通常依赖于单个预训练源模型,这使得自适应后的感知系统容易受到源特定偏差的影响,并在不同的道路布局、光照条件、天气模式和交通状况下限制其鲁棒性。本文提出了一种无监督协作域自适应(UCDA)框架,用于无源设置下的驾驶场景解析,该框架将多个预训练源模型的互补知识迁移到统一的目标模型,而无需访问任何原始源样本。为了比较独立训练模型的预测,UCDA构建了一个类级原型记忆库,并通过原型相似性估计跨模型预测可靠性,从而减少源模型间不一致置信度尺度的影响。基于由此产生的互补监督,UCDA采用两阶段迁移策略:首先通过正负一致性约束的协作优化,在无标签的目标域驾驶数据上精炼多个源模型,然后将它们经过验证的专业知识蒸馏到单个可部署的目标模型中。在公开驾驶场景数据集和从自动驾驶车辆平台收集的真实世界数据上的全面评估表明,UCDA有效地整合了互补的多源知识,提高了目标域场景解析的可靠性和在不同驾驶环境中的泛化能力。

英文摘要

Reliable driving scene parsing is a fundamental capability for autonomous vehicles operating in open and dynamic driving environments. However, adapting perception models to new deployment domains remains challenging because pixel-level annotations are expensive to obtain, while source-domain data are often inaccessible due to privacy, security, or ownership constraints. Existing source-free unsupervised domain adaptation methods typically rely on a single pre-trained source model, which makes the adapted perception system vulnerable to source-specific biases and limits its robustness under diverse road layouts, illumination conditions, weather patterns, and traffic conditions. This article presents an unsupervised collaborative domain adaptation (UCDA) framework for driving scene parsing in a source-free setting, which transfers complementary knowledge from multiple pre-trained source models to a unified target model without accessing any original source samples. To compare predictions from independently trained models, UCDA constructs a class-level prototype memory bank and estimates cross-model prediction reliability through prototype similarity, reducing the effect of inconsistent confidence scales across source models. Based on the resulting complementary supervision, UCDA adopts a two-stage transfer strategy: multiple source models are first refined on unlabeled target-domain driving data through collaborative optimization with positive and negative consistency constraints, and their validated expertise is then distilled into a single deployable target model. Comprehensive evaluations on public driving-scene datasets and real-world data collected from an autonomous vehicle platform demonstrate that UCDA effectively consolidates complementary multi-source knowledge, improving target-domain scene parsing reliability and generalization across diverse driving environments.

2606.01816 2026-06-02 q-bio.BM cs.LG

Site4Drug: Predicting Drug-Binding Target Sites with an AI Agent

Site4Drug: 利用AI智能体预测药物结合靶点

Taehan Kim, Sarrah Rose Mikhail Leung, Bharat Mekala, Jeongbin Park

AI总结 提出Site4Drug,一种模态感知的靶点发现智能体,通过整合拓扑、亲水性、翻译后修饰等证据,输出带约束、风险标记和决策日志的可靶向区域排名列表,并自动推荐结合模态。

详情
Comments
Accepted to the ICML 2026 Workshop on Generative and Agentic AI for Biology (GenBio)
AI中文摘要

选择在蛋白质上的干预位置(即选择可靶向位点)通常比选择结合物更模糊且更容易失败,尤其是对于膜蛋白,其可及性、拓扑和翻译后修饰(PTMs)限制了可作用区域。我们提出Site4Drug,一种模态感知的位点发现智能体,输出带有显式约束、证据摘要、风险标记和可追溯决策日志的可靶向区域排名列表。Site4Drug无需用户预先指定药物模态,而是利用与位点发现相同的证据(包括拓扑、亲水性、PTM倾向、二硫键、结构域背景和序列)推荐结合模态(例如抗体/肽类 vs 小分子)。重要的是,这些证据一致地应用于所有模态,包括小分子口袋发现,以避免选择化学上可行但生物学上被遮蔽的位点。

英文摘要

Selecting where to intervene on a protein (i.e., choosing a targetable site) is often a more ambiguous and failure-prone bottleneck than selecting what binds, especially for membrane proteins where accessibility, topology, and post-translational modifications (PTMs) constrain actionable regions. We present Site4Drug, a modality-aware site-finding agent that outputs a ranked list of targetable regions with explicit constraints, evidence summaries, risk flags, and a traceable decision log. Rather than requiring users to specify the drug modality upfront, Site4Drug can recommend a binding modality (e.g., antibody/peptide-like vs small-molecule) from the same evidence used for site discovery, including topology, hydropathy, PTM propensity, disulfides, domain context, and sequence. Importantly, this evidence is applied consistently across modalities, including small-molecule pocket discovery, to avoid selecting chemically plausible but biologically occluded sites.

2606.01815 2026-06-02 cs.CL

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

CRAB-Bench:在复杂任务依赖和人类对齐用户模拟下评估LLM智能体

Danqing Wang, Akshay Sivaraman, Lei Li

AI总结 提出CRAB-Bench和RUSE框架,通过约束图生成复杂任务依赖和基于人类行为研究的用户模拟,评估LLM智能体在真实服务场景中的表现,发现最佳模型仅达61%通过率,用户模拟导致性能下降高达57%。

详情
AI中文摘要

在现实服务场景中评估LLM智能体需要复杂的任务依赖、不完美的用户行为以及能够容纳多种有效解决方案的评估。我们引入了CRAB-Bench(基于约束的现实智能体基准)和RUSE(现实用户模拟引擎)来填补这一空白。CRAB-Bench通过一个包含多个相互依赖实体和结构化干扰项的约束图生成任务,要求智能体在数千个误导性候选项中仔细推理,其中只有极小部分解是有效的。RUSE用基于人类行为研究的现实用户取代了合作性的模板式模拟器,这些用户实例化在不同的角色和四个行为维度上。在四个前沿LLM智能体上的实验表明,最佳模型在CRAB-Bench上仅达到61%的pass@1,而切换到RUSE导致进一步下降高达57%,主要集中在任务解决能力而非对话质量上。信息泄露是最具破坏性的行为维度,与RUSE交互的智能体更少承认错误,而是通过隐式纠正掩盖错误。

英文摘要

Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions. Experiments on four frontier LLM agents show that the best model achieves only 61% pass@1 on CRAB-Bench, and switching to RUSE causes further drops of up to 57%, concentrated in task-solving ability rather than conversational quality. Information Disclosure is the most damaging behavioral dimension, and agents interacting with RUSE are less likely to admit mistakes, instead masking errors through implicit corrections.

2606.01813 2026-06-02 cs.CL

Cost-Aware Diffusion Draft Trees for Speculative Decoding

用于推测解码的成本感知扩散草稿树

Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai

AI总结 提出CaDDTree方法,通过联合优化树结构和节点预算直接最大化令牌吞吐量,并证明在凸验证成本下吞吐量函数是单峰的,从而无需离线预算搜索。

详情
AI中文摘要

推测解码通过让轻量级草稿模型生成令牌,并由目标语言模型并行验证来加速推理。诸如DFlash之类的块扩散草稿模型一次性生成整个草稿块,产生每个位置边际分布;DDTree利用这些边际分布构建候选树,在固定节点预算下最大化期望接受长度。然而,我们观察到接受长度随预算非递减:它总是偏好更大的树而不考虑验证成本,没有为预算选择提供原则性基础。我们提出 extbf{CaDDTree}(成本感知扩散草稿树),一种通过联合选择树结构和节点预算直接优化令牌吞吐量(单位时间内期望生成的令牌数)的方法。我们显式建模草稿和验证延迟,表明吞吐量目标可分解为每轮对预算的一维搜索,并证明在凸验证成本下吞吐量函数是 extit{单峰的},从而实现了高效的贪心停止规则。CaDDTree无需离线预算搜索,每轮根据当前每个位置分布和验证成本自适应调整预算。在Qwen3-4B和Qwen3-8B上跨越推理、编码和指令遵循任务的八个基准测试上的实验表明,CaDDTree在几乎所有任务上匹配或超越了具有最优预算选择的DDTree。

英文摘要

Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.

2606.01811 2026-06-02 cs.CL cs.AI cs.LG

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

“我知道这会如何发展”:通过渐进条件惊奇度刻画多样性

Matthew Khoriaty, David Williams-King, Shi Feng

AI总结 提出一种基于上下文学习的多样性度量方法 Decan(D_{Ca_n}),通过单次前向传递计算每个字节的得分,无需嵌入模型、参考语料或人工标注,在多个基准上验证了其有效性。

详情
Comments
28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: https://github.com/AMindToThink/icl-diversity
AI中文摘要

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化AI和人类写作中的创造性行为至关重要。我们提出了一种使用上下文学习来度量多样性的新方法,其中“Decan”度量 $D_{Ca_n} = C \times a_n$ 是我们评估的工作实例:一个基于每个字节的得分,该得分从基础模型 $θ$ 的每个标记对数概率中读取,每次排列只需一次前向传递,无需嵌入模型、参考语料库和人工标签。该方法基于信息论,利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性,并避免了训练专用模型的需要。同一流程对AI样本和人类编写的回答集进行评分,将多样性视为(回答、提示、评分模型)的一个属性。在Tevet和Berant基于人类判断的McDiv基准上,$D_{Ca_n}$ 在McDiv prompt_gen 集上达到了0.846的OCA,这是其表现最好的情况,仅次于Tevet和Berant报告的最强神经基线(SentBERT,0.897)。在OLMo-2-7B训练后流程中,$D_{Ca_n}$ 在基础→SFT→DPO→RLVR阶段单调下降,检测到创意写作应用所关注的多样性损失类型。

英文摘要

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

2606.01810 2026-06-02 cs.AI

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Token 预测器不是规划器:构建物理基础的因果推理器

Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li

AI总结 针对具身视觉-语言规划中模型依赖语言统计先验而非因果推理的问题,提出 Causal-Plan-Bench 基准和 Causal-Plan-1M 数据集,并训练 Causal Planner 模型,实现从 token 预测到物理因果推理的转变。

详情
Comments
77 pages, appendices included. Code: https://github.com/THUSI-Lab/Causal-Reasoner
AI中文摘要

当前的具身视觉-语言规划基准往往倾向于语言上的下一 token 预测,而非物理基础的下一状态推理。这奖励了模仿统计语言先验而非追踪因果依赖的模型,将物理规划简化为浅层序列建模。我们认为,可靠的物理自主性需要从语言基础的 token 预测转向物理基础的因果推理。为此,我们引入了 Causal-Plan-Bench,这是一个通过多阶段验证构建的高保真诊断套件,用于评估四个因果维度的具身规划。我们还构建了 Causal-Plan-1M,这是一个百万规模的显式推理轨迹语料库,通过四阶段标注流程从自我中心视频中生成。广泛评估表明,领先模型仍然难以展示真正的物理自主性,Gemini 3 Pro 在我们的基准上仅达到 38.18。相比之下,我们的训练方法使基于 Qwen3-VL-8B 构建的 Causal Planner 能够内化物理逻辑,从而实现更准确的下一状态估计。该模型在域内性能和跨基准泛化方面表现强劲,并揭示了一个因果缩放定律:将因果训练数据扩展到一百万实例可获得 36.3% 的相对提升,从 33.22 提高到 45.28。总体而言,我们的工作为将智能体从表面的 token 预测器转变为物理基础的因果推理器迈出了具体的一步。

英文摘要

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.

2606.01808 2026-06-02 cs.CV

Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI for Cardiac Digital Twins

基于电影MRI的个性化三维心肌梗死几何重建用于心脏数字孪生

Yilin Lyu, Mark YY Chan, Ching-Hui Sia, Lei Li

AI总结 提出一种显式几何-运动嵌入模型,从多视角电影MRI中全自动重建个性化、可仿真的三维心肌梗死几何结构,采用双分支自适应融合和AHA-17引导的多尺度监督,实现无对比剂梗死表征。

详情
Comments
14 pages
AI中文摘要

准确的三维心肌梗死(MI)几何表征对于构建心脏数字孪生(CDT)以精确模拟梗死相关电生理至关重要。晚期钆增强磁共振成像(LGE MRI)是定位MI的临床参考,但其对造影剂的依赖限制了在肾功能受损患者中的使用,并限制了纵向随访。作为替代,无对比剂电影MRI可可视化异常心室壁运动,这高度指示梗死区域。在本研究中,我们提出了一种新颖的显式几何-运动嵌入模型,直接从多视角电影MRI中全自动重建个性化、可仿真的三维MI几何结构。具体地,我们构建了一个4D(3D+t)双心室网格,以显式提取和解耦几何感知和运动感知特征。我们进一步设计了一个双分支模块用于自适应几何-运动融合,以捕获时空依赖性来映射梗死区域。此外,我们引入了一种利用AHA-17节段引导的交叉注意力机制的多尺度监督来指导预测,确保生物物理一致的重建。在225例电影MRI上的实验结果表明,所提出的三维MI重建实现了高性能,平均Dice得分为0.678±0.011。在下游的计算机电生理模拟评估中,结果与LGE衍生的真实情况高度一致,突显了所提出模型在无对比剂瘢痕表征和无缝集成到CDT建模中的巨大潜力。代码将在稿件被接受发表后公开。

英文摘要

Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 $\pm$ 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.

2606.01806 2026-06-02 cs.CL cs.AI cs.LG

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

ProbeScale: 通过探测分析优化神经缩放定律以实现高效小语言模型推理

Sourav Das

AI总结 提出ProbeScale框架,利用缩放定律和探测分析从预训练小语言模型中识别参数高效子网络,在参数预算下最大化任务加权探测性能,实现5-10倍参数压缩并保持95%-98%原始性能。

详情
Comments
7 pages, 2 figures, ACL
AI中文摘要

小语言模型在能力与计算可行性之间取得了平衡。神经缩放定律指导其最优训练,表明它们拥有随规模增长而丰富的内部表示。然而,在严格的资源约束下部署即使是这些小语言模型也可能具有挑战性。语言模型探测提供了分析模型内部编码的语言知识的方法。我们提出ProbeScale,一个统一缩放定律和探测洞察的框架,用于在预训练小语言模型中识别参数高效的子网络。ProbeScale利用良好缩放的小语言模型的高质量表示,并使用任务特定探测来数学量化每层对目标下游能力的相关性。这使得能够选择在性能与参数规模之间最优权衡的子网络。我们将子网络选择形式化为在参数预算下寻找最大化聚合任务加权探测性能的层子集。在代表性小语言模型如RoBERTa-Large和T5-Base上的实验表明,ProbeScale识别出的子网络实现了5到10倍的显著参数减少,同时在目标任务上保持了高性能(原始小语言模型的95%至98%),优于启发式基线。

英文摘要

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model's internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.

2606.01803 2026-06-02 cs.AI

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

OctoT2I:一种自我进化的智能文本到图像路由系统

Xu Jiang, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang

AI总结 提出OctoT2I框架,通过自进化机制构建知识库并采用状态化多轮路由策略,联合优化生成质量与推理效率,在GenEval上达到0.96性能,同时实现90.3%推理加速和56.6%能效提升。

详情
AI中文摘要

文本到图像(T2I)模型的爆炸式增长——从大规模版本到轻量级、实时模型——如今面临单模型扩展的边际收益递减。智能T2I方法通过使用多个模型来缓解这一瓶颈。然而,现有的智能T2I方法面临三个关键挑战:依赖昂贵的手工先验或人工标注、僵化的单路径决策机制以及忽视推理效率。为解决这些挑战,我们引入OctoT2I,一种新颖的智能框架,将T2I任务重新表述为生成质量和推理效率的联合优化。OctoT2I实现了一种有状态的多轮路由策略,该策略基于其知识和记忆自适应地选择最合适的工具。这一策略由我们新颖的自进化机制从头构建的知识库支持。该机制无需人工监督,首先自主定义基础概念维度(例如风格、颜色、数量),然后通过迭代的“提出-求解-评估-学习”(PSEL)循环智能地探索它们的组合。PSEL循环高效地发现每个工具的能力边界,在无需外部指导的情况下推动持续改进。大量实验表明,OctoT2I在GenEval上实现了具有竞争力的性能(0.96),同时相比领先基线(Flow-GRPO)提供了90.3%的推理加速和56.6%的能效提升,在性能和效率之间取得了卓越的平衡。代码和模型将公开提供。

英文摘要

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

2606.01800 2026-06-02 cs.CL cs.AI cs.LG

Multilinguality of Large Language Models From a Structural Perspective

从结构视角看大语言模型的多语言性

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

AI总结 本研究通过表示结构分析探索大语言模型的多语言性,发现低资源语言与英语的结构差异大于高、中资源语言,且语言特定后训练改变结构但保留语言间关系。

详情
AI中文摘要

大型语言模型(LLMs)通过在多语言数据上进行预训练和后训练,在处理多种语言方面表现出色,尽管英语在训练数据中占主导地位。先前关注标记表示的研究揭示了这些LLMs如何处理非英语文本。尽管这些分析提供了有见地的发现,但它们未能捕捉到结构视角,而结构是语言的内在属性。在本研究中,我们通过表示结构分析探索LLMs的多语言性。我们的发现表明,低资源语言在结构上与英语的差异大于高资源和中资源语言,并且语言特定的后训练改变了它们的结构,同时保留了语言间的关系。

英文摘要

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

2606.01799 2026-06-02 cs.LG stat.ML

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

树引导的识别-然后-利用:决斗式赌博机中最佳臂识别与遗憾最小化的统一框架

Pu Wang, Yao-Xiang Ding

AI总结 针对Condorcet赢家假设下的N臂随机决斗式赌博机,提出树引导的识别-然后-利用(TG-ITE)统一框架,通过共享树引导识别方法在O(N)次比较内找到高置信度候选,并针对不同目标设计利用策略,首次同时实现最佳臂识别O(N)样本复杂度、弱遗憾O(N)和强遗憾O(N log T)保证,并消除现有方法中O(log N)的次优差距。

详情
AI中文摘要

我们研究在Condorcet赢家假设下的$N$臂随机决斗式赌博机,考虑三个广泛采用的目标:最佳臂识别(BAI)、弱遗憾和强遗憾。我们提出树引导的识别-然后-利用(TG-ITE),据我们所知,这是第一个统一处理所有这些目标的框架。无需更强的假设,我们提出一种共享的树引导识别方法,在$O(N)$次比较内找到高置信度的候选。我们进一步提出不同的利用策略,利用这个热启动阶段来优化具体目标。这种方法使得我们的方法能够:(1)在没有通常采用的更强假设的情况下,实现BAI的$O(N)$样本复杂度;(2)构建第一个赢家保持风格的算法,实现$O(N)$弱遗憾;(3)享有与专门强遗憾方法相同的$O(N \log T)$保证;(4)实现BAI和弱遗憾的联合优化,两者均具有$O(N)$保证,消除了现有方法中$O(\log N)$的次优差距。我们的结果提供了证据,表明在决斗式赌博机中,BAI和遗憾最小化之间的权衡相对温和。

英文摘要

We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.

2606.01790 2026-06-02 cs.CV cs.AI

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

STaR-KV: 面向GUI视觉语言模型的时空自适应KV缓存压缩重加权方法

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang

AI总结 提出STaR-KV,一种无需训练的KV缓存压缩框架,通过子空间感知评分、时间稳定性折扣和熵驱动温度三个维度自适应校准令牌重要性,在GUI任务中实现高精度和近40%的峰值GPU内存节省。

详情
AI中文摘要

基于视觉语言模型的图形用户界面(GUI)代理展现出广泛的自动化能力,但其部署受限于随交互步骤线性增长的键值(KV)缓存。例如,UI-TARS-1.5-7B在仅五个屏幕截图上消耗76 GB的GPU内存,接近主流80 GB加速器的容量。现有的KV压缩方法共享两个结构假设:将视觉令牌重要性聚合为单个共享显著性图,并对融合的分数分布应用固定的top-B截断。初步测量反驳了这两点:空间专门化存在于注意力子空间层面并在层间迁移,而分数分布沿轨迹漂移。我们提出STaR-KV(时空自适应重加权),一种无需训练的KV缓存压缩框架,沿三个维度校准令牌重要性:(i)由在线空间互信息驱动的子空间感知评分;(ii)时间稳定性折扣,抑制来自持续关注子空间的冗余缓存条目;(iii)熵导出的温度,自适应重塑分数分布。在四个GUI基准测试中,STaR-KV在匹配预算下实现了最先进的KV压缩方法(如GUIKV、SnapKV)中最强的平均准确率,无压缩阶段FLOPs开销(-0.07%),并在20% KV缓存预算下削减近40%的峰值GPU内存。代码可在https://github.com/kawhiiiileo/STaR-KV获取。

英文摘要

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.

2606.01789 2026-06-02 cs.AI

Consistency evaluation of benchmarks used for causal discovery

用于因果发现的基准一致性评估

Yuzhe Zhang, Chihui Chen, Lina Yao, Chen Wang

AI总结 提出自动检索论文并利用大语言模型检查基准因果图与领域研究一致性的流程,评估11个流行基准,发现其一致性差异显著。

详情
AI中文摘要

在图形因果模型中,因果发现旨在基于数值数据和领域知识(以纯文本形式)构建因果图。然而,因果发现方法的评估在该领域仍然是一个挑战,因为领域研究的进展常常使得基准因果图包含不一致的知识。这个问题尤其影响基于大语言模型(LLM)的因果发现方法,因为它们对文献中的新发现敏感。本文首次系统研究基准因果图的质量。具体来说,我们设计了一个流程,自动从科学数据库中检索相关研究论文,并提示LLM检查基准因果图与领域研究论文之间的一致性。我们评估了11个流行的真实世界基准,我们的流程总共处理了38,081篇领域论文。结果表明,流行基准与领域研究的一致性差异显著,这对因果发现研究具有明确的意义。

英文摘要

In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.

2606.01788 2026-06-02 cs.CV

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

PlatonicNav: 在导航中揭示柏拉图式拓扑地图的语义对应

Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang, Luke Borgnolo, Maxwell Twelftree, Yang Zhao

AI总结 提出PlatonicNav框架,通过自监督视觉编码器构建柏拉图式拓扑地图,无需跨模态训练即可统一视觉目标导航、跨模态目标导航和视觉语言导航任务。

详情
AI中文摘要

具身视觉导航中,智能体感知复杂环境并从原始感官输入出发行动以到达目标,支撑了家庭服务机器人、辅助机器人和大规模自主探索等广泛应用。然而,最近统一视觉语言导航(VLN)和目标目标导航(ObjNav)的尝试仍停留在架构融合、混合任务训练和大规模视觉语言预训练层面,未检验独立训练的视觉和语言编码器是否已共享共同的语义结构。此外,即使是面向目标的拓扑地图,仍通过显式跨模态监督(如CLIP或大型视觉语言模型)来锚定语言目标,尚不清楚这种锚定是否可能仅从纯视觉构建的地图实现。为解决这些挑战,我们将柏拉图式表示假说扩展到具身导航,并将纯视觉ObjNav、跨模态ObjNav和VLN重新解释为同一面向目标的语义流形的三种不同接口。我们进一步引入PlatonicNav,一个无需训练的框架,其柏拉图式拓扑地图融合来自自监督视觉编码器的几何和语义节点距离,并通过盲匹配(无需任何配对视觉语言数据)锚定语言目标。在HM3D-IIN、OVON和R2R-CE(基于MP3D)等仿真基准以及宇树Go2机器人上的实验表明,PlatonicNav无需显式跨模态训练即可跨任务、模态和具身形式泛化。代码:https://github.com/AIGeeksGroup/PlatonicNav。网站:https://aigeeksgroup.github.io/PlatonicNav。

英文摘要

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.

2606.01787 2026-06-02 cs.AI math.OC

Stochastic convergence of parallel asynchronous adaptive first-order methods

并行异步自适应一阶方法的随机收敛性

Serge Gratton, Philippe L. Toint

AI总结 本文提出一类新的异步自适应一阶优化方法,包括多种流行算法的异步变体,并分析其在非凸函数上的随机收敛性,达到O(1/√t)的收敛速率。

详情
AI中文摘要

本文介绍了一类新的异步自适应一阶优化方法,包括几种流行算法的异步变体。还考虑了使用动量和/或非精确归一化的这些方法的版本。在完全随机环境下分析了该类方法在非凸函数上的收敛性,并证明在合理假设下,收敛阶为O(1/√t)(忽略对数因子)。数值实验表明,这种异步自适应算法在异构大规模机器学习系统中非常有用。

英文摘要

A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular algorithms. Versions of these methods using momentum and/or inexact normalization are also considered. The convergence of methods in the class on non-convex functions is analyzed in a fully stochastic setting, and is shown to be (up to logarithmic factors) of order O(1/sqrt{t}) under reasonable assumptions. Numerical experiments suggest that such asynchronous adaptive algorithms are very relevant in heterogeneous large-scale machine learning systems.

2606.01783 2026-06-02 cs.IR cs.AI

Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation

打破信息孤岛:面向跨域推荐的语义人物画像

Jonathan Mayo, Moshe Unger, Konstantin Bauman

AI总结 提出SPHERE方法,利用大语言模型生成语义人物画像,实现无共享用户或物品的跨域推荐,并通过双塔架构和动态融合门增强推荐性能。

详情
AI中文摘要

数字平台日益成为孤立的信息孤岛,限制了它们跨域构建全面用户表征的能力。跨域推荐系统试图通过将知识从源域迁移到目标域来克服这一限制,但大多数现有方法依赖于共享用户、共享物品或结构相似的交互图。这些假设在独立平台上往往不切实际。我们提出SPHERE(面向异构跨域推荐的语义人物画像),一种设计构件,能够在严格不相交的域之间实现推荐知识迁移,无需共享用户或物品。SPHERE不通过身份或图结构对齐域,而是使用大语言模型诱导共享行为词汇,为用户生成结构化语义人物画像,并检索行为相似的源域社区,形成社区源人物画像。该语义信号通过双塔架构和动态融合门与协同信号集成,使SPHERE能够增强标准推荐骨干。在Amazon Books、Goodreads和Steam上的实证评估表明,在全排名评估下,SPHERE在NCF、SVD++和LightGCN基线上取得了一致的改进。结果表明,跨域迁移效果不仅由域之间的语义接近度决定,还关键取决于目标域的结构密度和原生预测强度。该研究通过将跨域个性化重新定义为基于行为的语义对齐,为信息系统研究做出贡献,提供了一种在保持可解释性和模块化的同时克服信息孤岛的实用机制。

英文摘要

Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representations across domains. Cross-domain recommender systems seek to overcome this limitation by transferring knowledge from a source domain to a target domain, yet most existing approaches depend on shared users, shared items, or structurally similar interaction graphs. These assumptions are often unrealistic across independent platforms. We propose SPHERE (Semantic Personas for Heterogeneous cross-domain Recommendation), a design artifact that enables recommendation knowledge transfer across strictly disjoint domains with no shared users or items. Rather than aligning domains through identity or graph structure, SPHERE uses large language models to induce a shared behavioral vocabulary, generate structured semantic personas for users, and retrieve behaviorally similar source-domain communities that form a Community Source Persona. This semantic signal is integrated with collaborative signals through a dual-tower architecture and dynamic fusion gate, allowing SPHERE to augment standard recommender backbones. Empirical evaluation across Amazon Books, Goodreads, and Steam demonstrates consistent improvements over NCF, SVD++, and LightGCN baselines under full-ranking evaluation. The results show that cross-domain transfer effectiveness is not determined solely by semantic proximity between domains; rather, it depends critically on the structural density and native predictive strength of the target domain. The study contributes to information systems research by reframing cross-domain personalization as behavior-based semantic alignment, offering a practical mechanism for overcoming information silos while preserving interpretability and modularity.

2606.01781 2026-06-02 cs.AI

Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction

结构引导的自适应传播用于蛋白质-蛋白质相互作用位点预测

Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen, Yu Zhang, Baoshan Ma

AI总结 提出SGAP-PPIS模型,利用等变图神经网络的多尺度几何状态生成残基级传播系数,实现自适应信息扩散,在Test_60上取得竞争性能。

详情
Comments
9 pages, 3 figures
AI中文摘要

准确预测蛋白质-蛋白质相互作用位点(PPIS)对于理解细胞过程、疾病机制和治疗靶点发现至关重要。基于图的深度学习通过整合残基级结构上下文推进了PPIS预测。然而,尽管蛋白质界面存在结构和功能异质性,大多数基于图的模型仍依赖固定传播方案,对所有残基一视同仁。这种传播可能限制信息扩散适应局部几何环境的能力,使得难以区分真正的相互作用位点和结构相似的非相互作用邻居。我们提出SGAP-PPIS,一种用于PPIS预测的结构引导自适应传播模型。SGAP-PPIS不使用固定传播机制,而是利用等变图神经网络的多尺度几何状态生成残基级传播系数。这种设计允许每个残基根据其几何微环境自适应地平衡局部特征保留和邻域扩散。实验结果表明,SGAP-PPIS在Test_60上达到了与最先进方法竞争的性能。消融研究表明,几何条件自适应传播、尺度对齐几何引导和多步传播状态表示共同推动了这些改进。

英文摘要

Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and therapeutic target discovery. Graph-based deep learning has advanced PPIS prediction by incorporating residue-level structural context. However, most graph-based models still rely on fixed propagation schemes that treat all residues similarly, despite the structural and functional heterogeneity of protein interfaces. Such propagation may limit the ability to adapt information diffusion to local geometric environments, making it difficult to distinguish true interaction sites from structurally similar non-interacting neighbors. We present SGAP-PPIS, a structure-guided adaptive propagation model for PPIS prediction. Rather than using a fixed propagation mechanism, SGAP-PPIS leverages multi-scale geometric states from an equivariant graph neural network to generate residue-wise propagation coefficients. This design allows each residue to adaptively balance local feature preservation and neighborhood diffusion according to its geometric microenvironment. Experimental results show that SGAP-PPIS achieves competitive performance among the state-of-the-art methods on Test\_60. Ablation studies show that geometry-conditioned adaptive propagation, scale-aligned geometric guidance, and multi-step propagation-state representation jointly drive these improvements.

2606.01779 2026-06-02 cs.CL

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

HarnessForge:面向自适应智能体系统的协同框架与策略进化

Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou

AI总结 提出HarnessForge元自适应框架,通过框架-策略协同进化实现LLM智能体系统的全系统自适应,在多个基准上显著提升性能。

详情
Comments
25 pages, 13 figures
AI中文摘要

LLM智能体越来越需要在需要不同执行范式的异构任务环境中运行。这对固定智能体系统提出了挑战,并推动了超越孤立组件更新的系统级元自适应。虽然现有工作已自适应外部框架或训练底层推理策略,但全系统自适应仍未被充分表征。结构与执行之间的自适应空间很少被明确化,外部框架与内部推理器之间的兼容性也未得到联合优化。我们提出HarnessForge,一个用于进化LLM智能体系统的元自适应框架。HarnessForge将智能体系统形式化为一个框架-策略对,定义了一个稳定的自适应空间,将框架级执行结构与策略级推理行为分离。然后,它通过故障引导的框架裁剪和框架条件化的策略对齐执行框架-策略协同进化。在来自不同领域的五个基准上的实验表明,HarnessForge一致地改进了Qwen3-4B和Qwen3-8B骨干网络,优于仅框架和仅策略的基线,比最强基线提升高达12.0%,并实现了有利的展开效率权衡,证明了框架-策略协同进化是有效的,并且框架与推理策略之间的可执行兼容性对于智能体系统自适应至关重要。代码可在https://github.com/mingju-c/HarnessForge获取。

英文摘要

LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.

2606.01777 2026-06-02 cs.RO

Trans2Occ: Voxel Occupancy Estimation and Grasp for Transparent Objects from Simulation to Reality

Trans2Occ: 从仿真到现实的透明物体体素占用估计与抓取

Yixuan Yang, Sha Zhang, Rui Li, Zhenfei Yin, Xinzhu Ma, Yiran Qin, Lei Bai, Xudong Xu, Shilin Shan, Wangmeng Zuo, Yanyong Zhang, Wanli Ouyang, Feng Zheng, Shixiang Tang, Dongzhan Zhou

AI总结 提出基于单视图RGB输入的体素占用预测框架,结合仿真数据生成与规则抓取策略,实现透明物体的鲁棒3D感知与操作。

详情
AI中文摘要

透明物体由于折射和反射导致的深度感知不可靠,对机器人感知构成挑战。先前的方法依赖多视图重建或深度补全,但往往难以在真实机器人系统中扩展或部署。本文提出一个基于单视图RGB输入的透明物体感知与操作实用框架。我们的方法直接从单张图像预测体素空间占用,提供支持下游机器人抓取的几何感知表示。为实现大规模训练,我们构建了一个仿真流水线,在不同材质和光照条件下生成配对的RGB图像和体素占用标注。我们证明预测的占用表示对领域偏移具有鲁棒性,并能从仿真有效迁移到真实机器人设置,无需微调。基于占用构建的简单规则抓取策略进一步实现了透明物体的可靠抓取性能。在仿真和真实环境中的大量实验表明,我们的框架提供了准确的3D理解,并实现了透明物体的实用操作。这些结果表明,单视图占用预测为机器人中的透明物体感知提供了一种可扩展且有效的解决方案。

英文摘要

Transparent objects remain challenging for robotic perception due to unreliable depth sensing caused by refraction and reflection. While prior approaches rely on multi-view reconstruction or depth completion, they are often difficult to scale or deploy in real-world robotic systems. In this paper, we present a practical framework for transparent object perception and manipulation based on single-view RGB input. Our approach predicts voxel-space occupancy directly from a single image, providing a geometry-aware representation that supports downstream robotic grasping. To enable large-scale training, we construct a simulation pipeline that generates paired RGB images and voxel occupancy annotations under diverse materials and lighting conditions. We demonstrate that the predicted occupancy representation is robust to domain shifts and transfers effectively from simulation to real-world robotic setups without fine-tuning. A simple rule-based grasping strategy built on top of the occupancy further achieves reliable grasp performance on transparent objects. Extensive experiments in both simulation and real-world environments show that our framework provides accurate 3D understanding and enables practical manipulation of transparent objects. These results suggest that single-view occupancy prediction offers a scalable and effective solution for transparent object perception in robotics.

2606.01774 2026-06-02 cs.LG cs.AI

FLARE: Diffusion for Hybrid Language Model

FLARE: 混合语言模型的扩散方法

Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan, Yiran Xu, Wanrong Zhu, Jason Kuen, Koustava Goswami, Rajiv Jain, Yongxin Chen, Molei Tao, Jiuxiang Gu

AI总结 提出FLARE框架,通过结合自回归和扩散目标、硬件感知内核和统一推理,将混合注意力LLM转换为支持并行解码的扩散模型,在保持能力的同时提升吞吐量。

详情
AI中文摘要

自回归(AR)大型语言模型(LLM)已取得广泛的实际成功,但顺序解码仍然是低延迟部署的关键瓶颈。近期的高效推理工作沿着两个方向推进:通过高效架构降低每次模型调用的成本,以及通过并行生成减少串行解码步骤。混合注意力骨干解决了前者,而扩散语言模型(dLLM)通过迭代并行去噪追求后者。结合这些优势仍然具有挑战性:AR到dLLM的转换通常无法保留种子检查点的能力,并且混合注意力循环状态和掩码约束使得扩散训练和服务变得复杂。我们提出了FLARE,一个针对混合注意力LLM的系统转换框架。我们的分析确定迁移数据质量是能力保留的主要决定因素,其重要性超过损失公式和注意力掩码设计。最终框架结合了token等价的AR和扩散目标、硬件感知内核以及统一推理,使得一个检查点能够同时支持AR风格的验证解码和扩散风格的并行去噪。从强大的AR检查点出发,使用有限的训练后数据,FLARE在模型规模上与领先的开源dLLM竞争,并在单GPU并发服务中相比开源dLLM基线实现了持续的吞吐量提升。我们的结果进一步表明,实际dLLM不仅受限于解码算法,还受限于迁移数据质量和当前块扩散目标的训练低效性,这促使我们联合设计数据、目标、架构和推理系统。

英文摘要

Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models (dLLMs) pursue the latter via iterative parallel denoising. Combining these advantages remains challenging: AR-to-dLLM conversion often fails to preserve seed-checkpoint capability, and hybrid-attention recurrent states and masking constraints make diffusion training and serving nontrivial. We present FLARE, a systematic conversion framework for hybrid-attention LLMs. Our analysis identifies transfer data quality as the primary determinant of capability preservation, outweighing loss formulation and attention-mask design. The resulting framework combines a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified inference, enabling one checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Starting from strong AR checkpoints with limited post-training data, FLARE is competitive with leading open-source dLLMs across model scales and delivers consistent throughput gains over open-source dLLM baselines in single-GPU concurrent serving. Our results further suggest that practical dLLMs are limited not only by decoding algorithms, but also by transfer data quality and the training inefficiency of current block-diffusion objectives, motivating joint design of data, objectives, architectures, and inference systems.

2606.01764 2026-06-02 math.OC cs.GT cs.LG

Accelerating Min-Max Optimization via Power-Law Stepsizes

通过幂律步长加速极小极大优化

Yue Wu, Weiqiang Zheng, Yang Cai, Haipeng Luo

AI总结 本文提出确定性动态步长调度,将外梯度方法的最后迭代收敛率从Θ(T^{-1/2})加速到O(T^{-2/3+ε}),并通过分离外推和更新步长进一步达到近最优的O(T^{-1+ε})。

详情
Comments
56 pages
AI中文摘要

我们重新审视了无约束双仿射极小极大优化的外梯度(EG)方法的收敛保证。已知固定步长的EG实现了$Θ(T^{-1/2})$的最后迭代收敛率,这比通过引入锚定等额外机制可达到的最优$\mathcal{O}(T^{-1})$率要慢。受最近进展(动态步长本身可以显著加速梯度下降)的启发,我们询问动态步长是否也能类似地加速EG的最后迭代收敛。我们在此方向上给出了第一个正面结果。具体地,我们提供了一个确定性动态步长调度,将EG的收敛率加速到$\mathcal{O}(T^{-2/3+\varepsilon})$,对于任意$\varepsilon > 0$。我们还证明,当EG的外推和更新步使用相同步长时,该率是紧的。然后我们表明,允许外推和更新步使用不同步长进一步将收敛率提高到近最优的$\mathcal{O}(T^{-1+\varepsilon})$。我们的分析将步长调度简化为一个优化问题,其解导致遵循幂律分布(的离散化)的步长调度。我们提出的步长调度和分析可扩展到其他方法,如乐观梯度(OG),并表明对一般极小极大优化问题的更广泛适用性。

英文摘要

We revisit the convergence guarantees of the Extragradient (EG) method for unconstrained biaffine min-max optimization. It is known that EG with a fixed stepsize achieves a $Θ(T^{-1/2})$ last-iterate convergence rate, which is slower than the optimal $\mathcal{O}(T^{-1})$ rate attainable by incorporating additional mechanisms such as anchoring. Motivated by recent advances showing that dynamic stepsizes alone can significantly accelerate gradient descent, we ask whether dynamic stepsizes can similarly accelerate the last-iterate convergence of EG. We present the first positive result in this direction. Specifically, we provide a deterministic dynamic stepsize schedule that accelerates the convergence rate of EG to $\mathcal{O}(T^{-2/3+\varepsilon})$ for any $\varepsilon > 0$. We also show that this rate is tight when the extrapolation and update steps of EG use the same stepsize. We then show that allowing different stepsizes for the extrapolation and update steps further improves the convergence rate to the near-optimal $\mathcal{O}(T^{-1+\varepsilon})$. Our analysis reduces stepsize scheduling to an optimization problem, whose solution leads to a stepsize schedule that follows (a discretization of) a power-law distribution. Our proposed stepsize schedules and analysis extend to other methods, such as Optimistic Gradient (OG), and suggest broader applicability to general min-max optimization problems.

2606.01757 2026-06-02 cs.CV

PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

PillarDETR:基于YOLO骨干和RT-DETR头的实时3D目标检测

Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave

AI总结 提出PillarDETR架构,结合YOLOv8的CSP骨干和RT-DETR解码器,实现无需NMS的端到端实时3D目标检测,在KITTI和nuScenes上取得精度与速度的良好平衡。

详情
Comments
6 pages, 1 figures, 8 tables
AI中文摘要

实时3D目标检测是自动驾驶系统和机器人安全运行的关键组成部分。虽然LiDAR点云提供准确的空间信息,但高效处理它们仍然是一个重大挑战。传统方法依赖于复杂的3D卷积或基于锚点的范式,难以平衡检测精度与推理速度。在本文中,我们提出PillarDETR,一种新颖的端到端3D目标检测架构,它将基于柱体的LiDAR编码的效率与现代2D视觉模型的表示能力相结合。具体来说,PillarDETR用源自YOLOv8的跨阶段局部(CSP)网络替代标准卷积骨干,从而能够从伪图像中提取更丰富的特征。此外,我们摒弃了传统的基于锚点或基于中心的检测头,转而采用实时检测Transformer(RT-DETR)解码器。这种混合设计使网络能够捕获全局上下文并直接预测3D边界框,而无需依赖非极大值抑制(NMS)。在KITTI和nuScenes基准上的大量实验表明,PillarDETR在平均精度(mAP)和推理延迟之间实现了令人信服的权衡。我们的消融研究证实,集成YOLOv8骨干和RT-DETR头相比PointPillars基线带来了显著改进,使PillarDETR成为实时3D感知的高效解决方案。

英文摘要

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

2606.01756 2026-06-02 cs.CV

EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

EvoCut:面向高效大型视觉语言模型的多层演化感知视觉标记压缩

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Pengfei Zhang, Yao Hu, Jiawei Li, Shikai Jiang

AI总结 提出一种无需训练和注意力的视觉标记压缩方法EvoCut,通过分析多层演化偏差估计标记重要性,在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能。

详情
Comments
Preprint. 12 pages, 6 figures, 7 tables
AI中文摘要

大型视觉语言模型(LVLMs)在图像和视频理解任务上取得了强大性能,但其推理效率受到视觉编码器产生的大量视觉标记的限制。现有大多数视觉标记压缩方法从特定层的注意力分数或表示属性估计标记重要性,忽略了视觉标记在视觉编码器中的演化过程。这种逐层标准可能提供不完整的重要性估计,并限制压缩后的性能保持。为解决此问题,我们分析了逐层视觉标记演化方向,并观察到标记在视觉编码器各层形成多个组演化方向。进一步分析表明,信息性标记往往表现出与共同组演化方向的持续偏离。基于这一观察,我们提出了EvoCut,一种无需训练和注意力的视觉标记压缩方法,通过多层演化偏差估计标记重要性。实验结果表明,EvoCut在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能,展示了其在平衡效率和准确性方面的有效性。

英文摘要

Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing visual token compression methods estimate token importance from attention scores or representation properties at specific layers, overlooking how visual tokens evolve across the vision encoder. Such layer-specific criteria may provide incomplete importance estimates and limit performance preservation after compression. To address this issue, we analyze layer-wise visual token evolution directions and observe that tokens form multiple group evolution directions across vision-encoder layers. Our analysis further shows that informative tokens tend to exhibit persistent deviations from common group evolution directions. Based on this observation, we propose EvoCut, a training-free and attention-free visual token compression method that estimates token importance from multi-layer evolution deviation. Experimental results show that EvoCut can retain only 11.1\% of the visual tokens on LLaVA-1.5-7B while preserving 94.4\% of the average performance, demonstrating its effectiveness in balancing efficiency and accuracy.

2606.01755 2026-06-02 cs.AI cs.CL

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign: 迈向个性化大语言模型对齐中的通用真值一致性

Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung

AI总结 针对个性化大语言模型在不同社会群体间存在的通用真值不一致问题,提出TriAlign框架,通过离线多智能体强化学习联合优化真值准确性、跨群体一致性和个性化,实现公平对齐。

详情
AI中文摘要

个性化大语言模型根据用户的偏好和社会属性调整响应,但可能在不同社会群体间引入显著的通用真值不一致性,即某些群体在客观任务上系统性地获得较不准确的响应。现有的对齐方法要么忽略个性化,要么主要关注主观偏好对齐,很大程度上忽视了通用真值的公平性和一致性。为填补这一空白,我们研究了真值不变对齐(TIA),这是一个针对个性化LLM的对齐问题,旨在确保通用真值在不同社会群体间保持一致,同时保留个性化。我们提出TriAlign,这是首个用于TIA的离线多智能体强化学习(MARL)框架,其中每个社会群体被建模为一个交互的智能体。TriAlign通过一个公平感知目标和一个显式的不一致性惩罚,联合优化通用真值准确性、跨群体真值一致性和个性化。跨多个基准的实验表明,TriAlign在这三个目标之间实现了比强基线更强的平衡,减少了跨社会群体的通用真值差异,同时提高了客观任务性能和个性化质量。

英文摘要

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.