arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2168
2606.24888 2026-06-24 cs.CV 新提交

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

DiffusionBench: 扩散变换器的整体评估

Xingjian Leng, Jaskirat Singh, Zhanhao Liang, Ethan Smith, Martin Bell, Aninda Saha, Yuhui Yuan, Liang Zheng

发表机构 * Australian National University(澳大利亚国立大学) Canva Research(Canva研究院)

AI总结 提出NanoGen统一框架,训练21个潜在扩散模型后发现ImageNet与文本到图像生成的方法排名无强相关性,因此建立DiffusionBench基准,建议同时评估两项任务。

详情
AI中文摘要

关于图像生成的扩散变换器(DiT)研究已收敛到单一评估设置:在ImageNet上进行类别条件生成。虽然方法改进了FID及相关指标,但越来越不清楚它们是否反映了生成建模的真正进展。自然的替代方案,即文本到图像(T2I)生成,被认为训练和评估成本过高或不方便,通常被跳过。我们认为这种看法不再成立。我们引入了NanoGen,一个统一的DiT训练和评估框架。NanoGen在ImageNet上匹配了最先进的DiT基线,并且通过12行配置更改,也能训练有竞争力的文本到图像模型。它目前支持RAE、VAE、像素空间和MeanFlow扩散方法,在ImageNet和T2I设置下均可使用。在NanoGen下,训练T2I所需的计算量与ImageNet相当。在使用NanoGen训练21个潜在扩散模型后,我们观察到方法排名在ImageNet和T2I生成之间没有强相关性:三个指标的皮尔逊相关系数在-0.377到-0.580之间。这表明,改善类别条件ImageNet FID的方法可能在T2I上没有相应的改进,明确表明了在两项任务上评估DiT的必要性。为此,我们总结了ImageNet和文本到图像的结果,形成了DiffusionBench,一个用于DiT研究的整体基准。我们建议报告DiffusionBench代替仅ImageNet:改善DiffusionBench的方法更可能反映更广泛的进展。

英文摘要

Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.

2606.24884 2026-06-24 cs.RO cs.AI cs.LG 新提交

InSight: Self-Guided Skill Acquisition via Steerable VLAs

InSight: 通过可引导的VLA实现自主技能获取

Maggie Wang, Lars Osterberg, Stephen Tian, Ola Shorinwa, Jiajun Wu, Mac Schwager

发表机构 * Stanford University(斯坦福大学) Princeton University(普林斯顿大学)

AI总结 提出InSight框架,通过将VLA模型在基本动作层面变得可引导,实现自主技能获取,包括自动分割演示为基本动作和VLM引导的数据飞轮,无需人类演示即可学习新技能。

Comments Project website: https://insight-vla.github.io

详情
AI中文摘要

视觉-语言-动作(VLA)模型可以从演示中学习操作技能,但其能力受限于训练数据中的技能。我们提出InSight,一个通过使VLA在基本动作层面(例如,“将夹爪移动到碗边”、“向上抬起”、“倾倒瓶子”)变得可引导,从而解锁自主技能获取的框架。InSight包括两个主要阶段:(1)一个自动分割流程,通过VLM计划分解和末端执行器姿态将演示分割为带标签的基本动作,以实现VLA基本动作的可引导性;(2)一个VLM引导的数据飞轮,识别完成新任务所需的缺失基本动作,自主尝试使用VLM提出的低级控制来演示缺失的基本动作,并自动标记、存储和整合成功的演示到VLA训练集中。我们在仿真和真实世界操作任务中评估InSight,包括翻转方块、关闭抽屉、清扫、扭转和倾倒,无需任何这些目标技能的人类演示。一旦学习到,这些基本动作可以组合起来执行新的、长时域任务,无需额外的人类演示。我们的发现表明,基本动作的可引导性为VLA策略中的持续技能获取提供了实用基础。项目网站:此 https URL。

英文摘要

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.

2606.24883 2026-06-24 cs.CV 新提交

BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases

BenchX: 基于人口统计和协议偏差的癌症检测与定位AI模型基准测试

Qi Chen, Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal, Ibrahim Ethem Hamamci, Sezgin Er, Ashwin Kumar, Yiwen Ye, Yuhan Wang, Yuyin Zhou, Akshay S. Chaudhari, Curtis Langlotz, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou

发表机构 * Johns Hopkins University(约翰霍普金斯大学) German Cancer Research Center(德国癌症研究中心) University Hospital Basel(巴塞尔大学医院) University of Zurich(苏黎世大学) ETH AI Center(苏黎世联邦理工学院AI中心) Istanbul Medipol University(伊斯坦布尔梅迪波尔大学) Stanford University(斯坦福大学) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Nanyang Technological University(南洋理工大学) University of California, Santa Cruz(加州大学圣克鲁兹分校) University of California, San Francisco(加州大学旧金山分校) Johns Hopkins Medicine(约翰霍普金斯医学)

AI总结 提出BenchX基准,通过85,355次CT扫描系统评估12个肿瘤检测AI模型在肿瘤大小、位置、患者亚组和成像协议上的表现,揭示模型在罕见亚组中性能不佳。

详情
AI中文摘要

人工智能(AI)在医学影像领域取得了显著成功,但人们普遍认识到,这些模型在真实临床环境中往往表现不一致。当患者人口统计和成像协议发生变化时,例如在检测小肿瘤、分析不同对比阶段的扫描或评估不同年龄或性别的患者时,就会出现这种不一致性。为了量化这些不一致性,我们开发了一个大规模、开放的基准测试,包含85,355次CT扫描,系统评估了12个肿瘤检测AI模型在肿瘤大小、位置、患者亚组和成像协议方面的表现。我们利用大型语言模型(LLMs)从临床数据中提取和组织亚组信息,这使得分析既具有可扩展性又具有可重复性。我们的基准测试揭示,当前最先进的AI模型虽然针对平均准确率进行了优化,但在罕见或代表性不足的亚组(如年轻、非裔美国女性)中表现不佳。然而,为这些罕见病例收集足够的标注数据通常是不切实际的。该基准测试为构建更可靠、更稳健的肿瘤检测AI模型奠定了基础,并强调了在医学影像和计算机视觉中进行严格的亚组级评估的必要性。数据集、代码

英文摘要

Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code

2606.24876 2026-06-24 cs.CV 新提交

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

FLAT: 前馈潜在三角形泼溅用于几何精确的场景生成

Orest Kupyn, Goutam Bhat, Philipp Henzler, Fabian Manhardt, Christian Rupprecht, Federico Tombari

发表机构 * Google Research(谷歌研究院) University of Oxford, Visual Geometry Group(牛津大学视觉几何组) TU Munich(慕尼黑工业大学)

AI总结 提出FLAT方法,首次从视频扩散潜在表示中直接解码三角形泼溅,通过射线中心旋转参数化和乘积窗口函数解决梯度流问题,在保持视觉质量的同时显著提升几何精度。

详情
AI中文摘要

从单张图像生成可探索的3D场景需要强大的生成先验和适用于下游任务的精确几何表示。当前的视频扩散模型提供高质量生成,并在潜在空间中隐式编码多视图几何结构。然而,现有的前馈潜在场景解码器通常输出缺乏明确表面的体积3D高斯,限制了其在仿真或标准图形管线中的使用。这促使我们解码表面对齐的基元,这些基元不仅可渲染,而且更接近显式几何资产。我们探究是否可以将压缩的视频扩散潜在表示直接映射到显式表面基元。为此,我们引入FLAT,并首次展示三角形泼溅可以直接从视频扩散潜在表示中解码。与解码3D高斯相比,预测平面基元由于对基元方向的高度敏感性而更具挑战性,常常导致梯度流不佳。FLAT通过两个关键成分解决:用于三角形回归的射线中心旋转参数化,以及一种新颖的乘积窗口函数,该函数改进了可微三角形渲染期间的梯度流。在标准基准上,FLAT实现了显著更好的几何精度,同时与最先进的前馈基线相比保持了竞争力的视觉质量。我们进一步展示,一个轻量级的测试时细化步骤将预测的三角形汤转换为完全不透明的、游戏引擎就绪的表示,支持实时渲染。通过在相同训练设置下评估3DGS、2DGS和三角形泼溅变体,我们首次对前馈场景生成中的表示权衡进行了系统分析。项目页面见该网址。

英文摘要

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io

2606.24874 2026-06-24 cs.CV cs.AI 新提交

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

FLUX3D: 基于扩散对齐稀疏表示的高保真3D高斯生成

Haorui Ji, Weizhe Liu, Hongdong Li, Hengkai Guo

发表机构 * The Australian National University(澳大利亚国立大学) ByteDance(字节跳动)

AI总结 提出FLUX3D框架,通过扩散对齐结构化潜变量(DA-SLAT)和稀疏结构感知扩散变压器(SMDiT)解决稀疏体素表示中高频细节丢失和跨模态对齐问题,实现高保真图像到3D高斯生成。

详情
AI中文摘要

稀疏体素表示已成为图像到3D高斯泼溅(3DGS)生成的可扩展基础,但当前方法由于两个结构瓶颈难以保留输入图像的高频视觉细节。首先,它们采用为语义抽象优化的判别式2D特征来构建稀疏体素潜变量,这抑制了重建线索并导致表示瓶颈。其次,在生成阶段,标准扩散变压器缺乏有效机制将密集的2D图像令牌与稀疏的3D体素潜变量对齐,导致跨模态对应瓶颈。为解决这些问题,我们提出FLUX3D,一个可扩展的图像到3DGS框架,在生成过程中提升表示学习和跨模态对齐。我们首先重新审视基于稀疏体素的3D表示学习的2D特征选择,提出扩散对齐结构化潜变量(DA-SLAT),并将其与仅解码器架构结合以提高3DGS重建保真度。我们还设计了一个稀疏结构感知扩散框架,该框架集成了稀疏结构多模态扩散变压器(SMDiT)和模态感知旋转位置嵌入(MARoPE),以实现几何无关的2D-3D对齐。广泛的基准实验表明,FLUX3D在生成高质量3DGS资产时,在外观保真度上取得了显著改进,并显著优于所有最先进(SOTA)方法。

英文摘要

Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.

2606.24855 2026-06-24 cs.AI 新提交

OpenThoughts-Agent: Data Recipes for Agentic Models

OpenThoughts-Agent: 智能体模型的数据配方

Negin Raoof, Richard Zhuang, Marianna Nezhurina, Etash Guha, Atula Tejaswi, Ryan Marten, Charlie F. Ruan, Tyler Griggs, Alexander Glenn Shaw, Hritik Bansal, E. Kelly Buchanan, Artem Gazizov, Reinhard Heckel, Chinmay Hegde, Sankalp Jajee, Daanish Khazi, Emmanouil Koukoumidis, Xiangyi Li, Hange Liu, Shlok Natarajan, Harsh Raj, Nicholas Roberts, Ethan Shen, Nishad Singhi, Michael Siu, Ashima Suvarna, Hanwen Xing, Patrick Yubeaton, Robert Zhang, Leon Liangyu Chen, Xiaokun Chen, Steven Dillmann, Saadia Gabriel, Xunyi Jiang, Anurag Kashyap, Boxuan Li, Yein Park, Minh Pham, Sujay Sanghavi, Lin Shi, Ke Sun, Yixin Wang, Zhiwei Xu, Erica Zhang, Siyan Zhao, Wanjia Zhao, Jenia Jitsev, Alex Dimakis, Benjamin Feuer, Ludwig Schmidt

发表机构 * UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) JSC(于利希超级计算中心) LAION University of Texas at Austin(德克萨斯大学奥斯汀分校) Bespoke Labs Laude Institute UCLA(加州大学洛杉矶分校) Harvard University & Harvard Medical School(哈佛大学与哈佛医学院) TU Munich & Munich Center for Machine Learning(慕尼黑工业大学与慕尼黑机器学习中心) New York University(纽约大学) Medical University of South Carolina(南卡罗来纳医科大学) The LLM Data Company BenchFlow Independent Researcher(独立研究员) Northeastern University(东北大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Washington(华盛顿大学) TU Darmstadt(达姆施塔特工业大学) University of Southern California(南加州大学) UC San Diego(加州大学圣地亚哥分校) Amazon(亚马逊) Microsoft(微软) Korea University(高丽大学) Cornell Tech(康奈尔科技) University of Michigan(密歇根大学)

AI总结 提出全开放数据筛选流水线,通过100多次消融实验研究任务来源与多样性,构建10万样本训练集,在7个智能体基准上平均44.8%准确率,较最强开源模型提升3.9个百分点。

详情
AI中文摘要

智能体语言模型极大地扩展了人工智能的应用,但关于如何为广泛能力的智能体策划训练数据,公开所知甚少。现有的开源工作如SWE-Smith、SERA和Nemotron-Terminal通常针对单一基准,留下了如何训练能够跨多样化智能体任务泛化的模型的问题。OpenThoughts-Agent (OT-Agent) 项目通过一个完全开放的数据筛选流水线来训练智能体模型,填补了这一空白。我们进行了超过100次受控消融实验,系统性地研究了流水线的每个阶段,得出了关于任务来源和多样性的重要见解。然后,我们从流水线中组装了一个包含10万个样本的训练集,并在该数据集上微调了Qwen3-32B,在七个智能体基准上取得了平均44.8%的准确率,比现有最强的开源数据智能体模型(Nemotron-Terminal-32B,40.9%)提高了3.9个百分点。此外,我们的训练数据表现出强大的扩展特性,在计算控制比较中,每个训练集大小上都优于其他开放数据集。我们在以下网址公开发布了我们的训练集、数据流水线、实验数据和模型,以支持未来关于智能体模型训练的开源研究:http://xxx。

英文摘要

Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.

2606.24849 2026-06-24 cs.CV cs.AI 新提交

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

IV-CoT: 面向结构感知文本到图像生成的隐式视觉思维链

Zixuan Li, Haokun Lin, Yicheng Xiao, Zhiwei Li, Xinyang Song, Zelong Zheng, Yong He, Heng Yao, Ke Ding, Chao Yu, Chuan Yuan, Qi Li, Zhenan Sun

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所模式识别国家重点实验室) Ant Group(蚂蚁集团) The University of Hong Kong(香港大学)

AI总结 提出隐式视觉思维链(IV-CoT)框架,通过结构到语义的级联分解和训练时草图监督,在不增加推理开销的情况下提升文本到图像生成的结构感知能力。

详情
AI中文摘要

统一的多模态大语言模型(MLLMs)在文本到图像生成方面取得了强大的质量,但在结构感知的提示遵循方面仍然存在困难,其中必须保留对象计数、空间关系、属性绑定和粗略布局。我们将这一限制部分归因于结构规划和外观渲染在单一条件流中的纠缠。为了解决这个问题,我们提出了隐式视觉思维链(IV-CoT),一种用于查询条件图像生成的潜在视觉推理框架。IV-CoT将视觉条件查询分解为结构到语义的级联,其中结构查询首先形成潜在视觉计划,然后语义查询基于该计划渲染外观。为了指导结构查询,我们引入了仅训练时的草图监督,这鼓励它们从草图中捕获结构,而无需在推理时进行草图提取或中间解码。IV-CoT在单次前向传播中执行隐式CoT推理,并在GenEval和T2I-CompBench上取得了优异的结果。可视化和分析表明,学习到的结构查询和语义查询在结构感知生成中发挥互补作用。

英文摘要

Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.

2606.24844 2026-06-24 cs.CV 新提交

Bridging the Manifold Gap: Riemannian Residual Line Search for One-Step Image Editing

弥合流形差距:黎曼残差线搜索用于一步图像编辑

Hongzhu Yi, Zhongtian Luo, Tong Li, Yiyan Fan, Jungang Xu

发表机构 * UCAS(中国科学院大学) WashU(华盛顿大学) SHU(上海大学)

AI总结 针对一步扩散编辑中固定更新强度无法兼顾目标提示和源图像保真度的问题,提出黎曼残差线搜索方法,通过局部时间曲率估计和残差路径选择,在PIE-Bench++上达到SOTA。

详情
AI中文摘要

一步扩散编辑器速度快,因为它们避免了反演和迭代优化,但单次传输更新必须足够激进以实现目标提示,同时足够保守以保留源图像——而固定的更新强度无法满足所有编辑类型的需求。我们将这种张力视为能量场传输之上的事后候选选择问题,而不是一个新的编辑模型。我们提出的方法,黎曼残差线搜索,首先通过估计提示-增量场的局部时间曲率,并将校正后的方向投影回原始一阶能量场传输估计的更新范数上,构建一个更强的编辑。然后,它从源图像到该强编辑形成一条小的残差路径,保留原始一阶输出作为候选之一,并通过最大化目标提示的CLIP对齐来选择最终图像。在包含10种编辑类型ID的700样本PIE-Bench++评估中,我们的方法在当前一步更新算法中达到了最先进的性能。

英文摘要

One-step diffusion editors are fast because they avoid inversion and iterative optimization, but a single transport update must be aggressive enough to realize the target prompt and conservative enough to preserve the source image--and no fixed update strength satisfies both demands across edit types. We treat this tension as a post-hoc candidate-selection problem on top of energy-field transport rather than as a new editing model. Our proposed method, Riemannian Residual Line Search, first builds a stronger edit by estimating the local time curvature of the prompt-delta field and projecting the corrected direction back onto the update norm of the original first-order energy-field transport estimation. It then forms a small residual path from the source image to this strong edit, retains the original first-order output as one candidate, and picks the final image by maximizing target-prompt CLIP alignment. On a 700-sample PIE-Bench++ evaluation across 10 edit type IDs, our method achieves state-of-the-art (SOTA) performance among current one-step update algorithms.

2606.24839 2026-06-24 cs.AI stat.AP 新提交

Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

给评分者打分:评估智能数据分析系统的经验教训

Tian Zheng, Kai-Tai Hsu

发表机构 * Columbia University(哥伦比亚大学)

AI总结 针对智能数据分析系统输出复杂、难以评估的问题,提出三层人机评分级联(严格正则匹配、LLM宽松评分、基于片段的人工检查),在153个任务上实现100%精确率,宽松评分召回率97%,并通过迭代提示机制将评分成功率从36%提升至97%。

详情
AI中文摘要

智能数据分析系统产生丰富的输出,包括代码、数值结果和语言诊断。这使得它们比单轮LLM响应更难评估。因此,有必要区分智能体输出与真实答案之间的真正分歧与评分伪影。我们通过将LAMBDA(一个多智能体数据分析系统)应用于DSGym的153个数值QRData任务,研究自动评分器如何可靠地评估此类系统,以及哪些策略能提高评分质量。我们开发并评估了一个三层人机评分级联:严格正则匹配、基于LLM的宽松评分和基于片段的人工检查,该级联结合了非生成式AI和生成式AI策略,具有不同的失败模式。两个自动评分器均实现了100%的观察精确率(0/70假阳性)。宽松评分器相对于人工标签的召回率为97%。一个关键词锚定提取管道将严格评分器的召回率比最后一个数字启发式方法提高了60个百分点;宽松评分器在架构上独立于解析器。一个迭代提示机制将评分运行成功率从36%提高到97%,宽松通过率从16%提高到46%;比较有和没有原始问题重新注入的提示表明,重新注入没有带来好处,确认了提示作为答案模板线索的作用。我们进一步在本案例研究中观察到,变量类型是与评分管道动态和观察到的结果评分最一致的任务元数据字段。

英文摘要

Agentic data analysis systems produce rich outputs, including code, numerical results, and verbal diagnostics. This makes them more challenging to evaluate than single-turn LLM responses. It is therefore necessary to distinguish genuine disagreement between an agent's output and a ground-truth answer from grading artifacts. We investigate how reliably automated graders assess such a system and what strategies improve grading quality by applying LAMBDA, a multi-agent data-analysis system, on 153 numerical QRData tasks from DSGym. We develop and evaluate a three-layer human-AI grading cascade: strict regex matching, LLM-based lenient grading, and snippet-based human inspection, which combines non-GenAI and GenAI strategies with different failure profiles. Both automated graders achieve 100% observed precision (0/70 false positives). The lenient grader's recall is 97% against human labels. A keyword-anchored extraction pipeline raises the strict grader's recall by 60 percentage points over a last-number heuristic; the lenient grader is architecturally parser-independent. An iterative nudge mechanism raises grading run success from 36% to 97% and lenient-pass rates from 16% to 46%; comparing nudging with and without original-question re-injection shows that re-injection offers no benefit, confirming the nudge as an answer template cue. We further observe in this case study that variable type is the task metadata field most consistently associated with grading pipeline dynamics and observed outcome grades.

2606.24834 2026-06-24 cs.AI 新提交

Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

多轮LLM对话中NFR评估的准确性与满意度

Ali Pourghasemi Fatideh, Wilder Baldwin, Maria Dhakal, Collin McMillan, Sepideh Ghanavati

发表机构 * University of Maine(缅因大学) University of Notre Dame(圣母大学)

AI总结 研究开发者在与LLM对话助手评估非功能需求(NFR)时的准确性和满意度,发现开发者倾向于同意LLM评估但准确性低,且长回复和过多信息提供轮次降低满意度,而主动交互提升满意度。

Comments 9 pages, 5 figures. Accepted to SIGDIAL 2026 (27th Annual Meeting of the Special Interest Group on Discourse and Dialogue)

详情
AI中文摘要

基于LLM的对话助手已成为软件开发人员的主流工具,但当前评估基准仅关注功能正确性。这导致在处理非功能需求(NFR)时,评估这些对话的质量和准确性存在关键空白,因为NFR本质上是模糊的、依赖于上下文的,并涉及程序的许多部分。评估这些系统在NFR协作推理方面的支持程度,需要超越单轮准确性的方法,以捕捉系统输出的正确性和多轮交互的质量。在本文中,我们研究了开发人员与基于LLM的代理在健康保险流通与责任法案(HIPAA)法规合规领域中的多轮对话的准确性和质量。我们雇佣了49名程序员与GitHub Copilot交互,以评估148个HIPAA衍生的NFR,针对iTrust代码库(一个旨在符合HIPAA法规的系统),从三个维度进行:需求满足程度、推理和代码定位。我们发现开发者倾向于同意LLM的评估,但相对于专家真实情况,准确性较低。我们对用户满意度进行建模,发现较长的系统回复和更多提供信息的轮次对用户满意度产生负面影响,而主动交互则产生积极影响。我们的发现为设计支持NFR评估的基于LLM的对话系统提供了见解。

英文摘要

LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of these conversations when handling Non-Functional Requirements (NFRs), which are inherently vague, context-dependent, and involve many parts of a program. Evaluating how well these systems support collaborative reasoning about NFRs requires methods that go beyond single-turn accuracy to capture both the correctness of the system's outputs and the quality of the multi-turn interaction. In this paper, we investigate the accuracy and quality of multi-turn conversations between developers and an LLM-based agent in the domain of Health Insurance Portability and Accountability Act (HIPAA) regulatory compliance. We hired 49 programmers to interact with GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase, a system designed to comply with HIPAA regulations, across three dimensions: requirement satisfaction level, reasoning, and code localization. We find that developers tend to agree with LLM assessments, but accuracy against expert ground truth is low. We model user satisfaction and find that longer system responses and more information-providing turns negatively affect user satisfaction, whereas proactive interactions positively affect it. Our findings provide insights for designing LLM-based dialogue systems that support NFR assessment.

2606.24825 2026-06-24 cs.CL cs.LG 新提交

L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

L3Cube-MahaPOS:马拉地语词性标注数据集与BERT模型

Hariom Ingle, Ronit Ghode, Ishwari Gondkar, Jidnyasa Harad, Raviraj Joshi

发表机构 * Department of Information Technology, PICT, Pune, India(印度浦那PICT信息技术系) Indian Institute of Technology Madras, Chennai, India(印度理工学院马德拉斯分校) L3 Cube Labs, Pune, India(印度浦那L3 Cube实验室)

AI总结 针对资源匮乏的马拉地语,构建了32,354句人工标注的词性标注数据集MahaPOS,基于16标签UD对齐方案,并评测了HMM、CRF、BiLSTM、MuRIL等模型,最佳系统达到88.67%准确率。

详情
AI中文摘要

词性标注是支撑机器翻译、信息抽取和句法分析的基础NLP任务。尽管马拉地语有超过8300万使用者,位列全球使用人数最多的二十种语言之一,但在标注语料库和标准化评估基准方面仍然严重资源不足。马拉地语因其丰富的形态、相对自由的词序、缺乏大写惯例以及与印地语和英语普遍存在的代码混合,给计算建模带来了独特挑战。我们引入了L3Cube-MahaPOS,这是一个用于马拉地语的金标准词性标注数据集,包含从新闻文本中提取的32,354个手动标注句子。标注完全由一组精通马拉地语的标注员手动完成,遵循16标签的通用依赖对齐方案。一个结构化的预处理流程涵盖Unicode标准化、天城文感知分词和噪声过滤,确保了所有数据划分的标签一致性。我们在六个模型族上对数据集进行了基准测试,包括HMM、CRF、BiLSTM、BiLSTM+CharCNN、MuRIL以及马拉地语专用Transformer MahaBERT-v2。最佳系统在15个评估标签类别上达到了88.67%的token级准确率和81.67%的宏F1分数。我们发布了数据集、标注指南和训练好的模型检查点,以促进马拉地语NLP的进一步研究。

英文摘要

Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most spoken languages worldwide, it remains severely under-resourced in annotated corpora and standardised evaluation benchmarks. Marathi presents unique challenges for computational modelling owing to its rich morphology, relatively free word order, lack of capitalisation conventions, and pervasive code-mixing with Hindi and English. We introduce L3Cube-MahaPOS, a gold-standard POS tagging dataset for Marathi comprising 32,354 manually annotated sentences drawn from news text. Annotation was performed entirely manually by a team of Marathi-proficient annotators following a 16-tag Universal Dependencies-aligned scheme. A structured preprocessing pipeline covering Unicode normalisation, Devanagari-aware tokenisation, and noise filtering ensures label consistency across all splits. We benchmark the dataset across six model families spanning HMM, CRF, BiLSTM, BiLSTM+CharCNN, MuRIL, and the Marathi-specific transformer MahaBERT-v2. The best system achieves 88.67\% token-level accuracy and a macro-F1 of 81.67% over 15 evaluated tag classes. We release the dataset, annotation guidelines, and trained model checkpoints to foster further research in Marathi NLP.

2606.24824 2026-06-24 cs.AI 新提交

Solving Inverse Problems of Chaotic Systems with Bidirectional Conditional Flow Matching

用双向条件流匹配求解混沌系统的逆问题

Peiyan Hu, Jian Zhang, Jiashu Pan, Ruiqi Feng, Tao Zhang, Zhi-Ming Ma, Yuan-Sen Ting, Gongjie Li, Tailin Wu

发表机构 * Westlake University(西湖大学) Georgia Institute of Technology(佐治亚理工学院) Chinese Academy of Sciences(中国科学院) Max-Planck-Institut für Astronomie(马克斯·普朗克天文学研究所) The Ohio State University(俄亥俄州立大学)

AI总结 提出双向条件流匹配(Bi-CFM)方法,通过学习初始与最终状态分布的双向映射,解决混沌系统逆问题中的不适定性、非唯一性和指数误差累积,在多个经典系统上显著提升分布级指标并加速两个数量级以上。

Comments 50 pages, 17 figures

详情
AI中文摘要

建模混沌系统至关重要但具有挑战性。混沌动力学中的逆问题,即从最终状态推断初始条件,由于不适定性、非唯一性、不稳定性以及潜在的时间反向混沌动力学,在很大程度上仍未解决。我们通过双向条件流匹配(Bi-CFM)来解决这一开放问题,该方法学习初始状态和最终状态分布之间的双向映射,以捕捉混沌演化的随机性并减轻随时间累积的指数误差。此外,对于具有守恒定律的系统,我们将其扩展为守恒约束的Bi-CFM(CBi-CFM)。在经典的Lorenz、Circuit和高维Lorenz 96系统中,Bi-CFM在五个分布级指标上优于基线,同时实现了超过两个数量级的加速。在行星动力学中的三体行星-行星散射问题中,CBi-CFM更好地遵守守恒定律,其守恒误差与真实值相当。最后,在球状星团(由约$10^{10}$年(10 Gyr)演化形成的碰撞百万体系统)的真实观测上,我们的方法在准确性上取得了进步,为求解长时间尺度真实世界混沌系统的逆问题建立了一条可扩展的途径。

英文摘要

Modeling chaotic systems is crucial yet challenging. Inverse problems in chaotic dynamics, namely inferring initial conditions from final states, remain largely unsolved because of ill-posedness, non-uniqueness, instability, and potentially chaotic time-reverse dynamics. We address this open problem with Bidirectional Conditional Flow Matching (Bi-CFM), which learns bidirectional mappings between distributions of initial and final states to capture the stochasticity of chaotic evolution and mitigate exponential error accumulation over time. Furthermore, for systems with conservation laws, we extend it to Conservation-constrained Bi-CFM (CBi-CFM). Across the classic Lorenz, Circuit, and high-dimensional Lorenz 96 systems, Bi-CFM improves five distribution-level metrics over baselines while achieving a speedup of more than two orders of magnitude. In the three-body planet-planet scattering problem in planetary dynamics, CBi-CFM better respects conservation laws, with conservation errors comparable to those of the ground truth. Finally, on real observations of globular clusters, collisional million-body systems shaped by $\sim 10^{10}$ years (10 Gyr) of evolution, our method represents an advance in accuracy, establishing a scalable route to solving inverse problems of long-timescale real-world chaotic dynamics.

2606.24820 2026-06-24 cs.CL 新提交

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

SHERLOC: 代码修复智能体的结构化诊断定位

Hovhannes Tamoyan, Sean Narenthiran, Erik Arakelyan, Mira Mezini, Boris Ginsburg

发表机构 * NVIDIA(英伟达) Santa Clara, CA 95051, USA(美国加利福尼亚州圣克拉拉) TU Darmstadt(达姆施塔特工业大学) National Research Center for Applied Cybersecurity ATHENE(国家应用网络安全研究中心 ATHENE)

AI总结 提出SHERLOC框架,通过推理LLM与紧凑仓库工具及自恢复机制,实现无需微调的结构化诊断定位,在SWE-Bench上达到最优定位精度,并提升修复率、降低令牌消耗。

详情
AI中文摘要

LLM智能体通过多轮工具使用解决仓库级编码任务,但在编辑前将一半预算用于定位故障。专用的定位框架已经出现,但仍被评估为文件检索而非可操作的诊断,产生的位置缺乏修复智能体所需的诊断上下文。我们引入了SHERLOC(结构化假设驱动的探索与推理定位),这是一个无需训练的框架,将推理LLM与紧凑的仓库工具及自恢复相结合,无需微调或多智能体编排。SHERLOC在不同模型规模上达到了最先进的定位性能:在SWE-Bench Lite上准确率@1为84.33%,在SWE-Bench Verified上召回率@1为81.27%;在约30B参数下,它匹配或超越了其他智能体方法。将我们的位置和诊断发现注入修复智能体,在SWE-Bench Verified上平均解决率提高了5.95个百分点,同时将定位和总令牌消耗分别降低了36.7%和23.1%。

英文摘要

LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact repository tools and self-recovery, without fine-tuning or multi-agent orchestration. SHERLOC reaches state-of-the-art localization across model scales: 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified; at ~30B parameters, it matches or outperforms other agentic methods. Injecting our locations and diagnostic findings into repair agents yields, on average, +5.95 pp resolve rate on SWE-Bench Verified while cutting localization and total tokens by 36.7% and 23.1%.

2606.24814 2026-06-24 cs.RO 新提交

Vision-Language Model Reasoning for Contextual Semantic Mapping in Intralogistics

面向内部物流的上下文语义建图中的视觉-语言模型推理

Marvin Rüdt, Hao Pang, Constantin Enke, Zäzilia Seibold, Kai Furmans

发表机构 * Institute for Material Handling and Logistics, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院物流与物料搬运研究所)

AI总结 提出结合SLAM、SAM、实例聚类和VLM多视角推理的上下文语义建图流程,零样本推断物体可移动性,实现语义分类98.93% mIoU和可移动性估计89.17% mAcc。

Comments Accepted for publication at IEEE ETFA 2026

详情
AI中文摘要

在内部物流环境中运行的自主移动机器人依赖几何地图进行定位和导航,但缺乏对物体及其上下文属性的语义理解。我们提出了一种上下文语义建图流程,该流程结合了基于SLAM的几何建图、基于SAM的实例分割、实例聚类和VLM多视角推理,以生成编码几何结构、物体类别和物体可移动性的上下文语义地图表示。通过聚合多个视角的观测并在零样本、开放词汇设置下查询VLM,该流程推断出上下文物体属性——此处通过可移动性展示——无需特定任务训练或预定义物体类别。我们在两种提示策略下评估了三种VLM,并对流程进行了组件分析。所提出的流程在语义分类上达到98.93%的mIoU,在物体可移动性估计上达到89.17%的mAcc。组件分析表明,VLM推理是上下文理解的主要瓶颈,而实例聚类是全景性能的主要限制。生成的语义地图支持动态内部物流环境中的上下文感知过滤和鲁棒导航。

英文摘要

Autonomous mobile robots operating in intralogistics environments rely on geometric maps for localization and navigation, but lack semantic understanding of objects and their contextual properties. We present a contextual semantic mapping pipeline that combines SLAM-based geometric mapping, SAM-based instance segmentation, instance clustering, and VLM multi-view reasoning to produce a contextual semantic map representation encoding geometric structure, object class, and object movability. By aggregating observations across multiple viewpoints and querying a VLM in a zero-shot, open-vocabulary setting, the pipeline infers contextual object properties--here demonstrated through movability--without requiring task-specific training or predefined object categories. We evaluate three VLMs under two prompting strategies and conduct a component-wise analysis of the pipeline. The proposed pipeline achieves 98.93 % mIoU for semantic classification and 89.17 % mAcc for object movability estimation. Component analysis identifies VLM reasoning as the primary bottleneck for contextual understanding and instance clustering as the main limitation for panoptic performance. The resulting semantic map supports context-aware filtering and robust navigation in dynamic intralogistics environments.

2606.24805 2026-06-24 cs.CV 新提交

DDStereo: Efficient Dual Decoder Transformers for Stereo 3D Road Anomaly Detection

DDStereo: 用于立体3D道路异常检测的高效双解码器Transformer

Shiyi Mu, Zichong Gu, Zhiqi Ai, Yilin Gao, Shugong Xu

发表机构 * Shanghai University(上海大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 提出DDStereo,一种双解码器立体Transformer,通过轻量级解码器分支和紧凑视差特征提取器,实现实时开放集3D目标检测,在封闭和开放集协议下均达到最先进精度。

详情
AI中文摘要

基于立体的3D目标检测仍面临两个关键安全挑战:实时性能和开放集泛化。现有的立体3D方法通常达到单目方法两倍的精度,但推理速度显著较低,不适合实时应用。同时,开放世界检测的最新进展在单目2D和3D设置中引入了开放集和开放词汇算法,但基于立体的开放集检测仍基本未被探索。为弥合这一差距,我们提出DDStereo,一种新颖的双解码器立体Transformer,用于实时开放集3D目标检测。DDStereo具有两个轻量级解码器分支:一个用于开放集前景2D检测,另一个用于3D属性回归。这些解码器共享对象级查询以实现统一的目标级对齐。为提高推理效率,我们设计了紧凑的视差特征提取器和精简的解码器架构。在公共立体3D基准上的实验表明,DDStereo在封闭集和开放集协议下均达到最先进的精度。值得注意的是,我们的方法在推理速度上超越了现有的立体3D检测器,并首次实现了与单目方法相当的实时性能。

英文摘要

Stereo-based 3D object detection still faces two critical safety challenges: real-time performance and open-set generalization. Existing stereo 3D methods typically achieve twice the accuracy of monocular methods but suffer from significantly lower inference speeds, making them unsuitable for real-time applications. Meanwhile, recent advances in open-world detection have introduced open-set and open-vocabulary algorithms in monocular 2D and 3D settings, yet stereo-based open-set detection remains largely unexplored. To bridge this gap, we propose DDStereo, a novel Dual-Decoder Stereo Transformer for real-time open-set 3D object detection. DDStereo features two lightweight decoder branches: one for open-set foreground 2D detection and the other for 3D attribute regression. These decoders share object-level queries to achieve unified target-level alignment. To enhance inference efficiency, we designed a compact disparity feature extractor and a streamlined decoder architecture. Experiments on public stereo 3D benchmarks demonstrate that DDStereo achieves state-of-the-art accuracy under both closed-set and open-set protocols. Notably, our method surpasses existing stereo 3D detectors in inference speed and, for the first time, achieves real-time performance comparable to monocular approaches.

2606.24797 2026-06-24 cs.CV cs.AI 新提交

EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

EG-VQA: 基于接地时间证据的可验证视频问答基准

Linpeng Huang, Weixing Chen, Zexin Chen, Yang Liu, Liang Lin

发表机构 * Sun Yat-sen University(中山大学) Peng Cheng Laboratory(鹏城实验室) Shenzhen University(深圳大学)

AI总结 提出EG-VQA基准,通过细粒度时间证据标注和EG-F1指标,揭示视频大模型在答案正确性与证据定位之间的差距,并引入EG-Reasoner模型弥合这一差距。

详情
AI中文摘要

近期视频大语言模型(Video-LLMs)的进展在视频问答(VideoQA)上取得了有前景的性能。然而,现有基准主要通过答案正确性进行评估,而预测结果在相关视频证据中的定位在很大程度上仍未得到检验。答案生成与证据理解之间的脱节促使我们构建了证据接地视频问答基准(EG-VQA),这是一个开放式评估协议,其中每个问答对都明确标注了支持性时间证据,从而要求联合推理和精确证据定位。EG-VQA包含2,067个视频和11,838个问答对,带有细粒度证据标注。为了评估预测证据,引入了证据接地F1(EG-F1)作为统一指标,该指标联合衡量与真实证据的时间对齐和语义一致性。实验评估显示,即使是强大的专有模型也难以准确定位其预测,揭示了答案正确性与忠实证据定位之间的根本差距。为弥合这一差距,提出了EG-Reasoner,一种通过显式监督训练的证据接地推理模型。在开源模型中取得了最先进的性能,结果与专有系统具有竞争力,特别是在反事实问题等推理密集型任务上观察到显著提升。这些发现表明,仅靠扩展不足以实现稳健的视频理解,结构化证据监督对于开发更可靠和可解释的VideoQA系统至关重要。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evaluation protocol in which each QA pair is explicitly annotated with supporting temporal evidence, thereby requiring joint reasoning and precise evidence localization. EG-VQA is comprised of 2,067 videos and 11,838 QA pairs with fine-grained evidence annotations. To evaluate predicted evidence, Evidence-Grounded F1 (EG-F1) is introduced as a unified metric in which temporal alignment and semantic consistency against ground-truth evidence are jointly measured. Experimental evaluation reveals that even strong proprietary models struggle to accurately ground their predictions, exposing a fundamental discrepancy between answer correctness and faithful evidence localization. To bridge this gap, EG-Reasoner, an evidence-grounded reasoning model trained with explicit supervision, is proposed. State-of-the-art performance is achieved among open-source models, with results competitive against proprietary systems, particularly pronounced gains are observed on reasoning-intensive tasks such as counterfactual questions. These findings demonstrate that scaling alone is insufficient for robust video understanding and that structured evidence supervision is essential for the development of more reliable and interpretable VideoQA systems.

2606.24796 2026-06-24 cs.CV 新提交

Pocket-SLAM: Rendering-Area-Aware Pruning for Memory-Efficient 3DGS-SLAM

Pocket-SLAM:面向内存高效的3DGS-SLAM的渲染区域感知剪枝

Leshu Li, Jie Peng, Yang Zhao

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学双城分校) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出一种渲染区域感知的剪枝策略,根据高斯点对有效渲染区域的贡献进行选择性移除,在EuRoC和KITTI数据集上实现超过60%的内存减少和2倍以上的FPS提升,同时保持定位与建图精度。

Comments 2026 IEEE International Conference on Robotics and Automation(ICRA)

详情
AI中文摘要

3D高斯泼溅(3DGS)因其在捕捉细粒度几何特征和合成新视角方面的进展,在同步定位与地图构建(SLAM)中引起了广泛关注。对于大规模场景(如自动驾驶)中的SLAM,3DGS-SLAM面临一个关键限制:随着高斯点的累积,内存消耗随时间持续增加,导致内存效率低下并限制了其适用性。在这项工作中,我们提出了一种渲染区域感知的剪枝策略,该策略根据高斯点对有效渲染区域的贡献进行选择性移除,而不是仅依赖于基于高斯点的启发式方法(如不透明度或梯度幅度)。这一视角直接针对内存冗余的来源,有效降低了3DGS-SLAM在运行时的峰值内存占用。在EuRoC和KITTI数据集上的评估表明,我们的方法在大型户外场景中持续优于现有的剪枝方法,实现了超过60%的内存减少和2倍以上的FPS提升,同时保持了定位与建图精度。这些结果突显了渲染区域感知剪枝作为将3DGS-SLAM扩展到真实世界自动驾驶场景的一个有前景的方向。我们的代码在此https URL公开可用。

英文摘要

3D Gaussian Splatting (3DGS) has garnered significant attention in Simultaneous Localization and Mapping (SLAM) due to its advances in capturing fine-grained geometry features and synthesizing novel views. For SLAM in large-scale scenes, such as autonomous driving, 3DGS-SLAM faces a critical limitation: memory consumption increases continuously over time as Gaussian points accumulate, leading to poor memory efficiency and limiting its applicability. In this work, we propose a rendering-area-aware pruning strategy that selectively removes Gaussians based on their contribution to the effective rendering area, rather than solely relying on Gaussian-level heuristics such as opacity or gradient magnitude. This perspective directly targets the sources of memory redundancy, effectively reducing the peak memory footprint of 3DGS-SLAM during runtime. Evaluations on the EuRoC and KITTI datasets demonstrate that our method consistently outperforms existing pruning approaches in large-scale outdoor scenes, achieving over 60% memory reduction and more than 2 times FPS improvement while preserving localization and mapping accuracy. These results highlight rendering-area-aware pruning as a promising direction for scaling 3DGS-SLAM to real-world autonomous driving scenarios. Our code is publicly available at https://github.com/UMN-ZhaoLab/Pocket-SLAM.git.

2606.24786 2026-06-24 cs.CV 新提交

Counting Trees from Satellite Imagery with Noisy Supervision

基于噪声监督的卫星图像树木计数

Dimitri Gominski, Maurice Mugabowindekwe, Qiue Xu, Xiaowei Tong, Martin Brandt, Hieu Le, Rasmus Fensholt, Dimitris Samaras, Loic Landrieu

发表机构 * University of Copenhagen(哥本哈根大学) University of Rwanda(卢旺达大学) University of Chinese Academy of Sciences(中国科学院大学) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校) Stony Brook University(石溪大学) LIGM, CNRS, Univ Gustave Eiffel, ENPC, IPP(LIGM,CNRS,古斯塔夫·埃菲尔大学,ENPC,IPP)

AI总结 针对卫星图像中树木计数任务,提出基于非平衡最优传输的空间密度匹配方法,并引入自校正机制利用传输残差逐步优化噪声监督,在跨三大洲的TinyTrees基准上优于检测、回归和传输匹配基线。

详情
AI中文摘要

计数单棵树木是环境监测的一项基本任务,但在卫星图像中仍鲜有探索。在这些分辨率下,孤立树木可能仍可识别,但在密集森林中树冠边界变得模糊,使得单棵树木的概念本身定义不清。此外,大规模人工标注单棵树木成本过高。虽然可从机载LiDAR获得可扩展的监督,但由此产生的标注存在噪声且难以有效利用。我们通过将树木计数公式化为一个通过非平衡最优传输监督的空间密度匹配问题来应对这些挑战。该公式自然适应了孤立树木的精确定位和密集森林中的稳健密度估计。我们进一步引入了一种自校正机制,利用传输残差在训练过程中逐步细化噪声监督。我们在TinyTrees上评估了我们的方法,这是一个新的基准,涵盖三大洲和三个卫星传感器,包含超过2.15亿个树木标注(包括77.3万个手动验证实例),分布在23,000个地点。我们的方法一致优于基于检测、回归和基于传输的分布匹配基线,证明了非平衡传输和可靠性感知监督对于从卫星图像进行大规模树木计数的有效性。代码、数据和模型可在以下网址获取:https://github.com/...(原文链接)

英文摘要

Counting individual trees is a fundamental task for environmental monitoring, yet remains largely unexplored with satellite imagery. At these resolutions, isolated trees may still be identifiable, but crown boundaries become ambiguous in dense forests, making the notion of an individual tree inherently ill-defined. Moreover, large-scale manual annotations of individual trees are prohibitively expensive. While scalable supervision can be derived from airborne LiDAR, the resulting annotations are noisy and difficult to exploit effectively. We address these challenges by formulating tree counting as a spatial density matching problem supervised through Unbalanced Optimal Transport. This formulation naturally accommodates both precise localization of isolate trees and robust density estimation in dense forests. We further introduce a self-correction mechanism that leverages transport residuals to progressively refine noisy supervision during training. We evaluate our approach on TinyTrees, a new benchmark spanning three continents and three satellite sensors, comprising over 215 million tree annotations (including 773K manually verified instances) across 23,000 sq.km. Our method consistently outperforms detection-based, regression-based, and transport-based distribution-matching baselines, demonstrating the effectiveness of unbalanced transport and reliability-aware supervision for large-scale tree counting from satellite imagery. Code, data and models are available at https://github.com/dgominski/treematch.

2606.24784 2026-06-24 cs.CV 新提交

AerialFusionMapNet: Online HD Map Construction with Aerial-Onboard BEV Fusion

AerialFusionMapNet: 基于航拍-车载BEV融合的在线高清地图构建

Daniel Lengerer, Mathias Pechinger, Klaus Bogenberger, Carsten Markgraf

发表机构 * Technical University of Applied Sciences Augsburg(奥格斯堡应用技术大学) Technical University of Munich(慕尼黑工业大学)

AI总结 提出AerialFusionMapNet,通过结构化两阶段训练策略有效融合航拍与车载传感器特征,在nuScenes数据集上mAP达54.7,相比基线提升12.1%。

Comments Accepted at the IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026

详情
AI中文摘要

高分辨率航拍图像最近作为自动驾驶感知的补充模态出现,并在与车载传感器融合时显示出改善鸟瞰图(BEV)场景理解的潜力。先前的工作通过航拍-车载融合展示了在线高清(HD)地图构建的性能提升;然而,传统的端到端融合并未充分利用航拍表示中包含的结构信息。在这项工作中,我们引入了AerialFusionMapNet,这是一个基于融合的地图构建框架,采用结构化的两阶段训练策略,在统一流程中明确增强航拍特征的贡献。所提出的训练方案能够更有效地整合结构化的航拍先验。在nuScenes地理划分上,AerialFusionMapNet达到了54.7 mAP,相比先前的航拍-车载融合基线从48.8 mAP提高了+5.9绝对值和+12.1%相对值。结果表明,结构化的训练设计,而非增加架构复杂性,在释放航拍图像用于在线高清地图构建的全部潜力方面起着决定性作用。代码和训练模型可在以下网址获取:https://this URL。

英文摘要

High-resolution aerial imagery has recently emerged as a complementary modality for automated driving perception and has shown potential to improve birds-eye-view (BEV) scene understanding when fused with onboard sensors. Prior work demonstrated performance gains for online high-definition (HD) map construction through aerial-onboard fusion; however, conventional end-to-end fusion does not fully exploit the structural information contained in aerial representations. In this work, we introduce AerialFusionMapNet, a fusion-based mapping framework with a structured two-stage training strategy that explicitly enhances the contribution of aerial features within a unified pipeline. The proposed training scheme enables more effective integration of structural aerial priors. On the nuScenes geographic split, AerialFusionMapNet achieves up to 54.7 mAP, improving over prior aerial-onboard fusion baselines from 48.8 mAP by +5.9 absolute and +12.1% relative. The results suggest that structured training design, rather than increased architectural complexity, plays a more decisive role in unlocking the full potential of aerial imagery for online HD map construction. Code and trained models are available at https://github.com/DriverlessMobility/AerialFusionMapNet.

2606.24783 2026-06-24 cs.CL cs.AI 新提交

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

付费知情:代理型电商中已验证产品信息的微交易市场

Filippos Ventirozos, Matthew Shardlow

发表机构 * Manchester Metropolitan University(曼彻斯特城市大学)

AI总结 提出代理型电商作为已验证信息的微交易市场,买家代理通过微支付逐步获取卖家与评论者提供的数据,以解决信息可信度瓶颈,并转化为具体NLP问题。

Comments 8 pages, 1 figure. Vision paper, under review

详情
AI中文摘要

商业NLP将购物聊天机器人视为推荐或转化工具:其任务是匹配用户与目录条目并完成销售。我们认为,代理原生微支付通道(例如x402、AP2)的出现改变了稀缺性。当买家是能够详尽调查的自主代理时,瓶颈不再是匹配产品,而是获取关于产品的可信、决策相关信息。我们设想代理型电商作为已验证信息的微交易市场:买家代理花费几分钱逐步解锁卖家与评论者提供的数据——服务历史、第三方测试报告、物料清单、经审计的销售和支持指标——在免费增值模式下按需付费,评论者信任度通过声誉评分。我们勾勒了这样一个市场的架构,并论证它奖励真正的产品质量,并产生比基于排名的店面更真实的竞争。然后我们将这一愿景转化为具体的NLP问题——成本最优信息获取、数据定价与谈判、实时实体解析、基于基础的价值交换以及隐私保护的人物建模——并论证这些问题,而非聊天流畅性,值得该领域关注。

英文摘要

Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2) changes what is scarce. When the buyer is an autonomous agent that can investigate exhaustively, the bottleneck is no longer matching products but acquiring trustworthy, decision-relevant information about them. We envision agentic e-commerce as a micro-transaction market for verified information: buyer agents spend fractions of a cent to progressively unlock seller- and reviewer-supplied data -- service histories, third-party test reports, bills of materials, audited sales and support metrics -- paid for a la carte under a freemium model, with reviewer trust scored reputationally. We sketch the architecture of such a market and argue that it rewards genuine product quality and yields truer competition than ranking-based storefronts. We then translate the vision into concrete NLP problems -- cost-optimal information acquisition, data pricing and negotiation, real-time entity resolution, grounded value exchange, and privacy-preserving persona modelling -- and argue that these, not chat fluency, deserve the field's attention.

2606.24781 2026-06-24 cs.AI cs.HC 新提交

Assessing Distribution Shift in Human Activity Recognition for Domain Generalization

评估人类活动识别中的分布偏移以进行域泛化

Rebecca Adaimi, Edison Thomaz

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 系统评估设备类型、传感器位置、采样率和用户行为四种分布偏移对HAR模型的影响,建立统一基准并评估28种域泛化方法,发现现有方法泛化能力有限。

Comments 22 pages with references

详情
AI中文摘要

尽管人类活动识别(HAR)领域持续吸引研究者的兴趣并取得重要进展,但仍存在一些关键挑战。构建在现实环境中表现良好的HAR模型最困难的方面之一是处理来自设备和传感器异构性的数据多样性,以及现实应用固有的上下文变化。虽然文献中已充分认识到HAR中的数据多样性,但在理解各种类型的分布偏移对HAR模型的影响以及由此产生的域泛化问题方面仍存在差距。为此,本文系统评估了4种不同类型的分布偏移,包括设备类型、传感器位置、采样率和用户行为的变化。通过量化其影响,我们说明多样性偏移主要定义了所有类型的偏移,表明存在不同域之间不共享的独特特征。然后,我们引入统一的基于HAR的分布偏移基准,并对多达28种域泛化方法进行全面评估。我们的分析揭示了当前域泛化算法在实现模型泛化性方面的局限性,仅略微优于经验风险最小化基线。这项工作首次系统探索了传感器HAR中特定分布偏移的域泛化和适应,提供了开源基准平台和数据集以促进进一步研究。

英文摘要

While the field of Human Activity Recognition (HAR) continues to draw interest from researchers and advance in important ways, some key challenges remain. One of the most difficult aspects of building HAR models that show good performance in real-world settings is dealing with data diversity from device and sensor heterogeneity, and contextual changes that are intrinsic to real-world applications. While data diversity in HAR has been well-acknowledged in the literature, there remains a gap in understanding the effect of various types of distribution shifts on HAR models and the domain generalization problem that arises. Towards that end, this paper systematically evaluates 4 different types of distribution shifts, including variations in device type, sensor placement, sampling rate, and user behavior. Quantifying their effects, we illustrate that diversity shifts predominantly define all types of shifts, indicating the existence of unique features that are not shared across different domains. We then introduce a uniform HAR-based distribution shift benchmarks and conduct a comprehensive evaluation of up to 28 domain generalization methods. Our analysis exposes the limitations of current domain generalization algorithms in achieving model generalizability, marginally outperforming the empirical risk minimization baseline. This work represents the first systematic exploration of domain generalization and adaptation concerning specific distribution shifts in sensor-based HAR, offering an open-source benchmark platform and datasets to spur further research.

2606.24775 2026-06-24 cs.CL cs.DB cs.IR 新提交

Are We Ready For An Agent-Native Memory System?

我们准备好构建原生智能体记忆系统了吗?

Wei Zhou, Xuanhe Zhou, Shaokun Han, Hongming Xu, Guoliang Li, Zhiyu Li, Feiyu Xiong, Fan Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) MemTensor (Shanghai) Technology Co., Ltd(MemTensor(上海)科技有限公司)

AI总结 从数据管理视角系统研究智能体记忆,提出四模块分析框架,评估12种记忆系统,发现无单一架构占优,并揭示成本-性能权衡。

Comments Paper list available at: https://github.com/OpenDataBox/awesome-agent-memory. Source code available at: https://github.com/OpenDataBox/MemoryData

详情
AI中文摘要

大语言模型(LLM)智能体的记忆已从简单的检索增强机制迅速演变为支持持久信息存储、检索、更新、整合以及整个智能体执行过程中动态生命周期治理的数据管理系统。尽管有这种演变,现有的评估仍然主要通过端到端任务成功指标(如F1、BLEU)来基准测试智能体记忆,而将底层系统视为一个整体黑盒。因此,关键的系统级问题,包括操作成本、跨记忆模块的架构权衡以及动态知识更新下的鲁棒性,仍未得到充分探索。在本文中,我们从数据管理的角度对智能体记忆进行了系统的实验研究。我们提出了一个分析框架,将智能体记忆分解为四个核心模块:记忆表示与存储、提取、检索与路由以及维护。在此框架下,我们评估了12个代表性记忆系统和两个参考基线,跨越5个基准工作负载和11个数据集。我们广泛的端到端评估表明,没有单一架构在所有场景中占主导地位;相反,有效性在很大程度上取决于记忆结构与工作负载瓶颈的对齐程度。此外,通过细粒度的消融研究,我们量化了它们对表示保真度、检索精度、更新正确性和长期稳定性的个体影响。最后,我们揭示了现实工作负载下的成本-性能权衡,表明局部维护比全局重组更具成本效益。基于这些发现,我们指出了构建真正原生智能体记忆系统的有前景的方向。代码在此https URL公开。

英文摘要

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.

2606.24767 2026-06-24 cs.CV cs.RO 新提交

Compact Object-Level Representations with Open-Vocabulary Understanding for Indoor Visual Relocalization

具有开放词汇理解的紧凑对象级表示用于室内视觉重定位

Zhaopeng Cui, Jiarui Hu, Jingbo Liu, Boming Zhao, Xiyue Guo, Boyin Feng, Haocheng Peng, Yujun Shen, Hujun Bao, Guofeng Zhang

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室) Ant Group(蚂蚁集团)

AI总结 提出OpenReLoc系统,利用基础模型实现多模态开放词汇语义匹配、基于DIOU的参考帧选择和双路径2D ICP损失,仅用对象单元实现室内视觉重定位,提升可解释性和准确性。

Comments Accepted by RA-L 2026

详情
AI中文摘要

室内视觉重定位在空间和具身AI应用中扮演关键角色。然而,先前研究主要致力于低级视觉方案,难以感知场景语义和组成,限制了可解释性和适用性。本文探索如何将场景中的丰富对象信息(包括语义、布局和几何)组织成结构化地图表示,从而仅利用对象单元驱动相机重定位任务。为此,我们提出OpenReLoc,一种提供场景理解和精确位姿估计能力的相机重定位系统。利用最新基础模型,我们首先引入多模态机制整合开放词汇语义知识,以实现有效的2D-3D对象匹配。此外,我们设计面向对象的参考帧作为位置先验,并基于距离交并比(DIOU)的参考帧选择策略,使其可扩展到大规模场景。同时,为确保稳定精确的位姿优化,我们提出由对象形状引导的双路径2D迭代最近像素损失。实验结果表明,OpenReLoc在多个数据集上实现了优越的重定位召回率和精度。源代码将在接收后发布。

英文摘要

Indoor visual relocalization plays a critical role in emerging spatial and embodied AI applications. However, prior research was predominantly devoted to low-level vision schemes, struggling to perceive scene semantics and compositions, which limits both interpretability and applicability. In this paper, we explore the issue of how to organize rich object information in a scene, including semantics, layout, and geometry, into a structured map representation, thereby utilizing object units exclusively to drive the camera relocalization task. To this end, we propose OpenReLoc, a camera relocalization system designed to provide scene understanding and accurate pose estimation capabilities. Leveraging recent foundation models, we first introduce a multi-modal mechanism to integrate open-vocabulary semantic knowledge for effective 2D-3D object matching. Additionally, we design object-oriented reference frames as position priors, paired with a reference frame selection strategy based on the Distance-IoU (DIOU), enabling extension to scalable scenes. Moreover, to ensure stable and accurate pose optimization, we also propose a dual-path 2D Iterative Closest Pixel loss guided by object shape. Experimental results demonstrate that OpenReLoc achieves superior relocalization recall and accuracy across various datasets. Our source code will be released upon acceptance.

2606.24752 2026-06-24 cs.AI 新提交

Can Scale Save Us From Plasticity Loss in Large Language Models?

规模能拯救大型语言模型的可塑性丧失吗?

J. Fernando Hernandez-Garcia, Tomás Figliolia, Beren Millidge

发表机构 * Zyphra

AI总结 研究GPT风格Transformer模型在持续学习中的可塑性丧失问题,发现可塑性丧失随模型规模呈亚线性增长,表明增大模型只能延缓而非阻止该现象。

详情
AI中文摘要

可塑性丧失——网络在学习旧信息后学习新信息的能力丧失——是创建能够持续学习的人工神经网络的基本挑战。尽管这一现象已为人所知数十年,但主要是在较旧、相对较小的架构中研究,很少涉及自然语言领域。为了确定可塑性丧失在现代基于Transformer的LLM范式中是否仍然是一个问题,我们研究了在 multilingual 持续学习问题上训练的GPT风格Transformer模型中的可塑性丧失。与先前工作一致,我们在5M到314M非嵌入参数模型中发现了可塑性丧失的证据,通过在一个保留的越南语探测任务上的性能恶化来衡量。我们进一步发现,可塑性丧失的开始遵循可预测的缩放定律,随模型大小亚线性增长。这些结果表明,更大的模型可能延迟可塑性丧失的可测量影响,但仅增加参数数量可能不足以完全阻止它。我们还在静态 multilingual 训练下发现了可塑性丧失的证据,挑战了该现象仅存在于具有突然任务变化的持续学习中的观点。总体而言,我们的结果表明,即使是训练在自然语言上的大型Transformer语言模型,在经过足够长的训练后,最终也会丧失有效适应新数据的能力,无论是在持续学习还是静态设置中。

英文摘要

The loss of plasticity - the ability of a network to learn new information after having already learned older information - is a fundamental challenge in creating artificial neural networks capable of continual learning. Although this phenomenon has been known for decades, it has mostly been studied in older, relatively small architectures and rarely in natural-language domains. To determine whether loss of plasticity remains a problem in the modern transformer-based LLM paradigm, we study plasticity loss in GPT-style Transformer models trained on a multilingual continual learning problem. Consistent with prior work, we find evidence of plasticity loss across models ranging from 5M to 314M non-embedding parameters, as measured by deterioration on a held-out Vietnamese probing task. We further find that the onset of plasticity loss follows a predictable scaling law, growing sublinearly with model size. These results suggest that larger models may delay the measurable effects of plasticity loss, but that increasing parameter count alone is likely to be insufficient to completely prevent it. We also find evidence of plasticity loss under stationary multilingual training, challenging the view that the phenomenon is exclusive to continual learning with abrupt task changes. Overall, our results suggest that even large Transformer language models trained on natural-language will eventually lose the ability to efficiently adapt to new data after sufficiently long training, in both continual and stationary settings.

2606.24747 2026-06-24 cs.AI cs.CE 新提交

Scaling Laws for Task-Specific LLM Distillation

任务特定LLM蒸馏的缩放定律

Lavinia Ghita, Dhruv Desai, Ioana Boier

发表机构 * NVIDIA(英伟达)

AI总结 本文推导了领域特定LLM压缩的经验缩放定律,通过定量金融领域的实验,比较了基于logit和LoRA的蒸馏方法,并引入混合思维链监督损失,揭示了监督格式对领域知识与通用知识权衡的关键作用。

Comments 24 pages, 13 figures

详情
AI中文摘要

大型语言模型(LLM)在越来越多的领域展现出强大性能,但其规模在延迟和成本受限的应用中带来了部署挑战。本文推导了领域特定LLM压缩的经验缩放定律,量化了领域内和通用知识性能如何随数据集大小、压缩比、监督格式和迭代剪枝计划而变化。以定量金融为应用领域,我们比较了在迭代结构剪枝下基于logit和基于LoRA的蒸馏方法,引入了一种混合思维链监督损失,该损失在推理轨迹上稳定了KL散度蒸馏。领域内任务质量在压缩下可预测地下降,而通用知识基准在相同点之前就崩溃;监督格式是这一权衡的关键驱动因素,思维链监督主动恢复了剪枝抹去的通用知识。我们发布了头条数据集FinHeadlineMix、缩放定律结果和实用建议,为领域特定压缩决策提供了一个可复用的框架。

英文摘要

Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare logit-based and LoRA-based distillation under iterative structural pruning, introducing a blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces. In-domain task quality degrades predictably under compression while general-knowledge benchmarks collapse well before the same point; supervision format is the key driver of this tradeoff, with chain-of-thought supervision actively recovering general knowledge that pruning erases. We release the headline dataset FinHeadlineMix, scaling law results, and practical recommendations to provide a reusable framework for domain-specific compression decisions.

2606.24740 2026-06-24 cs.CV 新提交

BioMedVR: Confusion-Aware Mixture-of-Prompt Experts for Biomedical Visual Reprogramming

BioMedVR: 面向生物医学视觉重编程的混淆感知提示专家混合模型

Jiaxiang Liu, Tianxiang Hu, Juwei Guan, Yujie Wu, Yusong Wang, Yao Mu, Zuozhu Liu, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology(广东智能科技研究院) Zhejiang University(浙江大学) Southeast University(东南大学) The Hong Kong Polytechnic University(香港理工大学) Institute of Science Tokyo(东京科学大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出BioMedVR框架,通过可学习的VR模块实现预训练VLM在生物医学图像上的少样本适应,利用混淆最小化机制和提示专家混合模型解决细粒度分类中的类间混淆问题,在18个数据集上验证了优越性能。

Comments Accepted at ECCV 2026. 19 pages, 6 figures. Project page: https://jxliu-ai.github.io/biomedvr-page/

详情
AI中文摘要

最近视觉语言模型(如CLIP)的进展展示了其在自然图像领域的强大泛化能力。然而,将这些模型适应到生物医学成像并非易事:全模型微调计算成本高,而医学数据通常稀缺且表现出细微的细粒度类间差异,使得参数高效适应尤为关键。视觉重编程(VR)通过在输入空间中注入可学习扰动提供了一种参数高效的替代方案,但现有的针对VLM的VR方法主要关注正类提示,忽略了混淆负类,导致在细粒度医学场景中预测校准不佳。我们提出了BioMedVR,这是首个基于VR的生物医学成像框架,通过紧凑的可学习VR模块实现预训练VLM的少样本适应。为了缓解类混淆,我们引入了一种混淆最小化机制,利用LLM生成的混淆感知属性以及混淆抑制损失,明确减少假阳性对齐。此外,设计的提示专家混合模型结合了用于主类区分的正专家和用于混淆抑制的负专家,通过自适应门控进行平衡。在18个数据集(包括11个生物医学数据集和7个自然图像基准)上的大量实验表明,BioMedVR实现了优越的准确性和泛化能力,有效桥接了生物医学领域中的VR和VLM。

英文摘要

Recent advances in vision-language models (VLMs) such as CLIP have demonstrated strong generalization across natural-image domains. However, adapting these models to biomedical imaging is non-trivial: full-model fine-tuning is computationally expensive, while medical data are often scarce and exhibit subtle, fine-grained inter-class differences, making parameter-efficient adaptation particularly critical. Visual Reprogramming (VR) offers a parameter-efficient alternative by injecting learnable perturbations into the input space, but existing VR approaches for VLMs mainly focus on positive class prompts and overlook confusing negatives, leading to miscalibrated predictions in fine-grained medical scenarios. We present BioMedVR, the first VR-based framework for biomedical imaging, enabling few-shot adaptation of pretrained VLMs through compact learnable VR modules. To mitigate class confusion, we introduce a Confusion Minimization Mechanism that leverages LLM-generated confusion-aware attributes together with a Confusion-Suppression Loss to explicitly reduce false-positive alignment. Moreover, the designed Mixture-of-Prompt Experts combines a positive expert for main-class discrimination and a negative expert for confusion suppression, balanced via adaptive gating. Extensive experiments on 18 datasets, including 11 biomedical datasets and 7 natural image benchmarks, demonstrate that BioMedVR achieves superior accuracy and generalization, effectively bridging VR and VLMs in biomedical domains.

2606.24737 2026-06-24 cs.CV 新提交

VSANet: View-aware Sparse Attention Network for Light Field Image Denoising

VSANet: 面向光场图像去噪的视图感知稀疏注意力网络

Gargi Panda, Soumitra Kundu, Saumik Bhattacharya, Aurobinda Routray

发表机构 * IIT Kharagpur(印度理工学院克勒格布尔分校)

AI总结 提出VSANet,通过视图感知稀疏注意力实现全局特征交互与线性复杂度,结合特征细化模块,在光场去噪中取得最优性能。

详情
AI中文摘要

由于光场数据的高维结构,光场图像去噪具有挑战性。虽然噪声在子孔径图像之间是独立的,但场景内容表现出强烈的跨视图相关性。我们引入了VSANet,一种用于光场去噪的视图感知稀疏注意力网络。具体来说,我们提出了一个视图感知稀疏注意力块,将4D光场特征图表示为统一的空间-角度标记空间,并通过基于局部敏感哈希的稀疏注意力执行跨视图聚合。这使得全局特征交互具有线性复杂度,有效地利用了跨视图和空间位置的光场相关性。此外,我们设计了一个特征细化块来强调空间、角度和对极子空间中的信息特征。VSA和FR块集成在一个顺序注意力细化模块中,构成了VSANet的核心。实验表明,VSANet优于最先进的光场去噪方法。

英文摘要

Light field (LF) image denoising is challenging due to the high-dimensional structure of LF data. While noise is independent across sub-aperture images, scene content exhibits strong cross-view correlations. We introduce VSANet, a view-aware sparse attention network for LF denoising. Specifically, we propose a view-aware sparse attention (VSA) block that represents the 4D LF feature map as a unified spatial-angular token space and performs cross-view aggregation via locality-sensitive hashing-based sparse attention. This enables global feature interactions with linear complexity, effectively exploiting LF correlations across views and spatial locations. In addition, we design a feature refinement (FR) block to emphasize informative features in spatial, angular, and epipolar subspaces. The VSA and FR blocks are integrated within a sequential attention refinement module, forming the core of VSANet. Experiments demonstrate VSANet outperforms stateof-the-art LF denoising methods.

2606.24734 2026-06-24 cs.CL cs.AI cs.HC 新提交

Task Decomposition for Efficient Annotation

高效标注的任务分解

Nupoor Gandhi, Emma Strubell

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出将结构化标注任务分解为子任务以减少推理负担,基于中心理论建模推理负荷,并给出分配策略以在固定预算下最大化质量。

详情
AI中文摘要

高质量的结构化表示标注在大规模语料上收集成本高昂。手动标注结构费力,而基于模型的标注虽然生成成本较低,但需要昂贵的验证和潜在的显著监督,以确保标注质量足够强,以便在下游任务中有用。在传统标注工作流中,每个完整示例的标注由单个标注者端到端执行。然而,结构化标注是复杂的,任务的每个方面都代表了独特的挑战,并给特定标注者带来相应的推理负担。现代标注项目可以包含异质的标注者群体,包括模型和具有不同领域及语言专业知识的人类标注者。然而,在这种设置下,如何重新设计标注任务仍不清楚,其中努力根据不同的标注挑战有区别地分配给异质标注者。我们建议将标注任务分解为子任务,以减少标注项目的总体推理负担。受中心理论中中心概念的启发,我们引入了一个基于有效标注空间自由度的推理负荷形式化模型。使用该模型,我们表明识别这些中心(即由标注子任务实现的显著锚点实体)约束了输出空间复杂度,而隔离并推进中心识别的分解减少了总体推理负荷。我们提供了分解复杂结构化标注任务的指南,并附有来自我们先前工作的示例,展示了改进的成本效率。最后,我们提出了一个在固定预算下跨标注者分配子任务以最大化质量的程序。

英文摘要

High-quality annotations of structured representations are expensive to collect over large corpora. Manual annotation of structure is laborious, and model-based annotation, although cheaper to generate, requires expensive validation and potentially significant supervision to ensure that the annotation quality is strong enough to be useful downstream. In traditional annotation workflows, annotation of each complete example is performed end-to-end by a single annotator. However, structured annotation is complex, and each aspect of the task represents a unique challenge with an associated inferential load for a given annotator. Modern annotation projects can incorporate heterogeneous groups of annotators, including both models and human annotators with varying domain and linguistic expertise. It remains unclear, however, how to redesign annotation tasks in this setting, where efforts are discriminately allocated across heterogeneous annotators with respect to distinct annotation challenges. We propose to decompose annotation tasks into sub-tasks in order to reduce the aggregate inferential load of annotation projects. Inspired by the notion of centers from centering theory, we introduce a formal model of inferential load based on the degrees of freedom in the space of valid annotations. Using this model, we show that identifying these centers (i.e. salient anchor entities realized by annotation sub-tasks) constrains the output space complexity, and decompositions which isolate and advance center identification reduce the aggregate inferential load. We provide guidelines for decomposing complex structured annotation tasks, supported by examples demonstrating improved cost-efficiency from our prior work. Finally, we present a procedure for allocating sub-tasks across annotators to maximize quality under a fixed budget.

2606.24726 2026-06-24 cs.CV 新提交

SER: Learning to Ground Video Reasoning with Semantic Evidence Rewards

SER: 用语义证据奖励学习视频推理定位

Sheng Xia, Zhengqin Lai, Tianxiang Jiang, Kanghui Tian, Shoujun Zhou, Bin Li, Yi Wang

发表机构 * Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen Institutes of Advanced Technology(深圳先进技术研究院)

AI总结 提出语义证据奖励(SER),通过将时空证据定位重构为约束验证任务,利用裁判VLM评估证据的相关性和定位质量,在V-STAR基准上提升3.0点mLGM。

详情
AI中文摘要

视频多模态大语言模型在细粒度时空推理中常遇到困难,有时会基于不相关的帧或对象生成正确答案。尽管在推理过程中输出时空证据是一个有前景的方向,但现有的强化学习框架通常仅依赖几何(IoU)奖励,这容易受到边界扰动的影响并忽略语义对齐。为解决此问题,我们提出语义证据奖励(SER),将时空证据定位重构为约束验证任务。SER不计算像素级重叠,而是使用裁判VLM作为局部检查器,从两个维度(相关性和定位质量)评估模型生成的证据声明,并结合时间惩罚。这种设计减少了对密集框标注的依赖,并能够直接在标准视频问答数据上训练。在V-STAR基准上,SER实现了49.6%的mLGM,比强证据定位基线Open-o3-Video提高了3.0个百分点,展示了其在提升答案准确性和证据定位方面的潜力。

英文摘要

Video MLLMs often struggle with fine-grained spatio-temporal reasoning, sometimes generating correct answers based on irrelevant frames or objects. Although outputting spatio-temporal evidence during reasoning is a promising direction, existing RL frameworks typically rely on geometry-only (IoU) rewards, which can be sensitive to boundary perturbations and overlook semantic alignment. To address this, we propose Semantic Evidence Reward (SER), which reformulates spatio-temporal evidence grounding as a constrained verification task. Instead of computing pixel-level overlap, SER uses a referee VLM as a local checker to evaluate model-generated evidence claims across two dimensions: relevance and localization quality, combined with a temporal penalty. This design reduces the reliance on dense box annotations and enables training directly on standard video QA data. On the V-STAR benchmark, SER achieves 49.6% mLGM, improving by 3.0 points over the strong evidence-grounded baseline Open-o3-Video, demonstrating its potential in enhancing both answer accuracy and evidence grounding.

2606.24722 2026-06-24 cs.AI 新提交

Decentralised AI Training and Inference with BlockTrain

BlockTrain:去中心化AI训练与推理

Peter Toth

发表机构 * Spheroid Labs

AI总结 提出BlockTrain去中心化训练协议,将模型分割为独立训练块,在WikiText上达到交叉熵1.359,接近端到端Transformer,且每个工人仅训练一个块,避免全模型优化器状态。

Comments First arXiv version. 17 pages

详情
AI中文摘要

前沿AI训练越来越依赖于对密集、集中控制的加速器集群的访问。这为超大规模云服务商和大型集中化实验室创造了结构性优势,并使开放或独立的AI工作依赖于稀缺资本、特权基础设施和数据中心地理位置。我们提出了Spheroid BlockTrain,一种去中心化训练协议,其中模型被划分为独立可训练的块,每个块基于同一全局目标派生的局部目标进行优化,并在推理时组合成一个模型。在字节级WikiText上,BlockTrain达到了交叉熵1.359(困惑度3.89),与相同设置的端到端Transformer参考相差约0.04 CE,而每个活跃工人只训练一个块,避免了全模型优化器状态。一个共享的六工人块训练运行通过将相同块的更新平均到一个组装模型中,达到了CE 1.385。HTTP/TCP传输实验移动了真实的序列化检查点和更新,包括一个公共IP三主机运行,将CE从5.580改进到1.811,同时移动了15.22 GB数据。对于推理,当前的BlockTrain路径每次完整输出使用一次块堆栈遍历,并通过三个公共网络GPU主机上的直接TCP提供服务,逻辑fp16形状高达75.80B参数,优于匹配的纯自回归TCP流水线基线,因为它每次WAN流水线遍历生成一个完整序列,而不是每个遍历生成一个token。

英文摘要

Frontier AI training is increasingly shaped by access to dense, centrally controlled accelerator clusters. This creates a structural advantage for hyperscalers and large centralized laboratories, and makes open or independent AI efforts depend on scarce capital, privileged infrastructure, and data-center geography. We present Spheroid BlockTrain, a decentralized training protocol in which a model is partitioned into independently trainable blocks, each optimized on a local objective derived from the same global target and composed at inference into one model. On byte-level WikiText, BlockTrain reaches cross entropy 1.359 (perplexity 3.89), within about 0.04 CE of a same-setup end-to-end Transformer reference, while each active worker trains only one block and avoids full-model optimizer state. A shared six-worker block training run reaches CE 1.385 by averaging same-block updates into one assembled model. HTTP/TCP transport experiments move real serialized checkpoints and updates, including a public-IP three-host run that improves CE from 5.580 to 1.811 while moving 15.22 GB. For inference, the current BlockTrain path uses one block-stack traversal per full output and serves over direct TCP across three public-network GPU hosts up to a 75.80B-parameter logical fp16 shape, outperforming a matched plain-autoregressive TCP pipeline baseline because it emits a full sequence per WAN pipeline traversal rather than one token per traversal.