arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2606.06978 2026-06-08 cs.CV 新提交

CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

CL-CLIP: 基于CLIP的持续学习框架与代价体积类别解耦用于目标检测

Zihan Liu, Yuguang Yang, Shengjie Su, Jianing Pang, Linlin Yang, Chunyu Xie, Nikolai Yu. Zolotykh, Baochang Zhang

发表机构 * National College for Excellent Engineers, Beihang University(卓越工程师学院,北京航空航天大学) AI Research, Qihoo 360(360人工智能研究院,奇虎360) School of Electronic Information Engineering, Beihang University(电子信息学院,北京航空航天大学) School of Cyber Science and Technology, Beihang University(网络安全科学与技术学院,北京航空航天大学) School of Computer Science and Engineering, Beihang University(计算机科学与工程学院,北京航空航天大学) State Key Laboratory of Media Convergence and Communication, Communication University of China(媒体融合与传播国家重点实验室,中国传媒大学) Institute of Information Technology, Mathematics and Mechanics, Lobachebsky University(信息技术、数学与力学学院,洛瓦茨基大学) School of Artificial Intelligence, Beihang University(人工智能学院,北京航空航天大学)

AI总结 提出CL-CLIP框架,通过代价体积引导的类别解耦,增强开放词汇检测器的持续学习能力,缓解灾难性遗忘,在PASCAL VOC和MS-COCO上显著提升F-ViT基线性能。

详情
AI中文摘要

持续目标检测(COD)要求检测器随时间获取新类别的同时保留先前学习的类别。这一目标与开放词汇检测密切相关,因为两种设置都需要对当前训练阶段注释未完全覆盖的类别进行推理。最近的基于CLIP的开放词汇检测器展现出强大的零样本泛化能力,而F-ViT等框架表明视觉-语言预训练可以为未见类别提供强大的零样本检测能力。然而,实际部署不能保持纯粹的零样本:一旦这些检测器在新引入的类别上持续更新,它们会遭受严重的灾难性遗忘,并迅速失去先前校准的检测能力。因此,我们提出CL-CLIP,一种基于CLIP的COD框架,通过代价体积引导的类别解耦,为开放词汇检测器提供更好的持续学习能力。具体来说,遵循CAT-Seg,我们计算CLIP图像-文本相似度代价体积,定义为视觉令牌与类别文本嵌入之间的密集类别级响应图。这种零样本空间先验将共享区域特征分解为类别特定路径,然后由多专家RoI头处理。在PASCAL VOC和MS-COCO上的大量实验表明,CL-CLIP在持续微调下显著改善了F-ViT基线,并与现有持续目标检测器相比取得了竞争性能,特别是在适应新引入类别的同时保持竞争力的基类性能。

英文摘要

Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.

2606.06977 2026-06-08 cs.RO 新提交

Compliance-Based Sensor Placement for Force Sensing on a Sensorized Prostate Phantom

基于柔顺性的传感器布局方法用于传感化前列腺模体的力感知

Sizhe Tian, Yinoussa Adagolodjo, Jeremie Dequidt

发表机构 * CRIStAL DEFROST Polytech Lille

AI总结 提出一种基于柔顺性的加权贪心传感器布局方法,用于直肠指检训练模体的力感知,相比全局QR方法将目标区域力重构性提高22.5%。

详情
AI中文摘要

本文提出一种基于柔顺性的传感器布局方法,用于为直肠指检训练设计的传感化前列腺模体的力感知。该模体结合了三个内部气动腔室(用作内置压力传感器)和十个表面位移标记。通过在外表面采样位置施加外力生成有限元仿真数据集,并构建将力输入与压力和位移响应关联的柔顺矩阵。基于该矩阵,我们提出一种加权贪心选择策略,最大化局部力可重构性,同时优先考虑临床相关的后部接触区域,并避免将标记直接放置在感兴趣区域内。与全局基于QR的布局策略相比,所提方法将目标区域的平均可重构性得分提高了22.5%。这些结果表明,区域感知的稀疏传感器布局可以在保持有限且实用的传感配置的同时,提高软体机器人医疗模体的力可观测性。

英文摘要

This work presents a compliance-based sensor placement method for force sensing on a sensorized prostate phantom designed for Digital Rectal Examination training. The phantom combines three internal pneumatic chambers, used as intrinsic pressure sensors, with ten surface displacement markers. A finite-element simulation dataset is generated by applying external forces at sampled surface locations, from which a compliance matrix relating force inputs to pressure and displacement responses is constructed. Based on this matrix, we propose a weighted greedy selection strategy that maximizes local force reconstructability while prioritizing the clinically relevant posterior contact region and avoiding marker placement directly within the Region of Interest. Compared with a global QR-based placement strategy, the proposed method increases the mean reconstructability score in the target region by 22.5%. These results suggest that region-aware sparse sensor placement can improve force observability in soft robotic medical phantoms while maintaining a limited and practical sensing configuration.

2606.06976 2026-06-08 cs.AI 新提交

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

通过不确定性对齐强化学习探索智能体工具调用决策

Yijin Zhou, Linqian Zeng, Xiaoya Lu, Wenyuan Xie, Dongrui Liu, Junchi Yan, Jing Shao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院)

AI总结 针对智能体工具调用中错误累积问题,提出TRUST方法,将不确定性量化作为排斥力融入奖励设计,并标注轻量关键轮次用于多轮轨迹统一后训练,显著提升决策质量与智能体性能。

详情
AI中文摘要

基于大语言模型的智能体经常做出次优的工具使用决策,包括不支持的工具调用和幻觉式的直接响应,这可能在多步交互中累积错误。现有方法主要通过推理时校正或基于决策结果和结构化检查表的粗粒度奖励信号来改善这些行为,而忽略了智能体决策的不确定性特征。我们观察到,面向决策的强化学习倾向于削弱正确和错误动作之间的不确定性分离,导致过度自信的错误和较弱的探索信号。因此,我们提出TRUST,将不确定性量化作为排斥力融入奖励设计以维持不确定性分离,并标注轻量级关键轮次注释用于多轮轨迹的统一后训练。在多种工具使用基准上的实验结果表明,TRUST在优化过程中持续提升决策质量和智能体性能,同时保持更可靠的不确定性估计。

英文摘要

Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.

2606.06975 2026-06-08 cs.SD eess.AS 新提交

MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds

MyGardenBird:针对十二种常见马来西亚鸟类的机器学习就绪鸟声数据集

Muhammad Mun'im Ahmad Zabidi, Mohd Yamani Idna Idris, Norisma Idris

发表机构 * Faculty of Computer Science and Information Technology, Universiti Malaya(马来大学计算机科学与信息技术学院) Faculty of Electrical Engineering, Universiti Teknologi Malaysia(技术学院电气工程学院)

AI总结 提出MyGardenBird数据集,包含来自Xeno-canto的12种马来西亚常见鸟类的7200个经过人工验证的音频片段,通过卷积神经网络基线实验达到92-96%的分类准确率。

详情
Comments
17 pages, 9 figures
AI中文摘要

来自热带地区的生物声学数据集仍然有限,部分原因是缺乏可重复的工作流程来聚合来自公共档案的录音。我们提出了\textbf{MyGardenBird},一个精心策划的鸟类发声数据集,代表马来西亚半岛和印度-马来亚地区的十二种常见物种。录音来自Xeno-canto,并通过物种级过滤、手动频谱图分割和质量控制检查进行处理。主要版本包含7,200个人工验证的音频片段(16 kHz,16位PCM单声道WAV),每个物种平衡600个三秒片段(总计6.0小时),来自1,381个不同的录音。元数据包括地理空间坐标、发声类别和信噪比(SNR)值(范围:0.83--59.18 dB;平均值:15.80 dB)。还提供了一个44.1 kHz的补充版本。为了减轻数据泄漏,数据集划分在源录音级别定义。使用卷积神经网络在梅尔频谱图上的基线分类实验达到了92--96%的测试准确率,表明种间可分性强。局限性包括依赖单一标注者进行策展;然而,使用BirdNET进行的验证确认了标签一致性。MyGardenBird在CC BY-NC-SA 4.0许可下于该https URL公开提供。随附完整的预处理代码以支持可重复性和未来扩展。

英文摘要

Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.

2606.06967 2026-06-08 cs.LG 新提交

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

GenPO++:基于无雅可比似然比的生成式策略优化

Ke Hu, Shutong Ding, Panxin Tao, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 提出GenPO++框架,利用高阶可逆ODE求解器中的历史状态作为辅助记忆,实现精确可逆映射,从而无偏且高效地计算生成流策略的似然比,在连续控制任务中优于现有方法。

详情
AI中文摘要

生成式策略提供表达性强且多模态的动作分布,使其在复杂连续控制任务的强化学习(RL)中具有吸引力。其中,基于流的策略尤其吸引人,因为它们通过确定性传输映射生成动作。然而,将此类生成式策略应用于基于似然的在线学习仍然受到评估已执行动作概率的困难限制。现有的流RL方法要么用近似替代品替换真实的动作密度比,这可能会引入有偏更新,要么通过虚拟动作增广恢复精确似然,这会扩大策略空间并增加计算量。在这项工作中,我们提出GenPO++,一种可逆生成式策略优化框架,它使用高阶可逆ODE求解器中的历史状态作为辅助记忆,在不改变原始动作维度的情况下实现精确反演。由此产生的生成式策略映射的对数行列式仅由固定的求解器系数决定,从而实现了精确且无雅可比的似然比计算。这种设计保留了生成流策略的表达能力,同时避免了动作比率偏差和虚拟动作开销。我们在大规模模拟控制、微调和真实机器人操作任务上评估了GenPO++,与最先进的在线RL方法相比,它取得了具有竞争力或更优的性能,同时提高了训练稳定性和计算效率。

英文摘要

Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing because they generate actions through deterministic transport maps. However, applying such generative policies to likelihood-based on-policy learning remains limited by the difficulty of evaluating the probability of executed actions. Existing flow RL methods either replace the true action-density ratio with approximate surrogates, which can introduce biased updates, or recover exact likelihoods through dummy-action augmentation, which enlarges the policy space and increases computation. In this work, we propose GenPO++, a reversible generative policy optimization framework that uses history states as auxiliary memory in a high-order reversible ODE solver, yielding exact inversion without changing the original action dimension. The resulting generative policy map has a log-determinant determined only by fixed solver coefficients, enabling exact and Jacobian-free likelihood-ratio computation. This design preserves the expressiveness of generative flow policies while avoiding both action ratio bias and dummy-action overhead. We evaluate GenPO++ on large-scale simulated control, fine-tuning, and real-world robotic manipulation tasks, where it achieves competitive or superior performance over state-of-the-art on-policy RL methods, while improving training stability and computational efficiency.

2606.06966 2026-06-08 cs.CV 新提交

From Vision to Text: A Compact Multimodal Approach for Robust, Cross-Domain Presentation Attack Detection on ID Cards

从视觉到文本:一种用于身份证件跨域鲁棒演示攻击检测的紧凑多模态方法

Qingwen Zeng, Juan E. Tapia, Sneha Das, Christoph Busch

发表机构 * da/sec-Biometrics and Security Research Group, Hochschule Darmstadt(da/sec生物安全研究组,达姆施塔特应用技术大学) Technical University of Denmark (DTU)(丹麦技术大学(DTU))

AI总结 针对身份证件演示攻击检测中的跨域迁移问题,提出一种结合视觉与文本数据的紧凑多模态模型,发现监督微调后泛化强但零样本设置下失效,强调模型容量和真实数据的重要性。

详情
Comments
Publication under the revision process on IEEE
AI中文摘要

跨域迁移对身份证件上的演示攻击检测(PAD)构成挑战,因为隐私问题导致可用数据受限。本工作提出一种紧凑的多模态模型,基于新的生成和判别模块,结合视觉和文本数据对真实和合成身份证图像进行PAD。虽然多模态模型在监督微调后表现出强大的泛化能力,但在零样本设置下失败。我们的发现强调,模型容量和真实世界数据对于可靠的PAD至关重要,而现有的合成数据集可能无法反映真实世界的挑战。我们主张重新评估合成数据作为基准,并强调需要更真实、更多样化的数据集以推动PAD研究。

英文摘要

Cross-domain shifts challenge Presentation Attack Detection (PAD) on ID Cards, given the restricted data available due to privacy concerns. This work proposes a compact multimodal model, based on new generative and discriminative blocks, which combines visual and textual data for PAD on genuine and synthetic ID images. While multimodal models exhibit strong generalisation after supervised fine-tuning, they fail in zero-shot settings. Our findings underscore that model capacity and real-world data are essential for reliable PAD, while existing synthetic datasets may not reflect real-world challenges. We argue for a re-evaluation of synthetic data as a benchmark and emphasise the need for more realistic, diverse datasets to advance PAD research.

2606.06959 2026-06-08 cs.CL cs.AI 新提交

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

OpenHalDet:面向多种生成场景的幻觉检测统一基准

Xinyi Li, Zhen Fang, Yongxin Deng, Jinyuan Luo, Hongnan Ma, Changdae Oh, Zijing Shi, Shanshan Ye, Hanchen Wang, Shu-Lin Chen, Yadan Luo, Mengyue Yang, Sean Du, Sharon Li, Ling Chen

发表机构 * University of Technology Sydney(新南威尔士大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Bristol(布里斯托大学) The University of Queensland(昆士兰大学) Nanyang Technological University(南洋理工大学)

AI总结 提出OpenHalDet基准,标准化幻觉检测评估流程,支持黑盒、灰盒、白盒检测器,实现跨任务、模型和检测器的可控比较。

详情
Comments
Preprint. Code and data are available at https://github.com/Nellie179/Hallucination-Detection
AI中文摘要

幻觉检测对于大型语言模型(LLM)的可靠部署至关重要。然而,现有评估面临两个核心挑战:推理配置和评估不一致,以及下游领域和任务的覆盖有限。因此,报告的检测器性能往往难以比较、复现,并泛化到特定实验设置之外。我们引入OpenHalDet,一个面向多种生成场景的幻觉检测统一基准。OpenHalDet标准化了评估流程,从提示构建和响应生成到真实性标注、检测器评分和指标计算。它支持不同访问设置下的异构检测器家族,包括仅使用生成输出的黑盒方法、依赖基于概率信号的白盒方法,以及利用内部模型信号的白盒方法。通过将多样化的任务、模型和检测器纳入共享框架,OpenHalDet实现了可控比较,并提供了不同检测范式在LLM应用中行为的系统视角。我们发布OpenHalDet作为开放且可扩展的代码库,以促进幻觉检测方法的可复现评估和未来发展。代码和数据集可在该https URL获取。

英文摘要

Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at https://github.com/Nellie179/Hallucination-Detection.

2606.06958 2026-06-08 cs.CV 新提交

MVSegNet: A Lightweight Boundary-Aware Network for Fetal Lateral Ventricle Segmentation and Atrial Width Estimation in Prenatal Ultrasound

MVSegNet: 一种用于产前超声中胎儿侧脑室分割和心房宽度估计的轻量级边界感知网络

Arafat Hossain Sayem

发表机构 * Department of Computer Science & Engineering, Stamford University Bangladesh(计算机科学与工程系,斯坦福大学孟加拉国分校)

AI总结 提出轻量级边界感知网络MVSegNet,结合多尺度特征提取与边界细化,在584张超声图像上实现侧脑室分割,Dice达80.79%,心房宽度平均绝对误差3.40 mm,速度快且参数少。

详情
Comments
11 pages, 3 figures, 4 tables. Code and trained models will be released upon acceptance. Supplementary material available upon request
AI中文摘要

胎儿脑室扩张通过测量产前超声中侧脑室的心房宽度来评估。准确的分割对于这一测量至关重要,但声影、散斑噪声和低对比度使其变得困难。我们开发了MVSegNet,一种轻量级的编码器-解码器网络,结合了多尺度特征提取和边界感知细化。该模型在584张专家标注的经脑室超声图像上使用70/15/15划分进行训练和评估。使用重叠、边界和测量指标与六个分割基线进行了性能比较。MVSegNet实现了80.79%的Dice分数、68.47%的IoU、4.07 mm的豪斯多夫距离和3.40 mm的心房宽度平均绝对误差。该模型包含231万个参数,在NVIDIA T4 GPU上以165.6帧/秒的速度运行。MVSegNet在边界和测量指标上优于所有评估的基线,同时保持较低的计算成本,支持其在自动化胎儿超声分析中的应用。

英文摘要

Fetal ventriculomegaly is assessed by measuring the atrial width of the lateral ventricle in prenatal ultrasound. Accurate segmentation is essential for this measurement, but acoustic shadowing, speckle noise, and poor contrast make it difficult. We developed MVSegNet, a lightweight encoder-decoder network combining multi-scale feature extraction and boundary-aware refinement. The model was trained and evaluated on 584 expert-annotated transventricular ultrasound frames using a 70/15/15 split. Performance was compared against six segmentation baselines using overlap, boundary, and measurement metrics. MVSegNet achieved a Dice score of 80.79%, IoU of 68.47%, Hausdorff distance of 4.07 mm, and atrial width mean absolute error of 3.40 mm. The model contains 2.31 million parameters and runs at 165.6 frames per second on an NVIDIA T4 GPU. MVSegNet outperformed all evaluated baselines on boundary and measurement metrics while maintaining low computational cost, supporting its use in automated fetal ultrasound analysis.

2606.06953 2026-06-08 cs.RO 新提交

LIMMT: Less is More for Motion Tracking

LIMMT:少即是多的运动追踪

Yu Guan, Zekun Qi, Chenghuai Lin, Xuchuan Chen, Dairu Liu, Wenyao Zhang, Jilong Wang, Xinqiang Yu, He Wang, Li Yi

发表机构 * Tsinghua University(清华大学) GalBot Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 提出数据驱动的运动追踪框架LIMMT,通过物理可行性、多样性和复杂性三维度筛选高质量运动数据,仅用AMASS的3%数据即可超越全量训练效果。

详情
Comments
Accepted at ICML 2026
AI中文摘要

我们认为高质量的运动数据可以在训练早期引导追踪策略走向更优的优化轨迹。在这项工作中,我们引入了LIMMT(少即是多的运动追踪)。据我们所知,这是首个针对基于物理的人形运动追踪的数据中心研究。我们不仅简单地移除低质量和错误片段,而是通过三个维度定义运动数据质量:物理可行性、多样性和复杂性。我们表明,即使仅使用AMASS的不到3%的数据进行训练,也能获得比使用完整数据集更好的追踪性能。我们进一步对估计的网络来源动捕数据进行了数据清洗。大量实验和分析验证了我们框架的有效性。

英文摘要

We argue that high-quality motion data can steer tracking policies toward better optimization trajectories early in training. In this work, we introduce LIMMT (Less Is More for Motion Tracking). To our knowledge, this is the first data-centric study for physics-based humanoid motion tracking. We go beyond simply removing low-quality and erroneous clips, but define motion data quality through three dimensions: physics feasibility, diversity, and complexity. We show that even training with under 3% of AMASS yields better tracking performance than training with the full dataset. We further conduct data cleaning on the estimated web-sourced mocap data. Extensive experiments and analyses validate the effectiveness of our framework.

2606.06950 2026-06-08 cs.CV cs.AI 新提交

When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

何时3D值得?肺CT中CNN和Transformer的资源-性能前沿

Md Enamul Hoq, Sharafat Hossain, Imraul Emmaka, Linda Larson-Prior, Lawrence Tarbox, Jonathan Bona, Donald Johann Jr. and Fred Prior

发表机构 * Department of Biomedical Informatics University of Arkansas for Medical Sciences(生物医学信息学系,美国阿肯色大学医学科学分校) Department of Information Science University of Arkansas at Little Rock(信息科学系,美国阿肯色大学小岩分校) Department of Neuroscience University of Arkansas for Medical Sciences(神经科学系,美国阿肯色大学医学科学分校)

AI总结 研究在肺CT中2D、2.5D和3D输入对CNN和Transformer的影响,发现2.5D CNN在判别-稳定性权衡上最优,而3D CNN和Transformer存在不稳定性或退化预测。

详情
Comments
8 pages, 6 figures
AI中文摘要

三维模型通常被认为更适合体积医学成像,但其实际价值取决于性能提升是否值得增加的计算成本和复杂性。我们不提出新架构,而是研究在固定训练协议下,输入维度(2D、2.5D、3D)如何影响卷积神经网络(CNN)和视觉Transformer(ViT)的行为。使用无泄漏的NLST队列(n=1,977)和辅助LIDC-IDRI数据,我们发现2.5D CNN在我们的比较中提供了最有利的判别-稳定性权衡(ROC-AUC 0.682,95% CI [0.546, 0.799]),具有稳定的操作点。相比之下,3D CNN表现出阈值不稳定性,而Transformer出现退化预测,例如全正预测。置信区间宽且重叠,因此我们将这些结果呈现为受控的资源-性能前沿和失败模式分类,而非明确的优越性声明。对于类别不平衡的肺癌筛查分类,2D和2.5D输入在性能、稳定性和计算效率之间提供了比全3D表示更可靠的权衡。

英文摘要

Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.

2606.06944 2026-06-08 cs.RO 新提交

T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion

T-GMP: 基于地形条件的生成式运动先验用于多功能且自然的人形机器人 locomotion

Junhong Guo, Hao Hu, Chen Chen, Haoxuan Han, Linao Gong, Xin Yang, Zhicheng He, Yao Su, Fenghua He

发表机构 * Harbin Institute of Technology(哈尔滨工程大学) Leju Robotics(莱居机器人)

AI总结 提出 T-GMP 模块,利用条件变分自编码器从少量专家演示中学习地形条件潜在运动流形,结合对抗学习与立足点惩罚,实现统一策略下适应地形变化的多功能自然运动。

详情
AI中文摘要

实现拟人自然性和鲁棒地形穿越仍然是人形机器人 locomotion 的基本挑战。现有的强化学习方法通常依赖固定的运动先验,限制了其对变化环境的适应性。我们提出基于地形条件的生成式运动先验(T-GMP),该模块使用条件变分自编码器从少量专家状态-地形演示中捕获地形条件潜在运动流形。学习到的先验能够实现平滑的风格转换,促进统一策略适应地形变化。我们将 T-GMP 集成到对抗学习流程中,并引入提出的立足点惩罚,其中判别器根据局部地形特征动态调节自然性约束,指导生成多功能且类人的运动。实验结果表明,我们的方法在穿越成功率和运动平滑度上优于现有基线,同时保持了仿生自然和物理协调的运动。

英文摘要

Achieving both anthropomorphic naturalness and robust terrain traversal remains a fundamental challenge in humanoid locomotion. Existing Reinforcement Learning (RL) approaches typically rely on fixed motion priors, limiting their adaptability to varying environments. We propose Terrain-conditioned Generative Motion Priors (T-GMP), a module that captures a terrain-conditioned latent motion manifold from a few expert state-terrain demonstrations using a Conditional Variational Autoencoder (CVAE). The learned priors enable smooth style transitions, facilitating a unified policy that adapts to terrain variations. We integrate T-GMP into an adversarial learning pipeline with our proposed Foothold Penalty, where a discriminator dynamically modulates naturalness constraints conditioned on local terrain features, guiding the generation of versatile and human-like motions. Experimental results demonstrate that our method outperforms existing baselines in traversal success rate and motion smoothness, while preserving biomimetically natural and physically coordinated motions.

2606.06942 2026-06-08 cs.CL cs.AI 新提交

Didact: A Cross-Domain Capability Discovery System for Defence

Didact:面向国防的跨领域能力发现系统

Aarya Bodhankar, Aditya Joshi, Bao Gia Doan, Thomas Marchant, Oscar Leslie, Flora Salim

发表机构 * University of New South Wales, Sydney, Australia(新南威尔士大学,悉尼,澳大利亚) Cyndr.ai, Australia(Cyndr.ai,澳大利亚)

AI总结 提出Didact原型系统,通过构建知识图谱和复合检索增强生成管道,整合异构国防报告与政策文档,支持自然语言对话和可视化证据追溯,解决跨领域能力发现碎片化问题。

详情
Comments
Under Review at CIKM 2026 (System Demonstration Track)
AI中文摘要

国防及国防相关领域的政策制定者必须监控快速发展的研究以及与其作战和战略需求相关的部门优先事项。实际上,这些来源分散在异构格式、不连贯的存储库和孤立的更新流中,使得能力发现缓慢且难以审计。我们提出了Didact,一个原型系统,它将来自澳大利亚的公开国防报告和政策文件与基于澳大利亚研究出版物构建的专用知识图谱相结合。Didact为面向政策的工作流程提供自然语言对话,并利用复合检索增强生成(RAG)管道。Didact的一个关键特性是交互式证据轨道,它可以可视化检索到的证据和源关系。我们对Didact的输出质量和运行时间的评估凸显了其实用性。虽然Didact是作为澳大利亚背景下的学术界-工业界合作项目共同开发的,但它适用于知识同样碎片化的其他领域。演示视频可在此处获取:

英文摘要

Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operational and strategic needs. In practice, these sources are fragmented across heterogeneous formats, disjoint repositories, and siloed update streams, making capability discovery slow and difficult to audit. We present Didact, a prototype that integrates publicly available defence reports and policy documents from Australia with a purpose-built knowledge graph derived from Australian research publications. Didact provides natural language conversations for policy-oriented workflows, and leverages a composite retrieval-augmented generation (RAG) pipeline. A key feature of Didact is an interactive Evidence Rail that visualises retrieved evidence and source relationships. Our evaluation of the output quality and runtime of Didact highlights its utility. While Didact has been co-developed as an academia-industry project for the Australian context, it is adaptable to other domains where knowledge is similarly fragmented. A demonstration video is available here:

2606.06941 2026-06-08 cs.AI 新提交

Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

量子启发的迹增强证据选择用于结构化假设空间推理

Laura Wynter, Nirvik Sahoo, Paul Griffin

发表机构 * School of Computing and Information Systems(计算与信息系统学院) Singapore Management University(新加坡管理大学)

AI总结 提出EP-HUBO方法,将CoT推理片段选择转化为组合优化问题,通过高阶二元优化聚合证据,在证据密集型法律推理基准上提升少数但正确假设的权重。

详情
AI中文摘要

大型语言模型(LLMs)现在能够在广泛的专业级考试中达到或超过人类水平,但在法律等专门、证据密集型领域仍然脆弱。在这些任务上,错误不仅源于世界知识的空白,还源于证据片段之间的细微差别以及支持证据的不一致使用。最常见的基于采样思维链(CoT)轨迹的聚合器——多数投票,返回最流行的答案,而不考虑其证据是否实际上最强。我们提出将CoT推理片段的选择视为一个显式的组合优化问题,使得有充分支持但属于少数的假设能够覆盖噪声多数,并在对证据质量特别敏感的法律推理基准上评估该方法。我们引入了EP-HUBO(证据池高阶二元优化),它使用小型本地模型生成多个CoT轨迹,将片段解析为每个假设的证据池,对每个池求解具有质量衍生权重(相关性、特异性、区分性)的高阶无约束二元优化,并委托前沿模型对每个问题进行一次裁决调用。我们在两个证据密集型法律基准上评估了EP-HUBO,使用了经典硬件上的模拟退火以及Quantum Computing Inc.的Dirac-3光量子熵量子机。HUBO风格的优化提供了一种原则性的方法来聚合推理片段,同时保留少数但正确的假设,并且在低污染领域(前沿模型尚未吸收基准材料)中最为有价值。

英文摘要

Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.

2606.06934 2026-06-08 cs.LG 新提交

Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters

固定点参数上GD和SGD的均匀稳定性与泛化误差

Jonghyun Shin, Sejun Park

发表机构 * Department of Artificial Intelligence, Korea University(人工智能系,韩国大学)

AI总结 研究离散参数空间中梯度下降(GD)和随机梯度下降(SGD)的泛化误差与均匀稳定性,发现确定性舍入使GD泛化误差率从O(T/n)恶化到O(T/√n),而SGD在确定性舍入下仍具有非平凡稳定性保证,且随机舍入会引入随维度增长的泛化误差。

详情
AI中文摘要

我们分析了离散参数空间上梯度下降(GD)和随机梯度下降(SGD)的泛化误差、均匀稳定性和均匀参数稳定性,其中每次更新涉及确定性或随机舍入。我们表明,确定性舍入降低了GD在凸、Lipschitz和平滑损失函数上的泛化误差,将速率从$O(T/n)$增加到$O(T/\sqrt{n})$,并建立了匹配的下界。我们进一步证明GD的均匀稳定性变为$\Omega(T)$,表明基于稳定性的泛化界在此设置中是无效的。相比之下,对于相同的损失,带有确定性舍入的随机梯度下降具有非平凡的均匀稳定性保证,这些保证与实值情况有质的区别,并且在迭代次数和维度上表现出不同的依赖性:我们证明了一维的紧界$O(T/n)$和高维的$O(T^2/n)$。我们还表明,随机舍入可能引入随维度增加的泛化误差;这种现象在标准实值优化和确定性舍入情况下是不存在的。最后,我们给出了随机舍入方案的均匀参数稳定性的上界,并表明当损失可以表示为坐标函数之和时,这些界是紧的。

英文摘要

We analyze generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces, where each update involves deterministic or stochastic rounding. We show that deterministic rounding degrades the generalization error of GD on convex, Lipschitz, and smooth loss functions, increasing the rate from $O(T/n)$ to $O(T/\sqrt{n})$, and establish matching lower bounds. We further prove that uniform stability of GD becomes $Ω(T)$, showing that stability-based generalization bounds are vacuous in this setting. In contrast, for the same losses, stochastic gradient descent with deterministic rounding admits nontrivial uniform stability guarantees, which differ qualitatively from the real-valued case and exhibit distinct dependencies on the number of iterations and the dimension: we prove tight bounds $O(T/n)$ for one dimension and $O(T^2/n)$ for higher dimensions. We also show that stochastic rounding can introduce generalization error that increases with the dimension; such a phenomenon is absent in standard real-valued optimization and in the deterministic rounding case. Finally, we provide upper bounds on uniform argument stability for stochastic rounding schemes and show that these bounds are tight when the loss can be represented as a sum of coordinate-wise functions.

2606.06926 2026-06-08 cs.CV cs.MM 新提交

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

SVHighlights: 迈向极长体育视频精彩片段检测

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

发表机构 * Ulsan National Institute of Science and Technology(釜山国立科学研究院)

AI总结 针对现有方法无法处理超长视频精彩片段检测的问题,提出首个基准SVHighlights(包含320个平均时长2小时的体育视频)以及无训练的分段方法TF-SELECTOR,通过大语言模型融合多模态信息预测片段级显著性分数,在多个指标上超越现有基线。

详情
Comments
Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/
AI中文摘要

尽管长视频的精彩片段检测具有重要的实际意义,但现有方法大多局限于短视频内容,这主要是由于缺乏合适的基准。为了填补这一空白,我们引入了SVHighlights,据我们所知,这是首个针对极长体育视频(每段时长超过一小时,涵盖多种体育类别)精彩片段检测的基准。SVHighlights是通过一个数据集生成流水线,从完整体育视频及其对应的官方精彩片段视频对构建而成,无需传统的逐片段显著性标注即可实现可扩展的标签生成。该基准包含320个视频,平均时长2.00小时,总时长640.18小时,显著超过以往的数据集。现有方法在长视频上也面临根本性挑战:在短视频片段上训练的模型无法泛化到小时级内容,并且它们的片段级评分缺乏识别精彩片段所需的更广泛上下文。为了解决这一问题并提供一个强基线,我们提出了TF-SELECTOR,一种无需训练的基于分段的方法,该方法通过合并相邻的具有相同语义内容的镜头,将每个视频划分为上下文感知的分段,并使用多模态输入(包括视觉描述、转录文本和音频音量)的大语言模型预测分段级显著性分数。实验表明,与视频时间定位(VTG)微调的基线相比,TF-SELECTOR在大多数指标上取得了更优的性能,在HIT@1上提升+3.12,在HIT@K上提升+4.06,在IoU上提升+2.95。这些结果确立了SVHighlights作为长视频精彩片段检测的具有挑战性的测试平台,并证明了简单的基于分段的策略可以有效地扩展到小时级视频。

英文摘要

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +3.12 in HIT@1, +4.06 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

2606.06924 2026-06-08 cs.LG 新提交

From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing

从采样结果到能力分布:重新思考LLM路由的监督

Guannan Lai, Haoran Hu, Long Chen, Zhenguo Li, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Hong Kong University of Science and Technology(香港科学与技术大学) Frontier Robotics(前沿机器人)

AI总结 针对LLM路由中单次响应作为监督信号噪声大的问题,提出DARS框架,从分布视角构建路由监督,考虑输入和输出不确定性,实验表明分布感知监督更稳定有效。

详情
AI中文摘要

现有的LLM路由方法通常将模型对查询的单个响应作为训练路由器的能力标签。然而,由于LLM生成本质上是随机的,这种单次监督仅提供了查询-模型对行为的噪声观测,而非可靠的能力估计。我们表明,这种假设会向路由监督中引入系统性噪声,使得学习到的路由策略可靠性降低。为解决此问题,我们提出DARS(分布感知路由监督)框架,该框架从模型行为的分布视角构建路由监督。DARS不依赖单个生成的响应,而是考虑来自输入侧和输出侧的不确定性,捕捉语义等价的查询表述和随机生成如何影响模型性能。基于这些分布感知的观测,DARS为路由构建更可靠的监督信号。跨不同任务的实验表明,单次标签可能对模型选择产生误导,而分布感知监督提供更稳定的标签并改进学习到的路由行为。我们的结果表明,可靠的LLM路由应超越单次响应观测,并基于查询级模型能力分布。

英文摘要

Existing LLM routing methods typically treat a model's single response to a query as its capability label for training routers. However, because LLM generation is inherently stochastic, such single-shot supervision provides only a noisy observation of a query-model pair's behavior rather than a reliable capability estimate. We show that this assumption introduces systematic noise into routing supervision, making learned routing policies less reliable. To address this issue, we propose DARS (Distribution-Aware Routing Supervision), a framework that constructs routing supervision from a distributional view of model behavior. Instead of relying on a single generated response, DARS considers uncertainty from both the input side and the output side, capturing how semantically equivalent query formulations and stochastic generations affect model performance. Based on these distribution-aware observations, DARS builds more reliable supervision signals for routing. Experiments across diverse tasks show that single-shot labels can be misleading for model selection, while distribution-aware supervision provides more stable labels and improves learned routing behavior. Our results suggest that reliable LLM routing should move beyond single-response observations and be grounded in query-level model capability distributions.

2606.06923 2026-06-08 cs.AI cs.SE 新提交

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

知识驱动工具使用工作流中AI代理的声明式技能

M. Danish Lim, I. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, Laura Wynter

发表机构 * School of Computing and Information Systems(计算与信息系统学院) Singapore Management University(新加坡管理大学)

AI总结 提出声明式代理(通过自然语言技能文件控制流程)在知识密集型客服工作流中优于命令式状态机和无脚手架基线,但检索质量是主要瓶颈。

详情
AI中文摘要

我们研究了在非结构化知识库上的现实客服工作流中,使用工具的AI代理的编排机制。我们认为声明式代理——即在系统提示中附加自然语言技能文件的AI代理——是一种有效的编排范式。具体地,我们比较了(i) 在推理时读取三个领域特定技能文件并自行决定控制流的DeclarativeAgent,(ii) 基于具有显式阶段的程序化状态机的ImperativeAgent,以及(iii) 基于$\ au$-Knowledge基准代理的无脚手架基线代理。我们的ImperativeAgent受递归语言模型和图编排框架中的外部化控制推理启发。我们将三种代理形式化为分散部分可观察马尔可夫决策过程中的策略类,并分析其信息论和结构特性;然后在五个语言模型和两种检索机制下实证测试预测的差异。结果表明,检索质量是AI代理的主要瓶颈:当证据不完整或偏斜时,所有代理性能大幅下降,技能文件无法恢复损失的性能。然而,在高品质检索下,声明式技能在程序性任务上持续提高准确性并减少编排错误,而命令式状态机的脆弱性并未可靠地提高任务成功或合规性。

英文摘要

We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We argue that declarative agents -- AI agents equipped with natural-language skill files appended to the system prompt -- are an effective orchestration paradigm. Concretely, we compare (i) a DeclarativeAgent that reads three domain-specific skill files at inference time and decides its own control flow, (ii) an ImperativeAgent based on a programmatic state machine with explicit phases, and (iii) an unscaffolded baseline agent modeled after the $τ$-Knowledge benchmark agent. Our ImperativeAgent is motivated by externalised-control inference as in Recursive Language Models and graph-based orchestration frameworks. We formalise the three agents as policy classes within a decentralised partially-observable Markov decision process and analyse their information-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance. Under high-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine's brittleness does not reliably improve task success or compliance.

2606.06918 2026-06-08 cs.CV 新提交

DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

DRIFT: 从鲁棒性差距到AI生成图像检测的不变流形

Abhishek Ameta, Sayan Banerjee, Shreyas Pandith, Harshit, Ankita Chatterjee, Akshay Janardan Bankar, Amit Satish Unde

发表机构 * Samsung Research Institute, Bangalore, India(三星研究所,班加罗尔,印度)

AI总结 提出DRIFT方法,通过冻结视觉基础模型并学习真实图像的结构化不变流形,利用鲁棒和脆弱子空间分解及排序间隔实现AI生成图像检测,在未见生成器和分辨率上表现优异。

详情
Comments
Submitted to ECCV 2026
AI中文摘要

生成图像模型的快速演进挑战了现有的AI生成图像检测器,尤其是在面对未见生成器的开放世界场景中。近期无训练方法通过测量冻结视觉基础模型(VFM)中的鲁棒性差距,利用扰动引起的嵌入漂移检测伪造图像。然而,这些方法依赖于预训练继承的固定不变几何结构,缺乏针对检测任务的原则性适应。我们转而将AI生成图像检测表述为在单类监督下学习真实图像的结构化不变流形。基于冻结的VFM,我们引入轻量级投影头,将表示空间分解为互补的鲁棒子空间和脆弱子空间。鲁棒子空间被显式训练以抑制由物理上合理的成像变换引起的变异,近似真实图像流形的切方向,而脆弱子空间则保持对类似编辑扰动的敏感性。结构化的排序间隔强制实现物理不变性与编辑诱导变异性之间的层次分离,使得检测成为相对于所学流形的间隔违反测试。在推理时,两种变换族下的多尺度逐块漂移产生双通道不变性特征和可解释的定位。大量实验表明,该方法在未见生成器和分辨率上具有强大的开放世界泛化能力,始终优于基于无训练鲁棒性的基线方法,同时提供可解释的不变性违反图。

英文摘要

The rapid evolution of generative image models challenges existing AI-generated image detectors, particularly in open-world settings with unseen generators. Recent training-free approaches measure robustness gaps in frozen vision foundation models (VFMs), detecting fakes via perturbation-induced embedding drift. However, these methods rely on fixed invariance geometry inherited from pretraining and lack principled adaptation to the detection task. We instead formulate AI-generated image detection as learning a structured invariance manifold of real images under one-class supervision. Building upon a frozen VFM, we introduce lightweight projection heads that decompose representation space into complementary robust and fragile subspaces. The robust subspace is explicitly trained to suppress variations induced by physically plausible imaging transformations, approximating tangent directions of a real-image manifold, while the fragile subspace retains sensitivity to edit-like perturbations. A structured ordering margin enforces hierarchical separation between physical invariance and edit-induced variability, enabling detection as a margin-violation test relative to the learned manifold. At inference, multi-scale patch-wise drift under both transformation families yields a dual-channel invariance signature and interpretable localization. Extensive experiments demonstrate strong open-world generalization across unseen generators and resolutions, consistently outperforming training-free robustness-based baselines while providing interpretable invariance-violation maps.

2606.06908 2026-06-08 cs.CV 新提交

polyDAG: Polynomial Acyclicity Constraints for Efficient Continuous Causal Discovery in Visual Semantic Graphs

polyDAG:用于视觉语义图中高效连续因果发现的多项式无环性约束

Wenhao Zhang, Ramin Ramezani, Tao Han, Kai Hwang, Minyi Guo

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of California, Los Angeles(加州大学洛杉矶分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出多项式无环性约束框架polyDAG,用有限多项式迹约束替代矩阵指数约束,实现视觉语义图中高效的连续有向无环图学习,在合成图和CelebA数据集上提升了效率与结构恢复性能。

详情
AI中文摘要

现代图像分析流程通常将图像转换为结构化语义变量,如面部属性、对象概念和场景描述符。学习这些变量之间的有向依赖关系可以生成可解释的视觉语义图,但连续有向无环图学习受到执行无环性成本的限制。我们提出了polyDAG,一个用于视觉语义图中高效连续因果发现的多项式无环性框架。polyDAG用有限多项式迹约束替代矩阵指数无环性约束,并证明了新约束恰好对有向无环图为零。我们进一步推导了一种几何级数实现,避免了显式求和循环,同时保持了相同的无环性条件。在合成Erdos-Renyi图和CelebA面部视觉属性上的实验表明,polyDAG提高了效率和结构恢复能力。在d∈{100,200,500}的修订合成协议上平均,polyDAG将平均结构汉明距离从318.4降低到285.4,并将平均F1分数从0.725提高到0.756。在100个节点时,几何变体运行时间为3.44秒,而指数基线为5.16秒,对应33.4%的加速。代码和数据公开于此https URL。

英文摘要

Modern image-analysis pipelines often convert images into structured semantic variables, such as facial attributes, object concepts, and scene descriptors. Learning directed dependencies among these variables can produce interpretable visual semantic graphs, but continuous directed acyclic graph learning is limited by the cost of enforcing acyclicity. We present polyDAG, a polynomial acyclicity framework for efficient continuous causal discovery in visual semantic graphs. polyDAG replaces the matrix-exponential acyclicity constraint with a finite polynomial trace constraint and proves that the new constraint is zero exactly for acyclic graphs. We further derive a geometric-series implementation that avoids the explicit summation loop while preserving the same acyclicity condition. Experiments on synthetic Erdos-Renyi graphs and CelebA facial visual attributes show that polyDAG improves efficiency and structure recovery. Averaged over the revised synthetic protocol with d in {100, 200, 500}, polyDAG reduces mean structural Hamming distance from 318.4 to 285.4 and improves mean F1 score from 0.725 to 0.756. At 100 nodes, the geometric variant runs in 3.44 seconds compared with 5.16 seconds for the exponential baseline, corresponding to a 33.4 percent speedup. Code and data are publicly available at https://github.com/wenhaoz-fengcai/polyDAG.

2606.06906 2026-06-08 cs.CL cs.AI 新提交

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

EASE-TTT: 面向长上下文问答的基于证据对齐的选择性测试时训练

Xiaopeng Yuan, Zebin Wang, Suwen Wang, Zongxin Yang, Haohan Wang, Yushun Dong

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学) Brion, ASML US LP Florida State University(佛罗里达州立大学)

AI总结 提出EASE-TTT框架,通过将检索到的证据块转化为软注意力监督目标,指导查询侧参数适应,从而在保留完整上下文的情况下提升小模型的长上下文问答性能。

详情
Comments
13 pages, 4 figures, 3 tables
AI中文摘要

长上下文问答(QA)对于较小的语言模型来说仍然具有挑战性,即使输入中已经存在包含答案的证据。现有的上下文内检索方法定位并暴露候选证据块给问题,但它们止步于输入级证据暴露,而不是调整控制模型如何在整个上下文位置上分配注意力的查询侧注意力参数。相比之下,轻量级的测试时适应方法,如仅查询的测试时训练(qTTT),由于它们通用的跨度级自监督目标无法识别哪些上下文位置支持当前答案,因此未能解决证据定位问题。在本文中,我们提出了基于证据对齐的选择性测试时训练(EASE-TTT),这是一个上下文内检索增强的测试时训练框架,它将选定的证据块转换为对其标记位置的软注意力监督目标。EASE-TTT不是用检索到的块替换完整上下文,而是使用生成的注意力目标来指导查询侧适应,适应后的模型从原始完整上下文中生成最终答案。在六个LongBench QA任务和三个小型仅解码器语言模型上的实验表明,EASE-TTT在全上下文推理、仅检索基线和qTTT中实现了最强的宏平均性能,支持了长上下文QA中基于证据对齐的测试时适应。

英文摘要

Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.

2606.06903 2026-06-08 cs.CV cs.AI 新提交

Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

超越骨架:使用Same2X训练策略直接从驱动视频学习动画

Yuan Zeng, Yujia Shi, Yuhao Yang, Dongxia Liu, Zongqing Lu, Wenming Yang, Qingmin Liao

发表机构 * Tsinghua University(清华大学) Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出DirectAnimator框架,通过驱动线索三元组和Same2X训练策略,绕过姿态提取直接从原始视频学习动画,实现鲁棒且高质量的人体图像动画生成。

详情
Comments
Accepted to ICLR 2026
AI中文摘要

人体图像动画旨在根据从驱动视频中提取的姿态信息,从静态参考图像生成视频。现有方法通常依赖姿态估计器提取中间表示,但在遮挡或复杂姿态下这些信号容易出错。基于这些观察,我们提出了DirectAnimator,一个绕过姿态提取并直接从原始驱动视频学习的框架。我们引入了一个由姿态、面部和位置线索组成的驱动线索三元组,以语义丰富且稳定的形式捕捉运动、表情和对齐,并通过CueFusion DiT块融合它们,以实现去噪过程中的可靠控制。为了使学习在驱动和参考身份不同时依然可靠,我们设计了Same2X训练策略,将跨身份特征与从相同身份数据中学到的特征对齐,从而正则化优化并加速收敛。大量实验表明,DirectAnimator在保持身份的同时达到了最先进的视觉质量,对遮挡和复杂关节运动具有鲁棒性,并且计算资源更少。我们的项目页面位于此https URL。

英文摘要

Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.

2606.06902 2026-06-08 cs.LG 新提交

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

TALAN:面向大型语言模型目标后训练的任务对齐潜在自适应网络

Chengkai Zhang, Ziteng Liu, Junpu Wang, Zeyi Tao, Yang Wang, Sagar Chordia, Qin Huang

发表机构 * Meta AI

AI总结 提出TALAN,一种序列条件潜在旁路,插入Transformer残差流并与低秩适配器协同训练,在STEM/代码基准上平均提升LoRA 1.41点、DoRA 1.85点,仅增加<1%可训练参数和1.01-1.02倍推理开销。

详情
AI中文摘要

目标后训练旨在提升推理、数学和代码能力而不损害原有优势。低秩适配器高效但任务全局;激活干预输入感知但通常需要独立的探针、向量或推理时引导。我们提出TALAN(任务对齐潜在自适应网络),一种序列条件潜在旁路,插入Transformer的残差流中,并在一个SFT循环中与低秩适配器协同训练。TALAN将活动序列压缩为潜在记忆,将其重新混合为令牌级扰动,并通过受控残差更新写回。它沿六个轴配置:插入位置、记忆大小、混合器、写回规则、可训练范围和梯度尺度。在四个Qwen3系列骨干和四个STEM/代码基准上,TALAN改进了匹配的LoRA和DoRA基线。使用LoRA,它实现了+1.41点的跨模型平均增益,在所有四个骨干上为正,在所有16个模型-基准单元上非负。使用DoRA,它实现了+1.85点的平均增益,在所有骨干上为正,在16个单元中的13个上为正。配对种子检查支持正平均效应但显示非平凡方差,因此我们将其视为敏感性检查。成本很小:相对于骨干的可训练参数<1%,推理开销为匹配LoRA的1.01-1.02倍。在Llama-3.2-1B上的迁移探针在LoRA和rsLoRA下,跨七个配对种子也呈正效应,支持超越Qwen的迁移。内部状态分析表明TALAN是一种小的互补激活干预。匹配的适配器更新比TALAN扰动大80-1700倍,但它们的余弦接近零;逐层测量显示这种小的正交扰动通过深度传播和放大。TALAN为在标准适配器后训练中研究可引导的激活级自适应提供了一个实用平台。

英文摘要

Targeted post-training aims to improve reasoning, math, and code without degrading strengths. Low-rank adapters are efficient but task-global; activation interventions are input-aware but often require separate probes, vectors, or inference-time steering. We introduce TALAN (Task-Aligned Latent Adaptation Networks), a sequence-conditioned latent side path inserted into a transformer's residual stream and co-trained with a low-rank adapter in one SFT loop. TALAN compresses the active sequence into latent memory, remixes it into token-level perturbations, and writes them back through a controlled residual update. It is configured along six axes: insertion location, memory size, mixer, writeback rule, trainability scope, and gradient scale. Across four Qwen3-family backbones and four STEM/code benchmarks, TALAN improves matched LoRA and DoRA baselines. With LoRA, it yields a +1.41 point cross-model mean gain, positive on all four backbones and non-negative on all 16 model-benchmark cells. With DoRA, it yields a +1.85 point mean gain, positive on all backbones and on 13 of 16 cells. Paired seed checks support positive average effects but show nontrivial variance, so we treat them as sensitivity checks. Cost is small: <1% trainable parameters relative to the backbone and 1.01-1.02x inference overhead versus matched LoRA. A Llama-3.2-1B transfer probe is also positive under LoRA and rsLoRA across seven paired seeds, supporting a transfer beyond Qwen. Internal-state analyses suggest TALAN is a small complementary activation intervention. The matched adapter update is 80-1,700x larger than the TALAN perturbation, yet their directions have near-zero cosine; per-layer measurements show this small orthogonal perturbation propagates and amplifies through depth. TALAN offers a practical platform for studying steerable activation-level adaptation within standard adapter-based post-training.

2606.06901 2026-06-08 cs.CV 新提交

LUCID: Learning Unified Control for Image Deflaring and Exposure Mastery in Nighttime Photography

LUCID:夜间摄影中图像去眩光与曝光控制的统一学习

Tingyu Yang, Yuan Cheng, Xiaoyun Yuan

发表机构 * MoE Key Lab of Artificial Intelligence(人工智能混合专家实验室) AI Institute(人工智能研究所) School of Computer Science(计算机科学学院) School of Biomedical Engineering(生物医学工程学院) School of Artificial Intelligence(人工智能学院)

AI总结 提出LUCID统一框架,通过眩光解缠模块和扩散驱动模块联合处理夜间图像中的眩光和噪声,并引入四模式训练实现可控恢复,支持HDR重建,性能优于现有方法。

详情
Comments
Accepted by SIGGRAPH 2026
AI中文摘要

摄影是用光绘画的艺术,但夜间场景受到相互竞争的退化影响:强烈的眩光掩盖了场景结构,而光子受限区域则陷入噪声。传统方法孤立地处理这些因素,忽略了这些退化本质上是纠缠的。为弥补这一差距,我们引入了LUCID,一个统一框架,将夜间恢复重新定义为连续且可控的过程,而非固定的校正。我们将夜间恢复分解为两个协作组件:一个眩光解缠模块,用于揭开光学伪影的“幕布”,提供可靠的结构指导;以及一个扩散驱动模块,利用生成先验重建干净且曝光良好的图像。关键的是,LUCID通过一种新颖的四模式训练策略引入了显式的可控性,使用户能够通过无分类器引导(CFG)引导恢复过程,并允许对光源及其相关的眩光和鬼影伪影进行选择性控制,同时通过连续曝光控制支持高动态范围(HDR)重建。大量实验表明,LUCID在多种真实夜间场景中始终优于最先进的方法。

英文摘要

Photography is the art of painting with light, yet nighttime scenes are shaped by competing degradations: intense flares obscure scene structure, while photon-limited regions collapse into noise. Conventional approaches address these factors in isolation, overlooking the fact that these degradations are fundamentally entangled. To bridge this gap, we introduce LUCID, a unified framework that reframes nighttime restoration as a continuous and controllable process rather than a fixed correction. We decompose nighttime restoration into two cooperative components: a flare disentanglement module that lifts the 'curtain' of optical artifacts to provide reliable structural guidance, and a diffusion-driven module that leverages generative priors to reconstruct clean and well-exposed imagery. Crucially, LUCID introduces explicit controllability through a novel four-mode training strategy, enabling users to steer the restoration process via classifier-free guidance (CFG) and allowing selective control over light sources and their associated flare and ghosting artifacts, while also supporting high dynamic range (HDR) reconstruction through continuous exposure control. Extensive experiments demonstrate that LUCID consistently outperforms state-of-the-art methods across diverse real-world nighttime scenarios.

2606.06899 2026-06-08 cs.CV cs.LG 新提交

Lighting-Aware Representation Learning under Controllable Lighting Variation

可控光照变化下的光照感知表示学习

Lizhen Zhu, Charantej Reddy Pochimireddy, James Z Wang, Brad Wyble

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出光照感知表示学习框架,将光照变化作为显式训练信号,通过辅助目标捕获光照依赖变化,在分类和检测任务上优于标准对比学习基线。

详情
AI中文摘要

光照变化仍然是视觉表示学习的主要挑战,因为它们会在环境内部和之间引起显著的外观变化。虽然现有方法通常通过数据增强来鼓励模型对光照变化具有不变性,但这些策略在学习过程中并未显式建模光照信息。受人类视觉理论的启发,我们提出了一种光照感知表示学习框架,该框架将光照变化作为显式训练信号而非需要抑制的干扰因素。我们的方法通过引入一个辅助目标来扩展对比学习,该目标捕获渲染场景中光照依赖的变化,使模型能够联合学习保持语义一致性的表示,同时保持对光照依赖的视觉结构的敏感性。我们在ImageNet、ExDark和PASCAL VOC基准测试上评估了所提模型的图像分类和物体检测任务。结果表明,所提出的光照感知训练在保持相同架构和训练预算的情况下,始终优于标准对比学习基线。此外,我们的方法在监督学习框架和涉及更简单光照变化的设置中表现出有前景的性能,表明其具有超越复杂光照场景的广泛适用性。这些结果显示了它在复杂视觉环境以及更常规的图像处理任务中增强模型鲁棒性和适应性的潜力。

英文摘要

Variations in illumination remain a major challenge for visual representation learning, as they induce substantial appearance changes both across and within environments. While existing approaches typically address this issue through data augmentations that encourage models to become invariant to lighting changes, such strategies do not explicitly model lighting information during learning. Inspired by theories of human vision, we propose a lighting-aware representation learning framework that incorporates illumination variation as an explicit training signal rather than a nuisance factor to be suppressed. Our method extends contrastive learning by introducing an auxiliary objective that captures illumination-dependent variation in rendered scenes, enabling the model to jointly learn representations that preserve semantic consistency while remaining sensitive to lighting-dependent visual structure. We evaluate the proposed model on image classification and object detection tasks across the ImageNet, ExDark, and PASCAL VOC benchmarks. Results demonstrate that the proposed lighting-aware training consistently improves downstream performance over standard contrastive learning baselines, while maintaining the same architecture and training budget. Furthermore, our approach shows promising performance in supervised learning frameworks and under settings involving simpler lighting variation, suggesting broad applicability beyond complex illumination scenarios. These results indicate its potential to enhance model robustness and adaptability in complex visual environments as well as in more conventional image processing tasks.

2606.06893 2026-06-08 cs.AI 新提交

Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

工作流到技能:通过路由-工作流-语义-附件分解创建技能

Yuyang Zhang, Xinyuan Han, Xudong Jiang, Run Wang

发表机构 * Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University(航天信息安全部门与可信计算重点实验室,教育部,网络安全科学与工程学院,武汉大学) Nanchang University(南昌大学)

AI总结 提出RWSA中间表示和W2S框架,从异构交互证据中自动构建技能,通过分解工作流结构、执行语义和运行时附件,提升行为重放一致性10.5%。

详情
Comments
10 pages, 2 figures
AI中文摘要

大型语言模型代理越来越依赖技能来编码程序性知识,但高质量技能的手工编写成本高昂。本文研究从异构交互证据(包括演示、代理轨迹、工具痕迹和执行日志)自动构建技能。我们认为,从痕迹到技能的构建并非简单的摘要任务,因为痕迹是碎片化、冗余的,并且可能遗漏罕见但安全关键的行为。为此,我们引入RWSA,一种面向工作流的中间表示,将技能分解为工作流结构、执行语义和运行时附件,捕获任务分解、控制流、验证、安全、回滚和状态管理。基于RWSA,我们提出W2S框架,该框架分割痕迹、诱导局部技能草稿、对齐共享结构、协调分支、压缩冗余,同时保留证据和置信度注释。在70个技能上的实验表明,W2S相比基于摘要和提示的基线,行为重放一致性提高了10.5%,凸显了将痕迹视为可执行运行时规范而非可压缩文本的必要性。

英文摘要

Large language model agents increasingly rely on Skills to encode procedural knowledge, yet high-quality Skills remain costly to hand-write. This paper studies automatic Skill construction from heterogeneous interaction evidence, including demonstrations, agent trajectories, tool traces, and execution logs. We argue that trace-to-skill construction is not simple summarization tasks, because traces are fragmented, redundant, and may miss rare but safety-critical behaviors. To address this, we introduce RWSA, a workflow-oriented intermediate representation that decomposes Skills into Workflow structure, execution Semantics, and runtime Attachments, capturing task decomposition, control flow, verification, safety, rollback, and state management. Building on RWSA, we propose W2S, a framework that segments traces, induces local Skill drafts, aligns shared structures, reconciles branches, and compresses redundancy while preserving evidence and confidence annotations. Experiments on 70 Skills show that W2S improves behavioral replay consistency by 10.5% over summarization- and prompting-based baselines, highlighting the need to treat traces as executable runtime specifications rather than compressible text.

2606.06892 2026-06-08 cs.LG 新提交

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

GRASP:面向可扩展预训练数据归因的几何感知残差对齐

Yue Min, Ruining Chen, Yujun Li

发表机构 * Wizard Quant University of Science and Technology of China(中国科学技术大学)

AI总结 提出GRASP方法,通过二次几何惩罚建模子集交互,结合低维特征草图与有限置信度选择协议,实现可扩展的预训练数据归因,显著提升反事实子集保真度并降低计算成本。

详情
AI中文摘要

可扩展的数据归因方法通常为单个训练样本分配孤立的效用分数。这种普遍的加性假设从根本上无法捕捉关键的子集动态,包括数据冗余和互补覆盖。在这项工作中,我们将归因重新定义为子集级别的反事实效用预测,并引入GRASP,一种交互感知的替代方法。基于理论平滑度下界,GRASP通过二次几何惩罚显式建模子集交互。为了实现预训练规模的效率而不依赖隐藏的oracle调优,我们将低维特征草图与严格有限下置信度选择协议相结合。广泛的子集重训练评估表明,GRASP显著优于现有的可扩展基线。它将反事实子集保真度的任务级秩相关性提高了一倍以上,同时将前期工件构建成本降低了近一个数量级。下游诊断进一步表明,这种评分机制可迁移到语言模型策展和跨领域视觉选择,为优化大规模预训练语料库奠定了坚实基础。

英文摘要

Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and complementary coverage. In this work, we reframe attribution as subset-level counterfactual utility prediction and introduce GRASP, an interaction-aware surrogate. Grounded in a theoretical smoothness lower bound, GRASP explicitly models subset interactions through a quadratic geometric penalty. To achieve pretraining-scale efficiency without relying on hidden oracle tuning, we couple low-dimensional feature sketches with a strictly finite lower-confidence bound selection protocol. Extensive subset-retraining evaluations demonstrate that GRASP decisively outperforms existing scalable baselines. It more than doubles the task-level rank correlation for counterfactual subset fidelity while reducing upfront artifact construction costs by nearly an order of magnitude. Downstream diagnostics further show that this scoring mechanism transfers to language model curation and cross-domain vision selection, establishing a robust foundation for optimizing massive pretraining corpora.

2606.06891 2026-06-08 cs.CV 新提交

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Stream3D-VLM:基于增量几何先验的在线3D空间理解

Hanxun Yu, Xuan Qu, Lei Ke, Boqiang Zhang, Yuxin Wang, Jianke Zhu, Dong Yu

发表机构 * Zhejiang University(浙江大学) Tencent Hunyuan(腾讯文汇) HKUST(香港科技大学) Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 提出在线3D视觉语言模型Stream3D-VLM,通过自回归流控制、轻量视觉-空间特征融合模块和几何自适应体素压缩,实现从流式视频中实时理解3D空间,并构建超百万在线3D问答数据集,在多项任务上超越现有模型。

详情
Comments
Project Page: https://stream3d-vlm.github.io/
AI中文摘要

尽管3D场景理解取得了进展,但现有的3D大型多模态模型在离线设置下运行,需要完整的场景观测或预定义的视频片段。在本文中,我们提出了一种在线3D视觉语言模型,能够从流式视频中实现实时空间理解。我们的方法基于LLM的下一个词预测目标,采用自回归流控制建模来学习何时响应,并使用轻量级的视觉-空间特征融合(VSFI)模块,将时间对齐的几何先验增量注入视觉流。为了减轻长上下文解码开销,我们提出了一种即插即用的几何自适应体素压缩(GAVC)模块,用于高效的视觉令牌压缩。为了解决流式3D语言数据的稀缺问题,我们进一步开发了一个可扩展的数据生成流程,策划了超过100万个在线时空3D问答对,并建立了一个涵盖29个任务的全面基准。大量实验表明,我们的方法在在线和离线3D空间理解、推理和定位任务上均显著优于专有和开源模型。项目页面见https://这个URL。

英文摘要

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/

2606.06890 2026-06-08 cs.CV cs.LG 新提交

Diagnosing Visual Ignorance in Vision-Language Models

诊断视觉语言模型中的视觉忽视

Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang

发表机构 * Peking University(北京大学)

AI总结 研究视觉语言模型依赖语言先验的内部机制,通过层替换和探针分析揭示多阶段瓶颈,并引入渐进视觉退化指标发现基准测试可能奖励视觉忽视。

详情
AI中文摘要

视觉语言模型(VLM)经常依赖语言先验,产生自信但缺乏视觉证据支持的答案。虽然这种行为被广泛观察到,但其内部机制及对基准评估的影响仍未被充分理解。在这项工作中,我们从机制和行为两个角度研究语言先验依赖。在内部,我们将反事实层替换与有监督的逐层MLP探针相结合,以追踪真实视觉语义和语言先验语义如何在语言解码器中竞争。我们的分析揭示了一个多阶段瓶颈:中间层通常无法有效检索视觉信息,而后续层可能进一步抑制存活的视觉信号,偏向文本空间偏差。在外部,我们引入了一种基于多步高斯模糊的渐进视觉退化度量,用于识别那些即使视觉内容被逐渐破坏,答案仍保持不变的实例。在十二个视觉问答基准和三个代表性VLM上,我们发现相当一部分示例在严重或完全视觉混淆下仍可回答,表明当前基准可能无意中奖励视觉忽视。这些发现表明,语言先验依赖是一种系统性的路由故障,影响模型内部和基准有效性。最后,我们概述了未来的关键研究方向,强调需要设计基于结构隔离或反事实数据的训练分布和评估协议,以强制执行真正的跨模态基础。

英文摘要

Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

2606.06887 2026-06-08 cs.CV 新提交

ARAPDiffusion: ARAP Regularization for Diffusion-Based Deformable Shape Space Learning

ARAPDiffusion: 基于ARAP正则化的扩散变形形状空间学习

Haibo Liu, Jinghan Ke, Haitao Yang, Xiangru Huang, Georgios Pavlakos, Qixing Huang

发表机构 * University of Texas at Austin(德克萨斯大学) Westlake University(西拉丘学院)

AI总结 提出ARAPDiffusion,一种潜在扩散模型,通过注入ARAP变形模型作为正则化损失,学习变形形状集合的连续形状空间,减少对大量3D训练数据的依赖。

详情
AI中文摘要

本文介绍了ARAPDiffusion,一种潜在扩散模型,用于学习变形形状集合的潜在连续形状空间。关键创新在于将尽可能刚性(ARAP)变形模型作为正则化损失注入潜在扩散(LD),从而减少学习生成模型所需的大量3D训练数据。与标准LD相比,我们展示了如何利用ARAP模型同时改进编码器/解码器和LD模型。训练过程交替使用LD模型定义的合成分布来开发增强形状编码器/解码器的正则化损失,以及使用形状解码器来开发改进LD模型的正则化损失。我们还展示了LD范式在结合无表示LD模型和适用于无序点云的隐式形状解码器方面的优势。无条件和条件形状生成的实验结果证明了ARAPDiffusion相对于基线方法的优势。

英文摘要

This paper introduces ARAPDiffusion, a latent diffusion model to learn the underlying continuous shape space of a deformation shape collection. The key innovation is in injecting the as-rigid-as-possible (ARAP) deformation model as regularization losses into latent diffusion (LD), releasing the requirement of having abundant 3D training data for learning generative models. In contrast to the standard LD, we show how the ARAP model can be used to improve both the encoder/decoder and the LD model. The training procedure alternates between using the synthetic distribution defined by the LD model to develop a regularization loss that enhances the shape encoder/decoder and using the shape decoder to develop a regularization loss to improve the LD model. We also show the benefit of the LD paradigm in combining a representation-free LD process and an implicit shape decoder that is applicable to unorganized point clouds. The experimental results of unconditional and conditional shape generation demonstrate the advantages of ARAPDiffusion over baseline approaches.

2606.06878 2026-06-08 cs.RO cs.CV 新提交

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

一种用于鲁棒6-DoF抓取姿态估计的跨视图融合框架

Kangjian Zhu, Haobo Jiang, Jianjun Qian, Jin Xie

发表机构 * Nanjing University of Science and Technology(南京理工大学) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出跨视图融合框架,通过辅助视图缓解遮挡,利用自监督对比学习增强点云特征的空间一致性和方向区分性,并设计跨视图对齐圆柱体集成模块融合抓取相关几何,提升角落视图下的6-DoF抓取姿态估计鲁棒性。

详情
Comments
Corresponding author: Jin Xie
AI中文摘要

本文提出一种跨视图融合框架,增强了角落视图中6-DoF抓取姿态估计的鲁棒性。我们的框架通过引入辅助视图缓解遮挡,并通过后融合策略避免了耗时的、任务无关的多视图重建。为了增强跨视图融合,我们提出一种自监督对比学习策略,利用跨视图关联来正则化点云特征。简而言之,如果两个点对应相同的3D位置,则跨视图点对被视作匹配;如果它们代表不同的抓取方向,则视为不匹配。该学习策略显著增强了点特征的空间一致性和方向区分性,从而促进了跨视图融合并提高了估计鲁棒性。此外,我们提出一种跨视图对齐圆柱体集成模块,将抓取相关几何融合为综合表示。具体地,该模块首先根据相似性对齐跨视图点和特征,以增强对噪声的鲁棒性。随后,将这些点注册到圆柱坐标系中,强调对抓取重要的旋转对称几何。最后,交替使用局部自注意力和种子交叉注意力层,分别实现单视图内和跨视图间的交互,支持抓取相关几何的细粒度表示。我们的框架在GraspNet-1Billion基准测试和实际应用中均取得了强劲性能。代码可在以下网址获取:此https URL。

英文摘要

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.