arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.00906 2026-06-02 cs.CV

hZACH-ViT: Curved Latent Geometry for Compact Vision Transformers in Low-Data Medical Imaging

hZACH-ViT:用于低数据医学成像中紧凑视觉Transformer的曲率潜在几何

Athanasios Angelakis

发表机构 * BioML Lab, Research Institute CODE, UniBw, Munich, Germany(BioML实验室,CODE研究机构,UniBw,慕尼黑,德国) Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, Netherlands(流行病学与数据科学系,阿姆斯特丹大学医学中心,阿姆斯特丹,荷兰)

AI总结 提出hZACH-ViT,通过扩展ZACH-ViT的潜在空间为双曲或球形几何,在低数据医学成像中提升紧凑视觉Transformer的性能,并在MedMNIST数据集上平均提升+0.021。

Comments 17 pages, 2 figures, 4 tables. Code, execution notebooks, and aggregated result summaries will be released at https://github.com/Bluesman79/hZACH-ViT upon publication

详情
AI中文摘要

紧凑视觉Transformer在低数据和资源受限的医学成像场景中具有吸引力,但大多数现有变体假设欧几里得潜在几何足以组织图像表示。我们引入了hZACH-ViT,这是ZACH-ViT的曲率几何扩展家族,ZACH-ViT是一种紧凑的零令牌视觉Transformer,它去除了位置嵌入和类别令牌,并依赖于补丁表示的全局平均池化。为了隔离几何的作用,我们保留了经过验证的ZACH-ViT骨干网络,仅修改了最终表示空间和基于原型的分类器头部,从而实现了欧几里得、双曲和球形潜在几何之间的受控比较。我们在七个MedMNIST数据集上评估了庞加莱、克莱因和球形hZACH-ViT头部,采用相同的少样本协议,每个类别50个样本和五个随机种子。完整的基准测试包含770次训练运行,涵盖七个数据集、三种非欧几里得几何、七个曲率幅度以及一个欧几里得基线。在所有七个数据集中,最佳非欧几里得hZACH-ViT配置优于欧几里得ZACH-ViT,在数据集特定的主要指标上平均提升+0.021,在OCTMNIST上提升最大(+0.055 MacroF1)。固定的低曲率配置在大多数数据集上保持正向增益,低曲率值(c = 0.1或0.2)占据了七个数据集级别优胜者中的六个。我们的结果并未确定一个普遍最优的流形,而是将几何和曲率确立为数据集依赖的模型选择变量,固定的低曲率分析证实了增益在详尽的逐数据集调优之外仍然存在。

英文摘要

Compact Vision Transformers are attractive for medical imaging in low-data and resource-constrained settings, but most existing variants assume that Euclidean latent geometry is sufficient for organizing image representations. We introduce hZACH-ViT, a family of curved-geometry extensions of ZACH-ViT, a compact zero-token Vision Transformer that removes positional embeddings and the class token and relies on global average pooling over patch representations. To isolate the role of geometry, we preserve the verified ZACH-ViT backbone and modify only the final representation space and prototype-based classifier head, enabling a controlled comparison between Euclidean, hyperbolic, and spherical latent geometries. We evaluate Poincaré, Klein, and spherical hZACH-ViT heads on seven MedMNIST datasets under an identical few-shot protocol with 50 samples per class and five random seeds. The completed benchmark contains 770 training runs spanning seven datasets, three non-Euclidean geometries, seven curvature magnitudes, and a Euclidean baseline. Across all seven datasets, the best non-Euclidean hZACH-ViT configuration improves over Euclidean ZACH-ViT, with an average gain of +0.021 in the dataset-specific primary metric and the largest improvement on OCTMNIST (+0.055 MacroF1). Fixed low-curvature configurations retain positive gains on the majority of datasets, and low curvature values (c = 0.1 or 0.2) account for six of the seven dataset-level winners. Rather than identifying a universally optimal manifold, our results establish geometry and curvature as dataset-dependent model-selection variables, with fixed low-curvature analyses confirming that gains persist beyond exhaustive per-dataset tuning.

2606.00902 2026-06-02 cs.AI

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Ryze:从生物医学论文中合成富含证据的数据

Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su, Luo Mai

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 提出 Ryze 系统,自动从生物医学论文中生成包含完整证据结构的训练数据,并训练出领域专用 VLM BioVLM-8B,在 LAB-Bench 上以低于 200 美元成本达到 48.0% 加权准确率。

Comments Accepted at ACL 2026 System Demonstrations Track. 8 pages, 6 figures

详情
AI中文摘要

通用视觉语言模型在生物医学研究中仍然不可靠,因为科学论文中的有效答案依赖于分散在图、表、图表、标题和引用文本中的证据。现有的后训练流程受到昂贵的专家标注和丢弃证据结构的合成数据的瓶颈。我们提出了 Ryze,一个全自动系统,将原始生物医学论文转换为富含证据的训练集和领域专用的视觉语言模型。Ryze 合成带有完整支持证据(视觉元素、标题、提取的结构和引用段落)的问答对,通过图表/表格感知提取和基于大语言模型的清洗减少布局和 OCR 错误,并应用结合监督微调和强化学习的进度门控后训练策略。从 Qwen3-VL-8B 开始,Ryze 以不到 200 美元的成本生产出 BioVLM-8B,在 LAB-Bench 上达到 48.0% 的加权准确率,比基础模型高出 12.6 个百分点,并超过 GPT-5.2 3.8 个百分点。我们将 Ryze 与训练好的 BioVLM-8B 模型一起开源发布。

英文摘要

General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text. Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure. We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM. Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring paragraphs), reduces layout and OCR errors via chart/table-aware extraction and LLM-based cleansing, and applies a progress-gated post-training strategy combining supervised fine-tuning with reinforcement learning. Starting from Qwen3-VL-8B, Ryze produces BioVLM-8B at under USD 200, achieving 48.0% weighted accuracy on LAB-Bench, outperforming the base model by +12.6 percentage points (pp) and surpassing GPT-5.2 by +3.8 pp. We release Ryze as open source together with the trained BioVLM-8B model.

2606.00898 2026-06-02 cs.CL cs.DL

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

引用溯源:通过法律引用图检测和减少LLM引用幻觉

Volodymyr Ovcharov

发表机构 * LEX AI LLC

AI总结 提出引用溯源(CG)指标,利用乌克兰法院判决的引用图(1.008亿判决,5.02亿边)检测LLM法律引用幻觉,并通过CG-DPO方法(基于真实判决构建偏好对)减少幻觉,在100个法律查询上CG为0.791-0.873,幻觉率13-21%。

Comments 14 pages, 3 figures, 3 tables. Code and data: https://huggingface.co/datasets/overthelex/citation-grounding-eval

详情
AI中文摘要

大型语言模型系统性地产生法律引用幻觉——编造法规引用、引用已废除条款、混淆司法管辖区——但目前尚无自动化方法可大规模测量或减少此行为。我们提出引用溯源(CG),该指标通过从1.008亿乌克兰法院判决(5.02亿条边,21,736个唯一法规节点)中提取的真实引用图来验证LLM生成的法律引用。CG分解为三个组成部分——引用精确性(引用的条款是否存在?)、引用相关性(是否上下文相关?)和引用时效性(在相关日期是否有效?)——从而实现对幻觉类型的差异化诊断。对100个乌克兰法律查询的实证评估(涉及五个系统:通过AWS Bedrock的四个商业LLM——Claude Haiku 4.5、Mistral Pixtral Large、Amazon Nova Pro/Lite——以及一个RAG增强的生产系统)显示CG范围为0.791至0.873,其中13-21%的引用是幻觉。为了在没有人工标注的情况下减少幻觉,我们引入了引用溯源DPO(CG-DPO):一种通过四种针对性策略从真实法院判决中破坏已验证引用来自动构建偏好对的方法。在包含2,244个法院判决的数据集上,使用LoRA微调的Qwen2.5-7B-Instruct模型在区分正确和错误引用方面达到了98.5%的平均验证准确率(奖励边际+14.9,3个种子的标准差<0.3个百分点)。引用图、评估框架和CG-DPO数据集作为开放资源发布。

英文摘要

Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components -- citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) -- enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems -- four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system -- reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.

2606.00892 2026-06-02 cs.LG cs.CE physics.comp-ph

An Exploratory Study into using Machine-Learning for Fast Step-by-step Emulation of Numerical Mechanical Thrombectomy Simulations for Ischemic Stroke

使用机器学习快速逐步模拟缺血性卒中机械取栓数值仿真的探索性研究

Thijs Stessen

发表机构 * MSc Artificial Intelligence Master Thesis(人工智能硕士论文) Thijs Stessen MSc. Thijs Kuipers(Thijs Kuipers) Dr. Simone Saitta(Simone Saitta)

AI总结 本研究探索使用机器学习替代模型逐步加速机械取栓数值仿真,在简化抽吸过程中实现显著加速,但复杂几何下的长期稳定性不足。

Comments 40 pages, 16 figures, master thesis artificial intelligence

详情
AI中文摘要

使用机械取栓治疗缺血性卒中涉及在时间紧迫下做出困难决策。数值物理仿真理论上可以为操作者提供关于治疗方法和设备选择的更好决策信息,但在实践中速度太慢。在本论文中,我们研究当前基于机器学习的替代模型能否在显著加速的同时,以逐步方式准确模拟这些仿真。为此,我们在两个涉及简化抽吸过程的仿真上训练了三个替代模型,几何复杂度不同。结果表明,其中两个模型能准确预测单个仿真步骤并提供显著加速,尤其是结合特定数据增强时。然而,这些模型在长时间模拟复杂几何时表现出缺乏稳定性。总体而言,这项工作为未来研究开发稳定方法并扩展到机械取栓的现实数值物理仿真奠定了基础。

英文摘要

The treatment of ischemic stroke using mechanical thrombectomy involves difficult decisions under intense time constraints. Numerical physics simulations can in theory inform operators to make better decisions regarding treatment approaches and device selection, but are too slow to do so in practice. In this thesis, we investigate if current machine learning based surrogates can accurately emulate these simulations in a step-by-step manner while making them significantly faster. To do this we train three surrogate models on two simulations that involve a simplified aspiration procedure, with varying levels of geometric complexity. Our results show that two of our models accurately predict singular simulation steps and provide substantial speedups, especially when combined with specific data augmentations. However, the models showed a lack of stability when emulating simulations with complex geometries over longer time periods. Overall, this work provides a foundation for future studies to develop stable methods that scale to realistic numerical physics simulations of mechanical thrombectomy.

2606.00891 2026-06-02 cs.CV

MMDG-Bench: A Benchmark for Multimodal Domain Generalization

MMDG-Bench:多模态领域泛化基准

Qianshan Zhan, Qian Wang, Da Li, Xiao-Jun Zeng, Xiatian Zhu

发表机构 * University of Manchester(曼彻斯特大学) Jiyue AI(极越AI) Samsung AI Centre Cambridge(三星AI中心剑桥) University of Surrey(萨里大学)

AI总结 提出MMDG-Bench基准,通过D2M和M2D两种框架统一多模态学习与领域泛化,在动作识别和活体检测等任务上验证了结构化组合优于现有方法,并给出关键设计指南。

详情
AI中文摘要

多模态领域泛化(MMDG)旨在利用互补模态增强模型在未见领域上的鲁棒性。尽管多模态学习(MML)和领域泛化(DG)作为独立领域取得了广泛进展,但它们的系统集成仍未被充分探索。当前的MMDG研究主要局限于动作识别,且缺乏标准化的评估协议。为此,我们引入了MMDG-Bench,一个全面的基准,包含两个基础框架:先DG后MML(D2M)和先MML后DG(M2D)。我们在多种任务上提供了统一的实验协议,包括视频-音频-光流动作识别和RGB-深度-红外人脸活体检测。通过将统一的MML配置与五种DG技术配对,在D2M和M2D两种顺序下实例化十个MMDG基线,我们证明这些结构化组合通常优于现有最先进方法,强调了统一基准工作的必要性。我们的分析得出三个关键见解:(1)集成DG技术在各种骨干网络上提供一致的泛化增益,而非DG方法对骨干网络变化高度敏感;(2)最优框架选择取决于模态间稳定性:当模态关系在领域间稳定时D2M表现更好,而M2D对跨领域关系变化更鲁棒;(3)更强的骨干网络在集成到我们的结构化框架中时会产生放大的性能收益。MMDG-Bench为未来多模态鲁棒性研究提供了原则性基础和可操作的设计指南。代码已发布在 https://github.com/qszhan/MMDG-Bench。

英文摘要

Multi-modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi-modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under-explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG-Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG-Bench provides a principled foundation and actionable design guidelines for future research in multi-modal robustness. Code is released at https://github.com/qszhan/MMDG-Bench.

2606.00890 2026-06-02 cs.CV

Cohort-Scale Neural Atlases of Ultrasound Video

超声视频的队列级神经图谱

Zhuorui Zhang, Roger Pallarès-López, Xuan Wu, Praneeth Namburi, Brian W. Anthony

发表机构 * Department of Mechanical Engineering, MIT(麻省理工学院机械工程系) Institute for Medical Engineering and Science, MIT(麻省理工学院医学工程与科学研究所) MIT.nano Immersion Lab, MIT(麻省理工学院MIT.nano沉浸实验室)

AI总结 提出一种基于DINOv3特征空间、联合训练数千帧的队列级神经图谱方法,通过每视频生成潜在优化嵌入实现准确注释迁移,在五个心脏和肌肉骨骼数据集上达到与强基线相当的精度。

详情
AI中文摘要

超声是临床实践中应用最广泛的实时成像模态,然而每帧视频注释仍然是一个主要瓶颈:专家标签稀缺且昂贵,图像外观随散斑、阴影、衰减和操作者依赖的探头姿态而变化。这尤其具有局限性,因为临床相关信息通常是动态的,从超声心动图中的左心室运动到肌肉骨骼成像中的肌肉和骨骼运动学。群体图谱可以通过将观测注册到共享的规范坐标系来分摊注释成本,但现有的神经图谱方法主要针对单个视频、小型测试时图像集或物体中心的图像集合。我们引入了一种用于超声视频的队列级神经图谱:一个单一的规范图表,带有每视频生成潜在优化嵌入,在DINOv3特征空间中联合训练数千帧。在五个带有地标点和分割掩膜的心脏和肌肉骨骼数据集上,我们的方法学习了连贯的规范模板,并实现了准确的图谱空间注释迁移。在EchoNet-Dynamic和MSK-Bone上,它支持单次和少样本迁移,其精度与强密集对应基线相当,同时在单个消费级GPU上训练只需几分钟。学习到的嵌入是可解释的:线性投影揭示了结构化的队列变异,图像解码器插值产生解剖学上合理的中间帧,测试时潜在反演通过图谱重建保留帧。这些结果表明,队列级神经图谱为减少超声视频分析中的专家注释负担提供了一种实用、可解释的表示。

英文摘要

Ultrasound is the most widely used real-time imaging modality in clinical practice, yet per-frame video annotation remains a major bottleneck: expert labels are scarce and costly, and image appearance varies with speckle, shadowing, attenuation, and operator-dependent probe pose. This is especially limiting because clinically relevant information is often dynamic, from left-ventricular motion in echocardiography to muscle and bone kinematics in musculoskeletal imaging. Population atlases can amortize annotation cost by registering observations to a shared canonical coordinate system, but existing neural atlas methods mainly target single videos, small test-time image sets, or object-centric image collections. We introduce a cohort-scale neural atlas for ultrasound video: a single canonical chart with per-video Generative Latent Optimization embeddings, trained jointly over thousands of frames in DINOv3 feature space. Across five cardiac and musculoskeletal datasets with point landmarks and segmentation masks, our method learns coherent canonical templates and enables accurate atlas-space annotation transfer. On EchoNet-Dynamic and MSK-Bone, it supports single- and few-shot transfer with accuracy competitive with strong dense-correspondence baselines, while training in minutes on a single consumer GPU. The learned embeddings are interpretable: linear projections reveal structured cohort variation, image-decoder interpolation produces anatomically plausible intermediate frames, and test-time latent inversion reconstructs held-out frames through the atlas. These results suggest that cohort-scale neural atlases offer a practical, interpretable representation for reducing expert annotation burden in ultrasound video analysis.

2606.00888 2026-06-02 cs.LG cs.AI

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

基于动态稀疏性的内存高效LLM训练:从稳定性到实际扩展

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu, Torsten Hoefler

发表机构 * University of Waterloo(滑铁卢大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Michigan(密歇根大学)

AI总结 提出SMET方法,通过优化器预热和密度感知学习率缩放解决动态稀疏训练中的优化不稳定问题,实现LLM的稳定、可扩展且内存高效的稀疏预训练。

Comments Accepted at ICML2026

详情
AI中文摘要

动态稀疏训练(DST)为提高深度神经网络的训练和推理效率提供了一种有前景的范式;然而,我们发现,在大语言模型训练中,DST可能会遭受优化不稳定性,表现为拓扑更新后的损失尖峰。在这项工作中,我们表明,标准基于Adam的优化器的朴素使用会导致新重新生长的参数出现冷启动问题,从而导致过大的更新和破坏训练动态。为了解决这个问题,我们提出了稀疏内存高效训练(SMET),它通过优化器预热稳定DST,并通过密度感知学习率缩放改善训练进度。SMET通过仅存储活动参数的梯度和优化器状态进一步减少内存消耗。我们对SMET下的更新行为进行了理论分析,显示出改进的优化稳定性。大量实验表明,SMET能够实现LLM的稳定、可扩展且内存高效的稀疏预训练,为稀疏训练作为密集训练的实际替代方案铺平了道路。我们的代码公开在:https://github.com/QiaoXiao7282/SMET。

英文摘要

Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST can suffer from optimization instability, manifested as loss spikes after topology updates. In this work, we show that the naive use of standard Adam-based optimizers leads to a cold-start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. To address this issue, we propose Sparse Memory-Efficient Training (SMET), which stabilizes DST with optimizer warm-up and improves training progress through density-aware learning-rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory-efficient sparse pre-training of LLMs, paving the way for sparse training as a practical alternative to dense training. Our code is publicly available at: https://github.com/QiaoXiao7282/SMET.

2606.00886 2026-06-02 cs.CV cs.RO

GABI: Geometry-Aware Boundary Integration for Spacecraft Segmentation

GABI: 用于航天器分割的几何感知边界集成

Iason Georgios Velentzas, Dhruv Ahuja, Panagiotis Tsiotras

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一种轻量级边界感知多任务分割架构GABI,通过辅助距离场预测头增强卷积骨干网络,在保持低模型复杂度的同时提升航天器分割精度,在SPARK基准上平均精度提升5%,跨域泛化提升50%。

Comments Accepted to AI4Space at CVPR 2026

详情
AI中文摘要

精确分割对于自主航天器至关重要,因为它直接影响与3D态势感知相关的下游任务。然而,太空恶劣的照明条件会产生外观高度变化的图像,阻碍分割方法在不同航天器和环境中的泛化。在这项工作中,我们提出了GABI,一种轻量级的边界感知多任务分割架构,它通过一个辅助的距离场预测头增强卷积骨干网络。距离场在物体边界周围提供密集的几何监督,鼓励网络学习航天器结构的空间一致表示,同时保持适合机载感知系统的低模型复杂度。我们在一个既定的卷积基线和更重的基于Transformer的架构上评估了GABI。在SPARK基准上,距离场监督使基线在平均精度上提高了5%,同时实现了与Transformer模型相当的性能。在泛化实验中,GABI的平均精度比基线提高了50%以上。在跨域评估中,轻量级GABI变体在IoU和F1分数上与更重的Transformer模型相差5%以内,而体积大约小十倍。同时,更重的GABI变体在保持近三倍轻量的同时超越了Transformer架构。

英文摘要

Accurate segmentation is crucial for autonomous spacecraft, as it directly affects downstream tasks related to 3D situational awareness. The harsh illumination conditions of space, however, produce images with high variability in appearance, hindering the generalization of segmentation approaches across different spacecraft and environments. In this work, we propose GABI, a lightweight boundary-aware multi-task segmentation architecture that augments a convolutional backbone with an auxiliary distance-field prediction head. The distance field provides dense geometric supervision around object boundaries, encouraging the network to learn spatially consistent representations of spacecraft structures while maintaining low model complexity suitable for onboard perception systems. We evaluated GABI against both an established convolutional baseline and a heavier transformer-based architecture. On the SPARK benchmark, distance-field supervision improves the baseline by up to $5\%$ in Average Precision while achieving performance comparable to the transformer models. In generalization experiments, GABI improves Average Precision by more than $50\%$ over the baseline. In cross-domain evaluation, the lightweight GABI variant performs within $5\%$ in IoU and F1-score of the heavier transformer model while being approximately ten times smaller. At the same time, the heavier GABI variant surpasses the transformer architectures while remaining nearly three times lighter.

2606.00884 2026-06-02 cs.LG cs.AI

Dive into Waves: Morlet Spectral Transformer for Cross-Subject Emotion Decoding from EEG

深入波动:用于跨被试脑电情绪解码的Morlet谱变换器

Jiaxin Qing, Lexin Li

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对脑电情绪识别中跨被试变异性问题,提出基于Morlet小波标记化、长上下文基线去除和频带特定空间投影的Morlet谱变换器(MST),无需预训练即可在SEED系列数据集上超越大型预训练模型和频域方法。

详情
AI中文摘要

我们研究基于脑电的跨被试情绪识别,这是脑机接口中一个实际重要但具有挑战性的问题。与具有清晰波形特征的任务不同,情绪相关的脑电信号主要编码在频谱功率中,且微弱、嘈杂,并在被试间高度变化。现有方法要么依赖需要大量数据但仍难以应对跨被试变异的大型预训练脑电基础模型,要么依赖频域编码器(能更好地反映频谱结构但存在表示不匹配、漂移主导的标记化以及缺乏频带特定空间建模)。在本文中,我们提出了Morlet谱变换器(MST),它围绕三个关键组件构建,并与时空变换器主干集成。首先,Morlet小波标记化提供了与脑节律多尺度结构匹配的时频表示,并将经典微分熵特征扩展到适合变换器的形式。其次,长上下文基线去除作为一种简单的时间归一化,消除了被试特定漂移和附近窗口间的冗余。第三,频带特定空间投影为每个频带学习独立的通道混合器,捕获可解释的频带特定模式并减少跨通道混合。我们表明,即使没有预训练,MST在所有SEED系列数据集上始终优于大型预训练脑电基础模型和基于频率的方法。这些结果表明,精心的表示设计可以产生准确、经济且可解释的替代大规模预训练的方法。

英文摘要

We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike tasks with clear waveform signatures, emotion-related EEG signals are primarily encoded in spectral power and are weak, noisy, and highly variable across subjects. Existing approaches rely either on large pretrained EEG foundation models, which require massive data yet still struggle with cross-subject variability, or frequency-domain encoders, which better reflect spectral structure but suffer from mismatched representations, drift-dominated tokenization, and lack of band-specific spatial modeling. In this article, we propose the Morlet Spectral Transformer (MST), built around three key components and integrated with a spatiotemporal Transformer backbone. First, Morlet wavelet tokenization provides a time-frequency representation that matches the multi-scale structure of brain rhythms, and extends classical differential entropy features to a form suitable for Transformers. Second, long-context baseline removal acts as a simple temporal normalization that removes subject-specific drift and redundancy across nearby windows. Third, frequency-specific spatial projection learns a separate channel mixer for each frequency band, capturing interpretable band-specific patterns and reducing cross-channel mixing. We show that, even without pretraining, MST consistently outperforms both large pretrained EEG foundation models and frequency-based methods across all SEED-family datasets. These results suggest that careful representation design can yield an accurate, cost-effective, and interpretable alternative to large-scale pretraining.

2606.00881 2026-06-02 cs.CL

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

检索增强生成中的分块方法——针对计算成本与局限性的有效性评估

Mateusz Śmigielski, Michał Rajkowski, Mateusz Zbrocki, Michał Bernacki-Janson, Karol Kunicki, Julianna Godziszewska, Maciej Piasecki, Konrad Wojtasik

发表机构 * Department of Artificial Intelligence, Faculty of Information and Communication Technology, Wrocław University of Science and Technology(人工智能系,信息与通信技术学院,沃斯克大学)

AI总结 本研究首次系统评估多种分块方法在RAG系统中的有效性,揭示分块策略中常被忽视的关键问题。

详情
AI中文摘要

检索增强生成(RAG)在提升大型语言模型(LLMs)性能方面展现了显著能力。RAG系统中的关键任务之一是分块过程。传统上,固定大小分块和语义分块是标准方法。然而,对分块策略的兴趣日益增长,导致越来越多声称性能优于传统技术的方法被提出。许多这些方法针对特定用例和数据类型定制,缺乏在不同场景下有效性的证据。因此,直接比较不同技术并评估其相对优势仍然具有挑战性。据我们所知,本研究首次系统评估了广泛分块方法的有效性,并强调了RAG系统中分块策略的潜在挑战。虽然分块通常被视为简单的预处理步骤,但我们表明它引入了一系列有影响且常被忽视的问题。

英文摘要

Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size chunking and semantic chunking have been the standard approaches. However, interest in chunking strategies has been increasing, leading to a growing number of proposed methods that often claim improved performance over these conventional techniques. Many of these approaches are tailored to specific use cases and data types, with limited evidence of their effectiveness across diverse scenarios. As a result, it remains challenging to directly compare different techniques and assess their relative strengths. To the best of our knowledge, this study is the first to systematically evaluate the effectiveness of a wide range of chunking methods and emphasize the underlying challenges of chunking strategies in RAG systems. While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues.

2606.00880 2026-06-02 cs.LG cs.AI

Task diversity produces systematic transfer but inhibits continual reinforcement learning

任务多样性产生系统性迁移但抑制持续强化学习

Purab Seth, Neil Shah, Kunal Jha, Samuel J. Gershman, Max Kleiman-Weiner, Wilka Carvalho

发表机构 * MIT(麻省理工学院) University of California, Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学) Harvard University(哈佛大学)

AI总结 通过引入GPU加速的持续强化学习领域Banyan,研究任务多样性(地图布局、交互对象、子目标层次结构)对智能体在分布变化下持续学习能力的影响,发现多样性促进局部迁移但导致长期任务性能停滞和遗忘。

详情
AI中文摘要

持续强化学习旨在产生不仅能在当前任务上提高,还能随着任务分布变化而适应的智能体。在众多不同任务上训练智能体可以引发零样本泛化,但先前的工作通常是在训练后(冻结权重)评估这种泛化。任务多样性是否也能提高智能体在分布变化下继续学习的能力仍不清楚。我们引入了Banyan,一个GPU加速的持续强化学习领域,其中任务多样性分解为三个独立可控的轴:智能体必须导航的地图布局、必须与之交互的对象以及子目标依赖的层次结构。在单个分布变化中,增加每个轴上的多样性会导致智能体在新任务上开始训练时,其性能接近先前任务达到的水平,即使变化改变了最优策略的结构。然而,随着变化数量的增加,这种局部迁移本身并不能产生持续的持续学习:更长视野的任务出现平台期,并且较早的任务分布在后续训练后被遗忘。Banyan是一个基准,用于研究受控的任务多样性何时产生可迁移的学习,这种迁移何时持续,以及它在哪些方面未能达到真正的持续学习。

英文摘要

Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task distributions change. Training an agent on many diverse tasks can induce zero-shot generalization, but previous work generally evaluates this generalization after training -- with frozen weights. Whether task diversity also improves an agent's ability to continue learning across distribution shifts remains unclear. We introduce Banyan, a GPU-accelerated continual RL domain in which task diversity factors into three independently controllable axes: the map layouts an agent must navigate, the objects it must interact with, and the hierarchical structures of sub-goal dependencies. Across individual distribution shifts, increasing diversity along each axis causes agents to begin training on the new tasks near the performance attained on the previous one, even when the shift changes the structure of the optimal policy. However, as the number of shifts increases, this local transfer does not by itself yield sustained continual learning: longer-horizon tasks plateau, and earlier task distributions are forgotten after later training. Banyan is a benchmark for studying when controlled task diversity produces transferable learning, when that transfer persists, and where it falls short of proper continual learning.

2606.00875 2026-06-02 cs.CL

IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs

IDEAFix: 大语言模型中创造性去固定化提示的评估框架

F. Carichon, S. Sharma, M. Girard, R. Rampa, G. Farnadi

发表机构 * McGill University(麦吉尔大学) Mila Concordia University(康科迪亚大学) ÉTS

AI总结 提出IDEAFix框架,通过控制任务变体和提示策略系统分析大语言模型在开放式创意生成中的发散思维,发现任务表述和属性选择显著影响性能,简单提示可提升原创性但输出同质化问题依然存在。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用于涉及创造性问题解决和想法生成的任务。然而,关于其创造能力缺乏共识:一些研究报告了相比人类更优越的表现,而另一些则强调了结构性限制,如固定化和输出的同质化。现有的评估方法要么依赖于狭窄、脱离上下文的无法捕捉目标导向生成的任务,要么依赖于更广泛的设置,这些设置混淆了创造性过程的多个方面,使得难以隔离任务表述、提示和评估设计的影响。值得注意的是,结构化提示策略在塑造想法生成中的作用仍未得到充分探索。因此,我们引入了IDEAFix,一个用于分析开放式想法生成任务中发散思维的评估框架。我们提示模型对受控变化的简短设计场景、任务属性和去固定化提示策略生成多个原始解决方案。这种设计使得能够系统分析结构化指导如何影响LLMs的想法生成。我们的结果表明,任务表述和属性选择都显著影响模型的表现,并且简单的提示策略可以提升解决方案的原创性。然而,我们也观察到模型间持续的输出同质化,证实了它们在生成多样化解决方案方面固有的局限性。总体而言,IDEAFix提供了一个受控、可扩展的框架,用于研究LLMs创造力的底层机制。

英文摘要

Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior performances compared to humans, while others highlight structural limitations such as fixation and the homogenization of outputs. Existing evaluation approaches either rely on narrow, decontextualized tasks that do not capture goal-oriented generation or on broader settings that confound multiple aspects of the creative process, making it difficult to isolate the effects of task formulation, prompting, and evaluation design. Significantly, the role of structured prompting strategies in shaping idea generation remains underexplored. Therefore, we introduce IDEAFix, an evaluation framework for analyzing divergent thinking in open-ended idea generation tasks. We prompt models to generate multiple original solutions to controlled variations of short design scenarios, task attributes, and defixation prompting strategies. This design enables systematic analysis of how structured guidance influences LLMs' idea generation. Our results show that both task formulation and attribute selection significantly affect models' performance, and that simple prompting strategies can boost the originality of solutions. However, we also observe persistent output homogenization across models, confirming inherent limits in their ability to generate diverse solutions. Overall, IDEAFix provides a controlled, extensible framework for studying the mechanisms underlying LLMs' creativity.

2606.00872 2026-06-02 cs.CV

Images as Tables: In-Context Learning with TabPFN for Low-Data Detection of AI-Generated Images

图像作为表格:利用TabPFN进行上下文学习以实现低数据量下AI生成图像的检测

Jan Philip Walter, Shashank Agnihotri, Margret Keuper

发表机构 * Jan Philip Walter Shashank Agnihotri Margret Keuper

AI总结 提出将图像转换为表格形式,使用冻结的DINOv3骨干网络提取特征,并通过TabPFN进行上下文学习,在低数据量下有效检测AI生成图像,优于现有方法。

Comments Accepted as a Spotlight Oral at the ICML 2026 Workshop Foundation Models for Structured Data. *Equal Contribution

详情
AI中文摘要

AI生成图像检测是一个移动目标问题:针对一个生成器训练的检测器在出现新生成器时常常失效,且只有少量标记样本可用。我们研究了一种简单的图像到表格的公式化方法,其中每个图像由冻结的DINOv3骨干网络编码,其CLS特征通过PCA降维为500维的结构化行,TabPFN通过上下文表格推理而非特定任务分类器训练进行真实/伪造分类。这将伪造图像检测转化为对学习到的视觉特征的低数据量结构化预测,使检测器适应依赖于标记的上下文集而非基于梯度的微调。在GenImage上,LATTE(最新的最先进检测器)在拥有来自所有生成器的大量标记样本时仍然更强,在最大合并设置中高出7.4%,但在实际重要的低数据量情况下,DINOv3-PCA-TabPFN更强,最高超出LATTE 8.2%,并且在检测器必须从一个生成器泛化到另一个生成器的迁移设置中也是如此。这些结果将表格基础模型定位为图像取证中一种强大的互补适应机制,将适应从检测器重新训练转变为使用少量标记样本的轻量级上下文更新。代码URL:https://github.com/jpwalter30/Towards-Generalizable-Detection-of-AI-Generated-Images

英文摘要

AI-generated image detection is a moving-target problem: detectors trained on one generator often fail when a new generator appears, and only a few labeled examples are available. We study a simple image-to-table formulation for this regime, where each image is encoded by a frozen DINOv3 backbone, its CLS feature is reduced to a 500-dimensional structured row with PCA, and TabPFN performs real/fake classification by in-context tabular inference rather than task-specific classifier training. This turns fake-image detection into low-data structured prediction over learned visual features, making detector adaptation depend on the labeled context set instead of gradient-based fine-tuning. On GenImage, LATTE, a recent state-of-the-art detector, remains stronger when many labeled samples from all generators are available, by 7.4% in the largest pooled setting, but DINOv3-PCA-TabPFN is stronger in the practically important low-data regime, outperforming LATTE by up to 8.2%, and in transfer settings where the detector must generalize from one generator to another. These results position tabular foundation models as a strong complementary adaptation mechanism for image forensics, shifting adaptation from detector retraining to lightweight in-context updates with a small labeled set of examples. Code URL: https://github.com/jpwalter30/Towards-Generalizable-Detection-of-AI-Generated-Images

2606.00871 2026-06-02 cs.CV cs.AI

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

城市感知中的视觉语言模型基准应具备可靠性意识且可协商

Rashid Mushkani

发表机构 * Rashid Mushkani

AI总结 本文提出,用于城市感知的视觉语言模型基准应将分歧和弃权视为测量结果,报告标注者间信度,并将标签空间和评分策略视为可协商的产物。

Comments To appear in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉语言模型(VLM)越来越多地用于生成街景图像的结构化描述,用于街道景观审计、制图和公众咨询等任务。这些用途将可观察属性与评估类别相结合,而人类目标往往是带有分歧和明确不回答的判断分布。本文认为,为城市感知建立VLM基准应将分歧和弃权视为测量结果,报告标注者间信度以及模型对齐度,并在输出旨在为城市治理提供信息时,将标签空间和评分策略视为可协商的产物。我们基于一个由来自七个社区组织的12名参与者对100个蒙特利尔街景进行30个维度标注的基准,以及对七个VLM的确定性零样本评估来论证这一观点。在各个维度上,模型与人类共识的一致性随维度层面的人类信度共同变化,而对于评估维度“总体印象”,模型和标注者表现出分布不匹配,包括“不适用”的不同比率。最后,我们为基准创建者、模型开发者和机构提出了行动建议,以使不确定性和基准假设在评估报告中可见。

英文摘要

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.

2606.00869 2026-06-02 cs.LG

Enhancing LLM Metacognition via Cognitive Pairwise Training

通过认知成对训练增强LLM元认知

Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu

发表机构 * National Engineering Laboratory for Intelligent Information Processing, Academy of Mathematics and Physics, Chinese Academy of Sciences(智能信息处理国家工程实验室,中国科学院数学物理研究所) University of Science and Technology of China(中国科学技术大学)

AI总结 提出认知成对训练(CPT),通过成对比较推理轨迹来学习区分可靠与不可靠推理,从而提升LLM的推理与元认知权衡。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为LLM推理的核心,但其结果级奖励可能使模型在证据或推理不可靠时更愿意给出自信答案。现有的SFT或RL方法主要在响应级别教导LLM拒绝或表达不确定性,这可能导致过度拟合拒绝行为,而非提高推理可靠性。为解决这一局限,我们提出认知成对训练(CPT),这是一种认知中期训练对齐阶段,将推理轨迹上的成对比较转化为可复用的对齐信号。通过学习区分可信与有缺陷的推理,CPT鼓励模型内化推理质量判别边界,而非记忆表面拒绝模式。在五个模型规模和三个模型家族上,CPT改善了推理与元认知的权衡。在14B规模上,CPT+RL相比标准SFT+RL流水线在数学平均分上提升2.2分,在拒绝F1上提升5.2分。进一步分析表明,CPT提高了轨迹质量,并在评估和训练设置中表现出强鲁棒性和可扩展性。代码和模型已发布在https://github.com/Tsinghua-dhy/CPT。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid-training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning-quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning--metacognition trade-off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math-average points and +5.2 abstention-F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at https://github.com/Tsinghua-dhy/CPT.

2606.00857 2026-06-02 cs.RO cs.AI

From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction

从线索到视野:轨迹预测的动态风险视界剖面

Xinyi Ning, Zilin Bian, Dachuan Zuo, Semiha Ergan, Kaan Ozbay

发表机构 * Department of Civil and Urban Engineering, New York University(纽约大学土木与城市工程系) Department of Civil Engineering Technology and Environmental Management Safety, Rochester Institute of Technology(罗切斯特理工学院土木工程技术与环境安全管理系)

AI总结 提出风险视界剖面(RHP)模块,通过连续可学习的势场模型对未来风险分布进行建模,以提升轨迹预测的准确性,在highD和SHRP2数据集上分别降低5秒RMSE 25.0%和5秒minFDE 29.1%。

Comments 11 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)

详情
AI中文摘要

准确可靠的车辆轨迹预测对于安全自动驾驶至关重要。最近的研究将安全风险纳入轨迹预测,以量化周围代理带来的危险。然而,大多数风险感知方法将过去的风险信息作为辅助信号来帮助决策,忽视了其未来的演变和不确定性。在本文中,我们提出了一种风险视界剖面(RHP)模块,该模块结合了连续、可学习的势场模型,用于风险感知轨迹预测。RHP模块计算周围物体的时空接近度,以描绘未来视界上的风险分布,通过自适应识别人类驾驶员认为的关键时刻,支持更好的轨迹预测。我们在两个不同驾驶设置的数据集上评估了我们的方法:highD(高速公路走廊)和SHRP2(城市街道),涵盖了包括安全、近碰撞和碰撞事件在内的多种风险场景。与基线方法相比,我们的框架在highD数据集上实现了5秒RMSE降低25.0%,在SHRP2上实现了5秒minFDE降低29.1%。这些结果表明,该方法在短视界和长视界预测中均表现出色,并且在高速公路和城市场景中具有强大的泛化能力。所提出的方法能够实现更真实的自动驾驶车辆路径规划和策略选择,从而支持更安全的自动驾驶和更先进的驾驶员辅助系统。本工作的源代码可在以下网址获取:https://github.com/bilab-nyu/RHP

英文摘要

Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk-aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk-aware trajectory prediction. The RHP module calculates the spatial-temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near-crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0\% reduction in 5s RMSE on the highD dataset and a 29.1\% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver-assistance systems. The source code for this work is available at: https://github.com/bilab-nyu/RHP

2606.00852 2026-06-02 cs.CV cs.AI cs.LG

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

RefDiffNet: 在检测前学习暴露细微PCB缺陷

Vinay Edula, Nilesh Badwe, Priyanka Bagade

发表机构 * Department of Computer Science and Engineering Indian Institute of Technology Kanpur(计算机科学与工程系印度理工学院坎浦尔) Department of Materials Science and Engineering Indian Institute of Technology Kanpur(材料科学与工程系印度理工学院坎浦尔)

AI总结 提出RefDiffNet,一种轻量级即插即用的输入增强模块,通过引入无缺陷参考图像来突出缺陷区域,从而提升下游检测器在PCB缺陷检测中的性能。

详情
AI中文摘要

印刷电路板(PCB)缺陷检测具有挑战性,因为许多缺陷很小且难以与复杂的背景图案区分。大多数基于深度学习的PCB检测方法仅依赖被检测的PCB图像进行缺陷检测,忽略了编码走线、焊盘和其他PCB结构预期布局的无缺陷参考图像。在这项工作中,我们提出了RefDiffNet,一种轻量级即插即用的输入增强模块,放置在检测器主干之前,用于在缺陷检测前增强图像。RefDiffNet将经典检测中的一个成熟思想带入深度学习时代,利用无缺陷参考图像来揭示缺陷。RefDiffNet比较缺陷图像与对齐的参考图像,捕获相对于参考图像的结构变化,并使用轻量级编码器输出缺陷区域被突出的原始图像,从而简化下游检测器的任务。在HRIPCB和DeepPCB上的结果表明,RefDiffNet在各类检测器上一致地提升了性能,包括从YOLOv8到YOLOv26的单阶段检测器、基于Transformer的RT-DETR以及两阶段Faster R-CNN。它实现了高达18%的相对mAP50:95增益,且开销可忽略,仅引入0.004-0.005M额外参数和0.7-0.8 GFLOPs,最多占任何评估检测器参数量的0.25%。结果确立了RefDiffNet作为一种轻量级、即插即用、检测器无关的输入增强模块,以最小的计算成本显著提升PCB缺陷检测性能。

英文摘要

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector's task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.

2606.00851 2026-06-02 cs.SD cs.CL cs.HC cs.LG eess.AS

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sympatheia: 具有连续情感调节的情感自适应语音助手

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani

发表机构 * Department of Electrical Engineering, Columbia University(电气工程系,哥伦比亚大学)

AI总结 提出Sympatheia语音对话框架,通过从用户语音推断情感并结合连续效价-唤醒度控制信号,实现情感自适应响应,优于基线模型。

详情
AI中文摘要

共情口语对话系统必须推断用户的情感状态以做出适当响应,然而日常语音通常带有微弱、中性或模糊的情感线索。为解决这一问题,我们引入了Sympatheia,一种语音到语音对话框架,其条件基于从用户语音中推断出的情感,并且在可用时,基于多模态感知模块或用户界面提供的连续效价-唤醒度(VA)控制信号中的明确情感规格。为了训练我们的模型,我们构建了Sympatheia-18k,一个包含12个情感锚点的情感条件合成口语对话语料库。该数据集包括用于学习情感语音行为的情感分割,以及一个中性分割,该分割将情感中性查询与多个情感条件响应配对,以在情感模糊情况下隔离明确的情感控制。实验结果表明,Sympatheia在生成语义内容和口语表达均情感适当的响应方面优于语音对话基线。我们进一步表明,相同的VA界面可以整合来自不同感知模块(包括面部表情、生物信号和文本情感描述)的情感估计,从而在语音单独提供有限情感证据时改善响应对齐。这些结果表明,连续情感调节是构建情感自适应语音助手的有效实际步骤。

英文摘要

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

2606.00846 2026-06-02 cs.LG

CUPID in the Model Zoo: Online Matchmaking for Selecting Your Dream LLM

模型动物园中的丘比特:在线匹配以选择你的梦想大语言模型

Son Nguyen, Xinyuan Liu, Ransalu Senanayake

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于决斗老虎机算法的主动学习框架,通过迭代选择大语言模型对并收集用户反馈,高效匹配用户偏好与模型能力。

Comments 38 pages, 11 figures

详情
AI中文摘要

用户越来越面临从快速增长的大语言模型池中为给定任务选择合适的LLM的挑战,每个模型具有独特但通常不透明的潜在属性。加剧这一挑战的是,用户可能缺乏词汇或意识来明确表达他们在LLM的响应或部署中所重视的特征。我们提出了一种交互高效的主动学习框架,其中决斗老虎机算法迭代选择LLM对,收集用户关于其响应的反馈,并更新其对用户潜在偏好的信念。我们引入了一种新颖的信念感知上置信界策略,平衡模型池的探索与推断偏好的利用,从而在用户指定的成本和时间预算下实现用户需求与LLM能力之间的高效对齐。通过在LLM和人类研究上的多样化实验,我们实验验证了我们的模型能够以较低成本高效地将良好对齐的LLM匹配给用户。

英文摘要

Users increasingly face the challenge of selecting an appropriate LLM for a given task from a rapidly growing pool of LLMs, each with distinct but often opaque latent properties. Compounding this challenge, users may lack the vocabulary or awareness to explicitly articulate the characteristics they value in an LLM's responses or deployment. We propose an interaction-efficient active learning framework in which a dueling bandit algorithm iteratively selects pairs of LLMs, collects user feedback about their responses, and updates its belief about the user's latent preferences. We introduce a novel belief-aware upper confidence bound strategy that balances exploration of the model pool with exploitation of inferred preferences, enabling efficient alignment between user needs and LLM capabilities under user-specified cost and time budgets. Through diverse experiments on LLMs and human studies, we experimentally verify that our model can efficiently match well-aligned LLMs to users at a lower cost.

2606.00844 2026-06-02 cs.CV cs.AI cs.LG

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

MoEIoU:将边界框回归重新思考为混合专家模型

Vinay Edula, Priyanka Bagade

发表机构 * Indian Institute of Technology Kanpur(印度理工学院坎普尔分校)

AI总结 提出MoEIoU损失函数,通过混合专家模型联合优化重叠、中心对齐和长宽比,并采用课程学习权重调度,在多个数据集和YOLO架构上超越现有IoU损失。

详情
AI中文摘要

边界框回归是目标检测的基本组成部分,在精确目标定位中起着关键作用。现有的基于交并比(IoU)的损失函数通过引入几何惩罚项(如中心距离和长宽比不匹配)来扩展IoU目标,以改进边界框回归。然而,这些惩罚项通常在训练过程中保持不变,没有考虑优化动态:预测框在初始阶段表现出较大的中心距离和形状误差,而后期阶段则侧重于提高与真实框的重叠。为了解决这一局限性,我们引入了MoEIoU,一种基于混合专家的回归损失,它联合建模了重叠、中心对齐和长宽比不匹配。MoEIoU使用log-sum-exp函数聚合这些组件,该函数强调主要的定位误差,同时保持其他项的平滑贡献。此外,采用基于课程的权重调度,在早期训练阶段优先纠正框的位置和形状,在后期阶段提高重叠。我们在PASCAL VOC、HRIPCB和MS COCO上使用多种YOLO架构以及大规模模拟实验评估了所提出的MoEIoU。它始终优于标准和最新的最先进损失,表现出更快的收敛速度和更高的定位精度。我们进一步表明,这种自适应聚合改进了现有的基于IoU的损失,带来了一致的增益,并为目标检测框架中的边界框回归提供了更有效的优化指导。

英文摘要

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.

2606.00840 2026-06-02 cs.AI

Certificate-Guided Evaluation of Reinforcement Learning Generalization

证书引导的强化学习泛化评估

Vignesh Subramanian, Đorđe Žikelić, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology(佐治亚理工学院计算机科学学院) School of Computing and Information Systems, Singapore Management University(新加坡管理大学 computing and information systems 学院)

AI总结 提出一个逻辑驱动框架,通过神经证书函数验证强化学习算法在未见任务上的泛化能力,并证明证书违规率与测试任务成功率负相关。

详情
AI中文摘要

本文提出了一个逻辑驱动框架,用于评估强化学习算法在泛化到未见任务方面的性能。我们的框架定义了一类归纳可达-避免任务,这些任务在任务动态中具有结构相似性,从而能够评估泛化能力。我们引入了一个神经证书函数,通过强制执行关键条件来验证强化学习算法生成的轨迹,从而作为强化学习泛化的试金石。我们通过实验证明了该方法在几个最先进的可泛化强化学习算法上的能力,在具有挑战性的连续环境中验证了泛化能力。我们的结果表明,证书函数违规率越低,成功解决的测试任务数量越多,突显了我们的框架在评估和区分强化学习算法泛化能力方面的有效性。这项工作为基准测试强化学习泛化提供了一种原则性方法。

英文摘要

This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, characterized by structural similarities in task dynamics, enabling evaluation of generalization capabilities. We introduce a neural certificate function that validates trajectories generated by RL algorithms by enforcing key conditions, thereby serving as a litmus test for RL generalization. We empirically demonstrate our method's capability in certifying generalization for several state-of-the-art generalizable RL algorithms on challenging continuous environments. Our results show that a lower percentage of certificate function violations correlates with a higher number of test tasks successfully solved, highlighting the effectiveness of our framework in evaluating and distinguishing generalization capabilities of RL algorithms. This work provides a principled approach for benchmarking RL generalization.

2606.00838 2026-06-02 cs.AI

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

解耦行为克隆实现基于规范的强化学习中的可扩展归纳泛化

Vignesh Subramanian, Subhajit Roy, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology, USA(美国佐治亚理工学院计算机科学学院) Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India(印度理工学院坎浦尔分校计算机科学与工程系)

AI总结 提出DIBS方法,通过解耦任务特定策略学习与演化函数学习,利用行为克隆替代噪声奖励聚合,提升训练稳定性和零样本泛化能力。

详情
AI中文摘要

归纳泛化是强化学习泛化的一种框架,其中归纳相关的任务实例允许归纳相关的策略。先前的工作通过直接使用强化学习学习的高阶策略演化函数捕捉这种结构,但存在训练可扩展性差的问题:随着训练任务增加,聚合的奖励反馈变得嘈杂且冲突,破坏训练稳定性并削弱泛化能力。我们提出DIBS,一种解耦的行为克隆方法,将学习任务特定策略与学习演化函数分离。我们首先通过标准强化学习为每个任务学习独立的教师策略,然后通过行为克隆在教师标记的状态-动作对上拟合演化函数。这用密集、稳定的监督取代了嘈杂的奖励聚合。DIBS在训练稳定性和零样本泛化方面相比现有强化学习和元强化学习算法取得了显著改进。

英文摘要

Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.

2606.00837 2026-06-02 cs.RO cs.LG

Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning

粗到细的组合扩散用于长时域规划

Byoungwoo Park, Utkarsh A. Mishra, Jaemoo Choi, Juho Lee, Yongxin Chen

发表机构 * KAIST(韩国科学技术院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Coarse-to-Fine Compositional Diffusion (CoFi)方法,通过先形成全局骨架再细化局部细节,在长时域机器人规划、全景图像生成和长视频生成中提升全局一致性和局部质量,同时减少2-8倍去噪评估次数。

Comments Project page: https://cofi-diffusion.github.io

详情
AI中文摘要

扩散模型为生成结构化数据提供了强先验,但许多任务需要输出超出这些模型通常训练规模的范围。组合生成通过将来自预训练短时域先验的重叠局部计划组合成长时域输出来解决这一问题。然而,标准组合主要强制相邻局部计划之间的一致性,产生局部一致性而不直接指定完整组合的全局结构。因此,局部兼容的计划仍可能形成不合理的路线、任务序列或时间演化。现有方法通过重复传播局部一致性信号或添加推理时优化来提高全局连贯性,但随着局部计划数量或维度的增加,这些过程变得昂贵。我们提出粗到细组合扩散(CoFi),一种推理时采样器,将全局结构形成与局部细节细化分离。CoFi首先将局部去噪估计围绕共享的粗结构对齐,产生捕获长程任务级排列的全局骨架。然后将该骨架扩散到中间噪声水平,并使用相同的预训练局部先验去噪,在保留骨架诱导的全局连贯性的同时恢复局部精细结构。在长时域机器人规划、全景图像生成和长视频生成中,CoFi不仅比先前的组合基线提高了全局连贯性和局部样本质量,而且需要2-8倍更少的去噪评估次数。

英文摘要

Diffusion models provide strong priors for generating structured data, but many tasks require outputs beyond the scale on which these models are typically trained. Compositional generation addresses this by composing overlapping local plans from a pretrained short-horizon prior into a long-horizon output. However, standard composition primarily enforces agreement between neighboring local plans, yielding local consistency without directly specifying the global structure of the full composition. As a result, locally compatible plans may still form an implausible route, task sequence, or temporal evolution. Existing methods improve global coherence by repeatedly propagating local consistency signals or by adding inference-time optimization, but these procedures become expensive as the number or dimensionality of local plans increases. We propose Coarse-to-Fine Compositional Diffusion (CoFi), an inference-time sampler that separates global structure formation from local detail refinement. CoFi first aligns local denoised estimates around a shared coarse structure, producing a global scaffold that captures the long-range task-level arrangement. It then diffuses this scaffold to an intermediate noise level and denoises it with the same pretrained local prior, restoring local fine structure while preserving the scaffold-induced global coherence. Across long-horizon robotic planning, panoramic image generation, and long video generation, CoFi not only improves both global coherence and local sample quality over prior compositional baselines, but also requires 2-8x fewer denoiser evaluations.

2606.00835 2026-06-02 cs.LG

Online Packet Scheduling with Deadlines and Learning

具有截止日期和学习的在线数据包调度

Gianmarco Genalti, Achraf Azize, Vianney Perchet

发表机构 * Politecnico di Milano(米兰理工大学) FairPlay Joint Team, CREST, ENSAE, IP Paris(FairPlay联合团队,CREST,ENSAE,IP巴黎)

AI总结 针对部分反馈下未知权重的在线数据包调度问题,通过连接睡眠强盗问题,提出算法实现α-遗憾最小化,并在不同松弛度下达到最优界。

详情
AI中文摘要

强制执行服务质量(QoS)保证的网络路由器必须在每个时钟周期决定传输哪个即将过期的数据包,即使数据包的值在处理之前是未知的。我们将此问题框架化为部分反馈下的在线数据包调度(OPSD)问题:数据包在每个时钟周期到达,具有不同的截止日期,但权重仅在执行后观察到。在未知权重的随机假设下,我们探索了具有强盗反馈的OPSD问题的不同变体。我们在我们的设置和睡眠强盗问题之间建立了联系,并将学习目标设定为α-遗憾最小化。我们提供了在不同松弛度下具有可证明α-遗憾保证的算法,区分了允许随机化的系统和不允许的系统。在每种情况下,我们的算法实现了$\widetilde{\mathcal{O}}\left(\sqrt{KT} ight)$的α-遗憾上界,与标准强盗设置的下界匹配。在实际相关的2-有界截止日期实例中,其中截止日期最多设置在到达后的一个时钟周期,我们的确定性算法实现了可证明的最紧竞争比。值得注意的是,当不同数据包类型数量$K\ge 2$有限时,有可能打破已建立的$\Phi= rac{1+\sqrt{5}}{2}$竞争比障碍,并获得范围在$[\sqrt{2}, \Phi)$内的更紧竞争比$ heta_K$。

英文摘要

Network routers that enforce Quality-of-Service (QoS) guarantees must decide, at every clock cycle, which expiring packet of information to transmit, even when the value of the packet is unknown until it is processed. We frame this problem as the Online Packet Scheduling with Deadlines (OPSD) problem under Partial Feedback: packets arrive at every clock cycle, with different deadlines, but the weights are only observed after execution. Under a stochastic assumption on the unknown weights, we explore different variants of the OPSD problem with bandit feedback. We establish a connection between our setting and the sleeping bandits problem, and set our learning goal to $α$-regret minimization. We provide algorithms with provable $α$-regret guarantees under different spans of slackness, distinguishing systems allowing for randomization and systems that do not. In every scenario, our algorithms achieve an $α$-regret upper bound of $\widetilde{\mathcal{O}}\left(\sqrt{KT}\right)$, matching the lower bound for the standard bandit setting. In the practically relevant case of $2$-bounded deadline instances, where the deadline is set at most one clock cycle away from the arrival, our deterministic algorithm achieves the provably tightest possible competitive ratio. Remarkably, when the number of distinct packet types $K\ge 2$ is finite, it is possible to break the well-established $Φ= \frac{1+\sqrt{5}}{2}$ competitive ratio barrier and attain a tighter competitive ratio $θ_K$ ranging in $[\sqrt{2}, Φ)$.

2606.00832 2026-06-02 cs.CL

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

Momento:评估多会话代理对话中的持久记忆与推理

Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata

发表机构 * Institut Teknologi Bandung(万隆技术大学) Stanford University(斯坦福大学) Capital One

AI总结 提出Momento基准,通过多会话服务环境评估代理在跨会话中利用持久记忆和推理完成个性化任务的能力,发现现有代理因误估用户状态而表现不佳。

Comments Preprint

详情
AI中文摘要

近期代理人工智能的进展使得代理能够通过工具使用、推理和多步规划完成复杂任务。然而,现有基准在单会话内评估代理,忽略了代理必须整合的过去行动、陈述偏好和先前决策,以实现个性化用户目标。我们引入了Momento,一个用于多会话服务环境中持久代理任务完成的基准,要求代理在跨会话中处理时间依赖和演变的用户目标,同时采取重要的、工具介导的行动。实验结果表明,当前代理主要因误估用户状态而失败,将会话历史视为当前上下文的可靠代理,而非需要重新验证的过时信息,凸显了当前代理能力与现实长期人机交互之间的巨大差距。

英文摘要

Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated preferences, and prior decisions that agents must integrate to fulfill personalized user goals. We introduce Momento, a benchmark for persistent agentic task completion in multi-session service environments, requiring agents to take consequential, tool-mediated actions while resolving temporal dependencies and evolving user goals across sessions. Experimental results reveal that current agents fail primarily through misestimation of user state, treating prior session history as a reliable proxy for current context rather than stale information requiring re-validation, highlighting a substantial gap between current agent capabilities and realistic long-horizon human-agent interaction.

2606.00831 2026-06-02 cs.AI cs.LG

Subliminal Learning is a LoRA Artifact

潜意识学习是LoRA的伪影

Todd Nief, Harvey Yiyun Fu, Mark Muchane, Ari Holtzman

发表机构 * Department of Computer Science, University of Chicago(芝加哥大学计算机科学系) Data Science Institute, University of Chicago(芝加哥大学数据科学研究所)

AI总结 本文发现潜意识学习是LoRA微调产生的伪影,其传递行为与LoRA秩呈倒U型关系,且完全微调下消失,表明该现象依赖于微调和评估上下文。

详情
AI中文摘要

潜意识学习是一种现象,语言模型可以通过看似无害的数据将行为特征传递给其他模型(Cloud et al., 2025)。在潜意识学习中,具有行为特征(例如对猫的痴迷)的教师模型可以将这种猫痴迷传递给仅在教师生成的数字序列上微调的学生模型。在本文中,我们提出疑问:这种意想不到的行为传递是如何发生的?我们表明,潜意识学习是LoRA的伪影。当潜意识学习发生时,传递与LoRA秩呈倒U型关系;在完全微调下也会消失。我们表明,潜意识学习高度依赖于微调和评估期间看到的上下文。例如,在微调期间使用默认系统提示(“你是Qwen,由阿里云创建。你是一个有用的助手。”)的Qwen模型,在生成时如果没有包含系统提示,则不会表现出潜意识学习。我们进一步证明,潜意识行为局限于在微调和评估期间都看到的标记(例如模型的默认系统提示、标准聊天模板标记等)上的计算。总体而言,潜意识学习似乎是LoRA超参数和微调上下文的脆弱伪影,使其成为行为传递的不稳定渠道。

英文摘要

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

2606.00829 2026-06-02 cs.CV

The Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge

正确的推理策略即是一切:面向EgoCross挑战的近乎无需训练的领域感知推理

Leyi Wu, Yifan Zhao, Jinjie Zhang, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

AI总结 针对EgoCross挑战中源受限场景下多模态大模型在领域偏移严重的自我中心视频问答任务上表现不佳的问题,提出一种领域感知推理策略,通过为四个目标领域分别设计不同的输入、提示和答案映射流程,在不进行额外训练的情况下显著提升基线模型性能。

详情
AI中文摘要

EgoCross评估多模态大语言模型在显著领域偏移下的自我中心视频问答,其中测试视频来自手术、工业装配、极限运动和动物佩戴相机,而非日常场景。在源受限赛道中,基础模型固定为Qwen3-VL-4B,而官方任务特定支持集仅包含20个训练样本。这一设置使得挑战更侧重于向受限模型暴露正确的视觉、时序和答案选择线索,而非模型规模。我们的关键观察是,冻结的基线模型并非完全无法处理这些罕见场景;相反,它往往缺乏合适的接口来将其现有的视觉-语言知识迁移到新任务格式。因此,我们采用领域感知推理策略,将四个目标领域分开处理,并根据每个领域的任务特点设计不同的输入、提示和答案映射流程。这些策略通过强调每个领域重要的线索,使罕见自我中心场景对VLM更具可解释性。最终系统几乎无需训练:手术和动物问题使用基础Qwen3-VL-4B模型回答,而极限运动和工业问题仅使用在提供的20个训练样本上训练两个epoch的官方SFT检查点。在最终评估中,这一简单策略达到了66.98%的整体准确率,表明精心设计的领域感知推理可以弥补基础模型能力的不足,并恢复基线模型中已存在的大部分能力。

英文摘要

EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather than ordinary daily-life scenes. In the source-limited track, the base model is fixed to Qwen3-VL-4B, while the official task-specific support set contains only 20 training samples. This setting makes the challenge less about model scaling and more about exposing the right visual, temporal, and answer-selection cues to a constrained model. Our key observation is that the frozen baseline model is not simply incapable of these rare scenarios; rather, it often fails to transfer its existing visual-language knowledge to the new task format without an appropriate interface. We therefore use a domain-wise inference strategy that treats the four target domains separately and designs different input, prompting, and answer-mapping procedures according to each domain's task characteristics. These strategies make the rare egocentric scenes more interpretable to the VLM by emphasizing the cues that matter for each domain. The resulting system is nearly training-free: surgery, and animal questions are answered with the base Qwen3-VL-4B model, while XSports and industry use only the official SFT checkpoint trained for two epochs on the provided 20 training samples. On the final evaluation, this simple strategy reaches 66.98\% overall accuracy, suggesting that careful domain-aware inference can compensate for limited base-model strength and recover much of the ability already present in the baseline model.

2606.00828 2026-06-02 cs.CV

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

RoboStressBench: 在具身场景中基准测试VLM对物理视觉压力的鲁棒性

Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen, Zhifei Chen, Tianshuo Xu, Qingchun He, Hongxin Hu, Haojian Huang, Yangkai Wei, Wenqian Li, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州))

AI总结 本文提出RoboStressBench,从逆图形学角度将视觉压力分解为材质、视角、光照和几何四个物理维度,系统评估VLM在真实物理压力下的鲁棒性,并引入压力感知求解器提升高压力场景下的性能。

详情
AI中文摘要

视觉语言模型(VLM)展现出强大的视觉理解能力,并越来越多地部署在具身AI系统中,这些系统需要在真实条件下进行可靠的感知。然而,现有的基准测试使用干净图像或孤立扰动来评估VLM,而非由物理场景形成引起的压力。这种设计有两个局限性:它仅覆盖了日常视觉压力的一小部分子集,并且某些扰动在现实具身场景中很少出现。这一差距引发了一个基本问题:我们如何以一种原则性的方式定义视觉压力,以捕捉物理环境中遇到的各种因素?为了解决这个问题,我们从逆图形学角度构建视觉感知,并引入RoboStressBench,这是一个用于评估VLM在具身场景中对物理视觉压力鲁棒性的基准测试。受物理渲染方程的启发,RoboStressBench将视觉压力分解为四个物理基础维度:材质(M)、视角(V)、光照(L)和几何(G)。这种设计使RoboStressBench能够覆盖现实世界环境中的广泛视觉压力,同时允许对其在VLM能力(如视觉识别、推理和规划)上的影响进行受控分析。通过对最先进的VLM进行全面评估,我们识别出特定于压力的失败模式,并揭示了不同的物理因素会降低不同的具身能力,而这些往往被总体准确率所掩盖。我们进一步引入了一种压力感知的智能求解器,它在推理前检测视觉压力源并调用视觉编辑技能,从而提高了高压力场景下的鲁棒性。总体而言,RoboStressBench提供了一个原则性的评估框架,用于诊断和改进VLM在真实物理压力下的感知能力,支持开发更可靠的具身AI系统。

英文摘要

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.

2606.00826 2026-06-02 cs.LG

Partial Fairness Awareness: Belief-Guided Strategic Mechanism for Strategic Agents

部分公平意识:面向策略代理的信念引导策略机制

Xinpeng Lv, Chunyuan Zheng, Yunxin Mao, Renzhe Xu, Hao Zou, Shanzhi Gu, Liyang Xu, Huan Chen, Yuanlong Chen, Wenjing Yang, Haotian Wang

发表机构 * National University of Defense Technology, Changsha, China(国防科技大学) Peking University, Beijing, China(北京大学) Shanghai University of Finance and Economics, Shanghai, China(上海财经大学) ZGC Laboratory, Beijing, China(ZGC实验室) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院)

AI总结 针对策略分类中的公平暴露困境,提出部分公平意识(PFA)问题,通过发布公平约束候选集并隐藏真实约束,结合信念引导机制实现代理与系统公平约束的对齐,实验表明PFA在降低群体公平差距、提高合格个体接受率和结果稳定性方面优于完全公开或私有的公平机制。

Comments Accepted by AAAI2026

详情
AI中文摘要

策略机器学习研究代理操纵其特征以从预测模型获得有利决策的场景。为了解决策略分类中固有的公平问题,最近的工作引入了群体特定的公平约束。然而,当前的公平感知方法在公平暴露问题上面临根本困境:公开这些约束会导致策略操纵和公平逆转,而隐藏它们可能降低社会福利并阻碍真正的改进。为填补这一空白,我们随后提出了部分公平意识(PFA)问题,因为我们的理论分析表明,这种困境可以通过发布公平约束的候选集并隐藏真实约束来缓解。具体来说,我们引入了一种信念引导的策略机制,其中代理与决策系统迭代交互,并在公平约束候选集上维持一个信念分布。这一信念引导过程使代理能够通过迭代交互和反馈,更新其在候选集上的信念分布,从而逐渐使其信念与系统采用的真实公平约束对齐。在真实世界和合成数据集上的大量实验表明,与完全公开或私有的公平机制相比,PFA实现了更低的群体公平差距、更高的真正合格个体接受率以及更稳定的结果。

英文摘要

Strategic machine learning investigates scenarios where agents manipulate their features to receive favorable decisions from predictive models. To address fairness concerns intrinsic to strategic classification, recent work has introduced group-specific fairness constraints. However, current fairness-aware approaches face a fundamental dilemma in the issue of fairness exposure: making these constraints public enables strategic manipulation and can lead to fairness reversal, while keeping them hidden may reduce social welfare and discourage genuine improvement. To fill this gap, we subsequently propose the problem of partial fairness awareness (PFA), as our theoretical analysis informs that such a dilemma can be mitigated by releasing the candidate set of fairness constraints and concealing the grounding constraint. To be specific, we introduce a belief-guided strategic mechanism, wherein agents iteratively interact with the decision system and maintain a belief distribution over the candidate set of fairness constraints. This belief-guided process enables agents, through iterative interaction and feedback, to update their belief distribution over the candidate set, thereby gradually aligning their belief with the grounding fairness constraint employed by the system. Extensive experiments on real-world and synthetic datasets demonstrate that PFA achieves lower group fairness gaps, higher acceptance of truly qualified individuals, and more stable outcomes compared to fully public or private fairness regimes.

2606.00825 2026-06-02 cs.CV cs.ET cs.HC cs.MA

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

SuperMemory-VQA:面向长期记忆的自我中心视觉问答基准

Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe, Hyo Jin Kim, Mi Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) Meta Project(Meta项目)

AI总结 提出SuperMemory-VQA数据集,包含52.9小时AI眼镜录制的日常活动及4853个多选问答对,用于评估AI助手在长期记忆任务上的表现,发现现有系统可靠性不足。

Comments 34 pages, 21 figures, 5 tables

详情
AI中文摘要

AI眼镜为AI代理作为个性化记忆助手提供了有吸引力的平台。要真正有用,此类系统必须超越短期视频理解,解决人类在纵向自我中心视频流中因实际、个人或社交目的而经历的记忆缺口。然而,现有的自我中心数据集主要关注动作识别或来自短片的通用问答,衡量的是感知能力而非现实的人类记忆需求。我们引入了SuperMemory-VQA,一个用于评估AI助手在实际长期记忆任务上的自我中心视觉问答(VQA)数据集。它包含52.9小时用AI眼镜记录的日常活动,包括同步的RGB视频、音频转录、眼动追踪、IMU和SLAM轨迹。通过人工验证的标注流程,我们构建了4,853个有依据的问答对,涵盖物体和位置记忆、意图回忆、视觉场景回忆、时间线重建、对话记忆和上下文检索。每个问题以多项选择形式提出,并包含明确的“不可回答”选项以测试幻觉鲁棒性。对领先的代理框架和LLM骨干的基准测试表明,现有系统在现实世界记忆任务上仍远不可靠,凸显了需要新的架构来实现有依据的AI记忆,使其仅在证据充分时才能回答。参与者调查进一步支持我们的问题具有现实性、实用性,并与日常记忆需求一致。

英文摘要

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.