arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪
2605.24003 2026-06-17 cs.CV cs.AI stat.AP 版本更新

Remote sensing data imputation using deep learning for multispectral imagery

基于深度学习的多光谱遥感数据插补

Shuang Liu, Fiona Johnson, Rohitash Chandra

发表机构 * Water Research Centre, University of New South Wales(新南威尔士大学水研究中心) ARC ITTC Data Analytics for Resources and Environments, University of New South Wales(新南威尔士大学资源与环境数据分析师联盟) Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics, University of New South Wales(新南威尔士大学数学与统计学过渡人工智能研究组)

AI总结 针对云覆盖导致的光学卫星数据缺失问题,本研究比较了线性插值与多种深度学习模型(CNN、Inception Resnet、Autoencoder及其与LSTM的组合)在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果,发现深度学习模型显著优于基线方法,其中CNN表现最佳,且基于插补图像的藻华指数与观测数据吻合良好。

详情
AI中文摘要

近年来,遥感技术在水体应用中得到越来越多的利用。使用光学卫星数据的一个常见挑战是由于云覆盖导致的观测缺失。这些数据缺口可能导致错过对水资源管理部门高度关注的湖泊中关键事件(如藻华)的检测。因此,提高光学卫星数据集的完整性对于改善藻华的监测和预测至关重要。在本研究中,我们比较了传统数据插补方法(即线性插值)与深度学习模型在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果。采用的深度学习模型包括基于CNN的架构(即CNN、Inception Resnet和Autoencoder)以及基于CNN-LSTM的架构(即CNN-LSTM、Resnet-LSTM和Autoencoder-LSTM)。我们的结果表明,在人工掩膜区域内插补光谱波段值时,深度学习模型显著优于基线线性插值方法。在这些模型中,CNN在大多数湖泊中表现最佳。此外,我们通过将插补图像与观测数据进行比较,评估了基于插补图像的藻华指数(即Green/Red和NDCI)的性能。我们的结果表明,深度学习模型对于插补PlanetScope SuperDove影像中的缺失数据是有效的,从而能够实现更可靠的水体监测应用。

英文摘要

Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical satellite data is the presence of missing observations due to cloud cover. These data gaps can lead to missed detection of critical events, such as algal blooms, in lakes of high interest to water authorities. As a result, enhancing the completeness of optical satellite datasets is crucial for improving the monitoring and prediction of algal blooms. In this study, we compared a traditional data imputation method (i.e., linear interpolation) with deep learning models for reconstructing missing spectral bands across four lakes with historical records of algal blooms. The deep learning models adopted include CNN-based architectures (i.e., CNN, Inception Resnet, and Autoencoder) and CNN-LSTM-based architectures (i.e., CNN-LSTM, Resnet-LSTM, and Autoencoder-LSTM). Our results demonstrated that deep learning models substantially outperformed the baseline linear interpolation method in imputing spectral band values within artificially masked regions. Among these models, CNN delivered the best performance across most lakes. Furthermore, we evaluated the performance of algal bloom indices (i.e., Green/Red and NDCI) derived from the imputed imagery by comparing them with the observed data. Our results demonstrate that deep learning models are effective for imputing missing data in PlanetScope SuperDove imagery, enabling more reliable applications in water monitoring.

2602.10635 2026-06-17 cs.AI cs.LG 版本更新

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

OmniSapiens: 一种通过异质性感知相对策略优化进行社会行为处理的基础模型

Keane Ong, Sabri Boughorbel, Luwei Xiao, Chanakya Ekbote, Wei Dai, Ao Qu, Jingyao Wu, Rui Mao, Ehsan Hoque, Erik Cambria, Gianmarco Mengaldo, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Prince Sattam bin Abdulaziz University(普森·萨塔姆·本·阿卜杜勒阿齐兹大学) University of Rochester(罗切斯特大学)

AI总结 针对行为数据异质性导致的训练不平衡问题,提出Omnisapiens-7B 2.0基础模型,采用异质性感知相对策略优化(HARPO)方法,在10个行为任务和5个零样本泛化基准上取得最佳性能。

Comments Accepted to ICML 2026 Main Conference

详情
AI中文摘要

社交智能AI系统必须能够推理多样的人类行为任务,并泛化到新情境。然而,AI尚未达到这种社交智能水平。现有模型仍然受到行为数据训练引起的学习动态不平衡的根本限制。即,行为数据本质上是异质的,包含多种模态和预测目标,通常在不同样本间产生不均匀的训练信号。为了解决这个问题,我们开发了Omnisapiens-7B 2.0,一个专门处理异质行为数据学习的社会行为处理基础模型。这是通过异质性感知相对策略优化(HARPO)实现的,这是一种新颖的推理强化学习方法,明确地重新平衡样本间的学习信号。核心思想是近似策略更新的贡献信号,利用它们进行几何中心化和惯性平滑的优势调节。结果表明,Omnisapiens-7B 2.0在10个不同的行为任务上取得了最佳且最一致的性能,同时在所有五个保留的零样本泛化基准上也取得了最佳性能,分别提升了高达+12.02%和+9.37%。此外,Omnisapiens-7B 2.0展示了更一致和可解释的推理轨迹,支持可靠的现实世界行为应用。我们的模型和代码可在https://github.com/MIT-MI/human_behavior_atlas找到。

英文摘要

Socially intelligent AI systems must reason across diverse human behavioral tasks and generalize to new social contexts. However, behavioral data is inherently heterogeneous, comprising diverse modalities and prediction targets that produce uneven training signals across samples, creating imbalanced learning dynamics that challenge existing AI models. To address this, we develop Omnisapiens-7B 2.0, a foundation model for social behavior processing that explicitly addresses learning from heterogeneous behavioral data. This is enabled through Heterogeneity-Aware Relative Policy Optimization, a new RL method that rebalances learning signals across samples by approximating each sample's contribution to the policy update and using these estimates to drive geometrically centered, inertially smoothed advantage modulation for stable training. Omnisapiens-7B 2.0 achieves the best and most consistent performance across 10 behavioral tasks, while also attaining the best performance on all five held-out benchmarks, with gains of up to +12.02% and +9.37% respectively. Furthermore, it demonstrates more consistent and interpretable reasoning traces, supporting reliable real-world behavioral applications. Our model is available at https://github.com/MIT-MI/human_behavior_atlas.

2605.23176 2026-06-17 cs.CV 版本更新

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

DRIVESPATIAL:自动驾驶中视觉语言模型时空智能的基准

Hao Vo, Khoa Vo, Phu Loc Nguyen, Sieu Tran, Duc Minh Nguyen, Ngo Xuan Cuong, Gladys Gawugah, Sreevenkata Anjani Tishita Godavarthi, Chase Rainwater, Nghi D. Q. Bui, Anh Nguyen, Duy Minh Ho Nguyen, Ngan Le

发表机构 * University of Arkansas, USA(美国阿肯色大学) Google Research, Google(谷歌研究院) University of Liverpool, UK(英国利物浦大学) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究学校)

AI总结 提出DriveSpatial基准,通过多视角、时空推理任务评估视觉语言模型在自动驾驶中的场景构建、关系理解、时序推理和泛化能力,发现人类与模型间存在显著差距。

详情
AI中文摘要

自动驾驶中的时空智能要求智能体将多视角观测整合为连贯的场景表示,跨视角和时间保持物体连续性,并推理空间关系、交互和未来动态。然而,现有的自动驾驶视觉语言基准主要关注单视角、静态、自我中心或单源问答,尚不清楚当前视觉语言模型(VLM)能否真正构建和推理动态驾驶场景。我们引入了DriveSpatial,一个包含来自五个大规模自动驾驶数据集的20个任务、15.6K人工验证问答对的基准。DriveSpatial评估四种能力:认知场景构建、多视角关系理解、时序推理和泛化。与之前的基准不同,DriveSpatial是从一个动态多关系场景图生成的,该图编码了物体状态、空间关系、交互、相机可见性和时间对应关系,从而产生强制进行真正的跨视角和时空推理的问答对。评估15个代表性VLM揭示了显著的人机差距:最强模型落后人类28.4分,其中认知场景构建成为关键瓶颈。进一步诊断表明,仅语言提示不足,而显式BEV基础一致地提升性能。这些结果表明,当前VLM缺乏可靠的时空驾驶智能所需的场景构建能力。DriveSpatial及其构建流程将发布以支持未来研究。

英文摘要

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

2605.21135 2026-06-17 cs.CL 版本更新

Smarter edits? Post-editing with error highlights and translation suggestions

更智能的编辑?基于错误高亮和翻译建议的后编辑

Fleur V. J. van Tellingen, Gautam Ranka, Dora Žugčić, Joyce van der Wal, Andrea Camasta, Livio Guerra, Alina Karakanta

发表机构 * Leiden University Centre for Linguistics(莱顿大学语言研究中心) Visvesvaraya National Institute of Technology(维什瓦塞拉亚国家理工学院) Department of Bionanoscience, Faculty of Applied Sciences, Delft University of Technology(应用科学学院生物纳米科学系,代尔夫特理工大学) Pedagogical Sciences, Leiden University(莱顿大学教育科学) Faculty of Science, Leiden University(莱顿大学科学学院)

AI总结 本文研究了基于自动后编辑(APE)的错误高亮和纠正建议在后编辑任务中的有效性,发现虽然没有提升生产力和质量,但APE高亮和纠正建议提升了用户体验。

Comments Accepted at EAMT 2026

详情
AI中文摘要

随着机器翻译质量的提高,对增强的后编辑功能(如基于QE的错误高亮)的兴趣正在增长,但其有用性的证据仍然有限。在本文中,我们探讨了基于自动后编辑(APE)的错误高亮和纠正建议的有用性。我们进行了一项研究,让专业翻译员(En-Nl)使用APE错误高亮和纠正建议进行后编辑,并将生产力、质量和用户体验与常规PE和带有QE衍生高亮的PE进行比较。尽管没有条件相比常规PE在生产力或质量上有所提升,但APE高亮比QE衍生高亮更受好评,而纠正建议提高了整体的用户体验。

英文摘要

As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

2605.20708 2026-06-17 cs.CV cs.AI 版本更新

Rethinking Cross-Layer Information Routing in Diffusion Transformers

重新思考扩散变换器中的跨层信息路由

Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

发表机构 * Nanjing University(南京大学) Alibaba Group(阿里巴巴集团) Zhejiang University(浙江大学) City University of Hong Kong(香港城市大学)

AI总结 本文研究了扩散变换器中跨层信息流动的问题,通过系统性的实证分析,识别了传统残差加法的三个具体症状,并提出了扩散适应性路由(DAR)方法,以实现可学习、时间步适应和非递增的子层输出聚合,从而提升模型性能。

详情
AI中文摘要

扩散变换器(DiTs)已成为现代视觉生成的事实性骨干,其设计的几乎所有主要轴线——分词、注意力、条件、目标和潜在自编码器——都已被广泛重新审视。然而,决定信息如何在层之间积累的残差流却直接继承自原始Transformer。在本文中,我们对DiTs中的跨层信息流进行了系统性的实证分析,同时考虑深度和去噪时间步,并识别出传统残差加法的三个具体症状,即单调的前向幅度膨胀、急剧的反向梯度衰减和显著的块状冗余。受此诊断的启发,我们提出了扩散适应性路由(DAR),一种可直接替换残差的机制,能够对子层输出的历史进行可学习、时间步适应和非递增的聚合。此外,所提出的DAR与许多现代Transformer增强方法,如REPA,具有兼容性。在ImageNet 256×256上,DAR将SiT-XL/2的FID值提升了2.11(7.56 vs. 9.67),并且在8.75倍更少的训练迭代中达到了基线的收敛质量。在REPA之上堆叠时,它在早期阶段实现了2倍的训练加速,表明跨层信息路由是扩散建模中一个未被充分探索的设计轴,该轴与现有表示对齐目标相互独立。除了预训练外,DAR还可以在大规模T2I模型的微调阶段应用,并在分布匹配蒸馏中保留高频细节。

英文摘要

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

2510.21583 2026-06-17 cs.CV cs.AI 版本更新

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Yongzhe Chang, Changqian Yu, Kun Gai, Tiantian Zhang, Xueqian Wang

发表机构 * GitHub

AI总结 本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO,通过将连续步骤聚合为相干片段并改变策略优化层级,有效缓解了优势归因不准确的问题,实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

详情
AI中文摘要

近期在文本到图像(T2I)生成中的后训练流匹配中,群相对策略优化(GRPO)展示了强大的潜力。然而,其受到关键限制:优势归因不准确。在本文中,我们主张将连续步骤聚合为一个连贯的`chunk'并将策略优化范式从GRPO的步骤级别转移到片段级别,可以有效减轻这一问题的负面影响。基于这一见解,我们提出了群片段策略优化(GCPO),这是首个用于后训练流匹配的片段级强化学习方法。广泛的实验表明,GCPO在标准T2I基准和偏好对齐方面均取得了优越的性能,相对于GRPO最高相对提升达43%,凸显了片段级策略优化的前景。代码可在https://github.com/xingzhejun/GCPO上获得。

英文摘要

Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent 'chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to 43% relative gains over GRPO, highlighting the promise of chunk-level policy optimization. The code is available on https://github.com/xingzhejun/GCPO.

2605.15980 2026-06-17 cs.CV 版本更新

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Flash-GRPO:通过单步策略优化实现视频扩散的高效对齐

Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) Joy Future Academy Independent Researcher(独立研究员) Tsinghua University(清华大学)

AI总结 提出Flash-GRPO单步训练框架,通过等时分组和时间梯度校正解决计算瓶颈,在低计算预算下实现优于全轨迹训练的对齐质量和训练效率。

详情
AI中文摘要

群体相对策略优化已成为将视频扩散模型与人类偏好对齐的关键,但面临一个关键的计算瓶颈:训练一个14B参数的模型通常每个实验需要数百个GPU天。现有的效率方法通过滑动窗口子采样训练时间步来降低成本,但从根本上损害了优化,表现出严重的不稳定性,并且无法达到完整的轨迹性能。我们提出了Flash-GRPO,一个单步训练框架,在低计算预算下在对齐质量上优于全轨迹训练,同时大幅提高了训练效率。Flash-GRPO解决了两个关键挑战:等时分组通过强制提示级别的时间一致性消除了时间步混淆的方差,将策略性能与时间步难度解耦;时间梯度校正中和了导致不同时间步梯度幅度极不一致的时间依赖缩放因子。在1.3B到14B参数模型上的实验验证了Flash-GRPO的有效性,展示了显著的训练加速,同时保持了一致的稳定性和最先进的对齐质量。

英文摘要

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

2506.13127 2026-06-17 cs.SD eess.AS 版本更新

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

利用局部和全局知识整合与时间频率校准蒸馏进行语音增强

Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao

发表机构 * School of Computer Science, Nanjing Audit University(南京审计大学计算机科学学院) School of Communication Engineering, Nanjing Institute of Technology(南京工程技术学院通信工程学院) School of Information Science and Engineering, Southeast University(东南大学信息科学与工程学院) Cardiff University(卡迪夫大学) Inner Mongolia University(内蒙古大学) CHI – the Chair of Health Informatics, TUM University Hospital(健康信息学系,技术大学医院) GLAM – the Group on Language, Audio, & Music, Imperial College London(语言、音频与音乐组,伦敦帝国理工学院) Xiaomi EV(小米电动车)

AI总结 本文提出了一种融合框架,通过时间频率校准知识蒸馏提升语音增强性能,结合局部信息聚焦与全局知识流通,改进了低复杂度学生模型的表现。

Comments submitted to IEEE Transactions on Cognitive and Developmental Systems

详情
AI中文摘要

本文提出了一种内集和外集递归融合框架,结合时间频率校准知识蒸馏(I$^2$SRF-TFCKD)用于语音增强。与以往的语音增强蒸馏策略不同,该框架充分利用了语音的时间频率差异信息,同时促进局部信息聚焦和全局知识流通。首先,我们构建了内集和外集的相关蒸馏范式。在相关集合内,多层教师-学生特征进行成对匹配以实现校准蒸馏。随后,通过递归融合生成每个相关集合的代表性特征,形成融合特征集以促进跨集知识交互。其次,我们提出了一种基于双流时间频率交叉校准的多层交互蒸馏,分别在时间和频率域内计算教师-学生相似性校准权重,并进行交叉加权,从而根据语音特性对不同层的蒸馏贡献进行精细化分配。所提出的蒸馏策略应用于在L3DAS23挑战赛语音增强赛道排名第一的双路径扩张卷积循环网络(DPDCRN)。为了评估I$^2$SRF-TFCKD的有效性,我们在单通道和多通道语音增强数据集上进行了实验。客观评估显示,所提出的KD策略一致且有效地提升了低复杂度学生模型的性能,并优于其他蒸馏方案。

英文摘要

In this paper, we propose an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation (I$^2$SRF-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully exploits the time-frequency differential information of speech while facilitating both local information focusing and global knowledge circulation. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through recursive fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$SRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

2605.12646 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Learning to Decide with AI Assistance under Human-Alignment

在人工智能协助下的人类对齐决策学习

Nina Corvelo Benz, Eleni Straitouri, Manuel Gomez-Rodriguez

发表机构 * GitHub

AI总结 本文研究了在高风险领域中,人工智能如何通过预测结果帮助决策者,并探讨了AI预测信心与决策者自身信心的对齐程度对决策学习复杂性的影响。

详情
AI中文摘要

人们普遍认为,当人工智能模型通过预测感兴趣的结果来协助决策者时,它们应传达预测的置信度。然而,实证证据表明,决策者往往难以仅根据传达的置信度来判断何时信任预测。在此背景下,近期的理论和实证工作表明,AI辅助决策的效用与AI置信度和决策者自身置信度之间的对齐程度之间存在正相关性。关键的是,这些发现尚未阐明这种对齐程度如何影响通过重复交互学习做出最佳决策的复杂性。在本文中,我们考虑二元预测和二元决策的典型情况,首先证明该问题等价于具有完全反馈的双臂在线上下文学习问题,并建立了任何学习者可以达到的期望遗憾的下界为$Ω(\sqrt{|H| \cdot |B| \cdot T} )$,其中$H$和$B$分别表示人类和AI置信度的集合。然后我们证明,在AI和人类置信度完全对齐的情况下,学习者可以达到期望遗憾为$O(\sqrt{|H| \cdot T\log T})$,当$\sqrt{|H|} = O(\log T)$且$B$是可数的时,Dvoretzky-Kiefer-Wolfowitz不等式的非平凡推广将遗憾界改进到$O(\sqrt{T\log T})$。这些结果表明,对齐可以减少在人工智能协助下学习决策的复杂性。在两个不同的人类主体研究中,参与者通过AI模型协助解决简单决策任务的实验证明,我们的理论结果在完全对齐被违反时仍然稳健。

英文摘要

It is widely agreed that when AI models assist decision-makers in high-stakes domains by predicting an outcome of interest, they should communicate the confidence of their predictions. However, empirical evidence suggests that decision-makers often struggle to determine when to trust a prediction based solely on this communicated confidence. In this context, recent theoretical and empirical work suggests a positive correlation between the utility of AI-assisted decision-making and the degree of alignment between the AI confidence and the decision-makers' confidence in their own predictions. Crucially, these findings do not yet elucidate the extent to which this alignment influences the complexity of learning to make optimal decisions through repeated interactions. In this paper, we address this question in the canonical case of binary predictions and binary decisions. We first show that this problem is equivalent to a two-armed online contextual learning problem with full feedback, and establish a lower bound of $Ω(\sqrt{|H| \cdot |B| \cdot T} )$ on the expected regret any learner can attain, where $H$ and $B$ denote the sets of human and AI confidence values. We then demonstrate that, under perfect alignment between AI and human confidence, a learner can attain an expected regret of $O(\sqrt{|H| \cdot T\log T})$ and, when $\sqrt{|H|} = O(\log T)$ and $B$ is countable, a non-trivial generalization of the Dvoretzky-Kiefer-Wolfowitz inequality improves the regret bound to $O(\sqrt{T\log T})$. Taken together, these results reveal that alignment can reduce the complexity of learning to make decisions with AI assistance. Experiments on real data from two different human-subject studies where participants solve simple decision-making tasks assisted by AI models show that our theoretical results are robust to violations of perfect alignment.

2605.12227 2026-06-17 cs.CL 版本更新

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

结合在线优化与蒸馏以提升大语言模型的长上下文推理能力

Miguel Moura Ramos, Duarte M. Alves, André F. T. Martins

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究所) TransPerfect(TransPerfect公司)

AI总结 本文提出dGRPO方法,结合在线优化与蒸馏,通过强教师模型提供密集指导,提升长上下文推理能力,同时保持短上下文能力。

详情
AI中文摘要

适应大语言模型(LLMs)进行长上下文任务需要在训练后保持准确性和连贯性的方法。现有方法存在局限:1)监督微调(SFT)和知识蒸馏(KD)等离线方法存在曝光偏差且难以从模型生成的错误中恢复;2)在线强化学习方法如组相对策略优化(GRPO)更符合模型生成的状态,但因稀疏奖励导致不稳定和样本效率低;3)在线蒸馏(OPD)提供密集的token级指导,但不直接优化任意奖励信号。本文提出Distilled Group Relative Policy Optimization(dGRPO),通过OPD从更强的教师模型获得密集指导来增强GRPO。我们还引入LongBlocks,一个涵盖多跳推理、上下文接地和长形式生成的合成长上下文数据集。我们进行了广泛的实验和消融研究,比较离线训练、稀疏奖励GRPO和我们的综合方法,得出改进的长上下文对齐配方。总体而言,我们的结果表明,将基于结果的策略优化与知识蒸馏结合在一个目标中,为长上下文推理提供更稳定和有效的方法,同时保持短上下文能力。

英文摘要

Existing approaches to post-train models for long-context tasks face complementary limitations: (i) supervised fine-tuning (SFT) provides stable supervision but suffers from exposure bias; (ii) reinforcement learning methods such as Group Relative Policy Optimization (GRPO) train on model-generated trajectories but struggle with long-horizon credit assignment and sparse rewards; and (iii) on-policy distillation (OPD) provides dense token-level guidance but does not directly optimize task rewards. We study these complementary strategies for long-context alignment and derive a recipe that combines GRPO with OPD-style teacher guidance: the student learns from its own rollouts using outcome-level rewards, while a stronger teacher provides dense token-level regularization in place of the standard reference policy. This is especially useful when process-level supervision is difficult to obtain. To support this study, we introduce LongBlocks, a synthetic multilingual dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. Through controlled ablations, we isolate the roles of cold-start initialization, teacher anchoring, and data mixing, showing that our recipe yields a more stable and effective path to long-context reasoning than GRPO or OPD while preserving short-context capabilities.

2605.07971 2026-06-17 cs.CV cs.LG 版本更新

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD: 用于3D生成和编辑的离散体素扩散

Zhengrui Xiang, Jiaqi Wu, Fupeng Sun, Heliang Zheng, Yingzhen Li

发表机构 * Imperial College London(伦敦帝国学院) Math Magic Hitem3D

AI总结 提出离散体素扩散框架(DVD),通过将体素占用视为离散变量,实现稀疏体素的生成、不确定性估计和编辑,避免连续到离散的阈值处理,并提供可解释的生成动态。

详情
AI中文摘要

我们引入了离散体素扩散(DVD),这是一个离散扩散框架,用于生成、评估和编辑基于SLat(结构化潜在)的3D生成管道中的稀疏体素。尽管离散扩散通常没有在类似图像的生成中取代连续扩散,但我们表明它可以作为稀疏体素支架的有效第一阶段先验。通过将体素占用视为原生离散变量,DVD避免了连续到离散的阈值处理,并为体素生成、不确定性估计和编辑提供了一个简单的框架。除了质量提升外,DVD通过显式类别建模提供了更可解释的生成动态。此外,我们利用预测熵作为稳健的不确定性度量,以识别模糊的体素区域和复杂样本,促进数据过滤和质量评估等任务。最后,我们提出了一种使用块结构扰动模式的轻量级微调策略。这种方法使模型能够在单次采样轮次内修复和编辑体素,所需的辅助计算量可忽略不计,且无需额外的模型评估。

英文摘要

We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations. Code is available at https://github.com/TeCai/DVD.

2512.09373 2026-06-17 cs.CV 版本更新

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

FUSER: 前馈多视图3D配准Transformer与SE(3)^N扩散精化

Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng

发表机构 * Nanyang Technological University(南洋理工大学) Alibaba Group(阿里巴巴集团) Nanjing University(南京大学)

AI总结 提出FUSER,首个前馈多视图配准Transformer,在统一潜在空间中直接预测全局位姿,避免成对匹配;并引入SE(3)^N扩散精化框架FUSER-DF以校正估计。

Comments Accepted to CVPR 2026 (Oral)

详情
AI中文摘要

多视图点云的配准传统上依赖于广泛的成对匹配来构建用于全局同步的位姿图,这在计算上昂贵且在没有整体几何约束的情况下本质上是不适定的。本文提出了FUSER,第一个前馈多视图配准Transformer,它在统一、紧凑的潜在空间中联合处理所有扫描,直接预测全局位姿,无需任何成对估计。为了保持可处理性,FUSER通过稀疏3D CNN将每个扫描编码为低分辨率超点特征,该网络保留绝对平移线索,并通过几何交替注意力模块执行高效的扫描内和扫描间推理。特别地,我们从现成的基础模型中转移2D注意力先验,以增强3D特征交互和几何一致性。基于FUSER,我们进一步引入了FUSER-DF,一个SE(3)^N扩散精化框架,通过在联合SE(3)^N空间中进行去噪来校正FUSER的估计。FUSER作为代理多视图配准模型来构建去噪器,并推导了先验条件SE(3)^N变分下界用于去噪监督。在3DMatch、ScanNet和ArkitScenes上的大量实验表明,我们的方法实现了优越的配准精度和出色的计算效率。

英文摘要

Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

2604.22748 2026-06-17 cs.AI 版本更新

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

代理世界建模:基础、能力、定律及更远

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Runyi Li, Chenyu Tang, Dong Huang, Xuhang Chen, Rui Liu, Chengzu Li, Shiyi Du, Xu Huang, Haoxuan Che, Long Chen, Qifeng Chen, Wenya Wang, Wenxuan Zhang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

发表机构 * Hong Kong University of Science and Technology(香港科学与技术大学) National University of Singapore(新加坡国立大学) University of Oxford(牛津大学) Nanyang Technological University(南洋理工大学) Chinese University of Hong Kong(香港中文大学) University of Hong Kong(香港大学) University of Washington(华盛顿大学) University of Tokyo(东京大学) Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) Singapore University of Technology and Design(新加坡科技设计大学) Singapore Management University(新加坡管理学院) XGEN Labs(XGEN实验室)

AI总结 本文提出'层次x定律'分类法,定义三个能力层级和四个约束领域,综合400余篇文献总结100余系统,分析方法、失败模式和评估实践,提出决策导向的评估原则和可复现评估包,展望从被动预测到重塑环境的代理世界建模路径。

详情
AI中文摘要

随着AI系统从生成文本转向通过持续交互完成目标,建模环境动态能力成为关键瓶颈。操控物体、导航软件、协调他人或设计实验的代理需要预测环境模型,但'世界模型'一词在不同研究社区中有不同含义。本文引入'层次x定律'分类法,沿两个轴组织:第一轴定义三个能力层级:L1预测器学习一步局部转移运算符;L2模拟器将它们组合成多步、动作条件化的回放,符合领域定律;L3演进器在预测失败时自主修订模型。第二轴识别四个约束领域:物理、数字、社会和科学。这些领域决定世界模型必须满足的约束条件及其可能失败的领域。利用此框架,本文综合400余篇文献,总结100余代表系统,涵盖基于模型的强化学习、视频生成、网页和GUI代理、多代理社会模拟和AI驱动的科学发现。分析各层级-领域对的方法、失败模式和评估实践,提出决策导向的评估原则和最小可复现评估包,概述架构指导、开放问题和治理挑战。最终路线图连接此前孤立的社区,从被动下一步预测走向能模拟并最终重塑代理所处环境的世界模型。

英文摘要

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate. Code and resources are available at: https://github.com/matrix-agent/awesome-agentic-world-modeling.

2604.10827 2026-06-17 cs.AI 版本更新

Know Thy Reasoner: Not All Language Models Explore Alike

你的模型多样性,而非方法,决定推理策略

Moulik Choraria, Argyrios Gerogiannis, Anirban Das, Supriyo Chakraborty, Sourya Basu, Sambit Sahu, Lav R. Varshney

发表机构 * UIUC(伊利诺伊大学香槟分校) Capital One

AI总结 本文提出模型多样性影响推理策略,通过理论框架分析推理不确定性,验证了不同模型在深度精炼和并行采样中的表现差异。

Comments This is a full-length extension of the workshop paper that appeared in the ICLR 2026 Workshop on LLM Reasoning

详情
AI中文摘要

计算LLM推理的扩展性需要在探索解决方案方法(广度)和细化有前途的解决方案(深度)之间分配预算。大多数方法隐式地权衡两者,但为何特定的权衡有效仍不明确,且在单一模型上的验证掩盖了模型自身的作用。我们主张最优策略取决于模型的多样性分布,即概率质量在解决方案方法上的分散情况,并在采用任何探索策略之前必须进行表征。我们通过理论框架分解推理不确定性,并推导出树状深度精炼优于并行采样的条件。我们在Qwen-3 4B和Olmo-3 7B系列上验证了这一点,显示轻量信号足以在低多样性对齐模型上进行基于深度的精炼,而在高多样性基础模型上则产生有限的效用,我们推测后者需要更强的补偿以应对较低的探索覆盖度。

英文摘要

Compute scaling for LLM reasoning trades off exploring solution approaches (\emph{breadth}) against refining promising ones (\emph{depth}), yet why a given trade-off works, and why it often fails to transfer across models, remains unclear. We argue that \textbf{the optimal strategy depends on the model's \emph{diversity profile}, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted.} We formalize this with a framework decomposing reasoning uncertainty, deriving when depth-based refinement outperforms parallel sampling, and validate it across three model families at both inference and training. Our central finding is that the diversity regime dictates the strategy: low-diversity aligned models benefit from depth-based refinement with lightweight intrinsic signals, whereas high-diversity base models are often harmed by it, and instead need breadth or stronger signals to compensate.

2512.04524 2026-06-17 cs.LG cs.AI 版本更新

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

基于原型语义一致性对齐的域自适应检索

Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen, Jiaxing Li, Guoxu Zhou

发表机构 * School of Computer Science and Technology, Guangdong University of Technology(广东工业大学计算机科学与技术学院) School of Automation, Guangdong University of Technology(广东工业大学自动化学院) School of Computer Science, Guangdong Polytechnic Normal University(广东 polytechnic 正规大学计算机科学学院) School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区计算机科学与技术学院) School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院)

AI总结 提出原型语义一致性对齐(PSCA)两阶段框架,通过正交原型建立类级语义连接,利用几何邻近性加权伪标签置信度,并在重构特征上量化生成统一哈希码,解决域自适应检索中的类级对齐缺失和量化质量下降问题。

Comments AAAI2026

详情
AI中文摘要

域自适应检索旨在将知识从有标签的源域迁移到无标签的目标域,实现有效检索的同时缓解域差异。然而,现有方法存在几个根本性局限:1)忽略类级语义对齐,过度追求成对样本对齐;2)缺乏伪标签可靠性考虑或评估标签正确性的几何指导;3)直接量化受域偏移影响的原始特征,损害所学哈希码的质量。鉴于这些局限,我们提出基于原型的语义一致性对齐(PSCA),一种用于有效域自适应检索的两阶段框架。在第一阶段,一组正交原型直接建立类级语义连接,在聚集类内样本的同时最大化类间分离性。在原型学习过程中,几何邻近性通过自适应加权伪标签置信度,为语义一致性对齐提供可靠性指标。所得的隶属度矩阵和原型促进特征重建,确保在重建特征而非原始特征上进行量化,从而改善后续哈希编码质量并无缝连接两个阶段。在第二阶段,特定域的量化函数在相互逼近约束下处理重建特征,生成跨域的统一二进制哈希码。大量实验验证了PSCA在多个数据集上的优越性能。

英文摘要

Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.

2603.26551 2026-06-17 cs.CV cs.AI 版本更新

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

超越MACs:面向视觉骨干网络的硬件高效架构设计

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, University of Udine(乌迪大学机器学习与感知实验室) Centre for Vision Research, York University(约克大学视觉研究中心)

AI总结 针对MACs指标在边缘设备上的不足,提出基于硬件效率洞察的LowFormer骨干网络,通过轻量级Lowtention模块实现显著加速。

Comments Accepted at International Journal of Computer Vision (IJCV)

Journal ref Int J Comput Vis 134, 295 (2026)

详情
AI中文摘要

视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率直接惠及广泛下游应用。为衡量效率,许多出版物依赖MACs(乘累加操作)作为执行时间的预测指标。本文通过实验证明该指标的缺陷,尤其在边缘设备场景下。通过对比常见架构设计元素的MAC计数和执行时间,我们识别出高效执行的关键因素,并提供优化骨干设计的见解。基于这些见解,我们提出LowFormer,一种新型视觉骨干家族。LowFormer采用流线型的宏观和微观设计,包括Lowtention——多头自注意力的轻量级替代方案。Lowtention不仅更高效,还在ImageNet上取得了更优结果。此外,我们提出LowFormer的边缘GPU版本,可进一步提升其在边缘GPU和桌面GPU上的基线速度。通过在更小图像分类数据集上的评估以及将其适配到多个下游任务(如目标检测、语义分割、图像检索和视觉目标跟踪),我们展示了LowFormer的广泛适用性。与近期最先进的骨干网络相比,LowFormer模型在各种硬件平台上均实现了显著加速。代码和模型见此链接。

英文摘要

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

2603.25937 2026-06-17 cs.RO cs.LG 版本更新

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

视觉基础模型能否导航?零样本真实世界评估与经验教训

Maeva Guerrier, Karthik Soma, Jana Pavlasek, Giovanni Beltrame

发表机构 * Polytechnique Montreal(蒙特利尔理工学院)

AI总结 本文对五种视觉导航模型在真实环境中进行零样本评估,发现其存在几何理解不足、感知混淆和分布漂移等系统性问题,并公开评估代码与数据集。

详情
AI中文摘要

视觉导航模型(VNMs)通过从大规模视觉演示中学习,有望实现通用化的机器人导航。尽管在真实世界部署日益增多,现有评估几乎完全依赖成功率(机器人是否到达目标),这掩盖了轨迹质量、碰撞行为以及对环境变化的鲁棒性。我们针对五种最先进的VNMs(GNM、ViNT、NoMaD、NaviBridger和CrossFormer)在两个机器人平台和五个室内外环境中进行了真实世界评估。除了成功率,我们结合了基于路径的指标与基于视觉的目标识别分数,并通过受控图像扰动(运动模糊、太阳眩光)评估鲁棒性。我们的分析揭示了三个系统性问题:(a) 即使是架构复杂的扩散和Transformer模型也频繁发生碰撞,表明几何理解有限;(b) 模型无法区分感知相似但存在语义差异的不同位置,导致在重复环境中出现目标预测错误;(c) 在分布偏移下性能下降。我们将公开发布评估代码和数据集,以促进VNMs的可重复基准测试。

英文摘要

Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.

2602.10384 2026-06-17 cs.CL 版本更新

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

当表格失控:评估多模态模型在法语金融文档上的表现

Virginie Mouilleron, Théo Lasnier, Anna Mosolova, Djamé Seddah

发表机构 * Inria Paris(巴黎国家信息与自动化研究所) Sorbonne Université(索邦大学)

AI总结 提出Scribe Finance基准,评估多模态模型在法语金融文档上的文本、表格、图表及多轮对话理解能力,发现模型在图表和多轮对话中表现脆弱。

Comments 16 pages, 13 figures

详情
AI中文摘要

视觉语言模型(VLM)在许多文档理解任务上表现良好,但它们在专业非英语领域的可靠性仍未充分探索。这一差距在金融领域尤为关键,因为金融文档混合了密集的监管文本、数值表格和视觉图表,且提取错误可能产生实际后果。我们引入了Scribe Finance,这是首个用于评估法语金融文档理解的多模态基准。该数据集包含1,204个经过专家验证的问题,涵盖文本提取、表格理解、图表解释和多轮对话推理,数据来自真实的投资说明书、KID和PRIIP。我们使用LLM-as-judge协议评估了六个开源VLM(8B-124B参数)。虽然模型在文本和表格任务上表现强劲(85-90%准确率),但在图表解释上表现不佳(34-62%)。最值得注意的是,多轮对话揭示了一个严重的失败模式:早期错误在轮次间传播,导致准确率降至约50%,无论模型大小如何。这些结果表明,当前的VLM在定义明确的提取任务上有效,但在交互式、多步骤的金融分析中仍然脆弱。Scribe Finance提供了一个具有挑战性的基准,用于衡量和推动这一高风险场景的进展。

英文摘要

Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Scribe Finance, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Scribe Finance offers a challenging benchmark to measure and drive progress in this high-stakes setting.

2601.16407 2026-06-17 cs.CL cs.AI 版本更新

Jacobian Scopes: token-level causal attributions in LLMs

Jacobian Scopes: LLM中的令牌级因果归因

Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Gurbir Arora, Christopher J. Earls

发表机构 * Cornell University(康奈尔大学) Imperial College London(伦敦帝国理工学院) Goodfire AI

AI总结 提出Jacobian Scopes,一种基于梯度的令牌级因果归因方法,用于解释LLM预测,揭示政治偏见、翻译策略和上下文学习机制。

Comments 25 pages, 16 figures

详情
AI中文摘要

大型语言模型(LLM)基于上下文中的线索(如语义描述和上下文示例)进行下一个令牌预测。然而,由于现代架构中层和注意力头的 proliferation,阐明哪些先前的令牌对给定预测影响最大仍然具有挑战性。我们提出Jacobian Scopes,一套基于梯度的令牌级因果归因方法,用于解释LLM预测。基于微扰理论和信息几何,Jacobian Scopes量化输入令牌如何影响模型预测的各个方面,例如特定logits、完整预测分布和模型不确定性(有效温度)。通过涵盖指令理解、翻译和上下文学习(ICL)的案例研究,我们展示了Jacobian Scopes如何揭示隐含的政治偏见,揭示词级和短语级翻译策略,并阐明最近争论的上下文时间序列预测的潜在机制。为了便于在自定义文本上探索Jacobian Scopes,我们开源了实现,并在以下网址提供了云托管交互式演示:this https URL。

英文摘要

Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.

2603.05171 2026-06-17 cs.CL cs.AI 版本更新

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

中国司法判决中法律论证结构的标注与可视化指南

Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li

发表机构 * Law School, Nanjing University(南京大学法学院)

AI总结 提出一个系统化、可操作的标注框架,用于表示司法判决中的法律论证结构,支持大规模司法推理分析和法律论证挖掘。

Comments This Guideline has been developed through revision and refinement based on the first edition. The element label system has been adjusted, and the annotation granularity and annotation workflow have been further optimized

详情
AI中文摘要

本指南提出了一个系统化且可操作的标注框架,用于表示司法判决中的法律论证结构。该框架基于法律推理和论证理论,旨在揭示司法推理的逻辑组织,并为计算分析提供可靠基础。在元素层面,本指南区分了非命题层和命题层。非命题层由两个元素组成:议题和非论证性成分。在命题层面,本指南定义了四种命题类型:一般规范性判断、特殊规范性判断、一般事实判断和特殊事实判断。在关系层面,定义了五种关系类型来表示论证结构:支持、攻击、联合、匹配和同一性。这些关系捕捉了正面和负面的论证连接、合取推理结构、法律规范与案件事实之间的对应关系,以及命题之间的同一性或语义等价性。本指南进一步规定了基本结构和嵌套结构的形式化表示规则和可视化约定,使得复杂论证模式的可视化保持一致。此外,它建立了标准化的标注工作流程和一致性控制机制,以确保标注数据的可重复性和可靠性。通过提供清晰的概念模型、形式化表示规则和实用的标注程序,本指南支持大规模司法推理分析以及未来在法律论证挖掘、法律推理计算建模和人工智能辅助法律分析方面的研究。

英文摘要

This Guideline presents a systematic and operationalizable annotation framework for representing legal argumentation structures in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and provide a reliable foundation for computational analysis. At the element level, the Guideline distinguishes between the non-propositional layer and the propositional layer. The non-propositional layer consists of two elements: Issue and Non-argumentative Component. At the propositional level, the Guideline defines four proposition types: General Normative Judgment, Particular Normative Judgment, General Factual Judgment, and Particular Factual Judgment. At the relational level, five relation types are defined to represent argumentative structures: Support, Attack, Joint, Match, and Identity. These relations capture positive and negative argumentative connections, conjunctive reasoning structures, correspondences between legal norms and case facts, and identity or semantic equivalence between propositions. The Guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent visualization of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure the reproducibility and reliability of annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this Guideline supports large-scale analysis of judicial reasoning and future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

2602.11590 2026-06-17 cs.LG 版本更新

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

从错误中学习:自纠正掩码扩散模型

Yair Schiff, Omer Belhasin, Roy Uziel, Guanghan Wang, Marianne Arriola, Gilad Turok, Ran Zilberstein, Michael Elad, Volodymyr Kuleshov

发表机构 * Cornell(康奈尔大学) NVIDIA(英伟达)

AI总结 提出ProSeCo框架,通过训练模型同时进行掩码去除和错误纠正,在生成过程中迭代修正已解码标记,提升样本质量并实现更快的采样速度。

Comments Code to reproduce our experiments is available here: https://github.com/kuleshov-group/proseco

详情
AI中文摘要

掩码扩散模型(MDMs)已成为自回归模型的有前途的替代方案,能够实现并行标记生成,同时保持竞争性能。尽管有这些优势,MDMs面临一个根本性限制:一旦标记被解除掩码,它们就保持固定,导致错误累积并最终降低样本质量。我们通过提出一个框架来解决这个问题,该框架训练模型同时执行掩码去除和纠正。通过重用MDM去噪网络的输出作为纠正器训练的输入,我们训练模型从潜在错误中恢复。在生成过程中,我们在掩码去除步骤之间应用额外的纠正性细化步骤,以更改解码的标记并改进输出。我们将我们的训练和采样方法命名为渐进式自纠正(ProSeCo),因为它具有独特的能力,可以迭代地细化整个序列,包括已生成的标记。我们在多个条件和无条件任务上进行了广泛的实验验证,表明我们的方法产生了更好的质量-效率权衡(采样速度提升高达约4倍),并实现了推理时计算缩放,以进一步提高样本质量,超越标准MDMs(在基准测试上提升高达约1.2倍)。

英文摘要

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models, enabling parallel token generation while achieving competitive performance. Despite these advantages, MDMs face a fundamental limitation: once tokens are unmasked, they remain fixed, leading to error accumulation and ultimately degrading sample quality. We address this by proposing a framework that trains a model to perform both unmasking and correction. By reusing outputs from the MDM denoising network as inputs for corrector training, we train a model to recover from potential mistakes. During generation we apply additional corrective refinement steps between unmasking ones in order to change decoded tokens and improve outputs. We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence, including already generated tokens. We conduct extensive experimental validation across multiple conditional and unconditional tasks, demonstrating that \method~yields better quality-efficiency trade-offs (up to ~4x faster sampling) and enables inference-time compute scaling to further increase sample quality beyond standard MDMs (up to ~1.2x improvement on benchmarks).

2602.06806 2026-06-17 cs.CV cs.LG 版本更新

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

RAIGen: 文本到图像生成模型中的罕见属性识别

Silpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie, Muhammad Awais, Anjan Dutta

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出RAIGen框架,利用Matryoshka稀疏自编码器和新颖的少数度量,在无标签条件下发现扩散模型中的罕见属性,并支持属性放大。

Comments Accepted at ICML 2026. Webpage and code available at https://github.com/VSSILPA/RAIGen

详情
AI中文摘要

文本到图像扩散模型实现了令人印象深刻的生成质量,但继承并放大了训练数据中的偏差,扭曲了语义属性的覆盖。先前的工作以两种方式解决这一问题。封闭集方法在预定义的公平性类别(如性别、种族)中减轻偏差,假设社会显著的少数属性是先验已知的。开放集方法将任务框架化为偏差识别,突出主导输出的多数属性。两者都忽略了一个互补的任务:揭示在数据分布中代表性不足(社会、文化或风格)但仍编码在模型表示中的罕见或少数特征。我们介绍了RAIGen,据我们所知,这是第一个用于扩散模型中无标签罕见属性发现的框架,不需要预定义的少数类别。RAIGen利用Matryoshka稀疏自编码器和一种新颖的少数度量,结合神经元激活频率与语义独特性,识别出那些其最高激活图像揭示代表性不足属性的可解释神经元。实验表明,RAIGen在Stable Diffusion中发现了超出固定公平性类别的属性,可扩展到更大的模型如SDXL,支持跨架构的系统审计,并在生成过程中实现罕见属性的定向放大。项目页面可在 https://vssilpa.github.io/RAIGen_webpage/ 获取。

英文摘要

Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for label-free rare-attribute discovery in diffusion models, requiring no predefined minority categories. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation. The project page is available at https://vssilpa.github.io/RAIGen_webpage/ .

2602.06276 2026-06-17 cs.LG stat.ML 版本更新

Statistical Learning from Attribution Sets

从归因集合中进行统计学习

Lorne Applebaum, Robert Busa-Fekete, August Y. Chen, Claudio Gentile, Tomer Koren, Aryan Mokhtari

发表机构 * Google Research(谷歌研究) Cornell University(康奈尔大学) Tel Aviv University(特拉维夫大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 针对隐私约束下广告点击与转化无法直接关联的问题,提出基于归因集合的无偏损失估计方法,实现经验风险最小化的泛化保证,并优于行业启发式方法。

Comments COLT 2026. 45 pages

详情
AI中文摘要

我们解决了隐私约束下广告领域转化预测模型的训练问题,其中点击和转化之间缺乏直接链接。受隐私保护浏览器API和第三方cookie弃用的启发,我们研究了一种设置,其中学习器观察到一系列点击和一系列转化,但只能将转化与一组候选点击(归因集合)相关联,而不是唯一的来源。我们将此形式化为从由具有候选先验分布的无知对手生成的归因集合中进行学习。尽管缺乏显式标签,我们通过一种新颖的方法从这些粗粒度信号中构建了总体损失的无偏估计量。利用该估计量,我们表明经验风险最小化实现了泛化保证,该保证随先验的信息量而缩放,并且对先验的估计误差也具有鲁棒性,尽管归因集合之间存在复杂的依赖关系。在标准数据集上的简单实证评估表明,我们的无偏方法显著优于常见的行业启发式方法,特别是在归因集合较大或重叠的情况下。

英文摘要

We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.

2601.05212 2026-06-17 cs.CV 版本更新

FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

FlowLet: 基于小波流匹配的条件性3D脑MRI合成

Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

发表机构 * Politecnico di Bari(巴里理工学院) Sapienza University of Rome(罗马萨皮恩扎大学)

AI总结 提出FlowLet框架,利用可逆3D小波域中的流匹配生成年龄条件化的3D脑MRI,避免重建伪影并降低计算需求,实验证明其生成高保真体积且提升脑年龄预测模型对低代表性年龄组的性能。

Comments Accepted at Medical Image Analysis (Elsevier)

详情
AI中文摘要

脑磁共振成像(MRI)在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测(BAP),它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大规模、多样化和年龄平衡的数据集,而现有的3D MRI数据集在人口统计学上存在偏差,限制了公平性和泛化能力。获取新数据成本高昂且受到伦理约束,这促使了生成性数据增强。当前的生成方法通常基于潜在扩散模型,这些模型在学习的低维潜在空间中操作,以应对体积MRI数据的内存需求。然而,这些方法在推理时通常较慢,可能因潜在压缩而引入伪影,并且很少以年龄为条件,从而影响BAP性能。在这项工作中,我们提出了FlowLet,一个条件生成框架,通过在可逆3D小波域中利用流匹配来合成年龄条件化的3D MRI,有助于避免重建伪影并降低计算需求。实验表明,FlowLet以少量采样步骤生成高保真体积。使用FlowLet生成的数据训练BAP模型可改善低代表性年龄组的性能,基于区域的分析确认了解剖结构的保留。

英文摘要

Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

2601.04574 2026-06-17 cs.CL 版本更新

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

FeedEval: 面向教学法的LLM生成作文反馈评估

Seongyeub Chu, Jongwoo Kim, Munyong Yi

发表机构 * Graduate School of Data Science, KAIST(数据科学研究生院,韩国科学技术院) Department of Industrial & Systems Engineering, KAIST(工业与系统工程系,韩国科学技术院)

AI总结 提出FeedEval框架,沿特异性、帮助性和有效性三个教学维度评估LLM生成的作文反馈,通过专用评估器筛选高质量反馈,提升下游评分和修订效果。

详情
AI中文摘要

超越数值分数预测,近期自动作文评分研究日益强调生成提供理由和可操作指导的高质量反馈。为减轻专家标注的高成本,先前工作通常依赖LLM生成的反馈来训练作文评估模型。然而,此类反馈常未经明确质量验证即被纳入,导致下游应用中噪声的传播。为解决这一局限,我们提出FeedEval,一个基于LLM的框架,用于沿三个教学维度(特异性、帮助性和有效性)评估LLM生成的作文反馈。FeedEval采用维度专用的LLM评估器,这些评估器在本研究策划的数据集上训练,以评估多个候选反馈并选择高质量反馈供下游使用。在ASAP++基准上的实验表明,FeedEval与人类专家判断高度一致,且使用FeedEval筛选的高质量反馈训练的作文评分模型取得了更优的评分性能。此外,使用小型LLM进行的修订实验表明,FeedEval识别的高质量反馈能导致更有效的作文修订。我们在以下网址发布代码和策划的数据集:this https URL。

英文摘要

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

2505.23939 2026-06-17 cs.LG cs.NI 版本更新

Searching Neural Architectures for Sensor Nodes on IoT Gateways

搜索物联网网关上传感器节点的神经架构

Andrea Mattia Garavagno, Edoardo Ragusa, Antonio Frisoli, Paolo Gastaldo

发表机构 * University of Genoa(基因瓦大学)

AI总结 提出一种在物联网网关上自动设计神经网络的方法,保护数据隐私,在Raspberry Pi Zero 2上10小时内搜索出达到SOTA的架构。

Journal ref IEEE Internet of Things Journal, vol. 12, no. 21, pp. 44492-44501, 2025

详情
AI中文摘要

本文提出一种在边缘自动设计神经网络的方法,即使在隐私敏感的物联网应用中也能实现机器学习。该方法在物联网网关上运行,为连接的传感器节点设计神经网络,而无需将收集的数据共享到本地网络之外,将数据保留在采集现场。这种方法有潜力为医疗物联网和工业物联网启用机器学习,在边缘设计硬件友好且定制的神经网络,用于个性化医疗和高级工业服务,如质量控制、预测性维护或故障诊断。通过防止数据泄露到云服务,该方法保护了敏感信息,包括工业机密和个人数据。全面的实验结果表明,在Visual Wake Words数据集上,所提出的方法通过在Raspberry Pi Zero 2上运行不到10小时的搜索过程,可以达到最先进的结果。

英文摘要

This paper presents an automatic method for the design of Neural Networks (NNs) at the edge, enabling Machine Learning (ML) access even in privacy-sensitive Internet of Things (IoT) applications. The proposed method runs on IoT gateways and designs NNs for connected sensor nodes without sharing the collected data outside the local network, keeping the data in the site of collection. This approach has the potential to enable ML for Healthcare Internet of Things (HIoT) and Industrial Internet of Things (IIoT), designing hardware-friendly and custom NNs at the edge for personalized healthcare and advanced industrial services such as quality control, predictive maintenance, or fault diagnosis. By preventing data from being disclosed to cloud services, this method safeguards sensitive information, including industrial secrets and personal data. The outcomes of a thorough experimental session confirm that -- on the Visual Wake Words dataset -- the proposed approach can achieve state-of-the-art results by exploiting a search procedure that runs in less than 10 hours on the Raspberry Pi Zero 2.

2506.05797 2026-06-17 cs.LG cs.CE cs.RO 版本更新

EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator

EqCollide: 等变且碰撞感知的可变形物体神经模拟器

Qianyi Chen, Tianrun Gao, Chenbo Jiang, Tailin Wu

发表机构 * Westlake University(西交大大学) Fudan University(复旦大学) Tongji University(同济大学) McGill University(麦吉尔大学)

AI总结 提出首个端到端等变神经场模拟器EqCollide,通过等变编码器和碰撞感知消息传递的图神经网络常微分方程,实现可变形物体碰撞的准确、稳定和可扩展模拟。

Comments SIGKDD 2026 Oral AI4S Track. 20 pages, 16 figures

详情
AI中文摘要

模拟可变形物体的碰撞是一项基础但具有挑战性的任务,因为涉及固体力学和多体相互作用的复杂性。现有的数据驱动方法通常缺乏对物理对称性的等变性、对碰撞处理不足以及可扩展性有限。本文介绍\name,这是首个用于可变形物体及其碰撞的端到端等变神经场模拟器。我们提出一个等变编码器,将物体几何和速度映射到潜在控制点。随后,基于等变图神经网络的神经常微分方程通过碰撞感知消息传递建模控制点之间的相互作用。为了重建速度场,我们查询一个以控制点特征为条件的神经场,实现连续且分辨率无关的运动预测。在2D和3D场景上的实验结果表明,\name在不同物体配置下实现了准确、稳定且可扩展的模拟。与最佳基线模型相比,其滚动均方误差降低了24.34%至57.62%。此外,\name能够泛化到更多碰撞物体和更长的时间范围,并对群作用下的输入变换保持鲁棒。代码可在以下网址获取:this https URL

英文摘要

Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce EqCollide, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results on 2D and 3D scenarios show that EqCollide achieves accurate, stable, and scalable simulations across diverse object configurations. It achieves $24.34\%$ to $57.62\%$ lower rollout MSE, even compared with the best-performing baseline model. Furthermore, EqCollide could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action. Code is available at: https://github.com/AI4Science-WestlakeU/EqCollide

2504.14582 2026-06-17 cs.CV 版本更新

NTIRE 2025 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2025 图像超分辨率(×4)挑战赛:方法与结果

Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Zahra Moammeri, Ahmad Mahmoudi-Aznaveh, Ali Karbasi, Hossein Motamednia, Liangyan Li, Guanhua Zhao, Kevin Le, Yimo Ning, Haoxuan Huang, Jun Chen

发表机构 * CVPR 2025

AI总结 本文介绍NTIRE 2025图像超分辨率(×4)挑战赛,包括恢复和感知两个子赛道,总结比赛设计、数据集、评估协议及25个团队的提交方法。

Comments NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 1525-1535

详情
AI中文摘要

本文介绍了NTIRE 2025图像超分辨率(×4)挑战赛,这是CVPR 2025第10届NTIRE Workshop的关联竞赛之一。该挑战旨在从通过双三次下采样生成的×4比例低分辨率图像中恢复高分辨率图像,目标是开发有效的网络设计或解决方案以实现最先进的超分辨率性能。为反映图像超分辨率研究的双重目标,挑战包含两个子赛道:(1)恢复赛道,强调像素级精度,根据PSNR对提交结果进行排名;(2)感知赛道,关注视觉真实感,根据感知分数对结果进行排名。共有286名参与者注册了比赛,25个团队提交了有效作品。本报告总结了挑战设计、数据集、评估协议、主要结果以及每个团队的方法。该挑战作为基准,旨在推动图像超分辨率领域的最先进技术并促进其进步。

英文摘要

This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.

2504.03991 2026-06-17 cs.CL cs.AI cs.HC cs.MA 版本更新

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

面向多样化类人团队协作与通信的算法化提示生成与大型语言模型

Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

发表机构 * Thomas Lord Department of Computer Science, University of Southern California(美国南加州大学汤姆·劳德计算机科学系) School of Computing and Information, University of Pittsburgh(美国匹兹堡大学计算与信息学院) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Sibley School of Mechanical and Aerospace Engineering, Cornell University(康奈尔大学西伯利机械与航空航天工程学院)

AI总结 结合质量多样性优化与LLM代理,自动搜索生成多样化团队行为的提示,捕获人类协作与通信策略,并通过用户研究验证其类人性。

详情
AI中文摘要

理解人类如何在团队中协作和通信对于改善人-代理团队协作和AI辅助决策至关重要。然而,由于后勤、伦理和实际限制,仅依赖大规模用户研究的数据是不切实际的,因此需要多种多样化人类行为的合成模型。最近,基于大型语言模型(LLM)的代理已被证明能够在社交环境中模拟类人行为。但是,获得大量多样化行为需要手动设计提示。另一方面,质量多样性(QD)优化已被证明能够生成多样化的强化学习(RL)代理行为。在这项工作中,我们将QD优化与LLM驱动的代理相结合,以迭代搜索在长时域、多步骤协作环境中生成多样化团队行为的提示。我们首先通过一项人类受试者实验表明,人类在该领域中表现出多样化的协调和通信行为。然后,我们进行一系列实验,表明我们的方法捕获了在没有大规模数据收集的情况下难以观察到的行为,并通过后续用户研究表明这些生成的行为是类人的。我们的发现凸显了QD与LLM驱动代理的结合作为研究多代理协作中团队协作和通信策略的有效工具。

英文摘要

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

2401.14381 2026-06-17 cs.LG math.DG 版本更新

Manifold GCN: Diffusion-based Convolutional Neural Network for Manifold-valued Graphs

Manifold GCN:基于扩散的流形值图卷积神经网络

Martin Hanik, Gabriele Steidl, Christoph von Tycowicz

发表机构 * BIFOLD—Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究院) Technical University Berlin(柏林技术大学) Zuse Institute Berlin(柏林泽尼茨研究所)

AI总结 提出两种适用于黎曼流形特征图的图神经网络层:基于流形值图扩散方程的扩散层和受向量神经元启发的切向多层感知器,两者在节点置换和流形等距下等变,在更广泛问题上优于任务特定网络。

Comments Extended ADNI experiment

Journal ref International Journal of Computer Vision, Volume 134, article number 315 (2026)

详情
AI中文摘要

我们提出了两种适用于黎曼流形中特征图的图神经网络层。首先,基于流形值图扩散方程,我们构建了一个可应用于任意数量节点和图连接模式的扩散层。其次,通过将向量神经元框架的思想迁移到我们的通用设置中,我们建模了一个切向多层感知器。这两层在节点置换和特征流形的等距变换下都是等变的。这些特性在许多深度学习任务中带来了有益的归纳偏置。此外,它们还支持新颖、更灵活的特征设计。合成数据上的数值示例以及基于右海马体三角网格的阿尔茨海默病分类应用证明了我们新层的实用性:虽然它们适用于更广泛的问题类别,但在性能上优于任务特定的最先进网络。

英文摘要

We propose two graph neural network layers for graphs with features in a Riemannian manifold. First, based on a manifold-valued graph diffusion equation, we construct a diffusion layer that can be applied to an arbitrary number of nodes and graph connectivity patterns. Second, we model a tangent multilayer perceptron by transferring ideas from the vector neuron framework to our general setting. Both layers are equivariant under node permutations and the feature manifold's isometries. These properties have led to a beneficial inductive bias in many deep-learning tasks. Furthermore, they enable novel, more flexible feature designs. Numerical examples on synthetic data and an Alzheimer's classification application on triangle meshes of the right hippocampus demonstrate the usefulness of our new layers: While they apply to a much broader class of problems, they outperform task-specific state-of-the-art networks.