arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.24003 2026-06-17 cs.CV cs.AI stat.AP 版本更新

Remote sensing data imputation using deep learning for multispectral imagery

基于深度学习的多光谱遥感数据插补

Shuang Liu, Fiona Johnson, Rohitash Chandra

发表机构 * Water Research Centre, University of New South Wales（新南威尔士大学水研究中心）； ARC ITTC Data Analytics for Resources and Environments, University of New South Wales（新南威尔士大学资源与环境数据分析师联盟）； Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics, University of New South Wales（新南威尔士大学数学与统计学过渡人工智能研究组）

AI总结针对云覆盖导致的光学卫星数据缺失问题，本研究比较了线性插值与多种深度学习模型（CNN、Inception Resnet、Autoencoder及其与LSTM的组合）在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果，发现深度学习模型显著优于基线方法，其中CNN表现最佳，且基于插补图像的藻华指数与观测数据吻合良好。

详情

AI中文摘要

近年来，遥感技术在水体应用中得到越来越多的利用。使用光学卫星数据的一个常见挑战是由于云覆盖导致的观测缺失。这些数据缺口可能导致错过对水资源管理部门高度关注的湖泊中关键事件（如藻华）的检测。因此，提高光学卫星数据集的完整性对于改善藻华的监测和预测至关重要。在本研究中，我们比较了传统数据插补方法（即线性插值）与深度学习模型在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果。采用的深度学习模型包括基于CNN的架构（即CNN、Inception Resnet和Autoencoder）以及基于CNN-LSTM的架构（即CNN-LSTM、Resnet-LSTM和Autoencoder-LSTM）。我们的结果表明，在人工掩膜区域内插补光谱波段值时，深度学习模型显著优于基线线性插值方法。在这些模型中，CNN在大多数湖泊中表现最佳。此外，我们通过将插补图像与观测数据进行比较，评估了基于插补图像的藻华指数（即Green/Red和NDCI）的性能。我们的结果表明，深度学习模型对于插补PlanetScope SuperDove影像中的缺失数据是有效的，从而能够实现更可靠的水体监测应用。

英文摘要

Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical satellite data is the presence of missing observations due to cloud cover. These data gaps can lead to missed detection of critical events, such as algal blooms, in lakes of high interest to water authorities. As a result, enhancing the completeness of optical satellite datasets is crucial for improving the monitoring and prediction of algal blooms. In this study, we compared a traditional data imputation method (i.e., linear interpolation) with deep learning models for reconstructing missing spectral bands across four lakes with historical records of algal blooms. The deep learning models adopted include CNN-based architectures (i.e., CNN, Inception Resnet, and Autoencoder) and CNN-LSTM-based architectures (i.e., CNN-LSTM, Resnet-LSTM, and Autoencoder-LSTM). Our results demonstrated that deep learning models substantially outperformed the baseline linear interpolation method in imputing spectral band values within artificially masked regions. Among these models, CNN delivered the best performance across most lakes. Furthermore, we evaluated the performance of algal bloom indices (i.e., Green/Red and NDCI) derived from the imputed imagery by comparing them with the observed data. Our results demonstrate that deep learning models are effective for imputing missing data in PlanetScope SuperDove imagery, enabling more reliable applications in water monitoring.

URL PDF HTML ☆

赞 0 踩 0

2602.10635 2026-06-17 cs.AI cs.LG 版本更新

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

OmniSapiens: 一种通过异质性感知相对策略优化进行社会行为处理的基础模型

Keane Ong, Sabri Boughorbel, Luwei Xiao, Chanakya Ekbote, Wei Dai, Ao Qu, Jingyao Wu, Rui Mao, Ehsan Hoque, Erik Cambria, Gianmarco Mengaldo, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； National University of Singapore（新加坡国立大学）； Nanyang Technological University（南洋理工大学）； Prince Sattam bin Abdulaziz University（普森·萨塔姆·本·阿卜杜勒阿齐兹大学）； University of Rochester（罗切斯特大学）

AI总结针对行为数据异质性导致的训练不平衡问题，提出Omnisapiens-7B 2.0基础模型，采用异质性感知相对策略优化（HARPO）方法，在10个行为任务和5个零样本泛化基准上取得最佳性能。

Comments Accepted to ICML 2026 Main Conference

详情

AI中文摘要

社交智能AI系统必须能够推理多样的人类行为任务，并泛化到新情境。然而，AI尚未达到这种社交智能水平。现有模型仍然受到行为数据训练引起的学习动态不平衡的根本限制。即，行为数据本质上是异质的，包含多种模态和预测目标，通常在不同样本间产生不均匀的训练信号。为了解决这个问题，我们开发了Omnisapiens-7B 2.0，一个专门处理异质行为数据学习的社会行为处理基础模型。这是通过异质性感知相对策略优化（HARPO）实现的，这是一种新颖的推理强化学习方法，明确地重新平衡样本间的学习信号。核心思想是近似策略更新的贡献信号，利用它们进行几何中心化和惯性平滑的优势调节。结果表明，Omnisapiens-7B 2.0在10个不同的行为任务上取得了最佳且最一致的性能，同时在所有五个保留的零样本泛化基准上也取得了最佳性能，分别提升了高达+12.02%和+9.37%。此外，Omnisapiens-7B 2.0展示了更一致和可解释的推理轨迹，支持可靠的现实世界行为应用。我们的模型和代码可在https://github.com/MIT-MI/human_behavior_atlas找到。

英文摘要

Socially intelligent AI systems must reason across diverse human behavioral tasks and generalize to new social contexts. However, behavioral data is inherently heterogeneous, comprising diverse modalities and prediction targets that produce uneven training signals across samples, creating imbalanced learning dynamics that challenge existing AI models. To address this, we develop Omnisapiens-7B 2.0, a foundation model for social behavior processing that explicitly addresses learning from heterogeneous behavioral data. This is enabled through Heterogeneity-Aware Relative Policy Optimization, a new RL method that rebalances learning signals across samples by approximating each sample's contribution to the policy update and using these estimates to drive geometrically centered, inertially smoothed advantage modulation for stable training. Omnisapiens-7B 2.0 achieves the best and most consistent performance across 10 behavioral tasks, while also attaining the best performance on all five held-out benchmarks, with gains of up to +12.02% and +9.37% respectively. Furthermore, it demonstrates more consistent and interpretable reasoning traces, supporting reliable real-world behavioral applications. Our model is available at https://github.com/MIT-MI/human_behavior_atlas.

URL PDF HTML ☆

赞 0 踩 0

2605.23176 2026-06-17 cs.CV 版本更新

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

DRIVESPATIAL：自动驾驶中视觉语言模型时空智能的基准

Hao Vo, Khoa Vo, Phu Loc Nguyen, Sieu Tran, Duc Minh Nguyen, Ngo Xuan Cuong, Gladys Gawugah, Sreevenkata Anjani Tishita Godavarthi, Chase Rainwater, Nghi D. Q. Bui, Anh Nguyen, Duy Minh Ho Nguyen, Ngan Le

发表机构 * University of Arkansas, USA（美国阿肯色大学）； Google Research, Google（谷歌研究院）； University of Liverpool, UK（英国利物浦大学）； Max Planck Research School for Intelligent Systems（马克斯·普朗克智能系统研究学校）

AI总结提出DriveSpatial基准，通过多视角、时空推理任务评估视觉语言模型在自动驾驶中的场景构建、关系理解、时序推理和泛化能力，发现人类与模型间存在显著差距。

详情

AI中文摘要

自动驾驶中的时空智能要求智能体将多视角观测整合为连贯的场景表示，跨视角和时间保持物体连续性，并推理空间关系、交互和未来动态。然而，现有的自动驾驶视觉语言基准主要关注单视角、静态、自我中心或单源问答，尚不清楚当前视觉语言模型（VLM）能否真正构建和推理动态驾驶场景。我们引入了DriveSpatial，一个包含来自五个大规模自动驾驶数据集的20个任务、15.6K人工验证问答对的基准。DriveSpatial评估四种能力：认知场景构建、多视角关系理解、时序推理和泛化。与之前的基准不同，DriveSpatial是从一个动态多关系场景图生成的，该图编码了物体状态、空间关系、交互、相机可见性和时间对应关系，从而产生强制进行真正的跨视角和时空推理的问答对。评估15个代表性VLM揭示了显著的人机差距：最强模型落后人类28.4分，其中认知场景构建成为关键瓶颈。进一步诊断表明，仅语言提示不足，而显式BEV基础一致地提升性能。这些结果表明，当前VLM缺乏可靠的时空驾驶智能所需的场景构建能力。DriveSpatial及其构建流程将发布以支持未来研究。

英文摘要

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

URL PDF HTML ☆

赞 0 踩 0

2605.21135 2026-06-17 cs.CL 版本更新

Smarter edits? Post-editing with error highlights and translation suggestions

更智能的编辑？基于错误高亮和翻译建议的后编辑

Fleur V. J. van Tellingen, Gautam Ranka, Dora Žugčić, Joyce van der Wal, Andrea Camasta, Livio Guerra, Alina Karakanta

发表机构 * Leiden University Centre for Linguistics（莱顿大学语言研究中心）； Visvesvaraya National Institute of Technology（维什瓦塞拉亚国家理工学院）； Department of Bionanoscience, Faculty of Applied Sciences, Delft University of Technology（应用科学学院生物纳米科学系，代尔夫特理工大学）； Pedagogical Sciences, Leiden University（莱顿大学教育科学）； Faculty of Science, Leiden University（莱顿大学科学学院）

AI总结本文研究了基于自动后编辑（APE）的错误高亮和纠正建议在后编辑任务中的有效性，发现虽然没有提升生产力和质量，但APE高亮和纠正建议提升了用户体验。

Comments Accepted at EAMT 2026

2605.20708 2026-06-17 cs.CV cs.AI 版本更新

Rethinking Cross-Layer Information Routing in Diffusion Transformers

重新思考扩散变换器中的跨层信息路由

Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

发表机构 * Nanjing University（南京大学）； Alibaba Group（阿里巴巴集团）； Zhejiang University（浙江大学）； City University of Hong Kong（香港城市大学）

AI总结本文研究了扩散变换器中跨层信息流动的问题，通过系统性的实证分析，识别了传统残差加法的三个具体症状，并提出了扩散适应性路由（DAR）方法，以实现可学习、时间步适应和非递增的子层输出聚合，从而提升模型性能。

详情

AI中文摘要

扩散变换器（DiTs）已成为现代视觉生成的事实性骨干，其设计的几乎所有主要轴线——分词、注意力、条件、目标和潜在自编码器——都已被广泛重新审视。然而，决定信息如何在层之间积累的残差流却直接继承自原始Transformer。在本文中，我们对DiTs中的跨层信息流进行了系统性的实证分析，同时考虑深度和去噪时间步，并识别出传统残差加法的三个具体症状，即单调的前向幅度膨胀、急剧的反向梯度衰减和显著的块状冗余。受此诊断的启发，我们提出了扩散适应性路由（DAR），一种可直接替换残差的机制，能够对子层输出的历史进行可学习、时间步适应和非递增的聚合。此外，所提出的DAR与许多现代Transformer增强方法，如REPA，具有兼容性。在ImageNet 256×256上，DAR将SiT-XL/2的FID值提升了2.11（7.56 vs. 9.67），并且在8.75倍更少的训练迭代中达到了基线的收敛质量。在REPA之上堆叠时，它在早期阶段实现了2倍的训练加速，表明跨层信息路由是扩散建模中一个未被充分探索的设计轴，该轴与现有表示对齐目标相互独立。除了预训练外，DAR还可以在大规模T2I模型的微调阶段应用，并在分布匹配蒸馏中保留高频细节。

英文摘要

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

URL PDF HTML ☆

赞 0 踩 0

2510.21583 2026-06-17 cs.CV cs.AI 版本更新

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Yongzhe Chang, Changqian Yu, Kun Gai, Tiantian Zhang, Xueqian Wang

发表机构 * GitHub

AI总结本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO，通过将连续步骤聚合为相干片段并改变策略优化层级，有效缓解了优势归因不准确的问题，实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

2605.15980 2026-06-17 cs.CV 版本更新

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Flash-GRPO：通过单步策略优化实现视频扩散的高效对齐

Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang

发表机构 * Zhejiang University（浙江大学）； Joy Future Academy ； Independent Researcher（独立研究员）； Tsinghua University（清华大学）

AI总结提出Flash-GRPO单步训练框架，通过等时分组和时间梯度校正解决计算瓶颈，在低计算预算下实现优于全轨迹训练的对齐质量和训练效率。

详情

AI中文摘要

群体相对策略优化已成为将视频扩散模型与人类偏好对齐的关键，但面临一个关键的计算瓶颈：训练一个14B参数的模型通常每个实验需要数百个GPU天。现有的效率方法通过滑动窗口子采样训练时间步来降低成本，但从根本上损害了优化，表现出严重的不稳定性，并且无法达到完整的轨迹性能。我们提出了Flash-GRPO，一个单步训练框架，在低计算预算下在对齐质量上优于全轨迹训练，同时大幅提高了训练效率。Flash-GRPO解决了两个关键挑战：等时分组通过强制提示级别的时间一致性消除了时间步混淆的方差，将策略性能与时间步难度解耦；时间梯度校正中和了导致不同时间步梯度幅度极不一致的时间依赖缩放因子。在1.3B到14B参数模型上的实验验证了Flash-GRPO的有效性，展示了显著的训练加速，同时保持了一致的稳定性和最先进的对齐质量。

英文摘要

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

URL PDF HTML ☆

赞 0 踩 0

2506.13127 2026-06-17 cs.SD eess.AS 版本更新

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

利用局部和全局知识整合与时间频率校准蒸馏进行语音增强

Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao

发表机构 * School of Computer Science, Nanjing Audit University（南京审计大学计算机科学学院）； School of Communication Engineering, Nanjing Institute of Technology（南京工程技术学院通信工程学院）； School of Information Science and Engineering, Southeast University（东南大学信息科学与工程学院）； Cardiff University（卡迪夫大学）； Inner Mongolia University（内蒙古大学）； CHI – the Chair of Health Informatics, TUM University Hospital（健康信息学系，技术大学医院）； GLAM – the Group on Language, Audio, & Music, Imperial College London（语言、音频与音乐组，伦敦帝国理工学院）； Xiaomi EV（小米电动车）

AI总结本文提出了一种融合框架，通过时间频率校准知识蒸馏提升语音增强性能，结合局部信息聚焦与全局知识流通，改进了低复杂度学生模型的表现。

Comments submitted to IEEE Transactions on Cognitive and Developmental Systems

详情

AI中文摘要

本文提出了一种内集和外集递归融合框架，结合时间频率校准知识蒸馏（I$^2$SRF-TFCKD）用于语音增强。与以往的语音增强蒸馏策略不同，该框架充分利用了语音的时间频率差异信息，同时促进局部信息聚焦和全局知识流通。首先，我们构建了内集和外集的相关蒸馏范式。在相关集合内，多层教师-学生特征进行成对匹配以实现校准蒸馏。随后，通过递归融合生成每个相关集合的代表性特征，形成融合特征集以促进跨集知识交互。其次，我们提出了一种基于双流时间频率交叉校准的多层交互蒸馏，分别在时间和频率域内计算教师-学生相似性校准权重，并进行交叉加权，从而根据语音特性对不同层的蒸馏贡献进行精细化分配。所提出的蒸馏策略应用于在L3DAS23挑战赛语音增强赛道排名第一的双路径扩张卷积循环网络（DPDCRN）。为了评估I$^2$SRF-TFCKD的有效性，我们在单通道和多通道语音增强数据集上进行了实验。客观评估显示，所提出的KD策略一致且有效地提升了低复杂度学生模型的性能，并优于其他蒸馏方案。

英文摘要

In this paper, we propose an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation (I$^2$SRF-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully exploits the time-frequency differential information of speech while facilitating both local information focusing and global knowledge circulation. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through recursive fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$SRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

URL PDF HTML ☆

赞 0 踩 0

2605.12646 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Learning to Decide with AI Assistance under Human-Alignment

在人工智能协助下的人类对齐决策学习

Nina Corvelo Benz, Eleni Straitouri, Manuel Gomez-Rodriguez

发表机构 * GitHub

AI总结本文研究了在高风险领域中，人工智能如何通过预测结果帮助决策者，并探讨了AI预测信心与决策者自身信心的对齐程度对决策学习复杂性的影响。

详情

AI中文摘要

人们普遍认为，当人工智能模型通过预测感兴趣的结果来协助决策者时，它们应传达预测的置信度。然而，实证证据表明，决策者往往难以仅根据传达的置信度来判断何时信任预测。在此背景下，近期的理论和实证工作表明，AI辅助决策的效用与AI置信度和决策者自身置信度之间的对齐程度之间存在正相关性。关键的是，这些发现尚未阐明这种对齐程度如何影响通过重复交互学习做出最佳决策的复杂性。在本文中，我们考虑二元预测和二元决策的典型情况，首先证明该问题等价于具有完全反馈的双臂在线上下文学习问题，并建立了任何学习者可以达到的期望遗憾的下界为$Ω(\sqrt{|H| \cdot |B| \cdot T} )$，其中$H$和$B$分别表示人类和AI置信度的集合。然后我们证明，在AI和人类置信度完全对齐的情况下，学习者可以达到期望遗憾为$O(\sqrt{|H| \cdot T\log T})$，当$\sqrt{|H|} = O(\log T)$且$B$是可数的时，Dvoretzky-Kiefer-Wolfowitz不等式的非平凡推广将遗憾界改进到$O(\sqrt{T\log T})$。这些结果表明，对齐可以减少在人工智能协助下学习决策的复杂性。在两个不同的人类主体研究中，参与者通过AI模型协助解决简单决策任务的实验证明，我们的理论结果在完全对齐被违反时仍然稳健。

英文摘要

It is widely agreed that when AI models assist decision-makers in high-stakes domains by predicting an outcome of interest, they should communicate the confidence of their predictions. However, empirical evidence suggests that decision-makers often struggle to determine when to trust a prediction based solely on this communicated confidence. In this context, recent theoretical and empirical work suggests a positive correlation between the utility of AI-assisted decision-making and the degree of alignment between the AI confidence and the decision-makers' confidence in their own predictions. Crucially, these findings do not yet elucidate the extent to which this alignment influences the complexity of learning to make optimal decisions through repeated interactions. In this paper, we address this question in the canonical case of binary predictions and binary decisions. We first show that this problem is equivalent to a two-armed online contextual learning problem with full feedback, and establish a lower bound of $Ω(\sqrt{|H| \cdot |B| \cdot T} )$ on the expected regret any learner can attain, where $H$ and $B$ denote the sets of human and AI confidence values. We then demonstrate that, under perfect alignment between AI and human confidence, a learner can attain an expected regret of $O(\sqrt{|H| \cdot T\log T})$ and, when $\sqrt{|H|} = O(\log T)$ and $B$ is countable, a non-trivial generalization of the Dvoretzky-Kiefer-Wolfowitz inequality improves the regret bound to $O(\sqrt{T\log T})$. Taken together, these results reveal that alignment can reduce the complexity of learning to make decisions with AI assistance. Experiments on real data from two different human-subject studies where participants solve simple decision-making tasks assisted by AI models show that our theoretical results are robust to violations of perfect alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.12227 2026-06-17 cs.CL 版本更新

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

结合在线优化与蒸馏以提升大语言模型的长上下文推理能力

Miguel Moura Ramos, Duarte M. Alves, André F. T. Martins

发表机构 * Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； Instituto de Telecomunicações（电信研究所）； TransPerfect（TransPerfect公司）

AI总结本文提出dGRPO方法，结合在线优化与蒸馏，通过强教师模型提供密集指导，提升长上下文推理能力，同时保持短上下文能力。

详情

AI中文摘要

适应大语言模型（LLMs）进行长上下文任务需要在训练后保持准确性和连贯性的方法。现有方法存在局限：1）监督微调（SFT）和知识蒸馏（KD）等离线方法存在曝光偏差且难以从模型生成的错误中恢复；2）在线强化学习方法如组相对策略优化（GRPO）更符合模型生成的状态，但因稀疏奖励导致不稳定和样本效率低；3）在线蒸馏（OPD）提供密集的token级指导，但不直接优化任意奖励信号。本文提出Distilled Group Relative Policy Optimization（dGRPO），通过OPD从更强的教师模型获得密集指导来增强GRPO。我们还引入LongBlocks，一个涵盖多跳推理、上下文接地和长形式生成的合成长上下文数据集。我们进行了广泛的实验和消融研究，比较离线训练、稀疏奖励GRPO和我们的综合方法，得出改进的长上下文对齐配方。总体而言，我们的结果表明，将基于结果的策略优化与知识蒸馏结合在一个目标中，为长上下文推理提供更稳定和有效的方法，同时保持短上下文能力。

英文摘要

Existing approaches to post-train models for long-context tasks face complementary limitations: (i) supervised fine-tuning (SFT) provides stable supervision but suffers from exposure bias; (ii) reinforcement learning methods such as Group Relative Policy Optimization (GRPO) train on model-generated trajectories but struggle with long-horizon credit assignment and sparse rewards; and (iii) on-policy distillation (OPD) provides dense token-level guidance but does not directly optimize task rewards. We study these complementary strategies for long-context alignment and derive a recipe that combines GRPO with OPD-style teacher guidance: the student learns from its own rollouts using outcome-level rewards, while a stronger teacher provides dense token-level regularization in place of the standard reference policy. This is especially useful when process-level supervision is difficult to obtain. To support this study, we introduce LongBlocks, a synthetic multilingual dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. Through controlled ablations, we isolate the roles of cold-start initialization, teacher anchoring, and data mixing, showing that our recipe yields a more stable and effective path to long-context reasoning than GRPO or OPD while preserving short-context capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.07971 2026-06-17 cs.CV cs.LG 版本更新

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD: 用于3D生成和编辑的离散体素扩散

Zhengrui Xiang, Jiaqi Wu, Fupeng Sun, Heliang Zheng, Yingzhen Li

发表机构 * Imperial College London（伦敦帝国学院）； Math Magic ； Hitem3D

AI总结提出离散体素扩散框架（DVD），通过将体素占用视为离散变量，实现稀疏体素的生成、不确定性估计和编辑，避免连续到离散的阈值处理，并提供可解释的生成动态。

详情

AI中文摘要

我们引入了离散体素扩散（DVD），这是一个离散扩散框架，用于生成、评估和编辑基于SLat（结构化潜在）的3D生成管道中的稀疏体素。尽管离散扩散通常没有在类似图像的生成中取代连续扩散，但我们表明它可以作为稀疏体素支架的有效第一阶段先验。通过将体素占用视为原生离散变量，DVD避免了连续到离散的阈值处理，并为体素生成、不确定性估计和编辑提供了一个简单的框架。除了质量提升外，DVD通过显式类别建模提供了更可解释的生成动态。此外，我们利用预测熵作为稳健的不确定性度量，以识别模糊的体素区域和复杂样本，促进数据过滤和质量评估等任务。最后，我们提出了一种使用块结构扰动模式的轻量级微调策略。这种方法使模型能够在单次采样轮次内修复和编辑体素，所需的辅助计算量可忽略不计，且无需额外的模型评估。

英文摘要

We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations. Code is available at https://github.com/TeCai/DVD.

URL PDF HTML ☆

赞 0 踩 0

2512.09373 2026-06-17 cs.CV 版本更新

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

FUSER: 前馈多视图3D配准Transformer与SE(3)^N扩散精化

Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng

发表机构 * Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）； Nanjing University（南京大学）

AI总结提出FUSER，首个前馈多视图配准Transformer，在统一潜在空间中直接预测全局位姿，避免成对匹配；并引入SE(3)^N扩散精化框架FUSER-DF以校正估计。

Comments Accepted to CVPR 2026 (Oral)

详情

AI中文摘要

多视图点云的配准传统上依赖于广泛的成对匹配来构建用于全局同步的位姿图，这在计算上昂贵且在没有整体几何约束的情况下本质上是不适定的。本文提出了FUSER，第一个前馈多视图配准Transformer，它在统一、紧凑的潜在空间中联合处理所有扫描，直接预测全局位姿，无需任何成对估计。为了保持可处理性，FUSER通过稀疏3D CNN将每个扫描编码为低分辨率超点特征，该网络保留绝对平移线索，并通过几何交替注意力模块执行高效的扫描内和扫描间推理。特别地，我们从现成的基础模型中转移2D注意力先验，以增强3D特征交互和几何一致性。基于FUSER，我们进一步引入了FUSER-DF，一个SE(3)^N扩散精化框架，通过在联合SE(3)^N空间中进行去噪来校正FUSER的估计。FUSER作为代理多视图配准模型来构建去噪器，并推导了先验条件SE(3)^N变分下界用于去噪监督。在3DMatch、ScanNet和ArkitScenes上的大量实验表明，我们的方法实现了优越的配准精度和出色的计算效率。

英文摘要

Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2604.22748 2026-06-17 cs.AI 版本更新

超越MACs：面向视觉骨干网络的硬件高效架构设计

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, University of Udine（乌迪大学机器学习与感知实验室）； Centre for Vision Research, York University（约克大学视觉研究中心）

AI总结针对MACs指标在边缘设备上的不足，提出基于硬件效率洞察的LowFormer骨干网络，通过轻量级Lowtention模块实现显著加速。

Comments Accepted at International Journal of Computer Vision (IJCV)

Journal ref Int J Comput Vis 134, 295 (2026)

详情

DOI: 10.1007/s11263-026-02873-5

AI中文摘要

视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率直接惠及广泛下游应用。为衡量效率，许多出版物依赖MACs（乘累加操作）作为执行时间的预测指标。本文通过实验证明该指标的缺陷，尤其在边缘设备场景下。通过对比常见架构设计元素的MAC计数和执行时间，我们识别出高效执行的关键因素，并提供优化骨干设计的见解。基于这些见解，我们提出LowFormer，一种新型视觉骨干家族。LowFormer采用流线型的宏观和微观设计，包括Lowtention——多头自注意力的轻量级替代方案。Lowtention不仅更高效，还在ImageNet上取得了更优结果。此外，我们提出LowFormer的边缘GPU版本，可进一步提升其在边缘GPU和桌面GPU上的基线速度。通过在更小图像分类数据集上的评估以及将其适配到多个下游任务（如目标检测、语义分割、图像检索和视觉目标跟踪），我们展示了LowFormer的广泛适用性。与近期最先进的骨干网络相比，LowFormer模型在各种硬件平台上均实现了显著加速。代码和模型见此链接。

英文摘要

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

URL PDF HTML ☆

赞 0 踩 0

2603.25937 2026-06-17 cs.RO cs.LG 版本更新

FlowLet: 基于小波流匹配的条件性3D脑MRI合成

Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

发表机构 * Politecnico di Bari（巴里理工学院）； Sapienza University of Rome（罗马萨皮恩扎大学）

AI总结提出FlowLet框架，利用可逆3D小波域中的流匹配生成年龄条件化的3D脑MRI，避免重建伪影并降低计算需求，实验证明其生成高保真体积且提升脑年龄预测模型对低代表性年龄组的性能。

Comments Accepted at Medical Image Analysis (Elsevier)

详情

DOI: 10.1016/j.media.2026.104161

AI中文摘要

脑磁共振成像（MRI）在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测（BAP），它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大规模、多样化和年龄平衡的数据集，而现有的3D MRI数据集在人口统计学上存在偏差，限制了公平性和泛化能力。获取新数据成本高昂且受到伦理约束，这促使了生成性数据增强。当前的生成方法通常基于潜在扩散模型，这些模型在学习的低维潜在空间中操作，以应对体积MRI数据的内存需求。然而，这些方法在推理时通常较慢，可能因潜在压缩而引入伪影，并且很少以年龄为条件，从而影响BAP性能。在这项工作中，我们提出了FlowLet，一个条件生成框架，通过在可逆3D小波域中利用流匹配来合成年龄条件化的3D MRI，有助于避免重建伪影并降低计算需求。实验表明，FlowLet以少量采样步骤生成高保真体积。使用FlowLet生成的数据训练BAP模型可改善低代表性年龄组的性能，基于区域的分析确认了解剖结构的保留。

英文摘要

Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

URL PDF HTML ☆

赞 0 踩 0

2601.04574 2026-06-17 cs.CL 版本更新

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

FeedEval: 面向教学法的LLM生成作文反馈评估

Seongyeub Chu, Jongwoo Kim, Munyong Yi

发表机构 * Graduate School of Data Science, KAIST（数据科学研究生院，韩国科学技术院）； Department of Industrial & Systems Engineering, KAIST（工业与系统工程系，韩国科学技术院）

AI总结提出FeedEval框架，沿特异性、帮助性和有效性三个教学维度评估LLM生成的作文反馈，通过专用评估器筛选高质量反馈，提升下游评分和修订效果。

详情

AI中文摘要

超越数值分数预测，近期自动作文评分研究日益强调生成提供理由和可操作指导的高质量反馈。为减轻专家标注的高成本，先前工作通常依赖LLM生成的反馈来训练作文评估模型。然而，此类反馈常未经明确质量验证即被纳入，导致下游应用中噪声的传播。为解决这一局限，我们提出FeedEval，一个基于LLM的框架，用于沿三个教学维度（特异性、帮助性和有效性）评估LLM生成的作文反馈。FeedEval采用维度专用的LLM评估器，这些评估器在本研究策划的数据集上训练，以评估多个候选反馈并选择高质量反馈供下游使用。在ASAP++基准上的实验表明，FeedEval与人类专家判断高度一致，且使用FeedEval筛选的高质量反馈训练的作文评分模型取得了更优的评分性能。此外，使用小型LLM进行的修订实验表明，FeedEval识别的高质量反馈能导致更有效的作文修订。我们在以下网址发布代码和策划的数据集：this https URL。

英文摘要

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

URL PDF HTML ☆

赞 0 踩 0

2505.23939 2026-06-17 cs.LG cs.NI 版本更新

Searching Neural Architectures for Sensor Nodes on IoT Gateways

搜索物联网网关上传感器节点的神经架构

Andrea Mattia Garavagno, Edoardo Ragusa, Antonio Frisoli, Paolo Gastaldo

发表机构 * University of Genoa（基因瓦大学）

AI总结提出一种在物联网网关上自动设计神经网络的方法，保护数据隐私，在Raspberry Pi Zero 2上10小时内搜索出达到SOTA的架构。

Journal ref IEEE Internet of Things Journal, vol. 12, no. 21, pp. 44492-44501, 2025

详情

DOI: 10.1109/JIOT.2025.3581442

AI中文摘要

本文提出一种在边缘自动设计神经网络的方法，即使在隐私敏感的物联网应用中也能实现机器学习。该方法在物联网网关上运行，为连接的传感器节点设计神经网络，而无需将收集的数据共享到本地网络之外，将数据保留在采集现场。这种方法有潜力为医疗物联网和工业物联网启用机器学习，在边缘设计硬件友好且定制的神经网络，用于个性化医疗和高级工业服务，如质量控制、预测性维护或故障诊断。通过防止数据泄露到云服务，该方法保护了敏感信息，包括工业机密和个人数据。全面的实验结果表明，在Visual Wake Words数据集上，所提出的方法通过在Raspberry Pi Zero 2上运行不到10小时的搜索过程，可以达到最先进的结果。

英文摘要

This paper presents an automatic method for the design of Neural Networks (NNs) at the edge, enabling Machine Learning (ML) access even in privacy-sensitive Internet of Things (IoT) applications. The proposed method runs on IoT gateways and designs NNs for connected sensor nodes without sharing the collected data outside the local network, keeping the data in the site of collection. This approach has the potential to enable ML for Healthcare Internet of Things (HIoT) and Industrial Internet of Things (IIoT), designing hardware-friendly and custom NNs at the edge for personalized healthcare and advanced industrial services such as quality control, predictive maintenance, or fault diagnosis. By preventing data from being disclosed to cloud services, this method safeguards sensitive information, including industrial secrets and personal data. The outcomes of a thorough experimental session confirm that -- on the Visual Wake Words dataset -- the proposed approach can achieve state-of-the-art results by exploiting a search procedure that runs in less than 10 hours on the Raspberry Pi Zero 2.

URL PDF HTML ☆

赞 0 踩 0

2506.05797 2026-06-17 cs.LG cs.CE cs.RO 版本更新

EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator

EqCollide: 等变且碰撞感知的可变形物体神经模拟器

Qianyi Chen, Tianrun Gao, Chenbo Jiang, Tailin Wu

发表机构 * Westlake University（西交大大学）； Fudan University（复旦大学）； Tongji University（同济大学）； McGill University（麦吉尔大学）

AI总结提出首个端到端等变神经场模拟器EqCollide，通过等变编码器和碰撞感知消息传递的图神经网络常微分方程，实现可变形物体碰撞的准确、稳定和可扩展模拟。

Comments SIGKDD 2026 Oral AI4S Track. 20 pages, 16 figures

详情

AI中文摘要

模拟可变形物体的碰撞是一项基础但具有挑战性的任务，因为涉及固体力学和多体相互作用的复杂性。现有的数据驱动方法通常缺乏对物理对称性的等变性、对碰撞处理不足以及可扩展性有限。本文介绍\name，这是首个用于可变形物体及其碰撞的端到端等变神经场模拟器。我们提出一个等变编码器，将物体几何和速度映射到潜在控制点。随后，基于等变图神经网络的神经常微分方程通过碰撞感知消息传递建模控制点之间的相互作用。为了重建速度场，我们查询一个以控制点特征为条件的神经场，实现连续且分辨率无关的运动预测。在2D和3D场景上的实验结果表明，\name在不同物体配置下实现了准确、稳定且可扩展的模拟。与最佳基线模型相比，其滚动均方误差降低了24.34%至57.62%。此外，\name能够泛化到更多碰撞物体和更长的时间范围，并对群作用下的输入变换保持鲁棒。代码可在以下网址获取：this https URL

英文摘要

Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce EqCollide, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results on 2D and 3D scenarios show that EqCollide achieves accurate, stable, and scalable simulations across diverse object configurations. It achieves $24.34\%$ to $57.62\%$ lower rollout MSE, even compared with the best-performing baseline model. Furthermore, EqCollide could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action. Code is available at: https://github.com/AI4Science-WestlakeU/EqCollide

URL PDF HTML ☆

赞 0 踩 0

2504.14582 2026-06-17 cs.CV 版本更新

NTIRE 2025 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2025 图像超分辨率（×4）挑战赛：方法与结果

Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Zahra Moammeri, Ahmad Mahmoudi-Aznaveh, Ali Karbasi, Hossein Motamednia, Liangyan Li, Guanhua Zhao, Kevin Le, Yimo Ning, Haoxuan Huang, Jun Chen

发表机构 * CVPR 2025

AI总结本文介绍NTIRE 2025图像超分辨率（×4）挑战赛，包括恢复和感知两个子赛道，总结比赛设计、数据集、评估协议及25个团队的提交方法。

Comments NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 1525-1535

详情

DOI: 10.1109/CVPRW67362.2025.00141

AI中文摘要

本文介绍了NTIRE 2025图像超分辨率（×4）挑战赛，这是CVPR 2025第10届NTIRE Workshop的关联竞赛之一。该挑战旨在从通过双三次下采样生成的×4比例低分辨率图像中恢复高分辨率图像，目标是开发有效的网络设计或解决方案以实现最先进的超分辨率性能。为反映图像超分辨率研究的双重目标，挑战包含两个子赛道：（1）恢复赛道，强调像素级精度，根据PSNR对提交结果进行排名；（2）感知赛道，关注视觉真实感，根据感知分数对结果进行排名。共有286名参与者注册了比赛，25个团队提交了有效作品。本报告总结了挑战设计、数据集、评估协议、主要结果以及每个团队的方法。该挑战作为基准，旨在推动图像超分辨率领域的最先进技术并促进其进步。

英文摘要

This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.

URL PDF HTML ☆

赞 0 踩 0

2504.03991 2026-06-17 cs.CL cs.AI cs.HC cs.MA 版本更新

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

面向多样化类人团队协作与通信的算法化提示生成与大型语言模型

Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

发表机构 * Thomas Lord Department of Computer Science, University of Southern California（美国南加州大学汤姆·劳德计算机科学系）； School of Computing and Information, University of Pittsburgh（美国匹兹堡大学计算与信息学院）； Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）； Sibley School of Mechanical and Aerospace Engineering, Cornell University（康奈尔大学西伯利机械与航空航天工程学院）

AI总结结合质量多样性优化与LLM代理，自动搜索生成多样化团队行为的提示，捕获人类协作与通信策略，并通过用户研究验证其类人性。

详情

AI中文摘要

理解人类如何在团队中协作和通信对于改善人-代理团队协作和AI辅助决策至关重要。然而，由于后勤、伦理和实际限制，仅依赖大规模用户研究的数据是不切实际的，因此需要多种多样化人类行为的合成模型。最近，基于大型语言模型（LLM）的代理已被证明能够在社交环境中模拟类人行为。但是，获得大量多样化行为需要手动设计提示。另一方面，质量多样性（QD）优化已被证明能够生成多样化的强化学习（RL）代理行为。在这项工作中，我们将QD优化与LLM驱动的代理相结合，以迭代搜索在长时域、多步骤协作环境中生成多样化团队行为的提示。我们首先通过一项人类受试者实验表明，人类在该领域中表现出多样化的协调和通信行为。然后，我们进行一系列实验，表明我们的方法捕获了在没有大规模数据收集的情况下难以观察到的行为，并通过后续用户研究表明这些生成的行为是类人的。我们的发现凸显了QD与LLM驱动代理的结合作为研究多代理协作中团队协作和通信策略的有效工具。

英文摘要

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

URL PDF HTML ☆

赞 0 踩 0

2401.14381 2026-06-17 cs.LG math.DG 版本更新

Manifold GCN: Diffusion-based Convolutional Neural Network for Manifold-valued Graphs

Manifold GCN：基于扩散的流形值图卷积神经网络

Martin Hanik, Gabriele Steidl, Christoph von Tycowicz

发表机构 * BIFOLD—Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究院）； Technical University Berlin（柏林技术大学）； Zuse Institute Berlin（柏林泽尼茨研究所）

AI总结提出两种适用于黎曼流形特征图的图神经网络层：基于流形值图扩散方程的扩散层和受向量神经元启发的切向多层感知器，两者在节点置换和流形等距下等变，在更广泛问题上优于任务特定网络。

Comments Extended ADNI experiment

Journal ref International Journal of Computer Vision, Volume 134, article number 315 (2026)

详情

DOI: 10.1007/s11263-026-02899-9

AI中文摘要

我们提出了两种适用于黎曼流形中特征图的图神经网络层。首先，基于流形值图扩散方程，我们构建了一个可应用于任意数量节点和图连接模式的扩散层。其次，通过将向量神经元框架的思想迁移到我们的通用设置中，我们建模了一个切向多层感知器。这两层在节点置换和特征流形的等距变换下都是等变的。这些特性在许多深度学习任务中带来了有益的归纳偏置。此外，它们还支持新颖、更灵活的特征设计。合成数据上的数值示例以及基于右海马体三角网格的阿尔茨海默病分类应用证明了我们新层的实用性：虽然它们适用于更广泛的问题类别，但在性能上优于任务特定的最先进网络。

英文摘要

We propose two graph neural network layers for graphs with features in a Riemannian manifold. First, based on a manifold-valued graph diffusion equation, we construct a diffusion layer that can be applied to an arbitrary number of nodes and graph connectivity patterns. Second, we model a tangent multilayer perceptron by transferring ideas from the vector neuron framework to our general setting. Both layers are equivariant under node permutations and the feature manifold's isometries. These properties have led to a beneficial inductive bias in many deep-learning tasks. Furthermore, they enable novel, more flexible feature designs. Numerical examples on synthetic data and an Alzheimer's classification application on triangle meshes of the right hippocampus demonstrate the usefulness of our new layers: While they apply to a much broader class of problems, they outperform task-specific state-of-the-art networks.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Remote sensing data imputation using deep learning for multispectral imagery

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Smarter edits? Post-editing with error highlights and translation suggestions

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

Learning to Decide with AI Assistance under Human-Alignment

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Know Thy Reasoner: Not All Language Models Explore Alike

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Jacobian Scopes: token-level causal attributions in LLMs

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

Statistical Learning from Attribution Sets

FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

Searching Neural Architectures for Sensor Nodes on IoT Gateways

EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator

NTIRE 2025 Challenge on Image Super-Resolution (x4): Methods and Results

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

Manifold GCN: Diffusion-based Convolutional Neural Network for Manifold-valued Graphs