arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13665 2026-05-14 cs.RO

Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels

Amir Hossain Raj, Dibyendu Das, Xuesu Xiao

AI总结 本文研究了四足机器人穿越狭窄隧道等复杂三维环境的自主移动问题。为解决现有方法在适应多样化地形和复杂结构方面的不足,作者提出了一种结合过程化环境生成和策略蒸馏的强化学习框架,通过教师-学生训练范式,将针对不同隧道结构训练的专家策略知识迁移至统一的策略模型中。该方法无需复杂的奖励设计,有效提升了四足机器人在狭窄空间中的鲁棒性和通用性,并在仿真与实际实验中验证了其优越性。

详情
英文摘要

Quadruped robots demonstrate exceptional potential for navigating complex terrain in critical applications such as search and rescue missions and infrastructure inspection However autonomous traversal of confined 3D environments including tunnels caves and collapsed structures remains a significant challenge Existing methods often struggle with rigid gait patterns limited adaptability to diverse geometries and reliance on oversimplified environmental assumptions This paper introduces a Reinforcement Learning RL framework that combines procedural environment generation with policy distillation to enable robust locomotion across various tunnel configurations Our approach leverages a teacher student training paradigm where specialized expert policies trained on procedurally generated tunnel geometries transfer their knowledge to a unified student policy This strategy eliminates the need for complex reward shaping in end-to-end RL training simplifying the process by breaking down complicated tasks into smaller more manageable components that are easier for the robot to learn By synthesizing diverse tunnel structures during training and distilling navigation strategies into a generalizable policy our method achieves consistent traversal across complex spatial constraints where conventional approaches fail We demonstrate through both simulation and real world experiments that our method enables quadruped robots to successfully traverse challenging confined tunnel environments

2605.13664 2026-05-14 cs.CV physics.optics

HADAR-Based Thermal Infrared Hyperspectral Image Restoration

Cheng Dai, Jiale Lin, Bingxuan Song, Yifei Chen, Jiashuo Chen, Xin Yuan, Fanglin Bao

AI总结 热红外高光谱图像(TIR-HSI)在许多应用中具有重要价值,但其实际应用受到传感器退化等因素的严重限制。本文提出了一种基于HADAR渲染方程的物理驱动框架HAIR,通过结合温度、发射率和纹理(TeX)三元组的物理模型,实现了对地面TIR-HSI的高精度恢复。该方法不仅保证了物理一致性与空间光谱噪声的鲁棒性,还通过大气下行辐射参考和发射率光谱平滑性实现了光谱校准与生成,实验表明其在去噪、修复、光谱校准和超分辨率等任务上均优于现有方法。

Comments 17 pages, 18 figures

详情
英文摘要

Thermal-infrared (TIR) hyperspectral imagery (HSI) provides critical scene information for various applications. However, its practical utility is severely limited by unique sensor degradations beyond the capabilities of existing restoration methods, which are ignorant of underlying thermal physics. Here, we propose HAIR (HADAR-based Image Restoration) as a physics-driven framework for ground-based TIR-HSI restoration. HAIR utilizes the HADAR rendering equation (HRE) and combines it with the atmospheric downwelling radiative transfer equation (RTE) to model TIR-HSI using temperature, emissivity, and texture (TeX) physical triplets. This physical model leads to a TeX decompose-synthesize strategy that guarantees physical consistency and spatio-spectral noise resilience, in stark contrast to existing approaches. Moreover, our framework uses a forward-modeled atmospheric downwelling reference, along with spectral smoothness of emissivity and blackbody radiation, to enable spectral calibration and generation that would otherwise be elusive. Our extensive experiments on the outdoor DARPA Invisible Headlights dataset and in-lab FTIR measurements show that HAIR consistently outperforms state-of-the-art methods across denoising, inpainting, spectral calibration, and spectral super-resolution, establishing a benchmark in objective accuracy and visual quality.

2605.13663 2026-05-14 cs.CL cs.CY

Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas

Lukas Stähelin, Veronika Solopova, Max Upravitelev, David Kaplan, Ariana Sahitaj, Premtim Sahitaj, Charlott Jakob, Sebastian Möller, Vera Schmitt

AI总结 本文研究了如何在不同标注体系下提升社交媒体中宣传内容分类的鲁棒性,提出了一种基于意图的宣传技术分类体系,并与现有标注标准进行对比。通过四种大型语言模型的实验,发现微调对于提升分类性能至关重要,且提出的分层提示方法(HiPP)在微调后,特别是在标注分歧较大的体系中表现出色。研究还发布了基于新标注体系的HQP数据集,为未来研究提供了更具挑战性的基准。

详情
英文摘要

Propaganda detection in social media is challenging due to noisy, short texts and low annotation agreements. We introduce a new intent-focused taxonomy of propaganda techniques and compare it against an established, higher-agreement schema. Along three dimensions (model portfolio, schema effects, and prompting strategy) we evaluate the taxonomies as a classification task with the help of four language models (GPT-4.1-nano, Phi-4 14B, Qwen2.5-14B, Qwen3-14B). Our results show that fine-tuning is essential, since it transforms weak zero-shot baselines into competitive systems and reveals methodological differences that are hidden using base models. Across schemas, the Qwen models achieve the strongest overall performance, and Phi-4 14B consistently outperforms GPT-4.1-nano. Our hierarchical prompting method (HiPP), which predicts fine-grained techniques before aggregating them, is especially beneficial after fine-tuning and on the more ambiguous, low-agreement taxonomy, while remaining competitive on the simpler schema. The HQP dataset, annotated with the new intent-based labels, provides a richer lens on propaganda's strategic goals and a challenging benchmark for future work on robust, real-world detection.

2605.13651 2026-05-14 cs.SD cs.AI

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

Zhongju Yuan, Geraint Wiggins, Dick Botteldooren

AI总结 本文提出了一种无需训练的神经听觉注意力认知架构NAACA,用于解决长时音频中显著事件检测的注意力瓶颈问题。其核心是受神经系统启发的振荡工作记忆(OWM),能够通过感知显著性触发高层语言模型处理,从而提升事件检测精度并减少不必要的计算。实验表明,NAACA在XD-Violence数据集上显著提升了检测性能,并在城市声景数据集上表现出对噪声和突发停顿的良好鲁棒性。

Comments Accepted as a regular paper by ICML 2026

详情
英文摘要

Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.

2605.13647 2026-05-14 cs.CL

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

Junyan Li, Zhang-Wei Hong, Maohao Shen, Yang Zhang, Chuang Gan

AI总结 FlowCompile 是一个针对结构化大语言模型(LLM)工作流的优化编译器,旨在解决在预定义图结构中多个子代理协同执行时,如何在准确率与延迟之间取得最佳平衡的问题。该方法借鉴了机器学习编译器的思想,在部署前对工作流的设计空间进行全局探索,生成一组可复用的、覆盖不同精度-延迟权衡的工作流配置。实验表明,FlowCompile 在多种工作流和基准测试中均优于启发式优化和基于路由的方法,最高可带来6.4倍的加速效果。

详情
英文摘要

Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is challenging due to the combinatorial design space over model choices, reasoning budgets, and workflow structures. Existing cost-aware methods largely treat workflow optimization as a routing problem, selecting a configuration at inference time for each query according to the accuracy-latency objective used during training. We argue that structured LLM workflows can also be optimized from a compilation perspective: before deployment, the system can globally explore the workflow design space and construct a reusable set of workflow-level configurations spanning diverse accuracy-latency trade-offs. Drawing inspiration from machine learning compilers, we introduce FlowCompile, a structured LLM workflow compiler that performs compile-time design space exploration to identify a high-quality, reusable trade-off set. FlowCompile decomposes a workflow into sub-agents, profiles each sub-agent under diverse configurations, and composes these measurements through a structure-aware proxy to estimate workflow-level accuracy and latency. It then identifies diverse high-quality configurations in a single compile-time pass, without retraining or online adaptation. Experiments across diverse workflows and challenging benchmarks show that FlowCompile consistently outperforms heuristically optimized workflow configurations and routing-based baselines, delivering up to 6.4x speedup. The compiled configuration set further serves as a reusable optimization artifact, enabling flexible deployment under varying runtime preferences and supporting downstream selection or routing.

2605.13641 2026-05-14 cs.LG cs.CL

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

Yang Bai, Kaiyuan Liu, Ziyuan Zhuang, Jiahong Zhou, Rongxiang Weng, Xin Chen, Jingang Wang, Xunliang Cai

AI总结 该论文研究了复杂强化学习环境中多任务和混合奖励设定下的策略优化问题,针对异构奖励分布和奖励维度相关性带来的挑战,提出了一种名为RDPO的奖励处理方法。RDPO通过幅度感知分位数归一化和马哈拉诺比白化技术,分别稳定奖励分配并减少相关性冗余,从而提升策略训练的稳定性与效果。实验表明,该方法在LongCat-Flash的后训练中有效增强了指令遵循能力、写作质量和对复杂提示的鲁棒性,同时在推理和编程任务上保持了良好的竞争力。

详情
英文摘要

Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.

2605.13639 2026-05-14 cs.LG math.OC stat.ML

Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

Ishaq Hamza, Zaiwei Chen

AI总结 本文研究了强化学习中无策略actor-critic方法在单循环实现下的样本复杂度问题,在仅假设存在能诱导不可约马尔可夫链的策略的前提下,证明了在单循环、单时间尺度框架下,首次实现了$\tilde{\mathcal{O}}(ε^{-2})$的样本复杂度保证,用于找到一个$ε$-最优策略。相比以往需要嵌套循环或强算法依赖假设的工作,本文通过构建耦合的Lyapunov漂移框架,解决了单循环更新和非策略学习带来的挑战,为actor和critic分别建立了几何收敛率和$\tilde{\mathcal{O}}(1/T)$收敛率,并通过交叉支配性质将两者结合,具有重要的理论意义和应用潜力。

详情
英文摘要

In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity guarantee for finding an $ε$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded

2605.13632 2026-05-14 cs.RO cs.CV

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

Yiran Ling, Qing Lian, Jinghang Li, Qing Jiang, Tianming Zhang, Xiaoke Jiang, Chuanxiu Liu, Jie Liu, Lei Zhang

AI总结 本文提出了一种名为GTA-VLA的交互式视觉-语言-动作框架,通过允许用户使用显式视觉线索引导机器人策略,实现空间可操控的具身推理。该框架引入了用户可选的空间先验引导机制,并将其与内部任务规划相结合,生成统一的视觉-空间推理链,从而提升机器人在复杂或未知环境中的任务成功率。实验表明,该方法在标准基准测试中表现优异,并在面对视觉变化和空间歧义时展现出更强的鲁棒性和恢复能力。

详情
英文摘要

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/

2605.13625 2026-05-14 cs.AI

How to Interpret Agent Behavior

Jie Gao, Kaiser Sun, Jen-tse Huang, Katherine Van Koevering, Sijie Ji, Heyuan Huang, Weiyan Shi, Zhuoran Lu, Ziang Xiao, Daniel Khashabi, Mark Dredze

AI总结 本文研究了如何解释自主智能体(如 Claude Code 和 Codex)在运行时的行为,提出了一个名为 ACT*ONOMY 的行为分类体系,用于描述和分析智能体的运行轨迹。该方法结合了行动分类和理论框架,构建了一个包含 10 个动作、46 个子动作和 120 个叶子类别的三级层次结构,并提供了一个支持动态更新和扩展的开源分析平台。实验表明,ACT*ONOMY 能够有效比较不同智能体的行为特征,识别运行中的异常模式,为研究人员和用户提供了一致的分析语言,有助于提升对智能体行为的理解与管控。

Comments 34 pages in total

详情
英文摘要

Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in unstructured natural-language form, making it difficult for humans to interpret at scale. We introduce ACT*ONOMY (a combination of Action and Taxonomy), a taxonomy for describing and analyzing agent behavior at runtime. ACT*ONOMY has two components: (1) the taxonomy itself, developed through Grounded Theory and structured as a three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories; and (2) an open repository that hosts the living taxonomy, provides an automated analysis pipeline that applies it to agent trajectories analysis, and defines an extension protocol for customization and growth. Our experiments show that ACTONOMY can compare behavioral profiles across agents and characterize a single agent's behavior across diverse trajectories, surfacing patterns indicative of failure modes. By providing a shared vocabulary, ACT*ONOMY helps researchers, agent designers, and end users interpret agent behavior more consistently, enabling better oversight and control.

2605.13624 2026-05-14 cs.CL

Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction

Takumi Goto, Yusuke Sakai, Taro Watanabe

AI总结 本文研究了基于大语言模型的语法错误纠正中常见的过度修正问题,提出了一种无需训练的推理方法,通过单个模型生成多个候选修正结果并进行编辑级多数投票,有效缓解了过度修正现象。该方法在多个语言的九个基准测试中表现优于贪心解码和最大后验概率解码,在不同指令提示下也保持了稳定的修正质量。

Comments BEA Workshop 2026

详情
英文摘要

Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets loading and LLM inference.

2605.13623 2026-05-14 cs.LG

Multimodal Graph-based Classification of Esophageal Motility Disorders

Alexander Geiger, Lars Wagner, Daniel Rueckert, Alois Knoll, Dirk Wilhelm, Alissa Jell

AI总结 本文研究了基于多模态图神经网络的食管运动障碍分类方法,旨在解决高分辨率阻抗测压(HRIM)数据复杂且临床解释易变的问题。该方法结合HRIM记录、患者个体信息,并利用图模型对食管生理特性进行建模,通过图神经网络学习具有生理意义的表示,并与患者特征融合实现多类别分类。实验表明,该多模态方法在分类性能上优于仅依赖HRIM特征或基于视觉的分类方法,验证了图模型与患者信息融合的有效性。

详情
英文摘要

Diagnosing esophageal motility disorders pose significant challenges due to the complexity of high-resolution impedance manometry (HRIM) data and variability in clinical interpretation. This work explores the feasibility of a multimodal Machine Learning (ML)-based classification approach that combines HRIM recordings with patient-specific information and incorporates a graph-based modeling of esophageal physiology. We analyze HRIM recordings with corresponding patient information from 104 patients with esophageal motility disorders. Patient data includes demographic, clinical, and symptom information extracted from structured questionnaires and free-text notes using keyword detection and large language model-based processing. HRIM data is represented as spatio-temporal graphs, where nodes correspond to pressure values along the esophagus and edges encode spatial adjacency and impedance dynamics. A graph neural network (GNN) is applied to learn physiologically meaningful representations, which are fused with patient embeddings for multi-category, multi-class classification of swallow events. The impact of patient features and graph-based modeling is evaluated by ablation studies and comparison to vision-based classifier baselines. The proposed multimodal approach indicates improvements over models that rely solely on HRIM-derived features across all classification categories. Additionally, the graph-based modeling provides gains compared to vision-based baselines. Our experiments systematically assess the complementary contribution of multiple modalities, as well as demonstrate the feasibility of our proposed graph-based approach. Our initial findings demonstrate that integrating patient-level data with graph-based representations of HRIM signals appears to be a promising direction for more accurate classification of esophageal motility disorders.

2605.13621 2026-05-14 cs.CV

WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

Chunjin Yang, Xiwei Zhang, Yiming Xiao, Fanman Meng

AI总结 WD-FQDet 是一种基于小波分解和频率感知查询学习的多光谱检测Transformer框架,旨在解决红外与可见光图像融合检测中模态共享特征偏差和模态特有特征不足的问题。该方法通过低频域对齐和高频域保留模块,分别增强跨模态特征的一致性和模态特有特征的表达,并引入频率感知的查询选择机制动态调节不同特征的贡献。实验表明,WD-FQDet 在多个数据集上取得了领先的检测性能。

详情
英文摘要

Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.

2605.13613 2026-05-14 cs.RO

Design of Magnetic Continuum Robots with Tunable Force Response Using Rotational Ring Pairs

Alex Sayres, Giovanni Pittiglio

AI总结 本文提出了一种新型的连续体机器人设计,能够在线调节其末端的磁响应特性,从而实现对有效磁场方向和强度的动态调整,无需依赖外部磁场控制即可引入转向自由度。该设计突破了传统机器人依赖固定内部磁性结构的限制,适用于可控和固定磁场环境,有望拓展其在医疗等领域的应用。实验表明,该机器人最大末端偏转可达其长度的23%,并基于修正梁理论建立了力学模型,实现了较高的轨迹跟踪精度。

Comments 7 pages, 6 figures, Accepted to ISMR 2026

详情
英文摘要

In this paper, we discuss a novel continuum robot design that enables the online tuning of the magnetic response at its tip. The proposed method allows for the change of both effective magnetic direction and intensity, introducing steering DOF without the need to control the external fields. This is unattainable with classical designs, which rely on fixed internal magnetic content and steer solely under the effect of a controllable magnetic field. The proposed robot design can be used in both controllable and fixed magnetic fields, potentially widening the clinical applicability of these robots. We experimentally show a max tip deflection of 33.8 mm from the resting state (23 % of the length of the robot). We discuss a model based on modified beam theory that captures the mechanical behavior of the continuum robot, with a mean absolute tip tracking error of 1.86 mm (1.2 % of the length) and maximum errors of less than 4.8 mm (3.2 % of the length) for all experimental points.

2605.13612 2026-05-14 cs.LG cond-mat.dis-nn stat.ML

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

Yatin Dandi, Matteo Vilucchio, Luca Arnaboldi, Hugo Tabanelli, Florent Krzakala

AI总结 本文提出了一种名为“神经低度滤波”(Neural LoFi)的理论框架,用于解释深度神经网络如何通过层次化特征学习从数据中提取有用表示。该方法将基于梯度的训练过程简化为一种显式的迭代谱方法,每一层网络通过选择与标签具有最大低度相关性的方向来逐步构建特征。该理论不仅提供了对深度学习中特征演化机制的数学解释,还通过实验验证了其在全连接和卷积网络中的有效性,展示了其在特征选择和结构滤波方面的优越性。

Comments 62 pages, many figures, companion codes in https://github.com/IdePHICS/Neural-LoFi-Theory

详情
英文摘要

Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.

2605.13604 2026-05-14 cs.CV

Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

Chanyoung Kim, Donghyun Kim, Dong-Hyun Sim, Seong Jae Hwang, Youngjoong Kwon

AI总结 本文重新审视了图卷积网络在2D到3D手部姿态提升中的应用,探讨了是否应采用固定邻接图来编码手部骨骼结构。研究通过在FPHA数据集上进行参数匹配的消融实验,发现多头自注意力机制在性能上显著优于传统图卷积方法,并进一步表明基于软结构先验的图距离位置编码比硬邻接约束更有效。实验结果表明,自适应空间注意力比固定图卷积更能有效提升手部姿态估计的精度。

详情
英文摘要

Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.

2605.13601 2026-05-14 cs.AI cs.MA

Unweighted ranking for value-based decision making with uncertainty

Aarón López García, Natalia Criado, Jose Such

AI总结 随着智能系统在社会中越来越多地用于自主决策,其对人类价值观的遵循引发了广泛关注。本文提出了一种基于模糊逻辑的无权重价值决策框架(FUW-VBDM),通过引入定性和定量标准,提升决策的人本特性,并消除利益相关者主观赋权带来的偏差。为此,作者设计了Rankzzy方法,结合模糊推理量化不确定性,并在大规模案例中验证了其计算效率和排名性能的优势。

Comments 21 pages

详情
英文摘要

As intelligent systems are increasingly implemented in our society to make autonomous decisions, their commitment to human values raises serious concerns. Their alignment with human values remains a critical challenge because it can jeopardise the integrity and security of citizens. For this reason, an innovative human-centred and values-driven approach to decision making is required. In this work, we introduce the Fuzzy-Unweighted Value-Based Decision Making (FUW-VBDM) framework, where agents incorporate both quantitative and qualitative criteria to generate human-centred decisions. We also address the normative bias introduced by stakeholders with arbitrary weights by removing prior weights and introducing a fuzzy domain of decision variables defined for a score function. This concept allows us to generalise any VBDM problem as the search for feasible solutions when optimising the score in the weight domain. To provide a solution to FUW-VBDM, we present Rankzzy, a customizable unweighted ranking method that integrates fuzzy-based reasoning to quantify uncertainty. We mathematically prove the consistency of the Rankzzy for any admissible configuration selected by stakeholders. We show the applicability of our method through an illustrative case study, which we also use as a running example. The evaluation conducted indicates a reduced computational cost in large-scale value-based decision-making problems and a strong rank performance regarding existing approaches when employing the aggregation via Pythagorean means.

2605.13600 2026-05-14 cs.CV

Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting

Lovre Antonio Budimir, Yushi Guan, Steve Ryhner, Sven Lončarić, Nandita Vijaykumar

AI总结 本文提出了一种名为SCOUP的高效三维语言高斯溅射方法,旨在解决在开放词汇三维场景理解中,如何高效关联高维视觉-语言嵌入与大量三维高斯点的问题。该方法通过解耦语言表示学习与三维高斯优化,利用二维图像区域的特征学习稀疏编码表示,并通过加权稀疏聚合将其提升至三维高斯点,从而实现高效的存储与快速渲染。实验表明,SCOUP在训练速度和内存效率上均有显著提升,并在多个基准测试中达到了与现有方法相当或更优的开放词汇查询准确率。

Comments 18 pages (9 pages main paper), 10 figures, preprint

详情
英文摘要

3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.

2605.13597 2026-05-14 cs.LG

Rethinking Generalization in Graph Neural Networks: A Structural Complexity Perspective

Peiyao Wang, Liang Bai, Xian Yang, Richard Yi Da Xu, Jiye Liang

AI总结 本文从结构复杂度的角度重新思考图神经网络(GNN)的泛化能力,探讨图结构对模型泛化的影响。研究证明,图中边的增加会使输入表示过度适应输出模型,导致过拟合,并提出了一种基于有效边数量的结构复杂度度量,推导出相应的泛化界。基于这些理论发现,作者进一步提出了一种结构熵正则化方法,通过调控有效边的数量来平衡欠拟合与过拟合,从而提升GNN的泛化性能。

Comments 44 pages, 10 figures

详情
英文摘要

Graph neural networks (GNNs) have emerged as a fundamental tool for learning from graph-structured data, achieving strong performance across a wide range of applications. However, understanding their generalization capabilities remains challenging due to the complex structural dependencies inherent in such data. Existing generalization analyses largely follow the classical machine learning paradigm, focusing primarily on model complexity while overlooking the fundamental role of graph structure. Therefore, in this work, we systematically investigate this role by asking: does the graph structure actually influence generalization, and if so, by how much? To answer the first question and validate our intuition, we theoretically prove that incorporating more edges into the prediction process transforms the input representations to be overly accommodating to the output model, thereby inducing overfitting. To address the second question, we formulate a structural complexity measure based on the number of effective edges and derive a Rademacher complexity-based generalization bound. In doing so, we demonstrate that GNN generalization depends explicitly on structural complexity, alongside traditional parameter-dependent factors. Motivated by these theoretical findings, we propose a structural entropy regularization method. This approach controls structural complexity by regulating effective edges to balance underfitting and overfitting, ultimately improving the generalization performance of GNNs.

2605.13596 2026-05-14 cs.CL

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

Kyo Gerrits, Rik van Noord, Ana Guerberof Arenas

AI总结 本文研究了自动评估指标(AEMs)和大语言模型作为评委的评估方法在文学翻译中的表现,涉及多种语言、体裁和翻译方式。通过构建包含人类翻译、机器翻译和后编辑的多模态数据集,并由专业文学翻译者标注创造力相关指标,研究发现这些自动评估方法与专业评价在创造力方面关联性较低,尤其对文学性较强的体裁如诗歌评估效果更差。研究还指出,基于大语言模型的评估存在系统性偏差,倾向于青睐机器翻译文本,而对具有创造性和文化适应性的翻译方案进行惩罚,凸显了当前自动评估工具在文学翻译领域存在的根本性局限。

Comments This paper has been accepted to the EAMT Conference 2026 in Tilburg on June 15-18 2026

详情
英文摘要

This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

2605.13595 2026-05-14 cs.CL

Inducing Artificial Uncertainty in Language Models

Sophia Hager, Simon Zeng, Nicholas Andrews

AI总结 在安全关键型应用中,语言模型需要能够用有意义的概率表达其不确定性。本文提出了一种在语言模型中诱导人工不确定性的方法,以解决在缺乏挑战性数据的情况下训练不确定性量化方法的难题。通过在简单数据上引入人工不确定性,并使用专门训练的探针进行识别,该方法在保持模型性能的同时,显著提升了模型在困难数据上的校准能力。

详情
英文摘要

In safety-critical applications, language models should be able to characterize their uncertainty with meaningful probabilities. Many uncertainty quantification approaches require supervised data; however, finding suitable unseen challenging data is increasingly difficult for large language models trained on vast amounts of scraped data. If the model is consistently (and correctly) confident in its predictions, the uncertainty quantification method may consistently overestimate confidence on new and unfamiliar data. Finding data which exhibits enough uncertainty to train supervised uncertainty quantification methods for high-performance models may therefore be challenging, and will increase in difficulty as LLMs saturate datasets. To address this issue, we first introduce the problem of inducing artificial uncertainty in language models, then investigate methods of inducing artificial uncertainty on trivially easy data in the absence of challenging data at training time. We use probes trained to recognize artificial uncertainty on the original model, and find that these probes trained on artificial uncertainty outperform probes trained without artificial uncertainty in recognizing real uncertainty, achieving notably higher calibration on hard data with minimal loss of performance on easy data.

2605.13591 2026-05-14 cs.CV

Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

Kaicong Huang, Talha Azfar, Weisong Shi, Ruimin Ke

AI总结 本文提出了一种名为 Real2Sim 的物理驱动且可编辑的高斯点喷射框架,用于自动驾驶场景的生成。该方法结合了4D高斯点喷射与可微分的材料点方法求解器,能够重建具有时间连续性的动态驾驶场景,支持实例级编辑,并模拟真实的物体间及物体与环境之间的交互。该框架能够在保证物理合理性的前提下生成高保真的多样化场景,包括碰撞等复杂情况,实验表明其在渲染、重建、编辑及物理模拟方面表现优异,具有在自动驾驶感知、轨迹预测等任务中广泛应用的潜力。

详情
英文摘要

Reliable autonomous driving relies on large-scale, well-labeled data and robust models. However, manual data collection is resource-intensive, and traditional simulation suffers from a persistent reality gap. While recent generative frameworks and radiance-field methods improve visual fidelity, they still struggle with temporal and spatial consistency and cannot ensure physics-aware behavior, limiting their applicability to driving scenario generation. To address these challenges, we propose Real2Sim, an unified framework that combines 4D Gaussian Splatting (4DGS) with a differentiable Material Point Method (MPM) solver. Real2Sim explicitly reconstructs dynamic driving scenes as temporally continuous Gaussian primitives, supports instance-level editing, and simulates realistic object-object and object-environment interactions. This framework enables physics-aware, high-fidelity synthesis of diverse, editable scenarios, including challenging corner cases such as collisions and post-impact trajectories. Experiments on the Waymo Open Dataset validate Real2Sim's capabilities in rendering, reconstruction, editing, and physics simulation, demonstrating its potential as a scalable tool for data generation in downstream tasks such as perception, tracking, trajectory prediction, and end-to-end policy learning.

2605.13586 2026-05-14 cs.CV cs.AI

HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation

Zini Chen, Junming Huang, Rong Zhang, Jiamin Xu, Cheng Peng, Chi Wang, Weiwei Xu

AI总结 本文提出 HetScene,一种面向异构结构的扩散模型,用于生成高密度、物理合理的室内场景。该方法通过区分主物体和次物体,将场景生成过程分解为结构布局生成和上下文布局生成两个阶段,从而更有效地建模复杂的物体分布与空间依赖关系。该框架提升了生成场景的可控性和物理合理性,为具身人工智能的仿真环境构建提供了有力支持。

详情
英文摘要

Generating controllable and physically plausible indoor scenes is a pivotal prerequisite for constructing high-fidelity simulation environments for embodied AI. However, existing deeplearning-based methods usually treat all objects as homogeneous instances within a unified generation process. While effective for sparse and simplistic layouts, they struggle to model realistic layouts with dense object arrangements and complex spatial dependencies, leadingto limited scalability and degraded physical plausibility. To deal with these challenges, we revisit indoor layout generation from the perspective of structural heterogeneity and decompose the objects into primary objects and secondary objects according to their distinct roles in shaping a scene. Based on this decomposition, we propose HetScene, a heterogeneous two-stage generation framework that decouples indoor layout synthesis into Structural Layout Generation (SLG) and Contextual Layout Generation (CLG). SLG first generates globally coherent structural layouts with only primary objects conditioned on text descriptions, top-down binary room masks, and spatial relation graphs, establishing a stable global macro-skeleton of large core furniture.

2605.13583 2026-05-14 cs.CV

Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive Imaging

Wudi Chen, Zhiyuan Zha, Xin Yuan, Shigang Wang, Bihan Wen, Jiantao Zhou, Gang Yan, Zipei Fan, Ce Zhu

AI总结 本文提出了一种名为Phy-CoSF的方法,用于解决快照压缩成像(CASSI)系统中高光谱图像的连续光谱重建与超分辨率问题。该方法结合深度展开网络与隐式神经表示,建立了一种新的连续光谱重建范式,能够生成任意波长的高保真高光谱图像。核心模块连续光谱场(CoSF)通过跨域特征融合和动态先验机制,显著提升了重建精度和光谱细节保留能力,实验表明其在多个指标上优于现有先进方法。

Comments 15 pages, 10 figures, accepted by ICML 2026!

详情
英文摘要

Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral super-resolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial-frequency-channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation. Our code and more results are available at: https://github.com/PaiDii/Phy-CoSF.git.

2605.13581 2026-05-14 cs.CV

HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

Li Pang, Heng Zhao, Yijia Zhang, Deyu Meng, Xiangyong Cao

AI总结 高光谱图像(HSI)修复在实际应用中面临噪声、模糊和分辨率下降等问题,而现有模型在缺乏干净参考的靶域数据上表现不佳。为此,本文提出HIR-ALIGN框架,通过扩散模型生成与靶域分布匹配的合成数据,增强修复效果。该方法包含代理生成、分布自适应合成和对齐监督微调三个阶段,有效提升了在靶域上的修复性能,并在去噪和超分辨率任务中展现出优于现有方法的实验结果。

详情
英文摘要

Hyperspectral image (HSI) restoration is crucial for reliable analysis, as real HSIs suffer from degradations like noise, blur, and resolution loss. However, existing models trained on source data often fail on target domains lacking clean references, a common occurrence in practice. To address this issue, we present HIR-ALIGN, a plug-and-play target-adaptive augmentation framework that enhances hyperspectral image restoration by augmenting limited training images with synthetic data that closely matches the target distribution using no extra data. It consists of three stages: (i) proxy generation, where off-the-shelf restoration models restore degraded target observations to produce semantics-preserving proxy HSIs that approximate target-domain clean images; (ii) distribution-adaptive synthesis, where a blur-robust unCLIP diffusion model generates target-aligned RGBs from proxy RGBs, with prompt conditioning and embedding-space noise initialization. Then, a warp-based spectral transfer module synthesizes HSIs by aligning each generated RGB with the proxy RGB, estimating soft patch-wise transport weights, and applying these weights and learnable local interpolation kernels to the proxy HSI; and (iii) aligned supervised finetuning, where restoration networks pretrained on the source distribution are finetuned using both the proxy HSIs and synthesized target-aligned HSIs, and are then deployed on degraded target images. We further provide theoretical analysis showing that augmentation-based finetuning can achieve lower target-domain restoration risk by jointly improving target distribution coverage and controlling spectral bias. Extensive experiments on simulated and real datasets across denoising and super-resolution tasks demonstrate that HIR-ALIGN consistently improves source-only supervised baselines, outperforming both source-only counterparts and representative unsupervised methods.

2605.13579 2026-05-14 cs.AI

Position: Assistive Agents Need Accessibility Alignment

Jie Hu, Changyuan Yan, Yu Zheng, Ziqian Wang, Jiaming Zhang

AI总结 该论文探讨了为盲人和视力障碍用户设计的辅助智能体所面临的可访问性对齐问题,指出当前多数智能体系统基于视力正常用户的交互假设进行设计和评估,导致在辅助场景中频繁失效。研究分析了778个辅助任务实例,揭示了当前智能体在验证、风险和交互约束方面与视力障碍用户需求之间的不匹配,并提出将可访问性视为对齐问题,引入可访问性对齐概念,构建了一个贯穿用户研究、系统设计、部署与迭代的生命周期设计流程,推动更具包容性的智能体设计方向。

Comments 9 pages, 1 figures, Accepted to ICML 2026

详情
英文摘要

Assistive agents for Blind and Visually Impaired (BVI) users require accessibility alignment as a first-class design objective. Despite rapid progress in agentic AI, most systems are designed and evaluated under assumptions of sighted interaction, low-cost verification, and tolerable trial-and-error, leading to systematic failures in assistive scenarios that cannot be resolved by model scaling or post-hoc interface adaptations alone. Drawing on an analysis of 778 assistance task instances from prior work, we show that current agentic AI remain prone to failure in assistive scenarios due to mismatches between sighted-user design assumptions and the verification, risk, and interaction constraints faced by BVI users. We argue that accessibility should be treated as an alignment problem rather than a peripheral usability concern. To this end, we introduce accessibility alignment and propose a lifecycle-oriented design pipeline for accessibility-aligned assistive agents, spanning user research, system design, deployment and post-deployment iteration. We conclude that BVI-centered assistive tasks provide a critical stress test for agentic AI and motivate a broader shift toward inclusive agent design.

2605.13570 2026-05-14 cs.AI cs.LG

Learning Local Constraints for Reinforcement-Learned Content Generators

Debosmita Bhaumik, Julian Togelius, Georgios N. Yannakakis, Ahmed Khalifa

AI总结 本文研究如何结合基于约束的游戏内容生成方法(如Wave Function Collapse)与强化学习生成方法,以同时保证生成内容的局部视觉合理性和全局可玩性。作者提出通过将WFC学习到的局部约束应用于强化学习生成器的动作空间,使生成器在满足全局属性的同时遵循局部规则。实验表明,该混合方法在适当调参后能够生成视觉美观且可玩的平台解谜游戏关卡,如《Lode Runner》。

详情
英文摘要

Constraint-based game content generators that learn local constraints from existing content, such as Wave Function Collapse (WFC), can generate visually satisfying game levels but face challenges in guaranteeing global properties, such as playability. On the other hand, reinforcement-learning trained generators can guarantee global properties -- because such properties can easily be included in reward functions -- but the results can be visually dissatisfying. In this paper, we explore ways to combine these methods. Specifically, we constrain the action space of a PCGRL generator with constraints learned by WFC, effectively allowing the PCGRL generator to achieve global properties while forced to adhere to local constraints. To better analyze how this hybrid content generation method operates, we vary the number and type of inputs, and we test whether to randomly collapse the starting state and exclude rare patterns. While the method is sensitive to hyperparameter tuning, the best of our trained generators produce visually satisfying and playable puzzle-platform game levels -- such as Lode Runner levels -- with desired global properties.

2605.13568 2026-05-14 cs.LG cs.AI

Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model

Riccardo Cavarra, Lupo Lovatelli, Shaheim Ogbomo-Harmitt, Shahid Aziz, Adelaide De Vecchi, Andrew King, Oleg Aslanidi

AI总结 该研究旨在利用心电图(ECG)数据预测心肌梗死(MI)后心血管疾病的发展情况。研究提出了一种基于对比学习的预训练人工智能模型,结合患者特定的时序信息与监督多任务学习头,并在少量标注数据下进行微调,从而提升预测性能。实验表明,该模型在有限数据条件下优于从头训练的模型,证明了临床结构化ECG建模在疾病进展预测中的有效性。

Comments submitted to the 9th International Conference on Computational and Mathematical Biomedical Engineering, 4 pages, 1 figure, 1 table

详情
英文摘要

Myocardial infarction (MI) is a leading cause of death, and its adverse outcomes are urgent to predict. Yet ECG-based prognostic models underperform because deep learning requires large, labelled datasets, which are scarce in medicine. Foundation models can learn from unlabelled ECGs via selfsupervision, but medically relevant training strategies remain underexplored. We propose a pretrained artificial intelligence model that combines patient-specific temporal information using contrastive learning with supervised multitask heads, then fine-tunes on post-MI outcome prediction. The proposed model outperformed a model trained from scratch (0.794 vs 0.608 AUC) showing that clinically structured ECG modelling improves classification in limited data regimes.

2605.13566 2026-05-14 cs.LG

Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks

Solomiia Kurchaba, Angela Meyer

AI总结 该研究旨在解决城市地表温度(LST)在时空分辨率上的矛盾问题,通过结合静止轨道和极轨卫星数据,利用深度神经网络生成高时空分辨率(1公里、15分钟)的LST场。研究提出了一种基于U-Net的模型,将低分辨率LST数据映射为高分辨率数据,并进一步构建了基于ConvLSTM的LST短临预测模型,实现了15至75分钟的预报,显著优于传统基准方法,具有较高的精度和稳定性,可应用于实际的卫星LST监测。

详情
英文摘要

Land Surface Temperature (LST) is a key variable for various applications, such as urban climate and ecology studies. Yet, existing satellite-derived LST products provide either high spatial or high temporal resolution, resulting in a fundamental trade-off between the two. To address this trade-off, we combine observations from a geostationary and a polar orbiting satellite and provide LST fields at high spatial and high temporal resolution (1 km at 15-min intervals). We demonstrate their application for intraday forecasting of LSTs. To estimate LST fields at high spatiotemporal resolution, a U-Net model is trained to map LST fields from SEVIRI/MSG (3 km and 15 min resolution) to LST fields from Terra/Aqua MODIS (1 km, 4 overpasses per day) that are collocated in space and time. The presented model has been trained on LSTs across large European cities with a population exceeding 1 million inhabitants, and achieves an RMSE = $1.92$°C and near-zero bias MBE = $0.01$°C on the hold-out test set. As a second step, we present an LST nowcasting model based on ConvLSTM architecture, trained across downscaled LST fields with forecast lead times of 15 to 75 minutes. The nowcasting model outperforms a persistence and a Climatological Rolling Median benchmarks, with RMSEs of $0.57$ to $1.15$°C for the considered lead times and biases ranging from $-0.1$ to $0.14$°C. An additional validation conducted against independent MODIS overpasses confirms robust performance. Our LST forecast model at high spatiotemporal resolution is directly applicable to operational satellite-based LST monitoring.

2605.13565 2026-05-14 cs.CV

Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Yuxiang Chen, Zhendong Wang, Zihao Liu, Zikai Zhou, Yiliang Gu, Yi Wang, Xiaoxiao Xu, Lin Qu

AI总结 本文介绍了 Qwen-Image-VAE-2.0,一套在重建保真度和扩散能力方面取得显著进展的高压缩变分自编码器(VAE)。通过引入全局跳接连接和扩展潜在通道,模型有效解决了高压缩下的重建瓶颈,并结合大规模图像训练和合成渲染引擎提升了文本密集场景的表现。研究还提出了一种增强的语义对齐策略以优化高维潜在空间的收敛性,并采用非对称且无需注意力机制的编解码结构以提高计算效率。实验表明,该模型在多个基准测试中达到先进水平,尤其在高压缩比下表现出卓越的重建和扩散能力。

详情
英文摘要

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability.

2605.13560 2026-05-14 cs.LG

Uncertainty-Aware Prediction of Lung Tumor Growth from Sparse Longitudinal CT Data via Bayesian Physics-Informed Neural Networks

Lingfei Kong, Haoran Ma

AI总结 本文研究如何从稀疏且不规则的纵向CT数据中预测肺部肿瘤生长,并考虑测量误差的影响。研究提出了一种结合Gompertz生长模型与贝叶斯推断的物理信息神经网络方法,在对数体积域中进行低维贝叶斯估计,通过两阶段推理策略(最大后验估计与哈密顿蒙特卡洛采样)实现预测分布与不确定性区间的估计。该方法在国家肺癌筛查试验数据集上进行了验证,结果显示其能够准确捕捉肿瘤异质性生长模式,并在少量观测条件下提供校准良好的不确定性估计,具有重要的临床应用潜力。

Comments 8 pages, 15 figures

详情
英文摘要

This work studies lung tumor growth prediction from sparse and irregular longitudinal computed tomography (CT) observations with measurement variability. A Bayesian physics-informed neural network is developed by combining Gompertz growth dynamics with low-dimensional Bayesian inference in the log-volume domain. The framework employs a two-stage inference strategy combining maximum a posteriori (MAP) estimation and Hamiltonian Monte Carlo (HMC) sampling to estimate posterior predictive distributions and uncertainty intervals. The method was evaluated on longitudinal data from the National Lung Screening Trial (30 patients). Results show that the model captures heterogeneous tumor growth patterns while maintaining reasonable prediction accuracy under limited observations. Compared with deterministic modeling approaches, the proposed approach additionally provides calibrated uncertainty estimates. The inferred posterior parameter correlations were consistent with expected biological growth behavior. The proposed framework achieved a cohort-level log-space RMSE of approximately 0.20 together with well-calibrated 95% credible interval coverage across 30 patients. These findings suggest that Bayesian physics-informed modeling may be useful for uncertainty-aware tumor growth assessment when only limited longitudinal follow-up scans are available.