arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2605.21157 2026-05-21 cs.CV cs.AI cs.LG cs.RO

Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums

多光谱下无人机影像用于军事检测的比较分析

Sourov Roy Shuvo, Prajwal Panth, Rajesh Chowdhury, Sorup Chakraborty, Sudip Chakrabarty, Prasant Kumar Pattnaik

AI总结 本文研究了不同光谱条件下无人机影像用于军事目标检测的问题,通过构建四种不同数据集(灰度、热成像、夜视和模糊成像)来评估模型在不同环境下的性能,提出了一种改进的YOLOv11-small模型以提升无人机作战的性能和可靠性。

详情
Comments
6 pages, 7 figures. Accepted at the 16th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 6-11, 2025, IIT Indore. Proceedings pending publication
AI中文摘要

在现代战争中,无人机已成为情报收集和精确打击在不同 hostile 环境中的重要组成部分。其能够从安全距离实时操作 hostile 环境的能力使其在监视和军事行动中具有无价的价值。KIIT-MiTA 数据集由从无人机拍摄的不同军事场景图像组成,为检测军事目标提供了基础,但未考虑各种现实场景。为此,创建了四种不同类型的数据集:灰度、热成像、夜视和模糊成像,以模拟现实环境如低能见度、热成像和夜间条件。YOLOv11-small 模型被训练和用于检测不同设置中的目标。本研究通过在防御和进攻任务中开发先进的检测系统,提高了基于无人机的作战性能和可靠性。

英文摘要

In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.

2605.21154 2026-05-21 cs.CL cs.AI cs.LG

Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models

精神病诊断的ICD分类自动化:从经典NLP到大语言模型

Fernando Ortega, Raúl Lara-Cabrera, Jorge Dueñas-Lerín, Alejandro de la Torre-Luque, Mercé Salvador Robert, Enrique Baca-García

AI总结 本研究提出利用NLP和机器学习技术将自由文本描述映射到国际疾病分类(ICD),以自动化精神病诊断分析,通过评估从经典频率模型到先进大语言模型的多种文本表示方法,展示了transformer嵌入在捕捉隐含语义线索和细致医学术语方面的优势。

详情
AI中文摘要

心理健康已成为全球优先事项,导致临床诊断编码的行政负担巨大。本研究提出通过将自由文本描述映射到国际疾病分类(ICD)来自动化精神病诊断分析,利用包含145,513个西班牙精神病描述的专用数据集,评估了从经典频率模型(BoW,TF-IDF)到先进大语言模型(如e5_large、BioLORD和Llama-3-8B)的各种文本表示方法。结果表明,基于transformer的嵌入 consistently 超过传统方法,通过端到端微调,e5_large模型实现了最高的性能,F1_micro得分为0.866。本研究证明了将大语言模型适应特定临床术语对于克服“长尾”标签分布和精神病 discourse 的固有模糊性至关重要。

英文摘要

Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5\_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5\_large model, through end-to-end fine-tuning, achieved the highest performance with a $F1_{micro}$ score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail'' label distributions and the inherent ambiguity of psychiatric discourse.

2605.21150 2026-05-21 cs.RO

EllipseLIO: Adaptive LiDAR Inertial Odometry with an Ellipsoid Representation

EllipseLIO: 一种基于椭球表示的自适应激光雷达惯性里程计

Rowan Border, Margarita Chli

AI总结 本文提出EllipseLIO,一种基于椭球表示的实时激光雷达惯性里程计,通过自适应的激光雷达扫描过滤和配准方法,在不同环境和传感器下实现鲁棒的里程计性能,实验表明其在多种复杂场景中表现最优。

详情
Comments
8 pages, 6 figures, 2 tables
AI中文摘要

激光雷达惯性里程计(LIO)是许多需要无外部定位(如GPS)导航的移动机器人中的关键组件。在不同环境中自主运行且配备异构激光雷达传感器的平台需要一种能够适应这些不同场景且无需人工干预的LIO方法。现有LIO方法通常在环境和传感器相似时能提供可靠且准确的里程计,但许多方法在异构环境和传感器中保持鲁棒性时面临困难。本文提出了EllipseLIO,一种实时LIO方法,通过使用适应于传感器能力和环境的激光雷达扫描过滤和配准方法,在不同场景间进行泛化。在五个具有多样性和挑战性的数据集上,EllipseLIO与最先进的LIO方法的实验表明,EllipseLIO总体表现最佳。它在平均上比第二好的方法的里程计误差低38%,并且是唯一一个在所有实验中均不发散的方法。EllipseLIO的开源版本将在github.com/v4rl-ucy/ellipselio上提供。

英文摘要

LiDAR Inertial Odometry (LIO) is a critical component for many mobile robots that need to navigate without relying on external positioning (e.g., GPS). Platforms that operate autonomously in different environments and with heterogeneous LiDAR sensors require a LIO approach that can adapt to these different scenarios without human intervention. Existing LIO approaches can typically provide reliable and accurate odometry in scenarios with similar environments and sensors when suitably tuned. However, many approaches struggle to retain robust odometry across heterogeneous environments and sensors while using a consistent configuration. This paper presents EllipseLIO, a real-time LIO approach that generalises between scenarios by using methods for LiDAR scan filtering and registration that adapt to the sensor capabilities and environment without requiring scenario-specific tuning. Experiments with EllipseLIO and state-of-the-art LIO approaches on five datasets with diverse and challenging scenarios demonstrate that EllipseLIO is the best-performing approach overall. It achieves a 38% lower odometry error on average than the second-best approach and is the only approach that does not diverge in any experiment. An open-source version of EllipseLIO will be available at github.com/v4rl-ucy/ellipselio.

2605.21147 2026-05-21 cs.LG cs.CL

SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

SMoA:用于参数高效微调的频谱调制适配器

Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, Hinrich Schütze

AI总结 本文提出SMoA,一种频谱感知更新的适配器,通过在较小的参数预算下扩大可访问的频谱更新家族,提升参数高效微调的性能。

详情
AI中文摘要

随着模型参数数量的增加,参数高效微调(PEFT)已成为定制预训练大语言模型的首选方法。低秩适应(LoRA)使用低秩更新方法来模拟全参数微调,广泛用于减少资源需求。然而,降低秩面临代表能力有限的挑战。理论表明,LoRA微调秩r收敛于预训练权重矩阵的前r个奇异值。随着秩的增加,更多主奇异方向被保留,通常会提高模型性能。然而,更大的秩也会引入更多的可训练参数,导致更高的计算成本。为克服这一矛盾,我们提出SMoA,一种频谱调制适配器,通过在较小的参数预算下扩大可访问的频谱感知更新家族。SMoA将层分成多个对齐的频谱块,并在每个对角块上应用一个块内Hadamard调制的低秩分支,从而获得更广泛的预训练频谱方向覆盖。我们提供了多个任务的理论分析和实证结果。在我们的实验中,SMoA在当前较低预算设置下优于LoRA和具有竞争力的LoRA风格基线。

英文摘要

As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity. Theory suggests that LoRA fine-tuning with rank r converges toward the top r singular values of the pre-trained weight matrix. As the rank increases, more principal singular directions are preserved, which generally improves the model's performance. However, a larger rank also introduces more trainable parameters, leading to higher computational cost. To overcome this dilemma, we propose SMoA, a \textbf{S}pectrum \textbf{Mo}dulation \textbf{A}dapter that enlarges the accessible family of spectrum-aware updates under a smaller parameter budget. SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block, yielding broader coverage of pretrained spectral directions. We provide theoretical analysis and empirical results on multiple tasks. In our experiments, SMoA improves average performance in the current lower-budget setting over LoRA and competitive LoRA-style baselines.

2605.21138 2026-05-21 cs.RO

Safety-Critical Control for Smoothed Implicit Contact Dynamics

安全关键控制用于平滑隐式接触动力学

Haegu Lee, Yitaek Kim, Christoffer Sloth

AI总结 本文提出了一种方法,通过引入边界聚焦的滚动策略和离散时间控制屏障函数框架,解决平滑隐式接触动力学中接触力的约束问题,以提高安全性能。

详情
AI中文摘要

平滑隐式接触动力学使在接触丰富的任务中能够基于梯度的规划和控制,而无需预定义的模式序列。然而,安全关键控制仍然具有挑战性,因为隐式接触动力学使得安全过滤器设计变得复杂。平滑参数κ放松了接触互补性约束,这使动力学变得平滑但影响了接触力。本文提供了一种方法,以在使用放松的互补性约束时对实际接触力进行界定。我们显示,约束违反可以是非单调的κ。较小的κ减少了力近似误差,但并不一定改善安全性性能。为了解决这个问题,我们引入了边界聚焦的滚动策略来筛选κ,通过比较安全边际与近似误差。然后我们开发了一种基于隐式定义接触力的一阶泰勒近似的离散时间控制屏障函数(CBF)框架。为了考虑可能的力低估,我们通过添加一个固定的鲁棒边缘来增强由此产生的安全约束。在四个接触丰富的系统上的模拟显示,所提出的方法消除了在标准CBF下观察到的力违反现象。

英文摘要

Smoothed implicit contact dynamics enables gradient-based planning and control for contact-rich tasks without predefined mode sequences. However, safety-critical control remains challenging because implicit contact dynamics makes safety-filter design nontrivial. The smoothing parameter $κ$ relaxes contact complementarity constraints, which makes the dynamics smooth but affects the contact force. This paper provides a method for bounding the actual contact force despite the use of relaxed complementarity constraints. We show that constraint violations can be non-monotonic in $κ$. Smaller $κ$ reduces force-approximation error, but it does not necessarily improve safety performance. To address this issue, we introduce boundary-focused rollouts to screen $κ$ by comparing the safety margin with the approximation error. We then develop a discrete-time control barrier function (CBF) framework based on a first-order Taylor approximation of the implicitly defined contact force. To account for possible force under-prediction, we augment the resulting safety constraint with a fixed robust margin. Simulations on four contact-rich systems show that the proposed method eliminates force violations observed under a standard CBF.

2605.21135 2026-05-21 cs.CL

Smarter edits? Post-editing with error highlights and translation suggestions

更智能的编辑?基于错误高亮和翻译建议的后编辑

Fleur V. J. van Tellingen, Gautam Ranka, Dora Žugčić, Joyce van der Wal, Andrea Camasta, Livio Guerra, Alina Karakanta

AI总结 本文研究了基于自动后编辑(APE)的错误高亮和纠正建议在后编辑任务中的有效性,发现虽然没有提升生产力和质量,但APE高亮和纠正建议提升了用户体验。

详情
Comments
Accepted at EAMT 2026
AI中文摘要

随着机器翻译质量的提高,对增强的后编辑功能(如基于QE的错误高亮)的兴趣正在增长,但其有用性的证据仍然有限。在本文中,我们探讨了基于自动后编辑(APE)的错误高亮和纠正建议的有用性。我们进行了一项研究,让专业翻译员(En-Nl)使用APE错误高亮和纠正建议进行后编辑,并将生产力、质量和用户体验与常规PE和带有QE衍生高亮的PE进行比较。尽管没有条件相比常规PE在生产力或质量上有所提升,但APE高亮比QE衍生高亮更受好评,而纠正建议提高了整体的用户体验。

英文摘要

As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

2605.21133 2026-05-21 cs.RO

Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

通过主动空间大脑和可泛化动作小脑的人形全身 manipulation

Zhizhao Liang, Yi-Lin Wei, Xuhang Chen, Mu Lin, Yi-Xiang He, Zhexi Luo, Jun-Hui Liu, Kun-Yu Lin, Wei-Shi Zheng

AI总结 本文提出了一种通用的人形 locomotion-manipulation 框架,通过主动空间大脑和可泛化动作小脑来解决复杂3D环境中空间理解困难和动作生成泛化困难的问题,展示了在多种任务和环境中的强性能。

详情
Comments
Project page: https://leungchaos.github.io/Humanoid-Whole-Body-Manipulation-via-Active-Spatial-Brain-and-Generalizable-Action-Cerebellum/
AI中文摘要

在本文中,我们探索了空间感知的人形全身 manipulation 任务。与桌面设置相比,该任务提出了两个关键挑战:1)在复杂3D环境中,具有多样空间关系的空间理解具有挑战性。2)动作生成难以泛化,因为有限且昂贵的真实机器人数据限制了数据驱动模型的泛化能力。为了解决这些挑战,我们提出了一种通用的人形 locomotion-manipulation 框架,该框架利用多智能体大模型的空间感知和动作生成能力。具体而言,我们的框架包括两个组件:Active Spatial Brain 用于主动空间感知和决策,以及 Generalizable Action Cerebellum 用于生成可执行的机器人动作。第一个组件主动感知空间场景,并在任务规划和子任务分解上做出决策。第二个组件根据第一个模块的决策生成可执行的机器人动作,而无需任务特定的真实机器人数据。为了基准测试我们的框架,我们从两个视角设计了一组空间 manipulation 任务:评估空间感知和理解,以及评估真实机器人任务性能。结果表明,在各种任务和环境中,该框架在两个方面都表现出强大的性能。

英文摘要

In this paper, we explore spatial-aware humanoid whole-body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real-robot data restricts data-driven models generalization. To address these challenges, we propose a generalizable humanoid loco-manipulation framework that leverages the spatial perception and action generation capabilities of multi-agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision-making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task-specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real-robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.

2605.21132 2026-05-21 cs.CV

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

SurgOnAir: 基于层次感知的实时手术视频评论

Jingyi He, Yue Zhou, Long Bai, Kun Yuan, Nassir Navab, Yuan Bi

AI总结 本研究提出SurgOnAir,一种流式视觉-语言模型,通过层次化数据集实现对手术流程多层级的实时理解与评论生成,提升手术过程中的即时响应能力。

详情
AI中文摘要

理解手术流程的实时动态对于智能手术系统至关重要,其中AI系统需要持续感知并响应手术进展。在手术室中,关键决策依赖于细微且即时的变化,如精细的器械运动和不断演变的组织状态,其中即使是轻微的感知延迟也可能限制辅助或危及安全。然而,现有方法仍为离线或在粗粒度时间尺度上操作,仅在处理视频片段后生成描述,阻碍了即时反应。为此,我们提出SurgOnAir,一种流式视觉-语言模型,能够按顺序处理帧,无需未来信息,并在视觉输入到达时逐步生成叙述标记。SurgOnAir实现了细粒度的帧到标记生成,能够即时响应不断变化的手术动态。基于我们精心编纂的层次化数据集SurgOnAir-11k,该模型被训练以生成多级文本响应,反映手术流程的内在层次结构。此外,特殊过渡标记被生成以显式标记状态变化,使SurgOnAir能够捕捉并信号关键工作流程的转变。实验表明,SurgOnAir通过单一的视觉-语言模型实现了对手术流程多个层次的实时理解,生成更优且层次感知的叙述。代码和数据集将公开。

英文摘要

Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

2605.21131 2026-05-21 cs.CV

UniT: Unified Geometry Learning with Group Autoregressive Transformer

UniT: 基于群自回归变换器的统一几何学习

Haotian Wang, Yusong Huang, Zhaonian Kuang, Hongliang Lu, Xinhu Zheng, Meng Yang, Gang Hua

AI总结 本文提出UniT模型,通过群自回归变换器统一了几何感知中的多种能力,包括在线感知、离线重建、多模态融合、长视界扩展和度量尺度估计,并引入了适应性几何损失以提升跨场景的度量尺度泛化能力。

详情
Comments
Submitted to IEEE T-PAMI
AI中文摘要

近期的前馈模型在从传感器观测推断密集3D结构方面显著进步。然而,其本质能力仍然分散在多个不兼容的范式中,包括在线感知、离线重建、多模态整合、长视界可扩展性和度量尺度估计。我们提出了UniT,一种基于新颖的群自回归变换器的统一模型,将这些看似不同的能力重新整合到单一框架中。关键思想是将传感器观测的组视为基本的自回归单元,并以无锚点和自适应尺度的方式预测相应的点图。更具体地说,在线和离线设置中的各种视角配置自然地整合到单一的群自回归过程中。通过改变组的大小,在线模式在多个自回归步骤上使用单帧组,而离线模式在单次前向传递中聚合多帧组。同时,队列式KV缓存机制确保了长视界下的有界自回归内存。这通过减少对早期帧的长距离依赖,通过无锚点关系建模实现,从而允许过时的记忆在飞行中被丢弃。为了提高跨场景的度量尺度泛化能力,进一步在该框架中引入了自适应几何损失。它将相对几何约束与部分绝对尺度项耦合,隐含地正则化全局尺度,并诱导从尺度不变几何到度量尺度解决方案的逐步过渡。与专门的模态注意力模块相结合,用于整合辅助模态,UniT在十个基准上实现了统一几何感知的最先进性能,涵盖了七个代表性任务。

英文摘要

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

2605.21130 2026-05-21 cs.CV

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

VersusQ:用于通用视频质量评估的成对边距推理

Shibei Meng, Binxin Yang, Yuan Liu, Jiexuan Zhang, Zhengyao Lv, Hubery Yin, Qiang Xu

AI总结 本文提出VersusQ,一种基于成对边距推理的框架,通过直接比较视频来缓解绝对尺度校准偏差,实现跨域的视频质量评估。

详情
AI中文摘要

大型多模态模型(LMMs)在视频质量评估中展现出潜力,但大多数方法仍为每个视频预测一个绝对分数。这种点wise监督通常混合了感知质量和数据集特定的校准,包括标注协议、评分习惯和分数分布。因此,学习到的评分规则可能在基准内表现良好,但在未见过的领域转移效果差。我们主张相对比较通过纯粹关注感知差异而非数据集特定的评分习惯来缓解绝对尺度校准偏差。因此,我们提出了VersusQ,一种完全由直接比较驱动的成对边距推理框架。具体而言,VersusQ在两个视频之间进行基于LMM的比较,推断它们的视觉和时间质量差异,并预测一个带符号的连续边距,以捕捉首选选择和差异程度。此外,为了将可解释的比较理由与细粒度的数值差异对齐,我们引入了Margin-Coupled GRPO,它联合优化基于展开的相对推理和连续边距回归。在多个公共VQA基准上的广泛实验表明,VersusQ在多个公共VQA基准上实现了最先进的性能,强大的跨域泛化能力以及在异构评估场景下的可靠细粒度排名。

英文摘要

Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbf{VersusQ}, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.

2605.21127 2026-05-21 cs.LG

Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning

推理轨迹坍缩:在微调过程中显式推理能力的丧失评估

Lukas Twist, Helen Yannakoudakis, Jie M. Zhang

AI总结 本文研究了在微调过程中显式推理能力的丧失问题,提出了一种结构评估框架来区分答案正确性与推理轨迹的有效性,并发现标准监督微调会迅速抑制有效的推理轨迹,而仅关注答案的指标会掩盖这一问题。

详情
Comments
22 pages, 3 tables, 3 figures
AI中文摘要

显式推理模型被训练以在最终答案之前生成中间推理轨迹,但下游微调通常在不包含此类轨迹的普通指令-响应数据上进行。我们证明这种不匹配会导致推理轨迹坍缩:微调后的模型仍然能生成合理的最终答案,但会失去使其成为推理模型的结构有效推理轨迹。我们引入了一种结构评估框架,将答案正确性与推理轨迹有效性分开,测量有效、空、缺失和截断的推理轨迹以及基于推理的任务性能。使用该框架,我们研究了四个开放式推理模型,发现标准监督微调可以迅速抑制有效的推理轨迹,而仅关注答案的指标会显著掩盖这一失败:在几种设置中,基于有效推理的性能仍保持高位,而有效推理的比例却大幅下降。我们进一步表明,简单的损失屏蔽策略可以在不需教师生成推理轨迹的情况下显著缓解坍缩。这些结果表明,微调后的推理模型的评估应报告结构推理可靠性指标,尤其是在适应数据不包含显式推理轨迹的情况下。

英文摘要

Explicit reasoning models are trained to produce intermediate reasoning traces before final answers, but downstream fine-tuning is often performed on ordinary instruction-response data that contains no such traces. We show that this mismatch can induce reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces that made it a reasoning model in the first place. We introduce a structural evaluation framework that separates answer correctness from reasoning-trace validity, measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance. Using this framework, we study four open-weight reasoning models and find that standard supervised fine-tuning can rapidly suppress valid reasoning traces, and that answer-only metrics can substantially obscure this failure: in several settings, performance conditional on valid reasoning remains high while the rate of valid reasoning falls sharply. We further show that simple loss-masking strategies can substantially mitigate collapse without requiring teacher-generated reasoning traces. These results suggest that evaluations of fine-tuned reasoning models should report structural reasoning reliability metrics in addition to final-answer performance, especially when adaptation data does not contain explicit reasoning traces.

2605.21123 2026-05-21 cs.CV cs.LG

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

Linear-DPO: 用于扩散和流匹配生成模型的线性直接偏好优化

Kesong Li, Yixuan Xu, Kuo-kun Tseng, Weiyi Lu, Kan Liu, Tao Lan

AI总结 本文提出Linear-DPO,通过统一的反向时间SDE框架推导出涵盖扩散和流匹配的通用DPO目标,指出标准DPO目标在文本到图像生成中不最优,并通过定性定量实验验证了其在扩散模型和流匹配模型上的优越性。

详情
Comments
Code and models are available at: https://github.com/Whynot0101/Linear-DPO . Work done during an internship at Alibaba Group
AI中文摘要

直接偏好优化(DPO)在大语言模型对齐中取得成功,但在文本到图像生成中仍面临挑战。现有研究局限于去噪扩散模型,忽略了流匹配,并在将离散NLP基础的DPO应用于回归基础生成任务时存在目标不匹配的问题。本文推导出一个通用的DPO目标,通过统一的反向时间SDE框架涵盖扩散和流匹配,并从梯度角度指出标准DPO目标在文本到图像生成中不最优。因此,我们提出Linear-DPO,用持续的线性效用函数替代了激进的sigmoid基效用函数,并结合EMA更新的参考模型。在扩散模型(SD1.5、SDXL)和流匹配模型(SD3-Medium)上的定性和定量实验展示了我们的方法优于现有基线。

英文摘要

Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.

2605.21121 2026-05-21 cs.CV cs.GR

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

ROAR-3D: 为高保真3D生成实现任意视角路由

Hanxiao Sun, Mingxin Yang, Shuhui Yang, Zebin He, Xintong Han, Hongbo Fu, Chunchao Guo, Wenhan Luo

AI总结 本文提出ROAR-3D方法,通过改进预训练单视角模型以支持任意数量的未置位图像,利用视图路由和双流注意力设计实现高效的多视角3D生成,显著提升生成质量并支持测试时视角扩展。

详情
AI中文摘要

单图像到3D生成模型现在可以生成高质量的几何结构,但对单个视角的条件化不可避免地引入了对未见区域的模糊性。多视角条件化可以减少这种模糊性,但现有方法要么要求固定标准视角,要么依赖外部重建模块,这会带来沉重的训练成本并限制生成质量。我们观察到预训练的单视角模型已经具备强大的2D到3D基础,可以重新用于多视角条件化。然而,更深入的分析表明,它们的条件机制将方向控制与几何传输纠缠在一起,当来自不同视角的图像被简单结合时,这两种功能会冲突。基于此分析,我们提出ROAR-3D,一种轻量级方法,将预训练的单视角模型升级以接受任意数量的未置位图像。一个逐token的视图路由器将每个3D潜在token分配给其最相关的视角,隐式地建立2D到3D对应关系,而无需显式姿态输入。双流注意力设计保留了预训练的主要视角行为,同时通过专用路径路由辅助视角以实现几何增强。一个方向扰动策略确保辅助路径学习方向无关的几何传输。这些组件引入了极小的可训练参数,并在单视角基准上增加了可忽略的推理开销。ROAR-3D在多视角3D生成质量上达到最先进的水平,并支持测试时视角扩展从1到12+个视角,具有一致的改进。

英文摘要

Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

2605.21114 2026-05-21 cs.LG

A Unified Framework for Uncertainty-Aware Explainable Artificial Intelligence: A Case Study in Power Quality Disturbance Classification

不确定性感知可解释人工智能的统一框架:电力质量扰动分类的案例研究

Yinsong Chen, Samson S. Yu, Zhong Li, Chee Peng Lim

AI总结 本文提出了一种统一的框架,用于不确定性感知的可解释人工智能,通过在电力质量扰动分类任务中使用贝叶斯神经网络来捕捉解释分布的变异性,以提高决策的不确定性意识。

详情
AI中文摘要

事后可解释人工智能(XAI)方法通常产生确定性的归因图,而贝叶斯神经网络(BNNs)则在解释上诱导出一个分布。捕捉这种分布的变异性对于不确定性感知的决策至关重要。本文将解释分布定义为通过任何Lipschitz连续的归因操作符将BNN后验推前得到的测度。进一步,本文提出了不确定性感知的相关归因操作符(UA-RAO),这是一个概括性的操作符家族,通过均值、方差、变异系数、分位数和集合论聚合度量来总结解释分布。通过蒙特卡洛可访问性和Wasserstein近似界提供了理论支持。该框架在15类电力质量扰动(PQD)分类基准上进行了评估,比较了三种BNN近似方法与三种归因操作符,使用相关质量准确度和交并比作为局部化度量。结果表明,深度集成模型与均值UA-RAO相比,在确定性基线之上提高了局部化效果,而其他UA-RAO总结揭示了点估计归因中不存在的不确定性模式。对测量信号的定性结果进一步表明,这些模式能够超越合成训练分布。该框架是领域无关的,可以应用于任何配对Lipschitz连续归因操作符的BNN。

英文摘要

Post-hoc explainable AI (XAI) methods typically produce deterministic attribution maps, whereas Bayesian neural networks (BNNs) induce a distribution over explanations. Capturing the variability of this distribution is important for uncertainty-aware decision-making. This paper formalises the \emph{explanation distribution} as the push-forward measure of the BNN posterior through any Lipschitz-continuous attribution operator. It further proposes the uncertainty-aware relevance attribution operator (UA-RAO), a general family of operators that summarises the explanation distribution using the mean, variance, coefficient of variation, quantiles, and set-theoretic aggregation measures. Theoretical support is provided through Monte Carlo accessibility and Wasserstein approximation bounds. The framework is evaluated on a 15-class power quality disturbance (PQD) classification benchmark, comparing three BNN approximations paired with three attribution operators using relevance mass accuracy and intersection-over-union as localisation metrics. Results show that deep ensembles with the mean UA-RAO improve localisation over the deterministic baseline, while other UA-RAO summaries reveal uncertainty patterns absent from point-estimate attributions. Qualitative results on measured signals further suggest that these patterns generalise beyond the synthetic training distribution. The framework is domain-agnostic and can be applied to any BNN paired with a Lipschitz-continuous attribution operator.

2605.21112 2026-05-21 cs.CV

RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding

RCGDet3D: 重新思考基于增强雷达特征编码的4D雷达-相机融合3D目标检测

Weiyi Xiong, Bing Zhu

AI总结 本文提出RCGDet3D,通过增强雷达特征编码而非复杂的多模态融合策略,实现了在3D目标检测中更高的准确性和实时性,为实时部署设定了新标准。

详情
AI中文摘要

由于其低成本和鲁棒性,4D汽车雷达对于自动驾驶至关重要,但其点云稀疏性挑战了3D目标检测。现有的4D雷达-相机融合方法侧重于复杂的融合策略,以牺牲推理速度换取微小的增益。这种权衡阻碍了实时部署,因为密集特征图上的计算负担较大。相比之下,从稀疏雷达点中提取特征更加耗时,但仍然被低估。本文发现,仅仅增强雷达特征提取可以实现与复杂融合模块相当或更高的性能,同时保持实时性能。基于这一发现,我们提出了RCGDet3D,其核心在于雷达特征编码和简化多模态融合。其编码器继承自RadarGaussianDet3D中的高效高斯点编码器(PGE),并有两个关键改进。首先,Ray-centric PGE(R-PGE)在射线对齐的坐标系统中预测高斯属性,然后统一到鸟瞰图(BEV)空间,显著提高了几何一致性并减少了学习难度,通过将坐标转换与表征学习解耦。其次,语义注入(SI)模块结合图像中的视觉线索,产生更具几何准确性和语义丰富性的雷达特征。在View-of-Delft(VoD)和TJ4DRadSet上的实验表明,RCGDet3D在准确性和速度上均优于现有最先进方法,为实时部署设定了新的基准。

英文摘要

4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird's-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.

2605.21111 2026-05-21 cs.RO cs.SY eess.SY

Benchmarking Empirical and Learning-Based Approaches for Feedforward Steering Control in Autonomous Racing

为自动驾驶赛车中的前馈转向控制评估经验方法和学习方法

Georg Jank, Mattia Piccinini, Sebastian Wenk, Phillip Pitschi, Johannes Betz, Boris Lohmann

AI总结 本文通过系统评估两种学习方法和两种经验方法的前馈转向控制器,发现学习方法在开环评估中预测误差最小,但在闭环测试中路径跟踪性能和圈速并不优于所提出的方法,表明在完整轨迹规划和控制软件栈中评估前馈策略的必要性。

详情
Comments
8 pages, 12 figures, Accepted to be published as part of the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026), Naples, Italy, September 15-18, 2026
AI中文摘要

前馈转向控制是自动驾驶赛车分层控制架构中的关键组成部分。其目标是通过预测车辆的逆横向动力学来减少反馈控制器的转向修正。本文系统地比较了两种学习方法和两种经验(分析)前馈转向控制器。我们提出了一种基于多项式曲面拟合的新ehd公式,能够以最小的参数化捕捉速度依赖的非线性转向行为。我们使用基于现实世界阿布扎比分级自动驾驶赛车联赛的高保真度仿真框架,在高保真度双赛道车辆动力学仿真器中测试前馈控制器。开环评估显示,学习方法实现了最低的预测误差;然而闭环测试显示,这种改进的准确性并未转化为更好的路径跟踪性能或圈速,即使经过迭代微调后也是如此。相比之下,所提出的ehd方法在整体闭环鲁棒性和圈速方面表现最佳,突显了在完整轨迹规划和控制软件栈中评估前馈策略的必要性。我们的代码可在https://github.com/TUMRT/steering_ff_control上获得。

英文摘要

Feedforward steering control is a key component of hierarchical control architectures for autonomous racing. The goal is to reduce steering corrections from the feedback controllers by predicting the vehicle's inverse lateral dynamics. This paper presents a systematic benchmark of two learning-based and two empirical (analytical) feedforward steering controllers. We introduce a new \acf{ehd} formulation based on a polynomial surface fit that captures velocity-dependent nonlinear steering behavior with minimal parametrization. We test the feedforward controllers in a high-fidelity simulation framework based on the real-world Abu Dhabi Autonomous Racing League competition, using a high-fidelity double-track vehicle dynamics simulator. Open-loop evaluation shows that the learning-based controllers achieve the lowest prediction errors; however, closed-loop testing reveals that this improved accuracy does not translate into superior path tracking performance or lap times, even after iterative fine-tuning. In contrast, the proposed EHD approach achieves the best overall closed-loop robustness and lap time, highlighting the necessity of evaluating feedforward strategies within the complete trajectory planning and control software stack. Our code is available at https://github.com/TUMRT/steering_ff_control.

2605.21109 2026-05-21 cs.RO

Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction

基于异常的置信度校准用于基于视觉的安全预测

Zhenjiang Mao, Jiawen Wu, Gabriel Wagner, Zhongzheng Zhang, Ivan Ruchkin

AI总结 本文提出了一种基于异常的在线校准方法,通过融合感知和动态异常分数来改进基于视觉的安全预测中的置信度估计,从而在面对分布偏移时减少过自信,提升预测性能。

详情
AI中文摘要

可靠的置信度估计对于安全部署基于视觉的控制器至关重要,特别是在自动驾驶赛车中,安全预测必须从摄像头图像中推导出来,但现代预测器在测试时面临分布偏移时会变得危险地自信。我们发现现有异常信号中存在一个关键的感知-动态差距:广泛使用的分数,如自编码器重构误差,只能捕捉视觉损坏,却无法捕捉动态异常(例如执行偏差、延迟),其中图像仍然合理而轨迹却恶化。为了解决这个问题,我们提出了一种基于异常的在线校准方法,该方法不重新训练任何模型组件,融合了从世界模型中提取的两个互补的异常分数:一个来自重构误差的感知分数和一个来自epistemic不确定性及控制流统计的动态分数。基于这些融合的分数,一个轻量级的温度缩放校准器利用测试时增强来选择性地减少偏移下的过自信,同时保持正常条件下的性能。在四个未在训练中见过的真实世界异常协议(黑暗、模糊、执行偏差、处理延迟)下的物理DonkeyCar上进行实验,将平均预期校准误差从0.184降低到0.116,比最佳基线提高了37%,而无需修改基础安全预测器。

英文摘要

Reliable confidence estimates are important for safely deploying vision-based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test-time distribution shifts. We identify a critical perception-dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly-Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control-stream statistics. Based on these fused scores, a lightweight temperature-scaling calibrator leverages test-time augmentation to selectively reduce overconfidence under shift while preserving nominal-condition performance. Experiments on a physical DonkeyCar under four real-world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.

2605.21107 2026-05-21 cs.LG stat.ML

Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction

通过自收缩性获得约束在线凸优化的改进保证

Dhruv Sarkar, Abhishek Sinha

AI总结 本文提出了一种基于投影的算法,在强凸损失下同时实现O(log T)的 regrets 和 O(log T) 的 CCV,对于凸损失则在保持最优 O(√T) regrets 的同时将 CCV 提升到 O(√T)。

详情
AI中文摘要

我们考虑了具有对抗性选择约束的约束在线凸优化 (COCO)。在每一轮中,学习者在观察该轮损失和约束函数之前选择动作。目标是在满足所有约束的最佳点上实现小静态遗憾,同时控制累积约束违反(CCV)。对于强凸损失,最先进的算法实现 O(log T) 的遗憾和 O(√(T log T)) 的 CCV。对应的凸损失最佳已知界限是 O(√T) 的遗憾和 O(√T log T) 的 CCV。在本文中,我们提出了一种简单的投影算法,对于强凸损失同时实现 O(log T) 的遗憾和 O(log T) 的 CCV,从而在 CCV 方面实现了指数级改进。对于凸损失,我们的算法将 CCV 提高到 O(√T),同时保持最优的 O(√T) 悲伤。我们改进的关键是一个最近的几何结果,用于自收缩曲线,这可能具有独立兴趣。

英文摘要

We consider Constrained Online Convex Optimization (COCO) with adversarially chosen constraints. At each round, the learner chooses an action before observing the loss and constraint function for that round. The goal is to achieve small static regret against the best point satisfying all constraints while also controlling cumulative constraint violation ($\mathsf{CCV}$). For strongly convex losses, state-of-the-art algorithms achieve $O(\log T)$ regret and $O(\sqrt{T \log T})$ $\mathsf{CCV}.$ The corresponding best-known bounds for convex losses is $O(\sqrt{T})$ regret and $O(\sqrt{T} \log T)$ $\mathsf{CCV}$. In this paper, we give a simple projection-based algorithm that simultaneously achieves $O(\log T)$ regret and $O(\log T)$ $\mathsf{CCV}$ for strongly-convex losses, yielding an exponential improvement in the $\mathsf{CCV}$. For the convex losses, our algorithm improves the $\mathsf{CCV}$ to $O(\sqrt{T})$ while maintaining the optimal $O(\sqrt{T})$ regret. The key to our improvement is a recent geometric result for self-contracted curves, which may be of independent interest.

2605.21104 2026-05-21 cs.LG

HORST: Composing Optimizer Geometries for Sparse Transformer Training

HORST:用于稀疏Transformer训练的优化几何组合

Tom Jacobs, Rohan Jain, Rebekka Burkholz

AI总结 本文提出HORST,一种结合优化几何的模块化优化器,通过超几何镜像映射引入L1稀疏性偏置,以在保持训练稳定性的同时促进稀疏性。

详情
Comments
22 pages, 8 figures
AI中文摘要

稀疏化Transformer仍然是一个根本性挑战,因为标准优化器无法同时促进稀疏性和保持训练稳定性。有效的自适应优化器表现出隐含的L∞偏置,有利于稳定性,但稀疏性需要L1偏置。为了整合稀疏性,我们提出了一种优化器步骤的组合,将其视为非交换算子,以系统的方式分析和结合其优化几何。这导致了HORST(Hyperbolic Operator for Robust Sparse Training),一种模块化优化器,继承自自适应方法的稳定性,同时通过双曲镜像映射引入L1稀疏性偏置。我们的实验表明,HORST在视觉和语言任务上的稀疏Transformer训练中具有实用性。HORST在所有稀疏性水平上都显著优于AdamW基线,特别是在高稀疏性时有显著提升。

英文摘要

Sparsifying transformers remains a fundamental challenge, as standard optimizers fail to simultaneously encourage sparsity and maintain training stability. Effective adaptive optimizers exhibit an implicit $L_{\infty}$ bias favoring stability, yet, sparsity requires an $L_1$ bias. To integrate sparsity, we propose a composition of optimizer steps, which we cast as non-commutative operators to analyze and combine their optimization geometry in a principled way. This yields HORST (Hyperbolic Operator for Robust Sparse Training), a modular optimizer that inherits stability from adaptive methods while inducing $L_1$ sparsity bias through a hyperbolic mirror map. Our experiments demonstrate its utility for sparse training of transformers on both vision and language tasks. HORST consistently and significantly outperforms AdamW baselines across all sparsity levels, with large gains at higher sparsity.

2605.21103 2026-05-21 cs.LG

A Typed Tensor Language for Federated Learning

一种用于联邦学习的类型化张量语言

Theofilos Mailis, Kalliopi-Christina Despotidou, Konstantinos Filippopolitis, Yannis Foufoulas, Thanasis-Michail Karampatsis, Andreas Ktenidis, Evdokia Mailli, Theodore Papamarkou, Yannis Ioannidis

AI总结 本文提出了一种类型化的张量语言,用于形式化联邦学习中的结构,通过共享状态因子分解理论和可微片段,实现了联邦学习计算的正式描述。

详情
AI中文摘要

联邦学习和分析通常被描述为多个独立协议的集合,即使它们共享相同的数学形式:客户端本地张量计算、可合并到共享状态的聚合,以及仅共享的后处理。我们引入了一种类型化的张量语言,该语言正式化了这种结构。该语言区分了联邦张量,其记录在客户端之间沿跟踪的记录轴上被分割,以及共享张量,其在全球范围内可用。其语义由与虚拟全局张量的比较定义,仅用作参考对象。主要结果是共享状态因子分解理论。我们证明了类型化的单轮程序通过固定维度的共享状态因子分解,其大小与客户端和记录的数量无关,由客户端本地张量表达式计算并跨客户端合并。我们还证明了一个相反的可表示性结果;那些编码器和解码器可以由该语言表达的因子分解由类型化的单轮程序实现,并且这种对应关系扩展到迭代程序,其跨轮状态是共享的。这给出了语言中可表示的计算的正式描述,这些计算可以表示为编码、合并和解码过程。然后,我们开发了一个可微片段用于学习。如果每个记录的损失及其每个记录的梯度由客户端本地张量表达式表示,则全局梯度由记录轴求和的联邦梯度张量表示。这产生了用于服务器端梯度下降和共享线性代数二次更新的类型化迭代程序。该框架表征了一类广泛的联邦学习计算,其通信通过固定维度的共享状态传递。

英文摘要

Federated learning and analytics are often described as collections of separate protocols, even when they share the same mathematical form: client-local tensor computation, mergeable aggregation into shared state, and shared-only post-processing. We introduce a typed tensor language that formalizes this structure. The language distinguishes federated tensors, whose records are partitioned across clients along a tracked record axis, from shared tensors, which are available globally. Its semantics are defined by comparison with a virtual global tensor, used only as a reference object. The main result is a shared-state factorization theory. We show that typed one-round programs factor through fixed-dimensional shared state whose size is independent of the number of clients and records, computed from client-local tensor expressions and merged across clients. We also prove a converse representability result; factorizations whose encoders and decoders are expressible in the language are realized by typed one-round programs, and the correspondence extends to iterative programs whose cross-round state is shared. This gives a formal account of the computations in the language that can be expressed as encode, merge, and decode procedures. We then develop a differentiable fragment for learning. If a per-record loss and its per-record gradient are represented by client-local tensor expressions, the global gradient is represented by record-axis summation of the federated gradient tensor. This yields typed iterative programs for server-side gradient descent and shared-linear-algebra second-order updates. The framework characterizes a broad class of federated learning computations whose communication passes through fixed-dimensional shared state.

2605.21102 2026-05-21 cs.CL cs.AI cs.SE

ACL-Verbatim: hallucination-free question answering for research

ACL-Verbatim: 无幻觉的科研问答

Gábor Recski, Szilveszter Tóth, Nadia Verdha, István Boros, Ádám Kovács

AI总结 本研究提出ACL-Verbatim系统,通过提取式问答方法在科研论文中精准映射用户查询到相关文本片段,构建了新的真实数据集并训练评估了多种提取模型,最终通过150M参数的ModernBERT模型在词级F1得分上达到53.6,优于最强的LLM提取器。

详情
Comments
13 pages
AI中文摘要

学术研究者需要高效可靠的工具从可信来源获取高质量信息,但现代AI辅助研究工具仍受大语言模型(LLMs)产生事实不准确或不合逻辑输出(即幻觉)的影响。我们应用提取式问答系统VerbatimRAG到ACL Anthology中的研究论文,直接将用户查询映射到检索文档中的原文文本片段。我们贡献了一个新的真实数据集,用于将用户查询映射到科研论文中的相关文本片段,并利用该数据集训练和评估多种提取模型。人工标注由自然语言处理研究人员完成,基于使用ScIRGen方法生成的合成用户查询,配以由VerbatimRAG检索的论文片段。在该基准上,一个基于我们流水线银色监督训练的150M参数ModernBERT标记分类器在词级F1得分上达到53.6,优于最强的评估LLM提取器(48.7)

英文摘要

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

2605.21099 2026-05-21 cs.CV

R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound

R2AoP: 从产前超声可靠且鲁棒地估计进展角

Yuanhan Wang, Yifei Chen, Beining Wu, Mingxuan Liu, Xiaotian Hu, Chunbo Jiang, Yijin Li, Changmiao Wang, Feiwei Qin, Qiyuan Tian

AI总结 本文提出R2AoP框架,通过结构引导的分割和置信度引导的几何建模,实现了稳定的进展角估计,同时引入轻量级几何可靠测试时适应策略以提高在异质采集条件下的性能。

详情
Comments
11pages,4 figures,Accepted by MICCAI 2026
AI中文摘要

准确地从产前经阴超声估计进展角(AoP)对于客观评估产程进展至关重要,但仍然高度敏感于成像噪声、边界模糊性和局部分割误差的几何放大。我们提出R2AoP,一种可靠且鲁棒的AoP估计框架,整合了结构引导的分割和置信度引导的几何建模,以实现稳定且可重复的测量。一个三分支局部结构增强的主干提高了耻骨联合(PS)和胎儿头(FH)的界定,而置信度加权轮廓拟合明确抑制了AoP计算中不可靠边界点的影响。为进一步提高在异质采集条件下的性能,我们引入了一种轻量级几何可靠的测试时适应策略作为辅助组件,使推理过程稳定且无需目标标注。在多中心基准上的广泛评估显示,与最先进的AoP方法相比,AoP误差和边界指标均表现出一致的减少。我们的源代码可在https://github.com/baiyou1234/R2AoP上获得。

英文摘要

Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at https://github.com/baiyou1234/R2AoP.

2605.21097 2026-05-21 cs.CL

WCXB: A Multi-Type Web Content Extraction Benchmark

WCXB:一个多类型网络内容提取基准

Murrough Foley

AI总结 本文提出WCXB基准,包含2008个网页,涵盖七种结构不同的页面类型,通过五阶段流程生成真实标注,评估13种提取系统,发现现有文章-only基准无法发现结构化页面的盲区。

详情
Comments
Dataset: github.com/Murrough-Foley/web-content-extraction-benchmark, doi.org/10.5281/zenodo.19316874. Leaderboard: webcontentextraction.org. Preprint also deposited at doi.org/10.5281/zenodo.19664685
AI中文摘要

网络内容提取——从网页中隔离主要内容以排除周围模板内容——是搜索引擎索引、检索增强生成、NLP数据集构建和大语言模型训练的前提。该领域的进展受到现有评估基准的限制,这些基准规模小(100-800页)、仅限于新闻文章或基于超过十年前的网页。我们介绍了网络内容提取基准(WCXB),包含来自1613个域的2008个网页,涵盖七种结构不同的页面类型:文章、论坛、产品、集合、列表、文档和服务页面。该数据集包括1497页的开发集和511页的测试集,具有匹配的页面类型分布。真实标注通过五阶段流程生成:LLM辅助草稿、自动化验证、四轮前沿模型审查、片段和质量验证脚本以及人工审查。我们评估了13种提取系统——11种启发式和2种神经网络——发现尽管顶级系统在文章上(F1=0.93)表现良好,但在结构化页面类型上性能差异显著(F1=0.41-0.84),揭示了现有文章-only基准无法发现的盲区。该数据集以CC-BY-4.0许可证发布,包含HTML源文件、真实标注、页面类型标签和基线结果。

英文摘要

Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

2605.21094 2026-05-21 cs.LG

UOTIP: Unbalanced Optimal Transport Map for Unpaired Inverse Problems

UOTIP:用于无配对逆问题的不平衡最优传输映射

Donggyu Lee, Taekyung Lee, Jaewoong Choi

AI总结 本文提出了一种基于不平衡最优传输的逆问题求解器UOTIP,通过引入基于似然的成本函数,将重建任务建模为从噪声测量分布到干净信号分布的学习过程,从而在无配对逆问题上实现了最先进的性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

我们研究了无配对图像逆问题,这是一种具有挑战性的设置,其中只有独立的、未配对的噪声测量和干净目标信号集可用进行训练。我们提出了一种基于不平衡最优传输的新型逆问题求解器,称为用于逆问题的不平衡最优传输映射(UOTIP)。我们的方法将重建任务建模为学习从噪声测量分布到干净信号分布的UOT映射,通过引入基于似然的成本函数进行预测。通过放松精确边缘约束,UOT框架为我们的模型提供了关键优势:对多级观测噪声的鲁棒性、适应噪声和干净数据集之间的类别不平衡,以及对不同噪声类型场景的泛化能力。此外,我们理论证明,引入二次成本项通过满足扭条件确保了运输映射的存在性和唯一性,即使在病态逆问题中也是如此。我们的实验表明,UOTIP在无配对图像逆问题基准上实现了最先进的性能,涵盖了线性和非线性逆问题。

英文摘要

We investigate unpaired image inverse problems, a challenging setting where only independent, non-paired sets of noisy measurements and clean target signals are available for training. We propose a novel inverse problem solver based on Unbalanced Optimal Transport, called Unbalanced Optimal Transport Map for Inverse Problems (UOTIP). Our method formulates the reconstruction task, predicting clean target signals from noisy measurements, as learning a UOT Map from noisy measurement distribution to clean signal distribution by incorporating a likelihood-based cost function. By relaxing the exact marginal constraint, the UOT framework provides key advantages to our model: robustness to multi-level observation noise, adaptability to class imbalance between noisy and clean datasets, and generalizability to diverse noise-type scenarios. Furthermore, we theoretically demonstrate that incorporating a quadratic cost term ensures the existence and uniqueness of the transport map by satisfying the twist condition, even for ill-posed inverse problems. Our experiments demonstrate that UOTIP achieves state-of-the-art performance on unpaired image inverse problem benchmarks, across linear and nonlinear inverse problems.

2605.21090 2026-05-21 cs.CV

TextSculptor: Training and Benchmarking Scene Text Editing

TextSculptor: 训练和评估场景文本编辑

Yiheng Lin, Siyu Jiao, Xiaohan Lan, Wei Zhou, Qi She, Fei Yu, Heyun Chen, Zhengwei Wang, Jinghuan Chen, Moran Li, Yingchen Yu, Zijian Feng, Yao Zhao, Yunchao Wei, Yujie Zhong

AI总结 本文提出TextSculptor框架,通过构建大规模数据集和基准测试,解决场景文本编辑中高质量训练数据稀缺和缺乏标准化评估的问题,提升开源模型性能。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)和基于扩散的生成模型的进展显著提升了基于提示的图像编辑能力。然而,场景文本编辑仍具挑战性,因为模型需要精确修改文本内容,同时保持视觉真实性和非目标区域的完整性。当前开源模型仍落后于专有系统,主要由于高质量训练数据稀缺和缺乏针对文本编辑的标准化基准。为解决这些问题,我们提出了TextSculptor,一个全面的场景文本编辑数据构建和评估框架。我们首先开发了一个自动化数据构建管道,结合文本感知图像合成、程序化文本渲染和合成。基于此管道,我们构建了TextSculpt-Data,一个包含320万训练样本的大规模数据集,包括120万经过OCR验证的文本到图像样本和200万配对的文本编辑样本,具有自然对齐的源-目标图像和强背景一致性。我们进一步引入了TextSculpt-Bench,涵盖四个基本文本编辑任务:文本添加、文本替换、文本删除和混合编辑。为了支持可靠的评估,我们设计了一个定制协议,通过OCR文本对齐、多模态判断和背景区域相似性测量文本准确性、视觉质量和背景保持。广泛的实验表明,TextSculptor提升了开源文本编辑性能,缩小了与专有模型之间的差距。数据和基准可在https://github.com/linyiheng123/TextSculptor获取。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.

2605.21088 2026-05-21 cs.LG

Reviving Error Correction in Modern Deep Time-Series Forecasting

在现代深度时间序列预测中复兴误差校正

Minh Hoang Nguyen, Dai Do, Huu Hiep Nguyen, Dung Nguyen, Kien Do, Hung Le

AI总结 本文研究了深度时间序列预测中的误差累积问题,提出了一种通用误差校正器,通过分解趋势和季节性成分来提升预测的准确性和鲁棒性。

详情
Comments
27 pages
AI中文摘要

现代深度学习模型在时间序列预测中取得了显著成功。然而,由于自回归推理中的误差累积,其在长期预测中的性能会下降。尽管经典的误差校正机制(ECMs)长期以来被用于统计方法,但它们在深度学习模型中的应用仍然有限或无效。在本文中,我们重新审视了深度时间序列预测中的误差累积问题,并探讨了ECMs在此新背景中的作用和必要性。我们提出了一种简单、架构无关的误差校正模型,可以与任何现有的预测器集成,而无需重新训练。通过显式地将预测分解为趋势和季节性成分,并分别训练校正器来调整每个成分,我们引入了具有季节-趋势分解的通用误差校正器(UEC-STD),在4种骨干网络和10个数据集上显著提高了校正精度和鲁棒性。我们的发现提供了一种实用工具来增强预测,同时为减轻深度时间序列模型中的自回归误差提供了新的见解。代码可在https://github.com/DA2I2-SLM/UEC-STD上获得。

英文摘要

Modern deep-learning models have achieved remarkable success in time-series forecasting. Yet, their performance degrades in long-term prediction due to error accumulation in autoregressive inference, where predictions are recursively used as inputs. While classical error correction mechanisms (ECMs) have long been used in statistical methods, their applicability to deep learning models remains limited or ineffective. In this work, we revisit the error accumulation problem in deep time-series forecasting and investigate the role and necessity of ECMs in this new context. We propose a simple, architecture-agnostic error correction model that can be integrated with any existing forecaster without requiring retraining. By explicitly decomposing predictions into trend and seasonal components and training the corrector to adjust each separately, we introduce the Universal Error Corrector with Seasonal-Trend Decomposition (UEC-STD), which significantly improves correction accuracy and robustness across 4 backbones and 10 datasets. Our findings provide a practical tool for enhancing forecasts while offering new insights into mitigating autoregressive errors in deep time-series models. Code is available at https://github.com/DA2I2-SLM/UEC-STD.

2605.21086 2026-05-21 cs.CL

LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

LoCar: 通过细粒度社会语言学控制评估车载助手的本地化

Seogyeong Jeong, Kiwoong Park, Seyoung Song, Eunsu Kim, Ken E. Friedl, Jaeho Kim, Alice Oh

AI总结 本文提出了一种新的车载助手评估框架,专注于韩语本地化,揭示了当前LLM在细粒度韩语敬语控制方面的不稳定性以及策略性对话指标上的表现不足,强调了汽车AI需要向精确语言定制和可靠安全交互管理发展。

详情
Comments
To appear in ACL 2026 Industry Track
AI中文摘要

尽管大型语言模型(LLMs)越来越多地集成到车载对话系统中,但由于缺乏针对实际部署需求定制的领域特定评估标准,确定最优模型仍具挑战性。在本文中,我们提出了一种新的车载助手评估框架,特别关注韩语本地化。我们的实证分析揭示了模型行为中的显著模式。首先,细粒度韩语敬语控制在当前LLM中仍然不稳定,表明在本地化设置中必须明确评估精确的语音层面实现。其次,模型在战略对话指标如澄清和主动性方面表现较弱。我们的分析表明,这源于这些任务本身的主观复杂性,其中我们的框架采取保守的评估立场以优先考虑可靠性。总体而言,我们的发现强调,汽车AI必须超越一般能力,向精确语言定制和可靠、以安全为导向的交互管理发展。

英文摘要

While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

2605.21082 2026-05-21 cs.AI

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

AutoRPA: 通过基于LLM的代码合成实现高效的GUI自动化

Minghao Chen, Xinyi Hu, Zhou Yu, Yufei Yin

AI总结 本文提出AutoRPA框架,通过将ReAct风格代理的决策逻辑自动转化为鲁棒的RPA功能,提高GUI自动化效率和可重用性,同时减少82%到96%的token使用量。

详情
Comments
Accepted in ICML 2026
AI中文摘要

基于大型语言模型(LLM)的代理在多步骤的图形用户界面(GUI)交互中表现出色。尽管大多数研究集中在提升单任务性能,但实际场景中往往涉及重复的GUI任务,而频繁调用LLM推理(即ReAct范式)效率低下。在LLM之前,传统的机器人流程自动化(RPA)提供运行时效率,但需要大量手动努力来开发和维护。为弥合这一差距,我们提出AutoRPA框架,该框架能够自动将ReAct风格代理的决策逻辑转化为鲁棒的RPA功能。AutoRPA引入了两个核心创新:(1)一个翻译-构建流水线,其中翻译代理将硬编码的ReAct动作转换为软编码的流程,构建代理通过多轨迹检索增强生成合成鲁棒的RPA功能;(2)在代码验证期间的混合修复策略,结合RPA执行与基于ReAct的回退机制进行迭代优化。在多个GUI环境中的实验表明,由AutoRPA生成的RPA功能能够成功解决类似任务,同时减少82%到96%的token使用量,显著提高运行时效率和可重用性。

英文摘要

Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.

2605.21081 2026-05-21 cs.SD cs.LG

Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

音乐注意力转换器:使用音乐特定的注意力模型进行音乐生成

Shinnosuke Taksuka, Hideo Mukai

AI总结 本文提出了一种音乐特定的注意力模型,通过整合元信息来提升音乐生成的质量,核心方法是将音乐结构和元数据结合,主要贡献是提高了生成音乐的连贯性和多样性。

详情
Comments
32 pages, 13 figures
AI中文摘要

本研究旨在通过引入元信息来提升使用Transformer进行音乐生成的质量。尽管基于Transformer的方法在捕捉音乐作品中的长期依赖性方面有效,但它们生成的音乐常出现重复或音符重复的问题,导致不自然的旋律。为了解决这些限制,我们提出了音乐注意力机制,该机制将元信息如小节号、调性、节拍等整合到注意力过程中。音乐注意力显式利用音乐的结构属性及其相关元数据,使Transformer的注意力机制能够更有效地运作,从而提高生成输出的质量。在我们的框架中,每个音乐音符被表示为五个事件(音高、小节号、起始时间、持续时间和力度)以及三个元数据元素的组合。然后将注意力机制修改为反映这些八个特征之间的相关性,使模型能够更好地捕捉音乐编排的内在特性。实验结果表明,整合音乐注意力的模型在音乐连贯性、变化性和整体质量方面优于先前的方法,如全注意力和步进注意力。值得注意的是,它显著减少了重复并增强了模型生成多样化、和谐一致的旋律的能力。音乐注意力因此在AI驱动的音乐生成中代表了重要的进展,有助于创建更自然和富有表现力的音乐作品。

英文摘要

This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as excessive repetition or duplication of notes, leading to unnatural melodies. To address these limitations, we propose Musical Attention, a mechanism that incorporates meta-information such as bar numbers, key, signatures, and tempos into the attention process. Musical Attention explicitly leverages both the structural properties of music and its associated metadata, enabling the Transformer's attention mechanism to operate more effectively and thereby improving the quality of the generated output. In our framework, each musical note is represented as a combination of five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements. The attention mechanism is then modified to reflect the correlations among these eight features, allowing the model to better capture the inherent characteristics of musical composition. Experimental results demonstrate that the model incorporating Musical Attention outperforms prior methods, such as Full Attention and Strided Attention, in terms of musical coherence, variation, and overall quality. Notably, it significantly reduces repetition and enhances the model's ability to generate diverse, harmonically consistent melodies. Musical Attention thus represents a meaningful advancement in AI-driven music generation, facilitating the creation of more natural and expressive compositions.

2605.21076 2026-05-21 cs.CL

GradeLegal: Automated Grading for German Legal Cases

GradeLegal: 自动化评分德国法律案例

Abdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn, Michael Granitzer, Jelena Mitrovic

AI总结 本研究探讨了大型语言模型能否用于自动化评分德国刑事和公法案例解决方案,通过系统评估27个专有和开源LLM,发现基于样本解决方案和评分标准的提示策略能有效模拟专家评分,尤其在公法领域达到0.91的QWK值,而在刑事法律领域仅为0.60,表明刑事法律评分任务更难。此外,集成方法能进一步提高评分一致性,并为更强的闭源单模型提供替代方案。

详情
AI中文摘要

对德国法律考试解决方案进行评分面临日益增长的体积和合格评分员短缺,导致反馈延迟并形成瓶颈。同时,这是一个高风险的专家任务,因为国家考试成绩强烈影响德国的职业发展。尽管具有实际相关性,文献中缺乏系统研究有效评分法律考试的方法。为解决这一差距,我们研究了大型语言模型(LLMs)是否能支持自动化评分德国刑事和公法案例解决方案,从而实现可扩展的反馈和学生自我测试。我们系统评估了27个专有和开源LLM,并通过基准测试提示策略,逐步增加任务相关信息,如样本解决方案和评分标准。使用二次加权κ(QWK),基于推理的LLM在获得样本解决方案和评分标准时,能在公法领域模拟专家评分(最高0.91),而刑事法律领域仅为0.60,表明刑事法律评分任务更难。除了单模型评分外,集成方法能通过其最佳成员提高一致性高达0.15,并可为更强的闭源单模型提供替代方案。此外,我们的发现表明,有效的提示设计和模型选择对于可靠的LLM基于法律考试评分至关重要。

英文摘要

Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.