arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4089
2606.02328 2026-06-02 cs.LG

Riemannian Gradient Descent for Low-Rank Architectures

低秩架构的黎曼梯度下降

Nicholas Knight

AI总结 针对低秩矩阵参数,探索黎曼优化技术,并在小语言模型的多头注意力参数上应用,但未显著优于AdamW基线。

详情
AI中文摘要

我们探索了用于秩因子矩阵参数的黎曼优化技术,针对当代深度学习应用。我们考察了算法设计空间中的十个点:秩为$r$的矩阵的两种几何结构,秩为$r$的部分等距的三种几何结构,以及这五种几何结构的块矩阵变体,其中因子在块行和块列之间共享。我们将我们的方法应用于小语言模型中的多头注意力参数。在调整学习率后,我们的方法并未决定性地优于AdamW基线。我们的实现可在网上获取。

英文摘要

We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-$r$ matrices, three geometries for rank-$r$ partial isometries, and block-matrix variants of these five, where factors are shared across block-rows and block-columns. We apply our methods to the multihead attention parameters in small language models. After tuning learning rates, our methods do not conclusively outperform an AdamW baseline. Our implementations are available online.

2606.02326 2026-06-02 cs.AI

Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

否决前修复:面向上下文决策的修复增强约束学习

Yifan Wang

AI总结 提出修复增强约束学习(RACL)框架,将已知修复操作融入分类器语义,在否决前考虑可负担修复,以降低错误否决率并揭示决策规则的可学习性。

Comments 7 pages, 3 figures

详情
AI中文摘要

硬约束通常被视为最终否决:一旦候选违反要求,学习规则拒绝它,任何修复都在决策语义之外处理。这忽略了一种常见的部署场景,即系统已经知道有限的修改菜单,例如添加票务选项、更改配置或请求可用的服务升级。现有的约束学习、软松弛和补救方法解决了邻近问题,但它们没有学习在否决前是否应修复某个选项。我们引入修复增强约束学习(RACL),一种上下文决策框架,将已知修复算子提升到分类器语义中。当可负担的修复使候选可行且足够偏好时,候选被接受;否则系统返回结构化的拒绝信用,并在适用时返回修复计划。这种否决前修复视图严格推广了无修复的HASSLE风格语义,揭示了终端否决规则不可约的错误否决差距,将二分类不可识别性与决策规则可学习性分离,并为观测可行性共享权重设置提供了容量和校准界限。在受控和DB1B衍生基准测试中,RACL恢复了预期的信用和修复结构。在最难的原始数据衍生层级上,验证选择的RACL将错误否决减少到10/4039(FVR 0.0025),而最强的修复搜索黑盒基线约为1064/4039,同时明确展示了FVR/EDR权衡。

英文摘要

Hard constraints are usually treated as terminal vetoes: once a candidate violates a requirement, the learned rule rejects it and any repair is handled outside the decision semantics. This misses a common deployed regime in which the system already knows a finite menu of modifications, such as adding a ticket option, changing a configuration, or requesting an available service upgrade. Existing constraint-learning, soft-relaxation, and recourse methods address nearby problems, but they do not learn whether an option should be repaired before being vetoed. We introduce Repair-Augmented Constraint Learning (RACL), a contextual decision framework that lifts known repair operators into the classifier semantics. A candidate is accepted when an affordable repair makes it feasible and preferred enough; otherwise the system returns a structured rejection credit and, when applicable, a repair plan. This repair-before-veto view strictly generalizes no-repair HASSLE-style semantics, reveals an irreducible false-veto gap for terminal-veto rules, separates binary-label non-identifiability from decision-rule learnability, and gives capacity and calibration bounds for the observed-feasibility shared-weight setting. Across controlled and DB1B-derived benchmarks, RACL recovers the intended credit and repair structure. On the hardest raw-data-derived tier, validation-selected RACL reduces false vetoes to 10/4039 (FVR 0.0025), versus about 1064/4039 for the strongest repair-search black-box baseline, while making the FVR/EDR trade-off explicit.

2606.02304 2026-06-02 cs.CL

Unified Context Evolution for LLM Agents

统一上下文演化:面向LLM智能体

Zixuan Zhu, Yitong Hu, Yong Dai, Junfeng Fang, Chunyang Jiang, Senkang Hu, Yuzhi Zhao

AI总结 提出统一上下文演化(UCE)框架,通过将智能体经验外部化为四种类型的可演化上下文单元(ECU),实现跨任务的知识积累与动态调度,在ALFWorld和WebShop上显著提升性能。

详情
AI中文摘要

基于LLM的智能体可以通过结合推理与环境反馈来解决多步交互任务,然而每个回合从相同的固定上下文开始,任务结束后沿途发现的任何有用策略都会丢失。现有方法要么将学习限制在当前任务,要么将所有经验汇集到单一的无类型存储中,而不区分知识类型、通过使用跟踪质量或平衡库中仍缺乏的内容。我们引入了统一上下文演化(UCE),一种无梯度框架,将智能体经验外部化到不断演化的类型化可演化上下文单元(ECU)库中。UCE将经验分解为四种互补类型(记忆、策略、工作流和技能),每种类型从轨迹中根据类型特定条件生成,在决策时检索,通过重复使用结果评分,并在不再有价值时修剪。调度模块将每个周期的生成预算分配给库中最弱的类型。在两个交互基准测试中,UCE将ALFWorld的成功率从75.4%提高到96.3%,将WebShop的任务得分从45.1%提高到61.3%,并且累积的库无需重新训练即可迁移到其他智能体主干。

英文摘要

LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task ends. Existing approaches either limit learning to the current task or pool all experience into a single untyped store, without distinguishing knowledge types, tracking quality through use, or balancing what the library still lacks. We introduce Unified Context Evolution (UCE), a gradient-free framework that externalizes agent experience into an evolving library of typed Evolvable Context Units (ECUs). UCE decomposes experience into four complementary types (Memory, Strategy, Workflow, and Skill), each generated from trajectories under type-specific conditions, retrieved at decision time, scored through repeated usage outcomes, and pruned when no longer valuable. A scheduling module allocates each cycle's generation budget toward the types where the library is weakest. Across two interactive benchmarks, UCE raises ALFWorld success from 75.4% to 96.3% and WebShop task score from 45.1% to 61.3%, and the accumulated library transfers to alternative actor backbones without retraining.

2606.02253 2026-06-02 cs.AI

CEON: Circular Economy Ontology Network

CEON: 循环经济本体网络

Huanyu Li, Els de Vleeschauwer, Robin Keskisärkkä, Mikael Lindecrantz, Mina Abd Nikooie Pour, Ying Li, Ben De Meester, Patrick Lambrix, Eva Blomqvist

AI总结 为解决循环经济领域跨行业信息共享的语义互操作性问题,提出了循环经济本体网络(CEON),定义了跨行业概念并实现语义感知数据文档化,在建筑、电子和纺织行业场景中验证了其有效性。

详情
AI中文摘要

提高社会中资源利用的循环性已被视为实现可持续性的一条途径,即向更加循环的经济转型。为此有许多不同的循环策略,例如重复使用产品和组件、翻新和再制造旧产品,或回收剩余或使用过的材料。为了实现这些策略,有必要在基础设施层面共享信息,并在产品生命周期内跨行业部门进行沟通。因此,在这种信息共享和沟通中实现语义互操作性是提高循环性的关键。然而,涉及产品生命周期相关众多行业的循环经济(CE)领域的知识表示仍然具有挑战性。为弥补这一差距,我们在Onto-DESIDE项目中开发了循环经济本体网络(CEON)。该本体网络旨在通过定义跨行业概念来填补CE领域的空白,并实现语义感知的数据文档化。我们通过跨行业数据文档化场景(涵盖建筑、电子和纺织行业)展示了CEON。

英文摘要

Increasing the circularity of resource use in our society has been recognized as a path to sustainability, i.e., transitioning into a more circular economy. There are many different circular strategies to do so, such as reusing products and components, refurbishing and remanufacturing used products, or recycling left-over or used materials. To enable these strategies, it is necessary to share information at the infrastructure level and to communicate between industry sectors along the product life cycle. Enabling semantic interoperability in this information sharing and communication is therefore a key to increasing circularity. However, knowledge representation for the circular economy (CE) domain, which involves many relevant industry sectors related to product life cycles, remains challenging. To bridge this gap, we developed the Circular Economy Ontology Network (CEON) within the Onto-DESIDE project. This ontology network aims to fill gaps in CE by defining cross-sectorial concepts and to enable semantics-aware data documentation. We demonstrate CEON through cross-industry data documentation scenarios spanning construction, electronics, and textile sectors.

2606.02198 2026-06-02 cs.LG cs.CY

Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment

模型多重性与再犯风险评估中的预测任意性

Ashwin Singh, Carlos Castillo

AI总结 针对再犯风险评估中的预测任意性问题,通过理论下界推导和实证分析,发现相似精度的模型间预测一致性通常高于最坏情况理论保证,并提出采用最低风险分配策略来缓解任意性。

Comments 17 pages, 12 figures

详情
AI中文摘要

针对个体未来的预测任务本质上是嘈杂的,通常会产生多个相似精度的模型。当这些模型对同一个人产生不同预测时,会引发决策中的任意性问题。这种任意性在理论和实践中可能有多严重?如何解决以支持高风险风险评估?我们通过对一个已使用超过15年的基于机器学习的再犯风险评估决策支持系统的研究来回答这些问题。通过将复杂的法律规则转化为标记释放后结果(再犯或非再犯)的算法,我们首先构建了一个包含数千名囚犯释放的数据集。利用该数据集,我们学习可解释的模型,这些模型提高了预测性能,减少了群体间的错误率差异,并确保康复进展降低风险评分。接下来,我们研究预测多重性,首先推导出数据集上任何有限模型集的期望预测一致性的紧下界,然后评估该集合内的结构多样性(例如,不同的模型系数)在多大程度上转化为预测多重性(即对同一人的不同预测)。我们的实验表明,存在许多相似精度的模型且具有可比较的错误率差异并不一定意味着严重的预测多重性。经验上,性能相似的模型可以表现出比最坏情况理论保证高得多的预测一致性。我们发现,一种简单的策略——为每个囚犯分配这些模型中的最低风险——对于解决预测任意性是有效的。

英文摘要

Prediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment? We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness.

2606.02162 2026-06-02 cs.CV cs.AI cs.CL cs.IR

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

视觉丰富文档类型分类的多模态方法:一项比较分析

Catyana Heyne, Jürgen Frikel, Filippo Riccio

AI总结 针对视觉丰富文档类型分类中多模态建模策略难以系统比较的问题,本文在统一实验框架下对基于Transformer和LLM的四种代表性模型进行受控对比,发现专用多模态Transformer优于LLM方法,且图像信息贡献最大。

详情
AI中文摘要

视觉丰富文档中的文档类型分类仍然具有挑战性,因为相关信息分布在文本、视觉和布局模态中。为了捕捉这种复杂性,当前方法依赖于多样化的多模态建模策略,导致异构架构使得系统比较复杂化。这种变异性也反映在现有的比较研究中,这些研究通常依赖于异构评估设置,进一步复杂化了系统比较,并使得评估进展变得困难。为了解决这些局限性,本文提供了跨基于Transformer和基于LLM架构的多模态设计策略的结构化分析,并结合统一实验框架内的受控实证比较。具体来说,在RVL-CDIP基准上评估了四种代表性模型(LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B),以系统分析文本、图像和布局信息对文档类型分类的贡献,特别关注对比OCR依赖和OCR无关的方法。结果表明,专用多模态Transformer在视觉丰富和布局密集型文档上优于基于LLM的方法。图像信息对可靠分类贡献最大,而OCR派生的文本提供有用但次要的支持。这些发现强调,对于具有显著布局结构的文档,多模态处理仍然是必不可少的。总体而言,该研究为比较多模态架构提供了系统基础,并为选择有效的特征组合和模型设计以进行文档类型分类提供了实用指导。

英文摘要

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

2606.02142 2026-06-02 cs.LG cs.DB

TimeBlocks: Foundational and Continual Time-Series Blockbase -- Extended Version

TimeBlocks: 基础与持续时间序列块库——扩展版本

David Campos, Bin Yang, Tung Kieu, Lei Chen, Chenjuan Guo, Christian S. Jensen

AI总结 提出TimeBlocks方法,通过可互换的模块化模型块和路由策略,构建轻量级、多任务的时间序列模型,并引入StreamCore实现持续校准,在多个数据集上优于现有基线。

Comments 15 pages. An extended version of "TimeBlocks: Versatile and Continual Time-Series Blockbase" accepted at SIGKDD 2026

详情
AI中文摘要

持续的数字化导致监控各种过程的时间序列数据流激增,从中可以获得有价值的见解。此外,成功的基础语言模型的出现引发了一个问题:是否可能实现具有处理多个任务的基础属性的时间序列模型,同时足够轻量以允许实时数据流处理。现有的基础时间序列模型通常很大,并且仅在离线设置中有效,没有严格的时间和计算约束,且不需要重复的模型校准。然而,当应用于数据流时,这些模型由于规模大且缺乏对持续校准的支持而效率低下,这损害了它们提供准确实时响应的能力、耐久性以及在硬件受限环境中的可部署性。我们提出TimeBlocks,通过促进在可变条件下适用于多个任务的轻量级模型的高效构建,实现多用途的时间序列处理。特别是,该方法维护一个可互换和模块化的模型块池,可用于构建新的时间序列模型。当面对特定的时间序列数据时,路由策略迭代选择最合适的块来为数据构建轻量级且准确的模型。我们为TimeBlocks配备了一种称为StreamCore的方法,以构建数据流的代表性小子集,该子集随时间保持流的保证近似,从而实现持续的模型校准。在多个数据集和多个任务上的实验研究表明,TimeBlocks能够构建优于现有基线的模型。

英文摘要

The ongoing digitization has led to a proliferation of time-series data streams that monitor a variety of processes, from which valuable insights may be obtained. Further, the emergence of successful foundational language models begs the question of whether it is possible to achieve time-series models with the foundational properties of handling multiple tasks, while being sufficiently lightweight to allow real-time data stream processing. Existing foundational time-series models are often large and only effective in offline settings without stringent time and computational constraints, and where repeated model calibration is not needed. However, when applied to data streams, these models are ineffective due to their size and lack of support for continual calibration, which compromise their ability to deliver accurate real-time responses, their durability, and their deployability in hardware-limited settings. We propose TimeBlocks to enable versatile time-series processing by facilitating the efficient building of lightweight models suitable for multiple tasks under variable conditions. In particular, the method maintains a pool of interchangeable and modular model blocks that can be used to construct new time-series models. When presented with specific time-series data, a routing strategy iteratively selects the most suitable blocks to construct a lightweight and accurate model for the data. We equip TimeBlocks with a method called StreamCore to build a representative small subset of the data stream, which preserves a guaranteed approximation of the stream over time, enabling continual model calibration. An experimental study on multiple data sets and covering multiple tasks shows that TimeBlocks enables to build models capable of outperforming existing baselines.

2606.02113 2026-06-02 cs.CL cs.AI

A Primer in Post-Training Reasoning Data: What We Know About How It Works

后训练推理数据入门:我们对其运作机制的了解

Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun, Xiangzheng Zhang, Tong Yang

AI总结 本文综述了后训练推理数据的类型、效用、构建方法和扩展规律,为未来推理数据发布和后训练方案提供归因框架。

Comments 22 pages. Project Repository: https://github.com/RenBing-Sumeru/Awesome-LLM-Reasoning-Data

详情
AI中文摘要

后训练已成为大型推理模型近期进展的主要驱动力,而推理数据通常是决定这一阶段成功与否的关键变量。关于后训练推理数据的研究迅速增长,但相关文献仍分散在数据集论文、强化学习方案、奖励模型研究、基准测试和前沿系统报告中。本文是首篇综合了超过150篇关键公开研究和系统报告的后训练推理数据入门文章。我们围绕四个问题组织该领域:存在哪些数据对象、什么使它们有用、它们如何构建以及它们如何扩展。这一组织方式为未来的推理数据发布和后训练方案提供了归因框架。

英文摘要

Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across dataset papers, reinforcement-learning recipes, reward-model studies, benchmarks, and frontier system reports. This paper is the first primer to synthesize over 150 key public studies and system reports on post-training reasoning data. We organize the field around four questions: what data objects exist, what makes them useful, how they are constructed, and how they scale. Together, this organization provides an attribution framework for future reasoning-data releases and post-training recipes.

2606.02100 2026-06-02 cs.CL

PortBERT: Navigating the Depths of Portuguese Language Models

PortBERT:探索葡萄牙语语言模型的深度

Raphael Scheible-Schmitt, Henry He, Armando B. Mendes

AI总结 本文提出PortBERT,一种基于RoBERTa的葡萄牙语语言模型家族,通过字节级BPE分词和稳定预训练在超过450GB数据上训练,在ExtraGLUE基准上达到竞争性能,并重点分析了训练和推理效率。

详情
Journal ref
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, 2025, pp. 59-71
AI中文摘要

Transformer模型主导现代自然语言处理,但高效的语言特定模型仍然稀缺。在葡萄牙语中,大多数工作侧重于规模或准确性,往往忽略了训练和部署效率。在本文中,我们介绍了PortBERT,一个基于RoBERTa的葡萄牙语语言模型家族,旨在平衡性能和效率。使用fairseq在来自CulturaX的超过450GB去重和过滤的mC4和OSCAR23数据上从头训练,PortBERT利用字节级BPE分词以及在GPU和TPU处理器上的稳定预训练流程。我们发布了两个变体,PortBERT base和PortBERT large,并在ExtraGLUE(一组翻译的GLUE和SuperGLUE任务)上评估它们。两个模型都表现出竞争力,匹配或超越现有的单语和多语言模型。除了准确性,我们还报告了训练和推理时间以及微调吞吐量,提供了模型效率的实用见解。因此,PortBERT通过解决葡萄牙语NLP中计算-性能权衡这一未被充分探索的维度,补充了先前的工作。我们在Huggingface上发布所有模型,并提供fairseq检查点以支持进一步的研究和应用。

英文摘要

Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.

2606.01982 2026-06-02 cs.AI

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

一种基于NLP的课程-劳动力市场对齐框架:模式约束的LLM抽取、ESCO锚定的语义匹配和多维差距量化

Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki, Khaled Shuaib

AI总结 提出一个四阶段NLP框架,通过模式约束的LLM抽取、ESCO语义匹配、仲裁协议和验证机制,实现课程与劳动力市场的对齐,并量化多维供需差距。

Comments 53 pages, 9 figures, 4 tables

详情
AI中文摘要

从多样化的教育和劳动力市场语料库中进行模式约束的信息抽取仍然是自然语言处理中的一个开放挑战,因为现有流程主要依赖于无法恢复隐含能力的词汇表面方法,缺乏共享分类法的基础,并且没有提供抽取可靠性或文档级完整性的正式度量。为了解决这些限制,本文提出了一个四阶段NLP框架,结合了(i) 对两个前沿LLM集成模型进行模式约束提示,针对JSON Schema强制实施的七槽能力形式;(ii) 使用Sentence-BERT (SBERT)将抽取的记录与十一个领域的ESCO v1.2.1受控词汇表对齐;(iii) 一个解决模型间分歧的两级裁决协议;(iv) 一个结合每槽Cohen's kappa、模式符合性和文档级完整性审计的验证机制。该框架在高等教育质量保证的关键应用中实例化,即阿联酋大学ABET认证的计算机科学学士学位课程的课程-劳动力市场对齐。该流程从2025-2026学年的85门课程学习计划中抽取400条能力记录,并在从计算核心到概率加权学生轨迹的五范围分析下,与30个职位发布(483个要求条款)以0.50的SBERT余弦阈值对齐。抽取器在技能槽上达到0.79的Cohen's kappa,模式符合性100%,文档级完整性100%。对齐揭示了可解释的供需差距:通用和横向技能差距25.0%,算法与计算理论差距13.8%,软件工程与项目管理差距12.2%,而人工智能与数据科学差距接近零的1.8%,尽管供应覆盖率为38.6%。

英文摘要

Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language processing because existing pipelines rely primarily on lexical-surface methods that cannot recover implicit competencies, lack grounding in shared taxonomies, and provide no formal measures of extraction reliability or document-level completeness. To address these limitations, this paper proposes a four-stage NLP framework that combines (i) schema-constrained prompting of a two-model frontier-LLM ensemble against a JSON Schema-enforced seven-slot competency formalism, (ii) Sentence-BERT (SBERT) alignment of the extracted records against an eleven-domain ESCO v1.2.1 controlled vocabulary, (iii) a two-tier adjudication protocol that resolves inter-model disagreements, and (iv) a verification mechanism that combines per-slot Cohen's kappa, schema conformance, and document-level completeness audits. The framework is instantiated for a critical application in higher-education quality assurance, namely curriculum-labor market alignment for the ABET-accredited BSc Computer Science program at the United Arab Emirates University. The pipeline extracts 400 competency records from the 85-course 2025-2026 study plan and aligns them, under a five-scope analysis ranging from the computing core to a probability-weighted student trajectory, with 30 job postings (483 requirement clauses) at an SBERT cosine threshold of 0.50. The extractor achieves Cohen's kappa of 0.79 on the skill slot, with 100% schema conformance and 100% document-level completeness. The alignment surfaces interpretable supply-demand gaps of 25.0% in general and transversal skills, 13.8% in algorithms and computational theory, and 12.2% in software engineering and project management, with a near-zero 1.8% gap in artificial intelligence and data science despite 38.6% supply coverage.

2606.01981 2026-06-02 cs.CV

Generalization Limits in Vehicle Re-Identification

车辆再识别中的泛化极限

Anis Yassine Ben Mabrouk, Antoine Tadros, Rafael Grompone von Gioi, Gabriele Facciolo, Axel Davy, Rodrigo Verschae

AI总结 针对车辆再识别任务中模型对未见车辆类型泛化能力差的问题,提出了一种新的评估方法,并通过视角分割分析揭示了现有方法在视角鲁棒性和细节关注上的局限性。

详情
AI中文摘要

车辆再识别关注于根据查询图像从图库中检索同一车辆的图像。通过仔细检查常用数据集,我们观察到视觉差异很小的车辆——例如相同的品牌、型号和颜色——同时出现在训练集和测试集中。因此,有效记忆训练数据的方法在这些测试集上表现良好,但难以泛化到其他数据集。在本文中,我们通过提出一种新的评估方法来解决这个问题,该方法能更有效地衡量对未见车辆类型的泛化能力。为了进一步研究泛化性能,我们还提出基于视角进行分割评估,从而区分视角鲁棒性与同视角再识别的影响。我们的发现表明,大多数最先进的方法在处理未见车辆类型时存在困难,并且它们对视角变化的鲁棒性和对细节的关注仅限于训练中见过的车辆类型。

英文摘要

Vehicle re-identification focuses on retrieving images of the same vehicle from a gallery given a query image. Upon closer inspection of commonly used datasets, we observe that vehicles with few visual differences-e.g., the same make, model, and color-appear in both the training and test sets. As a result, methods that effectively memorize the training data tend to perform well on these test sets but struggle to generalize to other datasets. In this paper, we address this issue by proposing a novel evaluation approach that more effectively measures generalization capability to unseen vehicle types. To further study generalization performance, we also propose splitting the evaluation based on view, allowing us to differentiate the effect of viewpoint robustness from that of same-view re-identification. Our findings reveal that most state-of-the-art methods struggle with unseen vehicle types, and that their robustness to viewpoint changes and attention to detail are limited to vehicle types seen during training.

2606.01946 2026-06-02 cs.RO

Closed-Form Pose Estimation of Endoluminal Medical Devices via Gradiometer-Based Electromagnetic Localization System

基于梯度计的电磁定位系统实现腔内医疗器械的闭式位姿估计

Zhiwei Wu, Jiahao Luo, Yubo Pu, Siyi Wei, Yuankai Chen, Jinhui Zhang

AI总结 提出一种基于梯度计的电磁定位系统(GELS),利用紧凑型磁力计阵列作为准梯度计估计局部磁场和梯度张量,通过欧拉齐次关系映射为位移,再经多源Procrustes配准实现闭式位姿估计,无需预校准场图或迭代优化。

详情
AI中文摘要

嵌入式磁跟踪对于腔内医疗器械的远程导航具有极具吸引力的前景。然而,现有的六自由度位姿恢复方法通常需要预校准的工作空间场图或迭代非线性优化。本文提出了一种基于梯度计的电磁定位系统(GELS),这是一种闭式跟踪框架,使用紧凑型磁力计阵列作为嵌入式准梯度计来估计局部磁场和梯度张量。这些量通过欧拉齐次关系映射为源与阵列之间的位移,随后利用至少三个非共线源的多源Procrustes配准恢复阵列的方向和位置。该算法需要已知的源位置和阵列几何结构,但无需预校准的工作空间场图、初始位姿猜测或校准的激励源矩。恢复的位姿还可作为移动磁参考框架,实现概念验证的子级偶极子定位任务。跨传感器阵列配置和激励模式的台架实验显示,序列平均位置误差为\SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter},最快更新率为\SI{14.49}{\hertz},中位求解器运行时间为\SI{172.00}{\micro\second}。基于扰动的误差传播分析进一步确定了传感器间不一致性和偶极子模型失配是主要的精度限制因素,从而为未来进一步减少位姿估计误差的传感器阵列和磁源设计提供指导。

英文摘要

Embedded magnetic tracking holds highly attractive prospects for remote navigation of endoluminal medical devices. However, existing six-degree-of-freedom pose recovery approaches often require pre-calibrated workspace field maps or iterative nonlinear optimization. This letter presents a Gradiometer-Based Electromagnetic Localization System (GELS), a closed-form tracking framework that uses a compact magnetometer array as an embedded quasi-gradiometer to estimate local magnetic fields and gradient tensors. These quantities are mapped by the Euler homogeneous relation to displacements between source and array, from which multi-source Procrustes registration recovers the array orientation and position using at least three non-collinear sources. The algorithm requires known source positions and array geometry, but no pre-calibrated workspace field maps, initial pose guesses, or calibrated excitation-source moments. The recovered pose also enables a proof-of-concept sub-level dipole localization task by serving as a mobile magnetic reference frame. Benchtop experiments across sensor-array configurations and excitation modes demonstrate sequence-averaged position errors of \SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter}, a fastest update rate of \SI{14.49}{\hertz}, and a median solver runtime of \SI{172.00}{\micro\second}. A perturbation-based error propagation analysis further identifies inter-sensor inconsistency and dipole-model mismatch as the dominant accuracy limits, thereby informing future sensor array and magnetic source design for further reducing pose-estimation error.

2606.01641 2026-06-02 cs.CV

Edge-directed geometric partitioning for versatile video coding

面向多功能视频编码的边缘导向几何划分

Xuewei Meng, Xinfeng Zhang, Chuanmin Jia, Xia Li, Shanshe Wang, Siwei Ma

AI总结 针对VVC标准,提出基于时空边缘信息构建最可能模式列表的几何划分模式预测策略,以降低索引开销并提升编码效率,平均BD-rate增益0.58%-1.00%。

Comments This paper has been published in IEEE ICME

详情
Journal ref
IEEE International Conference on Multimedia and Expo (ICME), 2020, pp. 1-6
AI中文摘要

为了提升编码性能,针对即将到来的VVC标准提出了几何划分(GEO)。GEO提供140个划分候选。最优GEO模式的索引需要显式地信令。考虑到不同CU的结构特性以及空间相邻块与时序同位块之间的相关性,我们提出了一种GEO模式预测策略,通过构建最可能模式(MPM)列表来减少GEO索引的开销并提高编码效率。基于划分模式与物体边界高度相关的观察,提出了一种边缘导向的几何划分方案,根据时空边缘信息构建MPM列表。与VTM-6.0相比,所提方法在RA和LDB配置下平均提供了0.58%和1.00%的客观BD-rate增益。此外,它还提升了物体边界的视觉质量。

英文摘要

To improve the coding performance, geometric partition (GEO) was proposed for the upcoming VVC standard. GEO provides 140 partition candidates. The index of optimal GEO mode needs to be signaled explicitly. Considering different structural characteristics of different CUs and the correlation between spatial adjacent blocks and temporal collocated blocks, we propose a GEO mode prediction strategy by constructing a Most Probable Mode (MPM) list to reduce the overhead of GEO index and improve coding efficiency. Based on the observation of the high correlation between the partition mode and object boundaries, an edge-directed geometric partition scheme is proposed to construct the MPM list according to spatio-temporal edge information. The proposed method provides an objective BD-rate gain of 0.58% and 1.00% on average for RA and LDB configurations compared to VTM-6.0. Besides, it also promotes the visual quality of object boundaries.

2606.01157 2026-06-02 cs.CV

HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution

HiTokSR: 一种用于高保真真实世界图像超分辨率的具有层次化码本的从粗到细分词器

Mingxi Li

AI总结 提出HiTokSR层次化标记预测框架,通过将潜在空间沿通道维度划分为频率感知组并独立量化,解耦全局结构与细节,结合视觉基础模型先验和索引级扰动策略,实现真实世界图像超分辨率的最优感知质量和重建保真度。

详情
AI中文摘要

向量量化(VQ)生成模型在真实世界图像超分辨率(Real-ISR)中显示出有希望的结果。然而,现有方法通常依赖于一个将低频结构与高频纹理纠缠在一起的单一潜在空间。这种纠缠迫使单个码本捕获组合上复杂的结构-纹理配对集合,这限制了表示能力并降低了码本利用率。为了解决这个问题,我们提出了HiTokSR,一个层次化的标记预测框架。HiTokSR不使用单一码本,而是将潜在空间沿通道维度划分为频率感知组,并用独立的子码本对每组进行量化。这种从粗到细的设计将全局结构与精细细节解耦,增强了组合表达能力,同时避免了高维最近邻查找的优化不稳定性。为了进一步提高语义一致性,我们的生成器通过自适应特征调制、多尺度类别标记和表示对齐损失,整合了来自视觉基础模型的先验。此外,我们在解码器微调过程中引入了一种索引级扰动策略,以弥合离散标记预测中的训练-测试差异。在真实世界基准上的大量实验表明,HiTokSR在感知质量和重建保真度方面均达到了最先进的性能。

英文摘要

Vector-quantized (VQ) generative models have shown promising results in real-world image super-resolution (Real-ISR). However, existing methods typically rely on a monolithic latent space that entangles low-frequency structures with high-frequency textures. This entanglement forces a single codebook to capture a combinatorially complex set of structure-texture pairings, which constrains representational capacity and limits codebook utilization. To address this issue, we present HiTokSR, a hierarchical token prediction framework. Instead of using a single codebook, HiTokSR partitions the latent space along the channel dimension into frequency-aware groups, quantizing each with an independent sub-codebook. This coarse-to-fine design disentangles global structures from fine details, enhancing combinatorial expressiveness while circumventing the optimization instability of high-dimensional nearest-neighbor lookups. To further improve semantic consistency, our generator integrates priors from a vision foundation model via adaptive feature modulation, multi-scale class tokens, and a representation alignment loss. Additionally, we introduce an index-level perturbation strategy during decoder fine-tuning to bridge the train-test discrepancy in discrete token prediction. Extensive experiments on real-world benchmarks demonstrate that HiTokSR achieves state-of-the-art performance in both perceptual quality and reconstruction fidelity.

2606.00675 2026-06-02 cs.LG

Mapping the evolution of small reservoirs in Brazil from 1984 to 2025 using deep learning

利用深度学习绘制1984年至2025年巴西小型水库的演变

Kylen Solvik, Luis Gustavo Carvalho, Marcia N. Macedo

AI总结 针对巴西小型水库被忽视的问题,采用深度学习计算机视觉模型从Landsat数据中分割小型水库,生成了1984-2025年全国年度水库地图,揭示了水库数量和面积的大幅增长。

Comments 33 pages, 5 figures, 2 tables

详情
AI中文摘要

巴西的水研究在很大程度上忽视了为农业用途(如牲畜饮水、农场规模水电、灌溉和水产养殖)而广泛筑坝的小溪流。这些无处不在的水坝及其水库会改变水温、河流连通性、水生栖息地、温室气体排放和蒸发水损失。绘制小型水库地图具有挑战性,因为需要可靠地检测小型水体并将人工水库与天然湖泊区分开来。因此,大多数区域和全球数据集都将其排除在外。为了解决这一空白,我们训练了一个深度学习计算机视觉模型,利用Landsat 5-9的数据,准确分割巴西境内的小型(<1平方公里)、溪流补给的地表水水库。从1984年到2025年应用我们的模型,我们为整个国家创建了年度水库地图,以评估其数量、大小和分布随时间的变化。检测到的水库数量从263,913个增加到996,245个,增长了近四倍,而它们的总表面积从3510平方公里增加到8550平方公里。据我们所知,这是第一个代表四十年来小型水库演变的全国年度数据集。公开可用的年度地图突出了巴西各地小溪流蓄水工程的范围和累积影响,为管理淡水生态系统和水资源提供了可操作的见解。

英文摘要

Water research in Brazil largely overlooks the widespread damming of small streams for agricultural uses such as watering cattle, farm-scale hydropower, irrigation, and aquaculture. These ubiquitous dams and their reservoirs can alter water temperature, stream connectivity, aquatic habitats, greenhouse gas emissions, and evaporative water losses. Mapping small reservoirs is challenging because it requires reliably detecting small water bodies and distinguishing artificial reservoirs from natural lakes. As a result, most regional and global datasets exclude them. To address this gap, we trained a deep learning computer vision model to accurately segment small ($< 1 km^2$), stream-fed, surface water reservoirs in Brazil leveraging data from Landsat 5-9. Applying our model from 1984 to 2025, we created annual reservoir maps for the entire country to evaluate how their count, size, and distribution have changed over time. The number of detected reservoirs grew nearly fourfold from 263,913 to 996,245, while their total surface area increased from 3510 $km^2$ to 8550 $km^2$. To our knowledge, this is the first country-wide annual dataset representing the evolution of small reservoirs over four decades. The publicly available annual maps highlight the extent and cumulative impacts of the small stream impoundments across Brazil, providing actionable insights for managing freshwater ecosystems and water resources.

2606.00141 2026-06-02 cs.LG cs.AI

Adaptive data selection improves wearable prediction under low baseline performance

自适应数据选择改善低基线性能下的可穿戴预测

Ali Kargarandehkordi

AI总结 本研究通过评估多种模态下自适应时间窗口选择策略,发现其能显著提升低基线性能参与者的AUROC(最高提升0.7),而高基线性能者收益有限或为负,且增益与基线性能呈强负相关。

详情
AI中文摘要

自适应传感策略通过选择性采样数据,在有限数据预算下提高预测性能,在可穿戴健康系统中应用日益广泛,但其在不同个体间的收益尚不明确。本文基于纵向可穿戴数据集,评估了在固定测量预算下,针对心率、活动和生态瞬时评估(EMA)等多种传感模态,自适应选择时间窗口进行模型训练的效果。我们使用接收者操作特征曲线下面积(AUROC)和F1分数量化了相对于随机采样的性能提升。自适应策略为基线性能较低的参与者带来了显著的AUROC提升(增益高达0.7),而对基线性能较强的参与者增益有限甚至为负。跨模态来看,自适应增益与基线性能呈强负相关(Pearson r = -0.67;Spearman ρ = -0.62)。在参与者层面,大多数个体在AUROC上受益(跨模态为60-80%),尽管F1的改进较小且一致性较差。这些发现表明,自适应传感并非普遍有益,而是在性能不佳的情况下提供最大价值。我们的结果支持基于基线性能定制自适应传感的选择性部署策略,以提高可穿戴健康监测的效率。

英文摘要

Adaptive sensing strategies that selectively sample data are increasingly used in wearable health systems to improve prediction performance under limited data budgets, yet their benefits across individuals remain poorly understood. Here, we evaluate adaptive selection of time windows for model training under fixed measurement budgets across multiple sensing modalities, including heart rate, activity, and ecological momentary assessment (EMA), in a longitudinal wearable dataset. We quantify performance gains relative to random sampling using both area under the receiver operating characteristic curve (AUROC) and F1 score. Adaptive strategies yield substantial improvements in AUROC for participants with low baseline performance (with gains up to 0.7), while offering limited or negative gains for participants with strong baselines. Across modalities, adaptive gain is strongly inversely correlated with baseline performance (Pearson r = -0.67; Spearman p = -0.62). At the participant level, most individuals benefit in AUROC (60-80% across modalities), although improvements in F1 are smaller and less consistent. These findings show that adaptive sensing is not uniformly beneficial, but instead provides the greatest value in underperforming settings. Our results support selective deployment strategies that tailor adaptive sensing based on baseline performance to improve efficiency in wearable health monitoring.

2605.27590 2026-06-02 cs.CV cs.MM

ForestHG-Trace: Traceable Long-Horizon Ecological Reasoning over Large-Scale Forest Scenes

ForestHG-Trace: 大规模森林场景下的可追踪长程生态推理

Zihang Cheng, Duanchu Wang, Cheng Li, Jing Huang, Huanzhao Fu, Di Wang

AI总结 提出ForestHG-Trace框架,通过生态超图表示和LLM引导的确定性工具链,实现森林场景中可追踪的多步生态推理,并构建ForestTraceQA基准,显著提升长程生态问答的准确性和执行忠实度。

Comments It has theoretical flaws and experimental errors

详情
AI中文摘要

遥感问答(RS-QA)通常需要超越直接语义预测的能力,尤其是在大规模森林场景中,生态分析涉及多步过滤、数值聚合、邻域推理和可验证证据。我们提出ForestHG-Trace,一个用于森林环境中可追踪长程生态推理的框架。它将多模态NEON森林场景表示为生态超图,其中树木实例、空间单元、语义组和邻域关系支持超越成对场景图的高阶推理。然后,一个LLM引导的智能体调用确定性工具进行读取、过滤、扩展、聚合、比较和审计,生成可重放的执行轨迹和紧凑的证据记录,而不仅仅是自由形式的答案。我们进一步构建了ForestTraceQA,一个可执行的基准,用于评估跨不同任务类型和推理深度的生态问答。实验表明,ForestHG-Trace在答案准确性和执行忠实度上显著优于单步基线和场景图智能体,同时指出执行深度是长程生态问答的主要瓶颈。

英文摘要

Remote sensing question answering (RS-QA) often requires more than direct semantic prediction, especially in large-scale forest scenes where ecological analysis involves multi-step filtering, numerical aggregation, neighborhood reasoning, and verifiable evidence. We introduce ForestHG-Trace, a framework for traceable long-horizon ecological reasoning over forest environments. It represents multimodal NEON forest scenes as ecological hypergraphs, where tree instances, spatial units, semantic groups, and neighborhood relations support higher-order reasoning beyond pairwise scene graphs. An LLM-guided agent then invokes deterministic tools for reading, filtering, expansion, aggregation, comparison, and auditing, producing replayable execution traces and compact evidence records rather than only free-form answers. We further construct ForestTraceQA, an executable benchmark for evaluating ecological QA across diverse task types and reasoning depths. Experiments show that ForestHG-Trace substantially improves answer accuracy and execution faithfulness over single-step baselines and scene-graph agents, while highlighting execution depth as the main bottleneck for long-horizon ecological QA.

2605.26436 2026-06-02 cs.CL cs.AI

Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models

目标重掩码:在离散扩散语言模型中将令牌编辑替换为令牌到掩码的精炼

Lin Yao

AI总结 针对离散掩码扩散语言模型中令牌编辑机制的局限性,提出无训练的令牌到掩码重掩码方法,通过将疑似错误令牌重置为掩码状态,利用扩散过程在更干净上下文中重新预测,显著提升数学等任务的性能。

Comments This paper has been significantly revised, expanded, and superseded by a more comprehensive version available at arXiv:2604.18738. The authors have chosen to withdraw this version to avoid overlap and direct readers to the updated work

详情
AI中文摘要

离散掩码扩散语言模型(如LLaDA)通过迭代去噪生成文本,其中掩码令牌逐步被预测的令牌替换。LLaDA2.1引入了令牌到令牌(T2T)编辑机制,通过直接替换疑似错误的已提交令牌来加速生成。然而,我们发现了T2T编辑的根本性限制:它将错误检测与替换耦合,用可能错误的令牌污染生成上下文,并引入了训练-推理噪声不匹配,其中系统性的模型生成错误与训练中看到的随机扰动不同。我们提出了令牌到掩码(T2M)重掩码,这是一种无需训练、即插即用的T2T编辑替代方案,将疑似错误的令牌重置回掩码状态,允许扩散过程在更干净的上下文中重新预测它们。我们设计并实证验证了三种互补的错误检测策略——基于概率的、触发镜像的和基于时间差分的——并提供了统一的理论分析,表明T2M重掩码净化了生成上下文,将系统性的推理错误转换回模型的原生掩码噪声类型,并实现了延迟承诺以进行联合多位置优化。在涵盖知识、推理、数学、编码和指令跟随的12个基准上的全面实验表明,T2M通常在需要精确令牌级输出的任务上提升性能,其中数学任务提升最大(CMATH上+5.92%)。对CMATH的错误分析揭示,主要的失败模式是最后一英里令牌损坏——即正确的推理产生损坏的最终答案——而T2M修复了59.4%的此类情况。

英文摘要

Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively replaced with predicted tokens. LLaDA2.1 introduced a Token-to-Token (T2T) editing mechanism that accelerates generation by directly replacing committed tokens suspected of being incorrect. However, we identify fundamental limitations of T2T editing: it couples error detection with replacement, pollutes the generation context with potentially incorrect tokens, and introduces a train-inference noise mismatch where systematic model-generated errors differ from the random perturbations seen during training. We propose Token-to-Mask (T2M) remasking, a training-free, drop-in replacement for T2T editing that resets suspected erroneous tokens back to the mask state, allowing the diffusion process to re-predict them under cleaner context. We design and empirically validate three complementary error detection strategies -- probability-based, trigger-mirrored, and temporal-difference-based -- and provide a unified theoretical analysis showing that T2M remasking purifies the generation context, converts systematic inference errors back to the model's native mask noise type, and enables delayed commitment for joint multi-position optimization. Comprehensive experiments across 12 benchmarks spanning knowledge, reasoning, mathematics, coding, and instruction following show that T2M generally improves performance on tasks requiring precise token-level output, with the largest gain on mathematics (+5.92% on CMATH). Error analysis on CMATH reveals that the dominant failure mode is last-mile token corruption -- where correct reasoning produces a corrupted final answer -- and that T2M repairs 59.4% of such cases.

2605.17554 2026-06-02 cs.AI cs.LG

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

评估深度研究代理在专家咨询工作中的表现:一个包含验证器、评分标准和认知陷阱的基准

Tanmay Asthana, Aman Saksena, Divyansh Sahu

AI总结 本文提出一个基准,通过42个专家编写的任务,使用确定性验证器和五维度评分标准评估三个前沿深度研究代理(Claude、OpenAI o3、Gemini)在管理咨询类结构化分析交付物上的表现,并嵌入认知陷阱,发现所有代理的联合接受率均较低(最高21.4%),且各有独特失败模式。

Comments Updating the paper with more data. Will resubmit

详情
AI中文摘要

前沿深度研究代理(DRA)能够规划研究任务、综合多篇文档,并按需生成结构化的交付物。它们在企业工作流中的部署速度远快于评估速度。现有基准衡量事实回忆、单跳问答或通用代理技能,忽略了DRA被部署用于生成的多文档、决策级工作。我们引入一个基准,针对管理咨询师典型一周中所需的结构化分析交付物。我们评估三个前沿代理,即Claude Opus 4.6(带网络搜索)、OpenAI o3-deep-research和Google Gemini 3.1 Pro deep-research,在42个由领域专家(SME)编写的提示上。每个提示的126个响应在两个层面评分:确定性真实验证器(平均每个任务13.8个)和五维度0-3 SME评分标准,组合成0-100的验证器-评分标准分数(VRS)。大多数提示嵌入了惩罚表面模式匹配的认知陷阱。在我们的联合阈值(评分标准均值>=2.5且验证器通过率>=80%)下的接受率普遍较低:Gemini 21.4%,o3 9.5%,Claude 9.5%。平均VRS分数与已发表的基于评分标准的基准一致(我们的最高62.6对比APEX-v1 64.2,ProfBench 65.9,ResearchRubrics <68%),验证了评分标准构建。ACCEPT率低于APEX-Agents在专用DR代理上的MC-segment Pass@1区间(12.3-22.7%);尽管有工具优势,我们的下限仍低三个百分点,这是由于更严格的合取评分和陷阱设计。每个代理的失败模式各不相同。Claude最可靠地生成交付物(在需要文件的任务上比其他代理高4.5倍),但具有最高的虚构特征。o3具有最清晰的推理平均值,但会遗漏必要部分并传播算术错误。Gemini是双峰的,具有最高的接受率,同时也有最多的零分评分标准单元格。

英文摘要

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.

2604.09487 2026-06-02 cs.RO cs.LG

Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

基于广义执行器网络的肌肉驱动机器人仿真到现实迁移

Jan Schneider, Mridul Mahajan, Le Chen, Simon Guist, Bernhard Schölkopf, Ingmar Posner, Dieter Büchler

AI总结 提出广义执行器网络(GenAN),通过从关节位置轨迹学习执行器模型,实现肌肉驱动机器人从仿真到现实的策略迁移,首次成功在四自由度肌肉驱动机器人臂上完成动态任务。

详情
AI中文摘要

肌腱驱动配合软肌肉执行器使机器人更快、更安全,同时可能加速技能获取。然而,由于固有的非线性、摩擦和迟滞,这些系统在实际中很少使用,这给建模和控制带来了复杂性。到目前为止,这些挑战阻碍了策略从仿真到真实系统的迁移。为弥合这一差距,我们提出了一种仿真到现实的流程,该流程学习这种复杂执行器的神经网络模型,并利用成熟的刚体仿真来处理手臂动力学和与环境的交互。我们的方法称为广义执行器网络(GenAN),通过直接从关节位置轨迹学习,而不是需要扭矩传感器,从而能够在广泛的机器人上进行执行器模型识别。在PAMY2(一种由气动人工肌肉驱动的肌腱驱动机器人)上使用GenAN,我们成功部署了完全在仿真中训练的、动态但精确的到达目标、杯中球和乒乓球策略。据我们所知,这一结果构成了四自由度肌肉驱动机器人臂首次成功的仿真到现实迁移。

英文摘要

Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GenAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GenAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy dynamic but precise goal-reaching, ball-in-a-cup, and table tennis policies, trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.

2601.13574 2026-06-02 cs.RO

Highly Deformable Proprioceptive Membrane for Real-Time 3D Shape Reconstruction

高度可变形本体感觉膜用于实时三维形状重建

Guanyu Xu, Jiaqi Wang, Dezhong Tong, Xiaonan Huang

AI总结 提出一种基于光波导传感的柔性可拉伸本体感觉硅胶膜,通过数据驱动模型解码变形相关光强信号,实现实时三维形状重建,在140mm方形膜上达到90Hz更新率和1.307mm平均重建误差。

Comments 13 pages, 9 figures

详情
AI中文摘要

重建物体表面的三维几何形状对于机器人感知至关重要,但基于视觉的方法在低光照或遮挡条件下效果不佳。这一局限性促使我们设计一种本体感觉膜,该膜贴合感兴趣表面并通过重建自身变形来推断三维几何形状。传统的变形感知膜通常依赖于电阻、电容或磁敏机制,但可能存在结构复杂、在大规模变形下顺应性有限以及易受电磁干扰等问题。本文提出一种基于光波导传感的柔软、灵活且可拉伸的本体感觉硅胶膜。该膜将边缘安装的LED和中心分布的光电二极管集成在多层弹性体复合材料中。丰富的变形相关光强信号通过数据驱动模型解码,以恢复膜的几何形状。在定制的140mm方形膜上,以90Hz的端到端更新率实现了实时重建,对于高达25mm的面外变形,平均重建误差为1.307mm。所提出的传感器还在大面内变形下展示了精确重建,在高达75%应变下实现了可靠的形状恢复,平均Chamfer距离为1.214mm。所提出的框架为可变形机器人系统中的全局形状感知提供了一种可扩展、稳健且低剖面的解决方案。

英文摘要

Reconstructing the three-dimensional (3D) geometry of object surfaces is essential for robot perception, yet vision-based approaches degrade under low illumination or occlusion. This limitation motivates the design of a proprioceptive membrane that conforms to the surface of interest and infers 3D geometry by reconstructing its own deformation. Conventional deformation-aware membranes typically rely on resistive, capacitive, or magneto-sensitive mechanisms, but can suffer from structural complexity, limited compliance during large-scale deformation, and susceptibility to electromagnetic interference. This work presents a soft, flexible, and stretchable proprioceptive silicone membrane based on optical waveguide sensing. The membrane integrates edge-mounted LEDs and centrally-distributed photodiodes (PDs) within a multilayer elastomeric composite. Rich deformation-dependent light-intensity signals are decoded by a data-driven model to recover the membrane geometry. Real-time reconstruction is demonstrated on a customized 140 mm square membrane at an end-to-end update rate of 90 Hz, achieving an average reconstruction error of 1.307 mm for out-of-plane deformation of up to 25 mm. The proposed sensor also demonstrates accurate reconstruction under large in-plane deformation, achieving reliable shape recovery up to 75% strain with an average Chamfer distance of 1.214 mm. The proposed framework provides a scalable, robust, and low-profile solution for global shape perception in deformable robotic systems.

2511.02937 2026-06-02 cs.RO cs.SE cs.SY eess.SY

Toward an Agricultural Operational Design Domain: A Framework

面向农业运行设计域:一个框架

Mirco Felske, Jannik Redenius, Georg Happich, Julius Schöning

AI总结 针对农业自主系统在复杂多变环境中运行的特殊挑战,提出包含Ag-ODD描述概念、7层模型和迭代验证过程的农业运行设计域框架,以实现环境描述的结构化、透明化和可验证性。

Comments 18 pages, 7 figures, 2 tables

详情
Journal ref
Smart Agricultural Technology, Volume 14, August 2026
AI中文摘要

农业部门越来越依赖在复杂多变环境中运行的自主系统。与道路应用不同,农业自动化集成了驾驶和工作过程,每个过程都施加了不同的运行约束。处理这种复杂性并确保整个开发和验证过程的一致性,需要对环境进行结构化、透明且经过验证的描述。然而,现有的运行设计域(ODD)概念尚未解决农业应用的独特挑战。因此,本文引入了农业ODD(Ag-ODD)框架,可用于描述和验证自主农业系统的运行边界。Ag-ODD框架由三个核心要素组成。首先,Ag-ODD描述概念,它提供了一种结构化方法,利用ASAM Open ODD和CityGML的概念明确定义环境和运行参数。其次,源自PEGASUS 6层模型的7层模型,已扩展包括一个过程层以捕获动态农业操作。第三,迭代验证过程,根据从7层模型导出的相应逻辑场景验证Ag-ODD,以确保Ag-ODD的完整性和一致性。这些要素共同提供了一种一致的方法来创建明确且可验证的Ag-ODD。演示用例展示了Ag-ODD框架如何支持自主农业系统环境描述的标准化和可扩展性。

英文摘要

The agricultural sector increasingly relies on autonomous systems that operate in complex and variable environments. Unlike on-road applications, agricultural automation integrates driving and working processes, each of which imposes distinct operational constraints. Handling this complexity and ensuring consistency throughout the development and validation processes requires a structured, transparent, and verified description of the environment. However, existing Operational Design Domain (ODD) concepts do not yet address the unique challenges of agricultural applications. Therefore, this work introduces the Agricultural ODD (Ag-ODD) Framework, which can be used to describe and verify the operational boundaries of autonomous agricultural systems. The Ag-ODD Framework consists of three core elements. First, the Ag-ODD description concept, which provides a structured method for unambiguously defining environmental and operational parameters using concepts from ASAM Open ODD and CityGML. Second, the 7-Layer Model derived from the PEGASUS 6-Layer Model, has been extended to include a process layer to capture dynamic agricultural operations. Third, the iterative verification process verifies the Ag-ODD against its corresponding logical scenarios, derived from the 7-Layer Model, to ensure the Ag-ODD's completeness and consistency. Together, these elements provide a consistent approach for creating unambiguous and verifiable Ag-ODD. Demonstrative use cases show how the Ag-ODD Framework can support the standardization and scalability of environmental descriptions for autonomous agricultural systems.

2510.17532 2026-06-02 cs.CL cs.LG

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

OncoReason: 在大语言模型中构建临床推理以实现稳健且可解释的生存预测

Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska

AI总结 提出统一多任务学习框架,通过监督微调、思维链提示和强化学习三种对齐策略,使自回归大语言模型在MSK-CHORD数据集上联合进行二元生存分类、连续生存时间回归和自然语言理由生成,实现可解释的生存预测。

Comments This manuscript is withdrawn to allow careful review and correction of bibliographic issues identified after submission, including references that could not be adequately verified. These matters should be resolved before further circulation

详情
AI中文摘要

预测癌症治疗结果需要模型既准确又可解释,尤其是在存在异质性临床数据的情况下。虽然大语言模型(LLM)在生物医学自然语言处理中表现出强大的性能,但它们通常缺乏在高风险决策支持中至关重要的结构化推理能力。我们提出了一个统一的多任务学习框架,将自回归LLM与临床推理对齐,用于MSK-CHORD数据集上的结果预测。我们的模型被训练来联合执行二元生存分类、连续生存时间回归和自然语言理由生成。我们评估了三种对齐策略:(1)标准监督微调(SFT),(2)带有思维链(CoT)提示的SFT,以引发逐步推理,以及(3)组相对策略优化(GRPO),一种强化学习方法,将模型输出与专家推导的推理轨迹对齐。使用LLaMa3-8B和Med42-8B骨干网络的实验表明,CoT提示将F1提高了+6.0,并将MAE降低了12%,而GRPO在BLEU、ROUGE和BERTScore上实现了最先进的可解释性和预测性能。我们进一步表明,现有的生物医学LLM由于架构限制通常无法产生有效的推理轨迹。我们的发现强调了在多任务临床建模中推理感知对齐的重要性,并为精准肿瘤学中可解释、可信赖的LLM设立了新的基准。

英文摘要

Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

2510.10982 2026-06-02 cs.LG cs.AI

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

仅捕获一个:用于模型特定授权的不可迁移样本

Zihan Wang, Zhiyong Ma, Zhongkui Ma, Shuofeng Liu, Akide Liu, Derui Wang, Minhui Xue, Guangdong Bai

AI总结 提出不可迁移样本(NTEs),通过将数据编码为仅能被指定模型解码的“密文”,在无需训练的情况下利用模型特定低敏感子空间实现授权模型保真度与未授权模型性能退化。

详情
AI中文摘要

最近的AI法规越来越强调需要保护数据在AI创新中的效用,同时防止滥用,特别是在下游AI应用中强制执行目的限制。在实践中,执行这一原则仍然具有挑战性,因为发布的数据可以轻易地输入到超出其声明意图的任意模型中。现有方法试图通过扰动数据或重新训练模型来限制意外使用来减轻这种风险。然而,这些策略无法防止未知或外部训练模型的推理,或者从根本上依赖于对训练或部署的控制。在这项工作中,我们引入了不可迁移样本(NTEs),即重新编码的数据,作为任务级别的“密文”,只能由指定模型解码。对抗性样本利用高模型敏感性的方向,而NTEs则利用互补的不敏感子空间。我们提出了一种无需训练、数据无关的方法,在模型特定的低敏感子空间内重新编码数据,保留授权模型的输出,同时通过子空间错位降低未授权模型的性能。我们建立了形式化界限,证明授权模型的保真度,并表明未授权模型的退化与模型之间可测量的谱错位成比例。实验上,NTEs在常见预处理下保持了多种视觉骨干网络和最先进视觉语言模型的性能,而未授权模型即使在自适应重建攻击下也会崩溃。这些结果确立了NTEs作为一种实用手段,在防止未授权利用的同时保持预期的数据效用。我们的项目可在 https://trusted-system-lab.github.io/model-specificity 获取。

英文摘要

Recent AI regulations increasingly emphasize the need for mechanisms that preserve the utility of data for AI innovation while preventing misuse, particularly by enforcing purpose limitation in downstream AI applications. In practice, enforcing this principle remains challenging, as released data can be trivially fed into arbitrary models beyond its declared intent. Existing approaches attempt to mitigate this risk by either perturbing data or retraining models to limit unintended use. These strategies, however, offer no protection against inference by unknown or externally trained models, or fundamentally rely on control over the training or deployment. In this work, we introduce non-transferable examples (NTEs), recoded data that act as a task-level "ciphertext" decodable only by a designated model. Whereas adversarial examples exploit directions of high model sensitivity, NTEs leverage the complementary insensitive subspace. We propose a training-free, data-agnostic method that recodes data within a model-specific low-sensitivity subspace, preserving outputs for the authorized model while degrading unauthorized ones through subspace misalignment. We establish formal bounds certifying authorized-model fidelity and showing that unauthorized degradation scales with measurable spectral misalignment between models. Empirically, NTEs preserve performance across diverse vision backbones and state-of-the-art vision-language models under common preprocessing, while unauthorized models collapse even under adaptive reconstruction attacks. These results establish NTEs as a practical means to preserve intended data utility while preventing unauthorized exploitation. Our project is available at https://trusted-system-lab.github.io/model-specificity

2510.09222 2026-06-02 cs.LG

FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

FM-IRL:强化学习中用于奖励建模与策略正则化的流匹配

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang

AI总结 提出利用流匹配(FM)模型作为教师,通过奖励建模和策略正则化,指导简单MLP结构的学生策略进行在线强化学习,从而解决FM策略在线交互不稳定和效率低的问题。

Comments We have submitted a new version of this paper to arxiv (with new framing and title), arXiv:2605.27095. To avoid the misunderstanding of the readers, we request to withdraw the old-version of this paper

详情
AI中文摘要

流匹配(Flow Matching, FM)在建模复杂分布方面表现出色,并在离线模仿学习中克隆专家行为取得了强劲性能。然而,尽管其行为克隆表达能力强大,基于FM的策略本质上受限于缺乏环境交互和探索,导致在专家演示之外的未见场景中泛化能力差,凸显了与环境在线交互的必要性。不幸的是,由于梯度计算不稳定和推理成本高,通过在线交互优化FM策略具有挑战性且效率低下。为解决这些问题,我们提出让一个具有简单MLP结构的学生策略探索环境,并通过带有奖励模型的RL算法进行在线更新。该奖励模型与教师FM模型相关联,包含专家数据分布的丰富信息。此外,利用相同的教师FM模型来正则化学生策略的行为以稳定策略学习。由于学生架构简单,我们避免了FM策略的梯度不稳定性,实现了高效的在线探索,同时仍然利用了教师FM模型的表达能力。大量实验表明,我们的方法显著提高了学习效率、泛化能力和鲁棒性,尤其是在从次优专家数据学习时。

英文摘要

Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

2510.01167 2026-06-02 cs.LG cs.AI cs.CL

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

同时多目标对齐:可验证与不可验证奖励

Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu

AI总结 提出MAHALO框架,通过标准化PRM训练、多动作头DPO和PRM引导解码,实现大语言模型在可验证与不可验证奖励上的多目标对齐,减少目标冲突并支持推理时控制。

Comments ICML 2026

详情
AI中文摘要

将大语言模型与人类偏好对齐本质上是多维的,但大多数流水线将异质信号压缩为单一目标。我们试图回答如何同时在多个领域中对齐模型,这些领域包括:可验证奖励、不可验证主观偏好以及复杂交互场景。这种多目标对齐设置常常因各个目标相互冲突而困扰,导致训练效率低下和推理时用户控制有限。为了解决这些问题,我们提出了$ extbf{MAHALO}$(Multi-Action-Head Alignment with PRM-guided Decoding),这是一个统一的框架,它在可验证和不可验证设置下标准化PRM训练以进行步骤级监督,通过多动作头DPO执行向量化多目标对齐,并通过目标特定权重和PRM引导解码实现可控推理。在数学推理、人类价值观对齐和多轮辅导上的实验表明,MAHALO能够以有限的干扰同时联合改善多个目标,同时保持跨领域的泛化性和适应性,并在推理时提供灵活的用户控制。我们的代码可在 https://github.com/pearls-lab/multiobj-align 获取。

英文摘要

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, non-verifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by individual objectives being at odds with each other, resulting in inefficient training and limited user control during inference. To address these issues, we propose $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{AL}$ignment with PRM-guided Dec$\textbf{O}$ding ($\textbf{MAHALO}$), a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference through objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human values alignment, and multi-turn tutoring show that MAHALO jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time. Our code is available at: https://github.com/pearls-lab/multiobj-align.

2506.11903 2026-06-02 cs.CL

GeistBERT: Breathing Life into German NLP

GeistBERT: 为德语 NLP 注入活力

Raphael Scheible-Schmitt, Johann Frei

AI总结 通过在大规模德语语料库上使用 RoBERTa 架构和 Whole Word Masking 进行增量预训练,GeistBERT 在多项德语 NLP 任务上取得了领先性能,并在 GermEval 2018 细粒度文本分类中达到新 SOTA。

详情
Journal ref
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, 2025, pp. 42-50
AI中文摘要

基于 Transformer 的语言模型的进展凸显了在高质量语料库上进行特定语言预训练的优势。在此背景下,德语 NLP 有望受益于针对德语语言特征更新的架构和现代数据集。GeistBERT 旨在通过在多样化语料库上进行增量训练并优化模型在多种 NLP 任务上的性能来改进德语语言处理。我们使用 fairseq 预训练 GeistBERT,遵循 RoBERTa 基础配置并采用 Whole Word Masking (WWM),从 GottBERT 权重初始化。模型在 1.3 TB 的德语语料库上训练,采用动态掩码和固定的 512 令牌序列长度。为了评估,我们在标准下游任务上微调模型,包括 NER(CoNLL 2003、GermEval 2014)、文本分类(GermEval 2018 粗粒度/细粒度、10kGNAD)和 NLI(German XNLI),使用 $F_1$ 分数和准确率作为评估指标。GeistBERT 在所有任务上均取得了强劲结果,在基础模型中领先,并在 GermEval 2018 细粒度文本分类中创下新的最先进水平 (SOTA)。它还优于多个更大的模型,尤其是在分类基准测试中。为了支持德语 NLP 研究,我们在 MIT 许可下发布 GeistBERT。

英文摘要

Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI (German XNLI), using $F_1$ score and accuracy as evaluation metrics. GeistBERT achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification. It also outperformed several larger models, particularly in classification benchmarks. To support research in German NLP, we release GeistBERT under the MIT license.

2411.11436 2026-06-02 cs.LG cs.AI

Implicit Regularization for Multi-label Feature Selection

多标签特征选择的隐式正则化

Dou El Kefel Mansouri, Khalid Benabdeslem, Seif-Eddine Benkabou

AI总结 针对多标签学习中的特征选择问题,提出一种基于隐式正则化和标签嵌入的估计器,通过Hadamard积参数化避免显式正则化项的额外偏差,实验表明该方法可减少偏差并可能导致良性过拟合。

Comments 14 pages, 11 figures, Submitted for publication and currently under review

详情
AI中文摘要

本文通过使用一种基于隐式正则化和标签嵌入的新估计器,解决了多标签学习背景下的特征选择问题。与使用带有显式正则化项(如$l_{2,1}$-范数、MCP或SCAD)的惩罚估计器的稀疏特征选择方法不同,我们提出了一种通过Hadamard积参数化的简单替代方法。为了指导特征选择过程,采用了一种多标签信息潜在语义方法作为标签嵌入。在一些已知基准数据集上的实验结果表明,所提出的估计器遭受的额外偏差要小得多,并且可能导致良性过拟合。

英文摘要

In this paper, we address the problem of feature selection in the context of multi-label learning, by using a new estimator based on implicit regularization and label embedding. Unlike the sparse feature selection methods that use a penalized estimator with explicit regularization terms such as $l_{2,1}$-norm, MCP or SCAD, we propose a simple alternative method via Hadamard product parameterization. In order to guide the feature selection process, a latent semantic of multi-label information method is adopted, as a label embedding. Experimental results on some known benchmark datasets suggest that the proposed estimator suffers much less from extra bias, and may lead to benign overfitting.

2402.14521 2026-06-02 cs.CL

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

马来西亚英语新闻解码:用于命名实体和关系抽取的语言资源

Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

AI总结 针对马来西亚英语与标准英语的差异导致NLP任务困难的问题,构建了包含200篇新闻的MEN数据集,手动标注实体和关系,并通过微调spaCy NER工具验证了定制数据集能显著提升NER性能。

Comments Accepted at LREC-COLING 2024

详情
AI中文摘要

标准英语和马来西亚英语存在显著差异,这给马来西亚英语的自然语言处理(NLP)任务带来了挑战。不幸的是,现有数据集主要基于标准英语,因此不足以改进马来西亚英语的NLP任务。使用最先进的命名实体识别(NER)解决方案对马来西亚英语新闻文章进行的实验表明,它们无法处理马来西亚英语的形态句法变异。据我们所知,目前没有可用于改进模型的标注数据集。为了解决这些问题,我们构建了马来西亚英语新闻(MEN)数据集,包含200篇新闻文章,并手动标注了实体和关系。然后,我们微调了spaCy NER工具,并验证了拥有为马来西亚英语量身定制的数据集可以显著提高NER在马来西亚英语上的性能。本文介绍了我们在数据获取、标注方法以及标注数据集深入分析方面的工作。为了验证标注质量,我们使用了标注者间一致性检验,随后由领域专家对分歧进行裁决。完成这些任务后,我们成功开发了一个包含6,061个实体和3,268个关系实例的数据集。最后,我们讨论了spaCy微调设置及NER性能分析。这一独特的数据集将极大地推动马来西亚英语NLP研究的发展,使研究人员能够加速进展,特别是在NER和关系抽取方面。该数据集和标注指南已在Github上发布。

英文摘要

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our knowledge, there is no annotated dataset available to improvise the model. To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could improve the performance of NER in Malaysian English significantly. This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation, inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction. The dataset and annotation guideline has been published on Github.

2606.02528 2026-06-02 q-fin.GN cs.CY cs.LG

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

审计金融大语言模型中的资产特定偏好:来自比特币表征与投资组合配置的证据

Wenbin Wu

AI总结 本研究通过三级审计协议,发现大型语言模型对比特币存在框架依赖的偏好,并识别出模型内部一个可因果干预的比特币选择性特征,该特征能显著影响下游投资组合配置。

Comments 28 pages, 5 figures, 18 tables

详情
AI中文摘要

大型语言模型现已驱动机器人顾问和交易代理,但它们是否对特定资产存在固有偏见尚未得到充分检验。我们提出三个问题:LLMs是否系统性地偏好某些金融工具;能否识别出对这些偏好具有因果杠杆作用的内部表征;以及该表征是否影响下游金融决策。我们开发了一个三级审计协议并将其应用于比特币。首先,对八个前沿LLMs的行为审计显示,比特币在货币类工具中的排名具有框架依赖性:模型将其置于“可靠货币”的第5位(共8位),但在危机和自主代理框架下接近榜首,且属性交换实验确认排名追踪功能属性而非名称。其次,我们打开模型内部:在Gemma 3中搜索数千个稀疏自编码器特征,识别出一个主导的比特币选择性特征。放大该特征会使模型偏向该资产,抑制则使其远离,即使提示中从未出现“比特币”。第三,我们测试金融后果:放大使比特币在投资组合中的份额提高5.2个百分点,而抑制降低4.6个百分点,放大在加密资产内重新分配,抑制则削减总加密敞口。我们将此描述为有界行为杠杆(杠杆指对输出的因果影响,而非金融杠杆):一个可识别的内部特征可被扰动以改变金融选择,但仅在可测量的限度内。该框架将内部表征与外部建议联系起来,并通过随机对照和机制边界进行验证。随着LLMs成为自主金融代理,这是迈向新兴“了解你的代理”(KYA)标准的行为层的第一步:了解代理偏好什么,以及该偏好可被移动多远。

英文摘要

Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask three questions: do LLMs systematically prefer certain financial instruments; can an internal representation with causal leverage over those preferences be identified; and does that representation affect downstream financial decisions? We develop a three-level audit protocol and apply it to Bitcoin. First, a behavioral audit of eight frontier LLMs shows that Bitcoin's ranking among money-like instruments is frame-dependent: models place it around rank 5 of 8 as "reliable money" but near the top under crisis and autonomous-agent frames, and an attribute-swap experiment confirms rankings track functional properties, not names. Second, we open a model's internals: a search across thousands of sparse-autoencoder features in Gemma 3 identifies a dominant Bitcoin-selective feature. Amplifying it shifts the model toward the asset and suppressing it shifts the model away, even when "Bitcoin" never appears in the prompt. Third, we test financial consequences: amplification raises Bitcoin's portfolio share by 5.2 percentage points while suppression lowers it by 4.6 pp, with amplification reallocating within crypto and suppression cutting total crypto exposure. We characterize this as bounded behavioral leverage (leverage meaning causal influence over outputs, not financial leverage): an identifiable internal feature can be perturbed to move financial choices, but only within measurable limits. The framework links internal representations to external recommendations, validated with random controls and mechanism boundaries. As LLMs become autonomous financial agents, this is a first step toward a behavioral layer for emerging know-your-agent (KYA) standards: knowing what an agent prefers, and how far that preference can be moved.