arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2509.00834 2026-06-01 cs.AI cs.FL cs.LG cs.LO

Neuro-Symbolic Predictive Process Monitoring

神经符号预测性过程监控

Axel Mezini, Elena Umili, Ivan Donadello, Fabrizio Maria Maggi, Matteo Mancanelli, Fabio Patrizi

AI总结提出一种结合数据驱动学习与时序逻辑先验知识的神经符号方法，通过可微逻辑损失函数训练自回归序列预测器，以提升业务过程管理中后缀预测的准确性和逻辑一致性。

详情

AI中文摘要

本文通过提出一种神经符号预测性过程监控（PPM）方法，解决了业务流程管理（BPM）中的后缀预测问题，该方法将数据驱动学习与时序逻辑先验知识相结合。尽管最近的方法利用深度学习模型进行后缀预测，但由于训练过程中缺乏领域知识的显式集成，它们常常无法满足甚至基本的逻辑约束。我们提出了一种新颖方法，将有限迹上的线性时序逻辑（LTLf）融入自回归序列预测器的训练过程。我们的方法引入了一个可微的逻辑损失函数，该函数使用LTLf语义的软近似和Gumbel-Softmax技巧定义，可以与标准预测损失结合。这确保了模型学习生成既准确又逻辑一致的后缀。在三个真实世界数据集上的实验评估表明，我们的方法提高了后缀预测的准确性和对时序约束的遵从性。我们还引入了逻辑损失的两种变体（局部和全局），并展示了它们在噪声和现实环境下的有效性。虽然是在BPM背景下开发的，我们的框架适用于任何符号序列生成任务，并有助于推进神经符号人工智能。

英文摘要

This paper addresses the problem of suffix prediction in Business Process Management (BPM) by proposing a Neuro-Symbolic Predictive Process Monitoring (PPM) approach that integrates data-driven learning with temporal logic-based prior knowledge. While recent approaches leverage deep learning models for suffix prediction, they often fail to satisfy even basic logical constraints due to the lack of explicit integration of domain knowledge during training. We propose a novel method to incorporate Linear Temporal Logic over finite traces (LTLf) into the training process of autoregressive sequence predictors. Our approach introduces a differentiable logical loss function, defined using a soft approximation of LTLf semantics and the Gumbel-Softmax trick, which can be combined with standard predictive losses. This ensures that the model learns to generate suffixes that are both accurate and logically consistent. Experimental evaluation on three real-world datasets shows that our method improves suffix prediction accuracy and compliance with temporal constraints. We also introduce two variants of the logic loss (local and global) and demonstrate their effectiveness under noisy and realistic settings. While developed in the context of BPM, our framework is applicable to any symbolic sequence generation task and contributes to advancing Neuro-Symbolic AI.

URL PDF HTML ☆

赞 0 踩 0

2508.20478 2026-06-01 cs.CV

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

Video-MTR: 用于长视频理解的多轮强化推理

Yuan Xie, Tianshui Chen, Zheng Ge, Lionel Ni

AI总结提出Video-MTR框架，通过强化多轮推理迭代选择关键视频片段并理解问题，结合门控双层奖励系统实现端到端训练，在长视频理解基准上提升准确率和效率。

Comments Accepted by ICML 2026. Camera-ready version

详情

AI中文摘要

长视频理解因其长期时间依赖性和多事件特性仍然是一个挑战。现有方法通常依赖静态推理或外部视觉语言模型（VLM），但存在复杂性和缺乏端到端训练导致的次优性能等问题。本文提出Video-MTR，一个强化多轮推理框架，旨在实现迭代的关键视频片段选择和问题理解。与传统的单轮预测视频推理流程不同，Video-MTR进行多轮推理，基于对先前处理片段和当前问题的逐步理解，逐步选择视频片段。这种迭代过程允许对视频进行更精细和上下文感知的分析。为确保中间推理过程，我们引入了一种新颖的门控双层奖励系统，结合基于答案正确性的轨迹级奖励和强调帧-查询相关性的轮次级奖励。该系统优化了视频片段选择和问题理解，无需外部VLM，并允许端到端训练。在VideoMME、MLVU和EgoSchema等基准上的大量实验表明，Video-MTR在准确性和效率上均优于现有方法，推动了长视频理解的最新进展。

英文摘要

Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.

URL PDF HTML ☆

赞 0 踩 0

2508.19830 2026-06-01 cs.CV cs.AI

Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification

分布偏移下基于频率感知梯度修正的目标无关校准

Yilin Zhang, Cai Xu, You Wu, Ziyu Guan, Wei Zhao

AI总结提出频率感知梯度修正（FGR）框架，通过对训练图像进行低通滤波减少虚假高频线索并学习域不变特征，同时利用几何投影确保分布内校准不退化，从而在无需目标域信息的情况下提升模型在分布偏移下的校准性能。

Comments 25 pages, Accepted at ICML 2026

详情

AI中文摘要

现实世界中的模型部署不可避免地会遇到分布偏移，使得深度神经网络的置信度估计高度不可靠，在安全关键应用中带来严重风险。现有方法通过训练时正则化或事后调整来改善校准，但通常依赖于对目标域的访问（或模拟），限制了实用性。我们提出频率感知梯度修正（FGR），一种用于鲁棒校准的目标无关训练框架。从频率角度出发，FGR 对部分训练图像应用低通滤波，以减少虚假的高频线索并鼓励学习域不变特征。然而，相关的信息损失可能会降低分布内（ID）校准。为了解决这一权衡，FGR 将 ID 校准视为硬约束，并通过几何投影修正冲突的参数更新。这确保了 ID 校准目标的一阶非增，而无需引入额外的损失平衡系数。在合成、真实世界和语义偏移数据集上的大量实验表明，FGR 在保持 ID 性能的同时显著改善了各种偏移下的校准，并且与事后校准方法兼容。我们的代码可在 https://github.com/YilinZhang107/FGR-Calib 获取。

英文摘要

Real-world model deployments inevitably encounter distribution shifts, rendering the confidence estimates of deep neural networks highly unreliable, posing severe risks in safety-critical applications. Existing methods improve calibration via training-time regularization or post-hoc adjustment, but often rely on access to (or simulation of) target domains, limiting practicality. We propose Frequency-aware Gradient Rectification (FGR), a target-agnostic training framework for robust calibration. From a frequency perspective, FGR applies low-pass filtering to a subset of training images to diminish spurious high-frequency cues and encourage the learning of domain-invariant features. However, the associated information loss can degrade In-Distribution (ID) calibration. To resolve this trade-off, FGR treats ID calibration as a hard constraint and rectifies conflicting parameter updates via geometric projection. This ensures a first-order non-increase in the ID calibration objective without introducing an additional loss-balancing coefficient. Extensive experiments on synthetic, real-world, and semantic shift datasets demonstrate that FGR significantly improves calibration under diverse shifts while preserving ID performance, and it remains compatible with post-hoc calibration methods. Our code is available at https://github.com/YilinZhang107/FGR-Calib.

URL PDF HTML ☆

赞 0 踩 0

2508.18730 2026-06-01 cs.LG cs.AR

Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning

超越令牌：通过结构图学习增强RTL质量估计

Yi Liu, Hongji Zhang, Yiwen Wang, Dimitris Tsaras, Lei Chen, Mingxuan Yuan, Qiang Xu

AI总结提出StructRTL框架，利用控制数据流图的结构语义和自监督学习，结合知识蒸馏，显著提升寄存器传输级设计质量估计的准确性。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

在电子设计自动化工作流程中，估计寄存器传输级（RTL）设计的质量至关重要，因为它能够在不进行耗时的逻辑综合的情况下，即时反馈面积和延迟等关键性能指标。虽然最近的方法利用大型语言模型从RTL代码中提取嵌入并取得了有希望的结果，但它们忽略了对于准确质量估计至关重要的结构语义。相比之下，控制数据流图（CDFG）视图更明确地揭示了设计的结构特征，为表示学习提供了更丰富的线索。在这项工作中，我们引入了StructRTL，一种新颖的结构感知图自监督学习框架，用于改进RTL设计质量估计。通过从CDFG学习结构信息表示，StructRTL在各种质量估计任务上显著优于先前的方法。为了进一步提升性能，我们结合了一种知识蒸馏策略，将后映射网表的低级洞察转移到基于CDFG的预测器中。实验结果表明，StructRTL取得了新的最先进结果，突显了将结构学习与跨阶段监督相结合的有效性。

英文摘要

Estimating the quality of register transfer level (RTL) designs is crucial in the electronic design automation (EDA) workflow, as it enables instant feedback on key performance metrics like area and delay without the need for time-consuming logic synthesis. While recent approaches have leveraged large language models (LLMs) to derive embeddings from RTL code and achieved promising results, they overlook the structural semantics essential for accurate quality estimation. In contrast, the control data flow graph (CDFG) view exposes the design's structural characteristics more explicitly, offering richer cues for representation learning. In this work, we introduce StructRTL, a novel structure-aware graph self-supervised learning framework for improved RTL design quality estimation. By learning structure-informed representations from CDFGs, StructRTL significantly outperforms prior art on various quality estimation tasks. To further boost performance, we incorporate a knowledge distillation strategy that transfers low-level insights from post-mapping netlists into the CDFG-based predictor. Experimental results demonstrate that StructRTL establishes new state-of-the-art results, highlighting the effectiveness of combining structural learning with cross-stage supervision.

URL PDF HTML ☆

赞 0 踩 0

2508.16687 2026-06-01 cs.LG

Native Hierarchical and Compositional Representations with Subspace Embeddings

原生层次与组合表示：基于子空间嵌入

Gabriel Moreira, Zita Marinho, Manuel Marques, João Paulo Costeira, Chenyan Xiong

AI总结提出用线性子空间替代向量表示概念，通过子空间维度与包含关系自然建模层次与组合性，并引入可微软投影矩阵实现端到端训练，在层次推理和自然语言推理任务上达到最优性能。

Comments KDD 2026

详情

AI中文摘要

传统嵌入将数据点表示为向量，这使得相似度计算简单，但限制了其捕捉层次结构和组合性的能力。我们提出了一种根本不同的方法：将概念表示为线性子空间。通过跨越多个维度，子空间可以用高维区域建模更广泛的概念，并将更具体的概念嵌套其中。这种几何结构通过维度自然地捕捉一般性，通过包含关系捕捉层次性，并通过线性代数运算为组合提供涌现结构。为了使这种范式可训练，我们通过软投影矩阵引入了一种可微的子空间参数化方法，允许学习每个子空间的有效维度。我们的方法不仅在层次推理和自然语言推理基准上达到了最先进的性能，还提供了一种基于几何的蕴含模型。此外，我们证明，当标准向量嵌入在否定查询上退化为接近随机性能时，子空间嵌入无需显式监督即可原生地捕捉逻辑组合，同时保持与高效欧几里得向量搜索的兼容性。

英文摘要

Traditional embeddings represent datapoints as vectors, which makes similarity easy to compute but limits how well they capture hierarchies and compositionality. We propose a fundamentally different approach: representing concepts as linear subspaces. By spanning multiple dimensions, subspaces can model broader concepts with higher-dimensional regions and nest more specific concepts within them. This geometry naturally captures generality through dimension, hierarchy through inclusion, and enables an emergent structure for composition via linear algebraic operations. To make this paradigm trainable, we introduce a differentiable subspace parameterization via soft projection matrices, allowing the effective dimension of each subspace to be learned. Our method not only achieves state-of-the-art performance on hierarchical and natural language inference benchmarks but also provides a geometrically-grounded model of entailment. Further, we demonstrate that while standard vector embeddings degrade to near-random performance on negated queries, subspace embeddings natively capture logical composition without explicit supervision, while preserving compatibility with efficient Euclidean vector search.

URL PDF HTML ☆

赞 0 踩 0

2503.21168 2026-06-01 cs.RO cs.SY eess.SY

TAGA: A Tangent-Based Reactive Approach for Socially Compliant Robot Navigation Around Human Groups

TAGA：一种基于切线的反应式方法，用于在人群体周围实现社交合规的机器人导航

Utsha Kumar Roy, Sejuti Rahman

AI总结提出TAGA方法，通过切线路径检测群体边界并协调群体与个体避障，引入群体穿越率（GCR）指标，在多种人群动力学模型下验证了反应式与学习型方法的非对称性效果。

Comments 8 pages, 3 figures, 3 tables. Submitted to IEEE Robotics and Automation Letters (RA-L)

详情

AI中文摘要

机器人在有人群的环境中导航时，必须避免碰撞并尊重人群的社会结构，特别是社会群体的隐含边界。大多数导航方法将人类建模为独立个体，即使无碰撞也会导致社交干扰行为。本文提出TAGA（群体避障的切线动作），通过切线路径机动检测群体边界，无需修改底层导航策略。一个分层安全控制器协调群体级避障与个体碰撞预防。我们提出群体穿越率（GCR），一个连续度量，衡量机器人在任何群体凸包内停留的时间步比例，提供比终端度量更细粒度的社交合规评估。我们引入了一个现实的人群模拟基准，包含五个基于经验的阶段：个体速度异质性、群体速度耦合、F-formation静态群体、领导者-跟随者动力学和凸包边界，并在ORCA和Social Force行人动力学下进行评估。在ORCA、Social Force、DS-RNN和Intention-RL上的实验揭示了反应式-学习型非对称性：TAGA对经典反应式基线提升最大（成功率最高+8pp，GCR减半），而对学习型策略成本近乎为零。这些发现为模块化群体感知何时增加价值以及端到端群体感知训练何时更优提供了可操作的指导。

英文摘要

Robots navigating human-populated environments must avoid collisions while respecting the social structure of crowds, particularly the implicit boundaries of social groups. Most navigation approaches model humans as independent individuals,causing socially disruptive behavior even when collision-free. This paper presents TAGA (Tangent Action for Group Avoidance), detected group boundaries via tangent-path maneuvers without modifying the underlying navigation policy. A hierarchical safety controller coordinates group-level avoidance with individual collision prevention. We propose the Group Crossing Rate (GCR), a continuous metric measuring the fraction of timesteps the robot spends inside any group convex hull, providing finer-grained social compliance assessment than terminal metrics alone. We introduce a realistic crowd simulation benchmark with five empirically grounded phases: individual speed heterogeneity, group speed coupling, F-formation static groups, leader-follower dynamics, and convex-hull boundaries, evaluated under both ORCA and Social Force pedestrian dynamics. Experiments across ORCA, Social Force, DS-RNN, and Intention-RL reveal a reactive-learning asymmetry: TAGA provides the largest gains for classical reactive baselines (up to +8pp success rate, GCR halved) with near-zero cost for learned policies. These findings offer actionable guidance for when modular group-awareness adds value versus when end-to-end group-aware training is preferable.

URL PDF HTML ☆

赞 0 踩 0

2501.04661 2026-06-01 cs.CL cs.AI

Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

超越记忆：使用短语结构评估大型语言模型中的语义泛化

Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

AI总结通过构式语法构建诊断评估，测试大型语言模型在低频但人类易理解的短语结构上的语义理解与泛化能力，发现模型在句法相同但语义不同的构式上性能下降超40%。

Comments Camera Ready: AACL-IJCNLP (2025)

详情

AI中文摘要

预训练数据的网络规模带来了一个重要的评估挑战：将预训练数据中充分代表的案例的语言能力与对域外语言（特别是预训练数据中较少见的动态真实世界实例）的泛化能力区分开来。为此，我们利用构式语法（CxG）构建了一个诊断评估，系统性地评估大型语言模型（LLM）的自然语言理解。CxG为测试泛化提供了一个心理语言学上合理的框架，因为它明确地将句法形式与抽象的、非词汇意义联系起来。我们的新颖推理评估数据集包含英语短语构式，已知说话者能够抽象出常见的实例以理解和产生创造性实例。我们的评估数据集使用CxG评估两个核心问题：第一，模型是否能够“理解”那些可能在预训练数据中出现频率较低、但对人类而言直观且易于理解的句子的语义；第二，LLM是否能够在句法相同但意义不同的构式中部署适当的构式语义。我们的结果表明，包括GPT-o1在内的最先进模型在第二个任务上性能下降超过40%，揭示了模型无法像人类那样在句法相同的形式上进行泛化以得出不同的构式意义。我们公开了我们的新颖数据集和相关的实验数据，包括提示和模型响应。

英文摘要

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can 'understand' the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

URL PDF HTML ☆

赞 0 踩 0

2508.08204 2026-06-01 cs.CL cs.AI

Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

大型语言模型中推理时间不确定性的人类对齐与校准

Kyle Moore, Jesse Roberts, Daryl Watson

AI总结本文评估了多种推理时间不确定性度量，发现它们与人类群体不确定性高度对齐，尽管与人类答案偏好不一致，但在正确性相关性和分布分析上表现出中等到强校准证据。

Comments We have discovered a critical error in the normalized entropy calculation that may have substantially inflated nearly all results herein. We have since fixed this error in a new work, but we believe that the new work is sufficiently dissimilar in focus, methods, dataset, and results as to be misleading if presented as a simple replacement. As such, we propose removal and retraction instead

详情

AI中文摘要

最近，评估大型语言模型的不确定性校准引起了广泛关注，以促进模型控制和调节用户信任。推理时间不确定性可能为模型或外部控制模块提供实时信号，对于应用这些概念以改善LLM用户体验尤为重要。尽管许多现有论文考虑模型校准，但相对较少的工作试图评估模型不确定性与人类不确定性的对齐程度。在这项工作中，我们使用既有度量和新颖变体评估了一系列推理时间不确定性度量，以确定它们与人类群体水平不确定性以及传统模型校准概念的接近程度。我们发现，许多度量显示出与人类不确定性强烈对齐的证据，尽管与人类答案偏好缺乏对齐。对于那些成功的度量，我们在正确性相关性和分布分析方面发现了中等到强校准证据。

英文摘要

There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

URL PDF HTML ☆

赞 0 踩 0

2508.02217 2026-06-01 cs.LG

Population-Free Pareto Tracking for Sample-Efficient Multi-Policy MORL

无种群的帕累托跟踪：面向样本高效的多策略多目标强化学习

Zeyu Zhao, Yueling Che, Kaichen Liu, Jian Li, Junmei Yao

AI总结提出MPFT框架，通过无自进化种群的帕累托跟踪机制，结合单目标极端策略初始化，高效逼近完整帕累托前沿，显著提升样本效率并减少智能体-环境交互。

Comments 37 pages, 10 figures, ICML26 accepted paper

详情

AI中文摘要

多目标强化学习（MORL）是涉及多个冲突标准的现实世界决策问题的基本框架。现有的多策略（MP）方法通常依赖于维护大型策略种群的在线进化框架，导致高样本复杂性和过多的智能体-环境交互。为了缓解这些限制，我们提出了多策略帕累托前沿跟踪（MPFT），一种无需自进化种群的框架。它利用高效的帕累托跟踪机制，以单目标极端策略初始化来追踪帕累托前沿，并进一步加密稀疏区域以实现对完整帕累托前沿的精确近似。MPFT可以无缝集成先进的离线MORL算法，从而显著提高样本效率。我们在最多三个目标的六个机器人控制任务和超过三个目标的三个高维任务上评估了MPFT。实验结果表明，MPFT在超体积和期望效用方面优于最先进的基线。它还显著减少了智能体-环境交互。这些结果进一步证明，MPFT是一个通用框架，可以无缝集成在线和离线MORL算法。

英文摘要

Multi-objective reinforcement learning (MORL) is a fundamental framework for real-world decision-making problems involving multiple conflicting criteria. Existing multi-policy (MP) methods typically rely on online evolutionary frameworks that maintain large policy populations, leading to high sample complexity and excessive agent-environment interactions. To mitigate these limitations, we present Multi-policy Pareto Front Tracking (MPFT), a framework without a self-evolving population. It leverages an efficient Pareto-tracking mechanism initialized with single-objective extreme policies to trace the Pareto front, and further densifies sparse regions to achieve an accurate approximation of the full Pareto front. MPFT can be seamlessly integrated with advanced offline MORL algorithms, thereby substantially improving sample efficiency. We evaluate MPFT on six robotic control tasks with up to three objectives and three high-dimensional tasks with more than three objectives. Experimental results show that MPFT outperforms state-of-the-art baselines in terms of hypervolume and expected utility. It also significantly reduces agent-environment interactions. These results further demonstrate that MPFT serves as a general-purpose framework that can seamlessly integrate both online and offline MORL algorithms.

URL PDF HTML ☆

赞 0 踩 0

2507.16362 2026-06-01 cs.CV

LPTR-AFLNet: Lightweight Integrated Chinese License Plate Rectification and Recognition Network

LPTR-AFLNet：轻量级集成式中国车牌校正与识别网络

Guangzhu Xu, Pengcheng Zuo, Zhi Ke, Bangjun Lei

AI总结提出一种轻量级统一网络LPTR-AFLNet，结合透视变换校正模块和优化后的AFLNet识别网络，利用识别输出作为弱监督信号引导校正，并改进注意力模块和采用Focal Loss，实现高效准确的车牌校正与识别。

Comments 28 pages, 33 figures

详情

AI中文摘要

中国车牌识别（CLPR）在无约束和复杂环境中面临诸多挑战，特别是由于不同拍摄角度导致的透视畸变以及单行和双行车牌的校正问题。考虑到边缘设备有限的计算资源，开发低复杂度、端到端的集成校正与识别网络对于实现实时高效部署至关重要。本文提出了一种名为LPTR-AFLNet的轻量级统一网络，用于校正和识别中国车牌，该网络将透视变换校正模块（PTR）与优化的车牌识别网络AFLNet相结合。该网络利用识别输出作为弱监督信号，有效引导校正过程，确保准确的透视畸变校正。为提高识别精度，我们对LPRNet进行了多项改进，包括引入改进的注意力模块以减少相似字符间的混淆，以及使用Focal Loss解决训练中的类别不平衡问题。实验结果表明，LPTR-AFLNet在校正透视畸变和识别双行车牌图像方面表现出色，在各种具有挑战性的场景下均保持高识别精度。此外，在中低端GPU平台上，该方法运行时间小于10毫秒，显示出其实用效率和广泛适用性。

英文摘要

Chinese License Plate Recognition (CLPR) faces numerous challenges in unconstrained and complex environments, particularly due to perspective distortions caused by various shooting angles and the correction of single-line and double-line license plates. Considering the limited computational resources of edge devices, developing a low-complexity, end-to-end integrated network for both correction and recognition is essential for achieving real-time and efficient deployment. In this work, we propose a lightweight, unified network named LPTR-AFLNet for correcting and recognizing Chinese license plates, which combines a perspective transformation correction module (PTR) with an optimized license plate recognition network, AFLNet. The network leverages the recognition output as a weak supervisory signal to effectively guide the correction process, ensuring accurate perspective distortion correction. To enhance recognition accuracy, we introduce several improvements to LPRNet, including an improved attention module to reduce confusion among similar characters and the use of Focal Loss to address class imbalance during training. Experimental results demonstrate the exceptional performance of LPTR-AFLNet in rectifying perspective distortion and recognizing double-line license plate images, maintaining high recognition accuracy across various challenging scenarios. Moreover, on lower-mid-range GPUs platform, the method runs in less than 10 milliseconds, indicating its practical efficiency and broad applicability.

URL PDF HTML ☆

赞 0 踩 0

2507.17335 2026-06-01 cs.CV cs.CL

TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

TransLPRNet：用于单/双行中文车牌识别的轻量级视觉-语言网络

Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei

AI总结针对开放环境中车牌类型多样和成像条件复杂的问题，提出一种集成轻量视觉编码器和文本解码器的统一解决方案，通过预训练框架和透视校正网络实现单/双行中文车牌的高精度识别。

详情

AI中文摘要

开放环境中的车牌识别在各个领域广泛应用，但车牌类型和成像条件的多样性带来了显著挑战。为了解决基于CNN和CRNN的方法在车牌识别中遇到的局限性，本文提出了一种统一解决方案，该方案在针对单行和双行中文车牌的预训练框架内，集成了轻量级视觉编码器和文本解码器。为缓解双行车牌数据集的稀缺性，我们通过合成图像、将纹理映射到真实场景并与真实车牌图像混合，构建了单/双行车牌数据集。此外，为提高系统的识别精度，我们引入了一个透视校正网络（PTN），该网络将车牌角点坐标回归作为隐变量，并通过车牌视角分类信息进行监督。该网络具有更好的稳定性、可解释性和较低的标注成本。所提出的算法在粗定位扰动下的校正CCPD测试集上实现了99.34%的平均识别准确率。在细定位扰动下评估时，准确率进一步提高到99.58%。在双行车牌测试集上，平均识别准确率达到98.70%，处理速度高达每秒167帧，显示出较强的实际应用性。

英文摘要

License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system's recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.

URL PDF HTML ☆

赞 0 踩 0

2507.11075 2026-06-01 cs.CV cs.AI

Joint angle based learning to refine kinematic human pose estimation

基于关节角度学习的运动学人体姿态估计精化

Chang Peng, Yifei Zhou, Haoqiang Ren, Shiqing Huang, Chuangye Chen, Jianming Yang, Bao Yang, Huifeng Xi, Zhenyu Jiang

AI总结提出一种基于关节角度的双向循环网络后处理模块，利用高阶傅里叶级数近似生成可靠真值，以精化单图像人体姿态估计，纠正错误关键点并平滑轨迹。

详情

AI中文摘要

无标记人体姿态估计（HPE）在各个领域中的应用日益增多。当前的HPE在分析运动学人体姿态时，偶尔会出现关键点识别错误和关键点轨迹随机波动的问题。现有基于深度学习的HPE精化模型的性能受到训练数据集（关键点手动标注）不准确的显著限制。本文提出了一种新方法克服这一困难，关键技术包括：(i) 基于关节角度的运动学人体姿态鲁棒描述；(ii) 使用高阶傅里叶级数近似关节角度的时间变化以获得可靠的“真值”；(iii) 设计双向循环网络作为后处理模块，以精化基于单图像的HPE模型的估计。使用我们方法构建的高质量数据集训练后，该网络在纠正错误识别关节和平滑其时空轨迹方面表现出卓越性能。测试表明，在花样滑冰和霹雳舞等挑战性案例中，基于关节角度的精化（JAR）优于最先进的HPE精化网络。JAR还展示了纠正现有数据集的巨大潜力。

英文摘要

Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. This paper proposed a novel method to overcome the difficulty, in which the key techniques include: (i) A robust joint angle-based description of kinematic human poses; (ii) Approximating temporal variation of joint angles using high order Fourier series to get reliable "ground truth"; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of single image-based HPE models. Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking. JAR also demonstrates great potential to rectify existing datasets.

URL PDF HTML ☆

赞 0 踩 0

2507.06161 2026-06-01 cs.CV

Sinkhorn Normalization of Diffusion Kernels

扩散核的Sinkhorn归一化

Nathan Kessler, Robin Magnet, Jean Feydy

AI总结提出一种基于Sinkhorn算法的对称变体，将通用相似性矩阵归一化为类似扩散算子，继承拉普拉斯算子的理想性质，用于不规则数据（如点云、稀疏体素网格、高斯混合）的平滑处理，并保留谱信息用于形状分析与匹配。

Comments 33 pages, 25 figures

详情

AI中文摘要

基于局部邻域对信号进行平滑是机器学习和几何处理中的核心操作。在向量空间和流形等结构良好的域上，由微分几何导出的拉普拉斯算子通过热扩散提供了一种有理论保证的平滑方法。然而，构造这样的拉普拉斯算子需要精确定义的域结构，这并不总是可行的。因此，大多数从业者依赖于简单的卷积核和消息传递层，这些方法对域边界存在偏差。我们通过引入一类广泛的平滑算子（由一般相似性或邻接矩阵导出）来弥合这一差距，并证明它们可以被归一化为类似扩散的算子，继承拉普拉斯算子的理想性质。我们的方法依赖于Sinkhorn算法的对称变体，该算法重新缩放正平滑算子以匹配热扩散的结构行为。这种构造使得能够对不规则数据（如点云、稀疏体素网格或高斯混合）进行类似拉普拉斯的平滑和处理。我们表明，得到的算子不仅近似热扩散，而且保留了拉普拉斯算子本身的谱信息，可应用于形状分析和匹配。

英文摘要

Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On well-structured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.

URL PDF HTML ☆

赞 0 踩 0

2507.05488 2026-06-01 cs.AI cs.CY

OLG++: A Semantic Extension of Obligation Logic Graph

OLG++：义务逻辑图的语义扩展

Subhasis Dasgupta, Jon Stephens, Amarnath Gupta

AI总结提出OLG++，通过引入空间、时间、当事人组、可废止性和逻辑分组等节点与边类型，扩展义务逻辑图以建模市政和跨司法管辖区的法规规则，并通过食品商业法规示例展示其在法律问答中的应用。

详情

AI中文摘要

我们提出了OLG++，这是义务逻辑图（OLG）的语义扩展，用于建模市政和跨司法管辖区的监管和法律规则。OLG++引入了更丰富的节点和边类型，包括空间、时间、当事人组、可废止性和逻辑分组结构，从而能够细致地表示法律义务、例外和层级关系。该模型支持带有上下文条件、优先级和复杂触发器的规则的结构化表示。我们通过食品商业法规的示例展示了其用法，说明了OLG++如何使用属性图查询支持法律问答。我们还讨论了OLG++如何通过提供子类关系、空间约束和具体化例外结构的图原生结构来补充LegalRuleML。工作示例和初步覆盖率分析表明，在所研究的维度上，OLG++在市政监管表示方面比基线OLG模型更具表现力。

英文摘要

We present OLG++, a semantic extension of the Obligation Logic Graph (OLG) for modeling regulatory and legal rules in municipal and interjurisdictional contexts. OLG++ introduces richer node and edge types, including spatial, temporal, party group, defeasibility, and logical grouping constructs, enabling nuanced representations of legal obligations, exceptions, and hierarchies. The model supports structured representation of rules with contextual conditions, precedence, and complex triggers. We demonstrate its use through examples from food-business regulations, showing how OLG++ supports legal question answering using property-graph queries. We also discuss how OLG++ can complement LegalRuleML by providing graph-native constructs for subclass relations, spatial constraints, and reified exception structures. The worked examples and first-pass coverage analysis show that, on the dimensions studied, OLG++ is more expressive than the baseline OLG model for municipal regulatory representation.

URL PDF HTML ☆

赞 0 踩 0

2506.23768 2026-06-01 cs.RO

Motion Tracking with Muscles: Predictive Control of a Parametric Musculoskeletal Canine Model

基于肌肉的运动追踪：参数化肌肉骨骼犬模型的预测控制

Vittorio La Barbera, Steven Bohez, Leonard Hasenclever, Yuval Tassa, John R. Hutchinson

AI总结提出一种由精确3D肌肉网格程序化生成的犬类肌肉骨骼模型，结合改进的肌肉动力学模型和运动捕捉任务，通过比较模拟肌肉激活模式与实验EMG数据验证，旨在弥合生物力学、机器人和计算神经科学之间的差距。

2502.12851 2026-06-01 cs.CL cs.AI

MeMo: Towards Language Models with Associative Memory Mechanisms

MeMo：迈向具有联想记忆机制的语言模型

Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli

AI总结提出MeMo架构，通过分层联想记忆直接记忆文本，实现透明化和模型编辑，实验证明单层和多层配置的记忆能力。

2506.14842 2026-06-01 cs.CV cs.AI

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

PictSure：预训练嵌入对上下文学习图像分类器至关重要

Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop

AI总结本文提出PictSure视觉上下文学习模型，发现预训练嵌入质量是下游性能的关键瓶颈，而融合层训练数据的多样性影响有限。

Comments 10 pages, 2 figures

详情

AI中文摘要

在数据稀缺领域，构建图像分类模型仍然繁琐，因为收集大规模标注数据集不切实际。上下文学习（ICL）是少样本图像分类（FSIC）的一种有前景的范式，但先前工作未充分探索编码器预训练与融合层训练数据的相对重要性。我们提出了PictSure，一个纯视觉的ICL模型家族，展示了易于使用的融合Transformer架构的潜力，以及需要在更广泛的图像域中获得更好的嵌入表示。在域内和域外评估中，我们发现预训练引起的表示质量与下游ICL性能强相关。关键在于，将融合Transformer的训练数据集从仅ImageNet更改为多样化的多域混合，在评估设置下仅提供有限的额外性能提升，表明一旦嵌入充分结构化，融合层似乎能够有效适应。这些结果表明，视觉ICL的瓶颈是表示质量，而非融合模块的训练多样性。为了促进采用和可重复性，我们以开源形式发布所有模型权重，并提供一个MCP服务器，将PictSure作为可调用工具暴露给基于LLM的智能系统，使少样本图像分类能够在AI流水线中直接调用，无需集成开销。代码可在https://github.com/PictSure获取，模型可在https://huggingface.co/pictsure获取。

英文摘要

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) is a promising paradigm for few-shot image classification (FSIC), but prior work has underexplored the relative importance of encoder pretraining versus fusion-layer training data. We present PictSure, a vision-only ICL family of models that demonstrates the potential of easy-to-use fusion transformer architectures, as well as the need for better embedding representations across a wider range of image domains. In both in-domain and out-of-domain evaluations, we find that representation quality induced by pretraining strongly correlates with downstream ICL performance. Crucially, varying the training dataset for the fusion transformer, from ImageNet alone to diverse multi-domain mixtures, provides limited additional performance gains under the evaluated settings, demonstrating that the fusion layer appears capable of adapting effectively once embeddings are sufficiently structured. These results show that the bottleneck in visual ICL is representation quality, not fusion-module training diversity. To facilitate adoption and reproducibility, we release all model weights as open-source artifacts and provide an MCP server that exposes PictSure as a callable tool for LLM-based agentic systems, enabling few-shot image classification to be invoked directly within AI pipelines without integration overhead. Code can be found at https://github.com/PictSure and models at https://huggingface.co/pictsure.

URL PDF HTML ☆

赞 0 踩 0

2505.22934 2026-06-01 cs.CL cs.AI cs.LG

Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

解开LoRA干扰：用于鲁棒模型合并的正交子空间

Haobo Zhang, Jiayu Zhou

AI总结针对LoRA微调模型合并时性能下降的问题，提出通过微调前约束LoRA子空间正交性来减少任务间干扰的方法OSRM，可无缝集成现有合并算法，提升合并性能并保持单任务准确率。

Comments 14 pages, 5 figures, 16 tables, accepted by ACL 2025

详情

AI中文摘要

针对单个任务微调大型语言模型（LM）虽然性能强劲，但部署和存储成本高昂。近期研究探索模型合并，将多个任务特定模型组合成单个多任务模型，无需额外训练。然而，现有合并方法对于使用低秩适应（LoRA）微调的模型往往失败，导致性能显著下降。本文表明，这一问题源于模型参数与数据分布之间先前被忽视的相互作用。我们提出用于鲁棒模型合并的正交子空间（OSRM），在微调*之前*约束LoRA子空间，确保与一个任务相关的更新不会对其他任务的输出产生不利偏移。我们的方法可以无缝集成到大多数现有合并算法中，减少任务间的意外干扰。在八个数据集上使用三种广泛使用的LM和两种大型LM进行的广泛实验表明，我们的方法不仅提升了合并性能，还保持了单任务准确率。此外，我们的方法对合并的超参数表现出更强的鲁棒性。这些结果突显了数据-参数交互在模型合并中的重要性，并为合并LoRA模型提供了一种即插即用的解决方案。

英文摘要

Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.

URL PDF HTML ☆

赞 0 踩 0

2505.20840 2026-06-01 cs.LG

Aggregation Buffer: Revisiting DropEdge with a New Parameter Block

聚合缓冲区：用新参数块重新审视 DropEdge

Dooho Lee, Myeong Kong, Sagad Hamid, Cheonwoo Lee, Jaemin Yoo

AI总结针对 DropEdge 在监督学习中性能受限的问题，提出一种名为 Aggregation Buffer 的参数块，通过改进 GNN 的鲁棒性来提升性能，并统一解决度偏差和结构差异等问题。

Comments Published at ICML 2025

详情

AI中文摘要

我们重新审视了 DropEdge，这是一种用于 GNN 的数据增强技术，通过在训练过程中随机移除边来暴露多样化的图结构。虽然这是一种有效减少对图中特定连接过拟合的有前途的方法，但我们观察到其在监督学习任务中的潜在性能提升非常有限。为了理解原因，我们提供了理论分析，表明 DropEdge 的有限性能来自于许多 GNN 架构中存在的根本性限制。基于此分析，我们提出了 Aggregation Buffer，这是一个专门设计的参数块，通过解决 DropEdge 的限制来提高 GNN 的鲁棒性。我们的方法与任何 GNN 模型兼容，并在多个数据集上展示了一致的性能提升。此外，我们的方法作为统一解决方案，有效解决了度偏差或结构差异等众所周知的问题。代码和数据集可在 https://github.com/dooho00/agg-buffer 获取。

英文摘要

We revisit DropEdge, a data augmentation technique for GNNs which randomly removes edges to expose diverse graph structures during training. While being a promising approach to effectively reduce overfitting on specific connections in the graph, we observe that its potential performance gain in supervised learning tasks is significantly limited. To understand why, we provide a theoretical analysis showing that the limited performance of DropEdge comes from the fundamental limitation that exists in many GNN architectures. Based on this analysis, we propose Aggregation Buffer, a parameter block specifically designed to improve the robustness of GNNs by addressing the limitation of DropEdge. Our method is compatible with any GNN model, and shows consistent performance improvements on multiple datasets. Moreover, our method effectively addresses well-known problems such as degree bias or structural disparity as a unifying solution. Code and datasets are available at https://github.com/dooho00/agg-buffer.

URL PDF HTML ☆

赞 0 踩 0

2505.20795 2026-06-01 cs.RO

Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt

以人类演示视频为提示学习可泛化的机器人策略

Xiang Zhu, Yichen Liu, Hezhong Li, Jianyu Chen

AI总结提出两阶段框架，利用人类演示视频学习可泛化机器人策略，无需遥操作数据或微调即可执行新任务。

Comments Accepted to the IEEE International Conference on Robotics and Automation (ICRA), 2026

详情

AI中文摘要

最近的机器人学习方法通常依赖于通过遥操作收集的大规模机器人数据集的模仿学习。面对新任务时，这些方法通常需要收集一组新的遥操作数据并微调策略。此外，遥操作数据收集流程也繁琐且昂贵。相反，人类能够通过观察他人操作高效学习新任务。在本文中，我们介绍了一种新颖的两阶段框架，利用人类演示学习可泛化的机器人策略。该策略可以直接以人类演示视频为提示，执行新任务，无需任何新的遥操作数据和模型微调。在第一阶段，我们训练视频生成模型，通过交叉预测捕获人类和机器人演示视频数据的联合表示。在第二阶段，我们使用新颖的原型对比损失将学习到的表示与人类和机器人之间的共享动作空间融合。在真实世界灵巧操作任务上的实证评估显示了所提出方法的有效性和泛化能力。

英文摘要

Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real-world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2503.14190 2026-06-01 cs.AI

Inferring Events from Time Series using Language Models

利用语言模型从时间序列中推断事件

Mingtian Tan, Mike A. Merrill, Zack Gottesman, Tim Althoff, David Evans, Tom Hartvigsen

AI总结研究大型语言模型能否从时间序列数据中推断自然语言事件，提出自动化任务生成方法和新基准，并通过蒸馏与强化学习提升小模型性能。

Comments 21 pages, 15 Figures

2504.11972 2026-06-01 cs.CL

Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

重新评估大规模抽取式问答数据集：LLM作为评判者与深入分析

Xanh Ho, Jiahao Huang, Florian Boudin, Akiko Aizawa

AI总结本研究系统性地使用LLM作为评判者评估四个抽取式问答数据集，发现其与人类评价的相关性远高于EM和F1，并分析了答案类型敏感性、提示变化和自偏好偏差等因素。

Comments GEM Workshop at ACL 2026; code and data are available at https://github.com/Alab-NII/llm-judge-extract-qa

详情

AI中文摘要

抽取式问答任务通常使用精确匹配（EM）和F1分数进行评估，但这些指标往往无法反映模型的真实性能。最近的研究提出使用大型语言模型（LLM）作为评判者（LLM-as-a-judge），然而它们通常缺乏跨数据集的全面评估，并忽略了关键因素，如对答案类型的敏感性、提示变化和自偏好偏差。在这项工作中，我们跨四个抽取式问答数据集和各种提示变化对LLM作为评判者进行了系统研究，评估了多个LLM系列在回答和评判角色中的表现。我们的结果表明，LLM作为评判者的判断与人类评价的相关性远高于EM（0.22）和F1（0.40），与开源模型的相关性高达0.85。进一步分析显示，LLM作为评判者在数字相关答案上表现特别好，但在更复杂的类型（如职位名称）上面临挑战。与其他NLP任务的发现相反，我们未观察到自偏好偏差，即使同一模型同时作为问答模型和评判者。最后，我们发现提示措辞的影响很小，零样本、无上下文的评判通常能带来最佳的评估性能。

英文摘要

Extractive QA tasks are commonly evaluated using Exact Match (EM) and F1-score, but these metrics often fail to reflect true model performance. Recent studies have proposed using large language models (LLMs) as judges (LLM-as-a-judge), yet they often lack comprehensive evaluation across datasets and overlook key factors such as sensitivity to answer types, prompt variations, and self-preference bias. In this work, we conduct a systematic study of LLM-as-a-judge across four extractive QA datasets and various prompt variations, assessing multiple LLM families in both answering and judging roles. Our results show that LLM-as-a-judge judgments correlate much more strongly with human evaluations than EM (0.22) and F1 (0.40), achieving correlations up to 0.85 with open-source models. Further analysis reveals that LLM-as-a-judge performs particularly well on number-related answers but faces challenges with more complex types, such as job titles. Contrary to findings in other NLP tasks, we observe no self-preference bias, even when the same model serves as both QA model and judge. Finally, we find that prompt phrasing has minimal impact, and zero-shot, context-free judging often yields the best evaluation performance.

URL PDF HTML ☆

赞 0 踩 0

2409.14583 2026-06-01 cs.AI

LLM Bias Evaluation: Gender, Racial, and Age Disparities in Occupational and Crime Scenarios

LLM偏差评估：职业与犯罪场景中的性别、种族和年龄差异

Vishal Mirza, Rahul Kulkarni, Aakanksha Jadhav

AI总结本文评估了2024年四大领先LLM在职业和犯罪场景中的性别、种族和年龄偏差，发现去偏努力常导致新的公平性权衡，即“去偏悖论”。

Comments Updated title and abstract to emphasize key findings on the debiasing paradox for improved discoverability. Content and findings unchanged. 11 pages, 17 figures, Accepted at IEEE Conference on Artificial Intelligence (IEEE CAI) 2025. Full Paper acceptance in the Vertical HUMAN-CENTERED AI category

详情

DOI: 10.1109/CAI64502.2025.00045
Journal ref: 2025 IEEE Conference on Artificial Intelligence (CAI)

AI中文摘要

LLM偏差评估至关重要，因为大型语言模型（LLM）越来越多地影响高风险决策。本文对领先LLM中的性别、种族和年龄差异进行了全面评估，揭示出去偏努力常常创造新的公平性权衡。近年来LLM的进展显著，但由于各种限制，企业广泛采用仍然有限。本文考察了LLM中的偏差——这是一个影响其可用性、可靠性和公平性的关键问题。我们的研究评估了2024年发布的四个领先LLM（Gemini 1.5 Pro、Llama 3 70B、Claude 3 Opus和GPT-4o）在职业场景中的性别偏差以及犯罪场景中的性别、年龄和种族偏差。结果显示，LLM在各种职业中描绘女性角色的频率往往高于男性，与美国劳工统计局数据相比偏差达37%。在犯罪场景中，与美国联邦调查局数据的偏差在性别上为54%，种族上为28%，年龄上为17%。关键的是，我们观察到减少性别和种族偏差的努力常常导致过度偏向某一子类的结果，可能加剧差异——这种“去偏悖论”凸显了当前偏差缓解技术的局限性，并强调了更有效方法的必要性。

英文摘要

LLM bias evaluation is critical as large language models (LLMs) increasingly influence high-stakes decisions. This paper provides a comprehensive assessment of gender, racial, and age disparities in leading LLMs, revealing that debiasing efforts often create new fairness trade-offs. Recent advancements in LLMs have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs - a crucial issue affecting their usability, reliability, and fairness. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. Critically, we observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating disparities - a "debiasing paradox" that highlights the limitations of current bias mitigation techniques and underscores the need for more effective approaches.

URL PDF HTML ☆

赞 0 踩 0

2501.01926 2026-06-01 cs.CV cs.AI

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation

跨模态注意力校准用于LVLM幻觉缓解

Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li

AI总结提出一种无需训练的跨模态注意力校准方法，通过设计模态间解码和位置校准模块，缓解大型视觉语言模型中的幻觉问题。

Comments CVPR2026

详情

AI中文摘要

大型视觉语言模型（LVLM）在视觉-语言理解方面表现出显著能力。尽管取得了成功，LVLM在复杂生成任务中仍然会产生幻觉，导致视觉输入与生成内容不一致。为了解决这个问题，一些方法引入了推理时干预，如对比解码，以减少对语言先验的过度依赖。然而，这些方法忽略了由位置偏差和虚假跨模态相关性引起的幻觉。在本文中，我们提出了一种跨模态注意力校准（CMAC）方法，以无需训练的方式缓解LVLM中的幻觉。在该方法中，我们设计了一个模态间解码（IMD）模块，通过一种新颖的对比解码机制来减轻幻觉。IMD将具有显著跨模态注意力权重的值向量掩蔽为失真，从而同时解决了单模态过度依赖和误导性跨模态相关性问题。此外，跨模态位置校准（CMPC）模块缩小了图像标记的位置差距，缓解了跨模态注意力中的位置偏差。在多种幻觉基准上的实验结果验证了我们的方法在减少LVLM幻觉方面优于现有最先进技术。我们的代码将在https://github.com/lijm48/IMCCD上提供。

英文摘要

Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from position bias and spurious inter-modality correlations. In this paper, we propose a Cross-Modal Attention Calibration (CMAC) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design an Inter-Modality Decoding (IMD) module to alleviate hallucination by a novel contrastive decoding mechanism. IMD masks the value vectors associated with significant cross-modal attention weights as distortion, which addresses both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Cross-Modal Position Calibration (CMPC) module shrinks the position gap of image tokens, alleviating the position bias in cross-modal attention. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations for LVLM. Our code will be available at https://github.com/lijm48/IMCCD.

URL PDF HTML ☆

赞 0 踩 0

2502.15224 2026-06-01 cs.LG cs.AI

Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

自动发现基准：在Oracle引导发现中诊断结构化状态追踪

Tingting Chen, Beibei Lin, Srinivas Anumasa, Vedant Shah, Zifeng Yuan, Qiran Zou, Anirudh Goyal, Dianbo Liu

AI总结提出Auto-Discovery-Bench基准，通过确定性Oracle引导的假设-干预-反馈循环，诊断智能体在结构化状态追踪中的能力瓶颈。

Comments 13 pages

详情

AI中文摘要

交互式发现要求智能体在多轮反馈中维护和更新结构化信念。在评估智能体于嘈杂、开放的科学环境中的表现之前，有必要在受控条件下隔离这一先决能力。我们引入了Auto-Discovery-Bench，一个确定性的Oracle引导诊断基准，其中智能体通过重复的假设-干预-反馈循环恢复隐藏结构。该基准实例化了三种受控发现抽象：有向图发现、无向关系发现和符号方程发现。在所有模型中，性能随着变量数量、轨迹长度和干扰项的增加而下降。一个独立的轨迹追踪诊断表明，即使移除了干预选择和假设生成，许多失败仍然存在，这表明在维护和整合长程结构化信息方面的限制是Oracle引导发现的重要瓶颈。Auto-Discovery-Bench并非旨在取代真实的发现环境；相反，它提供了一个可重复、低混淆的诊断测试平台，用于隔离交互式科学智能体的先决能力。

英文摘要

Interactive discovery requires agents to maintain and update structured beliefs over many rounds of feedback. Before evaluating agents in noisy, open-ended scientific environments, it is useful to isolate this prerequisite capability under controlled conditions. We introduce Auto-Discovery-Bench, a deterministic oracle-guided diagnostic benchmark in which agents recover hidden structures through repeated hypothesis--intervention--feedback cycles. The benchmark instantiates three controlled discovery abstractions: directed graph discovery, undirected relational discovery, and symbolic equation discovery. Across models, performance degrades as the number of variables, trajectory length, and distractors increase. A separate trajectory-tracking diagnostic shows that many failures persist even when intervention selection and hypothesis generation are removed, suggesting that limitations in maintaining and integrating long-range structured information are an important bottleneck for oracle-guided discovery. Auto-Discovery-Bench is not intended to replace realistic discovery environments; rather, it provides a reproducible, low-confound diagnostic testbed for isolating a prerequisite capability for interactive scientific agents.

URL PDF HTML ☆

赞 0 踩 0

2502.04671 2026-06-01 cs.AI cs.LG cs.LO cs.PL

ProofWala: A Framework for Multilingual Proof Data Synthesis and Theorem-Proving

ProofWala: 多语言证明数据合成与定理证明框架

Amitayush Thakur, George Tsoukalas, Greg Durrett, Swarat Chaudhuri

AI总结提出ProofWala框架，通过itp-interface库实现与交互式定理证明器的程序化交互，支持多语言证明数据合成、并行证明搜索，并验证了跨语言与跨领域迁移的有效性。

详情

AI中文摘要

神经定理证明方法需要强大的基础设施来与交互式定理证明器（ITP）交互、提取结构化证明数据以及大规模执行证明搜索。然而，现有工具通常针对特定助手且面向文件级执行，使得仓库级分析和并行实验变得困难。我们提出ProofWala，一个多语言证明工程框架，基于 exttt{itp-interface}构建，这是一个用于与ITP进行程序化交互的可重用库。对于Lean 4，我们实现了一个在阐释器内部执行的元编程交互层，支持语义上忠实的策略级跟踪，以及跨整个仓库的声明和依赖级提取。该设计超越了传统的REPL式交互，支持项目范围的分析、环境克隆和证明状态的池化执行。相同的接口抽象支持多个版本的Rocq，形成统一的跨助手流水线。基于此基础设施，ProofWala提供标准化的多语言证明数据集、模型训练工具和并行证明搜索算法。使用该框架，我们展示了跨Lean和Rocq的多语言训练能够实现跨语言和跨领域迁移。我们在Lean Mathlib和领域适应（CategoryTheory）上观察到统计显著的改进，而其他设置也呈现一致的增长趋势。我们在两个仓库中开源了完整框架、并行证明搜索模块、数据集和模型：ProofWala (https://github.com/trishullab/proof-wala) 和 itp-interface 库 (https://github.com/trishullab/itp-interface)。

英文摘要

Neural approaches to theorem proving require robust infrastructure for interfacing with interactive theorem provers (ITPs), extracting structured proof data, and executing proof search at scale. However, existing tooling is often assistant-specific and oriented toward file-level execution, making repository-scale analysis and parallel experimentation challenging. We present ProofWala, a multilingual proof engineering framework built around \texttt{itp-interface}, a reusable library for programmatic interaction with ITPs. For Lean 4, we implement a meta-programmed interaction layer executing inside the elaborator, enabling semantically faithful tactic-level tracing alongside declaration- and dependency-level extraction across entire repositories. This design extends beyond traditional REPL-style interaction by supporting project-wide analysis, environment cloning, and pooled execution of proof states. The same interface abstraction supports multiple versions of Rocq, yielding a unified cross-assistant pipeline. Built on this infrastructure, ProofWala provides standardized multilingual proof datasets, model training utilities, and parallel proof search algorithms. Using the framework, we demonstrate that multilingual training across Lean and Rocq enables cross-lingual and cross-domain transfer. We observe statistically significant improvements on Lean Mathlib and in domain adaptation (CategoryTheory), while other settings exhibit consistent upward trends. We open-source the full framework, parallel proof search module, datasets, and models across two repositories: ProofWala (https://github.com/trishullab/proof-wala) and the itp-interface library (https://github.com/trishullab/itp-interface).

URL PDF HTML ☆

赞 0 踩 0

2502.04554 2026-06-01 cs.AI

Unifying and Optimizing Data Values for Selection via Sequential Decision-Making

通过序列决策统一和优化数据选择的数据价值

Hongliang Chi, Qiong Wu, Zhengyi Zhou, Jonathan Light, Emily Dodwell, Yao Ma

AI总结将数据选择重构为序列决策问题，通过动态规划得到最优选择序列，并统一解释Data Shapley等现有方法为近视线性近似，提出基于二分图的高效替代方法，在经典ML和大规模LLM微调数据选择中显著优于现有方法。

详情

AI中文摘要

数据选择已成为数据价值的一个关键下游应用，然而在数据价值用于选择的理论基础方面仍未被充分探索。我们将数据选择重新表述为一个序列决策问题，其中最优选择序列由动态规划产生，而数据价值可以被理解为该最优序列的编码。这一框架通过近似动态规划的视角统一并重新解释了现有方法（如Data Shapley），揭示它们是对序列问题的近视线性近似。我们进一步分析了在子模性下选择最优性如何随效用曲率下降，解释了这些近似何时以及为何失败。为了弥合理论与实践，我们提出了一种基于二分图的高效替代方法，该方法在保持子模结构的同时，实现了具有可证明保证的可扩展贪心选择。在经典机器学习基准和大规模LLM微调数据选择上的实验表明，该方法显著优于现有方法。代码公开于https://github.com/frankhlchi/SeqDataVal。

英文摘要

Data selection has emerged as a crucial downstream application of data valuation, yet the theoretical foundations for using data values in selection remain underexplored. We reformulate data selection as a sequential decision-making problem where the optimal selection sequence arises from dynamic programming, and data values can be understood as encodings of this optimal sequence. This framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, revealing them as myopic linear approximations to the sequential problem. We further analyze how selection optimality degrades with utility curvature under submodularity, explaining when and why these approximations fail. To bridge theory and practice, we propose an efficient bipartite graph-based surrogate that preserves submodular structure while enabling scalable greedy selection with provable guarantees. Experiments on classical ML benchmarks and large-scale LLM fine-tuning data selection demonstrate substantial improvements over existing methods. Code is publicly available at https://github.com/frankhlchi/SeqDataVal

URL PDF HTML ☆

赞 0 踩 0

2407.16167 2026-06-01 cs.RO cs.SY eess.SY

Consideration of Vehicle Characteristics on the Motion Planner Algorithm

运动规划算法中车辆特性的考虑

Syed Adil Ahmed, Taehyun Shim

AI总结针对现有轨迹规划器未考虑质心高度影响导致不同车辆（尤其是高质心车辆）轨迹非最优的问题，提出一种采用简化双轨模型、基于稳态方程估计侧向和侧倾载荷转移以及简化轮胎模型的规划器，以降低求解器负担，并在高/低加速度条件和不同车辆高度下与粒子模型和运动学模型规划器进行对比。

Comments This paper has been accepted for conference proceedings in MECC 2024, Chicago under a Creative Commons License CC-BY-NC-ND

详情

DOI: 10.1016/j.ifacol.2025.01.086
Journal ref: IFAC-PapersOnLine, Vol 58, Num 28, 2024, pgs 444-449

AI中文摘要

自主车辆控制通常分为两个主要领域：轨迹规划和轨迹跟踪。目前，轨迹规划大多通过粒子或基于运动学模型的优化控制器完成。由于这些规划器不考虑质心高度及其影响，其输出对于不同车辆类型（尤其是高质心车辆）并非唯一。因此，跟踪控制器在尝试实现这些次优轨迹时，可能需要付出较大努力以避免车辆操纵性和舒适性约束。本文尝试通过考虑一种采用简化双轨模型的规划器来解决该问题，该模型利用稳态方程估计侧向和侧倾载荷转移，并采用简化轮胎模型以降低求解器负担。将所开发的规划器与广泛使用的粒子模型和运动学模型规划器在碰撞避免场景下进行对比，涵盖高/低加速度条件和不同车辆高度。

英文摘要

Autonomous vehicle control is generally divided in two main areas; trajectory planning and tracking. Currently, the trajectory planning is mostly done by particle or kinematic model-based optimization controllers. The output of these planners, since they do not consider CG height and its effects, is not unique for different vehicle types, especially for high CG vehicles. As a result, the tracking controller may have to work hard to avoid vehicle handling and comfort constraints while trying to realize these sub-optimal trajectories. This paper tries to address this problem by considering a planner with simplified double track model with estimation of lateral and roll based load transfer using steady state equations and a simplified tire model to reduce solver workload. The developed planner is compared with the widely used particle and kinematic model planners in collision avoidance scenarios in both high and low acceleration conditions and with different vehicle heights.

URL PDF HTML ☆

赞 0 踩 0

2501.12020 2026-06-01 cs.CV

On the Illusion of Gender Bias in Face Recognition: Explaining the Fairness Issue Through Non-demographic Attributes

论人脸识别中的性别偏见错觉：通过非人口统计属性解释公平性问题

Paul Jonas Kurz, Haiyu Wu, Rouqaiah Al-Refai, Kevin W. Bowyer, Philipp Terhörst

AI总结本文通过去相关组合40种非人口统计面部特征，提出无监督联合调查框架，发现当男性和女性图像共享特定属性时性别差距消失，表明性能差异源于社会外貌定义而非生物学因素。

Comments Accepted at IEEE TBIOM

详情

AI中文摘要

人脸识别系统（FRS）根据用户性别表现出显著的准确性差异。由于这种性别差距降低了FRS的可信度，最近的努力试图找到原因。然而，这些研究使用手动选择、相关且小规模的面部特征集来支持其主张。在这项工作中，我们通过成功地将搜索域扩展到40种非人口统计面部特征的去相关组合来分析人脸识别中的性别偏见。首先，我们引入了一个工具链，以有效去相关和聚合面部属性，从而在大规模数据上实现较少偏见的性别分析。其次，我们定制了两个专门指标来量化面部属性对绝对和相对公平性的影响。基于这些基础，我们第三提出了一种新颖的无监督联合调查框架，能够识别当用作平衡测试数据集的过滤谓词时导致偏见消失的属性组合。实验表明，当男性和女性受试者的图像共享特定属性时，性别差距消失，这清楚地表明性能差异不是生物学问题，而是外貌的社会定义问题。这些发现可能重塑我们对人脸生物识别中公平性的理解，并为FRS提供见解，有助于解决性别偏见问题。

英文摘要

Face recognition systems (FRS) exhibit significant accuracy differences based on the user's gender. Since such a gender gap reduces the trustworthiness of FRS, more recent efforts have tried to find the causes. However, these studies make use of manually selected, correlated, and small-sized sets of facial features to support their claims. In this work, we analyze gender bias in face recognition by successfully extending the search domain to decorrelated combinations of 40 non-demographic facial characteristics. First, we introduce a toolchain to effectively decorrelate and aggregate facial attributes to enable a less-biased gender analysis on large-scale data. Second, we tailor two specialized metrics to quantify the effect of facial attributes on absolute and relative fairness. Based on these grounds, we thirdly present a novel unsupervised joint investigation framework capable of identifying attribute combinations leading to vanishing bias when used as filter predicates for balanced testing datasets. Experiments show the gender gap vanishing when images of male and female subjects share specific attributes, clearly indicating that the disparate performance is not a question of biology but of the social definition of appearance. These findings could reshape our understanding of fairness in face biometrics and provide insights into FRS, helping to address gender bias issues.

URL PDF HTML ☆

赞 0 踩 0

2412.03876 2026-06-01 cs.CV

Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

通过推理时提示-噪声优化保障文本到图像生成

Jiangweizhi Peng, Zhiwei Tang, Gaowen Liu, Charles Fleming, Mingyi Hong

AI总结提出一种无需训练的推理时优化方法（PNO），通过联合优化连续提示嵌入和注入噪声轨迹，抑制不安全图像生成，达到最先进性能并抵抗对抗攻击。

详情

AI中文摘要

文本到图像（T2I）扩散模型因其基于文本提示生成高质量、多样化图像的能力而被广泛认可。然而，尽管近期取得了进展，这些模型仍然容易生成包含敏感或不适当内容的不安全图像，这可能对用户造成伤害。当前防止扩散模型生成不当图像的努力容易被绕过且易受对抗攻击。如何确保T2I模型符合特定安全目标仍然是一个重大挑战。在这项工作中，我们提出了一种新颖的、无需训练的方法，称为提示-噪声优化（PNO），以减轻不安全图像生成。我们的方法引入了一个新颖的优化框架，利用采样过程中的连续提示嵌入和注入噪声轨迹来生成安全图像。大量的数值结果表明，我们的框架在抑制有毒图像生成方面达到了最先进的性能，并且对对抗攻击表现出鲁棒性，无需调整模型参数。此外，与现有方法相比，PNO在保持相当生成时间的同时，在安全生成和提示-图像对齐这两个冲突目标之间提供了最佳权衡。

英文摘要

Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images containing sensitive or inappropriate content, which can be harmful to users. Current efforts to prevent inappropriate image generation for diffusion models are easy to bypass and vulnerable to adversarial attacks. How to ensure that T2I models align with specific safety goals remains a significant challenge. In this work, we propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our method introduces a novel optimization framework that leverages both the continuous prompt embedding and the injected noise trajectory in the sampling process to generate safe images. Extensive numerical results demonstrate that our framework achieves state-of-the-art performance in suppressing toxic image generations and demonstrates robustness to adversarial attacks, without needing to tune the model parameters. Furthermore, compared with existing methods, PNO uses comparable generation time while offering the best tradeoff between the conflicting goals of safe generation and prompt-image alignment.

URL PDF HTML ☆

赞 0 踩 0