arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.27129 2026-05-27 cs.CV cs.RO

YOLO26-RipeLoc Lite: A lightweight architecture for tomato ripeness detection and picking point localization in greenhouse robotic harvesting

YOLO26-RipeLoc Lite:用于温室机器人采摘中番茄成熟度检测与采摘点定位的轻量级架构

Rajmeet Singh, Manveen Kaur, Shahpour Alirezaee, Irfan Hussain

AI总结 提出基于YOLO26的轻量级架构YOLO26-RipeLoc Lite,通过轻量特征金字塔网络、成熟度感知注意力模块和紧凑检测头,实现温室番茄的成熟度分类与中心点定位,在仅2.38M参数下达到92.9% mAP@0.5。

详情
AI中文摘要

在温室番茄生产中,自动化收获需要准确检测成熟番茄、进行成熟度分类,并为机器人末端执行器精确定位采摘点。本文提出YOLO26-RipeLoc Lite,一种基于YOLO26的轻量级深度学习架构,用于同时检测、成熟度分类和温室番茄的中心点定位。该模型引入了三项改进:(1) 轻量特征金字塔网络(LFPN),采用深度可分离卷积实现高效多尺度融合;(2) 成熟度感知注意力模块(RAAM),具有双池化和可学习的成熟度偏置向量,增强颜色纹理区分能力;(3) 紧凑检测头(CDH),采用共享卷积和集成的中心点回归分支,用于直接抓取规划。该模型在来自阿联酋阿布扎比SILAL温室的自定义数据集(1500张图像,6227个实例,其中3566个成熟,2661个未成熟)上进行评估。YOLO26-RipeLoc Lite在仅使用2.38M参数的情况下,实现了92.9%的mAP@0.5(成熟95.2%,未成熟90.6%),在所有评估架构中精度最高(95.2%)。训练后批量归一化剪枝30%可将参数减少至约1.8M,且精度损失可忽略。消融研究证实,温室感知的HSV增强提供了最大的改进(+2.02个百分点 mAP@50),骨干网络冻结达到了峰值精度(93.8%),而三阶段渐进解冻获得了最佳的定位质量(mAP@50:95为64.6%)。与YOLOv8n/s、YOLO11n/s、YOLO12n/s和YOLO26s的比较证实了其优越的精度-效率:比YOLO12n精度高2.9个百分点,参数少7.0%,并集成了用于机器人末端执行器引导的中心点定位。

英文摘要

In greenhouse tomato production, automated harvesting requires accurate detection of ripe tomatoes, ripeness classification, and precise picking-point localization for robotic end-effectors. This paper proposes YOLO26-RipeLoc Lite, a lightweight deep learning architecture based on YOLO26 for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes. The model introduces three modifications: (1) a Lightweight Feature Pyramid Network (LFPN) with depthwise separable convolutions for efficient multi-scale fusion, (2) a Ripeness-Aware Attention Module (RAAM) with dual pooling and a learnable ripeness bias vector for enhanced color-texture discrimination, and (3) a Compact Detection Head (CDH) with shared convolutions and an integrated center-point regression branch for direct grasp planning. The model is evaluated on a custom dataset of 1,500 images with 6,227 instances (3,566 ripe, 2,661 unripe) from the SILAL greenhouse, Abu Dhabi, UAE. YOLO26-RipeLoc Lite achieves mAP@0.5 of 92.9% (95.2% ripe, 90.6% unripe) with the highest precision (95.2%) among all evaluated architectures using only 2.38M parameters. Post-training BatchNorm pruning at 30% reduces parameters to ~1.8M with negligible accuracy loss. Ablation studies confirm that greenhouse-aware HSV augmentation provides the largest improvement (+2.02 pp mAP@50), backbone freezing achieves peak precision (93.8%), and 3-phase progressive unfreezing yields the best localization quality (mAP@50:95 of 64.6%). Comparisons with YOLOv8n/s, YOLO11n/s, YOLO12n/s, and YOLO26s confirm superior accuracy-efficiency: 2.9 pp higher precision than YOLO12n with 7.0% fewer parameters and integrated center-point localization for robotic end-effector guidance.

2605.27128 2026-05-27 cs.CV cs.LG

PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance

PILOT: 一种基于边界引导的无数据持续学习方法用于实时语义分割

Yujing Zhou, Prashant Shekhar, Thomas Yang, Yongxin Liu

AI总结 提出PILOT框架,通过冻结原网络参数并引入并行导数分支捕获新类边界信息,实现实时语义分割模型在无需旧数据情况下的增量学习,有效缓解灾难性遗忘。

详情
AI中文摘要

实时语义分割模型在准确性和推理速度之间取得了极好的平衡。然而,将这些模型部署在动态的真实世界环境中,通常需要能够在不重新训练整个数据集的情况下增量地学习新类别。这种能力被称为持续学习。在这方面,深度学习中的标准微调方法常常因灾难性遗忘而失败,即模型学习新信息但忘记了先前训练和学习的类别。针对这一关键领域,本文提出了一种针对PIDNet的新型持续学习框架,PIDNet是一种被广泛引用的最先进的实时语义分割模型。我们的方法PILOT(并行增量学习随时间)通过实现一个并行导数分支(D-branch)引入了一种实时且轻量级的策略,该分支旨在捕获新类别的高频边界信息,同时冻结原始分割网络的训练参数。这种新颖的设置允许模型适应新的语义类别,同时保留先前学习类别的知识。通过仅使用与新类别相关的数据,我们的模型显著减少了训练开销。实验结果表明,我们的方法成功分割了新类别,同时在原始基类上保持了较高的平均交并比(mIoU),从而在该领域轻松超越了所有主要的持续学习方法。总体而言,PILOT被证明能有效缓解灾难性遗忘,同时对推理延迟影响最小,从而保持实时性能。

英文摘要

Real-time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine-tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state-of-the-art real-time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real-time and lightweight strategy by implementing a parallel Derivative-branch (D-branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real-time performance.

2605.27117 2026-05-27 cs.AI

Position: AI Safety Requires Effective Controllability

立场:AI安全需要有效可控性

Yige Li, Yunhao Feng, Jun Sun

AI总结 本文提出AI安全应将可控性作为首要目标,通过定义可控性、引入基准测试ControlBench并分析现有对齐机制的不足,提出以控制为中心的架构框架。

详情
Comments
23 pages
AI中文摘要

AI安全在很大程度上仍被框定为对齐:训练模型遵循人类偏好、安全策略和规范约束。这种框架改善了现代语言模型的行为,但对齐行为本身并不能保证部署的智能体在开放、交互和使用工具的环境中能够被停止、覆盖或约束。一个系统可能在期望上是安全的,但在冲突指令、长期执行、对抗性输入或高风险工具使用下,仍可能无法服从明确的运行时权威。这篇立场论文认为,AI安全因此需要将可控性作为第一类目标。我们将\emph{可控性}定义为AI系统在运行时能够可靠地被显式控制信号中断、覆盖、重定向和约束的能力,同时在没有此类信号时保持普通效用。为了研究这一差距,我们引入了\controlbench{},一个用于评估高风险智能体场景中可控性失败的基准测试。基于OpenClaw的智能体实验表明,当前的对齐和防护机制降低了风险,但往往无法提供持久、权威和可执行的运行时控制。因此,我们提出了一个以控制为中心的架构框架,强调显式控制平面、运行时干预路径、持久控制状态和可审计决策接口,作为未来可控AI系统的关键设计原则。

英文摘要

AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emph{controllability} as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench{}, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.

2605.27116 2026-05-27 cs.CV

COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection

COVD: 通过新概念注入的持续开放词汇目标检测

Yupeng Zhang, Ruize Han, Yuzhong Feng, Zixin Ren, Yuntong Tian, Liang Wan

AI总结 提出持续开放词汇目标检测新任务COVD,通过冻结视觉编码器并仅更新文本分支参数注入新概念,实现无需额外参数的高效持续学习。

详情
AI中文摘要

开放词汇目标检测(OVD)取得了显著进展,使检测器能够从已见类别泛化到未见类别。然而,现实世界的类别空间不断演变,现有的OVD模型仍然难以处理新出现的概念,而重复的完全重新训练成本过高。为此,我们引入了一个新的任务设置,称为持续开放词汇目标检测与新概念注入(COVD),其中模型顺序学习传入的新概念组,同时保留先前的概念和原始的开放词汇知识,并附带一个新的基准Novel-114。我们的关键观察是,预训练的视觉编码器通常已经感知并表示了众多新概念,主要瓶颈在于视觉表示与文本概念之间缺乏稳定的语义对齐。基于此,我们提出了NoIn-Det,一个无需额外参数的高效持续注入框架。NoIn-Det冻结视觉编码器,仅使用常见概念和先前注入概念的文本来保留文本表示空间,并通过仅更新有利于新概念学习的少量文本分支参数来注入新概念。大量实验表明,NoIn-Det在不引入额外参数的情况下,有效学习了新概念,保留了旧知识,并持续优于现有的VLM持续学习方法。Novel-114和代码将发布。

英文摘要

Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional parameters.Novel-114 and the code will be released.

2605.27115 2026-05-27 cs.AI

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

基于对抗感知的多教师同策略蒸馏以实现领域保留下的通用能力恢复

Tianlei Chen, Jiao Ou, Ziyuan Liu, Ruiming Tang, Jian Liang, Han Li

AI总结 针对多教师同策略蒸馏在提示覆盖不完全时出现的恢复-保留对抗和弱信号平坦化问题,提出CaMOPD方法,通过解耦交替训练和基于差距的样本选择,在保持领域性能的同时有效恢复通用能力。

详情
AI中文摘要

领域专业化可以改善LLM在垂直领域的行为,但往往会削弱从原始模型继承的通用能力。最近的多教师同策略蒸馏(MOPD)流程通过教师反馈监督学生生成的轨迹来恢复模型能力,但通常假设教师对齐的提示覆盖,即提示需要匹配教师的训练分布。当通用教师是开源模型且其训练后数据未知时,这一假设难以满足。我们不是试图重建这种隐藏分布,而是研究使用现成的代理通用提示来恢复通用能力。我们识别了在这种不完全覆盖情况下原始MOPD的两种失败模式:混合冲突的恢复和保留梯度导致的恢复-保留对抗,以及均匀平均具有不等校正需求的样本导致的弱信号平坦化。我们提出了对抗感知的多教师同策略蒸馏(CaMOPD),通过解耦交替训练和基于差距的样本选择来解决这些问题。CaMOPD为通用恢复提供专用更新,定期审查领域提示以进行保留,并选择具有较大平均词级教师-学生对数概率差距的样本以集中校正信号。在角色扮演对话和医学推理问答场景中,CaMOPD在保持领域特定行为的同时,在通用恢复方面表现优于基线。梯度一致性分析进一步支持了CaMOPD在产生更一致的校正信号方面的预期效果。

英文摘要

Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers' training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.

2605.27113 2026-05-27 cs.LG cs.AI

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

使用GAN-扩散框架的高质量合成金融时间序列

Giuseppe Masi, Andrea Coletta, Novella Bartolini

AI总结 提出一种结合GAN和扩散模型的质量感知生成框架,通过GAN的Critic引导扩散过程,生成更真实且保留金融时间序列典型事实和资产间相关结构的合成数据。

详情
AI中文摘要

近年来,金融机构和公司越来越多地采用合成数据来解决数据稀缺问题并生成反事实市场情景。然而,再现金融时间序列的所有统计特性(通常称为典型事实)对于许多现有的通用架构来说仍然是一个开放的挑战。在本文中,我们提出了一种质量感知生成框架,该框架结合了两类生成方法,展示了它们的集成如何解决现有局限性,同时增强合成数据的真实性。具体来说,我们首先引入CoMeTS-GAN(相关多变量时间序列GAN),这是一种条件生成对抗网络(C-GAN),旨在联合生成相关股票的中价和成交量时间序列。然后,我们展示了如何将我们的GAN架构整合到最先进的扩散模型中,以提高生成的相关结构的质量。具体来说,GAN的Critic作为一个质量评估模块,指导扩散过程,在生成的时间序列中强制执行学习到的相关结构。我们的框架为真实的股票市场模拟提供了一种轻量级且响应迅速的解决方案,明确建模了资产间的相关结构。我们通过实验将我们的框架与领先的生成架构进行了比较,表明它更有效地捕捉了股票市场的典型事实并建模了资产间的相关性。

英文摘要

In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counterfactual market scenarios. However, reproducing all the statistical properties of financial time series, commonly known as stylized facts, remains an open challenge for many existing general-purpose architectures. In this paper, we present a quality-aware generative framework that combines two classes of generative methods, demonstrating how their integration addresses existing limitations while enhancing the realism of synthetic data. Specifically, we first introduce CoMeTS-GAN (Correlated Multivariate Time Series GAN), a Conditional Generative Adversarial Network (C-GAN) designed to jointly generate mid-price and volume time-series for correlated stocks. We then show how our GAN architecture can be incorporated into state-of-the-art diffusion models to enhance the quality of generated correlation structures. Specifically, the GAN's Critic serves as a quality evaluation module that guides the diffusion process, enforcing learned correlation structures in the generated time-series. Our framework offers a lightweight and responsive solution for realistic stock market simulation, explicitly modeling inter-asset correlation structures. We experimentally validate our framework against leading generative architectures, showing that it more effectively captures the stylized facts of stock markets and models inter-asset correlations.

2605.27110 2026-05-27 cs.CR cs.CL

BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning

BAIT: 基于边界引导的自我条件推理披露升级

Xuan Luo, Yue Wang, Geng Tu, Jing Li, Ruifeng Xu

AI总结 提出BAIT三步越狱框架,通过识别保护边界、细化边界和请求详细示例,利用模型自身推理和一致性倾向实现恶意目标披露,实验表明在多个基准上攻击成功率显著优于传统方法。

详情
AI中文摘要

在这项工作中,我们提出了BAIT(边界感知迭代陷阱),一个三步越狱框架,通过内部披露接近恶意目标。BAIT首先要求模型识别保护边界,然后要求其细化该边界,最后请求一个详细示例。通过在每个步骤中扩展模型之前的响应,BAIT将模型自身的推理和一致性倾向转变为披露路径。在AdvBench、JailbreakBench、AIR-Bench和SORRY-Bench上的实验表明,BAIT在顶级大语言模型上持续实现高攻击成功率,显著超越了传统的越狱基线。进一步分析揭示:1)预防导向的框架显著优于直接知识请求;2)细化步骤在披露升级中起关键作用;3)前两步有一定概率引发有害内容,同时触发很少的过滤。

英文摘要

In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.

2605.27101 2026-05-27 cs.CV cs.CL

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

弹出式干扰揭示视频大语言模型中的事件袋行为

Oscar Chew, Serhii Honcharenko, Qian-Hui Chen, Patricia Lu, Dishant Zaveri, Khoa D. Doan, Kuan-Hao Huang

AI总结 通过插入无关广告片段,发现视频大语言模型常将不同片段的事件错误关联,表现出将视频视为事件集合而非时间序列的“事件袋”行为。

详情
AI中文摘要

视频理解的一个关键能力是跨时间可靠地将主体与事件联系起来,然而视频大语言模型(VideoLLMs)是否真正实现了这一点仍不清楚。在这项工作中,我们引入了DistractionBench来评估VideoLLMs在存在无关视频片段的情况下是否能稳健地关联主体和事件。通过受控干预,例如在较长视频中插入短广告片段,我们表明VideoLLMs经常幻觉出不同片段中实体之间的交互,错误地将注入广告中的动作归因于主视频中的主体。我们将这种系统性幻觉表征为事件袋(BoE)行为,其中模型将视频视为事件的集合而非时间结构化的序列。评估11个流行的VideoLLMs,我们发现所有模型都表现出显著的BoE行为。我们的发现表明VideoLLMs缺乏可靠的时间接地机制,并激励开发具有更稳健主体-事件关联的模型。

英文摘要

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.

2605.27097 2026-05-27 cs.LG stat.ML

Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias

正交数据上的轻度过参数化ReLU网络:增量学习与隐式偏差

James Town, Etienne Boursier, Ben Lewis, Matthias Englert, Ranko Lazic

AI总结 研究从微小初始化出发的两层ReLU网络在正交数据上的梯度流动力学,揭示了当初始化尺度趋近零时极限流收敛到鞍点间跳跃过程,并证明网络在宽度m约大于log(n)时高概率插值训练数据,且学习到的插值器的平方ℓ2范数缩放为√n,与最小ℓ2范数插值器相差常数因子。

详情
Comments
66 pages, 6 figures
AI中文摘要

神经网络的成功训练依赖于一阶优化方法的使用,但这些方法的理论刻画仍不完整,尤其是在轻度过参数化设置下。本文研究从微小初始化出发的两层ReLU网络在正交训练数据上的梯度流动力学。我们证明,当初始化尺度趋近零时,极限流收敛到鞍点间跳跃过程,揭示了在每个鞍点处激活一个新神经元的增量学习现象。该分析恢复了Dana等人(2025, arXiv:2502.16977)的已知结果:只要$m \gtrsim \log(n)$(其中$m$是网络宽度,$n$是训练样本数),网络就以高概率插值训练数据。这一增量过程刻画还使我们能够推导出一个新的隐式偏差结果:学习到的插值器具有平方$\ell_2$范数缩放为$\sqrt{n}$,这处于最小$\ell_2$范数插值器的常数因子内。更广泛地,我们的工作为ReLU网络的增量学习过程提供了首个严格证明,同时表明轻度过参数化网络可以收敛到复杂度与最优插值器同阶的插值解。

英文摘要

The successful training of neural networks hinges on the use of first order optimization methods, yet the theoretical characterization of these methods remains incomplete. This is especially true in settings with mild overparameterization. In this work, we study the gradient flow dynamics of two-layer ReLU networks from small initialization with orthogonal training data. We prove the limiting flow converges to a saddle-to-saddle jump process as the initialization scale tends to zero, revealing an incremental learning phenomenon in which a new neuron activates at each saddle. This analysis recovers the known result of Dana et al. (2025, arXiv:2502.16977) that the network interpolates the training data with high probability as soon as $m \gtrsim \log(n)$, where $m$ is the network width and $n$ is the number of training samples. This incremental process characterization also allows us to derive a novel implicit bias result: the learned interpolator has a squared $\ell_2$-norm scaling as $\sqrt{n}$, which is within a constant factor of the minimal $\ell_2$-norm interpolator. More broadly, our work provides the first rigorous proof of an incremental learning process for ReLU networks, whilst suggesting mildly overparameterized networks can converge to interpolating solutions whose complexity is of the same order as that of the optimal interpolator.

2605.27093 2026-05-27 stat.ML cs.LG

Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix

基于高斯过程的学习:相关矩阵上Wishart先验的新MCMC实现

Kane Warrior, Dalia Chakrabarty

AI总结 提出一种自组装Wishart先验用于协方差矩阵,结合MCMC对核超参数进行贝叶斯推断,通过回溯窗口引入自适应性,有效诊断弱信息输入。

详情
AI中文摘要

在输入-输出关系的概率监督学习中(作为高斯过程(GP)的样本函数),通常为核的超参数指定先验,这些超参数参数化GP的协方差函数,其中(所得多元正态)似然的诱导协方差矩阵控制学习和预测。当所寻求的函数高度多元时,必须同时学习多个长度尺度参数,使得推断困难。我们为协方差矩阵开发了一种“自组装”Wishart先验,同时使用MCMC对核超参数进行贝叶斯推断。该构造使用最近MCMC迭代的回溯窗口来定义依赖于时间步长的尺度矩阵,从而为链引入自适应性。结果表明,在基于GP的学习范式中,对协方差矩阵的直接先验指定可用于诊断弱信息输入。我们通过两个不同的实证示例支持我们的先验开发——一个基于合成数据,另一个基于真实世界数据集。

英文摘要

In probabilstic supervised learning of an input-output relationship - as a sample function of a Gaussian Process (GP) - priors are typically specified for the hyperparameters of the kernel that parametrises the covariance function of the GP, where the induced covariance matrix of the (resulting multivariate Normal) likelihood, governs the learning and prediction. When the sought function is highly multivariate, multiple lengthscale parameters must be learnt simultaneously, making inference difficult. We develop a ``self-assembled'' Wishart prior for the covariance matrix, while undertaking Bayesian inference on the kernel hyperparameters using MCMC. The construction uses a look-back window over recent MCMC iterations to define a time-step dependent scale matrix, thereby introducing adaptiveness to the chain. Results suggest that direct prior specification on the covariance matrix can be useful for diagnosing weakly informative inputs within the GP-based learning paradigm. We support our prior development with two distinct empirical illustrations - one on synthetic data, and another on a real-world dataset.

2605.27091 2026-05-27 cs.CL cs.AI

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

MiRD:通过误覆盖风险分解实现开放式问答的可靠集值预测

Anqi Hu, Zhiyuan Wang, Zijun Jia, Bo Fu

AI总结 提出MiRD两阶段框架,通过将整体误覆盖分解为采样失败和条件选择失败,在开放式问答中实现可靠的集值预测,控制采样风险和条件选择风险,并产生更紧的边界和更自适应的预测集。

详情
AI中文摘要

可靠的集值预测为缓解开放式问答中的幻觉提供了一种原则性方法,但现有的共形方法通常依赖于一个脆弱的假设:有限采样必须已经产生至少一个可接受的候选,或者违反此条件的校准示例被丢弃。在本文中,我们介绍了MiRD,一个两阶段框架,将整体误覆盖分解为采样失败和条件选择失败。在第一阶段,MiRD在固定预算下,对有限采样不产生可接受答案的概率建立了一个期望水平的边际上界。在第二阶段,基于采样成功,MiRD使用在整个校准集上定义的与接受性相关的非一致性分数来校准共形选择阈值,从而保持校准集的完整性。在三个开放式问答数据集和八个模型上,MiRD控制了采样风险、条件选择风险和整体误覆盖,同时产生了比PAC风格替代方案更紧的第一阶段边界,以及比仅成功校准更自适应的预测集。

英文摘要

Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.

2605.27088 2026-05-27 cs.CL cs.LG

LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

LLMs 已经是好导师:面向教学数学辅导的无训练提示优化

Unggi Lee, Minchul Shin, Yeil Jeong, Sookbun Lee, Jeongsu Moon, Kyungtae Joo, Eunjoo Lee, Hoilym Kwon

AI总结 本研究探索通过API调用优化系统提示的无训练方法,提出5种教育专用方法,在2个OOD基准上评估12种方法,发现所有方法均超越最强RL训练基线,ParetoGrad在事后解决率、泄漏控制和有用性上达到最佳帕累托平衡。

详情
Comments
17 pages, 5 figures
AI中文摘要

将LLMs与数学辅导对齐通常需要基于RL的训练和多GPU基础设施。我们研究无训练提示优化——仅通过API调用演化系统提示——是否可以作为实用替代方案。我们改编了7种已发表方法并提出了5种教育专用方法,在2个OOD基准套件上的5种条件下评估这12种方法。所有12种最佳方法配置均超越了最强的RL训练基线(R_total = 0.633),我们的ParetoGrad在事后解决率、泄漏控制和有用性上实现了最佳帕累托平衡,而非在任何单一组件上占优。使用包含82个代码的教育代码本进行行为分析发现,无训练方法依赖教学知识模式的频率是RL训练模型的2-3倍,同时意图级脚手架减少了约10个百分点。我们还发现一个任务依赖的推理模式效应,在无训练和基于RL的范式中一致。我们的方法仅通过提示和最小计算即可高效开发教学对齐的LLM导师。

英文摘要

Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.

2605.27083 2026-05-27 cs.CL cs.CR

On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning

反事实知识训练在LLM遗忘中的隐藏代价

Xiaotian Ye, Xiaohan Wang, Mengqi Zhang, Shu Wu

AI总结 本文发现反事实微调(CFT)在LLM遗忘中存在知识冲突和幻觉溢出两大问题,并引入扩展基准RWKU+及诊断工具进行系统分析。

详情
AI中文摘要

反事实微调(CFT)已成为大语言模型(LLM)遗忘的一种有前景的范式,通过训练模型生成替代的虚构知识来取代不需要的内容。然而,在这项工作中,我们发现该范式在某些方面仍不如其他范式,并识别出导致这一差距的两个先前被忽视的陷阱:(1)知识冲突,即反事实语料库中的相互不一致导致冲突梯度,破坏参数优化;(2)幻觉溢出,即拟合虚假目标会灌输持久的捏造偏差,增加无关领域的幻觉率。为了系统诊断这些问题,我们引入了RWKU+,这是一个扩展的基准,配备了新颖的权衡指标和梯度级诊断工具。我们的工作进一步讨论了该范式的局限性和开销,旨在为更严格的LLM遗忘研究提供见解和可操作的指导。

英文摘要

Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools. Our work further discusses the limitations and overhead of the paradigm, aiming to provide insights and actionable guidance for more rigorous LLM unlearning research.

2605.27082 2026-05-27 cs.AI

Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

广泛的生物医学知识能否被情境化为基于场景的命题?

Qingyuan Zeng, Ziyang Chen, Pengxiang Cai, Zixin Guan, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

AI总结 提出SCENE双层多智能体框架,通过迭代搜索将广泛生物医学知识转化为证据支持的场景化命题,并在临床试验和LINCS L1000研究中验证其有效性。

详情
AI中文摘要

生物医学发现通常需要将广泛的生物医学知识与特定的实验或临床数据联系起来。背景知识提示相关机制,但通常过于泛化,无法直接映射到数据集变量;而数据驱动模式可能具有数据集特异性且难以从机制上解释。我们将这一缺失环节研究为知识情境化:将广泛的生物医学知识转化为有证据支持的、基于场景的命题,供领域专家检查、重现和验证。我们提出SCENE,一个双层多智能体框架,将知识情境化视为迭代搜索。上层将广泛知识转化为搜索方向,并将其锚定在数据集模式中。下层通过多目标优化执行这些方向,以识别在证据强度和数据支持之间取得平衡的具体命题。两层之间的反馈逐步细化搜索。我们在两个场景中评估SCENE:在临床试验场景中发现具有异质性治疗益处的患者亚组,以及在LINCS L1000研究中识别特定情境下的生物学反应。在临床试验中,SCENE发现了具体且支持充分的亚组,并优于现有基线。在L1000研究中,SCENE识别出具有强靶标-响应匹配和高阳性率的扰动情境。这些结果表明,SCENE弥合了广泛知识与场景特定证据之间的差距,为后续验证生成了可追溯、可检查的假设。

英文摘要

Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.

2605.27081 2026-05-27 cs.LG cs.AI cs.DC

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

ReMoE: 在内存受限的MoE大模型推理中通过路由器微调提升专家重用

Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao

AI总结 提出ReMoE路由器微调框架,通过偏向近期选中的专家实现时间稳定的路由,减少专家从外部存储的获取次数,在保持下游任务性能的同时提升专家重用26%,并在实际系统中实现8.4%的吞吐量提升和1.77-1.99倍的解码加速。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

细粒度混合专家(MoE)模型对每个token仅稀疏激活一部分专家,在保持高模型容量的同时减少激活计算。然而,在内存受限的推理场景中,只能缓存少量专家。未缓存的专家必须从慢速外部存储(如UFS)获取,导致频繁的驱逐和大量的I/O开销。我们提出ReMoE,一个路由器微调框架,旨在提升token级别的专家重用。ReMoE使路由器偏向近期选中的专家,产生时间稳定的路由,更好地匹配缓存局部性约束。通过增加短时专家重用,ReMoE减少了从存储中获取专家,且不增加推理计算开销。在DeepSeek和Qwen模型上的实验表明,ReMoE在保持下游任务性能的同时将专家重用提升了26%。实际系统评估进一步证实了这些优势:在vLLM GPU-CPU专家卸载下,输出吞吐量提升8.4%;在Jetson Orin NX上的llama.cpp中,TPOT降低43.6-49.8%,对应不同工作负载下1.77-1.99倍的解码加速。检查点和使用说明见https://github.com/BUAA-OSCAR/ReMoE。

英文摘要

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.

2605.27080 2026-05-27 cs.CV

Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning

基于解耦子空间对比学习的半监督视线估计

Qida Tan, Hongyu Yang, Wenchao Du

AI总结 提出一种半监督学习框架DSCL,通过雅可比正则化解耦特征为俯仰角和偏航角子空间,并利用子空间内序数对比学习,仅用5%-20%标注数据即可达到竞争性能。

详情
Comments
ICML2026
AI中文摘要

基于外观的视线估计由于标注样本有限和数据集多样性不足,常面临泛化能力差的问题。主流方法采用弱监督学习从无约束真实场景生成大规模伪标签数据,以缓解域偏移。本文设计了一种简单而有效的半监督学习架构,利用未标注数据增强域泛化,从而减少对劳动密集型人工标注的依赖。我们的关键洞察是施加雅可比正则化,将特征表示解耦为专门针对特定视线组件(如俯仰角和偏航角)的判别性子空间。我们进一步利用每个子空间内的内在序数排序进行对比学习,使模型能够从少量标注样本和大量未标注样本中学习鲁棒的视线表示。最终形成了我们的解耦子空间对比学习(DSCL)框架。在多个基准上的大量实验表明,所提出的DSCL是即插即用的,在域内和跨域评估设置下,仅使用20%、10%甚至5%的标注数据即可达到竞争性能。公开代码见https://github.com/da60266/DSCL。

英文摘要

Appearance-based gaze estimation always suffers from poor generalization due to limited annotated samples and insufficient dataset diversity. Leading approaches adopt weakly supervised learning to generate large-scale pseudo-labeled data from unconstrained real-world scenarios, aiming to mitigate the domain shifts. In this work, we devise a simple yet effective semi-supervised learning architecture that leverages unlabeled data to enhance domain generalization, thereby reducing reliance on labor-intensive manual annotations. Our key insight is to impose Jacobian regularization to disentangle feature representations into discriminative subspaces dedicated to specific gaze components, such as pitch and yaw angles. We further exploit the intrinsic ordinal ranking within each subspace for contrastive learning, enabling the model to learn robust gaze representations from a small set of labeled samples and an abundance of unlabeled ones. This ultimately yields our Disentangled Subspace Contrastive Learning (DSCL) framework. Extensive experiments on multiple benchmarks verify that the proposed DSCL is plug-and-play, achieving competitive performance using only 20\%, 10\%, and even 5\% of the annotated data under both in-domain and cross-domain evaluation settings. The public code is available at \href{https://github.com/da60266/DSCL}{https://github.com/da60266/DSCL}.

2605.27079 2026-05-27 cs.LG cs.AI cs.RO

Trust Region Q Adjoint Matching

信任区域Q伴随匹配

Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin

AI总结 针对预训练流策略的离策略强化学习不稳定性,提出信任区域Q伴随匹配方法,通过投影对偶下降自适应控制路径空间KL散度,实现稳定微调,在50个OGBench任务中离线RL成功率达68%。

详情
AI中文摘要

由于多步采样过程带来的优化不稳定性,预训练流策略的离策略强化学习仍然具有挑战性。最近,带有伴随匹配的Q学习(QAM)通过将问题重新表述为一个具有学习评论家的无记忆随机最优控制(SOC)问题来解决这一问题。然而,QAM继承了评论家引导改进的根本脆弱性:当评论家病态时,小的评论家误差会被放大,通常导致模型崩溃。本文引入了信任区域Q伴随匹配(TRQAM),一种稳定的离策略微调算法,通过投影对偶下降自适应地控制与预训练流策略的路径空间KL散度。具体来说,我们优化SOC动力学中的信任区域参数$λ$,并从理论上证明路径空间KL可以用$λ$的闭式函数表示。因此,我们的方法可以精确控制与预训练流策略的精确偏差,实现稳定的离策略强化学习。通过在50个OGBench任务上的实验,TRQAM在离线强化学习和离线到在线强化学习中都持续优于先前的方法。特别是,TRQAM在离线强化学习中实现了68%的总体成功率,显著提高了最强基线的46%。

英文摘要

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.

2605.27076 2026-05-27 cs.MA cs.LG

Cost of Structural Learning Under Censored Feedback: A Threshold-Bandit Approach

审查反馈下结构学习的代价:一种阈值-老虎机方法

Michael Ledford, William Regli

AI总结 针对任务仅当联盟达到未知规模阈值时才产生奖励的审查反馈问题,提出阈值激活合作多臂老虎机模型,并通过集中式算法C-TAC实现O(log T)累积遗憾,以及去中心化事件触发协议D-TAC在保持可行性对齐的同时减少23倍通信。

详情
AI中文摘要

在许多多智能体应用中,任务仅当由满足未知规模阈值的联盟执行时才产生奖励;否则,反馈完全被审查。这种审查造成了可识别性问题:智能体无法区分随机失败与协调不足。我们将此设置形式化为阈值激活合作多臂老虎机(TAC-MAB),并在集中式和去中心化协调下进行分析。我们证明集中式算法(C-TAC)实现了累积遗憾O(log T),该遗憾分解为结构搜索项(捕获在审查反馈下解决可行性的代价)和统计监控项(用于价值估计)。然后我们引入D-TAC,一种去中心化事件触发协议,其中智能体仅在其结构信念改变时进行同步。实验表明,在保守信念融合下,D-TAC相对于集中式基线实现了23倍的通信减少,同时保持了可行性对齐。这些结果刻画了在审查反馈下学习的协调代价,并表明无需持续同步即可实现接近集中式的通信效率。

英文摘要

In many multi-agent applications, tasks yield rewards only when executed by a coalition meeting an unknown size threshold; otherwise, feedback is fully censored. This censorship creates an identifiability problem: agents cannot distinguish stochastic failure from insufficient coordination. We formalize this setting as the Threshold-Activated Cooperative Multi-Armed Bandit (TAC-MAB) and analyze it under both centralized and decentralized coordination. We show that a centralized algorithm (C-TAC) achieves cumulative regret O(log T), decomposed into a structural-search term that captures the cost of resolving feasibility under censored feedback and a statistical-monitoring term for value estimation. We then introduce D-TAC, a decentralized event-triggered protocol in which agents synchronize only when their structural beliefs change. Empirically, D-TAC achieves a 23x reduction in communication relative to the centralized baseline while preserving feasibility alignment under conservative belief fusion. These results characterize the coordination cost of learning under censored feedback and show that near-centralized communication efficiency is achievable without continuous synchronization.

2605.27075 2026-05-27 cs.CV

SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration

SoftCap: 扩散Transformer加速的软预算控制

Yuhang Zhang, Junxiang Qiu, Huixia Ben, Zhenhua Tang, Shuo Wang, Yanbin Hao

AI总结 提出一种无需训练的软预算控制层SoftCap,通过轨迹漂移观测器和软预算PI控制器动态调整全步触发阈值,在保持计算预算软上限的同时提升图像质量。

详情
AI中文摘要

扩散Transformer(DiTs)实现了强大的视觉质量,但其迭代去噪过程需要大量昂贵的Transformer评估。无训练加速方法通过缓存、预测或验证中间特征来降低这一成本,然而何时执行全步的运行时决策通常由固定调度或手动调整的阈值驱动。我们提出 extbf{SoftCap},一种用于基于缓存的DiT推理的无训练控制层。SoftCap将轨迹漂移观测器(通过轻量级隐藏状态统计估计局部缓存风险)与软预算PI控制器(根据相对于固定参考配置的实际计算调整全步触发阈值)相结合。预算是软上限:它塑造阈值,但不要求运行消耗预定数量的全步评估。在FLUX.1-dev上,在可比的中等计算操作点下,SoftCap优于SpeCa,在几乎相同的FLOPs下将ImageReward从0.967提升至0.981,并将LPIPS-Full从0.518降至0.498,而目标扫描诊断显示随着预算放宽,预期的软上限行为得以实现。

英文摘要

Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying intermediate features, yet the runtime decision of when to execute a Full step is often driven by fixed schedules or hand-tuned thresholds. We propose \textbf{SoftCap}, a training-free control layer for cache-based DiT inference. SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations. On FLUX.1-dev, SoftCap improves over SpeCa at a comparable middle-compute operating point, raising ImageReward from 0.967 to 0.981 and reducing LPIPS-Full from 0.518 to 0.498 at nearly identical FLOPs, while target-sweep diagnostics show the intended soft-ceiling behavior as the budget is relaxed.

2605.27074 2026-05-27 cs.CV

IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

IPIBench: 在连续流下评估多模态大模型的交互式主动智能

Jinzhao Li, Yinuo Chen, Wenxuan Song, Yijia Lei, Yichi Zhang, Honglei Yan, Panwang Pan, Miao Liu

AI总结 提出IPIBench基准,用于评估多模态大模型在流式视频场景中的交互式主动智能,并设计IPI-Agent框架以改善主动触发和交互协调。

详情
AI中文摘要

最近的多模态大模型在反应式问答上表现强劲,但现实世界的流式助手需要对连续视觉输入进行主动推理。现有基准主要研究孤立的单轮设置中的反应式或主动式交互,忽视了用户可能在交错反应式查询中添加、修改或取消主动请求的动态多轮场景。为填补这一空白,我们引入IPIBench,这是首个在流式视频设置下评估多模态大模型交互式主动智能的基准。IPIBench涵盖主动监控、主动任务管理以及交错的反应式-主动式请求。对代表性多模态大模型的评估揭示了两个主要限制:不稳定的主动触发以及反应式和主动行为之间的弱协调。我们进一步提出IPI-Agent,一个无训练的智能体框架,包含交互控制策略和时间门控机制,用于稳定主动触发和协调多轮交互。实验表明,IPI-Agent在所有基准设置上持续改进现有多模态大模型。

英文摘要

Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.

2605.27073 2026-05-27 cs.LG

Learning to Orchestrate Agents under Uncertainty

学习在不确定性下编排智能体

Mary Chriselda Antony Oliver, Lan Jiang, Aaron Bundi Anampiu, Elaf Almahmoud, Francesco Quinzan, Umang Bhatt

AI总结 提出BOT-Orch框架,将编排问题转化为带正则化的多臂赌博机问题,在不确定性下实现异构智能体的自适应编排,理论保证遗憾界为O(√T)并优于基线。

详情
AI中文摘要

异构智能体的自适应编排需要在不确定且不断演化的智能体行为下做出顺序委派决策,例如协调具有不同可靠性、成本和响应质量的专门AI模型。虽然先前关于智能体编排的工作侧重于性能或成本,但通常未在编排层面显式建模智能体可靠性和输出分布的不确定性。在这项工作中,我们研究了不确定性下异构智能体的自适应编排问题,其中元控制器必须决定何时委派给某个智能体,同时考虑可靠性、成本和不确定性。我们提出了BOT-Orch,一个轻量级框架,将编排重新表述为智能体上的赌博机问题,并通过智能体输出分布与任务特定参考分布之间的OT距离进行正则化。我们证明,在标准假设下,正则化编排享有O(√T)的遗憾界,并能在具有相同平均奖励但分布对齐不同的智能体之间可证明地诱导偏好排序。实验上,我们展示了BOT-Orch在具有异构、非独立同分布智能体行为的合成但对抗性任务分配设置中优于标准赌博机和启发式基线。

英文摘要

Adaptive orchestration of heterogeneous agents requires making sequential delegation decisions under uncertain and evolving agent behaviour, e.g., coordinating specialised AI models with varying reliability, cost, and response quality. While prior work on agent orchestration focuses on performance or cost, uncertainty in agent reliability and output distributions is typically not modelled explicitly at the orchestration level. In this work, we study the problem of adaptive orchestration of heterogeneous agents under uncertainty, where a meta-controller must decide when to delegate to an agent, accounting for reliability, cost, and uncertainty. We propose BOT-Orch, a lightweight framework that recasts orchestration as a bandit problem over agents, regularized by OT distances between agent output distributions and task-specific reference distributions. We show that the regularised orchestration enjoys $\mathcal{O}(\sqrt{T})$ regret under standard assumptions, and provably induces preference ordering among agents with identical mean rewards but differing distributional alignment. Empirically, we demonstrate that BOT-Orch outperforms standard bandit and heuristic baselines in synthetic but adversarial task allocation settings with heterogeneous, non-i.i.d. agent behaviour.

2605.27072 2026-05-27 cs.CL cs.AI

E3: Issue-Level Backtesting for Automated Research Critique

E3: 面向自动化研究评论的问题级回测

Yashwardhan Chaudhuri, Sanyam Jain, Paridhi Mundra

AI总结 提出E3自动化评论助手,通过问题级回测协议评估其在识别研究论文技术问题上的表现,相比人类评审和LLM基线实现最高召回率。

详情
AI中文摘要

我们提出E3,一个自动化评论助手,通过识别研究论文中与决策相关的技术问题来增强评审者和工程团队。对于每个问题,E3报告其性质、位置、对贡献的影响以及解决该问题所需的分析或证据,涵盖无根据的主张、缺失的消融实验、弱基线、隐藏假设、有效性威胁和数据泄露风险。为了在没有污染混杂因素的情况下评估E3,我们采用问题级回测协议:语料库仅限于每个自动化来源训练截止日期之后发表的论文,并且对于每篇论文,一个仅观察匿名评审的元裁判将每个问题-来源对标记为“捕获”、“部分”或“遗漏”。应用于100篇ICLR 2026论文和4598个被评判的问题行,将E3与ICLR人类评审以及基于OpenAI的gpt-5.4和Anthropic的claude-opus-4-6构建的两个提示匹配的LLM基线进行比较,使用元裁判gpt-5.5,E3在每个聚合指标上达到最高召回率。包含部分的召回率达到90.2%,比GPT高15.5个百分点,比Claude高17.1个百分点,比人类评审高29.2个百分点,严格召回率保持顺序为65.8%。在人类评审提出的问题上,E3恢复了89.6%;在人类评审遗漏的问题上,它额外发现了1635行被纳入评判联合集,比次优来源多406行。语料库、基线提示、裁判提示模板和评估代码已发布。

英文摘要

We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak baselines, hidden assumptions, threats to validity, and leakage risks. To evaluate E3 without contamination confounds we adopt an issue-level backtesting protocol: the corpus is restricted to papers postdating the training cutoff of every automated source, and for each paper a meta-judge that observes only anonymised reviews labels every issue-source pair as Caught, Partial, or Missed. Applied to 100 ICLR 2026 papers and 4598 judged issue rows, comparing E3 against the ICLR human reviews and two prompt-matched LLM baselines built on gpt-5.4 from OpenAI and claude-opus-4-6 from Anthropic, with meta-judge gpt-5.5, E3 attains the highest recall on every aggregate metric. Partial-inclusive recall reaches 90.2 percent, which is 15.5 points over GPT, 17.1 points over Claude, and 29.2 points over the human reviews, and strict recall preserves the ordering at 65.8 percent. On concerns raised by the human reviewers, E3 recovers 89.6 percent; on concerns the human reviewers missed it surfaces 1635 additional rows admitted into the judged union, 406 above the next-best source. Corpus, baseline prompts, judge prompt template, and evaluation code are released.

2605.27071 2026-05-27 cs.AI

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

可溯源知识图谱推理助力钢铁行业工业VOCs的LLM辅助决策支持

Changqing Su, Yu Ding, Zuhong Lin, Hongyu Liu, Xi He, Zheng Zeng, Liqing Li

AI总结 针对钢铁行业VOCs治理知识分散、通用大模型易产生幻觉的问题,提出基于知识图谱增强的多智能体问答系统Chat-ISV,通过拓扑优化、多智能体路由和源回溯检索实现高可靠性决策支持。

详情
AI中文摘要

钢铁行业挥发性有机化合物(VOCs)治理的关键知识分散在非结构化的科学文献中,使得整合工艺、污染物和控制技术证据变得困难,并增加了通用大语言模型(LLM)在回答低频工业问题时产生幻觉的风险。为此,我们开发了Chat-ISV,一个知识图谱(KG)增强的多智能体问答系统,该系统解析精选的钢铁行业VOCs文献语料库,构建包含27180个节点和81779条语义边的Neo4j知识图谱,并结合提示约束提取、以块为中心的拓扑优化、多智能体路由、源回溯检索、本地文献检索、开放域知识访问和交互式子图可视化。基准测试和400份专家盲评表明,拓扑优化将孤立节点从57%降至4.08%,Chat-ISV实现了高事实可靠性,精确率96.93%,召回率72.63%,F1分数0.830,平均得分1.69/2.00。通过将碎片化的环境工程文献转化为可溯源、可查询、面向决策支持的知识,Chat-ISV为专业工业领域中可靠的LLM部署和智能污染控制决策支持建立了一种可扩展的环境信息学范式。

英文摘要

Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to integrate process, pollutant, and control-technology evidence and increasing the risk of hallucination when general large language models (LLMs) answer low-frequency industrial questions. Here we developed Chat-ISV, a knowledge graph (KG) enhanced multi-agent Q&A system that parses a curated steel-industry VOCs literature corpus, constructs a Neo4j KG with 27180 nodes and 81779 semantic edges, and combines prompt-constrained extraction, chunk-centered topology optimization, multi-agent routing, source-backtracking retrieval, local literature retrieval, open-domain knowledge access, and interactive subgraph visualization. Benchmark tests and 400 expert blind evaluations showed that topology optimization reduced isolated nodes from 57% to 4.08% and that Chat-ISV achieved high factual reliability, with 96.93% precision, 72.63% recall, an F1-score of 0.830, and a mean score of 1.69/2.00. By converting fragmented environmental-engineering literature into traceable, queryable, and decision-support-oriented knowledge, Chat-ISV establishes a scalable environmental-informatics paradigm for reliable LLM deployment and intelligent pollution-control decision support in specialized industrial domains.

2605.27068 2026-05-27 cs.CL cs.AI cs.MA

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

QUACK: 多模态社交推理智能体中的沟通知识质疑、理解与审计

Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu

AI总结 提出QUACK框架,通过游戏结果、行为轨迹和话语一致性三级评估,自动审计多模态社交推理智能体语言与感知行为的一致性,发现最强智能体仍有15.1%的空间幻觉和过半无据指控。

详情
AI中文摘要

社交推理游戏已成为探测大型语言模型智能体推理、欺骗、协调和信念建模的热门测试平台。然而,大多数环境仅通过胜率等游戏结果评分,且主要局限于纯文本交互,难以判断智能体的语言是否真正基于其感知和行动,也难以识别其行为背后的失败模式。为填补这一空白,我们引入了QUACK,一个用于审计多模态社交推理中智能体语言基础的开源环境和评估框架。QUACK在三个层面评估智能体:游戏结果、行为轨迹和话语级一致性。其核心的陈述验证流水线从引擎日志重建每个智能体的真实轨迹,并对照检查每个讨论声明,自动标记空间幻觉、无据指控、欺骗崩溃和语言-行动不一致。在同质和跨模型对抗设置下评估三个前沿视觉语言模型,我们发现即使是最强的智能体,其可验证的空间声明中有15.1%是幻觉,且超过一半的指控缺乏有据证据。我们在https://github.com/AAAAA-Academia-Attractions/QUACK发布完整的引擎、评估框架、工具包和日志。

英文摘要

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

2605.27067 2026-05-27 cs.CV

BEAT: Rhythm-Elastic Alignment for Agentic Music-guided Movie Trailer Generation

BEAT: 节奏弹性对齐用于智能音乐引导的电影预告片生成

Yutong Wang, Yunke Wang, Xinyuan Chen, Chang Xu

AI总结 提出BEAT框架,通过音乐-视觉对齐编码器MuVA和能量自适应动态规划算法Bar-DP,实现弹性多对一节奏对齐,用于端到端电影预告片生成。

详情
AI中文摘要

自动电影预告片生成必须从整部电影中选择镜头并与背景音乐同步。现有方法要么将音乐对齐归为后处理,要么强制执行刚性的——对应镜头-音乐映射,忽略了专业剪辑节奏的弹性:快速剪辑伴随高能量段落,而持续镜头跨越较安静的小节。我们提出BEAT,一个解决这一差距的框架,包含两个核心组件:MuVA,一个紧凑的音乐-视觉对齐编码器,通过Sinkhorn正则化的两阶段学习训练;以及Bar-DP,一种能量自适应动态规划算法,根据音乐动态产生弹性的多对一对齐。这些组件被集成到一个五阶段智能管道中,该管道将核心对齐建立在学习的跨模态特征上,同时通过结构化文本信号协调更高层次的创意决策。为了支持全面评估,我们还引入了TrailerArena,一个包含四个互补维度20多个指标的基准。在TrailerArena上,BEAT在镜头选择、排序和感知质量方面实现了最先进的性能,同时端到端地生成完整制作的预告片。

英文摘要

Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.

2605.27066 2026-05-27 cs.CL cs.IR

Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search

工业搜索中基于大语言模型的查询驱动事件时间线摘要

Mingyue Wang, Xingyu Xie, Hang Yang, Li Gao, Lixin Su, Ge Chen, Dawei Yin, Daiting Shi

AI总结 提出QDET系统,通过多任务微调和强化学习实现查询驱动的事件时间线摘要,在百度搜索中显著提升用户参与度。

详情
Comments
Accepted at KDD 2026
AI中文摘要

理解事件如何随时间演变对于处理热门新闻查询的搜索引擎至关重要。我们提出了QDET(查询驱动事件时间线摘要),这是一个部署在百度搜索上的生产系统,用于构建聚焦的事件时间线以解释特定查询事件。与传统的以主题为中心、旨在全面覆盖的方法不同,QDET从每天检索的数百万文档形成的嘈杂候选集中识别并组织与查询密切相关的子事件。QDET包含两个关键创新:(1)多任务监督微调,包含三个辅助任务——时间顺序、因果判断和时间线完成——使紧凑模型在专业领域匹配更大通用模型的性能;(2)基于强化学习的事件简洁摘要,在保持语义质量的同时强制执行严格长度约束,实现了88.2%的长度合规性,并在约束满足上比671B规模模型高出7.7个百分点。我们微调的7B参数模型在时间线摘要上达到76.2%的F1分数,略超DeepSeek-R1-671B的零样本性能(76.1% F1),而仅使用其1%的参数——表明领域特定优化能够以大幅降低的计算成本实现质量相当的生产就绪模型。百度搜索上的在线A/B测试验证了实际效果,与单任务基线相比,点击率提升5.5%,停留时间延长4.6%,探索深度增加4.4%。我们进一步证明时间线理解可迁移到热度预测,确认了对下游任务的有效知识迁移。

英文摘要

Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.

2605.27063 2026-05-27 cs.LG

Learning Dynamic Graph Representations through Timespan View Contrasts

通过时间跨度视图对比学习动态图表示

Yiming Xu, Zhen Peng, Bin Shi, Xu Hua, Bo Dong

AI总结 提出基于时间平移不变性的动态图表示框架CLDG和CLDG++,通过跨时间跨度对比学习和多尺度对比学习,有效提升节点分类和动态图异常检测性能。

详情
Comments
Accepted by Neural Networks
AI中文摘要

图蕴含的丰富信息激发了对无监督图表示的进一步研究。现有研究主要依赖静态图中的节点特征和拓扑属性来创建自监督信号,忽略了真实世界图数据携带的时间成分,例如边的时间戳。为了克服这一局限,本文探索了如何在动态图上优雅地建模时间演化。具体地,我们引入了一种新的归纳偏置,即时间平移不变性,它说明了同一节点在不同时间跨度上倾向于保持相似标签。基于这一假设,我们开发了一个动态图表示框架CLDG,通过在不同时间跨度上进行对比学习,鼓励节点保持局部一致的时间平移不变性。除了仅考虑显式拓扑链接的标准CLDG,我们进一步提出的CLDG++额外采用图扩散来揭示节点之间的全局上下文相关性,并设计了一个由局部-局部、局部-全局和全局-全局对比组成的多尺度对比学习目标,以增强表示能力。有趣的是,通过测量不同时间跨度之间的一致性来形成异常指标,CLDG和CLDG++无缝集成到动态图异常检测任务中,这在金融、网络安全和医疗保健等许多高影响力领域具有广泛应用。实验表明,CLDG和CLDG++在节点分类和动态图异常检测等下游任务中均表现出理想的性能。此外,CLDG通过隐式利用时间线索而不是复杂的序列模型,显著降低了时间和空间复杂度。

英文摘要

The rich information underlying graphs has inspired further investigation of unsupervised graph representation. Existing studies mainly depend on node features and topological properties within static graphs to create self-supervised signals, neglecting the temporal components carried by real-world graph data, such as timestamps of edges. To overcome this limitation, this paper explores how to model temporal evolution on dynamic graphs elegantly. Specifically, we introduce a new inductive bias, namely temporal translation invariance, which illustrates the tendency of the identical node to keep similar labels across different timespans. Based on this assumption, we develop a dynamic graph representation framework CLDG that encourages the node to maintain locally consistent temporal translation invariance through contrastive learning on different timespans. Except for standard CLDG which only considers explicit topological links, our further proposed CLDG++ additionally employs graph diffusion to uncover global contextual correlations between nodes, and designs a multi-scale contrastive learning objective composed of local-local, local-global, and global-global contrasts to enhance representation capabilities. Interestingly, by measuring the consistency between different timespans to shape anomaly indicators, CLDG and CLDG++ are seamlessly integrated with the task of spotting anomalies on dynamic graphs, which has broad applications in many high-impact domains, such as finance, cybersecurity, and healthcare. Experiments demonstrate that CLDG and CLDG++ both exhibit desirable performance in downstream tasks including node classification and dynamic graph anomaly detection. Moreover, CLDG significantly reduces time and space complexity by implicitly exploiting temporal cues instead of complicated sequence models.

2605.27062 2026-05-27 cs.CL cs.LG

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

FalAR: 一个大规模说话人标注的欧洲葡萄牙语议会会议语音语料库

Francisco Teixeira, Carlos Carvalho, Mariana Julião, Catarina Botelho, Rubén Solera-Ureña, Sérgio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad

AI总结 为弥补欧洲葡萄牙语语音资源不足,构建了FalAR语料库,包含5800小时议会会议语音及说话人标注,实验表明作为预训练数据可使ASR词错误率相对降低14%。

详情
Comments
Published in LREC2026
AI中文摘要

自动语音识别(ASR)的最先进性能在很大程度上依赖于大规模标注语料库的可用性。这增加了数据收集工作的需求,特别是对于代表性不足的语言和方言变体。由于欧洲葡萄牙语(EP)的说话人数量较少(约1100万),在目前可用的大规模语音数据资源中,它被巴西葡萄牙语(BP)(约2亿说话人)所掩盖,导致EP用户的语音系统性能不佳。为了弥补这一差距,并遵循其他语言的类似数据收集工作,我们提出了FalAR,一个大规模、说话人标注的欧洲葡萄牙语议会会议语音语料库。FalAR涵盖约20年,包含5800小时的语音数据。此外,4850小时具有说话人身份标注,总共1180个说话人,附带元数据包括年龄、性别、政治派别和议会角色。该语料库使用最先进的EP CAMÕES ASR模型进行转录参考对齐。在本文中,我们描述了数据收集过程以及FalAR语料库的主要特征。此外,我们评估了数据量和对齐准确性对ASR性能的权衡,实验表明,将FalAR作为预训练数据可以使基线模型的词错误率相对降低高达14%。

英文摘要

State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.

2605.27051 2026-05-27 cs.SE cs.AI

ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification

ConVer:使用合约和循环不变式合成实现可扩展的形式化软件验证

Muhammad A. A. Pirzada, Weiqi Wang, Yiannis Charalambous, Konstantin Korovin, Lucas C. Cordeiro

AI总结 提出一种自上而下的组合验证工具ConVer,利用大语言模型合成函数合约,并通过CEGAR-CEGIS循环迭代精炼合约,以解决大规模C程序形式化验证中的状态空间爆炸问题。

详情
Comments
12 pages; 6 figures
AI中文摘要

大型C程序的形式化验证受到状态空间爆炸的阻碍:有界模型检验(BMC)工具必须通过展开所有嵌套结构来编码整个状态空间直至预定边界。我们提出了ConVer,一种自上而下的组合验证工具。给定一个带有顶层断言的C程序,ConVer自上而下地分解验证:它使用大语言模型(LLM)从系统属性中合成函数合约,然后在CEGAR-CEGIS循环中交替进行系统级和函数级检查,每当检查失败时通过SMART ICE学习精炼合约。我们在四个难度递增的基准测试套件上评估了ConVer,并与其他最先进(SOTA)工具进行了比较。在包含45个简单C程序的Frama-C基准测试中,ConVer在三个LLM后端上实现了82-96%的验证成功率,其中93-95%的收敛程序仅需一次CEGAR-CEGIS迭代。在X.509解析器基准测试(6个程序)和LF2C-Simple套件(17个程序)上,ConVer分别实现了33-50%和82-88%的成功率。在包含11个递归和循环密集型程序的VerifyThis套件上,预抽象策略实现了55-64%的成功率。此外,我们提出了ESBMC-LF,一个预处理工具,它将LF模型转换为C语言,同时保留LF文件的属性,使ConVer能够验证它们。我们使用ESBMC-LF将LF验证器基准测试转换为C语言;我们将这些称为LF-Hard。我们表明,ConVer总体上成功验证了67%的LF-Hard基准测试。

英文摘要

Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire state space up to the predetermined bound by unrolling all nested constructs. We present ConVer, a top-down compositional verification tool. Given a C program with a top-level assertion, ConVer decomposes verification top-down: it uses a large language model (LLM) to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. We evaluate ConVer on four benchmark suites of increasing difficulty and against other state-of-the-art (SOTA) tools. On the Frama-C benchmark of 45 simple C programs, ConVer achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration. On the X.509 parser benchmark (6~programs) and LF2C-Simple suite (17 programs), ConVer achieves 33-50% and 82-88% success respectively. On the VerifyThis suite of 11 recursive and loop-intensive programs, the Pre-Abstraction strategy achieves 55-64% success. In addition, we present ESBMC-LF a preprocessor tool that converts LF models to C while preserving the properties of the LF files, enabling ConVer to verify them. We transpile the LF Verifier Benchmarks using ESBMC-LF to C; we denote those LF-Hard. We show that ConVer successfully verifies 67% of LF-Hard benchmarks overall.

2605.27050 2026-05-27 cs.CL cs.LG

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

BhashaSetu:一种以数据为中心的低资源机器翻译方法

Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar

AI总结 提出BhashaSetu数据集,通过大规模、多领域、形态感知的英-马拉地语平行语料库,并验证语料库级去重对低资源神经机器翻译质量的关键影响。

详情
AI中文摘要

我们提出了BhashaSetu,一个语言丰富的英语-马拉地语平行数据集,解决了低资源神经机器翻译(NMT)中持续存在的数据限制问题。马拉地语有超过9500万使用者,但在不同领域的高质量平行语料库中仍然代表性不足。我们的数据集包含来自新闻、政治、医疗、文学和文化等异构来源的278万个句子对,并提供了词干化和词形还原表示以支持形态感知分析。我们使用BLEU、spBLEU、chrF++和TER指标对多个最先进的翻译模型进行了基准测试,并使用LoRA对NLLB-200-distilled-600M进行了参数高效微调。我们消融实验的一个关键发现是:语料库级去重是预处理中对下游质量贡献最大的单一因素(去除它会使性能降低1.17 BLEU和2.21 chrF++),这表明对于低资源、形态丰富的语言,有纪律的跨源语料库卫生是一种低成本、高影响力的干预措施。该数据集已公开发布,以促进可重复且语言信息丰富的低资源NMT研究。

英文摘要

We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.