arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2604.17621 2026-06-02 cs.AI

KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

KnowledgeBerg: 评估大语言模型中的系统性知识覆盖与组合推理

Xiao Zhang, Qianru Meng, Yongjian Chen, Yumeng Wang, Johan Bos

发表机构 * University of Groningen(Groningen大学) LIACS, Leiden University(莱顿大学LIACS)

AI总结 提出KnowledgeBerg基准,通过4800道选择题评估大模型在知识宽度和推理深度上的系统性覆盖与组合推理能力,发现现有模型存在严重不足。

Comments ACL Findings

详情
AI中文摘要

许多现实世界的问题看似简单,却隐含地要求两种能力:(i) 对有限知识宇宙的系统性覆盖,以及(ii) 对该宇宙的基于集合的组合推理,我们将这种现象称为“冰山一角”。我们通过两个正交维度形式化这一挑战:知识宽度(所需宇宙的基数)和推理深度(组合集合操作的数量)。我们引入了KnowledgeBerg,一个包含4800道多项选择题的基准,这些题目源自1183个枚举种子,涵盖10个领域和17种语言,其宇宙基于权威来源以确保可重复性。代表性的开源大语言模型表现出严重局限性,在宇宙枚举上仅达到5.26-36.88的F1分数,在基于知识的推理上准确率仅为16.00-44.19。诊断分析揭示了三个失败阶段:完整性(知识缺失)、意识(未能识别需求)和应用(错误执行推理)。这种模式在语言和模型规模上持续存在。尽管测试时计算和检索增强带来了可测量的改进——分别高达4.35和3.78个百分点——但仍有显著差距,暴露了当前大语言模型在组织结构化知识和在有限领域上执行组合推理方面的局限性。数据集可在https://huggingface.co/datasets/2npc/KnowledgeBerg获取。

英文摘要

Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term "the tip of the iceberg." We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains -- up to 4.35 and 3.78 points, respectively -- substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at https://huggingface.co/datasets/2npc/KnowledgeBerg

2604.17456 2026-06-02 cs.AI

TrafficClaw: A Generalizable LLM Agent in the Unified Physical Environment for Urban Traffic Control

TrafficClaw:面向城市交通控制的统一物理环境中的可泛化LLM智能体

Siqi Lai, Pan Zhang, Yuping Zhou, Jindong Han, Yansong Ning, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Shandong University(山东大学)

AI总结 提出TrafficClaw,一种基于大语言模型的可泛化交通控制智能体,通过统一物理环境、可执行时空推理与多阶段智能体强化学习,实现跨子系统的协调优化。

详情
AI中文摘要

大语言模型(LLM)智能体在数字环境中的长程推理、工具使用和决策方面表现出强大能力,但将其扩展到物理系统仍具挑战。与目标通常弱耦合的网络、代码或游戏环境不同,物理系统通过紧密耦合的动力学演化,局部干预会随时间在相互作用的子系统中传播。城市交通控制体现了这一挑战,因为交通信号、高速公路、公共交通和出租车系统通过共享的空间基础设施和时间出行需求持续交互。现有的优化、强化学习(RL)和基于LLM的方法大多针对孤立子系统设计,限制了协调推理和系统级优化。我们提出TrafficClaw,一种基于LLM的可泛化交通控制智能体,用于物理城市系统。TrafficClaw在统一的交通环境中运行,暴露耦合的城市动态和反馈,通过持久记忆执行可扩展的时空推理以实现长期适应,并利用多阶段智能体RL进行协调的系统级优化。在三个大都市区域和六个交通控制任务上的实验证明了其强大的泛化能力、鲁棒性和跨子系统协调能力。我们的项目可在https://github.com/usail-hkust/TrafficClaw获取。

英文摘要

Large language model (LLM) agents have shown strong capabilities in long-horizon reasoning, tool use, and decision-making in digital environments, yet extending them to physically grounded systems remains challenging. Unlike web, code, or game environments, where objectives are often weakly coupled, physical systems evolve through tightly coupled dynamics in which local interventions propagate across interacting subsystems over time. Urban traffic control exemplifies this challenge, as traffic signals, freeways, public transit, and taxi systems continuously interact through shared spatial infrastructure and temporal mobility demand. Existing optimization, reinforcement learning (RL), and LLM-based approaches are largely designed for isolated subsystems, limiting coordinated reasoning and system-level optimization. We propose TrafficClaw, a LLM-based generalizable traffic control agent for physical urban systems. TrafficClaw operates within a unified traffic environment that exposes coupled urban dynamics and feedback, performs executable spatiotemporal reasoning with persistent memory for long-horizon adaptation, and leverages multi-stage agentic RL for coordinated system-level optimization. Experiments across three metropolitan regions and six traffic-control tasks demonstrate strong generalization, robustness, and cross-subsystem coordination. Our project is available at https://github.com/usail-hkust/TrafficClaw.

2601.14750 2026-06-02 cs.CL cs.CV

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Render-of-Thought: 将文本思维链渲染为图像以进行视觉潜在推理

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei

发表机构 * Tencent BAC(腾讯BAC) Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学) School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Mathematics and Statistics, University of Glasgow(格拉斯哥大学数学与统计学学院)

AI总结 提出Render-of-Thought框架,通过将思维链的文本步骤渲染为图像,利用视觉语言模型的视觉编码器进行语义对齐,实现3-4倍令牌压缩和推理加速,同时保持竞争性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

思维链提示在解锁大型语言模型的推理能力方面取得了显著成功。尽管思维链提示增强了推理能力,但其冗长性带来了巨大的计算开销。最近的工作通常只关注结果对齐,缺乏对中间推理过程的监督。这些缺陷掩盖了潜在推理链的可分析性。为了解决这些挑战,我们引入了Render-of-Thought,这是第一个通过将文本步骤渲染为图像来具体化推理链的框架,使潜在推理过程显式且可追溯。具体来说,我们利用现有视觉语言模型的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现,而无需额外的预训练开销。在数学和逻辑推理基准上的大量实验表明,与显式思维链相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,它与其他方法相比保持了竞争性能,验证了这种范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT获取。

英文摘要

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

2507.05179 2026-06-02 cs.CL

From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

从碎片到事实:一种课程驱动的DPO方法用于生成印地语新闻真实性解释

Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Adam Jatowt, Sriparna Saha

发表机构 * TCS Research(TCS研究) Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度理工学院帕纳布计算机科学与工程系) Indian Institute of Technology Patna(印度理工学院帕纳布) University of Innsbruck(因斯布鲁克大学)

AI总结 针对印地语等低资源语言的虚假信息检测,提出一种结合直接偏好优化(DPO)与课程学习的框架,通过引入“实际性”和“精炼度”参数优化解释质量,实验验证了生成连贯、相关解释的有效性。

Comments Accepted at ACL 2026 Findings

详情
AI中文摘要

在虚假信息泛滥的时代,生成可靠的新闻解释至关重要,尤其是对于印地语等代表性不足的语言。由于缺乏强大的自动化工具,印地语在扩展虚假信息检测方面面临挑战。为弥补这一差距,我们提出了一种新颖的框架,将直接偏好优化(DPO)与课程学习相结合,使机器生成的解释与人类推理对齐。来自可信来源的事实核查解释作为偏好响应,而LLM输出则突出系统局限性并作为非偏好响应。为了优化任务特定的对齐,我们在DPO损失函数中引入了两个关键参数——实际性和精炼度,从而提高了解释的质量和一致性。使用LLM(Mistral、Llama、Gemma)和PLM(mBART、mT5)进行的实验证实了该框架在生成连贯、上下文相关解释方面的有效性。这种可扩展的方法有助于打击虚假信息,并将自动解释生成扩展到低资源语言。

英文摘要

In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters -- Actuality and Finesse -- into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework's effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.

2604.17007 2026-06-02 cs.CV cs.AI

MobileAgeNet: Lightweight Facial Age Estimation for Mobile Deployment

MobileAgeNet:面向移动部署的轻量级面部年龄估计

Arun Kumar, Aswathy Baiju, Radu Timofte, Dmitry Ignatov

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室、CAIDAS与IFI、乌尔姆大学、德国)

AI总结 提出基于MobileNetV3-Large的轻量级年龄回归框架MobileAgeNet,通过两阶段微调和边界回归策略,在UTKFace测试集上达到4.65年MAE,移动端延迟14.4ms,参数量3.23M。

Comments 9 Pages including references, 3 figures

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3810-3818, 2026
AI中文摘要

面部年龄估计的移动部署需要模型在预测准确性、低延迟和小尺寸之间取得平衡。在这项工作中,我们提出了MobileAgeNet,一个轻量级年龄回归框架,在UTKFace保留测试集上实现了4.65年的MAE,同时使用AI Benchmark应用程序测量,平均延迟为14.4毫秒,保持了高效的设备端推理。该模型基于预训练的MobileNetV3-Large骨干网络,结合紧凑的回归头,支持移动设备上的实时预测。训练和评估流程集成到NN LEMUR数据集框架中,支持可重复实验、结构化超参数优化和一致评估。我们采用边界年龄回归以及两阶段微调策略,以提高训练稳定性和泛化能力。实验结果表明,MobileAgeNet以3.23M参数实现了具有竞争力的准确性,并且从PyTorch训练通过ONNX导出到TensorFlow Lite转换的部署流程,在实际设备条件下保持了预测行为,没有可测量的退化。总体而言,这项工作为面向移动的面部年龄估计提供了一个实用、可部署的基线。

英文摘要

Mobile deployment of facial age estimation requires models that balance predictive accuracy with low latency and compact size. In this work, we present MobileAgeNet, a lightweight age-regression framework that achieves an MAE of 4.65 years on the UTKFace held-out test set while maintaining efficient on-device inference with an average latency of 14.4 ms measured using the AI Benchmark application. The model is built on a pretrained MobileNetV3-Large backbone combined with a compact regression head, enabling real-time prediction on mobile devices. The training and evaluation pipeline is integrated into the NN LEMUR Dataset framework, supporting reproducible experimentation, structured hyperparameter optimization, and consistent evaluation. We employ bounded age regression together with a two-stage fine-tuning strategy to improve training stability and generalization. Experimental results show that MobileAgeNet achieves competitive accuracy with 3.23M parameters, and that the deployment pipeline from PyTorch training through ONNX export to TensorFlow Lite conversion - preserves predictive behavior without measurable degradation under practical on-device conditions. Overall, this work provides a practical, deployment-ready baseline for mobile-oriented facial age estimation.

2604.15231 2026-06-02 cs.AI

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent:一种用于胸部CT逐步解读的工具型AI智能体

Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia E. Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

发表机构 * Department of Biosystems Science and Engineering, ETH Zurich(生物系统科学与工程系,苏黎世联邦理工学院) ETH AI Center, Zurich(ETH人工智能中心,苏黎世) Department of Computer Science, ETH Zurich(计算机科学系,苏黎世联邦理工学院) Faculty of Computer Science and Mathematics, Heidelberg University(计算机科学与数学学院,海德堡大学) Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University(斯坦福大学人工智能在医学和影像中的中心) Department of Radiology, Stanford University(放射科,斯坦福大学) Department of Quantitative Biomedicine, University of Zurich(定量生物医学系,苏黎世大学) Institute of Computer Science, Zurich University of Applied Sciences(应用科学大学计算机科学研究所)

AI总结 提出RadAgent,一种通过逐步、可解释过程生成CT报告的工具型AI智能体,在临床准确性、鲁棒性和忠实度上优于3D VLM方法。

详情
AI中文摘要

视觉语言模型(VLM)显著推进了复杂医学影像(如计算机断层扫描(CT))的AI驱动解读和报告生成。然而,现有方法主要将临床医生视为最终输出的被动观察者,没有提供可解释的推理轨迹供其检查、验证或改进。为了解决这个问题,我们引入了RadAgent,一种工具型AI智能体,通过逐步且可解释的过程生成CT报告。每个生成的报告都附带一个完全可检查的中间决策和工具交互轨迹,使临床医生能够检查报告发现是如何得出的。在我们的实验中,我们观察到RadAgent在三个维度上改进了胸部CT报告生成,优于其3D VLM对应物CT-Chat。临床准确性在宏F1上提高了5.8分(相对提高35.4%),在微F1上提高了5.1分(相对提高18.6%)。对抗条件下的鲁棒性提高了24.7分(相对提高41.9%)。此外,RadAgent在忠实度上达到了37.0%,这是其3D VLM对应物完全不具备的新能力。通过将胸部CT的解读构建为显式、工具增强和迭代的推理轨迹,RadAgent使我们更接近放射学的透明和可靠AI。

英文摘要

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 5.8 points (35.4% relative) in macro-F1 and 5.1 points (18.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

2601.02997 2026-06-02 cs.LG cs.CV

From Memorization to Creativity: LLM as a Designer of Novel Neural Architectures

从记忆到创造:LLM作为新型神经架构的设计者

Waleed Khalid, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 本文提出NNGPT框架,通过闭环架构合成流水线,利用代码型LLM的监督微调循环,结合MinHash-Jaccard新颖性过滤和低保真性能信号,迭代提升生成架构的有效性、性能和多样性,实现从记忆到创造的转变。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3252-3261, 2026
AI中文摘要

大型语言模型(LLM)在程序合成方面表现出色,但其在神经架构设计中的能力——平衡语法可靠性、性能和结构新颖性——仍未得到充分探索。我们提出了NNGPT框架内的闭环架构合成流水线,其中代码型LLM经过22次监督微调循环的演化。在每个循环中,LLM合成PyTorch卷积网络,通过低保真性能信号验证,并通过MinHash-Jaccard标准过滤以防止结构冗余,然后纳入LEMUR数据集。具有新颖架构的高性能候选被转换为提示-代码对,用于参数高效的LoRA微调。这种反馈循环驱动了可测量的分布偏移,逐步内化经验架构先验,使得有效且高性能的输出从稀缺变为主导。在CIFAR-10上,有效生成率稳定在50.6%(峰值74.5%),平均第一轮准确率从28.1%上升到51.0%,超过40%准确率的候选从2.0%增长到96.8%。跨数据集迁移到CIFAR-100和SVHN证实了改进的有效性、偏移的准确率分布和持续的新颖性在不同难度和视觉领域的基准测试中泛化。在22个循环中,有455个原始语料库中不存在的新颖架构被新颖性过滤器接受。通过将合成基于执行反馈和新颖性过滤,我们证明了迭代自监督微调将LLM重塑为任务特化的架构先验——提高了生成可靠性、代理性能和结构多样性——为手工设计的搜索空间提供了一种可复现、无需标注的替代方案。

英文摘要

Large language models (LLMs) excel in program synthesis, yet their capacity for neural architecture design -- balancing syntactic reliability, performance, and structural novelty -- remains underexplored. We present a closed-loop architecture synthesis pipeline within the NNGPT framework, in which a code-oriented LLM evolves over 22 supervised fine-tuning cycles. At each cycle, the LLM synthesizes PyTorch convolutional networks, validated via low-fidelity performance signals and filtered via a MinHash--Jaccard criterion to prevent structural redundancy before being incorporated into the LEMUR dataset. High-performing candidates with novel architectures are converted into prompt--code pairs for parameter-efficient LoRA fine-tuning. This feedback loop drives a measurable distributional shift, progressively internalizing empirical architectural priors such that valid and high-performing outputs evolve from scarce to dominant across cycles. On CIFAR-10, the valid generation rate stabilizes at 50.6% (peaking at 74.5%), mean first-epoch accuracy rises from 28.1% to 51.0%, and candidates exceeding 40% accuracy grow from 2.0% to 96.8%. Cross-dataset transfer to CIFAR-100 and SVHN confirms that improved validity, shifted accuracy distributions, and sustained novelty generalize across benchmarks of varying difficulty and visual domain. Across 22 cycles, 455 unique architectures absent from the original corpus are admitted under the novelty filter. By grounding synthesis in execution feedback and novelty filtering, we demonstrate that iterative self-supervised fine-tuning reshapes an LLM into a task-specialized architectural prior -- improving generation reliability, proxy performance, and structural diversity -- offering a reproducible, annotation-free alternative to hand-crafted search spaces.

2512.24120 2026-06-02 cs.CV cs.AI

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

增强基于LLM的神经网络生成:面向自动化架构设计的少样本提示与高效验证

Raghuvir Duvvuri, Chandini Vysyaraju, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 本文提出少样本架构提示(FSAP)和空白归一化哈希验证方法,以提升基于LLM的计算机视觉架构自动生成效率,并通过大规模实验验证其有效性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3242-3251, 2026
AI中文摘要

自动化神经网络架构设计仍然是计算机视觉中的一个重大挑战。任务多样性和计算约束要求既有效又高效的架构与搜索方法。大型语言模型(LLMs)为计算密集型的神经架构搜索(NAS)提供了一种有前景的替代方案,但它们在计算机视觉架构生成中的应用尚未被系统研究,特别是在提示工程和验证策略方面。基于任务无关的NNGPT/LEMUR框架,本文引入并验证了两项针对计算机视觉的关键贡献。首先,我们提出了少样本架构提示(FSAP),这是首个针对基于LLM的架构生成中支持示例数量(n = 1, 2, 3, 4, 5, 6)的系统研究。我们发现使用n = 3个示例能在视觉任务的架构多样性和上下文聚焦之间取得最佳平衡。其次,我们引入了空白归一化哈希验证,一种轻量级去重方法(耗时小于1毫秒),相比AST解析实现了100倍加速,并防止了重复计算机视觉架构的冗余训练。在七个计算机视觉基准(MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN、Places365)的大规模实验中,我们生成了1,900个独特架构。我们还引入了一种数据集平衡的评估方法,以应对跨异构视觉任务比较架构的挑战。这些贡献为计算机视觉中基于LLM的架构搜索提供了可操作的指导,并建立了严格的评估实践,使计算资源有限的研究人员也能更便捷地进行自动化设计。

英文摘要

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

2604.14698 2026-06-02 cs.LG

Mean Flow Policy Optimization

平均流策略优化

Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出使用平均流模型作为策略表示,通过最大熵强化学习框架进行软策略迭代,以提升在线强化学习中训练和推理效率,实验表明性能与扩散模型基线相当或更优且时间显著减少。

Comments ICML 2026

详情
AI中文摘要

扩散模型最近作为在线强化学习(RL)的表达性策略表示出现。然而,其迭代生成过程引入了大量的训练和推理开销。为了克服这一限制,我们提出使用平均流模型(MeanFlow模型)来表示策略,这是一类少步流生成模型,旨在提高基于扩散的RL方法的训练和推理效率。为了促进探索,我们通过软策略迭代在最大熵强化学习框架下优化平均流策略,并解决了平均流策略特有的两个关键挑战:动作似然评估和软策略改进。在MuJoCo、DeepMind Control Suite和HumanoidBench基准上的实验表明,我们的方法——平均流策略优化(MFPO)——实现了与当前基于扩散的基线相当或更优的性能,同时显著减少了训练和推理时间。我们的代码可在https://github.com/dongxiaoyi-xyz/MFPO获取。

英文摘要

Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo, DeepMind Control Suite and HumanoidBench benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/dongxiaoyi-xyz/MFPO.

2604.14514 2026-06-02 cs.AI cs.CE

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

生物医学AI中的偏见视角:防止下游医疗保健差异

Michal Rosen-Zvi, Yoav Kan-Tor, Michael Danziger, Agata Ferretti, Javier Aula-Blasco, Julia Falcao, Ron Shamir, Mira Marcus-Kalish, Mordechai Muszkat

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院) Hebrew University of Jerusalem(特拉维夫大学)

AI总结 本文通过分析2015-2024年4514篇组学出版物和大型数据集,揭示数据收集和研究中存在的严重人口偏见,并提出通过来源、开放性和评估透明度三个原则来预防下游医疗保健差异。

Comments This manuscript has been accepted for publication in the 2026 IEEE International Conference on Digital Health (ICDH). The final version will appear in IEEE Xplore

详情
AI中文摘要

医疗保健差异在社会经济边界上持续存在,通常归因于筛查、诊断和治疗的不平等获取。然而,本文观点强调,关键偏见可能在更早阶段出现,即在数据收集和研究优先级确定期间,远在临床实施之前,尤其是在关注分子和组学数据的研究中。大量研究专注于收集组学数据,但相关的人口统计信息往往未被报告,即使报告了,也显示出显著偏见。对2015年至2024年间PubMed索引的4514篇组学出版物的自动分析,检查了多个人口统计维度的报告情况,发现总体报告有限;例如,只有2.7%的研究报告了祖先或种族信息,地理来源报告仅限于2.5%。对常用于模型训练的大规模数据集(如CellxGene和GEO)的分析揭示了显著的人口偏见,其中欧洲血统数据占主导地位。随着生物医学基础模型成为生物医学发现的核心,其范式是基础模型在大数据集上预训练并反复用于许多不同的下游任务,这些模型有风险延续或放大这些早期阶段的偏见,导致监管干预无法完全逆转的级联不平等。我们提出社区范围内关注三个基本原则:来源、开放性和通过评估透明度的可靠性。这些原则共同有助于使偏见和局限性对模型开发者和用户更加可见,支持在生物医学AI中更明智的模型开发、评估和部署决策。

英文摘要

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation, particularly in studies focused on molecular and omics data. A vast number of studies focus on collecting omics data, but the demographic information associated with these datasets is often not reported, and when it is reported, it reveals substantial biases. An automated analysis of 4514 PubMed-indexed omics publications from 2015 to 2024, examining reporting across multiple demographic dimensions, reveals limited reporting overall; for example, only 2.7% of studies report ancestry or ethnicity information and geographic origin reporting is limited to 2.5%. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them repeatedly for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Reliability through Evaluation Transparency. Together, these principles can help make biases and limitations more visible to model developers and users, supporting more informed model development, evaluation, and deployment decisions in biomedical AI.

2604.14344 2026-06-02 cs.RO

CART: Context-Aware Terrain Adaptation using Temporal Sequence Selection for Legged Robots

CART: 基于时间序列选择的上下文感知地形自适应方法用于腿式机器人

Kartikeya Singh, Youngjin Kim, Yash Turkar, Karthik Dantu

发表机构 * DRONES LAB, University at Buffalo, NY, USA(无人机实验室,布法罗大学,纽约州,美国)

AI总结 提出CART高层控制器,通过融合本体感觉和外部感知的上下文信息,提升腿式机器人在复杂地形上的稳定行走能力,在仿真和真实实验中分别将成功率平均提高5%,并将基座振荡降低最多41%和22%。

详情
AI中文摘要

自然界中的动物结合多种模态(如视觉和触觉)来感知地形,并发展出在不平坦地形上高效行走的理解。同样,腿式机器人需要通过发展对视觉和本体感觉之间关系的理解,来增强其在复杂地形上稳定行走的能力。目前大多数地形自适应方法在复杂的越野地形上仍然容易失败,因为它们没有明确建模外部感知地形外观与本体感觉物理交互之间的上下文关系。这种基于经验的学习往往会在所见与真实感受之间产生视觉-纹理悖论。在这项工作中,我们引入了CART,一种基于上下文感知地形自适应方法的高层控制器,它集成了来自机载传感器的本体感觉和外部感知,以实现对地形的鲁棒理解。我们在多种地形上使用Unitree Go2和ANYmal-C机器人在IsaacSim模拟器中进行评估,并在真实世界实验中使用Boston Dynamics SPOT机器人。为了评估学习到的上下文是否能在各种悖论情况下改善运动行为,我们在仿真和真实实验中测量了机器人的稳定性、穿越成功率和任务完成时间。我们将CART与多种地形条件下的最先进运动控制和地形自适应基线进行比较。CART在仿真中将平均成功率比基线提高了5%,同时改善了上下文条件化的运动行为,包括在仿真中将基座振荡降低最多41%,在真实世界中降低22%,且不增加完成运动任务所需的时间。

英文摘要

Animals in nature combine multiple modalities, such as sight and feel, to perceive terrain and develop an understanding of how to walk on uneven terrain in an efficient manner. Similarly, legged robots need to develop their ability to stably walk on complex terrains by developing an understanding of the relationship between vision and proprioception. Most current terrain-adaptation methods remain susceptible to failure on complex off-road terrain because they do not explicitly model the context between exteroceptive terrain appearance and proprioceptive physical interaction. This experience-based learning often creates a Visual-Texture Paradox between what has been seen and how it actually feels. In this work, we introduce CART, a high-level controller built on a context-aware terrain adaptation approach that integrates proprioception and exteroception from onboard sensing to achieve a robust understanding of terrain. We evaluate our method on multiple terrains using the Unitree Go2 and ANYmal-C robot on the IsaacSim simulator and a Boston Dynamics SPOT robot for our real-world experiments. To evaluate whether the learned context improves locomotion behavior under the various paradox circumstances, we measure the robot s stability, traversal success, and task completion time in both simulation and real-world experiments. We compare CART against state-of-the-art locomotion and terrain- adaptation baselines across diverse terrain conditions. CART improves the average success rate by 5% over the baselines in simulation, while improving context-conditioned locomotion behavior, including up to 41% lower base oscillation in simulation and 22% in the real world, without increasing the time required to complete the locomotion tasks.

2604.03588 2026-06-02 cs.AI

Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

Rashomon记忆:面向多视角智能体记忆的论证驱动检索

Albert Sadowski, Jarosław A. Chudziak

发表机构 * Warsaw University of Technology(华沙技术大学)

AI总结 提出Rashomon记忆架构,通过并行目标条件化智能体以各自优先级编码经验,并在查询时通过论证协商,利用Dung的论证语义选择解释,支持冲突呈现模式。

Comments Presented at EXTRAAMAS workshop at AAMAS 2026

详情
AI中文摘要

在长时间跨度上运行的AI智能体积累服务于多个并发目标的经验,并且通常必须维持对同一事件的矛盾解释。在客户谈判中的让步,对于一个战略目标编码为“建立信任的投资”,对于另一个目标则编码为“合同责任”。当前的记忆架构假设单一正确编码,或者最多在统一存储上支持多个视图。我们提出Rashomon记忆:一种架构,其中并行目标条件化智能体根据其优先级编码经验,并在查询时通过论证进行协商。每个视角维护自己的本体和知识图谱。在检索时,视角提出解释,使用非对称领域知识批评彼此的提议,Dung的论证语义决定哪些提议存活。生成的攻击图本身就是一个解释:它记录了哪个解释被选中,哪些替代方案被考虑,以及它们被拒绝的理由。我们提供了一个概念验证,表明检索模式(选择、组合、冲突呈现)从攻击图拓扑中涌现,并且冲突呈现模式(系统报告真正的分歧而不是强制解决)让决策者直接看到底层的解释性冲突。

英文摘要

AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain conflicting interpretations of the same events. A concession during a client negotiation encodes as a ``trust-building investment'' for one strategic goal and a ``contractual liability'' for another. Current memory architectures assume a single correct encoding, or at best support multiple views over unified storage. We propose Rashomon Memory: an architecture where parallel goal-conditioned agents encode experiences according to their priorities and negotiate at query time through argumentation. Each perspective maintains its own ontology and knowledge graph. At retrieval, perspectives propose interpretations, critique each other's proposals using asymmetric domain knowledge, and Dung's argumentation semantics determines which proposals survive. The resulting attack graph is itself an explanation: it records which interpretation was selected, which alternatives were considered, and on what grounds they were rejected. We present a proof-of-concept showing that retrieval modes (selection, composition, conflict surfacing) emerge from attack graph topology, and that the conflict surfacing mode, where the system reports genuine disagreement rather than forcing resolution, lets decision-makers see the underlying interpretive conflict directly.

2603.18373 2026-06-02 cs.CV cs.AI

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

看见还是取悦:揭示视觉语言模型中的视觉谄媚与分裂信念

Rui Hong, Shuxue Quan

发表机构 * George Mason University(乔治·玛斯纳大学) Independent Researcher(独立研究者)

AI总结 提出三层诊断框架,通过反事实干预实验发现视觉语言模型中普遍存在视觉谄媚(内部证据保留但输出幻觉答案)现象,并证明扩展模型规模无法解决该问题。

Comments 14 pages, 1 figures

详情
AI中文摘要

当视觉语言模型正确回答时,它们是否真正依赖视觉信息?我们引入了一个三层诊断框架,包含三个每样本指标:潜在异常检测、视觉必要性分数和竞争分数,用于解耦感知、依赖和对齐失败。在9个视觉语言模型和9000个模型-样本对中,通过反事实盲、噪声和冲突干预,72.9%的样本表现出视觉谄媚,这是一种分裂信念模式,即内部证据被保留但解码出幻觉答案,而零样本表现出稳健拒绝,表明当前的对齐训练已消除拒绝作为解码结果。在Qwen-VL系列中,无论是代内还是代间扩展,都单调减少了语言捷径,但加剧了视觉谄媚,表明仅靠规模和更新的后训练无法解决接地问题。诊断分数进一步实现了一种无需训练的择性预测策略,在50%覆盖率下准确率提升高达9.5个百分点。

英文摘要

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-sample metrics: Latent Anomaly Detection, Visual Necessity Score, and Competition Score, which disentangle perception, dependency, and alignment failures. Across 9 VLMs and 9,000 model-sample pairs under counterfactual blind, noise, and conflict interventions, 72.9% of samples exhibit Visual Sycophancy, a Split Beliefs pattern in which internal evidence is preserved yet a hallucinated answer is decoded, while zero samples show Robust Refusal, indicating that current alignment training has eliminated refusal as a decoding outcome. Scaling within the Qwen-VL family, both within- and across-generation, monotonically reduces Language Shortcuts but amplifies Visual Sycophancy, showing that scale and newer post-training alone cannot resolve the grounding problem. Diagnostic scores further enable a training-free selective-prediction strategy yielding up to +9.5 percentage points accuracy at 50% coverage.

2604.12792 2026-06-02 cs.RO

Actuation space reduction to facilitate insightful shape matching in a novel reconfigurable tendon driven continuum manipulator

驱动空间缩减以促进新型可重构腱驱动连续体机器人的形状匹配洞察

Sabyasachi Dash, John Golden, Girish Krishnan

发表机构 * Department of Industrial and Enterprise Systems Engineering, University of Illinois Urbana-Champaign(工业与企业系统工程系,伊利诺伊大学厄巴纳-香槟分校) Department of Mechanical Science and Engineering, University of Illinois, Urbana-Champaign(机械科学与工程系,伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一种通过旋转间隔盘重构腱路径的连续体机器人设计,利用曲率-扭转中间空间简化从期望骨架曲线到驱动器输入的映射,实现无模型的分步形状匹配策略。

详情
AI中文摘要

在腱驱动连续体机器人(TDCM)中,重构腱路径可以实现骨架的定制空间变形。本文提出一种设计,其中腱可以在驱动之前或之后通过主动旋转各个间隔盘来重新布线。每个盘的旋转因此为驱动空间增加了一个自由度,使得从期望骨架曲线到相应驱动器输入的映射复杂化。然而,当骨架形状投影到由曲率和扭转(C-T)定义的中间空间时,会出现一些模式,突出显示哪些盘对实现全局形状最有影响。这种洞察力使得一种简化的顺序形状匹配策略成为可能:首先,旋转近端和中间盘以近似全局形状;然后,调整远端盘以微调末端执行器位置,同时对整体形状影响最小。所提出的驱动框架为传统控制方法提供了一种无模型替代方案,绕过了建模可重构TDCM的复杂性。

英文摘要

In tendon driven continuum manipulators (TDCMs), reconfiguring the tendon routing enables tailored spatial deformation of the backbone. This work presents a design in which tendons can be rerouted either prior to or after actuation by actively rotating the individual spacer disks. Each disk rotation thus adds a degree of freedom to the actuation space, complicating the mapping from a desired backbone curve to the corresponding actuator inputs. However, when the backbone shape is projected into an intermediate space defined by curvature and torsion (C-T), patterns emerge that highlight which disks are most influential in achieving a global shape. This insight enables a simplified, sequential shape-matching strategy: first, the proximal and intermediate disks are rotated to approximate the global shape; then, the distal disks are adjusted to fine-tune the end-effector position with minimal impact on the overall shape. The proposed actuation framework offers a model-free alternative to conventional control approaches, bypassing the complexities of modeling reconfigurable TDCMs.

2604.11424 2026-06-02 cs.CL

Bridging What the Model Thinks and How It Speaks: Expressive Speech Generation via Self-Aware Intent-Realization Alignment

弥合模型所想与所言之间的鸿沟:通过自我意识意图-实现对齐的富有表现力的语音生成

Kuang Wang, Lai Wei, Ping Lin, Qibing Bai, Wenkai Fang, Li Zhou, Feng Jiang, Zhongjie Jiang, Jun Huang, Yannan Wang, Haizhou Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tencent Ethereal Audio Lab(腾讯虚实音频实验室) Southeast University(东南大学) Zhejiang University(浙江大学) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 提出SASLM框架,通过自蒸馏意图和自奖励优化,无需外部标注即可弥合语义理解与声学实现之间的差距,实现富有表现力的语音生成。

Comments Submitted to EMNLP 2026. Project page: https://wangkevin02.github.io/SASLM/

详情
AI中文摘要

语音语言模型(SLM)表现出强大的语义理解能力,但往往无法将这种能力转化为富有表现力的声学实现,生成的语音韵律平淡且情感错位。我们将这种不匹配识别为语义理解-声学实现差距。现有方法通常依赖外部指定的代理,如情感标签或风格提示,这些需要标注且难以捕捉对话中动态变化的表达意图。为克服这些限制,我们提出SASLM(自我意识语音语言模型),一种无代理框架,通过自我意识意图-实现对齐弥合模型所想与所言之间的鸿沟:(1)意图感知桥接通过变分信息瓶颈(VIB)从模型自身的演化语义生成状态中自蒸馏表达意图,从而在无外部表达监督下指导表达性语音实现;(2)实现感知对齐通过自奖励优化反思性地将生成的声学与预期表达对齐,在语音生成过程中逐步提高意图-实现一致性。尽管仅使用3B参数和800小时表达性语音数据,SASLM在EchoMind上达到了开源系统中的最先进性能,超越了规模大10倍以上的模型,并接近商业系统。

英文摘要

Speech Language Models (SLMs) exhibit strong semantic understanding, yet often fail to translate this capacity into expressive acoustic realization, producing speech with flattened prosody and misaligned emotion. We identify this mismatch as the semantic understanding-acoustic realization gap. Existing approaches typically rely on externally specified proxies, such as emotion labels or style prompts, which require annotations and struggle to capture dynamically evolving expressive intent throughout dialogue. To overcome these limitations, we propose SASLM (Self-Aware Speech Language Model), a proxy-free framework that bridges what the model thinks and how it speaks through self-aware intent-realization alignment: (1) Intent-Aware Bridging self-distills expressive intent from the model's own evolving semantic generation states via a Variational Information Bottleneck (VIB), thereby guiding expressive speech realization without external expressive supervision; while (2) Realization-Aware Alignment reflectively aligns generated acoustics with intended expression through self-reward optimization, progressively improving intent-realization consistency during speech generation. Despite using only 3B parameters and 800 hours of expressive speech data, SASLM achieves state-of-the-art performance on EchoMind among open-source systems, surpassing models over 10 times larger and approaching commercial systems.

2604.11283 2026-06-02 cs.CV

Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

多模态大语言模型驱动的视频翻译:面向角色的综述

Bingzheng Qu, Kehai Chen, Xuefeng Bai, Min Zhang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院)

AI总结 本文通过面向角色的分类法,系统综述了多模态大语言模型在视频翻译中的应用,将其分为语义推理器、表达执行器和视觉合成器三个功能角色,并总结了数据集、基准和评估指标,指出了端到端视频翻译的挑战与未来方向。

详情
AI中文摘要

多模态大语言模型(MLLMs)的最新进展正在将视频翻译从自动语音识别、机器翻译、文本到语音和唇形同步的级联管道重塑为统一的多模态推理和生成问题。高质量的视频翻译不仅需要语义保真度,还需要跨视觉、听觉和语言流的时间对齐、说话者一致性和情感表现力。本综述通过面向角色的分类法,对MLLM驱动的视频翻译进行了重点回顾。我们将MLLM驱动和MLLM相关的研究组织为三个功能角色:语义推理器,将翻译基于视频理解、时间推理和多模态融合;表达执行器,支持可控和上下文感知的语音生成;视觉合成器,实现唇形同步和视觉连贯的说话者渲染。我们进一步总结了每个角色的代表性数据集、基准和指标,并讨论了当前评估协议如何未能满足端到端视频翻译的要求。最后,我们指出了长视频理解、时间建模、多模态对齐、多语言鲁棒性和负责任部署方面的开放挑战,为自然和可信的跨语言视频通信勾勒了未来方向。

英文摘要

Recent progress in multimodal large language models (MLLMs) is reshaping video translation from a cascaded pipeline of automatic speech recognition, machine translation, text-to-speech, and lip synchronization into a unified multimodal reasoning and generation problem. High-quality video translation requires not only semantic fidelity, but also temporal alignment, speaker consistency, and emotional expressiveness across visual, acoustic, and linguistic streams. This survey provides a focused review of MLLM-enabled video translation through a role-oriented taxonomy. We organize MLLM-enabled and MLLM-relevant studies into three functional roles: Semantic Reasoner, which grounds translation in video understanding, temporal reasoning, and multimodal fusion; Expressive Performer, which supports controllable and context-aware speech generation; and Visual Synthesizer, which enables lip synchronization and visually coherent speaker rendering. We further summarize representative datasets, benchmarks, and metrics for each role, and discuss how current evaluation protocols fall short of end-to-end video translation requirements. Finally, we identify open challenges in long-form video understanding, temporal modeling, multimodal alignment, multilingual robustness, and responsible deployment, outlining future directions for natural and trustworthy cross-lingual video communication.

2604.10788 2026-06-02 cs.CL cs.AI

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR:探索大语言模型中的工具内化推理

Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Southeast University(东南大学) University of Edinburgh(爱丁堡大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 本文提出TInR-U框架,通过工具内化、监督微调和强化学习三阶段训练,使LLM无需外部文档即可进行工具集成推理,在域内和域外设置中均取得优越性能。

Comments Accepted to ACL 2026

详情
AI中文摘要

工具集成推理(TIR)通过扩展大语言模型(LLM)在推理过程中使用外部工具的能力,已成为一个有前景的方向。现有的TIR方法通常在推理过程中依赖外部工具文档。然而,这导致了工具掌握困难、工具规模限制和推理效率低下等问题。为了缓解这些问题,我们探索了工具内化推理(TInR),旨在促进使用内化到LLM中的工具知识进行推理。实现这一目标面临显著的要求,包括工具内化和工具-推理协调。为了解决这些问题,我们提出了TInR-U,一个用于统一推理和工具使用的工具内化推理框架。TInR-U通过三阶段流水线进行训练:1)使用双向知识对齐策略进行工具内化;2)使用高质量推理注释进行监督微调预热;3)使用TInR特定奖励进行强化学习。我们在域内和域外设置中全面评估了我们的方法。实验结果表明,TInR-U在两种设置下均实现了优越的性能,突显了其有效性和效率。

英文摘要

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

2604.10688 2026-06-02 cs.LG cs.AI cs.CL

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

SCOPE: 信号校准的在线策略蒸馏增强与双路径自适应加权

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan LongCat Interaction Team(美团 LongCat 交互团队) Nanjing University(南京大学) Fudan University(复旦大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对在线策略强化学习中奖励稀疏导致的信用分配难题,提出SCOPE框架,通过双路径自适应加权机制分别处理正确与错误轨迹,实现信号校准的蒸馏增强,在六个推理基准上平均提升11.42%的Avg@32和7.30%的Pass@32。

详情
AI中文摘要

在线策略强化学习已成为大型语言模型推理对齐的主导范式,但其稀疏的结果级奖励使得令牌级信用分配异常困难。在线策略蒸馏(OPD)通过引入来自教师模型的密集令牌级KL监督缓解了这一问题,但通常对所有rollout均匀应用这种监督,忽略了信号质量的根本差异。我们提出信号校准的在线策略蒸馏增强(SCOPE),一种双路径自适应训练框架,根据正确性将在线策略rollout路由到两个互补的监督路径。对于错误轨迹,SCOPE执行教师困惑度加权的KL蒸馏,优先考虑教师展现出真正纠正能力的实例,同时降低不可靠指导的权重。对于正确轨迹,它应用学生困惑度加权的MLE,将强化集中在能力边界上的低置信度样本,而不是过度强化已掌握的样本。两条路径都采用组级归一化来自适应校准权重分布,考虑不同提示的内在难度差异。在六个推理基准上的大量实验表明,SCOPE在Avg@32和Pass@32上分别比竞争基线平均相对提升11.42%和7.30%,证明了其一致的有效性。

英文摘要

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

2604.10579 2026-06-02 cs.RO cs.AI

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

AffordGen: 通过可供性对应生成多样化演示以实现通用物体操作

Jiawei Zhang, Kaizhe Hu, Yingqian Huang, Yuanchen Ju, Zhengrong Xue, Huazhe Xu

发表机构 * Shanghai Qi Zhi Institute(上海启智研究院) Tsinghua University(清华大学) Fudan University(复旦大学) UC Berkeley(伯克利大学)

AI总结 提出AffordGen框架,利用3D生成模型和视觉基础模型在大规模3D网格上的语义对应生成多样化操作轨迹,训练鲁棒的闭环视觉运动策略,实现零样本泛化到未见物体。

详情
AI中文摘要

尽管现代模仿学习方法在机器人操作中取得了近期成功,但其性能常常受到数据多样性不足导致的几何变化的限制。利用强大的3D生成模型和视觉基础模型(VFMs),所提出的AffordGen框架通过利用大规模3D网格上有意义关键点的语义对应来生成新的机器人操作轨迹,从而克服了这一限制。然后,这个大规模、可供性感知的数据集被用于训练一个鲁棒的、闭环的视觉运动策略,结合了可供性的语义泛化能力和端到端学习的反应性鲁棒性。在仿真和现实世界中的实验表明,使用AffordGen训练的策略实现了高成功率,并能够零样本泛化到真正未见过的物体,显著提高了机器人学习中的数据效率。项目页面:https://jiaweiz9.github.io/AffordGen-release/

英文摘要

Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning. Project Page: https://jiaweiz9.github.io/AffordGen-release/

2604.09877 2026-06-02 cs.CV cs.AI cs.RO

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Genie 4D:语义先验引导的4D动态场景重建

Yiru Yang, Zhuojie Wu, Nishant Kumar Singh, Max Schulthess

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出Genie 4D框架,结合实时视觉惯性高斯泼溅前端和前馈4D骨干网络,利用冻结的DINOv3特征作为结构先验抑制身份漂移,并通过条件扩散精炼器恢复高频细节,最终通过轻量级潜在动作头实现用户可控的4D世界模型重建。

详情
AI中文摘要

在计算机视觉与机器人感知的交汇处,动态场景的4D重建将低层几何感知与高层语义理解联系起来。我们提出Genie 4D,一个将手持手机拍摄转化为语义化、动作可控的4D世界模型的框架。Genie 4D将用于度量几何的实时视觉惯性高斯泼溅前端与由冻结的DINOv3特征(作为结构先验)正则化的前馈4D骨干网络相结合。语义先验抑制了动态跟踪中的身份漂移,而短条件扩散精炼器恢复了回归骨干网络平滑掉的高频表面细节。最后,一个轻量级潜在动作头将重建的4D状态暴露给以JEPA风格下一嵌入目标训练的Genie式世界模型,使得场景可以在用户动作下向前推进。在Point Odyssey和TUM-Dynamics基准测试上,Genie 4D保留了前馈基线的线性时间复杂度O(T),同时提高了3D跟踪精度(APD)和重建完整性,并且可以在单个消费级GPU(RTX 5090)上通过iPhone、Mac、Windows和Linux采集客户端交互式运行。Genie 4D为走向物理基础的世界模型提供了一条实用的、语义先验引导的路径。

英文摘要

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.

2604.09482 2026-06-02 cs.AI

Process Reward Agents for Steering Knowledge-Intensive Reasoning

过程奖励智能体:引导知识密集型推理

Jiwoong Sohn, Tomasz Sternal, Kenneth Styppa, Torsten Hoefler, Michael Moor

发表机构 * University of Michigan(密歇根大学)

AI总结 提出过程奖励智能体(PRA),通过在线、分步奖励指导冻结策略模型进行搜索式解码,在医疗推理基准上取得新最优结果,并泛化至不同规模模型。

Comments Accepted to ICML 2026

详情
AI中文摘要

知识密集型领域的推理仍然具有挑战性,因为中间步骤通常无法局部验证:与数学或代码不同,评估步骤的正确性可能需要跨大型外部知识源综合线索。因此,细微错误可能在推理轨迹中传播,可能永远不被检测到。先前的工作提出了过程奖励模型(PRM),包括检索增强变体,但这些方法事后操作,对完成的轨迹进行评分,这阻止了它们集成到动态推理过程中。在这里,我们引入了过程奖励智能体(PRA),这是一种推理时方法,用于向冻结策略提供基于领域、在线、分步的奖励。与先前的检索增强PRM相比,PRA能够实现基于搜索的解码,在每个生成步骤中对候选轨迹进行排序和剪枝。在多个医疗推理基准上的实验表明,PRA持续优于强基线,在MedQA上使用Qwen3-4B达到81.9%的准确率,这是4B规模下的新最优结果。重要的是,PRA泛化到未见过的冻结策略模型(参数从0.5B到8B),在无需任何策略模型更新的情况下,将其准确率提升高达25.7%。更广泛地说,PRA提出了一种范式,其中冻结推理器与领域特定奖励模块解耦,允许在复杂领域中部署新主干而无需重新训练。

英文摘要

Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), an inference-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 81.9% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.

2604.09041 2026-06-02 cs.LG cs.AI physics.ao-ph stat.ML

U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

U-Cast:一种惊人简单且高效的边界概率AI天气预报器

Salva Rühling Cachay, Duncan Watson-Parris, Rose Yu

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出基于标准U-Net骨架的概率天气预报模型U-Cast,通过确定性预训练和短时概率微调,以不到1/10的计算成本匹配或超越GenCast和IFS ENS的预报技能。

Comments ICML 2026. Our code is available at: https://github.com/Rose-STL-Lab/u-cast

详情
AI中文摘要

基于AI的天气预报现在可以与传统的基于物理的集合预报相媲美,但最先进的模型依赖于专门的架构和巨大的计算预算,造成了很高的进入门槛。我们证明,对于边界性能而言,这种复杂性是不必要的。我们引入了\ours,一种基于标准U-Net骨架的概率预报器,采用简单的训练方案:先进行基于平均绝对误差的确定性预训练,然后使用蒙特卡洛Dropout引入随机性,基于连续排序概率评分(CRPS)进行短时概率微调。结果,我们的模型在$1.5^\circ$分辨率下匹配或超过了GenCast和IFS ENS的概率技能,同时与领先的基于CRPS的模型相比,训练计算量减少了10倍以上,与基于扩散的模型相比,推理延迟减少了10倍以上。U-Cast在不到12个H200 GPU天内完成训练,并在3秒内生成15天的集合预报。这些结果表明,可扩展的通用架构与高效的训练课程相结合,可以以极低的成本匹配复杂的领域特定设计,从而向更广泛的社区开放边界概率天气模型的训练。

英文摘要

AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce \ours, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at $1.5^\circ$ resolution while reducing training compute by over $10\times$ compared to leading CRPS-based models and inference latency by over $10\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 15-day ensemble forecast in 3 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community.

2604.08161 2026-06-02 cs.LG

Shift- and stretch-invariant non-negative matrix factorization with an application to brain tissue delineation in emission tomography data

位移与伸缩不变的非负矩阵分解及其在脑组织发射断层成像数据分割中的应用

Anders S. Olsen, Miriam L. Navarro, Claus Svarer, Jesper L. Hinrich, Morten Mørup, Gitte M. Knudsen

发表机构 * Neurobiology Research Unit, Copenhagen University Hospital Rigshospitalet(哥本哈根大学医院神经生物学研究单位) Department of Neuroscience, Faculty of Health and Medical Sciences, University of Copenhagen(哥本哈根大学健康与医学科学学院神经科学系) Department of Applied Mathematics and Computer Science, Technical University of Denmark(丹麦技术大学应用数学与计算机科学系) Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen(哥本哈根大学健康与医学科学学院临床医学系)

AI总结 提出频域实现的位移与伸缩不变非负矩阵分解方法,解决动态神经影像中扩散导致的时延和伸缩问题,在合成数据和脑发射断层数据上验证了其对脑组织结构的精细刻画能力。

Comments Accepted at ICASSP2026

详情
AI中文摘要

动态神经影像数据,例如血液或脑脊液中放射性示踪剂传输的发射断层测量,通常表现出类似扩散的特性。这些特性引入了距离依赖的时间延迟、尺度差异和伸缩效应,限制了传统线性建模和分解方法的有效性。为了解决这一问题,我们提出了位移与伸缩不变的非负矩阵分解框架。我们的方法估计整数和非整数的时间位移以及时间伸缩,全部在频域中实现,其中位移对应于相位修改,而伸缩通过零填充或截断处理。该模型在PyTorch中实现(https://github.com/anders-s-olsen/shiftstretchNMF)。我们在合成数据和脑发射断层成像数据上证明,该模型能够解释伸缩效应,从而提供更详细的脑组织结构表征。

英文摘要

Dynamic neuroimaging data, such as emission tomography measurements of radiotracer transport in blood or cerebrospinal fluid, often exhibit diffusion-like properties. These introduce distance-dependent temporal delays, scale-differences, and stretching effects that limit the effectiveness of conventional linear modeling and decomposition methods. To address this, we present the shift- and stretch-invariant non-negative matrix factorization framework. Our approach estimates both integer and non-integer temporal shifts as well as temporal stretching, all implemented in the frequency domain, where shifts correspond to phase modifications, and where stretching is handled via zero-padding or truncation. The model is implemented in PyTorch (https://github.com/anders-s-olsen/shiftstretchNMF). We demonstrate on synthetic data and brain emission tomography data that the model is able to account for stretching to provide more detailed characterization of brain tissue structure.

2604.08149 2026-06-02 cs.LG stat.ML

A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

处理具有隐状态动态的上下文赌博机的直接方法

Zhen Li, Gilles Stoltz

发表机构 * arXiv.org GitHub

AI总结 本文提出一种直接方法处理隐马尔可夫链驱动的线性上下文赌博机,通过简化模型归约到标准线性上下文赌博机,并扩展理论分析以考虑HMM参数估计,同时针对更复杂的隐状态依赖模型引入周期性参数更新算法。

详情
Journal ref
ICML 2026 - Forty-Third International Conference on Machine Learning, Jul 2026, Seoul, South Korea, France
AI中文摘要

我们考虑一个线性上下文赌博机模型,其中上下文和奖励由有限隐马尔可夫链控制。我们首先重新审视Nelson等人(2022)的简化模型,其中奖励是给定观察上下文(称为信念)的隐状态后验概率的线性函数,而不是隐状态本身的函数。这个简化模型可以通过直接归约到标准线性上下文赌博机来处理。我们扩展了这一归约的理论分析,在遗憾界中考虑了隐马尔可夫模型[HMM]参数的估计,并提供了不再依赖于奖励函数而仅通过HMM参数估计依赖于模型的高概率界。其次,也是最重要的,我们转而研究更自然且更复杂的模型,该模型在隐状态中引入直接依赖关系(除了对观察上下文的依赖,这对于上下文赌博机是自然的)。在经典的HMM遗忘条件下,为应对奖励结构引入的各种统计依赖,引入的主要算法工具是仅周期性更新奖励模型参数。

英文摘要

We consider a linear contextual bandit model where contexts and rewards are governed by a finite hidden Markov chain. We first revisit the simplified model by Nelson et al. (2022), in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts (called beliefs), rather than functions of the hidden states themselves. This simplified model may be handled through a direct reduction to standard linear contextual bandits. We extend the theoretical analysis of this reduction to take into account the estimation of the parameters of the hidden Markov model [HMM] in the regret bound and to provide high-probability bounds not depending anymore on the reward functions and only depending on the model through the estimation of the HMM parameters. Second, and most importantly, we instead study the more natural and more complex model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits). Under a classic HMM forgetting condition, the main algorithmic tool introduced to cope with the various statistical dependencies that the reward structure introduces is to only periodically update reward-model parameters.

2604.06995 2026-06-02 cs.AI

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

屏幕到动作中缺失了什么?面向多模态GUI推理的UI-in-the-Loop范式

Songze Li, Xiaoke Guo, Tianqi Liu, Biao Yi, Zhaoyan Gong, Zhiqiang Liu, Huajun Chen, Wen Zhang

发表机构 * Zhejiang University(浙江大学) ZJU-Ant Group Joint Lab of Knowledge Graph(浙大蚂蚁集团知识图谱联合实验室)

AI总结 提出UI-in-the-Loop (UILoop) 范式,通过循环的屏幕-UI元素-动作过程,让多模态大模型显式学习UI元素的定位、语义和用法,实现可解释推理,并在UI理解任务上达到最优。

Comments Accepted by ACL 2026 Findings

详情
AI中文摘要

现有的图形用户界面(GUI)推理任务仍然具有挑战性,特别是在UI理解方面。当前方法通常依赖于直接的基于屏幕的决策,缺乏可解释性,并忽略了对UI元素的全面理解,最终导致任务失败。为了增强对UI的理解和交互,我们提出了一种创新的GUI推理范式,称为UI-in-the-Loop(UILoop)。我们的方法将GUI推理任务视为一个循环的屏幕-UI元素-动作过程。通过使多模态大语言模型(MLLMs)显式学习关键UI元素的定位、语义功能和实际用法,UILoop实现了精确的元素发现和可解释推理。此外,我们引入了一个更具挑战性的UI理解任务,该任务围绕UI元素展开,并包含三个评估指标。相应地,我们贡献了一个包含26K样本的基准(UI Comprehension-Bench),以全面评估现有方法对UI元素的掌握程度。大量实验表明,UILoop在UI理解性能上达到了最先进水平,同时在GUI推理任务中也取得了优异结果。

英文摘要

Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

2602.03912 2026-06-02 cs.LG

Echo State Networks for Time Series Forecasting: Hyperparameter Sweep and Benchmarking

回声状态网络用于时间序列预测:超参数扫描与基准测试

Alexander Häußer

发表机构 * arXiv.org

AI总结 本文研究回声状态网络(ESN)在M4竞赛数据集上的单变量预测性能,通过超参数扫描和基准测试,发现简单的一阶自回归ESN在月度数据上与ARIMA和TBATS相当,在季度数据上取得最低平均MASE。

详情
AI中文摘要

本文研究了回声状态网络(ESN)对M4预测竞赛数据集中月度与季度时间序列的单变量预测性能。我们评估了一个简单的一阶自回归ESN是否能成为广泛使用的预测方法的竞争性替代方案。研究采用两阶段设计:使用参数数据集分析泄漏率、谱半径、储层大小和正则化选择下的ESN模型配置,同时保留一个不相交的预测数据集用于样本外基准测试。预测精度通过平均绝对缩放误差(MASE)和对称平均绝对百分比误差(sMAPE)衡量,并与简单基准和统计模型(包括自回归积分滑动平均(ARIMA)、指数平滑状态空间(ETS)、Theta方法和TBATS)进行比较。模型配置分析揭示了频率特定的模式:月度序列倾向于中等持久性的储层,而季度序列则偏好更收缩的动态;两种频率下,高泄漏率普遍更受青睐。在最终基准测试中,ESN在月度数据上与ARIMA和TBATS表现相当,并在季度数据上取得最低平均MASE,尽管并非在所有指标上均一致最优。总体而言,结果表明,在考虑过滤后的M4子集上,简单的自回归ESN能提供有竞争力的预测精度(特别是在MASE下),且一旦ESN配置固定,训练和预测时间需求较低。

英文摘要

This paper investigates the performance of Echo State Networks (ESNs) for univariate forecasting of monthly and quarterly time series from the M4 Forecasting Competition dataset. We evaluate whether a simple first-order autoregressive ESN can serve as a competitive alternative to widely used forecasting methods. The study uses a two-stage design: a Parameter dataset is used to analyze ESN model configurations over leakage rate, spectral radius, reservoir size, and regularization selection, while a disjoint Forecast dataset is reserved for out-of-sample benchmarking. Forecast accuracy is measured using mean absolute scaled error (MASE) and symmetric mean absolute percentage error (sMAPE) and compared with simple benchmarks and statistical models including autoregressive integrated moving average (ARIMA), exponential smoothing state space (ETS), the Theta method, and TBATS. The model-configuration analysis reveals frequency-specific patterns: monthly series tend to favor moderately persistent reservoirs, whereas quarterly series favor more contractive dynamics; across both frequencies, high leakage rates are generally preferred. In the final benchmark, the ESN performs on par with ARIMA and TBATS for monthly data and achieves the lowest mean MASE for quarterly data, although it is not uniformly best across all metrics. Overall, the results indicate that a simple autoregressive ESN can provide competitive forecast accuracy on the considered filtered M4 subsets, particularly under MASE, while requiring low training and forecasting time once the ESN configuration has been fixed.

2604.05634 2026-06-02 cs.AI

PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

PECKER: 一种用于扩散模型机器遗忘的精确高效关键知识擦除方法

Zhiyong Ma, Zhitao Deng, Huan Tang, Jialin Chen, Zhijun Zheng, Zhengping Li, Qingyuan Chuai

发表机构 * Cao Tu Li(Guangzhou) Technology Co., Ltd, China(广州曹图利科技有限公司,中国) South China University of Technology, China(华南理工大学,中国) Guangzhou Xinhua University, China(广州新华大学,中国) Hong Kong Baptist University, HongKong(香港 Baptist 大学,香港)

AI总结 提出PECKER方法,通过显著性掩码优先更新关键参数,在蒸馏框架下实现高效机器遗忘,减少训练时间并保持遗忘效果。

Comments Accepted by ICPR 2026

详情
AI中文摘要

机器遗忘已成为生成式AI模型安全合规运行的关键技术。尽管现有MU方法有效,但大多数方法带来了高昂的训练时间和计算开销。我们的分析表明,根本原因在于梯度更新方向不佳,降低了训练效率并破坏了收敛稳定性。为缓解这些问题,我们提出PECKER,一种高效的MU方法,其性能匹配或优于主流方法。在蒸馏框架内,PECKER引入显著性掩码,优先更新对遗忘目标数据贡献最大的参数,从而减少不必要的梯度计算并缩短整体训练时间,同时不牺牲遗忘效果。我们的方法能够更快地生成遗忘相关类别或概念的样本,并在CIFAR-10和STL-10数据集上与真实图像分布紧密对齐,在类别遗忘和概念遗忘任务中均实现了更短的训练时间。

英文摘要

Machine unlearning (MU) has become a critical technique for GenAI models' safe and compliant operation. While existing MU methods are effective, most impose prohibitive training time and computational overhead. Our analysis suggests the root cause lies in poorly directed gradient updates, which reduce training efficiency and destabilize convergence. To mitigate these issues, we propose PECKER, an efficient MU approach that matches or outperforms prevailing methods. Within a distillation framework, PECKER introduces a saliency mask to prioritize updates to parameters that contribute most to forgetting the targeted data, thereby reducing unnecessary gradient computation and shortening overall training time without sacrificing unlearning efficacy. Our method generates samples that unlearn related class or concept more quickly, while closely aligning with the true image distribution on CIFAR-10 and STL-10 datasets, achieving shorter training times for both class forgetting and concept forgetting.

2604.05324 2026-06-02 cs.LG cs.IT math.IT

A Theoretical Framework for Statistical Evaluability of Generative Models

生成模型统计可评估性的理论框架

Shashaank Aiyer, Yishay Mansour, Shay Moran, Han Shao

发表机构 * University of Maryland(马里兰大学) Tel Aviv University and Google Research(特拉维夫大学和谷歌研究) Technion and Google Research(技术学院和谷歌研究)

AI总结 提出一个理论框架,研究生成模型的统计可评估性,证明基于有界测试类的积分概率度量可有限样本评估,而Rényi和KL散度不可评估。

Comments 30 pages

详情
AI中文摘要

统计评估旨在使用从真实分布中采样的独立同分布测试数据来估计模型的泛化性能。在分类等监督学习设置中,错误率等性能指标定义明确,给定足够大的数据集,测试误差可靠地近似总体误差。相比之下,由于生成模型的开放性,评估更具挑战性:不清楚哪些指标是合适的,以及这些指标是否可以从有限样本中可靠评估。在这项工作中,我们引入了一个评估生成模型的理论框架,并建立了常用指标的可评估性结果。我们研究了两类指标:基于测试的指标,包括积分概率度量(IPMs)和Rényi散度。我们证明,对于任何有界测试类,IPMs可以从有限样本中评估,误差为乘性和加性近似。此外,当测试类具有有限脂肪破碎维度时,IPMs可以任意精度评估。相比之下,Rényi和KL散度不能从有限样本中评估,因为它们的值可能由罕见事件关键决定。我们还分析了困惑度作为评估方法的潜力和局限性。

英文摘要

Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d. test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples. In this work, we introduce a theoretical framework for evaluating generative models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and Rényi divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, Rényi and KL divergences are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method.

2604.04937 2026-06-02 cs.AI cs.CL

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Pramana: 通过 Navya-Nyaya 微调大型语言模型进行认知推理

Sharath Sathish

发表机构 * University of York(约克大学)

AI总结 提出 Pramana 方法,利用 2500 年历史的印度 Navya-Nyaya 逻辑框架微调 LLM,通过结构化六阶段推理解决认知差距,提升可溯源性。

Comments 52 pages + appendices, comprehensive treatment of Navya-Nyaya computational formalization

详情
AI中文摘要

大型语言模型能生成流畅文本,但在系统推理方面存在困难,常常产生自信但无根据的幻觉。当苹果研究人员向数学问题添加无关背景时,LLM 性能下降了 65% (Apple Machine Learning Research),暴露出表面推理下脆弱的模式匹配。这种认知差距,即无法将主张建立在可追溯证据上的能力,限制了 AI 在需要论证的领域的可靠性。我们引入 Pramana,一种新颖的方法,通过在 Navya-Nyaya 逻辑(一种 2500 年历史的印度推理框架)上进行微调,教导 LLM 显式的认识论方法论。与通用的思维链提示不同,Navya-Nyaya 强制执行结构化的六阶段推理:SAMSHAYA(怀疑分析)、PRAMANA(证据源识别)、PANCHA AVAYAVA(包含普遍规则的五段论)、TARKA(反事实验证)、HETVABHASA(谬误检测)和 NIRNAYA(区分知识与假设的确定)。这种逻辑与认识论的整合提供了标准推理方法所缺乏的认知支架。我们在 55 个 Nyaya 结构化的逻辑问题(约束满足、布尔 SAT、多步演绎)上微调了 Llama 3.2-3B 和 DeepSeek-R1-Distill-Llama-8B。第一阶段在保留评估上实现了 100% 的语义正确性,尽管严格格式遵循率仅为 40%,这表明即使结构执行不完美,模型也能内化推理内容。消融研究表明格式提示和温度对性能有关键影响,且不同阶段的最优配置不同。我们在 Hugging Face 上发布所有模型、数据集和训练基础设施,以促进关于 AI 推理认识论框架的进一步研究。

英文摘要

Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.

2604.04199 2026-06-02 cs.LG

Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

哪些泄漏类型重要?2047个基准数据集的定量景观

Simon Roth

发表机构 * Simon Roth

AI总结 通过在2047个独立同分布表格数据集上进行28项受试者内反事实实验,以及129个时间序列数据集的边界实验,定量评估了机器学习中四类数据泄漏的严重性。

Comments 39 pages, 6 figures, 13 tables. Companion to arXiv:2603.10742

详情
AI中文摘要

通过在2047个独立同分布表格数据集上进行28项受试者内反事实实验,以及129个时间序列数据集的边界实验,测量了机器学习中四类数据泄漏的严重程度。第一类(估计:在全数据上拟合缩放器)可忽略:所有九种条件产生的$|ΔAUC| \leq 0.005$。第二类(选择:偷窥、种子挑选)影响显著:测量效果与约90%的噪声利用导致报告分数膨胀一致。第三类(记忆)随模型容量增加:在10%重复时,$d_z$从0.37(朴素贝叶斯)到1.11(决策树)。第四类(边界)在随机交叉验证下不可见。在这个独立同分布表格数据体制中,教科书的重点被颠覆:归一化泄漏最不重要;而实际数据集规模下的选择泄漏最为重要。

英文摘要

Twenty-eight within-subject counterfactual experiments across 2,047 iid tabular datasets, plus a boundary experiment on 129 temporal datasets, measure the severity of four data leakage classes in machine learning. Class I (estimation: fitting scalers on full data) is negligible: all nine conditions produce $|ΔAUC| \leq 0.005$. Class II (selection: peeking, seed cherry-picking) is substantial: the measured effect is consistent with about 90% noise exploitation inflating reported scores. Class III (memorization) scales with model capacity: $d_z$ = 0.37 (Naive Bayes) to 1.11 (Decision Tree) at 10% duplication. Class IV (boundary) is invisible under random cross-validation. Within this iid tabular regime, the textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.