arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4065
2604.11674 2026-05-12 cs.RO cs.AI

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

Mingyang Li, Haofan Xu, Haowen Sun, Xinzhe Chen, Sihua Ren, Liqi Huang, Xinyang Sui, Chenyang Miao, Jiawei Ye, Qiongjie Cui, Zeyang Liu, Xingyu Chen, Xuguang Lan

AI总结 AffordSim 是一个可扩展的数据生成器和基准平台,旨在提升机器人对物体功能区域的感知能力,以实现更精准的操控任务。该方法结合开放词汇的3D功能预测,根据自然语言任务描述生成场景、定位功能区域并生成对应的抓取动作,从而提高任务执行的成功率。AffordSim 在多种机器人平台和复杂物体上进行了验证,表现出优异的模拟到现实的迁移能力,并在多个关键任务中取得了接近人工标注数据的性能。

详情
英文摘要

Many everyday robot manipulation skills are affordance-dependent, with success determined by whether the robot contacts the functional object region required by the subsequent action. Current simulation data generators obtain contacts from generic grasp estimators or per-object manual contact annotations, but generic estimators rank stable grasps without task semantics and often select contacts that are misaligned with the downstream action, while manual contact annotations must be rewritten for each new object and task. To solve these challenges, we introduce AffordSim, a scalable data generator and benchmark that integrates open-vocabulary 3D affordance prediction into simulation-based trajectory generation. Given a natural-language task description, AffordSim synthesizes a task-relevant scene, emits affordance queries, grounds them on object surfaces, samples region-conditioned grasps, and selects executable candidates with motion planning. It further randomizes object pose, texture, lighting, image noise, and cross-viewpoint backgrounds for sim-to-real transfer. We instantiate AffordSim as a 50-task benchmark across diverse manipulation skills, five robot embodiments, and 500+ rigid and articulated objects. AffordSim achieves 93% of the trajectory collection success rate of manual contact annotations on affordance-critical tasks and 89% on hard composite tasks. Vision-language-action policies trained on AffordSim data transfer zero-shot to a real Franka FR3, reaching 24% average success.

2604.08577 2026-05-12 cs.LG cs.AI

Distributionally Robust Token Optimization in RLHF

Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis

AI总结 该研究针对大语言模型在面对细微输入变化时可能出现的性能下降问题,提出了一种分布鲁棒的标记优化方法(DRTO)。该方法结合了基于人类反馈的强化学习(RLHF)与分布鲁棒优化(DRO),通过构建f-散度模糊集来增强模型对困难响应片段的学习。实验表明,DRTO在多个推理任务中显著提升了模型在分布偏移下的表现,优于传统方法。

详情
英文摘要

Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.

2604.07098 2026-05-12 cs.LG cs.CL

Selective Neuron Amplification in Transformer Language Models

Ryyan Akhtar, Payal Pahwa, Monika Arora

AI总结 本文研究了大型语言模型在看似理解的任务上仍可能出现失败的问题,发现这主要不是因为知识缺失,而是模型内部某些电路在推理时未被充分激活。为此,作者提出了选择性神经元增强(Selective Neuron Amplification,SNA)方法,通过在推理时增强与任务相关的神经元活动,而无需修改模型参数。该方法在模型不确定时效果显著,表明部分模型失败是由于激活不足而非能力不足。

Comments 11 pages, 3 figures. Preprint. Code and experiments conducted independently

详情
英文摘要

Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model's parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.

2604.06720 2026-05-12 cs.CV

Exploring 6D Object Pose Estimation with Deformation

Zhiqiang Liu, Rui Song, Duanmu Chuangqi, Jiaojiao Li, David Ferstl, Yinlin Hu

AI总结 本文提出DeSOPE,一个用于6自由度(6DoF)变形物体位姿估计的大规模数据集。传统6D位姿估计方法通常假设物体为刚性或可变形的关节结构,但在实际应用中,物体因磨损、碰撞或形变而偏离标准形状,导致方法失效。为此,DeSOPE包含26类常见物体在标准形态和三种变形状态下的高精度3D扫描数据,并配有133K帧的RGB-D图像和665K个位姿标注,为研究变形物体的位姿估计提供了重要资源。

Comments Accepted at CVPR 2026

详情
英文摘要

We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications.

2604.04306 2026-05-12 cs.CV cs.AI

HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data

Stella Girtsou, Konstantinos Alexis, Giorgos Giannopoulos, Charalambos Kontoes

AI总结 随着气候相关灾害频发,实时监测和预警需求日益迫切。本文提出 HighFM,一种面向高时间分辨率多光谱遥感数据的基座模型,通过利用超过 2TB 的 SEVIRI 卫星影像,改进了掩码自编码框架以学习稳健的时空表征,并在云检测和火灾识别任务中取得了优于传统方法和近期地理空间基座模型的性能,展示了地静止卫星数据在实时遥感应用中的巨大潜力。

详情
英文摘要

The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.

2603.28902 2026-05-12 cs.AI

ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

Rongtian Ye

AI总结 ChartDiff 是首个大规模跨图表对比理解基准,旨在解决现有图表理解任务中缺乏多图表比较分析的问题。该基准包含 8,541 对来自不同数据源、图表类型和视觉风格的图表,每对图表均配有由大语言模型生成并经人工验证的摘要,描述趋势、波动和异常等差异。研究评估了通用模型、图表专用模型和流水线方法的性能,发现通用模型在生成质量上表现最佳,而专用模型和流水线方法虽在 ROUGE 分数上较高,但在人工评估中表现较差,揭示了词句重叠与实际摘要质量之间的不匹配。

Comments 21 pages, 17 figures, accepted to ACL 2026: the 4th Workshop on Advances in Language and Vision Research

详情
英文摘要

Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.

2603.26680 2026-05-12 cs.CL cs.AI

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

AI总结 随着大语言模型(LLMs)逐渐演变为终身AI助手,模型个性化成为关键研究方向。然而,目前缺乏一个标准的评估基准来推动这一领域的发展。为此,本文提出AlpsBench,一个基于真实人机对话构建的LLM个性化评估基准,包含2500个长期交互序列和人工验证的结构化记忆数据,用于评估个性化信息的提取、更新、检索与应用等核心任务,揭示了当前模型在个性化处理方面存在的多项挑战,并为未来研究提供了全面的评估框架。

详情
英文摘要

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.

2603.21901 2026-05-12 cs.CV

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

AI总结 CLEAR 是一种无需掩码的端到端视频字幕去除框架,旨在在保持时间一致性的同时区分字幕与背景内容。该方法采用两阶段设计,第一阶段通过自监督正交约束学习解耦的字幕表示,第二阶段利用LoRA参数微调和生成反馈机制进行动态上下文调整,从而实现无需真实掩码的自适应推理。CLEAR 在参数效率和跨语言泛化能力方面表现优异,仅需基础扩散模型0.77%的参数即可在多个中文字幕数据集上超越依赖掩码的基线方法,并在六种语言中展现出强大的零样本泛化能力。

Comments Accepted by ICML 2026 (Spotlight)

详情
英文摘要

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

2603.18256 2026-05-12 cs.LG cs.AI

MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

Philippe Formont, Maxime Darrin, Ismail Ben Ayed, Pablo Piantanida

AI总结 MolRGen 是一个用于从头生成分子的训练与评估框架,旨在解决基于推理的大语言模型在分子生成任务中缺乏有效奖励机制的问题。该框架包含约4,500个蛋白口袋目标,生成5万个结合对接评分与分子性质的多目标优化提示,并通过实时计算奖励来评估模型生成的分子质量。研究引入了多样性感知的Top-k指标,并利用验证器对大型语言模型进行微调,展示了其在分子设计中提升性能的潜力。

详情
英文摘要

Recent reasoning-based large language models have shown strong performance on tasks with verifiable outcomes, but their use in de novo molecular generation remains limited by the lack of training environments where rewards can be computed without reference molecules. We introduce MolRGen, a benchmark and molecular verifier for training and evaluating reasoning LLMs on de novo molecular generation. MolRGen contains approximately 4,500 protein-pocket targets, resulting in 50k multi-objective optimization prompts combining docking scores with molecular properties such as QED, synthetic accessibility, logP, and physicochemical descriptors. Unlike caption-based generation or molecule-editing benchmarks, MolRGen evaluates molecules proposed from scratch by computing rewards at generation time. We benchmark general-purpose and chemistry-specialized open-source LLMs and introduce a diversity-aware top-k metric to measure whether models can generate a diverse set of high-scoring molecules. Finally, we use the verifier to fine-tune a 128B LLM with GRPO, showing improved performance, at the cost of a diversity-exploitation trade-off. MolRGen provides a scalable testbed for studying verifier-based reasoning and reinforcement learning in molecular design.

2603.16964 2026-05-12 cs.CV cs.LG

Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE

Niklas Roßberg, Sinan Hasirlioglu, Mohamed Essayed Bouzouraa, Wolfgang Utschick, Michael Botsch

AI总结 该研究旨在从高速公路交通数据中标准化提取场景,并基于领域知识进行聚类,以支持自动驾驶系统的行为评估。研究提出了一种基于“场景即规范”概念的场景提取方法,并结合CVQ-VAE模型实现领域知识引导的聚类过程,提升了场景分类的可解释性和一致性。实验表明,该方法能够可靠地从真实数据中提取场景,并有效融合领域知识,为自动驾驶系统的验证提供了更高效和标准化的场景分类框架。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

详情
英文摘要

Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.

2603.16593 2026-05-12 cs.RO

Scalable Inspection Planning via Flow-based Mixed Integer Linear Programming

Adir Morgan, Kiril Solovey, Oren Salzman

AI总结 本文研究了机器人在给定兴趣点(POIs)集合中进行检测的路径规划问题,旨在找到最短的机器人路径以完成检测任务。为了解决该问题的复杂性,作者提出了一种基于网络流的混合整数线性规划(MILP)方法,将核心约束条件转化为网络流模型,并设计了专用的分支定界求解器,从而显著提升了求解效率和解的质量。实验表明,该方法在大规模场景下表现出优越的可扩展性,并大幅缩小了最优解的差距。

详情
英文摘要

Inspection planning is concerned with computing the shortest robot path to inspect a given set of points of interest (POIs) using the robot's sensors. This problem arises in a wide range of applications from manufacturing to medical robotics. To alleviate the problem's complexity, recent methods rely on sampling-based methods to obtain a more manageable (discrete) graph inspection planning (GIP) problem. Unfortunately, GIP still remains highly difficult to solve at scale as it requires simultaneously satisfying POI-coverage and path-connectivity constraints, giving rise to a challenging optimization problem, particularly at scales encountered in real-world scenarios. In this work, we present highly scalable Mixed Integer Linear Programming (MILP) solutions for GIP that significantly advance the state-of-the-art in both runtime and solution quality. Our key insight is a reformulation of the problem's core constraints as a network flow, which enables effective MILP models and a specialized Branch-and-Cut solver that exploits the combinatorial structure of flows. We evaluate our approach on medical and infrastructure benchmarks alongside large-scale synthetic instances. Across all scenarios, our method produces substantially tighter lower bounds than existing formulations, reducing optimality gaps by 30-50% on large instances. Furthermore, our solver demonstrates unprecedented scalability: it provides non-trivial solutions for problems with up to 15,000 vertices and thousands of POIs, where prior state-of-the-art methods typically exhaust memory or fail to provide any meaningful optimality guarantees.

2603.12275 2026-05-12 cs.CL cs.LG

GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping

Chahana Dahal, Ashutosh Balasubramaniam, Zuobin Xiong

AI总结 本文提出GONE,一个用于评估大语言模型中结构化知识遗忘能力的基准,旨在解决现有方法在处理关系型、多跳推理知识时的不足。该研究引入了基于知识图谱的基准和一种名为NEDS的新框架,通过利用图结构中的邻居信息来精确控制遗忘事实与语义邻域之间的边界,有效提升了知识遗忘的效果与局部性。实验表明,NEDS在多个基准上表现出色,具有较高的遗忘效率和局部保持能力。

详情
英文摘要

Unlearning knowledge is a pressing and challenging task in Large Language Models (LLMs) because of their unprecedented capability to memorize and digest training data at scale, raising more significant issues regarding safety, privacy, and intellectual property. However, existing works, including parameter editing, fine-tuning, and distillation-based methods, are all focused on flat sentence-level data but overlook the relational, multi-hop, and reasoned knowledge in naturally structured data. In response to this gap, this paper introduces Graph Oblivion and Node Erasure (GONE), a benchmark for evaluating knowledge unlearning over structured knowledge graph (KG) facts in LLMs. This KG-based benchmark enables the disentanglement of three effects of unlearning: direct fact removal, reasoning-based leakage, and catastrophic forgetting. In addition, Neighborhood-Expanded Distribution Shaping (NEDS), a novel unlearning framework, is designed to leverage graph connectivity and identify anchor correlated neighbors, enforcing a precise decision boundary between the forgotten fact and its semantic neighborhood. Evaluations on LLaMA-3-8B and Mistral-7B across multiple knowledge editing and unlearning methods showcase NEDS's superior performance (1.000 on unlearning efficacy and 0.839 on locality) on GONE and other benchmarks. Code is available at https://anonymous.4open.science/r/GONE-4679/.

2603.11969 2026-05-12 cs.CV

AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies

Jennifer Nolan, Travis Driver, John Christian

AI总结 本文提出了一种基于物理的高斯点绘(Gaussian Splatting)框架AstroSplat,用于小天体(如小行星)表面的渲染与重建。该方法引入行星反射模型,显式建模表面材质属性与光照交互,克服了传统基于球谐函数的外观参数化方法在物理特性表达上的不足。实验表明,AstroSplat在NASA“黎明”任务的真实图像上表现出更优的渲染效果和表面重建精度。

Comments 10 pages, 6 figures, conference

详情
英文摘要

Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.

2603.11566 2026-05-12 cs.CV

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin

AI总结 本文提出了一种名为R4Det的4D雷达-相机融合方法,用于提升自动驾驶中的3D目标检测性能。针对现有方法在深度估计、时序融合和小目标检测方面的不足,R4Det引入全景深度融合模块增强深度估计精度,设计无需依赖车辆姿态的可变形门控时序融合模块,并构建实例引导的动态细化模块以提升小目标检测能力。实验表明,R4Det在TJ4DRadSet和VoD数据集上取得了最先进的3D检测效果。

Comments Accepted to CVPR 2026

详情
英文摘要

4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets. The source code and models will be released at https://github.com/VDIGPKU/R4Det.

2603.10165 2026-05-12 cs.CL cs.AI cs.CV cs.LG

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang

AI总结 OpenClaw-RL 是一种创新的强化学习框架,通过利用用户反馈、工具输出和界面状态变化等“下一步状态”信号,实现对智能体的在线优化。该框架在基础设施上采用服务器-客户端架构,分离信号提取与策略优化过程,提升训练效率;在方法上提出混合强化学习目标,结合稀疏但精细的指令信号和广泛可用的评估信号,提升学习稳定性。研究展示了 OpenClaw-RL 在个性化代理和通用代理任务中的广泛应用,特别是在长期任务中表现出色。

Comments Code: https://github.com/Gen-Verse/OpenClaw-RL

详情
英文摘要

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.

2603.10126 2026-05-12 cs.RO cs.AI

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Yutong Hu, Jan-Nico Zaech, Nikolay Nikolov, Yuanqi Yao, Sombit Dey, Giuliano Albanese, Renaud Detry, Luc Van Gool, Danda Paudel

AI总结 本文提出了一种独立的自回归(AR)动作专家AR-VLA,它能够在可刷新的视觉-语言前缀条件下,生成连续的因果动作序列。与现有视觉-语言-动作(VLA)模型和扩散策略不同,该动作专家通过长时记忆保持自身历史信息,具备内在的上下文感知能力,有效解决了快速控制与慢速推理之间的频率不匹配问题。实验表明,AR-VLA在保持或超越现有反应式VLA任务成功率的同时,展现出更强的历史感知能力和更平滑的动作轨迹,为训练高效机器人策略提供了可扩展的结构基础。

Comments RSS 2026 accepted

详情
英文摘要

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies. Code and Videos available at https://arvla.insait.ai

2603.09970 2026-05-12 cs.CL

CREATE: Testing LLMs for Associative Creativity

Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, Greg Durrett

AI总结 CREATE 是一个用于评估大语言模型关联创造力能力的基准测试。该任务要求模型生成连接概念的路径,路径需具备高特异性和多样性,模型生成的路径越多且质量越高,得分越高。研究发现,当前最先进的模型在创造性任务中表现更优,但因搜索空间庞大,基准测试难以饱和,且思维模型在高token预算下也不一定更具优势。CREATE 为提升模型关联创造力提供了实验平台。

详情
英文摘要

A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.

2603.09465 2026-05-12 cs.CV cs.AI

EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation

Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Zijian Wang, Hanzhen Zhang, Zhengyu Jia, Wei Mao, Hao Wang, Xianming Liu, Shuchang Zhou, Yang Wang, Shanghang Zhang

AI总结 本文提出了一种名为EvoDriveVLA的协作感知-规划蒸馏框架,旨在解决视觉语言动作模型在自动驾驶中解冻视觉编码器后感知性能下降以及长期规划不稳定的问题。该方法结合了自锚定感知约束和未来感知轨迹优化,通过自锚定教师模型引导学生模型关注关键区域,并利用未来感知的引导教师进行轨迹优化与不确定性建模,从而提升模型的感知与规划能力。实验表明,EvoDriveVLA在nuScenes和NAVSIM数据集上均取得了优越的性能。

Comments 19 pages, 5 figures, 5 tables

详情
英文摘要

Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and future-informed trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, future-informed trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to synthesize reasoning trajectories that model future evolutions, enabling the student model to internalize the future-aware insights of the teacher. EvoDriveVLA achieves SOTA performance in nuScenes open-loop evaluation and significantly enhances performance in NAVSIM closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.

2603.08588 2026-05-12 cs.LG cs.AI

Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control

Riccardo De Monte, Matteo Cederle, Gian Antonio Susto

AI总结 本文研究了如何将现有的批量深度强化学习方法适配到流式处理场景中,以满足资源受限硬件的需求。作者提出了两种新型流式深度强化学习算法——S2AC和SDAC,它们在保持与先进批量RL方法兼容的同时,能够在标准基准上达到与现有流式方法相当的性能,且无需繁琐的超参数调整。研究还探讨了从批量到流式的过渡问题,并提出了一种有效保持预训练策略性能的方法。

详情
英文摘要

State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious per-environment hyperparameter tuning. We further investigate the batch-to-streaming transition, showing that a naive transition does not guarantee preservation of pre-trained policy performance, and propose a principled approach to address this challenge.

2603.08065 2026-05-12 cs.LG cs.CL

Deterministic Differentiable Structured Pruning for Large Language Models

Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu, Jianfei Chen

AI总结 该研究提出了一种确定性可微分结构化剪枝方法(DDP),用于降低大语言模型的推理成本。与以往依赖随机硬混凝土松弛的方法不同,DDP 直接优化离散 l0 目标的确定性软替代目标,消除了随机性,从而减少训练与测试间的不匹配并加快收敛。实验表明,该方法在多个密集和 MoE 模型上实现了接近原模型的性能,且在 20% 稀疏度下优于现有方法,并在实际部署中显著提升了推理速度。

Comments Published at ICML26;

详情
英文摘要

Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.

2603.04783 2026-05-12 cs.AI cs.CL

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, Difan Zou

AI总结 尽管大型语言模型在单轮对话中表现出强大的推理能力,但在多轮交互中却容易因信息逐步揭示或需要更新而出现性能下降,其根本原因是“上下文惯性”——模型倾向于固守先前的推理路径,忽视后续输入的修正信息。为此,研究提出了一种基于单轮锚点的强化学习方法RLSTA,利用模型在单轮任务中的优势作为稳定参考点,引导其在多轮交互中动态调整推理过程,从而打破上下文惯性。实验表明,RLSTA在多个领域均表现出优越的性能和良好的泛化能力,无需外部验证即可实现稳定有效的多轮对话。

详情
英文摘要

While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications. Code is available at https://github.com/Tencent/RLSTA.

2603.03756 2026-05-12 cs.LG cs.CE cs.CL

MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

Zonglin Yang, Lidong Bing

AI总结 尽管大型语言模型在科学发现中展现出潜力,但现有研究多关注推理或反馈驱动的训练,而未直接建模生成推理过程 $P(h|b)$。本文提出 MOOSE-Star 框架,通过分解子任务、动机引导的分层搜索和有界组合等方法,将训练复杂度从指数级降低到对数级,实现了 $P(h|b)$ 的可扩展训练。为支持该框架,研究者还发布了包含 108,717 篇分解论文的 TOMATO-Star 数据集,实验证明 MOOSE-Star 能够随着训练数据和推理预算持续扩展,而直接采样方法则受限于复杂度瓶颈。

Comments Accepted by ICML 2026

详情
英文摘要

While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework that enables tractable and scalable training of $P(h|b)$, while supporting more scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Empirically, MOOSE-Star scales continuously with training data and inference budget, whereas direct brute-force sampling hits a complexity wall.

2603.03239 2026-05-12 cs.CV

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski

AI总结 该研究提出了一种名为COP-GEN的多模态潜扩散变换器,用于生成Copernicus地球观测数据,能够建模不同传感器(如光学、雷达、高程和土地覆盖)在原生空间分辨率下的联合分布。通过将跨模态映射参数化为条件分布,COP-GEN实现了灵活的任意到任意条件生成,包括无需任务特异性再训练的零样本模态转换。实验表明,该模型在保持高峰值保真度的同时,能够生成多样且物理一致的观测结果,并在构建的基准数据集上展现出显著优于现有方法的生成能力。

详情
英文摘要

Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover. Relationships between modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations, and should be parametrised as conditional distributions. Deterministic models, by contrast, collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation without task-specific retraining. Experiments show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and adapts its output uncertainty as conditioning information increases. We release a stochastic benchmark built from multi-temporal Sentinel-2 observations that enables distribution-level comparison of generative EO models. On this benchmark, COP-GEN covers 90% of the real observation manifold and 63% of its per-band reflectance range, while the strongest competing method collapses to 2.8% and 18%, respectively. These results highlight the importance of stochastic generative modeling for EO and motivate evaluation protocols beyond single-reference, pointwise metrics. Website: https://miquel-espinosa.github.io/cop-gen

2603.01960 2026-05-12 cs.LG cs.AI

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

Taimur Khan

AI总结 TiledAttention 是一种用于 NVIDIA GPU 的缩放点积注意力(SDPA)前向计算算子,旨在加速 SDPA 相关研究。该方法基于 FlashAttention 的在线 softmax 形式,采用 cuTile/TileIR 实现策略,支持在 Python 层面对调度策略进行修改,从而实现高性能与高度可定制化的平衡。实验表明,TiledAttention 在标准 eager 注意力路径上实现了显著加速,并可直接集成到 PyTorch 工作流中,为注意力机制的高效研究提供了实用工具。

详情
英文摘要

TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. Algorithmically, TiledAttention follows the established FlashAttention-style online-softmax formulation; our novelty is the cuTile/TileIR implementation strategy, schedule-level modifiability, and reproducible benchmarking/profiling workflow. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch), explicit unfused baselines (torch_sdpa_math, standard eager attention), and forced backend probes (FlashAttention2, EffecientAttention, CuDNN Attention) across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.

2603.00541 2026-05-12 cs.LG stat.ML

Spectral Condition for $μ$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li

AI总结 随着生成式基础模型在宽度和深度上同时扩展,稳定特征学习和可靠的超参数迁移面临挑战。本文提出了一种统一的谱域框架,用于在联合宽度-深度缩放下实现最大更新参数化($μ$P),明确了权重及其每步更新的范数应如何随宽度和深度变化,并揭示了从单变换($k=1$)到多变换($k\geq 2$)的转变。该框架适用于多种优化器,实验表明其在GPT-2类语言模型中能实现稳定的特征学习和鲁棒的超参数迁移,优于传统参数化和$ k=1 $情况下的$ μ $P方法。

Comments 76 pages, 13 figures, 40 tables

详情
英文摘要

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $μ$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $μ$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and $μ$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.

2602.23928 2026-05-12 cs.CL

The Astonishing Ability of Large Language Models to Parse Jabberwockified Language

Gary Lupyan, Senyi Yang

AI总结 本研究展示了大型语言模型在解析严重退化的英语文本方面具有惊人的能力。通过将内容词随机替换为无意义字符串生成的“Jabberwockified”文本,模型仍能恢复出接近原意的常规英语句子。这一结果表明,句法结构和封闭类词汇等线索对词义的约束远超以往认知,也为理解语言处理机制提供了重要启示。

Comments Submitted to the 2026 Annual Meeting of the Cognitive Science Society

详情
英文摘要

We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., "At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp", can be translated to conventional English that is, in many cases, close to the original text, e.g., "At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife." These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of "Jabberwockified" English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.

2602.22953 2026-05-12 cs.AI

General Agent Evaluation

Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov, Michal Jacovi, Leshem Choshen, Liat Ein-Dor, Yoav Katz, Michal Shmueli-Scheuer

AI总结 该研究系统评估了通用智能体在不同协议和陌生环境中的性能,比较了工具调用、MCP、代码生成和CLI等多种智能体架构。研究提出了统一的协议和评估框架,构建了首个开放的通用智能体排行榜,涵盖多种基础模型和基准任务。实验发现,通用智能体无需领域定制即可适应不同任务,但架构选择对性能影响显著,且开源模型在通用性方面存在明显不足。

Comments Presented at the ICLR 2026 Workshop on Agents in the Wild

详情
英文摘要

General-purpose agents perform tasks in unfamiliar environments without domain-specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harnesses require per-benchmark wiring or fixed protocol classes (web for BrowserGym, CLI for Harbor), and benchmarks themselves expect human-authored prompts, context, and integration glue. To enable this study, we contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general-purpose agent and backbone model; and (3) the first Open General Agent Leaderboard of agent configurations, a full factorial over 5 agent architectures x 5 backbone LLMs (three closed-source, two open-weight) x 6 benchmarks spanning software engineering, customer service, deep research, and personal assistance. We find that (i) general agents adapt to every tested domain without per-domain customization; (ii) agent architecture choice swings results by up to 12pp within a single model, yet backbone model choice dominates overall performance; (iii) on 4 of 6 tested benchmarks, top general agents are indistinguishable from the leading heavily-customized domain-specific agents; (iv) open-weight models tested exhibit "generality sinks" absent from frontier closed-source models: they consistently collapse on specific agent architectures or benchmarks; (v) a behavioral failure analysis reveals architecture-distinctive error signatures that aggregate scoring cannot discriminate. Code, harness, leaderboard, and traces are at https://www.exgentic.ai.

2602.22611 2026-05-12 cs.LG

Mitigating Membership Inference in Intermediate Representations with Differentially Private Training

Jiayang Meng, Tao Huang, Chen Hou, Guolong Zheng, Hong Chen

AI总结 在嵌入式接口(EaaI)场景中,预训练模型被用于生成中间表示(IRs),这些表示可能泄露训练数据成员信息,从而被用于成员推理攻击(MIA)。本文提出了一种基于差分隐私的分层训练方法LM-DP-SGD,通过分析各层的MIA风险,动态调整隐私保护强度,从而在保证模型效用的同时更有效地缓解中间表示中的成员推理问题。实验表明,该方法在相同隐私预算下能够显著降低IR级别的MIA风险,实现了更优的隐私与效用平衡。

详情
英文摘要

In Embedding-as-an-Interface (EaaI) settings, pre-trained models are queried for Intermediate Representations (IRs). The distributional properties of IRs can leak training-set membership signals, enabling Membership Inference Attacks (MIAs) whose strength varies across layers. Although Differentially Private Stochastic Gradient Descent (DP-SGD) mitigates such leakage, existing implementations employ per-example gradient clipping and a uniform, layer-agnostic noise multiplier, ignoring heterogeneous layer-wise MIA vulnerability. This paper introduces Layer-wise MIA-risk-aware DP-SGD (LM-DP-SGD), which adaptively allocates privacy protection across layers in proportion to their MIA risk. Specifically, LM-DP-SGD trains a shadow model on a public shadow dataset, extracts per-layer IRs from its train/test splits, and fits layer-specific MIA adversaries, using their attack error rates as MIA-risk estimates. Leveraging the cross-dataset transferability of MIAs, these estimates are then used to reweight each layer's contribution to the globally clipped gradient during private training, providing layer-appropriate protection under a fixed noise magnitude. We further establish theoretical guarantees on both privacy and convergence of LM-DP-SGD. Extensive experiments show that, under the same privacy budget, LM-DP-SGD reduces the peak IR-level MIA risk while preserving utility, yielding a superior privacy-utility trade-off.

2602.21307 2026-05-12 cs.LG

SymTorch: Symbolic Distillation of Neural Networks

Elizabeth S. Z. Tan, Adil Soubki, Miles Cranmer

AI总结 本文提出了一种名为 SymTorch 的符号蒸馏方法,旨在揭示神经网络组件所学习的数学函数,并将其表示为可解释的闭式表达式。该方法基于 PySR 实现,适用于多种网络架构,并成功应用于物理定律的自动发现、模型解释性提升以及提升神经网络效率等方面。研究展示了 SymTorch 在符号回归、模型解释和资源优化中的广泛适用性与优越性能。

详情
英文摘要

What mathematical functions do neural network components learn? Symbolic distillation addresses this question by expressing neural network components with interpretable, closed-form mathematical expressions that expose the functional structure learned during training. We develop symbolic distillation as a systematic, architecture-agnostic methodology, and release our approach as the open-source SymTorch package - a PySR-powered library built natively for the PyTorch ecosystem. Applying this methodology across diverse architectures, we find that SymTorch is successful in the automated discovery of physical laws. Specifically, our approach (1) recovers pairwise interaction forces from graph neural networks trained on empirical $n$-body observations, (2) distills the exact closed-form PDE/ODE solutions of multiple physical systems, including the value of constants, from physics-informed neural networks trained on sparse data, and (3) uncovers the chaotic dynamics of the Lorenz system from high-dimensional data, ultimately outperforming the base neural network on downstream prediction tasks. We further demonstrate the utility of our framework for model interpretability by providing an optimized implementation of SLIME - a symbolic extension to the LIME explainability method. SLIME consistently outperforms LIME across predictive metrics across eight popular classification and regression benchmarks, while still providing an interpretable local symbolic model. Lastly, we investigate replacing transformer MLP layers with symbolic surrogates: replacing 1-7 layers with symbolic approximations yields 2-19\% throughput improvements and up to 18.7\% VRAM reduction, with the resulting hybrid models lying on the Pareto front of throughput versus perplexity among open-source LLMs of comparable scale.

2602.18866 2026-05-12 cs.LG stat.ML

$(α,β)$-Stability for Boosting Vector-Valued Prediction

Jian Qian, Shu Ge

AI总结 本文研究了向量值预测中的提升(boosting)方法,提出了基于几何中位数的$(α,β)$-稳定性概念,用于分析聚合过程如何将弱预测器的性能提升为强预测器。作者在多种自然散度度量下刻画了该稳定性性质,并基于此提出了一种通用的提升框架\geomedboost,该框架通过指数重加权和几何中位数聚合实现,能够在弱学习器条件下保证经验散度误差的指数衰减,并进一步推导出总体误差的上界。

详情
英文摘要

Despite the widespread use of boosting in structured prediction, a general theoretical understanding of aggregation beyond scalar prediction remains incomplete. We study vector-valued prediction under a target divergence and identify a geometric stability property under which aggregation amplifies weak guarantees into strong ones. We formalize this property as $(α,β)$-stability by geometric median and show how it supports a boosting framework based on exponential reweighting and geometric-median aggregation. For vector-valued prediction, we characterize this stability property under several natural divergences: $\ell_1$ and $\ell_2$ distances for unconstrained vector-valued prediction, and TV, Hellinger, and KL for density estimation over finite probability vectors. Building on these results, we propose a generic boosting framework \geomedboost. Under a weak learner condition and $(α,β)$-stability, we obtain exponential decay of the empirical divergence error, which then yields population guarantees through a generalization bound.