arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31304 2026-06-01 cs.LG cs.CV

Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance

无权衡的可解释性:在同等预测性能下解开多义性

Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Jonas Fischer, Robin Hesse

AI总结 提出ELUDe方法,通过无损重组层间信息流,在不改变模型输出的前提下将多义神经元分解为单义特征,提升深度神经网络的可解释性。

详情
Comments
Preprint
AI中文摘要

深度神经网络(DNN)被广泛使用,但解释它们实际学到什么仍然困难。一个主要障碍是单个神经元通常编码多个不相关的概念,模糊了网络的决策过程。虽然先前的工作,如稀疏自编码器,可以将这些混合信号分离成更有意义的“单义”特征,但这通常需要以可能降低下游性能的方式改变模型。为了克服这一点,我们引入了ELUDe(显式、无损、无监督解缠),一种在保持功能等价性的同时提高DNN可解释性的方法。ELUDe将潜在表示分解为清晰、可检查的子单元,这些子单元表现得像可解释的特征,同时保证模型的输出保持完全相同。它不需要显式训练,不需要标签,并且可以应用于预训练模型。ELUDe通过重组层间信息流的方式工作,重新路由特定概念的贡献,同时通过构造保留原始计算。在多个视觉模型上,包括DINOv2和有监督的ViT-B/16,ELUDe提高了可解释性,保持下游准确性不变,运行高效,并支持实际用途,如引导模型表示。简而言之,ELUDe提供了(几乎)没有权衡的可解释性:更清晰、可扩展且可操作的模型洞察,且性能无损失。

英文摘要

Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.

2605.31302 2026-06-01 eess.IV cs.CV eess.SP

MoE-dqINR: A Unified Mixture-of-Experts Implicit Neural Representation Framework for Scan-Specific Dynamic and Quantitative MRI Reconstruction

MoE-dqINR:用于特定扫描动态和定量MRI重建的统一混合专家隐式神经表示框架

Yinzhe Wu, Fanwen Wang, Zhenxuan Zhang, Zi Wang, Chengyan Wang, Guang Yang

AI总结 提出MoE-dqINR框架,通过共享空间专家和状态条件路由路径,实现高效、统一的特定扫描多线圈动态和定量MRI重建,优化时间约30秒。

详情
AI中文摘要

欠采样磁共振成像(MRI)重建旨在从不完整的多线圈k空间数据中恢复时间或对比度变化的图像序列,同时为动态和定量MRI(qMRI)保留状态相关的保真度。现有的特定扫描隐式神经表示(INR)通常使用单一的时空坐标场、显式子空间、运动或变形模型、校准变量或序列特定的定量信号模型。这些设计选择在跨采集状态适应图像合成的同时,限制了共享空间信息的灵活性。此外,许多基于INR的基线方法计算量大,通常需要每个扫描数百到数千秒的优化时间。我们提出MoE-dqINR,一种特定扫描的多线圈MRI重建框架,将图像域表示分解为共享空间专家和状态条件路由路径。空间专家编码可重用的坐标相关图像内容,而路由权重(以有序采集状态为条件)从公共专家库合成每个动态帧或对比状态。该表示与多线圈MRI前向模型耦合,使用归一化状态索引驱动动态和定量MRI中的路由。通过将共享空间表示与状态相关合成分离,该框架为动态和定量MRI提供了一种以图像为先的架构,同时在我们的实验中将特定扫描INR优化减少到每扫描约30秒。所提出的公式建立了状态条件混合专家INR作为特定扫描多线圈MRI重建先验,统一了共享空间表示、动态和qMRI特定合成以及实际每扫描效率。

英文摘要

Undersampled magnetic resonance imaging (MRI) reconstruction seeks to recover temporally or contrast-varying image series from incomplete multicoil k-space data while preserving state-dependent fidelity for dynamic and quantitative MRI (qMRI). Existing scan-specific implicit neural representations (INRs) often use monolithic spatiotemporal coordinate fields, explicit subspaces, motion or deformation models, calibration variables, or sequence-specific quantitative signal models. These design choices can limit flexibility in sharing spatial information while adapting image synthesis across acquisition states. Moreover, many INR-based baselines remain computationally demanding, typically requiring per-scan optimization times on the order of hundreds to thousands of seconds. We propose MoE-dqINR, a scan-specific multicoil MRI reconstruction framework that factorizes the image-domain representation into shared spatial experts and a state-conditioned routing pathway. Spatial experts encode reusable coordinate-dependent image content, whereas routing weights, conditioned on ordered acquisition states, synthesize each dynamic frame or contrast state from a common expert bank. The representation is coupled to a multicoil MRI forward model, uses the normalized state index to drive routing in both dynamic and quantitative MRI. By separating shared spatial representation from state-dependent synthesis, the framework provides an image-first architecture for dynamic and quantitative MRI while reducing scan-specific INR optimization to approximately 30 s per scan in our experiments. The proposed formulation establishes state-conditioned mixture-of-experts INR as a scan-specific multicoil MRI reconstruction prior that unifies shared spatial representation, dynamic- and qMRI-specific synthesis, and practical per-scan efficiency.

2605.31296 2026-06-01 q-bio.BM cs.LG

mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties

mRNAutilus:多目标引导的mRNA离散生成与优化治疗特性

Sawan Patel, Sophia Tang, Yesol Kim, Yinuo Zhang, Divya Srijay, Ping-Jung Lin, Shambhavi Shubham, Fengmei Pi, Cedric Wu, Sherwood Yao, Pranam Chatterjee

AI总结 提出mRNAutilus框架,结合掩码离散扩散模型和蒙特卡洛树引导,实现同时优化密码子和从头设计UTR,生成多目标帕累托最优的完整mRNA序列,在多个靶标上显著提升表达和稳定性。

详情
AI中文摘要

治疗性mRNA设计需要协调整个转录本中多个相互作用的序列特征,其中密码子使用、非翻译区(UTR)及其耦合共同决定稳定性、翻译效率和蛋白质表达。在这里,我们提出通过展开轨迹和信息潜在更新生成mRNA(mRNAutilus),这是一个直接从序列进行同时密码子优化和从头UTR设计的框架。mRNAutilus结合了在数百万全长mRNA上训练的掩码离散扩散模型与蒙特卡洛树引导,在多个功能目标下生成帕累托高效序列,使用模型嵌入上的轻量级回归器预测半衰期、翻译效率和蛋白质丰度。与最近分别设计编码序列和UTR或依赖事后组装和筛选的方法不同,mRNAutilus在单个过程中生成完整转录本,并跨属性优化。在多种靶标上,编码P. pyralis荧光素酶的零样本mRNA表达量比野生型高400倍以上,并优于商业和机器学习设计的基线,包括零样本生成方法。零样本SARS-CoV-2 Spike mRNA超过临床使用和商业构建体,并匹配或超越实验室优化设计,同时具有更好的耐久性。我们进一步展示了在治疗环境中的通用性,包括先导编辑(PEMax)和可编程蛋白质组调节,其中mRNAutilus设计的构建体增强了用于β-连环蛋白降解的肽引导E3连接酶(uAbs)的表达。这些结果建立了一个基于序列的多目标框架,用于生成适用于多种生物应用的功能性mRNA。

英文摘要

Therapeutic mRNA design requires coordinating multiple interacting sequence features across the full transcript, where codon usage, untranslated regions (UTRs), and their coupling jointly determine stability, translation efficiency, and protein expression. Here, we present mRNA generation via unrolled trajectories and informed latent updates (mRNAutilus), a framework for simultaneous codon optimization and de novo UTR design directly from sequence. mRNAutilus combines a masked discrete diffusion model trained on millions of full-length mRNAs with Monte Carlo Tree Guidance to generate Pareto-efficient sequences under multiple functional objectives, using lightweight regressors over model embeddings to predict half-life, translation efficiency, and protein abundance. Unlike recent methods that design coding sequences and UTRs separately or rely on post hoc assembly and screening, mRNAutilus generates complete transcripts in a single process optimized across properties. Across diverse targets, zero-shot mRNAs encoding P. pyralis luciferase achieve over 400-fold higher expression than wild-type and outperform commercial and machine learning-designed baselines, including zero-shot generative approaches. Zero-shot SARS-CoV-2 Spike mRNAs exceed clinically used and commercial constructs and match or surpass lab-optimized designs with improved durability. We further demonstrate generality in therapeutic settings, including prime editing (PEMax) and programmable proteome modulation, where mRNAutilus-designed constructs enhance expression of peptide-guided E3 ligases (uAbs) for beta-catenin degradation. These results establish a sequence-based, multi-objective framework for generating functional mRNAs tailored to diverse biological applications.

2605.31295 2026-06-01 cs.SD cs.AI cs.IR cs.LG

Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation

通过激活引导实现潜在空间解缠:符号音乐生成中可解释的属性控制

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis

AI总结 本文利用差分均值方法从多轨音乐Transformer的残差流中分离音高和时长的潜在方向,并通过Gram-Schmidt正交化实现双属性引导,从而在推理时实现可解释的确定性属性调制。

详情
Comments
Accepted at EUSIPCO 2026 (34th European Signal Processing Conference), 5 pages, 2 figures
AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展,但在实现对离散信号属性的细粒度、可解释控制方面仍存在显著差距。本文研究了多轨音乐Transformer(MMT)的机制可解释性,并提出了一种无需重新训练的确定性属性调制框架,通过推理时的激活引导来弥合这一差距。利用差分均值(DiffMean)方法,我们在残差流中分离了信号属性(特别是音高和时长)的潜在方向。我们验证了该领域的线性表示假设,实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题,我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明,与简单的向量加法相比,这种几何解耦减少了概念干扰和信号退化,即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

2605.31294 2026-06-01 cs.CV

TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

TokTalk: 基于音频-大语言模型令牌的富有表现力的实时面部动画

Qingcheng Zhao, Yifang Pan, Karan Singh

AI总结 提出TokTalk系统,利用音频-大语言模型产生的音频令牌直接实时生成富有表现力的3D面部动画,通过分块条件流匹配模型和轻量级适配策略实现低延迟和高品质。

详情
AI中文摘要

近期GPT-4o等音频-大语言模型的进展开启了与语言模型对话交互的新时代。然而,对话式虚拟角色在面部表情和对话流程上仍显机械,部分原因在于其顺序执行语音识别、文本生成、轮次文本响应、语音合成和音频驱动面部动画等多个阶段。基于当前音频-大语言模型产生的音频令牌包含足够信息以重建合理面部表现这一洞察,我们提出TokTalk,一个直接从流式音频令牌实时输出富有表现力面部动画的系统。我们构建了一个新颖的音频令牌到3D面部运动数据集,并使用基于分块的条件流匹配模型训练TokTalk。一种轻量级适配策略使我们的训练模型能够以极小的计算开销无缝连接到任何基于令牌的音频-大语言模型。我们的分块处理进一步实现了延迟与面部质量之间的参数化权衡,并通过消融研究进行了验证。我们还表明,TokTalk的实时性能在延迟上与现有技术解决方案相当,而在3D面部表现的质量、表现力和可控性方面(通过感知研究)显著更优。我们通过聊天机器人虚拟角色、语音驱动的用户虚拟角色和动画导演界面展示了TokTalk在多种音视频面部应用中的灵活性。

英文摘要

Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to sequential stages of speech recognition, text generation, turn-based text response, speech synthesis, and audio driven facial animation. Based on our insight that audio-tokens produced by current Audio-LLMs carry sufficient information to reconstruct a plausible facial performance, we present TokTalk, a system that directly outputs expressive facial animation in real-time from streaming audio-tokens. We construct a novel audio-token to 3D facial motion dataset, on which TokTalk is trained using a Chunk-based Conditional Flow Matching model. A lightweight adaptation strategy allows our trained model to seamlessly connect to any token-based Audio-LLM at minimal computational overhead. Our chunk-based processing further enables parametric trade-off between latency and facial quality, shown through ablation studies. We further show that the real-time performance of TokTalk is comparable in latency to prior art solutions, and significantly favorable (via a perceptual study) in terms of quality, expressivity and control of the 3D facial performance. We showcase TokTalk's flexibility using a chatbot Avatar, a voice-driven user Avatar, and an animation Director's interface, as diverse audio-visual face applications.

2605.31293 2026-06-01 cs.CL

Divergence Decoding: Inference-Time Unlearning via Auxiliary Models

发散解码:通过辅助模型进行推理时遗忘

Humzah Merchant, Bradford Levy

AI总结 提出发散解码(DD)方法,利用小型辅助模型在推理时引导LLM的logits远离特定数据,有效且低成本地实现遗忘,并在多个基准上超越现有方法。

详情
AI中文摘要

大型语言模型(LLM)经常记忆敏感的训练数据,从而产生显著的隐私和版权风险。解决这些风险,即从现有模型检查点中移除此类知识,已被证明具有挑战性,因为许多遗忘方法会导致灾难性的效用损失或对复杂查询无效。我们引入了发散解码(DD),一种使用小型辅助模型在推理时将LLM的logits引导远离特定数据的机制。训练这些模型是直接的,即我们使用标准的预训练和微调设置。我们发现该方法在遗忘基准测试中明显优于最先进的基线,且在各种模型和训练数据集规模上保持一致,表明DD是一种有效且廉价的遗忘解决方案。然后我们证明,这种引导后的分布可以轻松地蒸馏回基础模型。由于该方法普遍适用于任何概率模型,我们探索了其在文本生成之外的有效性,并发现了向图像领域泛化的证据。

英文摘要

Large Language Models (LLMs) frequently memorize sensitive training data thereby creating significant privacy and copyright risks. Addressing these risks, i.e., removing such knowledge from an existing model checkpoint, has proven challenging as many unlearning methods lead to catastrophic utility loss or are ineffective for complex queries. We introduce Divergence Decoding (DD), a mechanism that uses small auxiliary models to steer the logits of the LLM away from specific data during inference. Training these models is straight forward, i.e., we use standard pre-training and fine-tuning setups. We find the method decisively outperforms state-of-the-art (SOTA) baselines on unlearning benchmarks across a variety of model and training dataset scales consistent with DD being an effective and inexpensive solution to unlearning. We then demonstrate that this steered distribution can be trivially distilled back into the base model. Since the method is generally applicable to any probabilistic model, we explore its efficacy outside of text generation and find evidence of generalization to the domain of images.

2605.31292 2026-06-01 cs.CV

Authentication of Copy Detection Patterns via Cross-Camera Dual-Synthetic Referencing

复制检测模式的跨相机双合成参考认证

Ivan Oleksiyuk, Roman Chaban, Slava Voloshynovskiy

AI总结 提出一种基于注册的跨相机双合成参考框架,通过深度学习翻译器联合利用数字模板和注册捕获生成高质量参考图像,以应对打印随机性和相机失真,提升复制检测模式的认证性能。

详情
Comments
To appear in Proc. ICIP2026, September 13-17, 2026, Tampere, Finland
AI中文摘要

复制检测模式(CDP)是打印在物理对象上的结构,用于实现经济高效的认证。验证通过将捕获图像与打印CDP的数字模板进行比较来完成。在实践中,打印机的随机性和相机失真阻碍了这种比较,限制了对抗伪造的鲁棒性。先前的工作通过在验证相机域中合成参考图像来解决相机效应,但忽略了打印变异性。我们引入了一种基于注册的跨相机双合成参考框架。每个打印的CDP首先由受控的注册相机捕获,然后一个基于深度学习的翻译器联合利用数字模板和注册捕获,为验证图像生成高质量的参考。我们提供了信息论上的证明,表明双参考比基于模板的参考包含更多信息。在异构移动相机上的实验表明,认证性能得到提升,对基于机器学习的复制攻击具有鲁棒性,并且能够从小CDP区域和低端设备上进行可靠验证。

英文摘要

Copy Detection Patterns (CDPs) are structures printed on physical objects to enable cost-effective authentication. Verification is achieved by comparing a captured image with the digital template from which the CDP was printed. In practice, printer stochasticity and camera distortions hinder this comparison, limiting robustness against counterfeiting. Prior work addressed camera effects by synthesising reference images in the verification camera domain, but it ignored printing variability. We introduce an enrolment-based cross-camera dual-synthetic referencing framework. Each printed CDP is first captured by a controlled enrolment camera, and a deep-learning-based translator jointly exploits the digital template and the enrolled capture to generate a high-quality reference for the verification image. We provide an information-theoretic justification showing that the dual reference is more informative than template-based references. Experiments on heterogeneous mobile cameras demonstrate improved authentication performance, robustness to machine-learning-based copy attacks, and reliable verification from small CDP regions and on low-end devices.

2605.31291 2026-06-01 cs.IR cs.LG

Contextual Scalarisation Thompson Sampling for multi-objective decisions in public media

面向公共媒体多目标决策的上下文标量化汤普森采样

Théo Maëtz, Luc Guillet, Andrea Cavallaro

AI总结 提出上下文标量化汤普森采样(CSTS)方法,通过学习上下文相关的目标权重,在公共媒体推荐中平衡多个竞争目标,实验表明其优于固定权重和标准上下文赌博机方法。

详情
Comments
15 pages, 3 figures, 3 tables. Submitted-manuscript version of a paper accepted at ICPR 2026. The Version of Record will be published in the Springer Lecture Notes in Computer Science series; DOI will be added when available
AI中文摘要

推荐系统可能在多个相互竞争的目标下运行。例如,在公共服务的编辑决策中,必须平衡受众覆盖、文化价值、公共服务使命和运营约束。现有方法依赖于固定的目标组合或基于帕累托的优化,无法适应不同情境下优先级的动态变化。本文提出上下文标量化汤普森采样(CSTS),一种多目标上下文赌博机方法,它学习根据观察到的上下文对目标进行加权。我们在瑞士国家广播公司Radio Télévision Suisse的真实节目数据上评估CSTS,结果显示,与固定权重和标准上下文赌博机方法相比,CSTS在上下文相关性和与专家策展实践的一致性方面均有提升。

英文摘要

Recommender systems may operate under multiple, competing objectives. For example, audience reach, cultural values, public service mandate, and operational constraints must be balanced in editorial decisions of public service media. Existing approaches relying on fixed combinations of objectives or Pareto-based optimisation do not adapt to changing priorities across situations. In this paper, we propose Contextual Scalarisation Thompson Sampler (CSTS), a multi-objective contextual bandit method that learns to weight objectives as a function of the observed context. We evaluate CSTS on real programming data from Radio Télévision Suisse, the Swiss national broadcaster, showing improved contextual relevance and better alignment with expert curation practices compared to fixed weight and standard contextual bandit approaches.

2605.31289 2026-06-01 cs.LG cs.AI

The Terminal Representation in Reinforcement Learning

强化学习中的终端表示

Amir Esterhuysen, Anders Jonsson

AI总结 提出终端表示(TR),一种无需特征分解即可直接用于下游任务且计算开销更低的奖励加权状态表示方法。

详情
AI中文摘要

表示学习是强化学习(RL)中用于时空抽象的强大工具。两种成熟的方法是通过后继表示(SR)和默认表示(DR)。SR通过状态引发的未来轨迹对其进行编码,捕获与奖励解耦的信息流。DR在此基础上用奖励加权轨迹,将信用分配结构整合到表示中。两种表示的特征向量已被用于支持一系列下游任务——包括选项发现、奖励塑造、迁移学习和探索。我们引入了一种结构不同的公式:终端表示(TR)。TR类似于DR对奖励加权轨迹进行编码,但可以作为更低维度的对象进行学习,并且可以直接用于上述应用而无需特征分解。特征分解还施加了对称转移动力学的假设,而TR可以绕过这一点。在这项工作中,我们发展了TR的理论基础:其推导、两种学习算法的收敛性、其在零样本组合性中的使用,以及替代奖励公式之间的等价性。我们进一步表明TR嵌入在顶部DR特征向量中,使其无需特征分解即可捕获相同的基础知识。此外,我们提供了经验证据,证明TR在辅助应用中作为现有表示的可行替代方案,同时在学习、存储和使用方面需要更少的计算开销。

英文摘要

Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The SR encodes states by the future trajectories they induce, capturing information flow decoupled from reward. The DR builds on this by weighting trajectories with reward, integrating credit-assignment structure into the representation. Eigenvectors of both representations have been used to support a range of downstream tasks -- including option discovery, reward shaping, transfer learning, and exploration. We introduce a structurally distinct formulation: the terminal representation (TR). The TR encodes reward-weighted trajectories similarly to the DR, but can be learned as a lower-dimensionality object, and can be used directly for the mentioned applications without eigenvector computations. Eigendecomposition also imposes the assumption of symmetric transition dynamics, which the TR can bypass. In this work we develop the theoretical foundations of the TR: its derivation, convergence of two learning algorithms, its use for zero-shot compositionality, and equivalences between alternative reward formulations. We further show the TR is embedded in the top DR eigenvector, allowing it to capture the same underlying knowledge without eigendecomposition. Additionally, we provide empirical evidence of the TR as a viable alternative to existing representations in subsidiary applications, while requiring less computational overhead to learn, store, and use.

2605.31287 2026-06-01 cs.CY cs.AI cs.HC

Neither Replacement nor Panacea: Comparing LLM-Based Conversational and Graphical Decision Support in Industrial Tasks

既非替代品也非万能药:比较基于LLM的对话式与图形化决策支持在工业任务中的应用

Roberto Figliè, Simone Caputo, Alan Serrano, Daria Mikhaylova, Tommaso Turchi, Daniele Mazzei

AI总结 通过混合因子实验,比较基于LLM的对话式界面与仪表盘在工业决策支持中的效果,发现对话界面在低复杂度任务中降低认知负荷和加快完成时间,但优势随任务复杂度增加而消失,且未提高决策准确性。

详情
AI中文摘要

制造业环境中的管理者依赖数字界面解读运营数据以进行决策,但不断增长的数据量和复杂性使得高效识别相关洞察变得困难。虽然仪表盘在工业环境中仍占主导地位,但通过对话式用户界面(CUI)访问的基于大型语言模型(LLM)的对话代理(CA)可能提供更直接的数据访问。然而,其有效性可能取决于任务的信息处理需求。本研究在制造决策支持场景中比较了通过CUI提供的基于LLM的CA与仪表盘。在一个2x3设计的混合因子实验中,134名工业决策者被分配到一种界面条件,并完成三个复杂度递增的任务。我们考察了感知心理负荷(MWL)、决策准确性、完成时间和预期依赖,并测试了自我报告的数据素养作为调节变量。结果显示,CUI总体上降低了感知MWL,并在低要求任务中支持更快的完成,但随着任务复杂度增加,这两个优势均减弱。两种界面在决策准确性上均未产生一致的整体优势,且CUI不被偏好作为后续决策的唯一基础。此外,数据素养并未可靠地调节界面效应。这些发现表明,对话式交互为工业决策支持提供的是有条件而非普遍的好处。基于LLM的CA可能减少信息访问努力,而复杂决策仍然受益于持久、可检查的视觉表示。

英文摘要

Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards remain dominant in industrial contexts, Large Language Model (LLM)-based conversational agents (CAs), accessed through conversational user interfaces (CUIs), may provide more direct access to such data. However, their effectiveness may depend on the information-processing demands of the task. This study compares an LLM-based CA delivered through a CUI with a dashboard in a manufacturing decision-support scenario. In a mixed factorial experiment with a 2x3 design, 134 industrial decision-makers were assigned to one interface condition and completed three tasks of increasing complexity. We examined perceived Mental Workload (MWL), decision accuracy, completion time, and intended reliance, and tested self-reported data literacy as a moderator. Results showed that the CUI reduced perceived MWL overall and supported faster completion in less demanding tasks, but both advantages diminished as task complexity increased. Neither interface produced a consistent overall advantage in decision accuracy, and the CUI was not preferred as a sole basis for subsequent decisions. Furthermore, data literacy did not reliably moderate interface effects. These findings indicate that conversational interaction offers conditional rather than universal benefits for industrial decision support. LLM-based CAs may reduce information-access effort, whereas complex decisions continue to benefit from persistent, inspectable visual representations.

2605.31286 2026-06-01 cs.RO cs.AI

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA:面向可泛化可变形物体操作的视觉-语言-动作基础模型

Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu

AI总结 提出DeMaVLA模型,采用VLM骨干与动作专家结合流匹配生成连续动作,通过剪枝Transformer层提升效率,并利用大规模真实世界数据和人类反馈数据聚合训练,实现可变形物体折叠操作的多类别泛化。

详情
Comments
14 pages, 2 figures
AI中文摘要

现实家庭机器人需要视觉-语言-动作(VLA)基础模型,能够在不同物体、任务条件和家庭环境中获取可重复使用的操作技能。可变形物体折叠是一个代表性挑战,要求机器人处理来自随机初始状态的衣物,涉及不同类别、几何形状、材料和场景。然而,现有的VLA系统通常为不同物体类别训练独立的策略,而简单混合的多任务训练常常遭受任务干扰和性能下降。为了超越类别特定的折叠策略,我们引入了DeMaVLA,一个面向可泛化可变形物体操作的VLA基础模型。DeMaVLA采用VLM骨干网络和动作专家,并使用流匹配来公式化连续动作生成。为了提高效率,动作专家通过剪枝每隔一个Transformer层构建,同时保持与VLM骨干网络的逐层对齐,从而降低训练和推理成本。DeMaVLA首先在大约5000小时精选的真实世界双臂演示数据上进行预训练,以获得通用的操作先验。然后,它在混合折叠数据上进行后训练,这些数据通过人类参与的数据聚合(DAgger)流程,聚合了自我收集的演示和来自多个折叠任务中真实机器人失败的纠正轨迹。实验表明,DeMaVLA在RoboTwin上取得了有竞争力的性能,并在我们的家庭折叠基准测试中取得了强大的真实世界结果。这些结果突显了可扩展的真实世界数据、高效的动作生成和纠正学习对于可变形物体操作中的通用VLA策略的价值。

英文摘要

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

2605.31284 2026-06-01 cs.CV cs.AI

SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy

SAM 用于荧光显微镜中鲁棒的线粒体实例分割

Suyog Jadhav, Dilip K. Prasad, Krishna Agarwal

AI总结 通过仅在合成荧光显微镜数据上微调 SAM,解决了真实数据稀缺问题,提高了线粒体实例分割的精度和平均 Dice 分数。

详情
Comments
Accepted at PHAROS-AIF-MIH workshop @ CVPR 2026
AI中文摘要

荧光显微镜(FM)中线粒体的形态分析对于理解细胞健康、能量产生和代谢调节至关重要。虽然像 Segment Anything Model (SAM) 这样的基础模型已经革新了自然图像分割,但由于衍射受限分辨率、低对比度和复杂的重叠细胞器网络,它们直接应用于 FM 受到显著领域偏移的阻碍。此外,鲁棒模型的开发因严重缺乏高质量、手动标注的线粒体实例分割数据集而受阻。在本文中,我们提出了一种可扩展的解决方案,通过仅在合成生成的 FM 数据上微调 SAM 来解决数据稀缺问题。我们模拟真实的线粒体数据并模拟荧光显微镜的光学特性,以创建大规模标注数据集。我们在一个精心策划的真实手动标注 FM 图像数据集上评估了我们的微调模型。定性和定量分析表明,我们的合成微调模型在精度和平均 Dice 分数上优于强基线。这项工作确立了模拟辅助训练在 FM 实例分割中的潜力。

英文摘要

The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production, and metabolic regulation. While foundation models like the Segment Anything Model (SAM) have revolutionized natural image segmentation, their direct application to FM is hindered by a significant domain shift characterized by diffraction-limited resolution, low contrast, and complex overlapping organelle networks. Furthermore, the development of robust models is bottlenecked by a severe lack of high-quality, manually annotated instance segmentation datasets for mitochondria. In this paper, we propose a scalable solution to this data scarcity by finetuning SAM exclusively on synthetically generated FM data. We simulate realistic mitochondria data and emulate the optical properties of fluorescence microscopes to create a large-scale annotated dataset. We evaluate our fine-tuned model on a curated dataset of real, manually annotated FM images. Qualitative and quantitative analyses demonstrate that our synthetically fine-tuned model improves precision and average dice score over strong baselines. This work establishes the potential of simulation-assisted training for FM instance segmentation.

2605.31283 2026-06-01 cs.CV

Topologically Consistent Multi-view 3D Head Reconstruction via Coarse-Guided Layered Surface Sampling

基于粗引导分层表面采样的拓扑一致多视图三维头部重建

Timo Bolkart, Daoye Wang, Prashanth Chandran

AI总结 提出SHELLS框架,通过分层采样策略解耦特征提取与网格分辨率,实现高效、拓扑一致的多视图三维头部重建,在合成数据训练下泛化到真实场景。

详情
Comments
SIGGRAPH Conference Papers 2026
AI中文摘要

我们提出SHELLS(分层局部采样的语义头部估计),一种高效的前馈框架,用于从多视图图像中重建具有密集语义对应的三维头部。现有方法通常通过局部特征体素独立细化顶点,这种方法将内存密集的特征采样与网格分辨率耦合,限制了密集拓扑(>1万顶点)的可扩展性并引入表面噪声。相比之下,SHELLS通过分层采样策略将特征提取与网格分辨率解耦。我们使用带有LoRA适配的DINOv2骨干网络提取多视图特征,投影采样稀疏全局特征云,并预测中间粗网格。该粗先验指导构建分层、表面感知的采样壳,作为最终重建的离散搜索空间。SHELLS保持表面一致性,同时推理GPU内存比体积基线减少88%(2.4GB vs. 20GB)。对于1.8万顶点的网格,它将中位配准误差降低21%至29%,推理速度提升3.5倍(0.08s vs. 0.29s)。值得注意的是,我们的模型仅在合成数据上训练,却能有效泛化到真实世界捕获,消除了先前工作中常见的昂贵预注册多视图数据集的需求。

英文摘要

We present SHELLS (Semantic Head Estimation via Layered Local Sampling), an efficient feed-forward framework for 3D head reconstruction in dense semantic correspondence from multi-view images. Existing methods typically refine vertices independently via localized feature volumes. This approach couples memory-intensive feature sampling to mesh resolution, which limits scalability for dense topologies (> 10k vertices) and introduces surface noise. In contrast, SHELLS decouples feature extraction from mesh resolution via a hierarchical sampling strategy. We extract multi-view features using a DINOv2 backbone with LoRA adaptation, projectively sample a sparse global feature cloud, and predict an intermediate coarse mesh. This coarse prior guides the construction of layered, surface-aware sampling shells that serve as a discrete search space for the final reconstruction. SHELLS maintains surface consistency while using 88% less inference GPU memory (2.4GB vs. 20GB) than volumetric baselines. It reduces median registration error by 21% to 29% with a 3.5x inference speedup (0.08s vs. 0.29s) for 18k-vertex meshes. Notably, our model is trained exclusively on synthetic data yet generalizes effectively to real-world captures, eliminating the need for the costly, pre-registered multi-view datasets common in prior work.

2605.31281 2026-06-01 cs.CL

Wind Turbine Maintenance Log Labelling Framework: LLM-Driven Data Correction and Enrichment via Semantic Extraction of Reliability Intelligence

风力涡轮机维护日志标注框架:基于LLM驱动的数据校正与语义提取的可靠性智能增强

Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya

AI总结 提出一种利用大语言模型自动标准化和结构化风力涡轮机维护日志的方法,通过纠正系统代码、提取故障模式与维护动作分类,将非结构化文本转化为定量可靠性指标。

详情
Comments
An adjustable template containing the Python script architecture, applied dynamic prompts, and data schemas is hosted in an open-source GitHub repository: https://github.com/mvmalyi/llm-driven-wind-turbine-maintenance-log-labelling
AI中文摘要

随着风力涡轮机机队老化,数据驱动的可靠性工程对于优化其运行和维护以延长使用寿命和降低平准化能源成本至关重要。历史维护日志中的故障事件描述是宝贵可靠性智能的来源。然而,它们通常以非结构化的自然语言条目出现,无法进行定量分析。本文提出了一种新颖的方法,利用大语言模型(LLM)根据自由文本描述符系统地标准化和结构化维护日志。该方法在来自280台涡轮机、监测九年的16,316条维护日志数据集上运行,开发的模型无关框架自主纠正了层次化系统代码,并提取了基于证据的维护操作和故障模式分类。自动化流水线成功结构化超过70%的数据集。它解决了普遍存在的错误分类问题,例如隔离先前未分类的变桨系统故障并恢复缺失的系统代码,并通过应用经验分类法标记具体采取的操作和处理的故障模式来丰富记录。通过使用基于系统的日志批次构建故障模式、可观察症状、主导机制和候选原因的经验词典,该方法减少了手动故障模式与影响分析(FMEA)固有的主观性。最终,该方法为将大量定性现场观测转化为定量可靠性指标提供了高度可扩展、成本效益高的蓝图,为可再生能源领域的集成根本原因分析、改进的FMEA和先进预测性维护奠定了基础。

英文摘要

As wind turbine fleets age, data-driven reliability engineering is essential to optimise their operation and maintenance for service life extension and levelised cost of energy reduction. Failure event descriptions within historical maintenance logs are a source of valuable reliability intelligence. However, they typically appear as unstructured natural language entries, rendering them inaccessible for quantitative analysis. This paper presents a novel methodology leveraging a large language model (LLM) to systematically standardise and structure maintenance logs based on their free-text descriptors. Operating on a dataset of 16,316 maintenance logs from 280 turbines monitored over nine years, the developed model-agnostic framework autonomously corrected hierarchical system codes and extracted evidence-based taxonomies of maintenance actions and failure modes. The automated pipeline successfully structured over 70% of the dataset. It resolved pervasive misclassification issues, such as isolating previously unclassified pitch system faults and restoring missing system codes, and enriched the records by applying empirical taxonomies to label specific actions taken and failure modes addressed. By using system-based log batches to construct empirical dictionaries of failure modes, observable symptoms, dominant mechanisms, and candidate causes, this approach reduces the inherent subjectivity of manual failure modes and effects analysis (FMEA). Ultimately, the methodology provides a highly scalable, cost-effective blueprint for translating large sets of qualitative field observations into quantitative reliability metrics, laying the foundation for integrated root-cause analysis across the renewable energy sector, improved FMEA, and advanced predictive maintenance.

2605.31279 2026-06-01 eess.SP cs.AI cs.NI

Practical Cross-Band Channel Prediction for AI-RAN via Physics-Guided Deep Unfolding

面向AI-RAN的实用跨频段信道预测:基于物理引导的深度展开

Ruiqi Kong, He Chen, Xiaojun Lin

AI总结 提出GUIDE框架,通过将无线信道物理嵌入可微层,实现跨频段信道预测的泛化与实时推理,在未见环境中波束赋形增益比深度学习基线FIRE高2.75倍,比模型基线R2F2高1.39倍且速度快1610倍以上。

详情
Comments
2 pages
AI中文摘要

为了使跨频段信道预测对AI原生RAN实用化,算法必须能够泛化到不同环境并支持实时推理。现有方法只能实现其中之一。为弥合这一差距,我们引入了GUIDE,一种物理引导的深度展开框架,将无线信道物理嵌入到可微层中。在未见环境中无需重新训练,GUIDE的波束赋形增益比基于深度学习的基线FIRE高2.75倍,且推理时间仅略有增加;比最强的基于模型的基线R2F2的波束赋形增益高1.39倍,同时运行速度快1610倍以上。

英文摘要

To make cross-band channel prediction practical for AI-native RAN, algorithms must generalize across diverse environments and support real-time inference. Existing approaches achieve one but not both. To bridge this gap, we introduce GUIDE, a physics-guided deep unfolding framework that embeds wireless channel physics into differentiable layers. Without retraining in unseen environments, GUIDE achieves 2.75x beamforming gain than the deep learning-based baseline FIRE with only a slight increase in inference time, and 1.39x beamforming gain than the strongest model-based baseline R2F2 while running over 1610x faster.

2605.31277 2026-06-01 cs.CR cs.LG

GETA: Generalized Encrypted Traffic Analysis

GETA: 通用加密流量分析

Ransika Gunasekara, Rahat Masood, Salil Kanhere

AI总结 提出GETA框架,通过元学习、嵌入优化和自注意力机制,仅利用流量元数据建模多变量时间序列,实现跨协议、少样本的加密流量分析,在应用识别、VPN分类、IoT设备指纹和攻击检测等任务上优于现有方法。

详情
AI中文摘要

传统流量分析正受到加密、隧道和隐私保护协议的快速采用的根本性挑战,这些协议日益模糊数据包载荷并限制深度包检测(DPI)的实用性。尽管机器学习推进了加密流量分析,但现有方法通常仍依赖于特定协议的头部特征,依赖大量标注数据集,并在跨异构网络环境部署时性能下降。我们提出GETA,一个协议无关的加密流量分析框架,它仅使用流量元数据将网络流建模为多变量时间序列,从而避免了对数据包载荷或头部语义的依赖。GETA结合了元学习、嵌入优化和自注意力机制,以支持对未见过的领域进行少样本适应,仅需极少的标注数据。在涵盖应用识别、VPN流量分类、IoT设备指纹识别和攻击检测的九个公开数据集上,GETA始终优于最先进的基线方法。这些结果表明,GETA为现代加密网络中的鲁棒流量分析提供了一个实用且可泛化的基础。

英文摘要

Traditional traffic analysis is being fundamentally challenged by the rapid adoption of encryption, tunnelling, and privacy-preserving protocols, which increasingly obscure packet payloads and limit the usefulness of Deep Packet Inspection (DPI). Although machine learning has advanced encrypted traffic analysis, existing approaches often remain tied to protocol-specific header features, depend on large labelled datasets, and degrade when deployed across heterogeneous network environments. We present GETA, a protocol-agnostic framework for encrypted traffic analysis that models network flows as multivariate time series using only traffic metadata, thereby avoiding reliance on packet payloads or header semantics. GETA combines meta-learning, embedding refinement, and self-attention to support few-shot adaptation to previously unseen domains with minimal labelled data. Across nine public datasets spanning application identification, VPN traffic classification, IoT device fingerprinting, and attack detection, GETA consistently outperforms state-of-the-art baselines. These results show that GETA offers a practical and generalisable foundation for robust traffic analysis in modern encrypted networks.

2605.31276 2026-06-01 cs.LG

Learning Parametric Nitrogen Fertilizer Response Curves Using Neuro Symbolic Regression

使用神经符号回归学习参数化氮肥响应曲线

Giorgio Morales, John Sheppard

AI总结 提出一种基于神经符号回归的方法,无需预设函数形式即可学习氮肥响应曲线,并在真实冬小麦数据上验证其优于传统模型。

详情
Comments
Accepted at the Workshop on Symbolic Regression and Equation Discovery, part of the 2026 IEEE World Congress on Computational Intelligence (WCCI) and the IEEE Congress on Evolutionary Computation (CEC)
AI中文摘要

准确模拟作物对氮肥的响应是精准农业中的基本挑战,因为它影响经济效益和环境可持续性。现有方法要么依赖预定义的参数形式,要么使用不透明的机器学习模型,限制了它们从数据中解释或发现特定地点函数关系的能力。在这项工作中,我们提出了一种神经符号回归方法,无需假设预定义的函数形式即可学习参数化的氮响应曲线。我们的方法集成了基于Transformer的多集符号骨架预测策略,能够发现多个子域或管理区之间的共享函数结构。通过构建多样化的输入子集并强制它们之间的一致性,该方法恢复了稳健的符号骨架,随后使用遗传算法将其拟合到观测数据上。该框架首先在合成一维问题上进行评估,以评估其在不同认知不确定性水平下的稳健性。结果表明,即使在数据稀缺的情况下,所提出的符号回归方法也能恢复正确的表达式。在这项工作中,我们展示了将我们的方法应用于真实冬小麦数据的结果,学习了田间不同管理区的不同参数化氮响应曲线。结果表明,发现的表达式不仅比二次平台和指数函数等传统模型实现了更低的拟合误差,而且还捕捉了不同空间区域的多样化函数行为。这证明了神经符号回归在发现特定地点农学关系和支持精准农业中知情决策方面的潜力。

英文摘要

Accurately modeling crop response to Nitrogen (N) fertilization is a fundamental challenge in precision agriculture, as it impacts both economic returns and environmental sustainability. Existing approaches either rely on predefined parametric forms or opaque machine learning models, limiting their ability to interpret or discover site-specific functional relationships from data. In this work, we propose a neuro symbolic regression (SR) approach to learn parametric N-response curves without assuming a predefined functional form. Our approach integrates a transformer-based Multi-Set Symbolic Skeleton Prediction strategy, enabling the discovery of shared functional structures across multiple subdomains or management zones (MZs). By constructing diverse input subsets and enforcing consistency across them, the method recovers robust symbolic skeletons that are subsequently fitted to observed data using a genetic algorithm. This framework was first evaluated on synthetic one-dimensional problems to assess its robustness under varying levels of epistemic uncertainty. The results demonstrate the ability of the proposed SR approach to recover correct expressions even in data-scarce regimes. In this work, we present the results of applying our method to real-world winter wheat data, learning distinct parametric N-response curves for different MZs within a field. The results show that the discovered expressions not only achieve lower fitting errors than traditional models such as quadratic-plateau and exponential functions, but also capture diverse functional behaviors across spatial regions. This demonstrates the potential that neuro SR has to enable the discovery of site-specific agronomic relationships and support informed decision-making in precision agriculture.

2605.31275 2026-06-01 cs.HC cs.AI

Personalized to Persuade: The Effects of Contextualization and Warmth on Trust and Reliance in Conversational AI

个性化以说服:情境化和温暖对对话式AI中信任与依赖的影响

Mert Yazan, Suzan Verberne, Frederik Bungaran Ishak Situmeang

AI总结 通过2x2被试间实验(N=380),研究情境化与对话温暖如何交互影响AI助手在反驳专家建议时的说服力与用户依赖,发现情境化降低说服力但与温暖结合通过交叉交互恢复,且AI素养解耦信任与行为。

详情
AI中文摘要

人工智能(AI)代理通过根据用户的背景、兴趣和先前交互来定制解释,即情境化,从而个性化其响应。个性化已被视为政治或营销中的说服策略。然而,在用户通常缺乏先验知识的日常任务中,情境化的说服效果仍不明确。我们进行了一项2×2被试间实验(N=380),研究情境化与对话温暖相结合如何影响AI助手在反驳专家建议时的依赖性和说服力。我们的发现表明,情境化降低了AI的说服力,但其与温暖的结合通过交叉交互恢复了说服力。对AI的依赖在所有条件下都存在,并且不受对话设计的影响。信任强烈预测说服力和依赖,但情境化和温暖都不通过信任起作用。AI素养解耦了信任与行为:素养更高的用户对助手报告的信任较低,但更易被说服且更依赖其建议。这些结果表明,用户倾向于依赖AI代理而非人类专家判断;然而,界面级别的对话设计选择在塑造行为方面的作用有限。

英文摘要

Artificial Intelligence (AI) agents personalize their responses by tailoring explanations to users' backgrounds, interests, and prior interactions, referred to as contextualization. Personalization has been identified as a persuasive strategy in politics or in marketing. However, the persuasive effect of contextualization in everyday tasks, where users often lack prior knowledge, remains unclear. We conducted a $2\times2$ between-subjects experiment ($N = 380$) examining how contextualization, combined with conversational warmth, shapes reliance and persuasiveness of an AI assistant arguing against expert recommendations. Our findings reveal that contextualization reduces the persuasive power of AI, but its combination with warmth restores persuasiveness through a crossover interaction. Reliance on AI is present across conditions and is invariant to the conversational design. Trust strongly predicts both persuasion and reliance, yet neither contextualization nor warmth operates through trust. AI literacy decouples trust from behavior: more literate users report lower trust in the assistant, yet are more persuaded and more reliant on its advice. These results suggest that users are prone to deferring to AI agents over human expert judgment; however, interface-level conversational design choices have a limited role in shaping the behavior.

2605.31273 2026-06-01 cs.LG

Survival Reinforcement Learning: Toward Scalable Self-Supervised RL

生存强化学习:迈向可扩展的自监督强化学习

Franki Nguimatsia-Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier

AI总结 提出生存强化学习(SRL),一种基于在线分类的方法,通过最大化智能体在目标状态停留时间来解决对比强化学习中的均匀性-容忍性困境,在长时程运动任务上性能提升2至8倍。

详情
AI中文摘要

虽然自监督对比强化学习(CRL)展现了显著的深度扩展能力,成功使用了超过64层的网络,但由于对比损失固有的均匀性-容忍性困境,扩展的CRL在长时程目标条件规划中仍然存在困难。我们引入了生存强化学习(SRL),一种基于在线分类的替代方法,通过最大化智能体在目标状态的停留时间来扩展生存价值学习框架。SRL绕过了CRL的结构约束,并缓解了生存框架固有的“bang-bang”控制解,这种控制解在复杂动态系统中往往引发不良行为。在多种机器人基准测试中,扩展的SRL在操作任务上与最先进的CRL相当,并在稳定的长时程运动任务上性能提升2至8倍。我们的结果提供了强有力的额外证据,表明基于分类的方法可能成为扩展强化学习这一更广泛努力中的关键原语。

英文摘要

While self-supervised Contrastive Reinforcement Learning (CRL) has shown remarkable depth-scaling capabilities, successfully using networks over 64 layers, scaled CRL still struggles with long-horizon goal-conditioned planning due to the uniformity-tolerance dilemma inherent in contrastive losses. We introduce Survival Reinforcement Learning (SRL), an online classification-based alternative that extends the survival value learning framework by maximizing the agent's dwell time at target goals. SRL bypasses the structural constraints of CRL and mitigates the "bang-bang" control solutions inherent to survival frameworks, which often induce undesirable behavior in complex dynamical systems. Evaluated across diverse robotic benchmarks, scaled SRL matches state-of-the-art CRL on manipulation tasks and outperforms it by 2x to 8x on stable, long-horizon locomotion tasks. Our results provide strong additional evidence that classification-based methods may serve as a key primitive in the broader effort to scale reinforcement learning.

2605.31272 2026-06-01 cs.LG

Algorithmic Recourse of In-Context Learning for Tabular Data

表格数据的上下文学习算法补救

Wenshuo Dong, Jiaming Zhang, Shaopneg Fu, Hongbin Lin, Di Wang, Lijie Hu

AI总结 针对表格数据上下文学习中的黑箱模型,提出自适应子空间补救框架ASR-ICL,通过零阶优化高效生成可操作且稀疏的补救方案,理论证明补救有界且随上下文增大收敛至经典解。

详情
Comments
Accepted by ICML 2026
AI中文摘要

随着预测模型越来越多地部署在信用审批等高风险场景中,对受影响的个体提供补救的后验方法需求日益增长。许多此类模型处理表格数据,其中特征对应现实世界的属性。最近,上下文学习(ICL)使大型语言模型能够通过在推理时以标注示例为条件进行表格预测,而无需显式训练。然而,ICL下表格决策的算法补救仍基本未被探索。在这项工作中,我们首次研究了ICL下表格数据的算法补救。我们进行了理论分析,表明补救仍然定义良好且有界,并刻画了随着上下文增大,补救如何收敛到经典解。在实践中,我们提出了一种新颖的零阶补救框架——自适应子空间补救用于上下文学习(ASR-ICL),该框架高效地为黑箱ICL模型生成可操作且稀疏的补救。所提出的框架自然地扩展到多类表格任务。在多个真实世界数据集和模型上的实验表明,ASR-ICL以更少的查询实现了与现有方法相当的补救质量,并经验性地验证了预测的收敛行为,支持了我们的理论分析。

英文摘要

As predictive models are increasingly deployed in high-stakes settings such as credit approval, there is a growing need for post-hoc methods that provide recourse to affected individuals. Many such models operate on tabular data, where features correspond to real-world attributes. Recently, in-context learning (ICL) has enabled large language models to perform tabular prediction by conditioning on labeled examples at inference time, without explicit training. However, algorithmic recourse for tabular decision-making under ICL remains largely unexplored. In this work, we present the first study of algorithmic recourse for tabular data under ICL. We carry out a theoretical analysis, showing that recourse remains well-defined and bounded, and we characterize how recourse converges toward classical solutions as the context size increases. In practice, we propose a novel zeroth-order recourse framework, Adaptive Subspace Recourse for In-Context Learning (ASR-ICL), that efficiently generates actionable and sparse recourse for black-box ICL models. The proposed framework naturally extends to multi-class tabular tasks. Experiments across multiple real-world datasets and models demonstrate that ASR-ICL achieves recourse quality comparable to existing methods with fewer queries and empirically confirm the predicted convergence behavior, supporting our theoretical analysis.

2605.31271 2026-06-01 cs.CV

DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions

DriveMA:基于可验证元动作的驾驶视觉-语言-动作模型

Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang Zhao

AI总结 提出DriveMA框架,通过可验证元动作弥合语言与动作的差距,结合动作中心监督训练和强化学习实现端到端驾驶规划,在Waymo Open Dataset上取得最优性能。

详情
Comments
arXiv admin note: text overlap with arXiv:2605.21273
AI中文摘要

驾驶视觉-语言-动作模型(Driving VLAs)旨在利用语言改进端到端规划,但语言-动作差距限制了这一前景。我们提出DriveMA,一个基于可验证元动作的Driving VLA框架,该元动作将未来自我运动总结为紧凑的语言域意图,并可通过轨迹接地标注流水线从专家轨迹构建,以及通过基于规则的投影对生成轨迹进行验证。DriveMA利用这种可验证性,采用以动作为中心的监督训练和数据高效的回合级信用分配强化学习框架,通过密集奖励和精确信用分配明确地将高层决策与低层轨迹规划对齐。DriveMA在Waymo Open Dataset基于视觉的端到端驾驶上设立了新的最先进水平,2B模型获得8.060的评分者反馈分数,4B模型进一步提升至8.079;同时在NAVSIM上获得了具有竞争力的闭环规划性能。这些结果表明,即使是一个简单的元动作接口,在可验证并针对语言-动作对齐优化后,也能实现最先进的规划。代码、数据和模型将公开发布以促进未来研究。

英文摘要

Driving Vision-Language-Action Models (Driving VLAs) aim to use language to improve end-to-end planning, but the language-action gap limits this promise. We propose DriveMA, a Driving VLA framework built on verifiable meta-actions, which summarize future ego motion into compact language-domain intentions and can be constructed from expert trajectories with a trajectory-grounded annotation pipeline and can be verified against generated trajectories through rule-based projection. DriveMA exploits this verifiability with action-centric supervised training and a data-efficient turn-level credit assignment reinforcement learning framework, explicitly aligning high-level decisions with low-level trajectory planning through dense rewards and precise credit assignment. DriveMA sets a new state of the art on the Waymo Open Dataset Vision-based E2E Driving, achieving a Rater Feedback Score of 8.060 with a 2B model and further improving it to 8.079 with a 4B model; it also obtains competitive closed-loop planning performance on NAVSIM. These results show that even a simple meta-action interface can achieve state-of-the-art planning when made verifiable and optimized for language-action alignment. Code, data, and models will be released to facilitate future research.

2605.31268 2026-06-01 cs.CL

Mellum2 Technical Report

Mellum2 技术报告

Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok, Petr Borovlev, Kseniia Lysaniuk, Madeeswaran Kannan, Ivan Dolgov, Nikita Pavlichenko

AI总结 本文介绍 Mellum 2,一个12B参数(每token激活2.5B)的混合专家语言模型,专攻软件工程,通过架构创新和训练优化在代码生成、数学推理等基准上达到4B-14B开源模型的竞争力。

详情
AI中文摘要

我们提出 Mellum 2,一个开放权重的12B参数混合专家(MoE)语言模型,每token激活2.5B参数。Mellum 2是一个通用语言模型,专精于软件工程,涵盖代码生成与编辑、调试、多步推理、工具使用与函数调用、智能体编码以及对话式编程辅助,它是之前专注于补全的4B密集模型Mellum的继任者。架构基于混合专家(64个专家,8个激活),结合了分组查询注意力(4个KV头)、每四层中三层使用滑动窗口注意力,以及一个多token预测头,该头同时作为辅助预训练目标和内置的推测解码草稿模型;每个选择都通过消融实验验证,并以商品GPU上的推理效率作为设计约束。预训练涵盖约10.6万亿token,通过三个阶段的学习课程,从多样化的网络数据逐步转向精选的代码和数学内容,使用Muon优化器在FP8混合精度下进行优化,并采用预热-保持-衰减调度(线性衰减至零)。预训练基础模型通过层选择性YaRN扩展到128K上下文窗口,然后分两个阶段进行后训练(监督微调后接RLVR),产生两个发布变体:直接回答的Instruct模型和在最终答案前输出显式推理轨迹的Thinking模型。在代码生成、数学与推理、工具使用、知识和安全基准上,Mellum 2与4B-14B范围内的开放权重基线模型竞争,同时每token计算量相当于2.5B密集模型。我们在Apache 2.0许可下发布基础、指令和思考检查点,以及关于架构决策、数据管道和训练配方的本报告。

英文摘要

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

2605.31266 2026-06-01 cs.CV cs.AI cs.LG

Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation

超越少数:用于少样本非典型布局到图像生成的解耦语义与基元

Nan Bao, Yifan Zhao, Wenzhuang Wang, Jia Li

AI总结 针对少样本非典型布局到图像生成中表示碎片化问题,提出通过语义锚定和基元注入解耦语义与视觉细节,实现鲁棒少样本适应。

详情
Comments
Accepted to ICML 2026; code available at https://github.com/iCVTEAM/DSP
AI中文摘要

布局到图像(L2I)任务通过对象类别和空间布局实现对图像生成的细粒度控制。然而,现有的L2I方法在少样本非典型设置下会产生碎片化和扭曲的生成结果。我们将这种失败称为表示碎片化,源于将语义身份与视觉细节纠缠在一起的粒度不匹配。为了解决这个问题,我们提出了一种表示驱动的框架,将语义与基元解耦,以实现鲁棒的少样本适应。具体来说,语义锚定将类别语义聚合到锚点中以实现稳定的身份,而基元注入则建模可重新组合的基元以实现鲁棒的局部细节建模。概念引导进一步通过显著性感知目标调节优化,以保持前景语义一致性。大量实验表明,在5样本设置下,我们的方法在视觉保真度和跨不同非典型领域的对齐方面,均优于最先进的L2I方法。源代码公开于 https://github.com/iCVTEAM/DSP。

英文摘要

The layout-to-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few-shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation-driven framework that disentangles semantics from primitives for robust few-shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency-aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5-shot regime over state-of-the-art L2I methods in both visual fidelity and alignment across diverse atypical domains. The source code is publicly available at https://github.com/iCVTEAM/DSP.

2605.31264 2026-06-01 cs.AI cs.CL cs.LG

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

COLLEAGUE.SKILL: 通过专家知识蒸馏实现自动化AI技能生成

Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao, Xia Hu

AI总结 提出一个从异构痕迹到可检查、可修正、可代理使用的技能包的自动化蒸馏系统,用于生成基于人的AI技能。

详情
Comments
12 pages, 4 figures
AI中文摘要

LLM代理不仅被期望完成孤立的任务,还要承载人类专业知识、判断和互动风格的有限表示。构建这种基于人的代理仍然困难,因为与人或角色相关的可操作知识通常嵌入在异构痕迹中,而不是写成清晰的指令。现有的记忆和角色系统捕捉了这些证据的片段,而技能框架提供了可移植的打包格式;然而,没有端到端的工作流将这些痕迹蒸馏成可检查、可修正和代理可用的技能。我们提出了一个自动化的痕迹到技能蒸馏系统,通过专家知识蒸馏生成基于人的AI技能。给定目标人物或角色的材料,COLLEAGUE.SKILL 生成一个版本化的技能包,包含两个协调的轨道:一个能力轨道,用于实践、心理模型和决策启发式;一个边界行为轨道,用于沟通风格、互动规则和修正历史。该包可以被检查、调用、通过自然语言反馈更新、回滚、跨代理主机安装,并可选择性地为受控分发做准备。我们描述了开源系统中实现的人工制品契约、生成工作流、修正生命周期、部署表面和领域预设。在撰写本文时,公共仓库拥有约18.5k个GitHub星标;画廊列出了来自165位贡献者的215个技能,以及跨列出的技能卡累计超过10万个星标。该系统说明了基于人的技能如何表示为可移植、可修正的包,而不是不透明的提示或隐藏的记忆。

英文摘要

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.

2605.31261 2026-06-01 cs.LG cs.AI stat.ML

Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

为什么线性循环记忆在部分可观测强化学习中有效

Yike Zhao, Onno Eberhard, Malek Khammassi, Ali H. Sayed, Michael Muehlebach

AI总结 本文通过构造两种线性滤波器,从理论上证明了线性循环神经网络在部分可观测强化学习中作为记忆单元的有效性,并扩展到动作控制的隐马尔可夫模型。

详情
AI中文摘要

线性循环神经网络家族在部分可观测强化学习中作为循环记忆单元表现出色。我们通过构造并研究两种线性滤波器为其经验有效性提供了理论依据:(i) 第一种在确定性转移矩阵下精确重现隐马尔可夫模型(HMM)中信念向量的预softmax logits,从而作为最优策略学习的充分统计量;(ii) 第二种在近似确定性转移矩阵下实现状态解码误差趋近于零,从而将状态模糊性降至接近零。结果扩展到动作控制的HMM,其中相应的线性滤波器变为随时间变化且依赖于动作的动态。我们通过数值实验说明了主要结果,并进一步展示了所构造的线性滤波器在小型强化学习游戏中作为强特征提取器的能力。

英文摘要

The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing state-decoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.

2605.31259 2026-06-01 cs.LG

Lightweight CNN-Based Anomaly Detection for High Voltage Converter Modulators in the Spallation Neutron Source

基于轻量级CNN的散裂中子源高压转换器调制器异常检测

Alberto D. Cencillo, Leonardo Concepción, Julián Luengo, Isaac Triguero

AI总结 针对高压转换器调制器多通道信号异常检测,通过改变时间滤波与跨通道混合的顺序并引入自适应通道重加权,在公开数据集上达到AUC-PR 0.816和AUC-ROC 0.934,超越现有方法。

详情
Comments
21 pages, 8 figures
AI中文摘要

高功率脉冲转换器的非计划停机是大型加速器设施停机的主要原因。在散裂中子源(SNS)中,高压转换器调制器(HVCM)始终是丢失束流时间的第二大贡献者。每个HVCM脉冲通过跨电流、电压和磁通量的传感器通道记录,这些通道的相互交互编码了系统的运行状态。故障前兆在这些通道中并非均匀显现:根据故障类型,它们可能改变单个信号的时间结构,改变通道间的统计依赖性,或两者兼有。现有的深度学习方法通常使用标准卷积流水线处理多通道信号,该流水线从第一层开始就纠缠时间和跨通道操作,使得模型没有明确的机制来表示通道独立性或结构化的通道间交互。我们假设架构归纳偏差,特别是时间滤波和跨通道混合的顺序,在这类数据的检测性能中起着核心作用。为了验证这一点,我们改变了这两个操作的顺序,并检查每个脉冲的自适应通道重加权是否进一步提高灵敏度。在涵盖所有四个SNS子系统(RFQ、DTL、CCL、SCL)的公开HVCM数据集上评估,我们最好的变体实现了池化AUC-PR为0.816和AUC-ROC为0.934,在大多数子系统和六个故障家族中的五个上优于现有技术。消融实验识别出三个主导输入通道,并将每个故障家族的性能与前兆表现为单个通道的幅度偏移还是需要联合通道表示才能显现的更细微模式联系起来。

英文摘要

Unscheduled trips of high-power pulsed converters are a leading source of downtime at large accelerator facilities. At the Spallation Neutron Source (SNS), the High Voltage Converter Modulators (HVCMs) are consistently the second-largest contributor to lost beam time. Each HVCM pulse is recorded across sensor channels spanning currents, voltages, and magnetic fluxes, whose mutual interactions encode the operating state of the system. Fault precursors do not manifest uniformly across these channels: depending on fault type, they may alter the temporal structure of individual signals, change the statistical dependencies among channels, or both. Existing deep-learning approaches typically process multi-channel signals with standard convolutional pipelines that entangle temporal and cross-channel operations from the first layer, giving the model no explicit mechanism to represent channel independence or structured inter-channel interaction. We hypothesise that architectural inductive bias, specifically the ordering of temporal filtering and cross-channel mixing, plays a central role in detection performance on this class of data. To test this, we vary the order in which these two operations are applied, and examine whether per-pulse adaptive channel reweighting further improves sensitivity. Evaluated on the public HVCM dataset across all four SNS subsystems (RFQ, DTL, CCL, SCL), our best variant achieves a pooled AUC-PR of 0.816 and AUC-ROC of 0.934, outperforming the state of the art on most subsystems and five of the six fault families. Ablations identify three dominant input channels and link per-fault-family performance to whether precursors manifest as amplitude shifts in individual channels or as subtler patterns requiring joint channel representations to surface.

2605.31257 2026-06-01 cs.LG stat.ML

Fraud Type Decomposition and the Observation-Mechanism Taxonomy:Class-Specific Detection Limits in Payment Networks

欺诈类型分解与观测机制分类:支付网络中的类别特定检测极限

Gaurav Dhama

AI总结 本文通过引入观测机制分类将欺诈分为五类,证明按类别分别估计欺诈率并聚合优于整体估计,并推导了每类检测的理论约束。

详情
Comments
59 pages
AI中文摘要

支付网络中的欺诈检测依赖于通过异质且不完美的观测过程生成的标签,但现有方法将欺诈视为同质二元变量。我们证明这一假设在结构上不正确,并导致可证明的低效。我们引入一个观测机制分类,将欺诈分为五类,每类由不同的审查和标记流程定义。我们证明按类别分别估计欺诈率并聚合严格优于整体估计,效率差距由异质观测率导致的Jensen惩罚刻画。对于每类,我们推导了检测的绑定理论约束,包括内生标签腐败、结构不可观测性和特征非信息性。这些结果确立了欺诈检测本质上是一组不同的估计问题,每个问题由其自身的观测结构和检测极限支配。

英文摘要

Fraud detection in payment networks relies on labels generated through heterogeneous and imperfect observation processes, yet existing approaches treat fraud as a homogeneous binary variable. We show that this assumption is structurally incorrect and leads to provable inefficiency. We introduce an observation-mechanism taxonomy that partitions fraud into five classes, each defined by a distinct censorship and labeling pipeline. We prove that estimating fraud rates separately by class and aggregating strictly dominates pooled estimation, with the efficiency gap characterized as a Jensen penalty arising from heterogeneous observation rates. For each class, we derive the binding theoretical constraint on detection, including endogenous label corruption, structural non-observability, and feature non-informativeness. These results establish that fraud detection is fundamentally a collection of distinct estimation problems, each governed by its own observation structure and detection limit.

2605.31256 2026-06-01 cs.RO

Before Parc Fermé: RL-Time Pruning for Efficient Embodied LLMs in Autonomous Driving

在封闭停车场之前:面向自动驾驶高效具身大语言模型的强化学习时间剪枝

Luca Benfenati, Ali Azimi, Matteo Risso, Fabio Carapellese, Daniele Jahier Pagliari, Alessio Burrello

AI总结 提出一种在强化学习过程中进行剪枝的策略BPF,通过任务特定监督和闭环反馈压缩具身大语言模型控制器,在自动驾驶控制管道中实现了更好的性能-内存-吞吐量权衡。

详情
AI中文摘要

具身大语言模型越来越多地被用作机器人控制管道中的推理模块,以改善人机交互,但其内存和生成延迟使得实时部署变得困难。剪枝可以降低这些成本,但对于经历多个预训练和后训练阶段的控制器,关键问题不仅在于剪枝多少,还在于何时进行剪枝。在这项工作中,我们提出了Before Parc Fermé(BPF),一种在强化学习期间执行的剪枝策略,它在具身大语言模型控制器仍在针对闭环行为进行优化时对其进行压缩。这使得剪枝决策能够考虑塑造最终控制器的任务特定监督和闭环反馈。我们提出了两种变体:BPF-RL,它在强化学习期间通过按预定义训练间隔移除部分模型来执行迭代剪枝;以及BPF-SFT/RL,它首先在SFT期间移除部分模型结构,然后在强化学习期间使用与BPF-RL相同的迭代策略进一步压缩,直到达到目标剪枝比率。我们在基于LLM的自动驾驶控制管道RobotxR1上,使用已建立的LLM剪枝框架(LLM-Pruner)评估BPF,并将其与训练后剪枝、带有强化学习恢复的训练后剪枝、SFT阶段剪枝以及来自同一系列的小型密集模型进行比较。我们的结果表明,在所考虑的剪枝策略中,BPF提供了最佳的任务性能与内存和吞吐量之间的权衡。在压缩较大的RobotxR1模型时,BPF-SFT/RL实现了比直接选择同一系列中较小密集模型更好的尺寸-端到端性能权衡,以每损失一个百分点的控制适应性所移除的参数数量衡量,提升幅度为1.69倍。在目标机器人平台上搭载的Jetson AGX Orin上,紧凑模型将解码吞吐量提高了高达27%。

英文摘要

Embodied Large Language Models (LLMs) are increasingly used as reasoning modules in robotic control pipelines to improve human-robot interaction, but their memory and generation latency make real-time deployment difficult. Pruning can reduce these costs, but for controllers that undergo multiple pre- and post-training phases, the crucial question is not only how much to prune, but when pruning should occur. In this work, we propose Before Parc Fermé (BPF), a pruning strategy performed during RL that compresses embodied LLM controllers while they are still being optimized for closed-loop behavior. This allows pruning decisions to account for the task-specific supervision and closed-loop feedback that shape the final controller. We propose two variants: BPF-RL, which performs iterative pruning during RL by removing part of the model at predefined training intervals, and BPF-SFT/RL, which first prunes part of the model structure during SFT and then further compresses it during RL using the same iterative strategy as BPF-RL until the target pruning ratio is reached. We evaluate BPF on RobotxR1, an LLM-based autonomous-driving control pipeline, using an established LLM pruning framework (LLM-Pruner), and compare it against post-training pruning, post-training pruning with RL recovery, SFT-stage pruning, and smaller dense models from the same family. Our results show that BPF provides the best task-performance vs. memory and throughput trade-off among the considered pruning strategies. When compressing the larger RobotxR1 models, BPF-SFT/RL achieves a $1.69\times$ better size-end-to-end performance trade-off than directly selecting a smaller dense model from the same family, measured as removed parameters per lost percentage point of control adaptability. On the Jetson AGX Orin mounted on the target robotic platform, the compact models improve decode throughput by up to $27\%$.

2605.31254 2026-06-01 cs.AI

Formalizing and falsifying causal pathways of rare events

罕见事件因果路径的形式化与证伪

Anahita Haghighat, Dominik Janzing

AI总结 本文在结构方程模型中罕见事件根因分析的形式化基础上,提出因果路径的形式定义并讨论其可检验含义,引入罕见事件因果路径的抽象以桥接简单因果解释与详细因果建模。

详情
Comments
accepted for ICML 2026
AI中文摘要

基于最近在结构方程模型中对罕见事件(“异常值”)根因分析的形式化,我们提出了因果路径的形式定义并讨论了其可检验含义。我们识别了这些含义仅依赖于由罕见事件路径定义的因果抽象而非底层系统完整因果图的条件。据此,我们引入了一种因果结构到罕见事件路径的抽象,该抽象桥接了简单的口头因果解释与详细的因果建模。

英文摘要

Building on recent formalizations of root cause analysis for rare events (``outliers'') in structural equation models, we propose a formal definition of a causal pathway and discuss its testable implications. We identify conditions under which these implications depend only on a causal abstraction defined by the pathway of rare events, rather than on the full causal graph of the underlying system. Accordingly, we introduce an abstraction of causal structure to pathways of rare events that bridges simple verbal causal explanations and detailed causal modeling.

2605.31251 2026-06-01 cs.CV cs.AI

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

ERGeoBench:多模态大语言模型中具身推理与地理定位的综合基准

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo

AI总结 提出ERGeoBench基准,通过单视图、全景视图和具身视图三种渐进设置评估多模态大语言模型在视觉驱动的具身地理定位中的能力,发现当前模型在高层次地理语义推理上表现良好,但在细粒度感知、度量定位和视图间空间一致性上仍有不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)作为具身代理展现出强大潜力,然而由于缺乏细粒度评估,具身地理定位仍未被充分探索。我们引入ERGeoBench,一个用于视觉驱动的具身地理定位的诊断基准。ERGeoBench在三种渐进设置下评估模型——单视图、全景视图和具身视图——其中代理可以通过偏航、俯仰和缩放的顺序变化主动获取观察。该基准包含2,207个全球分布的街景全景图,并衡量四种互补能力:基础感知、空间意识、常识推理和地理定位推理。对领先的专有和开源MLLMs的评估表明,当前模型能够推断高层次的地理语义,但在细粒度感知操作、度量定位和跨视图空间一致性方面仍然困难。我们进一步观察到,地理定位与其他能力维度强相关,表明准确定位依赖于集成的感知、空间推理和常识推理,而非孤立的视觉识别。总体而言,ERGeoBench为诊断和推进类人具身地理定位提供了一个统一框架。项目页面:https://kaixuewen.github.io/ERGeoBench/

英文摘要

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/