arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10772 2026-05-12 cs.CV cs.AI eess.IV

Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

David F. Ramirez, Tim L. Overman, Kristen Jaskie, Marv Kleine, Andreas Spanias

AI总结 本文研究了将大语言-视觉模型(LLVM)应用于合成孔径雷达(SAR)图像的目标识别任务,特别是在军事车辆自动目标识别(ATR)中的应用。通过构建基于MSTAR公开数据集的训练与评估基准,并引入描述性文本和问答对,作者探索了LLVM在遥感图像描述和视觉问答(VQA)中的性能。实验表明,使用参数高效的微调方法,模型在识别细粒度目标特征方面达到了98%的准确率,为机器辅助的军事和情报遥感目标识别提供了新的技术路径。

Comments Accepted to SPIE Defense + Commercial Sensing, Automatic Target Recognition XXXV

详情
Journal ref
Proc. SPIE 13463, Automatic Target Recognition XXXV, 134630D (29 May 2025);
英文摘要

Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.

2605.10770 2026-05-12 cs.LG

DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

Eleonora Gualdoni, Sonia Laguna, Louis Bethune, Joao Monteiro, Pierre Ablin, Marco Cuturi

AI总结 本文提出了一种名为DynaMiCS的动态混合优化方法,用于在微调大语言模型时同时提升目标领域的性能并保持约束领域的性能。该方法通过在每次更新时进行短期的领域特定训练,估计各领域间的交叉影响,并据此动态计算混合权重,从而在优化目标领域表现的同时确保约束领域的损失不超过参考阈值。实验表明,DynaMiCS在多种多领域微调场景中相比固定混合方法取得了更优的性能提升和更高的约束满足度,且计算成本更低,无需参考模型或手动调参。

详情
英文摘要

Multi-domain fine-tuning of large language models requires improving performance on target domains while preserving performance on constrained domains, such as general knowledge, instruction following, or safety evaluations. Existing data mixing strategies rely on fixed heuristics or adaptive rules that cannot explicitly enforce preservation of such capabilities. We propose DynaMiCS, a dynamic mixture optimizer that casts multi-domain fine-tuning as a constrained optimization problem. At each update, DynaMiCS performs short domain-specific probing runs to estimate a slope matrix of local cross-domain effects, capturing how training on each fine-tuning dataset affects each evaluation domain. These estimates are then used to compute mixture weights through optimization over the probability simplex, with the objective of improving target-domain performance while keeping constrained-domain losses below reference levels. Across multi-domain fine-tuning scenarios with varying numbers of target and constrained domains, DynaMiCS achieves stronger target-domain improvements and higher constraint satisfaction than fixed-mixture baselines, at lower computational cost and without reference models, per-example scoring, or manually tuned mixture weights.

2605.10769 2026-05-12 cs.CV cs.AI

MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

Ziyi Wang, Xianping Ma, Ziyao Wang, Hongyang Zhang, Man On Pun

AI总结 本文提出了一种名为MPerS的动态多模态大语言模型混合专家感知引导的遥感场景分割方法,旨在提升遥感图像语义分割的效果。该方法通过设计多种提示词引导大语言模型生成高质量的遥感场景描述,并结合DINOv3提取土地覆盖的密集视觉特征,利用动态混合专家模块自适应融合最有效的文本语义信息,最终实现更精确的遥感场景分割。实验表明,该方法在三个公开的遥感语义分割数据集上取得了优越的性能。

Comments Accepted to CVPR 2026 Findings. 11 pages, 6 figures

详情
英文摘要

The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.

2605.10765 2026-05-12 cs.CV cs.AI cs.LG

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

Tao Hu, Da-Wei Zhou

AI总结 多模态大语言模型(MLLMs)通过指令微调取得了优异性能,但在实际应用中往往需要在连续任务中逐步扩展能力,同时避免灾难性遗忘。现有方法主要依赖模块组合范式,但难以应对同一任务内图像场景、问题意图和推理需求的差异。为此,本文提出DRAPE,一种动态跨模态提示生成框架,通过从文本指令中生成提示查询并结合视觉特征进行交叉注意力,为每个查询-图像对生成个性化的软提示,从而实现更细粒度的实例级适应。实验表明,DRAPE在多模态持续指令微调基准上取得了最先进的性能。

详情
英文摘要

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.

2605.10763 2026-05-12 cs.AI cs.CR

MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study

Tim Van hamme, Thomas Vissers, Javier Carnerero-Cano, Mario Fritz, Emil C. Lupu, Lieven Desmet, Dinil Mon Divakaran

AI总结 随着大型语言模型越来越多地作为具备工具、数据库和外部服务访问能力的自主代理系统部署,实践中缺乏系统方法来评估已知威胁类别在具体部署中的具体风险。本文提出MATRA,一个面向自主AI系统的实用威胁建模框架,通过资产影响评估和攻击树分析,系统地将已知的LLM威胁转化为特定部署的风险。研究以OpenClaw个人AI代理系统为例,展示了MATRA如何量化网络沙箱和最小权限访问等架构控制措施在降低风险中的作用。

Comments Accepted for presentation at the 5th International Workshop on Designing and Measuring Security in Systems with AI (DeMeSSAI 2026), co-located with the 11th IEEE European Symposium on Security and Privacy (EuroS&P 2026), Lisbon, Portugal, July 10, 2026

详情
英文摘要

LLMs are increasingly deployed as autonomous agents with access to tools, databases, and external services, yet practitioners (across different sectors) lack systematic methods to assess how known threat classes translate into concrete risks within a specific agentic deployment. We present MATRA, a pragmatic threat modeling framework for agentic AI systems that adapts established risk assessment methodology to systematically assess how known LLM threats translate into deployment-specific risks. MATRA begins with an asset-based impact assessment and utilizes attack trees to determine the likelihood of these impacts occurring within the system architecture. We demonstrate MATRA on a personal AI agent deployment using OpenClaw, quantifying how architectural controls such as network sandboxing and least-privilege access reduce risk by limiting the blast radius of successful injections.

2605.10762 2026-05-12 cs.CV cs.AI

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Mohamed Eltahir, Lama Ayash, Ali Habibullah, Tanveer Hussain, Naeemullah Khan

AI总结 在长视频理解任务中,视觉-语言模型(VLM)因需处理数千帧视频而面临二次注意力计算成本的瓶颈。为解决这一问题,本文提出GridProbe,一种高效的训练-free 后验探测推理框架,通过冻结VLM自身的推理能力,在答案空间中对证据进行评分,并自适应选择与问题相关的帧,从而显著降低计算成本而几乎不损失精度。GridProbe通过在K×K网格上布置帧,并运行轻量级的行和列探测器,生成可解释的重要性图,进而实现形状自适应的帧选择,有效提升了长视频理解的效率与性能。

详情
英文摘要

Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.

2605.10761 2026-05-12 cs.CV

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal, Alan L. Yuille, Zongwei Zhou

AI总结 RadThinking 是一个用于放射学纵向临床推理的视觉问答(VQA)数据集,旨在使癌症筛查中的诊断推理过程显式化并可训练。该数据集包含不同难度级别的问答对,从基础感知问题到需要多步骤推理的复合型问题,并提供了每道复合问题对应的推理链条,符合临床报告标准。RadThinking 覆盖了大量患者的CT扫描数据,为AI系统进行系统性的推理训练与评估提供了重要资源。

详情
英文摘要

Cancer screening is a reasoning task. A radiologist observes findings, compares them to prior scans, integrates clinical context, and reaches a diagnostic conclusion confirmed by pathology. We present RadThinking, a Visual Question Answering (VQA) dataset that makes this reasoning explicit and trainable. RadThinking releases VQA pairs at three difficulty tiers. Foundation VQAs are atomic perception questions. Single-step reasoning VQAs apply one clinical rule. Compositional VQAs require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5. For every compositional VQA, we release the chain of foundation VQAs that solves it. The chain follows the rules of the governing clinical reporting standard. The dataset spans 20,362 CT scans from 9,131 patients across 43 cancer groups, plus 2,077 verified healthy controls with >1-year follow-up. To our knowledge, RadThinking is the first cancer-screening VQA corpus that stratifies questions by reasoning depth and grounds compositions in clinical reporting standards. The foundation tier supplies atomic perception supervision. The compositional tier supplies chain-of-thought data and verifiable rewards for reinforcement-learning recipes such as DeepSeek-R1 and OpenAI o1. RadThinking enables systematic training and evaluation of whether AI systems can reason about cancer, not merely detect it.

2605.10760 2026-05-12 cs.RO

MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction

Zhihao Cao, Qi Shao, Shuhao Zhai, Jing Zhang, Anh Nguyen, Baoru Huang

AI总结 MAGS-SLAM 是一种基于单目视觉的多智能体高斯泼溅(3DGS)SLAM 框架,旨在实现几何与光度一致的协同场景重建。该方法通过各智能体独立构建局部单目高斯子地图,并传输紧凑的子地图摘要,避免了对深度传感器的依赖,从而适用于轻量、低成本或功耗受限的平台。研究引入了紧凑子地图通信、几何与外观感知的回环验证以及占用感知的高斯融合机制,实现了无需主动深度传感器的全局一致重建,并在合成与真实数据集上验证了其优越的跟踪精度与渲染质量。

详情
英文摘要

Collaborative photorealistic 3D reconstruction from multiple agents enables rapid large-scale scene capture for virtual production and cooperative multi-robot exploration. While recent 3D Gaussian Splatting (3DGS) SLAM algorithms can generate high-fidelity real-time mapping, most of the existing multi-agent Gaussian SLAM methods still rely on RGB-D sensors to obtain metric depth and simplify cross-agent alignment, which limits the deployment on lightweight, low-cost, or power-constrained robotic platforms. To address this challenge, we propose MAGS-SLAM, the first RGB-only multi-agent 3DGS SLAM framework for collaborative scene reconstruction. Each agent independently builds local monocular Gaussian submaps and transmits compact submap summaries rather than raw observations or dense maps. To facilitate robust collaboration in the presence of monocular scale ambiguity, our framework integrates compact submap communication, geometry- and appearance-aware loop verification, and occupancy-aware Gaussian fusion, enabling coherent global reconstruction without active depth sensors. We further introduce ReplicaMultiagent Plus benchmark for evaluating collaborative Gaussian SLAM. Intensive experiments on synthetic and real-world datasets show that MAGS-SLAM achieves competitive tracking accuracy and comparable or superior rendering quality to state-of-the-art RGB-D collaborative Gaussian SLAM methods while relying only RGB images.

2605.10756 2026-05-12 cs.CV

TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection

Yifeng Yang, Jubo Feng, Jing Xu, Xinbing Wang, Qinying Gu, Nanyang Ye

AI总结 该研究提出了一种名为TINS的测试时ID-原型分离负语义学习方法,用于提升视觉-语言模型在开放域检测(OOD Detection)中的性能。为了解决现有方法依赖静态负标签、难以适应多样化和动态变化的OOD概念的问题,TINS通过图像到文本的模态反转学习样本特定的负语义嵌入,并引入ID-原型分离正则化以避免与ID语义混淆。实验表明,TINS在多个基准数据集上均优于现有方法,尤其在Four-OOD基准中将平均FPR95从14.04%降低至6.72%。

详情
英文摘要

Vision-language models enable OOD detection by comparing image alignment with ID labels and negative semantics. Existing negative-label-based methods mainly rely on static negative labels constructed before inference, limiting their ability to cover diverse and evolving OOD concepts. Although test-time expansion provides a natural solution, naively learning negative semantics from potential OOD samples may introduce hard ID contamination. To address this issue, we propose a \textbf{T}est-time \textbf{I}D-prototype-separated \textbf{N}egative \textbf{S}emantics learning method, termed \textbf{TINS}. TINS learns sample-specific negative text embeddings via image-to-text modality inversion and introduces ID-prototype-separated regularization to keep them separated from ID semantics. To further stabilize negative semantics expansion, TINS employs group-wise aggregation scoring and a buffer update strategy. Extensive experiments across Four-OOD, OpenOOD, Temporal-shift, and Various ID settings show consistent improvements over strong baselines. Notably, on the Four-OOD benchmark with ImageNet-1K as ID, TINS reduces the average FPR95 from 14.04\% to 6.72\%. Our code is available at https://github.com/zxk1212/tins.

2605.10754 2026-05-12 cs.AI

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

Xinrun Wang, Chang Yang, He Zhao, Zhuoyi Lin, Shuyue Hu

AI总结 本文探讨了基于大语言模型的智能体(foundation agents)在复杂任务中长期运行所面临的核心科学问题,指出当前研究主要依赖经验试错,缺乏理论指导。作者提出“智能体控制论”(Agent Cybernetics),将经典控制论的六条定律映射为智能体设计的六项原则,并提炼出可靠性、长期运行和自我改进三个工程目标,为智能体的构建提供了理论框架。通过代码生成、计算机使用和自动化研究三个应用领域的案例,验证了该框架的有效性,为智能体的科学化发展奠定了基础。

Comments Preliminary Work

详情
英文摘要

LLM-based foundation agents that perceive, reason, and act across thousands of reasoning steps are rapidly becoming the dominant paradigm for deploying artificial intelligence in open-ended, long-horizon complex tasks. Despite this significance, the field remains overwhelmingly engineering-driven. Engineering practice has converged on useful primitives (tool loops, memory banks, harnesses, reflection steps), yet these are assembled by empirical trial and error rather than from first principles. Fundamental questions remain open: under what conditions does a long-running agent remain on-task? How should an agent respond when its environment exceeds its representational capacity? What architectural properties are necessary for safe self-improvement? We argue that cybernetics, the mid-twentieth-century science of control and communication in complex systems, provides the missing theoretical scaffold for foundation agents. By mapping six canonical laws of classical cybernetics onto six agent design principles, and synthesizing those principles into three engineering desiderata (reliability, lifelong running, and self-Improvement), we arrive at a framework termed Agent Cybernetics. Three application domains, code generation, computer use and automated research, exemplify the analytical framework of agent cybernetics by identifying failure modes and concrete engineering recommendations. We hope that agent cybernetics opens a new research venue and establishes the scientific foundation that foundation agents need for principled, reliable real-world deployment.

2605.10748 2026-05-12 cs.LG cs.AI

Provable Sparse Inversion and Token Relabel Enhanced One-shot Federated Learning with ViTs

Li Shen, Xiaolei Hao, Qinglun Li, Xiaochun Cao, Zhifeng Hao, Xun Yang

AI总结 本文研究了在极端非独立同分布(non-IID)环境下,如何提升单轮联邦学习中全局模型的性能。提出了一种名为FedMITR的框架,通过稀疏模型逆向和令牌重标签方法,有效生成高质量的合成数据并优化视觉Transformer(ViT)的预测能力。该方法在生成数据时仅逆向语义前景,忽略无信息背景,并结合伪标签与集成模型对不同信息密度的图像块进行差异化重标签,从而在理论上保证了模型稳定性与泛化能力,实验表明其在多种设置下均优于现有方法。

Comments 18 Pages

详情
英文摘要

One-Shot Federated Learning, where a central server learns a global model in a single communication round, has emerged as a promising paradigm. However, under extremely non-IID settings, existing data-free methods often generate low-quality data that suffers from severe semantic misalignment with ground-truth labels. To overcome these issues, we propose a novel Federated Model Inversion and Token Relabel (FedMITR) framework, which trains the global model by fully exploiting all patches of synthetic images. Specifically, FedMITR employs sparse model inversion during data generation, selectively inverting semantic foregrounds while halting the inversion of uninformative backgrounds. To address semantically meaningless tokens that hinder ViT predictions, we implement a differentiated strategy: patches with high information density utilize generated pseudo-labels, while patches with low information density are relabeled via ensemble models for robust distillation. Theoretically, our analysis based on algorithmic stability reveals that Sparse Model Inversion eliminates gradient instability arising from background noise, while Token Relabel effectively reduces gradient variance, collectively guaranteeing a tighter generalization bound. Empirically, extensive experimental results demonstrate that FedMITR substantially outperforms existing baselines under various settings.

2605.10744 2026-05-12 cs.CV cs.RO

C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

AI总结 本文提出了一种基于视觉语言模型的反事实推理框架C-CoT,用于提升自动驾驶在复杂城市交叉路口等安全关键场景中的决策能力。该方法将驾驶决策分解为五个阶段,通过引入结构化的元动作评估树,在反事实推理阶段显式评估不同行动组合的潜在后果,从而建立行动与安全结果之间的因果联系,增强模型在罕见和分布外场景中的鲁棒性。实验表明,该方法在风险预测和碰撞率等指标上均优于现有方法,显著提升了自动驾驶系统的安全性和可解释性。

详情
英文摘要

Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.

2605.10741 2026-05-12 cs.LG

AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery

Barbara Su, Fangshuo Liao, Anastasios Kyrillidis

AI总结 本文提出了一种名为AdaPaD的自适应并行消去方法,用于参数高效微调(PEFT),能够在训练过程中自动发现适配器的秩分布。该方法通过同时训练所有秩-1组件,并利用前序估计的改进不断优化消去目标,实现了误差的自纠正特性。此外,AdaPaD还引入了预训练学习和模块动态秩发现机制,使秩分配成为模型输出而非输入。实验表明,AdaPaD在多个基准任务上表现优异,且在参数预算相同的情况下,其适配器规模平均减少了30.7%。

详情
英文摘要

Fine-tuning large language models with LoRA requires choosing a rank r before training starts. Existing approaches either extract rank-1 components sequentially, freezing each component's error permanently into every subsequent residual, or optimize the full low-rank factorization jointly with guarantees that describe only the joint update, not individual rank-1 directions. We present AdaPaD (Adaptive Parallel Deflation), which trains all rank-1 components simultaneously: each worker refines its component against a deflation target built from the latest estimates of all predecessors, and as those estimates improve, the targets improve too. We call this property self-correction: deflation errors converge to zero over rounds rather than persisting as fixed residuals. On top of this backbone, AdaPaD adds advance learning (private pre-training before activation) and per-module dynamic rank discovery (importance-based growth until a shared budget is exhausted), making the rank distribution an output rather than an input. We prove that every component's error decays exponentially after a warm-up period, with a generalization bound that splits into a vanishing algorithmic term and an irreducible statistical floor. Empirically, AdaPaD is competitive with adaptive-rank LoRA baselines on GLUE with DeBERTaV3-base at matched parameter budgets, and competitive with fixed-rank LoRA on Qwen3-0.6B SQuAD/SQuAD v2 while deploying an adapter that is on average 30.7% smaller.

2605.10734 2026-05-12 cs.LG

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, Danica Kragic, Jan Peters

AI总结 本文提出了一种名为XQCfD的强化学习算法,旨在通过利用先验数据和先验策略提升快速策略梯度算法的样本效率。该方法通过增强的经验回放缓冲、预训练策略以及设计用于保持初始策略稳定性的静态策略架构,有效避免了传统算法在学习过程中对初始策略的快速遗忘问题。实验表明,XQCfD在多个具有稀疏奖励的复杂操作任务中表现出色,且在数据使用效率方面优于现有方法。

Comments 22 pages, 10 figures, 2 tables

详情
英文摘要

For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks -- notably with a low update-to-data ratio and no ensemble networks

2605.10732 2026-05-12 cs.CV cs.AI

iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning

Kaicong Huang, Weiheng Oh, Thomas Guggisberg, Ruimin Ke

AI总结 本文提出了一种名为iPay的集成支付动作识别框架,用于车载公共交通监控系统。该方法结合RGB图像和骨架数据,通过多模态混合专家架构,分别捕捉局部细节和整体运动特征,并引入双注意力融合机制和空间差异判别器,以提升模型对支付动作的识别能力。实验表明,iPay在真实监控数据上取得了83.45%的识别准确率,具有较高的计算效率,适用于边缘部署。

详情
英文摘要

Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45\% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.

2605.10730 2026-05-12 cs.CV

Qwen-Image-2.0 Technical Report

Bing Zhao, Chenfei Wu, Deqing Li, Hao Meng, Jiahao Li, Jie Zhang, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kuan Cao, Kun Yan, Liang Peng, Lihan Jiang, Niantong Li, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Xihua Wang, Yan Shu, Yanran Zhang, Yi Wang, Yilei Chen, Ying Ba, Yixian Xu, Yujia Wu, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, An Yang, Chen Cheng, Chenxu Lv, Dayiheng Liu, Fan Zhou, Hantian Xiong, Hongzhu Shi, Hu Wei, Huihong Zhao, Ivy Liu, Jianwei Zhang, Jiawei Zhang, Kai Chen, Kang He, Levon Xue, Lin Qu, Linhan Tang, Luwen Feng, Minggang Wu, Minmin Sun, Na Ni, Rui Men, Shuai Bai, Sishou Zheng, Tao Lan, Tianqi Zhang, Tingkun Wen, Wei Wang, Weixu Qiao, Weiyi Lu, Wenmeng Zhou, Xiaodong Deng, Xiaoxiao Xu, Xinlei Fang, Xionghui Chen, Yanan Wang, Yang Fan, Yichang Zhang, Yixuan Xu, Yu Wu, Zhiyuan Ma, Zhizhi Cai

AI总结 本文介绍了Qwen-Image-2.0,一种能够统一高保真图像生成与精确图像编辑的全能型图像生成基础模型。该模型通过结合Qwen3-VL作为条件编码器与多模态扩散变换器,解决了超长文本渲染、多语言排版、高分辨率写实生成等挑战,并在大规模数据训练和定制化多阶段训练流程的支持下,实现了强大的多模态理解能力与灵活的生成与编辑功能。实验表明,Qwen-Image-2.0在生成与编辑任务上显著优于之前的版本,向着更通用、可靠和实用的图像生成模型迈出了重要一步。

详情
英文摘要

We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.

2605.10727 2026-05-12 cs.LG math.DG

Kernel-Gradient Drifting Models

Maria Esteban-Casadevall, Jorge Carrasco-Pollo, Max Welling, Jan-Willem van de Meent, Erik J. Bekkers, Floor Eijkelboom

AI总结 本文提出了一种名为“核梯度漂移”的生成模型框架,通过将传统漂移模型中固定的欧几里得方向替换为由核函数诱导的方向,实现了更灵活的生成机制。该方法揭示了通用核函数下的梯度漂移与核平滑分布之间分数差异的关系,为特征核提供了可识别性,并在黎曼流形和离散数据上具有自然扩展性。实验表明,该方法在球面地理数据、DNA序列和分子生成等任务中实现了无需预训练模型的高质量一步生成。

详情
英文摘要

We propose kernel-gradient drifting, a one-step generative modeling framework that replaces the fixed Euclidean displacement direction in drifting models with directions induced by the kernel itself. Standard drifting is attractive because it enables fast, high-quality generation without distilling a large pretrained diffusion model, but its theory is currently understood mainly for Gaussian kernels, where the drift coincides with smoothed score matching and is identifiable. Our gradient-based reformulation exposes this score-based structure for general kernels: the resulting drift is the score difference between kernel-smoothed data and model distributions, yielding identifiability for characteristic kernels and a smoothed-KL descent interpretation of the drifting dynamics. Since kernel gradients are intrinsic tangent vectors, the same construction extends naturally to Riemannian manifolds and to discrete data via the Fisher-Rao geometry of the probability simplex. Across spherical geospatial data, promoter DNA and molecule generation, kernel-gradient drifting enables state-of-the-art one-step generation beyond the Euclidean setting without distillation.

2605.10723 2026-05-12 cs.CV cs.AI cs.LG cs.MA

AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State

Huimin Wang, Leilei Ouyang, Chang Xia, Yongqi Kang, Yu Fu, Yuqi Ouyang

AI总结 AllocMV 是一种用于音乐视频生成的分层框架,旨在解决长时域视频生成中计算成本高和跨镜头一致性难以保持的问题。该方法将视频合成建模为多重选择背包问题,通过结构化持久状态对象进行资源优化分配,并引入基于动态规划的求解器实现高效资源调度。实验表明,AllocMV 在严格预算和节奏约束下,实现了生成质量与资源消耗之间的最优平衡。

详情
英文摘要

Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video's persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.

2605.10722 2026-05-12 cs.LG

On Improving Graph Neural Networks for QSAR by Pre-training on Extended-Connectivity Fingerprints

Sam Money-Kyrle, Markus Dablander, Thierry Hanser, Stephane Werner, Charlotte M. Deane, Garrett M. Morris

AI总结 本文研究如何通过预训练改进图神经网络(GNN)在定量构效关系(QSAR)研究中的性能,提出了一种基于扩展连接指纹(ECFP)的预训练策略。实验表明,该方法在多个基准数据集上显著提升了GNN的预测性能,但在某些异质性更强或任务更复杂的场景下表现有所下降。研究还分析了预训练过程中子结构级数据泄露对下游任务的影响,验证了ECFP预训练在实际QSAR任务中具有增强模型泛化能力的潜力。

详情
英文摘要

Molecular Graph Neural Networks (GNNs) are increasingly common in drug discovery, particularly for Quantitative Structure-Activity Relationship (QSAR) studies; yet, their superiority compared to classical molecular featurisation approaches is disputed. We report a general strategy for improving GNNs for QSAR by pre-training to predict Extended-Connectivity Fingerprints (ECFP). We validate our approach with statistical tests and challenging out-of-distribution (OOD) splits. Across five out of six Biogen benchmarks, we observed a statistically significant improvement in standard performance metrics over all evaluated baselines when using ECFP pre-trained GNNs. However, for more heterogeneous datasets and more complex endpoints, such as binding affinity prediction, pre-trained GNNs underperformed in OOD settings. Importantly, we investigated the impact of substructure-level data leakage during pre-training on downstream performance. While we identified scenarios where pre-training on ECFPs was less effective, our findings show that ECFP-based pre-training can enhance downstream OOD performance on a diverse set of practically relevant QSAR tasks.

2605.10717 2026-05-12 cs.LG cs.CV

Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling

Guillem Capellera, Antonio Rubio, Luis Ferraz, Antonio Agudo

AI总结 本文提出了一种异方差扩散模型U2Diffine,用于多智能体轨迹建模,同时提供每个状态的不确定性估计,以解决传统方法在轨迹补全和不确定性量化方面的不足。通过在去噪损失中引入预测噪声的负对数似然,并利用一阶泰勒展开将潜在空间的不确定性传播到真实状态空间,实现了轨迹补全与不确定性估计的统一。此外,还提出了一种更高效的基线模型U2Diff,并结合排序神经网络进行后处理,显著提升了推理速度和预测可靠性,在多个体育数据集上取得了优于现有方法的性能。

Comments Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Extended version of arXiv:2503.18589 (CVPR 2025)

详情
英文摘要

Multi-agent trajectory modeling traditionally focuses on forecasting, often neglecting more general tasks like trajectory completion, which is essential for real-world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of heteroscedastic uncertainty. Moreover, popular multi-modal sampling methods lack error probability estimates for each generated scene under the same prior observations, which makes it difficult to rank the predictions at inference time. We introduce U2Diffine, a unified diffusion model built to perform trajectory completion while simultaneously offering state-wise heteroscedastic uncertainty estimates. This is achieved by augmenting the standard denoising loss with the negative log-likelihood of the predicted noise, and then propagating the latent space uncertainty to the real state space using a first-order Taylor approximation. We also propose U2Diff, a faster baseline that avoids gradient computation during sampling. This approach significantly increases inference speed, making it as efficient as a standard generative-only diffusion model. For post-processing, we integrate a Rank Neural Network (RankNN) that enables error probability estimation for each generated mode, demonstrating strong correlation with ground truth errors. Our method outperforms state-of-the-art solutions in both trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), underscoring the effectiveness of our uncertainty and error probability estimation.

2605.10716 2026-05-12 cs.LG stat.ML

What should post-training optimize? A test-time scaling law perspective

Muheng Li, Jian Qian, Wenlong Mou

AI总结 该论文研究了大语言模型在部署时常用的“最佳中选N”策略与后训练目标之间的不匹配问题。作者提出,在训练资源有限的情况下,可以通过对奖励分布的上尾统计量进行外推,近似最佳中选N的目标梯度,从而设计出高效的后训练优化方法。文中提出的Tail-Extrapolated Advantage(TEA)及其改进版本Prefix-TEA,在多种语言模型和数据集上均能有效提升最佳中选N的性能。

详情
英文摘要

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.

2605.10715 2026-05-12 cs.CV

UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting

Zhenyu Liang, Jack C. P. Cheng

AI总结 本文提出了一种基于无人机的扫描到模拟框架,用于提升滑坡监测与仿真的真实感与准确性。该方法结合物理感知的高斯点喷射技术(3DGS)与材料点法(MPM),实现了从无人机采集的实景图像到具备物理特性的滑坡模拟的全过程。研究通过在香港真实滑坡现场的验证,展示了该方法在视觉重建与物理模拟方面的双重优势,为灾害预防和公众教育提供了更有效的工具。

详情
英文摘要

Landslide monitoring and simulation play an important role in urban safety assessment and disaster prevention. Existing landslide simulation pipelines typically rely on digital elevation model and mesh-based representations, which are suitable for geometric analysis, but often lack visual realism. This limitation reduces their effectiveness in interactive applications, hazard communication, and public education. In this paper, we propose a UAV-based scan-to-simulation framework that bridges photorealistic scene capture and physics-based landslide simulation through 3DGS. Specifically, our pipeline includes four stages: (1) UAV-based acquisition of slope imagery, (2) reconstruction of a low-anisotropy 3DGS scene representation, (3) volumetric conversion of the target simulation region by filling the interior of the surface-based model, and (4) integration with the Material Point Method (MPM) for landslide simulation. We validate the proposed framework on a real landslide site in Hong Kong that experienced a severe landslide event. The results show that our method supports both realistic visual reconstruction and effective simulation.

2605.10714 2026-05-12 cs.CL cs.AI

Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish

Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé

AI总结 本文通过研究卢森堡语在自然语言处理中的应用,探讨了低资源语言处理中跨语言迁移的局限性与语言特定努力的必要性。研究发现,尽管跨语言迁移能有效提升低资源语言的性能,但其效果高度依赖于高质量、任务对齐的目标语言数据,而这类数据在低资源场景下通常不足。因此,跨语言迁移与语言特定努力应被视为互补而非对立的组成部分,文章据此提出了构建可持续低资源NLP系统的方法建议。

Comments Accepted at BigPicture Workshop 2026 (co-located with ACL 2026)

详情
英文摘要

Cross-lingual transfer has become a central paradigm for extending natural language processing (NLP) technologies to low-resource languages. By leveraging supervision from high-resource languages, multilingual language models can achieve strong task performance with little or no labeled target-language data. However, it remains unclear to what extent cross-lingual transfer can substitute for language-specific efforts. In this paper, we synthesize prior research findings and data collection results on Luxembourgish, which, despite its typological proximity to high-resource languages and its presence in a multilingual context, remains insufficiently represented in modern NLP technologies. Across findings, we observe a fundamental interdependence between cross-lingual transfer and language-specific efforts. Cross-lingual transfer can substantially improve target-language performance, but its success depends critically on the availability of sufficiently high-quality, task-aligned target-language data. At the same time, such resources, particularly in low-resource settings, are typically too limited in scale to drive strong performance on their own. Instead, such resources reach their full potential only when leveraged within a cross-lingual framework. We therefore argue that cross-lingual transfer and language-specific efforts should not be viewed as competing alternatives. Instead, they function as complementary components of a sustainable low-resource NLP pipeline. Based on these insights, we provide practical guidelines for integrating and balancing cross-lingual transfer with language-specific development in sustainable low-resource NLP pipelines.

2605.10707 2026-05-12 cs.RO

ObjView-Bench: Rethinking Difficulty and Deployment for Object-Centric View Planning

Sicong Pan, Hao Hu, Xuying Huang, Benno Wingender, Maren Bennewitz

AI总结 本文提出 ObjView-Bench,旨在重新思考以物体为中心的视角规划中的难度评估与部署问题。该框架通过分离物体自遮挡、观测饱和度和规划难度三个关键因素,为视角规划的评估提供了更精细的分析方法,并展示了考虑规划难度的采样策略可以提升学习型视角规划器的性能。此外,ObjView-Bench 还设计了面向实际部署的评估协议,揭示了预算限制和可达视角约束对不同规划方法性能和失败模式的影响,为机器人三维重建中的视角规划研究提供了更可靠的评估基准。

详情
英文摘要

Object-centric view planning is a core component of active geometric 3D reconstruction in robotics, yet existing evaluations often conflate object complexity, planning difficulty, budget assumptions, and physical reachability constraints. As a result, conclusions drawn from idealized view-planning evaluations may not reliably predict performance under realistic reconstruction settings. We introduce ObjView-Bench, an evaluation framework for rethinking difficulty and deployment in object-centric view planning. First, we disentangle three quantities underlying view-planning evaluation: omnidirectional self-occlusion as an object-side attribute, observation saturation difficulty, and protocol-dependent planning difficulty defined through a set-cover formulation. This separation supports controlled dataset construction, analysis of slow-saturation objects, and a case study showing that planning difficulty-aware sampling can improve learned view planners. Second, we design deployment-oriented evaluation protocols that reveal how budget regimes and reachable-view constraints alter method behavior. Across classical, learned, and hybrid planners, ObjView-Bench shows that difficulty, budget, and reachability constraints substantially change method rankings and failure modes.

2605.10706 2026-05-12 cs.LG

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

Byeongchan Kim, Arijit Sehanobish, Avinava Dubey, Min-hwan Oh, Krzysztof Choromanski

AI总结 本文提出了一种新的高效注意力机制——RelFlexformer,适用于任意可积调制函数定义的3D相对位置编码(RPE),实现了对3D输入序列的高效处理,其注意力计算时间复杂度为$O(L \log L)$。该方法基于非均匀傅里叶变换(NU-FFT)理论,能够自然地将已有高效RPE注意力方法从结构化网格扩展到任意分布的非结构化3D空间,特别适用于点云建模。实验表明,RelFlexformer在多个3D数据集上表现出优越的性能。

详情
英文摘要

We present a new class of efficient attention mechanisms applying universal 3D Relative Positional Encoding (RPE) methods given by arbitrary integrable modulation functions $f$. They lead to the new class of 3D-Transformer models, called \textit{RelFlexformers}, flexibly integrating those RPEs, and characterized by the $O(L \log L)$ time complexity of the attention computation for the $L$-length input sequences. RelFlexformers builds on the theory of the Non-Uniform Fourier Transform (NU-FFT), naturally generalizing several existing efficient RPE-attention methods from structured settings with tokens homogeneously embedded in unweighted grids into general non-structured heterogeneous scenarios, where tokens' positions are arbitrarily distributed in the corresponding 3D spaces. As such, RelFlexformers can be applied in particular to model point clouds. Our extensive empirical evaluation on a large portfolio of 3D datasets confirms quality improvements provided by the NU-FFT-driven attention modulation techniques in the RelFlexformers.

2605.10705 2026-05-12 cs.CV

TransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering

Zhenyu Liang, Xiao Zhang, Tianchao Li, Jack C. P. Cheng, Chi-Keung Tang

AI总结 该论文提出了一种名为TransmissiveGS的新框架,用于解决透射场景重建与渲染中的挑战性问题。该方法通过引入双高斯表示和延迟着色函数,实现了反射与透射成分的解耦重建,并利用多视角不一致性及残差信息分离表面几何与光照属性,同时提出反射光场以提升近场反射估计精度。实验表明,该方法在合成与真实场景中均优于现有高斯点绘技术,显著提升了透射场景的重建与渲染质量。

详情
英文摘要

Transmissive scenes are ubiquitous in daily life, yet reconstructing and rendering them remains highly challenging due to the inherent entanglement between near-field reflections from the surrounding environment on the transmissive surface, and the transmitted content of the scene behind it. This coupling gives rise to dual surface geometries and dual radiance components within each observation, posing ambiguities for standard methods. We present TransmissiveGS, a novel framework for disentangled reconstruction and rendering of transmissive scenes. Specifically, we model the scene with a dual-Gaussian representation and introduce a deferred shading function to jointly render the two Gaussian components. To separate reflection and transmission, we exploit the inherent multi-view inconsistency of reflections and leverage the residuals from reconstructing multi-view consistent content as cues for disentangled geometry and appearance modeling. We further propose a reflection light field that enables high-fidelity estimation of near-field reflections. During training, we introduce a high-frequency regularization to preserve fine details. We also contribute a new synthetic dataset for evaluating transmissive surface reconstruction. Experiments on both synthetic and real-world scenes demonstrate that TransmissiveGS consistently outperforms prior Gaussian Splatting-based methods in both reconstruction and rendering quality for transmissive scenes.

2605.10688 2026-05-12 cs.LG eess.SP

DANCE: Detect and Classify Events in EEG

Jarod Lévy, Hubert Banville, Jérémy Rapin, Jean-Remi King, Thomas Moreau, Stéphane d'Ascoli

AI总结 本文提出了一种名为DANCE的深度学习方法,用于直接从原始未对齐的脑电(EEG)信号中检测和分类事件,解决了传统方法依赖已知事件起始点的局限性。该方法将神经解码任务建模为集合预测问题,实现了端到端的异步解码。实验表明,DANCE在多种认知、临床和脑机接口任务中均优于现有方法,并在癫痫监测任务中达到了新的性能水平。

Comments 29 pages

详情
英文摘要

Event identification in continuous neural recordings is a critical task in neuroscience. Decoding in EEG is dominated by classifying windows aligned to known event onsets. However, while available in controlled experiments, such onsets are absent in continuous real-world monitoring. Here, we introduce DANCE, a deep learning pipeline that frames neural decoding as a set-prediction problem and jointly detects and classifies events directly from raw, unaligned signals. Evaluated separately on ten datasets curated from the literature with a wide variety of event types (ranging from milliseconds to minutes in duration), our model outperforms existing methods on a broad range of cognitive, clinical and BCI tasks. This single architecture establishes a new state of the art in the competitive task of seizure monitoring and matches the accuracy of onset-informed models for BCI tasks. Overall, our method marks a step towards end-to-end asynchronous neural decoding models

2605.10687 2026-05-12 cs.LG

The finite expression method for turbulent dynamics with high-order moment recovery

Xingjian Xu, Di Qi, Chunmei Wang

AI总结 该研究针对湍流动力系统中高阶统计矩难以准确捕捉的问题,提出了一种两阶段的数据驱动建模框架,结合符号回归与生成模型,联合识别系统动力学并预测其关键统计特性。第一阶段采用有限表达式方法(FEX)发现确定性动力学的闭式表达,无需预设函数库即可恢复非线性相互作用项和外力项;第二阶段引入生成模型学习残余随机成分,修正第一阶段的模型误差,从而准确刻画高阶统计量。实验表明,该方法在多个场景下有效恢复了相互作用项和外力表达,并准确预测了五阶以内的统计矩,展示了符号发现与数据驱动随机建模结合在复杂湍流系统中的潜力。

Comments 20 pages, 8 figures, 1 table

详情
英文摘要

Turbulent dynamical systems are characterized by nonlinear interactions and stochastic effects that generate coupled statistical quantities, such as non-zero higher-order moments, which are difficult to capture from data with accuracy. We propose a two-stage data-driven modeling framework that combines symbolic regression with generative models to jointly identify the governing dynamics and predict their key statistical quantities. In Stage I of the framework, the Finite Expression Method (FEX) is adopted to discover closed-form expressions of the deterministic dynamics, recovering nonlinear interaction terms and external forcing without predefined libraries. In Stage II, generative models are introduced to learn the residual stochastic components as a refined correction to the model error from the Stage I approximation, enabling accurate characterization of higher-order statistics. Theoretical analysis establishes the consistency of the symbolic estimator and quantifies the estimation error in terms of data size and numerical discretization. The model performance is verified through detailed numerical experiments on the stochastic triad models across multiple regimes, demonstrating that the framework successfully recovers interaction terms and forcing expressions, and accurately predicts statistical moments up to order five. These results highlight the potential of integrating interpretable symbolic discovery with data-driven stochastic modeling for complex turbulent systems.

2605.10680 2026-05-12 cs.LG

Exact Unlearning from Proxies Induces Closeness Guarantees on Approximate Unlearning

Virgile Dine, Teddy Furon

AI总结 本文提出了一种将机器遗忘直接与数据分布结构关联的新范式,而非仅仅依赖神经网络参数的更新。通过精确推断数据分布,该方法能够提取出模型所产生的精确遗忘信号,并在可验证的可接受性准则下,给出了理想重训练模型与遗忘模型之间KL散度的理论界,证明了方法的有效性。实验结果表明,该方法在三种遗忘场景中相比现有方法,能够达到最接近理想重训练模型的分类效果。

详情
英文摘要

This paper proposes a paradigm shift linking machine unlearning directly to the structure of the data distributions rather than a mere update of the neural network parameters. We show that inferring these distributions with precision enables distilling the exact unlearning signal induced by the modeling. Theoretical bounds on the Kullback-Leibler divergence from the ideal retrained model to our unlearned model, under verifiable admissibility criterion, reveal the soundness of our framework. This method is experimentally validated over three forgetting scenarios as reaching the closest classifier to the ideal retrained model when compared to competitors.

2605.10676 2026-05-12 cs.CV cs.LG

Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

Qingxin Xiao, Peilin Zhao, Yangyang Zhao, Lingwei Dang, Qingyao Wu

AI总结 在多模态语言模型解码过程中,注意力往往异常聚焦于与任务无关的图像区域,现有方法通常将这些区域视为噪声并强制调整注意力,但本文认为这些区域实际上承载了重要的视觉与叙事逻辑,强制调整反而加剧了视觉与语言之间的不平衡。为此,研究提出了一种名为Adversarial Counter-Commonsense Equilibrium(ACE)的训练无关框架,通过引入反常识的图像干扰块,动态调整解码过程中的注意力分布,从而在不引入额外训练的前提下,有效抑制虚假信息,恢复视觉与语言的平衡,实验表明该方法能显著提升模型的可信度且几乎不增加推理开销。

详情
英文摘要

During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.