arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13170 2026-05-14 cs.LG cs.MA

Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications

Maxwell Standen, Junae Kim, Claudia Szabo

AI总结 本文研究了针对多智能体强化学习系统的对抗攻击问题,重点分析如何通过扰动通信信息来破坏系统性能。作者提出利用雅可比矩阵的梯度信息,识别最易受攻击的消息、智能体及时刻,并设计了两种新的对抗损失函数以平衡攻击成功率与影响程度。实验表明,该方法在多个环境中显著提升了攻击效果,优于随机选择策略。

Comments Full version of the Extended Abstract presented at AAMAS 2026

详情
英文摘要

Multi-agent systems rely on communication for information sharing and action coordination, which exposes a vulnerability to attacks. We investigate single-victim communication perturbation attacks against Multi-Agent Reinforcement Learning-trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to attack and have the greatest impact on the system. We enhance these methods with two proposed adversarial loss functions that trade-off attack success for attack impact which also create more effective perturbations. We empirically demonstrate the effectiveness of our methods against two different multi-agent communication methods in navigation, PredatorPrey, and TrafficJunction environments. Our results show that our novel message selection method achieves a similar or greater impact than random message selection across almost all tested scenarios. Our victim selection, message selection, tempo, and loss functions improve attack effectiveness in half of the thirty scenarios we tested.

2605.13167 2026-05-14 cs.CL

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

Jinwoong Kim, Rui Yang, Huishuai Zhang

AI总结 本文介绍了GeoBuildBench,一个用于评估大型语言模型和多模态智能体能否将非正式的自然语言平面几何问题转化为可执行几何构造的基准。该基准不同于以往关注答案正确性或静态图示理解的几何测试集,而是将几何图示视为交互式构造任务,要求模型生成特定领域语言程序以满足明确的几何对象和可验证约束。研究发现,尽管现有模型在任务中取得了一定成效,但仍常出现结构幻觉、遗漏对象和无法满足几何约束等问题,表明几何构造是检验模型可执行推理能力的严格测试环境。

详情
英文摘要

We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.

2605.13165 2026-05-14 cs.CL

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

Chenjun Xu, Zhennan Zhou, Zhan Su, Bill Howe, Lucy Lu Wang, Bingbing Wen

AI总结 本文提出了一种名为STOP的结构化策略,用于在数据量有限的情况下对长链推理过程进行高效剪枝。该方法通过自蒸馏生成推理轨迹,并将其映射为结构化的推理接口,再结合最早正确节点(ECN)策略,去除冗余推理步骤,从而在保持推理准确性的同时显著减少生成的token数量。实验表明,STOP在多个数学推理任务中有效提升了推理效率,并减少了分布偏移,优化了推理结构。

Comments 20 pages, 6 figures, 6 tables. Code available at: https://github.com/chenjux/ECN-STOP

详情
英文摘要

Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

2605.13162 2026-05-14 cs.LG

Continual Fine-Tuning of Large Language Models via Program Memory

Hung Le, Svetha Venkatesh

AI总结 本文研究了在持续学习场景下如何高效地对大语言模型进行微调,提出了一个基于程序记忆的持续LoRA框架ProCL。该方法受到神经科学中互补学习系统的启发,通过结构化的程序记忆槽和输入条件注意力机制,实现了快速适应与知识保留的平衡。实验表明,ProCL在多个基准上表现出更优的知识保持能力和更低的灾难性遗忘现象。

Comments 18 page, preprint

详情
英文摘要

Parameter-Efficient Fine-Tuning (PEFT), particularly Low-Rank Adaptation (LoRA), has become a standard approach for adapting Large Language Models (LLMs) under limited compute. However, in continual settings where models are updated sequentially with small datasets, conventional LoRA updates struggle to balance rapid adaptation and knowledge retention. Existing methods typically treat the low-rank space as a homogeneous update region, lacking mechanisms to regulate how short-term updates are consolidated over time. We propose a continual LoRA framework with \textbf{Pro}gram memory, inspired by \textbf{C}omplementary \textbf{L}earning Systems in neuroscience. Our approach, dubbed \textbf{ProCL}, organizes LoRA adapters into structured program memory slots that are dynamically retrieved through input-conditioned attention. This enables rapid and localized adaptation, encouraging similar inputs to reuse shared adapter regions while reserving unused capacity for future data. The slots are then combined with the underlying adapter, which maintains a distributed representation that gradually accumulates knowledge across tasks to balance plasticity and stability. Our method operates entirely within the LoRA parameterization and incurs no additional inference cost. Experiments on diverse benchmarks demonstrate improved retention and reduced catastrophic forgetting over other continual LoRA strategies.

2605.13158 2026-05-14 cs.CV

Unifying Physically-Informed Weather Priors in A Single Model for Image Restoration Across Multiple Adverse Weather Conditions

Jiaqi Xu, Xiaowei Hu, Lei Zhu, Pheng-Ann Heng

AI总结 本文研究了在多种恶劣天气条件下进行图像修复的问题,提出了一种统一的物理感知天气先验模型,能够同时处理雨滴和雾等不同天气引起的退化现象。该方法基于对天气相关视觉因素的分析,构建了一个融合粒子散射和雾状聚集效应的成像模型,并设计了一种基于天气先验的网络结构,通过估计遮挡和透射信息增强特征以恢复清晰场景。实验表明,该方法在多种恶劣天气场景下均优于现有先进方法。

Comments Accepted by TCSVT

详情
英文摘要

Image restoration under multiple adverse weather conditions aims to develop a single model to recover the underlying scene with high visibility. Weather-related artifacts vary with the particle's distance to the camera according to the established scene visibility analysis, where close and faraway regions are more affected by falling drops and fog effects, respectively. Existing methods fail to consider this weather-specific physical visual process; thus, the restoration performance is limited. In this work, we analyze the common visual factors in adverse weather conditions and present a unified imaging model that considers the individually visible particles and fog-like aggregate scattering effects. Further, we design a novel weather-prior-based network, which leverages the weather-related prior information to help recover the scene by enhancing the features using the estimated occlusion and transmission. Experimental results in multiple adverse scenarios show the superiority of our method against state-of-the-art methods.

2605.13156 2026-05-14 cs.CV

Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

Jiaxin Liu, Ding Zhong, Yue Wang, Zhidong Yang, Zhaolu Kang, Guangyuan Dong, Qishi Zhan, Pengcheng Fang, Aofan Liu

AI总结 视觉语言模型(VLMs)在跨模态理解任务中表现出色,但常出现物体幻觉问题,即描述输入图像中并不存在的内容,影响其可靠性和可解释性。本文提出了一种双路径电路分析框架,用于识别和分析VLM中与幻觉相关的电路机制。通过激活路径修补和条件路径分析,研究发现了支持正确预测的视觉接地路径和导致错误输出的幻觉路径,并揭示了两者的交互机制。实验表明,抑制幻觉路径组件可显著减少物体幻觉,且该电路机制在不同模型架构和幻觉类型中具有良好的一致性和可迁移性。

详情
英文摘要

Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.

2605.13155 2026-05-14 cs.CV

Pareto-Guided Optimal Transport for Multi-Reward Alignment

Ying Ba, Tianyu Zhang, Mohan Zhou, Yalong Bai, Wenyi Mo, Guiwei Zhang, Bing Su, Ji-Rong Wen

AI总结 文本到图像生成模型在偏好优化方面取得了显著进展,但在面对多样化的奖励模型时,实现稳健的对齐仍是一个重大挑战。本文提出了一种基于帕累托前沿引导的最优传输(PG-OT)框架,通过构建特定提示的帕累托前沿,并利用分布感知的最优传输将劣化样本映射至该前沿,从而有效缓解奖励黑客问题。此外,作者引入了联合支配率(JDR)和联合崩溃率(JCR)作为评估多奖励协同效应和奖励黑客风险的指标,实验表明该方法在多个指标上均优于现有方法。

Comments Accepted to ICML 2026

详情
英文摘要

Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

2605.13153 2026-05-14 cs.AI

Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning

Rikui Huang, Shengzhe Zhang, Wei Wei

AI总结 本文针对时间知识图谱推理(TKGR)中的评估方法提出改进,指出当前方法对所有事件一视同仁,忽略了大多数事件是重复性的,从而高估了模型的推理能力。为此,作者提出一种基于“显著性”的评估框架,通过规则引导的显著性度量方法,区分并强调那些需要更深层次推理的罕见事件。实验表明,该框架能够更严格地评估模型在预测突出事件方面的能力,为TKGR研究提供了新的评价视角。

Comments Accepted to IJCAI-ECAI 2026

详情
英文摘要

Temporal Knowledge Graph Reasoning (TKGR) aims at inferring missing (especially future) events from historical data. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized. To this end, we propose a strikingness-aware evaluation framework, which introduces a rule-based strikingness measuring framework (RSMF) to quantify event strikingness by comparing its expected occurrence with peer events derived from temporal rules. Strikingness is then integrated as a weighting factor into metrics like weighted MRR and Hits@k. Experiments on four TKG benchmarks reveal: 1) All representative models perform worse as event strikingness increases, 2) Path-based methods excel on low-strikingness events and representation-based ones on high-strikingness events, 3) We design an ensemble method whose gains stem from fitting trivial events rather than reasoning improvement. Our framework provides a more rigorous evaluation, refocusing the field on predicting outstanding events.

2605.13152 2026-05-14 cs.CV cs.AI cs.LG cs.RO

EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision

Jiahao Chen, Zihui Zhang, Yafei Yang, Jinxi Li, Shenxing Wei, Zhixuan Sun, Bo Yang

AI总结 本文提出了一种名为 EvObj 的无监督三维实例分割方法,旨在解决从合成数据到真实点云场景中几何域差距带来的挑战。该方法通过引入对象辨别模块和对象补全模块,实现了对物体先验的动态优化和部分几何结构的重建,从而提升了在真实场景中的分割性能。实验表明,EvObj 在多个数据集上均取得了优于现有方法的分割效果,达到了当前最先进的水平。

Comments CVPR 2026. Code and data are available at: https://github.com/vLAR-group/EvObj

详情
英文摘要

We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.

2605.13151 2026-05-14 cs.CV

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

Jiyong Rao, Yu Wang, Shengjie Zhao

AI总结 GenCape 是一种面向类别无关姿态估计(CAPE)的生成式框架,旨在仅使用少量标注的支持样本,对任意类别的图像中的关键点进行定位。该方法通过图像支持输入自动推断关键点之间的关系,无需额外的文字描述或预定义的骨骼结构,克服了传统方法对人工标注的依赖和结构灵活性差的问题。GenCape 包含一个迭代结构感知变分自编码器和一个组合图转移模块,能够有效捕捉实例级别的结构信息,并在不同类别间实现语义对齐,实验表明其在少样本设置下优于现有基于图支持和文本支持的方法。

Comments Accepted in ICLR 2026

详情
英文摘要

Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model's capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose GenCape, a Generative-based framework for CAPE that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.

2605.13149 2026-05-14 cs.CL cs.AI cs.LG

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz, Pradeep Natarajan, Mahdi Namazifar, Jiaqi Ma, Dilek Hakkani-Tür

AI总结 本文提出了一种名为 AcquisitionSynthesis 的方法,利用主动学习中的获取函数作为奖励模型,训练语言模型生成高质量的合成数据,以解决模型训练中数据质量的瓶颈问题。该方法通过量化评估生成数据对下游学习器的影响,提升了数据生成的针对性和有效性。实验表明,使用 AcquisitionSynthesis 生成的数据能够提升学生模型的性能并增强其鲁棒性,同时该方法还可用于支持其他模型训练及资源从低到高的训练范式。

详情
英文摘要

Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.

2605.13148 2026-05-14 cs.LG cs.CV

Understanding Generalization through Decision Pattern Shift

Huiqi Deng, Yibo Li, Quanshi Zhang, Peng Zhang, Hongbin Pei, Xia Hu

AI总结 本文研究深度神经网络在未见样本上泛化失败的原因,提出了一种新的分析视角——决策模式偏移(DPS)。该方法通过分析模型内部决策模式的稳定性,量化其在训练与测试阶段的偏差,从而衡量泛化性能。研究发现,决策模式在类别间具有高度结构化和一致性,且其变化程度与泛化差距呈强线性相关,为理解不同泛化失败场景提供了统一的解释框架。

Comments 14pages, 12figures, computer vision and pattern recognition

详情
英文摘要

Understanding why deep neural networks (DNNs) fail to generalize to unseen samples remains a long-standing challenge. Existing studies mainly examine changes in externally observable factors such as data, representations, or outputs, yet offer limited insight into how a model's internal decision mechanism evolves from training to test. To address this gap, we introduce Decision Pattern Shift (DPS), a new perspective that defines generalization through the stability of internal decision patterns and quantifies failure as their deviation from those learned during training. Specifically, we represent each sample's decision pattern as a GradCAM-based channel-contribution vector, which captures how feature channels collectively support a prediction, and we propose the DPS metric to measure its discrepancy from the class-average pattern. Empirical analyses across multiple datasets and architectures show that, (i) decision patterns form a highly structured, class-consistent space with strong intra-class cohesion and low inter-class confusion, enabling direct analysis of a model's decision logic; (ii) the DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r > 0.8), revealing generalization as a systematic drift in the model's internal decision mechanism; (iii) the DPS spectrum organizes diverse generalization degradation scenarios (covering ideal generalization, in-distribution degradation, domain shift, out-of-distribution, and shortcut learning) into a continuous trajectory, providing a unified explanation of their failure modes. These findings open up new possibilities for early generalization-risk detection, failure-mode diagnosis, and channel-level defect localization.

2605.13145 2026-05-14 cs.LG

Collaborating in Multi-Armed Bandits with Strategic Agents

Idan Barnea, Ofir Schlisselberg, Yishay Mansour

AI总结 本文研究了多智能体贝叶斯老虎机问题中的协作学习,其中具有战略行为的智能体共同解决同一个老虎机实例。与以往假设短视智能体的文献不同,本文考虑了长期参与的智能体,并提出了一种名为CAOS的机制,能够在纳什均衡下维持协作,同时保证强遗憾上界。研究结果表明,仅通过信息共享即可实现有效的协作探索,其性能接近完全合作系统的水平。

详情
英文摘要

We study collaborative learning in multi-agent Bayesian bandit problems, where strategic agents collectively solve the same bandit instance. While multiple agents can accelerate learning by sharing information, strategic agents might prefer to free-ride and avoid exploration. We consider a setting with persistent agents that participate in multiple time periods. This is in contrast to most previous works on incentives in multi-agent MAB, which assume short-lived agents, namely each agent has a single decision to make and optimizes their expected reward in that single decision. As in the multi-agent MAB model with incentives, our model does not have monetary transfers, and the only incentives are through information sharing. We propose \texttt{CAOS}, a mechanism that sustains collaboration as a Nash equilibrium while achieving strong regret guarantees. Our results demonstrate that collaborative exploration can be sustained purely through information sharing, achieving performance close to that of fully cooperative systems despite strategic behavior.

2605.13140 2026-05-14 cs.CV

Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

Sangin Lee, Seokjun Kwon, Jeongmin Shin, Namil Kim, Yukyung Choi

AI总结 该论文研究了多源领域自适应下的目标检测问题,旨在提升模型在目标领域中检测性能,特别是在训练数据分布与目标领域存在差异的情况下。为了解决现有方法在学习领域无关特征时无法有效保留领域特定信息的问题,作者提出了MS-DePro方法,结合深度图和文本提示,分别用于引导目标定位和分类特征对齐。该方法在多个基准测试中取得了最先进的性能,验证了其有效性。

详情
英文摘要

General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on https://github.com/sejong-rcv/Multi-Modal-Guided-Multi-Source-Domain-Adaptation-for-Object-Detection.

2605.13133 2026-05-14 cs.LG eess.SP

KAST-BAR: Knowledge-Anchored Semantically-Dynamic Topology Brain Autoregressive Modeling for Universal Neural Interpretation

Haoning Wang, Wenchao Yang, Shuai Shen, Yang Li

AI总结 本文提出了一种名为KAST-BAR的知识锚定语义动态拓扑脑自回归模型,旨在解决脑电图(EEG)基础模型在跨任务通用神经解码中面临的空间时间拓扑建模不足和生理信号与高层语义之间模态鸿沟的问题。该模型通过双流层次注意力编码器捕捉脑部非欧几里得拓扑结构,并结合知识锚定语义分析模块,将生理信号与专家级语义空间对齐,从而实现更准确的神经信号解码。实验表明,KAST-BAR在多个下游任务中均表现出色,有效融合了医学专家知识以提升EEG信号的理解与解释能力。

详情
英文摘要

While EEG foundation models have shown significant potential in universal neural decoding across tasks, their advancement remains constrained by the inadequacy modeling of complex spatiotemporal topology, as well as the inherent modality gap between low-level physiological signals and high-level textual semantics. To address these challenges, we propose a Knowledge-Anchored Semantically-Dynamic Topology Brain Autoregressive Model (KAST-BAR), which dynamically aligns physiological representations derived from multi-level brain topology with an expert-level semantic space. Specifically, we design a Dual-Stream Hierarchical Attention (DSHA) encoder that accurately captures the brain's intrinsic non-Euclidean topology by modeling local temporal dynamics with global spatial contexts. On this basis, a Knowledge-Anchored Semantic Profiler (KASP) is proposed to synthesize physically-grounded and instance-level textual profiles, which subsequently drive a Semantic Text-Aware Refiner (STAR) to dynamically reconstruct EEG representations using Latent Expert Queries. By conducting large-scale pre-training on 21 diverse datasets to build a foundation model, KAST-BAR effectively integrates expert-level medical knowledge into EEG signal representations, consistently achieving superior performance across six downstream tasks. Our code is available at https://github.com/KAST-BAR/KAST-BAR

2605.13131 2026-05-14 cs.LG cs.RO

ERPPO: Entropy Regularization-based Proximal Policy Optimization

Changha Lee, Gyusang Cho

AI总结 本文提出了一种基于熵正则化的近端策略优化算法(ERPPO),旨在解决多智能体强化学习中因非稳态观测导致的策略优化难题。该方法通过引入分布时空模糊性学习器,估计多维观测环境下的目标检测不确定性,并结合动态熵正则化项,在高模糊度情况下增强探索,在低模糊度情况下稳定策略更新,从而提升目标定位的准确性和搜索效率。实验表明,ERPPO在海上搜索等时间敏感任务中表现出优于MAPPO的性能,尤其在视觉不确定条件下能有效抑制误检。

Comments 9 pages, 5 figures

详情
英文摘要

Multi-Agent Proximal Policy Optimization (MAPPO) is a variant of the Proximal Policy Optimization (PPO) algorithm, specifically tailored for multi-agent reinforcement learning (MARL). MAPPO optimizes cooperative multi-agent settings by employing a centralized critic with decentralized actors. However, in case of multi-dimensional environment, MAPPO can not extract optimal policy due to non-stationary agent observation. To overcome this problem, we introduce a novel approach, Entropy Regularization-based Proximal Policy Optimization (ERPPO). For the policy optimization, we first define the object detection ambiguity under multi-dimensional observation environment. Distributional Spatiotemporal Ambiguity (DSA) learner is trained to estimate object detection uncertainty in non-stationary constraints. Then, we enhance PPO with a novel Entropy Regularization term. This regularization dynamically adjusts the policy update by applying a stronger (L1) regularization in high-ambiguity observation to encourage significant exploratory actions and a weaker (L2) regularization in low-ambiguity observation to stabilize the proximal policy optimization. This approach is designed to enhance the probability of successful object localization in time-critical operations by reducing detection failures and optimizing search policy. Experiments on a testbed with AirSim-based maritime searching scenarios show that the proposed ERPPO improves accuracy performance. Our proposed method improves higher gradient than MAPPO. Qualitative results confirm that ERPPO effectiveness in terms of suppressing false detection in visually uncertain conditions.

2605.13130 2026-05-14 cs.AI

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

Junjie Li, Ziao Wang, NingXuan Ma, Jianghong Ma, Xiaofeng Zhang

AI总结 本文提出了一种名为GRACE的梯度对齐推理数据筛选方法,用于高效地进行模型后训练。该方法通过分析每个推理步骤与答案梯度方向的对齐程度以及与前序推理路径的一致性,对步骤进行评分,并将这些评分聚合为样本级别的选择依据,无需外部奖励模型或步骤注释。实验表明,GRACE在使用较少数据的情况下仍能保持接近甚至超越全数据的性能,且具有良好的模型迁移能力。

详情
英文摘要

Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model's internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.

2605.13125 2026-05-14 cs.RO

MoCCA: A Movable Circle Probability of Collision Approximation

Tobias Kern, Christian Birkner

AI总结 在自动驾驶中,准确评估碰撞概率(POC)对于避障和安全驾驶至关重要。本文提出了一种名为MoCCA的形状近似算法,通过为每辆车优化单个圆来近似其几何形状,从而在保持计算效率的同时减少保守性过高的问题。该方法建立了近似误差的上界,并引入了基于方向方差可调节的安全距离余量,以应对部分覆盖情况下的POC低估问题。

Comments Accepted at ITSC 2026

详情
英文摘要

In automated driving, crash mitigation is crucial to ensure passenger safety. Accurate avoidance requires precise knowledge of the object's position and orientation. However, sensor noise and occlusions often result in tracking and prediction uncertainties. To account for these uncertainties, estimating the Probability of Collision (POC) is a critical requirement. While Monte Carlo sampling is a common estimation technique, its high computational demand and stochastic nature often render it unsuitable for real-time applications. Analytical POC calculations are simplified by approximating vehicle geometries using circular bounds. While multi-circle approximations offer higher fidelity than a single circumscribed circle, they significantly increase computational complexity. This paper proposes a shape approximation algorithm, MoCCA, which utilizes a single circle for each vehicle, optimized to minimize the relative distance between them. MoCCA maintains a computational efficiency comparable to standard single-circle techniques while reducing over-conservatism. To address the potential underestimation of POC inherent in partial coverage, we establish an upper bound for the approximation error, demonstrating that it depends primarily on inter-vehicle distance and orientation variance. Furthermore, we introduce a safety distance margin that can be calibrated solely based on orientation variance.

2605.13123 2026-05-14 cs.RO

Multi-Depth Uniform Coverage Path Planning for Unmanned Surface Vehicle Surveying

Maider Larrazabal, Tong Yang, Izaro Goienetxea, Jaime Valls Miro

AI总结 本文提出了一种用于无人水面船舶水下地形测绘的新型自动覆盖路径规划算法。传统方法基于固定深度的往返路径,无法适应海底地形变化,导致覆盖不均;本文方法结合粗略的深度先验信息,动态调整路径生成与传感器覆盖范围,实现海底地形的均匀覆盖。实验表明,该方法在合成与真实场景中均显著优于传统方法,覆盖率分别超过99%和92%,具有重要的实际应用价值。

Comments Accepted by ICRA 2026

详情
英文摘要

This paper introduces a novel automatic coverage path planning algorithm for bathymetry surveying with unmanned surface vehicles. The detection range of the mapping sensor employed - a multibeam echo sounder - is heavily influenced by local seafloor depths. Hence, a path designed to uniformly cover the sea surface does not guarantee uniform coverage of the seafloor. Yet this is currently the typical process for bathymetric surveys, with the simplistic boustrophedon scheme along manually selected waypoints at constant depths being the most widespread planner used. The proposed scheme incorporates coarse prior depth information to pre-process the target region and adaptively guide path generation and sensing range configuration. By explicitly accounting for depth variations, the proposed algorithm designs a coverage path with optimised spacing between survey passes that adjusts the sensing beam aperture to achieve more consistent seafloor coverage. The proposed method is shown to offer significant improvements in both synthetic and real-world scenarios. Validations in challenging synthetic terrains achieves coverage ratios beyond 99%, a marked improvement when compared with traditional boustrophedon paths revealing a maximum 75% coverage. The same trend appears in realistic simulations using real bathymetric data from a coastal harbour, with coverage reaching over 92%, and significantly surpassing boustrophedon sweeps with coverage rates below 65%. Beyond improved performance, the scheme also brings a fully automated design, suitable for autonomous marine vehicles, thus offering practical utilities for real-world applications.

2605.13122 2026-05-14 cs.CV

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

Jingxuan He, Xiyu Wang, Yunke Wang, Mengyu Zheng, Chang Xu

AI总结 本文研究了基于指令的图像编辑模型在零样本参照图像分割任务中的语义定位能力。通过分析发现,这些模型在去噪过程的早期阶段已能生成具有强前景-背景可分性的内部表示,从而隐含实现了语言条件下的语义定位。基于此,作者提出了一种无需训练的框架,利用预训练图像编辑模型的中间表示,将分割任务分解为空间注意力和语义判别两个部分,实现了无需完整图像生成即可获得高精度分割掩码的方法,并在多个数据集上取得了优于现有零样本方法的性能。

详情
英文摘要

Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image editing models for RIS by exploiting their intermediate representations. Our approach decomposes localization into two complementary components: attention-based spatial priors that estimate where to focus, and feature-based semantic discrimination that determines what to segment. By leveraging feature-space separability, the framework produces accurate segmentation masks using only a single denoising step, without requiring full image synthesis. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our method achieves superior performance over existing zero-shot baselines.

2605.13119 2026-05-14 cs.RO cs.AI cs.CV

Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

Zixing Lei, Changxing Liu, Yichen Xiong, Minhao Xiong, Yuanzhuo Ding, Zhipeng Zhang, Weixin Li, Siheng Chen

AI总结 该研究旨在解决视觉-语言-动作(VLA)模型在长期任务中执行能力受限的问题,提出了一种将高层视觉语言模型与专用工具型VLA模块相结合的新策略。通过引入工具对齐的后训练方法(TAPT)和工具族接口,实现了高效的长期任务规划与执行协同,显著提升了机器人在复杂环境中的任务完成率和指令遵循精度。

详情
英文摘要

Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent invocations, we further propose Tool-Aligned Post-Training (TAPT), which constructs invocation-aligned training units for instruction following and adopts tool-family residual adapters for efficient tool specialization. Experiments show that VLAs-as-Tools improves the success rate of $π_{0.5}$ by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin, and further enhances invocation fidelity by 15.0 points as measured by Non-biased Rate. Code will be released.

2605.13117 2026-05-14 cs.RO cs.AI

SECOND-Grasp: Semantic Contact-guided Dexterous Grasping

Han Yi Shin, Heeju Ko, Jaewon Mun, Qixing Huang, Jaehyeok Lee, Sung June Kim, Honglak Lee, Sujin Jang, Sangpil Kim

AI总结 本文提出 SECOND-Grasp,一种语义引导的灵巧抓取框架,旨在将物理稳定性与语义任务理解相结合,以实现更可靠的机器人抓取。该方法通过视觉-语言推理生成粗略接触区域,并利用语义-几何一致性优化技术提升接触预测的准确性,最终通过逆运动学生成可行的抓取姿态。实验表明,该方法在已见和未见物体类别上的抓取成功率分别达到98.2%和97.7%,并在意图感知抓取任务中表现出显著提升。

详情
英文摘要

Achieving reliable robotic manipulation, such as dexterous grasping, requires a synergy between physically stable interactions and semantic task guidance, yet these objectives are often treated as separate, disjoint goals. In this paper, we investigate how to integrate dexterous grasping techniques, i.e., physically stable grasps for object lifting and language-guided grasp generation, to achieve both physical stability and semantic understanding. To this end, we propose SECOND-Grasp (SEmantic CONtact-guided Dexterous Grasping), a unified framework that enables robotic hands to dynamically adjust grasping strategies based on semantic reasoning while ensuring physical feasibility. We begin by obtaining coarse contact proposals through vision-language reasoning to infer where contacts should occur based on object properties, followed by segmentation to localize these regions across views. To further ensure consistency across multiple viewpoints, we introduce Semantic-Geometric Consistency Refinement (SGCR), which refines initial contact predictions by enforcing semantic consistency across views and removing geometrically invalid regions, yielding reliable 3D contact maps. Then, we derive a feasible hand pose for each contact map via inverse kinematics, generating a supervision signal for policy learning. Our approach, trained on DexGraspNet, consistently outperforms baselines in lifting success rate on both seen and unseen categories, achieving 98.2% and 97.7%, respectively, while also improving intent-aware grasping by 12.8% and 26.2%. We further show promising results on additional datasets and robotic hands, including Shadow Hand and Allegro Hand.

2605.13111 2026-05-14 cs.CV

Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

Jiayu Chen, Junbei Tang, Wenbiao Zhao, Maoliang Li, Jiayi Luo, Zihao Zheng, Jiawei Yang, Guojie Luo, Xiang Chen

AI总结 本文提出了一种名为Pyramid Forcing的头部感知金字塔KV缓存策略,用于提升高质量长视频生成的效果。该方法通过分析不同注意力头的历史帧关注模式,识别出三种具有不同特性的头类型,并据此设计差异化的缓存策略,从而有效缓解长期误差累积导致的退化问题。实验表明,该方法在多个指标上显著提升了长时序视频生成的质量。

详情
英文摘要

Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves long-horizon generation quality on VBench-Long, increasing the 60-second Self Forcing score from 77.87 to 81.21 while enhancing motion dynamics, visual fidelity, and semantic consistency. Project: https://if-lab-pku.github.io/Pyramid-Forcing/.

2605.13108 2026-05-14 cs.CV

Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection

Muhammad Shahid Jabbar, Muhammad Sohail Ibrahim, Taha Hasan Masood Siddique, Kejie Huang, Shujaat Khan

AI总结 本文研究了在复杂攻击方式和多变采集条件下实现轻量级人脸活体检测(FacePAD)的问题,提出了一种结合光流增强和知识蒸馏的方法。通过训练时引入光流信息增强运动表征,推理时无需计算光流,同时设计了一个双分支教师模型融合外观与运动线索,并利用知识蒸馏将运动感知知识传递给轻量的学生模型,显著提升了检测性能并降低了计算开销。实验表明,该方法在多个基准数据集上取得了优异的检测效果,并能在嵌入式设备上实现每秒52帧的实时检测。

Comments Accepted at 2026 International Conference on Automatic Face and Gesture Recognition (FG)

详情
英文摘要

Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.

2605.13105 2026-05-14 cs.RO

What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

Yuanfang Peng, Jingjing Fu, Chuheng Zhang, Li Zhao, Jiang Bian, Mingyu Liu, Ling Zhang, Jun Zhang, Rui Wang

AI总结 该研究针对视觉语言动作(VLA)模型在机器人操作任务中面临的视觉变化问题,提出了一种名为PAIR-VLA的强化学习微调框架。该方法通过在PPO优化过程中引入两个辅助目标——动作不变性目标和动作敏感性目标,引导模型在视觉变化时区分任务相关与无关的变化,从而提升模型的鲁棒性。实验表明,PAIR-VLA在多种视觉分布外变化场景下均优于标准PPO方法,显著提升了模型的泛化能力和操作成功率。

详情
英文摘要

Reinforcement learning (RL) fine-tuning has shown promise for Vision-Language-Action (VLA) models in robotic manipulation, but deployment-time visual shifts pose practical challenges. A key difficulty is that standard task rewards supervise task success, but offer limited guidance on whether a visual change is task-irrelevant or changes the behavior required for manipulation. We propose PAIR-VLA (Paired Action Invariance & Sensitivity for Visually Robust VLA), an RL fine-tuning framework to address this difficulty by adding two auxiliary objectives over paired visual variants during PPO optimization: an invariance term that reduces the discrepancy between action distributions for a task-preserving pair (e.g., different distractors), and a sensitivity objective that encourages separable action distributions for a task-altering pair (e.g., target object in a different pose). Together, these objectives turn visual variants from mere observation diversity into behavior-level guidance on policy responses during RL fine-tuning. We evaluate on ManiSkill3 across two representative VLA architectures, OpenVLA and $π_{0.5}$, under diverse out-of-distribution visual shifts including unseen distractors, texture changes, target object pose variation, viewpoint shifts, and lighting changes. Our method consistently improves over standard PPO, achieving average improvements of 16.62% on $π_{0.5}$ and 9.10% on OpenVLA. Notably, ablations further show generalization across visual shifts: invariance guidance learned from distractor and texture variants transfers to target-pose and lighting shifts, while adding sensitivity guidance on target-pose variants further improves robustness to nuisance shifts, highlighting the broader transferability of behavior-level RL guidance.

2605.13101 2026-05-14 cs.LG cs.AI

Margin-calibrated Classifier Guidance for Property-driven Synthesis Planning

Najwa Laabid, Vikas Garg

AI总结 该研究提出了一种名为Sequence Completion Ranking(SCR)的新方法,用于改进基于单步 retrosynthesis 模型的化学合成路径规划。通过引入对比论证和基于边距的损失函数,SCR 能够校准分类器,使其在解码过程中更有效地区分满足特定属性的反应路径,从而提升生成路径的质量与多样性。实验表明,该方法在 USPTO-190 数据集上显著提高了多步合成的成功率,并有效弥补了无模板与有模板方法之间的多样性差距。

详情
英文摘要

Synthesis planning seeks an efficient sequence of chemical reactions that produce a target molecule. Typically, a pretrained single-step (autoregressive) retrosynthesis model is repeatedly invoked to generate such a sequence. Classifier guidance can, in principle, help steer the output of single-step model toward reactions that satisfy specific constraints or accommodate chemist's preferences during inference without having to retrain the autoregressive generator. We expose the insufficiency of auxiliary classifiers trained with cross-entropy loss to override the unconditional token-level distributions learned from typical sparse single-disconnection reaction datasets. We overcome this issue with a novel method called Sequence Completion Ranking (SCR), which employs contrastive argumentation and a margin-based loss to calibrate the classifier so that it can meaningfully discriminate between continuations during decoding. We formally establish that margin-calibrated classifiers can expand the set of property-satisfying sequences reachable under guided beam search. Empirically, on USPTO-190, given chemist-specified guidance targets, SCR substantially improves multi-step solve rates from $16.8\%$ (unguided generator) to $78.4\%$ with reaction-type guidance and $95.3\%$ with Tanimoto guidance, unlocking valid routes for 33 targets ($17.4\%$) previously unsolvable with baselines. Our method also effectively closes the long-standing diversity gap between template-free and template-based methods.

2605.13099 2026-05-14 cs.SD

Bypassing Direct Reconstruction: Speech Detection from MEG via Large-Scale Audio Retrieval

Boda Xiao, Bo Wang, Heping Cheng

AI总结 本文研究如何从非侵入式脑信号(MEG)中检测语音内容,提出了一种无需直接重建语音信号的新方法。该方法首先利用对比学习模型从大规模音频库中检索与测试MEG信号匹配的语音片段,再通过语音检测模型生成静音与语音的二值序列。该方法在LibriBrain 2025语音检测任务中取得了优异成绩,验证了借助外部音频数据库进行语音检测的有效性。

Comments ranked first at LibriBrain Competition 2025 https://neural-processing-lab.github.io/2025-libribrain-competition/prizes/

详情
英文摘要

Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test MEG from a large-scale audio library (LibriVox). Second, a speech detection model generates the binary silence/speech sequence directly from this retrieved audio. With this approach, our team Sherlock Holmes achieved first place in the extended track (F1-score: 0.962), demonstrating that leveraging external audio databases is a highly effective strategy.

2605.13094 2026-05-14 cs.RO

Identification of Non-Transversal Bifurcations of Linkages

Andreas Mueller, P. C. López Custodio, J. S. Dai

AI总结 本文研究了机构在非横截分岔情况下的运动分支识别问题,提出了一种基于运动切锥的局部分析方法。该方法通过构造性定义的运动切锥提取必要的信息,以区分不同运动分支,弥补了传统局部分析在处理非横截分岔时的不足。文中还提出了一种计算方法,扩展了已有算法框架,为机构奇异性和运动性的研究提供了新的工具。

Comments Paper No: DETC2020-22301, V010T10A090; 8 pages

详情
Journal ref
Proceedings of the ASME 2020 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. Volume 10: 44th Mechanisms and Robotics Conference (MR). Virtual, Online
英文摘要

The local analysis is an established approach to the study of singularities and mobility of linkages. Key result of such analyses is a local picture of the finite motion through a configuration. This reveals the finite mobility at that point and the tangents to smooth motion curves. It does, however, not immediately allow to distinguish between motion branches that do not intersect transversally (which is a rather uncommon situation that has only recently been discussed in the literature). The mathematical framework for such a local analysis is the kinematic tangent cone. It is shown in this paper that the constructive definition of the kinematic tangent cone already involves all information necessary to separate different motion branches. A computational method is derived by amending the algorithmic framework reported in previous publications.

2605.13093 2026-05-14 cs.CV

RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering

Hoang Chuong Nguyen, Renjie Wu, Jose M. Alvarez, Miaomiao Liu

AI总结 RoSplat 是一种鲁棒的前馈像素级高斯点绘方法,旨在解决在输入视角变化和高分辨率渲染时出现的过亮和孔洞伪影问题。该方法通过引入像素级的 alpha 归一化策略和基于三维采样的辅助正则化器,有效提升了高斯尺度估计的准确性与渲染一致性。实验表明,RoSplat 在多个基准数据集上显著优于现有方法,尤其在输入视角变化和高分辨率场景下表现优异。

详情
英文摘要

Generalizable 3D Gaussian Splatting has recently emerged as an efficient approach for novel-view synthesis, enabling feed-forward synthesis from only a few input views. However, existing pixel-wise feed-forward methods suffer from over-bright renderings when the number of input views varies during inference, as well as insufficient supervision for accurate Gaussian scale estimation, which leads to hole artifacts, particularly in high-resolution renderings. To address these issues, we identify that the over-brightness is caused by the varying number of overlapping Gaussians and propose a simple alpha normalization strategy to maintain brightness consistency across different number of input views. In addition, we introduce an auxiliary 3D sampling-based regularizer to improve Gaussian scale estimation, thereby mitigating hole artifacts in high-resolution rendering. Experiments on benchmark datasets demonstrate that our method significantly improves baseline models under varying input-view and high-resolution rendering settings.

2605.13088 2026-05-14 cs.LG

Bayesian Nonparametric Mixed-Effect ODEs with Gaussian Processes

Julien Martinelli, Maksim Sinelnikov, Harri Lähdesmäki, Quentin Clairon, Mélanie Prague

AI总结 该论文提出了一种基于贝叶斯非参数方法的混合效应常微分方程(ODE)模型,用于处理具有个体差异的动态系统建模问题。该方法通过将每个个体的动态场分解为共享的群体成分和个体特异性偏差,并为两者赋予高斯过程先验,从而在保持不确定性量化的同时提升了模型的灵活性。研究引入了结合状态空间高斯过程轨迹先验和虚拟配点观测的训练方法,有效提高了对群体动态场和个体轨迹的预测性能。

详情
英文摘要

Dynamical modelling is central to many scientific domains, including pharmacometrics, systems biology, physiology, and epidemiology. In these settings, heterogeneity is often intrinsic: different subjects or units follow related but distinct continuous-time dynamics. Classical nonlinear mixed-effects Ordinary Differential Equation (ODE) models address this by combining population-level structure with subject-specific effects, but they rely on a parametric vector field and are therefore vulnerable to structural misspecification and unmodelled mechanisms. This motivates nonparametric approaches that can retain principled uncertainty quantification, yet existing nonparametric ODE methods typically assume a single shared dynamical system rather than an explicit mixed-effect hierarchy over subject-specific dynamics. We propose MEGPODE, a Bayesian nonparametric mixed-effect ODE model in which each subject's vector field is decomposed into a shared population component and a subject-specific deviation, both endowed with Gaussian process (GP) priors. To avoid repeated ODE solves per subject during training, we combine state-space GP trajectory priors with virtual collocation observations, yielding Kalman-smoothing trajectory updates and closed-form regressions for the vector fields. Across controlled heterogeneous ODE benchmarks spanning oscillatory, biomedical systems, MEGPODE improves population-field recovery and subject-level trajectory prediction relative to strong baselines.