arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.09431 2026-05-12 cs.CL

PumpSense: Real-Time Detection and Target Extraction of Crypto Pump-and-Dumps on Telegram

Ahmed Mahrous, Roberto Di Pietro

AI总结 PumpSense 是一项针对 Telegram 上加密货币“拉高出货”行为的实时检测与目标提取研究。该研究构建了一个包含 28 万多条消息的标注数据集,用于识别泵动公告及其目标币种和交易所,并提出了基于机器学习和大语言模型的检测与提取方法,实现了近实时的检测能力。研究还首次建立了相关任务的基准,证明了基于大模型的方法在目标提取任务中具有显著优势。

Comments Accepted to the 2026 IEEE International Conference on Blockchain and Cryptocurrency (ICBC)

详情
英文摘要

Cryptocurrency pump-and-dump schemes coordinated via Telegram threaten market integrity. However, existing research addressing this specific threat has not yet produced solutions that combine reliable results with fast response. This is in part due to the absence of publicly available, message-level labeled data, as well as design choices. In this paper, we address both issues. In particular, we introduce a corpus of over 280,000 Telegram posts from 39 pump-organizing groups, all manually reviewed to identify 2,246 pump announcements and their targeted cryptocurrency and exchange. Leveraging this dataset, we define two tasks: real-time pump-announcement detection and target cryptocurrency/exchange extraction. For detection, we compare two machine-learning models: a lightweight tree-based LightGBM classifier (F1=0.79, latency=9.4 s/sample) and a transformer-based BGE-M3 (F1=0.83, latency=50 ms/sample). With our proposed approach, we show that message analysis can achieve near-instant pump detection at the level of individual Telegram message windows. Unlike prior work that relies purely on market data and typically detects pumps tens of seconds after abnormal trading activity is observed, our method operates directly on the coordination messages themselves and can be evaluated in microseconds per window on commodity hardware. To our knowledge, we also establish the first benchmark for manipulated coin and exchange extraction. We demonstrate that traditional rule-based extraction methods, widely relied upon in prior literature, are ineffective due to ticker ambiguity. In contrast, LLMs achieve the highest accuracy with a score of 0.91.

2605.09429 2026-05-12 cs.CV cs.AI

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun

AI总结 该研究探讨了在视觉-语言模型中,低注意力视觉token是否真的冗余,并指出现有剪枝方法基于浅层注意力分数进行剪枝可能影响模型对复杂场景的推理能力,导致“视觉失语”问题。为此,作者提出了一种无需训练的剪枝框架COAST,通过对比自适应语义token剪枝,利用跨模态注意力识别关键token并平衡语义证据与空间上下文的关系。实验表明,COAST在多个基准上大幅减少了视觉token数量并提升了推理速度,同时保持了较高的模型性能,展示了其在不同模型和压缩设置下的广泛适用性。

详情
英文摘要

Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning

2605.09428 2026-05-12 cs.LG

FedCIGAR: A Personalized Reconstruction Approach for Federated Graph-level Anomaly Detection

Yunfeng Zhao, Yixin Liu, Qingfeng Chen, Shiyuan Li, Yue Tan, Shirui Pan

AI总结 本文提出了一种名为FedCIGAR的联邦图级异常检测方法,旨在解决分布式场景下隐私保护与模型泛化能力之间的矛盾。该方法通过在正常图上进行重建学习,避免使用不真实的合成异常数据,并引入客户端节点贡献门控机制与服务器端滑动窗口聚类策略,以应对数据异构性带来的挑战。实验表明,FedCIGAR在多个基准数据集上表现出优越的检测性能与鲁棒性。

Comments Accepted by IJCAI 2026

详情
英文摘要

Graph-level anomaly detection (GLAD) is crucial for ensuring the reliability of graph-driven applications by identifying abnormal graphs that deviate from the majority. Considering the privacy concerns in distributed scenarios, federated graph-level anomaly detection (FedGLAD) has emerged as a promising solution to enable collaborative detection without sharing raw data. However, existing methods suffer from poor generalization due to the reliance on unrealistic synthetic anomalies and insufficient personalization capabilities under data heterogeneity. To address these challenges, we propose a novel Federated graph-level anomaly detection approach with Cluster-adaptIve GAted Reconstruction (FedCIGAR). Specifically, we design a reconstruction-based paradigm trained on normal graphs to avoid synthetic data. Furthermore, we introduce a client-side node contribution gating mechanism and a server-side sliding window-based clustering strategy to tackle data heterogeneity. Extensive experiments demonstrate that FedCIGAR achieves superior performance and robustness in contrast to state-of-the-art methods.

2605.09425 2026-05-12 cs.CV cs.AI

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

Shogo Noguchi

AI总结 本文研究了多条件扩散模型中条件冲突对图像生成结构保真度的影响,提出了一种基于注意力机制的冲突抑制方法,有效提升了生成图像的高层结构一致性。通过结合语义分割、深度图和边缘信息作为多条件输入,模型能够在保持场景细节的同时生成高质量的图像,用于自动驾驶任务的数据增强。该工作不仅解决了多条件生成中的冲突问题,还构建了针对驾驶任务的生成框架与评估体系,为缓解高阶自动驾驶中数据稀缺问题提供了重要支持。

Comments 44 pages, 20 figures. Code and project page available at: https://github.com/ShogoNoguchi/AtteConDA

详情
英文摘要

Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.

2605.09424 2026-05-12 cs.LG

Tabular Foundation Model for Generative Modelling

Xiangjian Jiang, Mingxuan Liu, Nikola Simidjievski, Tassilo Klein, Mateja Jamnik

AI总结 本文提出了一种名为 TabFORGE 的新型表格基础生成模型,旨在解决现有表格生成模型在合成数据质量上不足的问题。该模型通过预训练的因果感知特征编码器,在统一的潜在空间中学习表格数据的隐含因果信息,并采用两阶段设计,先预训练基于分数的扩散变压器,再预训练与去噪对齐的解码器,从而有效缓解潜在表示在训练与推理间的分布偏移。实验表明,TabFORGE 能够高效生成高质量的合成表格数据,尤其在结构保真度方面表现突出。

详情
英文摘要

Generative modelling is a demanding test of foundation models, because it requires robust, holistic representation learning for a given data modality, rather than optimisation for a supervised prediction target alone. While recent work on tabular foundation models has achieved remarkable progress in predictive modelling, generative tabular foundation models remain underexplored. Existing tabular foundation generators, in particular, have not yet consistently matched strong dataset-specific generators in synthetic data quality. A key reason is their misalignment with the distinctive causal structural prior of heterogeneous tabular data. In this paper, we address this gap by introducing a novel tabular foundation model, \textbf{TabFORGE}, built on pretrained \textbf{Tab}ular \textbf{FO}undational \textbf{R}epresentations for \textbf{GE}neration. TabFORGE is designed to utilise the implicitly learned causal information underlying diverse tabular datasets in a unified latent space induced by a pretrained causality-aware feature encoder. It further decouples latent modelling from decoding through a two-stage design: we first pretrain a score-based diffusion transformer, and then pretrain a denoising-aligned decoder using the denoised latent embeddings. This design elegantly mitigates the distribution shifts in latent embeddings that typically arise between training and inference. We evaluate TabFORGE comprehensively against 22 benchmark methods on 45 real-world datasets. Our results show that TabFORGE effectively learns and leverages generalisable tabular representations, enabling efficient generation of high-quality synthetic tabular data, particularly with strong structural fidelity.

2605.09422 2026-05-12 cs.CL cs.CV

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang, Runxuan Liu, Tao Ren, Ming Liu, See-Kiong Ng, Bing Qin

AI总结 尽管大型多模态模型(LMMs)在视频理解方面表现出色,但它们在因果发现过程中容易依赖文本先验信息,这一缺陷尚未被充分理解。本文提出了一种基于扰动的评估方法ProCauEval,通过系统控制视觉和文本模态的输入,揭示模型在因果推理中的失效模式。研究发现,主流LMMs虽然能够准确感知视频内容,但在因果推理中未能充分加以利用,并且更强的后训练反而加剧了对文本先验的依赖。为此,作者提出了一种反蒸馏策略优化框架ADPO,通过强化学习推动模型更依赖视觉证据而非文本捷径,实验表明该方法有效提升了模型的视觉参与度并保持了基础理解能力。

Comments 17 pages, 5 figures

详情
英文摘要

Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.

2605.09419 2026-05-12 cs.AI

From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

Yanan Xiao, Yixiang Tang, Zechen Feng, Lu Jiang, Minghao Yin, Pengyang Wang

AI总结 本文提出了一种名为Neuro-Symbolic Experience Replay(NSER)的新框架,旨在将强化学习中的经验回放从被动记忆机制转变为具备主动推理能力的知识构建引擎。该方法通过结合大型语言模型(LLM)与符号逻辑表示,从累积的轨迹中归纳行为规则,并将其转化为可微分的逻辑表达式,从而动态调整经验回放的分布权重。NSER通过让抽象知识直接指导策略优化,在多种基准任务中实现了更高的样本效率和收敛速度。

详情
英文摘要

While experience replay is essential for data efficiency in reinforcement learning (RL), standard methods treat the replay buffer as a passive memory system, prioritizing samples based on numerical prediction errors rather than their semantic significance. This approach stands in contrast to human learning, which accelerates mastery by actively abstracting fragmented experiences into behavioral rules. To bridge this gap, we propose Neuro-Symbolic Experience Replay (NSER), a framework that transforms experience replay from a passive sample reuse mechanism into an active engine for knowledge construction. Specifically, NSER addresses the incompatibility between linguistic reasoning and numerical optimization through a novel neuro-symbolic grounding pipeline. It leverages Large Language Models (LLMs) in a zero-shot manner to induce candidate behavioral rules from accumulated trajectories, grounds these insights into differentiable first-order logic representations, and utilizes the resulting symbolic structures to dynamically reweight the replay distribution. By allowing abstract knowledge to directly shape policy optimization, NSER achieves consistent superior sample efficiency and convergence speed across reactive, rule-based, and procedural benchmarks.

2605.09418 2026-05-12 cs.CV cs.RO

MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

Zhengyi Xu, Yuhang Ming, Zhihao Zhan, Hanyu Zhu, Javier Civera, Wanzeng Kong

AI总结 跨视角场景识别在计算机视觉与机器人领域面临诸多挑战,尤其在地面观测与空中参考之间存在显著的视角、模态和结构差异。为此,本文提出MAG-VLAQ框架,通过融合预训练基础模型提取的多模态特征,在共享嵌入空间中实现地面与空中图像的对齐与融合。其核心创新在于引入ODE条件化的VLAQ机制,动态调整查询中心以适应多模态信息,从而在保持全局检索原型的同时提升场景特异性匹配能力。实验表明,该方法在KITTI360-AG数据集上显著优于现有方法,Recall@1指标达到61.1。

Comments 16 pages, 4 figures, 3 tables

详情
英文摘要

Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.

2605.09417 2026-05-12 cs.CV

SAMOFT: Robust Multi-Object Tracking via Region and Flow

Yanchao Wang, Dawei Zhang, Chengzhuan Yang, Wei Liu, Minglu Li, Hua Wang, Zhonglong Zheng, Ming-Hsuan Yang

AI总结 本文提出了一种名为SAMOFT的鲁棒多目标跟踪方法,旨在解决复杂运动场景下目标形变、非线性运动和遮挡带来的跟踪难题。该方法引入像素级运动匹配模块(PMM),结合Segment Anything Model(SAM)和密集光流,提升基于卡尔曼滤波的运动预测精度;同时设计了中心距匹配(CDM)模块和分布校正(DBC)模块,分别增强对低置信度检测的鲁棒性以及在线轨迹状态的动态修正能力。实验表明,SAMOFT在多个基准数据集上显著优于现有方法,验证了其有效性。

详情
英文摘要

Multi-object tracking (MOT) is a fundamental task in computer vision that requires continuously tracking multiple targets while maintaining consistent identities across frames. However, most existing approaches primarily rely on instance-level object features for trajectory association, which often leads to degraded performance under challenging conditions such as object deformation, nonlinear motion, and occlusion. In this work, we propose SAMOFT, a robust tracker that leverages pixel-level cues to improve robustness under complex motion scenarios. Specifically, we introduce a Pixel Motion Matching (PMM) module that integrates the Segment Anything Model (SAM) with dense optical flow to refine Kalman filter-based motion prediction using instantaneous foreground pixel motion. To further enhance robustness under unreliable detections, we design a Centroid Distance Matching (CDM) module that performs flexible mask-based centroid matching for low-confidence or partially occluded observations. Moreover, a Distribution-Based Correction (DBC) module models long-tailed motion patterns in a training-free manner using historical optical flow statistics and dynamically corrects trajectory states online. We also incorporate a Cluster-Aware ReID (CA-ReID) strategy to improve the stability and discriminative power of trajectory appearance features. Extensive experiments on the DanceTrack and MOTChallenge benchmarks demonstrate that SAMOFT consistently improves baseline trackers and achieves competitive performance compared with recent state-of-the-art methods, validating the effectiveness of leveraging pixel-level cues for robust multi-object tracking.

2605.09416 2026-05-12 cs.LG

A Controlled Diagnostic Study of Hardware-Induced Distortions in Hardware-Aware Training

Yunxuan Fang, Xinhe Wang

AI总结 本文研究了硬件非理想特性对神经网络训练的影响,提出了一种诊断框架,将硬件引起的失真建模为前向操作的结构化扰动,并评估其与梯度优化的兼容性。通过分析六类典型扰动,发现了三个关键诊断指标,揭示了哪些硬件失真可以通过训练补偿,哪些会破坏优化过程,为软硬件协同设计提供了重要指导。

详情
英文摘要

Hardware-aware training (HAT) is widely used to improve the robustness of neural networks on non-ideal AI accelerators, such as analog in-memory computing (IMC) systems. However, not all hardware-induced distortions are equally compensable by training. This paper presents a diagnostic framework that models hardware non-idealities as structured perturbations of the forward operator and evaluates their compatibility with gradient-based optimization. We analyze six representative perturbation classes--read noise, variability, drift, stuck-at faults, IR-drop, and ADC discretization--and identify three key diagnostics: gradient expectation consistency, bounded gradient variance, and non-degenerate sensitivity. Our results show a clear separation between perturbations that can be compensated by HAT and those that consistently break optimization. This provides practical guidance for hardware-software co-design, clarifying which non-idealities can be addressed at the training level and which require circuit-, architecture-, or calibration-level mitigation. This study should be interpreted as a controlled empirical analysis under vanilla forward-perturbation HAT, rather than as a universal theory of hardware-aware training.

2605.09414 2026-05-12 cs.CL

Cross-Cultural Transfer of Emoji Semantics and Sentiment in Financial Social Media

Ahmed Mahrous, Roberto Di Pietro

AI总结 该研究探讨了在金融社交媒体中表情符号的语义和情感在跨语言、跨平台及跨资产社区中的可迁移性。通过分析多语言的Twitter和StockTwits数据,研究发现尽管表情符号的使用频率在不同社区中存在差异,但其语义和情感极性具有较高的稳定性。研究还表明,结合表情符号的信息有助于提升情感迁移模型的性能,尤其在跨语言迁移中效果显著,揭示了金融交流中存在部分共享的“表情符号代码”。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026

详情
英文摘要

Emojis are widely used in online financial communication, but it is unclear whether they provide transferable sentiment signals across languages, platforms, and asset communities. This study examines the extent to which emoji usage, semantics, and sentiment polarity remain stable across financial communities, and how these layers influence zero-shot sentiment transfer. Using large corpora of Twitter and StockTwits posts in four languages, we measure cross-community divergence and evaluate sentiment models trained under emoji-only, text-only, and text+emoji inputs. We find that emoji frequencies differ across communities, especially across languages, but their semantics and sentiment polarity are largely stable. Cross-asset transferability shows minimal degradation, while cross-language transfer remains the most challenging. Including emojis consistently reduces transfer gaps relative to text-only models. These results indicate that financial communication exhibits a partially shared ``emoji code,'' and that emojis provide compact, language-independent sentiment cues that improve model generalization across markets and platforms.

2605.09410 2026-05-12 cs.RO cs.AI

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

Weijia Liufu, Xiaoyu Guo, Ruiyi Chen, Jingzhi Liu, Kaidong Zhang, Xiwen Liang, Jianqi Lin, Dawei Sun, Yuze Wang, Rongtao Xu, Bingqian Lin, Bowen Yang, Tongtong Cao, Bowen Peng, Dongyu Zhang, Guangrun Wang, Min Wang, Liang Lin, Xiaodan Liang

AI总结 本文提出RePO-VLA,一种面向视觉-语言-动作(VLA)模型的恢复驱动策略优化框架,旨在提升其在复杂操作任务中的鲁棒性。该方法通过区分成功、恢复和失败轨迹的角色,结合恢复感知初始化和进展感知语义价值函数,有效利用失败数据中的有用信息进行策略优化。实验表明,RePO-VLA在模拟和现实双臂任务中显著提升了对抗性场景下的成功率,平均从20%提升至75%,在实际测试中甚至达到80%。

详情
英文摘要

Vision-Language-Action (VLA) models remain brittle in long-horizon, contact-rich manipulation because success-only imitation provides little supervision for execution drift, while failed rollouts are often discarded. We introduce RePO-VLA, a recovery-driven policy optimization framework that assigns distinct roles to success, recovery, and failure trajectories. RePO-VLA first applies Recovery-Aware Initialization (RAI), slicing recovery segments and resetting history so corrective actions depend on the current adverse state rather than the preceding failure. It then learns a Progress-Aware Semantic Value Function (PAS-VF), aligning spatiotemporal trajectory features with instructions and successful references. The resulting labels salvage useful failure prefixes via reliability decay, while low-value labels mark drift and terminal breakdowns, teaching differences among nominal, failed, and corrective actions. The data engine turns adverse states into planner-generated or human-collected corrective rollouts, teaching recovery to the success manifold. Value-Conditioned Refinement (VCR) trains the policy to prefer high-progress actions. At deployment, a fixed high value ($v=1.0$) biases actions toward the learned success manifold without online failure detectors or heuristic retries. We introduce FRBench, with standardized error injection and recovery-focused evaluation. Across simulated and real-world bimanual tasks, RePO-VLA improves robustness, raising adversarial success from 20% to 75% on average and up to 80% in scaled real-world trials.

2605.09408 2026-05-12 cs.LG cs.SI stat.ML

GravityGraphSAGE: Link Prediction in Directed Attributed Graphs

Riccardo Porcedda, Francesca Chiaromonte, Fabrizio Lillo, Andrea Vandin

AI总结 本文研究了有向属性图中的链接预测问题,即预测图中节点之间缺失或未来的连接关系。为了解决现有方法在处理有向图和节点属性时的不足,作者提出了基于引力机制的改进版GraphSAGE模型——GravityGraphSAGE(GG-SAGE),首次将GraphSAGE应用于有向链接预测任务。实验表明,该模型在多个基准数据集和真实网络数据上优于现有最先进的图深度学习链接预测方法,展示了其在复杂图结构中的有效性与扩展性。

详情
英文摘要

Link prediction (inferring missing or future connections between nodes in a graph) is a fundamental problem in network science with widespread applications in, e.g., biological systems, recommender systems, finance and cybersecurity. The ability to accurately predict links has significant real-world applications, such as detecting fraudulent financial transactions or identifying drug-target interactions in biomedicine. Despite a rich literature, link prediction is still challenging, especially for graphs enriched with information on edges (direction) and nodes (attributes). In fact, research on link prediction, especially the one based on Graph Deep Learning (GDL), has mostly focused on undirected graphs, without fully leveraging node attributes. Here, we fill this gap by proposing Gravity-GraphSAGE (GG-SAGE), a modified version of GraphSAGE, a GDL model for node embeddings, composed of a gravity-inspired decoder. This implementation is the first example in the literature of a GraphSAGE backbone adopted for directed link prediction. Using the benchmark datasets Cora, Citeseer, PubMed and 16 real-world graphs from the online Netzschleuder repository, we show that our proposed model outperforms state-of-the-art GDL link prediction techniques. Using further experimental evidence, we relate the quality of the output of our model with various characteristics of the graph, suggesting that our framework scales well when applied to data of increasing complexity.

2605.09407 2026-05-12 cs.CV

AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

Woochul Kang, Hyungseop Lee, Jiho Lee

AI总结 本文提出了一种名为AnyDepth-DETR/-YOLO的任意深度目标检测框架,使单个网络能够在推理时通过控制深度实现精度与效率的连续权衡,无需重新训练。该方法通过将网络的主干和颈部模块分解为必须执行的主路径和可跳过的细化路径,保持了不同深度配置下的多尺度特征层次。通过在最深和最浅网络之间进行自蒸馏,并结合预测层和特征层对齐损失,确保各阶段输出的兼容性。实验表明,该方法在RT-DETR和YOLOv12上实现了与现有最佳模型相当或更优的性能,且在高效配置下可提升1.82倍速度,仅损失2.0 AP。

Comments 16 pages, 5 figures, 9 tables

详情
英文摘要

Modern object detectors are static, fixed-depth networks optimized for a single operating point, requiring separate models for different deployment scenarios. We present an any-depth detection framework that enables a single network to span a continuous range of accuracy--efficiency trade-offs by controlling depth at inference time without retraining. Each backbone and neck stage is divided into an essential path, which always executes, and a skippable refinement path; this decomposition preserves the full multi-scale feature hierarchy at every depth configuration, unlike conventional early exiting that discards entire stages. To train such a network, jointly optimizing many sub-networks of varying depth introduces conflicting gradient signals. We address this via self-distillation between only the two extremes, with prediction-level and feature-level alignment losses that enforce stage-wise modularity, ensuring the outputs of each stage remain compatible regardless of the paths taken. Instantiated on RT-DETR and YOLOv12, our full-depth configurations match or surpass their respective SOTA baselines with negligible parameter overhead, while the most efficient configurations achieve up to $1.82\times$ speedup at a cost of only 2.0 AP, all from a single set of weights.

2605.09404 2026-05-12 cs.LG cs.CL cs.CV

Let the Target Select for Itself: Data Selection via Target-Aligned Paths

Huitao Yang, Hengzhi He, Guang Cheng

AI总结 该研究针对目标导向的数据选择问题,提出了一种新的参考路径方法,以减少传统方法在异构数据池中可能产生的偏差。通过在目标验证集上进行短期预热,生成一个验证诱导的参考路径,并利用该路径上的终点损失下降作为候选样本的评分依据,从而实现无需梯度或海森矩阵近似的选择策略。该方法在多个实验中表现出与动态归因方法相当的性能,同时显著降低了预热和存储成本,并可复用到不同的数据池中。

详情
英文摘要

Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.

2605.09400 2026-05-12 cs.LG

D2ACE: Multi-Label Batch Selection Guided by Dual Dynamics and Adaptive Correlation Enhancement

Bin Liu, Haoyu Peng, Zhijia Wei, Jiajing Zhang, Grigorios Tsoumakas

AI总结 在深度多标签分类中,批样本选择对提升训练效率和预测性能至关重要。现有方法通常依赖单一指标评估样本重要性,并使用静态标签权重,忽视了训练过程中指标效用和标签重要性的动态变化。为解决这些问题,本文提出D2ACE方法,结合双动态机制和自适应相关性增强,通过阶段化伯努利混合采样和动态标签加权,动态调整标签优先级,并引入局部上下文感知的相关性增强以聚焦相关标签,实验表明该方法在多种模型和数据集上均表现出更优的预测性能和更高效的标签关联建模。

Comments 18 pages

详情
英文摘要

Batch selection is crucial for improving both training efficiency and predictive performance in deep multi-label classification (MLC). Existing batch selection methods typically rely on a single metric to assess instance importance and use static label weights to distinguish label significance, neglecting the dynamic evolution of metric utility and label significance during training. In addition, the method that explicitly exploits label correlations is largely affected by abundant irrelevant labels and insensitive to local label distributions. To address these issues, we propose D2ACE, a novel multi-label batch selection method guided by Dual Dynamics and Adaptive Correlation Enhancement. D2ACE explicitly captures metric and label-level training dynamics by combining stage-wise Bernoulli mixture sampling, which balances uncertainty and noise-resistant hardness, with dynamic label weighting to recalibrate label priorities at each epoch based on current metric statistics. Furthermore, D2ACE introduces a local context-aware correlation enhancement to focus on relevant labels with instance-adaptive dependencies. Extensive experiments on tabular and image benchmarks demonstrate that D2ACE outperforms existing batch selection approaches across various deep MLC models, achieving stronger predictive performance and more efficient correlation modeling.

2605.09392 2026-05-12 cs.CV

HyNeuralMap: Hyperbolic Mapping of Visual Semantics to Neural Hierarchies

Zihan Ma, Tian Xia, Kexin Wang, Xiao Li, Xiaowei He, Yudan Ren

AI总结 本文提出了一种名为HyNeuralMap的框架,用于将视觉语义映射到跨被试的神经层次结构中,以解决视觉刺激与神经响应之间复杂映射关系的理解问题。该方法利用双曲洛伦兹模型,通过双曲空间的负曲率作为归纳偏置,更有效地捕捉视觉语义的层次结构和跨被试神经相似性。实验表明,HyNeuralMap在多标签语义预测和跨模态检索任务中优于现有的欧氏空间方法,验证了双曲几何在跨模态语义对齐和层次建模中的优势。

Comments 14 pages, 4 figures

详情
英文摘要

Understanding the intricate mappings between visual stimuli and neural responses is a fundamental challenge in cognitive neuroscience. While current approaches predominantly align images and functional magnetic resonance imaging (fMRI) responses in Euclidean space, this geometry often struggles to preserve fine-grained semantic relationships and latent hierarchical structures across visual and neural modalities. To overcome this, we propose HyNeuralMap, a framework that employ hyperbolic Lorentz model to map visual semantics into a shared, cross-subject neural hierarchy. By leveraging the negative curvature of hyperbolic space as an inductive bias, the proposed framework better captures hierarchical semantic organization and cross-subject neural similarities. Specifically, visual and neural embeddings are jointly optimized through hyperbolic geometric alignment, where geodesic distances preserve semantic proximity and hierarchical relationships more effectively than Euclidean embeddings. Experiments demonstrate that HyNeuralMap consistently outperforms state-of-the-art Euclidean baselines in both multi-label semantic prediction and cross-modal retrieval tasks. This confirms hyperbolic geometry's superiority for cross-modal semantic alignment and hierarchical modeling, providing a new avenue for vision-neural representation learning.

2605.09387 2026-05-12 cs.AI cs.RO

NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning

Tiehan Cui, Peipei Liu, Yanxu Mao, Congying Liu, Mingzhe Xing, Datao You

AI总结 本文提出了一种名为NEXUS的模块化框架,旨在解决具身智能体在持续学习过程中面临的符号约束学习问题。该框架通过将物理可行性与安全规范解耦,结合闭环执行反馈与概率风险评估,实现了对安全指令的严格验证与风险规避。实验表明,NEXUS在任务成功率、安全指令拒绝能力及对抗攻击防御方面表现优异,并能通过知识积累逐步提升规划效率。

详情
英文摘要

While Large Language Models (LLMs) have catalyzed progress in embodied intelligence, a fundamental gap between their inherent probabilistic uncertainty and the strict determinism and verifiable safety required in the physical world. To mitigate this gap, this paper introduces NEXUS, a modular framework designed for continual learning in embodied agents. Different from prior works that treat symbolic artifacts merely as static interfaces, NEXUS leverages them for symbolic grounding and knowledge evolution. The framework explicitly decouples physical feasibility from safety specifications: capability of agents is improved through closed-loop execution feedback, while probabilistic risk assessments are grounded into deterministic hard constraints to establish a rigorous pre-action defense. Experiments on SafeAgentBench demonstrate that NEXUS achieves superior task success rates while effectively refusing unsafe instructions, exhibiting robust defense against adversarial attacks, and progressively improving planning efficiency through knowledge accumulation.

2605.09384 2026-05-12 cs.CV cs.AI q-bio.QM

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao

AI总结 本文提出了一种名为LiteMedCoT-VL的参数高效的适配方法,旨在提升医疗视觉问答(VQA)模型在资源受限设备上的推理能力。该方法通过基于LoRA的微调,将大型教师模型的链式推理能力迁移至小型学生模型,且无需依赖图像字幕,更贴近实际临床场景。实验表明,LiteMedCoT-VL在PMC-VQA基准测试中取得了64.9%的准确率,显著优于现有基线模型,验证了小参数模型通过推理蒸馏可达到甚至超越更大模型的效果。

Comments 17 pages, 5 figures

详情
英文摘要

The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.

2605.09383 2026-05-12 cs.RO

Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level

Yueqi Zhu, Yan Pan, Chufan Rui, Jiasheng Luo, Shihua Li, Bo Zhou

AI总结 在安全关键场景中,自主导航系统的保护等级对移动机器人执行安全任务至关重要。本文提出了一种基于流形上的确定性状态估计的安全关键激光雷达-惯性里程计(LIO)方法,通过迭代最近点算法推导出点云噪声与估计不确定性之间的明确关系,并设计了一种流形上的椭球集成员滤波器,从而提供确定性的保护等级作为机器人下游自主操作的安全参考。实验表明,该方法能够在不同环境中为多种机器人提供有效的在线安全保障。

详情
英文摘要

In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.

2605.09378 2026-05-12 cs.CV cs.AI cs.CL

EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation

Xinyi Wu, Jayant Teotia, Shuai Zhao, Erik Cambria

AI总结 EduStory 是一个统一的框架,旨在生成符合教学逻辑的多镜头STEM教学视频。该方法通过整合教学状态建模、脚本引导的结构化控制以及面向学习的评估指标,有效提升了视频在知识一致性和教学叙事连贯性方面的表现。研究还引入了 EduVideoBench 评估基准,支持对生成视频的多粒度分析与评估,实验表明该框架在保持教学意图和知识准确性方面具有显著优势。

详情
英文摘要

Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.

2605.09376 2026-05-12 cs.RO

Mismatch-Aware Adaptive Constraint Tightening for Bicycle-Model Trajectory Optimization

Lingxue Lyu, Zihui Liu

AI总结 本文针对自动驾驶车辆轨迹优化中因模型与实际动力学不匹配导致的安全约束违反问题,提出了一种基于模型失配特性的自适应约束收紧方法。研究通过理论分析得出了特征速度、偏差与时间平方成正比的规律,并推导出仅依赖车辆参数和规划时域的解析系数,从而构建了状态相关的约束收紧公式。实验表明,该方法在保证安全性的前提下显著减少了冗余安全余量,适用于多种车辆模型并在闭环MPC中表现出优越性能。

详情
英文摘要

Trajectory optimization for autonomous vehicles usually relies on the kinematic bicycle model because of its computational simplicity. However, when the planned trajectory is executed under the true vehicle dynamics, which include lateral slip, tire stiffness and yaw-lateral coupling, safety constraints can be violated owing to the model mismatch. In this paper, we make three theoretical contributions. First, we derive a characteristic speed $v_c=\sqrt{C_αL/M}$ which separates two different mismatch regimes: below $v_c$ the dynamic bicycle initially oversteers inward (safe); above $v_c$ it understeers outward (safety-critical). Second, we prove that the peak outward deviation $\varepsilon^*$ follows a $T^2$ horizon scaling whose coefficient transitions between a transient bound $\frac{1}{2}(v^2-v_c^2)κ$ and a steady-state bound. Third, we obtain a simulation-free analytical coefficient $a_2^{\mathrm{anal}}=\frac{1}{2}(1-v_c^2/v_{\max}^2)T^2$ that is computable from vehicle parameters and the planning horizon alone. Putting these together, we propose Mismatch-Aware Adaptive Constraint Tightening (MACT), $ε(v,κ)=a_2 v^2|κ|$, which replaces a fixed worst-case margin by a state-dependent one that is large at high speed/curvature but nearly zero on gentle paths. Eight numerical experiments confirm the scaling laws. MACT reaches 100% safety with 84% less wasted margin than a fixed-margin baseline on the 2-DOF vehicle, extends to a nonlinear leaning bicycle, and in a closed-loop direct-shooting MPC comparison it cuts the applied margin by 34% compared with tube MPC while keeping the same safety.

2605.09369 2026-05-12 cs.AI

Explainable Knowledge Tracing via Probabilistic Embeddings and Pattern-based Reasoning

Siyu Wu, Cong Xu, Wei Zhang

AI总结 该论文提出了一种可解释的知识追踪模型PLKT,旨在解决传统深度学习模型在预测学生知识状态时缺乏可解释性的问题。PLKT采用概率嵌入和基于模式的推理方法,将知识状态表示为贝塔分布的随机变量,并通过显式的逻辑运算构建透明的推理路径,从而揭示历史学习行为如何影响预测结果。实验表明,PLKT在保持高预测性能的同时,显著提升了模型的可解释性。

详情
英文摘要

Knowledge Tracing (KT) models students' knowledge states based on learning interactions to predict performance. While deep learning-based KT models have boosted predictive accuracy, most models rely on deterministic vector embeddings and opaque latent state transitions, limiting interpretability regarding how specific past behaviors influence predictions. To address this limitation, we propose Probabilistic Logical Knowledge Tracing (PLKT), an interpretable KT framework that formulates prediction as a goal-conditioned evidence reasoning process over historical learning behaviors. Instead of representing knowledge states as deterministic vector embeddings, PLKT employs robust Beta-distributed probabilistic embeddings to represent student knowledge states. This probabilistic foundation allows us to model the uncertainty of historical behaviors and perform explicit logical operations (e.g., conjunction), constructing transparent reasoning paths that reveal how specific past interactions contribute to the prediction. Extensive experiments show that PLKT outperforms state-of-the-art KT methods while achieving superior interpretability. Our code is available at https://anonymous.4open.science/r/PLKT-D3CE/.

2605.09365 2026-05-12 cs.AI cs.CL

Position: Avoid Overstretching LLMs for every Enterprise Task

Kuldeep Singh, Anson Bastos, Isaiah Onando Mulang'

AI总结 本文探讨了在企业任务中过度依赖大语言模型(LLM)可能带来的效率低下和可靠性问题,指出企业任务通常具有确定性、结构化和知识依赖性,且对成本、延迟和可靠性有严格要求。作者主张应将语言模型作为接口而非单一引擎,将知识和计算分离到专用组件中,以提高系统的可靠性、可扩展性和透明度。研究理论证明了有限容量的模型难以全面覆盖企业任务所需的知识范围,并提出应将语言模型主要用于结构化信息提取,而将计算和存储任务委托给知识库和符号处理流程,从而构建更可靠和可持续的企业级AI架构。

详情
英文摘要

Enterprise workloads are dominated by deterministic, structured, and knowledge-dependent tasks operating under strict cost, latency, and reliability constraints. While these are often addressed through large language model (LLM) deployment or distillation into smaller models, we argue this is inefficient, unreliable, and misaligned with enterprise task structures. Instead, AI systems should treat language models as interfaces rather than monolithic engines, externalizing knowledge and computation into dedicated components for greater reliability, scalability, and transparency. Our theoretical evidences show that finite-capacity models cannot fully capture the breadth of knowledge required for enterprise tasks, creating inherent limits to efficiency and interpretability. Building on this, we take the position that language models should primarily be used for structured extraction in deterministic enterprise workflows, while computation and storage are delegated to knowledge bases and symbolic procedures. We formally demonstrate that such modular architectures are more reliable and maintainable than monolithic frameworks, offering a sustainable foundation for enterprise tasks.

2605.09364 2026-05-12 cs.LG

Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

Valliappan Chidambaram Adaikkappan, David Meger, Sai Rajeswar, Pietro Mazzaglia

AI总结 本文研究了在离线目标条件强化学习(GCRL)中鲁棒表征学习的问题,特别是在稀疏奖励环境下,如何学习对齐状态和目标潜在表示的挑战。为了解决表征漂移问题,作者提出了一种基于多尺度预测监督的框架Ms.PR,使智能体能够从局部物理动态到长期目标结构等多个尺度理解环境,从而在潜在空间中实现目标导向的对齐。实验表明,Ms.PR在视觉和状态任务中均表现出优异的表征质量和性能,并在多种复杂数据条件下展现出强大的鲁棒性。

详情
英文摘要

This paper investigates robust representation learning in offline goal-conditioned reinforcement learning (GCRL). Particularly in sparse reward scenarios, learning representations that align state and goal latents is a challenge that frequently culminates in representation divergence where the encoder drifts toward a low-dimensional, goal-agnostic subspace that destabilizes policy learning. We address this issue by showing that an agent must acquire a fundamental understanding of its environment across multiple scales, from local physical dynamics to long-horizon goal-directed structure. Building on this insight, we propose Ms.PR, a framework that leverages multi-scale predictive supervision to enforce goal-directed alignment within the latent space. We demonstrate that Ms.PR leads to improved representation quality and strong performance on both vision and state-based tasks. Furthermore, we show that our approach is exceptionally resilient under realistic, challenging data regimes, maintaining state-of-the-art performance across a wide variety of tasks, trajectory stitching scenarios, and extreme noise conditions.

2605.09363 2026-05-12 cs.LG

Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

Soumita Hait, Ping Li, Haipeng Luo, Mengxiao Zhang

AI总结 本文研究了在零和博弈中,当玩家仅能观测到自身损失以及对手动作时,学习动态的最后迭代收敛问题。作者提出了一种高效的算法,通过稀疏更新策略并求解估计的对数障碍正则化博弈,实现了以高概率达到 $t^{-1/2}$ 的最后迭代收敛率。该工作克服了传统多臂老虎机分析在博弈场景中的局限性,实验表明该算法相比现有方法收敛更快,同时其结果也改进了对战老虎机这一特例的已有成果。

详情
英文摘要

Last-iterate convergence of learning dynamics in games has attracted significant recent attention. In two-player zero-sum games with bandit feedback, where only the loss of the selected action pair is observed, Fiegel et al. (2025) show a separation between average-iterate and last-iterate convergence in duality gap: while the optimal t^(-1/2) rate after t rounds is achievable for the former via standard no-regret algorithms, the latter cannot converge faster than t^(-1/3) in expectation or t^(-1/4) with high probability. However, in many practical settings, such as preference learning, the players observe not only their loss but also the opponent's action. This raises a natural question: can such additional information enable faster last-iterate convergence? We answer this question affirmatively, showing that t^(-1/2) last-iterate convergence is achievable with high probability in this setting, via an efficient algorithm that updates its strategy infrequently by solving an estimated log-barrier-regularized game. We identify fundamental obstacles preventing standard analysis for multi-armed bandits, the single-player case, from generalizing to games, and develop a novel analysis to overcome them. Experiments confirm that our algorithm indeed converges faster than naive baselines and prior methods that do not exploit opponent-action feedback. Finally, we note that our results also improve those for dueling bandits, a special case with skew-symmetric game matrices.

2605.09360 2026-05-12 cs.LG cs.AI cs.CL cs.SE

Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code

Zhenghan Song, Yulong Liu, Cheng Wan, Chenjun Li, Lingfu Liu, Yunyi Li, Congcong Yuan

AI总结 该论文研究了大语言模型生成的多物理场仿真代码与用户意图之间的不匹配问题,提出了基于偏微分方程(PDE)的意图验证方法。通过构建意图保真度分数(IFS)并设计基于PDE的修正循环,该方法能够检测并修正生成代码中与用户意图不符的物理方程、边界条件等关键部分。实验表明,该方法在多个基准测试中显著提升了代码的意图一致性,揭示了可执行性与物理正确性应作为两个独立的验证维度。

Comments Preprint

详情
英文摘要

Execution-based evaluation of LLM-generated code implicitly treats successful execution as a proxy for correctness. In scientific simulation, this proxy is insufficient: a generated input file can run, mesh, and converge while encoding governing equations that differ from the user's intent. We call this mismatch between intended physics and generated code the comprehension-generation gap. We instantiate this in MOOSE, where Kernel and BC objects map compositionally to weak-form residual terms, enabling deterministic reconstruction of the encoded PDE and comparison against an intended contract. We formalize this comparison as the Intent Fidelity Score (IFS), a structural metric covering governing terms, BCs, ICs, coefficients, and time scheme. Building on IFS, we develop a PDE-grounded refinement loop that uses deterministic violation reports to correct generated code iteratively. We evaluate on MooseBench, a 220-case multiphysics benchmark with PDE-level ground truth released with this work. On this benchmark, our method consistently improves mean IFS over direct generation, with gains concentrated on hard cases. On the subset where direct generation falls below IFS 0.7, refinement adds +0.22 to +0.41 absolute IFS. In the deployment audit, execution-only repair improves execution success while leaving 39-40% of all 220 cases runnable but still solving the wrong physics across the three main deployment-audit models, exposing executability and intent fidelity as separable failure modes. Static proof-of-concept experiments on four PDE-oriented DSLs (UFL/FEniCS, FreeFEM, FiPy, and Devito) suggest that the reconstruction-and-comparison pattern extends beyond MOOSE. These findings reinforce that executable simulation code should be verified against the mathematical structure it is intended to encode, not accepted on execution alone.

2605.09359 2026-05-12 cs.LG cs.AI

Skill-R1: Agent Skill Evolution via Reinforcement Learning

Yash Vishe, Rohan Surana, Xunyi Jiang, Zihan Huang, Xintong Li, Nikki Lijing Kuang, Tong Yu, Ryan A. Rossi, Jingbo Shang, Julian McAuley, Junda Wu

AI总结 该研究提出了一种名为Skill-R1的强化学习框架,用于通过可验证奖励进行实例级别的技能递归优化。与传统依赖提示工程或对任务模型本身进行对齐的方法不同,Skill-R1训练一个轻量级的技能生成器,根据任务上下文、历史执行结果及其验证反馈生成指导冻结任务模型的技能,从而实现低成本且兼容开源与闭源模型的适应。通过引入双层组相对策略优化目标,Skill-R1有效地实现了技能的定向进化,实验表明其在多个基准任务上优于无技能基线和标准GRPO方法,尤其在复杂多步骤任务中表现突出。

详情
英文摘要

Agentic large language models often rely on skills, reusable natural language procedures that guide planning, action, and tool use. In practice, skills are typically improved through prompt engineering or by aligning the task LLM itself, which is costly, model-specific, and often infeasible for closed-source models. Skill optimization is not a one-step problem but a recurrent process with two coupled levels of credit assignment: a useful skill must improve rollout quality under current conditioning, while a useful revision must turn observed outcomes into a better skill for the next round. We propose Skill-R1, a reinforcement learning framework for instance-level recurrent skill optimization from verifiable rewards. Rather than updating the task LLM, Skill-R1 trains a lightweight skill generator that conditions on the task context, prior rollouts, and their verified outcomes to produce skills that steer a frozen task LLM. This preserves black-box compatibility with both open- and closed-source models while making adaptation substantially cheaper than model-level updates. Skill-R1 proceeds over multiple generations: at each step, the current skill induces rollouts whose verified outcomes are fed back to produce the next revision. To optimize this recurrent process, we introduce a bi-level group-relative policy optimization objective combining intra-generation and inter-generation advantages. The intra-generation term compares rollouts under shared skill conditioning, while the inter-generation term rewards revisions that improve behavior across successive generations. Together, these provide a principled objective for directional skill evolution rather than one-shot self-refinement. Empirically, Skill-R1 achieves consistent gains over no-skill baselines and standard GRPO across benchmarks with verifiable rewards, with particularly strong improvements on complex, multi-step tasks.

2605.09356 2026-05-12 cs.LG cs.NI

Function-Space ADMM for Decentralized Federated Learning: A Control Theoretic Perspective

Akihito Taya, Yuuki Nishiyama, Kaoru Sezaki

AI总结 本文从控制理论的角度出发,提出了一种基于函数空间的分布式联邦学习算法FedF-ADMM,用于解决在无中心服务器的边缘设备网络中训练机器学习模型时面临的数据非独立同分布问题。该方法通过在函数空间中利用损失泛函的凸性,推导出基于ADMM的更新方向,并通过知识蒸馏将其投影到参数空间,从而提升模型训练的收敛性能和鲁棒性。实验表明,FedF-ADMM在严重非独立同分布场景下具有更快的收敛速度、更高的准确率和更好的设备间一致性。

Comments (c) 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情
Journal ref
IEEE Internet of Things Journal, 2026
英文摘要

Decentralized federated learning (FL) is a promising approach for training machine learning models on sensor networks, Internet of Things (IoT) devices, and other edge systems where no central server exists. While federated learning offers advantages such as preserving data privacy, it often suffers from non-independent and identically distributed (IID) data distributions across devices, which cause significant performance degradation. This issue is particularly severe when directly optimizing model parameters, because neural network training is inherently non-convex and standard convergence guarantees for convex optimization do not apply. Unlike existing decentralized FL methods that primarily operate in parameter space, we propose federated function-space alternating direction method of multipliers (FedF-ADMM). FedF-ADMM exploits the convexity of loss functionals within function space to derive alternating direction method of multipliers (ADMM)-based update directions, which are subsequently projected onto the parameter space via knowledge distillation. We further introduce a stabilization coefficient to enhance robustness under severe non-IID settings and analyze its behavior from a control-theoretic perspective by interpreting it as a proportional-integral (PI) term. Experiments under challenging non-IID scenarios, including settings where each device has data from only a single label, demonstrate that FedF-ADMM achieves faster and more stable convergence than existing decentralized FL methods, while attaining higher accuracy and better consensus among devices.

2605.09355 2026-05-12 cs.LG

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

Xing Han, Shravan Chaudhari, Tanvi Ranade, Rama Chellappa, Suchi Saria

AI总结 本文提出了一种名为FLAME的自适应专家混合模型框架,用于支持多模态多任务的持续学习。该方法结合了多任务预训练与持续适应两种场景,通过模态特定的路由机制实现灵活的模态组合学习,并利用低秩记忆子空间压缩专家知识以提升参数效率并缓解灾难性遗忘。实验表明,该方法在多个医疗多模态基准上表现出优越的性能。

Comments 37 pages, 25 figures, 6 tables

详情
英文摘要

Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.