arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10335 2026-05-12 cs.LG cs.AI cs.CL cs.NA math.NA math.OC

PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent

Yao Lu, Dengdong Fan, Shixun Zhang, Yonghong Tian

AI总结 本文提出了一种名为 PowerStep 的内存高效的自适应优化算法,旨在解决大规模神经网络训练中传统自适应优化器(如 Adam)所面临的内存开销过大的问题。该方法通过在动量缓冲区上直接应用非线性变换,实现了坐标自适应性,而无需存储二阶矩统计量。实验表明,PowerStep 在保持与 Adam 相当收敛速度的同时,显著降低了优化器的内存占用,并在结合量化技术后进一步提升了内存效率。

详情
英文摘要

Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an $\ell_p$-norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal $O(1/\sqrt{T})$ rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam's convergence speed while halving optimizer memory. Furthermore, when combined with aggressive \texttt{int8} quantization, PowerStep remains numerically stable and reduces optimizer memory by $\sim\!8\times$ compared to full-precision Adam. PowerStep thus provides a principled, scalable and resource-efficient alternative for large-scale training. Code is available at https://github.com/yaolubrain/PowerStep.

2605.10334 2026-05-12 cs.CV

The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

Andrii Yermakov, Jan Cech, Mario Fritz, Jiri Matas

AI总结 近年来,深度伪造检测方法在跨数据集泛化能力上有所提升,但其背后的机制仍不明确。本文提出“Alpha混合假说”,认为当前先进的基于帧的检测器实际上是在搜索Alpha混合痕迹,而非学习语义异常或生成模型的指纹。研究通过实验验证了该假说,并提出了一种基于真实人脸图像和自混合图像增强数据集的检测方法BlenD,在多个合成伪造数据集上取得了最佳的跨数据集泛化性能,且无需在训练中使用明确生成的深度伪造样本。

详情
英文摘要

Recent deepfake detection methods demonstrate improved cross-dataset generalization, yet the underlying mechanisms remain underexplored. We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames. We experimentally validate the hypothesis, demonstrating that deepfake detectors exhibit high sensitivity to the so-called self-blended images (SBI) and non-generative manipulations. We propose the method BlenD that leverages a large-scale, diverse dataset of real-only facial images augmented with SBI. This approach achieves the best average cross-dataset generalization on 15 compositional deepfake datasets released between 2019 and 2025 without utilizing explicitly generated deepfakes during training. Furthermore, we show that predictions from explicit blending searchers and models resilient to blending shortcuts are highly complementary, yielding a state-of-the-art AUROC of 94.0% in an ensemble configuration. The code with experiments and the trained model will be publicly released.

2605.10332 2026-05-12 cs.AI

EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

Ruofei Ju, Xinrui Wang, Xin Ding, Yifan Yang, Hao Wu, Shiqi Jiang, Qianxi Zhang, Hao Wen, Xiangyu Li, Weijun Wang, Kun Li, Yunxin Liu, Haipeng Dai, Wei Wang, Ting Cao

AI总结 EmbodiSkill 是一种用于具身智能体技能自演进的训练-free 框架,旨在解决具身环境中任务失败可能由技能错误或执行失误共同导致的问题。该方法通过技能感知的反思机制,区分任务失败中的技能错误与执行失误,并分别进行针对性的修正与强化。实验表明,EmbodiSkill 能有效提升具身任务的成功率,在 ALFWorld 上实现了高达 93.28% 的任务成功率,显著优于无技能直接使用的大型语言模型。

详情
英文摘要

Embodied agents can benefit from skills that guide object search, action execution, and state changes across diverse environments. Since embodied environments vary across layouts, object states, and other execution factors, these skills must self-evolve from trajectories generated during task execution. However, existing skill self-evolution methods are mainly developed in digital environments and often convert trajectories into coarse skill updates. Directly applying this paradigm to embodied settings is problematic, because a failed task execution may reflect not only incorrect skill content, but also an execution lapse in which the agent fails to follow valid guidance. We propose EmbodiSkill, a training-free framework for embodied skill self-evolution through skill-aware reflection and targeted revision. EmbodiSkill interprets each trajectory with respect to the current skill, uses skill-changing evidence to update the skill body, and uses execution-lapse evidence to preserve and emphasize valid guidance. Experiments on ALFWorld and EmbodiedBench show that EmbodiSkill consistently improves embodied task success. On ALFWorld, EmbodiSkill enables a frozen Qwen3.5-27B executor to reach 93.28% task success, outperforming GPT-5.2 used as a direct agent without skills by 31.58%. These results show that skill-aware self-evolution helps embodied agents accumulate reusable procedural knowledge from their own trajectories.

2605.10319 2026-05-12 cs.CV

LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Ko Watanabe, Riku Takahashi, Issey Sukeda, Andreas Dengel

AI总结 本文提出了一种名为 LimeCross 的训练-free 上下文条件化分层图像编辑框架,能够在保持未选层不变的前提下,根据文本指令对用户选定的 RGBA 分层进行编辑。该方法通过双流注意力机制利用其他层的上下文信息,保持跨层一致性,并有效防止编辑层污染。研究还引入了 LayerEditBench 数据集与评估协议,实验表明 LimeCross 在分层纯净度和合成真实感方面优于现有方法,为可控生成创作提供了新的分层编辑范式。

详情
英文摘要

Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.

2605.10318 2026-05-12 cs.CL

Extending Confidence-Based Text2Cypher with Grammar and Schema Aware Filtering

Makbule Gulcin Ozsoy

AI总结 该研究探讨了如何在Text2Cypher任务中利用结构化约束提升生成查询的可靠性。作者提出了一种结合置信度评分、语法验证和模式约束的过滤框架,通过在生成后进行多阶段验证来提高查询的正确性。实验表明,语法和模式感知的过滤分别提升了生成查询的语法有效性和执行质量,但也会增加空预测的数量并降低覆盖率。研究为理解不同约束对生成效果的影响提供了新的视角。

详情
英文摘要

Large language models (LLMs) allow users to query databases using natural language by translating questions into executable queries. Despite strong progress on tasks such as Text2SQL, Text2SPARQL, and Text2Cypher, most existing methods focus on better prompting, fine-tuning, or iterative refinement. However, they often do not explicitly enforce structural constraints, such as syntactic validity and schema consistency. This can reduce reliability, since generated queries must satisfy both syntax rules and database schema constraints to be executable. In this work, we study how structured constraints can be used in test-time inference for Text2Cypher. We focus on post-generation validation to improve query correctness. We extend a confidence-based inference framework with a sequential filtering process that combines confidence scoring, grammar validation, and schema constraints before final aggregation. This lets us analyze how different constraint types affect generated queries. Our experiments with two instruction-tuned models show that grammar-based filtering improves syntactic validity. Schema-aware filtering further improves execution quality by enforcing consistency with the database structure. However, stronger filtering also increases the number of empty predictions and reduces execution coverage. Overall, we show that adding simple structural checks at test time improves the reliability of Text2Cypher generation, and we provide a clearer view of how syntax and schema constraints contribute differently.

2605.10317 2026-05-12 cs.LG cs.AI

Relations Are Channels: Knowledge Graph Embedding via Kraus Decompositions

Sayan Kumar Chaki

AI总结 本文提出了一种基于Kraus分解的知识图谱嵌入方法,通过引入线性、迹保持和完全正性三个结构公理,将关系操作符形式化为Kraus通道,从而为关系建模提供了理论基础。该方法不仅能够自然处理多对多关系,还支持多跳推理并消除了对实体嵌入范数的约束,同时提出了首个具有理论依据的关系复杂度度量。实验表明,该模型在多对多关系任务上显著优于现有方法。

详情
英文摘要

Knowledge graph embedding (KGE) models typically represent each relation as an operator on entity embeddings. In this work, we identify three structural axioms that any principled relation operator must satisfy, linearity, trace preservation, and complete positivity, and show that they characterize a Kraus channel structure via the Kraus representation theorem. The completeness constraint defining this family is equivalent to these axioms, providing a principled foundation rather than an externally imposed condition. Under this formulation, most existing operator-based KGE models are recoverable as special cases with Kraus rank $κ= 1$ under specific embedding choices. We further generalize this characterization to arbitrary metric geometries by introducing \mbox{w-Kraus} channels, which satisfy completeness by construction within their respective spaces. Building on this theory, we propose \textsc{KrausKGE}, a principled KGE model that naturally handles $1$-to-$N$ and $N$-to-$N$ relations, supports $k$-hop reasoning without requiring explicit path encoders, and eliminates the need for norm constraints on entity embeddings. Additionally, our framework yields the first theoretically grounded per-relation complexity measure in the KGE literature, with a provable lower bound in terms of the empirical relation matrix rank. Empirical evaluation demonstrates that \textsc{KrausKGE} consistently outperforms strong baselines on $N$-to-$N$ relations, with performance gains that increase monotonically with relation fan-out, in alignment with theoretical predictions.

2605.10315 2026-05-12 cs.LG cs.AI

Active Tabular Augmentation via Policy-Guided Diffusion Inpainting

Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

AI总结 本文研究了在数据稀缺场景下如何通过生成表格数据来提升下游模型性能的问题。传统方法侧重于生成数据的分布保真度,但未能有效提升模型表现。为此,作者提出了TAP方法,结合扩散补全技术与条件策略,动态选择生成内容和注入时机,以最大化对当前学习器的提升效果。实验表明,TAP在多个真实数据集上显著优于现有方法,分类准确率提升最高达15.6个百分点,回归任务的RMSE降低最高达32%。

Comments Accepted for publication at ICML 2026

详情
英文摘要

Generative tabular augmentation is appealing in data-scarce domains, yet the prevailing focus on distributional fidelity does not reliably translate into better downstream models. We formalize a fidelity-utility gap: common generative objectives prioritize distributional plausibility, whereas augmentation succeeds only when injected samples reduce the current learner's held-out evaluation loss. This gap motivates learning not just how to generate, but what to generate and when to inject as training evolves. We propose TAP (Tabular Augmentation Policy), which couples diffusion inpainting with a lightweight, learner-conditioned policy to steer generation toward high-utility regions and controls safe injection via explicit gating and conservative windowed commitment. Under severe data scarcity, TAP consistently outperforms strong generative baselines on seven real-world datasets, improving classification accuracy by up to 15.6 percentage points and reducing regression RMSE by up to 32%.

2605.10313 2026-05-12 cs.LG math.OC

Signature Approach for Contextual Bandits with Nonlinear and Path-dependent Rewards

Xin Guo, Grace He, Xinyu Li

AI总结 本文研究具有非线性和路径依赖奖励的上下文多臂老虎机问题,提出了一种基于签名变换的新方法,将连续路径依赖的奖励函数在签名空间中近似为线性函数,从而能够高效地应用线性上下文老虎机算法并保留序列结构信息。基于该框架,作者设计了签名驱动的离散上置信界算法DisSigUCB,并在一定假设下证明了其高概率数据依赖的次线性遗憾界。实验表明,该算法在非线性和路径依赖场景下优于传统线性和核方法。

详情
英文摘要

We study contextual bandits with nonlinear and path-dependent rewards through a novel signature-transform-based approach. Leveraging the universal nonlinearity property of signatures, we approximate continuous path-dependent reward functionals by linear functionals in the signature space. This representation enables the use of efficient linear contextual bandit methods while preserving expressive sequential structure. Building on this framework, we propose \texttt{DisSigUCB}, a signature-based disjoint upper confidence bound (UCB) algorithm. Under boundedness and non-degeneracy assumptions, we prove a high-probability data-dependent sublinear regret bound of order \(\tilde{\mathcal O}(\sqrt{(d+m)KT})\) where \(d\) is the context dimension and \(m\) is the signature feature dimension. Synthetic experiments and numerical applications on temperature sensor monitoring, sleep-stage classification, and hospital nurse staffing demonstrate that \texttt{DisSigUCB} consistently outperforms classical linear and kernelized contextual bandit baselines in nonlinear and path-dependent settings.

2605.10307 2026-05-12 cs.CV cs.GR cs.RO

PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction

Yinan Deng, Jianyu Dou, Jiahui Wang, Jingyu Zhao, Yi Yang, Yufeng Yue

AI总结 动态场景重建是计算机视觉与机器人领域中的一个基础而具有挑战性的问题。为了解决复杂运动场景下高保真渲染与精确跟踪的难题,本文提出了一种新的动态高斯泼溅框架 PaMoSplat,该方法结合了部件感知与运动先验。通过多视角分割掩码的三维重建与光流引导的部件运动估计,PaMoSplat 能够实现更高质量的渲染与更精确的跟踪,并在多个实际场景中表现出优于现有方法的性能与收敛速度。

Comments Accepted by TCSVT. Project Url: https://pamosplat.github.io

详情
英文摘要

Dynamic scene reconstruction represents a fundamental yet demanding challenge in computer vision and robotics. While recent progress in 3DGS-based methods has advanced dynamic scene modeling, obtaining high-fidelity rendering and accurate tracking in scenarios with substantial, intricate motions remains significantly challenging. To address these challenges, we propose PaMoSplat, a novel dynamic Gaussian splatting framework incorporating part awareness and motion priors. Our approach is grounded in two key observations: 1) Parts serve as primitives for scene deformation, and 2) Motion cues from optical flow can effectively guide part motion. Specifically, PaMoSplat initializes by lifting multi-view segmentation masks into 3D space via graph clustering, establishing coherent Gaussian parts. For subsequent timestamps, we leverage a differential evolutionary algorithm to estimate the rigid motion of these parts using multi-view optical flow cues, providing a robust warm-start for further optimization. Additionally, PaMoSplat introduces an adaptive iteration count mechanism, internal learnable rigidity, and flow-supervised rendering loss to accelerate and optimize the training process. Comprehensive evaluations across diverse scenes, including real-world environments, demonstrate that PaMoSplat delivers superior rendering quality, improved tracking precision, and faster convergence compared to existing methods. Furthermore, it enables multiple part-level downstream applications, such as 4D scene editing.

2605.10298 2026-05-12 cs.LG

Set Prediction for Next-Day Active Fire Forecasting

Yuchen Bai, Georgios Athanasiou, Xin Yu, Diogenis Antonopoulos, Ioannis Papoutsis, Stijn Hantson, Nuno Carvalhais

AI总结 本文提出了一种名为WISP的模型,用于高分辨率的次日主动火点预测,将火点预测问题重新定义为点集预测任务。该模型基于48小时的多源数据,如气象、植被、地理和历史火点信息,在375米网格上预测未来火点集群中心的固定大小排名集合,并通过匈牙利匹配进行端到端训练。实验表明,该方法在全局测试集上取得了较高的平均精度和火点覆盖度,为高分辨率火灾预测提供了新的方法和基准。

详情
英文摘要

Accurate next-day active fire forecasts can support early warning, disaster response, forest risk assessment, and downstream estimation of fire-related carbon emissions. Existing machine learning approaches to wildfire forecasting typically predict wildfire danger or fire probability on kilometre-scale daily grids, which is useful for regional warning but does not directly represent localized fire events. We propose Wildfire Ignition Set Predictor (WISP), a query-based model that reformulates next-day active fire forecasting as point-set prediction. From 48 hours of covariates including meteorology, satellite vegetation products, static land, and fire history, WISP predicts a fixed-size ranked set of future active fire cluster centres on a 375 m grid across globally distributed regions. The model is trained end-to-end with Hungarian matching; to address the conflicting roles of the classification score in assignment, ranking, and query activation, we use asymmetric classification-localization weighting in matching and loss. We further construct a globally distributed, hourly, multi-source benchmark for this task. On a held-out test set spanning fire regions worldwide, the best WISP variant achieves 38.2% average precision (AP) for ranked fire-centre detections, covers 53.4% of fire cluster mass weighted by fire radiative power (FRP), and localizes 54.1% of observed clusters within 5 km. These results establish sparse set prediction as a viable formulation for high-resolution wildfire forecasting and provide a benchmark for future work in this regime.

2605.10296 2026-05-12 cs.CL cs.AI cs.IR cs.LG

Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding

Anton Bazdyrev, Ivan Bashtovyi, Ivan Havlytskyi, Oleksandr Kharytonov, Artur Khodakovskyi

AI总结 本文研究了如何利用现成的检索增强生成(RAG)方法解决乌克兰语多领域文档理解任务,具体为从PDF文档中回答多项选择题并定位支持信息。作者提出了一种基于上下文分块、问题感知的密集检索与重排序以及受限答案生成的管道,有效提升了系统性能。实验表明,使用Qwen系列模型进行检索与重排序能够显著提高召回率和答案准确率,在公开和私有测试集上均取得优异成绩,验证了结构保留和答案空间感知在严格竞赛条件下的有效性。

Comments Accepted to The Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)

详情
英文摘要

We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.

2605.10295 2026-05-12 cs.CL

DECO-MWE: building a linguistic resource of Korean multiword expressions for feature-based sentiment analysis

Jaeho Han, Changhoe Hwang, Seongyong Choi, Gwanghoon Yoo, Eric Laporte, Jeesun Nam

AI总结 本文旨在构建一个用于基于特征的情感分析的韩语多词表达(MWE)语言资源DECO-MWE。为高效构建情感相关的MWE资源,研究采用局部语法图(LGG)方法,将DECO-MWE形式化为有限状态转换器,以表达MWE的词法和句法限制。通过构建化妆品评论语料库并进行实证分析,研究识别出四类MWE,并在测试语料中实现了0.806的F值,为基于特征的情感分析提供了通用的多词表达词典和可复用的有限状态处理方法。

详情
Journal ref
13th Workshop on Asian Language Resources, May 2018, Miyazaki, Japan, pp.14-20
英文摘要

This paper aims to construct a linguistic resource of Korean Multiword Expressions for Feature-Based Sentiment Analysis (FBSA): DECO-MWE. Dealing with multiword expressions (MWEs) has been a critical issue in FBSA since many constructs reveal lexical idiosyncrasy. To construct linguistic resources of sentiment MWEs efficiently, we utilize the Local Grammar Graph (LGG) methodology: DECO-MWE is formalized as a Finite-State Transducer that represents lexical-syntactic restrictions on MWEs. In this study, we built a corpus of cosmetics review texts, which show particularly frequent occurrences of MWEs. Based on an empirical examination of the corpus, four types of MWEs have been distinguished. The DECO-MWE thus covers the following four categories: Standard Polarity MWEs (SMWEs), Domain-Dependent Polarity MWEs (DMWEs), Compound Named Entity MWEs (EMWEs) and Compound Feature MWEs (FMWEs). The retrieval performance of the DECO-MWE shows 0.806 f-measure in the test corpus. This study brings a twofold outcome: first, a sizeable general-purpose polarity MWE lexicon, which may be broadly used in FBSA; second, a finite-state methodology adopted in this study to treat domain-dependent MWEs such as idiosyncratic polarity expressions, named entity expressions or feature expressions, and which may be reused in describing linguistic properties of other corpus domains.

2605.10293 2026-05-12 cs.LG cs.AI

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

Maris F. L. Galesloot, Thomas Rhemrev, Nils Jansen

AI总结 本文研究了如何在离线强化学习中实现安全策略改进的问题,提出了鲁棒的概率屏蔽方法,通过结合安全策略改进(SPI)与屏蔽技术,仅利用已有数据集和安全状态知识,在策略优化过程中提供性能与安全性的双重保障。该方法能够在高概率下确保改进后的策略既优于基线策略,又满足安全约束,实验表明其在数据量较少时表现出更优的平均与最差情况性能。

详情
英文摘要

In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.

2605.10292 2026-05-12 cs.LG cs.AI

LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling

Sheng Pan, Ming Jin, Bo Du, Shirui Pan

AI总结 本文提出了一种名为 LeapTS 的新型时间序列预测框架,将传统固定映射的预测任务重新定义为动态的多步调度过程,以更好地适应未来时间点的演变。LeapTS 通过分层控制器和神经控制微分方程实现多级决策,动态选择预测尺度和推进步长,从而提升模型对非平稳动态的捕捉能力。实验表明,LeapTS 在多个真实和合成数据集上显著提升了预测性能,并实现了比基于 Transformer 的模型更快的推理速度。

详情
英文摘要

Time series forecasting serves as an essential tool for many real-world applications, supporting tasks such as resource optimization and decision-making. Despite significant architectural advancements, most modern models still treat forecasting task as a fixed mapping from history to target horizons. This induces temporal decoupling across future time points and limits the model's ability to adapt to the evolving context as forecasting progresses. In this work, we present LeapTS, a novel framework that reformulates time series forecasting as a dynamic scheduling process over the prediction horizon. Specifically, LeapTS organizes the forecasting process into multi-level decisions using: (1) the hierarchical controller to dynamically select the optimal prediction scale and advancement length at each step, and (2) continuous-time state evolution driven by neural controlled differential equations. Within this process, the controlled update mechanism explicitly couples the irregular temporal dynamics with discrete scheduling feedback. Extensive evaluations on both real-world and synthetic datasets demonstrate that LeapTS improves overall forecasting performance by at least 7.4% while achieving a 2.6$\times$ to 5.3$\times$ inference speedup over representative Transformer-based models. Furthermore, by explicitly tracing the scheduling trajectories, we reveal how the model autonomously adapts its forecasting behavior to capture non-stationary dynamics.

2605.10286 2026-05-12 cs.AI

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Baraa Al Jorf, Farah E. Shamout

AI总结 本文提出并评估了基于大语言模型(LLM)的智能体在多模态临床预测任务中的性能,研究了其在电子健康记录、医学影像、报告和临床笔记等异构数据上的表现。通过大规模真实医疗数据的系统性实验,发现单一智能体框架在多模态任务中优于简单的多智能体系统,具有更强的数据处理能力和校准效果。该研究为医疗领域智能体系统的进一步发展提供了新的基准,并开源了代码和评估框架。

Comments Accepted at the AHLI Conference on Health, Inference, and Learning 2026

详情
英文摘要

Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

2605.10281 2026-05-12 cs.SD cs.AI

Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

Konstantinos Soiledis, Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Tsamis

AI总结 本文研究如何从带有微时值和力度信息的表达性鼓点网格(MIDI表示)直接生成逼真的鼓音频,提出了一种基于神经音频编解码器的方法。该方法使用基于Transformer的模型将鼓点网格映射为编解码器的离散码元序列,并通过预训练的编解码器解码器生成波形音频。实验表明,该方法在大型人类鼓演奏数据集E-GMD上表现出良好的音频保真度和音乐对齐性,为鼓点到音频的生成提供了有效途径,并为打击乐合成中的音频码元选择提供了实用参考。

详情
英文摘要

Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.

2605.10279 2026-05-12 cs.LG

DeepLog: A Software Framework for Modular Neurosymbolic AI

Robin Manhaeve, Stefano Colamonaco, Vincent Derkinderen, Rik Adriaensen, Lucas Van Praet, Luc De Raedt, Giuseppe Marra

AI总结 DeepLog 是一个基于 PyTorch 的模块化神经符号人工智能框架,旨在将逻辑推理与深度学习统一在一个操作流程中。该框架通过将多种神经符号语言作为高层规范进行自动编译,生成优化的算术电路,从而降低了机器学习实践者的使用门槛,并为神经符号系统开发者提供了一个高性能的共享平台。其核心贡献在于实现了神经符号系统的模块化与通用化,便于不同方法的集成与实验。

Comments Preprint accepted at IJCAI2026 Demo Track

详情
英文摘要

DeepLog is an operational neurosymbolic framework that unifies logic and deep learning within standard PyTorch workflows. While existing neurosymbolic systems focus on a particular paradigm and semantics, DeepLog serves as a universal backend that can emulate many systems in the neurosymbolic alphabet soup. By treating diverse neurosymbolic languages as high-level specifications, the DeepLog software automatically compiles them into optimized arithmetic circuits. This design lowers the barrier for machine learning practitioners by treating logic as composable modules, while providing neurosymbolic developers with a shared, high-performance basis for prototyping new integration strategies. The code is available here: https://github.com/ML-KULeuven/deeplog

2605.10278 2026-05-12 cs.LG

Predictive Radiomics for Evaluation of Cancer Immune SignaturE in Glioblastoma: the PRECISE-GBM study

Prajwal Ghimire, Junjie Li, Liu Yaou, Marc Modat, Thomas Booth

AI总结 本研究旨在通过影像基因组学方法,开发并验证用于评估IDH野生型胶质母细胞瘤免疫特征的影像生物标志物。研究利用多中心回顾性数据,结合深度学习分割的MRI影像特征与基因组数据,构建并验证了基于放射组学的免疫签名预测模型。结果表明,所提出的模型能够非侵入性地预测巨噬细胞M0亚型的免疫特征,具有良好的稳定性和泛化能力,有望用于指导胶质母细胞瘤患者的免疫治疗分层。

Comments Abstract : 226; Importance of study: 109; Manuscript: 5690 (excluding references) Figures: 4, Tables: 2 Supplemental File: 1

详情
Journal ref
Neuro-Oncology Advances 2026. Published online May 2, 2026
英文摘要

Background: Radiogenomics allows identification of radiological biomarkers for genomic phenotypes. In glioblastoma, these biomarkers could potentially complement patient stratification strategies. We aim to develop and analytically validate radiological biomarkers that capture immune cell signatures within IDH-wildtype glioblastoma microenvironment using radiogenomic analysis. Methods: This was a retrospective multicenter study using curated open-access anonymized imaging and genomic data from TCGA-GBM, CPTAC, IvyGAP, REMBRANDT and CGGA datasets. Imaging data consisted of MRI-based radiomic features extracted from necrotic core, enhancing and edema regions of deep learning-based auto-segmented tumors. Radiomic feature selections were performed using nested cross-validated LASSO. Support vector machine and ensemble models were trained using seventeen immune and cell-specific score labels extracted from deconvoluted transcriptomic data using pan-cancer and glioblastoma immune signature matrices as reference standards. Seventeen classifier models trained in three cross-cohort strategies were validated on three held-out datasets assessing stability and generalizability. Results: One-hundred-and-seventy-six patients were included in the study. The immune-related radiomic signatures obtained after feature selection were shape, first order and higher order radiomic features. Models predicting macrophage subtype immune signature showed stable mean performance on balanced accuracy (0.67) and precision (0.89) metrics for three independent holdout datasets with ensemble model outperforming support vector machine model. Conclusion: Radiogenomic models non-invasively predicted the macrophage subtype M0 immune signature in IDH-wildtype glioblastoma. These biomarkers have the potential to stratify patients for immunotherapy within prospective glioblastoma clinical trials.

2605.10277 2026-05-12 cs.LG math.AP stat.ML

Generalization Error Bounds for Picard-Type Operator Learning in Nonlinear Parabolic PDEs

Koichi Taniguchi, Sho Sonoda

AI总结 本文研究了基于Duhamel-Picard迭代的非线性抛物型偏微分方程(PDE)解算子的学习问题,提出了一个抽象的状态转移模型框架,并推导了与实现无关的泛化误差界,将实现误差与估计误差分离。核心贡献在于揭示了增加Picard迭代深度可以减少截断误差,同时避免熵估计误差的无界增长,并将该理论应用于环面上非线性热方程的Picard型傅里叶神经算子实现中。

Comments 39 pages

详情
英文摘要

Operator learning for partial differential equations (PDEs) aims to learn solution operators on infinite-dimensional function spaces from finite-resolution data. In this setting, it is important for the learned model to be discretization-invariant, or resolution-robust, and to reflect PDE-specific structure. It is therefore natural to ask how such structure should be encoded in the model architecture, hypothesis class, or learning procedure. In this paper, we study operator learning for solution operators of nonlinear parabolic PDEs based on Duhamel--Picard iteration. We formulate Picard iteration as an abstract state-transition model and present a theoretical framework for Picard-type operator learning. We derive implementation-agnostic generalization error bounds that separate the implementation error from the estimation error associated with the abstract state-transition model induced by Picard iteration. A key consequence is that increasing the Picard depth reduces the Picard truncation error without causing an unbounded growth of the entropy-based estimation error. We also extend the analysis to long-time prediction by rolling out the same learned local model over successive time blocks. Finally, we illustrate the theory for nonlinear heat equations on the torus using a Picard-type Fourier neural operator as a concrete implementation.

2605.10275 2026-05-12 cs.CV

PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction

Chenggong Li, Yidong Luo, Junchao Zhang, Boxin Shi, Degui Yang

AI总结 本文提出了一种统一的时空极化视频重建框架PolarVSR,旨在解决主流分焦平面极化成像中从混色阵列中恢复极化参数这一具有挑战性的逆问题。该方法通过联合建模空间与时间上的极化方向,并结合极化感知的隐式神经表示,实现了连续且高保真的超分辨率重建。同时,引入了基于光流引导的极化变化损失以优化极化动态,还建立了首个大规模彩色DoFP极化视频基准数据集,实验结果验证了方法的有效性。

详情
英文摘要

Polarimetric imaging captures surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division of-Focal-Plane (DoFP) color polarization imaging, recovering polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras also face hardware bottlenecks and often cannot support high-frame-rate acquisition, limiting polarimetric imaging in dynamic video tasks. These limitations motivate joint spatial and temporal enhancement. To this end, we propose the first space-time polarization video reconstruction architecture. The method jointly models polarization directions in space and time and uses a polarization-aware implicit neural representation for continuous, high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. We also establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on this benchmark demonstrate the effectiveness of the method.

2605.10272 2026-05-12 cs.LG cs.AI cs.CR cs.DC

DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models

Haaris Mehmood, Jie Xu, Karthikeyan Saravanan, Rogier Van Dalen, Mete Ozay

AI总结 本文提出了一种轻量级自适应剪切方法DP-LAC,用于在联邦学习中实现语言模型的差分隐私微调。该方法通过私有直方图估计初步确定剪切阈值,并在训练过程中动态调整该阈值,而无需额外消耗隐私预算或引入新超参数。实验表明,DP-LAC在准确率上优于现有自适应剪切方法和传统DP-SGD,平均提升了6.6%。

Comments Accepted at ICASSP 2026

详情
英文摘要

Federated learning (FL) enables the collaborative training of large-scale language models (LLMs) across edge devices while keeping user data on-device. However, FL still exposes sensitive information through client-provided gradients. Differentially private stochastic gradient descent (DP-SGD) mitigates this risk by clipping each client's contribution to a threshold $C$ and adding noise proportional to $C$. Existing adaptive clipping techniques dynamically adjust $C$ but demand tedious hyperparameter tuning, which can erode the privacy budget. In this paper, we introduce DP-LAC, a method that first estimates an initial clipping threshold within an order of magnitude of the optimum using private histogram estimation, and then adapts this threshold during training without consuming additional privacy budget or introducing new hyperparameters. Empirical results show that DP-LAC outperforms both state-of-the-art adaptive clipping methods and vanilla DP-SGD, achieving an average accuracy gain of $6.6\%$.

2605.10269 2026-05-12 cs.CV cs.RO

Increasing the Efficiency of DETR for Maritime High-Resolution Images

Tinsae Yehuala, Hao Cheng, Ville Lehtola

AI总结 本文针对海上无人水面船舶(USV)安全导航中高分辨率图像的目标检测需求,研究如何提升DETR模型的检测效率。作者采用基于状态空间模型(SSM)的Vision Mamba(ViM)作为主干网络,结合序列化图像分块处理与特征金字塔网络设计,有效提升了对远距离、小目标及大尺度变化的检测能力。通过引入令牌剪枝等优化策略,该方法在保持检测精度的同时显著降低了计算和内存开销,为海上实时目标检测提供了更高效可靠的解决方案。

Comments Accepted to IEEE ITSC 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. DOI to be added upon publication

详情
英文摘要

Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid Network with successive downsampling and SSM layers, as well as token pruning to reduce unnecessary computation on background regions. Compared to state-of-the-art methods like RT-DETR with ResNet50 backbone, our approach achieves a better balance between performance and computational efficiency in maritime object detection.

2605.10268 2026-05-12 cs.CL cs.AI

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

Baibei Ji, Xiaoyang Weng, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang

AI总结 为了解决长上下文推理任务中标准注意力机制带来的二次复杂度问题,研究提出了一种基于智能体记忆的方法,通过动态更新记忆来线性处理文档块。然而,现有方法在记忆覆盖过程中可能丢失潜在证据,为此,MemReread 引入了基于问题分解和重读的机制,在最终记忆不足时触发重读,从而恢复被提前丢弃的间接事实,支持非线性推理同时保持文档理解的逻辑流程。此外,研究还引入强化学习框架,提升模型对长文本的外推能力,并根据任务复杂度动态控制重读次数,有效平衡了性能与计算开销。

详情
英文摘要

To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss of latent evidence in this memorize-while-reading paradigm, recent works have integrated retrieval modules that allow agents to recall information previously discarded during memory overwriting. However, retrieval-based recall suffers from both evidence loss during memory formation and interference induced by invalid queries. To overcome these limitations, we propose MemReread. Built upon streaming reading, MemReread circumvents intermediate retrieval. It triggers question decomposition and rereading when the final memory is insufficient, enabling the recovery of indirect facts that were prematurely discarded. This design supports non-linear reasoning while preserving the inherent logical flow of document comprehension. To further enhance practicality, we introduce a reinforcement learning framework that enhances length extrapolation capability while dynamically determining the number of rereading passes based on task complexity, thereby flexibly controlling computational overhead. Extensive experiments demonstrate that MemReread consistently outperforms baseline frameworks on long-context reasoning tasks, while maintaining linear time complexity with respect to context length.

2605.10261 2026-05-12 cs.AI cs.LG

E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability

Hasib Aslam, Muhammad Ali Chattha, Muhammad Taha Mukhtar, Muhammad Imran Malik, Andreas Dengel, Sheraz Ahmed

AI总结 本文提出了一种名为E-TCAV的高效概念解释框架,用于解决传统TCAV方法在计算开销、层间评分不一致和统计稳定性方面的不足。通过深入分析TCAV方法的三个关键方面,E-TCAV利用最终层作为早期层的快速代理,显著提升了计算效率,并在多个网络架构和数据集上验证了其有效性。实验表明,最终层与倒数第二层在TCAV评分上高度一致,且评分方差主要由潜在分类器的选择引起,从而为高效模型调试和实时概念引导训练提供了可行方案。

详情
英文摘要

TCAV (Testing with Concept Activation Vectors) is an interpretability method that assesses the alignment between the internal representations of a trained neural network and human-understandable, high-level concepts. Though effective, TCAV suffers from significant computational overhead, inter-layer disagreement of TCAV scores, and statistical instability. This work takes a step toward addressing these challenges by introducing E-TCAV, a framework for efficient approximation of TCAV scores, which is based on extensive investigation into three key aspects of the TCAV methodology: 1) the effect of latent classifiers on the stability of TCAV scores, 2) the inter-layer agreement of TCAV scores, and 3) the use of the penultimate layer as a fast proxy for earlier layers for TCAV computation. To ensure a solid foundation for E-TCAV, we conduct extensive evaluations across four different architectures and five datasets, encompassing problems from both computer vision and natural language domains. Our results show that the layers in the final block of the neural network strongly agree with the penultimate layer in terms of the TCAV scores, and the commonly observed variance of the TCAV scores can be attributed to the choice of the latent classifier. Leveraging this inter-layer agreement and the degeneracy of directional sensitivities at the penultimate layer, E-TCAV guarantees linearly scaling speed-ups with respect to the network's size and the number of evaluation samples, marking a step towards efficient model debugging and real-time concept-guided training.

2605.10257 2026-05-12 cs.AI

Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem

Alberto Castagna, Stefan Zahlner, Adrian Egli, Christian Eichenberger, Daniel Boos, Manuel Meyer, Anton Fuxjager

AI总结 本文研究了如何通过半分层深度强化学习方法解决铁路车辆调度中的突发干扰问题,以提升铁路运营的自动化水平。该方法针对铁路操作中的调度与路径规划任务,设计了专门的动作和观测空间,使策略能够专注于不同层次的决策,从而有效应对调度决策少而路径更新频繁的问题。实验表明,该方法在协调性、资源利用率和系统鲁棒性方面优于传统启发式方法和单一强化学习方法,显著提高了列车到达目的地的数量,并在高密度交通下保持了较低的死锁率。

详情
英文摘要

Managing disruptions in railway traffic management is a major challenge. Rising traffic density and infrastructure limits increase complexity, making the Vehicle Routing and Scheduling Problem (VRSP) difficult to solve reliably and in real time. While Operational Research (OR) methods are widely used, most dispatching still relies on human expertise due to the problem's exponential combinatorial complexity. Reinforcement Learning (RL) has gained attention for its potential in multi-agent coordination, but existing RL approaches often underperform OR methods and struggle to scale in dense rail networks. This paper addresses this gap from a machine learning perspective by introducing a semi-hierarchical RL formulation tailored to operational railway constraints. The method separates dispatching from routing through dedicated action and observation spaces, enabling policies to specialise in distinct decision scopes and addressing the imbalance between rare dispatch decisions and frequent routing updates. The approach is evaluated on the Flatland-RL simulator across five difficulty levels and 50 random seeds, with 7 to 80 trains. Results show substantially improved coordination, resource utilisation, and robustness compared with heuristic baselines and monolithic RL, nearly doubling the number of trains reaching their destinations, while keeping deadlock rates below 5% and adaptively sequencing, delaying, or cancelling trains under heavy congestion.

2605.10256 2026-05-12 cs.SD cs.AI

A Cold Diffusion Approach for Percussive Dereverberation

Dimos Makris, András Barják, Maximos Kaliakatsos-Papakostas

AI总结 本文提出了一种用于打击乐去混响的冷扩散框架,针对当前音频去混响研究主要集中在语音而忽视打击乐信号的问题,通过将混响建模为从无混响信号到混响信号的确定性退化过程,逐步生成混响效果。研究引入了两种逆过程参数化方法,并采用UNet和扩散Transformer作为模型架构,在包含真实和电子鼓录音的数据集上进行训练与评估,实验表明该方法在多个指标上优于现有的基于分数和条件扩散的基线模型。

Comments Accepted for the 2026 IEEE World Congress on Computational Intelligence, IJCNN Track, 21-26 June 2026, Maastricht, the Netherlands

详情
英文摘要

Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal structure. In this work, we propose a cold diffusion framework for dereverberating stereo drum stems (downmixes), modeling reverberation as a deterministic degradation process that progressively transforms anechoic signals into reverberant ones. We investigate two reverse-process parameterizations, Direct (next-state) and a Delta-normalized residual (velocity-style) prediction, and implement the framework using both a UNet and a diffusion Transformer backbone. The models are trained and evaluated on curated datasets comprising both acoustic and electronic drum recordings, with reverberation generated using a combination of synthetic and real room impulse responses. Extensive experiments on in-domain and fully out-of-domain test sets demonstrate that the proposed method consistently outperforms strong score-based and conditional diffusion baselines, evaluated using signal-based and perceptual metrics tailored to percussive audio.

2605.10251 2026-05-12 cs.CV

Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation

Ishan Narayan

AI总结 本文提出了一种名为GraphDepth的单目深度估计架构,通过在卷积编码器-解码器框架中引入图神经网络(GNN),有效建模了局部卷积难以捕捉的长距离空间关系。该方法在ResNet-101 U-Net主干网络的多尺度位置嵌入高效的GraphSAGE层,并结合通道注意力门控跳跃连接和异方差不确定性估计模块,提升了深度估计的精度与鲁棒性。实验表明,与基于Transformer的混合模型相比,GraphDepth在保持相近全局感受野的同时,计算效率更高,且在多个基准数据集上取得了优异的性能表现。

详情
英文摘要

We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6\% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.

2605.10247 2026-05-12 cs.LG

Teaching LLMs to See Graphs: Unifying Text and Structural Reasoning

Dario Vajda

AI总结 本文研究如何使大语言模型(LLMs)更有效地处理图结构数据,提出了一种名为Graph Transformer Language Model(GTLM)的新架构,该模型通过在注意力模块中引入图感知的注意偏差,使LLM能够原生处理图结构,同时避免了传统方法中将文本属性压缩为单一标记所带来的语义瓶颈。GTLM参数效率极高,仅增加0.015%的参数即可实现与图神经网络(GNN)相当甚至更优的性能,并在多个图结构基准测试中表现出色,展示了其在图推理任务中的优越性。

详情
英文摘要

Using Large Language Models (LLMs) to process graph-structured data is an active research area, yet current state-of-the-art approaches typically rely on multi-step pipelines with Graph Neural Network (GNN) encoders that compress rich textual attributes into solitary tokens, creating a significant semantic bottleneck. In this paper, we introduce the Graph Transformer Language Model (GTLM), a novel architecture that enables pretrained LLMs to natively process graph topologies while entirely eliminating this compressive bottleneck. GTLM is exceptionally parameter-efficient: by injecting graph-aware attention biases directly into the LLM's attention modules, it introduces only 0.015% additional parameters relative to the base model. We theoretically prove that our bidirectional attention prefix preserves node permutation equivariance while maintaining exact backward compatibility with the pretrained base model. Extensive evaluations demonstrate that a 1B-parameter GTLM matches or exceeds the performance of 7B-parameter state-of-the-art models on standard Text-Attributed Graph benchmarks, while significantly surpassing baselines on GraphQA. Finally, we demonstrate that GTLM attention heads implicitly learn to simulate message passing, explaining its superior performance on algorithmic tasks. This paradigm shift enables true algorithmic reasoning within LLMs and provides a scalable foundation for next-generation GraphRAG and relational deep learning.

2605.10242 2026-05-12 cs.LG cs.AI

When Normality Shifts: Risk-Aware Test-Time Adaptation for Unsupervised Tabular Anomaly Detection

Wei Huang, Hezhe Qiao, Kailai Zhang, Zaisheng Ye, Yu-Ming Shang, Xiangling Fu

AI总结 本文研究了无监督表格异常检测中因训练数据有限导致的正常模式不完整问题,并提出了一个风险感知的测试时自适应方法RTTAD。该方法通过训练阶段的协作双任务学习建立鲁棒的正常先验,并在测试阶段引入测试时对比学习模块,利用高置信度的伪正常样本进行模型更新,同时抑制异常样本的影响,从而有效应对正常模式偏移问题。实验表明,RTTAD在15个表格数据集上取得了最先进的检测性能。

Comments 13 pages, 6 figures

详情
英文摘要

Unsupervised tabular anomaly detection methods typically learn feature patterns from normal samples during training and subsequently identify samples that deviate from these patterns as anomalies during testing. However, in practical scenarios, the limited scale and diversity of training data often lead to an incomplete characterization of normal patterns. While test-time adaptation offers a remedy, its isolated focus on test-time optimization ignores the critical synergy with training-phase learning. Furthermore, indiscriminate adaptation to unlabeled test data inevitably triggers anomaly contamination, preventing the model from fully realizing its discriminative capability between normal and anomalous samples. To address these issues, we propose RTTAD, a Risk-aware Test-time adaptation method for unsupervised Tabular Anomaly Detection. RTTAD holistically tackles normality shifts via a synergistic two-stage mechanism. During training, collaborative dual-task learning captures multi-level representations to establish a robust normal prior. During testing, a Test-Time Contrastive Learning (TTCL) module explicitly accounts for adaptation risk by selectively updating the model using high-confidence pseudo-normal samples while constraining anomalous ones. Additionally, TTCL incorporates a k-nearest neighbor-based contrastive objective to refine embedding distributions, thereby further enhancing the model's discriminative capacity. Extensive experiments on 15 tabular datasets demonstrate that RTTAD achieves state-of-the-art overall detection performance.

2605.10241 2026-05-12 cs.CL cs.LG

Building Korean linguistic resource for NLU data generation of banking app CS dialog system

Jeongwoo Yoon, On-yu Park, Changhoe Hwang, Gwanghoon Yoo, Eric Laporte, Jeesun Nam

AI总结 本文旨在构建用于银行客户服务对话系统自然语言理解(NLU)的韩语标注训练数据,提出了一种名为FIAD的金融领域标注数据集,并基于银行应用评论语料库识别出韩语请求语句中的三种语言模式,利用局部语法图(LGGs)生成涵盖多种意图和实体的标注数据。实验表明,基于FIAD生成的数据训练的模型在意图和主题识别任务上取得了较高的准确率,验证了该资源的有效性。

详情
Journal ref
29th International Conference on Computational Linguistics (COLING), Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning (Pan-DL), Oct 2022, Gyeongju, South Korea, pp.29-37
英文摘要

Natural language understanding (NLU) is integral to task-oriented dialog systems, but demands a considerable amount of annotated training data to increase the coverage of diverse utterances. In this study, we report the construction of a linguistic resource named FIAD (Financial Annotated Dataset) and its use to generate a Korean annotated training data for NLU in the banking customer service (CS) domain. By an empirical examination of a corpus of banking app reviews, we identified three linguistic patterns occurring in Korean request utterances: TOPIC (ENTITY, FEATURE), EVENT, and DISCOURSE MARKER. We represented them in LGGs (Local Grammar Graphs) to generate annotated data covering diverse intents and entities. To assess the practicality of the resource, we evaluate the performances of DIET-only (Intent: 0.91 /Topic [entity+feature]: 0.83), DIET+ HANBERT (I:0.94/T:0.85), DIET+ KoBERT (I:0.94/T:0.86), and DIET+ KorBERT (I:0.95/T:0.84) models trained on FIAD-generated data to extract various types of semantic items.