arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.12071 2026-05-13 cs.RO cs.SY eess.SY

Control of Fully Actuated Aerial Vehicles: A Comparison of Model-based and Sensor-based Dynamic Inversion

Ali Sidar Yilmaz, Buday Turan, Lukas Pries, Markus Ryll

AI总结 本文比较了基于模型的几何非线性动态逆控制器(geometric NDI)与基于传感器的增量动态逆控制器(INDI)在固定倾角六旋翼飞行器上的控制性能。研究通过多个实验评估了两种控制器在参数偏差、风扰、传感器退化等不同条件下的表现,发现INDI在参数不匹配和传感器退化情况下具有明显优势,而几何NDI在控制频率降低时表现出更优的姿态跟踪能力。该工作首次对具有解耦平动和转动动力学的完整姿态跟踪INDI控制器进行了实验验证,揭示了基于测量与基于模型的动态逆方法在鲁棒控制与快速部署之间的权衡。

详情
英文摘要

Fully actuated multirotor platforms decouple translational force generation from vehicle attitude, enabling independent control of position and orientation and shifting performance limitations from attitude authority to actuator dynamics and control effectiveness. This paper compares a model-based nonlinear dynamic inversion controller (geometric NDI) with a sensor-based incremental dynamic inversion controller (INDI) on a fixed-tilt fully actuated hexarotor. Both controllers share an identical outer-loop structure and are both executed at 500 Hz; therefore, performance differences can be attributed primarily to the inversion strategy. Controller performance is evaluated in five experiments covering attitude step tracking under nominal conditions and under a 50% mismatch in the rotor force coefficient, hover disturbance rejection under an external lateral load, waypoint tracking in the presence of wind gust disturbances, reduced control frequency, and injected sensor degradation. The results show that INDI offers clear advantages under parameter mismatch, gust disturbances, and sensor degradation, and maintains lower position errors across the controller-frequency sweep. However, its advantages are not universal: geometric NDI yields better attitude tracking at reduced control frequencies. To the authors' best knowledge, this work presents the first experimental validation of a full pose tracking INDI controller with decoupled translational and rotational dynamics. These findings highlight the trade-off between measurement-based and model-based inversion for robust control and rapid deployment of fully actuated UAVs.

2605.12069 2026-05-13 cs.CV cs.AI cs.LG

Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection

Muhammad Aqeel, Maham Nazir, Uzair Khan, Marco Cristani, Francesco Setti

AI总结 该论文研究了无需目标类别训练的零样本异常检测问题,针对现有方法对正常与异常数据分布不对称性利用不足的问题,提出了一种名为AVA-DINO的异常感知视觉-语言适配框架。该方法通过两个专门分支分别处理正常和异常模式,结合文本引导的路由机制和显式路由正则化,在训练时实现分支特化;测试时仅依赖输入图像和预定义语言描述动态组合分支,实现不对称激活。实验表明,该方法在多个工业和医学基准上取得了最先进的性能,且具备良好的跨领域泛化能力。

Comments Accepted to ICIP 2026

详情
英文摘要

Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO

2605.12064 2026-05-13 cs.CV

TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

Zhuoyu Cai, Dou Quan, Ning Huyan, Pei He, Shuang Wang, Licheng Jiao

AI总结 本文提出了一种基于文本语义辅助的跨模态图像配准框架TAR,用于光学图像与合成孔径雷达(SAR)图像的配准。该方法通过引入遥感场景和地物覆盖类型的文本语义先验,缓解了光学与SAR图像之间的模态差异,增强了跨模态特征学习能力。TAR包含多尺度视觉特征学习、文本辅助特征增强和由粗到细的密集匹配三个模块,实验表明其在大形变情况下仍能实现优于现有方法的配准性能。

详情
英文摘要

Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.

2605.12061 2026-05-13 cs.AI

SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

Juntong Wang, Haoyue Zhao, guanghui Pan, Xiyuan Wang, Yanbo Wang, Qiyan Deng, Muhan Zhang

AI总结 本文提出了一种名为SAGE的自进化智能图记忆引擎,旨在解决语言智能体在长期记忆方面的瓶颈问题。SAGE将图记忆建模为动态的长期记忆载体,结合了用于构建结构化图记忆的“记忆写入器”和基于图基础模型的“记忆读取器”,通过交互历史逐步完善记忆结构,并利用反馈机制实现自我进化。实验表明,SAGE在多跳问答、开放域检索和长期记忆评估等任务中显著提升了证据恢复、答案置信度和检索效率,验证了其在构建稳健长期语言智能体中的有效性。

详情
英文摘要

Long-term memory is becoming a central bottleneck for language agents. Exsting RAG and GraphRAG systems largely treat memory graphs as static retrieval middleware, which limits their ability to recover complete evidence chains from partial cues, exploit reusable graph-structrual roles, and improve the memory itself through downstream feedback. We introduce SAGE, a Self-evolving Agentic Graph-memory Engine that models graph memory as a dynamic long-term memory substrate. SAGE couples two roles: a memory writer that incrementally constucts structured graph memory from interaction histories, and a Graph Foundation Model-based memory reader to perform retrieval and provide feedback to the memory writer. We provide rigorooous theoretical annalyses supporting the framework. Across multi-hop QA, open-domain retireval, domain-specific review QA, and long-term agent-memory benchmarks, SAGE improves evidence recovery, answer grounding, and retrieval efficiency: after two self-evolution rounds, it achieves the best average rank on multi-hop QA; in zero-shot open-domain transfer, it reaches 82.5/91.6 Recall@2/5 on NQ. Further results on LongMemEval and HaluMem show that traning and reader-writer feedback improve multiple long-term memory and hallucination-diagnostic metrics, suggesting that self-evolving, structure-aware graph memory is a promising foundation for robust long-horizon language agents.

2605.12056 2026-05-13 cs.AI

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

Yuchen Deng, Zidang Cai, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han

AI总结 OmniRefine 是一种用于高效多模态大语言模型的训练-free 两阶段压缩框架,旨在解决长视频和密集音频序列推理成本高的问题。该方法通过跨模态对齐的分块优化和模态感知的协同压缩,有效保留关键信息并减少冗余,从而在保持模型性能的同时提升推理效率。实验表明,OmniRefine 在多个任务上实现了优于现有方法的效率与性能平衡,并在较低压缩比下仍能保持稳定表现。

详情
英文摘要

Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.

2605.12051 2026-05-13 cs.LG

Learning plug-in surrogate endpoints for randomized experiments

Alessandro-Umberto Margueritte, Ahmet Zahid Balcıoğlu, Jesse Krijthe, Dave Zachariah, Fredrik D. Johansson

AI总结 在随机实验中,当长期结果难以观测时,常使用短期替代终点来评估干预效果。本文研究了一类可以直接替代主要结果的插件复合替代终点,提出两种方法以最大化其对真实效应的预测能力,并分析了在典型场景下找到无偏效应估计替代终点的可能性。实验表明,基于直接建模替代效应的方法相比现有方法能生成更具预测性的插件终点。

Comments 29 pages, 5 figures

详情
英文摘要

Surrogate endpoints are used in place of long-term outcomes in randomized experiments when observing the real outcome for a large enough cohort is prohibitively expensive or impractical. A short-term surrogate is good if the result of an experiment using the surrogate is predictive of the result of a hypothetical study using the real outcome. Much attention has been paid to formalizing this property in causal terms, but most criteria are unidentifiable and cannot be turned into practical algorithms for learning surrogate endpoints from data. To address this, we study plug-in composite surrogates, functions of post-treatment variables that may be substituted directly for the primary outcome in a randomized experiment. We propose two methods for learning plug-in surrogates that maximize effect predictiveness, and characterize the possibility of finding endpoints that yield unbiased effect estimates in representative scenarios. Finally, in both synthetic experiments with known effects and in data from a real-world experiment, we find that our method, based on directly modeling the surrogate effect, returns plug-in endpoints more predictive of the primary effect than established methods.

2605.12049 2026-05-13 cs.LG cs.AI cs.IT cs.NE math.IT

Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons

Aaron Spieler, Georg Martius, Anna Levina

AI总结 本文探讨了在固定参数预算下,如何在神经网络的单元数量、每个单元的复杂度和连接度之间进行最优分配的问题。研究引入了一种基于“表达型漏记忆”(ELM)神经元的循环网络架构,能够独立调节网络宽度、单元复杂度和连接度,并在不同规模下稳定训练。实验表明,在固定参数预算下,存在一个非平凡的最优权衡点,且更大的预算倾向于支持更复杂和更多的神经元,研究还通过信息论模型解释了这一权衡现象的机制。

Comments 25 pages, 21 figures, 3 tables, including derivations. Submitted for peer review

详情
英文摘要

Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget $P$ between the number of units $N$, per-unit effective complexity $k_e$, and per-unit connectivity $k_c$? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting $N$, $k_e$, and $k_c$ and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex's reliance on complex spatio-temporal integrators.

2605.12047 2026-05-13 cs.CL

Is Child-Directed Language Optimized for Word Learning? A Computational Study of Verb Meaning Acquisition

Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza

AI总结 本研究探讨儿童导向语言(CDL)是否优化了词汇学习,特别是动词意义的获取。通过对比基于CDL和成人导向语言(ADL)训练的神经语言模型,研究发现CDL和口语ADL在语法干扰下表现出更强的学习鲁棒性。研究还发现,动词意义的习得早于语法能力的提升,且这一异步现象在口语中尤为明显,表明CDL在动词学习上的优势可能源于口语本身的特性,而非CDL独有的优化。

Comments 8 pages

详情
英文摘要

Is child-directed language (CDL) optimized to support language learning, and which aspects of linguistic development does it facilitate? We investigate this question using neural language models trained on CDL versus adult-directed language (ADL). We selectively remove syntactic or lexical co-occurrence information from the model training data, and evaluate the impact of these manipulations on verb meaning acquisition. While disrupting syntax impairs learning across all datasets, models trained on CDL and spoken ADL show significantly higher resilience than those trained on written input. Tracking semantic and syntactic performance over training, we observe a semantic-first trajectory, with verb meanings emerging prior to robust syntactic proficiency, an asynchrony most pronounced in the spoken domain, especially CDL. These results suggest that the advantage for verb learning previously attributed to CDL may instead reflect broader properties of the spoken register, rather than a uniquely CDL-specific optimization.

2605.12039 2026-05-13 cs.CL

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

Xiaoyuan Li, Moxin Li, Keqin Bao, Yubo Ma, Wenjie Wang, Dayiheng Liu, Fuli Feng

AI总结 SkillGraph 是一种通过动态演化技能图谱来增强智能体强化学习能力的方法,旨在解决现有技能库在组合任务中依赖关系识别和维护困难的问题。该方法将可复用的技能表示为有向图中的节点,并通过边类型编码技能之间的前提、增强和共现关系,从而支持多步骤决策。实验表明,SkillGraph 在多个复杂任务中表现出色,显著优于传统的记忆增强型强化学习方法。

Comments Under Review

详情
英文摘要

Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.

2605.12038 2026-05-13 cs.CV

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, Mike Zheng Shou

AI总结 OmniHumanoid 是一种用于跨具身视频生成的流式生成框架,旨在实现从人类到机器人或机器人到机器人之间的动作迁移。该方法通过分离可迁移的运动学习与具身特定的适配,解决了传统方法中因素纠缠和依赖配对数据的限制,仅需使用未配对视频即可适应新具身。研究还引入了分支隔离注意力机制,并构建了一个包含多具身、多场景的合成数据集,实验表明该方法在运动保真度和具身一致性方面表现优异,且无需重新训练共享运动模型即可扩展到新机器人。

详情
英文摘要

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.

2605.12031 2026-05-13 cs.LG cs.CV

Resilient Vision-Tabular Multimodal Learning under Modality Missingness

Camillo Maria Caruso, Valerio Guarrasi, Paolo Soda

AI总结 该研究针对医疗多模态学习中常见的模态缺失问题,提出了一种无需数据填补或启发式切换的联合视觉-表格学习框架。该方法通过可学习的模态标记对单模态表示进行加权,并利用带有掩码的自注意力机制进行中间融合,从而排除缺失的模态和特征。此外,引入模态丢弃正则化策略增强模型鲁棒性,实验表明该方法在不同缺失场景下均优于现有基线,表现出更稳定的性能和更强的鲁棒性。

详情
英文摘要

Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.

2605.12028 2026-05-13 cs.CL cs.IR

Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking

David-Maximilian Caraman, Gheorghe Cosmin Silaghi

AI总结 本文介绍了参与SemEval-2026任务8(MTRAGEval)的系统,针对多轮检索任务提出了一个三阶段方法,包括基于LoRA微调的查询重写、BM25与稠密检索的混合搜索以及交叉编码器重排序。该方法在四个英文领域中取得了nDCG@5为0.531的成绩,排名第八,显著优于基准系统。研究还发现,针对不同领域调整生成查询的温度参数能够有效提升性能,而其他复杂策略则可能带来性能下降。

Comments Accepted at SemEval2026, task 8: MTRAGEval

详情
英文摘要

We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.

2605.12027 2026-05-13 cs.CV

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

Ying Zang, Xuanyi Liu, Yidong Han, Deyi Ji, Chaotao Ding, Yuanqi Hu, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu

AI总结 该论文提出了一种名为4DVGGT-D的4D视觉几何变换器,旨在解决从单目视频中重建动态4D场景时的挑战。研究核心在于通过一种无需训练的渐进式解耦框架,将动态与静态要素分离,从而提升深度估计的稳定性与准确性。方法包含动态掩码引导的位姿解耦、拓扑子空间手术以及信息论置信度融合三个关键模块,有效提升了4D重建的质量与鲁棒性。

详情
英文摘要

Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.

2605.12026 2026-05-13 cs.CV cs.AI eess.SP

Spectral Vision Transformer for Efficient Tokenization with Limited Data

Alexandra G. Roberts, Maneesh John, Jinwei Zhang, Dominick Romano, Mert Sisman, Ki Sueng Choi, Heejong Kim, Mert R. Sabuncu, Thanh D. Nguyen, Alexey V. Dimov, Pascal Spincemaille, Brian H. Kopell, Yi Wang

AI总结 本文提出了一种新型的光谱视觉变换器架构,旨在在数据量有限的情况下实现高效的图像分块处理,特别关注医学影像应用。该方法利用光谱基函数的选择带来了空间不变性和最优信噪比等理论优势,并通过光谱投影降低了模型复杂度。实验表明,与多种主流模型相比,该方法在参数更少的情况下仍能取得相当甚至更优的性能,适用于多种类型的数据集。

详情
英文摘要

We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.

2605.12025 2026-05-13 cs.LG stat.ML

Approximation Theory of Laplacian-Based Neural Operators for Reaction-Diffusion System

Takashi Furuya, Ryo Ozawa, Jenn-Nan Wang

AI总结 本文研究了基于拉普拉斯算子的神经算子在非线性反应-扩散系统中的逼近理论,以通用的Gierer-Meinhardt模型为例,分析了从初始条件到时间依赖解的映射学习问题。通过利用PDE格林函数的拉普拉斯谱表示,作者建立了神经网络深度、宽度和谱秩相关的显式逼近误差界,证明了所需参数复杂度随目标精度呈多项式增长,从而克服了传统算子学习中面临的参数复杂度指数增长问题。数值实验验证了理论结果的有效性。

详情
英文摘要

Neural operators provide a framework for learning solution operators of partial differential equations (PDEs), enabling efficient surrogate modeling for complex systems. While universal approximation results are now well understood, approximation analysis specific to nonlinear reaction-diffusion systems remains limited. In this paper, we study neural operators applied to the solution mapping from initial conditions to time-dependent solutions of a generalized Gierer-Meinhardt reaction-diffusion system, a prototypical model of nonlinear pattern formation. Our main results establish explicit approximation error bounds in terms of network depth, width, and spectral rank by exploiting the Laplacian spectral representation of the Green's function underlying the PDE. We show that the required parameter complexity grows at most polynomially with respect to the target accuracy, demonstrating that Laplacian eigenfunction-based neural operator architectures alleviate the curse of parametric complexity encountered in generic operator learning. Numerical experiments on the Gierer-Meinhardt system support the theoretical findings.

2605.12022 2026-05-13 cs.CL

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

Xiaoyuan Li, Yuzhe Wang, Moxin Li, Keqin Bao, Rui Men, Yichang Zhang, Dayiheng Liu, Wenjie Wang, Fuli Feng

AI总结 该研究提出了一种可扩展的自动鲁棒性增强框架SAGE,用于提升大语言模型知识评估基准的鲁棒性。SAGE通过微调小型模型实现高效的问题变体生成与验证,其中VariantGen负责生成变体,VariantQual基于人工标注数据训练用于验证质量。实验表明,SAGE能够在远低于人工成本的情况下构建大规模鲁棒性增强基准,并且微调模型还能泛化到其他任务如MMLU,无需针对具体基准进行微调。

Comments Under Review

详情
英文摘要

Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.

2605.12021 2026-05-13 cs.CV

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

Ryota Yoshihashi, Masahiro Kada, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

AI总结 本文提出了一种名为What-Where Transformer(WWT)的视觉骨干网络,旨在同时学习物体的外观(what)和位置(where)信息。该方法通过分离“what-where”这一归纳偏置,采用多流架构将物体表示与注意力图分别处理,从而实现对物体外观和空间位置的解耦表征。实验表明,WWT在无额外后处理的情况下即可从原始注意力图中发现多个物体,并在零样本目标发现和弱监督语义分割等任务中表现出优越性能。

详情
英文摘要

Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.

2605.12019 2026-05-13 cs.LG cs.AI

Efficient and Adaptive Human Activity Recognition via LLM Backbones

Aleksandr Bredikhin, Philippe Lalanda, German Vega

AI总结 本文提出了一种基于大语言模型(LLM)的高效且自适应的人类活动识别(HAR)方法,旨在解决传统方法在计算资源消耗和领域适应性方面的不足。通过将预训练的LLM作为通用时间特征提取器,并引入结构化卷积投影将传感器信号映射到LLM的隐空间,该方法大幅降低了参数量和训练成本,同时提升了模型的泛化能力。实验表明,该方法在低数据和少样本场景下表现出色,为HAR系统提供了可扩展且高效的解决方案。

详情
英文摘要

Human Activity Recognition (HAR) is a core task in pervasive computing systems, where models must operate under strict computational constraints while remaining robust to heterogeneous and evolving deployment conditions. Recent advances based on Transformer architectures have significantly improved recognition performance, but typically rely on task-specific models trained from scratch, resulting in high training cost, large data requirements, and limited adaptability to domain shifts. In this paper, we propose a paradigm shift that reuses large pretrained language models (LLMs) as generic temporal backbones for sensor-based HAR, instead of designing domain-specific Transformers. To bridge the modality gap between inertial time series and language models, we introduce a structured convolutional projection that maps multivariate accelerometer and gyroscope signals into the latent space of the LLM. The pretrained backbone is kept frozen and adapted using parameter-efficient Low-Rank Adaptation (LoRA), drastically reducing the number of trainable parameters and the overall training cost. Through extensive experiments on standard HAR benchmarks, we show that this approach enables rapid convergence, strong data efficiency, and robust cross-dataset transfer, particularly in low-data and few-shot settings. At the same time, our results highlight the complementary roles of convolutional frontends and LLMs, where local invariances are handled at the signal level while long-range temporal dependencies are captured by the pretrained backbone. Overall, this work demonstrates that LLMs can serve as a practical, frugal, and scalable foundation for adaptive HAR systems, opening new directions for reusing foundation models beyond their original language domain.

2605.12017 2026-05-13 cs.CV

FAME: Feature Activation Map Explanation on Image Classification and Face Recognition

Xinyi Zhang, Manuel Günther

AI总结 本文提出了一种名为FAME的图像分类与人脸识别任务的特征激活图解释方法,旨在提升深度学习模型的可解释性。FAME结合了基于梯度的特征图方法与扰动方法的优点,通过梯度驱动的方式对输入图像进行操作,而非使用固定补丁,从而更准确地生成像素级的归因图。实验表明,FAME在深度网络中优于传统CAM方法,并在定性和定量评估中展现出竞争力。

Comments Accepted for CVPR Workshop 2026

详情
英文摘要

Deep Learning has revolutionized machine learning, reaching unprecedented levels of accuracy, but at the cost of reduced interpretability. Especially in image processing systems, deep networks transform local pixel information into more global concepts in a highly obscured manner. Explainable AI methods for image processing try to shed light on this issue by highlighting the regions of the image that are important for the prediction task. Among these, Class Activation Mapping (CAM) and its gradient-based variants compute attributions based on the feature map and upscale them to the image resolution, assuming that feature map locations are influenced only by underlying regions. Perturbation-based methods, such as CorrRISE, on the other hand, try to provide pixel-level attributions by perturbing the input with fixed patches and checking how the output of the network changes. In this work, we propose Feature Activation Map Explanation (FAME), which combines both worlds by using network gradients to compute changes to the input image, manipulating it in a gradient-driven way rather than using fixed patches. We apply this technique on two common tasks, image classification and face recognition, and show that CAM's above-mentioned assumption does not hold for deeper networks. We qualitatively and quantitively show that FAME produces attribution maps that are competitive state-of-the-art systems. Our code is available: {\footnotesize https://github.com/AIML-IfI/fame.}

2605.12016 2026-05-13 cs.AI

LLMs and the ZPD

Peter Wallis

AI总结 本文探讨了大语言模型(LLMs)与维果茨基“最近发展区”(ZPD)理论之间的关系,提出LLMs并非通过分布式表征进行“思考”,而是在执行一种基于实践的“原始思维”。研究认为,LLMs的行为更类似于“做梦”而非幻觉,强调互动在人类沟通中的核心地位,而非仅仅是理解的辅助手段,为理解LLMs的认知机制提供了新的视角。

Comments Short paper submitted to Interspeech 2026 (Desk Reject) 4 pages, plus references. 2 figures

详情
英文摘要

One hundred years ago Vygotsky and his circle were exploring the nature of consciousness and defining what would become psychology in the Soviet Union. They concluded that children develop "scientific thinking" through interacting with enculturated adults in Zones of Proximal Development or ZPDs. The proposal is that, contrary to the claims of some, the LLM mechanism is not doing thinking with "distributed representations," but rather the completion model is doing "primitive thinking" in terms of *practices*. Viewed from this perspective, it would seem our large language models don't hallucinate, but rather dream, and that what is needed is not "guard rails" but an investigation of the set of cognitive tools that enable us to do things that look like common-sense. The proposal here is that *interaction* is core to human communication rather than just an add-on to "real" understanding.

2605.12013 2026-05-13 cs.CV cs.AI

L2P: Unlocking Latent Potential for Pixel Generation

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, Ying Tai

AI总结 本文提出了一种名为L2P的高效像素生成框架,旨在解决从头训练高精度像素空间模型所需的高昂计算和数据资源问题。L2P通过直接利用预训练潜在扩散模型(LDM)的知识,采用大块标记化替代VAE,并冻结LDM中间层仅训练浅层网络,从而学习潜在空间到像素空间的映射。该方法仅使用LDM生成的合成图像作为训练数据,无需真实数据采集,实现了快速收敛,并可在8块GPU上生成4K超高分辨率图像,实验表明其性能接近源模型,在多个基准测试中表现优异。

Comments project page: https://nju-pcalab.github.io/projects/L2P/

详情
英文摘要

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

2605.12010 2026-05-13 cs.LG

Limits of Learning Linear Dynamics from Experiments

Aybüke Ulusarslan, Niki Kilbertus, Nora Schneider

AI总结 本文研究了从实验数据中学习线性动力系统时的可识别性限制问题。作者指出,传统方法通常假设系统可识别,但当这一假设不成立时,模型预测可能不准确。通过几何分析,论文揭示了实验设置(初始状态和控制输入)决定了可从观测轨迹中恢复的信息上限,并推导出与该实验设置一致的所有系统的闭式描述,证明即使整个系统不可识别,实验可达子空间上的动力学仍可唯一确定。

详情
英文摘要

Learning governing dynamics from data is a common goal across the sciences, yet it is only well-posed when the underlying mechanisms are identifiable. In practice, many data-driven methods implicitly assume identifiability; when this assumption fails, estimated models can yield spurious predictions and invalid mechanistic conclusions. Classical identifiability guarantees for controlled linear time-invariant (LTI) systems provide sufficient conditions -- controllability and persistent excitation -- but leave open whether identifiability holds when these conditions fail, and which parts of the system remain identifiable without full identifiability. We show that the experimental setup, i.e., the realized initial state and control input, dictates a fundamental limit on the information recoverable from the observed trajectory. We develop a geometric characterization of this limit and derive a closed-form description of all systems consistent with the experimental setup. Crucially, we prove that even when the full system is not identifiable, the restricted dynamics on the subspace reachable by the experiment remain uniquely determined.

2605.12009 2026-05-13 cs.LG

Estimating Subgraph Importance with Structural Prior Domain Knowledge

Changhyun Kim, Seunghwan An, Jong-June Jeon

AI总结 本文提出了一种用于预训练图神经网络(GNN)图级任务的子图重要性估计方法,将其建模为嵌入空间中的线性组Lasso回归问题。该方法有效利用了图子结构的先验领域知识,且不受GNN输出层或读出函数形式的限制,无需真实目标标签即可进行估计。实验表明,该方法在多个现实图数据集上优于现有基线,并进一步扩展用于识别图中的重要节点。

详情
英文摘要

We propose a subgraph importance estimation method for pretrained Graph Neural Networks (GNNs) on graph-level tasks, formulated as a linear Group Lasso regression problem in the embedding space. Our method effectively leverages prior domain knowledge of graph substructures, while remaining independent of the specific form of the output layer or readout function used in the GNN architecture, and it does not require access to ground-truth target labels. Experiments on real-world graph datasets demonstrate that our method consistently outperforms existing baselines in subgraph importance estimation. Furthermore, we extend our method to identify important nodes within the graph.

2605.12006 2026-05-13 cs.CV

Robust Promptable Video Object Segmentation

Sohyun Lee, Yeho Gwon, Lukas Hoyer, Konrad Schindler, Christos Sakaridis, Suha Kwak

AI总结 本文研究了可提示视频对象分割(PVOS)模型在输入受到干扰时性能下降的问题,提出了首个全面的鲁棒PVOS(RobustPVOS)研究。作者构建了一个包含351个视频片段和2500多张物体掩膜的综合性基准,涵盖真实场景下的多种不利条件,并生成了带有多样化时间变化干扰的合成训练数据。提出了一种新的鲁棒PVOS方法MoGA,通过记忆中的物体特异性表示来增强模型对不同物体退化的处理能力,并保持预测的时序一致性,实验表明该方法在多种干扰条件下均取得显著提升,为未来鲁棒PVOS研究提供了有力基础。

Comments Accepted to CVPR 2026

详情
英文摘要

The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.

2605.12004 2026-05-13 cs.CL

Learning Agentic Policy from Action Guidance

Yuxiang Ji, Zengbin Wang, Yong Wang, Shidong Yang, Ziyu Ma, Guanhua Chen, Zonghua Sun, Liaoni Wu, Xiangxiang Chu

AI总结 该研究针对大语言模型在智能体强化学习中的探索能力不足问题,提出了一种基于行动指导的强化学习方法ActGuide-RL。该方法利用日常交互中产生的丰富行动数据作为计划式参考指导,帮助智能体克服奖励状态的可达性障碍,并通过混合策略训练将引导策略的探索收益反馈到未引导策略中。实验表明,该方法在搜索智能体基准测试中显著优于零样本强化学习,并可与监督微调加强化学习的方法相媲美,为智能体强化学习提供了一种减少对大量监督数据依赖的新范式。

Comments Work in progress

详情
英文摘要

Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.

2605.12002 2026-05-13 cs.CV

EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, Trong-Le Do

AI总结 本文提出了一种名为EDGER的图像伪造定位方法,旨在应对文本引导的图像修复技术带来的挑战,提升跨域检测能力。该方法采用双分支框架,结合基于频率的边缘检测与合成热图定位,分别在像素级和块级定位伪造区域,从而实现高精度、高分辨率的通用化检测。实验表明,EDGER在多个基准数据集上表现出优异的跨域泛化能力和对高分辨率图像的适应性。

Comments Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

详情
英文摘要

Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.

2605.11996 2026-05-13 cs.AI

BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

Xiaoting Lyu, Yufei Han, Hangwei Qian, Haoyuan Yu, Xiang Ao, Bin Wang, Chenxu Wang, Xiaobo Ma, Wei Wang

AI总结 本文研究了针对知识图谱增强大语言模型(KG-LLMs)的后门攻击问题,特别是针对通过图神经网络将知识图谱编码为软提示的新型架构。该架构引入了图条件通道,使得现有针对文本通道的后门攻击效果大打折扣。为此,作者提出BadSKP攻击方法,通过多阶段优化策略操纵图表示,诱导软提示生成对抗性语义,实验表明该方法在多种设置下均能有效攻击目标模型,而传统仅针对文本的攻击则效果有限。

详情
英文摘要

Recent knowledge graph (KG)-enhanced large language models (LLMs) move beyond purely textual knowledge augmentation by encoding retrieved subgraphs into continuous soft prompts via graph neural networks, introducing a graph-conditioned channel that operates alongside the standard text interface. However, existing backdoor attacks are largely designed for the textual channel, and their effectiveness against this dual-channel architecture remains unclear. We show that this architecture creates a robustness gap: text-channel backdoor attacks that readily compromise textual KG prompting systems become largely ineffective against soft-prompt-based counterparts. We interpret this gap through semantic anchoring, whereby graph-derived soft prompts bias the generation-driving hidden state toward query-consistent semantics and suppress surface-level malicious instructions. Because this anchoring effect is itself induced by the graph channel, an attacker who manipulates graph-level representations can in turn redirect it toward adversarial semantics. To demonstrate this risk, we propose BadSKP, a backdoor attack that targets the graph-to-prompt interface through a multi-stage optimization strategy: it constructs adversarial target embeddings, optimizes poisoned node embeddings to steer the induced soft prompt, and approximates the optimized representations with fluent adversarial node attributes. Experiments on two soft-prompt KG-enhanced LLMs across four datasets show that BadSKP achieves high attack success under both frozen and trojaned settings, while text-only attacks remain unreliable even under perplexity-based defenses.

2605.11993 2026-05-13 cs.CL

Towards Visually-Guided Movie Subtitle Translation for Indic Languages

Tarun Chintada, Kshetrimayum Boynao Singh, Asif Ekbal

AI总结 该研究探讨了面向印地语等低资源语言的电影字幕翻译问题,指出仅依赖文本的系统往往无法捕捉到视觉信息中蕴含的情感、动作和社会语境。研究比较了两种轻量级的视觉引导策略,发现通过选择性地增强低质量字幕片段,可以有效提升翻译质量,同时大幅减少视觉处理需求。实验表明,基于粗粒度属性的视觉上下文摘要方法在捕捉场景情感和细微语境方面更具鲁棒性。

详情
英文摘要

Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses

2605.11987 2026-05-13 cs.AI cs.LG stat.AP stat.ML

Random-Set Graph Neural Networks

Tommy Woodley, Shireen Kudukkil Manchingal, Matteo Tolloso, Davide Bacciu, Fabio Cuzzolin

AI总结 本文提出了一种新的图神经网络框架——随机集图神经网络(RS-GNN),用于更准确地量化节点层面的不确定性。该方法通过信念函数形式对节点的认识不确定性进行建模,能够同时输出精确的概率预测和不确定性度量。实验表明,RS-GNN在多个真实世界的图学习数据集上表现出优越的不确定性量化能力。

Comments 23 pages, 6 figures

详情
英文摘要

Uncertainty quantification has become an important factor in understanding the data representations produced by Graph Neural Networks (GNNs). Despite their predictive capabilities being ever useful across industrial workspaces, the inherent uncertainty induced by the nature of the data is a huge mitigating factor to GNN performance. While aleatoric uncertainty is the result of noisy and incomplete stochastic data such as missing edges or over-smoothing, epistemic uncertainty arises from lack of knowledge about a system or model (e.g., a graph's topology or node feature representation), which can be reduced by gathering more data and information. In this paper, we propose an original new framework in which node-level epistemic uncertainty is modelled in a belief function (finite random set) formalism. The resulting Random-Set Graph Neural Networks have a belief-function head predicting a random set over the list of classes, from which both a precise probability prediction and a measure of epistemic uncertainty can be obtained. Extensive experiments on 9 different graph learning datasets, including real-world autonomous driving benchmarks as such Nuscene and ROAD, demonstrate RS-GNN's superior uncertainty quantification capabilities

2605.11986 2026-05-13 cs.AI

On the Limitations of Large Language Models for Conceptual Database Modeling

Arthur F. Siqueira, Carlos D. S. Nogueira, Eduarda Farias, Claudio E. C. Campelo, Júlia Menezes

AI总结 本文分析了大语言模型(LLMs)在支持关系数据库概念建模中的应用,特别是通过从自然语言需求中自动生成实体-关系(ER)图的能力。研究结合不同的语言模型和提示工程方法,评估其在概念上一致地识别实体、关系和属性的能力。实验结果表明,尽管LLMs在简单场景中表现尚可,但随着需求复杂性的增加,其可靠性下降,出现了更多不一致、模糊和约束表示失败的问题,表明当前LLMs在复杂场景中尚不成熟,验证成本可能抵消其表面的效率提升。

详情
英文摘要

This article analyzes the use of Large Language Models (LLMs) as support for the conceptual modeling of relational databases through the automatic generation of Entity-Relationship (ER) diagrams from natural language requirements. The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less complex scenarios, their reliability decreases as the complexity of the requirements increases, with a rise in inconsistencies, ambiguities, and failures in representing constraints. These findings reinforce that, in their current state, LLMs are not sufficiently mature for reliable use in complex scenarios, and the cost of validation may offset the apparent productivity gains.