arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.12332 2026-05-13 cs.AI

Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

Torsten Darrell, Mahyar Ghazanfari, Jordan Kam, Alexandre Bayen, Amin Tabrizian, Peng Wei

AI总结 本文研究利用大语言模型(LLM)对非塔台机场的飞行后安全分析框架,旨在提升这类机场的空中交通安全评估能力。研究结合了CTAF通信记录、气象数据、ADS-B飞行轨迹和目视飞行规则图,提出了一种通用的视觉-语言模型方法,用于识别潜在的安全隐患。通过实际案例分析和合成数据集的评估,验证了该方法在识别飞行优先权违规等危险情况中的有效性,为未来自动化安全评估提供了可行的技术路径。

Comments 25 pages, 17 figures, 5 tables, Accepted to AIAA 2026

详情
英文摘要

We investigate frameworks for post-flight safety analysis at non-towered airports using large language models (LLMs). Non-towered airports rely on the Common Traffic Advisory Frequency (CTAF) for air traffic coordination and experience frequent near mid-air collisions due to the pilot self-announcement communication protocol. We propose a general vision-language model (VLM) approach to analyze the transcribed CTAF radio communications in natural language, METeorological Aerodrome Report (METAR) weather data, Automatic Dependent Surveillance-Broadcast (ADS-B) flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. We qualitatively evaluate our framework on real flight data using Gemini 2.5 Pro, demonstrating accurate identification of a right-of-way violation. The synthetic dataset is derived from real examples and includes a 12-category hazard taxonomy, and is used to benchmark three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLM models on the subset of inputs related to CTAF and METAR. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task. Future work includes a quantitative evaluation across all modalities and a larger number of real world examples. Taken together, our results suggest that VLM analysis of safety at non-towered airports may be a valuable future capability.

2605.12328 2026-05-13 cs.CL

A categorical error sensitivity index (ISEC): A preventive ordinal decision-support measure for irrecoverable errors in manual data entry systems

Ricardo Raúl Palma, Mauro Anibal Benetti, Fabricio Orlando Sanchez Varretti

AI总结 手动数据输入系统在面对类别误分类时仍存在结构性脆弱性,尤其在中小型企业中,由于类别间语义或形态相近,容易导致不可逆的错误,进而影响关键绩效指标并误导管理决策。本文提出了一种新的类别错误敏感性指数(ISEC),通过结合语义距离、形态转换成本和实际使用频率,构建了一个统一的预防性评估框架,有效提升了错误风险的识别效率。ISEC利用向量数据库架构大幅降低计算复杂度,并在多个异构数据集上验证了其有效性,为中小型企业提供了可扩展的数据治理工具。

Comments 15 pages, 4 figures

详情
英文摘要

Data entry systems remain structurally vulnerable to categorical misclassifications, particularly in small and medium sized enterprises (SMEs). When nominal categories exhibit semantic or morphological proximity, human machine interaction may produce errors that are irrecoverable ex post. In the absence of automated input controls, manual data entry frequently generates irrecoverable categorical distortions that propagate into Key Performance Indicators (KPIs), thereby misleading managerial decision making. State of the art normalization tools typically evaluate semantic and morphological dimensions in isolation and rely heavily on standard dictionaries, rendering them ineffective for SME master data rich in custom SKUs, abbreviations, and domain-specific technical jargon. This paper introduces the Categorical Error Sensitivity Index (ISEC), an ordinal composite score designed to rank category pairs according to their structural susceptibility to confusion. ISEC integrates semantic distance (via word embeddings), custom weighted morphological transformation costs (through an adapted Damerau Levenshtein algorithm), and empirical frequency into a unified, mathematically robust preventive framework. By leveraging vector database architectures, ISEC reduces computational complexity, achieving approximately a 195x performance improvement over brute-force methods. Validated across three heterogeneous datasets: governmental judicial records, retail inventory, and a synthetic ISO coded metalworking catalog, ISEC provides a scalable and proactive data governance instrument that enables SMEs to detect latent structural risk embedded within their categorical data assets.

2605.12327 2026-05-13 cs.LG

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

Vage Egiazarian, Erik Schultheis, Andrei Panferov, Earl Killian, Torsten Hoefler, Dan Alistarh

AI总结 本文研究了通过使用多个4位浮点网格来提升大语言模型量化效果的方法,提出了一种称为“功率二网格”(PO2)的量化策略,允许每个值组选择更适合的网格进行量化。实验表明,这种方法在中小规模值组上能显著提升量化精度,而在大规模值组上优势减弱。通过多种网格配置的实例验证,PO2方法在模型微调和预训练任务中均表现出优于单一网格量化方案的性能。

Comments Preprint

详情
英文摘要

A major recent advance in quantization is given by microscaled 4-bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating-point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the "better" among two or more 4-bit grids marked by one or more bits in the scale value. We formalize the power-of-two-grids (PO2) problem, and provide theoretical results showing that practical small-group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit-zero asymmetric grid and 4) SFP4, a TensorCore-implementable triple which pairs NVFP4 with two shifted variants. Results for post-training quantization of standard open models and pre-training of Llama-like models show that adaptive grids consistently improve accuracy vs single-grid FP4 under both weight-only and weight+activation. Source code is available at https://github.com/IST-DASLab/GridGames.

2605.12316 2026-05-13 cs.LG

Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

Yunbei Xu, Yuzhe Yuan, Ruohan Zhan

AI总结 本文研究了在模型误设条件下,使用联合KL散度度量的自回归模型和下一个token预测中,序列长度对近似误差和估计误差的影响。通过建立匹配的上界和下界,作者首次完整刻画了长序列误差行为,并改进了现有工作的收敛速率与最优性分析。研究揭示了联合KL散度在近似误差上具有与序列长度无关的优势,同时证明了序列长度对估计误差的下界为Ω(H),与高效算法的上界一致,从而统一了训练目标、评估指标和近似度量之间的关系。

详情
英文摘要

We study the fundamental and timely problem of learning long sequences in autoregressive modeling and next-token prediction under model misspecification, measured by the joint Kullback--Leibler (KL) divergence. Our goal is to characterize how the sequence horizon \(H\) affects both approximation and estimation errors in this joint-distribution, sequence-level regime. By establishing matching upper and lower bounds, we provide, to our knowledge, the first complete characterization of long-horizon error behavior under the natural joint KL objective, with improved rates and optimality justification relative to existing work. On the approximation side, we show that joint KL admits a horizon-free approximation factor, in sharp contrast to Hellinger-based analyses that exhibit an \(Ω(H)\) dependence for computationally efficient methods; this isolates the choice of divergence as the source of approximation amplification. On the estimation side, we prove a fundamental information-theoretic lower bound of order \(Ω(H)\) that holds for both decomposable policy classes and fully shared policies, matching the \(\widetilde O(H)\) upper bounds achieved by computationally efficient algorithms. Our analysis clarifies the landscape of recent autoregressive learning results by aligning the log-loss training objective, the sequence-level evaluation metric, and the approximation metric {\color{black}through a sharp joint-KL oracle theory}. We further show that these joint-KL guarantees imply policy learning regret bounds at rates matching prior imitation learning literature.

2605.12313 2026-05-13 cs.CL cs.IR

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Rezarta Islamaj, Joey Chan, Robert Leaman, Jongmyung Jung, Hyeongsoon Hwang, Quoc-An Nguyen, Hoang-Quynh Le, Harikrishnan Gurushankar Saisudha, Ganesh Chandrasekar, Rustam R. Taktashov, Nadezhda Yu. Bizyukova, Sofia I. R. Conceição, Paulo R. C. Lopes, Reem Abdel Salam, Mary Adewunmi, Zhiyong Lu

AI总结 BioCreative IX 的 MedHopQA 共享任务旨在评估大型语言模型在多跳医学问答中的推理能力,提出了包含1000个复杂问答对的新型数据集,每个问题需结合两个不同维基页面的信息进行两跳推理,特别关注罕见疾病相关问题。任务吸引了13支队伍的48次提交,结果表明基于检索增强生成(RAG)等策略的系统显著优于基线模型,最佳系统在概念准确度(MedCPT)和精确匹配(EM)指标上分别达到89.30%和87.30%。该数据集已公开,以推动医学多跳问答领域的发展。

详情
英文摘要

Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/

2605.12312 2026-05-13 cs.LG cs.AI

Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

Chenran Zhao, Dianxi Shi, Yaowen Zhang, Chunping Qiu, Shaowu Yang

AI总结 本文研究了在存在随机延迟的跨任务强化学习场景中,如何提高策略的可迁移性和适应性。为了解决延迟导致的动作与状态反馈时间错位以及任务目标变化带来的知识复用困难,作者提出了一种基于隐式因果图建模的可迁移延迟感知强化学习方法。该方法通过场节点编码器将高维观测转化为具有节点语义的潜在状态,并利用消息传递机制学习节点间的动态因果依赖关系,从而获得可迁移的结构化表示和环境动态知识,有效提升了跨任务学习的效率与性能。

详情
英文摘要

Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce the reusability of previously acquired task knowledge. To address this problem, this paper proposes a transferable delay-aware reinforcement learning method based on implicit causal graph modeling. The proposed method uses a field-node encoder to represent high-dimensional observations as latent states with node-level semantics, and employs a message-passing mechanism to characterize dynamic causal dependencies among nodes, thereby learning transferable structured representations and environment dynamics knowledge. On this basis, imagination-driven behavior learning and planning are incorporated to optimize policies in the latent space, enabling cross-task knowledge transfer and rapid adaptation. Experimental results show that the proposed method outperforms baseline methods on DMC continuous control tasks with random delays. Cross-task transfer experiments further demonstrate that the learned structured representations and dynamics knowledge can be effectively transferred to new tasks and significantly accelerate policy adaptation.

2605.12310 2026-05-13 cs.SD

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

Chen Geng, Meng Chen, Ruohua Zhou, Ruolan Liu, Weifeng Zhao

AI总结 本文提出了一种名为 Poly-SVC 的多声部感知歌唱语音转换系统,旨在在保留歌词和旋律的前提下,将源歌手的歌声转换为目标歌手的声音。该方法创新性地处理了伴奏录音中的残余和声问题,通过基于常数 Q 变换的音高提取器、随机采样器以及基于条件流匹配的扩散解码器,实现了对旋律与和声特征的融合,从而生成自然且富有表现力的多声部输出。实验表明,Poly-SVC 在自然度、音色相似性和和声重建方面均优于现有基线模型。

Comments Accepted by ICASSP 2026

详情
英文摘要

Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.

2605.12308 2026-05-13 cs.LG

In-context learning to predict critical transitions in dynamical systems

Yunus Sevinchan, Juan Nathaniel, Kai Ueltzhöffer, Carla Roesch, Tobias Weber, Vaios Laschos, Hang Fan, Gregor Ramien, Johannes Haux, Pierre Gentine, Benjamin Herdeanu

AI总结 该研究旨在解决动态系统中临界转变的早期预警问题,这类转变通常具有突发性和不可逆性,且在现实世界中观测数据稀缺。为此,作者提出了一种基于上下文学习的深度学习框架TipPFN,通过合成数据生成器训练模型,使其能够灵活适应不同规模、复杂度和维度的上下文信息。该方法在未见过的临界转变场景、仿真到现实案例以及真实观测数据中均表现出先进的早期检测能力,为构建可靠的预警系统提供了新思路。

Comments 14+38 pages, 5+23 figures

详情
英文摘要

Critical transitions - abrupt, often irreversible changes in system dynamics - arise across human and natural systems, often with catastrophic consequences. Real-world observations of such shifts remain scarce, preventing the development of reliable early warning systems. Conventional statistical and spectral indicators, such as increasing variance, tend to fail under realistic conditions of limited data and correlated noise, whereas existing deep learning classifiers do not extrapolate beyond their training data distribution. In this work, we introduce TipPFN, an in-context learning (ICL) framework that uses a prior-data fitted network to infer a system's proximity to a critical transition. Trained on our novel synthetic data generator, which is based on canonical bifurcation scenarios coupled to diverse, randomized stochastic dynamics, TipPFN flexibly capitalizes on contexts of various sizes, complexity and dimensionalities. We demonstrate robust, state-of-the-art early detection of critical transitions in previously unseen tipping regimes, sim-to-real examples, and real-world observations in both ICL and zero-shot settings.

2605.12306 2026-05-13 cs.LG cs.AI cs.CV

KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks

Minjong Cheon

AI总结 本文提出了一种名为KAN-CL的持续学习框架,旨在解决任务间参数干扰导致的灾难性遗忘问题。该方法利用Kolmogorov-Arnold网络(KAN)的紧支撑样条参数化特性,在每个样条节点层面进行重要性加权锚定,从而实现更精细的参数正则化。实验表明,KAN-CL在多个基准数据集上显著降低了遗忘率,同时保持了较高的分类精度,并通过神经切线核分析进一步揭示了其理论优势。

详情
英文摘要

Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.

2605.12305 2026-05-13 cs.CV

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Yabo Zhang, Kunchang Li, Dewei Zhou, Xinyu Huang, Xun Wang

AI总结 该研究针对多模态语言模型在处理复杂交错指令时生成图像的性能不足问题,提出了一种统一的视觉生成模型INSET,将图像作为文本指令中的原生词汇嵌入,从而更精确地匹配描述与视觉目标。通过引入可扩展的数据引擎生成大量高质量交错样本,并在多项任务中展现出优于现有方法的多图像一致性和文本对齐能力,同时支持多模态图像编辑等扩展应用。

详情
英文摘要

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

2605.12301 2026-05-13 cs.LG math.ST stat.TH

Approximation of Maximally Monotone Operators : A Graph Convergence Perspective

Takashi Furuya, Yury Korolev, Takaharu Yaguchi

AI总结 该论文研究了如何通过图收敛方法对极大单调算子进行逼近,这类算子在数学和机器学习中具有重要应用。传统上的均匀或 $L^p$ 近似方法在处理此类算子时存在局限,作者提出利用图收敛(Painlevé-Kuratowski 收敛)作为逼近框架,证明了任何极大单调算子均可通过编码-解码结构进行局部图收敛逼近,并构建了保持极大单调性质的结构化近似方法。这一成果为处理不连续或集值算子的算子学习提供了新的理论基础和方法。

详情
英文摘要

Operator learning has been highly successful for continuous mappings between infinite-dimensional spaces, such as PDE solution operators. However, many operators of interest-including differential operators-are discontinuous or set-valued, and lie outside classical approximation frameworks. We propose a paradigm shift by formulating approximation via graph convergence (Painlevé-Kuratowski convergence), which is well-suited for closed operators. We show that uniform and $L^p$ approximation are fundamentally inadequate in this setting. Focusing on maximally monotone operators, we prove that any such operator can be approximated in the sense of local graph convergence by continuous encoder-decoder architectures, and further construct structure-preserving approximations that retain maximal monotonicity via resolvent-based parameterizations.

2605.12299 2026-05-13 cs.CL

GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

Leonor Veloso, Hinrich Schütze

AI总结 该研究提出了一种名为GKnow的基准,用于评估语言模型在不同性别相关预测任务中的性别知识和性别偏见。研究发现,性别偏见与事实性性别在神经网络的电路和单个神经元层面高度纠缠,导致神经元消融等去偏方法效果不可靠。GKnow有助于识别和分析负责性别预测的模型组件,并揭示现有性别偏见评估基准可能掩盖事实性性别知识下降的问题。

Comments Accepted to ACL 2026

详情
英文摘要

Recent works have analyzed the impact of individual components of neural networks on gendered predictions, often with a focus on mitigating gender bias. However, mechanistic interpretations of gender tend to (i) focus on a very specific gender-related task, such as gendered pronoun prediction, or (ii) fail to distinguish between the production of factually gendered outputs (the correct assumption of gender given a word that carries gender as a semantic property) and gender biased outputs (based on a stereotype). To address these issues, we curate \gknow, a benchmark to assess gender knowledge and gender bias in language models across different types of gender-related predictions. \gknow allows us to identify and analyze circuits and individual neurons responsible for gendered predictions. We test the impact of neuron ablation on benchmarks for disentangling stereotypical and factual gender (DiFair and the test set of GKnow), as well as StereoSet. Results show that gender bias and factual gender are severely entangled on the level of both circuits and neurons, entailing that ablation is an unreliable debiasing method. Furthermore, we show that benchmarks for evaluating gender bias can hide the decrease in factual gender knowledge that accompanies neuron ablation. We curate GKnow as a contribution to the continuous development of robust gender bias benchmarks.

2605.12297 2026-05-13 cs.CV cs.RO eess.IV

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

Luming Wang, Hao Shi, Jiajun Zhai, Kailun Yang, Kaiwei Wang

AI总结 本文提出EgoEV-HandPose,一种基于立体事件相机的端到端框架,用于解决第一人称视角下的3D双手姿态估计与手势识别问题。核心方法KeypointBEV通过将特征提升至统一的鸟瞰视角,并结合迭代重投影引导的优化循环,有效解决了深度不确定性与运动模糊问题。同时,研究还发布了首个大规模真实场景立体事件相机数据集EgoEVHands,显著提升了低光和双手遮挡场景下的性能,为事件相机在第一人称感知领域的发展提供了新基准。

Comments Extended version of SMC 2025 paper arXiv:2503.12419. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose

详情
英文摘要

Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird's-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose.

2605.12294 2026-05-13 cs.AI

Executable Agentic Memory for GUI Agent

Zerui Qin, Sheng Yue, Xingyuan Hua, Yongjian Fu, Ju Ren

AI总结 本文提出了一种名为可执行智能体记忆(EAM)的新方法,用于提升图形用户界面(GUI)智能体在长期任务中的稳定性和效率。EAM 通过构建结构化的知识图谱,将自由生成的规划过程转化为基于检索与执行的流程,并结合状态感知的深度优先搜索和动作分组挖掘技术,实现高效的记忆构建。此外,引入基于价值引导的图搜索机制,利用轻量级Q函数模型指导蒙特卡洛树搜索,从而在保证规划效率的同时,显著提升了任务执行的成功率与成本效益。

详情
英文摘要

Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowledge Graph (KG) that shifts GUI planning from free-form generation to a robust retrieval-and-execution process. Our approach includes a sample-efficient memory construction pipeline using state-aware DFS and action-group mining to compress multi-step routines. To ensure efficient planning, we introduce a value-guided graph search where a lightweight Q-function model steers Monte Carlo Tree Search (MCTS) over the KG. We theoretically establish bias-consistency for the Q-model and derive sample complexity bounds for path recovery. Empirically, EAM outperforms state-of-the-art baselines like UI-TARS-7B by up to $19.6\%$ on AndroidWorld, while reducing token costs $6\times$ relative to GPT-4o. With a $2.8$s average latency, EAM enables reliable, quick, and long-horizon GUI automation.

2605.12292 2026-05-13 cs.LG

STRABLE: Benchmarking Tabular Machine Learning with Strings

Gioia Blayer, Myung Jun Kim, Félix Lefebvre, Lennart Purucker, Alan Arazi, Eilam Shapira, Roi Reichart, Frank Hutter, Marine Le Morvan, David Holzmüller, Gaël Varoquaux

AI总结 该论文提出了STRABLE,一个包含108个真实应用场景表格的基准数据集,用于评估包含字符串和数值的表格机器学习方法。研究探讨了在表格数据中是否需要专门处理字符串的模型,或是将其编码为数值即可,并比较了不同处理方式的效果。实验表明,针对以分类变量为主的表格,使用简单字符串嵌入与先进表格学习模型结合即可取得良好效果,而以自由文本为主的表格则更适合使用大型语言模型编码。STRABLE为字符串表格学习提供了可靠的基准,有助于推动该领域研究。

详情
英文摘要

Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.

2605.12290 2026-05-13 cs.LG

Targeted Neuron Modulation via Contrastive Pair Search

Sam Herring, Jake Naviasky, Karan Malhotra

AI总结 该研究探讨了语言模型如何通过指令微调拒绝有害请求的机制,并提出了一种名为对比神经元归因(CNA)的新方法,能够识别出少量关键神经元,这些神经元在区分有害和无害提示中起关键作用。实验表明,通过干预这些神经元可以有效降低模型的拒绝率,同时保持输出质量,而基础模型则缺乏这种可干预的拒绝机制。研究揭示了对齐微调如何将原有的判别结构转化为可操控的拒绝门控,为行为调控提供了更可靠的方法。

详情
英文摘要

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.

2605.12289 2026-05-13 cs.LG cs.AI

PriorZero: Bridging Language Priors and World Models for Decision Making

Junyu Xiong, Yuan Pu, Jia Tang, Yazhe Niu

AI总结 本文提出了一种名为 PriorZero 的统一框架,旨在将大型语言模型(LLM)的语言先验知识与基于世界模型的规划相结合,以提升强化学习代理在长期任务中的决策能力。该方法通过解耦的 rollout-训练设计,将 LLM 的概念先验仅注入蒙特卡洛树搜索(MCTS)的根节点,从而在保持世界模型深度前瞻能力的同时,引导搜索向语义上有潜力的动作聚焦。实验表明,PriorZero 在多个基准任务中显著提升了探索效率和最终性能,为基于 LLM 的决策制定提供了一个有前景的框架。

Comments 30 pages, 12 figures

详情
英文摘要

Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at https://github.com/opendilab/LightZero.

2605.12282 2026-05-13 cs.CV

Large-Small Model Collaboration for Farmland Semantic Change Detection

Xinjia Li, Rui Wang, Qiurong Peng, Lingfei Ye, Dengrong Zhang, Haoyu Zhang

AI总结 本文针对精细农田语义变化检测(SCD)中存在的标注不足和伪变化干扰问题,构建了一个大规模细粒度农田变化检测基准HZNU-FCD,并提出了一种大模型与小模型协作的检测框架。该框架结合了任务驱动的小型视觉模型FD-Mamba和冻结的大型视觉-语言模型,通过跨模态逻辑仲裁和硬区域协同训练策略,有效提升了边界保持和小区域变化检测的精度。实验表明,该方法在多个数据集上均取得了优异的性能,展现出良好的鲁棒性和泛化能力。

详情
英文摘要

Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated "from-to" annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at https://github.com/Lovelymili/FD-Mamba.

2605.12281 2026-05-13 cs.CL cs.LG

What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty

Jonas Mayer Martins, Zhuojing Huang, Aaricia Herygers, Lisa Beinborn

AI总结 本研究探讨了英语词汇对以西班牙语、德语或汉语为母语的学习者而言为何难以掌握,并计算建模了词汇难度,考虑了词汇的熟悉度、语义、表层形式及跨语言迁移等因素。通过Shapley值分析,发现词汇熟悉度是所有三类学习者共通的主要影响因素,而西班牙语和德语学习者还受到正字法迁移的影响,而汉语学习者则主要依赖熟悉度和表层特征。该模型提供了可解释的、针对不同母语背景的学习者定制化的词汇难度评估,有助于设计更有效的词汇教学方案。

Comments Submitted to BEA 2026 at ACL. 18 pages, 13 figures

详情
英文摘要

What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word's familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.

2605.12278 2026-05-13 cs.LG

Hypernetworks for Dynamic Feature Selection

Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez

AI总结 本文研究了动态特征选择(DFS)框架中的结构限制,并提出了一种基于超网络的新型方法Hyper-DFS,该方法能够按需生成特定特征子集的分类器参数,从而在保证泛化性能的同时降低结构复杂度。通过引入Set Transformer编码,Hyper-DFS还构建了平滑的条件空间,使功能相似的任务在几何上更接近。实验表明,Hyper-DFS在合成数据、真实表格数据以及多个图像数据集上均优于现有方法,并在未见过的特征子集上展现出更强的零样本泛化能力。

详情
英文摘要

Dynamic feature selection (DFS) is a machine learning framework in which features are acquired sequentially for individual samples under budget constraints. The exponential growth in the number of possible feature acquisition paths forces a DFS model to balance fitting specific scenarios against maintaining general performance, even when the feature space is moderate in size. In this paper, we study the structural limitations of existing DFS approaches to achieve an optimal solution. Then, we propose \textsc{Hyper-DFS}, a hypernetwork-based DFS approach that generates feature subset-specific classifier parameters on demand. We show that the use of hypernetworks compared to mask-embedding methods results in a smaller structural complexity bound. We also use a Set Transformer encoding to create a smooth conditioning space for the hypernetwork, so that functionally similar tasks are also geometrically close. In our benchmarks, \textsc{Hyper-DFS} outperforms all state-of-the-art approaches on synthetic and real-life tabular data. It is also competitive or superior across all image datasets tested, and shows substantially stronger zero-shot generalisation to feature subsets never seen during training than existing DFS approaches.

2605.12276 2026-05-13 cs.AI

NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities

Jina Kim, Gengchen Mai, Lingyi Zhao, Khurram Shafique, Yao-Yi Chiang

AI总结 该研究提出了一种名为NARA的自监督学习框架,用于处理异构矢量地理实体的数据,旨在解决现有方法在统一建模几何、语义和空间关系方面的不足。NARA通过联合建模语义、几何结构和空间关系,实现了对点、线、面等不同类型的地理实体的上下文感知表征。实验表明,该方法在建筑功能分类、交通速度预测和兴趣点推荐等任务中均优于现有方法,验证了其在统一关系建模方面的有效性。

详情
英文摘要

Geospatial foundation models have primarily focused on raster data such as satellite imagery, where self-supervised learning has been widely studied. Vector geospatial data instead represent the world as discrete geoentities with explicit geometry, semantics, and structured spatial relations, including metric proximity and topological relationships. These relations jointly determine how entities interact within space, yet existing representation learning methods remain fragmented, often restricted to specific geometry types or partial spatial relations, limiting their ability to capture unified spatial context across heterogeneous geoentities. We propose NARA (Neural Anchor-conditioned Relation-Aware representation learning), a self-supervised framework for vector geoentities. NARA learns context-dependent representations by jointly modeling semantics, geometry, and spatial relations within a unified framework and captures relational spatial structure beyond proximity alone, enabling rich contextualized representations across heterogeneous geoentities of points, polylines, and polygons. Evaluation on building function classification, traffic speed prediction, and next point-of-interest recommendation shows consistent improvements over prior methods, highlighting the benefit of unified relational modeling for vector geospatial data.

2605.12266 2026-05-13 cs.CV

CAD-feature enhanced machine learning for manufacturing effort estimation on sheet metal bending parts

Matteo Ballegeer, Toon Van Camp, Willem Jaspers, Alp Bayar, Aung Nyein Soe, Martin Roelfs, Dries F. Benoit, Bieke Decraemer, Joost R. Duflou

AI总结 该研究针对钣金弯曲零件的制造努力估计问题,提出了一种结合CAD特征与图神经网络的混合方法。通过在B-rep拓扑图中引入基于规则模块识别的制造特征,如弯折特性、翻边长度等,增强了模型对工艺相关几何模式的学习能力。实验表明,该方法在合成数据集和真实工业数据集上均显著提升了预测精度,验证了领域知识与图学习结合在制造可行性评估中的有效性。

详情
英文摘要

Graph-based machine learning has emerged as a promising approach for manufacturability analysis by learning directly from CAD models represented as Boundary Representations (B-reps), exploiting both surface geometry and topological connectivity. However, purely geometric representations often lack the process-specific semantics required for accurate manufacturability prediction: many manufacturing factors, such as surface roles or bend intent, are not explicitly encoded in shape alone and are difficult for data-driven models to infer reliably. We propose a hybrid approach that addresses this challenge by enriching B-rep attributed adjacency graphs with manufacturing features recognized through a rule-based module. Applied to sheet metal bending, recognized features, such as bend characteristics, flange lengths, and surface roles are integrated as node attributes, concentrating the learning signal on process-relevant geometric patterns. Experiments on both a large-scale synthetic manufacturability benchmark and a real-world industrial dataset with measured bending times, one of the first such validations on genuine production data, demonstrate that combining domain knowledge with graph-based learning improves prediction accuracy across both tasks. The results demonstrate that hybrid modeling offers a feasible and effective path toward deployable tools for manufacturability assessment and effort estimation in industrial CAD environments.

2605.12265 2026-05-13 cs.AI

How Useful Is Cross-Domain Generalization for Training LLM Monitors?

Sam Martin, Fabien Roger

AI总结 本文研究了在有限训练数据下使用提示语言模型进行分类的有效性,并探讨了跨领域泛化对训练大语言模型分类器的作用。研究发现,通过多任务提示训练可以在相邻领域提升分类性能,但在某些边缘情况下,微调模型会因提示变化而失效。研究还表明,将分类训练与通用指令遵循训练结合,能够在保持分类性能的同时缓解泛化失败问题,并发现这种无思考的分类训练在构建其他分类器和监控系统中可能具有实用价值。

详情
英文摘要

Using prompted language models as classifiers enables classification in domains with limited training data, but misses some of the robustness and performance benefits that fine-tuning can bring. We study whether training on multiple classification tasks, each with its own prompt, improves performance on new domains with new classification prompts. We show that such training partially generalizes to adjacent domains, improving classification performance on tasks that are unseen during training. However, we identify specific edge cases where the fine-tuned models fail to follow prompts, such as when the classification prompt changes completely while the data domain remains the same as during training. We show that classification training can be mixed with general instruction following training, and that (when done well) such training keeps the benefits of classification training and mitigates its generalization failures. Surprisingly, we see that this no-thinking supervised classification training can generalize to with-thinking classification and summarization, suggesting that no-thinking classification training might be instrumentally useful in building other kinds of classifiers and monitoring systems.

2605.12262 2026-05-13 cs.AI cs.LG

Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs

Joshua Wendland, Markel Zubia, Roman Andriushchenko, Maris F. L. Galesloot, Milan Ceska, Henrik von Kleist, Thiago D. Simao, Maximilian Weininger, Nils Jansen

AI总结 本文提出了一种新的部分可观测马尔可夫决策过程(POMDP)子类——缺失性-MDP(miss-MDP),将缺失数据理论融入强化学习框架中。该模型通过缺失函数描述状态特征在不同时间步缺失的概率,针对未知缺失函数的情况,提出基于不同缺失类型结构特性的算法,从观测数据中学习缺失函数,并据此生成近似最优策略。理论证明所得到的策略在真实 miss-MDP 中具有高概率的 ε-最优性,实验结果也验证了方法的有效性。

详情
英文摘要

We introduce missingness-MDPs (miss-MDPs), a novel subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. A miss-MDP is a POMDP whose observation function is a missingness function, specifying the probability that individual state features are missing (i.e., unobserved) at a time step. The literature distinguishes three canonical missingness types: missing (1) completely at random (MCAR), (2) at random (MAR), and (3) not at random (MNAR). Our planning problem is to compute near-optimal policies for a miss-MDP with an unknown missingness function, given a dataset of action-observation trajectories. Achieving such optimality guarantees for policies requires learning the missingness function from data, which is infeasible for general POMDPs. To overcome this challenge, we exploit the structural properties of different missingness types to derive probably approximately correct (PAC) algorithms for learning the missingness function. These algorithms yield an approximate but fully specified miss-MDP that we solve using off-the-shelf planning methods. We prove that, with high probability, the resulting policies are epsilon-optimal in the true miss-MDP. Empirical results confirm the theory and demonstrate superior performance of our approach over two model-free POMDP methods.

2605.12261 2026-05-13 cs.LG

Delay-Empowered Causal Hierarchical Reinforcement Learning

Chenran Zhao, Dianxi Shi, Haotian Wang, Mengzhu Wang, Yaowen Zhang, Chunping Qiu, Shaowu Yang

AI总结 许多现实任务中存在延迟效应,即动作的后果会在不同时间滞后后才显现。现有延迟感知的强化学习方法通常依赖状态增强、延迟分布的先验知识或非延迟数据,限制了其泛化能力。本文提出了一种延迟赋能的因果分层强化学习方法(DECHRL),该方法显式建模状态转移的因果结构及其相关的随机延迟分布,并将其融入延迟感知的赋能目标中,引导智能体主动探索可控性高的状态,从而在时间不确定性下提升性能。实验表明,DECHRL在具有随机延迟的修改版2D-Minecraft和MiniGrid环境中显著优于基线方法。

详情
英文摘要

Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their generalization. Hierarchical reinforcement learning, by contrast, inherently offers advantages in handling delays due to its hierarchical structure, yet existing methods are restricted to fixed delays. To address these limitations, we propose Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL). DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states, thereby improving performance under temporal uncertainty. We evaluate DECHRL in modified 2D-Minecraft and MiniGrid environments featuring stochastic delays. Experimental results show that DECHRL effectively models temporal delays and significantly outperforms baselines in decision-making under temporal uncertainty.

2605.12259 2026-05-13 cs.CV

From Image Hashing to Scene Change Detection

Anh-Kiet Duong, Marie-Claire Iatrides, Petra Gomez-Krämer, Jean-Michel Carozza

AI总结 图像哈希技术虽能高效存储和检索图像,但其全局比较特性无法定位具体变化区域,限制了其在场景变化检测中的应用。本文从场景变化检测的角度重新审视图像哈希,提出了一种基于块的哈希框架HashSCD,能够在哈明空间中直接实现全局变化检测与局部变化定位,无需对历史图像重复推理。该方法通过对比学习进行无监督训练,在保证性能的同时显著降低了计算和存储开销。

Comments 18 pages; accepted to ICPR 2026

详情
英文摘要

Image hashing provides compact representations for efficient storage and retrieval but is inherently limited to global comparison and cannot reason about where changes occur. This limitation prevents hashing from being directly applicable to scene change detection, where spatial localization is essential. In this work, we revisit hashing from a scene change detection perspective and propose HashSCD, a patch-wise hashing framework that enables both efficient global change detection and localized change identification. HashSCD encodes spatially aligned patches into compact hash codes and aggregates them through an XOR-like operation, allowing change detection and localization to be performed directly in the Hamming space without repeated inference on previous images. The model is trained in an unsupervised manner using contrastive learning at both patch and global levels. Experiments demonstrate that HashSCD achieves competitive performance compared to state-of-the-art unsupervised hashing and scene change detection methods, while significantly reducing computational cost and storage requirements.

2605.12258 2026-05-13 cs.LG

Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models

Runhe Lai, Xinhua Lu, Yanqi Wu, Jinlun Ye, Weijiang Yu, Ruixuan Wang

AI总结 多模态大语言模型在实际应用中仍面临物体幻觉的挑战。本文深入分析了指令令牌嵌入,发现其隐含编码了视觉信息并能有效过滤误导性视觉嵌入带来的错误信息,基于此提出了一种无需额外训练或辅助模型的物体幻觉检测方法——指令透镜分数(InsLen)。该方法结合校准局部分数与上下文一致性分数,实验表明其在多个基准和不同模型架构上均优于现有方法,具有良好的有效性和鲁棒性。

Comments Accepted by ICML-2026

详情
英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we present an in-depth analysis of instruction token embeddings and reveal that they implicitly encode visual information while effectively filtering erroneous information introduced by misleading visual embeddings. Building on this insight, we propose the Instruction Lens Score (InsLen), which combines a Calibrated Local Score with a Context Consistency Score that measures context consistency of the object tokens. The proposed approach serves as a plug-and-play object hallucination detector without relying on auxiliary models or additional training. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate that InsLen consistently outperforms existing hallucination detection methods, highlighting its effectiveness and robustness. The code is available at https://github.com/Fraserlairh/Instruction-Lens-Score.

2605.12255 2026-05-13 cs.AI cs.CY cs.LG

Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference

Toru Takahashi

AI总结 本文探讨了为何人们在面对相同观察时会产生不同结论的问题,指出这种分歧源于推理与学习过程中的非可识别性,而非对方认知缺陷。研究将非可识别性分为两个层次:在相同世界模型下因推理设置不同导致结论差异,以及推理设置本身影响数据暴露和更新规则,进而导致世界模型的差异。文章引入推理配置的概念,分析了分歧如何受计算、观察和协调等约束条件的影响,并将其与深度表征学习中的相关概念联系起来,通过AI监管辩论的案例加以说明。

Comments 12 pages, 2 figures, 1 table. Extended English version of a paper accepted for presentation at JSAI 2026

详情
英文摘要

When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non-identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) $θ$-level non-identifiability, where conclusions diverge under the same world model $W$ because inference settings differ; and (ii) $W$-level non-identifiability, where repeated use of an inference setting $θ$ biases data exposure and update rules, causing the learned world model $W$ itself to diverge. We introduce an inference profile $θ= (R, E, S, D)$, consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation $o$ and the same $W$. We further explain why disagreements tend to project onto a small number of bases -- abstract versus concrete, externalizability, and order versus freedom -- as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent-state estimation, and regularization-exploration trade-offs, and illustrate the framework through a case study on AI regulation debates.

2605.12252 2026-05-13 cs.CV

H3D-MarNet: Wavelet-Guided Dual-Path Learning for Metal Artifact Suppression and CT Modality Transformation for Radiotherapy Workflows

Mubashara Rehman, Niki Martinel, Michele Avanzo, Riccardo Spizzo, Christian Micheloni

AI总结 该研究提出了一种名为H3D-MarNet的两阶段框架,用于从千伏CT(kVCT)到兆伏CT(MVCT)的去金属伪影和CT模态转换,以提升放疗流程中的图像质量。第一阶段通过小波引导的预处理模块,在去除金属伪影的同时保留解剖结构;第二阶段采用结合卷积神经网络和Transformer的Domain-TransNet,通过注意力机制融合局部细节与全局上下文信息,实现高保真的CT模态转换。实验表明,该方法在伪影严重区域取得了较高的PSNR和SSIM指标,显示出其在临床放疗中的应用潜力。

Comments Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026

详情
英文摘要

Metal artifacts in computed tomography (CT) severely degrade image quality, compromising diagnostic accuracy and radiotherapy planning, especially in cancer patients with high-density implants. We propose H3D-MarNet, a two-stage framework for artifact-aware CT domain transformation from kilo-voltage CT (kVCT) to mega-voltage CT (MVCT). In the first stage, a wavelet-based preprocessing module suppresses metal-induced artifacts through frequency-aware denoising while preserving anatomical structures. In second stage, Domain-TransNet performs kVCT-to-MVCT domain transformation using a hybrid volumetric learning architecture. Domain-TransNet integrates a CNN-based encoder to capture fine-grained local anatomical details and a transformer-based encoder to model long-range volumetric dependencies. The complementary representations are fused through an attention-based feature fusion mechanism to ensure spatial and contextual coherence across slices. A multi-stage, attention-guided decoder, supported by deep supervision, progressively reconstructs artifact-suppressed MVCT volumes. Extensive experiments demonstrate that H3D-MarNet achieves 28.14 dB PSNR and 0.717 SSIM on artifact-affected slices from full dataset, indicating effective metal artifact suppression and anatomical preservation, highlighting its potential for reliable CT modality transformation in clinical radiotherapy workflows.

2605.12247 2026-05-13 cs.RO

SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

Yibo Liu, Stanko Oparnica, Simon Shewchun-Jakaitis, Guoyi Fu, Jie Wang, Jun Yang, Anand Jagannathan, Tony Hong-Yau Lo

AI总结 在机器人接触丰富的装配任务中,由于相对位姿的不确定性,如错位和微小间隙,搜索和高精度插入面临重大挑战。本文提出SI-Diff框架,通过力域扩散策略统一学习搜索与高精度插入动作,引入新的模式条件机制以在单一模型中捕捉不同动作行为,并设计新的搜索教师策略生成多样化轨迹,从而提升模型对初始位姿偏差的容忍度和对未知形状的泛化能力。

Comments 9 pages, 8 figures

详情
英文摘要

Contact-rich assembly is fundamental in robotics but poses significant challenges due to uncertainties in relative poses, such as misalignments and small clearances in peg-in-hole tasks. Existing approaches typically address search and high-precision insertion separately, because these tasks involve distinct action patterns. However, supporting both tasks within a single model, without switching models or weights, is desirable for intelligent assembly systems. In this work, we propose SI-Diff, a framework that learns both search and high-precision insertion through a force-domain diffusion policy. To this end, we introduce a new mode-conditioning mechanism that enables the policy to capture distinct action behaviors under a single framework. Moreover, we develop a new search teacher policy that can generate diverse trajectories. By training on successful and efficient demonstrations provided by the teacher policy, the model learns the mapping from tactile and end-effector velocity observations to effective action behaviors. We conduct thorough experiments to show that SI-Diff extends the tolerance to x-y misalignments from 2 mm to 5 mm compared to the state-of-the-art baseline, TacDiffusion, while also demonstrating strong zero-shot transferability to unseen shapes.