arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.09511 2026-05-12 cs.AI

WindINR: Latent-State INR for Fast Local Wind Query and Correction in Complex Terrain

Yi Xiao, Qilong Jia, Hang Fan, Pascal Fua, Robert Jenssen, Xiaosong Ma, Wei Xue

AI总结 在复杂地形中,许多下游决策需要对特定位置和高度的风速进行快速估计,而非传统的固定网格高密度预报场。为此,研究提出了WindINR,一种基于潜在状态的隐式神经表示框架,能够实现高分辨率局部风速的快速查询与稀疏观测修正。该方法通过一个受潜在状态条件约束的解码器,将静态地形描述、低分辨率背景场和连续查询坐标映射为高分辨率风场状态,并通过分离可复用的表示学习与样本特异性潜在状态修正,实现了高效的推理时修正。实验表明,WindINR在保证查询连续性的同时,相比全网络微调方法,在修正速度上提升了约2.6倍,为复杂地形中背景场、稀疏观测与风场查询之间的实际应用提供了有效接口。

详情
英文摘要

Many downstream decisions in complex terrain require fast wind estimates at a small number of user-specified locations and heights for a given forecast valid time, rather than another dense forecast field on a fixed grid. We present WindINR, a latent-state implicit neural representation framework for continuous high-resolution local wind query and sparse-observation correction. WindINR maps static terrain descriptors, a low-resolution background field, and continuous query coordinates to a high-resolution wind state through a latent-conditioned decoder. To enable rapid inference-time correction, WindINR separates reusable representation learning from sample-specific latent-state correction. During training, a privileged encoder infers a reference latent state from high-resolution supervision, a deployable latent predictor estimates an initial latent state from inference-time inputs alone, and their discrepancies are summarized into a dataset-adaptive Gaussian prior over latent corrections. At inference time, within the WindINR module, network weights remain fixed and only the latent state is updated by minimizing a regularized correction objective using sparse observations and their uncertainty. In controlled OSSEs over the Senja region, including a UAV-aided approach scenario and random-observation robustness tests, WindINR improves local high-resolution wind estimates by updating only a compact latent state rather than the full network. The corrected representation remains continuously queryable at arbitrary coordinates and, in our CPU benchmark, yields about a $2.6\times$ online-correction speedup over full-network fine-tuning, suggesting a practical interface between kilometer-scale background products, sparse local observations, and wind queries in complex terrain.

2605.09507 2026-05-12 cs.CV

Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

Omer Tariq, Syed Muhammad Raza, Jeongbae Son

AI总结 该论文提出了一种用于视频摘要的不确定性感知与解码器对齐的学习框架VASTSum,旨在解决视频摘要任务中因主观标注和离散解码过程带来的挑战。该方法通过变分形式预测帧级的概率重要性分数,显式建模多标注者监督下的不确定性,并引入解码器对齐正则化以提升摘要选择的稳定性。实验表明,该方法在多个数据集上表现出更强的鲁棒性和高效性,优于传统确定性和扩散模型方法。

Comments Accepted for presentation at the 2026 International Joint Conference on Neural Networks (IJCNN 2026)

详情
英文摘要

Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.

2605.09502 2026-05-12 cs.CL cs.AI cs.LG

Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Aojie Yuan, Zhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao

AI总结 该研究揭示了链式推理(CoT)中模型内部与外部表现之间的不一致性:尽管模型在生成过程中表现出高度自信,但其隐藏状态中却能准确检测出推理错误。通过线性探针分析,模型在第一步即可预测推理正确性,而生成的文本表面分类器却无法达到同样效果。研究进一步表明,尽管模型具备错误识别能力,但这种信号仅用于诊断推理质量,而非纠正错误,多种干预方法均未能成功利用该信号改善推理结果。这一发现明确了机械可解释性的边界,指出推理错误的表示与事实知识的表示存在本质差异。

Comments 10 pages, 5 figures, 10 tables.Mechanistic Interpretability @ ICML 2026

详情
英文摘要

Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC -- from the very first reasoning step (0.79) -- while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions -- activation steering, probe-guided best-of-N, self-correction, and activation patching -- all fail; patching destroys output coherence entirely. The signal is diagnostic, not causal: a readout of computation quality, not a lever to redirect it. This delineates a boundary for mechanistic interpretability: error representations during reasoning are fundamentally different from the factual knowledge representations that prior work has successfully edited.

2605.09498 2026-05-12 cs.LG cs.AI

Spectral Transformer Neural Processes

Xianhe Chen, Hao Chen, Yingzhen Li

AI总结 本文提出了一种名为Spectral Transformer Neural Processes(STNPs)的新方法,用于处理具有强周期性和准周期性的时间序列、空间数据和图像。该方法在Transformer Neural Processes(TNPs)的基础上引入了频域感知机制,通过频谱聚合器估计上下文频谱并生成任务自适应的频域特征,从而增强模型对周期性结构的建模能力。实验表明,STNPs在多个合成和真实数据集上均优于现有方法,显著提升了预测性能,拓展了神经过程模型在周期性建模中的应用范围。

Comments 37 pages, 10 figures, 18 tables

详情
英文摘要

Time series, spatial data, and images are natural applications of Neural Processes. However, when such data exhibit strong periodicity and quasi-periodicity, existing methods often suffer from underfitting and generalise poorly beyond the training distribution. In this work, we propose Spectral Transformer Neural Processes (STNPs), a frequency-aware extension of Transformer Neural Processes (TNPs). STNPs introduce a Spectral Aggregator that estimates an empirical context spectrum, compresses it into a spectral mixture, samples task-adaptive spectral features, and concatenates them with time-domain embeddings, thereby injecting a spectral-mixture-kernel bias into TNPs. This design reshapes the similarity geometry, allowing inputs that are distant in Euclidean space to remain close in an induced periodic manifold while enhancing time-frequency interactions. Extensive experiments on synthetic regression tasks, real-world time-series datasets, and an image dataset demonstrate that STNPs consistently improve predictive performance over existing baselines, extending Neural Processes beyond translation equivariance towards effective modelling of periodicity and quasi-periodicity.

2605.09497 2026-05-12 cs.AI cs.CR

Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces

Yilin Zhang, Yingkai Hua, Chunyu Wei, Xin Wang, Yueguo Chen

AI总结 本文研究了基于视觉-语言模型的网络代理在面对欺骗性界面时的脆弱性问题,提出了一种名为DUDE的两阶段防御框架,结合混合奖励学习与非对称惩罚机制,有效提升了代理对欺骗性界面的识别与抵御能力。同时,研究还构建了一个名为RUC的基准测试集,用于评估和推动该领域的发展。实验表明,DUDE在降低欺骗性界面影响的同时,仍能保持任务执行性能,为构建更安全的网络代理系统提供了有效基础。

Comments Accepted to ACL 2026 Main Conference. 23 pages, 8 figures, 19 tables

详情
英文摘要

Vision-language model (VLM) based web agents demonstrate impressive autonomous GUI interaction but remain vulnerable to deceptive interface elements. Existing approaches either detect deception without task integration or document attacks without proposing defenses. We formalize deception-aware web agent defense and propose DUDE (Deceptive UI Detector & Evaluator), a two-stage framework combining hybrid-reward learning with asymmetric penalties and experience summarization to distill failure patterns into transferable guidance. We introduce RUC (Real UI Clickboxes), a benchmark of 1,407 scenarios spanning four domains and deception categories. Experiments show DUDE reduces deception susceptibility by 53.8% while maintaining task performance, establishing an effective foundation for robust web agent deployment.

2605.09496 2026-05-12 cs.CL cs.LG

Beyond Language: Format-Agnostic Reasoning Subspaces in Large Language Models

Aojie Yuan, Zhiyuan Su

AI总结 该研究探讨了大型语言模型在不同符号系统(如英文、代码、数学符号)中是否共享一个统一的推理表征。通过引入TriForm基准测试,研究发现模型中间层存在一个与形式无关的推理子空间(FARS),该子空间能有效提取概念结构并抑制形式信息。实验表明,仅替换这一子空间的10个维度即可保留90%-96%的模型输出,验证了其在跨形式推理中的关键作用,并支持了“柏拉图式表征假设”。此外,研究还揭示了陈述性与过程性表征之间的不对称性,指出形式差异的关键不在于语言与形式,而在于陈述性与过程性之间的区别。

Comments Preprint. 13 pages, 13 figures, 12 tables

详情
英文摘要

Large language models represent the same reasoning in vastly different surface forms -- English prose, Python code, mathematical notation -- yet whether they share a common internal substrate across these symbolic systems remains unknown. We introduce the TriForm Benchmark (18 concepts x 6 forms x 3 instances = 324 stimuli) and study five LLMs (1.6B-8B) across three architecture families. Using permutation-corrected RSA, cross-form probing, and activation patching, we find converging evidence for a Format-Agnostic Reasoning Subspace (FARS) in middle layers. We make FARS concrete: concept-centroid PCA extracts a 10-dimensional subspace that amplifies concept structure 3x while suppressing form information to near zero. Replacing only these 10 dimensions during cross-form patching preserves 90-96% of model output -- far exceeding both full activation replacement (44-56%) and variance-maximizing PCA (60-74%) -- while ablating them causes targeted disruption. FARS generalizes to held-out concepts and converges across architectures (CCA > 0.79 for all model pairs), providing within-modality evidence for the Platonic Representation Hypothesis. We further discover a declarative-procedural asymmetry: representations are far more compatible between prose and mathematics than between either and code, suggesting that the critical axis of divergence is not linguistic vs. formal but declarative vs. procedural.

2605.09494 2026-05-12 cs.RO cs.AI

LASSA Architecture-Based Autonomous Fault-Tolerant Control of Unmanned Underwater Vehicles

Hong Chen, Zixiang Tang, Yuanbao Chen, Yu Liu

AI总结 本文提出了一种基于LASSA架构的自主容错控制方法,用于无人水下航行器(UUV)在通信受限环境下的高可靠性运行。该方法结合大型语言模型(LLM)与智能代理,实现未知故障的自主识别与任务重规划,同时通过求解器验证物理约束,抑制模型幻觉并确保决策可解释性。实验表明,该框架在舵故障等异常情况下能够有效调整航迹参数,满足约束条件并完成任务,展示了其在容错控制与实时控制之间的良好平衡。

详情
英文摘要

Unmanned underwater vehicles (UUVs) operate persistently in communication-constrained environments, thus requiring high-level autonomous fault-tolerant control under faulty operating conditions. Existing approaches rely heavily on predefined hard-coded rules and struggle to achieve effective fault-tolerant control against unforeseen faults. Although large language models (LLMs) possess powerful cognitive and reasoning capabilities, their inherent hallucinations remain a major obstacle to their application in UUV control systems. This paper proposes an intelligent control method based on the LASSA (LLM-based Agent with Solver, Sensor and Actuator) architecture. Within this architecture, an LLM identifies unknown faults and accomplishes task replanning via autonomous reasoning without hard-coded rules; the intelligent agent undertakes perception, scheduling and decision evaluation; the solver verifies physical boundary feasibility constraints prior to command transmission to the actuators. This architecture suppresses physically infeasible LLM hallucinations and ensures interpretable, verifiable decision-making. Moreover, it enables fast-slow dual closed-loop collaborative control, where the slow loop undertakes high-level dynamic decision-making and the fast loop guarantees high-frequency real-time control, simultaneously balancing decision intelligence and control timeliness. Lake experiments under normal and lower-rudder-fault conditions show that the framework detects trajectory tracking abnormalities, replans the route by adjusting the turning radius from 4m to 12m and reducing speed from 2kn to 1kn, passes all three solver constraints on the first invocation, and guides the UUV to complete the full mission; under normal conditions no false fault alarms are raised throughout the run.

2605.09490 2026-05-12 cs.CL cs.AR cs.LG

Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

Aojie Yuan, Tianqi Shen, Dajun Zhang

AI总结 大型语言模型在推理过程中生成大量中间思考步骤,这些步骤需要占用有限的GPU高带宽内存(HBM),导致性能瓶颈。本文提出一种语义感知的内存分层机制,将不同重要性的思考步骤分配到不同层级的存储中,如HBM、DDR内存、压缩存储和丢弃,从而减少对HBM的依赖。该方法通过累积注意力评分实现零近似误差的计算卸载,实验表明在保持较高推理精度的同时,可显著降低HBM占用并提升计算效率。

Comments Preprint. 14 pages + appendix. Under review at AdaptFM Workshop @ ICML 2026

详情
英文摘要

Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.

2605.09487 2026-05-12 cs.LG

Kintsugi: Learning Policies by Repairing Executable Knowledge Bases

Teng Cao, Yu Deng, Hikaru Shindo, Quentin Delfosse, Lanxi Wen, Suli Wang, Jannis Blüml, Christopher Tauchmann, Kristian Kersting

AI总结 本文提出了一种名为 Kintsugi 的白盒策略学习框架,旨在解决现代具身智能体任务知识难以检验、重组和复用的问题。该方法将策略改进视为由验证器引导的可执行知识库的构建过程,通过局部类型编辑而非依赖语言模型推理来提升策略知识。Kintsugi 在推理时无需调用大语言模型,通过确定性符号执行器直接执行知识库,实现了在长期文本代理和物体中心操作任务中的高性能,同时保持了知识的可检查性和可编辑性。

详情
英文摘要

Modern embodied agents achieve impressive performance, but their task knowledge is often stored in neural weights, latent state, or prompt-bound memory, making individual policy knowledge difficult to inspect, validate, recombine, and reuse. We introduce \textbf{Kintsugi}, a white-box policy-learning framework that treats embodied policy improvement as verifier-gated construction of a typed executable Knowledge Base (KB). Kintsugi represents task-level policy knowledge as composable typed entries -- predicates, operators, policy schemas, monitors, recovery rules, experience records, and goals -- and improves this artifact through localized typed edits induced from rollout evidence, rather than relying on test-time language-model reasoning. Between rollouts, a tool-constrained agentic editing loop diagnoses trajectory failures, localizes them to editable KB layers, and proposes candidate edits. A deterministic verification gate admits an edit only when the candidate type-checks, the resulting KB executes, and focused validation success or trajectory-health metrics improve without violating protected-regression checks. At inference, the accepted KB is executed by a deterministic symbolic executor with zero LLM calls. Across long-horizon text-agent benchmarks and representative object-centric manipulation settings, Kintsugi achieves strong endpoint performance while preserving inspectability, local editability, and verifier-gated deployment. These results suggest that embodied policy improvement can be organized around executable task knowledge.

2605.09486 2026-05-12 cs.LG cs.AI quant-ph

CTQWformer: A CTQW-based Transformer for Graph Classification

Zhan Li, Wuqing Yu, Yusen Wu, Chuan Wang

AI总结 本文提出了一种基于连续时间量子行走(CTQW)的图分类模型CTQWformer,旨在解决图神经网络和Transformer架构在捕捉全局结构依赖和动态信息传播方面的不足。该模型通过可训练的哈密顿量融合图结构和节点特征,物理地建模量子行走动态,提取丰富的图结构信息,并将其嵌入到图Transformer模块和图循环模块中,分别用于增强自注意力机制的结构偏差和建模时间演化模式。实验表明,CTQWformer在多个基准图分类数据集上优于传统图核和图神经网络方法,是首个将量子动力学与可训练深度学习框架结合的混合型图Transformer。

详情
英文摘要

Graph Neural Networks (GNN) and Transformer-based architectures have achieved remarkable progress in graph learning, yet they still struggle to capture both global structural dependencies and model the dynamic information propagation. In this paper, we propose CTQWformer, a hybrid graph learning framework that integrates continuous-time quantum walks (CTQW) with GNN. CTQWformer employs a trainable Hamiltonian that fuses graph topology and node features, enabling physically grounded modeling of quantum walk dynamics that captures rich and intricate graph structure information. The extracted CTQW-based representations are incorporated into two complementary modules:(i) a Graph Transformer module that embeds final-time propagation probabilities as structural biases in the self-attention mechanism, and (ii) a Graph Recurrent Module that captures temporal evolution patterns with bidirectional recurrent networks. Extensive experiments on benchmark graph classification datasets demonstrate that CTQWformer outperforms graph kernel and GNN-based methods, demonstrating the potential of integrating quantum dynamics into trainable deep learning frameworks for graph representation learning. To the best of our knowledge, CTQWformer is the first hybrid CTQW-based Transformer, integrating CTQW-derived structural bias with temporal evolution modeling to advance graph learning.

2605.09485 2026-05-12 cs.LG stat.ML

SEMASIA: A Large-Scale Dataset of Semantically Structured Latent Representations

Mario Edoardo Pandolfo, Enrico Grimaldi, Lorenzo Marinucci, Leonardo Di Nino, Simone Fiorellino, Sergio Barbarossa, Paolo Di Lorenzo

AI总结 本文介绍了SEMASIA,一个大规模的语义结构潜在表示数据集,包含从约1700个预训练视觉模型中提取的潜在表示,覆盖八个标准图像分类基准。该数据集配以描述模型架构、训练方式、预训练来源等结构化元数据,旨在解决不同模型潜在空间几何结构不兼容的问题。研究通过分析潜在空间的概念组织、对齐映射性能以及预训练数据与模型特性对表示的影响,展示了SEMASIA在可解释性、迁移学习等任务中的应用价值。

详情
英文摘要

Latent representations learned by neural networks often exhibit semantic structure, where concept similarity is reflected by geometric proximity in embedding space. However, comparing such spaces across models remains difficult: changes in architecture, pretraining data, objective, or random seed can yield embeddings with similar content but incompatible geometry. This latent space alignment problem is central to interpretability, transfer and multimodal learning, federated systems, and semantic communication; however, progress remains limited by the lack of large-scale, model-diverse, and metadata-rich benchmarks. To address this gap, we introduce SEMASIA, a large-scale collection of latent representations extracted from approximately 1,700 pretrained vision models across eight standard image-classification benchmarks. SEMASIA pairs embeddings with structured metadata describing architectures, training regimes, pretraining sources, and model scale. We demonstrate three applications of the resource. First, we analyze the conceptual organization of individual latent spaces, showing consistent prototype-like clustering and hierarchical semantic neighborhoods across models and datasets. Second, we benchmark supervised alignment mappings between latent spaces using reconstruction error and downstream task performance. Third, we perform a large-scale regression analysis of how pretraining-data complexity, specialization, transfer learning, augmentation, and model scale relate to geometric and probing properties of embeddings. By coupling representational scale with standardized metadata, SEMASIA provides a reproducible foundation for studying latent geometry, evaluating alignment methods, and developing next-generation heterogeneous and interoperable AI systems.

2605.09483 2026-05-12 cs.CL cs.AI cs.LG

A Cognitively Grounded Bayesian Framework for Misinformation Susceptibility

Pranava Madhyastha

AI总结 本文提出了一种基于认知理论的贝叶斯框架——有界实用听众模型(BPL),用于建模人们对错误信息的易感性。该框架结合了有限理性理论,引入了工作记忆限制、信息瓶颈和重要性采样等三个认知约束,从而更真实地模拟人类在信息处理中的决策过程。研究通过在LIAR和MultiFC数据集上的实验,验证了BPL在虚假信息分类任务中的有效性,并支持了深度错配悖论等理论预测。

Comments work in progress

详情
英文摘要

In this (work in progress) paper, we present Bounded Pragmatic Listener (or BPL), a cognitively grounded Bayesian framework for modelling susceptibility to information disorder. BPL extends Rational Speech Act theory with three cognitively motivated bounds derived from the bounded rationality literature with a) a recursion depth bound (that emphasises working memory limits);b) a prior compression parameter (which is oriented at capturing information bottleneck); and c) an availability sample size (that operationalises importance sampling with saliency-weighted proposals). This allows us to test predictions about misinformation susceptibility, annotator disagreement, and the differential vulnerability to mis-, dis-, and mal-information as defined in the Information Disorder framework. We validate BPL on the LIAR and MultiFC benchmarks showcasing competitive veracity classification and experimental support for the depth-mismatch paradox.

2605.09477 2026-05-12 cs.CV cs.AI

Outlier-Robust Diffusion Solvers for Inverse Problems

Yang Zheng, Jiahua Liu, Tongyao Pang, Wen Li, Zhaoqiang Liu

AI总结 本文研究了在存在异常值的情况下,如何利用扩散模型解决逆问题。为提高鲁棒性,作者首先通过显式噪声估计优化测量数据,并基于Huber损失函数构建迭代加权最小二乘目标函数,进而提出一种基于梯度下降的优化方法,并结合共轭梯度法以避免学习率调优问题。实验表明,该方法在多种图像数据集上表现出对异常值的强鲁棒性,优于现有的扩散模型方法。

Comments Accepted by CVPR 2026

详情
英文摘要

Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DM-based methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.

2605.09476 2026-05-12 cs.CL cs.AI

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Kenji Hilasaca, Nouran Khallaf, Serge Sharoff

AI总结 本文研究了多语言文本简化任务中高质量句子对齐语料库的构建问题,针对除英语外其他语言缺乏大规模高质量数据集的现状,提出了一种从可比语料中收集和处理众包简化数据的方法。通过文档级数据实现句子级对齐,构建了一个适用于多语言(包括加泰罗尼亚语、英语、法语、意大利语和西班牙语)文本简化系统训练与测试的公开数据集。

Comments Accepted at BUCC 2026 workshop at LREC 2026

详情
英文摘要

Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.

2605.09472 2026-05-12 cs.LG cs.DS

Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

Daniel Wolfson, Tal Wagner

AI总结 该论文研究了在Transformer模型中引入位置偏置的注意力机制,并通过局部敏感哈希(LSH)的视角提出了位置LSH方法。核心方法是将ALiBi位置偏置矩阵视为由位置LSH生成的块对角二值掩码的期望,并证明在采样掩码的均值下,可以以高概率实现谱范数和最大范数的近似保证。该方法将长上下文的ALiBi注意力转化为多个短上下文的随机无偏注意力操作,从而显著提升计算效率,实验验证了理论分析的有效性。

详情
英文摘要

Positional encoding in transformers is commonly implemented through positional embeddings, attention masks, or bias terms, but formal connections between these mechanisms remain limited. We study attention with positional bias through the lens of locality-sensitive hashing (LSH), focusing on Attention with Linear Biases (ALiBi). We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular (positionally unbiased) attention operations. Conceptually, this connects positional bias, masks, and positional embeddings in a single formal framework and suggests an approach to efficient ALiBi-biased attention. Experiments on large language models validate our theoretical findings.

2605.09469 2026-05-12 cs.CL

FinMoji: A Framework for Emoji-driven Sentiment Analysis in Financial Social Media

Ahmed Mahrous, Roberto Di Pietro

AI总结 本文研究了在金融社交平台StockTwits中利用表情符号进行情感分析的问题,探讨表情符号作为投资者情感指标的可靠性及其与传统文本分析的对比。研究采用逻辑回归和Transformer模型进行实验,发现仅使用表情符号的模型在F1分数上约为0.75,而结合文本与表情符号的模型可达约0.88,且计算成本更低,适用于高频交易等时间敏感场景。此外,部分表情符号及其组合对市场趋势具有超过90%的预测准确率,凸显了表情符号在金融情感分析中的独特价值。

详情
英文摘要

This paper explores the use of emojis in financial sentiment analysis, focusing on the social media platform StockTwits. Emojis, increasingly prevalent in digital communication, have potential as compact indicators of investor sentiment, which can be critical for predicting market trends. Our study examines whether emojis alone can serve as reliable proxies for financial sentiment and how they compare with traditional text-based analysis. We conduct a series of experiments using logistic regression and transformer models. We further analyze the performance, computational efficiency, and data requirements of emoji-based versus text-based sentiment classification. Using a balanced dataset of about 528,000 emoji-containing StockTwits posts, we find that emoji-only models achieve F1 approximately 0.75, lower than text-emoji combined models, which achieve F1 approximately 0.88, but with far lower computational cost. This is a useful feature in time-sensitive settings such as high-frequency trading. Furthermore, certain emojis and emoji pairs exhibit strong predictive power for market sentiment, demonstrating over 90 percent accuracy in predicting bullish or bearish trends. Finally, our research reveals large statistical differences in emoji usage between financial and general social media contexts, stressing the need for domain-specific sentiment analysis models.

2605.09465 2026-05-12 cs.RO

High Precision Hydraulic Excavator Control for Heavy-Duty Grading

Lennart Werner, Pol Eyschen, Sean Costello, Andrei Cramariuc, Marco Hutter

AI总结 本文研究了如何实现重型土方工程中高精度的液压挖掘机自动平整控制。针对不同液压架构对操作指令和土壤作用力的响应差异,作者提出了一种分层控制方法,包含液压感知的底层控制环和路径跟踪层,通过校准过程适用于负载感应和负流量控制两类设备。实验表明,该方法在精度上比现有商业方案提升2.6倍,并能更高效地利用机器压力性能。

Comments 12 pages 19 figures, RSS 2026

详情
英文摘要

High-precision heavy-duty grading is a common step in earthworks, traditionally carried out manually by skilled operators. Removing a significant amount of material while achieving a high-precision surface requires substantial machine-specific experience. Different hydraulic architectures react differently to operator inputs and soil interaction forces, which makes generalizable controllers challenging. In this paper, we present an autonomous controller that achieves high-precision grading at expert-operator speed on Load Sensing and Negative Flow Control machines alike. We split our controller into two parts: (1) a hydraulic-aware low-level loop that is hydraulic architecture-specific and (2) a path-tracking layer that coordinates joint motions and responses. Through a calibration process, our technique is applicable to load-sensing and negative-flow-control machinery. To showcase its versatility, we benchmark our approach on two excavators with different hydraulics and compare it against a commercial state-of-the-art solution. Our technique (RMSE 1.8~cm) outperforms the commercial solution (RMSE 4.7~cm) in precision by a factor of 2.6 and improves machine usage by leveraging the maximum function pressure, as opposed to commercial solutions that stall prematurely.

2605.09463 2026-05-12 cs.CL

Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven

Jiwei Tang, Zhijing Huang, Xinyu Zhang, Chen Jason Zhang, Jianxing Yu, Libin Zheng, Rui Meng, Jian Yin

AI总结 大型语言模型在多种任务中表现出色,但在处理长上下文时面临计算开销大和信息冗余的问题。现有软提示压缩方法受限于位置偏差,导致性能不稳定和语义碎片化。本文提出了一种语义一致的上下文压缩方法SeCo,通过在语义空间中动态选择与查询相关的语义中心,并进行一致性加权合并,摆脱了对物理位置的依赖,有效提升了压缩效果。实验表明,SeCo在多个基准测试中表现出优越的性能、推理速度和领域外鲁棒性。

Comments 20 pages, 6 figures

详情
英文摘要

Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks. However, their deployment in long-context scenarios faces high computational overhead and information redundancy. While soft prompt compression has emerged as a promising way to mitigate these costs by compressing sequences into compact embeddings, existing paradigms remain fundamentally constrained by position bias: they primarily rely on learnable tokens insertion at fixed positions or group tokens according to their physical token layout, thereby inducing performance instability and semantic fragmentation. To overcome this bottleneck, we propose Semantic Consistency Context Compression (SeCo), a method that shifts context compression from position-driven to semantic-driven. Rather than constraint by physical token layout, SeCo dynamically anchors compression directly in the semantic space by selecting query-relevant tokens as semantic centers and aggregating remaining tokens via consistency-weighted merging. This design inherently preserves semantic consistency while eliminating position bias. Extensive experiments on 14 benchmarks across two backbone models demonstrate that SeCo consistently shows superiority in downstream tasks, inference latency, and out-of-domain robustness. The code is available at https://anonymous.4open.science/r/seco-EE5E.

2605.09460 2026-05-12 cs.CV cs.AI

When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation

Dongqi Zheng

AI总结 本文研究了在保持身份特征的前提下,如何通过简化生成步骤来加速图像生成过程。作者提出了一种无需重新训练的方法,通过替换预训练的扩散模型主干网络,并禁用分类器引导,显著提升了生成效率,同时保持了较高的身份相似度。实验表明,在早期生成步骤中已能获得较高质量的身份特征,后续步骤主要优化细节,从而为身份保留生成提供了高效且实用的优化策略。

详情
英文摘要

Identity-preserved image generation is typically built on many-step diffusion backbones, making personalized generation expensive at deployment time. We show that this cost is often unnecessary for identity-conditioned FLUX generation. A frozen InfuseNet identity adapter trained with dev transfers directly to the distilled schnell backbone without retraining. This two-line replacement -- changing the backbone path and disabling classifier-free guidance -- reduces latency by 5.9x while improving ArcFace identity similarity by +0.028 and lpips by -0.016 over the standard 28-step dev baseline. To explain why this works, we analyze the denoising trajectory and find that identity fidelity enters an early effective regime, often within 4-8 steps, while later steps primarily refine visual detail, sharpness, and contrast. Adapter ablations confirm that identity formation depends on the identity adapter, while attention-stream norm probes suggest that the relative conditioning contribution decreases as sampling proceeds. Preliminary style-adapter and object-adapter sweeps on SDXL and SD1.5 show similar diminishing returns after intermediate steps. These results position distilled backbone replacement as a simple, training-free strategy for improving the efficiency-fidelity tradeoff of identity-preserved generation.

2605.09457 2026-05-12 cs.LG cs.AI cs.SI

RAwR: Role-Aware Rewiring via Approximate Equitable Partition

Riccardo Porcedda, Giuseppe Squillace, Bastian Epping, Andrea Vandin, Michael Schaub, Mirco Tribastone, Francesca Chiaromonte

AI总结 本文提出了一种名为RAwR的图神经网络(GNN)重构框架,旨在解决GNN在处理依赖长距离交互的预测任务时性能下降的问题。该方法通过引入基于近似等分划分的商图,增强输入图的结构表达,促进具有相同结构角色的节点之间的信息传播,从而降低系统的有效电阻。实验表明,RAwR在多种基准数据集上取得了最先进的性能,并通过理论分析提出了用于优化重构效果的谱角色提升(SRL)指标。

详情
英文摘要

While Graph Neural Networks (GNNs) have demonstrated significant efficacy in node classification tasks, where predictions rely on local neighborhood information, the performance of GNNs often drops when prediction tasks depend on long-range interactions. These limitations are attributed to phenomena such as oversquashing, where structural bottlenecks restrict signal propagation across the network topology. To address this challenge, we introduce RAwR, a computationally efficient rewiring framework that augments the input graph with a quotient graph derived from equitable partitions. This approach facilitates accelerated communication between nodes that share identical structural roles, as identified by the Weisfeiler-Leman graph coloring, and thereby reduces the total effective resistance of the system. Furthermore, by employing an approximate definition of the equitable partition, RAwR enables a controllable reduction of the quotient graph, which, in its most condensed state, recovers the conventional Master Node rewiring technique. Empirical evaluations across a diverse suite of benchmarks -- including homophilic, heterophilic, and synthetic long-range datasets -- demonstrate that RAwR achieves state-of-the-art results. Our contribution is further supported by an analytical investigation using a teacher-student model of linear GNNs, which elucidates the theoretical foundations of role-based rewiring. This analysis leads to the formulation of Spectral Role Lift (SRL), a metric designed to identify the optimal approximate equitable partition for maximizing predictive performance.

2605.09455 2026-05-12 cs.CV

Adaptive 3D Convolution for Remote Sensing Image Fusion

Siran Peng, Xiangyu Zhu, Shang-Qi Deng, Liang-Jian Deng, Zhen Lei

AI总结 本文研究了遥感图像融合问题,旨在从高分辨率但光谱信息有限的图像和低分辨率但光谱数据丰富的图像中生成高分辨率多/高光谱图像。为了解决现有方法在光谱信息保持和计算效率上的不足,作者提出了一种新型的自适应三维卷积(Ada3D)方法,该方法为每个输入体素生成独特的三维卷积核,结合空间和光谱信息,有效提升了融合效果,并通过分组卷积降低了计算复杂度。实验表明,该方法在五个数据集上均取得了当前最优的性能。

Comments Accepted by IEEE Transactions on Image Processing (TIP), Early Access, 2026

详情
英文摘要

Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.

2605.09449 2026-05-12 cs.CV

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu, Zhikang Zhang, Zizhuang Wei, Zhenyuan Chen, Lingyun Li, Zhuoyi Song

AI总结 当前多模态大语言模型(MLLMs)在视觉理解和语言推理方面取得了显著进展,但在三维环境中缺乏持续的、以世界为中心的空间表征。为此,研究提出了一种名为 SpaceMind++ 的视频 MLLM 架构,通过从 RGB 视频中构建体素化的认知地图,实现对物体永久性和空间拓扑关系的保持。该模型引入了坐标引导的深度迭代融合机制,将地图层面的空间知识反馈至原始二维视觉特征中,从而在不破坏原有视觉接口的前提下增强模型的空间推理能力。实验表明,SpaceMind++ 在多个基准测试中取得了优异的性能,尤其在未见过的三维环境中表现出更强的泛化能力。

Comments 14 pages, 3 figures

详情
英文摘要

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

2605.09448 2026-05-12 cs.LG

Learning to Bid with Unknown Private Values in Budget-Constrained First-Price Auctions

Zihao Hu, Yuxiao Wen, Yuan Yao, Jiheng Zhang, Zhengyuan Zhou

AI总结 本文研究了在预算约束和投资回报率目标限制下的首价拍卖中的竞价学习问题,其中竞拍者的估值无法直接获取,只能从被截断的数据中推断。为解决这一问题,作者提出了一种统一的原始-对偶框架,同时学习潜在的估值参数和竞争对手的出价分布。该方法通过引入强斯拉特条件和自适应预热过程,有效控制对偶变量的稳定性,从而实现了接近最优的遗憾界,为具有隐含估值的约束竞价问题提供了首个理论保障的解决方案。

详情
英文摘要

The transition to First-Price Auctions (FPA) in digital advertising has spurred significant research, yet existing work typically assumes access to a valuation oracle, ignoring the reality that values must be inferred from censored data. While Linear Treatment Effect (LTE) models address this by learning value uplift, they have not been adapted to realistic settings with hard Budget constraints or Return-on-Spend (RoS) targets requiring regret and violation control. In this work, we propose a unified primal-dual framework for constrained FPAs that jointly learns the latent LTE valuation parameters and the competitor's bid distribution. This simultaneous learning introduces a critical technical challenge: the estimation error is dynamically scaled by the Lagrangian multiplier, potentially leading to unbounded regret. We resolve this by leveraging a strong Slater condition and a novel adaptive burn-in procedure to stabilize the dual variables. Our approach achieves near-optimal regret guarantees, providing the first theoretically grounded solution for constrained bidding with latent valuations.

2605.09443 2026-05-12 cs.CV cs.CL

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Yihong Tang, Kehai Chen, Xuefeng Bai, Min Zhang

AI总结 随着多模态大语言模型的发展,角色扮演代理(RPAs)逐渐进入视觉化环境,但现有模型提取的通用视觉特征容易掩盖角色特性,导致模态-角色干扰(MRI)。为此,研究提出了一种无需训练的字符感知视觉干预框架CAVI,通过角色引导的标记剪枝、正交特征调制和模态自适应角色引导等方法,有效缓解MRI问题,显著提升了角色一致性的多模态交互能力。

详情
英文摘要

The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.

2605.09442 2026-05-12 cs.CV cs.AI

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

Shanwen Tan, Hao Li, Jingtao Zhang, Xiaosong Jia, Xue Yang, Shaofeng Zhang, Yanyong Zhang

AI总结 SWIFT 是一种用于多提示长视频生成的高效框架,旨在解决连续语义切换中的语义连贯性与计算效率之间的矛盾。该方法引入了轻量级的语义注入缓存和自适应动态窗口机制,能够在不重建缓存内容的前提下实现高效的语义切换,并通过分头语义注入和段级语义锚点保持视频的时序一致性。实验表明,SWIFT 在单块 H100 GPU 上实现了 22.6 FPS 的生成速度,显著提升了长视频生成的效率。

Comments Code is available at https://github.com/ShanwenTan/SWIFT

详情
英文摘要

Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.

2605.09441 2026-05-12 cs.RO

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

Samson Sun, Tianyi Yang, Tengyue Wang, Yikai Xue, Zhengjie Xu, Lingming Zhang, Qichen Zhang, Chao Liang, Zhipeng Zhang

AI总结 当前通用导航智能体的发展受到碎片化评估体系的限制,这些体系往往孤立测试导航技能并聚焦于特定机器人形态,难以反映现实场景中多样行为的协调需求。为此,研究提出OmniNavBench基准,旨在评估跨技能协作与跨形态泛化能力。该基准引入复合任务指令、多形态机器人支持及高质量人类演示,推动导航智能体在复杂、交错任务场景下的性能提升,揭示了现有方法在通用导航任务中的不足,为下一代通用导航系统提供了新的测试平台。

Comments Accepted at RSS 2026

详情
英文摘要

The pursuit of general-purpose embodied agents is hindered by fragmented evaluation protocols that isolate navigation skills and fixate on specific robot morphologies, failing to reflect real-world scenarios where agents must orchestrate diverse behaviors across varying embodiments. To bridge this gap, we introduce OmniNavBench, a benchmark for cross-skill coordination and cross-embodiment generalization. OmniNavBench introduces three paradigm shifts: (1) Compositional Complexity. We propose composite instructions that interleave sub-tasks from 6 categories (PointNav, VLN, ObjectNav, SocialNav, Human Following and EQA), compelling agents to transition between exploration, interaction, and social compliance within a single episode. (2) Morphological Universality and Sensor Flexibility. We present a simulation platform that breaks the reliance on single-morphology evaluation, enabling generalization tests across humanoid, quadrupedal, and wheeled robots, with a modular sensor interface and 170 environments blending synthetic assets with real-world scans. (3) Demonstrations Quality. Moving beyond shortest-path algorithms, we curate 1779 expert trajectories via human teleoperation, capturing behavioral nuances such as exploratory glance and anticipatory avoidance. Extensive evaluations demonstrate that current methods, despite their claimed unified design, struggle with the complex, interleaved nature of general-purpose navigation. This exposes a critical disparity between existing capabilities and real-world deployment demands, underscoring OmniNavBench as a testbed for the next generation of generalist navigators. Dataset, code, and leaderboard are available at http://omninavbench.cloud-ip.cc.

2605.09440 2026-05-12 cs.CL cs.AI

Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports

Yu Wang, Yingyun Li, Ying Qin, Haiyang Qian

AI总结 临床报告因隐私政策和数据孤岛问题常分散在不同医疗机构,导致电子健康记录整合和长期跟踪困难。本文提出一种基于关键字段的半结构化信息提取方法,通过迭代挖掘、归一化和聚类构建关键字段库,并引入“关键覆盖率”衡量信息完整性。实验表明,随着关键覆盖率提升,模型性能显著增强,在覆盖前90个关键字段时,F1分数分别达到0.839和0.893,且该方法适用于多语言场景。

Comments Preprint. Under review at MLHC 2026

详情
英文摘要

Clinical reports are often fragmented across healthcare institutions because privacy regulations and data silos limit direct information sharing. When patients seek care at a different hospital, they often carry paper or scanned reports from prior visits. This hinders EHR integration and longitudinal review, and downstream applications that depend on more complete patient records, such as patient management, follow-up care, real-world studies, and clinical-trial matching. Although OCR can digitize such reports, reliable extraction remains challenging because clinical documents are heterogeneous, OCR text is noisy, and many healthcare settings require low-cost on-premise deployment. We formulate this problem as canonical key-conditioned extractive question answering over OCR-derived clinical reports. Because the key fields are neither fixed nor known in advance, the key space is open. We maintain a canonical key inventory through iterative key mining, normalization, clustering, and lightweight human verification, and introduce key coverage as a metric to quantify inventory completeness. Using a 0.2B BERT-based model, experiments on real-world reports from more than 20 hospitals show performance improves monotonically with key coverage. The model achieves F1 scores of 0.839 and 0.893 under exact match and boundary-tolerant matching, respectively, once the Top-90 canonical keys are covered. These results show that key coverage is a dominant factor for end-to-end performance. At Top-90 coverage, our model outperforms a fine-tuned Qwen3-0.6B baseline under exact match. Although our annotated corpus is Chinese, the method relies on the language-agnostic key-value organization of semi-structured clinical reports and can be adapted to other settings given an appropriate canonical key inventory and alias mapping.

2605.09439 2026-05-12 cs.LG stat.ML

Inverse Design for Conditional Distribution Matching

Ori Meidler, Shaul Tolkovsky, Or Zuk

AI总结 该论文提出了一种新的逆设计问题——条件分布匹配(CDM),旨在从给定的联合分布 $\mathcal{P}(X, Y)$ 中找到输入 $x^*$,使得其诱导的条件分布 $\mathcal{P}(Y \mid X = x^*)$ 与目标分布 $\mathcal{G}(Y)$ 匹配。为了解决这一问题,作者提出了 MLGD-F 算法,结合预训练的扩散模型和快速条件采样器,在无需额外训练的情况下实现高效求解。实验表明,该方法在多种任务中能够可靠地恢复出满足用户指定分布目标的输入。

详情
英文摘要

Generative models are powerful tools for sampling from a learned distribution $\mathcal{P}(Y \mid X)$, and inverse-design methods invert this map to find an input $x$ that produces a desired point output $y^*$. However, many design goals are naturally distributional rather than pointwise, incorporating the inherent uncertainty of $Y$ and targeting a specific form for it, a task not addressed by standard inverse design. To address this issue we introduce Conditional Distribution Matching (CDM), a new inverse-design problem class in generative modeling: given a joint distribution $\mathcal{P}(X, Y)$ and a target distribution $\mathcal{G}(Y)$, find an input $x^*$ whose induced conditional distribution $\mathcal{P}(Y \mid X = x^*)$ matches $\mathcal{G}$. We formally define two variants: Conditional Distribution Matching Sampling (CDMS) and Conditional Distribution Matching Optimization (CDMO). To solve these problems, we propose MLGD-F (Matching-Loss Guided Diffusion with a Fast inner sampler), a plug-and-play inference-time algorithm that combines a pretrained score-based diffusion model with a pretrained fast conditional sampler, requiring no additional training or fine-tuning. By leveraging single-step conditional sampling, MLGD-F enables tractable gradient computation, making the estimation of $\mathcal{P}(Y \mid X)$ both memory-efficient and computationally lightweight. We validate MLGD-F on synthetic benchmarks, structured image transformations, and generative editing optimization, demonstrating reliable recovery of inputs whose conditional distributions match diverse user-specified targets, including discrete mixtures and continuous low-rank supports.

2605.09438 2026-05-12 cs.LG

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

Andreas D. Demou, Panagiotis Koromilas, James Oldfield, Yannis Panagakis, Mihalis A. Nicolaou

AI总结 该研究针对预训练Transformer模型中跨层特征提取的问题,提出了一种新的方法fmxcoders,旨在更有效地在共享潜在空间中恢复跨层特征。传统Crosscoders方法在跨层参数化和依赖关系上存在局限,导致学到的潜在变量主要捕捉表面模式而非深层语义概念。fmxcoders通过引入低秩张量分解和随机层掩码机制,提升了潜在变量的跨层一致性与语义可解释性,并在多个基础模型上显著提高了特征探测和重建性能。

详情
英文摘要

Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent's activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained cross-layer parameterization and unregularized cross-layer dependence. We address both by introducing fmxcoders, which (i) replace the encoder and decoder with low-rank tensor factorizations that draw every latent's per-layer weights from a shared cross-layer basis, and (ii) apply stochastic layer masking, a denoising regularizer along the layer axis that penalizes latents whose contribution collapses when a single layer is masked. Across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, fmxcoders lift mean probing F1 by 10-30 points, surpassing per-layer SAE baselines that standard crosscoders fail to reach, reduce reconstruction MSE by 25-50%, and roughly double mean functional coherence. An LLM-as-a-judge evaluation further shows that fmxcoders recover 3-13$\times$ more semantically coherent latents than standard crosscoders across all four base LLMs.

2605.09433 2026-05-12 cs.CV

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang

AI总结 现有文本到图像模型的偏好数据集通常仅存储最终的优胜或劣汰图像,这不足以支持基于直化流(RF)模型的生成过程,因其生成过程依赖特定的先验噪声样本并遵循近似直线的去噪轨迹。为此,本文提出了一种针对直化流模型的离线偏好优化框架——先验噪声感知偏好优化(PNAPO),通过保留生成优胜/劣汰图像所用的配对先验噪声,扩展标准三元组为六元组,并利用RF的直线特性进行噪声-图像插值,从而更准确地估计轨迹并提升优化目标的紧致性。实验表明,PNAPO在主流RF文本到图像模型上显著提升了偏好指标,同时减少了训练计算量。

Comments Accepted by ICML 2026

详情
英文摘要

Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.