arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13292 2026-05-14 cs.CL cs.AI cs.IR cs.LG

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel

AI总结 本文介绍了IndicMedDialog,一个包含英印九种语言的平行多轮医疗对话数据集,旨在提升医疗对话系统在印地语系语言中的适用性和对话真实性。该数据集通过大语言模型生成对话并经母语者验证和后处理优化,同时基于该数据集微调了参数高效的医疗语言模型IndicMedLM,以实现更个性化的症状收集。研究通过多语言基线对比和专家评估,验证了模型的临床合理性和有效性。

Comments Accepted in BioNLP @ ACL 2026 Conference

详情
英文摘要

Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

2605.13290 2026-05-14 cs.AI

What properties of reasoning supervision are associated with improved downstream model quality?

Mikołaj Langner, Dzmitry Pihulski, Jan Eliasz, Michał Rajkowski, Przemysław Kazienko, Maciej Piasecki, Jan Kocoń, Teddy Ferdinan

AI总结 本文研究了如何在训练前通过内在数据指标可靠预测推理数据集的效用,以减少对昂贵试错调优的依赖。作者提出了一系列定量指标,并通过在语义不同的波兰推理数据集上微调8B和11B模型进行验证,发现这些指标与下游模型性能有显著相关性。研究还揭示了效用预测指标具有规模依赖性:小模型更依赖对齐性指标保证精度,而大模型则受益于高冗余度和详细推理过程以解决复杂任务。这一发现为推理数据验证提供了一个规模感知的框架,有助于更高效地选择训练数据集。

Comments To appear in the Proceedings of the International Conference on Computational Science (ICCS) 2026

详情
英文摘要

Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.

2605.13287 2026-05-14 cs.LG cs.AI math.OC stat.ML

Delightful Exploration

Ian Osband

AI总结 本文提出了一种名为“Delight-gated exploration”(DE)的探索策略,用于解决大规模动作空间中探索预算有限的问题。该方法通过衡量潜在收益与惊喜值的乘积(即“delight”)来决定是否进行探索,从而更高效地利用有限的探索资源。DE 在多种任务中表现出比 Thompson Sampling 和 $\varepsilon$-greedy 更弱的遗憾增长,并且其超参数具有良好的跨任务迁移性,无需重新调整。

详情
英文摘要

Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $\varepsilon$-greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and $\varepsilon$-greedy in the tested unresolved regimes. Delight improves acting for the same reason it improves learning: it prices scarce resources by the product of upside and surprisal.

2605.13283 2026-05-14 cs.LG math.ST stat.TH

Byzantine-Robust Distributed Sparse Learning Revisited

Yuxuan Wang, Lixin Zhang, Kangqiang Li

AI总结 本文重新研究了高维稀疏线性模型下的拜占庭鲁棒分布式估计问题。作者提出了一种结合局部鲁棒$\ell_1$正则化估计与服务器端鲁棒聚合的框架,适用于伪Huber回归、分位数回归和稀疏支持向量机。该方法在较弱条件下提供了非渐近保证,达到了近似最优的统计收敛速率,同时保持了通信效率,仿真实验验证了其在多种拜占庭攻击下的估计鲁棒性、支持恢复和分类精度。

详情
英文摘要

We revisit Byzantine robust distributed estimation for high-dimensional sparse linear models. By combining local $\ell_1$-regularized robust estimation with robust aggregation at the server, the framework applies to pseudo-Huber regression, quantile regression, and sparse SVM. We show that the resulting estimators yield non-asymptotic guarantees and attain near-optimal statistical rates under mild conditions, while remaining communication-efficient. Simulations confirm strong robustness in estimation, support recovery and classification accuracy under various Byzantine attacks.

2605.13277 2026-05-14 cs.CL cs.AI cs.CV cs.IR cs.LG

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang

AI总结 本文研究了多模态检索增强生成(RAG)中视觉证据的选择问题,指出现有方法通常基于语义相关性或表面相似性,难以准确反映证据对下游推理的实际效用。为此,作者从信息论角度重新定义了证据的效用,提出通过模型输出分布的信息增益来衡量证据价值,并设计了一种无需训练、基于轻量多模态模型的高效估计框架。实验表明,该方法在多个基准上优于现有RAG方法,同时显著降低了计算成本。

Comments Accepted to ACL 2026

详情
英文摘要

Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

2605.13266 2026-05-14 cs.RO

Galilean State Estimation for Inertial Navigation Systems with Unknown Time Delay

Giulio Delama, Martin Scheiber, Yixiao Ge, Tarek Hamel, Stephan Weiss, Robert Mahony

AI总结 本文研究了在存在未知时间延迟的惯性导航系统(INS)中如何进行状态估计的问题。作者提出了一种基于伽利略对称性的几何框架,将时空统一建模,从而实现导航状态与时间延迟的联合估计,并推导出一种等变滤波器(EqF)用于在线估计。实验表明,该方法在保持估计精度的同时具有更好的一致性,优于现有的扩展卡尔曼滤波(EKF)方法,尤其在时间延迟较大时表现更优。

详情
英文摘要

Many Inertial Navigation Systems (INS) use Global Navigation Satellite System (GNSS) position as the primary measurement to drive filter performance and bound error growth. However, commercial-grade GNSS receivers introduce unknown measurement delays ranging from 50 ms to 300 ms depending on sensor quality and operating mode. Such time delays can significantly degrade INS performance unless they are explicitly compensated for. Existing algorithms commonly estimate this delay offline, run the filter concurrently with GNSS measurements using buffered Inertial Measurement Unit (IMU) data, and predict the current state by forward-integrating buffered inertial measurements via IMU preintegration. The state-of-the-art online method is an Extended Kalman Filter (EKF) that explicitly models the time delay as a state parameter, which defines the preintegration duration. This paper introduces a novel geometric framework for modeling time-delayed INS, in which Galilean symmetry is leveraged to provide a joint representation of space and time for consistent state estimation. An Equivariant Filter (EqF) is derived for the coupled estimation of navigation states and time delay. Validation is performed on two fixed-wing Uncrewed Aerial Vehicles (UAV) with GNSS time lags of 90 ms and 120 ms. The test flights last two to three minutes. Simulations further investigate delays up to 500 ms and provide a statistical comparison against the state-of-the-art EKF. Results show that the EqF preserves accuracy and consistency, while the EKF lacks consistency and its performance degrades significantly with increasing measurement delays.

2605.13265 2026-05-14 cs.LG

LightSplit: Practical Privacy-Preserving Split Learning via Orthogonal Projections

Mert Cihangiroglu, Alessandro Pegoraro, Phillip Rieger, Antonino Nocera, Ahmad-Reza Sadeghi

AI总结 Split Learning(SL)通过将神经网络分割在客户端和中央服务器之间实现协作训练,但切分层接口带来了高维激活值通信开销大和表示易受重构攻击的问题。本文提出LightSplit方法,在切分层应用轻量的固定正交随机投影,以降低信息暴露并减少通信开销。该方法基于信息论原理,通过投影限制样本特异性信息,抑制可被利用的样本信号,并在不改变原有架构的前提下实现高效训练,适用于边缘设备,同时保持端到端可微性。实验表明,LightSplit在大幅降低通信维度的情况下仍能保持超过95%的基线准确率。

详情
英文摘要

Split learning (SL) enables collaborative training by partitioning a neural network across clients and a central server, but the cut-layer interface introduces a key challenge: high-dimensional activations incur substantial communication overhead while exposing representations vulnerable to reconstruction attacks. Existing approaches typically address efficiency or privacy in isolation, relying on additional mechanisms such as sparsification, quantization, or noise injection. We propose LightSplit, which limits information exposure and reduces communication overhead by applying a lightweight fixed orthogonal random projection at the cut layer. Based on Shannon's information theory, this projection acts as an information bottleneck that restricts instance-specific information and suppresses exploitable per-sample signals. By transmitting low-dimensional projections instead of raw activations, the server operates on lifted representations without requiring architectural modifications, ensuring compatibility with existing SL architectures. By avoiding additional trainable components on the client, the method remains lightweight and suitable for edge devices while preserving end-to-end differentiability via exact gradient propagation. As the projection is non-invertible, part of the original representation is irreversibly discarded at the client, LightSplit reduces the information available for reconstruction and limits information exposure. We extensively evaluate LightSplit on state-of-the-art benchmarks in both IID and non-IID settings across varying projection dimensions and client scales. Our results show that the method retains more than 95% of the baseline accuracy at up to 32x reduction in transmitted dimensionality while maintaining stable training dynamics.

2605.13262 2026-05-14 cs.LG q-bio.QM

Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction

Deepak Warrier, Raja Sekhar Pappala

AI总结 本文提出了一种名为Chem-GMNet的球面原生几何变换器,用于分子属性预测任务。该模型通过将传统变换器中的各个模块替换为基于球面几何的结构,充分利用了化学结构中的几何先验信息。实验表明,Chem-GMNet在参数更少的情况下取得了优于现有方法如ChemBERTa的性能,尤其在无需预训练的情况下也表现出色。

详情
英文摘要

Modern SMILES-based chemical language models obtain strong MoleculeNet performance by treating SMILES as generic text and compensating with multi-million-molecule self-supervised pretraining. We ask: when a domain carries structural priors as rich as chemistry's, does it warrant a domain-native transformer rather than a generic one rescued by scale? We answer affirmatively with \textbf{GM-Net} (Geometric Measure Network), a transformer family in which every module is replaced by a sphere-native counterpart, and instantiate it as \textbf{Chem-GMNet}. Three blocks follow: SH-Embedding (tokens as learnable directions on $S^{k-1}$ lifted through a Gegenbauer feature map); DualSKA (a per-head fusion of a linear-time gated Sphere-Flow recurrence whose persistent state we prove is the truncated multipole expansion of the input distribution, and a softmax Sphere-Kernel branch over the same Schoenberg-valid kernel); and SH-FFN (sphere projection $\to$ Gegenbauer lift $\to$ moment readout). On canonical DeepChem scaffold splits, against same-shape ChemBERTa-2 baselines under the chemberta3-faithful protocol: (i) random-initialised, Chem-GMNet wins on 7 of 10 MoleculeNet endpoints at $\sim\!35\%$ fewer parameters; (ii) pretrained on the same 10M-SMILES ZINC corpus as ChemBERTa-2 MLM-10M, it matches or beats the public release on 6 of 8 shared endpoints (5/7 excluding a known ClinTox release anomaly). A $(k,L)$ ablation shows that increasing the sphere dimension from $k\!=\!8$ to $k\!=\!10$ at fixed $L\!=\!3$ lowers ESOL RMSE to $0.938$ at scratch, beating pretrained ChemBERTa-2 MLM-10M on this endpoint without any pretraining at all.

2605.13260 2026-05-14 cs.LG math.AP math.FA stat.ML

Unified generalization analysis for physics informed neural networks

Yuka Hashimoto, Tomoharu Iwata

AI总结 本文针对物理信息神经网络(PINNs)及其变体(VPINNs)的泛化能力进行了统一的理论分析。研究通过泰勒展开将非线性微分算子转化为高维空间中的线性算子,结合Koopman分析方法,建立了适用于包含微分操作的神经网络的泛化界。该方法突破了以往对稳定性条件或线性椭圆性的依赖,揭示了微分算子的非线性特性对泛化性能的显著影响,为理解物理信息神经网络的训练与推广提供了新的理论视角。

详情
英文摘要

Physics-Informed Neural Networks (PINNs) and their variational counterparts (VPINNs) are neural networks that incorporate physical laws, making them useful for scientific problems. Existing generalization analyses for PINNs and VPINNs remain limited, often requiring restrictive assumptions such as stability conditions or linear ellipticity. In this paper, we derive generalization bounds for neural networks that involve differentiation with respect to input variables, covering PINNs and VPINNs under a unified framework. We apply Taylor expansion to represent nonlinear differential operators as linear operators on a high-dimensional space, enabling the use of Koopman-based analysis and showing that high-rank networks can generalize well even in settings involving differential operators. We also show that the nonlinearity of the differential operator exponentially enlarges the bound, highlighting its significant impact on generalization.

2605.13255 2026-05-14 cs.AI

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang

AI总结 本文研究了如何在基于策略的自蒸馏中更有效地利用教师模型的不确定性信息,以提升大语言模型的推理效率。提出了一种基于熵引导的强化自蒸馏方法EGRSD,通过结合奖励引导方向、师生似然比幅度以及教师熵置信门机制,动态调整对不同位置token的监督权重,从而提升模型训练效果。进一步引入了因果前瞻变体CL-EGRSD,以区分持续高熵和短暂高熵区域,实验表明该方法在推理准确率与长度的权衡上优于现有可训练方法。

详情
英文摘要

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

2605.13245 2026-05-14 cs.AI

It's not the Language Model, it's the Tool: Deterministic Mediation for Scientific Workflows

Marios Adamidis, Danae Katrisioti, Yannis Tzitzikas, Emmanuel Stratakis

AI总结 该研究探讨了语言模型在科学工作流中生成分析结果的可重复性问题,指出当前模型在同一数据上多次生成时可能得到不同结果,缺乏可信度。为此,作者提出了一种“类型化中介”方法,通过模型调用确定性工具来执行分析,每个工具对应特定仪器的精确操作流程,确保结果的一致性。实验表明,该方法在多个平台上实现了相同分析任务的完全可复现结果,相较商业模型具有更高的稳定性和可靠性,为科学分析中的可重复性需求提供了实用解决方案。

Comments 18 pages, 4 figures, 2 appendices. Submitted to SETN 2026

详情
英文摘要

Language models can produce convincing scientific analyses, but repeated generations on the same data do not guarantee the same result. A researcher may regenerate an identical query and receive a different fit, a different peak position or a different analysis procedure, without an obvious way to decide which output to trust. We propose typed mediation, a pattern in which the model orchestrates deterministic tools rather than generating analytical code. Each tool encodes one researcher's exact procedure for one instrument, ported through structured interviews. The model selects which tool to call and with what parameters. The tool produces the result. Regeneration does not change it. We evaluate this claim by running the same photoluminescence analysis on four platforms, including three commercial foundation models, four times each with the same prompt. The typed tool produces identical results across all runs. The commercial platforms either vary in numerical output and analytical methodology across runs, or fail to produce valid results on the task. We deploy this pattern on two instruments serving users over approximately six months, with very positive user feedback. Both cases are very challenging: they involve proprietary binary formats and per-seat licensed software, which force the tool to remain on local infrastructure alongside the data and the instrument it operates. We argue that deployment topology is not just a preference, but a structural requirement of scientific tool mediation. The result is a practical pattern for deploying language models in scientific workflows where reproducibility is mandatory, reducing analysis time from weeks to minutes while guaranteeing identical outputs across runs.

2605.13236 2026-05-14 cs.CL

A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

Rabindra Lamsal, Sisi Zlatanova, Haowen Xu, Yafei Sun, Johnson Xuesong Shen

AI总结 本文提出了一种名为IfcLLM的混合框架,用于通过自然语言查询IFC格式的建筑信息模型(BIM)。该框架将IFC模型转化为互补的表示形式,包括用于结构化属性和几何信息的关系表示,以及用于拓扑关系的图表示,并通过迭代的重试与优化机制整合这两种表示进行大语言模型推理。实验表明,该方法在多个场景下的首次查询准确率高达93.3%至100%,能够有效提升非专家用户对BIM数据的访问与分析能力。

详情
英文摘要

Building Information Modeling (BIM) is widely used in the Architecture, Engineering, and Construction (AEC) industry, but the complexity of Industry Foundation Classes (IFC) limits accessibility for non-expert users. To address this, we introduce IfcLLM, a hybrid framework for natural language interaction with IFC-based BIM models. It transforms IFC models into complementary representations: a relational representation for structured element properties and geometry, and a graph representation for topological relationships. These representations are integrated through iterative retry-and-refine LLM reasoning. We implement the framework using an open-weight LLM (GPT OSS 120B), supporting reproducible and deployment-oriented workflows. Evaluation on three IFC models with queries derived from 30 scenarios shows first-attempt accuracy of 93.3%-100%, with all failures recovered using a fallback LLM. The results show that combining complementary representations with iterative reasoning enables more accessible natural language querying of IFC data while supporting routine BIM analysis tasks.

2605.13229 2026-05-14 cs.AI cs.SE

Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

Yuhan Wu, Huan Zhang, Wei Cheng, Chen Shen, Jingyue Yang, Wei Hu

AI总结 本文研究如何提升代码翻译的准确性和语义一致性,提出了一种基于语法引导和语义感知的偏好优化方法CTO。该方法通过对比学习训练跨语言语义模型,直接评估源代码与翻译代码的功能等价性,并将语义信号与编译器反馈的语法信号统一到多目标优化框架中。实验表明,CTO在C++、Java和Python代码翻译任务中显著优于现有方法。

Comments Accepted in the 35th International Joint Conference on Artificial Intelligence (IJCAI 2016)

详情
英文摘要

LLMs have shown immense potential for code translation, yet they often struggle to ensure both syntactic correctness and semantic consistency. While preference-based learning offers a promising alignment strategy, it is hindered by unreliable semantic rewards derived from sparse test cases or restrictive reference translations. We argue that a robust semantic reward for code translation must be derived directly from the source code. In this paper, we propose CTO to improve code translation with syntax-guided and semantic-aware preference optimization. Through contrastive learning, we train a cross-lingual semantic model to directly assess functional equivalence between source and translated code. By formulating code translation as a multi-objective optimization problem, this robust semantic signal is seamlessly unified with compiler-based syntactic feedback within the direct preference optimization framework. Extensive experiments on C++, Java, and Python translations demonstrate that CTO significantly outperforms existing baselines and alternative preference optimization strategies.

2605.13228 2026-05-14 cs.CV cs.AI

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

Xiao Liu, Nayu Liu, Junnan Zhu, Ruirui Chen, Guohui Xiang, Changjian Wang, Kaiwen Wei, Rongzhen Li, Jiang Zhong

AI总结 该论文提出了一种名为 ReTool-Video 的递归工具使用视频代理方法,旨在提升视频理解中复杂推理和跨模态分析的能力。为了解决现有视频代理在工具粒度和动作空间上的局限,研究构建了包含134个工具的 MetaAug-Video 工具库(MVTL),支持细粒度操作和多级信息访问,并设计了递归工具调用机制,将高层视频意图逐步分解为可执行的工具链。实验表明,该方法在多个基准测试中表现优异,显著提升了复杂视频理解的稳定性和效果。

详情
英文摘要

Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

2605.13225 2026-05-14 cs.LG

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

Paul Jeha, Anastasiia Sedova, Louis Béthune, Skyler Seto, Jes Frellsen, Pierre Ablin, Natalie Schluter

AI总结 在数据受限的语言模型预训练中,研究对比了超参数调优和双语数据混合两种方法,发现数据混合在验证损失和下游任务准确率上均优于超参数调优,且效果随模型规模增大而增强。研究进一步量化了数据混合的增益,表明其效果相当于增加了2到13倍的目标语言数据,并揭示了验证损失无法全面反映混合带来的好处。基于实验结果,作者建议在数据受限场景中优先采用高资源语言的数据混合,并通过μP方法迁移超参数设置。

详情
英文摘要

For most languages of the world, language model pre-training operates in a data-constrained regime where models must repeat their training data many times, degrading generalization. Two remedies exist: aggressive hyperparameter tuning such as high weight decay, and mixing in data from a high-resource auxiliary language to directly aid the low-resource target. While hyperparameter tuning regularizes the model by shrinking weights to restrict network capacity, auxiliary data mixing uses a tunable mixing ratio to expand the training distribution and diversify the training signal with new knowledge. Both offer a principled way to improve training in a data-constrained domain. We compare these levers systematically across four model scales from 150M to 1.43B parameters, using Arabic as the low-resource target and English as the auxiliary, over approximately 1000 pre-training runs. Three findings emerge. First, mixing yields larger improvements than hyperparameter tuning on both validation loss and downstream task accuracy, and the gap grows with model size. Second, we quantify how much mixing helps: it boosts performance by an amount equivalent to 2--3$\times$ the unique target data on validation loss and 2--13$\times$ on downstream task accuracy, with the gain scaling steeply with model size. Third, this divergence reveals that target-language validation loss systematically underestimates mixing's value. Mixing regularizes by diversifying the training signal and contributes knowledge the repeated target corpus cannot supply; validation loss captures only the first effect. Our practical recommendations are: mix in a high-resource language, prioritize the mixing ratio over hyperparameter tuning, and transfer hyperparameters from a small proxy model via $μ$P.

2605.13223 2026-05-14 cs.CV

Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

Abdelrahman Eldesokey, Merey Ramazanova, Ahmad Sait, Ansar Khangeldin, Karen Sanchez, Tong Zhang, Bernard Ghanem

AI总结 随着文本到图像生成技术的快速发展,可靠的模型评估变得尤为重要。本文提出了一种技能对齐注释方法,使注释策略更符合不同评估技能的本质特征,从而提升评估的一致性和稳定性。研究还构建了一个自动化评估流程,实现了可扩展的细粒度评估,并强调改进评估基础可以提高效率,而无需单纯增加注释工作量。

Comments Project Page: https://abdo-eldesokey.github.io/skill-aligned-eval/

详情
英文摘要

Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.

2605.13221 2026-05-14 cs.AI cs.LG

An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

Hanwen Zhang, Dusit Niyato, Wei Zhang, Xin Lou, Malcolm Yoke Hean Low

AI总结 本文研究了无人机辅助物流调度中结合边缘计算的混合调度问题,该问题涉及物理物流决策与计算任务调度的耦合。为解决这一挑战,作者提出了一种基于智能体AI的优化框架,结合大语言模型与链式推理技术将用户输入转化为可解释的数学模型,并设计了一种基于近端策略优化的分层深度强化学习方法,以优化无人机路径规划与任务执行资源分配。实验表明,该框架在任务截止时间满足率和产品收集成功率方面表现出色,性能稳定且优于传统方法。

Comments 15 pages

详情
英文摘要

In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV-assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic-AI-assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval-augmented generation, and chain-of-thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per-slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor-critic approach.

2605.13218 2026-05-14 cs.LG

Machine Learning-Driven Multimodal Spectroscopic Liquid Biopsy for Early Multicancer Detection

Alejandro Leonardo García Navarro, Javier Cachón Ortiz, Javier González Colsa, Samuel García Díaz, Carlos Viadero Valderrama

AI总结 该研究提出了一种基于多种光谱技术与机器学习的多模态液态活检方法,用于早期多癌种检测。通过结合傅里叶变换红外光谱(FTIR)、拉曼光谱和激发-发射矩阵(EEM)荧光光谱,并利用机器学习进行数据融合与分类,实现了对乳腺癌和结直肠癌的高精度检测。实验结果表明,多模态融合方法在灵敏度和特异性方面表现出更均衡的优异性能,ROC-AUC值分别达到0.997和0.994。

详情
英文摘要

Cancer is one of the leading causes of death worldwide, making the development of rapid, minimally invasive, label-free and scalable diagnostic strategies a major challenge in modern oncology. In this context, spectroscopic liquid biopsy has emerged as a promising alternative, as it enables the holistic characterization of biochemical alterations in biological fluids. In this work, we propose a multimodal spectroscopic liquid biopsy framework for multicancer detection based on the combination of Fourier Transform Infrared (FTIR) spectroscopy, Raman spectroscopy, and Excitation-Emission Matrix (EEM) fluorescence spectroscopy together with Machine Learning (ML) methodologies. Serum samples from breast cancer patients, colorectal cancer patients, and healthy controls were analyzed through the three spectroscopic modalities. After modality-specific preprocessing, low-level data fusion (LLDF) was employed to integrate the complementary biochemical information encoded within the different spectroscopic measurements, and classification was performed using XGBoost models. Seven experimental configurations were evaluated, including the three unimodal approaches, all pairwise bimodal configurations, and the full multimodal approach of FTIR, Raman, and EEM fluorescence. The results show that although several individual modalities achieved high discrimination performance, the multimodal fusion provided the most balanced overall results, reaching a ROC-AUC of 0.997 for breast cancer and 0.994 for colorectal cancer, together with highly balanced sensitivity and specificity values.

2605.13208 2026-05-14 cs.RO

Calibration-Free Gas Source Localization with Mobile Robots: Source Term Estimation Based on Concentration Measurement Ranking

Wanting Jin, Agatha Duranceau, İzzet Kağan Erünsal, Alcherio Martinoli

AI总结 本文研究了无需校准的移动机器人气体源定位问题,提出了一种基于浓度测量排名的源项估计方法。该方法通过比较动态采集数据与物理扩散模型之间的浓度排名差异,估计气体源在环境中的概率分布,从而实现高效定位。该方法避免了低成本传感器校准的需求,在仿真和实际实验中均表现出良好的定位精度,适用于真实场景中的应急监测等应用。

Comments This paper has been accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA), 2026

详情
英文摘要

Efficient Gas Source Localization (GSL) in real-world settings is crucial, especially in emergency scenarios. Mobile robots equipped with low-cost, in-situ gas sensors offer a safer alternative to human inspection in hazardous environments. Probabilistic algorithms enhance GSL efficiency with scattered gas measurements by comparing gas concentration measurements gathered by robots to physical dispersion models. However, accurately deriving gas concentrations from data acquired with low-cost sensors is challenging due to the nonlinear sensor response, environmental dependencies (e.g., humidity, temperature, and other gas influences), and robot motion. Mitigating these disturbance factors requires frequent sensor calibration in controlled environments, which is often impractical for real-world deployments. To overcome these issues, we propose a novel feature extraction algorithm that leverages the relative ranking of gas measurements within the dynamically accumulated dataset. By comparing the rank differences between gathered and modeled values, we estimate the probabilistic distribution of source locations across the entire environment. We validate our approach in high-fidelity simulations and physical experiments, demonstrating consistent localization accuracy with uncalibrated gas sensors. Compared to existing methods, our technique eliminates the need for gas sensor calibration, making it well-suited for real-world applications.

2605.13207 2026-05-14 cs.LG

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

Stefan Stojanovic, Alexandre Proutiere

AI总结 本文研究了如何在零样本强化学习中实现分层控制,提出了一种称为“切换继承者度量”的方法,无需额外监督、固定时间范围或手动设计子目标即可实现分层决策。该方法基于经典继承者度量进行扩展,保持其结构特性,并在此基础上设计了FB $π$-Switch算法,能够从正向-反向表示中直接提取高层子目标策略和底层控制策略,从而实现分层行为。实验表明,该方法在目标条件任务和一般奖励任务中均优于非分层基线,并在目标条件任务中达到现有分层方法的性能水平。

详情
英文摘要

Hierarchical reinforcement learning can improve generalization by decomposing long-horizon decision-making into simpler subproblems. However, existing approaches often rely on restrictive design choices, such as fixed temporal abstractions or goal-conditioned objectives, which largely confine them to goal-reaching tasks and limit their applicability to general reward functions. In this paper, we introduce switching successor measures, an extension of successor measures that enables hierarchical control in zero-shot reinforcement learning without additional supervision, fixed horizons, or manually designed subgoals. We show that switching successor measures arise naturally from classical successor measures while preserving their underlying structure. Building on this result, we propose FB $π$-Switch, an algorithm that extracts both a high-level subgoal-selection policy and a low-level control policy directly from forward-backward (FB) representations, allowing hierarchical behavior to emerge from a single learned representation. Experiments on both goal-conditioned and general reward-based tasks show that FB $π$-Switch improves over non-hierarchical baselines and matches state-of-the-art hierarchical methods in goal-conditioned settings. These results demonstrate that structured successor representations provide a flexible foundation for hierarchical zero-shot reinforcement learning beyond goal-reaching tasks. Our project website is available at: https://stestokth.github.io/switching-successors/.

2605.13202 2026-05-14 cs.CV cs.AI

STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

Hongli Liu, Yu Wang, Shengjie Zhao

AI总结 本文研究了少样本动作识别(FSAR)中的语义-时序对齐问题,提出了一种统一的语义-时序自适应表示学习框架STAR。该方法通过引入时序语义注意力机制和语义时序原型细化模块,有效解决了文本提示与动作序列中稀疏视觉线索的对齐问题,并增强了对多尺度时序动态的建模能力。实验表明,STAR在多个基准数据集上均优于现有方法,验证了其在有限样本条件下的有效性。

Comments Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

详情
英文摘要

Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR-main.

2605.13200 2026-05-14 cs.LG cs.ET

A Hybrid Tucker-LSTM Tensor Network Model for SOC Prediction in Electric Vehicles

Han Wang, Ying Wang, Bing Wang

AI总结 本文提出了一种结合 Tucker 张量分解与长短期记忆网络(LSTM)的混合模型,用于电动汽车电池荷电状态(SOC)的预测。该方法利用全生命周期的电动汽车实际运行数据,通过 Tucker 分解在保持时间结构的同时降低数据维度,从而提升 LSTM 的预测性能。实验结果表明,该混合模型在多个评估指标上均优于传统 LSTM,显著提高了 SOC 预测的准确性,为基于张量分析的电池管理系统提供了新的研究方向。

详情
英文摘要

Accurate state of charge estimation is critical for the success of electric vehicle battery management strategies, but it is well known that conventional estimators suffer from two fundamental shortcomings: cumulative errors that grow over time and reliance on simplified battery models that do not reflect real world dynamics. Therefore, this paper presents a novel hybrid approach combining Tucker tensor decomposition with LSTM networks, using full - lifecycle EV field data for SOC prediction. The inputs are charge status, mileage, voltage, current, cell differentials, and temporal features. Tucker decomposition is skillfully used to reduce dimensionality while maintaining the temporal structure, hence allowing a direct, fair comparison with standard LSTM. The result is unequivocal: Tucker - LSTM outperforms the baseline on all metrics, with MSE dropping 70.5\% (from 21.07 to 6.22 ), MAE improving 48.7\% (from 3.37\% to 1.73\%), RMSE falling from 4.59\% to 2.49\%, and $R^2$ rising from 0.918 to 0.976. Since the experimental results demonstrably demonstrate that tensor decomposition compresses high-dimensional battery data very well without loss of predictive fidelity, this paper naturally opens up a new direction for tensor-based analytics in electric vehicle battery management.

2605.13197 2026-05-14 cs.LG cs.AI

McCast: Memory-Guided Latent Drift Correction for Long-Horizon Precipitation Nowcasting

Penghui Wen, Yu Luo, Lintao Wang, Mengwei He, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

AI总结 现有的降水临近预报方法通常采用自回归框架,但这种方法在长时间预测中容易累积误差,导致预报偏离物理合理的演变轨迹。为了解决这一问题,本文提出 McCast,一种基于记忆引导的潜在漂移校正方法,通过引入时序组织的记忆库,主动校正自回归过程中的潜在演变偏差,从而生成更加时序一致且可靠的长期预报。实验表明,McCast 在 SEVIR 和 MeteoNet 两个基准数据集上取得了最先进的性能,尤其在长期预报任务中表现突出。

详情
英文摘要

Existing precipitation nowcasting methods typically adopt an autoregressive formulation, where future states are predicted from previous outputs. However, such an approach accumulates errors over long rollouts, causing forecasts to drift away from physically plausible evolution trajectories. Although various studies have attempted to alleviate this problem by improving step-wise prediction accuracy, they largely neglect the global temporal evolution of meteorological systems and lack mechanisms to actively correct drift during rollouts. To address this issue, we propose McCast, a memory-guided latent drift correction method for precipitation nowcasting. Rather than treating memory as an unordered dictionary of latent states for passive conditioning, McCast leverages temporally organized memory to actively correct autoregressive latent evolution. Specifically, McCast introduces a Drift-Corrective Memory Bank (DCBank) that explicitly estimates the temporally consistent drift corrections to calibrate the divergent trajectory. DCBank performs drift correction in two stages: a Corrective Latent Extractor first predicts an initial correction from the current prediction and a reference latent state, and a Correction-Aware Memory Retrieval module then refines the initial correction using temporally organized historical memory. By explicitly correcting latent evolution, instead of improving step-wise prediction accuracy only, McCast produces more temporally coherent and reliable long-horizon forecasts. Experiments on two widely used benchmarks, SEVIR and MeteoNet, show that McCast achieves state-of-the-art performance, particularly in challenging long-horizon forecasting scenarios.

2605.13194 2026-05-14 cs.LG cs.AI

ECG-NAT: A Self-supervised Neighborhood Attention Transformer for Multi-lead Electrocardiogram Classification

Mahsa Gazeran, Sayvan Soleymanbaigi, Fatemeh Daneshfar, Amjad Seyedi, Fardin Akhlaghian Tab

AI总结 本文提出了一种名为ECG-NAT的自监督邻域注意力变换器,用于多导联心电图(ECG)分类。该方法通过分两阶段训练:首先使用掩码自编码器在未标注数据上进行生成式预训练,学习鲁棒的跨数据集特征表示;随后通过结合监督对比损失和交叉熵损失的双损失函数进行判别式微调,提升分类性能。ECG-NAT采用分层注意力机制,高效捕捉从细粒度心跳形态到更广泛节律模式的多尺度时间特征,在少量标注数据下仍能取得优异的分类准确率,适用于实时心电诊断场景。

详情
英文摘要

Electrocardiogram (ECG) arrhythmia classification remains challenging due to signal variability, noise, limited labeled data, and the difficulty in achieving both accuracy and efficiency in models. While self-supervised learning reduces label dependency, most methods target either global contextual features or local morphological patterns, but rarely implement hierarchical multi-scale feature extraction. ECG signals require architectures that simultaneously capture fine-grained beat-level morphology and broader rhythm-level dependencies with computational efficiency. To overcome this limitation, this paper proposes the Electrocardiogram Neighborhood Attention Transformer (ECG-NAT), a novel self-supervised learning approach tailored for multi-lead ECG classification. Our two-stage approach begins with generative pretraining, using a masked autoencoder to reconstruct partially masked ECG signals across multiple diverse datasets, enabling the model to learn robust, domain-invariant representations from unlabeled data. This is followed by discriminative fine-tuning with a dual-loss function that combines supervised contrastive and cross-entropy losses, aligning representation learning with label prediction. The hierarchical attention mechanism efficiently captures multi-scale temporal features from localized beat morphology to broader rhythm patterns at low computational cost. ECG-NAT achieves robust performance on benchmark datasets, with 88.1\% accuracy using only 1\% labeled data, demonstrating strong efficacy in low-resource settings. The framework combines superior classification performance with computational efficiency, making it practical for real-time ECG diagnosis. The code will be made available upon acceptance at: https://github.com/Mahsagazeran/ECG-NAT.

2605.13192 2026-05-14 cs.RO

Dynamics Computation of Soft-Rigid Hybrid-Link System and Its Application to Motion Analysis of an Athlete Wearing Sport Prosthesis

Sunghee Kim, Yuta Shimane, Taiki Ishigaki, Ko Yamamoto

AI总结 本文提出了一种基于软刚混合连杆系统的运动分析框架,用于分析佩戴运动专用柔性假肢的运动员动作。该方法通过统一建模刚性人体骨骼与柔性假肢的相互作用力,解决了传统刚体多连杆模型难以处理柔性部件的问题。研究应用混合连杆系统的逆运动学进行动作重建,并通过逆动力学估计关节力矩和地面反作用力,实验表明地面反作用力估计误差约为12%,同时考虑了截肢后的肌肉力与假肢变形的相互作用。

详情
Journal ref
Advanced Robotics, Vol.40, No.4, 2026
英文摘要

This paper presents a motion analysis framework for an athlete wearing sport-specific flexible prosthesis based on the soft-rigid hybrid-link system. Such a motion analysis is a challenging problem because we need to consider the interaction force between the rigid human skeleton system and a flexible prosthesis. However, most of human musculoskeletal models are based on the computation framework of a rigid-body multi-link system. Recently in soft robotics research field, fast and efficient modeling methods were developed for a flexible rod deformation, which allows us to build a hybrid-link system that integrates rigid-link and soft-bodies in a unified formulation. We apply inverse kinematics of the hybrid-link system to motion reconstruction from a motion captured data, and also present the estimation of the joint torques and ground reaction force by inverse dynamics. Through a human subject experiment, we show that the inverse dynamics achieved approximately 12% error on the ground reaction force estimation. Furthermore, we provide the muscle force estimation considering muscle amputation and interaction force with the prosthesis leg deformation.

2605.13190 2026-05-14 cs.LG cs.AI

N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

Aleksander Lorenc, Frédéric Berdoz, Joël Mathys, Roger Wattenhofer

AI总结 本文提出了一种名为N-vium的混合退出Transformer模型,旨在提升自回归Transformer的推理效率。该方法通过在不同深度添加预测头,并采用自适应路由机制,将计算部分并行化,从而提高每秒的计算效率,而非单纯减少每个token的计算量。实验表明,N-vium在保持相同困惑度的前提下,实现了比标准Transformer高达57.9%的运行速度提升。

详情
英文摘要

Improving the inference efficiency of autoregressive transformers typically means reducing FLOPs per token, usually through approximations that degrade model quality. We introduce N-vium, a mixture-of-exits transformer that partially parallelizes computation across depth on standard hardware, increasing effective FLOPs per second rather than minimizing compute per token. N-vium attaches prediction heads at multiple depths and defines the next-token distribution as a learned mixture over these exits, with token-adaptive routing. This formulation strictly generalizes the standard transformer, which is recovered exactly when routing assigns zero mass to all intermediate heads. Sampling from the mixture is exact, and complete KV caches are recovered by deferring the upper-layer computation and batching it with later tokens. We pretrain N-vium at scales up to 1.5B parameters. Our largest model reaches 57.9% wall-clock speedup over a parameter- and data-matched standard transformer at no perplexity cost.

2605.13182 2026-05-14 cs.CV

DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

Zheng Chen, Ruofan Yang, Jin Han, Dehua Song, Zichen Zou, Chunming He, Yong Guo, Yulun Zhang

AI总结 DiffST 是一种高效的时空感知扩散框架,旨在解决真实场景下的时空视频超分辨率(STVSR)问题。该方法通过引入跨帧上下文聚合和视频表示引导模块,提升了对时空信息的利用效率,并采用一步采样策略提高了推理速度。实验表明,DiffST 在多个真实场景任务中取得了领先的性能,且推理速度比现有方法快约17倍。

Comments Code is available at: https://github.com/zhengchen1999/DiffST

详情
英文摘要

Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17$\times$ faster than previous diffusion-based STVSR methods. Code is available at: https://github.com/zhengchen1999/DiffST.

2605.13181 2026-05-14 cs.LG cs.AI

Stable Attention Response for Reliable Precipitation Nowcasting

Penghui Wen, Zexin Hu, Sen Zhang, Patrick Filippi, Xiaogang Zhu, Allen Benter, Thomas Bishop, Zhiyong Wang, Kun Hu

AI总结 降水临近预报由于大气动力学的高度局部化、快速变化和异质性而具有挑战性。尽管近期方法在单模态和多模态设置中越来越多地采用基于注意力的架构,但主要关注于增强表示学习和预测能力,而忽视了注意力响应在不同样本间的稳定性。本文提出HARECast,一种基于头级注意力响应能量调控的降水临近预报框架,通过减少注意力响应能量在样本间的波动,提升预测的稳定性与可靠性,并在多个基准数据集上取得了最先进的性能。

详情
英文摘要

Precipitation nowcasting remains challenging due to the highly localized, rapidly evolving, and heterogeneous nature of atmospheric dynamics. Although recent methods increasingly adopt attention-based architectures in both unimodal and multimodal settings, they mainly emphasize stronger representation learning and prediction capacity, while paying less attention to the stability of attention responses across samples. In this work, we show that cross-sample instability of attention-response energy is an important and previously underexplored source of forecasting unreliability. Empirically, inaccurate forecasts are associated with larger attention-response energy variance across heads and layers. Theoretically, we show that cross-sample variability can propagate through self-attention, and enlarge a lower bound on prediction error. Based on this insight, we propose HARECast, a Head-wise Attention Response Energy-regulated framework for precipitation nowcasting. HARECast explicitly models head-wise attention-response energy and stabilizes it through a group-wise regularization objective that reduces cross-sample fluctuations. The proposed formulation is generic and applicable to both unimodal and multimodal nowcasting architectures. We instantiate HARECast in a standard forecasting pipeline with reconstruction branches and a diffusion-based predictor, and evaluate it on commonly used benchmarks--SEVIR and MeteoNet. Experimental results demonstrate that HARECast achieves state-of-the-art performance.

2605.13179 2026-05-14 cs.CV

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

Jinghao Wang, Qiyuan He, Chunbin Gu, Pheng-Ann Heng

AI总结 该研究探讨了Engram模块在自回归图像生成中的作用,发现其虽能减少计算量,但并未提升生成图像的质量。通过实验分析表明,Engram模块更像是一个带有门控机制的辅助路径,而非内容寻址的回忆机制。研究进一步指出,Engram模块对生成结果的改进主要来源于其结构本身,而非记忆表中的内容。

Comments 9 pages

详情
英文摘要

The Engram module -- a hash-keyed, O(1) associative memory injected into Transformer layers -- was recently shown to improve large language model pretraining, with the appealing interpretation that it provides a content-addressed shortcut to recurring local token patterns. We ask whether this interpretation transfers to autoregressive (AR) image generation, or whether the observed gains, if any, come from a different mechanism. We adapt the Engram module to vision with 2D spatial $n$-gram hashing, gated fusion, and KV-cache-compatible incremental inference, and inject it into a class-conditional AR generator trained on ImageNet 256x256. Across a sweep of backbone-to-memory budget ratios $ρ{\in}[0.17, 0.90]$, every Engram-augmented variant trails the pure AR baseline in FID, indicating that the module saves backbone FLOPs but does not, by itself, improve sample quality. We then probe how the module is used. A gate-clamp sweep shows that disabling the Engram pathway entirely is catastrophic, yet a tiny constant gate (g=0.10) matches or beats the learned gate -- inconsistent with a heavily content-addressed recall mechanism. A donor-probe experiment shows that swapping the hash inputs for matched, adversarial, or random same-class exemplars produces statistically indistinguishable next-token distributions, while collapsing or randomising the table degrades them by two to three orders of magnitude. Finally, training a model from scratch with the entire memory table frozen to $\mathcal{N}(0, 1)$ noise costs only $Δ\text{FID}{=}0.10$ and actually raises Inception Score. Together, these findings indicate that the Engram in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement.

2605.13171 2026-05-14 cs.AI

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

Moritz Firsching, Paul Lezeau, Salvatore Mercuri, Miklós Z. Horváth, Yaël Dillies, Calle Sönne, Eric Wieser, Fred Zhang, Thomas Hubert, Blaise Agüera y Arcas, Pushmeet Kohli

AI总结 随着自动推理系统的发展,亟需高质量的数学问题用于评估其能力。为此,研究者提出了“Formal Conjectures”,一个包含2615个用Lean 4形式化的问题的持续演进基准,涵盖836个已解决的问题和1029个未解的数学猜想,用于评估自动证明发现的能力。该基准通过协作开源项目确保形式化正确性,并利用AI生成的证明与反例进行持续优化,已在实际中推动了新的数学发现。

Comments 21 pages, 4 figures, 5 tables

详情
英文摘要

As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.