arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
2605.13751 2026-05-14 cs.RO cs.SE cs.SY eess.SY

Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles

Yizhuo Xiao, Haotian Yan, Ying Wang, Zhongpan Zhu, Yuxin Zhang, Xintao Yan, Mustafa Suphi Erden, Cheng Wang

发表机构 * School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh, U.K.(1 工程与物理科学学院,赫瑞-沃顿大学,爱丁堡,英国) State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, Shanghai, China(2 自主智能无人系统国家重点实验室,同济大学,上海,中国) College of Computer Science and Technology, Jilin University, Changchun, China(3 计算机科学与技术学院,吉林大学,长春,中国) University of Shanghai for Science and Technology, Shanghai, China(4 上海科技大学,上海,中国) National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University, Changchun, China(5 汽车底盘集成与生物力学国家重点实验室,吉林大学,长春,中国) Department of Civil Engineering, The University of Hongkong, Hongkong, China(6 市政工程系,香港大学,香港,中国)

AI总结 该研究旨在为自动驾驶系统(ADS)建立可信的安全保障,通过区分系统缺陷与不可避免的交通冲突,生成具有责任归属的对抗场景。提出的方法CARS结合上下文感知的对抗体选择与闭环模拟优化的生成对抗策略,能够生成物理可行且责任可追溯的碰撞场景。该框架在多国交通环境下表现出色,能够有效发现符合法规要求的高责任归属碰撞场景,为自动驾驶系统的可解释性验证提供了新的方向。

详情
英文摘要

Establishing trustworthy safety assurance for autonomous driving systems (ADSs) requires evidence that failures arise from avoidable system deficiencies rather than unavoidable traffic conflicts. Current adversarial simulation methods can efficiently expose collisions, but generally lack mechanisms to distinguish these fundamentally different failure modes. Here we present CARS (Context-Aware, Responsibility-attributed Scenario generation), a framework that integrates responsibility attribution directly into adversarial scenario generation. CARS combines context-aware adversary selection with a generative adversarial policy optimized in closed-loop simulation to construct collision scenarios that are both physically feasible and diagnostically attributable. Across benchmark datasets spanning heterogeneous national traffic environments, CARS consistently discovers feasible collision scenarios with high attribution rates under multiple regulation-prescribed careful and competent driver models. By coupling adversarial generation with normative responsibility assessment, CARS moves simulation testing beyond collision discovery toward the construction of interpretable, regulation-aligned safety evidence for scalable ADS validation.

2605.13746 2026-05-14 cs.CV cs.AI

Weakly-Supervised Spatiotemporal Anomaly Detection

Urvi Gianchandani, Praveen Tirupattur, Mubarak Shah

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Central Florida(佛罗里达中央大学)

AI总结 本文研究了弱监督下的时空异常检测问题,仅使用视频级别的标签进行训练,无需逐帧标注。核心方法是通过提取正常和异常视频片段的特征,并利用多实例排序损失(MIL)对时空区域进行异常评分,同时考虑了异常在时间和空间上的局部性。该方法在包含时空标注的UCF Crime2Local数据集上进行了验证,取得了有效结果。

详情
英文摘要

In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.

2605.13744 2026-05-14 cs.CV

Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration

Feiyu Tan, Qi Xie, Zongben Xu, Deyu Meng

发表机构 * School of Mathematics and Statistics(数学与统计学学院)

AI总结 图像修复是一个固有病态的逆问题,而嵌入几何对称先验的等变网络可以缓解这一问题并提升性能。然而,现有研究对网络等变性与数据对称性的关系理解仍停留在启发式层面,缺乏系统理论框架来量化对称性、选择变换群或评估模型与数据的对齐程度。本文从优化角度出发,首次提出了在数据集层面可量化的非严格对称性定义,并将其作为约束构建图像修复逆问题,揭示了数据对称性、模型等变性与泛化能力之间的内在联系,同时提出了一个样本自适应的等变网络,能够动态对齐每个样本的内在对称性,实验表明该方法在超分辨率、去噪和去雨任务中显著优于传统方法。

Comments 30 pages, 9 figures, Supplementary Material can be found at https://github.com/tanfy929/SA-Conv

详情
英文摘要

Image restoration is an inherently ill posed inverse problem. Equivariant networks that embed geometric symmetry priors can mitigate this ill posedness and improve performance. However, current understanding of the relationship between network equivariance and data symmetry remains largely heuristic. Particularly for real world data with imperfect symmetry, existing research lacks a systematic theoretical framework to quantify symmetry, select transformation groups, or evaluate model data alignment. To bridge this gap, we conduct an analysis from an optimization perspective and formalize the intrinsic relationship among data symmetry priors, model equivariance, and generalization capability. Specifically, we propose for the first time a quantifiable definition of non strict symmetry at the dataset level (rather than sample level) and use it as a constraint to formulate the restoration inverse problem. We then show that the equivariance for restoration models can be naturally derived from this inverse problems incorporated the proposed symmetry constraints, and that the equivariance error of the optimal restoration operator is strictly bounded by the data symmetry error and the discretization mesh size. Furthermore, by analyzing the network's empirical risk, we demonstrate that aligning equivariance with data symmetry optimizes the bias variance trade off, minimizing the total expected risk. Guided by these insights, we propose a Sample Adaptive Equivariant Network that uses a hypernetwork and transformation learnable equivariant convolutions to dynamically align with each sample's inherent symmetry. Extensive experiments on super resolution, denoising, and deraining validate our theoretical findings and show significant superiority over standard baselines and traditional equivariant models. Our code and supplementary material are available at https://github.com/tanfy929/SA-Conv.

2605.13740 2026-05-14 cs.LG

Learning POMDP World Models from Observations with Language-Model Priors

Valentin Six, Frederik Panse, Mathis Fajeau, Lancelot Da Costa, Mridul Sharma, Alfonso Amayuelas, Tim Z. Xiao, David Hyland, Philipp Hennig, Bernhard Schölkopf

发表机构 * Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) IRIIS University of California, Santa Barbara(加州大学圣芭芭拉分校) University of Tübingen(图宾根大学) University of Oxford(牛津大学) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 该研究探讨了如何利用语言模型先验知识从观察数据中学习部分可观测马尔可夫决策过程(POMDP)世界模型,以减少对环境交互的依赖。提出了一种名为 Pinductor 的方法,通过语言模型从少量观察-动作轨迹中生成候选 POMDP 模型,并通过迭代优化信念状态下的似然分数进行模型精炼。实验表明,Pinductor 在样本效率上优于传统表格型 POMDP 方法,并且性能随着语言模型能力的提升而增强,为在部分可观测环境下高效学习世界模型提供了新思路。

详情
英文摘要

Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially-observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation-action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language-model priors can reduce costly interaction by leveraging prior knowledge, and introduce \emph{Pinductor} (POMDP-inductor): an LLM proposes candidate POMDP models from a few observation-action trajectories and iteratively refines them to optimize a belief-based likelihood score. Despite using strictly less information, \emph{Pinductor} matches the performance and sample efficiency of LLM-based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language-model priors as a practical tool for sample-efficient world-model learning under partial observability, and a step toward generalist agents in real-world environments. Code is available at https://github.com/atomresearch/pinductor.

2605.13737 2026-05-14 cs.AI cs.CL

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun, Ziwei Liu

发表机构 * Nanyang Technological University(南洋理工大学) LMMs-Lab Team(多模态大模型实验室团队) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文研究了全模态大语言模型在处理文本前提与实际感知内容矛盾的问题时存在的“表示-行为鸿沟”。作者构建了一个名为IMAVB的基准数据集,用于评估模型在检测感知与文本前提冲突方面的能力,并发现模型在隐藏状态中能够准确编码矛盾信息,但在输出行为上却表现出拒绝能力不足或过度拒绝的问题。研究还提出了一种基于探针引导的对数几率调整方法,有效提升了模型的拒绝行为,表明全模态模型的瓶颈在于信息翻译而非感知能力。

详情
英文摘要

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

2605.13731 2026-05-14 cs.LG cs.HC

Distinguishing performance gains from learning when using generative AI

Lixiang Yan, Samuel Greiff, Jason M. Lodge, Dragan Gašević

发表机构 * Faculty of Information Technology, Monash University(墨尔本大学信息技术学院) School of Education, The University of Queensland(昆士兰大学教育学院)

AI总结 本文探讨了在教育中使用生成式人工智能(AI)所带来的绩效提升是否真正促进了高质量的学习。研究指出,尽管生成式AI能提高学习者的表现,但其使用可能并未有效促进深层次的认知和元认知加工过程。文章的核心方法通过实证分析揭示了AI辅助学习中的潜在认知局限,并强调了在教育应用中需关注学习深度与质量的提升。

Journal ref Nature Reviews Psychology, 4(7), 435-436 (2025)

详情
英文摘要

Generative artificial intelligence (AI) is increasingly being integrated into education, where it can boost learners' performance. However, these uses do not promote the deep cognitive and metacognitive processing that are required for high-quality learning.

2605.13730 2026-05-14 cs.LG cs.AI cs.CV

Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

Christos Chrysanthos Nikolaidis, Vasileios Sachpekidis, Nikolas Moustakidis, Theofilos Moustakidis, Pavlos S. Efraimidis

发表机构 * Department of Electrical and Computer Engineering, Democritus University of Thrace(电气与计算机工程系,德莫克里特大学)

AI总结 该研究旨在利用超声心动图图像可靠诊断二叶式主动脉瓣(BAV),解决因操作者经验和图像质量差异导致的诊断不一致性问题。研究提出了一种基于视频集成的可解释人工智能模型,通过分析常规获取的左心室长轴视图动态影像,实现了对BAV与三叶式主动脉瓣(TAV)的准确分类。模型在90例患者数据上表现出优异的分类性能,并通过Grad-CAM和SHAP值提供了可解释的诊断依据,有助于提升临床诊断的透明度和可追溯性。

详情
英文摘要

Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

2605.13729 2026-05-14 cs.CV cs.AI

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

Deli Cai, Haoyang Ma, Changxing Ding

发表机构 * School of Electronic and Information Engineering, South China University of Technology(华南理工大学电子与信息学院) Pazhou Lab(琶洲实验室)

AI总结 本文研究了在文本描述和空间轨迹双重条件下生成真实人体运动的问题,现有方法在条件冲突和运动表示冗余方面存在不足,导致生成质量下降或轨迹控制不稳定。为此,作者提出了一种解耦框架 CMC,通过分治策略将任务分为轨迹控制和运动补全两个阶段,分别确保轨迹准确跟踪和生成完整运动。此外,引入选择性补全机制以缓解数据不足带来的过拟合问题,实验表明 CMC 在多个数据集上取得了优越的控制精度和运动质量。

详情
英文摘要

Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

2605.13725 2026-05-14 cs.AI cs.SI

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

Yitian Yang, Yiqun Duan, Linghan Huang, Yiqi Zhu, Francesco Bailo, Chunmeizi Su, Huaming Chen

发表机构 * The University of Sydney(悉尼大学)

AI总结 ScioMind 是一个基于认知机制的多智能体社会模拟框架,旨在提升基于大语言模型的社会意见动态研究的真实性。该框架结合结构化意见演化与基于LLM的智能体推理,引入记忆锚定的信念更新规则、分层记忆架构以及基于语料库的动态智能体画像,以更真实地模拟人类在社会互动中的信念变化与行为特征。实验表明,ScioMind 在意见极化、多样性、轨迹稳定性等方面表现出更符合现实的模拟效果,为社会模拟提供了新的认知基础设计思路。

详情
英文摘要

Large language model (LLM)-based multi-agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges these paradigms by combining structured opinion dynamics with LLM-based agent reasoning. ScioMind integrates three key components: 1) a memory-anchored belief update rule that modulates susceptibility to influence via personality-conditioned anchoring strength; 2) a hierarchical memory architecture that supports persistent, experience-driven belief formation; and 3) dynamic agent profiles derived from a corpus-grounded retrieval pipeline, enabling heterogeneous personalities, rationales, and evolving internal states. We evaluate ScioMind on multiple case studies in a real-world policy debate scenario. Across metrics including polarisation, diversity, extremization, and trajectory stability, the proposed components consistently yield improvements in behavioural realism. In particular, dynamic profiles increase opinion diversity, memory and reflection reduce unstable oscillation, and anchoring induces persistent belief trajectories that better align with patterns reported in political psychology. These results suggest that our cognitively grounded design provides a novel solution to LLM-based social simulation that improves both stable and behavioural realism

2605.13724 2026-05-14 cs.CV cs.AI

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou

发表机构 * NVIDIA Show Lab, National University of Singapore(新加坡国立大学Show实验室) MIT(麻省理工学院)

AI总结 本文提出 AnyFlow,一种基于流图的任意步数视频扩散模型蒸馏框架,旨在解决一致性蒸馏模型在测试时分配更多采样步数时性能下降的问题。AnyFlow 通过将蒸馏目标从终点一致性映射转换为任意时间区间的流图转移学习,优化完整的 ODE 采样轨迹,并引入流图反向模拟方法,提升采样效率并减少测试时误差。实验表明,AnyFlow 在少量步数生成任务中性能优于或匹配现有方法,同时支持任意步数的灵活扩展。

Comments Project page at https://nvlabs.github.io/AnyFlow/

详情
英文摘要

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.

2605.13717 2026-05-14 cs.LG stat.ML

Tight Sample Complexity Bounds for Entropic Best Policy Identification

Amer Essakine, Claire Vernade

发表机构 * ENS Paris Saclay(巴黎-萨克雷大学) University of Technology Nuremberg(纽伦堡技术大学)

AI总结 本文研究了在熵风险度量下有限时间风险敏感强化学习中的最优策略识别问题。作者针对现有样本复杂度上界与下界之间存在的指数级差距,提出了一种基于前向模型并结合KL散度探索奖励的算法,通过利用指数效用函数的平滑性质,改进了集中性分析,从而消除了原有的指数因子,使得样本复杂度达到理论下界,填补了该问题的空白。

详情
英文摘要

We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $Ω(e^{|β| H})$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(e^{2|β| H})$ (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.

2605.13713 2026-05-14 cs.CV eess.IV

Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization

Isabella Poles, Simon Arberet, Riqiang Gao, Martin Kraus, Marco D. Santambrogio, Florin C. Ghesu, Ali Kamen, Dorin Comaniciu

发表机构 * Politecnico di Milano(米兰理工学院) Digital Technology and Innovation, Siemens Healthineers(西门子医疗数字化技术与创新)

AI总结 本文提出了一种基于扩散模型和LSTM的端到端优化方法,用于放射治疗计划的生成。该方法通过分布匹配的扩散模型生成临床可行的射线强度图,并利用LSTM模块学习梯度更新动态,从而快速优化剂量分布。实验表明,该方法在提升计划效率、灵活性和机器可执行性方面优于现有方法。

Comments Early Accept at MICCAI 2026

详情
英文摘要

Volumetric Modulated Arc Therapy (VMAT) is a cornerstone of modern radiation therapy, enabling highly conformal tumor irradiation and healthy-tissue sparing. Yet, its planning solves inverse and nested optimization for multi-leaf collimators, monitor units and dose parameters, while enforcing their consistency to ensure mechanical deliverability. Nevertheless, this process often requires repeated re-optimization when treatment configurations change, resulting in substantial planning time per patient. To address these problems, we present a diffusion-driven Learning-to-Optimize (L2O) method for end-to-end VMAT planning. A distribution-matching distilled diffusion model learns a clinically feasible manifold of fluence maps, enabling their one-shot generation. On top of this, an LSTM-based L2O module learns gradient update dynamics to swiftly refine fluence maps toward prescribed dose objectives during inference. Experimental results on clinical and public prostate cancer cohorts demonstrate improved planning efficiency, flexibility, and machine deliverability over currently available end-to-end VMAT planners.

2605.13711 2026-05-14 cs.LG

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

Hsing-Huan Chung, Shijun Li, Yoav Wald, Xing Han, Suchi Saria, Joydeep Ghosh

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Technion-IIT(技术学院-以色列理工学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 该研究提出了一种名为MILM的多模态不规则时间序列语言模型,用于处理来自异构数值和文本通道的异步、不规则采样数据,例如医疗中的电子健康记录。MILM通过将时间序列表示为XML格式的有序三元组,并采用两阶段微调策略,分别学习采样模式和观测值的联合建模,从而提升分类性能。实验表明,MILM在多个医疗数据集上取得了最佳或次优结果,并在值缺失场景下表现出更强的鲁棒性。

详情
英文摘要

Multimodal irregular time series (MITS) consist of asynchronous and irregularly sampled observations from heterogeneous numerical and textual channels. In healthcare, for example, patients' electronic health records (EHR) include irregular lab measurements and clinical notes. The irregular timing and channel patterns of observations carry predictive signal alongside the numerical values and textual content. LLMs are natural candidates for processing such heterogeneous data, given their extensive pretrained knowledge spanning textual and numerical domains. We introduce MILM (Multimodal Irregular time series Language Model), which represents MITS as time-ordered triplets in Extensible Markup Language (XML) format and fine-tunes an LLM through a two-stage strategy for MITS classification. The first stage trains on value-redacted MITS to predict from sampling patterns alone, and the second stage trains on full MITS to jointly model sampling patterns and observed values. Our two-stage model (MILM-2S) and its single-stage counterpart (MILM-Direct) achieve the best and second-best average performance on multiple EHR datasets. Further value redaction evaluations confirm that sampling patterns carry predictive signal and that MILM-2S learns to exploit them. In the value pending evaluation we introduce, where some values are unavailable at prediction time, MILM-2S outperforms MILM-Direct by a larger margin compared to standard evaluation. For MILM-2S, preserving the time and channel of value-pending observations as additional sampling information further improves in-hospital mortality prediction.

2605.13709 2026-05-14 cs.CL cs.AI cs.LG

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

Qian Shen, Fanghua Cao, Min Yao, Shlok Gilda, Bonnie J. Dorr, Walter L. Leite

发表机构 * University of Florida(佛罗里达大学)

AI总结 该研究旨在生成适合儿童阅读的英文故事,同时控制难度和确保安全性。研究通过监督微调方法,对三个参数规模为8B的紧凑型大语言模型进行训练,使其能够生成符合儿童阅读水平的故事。实验表明,经过适当微调的8B模型在难度控制方面优于零样本使用的更大模型,且几乎不存在安全问题,为教育场景中低成本、高效生成儿童读物提供了可行方案。

Comments Comments: 15 pages, 4 figures. Author Two and Author Three contributed equally. Accepted by the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), ACL 2026

详情
英文摘要

Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children's English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children's interests, controllable difficulty and safety.

2605.13702 2026-05-14 cs.AI

Adaptive mine planning under geological uncertainty: A POMDP framework for sequential decision-making

Hamza Khalifi, Jef Caers, Yassine Taha, Mostafa Benzaazoua, Abdellatif Elghali

发表机构 * Geology & Sustainable Mining Institute (GSMI), University Mohammed VI Polytechnic (UM6P)(地质与可持续采矿研究所(GSMI),穆罕默德六世理工学院(UM6P)) Department of Earth and Planetary Sciences, Stanford University(地球与行星科学系,斯坦福大学)

AI总结 本文提出了一种基于部分可观察马尔可夫决策过程(POMDP)的框架,用于在地质不确定性下进行自适应矿山规划。该方法通过逐步更新对地质条件的信念,动态调整开采和运输决策,从而替代传统的固定计划模式。研究引入了一种结合模拟退火和集合平滑技术的混合架构,有效提升了计算可行性,并在实际铜金露天矿案例中显著提高了净现值,展示了该方法在应对不确定性方面的优越性和鲁棒性。

详情
英文摘要

Strategic mine production scheduling under geological uncertainty is conventionally formulated as a stochastic optimization problem in which a fixed extraction sequence and routing decisions are computed ex ante. This plan-driven paradigm treats uncertainty as passive: decisions are hedged across geological scenarios, but planning does not anticipate how future observations will inform future decisions. We propose a different perspective by formulating mine scheduling as a Partially Observable Markov Decision Process (POMDP), in which extraction and routing decisions are made sequentially with planning explicitly integrating the expectation of future belief updates. To achieve computational tractability, we introduce a hybrid SA-POMDP architecture that combines simulated annealing-based (SA) value approximation with ensemble-based belief updating via ensemble smoother with multiple data assimilation (ES-MDA). At each decision epoch, candidate actions are evaluated through their expected long-term value under the current belief, and the belief is updated as mining observations are assimilated. This yields an adaptive policy rather than a fixed plan. We evaluate the framework on a copper-gold open-pit mining complex with multiple processing destinations. Under a statistically consistent prior, the SA-POMDP reduces the expectation-reality gap from 22.3% to 4.6%, improving realized NPV by USD8.4M relative to one-shot stochastic optimization. Under systematic prior misspecification of 10%, the adaptive framework outperforms static planning by up to USD44.6M (36.9%), demonstrating structural robustness beyond scenario hedging. These results show that sequential belief updating transforms geological uncertainty from a passive constraint into an active component of value creation.

2605.13695 2026-05-14 cs.CL cs.AI

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

Andrea Morandi

发表机构 * Cisco(思科)

AI总结 该研究提出了一种名为RTLC的三阶段提示范式,灵感来源于费曼学习法,旨在提升大语言模型作为评判者的准确性,无需微调。RTLC通过“研究—教学—批判”三个阶段,引导模型生成多个候选判断并进行交叉对比,最终输出优化后的评判结果。实验表明,在JudgeBench基准上,RTLC显著提升了模型的判断准确率,优于传统的自洽投票和零样本方法,展示了其在开放生成评估中的有效性。

详情
英文摘要

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

2605.13692 2026-05-14 cs.LG cs.CC

Polyhedral Instability Governs Regret in Online Learning

Yuetai Li, Fengqing Jiang, Yichen Feng, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Linda Bushnell, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) Western Washington University(西雅图华盛顿大学) King Abdulaziz City for Science and Technology(卡布勒·阿卜杜勒·阿齐兹科技城) HUMAIN

AI总结 本文研究了在线学习中组合动作决策问题的遗憾界,指出这类问题的遗憾主要由多面体结构的不稳定性决定,即活动区域变化的次数。作者提出了一种基于区域切换次数和区域顶点数的遗憾界分析方法,并在全信息反馈和固定划分假设下,证明了遗憾的渐进界。该结果适用于在线凸优化和在线子模-凹博弈等场景,并通过实验验证了理论分析的有效性。

详情
英文摘要

Many online decision problems over combinatorial actions are addressed via convex relaxations, leading to online convex optimization with piecewise linear objectives and induced polyhedral structure. We show that regret in such problems is governed by \emph{polyhedral instability}: the number of changes of the active region. Under full information feedback and fixed partition assumptions, if $\mathrm{RS}_T$ denotes the number of region switches and $V_{\max}$ the maximum number of vertices per region, we prove $\Regret_T= Θ(\sqrt{(1+\mathrm{RS}_T)\,T\,\log V_{\max}})$ interpolating between experts-like and dimension-dependent OCO rates. For online submodular--concave games under Lovász convexification, this reduces to the permutation-switch count $\mathrm{SC}_T$, yielding the matching rate $\Regret_T= Θ(\sqrt{(1+\mathrm{SC}_T)\,T\,\log n})$. Experiments on synthetic and real combinatorial problems (shortest path, influence maximization) validate the predicted scaling and indicate that low-instability regimes can arise in practice without explicit enumeration of actions.

2605.13690 2026-05-14 cs.LG cs.AI

The WidthWall: A Strict Expressivity Hierarchy for Hypergraph Neural Networks

Fengqing Jiang, Yuetai Li, Yichen Feng, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Linda Bushnell, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) Western Washington University(西华盛顿大学) King Abdulaziz City for Science and Technology(国王阿卜杜勒阿齐兹科技城) HUMAIN

AI总结 该研究探讨了超图神经网络(HGNN)在表达复杂高阶交互结构方面的能力,指出模型的表达能力取决于其能够检测和计数的局部结构模式。通过引入同态密度的概念,研究建立了以超树宽度为指标的严格表达能力层次,并揭示了一个“宽度墙”现象:当结构模式的宽度超过一定阈值时,任何固定深度的HGNN都无法有效表示这些结构。该成果为15种HGNN架构提供了统一的理论分析,并在真实超图数据集上验证了宽度墙对模型性能的预测作用。

详情
英文摘要

Hypergraphs provide a natural framework to model higher-order interactions in scientific, social, and biological systems. Hypergraph neural networks (HGNNs) aim to learn from such data, yet it remains unclear which higher-order structures these models can represent. We show that hypergraph expressivity is governed by which small patterns an architecture can detect and count. We formalize this via homomorphism densities, which measure how often a structural motif appears in a hypergraph. Combining classical homomorphism-count completeness with invariant approximation, we show that homomorphism densities generate all continuous hypergraph invariants and organize them into a strict hierarchy indexed by hypertree width. This yields a Width Wall: a fundamental architectural limit beyond which no hidden dimension, training procedure or fixed-depth HGNN can represent invariants requiring wider patterns. Our framework provides a unified characterization of 15 HGNN architectures, precisely identifies information lost by clique expansion, and motivates density-aware models that extend expressivity beyond bounded-width message passing. We experimentally validate this finding on an APPLICATION NODE CLASSIFICATION SUITE of real-world hypergraphs, where the Width Wall predicts when graph-reduction baselines fail and when density features help.

2605.13688 2026-05-14 cs.CV cs.LG

MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

Cenwei Zhang, Suncheng Xiang, Lei You

发表机构 * Shanghai Jiao Tong University(上海交通大学) Technical University of Denmark(技术大学)

AI总结 MedCore 是一种针对 MedSAM 的结构化剪枝框架,旨在在保持医学图像分割边界精度的前提下显著压缩模型规模。该方法通过保留两种关键结构实现高效剪枝:一种是在 SAM 到 MedSAM 适配过程中变得重要的结构,另一种是具有高边界影响力的结构。实验表明,MedCore 在多项息肉分割基准测试中大幅减少了参数和计算量,同时保持了较高的 Dice 和边界指标,验证了其在医学图像分割中的有效性与可靠性。

Comments 3 figures, 17 pages

详情
英文摘要

Medical segmentation foundation models such as SAM and MedSAM provide strong prompt-driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM-to-MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual-intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary-aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression-induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine-tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head-fragile boundary regime: head-pruning steps have 2.887 times larger 95th-percentile boundary leverage than MLP-pruning steps, and this logit-level effect is consistent with BF1 and HD95 degradation. Our code is available at https://github.com/cenweizhang/MedCore.

2605.13687 2026-05-14 cs.LG cs.AI stat.ML

A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

Jason Gaitonde, Frederic Koehler, Elchanan Mossel, Joonhyung Shin, Allan Sly

发表机构 * Duke University(杜克大学) University of Chicago(芝加哥大学) Massachusetts Institute of Technology(麻省理工学院) Princeton University(普林斯顿大学)

AI总结 本文提出了一类具有层次结构的合成语言,并通过树上的广播过程生成,从而能够精确分析上下文长度和推理在自回归生成中的作用。研究引入了一种精确的$k$-gram假设来替代传统变换器模型,并通过实验证明其有效性。研究发现,在特定语言模型下,若上下文长度不足,生成结果将偏离真实语言分布,而具备推理能力的模型仅需对数长度的内存即可精确生成符合真实语言的序列,展现出指数级的性能提升。

详情
英文摘要

We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an $Ω(n)$ lower bound on the context length required to faithfully sample length-$n$ sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only $Θ(\log n)$ working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.

2605.13686 2026-05-14 cs.CV cs.AI

Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

Giulia Romoli, Alessia Capoccia, Filippo Ruffini, Francesco Di Feola, Luca Boldrini, Arturo Chiti, Renato Cuocolo, Tugba Akinci D'Antonoli, Fatemeh Darvizeh, Marcello Di Pumpo, Bradley J. Erickson, Liu Fang, Deborah Fazzini, Paola Feraco, Fabrizia Gelardi, Francesco Gossetti, Ana Isabel Hernáiz Ferrer, Michail E. Klontzas, Seyedmehdi Payabvash, Katrine Riklund, Sara N. Strandberg, Valerio Guarrasi, Paolo Soda

发表机构 * Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University(诊断与介入部门、放射物理、生物医学工程,乌梅大学) Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma(人工智能与计算机系统单位,工程部门,罗马生物医学学院) Vita-Salute San Raffaele University(维塔-萨拉特·桑拉法埃莱大学) Department of Medicine, Surgery and Dentistry, University of Salerno(医学、外科和牙科部门,萨勒诺大学) Division of Diagnostic and Interventional Neuroradiology, Department of Radiology, University Hospital Basel(诊断和介入神经放射学部门,放射学部门,巴塞尔大学医院) Department of Pediatric Radiology, University Children’s Hospital Basel(儿科放射学部门,巴塞尔儿童医院) Department of Life Science and Public Health, Università Cattolica del Sacro Cuore(生命科学与公共健康部门,圣心大学) Athinoula A. Martinos Center for Biomedical Imaging(阿提诺拉A·马里诺斯生物医学成像中心) Artificial Intelligence and Translational Imaging (ATI) Lab, Department of Radiology, School of Medicine, University of Crete(人工智能与转化成像(ATI)实验室,放射学部门,医学院,克里特大学) Division of Radiology, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute(放射学部门,临床科学、介入和科技(CLINTEC)部门,卡罗林斯卡研究所) Columbia University Medical Center(哥伦比亚大学医学中心) Department of Diagnostics and intervention, Diagnostic radiology, Umeå University(诊断与介入部门,诊断放射学,乌梅大学)

AI总结 本文研究了医学影像中跨模态图像翻译的问题,旨在从源影像模态生成目标模态的图像,无需额外采集。作者提出了一种可复现、标准化的评估框架,对七种生成模型在多个临床任务和数据集上的性能进行了系统比较,发现基于生成对抗网络(GAN)的模型整体表现优于潜在生成模型,其中SRGAN在多项任务中表现最优。实验还揭示了模型在小病灶生成和定量指标与临床偏好之间的差异,表明合成影像在临床判别上已接近真实影像。

详情
英文摘要

Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.

2605.13684 2026-05-14 cs.LG cs.IT math.IT

Scale-Sensitive Shattering: Learnability and Evaluability at Optimal Scale

Shashaank Aiyer, Yishay Mansour, Shay Moran, Han Shao, Tom Waknine

发表机构 * University of Maryland(马里兰大学) Tel Aviv University and Google Research(特拉维夫大学和谷歌研究) Technion and Google Research(技术学院和谷歌研究)

AI总结 本文研究了实值函数类在最优尺度下表现出一致收敛和可学习性的条件。通过建立一个尺度敏感的PAC学习基本定理,作者证明了统一收敛、可学习性以及fat-shattering维数的有限性在特定尺度下是等价的,解决了关于学习性尺度的长期疑问,并改进了已有上界结果。研究还给出了关于fat-shattering尺度的精确度量熵界,并应用于积分概率度量的估计问题,揭示了其可估性与弱可评估性的二元性。

Comments 32 pages, 1 figure

详情
英文摘要

We study the optimal scale at which real-valued function classes exhibit uniform convergence and learnability. Our main result establishes a scale-sensitive generalization of the fundamental theorem of PAC learning: for every bounded real-valued class and every $γ>0$, uniform convergence at scale $γ$, agnostic learnability at scale $γ/2$, and finiteness of the fat-shattering dimension at every scale $γ'>γ$ are equivalent. This resolves a question by Anthony and Bartlett (Cambridge Univ. Press 1999) on the precise scales governing learnability, refuting a conjecture attributed there to Phil Long that a multiplicative 2-factor gap is unavoidable, and improves the upper bounds of Bartlett and Long (JCSS 1998), which incur such a loss. The key technical ingredient is a direct bound on empirical $\ell_\infty$ covering numbers, avoiding the standard detour through packing numbers. As a consequence, we obtain sharp asymptotic metric-entropy bounds in terms of the fat-shattering scale $γ$: an $O(\log^2 n)$ bound holds already at scale $γ/2$, while an $O(\log n)$ bound holds at scale $2γ$. We further show that the $O(\log^2 n)$ bound is sometimes tight. These results resolve open questions by Alon et al. (JACM 1997) and Rudelson and Vershynin (Ann. of Math. 2006). As an application, we establish a sharp dichotomy for bounded integral probability metrics: every such IPM is either estimable or cannot be weakly evaluated within any multiplicative factor $c<3$, while $3$-weak evaluability always holds, resolving an open question from Aiyer et al. (ICML 2026). We also highlight several open questions on quantitative sample complexity and evaluability.

2605.13681 2026-05-14 cs.LG stat.ML

Sampling from Flow Language Models via Marginal-Conditioned Bridges

Iskander Azangulov, Leo Zhang

发表机构 * Department of Statistics, University of Oxford(牛津大学统计系)

AI总结 本文研究了如何从流语言模型(FLMs)中进行有效的采样,提出了一种基于边缘条件桥接的采样方法。与传统方法不同,该方法在每一步反向采样时,根据FLM的边缘后验分布生成干净的one-hot端点,并通过解析的Ornstein-Uhlenbeck桥接过程生成连续状态,从而更准确地保留语言模型的结构特性。该方法无需额外训练,能够自然地支持温度缩放和核截断等解码控制,实验表明其在生成质量与多样性之间取得了更好的平衡。

详情
英文摘要

Flow Language Models (FLMs) are a recently introduced class of language models which adapt continuous flow matching for one-hot encoded token sequences. Their denoisers have a special structure absent from generic continuous diffusion models: each block of the denoising mean is a posterior marginal distribution over the clean token at that position. Standard DDPM-style samplers collapse these marginals to a single conditional-mean endpoint and bridge toward this simplex-valued point, which is generally not a valid one-hot sequence. We argue that the natural sampler for an FLM is instead posterior-predictive. At each reverse step, we sample a clean one-hot endpoint from the factorized posterior defined by the FLM token marginals, and then sample the next continuous state from the analytic Ornstein--Uhlenbeck bridge conditioned on that endpoint. The method is training-free, uses the same model evaluations as standard sampling, and gives a principled interface for token-level decoding controls such as temperature scaling and nucleus truncation. We show that, under exact posterior marginals, the endpoint approximation error is exactly the conditional multi-information among token positions. The induced one-step bridge kernel preserves all token-wise posterior-predictive marginals and loses only the residual cross-position dependence. Finally, we prove a Girsanov path-space comparison showing that the marginal-conditioned bridge has a no-larger denoising-error term than the frozen conditional-mean bridge, with strict improvement whenever intermediate coordinate-wise bridge observations reveal additional information about the clean token. Experiments with FLMs show that the sampler improves the quality--diversity tradeoff. Code is available at: github.com/imbirik/mcb.

2605.13678 2026-05-14 cs.LG

Three-Stage Learning Unlocks Strong Performance in Simple Models for Long-Term Time Series Forecasting

Zhenan Yu, Guangxin Jiang, Jin Yang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出了一种名为STAIR的三阶段训练框架,旨在在不引入复杂结构模块的情况下,充分发挥简单时间映射模型在长期时间序列预测中的潜力。STAIR通过共享时间映射学习变量间的通用动态,再逐个变量进行微调以捕捉特定模式,最后通过残差学习引入跨变量信息,逐步增强模型灵活性。实验表明,STAIR在九个长期预测基准上表现优异,验证了其在保持模型简洁性的同时实现高性能的有效性。

详情
英文摘要

Recent studies on long-term time series forecasting have shown that simple linear models and MLP-based predictors can achieve strong performance without increasingly complex architectures. However, many competitive baselines still rely on structural priors such as frequency-domain modeling, explicit decomposition, multi-scale mixing, or sophisticated cross-variable interaction modules, while paying less attention to how simple temporal mappings should be trained and organized. In this paper, we propose STAIR, short for Stagewise Temporal Adaptation via Individualization and Residual Learning, a training paradigm for long-term time series forecasting that aims to unlock the capacity of simple temporal mapping models without introducing complex architectural modules. STAIR decomposes forecasting ability into three progressive stages: it first learns common temporal dynamics across variables through a shared temporal mapping, then adapts the shared model to each variable via channel-wise fine-tuning to capture variable-specific patterns, and finally complements the backbone with cross-variable information through residual learning. We further introduce Shared-to-Individual Fine-tuning and alpha-RevIN to mitigate the limitations of strict channel independence and the overly strong normalization prior induced by standard RevIN. This design gradually increases modeling flexibility while keeping the core temporal predictor as a shallow MLP in the main experiments, with linear variants analyzed separately. Experiments on nine long-term forecasting benchmarks show that STAIR matches or outperforms recent strong baselines while preserving a simple temporal backbone, providing a concise and effective modeling perspective for long-term time series forecasting.

2605.13675 2026-05-14 cs.CV cs.LG q-bio.NC

Characterizing Universal Object Representations Across Vision Models

Florian P. Mahner, Johannes Roth, Ka Chun Lam, Michael F. Bonner, Francisco Pereira, Martin N. Hebart

发表机构 * Vision and Computational Cognition Group(视觉与计算认知组) Max Planck Institute(马克斯·普朗克研究所) Justus-Liebig-University Giessen(吉森约瑟夫·李贝大学) Machine Learning Core(机器学习核心) Department of Cognitive Science(认知科学系) National Institute of Mental Health(国家心理健康研究所) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本研究探讨了不同架构、目标函数和数据集训练的深度神经网络在视觉表征上的收敛现象,旨在揭示模型实际收敛于哪些视觉属性以及影响这一收敛的因素。通过将162个多样化视觉模型的对象相似性结构分解为少量非负维度,并分析这些维度在模型间的重复出现情况,研究发现部分维度具有跨模型的普遍性,且更易解释、更受图像语义属性驱动。研究还表明,模型的普遍性维度与灵长类动物视觉皮层活动和人类相似性判断的预测能力更强,暗示了这种普遍性可能反映了与生物视觉相关的表征特性。

详情
英文摘要

Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.

2605.13673 2026-05-14 cs.LG

Graph Neural Networks with Triangle-Based Messages for the Multicut Problem

Jannik Irmai, Lucas Fabian Naumann, Bjoern Andres

发表机构 * Faculty of Computer Science, TU Dresden(德累斯顿理工大学计算机科学系) Center for Scalable Data Analytics and AI, Dresden/Leipzig(德累斯顿/莱比锡可扩展数据分析与人工智能中心)

AI总结 本文研究了用于多割问题的图神经网络方法,该问题是一个计算复杂度高的组合优化问题,在生物信息学、数据挖掘和计算机视觉等领域有广泛应用。作者提出了一种改进的图神经网络架构,其特征仅分配给边,并基于图中的三角形结构进行消息传递,以更好地适应多割问题的目标函数和约束条件。实验表明,该方法在保证运行时间可行的前提下,优于现有的启发式求解器,在部分实例中甚至能秒级找到最优解,而精确求解器则需要数小时。

Comments 21 pages, 5 figures

详情
英文摘要

The multicut problem is an NP-hard combinatorial optimization problem with diverse applications in fields such as bioinformatics, data mining and computer vision. Graph neural networks have been defined for the multicut problem but can be adapted further to its specific objective function and constraints. In this article, we introduce such an adapted graph neural network architecture in which features are assigned only to edges, and the computation of messages is based on triangles in the underlying graph. Experiments with synthetic and real-world instances with up to 200 nodes show that our method outperforms state-of-the-art heuristic solvers in terms of solution quality while maintaining feasible runtimes. For some instances, our method finds optimal solutions in seconds whereas exact solvers need hours to find and certify optimal solutions.

2605.13670 2026-05-14 cs.CV

Pattern-Enhanced RT-DETR for Multi-Class Battery Detection

Xu Zhong, Enyuan Hu

发表机构 * Independent Researcher(独立研究者) Chemistry Division Brookhaven National Laboratory NY, USA(布鲁赫斯国家实验室化学部纽约美国)

AI总结 本文针对多类别电池检测任务,提出了一种基于模式增强的RT-DETR方法PaQ-RT-DETR,通过引入基于模式的动态查询生成机制,有效缓解了查询激活不平衡问题,同时保持了较低的计算开销。研究在包含约8,591张标注图像的公开数据集上系统比较了多种检测模型,结果表明PaQ-RT-DETR-X在整体mAP@50指标上优于基线模型,尤其在数据稀缺的电池类别上表现突出,为电池相关工业应用中的目标检测模型选择提供了实用指导。

Comments 4 pages, 3 figures

详情
英文摘要

Accurate and efficient battery detection is increasingly important for applications in electronic waste recycling, industrial quality control, and automated sorting systems. In this paper, we present both a comprehensive benchmark and a novel method for multi-class battery detection. We systematically compare three CNN-based detectors (YOLOv8n, YOLOv8s, YOLO11n) and two transformer-based detectors (RT-DETR-L, RT-DETR-X) on a publicly available dataset of approximately 8,591 annotated images under identical experimental conditions, and further propose PaQ-RT-DETR, which introduces pattern-based dynamic query generation into RT-DETR to alleviate query activation imbalance with negligible computational overhead. Among baselines, YOLO11n achieves the best CNN-based accuracy (mAP@50: 0.779) at only 2.6M parameters, while YOLOv8n delivers the fastest inference at ~1,667 FPS. PaQ-RT-DETR-X achieves the highest overall mAP@50 of 0.782, surpassing RT-DETR-X by +2.8% with consistent per-class gains across all six battery categories including the data-scarce Bike Battery class. Our findings provide practical guidance for selecting object detection models in battery-related industrial applications.

2605.13667 2026-05-14 cs.CV

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

Vladislav Makarov, Mark Gizetdinov, Dmitry Yudin

发表机构 * MIRAI

AI总结 SceneGraphVLM 是一种基于视觉语言模型的紧凑方法,用于从图像和视频中生成结构化的场景图。该方法通过高效的 TOON 格式序列化图结构,并采用两阶段训练策略,结合监督微调和强化学习,以提升关系覆盖率和精确度,同时避免生成不相关对象和关系。在视频处理中,模型可通过前一帧生成的场景图提供轻量级的短期上下文,无需跟踪或后处理。实验表明,SceneGraphVLM 在多个数据集上实现了高质量与生成速度的良好平衡,并显著提升了场景图生成的精确度。

详情
英文摘要

Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.

2605.13665 2026-05-14 cs.RO

Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels

Amir Hossain Raj, Dibyendu Das, Xuesu Xiao

发表机构 * Department of Computer Science, George Mason University(乔治·马歇尔大学计算机科学系)

AI总结 本文研究了四足机器人穿越狭窄隧道等复杂三维环境的自主移动问题。为解决现有方法在适应多样化地形和复杂结构方面的不足,作者提出了一种结合过程化环境生成和策略蒸馏的强化学习框架,通过教师-学生训练范式,将针对不同隧道结构训练的专家策略知识迁移至统一的策略模型中。该方法无需复杂的奖励设计,有效提升了四足机器人在狭窄空间中的鲁棒性和通用性,并在仿真与实际实验中验证了其优越性。

详情
英文摘要

Quadruped robots demonstrate exceptional potential for navigating complex terrain in critical applications such as search and rescue missions and infrastructure inspection However autonomous traversal of confined 3D environments including tunnels caves and collapsed structures remains a significant challenge Existing methods often struggle with rigid gait patterns limited adaptability to diverse geometries and reliance on oversimplified environmental assumptions This paper introduces a Reinforcement Learning RL framework that combines procedural environment generation with policy distillation to enable robust locomotion across various tunnel configurations Our approach leverages a teacher student training paradigm where specialized expert policies trained on procedurally generated tunnel geometries transfer their knowledge to a unified student policy This strategy eliminates the need for complex reward shaping in end-to-end RL training simplifying the process by breaking down complicated tasks into smaller more manageable components that are easier for the robot to learn By synthesizing diverse tunnel structures during training and distilling navigation strategies into a generalizable policy our method achieves consistent traversal across complex spatial constraints where conventional approaches fail We demonstrate through both simulation and real world experiments that our method enables quadruped robots to successfully traverse challenging confined tunnel environments

2605.13664 2026-05-14 cs.CV physics.optics

HADAR-Based Thermal Infrared Hyperspectral Image Restoration

Cheng Dai, Jiale Lin, Bingxuan Song, Yifei Chen, Jiashuo Chen, Xin Yuan, Fanglin Bao

发表机构 * School of Science, Westlake University(西lake大学科学学院) School of Engineering, Westlake University(西lake大学工程学院)

AI总结 热红外高光谱图像(TIR-HSI)在许多应用中具有重要价值,但其实际应用受到传感器退化等因素的严重限制。本文提出了一种基于HADAR渲染方程的物理驱动框架HAIR,通过结合温度、发射率和纹理(TeX)三元组的物理模型,实现了对地面TIR-HSI的高精度恢复。该方法不仅保证了物理一致性与空间光谱噪声的鲁棒性,还通过大气下行辐射参考和发射率光谱平滑性实现了光谱校准与生成,实验表明其在去噪、修复、光谱校准和超分辨率等任务上均优于现有方法。

Comments 17 pages, 18 figures

详情
英文摘要

Thermal-infrared (TIR) hyperspectral imagery (HSI) provides critical scene information for various applications. However, its practical utility is severely limited by unique sensor degradations beyond the capabilities of existing restoration methods, which are ignorant of underlying thermal physics. Here, we propose HAIR (HADAR-based Image Restoration) as a physics-driven framework for ground-based TIR-HSI restoration. HAIR utilizes the HADAR rendering equation (HRE) and combines it with the atmospheric downwelling radiative transfer equation (RTE) to model TIR-HSI using temperature, emissivity, and texture (TeX) physical triplets. This physical model leads to a TeX decompose-synthesize strategy that guarantees physical consistency and spatio-spectral noise resilience, in stark contrast to existing approaches. Moreover, our framework uses a forward-modeled atmospheric downwelling reference, along with spectral smoothness of emissivity and blackbody radiation, to enable spectral calibration and generation that would otherwise be elusive. Our extensive experiments on the outdoor DARPA Invisible Headlights dataset and in-lab FTIR measurements show that HAIR consistently outperforms state-of-the-art methods across denoising, inpainting, spectral calibration, and spectral super-resolution, establishing a benchmark in objective accuracy and visual quality.