arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13753 2026-05-14 cs.LG cs.CV

Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein

Ashkan Shahbazi, Xinran Liu, Ping He, Soheil Kolouri

AI总结 本文提出了一种名为 min Generalized Sliced Gromov-Wasserstein(min-GSGW)的新型方法,用于高效求解 Gromov-Wasserstein(GW)问题。该方法通过引入表达能力强的广义切片算子,学习输入度量之间的耦合非线性切片,从而在原始空间中直接最小化 GW 目标函数。min-GSGW 具有刚体运动不变性,适用于几何匹配和形状分析任务,并在多个实验中表现出比现有方法更低的计算成本和更优的几何对应结果。

详情
英文摘要

We propose min Generalized Sliced Gromov--Wasserstein (min-GSGW), a sliced formulation for the Gromov--Wasserstein (GW) problem using expressive generalized slicers. The key idea is to learn coupled nonlinear slicers that assign compatible push-forward values to both input measures, so that monotone coupling in the projected domain lifts to a transport plan evaluated against the GW objective in the original spaces. The resulting plan induces a GW objective value, and min-GSGW minimizes this cost directly in the original spaces. We further show that min-GSGW is rigid-motion invariant, a crucial property for geometric matching and shape analysis tasks. Our contributions are threefold: 1) we introduce generalized slicers into the sliced GW framework, 2) we construct a slicing-based efficient GW transport plan; and 3) we develop an amortized variant that replaces per-instance optimization with a learned slicer for unseen input pairs. We perform experiments on animal mesh matching, horse mesh interpolation, and ShapeNet part transfer. Results show that min-GSGW produces meaningful geometric correspondences and GW objective values at substantially lower computational cost than existing GW solvers.

2605.13751 2026-05-14 cs.RO cs.SE cs.SY eess.SY

Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles

Yizhuo Xiao, Haotian Yan, Ying Wang, Zhongpan Zhu, Yuxin Zhang, Xintao Yan, Mustafa Suphi Erden, Cheng Wang

AI总结 该研究旨在为自动驾驶系统(ADS)建立可信的安全保障,通过区分系统缺陷与不可避免的交通冲突,生成具有责任归属的对抗场景。提出的方法CARS结合上下文感知的对抗体选择与闭环模拟优化的生成对抗策略,能够生成物理可行且责任可追溯的碰撞场景。该框架在多国交通环境下表现出色,能够有效发现符合法规要求的高责任归属碰撞场景,为自动驾驶系统的可解释性验证提供了新的方向。

详情
英文摘要

Establishing trustworthy safety assurance for autonomous driving systems (ADSs) requires evidence that failures arise from avoidable system deficiencies rather than unavoidable traffic conflicts. Current adversarial simulation methods can efficiently expose collisions, but generally lack mechanisms to distinguish these fundamentally different failure modes. Here we present CARS (Context-Aware, Responsibility-attributed Scenario generation), a framework that integrates responsibility attribution directly into adversarial scenario generation. CARS combines context-aware adversary selection with a generative adversarial policy optimized in closed-loop simulation to construct collision scenarios that are both physically feasible and diagnostically attributable. Across benchmark datasets spanning heterogeneous national traffic environments, CARS consistently discovers feasible collision scenarios with high attribution rates under multiple regulation-prescribed careful and competent driver models. By coupling adversarial generation with normative responsibility assessment, CARS moves simulation testing beyond collision discovery toward the construction of interpretable, regulation-aligned safety evidence for scalable ADS validation.

2605.13746 2026-05-14 cs.CV cs.AI

Weakly-Supervised Spatiotemporal Anomaly Detection

Urvi Gianchandani, Praveen Tirupattur, Mubarak Shah

AI总结 本文研究了弱监督下的时空异常检测问题,仅使用视频级别的标签进行训练,无需逐帧标注。核心方法是通过提取正常和异常视频片段的特征,并利用多实例排序损失(MIL)对时空区域进行异常评分,同时考虑了异常在时间和空间上的局部性。该方法在包含时空标注的UCF Crime2Local数据集上进行了验证,取得了有效结果。

详情
英文摘要

In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.

2605.13744 2026-05-14 cs.CV

Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration

Feiyu Tan, Qi Xie, Zongben Xu, Deyu Meng

AI总结 图像修复是一个固有病态的逆问题,而嵌入几何对称先验的等变网络可以缓解这一问题并提升性能。然而,现有研究对网络等变性与数据对称性的关系理解仍停留在启发式层面,缺乏系统理论框架来量化对称性、选择变换群或评估模型与数据的对齐程度。本文从优化角度出发,首次提出了在数据集层面可量化的非严格对称性定义,并将其作为约束构建图像修复逆问题,揭示了数据对称性、模型等变性与泛化能力之间的内在联系,同时提出了一个样本自适应的等变网络,能够动态对齐每个样本的内在对称性,实验表明该方法在超分辨率、去噪和去雨任务中显著优于传统方法。

Comments 30 pages, 9 figures, Supplementary Material can be found at https://github.com/tanfy929/SA-Conv

详情
英文摘要

Image restoration is an inherently ill posed inverse problem. Equivariant networks that embed geometric symmetry priors can mitigate this ill posedness and improve performance. However, current understanding of the relationship between network equivariance and data symmetry remains largely heuristic. Particularly for real world data with imperfect symmetry, existing research lacks a systematic theoretical framework to quantify symmetry, select transformation groups, or evaluate model data alignment. To bridge this gap, we conduct an analysis from an optimization perspective and formalize the intrinsic relationship among data symmetry priors, model equivariance, and generalization capability. Specifically, we propose for the first time a quantifiable definition of non strict symmetry at the dataset level (rather than sample level) and use it as a constraint to formulate the restoration inverse problem. We then show that the equivariance for restoration models can be naturally derived from this inverse problems incorporated the proposed symmetry constraints, and that the equivariance error of the optimal restoration operator is strictly bounded by the data symmetry error and the discretization mesh size. Furthermore, by analyzing the network's empirical risk, we demonstrate that aligning equivariance with data symmetry optimizes the bias variance trade off, minimizing the total expected risk. Guided by these insights, we propose a Sample Adaptive Equivariant Network that uses a hypernetwork and transformation learnable equivariant convolutions to dynamically align with each sample's inherent symmetry. Extensive experiments on super resolution, denoising, and deraining validate our theoretical findings and show significant superiority over standard baselines and traditional equivariant models. Our code and supplementary material are available at https://github.com/tanfy929/SA-Conv.

2605.13741 2026-05-14 cs.RO cs.CV

LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

Christina Kassab, Hyeonjae Gil, Matías Mattamala, Ayoung Kim, Maurice Fallon

AI总结 本文提出LEXI-SG,首个仅依赖RGB相机输入的单目三维场景图映射系统,能够在开放词汇场景中实现高精度、可扩展的密集地图重建。该方法利用开放词汇基础模型的语义先验,将场景划分为房间,并在每个房间完全观测后进行前馈重建,从而避免了滑动窗口尺度不一致的问题。通过基于房间的因子图优化,实现了全局对齐与局部地图一致性的保持,同时自然地构建了语义场景图的层次结构,并支持开放词汇的对象分割与跟踪。实验表明,LEXI-SG在轨迹估计、密集重建和开放词汇分割方面均表现出色。

详情
英文摘要

Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.

2605.13740 2026-05-14 cs.LG

Learning POMDP World Models from Observations with Language-Model Priors

Valentin Six, Frederik Panse, Mathis Fajeau, Lancelot Da Costa, Mridul Sharma, Alfonso Amayuelas, Tim Z. Xiao, David Hyland, Philipp Hennig, Bernhard Schölkopf

AI总结 该研究探讨了如何利用语言模型先验知识从观察数据中学习部分可观测马尔可夫决策过程(POMDP)世界模型,以减少对环境交互的依赖。提出了一种名为 Pinductor 的方法,通过语言模型从少量观察-动作轨迹中生成候选 POMDP 模型,并通过迭代优化信念状态下的似然分数进行模型精炼。实验表明,Pinductor 在样本效率上优于传统表格型 POMDP 方法,并且性能随着语言模型能力的提升而增强,为在部分可观测环境下高效学习世界模型提供了新思路。

详情
英文摘要

Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially-observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation-action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language-model priors can reduce costly interaction by leveraging prior knowledge, and introduce \emph{Pinductor} (POMDP-inductor): an LLM proposes candidate POMDP models from a few observation-action trajectories and iteratively refines them to optimize a belief-based likelihood score. Despite using strictly less information, \emph{Pinductor} matches the performance and sample efficiency of LLM-based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language-model priors as a practical tool for sample-efficient world-model learning under partial observability, and a step toward generalist agents in real-world environments. Code is available at https://github.com/atomresearch/pinductor.

2605.13737 2026-05-14 cs.AI cs.CL

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun, Ziwei Liu

AI总结 本文研究了全模态大语言模型在处理文本前提与实际感知内容矛盾的问题时存在的“表示-行为鸿沟”。作者构建了一个名为IMAVB的基准数据集,用于评估模型在检测感知与文本前提冲突方面的能力,并发现模型在隐藏状态中能够准确编码矛盾信息,但在输出行为上却表现出拒绝能力不足或过度拒绝的问题。研究还提出了一种基于探针引导的对数几率调整方法,有效提升了模型的拒绝行为,表明全模态模型的瓶颈在于信息翻译而非感知能力。

详情
英文摘要

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

2605.13731 2026-05-14 cs.LG cs.HC

Distinguishing performance gains from learning when using generative AI

Lixiang Yan, Samuel Greiff, Jason M. Lodge, Dragan Gašević

AI总结 本文探讨了在教育中使用生成式人工智能(AI)所带来的绩效提升是否真正促进了高质量的学习。研究指出,尽管生成式AI能提高学习者的表现,但其使用可能并未有效促进深层次的认知和元认知加工过程。文章的核心方法通过实证分析揭示了AI辅助学习中的潜在认知局限,并强调了在教育应用中需关注学习深度与质量的提升。

详情
Journal ref
Nature Reviews Psychology, 4(7), 435-436 (2025)
英文摘要

Generative artificial intelligence (AI) is increasingly being integrated into education, where it can boost learners' performance. However, these uses do not promote the deep cognitive and metacognitive processing that are required for high-quality learning.

2605.13730 2026-05-14 cs.LG cs.AI cs.CV

Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

Christos Chrysanthos Nikolaidis, Vasileios Sachpekidis, Nikolas Moustakidis, Theofilos Moustakidis, Pavlos S. Efraimidis

AI总结 该研究旨在利用超声心动图图像可靠诊断二叶式主动脉瓣(BAV),解决因操作者经验和图像质量差异导致的诊断不一致性问题。研究提出了一种基于视频集成的可解释人工智能模型,通过分析常规获取的左心室长轴视图动态影像,实现了对BAV与三叶式主动脉瓣(TAV)的准确分类。模型在90例患者数据上表现出优异的分类性能,并通过Grad-CAM和SHAP值提供了可解释的诊断依据,有助于提升临床诊断的透明度和可追溯性。

详情
英文摘要

Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

2605.13729 2026-05-14 cs.CV cs.AI

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

Deli Cai, Haoyang Ma, Changxing Ding

AI总结 本文研究了在文本描述和空间轨迹双重条件下生成真实人体运动的问题,现有方法在条件冲突和运动表示冗余方面存在不足,导致生成质量下降或轨迹控制不稳定。为此,作者提出了一种解耦框架 CMC,通过分治策略将任务分为轨迹控制和运动补全两个阶段,分别确保轨迹准确跟踪和生成完整运动。此外,引入选择性补全机制以缓解数据不足带来的过拟合问题,实验表明 CMC 在多个数据集上取得了优越的控制精度和运动质量。

详情
英文摘要

Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

2605.13725 2026-05-14 cs.AI cs.SI

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

Yitian Yang, Yiqun Duan, Linghan Huang, Yiqi Zhu, Francesco Bailo, Chunmeizi Su, Huaming Chen

AI总结 ScioMind 是一个基于认知机制的多智能体社会模拟框架,旨在提升基于大语言模型的社会意见动态研究的真实性。该框架结合结构化意见演化与基于LLM的智能体推理,引入记忆锚定的信念更新规则、分层记忆架构以及基于语料库的动态智能体画像,以更真实地模拟人类在社会互动中的信念变化与行为特征。实验表明,ScioMind 在意见极化、多样性、轨迹稳定性等方面表现出更符合现实的模拟效果,为社会模拟提供了新的认知基础设计思路。

详情
英文摘要

Large language model (LLM)-based multi-agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges these paradigms by combining structured opinion dynamics with LLM-based agent reasoning. ScioMind integrates three key components: 1) a memory-anchored belief update rule that modulates susceptibility to influence via personality-conditioned anchoring strength; 2) a hierarchical memory architecture that supports persistent, experience-driven belief formation; and 3) dynamic agent profiles derived from a corpus-grounded retrieval pipeline, enabling heterogeneous personalities, rationales, and evolving internal states. We evaluate ScioMind on multiple case studies in a real-world policy debate scenario. Across metrics including polarisation, diversity, extremization, and trajectory stability, the proposed components consistently yield improvements in behavioural realism. In particular, dynamic profiles increase opinion diversity, memory and reflection reduce unstable oscillation, and anchoring induces persistent belief trajectories that better align with patterns reported in political psychology. These results suggest that our cognitively grounded design provides a novel solution to LLM-based social simulation that improves both stable and behavioural realism

2605.13724 2026-05-14 cs.CV cs.AI

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou

AI总结 本文提出 AnyFlow,一种基于流图的任意步数视频扩散模型蒸馏框架,旨在解决一致性蒸馏模型在测试时分配更多采样步数时性能下降的问题。AnyFlow 通过将蒸馏目标从终点一致性映射转换为任意时间区间的流图转移学习,优化完整的 ODE 采样轨迹,并引入流图反向模拟方法,提升采样效率并减少测试时误差。实验表明,AnyFlow 在少量步数生成任务中性能优于或匹配现有方法,同时支持任意步数的灵活扩展。

Comments Project page at https://nvlabs.github.io/AnyFlow/

详情
英文摘要

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.

2605.13717 2026-05-14 cs.LG stat.ML

Tight Sample Complexity Bounds for Entropic Best Policy Identification

Amer Essakine, Claire Vernade

AI总结 本文研究了在熵风险度量下有限时间风险敏感强化学习中的最优策略识别问题。作者针对现有样本复杂度上界与下界之间存在的指数级差距,提出了一种基于前向模型并结合KL散度探索奖励的算法,通过利用指数效用函数的平滑性质,改进了集中性分析,从而消除了原有的指数因子,使得样本复杂度达到理论下界,填补了该问题的空白。

详情
英文摘要

We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $Ω(e^{|β| H})$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(e^{2|β| H})$ (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.

2605.13713 2026-05-14 cs.CV eess.IV

Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization

Isabella Poles, Simon Arberet, Riqiang Gao, Martin Kraus, Marco D. Santambrogio, Florin C. Ghesu, Ali Kamen, Dorin Comaniciu

AI总结 本文提出了一种基于扩散模型和LSTM的端到端优化方法,用于放射治疗计划的生成。该方法通过分布匹配的扩散模型生成临床可行的射线强度图,并利用LSTM模块学习梯度更新动态,从而快速优化剂量分布。实验表明,该方法在提升计划效率、灵活性和机器可执行性方面优于现有方法。

Comments Early Accept at MICCAI 2026

详情
英文摘要

Volumetric Modulated Arc Therapy (VMAT) is a cornerstone of modern radiation therapy, enabling highly conformal tumor irradiation and healthy-tissue sparing. Yet, its planning solves inverse and nested optimization for multi-leaf collimators, monitor units and dose parameters, while enforcing their consistency to ensure mechanical deliverability. Nevertheless, this process often requires repeated re-optimization when treatment configurations change, resulting in substantial planning time per patient. To address these problems, we present a diffusion-driven Learning-to-Optimize (L2O) method for end-to-end VMAT planning. A distribution-matching distilled diffusion model learns a clinically feasible manifold of fluence maps, enabling their one-shot generation. On top of this, an LSTM-based L2O module learns gradient update dynamics to swiftly refine fluence maps toward prescribed dose objectives during inference. Experimental results on clinical and public prostate cancer cohorts demonstrate improved planning efficiency, flexibility, and machine deliverability over currently available end-to-end VMAT planners.

2605.13711 2026-05-14 cs.LG

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

Hsing-Huan Chung, Shijun Li, Yoav Wald, Xing Han, Suchi Saria, Joydeep Ghosh

AI总结 该研究提出了一种名为MILM的多模态不规则时间序列语言模型,用于处理来自异构数值和文本通道的异步、不规则采样数据,例如医疗中的电子健康记录。MILM通过将时间序列表示为XML格式的有序三元组,并采用两阶段微调策略,分别学习采样模式和观测值的联合建模,从而提升分类性能。实验表明,MILM在多个医疗数据集上取得了最佳或次优结果,并在值缺失场景下表现出更强的鲁棒性。

详情
英文摘要

Multimodal irregular time series (MITS) consist of asynchronous and irregularly sampled observations from heterogeneous numerical and textual channels. In healthcare, for example, patients' electronic health records (EHR) include irregular lab measurements and clinical notes. The irregular timing and channel patterns of observations carry predictive signal alongside the numerical values and textual content. LLMs are natural candidates for processing such heterogeneous data, given their extensive pretrained knowledge spanning textual and numerical domains. We introduce MILM (Multimodal Irregular time series Language Model), which represents MITS as time-ordered triplets in Extensible Markup Language (XML) format and fine-tunes an LLM through a two-stage strategy for MITS classification. The first stage trains on value-redacted MITS to predict from sampling patterns alone, and the second stage trains on full MITS to jointly model sampling patterns and observed values. Our two-stage model (MILM-2S) and its single-stage counterpart (MILM-Direct) achieve the best and second-best average performance on multiple EHR datasets. Further value redaction evaluations confirm that sampling patterns carry predictive signal and that MILM-2S learns to exploit them. In the value pending evaluation we introduce, where some values are unavailable at prediction time, MILM-2S outperforms MILM-Direct by a larger margin compared to standard evaluation. For MILM-2S, preserving the time and channel of value-pending observations as additional sampling information further improves in-hospital mortality prediction.

2605.13709 2026-05-14 cs.CL cs.AI cs.LG

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

Qian Shen, Fanghua Cao, Min Yao, Shlok Gilda, Bonnie J. Dorr, Walter L. Leite

AI总结 该研究旨在生成适合儿童阅读的英文故事,同时控制难度和确保安全性。研究通过监督微调方法,对三个参数规模为8B的紧凑型大语言模型进行训练,使其能够生成符合儿童阅读水平的故事。实验表明,经过适当微调的8B模型在难度控制方面优于零样本使用的更大模型,且几乎不存在安全问题,为教育场景中低成本、高效生成儿童读物提供了可行方案。

Comments Comments: 15 pages, 4 figures. Author Two and Author Three contributed equally. Accepted by the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), ACL 2026

详情
英文摘要

Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children's English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children's interests, controllable difficulty and safety.

2605.13702 2026-05-14 cs.AI

Adaptive mine planning under geological uncertainty: A POMDP framework for sequential decision-making

Hamza Khalifi, Jef Caers, Yassine Taha, Mostafa Benzaazoua, Abdellatif Elghali

AI总结 本文提出了一种基于部分可观察马尔可夫决策过程(POMDP)的框架,用于在地质不确定性下进行自适应矿山规划。该方法通过逐步更新对地质条件的信念,动态调整开采和运输决策,从而替代传统的固定计划模式。研究引入了一种结合模拟退火和集合平滑技术的混合架构,有效提升了计算可行性,并在实际铜金露天矿案例中显著提高了净现值,展示了该方法在应对不确定性方面的优越性和鲁棒性。

详情
英文摘要

Strategic mine production scheduling under geological uncertainty is conventionally formulated as a stochastic optimization problem in which a fixed extraction sequence and routing decisions are computed ex ante. This plan-driven paradigm treats uncertainty as passive: decisions are hedged across geological scenarios, but planning does not anticipate how future observations will inform future decisions. We propose a different perspective by formulating mine scheduling as a Partially Observable Markov Decision Process (POMDP), in which extraction and routing decisions are made sequentially with planning explicitly integrating the expectation of future belief updates. To achieve computational tractability, we introduce a hybrid SA-POMDP architecture that combines simulated annealing-based (SA) value approximation with ensemble-based belief updating via ensemble smoother with multiple data assimilation (ES-MDA). At each decision epoch, candidate actions are evaluated through their expected long-term value under the current belief, and the belief is updated as mining observations are assimilated. This yields an adaptive policy rather than a fixed plan. We evaluate the framework on a copper-gold open-pit mining complex with multiple processing destinations. Under a statistically consistent prior, the SA-POMDP reduces the expectation-reality gap from 22.3% to 4.6%, improving realized NPV by USD8.4M relative to one-shot stochastic optimization. Under systematic prior misspecification of 10%, the adaptive framework outperforms static planning by up to USD44.6M (36.9%), demonstrating structural robustness beyond scenario hedging. These results show that sequential belief updating transforms geological uncertainty from a passive constraint into an active component of value creation.

2605.13695 2026-05-14 cs.CL cs.AI

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

Andrea Morandi

AI总结 该研究提出了一种名为RTLC的三阶段提示范式,灵感来源于费曼学习法,旨在提升大语言模型作为评判者的准确性,无需微调。RTLC通过“研究—教学—批判”三个阶段,引导模型生成多个候选判断并进行交叉对比,最终输出优化后的评判结果。实验表明,在JudgeBench基准上,RTLC显著提升了模型的判断准确率,优于传统的自洽投票和零样本方法,展示了其在开放生成评估中的有效性。

详情
英文摘要

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

2605.13692 2026-05-14 cs.LG cs.CC

Polyhedral Instability Governs Regret in Online Learning

Yuetai Li, Fengqing Jiang, Yichen Feng, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Linda Bushnell, Radha Poovendran

AI总结 本文研究了在线学习中组合动作决策问题的遗憾界,指出这类问题的遗憾主要由多面体结构的不稳定性决定,即活动区域变化的次数。作者提出了一种基于区域切换次数和区域顶点数的遗憾界分析方法,并在全信息反馈和固定划分假设下,证明了遗憾的渐进界。该结果适用于在线凸优化和在线子模-凹博弈等场景,并通过实验验证了理论分析的有效性。

详情
英文摘要

Many online decision problems over combinatorial actions are addressed via convex relaxations, leading to online convex optimization with piecewise linear objectives and induced polyhedral structure. We show that regret in such problems is governed by \emph{polyhedral instability}: the number of changes of the active region. Under full information feedback and fixed partition assumptions, if $\mathrm{RS}_T$ denotes the number of region switches and $V_{\max}$ the maximum number of vertices per region, we prove $\Regret_T= Θ(\sqrt{(1+\mathrm{RS}_T)\,T\,\log V_{\max}})$ interpolating between experts-like and dimension-dependent OCO rates. For online submodular--concave games under Lovász convexification, this reduces to the permutation-switch count $\mathrm{SC}_T$, yielding the matching rate $\Regret_T= Θ(\sqrt{(1+\mathrm{SC}_T)\,T\,\log n})$. Experiments on synthetic and real combinatorial problems (shortest path, influence maximization) validate the predicted scaling and indicate that low-instability regimes can arise in practice without explicit enumeration of actions.

2605.13690 2026-05-14 cs.LG cs.AI

The WidthWall: A Strict Expressivity Hierarchy for Hypergraph Neural Networks

Fengqing Jiang, Yuetai Li, Yichen Feng, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Linda Bushnell, Radha Poovendran

AI总结 该研究探讨了超图神经网络(HGNN)在表达复杂高阶交互结构方面的能力,指出模型的表达能力取决于其能够检测和计数的局部结构模式。通过引入同态密度的概念,研究建立了以超树宽度为指标的严格表达能力层次,并揭示了一个“宽度墙”现象:当结构模式的宽度超过一定阈值时,任何固定深度的HGNN都无法有效表示这些结构。该成果为15种HGNN架构提供了统一的理论分析,并在真实超图数据集上验证了宽度墙对模型性能的预测作用。

详情
英文摘要

Hypergraphs provide a natural framework to model higher-order interactions in scientific, social, and biological systems. Hypergraph neural networks (HGNNs) aim to learn from such data, yet it remains unclear which higher-order structures these models can represent. We show that hypergraph expressivity is governed by which small patterns an architecture can detect and count. We formalize this via homomorphism densities, which measure how often a structural motif appears in a hypergraph. Combining classical homomorphism-count completeness with invariant approximation, we show that homomorphism densities generate all continuous hypergraph invariants and organize them into a strict hierarchy indexed by hypertree width. This yields a Width Wall: a fundamental architectural limit beyond which no hidden dimension, training procedure or fixed-depth HGNN can represent invariants requiring wider patterns. Our framework provides a unified characterization of 15 HGNN architectures, precisely identifies information lost by clique expansion, and motivates density-aware models that extend expressivity beyond bounded-width message passing. We experimentally validate this finding on an APPLICATION NODE CLASSIFICATION SUITE of real-world hypergraphs, where the Width Wall predicts when graph-reduction baselines fail and when density features help.

2605.13688 2026-05-14 cs.CV cs.LG

MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

Cenwei Zhang, Suncheng Xiang, Lei You

AI总结 MedCore 是一种针对 MedSAM 的结构化剪枝框架,旨在在保持医学图像分割边界精度的前提下显著压缩模型规模。该方法通过保留两种关键结构实现高效剪枝:一种是在 SAM 到 MedSAM 适配过程中变得重要的结构,另一种是具有高边界影响力的结构。实验表明,MedCore 在多项息肉分割基准测试中大幅减少了参数和计算量,同时保持了较高的 Dice 和边界指标,验证了其在医学图像分割中的有效性与可靠性。

Comments 3 figures, 17 pages

详情
英文摘要

Medical segmentation foundation models such as SAM and MedSAM provide strong prompt-driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM-to-MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual-intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary-aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression-induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine-tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head-fragile boundary regime: head-pruning steps have 2.887 times larger 95th-percentile boundary leverage than MLP-pruning steps, and this logit-level effect is consistent with BF1 and HD95 degradation. Our code is available at https://github.com/cenweizhang/MedCore.

2605.13687 2026-05-14 cs.LG cs.AI stat.ML

A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

Jason Gaitonde, Frederic Koehler, Elchanan Mossel, Joonhyung Shin, Allan Sly

AI总结 本文提出了一类具有层次结构的合成语言,并通过树上的广播过程生成,从而能够精确分析上下文长度和推理在自回归生成中的作用。研究引入了一种精确的$k$-gram假设来替代传统变换器模型,并通过实验证明其有效性。研究发现,在特定语言模型下,若上下文长度不足,生成结果将偏离真实语言分布,而具备推理能力的模型仅需对数长度的内存即可精确生成符合真实语言的序列,展现出指数级的性能提升。

详情
英文摘要

We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an $Ω(n)$ lower bound on the context length required to faithfully sample length-$n$ sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only $Θ(\log n)$ working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.

2605.13686 2026-05-14 cs.CV cs.AI

Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

Giulia Romoli, Alessia Capoccia, Filippo Ruffini, Francesco Di Feola, Luca Boldrini, Arturo Chiti, Renato Cuocolo, Tugba Akinci D'Antonoli, Fatemeh Darvizeh, Marcello Di Pumpo, Bradley J. Erickson, Liu Fang, Deborah Fazzini, Paola Feraco, Fabrizia Gelardi, Francesco Gossetti, Ana Isabel Hernáiz Ferrer, Michail E. Klontzas, Seyedmehdi Payabvash, Katrine Riklund, Sara N. Strandberg, Valerio Guarrasi, Paolo Soda

AI总结 本文研究了医学影像中跨模态图像翻译的问题,旨在从源影像模态生成目标模态的图像,无需额外采集。作者提出了一种可复现、标准化的评估框架,对七种生成模型在多个临床任务和数据集上的性能进行了系统比较,发现基于生成对抗网络(GAN)的模型整体表现优于潜在生成模型,其中SRGAN在多项任务中表现最优。实验还揭示了模型在小病灶生成和定量指标与临床偏好之间的差异,表明合成影像在临床判别上已接近真实影像。

详情
英文摘要

Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.

2605.13684 2026-05-14 cs.LG cs.IT math.IT

Scale-Sensitive Shattering: Learnability and Evaluability at Optimal Scale

Shashaank Aiyer, Yishay Mansour, Shay Moran, Han Shao, Tom Waknine

AI总结 本文研究了实值函数类在最优尺度下表现出一致收敛和可学习性的条件。通过建立一个尺度敏感的PAC学习基本定理,作者证明了统一收敛、可学习性以及fat-shattering维数的有限性在特定尺度下是等价的,解决了关于学习性尺度的长期疑问,并改进了已有上界结果。研究还给出了关于fat-shattering尺度的精确度量熵界,并应用于积分概率度量的估计问题,揭示了其可估性与弱可评估性的二元性。

Comments 32 pages, 1 figure

详情
英文摘要

We study the optimal scale at which real-valued function classes exhibit uniform convergence and learnability. Our main result establishes a scale-sensitive generalization of the fundamental theorem of PAC learning: for every bounded real-valued class and every $γ>0$, uniform convergence at scale $γ$, agnostic learnability at scale $γ/2$, and finiteness of the fat-shattering dimension at every scale $γ'>γ$ are equivalent. This resolves a question by Anthony and Bartlett (Cambridge Univ. Press 1999) on the precise scales governing learnability, refuting a conjecture attributed there to Phil Long that a multiplicative 2-factor gap is unavoidable, and improves the upper bounds of Bartlett and Long (JCSS 1998), which incur such a loss. The key technical ingredient is a direct bound on empirical $\ell_\infty$ covering numbers, avoiding the standard detour through packing numbers. As a consequence, we obtain sharp asymptotic metric-entropy bounds in terms of the fat-shattering scale $γ$: an $O(\log^2 n)$ bound holds already at scale $γ/2$, while an $O(\log n)$ bound holds at scale $2γ$. We further show that the $O(\log^2 n)$ bound is sometimes tight. These results resolve open questions by Alon et al. (JACM 1997) and Rudelson and Vershynin (Ann. of Math. 2006). As an application, we establish a sharp dichotomy for bounded integral probability metrics: every such IPM is either estimable or cannot be weakly evaluated within any multiplicative factor $c<3$, while $3$-weak evaluability always holds, resolving an open question from Aiyer et al. (ICML 2026). We also highlight several open questions on quantitative sample complexity and evaluability.

2605.13681 2026-05-14 cs.LG stat.ML

Sampling from Flow Language Models via Marginal-Conditioned Bridges

Iskander Azangulov, Leo Zhang

AI总结 本文研究了如何从流语言模型(FLMs)中进行有效的采样,提出了一种基于边缘条件桥接的采样方法。与传统方法不同,该方法在每一步反向采样时,根据FLM的边缘后验分布生成干净的one-hot端点,并通过解析的Ornstein-Uhlenbeck桥接过程生成连续状态,从而更准确地保留语言模型的结构特性。该方法无需额外训练,能够自然地支持温度缩放和核截断等解码控制,实验表明其在生成质量与多样性之间取得了更好的平衡。

详情
英文摘要

Flow Language Models (FLMs) are a recently introduced class of language models which adapt continuous flow matching for one-hot encoded token sequences. Their denoisers have a special structure absent from generic continuous diffusion models: each block of the denoising mean is a posterior marginal distribution over the clean token at that position. Standard DDPM-style samplers collapse these marginals to a single conditional-mean endpoint and bridge toward this simplex-valued point, which is generally not a valid one-hot sequence. We argue that the natural sampler for an FLM is instead posterior-predictive. At each reverse step, we sample a clean one-hot endpoint from the factorized posterior defined by the FLM token marginals, and then sample the next continuous state from the analytic Ornstein--Uhlenbeck bridge conditioned on that endpoint. The method is training-free, uses the same model evaluations as standard sampling, and gives a principled interface for token-level decoding controls such as temperature scaling and nucleus truncation. We show that, under exact posterior marginals, the endpoint approximation error is exactly the conditional multi-information among token positions. The induced one-step bridge kernel preserves all token-wise posterior-predictive marginals and loses only the residual cross-position dependence. Finally, we prove a Girsanov path-space comparison showing that the marginal-conditioned bridge has a no-larger denoising-error term than the frozen conditional-mean bridge, with strict improvement whenever intermediate coordinate-wise bridge observations reveal additional information about the clean token. Experiments with FLMs show that the sampler improves the quality--diversity tradeoff. Code is available at: github.com/imbirik/mcb.

2605.13678 2026-05-14 cs.LG

Three-Stage Learning Unlocks Strong Performance in Simple Models for Long-Term Time Series Forecasting

Zhenan Yu, Guangxin Jiang, Jin Yang

AI总结 本文提出了一种名为STAIR的三阶段训练框架,旨在在不引入复杂结构模块的情况下,充分发挥简单时间映射模型在长期时间序列预测中的潜力。STAIR通过共享时间映射学习变量间的通用动态,再逐个变量进行微调以捕捉特定模式,最后通过残差学习引入跨变量信息,逐步增强模型灵活性。实验表明,STAIR在九个长期预测基准上表现优异,验证了其在保持模型简洁性的同时实现高性能的有效性。

详情
英文摘要

Recent studies on long-term time series forecasting have shown that simple linear models and MLP-based predictors can achieve strong performance without increasingly complex architectures. However, many competitive baselines still rely on structural priors such as frequency-domain modeling, explicit decomposition, multi-scale mixing, or sophisticated cross-variable interaction modules, while paying less attention to how simple temporal mappings should be trained and organized. In this paper, we propose STAIR, short for Stagewise Temporal Adaptation via Individualization and Residual Learning, a training paradigm for long-term time series forecasting that aims to unlock the capacity of simple temporal mapping models without introducing complex architectural modules. STAIR decomposes forecasting ability into three progressive stages: it first learns common temporal dynamics across variables through a shared temporal mapping, then adapts the shared model to each variable via channel-wise fine-tuning to capture variable-specific patterns, and finally complements the backbone with cross-variable information through residual learning. We further introduce Shared-to-Individual Fine-tuning and alpha-RevIN to mitigate the limitations of strict channel independence and the overly strong normalization prior induced by standard RevIN. This design gradually increases modeling flexibility while keeping the core temporal predictor as a shallow MLP in the main experiments, with linear variants analyzed separately. Experiments on nine long-term forecasting benchmarks show that STAIR matches or outperforms recent strong baselines while preserving a simple temporal backbone, providing a concise and effective modeling perspective for long-term time series forecasting.

2605.13675 2026-05-14 cs.CV cs.LG q-bio.NC

Characterizing Universal Object Representations Across Vision Models

Florian P. Mahner, Johannes Roth, Ka Chun Lam, Michael F. Bonner, Francisco Pereira, Martin N. Hebart

AI总结 本研究探讨了不同架构、目标函数和数据集训练的深度神经网络在视觉表征上的收敛现象,旨在揭示模型实际收敛于哪些视觉属性以及影响这一收敛的因素。通过将162个多样化视觉模型的对象相似性结构分解为少量非负维度,并分析这些维度在模型间的重复出现情况,研究发现部分维度具有跨模型的普遍性,且更易解释、更受图像语义属性驱动。研究还表明,模型的普遍性维度与灵长类动物视觉皮层活动和人类相似性判断的预测能力更强,暗示了这种普遍性可能反映了与生物视觉相关的表征特性。

详情
英文摘要

Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.

2605.13673 2026-05-14 cs.LG

Graph Neural Networks with Triangle-Based Messages for the Multicut Problem

Jannik Irmai, Lucas Fabian Naumann, Bjoern Andres

AI总结 本文研究了用于多割问题的图神经网络方法,该问题是一个计算复杂度高的组合优化问题,在生物信息学、数据挖掘和计算机视觉等领域有广泛应用。作者提出了一种改进的图神经网络架构,其特征仅分配给边,并基于图中的三角形结构进行消息传递,以更好地适应多割问题的目标函数和约束条件。实验表明,该方法在保证运行时间可行的前提下,优于现有的启发式求解器,在部分实例中甚至能秒级找到最优解,而精确求解器则需要数小时。

Comments 21 pages, 5 figures

详情
英文摘要

The multicut problem is an NP-hard combinatorial optimization problem with diverse applications in fields such as bioinformatics, data mining and computer vision. Graph neural networks have been defined for the multicut problem but can be adapted further to its specific objective function and constraints. In this article, we introduce such an adapted graph neural network architecture in which features are assigned only to edges, and the computation of messages is based on triangles in the underlying graph. Experiments with synthetic and real-world instances with up to 200 nodes show that our method outperforms state-of-the-art heuristic solvers in terms of solution quality while maintaining feasible runtimes. For some instances, our method finds optimal solutions in seconds whereas exact solvers need hours to find and certify optimal solutions.

2605.13670 2026-05-14 cs.CV

Pattern-Enhanced RT-DETR for Multi-Class Battery Detection

Xu Zhong, Enyuan Hu

AI总结 本文针对多类别电池检测任务,提出了一种基于模式增强的RT-DETR方法PaQ-RT-DETR,通过引入基于模式的动态查询生成机制,有效缓解了查询激活不平衡问题,同时保持了较低的计算开销。研究在包含约8,591张标注图像的公开数据集上系统比较了多种检测模型,结果表明PaQ-RT-DETR-X在整体mAP@50指标上优于基线模型,尤其在数据稀缺的电池类别上表现突出,为电池相关工业应用中的目标检测模型选择提供了实用指导。

Comments 4 pages, 3 figures

详情
英文摘要

Accurate and efficient battery detection is increasingly important for applications in electronic waste recycling, industrial quality control, and automated sorting systems. In this paper, we present both a comprehensive benchmark and a novel method for multi-class battery detection. We systematically compare three CNN-based detectors (YOLOv8n, YOLOv8s, YOLO11n) and two transformer-based detectors (RT-DETR-L, RT-DETR-X) on a publicly available dataset of approximately 8,591 annotated images under identical experimental conditions, and further propose PaQ-RT-DETR, which introduces pattern-based dynamic query generation into RT-DETR to alleviate query activation imbalance with negligible computational overhead. Among baselines, YOLO11n achieves the best CNN-based accuracy (mAP@50: 0.779) at only 2.6M parameters, while YOLOv8n delivers the fastest inference at ~1,667 FPS. PaQ-RT-DETR-X achieves the highest overall mAP@50 of 0.782, surpassing RT-DETR-X by +2.8% with consistent per-class gains across all six battery categories including the data-scarce Bike Battery class. Our findings provide practical guidance for selecting object detection models in battery-related industrial applications.

2605.13667 2026-05-14 cs.CV

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

Vladislav Makarov, Mark Gizetdinov, Dmitry Yudin

AI总结 SceneGraphVLM 是一种基于视觉语言模型的紧凑方法,用于从图像和视频中生成结构化的场景图。该方法通过高效的 TOON 格式序列化图结构,并采用两阶段训练策略,结合监督微调和强化学习,以提升关系覆盖率和精确度,同时避免生成不相关对象和关系。在视频处理中,模型可通过前一帧生成的场景图提供轻量级的短期上下文,无需跟踪或后处理。实验表明,SceneGraphVLM 在多个数据集上实现了高质量与生成速度的良好平衡,并显著提升了场景图生成的精确度。

详情
英文摘要

Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.