arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪 全部专题
2605.14742 2026-05-15 cs.CV cs.RO

EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau, Yi Wang

发表机构 * Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR(香港理工大学电子与电气工程系) Division of Emerging Interdisciplinary Areas (EMIA), The Hong Kong University of Science and Technology, Hong Kong SAR(香港理工大学新兴跨学科领域研究中心)

AI总结 本文提出EARL,一种以自我视角分析为导向的强化学习框架,旨在提升机器人对人类与环境交互的推理能力和像素级定位精度。EARL采用两阶段解析结构,首先生成结构化文本描述,再根据用户查询生成回答和像素掩码,并通过分析引导特征合成器整合语义先验信息。实验表明,EARL在像素级定位任务中取得了优于现有基于强化学习方法的显著提升,展现出良好的泛化能力。

Comments Accepted at ICML 2026. Project page: https://github.com/yuggiehk/EARL

详情
英文摘要

Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.

2605.14733 2026-05-15 cs.CV

Video-Zero: Self-Evolution Video Understanding

Ruixu Zhang, Deyi Ji, Lanyun Zhu, Xuanyi Liu, Yuxin Meng, Ruihang Chu, Yujiu Yang

发表机构 * Tsinghua University(清华大学) Tencent(腾讯) Tongji University(同济大学) Peking University(北京大学)

AI总结 本文提出了一种名为Video-Zero的自进化视频理解框架,旨在无需人工标注的情况下提升视频理解模型的推理能力。该方法通过一个问答共进化系统,聚焦于视频中时间局部化的关键证据,生成基于证据的问题并进行对齐学习,从而实现更有效的监督与模型训练。实验表明,Video-Zero在多个视频理解任务中显著提升了基础模型的性能,验证了其有效性与泛化能力。

详情
英文摘要

Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.

2605.14727 2026-05-15 cs.CV

CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators

Pengcheng Fang, Hongli Chen, Yuxia Chen, Tengjiao Sun, Jiaxin Liu, Xiaohao Cai

发表机构 * University of Southampton(南安普顿大学) University of Queensland(昆士兰大学) Chengdu University of Technology(成都理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种名为CHASM的跨频率协调轴分离混合器,用于改进基于傅里叶变换的光谱token操作器。CHASM通过共享一个学习到的通道特征基,并为每个频率保留独立的正谱增益,实现了跨频率的通道方向对齐与局部频率适应性的结合。该方法在多个视觉任务中表现出色,实验表明其结构设计有助于提升模型性能,并验证了跨频率协调作为光谱操作器的有效归纳偏置。

详情
英文摘要

Spectral token mixers based on Fourier transforms provide an efficient way to model global interactions in visual feature maps. Existing designs often either apply filter-wise spectral responses along fixed channel axes, or learn adaptive frequency-indexed channel mixing without explicitly aligning the channel directions used across frequencies. We propose CHASM, a Cross-frequency Harmonized Axis-Separable Mixer, as a structured middle ground. CHASM separates what should be shared from what should remain frequency-specific: all frequencies share a learned channel eigenbasis, while each frequency retains its own positive spectral gains. The shared basis makes channel directions comparable across the spectrum, whereas the positive gains preserve local spectral adaptivity. CHASM applies this structured operator separably along the height and width axes and is used as a drop-in replacement mixer inside existing backbones. We provide a structural characterization of the shared-basis operator family and evaluate CHASM through controlled same-backbone comparisons. Across accelerated MRI reconstruction, undersampled MRI segmentation, and natural-image reconstruction, CHASM consistently improves over same-backbone spectral-mixer baselines. Ablations show that removing the shared-basis constraint weakens performance, and randomizing coherent sampling geometry substantially reduces the gain, supporting cross-frequency harmonization as a useful inductive bias for spectral token operators.

2605.14723 2026-05-15 cs.AI cs.CL cs.LG

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文提出了一种名为SepsisAgent的新型代理模型,用于重症监护中的脓毒症治疗决策。该模型通过结合临床世界模型,模拟患者对不同治疗方案的反应,并采用“提出—模拟—优化”的流程进行决策优化。研究显示,SepsisAgent在遵循指南和安全指标方面表现优异,优于传统强化学习和大语言模型基线方法,其核心贡献在于通过与临床世界模型的反复交互,使模型能够学习患者生理变化的规律并提升决策可靠性。

详情
英文摘要

Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

2605.14721 2026-05-15 cs.AI

On Strong Equivalence Notions in Logic Programming and Abstract Argumentation

Giovanni Buraglio, Wolfgang Dvorak, Stefan Woltran

发表机构 * TU Wien, Austria(维也纳技术大学,奥地利)

AI总结 本文研究了逻辑编程与抽象论证中强等价性的差异问题,指出在动态环境下,两类形式系统由于更新机制的不同,导致强等价性无法直接对应。为此,作者提出了一种新的逻辑程序强等价性定义,使得在特定类别的逻辑程序与邓式及扩展型论证框架之间,强等价性得以保持,从而恢复了不同形式系统间的兼容性。

详情
英文摘要

Strong equivalence between knowledge bases ensures the possibility of replacing one with the other without affecting reasoning outcomes, in any given context. This makes it a crucial property in nonmonotonic formalisms. In particular, the fields of logic programming and abstract argumentation provide primary examples in which this property has been subject to vast investigations. However, while (classes of) logic programs and abstract argumentation frameworks are known to be semantically equivalent in static settings, this alignment breaks in dynamic contexts due to differing notions of update. As a result, strong equivalence does not always carry over from one formalism to the other. In this paper, we carefully investigate this discrepancy and introduce a new notion of strong equivalence for logic programs. Our approach preserves strong equivalence under translation between certain classes of logic programs and both Dung-style and claim-augmented argumentation frameworks, thus restoring compatibility across these formalisms.

2605.14717 2026-05-15 cs.CV cs.AI

Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

Saqib Nazir, Ardhendu Behera

发表机构 * Department of Computer Science, Edge Hill University, UK(英国埃德希尔大学计算机科学系)

AI总结 该研究旨在解决无标记单细胞成像中直接从明场图像推断分子表型的难题,提出了一种基于多任务学习的深度学习框架,能够同时完成白细胞分类和蛋白质表达水平的回归预测。该模型采用卷积神经网络与Transformer相结合的混合架构,通过可学习的跨分支门控模块融合局部纹理特征与全局表示,从而实现对差分相位对比图像的鲁棒形态-分子联合推理。实验表明,该方法在多个基准数据集上表现出色,为无需荧光染色的低成本血液学分析提供了新途径。

Comments Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single-Cell-Phenotyping.

2605.14712 2026-05-15 cs.RO cs.AI cs.CL cs.CV

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun Jin, Haishan Liu, Changti Wu, Hang Yuan, Cong Huang, Kai Chen

发表机构 * HUST(华中科技大学) ZGCA(中钢集团人工智能研究院) ZGCI(中钢智能科技有限公司) HIT(哈尔滨工业大学) HKUST(GZ)(香港科技大学(广州)) BUAA(北京航空航天大学) ZZU(浙江工业大学) ECNU(华东师范大学) USTC(中国科学技术大学) DeepCybo

AI总结 该研究针对机器人模仿学习中因短时意图差异导致的动作冲突问题,提出了一种基于历史信息的视觉-语言-动作(VLA)框架IntentVLA,通过编码近期视觉观测生成紧凑的短时意图表示,用于指导动作生成。研究还构建了AliasBench基准,用于评估短时观测歧义下的策略性能,实验表明IntentVLA在多个任务中提升了动作执行的稳定性并优于现有VLA方法。

Comments Code can be found in https://github.com/ZGC-EmbodyAI/IntentVLA

详情
英文摘要

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

2605.14710 2026-05-15 cs.CV cs.AI

Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

Liren Chen, Lidong Sun, Mingyan Huang, Junzhe Tang, Yinghui Zhu, Guanjie Wang, Yiqing Xia, Ting Xiao

发表机构 * School of Information Science and Engineering, East China University of Science and Technology(信息科学与工程学院,东华大学)

AI总结 该研究针对缺血性中风预后预测中多模态数据融合不足的问题,提出了一种三模态融合模型,有效整合了医学影像、结构化临床数据和非结构化文本。核心方法通过大语言模型自动生成半结构化诊断文本,缓解了专家标注稀缺的问题,并设计了以视觉特征为条件的对齐融合模块,实现了跨模态的深度交互与异构性缓解。实验表明,该模型在真实临床数据上取得了最先进的预测性能。

Comments Corresponding author: Ting Xiao

详情
英文摘要

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

2605.14708 2026-05-15 cs.CV

StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

Zeyu Chen, Fangmin Zhao, Yan Shu, Yichao Liu, Liu Yu, Yu Zhou

发表机构 * Nankai University(南开大学) University of Trento(特伦特大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 StyleTextGen 是一种用于多语言场景文本生成的风格条件生成框架,旨在解决从复杂背景中准确提取文本风格并保持跨字符细粒度风格一致性的挑战。该方法引入了双分支风格编码器、文本风格一致性损失以及掩码引导的生成策略,有效提升了多语言文本风格的感知与复制能力。研究还构建了首个双语场景文本风格基准 StyleText-CE,并在多项指标上取得了当前最优的性能。

Comments This paper has been accepted to CVPR 2026

详情
英文摘要

Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

2605.14705 2026-05-15 cs.CV

Towards Continuous Sign Language Conversation from Isolated Signs

Youngmin Kim, Kyobin Choo, Jiwoo Park, Minseo Kim, Chanyoung Kim, Junhyeok Kim, Seong Jae Hwang

发表机构 * Yonsei University(延世大学) LG Electronics(LG电子) Emory University(埃默里大学)

AI总结 该研究旨在直接建模手语对话系统,以更好地支持聋人和听力障碍者使用手语进行交流。面对现有手语数据集词汇量有限、泛化能力弱的问题,研究者构建了大规模的孤立手语动作数据集SignaVox-W,并基于此生成连续的手语对话数据集SignaVox-U。通过引入检索引导的语音到手语翻译模型和扩散变换器BRAID,实现了从孤立动作到连续对话的生成,最终训练出无需依赖语音或书面语的直接手语到手语对话模型SignaVox,显著提升了手语生成的质量与语义对齐能力。

详情
英文摘要

Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.

2605.14704 2026-05-15 cs.CV cs.AI cs.RO

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(国立台湾大学) Delta Robotics Innovation Center(Delta机器人创新中心)

AI总结 在现实场景中,目标物体可能位于不可见区域,而当前视觉语言模型(VLMs)在推理这些被遮挡物体的位置方面仍面临挑战。为此,研究提出SceneFunRI基准,基于SceneFun3D数据集构建了一个包含855个实例的2D空间推理任务,要求模型通过任务指令和常识推理定位不可见的功能性物体。实验表明,现有最强基线模型在该任务上的表现仍较为有限,揭示了当前模型在不可见区域推理能力上的不足,亟需更紧密融合任务意图、常识先验、空间定位与不确定性感知搜索的模型改进。

详情
英文摘要

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

2605.14703 2026-05-15 cs.CV

Generating HDR Video from SDR Video

SaiKiran Tedla, Francesco Banterle, Trevor Canham, Karanpreet Raja, David B. Lindell, Kiriakos N. Kutulakos, Jiacheng Li, Feiran Li, Daisuke Iso

发表机构 * Sony Research(索尼研究实验室) York University(约克大学) Vector Institute(向量研究所) University of Toronto(多伦多大学)

AI总结 本文研究如何从标准动态范围(SDR)视频生成高动态范围(HDR)视频,提出了一种基于大规模生成视频模型的解决方案。该方法引入了多曝光视频模型(MEVM)和可学习的视频合并模型(VMM),能够从单个非线性SDR视频输入生成多曝光SDR序列,并将其合并为高质量的HDR视频,有效保留暗部和亮部细节。实验表明,该方法在真实场景的消费级视频和经典影片中均能实现鲁棒的HDR转换,并可与现有SDR生成模型结合构建HDR合成流程。

详情
英文摘要

The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: sdr2hdrvideo.github.io

2605.14700 2026-05-15 cs.RO

SR-Platform: An Agentic Pipeline for Natural Language-Driven Robot Simulation Environment Synthesis

Ben Wei Lim, Minh Duc Le, Thang Truong, Thanh Nguyen Canh

发表机构 * Strike Robotics

AI总结 SR-Platform 是一个基于智能体的系统,旨在通过自然语言指令自动生成可用于机器人学习的 MuJoCo 模拟环境。该系统将场景合成分解为四个阶段,包括意图解析、3D 资产生成、布局规划和场景装配,有效降低了构建训练环境的技术门槛。实验表明,SR-Platform 能够在不到一分钟内生成可执行的 MuJoCo 场景,显著提升了机器人模拟环境创建的效率和自动化程度。

详情
英文摘要

Generating robot simulation environments remains a major bottleneck in simulation-based robot learning. Constructing a training-ready MuJoCo scene typically requires expertise in 3D asset modeling, MJCF specification, spatial layout, collision avoidance, and robot-model integration. We present SR-Platform, a production-deployed agentic system that converts free-form natural language descriptions into executable, physically valid MuJoCo environments. SR-Platform decomposes scene synthesis into four stages: an LLM-based orchestrator that converts user intent into a structured scene plan; an asset forge that retrieves cached assets or generates new 3D geometry through LLM-to-CadQuery synthesis; a layout architect that assigns object poses and verifies industrial constraints; and a bridge layer that assembles the final MJCF scene and merges the selected robot model. The system is deployed as a nine-service Docker stack with WebSocket progress streaming, MinIO-backed mesh storage, Qdrant-based semantic asset retrieval, Redis job state, and InfluxDB telemetry. Using 30 days of production telemetry covering 611 successful LLM calls, SR-Platform generates five-object scenes with a median end-to-end latency of approximately 50 s, while cache-accelerated scenes complete in approximately 30-40 s. The asset forge shows an 11.3% first-attempt retry rate with automatic recovery, and cached asset retrieval removes per-object LLM calls for previously generated object types. These results show that agentic scene synthesis can reduce the manual effort required to create diverse robot training environments, enabling users to produce executable MuJoCo scenes from plain English prompts in under one minute.

2605.14698 2026-05-15 cs.LG cs.AI

NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces

Konstantinos Kontras, Trui Osselaer, Stylianos G. Mouslech, Angeliki-Ilektra Karaiskou, Guido Gagliardi, Thomas Strypsteen, Mohammad Hossein Badiei, Anku Rani, Maarten Vanmarcke, Miguel Bhagubai, Chanakya Ekbote, Jaedong Hwang, Christos Chatzichristos, Paul Pu Liang, Maarten De Vos

发表机构 * KU Leuven(鲁文大学) MIT(麻省理工学院)

AI总结 本文介绍了NeuroAtlas,这是目前最大的临床脑电图(EEG)基准数据集,包含42个数据集和26万小时的EEG数据,涵盖癫痫、睡眠医学和脑龄估计等领域,并引入了专门的临床评估指标。研究对比了专门针对EEG的预训练模型与通用时间序列模型的性能,发现后者在某些任务上表现相当甚至更优。研究还指出,传统机器学习指标难以准确评估临床实用性,因此提出了更贴近实际应用的评估方法,并揭示了当前预训练模型在统一EEG建模方面仍存在较大差距。

详情
英文摘要

Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.

2605.14696 2026-05-15 cs.CV

EponaV2: Driving World Model with Comprehensive Future Reasoning

Jiawei Xu, Zhizhou Zhong, Zhijian Shu, Mingkai Jia, Mingxiao Li, Jia-Wang Bian, Qian Zhang, Kaicheng Zhang, Jin Xie, Jian Yang, Wei Yin

发表机构 * PCA Lab, VCIP, College of Computer Science, Nankai University(PCA实验室、VCIP、计算机科学学院、南开大学) Horizon Robotics HKUST(香港科技大学) NJUPT(南京工程大学) NTU(国立台湾大学) Anyverse School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院、南京大学)

AI总结 本文提出了一种名为 EponaV2 的新型驾驶世界模型,旨在解决现有自动驾驶系统在轨迹规划中依赖大量人工标注数据的问题。该模型通过引入全面的未来推理机制,能够预测未来几何和语义信息,从而提升对环境的理解和规划能力。此外,受大语言模型训练方法的启发,EponaV2 引入了流匹配组相对策略优化机制,进一步提升了规划精度,在多个基准测试中取得了优于现有方法的性能。

详情
英文摘要

Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

2605.14694 2026-05-15 cs.LG

The Rate-Distortion-Polysemanticity Tradeoff in SAEs

Tommaso Mencattini, Francesco Montagna, Francesco Locatello

发表机构 * EPFL(瑞士联邦理工学院) Institute of Science and Technology Austria(奥地利科学技术研究所)

AI总结 本文研究了稀疏自编码器(SAEs)在重建精度(最小化失真)、编码效率(最小化速率)与表示语义单一性(单义性)之间的权衡问题,提出了“速率-失真-多义性”三重权衡。通过理论分析与实验验证,作者表明强制SAEs学习单义表示会导致速率和失真增加,并发现最优SAEs的多义性程度由训练数据分布决定,尤其是特征共现概率。研究进一步拓展到实际场景,提出多义性度量应满足的必要条件,并在大语言模型训练的SAEs上对现有度量方法进行了评估,揭示了多义性本质上是数据层面的问题,应在架构和优化层面加以考虑。

详情
英文摘要

Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations, introducing the Rate-Distortion-Polysemanticity tradeoff in SAEs. Under toy-modeling assumptions, we theoretically and empirically show that restricting the SAE to be monosemantic necessarily comes with an increase in rate and distortion. Assuming a generative model behind the input observations, we further demonstrate that the degree of polysemanticity of optimal SAEs is determined by the training data distribution, especially by the probability of features to co-occur. Finally, we extend the analysis to real-world settings by deriving necessary conditions that a polysemanticity measure should satisfy when the data-generating process is unknown, and we benchmark existing proxy metrics on SAEs trained on Large Language Models. Taken together, our findings show that polysemanticity is a data problem that should be accounted for when addressing it at the architectural and optimization level.

2605.14689 2026-05-15 cs.CV

Are Candidate Models Really Needed for Active Learning?

Harshini Mridula Mohan, Maanya Manjunath, Vipul Arya, S. H. Shabbeer Basha, Nitin Cheekatla

发表机构 * SoCSE, RV University, Bengaluru, India.(RV大学计算机科学与工程系,印度班加罗尔) School of Engineering and Technology, Vidyashilp University, Bengaluru, India.(维达希尔普大学工程与技术学院,印度班加罗尔) Dataplex Inc., USA.(Dataplex公司,美国)

AI总结 本文探讨了在主动学习中是否真的需要候选模型,并提出了一种无需初始候选模型的主动学习方法。研究采用随机初始化的卷积神经网络和变换器模型,结合基于置信度的采样策略,验证了其在减少标注负担方面与传统方法相当的效果。实验表明,低置信度采样策略在多数情况下表现最佳,为高效、灵活的主动学习提供了新思路。

Comments Accepted for publication in Computer Vision and Image Understanding (CVIU)

详情
英文摘要

Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.

2605.14686 2026-05-15 cs.LG

ReMIA: a Powerful and Efficient Alternative to Membership Inference Attacks against Synthetic Data Generators

Davide Scassola, Andrea Coser, Sebastiano Saccani

发表机构 * Aindo SpA(Aindo公司)

AI总结 在隐私保护日益重要的背景下,合成数据生成器(SDGs)被广泛用于数据共享,但其生成的数据仍面临成员推理攻击(MIAs)的威胁。本文提出了一种名为 ReMIA 的新型隐私评估方法,该方法仅需两次 SDG 训练运行和与原始训练集规模相当的辅助数据,显著提升了 MIAs 的实用性。实验表明,ReMIA 在保持高灵敏度的同时,相比现有方法更加高效,同时揭示了 SDGs 在隐私与数据效用平衡方面优于传统去标识化方法的潜力。

详情
英文摘要

Tabular data sharing under privacy constraints is increasingly important for research and collaboration. Synthetic data generators (SDGs) are a promising solution, but synthetic data remains vulnerable to attacks, such as membership inference attacks (MIAs), which aim to determine whether a specific record was part of the training data. State-of-the-art MIAs are powerful but impractical: they rely on shadow modeling, requiring hundreds of SDG training runs, and need auxiliary data several times larger than the original training set. Fast proxy metrics like distance to closest record (DCR) are efficient but have limited sensitivity to MIA risk. We introduce ReMIA (Relative Membership Inference Attack), a practical privacy metric that requires only two SDG training runs and additional data no larger than the original training set. Rather than predicting whether a record was in the training set, ReMIA generates two synthetic datasets from two source datasets and measures whether a classifier can identify which source a record came from. Experiments across multiple tabular datasets and SDGs show that ReMIA has a sensitivity comparable to state-of-the-art MIAs while being substantially more practical. We further observe that SDGs can achieve privacy-utility trade-offs that traditional noise-based anonymization methods do not match. Code is available at https://github.com/aindo-com/remia.

2605.14685 2026-05-15 cs.LG cond-mat.stat-mech cs.AI

Spontaneous symmetry breaking and Goldstone modes for deep information propagation

Nabil Iqbal, T. Anderson Keller, Yue Song, Takeru Miyato, Max Welling

发表机构 * Dept. of Mathematical Sciences, Durham University(杜伦大学数学科学系) Kempner Institute, Harvard University(哈佛大学凯普纳研究所) AMLab, University of Amsterdam(阿姆斯特丹大学AMLab) College of AI, Tsinghua University(清华大学人工智能学院) University of Tübingen, Tübingen(图宾根大学) AI Center(人工智能中心) CuspAI

AI总结 本文研究了具有连续对称性的深度神经网络中自发对称性破缺现象及其类似戈德斯通模式的自由度,揭示了这些自由度能够支持信息在深度网络和循环迭代中的相干传播。通过理论分析与实验验证,作者表明这种机制可以在无需残差连接或归一化等结构稳定器的情况下实现稳定的信息流,提升了前馈网络的可训练性和表示多样性,并在循环网络中有效增强了长期记忆能力,改善了长序列建模任务的性能。

Comments 28 pages. Code at https://github.com/nabiliqbal/ssb-goldstone-deep-info-prop

详情
英文摘要

In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.

2605.14683 2026-05-15 cs.RO cs.SY eess.SY

SeaVis: Modeling and Control of a Remotely Operated Towed Vehicle for Seabed Visualization and Mapping

Abdelhakim Amer, Aske Alstrup, Frederik Rasmussen, Yury Brodskiy, Andriy Sarabakha, Erdal Kayacan

发表机构 * Artificial Intelligence in Robotics Laboratory (AiR Lab)(人工智能机器人实验室(AiR实验室)) Department of Electrical and Computer Engineering(电气与计算机工程系) EIVA a/s(EIVA公司) Automatic Control Group (RAT)(自动控制组(RAT)) Department of Electrical Engineering and Information Technology(电气工程与信息科技系)

AI总结 本文提出了一种用于海底可视化与测绘的遥控拖曳式水下机器人SeaVis的新型数学模型,并设计了一种增益调度的线性二次调节器(LQR)以实现其深度和姿态的鲁棒控制。通过高保真仿真验证,结果表明该LQR控制器在抗干扰能力、控制效率和舵面动作幅度等方面均优于传统PID控制器,并且在全操作速度范围内均表现出良好的控制效果。研究为水下机器人高精度稳定作业提供了有效的控制方法。

Comments Accepted at IEEE/ASME AIM 2026

详情
英文摘要

High-resolution seafloor mapping necessitates stable and precise positioning for underwater robots. This paper introduces a novel mathematical model for SeaVis remotely operated towed vehicles (ROTVs) and develops a gain-scheduled linear-quadratic regulator (LQR) for robust depth and attitude control. We validate the approach in a high-fidelity simulation, benchmarking the LQR against a conventional PID controller over a challenging seabed profile. The presented results demonstrate the LQR's superior performance, with significantly enhanced robustness to disturbances, greater control efficiency, and substantially reduced flap actuation. The gain scheduling also confirms the controller's effectiveness across the full operational velocity range. The complete simulation environment and controller are open-sourced.

2605.14679 2026-05-15 cs.CL cs.AI

AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

Vicent Briva-Iglesias, María Ferre-Fernández

发表机构 * Dublin City University(都柏林城市大学) CTTS(文化传承研究所) ADAPT Centre(适应中心) SALIS Universidad de Almería(阿尔梅里亚大学)

AI总结 本研究探讨了在岩画文献等术语密集的文化遗产领域中,如何通过人工智能辅助提升多语言传播的质量。研究比较了三种英文机器翻译方法在西班牙语学术文本中的表现,重点评估了基于术语表增强的提示策略对专业术语准确性的提升效果。结果表明,结合术语表的大型语言模型(Gemini-RAG)在术语准确性和整体翻译质量上均优于传统神经机器翻译和基础提示模型,为文化机构提供了一种低成本、高效率的术语控制解决方案。

详情
英文摘要

Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

2605.14672 2026-05-15 cs.LG

AQKA: Active Quantum Kernel Acquisition Under a Shot Budget

Jian Xu, Chao Li, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * RIKEN iTHEMS(日本理化学研究院iTHEMS研究中心) RIKEN AIP(日本理化学研究院Advanced Institute for Physics) South China University of Technology(华南理工大学) Columbia University(哥伦比亚大学)

AI总结 该论文研究了在有限测量资源下高效估计量子核矩阵的问题,提出了一种名为AQKA的新方法,通过动态分配测量资源以提升分类性能。其核心贡献包括:建立了一个完整的资源分配策略选择框架,并提出了基于梯度和核值的成对测量分配理论,显著提升了在有限预算下的模型表现。实验表明,AQKA在多个量子硬件平台上均优于现有方法,尤其在稀疏敏感任务中表现突出。

详情
英文摘要

Estimating an $N \times N$ quantum kernel from circuit fidelities requires $Θ(N^2 S)$ measurement shots, the dominant bottleneck for deployment on near-term hardware. Existing budget-saving methods (Nyström-QKE, ShoFaR, kernel-target alignment) sub-sample \emph{which} entries to measure but allocate shots \emph{uniformly} within their chosen subset, ignoring how much each entry drives the downstream classifier. We close this gap with two contributions. \textbf{First, a complete regime decomposition} for shot-budgeted quantum kernel learning: a principled menu of when each allocator wins. Our method, \emph{AQKA}, dominates the budget-limited regime ($B \lesssim 16 n_{\mathrm{pairs}}$) on sparse-sensitivity KRR, with the gap \emph{growing} from $+8$ to $+25$ pts over uniform as $N$ scales $225{\to}1000$ and reaching $+26$--$32$ pts on an \texttt{ibm\_pittsburgh} (156-qubit Heron) hardware kernel; Nyström-QKE wins at saturating budgets on planted-sparse via low-rank reconstruction; ShoFaR is competitive only at extreme low budgets. \textbf{Second, a closed-form pair-level acquisition theory}: $s_{ij}^{\star} \propto |g_{ij}|\sqrt{K_{ij}(1-K_{ij})}$ with explicit gradient $g_{ij}$ for KRR (Lemma~1, $|β_iα_j+β_jα_i|\sqrt{K_{ij}(1-K_{ij})}$) and SVM via the envelope theorem ($|η_i^*η_j^*|\sqrt{K_{ij}(1-K_{ij})}$); a \emph{corrected} sparsity-aware Cauchy--Schwarz rate $ρ\le 2m/N$ matching empirics (vs.\ the naive $m^2/N^2$); an explicit-constant plug-in regret bound (Theorem~2); and a tighter SVM ceiling $ρ^{\mathrm{SVM}} \le m_{\mathrm{sv}}^2/N^2$. We close with the first multi-seed live online adaptive shot allocation on quantum hardware: $+17.0 \pm 4.8$ pts at $N{=}20$ on \texttt{ibm\_aachen} ($3.5σ$, 5 seeds), with the advantage holding at $N{=}30$ at higher budget on \texttt{ibm\_berlin} ($+14.0 \pm 8.5$ pts, 5 seeds).

2605.14667 2026-05-15 cs.AI

How Sensitive Are Radiomic AI Models to Acquisition Parameters?

D. Gil, I. Sanchez, C. Sanchez

发表机构 * Computer Vision Center(计算机视觉中心) Universitat Autònoma de Barcelona(巴塞罗那自治大学)

AI总结 本文研究了放射组学AI模型对影像采集参数的敏感性,提出了一种基于混合效应的框架,用于量化临床相关参数对模型性能的影响,并识别出有助于提升跨数据集鲁棒性的关键参数范围。通过在两个独立的多中心CT数据集上应用该框架,研究发现优化的扫描参数配置(如管电流≥200mA、螺距≤1.5、层厚≤1.25mm)可在保证诊断质量的同时降低辐射剂量,显著提升模型的敏感性和特异性。

详情
英文摘要

A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.

2605.14666 2026-05-15 cs.AI

Monitoring Data-aware Temporal Properties (Extended Version)

Alessandro Gianola, Marco Montali, Sarah Winkler

发表机构 * INESC-ID/Instituto Superior Técnico, Universidade de Lisboa, Portugal(葡萄牙里斯本大学理工学院/INESC-ID) Free University of Bozen-Bolzano, Italy(意大利博登-博洛尼亚自由大学)

AI总结 本文研究如何对具有任意SMT理论的线性时序逻辑(LTLfMT)进行前瞻监控,以应对动态系统中无法访问内部规范的问题。提出了一种结合自动机理论与自动推理技术的新框架,能够在有限轨迹上正确监控复杂属性。该方法首次识别出包含线性算术与未解释函数的可判定子类,适用于数据感知的业务流程和只读数据库上的动态系统,并通过原型实现验证了其可行性。

Comments This is the extended version of a paper accepted to IJCAI 2026

详情
英文摘要

Dynamic systems in AI are often complex and heterogeneous, so that an internal specification is not accessible and verification techniques such as model checking are not applicable. Monitoring is in such cases an attractive alternative, as it evaluates desirable properties along traces generated by an unknown dynamic system. In this work, we consider anticipatory monitoring of linear-time properties enriched with an arbitrary SMT theory over finite traces (LTLfMT). Anticipatory monitoring in this setting is highly challenging, as the monitoring state depends on both the trace prefix seen so far and all its possible finite continuations. Under reasonable assumptions on the background theory, we present and formally prove the correctness of a novel foundational framework for monitoring properties in an expressive fragment of LTLfMT. The framework combines automata-theoretic methods to handle the temporal aspects of the logic, with automated reasoning techniques to address the first-order dimension. Moreover, we identify for the first time decidable fragments of this monitoring problem that are practically relevant as they combine linear arithmetic with uninterpreted functions, which covers e.g. data-aware business processes and dynamic systems operating over a read-only database. Feasibility is witnessed by a prototype implementation and preliminary evaluation.

2605.14660 2026-05-15 cs.AI

MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder

Eranga Bandara, Ross Gore, Asanga Gunaratna, Ravi Mukkamala, Nihal Siriwardanagea, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Wathsala Herath, Chalani Rajapakse, Sachin Shetty, Anita H. Clayton, Christopher K. Rhea, Ng Wee Keong, Kasun De Zoysa, Amin Hass, Shaifali Kaushik, Preston Samuel, Atmaram Yarlagadda

发表机构 * Old Dominion University(旧 Dominion 大学) AI Motion Labs(AI Motion 实验室) Nanyang Technological University(南洋理工大学) University of Colombo(科伦坡大学) Accenture Technology Labs(Accenture 技术实验室) Department of Psychiatry and Neurobehavioral Sciences(精神病学与神经行为科学系) University of Virginia School of Medicine(弗吉尼亚大学医学院) Blanchfield Army Community Hospital(Blanchfield 军队社区医院) McDonald Army Health Center(McDonald 军队健康中心)

AI总结 本文提出了一种名为MindGap的会话式人工智能框架,旨在通过上游神经可塑性干预治疗创伤后应激障碍(PTSD)。该方法基于佛教心理框架“缘起”理论,引导患者在感知与反应之间的时间间隙进行观察,从而实现对过度反应神经通路的结构性重塑。MindGap通过三个渐进的观察层次,帮助患者逐步识别并削弱引发应激反应的潜在信念,实现从源头上缓解症状,而非仅在反应发生后进行压制。该框架完全在设备端运行,保障隐私,适合在临床和军事等对数据安全要求严格的环境中部署。

详情
英文摘要

Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.

2605.14659 2026-05-15 cs.LG

Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning

Shin So, Kyelim Lee, Albert No

发表机构 * Yonsei University(延世大学)

AI总结 本研究探讨了算法学习中泛化与记忆化之间的关系,指出在数据量达到一定阈值后,增加数据可能不会加速验证准确率的提升,反而需要更多的梯度更新。在结构化输出任务中,如Needleman-Wunsch矩阵生成,模型在中等数据量时达到最佳验证性能,而更大的数据集虽仍可实现泛化,但收敛速度变慢。研究揭示了泛化起始所需的数据量与基于更新次数的收敛优化之间存在差异,并指出了在某些结构化任务中,学习规则与精确拟合可能分道扬镳。

详情
英文摘要

Critical-data-size accounts of grokking suggest a natural post-threshold intuition: once training data is sufficient to identify the underlying rule, additional data should accelerate validation convergence. We show that this intuition can fail in a controlled structured-output task. In Needleman--Wunsch (NW) matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size, not at the largest one. Past this dataset-size sweet spot, generalization remains achievable but requires more gradient updates. Conversely, in the regime where partial validation competence first appears, larger datasets can require fewer updates to reach high training accuracy, suggesting that emerging rule structure can accelerate fitting beyond example-wise memorization. A multiplication baseline does not show the same post-threshold slowdown. These results separate the critical data size for the onset of generalization from the dataset size that optimizes update-based convergence, and identify structured-output tasks where learning the rule and completing exact-fitting can diverge.

2605.14654 2026-05-15 cs.CV

Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

Tan Pan, Shuhao Mei, Yixuan Sun, Kaiyu Guo, Chen Jiang, Zhaorui Tan, Mengzhu Li, Limei Han, Xiang Zou, Yuan Cheng, Mahsa Baktashmotlagh

发表机构 * Fudan University, China(复旦大学) University of Queensland, Australia(昆士兰大学) Shanghai Academy of AI for Science, China(上海人工智能科学研究院) Huashan Hospital, National Center for Neurological Disorders, Fudan University, China(华山医院,国家神经系统疾病中心,复旦大学) Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore(生物信息研究所(BII),科技研究局(A*STAR),新加坡)

AI总结 该研究针对医学影像中的多模态3D数据,提出了一种超越个体级自监督的方法,利用解剖结构在不同个体间保持的拓扑一致性作为监督信号。通过两种对齐策略——个体内的跨模态三元组目标和个体间的伪对应关系生成——有效提升了模型对局部和全局拓扑结构的学习能力。实验表明,该方法在多个下游任务中取得了显著性能提升,并在测试时模态缺失情况下表现出更强的鲁棒性。

Comments ICML2026

详情
英文摘要

Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.

2605.14651 2026-05-15 cs.CV

TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

Omkar Oak, Rukmini Nazre, Rujuta Budke, Suraj Sawant

发表机构 * COEP Technological University, Pune, India(科帕尔技术大学,印度普纳) University of Massachusetts, Amherst, USA(马萨诸塞大学,美国阿姆赫斯特) North Carolina State University, USA(北卡罗来纳州立大学,美国)

AI总结 本文提出了一种多时相的遥感影像变化检测框架TERRA-CD,用于多类别和语义变化检测。该研究构建了一个包含5,221对Sentinel-2影像的基准数据集,覆盖美国和欧洲232个城市,并提供了三种标注方案,涵盖土地覆盖分类、植被变化和语义变化。通过多种深度学习方法评估了该数据集在多类别和语义变化检测中的有效性,为城市植被监测和环境变化分析提供了重要资源。

Comments Paper presented at 11th International Congress on Information and Communication Technology (ICICT) 2026, London

详情
英文摘要

Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset's effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at https://github.com/omkarsoak/TERRA-CD.

2605.14645 2026-05-15 cs.CV cs.AI

Vision-Based Water Level and Flow Estimation

ZhiXin Sun

发表机构 * PowerChina Zhongnan Engineering Corporation Limited(中国电力工程集团中南工程公司)

AI总结 该研究提出了一种结合先进视觉模型与统计建模的综合框架,用于提高水位检测和水流估算的精度。通过引入物理先验知识和鲁棒滤波策略,有效应对了环境敏感性、精度有限和现场校准复杂等挑战。该方法在保持自动化和可解释性优势的同时,提升了传统视觉方法在水文监测中的可靠性。

详情
英文摘要

With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git

2605.14643 2026-05-15 cs.LG cs.NA math.NA math.OC

Unbiased and Second-Order-Free Training for High-Dimensional PDEs

Jaemin Seo, Surin Lee, Jae Yong Lee

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea(人工智能系, Chung-Ang 大学,首尔,韩国)

AI总结 本文研究了基于倒向随机微分方程的深度学习方法在求解高维偏微分方程时的训练偏差问题,指出常用的欧拉-马尤亚时间离散化方案会导致损失函数的内在偏差。为此,作者提出了一种无偏且无需二阶导数的训练框架,在保持计算效率的同时消除了该偏差,提升了高维PDE求解的准确性和稳定性。

Comments Accepted at ICML 2026

Journal ref International Conference on Machine Learning 2026

详情
英文摘要

Deep learning methods based on backward stochastic differential equations (BSDEs) have emerged as competitive alternatives to physics-informed neural networks (PINNs) for solving high-dimensional partial differential equations (PDEs). By leveraging probabilistic representations, BSDE approaches can avoid the curse of dimensionality and often admit second-order-free training objectives that do not require explicit Hessian evaluations. It has recently been established that the commonly used Euler-Maruyama (EM) time discretization induces an intrinsic bias in BSDE training losses. While high-order schemes such as Heun can fully eliminate this bias, such schemes re-introduce second-order spatial derivatives and incur substantial computational overhead. In this work, we provide a principled analysis of EM-induced loss bias and propose an unbiased, second-order-free training framework that preserves the computational advantages of BSDE methods. Our code is available at https://github.com/seojaemin22/Un-EM-BSDE.