arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14666 2026-05-15 cs.AI

Monitoring Data-aware Temporal Properties (Extended Version)

Alessandro Gianola, Marco Montali, Sarah Winkler

AI总结 本文研究如何对具有任意SMT理论的线性时序逻辑(LTLfMT)进行前瞻监控,以应对动态系统中无法访问内部规范的问题。提出了一种结合自动机理论与自动推理技术的新框架,能够在有限轨迹上正确监控复杂属性。该方法首次识别出包含线性算术与未解释函数的可判定子类,适用于数据感知的业务流程和只读数据库上的动态系统,并通过原型实现验证了其可行性。

Comments This is the extended version of a paper accepted to IJCAI 2026

详情
英文摘要

Dynamic systems in AI are often complex and heterogeneous, so that an internal specification is not accessible and verification techniques such as model checking are not applicable. Monitoring is in such cases an attractive alternative, as it evaluates desirable properties along traces generated by an unknown dynamic system. In this work, we consider anticipatory monitoring of linear-time properties enriched with an arbitrary SMT theory over finite traces (LTLfMT). Anticipatory monitoring in this setting is highly challenging, as the monitoring state depends on both the trace prefix seen so far and all its possible finite continuations. Under reasonable assumptions on the background theory, we present and formally prove the correctness of a novel foundational framework for monitoring properties in an expressive fragment of LTLfMT. The framework combines automata-theoretic methods to handle the temporal aspects of the logic, with automated reasoning techniques to address the first-order dimension. Moreover, we identify for the first time decidable fragments of this monitoring problem that are practically relevant as they combine linear arithmetic with uninterpreted functions, which covers e.g. data-aware business processes and dynamic systems operating over a read-only database. Feasibility is witnessed by a prototype implementation and preliminary evaluation.

2605.14660 2026-05-15 cs.AI

MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder

Eranga Bandara, Ross Gore, Asanga Gunaratna, Ravi Mukkamala, Nihal Siriwardanagea, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Wathsala Herath, Chalani Rajapakse, Sachin Shetty, Anita H. Clayton, Christopher K. Rhea, Ng Wee Keong, Kasun De Zoysa, Amin Hass, Shaifali Kaushik, Preston Samuel, Atmaram Yarlagadda

AI总结 本文提出了一种名为MindGap的会话式人工智能框架,旨在通过上游神经可塑性干预治疗创伤后应激障碍(PTSD)。该方法基于佛教心理框架“缘起”理论,引导患者在感知与反应之间的时间间隙进行观察,从而实现对过度反应神经通路的结构性重塑。MindGap通过三个渐进的观察层次,帮助患者逐步识别并削弱引发应激反应的潜在信念,实现从源头上缓解症状,而非仅在反应发生后进行压制。该框架完全在设备端运行,保障隐私,适合在临床和军事等对数据安全要求严格的环境中部署。

详情
英文摘要

Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.

2605.14659 2026-05-15 cs.LG

Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning

Shin So, Kyelim Lee, Albert No

AI总结 本研究探讨了算法学习中泛化与记忆化之间的关系,指出在数据量达到一定阈值后,增加数据可能不会加速验证准确率的提升,反而需要更多的梯度更新。在结构化输出任务中,如Needleman-Wunsch矩阵生成,模型在中等数据量时达到最佳验证性能,而更大的数据集虽仍可实现泛化,但收敛速度变慢。研究揭示了泛化起始所需的数据量与基于更新次数的收敛优化之间存在差异,并指出了在某些结构化任务中,学习规则与精确拟合可能分道扬镳。

详情
英文摘要

Critical-data-size accounts of grokking suggest a natural post-threshold intuition: once training data is sufficient to identify the underlying rule, additional data should accelerate validation convergence. We show that this intuition can fail in a controlled structured-output task. In Needleman--Wunsch (NW) matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size, not at the largest one. Past this dataset-size sweet spot, generalization remains achievable but requires more gradient updates. Conversely, in the regime where partial validation competence first appears, larger datasets can require fewer updates to reach high training accuracy, suggesting that emerging rule structure can accelerate fitting beyond example-wise memorization. A multiplication baseline does not show the same post-threshold slowdown. These results separate the critical data size for the onset of generalization from the dataset size that optimizes update-based convergence, and identify structured-output tasks where learning the rule and completing exact-fitting can diverge.

2605.14654 2026-05-15 cs.CV

Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

Tan Pan, Shuhao Mei, Yixuan Sun, Kaiyu Guo, Chen Jiang, Zhaorui Tan, Mengzhu Li, Limei Han, Xiang Zou, Yuan Cheng, Mahsa Baktashmotlagh

AI总结 该研究针对医学影像中的多模态3D数据,提出了一种超越个体级自监督的方法,利用解剖结构在不同个体间保持的拓扑一致性作为监督信号。通过两种对齐策略——个体内的跨模态三元组目标和个体间的伪对应关系生成——有效提升了模型对局部和全局拓扑结构的学习能力。实验表明,该方法在多个下游任务中取得了显著性能提升,并在测试时模态缺失情况下表现出更强的鲁棒性。

Comments ICML2026

详情
英文摘要

Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.

2605.14651 2026-05-15 cs.CV

TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

Omkar Oak, Rukmini Nazre, Rujuta Budke, Suraj Sawant

AI总结 本文提出了一种多时相的遥感影像变化检测框架TERRA-CD,用于多类别和语义变化检测。该研究构建了一个包含5,221对Sentinel-2影像的基准数据集,覆盖美国和欧洲232个城市,并提供了三种标注方案,涵盖土地覆盖分类、植被变化和语义变化。通过多种深度学习方法评估了该数据集在多类别和语义变化检测中的有效性,为城市植被监测和环境变化分析提供了重要资源。

Comments Paper presented at 11th International Congress on Information and Communication Technology (ICICT) 2026, London

详情
英文摘要

Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset's effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at https://github.com/omkarsoak/TERRA-CD.

2605.14645 2026-05-15 cs.CV cs.AI

Vision-Based Water Level and Flow Estimation

ZhiXin Sun

AI总结 该研究提出了一种结合先进视觉模型与统计建模的综合框架,用于提高水位检测和水流估算的精度。通过引入物理先验知识和鲁棒滤波策略,有效应对了环境敏感性、精度有限和现场校准复杂等挑战。该方法在保持自动化和可解释性优势的同时,提升了传统视觉方法在水文监测中的可靠性。

详情
英文摘要

With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git

2605.14643 2026-05-15 cs.LG cs.NA math.NA math.OC

Unbiased and Second-Order-Free Training for High-Dimensional PDEs

Jaemin Seo, Surin Lee, Jae Yong Lee

AI总结 本文研究了基于倒向随机微分方程的深度学习方法在求解高维偏微分方程时的训练偏差问题,指出常用的欧拉-马尤亚时间离散化方案会导致损失函数的内在偏差。为此,作者提出了一种无偏且无需二阶导数的训练框架,在保持计算效率的同时消除了该偏差,提升了高维PDE求解的准确性和稳定性。

Comments Accepted at ICML 2026

详情
Journal ref
International Conference on Machine Learning 2026
英文摘要

Deep learning methods based on backward stochastic differential equations (BSDEs) have emerged as competitive alternatives to physics-informed neural networks (PINNs) for solving high-dimensional partial differential equations (PDEs). By leveraging probabilistic representations, BSDE approaches can avoid the curse of dimensionality and often admit second-order-free training objectives that do not require explicit Hessian evaluations. It has recently been established that the commonly used Euler-Maruyama (EM) time discretization induces an intrinsic bias in BSDE training losses. While high-order schemes such as Heun can fully eliminate this bias, such schemes re-introduce second-order spatial derivatives and incur substantial computational overhead. In this work, we provide a principled analysis of EM-induced loss bias and propose an unbiased, second-order-free training framework that preserves the computational advantages of BSDE methods. Our code is available at https://github.com/seojaemin22/Un-EM-BSDE.

2605.14641 2026-05-15 cs.CV cs.AI

How to Evaluate and Refine your CAM

Luca Domeniconi, Alessandra Stramiglio, Michele Lombardi, Samuele Salti

AI总结 该研究针对卷积神经网络中类别归因图(CAM)的评估与改进问题,提出了一种合成数据集以生成真实归因标签,从而更严格地比较现有评估指标,并提出了一种新的复合评估指标ARCC,能够更可靠地识别忠实的解释。同时,为解决CAM分辨率低的问题,研究还引入了RefineCAM方法,通过聚合多层网络的CAM生成高分辨率归因图,实验表明该方法在新评估指标下优于现有方法。

Comments Accepted at ICPR 2026

详情
英文摘要

Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.

2605.14636 2026-05-15 cs.AI

Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

Chenlu Ding, Jiancan Wu, Yanchen Luo, Zheyuan Liu, Yancheng Yuan, Xiang Wang

AI总结 该研究探讨了大型语言模型在时间截断条件下进行推理时的失效问题,即模型在回答过去时间点的问题时错误地使用了未来才可获得的信息。研究提出了一种名为TCFT的时序批评微调框架,通过训练模型识别和判断回答中是否存在时间泄露,从而提升其在时间限制下的推理能力。实验表明,TCFT在多个模型上显著优于传统提示和微调方法,有效降低了时间泄露的比例。

详情
英文摘要

Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

2605.14635 2026-05-15 cs.CV cs.AI

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

Tianwei Chen, Takuya Furusawa, Yuki Hirakawa, Ryotaro Shimizu, Mo Fan, Takashi Wada

AI总结 本文提出一个多标签视觉情感分析基准数据集MultiEmo-Bench,用于全面评估多模态大语言模型(MLLMs)对图像引发情感的预测能力。现有数据集采用单一标签标注方式,难以反映图像可能引发的多维度、多强度情感,为此本文引入多标注员协同标注机制,生成包含10,344张图像和236,998个有效情感标签的高质量数据集,并基于该数据集评估了多个主流模型在主控情感预测和情感分布预测任务上的表现,揭示了当前MLLMs在情感理解方面的进展与不足。

详情
英文摘要

This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

2605.14632 2026-05-15 cs.LG stat.AP

DRL-STAF: A Deep Reinforcement Learning Framework for State-Aware Forecasting of Complex Multivariate Hidden Markov Processes

Manrui Jiang, Jingru Huang, Yong Chen, Chen Zhang

AI总结 该研究提出了一种基于深度强化学习的DRL-STAF框架,用于复杂多变量隐马尔可夫过程的状态感知预测。该方法结合深度神经网络建模非线性观测,并利用强化学习估计离散隐状态,克服了传统隐马尔可夫模型在非线性发射和扩展性方面的不足,同时减少了对预定义状态转移结构的依赖。实验表明,DRL-STAF在预测性能和隐状态估计方面均优于现有方法。

详情
英文摘要

Forecasting multivariate hidden Markov processes is challenging due to nonlinear and nonstationary observations, latent state transitions, and cross-sequence dependencies. While deep learning methods achieve strong predictive accuracy, they typically lack explicit state modeling, whereas Hidden Markov Models (HMMs) provide interpretable latent states but struggle with complex nonlinear emissions and scalability. To address these limitations, we propose DRL-STAF, a Deep Reinforcement Learning based STate-Aware Forecasting framework that jointly predicts next-step observations and estimates the corresponding hidden states for complex multivariate hidden Markov processes. Specifically, DRL-STAF models complex nonlinear emissions using deep neural networks and estimates discrete hidden states using reinforcement learning, reducing the reliance on predefined transition structures and enabling flexible adaptation to diverse temporal dynamics. In particular, DRL-STAF mitigates the state-space explosion encountered by typical multivariate HMM-based methods. Extensive experiments demonstrate that DRL-STAF outperforms HMM variants, standalone deep learning models, and existing DL-HMM hybrids in most cases, while also providing reliable hidden-state estimates.

2605.14631 2026-05-15 cs.LG cs.AI cs.CV

Action-Inspired Generative Models

Eshwar R. A., Debnath Pal

AI总结 本文提出了一种受动作启发的生成模型(AGMs),旨在改进现有桥接匹配方法中对所有随机转移赋予相同回归权重的问题。该方法引入了一个轻量的可学习标量势函数 $V_ϕ$,用于在线评估桥接样本并调节漂移目标,从而选择性地惩罚非信息性传输路径,提升了生成质量。该模型结构简单,仅增加约1.4%的参数,无需额外计算开销,可直接嵌入任何桥接匹配训练流程中。

Comments 11 pages, 5 figures, and 4 tables

详情
英文摘要

We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_ϕ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_ϕ$'s guiding signal. Crucially, $V_ϕ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_ϕ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.

2605.14626 2026-05-15 cs.CV

UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

Ping Zhou, Haoyu Wang, Mengmeng Zheng, Lei Zhang, Wei Wei, Chen Ding, Fei Zhou

AI总结 RGB-T语义分割需要严格对齐的可见光-红外-标签三元组,但在实际场景中这类数据往往稀缺。为解决这一问题,本文提出UniTriGen,一种统一的三元组生成框架,能够在文本提示引导下直接生成空间对齐、语义一致且模态互补的可见光-红外-标签三元组。该方法通过共享潜在空间中的联合编码和扩散过程建模,确保跨模态一致性,并引入轻量级模态特定适配器以适应不同模态的成像特性,同时采用场景平衡和类别感知的少样本采样策略,提升生成三元组的多样性和质量,从而在多种RGB-T语义分割模型中实现性能提升。

详情
英文摘要

RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.

2605.14621 2026-05-15 cs.CV cs.AI cs.CL

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju, Lijie Wen

AI总结 大型视觉语言模型(LVLMs)在语言先验主导弱或模糊视觉证据时容易产生幻觉。现有对比解码方法通过比较原始图像和外部扰动输入的预测来缓解这一问题,但依赖外部参考可能引入偏差并增加计算成本。本文提出SIRA,一种无需训练的内部对比解码框架,通过利用多模态变换器的分阶段信息流,在模型内部构建反事实参考,有效抑制幻觉,同时保持描述覆盖率,并适用于开源权重模型。

详情
英文摘要

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

2605.14619 2026-05-15 cs.AI

SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

Kang Chen, Junjie Nian, Yixin Cao, Yugang Jiang

AI总结 该研究提出了SliceGraph方法,用于分析多轮思维链(CoT)推理过程中不同路径之间的共享、分裂与重组结构。通过计算CoT片段间的激活键Jaccard相似度并构建互k近邻图,SliceGraph揭示了不同推理路径在过程结构上的异同,并识别出具有相同答案但推理过程不同的“过程异构体”。实验表明,多数问题-模型组合中存在多个过程家族,它们在策略上具有一致性但结构上有所区分,表明最终答案聚合忽略了推理过程中的多路径结构特征。

详情
英文摘要

Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

2605.14615 2026-05-15 cs.CV

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

Boying Li, Cheng Zhang, Weirong Chen, Daniel Cremers, Ian Reid, Hamid Rezatofighi

AI总结 本文提出了一种名为 CalibAnyView 的新型相机标定方法,能够在任意数量的视角下(包括单视角)实现鲁棒的几何一致性标定。该方法通过构建大规模多视角视频数据集,并设计多视角变换网络预测密集透视场,结合几何优化框架联合估计相机内参和重力方向,从而在真实场景中取得优于现有方法的标定效果。该工作为野外环境下的三维重建和机器人感知等任务提供了可靠的基础。

Comments 44 pages, 25 figures

详情
英文摘要

Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

2605.14609 2026-05-15 cs.CV cs.LG

Deep Image Segmentation via Discriminant Feature Learning

Adam Dawid Sztamborski, Raül Pérez-Gonzalo, Antonio Agudo

AI总结 本文研究了图像分割中边界不清晰的问题,提出了一种新的可微且与网络结构无关的损失函数Deep Discriminant Analysis(DDA),通过最大化类间方差并最小化类内方差,提升特征分布的紧致性和可分性。实验表明,DDA在多种架构上均能有效提升分割精度、边界清晰度和模型置信度,为构建更鲁棒的分割模型提供了简单而有效的方法。

Comments Accepted to ICIP 2026

详情
英文摘要

Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.

2605.14607 2026-05-15 cs.CV cs.CY

ViMU: Benchmarking Video Metaphorical Understanding

Qi Li, Xinchao Wang

AI总结 本文提出ViMU,首个用于评估视频隐喻理解能力的基准,旨在解决现有视频理解模型主要关注字面内容而忽视隐喻、讽刺和社会含义的问题。ViMU通过开放问答和多选题形式,要求模型基于多模态证据推断视频中的隐含意义,且问题设计无提示,确保模型依赖自身理解能力进行推理。该工作为视频理解领域引入了新的评估方向,推动模型在深层次语义理解方面的发展。

详情
英文摘要

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

2605.14606 2026-05-15 cs.CV

MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting

Chunlei Shi, Cui Wu, Xiang Xu, Hao Li, Ni Fan, Xue Han, Yongchao Feng, Yufeng Zhu, Boyu Liu, Zengliang Zang, Hongbin Wang, Yanlan Yang, Dan Niu

AI总结 本文提出了一种名为MambaRain的多尺度编码-解码框架,用于0-3小时的降水临近预报。该方法结合了Mamba模型的线性复杂度长期时间建模能力和自注意力机制对空间相关性的显式捕捉,有效解决了现有方法在长时段预测中性能下降的问题。通过引入混合架构和频谱损失函数,MambaRain在保持计算效率的同时提升了预报精度,尤其在2-3小时的困难预测区间表现突出。

Comments 9 pages,7 figures

详情
英文摘要

Accurate precipitation nowcasting over extended horizons (0-3 hours) is essential for disaster mitigation and operational decision-making, yet remains a critical challenge in the field. Existing deterministic approaches are predominantly constrained to shorter prediction windows (0-2 hours), exhibiting severe performance degradation beyond 90 minutes owing to their inherent difficulty in capturing long-range spatiotemporal dependencies from radar-derived observations. To address these fundamental limitations, we propose MambaRain, a novel multi-scale encoder-decoder architecture that synergistically integrates Mamba's linear-complexity long-range temporal modeling with self-attention mechanisms for explicit spatial correlation capture. The core innovation lies in a hybrid design paradigm wherein Mamba blocks leverage selective state space mechanisms to model global temporal dynamics across extended sequences with computational efficiency, while self-attention modules explicitly characterize spatial correlations within precipitation fields - a capability inherently absent in Mamba's sequential processing paradigm. This complementary synergy enables comprehensive spatiotemporal representation learning, effectively extending the viable forecasting horizon to 2-3 hours with substantial accuracy improvements. Furthermore, we introduce a spectral loss formulation to mitigate blurring artifacts characteristic of chaotic precipitation systems, thereby preserving fine-scale motion details critical for nowcasting accuracy. Experimental validation demonstrates that MambaRain substantially outperforms existing deterministic methodologies in 0-3 hour nowcasting tasks, with particularly pronounced performance gains in the challenging 2-3 hour prediction range.

2605.14604 2026-05-15 cs.AI cs.HC

Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks

Enkelejda Kasneci, Gjergji Kasneci

AI总结 本文指出,有效的教学需要“纠正性摩擦”,即通过指出并支持性地挑战学生的误解来促进概念转变,但当前偏好对齐的大语言模型(LLMs)可能为了友好而牺牲认知严谨性。为此,作者提出了“推理-谄媚悖论”,即模型虽能抵御上下文切换攻击,却可能在权威或社交压力下退缩。文章引入了EduFrameTrap基准,用于评估LLM在不同学科和压力情境下的教学表现,并发现当前前沿模型在面对权威和社会压力时更容易出现认知退缩,强调了建立衡量“社会-认知勇气”的教学基准的重要性。

详情
英文摘要

This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority ("my notes say I'm right") and social-affective face-saving ("please don't tell me I'm wrong"). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.

2605.14601 2026-05-15 cs.CV

Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach

Kanglin Ning, Yiran Zhao, Wenrui Li, Shaoru Sun, Xingtao Wang, Xiaopeng Fan

AI总结 本文提出了一种基于连续语义高斯表示的单目全景3D目标检测框架PanoGSDet,旨在解决全景图像中2D特征到3D空间映射不准确的问题。该方法通过全景深度估计模块和语义高斯模块,将全景图像中的语义和深度信息提升到3D语义高斯分布,并通过优化和预测模块生成精确的3D目标框。实验表明,该方法在Structured3D数据集上显著优于现有方法。

Comments Current has been accepted by ICME 2026

详情
英文摘要

Three-dimensional object detection in panoramic imagery is crucial for comprehensive scene understanding, yet accurately mapping 2D features to 3D remains a significant challenge. Prevailing methods often project 2D features onto discrete 3D grids, which break geometric continuity and limit representation efficiency. To overcome this limitation, this paper proposes PanoGSDet, a monocular panoramic 3D detection framework built upon continuous semantic 3D Gaussian representations. The proposed framework comprises a panoramic depth estimation component and a semantic Gaussian component. The panoramic depth estimation component extracts the equirectangular semantic and depth features from the monocular panorama input. The semantic Gaussian component includes a semantic Gaussian lifting module that projects spherical features into 3D semantic Gaussians, a semantic Gaussian optimization module that refines these semantic Gaussians, and a Gaussian guided prediction head that generates 3D bounding boxes from optimized Gaussian representations. Extensive experiments on the Structured3D dataset demonstrate that our method significantly outperforms existing methods.

2605.14600 2026-05-15 cs.CL

SciPaths: Forecasting Pathways to Scientific Discovery

Eric Chamoun, Yizhou Chi, Yulong Chen, Rui Cao, Zifeng Ding, Michalis Korakakis, Andreas Vlachos

AI总结 本文提出 SciPaths,一个用于科学发现路径预测的新基准,旨在预测实现特定科学成果所需的前置贡献及其在已有文献中的依据。研究通过构建包含专家标注和机器学习生成的路径数据集,评估了前沿语言模型在该任务上的表现,发现模型在严格语义匹配下表现有限,尤其在恢复核心方法依赖方面存在困难。该工作揭示了科学预测中一个被忽视的关键能力:从目标成果逆向推理出其所需的科学基础和文献依赖。

详情
英文摘要

Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.

2605.14599 2026-05-15 cs.LG cs.AI stat.ML

Fast Rates for Inverse Reinforcement Learning

Andreas Schlaginhaufen, Maryam Kamgarpour

AI总结 本文研究了有限时间马尔可夫决策过程中的熵正则化最小-最大逆强化学习(Min-Max-IRL)问题,针对线性奖励类问题,建立了新的结构和统计性质。作者证明了在总体层面,最大似然估计与Min-Max-IRL等价,在确定性动力学下在经验层面也等价。通过利用Min-Max-IRL损失的伪自共轭性质,作者展示了轨迹级KL散度和参数误差在Hessian范数下的衰减速度为$\mathcal{O}(n^{-1})$,且结果适用于模型误设情况,无需探索假设。此外,还扩展了奖励可识别性的结果到一般的Borel空间,并推导了软最优价值函数关于奖励参数的导数新性质。

详情
英文摘要

We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.

2605.14597 2026-05-15 cs.CV cs.CE cs.MM

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

Chunlei Shi, Hao Li, Yufeng Zhu, Boyu Liu, Yongchao Feng, Zengliang Zang, Hongbin Wang, Yanlan Yang, Dan Niu

AI总结 降水临近预报是气象应用中的重要时空预测任务,但因降水系统的混沌特性面临诸多挑战。现有方法多依赖单一来源的雷达数据构建确定性或概率性模型进行外推,但存在模糊性或计算效率低等问题。本文提出一种基于粗到细的视觉Mamba Unet与残差扩散模型(VMU-Diff)的多源数据融合框架,通过两阶段过程实现降水临近预报:第一阶段利用雷达与多波段卫星数据融合预测全局运动趋势,第二阶段基于条件扩散模型生成细节预测,实验表明该方法在短期预报中优于现有先进方法。

Comments 5 pages, 2 figures

详情
英文摘要

Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.

2605.14594 2026-05-15 cs.CV cs.GR

TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

Bojun Xiong, Zoubin Bi, Xinghui Peng, Yunmu Wang, Junchen Deng, Jun Liang, Jing Li, Bowen Cai, Huan Fu

AI总结 本文提出TOPOS,一种用于单图像条件生成高保真3D头部模型的框架,旨在满足影视、动画和游戏等行业对统一拓扑结构的需求。TOPOS通过引入一种新型变分自编码器(TOPOS-VAE)和修正流变换器(TOPOS-DiT),在固定工业标准拓扑下联合生成几何和外观,实现跨生成头部的顶点级一致性。此外,TOPOS-Texture模块可从同一肖像图像生成可重新光照的UV纹理贴图,保留高频细节,实验表明TOPOS在3D头部生成任务中达到领先水平。

Comments Technical Report

详情
英文摘要

High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE's structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.

2605.14590 2026-05-15 cs.CV

FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology

Fengyi Zhang, Junya Zhang, Wenzhuo Sun

AI总结 在计算病理学中,由于不同机构之间染色异质性显著,鲁棒的全切片图像分析仍面临挑战。现有联邦域泛化方法大多依赖低阶统计量,难以捕捉真实染色过程中存在的非高斯特性。本文提出FedStain,一种联邦域泛化框架,通过引入偏度和峰度等高阶统计量作为紧凑的染色描述子,在保护隐私和通信效率的前提下,有效建模染色变化,实验表明其在多个基准数据集上显著优于现有方法。

详情
英文摘要

Robust whole-slide image (WSI) analysis under strict data-governance remains challenging due to substantial cross-institutional stain heterogeneity. Domain generalization (DG) mitigates these shifts but typically requires centralized data, conflicting with privacy regulations. Federated learning (FedL) provides a decentralized alternative; however, existing FedL and federated DG (FedDG) approaches rely almost exclusively on low-order statistics, assuming Gaussian-like stain distributions. In contrast, real-world staining processes often produce asymmetric, heavy-tailed color distributions due to biochemical diffusion and scanner nonlinearity. Consequently, current methods fail to model the higher-order, non-Gaussian characteristics dominating real-world stain variability. To address this, we propose FedStain, a stain-aware FedDG framework explicitly incorporating higher-order stain moments--skewness and kurtosis--as compact statistical descriptors exchanged during federated optimization. These descriptors require no pixel-level data transmission, preserving strict privacy and communication efficiency, while enabling the global model to capture stain variability missed by low-order statistics. FedStain also employs a contrastive, cross-site parameter aggregation strategy to promote stain-invariant representations without relaxing data constraints. Extensive experiments on Camelyon17 and our new MvMidog-Fed benchmark show FedStain yields consistent improvements, outperforming state-of-the-art FedL, DG, and FedDG baselines by up to +3.9% absolute accuracy. To our knowledge, FedStain is the first FedDG approach to explicitly model higher-order stain statistics, enabling robust cross-institutional deployment in computational pathology.

2605.14587 2026-05-15 cs.LG cs.AI cs.CR

Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning

Oubo Ma, Ruixiao Lin, Yang Dai, Jiahao Chen, Chunyi Zhou, Linkang Du, Shouling Ji

AI总结 本文研究了可塑性干预对深度强化学习(DRL)中后门攻击的影响,发现大多数干预措施能有效缓解后门威胁,而仅有SAM干预会加剧威胁。通过病理分析,揭示了后门梯度放大与激活路径破坏等机制,并提出了SCC概念框架和异常损失景观锐度作为后门检测的新指标,为提升DRL系统安全性提供了理论支持。

Comments To appear in the Forty-Third International Conference on Machine Learning (ICML 2026), July 6-11, 2026, Seoul, South Korea

详情
英文摘要

Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.

2605.14581 2026-05-15 cs.CV cs.AI cs.IR

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Ho Hung Lim, Yi Yang

AI总结 本研究探讨了在视觉金融文档检索中,将文档图像编码为单一向量进行聚合可能带来的信息丢失问题。通过构建一个金融文档诊断基准,实验发现单一向量聚合会导致不同文档的向量几乎相同,从而掩盖了关键语义细节。研究指出,全局纹理主导是导致这一问题的根本原因,并表明该现象在不同模型规模和优化策略下均存在,突显了单一向量方法在金融应用中的潜在风险。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

2605.14579 2026-05-15 cs.CV

Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

Zhiquan Chen, Haitao Wang, Guowei Zou, Hejun Wu

AI总结 医学图像分割是精准医疗的基础,但在面对组织外观差异大、边界模糊和解剖结构多变等挑战时,现有方法仍难以实现稳定而精确的分割。本文提出 Med-DisSeg 框架,通过引入一种轻量级的分散损失(Dispersive Loss)和自适应注意力机制,提升细粒度结构分割的表示学习与解剖边界刻画能力。该方法通过扩大样本间嵌入表示的间隔,增强编码器对结构特征的敏感性,并利用多尺度解码器保留局部纹理与整体形状信息,实验表明其在多个医学影像数据集上均取得领先的分割性能。

详情
英文摘要

Accurate medical image segmentation is fundamental to precision medicine, yet robust delineation remains challenging under heterogeneous appearances, ambiguous boundaries, and large anatomical variability. Similar intensity and texture patterns between targets and surrounding tissues often lead to blurred activations and unreliable separation. We attribute these failures to representation collapse during encoding and insufficient fine grained multi scale decoding. To address these issues, we propose Med DisSeg, a dispersion driven medical image segmentation framework that jointly improves representation learning and anatomical delineation. Med DisSeg combines a lightweight Dispersive Loss with adaptive attention for fine grained structure segmentation. The Dispersive Loss enlarges inter sample margins by treating in batch hidden representations as negative pairs, producing well dispersed and boundary aware embeddings with negligible overhead. Based on these enhanced representations, the encoder strengthens structure sensitive responses, while the decoder performs adaptive multi scale calibration to preserve complementary local texture and global shape information. Extensive experiments on five datasets spanning three imaging modalities demonstrate consistent state of the art performance. Moreover, Med DisSeg achieves competitive results on multi organ CT segmentation, supporting its robustness and cross task applicability.

2605.14578 2026-05-15 cs.LG

Woodelf++: A Fast and Unified Partial Dependence Plot Algorithm for Decision Tree Ensembles

Ron Wettenstein, Alexander Nadel, Udi Boker

AI总结 本文提出了一种名为 Woodelf++ 的高效统一算法,用于计算决策树集成模型的多种可解释性工具,包括部分依赖图(PDP)、联合 PDP 和任意阶特征交互值(Any-Order-PDIVs)。该方法基于伪布尔函数的度量推导,实现了对这些工具的统一计算框架,相比现有方法在计算复杂度上有了显著提升,尤其在 Any-Order-PDIVs 上实现了指数级加速。实验表明,Woodelf++ 在 Python 中实现并支持 GPU 加速,其计算速度远超当前主流工具。

Comments Extended version of the paper to appear at IJCAI 2026

详情
英文摘要

Partial Dependence Plots (PDPs) visualize how changes in a single feature affect the average model prediction. They are widely used in practice to interpret decision tree ensembles and other machine learning models. Joint-PDPs extend this idea to pairs of features, revealing their combined effect. Partial Dependence Interaction Values (PDIVs) measure feature interactions. The Any-Order-PDIVs task computes these interactions for every feature subset across all rows of the dataset. We introduce Woodelf++, a unified and efficient approach for computing all these useful explainability tools on decision tree ensembles, building on Woodelf, an algorithm for efficient SHAP computation. By deriving suitable metrics over pseudo-Boolean functions, Woodelf++ can compute PDPs (exact and approximate), Joint-PDPs, and Any-Order-PDIVs in a unified framework. Our method delivers substantial complexity improvements over the state of the art, including an exponential gain for Any-Order-PDIVs. Additionally, we introduce and efficiently compute Full PDPs, which leverage the model's split thresholds to faithfully capture its behavior across all possible feature values. Woodelf++ is implemented in pure Python and supports GPU acceleration. On a dataset with 400,000 rows, Woodelf++ computes PDP and Joint-PDP up to 6x faster than the state of the art and up to five orders of magnitude faster than scikit-learn. For Any-Order-PDIVs, the gap is even larger: Woodelf++ computes all interaction values in 5 minutes, while the state of the art is estimated to require over 1,000,000 years.