arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.08238 2026-05-12 cs.CV cs.AI cs.ET cs.LG

Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation

Farhana Yasmin, Mahade Hasan, Haipeng Liu, Amjad Ali, Ghulam Muhammad, Yu Xue

AI总结 该研究提出了一种资源感知的进化神经网络架构搜索方法CardiacNAS,用于心脏磁共振成像(CMR)分割。该方法结合了类似UNet的超网络和针对心脏分割任务设计的搜索空间,通过进化算法在固定计算预算下联合优化分割精度与模型效率。实验表明,该方法在ACDC数据集上取得了较高的分割精度与较低的计算开销,展示了其在准确性和效率之间的良好平衡。

详情
Journal ref
F. Yasmin et.al., "Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation," 28th International Conference on Computer and Information Technology (ICCIT), 2025, pp. 2819-2824
英文摘要

Cardiac magnetic resonance (CMR) segmentation underpins quantitative assessment of ventricular structure and function, yet reliable delineation remains difficult due to low tissue contrast, fuzzy boundaries, and inter scan variability. We present CardiacNAS, an evolutionary neural architecture search (NAS) framework that couples a UNet like supernet with a cardiac aware search space spanning depth width, kernel size, filter size, attention, fusion, activation, dropout, and residual scaling. The search is explicitly resource aware, jointly optimizing dice similarity coefficient (DSC) and 95th percentile Hausdorff distance (HD95) versus model size and floating point operations (FLOPs) under fixed compute budgets. Candidate architectures are instantiated from the supernet, trained with proxy budgets, and evolved through crossover, mutation, and elitist selection. We evaluate on the ACDC dataset and compare against six state of the art methods, using qualitative comparisons, learning curve analyses, and design factor correlation studies. The resulting model attains 93.22% average DSC and 4.73 mm HD95 with 3.58M parameters and 14.56 GFLOPs, demonstrating a favorable accuracy efficiency trade off. Analyses indicate that searched attention and fusion choices, together with residual scaling, contribute to improved boundary fidelity and stability. CardiacNAS offers a principled, resource aware approach to deployable CMR segmentation with transparent reporting of architectural complexity and compute budgets.

2605.08237 2026-05-12 cs.LG stat.ML

Distributional Spectral Diagnostics for Localizing Grokking Transitions

Ziyue Wang, Yufeng Ying, Takafumi Kanamori

AI总结 该研究探讨了机器学习模型在“grokking”现象中从记忆训练数据到泛化的转变过程,并提出了一种基于分布谱分析的方法来定位这一转变。通过将任务相关的观测值映射到 Wasserstein/分位数坐标,并结合 Hankel 动态模态分解,研究构建了用于诊断的残差、谱特征和有效秩等指标。实验表明,该方法在模块加法 Transformer 模型中能够有效区分 grokking 与非 grokking 运行,并在固定阈值下实现提前预警,具有较高的检测性能和实用性。

详情
英文摘要

In grokking, a model first fits the training data while test accuracy remains low, and only later begins to generalize. We ask whether this transition can be localized from observed training trajectories before the test accuracy rises, and formulate grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual, together with spectrum and effective rank, forms the diagnostic output. On held-out modular-addition Transformer runs, the residual achieves AUROC \(\approx \) 0.93 for grokking-vs-non-grokking discrimination at the run level; under a fixed sustained-threshold operating rule, true-positive alarms can precede onset, with lead time reported jointly with false-alarm rate and uncertainty intervals. Perturbation experiments show that, in the tested \(wd=1\) pool, high-residual windows exhibit about \(3\times\) larger short-horizon perturbation deviation than low-residual windows. In a same-data norm-window control, perturbation sensitivity aligns with the residual ordering rather than total-parameter-norm ordering, suggesting that the residual is not merely a total-norm proxy at the window level in the studied \(wd=1\) dynamics. Norm signals remain strong run-level regime indicators, and log-probability performs best among the observables tested under the current protocol. We position the residual as a window-level monitoring and localization signal in the studied modular-arithmetic Transformer settings, not a universal early-warning predictor or an intervention rule.

2605.08234 2026-05-12 cs.LG cs.AI

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Ruijie Zhang, Haozhe Liang, Da Chang, Li Hu, Fanqi Kong, Huaxiao Yin, Yu Li

AI总结 该研究探讨了在长上下文大语言模型推理中,如何通过值感知的键值(KV)缓存淘汰策略来优化缓存压缩的效果。作者提出了一种固定合同诊断方法,用于分析选择器在不同决策点上的表现,从而更准确地评估压缩策略对任务性能的影响。实验表明,该方法能够有效区分缓存压缩中的支持恢复、输出值排序与边缘效应,为非单调缓存压缩提供了有价值的诊断工具。

详情
英文摘要

Long-context LLM inference is bottlenecked by the memory and bandwidth cost of reading large KV caches during decoding. KV compression reduces this cost by keeping only part of the cache, but task accuracy alone does not identify why a selector succeeds or fails. A selector can fail at three steps: it may miss the evidence future decoding needs, give high scores to tokens that do not affect the output, or break related evidence when fitting scores into a small cache. We introduce a fixed-contract diagnostic that holds the selector's setup fixed and changes one decision slot at a time. For value ranking, the probe combines a block's attention mass with the estimated output change from removing it. On LongBench across three models and two budgets, the probe is positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. NeedleBench M-RT at 32k and a RULER 8k check probe support closure under branched retrieval, and a 264-cell sign evaluation separates support recovery and output-value ranking from leverage effects near the boundary. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

2605.08232 2026-05-12 cs.LG physics.comp-ph physics.flu-dyn

Hierarchical Multi-Fidelity Learning for Predicting Three-Dimensional Flame Wrinkling and Turbulent Burning Velocity

Saghar Zolfaghari, Yu Xie, Junfeng Yang, Safa Jamali

AI总结 该研究针对高保真实验数据获取困难的问题,提出了一种分层多保真度神经网络框架(MuFiNNs),用于预测湍流预混火焰的三维皱褶动态和湍流燃烧速度。该方法结合稀疏的高保真实验数据与编码主导物理趋势的低保真结构模型,通过分层构建和非线性修正,准确学习火焰几何与反应行为的耦合特性。研究显示,该框架在数据稀疏、噪声大或实验难以获取的条件下仍具有良好的预测能力,为数据有限情况下的燃烧建模提供了可扩展且物理基础的解决方案。

详情
英文摘要

High-fidelity experimental characterization of turbulent premixed flames remains limited by the cost and complexity of advanced diagnostics, particularly under elevated pressures and intense turbulence where measurements of coupled flame morphology and burning dynamics are sparse. Here, we develop a hierarchical multi-fidelity neural network framework (MuFiNNs) to address this challenge by integrating sparse high-fidelity experimental data with structured low-fidelity representations encoding dominant physical trends. The framework combines hierarchical low-fidelity construction with nonlinear multi-fidelity correction to learn coupled geometric and reactive flame behavior while recovering discrepancies that simplified models alone cannot capture. The methodology is applied to expanding turbulent premixed flames to predict three-dimensional flame wrinkling dynamics and turbulent mass burning velocity across varying fuels, pressures, and turbulence intensities. Using experimentally informed low-fidelity trend models with sparse high-fidelity measurements, MuFiNNs accurately reconstruct observed flame behavior, enable interpolation across unseen operating conditions, and demonstrate robust extrapolation beyond the training domain. Importantly, the framework remains effective in noisy, weakly structured, or experimentally inaccessible regimes where conventional data-driven approaches often fail. These results show that hierarchical multi-fidelity learning provides a scalable and physically grounded strategy for predictive combustion modeling in data-limited regimes. More broadly, this work establishes multi-fidelity scientific machine learning as a practical framework for extracting physically meaningful predictive models from sparse experiments, particularly for instability-dominated and turbulence-sensitive reactive flows where high-fidelity data acquisition is demanding.

2605.08231 2026-05-12 cs.LG cs.AI cs.AR

TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators

Chang Meng, Hanyu Wang, Yuyang Ye, Mingfei Yu, Wayne Burleson, Giovanni De Micheli

AI总结 随着AI加速器的功耗问题日益突出,本文提出TRAM方法,通过联合优化近似乘法器结构与AI模型参数,在保持精度损失较小的前提下显著降低功耗。与以往独立设计近似乘法器的工作不同,TRAM将乘法器结构设计与模型训练过程结合,实现了更高效的功耗优化。实验表明,TRAM在CIFAR-10和ImageNet数据集上分别实现了高达25.05%和27.09%的功耗降低。

详情
英文摘要

Reducing power consumption in AI accelerators is increasingly important. Approximate computing can reduce power consumption while keeping the accuracy loss small. Since multipliers are power-hungry components in AI models, this paper focuses on synthesizing low-power approximate multipliers (AxMs). Unlike prior works that design AxMs separately from AI model training, we present TRAM, which jointly optimizes the AxM structure and AI model parameters to lower power with small accuracy loss. Experiments show that compared to state-of-the-art AxMs, TRAM achieves up to 25.05% AxM power reduction on CNNs with CIFAR-10, and reduces power by up to 27.09% on vision transformers with ImageNet.

2605.08230 2026-05-12 cs.LG stat.AP

Social Determinants of Health and Fentanyl Overdose Mortality Across US Counties: An XGBoost and SHAP Analysis Identifying Silent Risk Counties and Treatment Deserts

Kabi Raj Tiruwa, Abhisan Ghimire, Anuj Kumar Shah

AI总结 该研究利用XGBoost和SHAP方法,分析美国各县的社 hội决定因素与芬太尼过量死亡率之间的关系,旨在识别高风险但未被关注的“沉默风险县”和“治疗荒漠县”。研究整合了多项公共卫生数据,发现残疾率、高血压、吸烟和交通不便等因素是预测过量死亡的关键指标,并揭示治疗荒漠县的死亡风险显著升高。研究结果为制定针对性干预措施提供了依据,强调应优先扩展药物使用障碍治疗资源,并对高风险地区进行早期干预。

Comments 21 pages, 7 figures, 4 tables

详情
英文摘要

Background: Fentanyl overdose deaths are still increasing across the U.S. We do not fully understand which county-level social and structural conditions lead to higher overdose death rates. Social determinants of health, including disability, treatment access, and behavioral health issues, may help identify vulnerable counties before deaths become severe. No earlier study has used explainable machine learning with SHAP attribution on 2022 CDC WONDER data to study treatment access gaps and silent risk counties. Methods: We combined data from four government sources for 975 U.S. counties, including CDC WONDER (2022) overdose mortality data, CDC Social Vulnerability Index (SVI), CDC PLACES health behavior data, and Area Health Resources Files. An XGBoost model was used to predict overdose mortality risk using Standardized Mortality Ratio (SMR). Five-fold cross-validation was used to test model accuracy, and SHAP values were used to show which factors increase or decrease risk. Results: XGBoost outperformed all tested models (Spearman rho=0.67, R2=0.457, MAE=0.409, high-risk recall=71.1%). Top predictors were disability rate, hypertension, smoking, and lack of vehicle access. Treatment desert counties had 52.6% higher overdose mortality (SMR 1.786 vs 1.170; p<0.0001). K-means identified 143 silent risk counties. Overdose deaths were spatially clustered (Moran's I=0.505, p=0.001) with 75 hotspots and 136 coldspots. Suppressed counties were 58.2% of WONDER counties, mostly rural (72%) and treatment deserts (65%). Conclusions: County-level SDOH factors predict overdose deaths, especially disability, treatment access, and behavioral health burden. MOUD expansion should prioritize treatment desert counties, and silent risk counties need early intervention before mortality worsens.

2605.08226 2026-05-12 cs.CV

SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection

Sarra Arab, Anfal Achouri, Seif Eddine Bouziane

AI总结 随着AI生成图像的迅速增多,保障数字信息完整性面临重大挑战。本文提出SPECTRA-Net,一种可扩展的、具有可解释性的跨领域张量表示管道,用于检测AI生成图像。该方法结合视觉基础模型的全局语义特征、光谱分析、局部块异常检测和统计描述符,实现了在域内和跨域场景下的先进检测性能,并通过定位异常区域提供了可解释性,为真实场景中的内容验证提供了更可靠的技术方案。

Comments 13 pages, 2 figures, submitted to a journal

详情
英文摘要

The rapid proliferation of AI-generated images (AIGI) presents a significant challenge to digital information integrity. While human observers and existing detection models struggle to keep pace with the increasing sophistication of generative models, the need for robust, real-time detection systems has become critical. This paper introduces SPECTRA-Net, a scalable pipeline for explainable, cross-domain tensor representations for AIGI detection. Our approach leverages a multi-view representation of images, combining global semantic features from a Vision Foundation Model (VFM), spectral analysis, local patch-based anomaly detection, and statistical descriptors. By fusing these complementary data streams, SPECTRA-Net achieves state-of-the-art performance in both in-domain and cross-domain settings, demonstrating high accuracy and generalization capabilities across a wide range of challenging datasets, including WildFake, Chameleon, and RRDataset. The proposed pipeline not only provides a robust solution for AIGI detection but also offers explainability through artifact localization, paving the way for more trustworthy and reliable content verification in real-world applications.

2605.08223 2026-05-12 cs.LG

A Simulated Federated Analysis of MS-Induced Brain Lesions

Evelyn Trautmann, Joël Federer-Gsponer, Markus C. Elze, José-Tomás Prieto

AI总结 本文提出了一种模拟框架,用于研究多中心联邦分析在分析多发性硬化症(MS)患者数据中的应用。该框架包含图像分割和临床数据分析两个任务,分别采用联邦生存分析和主成分分析方法,通过构建高保真合成数据集和结合真实影像数据,模拟了实际联邦研究中的数据治理、本地预处理和模型训练等关键环节。该研究为联邦学习方法在MS研究中的开发与评估提供了真实可靠的实验平台。

Comments Accepted for publication at The 39th IEEE International Symposium on Computer-Based Medical Systems

详情
英文摘要

Federated techniques such as federated learning and federated analysis have emerged as a powerful paradigm for enabling multi-center research on sensitive clinical data while preserving patient privacy. In this study, we introduce a simulation framework that emulates a real-world federated research project focused on the analysis of multiple sclerosis (MS) patient data. The project comprises two components: an image segmentation task and a clinical data analysis task, where federated variants of survival analysis and Principal Component Analysis (PCA) are employed. To capture the complexity and heterogeneity of real clinical datasets, we construct a federation of high-fidelity synthetic cohorts designed to mirror MS-related clinical and demographic characteristics, while the imaging component leverages publicly available real-world datasets. Our simulation replicates key elements of authentic federated workflows, including distributed data governance, site-specific preprocessing, model training across isolated nodes, and the secure aggregation of analytical outputs. This framework provides a realistic testbed for developing, evaluating, and benchmarking federated learning methods in the context of MS research.

2605.08222 2026-05-12 cs.CV cs.AI cs.IR

From Historical Tabular Image to Knowledge Graphs: A Provenance-Aware Modular Pipeline

Sarah Binta Alam Shoilee, Victor de Boer, Jacco van Ossenbruggen, Susan Legêne

AI总结 该研究提出了一种模块化且注重数据来源的流程,用于将手写历史表格图像转化为知识图谱,以支持人机协作。该流程分为表格重建、信息提取和知识图谱构建三个阶段,并在每个阶段保留中间表示以便于人工检查与修正。其核心贡献在于系统性地在每个处理阶段集成数据来源信息,确保所有提取的实体和字面值均可追溯至原始视觉和文本来源,从而提升处理过程的透明度与可控性。

Comments Shorter version of this paper has been accepted in the 5th International Conference on Hybrid Human-Artificial Intelligence (HHAI 2026)

详情
英文摘要

Handwritten archival tables contain rich historical information, yet transforming them into structured representations, such as Knowledge Graphs, requires integrating table structure recognition, handwriting recognition, and semantic interpretation - a complex multimodal process. End-to-end AI implementations can obscure these steps, resulting in opaque algorithmic operations that hinder human oversight, critical assessment, and trust. To address this, we present a modular, provenance-aware pipeline to convert handwritten tabular images into KGs supporting human-AI collaboration. The pipeline decomposes the workflow into three stages - table reconstruction, information extraction, and KG construction - while exposing intermediate representations for inspection, evaluation, and correction. A key contribution of our approach is the systematic integration of data provenance at every stage, ensuring that all extracted entities and literals remain traceable to their visual and textual origins. The proposed pipeline is demonstrated through a number of experiments on real-world archival material concerning military careers. The results across three different table reconstruction variants highlight the importance of modularisation. By coupling modularity with data provenance, our work advances transparent and collaboratively controllable image-to-KG pipelines for complex historical data.

2605.08221 2026-05-12 cs.LG cs.AI

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

Michael Jerge, David Evans

AI总结 本文提出了一种名为 NoisyCoconut 的新方法,通过在推理阶段对大型语言模型的内部表示引入受控噪声,从而提升模型的可靠性。该方法无需重新训练模型,而是通过生成多样化的推理路径并利用其一致性作为置信信号,使模型在不确定时选择不回答。实验表明,这种方法在多个推理基准上实现了有效的准确率与覆盖范围的权衡,并显著降低了错误率,使模型在数学推理任务中的准确率超过95%。

详情
英文摘要

This paper presents NoisyCoconut, a novel inference-time method that enhances large language model (LLM) reliability by manipulating internal representations. Unlike fine-tuning methods that require extensive retraining, NoisyCoconut operates directly on model representations during inference and requires no retraining. Rather than training models to reason in latent space, we inject controlled noise into latent trajectories to generate diverse reasoning paths. Agreement among these paths provides a confidence signal, enabling models to abstain when uncertain. We demonstrate that this approach achieves effective coverage-accuracy tradeoffs across multiple reasoning benchmarks without requiring access to training data or modification of model parameters. This approach provides a practical pathway to improving the reliability of LLM outputs while maintaining compatibility with existing models. Our experiments show that unanimous agreement among noise-perturbed paths reduces error rates from 40-70% to below 15%, enabling models to exceed 95% accuracy on mathematical reasoning tasks through selective abstention.

2605.08220 2026-05-12 cs.AI cs.CE cs.CL cs.CV cs.SE

Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

Andrei Lazarev, Dmitrii Sedov, Alexander Galkin

AI总结 本文研究了如何提高多模态大语言模型从科学图表中提取数据的准确性,重点比较了高层语义引导与底层空间引导两种策略的效果。研究发现,尽管语义引导方法如元数据优先框架和思维链方法未能显著提升性能,但通过在图表图像上叠加坐标网格的简单空间引导方法,能够显著降低数据提取误差。实验表明,该网格方法在合成数据集上将SMAPE误差从25.5%降低至19.5%,验证了空间上下文对当前多模态模型更有效且更可靠。

Comments his is the version of the article accepted for publication in SUMMA 2025 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/SUMMA68668.2025.11302248

详情
Journal ref
2025 7th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA), Lipetsk, Russian Federation, 2025, pp. 799-804
英文摘要

The automated extraction of data from scientific charts is a critical task for large-scale literature analysis. While multimodal Large Language Models (LLMs) show promise, their accuracy on non-standardized charts remains a challenge. This raises a key research question: what is the most effective strategy to improve model performance (high-level semantic priming) or low-level spatial priming? This paper presents a comparative investigation into these two distinct strategies. We describe our exploratory experiments with semantic methods, such as a two-stage metadata-first framework and Chain-of-Thought, which failed to produce a statistically significant improvement. In contrast, we present a simple but highly effective spatial priming method: overlaying a coordinate grid onto the chart image before analysis. Our quantitative experiment on a synthetic dataset demonstrates that this grid-based approach provides a statistically significant reduction in data extraction error (SMAPE reduced from 25.5% to 19.5%, p < 0.05) compared to a baseline. We conclude that for the current generation of multimodal models, providing explicit spatial context is a more effective and reliable strategy than high-level semantic guidance for this class of tasks.

2605.08218 2026-05-12 cs.LG cs.CV

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

Adam Szokalski, Mateusz Modrzejewski

AI总结 本文提出了一种名为LVO的机制可解释性方法,将卷积神经网络中的特征可视化优化技术扩展到扩散模型的潜在空间中,通过稀疏自编码器(SAE)将多义特征解耦为单一语义特征。研究展示了LVO在Stable Diffusion 1.5上的应用,能够生成清晰可辨的可视化概念图像,如人物、玫瑰、瀑布泡沫等,并验证了像素空间中的正则化技术在潜在空间中的有效性。该方法相比传统数据集示例和特征引导,能更直接地揭示特征激活的本质。

详情
英文摘要

This paper proposes latent visualization by optimization (LVO), a mechanistic interpretability technique that extends feature visualization by optimization - originally developed for convolutional neural networks - to latent diffusion models. LVO employs sparse autoencoders (SAEs) to disentangle polysemantic layer representations into monosemantic features. Key contributions include latent-space optimization, time-step activity analysis, schedule-matched noise injection, prior initialization through feature steering, and suitable regularization strategies. We demonstrate the method on Stable Diffusion 1.5 fine-tuned on the Style50 dataset, showing that SAE features produce clear visualizations of recognizable concepts - including diagonal compositions, human figures, roses, cables, and waterfall foam - that correlate with dataset examples, while the baseline without disentanglement produces less coherent results. We further show that regularization techniques from pixel-space feature visualization transfer to the latent domain, though they require different configurations for the raw-layer and SAE variants. Compared to dataset examples and steering, LVO provides complementary insights by directly revealing what activates a feature rather than its downstream effects.

2605.08217 2026-05-12 cs.LG cs.IR

Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting

Rishi Ahuja, Kumar Prateek, Simranjit Singh, Vijay Kumar

AI总结 本文研究了在时间序列预测中,长期上下文是否真的有助于提升预测性能。通过实验发现,在随机性较强的领域中,过长的历史信息反而会引入噪声,导致预测误差增加。为此,作者提出了一种基于检索增强的预测方法(RAFT),通过选择性检索相关历史片段作为外部变量,显著提升了预测精度,并减少了计算需求,为未来时间序列基础模型的设计提供了新方向。

详情
英文摘要

Time Series Foundation Models (TSFMs) have borrowed the long context paradigm from natural language processing under the premise that feeding more history into the model improves forecast quality. But in stochastic domains, distant history is often just high-frequency noise, not signal. Hence, the proposed work tests whether this premise actually holds by running continuous context architectures (PatchTST included) through the ETTh1 benchmark. The obtained results contradict the premise: an inverse scaling law shows up clearly, with forecasting error rising as context gets longer. A 3,000-step window causes performance to drop by over 68%, evidence that attention mechanisms are poor at ignoring irrelevant historical volatility. Retrieval-Augmented Forecasting (RAFT) is evaluated as an alternative. RAFT achieves a mean squared error (MSE) of 0.379 with a fixed 720-step window and selective retrieval, outperforming both long-context configurations and zero-shot foundation models (Chronos, Moirai) despite requiring far less computation. In addition, the retrieval step injects only the most relevant historical segments as dynamic exogenous variables, which gives the model a context-informed inductive bias it cannot build on its own from raw sequences. Therefore, foundation models going forward need to shift architecturally toward selective retrieval.

2605.08214 2026-05-12 cs.SD cs.AI eess.AS

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Mohammed Aman Bhuiyan, Md Sazzad Hossain Adib, Samiul Basir Bhuiyan, Amit Chakraborty, Aritra Islam Saswato, Ahmed Faizul Haque Dhrubo, Mohammad Ashrafuzzaman Khan

AI总结 本文针对孟加拉语长篇语音识别和说话人分段任务中的挑战,提出了基于Whisper和PyAnnote的改进方法。研究通过微调Whisper模型和PyAnnote分割模块,结合数据增强与定制数据集训练,显著提升了孟加拉语长时语音识别和说话人分段的性能。实验结果显示,所提出的系统在测试集上分别实现了0.2441的词错误率(WER)和0.2392的分段错误率(DER),优于原有预训练模型。

Comments 3 figures and 5 tables

详情
英文摘要

Automatic Speech Recognition (ASR) and speaker diarization in Bangla remain challenging due to long form recordings, diverse acoustic conditions, and significant speaker variability. This work addresses these two core tasks in Bangla spoken language understanding by developing robust systems for long form ASR and speaker diarization. For ASR (Problem 1), we fine tune the tugstugi bengaliai regional asr whisper medium model on a custom-curated dataset of approximately 15,000 chunked and aligned Bangla audio segments, employing full weight training with extensive data augmentation including noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation. For speaker diarization (Problem 2), we fine-tune the pyannote/segmentation-3.0 model using PyTorch Lightning on the competition annotated diarization dataset, swapping the fine-tuned segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline while retaining the pretrained speaker embedding and clustering components. Our ASR system achieves a Word Error Rate (WER) of 0.2441, while our diarization system achieves a Diarization Error Rate (DER) of 0.2392, both evaluated on the test set, demonstrating notable improvements over the respective pretrained baselines. We describe our complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing for both tasks.

2605.08213 2026-05-12 cs.CV

Low-Cost Stereo Vision for Robust 3D Positioning of Thin Radiata Pine Branches in Autonomous Drone Pruning

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

AI总结 本文研究如何利用低成本的立体视觉系统实现对薄型辐射松枝条的高精度三维定位,以支持无人机自主修剪。研究提出了一种两阶段方法,包括枝条分割和深度估计,并在自定义数据集上对比了多种先进算法。核心贡献在于结合立体分割与基于中位绝对偏差的三角化算法,有效解决了森林场景中纹理稀疏、结构细薄和深度噪声等问题,为无需昂贵传感器的自主修剪提供了可行方案。

详情
英文摘要

Manual pruning of radiata pine, a species of major economic importance to New Zealand forestry, is hazardous, labour-intensive, and increasingly constrained by workforce shortages. Existing autonomous pruning platforms typically rely on expensive sensors such as LiDAR and are limited to thick branches, which restricts their wider adoption. This paper investigates whether a single low-cost stereo camera mounted on a drone can provide sufficiently accurate branch detection and three-dimensional positioning to support autonomous pruning of branches as thin as 10 mm, thereby removing the need for auxiliary depth sensors. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, Mask R-CNN variants and the YOLOv8 and YOLOv9 families are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera; YOLOv8 and YOLOv9 are selected as representative state-of-the-art real-time segmentors at the time of data collection, and the framework is designed to remain compatible with newer YOLO releases. For depth estimation, a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated, including cross-dataset fine-tuning experiments that expose the domain gap between urban driving benchmarks and natural forestry scenes. The main novelty of this work lies in coupling stereo segmentation with a centroid-based triangulation algorithm and Median-Absolute-Deviation outlier rejection that converts a segmentation mask and disparity map into a single robust branch-to-camera distance, addressing the challenges of sparse texture, thin structures, and noisy disparity values typical of forest scenes. Qualitative evaluations at distances of 1-2 m show that the learning-based stereo methods produce more coherent depth es...

2605.08212 2026-05-12 cs.LG cs.CL gr-qc hep-th

LLMs with in-context learning for Algorithmic Theoretical Physics

Anamaria Hell, Leander Thiele

AI总结 随着理论物理中算法计算需求的增加,本文探讨了结合大语言模型(LLM)与计算机代数系统(CAS)进行算法任务处理的可行性。研究通过将 Claude 与 Maple 接口,应用于修正引力理论中的宇宙学扰动计算,展示了该方法在实际问题中的表现、常见失败原因及改进方向。结果表明,配备充分示例的前沿大语言模型能够解决大部分测试问题,为理论物理中的自动化计算提供了新思路。

Comments 8 pages, 2 figures

详情
英文摘要

There is an increasing number of algorithmic computations in theoretical physics. These, while conceptually simple, can nevertheless be time-consuming and contain subtleties that should not be overlooked. Given the recent improvement of Large Language Models (LLM), it is natural to investigate whether LLMs equipped with a computer algebra system (CAS) runtime and sufficiently informative context can reliably carry out these algorithmic tasks. In this work, we interface Claude with Maple, and apply this framework to cosmological perturbations in modified theories of gravity. We demonstrate the current capabilities of this approach, the typical failures, and how the same can be improved. We find that a frontier LLM supplied with worked examples is able to solve most test problems.

2605.08210 2026-05-12 cs.CV

Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation

Sanaz Karimijafarbigloo, Armin Khosravi, Alireza Kheyrkhah, Reza Azad, Mauricio Reyes, Dorit Merhof

AI总结 该研究针对多专家医学图像分割中的标注差异问题,提出了一种融合特征调和与频率提示个性化的概率框架,旨在更准确地反映临床诊断的不确定性。通过自适应特征条件化和频域个性化模块,模型能够区分设备噪声与专家标注差异,并生成更具解剖一致性的分割结果。实验表明,该方法在多个数据集上实现了领先的分割性能与不确定性估计,尤其在噪声较大的情况下表现突出。

Comments Accepted in main CVPR 2026

详情
英文摘要

Multi-rater medical image segmentation captures the inherent ambiguity of clinical interpretation, where diagnostic boundaries vary across experts and imaging devices. Existing approaches often reduce this diversity to consensus labels or treat rater differences as noise, resulting in overconfident and poorly calibrated models. We propose a harmonized probabilistic framework that disentangles acquisition artifacts from genuine annotator variability through adaptive feature conditioning and frequency-domain personalization. A lightweight Harmonizer Network implicitly models scanner-specific artifacts and performs dynamic feature modulation to standardize latent representations, ensuring that uncertainty reflects anatomy rather than noise. To represent rater-specific styles, we introduce a novel High-Frequency Prompt Modules that operate in the spectral domain to encode annotator-dependent boundary precision and textural sensitivity. These prompts adaptively modulate harmonized features to produce personalized yet anatomically consistent segmentations. Furthermore, a Generalized Energy Distance based regularization aligns the generative distribution with empirical annotation variability, promoting diversity where experts disagree and consensus where they converge. Experiments on LIDC-IDRI and NPC-170 show SOTA aggregated and individualized segmentation, with notable GED reductions and improved Dice scores, especially on noisy cases. Beyond accuracy, the model exhibits clinically meaningful uncertainty. Confidence rises in agreement regions and declines in ambiguous areas, supporting its use as a reliable and interpretable tool for multi-expert clinical workflows.

2605.08209 2026-05-12 cs.LG

Learngene Search Across Multiple Datasets for Building Variable-Sized Models

Boyu Shi, Junbo Zhou, Chang Liu, Xu Yang, Qiufeng Wang, Xin Geng

AI总结 本文提出了一种跨多个数据集学习基因搜索的方法(LSAMD),用于构建可变大小的深度学习模型。该方法通过扩展祖先模型(Ans-Net)为包含数据集特有模块和适配器的超级祖先网络,实现了跨数据集的架构搜索,并从中提取出高频使用的模块作为“学习基因”来初始化不同规模的后代模型。实验表明,LSAMD在保持模型性能的同时,显著降低了存储和训练成本。

详情
英文摘要

Deep learning methods are widely used under diverse resource constraints, resulting in models of varying sizes, such as the Vision Transformer (ViT) series. Deploying these models typically requires costly pretraining and finetuning. The Learngene paradigm addresses this issue by extracting transferable components, called learngenes, from a pretrained ancestry model (Ans-Net) to initialize variable-sized descendant models (Des-Nets).Existing learngene extraction methods rely on a single dataset, limiting downstream performance. To address this limitation, we propose Learngene Search Across Multiple Datasets for Building Variable-Sized Models (LSAMD). LSAMD expands the Ans-Net into a searchable super Ans-Net with dataset-specific blocks and dataset adapters (DADs). During training, LSAMD searches for an optimal architecture path for each dataset. The base blocks most frequently selected across datasets are extracted as learngenes for initializing Des-Nets.Experiments on multiple datasets show that LSAMD achieves performance comparable to pretrain-finetune methods while significantly reducing storage and training costs.

2605.08207 2026-05-12 cs.CV

A Breast Vision Pathology Foundation Model for Real-world Clinical Utility

Yingxue Xu, Zhengyu Zhang, Xiuming Zhang, Mengwei Xu, Fengtao Zhou, Yihui Wang, Jiabo Ma, Yi Xin, Danyi Li, Chengyu Lu, Zhijian Cen, Ying Tan, Qingbing Yao, Qi Wang, Zizhao Gao, Yong Zhang, Jingjing Chen, Feifei Liu, Qian Xu, Yi Dai, Hongxuan Tan, Cheng Jin, Huajun Zhou, Zhengrui Guo, Ling Liang, Hongyi Wang, Yingcong Chen, Xi Wang, Zhenhui Li, Ronald Cheong Kin Chan, Ning Mao, Muyan Cai, Zhe Wang, Li Liang, Hao Chen

AI总结 该研究提出了一种名为BRAVE的乳腺病理基础模型,旨在支持真实临床场景中的病理诊断与决策。该模型基于来自亚洲、欧洲和北美32个来源的10万余张乳腺全切片图像进行开发与评估,能够在术前活检、术中冰冻切片和术后切除等多个临床环节中发挥作用。实验表明,BRAVE在排除低风险病例、辅助发现漏诊病例以及提升病理医生诊断准确率和效率方面表现出显著优势,并能独立预测患者的无病生存率和总体生存率。

Comments 60 pages

详情
英文摘要

Pathology foundation models have shown strong retrospective performance, but whether such systems can support clinically relevant use remains unclear. This challenge is particularly important in breast cancer, where pathological assessment serves as the gold standard for diagnosis and guides treatment planning, surgical decision-making and risk stratification across pre-, intra- and post-operative stages. Here we present \textbf{BRAVE}, a breast-adaptive pathology foundation model developed and evaluated using a total resource of 101,638 breast whole-slide images from 32 sources across Asia, Europe and North America. We assessed BRAVE across 34 tasks in 82 cohorts spanning pre-operative biopsy, intra-operative frozen section and post-operative resection, using an evidence chain comprising retrospective benchmarking, clinically challenging scenarios, workflow-oriented clinical impact simulations, prospective observational validation with the thresholds locked in the retrospective cohorts and crossover pathologist-AI interaction studies. Across these settings, BRAVE supported practical roles in the clinical workflow, including safe exclusion of low-risk cases from routine review, AI-assisted second-review rescue of initially missed positives and prioritization of cases for further assessment. In prospective validation across three centres, BRAVE excluded 76.9% of negative biopsy cases (NPV 0.953) and 70.1% of negative frozen-section cases (NPV 0.973), and triaged 78.8% of post-operative subtyping cases as high-confidence clear-cut cases (NPV 1.000). In reader studies, AI assistance improved balanced accuracy from 88.5% to 95.1% (OR 3.14, P<0.001), with better efficiency, confidence and inter-rater agreement. BRAVE-derived scores also independently predicted disease-free survival (adjusted HR 4.79, P<0.001) and overall survival (adjusted HR 8.14, P<0.001).

2605.08202 2026-05-12 cs.LG cs.AI

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

Qingjun Wang, Hongtu Zhou, Hang Yu, Junqiao Zhao, Yanping Zhao, Chen Ye, Ziqiao Wang, Guang Chen

AI总结 离线强化学习面临的一个关键挑战是对分布外(OOD)动作价值的高估问题。现有方法通常通过惩罚未见样本来缓解这一问题,但难以准确识别OOD动作,并可能抑制有益的探索。为此,本文提出DOSER框架,基于扩散模型捕捉行为策略和状态分布,利用单步去噪重建误差作为可靠的OOD检测指标,并在策略优化过程中区分有益和有害的OOD动作,选择性地抑制风险动作并鼓励高潜力动作的探索,从而在多个基准测试中表现出优于现有方法的性能。

Comments 10 pages, 5 figures. Accepted to ICLR 2026

详情
英文摘要

Offline reinforcement learning (RL) faces a critical challenge of overestimating the value of out-of-distribution (OOD) actions. Existing methods mitigate this issue by penalizing unseen samples, yet they fail to accurately identify OOD actions and may suppress beneficial exploration beyond the behavioral support. Although several methods have been proposed to differentiate OOD samples with distinct properties, they typically rely on restrictive assumptions about the data distribution and remain limited in discrimination ability. To address this problem, we propose DOSER (Diffusion-based OOD Detection and Selective Regularization), a novel framework that goes beyond uniform penalization. DOSER trains two diffusion models to capture the behavior policy and state distribution, using single-step denoising reconstruction error as a reliable OOD indicator. During policy optimization, it further distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high-potential ones. Theoretically, we prove that DOSER is a $γ$-contraction and therefore admits a unique fixed point with bounded value estimates. We further provide an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors. Across extensive offline RL benchmarks, DOSER consistently attains superior performance to prior methods, especially on suboptimal datasets.

2605.08201 2026-05-12 cs.LG cs.AI cs.CV

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

Sparsh Tiwari, Bettina Finzel, Gesina Schwalbe

AI总结 本文研究了如何在弱监督条件下实现面向对象的视觉推理中的概念学习问题。提出了一种结合基于插槽的架构和变分自编码器(VAE)的方法,通过自监督和概念引导在潜在空间中实现对感知符号的可解释 grounding。该方法能够在仅使用1%标签的情况下发现复杂的抽象规则,并在领域迁移情况下表现出优越的鲁棒性,甚至在少样本设置下优于当前最先进的基础模型。

详情
英文摘要

Neurosymbolic systems promise to combine deep neural network's (DNN) processing of raw sensor inputs with few-shot performance of symbolic artificial intelligence. Two-stage approaches explicitly decouple DNN based perception from subsequent rule based reasoning. This avoids optimization and interpretability issues of end to end differentiable approaches, but requires costly labels for the perception output. This paper introduces an efficient weak supervision scheme for the perception stage to ground its output symbols for logical induction in object-centric reasoning tasks. It combines a slot-based architecture for object-centricity with a Variational Autoencoder (VAE) for self-supervision, competing with concept guidance on latent dimensions for human interpretable grounding. The resulting predictions are translated into symbolic background knowledge for reasoning frameworks, such as Inductive Logic Programming (ILP), Decision Trees, and Bayesian Networks. Our extensive empirical evaluation on synthetic and real world datasets shows that our approach can discover complex, abstract rules for object centric reasoning whilst reducing supervision to as little as 1% of labels, and being robust even under substantial domain shift. Notably, at 1% supervision it even outperforms state of the art foundation model baselines in domain generalization

2605.08200 2026-05-12 cs.AI cs.CV cs.LG

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail, Yi Xia, Emily Huang

AI总结 本研究通过统一的机制分析框架,探讨了视觉-语言模型(VLMs)中可靠性的真实来源,挑战了“注意力图越清晰模型越可信”的直观假设。研究发现,注意力结构与模型正确性几乎无关,而隐藏状态的几何特征和后期层的稀疏电路更能可靠地反映模型的可靠性。此外,不同模型架构在可靠性分布上存在显著差异,对模型设计和监控具有重要启示。

Comments 15 pages, 4 figures, 10 tables. Accepted at the ICLR 2026 Workshop on Multimodal Reasoning. Code and probe-training pipelines: https://github.com/itsloganmann/VLM-Reliability-Probe

详情
英文摘要

A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline -- the VLM Reliability Probe (VRP) -- that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC>0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.

2605.08198 2026-05-12 cs.LG cs.AI cs.CY

FairHealth: An Open-Source Python Library for Trustworthy Healthcare AI in Low-Resource Settings

Farjana Yesmin

AI总结 本文介绍了 FairHealth,一个开源的 Python 库,旨在为低资源环境下的可信医疗 AI 提供统一且模块化的框架,特别关注如孟加拉国等低收入国家的应用场景。该库针对现有医疗 AI 工具的四大不足,提供了公平性审计、隐私保护的联邦学习、低带宽解释性工具以及面向全球南方的医疗数据集支持等功能模块,填补了相关领域的空白,并可直接通过 pip 安装使用。

Comments 8 pages, open-source Python library

详情
英文摘要

We present FairHealth, an open-source Python library that provides a unified, modular framework for trustworthy machine learning in healthcare applications, with particular focus on low-resource and low-income country (LMIC) settings such as Bangladesh. FairHealth addresses four critical gaps in existing healthcare AI toolkits: (1) the absence of integrated fairness auditing for biosignals and clinical tabular data; (2) the lack of privacy-preserving federated learning tools compatible with standard ML workflows; (3) missing explainability tools tailored for low-bandwidth clinical decision support; and (4) no existing toolkit covering Global South healthcare datasets. Built from five peer-reviewed research contributions, FairHealth provides six modules covering federated learning with homomorphic encryption (fairhealth.federated), intersectional fairness metrics (fairhealth.fairness), hybrid fuzzy-SHAP explainability (fairhealth.explain), multilingual dengue triage (fairhealth.lowresource), equitable disaster aid allocation (fairhealth.equity), and public dataset loaders (fairhealth.datasets). All datasets used are publicly available without institutional data use agreements. FairHealth is installable via pip install fairhealth(PyPI: pypi.org/project/fairhealth/) and available at https://github.com/Farjana-Yesmin/fairhealth.

2605.08197 2026-05-12 cs.LG cs.AI

ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

Serafim Batzoglou

AI总结 ReplaySCM 是一个用于评估从有限干预证据中归纳可执行因果机制的基准,包含1300个由潜在全观测的布尔结构因果模型生成的二值世界。该基准要求系统输出符合特定布尔DSL的机制图,并通过在训练和测试干预场景中回放来评估其行为,而非仅比较公式字符串。研究显示,前沿语言模型在部分信息设置下表现出色,但在隐藏因果顺序或根源时性能显著下降,表明其在因果机制归纳方面仍面临挑战。

详情
英文摘要

Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction from finite interventional evidence. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model (SCM). A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly. ReplaySCM varies the structural information disclosed to the model through Ordered, Block-order, Hidden-order, and Hidden-roots settings, and includes Alternative-SCM tasks that supply a valid reference SCM and ask for a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness. Frontier LLMs infer parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. We also evaluate a matched support-audit ladder: Original, Extra Worlds, and Counterexample Audit (CEx), that raises mean local predecessor-pattern coverage from 0.8949 to 0.9815 to 1.0; under the audited searches, no discovered semantic alternative remains consistent with the training worlds. The Ordered/Hidden-order gap persists under this stronger evidence. ReplaySCM complements answer-level causal reasoning and graph-discovery benchmarks by evaluating executable replay generalization from finite interventional evidence, without claiming unique identification of the latent SCM.

2605.08196 2026-05-12 cs.CV

Survey on Disaster Management Datasets for Remote Sensing Based Emergency Applications

Alain P. Ndigande, Josiah Wiggins, Sedat Ozer

AI总结 本文综述了用于基于遥感的应急应用的灾难管理数据集,重点介绍了支持计算机视觉和遥感任务的公开图像数据集,涵盖灾难前、中、后各个阶段。研究旨在为研究人员和实践者提供高质量数据集的集中参考,以加速基于遥感的灾难响应解决方案的开发与部署。

Comments This work has been accepted for publication at IEEE Transactions on Geoscience and Remote Sensing

详情
英文摘要

Recent natural disasters have highlighted the urgent need for efficient data-driven approaches to disaster management. Machine learning (ML) and deep learning (DL) techniques have shown considerable promise in enhancing the key phases of disaster management including mitigation, preparedness, detection, response, and recovery. A critical enabler of successful ML or DL based applications in remote sensing, however, is the accessibility and quality of annotated datasets. With the growing availability of high-resolution imagery from unmanned aerial vehicles (UAVs) and satellites, computer vision and remote sensing algorithms have become essential tools for rapid detection, situational assessment, and decision-making in disaster scenarios. This survey provides a comprehensive overview of publicly available image-based datasets relevant to ML/DL-based disaster management pipelines. Emphasis is placed on datasets that support computer vision and remote sensing tasks across all phases of disaster events including pre-disaster, during, and post-disaster. The goal of this work is to serve as a centralized reference for researchers and practitioners seeking high-quality datasets for rapid development and deployment of remote sensing-driven disaster response solutions.

2605.08195 2026-05-12 cs.LG

ExecuTorch -- A Unified PyTorch Solution to Run AI Models On-Device

Mergen Nachin, Digant Desai, Sicheng Stephen Jia, Chen Lai, Mengwei Liu, Jacob Szwejbka, Raziel Alvarez, RJ Ascani, Dave Bort, Manuel Candales, Andrew Caples, Yanan Cao, Zhengxu Chen, Soumith Chintala, Gregory Comer, Tanvir Islam, Songhao Jia, Tarun Karuturi, Jack Khuu, Abhinay Kukkadapu, Tugsbayasgalan Manlaibaatar, Andrew Or, Kimish Patel, Siddartha Pothapragada, Lucy Qiu, Supriya Rao, Orion Reblitz-Richardson, Max Ren, Scott Roy, Anthony Shoumikhin, Scott Wolchok, Guang Yang, Angela Yi, Martin Yuan, Hansong Zhang, Jack Zhang, Jerry Zhang, Shunting Zhang, C. Cagatay Bilgin

AI总结 ExecuTorch 是一个统一的 PyTorch 部署框架,旨在解决在边缘设备上运行 AI 模型时面临的硬件碎片化问题。该框架支持从微控制器到复杂 SoC 的多种异构计算环境,能够在保留 PyTorch 语义的同时提供量化优化和可插拔执行后端等功能。ExecuTorch 使得研究人员能够在 PyTorch 生态内完成模型部署验证,有效弥合了研究与生产之间的差距。

详情
英文摘要

Local execution of AI on edge devices is important for low latency and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete reimplementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution "backends". These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.

2605.08194 2026-05-12 cs.SD eess.AS eess.SP

ShipEcho -- An Interactive Tool for Global Mapping of Underwater Radiated Noise from Vessels

Mark Shipton, Valentino Denona, Đula Nađ, Roee Diamant

AI总结 本文介绍了一款名为 ShipEcho 的交互式网络地理信息系统(GIS),用于全球范围内实时绘制船舶辐射噪声(V-URN)地图。该工具利用基于社区的自动识别系统(AIS)数据,并结合已建立的船舶声学模型和海底地形数据进行传播模拟,生成包括不同频段的声压级和声暴露级在内的噪声地图。研究展示了 ShipEcho 在支持环境评估、决策制定和政策制定方面的应用潜力,并通过与实际声学记录的对比验证了其地图的准确性。

Comments 34 pages

详情
英文摘要

Underwater radiated noise from vessels (V-URN) is a recognized environmental stressor that negatively impacts marine ecosystems. Significant resources are invested in the development of V-URN monitoring indicators, regulatory frameworks, and management-oriented assessments. One approach with high potential for impact is V-URN mapping, which can provide actionable spatiotemporal information for environmental assessment and mitigation planning. Producing management-scale maps remains challenging as passive acoustic measurements are spatially sparse and many operational systems depend on specialist workflows and costly access to wide-area vessel activity data. To address these constraints, we introduce ShipEcho, a freely accessible web-based Geographic Information System (GIS) that provides near-real-time V-URN mapping using vessel data acquired through a community-based AIS exchange. Using established vessel SL models and propagation modeling informed by bathymetric data, ShipEcho produces near-real-time and cumulative noise maps across regions worldwide. These include sound pressure levels and sound exposure levels using standard indicators, including the 63~Hz and 125~Hz one-third octave bands and a 20--2000~Hz broadband level. We describe the system architecture, data pipeline, modeling workflow, and key assumptions, and evaluate map accuracy through comparison with acoustic recordings. We then demonstrate how ShipEcho can support management-level assessment, decision-making, and policy initiatives through practical use cases.

2605.08191 2026-05-12 cs.CV cs.AI

A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing

Maria Stoica, Abdelrahman Hekal, Alessio Lomuscio

AI总结 该论文提出了一种名为ROSS的鲁棒的分布外检测框架,旨在提升机器学习系统在面对对抗攻击时的可靠性。其核心方法是通过对基线检测分数进行中值平滑,利用生成的噪声样本量化原始分数的局部不稳定性,并据此区分分布内与分布外样本。该方法在多种数据集上表现出色,实现了对两类对抗攻击的对称鲁棒性,显著优于现有方法。

Comments Accepted to CVPR Findings 2026

详情
英文摘要

Reliable out-of-distribution (OOD) detection is a critical requirement for the safe deployment of machine learning systems. Despite recent progress, state-of-the-art OOD detectors are highly susceptible to adversarial attacks, which undermines their trustworthiness in automated systems. To address this vulnerability, we apply median smoothing to baseline OOD detection scores, balancing clean and adversarial accuracies. Our key insight is that the noisy samples generated for median smoothing can be repurposed to quantify the local instability of the base score. We observe that OOD samples exhibit higher instability under perturbation. Based on this, we propose ROSS, a novel and robust post-hoc OOD detector that leverages the instability of baseline scores to further distinguish between in-distribution (ID) and OOD samples. ROSS achieves symmetric robustness, performing strongly against both score-minimising and score-maximising attacks, unlike prior work. This symmetric defence leads to state-of-the-art robustness, outperforming prior methods by up to 40 AUROC points. We demonstrate ROSS's effectiveness on extensive experiments across CIFAR-10, CIFAR-100, and ImageNet. Code is available at: https://github.com/Abdu-Hekal/ROSS.

2605.08190 2026-05-12 cs.LG cs.SY eess.SY

Synergistic Simplex: Cooperative Runtime Assurance for Safety-Critical Autonomous Systems

Ayoosh Bansal, Mikael Yeghiazaryan, Artyom Khachatryan, Tianyi Zhu, Hunmin Kim, Naira Hovakimyan, Lui Sha

AI总结 随着自动驾驶系统越来越多地依赖机器学习组件执行安全关键任务,如何确保其可靠性成为重要问题。本文提出了一种名为Synergistic Simplex(SS)的协同运行时保证架构,通过允许安全监控模块与机器学习组件进行双向交互,从而在保持形式化安全保证的同时提升系统性能。该方法的核心创新在于打破传统运行时保证系统的限制,使安全监控能够利用机器学习的输出,并通过形式化分析证明其安全性,实验验证了其在自动驾驶障碍物检测中的有效性。

详情
英文摘要

Autonomous systems increasingly rely on machine-learning (ML) components for safety-critical tasks such as perception and control in autonomous vehicles (AVs). While ML enables essential capabilities, it inevitably exhibits long-tail faults that make it unsuitable for safety-critical tasks. Runtime assurance (RTA) mitigates this issue by pairing ML components with verifiable safety monitors, e.g., Control Simplex and Perception Simplex architectures. However, the limited performance of safety monitors remains a major bottleneck. The Synergistic Simplex (SS) architecture improves system performance by enabling bidirectional integration between ML components and safety monitors while preserving formal safety guarantees. The key innovation here is allowing safety monitors to use ML outputs, which is typically prohibited in RTA systems. We formally derive conditions under which this integration preserves safety and demonstrate the performance benefits. We present the design, analysis, and evaluation of SS for AV obstacle detection.

2605.08188 2026-05-12 cs.CV cs.AI

Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

Mathis Immertreu, Fitim Abdullahu, Thomas Kinfe, Helmut Grabner, Patrick Krauss, Achim Schilling

AI总结 该研究探讨了多模态变换器模型中视觉吸引力的编码机制,通过引入来自Flickr平台的人类兴趣评分,分析了Qwen3-VL-8B模型内部视觉和语言组件的表示结构。研究发现,模型中与视觉兴趣相关的信息可以从最终层嵌入中线性解码,并在中间视觉变换器层和语言模型层中逐步显现,表明模型在无监督条件下能够结构化地编码视觉吸引力。研究还揭示了不同方法提取的概念向量在高层趋于一致,为理解人类注意力与人工智能系统中的兴趣机制提供了新视角。

详情
英文摘要

Human attention is the gateway to conscious perception, memory and decision-making. However, its role in modern transformer models remains largely unexplored. As these systems increasingly influence what people see, prefer and buy, the question arises as to whether they encode principles of human interest or merely exploit large-scale correlations. Addressing this issue is crucial for understanding cognition and ensuring the responsible use of AI in communication and marketing. In order to address this issue, the concept of visual interest was examined within the multimodal vision-language-model Qwen3-VL-8B, using a pre-defined Common Interestingness (CI) score derived from large-scale human engagement data on the photo-sharing platform Flickr. Here, we analyzed internal representations across vision and language components using methods from the neurosciences. Our analyses revealed that CI information is linearly decodable from final-layer embeddings, indicating that it is aligned with human-derived measures of visual interestingness. Dimensionality reduction and Generalized Discrimination Value (GDV) analyses demonstrate that CI-related hidden representations emerge in intermediate vision transformer layers and becomes progressively more distinguishable across language model layers. Concept vectors derived using geometric, probe, and Sparse Auto-Encoder based methods converge in higher layers, as confirmed by representational similarity analysis. This indicates a robust and structured encoding of visual interestingness without explicit supervision. Future work will seek to identify shared computational principles linking human brain dynamics and transformer architectures, with the ultimate goal of uncovering the organizing mechanisms that give rise to attention and interest in both biological and artificial systems.