arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.15100 2026-05-15 cs.AI

Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling

Rongman Xu, Yifei Li, Tianzhe Zhao, Yanrui Wu, Bo Li, Hang Yan

AI总结 本文研究了在推理时对大语言模型进行适应性扩展时如何平衡计算预算与推理质量的问题。为解决现有方法中宽度与深度优化目标相互独立导致的效率与准确性难以兼顾的问题,作者提出了双维度一致性(DDC)框架,通过结合置信度加权的贝叶斯协议和趋势感知分层剪枝策略,有效集中计算资源于高质量推理路径,从而减少幻觉并加速共识形成。实验表明,该方法在多个基准上显著降低了计算开销,同时保持或超越了现有强基线的准确性。

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable abilities in reasoning. However, maximizing their potential through inference-time scaling faces challenges in trade-off between sampling budget and reasoning quality. Current strategies remain inefficient as they typically treat sampling width and depth as orthogonal objectives, where width consensus methods risk reinforcing hallucinations, while depth pruning mechanisms prematurely truncate complex yet valid reasoning chains. Therefore, we propose Dual-Dimensional Consistency (DDC), a unified framework that bridges path quality with adaptive termination. By coupling Confidence-Weighted Bayesian protocol with a Trend-Aware Stratified Pruning, our method ensures that computational resources are concentrated on high quality reasoning paths, filtering hallucinations while accelerating consensus. Evaluations across five benchmarks demonstrate that this approach reduces token consumption by over 10 times while maintaining or exceeding the accuracy of strong baselines across various LLMs.

2605.15088 2026-05-15 cs.CV

SAGE3D: Soft-guided attention and graph excitation for 3D point cloud corner detection

Batuhan Arda Bekar, Can Sarı, Hüseyin Can Gülkan, Barış Özcan

AI总结 本文提出SAGE3D,一种基于Transformer的混合模型,用于机载LiDAR点云中的角点检测。该方法采用分层编码-解码架构,通过Set Abstraction层逐步下采样点云,并通过特征传播恢复每个点的预测结果。研究引入了软引导注意力机制和激励图神经网络,前者在训练时将真实角点标签作为先验信息注入注意力计算以提高精度,后者在关键尺度上通过正向消息传递增强高置信度角点的预测,从而提升召回率。

详情
Comments
5 pages, 4 figures
英文摘要

We present SAGE3D, a hybrid Transformer-based model for corner detection in airborne LiDAR point clouds. We propose a multi-stage solution built on a hierarchical encoder-decoder architecture that progressively downsamples point clouds through Set Abstraction layers and recovers per-point predictions via Feature Propagation. We introduce two innovations: Soft-Guided Attention, which injects ground-truth corner labels as a log-prior into attention logits during training to improve precision; then an Excitatory Graph Neural Network positioned at strategic resolutions in the hierarchy, employing positive-only message passing where high-confidence corners reinforce predictions through learned boosting, optimizing for recall. The hierarchical design enables multi-scale feature extraction while our guided attention and excitatory modules ensure corner signals are amplified rather than diluted across scales.

2605.15083 2026-05-15 cs.LG cs.AI

Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction

Daniel Asare Kyei, Alimatu Saadia-Yussiff, Maame G. Asante-Mensah, Abdul Lateef-Yussiff, Charles Roland Haruna, Derry Emmanuel

AI总结 该研究提出了一种名为DBS-Adam的动态批敏感优化器,用于解决车辆事故伤害严重程度预测中的类别不平衡和序列数据处理问题。DBS-Adam通过计算梯度范数和批次损失的指数移动平均来动态调整学习率,从而提升训练稳定性并加速收敛。实验表明,DBS-Adam在测试集上取得了较高的准确率和精确率,并在与多种先进优化器的对比中表现出显著优势,验证了其在处理不平衡序列数据任务中的有效性。

详情
英文摘要

The choice of optimiser is important in deep learning, as it strongly influences model efficiency and speed of convergence. However, many commonly used optimisers encounter difficulties when applied to imbalanced and sequential datasets, limiting their ability to capture patterns of minority classes. In this study, we propose Dynamic Batch-Sensitive Adam (DBS-Adam), an optimiser that dynamically scales the learning rate using a batch difficulty score derived from exponential moving averages of gradient norms and batch loss. DBS-Adam improves training stability and accelerates convergence by increasing updates for difficult batches and reducing them for easier ones. We evaluate DBS-Adam by integrating it with Bi-Directional LSTM networks for accident injury severity prediction, addressing class imbalance through SMOTE-ENN resampling and Focal Loss. Four experimental configurations compare baseline Bi-LSTM models and alternative architectures to assess optimiser impact. Rigorous comparison against state-of-the-art optimisers (AMSGrad, AdamW, AdaBound) across five random seeds demonstrated DBS-Adam's competitive performance with statistically significant precision improvements (p=0.020). Results indicate that DBS-Adam outperforms standard optimisation approaches, achieving 95.22% test accuracy, 96.11% precision, 95.28% recall, 95.39% F1-score, and a test loss of 0.0086. The proposed framework enables effective real-time accident severity classification for targeted emergency response and road safety interventions, demonstrating the value of DBS-Adam for learning from imbalanced sequential data.

2605.15081 2026-05-15 cs.CL cs.AI

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

AI总结 本文提出了一种名为 ML-Embed 的多语言嵌入框架,旨在解决当前高质量文本嵌入发展中存在的计算成本高、语言覆盖有限和模型透明度不足等问题。基于三维俄罗斯套娃学习(3D-ML)框架,该方法在模型生命周期中实现了全面的效率优化,并通过多语言数据集和参数规模从1.4亿到80亿的模型套件,提升了参数效率和语言包容性。实验表明,ML-Embed 在多个基准测试中表现优异,尤其在低资源语言上取得了显著成果,为构建公平且高效的全球AI系统提供了可复现的解决方案。

详情
Comments
Accepted by ICML 2026. The data has been released earlier in the preprint arXiv:2603.19223
英文摘要

The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

2605.15079 2026-05-15 cs.LG cs.DB cs.DL cs.IR

Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K., Nobin Sarwar, Rajat Ghosh, Anwai Archit, Surbhi Motghare, Christina Conrad Parry, Luis Oala, Lara Grosso, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Eric S. Rosenthal, Marzyeh Ghassemi, Matthew McDermott, Tom Pollard

AI总结 本文提出了一种名为 Croissant Baker 的开源工具,用于本地生成符合 Croissant 标准的机器学习数据集元数据,解决了在受控环境和大型本地数据仓库中难以生成标准化元数据的问题。该工具通过模块化的处理器注册机制,直接从数据目录生成验证过的元数据,支持大规模数据集的高效处理。实验表明,Croissant Baker 在多个领域与人工或标准生成的元数据相比,达到了 97-100% 的一致性,具有较高的准确性和实用性。

详情
Comments
23 pages, 5 figures, 11 tables. Project: https://lcp.mit.edu/croissant-baker/ Code: https://github.com/MIT-LCP/croissant-baker
英文摘要

Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

2605.15077 2026-05-15 cs.CL cs.AI cs.LG

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Guangyu Feng, Huanzhi Mao, Prabal Dutta, Joseph E. Gonzalez

AI总结 本文提出了一种名为 AsyncFC 的纯执行层框架,旨在在不改变模型结构和函数实现的前提下,实现大型语言模型(LLM)的异步函数调用。该方法通过解耦模型解码与函数执行,使得两者可以并行进行,从而显著降低任务完成的端到端延迟。实验表明,AsyncFC 在多个基准测试中有效提升了任务处理效率,同时保持了任务准确性,并揭示了 LLM 本身具备处理未决执行结果的符号化未来(symbolic futures)的能力。

详情
英文摘要

Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.

2605.15074 2026-05-15 cs.RO

SOCC-ICP: Semantics-Assisted Odometry based on Occupancy Grids and ICP

Johannes Scherer, Sebastian Hirt, Henri Meeß

AI总结 SOCC-ICP 是一种基于占用网格和ICP的语义辅助里程计方法,旨在提升自主系统在未知环境中的位姿估计可靠性。该方法将语义占用网格映射与激光雷达扫描对齐结合,每个网格体素编码几何与语义信息,支持基于局部平面性的点对点或点对平面ICP匹配,并通过基于射线投射的自由空间更新过滤动态物体。实验表明,SOCC-ICP 在多种场景下性能优于现有激光雷达里程计方法,且在几何退化环境中仍保持鲁棒性,同时在有语义信息时进一步提升定位精度。

详情
Comments
9 pages, 3 figures, Accepted May 2026 for publication in IEEE Robotics and Automation Letters (RA-L)
英文摘要

Reliable pose estimation in previously unseen environments is a fundamental capability of autonomous systems. Existing LiDAR odometry methods typically employ point-, surfel-, or NDT-based map representations, which are distinct from the semantic occupancy grids commonly used for downstream tasks such as motion planning. We introduce SOCC-ICP, a semantics-assisted odometry framework that jointly performs Semantic OCCupancy grid mapping and LiDAR scan alignment. Each map voxel encodes geometric and semantic statistics, enabling adaptive point-to-point or point-to-plane ICP based on local planarity. Further, the occupancy grid naturally filters dynamic objects through raycasting-based free-space updates. Across diverse evaluation scenarios, SOCC-ICP achieves performance competitive with state-of-the-art LiDAR odometry and remains robust in geometrically degenerate environments, even in the absence of semantic cues. When semantic labels are available, integrating them into map construction, downsampling, and correspondence weighting yields further accuracy gains. By unifying odometry and semantic occupancy grid mapping within a single representation, SOCC-ICP eliminates redundant map structures and directly provides a map suitable for downstream robotic applications.

2605.15071 2026-05-15 cs.CV cs.AI cs.CL

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

AI总结 该研究指出视觉语言模型在处理文化遗产材料时存在“文化时差”问题,即模型倾向于用不符合历史时期的概念、材料或文化框架来误解历史文物。为此,研究者构建了TAB-VLM基准数据集,包含1600件印度不同时期的文化遗物和600个问题,用于评估模型的时序推理能力。实验表明,即使是最先进的模型在该基准上的表现也有限,揭示了当前视觉语言模型在理解和处理非西方文化历史材料方面仍存在显著不足。

详情
Comments
Project Page: https://khushboo0012.github.io/tab-vlm-webpage/
英文摘要

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

2605.15062 2026-05-15 cs.CV

Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection

Chengshuai Yang, Lei Xing, Gregory Entin, Roopa Vemulapalli, Lisa Casey, Raiyan Tripti Zaman

AI总结 该研究针对胶囊内镜图像中因血红蛋白对比与胆汁和光照衰减混淆而导致的分类性能下降问题,提出了一种基于蒙特卡洛启发的分析先验模型,用于从RGB信号中计算血红蛋白分布,从而提升对罕见血管异常的检测能力。通过在Kvasir-Capsule数据集上的实验,该方法在多个种子设置下均表现出方向一致的AUC提升,尤其在淋巴管扩张等类别上效果显著。研究还展示了该方法可生成可解释的热图,并能在普通三通道RGB输入上运行,具有较好的实用价值。

详情
Comments
24 pages, 6 figures, 3 tables. Code and trained-model checkpoints at https://github.com/integritynoble/GI_Multi_Task . 6-seed (seeds 41, 42, 43, 44, 45, 47) mean +/- SD ablation as the headline; per-class single-seed=42 analyses in Appendix A
英文摘要

Background. RGB-trained capsule-endoscopy classifiers underperform on small-vessel vascular findings by conflating hemoglobin contrast with bile and illumination falloff. Thus, here we test whether a Monte Carlo-inspired analytic model can compute hemoglobin from RGB signal built upon extracted classifier. Methods. On Kvasir-Capsule (47,238 frames, video-level 70/15/15 split, 11 evaluable classes) we evaluate two software-only configurations against RGB-only EfficientNet-B0 across 6 seeds: (i) a prior P_blood = sigma(alpha * (H_norm - 0.5)) * Phi(r) fused as 2 zero-init auxiliary channels; (ii) a distillation head training a 3-channel RGB backbone to predict P_blood. Significance: paired DeLong, McNemar, bootstrap CIs with Bonferroni correction. Results. Across 6 seeds (n=6,423), the analytic prior provides a small but direction-consistent macro-AUC improvement: RGB-only 0.760 +/- 0.027, input-fusion 0.783 +/- 0.024 (paired Delta = +0.023, sign-positive on 5/6 seeds), distillation 0.773 +/- 0.028. The largest robust per-class lift is on Lymphangiectasia, where AUC rises from RGB 0.238 +/- 0.057 to input-fusion 0.337 +/- 0.019, sign-consistent across all 6 seeds. On rare focal-vascular classes (Angiectasia, Blood - fresh) the prior's per-seed effects are bimodal: seed=42 reaches Angiectasia AUC 0.528 -> 0.916, but the cross-seed mean is 0.646 -> 0.608 with sigma_PI = 0.23 - reported as a high-variance per-seed exemplar. Conclusion. A Monte Carlo-inspired analytic prior provides a small, direction-consistent macro-AUC improvement on Kvasir-Capsule across 6 seeds with the largest robust per-class lift on Lymphangiectasia; the distillation variant runs on plain 3-channel RGB and yields a free interpretability heatmap.

2605.15055 2026-05-15 cs.LG cs.CV

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Quanhao Li, Junqiu Yu, Kaixun Jiang, Yujie Wei, Zhen Xing, Pandeng Li, Ruihang Chu, Shiwei Zhang, Yu Liu, Zuxuan Wu

AI总结 本文提出了一种名为DiffusionOPD的多任务训练框架,用于改进扩散模型的图像生成能力。该方法基于在线策略蒸馏(OPD),通过独立训练任务特定的教师模型,再沿学生的生成轨迹将其知识蒸馏到统一的学生模型中,从而解耦单任务探索与多任务整合,避免了联合优化带来的干扰与不平衡问题。理论分析表明,DiffusionOPD将OPD框架从离散标记扩展到连续状态马尔可夫过程,推导出统一的KL散度目标函数,提升了训练效率和生成质量,并在多个基准测试中取得了优越的性能。

详情
英文摘要

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.

2605.15054 2026-05-15 cs.CV

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Mitchell Piehl, Muchao Ye

AI总结 本文提出了一种名为LATERN的上下文感知可解释视频异常检测框架,旨在解决现有视觉语言模型在视频异常检测中缺乏结构化时间上下文的问题。该方法通过引入上下文感知异常评分模块和递归证据聚合模块,将视频异常检测建模为时间证据聚合过程,从而生成更准确且语义连贯的事件级解释。实验表明,LATERN在多个具有挑战性的基准数据集上显著提升了冻结模型在测试时的检测精度和解释一致性。

详情
英文摘要

Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.

2605.15051 2026-05-15 cs.LG cs.PF

An Interpretable Latency Model for Speculative Decoding in LLM Serving

Linghao Kong, Megan Flynn, Michael Peng, Nir Shavit, Mark Kurtz, Alexandre Marques

AI总结 本文研究了在大语言模型服务系统中,如何通过可解释的延迟模型理解推测解码(SD)的性能表现。作者提出了一种基于Little定律的简单且可解释的延迟模型,用于分析SD在不同负载条件下的行为,并将每个请求的延迟分解为与负载无关和相关的部分。该模型通过大量实验验证,能够准确描述实际延迟,并解释了为何随着服务器负载增加,加速效果会减弱,同时揭示了草案长度、接受率和验证器-草案模型规模对延迟的影响,为实际部署中的SD配置提供了理论指导。

详情
Comments
10 pages, 8 figures
英文摘要

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.

2605.15050 2026-05-15 cs.LG

Separating Intrinsic Ambiguity from Estimation Uncertainty in Deep Generative Models for Linear Inverse Problems

Yuxin Guo, Dongrui Deng, Pulkit Grover

AI总结 本文研究了在深度生成模型用于线性逆问题时,如何区分后验不确定性中的内在模糊性与估计不确定性。作者提出了一种结构分解方法,将后验不确定性拆分为可解释的组成部分,从而揭示模型预测中的潜在问题。该方法通过级联结构实现对内在模糊性的分析,并应用于磁共振成像和脑电图源成像等实际任务,提升了模型的可解释性和校准能力。

详情
英文摘要

Recently, deep generative models have been used for posterior inference in inverse problems, including high-stakes applications in medical imaging and scientific discovery, where the uncertainty of a prediction can matter as much as the prediction itself. However, posterior uncertainty is difficult to interpret because it can mix ambiguity inherent to the forward operator with uncertainty propagated through inference. We introduce a structural decomposition of posterior uncertainty that isolates intrinsic ambiguity. A cascade formulation makes this ambiguity accessible for calibration analysis, enabling qualitative diagnostics and simulation-based calibration tests that reveal failure modes that remain hidden when models are selected by reconstruction quality alone. We first validate the approach on a Gaussian example with analytical posterior structure, then illustrate the decomposition on accelerated magnetic resonance imaging (MRI), and finally apply the calibration diagnostics to electroencephalography (EEG) source imaging.

2605.15049 2026-05-15 cs.RO cs.MA cs.SY eess.SY

A Prototyping Framework for Distributed Control of Multi-Robot Systems

Junaid Ahmed Memon, Allan Andre Do Nascimento, Kostas Margellos, Antonis Papachristodoulou

AI总结 本文提出了一种用于多机器人系统分布式控制的原型框架,旨在连接分布式优化算法的理论研究与实际测试。该框架基于单程序多数据(SPMD)范式,在单台计算机上模拟分布式控制,每个核心运行相同算法并进行局部状态和邻近通信。通过非合作博弈论算法在四旋翼无人机位置交换任务中的应用,验证了该框架在不同动态模型下的有效性,包括质点模型、高保真四旋翼模型以及实际硬件测试平台,展示了其低成本且易用的算法验证优势。

详情
Comments
Accepted at IFAC World Congress 2026
英文摘要

This paper presents a prototyping framework for distributed control of multi-robot systems, aimed at bridging theory and practical testing of distributed optimization algorithms. Using the Single Program, Multiple Data (SPMD) paradigm, the framework emulates distributed control on a single computer, with each core running the same algorithm using local states and neighbour-to-neighbour communication. We demonstrate the framework on a four-quadrotor position-swapping task using a non-cooperative game-theoretic distributed algorithm. Computational time and trajectory data are compared across the supported dynamics levels: a point-mass model, a high-fidelity quadrotor model, and an experimental hardware testbed using Crazyflie quadcopters. The results show that the framework provides a low-cost and accessible approach for validating distributed algorithms.

2605.15044 2026-05-15 cs.SD cs.AI cs.LG cs.MM eess.AS

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

KiHyun Nam, Jungwoo Heo, Siu Bae, Ha-Jin Yu, Joon Son Chung

AI总结 随着物理人工智能、对话机器人和无屏可穿戴设备的发展,音频大语言模型需要具备针对说话人的理解能力,以支持用户认证、个性化和上下文感知交互。为此,本文提出 SpeakerLLM,一种专门针对说话人的音频大语言模型框架,能够统一处理单句说话人画像、录音条件理解、双句说话人对比以及基于证据的验证推理。其核心是采用分层说话人分词器,分别捕捉说话人身份和录音条件的多粒度信息,并通过结构化推理轨迹提升验证推理的准确性和可解释性。

详情
英文摘要

As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.

2605.15042 2026-05-15 cs.CV cs.AI

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Wuyang Li, Yang Gao, Mariam Hassan, Lan Feng, Wentao Pan, Po-Chien Luan, Alexandre Alahi

AI总结 EverAnimate 是一种高效的后训练方法,用于生成高质量的长时域动画视频,能够保持视觉质量和角色身份的一致性。该方法通过引入持久潜空间传播和修复流匹配两种机制,解决了长视频生成中由于分块生成导致的细节退化和语义不一致问题。实验表明,仅需轻量的LoRA调优,EverAnimate 在短时和长时动画生成任务中均优于现有方法,显著提升了图像保真度和视觉质量。

详情
Comments
Project Page: https://everanimate.github.io/homepage/
英文摘要

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.

2605.15041 2026-05-15 cs.AI cs.CL

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao, Xiaosong Zhang

AI总结 本文研究了如何通过案例驱动的方法提升大语言模型在工具使用中的推理与执行能力。提出了一种名为CAST的框架,该框架将历史执行轨迹作为结构化案例,提取案例中的复杂性与失败特征,用于指导模型优化推理策略并避免结构错误。实验表明,CAST在保持执行结构正确性的同时提高了工具使用成功率,并减少了不必要的推理步骤,显著提升了整体性能。

详情
英文摘要

Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

2605.15035 2026-05-15 cs.LG

TopoPrimer: The Missing Topological Context in Forecasting Models

Zara Zetlin, Kayhan Moharreri, Maria Safi

AI总结 本文提出了一种名为TopoPrimer的框架,通过将序列群体的全局拓扑结构显式地作为输入引入到任何预测模型中,从而提升预测准确性。该方法基于持续同调和谱丛坐标预计算拓扑信息,能够稳定应对季节性需求高峰,并有效缓解冷启动问题。实验表明,TopoPrimer在多个公开基准数据集上显著提升了预测性能,尤其在复杂和困难场景下表现更为突出。

详情
Comments
29 pages, 16 figures
英文摘要

We introduce TopoPrimer, a framework that makes the global topological structure of the series population an explicit input to any forecasting model. TopoPrimer improves accuracy across diverse domains, stabilizes forecasts under seasonal demand spikes, and closes the cold-start gap. Precomputed once per domain via persistent homology and spectral sheaf coordinates, TopoPrimer deploys per token for fully-trained models and as a lightweight adapter for pre-trained backbones. Of these two components, sheaf coordinates are the primary accuracy driver. Across four public benchmarks on Chronos and TimesFM, TopoPrimer consistently improves forecasting accuracy, with gains of up to 7.3% MSE on ECL. The topology advantage persists with near-identical magnitude across zero-shot and fine-tuned backbones, suggesting topology and per-series training capture complementary signals. The gains are most pronounced in difficult regimes. Under peak seasonal demand, classical and zero-shot models degrade by up to 50%, while TopoPrimer stays within 10%. At cold start with no item history, TopoPrimer reduces MAE by 27% over a topology-free baseline.

2605.15034 2026-05-15 cs.CL cs.AI cs.CY cs.MA

AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models

Vinicius Covas, Jorge Alberto Hidalgo Toledo

AI总结 本研究探讨了大型语言模型(LLM)在感知到社会观察情境时是否会产生系统性的语言适应行为,这一问题对AI治理和审计具有重要意义。基于社会学理论,研究通过控制实验分析了不同观察情境下多智能体辩论系统的行为变化,发现模型在面对人类或AI观察者时会表现出不同的语言风格调整,表明其行为对观察者身份敏感。研究结果为理解LLM作为情境敏感的沟通主体提供了新视角,并对算法审计和AI治理提出了启示。

详情
Comments
20 pages, 6 figures
英文摘要

Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts -- a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation -- from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p < .001. A fifth condition -- replacing human with AI observers -- yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.

2605.15024 2026-05-15 cs.CV

HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning

Man Wang, Chenyang Liu, Wenjun Li, Feng Ni, Bing Jia, Baoqi Huang, Riting Xia, Zhenwei Shi

AI总结 本文提出了一种名为HiSem的层次化语义解耦网络,用于解决遥感图像变化描述生成中的语义纠缠问题。该方法通过引入双向差分注意力调制模块和层次化自适应语义解耦模块,分别增强时序交互并分离不同粒度的语义表示,从而更准确地区分变化与未变化图像对,并建模细粒度的变化语义。实验表明,HiSem在两个基准数据集上均优于现有方法,在WHU-CDC数据集上BLEU-4指标提升了7.52%,为遥感图像变化描述任务提供了结构化的建模视角。

详情
英文摘要

Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic understanding.To address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52\% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at https://github.com/Man-Wang-star/HiSem

2605.15019 2026-05-15 cs.CL

From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Guanhua Chen, Chuyue Huang, Yutong Yao, Shudong Liu, Xueqing Song, Lidia S. Chao, Derek F. Wong

AI总结 该论文研究了多模态检索增强生成(RAG)系统在细粒度查询下的证据检索问题,提出了一种多粒度证据检索框架GranuRAG。该方法通过将视觉元素作为基本检索单元,分三个阶段实现元素级检测、跨模态对齐和约束生成,从而提升检索精度与可解释性。实验表明,该方法在真实场景的多模态问答任务中相比多个强基线模型提升了29.2%。

详情
英文摘要

Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.

2605.15018 2026-05-15 cs.LG cs.AI

Generalized Priority-Aware Shapley Value

Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang

AI总结 本文提出了一种广义优先感知的夏普利值(GPASV),用于解决机器学习中的价值分配问题。传统方法要求优先级关系为二元且无环,但实际应用中常出现循环或多元比较的情况。GPASV 支持任意有向加权优先图,允许边权重对顺序冲突进行惩罚而非禁止,从而更灵活地建模真实数据中的优先关系。该方法通过公理化定义建立理论基础,并应用于大语言模型集成评估,展示了优先权分配对价值评估结果的重要影响。

详情
英文摘要

Shapley value and its priority-aware extensions are widely used for valuation in machine learning, but existing methods require pairwise priority to be binary and acyclic, a restriction spectacularly violated in real-data examples such as aggregated human preferences and multi-criterion comparisons. We introduce the generalized priority-aware Shapley value (GPASV), a random order value defined on arbitrary directed weighted priority graphs, in which pairwise edges penalize rather than forbid order violations. GPASV covers a range of classical models as boundary cases. We establish GPASV through an axiomatic characterization, develop the associated computational methods, and introduce a priority sweeping diagnostic extending PASV's. We apply GPASV to LLM ensemble valuation on the cyclic Chatbot Arena preference graph, illustrating that priority-aware valuation is not a one-button operation: different balances of pairwise graph priority versus individual soft priority produce substantively different valuations of the same data.

2605.15016 2026-05-15 cs.CL cs.AI

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu

AI总结 随着大型语言模型在医疗领域的应用,智能临床决策支持系统迅速发展。然而,现有模型在处理纵向电子健康记录(EHR)时存在统计推理不足和时间依赖性建模困难的问题。为此,本文提出COTCAgent,一种基于概率思维链补全的分层推理框架,通过解耦统计计算、特征匹配与语言生成,提升了对长期健康记录的分析能力,并在多个医疗数据集上取得了优于现有方法的性能。

详情
英文摘要

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

2605.15015 2026-05-15 cs.AI cs.CL cs.HC

Small, Private Language Models as Teammates for Educational Assessment Design

Chris Davis Jaldi, Anmol Saini, Shan Zhang, Noah Schroeder, Cogan Shimizu, Eleni Ilkou

AI总结 本研究探讨了小型私有语言模型(SLMs)在教育评估设计中的应用,旨在弥补大型语言模型(LLMs)在隐私和资源限制方面的不足。通过系统对比LLMs与SLMs在生成评估题目时的表现,研究采用可复现的教育学导向指标评估生成质量,并分析模型评分与专家评分的一致性与偏差。结果表明,SLMs在关键教育质量维度上表现优异,支持本地化部署,但模型评分仍存在系统性不一致和偏差,突显了人机协同在教育评估流程中的必要性。

详情
英文摘要

Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.

2605.15012 2026-05-15 cs.LG cs.AI cs.CL

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

AI总结 本文提出了一种名为FEST的新型可验证奖励强化学习算法,旨在解决在复杂任务中样本效率低的问题。该方法通过随机选取少量示范数据进行指导,仅需128个示例即可取得优异效果,显著减少了对大量监督数据的依赖。研究发现,结合监督信号、策略梯度信号以及对少量示范数据的衰减权重是实现高性能的关键。实验表明,FEST在多个基准上优于传统方法,即使使用更少的监督数据也能达到相近甚至更好的性能。

详情
Comments
25 pages, 11 figures
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

2605.15009 2026-05-15 cs.LG

DeepTokenEEG Enhancing Mild Cognitive Impairment and Alzheimers Classification via Tokenized EEG Features

Thinh Nguyen-Quang, Minh Long Ngo, Ngoc-Son Nguyen, Nguyen Thanh Vinh, Huy-Dung Han, Bui Thanh Tung, Nguyen Quang Linh, Khuong Vo, Manoj Vishwanath, Hung Cao

AI总结 该研究提出了一种名为DeepTokenEEG的轻量高效模型,用于阿尔茨海默病(AD)及轻度认知障碍的分类。该模型通过时空分词器提取EEG信号在时域和频域中的AD相关生物标志物,仅使用0.29百万参数即可实现高精度分类。实验表明,该方法在特定频段上达到100%的最高准确率,相较现有方法提升了1.41-15.35%,展现出在AD早期检测和筛查中的应用潜力。

详情
英文摘要

The detection of Alzheimers disease (AD) is considered crucial, as timely intervention can improve patient outcomes. Electroencephalogram (EEG)-based diagnosis has been recognized as a non-invasive, accessible, and cost-effective approach for AD detection; however, it faces challenges related to data availability, accuracy of modern deep learning methods, and the time-consuming nature of expert-based interpretation. In this study, a novel lightweight and high-performance model, DeepTokenEEG, was designed for the diagnosis of AD and the classification of EEG signals from AD patients, individuals with other neurological conditions, and healthy subjects. Unlike traditional heavy-weight models, DeepTokenEEG ultilizes spatial and temporal tokenizer that effectively captures AD-related biomarkers in both temporal and frequency domain with only 0.29 million paramaters. Trained in a combined dataset of 274 subjects, including 180 AD cases, and 94 healthy controls, the proposed method achieves a maximum recorded accuracy of 100% on specific frequency bands, representing an improvement of 1.41-15.35% over state-of-the-art methods on the same dataset. These results indicate the potential of DeepTokenEEG for early detection and screening of AD, with promising applicability for deployment due to its compact size.

2605.15000 2026-05-15 cs.CL cs.AI

Quantifying and Mitigating Premature Closure in Frontier LLMs

Rebecca Handler, Suhana Bedi, Nigam Shah

AI总结 该研究探讨了前沿大语言模型(LLMs)在面对不确定信息时过早得出结论的问题,即“过早闭合”现象,特别是在医疗任务中可能带来的风险。研究通过结构化和开放式的医学任务评估了五种前沿模型,发现它们在缺乏足够信息时仍频繁给出确定性回答,错误率较高。尽管安全导向的提示策略能部分缓解这一问题,但模型仍存在显著的过早闭合行为,表明当前医疗大语言模型在判断何时不应作答方面仍需改进。

详情
Comments
14 pages, 3 figures, 1 table
英文摘要

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

2605.14995 2026-05-15 cs.AI cs.CL cs.LG cs.SI

Explainable Detection of Depression Status Shifts from User Digital Traces

Loris Belcastro, Francesco Gervino, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio

AI总结 本文提出了一种可解释的框架,用于从用户的数字痕迹(如社交媒体帖子、聊天记录等)中检测和分析抑郁状态的变化。该方法结合多个基于BERT的模型提取情感、情绪和抑郁严重程度等多维度信号,并通过时间聚合构建用户轨迹,识别有意义的状态变化点。同时引入大语言模型生成简洁的人类可读报告,提升结果的可解释性。实验表明,该方法在两个社交媒体数据集上表现出更高的历史覆盖度、时间连贯性和变化点敏感性,为心理健康状态的动态分析提供了有力支持。

详情
英文摘要

Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

2605.14991 2026-05-15 cs.CV cs.AI

Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning

Francesco Pastori, Francesca Fati, Marina Rosanu, Luigi De Vitis, Lucia Ribero, Gabriella Schivardi, Giovanni Damiano Aletti, Nicoletta Colombo, Jvan Casarin, Francesco Multinu, Elena De Momi

AI总结 该研究旨在通过术前增强CT影像预测卵巢癌患者对新辅助化疗的反应,以帮助早期识别无效治疗的患者。研究提出了一种基于多损失深度学习的非侵入性框架,利用自动提取的3D病灶掩膜,结合部分微调的图像编码器和注意力机制进行特征聚合与分类。实验在包含280例患者的回顾性队列上验证,模型在测试集上实现了ROC-AUC为0.73、F1得分为0.70,表明其具备一定的临床预测能力,为影像驱动的患者分层提供了可靠基础。

详情
英文摘要

Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.

2605.14990 2026-05-15 cs.CV

Characterizing the visual representation of objects from the child's view

Jane Yang, Tarun Sepuri, Alvin Wei Ming Tan, Khai Loong Aw, Michael C. Frank, Bria Long

AI总结 该研究探讨了儿童在日常生活中如何通过视觉经验学习物体类别表征,并分析了来自BabyView数据集的大量第一人称视频数据。研究利用监督检测模型从数百万帧画面中提取常见物体类别,发现儿童接触到的物体类别分布极不均衡,且物体呈现方式多变,如角度异常、场景杂乱或部分遮挡。尽管如此,检测到的物体类别在高层次类别(如动物、食物)中仍表现出较强的聚类结构,这一现象在自监督模型的高维嵌入中也得到验证,表明儿童的视觉学习具有高度鲁棒性和效率。

详情
Comments
19 pages, 6 figures
英文摘要

Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children's visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5--36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children's object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children's visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.