arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1955
2605.22368 2026-05-22 cs.LG cs.AI cs.SE

VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

VeriScale:对抗性测试套件缩放用于可验证代码生成

Yifan Bai, Xiaoyang Liu, Zihao Mou, Guihong Wang, Jian Yu, Shuhan Xie, Yantao Li, Yangyu Zhang, Jingwei Liang, Tao Luo

AI总结 本文提出VeriScale框架,通过对抗性实现扩展和缩减测试套件,提升代码生成的可验证性,实验表明VerinaPlus显著暴露了模型弱点,而VerinaLite在低成本下保持判别能力。

详情
AI中文摘要

随着大型语言模型(LLMs)在软件工程中的广泛应用,构建高质量基准对于评估生成代码的功能正确性和形式可验证性至关重要。然而,现有基准受限于正负测试用例的数量和质量,导致模型在生成规范和实现方面的能力被高估。为此,我们提出VeriScale,一种由对抗性实现驱动的新框架,分为两个阶段:测试套件扩展以构建多样且具有挑战性的测试用例,以及测试套件缩减以将其压缩为紧凑且判别性的套件。虽然VeriScale具有通用性,但我们将其应用于Verina,构建VerinaPlus和VerinaLite。实验表明,VerinaPlus在SpecGen和CodeGen任务上显著暴露了模型弱点,而VerinaLite在低成本下保持了判别能力。增强的基准和源代码在https://github.com/XiaoyangLiu-sjtu/VeriScale上公开可用。

英文摘要

As large language models (LLMs) are increasingly deployed for software engineering, constructing high-quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated code. However, existing benchmarks are limited by the quantity and quality of positive and negative test cases, leading to an overestimation of model capabilities in generating specifications and implementations. To address this, we propose VeriScale, a novel framework driven by the adversarial implementations. It consists of two stages: test-suite expansion to construct diverse and challenging test cases, and test-suite reduction to distill them into compact yet discriminative suites. While VeriScale is general, we instantiate it on Verina to construct VerinaPlus, which expands the original test suites by over 83$\times$, and VerinaLite, a lightweight 14$\times$ variant. Our experiments across eight state-of-the-art LLMs demonstrate that VerinaPlus exposes substantial model weaknesses hidden by the original benchmark, evidenced by sharp score drops on both SpecGen and CodeGen tasks, whereas VerinaLite maintains this discriminative power at a fraction of the evaluation cost. The enhanced benchmarks and source code are publicly available at https://github.com/XiaoyangLiu-sjtu/VeriScale.

2605.22366 2026-05-22 cs.CV

AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

AgroTools: 一个用于农业中增强工具的多模态代理基准

Zi Ye, Yibin Wen, Xiaoya Fan, Xinyu Zhang, Jing Wu, Kun Zeng, Zurong Mai, Jiarui Zhang, Bohan Shi, Juepeng Zheng, Jianxi Huang, Yutong Lu, Haohuan Fu

AI总结 本文提出AgroTools基准,用于评估农业中增强工具的多模态代理,通过539个问题-答案实例和1097张异构农业图像,评估模型在工具使用中的执行质量和任务成功率。

详情
AI中文摘要

农业决策日益需要能够将视觉观察转化为可靠可执行动作的多模态系统。然而,现有农业多模态基准主要评估最终答案的正确性,并提供有限的支持来评估模型是否能使用外部工具完成高精度工作流。在本文中,我们介绍了AgroTools,一个用于评估农业中增强工具的多模态代理的基准。AgroTools包含539个问题-答案实例和1097张异构农业图像,涵盖五个任务家族和14种农业工具的可执行环境。每个查询都标注了结构化的工具使用轨迹,使能够从两个视角评估执行层面的质量和结果层面的任务成功率。我们对9个开源和4个闭源的多模态大语言模型在AgroTools上进行了基准测试。结果表明,当前模型在农业工具使用场景中仍远未可靠,存在工具规划、论点生成、执行恢复和最终答案综合等方面的明显瓶颈。我们希望AgroTools能支持未来在高精度农业应用中多模态代理的研究。该基准和评估可在https://huggingface.co/datasets/AgroTools/AgroTools上获取。

英文摘要

Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at https://huggingface.co/datasets/AgroTools/AgroTools.

2605.22364 2026-05-22 cs.AI

Scaling Observation-aware Planning in Uncertain Domains

在不确定领域中扩展感知意识规划

Adrian Zvizdenco, Arthur Conrado Veiga Bosquetti, Alberto Lluch Lafuente, Christoph Matheja

AI总结 本文研究了在不确定领域中扩展感知意识规划的方法,通过子符号技术扩展可决定OOP片段的解决方法,包括传感器选择问题和位置可观察性问题,并通过分解POMDPs识别合理的观察函数,从而提高性能。

详情
AI中文摘要

在不确定领域中决定在智能体上部署哪些感知能力是一个根本性的工程挑战,其中需要在任务可实现性与硬件和处理的高成本之间取得平衡。这个问题之前已被正式化为最优可观察性问题(OOP),基于著名的部分可观测马尔可夫决策过程(POMDP)模型进行决策。本文研究了(子)符号技术,以扩展可决定OOP片段的解决方法,即传感器选择问题(SSP)和位置可观察性问题(POP)。除了改进基于参数综合的原始方法外,我们还开发了一种新的解决方法,通过分解POMDPs识别合理的观察函数,从而在实例大小和运行时间上分别提高了3到5个数量级的性能。

英文摘要

Deciding which sensing capabilities to deploy on an agent in uncertain domains is a fundamental engineering challenge, in which one balances task achievability against the high costs of hardware and processing. This problem has previously been formalized as the Optimal Observability Problem (OOP), based on the well-known Partially Observable Markov Decision Process (POMDP) model for decision-making. This work studies (sub-)symbolic techniques to scale solving of decidable fragments of the OOP, namely the Sensor Selection Problem (SSP) and the Positional Observability Problem (POP). Besides improving the original approach based on parameter synthesis, we develop a new solving method that identifies sensible observation functions via decomposition of POMDPs, improving performance by 3 and 5 orders of magnitude for instance size and runtime, respectively.

2605.22359 2026-05-22 cs.CV

GazePrior: Zero-Shot AR/VR Eye Tracking via Learned 3D Gaze Reconstruction

GazePrior: 通过学习的3D注视重建实现零样本AR/VR注视跟踪

Corentin Dumery, David Colmenares, Alexander Fix, Pascal Fua, Ali Behrooz, Jogendra Kundu

AI总结 本文提出GazePrior,一种数据驱动的3D注视先验模型,用于在无需额外数据收集的情况下实现高质量的AR/VR注视跟踪,通过合成数据提高模型的准确性和鲁棒性。

详情
Comments
Project page: https://corentindumery.github.io/projects/gazeprior.html
AI中文摘要

注视跟踪(ET)是高级AR/VR应用的基础技术。然而,为每个新的ET设备训练ET模型具有挑战性:真实数据收集成本高且耗时,而现有合成数据生成方法缺乏真实感。为了在不需额外数据收集的同时保持数据质量,我们引入了一种数据驱动的3D先验,该先验模型了人类眼睛在多样化身份、注视方向和光照设置下的分布。该模型,我们称之为GazePrior,能够对使用先前ET设备收集的注释数据进行稀疏输入的3D重建,这些数据可以进一步从任何目标ET设备的摄像头中渲染。我们的方法在不付出其压制性成本的情况下,合成数据具有真实数据收集的现实感、多样性和真实准确性。我们的实验表明,使用我们合成数据训练的ET模型优于先前的零样本方法,实现了更高的准确性和鲁棒性。

英文摘要

Eye tracking (ET) is a foundational technology for advanced AR/VR applications. However, training ET models for every new ET device is challenging: real data collection is costly and time-consuming, while existing synthetic data generation methods lack realism. To remove the need for additional data collection while maintaining data quality, we introduce a data-driven 3D prior that models the distribution of human eyes across diverse identities, gaze directions, and light settings. This model, which we coin GazePrior, then enables sparse-input 3D reconstruction of annotated data collected with previous ET devices, which can in turn be rendered from the cameras of any target ET device. Our approach synthesizes data with the realism, diversity and ground-truth accuracy of real data collection without its prohibitive costs. Our experiments demonstrate that ET models trained with our synthesized data outperform previous zero-shot methods, achieving higher accuracy and robustness.

2605.22357 2026-05-22 cs.CV cs.AI

VEELA: A Clinically-Constrained Benchmark for Liver Vessel Segmentation in Computed Tomography Angiography

VEELA:一种受临床约束的肝血管分割基准数据集

Ziya Ata Yazıcı, N. Sinem Gezer, İlkay Öksüz, İlker Özgür Koska, Tuğçe Toprak, Pervin Bulucu, Ufuk Beşenk, A. Emre Kavur, Pierre-Henri Conze, Hazım Kemal Ekenel, Oğuz Dicle, Mustafa Ege Şeker, Mustafa Said Kartal, Ariorad Moniri, Orhan Özkan, Osman Faruk Bayram, Hakan Polat, Musa Balcı, Ece Tuğba Cebeci, Baran Cılga, Kardelen Peçenek, M. Alper Selver

AI总结 本文提出VEELA数据集,用于在CT血管造影中实现肝门静脉分割,通过严格的人工标注和多专家共识,确保标注的临床现实性和准确性,并引入多种评估指标以评估血管分割的多视角性能。

详情
Comments
27 pages, 25 figures, 5 tables
AI中文摘要

在对比增强的计算机断层扫描血管造影(CTA)中,准确分割肝内和门静脉仍然具有挑战性,由于复杂的血管拓扑结构、边缘可见性限制以及成像引起的模糊性。尽管现有的公开数据集提供了有价值的基准,但很少包含临床现实的标注约束。我们引入VEELA(Vessel Extraction and Extrication for Liver Analysis),一个严格编纂的肝血管数据集,源自40个CTA扫描,继承自CHAOS大挑战队列。所有血管均在多专家共识下逐层手动勾勒,使用严格可见性驱动的标注策略,并避免解剖推断插值。这种设计明确捕捉了解剖变异性和成像相关不确定性。作为CHAOS挑战的延续,VEELA使可重复的跨基准评估成为可能,同时扩展到细粒度的肝内和门静脉分割。我们进一步建立了标准化的基准评估框架,并分析了互补的评估指标,包括拓扑感知(clDice)、重叠基于(IoU)、边界敏感(NSD)和几何感知(面积、长度)度量。我们的结果表明,不同的指标捕捉了血管完整性不同的方面,强调了多视角评估在临床有意义的血管分割中的必要性。VEELA已公开发布,以促进可重复的研究并支持稳健的血管分割方法的发展。研究人员可以访问评估指标、数据集和提交平台:https://www.synapse.org/Synapse:syn65471967。

英文摘要

Accurate segmentation of hepatic and portal vessels in contrast-enhanced computed tomography angiography (CTA) remains challenging due to complex vascular topology, peripheral visibility limitations, and acquisition-induced ambiguities. While existing public datasets offer valuable benchmarks, few include clinically realistic annotation constraints. We introduce VEELA (Vessel Extraction and Extrication for Liver Analysis), a rigorously curated liver vessel dataset derived from 40 CTA scans inherited from the CHAOS grand-challenge cohort. All vessels were manually delineated slice-by-slice under multi-expert consensus, using a strict visibility-driven annotation policy and avoiding anatomically inferred interpolation. This design explicitly captures anatomical variability and imaging-related uncertainty. As a continuation of the CHAOS challenge, VEELA enables reproducible cross-benchmark evaluation while extending the scope to fine-grained hepatic and portal vessel segmentation. We further establish a standardized benchmarking framework and analyze complementary evaluation metrics, including topology-aware (clDice), overlap-based (IoU), boundary-sensitive (NSD), and geometry-aware (area, length) measures. Our results demonstrate that different metrics capture distinct aspects of vascular integrity, underscoring the necessity of multi-perspective evaluation for clinically meaningful vessel segmentation. VEELA is publicly released to facilitate reproducible research and support the development of robust vascular segmentation methods. Researchers can access the evaluation metrics, dataset, and submission platform at https://www.synapse.org/Synapse:syn65471967.

2605.22356 2026-05-22 cs.CL

Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning

通过行为微调建模病理样行为模式

Nicola Milano, Davide Marocco

AI总结 本文研究了通过行为微调在语言模型中建模病理样行为模式,采用结构化决策任务进行微调,发现模型在不同上下文中产生稳定的生成分布变化,表明行为优化能影响语言生成的分布特性。

详情
AI中文摘要

大型语言模型越来越多地被用作计算工具来建模人类样行为。我们引入了一个行为诱导框架,通过在结构化决策任务上进行微调来修改模型策略:使用受适应不良行为模式启发的合成数据集,包括抑郁和偏执,我们训练基于转换器的语言模型在多样化上下文中一致选择特定类别的动作。然后我们测试这种行为优化是否会在生成分布中产生系统性的变化。在两个架构中,微调后的模型显示出稳定的、上下文通用的下一个标记概率分布变化,包括在开放性语言任务中对负面和威胁相关解释的概率增加。这些效果超越了训练上下文,并在定性完成、心理测量风格的评估和定量分布度量如詹森-香农散度中可以检测到。诱导的行为配置文件也显示出部分特异性。为不同行为模式优化的模型在评估探针上表现出可区分的响应倾向,表明结构化行为训练产生的是差异化的策略层面偏差,而不是通用的分布偏斜。我们将这些发现解释为证据,表明在LLM中一致的行为优化可以生成与改变的潜在先验相关的稳定行为和分布模式,将动作选择和语言生成联系起来。更广泛地说,这些结果支持了LLM作为基于策略系统的观点,在其中行为约束塑造了涌现的表示结构,突显了它们作为研究行为、解释和生成语言之间关系的受控测试床的潜力。

英文摘要

Large language models are increasingly used as computational tools for modeling human-like behavior. We introduce a behavioral induction framework that modifies model policies through fine-tuning on structured decision-making tasks: using synthetic datasets inspired by maladaptive behavioral patterns, including depression and paranoia, we train transformer-based language models to consistently select specific classes of actions across diverse contexts. We then test whether this behavioral optimization produces systematic changes in generative distributions. Across two architectures, fine-tuned models show stable, context-general shifts in next-token probability distributions, including increased probability assigned to negative and threat-related interpretations in open-ended language tasks. These effects generalize beyond training contexts and are detectable in qualitative completions, psychometric-style evaluations, and quantitative distributional metrics such as Jensen-Shannon divergence. Induced behavioral profiles also show partial specificity. Models optimized for different behavioral patterns exhibit dissociable response tendencies across evaluation probes, suggesting that structured behavioral training produces differentiated policy-level biases rather than generic distributional skew. We interpret these findings as evidence that consistent behavioral optimization in LLMs can generate stable behavioral and distributional patterns consistent with altered latent priors, linking action selection and language generation. More broadly, the results support a view of LLMs as policy-based systems in which behavioral constraints shape emergent representational structure, highlighting their potential as controlled testbeds for studying the relationship between behavior, interpretation, and generative language in computational models of cognition.

2605.22355 2026-05-22 cs.CL cs.AI cs.LG

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

TransitLM: 一个大规模数据集和基准,用于无地图的公共交通路线生成

Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu

AI总结 本文提出TransitLM,一个包含1300万条公共交通路线规划记录的数据集,用于无地图的公共交通路线生成,展示了通过数据训练模型生成有效路线的能力。

详情
AI中文摘要

公共交通路线规划传统上依赖于结构化的地图基础设施和复杂的路由引擎,而现有的数据集不支持训练模型绕过这种依赖。我们提出了TransitLM,一个包含来自四个中国城市的超过1300万条公共交通路线规划记录的数据集,覆盖120,845个车站和13,666条线路,作为持续预训练语料库和用于三个评估任务的基准数据。实验表明,使用TransitLM训练的LLM能够生成结构上有效的路线,精度高,并且能够隐式地将任意GPS坐标映射到合适的车站,而无需显式映射。这些结果表明,公共交通路线规划可以完全从数据中学习,从而实现端到端、无地图的路线生成,直接从起止点信息生成。数据集和基准可在https://huggingface.co/datasets/GD-ML/TransitLM获取,评估代码在https://github.com/HotTricker/TransitLM。

英文摘要

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.

2605.22351 2026-05-22 cs.CV

QuantSR+: Pushing the Limit of Quantized Image Super-Resolution Networks

QuantSR+: 推动量化图像超分辨率网络的极限

Haotong Qin, Xudong Ma, Xianglong Liu, Jie Luo, Jinyang Guo, Michele Magno, Yulun Zhang

AI总结 本文提出QuantSR+框架,通过改进量化操作、网络设计和训练优化,实现了在精度和效率之间的更好平衡,针对超低精度下的性能下降问题,提出了三种关键技术贡献:重分布驱动位数确定、量化瘦身架构和瘦身引导的功能局部蒸馏。

详情
AI中文摘要

低比特量化广泛用于压缩超分辨率(SR)模型,以减少在资源受限设备上的存储和计算成本。然而,当SR模型被推向超低精度(2-4位)时,性能会因表示能力的降低和SR的细节敏感性而急剧下降。为了解决这些问题,我们提出QuantSR+,一个统一的框架,通过改进量化操作、网络设计和训练优化,实现了比先前低比特SR方法更好的精度和效率的权衡。QuantSR+主要依靠三个技术贡献:(1)重分布驱动位数确定(RBD),通过正向和反向传递中重塑量化分布,以保持表示保真度;(2)量化瘦身架构(QSA),从过参数化的模型开始,逐步剪枝不重要的块以满足效率预算,同时推动精度性能;(3)瘦身引导的功能局部蒸馏(SFD),通过直接损失和逐步的功能局部训练计划强制块感知的特征对齐,以更好地捕捉量化效果并加快收敛速度。广泛的实验表明,QuantSR+在专门的量化SR方法和通用量化方法上均实现了最先进的性能。对于SwinIR-S在Urban100(x4)上,它在2位SOTA基准上将PSNR提高了0.29 dB。同时,在2位下,它在操作数上减少了高达87.9%,存储上减少了89.4%。QuantSR+对卷积和基于Transformer的SR模型都有效,表明了广泛的应用性。

英文摘要

Low-bit quantization is widely used to compress super-resolution (SR) models and reduce storage and computation costs for deployment on resource-limited devices. However, when SR models are pushed to ultra-low precision (2-4 bits), performance can drop sharply due to diminished representational capacity and the detail-sensitive nature of SR. To address these issues, we propose QuantSR+, a unified framework that improves quantization operators, network design, and training optimization, achieving better trade-offs between accuracy and efficiency than prior low-bit SR methods. QuantSR+ mainly relies on three technical contributions: (1) Redistribution-driven Bit Determination (RBD), which reshapes quantization distributions in both forward and backward passes to preserve representation fidelity; (2) Quantized Slimmable Architecture (QSA), which begins with an over-parameterized model and progressively prunes less critical blocks to meet efficiency budgets while pushing the accuracy performance; and (3) Slimming-guided Function-localized Distillation (SFD), which enforces block-aware feature alignment via a direct loss and a progressive, function-local training schedule to capture quantization effects better and speed up convergence. Extensive experiments show that QuantSR+ achieves state-of-the-art performance against both specialized quantized SR methods and generic quantization approaches. For SwinIR-S on Urban100 (x4), it improves PSNR by 0.29 dB over the 2-bit SOTA baseline. Meanwhile, it delivers strong efficiency gains at 2-bit, reducing operations by up to 87.9% and storage by 89.4%. QuantSR+ is effective for both convolutional and transformer-based SR models, indicating broad applicability.

2605.22344 2026-05-22 cs.CV cs.AI cs.MM

Bernini: Latent Semantic Planning for Video Diffusion

Bernini: 视频扩散中的潜在语义规划

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

AI总结 本文提出Bernini框架,通过将大规模多模态语言模型用于语义规划,扩散模型用于像素生成,实现了视频生成与编辑的统一方法,提升了编辑任务的泛化能力。

详情
Comments
Project Page: https://bernini-ai.github.io/
AI中文摘要

多模态大语言模型(MLLMs)和扩散模型各自已达到显著成熟度:MLLMs在处理异构多模态输入时具有强大的语义基础,而扩散模型则能以逼真度生成图像和视频。我们主张通过简单的分工统一这两类模型:MLLMs负责语义规划,扩散模型则根据高层语义指导和低层视觉特征生成像素。基于此思想,我们提出了Bernini,一个统一的视频生成与编辑框架。一个基于MLLM的规划器直接在ViT嵌入空间中预测目标语义表示,而基于DiT的渲染器则根据此计划生成像素,同时结合文本特征,并在编辑时引入源VAE特征以保留细节。因为语义作为接口,规划器和渲染器可以分别训练,并仅轻度联合训练,从而保留两者预训练的优势,同时保持训练效率。为更好地处理多种视觉输入,我们引入了Segment-Aware 3D Rotary Positional Embedding(SA-3D RoPE),并进一步在规划器中结合链式推理以更好地将理解转化为生成。Bernini在广泛的视频生成与编辑基准上均取得最先进的性能,MLLMs的预训练理解在挑战性的编辑任务上实现了强大的泛化能力。

英文摘要

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.

2605.22342 2026-05-22 cs.CV cs.AI

4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting

4D-GSW: 4D高斯点散布的运动感知空间-时间一致水印技术

Sifan Zhou, Hang Zhang, Yuhang Wang, Ming Li

AI总结 本文提出4D-GSW,一种运动感知的空间-时间一致水印技术,用于在4D高斯点散布中嵌入鲁棒的版权信息,同时保持高空间-时间一致性。

详情
Comments
9 pages main paper, 7 figures, 18 pages in total
AI中文摘要

尽管4D高斯点散布(4DGS)已革新了高保真的动态重建,但保护这些资产的知识产权仍是一个开放性挑战。传统隐写技术常常忽视底层的运动流形,导致非物理的伪影,如严重的时序闪烁和"FVD崩溃"。为了解决这个问题,我们提出了4D-GSW,一种运动感知的水印框架,旨在嵌入鲁棒的版权信息同时保持高空间-时间一致性。与以往的4D隐写技术不同,我们的方法明确处理运动轨迹的物理一致性。我们引入了空间-时间曲率(STC)度量来识别"动态瞬间",并自适应地门控水印梯度注入,以保护关键运动流形免受非物理扰动。为了确保复杂变形中的全局一致性,我们提出了联合HMM-MRF能量最小化模型,该模型同步水印相位在时间轨迹和空间邻域内。此外,一种各向异性梯度路由机制确保水印嵌入严格脱离光度重建保真度。大量实验表明,我们的方法在鲁棒隐藏水印的同时,能够抵抗各种攻击并保持高质量的渲染质量和空间-时间一致性。

英文摘要

While 4D Gaussian Splatting (4DGS) has revolutionized high-fidelity dynamic reconstruction, safeguarding the intellectual property of these assets remains an open challenge. Conventional steganographic techniques often neglect the underlying kinematic manifolds, triggering non-physical artifacts such as severe temporal flickering and "FVD collapse". To address this, we propose \textbf{4D-GSW}, a kinematic-aware watermarking framework designed to embed robust copyright information while preserving high spatio-temporal consistency. Unlike prior 4D steganography that primarily focuses on opacity-guided invisibility, our approach explicitly addresses the physical coherence of motion trajectories. We introduce a \textbf{Spatio-Temporal Curvature (STC)} metric to identify "Dynamic Instants," adaptively gating watermark gradient injection to shield critical motion manifolds from non-physical perturbations. To ensure global coherence across complex deformations, we formulate a joint \textbf{HMM-MRF energy minimization} model that synchronizes watermark phases within both temporal trajectories and spatial neighborhoods. Furthermore, an \textbf{anisotropic gradient routing} mechanism ensures that watermark embedding remains strictly decoupled from photometric reconstruction fidelity. Extensive experiments have demonstrated the superior performance of our method in robustly hiding watermarks while resisting various attacks and maintaining high rendering quality and spatiotemporal consistency.

2605.22341 2026-05-22 cs.LG cond-mat.dis-nn

A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

一种用于在线Softmax分类中三分之一缩放的边界层机制

Marcel Kühn, Yoon Thelge, Bernd Rosenow

AI总结 本文研究了在线教师-学生模型中平滑替代损失与离散标签之间的不匹配如何产生幂律学习曲线的边界层机制,揭示了测试损失和泛化误差的α^{-1/3}缩放特性,以及学习率调度对泛化误差的改进。

详情
Comments
20 pages, 7 figures
AI中文摘要

硬标签分类通常使用平滑替代损失进行训练,最典型的是交叉熵。我们隔离了一个渐近机制,即这种平滑替代损失与离散标签之间的不匹配在在线教师-学生模型中产生幂律学习曲线。在减去平均logit后,热力学极限动态在中心变量中闭合:一个增长的中心学生-教师对齐D和残余学生方差Δ。在晚期时间,远离教师决策边界的例子已被自信分类并贡献指数级很小。只有宽度为O(D^{-1})的边界层仍活跃,而固定学习率的在线梯度下降噪声保持非零的Δ。作为训练时间α的函数,晚期解产生α^{-1/3}的幂律,不仅适用于测试损失,还适用于泛化误差ε_g,即1减去测试准确率。这比相同模型的贝叶斯最优参考α^{-1}要慢得多。我们进一步表明,学习率调度可以将泛化误差改进到ε_g ~ α^{-1/2}的幂律。模拟支持预测的序参量动态和学习曲线。使用相关高斯输入和白化预训练特征的受控实验表明,数据结构可以主导瞬态。因此,我们的结果是一种渐近的、补充的机制,而不是神经缩放定律频谱解释的替代方案。

英文摘要

Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment $D$ and the residual student variance $Δ$. At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width $O(D^{-1})$ remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero $Δ$. As a function of the training time $α$ the late-time solution yields a $α^{-1/3}$ power law not only for the test loss but also for the generalization error $ε_g$, i.e., one minus test accuracy. This is much slower than the $α^{-1}$ Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a $ε_g \sim α^{-1/2}$ power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.

2605.22340 2026-05-22 cs.LG

From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching

从快照到轨迹:通过条件流匹配学习单细胞基因表达动力学

Siyu Pu, Qingqing Long, Xiaohan Huang, Haotian Chen, Jiajia Wang, Meng Xiao, Xiao Luo, Hengshu Zhu, Yuanchun Zhou, Xuezhi Wang

AI总结 本文提出单细胞流匹配(scFM)方法,通过条件流匹配学习单细胞基因表达的动力学,解决时间点不连续和长时间预测中的分布漂移问题,提升轨迹推断的准确性和时间一致性。

详情
AI中文摘要

单细胞RNA测序(scRNA-seq)提供了细胞状态的高维轮廓,使能够驱动建模细胞动态随时间变化。实际上,时间分辨的scRNA-seq仅在几个离散时间点收集为不配对的快照群体,留下显著的时间间隙。这激励了在未测量时间点进行轨迹推断。现有方法主要沿着两个方向发展,最优传输(OT)对齐在观测快照之间提供分布层面的匹配,而连续时间生成模型支持通过学习的动力学进行预测。然而,仍存在两个挑战:(i)不配对的快照导致相邻时间点之间的局部转换模糊,导致监督不稳定;(ii)长时间预测依赖于重复积分,其中小的建模误差会累积并导致分布漂移。为了解决这些挑战,我们提出单细胞流匹配(scFM),一种基于耦合条件流匹配的潜在生成框架。首先,我们计算熵正则化的OT耦合在相邻快照之间,并使用它们来构建软加权流匹配目标,以学习时间依赖的速度场。其次,我们学习双向速度场,并利用其一致性来细化耦合并改进稀疏监督下的时间一致性。第三,我们引入分布层面的对齐和潜在动态正则化,以锚定长时间滚动并缓解漂移。在真实世界的时间序列scRNA-seq数据集上的实验表明,scFM在时间插值和外推的分布预测性能上始终有所提高。此外,scFM在中间时间点缺失的情况下产生更准确的轨迹重建和时间一致的可视化,表明对潜在时间基因表达动力学的更忠实恢复。

英文摘要

Single-cell RNA sequencing (scRNA-seq) provides high-dimensional profiles of cellular states, enabling data-driven modeling of cellular dynamics over time. In practice, time-resolved scRNA-seq is collected at only a few discrete time points as unpaired snapshot populations, leaving substantial temporal gaps. This motivates trajectory inference at unmeasured time points. Existing methods mainly follow two directions, optimal-transport (OT) alignment provides distribution-level matching between observed snapshots, while continuous-time generative models support forecasting via learned dynamics. However, two challenges remain: (i) unpaired snapshots render local transitions between adjacent time points ambiguous, leading to unstable supervision; and (ii) long-horizon prediction relies on repeated integration, where small modeling errors compound and cause distribution drift. To address these challenges, we propose single-cell Flow Matching (scFM), a latent generative framework based on coupling-conditioned flow matching. First, we compute entropically regularized OT couplings between adjacent snapshots and use them to construct soft, weighted flow-matching targets for learning time-dependent velocity fields. Second, we learn bidirectional velocity fields and leverage their consistency to refine couplings and improve temporal coherence under sparse supervision. Third, we introduce distribution-level alignment and latent dynamic regularization to anchor long rollouts and mitigate drift. Experiments on real-world time-series scRNA-seq datasets show that scFM consistently improves distributional prediction performance for both temporal interpolation and extrapolation. Moreover, scFM yields more accurate trajectory reconstruction and temporally coherent visualizations where intermediate time points are absent, indicating a more faithful recovery of underlying temporal gene expression dynamics.

2605.22338 2026-05-22 cs.LG

Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction

物理引导的生成求解器:连接数据驱动先验与守恒定律以稳定时空场重建

Ziyuan Zhu, Keyu Hu, Zhifei Chen, Yuhao Shi, Ming Bao, Jing Zhao, Gang Wang, Haitan Xu, Jiadong Li, Qijun Zhao, Xiaodong Li, Minghui Lu, Yanfeng Chen

AI总结 本文提出了一种物理引导的生成求解器,通过分离稳定的先验学习与推理时的守恒定律强制执行,解决了从稀疏测量中重建连续物理场的问题,同时在声学和气象学中实现了高效且稳定的场重建。

详情
AI中文摘要

从稀疏测量中重建连续物理场是一个核心的逆问题,但数据驱动的生成模型可能会生成违反支配动力学的状态。我们引入了一种物理引导的生成求解器,将稳定的先验学习与推理时的守恒定律强制执行分离。Martingale-Regularized Score Matching通过Score Fokker-Planck约束正则化Score预训练,从而获得动态稳定的先验。Physics-Informed Implicit Score Sampling则通过物理残差的梯度引导去噪轨迹,将样本投影到可接受的流形上而无需重新训练。在声学中,该方法从稀疏传感器共同生成压力和粒子速度,使密集的虚拟阵列得以抑制空间混叠。相同的框架在极端稀疏的现实世界ERA5气象场中也具有泛化能力。一起,这项工作建立了一个严谨且可推广的范式,用于解决高维逆问题,弥合了生成人工智能与第一原理科学之间的差距。

英文摘要

Reconstructing continuous physical fields from sparse measurements is a central inverse problem, but data-driven generative models can produce states that violate governing dynamics. We introduce a physics-informed generative solver that separates stable prior learning from inference-time enforcement of conservation laws. Martingale-Regularized Score Matching regularizes score pretraining with a Score Fokker-Planck constraint, yielding a dynamically stable prior. Physics-Informed Implicit Score Sampling then guides denoising trajectories by gradients of physical residuals, projecting samples toward admissible manifolds without retraining. In acoustics, the method co-generates pressure and particle velocity from sparse sensors, enabling dense virtual arrays that suppress spatial aliasing. The same framework generalizes to real-world ERA5 meteorological fields under extreme sparsity. Together, this work establishes a rigorous and generalizable paradigm for solving high-dimensional inverse problems, bridging the gap between generative artificial intelligence and first-principles science.

2605.22335 2026-05-22 cs.LG

Learning Causal Orderings for In-Context Tabular Prediction

在上下文中的表格预测中学习因果顺序

Sascha Xu, Sarah Mameche, Jilles Vreeken

AI总结 本文研究了如何在表格预测中同时推断和强制因果结构,通过拓扑变量顺序形式进行因果结构推断,提出TabOrder模型利用因果顺序约束注意力机制,在学习的因果顺序下仅基于先于目标的特征进行预测,并通过似然目标无监督学习最优变量顺序,同时探讨了样本缺失对因果方向识别的影响。

详情
AI中文摘要

在上下文学习中,表格数据集在观测设置中具有强大的预测标准;然而,它主要依赖于相关结构,这在分布偏移或干预下变得不可靠。虽然已建立的方法可用于发现因果结构,但它们通常专注于结构可识别性,并与可能从中受益的预测架构解耦。为了弥合这些视角,我们研究了如何在表格预测中同时推断和强制因果结构,以拓扑变量顺序的形式。与标准架构不同,我们的模型TabOrder使用因果顺序约束注意力,基于学习的因果顺序下仅使用先于目标的特征进行预测。类似于因果发现方法,TabOrder通过基于似然的目标无监督学习最优变量顺序。我们在此选择下标准函数模型类别,并研究了样本缺失,这是表格数据中常见的挑战,如何与因果方向识别相互作用。经验上,我们确认TabOrder在恢复准确的变量顺序的同时,解决了预测和填补任务,并在干预下为现实世界生物数据提供了见解。

英文摘要

In-context learning for tabular data sets strong predictive standards in observational settings; it however primarily relies on correlational structure, which becomes unreliable under distribution shift or intervention. While established methods to discover causal structure exist, they are often focused on structure identifiability and decoupled from the predictive architectures that could benefit from them. To bridge these perspectives, we study how to simultaneously infer and enforce causal structure in the form of topological variable orderings into tabular prediction. Unlike standard architectures, our model TabOrder uses causal order-constrained attention, basing predictions only on features that precede a target under a learned causal order. Similar to causal discovery methods, TabOrder learns the optimal variable ordering in an unsupervised manner through a likelihood-based objective. We justify this choice under standard functional model classes and also study how sample missingness, a common challenge in tabular data, interacts with causal direction identification. Empirically, we confirm that TabOrder recovers accurate variable orderings while addressing prediction and imputation tasks, as well as gives insight into real-world biological data under intervention.

2605.22334 2026-05-22 cs.LG

Riemannian geometry meets fMRI: the advantages of modeling correlation manifolds and eigenvector subspaces

黎曼几何与fMRI的结合:建模相关流形和特征向量子空间的优势

Mario Severino, Manuela Moretto, Robert A. McCutcheon, Mattia Veronese

AI总结 本文提出了一种可扩展的几何框架,通过Off-log度量和Grassmannian子空间判别方法,改进了fMRI数据的分析,提高了敏感性和预测性能。

详情
AI中文摘要

相关矩阵是功能脑网络的基本总结,但标准分析通常将条目独立处理,忽略了相关空间的曲面几何。现有的几何方法往往缺乏闭式运算或依赖任意区域排序,限制了可扩展性。我们引入了一种可扩展的几何框架,包含两个组成部分:(i)Off-log度量,一种平滑变换将相关矩阵映射到对称零对角矩阵。这使得距离、弗雷歇均值和线性模型的闭式表达成为可能,允许标准统计建模而无需复杂的流形优化。(ii)Grassmannian子空间判别,通过特征向量子空间之间的主角距离比较受试者,解决固有的符号和基底模糊性。这两个组成部分可以集成到标准机器学习工作流中进行推断、回归和分类。在两个临床队列(帕金森病和精神分裂症)和三个衰老fMRI数据集上得到验证,Off-log度量在置换检验中提高了灵敏度,并在分类中与黎曼和欧几里得基线匹配或超过。脑年龄预测性能相当,其中黎曼度量在两个队列中表现最佳。Grassmannian方法始终优于欧几里得基线,突显了与疾病相关的网络。总体而言,几何意识的表示提高了灵敏度和预测性能,同时在大规模部署时仍保持简单。

英文摘要

Correlation matrices are fundamental summaries of functional brain networks, yet standard analyses often treat entries independently, ignoring the curved geometry of correlation space. Existing geometric methods frequently lack closed-form operations or depend on arbitrary region ordering, limiting scalability. We introduce a scalable geometric framework with two components: (i) the Off-log metric, a smooth transformation mapping correlation matrices to symmetric zero-diagonal matrices. This enables closed-form expressions for distances, Frechet means, and linear models, allowing standard statistical modeling without complex manifold optimization. (ii) Grassmannian subspace discrimination, which compares subjects via principal-angle distances between eigenvector subspaces, resolving inherent sign and basis ambiguities. Both components integrate into standard machine-learning workflows for inference, regression, and classification. Validated across two clinical cohorts (Parkinson's and psychosis) and three ageing fMRI datasets, the Off-log metric increased sensitivity in permutation tests and matched or exceeded Riemannian and Euclidean baselines in classification. Brain-age prediction performance was comparable, with Riemannian metrics excelling in two of three cohorts. The Grassmannian method consistently outperformed Euclidean baselines, highlighting disease-relevant networks. Overall, geometry-aware representations improve sensitivity and predictive performance while remaining straightforward to deploy at scale.

2605.22331 2026-05-22 cs.LG cs.AI cs.DC

SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection

SepsisAI Orchestrator:一个容器化和可扩展的平台,用于部署AI模型和实时监控以实现早期败血症检测

Santiago Ospitia, John Sanabria, John Garcia-Henao

AI总结 本文提出SepsisAI-Orchestrator平台,通过整合HL7 FHIR启发的临床文档架构(CDA)预处理、NoSQL存储、容器化LightGBM分类器和Streamlit临床仪表板,解决了早期败血症检测中AI模型部署的挑战,并通过负载测试展示了U型扩展行为。

详情
Comments
13 pages, 5 figures. Submitted to BioCARLA 2025 Workshop
AI中文摘要

尽管在临床机器学习文献中预测结果强劲,但将这些模型转化为床边使用仍然受限于系统层面的障碍:异构数据表示、缺乏标准化的部署流程以及研究原型与医院环境的并发性和延迟需求之间的不匹配。我们提出了SepsisAI-Orchestrator,一个开源的模块化平台,旨在解决早期败血症检测中的部署缺口。该平台集成了HL7 FHIR启发的临床文档架构(CDA)预处理、NoSQL存储、通过REST API服务的容器化LightGBM分类器和Streamlit临床仪表板,并通过Docker和Kubernetes进行协调。一个之前已验证的LightGBM模型(在PhysioNet 2019上的F1值为0.87-0.94)在不进行修改的情况下被重用;贡献在于周围基础设施及其在负载下的实证表征。使用k6进行50-1000个并发虚拟用户测试,我们发现副本数量必须与主机的物理CPU线程数匹配:在12线程CPU上从3个副本扩展到12个副本,将p95延迟从3.3秒减少到1.41秒(减少57.3%)并消除所有请求失败,而过度配置到24或48个副本则由于调度器竞争导致性能下降。据我们所知,这种U型扩展行为此前尚未对临床AI推理工作负载进行量化。我们不声称具有前瞻性临床验证。源代码和部署清单可在https://github.com/nucleusai/sepsisai-orchestrator获取。

英文摘要

Despite strong predictive results in the clinical machine learning literature, the translation of these models into bedside use remains limited by systems-level barriers: heterogeneous data representations, the absence of standardized deployment workflows, and a mismatch between research prototypes and the concurrency and latency requirements of hospital environments. We present the SepsisAI-Orchestrator, an open-source modular platform that addresses this deployment gap for early sepsis detection. The platform integrates HL7 FHIR-inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, orchestrated with Docker and Kubernetes. A previously validated LightGBM model (F1 0.87-0.94 on PhysioNet 2019) is reused without modification; the contribution lies in the surrounding infrastructure and its empirical characterization under load. Using k6 with 50-1000 concurrent virtual users, we find that replica count must be matched to the physical CPU thread count of the host: scaling from 3 to 12 replicas on a 12-thread CPU reduces p95 latency from 3.3s to 1.41s (57.3% reduction) and eliminates all request failures, while over-provisioning to 24 or 48 replicas degrades performance due to scheduler contention. To our knowledge this U-shaped scaling behavior has not been quantified previously for clinical AI inference workloads. We do not claim prospective clinical validation. Source code and deployment manifests are available at https://github.com/nucleusai/sepsisai-orchestrator.

2605.22328 2026-05-22 cs.CV

3D LULC classification using multispectral LiDAR and deep learning: current and prospective schemes

基于多光谱LiDAR和深度学习的3D土地利用/覆盖分类:当前和未来方案

Narges Takhtkeshha, Aldino Rizaldy, Markus Hollaus, Juha Hyyppä, Fabio Remondino, Gottfried Mandlburger

AI总结 本文提出了一种基于多光谱LiDAR和深度学习的3D土地利用/覆盖分类方法,介绍了NMCA对齐的L1和L2分类方案,并引入了一个新的多光谱LiDAR基准数据集,评估了七种最先进的深度学习模型,并展示了光谱信息对分类性能的提升。

详情
AI中文摘要

土地利用/覆盖(LULC)分类对于国家3D制图、地理空间分析和可持续规划至关重要。多光谱(MS)LiDAR提供同步的空谱信息,深度学习(DL)能够实现3D点云语义分割;然而,其应用受限于缺乏与国家制图和地籍机构(NMCAs)分类方案对齐的公开可用的城市和郊区MS LiDAR数据集。本文通过引入L1和L2 NMCA对齐的LULC分类方案和一个新的多光谱LiDAR数据集来填补这些空白。我们评估了七种最先进的深度学习模型,并在两个细节层次上进行了光谱消融研究。结果表明,Point Transformer V3在使用双波长LiDAR系统(532 nm和1064 nm)时,分别在L1(8类)和L2(20类)上实现了79.4%和58.9%的mIoU。消融结果表明,多光谱信息在几何信息基础上提升了性能,分别在L1和L2上提升了1.1个百分点和7.8个百分点。这些结果突显了LiDAR反射率在细粒度材料识别中的价值,并支持NMCA LULC方案向更高语义细节演进。Loosdorf-MSL数据集为一致的国家和国际LULC制图提供了新的基准。

英文摘要

Land Use Land Cover (LULC) classification is essential for national 3D mapping, geospatial analysis, and sustainable planning. Multispectral (MS) LiDAR provides synchronized spatial-spectral information, and deep learning (DL) enables 3D point cloud semantic segmentation; however, adoption is limited by the lack of publicly available urban and suburban MS LiDAR datasets aligned with National Mapping and Cadastral Agencies (NMCAs) classification schemes. This study addresses these gaps by introducing L1 and L2 NMCA-aligned LULC classification schemes and a new benchmark MS LiDAR dataset. We evaluate seven state-of-the-art DL models and perform spectral ablation studies at both levels of detail. Results show that Point Transformer V3 achieves the best performance, with mIoU of 79.4% (L1, 8 classes) and 58.9% (L2, 20 classes) using a dual-wavelength LiDAR system (532 nm and 1064 nm). Ablation results show that multispectral information improves performance over geometry-only inputs, with gains of 1.1 percentage points at L1 and 7.8 points at L2. These results highlight the value of LiDAR reflectance for fine-grained material discrimination and support the evolution of NMCA LULC schemes toward higher semantic detail. The Loosdorf-MSL dataset contributes a new benchmark for consistent national and international LULC mapping.

2605.22327 2026-05-22 cs.CV physics.med-ph

Robustness of breast lesion segmentation under MRI undersampling improves with k-space-aware deep learning

在MRI欠采样下,基于k空间的深度学习改进了乳腺病变分割的鲁棒性

Lukas T. Rotkopf, Marco Schlimbach, Julius C. Holzschuh, Heinz-Peter Schlemmer, Jens Kleesiek, Moritz Rempe

AI总结 本文研究了直接从获得的MRI k空间学习乳腺病变分割是否能提高在加速或噪声下的鲁棒性,通过比较不同模型发现基于k空间的深度学习方法在欠采样和噪声下表现更优。

详情
AI中文摘要

目的:评估是否可以直接从获得的MRI k空间学习乳腺病变分割,并判断在数据加速或噪声情况下这种学习方式是否能提高鲁棒性。材料和方法:本回顾性研究使用了公开的乳腺动态对比增强MRI(DCE-MRI)数据集,包含获得的和合成的k空间,以及数据集内的合成对照。我们比较了四种3D U-Net变体:混合k空间到图像模型、原生k空间模型以及幅度和复数图像空间基线。模型在增加的欠采样和添加复数高斯k空间噪声下进行评估。主要结果是交叉验证下的患者级Dice相似性系数,其中混合模型被预设为主要比较对象,与幅度图像空间基线进行比较。结果:在完全采样下,混合模型和图像空间模型表现相似。随着加速增加,混合模型在欠采样水平中保持了显著的分割准确性,并在中等至高欠采样水平上显著优于幅度图像空间基线。当直接向k空间添加噪声时,相同模式被观察到:混合模型退化更慢,而图像空间基线在更重噪声下失败。这种优势在数据集内的合成对照中被重复验证。特征分析表明,k空间阶段和图像空间阶段发挥了互补作用,频率域过滤集中在图像域病变定位之前。结论:基于k空间的深度学习在MRI欠采样和k空间噪声下提高了乳腺病变分割的鲁棒性,同时在完全采样下与图像空间方法相当。

英文摘要

Purpose: To assess whether breast lesion segmentation can be learned directly from acquired MRI k-space, and whether doing so improves robustness when data are accelerated or noisy. Materials and Methods: This retrospective study used public breast dynamic contrast-enhanced MRI (DCE-MRI) datasets with acquired and synthetic k-space, together with a within-dataset synthetic control. We compared four 3D U-Net variants: a hybrid k-space-to-image model, a native k-space model, and magnitude and complex image-space baselines. Models were evaluated under increasing undersampling and added complex Gaussian k-space noise. The primary outcome was patient-level Dice similarity coefficient under cross-validation, with the hybrid model prespecified as the main comparison against the magnitude image-space baseline. Results: At full sampling, the hybrid and image-space models performed similarly. As acceleration increased, the hybrid model retained substantially more segmentation accuracy and significantly outperformed the magnitude image-space baseline across moderate to high undersampling levels. The same pattern was observed when noise was added directly to k-space: the hybrid model degraded more slowly, whereas the image-space baseline failed under heavier noise. This advantage was reproduced in the within-dataset synthetic control. Feature analysis suggested that the k-space stage and image-space stage played complementary roles, with frequency-domain filtering concentrated before image-domain lesion localization. Conclusion: K-space-aware deep learning improves the robustness of breast lesion segmentation under MRI undersampling and k-space noise, while matching image-space methods at full sampling.

2605.22322 2026-05-22 cs.RO

How can reasoning capability empower the AI copilot robot in endoscopic surgery

推理能力如何赋能内窥手术中的AI助手机器人

Guankun Wang, Long Bai, Hongliang Ren

AI总结 本文研究了推理能力在内窥手术中AI助手机器人中的应用,提出通过整合多模态线索、解读手术意图和推断隐藏组织动态来提高手术的精确性、安全性和可持续性。

详情
Comments
Accepted by npj digital medicine
AI中文摘要

推理能力已显著提升了复杂逻辑推理和机器人决策制定在一般领域的能力。然而,其在人工智能(AI)助手机器人——特别是基于视觉-语言-动作(VLA)模型实现——在内窥手术中的潜力仍待探索。有效的推理应使AI助手机器人能够整合多模态线索、解读手术意图并推断隐藏的组织动态,从而缓解术中不确定性和对外科医生的认知负担。正确实施的推理驱动自主性可以将AI助手机器人从被动执行者转变为认知合作者,从而在临床实践中提高精确性、安全性和可持续性。

英文摘要

Reasoning capability has significantly advanced complex logical inference and robotic decision-making in general domains. However, its potential in the Artificial Intelligence (AI) copilot robot-particularly implemented based on the Vision-Language-Action (VLA) model-remains unexplored in endoscopic surgery. Effective reasoning should enable AI copilot robots to integrate multimodal cues, interpret surgical intent, and infer hidden tissue dynamics, thereby alleviating intraoperative uncertainty and cognitive burden on surgeons. Properly implemented, reasoning-driven autonomy can transform AI copilot robots from reactive executors into cognitive collaborators, enhancing precision, safety, and sustainability in clinical practice.

2605.22311 2026-05-22 cs.CV

PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models

PIU:基于接近性的身份去学习在ID条件化的扩散模型中

Jose Edgar Hernandez Cancino Estrada, Mauro Díaz Lupone, Žiga Emeršič, Vitomir Štruc, Peter Peer, Darian Tomašević

AI总结 本文研究了在ID条件化的扩散模型中身份去学习的问题,提出了一种基于接近性的身份去学习框架PIU,通过在学习的身份空间中重新分配源身份到选定的锚身份来实现身份移除,并结合基于ArcFace表示几何的锚点选择策略,通过局部微调少量身份敏感的交叉注意力层实现有效的去学习。

详情
AI中文摘要

身份条件化的扩散模型能够生成高质量且身份一致的面部图像,但它们也引发了严重的隐私问题,因为模型可能在个人被遗忘后仍继续合成个体。尽管机器去学习已被广泛研究用于概念和数据删除,但身份去学习仍鲜有探索,特别是在直接基于身份嵌入而非文本提示的模型中。在本文中,我们研究了Arc2Face,一个最先进的身份条件化的潜在扩散模型用于面部生成,并引入了基于接近性的身份去学习(PIU),一种锚点引导的身份去学习框架。具体而言,我们将身份移除建模为身份替换目标,该目标将源身份重新分配到学习身份空间中选定的锚身份,并补充了受ArcFace表示几何启发的基于接近性的锚点选择策略。我们进一步表明,通过局部微调少量身份敏感的交叉注意力层可以实现有效的去学习。在许多目标身份上的实验表明,我们的框架能够有效抑制目标身份的生成,同时保持保留身份的真实性和身份一致性,这通过改进的去学习和图像质量指标以及定性评估得到验证。PIU框架的源代码可在https://github.com/edgarcancinoe/piu_unlearning 公开获取。

英文摘要

Identity-conditioned diffusion models enable high-quality and identity-consistent face generation, but they also raise severe privacy concerns, as models may continue to synthesize individuals despite their right to be forgotten. While machine unlearning has been extensively studied for concept and data removal, identity unlearning remains largely unexplored, particularly in models conditioned directly on identity embeddings rather than text prompts. In this work, we study identity unlearning in Arc2Face, a state-of-the-art identity-conditioned latent diffusion model for face generation, and introduce Proximity-guided Identity Unlearning (PIU), an anchor-guided framework for identity unlearning. Specifically, we formulate identity removal as an identity replacement objective that reassigns the source identity to a selected anchor identity in the learned identity space, and we complement it with a proximity-based anchor selection strategy motivated by the geometry of ArcFace representations. We further show that effective unlearning can be achieved through localized fine-tuning of a small subset of identity-sensitive cross-attention layers. Experiments across many target identities show that our framework effectively suppresses generation of the target identity while preserving realism and identity consistency for retained identities, as validated by improved performance on unlearning and image-quality metrics, together with qualitative evaluation. The source code for the PIU framework is publicly available at https://github.com/edgarcancinoe/piu_unlearning .

2605.22310 2026-05-22 cs.CL

Pattern-and-root inflectional morphology: the Arabic broken plural

词形和根的屈折形态:阿拉伯语的破碎复数

Alexis Amid Neme, Eric Laporte

AI总结 本文提出了一种对阿拉伯语名词屈折形态的描述模型,重点在于阿拉伯语学者在管理词典和其他语言资源时的处理方式。其突破在于将传统的根-词形塞语模型反转为词形-根模型,优先考虑词形。该模型包括破碎复数(BPs),即通过修改词干形成的复数。它基于传统的塞语形态学中的根和词形概念。与传统阿拉伯语形态学相比,它将屈折的正式描述与派生和语义分开。如同传统阿拉伯语词典,可更新的词典以词形的词典条目结构进行组织,并且参考拼写完全带变音符号。在我们的模型中,阿拉伯语文本的形态分析直接使用词典进行,而无需形态音律规则。我们对名词屈折的分类是简单、有序且详细的。我们通过指定元音数量为v或vv,并忽略元音质量来简化单数词形的分类。根交替和正字法变化是独立于词形并以事实方式编码的,而不涉及深根或形态音律或正字法规则。具有三重词干BPs的名词根据22个词形细分到90个类别中进行分类,而具有四重词干BPs的名词根据3个词形细分到70个类别中进行分类。这些160个类别在考虑只影响单数的屈折变化时,变为300个屈折类别。我们提供了一种直接的编码方案,该方案应用于3200个BPs名词条目。

详情
Journal ref
Language Sciences, 2013, 40, pp.221-250
AI中文摘要

我们提出了一种实施程度较高的模型,用于描述阿拉伯语名词的屈折形态,特别关注阿拉伯语学者在管理词典和其他语言资源时的处理方式。突破点在于将传统的根-词形塞语模型反转为词形-根模型,优先考虑词形。我们的模型包括破碎复数(BPs),即通过修改词干形成的复数。它基于传统的塞语形态学中的根和词形概念。然而,与传统阿拉伯语形态学相比,它将屈折的正式描述与派生和语义分开。如同传统阿拉伯语词典,可更新的词典以词形的词典条目结构进行组织,并且参考拼写完全带变音符号。在我们的模型中,阿拉伯语文本的形态分析直接使用词典进行,而无需形态音律规则。我们的名词屈折分类法简单、有序且详细。我们通过指定元音数量为v或vv,并忽略元音质量来简化单数词形的分类。根交替和正字法变化是独立于词形并以事实方式编码的,而不涉及深根或形态音律或正字法规则。具有三重词干BPs的名词根据22个词形细分到90个类别中进行分类,而具有四重词干BPs的名词根据3个词形细分到70个类别中进行分类。这些160个类别在考虑只影响单数的屈折变化时,变为300个屈折类别。我们提供了一种直接的编码方案,该方案应用于3200个BPs名词条目。

英文摘要

We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.

2605.22304 2026-05-22 cs.AI cs.DB cs.LG

Evaluation of Pipelines for Data Integration into Knowledge Graphs

数据整合到知识图谱的管道评估

Marvin Hofer, Erhard Rahm

AI总结 本文提出KGI-Bench基准测试,用于评估将不同输入数据整合到现有知识图谱的管道,通过覆盖度、正确性和一致性三个指标分析输出的知识图谱质量,并在电影领域提供基准数据集以评估12种管道的性能。

详情
AI中文摘要

将新数据整合到知识图谱(KG)通常涉及在工作流或管道中执行的不同任务。对于特定的整合问题,有许多可能的管道,但目前尚无通用方法来评估此类管道的整体质量和性能,以确定最佳选择。因此,我们提出一个新的基准KGI-Bench,用于评估将不同类型的输入数据整合到现有KG的管道。我们通过分析输出,即更新后的KG,使用三个互补的质量度量:覆盖度、正确性和一致性来评估管道。我们还提供了基准数据集(种子KG、三种格式的重叠输入数据、参考KG作为地面真实值)用于电影领域。为了展示所提基准的适用性和有用性,我们比较评估了12种管道,并分析了它们在不同输入数据格式和设计选择下的行为。

英文摘要

Integrating new data into knowledge graphs (KG) typically involves different tasks that are executed within workflows or pipelines There are many possible pipelines for a specific integration problem but there is not yet a general approach to evaluate the overall quality and performance of such pipelines to be able to determine the best choices. We therefore propose a new benchmark KGI-Bench to evaluate integration pipelines that ingest different kinds of input data into an existing KG. We evaluate pipelines by analyzing their output, i.e., the updated KG, with the three complementary quality metrics coverage, correctness and consistency. We also provide benchmark datasets (seed KG, overlapping input data of three formats, reference KG as a ground truth) for the movie domain. To demonstrate the applicability and usefulness of the proposed benchmark, we comparatively evaluate 12 pipelines and analyze their behavior across different input data formats and design choices.

2605.22300 2026-05-22 cs.AI cs.LG cs.MA

Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

跨领域基准测试揭示协调AI代理在部分证据下提升科学推断何时有效

Fiona Y. Wong, Markus J. Buehler

AI总结 本文通过跨领域基准测试探讨协调AI代理在部分证据下提升科学推断的有效性,发现当不同学科各自捕捉现象部分时,跨通道复合方法优于单一通道基线,但在某些情况下分解并不总是提升整体性能。

详情
AI中文摘要

科学证据通常跨越仪器、数据库和学科,因此没有单一来源能完整记录现象。这使得确定协调AI代理何时能超越简单科学工作流变得困难。我们通过涵盖四个科学任务的跨领域基准测试评估了这一问题:将分子结构映射到音乐表示、检测科学历史范式转变、识别媒介传播疾病爆发以及验证行星凌星候选体。每个案例均使用冻结评估小组、预定义评分协议、明确基线、消融或零对照,以及声明的限制。结果定义了三个操作模式。当不同学科各自只捕捉现象部分时,跨通道复合方法优于单一通道基线:气候-媒介爆发达到AUROC 0.944,行星凌星验证达到AUROC 0.955。然而,行星凌星工作流与强联合摘要基线几乎持平,表明分解不总能提升整体性能。当一个信号主导时,如范式转变检测,协调主要提升解释和可追溯性。对于分子音乐化,收益是表征而非预测性的。ScienceClaw x Infinite提供了此评估的可审计艺术ifacts和来源层。因此,该基准测试仅在对应的性能、来源或表征主张有明确比较器支持时才赋予协调价值。

英文摘要

Scientific evidence often spans instruments, databases, and disciplines, so no single source records the full phenomenon. This makes it difficult to determine when coordinated AI agents add value over simpler scientific workflows. We evaluate this question with a cross-domain benchmark spanning four scientific tasks: mapping molecular structure into musical representations, detecting historical paradigm shifts in science, identifying vector-borne disease emergence, and vetting transiting-exoplanet candidates. Each case uses a frozen evaluation panel, predefined scoring protocols, explicit baselines, ablations or null controls, and stated limitations. The results define three operating regimes. When different disciplines each capture only part of the phenomenon, cross-channel composites improve over single-channel baselines: climate-vector emergence reaches AUROC 0.944 and exoplanet vetting reaches AUROC 0.955. However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance. When one signal dominates, as in paradigm-shift detection, coordination mainly improves interpretation and traceability. For molecular sonification, the gain is representational rather than predictive. ScienceClaw x Infinite provides the auditable artifact and provenance layer for this evaluation. The benchmark therefore assigns value to coordination only when the corresponding performance, provenance, or representation claim is supported by explicit comparators.

2605.22291 2026-05-22 cs.LG

Long-term Fairness with Selective Labels

长期公平性与选择性标签

Giovani Valdrighi, Isabel Valera, Marcos Medeiros Raimundo

AI总结 本文研究了在选择性标签设置下长期公平性的问题,提出了一种新的框架,通过结合观测数据和标签预测模型来估计真实的公平性度量,并提出了一种新的强化学习算法以实现有效长期公平决策。

详情
AI中文摘要

长期公平性算法旨在通过考虑决策政策与人口行为之间的动态关系,满足超越静态和短期观念的公平性。大多数先前的方法从可观察特征和标签评估性能和公平性度量,其中标签被假设为完全可观测。然而,在招聘或贷款等场景中,标签(例如偿还贷款的能力)是选择性标签,因为它们仅在积极决定(例如贷款被批准时)后才被揭示。在本文中,我们研究了选择性标签设置下的长期公平性,并分析表明,朴素的解决方案无法保证公平性。为了解决这一差距,我们引入了一个新的框架,利用观测数据和标签预测模型来估计真实的公平性度量值,将其分解为观测公平性和标签预测中的偏差。这使我们能够通过使用预测模型的置信度来推导出满足真实公平性的充分条件。最后,我们依赖我们的理论结果,提出了一种新的强化学习算法,以实现有效长期公平决策。在半合成环境中,所提出的算法在公平性和性能方面与具有oracle访问真实标签的智能体相当。

英文摘要

Long-term fairness algorithms aim to satisfy fairness beyond static and short-term notions by accounting for the dynamics between decision-making policies and population behavior. Most previous approaches evaluate performance and fairness measures from observable features and a label, which is assumed to be fully observed. However, in scenarios such as hiring or lending, the labels (e.g., ability to repay the loan) are selective labels as they are only revealed based on positive decisions (e.g., when a loan is granted). In this paper, we study long-term fairness in the selective labels setting and analytically show that naive solutions do not guarantee fairness. To address this gap, we then introduce a novel framework that leverages both the observed data and a label predictor model to estimate the true fairness measure value by decomposing it into the observed fairness and bias from label predictions. This allows us to derive sufficient conditions to satisfy true fairness from observable quantities by using the confidence in the predictor model. Finally, we rely on our theoretical results to propose a novel reinforcement learning algorithm for effective long-term fair decision-making with selective labels. In semisynthetic environments, the proposed algorithm reached comparable fairness and performance to an agent with oracle access to the true labels.

2605.22290 2026-05-22 cs.CV

Detection of Virus and Small Cell Patches in Foci Images Using Switchable Convolution and Feature Pyramid Networks

利用可切换卷积和特征金字塔网络在焦点图像中检测病毒和小细胞斑块

Amrita Singh, Snehasis Mukherjee

AI总结 本文提出了一种改进的YOLOv2检测器,结合特征金字塔网络和可切换空洞卷积机制,以提高在生物医学焦点图像中检测病毒斑块和小细胞斑块的性能,实验结果显示在不同IoU阈值下的mAP值显著提升。

详情
AI中文摘要

准确检测和计数焦点形成单位(FFU)图像中的病毒斑块对于量化病毒感染和分析细胞结构至关重要。这项任务具有挑战性,因为生物医学目标在大小、密度、对比度和形状上往往差异显著。本文提出了一种增强的YOLOv2检测器,集成了特征金字塔网络(FPN)以提高多尺度特征表示。我们还引入了可切换空洞卷积机制,以适应密集显微图像中细粒度目标的接收域。所提出的方法在生物医学焦点图像数据集上进行评估,用于病毒斑块和小细胞斑块的检测。对于小细胞斑块检测,模型在25%的交并比(IoU)阈值下达到40.5%的平均精度均值(mAP)。对于FFU病毒斑块检测,模型达到68%的mAP。这些结果表明,结合FPN特征融合与可切换卷积能够提高YOLOv2在专门生物医学目标检测任务中的适用性。

英文摘要

Accurate detection and counting of virus patches in focus-forming unit (FFU) images, also known as foci images, are important for quantifying viral infection and analyzing cellular structures. This task is challenging because biomedical targets often vary substantially in size, density, contrast, and shape. In this paper, we propose an enhanced YOLOv2-based detector that integrates a Feature Pyramid Network (FPN) to improve multi-scale feature representation. We also incorporate a switchable atrous convolution mechanism to adapt the receptive field for fine-grained targets in dense microscopy images. The proposed method is evaluated on biomedical foci image datasets for virus patch and small cell patch detection. For small cell patch detection, the model achieves a mean average precision (mAP) of 40.5% at a 25% Intersection over Union (IoU) threshold. For FFU virus patch detection, the model achieves an mAP of 68%. These results indicate that combining FPN-based feature fusion with switchable convolution improves the suitability of YOLOv2 for specialized biomedical object detection tasks

2605.22287 2026-05-22 cs.AI

SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

SciCore-Mol: 通过可插拔分子认知模块增强大语言模型

Yuxuan Chen, Changwei Lv, Yunduo Xiao, Zhongjing Du, Daquan Zhou, Yukun Yan, Zheni Zeng, Zhiyuan Liu

AI总结 本文提出SciCore-Mol框架,通过三个深度集成的可插拔认知模块解决大语言模型处理异构科学数据(如分子)时的语义鸿沟问题,实现分子理解、生成、反应预测和化学知识的综合性能提升。

详情
Comments
15 pages, 4 figures, 9 tables. Preprint
AI中文摘要

大型语言模型(LLMs)是实现万物智能范式的中心,但处理异构科学数据如分子时面临根本性挑战:离散语言符号与拓扑分子或连续反应数据之间的固有差距导致文本推理中的信息丢失和语义噪声。我们提出了SciCore-Mol,一个模块化框架,通过三个深度集成的可插拔认知模块弥合这一差距:拓扑感知感知模块、基于潜在扩散的分子生成模块以及反应感知推理模块。每个模块通过学习的表示接口连接到LLM主干,使信息交换比仅使用文本工具反馈更丰富。我们在多样化的化学任务上的实验表明,SciCore-Mol在分子理解、生成、反应预测和一般化学知识方面实现了强大的综合性能,8B参数开源系统在多个维度上可与甚至超越专有大模型竞争。这项工作为通过解耦、可插拔和灵活编排的模块系统为LLM提供科学专业知识提供了系统蓝图,对药物设计、化学合成和更广泛的科学发现有直接意义。

英文摘要

Large Language Models (LLMs) are central to the one-for-all intelligent paradigm, but they face a fundamental challenge when dealing with heterogeneous scientific data such as molecules: the inherent gap between discrete linguistic symbols and topological molecular or continuous reaction data leads to significant information loss and semantic noise in text-based reasoning. We propose SciCore-Mol, a modular framework that bridges this gap through three deeply integrated pluggable cognitive modules: a topology-aware perception module, a latent diffusion-based molecular generation module, and a reaction-aware reasoning module. Each module is coupled to the LLM backbone through learned representation interfaces, enabling richer information exchange than is possible with text-only tool feedback. Our experiments on diverse chemical tasks demonstrate that SciCore-Mol achieves strong comprehensive performance across molecular understanding, generation, reaction prediction, and general chemistry knowledge, with an 8B-parameter open-source system that is competitive with and in several dimensions surpasses proprietary large models. This work provides a systematic blueprint for equipping LLMs with scientific expertise through decoupled, pluggable, and flexibly orchestrated modules, with direct implications for drug design, chemical synthesis, and broader scientific discovery.

2605.22286 2026-05-22 cs.LG cs.AI

EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes

EmoTrack: 从咨询记录中跨会话制度实现稳健的抑郁跟踪

Zhaomin Wu, Jiayi Li, Bingsheng He

AI总结 本文研究了从单次会话和多会话制度中通过咨询记录进行稳健抑郁跟踪的问题,提出了LongCounsel多会话咨询数据集和EmoTrack框架,结合LLM提取的临床信号和冻结的轮次级语义嵌入,训练症状特定预测器,并通过紧凑的跨会话记忆进一步结合先前会话,实验表明在真实单次会话基准上表现优异。

详情
AI中文摘要

基于文本的咨询是人工智能心理健康支持的重要接口,其中记录可能用于监控抑郁严重程度并标记需要及时人工审查的会话。然而,跨会话制度实现稳健的PHQ-8预测仍然具有挑战性:基于微调的方法可以利用更丰富的监督但可能在数据稀缺时泛化能力差,而基于提示的LLM方法数据高效但通常将每个记录整体处理,对纵向上下文支持有限。我们研究了从咨询记录中跨单次会话和多会话制度进行稳健抑郁跟踪。我们引入了LongCounsel多会话咨询数据集,具有会话级PHQ-8监督,用于评估在部分症状披露和跨会话连续性下的重复会话跟踪。我们进一步提出了EmoTrack,一种PHQ-8预测框架,结合LLM提取的临床信号与冻结的轮次级语义嵌入,并在得到的记录表示上训练症状特定预测器。当先前会话可用时,EmoTrack可通过紧凑的跨会话记忆进一步结合它们。在LongCounsel和DAIC-WOZ上的实验表明,EmoTrack在真实单次会话基准上实现了明显优势,包括在最强DAIC-WOZ基线上的MAE相对减少13.5%,并在LongCounsel上与最强的纵向基线保持竞争力。

英文摘要

Text-based counseling is an important interface for AI mental-health support, where transcripts may be used to monitor depression severity and flag sessions requiring timely human review. However, robust PHQ-8 prediction across session regimes remains challenging: fine-tuning-based methods can exploit richer supervision but may generalize poorly under data scarcity, while prompt-based LLM methods are data-efficient but usually treat each transcript holistically and provide limited support for longitudinal context. We study robust depression tracking from counseling transcripts across single-session and multi-session regimes. We introduce LongCounsel, a multi-session counseling dataset with session-level PHQ-8 supervision for evaluating repeated-session tracking under partial symptom disclosure and cross-session continuity. We further propose EmoTrack, a PHQ-8 prediction framework that combines LLM-extracted clinical signals with frozen turn-level semantic embeddings and trains symptom-specific predictors over the resulting transcript representation. When prior sessions are available, EmoTrack can further incorporate them through compact cross-session memory. Experiments on LongCounsel and DAIC-WOZ show that EmoTrack achieves a clear gain on the real single-session benchmark, including a 13.5% relative MAE reduction over the strongest DAIC-WOZ baseline, and remains competitive with the strongest longitudinal baseline on LongCounsel.

2605.22283 2026-05-22 cs.RO

Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action

视觉-语言-动作中视界外操作的空间记忆

Pengteng Li, Weiyu Guo, He Zhang, Tiefu Cai, Xiao He, Yandong Guo, Hui Xiong

AI总结 本文提出SOMA框架,用于解决视觉-语言-动作模型中视界外操作的问题,通过构建持久的空间记忆,使模型能够超越当前视觉范围进行推理,提升任务成功率和操作行为质量。

详情
Comments
Accepted by ICML 2026
AI中文摘要

我们引入SOMA,即视觉-语言-动作(VLA)模型中的空间记忆框架,用于视界外操作。现有VLAs通常隐式假设任务相关物体始终可见,当目标物体处于相机视野外时,会导致行为脆弱和反应迟钝。SOMA通过为VLAs配备由移动头部相机获取的多视角观察构建的持久空间记忆,解决了这一限制。该框架包含三个组件:空间记忆构建,通过扫描将角度方向的观察聚合为统一的空间-语义表示;动态记忆细化,保持时间上的全局一致性;以及情境记忆检索,激活操作过程中与指令相关的空间线索。我们评估SOMA在五个具有挑战性的现实世界视界外操作任务上,包括多步骤和双臂场景,其中目标物体最初不可见。实验结果表明,SOMA不仅提高了任务成功率,还诱导了质不同操作行为,具有更快的目标定位、减少视角搜索和近似单次抓取在部分可观察性条件下。在RoboCasa GR1和SimplerEnv上的额外实验进一步验证了SOMA记忆设计在传统完全可观察设置下的有效性。代码将很快发布。

英文摘要

We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five challenging real-world out-of-vision manipulation tasks, including multi-step and dual-arm scenarios where target objects are initially invisible. Experimental results show that SOMA not only improves task success rates, but also induces qualitatively different manipulation behaviors, with faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability. Additional experiments on RoboCasa GR1 and SimplerEnv further validate the effectiveness of SOMA's memory design under conventional fully observable settings. Code will be released soon.

2605.22275 2026-05-22 cs.LG

Adaptive Measurement Allocation for Learning Kernelized SVMs Under Noisy Observations

适应性测量分配用于在噪声观测下学习核化SVM

Artur Miroszewski

AI总结 本文提出了一种适应性测量分配策略,用于在噪声观测下学习核化支持向量机,通过结合几何敏感性和主动集不稳定性,优化核矩阵中决策关键区域的测量分配,从而提升支持向量恢复、边距估计和决策函数准确性。

详情
Comments
20 pages, 9 figures
AI中文摘要

核方法通常是在假设能够精确获取Gram矩阵的情况下进行建模的。然而,在新兴领域如量子机器学习中,每个核元素必须从噪声观测中推断出来,其准确性取决于如何分配有限的测量预算。尽管如此,现有方法大多依赖于均匀分配,这虽然平等地降低了估计方差,但忽略了核化分类器对Gram矩阵的高度非均匀依赖。在本文中,我们提出了一种适应性测量分配策略,用于从噪声伯努利观测中学习核化支持向量机。我们的方法结合了两个互补原则:(i) 几何敏感性,捕捉单个核元素扰动对分类器边距的影响,以及 (ii) 主动集不稳定性,量化由测量噪声引起的支持向量成员身份的离散变化概率。这些信号定义了一个任务感知的分配方案,将测量集中在核矩阵中最关键的决策区域。我们提供了理论分析,表明适应性分配的益处由诱导核重要结构的异质性决定,导致在不同情况下适应性或均匀策略更优。在合成数据集上的实验证明,在固定测量预算下,适应性分配显著提高了支持向量恢复、边距估计和决策函数准确性。双系数稳定性准则进一步使早停成为可能,仅使用少量测量成本即可达到近最优性能。此外,在从真实数据导出的量子核上的额外实验揭示了与已知现象如核集中度相一致的领域依赖行为。

英文摘要

Kernel methods are typically formulated under the assumption of exact, noise-free access to the Gram matrix. However, in emerging settings such as quantum machine learning, each kernel entry must be inferred from noisy observations, and its accuracy depends on how a limited measurement budget is allocated. Despite this, existing approaches overwhelmingly rely on uniform allocation, which equalizes estimator variance but ignores the highly non-uniform dependence of kernelized classifiers on the Gram matrix. In this work, we introduce an adaptive measurement-allocation strategy for learning kernelized Support Vector Machines (SVMs) from noisy Bernoulli observations. Our approach combines two complementary principles: (i) geometric sensitivity, capturing how perturbations of individual kernel entries affect the classifier margin, and (ii) active-set instability, quantifying the probability of discrete changes in support-vector membership induced by measurement noise. These signals define a task-aware allocation scheme that concentrates measurements on the most decision-critical regions of the kernel matrix. We provide a theoretical analysis showing that the benefit of adaptive allocation is governed by the heterogeneity of the induced kernel importance structure, leading to distinct regimes in which adaptive or uniform strategies are preferable. Empirical evaluations on synthetic datasets demonstrate that adaptive allocation significantly improves support-vector recovery, margin estimation, and decision-function accuracy under fixed measurement budgets. A dual-coefficient stability criterion further enables early stopping, achieving near-optimal performance while using only a fraction of the measurement cost. Additional experiments on quantum kernels derived from real-world data reveal a regime-dependent behavior aligned with known phenomena such as kernel concentration. Together...

2605.22273 2026-05-22 cs.CV

Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

揭示可见-红外VLMs中的漏洞:一种具有跨任务迁移性的统一几何对抗框架

Xiang Chen, Yuxian Dong, Chao Li, Chengyin Hu, Jiaju Han, Fengyu Zhang, Yiwei Wei, Jiahuan Long, Jiujiang Guo

AI总结 本文针对可见-红外视觉语言模型在多模态任务中的对抗鲁棒性不足问题,提出了一种基于分形几何的对抗框架CFGPatch,通过引入曲边分形元素和Fraser螺旋渲染机制,有效攻击VLMs并展示出跨任务迁移能力。

详情
AI中文摘要

视觉语言模型(VLMs)在多样化的多模态任务中实现了强大的性能,但其在可见-红外(VIS-IR)场景中的对抗鲁棒性仍处于探索阶段。为了解决这种跨模态威胁设置,我们提出了CFGPatch,一种基于三角分形几何的曲边分形对抗补丁框架,用于攻击VIS-IR VLMs。CFGPatch基于三角分形几何,用贝塞尔曲线元素替代刚性的直边元素,在保持多尺度分形自相似性的同时引入更平滑的轮廓、更丰富的方向变化和更灵活的形状变形。此外,我们设计了模态特定的Fraser螺旋渲染机制,以在可见和红外图像中注入细粒度纹理扭曲和误导性感知线索。通过将全局曲边分形几何与局部螺旋基外观干扰相结合,CFGPatch破坏了形状感知和纹理解释。我们进一步采用期望超越变换(EOT)以提高对常见图像级变换的鲁棒性。大量实验表明,CFGPatch能够有效欺骗VIS-IR VLMs,并在攻击效果和鲁棒性上均优于标准补丁基线。此外,针对零样本分类优化的对抗样本在图像描述和视觉问答任务中表现出良好的迁移能力,展示了在下游任务中的强大跨任务迁移性和泛化能力。

英文摘要

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.