arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3839
专题追踪
2606.07685 2026-06-09 cs.LG cs.AI 新提交

Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments

物联网环境下机器学习即服务(MLaaS)的测试时自适应组合

Deepak Kanneganti, Sajib Mistry, Sheik Mohammad Mostakim Fattah, Aneesh Krishna

发表机构 * Deepak Kanneganti Sajib Mistry Sheik Mohammad Mostakim Fattah Aneesh Krishna

AI总结 针对物联网环境中MLaaS组合因动态性而失效的问题,提出一种测试时自适应(TTA)组合框架,通过TTA感知可组合性模型和服务级自适应模型,在推理时调整服务并保持组合性能,显著降低计算时间。

详情
AI中文摘要

物联网(IoT)环境的动态性影响了机器学习即服务(MLaaS)组合的长期有效性。现有的自适应组合方法主要基于服务替换或重新组合,其中识别合适的替代服务既困难又耗时。为了解决这一问题,我们提出了一种新颖的测试时自适应(TTA)组合框架,用于物联网环境中的MLaaS。首先,我们引入了一个TTA感知的可组合性模型,以确定自适应服务是否仍然与现有组合兼容。接下来,我们设计了一个服务级自适应模型,在推理过程中调整单个服务,同时保持组合性能。实验结果表明,与传统的自适应方法相比,所提出的框架更有效地减少了计算时间。

英文摘要

The dynamic nature of Internet of Things (IoT) environments affects the long-term effectiveness of Machine Learning as a Service (MLaaS) compositions. Existing adaptive composition methods are mainly based on service replacement or re-composition, where identifying suitable substitutes is difficult and time-consuming. To address this, we propose a novel Test-Time Adaptive (TTA) composition framework for MLaaS in IoT environments. First, we introduce a TTA-aware composability model to determine whether adapted services remain compatible with the existing composition. Next, we design a service-level adaptation model to adjust individual services during inference while preserving composition performance. Experimental results demonstrate that the proposed framework reduces computational time more effectively than traditional adaptive approaches.

2606.07684 2026-06-09 cs.LG cs.AI 新提交

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

语义缓存蒸馏:通过重用和选择性修补实现高效状态传输

Qianli Ma, Zhiqing Tang, Hanshuai Cui, Zhi Yao, Weijia Jia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对大语言模型推理中KV缓存传输的通信瓶颈和跨模型重用时的语义错位问题,提出语义缓存蒸馏(SCD)框架,通过低秩子空间重建和稀疏过渡层归一化输入预测,实现高达2.65倍的首令牌时间加速,且生成质量接近理想情况。

Comments Accepted to ICML 2026

详情
AI中文摘要

分离式服务缓解了大语言模型(LLM)推理中的内存瓶颈,但造成了严重的通信瓶颈:传输高维键值(KV)缓存通常主导首令牌时间(TTFT)。此外,跨异构模型(例如,基础模型和微调变体)重用缓存会导致语义错位,且这种错位会随着层数累积,降低生成质量。我们提出语义缓存蒸馏(SCD),一种受损失约束的框架,用紧凑的语义代码替代原始KV传输。SCD通过两种机制解决这些挑战:(1)重用,从低秩子空间重建大部分层以最小化传输成本,以及(2)修补,在稀疏过渡层预测归一化输入以截断误差传播。实验表明,在带宽受限的情况下,SCD相比理想消费预填充实现了高达2.65倍的TTFT加速,并在质量-延迟帕累托前沿上优于量化和选择性重计算基线,同时将生成质量保持在理想情况F1的5%以内。

英文摘要

Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token (TTFT). Moreover, reusing caches across heterogeneous models (e.g., base and fine-tuned variants) causes semantic misalignment that accumulates over layers, degrading generation quality. We propose Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV transmission with compact semantic codes. SCD addresses these challenges via two mechanisms: (1) Reuse, which reconstructs most layers from low-rank subspaces to minimize transfer cost, and (2) Patch, which predicts normalized inputs at sparse transition layers to truncate error propagation. Empirically, SCD delivers up to 2.65 $\times$ TTFT speedup over the oracle consumer prefill and dominates quantization and selective recomputation baselines on the quality--latency Pareto frontier in bandwidth-constrained regimes, while keeping generation quality within 5\% F1 of the oracle.

2606.07674 2026-06-09 cs.CV q-bio.NC 新提交

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

同时性多动症表型分析:基于常规视频、无标记姿态估计和表格基础模型的跨队列儿科迁移研究

Laura Cif, Diane Demailly, Zohra Souei, Muhammad Mushhood Ur Rehman, Juan Dario Ortigoza Escobar, Mayté Castro Jiménez, Cécile A. Hubsch, Sophie Huby, Morgan Dornadic, Gun-Marie Hariz, Eduardo M. Moraud, Jocelyne Bloch, Gabriella A. Horvath, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV)(洛桑大学医院) University of Lausanne (UNIL)(洛桑大学) Institut du Neurone(神经元研究所) Clinique Beau Soleil(博索莱伊诊所) Institut Mutualiste Montpelliérain(蒙彼利埃互助研究所) Military University Hospital of Sfax(斯法克斯军事大学医院) University of Edinburgh(爱丁堡大学) Hospital Sant Joan de Déu(圣琼德迪乌医院) European Reference Network for Rare Neurological Diseases (ERN-RND)(欧洲罕见神经系统疾病参考网络) Instituto de Salud Carlos III(卡洛斯三世健康研究所) CHU Montpellier(蒙彼利埃大学医院) Umeå University(于默奥大学) University Hospital Lausanne(洛桑大学医院) Ecole Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) British Columbia Children’s Hospital(不列颠哥伦比亚儿童医院)

AI总结 提出结合无标记姿态估计、运动学描述符和预训练基础模型的视频框架,在成人数据上训练后迁移至儿科队列,经轻量校准后实现多种多动症现象的同时检测。

详情
AI中文摘要

目的:开发并外部测试一个基于视频的框架,用于同时检测多动症运动障碍现象:肌张力障碍、震颤、肌阵挛、舞蹈症、手足徐动症、投掷症、刻板动作和抽动,使用常规临床记录,并明确测试从成人到儿科人群的外部跨队列迁移。方法:在这项概念验证研究中,该框架结合了无标记姿态估计、运动学描述符和预训练基础模型。在21名确诊多动症的成人和4名健康对照(按标准化方案评估)上开发了共享预测骨干。外部验证在一个独立的外部队列上进行:一个真实世界的儿科样本(n=12,单基因联合多动症)。对于外部数据集,骨干网络未经重新训练直接部署;轻量校准仅调整最终受试者级别的决策步骤,使用由临床医生选择的小标记子集(代表队列表型范围)。结果:在临床医生选择的子集上对决策层进行本地校准后,在保留的儿科患者(n=7)上性能持续提升:汉明准确率从0.804提高到0.839,Jaccard指数从0.548提高到0.633。当评估限制在临床医生一致性更高的现象时,校准后的性能得以保持,Jaccard指数进一步提高(汉明准确率0.9,Jaccard指数0.786),表明增益并非依赖于最不可靠的标签。

英文摘要

Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort's phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 新提交

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究院) GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology(光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所) School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院) Department of AI Convergence, Gwangju Institute of Science and Technology(光州科学技术院人工智能融合系)

AI总结 提出分层特征工程框架,包括静态、动态、比率和耦合特征,用于区分声带创伤性和非声带创伤性声音亢进,发现耦合特征对两类分类均关键,PVH AUC 0.891,NPVH AUC 0.728。

Comments Interspeech 2026

详情
AI中文摘要

动态颈部表面加速度能够实现声音亢进的无创监测,但其亚型的稳健生物标志物仍然有限。本研究利用NeckVibe Challenge数据集区分声带创伤性(PVH)和非声带创伤性(NPVH)声音亢进与健康对照组。我们提出一个分层特征工程框架,包括:(i)静态特征,(ii)动态特征,(iii)基于比率的特征,(iv)捕捉源-滤波器交互的耦合特征。单变量统计分析显示PVH具有强可分性,但NPVH显著性有限,而我们针对高维特征集成优化的机器学习流程发现,耦合特征对两项任务都至关重要。我们实现了PVH的AUC为0.891,NPVH的AUC为0.728,表明虽然PVH近似线性可分,但NPVH的区分受益于非线性特征交互建模。

英文摘要

Ambulatory neck-surface acceleration enables non-invasive monitoring of vocal hyperfunction, yet robust biomarkers for its subtypes remain limited. This study investigates the NeckVibe Challenge dataset to distinguish phonotraumatic (PVH) and non-phonotraumatic (NPVH) from healthy controls. We propose a hierarchical feature engineering framework comprising: (i) static, (ii) dynamic, (iii) ratio-based, (iv) coupling features capturing source filter interactions. While univariate statistical analysis shows strong separability for PVH but limited significance for NPVH, our machine learning pipeline, tailored for high-dimensional feature integration, identifies that coupling features are crucial for both tasks. We achieve an AUC of 0.891 for PVH and 0.728 for NPVH, suggesting that while PVH is near-linearly separable, NPVH discrimination benefits from modeling non-linear feature interactions.

2606.07670 2026-06-09 cs.CV cs.AI 新提交

Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting

液态神经网络作为动态3D高斯泼溅的即插即用连续时间变形场

Mingzhao Li, Arghya Pal, Guan Yuan Tan

发表机构 * Monash University(莫纳什大学)

AI总结 提出用液态神经网络(LNN)的闭式连续时间(CfC)单元替代MLP,构建显式连续时间变形场,在动态场景重建中匹配或超越MLP基线,尤其擅长高频关节运动。

详情
AI中文摘要

可变形3D高斯泼溅(D-3DGS)通过一个位置编码的MLP(以帧时间t为输入)变形一组规范3D高斯,从单目视频重建动态场景。尽管拟合连续变量,但MLP在架构中不耦合任意两个t值,实际上预测离散的逐帧偏移,使得时间平滑性仅作为优化的副产品出现。我们将变形场重新设计为一组闭式连续时间(CfC)单元,即液态神经网络(LNN),它是液态时间常数ODE的闭式解,同时保留D-3DGS管道的其他部分。每个单元暴露一个sigmoid时间门,在两个候选隐藏状态之间插值,将学习到的对t的平滑响应嵌入损失景观,无需调用任何数值求解器。在八个D-NeRF和七个NeRF-DS场景上,液态场在总体上匹配或超过MLP基线,其最大增益集中在具有最高频关节运动的场景上。结果是一种近乎零摩擦的架构设计,将离散的MLP变形场转变为t的显式连续时间函数。

英文摘要

Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians through a positional-encoded MLP of frame time t. Although fitted to a continuous variable, the MLP couples no two values of t in its architecture and effectively predicts discrete per-frame offsets, leaving temporal smoothness to emerge only as a byproduct of optimisation. We redesign the deformation field as a stack of Closed-form Continuous-time (CfC) cells, a Liquid Neural Network (LNN), that is the closed-form solution of the Liquid Time-constant ODE while preserving every other part of the D-3DGS pipeline. Each cell exposes a sigmoidal time gate that interpolates between two candidate hidden states, baking a learned smooth response to t into the loss landscape without invoking any numerical solver. On the eight D-NeRF and seven NeRF-DS scenes the liquid field matches or exceeds the MLP baseline in aggregate, with its largest gains concentrated on the scenes with the most high-frequency articulated motion. The result is a near-zero-friction architectural design that turns the discrete MLP deformation field into an explicit continuous-time function of t.

2606.07669 2026-06-09 cs.CV cs.AI 新提交

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

MemoVAD: 边缘计算场景下基于动态语义记忆的资源高效视频异常检测

Guo Li, Jiandian Zeng, Yang Li, Zihao Peng, Ke Chen, Tian Wang

发表机构 * Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Engineering Research Center of Cloud-Edge Intelligent Collaboration on Big Data, Ministry of Education, Beijing Normal University(北京师范大学大数据云边智能协同教育部工程研究中心)

AI总结 提出MemoVAD边缘-云协同框架,通过不确定性感知门控策略选择性调用云端视觉语言模型,并设计动态语义记忆缓存原型,在降低通信开销的同时提升视频异常检测性能。

Comments Accepted by IJCAI2026

详情
AI中文摘要

在真实监控场景中部署视频异常检测(VAD)面临着对高层语义的需求以确保有效性,与边缘设备有限计算资源之间的根本矛盾。视觉语言模型(VLM)提供了丰富的开放词汇语义,但其延迟和计算成本阻碍了设备端部署。为解决这一挑战,我们提出MemoVAD,一种边缘-云协同框架,选择性地将VLM语义融入流式VAD。MemoVAD在边缘端使用轻量级检测器和因果时序上下文编码器(TCE)建模时序依赖,运行大部分推理。具体而言,我们引入基于主观逻辑的不确定性感知门控(UAG)策略,以建模感知不确定性,并仅对高不确定性和语义新颖的片段查询云端VLM。此外,设计动态语义记忆(DSM)缓存经VLM验证的原型以实现高效检索,使边缘模型通过语义适配器逐步融入VLM级语义。在真实边缘设备上对UCF-Crime和XD-Violence数据集的实验表明,MemoVAD在显著降低通信开销的同时,超越了当前最优性能。

英文摘要

Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.

2606.07661 2026-06-09 cs.CV cs.DL 新提交

PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing

PereStruct: 面向鲁棒历史文档解析的多模态语义组装

Maksim Shandybo, Ivan Bespalov, Daniil Yefimov, Marina Kosheleva, Alexander Loukianov

发表机构 * IGIC RAS(俄罗斯科学院信息传输问题研究所) Yandex Cloud National University of Science and Technology MISIS(莫斯科国立钢铁合金学院) Nekrasov Central Universal Scientific Library(涅克拉索夫中央综合科学图书馆)

AI总结 针对历史报纸复杂多栏布局的解析难题,提出结合微调YOLO与语义组装模块的多模态方法,在块到文章映射上F1达0.904,BLEU约0.96,显著优于通用视觉语言模型。

Comments Code and data available at https://github.com/makSShandybo/PereStruct

详情
AI中文摘要

解析具有复杂非标准布局的历史文档仍是大规模档案数字化的基本瓶颈。与现代排版不同,历史报纸存在严重的物理退化和高度不规则的页面结构,即使最先进的视觉语言模型也难以应对,呈现出严重的分布外挑战。我们通过一个专门为解析历史报纸(具有特别复杂多栏布局的文档)设计的自动化流程来弥补这一差距。我们的方法结合了用于布局分析和块检测的微调YOLO架构(在1,426张完全人工标注的扫描页面上训练),以及一个新颖的语义组装模块,该模块通过联合建模基于TF-IDF的词法语义相似性、来自微调YOLO的视觉嵌入以及几何布局约束来重构文章。这种多模态集成实现了最先进的性能,在块到文章映射上取得了0.904的F1分数。值得注意的是,与视觉语言模型(Qwen3.6-35B-A3B和Qwen3.6-Plus)的端到端评估表明,PereStruct实现了显著更高的保真度(BLEU约0.96 vs 0.34),验证了模块化架构在通用VLM难以处理的复杂历史布局上表现出色。为了支持可重复性并推动该领域的研究,我们发布了包含599张标注页面的训练语料库和包含93张页面(具有专家验证的真实块到文章映射)的精选PereStruct基准。该框架为复杂档案材料的高保真数字化和语义重建奠定了坚实基础。

英文摘要

Parsing historical documents with complex, non-standard layouts remains a fundamental bottleneck in large-scale archival digitization. Unlike modern typography, historical newspapers exhibit severe physical degradation and highly irregular page structures that confound even state-of-the-art vision-language models, presenting severe out-of-distribution challenges. We address this gap with an automated pipeline specifically designed for parsing historical newspapers, documents characterized by particularly intricate multi-column layouts. Our approach combines a fine-tuned YOLO architecture for layout analysis and block detection, trained on 1,426 fully human-annotated scanned pages, with a novel semantic assembly module that reconstructs articles by jointly modeling lexical-semantic similarity via TF-IDF, visual embeddings from our fine-tuned YOLO, and geometric layout constraints. This multi-modal integration yields state-of-the-art performance, achieving an F1 score of 0.904 on block-to-article mapping. Notably, end-to-end evaluation against vision-language models (Qwen3.6-35B-A3B and Qwen3.6-Plus) demonstrates that PereStruct achieves substantially higher fidelity (BLEU approximately 0.96 vs 0.34), validating that modular architectures excel where generic VLMs fail on complex historical layouts. To support reproducibility and advance research in this domain, we release both the training corpus of 599 annotated pages and a curated PereStruct benchmark of 93 pages with expert-verified ground-truth block-to-article mappings. This framework establishes a robust foundation for high-fidelity digitization and semantic reconstruction of complex archival materials.

2606.07660 2026-06-09 cs.CV cs.LG 新提交

Need We Teach Foundation Models What is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation

我们是否需要教基础模型什么是生成图像?基于解析谱自适应的无梯度生成伪影检测

Qiaoyu Chen, Bing Zhang

发表机构 * Harbin University of Commerce(哈尔滨商业大学)

AI总结 提出无梯度方法,将生成伪影检测重构为分布外异常度量问题,通过解析解耦统计与语义偏差,在零样本设置下显著优于梯度优化方法。

详情
AI中文摘要

通过基于梯度的更新来适应基础模型以检测生成伪影会损害其内在表示。在有限样本上优化时,模型会过拟合到局部领域捷径。在专门数据上微调大量权重会引入错误的归纳偏差,在高维特征空间中引起可测量的 $\mathcal{L}_2$ 范数扰动——我们将这一现象形式化为锚点漂移。非线性激活放大了这种漂移,损害了跨未见领域的零样本伪造检测。我们提出了一种无梯度方法,将检测从二分类重新定义为分布外(OOD)异常度量问题。将冻结的基础模型视为稳定的坐标系,通过解析解耦统计和语义偏差,在真实视觉流形上建立一个绝对的自然锚点,该锚点源自注意力加权的空间矩和感知不一致性的正交投影。在极端零样本设置下(在面部伪造上训练,在通用文本到图像生成上测试),我们的方法显著优于梯度优化范式。无反向传播的前向传递和线性求解器实现了硬件无关、边缘可部署的校准,延迟极低。此外,Sherman-Morrison公式使得能够针对新型攻击进行即时在线学习,并通过协方差增量传输实现隐私保护的联邦协作。

英文摘要

Adapting foundation models to detect generative artifacts via gradient-based updates compromises their intrinsic representations. Under optimization on limited samples, models overfit to local domain shortcuts. Fine-tuning massive weights on specialized data introduces erroneous inductive biases, inducing a measurable $\mathcal{L}_2$ norm perturbation in the high-dimensional feature space -- a phenomenon we formalize as anchor drift. Amplified by nonlinear activations, this drift impairs zero-shot forgery detection across unseen domains.We propose a gradient-free methodology reframing detection from binary classification to an out-of-distribution (OOD) anomaly measurement problem. Treating a frozen foundation model as a stable coordinate system, we establish an absolute natural anchor on the real visual manifold by analytically decoupling statistical and semantic deviations, derived from attention-weighted spatial moments and orthogonal projection of perceptual inconsistencies. Evaluated in an extreme zero-shot setting (trained on face forgeries, tested on universal Text-to-Image generations), our method significantly outperforms gradient-optimized paradigms. Backpropagation-free forward passes and linear solvers enable hardware-agnostic, edge-deployable calibration with minimal latency. Furthermore, the Sherman-Morrison formula unlocks instantaneous online learning against novel attacks and enables privacy-preserving federated collaboration via covariance delta transmission.

2606.07659 2026-06-09 cs.CV eess.IV 新提交

Real-Time Industrial Defect Detection on Edge Hardware Using Fine-Tuned YOLOv8: A Systematic Benchmark on the NEU Surface Defect Database and MVTec AD with Automotive & Battery Manufacturing Extensions

基于微调YOLOv8的边缘硬件实时工业缺陷检测:在NEU表面缺陷数据库和MVTec AD上的系统基准测试及汽车与电池制造扩展

Emmanuel Ezeji Somtochukwu, Nitesh Rijal

发表机构 * Zema AI Labs(Zema AI实验室)

AI总结 提出Industrial-YOLO框架,基于微调YOLOv8,通过TensorRT和OpenVINO加速,在边缘硬件上实现超过120 FPS的实时缺陷检测,mAP达98.5%,并在汽车装配线验证零延迟性能。

Comments 11 pages, 4 figures, 7 tables. Includes edge optimization framework (TensorRT/OpenVINO) and industrial hardware benchmark analysis

详情
AI中文摘要

自动化表面缺陷检测对于确保高速制造环境中的严格质量控制至关重要。虽然深度学习模型提供了显著的准确性,但在资源受限的边缘硬件上部署而不引入显著延迟仍然是一个持续的挑战。本文提出了Industrial-YOLO,一个基于微调YOLOv8架构的边缘优化框架,专门为实时工业缺陷检测设计。我们利用NEU表面缺陷数据库(用于钢板)和MVTec AD数据集进行系统基准测试,并补充了代表真实世界结构异常(划痕、凹坑和夹杂物)的定制汽车制造扩展。为了弥合算法复杂性和边缘硬件约束之间的差距,通过TensorRT和OpenVINO加速引擎引入了目标特定的优化。实验结果表明,Industrial-YOLO在NVIDIA Jetson Orin平台上实现了超过120 FPS的高速推理速度,同时保持了98.5%的卓越平均精度(mAP)。所提出的框架在直接部署到活跃的汽车装配线上时,展示了高度鲁棒、零延迟的性能,为下一代自动光学检测(AOI)系统提供了可扩展的蓝图。

英文摘要

Automated surface defect detection is critical for ensuring rigorous quality control in high-speed manufacturing environments. While deep learning models offer remarkable accuracy, deploying them on resource-constrained edge hardware without introducing significant latency remains a persistent challenge. This paper presents Industrial-YOLO, an edge-optimized framework built upon a fine-tuned YOLOv8 architecture specifically engineered for real-time industrial defect detection. We conduct a systematic benchmark utilizing the NEU surface defect database for steel sheets and the MVTec AD dataset, supplemented with custom automotive manufacturing extensions representing real-world structural anomalies (scratches, pits, and inclusions). To bridge the gap between algorithmic complexity and edge hardware constraints, target-specific optimizations are introduced via TensorRT and OpenVINO acceleration engines. Experimental results demonstrate that Industrial-YOLO achieves a high-velocity inference speed exceeding 120 FPS on the NVIDIA Jetson Orin platform while maintaining an exceptional mean Average Precision (mAP) of 98.5%. The proposed framework showcases highly robust, zero-latency performance when deployed directly onto an active automotive assembly line, offering a scalable blueprint for next-generation automated optical inspection (AOI) systems.

2606.07658 2026-06-09 cs.CV cs.LG 新提交

What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery

神经外科医生需要看到的:用于脑肿瘤手术中脑移位补偿的超声合成术中MRI

Santiago Cepeda, Olga Esteban-Sinovas, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit, Río Hortega University Hospital, Valladolid, Spain(西班牙巴利亚多利德里奥·奥尔特加大学医院神经外科神经血管科) Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC), Instituto de Investigación Biosanitaria de Valladolid (IBioVALL), Valladolid, Spain(西班牙巴利亚多利德生物医学研究与计算分析专业组(GEIBAC),巴利亚多利德生物健康研究所(IBioVALL))

AI总结 提出一种端到端流水线,通过融合术前MRI、术中超声生成的合成MRI及锚定该合成图像的可变形配准,生成术前成像空间中的全脑MRI体积,以补偿脑移位,为神经导航提供类似MRI的术中视野更新。

详情
AI中文摘要

最大安全切除是胶质瘤手术的主要目标。硬脑膜打开后,神经导航引导会因脑移位而逐渐退化。术中MRI可以补偿,但需要专用基础设施且很少可用,而术中超声(ioUS)廉价、可重复且与常规工作流程兼容。将ioUS与术前MRI结合的导航系统通常依赖刚性配准;即使是可变形多模态配准也受限于超声散斑对比度、窄视野以及无法表示术前扫描中不存在的结构,最关键的是切除腔和残余肿瘤。我们提出一个端到端流水线,通过合并术前MRI、从ioUS生成的合成MRI以及锚定在该合成图像上的可变形配准,生成术前成像空间中的全脑MRI体积。它集成了一个2.5D残差变换器合成骨干(ResViT-2.5D)和一个两阶段配准,将NiftyReg与合成锚定的SynthMorph阶段耦合,直接对原始扫描仪输入进行操作。在切除后的ReMIND队列上,ResViT-2.5D生成的合成图像在结构、强度和感知指标上与术中T2紧密匹配。在14名受试者的215个专家标志点上,合成锚定配准将平均目标配准误差从6.27毫米降低到5.86毫米,与强大的经典NiftyReg基线(5.85毫米)相当,同时为每个受试者产生微分同胚变形场。贡献不在于配准精度的提高,而在于集成的体积本身,它在超声视野内反映了术中切除后的状态。这为外科医生提供了手术视野的类似MRI的更新,并有可能集成到手术导航工作流程中。

英文摘要

Maximal safe resection is the primary objective in glioma surgery. Neuronavigation guidance is progressively degraded by brain shift after dural opening. Intraoperative MRI can compensate but needs dedicated infrastructure and is rarely available, whereas intraoperative ultrasound (ioUS) is inexpensive, repeatable, and compatible with routine workflows. Navigation systems combining ioUS with preoperative MRI usually rely on rigid registration; even deformable multimodal registration is limited by ultrasound speckle contrast, a narrow field of view, and the inability to represent structures absent from the preoperative scan, most critically the resection cavity and residual tumor. We propose an end-to-end pipeline that generates a new whole-brain MRI volume in the preoperative imaging space by merging the preoperative MRI, a synthetic MRI generated from the ioUS, and a deformable registration anchored on that synthetic image. It integrates a 2.5D residual-transformer synthesis backbone (ResViT-2.5D) and a two-stage registration coupling NiftyReg with a synthesis-anchored SynthMorph stage, operating directly on raw scanner inputs. On a post-resection ReMIND cohort, ResViT-2.5D produced synthetic images closely matching the intraoperative T2 across structural, intensity, and perceptual metrics. In 14 subjects with 215 expert landmarks, the synthesis-anchored registration reduced the mean target registration error from 6.27 to 5.86 mm, matching a strong classical NiftyReg baseline (5.85 mm) while yielding a diffeomorphic deformation field in every subject. The contribution is not a gain in registration accuracy but the integrated volume itself, which inside the ultrasound field of view it reflects the intraoperative post-resection state. This provides the surgeon with an MRI-like update of the operative field with potential for integration into surgical-navigation workflows.

2606.07654 2026-06-09 cs.CV cs.AI 新提交

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

MM-Matryoshka:通过二维多模态套娃训练框架实现预算弹性视觉文档检索

Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao, Mingdong Ou, Xuming Hu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Alibaba Cloud Computing(阿里云计算) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出MM-Matryoshka,一种二维套娃训练框架,使视觉文档检索器在向量维度和编码器深度上实现弹性预算选择,无需为不同预算训练独立模型。

详情
AI中文摘要

多向量视觉文档检索器通过深度视觉语言模型(VLM)为每个页面生成多个向量,实现强大的细粒度匹配,但这种设计在存储和计算开销上导致部署成本高昂。现有效率技术通常只优化预算的一部分,使得多模态检索器缺乏统一的方法来权衡精度与向量宽度和编码器深度。因此,我们提出MM-Matryoshka,一种用于预算弹性视觉文档检索(VDR)的二维套娃训练框架,使ColPali风格的多向量检索在维度和层两个方向上实现弹性。在推理时,单个检索器可以选择二维可调预算,无需为不同预算训练独立模型。通过在多个代表性骨干网络上的全面实验,我们证明MM-Matryoshka在显著降低存储和计算开销的同时,保留了比直接截断基线高得多的质量,从而为高效VDR提供了稳健的预算弹性。

英文摘要

Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.

2606.07653 2026-06-09 cs.CV cs.AI 新提交

A Dataset for Dynamic Human Preferences for Vision Language Models

面向视觉语言模型的动态人类偏好数据集

Hannah Gao, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一个评估视觉语言模型理解动态人类偏好能力的基准,通过自动化管道生成包含图像依赖变化的数据集,并评估了现有模型。

详情
AI中文摘要

鉴于视觉语言模型(VLM)在人机交互场景中的广泛应用,评估这些模型适应不同用户实时偏好的能力变得重要。尽管近年来引入了越来越多的视觉语言基准,但它们主要侧重于评估静态能力和从大量训练数据中学习的一般偏好。本文引入了一个新的基准,用于评估VLM理解动态人类偏好的能力,即在推理时通过上下文传递的偏好。我们提供了一个自动化管道来生成该基准,包含图像依赖变化、动态多模态人类偏好数据集,并对最新模型在新基准上的表现进行了评估。

英文摘要

Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.

2606.07651 2026-06-09 cs.LG cs.CV 新提交

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

KITE:一种融合文本、图像和知识图谱的三模态假新闻检测Transformer

Kevin Patel, Shashi Bhushan Jha

发表机构 * Department of Computer Science, University of West Florida(威斯福大学计算机科学系)

AI总结 提出三模态假新闻检测框架KITE,联合建模文本、视觉和知识表示,利用跨模态注意力整合特征,在基准数据集上显著优于单双模态基线。

详情
AI中文摘要

随着多模态虚假信息日益复杂,无缝融合欺骗性文本、操纵性视觉和事实错误的主张,传统的假新闻检测方法已落后。大多数先前工作侧重于文本-图像融合,或将外部知识仅作为后处理步骤应用,限制了其检测更深层语义不一致的能力。在本文中,我们引入了KITE(知识集成文本-图像编码器),一种三模态假新闻检测框架,联合建模文本、视觉和事实知识表示。KITE利用Roberta [23,14]和CLIP [24]进行语言和视觉编码,同时图注意力网络(GAT)处理从Wikidata检索的结构化事实。KITE在多模态Transformer中使用跨模态注意力[9]来集成文本、视觉和知识特征,帮助理解每种模态如何相互关联。模态特定置信度分数与最终预测一起生成,通过指示哪种输入类型对决策影响最大来提供可解释性。在基准数据集上的评估表明,KITE显著优于单模态和双模态基线,特别是在涉及图像-文本不匹配或与外部知识矛盾的情景中。

英文摘要

Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on text-image fusion or applies external knowledge only as a post-processing step, limiting their ability to detect deeper semantic inconsistencies. In this paper, we introduce KITE (Knowledge-Integrated Text-Image Encoder), a tri-modal fake news detection framework that jointly models textual, visual, and factual knowledge representations. KITE leverages Roberta [23,14] and CLIP [24] for linguistic and visual encoding, while a Graph Attention Network (GAT) processes structured facts retrieved from Wikidata. KITE uses cross-modal attention [9] within a multimodal transformer to integrate text, visual, and knowledge features, helping it understand how each modality relates to one another. Modality-specific confidence scores are generated alongside the final prediction, offering interpretability by indicating which input type most influenced the decision. Evaluations on benchmark datasets demonstrate that KITE significantly outperforms unimodal and bimodal baselines, particularly in scenarios involving image-text mismatches or contradictions with external knowledge.

2606.07649 2026-06-09 cs.CV cs.AI 新提交

ViMax: Agentic Video Generation

ViMax: 智能体视频生成

Lingxuan Huang, Sizhe He, Hengji Zhou, Liqiang Nie, Lianghao Xia, Chao Huang

发表机构 * The University of Hong Kong(香港大学) South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出ViMax框架,通过多智能体协作实现长视频生成,利用分层叙事引擎和视觉一致性机制,保证叙事连贯性和视觉一致性。

Comments 20 pages, 13 figures

详情
AI中文摘要

长视频生成需要系统的叙事规划和视觉一致性,而当前的短视频方法无法提供。现有方法生成孤立的序列,缺乏叙事结构,并且缺乏跨场景保持角色和环境一致性的机制。我们提出ViMax,一个智能体视频生成框架,通过协调的多智能体协作来解决视频创作问题,其中专门的组件协商叙事决策、视觉连续性和制作质量。我们的框架采用分层叙事引擎,结合检索增强生成以实现全局故事连贯性,以及依赖感知的视觉一致性机制,跨时间边界跟踪角色和环境状态,同时VLM引导的智能体持续监控和优化叙事连贯性和视觉保真度。该框架支持协调的智能体协作以生成扩展的叙事内容,在多场景时间线上保持叙事完整性和视觉连贯性。

英文摘要

Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi-agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval-augmented generation for global story coherence and a dependency-aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM-guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi-scene timelines.

2606.07648 2026-06-09 cs.CV cs.AI 新提交

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

AQIFormer:一种基于Transformer的多视角架构用于跨城市空气质量分类

Om Kathalkar, Nitin Nilesh, Sachin Chaudhari, Anoop Namboodiri

发表机构 * IIIT Hyderabad(印度海得拉巴国际信息技术学院)

AI总结 提出AQIFormer,一种基于Transformer的集成架构,通过前后视图融合、天气感知注意力和多任务学习,在跨城市空气质量分类中达到89.96%准确率,比现有方法提升14.96%。

Comments Accepted at ICVGIP 2025 (Indian Conference on Computer Vision, Graphics and Image Processing), 9 pages, 4 figures

详情
AI中文摘要

空气污染是全球最严峻的环境和公共卫生挑战之一,传统的基于传感器的监测系统面临显著的可扩展性和经济性限制。基于图像的空气质量估计已成为一种有前景的替代方案,利用交通场景中大气污染物的视觉特征。然而,现有方法存在跨城市泛化能力有限以及对多视角信息利用不足的问题。我们提出AQIFormer,一种新颖的基于Transformer的集成架构,通过创新的双视图融合、天气感知注意力机制和全面的多任务学习来解决这些根本性限制。我们的方法独特地将前后交通图像与气象参数相结合,以实现跨不同城市环境的稳健空气质量分类。在包含26,678个同步前后图像对的综合数据集上进行的大量评估表明,该模型性能良好,准确率达到89.96%,比现有最优方法提高了14.96%。最重要的是,我们的模型保持了出色的跨城市泛化能力,在印度那格浦尔收集的独立数据集上达到81.67%的准确率,通过少量样本自适应仅用极少的训练样本,性能下降仅为8.29%。

英文摘要

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

2606.07647 2026-06-09 cs.CV cs.CL cs.LG 新提交

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

关键位置引导:基于令牌级视觉敏感度引导的LVLMs幻觉缓解

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出令牌级视觉敏感度引导(TLVS)方法,通过提取令牌级引导向量并自适应调整引导强度,仅在关键解码步骤抑制幻觉,在多个基准上优于现有方法。

详情
AI中文摘要

大型视觉语言模型(LVLMs)取得了快速进展并部署在各种应用中,但幻觉仍然是一个主要挑战。激活引导因其训练开销小和推理时可控制而具有吸引力。然而,我们发现,在自回归解码过程中,视觉条件对令牌预测的影响是稀疏且局部的,许多现有方法对整个序列的图像与非图像差异进行平均,稀释了这些关键信号,导致引导方向信噪比低。此外,许多现有方法应用固定的引导强度,错误分配干预预算,过度扰动非关键令牌,并可能导致不稳定。为了解决这些限制,我们提出了令牌级视觉敏感度引导(TLVS)用于幻觉缓解。我们的方法首先提取令牌级引导向量并进行细化,然后仅在关键位置应用细粒度的、视觉敏感度自适应的引导。这种轻量级、即插即用的机制只需要最少的校准训练,可以应用于各种视觉语言模型。它在每个解码步骤调节引导强度,选择性地抑制易产生幻觉的片段,同时保留基于证据的内容。我们在多个基准上评估TLVS,包括POPE、AMBER、CHAIR(COCO)、MMHal和HallusionBench,证明其相对于先前引导方法的一致改进。

英文摘要

Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.

2606.07646 2026-06-09 cs.CV cs.AI 新提交

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

DOME:从稀疏监督中学习可迁移域变量用于测试时自适应

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

发表机构 * MAIS, IACAS(中国科学院自动化研究所多模态人工智能系统实验室)

AI总结 提出DOME域编码器,通过视觉-语言预训练提取密集连续表示,参数化域为分布变量并引入动量更新的稀疏域库,实现零样本显式域建模,在多个基准上超越复杂TTA方法。

详情
AI中文摘要

测试时自适应(TTA)旨在仅使用无标签流数据将模型对齐到变化的测试域。现有方法大多隐式推断单个全局域分布,忽略了真实世界域迁移的多维性和样本特异性,导致自适应脆弱。我们提出DOME,一种有效的域编码器,以零样本方式显式建模每个样本的域。DOME利用视觉-语言预训练提取密集、连续的表示,将域参数化为分布变量,并引入动量更新的稀疏域库用于解耦监督。通过将这些显式域线索注入下游模型,即使是最基本的熵最小化TTA策略也在ImageNet-C、ImageNet-R和ImageNet-Sketch上达到了最先进的性能,超越了复杂的TTA方法。我们的结果表明,鲁棒的自适应并非源于复杂的自适应算法,而是源于显式的、结构化的域表示。

英文摘要

Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.

2606.07645 2026-06-09 cs.CV cs.AI 新提交

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

FineGen:基于VLM的多智能体框架用于细粒度图像-文本数据集构建

Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

发表机构 * Shenzhen Polytechnic University(深圳职业技术大学) Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong Macao Greater Bay Area(粤港澳大湾区应用人工智能研究所) Shenzhen University(深圳大学)

AI总结 提出FineGen框架,通过生成-验证-校正流水线和闭环反馈机制自动构建含硬负样本的细粒度数据集,在ImageNet上构建FineGen-100K,硬样本准确率提升14.4%。

Comments 15 pages, 2 figures, conference

详情
AI中文摘要

当前视觉-语言数据集中硬负样本的稀缺严重阻碍了细粒度感知。为此,我们提出FineGen,一种基于VLM的多智能体框架,用于自动化数据集构建。通过采用协作的生成-验证-校正流水线及闭环反馈机制,FineGen确保合成的硬负样本在语义上有效且与视觉内容严格矛盾。将其应用于ImageNet,我们构建了FineGen-100K,一个包含超过147,000个属性特定硬负样本的分层数据集,正负样本比严格为1:10。广泛评估证实了96.7%的属性有效性。关键的是,在FG-OVD基准上的下游验证表明,在FineGen-100K上微调后,硬样本准确率大幅提升14.4%,显著优于现有最先进方法。

英文摘要

The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, we propose FineGen, a VLM-based Multi-Agent framework for automated dataset construction. By employing a collaborative Generation-Verification-Correction pipeline with a closed-loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen-100K, a hierarchical dataset containing over 147,000 attribute-specific hard negatives with a rigorous 1:10 positive-to-negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG-OVD benchmark shows that fine-tuning on FineGen-100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state-of-the-art methods.

2606.07643 2026-06-09 cs.CV cs.AI cs.SD eess.AS 新提交

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

AVI-Bench:迈向全模态大语言模型的人类级视听智能

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出AVI-Bench基准,通过感知、理解、推理三阶段跨模态任务评估全模态大语言模型的视听智能,并引入AVI-Bench-PriSe测试原始视听感知,揭示当前模型局限,构建四级AVI分类体系。

Comments 31 pages, 8 figures, ICML 2026

详情
AI中文摘要

近期全模态大语言模型(Omni-MLLMs)的进展实现了视觉、音频和语言的强集成。然而,由于缺乏系统全面的基准,其视听智能(AVI)仍未被充分评估。我们提出AVI-Bench,一个受认知启发的基准,通过需要联合视听解释的跨模态任务,在感知、理解和推理三个阶段评估Omni-MLLMs。该设计能够细粒度诊断模型能力和失败模式。为进一步评估超出熟悉领域的鲁棒性,我们提出AVI-Bench-PriSe,一个扩展版本,使用不熟悉的、低语义刺激探测模型的原始视听感知,测试超出常见训练分布的泛化能力。对开源和闭源模型的大量实验揭示了当前Omni-MLLMs的显著局限性。基于这些发现,我们提出了一个四级AVI分类体系。总体而言,AVI-Bench提供了一个原则性的评估框架,以指导更鲁棒和可泛化AVI的发展。项目网站:https://fudancvl.github.io/AVI-Bench/

英文摘要

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

2606.07642 2026-06-09 cs.CV cs.CY 新提交

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

视觉语言模型能否感知传感器所感?一种可扩展的专家引导设计用于从街景评估轮椅可达性

Dongdong Wang, Alina Hagen, Isabelle Gatmaitan, Hao Zhou, Yiwen Dong, Shabboo Valipoor, Vivian W. H. Wong, Lingyao Li

发表机构 * University of Florida(佛罗里达大学) University of South Florida(南佛罗里达大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出专家引导的检索增强框架,利用视觉语言模型从谷歌街景图像识别轮椅可达性障碍,通过GPS轮椅停留行为验证,表明VLM评分与移动摩擦部分一致,但细粒度障碍识别有限。

详情
AI中文摘要

评估建筑环境交互(如轮椅可达性)是困难的,因为现实世界的移动性受到分布式、上下文依赖和临时性障碍的影响,这些障碍难以大规模捕捉。为了支持可扩展的评估,本文研究了视觉语言模型(VLM)是否能够从谷歌街景(GSV)图像中识别可达性障碍。我们提出了一种专家引导的检索增强框架,结合GSV图像、ADA指导原则和专家制定的评分标准来评估可达性维度。我们在佛罗里达大学收集了一个校园规模的数据集,将407个独特的GSV位置与GPS衍生的轮椅停留行为作为移动摩擦信号相关联。结果表明,VLM评分与停留时间既呈负相关又在分布上相似,表明与移动摩擦的行为代理部分但一致的对齐。视觉线索分析显示,某些环境对象(如路缘坡道和人行横道)与较高的VLM可达性评分相关,而对于细微的表面条件、临时障碍物和视角依赖的障碍,对齐仍然有限。总体而言,我们的发现显示了专家引导的VLM在可扩展的可达性评估中的潜力,与真实世界轮椅导航的传感器衍生指标相一致。

英文摘要

Assessing built-environment interaction, such as wheelchair accessibility, is difficult because real-world mobility is shaped by distributed, context-dependent, and temporary barriers that are hard to capture at scale. To support scalable assessment, this paper examines whether vision-language models (VLMs) can identify accessibility barriers from Google Street View (GSV) imagery. We propose an expert-guided retrieval-augmented framework that combines GSV images, ADA-informed guidance, and expert-derived rubrics to evaluate accessibility dimensions. We collect a campus-scale dataset at the University of Florida, linking 407 unique GSV locations with GPS-derived wheelchair dwell behavior as a mobility-friction signal. Results show that VLM ratings are both negatively correlated and distributionally similar with dwell time, indicating partial but consistent alignment with a behavioral proxy for mobility friction. Visual cue analysis shows that certain environmental objects, such as curb ramps and crosswalks, are associated with higher VLM accessibility scores, while alignment remains limited for subtle surface conditions, transient obstructions, and viewpoint-dependent barriers. Overall, our findings show the potential of expert-guided VLMs for scalable accessibility assessment aligning with sensor-derived indicators of real-world wheelchair navigation.

2606.07641 2026-06-09 cs.CV 新提交

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

可读但不可预测:视觉语言模型中的旋转结果预测

Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 研究视觉语言模型能否仅从原图预测180°旋转后的内容,引入RotOutBench基准,发现模型能识别但无法预测旋转结果。

详情
AI中文摘要

视觉语言模型能否仅从原始图像预测180°旋转后会看到什么?我们通过旋转结果预测来研究这种能力:给定原始图像,模型必须回答在180°平面旋转后会看到或读到什么,而不直接观察旋转后的目标。为了隔离这一差距,我们引入了RotOutBench,一个涵盖开放视觉案例和受控文本图像旋转的配对诊断基准。一个明显的模式出现了:许多VLM在直接给出原始或旋转图像时能够识别相关内容,但仅从原始图像推断旋转结果时却失败。在受控文本图像旋转中,即使对于具有高直接读取准确性的模型,预测旋转的准确性也降至接近零。模型级别的案例研究进一步表明,预测状态可以接近旋转图像读取状态,而最终读出仍向原始字符串偏移。当前的VLM在展示变换后的视觉状态时能够识别,但往往无法从原始视角预测该状态。

英文摘要

Can vision-language models predict what a 180° rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180° in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations. A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string. Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 新提交

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) Universidad de Alcalá(阿尔卡拉大学)

AI总结 研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡,提出联合评估框架,比较VAE、GAN和DDPM在三个图像数据集上的表现,发现GAN和DDPM在差分隐私下更鲁棒。

详情
AI中文摘要

本研究探讨了在数据稀缺和隐私敏感条件下,合成数据生成中保真度、隐私和效用之间的权衡。我们提出了一个联合评估这三个维度的框架,并将其应用于三种广泛使用的生成模型:VAE、GAN和DDPM。评估涵盖三个图像数据集:MNIST、OCTMNIST和OrganAMNIST,包括通用和医学成像领域。在训练过程中引入差分隐私机制时,三种模型的行为出现了显著差异。GAN和DDPM表现出更强的鲁棒性,在一系列噪声水平下保持较高的保真度和下游效用,而VAE随着隐私约束的增加而更快地退化。本研究强调了深度生成模型多维评估的重要性,并指出应用隐私技术时它们的行为存在显著差异。

英文摘要

This study investigates the trade-offs between fidelity, privacy, and utility in synthetic data generation under conditions of data scarcity and privacy sensitivity. We propose an evaluation framework that jointly assesses these three dimensions and apply it to three widely used generative models, VAE, GAN, and DDPM. The evaluation spans three image datasets, MNIST, OCTMNIST, and OrganAMNIST, encompassing both general-purpose and medical imaging domains. Notable differences arise between the three models in their behaviour when differential privacy mechanisms are introduced during training. GAN and DDPM demonstrate greater robustness, maintaining higher fidelity and downstream utility across a range of noise levels, while VAE degrades more rapidly as privacy constraints increase. This study highlights the importance of a multidimensional evaluation of deep generative models, also noting that their behaviour significantly differs when privacy techniques are applied.

2606.07639 2026-06-09 cs.CV cs.AI 新提交

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

MOSS-Video-Preview: 通过交叉注意力实现实时视频理解

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出双通道交叉注意力架构MOSS-Video-Preview,通过非阻塞感知与生成实现实时视频理解,在单H200上实现5倍首词加速和2.7倍解码吞吐提升。

详情
AI中文摘要

视频理解正从离线范式——将完整录制的视频作为输入并在结束后产生单一答案——转向实时交互,其中模型在回复的同时感知新帧,随着新证据的出现修正答案,并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞;其自然实现是双通道架构。我们认为,交叉注意力主干比流行的仅解码器设计更适合实时视觉-语言融合:视觉特征通过侧通道进入,而不是加入自回归序列,因此感知和生成在独立的、非阻塞的路径上运行——降低了视觉处理的频率,并为独立压缩提供了清晰的通道级接口。我们辅以数据合成流水线,将密集字幕转换为实时理解问答,其答案被修正以匹配模型迄今为止感知到的内容,并在此数据上专门训练离线模型以引发实时行为。我们的模型总体上落后于强大的Qwen2.5-VL-7B基线——这一差距我们主要归因于数据和规模而非架构——但在离线视频和多模态理解上具有竞争力,在实时应用核心的空间和细粒度时间推理上保持稳健,并获得了离线模型缺乏的行为:持续感知、答案修正和及时沉默。在单个H200上,每视频256帧,它实现了约5倍的首词时间加速和2.7倍的解码吞吐提升,离线能力几乎没有下降。我们对范式、架构和数据的研究勾勒出通往实时视频理解的可行路径。

英文摘要

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

2606.07638 2026-06-09 cs.CV cs.AI 新提交

Anchor-Conditioned Compositional Control for Landscape Image Generation

基于锚点条件的景观图像生成组合控制

Gadha Lekshmi P, Govind Arun, Rohith Syam, Ahmed Elgammal

发表机构 * Rutgers University–New Brunswick(罗格斯大学新布朗斯维克分校) University of Maryland–College Park(马里兰大学帕克分校) University of Technology Sydney(悉尼科技大学)

AI总结 提出锚点条件微调框架,通过解耦交叉注意力机制注入四维组合锚点向量,实现景观图像生成中的组合控制,在水平线检测和三分法对齐上取得最优性能。

Comments Accepted to the International Conference on Computational Creativity, ICCC 2026

详情
AI中文摘要

图像生成模型虽然被广泛用作创意工具,但对摄影师和视觉艺术家常规执行的组合控制类型支持有限。本文提出了一个用于景观图像生成的锚点条件微调框架的早期结果,其中从训练图像中提取四维组合锚点向量,并通过带有傅里叶编码和三路分类器自由引导丢弃的解耦交叉注意力机制注入扩散模型。与基线和三个消融变体的定量评估表明,所提出的架构实现了最高的水平线检测率0.850和最高的三分法对齐度0.817。类别特定的消融进一步表明,在组合同质场景子集上训练相比混合训练可将水平线偏差降低多达40%。这确立了组合控制精度是类别依赖的。

英文摘要

Image generative models, though widely used as creative tools, offer limited support for the kind of compositional control that photographers and visual artists routinely exercise. This paper presents early results on an anchor conditioned finetuning framework for landscape image generation, in which a four dimensional compositional anchor vector is extracted from training images and injected into a diffusion model via a decoupled cross attention mechanism with Fourier encoding and three way classifier free guidance dropout. Quantitative evaluation against a baseline and three ablation variants shows that the proposed architecture achieves the highest horizon detection rate of 0.850 and the highest rule of thirds alignment of 0.817. A category specific ablation further demonstrates that training on compositionally homogeneous scene subsets reduces horizon deviation by up to 40 percent compared to mixed training. This establishes that compositional control precision is category dependent.

2606.07636 2026-06-09 cs.CV cs.CL cs.MA 新提交

Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

Crayotter: 用于长视频编辑的可追踪多智能体工作流

Lecheng Yan, Yichong Zhang, Ben Pan, Xiaoyu Zheng, Jiawei Qian, Anqi Wu, Wenxi Li, Chenyang Lyu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Crayotter,一个开源多模态多智能体系统,通过三阶段工作流(材料准备、基于工件的编辑研究、工具驱动的执行)实现长视频编辑的可追踪性和选择性修订,在人类评估中优于基线方法。

Comments 11 pages, 5 figures

详情
AI中文摘要

从异构素材编辑长视频不仅需要选择片段:智能体必须在材料准备、时间线构建、后期制作和修订过程中保持叙事意图,同时留下足够的证据以诊断失败。我们提出 \textbf{Crayotter},一个用于提示驱动视频编辑的开源多模态多智能体系统。Crayotter 将制作组织为三个阶段:覆盖感知的材料准备、基于工件的编辑研究以及工具驱动的时间线执行。每个阶段外化可检查的工件,包括覆盖报告、多模态分析、编辑蓝图、工具调用和中间渲染。这些工件使编辑运行可追踪,并允许诊断和选择性修订失败的片段,而无需完全重启。我们在23个编辑主题上评估Crayotter,与CapCut-Mate和CutClaw进行比较。在人类评估下,Crayotter的平均得分为3.40/5,而两个基线分别为2.44和1.70,在主题对齐、叙事连贯性和编辑流畅性方面持续提升。我们还描述了一个可重放的轨迹模式和可验证的奖励设计,为这些工作流未来的策略优化做准备。代码、轨迹和示例可在 https://github.com/idwts/Crayotter 公开获取。

英文摘要

Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.

2606.07635 2026-06-09 cs.CV cs.AI 新提交

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign: 用于MCI分析的动态与结构性神经影像的分层多模态融合

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)人工智能学院智能科学与工程学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University(深圳大学医学部生物医学工程学院广东省生物医学测量与超声成像重点实验室) Department of Radiology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Guangxi Academy of Medical Sciences(广西壮族自治区人民医院放射科,广西医学科学院) Shenzhen Sixth People’s Hospital (Nanshan Hospital), Huazhong University of Science and Technology Union Shenzhen Hospital(华中科技大学协和深圳医院(深圳市第六人民医院)) School of Basic Medical Sciences, Shenzhen University(深圳大学基础医学院) Egypt-Japan University of Science and Technology (E-JUST)(埃及日本科技大学) School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen University Medical School(深圳大学医学部生物医学工程学院,国家地方联合医学超声关键技术工程实验室,广东省生物医学测量与超声成像重点实验室)

AI总结 提出NeuroAlign框架,通过双模态分层对齐和双域分层交互融合fMRI与DTI特征,实现MCI/SCD检测,并设计无梯度归因方法SAM进行特征分析。

详情
AI中文摘要

功能磁共振成像(fMRI)和弥散张量成像(DTI)的多模态神经影像融合为认知障碍分析提供了互补信息,但仍面临异构特征空间和表示不对齐的挑战。我们提出\textit{NeuroAlign},一个用于结构化多模态融合的分层框架。它引入了(1)\textit{双模态分层对齐}(DMHA),该模块建模多尺度动态连接并对齐动态-静态和功能-结构嵌入;以及(2)\textit{双域分层交互}(DDHI),该模块实现连接级和区域级特征之间的细粒度调制和全局交互。为了支持特征级检查,我们设计了\textit{协同激活映射}(SAM),一种针对DFC、SFC、ALFF和FA的无梯度、面向标记的归因方法。在GUTCM、ADNI和OASIS数据集上通过五折验证评估,NeuroAlign在MCI/SCD检测中取得了竞争性结果,并展示了初步的跨数据集可迁移性。归因分析揭示了模态特异性和部分一致的脑区模式,为多模态表示分析提供了模型驱动的证据。

英文摘要

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

2606.07633 2026-06-09 cs.CV cs.AI 新提交

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

AMN:一种用于细胞核分割的具有边界和不确定性建模的自适应多尺度融合网络

Spoorthi M, Suja Palaniswamy

发表机构 * Department of Computer Science \& Engineering, Amrita School of Computing, Bengaluru, Amrita Vishwa Vidyapeetham, India , 2 p\

AI总结 提出AMN双编码器分割框架,融合Swin Transformer和ResNet-50特征金字塔,通过门控机制动态加权,结合多目标损失,在CoNIC基准上平均Dice 0.82,F1 0.68,优于八种基线模型。

详情
AI中文摘要

组织病理学图像中细胞核亚型的准确分类对于下游任务(包括肿瘤分级、免疫浸润量化和预后预测)至关重要。现有方法孤立地依赖卷积或基于Transformer的编码器,限制了它们同时捕捉细粒度局部纹理和长程空间上下文的能力。我们提出了AMN(自适应多尺度细胞核网络),一种双编码器分割框架,联合利用Swin Transformer和ResNet-50特征金字塔,通过学习的逐通道门控机制动态权衡每个编码器在每个尺度的贡献。AMN使用多目标损失进行训练,该损失结合了类别加权焦点损失、具有正像素强调的边界感知损失以及一种新颖的不确定性调制分类项,用于抑制过度自信的错误预测。在涵盖七个细胞核类别的CoNIC基准上评估,AMN实现了平均Dice 0.82和平均F1 0.68,在诊断上具有挑战性的淋巴细胞类别上F1为0.67。AMN优于八种基线模型,包括纯CNN、纯Transformer和最近的混合架构:U-Net、ResU-Net、DeepLabV3+、SegNet、ViT-Small、HmsU-Net、ConvFormer-UNet和BEFUnet。在MoNuSeg上的跨数据集评估证明了无需重新训练的强泛化能力,验证了所学表示的领域鲁棒性。

英文摘要

Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infiltrate quantification, and prognosis prediction. Existing approaches rely on either convolutional or transformer-based encoders in isolation, limiting their ability to simultaneously capture fine-grained local texture and long-range spatial context. We present AMN (Adaptive Multi-Scale Nuclei Network), a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. Evaluated on the CoNIC benchmark across seven nuclei classes, AMN achieves a mean Dice of 0.82 and mean F1 of 0.68, with an F1 of 0.67 on the diagnostically challenging lymphocyte class. AMN outperforms eight baseline models spanning pure-CNN, pure-transformer, and recent hybrid architectures: U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet. Cross-dataset evaluation on MoNuSeg demonstrates strong generalization without retraining and validating the domain robustness of the learned representations.

2606.07632 2026-06-09 cs.LG 新提交

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

评估机器学习资源利用需要模型生命周期评估

Jared Fernandez, Clara Na, Yonatan Bisk, Constantine Samaras, Emma Strubell

发表机构 * GitHub arXiv

AI总结 本文提出应用生命周期评估方法全面核算AI系统从硬件制造到训练推理的全链条资源消耗与环境影响,以弥补传统单一训练或推理成本评估的不足。

Comments ICML 2026: Position Paper Track

详情
AI中文摘要

正确核算人工智能(AI)系统的能源需求和环境影响对于研究人员、开发者、政策制定者和用户评估构建大规模系统的障碍是必要的。随着开发和部署AI系统所需的管道和底层基础设施日益复杂,以往侧重于单次训练运行或单个推理预测成本的AI效率评估方法已不再足够。在这篇立场论文中,我们阐述了应用生命周期评估来评估机器学习模型开发和部署管道成本的必要性,以正确核算所需资源和下游影响。生命周期评估能够将AI系统及其底层基础设施整个生命周期的成本纳入考量,从与物理计算硬件相关的隐含成本到训练和推理中的运营成本。

英文摘要

Proper accounting of the energy requirements and environmental impact of artificial intelligence (AI) systems is necessary for researchers, developers, policy makers, and users to assess the barriers to building systems at scale. With the growing complexity of pipelines and underlying infrastructure needed to develop and deploy AI systems, previous approaches for evaluating AI efficiency which focus on the costs of a single training run or an individual inference prediction are no longer sufficient. In this position paper, we enunciate the need for applying life cycle assessment to evaluate the costs of the machine learning model development and deployment pipeline to properly account for the required resources and downstream impact. Life cycle assessments enable the incorporation of costs across the full life cycle of an AI system and its underlying infrastructure, from the embodied costs associated with the physical computing hardware through the operational costs in training and inference.

2606.07631 2026-06-09 cs.LG cs.AI cs.CY 新提交

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

监督微调中涌现失调的性状空间监测

Huy Nghiem, Sy-Tuyen Ho, Sarah Wiegreffe, Hal Daumé

发表机构 * University of Maryland(马里兰大学)

AI总结 提出利用激活空间中的性状方向监测监督微调中的涌现失调,通过低维几何特征实现高效检测,在7-9B模型上达到0.990 AUROC。

Comments First version. 45 pages

详情
AI中文摘要

涌现失调(EM)发生在窄微调导致模型在微调任务之外出现危险行为时。标准训练信号可能忽略这种偏移,如果依赖重复的行为评估,可靠检测的成本会很高。我们探究是否可以在微调期间从内部表示中检测涌现失调。利用激活空间中编码为线性方向的七个对齐相关性状,我们在四个开源7-9B大语言模型的训练检查点中跟踪表示漂移。EM相关漂移集中在解释65.5%方差的低维轴上,揭示了所研究机制中的几何特征。基于该漂移轮廓构建的低开销监测器在保留的扰动类型上检测危险检查点,假阴性率为2.2%,假阳性率为2.9%,AUROC为0.990,优于无监督PCA和SAE基线。在两个14B模型、更长的微调运行以及失调起始点上的压力测试确定了关键的部署边界。这些结果将性状空间监测定位为基于LoRA的微调中EM检测的行为评估的实用补充,同时表明在显著不同机制下的部署可能需要重新校准。

英文摘要

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

2606.07630 2026-06-09 cs.LG cs.AI stat.ML 新提交

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

基于基础模型先验的主动学习:类别不平衡下的高效学习

Jiancheng Zhang, Meiqing Li, Qi Zhang, Yinglun Zhu

发表机构 * University of California, Riverside(加州大学河滨分校) Carnegie Mellon University(卡内基梅隆大学) Worcester Polytechnic Institute(伍斯特理工学院)

AI总结 针对现实数据中的类别不平衡和噪声标注问题,提出一种利用基础模型先验的主动学习框架,通过不平衡感知的协同决策选择信息量最大的样本,在图像和文本数据集上实现超过50%的标注节省。

Comments To appear at ICML 2026

详情
AI中文摘要

现实世界中图像和文本领域的数据集通常具有偏斜的类别分布和噪声标注,这共同降低了模型性能,尤其是对少数类。在现有解决方案中,主动学习通过选择性地查询信息最丰富且平衡的样本进行标注,提供了一种有效且高效的范式。我们提出了一种创新的主动学习框架,该框架减轻了类别不平衡,并选择信息量最大的样本进行标注。利用基础模型先验,我们的算法使得基础模型和小模型之间能够进行不平衡感知的协同决策,以处理跨领域的有噪声和不平衡标签。我们首次系统性地研究了在图像和文本领域中标签噪声和类别不平衡双重挑战下的主动学习。在不平衡数据集上的大量实验表明,我们的方法实现了显著的标注节省——与最佳主动学习基线相比超过50%——同时保持了对标签噪声的性能和鲁棒性。

英文摘要

Real-world datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointly degrade model performance, particularly on minority classes. Among existing solutions, active learning offers an effective and efficient paradigm by selectively querying the most informative and balanced samples for annotation. We propose an innovative active learning framework that mitigates class imbalance and selects the most informative samples to annotate. Leveraging foundation model priors, our algorithm enables imbalance-aware co-decisions between foundation model and small model to tackle noisy and imbalanced labels across various domains. We introduce the first study to systematically explore active learning under the dual challenges of label noise and class imbalance across image and text domains. Extensive experiments on imbalanced datasets demonstrate that our method achieves substantial annotation savings-over 50% compared to the best active learning baseline-while preserving performance and robustness to label noise.