arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2606.07685 2026-06-09 cs.LG cs.AI 新提交

Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments

物联网环境下机器学习即服务（MLaaS）的测试时自适应组合

Deepak Kanneganti, Sajib Mistry, Sheik Mohammad Mostakim Fattah, Aneesh Krishna

发表机构 * Deepak Kanneganti ； Sajib Mistry ； Sheik Mohammad Mostakim Fattah ； Aneesh Krishna

AI总结针对物联网环境中MLaaS组合因动态性而失效的问题，提出一种测试时自适应（TTA）组合框架，通过TTA感知可组合性模型和服务级自适应模型，在推理时调整服务并保持组合性能，显著降低计算时间。

2606.07684 2026-06-09 cs.LG cs.AI 新提交

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

语义缓存蒸馏：通过重用和选择性修补实现高效状态传输

Qianli Ma, Zhiqing Tang, Hanshuai Cui, Zhi Yao, Weijia Jia

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对大语言模型推理中KV缓存传输的通信瓶颈和跨模型重用时的语义错位问题，提出语义缓存蒸馏（SCD）框架，通过低秩子空间重建和稀疏过渡层归一化输入预测，实现高达2.65倍的首令牌时间加速，且生成质量接近理想情况。

Comments Accepted to ICML 2026

详情

AI中文摘要

分离式服务缓解了大语言模型（LLM）推理中的内存瓶颈，但造成了严重的通信瓶颈：传输高维键值（KV）缓存通常主导首令牌时间（TTFT）。此外，跨异构模型（例如，基础模型和微调变体）重用缓存会导致语义错位，且这种错位会随着层数累积，降低生成质量。我们提出语义缓存蒸馏（SCD），一种受损失约束的框架，用紧凑的语义代码替代原始KV传输。SCD通过两种机制解决这些挑战：（1）重用，从低秩子空间重建大部分层以最小化传输成本，以及（2）修补，在稀疏过渡层预测归一化输入以截断误差传播。实验表明，在带宽受限的情况下，SCD相比理想消费预填充实现了高达2.65倍的TTFT加速，并在质量-延迟帕累托前沿上优于量化和选择性重计算基线，同时将生成质量保持在理想情况F1的5%以内。

英文摘要

Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token (TTFT). Moreover, reusing caches across heterogeneous models (e.g., base and fine-tuned variants) causes semantic misalignment that accumulates over layers, degrading generation quality. We propose Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV transmission with compact semantic codes. SCD addresses these challenges via two mechanisms: (1) Reuse, which reconstructs most layers from low-rank subspaces to minimize transfer cost, and (2) Patch, which predicts normalized inputs at sparse transition layers to truncate error propagation. Empirically, SCD delivers up to 2.65 $\times$ TTFT speedup over the oracle consumer prefill and dominates quantization and selective recomputation baselines on the quality--latency Pareto frontier in bandwidth-constrained regimes, while keeping generation quality within 5\% F1 of the oracle.

URL PDF HTML ☆

赞 0 踩 0

2606.07674 2026-06-09 cs.CV q-bio.NC 新提交

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

同时性多动症表型分析：基于常规视频、无标记姿态估计和表格基础模型的跨队列儿科迁移研究

Laura Cif, Diane Demailly, Zohra Souei, Muhammad Mushhood Ur Rehman, Juan Dario Ortigoza Escobar, Mayté Castro Jiménez, Cécile A. Hubsch, Sophie Huby, Morgan Dornadic, Gun-Marie Hariz, Eduardo M. Moraud, Jocelyne Bloch, Gabriella A. Horvath, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV)（洛桑大学医院）； University of Lausanne (UNIL)（洛桑大学）； Institut du Neurone（神经元研究所）； Clinique Beau Soleil（博索莱伊诊所）； Institut Mutualiste Montpelliérain（蒙彼利埃互助研究所）； Military University Hospital of Sfax（斯法克斯军事大学医院）； University of Edinburgh（爱丁堡大学）； Hospital Sant Joan de Déu（圣琼德迪乌医院）； European Reference Network for Rare Neurological Diseases (ERN-RND)（欧洲罕见神经系统疾病参考网络）； Instituto de Salud Carlos III（卡洛斯三世健康研究所）； CHU Montpellier（蒙彼利埃大学医院）； Umeå University（于默奥大学）； University Hospital Lausanne（洛桑大学医院）； Ecole Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； British Columbia Children’s Hospital（不列颠哥伦比亚儿童医院）

AI总结提出结合无标记姿态估计、运动学描述符和预训练基础模型的视频框架，在成人数据上训练后迁移至儿科队列，经轻量校准后实现多种多动症现象的同时检测。

详情

AI中文摘要

目的：开发并外部测试一个基于视频的框架，用于同时检测多动症运动障碍现象：肌张力障碍、震颤、肌阵挛、舞蹈症、手足徐动症、投掷症、刻板动作和抽动，使用常规临床记录，并明确测试从成人到儿科人群的外部跨队列迁移。方法：在这项概念验证研究中，该框架结合了无标记姿态估计、运动学描述符和预训练基础模型。在21名确诊多动症的成人和4名健康对照（按标准化方案评估）上开发了共享预测骨干。外部验证在一个独立的外部队列上进行：一个真实世界的儿科样本（n=12，单基因联合多动症）。对于外部数据集，骨干网络未经重新训练直接部署；轻量校准仅调整最终受试者级别的决策步骤，使用由临床医生选择的小标记子集（代表队列表型范围）。结果：在临床医生选择的子集上对决策层进行本地校准后，在保留的儿科患者（n=7）上性能持续提升：汉明准确率从0.804提高到0.839，Jaccard指数从0.548提高到0.633。当评估限制在临床医生一致性更高的现象时，校准后的性能得以保持，Jaccard指数进一步提高（汉明准确率0.9，Jaccard指数0.786），表明增益并非依赖于最不可靠的标签。

英文摘要

Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort's phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.

URL PDF HTML ☆

赞 0 踩 0

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 新提交

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University（圆光大学电子工程系）； AI Convergence Research Institute, Wonkwang University（圆光大学人工智能融合研究院）； GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology（光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所）； School of Electrical Engineering, KAIST（韩国科学技术院电气工程学院）； Department of AI Convergence, Gwangju Institute of Science and Technology（光州科学技术院人工智能融合系）

AI总结提出分层特征工程框架，包括静态、动态、比率和耦合特征，用于区分声带创伤性和非声带创伤性声音亢进，发现耦合特征对两类分类均关键，PVH AUC 0.891，NPVH AUC 0.728。

Comments Interspeech 2026

2606.07670 2026-06-09 cs.CV cs.AI 新提交

Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting

液态神经网络作为动态3D高斯泼溅的即插即用连续时间变形场

Mingzhao Li, Arghya Pal, Guan Yuan Tan

发表机构 * Monash University（莫纳什大学）

AI总结提出用液态神经网络（LNN）的闭式连续时间（CfC）单元替代MLP，构建显式连续时间变形场，在动态场景重建中匹配或超越MLP基线，尤其擅长高频关节运动。

详情

AI中文摘要

可变形3D高斯泼溅（D-3DGS）通过一个位置编码的MLP（以帧时间t为输入）变形一组规范3D高斯，从单目视频重建动态场景。尽管拟合连续变量，但MLP在架构中不耦合任意两个t值，实际上预测离散的逐帧偏移，使得时间平滑性仅作为优化的副产品出现。我们将变形场重新设计为一组闭式连续时间（CfC）单元，即液态神经网络（LNN），它是液态时间常数ODE的闭式解，同时保留D-3DGS管道的其他部分。每个单元暴露一个sigmoid时间门，在两个候选隐藏状态之间插值，将学习到的对t的平滑响应嵌入损失景观，无需调用任何数值求解器。在八个D-NeRF和七个NeRF-DS场景上，液态场在总体上匹配或超过MLP基线，其最大增益集中在具有最高频关节运动的场景上。结果是一种近乎零摩擦的架构设计，将离散的MLP变形场转变为t的显式连续时间函数。

英文摘要

Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians through a positional-encoded MLP of frame time t. Although fitted to a continuous variable, the MLP couples no two values of t in its architecture and effectively predicts discrete per-frame offsets, leaving temporal smoothness to emerge only as a byproduct of optimisation. We redesign the deformation field as a stack of Closed-form Continuous-time (CfC) cells, a Liquid Neural Network (LNN), that is the closed-form solution of the Liquid Time-constant ODE while preserving every other part of the D-3DGS pipeline. Each cell exposes a sigmoidal time gate that interpolates between two candidate hidden states, baking a learned smooth response to t into the loss landscape without invoking any numerical solver. On the eight D-NeRF and seven NeRF-DS scenes the liquid field matches or exceeds the MLP baseline in aggregate, with its largest gains concentrated on the scenes with the most high-frequency articulated motion. The result is a near-zero-friction architectural design that turns the discrete MLP deformation field into an explicit continuous-time function of t.

URL PDF HTML ☆

赞 0 踩 0

2606.07669 2026-06-09 cs.CV cs.AI 新提交

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

MemoVAD: 边缘计算场景下基于动态语义记忆的资源高效视频异常检测

Guo Li, Jiandian Zeng, Yang Li, Zihao Peng, Ke Chen, Tian Wang

发表机构 * Institute of Artificial Intelligence and Future Networks, Beijing Normal University（北京师范大学人工智能与未来网络研究院）； School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； Engineering Research Center of Cloud-Edge Intelligent Collaboration on Big Data, Ministry of Education, Beijing Normal University（北京师范大学大数据云边智能协同教育部工程研究中心）

AI总结提出MemoVAD边缘-云协同框架，通过不确定性感知门控策略选择性调用云端视觉语言模型，并设计动态语义记忆缓存原型，在降低通信开销的同时提升视频异常检测性能。

Comments Accepted by IJCAI2026

详情

AI中文摘要

在真实监控场景中部署视频异常检测（VAD）面临着对高层语义的需求以确保有效性，与边缘设备有限计算资源之间的根本矛盾。视觉语言模型（VLM）提供了丰富的开放词汇语义，但其延迟和计算成本阻碍了设备端部署。为解决这一挑战，我们提出MemoVAD，一种边缘-云协同框架，选择性地将VLM语义融入流式VAD。MemoVAD在边缘端使用轻量级检测器和因果时序上下文编码器（TCE）建模时序依赖，运行大部分推理。具体而言，我们引入基于主观逻辑的不确定性感知门控（UAG）策略，以建模感知不确定性，并仅对高不确定性和语义新颖的片段查询云端VLM。此外，设计动态语义记忆（DSM）缓存经VLM验证的原型以实现高效检索，使边缘模型通过语义适配器逐步融入VLM级语义。在真实边缘设备上对UCF-Crime和XD-Violence数据集的实验表明，MemoVAD在显著降低通信开销的同时，超越了当前最优性能。

英文摘要

Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.07661 2026-06-09 cs.CV cs.DL 新提交

PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing

PereStruct: 面向鲁棒历史文档解析的多模态语义组装

Maksim Shandybo, Ivan Bespalov, Daniil Yefimov, Marina Kosheleva, Alexander Loukianov

发表机构 * IGIC RAS（俄罗斯科学院信息传输问题研究所）； Yandex Cloud ； National University of Science and Technology MISIS（莫斯科国立钢铁合金学院）； Nekrasov Central Universal Scientific Library（涅克拉索夫中央综合科学图书馆）

AI总结针对历史报纸复杂多栏布局的解析难题，提出结合微调YOLO与语义组装模块的多模态方法，在块到文章映射上F1达0.904，BLEU约0.96，显著优于通用视觉语言模型。

Comments Code and data available at https://github.com/makSShandybo/PereStruct

详情

AI中文摘要

解析具有复杂非标准布局的历史文档仍是大规模档案数字化的基本瓶颈。与现代排版不同，历史报纸存在严重的物理退化和高度不规则的页面结构，即使最先进的视觉语言模型也难以应对，呈现出严重的分布外挑战。我们通过一个专门为解析历史报纸（具有特别复杂多栏布局的文档）设计的自动化流程来弥补这一差距。我们的方法结合了用于布局分析和块检测的微调YOLO架构（在1,426张完全人工标注的扫描页面上训练），以及一个新颖的语义组装模块，该模块通过联合建模基于TF-IDF的词法语义相似性、来自微调YOLO的视觉嵌入以及几何布局约束来重构文章。这种多模态集成实现了最先进的性能，在块到文章映射上取得了0.904的F1分数。值得注意的是，与视觉语言模型（Qwen3.6-35B-A3B和Qwen3.6-Plus）的端到端评估表明，PereStruct实现了显著更高的保真度（BLEU约0.96 vs 0.34），验证了模块化架构在通用VLM难以处理的复杂历史布局上表现出色。为了支持可重复性并推动该领域的研究，我们发布了包含599张标注页面的训练语料库和包含93张页面（具有专家验证的真实块到文章映射）的精选PereStruct基准。该框架为复杂档案材料的高保真数字化和语义重建奠定了坚实基础。

英文摘要

Parsing historical documents with complex, non-standard layouts remains a fundamental bottleneck in large-scale archival digitization. Unlike modern typography, historical newspapers exhibit severe physical degradation and highly irregular page structures that confound even state-of-the-art vision-language models, presenting severe out-of-distribution challenges. We address this gap with an automated pipeline specifically designed for parsing historical newspapers, documents characterized by particularly intricate multi-column layouts. Our approach combines a fine-tuned YOLO architecture for layout analysis and block detection, trained on 1,426 fully human-annotated scanned pages, with a novel semantic assembly module that reconstructs articles by jointly modeling lexical-semantic similarity via TF-IDF, visual embeddings from our fine-tuned YOLO, and geometric layout constraints. This multi-modal integration yields state-of-the-art performance, achieving an F1 score of 0.904 on block-to-article mapping. Notably, end-to-end evaluation against vision-language models (Qwen3.6-35B-A3B and Qwen3.6-Plus) demonstrates that PereStruct achieves substantially higher fidelity (BLEU approximately 0.96 vs 0.34), validating that modular architectures excel where generic VLMs fail on complex historical layouts. To support reproducibility and advance research in this domain, we release both the training corpus of 599 annotated pages and a curated PereStruct benchmark of 93 pages with expert-verified ground-truth block-to-article mappings. This framework establishes a robust foundation for high-fidelity digitization and semantic reconstruction of complex archival materials.

URL PDF HTML ☆

赞 0 踩 0

2606.07660 2026-06-09 cs.CV cs.LG 新提交

Need We Teach Foundation Models What is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation

我们是否需要教基础模型什么是生成图像？基于解析谱自适应的无梯度生成伪影检测

Qiaoyu Chen, Bing Zhang

发表机构 * Harbin University of Commerce（哈尔滨商业大学）

AI总结提出无梯度方法，将生成伪影检测重构为分布外异常度量问题，通过解析解耦统计与语义偏差，在零样本设置下显著优于梯度优化方法。

详情

AI中文摘要

通过基于梯度的更新来适应基础模型以检测生成伪影会损害其内在表示。在有限样本上优化时，模型会过拟合到局部领域捷径。在专门数据上微调大量权重会引入错误的归纳偏差，在高维特征空间中引起可测量的 $\mathcal{L}_2$ 范数扰动——我们将这一现象形式化为锚点漂移。非线性激活放大了这种漂移，损害了跨未见领域的零样本伪造检测。我们提出了一种无梯度方法，将检测从二分类重新定义为分布外（OOD）异常度量问题。将冻结的基础模型视为稳定的坐标系，通过解析解耦统计和语义偏差，在真实视觉流形上建立一个绝对的自然锚点，该锚点源自注意力加权的空间矩和感知不一致性的正交投影。在极端零样本设置下（在面部伪造上训练，在通用文本到图像生成上测试），我们的方法显著优于梯度优化范式。无反向传播的前向传递和线性求解器实现了硬件无关、边缘可部署的校准，延迟极低。此外，Sherman-Morrison公式使得能够针对新型攻击进行即时在线学习，并通过协方差增量传输实现隐私保护的联邦协作。

英文摘要

Adapting foundation models to detect generative artifacts via gradient-based updates compromises their intrinsic representations. Under optimization on limited samples, models overfit to local domain shortcuts. Fine-tuning massive weights on specialized data introduces erroneous inductive biases, inducing a measurable $\mathcal{L}_2$ norm perturbation in the high-dimensional feature space -- a phenomenon we formalize as anchor drift. Amplified by nonlinear activations, this drift impairs zero-shot forgery detection across unseen domains.We propose a gradient-free methodology reframing detection from binary classification to an out-of-distribution (OOD) anomaly measurement problem. Treating a frozen foundation model as a stable coordinate system, we establish an absolute natural anchor on the real visual manifold by analytically decoupling statistical and semantic deviations, derived from attention-weighted spatial moments and orthogonal projection of perceptual inconsistencies. Evaluated in an extreme zero-shot setting (trained on face forgeries, tested on universal Text-to-Image generations), our method significantly outperforms gradient-optimized paradigms. Backpropagation-free forward passes and linear solvers enable hardware-agnostic, edge-deployable calibration with minimal latency. Furthermore, the Sherman-Morrison formula unlocks instantaneous online learning against novel attacks and enables privacy-preserving federated collaboration via covariance delta transmission.

URL PDF HTML ☆

赞 0 踩 0

2606.07659 2026-06-09 cs.CV eess.IV 新提交

Real-Time Industrial Defect Detection on Edge Hardware Using Fine-Tuned YOLOv8: A Systematic Benchmark on the NEU Surface Defect Database and MVTec AD with Automotive & Battery Manufacturing Extensions

基于微调YOLOv8的边缘硬件实时工业缺陷检测：在NEU表面缺陷数据库和MVTec AD上的系统基准测试及汽车与电池制造扩展

Emmanuel Ezeji Somtochukwu, Nitesh Rijal

发表机构 * Zema AI Labs（Zema AI实验室）

AI总结提出Industrial-YOLO框架，基于微调YOLOv8，通过TensorRT和OpenVINO加速，在边缘硬件上实现超过120 FPS的实时缺陷检测，mAP达98.5%，并在汽车装配线验证零延迟性能。

Comments 11 pages, 4 figures, 7 tables. Includes edge optimization framework (TensorRT/OpenVINO) and industrial hardware benchmark analysis

详情

AI中文摘要

自动化表面缺陷检测对于确保高速制造环境中的严格质量控制至关重要。虽然深度学习模型提供了显著的准确性，但在资源受限的边缘硬件上部署而不引入显著延迟仍然是一个持续的挑战。本文提出了Industrial-YOLO，一个基于微调YOLOv8架构的边缘优化框架，专门为实时工业缺陷检测设计。我们利用NEU表面缺陷数据库（用于钢板）和MVTec AD数据集进行系统基准测试，并补充了代表真实世界结构异常（划痕、凹坑和夹杂物）的定制汽车制造扩展。为了弥合算法复杂性和边缘硬件约束之间的差距，通过TensorRT和OpenVINO加速引擎引入了目标特定的优化。实验结果表明，Industrial-YOLO在NVIDIA Jetson Orin平台上实现了超过120 FPS的高速推理速度，同时保持了98.5%的卓越平均精度（mAP）。所提出的框架在直接部署到活跃的汽车装配线上时，展示了高度鲁棒、零延迟的性能，为下一代自动光学检测（AOI）系统提供了可扩展的蓝图。

英文摘要

Automated surface defect detection is critical for ensuring rigorous quality control in high-speed manufacturing environments. While deep learning models offer remarkable accuracy, deploying them on resource-constrained edge hardware without introducing significant latency remains a persistent challenge. This paper presents Industrial-YOLO, an edge-optimized framework built upon a fine-tuned YOLOv8 architecture specifically engineered for real-time industrial defect detection. We conduct a systematic benchmark utilizing the NEU surface defect database for steel sheets and the MVTec AD dataset, supplemented with custom automotive manufacturing extensions representing real-world structural anomalies (scratches, pits, and inclusions). To bridge the gap between algorithmic complexity and edge hardware constraints, target-specific optimizations are introduced via TensorRT and OpenVINO acceleration engines. Experimental results demonstrate that Industrial-YOLO achieves a high-velocity inference speed exceeding 120 FPS on the NVIDIA Jetson Orin platform while maintaining an exceptional mean Average Precision (mAP) of 98.5%. The proposed framework showcases highly robust, zero-latency performance when deployed directly onto an active automotive assembly line, offering a scalable blueprint for next-generation automated optical inspection (AOI) systems.

URL PDF HTML ☆

赞 0 踩 0

2606.07658 2026-06-09 cs.CV cs.LG 新提交

What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery

神经外科医生需要看到的：用于脑肿瘤手术中脑移位补偿的超声合成术中MRI

Santiago Cepeda, Olga Esteban-Sinovas, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit, Río Hortega University Hospital, Valladolid, Spain（西班牙巴利亚多利德里奥·奥尔特加大学医院神经外科神经血管科）； Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC), Instituto de Investigación Biosanitaria de Valladolid (IBioVALL), Valladolid, Spain（西班牙巴利亚多利德生物医学研究与计算分析专业组(GEIBAC)，巴利亚多利德生物健康研究所(IBioVALL)）

AI总结提出一种端到端流水线，通过融合术前MRI、术中超声生成的合成MRI及锚定该合成图像的可变形配准，生成术前成像空间中的全脑MRI体积，以补偿脑移位，为神经导航提供类似MRI的术中视野更新。

详情

AI中文摘要

最大安全切除是胶质瘤手术的主要目标。硬脑膜打开后，神经导航引导会因脑移位而逐渐退化。术中MRI可以补偿，但需要专用基础设施且很少可用，而术中超声（ioUS）廉价、可重复且与常规工作流程兼容。将ioUS与术前MRI结合的导航系统通常依赖刚性配准；即使是可变形多模态配准也受限于超声散斑对比度、窄视野以及无法表示术前扫描中不存在的结构，最关键的是切除腔和残余肿瘤。我们提出一个端到端流水线，通过合并术前MRI、从ioUS生成的合成MRI以及锚定在该合成图像上的可变形配准，生成术前成像空间中的全脑MRI体积。它集成了一个2.5D残差变换器合成骨干（ResViT-2.5D）和一个两阶段配准，将NiftyReg与合成锚定的SynthMorph阶段耦合，直接对原始扫描仪输入进行操作。在切除后的ReMIND队列上，ResViT-2.5D生成的合成图像在结构、强度和感知指标上与术中T2紧密匹配。在14名受试者的215个专家标志点上，合成锚定配准将平均目标配准误差从6.27毫米降低到5.86毫米，与强大的经典NiftyReg基线（5.85毫米）相当，同时为每个受试者产生微分同胚变形场。贡献不在于配准精度的提高，而在于集成的体积本身，它在超声视野内反映了术中切除后的状态。这为外科医生提供了手术视野的类似MRI的更新，并有可能集成到手术导航工作流程中。

英文摘要

Maximal safe resection is the primary objective in glioma surgery. Neuronavigation guidance is progressively degraded by brain shift after dural opening. Intraoperative MRI can compensate but needs dedicated infrastructure and is rarely available, whereas intraoperative ultrasound (ioUS) is inexpensive, repeatable, and compatible with routine workflows. Navigation systems combining ioUS with preoperative MRI usually rely on rigid registration; even deformable multimodal registration is limited by ultrasound speckle contrast, a narrow field of view, and the inability to represent structures absent from the preoperative scan, most critically the resection cavity and residual tumor. We propose an end-to-end pipeline that generates a new whole-brain MRI volume in the preoperative imaging space by merging the preoperative MRI, a synthetic MRI generated from the ioUS, and a deformable registration anchored on that synthetic image. It integrates a 2.5D residual-transformer synthesis backbone (ResViT-2.5D) and a two-stage registration coupling NiftyReg with a synthesis-anchored SynthMorph stage, operating directly on raw scanner inputs. On a post-resection ReMIND cohort, ResViT-2.5D produced synthetic images closely matching the intraoperative T2 across structural, intensity, and perceptual metrics. In 14 subjects with 215 expert landmarks, the synthesis-anchored registration reduced the mean target registration error from 6.27 to 5.86 mm, matching a strong classical NiftyReg baseline (5.85 mm) while yielding a diffeomorphic deformation field in every subject. The contribution is not a gain in registration accuracy but the integrated volume itself, which inside the ultrasound field of view it reflects the intraoperative post-resection state. This provides the surgeon with an MRI-like update of the operative field with potential for integration into surgical-navigation workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.07654 2026-06-09 cs.CV cs.AI 新提交

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

MM-Matryoshka：通过二维多模态套娃训练框架实现预算弹性视觉文档检索

Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao, Mingdong Ou, Xuming Hu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Alibaba Cloud Computing（阿里云计算）； Hong Kong University of Science and Technology（香港科技大学）

AI总结提出MM-Matryoshka，一种二维套娃训练框架，使视觉文档检索器在向量维度和编码器深度上实现弹性预算选择，无需为不同预算训练独立模型。

详情

AI中文摘要

多向量视觉文档检索器通过深度视觉语言模型（VLM）为每个页面生成多个向量，实现强大的细粒度匹配，但这种设计在存储和计算开销上导致部署成本高昂。现有效率技术通常只优化预算的一部分，使得多模态检索器缺乏统一的方法来权衡精度与向量宽度和编码器深度。因此，我们提出MM-Matryoshka，一种用于预算弹性视觉文档检索（VDR）的二维套娃训练框架，使ColPali风格的多向量检索在维度和层两个方向上实现弹性。在推理时，单个检索器可以选择二维可调预算，无需为不同预算训练独立模型。通过在多个代表性骨干网络上的全面实验，我们证明MM-Matryoshka在显著降低存储和计算开销的同时，保留了比直接截断基线高得多的质量，从而为高效VDR提供了稳健的预算弹性。

英文摘要

Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.

URL PDF HTML ☆

赞 0 踩 0

2606.07653 2026-06-09 cs.CV cs.AI 新提交

A Dataset for Dynamic Human Preferences for Vision Language Models

面向视觉语言模型的动态人类偏好数据集

Hannah Gao, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结提出一个评估视觉语言模型理解动态人类偏好能力的基准，通过自动化管道生成包含图像依赖变化的数据集，并评估了现有模型。

2606.07651 2026-06-09 cs.LG cs.CV 新提交

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

KITE：一种融合文本、图像和知识图谱的三模态假新闻检测Transformer

Kevin Patel, Shashi Bhushan Jha

发表机构 * Department of Computer Science, University of West Florida（威斯福大学计算机科学系）

AI总结提出三模态假新闻检测框架KITE，联合建模文本、视觉和知识表示，利用跨模态注意力整合特征，在基准数据集上显著优于单双模态基线。

详情

AI中文摘要

随着多模态虚假信息日益复杂，无缝融合欺骗性文本、操纵性视觉和事实错误的主张，传统的假新闻检测方法已落后。大多数先前工作侧重于文本-图像融合，或将外部知识仅作为后处理步骤应用，限制了其检测更深层语义不一致的能力。在本文中，我们引入了KITE（知识集成文本-图像编码器），一种三模态假新闻检测框架，联合建模文本、视觉和事实知识表示。KITE利用Roberta [23,14]和CLIP [24]进行语言和视觉编码，同时图注意力网络（GAT）处理从Wikidata检索的结构化事实。KITE在多模态Transformer中使用跨模态注意力[9]来集成文本、视觉和知识特征，帮助理解每种模态如何相互关联。模态特定置信度分数与最终预测一起生成，通过指示哪种输入类型对决策影响最大来提供可解释性。在基准数据集上的评估表明，KITE显著优于单模态和双模态基线，特别是在涉及图像-文本不匹配或与外部知识矛盾的情景中。

英文摘要

Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on text-image fusion or applies external knowledge only as a post-processing step, limiting their ability to detect deeper semantic inconsistencies. In this paper, we introduce KITE (Knowledge-Integrated Text-Image Encoder), a tri-modal fake news detection framework that jointly models textual, visual, and factual knowledge representations. KITE leverages Roberta [23,14] and CLIP [24] for linguistic and visual encoding, while a Graph Attention Network (GAT) processes structured facts retrieved from Wikidata. KITE uses cross-modal attention [9] within a multimodal transformer to integrate text, visual, and knowledge features, helping it understand how each modality relates to one another. Modality-specific confidence scores are generated alongside the final prediction, offering interpretability by indicating which input type most influenced the decision. Evaluations on benchmark datasets demonstrate that KITE significantly outperforms unimodal and bimodal baselines, particularly in scenarios involving image-text mismatches or contradictions with external knowledge.

URL PDF HTML ☆

赞 0 踩 0

2606.07649 2026-06-09 cs.CV cs.AI 新提交

ViMax: Agentic Video Generation

ViMax: 智能体视频生成

Lingxuan Huang, Sizhe He, Hengji Zhou, Liqiang Nie, Lianghao Xia, Chao Huang

发表机构 * The University of Hong Kong（香港大学）； South China University of Technology（华南理工大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））

AI总结提出ViMax框架，通过多智能体协作实现长视频生成，利用分层叙事引擎和视觉一致性机制，保证叙事连贯性和视觉一致性。

Comments 20 pages, 13 figures

详情

AI中文摘要

长视频生成需要系统的叙事规划和视觉一致性，而当前的短视频方法无法提供。现有方法生成孤立的序列，缺乏叙事结构，并且缺乏跨场景保持角色和环境一致性的机制。我们提出ViMax，一个智能体视频生成框架，通过协调的多智能体协作来解决视频创作问题，其中专门的组件协商叙事决策、视觉连续性和制作质量。我们的框架采用分层叙事引擎，结合检索增强生成以实现全局故事连贯性，以及依赖感知的视觉一致性机制，跨时间边界跟踪角色和环境状态，同时VLM引导的智能体持续监控和优化叙事连贯性和视觉保真度。该框架支持协调的智能体协作以生成扩展的叙事内容，在多场景时间线上保持叙事完整性和视觉连贯性。

英文摘要

Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi-agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval-augmented generation for global story coherence and a dependency-aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM-guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi-scene timelines.

URL PDF HTML ☆

赞 0 踩 0

2606.07648 2026-06-09 cs.CV cs.AI 新提交

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

AQIFormer：一种基于Transformer的多视角架构用于跨城市空气质量分类

Om Kathalkar, Nitin Nilesh, Sachin Chaudhari, Anoop Namboodiri

发表机构 * IIIT Hyderabad（印度海得拉巴国际信息技术学院）

AI总结提出AQIFormer，一种基于Transformer的集成架构，通过前后视图融合、天气感知注意力和多任务学习，在跨城市空气质量分类中达到89.96%准确率，比现有方法提升14.96%。

Comments Accepted at ICVGIP 2025 (Indian Conference on Computer Vision, Graphics and Image Processing), 9 pages, 4 figures

详情

DOI: 10.1145/3774521.3774577

AI中文摘要

空气污染是全球最严峻的环境和公共卫生挑战之一，传统的基于传感器的监测系统面临显著的可扩展性和经济性限制。基于图像的空气质量估计已成为一种有前景的替代方案，利用交通场景中大气污染物的视觉特征。然而，现有方法存在跨城市泛化能力有限以及对多视角信息利用不足的问题。我们提出AQIFormer，一种新颖的基于Transformer的集成架构，通过创新的双视图融合、天气感知注意力机制和全面的多任务学习来解决这些根本性限制。我们的方法独特地将前后交通图像与气象参数相结合，以实现跨不同城市环境的稳健空气质量分类。在包含26,678个同步前后图像对的综合数据集上进行的大量评估表明，该模型性能良好，准确率达到89.96%，比现有最优方法提高了14.96%。最重要的是，我们的模型保持了出色的跨城市泛化能力，在印度那格浦尔收集的独立数据集上达到81.67%的准确率，通过少量样本自适应仅用极少的训练样本，性能下降仅为8.29%。

英文摘要

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

URL PDF HTML ☆

赞 0 踩 0

2606.07647 2026-06-09 cs.CV cs.CL cs.LG 新提交

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

关键位置引导：基于令牌级视觉敏感度引导的LVLMs幻觉缓解

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出令牌级视觉敏感度引导（TLVS）方法，通过提取令牌级引导向量并自适应调整引导强度，仅在关键解码步骤抑制幻觉，在多个基准上优于现有方法。

详情

AI中文摘要

大型视觉语言模型（LVLMs）取得了快速进展并部署在各种应用中，但幻觉仍然是一个主要挑战。激活引导因其训练开销小和推理时可控制而具有吸引力。然而，我们发现，在自回归解码过程中，视觉条件对令牌预测的影响是稀疏且局部的，许多现有方法对整个序列的图像与非图像差异进行平均，稀释了这些关键信号，导致引导方向信噪比低。此外，许多现有方法应用固定的引导强度，错误分配干预预算，过度扰动非关键令牌，并可能导致不稳定。为了解决这些限制，我们提出了令牌级视觉敏感度引导（TLVS）用于幻觉缓解。我们的方法首先提取令牌级引导向量并进行细化，然后仅在关键位置应用细粒度的、视觉敏感度自适应的引导。这种轻量级、即插即用的机制只需要最少的校准训练，可以应用于各种视觉语言模型。它在每个解码步骤调节引导强度，选择性地抑制易产生幻觉的片段，同时保留基于证据的内容。我们在多个基准上评估TLVS，包括POPE、AMBER、CHAIR（COCO）、MMHal和HallusionBench，证明其相对于先前引导方法的一致改进。

英文摘要

Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07646 2026-06-09 cs.CV cs.AI 新提交

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

DOME：从稀疏监督中学习可迁移域变量用于测试时自适应

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

发表机构 * MAIS, IACAS（中国科学院自动化研究所多模态人工智能系统实验室）

AI总结提出DOME域编码器，通过视觉-语言预训练提取密集连续表示，参数化域为分布变量并引入动量更新的稀疏域库，实现零样本显式域建模，在多个基准上超越复杂TTA方法。

详情

AI中文摘要

测试时自适应（TTA）旨在仅使用无标签流数据将模型对齐到变化的测试域。现有方法大多隐式推断单个全局域分布，忽略了真实世界域迁移的多维性和样本特异性，导致自适应脆弱。我们提出DOME，一种有效的域编码器，以零样本方式显式建模每个样本的域。DOME利用视觉-语言预训练提取密集、连续的表示，将域参数化为分布变量，并引入动量更新的稀疏域库用于解耦监督。通过将这些显式域线索注入下游模型，即使是最基本的熵最小化TTA策略也在ImageNet-C、ImageNet-R和ImageNet-Sketch上达到了最先进的性能，超越了复杂的TTA方法。我们的结果表明，鲁棒的自适应并非源于复杂的自适应算法，而是源于显式的、结构化的域表示。

英文摘要

Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.

URL PDF HTML ☆

赞 0 踩 0

2606.07645 2026-06-09 cs.CV cs.AI 新提交

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

FineGen：基于VLM的多智能体框架用于细粒度图像-文本数据集构建

Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

发表机构 * Shenzhen Polytechnic University（深圳职业技术大学）； Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong Macao Greater Bay Area（粤港澳大湾区应用人工智能研究所）； Shenzhen University（深圳大学）

AI总结提出FineGen框架，通过生成-验证-校正流水线和闭环反馈机制自动构建含硬负样本的细粒度数据集，在ImageNet上构建FineGen-100K，硬样本准确率提升14.4%。

Comments 15 pages, 2 figures, conference

2606.07643 2026-06-09 cs.CV cs.AI cs.SD eess.AS 新提交

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

AVI-Bench：迈向全模态大语言模型的人类级视听智能

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出AVI-Bench基准，通过感知、理解、推理三阶段跨模态任务评估全模态大语言模型的视听智能，并引入AVI-Bench-PriSe测试原始视听感知，揭示当前模型局限，构建四级AVI分类体系。

Comments 31 pages, 8 figures, ICML 2026

详情

AI中文摘要

近期全模态大语言模型（Omni-MLLMs）的进展实现了视觉、音频和语言的强集成。然而，由于缺乏系统全面的基准，其视听智能（AVI）仍未被充分评估。我们提出AVI-Bench，一个受认知启发的基准，通过需要联合视听解释的跨模态任务，在感知、理解和推理三个阶段评估Omni-MLLMs。该设计能够细粒度诊断模型能力和失败模式。为进一步评估超出熟悉领域的鲁棒性，我们提出AVI-Bench-PriSe，一个扩展版本，使用不熟悉的、低语义刺激探测模型的原始视听感知，测试超出常见训练分布的泛化能力。对开源和闭源模型的大量实验揭示了当前Omni-MLLMs的显著局限性。基于这些发现，我们提出了一个四级AVI分类体系。总体而言，AVI-Bench提供了一个原则性的评估框架，以指导更鲁棒和可泛化AVI的发展。项目网站：https://fudancvl.github.io/AVI-Bench/

英文摘要

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

URL PDF HTML ☆

赞 0 踩 0

2606.07642 2026-06-09 cs.CV cs.CY 新提交

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

视觉语言模型能否感知传感器所感？一种可扩展的专家引导设计用于从街景评估轮椅可达性

Dongdong Wang, Alina Hagen, Isabelle Gatmaitan, Hao Zhou, Yiwen Dong, Shabboo Valipoor, Vivian W. H. Wong, Lingyao Li

发表机构 * University of Florida（佛罗里达大学）； University of South Florida（南佛罗里达大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出专家引导的检索增强框架，利用视觉语言模型从谷歌街景图像识别轮椅可达性障碍，通过GPS轮椅停留行为验证，表明VLM评分与移动摩擦部分一致，但细粒度障碍识别有限。

详情

AI中文摘要

评估建筑环境交互（如轮椅可达性）是困难的，因为现实世界的移动性受到分布式、上下文依赖和临时性障碍的影响，这些障碍难以大规模捕捉。为了支持可扩展的评估，本文研究了视觉语言模型（VLM）是否能够从谷歌街景（GSV）图像中识别可达性障碍。我们提出了一种专家引导的检索增强框架，结合GSV图像、ADA指导原则和专家制定的评分标准来评估可达性维度。我们在佛罗里达大学收集了一个校园规模的数据集，将407个独特的GSV位置与GPS衍生的轮椅停留行为作为移动摩擦信号相关联。结果表明，VLM评分与停留时间既呈负相关又在分布上相似，表明与移动摩擦的行为代理部分但一致的对齐。视觉线索分析显示，某些环境对象（如路缘坡道和人行横道）与较高的VLM可达性评分相关，而对于细微的表面条件、临时障碍物和视角依赖的障碍，对齐仍然有限。总体而言，我们的发现显示了专家引导的VLM在可扩展的可达性评估中的潜力，与真实世界轮椅导航的传感器衍生指标相一致。

英文摘要

Assessing built-environment interaction, such as wheelchair accessibility, is difficult because real-world mobility is shaped by distributed, context-dependent, and temporary barriers that are hard to capture at scale. To support scalable assessment, this paper examines whether vision-language models (VLMs) can identify accessibility barriers from Google Street View (GSV) imagery. We propose an expert-guided retrieval-augmented framework that combines GSV images, ADA-informed guidance, and expert-derived rubrics to evaluate accessibility dimensions. We collect a campus-scale dataset at the University of Florida, linking 407 unique GSV locations with GPS-derived wheelchair dwell behavior as a mobility-friction signal. Results show that VLM ratings are both negatively correlated and distributionally similar with dwell time, indicating partial but consistent alignment with a behavioral proxy for mobility friction. Visual cue analysis shows that certain environmental objects, such as curb ramps and crosswalks, are associated with higher VLM accessibility scores, while alignment remains limited for subtle surface conditions, transient obstructions, and viewpoint-dependent barriers. Overall, our findings show the potential of expert-guided VLMs for scalable accessibility assessment aligning with sensor-derived indicators of real-world wheelchair navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.07641 2026-06-09 cs.CV 新提交

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

可读但不可预测：视觉语言模型中的旋转结果预测

Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of California, Merced（加州大学默塞德分校）

AI总结研究视觉语言模型能否仅从原图预测180°旋转后的内容，引入RotOutBench基准，发现模型能识别但无法预测旋转结果。

详情

AI中文摘要

视觉语言模型能否仅从原始图像预测180°旋转后会看到什么？我们通过旋转结果预测来研究这种能力：给定原始图像，模型必须回答在180°平面旋转后会看到或读到什么，而不直接观察旋转后的目标。为了隔离这一差距，我们引入了RotOutBench，一个涵盖开放视觉案例和受控文本图像旋转的配对诊断基准。一个明显的模式出现了：许多VLM在直接给出原始或旋转图像时能够识别相关内容，但仅从原始图像推断旋转结果时却失败。在受控文本图像旋转中，即使对于具有高直接读取准确性的模型，预测旋转的准确性也降至接近零。模型级别的案例研究进一步表明，预测状态可以接近旋转图像读取状态，而最终读出仍向原始字符串偏移。当前的VLM在展示变换后的视觉状态时能够识别，但往往无法从原始视角预测该状态。

英文摘要

Can vision-language models predict what a 180° rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180° in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations. A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string. Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.

URL PDF HTML ☆

赞 0 踩 0

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 新提交

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid（马德里理工大学）； Universidad de Alcalá（阿尔卡拉大学）

AI总结研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡，提出联合评估框架，比较VAE、GAN和DDPM在三个图像数据集上的表现，发现GAN和DDPM在差分隐私下更鲁棒。

2606.07639 2026-06-09 cs.CV cs.AI 新提交

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

MOSS-Video-Preview: 通过交叉注意力实现实时视频理解

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出双通道交叉注意力架构MOSS-Video-Preview，通过非阻塞感知与生成实现实时视频理解，在单H200上实现5倍首词加速和2.7倍解码吞吐提升。

详情

AI中文摘要

视频理解正从离线范式——将完整录制的视频作为输入并在结束后产生单一答案——转向实时交互，其中模型在回复的同时感知新帧，随着新证据的出现修正答案，并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞；其自然实现是双通道架构。我们认为，交叉注意力主干比流行的仅解码器设计更适合实时视觉-语言融合：视觉特征通过侧通道进入，而不是加入自回归序列，因此感知和生成在独立的、非阻塞的路径上运行——降低了视觉处理的频率，并为独立压缩提供了清晰的通道级接口。我们辅以数据合成流水线，将密集字幕转换为实时理解问答，其答案被修正以匹配模型迄今为止感知到的内容，并在此数据上专门训练离线模型以引发实时行为。我们的模型总体上落后于强大的Qwen2.5-VL-7B基线——这一差距我们主要归因于数据和规模而非架构——但在离线视频和多模态理解上具有竞争力，在实时应用核心的空间和细粒度时间推理上保持稳健，并获得了离线模型缺乏的行为：持续感知、答案修正和及时沉默。在单个H200上，每视频256帧，它实现了约5倍的首词时间加速和2.7倍的解码吞吐提升，离线能力几乎没有下降。我们对范式、架构和数据的研究勾勒出通往实时视频理解的可行路径。

英文摘要

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.07638 2026-06-09 cs.CV cs.AI 新提交

Anchor-Conditioned Compositional Control for Landscape Image Generation

基于锚点条件的景观图像生成组合控制

Gadha Lekshmi P, Govind Arun, Rohith Syam, Ahmed Elgammal

发表机构 * Rutgers University–New Brunswick（罗格斯大学新布朗斯维克分校）； University of Maryland–College Park（马里兰大学帕克分校）； University of Technology Sydney（悉尼科技大学）

AI总结提出锚点条件微调框架，通过解耦交叉注意力机制注入四维组合锚点向量，实现景观图像生成中的组合控制，在水平线检测和三分法对齐上取得最优性能。

Comments Accepted to the International Conference on Computational Creativity, ICCC 2026

2606.07636 2026-06-09 cs.CV cs.CL cs.MA 新提交

Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

Crayotter: 用于长视频编辑的可追踪多智能体工作流

Lecheng Yan, Yichong Zhang, Ben Pan, Xiaoyu Zheng, Jiawei Qian, Anqi Wu, Wenxi Li, Chenyang Lyu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Crayotter，一个开源多模态多智能体系统，通过三阶段工作流（材料准备、基于工件的编辑研究、工具驱动的执行）实现长视频编辑的可追踪性和选择性修订，在人类评估中优于基线方法。

Comments 11 pages, 5 figures

详情

AI中文摘要

从异构素材编辑长视频不仅需要选择片段：智能体必须在材料准备、时间线构建、后期制作和修订过程中保持叙事意图，同时留下足够的证据以诊断失败。我们提出 \textbf{Crayotter}，一个用于提示驱动视频编辑的开源多模态多智能体系统。Crayotter 将制作组织为三个阶段：覆盖感知的材料准备、基于工件的编辑研究以及工具驱动的时间线执行。每个阶段外化可检查的工件，包括覆盖报告、多模态分析、编辑蓝图、工具调用和中间渲染。这些工件使编辑运行可追踪，并允许诊断和选择性修订失败的片段，而无需完全重启。我们在23个编辑主题上评估Crayotter，与CapCut-Mate和CutClaw进行比较。在人类评估下，Crayotter的平均得分为3.40/5，而两个基线分别为2.44和1.70，在主题对齐、叙事连贯性和编辑流畅性方面持续提升。我们还描述了一个可重放的轨迹模式和可验证的奖励设计，为这些工作流未来的策略优化做准备。代码、轨迹和示例可在 https://github.com/idwts/Crayotter 公开获取。

英文摘要

Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.

URL PDF HTML ☆

赞 0 踩 0

2606.07635 2026-06-09 cs.CV cs.AI 新提交

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign: 用于MCI分析的动态与结构性神经影像的分层多模态融合

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳）人工智能学院智能科学与工程学院）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University（深圳大学医学部生物医学工程学院广东省生物医学测量与超声成像重点实验室）； Department of Radiology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Guangxi Academy of Medical Sciences（广西壮族自治区人民医院放射科，广西医学科学院）； Shenzhen Sixth People’s Hospital (Nanshan Hospital), Huazhong University of Science and Technology Union Shenzhen Hospital（华中科技大学协和深圳医院（深圳市第六人民医院））； School of Basic Medical Sciences, Shenzhen University（深圳大学基础医学院）； Egypt-Japan University of Science and Technology (E-JUST)（埃及日本科技大学）； School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen University Medical School（深圳大学医学部生物医学工程学院，国家地方联合医学超声关键技术工程实验室，广东省生物医学测量与超声成像重点实验室）

AI总结提出NeuroAlign框架，通过双模态分层对齐和双域分层交互融合fMRI与DTI特征，实现MCI/SCD检测，并设计无梯度归因方法SAM进行特征分析。

详情

AI中文摘要

功能磁共振成像（fMRI）和弥散张量成像（DTI）的多模态神经影像融合为认知障碍分析提供了互补信息，但仍面临异构特征空间和表示不对齐的挑战。我们提出\textit{NeuroAlign}，一个用于结构化多模态融合的分层框架。它引入了（1）\textit{双模态分层对齐}（DMHA），该模块建模多尺度动态连接并对齐动态-静态和功能-结构嵌入；以及（2）\textit{双域分层交互}（DDHI），该模块实现连接级和区域级特征之间的细粒度调制和全局交互。为了支持特征级检查，我们设计了\textit{协同激活映射}（SAM），一种针对DFC、SFC、ALFF和FA的无梯度、面向标记的归因方法。在GUTCM、ADNI和OASIS数据集上通过五折验证评估，NeuroAlign在MCI/SCD检测中取得了竞争性结果，并展示了初步的跨数据集可迁移性。归因分析揭示了模态特异性和部分一致的脑区模式，为多模态表示分析提供了模型驱动的证据。

英文摘要

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.07633 2026-06-09 cs.CV cs.AI 新提交

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

AMN：一种用于细胞核分割的具有边界和不确定性建模的自适应多尺度融合网络

Spoorthi M, Suja Palaniswamy

发表机构 * Department of Computer Science \& Engineering, Amrita School of Computing, Bengaluru, Amrita Vishwa Vidyapeetham, India , 2 p\

AI总结提出AMN双编码器分割框架，融合Swin Transformer和ResNet-50特征金字塔，通过门控机制动态加权，结合多目标损失，在CoNIC基准上平均Dice 0.82，F1 0.68，优于八种基线模型。

详情

AI中文摘要

组织病理学图像中细胞核亚型的准确分类对于下游任务（包括肿瘤分级、免疫浸润量化和预后预测）至关重要。现有方法孤立地依赖卷积或基于Transformer的编码器，限制了它们同时捕捉细粒度局部纹理和长程空间上下文的能力。我们提出了AMN（自适应多尺度细胞核网络），一种双编码器分割框架，联合利用Swin Transformer和ResNet-50特征金字塔，通过学习的逐通道门控机制动态权衡每个编码器在每个尺度的贡献。AMN使用多目标损失进行训练，该损失结合了类别加权焦点损失、具有正像素强调的边界感知损失以及一种新颖的不确定性调制分类项，用于抑制过度自信的错误预测。在涵盖七个细胞核类别的CoNIC基准上评估，AMN实现了平均Dice 0.82和平均F1 0.68，在诊断上具有挑战性的淋巴细胞类别上F1为0.67。AMN优于八种基线模型，包括纯CNN、纯Transformer和最近的混合架构：U-Net、ResU-Net、DeepLabV3+、SegNet、ViT-Small、HmsU-Net、ConvFormer-UNet和BEFUnet。在MoNuSeg上的跨数据集评估证明了无需重新训练的强泛化能力，验证了所学表示的领域鲁棒性。

英文摘要

Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infiltrate quantification, and prognosis prediction. Existing approaches rely on either convolutional or transformer-based encoders in isolation, limiting their ability to simultaneously capture fine-grained local texture and long-range spatial context. We present AMN (Adaptive Multi-Scale Nuclei Network), a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. Evaluated on the CoNIC benchmark across seven nuclei classes, AMN achieves a mean Dice of 0.82 and mean F1 of 0.68, with an F1 of 0.67 on the diagnostically challenging lymphocyte class. AMN outperforms eight baseline models spanning pure-CNN, pure-transformer, and recent hybrid architectures: U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet. Cross-dataset evaluation on MoNuSeg demonstrates strong generalization without retraining and validating the domain robustness of the learned representations.

URL PDF HTML ☆

赞 0 踩 0

2606.07632 2026-06-09 cs.LG 新提交

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

评估机器学习资源利用需要模型生命周期评估

Jared Fernandez, Clara Na, Yonatan Bisk, Constantine Samaras, Emma Strubell

发表机构 * GitHub ； arXiv

AI总结本文提出应用生命周期评估方法全面核算AI系统从硬件制造到训练推理的全链条资源消耗与环境影响，以弥补传统单一训练或推理成本评估的不足。

Comments ICML 2026: Position Paper Track

2606.07631 2026-06-09 cs.LG cs.AI cs.CY 新提交

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

监督微调中涌现失调的性状空间监测

Huy Nghiem, Sy-Tuyen Ho, Sarah Wiegreffe, Hal Daumé

发表机构 * University of Maryland（马里兰大学）

AI总结提出利用激活空间中的性状方向监测监督微调中的涌现失调，通过低维几何特征实现高效检测，在7-9B模型上达到0.990 AUROC。

Comments First version. 45 pages

详情

AI中文摘要

涌现失调（EM）发生在窄微调导致模型在微调任务之外出现危险行为时。标准训练信号可能忽略这种偏移，如果依赖重复的行为评估，可靠检测的成本会很高。我们探究是否可以在微调期间从内部表示中检测涌现失调。利用激活空间中编码为线性方向的七个对齐相关性状，我们在四个开源7-9B大语言模型的训练检查点中跟踪表示漂移。EM相关漂移集中在解释65.5%方差的低维轴上，揭示了所研究机制中的几何特征。基于该漂移轮廓构建的低开销监测器在保留的扰动类型上检测危险检查点，假阴性率为2.2%，假阳性率为2.9%，AUROC为0.990，优于无监督PCA和SAE基线。在两个14B模型、更长的微调运行以及失调起始点上的压力测试确定了关键的部署边界。这些结果将性状空间监测定位为基于LoRA的微调中EM检测的行为评估的实用补充，同时表明在显著不同机制下的部署可能需要重新校准。

英文摘要

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

URL PDF HTML ☆

赞 0 踩 0

2606.07630 2026-06-09 cs.LG cs.AI stat.ML 新提交

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

基于基础模型先验的主动学习：类别不平衡下的高效学习

Jiancheng Zhang, Meiqing Li, Qi Zhang, Yinglun Zhu

发表机构 * University of California, Riverside（加州大学河滨分校）； Carnegie Mellon University（卡内基梅隆大学）； Worcester Polytechnic Institute（伍斯特理工学院）

AI总结针对现实数据中的类别不平衡和噪声标注问题，提出一种利用基础模型先验的主动学习框架，通过不平衡感知的协同决策选择信息量最大的样本，在图像和文本数据集上实现超过50%的标注节省。

Comments To appear at ICML 2026

详情

AI中文摘要

现实世界中图像和文本领域的数据集通常具有偏斜的类别分布和噪声标注，这共同降低了模型性能，尤其是对少数类。在现有解决方案中，主动学习通过选择性地查询信息最丰富且平衡的样本进行标注，提供了一种有效且高效的范式。我们提出了一种创新的主动学习框架，该框架减轻了类别不平衡，并选择信息量最大的样本进行标注。利用基础模型先验，我们的算法使得基础模型和小模型之间能够进行不平衡感知的协同决策，以处理跨领域的有噪声和不平衡标签。我们首次系统性地研究了在图像和文本领域中标签噪声和类别不平衡双重挑战下的主动学习。在不平衡数据集上的大量实验表明，我们的方法实现了显著的标注节省——与最佳主动学习基线相比超过50%——同时保持了对标签噪声的性能和鲁棒性。

英文摘要

Real-world datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointly degrade model performance, particularly on minority classes. Among existing solutions, active learning offers an effective and efficient paradigm by selectively querying the most informative and balanced samples for annotation. We propose an innovative active learning framework that mitigates class imbalance and selects the most informative samples to annotate. Leveraging foundation model priors, our algorithm enables imbalance-aware co-decisions between foundation model and small model to tackle noisy and imbalanced labels across various domains. We introduce the first study to systematically explore active learning under the dual challenges of label noise and class imbalance across image and text domains. Extensive experiments on imbalanced datasets demonstrate that our method achieves substantial annotation savings-over 50% compared to the best active learning baseline-while preserving performance and robustness to label noise.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing

Need We Teach Foundation Models What is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation

Real-Time Industrial Defect Detection on Edge Hardware Using Fine-Tuned YOLOv8: A Systematic Benchmark on the NEU Surface Defect Database and MVTec AD with Automotive & Battery Manufacturing Extensions

What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

A Dataset for Dynamic Human Preferences for Vision Language Models

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

ViMax: Agentic Video Generation

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

No Free Lunch for Synthetic Images under Data Scarcity Conditions

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

Anchor-Conditioned Compositional Control for Landscape Image Generation

Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance