arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07128 2026-06-08 cs.LG 新提交

A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

一种机器学习辅助的渐进式数字随机性筛查框架，用于检测原始数值研究数据中的非随机模式

Zhuphua Cao

AI总结提出FDRS框架，结合统计与机器学习方法检测数值数据中的非随机数字模式，通过酶学吸光度数据集和模拟异常数据验证，能有效分级风险。

详情

AI中文摘要

原始数值数据集在完整性筛查中受到的关注少于图像、抄袭或汇总统计不一致性。我们开发了造假风险数字随机性筛查模型（FDRS），这是一个统计和机器学习框架，用于检测数值研究数据中的非随机数字模式不规则性。FDRS整合了单小数位和联合小数位检验、Cramer's V、熵度量、Kullback-Leibler散度、数字偏好指数、渐进子采样和半监督风险评分。使用仪器衍生的酶促吸光度数据集（RawData，n=253）和盲法手动模拟不规则数据集（ErrData，n=255）进行评估。RawData在单个第三小数位分析中未显示显著偏差，而ErrData显示显著偏差。在联合第三-第四小数位分析中，ErrData显示出更高的Cramer's V、更低的归一化熵、更高的KL散度以及更持久的渐进子采样偏差信号。在内部验证中，弹性网络逻辑回归取得了最高的AUC（0.98395）和最低的Brier分数（0.048439），而随机森林取得了最高的准确率（0.926667）和平衡准确率（0.935）。RawData获得了0.124627的低集成风险评分，被分类为0级；ErrData获得了0.740760的评分，被分类为3级。外部真实世界基准支持分级风险分层：三个未发现公开出版后问题的数据集被分类为0级或1级，而两个来自公开质疑或机构处理文章的数据集被分类为2级或3级。FDRS通过整合可解释的统计和机器学习特征，可以优先考虑对原始数值数据集进行进一步审查。它是一个辅助性的数字结构筛查工具，而非造假或不当行为的独立证据。

英文摘要

Raw numerical datasets remain less systematically examined in integrity screening than images, plagiarism, or summary-statistic inconsistencies. We developed the Fabrication-risk Digit Randomness Screening model (FDRS), a statistical and machine-learning framework for detecting non-random digit-pattern irregularities in numerical research data. FDRS integrates single- and joint-decimal-digit tests, Cramer's V, entropy metrics, Kullback-Leibler divergence, digit-preference indices, progressive subsampling, and semi-supervised risk scoring. It was evaluated using an instrument-derived enzymatic absorbance dataset (RawData, n=253) and a blinded manually simulated irregular dataset (ErrData, n=255). RawData showed no significant deviation in single third-decimal-digit analysis, whereas ErrData showed a significant deviation. In joint third-fourth decimal digit analysis, ErrData showed higher Cramer's V, lower normalized entropy, higher KL divergence, and a more persistent progressive-subsampling deviation signal. In internal validation, Elastic-net Logistic Regression achieved the highest AUC (0.98395) and lowest Brier score (0.048439), while Random Forest achieved the highest accuracy (0.926667) and balanced accuracy (0.935). RawData received a low ensemble risk score of 0.124627 and was classified as Grade 0; ErrData received a score of 0.740760 and was classified as Grade 3. External real-world benchmarks supported graded risk stratification: three datasets without identified public post-publication concerns were classified as Grade 0 or 1, whereas two datasets from publicly questioned or institutionally handled articles were classified as Grade 2 or 3. FDRS can prioritize raw numerical datasets for further review by integrating interpretable statistical and machine-learning features. It is an auxiliary digit-structure screening tool, not standalone evidence of fabrication or misconduct.

URL PDF HTML ☆

赞 0 踩 0

2606.07115 2026-06-08 cs.CV cs.GR 新提交

3DMorph: Single-Image-Guided Local 3D Shape Editing and Morphing

3DMorph: 单图引导的局部3D形状编辑与变形

Tobias Preintner, Yunfei Deng, Phillip Müller, Sebastian Illing, Adrian König, Thomas Bäck, Elena Raponi, Niki van Stein

AI总结提出无训练框架3DMorph，通过单张编辑图像自动定位并转移2D修改到3D局部区域，同时支持中间形状生成，在Delta3D基准上优于现有方法。

详情

Comments: Accepted to IJCNN 2026

AI中文摘要

尽管3D生成领域近期取得了进展，但对现有形状的直观编辑仍然有限。与受益于成熟修复工具的图像不同，网格等通用3D对象仍缺乏简单有效的局部形状编辑方法。现有方法通常是全局的、领域特定的、需要复杂的用户交互，或侧重于外观（颜色和纹理）而非几何。我们提出了3DMorph，一个无需训练的框架，用于单图引导的局部3D形状编辑和变形。给定一张显示所需形状修改的编辑图像，我们的方法自动定位相关的3D区域，并将2D修改转移到3D，同时保留未修改的区域。3DMorph还能在原始对象和编辑对象之间生成中间形状，促进设计探索。为了基准测试编辑质量，我们引入了Delta3D，一个带有配对真实编辑的图像引导局部3D编辑基准。实验结果表明，3DMorph将直观的2D编辑转化为3D，优于最先进的生成和编辑方法。

英文摘要

Despite recent progress in 3D generation, intuitive editing of existing shapes remains limited. Unlike images, which benefit from well-established inpainting tools, general 3D objects such as meshes still lack simple and effective methods for local shape editing. Existing approaches are often global, domain-specific, require complex user interaction, or focus on appearance (color and texture) rather than geometry. We introduce 3DMorph, a training-free framework for single-image-guided local 3D shape editing and morphing. Given an edited image showing a desired shape modification, our method automatically localizes the relevant 3D region and transfers 2D modifications to 3D while preserving unmodified areas. 3DMorph also enables intermediate shape generation between the original and edited objects, facilitating design exploration. To benchmark editing quality, we introduce Delta3D, an image-guided local 3D editing benchmark with paired ground-truth edits. Experimental results show that 3DMorph translates intuitive 2D edits into 3D, outperforming state-of-the-art generative and editing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07113 2026-06-08 cs.AI 新提交

Beyond Post-hoc Explanation: Toward Glassbox AI via Probabilistic Mediation

超越事后解释：通过概率中介迈向玻璃箱AI

Manuele Leonelli

AI总结针对大语言模型在关键领域的不透明性，提出玻璃箱框架，利用贝叶斯网络作为事前中介层，实现可审计推理、不确定性量化和可争议输出。

详情

AI中文摘要

大型语言模型正迅速成为高风险机构设置中的基础设施组件，包括公共管理、法律推理和医疗保健，在这些领域中，不透明性不仅不方便，而且在制度和法律上不可接受。现有的可解释性方法主要是事后性的，提供不稳定、不可争议的解释，这些解释与产生输出的推理过程没有正式关系。我们认为问题不在于缺乏解释，而在于首先缺乏结构化推理。本文提出了一种根本不同的架构，我们称之为玻璃箱框架，其中贝叶斯网络作为生成模型的透明事前中介层。贝叶斯网络在推理之前编码领域知识、因果假设和概率依赖性，从而实现可审计的推理轨迹、不确定性量化和可争议的输出。我们描述了该框架的架构，并将其置于一个福利资格场景中，确定了必须解决的基础性挑战，包括语义对齐、动态模型构建、概率基础以及人类治理，以便大规模实现它。通过从事后解释转向事前概率中介，本文勾勒出一条原则性路径，通往不仅强大而且根本上可问责的AI系统。

英文摘要

Large language models are rapidly becoming infrastructural components in high-stakes institutional settings, including public administration, legal reasoning, and healthcare, where opacity is not merely inconvenient but institutionally and legally untenable. Existing approaches to explainability are predominantly post-hoc, offering unstable, non-contestable accounts that have no formal relationship to the reasoning process that produced the output. We argue that the problem is not the absence of explanation but the absence of structured reasoning in the first place. This paper makes the case for a fundamentally different architecture, which we call the Glassbox Framework, in which Bayesian networks serve as transparent, ante-hoc mediation layers for generative models. Bayesian networks encode domain knowledge, causal assumptions, and probabilistic dependencies before inference occurs, enabling auditable reasoning traces, uncertainty quantification, and contestable outputs. We characterise the architecture of this framework and ground it in a benefit eligibility scenario, identifying the foundational challenges spanning semantic alignment, dynamic model construction, probabilistic grounding, and human governance that must be solved to realise it at scale. By shifting from post-hoc explanation to ante-hoc probabilistic mediation, this work outlines a principled path toward AI systems that are not only powerful but fundamentally accountable.

URL PDF HTML ☆

赞 0 踩 0

2606.07100 2026-06-08 cs.CV cs.RO 新提交

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

LARA: 视觉-语言-动作模型的潜在动作表示对齐

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang

AI总结提出LARA框架，通过表示对齐联合优化潜在动作模型和视觉-语言-动作模型，利用人类视频数据提升机器人操作性能，在模拟和真实基准上平均提升约10%、5%和15%。

详情

AI中文摘要

视觉-语言动作（VLA）模型使机器人能够直接从观测和语言指令预测动作，但其性能依赖于大规模、高质量数据，并受到真实机器人动作数据集稀缺的限制。为了利用丰富的未标记人类视频促进VLA模型学习，潜在动作模型（LAM）从视觉动态中学习潜在动作表示，为VLA学习提供额外监督。然而，LAM和VLA通常分开训练，导致LAM在VLA训练期间未接地，且VLA模型受冻结的LAM表示约束。为解决这些问题，我们提出潜在动作表示对齐（LARA），一种即插即用框架，通过表示对齐联合优化LAM和VLA。这使得LAM能够利用动作轨迹学习以避免虚假视觉变化，同时VLA通过LAM中学习的前向动力学进行正则化，减少功能无效轨迹的幻觉。我们展示了LARA在预训练、预训练VLA模型的后训练增强以及LAM细化中的多功能性和有效性，在3个模拟和1个精心设计的真实机器人操作基准上平均提升约10%、约5%和约15%。

英文摘要

Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.07090 2026-06-08 cs.CV 新提交

Detecting Temporally Localized Manipulations in Authentic Video Streams

检测真实视频流中的时间局部操纵

Okan Umur, Ali Emre Güşlü, Ibrahim Delibasoglu

AI总结针对真实视频中插入短时逼真操纵片段难以检测的问题，提出新数据集并评估两种方法：基于DINOv3特征的线性探针和连续帧相似性方法，建立初步基准。

详情

AI中文摘要

视频编辑和生成式人工智能技术的快速发展使得逼真的视频操纵越来越容易实现。尽管现有数据集显著推动了深度伪造检测、对象移除和视频修复的研究，但它们未能充分模拟在真实视频中插入短时操纵片段且原始视频继续播放的场景。在本研究中，我们回顾了文献中的代表性数据集，分析了它们的特征，并讨论了它们在时间局部逼真操纵检测方面的局限性。基于此分析，我们提出了专门针对包含短时且高度逼真操纵间隔的真实视频的新数据集的需求。最后，我们在自定义策划的测试集上评估了两种互补方法，为这一具有挑战性的场景建立了初始基准。第一种方法采用基于DINOv3特征的线性探针，在三种阈值策略下进行评估。第二种方法利用DINOv3特征结合连续帧相似性方法来检测时间操纵边界。这些实验共同为部分操纵视频检测提供了初步基准，并强调了内容自适应阈值机制的必要性。数据集、代码和补充材料可在此https URL公开获取。

英文摘要

The rapid advancement of video editing and generative artificial intelligence technologies has made realistic video manipulation increasingly accessible. Although existing datasets have significantly advanced research in deepfake detection, object removal, and video inpainting, they do not adequately model scenarios in which a short manipulated segment is inserted into an otherwise authentic video and the original video continues afterward. In this study, we review representative datasets from the literature, analyze their characteristics, and discuss their limitations with respect to temporally localized realistic manipulation detection. Based on this analysis, we motivate the need for a new dataset specifically designed for authentic videos containing short and highly realistic manipulated intervals. Finally, we evaluate two complementary approaches on our custom-curated test set to establish an initial benchmark for this challenging scenario. The first employs a linear probe on DINOv3 features, assessed under three thresholding strategies. The second leverages DINOv3 features with a consecutive frame similarity-based method to detect temporal manipulation boundaries. Together, these experiments provide an initial benchmark for partially manipulated video detection and highlight the need for content-adaptive thresholding mechanisms. The dataset, code, and supplementary materials are publicly available at https://github.com/OkanUmur/temporally-localized-video-manipulation-detection.

URL PDF HTML ☆

赞 0 踩 0

2606.07086 2026-06-08 cs.CV cs.LG 新提交

An Adaptive Data cleaning Framework for Noisy Label Detection

自适应数据清洗框架用于噪声标签检测

Chen-Hsuan Fang, Wei-Hsinag Chen, Pin-Hsuan Yu, Jung-Hua Wang, Tsung-Wei Pan

AI总结提出一种无需手动阈值的自适应数据清洗框架，融合局部、全局和学习动态等多重度量，通过特征空间的多度量聚类实现噪声标签检测，在CIFAR-10、MNIST和ImageNet-100上显著提升召回率和模型精度。

详情

AI中文摘要

深度神经网络（DNN）在给定大型标注数据集的计算机视觉任务中表现出色。然而，在实际应用中，标签常常因歧义、人为错误或动态环境而受到污染。过参数化的DNN在训练过程中容易记忆这些噪声标签，从而降低模型的准确性和泛化能力。现有的数据清洗和样本选择策略通常依赖于手动指定的阈值、噪声比率的先验知识或单一度量（学习动态或几何结构），这使得它们在复杂数据场景下不稳定。本文提出了一种自适应数据清洗框架，该框架整合了局部、全局和学习动态线索，用于鲁棒的噪声标签检测。通过模块化特征拼接范式，样本被映射到统一的低维特征空间。我们提供了两种实例化：一种二维度量，结合了基于类自适应KNN的局部不一致性和基于k-means的全局质心距离；另一种三维多度量，额外引入了z归一化分数。与传统的将一维高斯混合模型应用于单一标量度量的方法不同，我们的框架在特征空间上执行多度量聚类，以自适应地将样本划分为干净主导和噪声主导成分，无需手动阈值或噪声先验。在CIFAR-10、MNIST和ImageNet-100上，针对5%至40%的对称标签噪声进行的实验表明，该框架在所有设置下均实现了高召回率，包括在ImageNet-100上40%噪声时接近完美的召回率（≥98%）。后续训练在所有评估设置下均获得了精度提升，尤其是在ImageNet-100的严重污染情况下。这些发现表明，多度量整合为噪声标签检测提供了一种无阈值、实用且低调整的策略。

英文摘要

Deep neural networks (DNNs) excel in computer vision tasks given large annotated datasets. In real-world applications, however, labels are often corrupted by ambiguity, human error, or dynamic environments. Over-parameterized DNNs easily memorize these noisy labels during training, degrading model accuracy and generalization. Existing data-cleaning and sample-selection strategies often rely on manually specified thresholds, prior knowledge of the noise ratio, or a single metric (either learning dynamics or geometric structure), making them unstable in complex data regimes. This paper proposes a self-adaptive data-cleaning framework that integrates local, global, and learning dynamics cues for robust noisy-label detection. Samples are mapped into a unified low-dimensional feature space through a modular feature concatenation paradigm. We provide two instantiations: a 2D metric integrating class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. Unlike conventional 1D Gaussian Mixture Models applied to a single scalar metric, our framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise show high recall across settings, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise. Subsequent training yields accuracy gains across evaluated settings, especially under severe corruption on ImageNet-100. These findings suggest that multi-metric integration provides a threshold-free, practical, and low-tuning strategy for noisy label detection.

URL PDF HTML ☆

赞 0 踩 0

2606.07067 2026-06-08 cs.RO 新提交

Extending Responsibility-Sensitive Safety for the Assessment of Offloaded Autonomous Driving Services

扩展责任敏感安全以评估卸载的自动驾驶服务

Robin Dehler, Aryan Thakur, Michael Buchholz

AI总结针对自动驾驶功能卸载中V2X通信导致响应时间变化的安全挑战，扩展责任敏感安全定义，提出基于安全约束的卸载决策与回退机制，并引入热备阶段提升回退安全性。

详情

Comments: 8 pages; accepted for 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC), Naples, Italy, September 15-18, 2026 - DOI will be added after publication

AI中文摘要

安全是自动驾驶系统开发的基本要求。虽然功能卸载在计算效率和能耗方面显示出显著优势，但其在安全关键的AD功能中的应用带来了新的挑战。特别是，由于无线车联网通信，卸载的服务组合会导致响应时间增加且可变，这直接影响车辆的反应时间，从而影响其安全保证。在本文中，我们通过扩展责任敏感安全（RSS）的定义，明确考虑本地和卸载的AD服务组合的不同响应时间，来应对这一挑战。基于这一扩展，我们提出将其集成到功能卸载中，使用RSS安全约束进行卸载决策和回退机制。仅当当前交通状况在相应的端到端响应时间下保持安全时，才允许卸载的服务组合。如果违反此条件，系统将执行受控回退到本地执行。此外，我们引入了一种增强的回退策略，其中包括卸载服务的热备阶段，从而实现从卸载服务到本地服务的更快、更安全的过渡。所提出的方法已集成到我们的AD堆栈中，并在仿真和真实世界中进行了评估。实验结果表明，与最先进的功能卸载和安全框架相比，所提出的方法提高了安全性，同时在安全条件允许时保留了分布式计算的优势。

英文摘要

Safety is a fundamental requirement in the development of autonomous driving (AD) systems. While function offloading has demonstrated significant benefits in terms of computational efficiency and energy consumption, its application to safety-critical AD functionality introduces new challenges. In particular, offloaded service compositions incur increased and variable response times due to wireless vehicle-to-everything (V2X) communication, which directly affects the vehicle's reaction time and thus its safety guarantees. In this paper, we address this challenge by extending the definitions of Responsibility-Sensitive Safety (RSS) to explicitly account for different response times of local and offloaded AD service compositions. Based on this extension, we propose an integration into function offloading, using the RSS safety constraints for offloading decision-making and fallback mechanisms. Offloaded service compositions are only permitted if the current traffic situation remains safe under the corresponding end-to-end response time. If this condition is violated, the system performs a controlled fallback to local execution. Furthermore, we introduce an enhanced fallback strategy that includes a warm-standby phase for offloaded services, enabling faster and safer transitions from offloaded to local services. The proposed approach is integrated into our AD stack and evaluated in both simulation and the real world. Experimental results demonstrate that the proposed method improves safety compared to state-of-the-art function offloading and safety frameworks, while preserving the benefits of distributed computation when safety conditions allow.

URL PDF HTML ☆

赞 0 踩 0

2606.07034 2026-06-08 cs.CV 新提交

ForensicConcept: Transferable Forensic Concepts for AIGI Detection

ForensicConcept: 用于AIGI检测的可迁移取证概念

Menyanshu Zhou, Ziyin Zhou, Ke Sun, Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

AI总结提出ForensicConcept框架，通过Transformer归因定位关键补丁、构建概念码本并利用扩散特征对齐，实现AI生成图像检测中可迁移取证概念的提取与跨骨干网络迁移。

详情

Comments: Accepted by ICML 2026

AI中文摘要

AI生成图像检测器在分布内数据上取得高精度，但往往在未见过的生成器上失效。理解这一失败的关键障碍在于当前检测器的黑箱性质：它们不揭示哪些证据驱动其决策。我们提出ForensicConcept，一个从检测器中提取显式取证概念并使其能够跨骨干网络迁移的框架。我们的方法通过Transformer归因定位决策关键补丁，将其聚类为紧凑的概念码本，并使用概念对齐投影产生可审计的证据读出。受先前研究表明DINO表示可以引导扩散生成并与扩散特征具有概念级对应关系的启发，我们引入基于CleanDIFT扩散特征的生成痕迹参考，并通过邻域结构一致性（CKNNA）量化骨干-痕迹对齐。我们进一步提出概念码本注入，将扩散衍生的概念迁移到目标骨干网络中。在GenImage、GAN族和Chameleon基准上的实验显示，相比先前方法有一致改进。我们还发现CKNNA对齐预测迁移有效性，为为什么某些骨干网络产生比其他更可迁移的取证证据提供了原则性解释。

英文摘要

AI-generated image detectors achieve high accuracy on in-distribution data but often fail on unseen generators. A key obstacle to understanding this failure is the black-box nature of current detectors: they do not reveal which evidence drives their decisions. We propose ForensicConcept, a framework that extracts explicit forensic concepts from detectors and enables their transfer across backbones. Our method localizes decision-critical patches via Transformer attribution, clusters them into a compact concept codebook, and uses a concept-aligned projection to produce auditable evidence readouts. Motivated by prior studies showing that DINO representations can guide diffusion generation and exhibit concept-level correspondence with diffusion features, we introduce a generation-trace reference based on CleanDIFT diffusion features and quantify backbone-trace alignment via neighborhood-structure consistency (CKNNA). We further propose concept codebook injection to transfer diffusion-derived concepts into target backbones. Experiments on GenImage, GAN-family, and Chameleon benchmarks show consistent improvements over prior methods. We also find that CKNNA alignment predicts transfer effectiveness, providing a principled explanation for why some backbones yield more transferable forensic evidence than others.

URL PDF HTML ☆

赞 0 踩 0

2606.07032 2026-06-08 cs.CV cs.AI 新提交

Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

前所未见：基于一致视频源数据集的真正零样本组合图像检索基准测试

Zhenyu Yang, Zemin Du, Shengsheng Qian, Changsheng Xu

AI总结针对现有零样本组合图像检索数据集存在参考与目标图像不相关、非真正零样本的问题，提出ZeroSight基准，包含来自视频的一致参考-目标对和训练无关的MLLM驱动方法SC4CIR，通过三重对称一致性检查识别难负样本，实验表明现有方法性能被高估。

详情

AI中文摘要

零样本组合图像检索（ZS-CIR）旨在基于由参考图像和相对描述组成的查询，在没有训练样本的情况下检索目标图像。现有的ZS-CIR数据集常因图像来源嘈杂而导致参考图像与目标图像完全不相关，并且由于使用了CLIP等模型已训练过的公开图像数据集，未能实现真正的零样本场景。为解决这些挑战，我们引入了ZeroSight，一个用于ZS-CIR的新基准。它包括一个来自视频的一致参考-目标对数据集、一个数据构建流程，以及考虑多个正负目标图像排序的评估方法。我们通过从单个视频中提取帧并使用LLM辅助方法生成相对描述，确保参考-目标对在视觉和语义上一致。为确保真正的零样本场景，我们使用2022年3月31日之后发布的视频数据，确保其未包含在CLIP的预训练数据中。此外，我们提出了一种无需训练的MLLM驱动方法SC4CIR（对称一致性用于CIR），该方法通过三重对称一致性检查能够有效识别难负目标。该方法是即插即用的，能与各种CIR方法无缝集成并显著提升性能。我们通过27种方法的实验结果表明，当前的ZS-CIR数据集和评估指标导致了检索性能的膨胀，夸大了CIR方法的能力。我们的基准和模型可通过此https URL访问。

英文摘要

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.

URL PDF HTML ☆

赞 0 踩 0

2606.07031 2026-06-08 cs.LG 新提交

CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning

CF-JEPA：面向时间序列表示学习的无掩码前向预测与非对称编码器利用

Jaehoon Lee, Sunghyun Sim

AI总结提出CF-JEPA，一种无掩码框架，通过多时间范围前向预测替代掩码，利用时间序列的时序顺序作为学习信号；并利用在线编码器与EMA目标编码器之间的非对称性，将不同任务路由到合适的编码器，在多个基准上取得领先性能。

详情

AI中文摘要

自监督学习（SSL）在时间序列表示学习领域主要有两种范式：对比方法（在构建正负样本对时面临挑战）和基于掩码的方法（会破坏时间序列信号的时序连续性）。联合嵌入预测架构（JEPA）通过在表示空间中进行预测而非重建原始输入，提供了一种有前景的替代方案。然而，现有的时间序列JEPA变体仍然依赖掩码，因此继承了其连续性问题。本文提出基于裁剪的前向JEPA（CF-JEPA），这是一种创新的无掩码框架，用多时间范围前向预测替代掩码：随机裁剪作为上下文视图，并在前向时间方向上预测短、中、长时域的未来表示，直接利用时间序列数据固有的时序顺序作为学习信号。此外，我们还发现单次训练运行中产生的在线编码器和指数移动平均（EMA）目标编码器之间存在强烈的非对称性：在线编码器发展出更高秩的判别性特征，而EMA目标编码器发展出更平滑、更低秩的时序特征。利用这种非对称性，将分类任务路由到在线编码器，将预测或异常检测任务路由到EMA目标编码器，在不增加训练成本的情况下，多变量预测均方误差（MSE）降低了27%。在126个加州大学河滨分校（UCR）和26个东英吉利大学（UEA）分类数据集、8个电力变压器温度预测基准以及关键绩效指标/Yahoo异常检测任务上，CF-JEPA在自监督基线方法中取得了UCR和UEA上的最高平均准确率和排名，并在单变量预测和k近邻评分的异常检测中排名第二。

英文摘要

Self-supervised learning (SSL) for time-series representation learning is dominated by two paradigms: contrastive methods, which face challenges in constructing positive or negative pairs, and masking-based methods, which disrupt the temporal continuity of time-series signals. Joint-Embedding Predictive Architecture (JEPA) offers a promising alternative by predicting in representation space rather than reconstructing raw inputs. However, existing time-series JEPA variants still rely on masking and therefore inherit its continuity problem. Crop-based Forward JEPA (CF-JEPA) is proposed as an innovative mask-free framework that replaces masking with multi-horizon forward prediction: random crops serve as context views, and short-, mid-, and long-horizon future representations are predicted in the forward temporal direction, directly leveraging the inherent temporal ordering of time-series data as a learning signal. A strong asymmetry is also identified between the online encoder and the exponential moving average (EMA) target encoder, both produced from a single training run: the online encoder develops higher-rank discriminative features, while the EMA target encoder develops smoother, lower-rank temporal features. Exploiting this asymmetry, classification is routed to the online encoder and forecasting or anomaly detection to the EMA target encoder, achieving a 27% reduction in multivariate forecasting mean squared error (MSE) at no additional training cost. Across 126 University of California, Riverside (UCR) and 26 University of East Anglia (UEA) classification datasets, eight electricity transformer temperature forecasting benchmarks, and Key Performance Indicator /Yahoo anomaly detection, CF-JEPA achieves the highest average accuracy and rank on UCR and UEA among self-supervised baselines and ranks second on univariate forecasting and k-nearest neighbors-scored anomaly detection.

URL PDF HTML ☆

赞 0 踩 0

2606.07007 2026-06-08 cs.LG cs.AI 新提交

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

理解稀疏自编码器中概念学习与神经元解释的几何视角

Chenhao Zhang, Chris Lin, Su-In Lee

AI总结提出统一数学框架，将概念学习形式化为集合对齐问题，区分检测、分离和近似三种学习强度，并给出几何条件与误差界，通过形式概念分析连接概念学习与神经元解释。

详情

AI中文摘要

我们提出了一个统一的数学框架，用于几何理解稀疏自编码器（SAE）中的概念学习和神经元解释。尽管SAE通过学习稀疏特征表示提高了神经网络的可解释性，但“概念”和“学习”的原则性定义仍不明确。我们将概念形式化为数据点的集合，并将概念学习视为人类定义概念与模型诱导概念之间的集合对齐问题。该公式区分了三种越来越强的学习概念——检测、分离和近似——并给出了概念可由单个神经元或多神经元单元表示的几何条件、误差界和容量约束。它还提供了对常见SAE现象的集合论解释，包括特征分裂、特征吸收、特征族和层次概念。最后，我们通过形式概念分析将概念学习与神经元解释联系起来，表明这两个方向不必一致，并且它们的多对多结构可以通过概念格来组织。在合成数据上使用ReLU和Top-$K$ SAE的实验说明了该理论，并揭示了SAE大小和稀疏性对概念学习的影响。

英文摘要

We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ''concept'' and ''learning'' remains unclear. We formalize concepts as sets of data points and cast concept learning as a set-alignment problem between human-defined and model-induced concepts. This formulation distinguishes three increasingly strong notions of learning -- detection, separation, and approximation -- and yields geometric conditions, error bounds, and capacity constraints for when concepts can be represented by individual neurons or multi-neuron units. It also provides a set-theoretic account for common SAE phenomena, including feature splitting, feature absorption, feature families, and hierarchical concepts. Finally, we connect concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and that their many-to-many structure can be organized by concept lattices. Experiments on synthetic data with ReLU and Top-$K$ SAEs illustrate the theory and reveal the effects of SAE size and sparsity on concept learning.

URL PDF HTML ☆

赞 0 踩 0

2606.06996 2026-06-08 cs.RO cs.DC 新提交

Mission-Level Runtime Assurance Framework for Autonomous Driving

自动驾驶任务级运行时保证框架

Chieh Tsai, Salim Hariri

AI总结提出一种评估驾驶安全与任务完成能力的运行时保证框架，通过监控系统拒绝不可行命令，实验证明其优于仅关注平台安全的方法。

详情

AI中文摘要

本文研究当高级驾驶命令出现故障或不可靠时自动驾驶的运行时安全性。与主要关注即时车辆安全的传统运行时安全方法不同，所提出的框架在执行命令前评估驾驶安全以及车辆是否仍能成功完成任务。该框架通过引入任务级故障场景（如跳过必需检查点、进入受限区域、生成无法成功完成任务的未来路线）扩展了highway-env。引入运行时监控系统，在执行前检测并拒绝不安全或任务不可行的命令。作为对比，使用公开的Simplex-Drive框架实现了一个基于学习的驾驶控制、安全回退控制和运行时控制器切换的自适应Simplex-Drive运行时安全基线。实验结果表明，仅平台级运行时安全无法检测任务级规划故障，而所提出的框架成功拒绝任务不可行命令，并在随机故障条件下提高了任务成功率。

英文摘要

This paper studies runtime safety for autonomous driving when high-level driving commands become faulty or unreliable. Unlike conventional runtime-safety approaches that mainly focus on immediate vehicle safety, the proposed framework evaluates both driving safety and whether the vehicle can still successfully complete its mission before a command is executed. The framework extends highway-env with mission-level fault scenarios such as skipping required checkpoints, entering restricted areas, and generating future routes that can no longer complete the mission successfully. A runtime monitoring system is introduced to detect and reject unsafe or mission-infeasible commands before execution. For comparison, an adapted Simplex-Drive runtime-safety baseline with learning-based driving control, safety fallback control, and runtime controller switching is implemented using the public Simplex-Drive framework. Experimental results show that platform-level runtime safety alone cannot detect mission-level planning faults, while the proposed framework successfully rejects mission-infeasible commands and improves mission success under randomized fault conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.06991 2026-06-08 cs.CV cs.AI 新提交

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

不要暂停：面向在线视频理解的流式视频-语言同步

Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu

AI总结提出流式视频-语言同步（SVLS）范式，通过帧驱动转换控制器和流式令牌调节器实现视频帧与语言生成的细粒度同步，在不中断感知的情况下进行实时交互。

详情

AI中文摘要

在线视频大语言模型（Video-LLMs）通过逐帧处理和主动响应，在人机交互方面取得了进展。然而，流式场景中仍存在一个关键挑战：现有模型在生成响应时通常会暂停视频感知，破坏了实时的视频-语言同步并导致卡顿。为了解决这个问题，我们引入了一种新的在线视频理解范式：流式视频-语言同步（SVLS），并提出了LyraV，一个基于分层控制框架的实时流式助手，具有两个核心创新。首先，帧驱动转换控制器（FDTC）是一个无需训练的基于验证的有限状态机，它做出高层语义决策，决定何时继续说话、开始新的响应或保持沉默。其次，流式令牌调节器（SToP）是一个即插即用的轻量级预测模块，动态调整语言生成速率以匹配视觉内容的节奏。具体来说，LyraV执行逐帧增量、子预算解码：在每个帧间隔内，它只发射适合实时预算的一小部分令牌，因此感知永远不会被阻塞整个句子。这些组件共同使LyraV能够无缝地交织传入的视频帧和生成的词令牌，实现细粒度的同步。在五个在线和三个离线基准上进行的广泛实验表明，LyraV保留了骨干网络的通用理解能力，同时显著提高了流式同步和叙事流畅性，实现了98.29%的视频播放同步率和3.89 FPS的实时处理速度。有趣的是，我们观察到LyraV的一个经验能力：对流式令牌进行动态推理，实现了与视觉输入并行的连续解释和“思考”。

英文摘要

Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.

URL PDF HTML ☆

赞 0 踩 0

2606.06986 2026-06-08 cs.LG 新提交

Heterogeneous Effects of Green Finance on Urban Decarbonization: Evidence from 285 Cities in China

绿色金融对城市脱碳的异质性效应：来自中国285个城市的证据

Xueyang Li, Jinlei Ma

AI总结本研究利用计量模型和机器学习分析，发现绿色金融显著降低城市碳强度，其中绿色债券和绿色投资效果最强，且存在空间溢出效应，影响因城市发展水平而异，主要通过能源结构优化等渠道发挥作用。

详情

AI中文摘要

虽然绿色金融已成为低碳城市转型的关键工具，但其实际的脱碳效应和传导机制仍不明确。本研究采用计量经济模型和基于机器学习的分析，考察绿色金融是否以及如何降低城市碳强度。结果表明，绿色金融显著降低碳强度，其中绿色债券和绿色投资的影响最强，并存在明显的空间溢出效应。效果因发展水平而异，在四五线城市最为显著。中介分析显示，绿色金融主要通过能源结构优化发挥作用，其次是产业升级、外商直接投资和技术创新。SHAP分析证实不同金融工具之间存在显著差异，其中绿色债券、基金和信贷对脱碳贡献最大。此外，在技术能力低、产业依赖度高和以煤为主的能源结构的城市，边际影响更强。这些发现为构建多层次、区域差异化的绿色金融体系以促进包容性低碳转型提供了理论支持和政策指导。关键词：绿色金融；碳强度；脱碳效应；机器学习；城市

英文摘要

While green finance has become a key instrument for low-carbon city transitions, its actual decarbonization effects and transmission mechanisms remain unclear. This study employs econometric models and machine learning-based analysis to examine whether and how green finance reduces city-level carbon intensity. Results show that green finance significantly lowers carbon intensity, with green bonds and green investment having the strongest impacts and evident spatial spillovers. The effects vary by development level, being most pronounced in Fourth- and Fifth-tier cities. Mediation analysis reveals that green finance operates mainly through energy structure optimization, followed by industrial upgrading, foreign direct investment, and technological innovation. SHAP analysis confirms substantial differences across financial instruments, with green bonds, funds, and credit contributing most to decarbonization. Moreover, the marginal impact is stronger in cities with low technological capacity, high industrial dependency, and coal-based energy mixes. These findings provide theoretical support and policy guidance for building a multi-level, regionally differentiated green finance system to promote inclusive low-carbon transitions. Keywords: Green Finance; Carbon Intensity; Decarbonization Effect; Machine Learning; City

URL PDF HTML ☆

赞 0 踩 0

2606.06972 2026-06-08 cs.AI 新提交

Accounting for Context: Shaping Moral Credences for Value Alignment

考虑情境：塑造道德信念以实现价值对齐

Jazon Szabo, Sanjay Modgil

AI总结本文针对价值对齐中道德多元性问题，提出在聚合道德评估时必须考虑情境因素，并形式化道德不确定性下的决策，揭示弱帕累托原则的违反是辛普森悖论的一种变体。

详情

AI中文摘要

确保智能体行为与人类道德价值观对齐不可避免地引发一个问题：如何解释社会乃至个体通常采纳的多元道德视角。关于道德不确定性的工作提出了在不同道德理论之间公平且民主地聚合行动评估的机制。然而，本文认为在聚合道德评估时需要考虑情境因素。例如，后果主义视角假设能够准确确定智能体的行动如何改变世界；这一假设在现实世界中往往不成立。因此，我们在考虑这些情境因素的同时，形式化了道德不确定性下的智能体决策。我们由此表明，一个看似常识性的属性——弱帕累托原则——被违反了。我们认为，这个看似的问题实际上是辛普森悖论的一种变体，因此揭示了忽视情境因素影响的聚合机制的局限性。

英文摘要

Ensuring that agent behaviours are aligned with human moral values inevitably raises the problem of how to account for the plurality of moral perspectives that societies -- and even individuals -- typically adopt. Work on moral uncertainty proposes mechanisms to fairly and democratically aggregate evaluations of actions across different moral theories. However, this paper argues that one needs to account for contextual factors when aggregating moral evaluations. For example, consequentialist perspectives assume an ability to accurately determine how an agent's actions change the world; an assumption that often does not hold in real world settings. We, therefore, formalise agent decision making under moral uncertainty, while also accounting for these kinds of contextual factors. We thereby show that a seemingly commonsensical property -- the weak Pareto principle -- is violated. We argue that this apparent problem is, in fact, a variation of Simpson's paradox, and hence reveals the limitations of aggregation mechanisms that ignore the impact of contextual factors.

URL PDF HTML ☆

赞 0 踩 0

2606.06960 2026-06-08 cs.CL 新提交

Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

经验之树：低重复与隐式奖励环境下自演化智能体的结构化经验管理方案

Zihao Deng, Yining Zhu, Leiming Wang, Jingfei Lu, Junbo Wang, Chuncheng Ran, Yu Yang, Dixuan Yang, Jikun Shen

AI总结针对低重复任务与隐式奖励环境，提出结构化经验管理方法ToE，通过组织、检索、验证和更新经验，在金融情绪预测基准上优于无经验基线。

详情

AI中文摘要

基于经验的自我演化对于LLM智能体至关重要，但现有基准通常假设明确的目标、稳定的任务模式和清晰的反馈。我们研究了一个更具挑战性的场景：具有隐式奖励的低重复任务，其中过去的经验难以重用，且反馈是延迟的、有噪声的且是结果层面的。我们引入了\textsc{FinEvolveBench}，一个时间控制的金融情绪预测基准，将每日新闻驱动的预测与未来超额收益联系起来。我们进一步提出了经验之树（ToE），一种结构化的经验管理方法，用于组织、检索、验证和更新智能体的经验。实验表明，通用经验机制并不一致地优于无经验基线，而ToE实现了更强的整体性能。这些结果强调了在隐式奖励环境中，结构化经验管理对于自演化智能体的重要性。

英文摘要

Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit rewards, where past experience is difficult to reuse and feedback is delayed, noisy, and outcome-level. We introduce \textsc{FinEvolveBench}, a temporally controlled benchmark for financial sentiment prediction that links daily news-driven predictions to future excess returns. We further propose Tree-of-Experience (ToE), a structured experience-management method that organizes, retrieves, validates, and updates agent experience. Experiments show that general-purpose experience mechanisms do not consistently outperform no-experience baselines, while ToE achieves stronger overall performance. These results highlight the importance of structured experience management for self-evolving agents in implicit-reward environments.

URL PDF HTML ☆

赞 0 踩 0

2606.06946 2026-06-08 cs.CL cs.AI 新提交

Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

领域自适应大语言模型中的训练数据审计：LoRA-MINT

Gonzalo Mancera, Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana, Francisco Jurado

AI总结提出LoRA-MINT方法，通过成员推理测试审计LoRA微调的大语言模型训练数据，在四个模型和三个基准上达到0.77-0.92的精度，优于现有基线。

详情

Comments: IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

AI中文摘要

我们提出了LoRA-MINT，一种应用于通过低秩适应（LoRA）针对特定自然语言处理（NLP）任务微调的最新大语言模型（LLMs）的成员推理测试（MINT）新方法。主要目标是评估个体样本是否属于这些适应模型的训练数据，为知识产权和敏感数据管理提供有用的审计工具。我们的分析探索了模型困惑度与成员状态之间的关系，提供了一个系统框架来估计微调LLMs中的数据暴露程度。我们在四个模型和三个基准数据集上进行了实验，在确定给定数据是否用于训练时获得的精度值在0.77到0.92之间，优于最先进的基线，并证明了所提出方法的鲁棒性和通用性。总的来说，我们的发现强调了LoRA-MINT作为审计LLMs的有效且可扩展框架的潜力，提高了透明度，并促进了AI和NLP技术的道德和负责任部署。为了具体性和当前相关性，我们的讨论和实验集中在LoRA调整的LLMs上，但请注意，所提出的大部分方法很容易适用于审计任何其他适应LLMs的技术或更一般地任何其他领域自适应AI模型的训练数据。

英文摘要

We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples were part of the training data of these adapted models, providing a useful auditing tool for the management of intellectual property and sensitive data. Our analysis explores the relationship between model perplexity and membership status, providing a systematic framework for estimating data exposure in fine-tuned LLMs. We conducted experiments on four models and three benchmark datasets, obtaining precision values in determining if given data were used for training ranging from 0.77 to 0.92, which outperform state-of-the-art baselines and demonstrate the robustness and generality of the proposed method. In general, our findings underscore the potential of LoRA-MINT as an effective and scalable framework for auditing LLMs, improving transparency, and fostering the ethical and responsible deployment of AI and NLP technologies. For the sake of concreteness and current relevance, our discussion and experiments are centered on LoRAadjusted LLMs, but note that most of the presented methodology is easily applicable for auditing training data given any other technique for adapting LLMs or, more generally, any other domain-adapted AI models.

URL PDF HTML ☆

赞 0 踩 0

2606.06943 2026-06-08 cs.CV cs.AI 新提交

SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

SS-TPT：面向对抗鲁棒视觉语言模型的稳定性和适用性引导的测试时提示微调

Sunoh Kim, Daeho Um

AI总结提出SS-TPT方法，通过稳定性与适用性分数评估增强视图质量，引导测试时提示微调，在保持高吞吐量的同时显著提升对抗鲁棒性。

详情

Comments: Accepted in ICML2026

AI中文摘要

视觉语言模型（如CLIP）实现了强大的零样本识别，但在对抗扰动下仍然非常脆弱。最近的测试时自适应防御通过利用大量增强视图来提高鲁棒性，但这导致了不切实际的减速和明确的鲁棒性-吞吐量权衡。为了应对这一挑战，我们提出了稳定性和适用性引导的测试时提示微调（SS-TPT），通过两个互补分数评估每个增强视图的质量：（1）稳定性，衡量对弱增强的预测不变性，以及（2）适用性，衡量视图间的特征空间密度。这些稳定性和适用性（SS）分数通过SS引导的一致性损失和SS加权预测来指导自适应和推理，放大可信视图同时抑制受损视图。大量实验表明，SS-TPT显著优于先前最先进的方法，在不同数据集和不同视图数量下实现了卓越的鲁棒性-吞吐量权衡，从而展示了强大的实用性和泛化性。我们的代码可在以下网址获得：https://this URL。

英文摘要

Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at https://github.com/sunoh-kim/SS-TPT.

URL PDF HTML ☆

赞 0 踩 0

2606.06938 2026-06-08 cs.CV 新提交

When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

当CLIP看得更多，它反击得更猛烈：多视图引导的自适应对抗攻击用于测试时对抗鲁棒性

Sunoh Kim, Daeho Um

AI总结提出多视图引导的自适应对抗攻击（MAC），通过构建输入图像的增强视图、执行对抗攻击精炼嵌入、自适应调整攻击强度并聚合视图，显著提升CLIP在测试时的对抗鲁棒性。

详情

Comments: Accepted in CVPR2026

AI中文摘要

视觉-语言模型如CLIP在零样本识别方面取得了显著成就，但其对对抗扰动的鲁棒性仍然有限。最近提出的测试时对抗攻击（TTC）通过在推理过程中扰动输入图像使其远离受损状态来提高CLIP的鲁棒性。然而，TTC在强攻击下仍然脆弱，因为其对抗攻击依赖于直接受损的原始视图，并采用噪声驱动的硬门控方案，无法适应变化的损坏严重程度。为了解决这些限制，我们引入了多视图引导的自适应对抗攻击（MAC），它针对多视图执行具有损坏感知软加权的对抗攻击。具体来说，MAC首先构建输入图像的增强视图以获得多样化的嵌入。然后，它执行对抗攻击以精炼视图的受损嵌入。接下来，MAC根据每个视图的估计损坏程度自适应地缩放对抗攻击强度。最后，自适应对抗攻击后的视图被聚合以产生鲁棒的最终预测。在20个数据集和多种攻击场景下的广泛实验表明，MAC显著提高了鲁棒性，同时由于其无调优设计，保持了高推理速度和内存效率。我们的代码可在该https URL获取。

英文摘要

Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at https://github.com/sunoh-kim/MAC.

URL PDF HTML ☆

赞 0 踩 0

2606.06928 2026-06-08 cs.SD eess.AS 新提交

VoxCPM2 Technical Report

VoxCPM2 技术报告

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Jiancheng Gui, Jiaheng Wu, Ziyang Wang, Xudong Shen, Runchuan Ye, Zhisheng Zhang, Jiuyang Zhou, Bingsong Bai, Weiyue Sun, Mengyuan Deng, Qundong Shi, Zhiyong Wu, Zhiyuan Liu

AI总结提出VoxCPM2，一种全开源多语言可控语音生成基础模型，通过层次化扩散自回归建模、非对称AudioVAE和2B参数/200万小时数据扩展，在零样本和指令跟随TTS基准上达到SOTA，平均WER为1.68%。

详情

Comments: The technical report of VoxCPM2, a TTS foundation model (GitHub: https://github.com/OpenBMB/VoxCPM)

AI中文摘要

我们提出VoxCPM2，一个完全开源的多语言可控语音生成基础模型，它扩展了VoxCPM的层次化扩散自回归建模范式。VoxCPM2在三个关键维度上推进了该框架：(i) 能力，通过统一30种语言、9种中文方言、自然语言语音设计、风格可控的语音克隆以及高保真延续克隆于单个骨干网络；(ii) 质量，通过非对称AudioVAE以16 kHz编码并以48 kHz重建，实现具有高编码效率的隐式超分辨率；(iii) 规模，通过将模型联合扩展到2B参数，训练数据超过200万小时的多语言语音。为了在单个模型中支持这些多样化的能力，我们引入了一种统一的序列组织方式，通过相同输入构建块的不同排列来表达所有生成模式，从而允许在单一参数集和目标下进行联合训练。VoxCPM2在公共零样本和指令跟随TTS基准上达到了最先进或具有竞争力的性能。在我们的内部30语言评估集上，它取得了平均1.68%的词错误率。这些结果表明，层次化连续潜在建模无需依赖任何外部离散语音分词器，为大规模多语言可控语音生成提供了可行且强大的基础。模型权重、微调代码和推理工具已在Apache 2.0许可下公开发布，以促进社区研究和开发。

英文摘要

We present VoxCPM2, a https://info.arxiv.org/help/prep#abstractsfully open-source multilingual and controllable speech generation foundation model that extends the hierarchical diffusion-autoregressive modeling paradigm of VoxCPM. VoxCPM2 advances the framework in three key dimensions: (i) capability, by unifying 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning within a single backbone; (ii) quality, through an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, enabling implicit super-resolution with high encoding efficiency; and (iii) scale, by jointly scaling the model to 2B parameters and the training data to over 2 million hours of multilingual speech. To support these diverse capabilities within one model, we introduce a unified sequence organization that expresses all generation modes through different arrangements of the same input building blocks, allowing joint training under a single set of parameters and objective. VoxCPM2 achieves state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks. On our internal 30-language evaluation set, it attains an average WER of 1.68%. These results demonstrate that hierarchical continuous-latent modeling, without relying on any external discrete speech tokenizer, offers a viable and powerful foundation for large-scale multilingual and controllable speech generation. The model weights, fine-tuning code, and inference tools are publicly released under the Apache 2.0 license to foster community research and development.

URL PDF HTML ☆

赞 0 踩 0

2606.06920 2026-06-08 cs.LG cs.AI 新提交

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

微调陷阱：评估负迁移及PEFT在亚十亿参数数学推理中的作用

Rahul Nair, Chun Tao

AI总结本研究评估了五种亚十亿参数模型在数学推理任务中的微调策略，发现全量微调对小于3亿参数的模型造成负迁移，而参数高效微调（PEFT）是稳定性要求。

详情

Comments: 8 pages, 6 figures, 2 tables

AI中文摘要

在边缘设备上部署小型语言模型（SLM）需要高效的微调策略，使模型适应新任务而不降低其通用能力。在本研究中，我们对五种亚十亿参数模型（135M-1B）在数学推理任务上进行了基准测试，并发现了一个关键脆弱性：全量微调（Full FT）会主动损害300M以下参数模型的性能，通常将准确率降至零样本基线以下。这种“负迁移”使得参数高效微调（PEFT）不仅是效率上的偏好，更是稳定性上的要求。我们发现，虽然低秩适应（LoRA）和权重分解LoRA（DoRA）性能相当，但它们的优势因任务而异：DoRA在复杂推理（GSM8K）中表现出色，而LoRA在模式匹配（OrcaMath）中占主导地位。特别地，在对齐模型（Qwen2.5-0.5B）上，LoRA优于全量微调，甚至在最小架构（SmolLM2-135M）上，简单的5-shot上下文学习也优于全量微调。基于这些发现，我们建议对所有对齐的亚十亿参数模型默认使用PEFT，并警告不要对任何小于500M参数的架构使用全量微调，以防止灾难性遗忘。本工作的复现可在此网址找到：https://this URL。

英文摘要

Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300M parameters, often dropping accuracy below zero-shot baselines. This "negative transfer" makes Parameter-Efficient Fine-Tuning (PEFT) not just an efficiency preference, but a stability requirement. We find that while Low-Rank Adaptation (LoRA) and Weight-Decomposed LoRA (DoRA) perform comparably, their strengths vary by task; DoRA excels in complex reasoning (GSM8K), while LoRA dominates pattern matching (OrcaMath). In particular, Full FT is outperformed by LoRA on aligned models (Qwen2.5-0.5B) and even by simple 5-shot In-Context Learning on the smallest architectures (SmolLM2-135M). Based on these findings, we recommend defaulting to PEFT for all aligned sub-1B models and caution against Full FT for any architecture smaller than 500M parameters to prevent catastrophic forgetting. Reproduction of this work can be found at https://github.com/gulguluu/tiny-slm-finetune-compare.

URL PDF HTML ☆

赞 0 踩 0

2606.06885 2026-06-08 cs.CV cs.AI 新提交

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

FreeAnimate: 基于预览引导去噪的无训练人体图像动画

Yuan Zeng, Yujia Shi, Zongqing Lu, QingMin Liao

AI总结提出FreeAnimate框架，利用图像扩散模型内在能力实现无训练的人体图像动画，通过预览生成策略提供时序和结构先验，结合反演增强注意力和参考锚定自注意力模块，保证时序一致性和身份保持。

详情

DOI: 10.1109/ICASSP55912.2026.11462600
Comments: Accepted to IEEE ICASSP 2026

AI中文摘要

人体图像动画已经取得了显著进展，主要得益于扩散模型。然而，现有方法通常需要大量的训练数据和资源才能获得高质量结果，限制了泛化性和可访问性。在这项工作中，我们引入了FreeAnimate，一个无训练框架，利用图像扩散模型的内在能力来实现时序一致性、身份保持和背景稳定性。我们的方法包含一种新颖的预览生成策略，该策略从生成的预览帧中提供时序和结构先验，无需训练即可有效引导姿态对齐和背景一致性。此外，FreeAnimate引入了反演增强注意力和参考锚定自注意力模块，以保证时序一致性和身份保持。实验结果表明，FreeAnimate优于现有的无训练竞争方法和基于训练的基线方法，生成的图像质量可与最先进的方法相媲美，并在不同数据集上展现出强大的泛化能力。我们的项目页面位于此https URL。

英文摘要

Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06881 2026-06-08 cs.LG 新提交

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

GlucoFM-Bench：血糖预测的时间序列基础模型基准测试

Baiying Lu, Zhaohui Liang, Ryan Pontius, Shengpu Tang, Temiloluwa Prioleau

AI总结提出GlucoFM-Bench基准，评估8种时间序列基础模型与监督深度学习模型在15个糖尿病数据集上的血糖预测性能，发现预训练模型在零样本和少样本场景表现优异，但全样本下轻量LSTM仍最优。

详情

AI中文摘要

血糖预测模型是现代糖尿病管理系统的基石，可靠的短期预测能够实现主动干预、支持自动化胰岛素输送，并降低低血糖和高血糖事件的风险。从建模角度看，由于糖尿病群体中异质的生理动态，血糖预测面临独特挑战。传统机器学习和深度学习模型已被广泛评估用于血糖预测，但近期的时间序列基础模型（TSFMs）在此场景下的研究仍较少。为填补这一空白，我们提出GlucoFM-Bench，一个全面的基准测试，评估最先进的TSFMs与监督深度学习模型在血糖预测中的表现。我们评估了8种代表性架构，包括预训练TSFMs、时间序列大语言模型和特定任务深度学习模型，涵盖15个公开的糖尿病相关数据集，涉及1117名1型糖尿病、2型糖尿病、前驱糖尿病和非糖尿病个体。模型在零样本、少样本和全样本协议下进行评估，并系统变化上下文长度和预测范围。跨数据集，预训练TSFMs，尤其是Chronos-2和TimesFM，展现出强大的零样本和少样本迁移能力，最佳零样本模型性能在最佳全样本监督模型的5%以内。然而，当任务特定数据充足时，轻量级LSTM仍是最强的，在全样本训练下比TSFMs高出4-21%。分层分析揭示了T1D队列和低/高血糖范围内的持续挑战，强调了超越聚合误差指标进行评估的必要性。总之，GlucoFM-Bench为评估、比较和改进血糖预测基础模型提供了标准化和可重复的基础。

英文摘要

Blood glucose forecasting models are foundational for modern diabetes management systems, as reliable short-term predictions can enable proactive interventions, support automated insulin delivery, and reduce the risk of hypo- and hyperglycemic events. From a modeling perspective, glucose forecasting poses unique challenges due to heterogeneous physiological dynamics across diabetes populations. Traditional machine learning and deep learning models have been extensively evaluated for glucose prediction, yet recent time-series foundation models (TSFMs) remain much less studied in this setting. To bridge this gap, we present GlucoFM-Bench, a comprehensive benchmark evaluating state-of-the-art TSFMs alongside supervised deep learning models for blood glucose forecasting. We assess eight representative architectures, including pre-trained TSFMs, time-series large language models, and task-specific deep learning models, across 15 publicly available diabetes-relevant datasets comprising 1,117 individuals with type 1 diabetes, type 2 diabetes, prediabetes, and no diabetes. Models are evaluated under zero-shot, few-shot, and full-shot protocols, with systematic variation in context length and prediction horizon. Across datasets, pre-trained TSFMs, especially Chronos-2 and TimesFM, show strong zero-shot and few-shot transfer, with the best zero-shot model performing within 5% of the best full-shot supervised model. Yet, when task-specific data are abundant, a lightweight LSTM remains strongest, outperforming TSFMs by 4--21% under full-shot training. Stratified analyses reveal persistent challenges in T1D cohorts and hypo-/hyperglycemic ranges, highlighting the need for evaluation beyond aggregate error metrics. Together, GlucoFM-Bench provides a standardized and reproducible foundation for evaluating, comparing, and improving foundation models for blood glucose forecasting.

URL PDF HTML ☆

赞 0 踩 0

2606.06879 2026-06-08 cs.CL cs.CR 新提交

An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

用于多轮短信钓鱼检测的扩展合成对话数据集

Carl Lochstampfor, Ayan Roy

AI总结提出COVA-X扩展数据集（10,985条对话），改进生成管道解决标签污染等问题，实验表明Longformer超越XGBoost，验证了Transformer模型需要更大对话语料才能发挥上下文优势。

详情

AI中文摘要

我们之前的工作引入了COVA，一个合成生成的多轮对话短信钓鱼数据集，包含3,201条标记对话，建立了八个模型的基线检测基准。虽然使用TF-IDF特征的XGBoost表现最佳，准确率72.5%，宏F1为0.691，但Transformer模型表现不佳，归因于输入截断和训练数据不足。我们提出COVA-X，一个扩展数据集，包含10,985条对话，涵盖八种针对老年人的诈骗类别，由改进的生成管道生成，解决了第一次迭代中的污染、标签不匹配、舞台指示泄露和提示设计失败问题。在扩展数据集上重新训练所有分类器得到了本工作的核心发现：Longformer现在在所有评估指标上超越了XGBoost，准确率79.71%，宏F1 0.7786，而XGBoost分别为78.43%和0.7563。这直接证实了Transformer模型需要更大的对话语料库才能发挥其上下文优势。我们还记录了一个质量生命周期，包括标签修正率从49.8%提高到3.9%（12.7倍改进），一项架构干预将虚拟绑架伪影率从67.1%降低到46.5%，以及按诈骗类型的结果分析显示，诈骗类别以机制一致的方式调节结果。清理前后的敏感性分析证实，数据集精炼在所有三种分类器架构中恢复了真实的标签相关信号。

英文摘要

Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71\% accuracy and 0.7786 macro F1 compared with 78.43\% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7$\times$ improvement in label correction rate, from 49.8\% to 3.9\%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1\% to 46.5\%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.06875 2026-06-08 cs.CV cs.CR 新提交

Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

统一安全上下文图像生成：在多模态扩散变换器中通过限制不安全信息流

Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Mi Wen, Min Yang

AI总结提出UVR框架，通过分析注意力动态中的不安全信息流，在无需训练的情况下对输出补丁进行注意力调制，实现图像生成和编辑任务的安全控制，达到91%和77%的擦除率。

详情

Comments: ICML26

AI中文摘要

配备多模态注意力（MM-Attn）的扩散变换器（DiTs）已成为图像生成的主导范式。然而，防止有害内容的生成仍然是一个关键挑战，特别是在图像到图像（I2I）编辑任务中。现有的安全机制主要针对文本到图像（T2I）合成或基于U-Net的架构设计，这限制了它们在基于DiT的框架中统一安全缓解的有效性。为弥补这一差距，我们提出了统一视觉安全调节器（UVR），一个无需训练的、在生成图像中调节不安全语义的安全生成框架。UVR基于从信息流角度对MM-Attn中注意力动态的分析。我们识别出一个与任务无关的启动阶段，在该阶段输出补丁中的不安全语义迅速出现并可以被精确定位，随后是特定任务的语义放大和干扰阶段，其中有害信号进一步传播并与良性内容纠缠。基于这些观察，UVR通过统一的、有针对性的注意力调制和对识别出的不安全输出补丁上有害信息流的显式限制来缓解不安全生成。跨多种概念的实验表明，UVR在图像合成和编辑任务中分别实现了91%和77%的擦除率，达到了最先进的安全性能，同时以最小的退化保持了视觉质量和保真度。代码可在以下网址获取：https://this URL。

英文摘要

Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at https://github.com/deng12yx/UVR.

URL PDF HTML ☆

赞 0 踩 0

2606.06872 2026-06-08 cs.CV cs.AI 新提交

EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

EgoPressDiff: 用于自我中心UV域手部压力估计的多模态视频扩散模型

Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang, QingMin Liao

AI总结提出EgoPressDiff，一种条件视频扩散框架，通过多模态条件策略（手部姿态、3D网格顶点和深度信息）从视觉输入生成UV压力图，解决了现有方法中的量化误差和时间不一致问题，在EgoPressure数据集上实现SOTA，Volumetric IoU相对提升34%以上。

详情

DOI: 10.1109/ICASSP55912.2026.11463813
Comments: Accepted to IEEE ICASSP 2026

AI中文摘要

从自我中心视角估计手部表面接触压力对于AR/VR设备、机器人模仿和人体工程学分析至关重要。现有方法通常对压力信号进行离散化并独立处理帧，导致量化误差和时间不一致性。我们提出EgoPressDiff，一种条件视频扩散框架，从视觉输入生成UV压力图。我们方法的核心是一种多模态条件策略，引入PoseNet和顶点编码器，从手部姿态和3D网格顶点中高效提取特征。这些信号与深度信息一起，指导生成过程以确保压力场在物理上是合理的。为了有效融合这些异构特征，我们进一步提出分布校准空间层，在组合前对齐其统计特性。在EgoPressure自我中心视图设置上的评估表明，EgoPressDiff实现了最先进的结果，Volumetric IoU相对先前基线提升超过34%，同时降低MAE并保持高时间精度。我们的项目页面位于此https URL。

英文摘要

Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06871 2026-06-08 cs.LG 新提交

Evidence-Grounded Ensemble Diagnosis of 802.11 Packet Captures: A Multi-Stage Pipeline with Deterministic Reliability Scoring

基于证据的802.11数据包捕获集成诊断：具有确定性可靠性评分的多阶段流水线

Jerome Henry, Swadhin Pradhan, Miroslav Popovic

AI总结提出PROBE多阶段流水线，通过确定性证据框架和集成方法解决LLM在802.11诊断中的幻觉、置信度偏差和评估偏见问题，在87个企业Wi-Fi捕获上实现0.957的加权证据F1分数和96%的自动接受率。

详情

Comments: 37 pages, 9 figures, 9 tables

AI中文摘要

诊断802.11数据包捕获需要专家协议知识，速度慢、工程师间不一致且不可扩展。基于LLM的方法听起来合理，但会编造捕获中不存在的协议事件（尤其是截断的跟踪），产生未校准的置信度分数，并且当黄金参考由被测模型共同生成时遭受评估偏差。我们引入PROBE（基于证据的协议推理集成），一个解决所有三个失败的多阶段流水线。它整合了(i)具有帧级可验证性的确定性PCAP到文本归一化，(ii)多运行、多候选集成，带有可选的跨模型第二意见和渐进混淆，(iii)一个判决感知的证据框架，将缺乏失败证据视为贡献证据，以及(iv)一个完全确定性的复合可靠性分数，来自证据有效性、运行间稳定性和跨模型一致性，无需LLM自我评估。在87个企业Wi-Fi捕获（104个捕获-审查者对）上，单次LLM分析将加权证据F1从0.871（专家基线）提升到0.912，但在35%的情况下遗漏了关键帧。朴素集成投票降至基线以下（0.842），因为多数投票放大了保守判决：50%的确认失败被误分类为“无问题”或“证据不足”。添加基于证据的协调达到0.957 F1，96%的自动接受率，以及最坏情况下的下限高于0.70。LLM自我报告的置信度聚集在0.95，无论难度如何（71%报告恰好0.95），证实其无信息量。我们还引入了一个使用逐字段断言匹配的模型无关评估框架，消除了来自模型共同生成的黄金参考的循环偏差。

英文摘要

Diagnosing 802.11 packet captures requires expert protocol knowledge, is slow, inconsistent across engineers, and unscalable. LLM-based approaches sound plausible but fabricate protocol events absent from captures (especially truncated traces), produce uncalibrated confidence scores, and suffer evaluation bias when golden references are co-produced by the model under test. We introduce PROBE (Protocol Reasoning Over evidence-Based Ensembles), a multi-stage pipeline addressing all three failures. It integrates (i) deterministic PCAP-to-text normalization with frame-level verifiability, (ii) multi-run, multi-candidate ensembles with optional cross-model second opinion and progressive obfuscation, (iii) a verdict-aware evidence framework treating absence of failure evidence as contributing evidence, and (iv) a fully deterministic composite reliability score from evidence validity, run-to-run stability, and cross-model agreement without LLM self-assessment. On 87 enterprise Wi-Fi captures (104 capture-reviewer pairs), single-pass LLM analysis raises weighted evidence F1 from 0.871 (expert baseline) to 0.912 but misses critical frames in 35% of cases. Naive ensemble voting drops below baseline (0.842) as majority voting amplifies conservative verdicts: 50% of confirmed failures are misclassified as 'no issue' or 'insufficient evidence.' Adding evidence-grounded reconciliation achieves 0.957 F1, a 96% auto-accept rate, and a worst-case floor above 0.70. LLM self-reported confidence clusters at 0.95 regardless of difficulty (71% report exactly 0.95), confirming it is uninformative. We also introduce a model-agnostic evaluation framework using per-field assertion matching, eliminating circular bias from model-co-produced golden references.

URL PDF HTML ☆

赞 0 踩 0

2606.06856 2026-06-08 cs.CV 新提交

FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness

FS-DVS：一种增强信息完整性的频率选择性动态视觉传感范式

Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan

AI总结提出FS-DVS范式，通过在事件触发前集成可学习空间滤波器模拟视网膜神经节细胞聚合机制，自发学习中心-环绕模式以增强中频信息，在目标检测和动作识别中取得显著性能提升。

详情

AI中文摘要

动态视觉传感器（DVS）通过异步报告像素级强度变化，提供卓越的时间分辨率和动态范围。然而，传统DVS依赖每像素独立触发机制，忽略了生物视网膜神经节细胞（RGC）执行的空间整合。因此，它们缺乏对比度敏感函数（CSF）及其对中空间频率的固有敏感性，这不可避免地因亚阈值信号丢失而导致信息不完整。为弥补这一差距，我们提出FS-DVS（频率选择性动态视觉传感器），一种新颖范式，它在事件触发过程之前严格集成一个可学习空间滤波器，以模拟RGC聚合机制。通过开发可微分事件模拟框架，空间滤波器可以与下游任务进行端到端优化。我们的研究揭示，从δ函数开始，学习到的空间滤波器自发演变为强调中频分量的中心-环绕模式，与人类CSF一致。除了在目标检测和动作识别中实现显著的性能提升外，不同任务中向类人CSF特性的一致收敛强调了这种中频选择性机制的普遍性。与单纯提高传感器灵敏度或依赖后处理相比，我们的范式实现了具有高噪声鲁棒性的选择性信息增强，为下一代神经形态传感器提供了稳健且生物合理的蓝图。

英文摘要

Dynamic vision sensors (DVS) offer exceptional temporal resolution and dynamic range by asynchronously reporting pixel-level intensity changes. However, conventional DVS rely on a per-pixel independent triggering mechanism, ignoring the spatial integration performed by biological retinal ganglion cells (RGCs). Consequently, they lack the contrast sensitivity function (CSF) and its inherent sensitivity to mid-spatial frequencies, which inevitably leads to information incompleteness due to sub-threshold signal loss. To bridge this gap, we propose FS-DVS (Frequency-Selective Dynamic Vision Sensor), a novel paradigm that integrates a learnable spatial filter strictly preceding the event triggering process to mimic the RGC aggregation mechanism. By developing a differentiable event simulation framework, the spatial filter can be optimized end-to-end with downstream tasks. Our study reveals that starting from a delta function, the learned spatial filters spontaneously evolve into center-surround patterns that emphasize mid-frequency components, consistently aligning with human CSF. Beyond achieving substantial performance gains in object detection and action recognition, the consistent convergence to human-like CSF characteristics across different tasks underscores the universality of this mid-frequency selective mechanism. Compared to naively increasing sensor sensitivity or relying on post-processing, our paradigm achieves selective information enhancement with high noise resilience, providing a robust, biologically plausible blueprint for next-generation neuromorphic sensors.

URL PDF HTML ☆

赞 0 踩 0

2606.06836 2026-06-08 cs.RO cs.AI cs.CV 新提交

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

像飞行员一样思考：细粒度长时程无人机导航

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

AI总结提出FLIGHT基准和FLIGHT VLA异步架构，通过低频飞行员推理VLM与高频扩散动作模型解耦，实现无人机长时程语义指令下的平滑连续飞行控制。

详情

AI中文摘要

语言引导的无人机代理必须执行长时程语义指令，同时产生平滑、物理可行的连续飞行命令，然而现有的视觉语言导航（VLN）基准通常使用离散或粗粒度的动作，而现有的无人机视觉-语言-动作（VLA）任务则专注于短时、原子化的机动。为了解决无人机任务设置中的这一空白，我们引入了\ extbf{FLIGHT}，一个用于混合无人机导航与推理任务的\ extbf{细}粒度\ extbf{长}时程\ extbf{指令引导}基准，该基准结合了多阶段指令与密集的6-DoF轨迹注释，分为两个数据集：细粒度VLN和长时程流。为了使无人机代理具备对任务执行状态和任务规划进行实时飞行推理的能力，同时适应高频、实时的精确控制，我们进一步提出了\ extbf{FLIGHT VLA}，一种异步架构，将用于任务状态推理的低频流式飞行员视觉语言模型（VLM）与用于连续控制的高频扩散动作模型解耦，并由显式的\ extbf{飞行员推理}文本进行监督，该文本总结了当前飞行状态并预测下一个子目标。在闭环评估中，FLIGHT VLA在我们的FLIGHT基准上持续优于代表性的VLN和VLA基线，实现了更强的多阶段完成、子目标遵循和终端控制。其训练的流式飞行员推理VLM进一步提升了无人机视频推理，验证了我们设计的有效性。

英文摘要

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

URL PDF HTML ☆

赞 0 踩 0

2606.06827 2026-06-08 cs.LG 新提交

Architecture Shapes Transfer Specificity in Implicit Neural Representations

架构影响隐式神经表示中的迁移特异性

D Yang Eng

AI总结通过控制实验和偏微分方程基准测试，研究SIREN、ReLU MLP和傅里叶特征MLP三种隐式神经表示架构的迁移特性，发现迁移幅度与迁移特异性分离，ReLU更具源选择性，而SIREN广泛重用权重。

详情

AI中文摘要

坐标网络中的迁移通常通过热启动增益来衡量，但这种增益反映的是源特定结构还是通用权重重用尚不明确。我们通过控制分析测试、二维顶盖驱动方腔纳维-斯托克斯基准以及一维热方程、粘性伯格斯方程和聚焦三次非线性薛定谔方程参考解套件，研究了三种隐式神经表示（INR）家族：SIREN、ReLU MLP和傅里叶特征MLP。分析测试使用独立种子随机控制，而PDE基准使用同族替代源控制和辅助消融。在各种设置下，迁移幅度和迁移特异性明显分离。在10种子控制的一维几何测试中，傅里叶特征显示出最大的结构化迁移（33.1倍），其次是SIREN（23.0倍）和ReLU（10.7倍），但ReLU的选择性更强：随机控制迁移为0.41倍，而SIREN为14.24倍。在受控的双参数一维族中，排名发生变化：在默认设置下，ReLU给出了最清晰的结构化与控制分离，而傅里叶特征仅在带宽重新调整后才有改进。在纳维-斯托克斯和更广泛的一维PDE套件中，没有单一架构主导所有方程，但相同的模式仍然存在：SIREN通常广泛重用权重，而ReLU以及在某些方程中的傅里叶特征更具源选择性。静态诊断仍然薄弱，启发式缩放律$A_{\text{transfer}} \propto 1/\Delta t^2$在所实施的一维审计中被拒绝。这些结果将迁移特异性定位为坐标网络的有用诊断，并表明科学机器学习中的架构选择应在明确控制条件下进行评估，而不仅仅依据迁移幅度。

英文摘要

Transfer in coordinate networks is often measured by warm-start gain, but whether that gain reflects source-specific structure or generic weight reuse is less clear. We study this question across three implicit neural representation (INR) families, SIREN, ReLU MLPs, and Fourier-feature MLPs, using controlled analytic tests, a 2D lid-driven-cavity Navier--Stokes benchmark, and 1D PDE reference-solution suites for heat, viscous Burgers, and focusing cubic NLS. The analytic tests use independent-seed random controls, while the PDE benchmarks use alternate same-family source controls and auxiliary ablations. Across settings, transfer magnitude and transfer specificity separate clearly. In a 10-seed controlled 1D geometric test, Fourier Features show the largest structured transfer ($33.1\times$), followed by SIREN ($23.0\times$) and ReLU ($10.7\times$), but ReLU is far more selective: random-control transfer is $0.41\times$ for ReLU versus $14.24\times$ for SIREN. On a controlled two-parameter 1D family, the ranking changes: ReLU gives the clearest structured-versus-control separation at default settings, whereas Fourier Features improve only after bandwidth retuning. In Navier--Stokes and the broader 1D PDE suite, no single architecture dominates every equation, yet the same pattern remains: SIREN often reuses weights broadly, whereas ReLU and, in some equations, Fourier Features are more source-selective. Static diagnostics remain weak, and the heuristic scaling law $A_{\text{transfer}} \propto 1/Δt^2$ is rejected in the implemented 1D audit. These results position transfer specificity as a useful diagnostic for coordinate networks and suggest that architecture selection in scientific machine learning should be evaluated under explicit control conditions, not by transfer magnitude alone.

URL PDF HTML ☆

赞 0 踩 0