arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.01753 2026-06-02 cs.CV

Quality-Guided Semi-Supervised Learning for Medical Image Segmentation

质量引导的半监督学习用于医学图像分割

Kumar Abhishek, Ghassan Hamarneh

AI总结 提出一种质量引导的半监督学习框架,通过专用网络估计分割质量,并利用质量感知正则化和伪标签重加权提升医学图像分割性能。

详情
Comments
Early Accept at MICCAI 2026, 13 pages, 2 figures
AI中文摘要

训练准确的医学图像分割模型需要大量密集标注的数据,这既昂贵又耗时。半监督学习通过从大量未标注数据和少量标注数据中学习来缓解这一问题。然而,大多数现代半监督学习方法依赖未标注数据的伪标签,并通常通过模型置信度或不确定性来评估其可靠性,这些度量是自我指涉的,缺乏对分割质量的明确基础。相反,我们提出了一种质量引导的半监督学习框架,训练一个专用网络从图像-掩膜对中估计分割质量。该预测器在通过合成损坏生成的变质量掩膜上进行训练,这些损坏结合了部分训练分割模型产生的不完美输出,捕捉训练中遇到的真实错误模式。我们通过两种互补机制将质量预测器集成到半监督学习中:质量感知正则化损失和基于质量的伪标签样本重新加权方案。我们表明,我们的方法可以作为现有半监督学习框架的即插即用增强。在五个数据集和多种架构上的大量实验表明,与竞争性的半监督学习方法相比,我们的方法取得了一致的改进,推进了半监督医学图像分割的最新水平。

英文摘要

Training accurate medical image segmentation models requires large amounts of densely annotated data, which is costly and time-consuming to obtain. Semi-supervised learning (SSL) alleviates this by learning from both abundant unlabeled data and limited labeled data. However, most modern SSL methods rely on pseudolabels for unlabeled data, and typically assess their reliability through model confidence or uncertainty, measures that are self-referential and lack explicit grounding in segmentation quality. Instead, we propose a quality-guided SSL framework that trains a dedicated network to estimate segmentation quality from image-mask pairs. The predictor is trained on variable-quality masks generated through synthetic corruptions augmented with imperfect outputs from partially trained segmentation models, capturing realistic error patterns encountered during training. We integrate the quality predictor into SSL through two complementary mechanisms: a quality-aware regularization loss and a quality-based pseudolabel sample reweighting scheme. We show that our method serves as a drop-in enhancement to existing SSL frameworks. Extensive experiments across five datasets and multiple architectures demonstrate consistent improvements over competing SSL methods, advancing the state-of-the-art in semi-supervised medical image segmentation.

2606.01747 2026-06-02 cs.CL cs.AI

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

基于BERT和图神经网络的历史知识图谱构建

Ping Li, Bartlomiej Brzozka

AI总结 本文提出结合BERT和图神经网络的高层架构,从历史文本中提取实体和关系,构建知识图谱,在精度、召回率和F1分数上优于传统方法和深度学习基线。

详情
Comments
9 pages, 4 figures
AI中文摘要

通过数字人文研究和规模化历史数据分析,大量传统历史文本被转换为结构化知识图谱。本文提出一种结合双向编码器表示(BERT)和图神经网络(GNN)的高层架构,用于从各类历史文本中提取实体和关系。传统历史文本系统地解决了语言歧义、上下文限制的引用以及缺乏既定语法规范的问题。本研究根据上述建议,开发了一种基于FastRQNet和预训练视觉-语言模型Vilt-qaformer+RoBInet的新型图像检索系统。实验充分利用了市政记录、议会文件和历史信函的全面数据集。与传统基于规则的技术和其他流行的深度学习基线相比,联合BERT-GNN系统获得了更高的精度、召回率和F1分数(表2)。该结构在创建知识图谱时能够以足够的准确性和全面性处理复杂的嵌套结构和隐式引用问题。上述实验表明,将关系图学习算法与上下文敏感的语义表示技术相结合,可以自动提取历史数据,为知识库积累累积的智慧。

英文摘要

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.

2606.01746 2026-06-02 cs.CV cs.LG

Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness

敏感性是一把双刃剑:判别性与对抗鲁棒性之间的权衡

Kai Wang

AI总结 本文发现全连接分类器的高敏感性带来判别性但也导致脆弱性,而ℓ2距离分类器的不敏感性带来鲁棒性但限制性能,为此提出基于混合原型混合框架的ℓ2重分类器,通过融合稳定原型和动态原型实现判别性与鲁棒性的平衡,并设计混合替代攻击评估协议。

详情
Comments
13 pages including reference, 4 figures
AI中文摘要

现代神经网络极易受到对抗性扰动的影响。在这项工作中,我们指出这种脆弱性部分源于广泛使用的全连接分类器对此类扰动的敏感性。相比之下,简单的基于ℓ2距离的分类器表现出显著更强的鲁棒性。我们提供了充分的理论和实证分析,表明全连接分类器的高敏感性使其具有判别性,但也使其脆弱;相反,ℓ2分类器的不敏感性赋予了鲁棒性但限制了性能。受这种权衡的启发,我们提出了一种基于混合原型混合框架的新型ℓ2重分类器。该方法保留了全连接分类器的判别能力,同时利用了ℓ2距离的鲁棒性。它通过融合两种原型类型来产生基于ℓ2距离的预测:(1)通过指数移动平均更新的稳定数据集级原型,以及(2)使用直通估计器从全连接分类器预测生成的动态批量级原型。然而,这种基于直通估计器的动态架构给评估带来了重大挑战,例如梯度混淆和前向不连续性。为了解决这个问题,我们提出了一种新的严格评估协议——混合替代攻击,该协议使用多个替代模型以及强大的AutoAttack,以确保公平和稳健的评估。大量实验表明,我们的轻量级即插即用模块只需极少的微调,就能有效增强各种现有最先进对抗训练模型的对抗鲁棒性。

英文摘要

Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple $\ell_2$ distance-based classifiers exhibit significantly greater robustness. We provide thorough theoretical and empirical analysis showing that while FC classifiers' high sensitivity makes them discriminative, it also makes them vulnerable. Conversely, $\ell_2$-classifiers' insensitivity grants robustness but limits performance. Motivated by this trade-off, we propose a novel $\ell_2$-reclassifier based on a Hybrid Prototype Mixing (HPM) framework. This method retains the discriminative power of FC classifiers while leveraging the robustness of $\ell_2$ distance. It yields $\ell_2$-distance-based predictions by fusing two prototype types: (1) stable, dataset-level prototypes updated via EMA, and (2) dynamic, batch-level prototypes generated from the FC classifier's predictions using a Straight-Through Estimator (STE). However, this dynamic, STE-based architecture introduces significant challenges for evaluation, such as gradient obfuscation and forward discontinuity. To address this, we propose a new, rigorous evaluation protocol, the Mixed Surrogate Attack (MSA), which uses multiple surrogates along with powerful AutoAttack to ensure a fair and robust assessment. Extensive experiments demonstrate that our lightweight, plug-and-play module, with minimal fine-tuning, effectively enhances the adversarial robustness of various existing SOTA adversarially trained models.

2606.01741 2026-06-02 cs.CR cs.AI

SECUREVENT: Hybrid AI/ML Security Monitoring for Distributed Event-Based Systems

SECUREVENT: 面向分布式事件系统的混合AI/ML安全监控

Eric Liang

AI总结 提出SECUREVENT架构,结合传统安全机制与在线异常检测、图行为特征、复杂事件策略、联邦学习和对抗ML治理,通过混合AI/CEP监控提高召回率并保持低误报率。

详情
AI中文摘要

分布式事件系统已成为互联网规模发布/订阅服务、物联网遥测、云原生微服务和安全运营管道的常见基础。它们的松散耦合和异步交付提高了可扩展性,但也扩大了攻击面:发布者、代理、订阅者、主题、模式和时间顺序都可能被滥用,而没有一个组件能观察整体行为。本文提出了SECUREVENT,一种用于分布式事件系统的混合AI/ML安全监控架构。该架构将传统保护(如认证传输、主题级授权和签名事件)与在线异常检测、图感知行为特征、复杂事件策略规则、联邦学习和对抗ML治理相结合。对合成事件流攻击的确定性原型研究表明,混合AI/CEP监控可以在保持低误报率的同时提高静态规则的召回率。核心主张并非机器学习取代密码学和访问控制机制,而是当事件流、身份、模式和时间关系过于动态以至于静态控制无法单独应对时,基于模型的安全监控是必要的。

英文摘要

Distributed event-based systems have become a common substrate for Internet-scale publish/subscribe services, IoT telemetry, cloud-native microservices, and security operations pipelines. Their loose coupling and asynchronous delivery improve scalability, but they also expand the attack surface: publishers, brokers, subscribers, topics, schemas, and temporal ordering can each be abused without a single component observing the whole behavior. This paper proposes SECUREVENT, a hybrid AI/ML security-monitoring architecture for distributed event-based systems. The architecture combines traditional protections such as authenticated transport, topic-level authorization, and signed events with online anomaly detection, graph-aware behavioral features, complex-event policy rules, federated learning, and adversarial-ML governance. A deterministic prototype study over synthetic event-stream attacks illustrates how a hybrid AI/CEP monitor can improve recall over static rules while retaining a low false-positive rate. The central claim is not that machine learning replaces cryptographic and access-control mechanisms, but that model-based security monitoring is necessary when event flows, identities, schemas, and timing relationships are too dynamic for static controls alone.

2606.01738 2026-06-02 cs.CL cs.AI

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

THRD:一种针对大语言模型越狱攻击的无训练多轮防御框架

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu

AI总结 提出无训练框架THRD,通过显式建模时间风险累积(包括逐轮风险评估、跨轮意图检测、响应评估和决策模块)防御多轮越狱攻击,将攻击成功率降至0.2-4.0%且模型效用损失小于1.5%。

详情
AI中文摘要

多轮越狱攻击通过利用对话动态(如逐步升级和跨轮协调)对LLM构成日益严重的威胁。现有防御要么依赖昂贵的重新训练(通常会降低模型效用),要么在每一轮独立应用单轮分析,无法捕捉风险沿交互轨迹的累积。我们观察到多轮交互中的安全行为是轨迹依赖的:对话历史不断重塑模型的调节上下文,使得孤立评估每一轮变得不足。基于这一洞察,我们提出THRD,这是第一个显式建模多轮越狱防御中时间风险累积的无训练框架。THRD集成了四个模块:用于即时风险评估的逐轮风险评估器(TRA)、用于跨轮意图升级检测的历史上下文分析器(HCA)、用于识别促进性输出的响应评估器(RE),以及通过带衰减调制和趋势感知调整的时间演化评分机制组合这些信号的决策模块。在两个目标模型上针对最先进的多轮攻击(包括基于树搜索和多智能体协作方法)的实验表明,THRD将攻击成功率降至0.2-4.0%,同时在MMLU和GSM8K上将模型效用退化控制在1.5%以内。消融研究证实了模块的非冗余贡献和稳定的跨架构泛化。对首次拒绝触发器的分析显示,超过70%的多轮攻击需要在第2轮或之后才能检测到,验证了显式时间聚合的必要性。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

2606.01737 2026-06-02 cs.AI

TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination

TrafficRAG:用于交通事故责任认定的多模态RAG框架

Xu Li, Zedong Fu, Xinyi Li, Xun Han

AI总结 提出TrafficRAG框架,通过视觉语言模型生成结构化描述、混合检索获取法规和案例、大语言模型融合多模态证据进行推理,实现自动化交通事故责任分析报告生成。

详情
Comments
12 pages, 3 figures, accepted at ICANN 2026
AI中文摘要

交通事故责任分析是智能交通和法律辅助中一项关键但具有挑战性的任务。现有方法通常存在效率低、主观判断和不一致的分析结果等问题。同时,大语言模型受到噪声视频输入和法律领域知识不足的限制。为了解决这些问题,本文提出了TrafficRAG,一个用于自动化交通事故分析和报告生成的多模态检索增强框架。具体来说,该框架首先采用视觉语言模型生成事故场景的结构化文本描述,作为准确的检索查询。基于这些文本查询,采用结合BM25稀疏检索和稠密嵌入检索的混合检索策略来获取相关交通法规和类似历史案例。最后,大语言模型整合检索到的法律知识和多模态事故证据进行综合推理,生成标准化、有法律依据的责任分析报告。大量实验表明,TrafficRAG始终优于基线方法,实现了77.32%的法律规范适配准确率、81.71%的事实忠实度以及5.48%的责任比例平均绝对误差。结果验证了通过检索增强将多模态事实证据与法律条款相结合,可以有效提高交通事故责任认定的可靠性和准确性。

英文摘要

Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods often suffer from low efficiency, subjective judgment, and inconsistent analysis results. Meanwhile, large language models are constrained by noisy video inputs and insufficient legal domain knowledge. To address these issues, this work presents TrafficRAG, a multimodal retrieval-augmented framework for automated traffic accident analysis and report generation. Specifically, the proposed framework first adopts a vision-language model to produce structured textual descriptions of accident scenarios, which serve as accurate retrieval queries. Based on these textual queries, a hybrid retrieval strategy integrating BM25 sparse retrieval and dense embedding retrieval is employed to fetch relevant traffic regulations and similar historical cases. Finally, the large language model incorporates retrieved legal knowledge and multimodal accident evidence for comprehensive reasoning, and generates standardized, legally grounded liability analysis reports. Extensive experiments show that TrafficRAG consistently outperforms baseline methods, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and a Liability Ratio MAE of 5.48%. The results validate that integrating multimodal factual evidence with legal clauses via retrieval augmentation can effectively improve the reliability and accuracy of traffic accident liability determination.

2606.01734 2026-06-02 cs.CV cs.LG cs.RO

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

FlatVPR: 用于基础模型特征流形几何校正的即插即用地线性残差适配器

Rai Hisada, Kanji Tanaka

AI总结 提出FlatVPR范式,通过可学习残差适配器和Pullback Flatness Loss抑制特征流形曲率,实现稀疏锚点下的线性插值重建,在NCLT数据集上显著提升视觉位置识别精度。

详情
Comments
5 pages, 1 figure, technical report
AI中文摘要

本文提出“FlatVPR”,一种新颖的几何校正范式,通过强制特征流形结构,使得两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的任何描述符都可以通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$(其中 $t \in [0,1]$ 表示相对位置)精确重建,从而有效平衡视觉位置识别(VPR)中地图轻量化和定位精度之间的权衡。尽管最先进的基础模型(如DINOv2-ViT-S/14)提供了鲁棒的语义特征,但其潜在流形表现出显著的曲率,将物理空间中的均匀线性运动投影到特征空间中高度非线性的轨迹上,这阻碍了稀疏锚点条件下的可靠重建。为了实现上述基于插值的重建,我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$,其中 $\text{Res}(\cdot)$ 表示可学习的适配器。我们的方法通过数学上严谨的Pullback Flatness Loss显式抑制流形曲率,该损失最小化中间特征与连接相邻锚点的线性段之间的偏差,从而最小化流形的内在曲率。通过这种空间展平,地图构建被公式化为期望最大化(EM)框架,解耦为用于流形适应的连续M步和用于最优锚点选择准则的概念性E步。在NCLT数据集上的实验表明,即使在100米间隔的极端稀疏锚点和极端季节变化条件下,应用我们的适配器也能带来显著的性能提升。

英文摘要

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

2606.01730 2026-06-02 cs.AI cs.LG

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

证据门控的LLM先验用于多目标贝叶斯优化

Jiangyu Chen, Banyi

AI总结 针对多目标贝叶斯优化中LLM先验可能误导的问题,提出一种目标级声誉市场机制,通过在线反馈动态校准专家权重,并引入解耦反事实门控,在合成测试和分子优化基准上验证了动态校准的鲁棒性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作黑箱优化的启发式顾问,但其建议和自我报告的置信度不一定与下游目标值校准。在多目标贝叶斯优化中,这一问题更加突出,因为不同目标可能需要不同的专家知识,而LLM专家可能对一个目标有用,但对另一个目标产生误导。 我们研究如何在离散多目标贝叶斯优化中使用LLM生成的专家先验,而不盲目信任它们。我们提出了一种目标级声誉市场机制,将每个专家-目标对视为可证伪的先验来源。专家权重根据观察到的目标反馈在线更新,随时间衰减,并由市场级信任门控。然后,我们引入一个解耦的反事实门控,可以在不使用置信度的情况下使用LLM先验,在置信度下使用,或完全放弃LLM先验。 在受控的合成压力测试和三个使用\qwenflash{}生成的专家先验的分子优化基准上,我们发现动态目标级校准比固定LLM先验提高了鲁棒性。然而,原始LLM置信度并不总是有益的:在ESOL上,置信度与预测误差正相关;在FreeSolv上,置信度可能有帮助;在Lipophilicity上,忽略置信度仍然最强。我们的固定三臂反事实门控在ESOL和FreeSolv上优于第一个反事实变体,而尝试的边际组合暴露了一个有用的负面结果:边际选择应基于采集感知,而不是仅基于一步先验误差。

英文摘要

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

2606.01725 2026-06-02 cs.AI cs.LG

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

基于迹驱动仿真的通用任务多模型智能体AI系统特征分析

Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng

AI总结 本文提出GAIATrace数据集和Vidur-Agent仿真器,通过迹驱动仿真分析多模型智能体AI系统在通用任务上的行为特征。

详情
Comments
13 pages, 18 figures, 2 tables
AI中文摘要

智能体AI通过迭代规划、工具使用和基于观察结果的推理来完成任务。尽管其流行,但其系统级行为仍然知之甚少,特别是对于复杂数据集和智能体架构——由于高度非确定性执行、高昂的评估成本以及对专有模型的有限可见性。本文提出了GAIATrace,这是两个最先进的智能体系统(MiroThinker和OWL)运行GAIA(一个由异构通用任务组成的基准测试)的首个token级迹数据集。与先前的迹数据集不同,GAIATrace捕获了完整的推理token、任务级结构以及每个主要参与LLM的活动,从而支持深入的系统研究。作为数据集的补充,我们提出了Vidur-Agent,一个迹驱动的仿真器,可以重放GAIATrace以在多种模拟环境中进行可重复、低成本的系统评估。利用这两个工件,我们描述了现代智能体系统如何处理通用任务以及各种系统设计选择如何塑造其行为,得出了若干独特的发现。

英文摘要

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.

2606.01723 2026-06-02 cs.LG cs.AI

Shortcut to Nowhere: Demystifying Deep Spurious Regression

捷径通往虚无:揭秘深度虚假回归

Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang

AI总结 针对连续预测中的虚假相关性,提出利用标签和特征空间中虚假属性的相似性来校准分布,从而提升模型在分布偏移下的泛化能力。

详情
AI中文摘要

现实世界中的回归常常存在捷径:在训练中与连续目标虚假相关的属性,在部署偏移下不可靠;使用此类捷径回归目标可能在测试时灾难性失败。现有关于虚假相关性的研究主要关注分类,其中标签是分类的且组是自然定义的。然而,许多现实任务需要连续预测,其中不存在硬标签边界或离散的组-标签对。我们将深度虚假回归(DSR)定义为从具有属性-标签混淆的回归数据中学习,处理连续虚假相关性,并在测试时泛化到所有属性-标签组合。受分类和回归捷径内在差异的启发,我们提出利用标签和特征空间中虚假属性之间的相似性,从而在跨属性校准标签和学习特征分布时考虑邻近目标和相关组。在涵盖计算机视觉、环境感知和大语言模型(LLM)回归的常见真实世界DSR数据集上的大量实验验证了我们策略的优越性能。我们的工作填补了研究连续预测中虚假相关性的基准和技术空白。

英文摘要

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

2606.01722 2026-06-02 cs.LG cs.AI cs.DC

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

后确定性分布式系统:可信自主基础设施的新基础

Jun He, Deying Yu

AI总结 本文提出后确定性分布式系统(PDDS)模型,以协调确定性代码、随机模型和自主代理共存的异构环境,并定义了五大架构支柱及新的故障分类。

详情
Comments
8 pages, 1 table
AI中文摘要

几十年来,分布式系统通常假设正确的参与者执行协议指定的行为,具有稳定、外部定义和确定性的语义。经典理论广泛参数化了网络时序、通信拓扑和故障域,但参与者模型相对固定。将自主推理引擎、随机模型驱动代理和策略驱动参与者集成到云控制平面、事件响应系统和金融基础设施中,挑战了这一假设的普遍性。这些代理通常产生不同的推理路径、不同的操作轨迹和异构的内部表示,同时实现语义等价且正确的结果。在本文中,我们引入后确定性分布式系统(PDDS)作为研究和工程模型,用于协调确定性代码、随机模型和自主代理共存的异构环境。我们表明,经典分布式计算模型构成了这种参与者通用模型的零歧义特例。我们并非主张确定性系统消失;而是确定性执行不能再作为自主基础设施的通用参与者假设。最后,我们概述了后确定性基础设施的五大架构支柱:协议驱动开发、可验证代理基础设施、自主状态控制平面、语义法定保证和认知状态复制。认知状态复制将持久性和一致性模型从数据可见性扩展到知识可见性,实现代理记忆、可验证语义回滚以及跨推理参与者的连贯性。我们还定义了在此环境中出现的故障类别的分类法。

英文摘要

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

2606.01720 2026-06-02 cs.LG

A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling

关于带客户端采样的正交化矩阵动量的稳定性注记

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang

AI总结 研究带客户端采样的分布式矩阵优化中正交化动量更新的有限样本泛化界,通过耦合邻域稳定性递归和加权集中步骤导出上尾保证。

详情
AI中文摘要

我们研究了带矩阵值参数和正交化动量更新的客户端采样分布式优化方案的有限样本泛化。核心量是当每轮只有一部分客户端参与时,返回模型上总体目标与经验目标之间的差距。在独立异构客户端数据、不等本地样本计数和固定聚合权重下,我们通过耦合邻域稳定性递归和加权集中步骤导出了有限轮上尾保证。该界限通过放大因子 \(Y_i(\mathcal C)\) 保留客户端选择计数;在均匀全参与全批次情况下,当控制依赖于时间范围的放大项时,它产生 \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) 的缩放。矩阵正交化规则要求沿配对轨迹是Lipschitz的,该条件由正则化极型映射和归一化有限步Newton-Schulz正交化器满足。对于未正则化的矩阵符号,相同的论证需要耦合谱分离,而高斯平滑给出了有限轮平滑变体。一个一维反例说明了为什么间隙、平滑或正则性条件是必要的。

英文摘要

We study finite-sample generalization for a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates. The central quantity is the gap between the population and empirical objectives at the returned model when only a subset of clients participates in each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor \(Y_i(\mathcal C)\); in the uniform full-participation full-batch regime, it yields \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton--Schulz orthogonalizers. For the unregularized matrix sign, the same argument requires coupled spectral separation, whereas Gaussian smoothing gives a finite-round smoothed variant. A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.

2606.01719 2026-06-02 cs.LG cs.AI cs.CR

Fair Finetuning Mitigates Distribution Inference Attacks

公平微调缓解分布推断攻击

Rakshit Naidu

AI总结 提出公平微调(FFt)方法,通过在等几率约束下对互补分布样本进行微调,将模型公平性指标与分布推断攻击中的对抗优势联系起来,并给出理论界限,实验证明能有效降低攻击成功率。

详情
Comments
16 pages (11 main, 5 appendix)
AI中文摘要

在敏感数据上训练的机器学习模型可能会无意中泄露其训练分布的群体级信息——这种威胁被称为分布推断攻击(DIA)。具有黑盒访问权限的对手可以在不直接观察任何训练数据的情况下推断敏感的人口统计属性,如子群比例。尽管已经提出了差分隐私和属性遗忘等防御措施,但公平性约束与分布泄漏之间的联系尚未被探索。我们提出了公平微调(FFt):在等几率(EO)约束下,对来自互补分布的样本进行微调。我们提供了完整的理论刻画,证明了紧界 $ ext{Adv}(\mathcal{A},M_f) \le Δ_{ ext{EO}} \cdot W$,其中 $W$ 量化了两个训练分布通过其敏感属性组成的可区分程度。我们还建立了FFt降低对抗优势的必要条件,并证明了该界的紧性。我们在六个数据集上进行了评估,涵盖表格数据(ACS Income、COMPAS、German Credit)、图像数据(UTKFaces)和自然语言处理数据(Bias in Bios)。基于重演的FFt在所有设置中一致地将对抗准确率差距降低到检测阈值 $τ=0.1$ 以下;在ACS Income上,差距从约15%下降到4%以下。我们的工作提供了第一个将模型测量的EO差异直接与其在DIA博弈中的对抗优势联系起来的正式界限,为统一的公平性和隐私防御开辟了新途径。

英文摘要

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions -- a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$, where $W$ quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold $τ!=!0.1$ across all settings; on ACS Income, the gap falls from $\sim!15%$ to under $4%$. Our work provides the first formal bound connecting a model's measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.

2606.01717 2026-06-02 cs.LG

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

去中心化指令微调:冲突感知拆分与权重合并

Minsik Choi, Geewook Kim

AI总结 提出MERIT方法,通过冲突感知拆分和权重合并实现去中心化指令微调,在Qwen2.5-VL-3B上8个基准平均分从54.3提升至57.0。

详情
Comments
32 pages, 5 figures. Accepted for publication at ICML 2026
AI中文摘要

指令微调使包括多模态在内的大语言模型与多样化的用户意图对齐,但扩展到异构混合数据时受到梯度干扰和带宽密集型同步的阻碍。我们提出是否可以通过独立训练部分混合数据并在参数空间中一次性协调它们来共同解决这两个瓶颈。我们在共享平坦盆地内发展了一个局部二次理论,得到三个结果:权重合并产生曲率加权方差减少;PCA对齐的冲突拆分沿高曲率方向最大化这一增益;合并还作为具有隐式范数正则化的谱滤波。这些结果直接激发了MERIT,一个去中心化的合并就绪指令微调流程,该流程估计数据集级别的梯度冲突,沿顶部PCA冲突轴划分混合数据,独立微调每个分区且无分区间通信,并通过令牌加权平均一次性合并。在Qwen2.5-VL-3B上使用136个Vision-FLAN任务,MERIT将8个基准平均分从54.3(联合训练)提升至57.0。相同的方案扩展到7B模型上,使用160万样本、176个源的混合数据——以最小成本开销匹配或超越集中式联合训练——并迁移到纯文本FLAN。我们的代码可在https://github.com/naver-ai/merit获取。

英文摘要

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.

2606.01713 2026-06-02 cs.RO cs.SY eess.SY math.OC

FlipItRight: Stable Pose-Targeted Throw-Flip Across Diverse Objects

FlipItRight: 面向多样物体的稳定姿态目标投掷翻转

Axel Dawne, Shinkyu Park

AI总结 提出FlipItRight框架,通过将任务分解为物体级规划器和机器人级规划器,利用释放状态作为显式中间表示,实现无需先验数据或学习模型的高自由度机械臂稳定平面姿态目标投掷翻转,在120次实验中达到90%成功率。

详情
AI中文摘要

我们提出了FlipItRight,一个用于高自由度机械臂进行稳定平面姿态目标投掷翻转的框架。该任务被分解为一个物体级规划器,它生成满足期望着陆姿态的候选释放状态,以及一个机器人级规划器,它评估可执行性并构建可行的摆动运动。将释放状态视为显式中间表示,能够实现原则性的候选过滤、释放和预摆动配置的自适应选择,以及结构化的近释放运动设计——特别是在最终摆动阶段保持近似恒定的末端执行器速度,以提高对释放时间不确定性的鲁棒性。我们在一个真实平台上对不同形状、大小和质量的物体进行了验证,在120次试验中达到了90%的成功率。消融研究证实,每个设计选择都对投掷性能有所贡献,并且该框架不需要先验数据或学习模型,能够直接部署到新物体和目标上,无需特定环境的校准或数据收集。

英文摘要

We propose FlipItRight, a framework for stable planar pose-targeted throw-flip with a high-DoF manipulator. The task is decomposed into an object-level planner, which generates candidate release states satisfying the desired landing pose, and a robot-level planner, which evaluates executability and constructs a feasible swing motion. Treating the release state as an explicit intermediate representation enables principled candidate filtering, adaptive selection of release and pre-swing configurations, and structured near-release motion design -- in particular, approximately constant end-effector velocities during the final swing phase to improve robustness to release-timing uncertainty. We validate on a real platform across objects of varying shape, size, and mass, achieving a 90% success rate across 120 trials. Ablation studies confirm that each design choice contributes to throwing performance, and the framework requires no prior data or learned model, enabling direct deployment on new objects and targets without environment-specific calibration or data collection.

2606.01711 2026-06-02 cs.CV

Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

通过纠正失真改进视觉令牌减少以实现高效多模态大语言模型推理

Hyeonwoo Cho, DongHyeon Baek, Yewon Kim, Bumsub Ham

AI总结 提出RESTORE框架,通过校准位置和注意力失真来改进视觉令牌减少,在保持效率的同时提升多模态大语言模型性能。

详情
Comments
Accepted to ICML 2026
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务中取得了显著成功,但大量视觉令牌带来的二次计算复杂度导致了严重的内存和延迟瓶颈。虽然已经探索了视觉令牌减少(VTR)策略来缓解这一负担,但现有方法忽略了完整序列与减少序列之间的位置和注意力一致性,导致表示失真。为此,我们提出RESTORE,一种新颖的VTR框架,在保持效率的同时纠正位置和注意力失真。具体来说,我们提出一种简单而有效的校准方法,通过基于相对距离增强注意力权重来恢复丢失的视觉注意力。我们还引入了一种独特的锚点选择用于令牌合并,以减轻特征平均过程中的信息损失。在多个基准上的实验结果表明,我们的方法持续提高了各种减少方法的准确性,在保持计算效率的同时实现了最先进的性能。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency.

2606.01710 2026-06-02 cs.CV cs.LG

Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs

零样本VLM中虚假相关性的密度感知转换

Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani

AI总结 提出密度感知转换(DAT)方法,利用局部几何密度项修正图像-文本相似度,以缓解CLIP等视觉语言模型在零样本分类中因虚假相关性导致的性能下降。

详情
Comments
ICML 2026
AI中文摘要

视觉语言模型(如CLIP)实现了强大的零样本分类。然而,它们的预测仍然对虚假相关性敏感,即上下文线索主导语义内容。早期的解决方案通常依赖于微调或提示工程,这要么削弱了预训练模型的优势,要么容易产生幻觉。在这项工作中,我们提出了密度感知转换(DAT),它使用从组参考集导出的局部几何密度项来细化图像-文本相似度分数。我们的方法受到以下现象的启发:CLIP嵌入表现出模态间隙,并位于特征空间中的各向异性壳上:常见模式聚集在均值附近,而罕见模式被推向外围。这种几何结构产生了不均匀的对齐,其中虚假相关性被放大,而语义上有意义但罕见的线索被边缘化。为了解决这个问题,我们采用相对度量根据嵌入密度重新缩放相似度,抑制扩散区域中过度自信的分数,同时保留密集、语义一致的匹配。在基准数据集上的实验结果表明,最差组和平均准确率持续提高,突出了密度感知转换作为一种简单有效的校准机制,用于使用多模态模型进行可靠的零样本分类。

英文摘要

Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.

2606.01708 2026-06-02 cs.LG cs.AI

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

随机极小极大树的双保真度最优动作识别

Peter Chen, Xi Chen

AI总结 针对随机极小极大树中的固定置信度最优动作识别问题,提出双保真度树搜索算法2FFS,结合极小极大快速扩展与MCTS随机采样,自适应选择廉价有偏评估或昂贵精确评估,理论证明固定置信度正确性、有限停止及多项式深度成本上界,实验表明比现有BAI-MCTS基线显著减少样本和计算。

详情
Comments
36 pages
AI中文摘要

我们研究随机极小极大树中的固定置信度最优动作识别(BAI)。该问题在现代AI规划中日益重要,其中深度极小极大搜索和带有语言模型长滚动的蒙特卡洛树搜索(MCTS)面临一个基本权衡:启发式评估廉价但有偏,而精确滚动可靠但代价高昂。我们提出2FFS,一种双保真度树搜索算法,将多保真度平面赌博机思想引入树中。该算法结合了极小极大风格的快速扩展和MCTS风格的随机采样,自适应地决定何时利用廉价有偏评估以及何时调用昂贵精确评估进行局部认证。我们证明了固定置信度正确性,建立了精确识别的有限停止性,并给出了通用深度树的多项式深度成本上界。在数值随机树实验中,与现有BAI-MCTS基线相比,2FFS使用的样本和计算操作显著减少。

英文摘要

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.

2606.01703 2026-06-02 cs.SD cs.AI cs.CV

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

AI总结 提出JenBridge框架,通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制,实现长视频配乐的高保真生成与场景转换自然连贯。

详情
AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计,缺乏确保叙事连续性的机制。我们提出了JenBridge,一个模块化且可解释的自适应长视频配乐框架,确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型,采用流匹配目标训练,遵循两阶段范式:在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验,然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是,为了实现跨不同场景变化的长格式连贯性,JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包,包括一种生成式过渡方法,并独特地采用了一个大型语言模型(LLM)代理,作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务,我们提出了LVS基准,这是一个新基准,包含一个精选数据集和新的评估指标,侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明,JenBridge在客观和主观指标上均显著优于现有方法,特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

2606.01702 2026-06-02 cs.GR cs.LG

KDH-CAD: Knowledge-data hybrid CAD learning under data scarcity

KDH-CAD:数据稀缺下的知识-数据混合CAD学习

Ziqin Gao, Zhijie Yang, Qiang Zou

AI总结 提出KDH-CAD框架,融合预训练基础模型、结构化领域知识和少量标注CAD数据,在数据稀缺下实现高效机械零件分类,准确率达92.6%(250样本)和95.8%(1000样本)。

详情
Comments
18 pages
AI中文摘要

计算机辅助设计(CAD)中的深度学习仍然受到数据稀缺挑战的根本制约:真实的CAD数据难以大规模收集,而合成数据可能无法真实反映实际设计实践。本文不追求更大的CAD数据集,而是将CAD学习视为知识补全和校准问题。它引入了KDH-CAD,一个知识-数据混合框架,该框架整合了基础模型中的预训练知识、教科书/教程中的结构化领域知识以及非常少量的标注CAD数据。领域知识用于引出和补全在预训练基础模型中表达较弱或代表性不足的CAD相关概念,而标注CAD数据则在潜在空间中校准这些概念,以考虑特定任务的几何变异性,而无需微调基础模型。在真实机械零件分类上的实验表明,KDH-CAD在低数据场景下取得了强劲性能,仅用250个训练样本就达到92.6%的准确率,用1000个样本达到95.8%,并且随着数据增加持续提升。这匹配或超过了通常需要多一个数量级数据的现有最优性能。这些结果表明,将预训练基础模型与结构化领域知识相结合可以大幅减少对大规模CAD数据集的依赖,为数据高效的CAD学习提供了原则性和实用性的方向。

英文摘要

Deep learning in computer-aided design (CAD) remains fundamentally constrained by the data scarcity challenge: authentic CAD data is difficult to collect at scale, while synthetic data may not faithfully reflect real design practice. Rather than pursuing ever-larger CAD datasets, this paper alternatively treats CAD learning as a knowledge completion and calibration problem. It introduces KDH-CAD, a knowledge-data hybrid framework that integrates pretrained knowledge in foundation models, structured domain knowledge from textbooks/tutorials, and a very small amount of labeled CAD data. Domain knowledge is used to elicit and complete CAD-relevant concepts that are weakly expressed or under-represented in pretrained foundation models, while labeled CAD data calibrates these concepts in the latent space to account for task-specific geometric variability, without fine-tuning the foundation model. Experiments on real-world mechanical part classification show that KDH-CAD achieves strong performance in low-data regimes, reaching 92.6\% accuracy with only 250 training samples, 95.8\% with 1,000 samples, and continuing to improve with additional data. This matches or exceeds state-of-the-art performance that typically requires an order of magnitude more data. These results suggest that combining pretrained foundation models with structured domain knowledge can substantially reduce reliance on large-scale CAD datasets, providing a principled and practical direction for data-efficient CAD learning.

2606.01701 2026-06-02 cs.CV

Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding

时空相关性引导的几何划分用于多功能视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

AI总结 针对VVC中几何划分开销大的问题,提出时空相关性引导的几何划分(STGEO)方案,通过模式预测和运动候选选择减少边信息比特,提升编码效率。

详情
Journal ref
IEEE Transactions on Image Processing, vol. 31, pp. 30-42, 2022
AI中文摘要

几何划分因其在混合视频编码框架中卓越的运动场描述能力而受到越来越多的关注。然而,多功能视频编码(VVC)中现有的几何划分(GEO)方案给边信息的信令带来了不可忽视的负担,从而限制了编码效率。鉴于此,我们提出了一种时空相关性引导的几何划分(STGEO)方案,以有效描述视频编码运动场中的物体信息。所提方法可以节省用于边信息信令的比特,包括划分模式和运动信息。我们首先以统计合理的方式分析了划分模式决策和运动矢量选择的特性。基于观察到的时空相关性,我们设计了一种模式预测和编码方法,以减少表示上述边信息的开销。主要思想是预测具有较高选择可能性的STGEO模式和运动候选,这可以指导熵编码,即用更少的比特表示预测的高概率模式和运动候选。特别地,高概率STGEO模式基于边缘信息和相邻STGEO编码块的历史模式进行预测。相应的运动信息由合并候选列表中的索引表示,该索引基于离线训练的合并候选选择概率自适应地推断。仿真结果表明,与未使用GEO的VTM-8.0相比,所提方法在随机接入和低延迟B配置下平均分别节省了0.95%和1.98%的比特率。

英文摘要

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

2606.01700 2026-06-02 cs.CV

MixerSENet: A Lightweight Framework for Efficient Hyperspectral Image Classification

MixerSENet: 一种用于高效高光谱图像分类的轻量级框架

Mohammed Q. Alkhatib, Swalpa Kumar Roy, Ali Jamali

AI总结 提出轻量级框架MixerSENet,通过解耦空间与通道维度混合并引入挤压激励模块,在保持低参数量的同时实现高光谱图像分类的高精度与高效率。

详情
Comments
Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)
AI中文摘要

本文提出了一种新颖的框架MixerSENet,用于高光谱图像(HSI)分类,旨在解决计算效率和有限标注数据带来的挑战。所提出的模型处理高光谱图像块,同时在整个网络中保持一致的尺寸和分辨率,有效解耦了空间和通道维度的混合。值得注意的是,MixerSENet轻量且计算高效,与传统模型相比所需参数更少,适用于资源受限环境。模型中嵌入了挤压激励块以细化特征提取,增强网络捕获更多信息特征的能力。在两个基准数据集上的实验结果表明,MixerSENet实现了优越的性能,在Houston13数据集上达到82.47%的总体精度(OA),在Qingyun数据集上达到96.70%,优于包括3D-CNN、HybridKAN、HSIFormer、SimPoolFormer和MorphMamba在内的最先进方法。此外,对计算效率的详细分析表明,MixerSENet在准确性和效率之间实现了良好的平衡,仅需53,146个参数和较低的推理时间,证实了其在实际应用中的实用性。发布时,源代码将在https://github.com/mqalkhatib/MixerSENet公开。

英文摘要

In this paper, a novel framework, MixerSENet, is introduced for hyperspectral image (HSI) classification, designed to address the challenges of computational efficiency and limited labeled data. The proposed model processes hyperspectral image patches while maintaining consistent size and resolution throughout the network, effectively decoupling the mixing of spatial and channel dimensions. Notably, MixerSENet is lightweight and computationally efficient, requiring fewer parameters compared to traditional models, making it suitable for resource-constrained environments. A squeeze and excitation block is incorporated into the model to refine feature extraction, enhancing the network's ability to capture more informative features. Experimental results on two benchmark datasets demonstrate that MixerSENet achieves superior performance, reaching an overall accuracy (OA) of 82.47% on Houston13 dataset and 96.70% on the Qingyun dataset, outperforming state-of-the-art methods including 3D-CNN, HybridKAN, HSIFormer, SimPoolFormer, and MorphMamba. Furthermore, a detailed analysis of computational efficiency shows that MixerSENet achieves a favorable balance between accuracy and efficiency, with only 53,146 parameters and an low inference time, confirming its practicality for real-world applications. At publication, source code will be publicly available at https://github.com/mqalkhatib/MixerSENet.

2606.01698 2026-06-02 cs.CV

Learning Label-Efficient Interpretable Medical Image Diagnosis via Semi-supervised Hypergraph Concept Bottleneck Model

通过半监督超图概念瓶颈模型实现标签高效的医学图像可解释诊断

Yijun Yang, Ruiqiang Xiao, Lijie Hu, Angelica I Aviles-Rivero, Yunzhu Wu, Jing Qin, Lei Zhu

AI总结 提出一种半监督超图概念瓶颈模型,利用双层超图学习建模高阶概念依赖并生成领域自适应伪标签,在胎盘植入谱系等医学图像诊断中实现高可解释性和性能。

详情
AI中文摘要

深度学习在医学图像分析中取得了革命性进展,在多种应用中提供了卓越的诊断准确性。然而,其决策缺乏可解释性阻碍了临床采纳,特别是在高风险医疗场景中,透明度对可信度至关重要。例如,在胎盘植入谱系(PAS)中,超声图像中的细微线索挑战了可靠诊断,使得黑盒模型难以获得准确的评分信任。为了解决这一问题,概念瓶颈模型(CBM)通过将临床上有意义的中间概念嵌入诊断流程,提供了一种有前景的途径,使临床医生能够审查和优化模型输出。然而,传统的CBM在捕捉复杂的概念间依赖关系方面表现不佳,并且需要昂贵、专家驱动的概念注释,限制了其可扩展性。本研究引入了一种新颖的半监督CBM框架,专为医学成像设计,利用双层超图学习来建模高阶概念依赖并生成领域自适应伪标签。我们的方法通过集成概念级超图以增强推理和图像级超图以生成鲁棒的伪标签,实现了卓越的可解释性和性能。在新标注的PAS超声数据集和乳腺超声公共数据集上的实验证明了所提出的概念标签高效可解释框架的有效性。其通用性在皮肤镜图像数据集SkinCon上得到了进一步验证。代码可在https://github.com/scott-yjyang/HyperCBM获取。

英文摘要

Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at https://github.com/scott-yjyang/HyperCBM.

2606.01697 2026-06-02 cs.CL

RCEM: Embedder Equipped with Query Rewriting Skill for Robust Conversational Search in Distributional Shift

RCEM:配备查询重写技能的嵌入器,用于分布偏移下的鲁棒对话搜索

Kilho Son, Paul Hsu, Cha Zhang, Dinei Florencio

AI总结 提出RCEM模型,通过将LLM的查询重写能力蒸馏到嵌入模型中,实现无需显式重写的上下文感知检索,在分布偏移下提升鲁棒性。

详情
AI中文摘要

对话搜索在检索增强生成(RAG)系统中变得越来越重要,用户通过包含上下文相关查询的多轮对话与AI助手交互。我们提出RCEM,一种对话式稠密检索模型,它将LLM的查询改写能力蒸馏到嵌入模型中,从而在推理时无需显式查询改写即可实现上下文感知检索。与先前学习直接对话到文档匹配的对话式稠密检索方法不同,RCEM将对话查询嵌入与改写后的查询嵌入对齐,提高了在分布偏移下的鲁棒性。RCEM不需要用于训练的对话查询到文档的相关性映射,这些映射通常昂贵且难以获得高质量。在QReCC、TopiOCQA和TREC CAsT上的大量实验表明,RCEM始终优于强对话检索基线,在分布偏移下取得了特别大的增益,包括Recall@10提升高达20%。RCEM进一步扩展了基础嵌入模型,使其具备对话查询改写能力,同时保留了原有的检索功能,允许单个模型对独立查询和对话查询进行编码,并针对现有文档索引进行搜索,而无需重建检索数据库。

英文摘要

Conversational search has become increasingly important in retrieval-augmented generation (RAG) systems, where users interact with AI assistants through multi-turn conversations containing context-dependent queries. We propose RCEM, a conversational dense retrieval model that distills the query reformulation capability of LLMs into the embedding model, enabling context-aware retrieval without explicit query rewriting during inference. Unlike prior conversational dense retrieval approaches that learn direct conversation-to-document matching, RCEM aligns conversational-query embeddings with rewritten-query embeddings, improving robustness under distributional shift. RCEM does not require conversational query-to-document relevance mappings for training, which are often expensive and difficult to obtain with high quality. Extensive experiments on QReCC, TopiOCQA, and TREC CAsT demonstrate that RCEM consistently outperforms strong conversational retrieval baselines, achieving particularly large gains under distributional shift, including up to 20% improvement in Recall@10. RCEM further extends the base embedding model with conversational query rewriting capability while preserving its original retrieval functionality, allowing both standalone and conversational queries to be encoded by a single model and searched against existing document indexes without rebuilding the retrieval database.

2606.01695 2026-06-02 cs.LG

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

CANARY: 语言模型中微调污染的无标签检测

Swapnil Parekh

AI总结 提出CANARY方法,通过稀疏自编码器分析隐藏状态差异,在无标签情况下检测微调数据污染,实现1%污染率下AUROC=1.000,并支持检测、验证、优先排序和修复。

详情
AI中文摘要

攻击者可以通过污染仅1%的微调样本来植入潜在的有害行为。这种污染对所有的输出级防御都是不可见的:有害行为潜伏在模型的隐藏状态几何中,直到污染超过7.5%才会在生成的文本中出现。我们提出了CANARY(通过神经激活表示产出的污染审计器),这是一种无标签检查点审计器,可以直接通过对未标记提示集进行两次前向传递来检测这种隐藏的偏移。CANARY通过稀疏自编码器投影隐藏状态差异,过滤风格噪声以隔离有意义的语义漂移。它在四种模型架构和两种训练范式下,在1%污染率下实现了AUROC=1.000(95%置信区间=[0.997, 1.000];Cohen's d=3.28),比任何输出级方法触发点低7.5倍,并且在良性微调上零误报,对风格匹配和梯度噪声自适应攻击具有完全鲁棒性。相同的SAE特征基础驱动了一个完整的治理流程:SAE过滤放大以比标准生成高5倍的速率揭示潜在危害;得分排序的提示带来4.2倍的红队测试提升;在推理时抑制少数污染特定特征将危害从70%降低到10%,且无困惑度惩罚。CANARY是第一个仅从隐藏状态检测、验证、优先排序和修复供应链污染的无标签框架。

英文摘要

Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

AI总结 针对热行人多目标跟踪中身份碎片化问题,提出轻量级后处理方法,通过在线短间隙重映射和离线轨迹重链接恢复身份连续性,在PBVS热行人MOT基准上提升IDF1。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419
Comments
Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings
AI中文摘要

热行人多目标跟踪仍然具有挑战性,因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始,我们添加了一个模块化的身份修复后端,包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明,主要身份增益来自保守的重链接,将IDF1从82.25提升到84.93,同时保持MOTA,而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明,在低信息热图像中,通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析,表明与局部帧到帧关联相比,场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

2606.01691 2026-06-02 cs.CR cs.LG

IstGPT: LLM-based Anomaly Detection for Spatial-Temporal Graph in Industrial Systems

IstGPT:基于LLM的工业系统时空图异常检测

Yuchen Zhang, Ning Xi, Pengbin Feng, Shigang Liu, Jianfeng Ma, Yulong Shen, Yanan Sun, Xiaolin Zhou

AI总结 提出IstGPT,首个结合大语言模型与图学习的工业异常检测工具,通过多模态知识提取传感器-执行器依赖图并利用改进的图神经网络实现实时异常检测,在9个数据集上取得最佳F1分数和eTaF1指标。

详情
AI中文摘要

工业互联网系统面临来自复杂工业控制系统(ICS)攻击的日益增长的威胁,导致严重的安全事件。然而,由于传感器和执行器之间的复杂依赖关系,现有工具在实时异常检测方面效果有限。为了解决这个问题,我们提出了IstGPT,这是首个基于大语言模型和图学习的工业异常检测工具,能够针对广泛的ICS攻击提供实时保护。IstGPT实现了对工业信息物理系统中时空依赖关系的细粒度精确建模。它首先利用工业多模态知识,包括操作数据、技术文档和系统图,通过多阶段提示工程提取传感器-执行器依赖图。然后,LLM-Optimation基于节点准确性、边缘一致性和逻辑连贯性迭代优化图。最后,IstGPT将改进的图神经网络与编码器-解码器架构相结合,通过重构误差检测异常。我们在9个数据集上评估了IstGPT与12个最先进基线模型的性能,包括2个公共数据集、6个模拟数据集和一个真实机器人手臂数据集。IstGPT在所有九个数据集上取得了最佳的F1分数和eTaF1(一种较新的时间感知指标)。我们进一步讨论了在真实工业场景中部署IstGPT的可行性。

英文摘要

Industrial Internet systems face increasing threats from sophisticated industrial control system (ICS) attacks, resulting in critical safety incidents. However, existing tools exhibit limited effectiveness in real-time anomaly detection due to the complex dependencies among sensors and actuators. To tackle this, we present IstGPT, the first industrial anomaly detection tool based on LLMs and graph learning to provide real-time protection against a wide range of ICS attacks. IstGPT achieves fine-grained and precise modeling on spatial-temporal dependencies in industrial cyber-physical systems. It first leverages industrial multi-modal knowledge, including operational data, technical documents, and system diagrams, to extract sensor-actuator dependency graphs via multi-stage prompt engineering. Then, LLM-Optimation iteratively refines the graph based on node accuracy, edge consistency, and logical coherence. Finally, IstGPT integrated improved graph neural networks with an encoder-decoder architecture to detect anomalies via reconstruction errors. We evaluate IstGPT against 12 state-of-the-art baselines on 9 datasets, including 2 public, 6 simulated, and a real-world robotic arm dataset. IstGPT achieves the best F1-scores and eTaF1 (a newer time-aware metric) across nine datasets. We further discuss the feasibility of deploying IstGPT in real-world industrial scenarios.

2606.01689 2026-06-02 cs.CV cs.AI

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

RPCASSM: 基于鲁棒主成分分析的状态空间模型用于红外小目标检测

Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou

AI总结 针对红外小目标检测中主流状态空间模型难以准确建模目标边缘的问题,提出基于鲁棒主成分分析(RPCA)的RPCASSM网络,通过设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)分别利用空间异质信号显著性和目标稀疏局部高亮特性进行状态空间建模,有效解决了边缘建模难题。

详情
Comments
12 pages, 8 figures, under review
AI中文摘要

红外小目标的检测与分割在监控安防、海上救援等领域具有重要的应用意义。由于这些目标在远距离成像中占据像素少,主流的视觉状态空间模型效率低下且难以准确建模目标边缘。现有的红外状态空间模型并未从红外小目标的结构特性出发偏离主流视觉状态空间结构框架。为了解决这一问题,本文基于鲁棒主成分分析(RPCA)的模型范式提出了RPCASSM网络,旨在通过红外小目标在空间域的性质设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)。BSSM旨在利用空间异质信号的显著性设计空间探测扫描机制(SPCM)来建模背景信息。TSSM利用目标的稀疏性和局部高亮特性设计可变形提示扫描机制(DPCM),聚焦于目标的可变形空间进行状态空间建模。通过上述设计,我们有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了RPCASSM设计的有效性。我们的代码将在\href{https://github.com/PepperCS/RPCASSM}{RPCASSM}公开。

英文摘要

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

2606.01686 2026-06-02 cs.SD cs.AI

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

HAIM: 用于AI音乐制作跟踪基准的人机音乐数据集

Seonghyeon Go, Yumin Kim

AI总结 针对当前AI音乐检测局限于二元分类的不足,提出HAIM数据集,通过多阶段标签定义“AI音乐跟踪”任务,评估现有检测器缺陷,推动向细粒度结构化评估转变。

详情
AI中文摘要

随着Suno和Udio等生成平台达到人类级音频质量,AI的实用性已扩展到整个音乐制作流程。除了简单的音轨生成,这些进步催生了AI驱动方法在各种形式中的应用,包括声音合成、编曲和专业母带处理。然而,当前的检测研究仍主要局限于二元“AI或人类”范式,未能反映当代音乐制作流程的现实。在真实制作中,AI工具越来越多地被用于优化或母带处理人类制作的音轨,而人类工程师同样对AI生成的材料进行后处理以确保专业质量。此外,用户经常采用对抗策略绕过AI检测器,例如对AI生成的音轨应用人类母带处理。这创造了一个简单的二元分类无法捕捉的灰色地带。在本文中,我们定义并研究“AI音乐跟踪”:在音乐制作的多面光谱中识别特定AI集成的挑战。为此,我们引入HAIM,一个具有音乐制作阶段多样化标签的数据集。它旨在隔离AI干预的阶段,包括混合制作和代理级跟踪。我们对最先进检测器的评估揭示了系统性缺陷。通过发布HAIM,我们提出了一个新的基准,将领域从二元分类转向对AI音乐的细粒度结构化评估。

英文摘要

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

2606.01682 2026-06-02 cs.CL cs.AI cs.LG

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器:数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

AI总结 提出Chunk-Level Guided Generation方法,利用现成的大语言模型作为过程评分器,通过固定长度块评分和对比选择规则,无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情
AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已经陷入错误推理路径时,该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题,但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation,一种无训练的替代方案,使用现成的大语言模型作为过程评分器。在每一步,小模型采样k个固定长度的候选块,而大模型使用似然度对候选块进行评分,无需生成任何文本。选中的块在下一步之前被提交,从而在错误传播之前引导生成。我们用两种选择规则实例化该框架:似然引导选择(LGS),选择具有最高长度归一化大模型对数概率的块;以及对比引导选择(CGS),减去小模型的对数概率,以偏向于大模型偏好与小模型偏好不同的块。我们证明,由于系统性的长度偏差(即使在长度归一化后仍然存在),使用大模型似然度对可变长度推理步骤进行评分是不可靠的,而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上,使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B,CGS在多数投票上最多提升28个百分点,并且在匹配的引导预算下,在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索,且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B,CGS在k=16时在MATH上达到81.8%,在Minerva Math上达到63.6%,超过多数投票4-6个百分点。最后,Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.