arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.17236 2026-05-19 cs.CV cs.AI

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

对视觉变换器在自动化宫颈癌分类中的系统评估:优化、统计验证与临床可解释性

Nisreen Albzour, Sarah S. Lam

AI总结 本文研究了视觉变换器在自动化宫颈癌分类中的应用,通过优化和统计验证,展示了其在临床可解释性方面的优势。

详情
AI中文摘要

手动宫颈癌筛查的巴氏涂片分析受到观察者间差异、时间限制和专家资源有限的限制。尽管卷积神经网络(CNNs)已自动化了宫颈细胞分类,但它们在建模长距离空间依赖性和缺乏临床可解释性方面仍有局限。在本研究中,视觉变换器(ViT)架构被系统优化以提高自动化宫颈癌筛查的性能,从而提高了可解释性。通过赫尔勒夫数据集(917张图像:242张正常,675张异常)对ViT-Tiny进行优化,这是一种轻量级视觉变换器架构,旨在减少计算复杂性。通过全面评估增强策略、类别加权和超参数,最佳配置实现了94.9%-95.2%的交叉验证准确率,其中随机水平翻转和类别加权(0.7 x 1.3)被确定为最有效的因素。梯度加权类激活映射(Grad-CAM)分析证实,模型注意力对应于临床相关形态学特征,包括核区域、细胞边界和染色质纹理,这与细胞病理学标准一致。这些发现表明,视觉变换器可以提供准确且可解释的决策支持,以用于宫颈癌筛查,这满足了医疗AI部署所需的临床性能和透明性要求。

英文摘要

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.

2605.17231 2026-05-19 cs.LG cs.CL

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

FishBack: 用于变换器中最优激活引导的反向费舍尔几何

Sihan Wang, Jiayi Zhao

AI总结 本文提出FishBack框架,通过反向费舍尔几何优化变换器中的激活引导,解决激活空间欧几里得假设失效的问题,提出闭式引导方程并实现迭代优化。

Comments Preprint. 20 pages, 9 figures, 5 tables

详情
AI中文摘要

激活引导方法通过修改语言模型的中间表示来控制输出行为,但普遍假设激活空间是欧几里得空间。我们证明这一假设严重失效:模型自身输出行为诱导的局部几何——即softmax层的费舍尔信息度量通过后续层的雅可比矩阵拉回后的度量——在GPT-2上相对于欧几里得度量的谱相对范数偏离超过97%,其有效维度仅为环境空间的2-17%。基于此拉回费舍尔度量,我们推导出一个闭式引导方程,确定任何目标概念的最小失真方向,从而在每个点上获得闭式最优方向,可迭代应用而无需曲面拟合或数据驱动的几何估计。我们称该框架为FishBack。该度量允许逐层递归分解,揭示现有方法——CAA、ActAdd、ITI等——各自隐式采用特定近似度量,其性能差距可通过单个谱诊断量化:其隐式度量成本与费舍尔最优成本的比率。在GPT-2上,迭代拉回引导在三个动词形态学概念和四层上均优于所有欧几里得基线,其偏离目标的KL减少量相对于欧几里得梯度上升为1.3×-2.5×,相对于CAA在匹配的概念概率下为1.5×。

英文摘要

Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model's own output behavior -- the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers -- deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2--17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a layer-wise recursive decomposition, which reveals that existing methods -- CAA, ActAdd, ITI, and others -- each implicitly adopt a particular approximate metric, and that their performance gaps are quantitatively predicted by a single spectral diagnostic: the ratio of their implicit metric's cost to the Fisher-optimal cost. On GPT-2, iterative pullback steering consistently outperforms all Euclidean baselines across three verb-morphology concepts and four layers, with off-target KL reductions of $1.3\times$--$2.5\times$ relative to Euclidean gradient ascent and $1.5\times$ relative to CAA at matched concept probability.

2605.17229 2026-05-19 cs.RO cs.SY eess.SY

Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

生成车辆-行人交互的安全关键场景

Qingwen Pu, Kun Xie, Yuan Zhu, Guocong Zhai

AI总结 本文提出了一种三阶段框架,结合现实数据与自适应模拟,生成大规模行为真实的安全关键场景,通过多智能体状态空间Transformer增强DDPG算法,在车辆-行人交互中实现了高精度的避让行为生成,最终生成了VPSCI数据集。

Comments 49 pages, 13 figures, 11 table

详情
AI中文摘要

自动驾驶系统部署需要在安全关键的车辆-行人交互中进行严格验证,但现实世界数据集很少捕捉高风险场景,而模拟平台缺乏真实行为。为此,本研究提出了一种三阶段框架,结合现实数据与自适应模拟,生成行为真实的安全关键场景。第一阶段在现实安全关键数据上预训练多智能体状态空间Transformer增强DDPG(MA-SST-DDPG)智能体,通过数据驱动学习学习人类样式的避让行为。第二阶段在CARLA中部署预训练的多智能体进行在线强化学习,实现跨多样场景的泛化,整合现实知识与模拟经验,生成精炼的MA-SST-DDPG模型。第三阶段使用CARLA与精炼模型生成来自八个交叉口场景的超过198,000个高分辨率交互episode,最终生成车辆-行人安全关键交互(VPSCI)数据集。精炼的MA-SST-DDPG模型在复现真实避让行为上优于基线方法,实现了最低的轨迹误差(ADE=0.072 m,FDE=0.142 m)。统计比较证实生成数据与现实数据在冲突严重程度和行为响应分布上具有等价性。图灵测试确认三阶段框架生成的避让行为与现实交互无法区分。这些结果展示了该框架在生成高保真安全关键数据方面的有效性,为ADS开发和基于模拟的安全评估提供了有价值的来源。

英文摘要

Automated driving system deployment requires rigorous validation across safety-critical vehicle-pedestrian interactions, yet real-world datasets rarely capture high-risk scenarios while simulation platforms lack realistic behavior. In response, this study proposes a three-stage framework that combines real-world grounding with adaptive simulation to generate behaviorally realistic safety-critical scenarios at scale. Stage 1 pre-trains multi-agent state-space Transformer-enhanced DDPG (MA-SST-DDPG) agents on real-world safety-critical data to learn human-like interactive evasive behaviors through data-driven learning. Stage 2 deploys pre-trained multi-agents in CARLA for online reinforcement learning to generalize across diverse scenarios, integrating real-world knowledge with simulation experience to produce a refined MA-SST-DDPG model. Stage 3 uses CARLA with the refined model to generate over 198,000 high-resolution interaction episodes from eight intersection scenarios, culminating in the Vehicle-Pedestrian Safety-Critical Interaction (VPSCI) dataset. The Refined MA-SST-DDPG model outperformed baseline methods in reproducing realistic evasive behaviors, achieving the lowest trajectory errors (ADE = 0.072 m, FDE = 0.142 m). Statistical comparison confirmed distributional equivalence between the generated and real-world data in both conflict severity and behavioral response. A Turing test confirmed that the three-stage framework generated evasive behaviors were indistinguishable from real-world interactions. These results demonstrate the framework's effectiveness in producing high-fidelity safety-critical data, offering valuable sources for the development of ADS and simulation-based safety evaluations.

2605.17228 2026-05-19 cs.CL

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

人工不耐受:临床文档中的污名化语言扭曲大型语言模型决策

Jen-tse Huang, Didi Zhou, Faith Kamau, Amy Oh, Anne R. Links, Mark Dredze, Mary Catherine Beach, Somnath Saha

AI总结 研究探讨了大型语言模型在处理包含污名化语言的临床文档时是否继承并传播人类偏见,发现所有模型在决策上都存在显著偏见,且对语言框架敏感,需加强算法防护以确保公平性和鲁棒性。

Comments 9 pages

详情
AI中文摘要

大型语言模型(LLMs)正在越来越多地应用于高风险领域,如临床决策支持和医疗文档。然而,这些模型对细微语言变化的鲁棒性,特别是常见于人类撰写的临床笔记中的污名化语言(SL)仍缺乏深入研究。本文研究了前沿LLMs在处理临床文本时是否继承并传播这种人类偏见。我们系统评估了九个前沿LLMs,通过注入不同强度和表型的SL(怀疑、指责和贬低)的临床案例,发现所有模型均表现出显著偏见,临床决策显著偏向于更温和的患者管理。值得注意的是,我们观察到对语言框架的高度敏感性,单句SL即可改变模型输出,揭示出剂量-反应关系。此外,我们评估了标准提示基于的缓解策略,包括链式推理(CoT)和模型自我去偏。这些方法效果有限;模型难以显式识别SL,但仍受其影响。我们的发现揭示了当前LLMs在临床NLP中的公平性和鲁棒性关键漏洞,强调了需要严格算法防护以防止健康不平等的自动化。

英文摘要

Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as clinical decision support and medical documentation. However, the robustness of these models against subtle linguistic variations, specifically stigmatizing language (SL) commonly found in human-authored clinical notes, remains critically under-explored. In this work, we investigate whether frontier LLMs inherit and propagate this human bias when processing clinical text. We systematically evaluate nine frontier LLMs across four stigmatized medical conditions, utilizing clinical vignettes injected with varying intensities and phenotypes of SL (doubt, blame, and maligning). Our results demonstrate that all evaluated models exhibit substantial bias, with clinical decision-making significantly skewed towards less aggressive patient management. Notably, we observe a high sensitivity to linguistic framing, where a single SL sentence is sufficient to alter model outputs, revealing a clear dose-response relationship. Furthermore, we evaluate standard prompt-based mitigation strategies, including Chain-of-Thought (CoT) reasoning and model self-debiasing. These approaches show limited efficacy; models struggle to explicitly identify SL while remaining implicitly influenced by it. Our findings expose a critical vulnerability in current LLMs regarding fairness and robustness in clinical NLP, underscoring the need for rigorous algorithmic guardrails to prevent the automation of health disparities.

2605.17214 2026-05-19 cs.AI cs.CL cs.CV

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA:推动大型语言模型在化学反应图示理解上的进步

Mingyang Rao, Kehua Feng, Zhihui Zhu, Jiangzhen Fu, Hao Yu, Keyan Ding, Huajun Chen

AI总结 本文针对现有系统在理解化学反应图示时存在的视觉缺陷和语义断开问题,提出ChemVA框架,通过视觉锚机制和语义对齐方法提升大型语言模型在化学推理中的性能。

详情
AI中文摘要

尽管大型语言模型(LLMs)已革新了科学文本处理,但在解释化学反应图示方面存在显著的能力差距。我们识别出两个限制当前系统的根本瓶颈:视觉缺陷,即通用视觉编码器难以解析密集分子图的严格拓扑连接性;以及语义断开,即标准线性字符串,如SMILES,无法有效激活模型的潜在化学推理能力。为弥合这些差距,我们提出了化学视觉激活(ChemVA)框架,该框架采用视觉锚机制通过混合粒度检测来定位功能团,随后采用语义对齐方法将视觉特征转换为实体名称,以最大限度地激活LLMs中的知识。我们在OCRD-Bench数据集上评估了我们的方法,该数据集包含密集的视觉-语义上下文和全面的反应覆盖,以评估从识别到推理的整个谱系。在OCRD-Bench上的大量实验表明,ChemVA实现了92.0%的结构识别准确率。通过弥合视觉和语义瓶颈,我们的框架在9种不同的LLMs上实现了约20个百分点的性能提升,使开放式权重模型能够与专有SOTA系统在复杂的化学推理任务中竞争。

英文摘要

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

2605.17205 2026-05-19 cs.CL

LLMs for automatic annotation of Mandarin narrative transcripts

基于大型语言模型的汉语叙述转录自动标注

Qingwen Zhao, Hongao Zhu, Yunqi He, Rui Wang, Aijun Huang, Hai Hu

AI总结 本研究探讨了大型语言模型在自动标注汉语叙述转录中的有效性,通过比较四种LLM与训练好的人类标注者,发现最佳模型在标注叙事宏观结构时能与人类标注者达成高一致性,同时显著减少标注时间,但轻量级本地部署模型表现较差,且标注难度因宏观结构元素类型而异,尤其在需要微妙语义区分的类别中存在持续挑战。

Comments 28 pages, 9 tables

详情
AI中文摘要

对转录语音进行语言标注对于语言习得、语言障碍和社会语言学研究至关重要,但这一过程仍然耗费大量人力和时间。尽管大型语言模型(LLMs)在自动化标注任务中显示出潜力,但其在非英语语言中处理复杂语篇层面标注的能力仍缺乏研究。本研究评估了LLMs在标注汉语口语中的叙事宏观结构(即故事语法元素的层级组织)的可靠性,使用多语言叙述评估仪器(MAIN)作为测试平台。我们比较了四种LLM与训练好的人类标注者在儿童、年轻人和老年人生成的叙述上的表现。最佳模型在与人类评分者达成一致(k=.794)方面接近人类-人类可靠性水平(k=.872),同时将标注时间减少了65%,而本地可部署的轻量级模型表现则明显较差。标注难度系统性地因宏观结构元素类型而异,需要微妙语义区分的类别提出了持续挑战。此外,模型可靠性在年轻人叙述中下降,因为年轻人叙述表现出更大的词汇变化、语义模糊性和单个语句中的多元素整合。这些发现表明,LLMs可以有效支持非英语口语语料库的语篇层面标注,同时强调在语义复杂任务中仍需持续的人类监督。我们的提示模板已开源供未来使用。

英文摘要

Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.

2605.17204 2026-05-19 cs.RO cs.AI

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

基于事件的稀疏自编码器用于视觉-语言-动作策略

Xinchen Jin, Aditya Chatterjee, Pranav Kumar, Rohan Paleja

AI总结 本文提出了一种基于事件的稀疏自编码器(SAE)分析方法,用于视觉-语言-动作(VLA)策略的可解释性研究,通过行为事件锚定SAE特征分析,提升了对闭合回路行为的因果影响和可解释性。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将语言和视觉输入转化为机器人动作,其隐藏表示直接塑造闭环行为。然而,语言和视觉-语言模型中的机制可解释性工具无法直接转移到VLA中:输出是机器人动作而非人类可读的标记,干预只能通过昂贵的闭环回放测试。我们提出了一种基于事件的可解释性流程,将SAE特征分析锚定在行为事件而非文本上下文中。通过在每个任务中使用视觉、状态和时间线索对末端执行器关键帧进行聚类,将SAE特征与行为显著事件联系起来,并通过可选的VLM注释与语义上下文联系起来。据我们所知,我们的流程是首个将基于SAE的VLA分析锚定在闭环行为事件上的方法之一。在两个仿真架构和一个真实机器人研究中,基于事件的排名在OpenVLA上产生了最强的因果效应,并转移到了π_{0.5}的连续动作块中。SAE是一种稀疏但不完美的干预基础:实用性因架构和干预位置而异,激进干预揭示了安全性和可解释性的限制。总体而言,基于事件的SAE分析成为行为锚定VLA可解释性的一种实用起点,推动了未来关于SAE特征的研究,包括超越动作对齐坐标的更细致分析、更精细的闭环评估以及高风险VLA部署中的安全干预。代码可在https://github.com/xc-j/Event-SAE上获得。

英文摘要

Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of $π_{0.5}$. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \url{https://github.com/xc-j/Event-SAE}.

2605.17197 2026-05-19 cs.LG cs.CV

OPTNet: Ordering Point Transformer Network for Post-disaster 3D Semantic Segmentation

OPTNet:用于灾后3D语义分割的点变换网络

Nhut Le, Ehsan Karimi, Maryam Rahnemoonfar

AI总结 本文提出OPTNet,一种通过可学习的点排序模块动态预测最优排列以提高注意力机制局部性的网络,用于灾后3D点云语义分割。

Comments Accepted for International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

灾后损害评估需要快速且准确地对3D点云进行语义分割,以识别受损的基础设施,如损坏的建筑和道路。早期的点变换(如PTv1、PTv2)依赖于计算成本高的邻居搜索(k-NN)和最远点采样(FPS)。为了提高效率,最近的架构如Point Transformer V3(PTv3)采用了静态序列化方法,如Hilbert曲线或Z-order,来组织无序点以进行基于窗口的注意力。然而,这些固定顺序并不利于捕捉灾难场景的复杂几何结构。在本文中,我们提出了OPTNet(Ordering Point Transformer Network),它引入了一个可学习的点排序模块。OPTNet利用自监督的排序损失动态预测最优排列,以最大化注意力机制的局部性。我们在3DAeroRelief数据集上评估了我们的方法,显著优于最先进的基线。

英文摘要

Post-disaster damage assessment requires rapid and accurate semantic segmentation of 3D point clouds to identify critical infrastructure such as damaged buildings and roads. Early Point Transformers (e.g., PTv1, PTv2) relied on computationally expensive neighbor searching (k-NN) and Farthest Point Sampling (FPS). To improve efficiency, recent architectures like Point Transformer V3 (PTv3) adopted static serialization methods, such as Hilbert curves or Z-order, to organize unstructured points for window-based attention. However, these fixed orderings are not optimal for capturing the complex geometry of disaster scenes. In this paper, we propose OPTNet (Ordering Point Transformer Network), which introduces a learnable Point Sorter module. OPTNet utilizes a self-supervised ordering loss to dynamically predict an optimal permutation that maximizes the locality of the attention mechanism. We evaluate our method on the 3DAeroRelief dataset, significantly outperforming state-of-the-art baselines.

2605.17187 2026-05-19 cs.CL cs.AI cs.CY

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule:一种用于社交媒体上多元社区调节的基准测试

Zoher Kachwala, Bao Tran Truong, Rasika Muralidharan, Haewoon Kwak, Jisun An, Filippo Menczer

AI总结 研究探讨了AI模型在调节社交媒体上多元社区中的挑战,提出PluRule基准测试以检测13371条规则违规情况,发现即使使用最先进的视觉语言模型,也难以有效识别违规行为。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

社交媒体正在向多元主义转变--由社区自行定义规范的平台。在某一社区中违反规则的行为可能在另一社区中是完全可接受的。AI模型能否帮助调节此类多元社区?我们将此任务形式化为多选问题,模仿人类调节员在现实世界中的操作方式:给定一条评论及其上下文,识别违反了哪一条具体规则(如果有的话)。我们引入了PluRule,一个多模态、多语言的基准测试,用于检测1989个Reddit社区中跨越2885条规则的13371条违规情况。使用此基准测试,我们发现最先进的视觉语言模型在识别违规方面表现显著不佳:即使GPT-5.2具有高水平推理能力,也仅略优于基础基线。我们还发现,更大的模型和更多的上下文提供微小收益,而普遍规则如礼貌和自我推广更容易检测。我们的结果表明,社交媒体上多元社区的调节是语言模型的基本挑战。我们的代码和基准测试已公开发布。

英文摘要

Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

2605.17181 2026-05-19 cs.SD cs.AI

MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition

MusicSynth: 一种用于从乐谱生成小提琴指板动画的自动化流水线

Abhimanyu Kaushik

AI总结 该研究提出了一种自动化流程,通过光学音乐识别技术将乐谱转换为小提琴指板动画,其核心方法是整合三个开源工具,并通过自定义的查找表将音乐音符映射到小提琴的弦和指位。

Comments 12 pages, 4 figures

详情
AI中文摘要

学习小提琴比看起来更困难。与钢琴键或吉他品相比,小提琴琴颈上没有任何标记,因此初学者无法通过观察来确定每个手指应放置的位置。MusicSynth是一种开源的网页工具,旨在解决这个问题:用户上传任何小提琴乐谱的照片(或数字乐谱文件),系统会自动生成一个视频,显示带有每个音符高亮的小提琴指板——无需安装软件,也不需要手动输入音符。该系统将三个现有的开源工具连接成一个流水线:光学音乐识别(OMR)库从上传的图像中读取音符,MusicXML解析器从数字乐谱中提取时间信息,视频渲染器逐帧绘制指板。唯一从头开始构建的部分是将每个音乐音符映射到小提琴弦和指位的查找表。在110个公共领域小提琴乐谱上测试,MusicSynth在清洁打印乐谱中正确识别了91.2%的音符,并在获得数字乐谱文件时正确分配指位99.1%的时间。据作者所知,目前没有其他免费工具可以自动将乐谱图像转换为动画小提琴指板教程。

英文摘要

Learning the violin is harder than it looks. Unlike piano keys or guitar frets, the violin neck has no markings at all, so a beginner cannot tell by looking where to place each finger. MusicSynth is an open-source web tool that tries to fix that: user uploads a photo of any violin sheet music (or a digital score file), and the system automatically produces a video showing a violin fingerboard with each note highlighted at the right moment -- no software to install, no manual note entry required. The system connects three existing open-source tools into one pipeline: an optical music recognition (OMR) library reads the notes from the uploaded image, a MusicXML parser extracts timing information from digital scores, and a video renderer draws the fingerboard frame by frame. The only part built from scratch is the lookup table that maps each musical note to a string and finger position on the violin. Tested across 110 public-domain violin scores, MusicSynth correctly identified 91.2\,\% of notes in clean printed music and assigned the right finger position 99.1\,\% of the time when given a digital score file. To the author's knowledge, no freely available tool currently turns a sheet music image into an animated violin fingerboard tutorial automatically and in a single browser-based step.

2605.17180 2026-05-19 cs.LG math.OC stat.ML

The Geometry of Projection Heads: Conditioning, Invariance, and Collapse

投影头的几何学:条件性、不变性与坍缩

Faris Chaudhry

AI总结 本文提出了一种投影头的几何理论,通过将投影头建模为可训练的黎曼度量来研究自监督学习中的条件性、不变性和坍缩问题,揭示了投影头在不同深度下的适应能力和稳定性。

Comments Accepted at ICML 2026. 29 pages, 8 figures, 7 tables

详情
AI中文摘要

我们通过将头建模为可训练的黎曼度量来发展投影头在自监督学习中的几何理论。我们证明线性头执行隐式的子空间白化,而非线性头适应局部度量以满足损失函数的特定拓扑约束,且头的深度经验上决定了这种能力。通过分析维度坍缩,我们证明平滑的非线性头在坍缩平衡点会自然诱导Hessian矩阵的负特征值,使其不稳定。我们通过连续跟踪训练过程中的优化几何来验证这一点,发现Swish等平滑激活函数可以生成显式的负曲率以逃离坍缩,而线性和ReLU头在连续时间梯度流中无法做到这一点,而是依赖于离散时间优化动态和BatchNorm。最后,我们从几何上表征了度量退化如何支配信息不变性之间的权衡,解释了为什么必须丢弃头。在基础模型上对比和去相关目标的评估表明,投影头起到通用几何缓冲器的作用,将语义骨干与预训练目标的刚性破坏约束解耦。

英文摘要

We develop a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. We show that linear heads perform implicit subspace whitening, while nonlinear heads adapt local metrics to satisfy the specific topological constraints of the loss, with head depth empirically dictating this capacity. Analyzing dimensional collapse, we prove that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, making them unstable. We empirically validate this by continuously tracking the optimization geometry during training, which reveals that smooth activations like Swish can generate explicit negative curvature to escape collapse, whereas linear and ReLU heads under continuous-time gradient flow cannot, relying instead on discrete-time optimization dynamics and BatchNorm. Finally, we geometrically characterize how metric degeneracy governs the information-invariance trade-off, explaining why the head must be discarded. Evaluated across contrastive and decorrelation-based objectives on foundation models, our results demonstrate that the projection head acts as a universal geometric buffer, decoupling the semantic backbone from the rigid, destructive constraints of the pretraining objective.

2605.17179 2026-05-19 cs.CV

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

iMiGUE-3K:一种基于自监督学习的微手势分析大规模基准

Chengyan Wang, Haoyu Chen, Hui Wei, Yueyi Yang, Yunquan Chen, Guoying Zhao

AI总结 本文提出iMiGUE-3K大规模数据集和MG-FMs基础模型,用于微手势情感理解,通过自监督学习提升情绪识别性能。

详情
AI中文摘要

情感理解是情感计算和人工智能中的基本挑战。尽管现有方法主要关注面部表情和语音,但往往忽视了通过身体语言传达的丰富情绪线索。最近,微手势(MGs)作为一种替代线索受到越来越多关注,但目前缺乏支持MG基础模型预训练的大规模数据集。为了推动MG研究,我们提出一个新的微手势情感理解基准,包含关键贡献:新的数据集(iMiGUE-3K)和一系列针对不同任务的基础模型。通过基于模型的众包数据收集策略,我们构建了iMiGUE-3K,这是迄今为止最大的MG数据集。该数据集包含332名专业网球运动员过去七年的公开采访视频,总时长超过3.4K小时视频片段和3700万帧。数据集包含32种微手势类别,具有丰富的描述性标注,是首个大规模、真实场景的视频数据集,用于细粒度手势基情感分析。基于iMiGUE-3K,我们提出MG-FMs,一种用于可迁移手势呈现学习的判别基础模型。基于该基础模型,我们建立了五个全面的评估任务:微手势识别(无监督、半监督、监督)、微手势检索和微手势情感识别。我们对代表性方法的系统评估表明,基于微手势的分析显著提升了情感理解。我们希望这项工作能为微手势分析提供全面工具,并为未来心理诊断、情感计算和高级人机交互研究奠定坚实基础。

英文摘要

Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.

2605.17176 2026-05-19 cs.AI

CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

CAREBench: 通过评估认知评价推理来评估LLMs的情感理解

Zhaoyue Sun, Hainiu Xu, Andero Uusberg, James J. Gross, Petr Slovak, Yulan He

AI总结 本文提出CAREBench,首个全面标注认知评价推理、评价评分和多标签情感标注的基准,通过系统实验发现更强模型在某些任务上匹配或超越人类,但在评价推理和积极情绪识别上表现不足,揭示了当前模型未能内部化捕捉人类主观异质性的机制。

Comments 27 pages,18 figures

详情
AI中文摘要

情感理解是LLMs有效与人类交互的核心能力,但现有评估方法依赖离散情绪标签预测,无法捕捉情绪生成的认知过程。基于评价理论,我们引入CAREBench,首个包含从第一和第三人称视角对现实叙述的完整推断链标注的基准,涵盖评价推理、评价评分和多标签情感标注。我们提出一个过程级评估框架,并在六个LLMs上围绕四个研究问题开展系统实验。我们发现,更强的模型在某些任务上匹配或超越人类观察者,但在评价推理和积极情绪识别上表现不足;跨步骤性能和对评价干预的敏感性在不同模型间表现出差异;当前模型尚未内部化捕捉人类主观异质性的机制。这些发现表明,下游情绪预测指标可能高估LLMs的真实情感理解,而CAREBench为更具有诊断信息的LLMs情感认知能力评估提供了基础。

英文摘要

Emotion understanding is a core capability for LLMs to interact effectively with humans, yet existing evaluation paradigms rely on discrete emotion label prediction and fail to capture the cognitive processes underlying emotion generation. Grounded in appraisal theory, we introduce CAREBench, the first benchmark with complete inferential chain annotations from both first- and third-person perspectives on real-world narratives, spanning appraisal reasoning, appraisal ratings, and multi-label emotion annotation. We propose a process-level evaluation framework and conduct systematic experiments across six LLMs organized around four research questions. We find that stronger models match or surpass human observers on certain tasks, yet fall short on appraisal reasoning and positive emotion recognition; performance across chain steps and sensitivity to appraisal interventions exhibit dissociations across models; and current models have not internalized the mechanisms needed to capture human subjective heterogeneity. These findings suggest that downstream emotion prediction metrics may overestimate LLMs' true emotion understanding, and CAREBench provides a foundation for more diagnostically informative evaluation of LLMs' affective cognitive capabilities.

2605.17173 2026-05-19 cs.CL cs.AI cs.LG

Why Do Safety Guardrails Degrade Across Languages?

为何安全护栏在不同语言中会退化?

Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

AI总结 该研究通过引入多组项目反应理论框架,揭示了语言无关的安全鲁棒性、提示内在难度、全球语言处理难度和提示特定的跨语言安全差距等因素,发现安全退化并非仅在低资源语言中发生,且文化与概念不匹配也会影响安全性能。

详情
AI中文摘要

大型语言模型在非英语语言中表现出安全退化。标准评估依赖于禁令成功率(JSR),但将多个安全驾驶因素合并为一个,掩盖了安全失败的具体原因。我们引入了一个潜在变量模型,即多组项目反应理论(IRT)框架,将安全驾驶因素如语言无关的安全鲁棒性(θ)、内在提示难度(β)、全球语言处理难度(γ)和提示特定的跨语言安全差距(τ)分离。使用MultiJail数据集,我们评估了61种模型配置在5个闭源模型家族和10种资源各异的语言中的安全鲁棒性,汇总了190万行数据集。探索性因子分析显示安全主要是一维的:模型拒绝不同危害类型主要通过共享机制。与预期趋势相反,22种模型配置在英语中比在低资源语言中更易受攻击。低资源语言产生更多不确定响应(高熵)比高资源语言。此外,高τ提示集中在如盗窃和武器等物理危害类别和低资源语言中,趋势通过跨数据集泛化得到验证。虽然全球翻译质量与τ相关性低,但严重翻译错误驱动高偏置异常值,通过本地说话者验证。文化与概念基础不匹配也会影响τ。在预测验证中,IRT框架实现了AUC=0.940,优于更简单的基线,在预测不安全提示的安全拒绝方面表现更优。我们的框架揭示了概念-语言脆弱性,这些指标汇总后被掩盖,使公平的跨语言安全评估和目标改进数据集建设成为可能。

英文摘要

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($θ$), intrinsic prompt hardness ($β$), global language processing difficulty ($γ$), and a prompt-specific cross-lingual safety gap ($τ$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$τ$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $τ$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $τ$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

2605.17172 2026-05-19 cs.LG cs.AI cs.CL

OpenJarvis: Personal AI, On Personal Devices

OpenJarvis: 个人AI,本地设备上

Jon Saad-Falcon, Avanika Narayan, Robby Manihani, Tanvir Bhathal, Herumb Shandilya, Hakki Orhun Akengin, Gabriel Bo, Andrew Park, Matthew Hart, Caia Costello, Chuan Li, Christopher Ré, Azalia Mirhoseini

AI总结 本文提出OpenJarvis,一种分解的个人AI堆栈,通过在本地设备上优化五个基本组件(智能、引擎、代理、工具与记忆、学习)来缩小本地与云端之间的性能差距,同时保持本地模型的特性。

Comments Code: https://github.com/openjarvis/openjarvis Website: https://open-jarvis.github.io/OpenJarvis/

详情
AI中文摘要

个人AI堆栈,如OpenClaw和Hermes Agent,正在成为日常工作的核心,但它们几乎将每一个查询(通常涉及敏感的本地数据)都路由到云托管的前沿模型。用现有的堆栈中替换前沿模型为本地模型并不奏效:将Claude Opus 4.6换成Qwen3.5-9B,在个人AI任务如PinchBench和GAIA上会降低25-39个百分点的准确性。现有堆栈围绕特定的云模型捆绑代理提示、工具描述、内存配置和运行时设置。只有提示可以进行调优,而最先进的提示优化器只能自行关闭5个百分点的本地-云差距。这促使了分解的个人AI堆栈:一种能够暴露个体原语,可以单独或联合优化以缩小本地-云差距的堆栈。我们提出了OpenJarvis,一种将个人AI系统表示为五种原语的类型规范的架构:智能、引擎、代理、工具与记忆、学习。每个原语都是独立可编辑的字段,使堆栈能够端到端优化,并且可以针对准确性、成本和延迟进行测量。为了在不牺牲本地模型特性的情况下缩小本地-云差距,OpenJarvis引入了LLM引导的规范搜索,这是一种本地-云协作,在搜索时前沿云模型提出规范的编辑,只有非退化的编辑被接受,最终的规范在推理时完全在设备上运行。通过LLM引导的规范搜索,设备上的规范在8个基准中的4个上匹配或超过了云准确性,并且平均在最佳云基线基础上减少了3.2个百分点。它们还减少了边际API成本约800倍,并将端到端延迟减少了4倍。

英文摘要

Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud-hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5-9B drops accuracy by 25-39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state-of-the-art prompt optimizers close just 5 pp of the local-cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local-cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end-to-end optimizable and measurable against accuracy, cost, and latency. Towards closing the local-cloud gap without surrendering local-model properties, OpenJarvis introduces LLM-guided spec search, a local-cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non-regressing edits are accepted, and the resulting spec runs entirely on-device at inference time. With LLM-guided spec search, on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end-to-end latency by 4x.

2605.17170 2026-05-19 cs.LG

TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

TriAxialKV: 向极低精度KV缓存量化迈进以应对代理推理任务

Hanzhang Shen, Haoran Wu, Yiren Zhao, Robert Mullins

AI总结 本文提出TriAxialKV,一种混合精度的KV缓存量化方案,通过为每个token分配三轴标签,校准每种标签的敏感性,并在固定内存预算下分配INT2/INT4位宽,以提高代理推理任务的效率和吞吐量。

详情
AI中文摘要

代理工作负载已成为LLM推理中的主要工作负载。它们与仅聊天的工作负载有显著不同,要求长上下文处理、处理多模态输入以及支持结构化的多轮交互和工具调用能力。因此,其上下文表现出结构,可以沿三个关键轴携带不同的重要性:时间最近性、模态(如文本或图像标记)以及语义角色(如用户查询、工具调用、观察或推理)。这些轴捕捉了不同的标记行为,并导致不同的对KV缓存压缩的敏感性。然而,现有的KV缓存量化方法通常是同质的或仅在单一维度上利用异质性,如时间接近性或模态,忽略了它们之间的相互作用。为此,我们引入TriAxialKV,一种新的混合精度KV缓存量化方案,为每个token分配三轴标签,校准每种标签的敏感性,并在固定内存预算下分配INT2/INT4位宽。我们实现了TriAxialKV作为端到端的服务系统,包括校准、混合精度量化和内存管理,并定制了融合的Triton解码内核。当使用Qwen3-VL-32B-Thinking作为计算机使用代理操作OSWorld时,TriAxialKV在BF16 KV缓存的准确性与SGLang相当,同时支持4.5倍的KV缓存大小,并在真实GPU系统上实现了30%更高的端到端吞吐量。

英文摘要

Agentic workloads have emerged as a major workload for LLM inference. They differ significantly from chat-only workloads, requiring long-context processing, the ability to handle multimodal inputs, and structured multi-turn interactions with tool calling capabilities. As a result, their context exhibits structure that can carry different importance along three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning. These axes capture distinct token behaviors and lead to different sensitivities to KV-cache compression. However, existing KV-cache quantization methods are typically homogeneous or exploit only heterogeneity on a single dimension, such as temporal proximity or modality, overlooking the interactions among them. To this end, we introduce TriAxialKV, a novel mixed-precision KV-cache quantization scheme that assigns each token a triaxial tag, calibrates per-tag sensitivity, and allocates INT2/INT4 bitwidths under a fixed memory budget. We implement TriAxialKV as an end-to-end serving system, comprising calibration, mixed-precision quantization and memory management, and custom fused Triton decode kernels. When using Qwen3-VL-32B-Thinking as a computer-use agent operating the OSWorld, TriAxialKV matches the accuracy of SGLang with BF16 KV cache while supporting 4.5$\times$ KV cache size and achieving 30% higher end-to-end throughput, when running on real GPU systems.

2605.17169 2026-05-19 cs.AI cs.CL cs.MA

Responsible Agentic AI Requires Explicit Provenance

负责任的代理AI需要明确的来源

Jinwei Hu, Xinmiao Huang, Qisong He, Youcheng Sun, Yi Dong, Xiaowei Huang

AI总结 本文探讨了代理AI中责任归属的问题,指出需要在整个代理生命周期中明确来源,以使责任可计算和可执行,提出了通过因果归因函数和责任张量来形式化所需信息,并通过初步实验验证了在线估计和干预的可能性。

Comments Under Review

详情
AI中文摘要

代理AI正在迅速扩展到软件工程等多样化的真实世界领域,但公众信任并未同步增长。核心原因是责任,尽管被广泛讨论,但仍是一个主观且未强制执行的概念,因为目前没有任何代理框架能够产生所需的可量化、可追溯和可干预的来源,以在损害由多个方共同设计时分配责任。我们主张所缺失的不是更好的基准级评估,而是整个代理生命周期中明确的来源,这是使责任可计算和可操作的唯一可行基础。我们从四个方向推进这一议程:通过识别社会技术维度中的责任缺口,确立为何此类来源是结构上的必要条件;通过因果归因函数和责任张量形式化它必须编码的内容;讨论如何在四个生命周期层中使其可计算,通过初步实验表明来源可以在不可逆损害积累之前在线估计和干预;并通过具体代理事件考察谁应承担责任。明确来源不是可选的改进,而是负责任的代理AI的必要条件,其生态系统中的任何利益相关者都无法承担将其视为可选的态度。

英文摘要

Agentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but $\textbf{explicit provenance}$ across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing $\textit{why}$ such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing $\textit{what}$ it must encode through a causal attribution function and responsibility tensor, discussing $\textit{how}$ it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining $\textit{who}$ bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional.

2605.17165 2026-05-19 cs.CV cs.LG

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

视频JEPA中的因子化潜在动态:辅助目标的实证研究

Santosh Premi

AI总结 本研究探讨了视频JEPA中辅助目标的实证效果,通过对比不同辅助目标变体,发现潜在表示的因子化方法在提升某些能力的同时可能降低其他能力,FWM-HW-LD在混合数据集下提升了ImageNet-100和SSv2的性能。

详情
AI中文摘要

联合嵌入预测架构(JEPA)是自监督视频表示学习的一个有前景的框架,但小型规模的视频JEPA训练中辅助目标的行为尚未得到充分表征。我们报告了在两个预训练阶段(单一数据集(UCF-101)和混合数据集(UCF-101 + Something-Something V2 + ImageNet-100))下对18种辅助目标变体进行的小规模实证研究。我们评估了冻结表示在三个互补基准上的表现:Diving-48(细粒度运动)、SomethingSomething V2(时间推理)和ImageNet-100(外观)。我们的实验表明,许多辅助目标表现出能力取舍:在一种下游能力上的收益往往伴随着另一种能力的退化。我们随后研究了FWM-HW-LD(带有硬区域加权的因子化世界模型与潜在动态),这是一种训练时的目标,将潜在表示分为外观和动态子空间,并对JEPA预测误差和潜在动态误差应用硬区域加权。在我们的混合数据集设置中,FWM-HW-LD相比参考基线在ImageNet-100上提高了+5.92个百分点,在SSv2上提高了+3.21个百分点,同时在Diving-48上保持在0.30个百分点以内。这些结果表明,潜在因子化是研究视频JEPA中辅助目标取舍的有效方向。

英文摘要

Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.

2605.17162 2026-05-19 cs.AI cs.LG

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

从模仿到交互:利用浅层强化学习掌握斯纳普森游戏

Ján Klačan, Sizhong Zhang

AI总结 本文研究浅层神经网络代理是否能掌握纸牌游戏斯纳普森,并挑战使用蒙特卡洛采样和前瞻搜索的强搜索基线RdeepBot。通过逐步更复杂的实验设计,首先评估了基于回放数据训练的监督学习代理(MLPBot)以及通过异步蒙特卡洛更新和经验回放训练的强化学习代理(RLBot)。结果表明,监督模仿不足以击败强RdeepBot对手,而强化学习产生了更强的代理。在聚焦RdeepBot深度参数的设置中,最佳性能是在学习的价值函数与游戏过程中更深层次的前瞻搜索结合时实现的,使RLBot在最强的RdeepBot基线下实现了统计显著更高的胜率。在基于样本的设置中,收益更具条件性:最强性能出现在相对较低的训练num_samples参数下,而不是随着更强采样均匀增加。

Comments 17 pages, 8 figures

详情
AI中文摘要

本文研究浅层神经网络代理是否能掌握纸牌游戏斯纳普森,并挑战使用蒙特卡洛采样和前瞻搜索的强搜索基线RdeepBot。通过逐步更复杂的实验设计,首先评估了基于回放数据训练的监督学习代理(MLPBot)以及通过异步蒙特卡洛更新和经验回放训练的强化学习代理(RLBot)。结果表明,监督模仿不足以击败强RdeepBot对手,而强化学习产生了更强的代理。在聚焦RdeepBot深度参数的设置中,最佳性能是在学习的价值函数与游戏过程中更深层次的前瞻搜索结合时实现的,使RLBot在最强的RdeepBot基线下实现了统计显著更高的胜率。在基于样本的设置中,收益更具条件性:最强性能出现在相对较低的训练num_samples参数下,而不是随着更强采样均匀增加。

英文摘要

This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo sampling and lookahead search. Guided by a progressively more complex experimental design, we first evaluate a supervised learning agent (MLPBot) trained on replay data and then a reinforcement learning agent (RLBot) with the same shallow architecture trained through asynchronous Monte Carlo updates and experience replay. The results show that supervised imitation does not generalize well enough to defeat strong RdeepBot opponents, whereas reinforcement learning produces substantially stronger agents. In the setting that focuses on the depth parameter of RdeepBot, the best performance is achieved when the learned value function is combined with deeper lookahead during gameplay, allowing RLBot to achieve statistically significant higher winning rates against the strongest evaluated RdeepBot baseline. In the sample-based setting, the gains are more conditional: the strongest performance appears at a relatively lower training num_samples parameter rather than increasing uniformly with stronger sampling.

2605.17160 2026-05-19 cs.LG cs.AI cs.CV

When Bits Break Recourse: Counterfactual-Faithful Quantization

当比特失效时的反事实:反事实忠实量化

Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi

AI总结 本文研究了量化过程中反事实可解释性的问题,提出反事实忠实量化方法,通过定义有效性下降和反事实可逆差距两个指标来评估量化对反事实可解释性的影响,并在多个数据集上验证了该方法在保持准确性的同时提升了反事实稳定性。

Comments 57 pages, 32 tables, 26 figures

详情
AI中文摘要

量化可以在低比特部署下保持预测准确性,但会无声地破坏算法可逆性:一个在量化前可以执行的操作在量化后可能失效,或变得显著更昂贵。我们通过有效性、成本和方向稳定性来形式化量化下的反事实敏感性,并引入两个指标:有效性下降(VD)和反事实可逆差距(CRG),以揭示准确性无法检测到的可逆失败。我们提出反事实忠实量化(CFQ),通过训练量化参数和混合精度位分配,在全局位预算下强制在教师可逆点上保持目标结果,以保留反事实行为。基于边界的分析给出了在受限制的量化扰动下可逆转移的充分条件。在Adult、德国信贷和COMPAS数据集上的实验表明,与准确性匹配的基线相比,CFQ在保持准确性的同时显著提高了VD和CRG。

英文摘要

Quantization can preserve predictive accuracy under low-bit deployment while silently breaking algorithmic recourse: an actionable change that flips a decision before quantization may fail after quantization, or become substantially more costly. We formalize counterfactual sensitivity under quantization through validity, cost, and direction stability, and introduce two metrics: Validity Drop (VD) and Counterfactual Recourse Gap (CRG) that reveal recourse failures invisible to accuracy. We propose Counterfactual-Faithful Quantization (CFQ), which trains quantizer parameters and mixed-precision bit allocation to preserve counterfactual behavior by enforcing the target outcome at teacher recourse points under a global bit budget. A margin-based analysis gives a sufficient condition for recourse transfer under bounded quantization perturbations. Experiments on Adult, German Credit, and COMPAS show that accuracy-matched baselines can significantly degrade recourse stability, while CFQ maintains accuracy and substantially improves VD and CRG across bit budgets.

2605.17159 2026-05-19 cs.AI cs.MA

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

MADP:一种用于可持续文档处理的多智能体系统(带人机协作)

Diego Gosmar, Giovanni Zenezini

AI总结 本文提出MADP,一种结合深度学习分类和解析以及大语言模型提取的多智能体架构,通过选择性的人工验证保持准确性,实现了文档处理自动化,显著降低人工需求并减少环境影响。

Comments 18 pages, 5 figures

详情
AI中文摘要

文档处理自动化仍然是企业环境中关键的挑战,传统手动方法劳动强度大且容易出错。我们提出了MADP,一种多智能体架构,通过结合基于深度学习的分类和解析以及大语言模型提取,结合选择性的人工验证来解决企业环境中文档处理自动化的挑战。我们的系统集成了五个专门的智能体--Classificator、Splitter、Parser、Extraction和Validator--并采用带有人机协作(HITL)机制和一种新颖的Prompt Fine Tuning with Feedback Inheritance(PFTFI)方法。对每年处理10万张发票的生产使用案例的运营分析表明,可以将全职等效(FTE)需求减少约70%。在2026年1月前处理955个真实世界文档的生产部署中,实现了97.0%的全流程自动化率,仅有3%需要非AI回退。对一个分层的100文档子集(每种供应商/文档类型类别5个文档)的消融评估显示,带有HITL监督的完整MADP配置实现了98.5%的文档级准确性。此外,我们还展示了全面的可持续性分析,表明我们的混合AI+HITL方法相比传统手动处理减少了69%的二氧化碳排放、69%的能量消耗和63%的用水量。多个LLM后端(Granite-Docling、Mistral-Small、DeepSeek-OCR)的基准比较提供了在生产环境中部署的实用见解。

英文摘要

Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.

2605.17153 2026-05-19 cs.LG cs.LO math.OC

Stress-Testing Neural Network Verifiers with Provably Robust Instances

用可证明稳健实例压力测试神经网络验证器

David Troxell, Yulia Alexandr, Sofia Hunt, Stephanie Lei, Guido Montúfar

AI总结 本文提出了一种生成具有已知真实稳健标签的验证实例的框架,揭示了现有验证器的数值容忍度问题和实现错误,并引入了验证难度轮廓以系统研究验证器失败模式,评估了五种最先进的验证器并展示了不同实例对验证流程不同方面的压力测试。

详情
AI中文摘要

神经网络验证器旨在为模型行为提供正式保证,但现有的验证基准本质上受到缺乏真实标签的限制。因此,验证器评估依赖于间接启发式方法,这阻止了精确评分和系统研究验证器失败模式。我们通过引入一个可重用的框架来生成验证实例,其真实稳健标签通过分析构造已知,从而填补了这一差距。我们的框架导致在流行的验证器中发现了多个数值容忍度问题和实现错误,突显了真实标签的必要性。此外,为了系统研究验证器失败模式,我们引入了验证难度轮廓,一个收集可估计数量的集合,捕捉不同的实例难度来源。使用我们的框架和这些轮廓,我们评估了五种最先进的验证器,并展示了不同实例对验证流程不同方面的压力测试。我们证明这些结果可以帮助未来验证器的发展,因为它们为提高数值可靠性、放松质量和搜索行为提供了可行的目标。我们的代码已公开可用:https://github.com/dtroxell19/VeriStressGT.git。

英文摘要

Neural network verifiers aim to provide formal guarantees on model behavior, but existing verification benchmarks are fundamentally limited by their lack of ground-truth labels. As a result, verifier evaluation relies on indirect heuristics, which prevents exact scoring and systematic study of verifier failure modes. We address this gap by introducing a reusable framework for generating verification instances whose ground-truth robustness labels are known a priori through analytic construction. Our framework led to the discovery of multiple numeric tolerance concerns and an implementation bug in popular verifiers, highlighting the need for ground-truth labels. Additionally, to systematically study verifier failure modes, we introduce the verification Difficulty Profile, a collection of estimable quantities capturing distinct sources of instance hardness. Using our framework and these profiles, we evaluate five state-of-the-art verifiers and show that different instances stress distinct aspects of the verification pipeline. We show that these results can aid the future development of verifiers as they provide actionable targets for improving numerical reliability, relaxation quality, and search behavior. Our code is publicly available: https://github.com/dtroxell19/VeriStressGT.git.

2605.17152 2026-05-19 cs.CL

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

多语言和多模态大语言模型在野:为低资源语言构建

Firoj Alam, Shammur Absar Chowdhury, Enamul Hoque Prince

AI总结 本文探讨了在有限数据和计算资源下构建多语言多模态大语言模型的方法,涵盖了低成本数据创建、三模态对齐适配器堆栈以及文化感知评估等核心技术和资源。

Comments Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Low-resources-language

详情
AI中文摘要

多模态大语言模型正从视觉语言扩展到三模态(见、听、读),但流水线和基准测试仍以英语为中心且计算密集。本教程概述了在有限的数据和计算预算下,跨文本、语音和视觉的多语言多模态研究领域的概述,综合了基础理论、最近的多语言模型(PALO、Maya)和语音-文本大语言模型。我们涵盖了低成本的数据创建/整理;三模态对齐的适配器堆栈;超越英语的文化感知评估以及用于微调紧凑型多语言视觉语言模型和构建语音->文本->LLM流水线的实用资源。内容将以交互式半天教程形式呈现,面向在低资源语言环境中从事多语言、多模态AI研究和实践的研究人员和从业者。

英文摘要

Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.

2605.17151 2026-05-19 cs.LG

An Analytical Multiple Criteria Framework for Temporal and Dynamic Business-to-Business Customer Segmentation in Manufacturing

一个用于制造业业务到业务客户细分的分析多标准框架

Muhammad Raees, Konstantinos Papangelis, Vassilis Javed Khan

AI总结 本文提出了一种动态多准则决策方法,通过扩展RFM模型以包含稳定性和增长维度,整合自适应和分析性的分层过程,并评估多变量时间序列聚类模型,以提高制造业B2B客户细分的鲁棒性。

详情
AI中文摘要

在销售和营销中,客户细分是制定客户处理和供应链管理策略的重要工具。大多数细分实现依赖于有限的标准,如最近、频率和货币(RFM)建模,这通常无法捕捉复杂的商业互动。在本工作中,我们设计并评估了一种动态多准则决策(MCDM)方法,应用于业务到业务(B2B)制造环境,通过1)将RFM扩展到稳定性和增长维度,2)整合自适应和分析性的分层过程以匹配业务目标,3)评估多变量时间序列聚类模型。我们随后测量客户稳定性,跟踪不同细分之间的转换,以及时间内的波动性,并应用基于图的共识模型进一步加强分析。我们使用现实世界制造公司数据集测试所提出方法的有效性,对超过3,000个B2B客户进行细分,显示出对时间变化的强鲁棒性。该实现使领域专家能够利用优先分析来制定策略,为B2B客户细分提供有效的决策支持。

英文摘要

In sales and marketing, customer segmentation is an important tool for formulating strategies for customer treatment and supply chain management. Most segmentation implementations rely on limited criteria, such as recency, frequency, and monetary (RFM) modeling, which often fail to capture complex business interactions. In this work, we design and evaluate a dynamic multi-criteria decision-making (MCDM) method in a business-to-business (B2B) manufacturing context by 1) extending RFM to dimensions of stability and growth, 2) integrating an adaptive and analytical hierarchical process to match business objectives, and 3) evaluating multivariate time-series clustering models. We then measure customer stability, tracking between-segment transitions, and volatility over time, and apply a graph-based consensus model to further strengthen the analysis. We test the efficacy of the proposed method using a real-world manufacturing company dataset to segment more than 3,000 B2B customers, showing strong robustness to temporal shifts. The implementation enables domain experts with preferential analytics to devise their strategies, providing effective decision support for B2B customer segmentation.

2605.17144 2026-05-19 cs.RO cs.AI cs.LG

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比性概念激活引导(COAST):通过隐藏状态解锁视觉-语言-动作模型

Miranda Muqing Miao, Subin Kim, Brandon Yang, Lyle Ungar

AI总结 本文提出COAST方法,通过识别成功子空间来提升视觉-语言-动作模型在机器人任务中的性能,其核心方法是利用概念投射来引导模型向成功分布发展,从而提高任务成功率。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用大规模网络视觉-语言模型(VLM)预训练的强感知先验,但实际应用中却表现出惊人的脆弱性,常常在简单的机器人任务中失败。为缓解这一问题,我们提出了对比性概念激活引导(COAST)。COAST基于“概念”这一线性操作符,该操作符能将数据软投影到目标分布的主成分中。COAST利用概念来从少量的成功和失败轨迹中识别出目标机器人任务的成功子空间。在推理过程中,它将VLA的潜在表示引导到这些识别出的成功子空间中,以提高任务结果。在三种架构不同的神经策略(流匹配VLA、自回归VLA和扩散策略)上,COAST将绝对均值仿真和真实机器人任务的成功率分别提高了超过20%和40%。激活子空间几何表明,失败模式在不同任务中共享大量结构,而成功表示则主要任务特定。当任务共享相似的失败模式时,这种结构使之前拟合的概念能提升新任务的性能而无需重新拟合。最终,我们的结果表明,当前VLA在潜在表示中保留了大量任务相关的知识,而动作专家的解码瓶颈可以通过将残差流引导至任务相关子空间来缓解。COAST提供了一条轻量、无训练的路径,通过引导模型朝其自身的“成功”分布发展,来解锁这些潜在能力。

英文摘要

Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a "conceptor", a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert's decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own "success" distributions.

2605.17141 2026-05-19 cs.AI

Dynamics of collective creativity in AI art competitions

AI艺术竞赛中集体创造力的动力学

Mason Youngblood, Jeff Nusz, Joel Simon

AI总结 研究通过分析AI艺术竞赛中的图像生成过程,发现集体创造力在人类与AI协同创作中呈现出图像简化、主题趋同以及用户偏好与创作复杂性之间的矛盾现象。

详情
AI中文摘要

创造力是文化演变的核心方面,但群体产生新颖性的机制难以从历史记录中推断。迭代学习实验表明,文化传承会将制品扭曲向学习者的归纳偏差,但大多数研究使用线性链式结构,未探讨这些动态在日益影响文化生产的人类-人工智能系统中的表现。在本研究中,我们利用Artbreeder系统,该系统每日举办'混搭派对',用户基于单一种子图像迭代构建彼此的作品,生成分支的共同创作图像。我们分析了13个月内368场混搭派对的130,882张图像数据,发现图像变得简单并趋同于常见主题'吸引子'(如蒸汽朋克场景、外星建筑)。我们还发现,尽管更新颖的'父'图像产生更新颖且复杂的'子'图像并吸引更多点赞,用户却 paradoxically 偏好混搭新颖性和复杂性较低的图像。最后,更大规模的混搭派对产生更多新颖性,但以更低的复杂性为代价。

英文摘要

Creativity is a fundamental aspect of how culture evolves, yet the mechanisms by which groups produce novelty are notoriously difficult to infer from the historical record. Iterated learning experiments have shown that cultural transmission reliably distorts artifacts toward the inductive biases of learners, but most of this work uses linear chains between human participants, leaving open how these dynamics play out in the networked, human-AI systems that increasingly shape cultural production. In this study, we leverage one such system, Artbreeder, which hosts daily "remix parties" where users iteratively build on each other's work from a single seed image, producing branching lineages of human-AI co-created images. We analyze a dataset of 130,882 images from 368 remix parties over 13 months and find that images become simpler and converge toward common thematic "attractors" (e.g., steampunk scenes, alien architecture). We also find that while more novel "parent" images produce more novel and complex "children" that attract more likes, users paradoxically prefer to remix images that are less novel and complex. Finally, larger remix parties produce more novelty at the cost of lower complexity.

2605.17137 2026-05-19 cs.AI

Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

潜在启发式搜索:为自动化算法设计的连续优化

Cheikh Ahmed, Mahdi Mostajabdaveh, Zirui Zhou

AI总结 本文提出了一种连续启发式发现框架,通过将离散程序映射到连续嵌入空间,并利用可微代理模型进行梯度优化,以提升自动化算法设计的性能。

Comments Accepted at LION 2026, The Learning and Intelligent Optimization Conference

详情
AI中文摘要

将大型语言模型(LLMs)整合到进化框架中,已建立了自动化启发式发现的新范式。尽管具有潜力,这些方法通常在程序语法的离散空间中搜索,依赖随机采样来导航高度非凸的优化景观。本文提出了一种连续启发式发现框架,将优化转移到学习的潜在流形上。我们使用编码器将离散程序映射到连续嵌入,并训练一个可微代理模型来预测性能,从而实现基于梯度的搜索。为了正则化优化轨迹,一个可逆的归一化流将这些嵌入映射到结构化的高斯先验中,其中我们执行梯度上升。最终优化的潜在向量通过学习的映射器投影到软提示中,这些提示条件冻结的LLM合成新的可执行启发式方法。我们在旅行商问题(TSP)、有容量车辆路径问题(CVRP)、背包问题(KSP)和在线装箱问题(OBP)上评估了所提出的方法。实证结果表明,连续潜在空间优化在性能上与最先进的离散进化基线相当,同时为自动化算法设计提供了互补的方法论替代方案。实现代码可在https://github.com/cheikh025/LHS上找到。

英文摘要

The integration of Large Language Models (LLMs) into evolutionary frameworks has established a new paradigm for automated heuristic discovery. Despite their promise, these methods typically search in the discrete space of program syntax, relying on stochastic sampling to navigate a highly non-convex optimization landscape. This work proposes a continuous heuristic discovery framework that shifts optimization to a learned latent manifold. We employ an encoder to map discrete programs into continuous embeddings and train a differentiable surrogate model to predict performance, enabling gradient-based search. To regularize the optimization trajectory, an invertible normalizing flow maps these embeddings to a structured Gaussian prior, where we perform gradient ascent. The resulting optimized latent vectors are projected through a learned mapper into soft prompts, which condition a frozen LLM to synthesize novel executable heuristics. We evaluate the proposed method on the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), the Knapsack Problem (KSP), and Online Bin Packing (OBP). Empirical results demonstrate that continuous latent-space optimization achieves performance competitive with state-of-the-art discrete evolutionary baselines while offering a complementary methodological alternative for automated algorithm design. The implementation code is available at \url{https://github.com/cheikh025/LHS}.

2605.17135 2026-05-19 cs.CV

Collaborative Learning for Semi-Supervised LiDAR Semantic Segmentation

协同学习用于半监督激光雷达语义分割

Bin Yang, Alexandru Paul Condurache

AI总结 本文提出CoLLiS框架,通过协同学习解决半监督学习中伪标签单一导致的偏差问题,实验表明其在低标注情况下表现优异。

Comments The paper was accepted by ICML2026

详情
AI中文摘要

对大规模激光雷达点云进行注释用于3D语义分割成本高且耗时,这促使了半监督学习(SemiSL)的应用。标准的激光雷达SemiSL方法通常采用两步训练范式,其中伪标签从单一蒸馏源单独生成,无论是相同的还是另一个激光雷达表示。这种监督依赖于唯一的伪标签源,可能加剧确认偏见并在训练过程中传播错误,最终限制性能。为了解决这一挑战,我们引入了CoLLiS,一种新颖的框架,利用协同学习进行激光雷达半监督分割。与之前解耦伪标签生成和训练阶段的范式不同,CoLLiS通过将它们视为同等学生,在单步中协同训练多个表示。每个学生从多个表示中自适应地蒸馏,同时在线监控学生间的差异以解决矛盾的监督并有效缓解确认偏见。在三个数据集上的大量实验表明,CoLLiS在低标注情况下显著优于最先进的激光雷达SemiSL方法。

英文摘要

Annotating large-scale LiDAR point clouds for 3D semantic segmentation is costly and time-consuming, which motivates the use of semi-supervised learning (SemiSL). Standard LiDAR SemiSL methods typically adopt a two-step training paradigm, where pseudo-labels are separately generated from a single distillation source, either from the same or another LiDAR representation. Such supervision relies on a unique source of pseudo-labels, which can reinforce confirmation bias and propagate errors during training, ultimately limiting performance. To address this challenge, we introduce CoLLiS, a novel framework that leverages Collaborative Learning for LiDAR Semi-supervised segmentation. Unlike prior paradigms with decoupled pseudo-labeling and training phases, CoLLiS trains multiple representations collaboratively in a single step by treating them as coequal students. Each student is adaptively distilled from multiple representations, while inter-student disparities are monitored online to resolve contradictory supervision and effectively mitigate confirmation bias. Extensive experiments on three datasets demonstrate that CoLLiS consistently outperforms state-of-the-art LiDAR SemiSL methods, with particularly strong gains in low-label regimes.

2605.17133 2026-05-19 cs.CV cs.AI

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

CAM-VFD: 跨注意力多模态视频伪造检测

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

AI总结 针对深度伪造技术和视频编辑工具快速发展带来的挑战,本文提出CAM-VFD框架,通过跨模态矛盾建模实现多模态视频伪造检测,实验表明其在两个生成视频基准测试中表现出色,具有良好的鲁棒性。

详情
AI中文摘要

深度伪造技术和视频编辑工具的快速发展对多媒体取证、司法证据完整性以及信息真实性构成了重大挑战。当前的检测器依赖单一模态信号,将外观、几何和运动独立处理。然而,先进的生成器在保持单模态一致性的同时会产生跨模态矛盾,这些矛盾在取证上具有鉴别性但无法被单一模态检测器发现。本文提出CAM-VFD,即跨注意力多模态视频伪造检测框架,将跨模态矛盾建模为方向性取证信号。该框架采用跨注意力融合机制,其中基于CLIP的外观表示作为查询,与VideoMAE运动特征和MiDaS深度特征进行对比,从而识别视觉、时间及几何证据之间的矛盾。通过跨模态注意力差异分析验证了该设计,观察到真实与伪造分布在统计上可分离(p<0.001,Cohen's d=0.68)。在两个生成视频基准测试中的实验结果表明,CAM-VFD在GenVidBench上达到95.31%的Top-1准确率,在GenVideo上达到93.43%的准确率、90.63%的F1分数和96.56%的AUROC。此外,CAM-VFD在压缩、噪声、模糊和对抗扰动下表现出稳定的性能,表明跨模态推理可能在媒体取证中提高鲁棒性。代码已公开在https://github.com/Hoda-Osama/CAM-VFD/tree/main。

英文摘要

The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.

2605.17125 2026-05-19 cs.CV cs.LG

Principal Component Analysis for Lunar Crater Detection

基于主成分分析的月球陨石坑检测

Travis Driver, John A. Christian

AI总结 本文提出了一种基于主成分分析的自动陨石坑模板生成方法,用于改进基于图像的陨石坑识别技术,通过在模拟月球图像上展示优于手工挑选模板的检测和定位性能。

详情
AI中文摘要

光学导航是月球轨道器和着陆器任务中的关键组成部分。基于图像的陨石坑识别由于月球表面陨石坑丰富以及现有大量陨石坑目录的可用性,已成为光学导航的有前景技术。此外,由于月球陨石坑在形态上相对同质,模板匹配已被确定为识别的有前景方法。在本文中,我们提出EigenCrater,一种基于陨石坑数字高程图(DEM)的主成分分析的自动陨石坑模板生成方法。我们证明了在模拟月球图像上,该方法在检测和位置估计性能方面优于手工挑选的模板。

英文摘要

Optical navigation is a critical component for lunar orbiter and lander missions. Image-based crater identification has emerged as a promising technology for optical navigation due to the abundance of craters on the lunar surface and the availability of extensive crater catalogs. Moreover, due to the relative morphological homogeneity among lunar craters, template matching has been identified as a promising approach for identification. In this paper, we propose EigenCrater, an automated crater template generation method based on principal component analysis of crater digital elevation maps (DEMs). We demonstrate superior detection and position estimation performance relative to hand-picked templates on simulated lunar imagery.