arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.16009 2026-05-18 cs.RO

Fast Expanding Safe Circular Regions for Efficient Local Path Planning

快速扩展的安全圆形区域用于高效的局部路径规划

Scott Fredriksson, Akshit Saradagi, George Nikolakopoulos

AI总结本文提出一种几何算法，通过局部LiDAR扫描生成扩展的圆形区域，实现更快的计算和更长的规划时间，用于局部导航。

Comments Accepted by the IFAC World Congress 2026

2605.16008 2026-05-18 cs.CV

End-to-end plaque counting and virus titration from laboratory plate images with deep learning

基于深度学习的实验室平板图像端到端斑块计数与病毒滴度测定

Eugenia Moris, Alicia Costábile, Sebastián Rey, Irene Ferreiro, Joaquín Hurtado, Lizandra Lissette Luciano, Matías Villagrán, Aisha Espino Vázquez, Jomari Ramos, Isadora Monteiro, María Victoria de Santiago, Pilar Moreno, Gonzalo Moratorio, José Ignacio Orlando

AI总结本文提出一种端到端的深度学习方法，通过分割模型对实验室平板图像中的斑块进行自动计数和滴度测定，提高了病毒感染性检测的效率和准确性。

详情

AI中文摘要

斑块实验仍然是病毒感染性检测的金标准，但通过平板图像进行斑块计数过程繁琐且易受操作者差异影响。本文提出了一种端到端的计算机辅助工作流程，直接从实验室斑块实验图像中基于细胞病理效应的病毒滴度测定。所提出的方法结合了源自Segment Anything Model (SAM)的两个模型：一个基于SAM2的井分割模块，用于在异质成像条件下定位实验井；另一个基于SAM的斑块分割模型，用于在每个井中检测和统计斑块。该方法在混合数据集上进行了评估，该数据集包括Mayaro病毒和Coxsackievirus B3的私有斑块实验图像，以及来自VACVPlaque数据集的天花病毒图像。该流程输出每井斑块计数，自动计算每毫升形成斑块单位（PFU/mL），并整合到一个基于网络的平台中，允许用户审查结果并组织实验。在测试板（17块来自MAYV/CVB3和22块来自VACV）上，该工作流程在两种板格式（6孔和12孔）上实现了良好的泛化，并与手动注释有很强的一致性（MAYV/CVB3的皮尔逊相关系数为0.92，VACV为0.88）。自动斑块计数还与四位独立专家的注释进行了比较，显示了高度的一致性。所提出的系统将在本论文被接受后开源并公开发布，以实现可重复、可扩展和审计准备的斑块实验分析，同时显著减少手动注释的工作量。

英文摘要

Plaque assays remain the gold standard readout of virus infectivity; however, plaque counting from plate images is labor-intensive and prone to inter-operator variability. We present an end-to-end, computer-aided workflow for cytopathic effect-based virus titration directly from laboratory plaque assay images. The proposed approach combines two models derived from the Segment Anything Model (SAM): a SAM2-based well-segmentation module that localizes assay wells across heterogeneous imaging conditions, and a SAM-based plaque-segmentation model that detects and enumerates plaques within each well. The method was evaluated on a mixed dataset comprising private plaque assay images of Mayaro virus and Coxsackievirus B3, together with public Vaccinia virus images from the VACVPlaque dataset. The pipeline outputs per-well plaque counts, automatically computes plaque-forming units per milliliter (PFU/mL), and is integrated into a web-based platform that allows users to review results and organize experiments. On held-out plates (17 from MAYV/CVB3 and 22 from VACV), the workflow generalized across two plate formats (6-well and 12-well) and showed strong agreement with manual annotations (Pearson correlation coefficients of 0.92 for MAYV/CVB3 and 0.88 for VACV). Automated plaque counts were further compared with annotations from four independent experts, demonstrating high concordance. The proposed system will be open sourced and publicly released upon acceptance of this manuscript to enable reproducible, scalable, and audit-ready plaque assay analysis while substantially reducing manual annotation effort.

URL PDF HTML ☆

赞 0 踩 0

2605.16003 2026-05-18 cs.CV

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

回声驱动：一种用于交互式长视频生成的场景记忆框架

Mingqiang Wu, Weilun Feng, Zhefeng Zhang, Haotong Qin, Yuqi Li, Guoxin Fan, Xiaokun Liu, Zhulin An, Libo Huang, Yongjun Xu, Chuanguang Yang

AI总结本文提出Echo-Forcing框架，通过分层时间记忆、场景回溯帧和差分记忆衰减机制，解决长视频生成中历史KV状态的函数纠缠问题，实现交互式视频生成的流畅过渡与长距离场景回溯。

详情

AI中文摘要

自回归视频扩散模型通过局部注意力和KV缓存实现开放生成。然而，现有无训练长视频优化方法主要关注单一提示下的稳定扩展，难以处理涉及提示切换、旧场景遗忘和历史场景回溯的交互场景。我们发现核心瓶颈是历史KV状态的功能纠缠：稳定锚点和近期动态由同一缓存策略处理，导致背景污染、对新提示响应延迟和长距离记忆丢失。为此，我们提出Echo-Forcing，一种专门用于交互式长视频生成的无训练场景记忆框架，包含三个核心机制：（1）分层时间记忆，通过相对RoPE解耦稳定锚点、压缩历史和近期窗口；（2）场景回溯帧，将历史场景压缩为空间结构化的KV表示以支持长期回溯；（3）差分记忆衰减，根据旧场景与新场景的差异适配性遗忘冲突令牌。基于这些设计，Echo-Forcing在有限的缓存预算下统一支持平滑过渡、硬切和长距离场景回溯。在VBench-Long上的广泛评估进一步证明，Echo-Forcing在长视频生成和交互视频生成设置中均取得最佳整体性能。我们的代码已发布在https://github.com/mingqiangWu/Echo-Forcing

英文摘要

Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing

URL PDF HTML ☆

赞 0 踩 0

2605.15999 2026-05-18 cs.RO cs.SY eess.SY

Constrained MPC-Based Motion Planning for Morphing Quadrotors in Ultra-Narrow Passages under Limited Perception

基于约束MPC的形态四旋翼运动规划在超狭窄通道中的应用

Harsh Modi, Xiao Liang, Minghui Zheng

AI总结本文提出一种运动规划框架，用于在极端受限环境中为变形四旋翼规划形态和轨迹。通过开发一种新颖的障碍物避障成本函数，使四旋翼能够在有限的2D激光雷达感知下导航通过极狭窄的缝隙。

详情

AI中文摘要

本文介绍了一种运动规划框架，用于在极端受限环境中为变形四旋翼规划形态和轨迹。我们开发了一种新颖的障碍物避障成本函数，用于非线性模型预测控制（MPC），使四旋翼能够在有限的2D激光雷达感知下导航通过极狭窄的缝隙。传统基于人工势场的成本通常在狭窄通道中具有较高的成本，会人工阻塞可通行的路径。相比之下，我们提出了一种平滑的指数障碍物成本，该成本在狭窄缝隙中保持低通行成本，同时保持强的碰撞避障行为。该方法避免了硬激活阈值，并引入了成本降低因子以减少狭窄通道内的成本。直接使用2D激光雷达测量值在MPC中允许绕过任意形状的障碍物。该方法嵌入在基于acados的非线性MPC框架中。仿真和实验结果证明了在通常排斥成本函数会失败的狭窄走廊中成功通行。该方法提供了一种计算高效且实用的解决方案，用于在保持安全避障的同时导航通过狭窄空间。虽然我们正在将该框架应用于变形四旋翼，但成本函数的公式是通用的，适用于任何移动机器人应用，不限于变形四旋翼。实现代码可在GitHub上获得，并附有简短视频链接。

英文摘要

This paper introduces a motion planning framework to plan morphology and trajectory for morphing quadrotors under extremely constrained environments. We develop a novel obstacle avoidance cost function for nonlinear model predictive control (MPC) that enables navigation through extremely narrow gaps under limited perception from a 2D LiDAR. Classical artificial potential field-based costs typically have a high cost in narrow passages, artificially blocking the navigable path. In contrast, we propose a smooth exponential obstacle cost that preserves low traversal cost within narrow gaps while maintaining strong collision avoidance behavior. The formulation avoids hard activation thresholds and introduces a cost reduction factor to reduce the cost within narrow passages. Direct use of 2D LiDAR measurements in MPC allows navigation around arbitrarily shaped obstacles. The method is embedded within an acados-based nonlinear MPC framework. Simulation and experimental results demonstrate successful traversal of narrow corridors where typical repulsive cost functions would fail. The approach provides a computationally efficient and practical solution for navigating through tight spaces while maintaining safety from the obstacles. While we are implementing the framework on the morphing quadrotors, the cost function formulation is general-purpose for any mobile robot application, and is not limited to the morphing quadrotors. The implementation code is available at \href{https://github.com/harshjmodi1996/morphocopter_mpc}{Github Repo} and a short video is available at \href{https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/MPC_MorphoCopter_video.mp4}{Video Link}.

URL PDF HTML ☆

赞 0 踩 0

2605.15997 2026-05-18 cs.CV

Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

分割、检测与解释：一种用于CT外观推理的统一框架

Yuyuan Liu, Can Peng, Yingyu Yang, Qianye Yang, Cheng Ouyang, J. Alison Noble

AI总结本文提出统一框架，整合语言引导的视觉推理，提升CT图像分割与检测的精度，并提供外观推理输出。

Comments 8 pages, 4 figures, submitted to IEEE Transactions on Medical Imaging (TMI)

详情

AI中文摘要

近年来深度学习的进步显著推动了CT图像分析，尤其在分割任务中。然而，这些进展大多局限于图像层面的模式识别，大多数方法缺乏显式的解剖或上下文推理。大型视觉-语言模型引入了语言上下文到图像分析中，但大多数方法通常专注于单一任务，这不足以满足临床工作流程分析中需要多种细粒度分析的需求，如解剖检测和分割。本文提出了一种统一的自回归框架，将语言引导的视觉推理整合到CT解释中。我们的方法引入了任务路由标记，根据大型视觉-语言模型的隐藏状态触发检测和分割头，从而产生连贯的视觉输出（例如掩码和边界框）和文本推理。为进一步提高局部化精度和语义清晰度，我们进一步设计了

英文摘要

Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g., masks and bounding boxes) and textual reasonings. To progressively enhance localisation accuracy and semantic clarity, we further design a "closer-look" mechanism that allows the model to perform progressive coarse-to-fine visits to regions of interest under refined fields of view. To support model training and evaluation, we curated a new multimodal CT dataset containing pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions for visual objects constructed through an AI-assisted annotation process with human verification. Experiments on public benchmarks demonstrate consistent improvements over the SoTA, achieving up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, while additionally providing appearance reasoning outputs. The code and dataset will be available.

URL PDF HTML ☆

赞 0 踩 0

2605.15995 2026-05-18 cs.LG cs.AI

Constrained latent state modeling: A unifying perspective on representation learning under competing constraints

受限潜在状态建模：在竞争约束下表示学习的统一视角

Gwenolé Quellec

AI总结本文提出受限潜在状态建模（CLSM），统一了表示学习中在竞争约束下的核心原则与方法，揭示了潜在状态的内在耦合关系与根本权衡。

Comments Resources and model cards: https://github.com/gwenole-quellec/clsm

详情

AI中文摘要

从复杂数据中学习潜在表示是现代机器学习的核心，涵盖时间、多模态和部分观测系统。在这些设置中，表示应被视为捕捉系统动态的潜在状态，而非仅仅是观测的压缩总结。然而，当前方法仍碎片化，依赖于对这些状态应代表什么的不同且往往隐含的假设。我们主张这种碎片化反映了更根本的限制：潜在表示通常从欠约束的目标学习，未能指定有意义的潜在状态应满足的属性。因此，多个表示可以满足相同的目标，导致结构和解释的模糊性。尽管许多底层原则已被单独探索，但它们的相互作用尚未被显式形式化。在本文中，我们提出受限潜在状态建模（CLSM）作为统一的视角。我们识别了一组核心属性——预测充分性、最小性、时间一致性、观测兼容性、对干扰因素的不变性以及结构约束——并展示它们通过根本的权衡相互耦合。通过这一视角重新审视主要建模家族，我们显示现有方法可以被解释为强制不同的约束子集，从而占据共同设计空间的不同区域。这一视角将持续挑战如可识别性不足重新解释为欠约束形式的后果，而非孤立的技术限制。更广泛地说，CLSM提供了一个原则性的框架，以使设计选择显式化，分析权衡，并指导开发更具可解释性、稳健性和任务对齐的潜在状态模型。

英文摘要

Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, rather than as mere compressed summaries of observations. Yet current approaches remain fragmented, relying on distinct -- and often implicit -- assumptions about what these states should represent. We argue that this fragmentation reflects a more fundamental limitation: latent representations are typically learned from underconstrained objectives that fail to specify the properties that meaningful latent states should satisfy. As a result, multiple representations can satisfy the same objective, leading to ambiguity in their structure and interpretation. While many of the underlying principles have been explored in isolation, their interactions have not been explicitly formalized. In this work, we propose constrained latent state modeling (CLSM) as a unifying perspective. We identify a set of core properties -- predictive sufficiency, minimality, temporal coherence, observation compatibility, invariance to nuisance factors, and structural constraints -- and show that they are intrinsically coupled through fundamental trade-offs. Revisiting major modeling families through this lens, we show that existing approaches can be interpreted as enforcing different subsets of constraints, thereby occupying distinct regions of a common design space. This perspective reframes persistent challenges such as lack of identifiability as consequences of underconstrained formulations, rather than isolated technical limitations. More broadly, CLSM provides a principled framework to make design choices explicit, to analyze trade-offs, and to guide the development of more interpretable, robust, and task-aligned latent state models.

URL PDF HTML ☆

赞 0 踩 0

2605.15990 2026-05-18 cs.CL

Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory

定义AI评估中的文化能力：基于跨文化沟通理论的分类

Isar Nejadgholi, Masoud Kianpour, Krishnapriya Vishnubhotla, Maryam Molamohamadi

AI总结本文提出基于跨文化沟通理论的AI相关文化能力三级分类，旨在澄清文化能力的模糊定义，提升AI评估在多元文化环境中的有效性与可解释性。

详情

AI中文摘要

大量努力被投入到评估AI系统在不同文化中的包容性和有效性。然而，文献中所考虑的文化能力往往定义模糊，术语混用，且通常局限于回忆关于不同人口、地区和民族的准确信息。为解决这一概念模糊性，我们借鉴跨文化沟通研究，提出一个三级AI相关文化能力分类：文化意识回答“模型是否知道？”，文化敏感性回答“它如何框架其知识？”，文化能力回答“它能否随着互动发展而适应？”。除了概念澄清外，我们将此分类定位为一种实用工具，以提高AI评估在现实多文化环境中的有效性与可解释性。没有这种构造清晰性，评估结果可能高估模型能力，并可能导致在文化敏感情境中不当的部署决策。

英文摘要

Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers "Does the model know?", Cultural Sensitivity answers "How does it frame its knowledge?", and Cultural Competence answers "Can it adapt as the interaction evolves?". Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.15984 2026-05-18 cs.SD cs.AI cs.CR

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

超越内容：一个综合的语音毒性数据集和检测框架，结合副语言线索

Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li, Qinglong Wang, Li Lu

AI总结本文提出ToxiAlert-Bench数据集和双头神经网络框架，通过整合副语言线索提升语音毒性检测性能，实验显示方法在多个指标上优于现有基线。

详情

AI中文摘要

语音毒性检测已成为维护安全在线通信环境的关键挑战。然而，现有方法常忽视副语言线索（如情绪、语调和语速）的作用，而当前数据集多为文本基，限制了对副语言线索的建模。为此，我们提出ToxiAlert-Bench，包含30000多个音频片段，标注七种主要毒性类别和二十种细粒度标签，并标注毒性来源（文本或副语言）。我们还提出双头神经网络，包含两个任务特定分类头：一个用于识别敏感源（文本或副语言），另一个用于分类具体毒性类型。训练过程包括独立头训练和联合微调以减少任务干扰。为缓解数据类别不平衡，我们采用类平衡采样和加权损失函数。实验结果表明，利用副语言特征显著提升了检测性能，方法在多个评估指标上优于现有基线，宏F1分数提升21.1%，准确率提升13.0%。

英文摘要

Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.

URL PDF HTML ☆

赞 0 踩 0

2605.15983 2026-05-18 cs.AI

Petri Net Induced Heuristic Search for Resource Constrained Scheduling

基于Petri网的启发式搜索用于资源受限调度

Ido Lublin, Dor Atzmon, Izack Cohen

AI总结本文将资源受限项目调度问题建模为Timed Transition Petri网的可达图最优搜索，采用相对延迟令牌实现调度决策与状态空间转换的对应关系，通过结合关键路径和资源下界启发式函数的A*算法，证明其一致性，并在PSPLIB基准测试中优于MIP基线方法。

Comments Accepted at the International Symposium on Combinatorial Search (SoCS 2026)

2605.15978 2026-05-18 cs.CL cs.AI cs.LO

Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports

执法ontology：用于执法报告中语义理解和推理的概念知识学习

Anita Srbinovska, Jansen Orfan, Adrian Martin, Ernest Fokoué

AI总结本文提出利用符号方法将执法报告中的叙述转化为证据关联事实，通过消除个人标识、语义解析、谓词映射到本体和推理，提高对事件细节的恢复能力，并构建包含时间线索和领域公理的时间图。

Comments 13 pages, 8 figures, 9 tables

详情

AI中文摘要

执法报告包含结构化字段和书面叙述。然而，许多需要审查、警察培训和调查的事件事实是以自然语言形式存在的，需要手动阅读。我们提出了一种使用符号方法将叙述转换为证据关联事实的框架。我们的目标是通过仅从无结构文本中恢复事件细节，并构建包含时间线索和领域公理的时间图。我们通过消除个人标识、语义解析、谓词映射到本体和推理来实现这一点。我们在450份财产犯罪报告和一段简短的人类审查中评估了符号方法。从系统中提取的事件中，54.1%具有至少0.80的置信度分数，93.7%通过PropBank-VerbNet-WordNet语义路径映射。在事件启动、被盗物品和时间线索上达到了100%的一致性，在强制进入解释上则一致率较低。

英文摘要

Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank--VerbNet--WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.

URL PDF HTML ☆

赞 0 踩 0

2605.15976 2026-05-18 cs.CL cs.AI

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

无需参考的强化学习微调用于机器翻译：序列到序列视角

Ernesto Garcia-Estrada, Carlos Escolano, José A. R. Fonallosa

AI总结本文提出一种无需参考的强化学习微调方法，应用于序列到序列模型，针对13种语言在无平行数据情况下提升翻译质量，尤其在形态复杂语言中表现优异。

详情

AI中文摘要

生产级机器翻译主要依赖于编码器-解码器序列到序列模型，但强化学习方法在微调中主要针对解码器-only的大语言模型，且对编码器-解码器架构研究有限。我们应用组相对策略优化在NLLB-200（600M和1.3B）上，使用混合无参考奖励（LaBSE和COMET-Kiwi），在微调时无需平行数据，评估13种语言。GRPO在所有13种语言上均取得一致提升，传统中文的chrF++提升达+5.03，在无目标语言数据的情况下，在形态复杂语言中与3轮监督微调竞争。我们发现一个一致的实证模式：在基线表现最弱且奖励判别性最高的地方，收益最大，使该方法在平行数据最稀缺的地方最有效，并在英语和西班牙源语言上复制了这一模式。

英文摘要

Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.

URL PDF HTML ☆

赞 0 踩 0

2605.15967 2026-05-18 cs.AI cs.CV cs.LO

Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

确定性事件-图子结构作为世界模型用于反事实推理

Fabio Rovai

AI总结本文提出事件图子结构作为世界模型，通过结构化干预词汇fork日志来回答反事实查询，证明了解释性与反事实性查询的对偶性，并在CLEVRER验证规模上评估了基于领域无关子结构运行时的C++解释器。

Comments 10 pages, 3 figures, 2 tables

详情

AI中文摘要

我们研究了事件图子结构：一类世界模型，将智能体状态表示为只追加的类型RDF三元组日志，并通过结构化干预词汇fork日志来回答反事实查询。子结构在三元组层面可检查，支持精确的反事实查询，并且可以在不同领域之间转移而无需学习组件。我们正式化了该类，证明了解释性和反事实性查询之间的对偶性，将两者都减少到相同的因果-祖先遍历，并在领域无关的子结构运行时上评估了一个1,400行的CLEVRER-DSL解释器，达到完整的CLEVRER验证规模（n=75,618）。子结构在所有四个问题类别中均优于NS-DR符号Oracle（分别高出9.89、20.26、17.65和0.80个百分点），并在描述性和解释性方面优于参数化ALOE基线，但在预测性和反事实性方面略有落后。我们还引入了双EventLog，一个500规范的Park-Canonical Smallville反事实基准，子结构在完整上下文中超过Llama-3.1-8B 18.80点的联合准确率。

英文摘要

We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.15965 2026-05-18 cs.LG

Entropy-Based Characterisation of the Polarised Regime in Latent Variable Models

基于熵的潜在变量模型中极化区域表征

Peter Clapham, Lisa Bonheme, Marek Grzes

AI总结本文提出基于熵的信息论方法，用于识别潜在变量模型中的极化区域，通过熵-方差界理论联系KL最小化，并验证在多种自编码器中有效。

Comments 13 pages, 17 figures, under review at Neural Networks

详情

AI中文摘要

变分自编码器（VAEs）常表现出潜在变量分离为活跃、被动和混合子集的极化区域。现有方法依赖高斯先验，限制了其在变分模型和特定先验中的应用。本文提出基于均值表示熵的信息论分类方法，理论推导熵与KL最小化的联系，并与Bonheme的活跃/被动条件相关联。同时指出仅依赖均值熵无法可靠区分活跃与混合维度，需结合方差表示信号。实验证明该准则在β-VAEs、可识别VAEs、最小体积自编码器和L2正则化自编码器中均能有效恢复极化区域。最后显示，适当归一化的潜变量编码可使被动维度在下游任务中产生小但一致的提升，表明崩溃更多是尺度问题而非绝对信息丢失。

英文摘要

Variational Autoencoders (VAEs) often exhibit a polarised regime in which latent variables separate into active, passive, and mixed subsets. Existing criteria for identifying active dimensions depend on a Gaussian prior, limiting their applicability to variational models and specific priors. We propose a simple information-theoretic classification of the polarised regime based on the entropy of the mean representation. We show theoretically how this entropy couples to KL minimisation through entropy--variance bounds, and we relate the resulting criterion to Bonheme's active/passive conditions. We also clarify a key limitation: entropy of the mean alone cannot reliably distinguish active from mixed dimensions without additional signals from the variance representation. Empirically, we evaluate the entropy criterion on $β$-VAEs, identifiable VAEs, Least-Volume Autoencoders, and L2-regularised autoencoders, and find that it consistently recovers a polarised regime when such a regime is present across the model classes studied. Finally, we show that passive dimensions can yield small but consistent improvements on downstream tasks when latent codes are appropriately normalised, suggesting that collapse is often a matter of scale rather than absolute information removal.

URL PDF HTML ☆

赞 0 踩 0

2605.15964 2026-05-18 cs.RO cs.CV

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

WorldVLN: 用于空域视觉-语言导航的自回归世界动作模型

Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

AI总结 WorldVLN提出一种自回归世界动作模型，通过预测潜在世界演变并生成可执行的航点动作，提升空域视觉-语言导航性能，优于现有基线模型。

详情

AI中文摘要

空域视觉-语言导航（VLN）要求智能体通过闭环感知与行动在3D环境中遵循自然语言指令。本文认为空域VLN可视为预测驱动的世界-动作问题：智能体应预测潜在世界演变并根据预测后果行动。为此，我们提出WorldVLN，首个针对空域VLN的自回归世界动作模型。不同于生成完整视觉片段的全序列视频生成世界模型，WorldVLN采用潜在自回归视频主干来预测短视界世界状态转换，并直接解码为可执行航点动作。每次动作段执行后，新接收的观测被编码回自回归上下文，实现闭环世界-动作预测。我们进一步引入双阶段训练框架，首先将视频先验在指令条件下的导航动力学中定位，然后开发Action-aware GRPO，首个针对自回归WAMs的强化学习方法，通过下游回放后果优化航点决策。在公开户外和室内基准上，WorldVLN在12%+的成功率提升和挑战性案例中表现更优。它进一步实现零样本迁移至真实无人机部署，表明所提WorldVLN为空间动作任务提供了一条有前景的路径。演示和代码可在https://embodiedcity.github.io/WorldVLN/上获取。

英文摘要

Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.

URL PDF HTML ☆

赞 0 踩 0

2605.15963 2026-05-18 cs.AI

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

PAGER:弥合点精确几何GUI控制中的语义-执行鸿沟

Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan

AI总结本文提出PAGER，通过依赖结构规划与像素级执行，解决对点精确几何GUI任务的需求，提升任务成功率至62%以上，填补语义-执行鸿沟。

Comments 27 pages, 11 figures, 3 tables

详情

AI中文摘要

大规模视觉-语言模型显著提升了GUI代理，使其能在网页、移动和桌面界面间执行交互。然而这些进展大多依赖于宽容区域容忍范式，即同一组件内的许多邻近像素仍有效。精确几何构造打破了这一假设：动作必须落在连续画布空间的点上，而非容忍区域。由于几何原语具有本体依赖性，局部坐标误差可能引发级联拓扑故障，扭曲下游对象并使最终构造无效。我们将其称为对精度敏感的GUI任务，需要点级精度、几何感知验证以及对依赖驱动误差传播的鲁棒性。为评估此领域，我们引入PAGE Bench，包含4,906个问题和超过224K个过程监督的像素级GUI动作。我们进一步提出PAGER，一种拓扑感知代理，将构造分解为依赖结构化的规划和像素级执行。像素基础的监督训练建立可执行的动作语法，而精度对齐的强化学习通过状态条件化的几何反馈缓解滚动诱导的暴露偏差。实验揭示了显著的语义-执行鸿沟：通用多模态模型可以超过88%的动作类型准确率，但任务成功率仍低于6%。PAGER填补了这一鸿沟，其任务成功率比最强的通用基线高4.1倍，并将GUI专用代理的步骤成功率从低于9%提升到超过62%，为点精确GUI控制树立了新的基准。

英文摘要

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

URL PDF HTML ☆

赞 0 踩 0

2605.15961 2026-05-18 cs.CV

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

稀疏自编码器使CLIP模型的鲁棒且可解释的微调成为可能

Fabian Morelli, Arnas Uselis, Ankit Sonthalia, Seong Joon Oh

AI总结本文提出SAE-FT方法，通过稀疏自编码器约束视觉表示的变化，实现CLIP模型的鲁棒且可解释的微调，提高下游任务性能同时保持模型鲁棒性。

2605.15959 2026-05-18 cs.LG cs.AI

When and Why Adversarial Training Improves PINNs: A Neural Tangent Kernel Perspective

何时以及为何对抗训练能提升PINNs：神经 tangent 核视角

Yuan-dong Cao, Chi Chiu SO, Jun-Min Wang, He Wang

AI总结本文从神经 tangent 核角度分析对抗训练提升PINNs的机制，提出理论框架并设计高效算法，实验证明能显著改善PINNs训练病理，提升模型精度。

详情

AI中文摘要

物理信息神经网络（PINNs）是微分方程的强大替代品，但因频谱偏置、刚性和高频率或多尺度解的准确性差而难以训练。基于生成对抗网络（GANs）的对抗训练近期在提升训练效果上取得了显著的实证结果，但其内在机制仍不明确。为此，本文提出了一种新的分析框架，基于GANs中判别器如何影响PINNs训练动态的关键观察。该框架首先为为何以及何时对抗训练在PINNs中有效提供了必要的理论依据，然后对GANs变体在该训练中的统一分析，并最终提出一种新的、实用的、高效的PINNs训练算法。实验证明，我们的方法能显著减少PINNs训练的病理现象，从而提供更优的模型，通常比其他方法准确度高几个数量级。

英文摘要

Physics-informed neural networks (PINNs) are powerful surrogates for differential equations but are notoriously difficult to train due to spectral bias, stiffness, and poor accuracy on high-frequency or multiscale solutions. Adversarial training based on generative adversarial networks (GANs) has recently gained surprisingly strong empirical results in improving training, but the underlying mechanisms remain elusive. To this end, we propose a new analysis framework for adversarially trained PINNs, based on the key observation of how the discriminator in GANs can influence the training dynamics of PINNs. The framework first provides a much needed theoretical grounding to why and when adversarial training is effective in PINNs, then presents a unified analysis of GANs variants in such training, and finally leads to a new, practical, efficient training algorithm for PINNs. Empirical results demonstrate that our method can significantly reduce the pathology of PINNs training, thereby providing better models with superior performances, often several magnitudes more accurate than alternative methods.

URL PDF HTML ☆

赞 0 踩 0

2605.15951 2026-05-18 cs.CV

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

从失败到反馈：群体修订解锁对象级 grounding 的难题

Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan, Can Peng, Jiajun Deng, Fengbei Liu, Junde Wu

AI总结本文提出群体修订优化方法，通过生成改进候选响应提升硬案例学习效果，改进奖励和优势函数以增强高质量修订影响，优于现有GRPO方法。

Comments 8 pages, 5 figures, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

通过强化学习微调大视觉-语言模型以增强对象级 grounding 能力已成为有前景的方法。然而，现有方法主要基于GRPO，在响应层面分配奖励，导致在挑战性场景中所有候选响应失败时学习信号稀疏。本文提出群体修订优化范式，通过生成改进候选响应探索更好的 grounding 结果。受奖励塑造启发，我们引入巩固过程，量化每个候选响应相对于初始尝试的改进，并将其转化为信息丰富的塑造信号。这些信号用于精炼奖励和调节优势，放大高质量修订的影响。我们的方法在指称和推理分割、REC 和计数基准上均优于先前的 GRPO 基方法。我们的代码可在 https://github.com/yyliu01/GroupRevision 获取。

英文摘要

Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.

URL PDF HTML ☆

赞 0 踩 0

2605.15942 2026-05-18 cs.CV cs.AI

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

分解式视觉-语言对齐用于细粒度开放词汇分割

Chenhao Wang, Yingrui Ji, Yu Meng, Yao Zhu

AI总结本文提出分解式视觉-语言对齐框架，通过将文本提示分解为概念令牌和多个属性令牌，实现细粒度开放词汇分割中对未见属性-类别组合的泛化提升。

详情

AI中文摘要

开放词汇分割模型常难以泛化到未见的对象类别和属性组合，因为细粒度描述通常被编码为整体句子，将多个语义单元纠缠在一起。我们提出一种分解式视觉-语言对齐框架，将文本提示显式分解为概念令牌和多个属性令牌，使每个语义单元能够分别进行跨模态交互。在特征层面，我们引入了特征门控交叉注意力模块，生成属性特定的门控图以以乘法方式融合信息，有效强制组合语义。在评分层面，每个token的相似性在log空间中聚合，产生稳定且可解释的组合匹配。该方法可以无缝集成到现有的基于transformer的分割架构中，并在细粒度开放词汇分割基准中显著提升对未见属性-类别组合的泛化能力。

英文摘要

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.15937 2026-05-18 cs.LG

A Retrieval-Enhanced Transformer for Multi-Step Port-of-Call Sequence Prediction in Global Liner Shipping

一种增强检索的变压器用于全球集装箱航运多步骤港口序列预测

Yanzhao Su, Fang He, Yineng Wang

AI总结本文提出CCRE框架，通过检索增强的历史编码器和Transformer轨迹编码器，提升多步骤港口序列预测的准确性和稳定性，实验结果显示其在第一目的地和三步预测上的准确率显著高于基线模型。

详情

AI中文摘要

准确的多步骤港口序列预测对战术资源调度和物流效率至关重要。然而，现有方法难以处理不可靠的航行日程和AIS数据无法提供超越下一个港口的可见性。为此，本研究提出了一种连接性受限且增强检索（CCRE）的深度学习框架。受检索增强生成的启发，CCRE引入了一个检索增强的历史编码器，查询全球海运数据库以获取上下文相似的航行先例。将这些场景转换为候选级语义表示，以弥补长尾路线的数据稀疏性并解决路由歧义。将其与基于Transformer的轨迹编码器结合，架构通过交叉注意力实现自适应的“中间融合”。这动态地将预测依赖从实时运动学转移到历史上下文以获得长期战略稳定性。为确保序列级一致性，预测被建模为一个联合序列生成问题，使用带有调度采样和Gumbel-Softmax松弛的自回归Transformer解码器。这缓解了误差累积，同时拓扑掩码严格 enforcing 海运网络可达性以消除操作上不可行的路线。在全局数据集上评估，CCRE在第一目的地准确率为72.3%，平均三步准确率为61.4%，优于CatBoost和LSTM等基线模型，分别平均高出12.6%和11.3%。案例研究进一步证实了模型的可扩展性和在不同国际贸易通道中捕捉复杂路由模式的能力。

英文摘要

Accurate multi-step port-of-call sequence prediction is vital for tactical resource orchestration and logistical efficiency. However, existing methods struggle with unreliable voyage schedules and the inability of AIS data to provide visibility beyond the immediate next port. To address this, this study proposes a Connectivity-Constrained and Retrieval-Enhanced (CCRE) deep learning framework. Inspired by Retrieval-Augmented Generation, CCRE introduces a retrieval-enhanced historical encoder that queries a global maritime database for contextually similar navigational precedents. Transforming these scenarios into candidate-level semantic representations compensates for data sparsity in long-tail routes and resolves routing ambiguities. Integrating this with a Transformer-based trajectory encoder, the architecture executes adaptive "middle fusion" via cross-attention. This dynamically shifts predictive reliance from real-time kinematics for short-term accuracy to historical context for long-term strategic stability. To ensure sequence-level coherence, forecasting is formulated as a joint sequence generation problem using an autoregressive Transformer decoder enriched with Scheduled Sampling and Gumbel-Softmax relaxation. This mitigates error accumulation, while topology masks strictly enforce maritime network reachability to eliminate operationally infeasible routes. Evaluated on a global dataset, CCRE achieves a 72.3% first-destination accuracy and a 61.4% average three-step accuracy, outperforming baselines like CatBoost and LSTM by average margins of 12.6% and 11.3%, respectively. Case studies further corroborate the model's scalability and ability to capture complex routing patterns across diverse international trade lanes.

URL PDF HTML ☆

赞 0 踩 0

2605.15935 2026-05-18 cs.RO cs.SY eess.SY physics.plasm-ph

Dynamic Plasma Shape Control with Arbitrary Sensor Subsets

等离子体形状动态控制与任意传感器子集

D. Sorokin, M. Stokolesov, A. Granovskiy, I. Prokofyev, E. Adishchev, M. Nurgaliev, E. Khayrutdinov, G. Subbotin, R. Clark, D. Orlov

AI总结本文提出一种强化学习代理，通过NSFsim模拟器在120个实验等离子体形状数据集上训练，实现了对动态变化的等离子体形状目标的实时跟踪，并在诊断故障情况下保持鲁棒性。

详情

AI中文摘要

在托卡马克装置中，等离子体形状控制需要一个能够实时跟踪动态变化的形状目标并容忍诊断故障的控制器。经典方法将问题分解为平衡重构后接线性控制器，并假设固定且完全运行的传感器集。本文提出了一种强化学习代理，同时解决这两个限制。该代理在NSFsim高保真托卡马克模拟器上训练，该模拟器配置为DIII-D，基于120个实验等离子体形状数据集。形状目标每隔0.25秒随机重新采样，使代理面临多样化的过渡。在测试时，代理零样本跟踪动态形状序列；在模拟中持有一静态配置时，平均形状误差为2.01厘米，动态轨迹跟踪在模拟和物理设备上得到定性演示。诊断丢失随机屏蔽每个回合30%的磁感应传感器，产生一个对任意传感器子集鲁棒的单一策略，无需备用控制器或模式切换逻辑。不对称的actor-critic架构与特权平衡信息改进了部分可观测下的价值估计；actor上的辅助形状重建头使从原始诊断中端到端重建形状，并作为策略分析的可解释工具。该策略转移到实验DIII-D射线中，直接控制两个动态形状操作的线圈执行器，并转移到独立的GSevolve模拟器。

英文摘要

Plasma shape control in tokamaks requires a real-time controller that tracks dynamically changing shape targets while tolerating diagnostic failures. Classical approaches decompose the problem into equilibrium reconstruction followed by a linear controller, and assume a fixed, fully operational sensor set. We present a reinforcement learning agent that addresses both limitations simultaneously. The agent is trained in NSFsim, a high-fidelity tokamak simulator configured for DIII-D, on a curated dataset of 120 experimental plasma shapes. The shape targets are resampled as random step changes every 0.25 s, exposing the agent to diverse transitions across the full shape envelope. At test time the agent zero-shot tracks dynamic shape sequences; on a held-out static configuration in simulation it achieves a mean shape error of 2.01 cm, and dynamic trajectory following is demonstrated qualitatively in simulation and on the physical device. Diagnostic dropout randomly masks 30% of magnetic sensors per episode, yielding a single policy robust to arbitrary sensor subsets without backup controllers or mode-switching logic. An asymmetric actor-critic architecture with privileged equilibrium information improves value estimation under partial observability; an auxiliary shape reconstruction head on the actor enables end-to-end shape reconstruction from raw diagnostics and serves as an interpretability tool for policy analysis. The policy transfers to experimental DIII-D shots, where it directly commands the coil actuators on two dynamic shape maneuvers, and to the independent GSevolve simulator.

URL PDF HTML ☆

赞 0 踩 0

2605.15923 2026-05-18 cs.CV

Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction

Invaria：通过下一分辨率预测实现点云中的尺度和密度不变性

Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Dariu Gavrila, Holger Caesar

AI总结本文提出Invaria，一种通过下一分辨率预测和感受野校准实现点云尺度和密度不变性的编码器，提升了模型在不同分辨率下的泛化能力。

详情

AI中文摘要

现代图像编码器通过将语义意义与分辨率解耦实现高泛化能力，但在3D领域尚未完全实现。本文研究了3D点云编码器在实现类似泛化能力时的失败原因，发现现有模型对采样分辨率和尺度变化高度敏感，导致性能显著下降。这种敏感性是机器人实际部署中的主要瓶颈，因为它表明模型过度拟合特定量化密度和物体尺度，而非学习不变的语义特征。为缓解这种依赖，我们提出Invaria，一种通过下一分辨率预测和感受野校准实现尺度和密度不变性的点云编码器。虽然我们的目标不是显式生成高分辨率点云，但发现这种训练目标鼓励模型学习稳健的结构不变性。结果编码器在分辨率变化时实现显著性能提升，同时通过紧凑的模型大小和减少的token需求保持高效。具体来说，在ScanNet上，Invaria在3倍更低的分辨率下实现mIoU提升56.0%，当物体尺度减少3倍时提升20%。这些收益通过45%更小的模型大小和平均40%的输入token减少实现。

英文摘要

Modern image encoders achieve high generalization by decoupling semantic meaning from resolution, an ability yet to be fully realized in the 3D domain. We investigate the failure of 3D point cloud encoders to achieve similar generalization and find that existing models are highly sensitive to sampling resolution and scale changes, leading to significant performance degradation. This sensitivity is a major bottleneck for real-world deployment in robotics, as it suggests models overfit to specific quantization densities and object scales rather than learning invariant semantic features. To mitigate this dependency, we propose Invaria, a point cloud encoder that achieves scale and density invariance through next-resolution prediction and receptive field calibration. While our objective is not the explicit generation of high-resolution point clouds, we find that this training objective encourages the model to learn robust, structural invariants. The resulting encoder achieves significant performance gains during resolution shifts while maintaining high efficiency through a compact model size and reduced token requirements. Specifically, on ScanNet, Invaria achieves a 56.0\% higher mIoU at 3$\times$ lower resolution and a 20\% improvement when the objects scale is reduced by a factor of 3. These gains are achieved with a 45\% smaller model size and an average reduction of 40\% in input tokens.

URL PDF HTML ☆

赞 0 踩 0

2605.15921 2026-05-18 cs.CV

AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression

AdaEraser：通过自适应注意力抑制实现无训练对象去除

Dingming Liu

AI总结本文提出AdaEraser框架，通过动态调节注意力机制实现对象去除，解决无训练方法中盲目抑制注意力导致生成质量下降的问题，实验表明其在对象去除任务中表现优异。

Comments Accepted by ICML 2026

详情

AI中文摘要

对象去除旨在从图像中消除指定对象，同时合理地用背景内容修复受影响区域。当前无训练方法通常在图像生成过程中阻断自注意力层中对象区域的注意力，利用周围背景信息恢复图像。然而，盲目抑制空置区域的自注意力会降低生成质量，因为模型必须同时重建这些区域的背景内容。为了解决这一冲突，我们提出AdaEraser，一种自适应框架，能够根据目标对象概念的估计存在性动态调节注意力。通过分析去噪步骤前后自注意力图的演变，我们开发了一种逐token的自适应注意力抑制策略。该方法使在去噪过程中逐步感知对象去除，自注意力层的抑制强度会根据情况进行调整。大量实验表明，AdaEraser在对象去除任务中实现了优越的性能，甚至优于基于训练的方法。

英文摘要

Object removal aims to eliminate specified objects from images while plausibly inpainting the affected regions with background content. Current training-free methods typically block attention to object regions within self-attention layers during the image generation process, leveraging surrounding background information to restore the image. However, indiscriminate suppression of self-attention in the vacated areas can degrade generation quality, as the model must simultaneously reconstruct background content in these regions. To solve this conflict, we propose AdaEraser, an adaptive framework that dynamically modulates attention based on the estimated presence of target object concepts. Through analysis of self-attention map evolution across denoising timesteps before and during removal, we develop a token-wise adaptive attention suppression strategy. This approach enables progressive perception of object removal throughout the denoising process, with the suppression strength in self-attention layers adjusted adaptively. Extensive experiments demonstrate that AdaEraser achieves superior performance in object removal, outperforming even training-based methods.

URL PDF HTML ☆

赞 0 踩 0

2605.15916 2026-05-18 cs.LG cs.AI cs.CV

LoCO: Low-rank Compositional Rotation Fine-tuning

LoCO：低秩组合旋转微调

An Nguyen, Jaesik Choi, Anh Tong

AI总结 LoCO提出一种低秩组合旋转微调方法，通过低秩斜对称矩阵构建正交变换，实现高效参数微调，适用于多领域模型适应，展现优于传统正交和非正交方法的性能。

Comments IJCAI 2026

详情

AI中文摘要

参数高效微调（PEFT）已成为适应大规模基础模型的关键技术，在自然语言处理和计算机视觉领域广泛应用。尽管现有方法如低秩适应通过低秩权重更新实现参数效率，但其在保持预训练表示几何结构方面有限。我们引入低秩组合正交微调（LoCO），一种新颖的PEFT方法，通过低秩斜对称矩阵构建正交变换，并通过组合旋转链实现。我们提出了一种近似方案，使组合旋转的完全并行计算成为可能，使该方法适用于高维特征空间。我们的方法在保持低计算复杂度的同时，保持正交性并控制近似误差。我们在多样化的领域中验证了LoCO，包括扩散Transformer微调、视觉Transformer适应和语言模型适应。我们的方法在性能上优于或与现有正交和非正交方法相当。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.

URL PDF HTML ☆

赞 0 踩 0

2605.15908 2026-05-18 cs.CV cs.AI

RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

RaPD：通过语义增强的隐式表示实现分辨率无关的像素扩散

Yanhao Ge, Shanyan Guan, Weihao Wang, Ying Tai, Mingyu You

AI总结 RaPD通过语义表示引导和坐标查询注意力渲染器，在连续神经图像场的潜在空间中实现分辨率无关的像素扩散，解决了重建与生成之间的差距，提升了生成质量和分辨率扩展能力。

2605.15906 2026-05-18 cs.CV

A Causally Grounded Taxonomy for Image Degradation Robustness Evaluation

基于因果的图像退化鲁棒性评估分类学

Stefan Becker, Simon Weiss, Wolfgang Hübner, Michael Arens

AI总结本文提出一个基于因果的图像退化分类学，通过双轴抽象整合算法退化、感知失真和物理退化，提供统一的严重性度量层以提升跨数据集和任务的可比性。

详情

AI中文摘要

图像退化可能在采集、处理和传输过程中发生，改变视觉外观并影响下游视觉任务。这些退化现象在多个领域被研究，包括合成污染基准测试、感知图像质量评估以及物理基础的成像系统或真实相机故障分析。尽管这些领域处理密切相关现象，但常使用不兼容的分组方案和后端特定严重性定义，导致结果难以跨数据集、退化源和任务比较。本文提出一个基于因果的框架，用于组织和解释这些设置中的图像退化。我们提供一个解释性表示和度量层，使隐含假设显性化。每个退化沿两个正交轴描述：成像管道中的主导因果源（环境、传感器/光学、ISP/渲染器/编码器或传输/系统）及其感知效果。这种双轴抽象产生了一个紧凑的分类学，涵盖算法退化、感知失真和物理驱动的成像伪影。为解决不一致的严重性语义而不改变现有实现，我们引入了一个轻量级的严重性度量层。对于每个退化和每个给定后端的原生严重性级别，我们使用全参考图像质量指标（PSNR、SSIM和LPIPS）量化退化强度。这使严重性在不同源之间可观察和可比，同时保留原生参数化。我们通过COCO Degradation展示该框架，这是一个对齐基准，用于评估在多样成像条件下目标检测器的鲁棒性。

英文摘要

Image degradations can occur during acquisition, processing, and transmission, altering visual appearance and affecting downstream vision tasks. They are studied in several communities, including synthetic corruption benchmarks for robustness evaluation, perceptual image quality assessment, and physically grounded analyses of imaging systems or real camera failures. Although these areas address closely related phenomena, they often use incompatible grouping schemes and backend specific severity definitions, making results difficult to compare across datasets, degradation sources, and tasks. We propose a causally grounded framework for organizing and interpreting image degradations across these settings. Instead of introducing new degradations or redefining existing benchmarks, we provide an interpretive representation and measurement layer that makes implicit assumptions explicit. Each degradation is described along two orthogonal axes: its dominant causal source in the imaging pipeline (environment, sensor/optics, ISP/renderer/codec, or transfer/system), and its resulting perceptual effect. This dual axis abstraction yields a compact taxonomy spanning algorithmic corruptions, perceptual distortions, and physically motivated imaging artifacts. To address inconsistent severity semantics without changing existing implementations, we introduce a lightweight severity measurement layer. For every degradation and each native severity level of a given backend, we quantify degradation strength using full reference image quality metrics: PSNR, SSIM, and LPIPS. This makes severity observable and comparable across sources while preserving native parameterizations. We demonstrate the framework through COCO Degradation, a taxonomy aligned benchmark for evaluating object detector robustness under diverse imaging conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.15904 2026-05-18 cs.LG

Context-aware Entity-Relation Extraction for Threat Intelligence Knowledge Graphs

面向威胁情报知识图谱的上下文感知实体-关系抽取

Inoussa Mouiche, sherif Saad

AI总结本文提出CTiKG框架，通过结合SecureBERT和领域本体，提升威胁情报报告中实体关系抽取的准确性和鲁棒性，实验表明在NER和RE任务上优于现有方法。

Comments 16 pages

详情

AI中文摘要

网络安全知识图谱（CKGs）通过统一多样化的网络安全威胁情报（CTI）来源，提供结构化、可查询的格式，为自动化主动和实时安全响应提供可扩展解决方案。然而，构建CKGs需要从非结构化CTI报告中提取实体-关系三元组，这一任务受到复杂报告结构、领域特定语言和语义模糊性的阻碍。现有流程化方法常因错误传播导致提取准确性下降和泛化能力受限。本文引入上下文感知威胁情报知识图谱（CTiKG）框架，一种流程架构，旨在准确提取和分类威胁实体及其关系。CTiKG结合混合NLP模型，利用SecureBERT+上下文嵌入和领域本体专家知识以减少误分类并缓解级联错误。在DNRTI-AUG-STIX2数据集上的实验表明，该框架在NER和RE任务上优于现有最先进基线，分别获得3-4%和高达8%的提升，基于精确率、召回率和F1分数。此外，在DNRTI和STUCCO基准上的进一步验证证实了框架的鲁棒性和实用性。所有数据集，包括精心编写的DNRTI-AUG-STIX2，均在GitHub上发布，以促进可重复性和进一步研究。

英文摘要

Cybersecurity Knowledge Graphs (CKGs) unify diverse Cyber Threat Intelligence (CTI) sources into structured, queryable formats, offering scalable solutions for automating proactive and real-time security responses. Their increasing adoption has significantly enhanced the workflow and decision-making efficiency of security professionals. However, constructing CKGs requires extracting entity-relation triples from unstructured CTI reports, a task hindered by complex report structure, domain-specific language, and semantic ambiguity. As a result, existing pipeline-based approaches often suffer from error propagation, reducing extraction accuracy and limiting generalizability. This paper introduces the Context-aware Threat Intelligence Knowledge Graph (CTiKG) framework, a pipeline architecture designed to accurately extract and classify threat entities and their relationships from CTI reports. CTiKG incorporates hybrid NLP models that leverage SecureBERT+ contextual embeddings and expert knowledge from a domain ontology to reduce misclassifications and mitigate cascading errors. Experiments on the DNRTI-AUG-STIX2 dataset, which comprises 21 entity types aligned with STIX 2.1, demonstrate significant improvements over state-of-the-art baselines, yielding 3-4% gains in NER and up to 8% in RE performance, based on precision, recall, and F1-score. Additional validation on DNRTI and STUCCO benchmarks confirms the framework's robustness and practical applicability. All datasets, including the curated DNRTI-AUG-STIX2, are released on GitHub to foster reproducibility and further research.

URL PDF HTML ☆

赞 0 踩 0

2605.15901 2026-05-18 cs.LG

From Layers to Networks: Comparing Neural Representations via Diffusion Geometry

从层到网络：通过扩散几何比较神经表示

Atharva Khandait, Jan E. Gerken

AI总结本文通过扩散几何框架，首次将多视图学习工具应用于神经表示，提出多尺度中心核对齐和距离相关性变体，并通过融合多层马尔可夫矩阵实现网络间相似性比较，验证了在语言和视觉任务中的优越性能。

Comments 11 pages + appendices

详情

AI中文摘要

扩散几何是一种流形学习框架，利用由马尔可夫转移矩阵定义的随机游走来在多个尺度上表征数据集的几何结构。我们使用扩散几何进行神经表示，首次将多视图学习工具引入该领域。我们的关键技术观察是，基于表示相似性矩阵（RSMs）的一类广泛相似性度量可以以行随机马尔可夫矩阵的形式表达，从而可以利用扩散几何进行操作。作为首次应用，我们开发了多尺度的中心核对齐和距离相关性变体，利用底层转移矩阵的t次方来探测可调节的扩散尺度下的数据几何。进一步，我们引入这些度量的变体，通过交替扩散将多个层的马尔可夫矩阵融合为一个单一算子，捕捉网络的联合样本几何，允许在多个层之间计算相似性，将比较从层到层转变为网络到网络。我们进行了广泛的数值实验，评估了我们的度量在包含14种架构、7个数据集、三个不同领域的表示相似性（ReSi）基准上的性能。我们的方法在语言和视觉任务中不同模型的准确性和输出相关性上实现了最先进的结果。我们进一步在额外的基准上展示了对分布外数据的优越性能。

英文摘要

Diffusion geometry is a manifold learning framework that uses random walks defined by Markov transition matrices to characterize the geometry of a dataset at multiple scales. We use diffusion geometry for neural representations, incorporating tools from multi-view learning into this field for the first time. Our key technical observation is that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices, opening the door to manipulations from diffusion geometry. As a first application, we develop multi-scale variants of Centered Kernel Alignment and Distance Correlation, which utilise the $t^{th}$ power of the underlying transition matrix to probe the data geometry at adjustable diffusion scales. Going further, we introduce variants of these measures which fuse the Markov matrices of several layers via alternating diffusion into a single operator that captures the network's joint sample geometry, allowing similarity to be computed across multiple layers and shifting the comparison from layer-to-layer to network-to-network. We perform extensive numerical experiments, evaluating our measures on the Representational Similarity (ReSi) benchmark comprising 14 architectures trained on 7 datasets across three different domains. Our methods achieve SoTA results in accuracy and output correlation for both language and vision tasks across different models. We furthermore show SoTA performance on an additional benchmark evaluating on out-of-distribution data.

URL PDF HTML ☆

赞 0 踩 0

2605.15894 2026-05-18 cs.CV cs.AI

Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning

基于CBAM增强的EfficientNet和证据深度学习的不确定性意识卫星图像野火烟密度分类

Ranjith Chodavarapu

AI总结本文提出一种概率框架，通过CBAM增强的EfficientNet和证据深度学习，对卫星图像中的烟雾密度进行分类，并提供分解的epistemic和aleatoric不确定性。模型在16298个真实卫星图像块上达到93.8%的加权测试准确率。

详情

AI中文摘要

快速且准确的野火烟雾严重程度评估对于应急响应、空气质量建模和人类健康风险管理至关重要。现有的深度学习方法将烟雾检测视为二元任务，产生点估计而没有预测置信度的度量。我们提出了一种概率框架，将卫星图像块分类为轻度、中度和重度严重程度类别，并在单次前向传递中提供分解的epistemic和aleatoric不确定性。我们的架构使用预训练的EfficientNet-B3作为主干，并结合CBAM模块和证据深度学习头，该头预测Dirichlet浓度参数，直接估计vacuity（epistemic）和dissonance（aleatoric）而无需蒙特卡洛采样。在16298个来自野火检测数据集的真实卫星图像块上进行评估，我们的模型在加权测试准确率为93.8%（无加权为91.1%）时，ECE=0.0274。选择性预测保留最确定的50%的图像块可达到96.7%的准确率。随着图像质量下降，不确定性单调增加，vacuity是实际扫描质量的度量。中度类别代表过渡烟雾条件，表现出最高的epistemic不确定性（平均vacuity=0.187），确认了模型正确识别了模糊的烟雾边界区域。CBAM空间注意力图局部化到结构上显著的场景区域，t-SNE展示了轻度和重度烟雾的清晰聚类分离。

英文摘要

Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.

URL PDF HTML ☆

赞 0 踩 0

2605.15892 2026-05-18 cs.RO cs.HC

Designing for Robot Wranglers: A Synthesis of Literature and Practice

为机器人协调员设计：文献与实践的综合

David Porfirio, Ian McDermott, Hsin-Mei Chen, Satoru Satake, Takayuki Kanda, Thomas D. LaToza

AI总结本文通过文献综述和实践反思，揭示了机器人协调员的角色需求，并提出支持协调员个体及服务生态的设计建议。

Comments Accepted for publication in the Proceedings of ACM Designing Interactive Systems (2026)

详情

AI中文摘要

机器人日益成为人类空间的一部分，例如在医院中执行配送、在博物馆中与访客互动以及在仓库中存储物品。为了确保机器人无缝融入这些空间，一种新的角色在人机交互中出现——机器人协调员，即负责设置、监督和解决机器人问题的个人。为了了解这一利益相关者的需要，我们进行了范围综述，揭示了文献中机器人协调的类型，并发现协调是一个涵盖高度复杂且异质活动空间的总称，往往使这种劳动难以描述和支持。为进一步澄清和理解机器人协调，我们则反思了自己作为机器人协调员在各自领域中的亲身和想象经历。基于范围综述和我们的反思，我们提出一系列设计建议，以支持协调员作为个体以及作为更广泛服务生态系统成员的方式。

英文摘要

Robots are increasingly present in human spaces, such as for conducting deliveries in hospitals, interacting with visitors at museums, and stocking items in warehouses. To ensure the seamless integration of robots into these spaces, a new role in human-robot interaction is emerging - the robot wrangler, namely an individual who is responsible for setting up, overseeing, and troubleshooting the robot. To understand the needs of this stakeholder, we conducted a scoping review that uncovered a typology of robot wrangling across the research literature, and discovered that wrangling is an umbrella term that collapses a highly complex and heterogeneous space of activities, often rendering this labor difficult to characterize and support. To further clarify and understand robot wrangling, we then reflected on our own firsthand and imagined experiences as robot wranglers within our own respective domains. Guided by the scoping review and our reflections, we devise a series of design implications for supporting wranglers directly as individuals and as members of a wider service ecology.

URL PDF HTML ☆

赞 0 踩 0