arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.23840 2026-05-25 cs.CV

MuellerPT: Decomposition Driven Pretraining for Dense Learning in Mueller Polarimetry

MuellerPT: 穆勒偏振测量中密集学习的分解驱动预训练

Adam Tlemsani, Yingdian Li, Maxime Giot, Naim Slim, Christopher J. Peters, Abhijeet Ghosh, Daniel S. Elson

AI总结 该研究提出了一种名为 MuellerPT 的物理引导预训练方法,用于解决穆勒偏振成像在生物医学组织分析中因标注稀缺和领域差异导致的监督学习难题。通过从每个像素的 4x4 穆勒矩阵预测 Lu-Chipman 分解图,该方法学习到具有迁移能力的密集表征,并在少样本分割和分类任务中表现出显著提升。实验表明,MuellerPT 在标签效率和跨样本迁移能力方面优于无预训练的模型,为高效标注的穆勒偏振成像应用提供了新思路。

详情
Comments
Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)
AI中文摘要

穆勒矩阵成像为生物医学组织分析提供了丰富、物理上有意义的对比度,但监督学习受到稀疏密集标注和跨样本及采集设置强域偏移的阻碍。我们提出MuellerPT,一种物理引导的预训练方法,通过从逐像素4x4穆勒矩阵预测Lu-Chipman分解图来学习可迁移的密集表示。为了扩展预训练,我们收集了新的多光谱动物偏振器官数据集(MAP-Org)。预训练编码器通过分割头适应于羔羊脑灰质与白质分割,并使用分类头进行结直肠癌与非癌分类。分割和分类均在少样本学习场景下评估。在分割中,与无预训练模型相比,MuellerPT提高了标签效率和跨样本迁移,在使用5%训练数据时,相比从头训练的基线实现了超过20%的绝对DICE增益。在分类中,MuellerPT也增强了标签效率,在使用1%训练数据时,相比基线总体准确率提高了8%。我们通过对离体人类食管样本预测的Lu-Chipman图进行定性评估,证明了MuellerPT对域偏移的鲁棒性。这些结果表明,预测Lu-Chipman分解是从穆勒偏振测量中进行鲁棒生物医学推断的有效且实用的预文本任务,并为未来标签高效穆勒成像的工作铺平了道路。

英文摘要

Mueller matrix imaging provides rich, physically meaningful contrast for biomedical tissue analysis, but supervised learning is hindered by scarce dense annotations and strong domain shifts across specimens and acquisition settings. We introduce MuellerPT, a physics guided pre-training approach that learns transferable dense representations by predicting Lu-Chipman decomposition maps from per-pixel 4x4 Mueller matrices. To scale pre-training, we collected a new large Multispectral Animal Polarimetric Organ dataset (MAP-Org). The pre-trained encoder is adapted with a segmentation head for grey vs. white matter segmentation in lamb brain. A classification head is used for colorectal cancer vs. non-cancer classification. Both segmentation and classification are evaluated across few-shot learning scenarios. In segmentation, MuellerPT improves label efficiency and cross specimen transfer compared to models without pre-training, achieving an absolute DICE gain of over 20% compared to the baseline trained from scratch when using 5% of the training data. In classification, MuellerPT also enhances label efficiency, improving overall accuracy by 8% compared to the baseline when using 1% of the training data. We demonstrate MuellerPT's robustness to domain shift with a qualitative evaluation of its predicted Lu-Chipman maps on an ex vivo human oesophagus sample. These results suggest that predicting Lu-Chipman decomposition is an effective and practical pretext task for robust biomedical inference from Mueller polarimetry and can pave the way for future work on label efficient Mueller imaging.

2605.23832 2026-05-25 cs.RO

SFG-ROS: A Resource-Aware Framework for Dense Multi-Agent Perception

SFG-ROS:面向密集多智能体感知的资源感知框架

Constantin Blessing, Elias Geiger, Jakob Häringer, Dennis Grewe, Markus Enzweiler

AI总结 本文提出了一种名为 SFG-ROS 的资源感知型多智能体软件框架,旨在解决异构机器人团队在密集感知任务中遇到的网络拥堵、命名冲突和计算开销过大的问题。该框架通过基于模式的流量路由、按需解码管道和硬件无关的容器化处理等关键技术,有效提升了系统扩展性和资源利用效率。实验表明,与标准 ROS 2 相比,SFG-ROS 显著降低了网络流量和 CPU 开销,同时保持了较低的通信延迟。

详情
AI中文摘要

部署异构多机器人车队进行协作感知需要稳健的数据交换和可扩展的软件架构。然而,标准的ROS 2实现在跨设备分发密集传感器流时,常常面临网络饱和、命名空间冲突和严重的计算开销。为了解决这些瓶颈,我们提出了SFG-ROS,一个面向动态车队部署的资源感知多智能体软件框架。SFG-ROS通过三个主要贡献应对这些挑战。首先,模式驱动的流量路由使用程序化的完全限定名称模式和定向Fast DDS路由,将高频的智能体内部流量与全局网络隔离。其次,按需集中解码管道自动卸载高带宽传感器数据解压缩,消除了本地消费者节点上的冗余处理。最后,硬件无关的容器管道动态适应异构加速器,无缝桥接开发环境与零接触的现场就绪执行。我们使用配备LiDAR和立体深度相机的轮式和腿式机器人车队评估该框架。实验结果表明,SFG-ROS将网络流量限制为$\mathcal{O}(1)$,并通过用轻量级IPC替代冗余解压缩,将每个订阅者的CPU扩展惩罚比标准ROS 2降低了72.3%,同时保持低延迟。最后,我们在宽松许可下发布了SFG-ROS,可通过\href{https://iis-esslingen.github.io/sfg-ros}{iis-esslingen.github.io/sfg-ros}获取。

英文摘要

Deploying heterogeneous multi-agent robot fleets for collaborative perception requires robust data exchange and scalable software architectures. However, standard ROS 2 implementations often suffer from network saturation, namespace collisions, and severe computational overhead when distributing dense sensor streams across devices. To address these bottlenecks, we present SFG-ROS, a resource-aware multi-agent software framework designed for dynamic fleet deployments. SFG-ROS addresses these challenges through three primary contributions. First, schema-driven traffic routing isolates high-frequency intra-agent traffic from the global network using a programmatic fully qualified name schema and targeted Fast DDS routing. Second, an on-demand centralized decoding pipeline automatically offloads high-bandwidth sensor data decompression, eliminating redundant processing across local consumer nodes. Finally, a hardware-agnostic container pipeline dynamically adapts to heterogeneous accelerators, seamlessly bridging development environments with zero-touch, field-ready execution. We evaluate the framework using a fleet of wheeled and legged robots equipped with LiDAR and stereo depth cameras. Experimental results show SFG-ROS bounds network traffic to $\mathcal{O}(1)$ and, by replacing redundant decompression with lightweight IPC, reduces the per-subscriber CPU scaling penalty by 72.3\% versus standard ROS 2, all while maintaining low latency. Finally, we publish SFG-ROS under a permissive license, available via \href{https://iis-esslingen.github.io/sfg-ros}{iis-esslingen.github.io/sfg-ros}.

2605.23826 2026-05-25 cs.CV cs.CL

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

将查询分解为工具调用以进行长视频关键帧检索

Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, Derek Hoiem

AI总结 本文研究了如何从长视频中检索关键帧以支持问答任务,提出了一种基于工具调用分解与合并的新型关键帧检索方法ToolMerge。该方法利用大语言模型将查询分解为多个工具调用,并通过布尔运算符合并各工具的排序结果,从而更精准地定位相关帧。实验在自建的M2M基准上进行,ToolMerge在多项任务中表现优异,尤其在字幕检索任务中超越其他方法5%。

详情
AI中文摘要

关键帧选择是为长视频问答(QA)提供可验证视觉证据的直接方式。查询所需的内容各不相同,找到正确的帧取决于知道要查找什么。现有的关键帧选择器要么根据单个查询对每一帧进行评分,要么将查询分解为由单个视觉工具评估的固定模式。我们提出ToolMerge,一种基于分解和合并的关键帧检索方法:基于大语言模型(LLM)的规划器将查询分解为工具调用,并指定如何使用布尔运算符合并每个工具的排名。为了直接评估检索,我们构建了Molmo-2 Moments(M2M)基准,其中每个问题通过构造锚定到特定的时间间隔。在QA、问题检索和字幕检索中,ToolMerge与先前的关键帧选择器具有竞争力,尤其是在字幕检索上,优于其他方法5%。代码和数据可在https://github.com/michalsr/ToolMerge找到。

英文摘要

Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .

2605.23825 2026-05-25 cs.LG cs.AI

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

是人类,而非数据:LLM中的地缘政治偏见源于后训练,并通过提示语言放大

Stuart Bladon, Brinnae Bent

AI总结 该研究发现,语言模型中的地缘政治偏见主要来源于微调阶段,而非预训练阶段。通过对七家实验室的多个模型进行对比实验,结果表明,微调后模型的立场往往更倾向于其开发者所在国家或地区,且这种偏见在不同语言提示下表现不同。研究强调,模型对国家、文化及政治观点的表征并非单纯继承自训练数据,而是在对齐过程中被主动塑造,凸显了对微调过程进行透明度和监管的重要性。

详情
Comments
12 pages, 6 figures, 2 tables, 3 appendices. Code and scenario bank: https://github.com/recozers/LLM-Bias
AI中文摘要

人们通常认为语言模型中的地缘政治偏见源于预训练阶段使用的训练数据。我们在英语、法语和中文中,对来自七个实验室的七对开放权重LLM(仅预训练的基础模型和经过预训练及后训练的对话模型)进行了28对国家对的配对场景强制选择探测,发现地缘政治偏见源于后训练而非预训练。在七个AI实验室中,有六个在模型开发者所在国家或地区的方向上,后训练后出现了偏见偏移。这种偏移在阿里巴巴的Qwen 2.5中最为显著:基础模型对中国好感度呈中性(对数几率-0.15,p=0.15),而后训练的对话变体则为+2.91(p<10^-4),几率偏移了18倍。我们还观察到所有模型对其他国家的偏见也存在偏移。此外,这种偏移的幅度取决于提示模型所用的语言:法国制造的Mistral仅在法语提示下表现出亲法倾向(法语-英语偏移+1.91,p<10^-4)。这些发现表明,语言模型中的地缘政治偏好并非简单地从大规模互联网数据中继承,而是在后训练过程中被主动塑造,这凸显了对影响模型如何表征国家、文化和政治观点的对齐过程进行更大透明度、审计和监督的必要性。

英文摘要

It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p<10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p<10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

2605.23821 2026-05-25 cs.CL cs.LG

Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

语言模型中的层级概念几何源于词汇共现

Andres Nava, Matthieu Wyart

AI总结 本文研究了语言模型中如何通过词共现关系几何地编码超类关系(即“是-一种”关系)。作者从词网中词语之间的共现频率与层次结构关系的实证观察出发,理论分析了词嵌入的协方差矩阵谱结构,证明了主特征向量能按从粗到细的层次逐步分离出概念分支,形成与树状结构一致的层次分割几何。实验验证表明,这一现象不仅在词2vec中存在,在Gemma 2B模型中也表现显著,表明层次概念几何可由词对统计的谱结构自然产生,无需依赖特定的层次功能机制。

详情
Comments
34 pages, 12 figures, including appendices
AI中文摘要

我们提出了一种分布理论,解释上下义关系——一般概念与具体概念之间的“is-a”关系——如何在语言表示中以几何方式编码。从经验验证的假设出发,即WordNet上下义图中距离较近的词汇共现频率更高,我们在理论上刻画了由此产生的word2vec嵌入Gram矩阵的谱。在共现核的温和正性和衰减条件下,我们证明主特征向量首先分离广泛的分类分支,然后逐步分离更细的子分支,产生一种\emph{层级分裂几何},其从粗到细的谱组织反映了树结构。我们在多个采样的WordNet子树上的word2vec嵌入中验证了这些预测,并表明相同的特征显著地扩展到Gemma 2B的解嵌入。我们的结果表明,LLM中的层级概念几何不必反映层级特定的功能机制,而是从成对词汇统计的谱结构中涌现出来。

英文摘要

We propose a distributional theory of how hypernymy -- the ``is-a'' relation between general and specific concepts -- is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.

2605.23819 2026-05-25 cs.CV cs.AI

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

不过于生成,也不过于判别:人类对齐的甜蜜点

Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, Victor Boutin

AI总结 本文探讨了计算视觉中一个核心问题:人类视觉表征是由判别式学习还是生成式学习更好地解释。研究通过联合能量模型(JEMs)在固定架构下连续插值判别与生成训练目标,分离学习目标的影响,并在六个涵盖感知相似性、光泽感知、人类响应不确定性等的人类对齐基准上进行评估。结果表明,人类对齐在生成与判别目标的中间点达到最优,而非极端端点,表明人类视觉对齐源于生成与判别目标的平衡,而非单一目标的选择。

详情
AI中文摘要

计算视觉中的一个核心问题是,人类视觉表征是否更好地由判别学习或生成学习解释。然而,现有的比较常常混淆学习目标与架构、规模及训练数据,使得目标本身是否驱动对齐的问题悬而未决。我们使用联合能量模型(JEM)来解决这一混淆问题,该模型在固定架构内连续插值判别与生成训练。通过改变单个混合系数,我们隔离了学习目标的影响,并在六个涵盖感知相似性、光泽感知、人类响应不确定性、鲁棒性、形状-纹理线索冲突和诊断性特征归因的人类对齐基准上评估了所得模型。在这多样化的测试套件中,人类对齐在生成-判别连续体的中间点始终达到最大,而非任一端点。混合JEM结合了判别学习诱导的类别结构与生成学习诱导的对输入结构的敏感性,在视觉的多个层次上产生了更类人的行为。这些结果表明,生成-判别二分法不是理解人类对齐视觉的正确轴:对齐并非来自选择其中一个目标,而是来自平衡两者。

英文摘要

A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

2605.23797 2026-05-25 cs.LG cs.CV

Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models

去偏负挖掘提升基于预训练视觉语言模型的分布外检测

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

AI总结 本文研究了如何利用预训练的视觉-语言模型(VLM)进行分布外(OOD)检测,旨在识别来自未知类别的输入。现有方法主要依赖启发式规则从未标注的语料中挖掘负样本,但存在严重的负样本偏差问题。为此,作者提出了一种去偏负样本挖掘方法,通过间接估计负样本分布来纠正偏差,并将其转化为基于标注数据和未标注语料的蒙特卡洛采样过程。实验表明,该方法在多种OOD检测任务中取得了新的最先进性能。

详情
Comments
KDD 2026
AI中文摘要

旨在识别来自未知类别的意外输入,分布外(OOD)检测已成为增强机器学习模型可靠性的关键方法。本文聚焦于基于预训练视觉语言模型(VLM)的事后OOD检测这一新兴范式,其中一种流行的流程是通过检查输入与ID标签和负标签(即语义上不同于ID标签的标签)之间的亲和度来检测OOD输入。由于目标OOD标签不可用,现有工作主要依赖启发式规则从未标注的语料数据中挖掘负标签。尽管取得了经验上的成功,我们认为基于VLM的OOD检测能力尚未被完全释放,因为文献中臭名昭著的假阴性问题远未解决。基于这一动机,我们感兴趣于解决为OOD评分挖掘真实负标签的挑战。为此,我们开发了一个理论框架,通过间接近似负标签的分布来校正负标签的采样偏差。令人惊讶的是,我们表明去偏负挖掘可以自然地转化为基于ID标签和未标注语料数据的蒙特卡洛采样。大量实验经验性地证明,我们的方法在各种OOD检测设置中建立了新的最先进水平。代码公开于\href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{此处}。

英文摘要

Aiming at identifying unexpected inputs from unknown classes, out-of-distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post-hoc OOD detection with pre-trained vision-language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM-based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte-Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups. Code is publicly available at \href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{\textcolor{red}{here}}.

2605.23790 2026-05-25 cs.CV

Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

探索基于事件的显著性预测:一种基于Transformer的模型

Romaric Mazna, Jean Martinet, Sai Deepesh Pokala

AI总结 本文研究了基于事件相机数据的显著性预测问题,提出了一个基于Transformer的模型SEST,用于从事件数据中预测显著性区域。为克服事件数据缺乏大规模标注数据集和强基线模型的难题,作者引入了事件原生的预训练策略和合成监督,并构建了两个新的基准数据集。实验表明,SEST在事件显著性预测任务中优于现有方法,并在真实事件数据上展示了良好的迁移能力,是首次将深度学习应用于事件显著性预测的研究。

详情
AI中文摘要

显著性预测在RGB图像和视频中作为人类视觉注意的计算模型已被广泛研究。相比之下,尽管事件相机具有生物启发性和良好的传感特性,但从事件数据预测显著性仍基本未被探索。两个障碍阻碍了这一方向:缺乏大规模事件显著性数据集,以及缺乏强基线。在本文中,我们介绍了SEST(Swin事件显著性Transformer),一种基于Transformer的事件数据显著性预测模型,通过事件原生预训练和合成监督弥补数据稀缺障碍。SEST利用自监督预训练的事件Swin Transformer骨干结合轻量CNN解码器生成动态显著性图。为解决标注事件显著性数据稀缺的问题,我们引入了两个新的基准数据集N-DHF1K和N-UCF Sports,这些数据集从大规模RGB显著性基准生成。实验结果表明,SEST明显优于现有事件显著性方法,并缩小了与最先进RGB模型的性能差距。在真实事件相机数据集上的零样本评估进一步证明,我们在合成数据上训练的模型在真实事件流上仍具有可迁移性。据我们所知,这项工作是首次将深度学习应用于基于事件的显著性预测,开辟了事件视觉与神经形态视觉注意交叉领域的新研究方向。

英文摘要

Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

2605.23780 2026-05-25 cs.AI

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

超越二元编辑:基于对抗子空间对齐的鲁棒多模态知识编辑

Haoyuan Wang, Xiaohao Liu, Jiajie Su, Jianmao Xiao, Chaochao Chen

AI总结 本文研究了多模态大语言模型中鲁棒的内在知识编辑问题,旨在在不损害原有能力的前提下高效更新知识。针对现有方法在语义等价的视觉和语言变体间传播编辑效果有限的问题,作者提出了对抗子空间对齐方法(ASAM),通过引入潜在对抗鲁棒化(LAR)和秩约束子空间学习(RCSL)技术,增强模型在高维多模态空间中的泛化能力和编辑鲁棒性。实验表明,该方法在知识编辑任务中表现出优越的性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)需要高效的机制来更新知识,同时不降低现有能力。虽然内在多模态知识编辑实现了强可靠性和局部性,但它通常表现出有限的泛化性,无法在语义等价的视觉和语言变体之间传播编辑。这个问题源于在高维多模态空间中缺乏显式的语义监督、僵化的编辑范围以及对单个样本的有偏锚定。我们通过显式地针对泛化性来解决鲁棒的内在多模态知识编辑。我们通过知识单元(将语义等价的多模态输入分组)形式化鲁棒性,并将泛化性定义为每个单元内一致的预测。为了暴露脆弱的语义区域,我们引入了潜在对抗鲁棒化(LAR),它在联合潜在空间中生成对抗但语义连贯的变体。我们进一步提出了秩约束子空间学习(RCSL),通过基于奇异值的目标在编辑层强制对抗表示的低秩对齐。大量实验证明了ASAM的有效性。

英文摘要

Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intrinsic multimodal knowledge editing achieves strong reliability and locality, it often exhibits limited generality, failing to propagate edits across semantically equivalent visual and linguistic variations. This issue arises from the lack of explicit semantic supervision, rigid editing scopes, and biased anchoring to individual samples in high-dimensional multimodal spaces. We address robust intrinsic multimodal knowledge editing by explicitly targeting generalization. We formalize robustness through knowledge units that group semantically equivalent multimodal inputs and define generality as consistent predictions within each unit. To expose fragile semantic regions, we introduce Latent Adversarial Robustification (LAR), which generates adversarial yet semantically coherent variants in the joint latent space. We further propose Rank-Constrained Subspace Learning (RCSL), enforcing low-rank alignment of adversarial representations at the edit layer via a singular value-based objective. Extensive analysis demonstrates the effectiveness of ASAM empirically.

2605.23778 2026-05-25 physics.ao-ph cs.LG physics.comp-ph

The physics of AI weather models

AI天气模型的物理学

George Craig, Tobias Selz, Matthias Beylich, Kirsten I. Tempest

AI总结 本文探讨了人工智能天气模型是否在隐式求解物理方程,尽管这些方程可能不同于传统数值天气预报模型所使用的方程。研究通过计算预报技能与中心核对齐的相关性,发现不同AI天气模型在表征大气时具有相似性,尽管其结构和容量存在差异。文章提出这些模型可能通过粒子描述的方式模拟大气,其中每个网格点的潜在变量对应高维潜在空间中的粒子位置,并假设粒子的运动遵循潜在空间中自由能函数的梯度流。这一假设在GraphCast和Aurora模型的分析中得到了支持。

详情
AI中文摘要

AI天气模型是否可能在求解物理方程,尽管这些方程可能不是传统NWP模型所使用的方程?我们计算了预测技能和中心核对齐的相关性,提供了证据表明不同的AI天气模型以相似的方式表示大气,尽管架构和能力存在差异。我们认为AI模型的架构和训练限制了它们可能模拟的物理定律的形式。特别地,我们提出这些模型实现了大气的粒子描述,其中每个网格点的潜变量对应于高维潜空间中粒子的位置。我们假设粒子的运动遵循潜空间中的梯度流,朝向学习到的自由能泛函的最小值。对GraphCast和Aurora模型的分析表明,它们在早期处理器层中在大空间尺度上进行变化,并随着层深增加转向更小尺度,这与梯度流假设一致。

英文摘要

Could it be that AI weather models are solving physical equations, although they may not be the equations used by conventional NWP models? We compute correlations of forecast skill and Centered Kernel Alignment, providing evidence that different AI weather models represent the atmosphere in similar ways, despite differences in architecture and capacity. We argue that the architecture and training of the AI models constrains the form of the physical laws that they might simulate. In particular, we propose that the models implement a particle description of the atmosphere, where the latent variables at each mesh point correspond to the position of a particle in the high dimensional latent space. We hypothesize that the movement of the particles follows a gradient flow in the latent space towards a minimum of a learned free energy functional. Analysis of the GraphCast and Aurora models show that they make changes on large spatial scales in the early processor layers and move to smaller scale with increasing layer depth, consistent with the gradient flow hypothesis.

2605.23777 2026-05-25 cs.CV

Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

机器学习应用于祖母绿宝石分级:框架提案与公开数据集创建

FB Pena, D Crabi, Sandro C Izidoro, Érick O Rodrigues, G Bernardes

AI总结 本文提出了一种基于机器学习的祖母绿宝石分级框架,并创建了一个公开数据集。该框架从图像采集到最终分类实现了整个分级过程的自动化,避免了人工分级的主观性。研究首次将机器学习与图像处理技术结合应用于祖母绿分级,取得了98%的分类准确率,并发布了包含192张祖母绿图像及其预处理特征的数据集。

详情
Journal ref
Pattern Analysis and Applications 2022
AI中文摘要

目前,宝石分级是由宝石学家执行的手工过程。一种流行的方法使用参考石,由专家目视检查,决定哪一颗参考石与待检石最相似。该过程非常主观,不同专家可能做出不同的分级选择。本文提出了一个完整的框架,涵盖图像采集直至最终宝石分类。该提案能够自动化整个过程,除了将宝石放入创建的图像采集腔室之外。它摒弃了专家做出的主观决策。这是首个将机器学习方法与图像处理技术相结合用于祖母绿分级的工作。所提出的框架实现了98%的准确率(正确分类的宝石),优于深度学习方法。此外,我们还创建并发布了所使用的数据集,包含192张祖母绿宝石图像及其提取和预处理后的特征。

英文摘要

The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.

2605.23775 2026-05-25 cs.CV

A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques

一种基于cGANs和图像处理技术的木材计数新方法

João VC Mazzochin, Giovani Bernardes Vitor, Gustavo Tiecker, Elioenai MF Diniz, Gilson A Oliveira, Marcelo Trentin, Érick O Rodrigues

AI总结 本文提出了一种基于条件生成对抗网络(cGANs)和图像处理技术的新型木材原木计数方法,旨在解决精确计数中的挑战。该方法结合图像处理技术处理噪声和交叉重叠问题,并利用连通组件算法实现高效计数。研究还公开了一个包含466张图像、约13,048根桉树原木的数据库,实验表明该方法在像素级和原木级准确率上分别达到96.4%和92.3%,具有较高的实用价值和实时处理能力,适用于林业管理、资源优化等实际场景。

详情
Journal ref
Forests 2025
AI中文摘要

本研究解决了精确木材计数的挑战,所提出方法论的应用可涵盖从材料管理、监控和安全科学到木材交通监测、木材体积估计等自动化方法。我们引入了一种利用条件生成对抗网络(cGANs)进行桉木图像分割的方法,结合专门的图像处理技术处理噪声和交叉,并采用连通分量算法进行高效计数。为支持本研究,我们创建并公开了一个包含466张图像、约13,048根桉木的全面数据库,用于训练和验证。我们的方法表现出稳健性能,平均像素精度达到96.4%,原木计数精度达到92.3%,其他指标如F1分数在0.879至0.933之间,IoU值在0.784至0.875之间,进一步验证了其有效性。该实现效率高,在NVIDIA T4 GPU上每张图像平均处理时间为0.713秒,适合实时应用。该方法对运营林业具有重要实际意义,能够实现更准确的库存管理,减少人工计数的错误,并优化资源配置。此外,模型的分割能力为桉木堆体积估计等高级应用奠定了基础,有助于对林业运营进行更全面和精细的分析。该方法在处理复杂场景(包括交叉原木和变化的环境条件)方面的成功,使其成为相关工业领域实际应用的有价值工具。

英文摘要

This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional Generative Adversarial Networks (cGANs) for eucalyptus log segmentation in images, incorporating specialized image processing techniques to handle noise and intersections, coupled with the Connected Components Algorithm for efficient counting. To support this research, we created and made publicly available a comprehensive database of 466 images containing approximately 13,048 eucalyptus logs, which served for both training and validation purposes. Our method demonstrated robust performance, achieving an average Accuracy_pixel of 96.4% and Accuracy_logs of 92.3%, with additional measures such as F1 scores ranging from 0.879 to 0.933 and IoU values between 0.784 and 0.875, further validating its effectiveness. The implementation proves to be efficient with an average processing time of 0.713s per image on an NVIDIA T4 GPU, making it suitable for realtime applications. The practical implications of this method are significant for operational forestry, enabling more accurate inventory management, reducing human errors in manual counting, and optimizing resource allocation. Furthermore, the segmentation capabilities of the model provide a foundation for advanced applications such as eucalyptus stack volume estimation, contributing to a more comprehensive and refined analysis of forestry operations. The methodology's success in handling complex scenarios, including intersecting logs and varying environmental conditions, positions it as a valuable tool for practical applications across related industrial sectors.

2605.23772 2026-05-25 cs.AI cs.LO cs.PL cs.SE

Agentic Proving for Program Verification

程序验证的智能体证明

Alessandro Sosso, Akhil Arora, Bas Spitters

AI总结 该研究评估了基于代理的定理证明系统在程序验证任务中的能力,通过在CLEVER基准上测试Claude Code的表现,发现其在生成规范、验证实现以及端到端程序生成与验证方面均取得了较高的成功率。研究还指出当前程序验证基准与现代代理证明系统的能力之间存在差距,并强调需要更严格、更具鲁棒性的评估方法,特别是替代基于同构评分的规范评估方式。研究结果表明,结合编译器的紧密循环代理范式是当前程序验证最有效的方法之一。

详情
AI中文摘要

智能体系统最近已成为形式数学中自动定理证明的最先进方法。为了评估这些能力在程序验证中的延伸程度,我们在CLEVER(一个用于可验证代码生成的Lean 4基准)上,在智能体证明框架中评估了Claude Code。我们的结果显示,Claude为98.8%的问题生成了可论证的有效规范(其中81.3%也被CLEVER基于同构的评分在基准的正确部分接受),针对正确的地面真实规范验证了87.5%问题的实现,并在具有自洽前提的条目上,端到端程序生成和验证管道的成功率达到98.1%。在所有阶段,Claude进一步对其自身尝试提供了高质量的反馈(经人工审查确认),识别了失败的根本原因和数据集中残留的错误。这些发现突显了现有程序验证基准的难度与当代智能体证明器能力之间日益增长的不匹配,并指出了对更严格、更具错误鲁棒性的评估方法的需求,特别是对生成规范基于同构的评分的替代方案。更广泛地说,我们的结果提供了经验证据,表明紧密的编译器在环智能体范式目前是基础程序验证最有效的方法。

英文摘要

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

2605.23771 2026-05-25 cs.CV cs.AI cs.MA

PhotoFlow: Agentic 3D Virtual Photography Missions

PhotoFlow: 智能体式3D虚拟摄影任务

Jiarui Guo, Haojia Wei, Yiming Zhang, Yifei Liu, Yuning Gong, Hongjie Zhang, Xue Yang, Zhihang Zhong

AI总结 PhotoFlow 是一种用于虚拟摄影的智能代理系统,能够在没有预设相机参数或参考图像的情况下,根据语言指令在3D场景中生成符合语义意图的高质量照片。该系统由三个模块组成:Director 生成多样化的相机候选方案,Reviewer 进行视觉评估与参数筛选,Reflector 则通过失败经验优化搜索策略。研究还提出了 VPhotoBench 基准,包含多个 Blender 场景和语言条件摄影任务,实验表明 PhotoFlow 在多轮渲染预算下表现出色,是首个在任意 Blender 场景中实现语言条件虚拟摄影的可执行代理系统。

详情
AI中文摘要

虚拟摄影要求智能体进入一个预制的3D场景,没有预设的相机姿态或参考图像,从场景信息和语言意图中推断合适的镜头,选择可执行的相机参数,并渲染最终照片。视觉-语言模型的最新进展使这种空间智能体越来越可行,但该任务强调两种难以同时评估的能力:复杂的3D空间理解和抽象审美判断。我们引入了PhotoFlow,一个导演-评审-反思智能体,用于闭环相机搜索。导演构建软摄影蓝图并提议多样化的候选相机;评审结合规则检查、视觉批评和成对优胜者选择;反思将失败转化为区域记忆、死区抑制和高探索重定位。我们还引入了VPhotoBench,一个包含47个开源许可的Blender场景和141个语言条件摄影任务的基准,涵盖主体放置、关系构图和氛围/风格。在保留实验中,PhotoFlow在六轮渲染预算下,在一次性预测、单链反思、锚点库选择和随机搜索中取得了最强的外部质量-对齐复合指标和成功率。据我们所知,这是第一项将任意Blender场景中的语言条件虚拟摄影作为可执行智能体任务的工作,我们的结果表明,以LLM为中心的空间智能体已经可以在旨在挑战3D推理和审美选择的设置中产生强大的照片。

英文摘要

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

2605.23762 2026-05-25 cs.RO

Direct Dynamic Retargeting for Humanoid Imitation Learning from Videos

面向人形机器人视频模仿学习的直接动态重定向

Constant Roux, Ludovic De Matteïs, Armand Jordana, Valentin Guillet, Nicolas Mansard, Olivier Stasse, Philippe Souères

AI总结 本文研究了如何从单目视频中学习人类形体的模仿技能,并将其应用于人形机器人。为了解决人类运动与人形机器人之间形态差异带来的挑战,作者提出了直接动态重定向(DDR)方法,通过任务空间建模和基于采样的模型预测控制求解器,直接生成符合物理规律的高质量轨迹,避免了传统方法中的几何偏差。实验表明,DDR在轨迹跟踪精度和强化学习训练效率方面均优于现有方法。

详情
AI中文摘要

从单目视频演示中进行模仿学习为向人形机器人教授复杂技能提供了一种可扩展的方法。然而,将人体运动转化为类人运动需要克服显著的形态不匹配。标准方法依赖于几何重定向或间接动态重定向流程。我们发现这些中间运动学投影引入了几何偏差,限制了搜索空间并产生了次优的动态行为。在本文中,我们提出了直接动态重定向(DDR),一种新颖的单阶段框架,可直接从专家视频生成高保真、动态可行的轨迹。通过将问题在任务空间中建模,并在物理模拟器中利用基于采样的模型预测控制求解器,DDR 在缓解输入漂移的同时原生优化复杂的接触序列。我们的实验表明,绕过几何偏差使 DDR 在演示跟踪精度上优于最先进的基线方法。此外,我们证实,向强化学习智能体提供此类物理可行的参考可加速训练收敛,并增强敏捷和平衡行为的最终执行。源代码将公开发布。

英文摘要

Imitation Learning from monocular video demonstrations provides a scalable approach for teaching complex skills to humanoid robots. However, translating human motion to humanoids requires overcoming significant morphological mismatches. Standard approaches rely on Geometric Retargeting or Indirect Dynamic Retargeting pipelines. We identify that these intermediate kinematic projections introduce a geometric bias, restricting the search space and yielding suboptimal dynamic behaviors. In this paper, we propose Direct Dynamic Retargeting (DDR), a novel single-stage framework that generates high-fidelity, dynamically feasible trajectories directly from expert videos. By formulating the problem in the task space and leveraging a sampling-based Model Predictive Control solver within a physics simulator, DDR natively optimizes over complex contact sequences while mitigating input drift. Our experiments demonstrate that bypassing the geometric bias allows DDR to outperform state-of-the-art baselines in demonstration tracking accuracy. Furthermore, we establish that providing such physically viable references to RL agents accelerates training convergence and enhances the final execution of agile and balancing behaviors. Source code will be made publicly available.

2605.23754 2026-05-25 cs.LG

LLM-driven design of physics-constrained constitutive models: two agents are better than one

LLM驱动的物理约束本构模型设计:两个智能体胜过一个

Marius Tacke, Matthias Busch, Kian Abdolazizi, Jonas Eichinger, Kevin Linka, Roland Aydin, Christian Cyron

AI总结 本文提出了一种基于大语言模型(LLM)的多智能体方法,用于生成符合物理规律的本构模型。该方法引入了两个智能体:Creator 负责根据数据生成模型,Inspector 负责检查模型是否满足九项物理约束,若不满足则返回修改。实验表明,该方法显著提高了生成模型的物理正确性,同时保持了高精度和良好的泛化能力,为自动化、物理感知的模型发现提供了可信的解决方案。

详情
AI中文摘要

传统上,开发描述材料在载荷下变形方式的本构模型需要连续介质力学、机器学习和科学编程方面多年的专业知识。最近,大型语言模型(LLM)已被证明可以通过按需生成本构模型来降低这一门槛,但现有的单智能体流程缺乏系统性的检查,以确保生成的模型尊重基本物理定律。为弥补这一差距,我们引入了首个多智能体LLM驱动的本构模型生成方法:一个Creator智能体根据数据提出定制模型,而一个Inspector智能体对每个提案进行严格审计,检查其是否满足九个物理约束,并在检测到违规时返回修改。我们使用本构人工神经网络(CANN)演示了这一概念,并在脑组织、实验橡胶和合成橡胶上使用两种不同的LLM骨干(Claude Opus 4.7和Kimi K2.5)进行基准测试。添加Inspector后,对于Opus,导出模型中真正满足所有物理约束的比例从91%提高到完美的100%;对于Kimi,从37%提高到56%,同时保持了接近基线的准确性和对未见加载路径的显著泛化能力。综合来看,生成的模型在物理上有效、高度准确,并能可靠地外推到训练数据之外——这些特性使其可以直接在实践中使用。因此,将生成与检查分离,使LLM驱动的本构建模成为一个真正可信的过程。该范式故意与技术无关,并随着LLM能力的进步自动扩展,为自动化、物理感知的模型发现开辟了一条有前景的道路。

英文摘要

Developing constitutive models that capture how materials deform under load traditionally requires years of specialized expertise in continuum mechanics, machine learning, and scientific programming. Large language models (LLMs) have recently been shown to lower this barrier by generating constitutive models on demand, but existing single-agent pipelines lack systematic checks that the resulting models respect fundamental physical laws. To close this gap, we introduce the first multi-agent LLM-driven approach for constitutive model generation: a Creator agent proposes a model tailored to the data, while an Inspector agent critically audits each proposal against nine physical constraints and returns it for refinement whenever a violation is detected. We demonstrate this concept with constitutive artificial neural networks (CANNs) and benchmark it on brain tissue, experimental rubber, and synthetic rubber, using two different LLM backbones (Claude Opus 4.7 and Kimi K2.5). Adding the Inspector raises the share of exported models that truly satisfy all physical constraints from 91% to a perfect 100% for Opus and from 37% to 56% for Kimi, while preserving near-baseline accuracy and remarkable generalization to unseen loading paths. In combination, the generated models are physically valid, highly accurate, and extrapolate reliably beyond the training data - properties that together make them directly usable in practice. Separating generation from inspection thus turns LLM-driven constitutive modeling into a genuinely trustworthy process. The paradigm is deliberately technique-agnostic and scales automatically with advances in LLM capability, opening a promising path toward automated, physics-aware model discovery.

2605.23753 2026-05-25 cs.LG

SeedER: Seed-and-Expand Retrieval from Knowledge Graphs

SeedER: 基于种子扩展的知识图谱检索

Hamed Shirzad, Frederik Wenkel, Dominique Beaini, Danica J. Sutherland, Emmanuel Noutahi

AI总结 SeedER 是一种用于知识图谱的检索框架,旨在解决其不规则结构带来的检索挑战。该方法通过先利用轻量级的密集嵌入和实体检索确定核心节点,再通过强化学习训练的图感知策略进行选择性扩展,从而高效发现与查询相关的节点。实验表明,SeedER 在保持较低扩展成本的同时,显著提升了检索效果,尤其在处理多跳组合查询时表现出优越的性能。

详情
AI中文摘要

知识图谱(KGs)为关系知识提供了丰富的表示,但其不规则结构使得检索具有挑战性:自我图扩展迅速增长,而密集嵌入方法难以处理多跳组合查询。现有的基于智能体的图探索方法虽然表达能力强,但通常对于大规模检索来说过于昂贵。我们引入了SeedER(种子扩展检索),这是一个通过迭代、低成本扩展显式利用KG结构的检索框架。SeedER首先使用轻量级密集和基于实体的检索播种一个紧凑的核心节点集,然后通过使用强化学习训练的图感知策略选择性地扩展该集合。这种设计将全局推理分解为可重用的局部决策,从而能够在严格控制扩展成本的同时高效发现与查询相关的节点。我们展示了密集检索在组合图查询上的理论局限性,并从组合泛化和图约束子模优化的角度确立了SeedER的优势。实验上,SeedER在紧凑候选集上显著提高了召回率,超过了强大的密集和图增强基线,使其成为知识密集型推理系统中有效的第一阶段检索器。

英文摘要

Knowledge graphs (KGs) offer a rich representation for relational knowledge, but their irregular structure makes retrieval challenging: ego-graph expansion grows rapidly, and dense embedding methods struggle with multi-hop compositional queries. Existing agent-based graph exploration approaches, while expressive, are often too expensive for large-scale retrieval. We introduce SeedER (Seed-and-Expand Retrieval), a retrieval framework that explicitly leverages KG structure through iterative, low-cost expansion. SeedER first seeds a compact set of core nodes using lightweight dense and entity-based retrieval, then selectively expands this set via a learned graph-aware policy trained with reinforcement learning. This design decomposes global reasoning into reusable local decisions, enabling efficient discovery of query-relevant nodes while tightly controlling expansion cost. We show theoretical limitations of dense retrieval on compositional graph queries, and establish advantages of SeedER from both compositional generalization and graph-constrained submodular optimization perspectives. Empirically, SeedER substantially improves recall with compact candidate sets over strong dense and graph-augmented baselines, making it an effective first-stage retriever for knowledge-intensive reasoning systems.

2605.23751 2026-05-25 cs.LG

Approaching I/O-optimality for Approximate Attention

逼近近似注意力的I/O最优性

Pál András Papp, Aleksandros Sobczyk, Anastasios Zouzias

AI总结 本文研究了大语言模型中注意力机制的I/O复杂度问题,旨在以最少的快慢内存数据传输次数计算注意力矩阵。作者提出了一种基于近似注意力框架的I/O高效算法,使得在大多数参数设置下,I/O代价仅近似线性依赖于序列长度$n$,显著优于现有方法的二次复杂度。同时,作者还给出了不同参数范围下的I/O下界,证明所提方法接近I/O最优。

详情
AI中文摘要

我们重新审视了大语言模型中注意力的I/O复杂度。给定查询-键-值矩阵 $Q,K,V\in\mathbb{R}^{n\times d}$,以及一个快速内存大小为 $M$ 的机器,目标是计算“注意力矩阵” $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$,同时最小化快速和慢速内存之间的数据传输次数。文献中的现有方法,尤其是FlashAttention及其变体,其I/O开销与 $n$ 呈二次关系,而一个平凡的下界仅需要 $\Omega(nd)$ 次I/O来读取输入和写入输出。在这项工作中,我们提出了一种计算注意力的技术,在大多数参数范围内,其I/O开销几乎与 $n$ 呈线性关系。这是通过开发受Alman和Song最近提出的近似注意力框架启发的I/O高效算法实现的。我们还证明了每个参数范围内的相应下界,以表明我们的算法确实接近I/O最优。

英文摘要

We revisit the I/O complexity of attention in large language models. Given query-key-value matrices $Q,K,V\in\mathbb{R}^{n\times d}$, and a machine with fast memory size $M$, the goal is to compute the "attention matrix" $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$ with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on $n$, while a trivial lower bound only requires $Ω(nd)$ I/O's to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on $n$ in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.

2605.23747 2026-05-25 cs.CV

Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox

复兴密集材质分割:稳定的视觉Transformer与泛化悖论

Allan Kazakov, Duygu Cakir, Hilal Kurt İrfanoğlu, Yavuz İrfanoğlu

AI总结 本文旨在复兴苹果密集材料分割(Apple-DMS)基准,解决当前材料分割任务中因几何偏倚模型主导而导致的性能停滞问题。研究提出了一种稳定训练方法,包括高保真逻辑投影、查询熵正则化和物理兼容的数据增强策略,显著提升了基于Vision Transformer的分割模型性能。同时,作者揭示了“泛化悖论”——虽然数据重划分可提升指标,却会降低模型在真实场景中的泛化能力,强调了使用原始数据划分对推动物理感知人工智能研究的重要性。

详情
AI中文摘要

材质分割,即对物理表面属性进行像素级分类,仍然是计算机视觉中的一个挑战性问题,需要区别于以物体为中心解析的物理化学理解。尽管引入了严格的Apple密集材质分割(DMS)数据集,该基准测试仍遭受衰退和停滞,日益被偏向几何的基础模型所掩盖。在本文中,我们复兴Apple-DMS基准测试,建立现代视觉Transformer基线。我们对SegFormer和Mask2Former架构进行了详尽评估,揭示标准训练范式由于高方差梯度而在无定形纹理场上失败。为解决此问题,我们引入了一种稳定的训练方案,包括高保真logit投影、查询熵正则化以及领域特定、符合物理的增强流程。我们优化的SegFormer-B5在原始数据集划分上达到了0.4572 mIoU的新最先进水平(SOTA),显著超越了先前的卷积基线。此外,我们识别出一个关键的“泛化悖论”:虽然将数据集重新划分为数据丰富的80/10/10划分将指标提升至0.5276 mIoU,但专家定性分析表明这导致了分布同质化,严重降低了真实世界、分布外性能。通过发布我们恢复的数据集索引和稳健的训练框架,我们证明材质感知远未解决,并敦促社区利用严格的原始划分推动物理基础人工智能的真正进展。

英文摘要

Material segmentation, the pixel-wise classification of physical surface properties, remains a challenging problem in computer vision, requiring physicochemical understanding distinct from object-centric parsing. Despite the introduction of the rigorous Apple Dense Material Segmentation (DMS) dataset, the benchmark has suffered from attrition and stagnation, increasingly overshadowed by geometry-biased foundation models. In this paper, we revive the Apple-DMS benchmark to establish a modern Vision Transformer baseline. We conduct an exhaustive evaluation of SegFormer and Mask2Former architectures, revealing that standard training paradigms fail on amorphous texture fields due to high-variance gradients. To address this, we introduce a stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline. Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split, significantly surpassing the prior convolutional baseline. Furthermore, we identify a critical "Generalization Paradox": while re-partitioning the dataset into a data-rich 80/10/10 split inflates the metric to 0.5276 mIoU, expert qualitative analysis reveals this induces distributional homogenization, severely degrading real-world, out-of-distribution performance. By releasing our recovered dataset index and robust training framework, we demonstrate that material perception is far from solved and urge the community to leverage the rigorous original split to drive genuine progress in physically grounded artificial intelligence.

2605.23744 2026-05-25 cs.LG

Contrast to Detect: Dynamic Graph Contrastive Regularization for Unsupervised Anomaly Detection in Multivariate Time Series

对比检测:面向无监督多变量时间序列异常检测的动态图对比正则化

Yunhua Pei, Zixing Song, Jin Zheng, John Cartlidge

AI总结 该研究针对多变量时间序列中的无监督异常检测问题,提出了一种名为ContrastAD的框架,用于应对动态变量依赖关系和频谱噪声带来的挑战。该方法通过动态图对比学习,将结构演变作为学习信号,而非抑制其变化,并引入多视角嵌入和频率感知注意力机制以提升鲁棒性。实验表明,ContrastAD在多个真实数据集上取得了优越的异常检测性能,尤其在F1指标上表现突出。

详情
Comments
12 pages, 5 figures. Preprint. Code and demo data available online
AI中文摘要

多变量时间序列(MTS)中的异常检测受到动态变量间依赖关系和频谱噪声下特征纠缠的阻碍,在实践中,由于缺乏异常标签而进一步复杂化。现有的基于重构的检测器倾向于像正常模式一样忠实地恢复异常,而流行的图对比方法强制视图间不变性,从而假设一个平稳的关系结构,这一假设在真实系统的结构漂移下被打破。我们提出ContrastAD,一个无监督框架,将结构演化本身转变为学习信号而非抑制它。一个多视角编码器从时间、属性和结构视角编码输入。一个频率感知注意力混合器在注意力之前执行频谱top-K过滤,防止噪声泄漏到查询-键相似度中。核心组件,一个动态图对比学习器,从批次级DTW距离构建基于幂律的稀疏图快照,并将最发散的对与稳定锚点进行对比,在不施加刚性不变性的情况下正则化潜在空间。在五个真实世界基准上,ContrastAD在所有五个数据集上获得最高平均F1,并在三个数据集上获得最高AUC(SWaT 93.60,SMD 98.66,PSM 97.79),在SWaT和PSM上相对于最强基线具有统计显著的F1和AUC差距。在MSL和SMAP上,其AUC落后领先者不到0.7个百分点,同时F1仍领先。消融和敏感性研究进一步证实,对比目标作为软正则化器效果最佳,支持我们的主张:在非平稳动态下严格不变性是次优的。

英文摘要

Anomaly detection in multivariate time series (MTS) is hindered by dynamic inter-variable dependencies and feature entanglement under spectral noise, and in practice, is further complicated by the absence of anomaly labels. Existing reconstruction-based detectors tend to recover anomalies as faithfully as normal patterns, while prevailing graph contrastive methods enforce invariance across views and thus assume a stationary relational structure, an assumption that breaks under structural drift in real systems. We propose ContrastAD, an unsupervised framework that turns structural evolution itself into a learning signal rather than suppressing it. A Multi-Perspective Embedder encodes inputs from temporal, attribute, and structural perspectives. A Frequency-Aware Attention Mixer then performs spectral top-K filtering before attention, preventing noise from leaking into query-key similarities. The core component, a Dynamic Graph Contrastive Learner, builds power-law-inspired sparse graph snapshots from batch-level DTW distances and contrasts the most divergent pair against a stable anchor, regularizing the latent space without imposing rigid invariance. Across five real-world benchmarks, ContrastAD attains the highest mean F1 on all five datasets and the highest AUC on three (SWaT 93.60, SMD 98.66, PSM 97.79), with statistically significant F1 and AUC margins over the strongest baseline on SWaT and PSM. On MSL and SMAP, it trails the AUC leader by under 0.7 points while still leading on F1. Ablation and sensitivity studies further confirm that the contrastive objective works best as a soft regularizer, supporting our claim that strict invariance is suboptimal under non-stationary dynamics.

2605.23733 2026-05-25 cs.RO cs.AI

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

AI总结 本文提出了一种名为 Any2Any 的高效跨体素迁移方法,用于人形机器人全身运动跟踪。该方法通过少量数据和计算资源,将预训练的全身跟踪模型快速适配到新的机器人平台上。核心思想包括通过运动学对齐统一输入输出空间,并利用轻量级参数高效微调技术对关键模块进行动态适应,从而在保持原有行为先验的同时实现精准迁移。实验表明,Any2Any 显著降低了训练成本,且在多个平台上的跟踪性能优于或接近从头训练的方法。

详情
AI中文摘要

全身跟踪(WBT)模型已成为人形机器人的关键基础,使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算,使得在新人形平台上快速部署成本高昂。这自然引发一个问题:预训练的WBT模型能否通过最小化适应跨本体迁移?为回答这个问题,我们提出Any2Any,一种范式,能够高效地将现有WBT专家迁移到新人形本体,仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐,对齐其输入和输出空间,使得预训练的源策略可以在目标本体上有意义地重用。然后,Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调(PEFT)组件进行动力学适应,保留有用的行为先验,同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明,与从头训练相比,Any2Any显著加速收敛并降低训练成本,同时实现具有竞争力或更优的跟踪性能。值得注意的是,仅使用完整训练所需计算和数据的1%,Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明,预训练的WBT专家可以跨本体高效重用,为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots.

2605.23726 2026-05-25 cs.LG cs.DS stat.ML

Optimal Dimension-Free Sampling for Regularized Classification

正则化分类的最优无维度采样

Meysam Alishahi, Alexander Munteanu, Simon Omlor, Jeff M. Phillips

AI总结 本文研究了在正则化分类问题中实现$(1\pm\varepsilon)$相对误差的最优无维度采样方法,适用于一大类满足Lipschitz条件的分类损失函数,如逻辑回归、铰链损失和ReLU损失等。作者给出了不同正则化项下的采样复杂度上界和下界,证明了基于$\|\cdot\|_2/k$和$\|\cdot\|_1/k$正则化的采样复杂度分别为$k^2/\varepsilon^2$和$k/\varepsilon^2$,并分析了$\|\cdot\|_2^2/k$正则化下采样复杂度对函数导数性质的依赖。相比现有基于敏感度的立方复杂度方法,本文通过统一采样和更精细的高阶矩分析,实现了更优的采样效率。

详情
AI中文摘要

我们证明了对于一大类Lipschitz连续分类损失函数,在各种正则化项下,达到$(1\pm\varepsilon)$相对误差的最优采样界。这包括重要的函数如logistic和sigmoid损失、hinge损失和ReLU损失,作为突出和流行的代表性例子。特别地,我们证明了对于$\|\cdot\|_2/k$正则化的$k^2/\varepsilon^2$上下界,以及对于$\|\cdot\|_1/k$正则化的$k/\varepsilon^2$上下界。对于$\|\cdot\|_2^2/k$正则化,采样复杂度主要取决于有界导数性质:如果$|g'(x)|\leq g(x)$,且$g(0)>0$,且$g$是单调或凸的,则采样复杂度是$k$的线性;否则一般界为$k^2/\varepsilon^2$。然而,如果$g(0)=0$,我们的结果表明不可能得到无维度界,甚至次线性界也被排除。所有上界都有匹配的下界(至多相差多对数项)。此外,我们的工作在概念上和算法上依赖于简单的均匀或(平方)范数采样,从而改进了最近(Alishahi and Phillips, ICML'24)的立方$k^3/\varepsilon^2$敏感度采样界。这是通过涉及更高矩界和经验过程分析的精细论证来实现的,以避免在事实上的标准VC维和敏感度框架中出现的过度计数。

英文摘要

We prove optimal sampling bounds achieving $(1\pm\varepsilon)$-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove $k^2/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_2/k$ regularization, and $k/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_1/k$ regularization. For $\|\cdot\|_2^2/k$ regularization, the sampling complexity depends mainly on a bounded derivative property: if $|g'(x)|\leq g(x)$, and $g(0)>0$, and $g$ is monotonic or convex, then it admits linear in $k$ sampling complexity; otherwise the general bound is $k^2/\varepsilon^2$. However, if $g(0)=0$, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic $k^3/\varepsilon^2$ sensitivity sampling bounds of (Alishahi and Phillips, ICML'24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.

2605.23723 2026-05-25 cs.AI

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

MemAudit:通过因果归因和结构异常检测对中毒代理记忆进行事后审计

Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu, Xiangzheng Zhang, Duohe Ma, Tong Yang, Lin Sun

AI总结 随着大型语言模型代理越来越多地依赖持久内存来存储历史交互并提升任务执行能力,内存机制也带来了潜在的安全隐患:攻击者可通过正常交互向内存中注入恶意记录,从而影响代理的行为。为此,本文提出 MemAudit,一种用于事后审计内存增强型大语言模型代理的因果记忆审计框架。该方法结合因果影响评分与结构异常检测,有效识别出对有害输出有贡献的恶意记忆记录,并在多种攻击场景下显著降低了攻击成功率。

详情
AI中文摘要

大型语言模型代理越来越依赖持久记忆来存储过去的交互、检索相关演示并改进长期任务执行。然而,这种记忆机制也造成了一个实际的安全漏洞:对抗性用户可能通过普通交互将恶意记录注入代理的记忆中,这些记录随后可能被检索以引导代理的推理和行动。现有的防御主要关注在线干预,如提示过滤或输出阻止,但它们没有解决事后问题,即在观察到有害行为后,哪些存储的记忆应负责。我们提出了 extbf{MemAudit},一个用于记忆增强型LLM代理的事后因果记忆审计框架。该框架结合了两个互补信号:(1)反事实记忆影响评分,衡量每个记忆对有害输出的因果贡献;(2)记忆一致性图,识别更广泛记忆存储中的结构异常记忆。我们针对MINJA(一种仅查询的记忆注入攻击,其中恶意记录通过正常代理交互生成和存储,而非直接修改记忆库)评估了MemAudit。在QA和推理代理两种设置中,MemAudit在现实的事后审计场景下显著降低了攻击成功率。结果显示,QA攻击成功率从$70\%$降至$0\%$,而RAP攻击成功率从$83.3\%$降至$0\%$。

英文摘要

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.

2605.23721 2026-05-25 cs.CL

Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

文档是教育性的还是维基风格的?——基于分类器的质量过滤的陷阱

Mateusz Klimaszewski, Piotr Andruszkiewicz

AI总结 本文探讨了基于分类器的质量过滤方法在构建预训练语料库中的潜在问题,指出简单的维基百科风格格式化操作可能显著改变模型对内容质量的判断,导致低质量内容通过过滤。研究以FineWeb-Edu CQF模型为例,发现约7%的文档会被错误分类,从而被错误地纳入预训练语料库,揭示了该方法在实际应用中的关键缺陷。

详情
Comments
Accepted to ACL 2026
AI中文摘要

基于分类器的质量过滤最近已成为构建预训练语料库的基本技术。部署单一模型以替代或补充一组启发式规则的能力已被证明在众多大型语言模型中有效。在这项工作中,我们通过展示一个直接的维基风格重格式化操作如何显著改变模型的质量评估,并使低质量内容能够超过过滤阈值,暴露了这种方法的一个关键漏洞。我们的分析表明,FineWeb-Edu CQF模型会反转约7%评估文档的过滤决策,从而将原本会被排除的内容纳入预训练语料库。

英文摘要

Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

2605.23719 2026-05-25 cs.CV cs.AI

Weierstrass Positional Encoding for Vision Transformers

Weierstrass位置编码用于视觉Transformer

Zhihang Xin, Rui Wang, Xitong Hu, Xiaojun Wu

AI总结 视觉Transformer在计算机视觉中取得了显著成功,但其常用的可学习一维位置编码在图像分块展平后削弱了图像的二维空间结构。为解决这一问题,本文提出了一种基于魏尔斯特拉斯椭圆函数的位置编码方法(WePE),通过在复数域中对二维分块坐标进行映射,构建具有双周期特性的四维位置特征,从而更准确地保留图像分块的几何关系和空间邻近性先验。该方法具有数学理论支撑,能够自然匹配图像网格的规则结构,并且无需额外计算开销,可无缝集成到现有视觉Transformer中,实验表明其在多种任务中均能带来性能提升。

详情
AI中文摘要

视觉Transformer在计算机视觉中取得了显著成功,但它们通常使用可学习的一维位置编码,这削弱了图像块展平后固有的二维空间结构。现有的位置编码往往缺乏几何约束,并且不保持欧氏空间距离与序列索引距离之间的单调关系,限制了ViTs利用空间邻近先验的能力。受周期性在位置编码中实用性的启发,我们提出了Weierstrass椭圆位置编码(WePE),这是一种在复数域中编码二维坐标的数学基础方法。WePE将归一化的二维块坐标映射到复平面,并使用Weierstrass椭圆函数及其导数构建紧凑的四维位置特征。双周期性提供了二维位置的原则性表示,其固有的晶格结构自然匹配图像块网格的规则几何形状。其非线性几何特性有助于更忠实地建模空间距离关系,而代数加法公式使得任意块对之间的相对位置信息可以直接从其绝对编码中推导出来。WePE是即插即用的且与分辨率无关,可以无缝集成到现有的ViTs中。大量实验表明,WePE在大多数设置中带来一致的性能提升。通过预计算的查找表,这些改进不会引入明显的计算或内存开销。额外的分析和消融研究进一步验证了所提方法的有效性。

英文摘要

Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encodings weakens the inherent two-dimensional spatial structure of images after patch flattening. Existing positional encodings often lack geometric constraints and do not preserve a monotonic relationship between Euclidean spatial distances and sequential index distances, limiting ViTs' ability to exploit spatial proximity priors. Motivated by the usefulness of periodicity in positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically grounded method for encoding two-dimensional coordinates in the complex domain. WePE maps normalized 2D patch coordinates onto the complex plane and constructs compact four-dimensional positional features using the Weierstrass elliptic function and its derivative. The double periodicity provides a principled representation of 2D positions, and its intrinsic lattice structure naturally matches the regular geometry of image patch grids. Its nonlinear geometric properties help model spatial distance relationships more faithfully, while the algebraic addition formula enables relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is plug-and-play and resolution-agnostic, allowing seamless integration into existing ViTs. Extensive experiments show that WePE brings consistent performance gains in most settings. With precomputed lookup tables, these improvements introduce no noticeable computational or memory overhead. Additional analyses and ablation studies further validate the effectiveness of the proposed method.

2605.23717 2026-05-25 cs.RO

Vision-Based Agile Landing on Turbulent Waters

基于视觉的湍急水域敏捷着陆

Dimosthenis Angelis, Leonard Bauersfeld, Davide Scaramuzza, Evangelos Boukas

AI总结 本文研究了在恶劣海况下无人机自主降落在移动海上平台的难题,提出了一种基于强化学习的方法,无需显式获取平台状态信息。该方法结合多旋翼无人机的状态测量和着陆面的局部视觉特征,预测姿态和推力指令,并通过底层控制器实现跟踪。实验表明,该方法在模拟和实际测试中均优于传统模型预测控制方法,是首个无需显式平台状态表示的湍流水域敏捷着陆方案。

详情
AI中文摘要

由于飞行器和着陆平台在公海条件下的耦合运动,无人机在海上船只上的自主着陆具有挑战性。本文提出了一种基于强化学习的自主多旋翼着陆方法,无需显式平台状态信息。该方法利用多旋翼状态测量以及局部视觉特征(包括从着陆表面提取的关键点和相关描述符)来预测姿态和推力指令。这些指令由传统的低层控制器跟踪。策略在仿真中使用合成关键点和随机生成的归一化描述符进行训练,从而能够在无人机上使用不同的局部特征提取器进行零样本部署。我们在逼真的模拟器中评估了该方法,并表明在对应于“非常恶劣”海况的平台运动下,它优于最先进的模型预测控制基线。最后,我们进行了广泛的实际实验,展示了使用两种不同局部特征提取器的自主机载着陆。据我们所知,这是首个在湍急水域中无需显式平台状态表示即可实现海上平台敏捷多旋翼着陆的方法。

英文摘要

Autonomous landing of Unmanned Aerial Vehicles on maritime vessels is challenging due to the coupled motion of the vehicle and landing platform in open-sea conditions. This paper presents a reinforcement-learning-based approach for autonomous multirotor landing on moving maritime platforms without requiring explicit platform-state information. The proposed method uses multirotor state measurements together with local visual features, consisting of keypoints and associated descriptors extracted from the landing surface, to predict attitude and thrust commands. These commands are tracked by a conventional low-level controller. The policy is trained in simulation using synthetic keypoints with randomly generated normalized descriptors, enabling zero-shot deployment with different local feature extractors onboard the UAV. We evaluate the method in a realistic simulator and show that it outperforms a state-of-the-art Model Predictive Control baseline under platform motions corresponding to ``Very Rough'' sea conditions. Finally, we perform extensive real-world experiments, demonstrating autonomous onboard landing using two different local feature extractors. To the best of our knowledge, this is the first approach for agile multirotor landing on maritime platforms in turbulent waters that does not rely on an explicit platform-state representation.

2605.23715 2026-05-25 cs.CL

NLG Evaluation: Past, Present, Future

NLG评估:过去、现在与未来

Ehud Reiter

AI总结 本文回顾了自然语言生成(NLG)评估的发展历程,从1990年代与语言学紧密相关的阶段,到如今与机器学习深度融合的现状,并展望了未来评估方向的变化。研究指出,随着NLG技术的广泛应用,评估方法正从传统的形式化实验向更注重影响、质量与安全的方向演进,其中基于大语言模型的评估方法(如LLM-as-Judge)已成为最新趋势。本文系统梳理了NLG评估的发展脉络,并探讨了未来评估体系的关键挑战与发展方向。

详情
Comments
Will appear in Proceeedings of RetroEval 2026
AI中文摘要

自然语言生成(NLG)评估自1990年以来发生了巨大变化,并将继续演变。1990年,当NLG与语言学紧密相连时,几乎没有现代意义上的正式实验评估。到2026年,当NLG与机器学习紧密相关时,实验评估已成为研究的基础和预期。在此期间,开发了许多评估技术,包括最近的LLM-as-Judge。我预计NLG评估将在未来继续演变。特别是,随着大量人群日常使用NLG技术,影响、质量和安全评估将变得更加重要。

英文摘要

Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.

2605.23712 2026-05-25 cs.CE cs.LG

Operator Learning for Reconstructing Flow Fields from Sparse Measurements: a Language Model Approach

基于稀疏测量重建流场的算子学习:一种语言模型方法

Qian Zhang, George Em Karniadakis

AI总结 本文研究了如何从稀疏测量数据中重建流场这一流体力学中的基础问题,并提出了一种基于语言模型架构的新型算子学习框架,实现了无需网格的流场重建。该方法将流场重建转化为序列到序列的学习任务,利用稀疏测量作为上下文,未观测位置作为查询,有效捕捉了空间相关性和长程依赖关系。实验表明,该方法在多个基准数据集上均表现出良好的重建精度,尤其在观测数据不足10%的情况下仍具有高效性能,展示了语言模型在科学数据重建中的潜力。

详情
AI中文摘要

从稀疏测量中重建流场是流体力学中的一个基本问题,对建模、控制和设计具有广泛影响。在这项工作中,我们提出了一种新颖的算子学习框架,利用语言模型的架构以无网格方式进行流场重建。我们将流场重建重新表述为序列到序列的学习任务,其中稀疏测量被视为上下文,未观测位置被视为查询。我们的模型学习从稀疏输入重建完整流场,有效捕捉空间相关性和长程依赖。我们在四个基准数据集上评估了所提出的方法:(1) 二维涡街模拟,(2) 美国本土的日平均温度数据,(3) 基于耗散粒子动力学的三维血流模拟,以及(4) 通过粒子跟踪测速获得的三维湍流射流测量。在所有情况下,我们的方法即使在高度不完整的数据(观测率低于10%)下也表现出竞争性的重建精度,并实现了高效性能。结果凸显了语言模型作为科学数据重建的鲁棒且可扩展工具的潜力,并指向了为科学和工程应用开发基础模型的有前景方向。

英文摘要

Reconstructing flow fields from sparse measurements is a fundamental problem in fluid mechanics with broad implications for modeling, control, and design. In this work, we propose a novel operator learning framework that leverages the architecture of language models to perform flow reconstruction in a mesh-free manner. We reformulate flow field reconstruction as a sequence-to-sequence learning task, where sparse measurements are treated as context and unobserved locations as queries. Our model learns to reconstruct the full flow field from sparse inputs, effectively capturing spatial correlations and long-range dependencies. We evaluate the proposed approach on four benchmark datasets: (1) two-dimensional vortex street simulations, (2) daily average temperature data across the contiguous United States, (3) three-dimensional blood flow simulations based on dissipative particle dynamics, and (4) three-dimensional turbulent jet flow measurements obtained via particle tracking velocimetry. Across all cases, our method demonstrates competitive reconstruction accuracy, even with highly incomplete data (less than 10\% observed), and achieves efficient performance. The results highlight the potential of language models as robust and scalable tools for scientific data reconstruction, and suggest a promising direction toward the development of foundation models for scientific and engineering applications.

2605.23710 2026-05-25 cs.CL

A graph-based analysis of semantic types and coercion in contextualized word embeddings

基于图的语义类型与语境化词嵌入中的强制现象分析

Long Chen, Deniz Ekin Yavas

AI总结 本文研究语义类型不匹配现象在上下文词嵌入中的表现,提出一种基于图的方法来分析词与上下文之间的语义类型信息。通过构建BERT和语义增强嵌入的图结构,并引入邻域类型概率和熵两个指标,研究发现语义增强嵌入能更有效地反映语义类型信息,并能有效区分匹配与不匹配的句子。

详情
AI中文摘要

名词与其语境之间的语义类型不匹配是强制现象的核心。本文引入一种基于图的方法来检查词汇和语境类型信息如何在词嵌入中反映。我们从十种语义类型中选择名词,标注语料实例的类型匹配(匹配 vs. 强制 vs. 其他不匹配 vs. 无限制),并使用BERT和感知增强嵌入构建图。提出了两个指标——邻居类型概率(NTP)和邻居类型熵(NTE)——来分析邻居类型分布。结果表明,使用感知增强嵌入构建的图能更好地反映语义类型信息,并且通过所提出的指标可以区分匹配和不匹配句子。

英文摘要

Semantic type mismatch between a noun and its context is central to coercion phenomena. This paper introduces a graph-based method to examine how lexical and contextual type information is reflected in word embeddings. We select nouns from ten semantic types, annotate corpus instances for type matching (matching vs. coercion vs. other mismatch vs. unrestricted), and construct graphs using BERT and sense-enhanced embeddings. Two metrics -- Neighbor Type Probability (NTP) and Neighbor Type Entropy (NTE) -- are proposed to analyze neighborhood type distributions. Results show that graphs constructed with sense-enhanced embeddings reflect semantic type information better, and matching and mismatch sentences can be distinguished through the proposed metrics.

2605.23708 2026-05-25 cs.LG cs.SY eess.SY nlin.AO

Learning Dynamic Stability Landscapes in Synchronization Networks

学习同步网络中的动态稳定景观

Christian Nauck, Junyou Zhu, Michael Lindner, Frank Hellmann

AI总结 本文提出了一种新的上游任务——学习同步网络中的动态稳定性景观,以更深入地理解同步行为,并从中衍生出多种标量稳定性指标。研究首次引入了图到图像的预测范式,直接从图结构学习每个节点的图像状稳定性景观,并发布了两个包含10,000个图的基准数据集。通过结合图神经网络与卷积神经网络,模型能够端到端地学习稳定性景观,实现了良好的泛化能力,为超越传统标量稳定性指标提供了新方法。

详情
Comments
22 pages, 12 figures
AI中文摘要

同步的鲁棒性通常通过标量、节点级稳定性指数来表征,这些指数对拓扑的依赖性通过网络科学或图神经网络(GNN)进行研究。我们提出了一种新颖的上游任务——学习稳定景观,它提供了对同步行为的更深入洞察,并且可以从中推导出许多此类标量指数。关键的是,我们开创了一种图到图像的预测范式:直接从图拓扑学习作为节点级目标的图像状景观,这种表述在文献中我们尚未见到。为了支持这一任务,我们发布了两个数据集,每个数据集包含10,000个图,节点数分别为20和100,并带有节点级景观标签,基于一个概念性振荡器模型,捕捉电网同步行为。GNN编码拓扑,CNN解码器渲染每个节点的图像,以端到端方式学习,具有良好的分布内准确性,并能泛化到不同图大小和实际电网拓扑。这表明,稳定景观虽然超出了传统网络科学的能力范围,但可以从拓扑中学习,并为生物学、神经科学和电网中超越标量稳定性指数开辟了新途径。

英文摘要

The robustness of synchronization is typically characterized by scalar, per-node stability indices whose dependence on topology is studied via network science or graph neural networks (GNNs). We propose a novel upstream task, learning stability landscapes, which provide deeper insights into synchronization behavior and from which many such scalar indices can be derived. Crucially, we pioneer a graph-to-image prediction paradigm: learning image-like landscapes as per-node targets directly from graph topology, a formulation we are not aware of having been established elsewhere in the literature. To support this task, we release two datasets of 10,000 graphs each at 20 and 100 nodes with per-node landscape labels, based on a conceptual oscillator model, capturing power grid synchronization behavior. A GNN encodes topology and a CNN decoder renders per-node images, learned end-to-end with good in-distribution accuracy, generalizing across graph sizes and to realistic power grid topologies. This demonstrates that stability landscapes, while beyond the reach of conventional network science, are learnable from topology and open new avenues for moving beyond scalar stability indices in biology, neuroscience, and power grids.