2605.22716 2026-05-22 cs.AI cs.LO

Parametric Modular Answer Set Programs Made Declarative

参数化模态答案集程序的声明性

Jorge Fandinno, Yuliya Lierler, Torsten Schaub

发表机构 * University of Nebraska Omaha, USA（内布拉斯加大学奥马哈分校）； University of Potsdam, Germany（波茨坦大学）

AI总结本文探讨了在一阶答案集编程中模ularity的概念，引入了参数化模态逻辑程序这一新形式，允许定义带有参数和intensionality语句的子程序，并展示了如何捕捉clingo程序的集体控制语义，连接传统非模态答案集编程。

Comments To appear in Theory and Practice of Logic Programming

2605.22711 2026-05-22 cs.LG cs.AI

Abstraction for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习中的抽象

Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster

发表机构 * FLAIR, MLRG University of Oxford（FLAIR、MLRG 欧洲大学）

AI总结本文提出了一种在离线目标条件强化学习中利用抽象的方法，通过引入相对化选项和不同层次的表示，提高了在相似状态空间上下文中的经验复用能力，从而提升了性能。

2605.22707 2026-05-22 cs.AI cs.HC

Beyond the Org Chart: AI and the Transformation of Invisible Work

超越组织图：人工智能与无形工作的变革

Stephanie Rosenthal, Shamsi Iqbal

发表机构 * Microsoft Corporation（微软公司）

AI总结本文研究了人工智能如何改变工作流程，特别是无形文化实践，如专业指导，同时提出了使无形工作可见的步骤以及个人和领导者如何支持同事并保持健康的公司文化。

Comments 10 pages

详情

AI中文摘要

越来越多的新闻和研究文章报告称，人工智能的采用使专业人士能够模糊和扩展其在企业中的角色边界。为了了解在人工智能导向的公司中工作流程可能发生的变化，我们采访了大型科技公司中24名以产品为中心的个体，探讨人工智能如何影响他们的工作、他们在产品团队中的工作以及他们的专业互动。我们的谈话表明，人工智能不仅改变了正式的角色责任和角色之间的协作，还改变了诸如专业指导等无形文化实践，这些实践对于帮助专业人士适应其职位、保持对工作的投入以及发展职业生涯至关重要。一些变化是积极的，例如同行之间的协作更加顺畅，但其他变化更加微妙，可能使典型的职业发展机会，如从专业网络中获得反馈、促进领导力和指导，面临风险。我们提出人工智能公司可以采取的步骤，以使无形工作更加可见。此外，我们还提出个人和领导者可以采取的措施，以在人工智能转型过程中支持同事，同时保持支持多样化思维、协作和非正式互动的健康公司文化。

英文摘要

An increasing number of news and research articles report that AI adoption is allowing professionals to blur and extend the boundaries of their corporate roles. With the goal of understanding how work processes might be changing in an AI-forward company, we interviewed 24 product-focused individuals at a large technology firm about how AI has impacted their own work, their work within their product team, and their professional interactions. Our conversations suggest that AI is not only changing formal role responsibilities and collaborations between those roles, but also changing informal cultural practices like professional mentoring that are key to helping professionals settle in their positions, stay engaged with their work, and grow their careers. Some of these changes are positive, such as smoother collaboration between peers, but other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk. We propose steps that AI companies can take to make the invisible work more visible. Additionally, we propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.

URL PDF HTML ☆

赞 0 踩 0

2605.22703 2026-05-22 cs.LG

Hoang-Dung Bui, Abhish Khanal, Raihan Islam Arnob, Gregory J. Stein

发表机构 * George Mason University（乔治·马歇尔大学）

AI总结本文提出了一种Scout-Assisted Planning框架，通过无人机主动收集环境信息来改进地面车辆的导航，通过信息增益引导的行动剪枝减少回溯成本，实验表明其在不同环境中能显著降低地面机器人旅行成本。

详情

AI中文摘要

自主机器人团队在部分已知环境中导航时，当地面机器人遇到被阻塞的道路时，需要昂贵的回溯操作。我们通过Scout-Assisted Planning，一种异构规划框架，其中无人机主动收集环境信息以改进地面车辆的导航。为了将侦察聚焦于最关键的边，我们提出了基于信息增益的行动剪枝，通过评估候选侦察行动对地面机器人行为的预期影响来评分。由于精确的信息增益基于行动剪枝计算成本过高，我们开发了一个基于图神经网络的模型，该模型可以直接从图结构和信念状态预测信息增益值，将规划时间减少到实时水平而不牺牲解决方案质量。在三种环境类型上的实验表明，SAP结合信息增益行动剪枝将地面机器人旅行成本降低了31.9-37.7%相对于加拿大旅行者问题基线，并且比基于接近度的侦察指导多出8-14%，证实了基于原则的信息增益引导的侦察在实际部署中既更有效且计算上可行。

英文摘要

Autonomous robot teams navigating partially known environments face costly backtracking when ground robots encounter blocked roads that are only revealed upon physical traversal. We address this with Scout-Assisted Planning, a heterogeneous planning framework in which scouting Unmanned Aerial Vehicles proactively gather environmental information to improve Unmanned Ground Vehicle navigation. To focus scouting on the most consequential edges, we propose Information Gain-based Action Pruning, which scores candidate scouting actions by their expected impact on ground robot behavior. Since exact Information Gain-based Action Pruning computation is prohibitively expensive, we develop a Graph Neural Network based model that predicts information gain values directly from graph structure and belief state, reducing planning time to real-time levels without sacrificing solution quality. Experiments across three environment types show that SAP with Information Gain Action Pruning reduces ground robot travel cost by 31.9--37.7% over the Canadian Traveler Problem baseline, and outperforms proximity-based scouting guidance by an additional 8--14%, confirming that principled information-gain-guided scouting is both more effective and computationally feasible for real-world deployment

URL PDF HTML ☆

赞 0 踩 0

2605.22691 2026-05-22 cs.LG cond-mat.stat-mech

Posterior Collapse as Automatic Spectral Pruning

后验坍缩作为自动谱剪枝

Johannes Hirn

发表机构 * Image Processing Laboratory (IPL), Universitat de València, Paterna, València 46980, Spain（图像处理实验室（IPL），瓦伦西亚大学，帕特erna，瓦伦西亚 46980，西班牙）

AI总结本文研究了β-VAE中的后验坍缩现象，揭示其本质上是一种自动谱剪枝过程，通过分析不同β值下的均衡解，展示了潜在模式从最不有用的到最有用的逐步解耦的崩溃过程。

2605.22681 2026-05-22 cs.AI

Forecasting Scientific Progress with Artificial Intelligence

用人工智能预测科学进步

Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada, Peter Clark, David Clifton, Philip Torr, James Zou, Junchi Yu

发表机构 * University of Oxford（牛津大学）； Stanford University（斯坦福大学）； Allen Institute for AI（人工智能研究所）； Sakana AI

AI总结本文研究了人工智能在预测科学进步中的能力，提出了一种基于时间的评估框架，并介绍了CUSP基准，通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统的科学预测能力，发现当前前沿模型在不同领域存在系统性限制，且预测结果受事件发生时间影响较大，表明AI在科学预测中仍存在不足。

Comments 73 pages, 13 figures, 29 tables

详情

AI中文摘要

人工智能（AI）日益融入科学发现，但其能否预测科学进步仍不明确。为研究此问题，我们引入了一个基于时间的评估框架，用于在受控知识约束下预测科学进步。我们提出了CUSP（截止条件下的未见科学进步），一个多学科和事件级别的基准，通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统在科学预测中的表现。在4760个科学事件中，我们观察到当前前沿模型在不同领域存在系统性和领域依赖性的限制。虽然模型可以识别出竞争候选研究方向的可能性，但它们无法可靠地预测科学进步是否会被实现，并系统性地低估了其发生时间。性能在不同领域中高度异质，AI的进步时间比生物学、化学和物理学的进步更可预测。性能在事件发生时间在训练截止前或后时基本不受影响，表明这些限制不能仅由训练数据中的知识暴露来解释。在受控信息访问下，额外的预截止知识会提高性能，但无法缩小与全信息设置之间的差距，这种差距在高引用进步中更加明显。模型还表现出系统性的过度自信和强烈的响应偏差，表明不确定性估计不可靠。综合来看，当前AI系统在预测科学进步方面仍显不足。获取先前知识并未转化为可靠的预测，性能更受益于事后信息而非前瞻性预测。

英文摘要

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.22679 2026-05-22 cs.CV cs.LG

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

将嵌入概念化：面向视觉-语言模型的稀疏解缠

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University（雅盖隆大学数学与计算机科学学院）； Doctoral School of Exact and Natural Sciences, Jagiellonian University（雅盖隆大学精确与自然科学博士学校）； Centre for Credible AI, Warsaw University of Technology（华沙技术大学可信人工智能中心）

AI总结本文提出CEDAR方法，通过稀疏解缠技术在不增加维度的情况下揭示预训练嵌入的组成结构，从而提升视觉-语言模型的可解释性和与人类感知的一致性。

详情

AI中文摘要

视觉-语言模型学习了强大的多模态嵌入，但其内部语义仍然模糊。尽管稀疏自编码器（SAEs）可以提取可解释的特征，但它们依赖于扩展表示维度，这会破坏原始几何结构并引入冗余。我们引入CEDAR（通过自适应旋转进行概念嵌入解缠），一种事后方法，能够在不增加维度的情况下揭示预训练嵌入的组成结构。通过学习具有top-k稀疏瓶颈的可逆变换，CEDAR将语义信息集中到轴对齐的解缠坐标中。在CLIP-like架构中，单个坐标可以与文本概念进行解释，而对于生成模型如BLIP，它们可以解码为自然语言描述。实验表明，CEDAR在重建-稀疏性权衡方面具有竞争力，同时产生更可解释且更符合人类感知的解释。我们的结果表明，视觉-语言表示中的显性纠缠可以通过适当的基变换来解决，从而消除对过度扩展的需要。

英文摘要

Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.

URL PDF HTML ☆

赞 0 踩 0

2605.22678 2026-05-22 cs.CV cs.AI

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Swift Sampling: 通过泰勒级数选择时间惊喜

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

发表机构 * Boston University（波士顿大学）； Microsoft Research India（微软研究院印度）

AI总结本研究提出了一种无需训练的帧选择算法Swift Sampling，通过在视觉潜在空间中建模视频为可微轨迹，并利用泰勒展开预测后续帧的路径，从而自动识别高信息量的时间惊喜帧，提升了长视频问答任务的性能。

详情

AI中文摘要

尽管长视频中的大多数帧都是冗余的，但关键信息存在于时间惊喜中：即实际视觉特征偏离其预测演变的时刻。受人脑预测编码的启发，我们引入了Swift Sampling，一种优雅且无需训练的帧选择算法，能够自动识别视频中的高信息量时刻。具体而言，我们将视频建模为视觉潜在空间中的可微轨迹，并计算其特征的速度和加速度。然后，我们应用泰勒展开来投影后续帧的预期路径。与预测路径显著偏离的帧被识别为时间惊喜帧并被选中采样。与依赖辅助网络或视频特定超参数调整的先前无训练方法不同，Swift Sampling 非常轻量，仅比基线增加 0.02x 的计算成本，使其比领先基线便宜 30 倍。在三个长视频问答基准和 10 个不同的下游任务上，Swift Sampling 超过了均匀采样和先前查询无关的基线。它在帧预算有限的长视频中表现尤为强大，准确率可提高高达 12.5 个百分点。

英文摘要

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

URL PDF HTML ☆

赞 0 踩 0

2605.22677 2026-05-22 cs.CV

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Slimmable ConvNeXt: 适用于高效多设备部署的宽度自适应推理

Janek Haberer, Jon Eike Wilhelm, Olaf Landsiedel

发表机构 * Kiel University（基尔大学）； Hamburg University of Technology (TUHH)（汉堡工业大学）； UNU-INWEH

AI总结本文提出Slimmable ConvNeXt，通过训练包含多个嵌套子网络的共享权重集，实现宽度自适应推理，从而在不同资源约束的设备上高效部署模型。该方法利用ConvNeXt的现代设计，如LayerNorm和倒置瓶颈结构，实现了通道宽度压缩，减少了归一化开销，并提供了更简单的训练流程。

Comments Accepted at Mobile AI Workshop 2026 (CVPR'26 Workshop)

详情

AI中文摘要

在资源约束变化的设备上部署视觉模型，或在单个设备上由于电池状态、热 throttling 或延迟截止而变化的计算资源，通常需要训练和维护多个模型。宽度自适应推理通过训练一组共享权重，其中包含多个嵌套子网络，这些子网络具有递增的容量，从而解决这一问题。尽管之前的CNN方法需要可切换的批量归一化，而近期可扩展方法则集中在视觉Transformer上，本文提出了Slimmable ConvNeXt，证明了ConvNeXt的现代设计，特别是LayerNorm和倒置瓶颈结构，使其特别适合通道宽度压缩，消除了经典可压缩网络的归一化开销，并提供了比之前CNN和ViT方法更简单的训练流程。在ImageNet-1k上，Slimmable ConvNeXt-T在3个子网络的情况下，以4.5 GMACs达到80.8%的top-1准确率，以1.2 GMACs达到77.4%的准确率，训练了600个epoch。在同等计算量下，这超过了HydraViT的6头子网络（78.4%在4.6 GMACs）2.4个百分点，以及其3头配置（73.0%在1.3 GMACs）4.4个百分点，同时在相同GMACs下也超过了MatFormer-S（78.6%）和SortedNet-S（78.2%）。将规模扩展到Slimmable ConvNeXt-B进一步将最大准确率提高到15.35 GMACs时的82.8%。

英文摘要

Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.

URL PDF HTML ☆

赞 0 踩 0

2605.22675 2026-05-22 cs.CL

道德语义在机器翻译中得以保留：来自道德基础语料库的跨语言证据

Maciej Skorski

发表机构 * University of Luxembourg（卢森堡大学）

AI总结本研究探讨了基于LLM的翻译是否能弥合道德价值观分类中语言特定标注语料库的差距，通过波兰语案例展示直接翻译能有效保留微妙的道德线索，为资源匮乏语言的道德研究提供了可行路径。

详情

AI中文摘要

道德语言具有微妙性和文化差异性，使得跨语言忠实翻译极具挑战性。习语、俚语和文化参考会引入难以避免的翻译痕迹。然而，自动道德价值观分类依赖于几乎只存在于英语中的语言特定标注语料库。我们研究了基于LLM的翻译是否能弥合这一差距，以波兰语为测试案例。使用约5万条涵盖广泛主题的道德标注社交媒体帖子，我们应用了一个系统化的四方法验证流程：LaBSE跨语言嵌入相似性、中心核对齐（CKA）、LLM作为评判者评估以及深度学习分类器公平性测试。我们证明，尽管在处理俚语、粗俗语言和文化负载表达方面存在不足，直接翻译能够很好地保留微妙的道德线索，这些线索足以被跨语言机器学习系统捕获——在所有基础方面，平均余弦相似度为0.86，AUC差距在0.01-0.02之间，经过语言模型微调后进一步缩小。这些结果表明，机器翻译是实现当前资源匮乏语言中道德研究的实用且成本效益高的途径。我们以波兰语作为代表性的斯拉夫语言展示了这一点，并预期可推广到相关语言。

英文摘要

Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.

URL PDF HTML ☆

赞 0 踩 0

2605.22658 2026-05-22 cs.CV cs.LG cs.MM eess.IV

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass: 探索通过稀疏自编码器实现可解释对齐以增强推理分割

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； Peng Cheng Laboratory（鹏城实验室）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Meituan, Beijing（美团，北京）； University of Chinese Academy of Sciences（中国科学院大学）； College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）

AI总结本文提出SegCompass，一种通过稀疏自编码器实现可解释对齐的端到端模型，以提升推理分割的性能和可解释性。

Comments Accepted by CVPR 2026. 15 pages, 9 figures, 6 tables

详情

AI中文摘要

尽管大语言模型提供了强大的组合推理能力，但现有推理分割流程未能清晰地将这种推理与视觉感知连接起来。当前方法，如潜在查询对齐，虽然端到端但却是不透明的“黑箱”。相反，文本定位读出仅可读但不真正可解释，通常作为无约束的后处理步骤。为弥合这一可解释性差距，我们提出了SegCompass，一种端到端模型，利用稀疏自编码器（SAE）建立一个显式、可解释且可微的对齐路径。给定一个图像-指令对，SegCompass首先生成一个思维链（CoT）轨迹。该方法的核心是一个将CoT和视觉标记映射到共享高维稀疏概念空间的SAE。一个查询代码本从该空间中选择显著概念，然后通过槽映射器在空间上定位到多槽热图，引导最终的掩码解码器。整个模型联合训练，将强化学习用于推理路径与标准分割监督相结合。这种由SAE驱动的接口提供了显著比潜在查询更可追溯的“白盒”连接，比文本读出更连贯。在五个具有挑战性的基准测试中，SegCompass匹配或超越了最先进的性能。关键的是，我们的视觉和定量分析显示，所学稀疏概念的质量与最终掩码准确性之间存在强相关性，证实了SegCompass通过其增强且可检查的对齐实现了优越的结果。代码可在https://github.com/ZhenyuLU-Heliodore/SegCompass获取。

英文摘要

While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU-Heliodore/SegCompass.

URL PDF HTML ☆

赞 0 踩 0

2605.22654 2026-05-22 cs.CL cs.CV

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

看见诗歌：基于大语言模型的AI生成现代汉语诗歌的图像-语义检测

Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo, Derek F. Wong

发表机构 * Department of Computer and Information Science, University of Macau（澳门大学计算机与信息科学系）； University of Rochester（罗切斯特大学）； Sichuan University（四川大学）； Department of Portuguese, Faculty of Arts and Humanities, University of Macau（澳门大学人文学院葡萄牙语系）

AI总结本文提出了一种图像-语义引导的诗歌检测方法，通过整合图像内容与诗歌文本信息，提升大语言模型在检测现代汉语诗歌中的性能，实验结果表明该方法在多个数据集上均优于传统方法。

详情

AI中文摘要

先前的检测研究显示，LLMs无法有效用作检测器，但这些研究未涉及现代汉语诗歌。此外，没有相关研究探讨LLMs在检测现代汉语诗歌中的性能。本文评估并提升了LLMs作为现代汉语诗歌检测器的性能，并提出了一种图像-语义引导的诗歌检测方法。与传统检测方法相比，我们的方法创新性地整合了反映诗歌内容的图像。通过示例驱动的方法，我们的方法有效整合了图像中的意义、意象和情感信息，然后与诗歌文本形成互补判断。实验结果表明，基于我们方法的LLM检测器在多个数据集上均优于基于纯文本的基线检测器，甚至超越了表现最佳的传统检测器RoBERTa。使用我们方法的Gemini检测器在Macro-F1得分上达到85.65%，达到最先进的水平。不同LLM检测器在多个LLM生成数据上的性能提升证明了我们方法的有效性。

英文摘要

Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2605.22651 2026-05-22 cs.CV

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

图中标签真的在说些什么？用于视觉语言预训练中组合数据选择的反事实短语干预

Hyejin Go, Semi Lee, Hyesong Choi

发表机构 * Soongsil University（顺斯尔大学）

AI总结本文研究了在视觉语言预训练中如何通过反事实短语干预来改进组合数据选择，提出了CPI方法以解决现有方法中全局过滤信号失效的问题，从而提升模型在关系识别任务上的表现。

Comments 11 pages, 2 figures, 4 tables. Preprint

详情

AI中文摘要

CLIP风格的对比预训练通常通过样本级过滤信号来收集网络级图像-文本对，通常基于对级对齐。我们证明这种信号饱和：一旦粗略不匹配被移除，更严格的全局过滤不再跟踪由保留标签提供的组合监督。原因在于结构问题 - 全局评分混淆了对是否广泛合理与是否个别对象、属性和关系短语在标签中实质性支持图像-文本匹配。后者是组合泛化所需，但对级过滤器对此无能为力。我们通过反事实短语干预（CPI），一种短语级整理框架，将受控的非正式令牌替换转换为图像条件的短语敏感性评分。CPI仅使用全局对齐进行粗略不匹配移除，然后通过是否在受控替换下短语显著影响图像-文本评分来对幸存池进行排名。我们将CPI框架为一阶短语敏感性信号，而非接地或识别结果，并在CC3M规模上评估。按此信号排名产生一个50%的数据子集，在VL-CheckList-VG关系任务上比完整数据基线提高+1.91，在匹配预算下比仅对齐过滤提高+1.00，同时提高SugarCrepe整体表现并保持泛化转移。CPI是损失正交的：应用不变于NegCLIP，它进一步在VL-CheckList-VG关系任务上提高+3.84，并在主要文本中获得额外的CE-CLIP收益。

英文摘要

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

URL PDF HTML ☆

赞 0 踩 0

2605.22650 2026-05-22 cs.CL

SE3Kit: 一个用于机器人学中专用几何原语的轻量级Python库

Daniyal Maroufi, Omid Rezayof, Farshid Alambeigi

发表机构 * Walker Department of Mechanical Engineering and Texas Robotics at The University of Texas at Austin（德克萨斯大学奥斯汀分校机械工程系和德克萨斯机器人学院）

AI总结本文提出SE3Kit，一个轻量级Python库，专注于特殊欧几里得群SE(3)和特殊正交群SO(3)上的高效运算，提供严格的数学实现，适用于嵌入式部署、快速原型设计和教育。

2605.22631 2026-05-22 cs.CV

AtomicMotion: Learning Human Motion From Different Human Parts

AtomicMotion: 从不同人体部分学习人体动作

Runzhen Liu, Chuhua Xian, Fa-Ting Hong

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； South China University of Technology（华南理工大学）； Department of Computer Science and Engineering（计算机科学与工程系）； The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结该研究提出AtomicMotion框架，通过解耦和重新整合身体动态，解决从稀疏头部和手部轨迹准确重建完整身体姿态的挑战，核心方法是逻辑身体分区、全身体预条件化策略和运动学注意力机制，实验表明其在AMASS数据集上显著优于现有基线。

详情

AI中文摘要

准确从稀疏头部和手部轨迹重建完整身体姿态是沉浸式AR/VR远程存在的基础挑战。当前方法常面临误差累积和不自然关节协调的问题，主要因为将人体视为单一实体，无法捕捉细微信号变化中的细粒度“原子意图”，并忽视了固有的结构拓扑。为弥合这一差距，我们提出了AtomicMotion，一个通过三个核心创新解耦和重新整合身体动态的框架。首先，我们引入一种逻辑身体分区方案，根据功能意图将骨架分解为五个不同的簇；这确保每个分区保留内部关节协同性，同时隔离局部运动原语。其次，为了稳健地将稀疏输入映射到高维姿态，我们在训练期间采用掩码全身体预条件化策略，迫使模型内化全局骨骼拓扑和潜在运动学约束。最后，针对常规空间注意力机制常忽略固定生理连接的局限性，我们提出了运动学注意力。通过将经典运动学树结构嵌入注意力机制中，我们确保合成动作具有生物合理性。在AMASS数据集上的广泛评估表明，AtomicMotion显著优于现有基线，实现了更高的重建保真度和更优越的生物力学真实性。

英文摘要

Accurately reconstructing full-body poses from sparse head and hand trajectories is a foundational challenge for immersive AR/VR telepresence. Current methods often struggle with error accumulation and unnatural joint coordination, primarily because they treat the human body as a monolithic entity, thereby failing to capture the fine-grained ``atomic intents'' embedded in subtle signal variations and overlooking the inherent structural topology. To bridge this gap, we present AtomicMotion, a framework designed to decouple and re-integrate body dynamics through three core innovations. First, we introduce a logical body partitioning scheme that decomposes the skeleton into five distinct clusters based on functional intent; this ensures that each partition preserves internal joint synergies while isolating local motion primitives. Second, to robustly map sparse inputs to high-dimensional poses, we employ a masked full-body pre-conditioning strategy during training, forcing the model to internalize global skeletal topology and latent kinematic constraints. Finally, addressing the limitations of vanilla spatial attention, which often ignores fixed physiological connectivity, we propose Kinematic Attention. By embedding the classical kinematic tree structure into the attention mechanism, we ensure biological plausibility in the synthesized motions. Extensive evaluations on the AMASS dataset demonstrate that AtomicMotion significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism.

URL PDF HTML ☆

赞 0 踩 0