arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2605.20960 2026-05-21 cs.CL

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

JobArabi: 一个阿拉伯语语料库及来自社交媒体的招聘公告分析

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor

AI总结 本文介绍了JobArabi,一个从2024年1月至2025年10月期间收集的阿拉伯语招聘公告语料库,包含20,528条来自X平台的公开帖子,旨在分析阿拉伯语在线社区中的就业相关话语,揭示社交媒体在劳动力市场沟通和语言变化研究中的潜力。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

本文介绍了JobArabi,一个大规模的阿拉伯语招聘公告语料库,该语料库从2024年1月至2025年10月期间收集自社交媒体。该数据集包含20,528条来自X平台的公开帖子,涵盖了阿拉伯语在线社区中超过两年的就业相关话语。该语料库使用了一个基于语言学的查询框架,覆盖了21个阿拉伯语关键词家族,这些关键词反映了招聘语言中性别化、复数、正式和方言化的表达。所得到的数据集包括来自机构、商业和个体账号的帖子,并提供了时间戳、参与指标和地理位置等元数据(如可用)。这使得能够对就业话语进行时间与区域分析。定量分析揭示了在线招聘中的若干社会语言学模式,包括性别化招聘语言的持续存在、地区职业需求的差异以及招聘信息的情感框架。这些发现突显了阿拉伯语社交媒体作为研究劳动力市场沟通和语言变化资源的潜力。JobArabi语料库,连同其文档和收集脚本,将被发布以支持阿拉伯语自然语言处理、计算社会科学和数字劳动研究领域的研究。

英文摘要

This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse. Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change. The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.

2605.20956 2026-05-21 cs.LG cs.CY

A Deployment Audit of Release-Side Risk in Conformal Triage under Prevalence Shift

发布侧风险的符合性分诊部署审计

Chengze Li, Xiao Liu, Hanrong Zhang, Haiyang Peng, Yanghao Ruan, Huanhuan Ma, Chunyu Miao, Qichao Zhou, Xiangrong Qi, Philip Yu

AI总结 本文提出了一种泄漏感知的发布侧符合性分诊审计方法,用于评估在患病率变化下,是否真正经历目标事件的患者被释放而无需审查,通过将目标主体分为三个非重叠角色来评估发布直接安全性。

Comments 18 pages, 4 figures, 5 tables

详情
AI中文摘要

符合性分诊将预测分数转换为部署行动,即释放案例、标记为紧急关注或推迟给人类审查。然而,在患病率变化下,通常的边际覆盖和人类审查率总结可能无法回答关键的安全问题:是否真正经历目标事件的患者被释放而无需审查。为解决这一差距,我们引入了一种泄漏感知的发布侧符合性分诊审计。它首先将目标主体分配给三个非重叠角色:患病率校正、符合性校准和保留的发布安全性评估。这种分离使审计能够直接评估发布:有多少事件阳性患者被清除而无需审查,是否试点有足够的事件标签用于校准,以及安全审查权衡如何转移。将此审计应用于回顾性NSCLC试点显示了较低审查可能具有误导性:在患病率校正后,池化符合性分支通过释放更多患者降低审查,其中一些是事件阳性。在审计中,类内分支充当稀缺性诊断:试点拥有过多的事件标签以认证安全的低审查释放。

英文摘要

Conformal triage converts predictive scores into deployment actions that either release a case, flag it for urgent attention, or defer it to human review. Under prevalence shift, however, the usual summaries of marginal coverage and human-review rate can miss the safety-critical question of whether patients who truly experience the target event are released without review. To address this gap, we introduce a leakage-aware deployment audit for release-side conformal triage. It first assigns target subjects to three non-overlapping roles: prevalence correction, conformal calibration, and held-out release-safety evaluation. This separation then lets the audit evaluate release directly: how many event-positive patients are cleared without review, whether the pilot has enough event labels for calibration, and how the safety-review trade-off shifts. Applying this audit to a retrospective NSCLC pilot shows why lower review can be misleading: after prevalence correction, the pooled conformal branch lowers review by releasing more patients, some of whom are event-positive. Within the audit, the classwise branch acts as a scarcity diagnostic: the pilot has too few event labels to certify safe low-review release.

2605.20955 2026-05-21 cs.CV

DrawMotion: Generating 3D Human Motions by Freehand Drawing

DrawMotion: 通过自由手绘生成3D人体动作

Tao Wang, Lei Jin, Zhihua Wu, Qiaozhi He, Jiaming Chu, Yu Cheng, Junliang Xing, Jian Zhao, Shuicheng Yan, Li Wang

AI总结 本研究提出DrawMotion,一种基于扩散模型的框架,通过自由手绘和文本条件生成3D人体动作,减少用户输入时间,提升生成精度。

详情
AI中文摘要

文本到动作生成,即通过文本描述生成人体动作,面临用户难以通过文本精确表达意图的挑战。为了解决这一问题,本文介绍了DrawMotion,一种高效的扩散基框架,适用于多条件场景。DrawMotion基于传统文本条件和新的手绘条件生成动作,分别提供语义和空间控制。具体而言,我们从三个方面解决细粒度动作生成任务:1) 自由手绘条件。为了准确捕捉用户意图而不需繁琐的文本输入,我们开发了算法自动在不同数据集格式中生成手绘简笔画;2) 多条件融合。我们提出了一个多条件模块(MCM),整合到扩散过程中,使模型能够利用所有可能的条件组合,同时比传统方法减少计算复杂性;3) 训练自由引导。值得注意的是,DrawMotion中的MCM确保其中间特征位于连续空间中,允许分类器引导梯度更新特征,从而使生成的动作与用户意图对齐,同时保持保真度。定量实验和用户研究表明,自由手绘方法在生成与想象一致的动作时,可将用户时间减少约46.7%。代码、演示和相关数据可在https://github.com/InvertedForest/DrawMotion上公开获取。

英文摘要

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

2605.20948 2026-05-21 cs.CL

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

Memory Grafting: 通过离线条件记忆实现语言模型预训练的扩展

Runxi Cheng, Yuchen Guan, Yongxian Wei, Qianpu Sun, Qixiu Li, Sinan Du, Feng Xiong, Chun Yuan, Yan Lu, Yeyun Gong

AI总结 本文提出Memory Grafting方法,通过利用冻结的隐藏状态作为条件n-gram记忆,实现语言模型预训练的扩展,通过离线处理和高效检索机制提升模型容量,实验表明其在不同规模下均优于MoE和Vanilla Engram基线。

Comments 25 pages, 12 figures, 5 tables

详情
AI中文摘要

扩大条件记忆为提高语言模型容量提供了一种有前途的方法,但现有方法如Engram在预训练过程中从头学习大型记忆表,使记忆扩展成本高昂且有时效果不佳。我们提出Memory Grafting,一种利用冻结的隐藏状态作为条件n-gram记忆的条件记忆扩展方法。鉴于频繁的局部n-gram,我们离线运行grafting模型,将最终token的隐藏表示存储为记忆值,并让接收模型通过精确的最长匹配后缀查找来检索它们。检索到的记忆通过轻量级投影和门控进行适应,同时基于哈希的Engram回退机制保留了未匹配上下文的覆盖范围。由于grafting模型仅在离线运行,且精确查找的复杂度相对于内存库大小具有预期O(1)的复杂度,Memory Grafting在有限的训练和推理开销下扩展了外部潜在容量。在匹配的接收架构和预训练预算下进行的实验表明,Memory Grafting在MoE和Vanilla Engram基线之上有所改进。在2.8B规模下,其平均基准分数从MoE的51.95和Vanilla Engram的52.43提升到53.86。在0.92B规模下,所有grafting模型变体均优于基线,其中Qwen3.5-35B-A3B表现最佳。这些结果表明,预训练模型可以作为外部潜在记忆的可重用构造器,为未来语言模型超越仅可训练参数的扩展提供了实用步骤。

英文摘要

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

2605.20946 2026-05-21 cs.CL

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

思考-言语:一种受控交错推理方法用于实时语音生成

Xuan Du, Qiangyu Yan, Wenshuo Li, Borui Jiang, Changming Xiao, Han Shu, Xinghao Chen

AI总结 本文提出了一种受控交错推理方法InterRS,用于实时语音生成,通过在自然语音生成过程中插入推理步骤,提高了语音流畅性和推理深度,实验表明其在数学和逻辑基准测试中表现更优,并生成更自然流畅的答案。

详情
AI中文摘要

思考-言语范式旨在使AI交流更人性化。关键挑战是保持流畅的语音生成同时进行深度推理。我们的方法InterRS通过在自然语音生成过程中插入推理步骤来解决这一问题。这需要高质量的数据,其中推理和语音精确对齐且长度比例受控。我们引入了一种新的管道来生成无缝交错的音频数据。为了训练我们的模型,我们结合了交错SFT与精炼数据以及强化学习,使用两种新的奖励:TA-Balance Reward用于管理时间与思考-回答比例,以及Linguistic Quality Reward用于优化表达。实验表明,我们的方法在数学和逻辑基准测试中实现了13%的性能提升,同时生成像口语指令模型一样快速的响应。此外,我们的方法生成的答案例句比先前方法更自然流畅。

英文摘要

The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

2605.20942 2026-05-21 cs.CV

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

连接结构与语言:基于图的视觉推理用于自动驾驶道路理解

Lena Wild, Katie Z Luo, Marco Pavone

AI总结 本文提出结合道路子基质(CRS)框架,通过图结构和开放词汇语义的联合执行,解决自动驾驶中道路结构理解的精度与语义灵活性之间的平衡问题。

详情
AI中文摘要

车道几何、拓扑和交通元素关系的结构化道路理解是安全自动驾驶的基础。尽管视觉-语言模型(VLMs)提供了有前途的语义灵活性,但它们缺乏精确道路推理所需的几何和关系基础。相反,传统模块化系统,如HD地图和拓扑道路图,提供了结构精度,但保持了语义刚性。为弥合这一差距,我们引入了结合道路子基质(CRS),一种基于图的框架,使几何道路结构和开放词汇语义能够在单一表示中联合执行。CRS能够通过递归图查询自动生成具有组合复杂性和语言多样性的问答对,辅以一种“免费基础”机制,确保逻辑可追溯到特定地图元素,并通过程序提取的推理链监督轨迹。我们证明了最先进的VLMs,包括大型闭源模型,在结构化道路推理上表现显著不足,但训练一个仅需20到80个CRS增强场景的2亿或4亿参数小模型,即可在不同深度的组合推理任务中获得稳定的提升。通过可验证的推理轨迹分析模型行为,揭示了失败模式的系统性转变:尽管基线模型在关系场景理解上失败,CRS训练的模型将失败减少到属性识别,表明道路理解的主要瓶颈不是模型规模,而是缺乏结构化监督。

英文摘要

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

2605.20941 2026-05-21 cs.CV cs.GR cs.HC

PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

PaintCopilot: 将绘画建模为自主的艺术延续

Yunge Wen, Yuancheng Shen, Paul Pu Liang

AI总结 本文提出了一种基于神经网络的绘画助手PaintCopilot,通过建模绘画作为开放性自回归艺术行为,基于不断演变的画布状态和先前笔触历史,无需目标图像即可预测未来笔触,与现有神经绘画方法不同,后者将绘画建模为向预定参考图像的像素重建。

详情
AI中文摘要

我们提出了PaintCopilot,一种协作式神经绘画助手,将绘画建模为一种开放性自回归的艺术行为,该行为基于不断演变的画布状态和先前笔触历史,而无需目标图像。与现有神经绘画方法不同,后者将绘画建模为向预定参考图像的像素重建,PaintCopilot直接从学习到的艺术动态中预测未来的笔触,类似于大型语言模型通过先前上下文继续文本序列。该框架提出了三个互补的模型:基于ViT的目标预测器,通过部分画布观察推断艺术家意图;自回归的下一笔预测器,通过流匹配生成时间上连贯的笔触;以及基于VAE的区域采样器,可按需合成语义本地化的笔触序列。基于三种可微分的笔触表示(硬圆、笔尖和2D高斯),系统支持四种交互工作流程:优化历史、笔触完成、区域修复和动态笔刷。通过与专业艺术家的案例研究,我们证明PaintCopilot能够实现流畅的协作绘画工作流程,在创作过程中艺术家和AI不断交替控制。

英文摘要

We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.

2605.20940 2026-05-21 cs.CV

3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

3D重建与知识蒸馏以改进多视角图像模型以探索小麦籽粒体积估计

Olivia Zumsteg, Jannis Widmer, Yann Bourdé, Norbert Kirchgessner, Andreas Hund, Lukas Roth, Paraskevi Nousi

AI总结 本文提出了一种混合2D-3D方法,通过训练过程中知识蒸馏,使模型能够高效地进行图像-only推理。该方法结合了基于距离直方图特征的刚性不变点云网络和提出的多视角图像基于调节Transformer(RT)的集成架构,最终通过特征或标签蒸馏将知识转移到纯图像模型中,从而提高了籽粒体积估计的精度和效率。

Comments 8 pages, 6 figures (Appendix: 4 pages, 5 figures)

详情
AI中文摘要

准确估计小麦籽粒体积对于产量成分分析和压力耐受性评估至关重要,但基于现场的测量仍然具有挑战性。主动3D传感方法如光检测和测距(LiDAR)或飞行时间(ToF)对植物运动敏感或不适合户外条件,而3D重建计算成本高。直接2D图像处理可提供计算优势,但基于图像的模型缺乏显式几何信息。因此,我们提出了一种混合2D-3D方法,在训练过程中进行知识蒸馏,同时允许高效的图像-only推理。首先,我们训练一个基于距离直方图特征的刚性不变点云网络,以获得姿态鲁棒的几何表示。然后,我们将3D模型与所提出的多视角图像基于调节Transformer(RT)结合到集成架构中。最后,我们通过基于特征或标签的蒸馏将集成知识转移到纯图像学生模型中。两个蒸馏的RTs将非蒸馏RT的均方绝对误差(MAE)从654.31 mm³降低到639.93 mm³和644.62 mm³,并将相关性从0.76提高到0.77和0.82。同时,推理时间从160 ms减少到每粒籽1.4 ms。蒸馏进一步减轻了体积依赖性偏差,并使图像模型的潜在表示向几何感知的形状转变。我们的结果表明,2D Transformer的3D指导训练能够实现高通量田间表型分析中可扩展且高效的籽粒体积估计。

英文摘要

Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

2605.20936 2026-05-21 cs.LG cs.AI cs.CL

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

DASH:在单个GPU上几分钟内完成的快速可微架构搜索用于混合注意力

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

AI总结 本研究提出DASH,一种快速可微架构搜索框架,用于混合注意力架构设计,通过将离散的层间注意力操作放置转化为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,显著提高了搜索效率。DASH在Qwen2.5-3B-Instruct上优于现有的所有选择器风格的混合注意力设计基线,展示了直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

Comments 19 pages, 7 figures

详情
AI中文摘要

混合注意力架构正变得越来越重要,用于在保持模型质量的同时提高LLM推理效率,使混合架构设计成为核心问题。现有的设计通常依赖于手动经验规则或基于代理的选择器信号来分配层间操作符。最近的NAS风格系统,如Jet-Nemotron,展示了自动混合架构搜索的潜力。然而,Jet-Nemotron的PostNAS搜索阶段单独使用200B tokens,使得此类搜索流程难以作为混合架构设计的常规方法。我们引入DASH,一种用于混合注意力架构设计的快速可微搜索框架,它将离散的层间注意力操作放置放松为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,以显著提高搜索效率。在Qwen2.5-3B-Instruct上,DASH一致优于现有的所有选择器风格的混合注意力设计基线,表明直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

英文摘要

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

2605.20932 2026-05-21 cs.RO

WiXus: A Wheeled-Legged Robot with Wire-Driven Environmental Utilizing to Integrate Mobility and Manipulation

WiXus: 一种配备线驱动环境利用的轮腿机器人,用于整合移动与操作

Shintaro Inoue, Kento Kawaharazuka, Temma Suzuki, Sota Yuzaki, Kei Okada

AI总结 本文提出了一种新型轮腿机器人WiXus,通过利用外部环境的线驱动机制,使机器人能够实现平面移动和三维移动,并将腿部重新用于物体操作和工具使用。

Comments Accepted at ICRA2026, website - https://shin0805.github.io/wixus/, YouTube - https://youtu.be/32qhUslR0gM

详情
AI中文摘要

轮腿机器人通过协调轮驱动和腿驱动实现高移动性,但通常仅作为专为移动设计的平台。因此,它们无法将腿部用于其他任务,如物体操作或工具利用。本文提出了一种方法,通过外部身体支持释放腿部的移动角色,以挖掘腿部的任务执行潜力。为此,我们提出并开发了一种新的机器人WiXus,该机器人融合了轮腿机制和利用外部环境的线驱动机制。开发的WiXus不仅能够通过轮腿驱动实现平面移动,还能通过协调线驱动和轮腿驱动实现如攀爬等三维移动。此外,通过使用线驱动驱动悬吊身体,WiXus成功将腿部重新用作手臂,执行物体操作(例如救援狗(填充玩具))和工具使用(例如用剪枝器采摘苹果(模拟))。本研究证明了利用线驱动驱动环境的方法是一种新的设计原则,扩展了轮腿机器人的操作领域。

英文摘要

Wheeled-legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task-execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled-legged mechanism with a wire-driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled-legged drive, but also three-dimensional mobility such as cliff climbing by coordinating wire-driven and wheeled-legged actuation. Furthermore, by suspending the body with wire-driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire-driven actuation is a new design principle that extends the operational domain of wheeled-legged robots.

2605.20929 2026-05-21 cs.RO

STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding

STEAM: 一种无需训练的拥堵感知增强框架用于去中心化多智能体路径寻找

Mingyang Feng, Mengnuo Zhang, Shaoyuan Li, Xiang Yin

AI总结 本文提出STEAM框架,一种无需训练的去中心化多智能体路径寻找(MAPF)学习方法,在离散环境中通过注入轻量级拥堵感知指导来提升性能,通过空间避让、时间修正和密度修正等方法提高成功率和效率。

详情
AI中文摘要

我们提出STEAM(空间、时间和涌现拥堵意识用于MAPF),一种无需训练的测试时间增强框架,用于学习的去中心化多智能体路径寻找(MAPF)在离散环境中。给定一个预训练的去中心化策略,STEAM不需要重新训练、架构修改或用集中规划器替代。相反,它将轻量级拥堵感知指导注入到原始策略执行中。STEAM首先通过当前的成本到目标地图诱导的最短路径来识别潜在的未来拥堵热点。通过更新agent特定的成本到目标信息来缓解空间上可避免的拥堵,而通过时间logit修正来处理空间上不可避免的瓶颈。此外,通过基于邻近智能体修正后的成本到目标地图的密度感知logit修正来减少涌现的局部拥堵。在代表性学习的去中心化MAPF算法上的大量实验表明,STEAM一致地提高了成功率、完成时间和解决方案成本,成功率提升高达60%,且仅带来轻微的计算开销。实现可在https://anonymous.4open.science/r/STEAM-MAPF-7A62获取。

英文摘要

We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training-free test-time enhancement framework for learning-based decentralized Multi-Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion-aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent-specific cost-to-go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density-aware logit correction based on neighboring agents' corrected cost-to-go maps. Extensive experiments on representative learning-based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success-rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM-MAPF-7A62.

2605.20924 2026-05-21 cs.CL cs.AI

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Strategy-Induct: 任务级策略诱导用于指令生成

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

AI总结 该研究提出了一种任务级策略诱导方法Strategy-Induct,通过仅使用少量示例问题生成任务指令,无需依赖标注答案,从而在指令生成任务中取得优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

设计有效的任务级提示对于提高大型语言模型(LLMs)的性能至关重要。尽管先前的指令诱导工作表明LLMs可以通过有限的例子推断出更好的指令,但现有方法通常依赖于输入-输出对,而获取标注答案可能困难或成本高昂。为了解决这一限制,我们提出了Strategy-Induct框架,该框架仅从少量示例问题中推导出任务级指令,而无需标注答案。我们的方法首先提示模型为每个问题生成显式的推理策略,形成(策略,问题)对。这些对随后用于诱导一个任务指令,以引导推理。在多个任务和模型规模上的实验表明,Strategy-Induct在仅问题设置中优于最先进的方法。此外,我们观察到在任务指令生成和推理中联合使用LLMs和大型推理模型可能会进一步提高性能。

英文摘要

Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

2605.20922 2026-05-21 cs.LG cs.AI cs.CV

Winfree Oscillatory Neural Network

Winfree振荡神经网络

Jiawen Dai, Yue Song

AI总结 本文提出了一种基于广义Winfree动力学的振荡神经网络WONN,通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活的层次交互机制,实现了在图像识别和复杂推理任务上的竞争力和参数效率。

Comments Project page: https://jiawen-dai.github.io/WONN_Project_Page/

详情
AI中文摘要

振荡和同步被认为是表示和计算中的基本要素。然而,现有的基于同步动力学的机器学习方法大多局限于特定领域,如物体发现,缺乏在标准视觉基准或逻辑推理任务中的扩展性证据。我们提出Winfree振荡神经网络(WONN),一种基于广义Winfree动力学的动态神经架构。WONN通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活且层次化的交互机制,这些机制可以是固定的三角函数映射或可学习的神经网络。我们在图像识别和复杂推理任务上评估了WONN,包括CIFAR、ImageNet、Maze-hard和Sudoku。在这些领域中,WONN实现了具有竞争力或优越性能的成果,并且具有强参数效率。特别是,WONN是目前已知第一个能够与ImageNet-1K竞争的基于同步的振荡架构。此外,在Maze-hard上,WONN仅使用前状态-of-the-art模型1%的参数就达到了80.1%的准确率。这些结果表明,结构化的振荡动力学为传统神经架构提供了一种可扩展且参数高效的替代方案。

英文摘要

Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus $(S^1)^d$ through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

2605.20920 2026-05-21 cs.CL cs.SD

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

通过发声学音素识别评估语音发声合成

Vinicius Ribeiro, Yves Laprie

AI总结 本文通过发声学音素识别作为代理来评估语音发声合成的质量,提出利用发声学特征进行音素识别以更准确捕捉发音细节,从而改进生成模型的评估方法。

Comments Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

详情
AI中文摘要

最近机器学习的进步和发声学数据集的可用性使得声带合成可以基于语音序列进行条件化,这是发声学语音合成的主要任务。然而,质量评估需要更好的定义。通常,对生成模型进行排名具有挑战性,因为这涉及主观性。然而,发声学合成还具有额外的困难,即需要对声带解剖学和声学有专业知识。为了解决这个问题,本文提出通过音素识别来评估语音发声合成。我们的假设是使用发声学特征进行音素识别能更好地捕捉发音细节,如正确的发音位置,这传统度量(如点距度量)无法做到。我们训练了一个神经网络,使用来自单说话人RT-MRI数据集提取的声学和发声学特征。然后,我们比较了在不同合成发声学特征下测试模型的识别性能。我们的结果表明,我们的发声学特征集在语音发声合成中具有丰富的语音信息,并有助于探索额外的维度。

英文摘要

Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.

2605.20917 2026-05-21 cs.RO

SubTGraph: Large-Scale Subterranean Environment Synthesis with Controllable Topological Variability for Robotic Autonomy Validation

SubTGraph: 大规模地下环境合成与可控拓扑变化用于机器人自主性验证

F. Labra Caso, A. Saradagi, S. Fredriksson, S. Nordström, A. Koval, G. Nikolakopoulos

AI总结 本文提出SubTGraph框架,用于快速合成具有高变异性的多层级地下环境,通过用户指定的拓扑、维度、纹理等参数生成不同类型的地下环境,用于验证机器人自主栈各层的严格验证。

Comments 16 pages, 18 figures

详情
AI中文摘要

地下(SubT)环境已成为自主机器人技术的前沿领域,推动采矿自动化和行星探索(如火星熔岩管)。由于实际SubT环境的访问具有挑战性,因此在现实模拟环境中严格测试自主性堆栈至关重要。本文填补了已知的空白,即由于缺乏大规模基于模拟的基准评估基础设施,导致SubT研究论文通常只能在少数环境中展示验证结果。本文提出了SubTGraph,一种新的框架,用于快速合成具有高变异性的多层级SubT环境,结合用户指定的拓扑、维度、纹理等参数,生成如运营矿山、自然洞穴和熔岩管等不同环境。SubTGraph通过用户指定的结构约束构建成本矩阵,指导经典Dijkstra算法,利用DARPA World Generator的拓扑瓷砖生成SubT世界。通过三个机器人案例研究验证了SubTGraph在验证机器人自主栈不同层次的严格性方面的有效性。结构语义分割与拓扑地面真相进行验证,多智能体路径规划广泛测试以识别算法行为中的模式和趋势,LIO SLAM在具有挑战性的地下部分进行压力测试以识别失败案例。SubTGraph世界创建代码库已开源(https://github.com/LTU-RAI/SubTGraph.git),并附带包含150个高度变异的地下世界的数据库。

英文摘要

Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well-known gap, which relates to the unavailability of a large-scale simulation-based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi-level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user-specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case-studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi-agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress-tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open-sourced (https://github.com/LTU-RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.

2605.20916 2026-05-21 cs.CL

Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis

具有认知评估的任务路由混合专家模型用于隐含情感分析

Yaping Chai, Haoran Xie, Joe S. Qin

AI总结 本文提出了一种基于认知评估理论的多任务学习框架,通过隐含情感检测和认知推理生成两个辅助任务,提升隐含情感分析的性能,同时采用任务级混合专家模型减少任务干扰,实验表明该方法在隐含情感子集上优于现有方法。

Comments 8 pages, 4 figures, and 3 tables

详情
AI中文摘要

隐含情感分析具有挑战性,因为对某个方面的态度通常是通过事件推断而非显式意见词表达。现有模型通常仅学习最终极性标签,这限制了从上下文推断情感的能力。受认知评估理论启发,我们提出了一种评估意识的多任务学习(MTL)框架,用于隐含情感分析,该框架通过两个互补的辅助任务:隐含情感检测和认知推理生成,提供极性预测。然而,训练多个具有不同目标的任务并共享单一骨干结构会限制灵活性并导致任务干扰。为减少这些相关但不同的目标之间的干扰,我们采用任务级混合专家模型,其中所有任务共享一组专家,任务身份控制这些专家的稀疏组合。我们的方法基于编码器-解码器架构,并用这些稀疏混合替换部分编码器和解码器块。我们使用任务条件路由器为每个任务选择稀疏专家混合,并使用任务分离的路由目标鼓励不同任务学习不同的专家选择模式。实验结果表明,我们的模型在隐含情感子集上优于最近提出的方法,具有显著优势。我们的代码可在 https://github.com/yaping166/TRMoE-ISA 上获得。

英文摘要

Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at https://github.com/yaping166/TRMoE-ISA.

2605.20915 2026-05-21 cs.CL cs.AI cs.LG

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

校准与决策制定:重新审视未学习语言模型中的可靠性悖论

Divyaksh Shukla, Ashutosh Modi

AI总结 本文研究了生成语言模型中校准与决策可靠性之间的差距,通过TOFU基准测试中的多项选择问答评估协议,发现经过微调的模型在校准误差较低,而未学习后的模型在校准误差仍低,但依赖于相关性特征的决策规则增加,扩展了可靠性悖论到机器未学习领域。

Comments Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

详情
AI中文摘要

机器未学习旨在从模型中移除特定训练数据的影响,同时保持对剩余数据的可靠行为,使可靠的预测和不确定性估计成为评估的关键。校准常被用作语言模型可靠性代理,但低校准误差并不一定意味着可靠的决策规则,因为模型可能依赖于虚假相关性而保持良好校准。我们通过TOFU基准测试中的多项选择问答评估协议,研究了生成语言模型中的这一差距,利用校准指标(ECE、MCE、Brier)测量概率可靠性,并通过基于属性的快捷方式检测(使用积分梯度和局部互信息)评估决策规则可靠性。我们发现,微调模型的校准误差(ECE ~ 0.04)低于预训练模型(ECE > 0.5),而未学习后的模型在校准误差相似,尽管在遗忘分割上的准确性降低,属性分析显示对基于相关性的标记依赖增加。这些结果表明,良好的校准可以与未学习后的基于快捷方式的决策规则共存,将可靠性悖论扩展到了机器未学习领域。

英文摘要

Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

2605.20912 2026-05-21 cs.CL

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

增强科学论述:面向科学领域的机器翻译

Dimitris Roussis, Sokratis Sofianopoulos, Stelios Piperidis

AI总结 本文针对科学领域中由于专业术语和复杂句式带来的翻译挑战,构建了多语种平行和单语语料库,并通过微调通用神经机器翻译系统评估语料库质量。

详情
AI中文摘要

随着科研文献数量的增加,跨语言交流的需求日益迫切。机器翻译(MT)为获取国际出版物提供了有前景的解决方案。然而,科学领域因其专业术语和复杂句式而具有独特挑战。本文提出了一套面向科学领域的平行和单语语料库,目标语言对为西班牙-英语、法语-英语和葡萄牙-英语。对于每种语言对,我们创建了一个大规模的通用科学语料库以及四个聚焦于癌症研究、能源研究、神经科学和交通运输研究的较小语料库。为了评估这些语料库的质量,我们利用它们对通用神经机器翻译(NMT)系统进行微调。我们详细介绍了语料库的创建过程、所采用的微调策略,并最后给出了评估结果。

英文摘要

The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

2605.20911 2026-05-21 cs.AI cs.LG

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

我们应该持续打击多久?在格斗游戏中学习动作持续时间

Hoang Hai Nguyen, Kurt Driessens, Dennis J. N. J. Soemers

AI总结 本文研究了在格斗游戏中如何通过学习动作持续时间来提高强化学习代理的决策能力,探讨了动态调整反应时间的方法及其对性能和行为模式的影响。

Comments Accepted at Computers and Games 2026

详情
AI中文摘要

像《街头霸王II》这样的格斗游戏对强化学习(RL)代理提出了独特的挑战,因为它们具有快速且实时的性质。在大多数RL框架中,代理被硬编码为在固定间隔内做出决策,通常每帧或每N帧。虽然这种设计确保了及时的响应,但限制了代理调整反应时间的能力。每帧行动提供帧完美反应,这与人类玩家相比不现实,而更长的固定间隔会降低计算成本但会阻碍响应速度。我们考虑了一种替代的决策框架,其中代理不仅学习采取什么动作,还学习执行该动作有多久。通过同时预测动作和持续时间,代理可以动态调整其对游戏不同情况的响应能力。我们使用开源的FightLadder环境,通过训练代理对抗内置的脚本机器人,系统地测试不同的帧跳配置,以分析其对性能、响应性和学习行为的影响。实验表明,学习的时间可以与精心选择的固定帧跳性能相匹配,并鼓励可重复的动作模式,但本身并不能保证鲁棒性。在大多数情况下,我们发现代理在一致的高帧跳值(即低响应速度)下表现最佳。这种策略使学习利用性策略变得更容易,其中相同的动作被反复执行,而脚本机器人似乎容易受到这种策略的影响。

英文摘要

Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent's ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.

2605.20910 2026-05-21 cs.CV

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

FlowLong: 通过流形约束的 Tweedie 匹配实现推理时的长视频生成

Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye

AI总结 本文提出了一种新的推理时长视频生成方法,通过流形约束的Tweedie匹配在重叠滑动窗口中生成长视频,同时保持时间和空间一致性,并且无需额外训练。

Comments Project Page: https://flowlong-video.github.io/

详情
AI中文摘要

扩展视频扩散模型的生成时间范围仍然是一个长期且重要的挑战。现有的无训练方法分为两类:双向模型的扩展,这些模型紧密耦合到特定架构,且在长范围内质量下降;以及自回归模型,这些模型由于暴露偏差积累漂移误差,倾向于生成重复的运动模式。为了解决这些问题,我们提出了一种新颖但简单的推理时长视频生成方法,该方法对架构不敏感且不需要额外训练。我们的方法通过重叠滑动窗口生成长视频,其中相邻窗口预测的干净样本通过Tweedie匹配融合,以强制重叠区域的流形约束和时间一致性。随后,随机早期阶段采样通过在高噪声阶段每次Tweedie匹配校正后注入新鲜噪声,同步每个窗口的轨迹,然后过渡到确定性ODE采样以保持细粒度的视觉保真度。应用于各种视频生成模型,我们的方法生成的视频长度是原窗口长度的数倍,同时在时间和视觉质量上优于无训练和自回归基线,并且进一步扩展到音频视频联合生成和文本到3DGS,无需微调。

英文摘要

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

2605.20908 2026-05-21 cs.CV

SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches

SynCB:一种基于协同概念的模型,具有概念与互补神经分支之间的动态路由

Tores Julie, Sun Rémy, Sassatelli Lucile, Ancarani Elisa, Wu Hui-Yin, Precioso Frédéric

AI总结 本研究提出了一种协同概念模型SynCB,通过动态路由模块在概念分支和互补神经分支之间进行选择,以提高任务准确性和对人工干预的响应性。

详情
AI中文摘要

基于概念(CB)的模型提供了可解释性和支持测试时的人工干预,而标准神经网络(NN)提供了强大的任务性能但透明性较低。先前的工作探索了将概念和其他表示结合的混合公式以提高准确性,但通常以牺牲人工干预为代价。我们引入了协同概念模型(SynCB)框架,该框架结合了CB分支和互补神经分支,并且有一个可训练的路由模块,可以动态选择每个输入使用的分支。与以往模型不同,SynCB保持两个分支独立,并通过路由模块协调它们。此外,两个分支都是联合学习的,允许互补神经分支和CB分支通过它们的共同骨干进行信息共享。为了提高对干预的响应性,我们进一步引入了测试时的干预策略和相应的损失。在五个数据集和CB基准上,SynCB始终在任务准确性和对人工干预的响应性上取得更高的成绩,比全神经基线高3.9个百分点,比最强竞争对手的干预性能高6.43个百分点。

英文摘要

Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emph{Synergy Concept-Based Model (SynCB)} framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.

2605.20904 2026-05-21 cs.CV

JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026

JFAA:EgoVis 2026 EPIC-KITCHENS-100 动作预见挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

AI总结 本文提出JFAA,一种基于JEPA的未来动作预见方法,用于EPIC-KITCHENS-100动作预见任务。通过冻结编码器和预测器提取观察上下文特征和近未来潜在标记,再训练轻量级注意力探针以预测动词、名词和动作日志。通过构建字段感知的集成模型提高鲁棒性,实验结果表明JFAA在EgoVis 2026 EPIC-KITCHENS-100动作预见挑战中取得第一名。

Comments The champion solution for the EPIC-KITCHENS-100 Action Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

我们提出JFAA,一种基于JEPA的未来动作预见方法,用于EPIC-KITCHENS-100(EK-100)动作预见任务。受V-JEPA 2.1的表示学习和未来预测能力的启发,JFAA使用冻结的编码器和预测器来提取观察上下文特征和近未来潜在标记。然后训练一个轻量级的注意力探针,使用单独的任务查询来预测动词、名词和动作的日志。为了提高鲁棒性,我们进一步构建了一个字段感知的集成模型,使每个输出字段都能受益于其最可靠的候选者。在官方挑战服务器上的实验结果表明,JFAA在EgoVis 2026 EK-100动作预见挑战中取得第一名。我们的代码将在https://github.com/CorrineQiu/JFAA上发布。

英文摘要

We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at https://github.com/CorrineQiu/JFAA.

2605.20901 2026-05-21 cs.CV cs.AI

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

VISTA:EgoVis 2026 ego4D 短期物体交互预测挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

AI总结 本文提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测挑战的V-JEPA集成静态快速时序预测器。该方法结合了以物体为中心的空间检测与短视时间上下文,通过特征调制和ROI级上下文融合,将时间表示注入检测路径,以提高预测的鲁棒性。

Comments The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

我们提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测(STA)挑战的V-JEPA集成静态快速时序预测器。给定一个眼动视频时间戳,任务要求预测下一步的人-物体交互,包括未来活跃物体的边界框、名词类别、动词类别、接触时间以及置信度分数。VISTA采用StillFast风格的设计,结合以物体为中心的空间检测与短视时间上下文。具体来说,一个在COCO上预训练的Faster R-CNN ResNet-50 FPN检测器从最后一个观察到的高分辨率帧中生成物体建议,而冻结的V-JEPA 2.1时间分支从观察到的视频中提取片段级眼动上下文。时间表示通过特征调制和ROI级上下文融合注入检测路径。融合的建议特征随后传递给多头STA预测器进行框细化、名词分类、动词分类、接触时间回归和交互置信度估计。为了最终提交,我们进一步融合互补预测以提高鲁棒性。在官方挑战服务器上的实验结果表明,VISTA在EgoVis 2026 ego4D STA挑战中获得第一名。我们的代码将在https://github.com/CorrineQiu/VISTA上发布。

英文摘要

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

2605.20894 2026-05-21 cs.RO

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

Mobile UMI: 用于移动操作的跨视角扩散策略与解耦动力学

Haoran Huang, Haonan Dong, Huixu Dong

AI总结 本文提出了一种无需硬件的演示框架Mobile UMI,通过三个组件解决移动模仿学习中的两个瓶颈问题:运动污染的动作标签和推理导致的执行延迟。核心方法是通过双摄像头捕捉全局和局部上下文,结合空间锚点统一视觉-惯性框架,并利用异步递推地执行器进行在线状态匹配,从而实现解耦的动力学和基座轨迹。

详情
AI中文摘要

在便携式演示接口上进行移动模仿学习面临两个耦合的瓶颈:由运动污染导致的动作标签和由于连续移动基座引起的推理诱导的执行延迟。最近的腕部安装接口降低了桌面数据收集的成本,但单个腕部视角无法捕捉基座导航所需的全局上下文。添加身体安装的摄像头会将人类行走与手部运动纠缠在一起。同时,生成策略引入了数百毫秒的推理延迟,在此期间,基座会经过预测的路径点,迫使在动作拼接处进行回退修正。本文提出了Mobile UMI,一种无需硬件的演示框架,通过三个组件解决这两个缺口。首先,双摄像头捕获系统记录以胸部为中心的全局上下文和以腕部为中心的局部交互,无需任何机器人存在。其次,基于ChArUco的一次性空间锚点统一了胸部和手部的视觉-惯性框架;手部姿态随后相对于胸部重新表达,以提取解耦的SE(3)操作和SE(2)基座轨迹。第三,异步递推地执行器执行在线状态匹配:每个生成的动作块都与当前物理姿态对齐,使过期的路径点在执行前被丢弃。整个系统在四个长周期家庭任务上进行了评估,在100次试验中平均成功率为83.8%。受控比较ACT和Diffusion Policy显示,仅胸部相对标签就缩小了大部分差距;在线状态匹配缩小了剩余差距。这些结果表明,在测试条件下,移动模仿学习中显式动力学分解与状态级延迟对齐相结合,提供了一种有效的解决方案,而无需对底层策略类别进行架构更改。

英文摘要

Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.

2605.20892 2026-05-21 cs.CV

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Enhui Yu, Junhui Li, Ruitong Lu, Jialu Li, Youshan Zhang

AI总结 本文提出FruitEnsemble框架,通过多阶段动态推理解决细粒度水果分类中的泛化限制问题,利用MLLM进行专家仲裁以提升分类准确率,最终达到70.49%的分类精度。

Comments 10 pages,6 figures,submitted to CVPR 2026

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2026
AI中文摘要

细粒度水果分类是农业计算机视觉中的关键但具有挑战性的任务,主要受高质量数据集匮乏和类别间高视觉相似性阻碍。为解决这些问题,我们首先构建了一个包含306个水果类别、116,233个样本的综合数据集。此外,我们提出FruitEnsemble,一种实用的两阶段动态推理框架,旨在克服静态单模型架构的泛化限制。第一阶段,FruitEnsemble利用验证校准的异构骨干网络加权集成生成稳健的Top-3候选池。为处理困难样本,我们引入专家仲裁机制:当集成置信度低于0.6时,触发多模态大语言模型(MLLM)进行严格视觉验证,通过整合外部植物学描述使用链式推理(CoT)进行验证。此外,我们优化了训练流程,采用硬样本感知的联合损失。大量实验表明,FruitEnsemble实现了70.49%的分类准确率,并优于现有最先进模型。我们的框架为现实世界的农业视觉分拣和质量检测任务提供了高效、部署导向的解决方案。

英文摘要

Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

2605.20891 2026-05-21 cs.CV

HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction

HDMoE:一种用于多模态癌症生存预测的分层解耦-融合专家混合框架

Huayi Wang, Haochao Ying, Yuyang Xu, Qiyao Zheng, jun wang, Cheng Zhang, Ying Sun, Jian Wu

AI总结 本文提出HDMoE框架,通过分层解耦-融合专家混合方法,有效整合多模态医学数据以提高癌症生存预测的准确性,解决了传统方法中特征解耦和融合效果不佳的问题。

Comments 12 pages, HDMoE has been accepted by KDD 2026 AI for Sciences Track

详情
AI中文摘要

多模态生存预测是一项关键但具有挑战性的任务,要求整合多模态医学数据(例如全切片图像(WSIs)和基因组谱)以实现准确的预后建模。鉴于模态间的固有异质性,特征解耦-融合范式已成为主导方法。然而,这些方法存在以下不足:(1)在解耦前未能减少模态特征的冗余信息,这会负面影响特征解耦和融合效果;(2)缺乏对特征细粒度关系建模的能力,无法捕捉模态内和模态间特征的局部信息交互。为了解决这些问题,我们提出了一种具有两个层次MoE和随机特征重排(RFR)模块的HDMoE框架。在第一层MoE中,使用共享专家和路由专家去除冗余信息并提取每个模态的细粒度特定特征,而第二层MoE促进细粒度的跨模态特征解耦。此外,我们设计了两个RFR模块,分别跟随每个层次的MoE,以精细融合模态内和模态间特征,有助于模型捕捉更多模态间的细粒度关系。在我们的私有肝癌(LC)和三个TCGA公开数据集上的广泛实验结果证实了我们所提出方法的有效性。代码可在https://github.com/ZJUMAI/HDMoE上获得。

英文摘要

Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) modules.In the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at https://github.com/ZJUMAI/HDMoE.

2605.20889 2026-05-21 cs.CV

Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video

Map-Mono-Ego: 从单目第一视角视频实现基于地图的全局人体姿态估计

Hiroyuki Deguchi, Ryosuke Hori, Kotaro Amaya, Tsubasa Maruyama, Mitsunori Tada, Hideo Saito

AI总结 本文提出Map-MonoEgo框架,通过利用预扫描的3D点云实现从单目摄像头获得的全局一致的人体姿态估计,并引入AIST-Living数据集,证明该方法在无需专用硬件的情况下能有效提升日常监控任务的实用性。

Comments Accepted at ICIP 2026, Project page: https://deguchihiroyuki.github.io/Map-Mono-Ego-Project/

详情
AI中文摘要

单目第一视角人体姿态估计对于无处不在的活动监控至关重要。然而,理解用户在环境中的绝对位置仍是一个挑战。现有方法主要关注初始位置的相对运动,而不考虑佩戴者在环境中的绝对位置。此外,单目视觉固有的尺度模糊性导致严重的位移漂移,限制了长期跟踪,而无法使用专用多传感器硬件。为了解决这一问题,我们提出了MapMonoEgo,一种新颖的框架,仅通过单目摄像头即可实现全局一致的人体姿态估计,利用预扫描的3D点云。我们还引入了AIST-Living数据集,该数据集将第一视角视频与扫描环境中的真实运动相结合。实验表明,我们的方法显著优于现有最先进基线,证明其在无需专用硬件的情况下对实际监控任务的实用性。

英文摘要

Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user's absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer's absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.

2605.20885 2026-05-21 cs.LG q-bio.QM

Training distribution determines the ceiling of drug-blind cancer sensitivity prediction

训练分布决定了药物盲癌敏感性预测的上限

Taekyung Heo

AI总结 本文研究了药物盲癌敏感性预测中训练分布对预测性能的影响,发现传统指标存在偏差,通过机制分层训练和响应匹配策略恢复了预测增益。

详情
AI中文摘要

精准肿瘤学需要预测特定肿瘤从其分子特征出发哪种药物能抑制它,但尽管药物表示越来越复杂,药物盲敏感性预测却停滞不前。本文表明这种停滞反映的是度量偏差而非表示瓶颈。标准基准全球皮尔逊相关系数受药物间效力差异主导,一个简单的药物均值预测器即可捕捉。每种药物皮尔逊相关系数揭示了在四个独立数据集中,没有药物编码能超过仅基于细胞特征的预测。受控实验将作用机制身份作为药物特征或训练分布约束,确定了原因。将作用机制作为特征提供微小收益,而将其作为训练分布分层则显著提高针对靶向激酶抑制剂的每种药物相关系数,因为全癌症联合训练抑制了通路特异性敏感信号。机制分层训练和试点观察的响应匹配提供了两种可部署策略,共同恢复了药物盲敏感性预测中的主要预测增益来源。

英文摘要

Precision oncology requires predicting which drugs will suppress a specific tumor from its molecular profile, but drug-blind sensitivity prediction has plateaued despite increasingly complex drug representations. Here we show that this stagnation reflects a metric artifact rather than a representational bottleneck. The standard benchmark, global Pearson r, is dominated by between-drug potency differences that a trivial drug-mean predictor captures without any cell-specific learning. Per-drug Pearson r, which isolates within-drug cell ranking, reveals that no drug encoding improves over cell-only features across four independent datasets. A controlled experiment channeling mechanism-of-action identity as either a drug feature or a training-distribution constraint identifies the cause. Supplying MoA as a feature yields negligible benefit, whereas using it to stratify training raises per-drug r substantially for targeted kinase inhibitors, because pan-cancer co-training suppresses pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations provide two deployable strategies that together recover the principal sources of predictive gain in drug-blind sensitivity prediction.

2605.20883 2026-05-21 cs.LG

Learning fMRI activations dictionaries across individual geometries via optimal transport

通过最优传输学习跨个体几何的fMRI激活字典

Sonia Mazelet, Rémi Flamary, Bertrand Thirion

AI总结 本文提出了一种基于最优传输的fMRI字典学习方法,通过Fused Gromov-Wasserstein距离处理个体脑几何差异,利用amortized优化减少计算成本,并学习依赖FGW参数平衡特征对齐与结构一致性的字典原子。

详情
AI中文摘要

字典学习是一种创建可解释表示的强大工具。当应用于功能性磁共振成像(fMRI)数据时,所得到的脑活动模式可用于各种下游任务,如脑状态分类或群体水平分析。然而,一个主要挑战是不同个体之间的脑几何差异。通常通过将每个个体的脑几何投影到一个通用模板上来解决,这会移除个体特定的信息。在本工作中,我们提出了一种新的fMRI数据字典学习方法,该方法明确考虑了这种差异。我们使用基于最优传输的融合Gromov-Wasserstein(FGW)距离来比较具有不同几何和特征的图。为了解决计算多个FGW距离对于大图(如来自fMRI数据的图)带来的挑战,我们依赖于amortized优化来学习一个神经网络,该网络可以预测最优传输计划的近似值,从而显著降低计算成本。此外,我们学习了依赖FGW权衡参数的字典原子,该参数控制特征对齐和结构一致性之间的平衡。在HCP数据集上的数值实验表明,所提出的方法能够捕捉数据中的不同几何差异水平,并提供保留关键信息的表示。

英文摘要

Dictionary learning is a powerful tool for creating interpretable representations. When applied to functional magnetic resonance imaging (fMRI) data, the resulting patterns of brain activity can be used for various downstream tasks, such as brain state classification or population-level analysis. However, a major challenge is the variability in brain geometry across individuals. This is usually addressed by projecting each individual brain geometry onto a common template, which removes subject-specific information. In this work, we introduce a novel approach to dictionary learning on fMRI data that explicitly accounts for this variability. We use the optimal transport-based Fused Gromov-Wasserstein (FGW) distance to compare graphs with different geometries and features. To address the challenge of computing multiple FGW distances for large graphs such as those arising from fMRI data, we rely on amortized optimization to learn a neural network that predicts an approximation of the optimal transport plans, which substantially reduces the computational cost. Additionally, we learn dictionary atoms that depend on the FGW trade-off parameter, which controls the balance between feature alignment and structural consistency. Numerical experiments on the HCP dataset demonstrate that the proposed approach captures different levels of geometric variability in the data and provides representations that preserve essential information.

2605.20879 2026-05-21 cs.LG

NeighborDiv: Training-free Zero-shot Generalist Graph Anomaly Detection via Neighbor Diversity

NeighborDiv: 一种基于邻居多样性、无需训练的跨域通用图异常检测方法

Kaifeng Wei, Teng Liu, Liang Dong, Xiubo Liang, Yuke Li

AI总结 本文提出NeighborDiv,一种无需训练的通用图异常检测方法,通过邻居多样性原理来检测异常,克服了传统方法在训练复杂度、数据依赖性和跨域泛化稳定性方面的不足,实验表明其在多个评估框架下均取得最佳性能。

详情
AI中文摘要

图异常检测(GAD)正逐渐转向通用图异常检测(GGAD)以实现跨域的'一揽子'检测,但现有GGAD方法主要依赖邻居一致性原则,陷入'节点到邻居一致性范式'的异常量化中。这些方法存在训练流程复杂、依赖大量训练数据、计算成本高以及跨域泛化不稳定等问题。为了解决这些限制,我们提出了NeighborDiv,一种基于邻居多样性的无需训练的通用图异常检测框架。偏离主流的'节点到邻居一致性范式',我们转向'邻居到邻居多样性范式',发现节点邻居集合的内部结构分散性是一种强大且独立的异常信号。我们通过邻居间特征相似性的方差来量化邻居多样性,捕捉节点如何组织其局部图环境,并独立于传统节点到邻居一致性框架。在两个标准的GGAD评估范式下进行的大量实验表明,NeighborDiv在单域独立训练(SDIT)下平均AUC提升了10.25%,平均AP提升了17.78%;在统一多域训练(UMDT)下,AUC和AP分别提升了6.89%和9.58%。值得注意的是,NeighborDiv在所有数据集上均无性能波动,消除了训练集依赖性,建立了一个轻量且高度实用的GGAD框架。

英文摘要

Graph Anomaly Detection (GAD) is increasingly shifting to Generalist GAD (GGAD) for cross-domain "one-for-all" detection, but existing GGAD methods predominantly rely on the neighbor consistency principle, falling into the \textbf{Node-to-Neighbor Consistency Paradigm} for anomaly quantification. These methods suffer from complex training pipelines, heavy training data dependency, high computational costs, and unstable cross-domain generalization. To address these limitations, we propose NeighborDiv, a training-free generalist graph anomaly detection framework based on neighbor diversity. Departing from the dominant Node-to-Neighbor Consistency Paradigm, we shift the focus to the \textbf{Neighbor-to-Neighbor Diversity Paradigm}, and uncover that the internal structural dispersion of a node's neighbor set is a powerful, independently discriminative anomaly signal. We quantify neighbor diversity via the variance of inter-neighbor feature similarities, which captures how a node organizes its local graph environment, and operates independently of conventional node-to-neighbor consistency frameworks. Extensive experiments under two standard GGAD evaluation paradigms show NeighborDiv achieves state-of-the-art performance, with relative gains of 10.25% in average AUC and 17.78% in average AP over the second-best baseline under Single-Domain Independent Training (SDIT), and 6.89%/9.58% in AUC/AP under Unified Multi-Domain Training (UMDT), respectively. Notably, NeighborDiv yields zero performance volatility across all datasets, eliminating training-set dependency and establishing a lightweight and highly practical GGAD framework.