arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
2605.25763 2026-05-28 cs.CV

AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

AI-T2I: 面向文本到图像合成的扩散模型的聚合与隔离交叉注意力

Shipeng Cao, Biao Qian, Haipeng Liu, Yang Wang, Meng Wang

发表机构 * Institute of Advanced Medicine and Frontier Technology(先进医学与前沿技术研究院) Hefei University of Technology(合肥工业大学) Key Laboratory of Knowledge Engineering With Big Data, Ministry of Education(大数据知识工程重点实验室) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对扩散模型在文本到图像合成中交叉注意力图内文本-图像对齐不精确的问题,提出一种聚合与隔离交叉注意力方法(AI-T2I),通过聚合损失整合分散的令牌内激活并引入隔离损失分离令牌间激活,实现精确对齐。

Comments Accepted by IEEE Transactions on Multimedia (2026). 13 pages, 15 figures

详情
AI中文摘要

文本到图像合成取得了显著进展,得益于扩散模型强大的生成能力。然而,这些模型在去噪过程中难以在交叉注意力图中实现精确的文本-图像对齐。现有工作主要关注不同主体之间的主体间令牌激活(即交叉注意力分数)重叠,而忽略了相同主体的主体内令牌激活分散问题。在本文中,我们提出了一种面向文本到图像合成的扩散模型的聚合与隔离交叉注意力方法,称为AI-T2I。技术上,为了解决分散问题,我们设计了一个聚合损失来识别并整合分散的令牌内激活,这隐式地有助于缓解潜在的重叠问题。在此基础上,进一步引入隔离损失以推开令牌间激活,从而实现精确的文本-图像对齐。在各种基准上的大量实验表明,AI-T2I在文本到图像合成方面优于最先进的工作。此外,我们的AI-T2I在其他任务上表现出优异的泛化能力,例如可控布局生成和个性化生成。我们的代码可在https://github.com/Hatter77/AI-T2I获取。

英文摘要

Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation. Our code is available at https://github.com/Hatter77/AI-T2I.

2605.25010 2026-05-28 cs.RO cs.AI

Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

经典与神经采样算法在机器人导航中的性能比较

Hichem Cheriet, Badra Khellat Kihel, Samira Chouraqui

发表机构 * dept. of Economics Oran2 Mohamed BenAhmed University(经济系奥兰2莫哈梅德·本·阿赫迈德大学)

AI总结 本文在含凸凹障碍物的环境中比较了RRT*、Neural RRT*和Neural Informed RRT*三种算法,发现神经引导规划器能生成更短(最多14%)和更平滑(55-75%)的路径,其中Neural Informed RRT*综合性能最优。

Journal ref Presented at The 3rd Edition of National Conference on Applications of Artificial Intelligence A2I' 26. 2026

详情
AI中文摘要

将人工智能(AI)集成到基于采样的运动规划中为提高自主导航效率提供了新的可能性。本文在包含不同障碍物密度的凸凹障碍物环境中实现并评估了三种算法,即RRT*、Neural RRT*和Neural Informed RRT*。结果表明,与传统RRT*算法相比,神经引导规划器提高了路径质量,生成了最多短14%的路径和55-75%更平滑的轨迹。在评估的方法中,Neural Informed RRT*在路径长度和轨迹平滑度方面实现了最佳整体性能。这些结果证明了AI引导采样策略在提高机器人和无人机导航的可靠性和轨迹效率方面的有效性,尽管计算时间略有增加。总体而言,该研究凸显了人工智能在实时机器人路径规划应用中日益增长的重要性。

英文摘要

Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation efficiency. In this paper, three algorithms, namely RRT*, Neural RRT*, and Neural Informed RRT*, are implemented and evaluated on environments containing convex and concave obstacles with different obstacle densities. The obtained results indicate that neural-guided planners improve path quality, producing up to 14\% shorter paths and 55--75\% smoother trajectories compared with the conventional RRT* algorithm. Among the evaluated methods, Neural Informed RRT* achieves the best overall performance in terms of path length and trajectory smoothness. These results demonstrate the effectiveness of AI-guided sampling strategies for improving reliability and trajectory efficiency in robotic and UAV navigation, despite a slight increase in computation time. Overall, the study highlights the growing importance of artificial intelligence in real-time robotic path planning applications.

2605.24906 2026-05-28 cs.CV

Where Detectors Fail: Probing Generative Space for Generalizable AI-Generated Image Detection

探测器失效之处:探索生成空间以实现可泛化的AI生成图像检测

Zijie Cao, Weijie Tu, Yao Xiao, Weijian Deng, Liang Lin, Pengxu Wei

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学计算机科学与工程学院) Australian National University, Canberra, Australia(澳大利亚国立大学) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生学院)

AI总结 提出PROBE框架,通过主动探索生成过程中的困难区域来改进AI生成图像检测器的泛化能力。

Comments Accepted by ICML2026

详情
AI中文摘要

检测AI生成图像(AIGI)仍然具有挑战性,因为检测器通常无法泛化到未见过的生成器。尽管现有方法在大数据集上训练,但当生成设置改变时,其性能仍然下降,这表明仅靠数据规模是不够的,训练期间生成变化的有限覆盖是一个关键因素。关于生成模型编辑的研究表明,内部表示的微小变化可以产生多样且有意义的图像变化,其中许多在标准采样下未被探索。利用这一见解,我们提出了PROBE(通过边界探索探测鲁棒性)框架,通过主动探索生成过程中的困难区域来改进检测器的泛化能力。PROBE不是将生成器视为固定的数据源,而是使用检测器作为批评者,通过流形级别的修改引导生成器,产生难以分类的真实样本。这些样本暴露了在标准数据采样策略下不常见的失败案例,并用于改进检测器。在多个基准上的实验结果表明,PROBE增强了对未见生成器的泛化能力,从而实现了更可泛化的AIGI检测性能。代码和模型可在 https://github.com/Amamiya-C/PROBE-AIGI-Detection 获取。

英文摘要

Detecting AI-generated images (AIGI) remains challenging because detectors often fail to generalize to unseen generators. Although existing methods are trained on large datasets, their performance still degrades when generation settings change, indicating that data scale alone is insufficient and that limited coverage of generative variations during training is a key factor. Studies on generative model editing show that small changes in internal representations can produce diverse and meaningful image variations, many of which are not explored under standard sampling. Leveraging this insight, we propose PROBE (Probing Robustness via Boundary Exploration), a framework that improves detector generalization by actively exploring challenging regions of the generative process. Instead of treating the generator as a fixed data source, PROBE uses the detector as a critic to steer the generator through manifold-level modifications, producing realistic samples that are difficult to classify. These samples expose failure cases that are uncommon under standard data sampling strategies and are used to refine the detector. Experimental results across multiple benchmarks indicate that PROBE enhances generalization to unseen generators, resulting in more generalizable AIGI detection performance. Code and models are available at https://github.com/Amamiya-C/PROBE-AIGI-Detection

2605.24678 2026-05-28 cs.AI cs.CL cs.SD

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

探索感知语音特征用于心理健康护理中的临床决策支持

Vassilis Lyberatos, Edmund G. Dervakos, Eleni Adamidi, Athanasios Voulodimos, Giorgos Stamou

发表机构 * National Technical University of Athens(国家技术大学雅典) PsychNow

AI总结 提出一个基于感知声学和语言特征(如韵律、嗓音质量、语义连贯性、句法结构和讽刺)的系统分析框架,结合统计分析和可解释机器学习(XGBoost与SHAP和LIME),在多个数据集上发现语音特征与抑郁、焦虑和ADHD症状严重度之间的稳定关联,并通过消融研究识别最具信息量的特征组。

Comments Accepted to CLPsych 2026, part of ACL 2026

详情
AI中文摘要

语音和语言技术通过客观且可解释的线索为支持心理健康评估提供了宝贵的机会。我们提出了一个系统的基于特征的分析框架,利用感知基础的声学和语言特征,包括韵律、嗓音质量、语义连贯性、句法结构和讽刺。通过统计分析和可解释机器学习(XGBoost与SHAP和LIME),我们研究了语音特征与抑郁、焦虑和ADHD的已验证症状测量之间的关联。在受控基准数据集(StressID、DAIC-WOZ、Androids、EATD)和真实世界临床数据集上的评估表明,该框架揭示了症状严重度与嗓音不规则性(如shimmer、jitter)、词汇-句法模式和情感基调之间的稳定且一致的关系。跨所有数据集进行的消融研究进一步识别了最具信息量的特征组。这项工作探索了一种透明且临床可解释的基于语音的心理健康分析方法。

英文摘要

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

2605.22297 2026-05-28 cs.LG cs.AI

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

一个学习率不适用于所有层:基于重尾引导的LLM逐层学习率

Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu

发表机构 * SIAT(深圳先进技术研究院) PCL(鹏城实验室) UCAS(中国科学院大学) UOT(图宾根大学) LIU1(智能系统马克斯·普朗克研究所) LIU2(图宾根ELLIS研究所) LIU3(图宾根人工智能中心)

AI总结 本文提出基于重尾自正则化理论的逐层学习率(LLR)方法,通过为Transformer各层分配不同学习率,加速训练并提升泛化能力,在多种模型和优化器上验证了有效性。

详情
AI中文摘要

学习率配置是现代深度学习的一个基本方面。当前跨所有层应用统一学习率的普遍做法忽视了Transformer的结构异质性,可能限制其作为大型语言模型(LLM)骨干的有效性。在本文中,我们引入逐层学习率(LLR),这是一种自适应方案,为各个Transformer层分配不同的学习率。我们的方法基于重尾自正则化(HT-SR)理论,该理论通过表征权重相关矩阵的经验谱密度(ESD)来量化重尾性。重尾性较弱的层被分配较大的学习率以加速训练,而重尾性较强的层则获得较小的学习率。通过这种方式定制学习率,LLR促进了跨层更均衡的训练,导致更快的收敛和更好的泛化。在从LLaMA到GPT-nano的架构、包括AdamW和Muon的优化器以及从60M到3B参数、最多100B训练token的模型规模上进行的大量实验证明了LLR的有效性。LLR实现了高达1.5倍的训练加速,并且始终优于统一学习率的基线。特别地,它将1B模型的平均零样本准确率从47.09%提高到49.02%,将3B模型的平均零样本准确率从48.58%提高到50.61%。LLR的一个关键优势是其低调优开销:它可以直接从统一基线转移近乎最优的学习率设置。代码可在https://github.com/hed-ucas/Layer-wise-Learning-Rate获取。

英文摘要

Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes more balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures ranging from LLaMA to GPT-nano, optimizers including AdamW and Muon, and model scales from 60M to 3B parameters with up to 100B training tokens demonstrate the effectiveness of LLR. LLR achieves up to 1.5x training speedup and consistently outperforms uniform-learning-rate baselines. In particular, it improves the average zero-shot accuracy of 1B models from 47.09% to 49.02%, and that of 3B models from 48.58% to 50.61%. A key advantage of LLR is its low tuning overhead: it can transfer nearly optimal learning-rate settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.

2605.18137 2026-05-28 cs.CV

Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving

小米自动驾驶世界模型:一个融合重建与生成的联合世界模型

Lijun Zhou, Hongcheng Luo, Zhenxin Zhu, Cheng Chi, Mingfei Tu, Kaixin Xiong, Lei Gong, Zhanqian Wu, Zehan Zhang, Fangzhen Li, Hao Li, Yingying Shen, Jiale He, Haohui Zhu, Shan Zhao, Kai Wang, Zhiwei Zhan, Yuechuan Pu, Kaiyuan Tan, Ruiling Yang, Xianqi Wang, Tianyi Yan, Jiawei Zhou, Lei Zhang, Jingyang Zhao, Xi Zhou, Chitian Sun, Chenming Wu, Jiong Deng, Hongwei Xie, Ming Lu, Kun Ma, Long Chen, Guang Chen, Hangjun Ye, Bing Wang, Haiyang Sun

发表机构 * Xiaomi(小米)

AI总结 提出一个统一技术系统,通过稀疏场景查询驱动的重建模块WorldRec和两阶段训练框架WorldGen,实现高保真3D场景表示与高质量因果视频生成,并联合优化以提升生成稳定性、跨帧一致性和视觉保真度。

详情
AI中文摘要

本报告提出了一个统一的技术系统,解决自动驾驶世界模型的两个核心能力:世界表示和世界生成。对于世界表示,我们提出了WorldRec,一种由稀疏场景查询驱动的前馈重建架构。WorldRec在3D空间中初始化结构化查询,利用它们聚合跨视图、跨时间特征,从而自然地强制帧间空间一致性,并产生紧凑且高保真的3D高斯场景表示。对于世界生成,我们提出了WorldGen,一个两阶段训练框架,包括双向预训练和随后通过三个渐进阶段(教师强制、ODE蒸馏和DMD)的因果微调,使得在仅4个去噪步骤中实现高质量的在线因果视频生成。基于这两个模块,我们进一步引入了JWM,它深度融合了WorldRec和WorldGen,在生成稳定性、跨帧一致性和视觉保真度方面实现协同增益,为自动驾驶中的闭环仿真、数据合成和端到端训练提供了坚实基础。

英文摘要

This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.

2605.02417 2026-05-28 cs.CV

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

DirectEdit: 基于流的图像编辑的逐步骤精确反演

Desong Yang, Mang Ye

发表机构 * National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China(多媒体软件国家工程研究中心,计算机科学学院,武汉大学,中国武汉)

AI总结 提出DirectEdit方法,通过直接对齐前向路径消除反演过程中的累积漂移,实现精确重建和可靠特征共享,无需额外神经函数评估,在多种场景下优于现有方法。

Comments ICML 2026. Project page: https://desongyang.github.io/Directedit/

详情
AI中文摘要

随着大规模预训练文本到图像(T2I)模型的最新进展,免训练的图像编辑方法已展现出显著成功。通常,这些方法通过反演过程向干净图像添加噪声,随后在前向过程中分别对重建路径和编辑路径进行去噪步骤。然而,由于重建路径使用来自不匹配时间步的噪声潜变量进行近似,现有方法不可避免地遭受累积漂移,这从根本上限制了重建保真度。为了解决这一挑战,我们系统地分析了流变换器中的反演过程,并提出了DirectEdit,一种简单而有效的编辑方法,无需引入额外的神经函数评估(NFE)即可消除固有的重建误差。与大多数试图纠正反演路径的先前工作不同,DirectEdit专注于直接对齐前向路径,从而实现精确重建和可靠的特征共享。此外,我们引入了一种基于注意力特征注入和多分支掩码引导噪声混合的保留机制,有效平衡了保真度和可编辑性。跨多种场景的大量实验表明,DirectEdit实现了高效准确的图像编辑,其优越性能优于最先进的方法。代码和示例可在 https://desongyang.github.io/Directedit 获取。

英文摘要

With recent advancements in large-scale pre-trained text-to-image (T2I) models, training-free image editing methods have demonstrated remarkable success. Typically, these methods involve adding noise to a clean image via an inversion process, followed by separate denoising steps for the reconstruction and editing paths during the forward process. However, since the reconstruction path is approximated using noisy latents from mismatched timesteps, existing methods inevitably suffer from accumulated drift, which fundamentally limits reconstruction fidelity. To address this challenge, we systematically analyze the inversion process within the flow transformer and propose DirectEdit, a simple yet effective editing method that eliminates the inherent reconstruction error without introducing additional neural function evaluations (NFEs). Unlike most prior works that attempt to rectify the inversion path, DirectEdit focuses on directly aligning the forward paths, enabling precise reconstruction and reliable feature sharing. Furthermore, we introduce a preservation mechanism based on attention feature injection and multi-branch mask-guided noise blending, which effectively balances fidelity and editability. Extensive experiments across diverse scenarios demonstrate that DirectEdit achieves efficient and accurate image editing, delivering superior performance that outperforms state-of-the-art methods. Code and examples are available at https://desongyang.github.io/Directedit.

2605.01735 2026-05-28 cs.CL

Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

少即是多:面向LLM的几何遗忘方法,实现最小数据披露

Chenchen Tan, Xinghao Li, Shujie Cui, Youyang Qu, Cunjian Chen, Longxiang Gao

发表机构 * Faculty of Information Technology, Monash University, Clayton, Victoria, Australia.(墨尔本大学信息技术学院,澳大利亚维多利亚州克莱顿分校)

AI总结 提出几何遗忘(GU)方法,通过操作模型隐藏状态并利用少量安全参考提示和锚点上下文合成提示,在无需原始训练数据的情况下实现高效选择性遗忘,在ToFU和UnlearnPII基准上达到强目标抑制且对非目标性能影响最小。

Comments 21 pages, 8 Figures

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在现实系统中,它们必须支持事后移除特定内容以满足隐私和治理要求。这推动了选择性遗忘的发展,即抑制关于特定实体或主题的信息,同时保持LLM的通用效用。然而,现有的大多数LLM遗忘方法需要访问原始训练语料库,并依赖于输出级拒绝微调或广泛的梯度更新,在遗忘强度、非目标保持和数据可用性之间造成紧张。我们提出几何遗忘(GU),一种直接操作模型提示条件隐藏状态的方法,无需访问原始训练语料库。具体来说,GU从少量安全参考提示中提炼出紧凑的低秩安全行为子空间,并使用轻量级的锚点上下文合成提示触发隐藏表示向该安全子空间的局部投影对齐。对合成非目标锚点的教师蒸馏正则化进一步减少了附带漂移。在面向隐私的遗忘基准(ToFU和UnlearnPII)上,GU实现了强目标抑制,对非目标性能影响最小,表明有效遗忘可以用最少的合成数据实现。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information about a particular entity or topic while preserving the LLM's general utility. However, most existing LLM unlearning methods require access to the original training corpus and rely on output-level refusal tuning or broad gradient updates, creating a tension among unlearning strength, non-target preservation, and data availability. We propose Geometric Unlearning (GU), an approach that operates directly on the model's prompt-conditioned hidden states without access to the original training corpus. Specifically, GU distills a compact, low-rank safe-behavior subspace from a small set of safe reference prompts and uses lightweight anchor-in-context synthetic prompts to trigger localized, projection-based alignment of hidden representations to this safe subspace. A teacher-distillation regularizer on synthetic non-target anchors further reduces collateral drift. Across privacy-oriented unlearning benchmarks (ToFU and UnlearnPII), GU achieves strong target suppression with minimal impact on non-target performance, demonstrating that effective unlearning can be achieved with minimal synthetic data.

2604.21668 2026-05-28 cs.CV

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

通过结构化运动描述实现无编码器的人体运动理解

Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao

发表机构 * Aalto University(阿alto大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出结构化运动描述(SMD),将关节位置序列转换为结构化自然语言描述,使大语言模型无需专用编码器即可直接进行运动推理,在运动问答和字幕生成任务上超越现有方法。

详情
AI中文摘要

基于文本的大语言模型(LLM)的世界知识和推理能力正在快速发展,但目前的人体运动理解方法(包括运动问答和字幕生成)尚未充分利用这些能力。现有的基于LLM的方法通常通过专用编码器学习运动-语言对齐,将运动特征投影到LLM的嵌入空间中,但仍受限于跨模态表示和对齐。受生物力学分析的启发(其中关节角度和身体部位运动学长期以来一直作为人体运动的精确描述语言),我们提出了 extbf{结构化运动描述(SMD)},一种基于规则的确定性方法,将关节位置序列转换为关节角度、身体部位运动和全局轨迹的结构化自然语言描述。通过将运动表示为文本,SMD使LLM能够直接将其关于身体部位、空间方向和运动语义的预训练知识应用于运动推理,无需学习编码器或对齐模块。我们表明,该方法在运动问答(BABEL-QA上66.7%,HuMMan-QA上90.1%)和运动字幕生成(HumanML3D上R@1为0.584,CIDEr为53.16)上均超越了所有先前方法,达到了最先进的结果。SMD还提供了实际优势:相同的文本输入可适用于不同的LLM,仅需轻量级的LoRA适配(在6个模型家族的8个LLM上验证),并且其人类可读的表示能够对运动描述进行可解释的注意力分析。代码、数据和预训练的LoRA适配器可在https://yaozhang182.github.io/motion-smd/获取。

英文摘要

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

2605.28145 2026-05-28 cs.AI cs.LG

Adaptive Reservoir Computing for Multi-Scenario Chaotic System Forecasting

自适应储层计算用于多场景混沌系统预测

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino(托里尼理工大学)

AI总结 提出一种自适应储层计算框架,通过四种定制策略(精确状态同步、直方图引导候选选择、多种子搜索、顺序多序列训练)在CTF-4-Science Lorenz基准的12个任务中取得74.91分,证明其高效竞争力。

Comments 4 pages, 2 figures

详情
AI中文摘要

我们提出了一种自适应储层计算框架,用于CTF-4-Science Lorenz基准测试,该基准评估机器学习模型在十二个不同任务上的表现,这些任务涵盖五种性质不同的场景:基线预测、含噪信号重建、噪声下预测、少样本学习和参数泛化。我们没有采用统一的推理策略,而是根据每个评估场景的具体需求定制回声状态网络(ESNs)的训练和预测过程。我们的主要贡献有四个方面:(1)精确的储层状态同步,消除了短时预测中的预热近似误差;(2)直方图引导的候选选择,直接优化长时间遍历评估指标;(3)多种子储层搜索,适用于训练数据严重受限的少样本场景;(4)顺序多序列训练,解决了参数泛化任务中的状态分布不匹配问题。所提出的框架在公共基准排行榜上获得了74.91分,表明精心调整的储层计算对于多样化的混沌系统建模挑战是一种具有竞争力和计算效率的方法。

英文摘要

We present an adaptive reservoir computing framework for the CTF-4-Science Lorenz benchmark, which evaluates machine learning models across twelve distinct tasks spanning five qualitatively different scenarios: baseline forecasting, noisy signal reconstruction, forecasting under noise, few-shot learning, and parametric generalization. Rather than applying a uniform inference strategy, we tailor the training and prediction procedure of Echo State Networks (ESNs) to the specific demands of each evaluation scenario. Our key contributions are fourfold: (1) exact reservoir state synchronization that eliminates warmup approximation error in short-time prediction; (2) histogram-guided candidate selection that directly optimizes the long-time ergodic evaluation metric; (3) multi-seed reservoir search for few-shot regimes with severely limited training data; and (4) sequential multi-sequence training that resolves state-distribution mismatch in parametric generalization tasks. The proposed framework achieves a score of 74.91 on the public benchmark leaderboard, demonstrating that carefully adapted reservoir computing constitutes a competitive and computationally efficient approach for diverse chaotic system modeling challenges.

2605.28144 2026-05-28 cs.AI

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

解构空间复杂性:用于LLM空间推理的层次分解

Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen, Sihong Xie

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出一种层次任务分解方法,结合MCTS引导的组相对策略优化(M-GRPO),通过改进中间状态选择和规划能力,显著提升LLM在导航、规划和策略游戏等空间任务中的表现。

Comments 8 pages

详情
AI中文摘要

LLMs在通用语言理解和推理方面表现出色。然而,它们在空间推理方面始终表现不佳,这严重限制了它们的应用,特别是在具身智能领域。受层次强化学习成功的启发,本文介绍了一种新颖的LLM空间推理层次任务分解方法。我们的方法通过识别关键中间状态并生成简化的子环境,引导LLMs将复杂任务分解为可管理的子任务。然而,我们发现LLMs由于缺乏足够的空间先验知识,往往无法推导出最优的中间状态,导致次优的任务分解。为了解决这一限制并增强其规划能力,我们提出了MCTS引导的组相对策略优化(M-GRPO),其中我们通过结合LLM的先验预测概率及其认知不确定性来重新制定UCT公式。此外,我们实现了一个更细粒度的优势函数,使模型能够学习最优路径规划。实验结果表明,我们的方法显著提高了LLM在空间任务(包括导航、规划和策略游戏)上的性能,达到了最先进的结果。这项工作为LLM在现实世界中的应用铺平了道路。

英文摘要

LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Inspired by the success of hierarchical reinforcement learning, this paper introduces a novel method for hierarchical task decomposition in LLM spatial reasoning. Our approach guides LLMs to decompose complex tasks into manageable sub-tasks by identifying key intermediate states and generating simplified sub-environments. However, we identify that LLMs often fail to derive optimal intermediate states due to their insufficient spatial prior, leading to sub-optimal task decomposition. To address this limitation and enhance its planning capability, we propose the MCTS-Guided Group Relative Policy Optimization (M-GRPO), where we reformulate the UCT formula by incorporating the LLM's prior predictive probabilities alongside its epistemic uncertainty. Furthermore, we implement a more fine-grained advantage function, enabling the model to learn optimal path planning. Experimental results demonstrate that our method substantially improves LLM performance on spatial tasks, including navigation, planning, and strategic games, achieving state-of-the-art results. This work paves the way for LLMs in real-world applications.

2605.28143 2026-05-28 cs.LG cs.IT eess.SP math.IT

Sequential Neural Probabilistic Amplitude Shaping: Learning the Channel's Language

序列神经概率幅度整形:学习信道的语言

Mohammad Taha Askari, Lutz Lampe, Amirhossein Ghazisaeidi

发表机构 * Department of Electrical and Computer Engineering, University of British Columbia(英国不列颠哥伦比亚大学电气与计算机工程系) Nokia Bell Labs(诺基亚贝尔实验室)

AI总结 提出首个考虑所有实现损耗的神经概率幅度整形方法,采用无块、易于实现的序列自回归编码器与算术分布匹配,降低速率损失并提高可达信息率。

Comments 4 pages, 2 figures, Submitted to the 52nd European Conference on Optical Communications

详情
AI中文摘要

我们提出了首个神经概率幅度整形方法,它在考虑所有实现损耗的情况下优于现有方法,采用无块、易于实现的序列自回归编码器,与算术分布匹配兼容,从而降低了速率损失并提高了可达信息率。

英文摘要

We present the first neural probabilistic amplitude shaping that outperforms existing methods while accounting for all implementation losses, using a block-less, easily implementable sequential autoregressive encoder compatible with arithmetic distribution matching, yielding reduced rate loss and higher achievable information rates.

2605.28142 2026-05-28 cs.LG cs.CL

Self-Consistency via Marginal Sharpening

通过边际锐化实现自一致性

Aleksei Arzhantsev, Otmane Sakhi, Nicolas Chopin

发表机构 * Criteo AI Lab(Criteo AI实验室) CREST IP Paris(巴黎CREST研究所) ENSAE, Institut Polytechnique de Paris(巴黎理工学院ENSAE)

AI总结 提出一种自回归并行采样算法,通过锐化答案边际分布而非完整输出分布,在数学和编程基准上优于标准功率采样且速度更快。

详情
AI中文摘要

推理时采样可以在不额外训练的情况下激发语言模型的强推理能力。现有的功率采样方法通过锐化完整生成输出的分布来做到这一点,偏向于模型下个体可能性高的完成。我们认为这对于推理来说是错误的目标:一个完成将推理轨迹与最终答案纠缠在一起,而重要的是一个答案是否被许多合理的推理路径支持。因此,我们将目标从完整输出分布转移到锐化的答案边际,使自一致性成为推理时目标而非事后投票标准。令人惊讶的是,这个边际目标允许一个高效的近似:我们提出一个简单的、纯粹自回归的并行采样算法,近似地从锐化的答案边际中采样,在数学和编程基准上比标准功率采样表现出更强的性能,同时快几个数量级。

英文摘要

Inference-time sampling can elicit strong reasoning abilities from language models without additional training. Existing power-sampling methods do so by sharpening the distribution over full generated outputs, favoring completions that are individually likely under the model. We argue that this is the wrong object to target for reasoning: a completion entangles a reasoning trace with a final answer, whereas what matters is whether an answer is supported by many plausible reasoning paths. We therefore shift the target from the full-output distribution to the sharpened answer marginal, making self-consistency an inference-time objective rather than a post-hoc voting criterion. Surprisingly, this marginal target admits an efficient approximation: we propose a simple, purely autoregressive parallel sampling algorithm that approximately samples from the sharpened answer marginal, eliciting stronger performance than standard power sampling on mathematics and coding benchmarks while being orders of magnitude faster.

2605.28139 2026-05-28 cs.AI

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

数据高效的在线策略蒸馏用于自动语音识别

Yu Lin, Yiming Wang, Runyuan Cai, Xiaodong Zeng

发表机构 * AutoArk-AI

AI总结 提出一种在线策略蒸馏方法,利用教师模型(Qwen-ASR)在仅10万小时语音数据下提升学生模型(Ark-ASR)的识别能力,在多个基准上超越同规模基线。

详情
AI中文摘要

构建有竞争力的自动语音识别(ASR)模型通常需要大规模的音频监督,这使得复现和专业化成本高昂。我们研究了Ark-ASR,一个基于100k小时语音训练的0.6B参数音频条件语言模型,并检验了强大的Qwen-ASR教师能否通过在线策略蒸馏传递额外的识别能力。在普通话和英语ASR基准上,所提出的训练方案一致地优于仅进行监督微调,并在五个评估集中的四个上超越了同规模的Qwen3-ASR-0.6B基线。这仅使用了100k小时的语音,而Qwen3-Omni AuT编码器报告使用了20M小时的监督音频。更大的Qwen3-ASR-1.7B仍然更强,但结果表明,在更小的音频预算下,教师指导的在线策略训练可以显著缩小紧凑型ASR模型的差距。支持重叠诊断进一步表明,教师-数据阶段改善了局部学生-教师兼容性,这与最近关于在线策略蒸馏何时有效的分析一致。

英文摘要

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

2605.28137 2026-05-28 cs.CV cs.LG

No Safe Dose: How Training Data Drives Unsafe Image Generation

无安全剂量:训练数据如何驱动不安全图像生成

Felix Friedrich, Lukas Helff, Niharika Hegde, Patrick Schramowski, Kristian Kersting

发表机构 * Black Forest Labs(黑森林实验室) TU Darmstadt & hessian.AI(图腾达姆施塔特大学 & heessian.AI) DFKI(德意志联邦鹰嘴豆研究所) Lab1141(Lab1141实验室)

AI总结 通过控制训练数据中不安全图像的比例(0%至9.6%),发现输出不安全率随比例单调上升,且比例而非绝对数量是关键因素,同时文本编码器(如SafeCLIP)可降低基线风险,但剂量效应持续存在。

详情
AI中文摘要

基于大规模数据训练的文本到图像模型往往不可避免地包含不安全内容。虽然有人观察到输入输出放大效应,但训练数据组成是否以及如何直接驱动模型输出安全性,还是由其他因素决定,仍不清楚。我们通过隔离这一变量来阐明问题:在多个数据集规模(10万到800万)下,我们在仅在不安全图像比例(0%到9.6%)上不同的数据集上训练相同的文本到图像模型。然后使用生成的模型生成图像,并用四个独立的安全分类器进行评估。输出不安全率从0%污染时的16.6%单调上升到5%污染时的25.5%。析因设计揭示,不安全训练图像的 extit{比例}而非绝对数量是操作变量。零污染时16.6%的不可降低基线表明其他组件(如冻结的文本编码器)是残余安全风险——通过文本编码器消融实验证实,SafeCLIP将这一底线降至9.6%,而剂量效应在所有测试的三个编码器中持续存在。关键的是,在FID、CLIPscore和ImageReward方面,安全过滤并未伴随质量下降。这些结果表明,数据整理和文本编码器安全是互补且独立有效的干预措施。同时,剩余的不安全水平为未来关于新兴能力和组合性的研究提出了问题。

英文摘要

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

2605.28136 2026-05-28 cs.CV cs.RO

SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

SAM增强的道路数据集分割:自动驾驶中关键类别的平衡

Toomas Tahves, Mauro Bellone, Junyi Gu, Raivo Sell

发表机构 * Department of Mechanical and Industrial Engineering, Tallinn University of Technology(塔林技术大学机械与工业工程系) FinEst Centre for Smart Cities, Tallinn University of Technology(塔林技术大学智能城市研究中心) Department of Computer Science and Engineering, Universitas Mercatorum(默卡托姆大学计算机科学与工程系) Department of Computer Science and Engineering, Chalmers University of Technology(挑战者技术大学计算机科学与工程系) University of Gothenburg(哥德堡大学)

AI总结 提出基于SAM的标注流水线,将ZOD数据集的边界框转换为密集像素级语义掩码,并评估不同架构在类别不平衡下的性能,通过双向迁移学习实现跨传感器配置的有效迁移。

详情
AI中文摘要

密集语义分割对于自动驾驶至关重要,然而许多多模态数据集缺乏像素级标注。Zenseact开放数据集(ZOD)提供丰富的多传感器数据,但仅有边界框标签,限制了其在分割研究中的应用。我们的主要贡献是一个基于Segment Anything Model(SAM)的标注流水线,通过将边界框转换为语义掩码,为ZOD生成密集的像素级标注。在这项初步研究中,我们处理了超过10万帧,并手动筛选出一个2300帧的子集(接受率36%),以建立可靠的基线。利用这些标注,我们评估了基于Transformer的CLFT和基于CNN的DeepLabV3+架构在不同天气条件下的性能,其中CLFT-Hybrid达到了48.1%的mIoU。为了解决极端类别不平衡问题(行人、骑行者、标志牌像素占比不足1%),我们探索了针对稀有类别的专门模型。我们还在Iseauto自动驾驶平台上验证了该流水线,达到了77.5%的mIoU,并展示了通过双向迁移学习,SAM导出的表示能够有效地跨传感器配置迁移。所有代码和标注均已发布,以支持可重复研究。

英文摘要

Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

2605.28133 2026-05-28 cs.LG stat.ML

Learning to Bid in Repeated Second-Price Auctions with Dynamic Values and Aggregated Feedback

在具有动态价值和聚合反馈的重复第二价格拍卖中学习出价

Benjamin Heymann, Otmane Sakhi

发表机构 * Criteo AI Lab(Criteo AI实验室)

AI总结 研究当投标者价值动态变化时,如何通过结合插件估计器和最优策略的微分方程刻画来学习出价策略,并针对分段线性和一般光滑原始函数分别实现接近最优的遗憾界。

详情
AI中文摘要

我们研究了当投标者的价值是动态的,即当前价值依赖于过去结果时的学习出价问题。具体来说,我们考虑一个参与重复第二价格拍卖的投标者,其价值取决于自上次成功出价以来的时间,拍卖在连续时间内到达,并且仅在时间范围结束时揭示聚合反馈。这样的投标者必须(1)平衡赢得当前拍卖的即时收益与其对未来价值的影响,以及(2)学习未知的环境参数。我们推导了一类学习方法的遗憾界,该方法将插件估计器与最优策略的微分方程刻画相结合,并表明一个特定的置信界算法以接近最优的遗憾学习最优策略,对于分段线性原始函数为$\widetilde{O}(\log N)$,对于一般光滑原始函数为$\widetilde{O}(N^{1/3})$,且无需显式随机化即可实现这些遗憾。这些理论结果得到了数值实验的支持。

英文摘要

We study the problem of learning to bid when the bidder's value is dynamic, i.e., when the current value depends on past outcomes. Specifically, we consider a bidder participating in repeated second-price auctions whose value depends on the time elapsed since their last successful bid, with auctions arriving in continuous time and only aggregated feedback revealed at the end of the horizon. Such a bidder must (1) balance the immediate benefit of winning the current auction against its impact on future values and (2) learn unknown environmental parameters. We derive regret bounds for a class of learning methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, and show that a specific confidence bound algorithm learns the optimal policy with a near optimal regret of $\widetilde{O}(\log N)$ for piecewise linear primitives, and $\widetilde{O}(N^{1/3})$ for general, smooth primitives, achieving these regrets without explicit randomization. These theoretical results are supported by numerical experiments.

2605.28132 2026-05-28 cs.CV

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

哪种预训练范式更有利于空间智能?视觉-语言模型与视频生成模型的实证比较

Haozhan Shen, Tiancheng Zhao, Kangjia Zhao, Jianwei Yin

发表机构 * Zhejiang University(浙江大学) Om AI Research(Om人工智能研究) Binjiang Institute of Zhejiang University(浙江大学滨江研究院)

AI总结 本文通过冻结特征探测研究,系统比较了视觉-语言模型(VLM)和视频生成模型(VGM)在语义标注、实例分组和3D几何预测三个空间智能维度上的表现,发现两者互补且简单融合可提升整体性能。

Comments Code is here: \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}

详情
AI中文摘要

空间智能需要能够捕捉物理世界中语义对象和几何结构的视觉表示。为此,两种主要的预训练方案被广泛用作基础骨干:视觉-语言模型(VLM),它使用语言监督将视觉观察与语义概念对齐;以及视频生成模型(VGM),它从时间演变的视觉世界中学习。然而,目前尚不清楚哪种预训练方案为空间智能提供了更好的表示基础。在本文中,我们首次对VLM和VGM在空间智能的三个代表性维度上进行了系统的冻结特征探测研究:语义标注、实例分组和3D几何预测。通过轻量级探测,我们的框架能够控制性地比较两个模型家族的冻结表示中已经编码的信息。实验结果显示明显的互补性:VLM在语义标注和实例分组方面更强,而VGM为密集几何和相机运动提供了更易获取的信号。此外,两者的简单融合已经产生了在几何和语义方面都表现出色的表示,这表明通过有效整合两个模型家族的特征来构建更强的空间智能骨干是一个有前景的方向。我们的代码可在\href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}获取。

英文摘要

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}.

2605.28131 2026-05-28 cs.CL

Better heads do not guarantee better binarized constituency parsing

更好的头部并不能保证更好的二值化成分句法分析

Zeyao Qi, Yige Chen, Eitan Klinger, Vivaan Wadhwa, Jungyeul Park

发表机构 * The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong(香港中文大学,大屿山分校) The University of British Columbia, Vancouver, Canada(不列颠哥伦比亚大学,温哥华,加拿大) Korea Advanced Institute of Science & Technology, Daejeon, South Korea(韩国科学技术院,大田,韩国)

AI总结 本文研究了标点感知的树二值化方法,并探讨了依赖诱导的头部信息是否改善了二值化句法分析器的监督信号,发现尽管学习到的头部在内在头部预测上优于基于规则的头部,但在去二值化后并未带来一致的句法分析提升。

详情
AI中文摘要

我们重新审视了成分句法分析中标点感知的树二值化,并询问依赖诱导的头部信息是否改善了二值化句法分析器的监督信号。尽管学习到的头部在内在头部预测上显著优于基于规则的头部,但在去二值化后并未带来一致的句法分析提升。特别地,标点条件评估显示,在宏观平均标点敏感$F_1$上,学习到的头部表现不如基于规则的二值化,尽管在CTB上整体有小幅提升。类似的稳定性问题在跨树库迁移中也出现。这些结果表明,当用作二值化控制信号时,语言学的头部信息不一定对句法分析器最优。本文呈现了一个负面结果:更好的头部预测并不意味着更好的标点敏感成分句法分析。

英文摘要

We revisit punctuation-aware tree binarization for constituency parsing and ask whether dependency-induced headedness improves binary parser supervision. Although learned heads substantially outperform rule-based heads in intrinsic head prediction, they do not yield consistent parsing gains after debinarization. In particular, punctuation-conditioned evaluation shows that learned headedness underperforms rule-based binarization in macro-average punctuation-sensitive $F_1$, despite a small overall gain on CTB. Similar instability appears under cross-treebank transfer. These results suggest that \ycc{linguistically grounded} headedness is not necessarily parser-optimal when used as a binarization control signal. The paper presents a negative result: better head prediction does not imply better punctuation-sensitive constituency parsing.

2605.28129 2026-05-28 cs.AI

Do Clinical Models Change Treatment Decisions?

临床模型是否会改变治疗决策?

Dongkyu Cho, Miao Zhang, Rumi Chunara

发表机构 * New York University(纽约大学)

AI总结 本研究提出ClinPivot基准,通过生物医学关系和变化的患者情境评估临床基础模型在治疗决策中的适应性,发现强医学QA能力不能可靠预测决策表现。

Comments 9 pages, 3 figures

详情
AI中文摘要

临床基础模型通过事实性或考试式医学QA进行评估,但治疗决策必须在患者情境变化时改变。我们引入ClinPivot,一个基于生物医学关系和枢轴患者情境的可审计治疗决策基准。ClinPivot询问当新的临床约束改变行动空间时,模型是否会改变治疗选择。我们发现,强大的医学QA表现并不能可靠预测决策表现:前沿模型和任务适应的Qwen变体经常无法正确改变决策,且模型排名在不同评估体系间发生变化。在匹配的知识预算下,决策结构化的监督提高了对枢轴敏感的决策制定和医学QA,而轻量级重放减少了通用助手能力上的损失。

英文摘要

Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.

2605.28128 2026-05-28 cs.CL

Chinese Word Boundary Recovery through Character Alignment Projection

通过字符对齐投影恢复中文词边界

Lusha Wang, Yuchen Li, Su Yuan, Jungyeul Park

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Korea Advanced Institute of Science & Technology(韩国科学技术院)

AI总结 提出基于对齐投影的两步方法,从带噪句子中恢复词边界,并构建两个评估基准,实验表明该方法能有效纠正过度切分错误。

详情
AI中文摘要

中文分词在非标准文本中尤其脆弱,语言学习者错误和其他字符层面的差异会破坏下游标注和评估所假设的词边界。本文将中文词边界恢复形式化为基于对齐的投影任务。给定一个带噪的源句子和一个更干净的目标对应句,我们首先在字符级别对齐两个字符串,然后将目标侧的词边界投影回源句。除了恢复方法本身,我们还引入了两个评估资源:基于MuCGEC的人工检查学习者中文基准,以及从中文宾州树库导出的受控合成基准。实验表明,直接分词仍然容易受到学习者输入中的复合碎片化影响,而所提出的两步投影方法通过使用校正后的目标恢复源侧词跨度,纠正了许多过度切分错误。结果表明,词边界恢复不同于普通分词,并且对齐投影为在带噪输入下稳定中文标注和评估提供了一种原则性机制。

英文摘要

Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper formulates Chinese word boundary recovery as an alignment-based projection task. Given a noisy source sentence and a cleaner target counterpart, we first align the two strings at the character level and then project target-side word boundaries back onto the source. Beyond the recovery method itself, we introduce two evaluation resources: a manually checked learner Chinese benchmark based on MuCGEC and a controlled synthetic benchmark derived from the Chinese Penn Treebank. Experiments show that direct segmentation remains vulnerable to compound fragmentation in learner input, whereas the proposed two step projection method corrects many over-segmentation errors by using the corrected target to recover source-side word spans. The results show that word boundary recovery is distinct from ordinary segmentation and that alignment projection provides a principled mechanism for stabilizing Chinese annotation and evaluation under noisy input.

2605.28127 2026-05-28 cs.LG

Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

面向长视界离线目标条件强化学习的自适应由粗到细子目标细化

Kaiqiang Ke, Shenghong He, Chengdong Xu, Yuheng Luo, Xiangyuan Lan, Chao Yu

发表机构 * Sun Yat-sen University(中山大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出CFHRL框架,通过自适应递归细化子目标并基于可学习可达性成本停止细化,解决长视界离线目标条件强化学习中的弱监督和累积误差问题。

详情
AI中文摘要

离线目标条件强化学习(GCRL)在长视界任务中具有挑战性,其中遥远的状态-目标对提供弱监督,且价值估计容易受到累积自举误差的影响。分层方法通过引入中间子目标来缓解这一困难,但固定的时间抽象或固定的层次深度可能与具有不同可达性视界的状态-目标对不匹配。我们提出由粗到细分层目标强化学习(CFHRL),一种完全离线的GCRL框架,在执行前自适应地细化遥远目标。从最终目标开始,CFHRL递归地提出中间目标,这些目标由重放支持的候选训练,并在当前目标被估计为可通过学习的可达性成本局部执行时停止细化。关键思想是,子目标不必是精确的中点或全局最优路径点;它只需要提供可靠的进展并减少剩余到达难度,从而能够在更短的视界上进行后续细化。一个风格化的分析进一步支持近似递归收缩的鲁棒性。在OGBench上的实验表明,在多个长视界任务上取得了显著收益,消融实验验证了所提出的细化和停止机制。

英文摘要

Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introducing intermediate subgoals, but fixed temporal abstractions or fixed hierarchy depths can be mismatched to state--goal pairs with different reachability horizons. We propose Coarse-to-Fine Hierarchical Goal Reinforcement Learning (CFHRL), a fully offline GCRL framework that adaptively refines distant goals before execution. Starting from the final goal, CFHRL recursively proposes intermediate targets, trained from replay-supported candidates, and stops refinement once the current target is estimated to be locally executable by a learned reachability cost. The key idea is that a subgoal need not be an exact midpoint or globally optimal waypoint; it only needs to provide reliable progress and reduce the remaining reaching difficulty, enabling subsequent refinement over shorter horizons. A stylized analysis further supports the robustness of approximate recursive contraction. Experiments on OGBench show substantial gains on several long-horizon tasks, with ablations validating the proposed refinement and stopping mechanisms

2605.28125 2026-05-28 cs.CV cs.GR

CLEAR-NeRF: Collinearity and Local-region Enhanced Accurate 3D Reconstruction in Unbounded Scenes

CLEAR-NeRF: 共线性和局部区域增强的无界场景精确三维重建

Vladislav Polianskii, Elijs Dima, Isabel Salmerón Marazuela, Gergő László Nagy, Sigurdur Sverrisson, Volodya Grancharov

发表机构 * Ericsson Research(爱立信研究)

AI总结 提出CLEAR-NeRF方法,通过自动局部区域定位、共线性射线采样、深度局部邻域点提取和几何相关颜色聚合,在无界复杂场景中实现高保真度和度量精度的三维重建。

详情
AI中文摘要

许多真实世界的三维重建应用要求在无界、复杂场景中实现照片级真实感和度量精度,这些场景具有挑战性的光照和不完美的捕获,而当前的神经辐射场(NeRF)流程仅部分满足这些需求。本研究将基于NeRF的三维重建适应于多兴趣区域的无界场景,以提高对光照和姿态变化的鲁棒性,同时确保适用于数字孪生应用的度量精度。我们的方法引入了(i)自动局部区域定位/检测和重建,以无缝优先考虑感兴趣区域而不增加子模块;(ii)共线性强制射线采样,以学习平滑的平面和曲面;(iii)深度局部邻域点提取,以抑制表面伪影;以及(iv)几何相关颜色聚合,以减轻光照和姿态引起的变化。结果表明,所提出的流程在基线NeRF模型以及成熟的结构从运动(SfM)-多视图立体(MVS)解决方案上均表现出优越的性能。

英文摘要

Many real-world 3D reconstruction applications demand photorealism and metric accuracy across unbounded, complex scenes with challenging lighting and imperfect captures that current Neural Radiance Field (NeRF) pipelines only partly satisfy. This study adapts NeRF-based 3D reconstruction to multi-region of interest unbounded scenes to improve robustness to lighting and pose variation while enforcing metric accuracy suitable for digital-twin applications. Our approach introduces (i) automated local region localization/detection and reconstruction to seamlessly prioritize areas of interest without proliferating submodules, (ii) collinearity-enforcing ray sampling to learn smooth planar and curved surfaces, (iii) depth-localized neighborhood point extraction to suppress surface artifacts, and (iv) geometry-relevant color aggregation to mitigate lighting- and pose-caused variations. Results indicate superior performance of the proposed pipeline over the baseline NeRF models and established Structure from Motion (SfM) - Multi-View Stereo (MVS) solutions.

2605.28124 2026-05-28 cs.AI

Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction

梯度步进即插即用模型用于牙科锥束CT重建

Idris Tatachak, Luis Kabongo, Nicolas Papadakis, Xavier Ripoche, Simon Rit

发表机构 * INSA‐Lyon, Universite Claude Bernard Lyon 1, CNRS, Inserm, CREATIS UMR 5220, U1294(INSA-里昂、 Claude Bernard 里昂大学、 CNRS、 Inserm、 CREATIS UMR 5220、 U1294) Univ. Bordeaux, CNRS, Inria, Bordeaux INP, IMB, UMR 5251(波尔多大学、 CNRS、 Inria、 Bordeaux INP、 IMB、 UMR 5251) ACTEON Group, France(ACTEON集团,法国)

AI总结 提出一种基于梯度步进去噪器的即插即用算法,通过模拟扇形束采集并添加光子噪声训练先验,有效减少牙科锥束CT重建中的光子噪声。

Comments CT Meeting 2026 - 9th International Conference on Image Formation in X-Ray Computed Tomography, Jun 2026, Salt lake City, United States

详情
AI中文摘要

本工作的目标是减少牙科锥束CT重建中光子噪声的影响。我们考虑一个逆问题公式,并开发一个基于数据的先验。为此,我们模拟扇形束采集,并向投影数据添加光子噪声。通过使用重建的模拟采集训练一个梯度步进去噪器来获得先验。将训练好的模型集成到即插即用梯度步进算法中,从模拟投影重建图像。对合成数据的实验证明了训练模型的去噪能力,而对真实图像的定性评估展示了算法的性能和泛化能力。

英文摘要

The goal of this work is to reduce the effect of photon noise in dental cone-beam CT reconstruction. We consider an inverse problem formulation and develop a databased prior. To this end, we simulate fan-beam acquisitions and add photon noise to the projection data. The prior is obtained by training a gradient-step denoiser using reconstructed simulated acquisitions. The trained model is integrated into a plug-and-play gradient-step algorithm to reconstruct images from simulated projections. Experiments on synthetic data demonstrate the denoising capabilities of the trained model, while qualitative evaluations on real images showcase the algorithm's performance and generalization ability.

2605.28123 2026-05-28 cs.CL

Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models

风险感知的选择性提示用于大型视觉-语言模型中的幻觉缓解

Yuang Huang, Yafeng Zhang, Yu Zilan

发表机构 * Shanghai Jiao Tong University(上海交通大学) iFLYTEK Tsinghua University(清华大学)

AI总结 本文系统研究提示验证在大型视觉-语言模型中的风险,发现其效果依赖输入难度,并提出基于预生成不确定性信号的选择性提示方法RSP以平衡性能。

Comments 7 pages, 1 figures, submitted to ACL ARR 2026 May (EMNLP)

详情
AI中文摘要

基于提示的验证被广泛用于缓解大型视觉-语言模型(LVLMs)中的幻觉,但其何时有效仍不清楚。我们系统研究了两种代表性LVLM架构和幻觉基准上的验证提示,发现它是一种有风险的干预:其纠正随输入难度增加,而新引入的错误在不同难度级别持续存在。因此,始终开启的提示在困难输入上有帮助,但在简单输入上益处甚微甚至有害。我们的分析进一步表明,这种行为与保守的输出偏移相关。验证提示将注意力从视觉令牌重新分配到指令令牌,并诱导出中性提示控制中不存在的中层熵模式,这表明是指令条件化的注意力重新分配而非统一的视觉基础改善。受这种输入依赖风险的启发,我们提出了风险感知的选择性提示(RSP),一种无需训练的方法,利用预生成不确定性信号选择性地触发验证。RSP减轻了始终开启提示的性能下降,同时保持基线性能,并揭示了有效的选择信号因架构而异。

英文摘要

Prompt-based verification is widely used to mitigate hallucinations in large vision-language models (LVLMs), yet when it helps remains poorly understood. We systematically study verification prompting across two representative LVLM architectures and hallucination benchmarks, and find that it is a risk-bearing intervention: its corrections increase with input difficulty, while newly introduced errors persist across difficulty levels. As a result, always-on prompting helps on hard inputs but offers little benefit -- and can harm -- easier ones. Our analysis further shows that this behavior is associated with a conservative output shift. Verification prompts redistribute attention from visual tokens toward instruction tokens and induce a distinct middle-layer entropy pattern absent in a neutral-prompt control, suggesting instruction-conditioned attention redistribution rather than uniformly improved visual grounding. Motivated by this input-dependent risk, we propose Risk-aware Selective Prompting (RSP), a training-free approach that uses pre-generation uncertainty signals to trigger verification selectively. RSP mitigates the degradation of always-on prompting while preserving baseline performance, and reveals that effective selection signals vary across architectures.

2605.28120 2026-05-28 cs.CL cs.AI cs.MA

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

LegalGraphRAG:面向可靠法律推理的多智能体图检索增强生成

Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能研究院) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LegalGraphRAG框架,通过分层法律图和多智能体系统(研究员、审计员、裁决员)实现可靠的法律推理,在准确性和可信度上超越现有GraphRAG基线。

Comments 30 pages, 18 figures, ACL 2026 Main Conference. Project page: https://github.com/XMUDeepLIT/LegalGraphRAG

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)通过将知识结构化为关系图,推进了平面文档检索,实现了更连贯和有效的推理。然而,将其应用于法律推理等特定领域面临关键挑战。(i) 法律语料库是异构的,包含来自案例、法条和解释的多粒度知识。平面知识图无法充分区分事实细节、适用规则和抽象原则,限制了准确检索。(ii) 可靠的法律判决需要透明、基于证据的推理。传统的RAG直接将检索到的上下文传递给LLM而不进行验证,导致推理不透明且易出错。为此,我们提出了LegalGraphRAG,一个专为可靠法律推理设计的框架。我们的方法引入了两个核心组件:一个分层法律图,用于分层组织法律来源,以便在适当的抽象级别进行检索;以及一个用于可靠法律推理的多智能体系统,其中研究员检索候选证据,审计员严格验证其相对于源文档的有效性,裁决员综合已验证的证据集作出最终判决。大量实验表明,LegalGraphRAG达到了最先进的性能,在准确和可信的法律分析方面优于现有的GraphRAG基线。我们的代码、数据集和实现细节可在https://github.com/XMUDeepLIT/LegalGraphRAG获取。

英文摘要

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

2605.28115 2026-05-28 cs.AI

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

CIVIC: 面向高效视觉语言模型的端到端序列紧凑性

Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu, Chenxi Liu

发表机构 * Department of Civil & Environmental Engineering(土木与环境工程系) University of Utah(犹他大学)

AI总结 提出CIVIC框架,通过路径一致的紧凑视觉推理,在视觉编码器、投影层、LLM预填充和KV缓存中保持紧凑序列表示,减少非连续内存访问和局部合并开销,在Qwen3-VL架构上实现KV缓存内存降至约三分之一并降低端到端推理延迟,同时通过文本对齐KL蒸馏和自适应空间保留下限保持精度。

Comments 11 pages, 6 figures, 2 tables, conference

详情
AI中文摘要

视觉语言模型(VLM)由于高分辨率视觉标记面临严重的内存和延迟瓶颈。虽然当前的标记缩减方法理论上节省了FLOPs,但事后剪枝引入了结构开销,未能产生成比例的墙上时钟加速。然而,强制实施连续的紧凑路径存在几何方向迷失和细粒度定位丢失的风险。为了克服这些障碍,本文引入了CIVIC,一种路径一致的紧凑视觉推理框架。通过在视觉编码器、投影层、LLM预填充和KV缓存中无缝地维护紧凑序列表示,CIVIC避免了非连续内存访问和局部合并开销。在Qwen3-VL架构上评估,CIVIC成功地将序列缩减转化为真正的物理硬件效率,将KV缓存内存缩小到基线的约三分之一,并减少了端到端推理延迟。通过文本对齐的KL蒸馏和自适应空间保留下限,CIVIC在严格的多模态推理和视觉定位基准测试中实现了这些效率里程碑,同时不降低准确性。

英文摘要

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

2605.28114 2026-05-28 cs.AI

Human-like in-group bias in instruction-tuned language model agents

指令调优语言模型代理中类似人类的内群体偏见

Messi H. J. Lee

发表机构 * Independent Researcher(独立研究者)

AI总结 通过多代理模拟,发现指令调优语言模型在群体标签可见时表现出内群体信任偏见、行动同质性和网络同配性,且这种歧视在标准审计中不可见,但会累积为结构性不平等。

Comments 12 pages, 6 figures

详情
AI中文摘要

随着自主AI代理被部署在持久、交互的网络中——协调任务、路由资源和积累声誉历史——出现的社会动态将决定谁获得机会,谁没有,其规模是任何人类机构都无法监督的。我们进行了一项受控的多代理模拟,其中指令调优语言模型代理在三种条件下(操纵群体标签显著性和资源稀缺性)进行了500轮交互,涉及六个模型系列,每个系列20个种子。当群体标签可见时,我们观察到内群体信任偏见、行动同质性和网络同配性——当标签隐藏时这些现象全部消失——这种模式在结构上与人类社会心理学中的显著性依赖性一致。这种歧视对标准的行动日志审计是不可见的:偏见完全通过谁接收每个行动来运作,而不是通过选择什么行动,行动类型分布显示不同条件下的负面行动没有增加。所有六个模型的每轮内群体与外群体差异为5到16个百分点,具有统计显著性(Wilcoxon符号秩检验,所有Benjamini-Hochberg校正p < 0.001),表明群体条件性目标选择是指令调优语言模型在不同架构和训练范式下的稳健特性。通过500轮的互惠累积,这些差异累积成内群体信任偏见,范围为+0.014到+0.100(d = 0.84-4.52),说明每轮交互中适度的目标选择如何在持久网络中传播为结构性不平等。

英文摘要

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

2605.28111 2026-05-28 cs.LG

Chreode: A Cell World Model for One-Step Temporal Dynamics and Perturbation Prediction

Chreode: 用于一步时间动态和扰动预测的细胞世界模型

Mufan Qiu, Genhui Zheng, Yinuo Xu, Ruichen Zhang, Ying Ding, Qi Long, Tianlong Chen

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出Chreode,一种基于结构化残差转移算子的单步细胞世界模型,通过预训练和微调实现发育轨迹与扰动预测的统一,在多个基准上取得性能提升。

Comments 25 pages, 3 figures, 14 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

预测细胞在发育信号或遗传扰动下如何改变其转录状态是计算机生物学和AI虚拟细胞计划的核心。现有方法要么拟合忽略时间的静态对照到处理映射,要么在每个数据集上独立求解多步ODE/薛定谔桥问题。我们引入了Chreode,一种单步细胞世界模型,通过结构化残差转移算子预测动作条件下的细胞状态转换。它将分布演化从推理时间转移到训练时间,实现单次生成,同时保留了受Waddington启发的分解:下坡景观流、切向旋转动力学和随机扩散。该模型使用共享的scVI编码器和基于DiT的动态骨干在包含7个数据集的240万细胞小鼠胚胎图谱上进行预训练。作为微调初始化,Chreode在Weinreb造血和Veres胰岛分化上改善了每个目标的Sinkhorn距离,优于匹配的scratch模型、PI-SDE和PRESCIENT。作为GEARS的可转移基因状态嵌入,预训练的动态表示将Norman Perturb-seq上的共享词汇DE20均方误差从0.2121降低到0.1858,相对改进12.4%,且未改变GEARS训练过程。我们将这种对扰动预测的可转移性解释为预训练的发育轨迹动态编码了可转移至CRISPR诱导状态变化的分化原语,因为两者都涉及共享潜在几何中的细胞状态转换。此外,预训练骨干在Weinreb上产生了与强动态OT基线竞争的无监督克隆命运分数。

英文摘要

Predicting how a cell will change its transcriptional state under a developmental signal or a genetic perturbation is the computational core of in-silico biology and the AI Virtual Cell program. Existing approaches either fit static control-to-treated maps that discard time, or solve multi-step ODE / Schrödinger-bridge problems on each dataset independently. We introduce Chreode, a one-step cell world model that predicts action-conditioned cell-state transitions through a structured residual transition operator. It shifts distributional evolution from inference time to training time, enabling single-pass generation while preserving a Waddington-inspired decomposition into downhill landscape flow, rotational in-tangent dynamics, and stochastic spread. The model is pretrained with a shared scVI encoder and a DiT-based dynamics backbone on a 2.4M-cell mouse embryonic atlas spanning 7 datasets. As a fine-tuning initialization, Chreode improves per-target Sinkhorn distance on Weinreb hematopoiesis and Veres islet differentiation over matched scratch models, PI-SDE, and PRESCIENT. As a transferable gene-state embedding for GEARS, the pretrained dynamics representation reduces shared-vocabulary DE20 mean squared error on Norman Perturb-seq from 0.2121 to 0.1858, a 12.4% relative improvement, without changing the GEARS training procedure. We interpret this transfer to perturbation prediction as evidence that pretrained developmental-trajectory dynamics encode differentiation primitives transferable to CRISPR-induced state shifts, since both involve cell-state transitions in a shared latent geometry. The pretrained backbone additionally produces zero-shot clonal fate scores on Weinreb that are competitive with strong dynamic-OT baselines.

2605.28110 2026-05-28 cs.RO

STR Robot: Design of an Autonomous Mobile Robot from Simulation to Reality

STR机器人:从仿真到现实的自主移动机器人设计

Vinh Nguyen, Gia-Uy Le, Tien-Dat Nguyen, Tri-Tin Nguyen, Vinh-Hao Nguyen

发表机构 * Faculty of Electrical and Electronic Engineering, Ho Chi Minh City University of Technology, VNU-HCM(电子工程学院,胡志明市技术大学,VNU-HCM)

AI总结 本文提出一种基于现有机械平台的自主移动机器人仿真到现实实现方法,重点开发机载控制、自定位和自主导航系统,并通过仿真和实验验证其可行性。

详情
AI中文摘要

随着仿真工具的快速发展,自主机器人系统在实际部署前的开发和验证变得更加高效。本文介绍了一种基于现有机械平台的自主移动机器人的仿真到现实实现。我们的工作不关注机械设计,而是集中于机载控制、自定位和自主导航系统的开发。所提出的机器人配备了机载感知和计算能力,以估计其姿态并在环境中自主导航。整个框架首先在仿真中开发和测试,然后部署在真实机器人上进行实验评估。结果证明了所提出方法的可行性,并表明仿真为开发可靠的自主移动机器人系统提供了有效基础。源代码将在 https://ntdathp.github.io/outdoor-robot-web 发布。

英文摘要

With the rapid development of simulation tools, the development and validation of autonomous robotic systems have become more efficient before real-world deployment. This paper presents a simulation-to-real implementation of an autonomous mobile robot based on an existing mechanical platform. Instead of focusing on mechanical design, our work concentrates on the development of the onboard control, self-localization, and autonomous navigation system. The proposed robot is equipped with onboard sensing and computation to estimate its pose and navigate autonomously in the environment. The overall framework is first developed and tested in simulation, and then deployed on the real robot for experimental evaluation. The results demonstrate the feasibility of the proposed approach and show that simulation provides an effective foundation for developing reliable autonomous mobile robot systems. The source code will be released at https://ntdathp.github.io/outdoor-robot-web.