arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11190 2026-06-12 cs.LG 新提交

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

何时对齐，何时预测：多模态学习的相图

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero

发表机构 * Technion（以色列理工学院）； Genentech（基因泰克公司）； Brown University（布朗大学）； Meta AI, FAIR

AI总结提出统一线性框架，通过信噪比模型揭示跨模态对齐与预测的互补失效模式，构建四区域相图指导多模态学习目标选择，并在非线性实验中验证。

详情

AI中文摘要

跨模态对齐（CA）和跨模态预测（CP）是多模态表示学习的主要范式，但目前缺乏对每种方法何时成功、何时失败以及跨模态训练何时有帮助的系统性理解——这一空白使得从业者，特别是在生物医学或天体物理学等科学领域，面对异构仪器以及多个层次的组织和测量时，无法诊断为什么标准方法不如最佳单模态。我们开发了一个统一的线性框架来解决这两个问题。在具有结构化跨模态干扰相关性的尖峰信号加噪声模型下，我们推导出两个目标的分离比，揭示了互补的失效模式：对齐使每个模态白化，当干扰在视图间强相关时失败；预测通过单侧白化编码任何可跨模态预测的内容，恢复由源模态质量决定。由此产生的相图将多模态问题划分为四个区域：两者、仅CA、仅CP和两者都不。我们提出了一种数据驱动的方法，使用少量标记子样本将真实世界数据集定位在该图中，在任何跨模态训练之前确定首选目标和预测方向。在合成数据、立体视觉基准、图像-文本对和真实天体物理数据上的实验验证了非线性情况下的预测，包括跨模态训练有害的“两者都不”区域。我们的框架使从业者能够诊断其多模态问题，并在投入训练之前选择正确的目标。重现结果的代码可在此https URL获取。

英文摘要

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at this https URL.

URL PDF HTML ☆

赞 1 踩 0

2606.10716 2026-06-12 cs.CL cs.AI 新提交

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展：利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University（技术研究所，ICAI工程学院，科米利亚斯宗座大学）； DD-AIM, Senior Machine Learning Researcher（DD-AIM，高级机器学习研究员）

AI总结提出注意力扩展机制，通过预训练词嵌入增强PLM的上下文表示，在不增加计算成本的情况下扩展有效上下文范围，显著提升长文档关键短语提取性能。

详情

AI中文摘要

预训练语言模型（PLM）在关键短语提取（KPE）中取得了强劲性能，主要得益于其生成丰富上下文表示的能力。然而，长文档KPE仍然具有挑战性，因为显著的关键短语证据可能分散在遥远的文档部分，而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型（LLM）可以处理更广泛的文本上下文，但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制，我们提出了一种注意力扩展机制，该机制利用预训练词嵌入，用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围，而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法，包括通用、科学、任务特定和长上下文编码器，使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明，注意力扩展在所有评估设置中一致地提升了KPE性能，超越了最先进的模型，并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型，表明所提出的机制提供了互补信息，而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

URL PDF HTML ☆

赞 0 踩 0

2606.10683 2026-06-12 cs.RO cs.AI cs.CV 新提交

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

UniDexTok：基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Hefei University of Technology（合肥工业大学）； Rimbot ； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口，并基于此开发UniDexTok，一种免重定向的状态分词器，学习基于真实关节状态的离散token，实现异构灵巧手的统一表示，误差降低98%以上。

详情

AI中文摘要

灵巧手对于精细操作至关重要，但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难，与平行夹爪相比更是如此。因此，灵巧手数据仍然碎片化，难以用于联合训练。在这项工作中，我们提出了统一灵巧手模型（UDHM），它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM，我们引入了UniDexTok，一种免重定向的状态分词器，它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示，无需依赖重定向或仿真数据。与最近的基线UniHM相比，UniDexTok将MPJAE从15.63度降低到0.16度，MPJPE从18.51毫米降低到0.18毫米，误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明，来自其他实施例的数据提高了目标实施例的重建精度，证明了跨实施例分词的优势。当引入新的灵巧手时，UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

URL PDF HTML ☆

赞 0 踩 0

2606.10678 2026-06-12 cs.LG 新提交

One Step Closer to Ground Truth: A Multi-Scale Residual-Aware Representation Learning Pipeline for Predicting Time Series Data

更接近真实：一种多尺度残差感知表示学习管道用于时间序列预测

Amrijit Biswas, Mustafa Kamal, Robin Krambroeckers, M. M. Lutfe Elahi, Sifat Momen, Nabeel Mohammed, Shafin Rahman

发表机构 * RobotBulls Labs（RobotBulls实验室）； North South University（南北大学）

AI总结提出两阶段模型无关框架，通过显式解耦预测与残差学习，使用元校正器动态建模结构误差模式，提升Transformer预测精度。

详情

Comments: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

AI中文摘要

近年来，基于Transformer的模型已成为时间序列预测的主要范式，利用自注意力机制捕获长程依赖关系。尽管取得了成功，但这些单阶段预测架构由于结构差异、未建模的随机成分或多尺度时间表示不足，表现出持续的系统性残差偏差。当残差被视为不可约噪声时，这一局限性依然存在，阻碍了对结构化误差模式的自适应校正。为解决这一问题，我们引入了一个两阶段、模型无关的框架，将预测和残差学习显式解耦为不同的表示学习阶段。基础Transformer首先生成初始预测。随后，专用的元校正器动态建模跨多元通道的结构化误差模式，保留跨变量依赖关系，并迭代修正基础Transformer的残差偏差。通过将该管道形式化为假设空间扩展，我们的框架解决了单阶段架构固有的近似局限性，消除了对限制性假设的依赖，并实现了复杂误差动态的端到端学习。在八个流行的基准数据集上使用既定协议进行评估，我们的方法达到了最先进的性能，在标准指标（MSE、MAE）上有显著改进。结果表明，该框架能够减轻系统性偏差，增强对复杂时间动态的鲁棒性，推进了基于Transformer的预测模型的实际应用。

英文摘要

Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework's ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

URL PDF HTML ☆

赞 0 踩 0

2606.10616 2026-06-12 cs.AI 新提交

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么：通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab（华为诺亚方舟实验室）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）

AI总结针对长时域语言代理的有限上下文窗口，提出OSL-MR框架，将记忆保留建模为约束随机优化问题，通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值，实验表明在严格预算下优于现有方法。

详情

AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口，使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理，但大多将保留视为局部决策问题，并未在现实观测约束下显式建模其长期后果。为填补这一空白，我们将记忆保留建模为一个约束随机优化问题，具有明确的预算可行性、证据效用以及延迟成本（包括遗漏惩罚、重新获取延迟和过时信息风险）。随后，我们提出OSL-MR（观测安全记忆保留学习），这是一个新颖的框架，强制执行在线可观测特征与离线可用监督（OAS）之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式，该启发式既作为可部署的在线安全基线，又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值，同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明，OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度，敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve management through heuristic scoring, retrieval optimization, or learned compression, but largely treat retention as a local decision problem and do not explicitly model its long-term consequences under realistic observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization problem with explicit budget feasibility, evidence utility, and delayed costs including miss penalties, reacquisition delays, and stale-information risk. We then propose OSL-MR (Observability-Safe Learning for Memory Retention), a novel framework that enforces a strict separation between online-observable features and offline-available supervision (OAS). OSL-MR combines an evidence learner trained from realized evidence supervision with a Mixed-Score heuristic that serves both as a deployable online-safe baseline and as a structured inductive prior for learning. The resulting policy learns query-conditioned evidence value directly from interaction data while remaining deployable under the same observability constraints. Experiments on LOCOMO and LongMemEval show that OSL-MR consistently outperforms recency-based methods, Generative Agents-style scoring, and other heuristic baselines, particularly under tight memory budgets. The Mixed-Score prior further improves precision while preserving recall, and sensitivity analysis demonstrates robustness across a wide range of cost configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.10403 2026-06-12 cs.CL 新提交

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

KCSAT-ML: 用全国队列人类难度探测推理模型

Sanghee Park, Geewook Kim, Kee-Eung Kim

发表机构 * NAVER Cloud AI（NAVER云AI）； KAIST AI（韩国科学技术院人工智能系）

AI总结提出KCSAT-ML基准（含664道韩国高考数学题及339道带官方错误率的核心题）和难度对齐推理增益（DRG）指标，揭示视觉语言模型在人类高错误率题目上准确率崩溃、测试时缩放非单调以及同一模型族内反缩放与过度思考并存的现象。

详情

Comments: 18 pages, 14 figures, 8 tables

AI中文摘要

数学推理基准已大量涌现，但大多数缺乏基于实际人类表现的每道题难度信号。我们引入KCSAT-ML，包含十年（2014-2025）韩国大学修学能力考试（KCSAT；修能）数学：664道题，其中339道核心题带有来自数十万考生全国队列的官方每道题错误率。我们将该基准与难度对齐推理增益（DRG）配对：一种分数正交的度量，询问模型的错误是集中在人类认为难的题目上，还是人类认为容易的题目上。两者共同揭示，在广泛的视觉语言模型（以及通过OCR的LLM）中，存在三种模式：（i）低预算准确率在人类高错误率尾部崩溃，无论模型大小；（ii）测试时缩放（TTS）使token使用量大致随队列错误率线性增加，而准确率增益遵循非单调曲线；（iii）在同一模型族内，TTS在最难题目上从反缩放翻转到较容易题目上的过度思考——这是同一对齐失败的两个方面。在DRG上，准确率几乎相同的模型可以处于几乎相反的值：一个模型做错了人类也觉得难的题目，而另一个模型解决了最难的题目却在人类认为容易的题目上失败——这是聚合准确率所隐藏的对比。我们的代码和数据集构建器将在https://this URL开源。

英文摘要

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.09639 2026-06-12 cs.CV 新提交

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

CineDance: 迈向下一代多镜头长片电影级音视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Jason Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Electronic Science and Technology of China（电子科技大学）； Zhejiang University（浙江大学）； The University of Tokyo（东京大学）； Nanyang Technological University（南洋理工大学）

AI总结提出CineDance-1M大规模多镜头长片音视频数据集，通过三阶段筛选流程和CineBench评估体系，实现高质量联合生成。

详情

AI中文摘要

训练数据集的保真度和结构多样性从根本上决定了视频生成模型的能力。尽管商业系统在生成电影叙事方面表现出色，但开源模型的进展仍受限于高质量训练数据的稀缺性。为弥合这一差距，我们引入了CineDance-1M，一个大规模、开放研究文本到音视频（T2AV）数据集，专门用于多镜头、长片联合音视频生成。每个视频平均时长92.8秒，包含24.2个连续镜头，并提供音频和视频模态的可配置、结构化标注。这一卓越质量通过严格的三个阶段筛选流程实现：i) 多样化来源和全面清洗，ii) 基于电影理论的叙事解析，以及iii) 层次化双模态字幕生成。为进行全面评估，我们提出了CineBench，包含多样化的提示套件和六维、与人类对齐的度量系统，专为复杂叙事音视频评估而设计。此外，我们将LTX-2.3适配为CineDance，展示了卓越的单模态质量以及精确的音视频对齐和稳健的主体与环境一致性，有效验证了我们的筛选策略和CineDance-1M的高质量。我们预期这项工作将为加速未来多镜头、长片联合音视频生成研究奠定坚实基础。我们的项目页面可在https://aliothchen.github.io/projects/CineDance/获取。

英文摘要

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.08765 2026-06-12 cs.RO cs.CV 新提交

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

RGB-S: 用于鲁棒灵巧操作的图像对齐触觉显著性

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University（上海科技大学）； Beijing Institute for General Artificial Intelligence（北京通用人工智能研究院）

AI总结提出RGB-S框架，通过正向运动学和相机标定将触觉传感器位置投影到RGB图像平面，生成力调制高斯显著性图，显式对齐触觉与视觉，在严重遮挡下灵巧操作成功率提升26.7个百分点。

详情

Comments: 20 pages, 7 figures

AI中文摘要

有效的视觉-触觉整合对于机器人灵巧操作至关重要，尤其是在视觉观测不可靠或被遮挡时。然而，将稀疏、异构的触觉测量与密集的视觉表示鲁棒地对齐仍然是一个基本挑战。大多数现有方法需要策略从有限的演示中隐式学习跨模态对应关系，而不利用几何先验。因此，它们在视觉观测退化时往往数据效率低且泛化能力差。为解决这一限制，我们提出一个框架，显式地将物理接触锚定在图像域中。利用机器人正向运动学和相机标定，我们将触觉传感器位置直接投影到RGB图像平面上。然后，我们渲染力调制的高斯显著性图，以模拟由运动学和标定误差引起的空间不确定性。通过零初始化的条件架构整合这些2D空间锚点，我们的方法将物理接触先验注入标准视觉骨干网络，同时保留预训练的视觉表示。我们在模拟和现实世界的六项灵巧操作任务中评估了我们的方法，在严重视觉遮挡下。现实世界实验表明，在图像域中显式的RGB-S锚定将现实世界遮挡操作成功率比最强的隐式视觉-触觉基线提高了26.7个百分点，表明其空间推理能力和对遮挡的鲁棒性得到了改善。项目页面：touch-as-saliency.github.io

英文摘要

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: this http URL

URL PDF HTML ☆

赞 0 踩 0

2606.08098 2026-06-12 cs.AI cs.LG 新提交

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

何时委托优于多数？一种基于委托的多样本LLM推理聚合器

Yasushi Sakai, Allen Song, Kent Larson

发表机构 * MIT Media Lab（麻省理工学院媒体实验室）

AI总结提出基于委托的聚合器PPV，利用样本的字母熵和推理几何信号，在MMLU-Pro上比多数投票高1.5个百分点，无需标签或训练。

详情

Comments: Preprint. 16 pages, 5 figures, 4 tables

AI中文摘要

多数投票是对多样本LLM推理进行无监督聚合的主流方法。我们证明，将每个样本携带的信号输入基于委托的聚合器（传播代理投票，PPV）可产生一种无监督共识规则，在MMLU-Pro上整体比多数投票高1.5个百分点，在非平凡子集上高2.24个百分点（配对McNemar p ~ 1.0e-14，n = 8,099）。多数投票丢弃了每个样本携带的两个自由信号：组内字母熵和组间推理几何。PPV暴露了两个每个投票者使用的杠杆，它们恰好消耗这些信号：WHEN（投票者保留自己选择的权重）和WHOM（如何将剩余权重分配给同行）。我们使用字母熵驱动WHEN，使用以问题为中心的嵌入余弦驱动WHOM。该方法不需要真实标签和辅助训练：对于每个问题，我们将128个采样生成划分为16组，计算每组的字母级语义熵和推理嵌入质心，并将两者输入随机委托矩阵，其平稳分布选择共识答案。我们通过一个例子说明PPV如何推翻一个明显的10-6多数（错误答案）：10票的多数簇几何上不连贯（平均簇内余弦-0.02），而6票的少数簇紧凑（+0.26），因此传播的委托质量集中在少数派的答案上，尽管仅凭熵会使多数保持领先。我们还报告了具有负面结果的委托策略，这些策略限制了无监督LLM聚合的设计空间：没有问题内的置信度模式集成能够缩小与oracle的差距。

英文摘要

Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

URL PDF HTML ☆

赞 0 踩 0

2606.07515 2026-06-12 cs.CL cs.AI cs.HC math.PR 新提交

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠？

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze

AI总结通过离散概率问题基准测试，发现 LLM 在标准问题上准确率 0.96，但在反直觉问题上仅 0.59，且存在 token 偏差和误导提示的脆弱性。

2606.07489 2026-06-12 cs.AI econ.GN 新提交

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

AI代理如何重塑知识工作：自主性、效率与范围

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

发表机构 * Harvard Business School ； Perplexity AI

AI总结基于Perplexity产品数据，研究发现AI代理通过端到端任务执行，将自主工作时间从33秒提升至26分钟，完成时间缩短87%，成本降低94%，并扩展了工作范围与认知层次。

详情

AI中文摘要

前沿AI系统正从对话式助手转向端到端执行任务的自主代理，弥合智能与实用性之间的差距。利用Perplexity的Search和Computer产品的生产数据，我们通过研究AI代理如何加速和重塑知识工作来考察这一转变。三个关键实证发现出现。首先，使用具有几乎相同初始查询对的会话作为同一底层任务的自然实验，Computer每个用户会话执行26分钟的自主工作，而Search为33秒。Computer自动化了Search用户可能手动编排和实现的任务分解与执行。因此，Computer将后续查询分布转向更高层次的工作，如验证和扩展。自主性也提高了执行质量，Computer上每次查询的不满意率比Search低55%。其次，由于其自主性优势，Computer在匹配任务上将完成时间从269分钟减少到36分钟，与仅配备Search的人类相比，估计时间和成本分别降低87%和94%。第三，Computer改变了用户尝试的工作范围：Computer查询更常跨越职业边界，需要更高层次的认知，利用更广泛的专业知识，采取将相互依赖的子任务捆绑到单个查询中的复合任务形式，并解锁了同一用户在Search使用中基本不存在的工作活动。综合来看，证据表明AI代理加速工作流程、提高输出质量、降低成本，并扩展自动化工作的广度和深度。

英文摘要

Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

URL PDF HTML ☆

赞 0 踩 0

2606.07436 2026-06-12 cs.CV 新提交

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Skill-3D：面向智能体3D空间推理的场景感知技能进化

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang

发表机构 * Zhejiang University ； University of Technology Sydney ； OPPO Research Institute

AI总结提出Skill-3D框架，通过场景记忆和技能库的协同进化，使智能体根据场景自适应选择工具，显著提升3D空间推理中工具使用的正确性和充分性。

详情

AI中文摘要

本文探索智能体3D空间理解，即MLLM智能体通过工具使用进行3D推理。现有方法在3D场景下常误用工具并表现出有偏的工具偏好，使得智能体范式相比非智能体策略仅有边际提升。我们揭示3D空间推理任务在不同场景下具有异质性，而这些智能体对所有场景采用统一的工具使用策略，而非根据具体场景和任务选择工具。为解决此问题，我们提出Skill-3D，一种学习自进化场景感知技能的框架。具体而言，Skill-3D识别任务场景并将智能体的工具使用轨迹记录到场景记忆中，其中来自相似场景的成功轨迹被聚合和蒸馏成可复用的场景感知技能，失败的轨迹作为教训附加到该技能上。在训练过程中，一旦相似场景再次出现，注入相应技能以引导智能体，产生新轨迹，其成功和失败进一步优化技能，形成记忆和技能库共同进化的循环。实验表明，Skill-3D显著提升了3D空间推理中的工具利用率（在VSI-Bench上从39%提升至78%），推动智能体正确且充分地使用工具。例如，在MMSI-Bench上，它将Gemini-3-Flash提升了67%。此外，我们在技能引导的轨迹上进行智能体后训练，使Qwen3-VL-8B在VSI-Bench上提升了43%。

英文摘要

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.07334 2026-06-12 cs.SD cs.LG 新提交

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

和弦符号时间序列适应能承载多远流派身份？多流派和弦符号建模的能力与边界

Jinju Lee

发表机构 * PearlLeeStudio

AI总结本研究评估了五种轻量级适应方法（LoRA、IA3、BitFit、前缀微调和全微调）将预训练流行爵士和弦模型扩展到11个目标流派的效果，发现所有方法均能提升和弦预测性能，但和弦符号本身不足以完整传递流派身份。

详情

Comments: v2: corrected frozen-base checkpoint description after weight-level verification (released F1 coincides with the pop-only Phase-0 baseline; selection artifact); added released-adapter rank-selection disclosure; all reported numbers unchanged

AI中文摘要

和声是一个紧凑的符号层，其中数学音高关系、声学协和与音乐惯例交汇。本报告将和弦符号序列视为音乐的不完全表示，而是作为可解释、可控的时间序列用于流派局部和声建模。从一个冻结的流行爵士音乐变换器检查点开始，我评估了小型适应接口能将模型扩展到11个目标流派的程度：布鲁斯、波萨诺瓦、巴赫众赞歌、乡村、电子、民谣、放克、福音、嘻哈、R&B/灵魂乐和摇滚。主要比较了LoRA、IA3、BitFit、前缀微调和全微调在11个流派和3个种子上的表现，构成完整的165个单元格网格。所有五种方法在保留和弦预测上都优于冻结基线，宏观增益从+2.89到+3.61分；LoRA和IA3得分最高，但经Holm和Benjamini-Hochberg校正的Wilcoxon检验不支持决定性优胜者。一个匹配数据量的对照实验进一步明确了这一点：当流派被子采样到共同语料库大小时，IA3保持领先，但LoRA的全数据优势消失并跌至最后，表明小差距部分由数据驱动。一个控制标记基线也很强，错误流派适配器通常优于冻结基线，表明大部分效果来自对可重用和声基底的轻量级条件化，而非特定适配器家族。额外的诊断（秩扫描、错误流派轮换、基础检查点消融、仅和弦流派分类、生成输出统计、真实歌曲评估和重复分析）支持一个有限的结论：和弦符号适应可靠地改进了流派局部和声预测，但仅靠和弦符号不能承载完整的流派身份。因此，本报告避免关于感知流派真实性或完整音乐质量的声明，这需要受控的听众或音乐家评估。

英文摘要

This report treats chord-symbol sequences as an interpretable, controllable time series for genre-local harmonic modeling. The frozen Music Transformer base - released as a pop-jazz fine-tune endpoint but verified in this revision weight-identical to the pop-only Phase-0 baseline, so all gains are measured over a pure-pop prior (see Changes in v2) - is extended to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction (macro gains +2.89 to +3.61 percentage points); LoRA and IA3 score highest, but pairwise Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: at a common corpus size IA3 stays on top while LoRA drops to last, so the small method gaps are partly data-driven rather than representational. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting the adaptation effect is largely lightweight conditioning over a reusable harmonic base rather than genre-specific adapter memory. Further diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation that v2 reinterprets as a same-weights control, chord-only genre classification, output-distribution statistics, real-song evaluation, duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. Perceived genre authenticity and musical quality are left to controlled listener evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.11092 2026-06-12 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo：通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong（香港大学）； The Chinese University of Hong Kong（香港中文大学）； Archon Robotics

AI总结提出三阶段运动引导课程强化学习框架RoboNaldo，从单一人踢参考逐步优化射门性能，在仿真中射门误差降低48.6%、速度提升2.96倍，真实机器人上3米外平均射门误差0.73-0.86米，触球后球速达13.10米/秒。

详情

AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性，但固定参考难以适应不同的球位和击球时机；相比之下，任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此，我们引入了RoboNaldo，一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架，并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验，然后使踢球适应任意静止球位的任意球场景，最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间，一个高级启发式规划器控制该接口，而推理时其他高级控制器可驱动相同的低级策略。在仿真中，RoboNaldo的任意球射门误差比先前工作基线低48.6%，射门速度高2.96倍。在真实世界中，使用搭载机载感知的宇树G1，RoboNaldo在3米距离的任意球和移动球情况下，平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒，是职业比赛开放射门速度的59-71%。项目页面：$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11042 2026-06-12 cs.AI 版本更新

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM：面向真实世界专业领域的长周期计算机使用代理任务评估

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang, Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu, Yang Liu, Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang

发表机构 * ByteDance Seed（字节跳动Seed）； M-A-P ； Humanlaya

AI总结提出Workflow-GYM基准，评估AI代理在专业软件中执行长周期、高价值工作流的能力，发现最强模型成功率仅略超30%，揭示当前代理在长周期工作流一致性方面的严重不足。

详情

AI中文摘要

近年来，AI代理在处理日益复杂、真实世界任务方面取得了快速发展。然而，现有基准很少评估代理能否操作图形用户界面以完成跨领域的长周期、高价值专业工作流。当前的GUI基准仍主要关注通用软件、相对简单的应用和短周期任务，使得现代代理能否遵循用户指令自主操作领域特定专业软件并以端到端方式完成经济价值工作尚不清楚。为填补这一空白，我们引入Workflow-GYM，一个以专业领域和专门软件环境为中心的长周期GUI任务基准。通过对最先进模型的广泛实验，我们发现即使最强的模型也仅达到略高于30%的成功率，突显出专业长周期GUI工作流对当前GUI代理仍极具挑战性。进一步分析表明，当前代理难以维持长周期工作流的一致性，频繁出现工作流阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。我们的发现为当前代理系统的局限性提供了重要见解，并为下一代GUI代理研究指明了关键方向。

英文摘要

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

URL PDF HTML ☆

赞 0 踩 0

2606.09500 2026-06-12 cs.AI cs.DL 版本更新

Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

用于LLM辅助临床手稿准备的确定性完整性门控：一种可审计的生物医学信息学架构

Yoojin Nam, Jinhoon Jeong, Namkug Kim

发表机构 * University of Ulsan College of Medicine（蔚山大学医学院）； Asan Medical Center（峨山医疗中心）； Aperivue ； AMIST, Asan Medical Center（AMIST，峨山医疗中心）

AI总结提出一种确定性完整性门控架构，通过将工作流分解为可独立验证的技能并在每个阶段设置确定性检查，解决了LLM生成临床手稿中的虚假引用、数据漂移和报告指南缺失问题。

详情

Comments: 28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): this https URL. Archived on Zenodo: concept DOI this https URL and version DOI (v3.8.0) this https URL

AI中文摘要

目的。大型语言模型（LLM）越来越多地起草临床研究手稿，但其流畅性可能隐藏虚构的引用、偏离源表格的数字以及未满足的报告指南项目。现有工具生成文本而不进行验证，自我批评继承了产生自信虚构的盲点。我们描述了一种将生成与验证配对的架构。方法。该设计基于三个原则：将工作流分解为自包含的技能，在每个阶段转换处设置失败即停止的门控，以及用最便宜的足够机制解决每个完整性问题——一个确定性的、可重新执行的检查（如果适用），以及仅在需要解释时才使用散文级探针。这种尽可能确定性的分离，组织为完整性门控分类法，是核心贡献。它被实现为MedSci Skills，一个由43个技能组成的开源工具包，由一个编排器协调，其确定性层级包括21个标准库检测器。我们在三个可重复的公共数据集管道（STARD、PRISMA、STROBE）和一个种子缺陷消融上评估它。结果。在三个管道中，每个内容哈希清单都验证为干净，门控揭示了真实缺陷。在27个相同的注入缺陷上，确定性门控检测到所有27个，在匹配的干净固定装置上没有误报，而通用单提示LLM审查员检测到11个，其遗漏集中在生成的代码、参考文献内部和散文未暴露的风格缺陷上。结论。尽可能确定性的验证产生了一个可审计、可重新执行的轨迹，暴露了人类检查LLM辅助手稿所需的证据——可行性和可重复性证据，而不是声称具有人类竞争力的质量，这由另一项盲法研究解决。MedSci Skills采用MIT许可并归档（v3.8.0）。

英文摘要

As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).

URL PDF HTML ☆

赞 0 踩 0

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 新提交

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结提出任务可交换性条件，确保在科学研究中使用合成数据进行统计推断的有效性，并给出在民意调查和AI评估中的应用。

详情

AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如，社会科学家主张在试点研究中使用LLM生成的“硅样本”；AI评估越来越依赖“LLM作为裁判”的输出；蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性：合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧：合成数据可能有偏、有噪声且设定错误。在这项工作中，我们提出了在科学研究中使用合成数据的统计原则，并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说，这是一个要求，即研究人员可以识别出有真实数据可用的历史任务，使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法，以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

URL PDF HTML ☆

赞 0 踩 0

2606.13544 2026-06-12 eess.AS cs.AI cs.CL 新提交

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

自适应轮流发言：面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结提出ModeratorLM，一种基于角色条件的语音大模型，通过分块流式处理和链式推理，在多方对话中实现自适应轮流发言，显著提升轮流精度和召回率。

详情

Comments: Accepted for publication at Interspeech 2026

AI中文摘要

多方口语对话中的轮流发言仍然是语音代理面临的基本挑战，特别是在动态的发言权竞争和用户期望变化的情况下。我们提出ModeratorLM，一种角色扮演语音代理，它在多方环境中根据明确分配的角色来调节轮流发言行为。该系统基于以分块流式方式运行的语音大语言模型。我们进一步引入了一种推理增强变体，该变体结合了对对话上下文和分配角色的链式推理。我们构建了RolePlayConv，一个大规模合成数据集，包含具有多种助手角色的口语多方对话。在真实会议数据和RolePlayConv上的实验表明，与无角色条件的基线相比，轮流发言精度提高了40%以上，召回率提高了70%以上，同时大幅减少了误报中断。

英文摘要

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13450 2026-06-12 eess.AS cs.SD 新提交

Endpoint Anticipation for Low-Latency Spoken Dialogue

低延迟口语对话的端点预测

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

AI总结提出端点预测方法，通过提前预测对话结束信号实现低延迟，在部分上下文中投机执行LLM和TTS流水线，平均延迟降低505毫秒。

详情

Comments: Accepted at Interspeech 2026

AI中文摘要

虽然低延迟交互对于口语对话至关重要，但级联架构通常受限于反应式话轮结束检测。我们提出端点预测，从反应式检测转向主动预测结束信号。我们的基于语音的模型可提前最多2.56秒预测端点，从而能够在部分上下文中投机执行LLM和TTS流水线。我们引入指标来量化实现的延迟降低与计算冗余之间的权衡。在对话和任务导向数据集上的评估表明，我们的模型始终优于基于VAP的竞争基线。与Unmute框架的集成展示了平均延迟降低505毫秒，投机计算增加28.4%，有效掩盖了顺序瓶颈，从而在实时语音到语音交互中实现复杂推理。

英文摘要

While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.13295 2026-06-12 stat.ML cs.LG stat.ME 新提交

Simultaneous Latent Budget Trees for Stratified Classification

用于分层分类的同时潜在预算树

Simultaneous Latent Budget Trees for Stratified Classification Cristian Buoncompagni, Stefano Pellegrino, Giulia Vannucci, Roberta Siciliano

AI总结提出同时潜在预算树框架，通过模型驱动的分裂规则处理分层因素，实现可解释分类，并应用于肌萎缩侧索硬化症性别差异分析。

详情

AI中文摘要

在可解释人工智能时代，单棵树因其易于解释而重新受到关注。本文介绍了同时潜在预算树，这是一个概率机器学习框架，用于在存在分层因素（如时间、空间或人口统计变量）作为控制变量或潜在混杂因素时的分类树。标准的树生长过程并非设计用于优化条件分裂规则。提出了一种基于模型的分裂规则，其中子节点被解释为同时混合模型（如同时潜在预算模型及其约束版本）的潜在成分，该模型拟合于父节点。混合参数驱动观测值（不同组别不同）到达子节点，而潜在预算参数更新控制变量每个水平的响应类别轮廓。参数通过最小二乘法估计，考虑模型的神经网络视角。信息丰富的树结构可以通过节点和路径上的解释辅助工具进行交互式可视化，包括视觉剪枝和决策树选择过程。提出了适当的措施来处理不平衡的响应类别分布。所提出的方法应用于调查肌萎缩侧索硬化症疾病进展中的性别相关差异。SLBT库及其各种基于树的算法可在链接的GitHub仓库中获取。

英文摘要

In the era of Explainable Artificial Intelligence, there is a renewed focus on single trees for their ease of interpretation. This paper introduces Simultaneous Latent Budget Trees, a probabilistic machine learning framework for classification trees in the presence of a stratification factor such as a temporal, spatial, or demographic variable, acting as a control variable or potential confounder. Standard tree growth procedures are not designed to optimize a conditional split rule. A model-based split rule is proposed in which child nodes are interpreted as latent components of a simultaneous mixture model, such as the Simultaneous Latent Budget Model and its constrained versions, fitted to the parent node. Mixing parameters drive the observations, differently for each group, to the child nodes whereas latent budgets parameters update the response classes profile of each level of the control variable. Parameters are estimated by least squares considering a neural network perspective of the model. An informative tree structure can be interactively visualized with interpretation aids on the node and the paths, including visual pruning and decision tree selection procedure. Suitable measures are proposed to handle an unbalanced response class distribution. The proposed methodology is applied to investigate gender-related differences in disease progression of Amyotrophic Lateral Sclerosis. The SLBT library with the various tree-based algorithms is available in the linked GitHub repository.

URL PDF HTML ☆

赞 0 踩 0

2606.13277 2026-06-12 stat.ML cs.LG 新提交

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

ProtoX-AD：自解释的时间序列异常检测与特征描述

Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer, Robert Jenssen

AI总结提出ProtoX-AD框架，通过原型学习实现自监督时间序列异常检测的可解释性，在保持检测性能的同时提供语义一致的异常特征解释。

详情

Comments: 26 pages, 8 figures

AI中文摘要

时间序列异常检测（TSAD）的最新进展突显了自监督分类方法的有效性。这些方法对正常训练样本应用变换，训练分类器识别变换特定模式，从而通过增加分类误差来帮助识别异常。尽管性能强大，但一个重大挑战是缺乏可解释性，因为它们对标记异常的特征提供的洞察有限。为了解决这一局限，我们提出了ProtoX-AD，一种基于原型的自解释框架，用于自监督TSAD。ProtoX-AD学习变换感知的潜在表示以及可解释的原型，从而实现准确的异常检测和通过基于原型的解释识别不同的异常轮廓。此外，它允许系统分析变换设计如何影响检测性能和可解释性。在合成和真实世界数据集上的实验结果表明，ProtoX-AD实现了与其黑盒对应物相当的检测性能，同时比现有的可解释基线提供更一致和语义上有意义的解释。我们的代码在此 https URL 公开。

英文摘要

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13193 2026-06-12 eess.AS cs.PL cs.SD 新提交

A Dual-Mode Faust-to-CLAP Compilation System

双模式 Faust 到 CLAP 编译系统

Facundo Franchino (1), Stéphane Letz (2), Jatin Chowdhury (3) ((1) University of York, (2) GRAME-CNCM, (3) Massachusetts Institute of Technology)

AI总结提出 faust2clap 框架，支持静态编译和动态解释两种模式，通过地址身份匹配算法和稳定槽位分配方案解决 DSP 参数身份保持问题，实现高效编译与热更新。

详情

Comments: 4 pages, 4 figures, 1 algorithm. Presented at the International Faust Conference (IFC-26), Lyon, France, June 2026

AI中文摘要

我们描述了 faust2clap，一个建立从 Faust DSP 规范到 CLAP 格式的首个官方维护编译路径的框架。该系统以两种不同模式运行。静态模式采用提前编译以生成最优效率的原生二进制文件，而动态模式使用运行时解释以允许在不中断宿主应用程序的情况下修改 DSP 代码。后一种能力解决了音频软件开发中一个长期存在的摩擦，即编辑、编译和重载循环的累积开销。我们详细阐述了两种模式背后的算法机制，特别关注参数身份问题。为了在结构 DSP 突变中保留参数值及其与宿主自动化的绑定，我们引入了一种基于地址的身份匹配算法和一种稳定的槽位分配方案。该实现包含约 2400 行 C++ 架构和 Python 工具代码，并已集成到 Faust 主发行版中。

英文摘要

We describe faust2clap, a framework establishing the first officially maintained compilation pathway from Faust DSP specifications to the CLAP format. The system operates in two different modes. A static mode employs ahead-of-time compilation to yield native binaries of optimal efficiency, while a dynamic mode uses runtime interpretation to permit DSP code modification without interrupting the host application. This latter capability addresses a persistent friction in audio software development, namely the cumulative overhead of the edit, compile, and reload cycle. We detail the algorithmic machinery underlying both modes, focusing specifically on the problem of parameter identity. To preserve both parameter values and their bindings to host automation across structural DSP mutations, we introduce an address-based identity matching algorithm and a stable slot allocation scheme. The implementation, comprising approximately 2,400 lines of C++ architecture and Python tooling code, has been integrated into the main Faust distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.13146 2026-06-12 stat.ML cs.LG stat.ME 新提交

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

鲁棒的状态条件特征加权跳跃模型用于时间聚类

Federico P. Cortese, Alessio Farcomeni

AI总结提出一种鲁棒的特征加权跳跃模型，通过Tukey双权损失函数实现鲁棒性，并引入状态特定特征权重，在模拟和实证中优于竞争方法。

2606.13109 2026-06-12 eess.AS cs.SD 新提交

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

为真实场景语音增强生成训练目标：通过近远麦克风投影

Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo, Marc Delcroix, Shoko Araki

AI总结提出近远麦克风投影（C2D投影）方法，利用真实录音生成配对数据，通过参数化多通道维纳滤波器实现投影，训练神经网络在远场语音增强中优于现有GSS方法。

详情

AI中文摘要

在远距离语音捕获场景中训练语音增强（SE）神经网络需要配对的失真和干净参考语音信号。虽然此类数据通常通过模拟生成，但模拟与真实录音之间的不匹配显著限制了SE的准确性。为解决此问题，我们提出近远麦克风投影（C2D投影），一种从近距离和远距离麦克风捕获的真实录音中生成配对数据的方法。C2D投影估计一个最优投影矩阵，将近麦克风输入转换为与远麦克风录音对齐的干净参考信号，同时执行去噪。我们证明，使用参数化多通道维纳滤波器（PMWF）的变体可以有效地实现这种投影。实验结果表明，在具有挑战性的CHiME6晚宴派对ASR任务中，使用C2D投影数据训练的神经网络在oracle说话人日志条件下，当使用GSS的增强输出作为神经网络的辅助输入时，优于最先进的引导源分离（GSS）。

英文摘要

Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

URL PDF HTML ☆

赞 0 踩 0

2606.13095 2026-06-12 eess.AS cs.SD 新提交

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

在端到端大语言模型中平衡ASR与说话人日志以进行多说话人语音识别

Naijun Zheng, Yuke Lin, Sanli Tian, Mengtian Li, Zhiwei Lin, Longshuai Xiao, Dandan Tu

AI总结提出双编码器架构、特征交错格式、长度感知说话人ID损失和自适应阈值ASR损失策略，在有限真实数据下高效训练LLM系统，平衡ASR与说话人日志任务，在AliMeeting和Aishell4语料库上分别实现18%和24%的相对改进。

详情

Comments: Accepted in Interspeech 2026

AI中文摘要

多说话人语音识别通常通过结合自动语音识别（ASR）和说话人日志的流水线系统来处理。最近，基于大语言模型（LLM）的方法通过联合建模语义和说话人信息显示出前景，但它们通常需要大规模的多说话人语料库，而标注这些语料库成本高昂。在本文中，我们研究了如何在有限真实录音数据下高效训练基于LLM的系统，同时保持说话人归属的高准确性。我们提出了几种策略：（1）双编码器架构，用于提取语义和说话人特征；（2）特征交错格式，将这些特征合并作为LLM的输入；（3）长度感知的说话人ID损失，以增强日志能力；（4）自适应阈值的ASR损失计算，以减轻语音重叠引起的幻觉。这些策略平衡了ASR和说话人日志任务之间的训练。我们的系统优于开源基线方法，在AliMeeting语料库上实现了18%的相对改进，在Aishell4语料库上实现了24%的相对改进。

英文摘要

Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.

URL PDF HTML ☆

赞 0 踩 0

2606.13017 2026-06-12 q-bio.NC cs.LG 新提交

Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback

基于EEG信号临界性的深度睡眠分类：一种用于改善睡眠神经反馈的被动BCI方法

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

AI总结本研究利用去趋势波动分析（DFA）提取的临界性特征，通过朴素贝叶斯分类器实现了对深度睡眠（N3）的高精度识别（平衡准确率87.17%），为被动脑机接口中的状态依赖神经反馈提供了高效感知机制。

详情

Comments: 7 pages, 3 figures, accepted for publication in the Proceedings of the 10th Graz Brain-Computer Interface Conference 2026, Graz, Austria, September 14-17, 2026

AI中文摘要

自动睡眠分期是被动脑-机接口（pBCI）的一项基础应用，它解码自发神经状态以实现独立于用户意图的闭环干预。本研究评估了从去趋势波动分析（DFA）中提取的临界性特征，用于特定识别深度睡眠（N3）。我们分析了来自290名老年女性的347,232个EEG时段，使用UMAP流形学习可视化状态转换。随后，通过10折交叉验证对六个分类器进行基准测试，使用平衡准确率确定此http URL的最佳“状态感知”引擎。朴素贝叶斯达到了最高的平均平衡准确率（87.17% ± 0.24%），显著优于全连接深度神经网络（FNN：81.58%）和随机森林（80.97%）。线性模型（LDA：57.21%；SVM：51.01%）表现不佳，表明DFA衍生的临界性特征位于一个独特的非线性流形上。EEG临界性的概率解码为pBCI提供了一种高精度的感知机制。这种稳健的分类流程支持开发状态依赖的神经反馈，例如靶向听觉刺激，以增强认知恢复。

英文摘要

Automated sleep staging is a fundamental application of passive Brain-Computer Interfaces (pBCI), decoding spontaneous neural states to enable closed-loop interventions independent of user intent. This study evaluates criticality features derived from Detrended Fluctuation Analysis (DFA) for the specific identification of deep sleep (N3). We analyzed $347,232$ EEG epochs from $290$ older women using UMAP manifold learning to visualize state transitions. Subsequently, six classifiers were benchmarked via 10-fold cross-validation, using balanced accuracy to determine the optimal "state-sensing" engine for this http URL Bayes achieved the highest mean balanced accuracy ($87.17\% \pm 0.24\%$), significantly outperforming a fully connected deep neural network (FNN: $81.58\%$) and Random Forest ($80.97\%$). Linear models (LDA: $57.21\%$; SVM: $51.01\%$) performed poorly, indicating that DFA-derived criticality features reside on a distinct, non-linear manifold. Probabilistic decoding of EEG criticality provides a high-accuracy sensing mechanism for pBCIs. This robust classification pipeline supports the development of state-dependent neurofeedback, such as targeted auditory stimulation, to enhance cognitive recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.12838 2026-06-12 q-bio.QM cs.AI cs.LG q-bio.GN 新提交

OCOO-T: A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

OCOO-T: 一种用于转录扰动响应预测的简单可扩展虚拟细胞模型

Danning Jiang, Zheming An, Yalong Zhao, Lipeng Lai

AI总结提出OCOO-T，一种基于流匹配的简约虚拟细胞模型，通过连续时间去噪和自适应层归一化，在多个基准上实现转录扰动预测的最优性能。

详情

Comments: 22 pages, 6 figures

AI中文摘要

预测单细胞对遗传、化学和细胞因子扰动的转录响应是计算生物学和AI虚拟细胞（AIVC）建模中的一个基本挑战，对药物发现和基因调控网络的阐明具有直接影响。现有方法通常依赖辅助细胞状态编码器、分层变分自编码器、专用Transformer编码器-解码器模块或基因相互作用先验，将高维表达谱压缩为潜在表示。虽然有效，但这些设计增加了架构复杂性，可能限制可扩展性和泛化性。本文介绍了OCOO-T，一种基于流匹配的简约AIVC模型，用于转录扰动响应预测。OCOO-T利用一个直接操作连续基因表达谱的普通Transformer堆栈，并将扰动响应预测表述为连续时间去噪过程。通过自适应层归一化和上下文令牌整合扰动嵌入、剂量信息以及细胞系/细胞类型特异性。在Tahoe100M、Replogle和PBMC基准上的全面评估表明，OCOO-T在多种扰动和细胞类型上实现了最先进的性能，同时通过细胞上下文的修补和拆补有效扩展到长转录谱。通过利用基于Transformer去噪的单细胞组学简单性，OCOO-T为计算机细胞模拟提供了一个有效且可扩展的框架。

英文摘要

Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.12654 2026-06-12 stat.ME cs.LG stat.ML 新提交

Computationally tractable robust differentially private mean estimation

计算可处理的鲁棒差分隐私均值估计

Kelly Ramsay

AI总结提出一种名为“气球均值”的新差分隐私均值估计器，通过扩展马氏距离球上的迭代裁剪实现计算可处理性、鲁棒性及零集中差分隐私，理论保证在重尾和污染椭圆模型下的统计性能与鲁棒性。

2606.12623 2026-06-12 stat.AP cs.LG 新提交

Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation

使用因果变换模型（TRAM-DAG）估计急性缺血性卒中个体化治疗效果：一项多中心观察性研究及外部RCT验证

Oliver Dürr, Lisa Herzog, Pascal Bühler, Susanne Wegener, Beate Sick

AI总结提出因果变换模型（TRAM-DAG）估计急性缺血性卒中患者个体化治疗效果，基于观察数据拟合后，在RCT人群中验证其平均效果与ATE一致，并能正确排序患者预后。

详情

AI中文摘要

急性缺血性卒中的个体化医疗需要从平均治疗效果（ATE）转向个体化治疗效果（ITE）估计，以支持治疗决策。在急性缺血性卒中中，随机对照试验（如MR CLEAN研究）显示机械取栓平均优于溶栓。我们旨在识别哪些个体患者从机械取栓中获益最大。关注的结局是三个月时的改良Rankin量表（mRS），这是一个有序的功能残疾指标（0：无症状，6：死亡）。我们证明，在观察性MAGIC多中心卒中患者数据上拟合后，有向无环图上的因果变换模型（TRAM-DAG）可用于ITE估计。为确保与用于验证的MR CLEAN人群的可比性，我们在MAGIC子人群（入院NIHSS≥6，对应MR CLEAN的一项纳入标准）上训练TRAM-DAG。然后使用拟合模型估计MR CLEAN人群中卒中患者的ITE。虽然这些ITE估计无法通过实验确认，但我们显示其平均值与试验报告的ATE一致。此外，ITE估计正确地将试验患者按观察到的良好结局（三个月mRS≤2）频率排序。这些发现支持使用像TRAM-DAG这样的因果模型进行卒中护理中的个性化决策，并突显其弥合观察性证据与临床试验之间差距的能力。

英文摘要

Personalized medicine in acute ischemic stroke requires moving beyond average treatment effects (ATE) to individualized treatment effect (ITE) estimates to support treatment decisions. In acute ischemic stroke, mechanical thrombectomy has been shown to be more effective on average than lysis in randomized controlled trials (RCTs), such as the MR CLEAN study. We aim to identify which individual patients benefit most from mechanical thrombectomy compared to lysis. The outcome of interest is the modified Rankin Scale (mRS) at three months, an ordinal measure of functional disability (0: no symptoms, 6: death). We demonstrate that causal transformation models on directed acyclic graphs (TRAM-DAG) can be used for ITE estimation after being fitted on observational MAGIC multi-center stroke patient data. To ensure comparability with the MR CLEAN population, which we use for validation, we train the TRAM-DAG on a MAGIC sub-population with NIHSS at admission >= 6, corresponding to one inclusion criterion of MR CLEAN. The fitted model is then used to estimate ITEs for stroke patients in the MR CLEAN population. While these ITE estimates cannot be confirmed experimentally, we show that their average is consistent with the trial's reported ATE. Furthermore, the ITE estimates correctly rank trial patients by their observed frequency of a good outcome (mRS at three months <= 2). These findings support the use of causal models like TRAM-DAG for personalized decision-making in stroke care and highlight their ability to bridge the gap between observational evidence and clinical trials.

URL PDF HTML ☆

赞 0 踩 0

2606.12471 2026-06-12 stat.ML cs.CL cs.ET cs.LG 新提交

Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

无高斯假设的可识别性：符号世界模型与近无限时间一致性

Seth Dobrin, Łukasz Chmiel

AI总结本文提出物理基础符号架构（PGSA），证明其在非高斯动态系统中实现精确线性可识别性和近无限时间一致性，克服了统计世界模型的高斯边界限制。

详情

Comments: Pre-print

AI中文摘要

Klindt、LeCun 和 Balestriero (arXiv:2605.26379) 证明了联合嵌入预测架构（JEPA）实现线性可识别性（即线性恢复世界的真实潜在变量）当且仅当世界的潜在动态遵循高斯平稳过程。这一高斯边界意味着时间一致性的基本限制：对于任何非高斯物理系统，统计世界模型的表示误差随时间单调增长。我们证明这一限制是统计对齐机制的产物，而非世界模型的一般性质。我们引入物理基础符号架构（PGSA），并证明三个结果：(1) PGSA 对所有物理机制实现精确线性可识别性，无论潜在分布如何；(2) PGSA 的每步误差仅受数值精度限制；(3) 直接推论是，PGSA 在无界数量的转换中保持时间一致性，我们称之为近无限时间一致性。我们进一步证明，对于任何非高斯系统，统计世界模型无法实现这一性质，无论模型容量或训练数据量如何。其中四个定理的代数核心已在 Lean 4 中使用 Mathlib4 v4.31.0 形式化（零个 sorry 占位符）；Klindt 等人的逆命题作为外部前提。对比表明，在世界动态的因果生成器中进行符号基础化是充分条件，并且在非高斯体制下，是实现近无限时间一致性的唯一条件。

英文摘要

Klindt, LeCun, and Balestriero ( arXiv:2605.26379 ) proved that Joint-Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world's true latent variables, if and only if the world's latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non-Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics-Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per-step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near-infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non-Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world's dynamics is the sufficient condition and, in non-Gaussian regimes, the only condition for near-infinite temporal consistency.

URL PDF HTML ☆

赞 0 踩 0