arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.13106 2026-06-12 cs.LG cs.CL 新提交

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭秘隐状态循环:基于在线强化学习的可切换潜在推理

Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

发表机构 * HKUST(GZ)(香港科技大学(广州)) University of Cambridge(剑桥大学) NTU(南洋理工大学) JoinQuant(聚宽) HKUST(香港科技大学)

AI总结 提出SWITCH框架,通过离散边界令牌使隐状态循环推理兼容在线强化学习,并支持因果机制分析,实验表明其优于现有方法。

详情
AI中文摘要

潜在思维链通过用连续的隐状态循环替换可见推理轨迹来压缩推理,但现有公式难以用标准在线强化学习(RL)优化,且难以进行因果解释。我们的关键见解是,一对显式的边界令牌可以同时解决这两个问题:离散的进入和退出锚点使潜在块与标准在线RL兼容,并且相同的锚点为机制分析提供了自然立足点。基于此,我们提出SWITCH,一个可切换的潜在推理框架。模型发出<swi>进入潜在模式,</swi>退出。由于边界是普通的离散令牌,GRPO策略比率在每个决策点都有明确定义。相同的锚点也使潜在步骤暴露于直接探测和因果干预。我们通过可见到潜在的课程和Switch-GRPO目标训练模型,该目标通过循环潜在计算传播梯度。SWITCH在相似规模下始终优于先前的隐状态循环潜在推理方法。通过边界令牌的机制分析进一步揭示了三个发现:(i)<swi>是一个尖锐局部化的学习切换策略,而非风格化伪影;(ii)它开启的潜在步骤执行特定于问题的、因果重要的计算,而非作为惰性占位符;(iii)该计算集中在进入时的单个隐状态转换上。这些结果表明,隐状态循环潜在推理既可RL训练,又可进行直接机制分析,包括在线RL本身如何从内部改进模型。

英文摘要

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

2606.13105 2026-06-12 cs.LG 新提交

Disparate Impact in Synthetic Data Generation

合成数据生成中的差异性影响

Paul Andrey, Michaël Perrot, Batiste Le Bars, Marc Tommasi

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL(里尔大学、法国国家信息与自动化研究所、法国国家科学研究中心、中央里尔高等电力工程学院、计算机科学、信号与自动化研究实验室)

AI总结 本文重新审视合成数据生成中的差异性影响公平性概念,指出非差异性影响要求合成分布与真实分布一致,并分析SDG失败的原因(表达能力、抽样误差、差分隐私估计误差),提出分组学习策略以提升整体效用和公平性。

详情
AI中文摘要

我们重新审视合成数据生成(SDG)中差异性影响的公平性概念,该概念评估生成记录的效用是否在不同敏感群体间相同。我们的方法不同于现有的公平SDG工作,后者旨在纠正观测分布中的不当偏差,从而将SDG重新定义为学习一个并非真实数据分布的分布。相比之下,当合成分布与真实分布相同时,非差异性影响得以显著实现。我们揭示了SDG可能无法达到该解决方案的原因,并讨论了近似误差和估计误差为何会发生以及可能在不同群体间存在差异。我们特别关注了SDG方法相对于分布复杂性的表达能力、群体比例导致的抽样误差以及差分隐私机制引起的估计误差。我们在人工和真实数据上展示了差异性影响的案例,重点关注依赖概率图模型的SDG方法。我们还引入了一种学习分组SDG模型的策略,并说明了它在许多情况下如何提升整体效用及其公平性。

英文摘要

We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

2606.13102 2026-06-12 cs.RO 新提交

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

FTP-1:一种跨触觉传感器的通用基础触觉策略,用于密集接触操作

Chengbo Yuan, Zicheng Zhang, Mingjie Zhou, Wendi Chen, Yi Wang, Zhuoyang Liu, Dantong Niu, Shuo Wang, Hui Zhang, Wenkang Zhang, Yingdong Hu, Yuanqing Gong, Wanli Xing, Chuan Wen, Cewu Lu, Kaifeng Zhang, Yang Gao

发表机构 * Tsinghua University(清华大学) Shanghai Qi Zhi Institute(上海期智研究院) Sharpa Shanghai Jiao Tong University(上海交通大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出FTP-1,首个通用基础触觉策略,通过异构编码器和共享Transformer专家,跨21种传感器和3000小时数据预训练,实现触觉操作技能的跨传感器迁移,在未见传感器上成功率提升31%。

详情
AI中文摘要

尽管基于视觉的通用机器人策略取得了成功,但现有的基于触觉的策略仍然局限于固定的具身和传感器设置。这是因为触觉信号在不同硬件之间高度异构,使得跨传感器泛化变得困难。我们提出了FTP-1,这是第一个通用基础触觉策略,预训练以获取跨不同传感器和具身的可迁移触觉操作能力。FTP-1支持多种触觉输入,包括基于图像、阵列和状态的信号,通过使用异构编码器将它们投影到统一的形态感知潜在标记中,并由共享的触觉Transformer专家联合建模。FTP-1在来自26个数据源的约3000小时触觉操作数据上进行预训练,涵盖21种传感器的人类和机器人演示,学习到的触觉技能可以迁移到预训练期间未见过的传感器上。在涵盖5种硬件配置的下游微调实验中,FTP-1在见过的传感器设置上将密集接触操作的成功率提高了17.2%,并且令人惊讶地,迁移到两种先前未见过的触觉传感器设置上,成功率提高了31%。FTP-1为触觉操作建立了第一个统一的基础基线,为未来的触觉策略提供了共享的模型级起点。预训练模型、数据集、训练代码及更多可视化内容请访问此网址。

英文摘要

Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at this https URL.

2606.13100 2026-06-12 cs.CL 新提交

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

LEDGER:基于公司年报的长上下文基准,用于基于事实的金融检索与提取

Charles Moslonka, Amaury de Vitry, Arthur Garnier, Hicham Randrianarivo, Emmanuel Malherbe

发表机构 * Artefact Research Center(Artefact 研究中心) MICS, CentraleSupélec, Université Paris-Saclay(巴黎萨克雷大学中央理工高等电力学院 MICS 实验室) Ardian

AI总结 提出LEDGER基准,包含4,999份数字化公司年报,用于评估大语言模型在长上下文金融任务中的表现,涵盖KPI检索、单值查找和全量提取任务。

详情
Comments
5 pages, 1 figure
AI中文摘要

财务报告是大语言模型天然的试验场,而近期各种规模模型的长上下文能力使得在该领域进行严格评估的需求日益迫切。然而,大多数公开的金融资源将任务简化为纯文本的SEC 10-K文件,并配以少量问答项。我们发布了LEDGER(基于事实的提取与检索的长上下文文档评估),一个包含4,999份数字化公司年报的语料库——这些是包含图表、表格和叙述的完整文档,而不仅仅是监管文件。每份报告标注了31个合并的财务KPI,这些KPI需要被提取并与财报发布日的市场反应相关联。基于这些数据,我们推导出三个覆盖难度范围的评估基准:一个纯页面级别的KPI检索任务,包含118,048个自然语言问题及其TREC风格的相关性判断;一个对话式的“大海捞针”单值查找任务;以及一个完整的KPI提取任务,均基于长且数字密集的报告。此外,我们还提供了人工OCR质量标注(含标注者间一致性)、完整的提取、验证和评分工具链。我们进一步通过一个案例研究展示了该数据集的研究实用性,该案例将CEO信函修辞与发布后的市场影响联系起来。

英文摘要

Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

2606.13096 2026-06-12 cs.CV 新提交

Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison

基于层级肿瘤结构比较的统一MRI脑图像翻译

Yupeng Cai, Jia Wei, Jianlong Zhou

发表机构 * South China University of Technology(华南理工大学) UTS Data Science Institute, University of Technology Sydney(悉尼科技大学UTS数据科学研究所)

AI总结 提出HTSCGAN模型,通过层级肿瘤结构比较和多种损失函数,提高多模态MRI脑图像翻译质量,在BraTS2020/2021上表现优异。

详情
AI中文摘要

多模态MRI脑图像翻译通过可用模态在现代医学中具有重要的实际意义,为疾病的早期诊断、治疗计划和结果评估提供有力支持。为此,确保翻译后肿瘤区域的保真度至关重要。然而,现有的脑图像翻译方法忽略了不同肿瘤区域的结构信息,而利用这些信息有助于翻译模型提高翻译图像的质量和临床适用性。在这项工作中,我们提出了一种新颖的翻译模型HTSCGAN,这是一个统一的多模态脑图像翻译生成对抗模型,整合了肿瘤区域内的结构信息,旨在提高脑图像翻译的质量。具体地,生成器采用三个不同补丁大小的补丁对比模块(PCM)来捕获肿瘤区域的层级结构信息。此外,使用预训练的补丁分类器(PC)和预训练的结构感知编码器(SAE),分别通过补丁分类损失和肿瘤感知损失,使生成的图像包含与真实图像相同的肿瘤区域结构。在BraTS2020和BraTS2021上的实验表明,我们的模型在翻译任务和下游分割任务中均表现出强大的性能,突显了其在提高翻译脑图像质量和临床相关性方面的有效性。我们的代码可在以下网址获取:https://this URL。

英文摘要

Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at this https URL.

2606.13082 2026-06-12 cs.CL 新提交

sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling

sebis at CRF Filling 2026: 用于医疗CRF填写的两阶段本地LLM流水线

Katharina Sommer, Tristan Till, Florian Matthes

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出基于MedGemma-27B的两阶段本地流水线,分离二值存在分类与值提取,通过少样本上下文学习实现隐私保护,在CRF填写任务上取得0.55 macro-F1,排名第二。

详情
Comments
Published in Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health), LREC 2026
AI中文摘要

从非结构化电子健康记录中提取结构化临床信息是医疗信息学中一个持续存在的瓶颈。虽然大型语言模型(LLM)提供了高性能,但它们在临床环境中的部署受到隐私风险、推理成本以及超出文本证据产生幻觉的倾向的阻碍。我们针对CL4Health 2026病例报告表(CRF)填写任务,通过提出一个完全本地化、领域自适应的流水线来解决这些挑战,该流水线使用MedGemma-27B模型。我们的两阶段架构将二值存在分类与值提取分离,强制严格遵守文本证据,并确保对否定、不确定或未知状态产生确定性输出。通过利用特定项目的少样本上下文学习,无需外部API调用或微调,我们的方法在官方英语测试轨道上实现了0.55的宏F1分数。这一结果在所有本地托管、开源提交中排名第二。我们的工作表明,保护隐私的本地LLM流水线可以实现与专有前沿模型接近的性能,为临床NLP提供了一个实用、数据主权的框架。

英文摘要

The extraction of structured clinical information from unstructured EHR notes is a persistent bottleneck in healthcare informatics. While large language models (LLMs) offer high performance, their deployment in clinical settings is hindered by privacy risks, inference costs, and the tendency to hallucinate beyond textual evidence. We address these challenges for the CL4Health 2026 Case Report Form (CRF) filling task by proposing a fully local, domain-adapted pipeline using the MedGemma-27B model. Our two-stage architecture, which separates binary presence classification from value extraction, enforces strict adherence to textual evidence and ensures deterministic outputs for negated, uncertain, or unknown states. By leveraging item-specific, few-shot in-context learning without external API calls or fine-tuning, our approach achieves a macro-F1 score of 0.55 on the official English test track. This result secures second place among all locally-hosted, open-source submissions. Our work demonstrates that privacy-preserving, on-premise LLM pipelines can achieve near-competitive performance with proprietary frontier models, providing a practical, data-sovereign framework for clinical NLP.

2606.13081 2026-06-12 cs.LG cs.AI 新提交

Emotional regulation improves deep learning-based image classification

情绪调节改善基于深度学习的图像分类

Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici

发表机构 * Mare Group(Mare集团) NOVA LINCS(NOVA LINCS实验室) Institute of Engineering (ISE), University of Algarve(阿尔加维大学工程学院) Department of Energy Technologies and Renewable Sources, ENEA Casaccia Research Center(ENEA卡萨恰研究中心能源技术与可再生能源部)

AI总结 提出情绪调节框架,通过人工主观体验在深度学习中建模情绪,在图像分类任务中预训练ResNet和ViT,在CIFAR-10/100上超越现有方法,成为情绪增强深度学习的新标杆。

详情
AI中文摘要

情绪显著影响认知,能在特定条件下增强记忆和学习。基于这一原理,情绪增强深度学习研究情感状态如何改善神经网络架构和学习范式,实现比非情绪模型更好的泛化。然而,现有方法通常仅依赖客观神经生理因素,忽视了情绪的主观性。为弥补这一差距,本研究引入情绪调节,一种通过人工主观体验在深度学习中建模情绪的新框架。该方法采用基于情感刺激的预训练,在下游任务优化中平衡非情绪和情绪影响响应。在图像分类中进行了广泛实验,在四个情感数据集上预训练ResNet和ViT架构,以CIFAR-10和CIFAR-100作为目标基准。结果显示,相比上述骨干网络有改进,证明情绪调节是通过人工主观体验定义情绪增强深度学习的有前景方法。此外,所提方法超越了基于CIFAR的图像分类相关工作,揭示情绪调节成为大规模视觉数据集上情绪增强深度学习的新标杆。研究还提供了情感状态改善机器学习任务优化的证据,鼓励进一步探索情绪启发架构。

英文摘要

Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

2606.13067 2026-06-12 cs.LG 新提交

Limits of spectral learning under noise

噪声下谱学习的极限

Sabin Roman, Ljupco Todorovski, Saso Dzeroski, Marta Sales-Pardo, Roger Guimera

发表机构 * Joz̆ef Stefan Institute(约瑟夫·斯特凡研究所) Faculty of Mathematics and Physics, University of Ljubljana(卢布尔雅那大学数学与物理学院) Department of Chemical Engineering, Universitat Rovira i Virgili(罗维拉-威尔吉利大学化学工程系) Center for Computational Science and Applied Mathematics (ComSCIAM), Universitat Rovira i Virgili(罗维拉-威尔吉利大学计算科学与应用数学中心) ICREA(加泰罗尼亚研究与高等研究院)

AI总结 研究监督回归中加性标签噪声对谱方法的影响,推导出噪声导致系数漂移的闭合表达式,揭示了由单一内在噪声尺度控制的通用退化曲线。

详情
AI中文摘要

从含噪数据中学习函数关系是科学推理的核心问题。谱方法通过将未知函数在基函数上展开并从数据中估计相应系数来逼近函数,但这些系数在噪声下的稳定性仍知之甚少。本文研究使用稀疏谱表示在多个基和维度下进行带加性标签噪声的监督回归。我们表明,噪声会导致学习到的系数向量发生可预测的漂移,其大小取决于有效活跃谱模式的数量。在对经验特征几何进行白化后,我们推导出含噪与无噪系数向量之间重叠的闭合表达式,揭示了一条由单一内在噪声尺度控制的通用退化曲线。在傅里叶、勒让德、贝塞尔和哈尔基上的数值实验证实了理论预测。结果表明,谱学习存在一个基本噪声阈值,超过该阈值系数估计变得不稳定,从而对从含噪数据中恢复函数结构施加了内在限制。

英文摘要

Learning functional relationships from noisy data is a central problem in scientific inference. Spectral methods approximate unknown functions by expanding them in a basis and estimating the corresponding coefficients from data, but the stability of these coefficients under noise remains poorly understood. Here we study supervised regression with additive label noise using sparse spectral representations across multiple bases and dimensions. We show that noise induces a predictable drift in the learned coefficient vector whose magnitude depends on the effective number of active spectral modes. After whitening the empirical feature geometry, we derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve governed by a single intrinsic noise scale. Numerical experiments across Fourier, Legendre, Bessel, and Haar bases confirm the theoretical prediction. The results demonstrate that spectral learning exhibits a fundamental noise threshold beyond which coefficient estimates become unstable, placing intrinsic limits on recovering functional structure from noisy data.

2606.13061 2026-06-12 cs.CV 新提交

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

LaME: 通过信息瓶颈在潜在空间中进行多模态嵌入的推理学习

Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin, Wei Yuan, Fan Yang, Tingting Gao, Hebei Li, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Kuaishou Technology(快手科技) Zhejiang University(浙江大学) Tsinghua University(清华大学)

AI总结 提出LaME方法,将面向嵌入的潜在推理建模为弱监督信息瓶颈,使用可学习推理令牌在单次前向传播中完成推理,避免显式CoT的高计算成本和标注依赖,实现60倍加速。

详情
AI中文摘要

基于推理的通用多模态嵌入通过将思维链(CoT)推理引入嵌入流程取得了快速进展。尽管在通用和复杂任务上表现强劲,该范式存在两个核心限制:(i) 自回归CoT推理计算成本高,使其不适用于低延迟检索;(ii) 嵌入性能与CoT标注质量高度耦合,导致大规模训练不可靠。这些引出了基本问题:文本CoT是否是嵌入的最优推理形式,以及有效的嵌入推理能否在潜在空间中完成?为此,我们提出LaME(潜在推理多模态嵌入),将面向嵌入的潜在推理建模为弱监督信息瓶颈。LaME采用K个可学习推理令牌作为固定容量瓶颈,在单次前向传播中完成所有推理。两个弱监督信号在结构上解耦了对比目标和自回归目标,消除了对CoT标注的依赖,而两阶段训练流程确保了稳定收敛。在MMEB-v2和MRMR上的实验表明,LaME达到了有竞争力的性能,超越了某些显式CoT模型,同时推理速度比显式CoT方法快60倍,比潜在基线快2倍,吞吐量与判别式嵌入模型相当。代码将开源。

英文摘要

Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

2606.13060 2026-06-12 cs.LG 新提交

A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning

一种面向新兴材料的绿色溶剂筛选工具:基于不确定性感知、Transformer增强的迁移学习

Ioannis Kouroudis, Simon Ternes, Zhaosu Gu, Gohar Ali Siddiqui, Marina Ustinova, Angelo Lembo, Alessio Gagliardi, Aldo Di Carlo

发表机构 * Technical University of Munich(慕尼黑工业大学) Institute of Structure of Matter – National Research Council Rome (ISM-CNR)(罗马国家研究委员会物质结构研究所) University of Rome "Tor Vergata"(罗马第二大学)

AI总结 提出一种结合预训练Transformer模型和不确定性量化的迁移学习方法,在极少数据下高精度预测溶解度参数,并开发了可定制的绿色溶剂筛选工具。

详情
AI中文摘要

溶解度的准确预测仍然是材料科学和可持续化学中的一个核心挑战。特别是由于有机和混合光伏、电池、催化等新兴技术,溶剂使用量预计在未来几年将显著增加。因此,用更绿色的替代品取代溶剂至关重要。这正是机器学习可以产生重大影响的地方。然而,溶解度关键参数的数据有限,严重制约了机器学习的效能。在这项工作中,我们将预训练的QM9基础模型迁移到我们的应用中,所需数据极少。此外,该流程集成了不确定性量化,允许用户评估预测的置信度。作为基线,我们成功预测了存在大量数据库的汉森溶解度参数和介电常数。重要的是,我们在其他目标(如Gutmann供体和受体数)上实现了高模型性能,而这些目标的可获得数据极为有限。总体而言,我们通过高质量预测将溶解度描述符的数据量提高了数个数量级。为了有效传播,我们部署了一个易于使用、易于与高通量实验室集成、可定制的工具,用于排序和筛选可能的溶剂替代品。最后,我们重新发现了已知的绿色溶剂替代品,并提出了新的候选者,证明了其在寻找环保溶剂方面的相关性。

英文摘要

Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

2606.13053 2026-06-12 cs.RO cs.AI 新提交

EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation

EA-WM: 基于任务规范基础的事件感知世界模型用于长时域操作

Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng

发表机构 * AI Lab, Country Garden Services Group(碧桂园服务集团AI实验室) Fudan University(复旦大学) Omni AI

AI总结 提出EA-WM框架,通过事件预测和验证增强预训练特征世界模型,实现长时域操作中任务进展信号的可靠评估与规划。

详情
AI中文摘要

预训练特征世界模型为机器人想象提供了有用的基础,但仅凭视觉或潜在预测并不能确定想象的未来是否满足任务相关事件。长时域操作需要关系性、谓词级和物理基础的进展信号:物体是否移动,抽屉或接触状态是否改变,放置谓词是否满足,以及候选未来是否足够可靠以执行。我们引入了EA-WM,一种事件感知世界模型框架,通过任务规范基础的事件预测和验证来增强冻结的视觉特征动力学。EA-WM在预训练视觉特征空间中展开候选未来,将其解码为结构化事件状态,并使用任务进展、语义一致性、物理可行性和不确定性项进行评分。验证器指导基于采样的规划,门控候选动作,并在接触敏感的LIBERO酒架设置中,选择PPO生成的提议。在导航、可变形物体、墙壁约束和语言描述的操作研究中,EA-WM表明事件感知验证可以使特征空间世界模型更可解释,并更好地与任务进展对齐。

英文摘要

Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant events. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce EA-WM, an event-aware world-model framework that augments frozen visual-feature dynamics with task-specification-grounded event prediction and verification. EA-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPOgenerated proposals. Across navigation, deformable-object, wall-constrained, and languagedescribed manipulation studies, EA-WM shows that event-aware verification can make featurespace world models more interpretable and better aligned with task progress.

2606.13051 2026-06-12 cs.AI 新提交

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

AAbAAC:用于自身免疫信息抽取的标注语料库

Fabien Maury (Imagine - U1163, HeKA | U1346), Solène Grosdidier, Maud de Dieuleveult (Imagine - U1163), Adrien Coulet (HeKA | U1346)

发表机构 * Inserm, Université Paris Cité, U1163 Institut Imagine(法国国家健康与医学研究院、巴黎西岱大学、U1163 想象研究所) Inria, Inserm, Université Paris Cité, U1346 HeKA(法国国家信息与自动化研究所、法国国家健康与医学研究院、巴黎西岱大学、U1346 HeKA) Freelance researcher(自由研究员)

AI总结 针对自身免疫领域信息抽取性能不足,构建了包含115篇PubMed摘要的AAbAAC语料库,手动标注实体和关系,通过微调NER模型验证了其有效性。

详情
AI中文摘要

尽管深度学习和大型语言模型推动了信息抽取的进步,但在高度专业化的生物医学领域,领域特异性复杂性对通用模型构成挑战,性能差距仍然存在。本文聚焦自身免疫领域,其中主要关注实体包括自身免疫疾病、自身抗体(即可能标记或导致这些疾病的分子)、其分子靶点、在体内的位置以及相关临床体征。我们提出了AAbAAC(自身抗体与自身免疫标注语料库),该语料库包含从PubMed精选的115篇摘要,并手动标注了实体及其关系。首先,AAbAAC被用于评估多种方法在命名实体识别(NER)任务上的表现;其次,用于微调NER模型。我们的研究展示了AAbAAC在自身免疫领域信息抽取中的实用性,表明微调后NER性能预期提升。这说明了小规模标注工作对专业领域的价值,并为自身免疫的计算研究做出了贡献。AAbAAC语料库可通过此https链接获取。

英文摘要

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at this https URL.

2606.13049 2026-06-12 cs.RO 新提交

Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Y-BotFrame:一种用于四足机器人助手的可扩展具身智能体框架

Luyao Zhang, Ke Li, Yuan Ding, Xulong Zhao, Guo Yu, Chengwei Yan, Fuyu Dong, Jiawei Hu, Di Wang, Nan Luo, Gang Liu, Quan Wang

发表机构 * Xidian University(西安电子科技大学)

AI总结 提出Y-BotFrame框架,集成多模态感知与大语言模型认知核心,将自然语言指令映射为可执行任务单元,实现无遥控器的人机协作,支持模块化扩展。

详情
AI中文摘要

四足机器人能够以高灵活性穿越各种复杂地形。作为高机动性的地面智能平台,它们可以配备导航控制、环境感知和智能交互模块,从而成为各种算法在现实世界中的移动部署平台。本文介绍了Y-BotFrame,一个可扩展的具身平台,它将机器人转变为智能地面助手。Y-BotFrame集成了多模态感知能力,包括语音、视觉和激光雷达,并采用大语言模型作为环境理解、上下文推理和任务规划的认知核心。该系统将用户的自然语言指令映射为机器人可执行的具体任务单元。Y-BotFrame通过语音命令和视觉反馈支持自然交互,无需遥控器即可实现高效的人机协作。凭借高度可扩展的框架,Y-BotFrame支持新功能模块的即插即用集成以及模块化升级和迭代开发,为通用、指令驱动的具身智能体在现实世界中的部署提供了参考实现。补充视频见https://this https URL。

英文摘要

Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground-based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real-world mobile deployment platforms for various algorithms. In this paper, we introduce Y-BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y-BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural-language instructions into executable embodied task units that can be carried out by the robot. Y-BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human-robot collaboration. With a highly extensible framework, Y-BotFrame supports plug-and-play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real-world deployment of general-purpose, instruction-driven embodied this http URL supplementary video is available at this https URL.

2606.13044 2026-06-12 cs.CL 新提交

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

无需隐藏提示!仅通过展示性修改即可欺骗AI同行评审

Xu Yang, Zhizhou Sha, Junbo Li, Jian Yu, Yifan Sun, Matthew Zhao, Jinrui Fang, Xinyue Guo, Yining Wu, Xu Hu, Yifu Luo, Qiang Liu, Zhangyang Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Texas at Dallas(德克萨斯大学达拉斯分校) Independent Researcher(独立研究者)

AI总结 研究通过仅修改论文的展示层面(如摘要、贡献框架等)而不改变科学内容,利用AI评审反馈进行对抗性重打包,成功提升评分,揭示AI评审易被表面印象误导的结构性缺陷。

详情
Comments
35 pages, 5 figures
AI中文摘要

随着AI生成的评审从实验工具转向同行评审基础设施,大多数鲁棒性问题集中在显式攻击上,如隐藏指令和提示注入。我们研究了一个更难且更具政策相关性的失败模式:无隐藏文本、无提示注入,且不改变方法、实验、图表、方程、证明或数值结果。攻击者仅修改展示层面的内容,如摘要、贡献框架、相关工作、讨论和叙事结构。我们引入了对抗性重打包:一种闭环攻击,利用AI评审反馈搜索展示层面的修订,同时保持科学证据不变。在三个主流AI评审器上,对抗性重打包实现了75.1%的攻击成功率和平均+1.21/10的分数提升。这种效果不能用普通的散文润色来解释。我们还揭示,改变评审者对论文解读方式的策略(如相关工作重新定位和分析性讨论扩展)显著优于表面编辑(如局部润色、表格格式和算法框)。我们的分析揭示了两个更深层次的结构性失败模式。首先,AI评审者更容易被打动而非说服:突出优点可靠地增加感知价值,而试图消除弱点常常适得其反。其次,AI评审者可能混淆了表面解决局限性与实际解决局限性,使得未改变的证据被重新解释为更强的科学贡献。这些结果表明,部署风险不仅在于恶意的隐藏指令,还在于论文展示本身作为优化表面的出现。我们发布了一个无污染滚动基准和攻击框架,用于测试AI评审者在仅展示层面编辑下是否仍锚定于科学内容。

英文摘要

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

2606.13042 2026-06-12 cs.AI cs.CV 新提交

Augmentation techniques for video surveillance in the visible and thermal spectral range

可见光和热红外光谱范围内视频监控的增强技术

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB(弗劳恩霍夫光学、系统技术与图像处理研究所)

AI总结 针对多光谱CNN目标检测,研究可见光与热红外图像差异,探索数据增强技术对分类精度的影响,以提升监控性能。

详情
Comments
8 pages
AI中文摘要

在智能视频监控中,摄像机在白天和夜晚记录图像序列。通常,这需要不同的传感器。为了获得更好的性能,将它们结合起来并不罕见。我们关注的情况是,长波红外摄像机连续记录,此外,另一台摄像机在白天记录可见光谱范围内的图像,并且智能算法监控采集的图像。更准确地说,我们的任务是基于多光谱CNN的目标检测。乍一看,可见光谱范围内的图像与热红外图像的区别在于,前者具有颜色和清晰的纹理信息,而后者不包含物体发出的热辐射信息。尽管颜色可以为分类任务提供有价值的信息,但诸如光照变化和不同传感器的特性等因素仍然构成重大问题。无论如何,获取足够且实用的热红外数据集来训练深度神经网络仍然是一个挑战。这就是为什么借助可见光谱范围内的数据进行训练可能是有利的,特别是当待评估的数据同时包含可见光和红外数据时。然而,目前尚不清楚热辐射、形状或颜色信息的强烈变化如何影响分类精度。为了更深入地了解卷积神经网络如何做出决策以及它们从不同传感器输入数据中学到什么,我们研究了不同增强技术的适用性和鲁棒性。

英文摘要

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

2606.13041 2026-06-12 cs.CV cs.GR cs.MM 新提交

SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing

SeamEdit: 一种用于大图像语义编辑的黑盒VLM无关流水线

Xiangyu Lyu, Dan Lei

发表机构 * Technische Universität Darmstadt(达姆施塔特工业大学) Fine-Arts Educator, Yuncheng Middle School(运城中学美术教师)

AI总结 提出SeamEdit,一种无需训练、模型无关的流水线,通过五阶段后处理解决大图像分块编辑中的语义变形、对齐漂移和接缝伪影问题,实现高质量语义编辑。

详情
Comments
19 pages, 9 figures, 2 tables
AI中文摘要

大图像的语义区域编辑必须同时满足两个要求:高生成质量和与周围内容的自然融合。一些相关方法依赖于白盒模型,而忽略了闭源模型的强大生成能力。然而,直接将闭源模型应用于分块编辑会引入几种失败模式:语义变形、画布级对齐漂移和可见接缝伪影。本文提出SeamEdit,一种无需训练且模型无关的流水线,将任何具有修补能力的VLM视为黑盒预言机。SeamEdit通过五阶段后处理流水线缓解这些问题:基于覆盖的分块分解、黑盒VLM修补、几何和颜色一致性校正、基于接缝风险的多候选排序以及动态规划曲线接缝融合。该流水线降低了接缝可见性,并支持任意分块区域的语义修改。

英文摘要

Semantic region editing for large images must satisfy two requirements at the same time: high generative quality and natural integration with surrounding content. Some related methods rely on white-box models and leave the strong generation capability of closed-source models underexplored. Directly applying closed-source models to tiled editing, however, introduces several failure modes: semantic deformation, canvas-level alignment drift, and visible seam artifacts. This paper presents SeamEdit, a training-free and model-agnostic pipeline that treats any VLM with inpainting capability as a black-box oracle. SeamEdit mitigates these issues through a five-stage post-hoc pipeline: overlay-based tile decomposition, black-box VLM inpainting, geometric and color-consistency correction, seam-risk-based multi-candidate ranking, and dynamic-programming curved seam fusion. The pipeline reduces seam visibility and supports semantic modification of arbitrary tile regions.

2606.13040 2026-06-12 cs.RO 新提交

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

RoboProcessBench:视觉语言机器人操作中的过程感知理解基准测试

Dayu Xia, Yue Shi, Yao Mu, Huiting Ji, Chaofan Ma, Yingjie Zhou, Hua Chen, Yang Liu, Jiezhang Cao, Guangtao Zhai

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) China University of Mining Technology(中国矿业大学)

AI总结 提出RoboProcessBench基准,通过静态监控和动态推理两个维度、12个诊断问题家族,评估视觉语言模型在机器人操作中的过程感知理解能力,并基于58k问答对数据集验证了当前模型的局限性及后训练的有效性。

详情
AI中文摘要

视觉语言模型(VLM)正越来越多地被探索作为机器人操作中的视觉评判者、奖励生成器和故障检测器。这些角色隐含地要求模型不仅判断最终任务成功与否,还要判断操作执行在物理和时间上的进展。然而,现有评估未能测试VLM是否具备细粒度的过程理解。为填补这一空白,我们提出了RoboProcessBench,一个用于视觉语言机器人操作中过程感知理解的基准测试。RoboProcessBench将这种能力分解为两个互补维度:\emph{静态监控}和\emph{动态推理},具体化为12个诊断问题家族,涵盖阶段、接触、运动、协调、原始局部进展、时间顺序、结果和原始级转换。基于物理基础的执行轨迹,构建的基准语料库ProcessData包含约58k个问答对,涵盖260个操作任务,进一步分为ProcessData-SFT和ProcessData-Eval,分别用于后训练和评估。对ProcessData-Eval上各种VLM的广泛评估揭示了12个诊断任务家族的普遍局限性,表明当前模型仍缺乏对操作执行的鲁棒过程感知理解。但通过ProcessData-SFT,后训练的\textit{Qwen2.5-VL-7B}和\textit{InternVL-3-8B}在局部状态、运动、进展和原始级线索上表现出持续改进。这些结果表明,RoboProcessBench既可作为评估基准,也可作为可学习的监督源,用于开发能够监控和评估机器人操作过程的VLM。项目网页:\href{ this https URL }{ this https URL }。

英文摘要

Vision-language models (VLMs) are increasingly explored as visual critics, reward generators, and failure detectors in robotic manipulation. These roles implicitly require models to judge not only final task success, but also how a manipulation execution is physically and temporally progressing. However, existing evaluations fail to test whether VLMs possess fine-grained process understanding. To address this gap, we present RoboProcessBench, a benchmark for process-aware understanding in vision-language robotic manipulation. RoboProcessBench decomposes such capability into two complementary dimensions, \emph{static monitoring} and \emph{dynamic reasoning}, instantiated as 12 diagnostic question families covering phase, contact, motion, coordination, primitive-local progress, temporal order, outcome, and primitive-level transitions. Built from physically grounded execution traces, the curated benchmark corpus ProcessData contains \textasciitilde 58k question-answer pairs across 260 manipulation tasks, which is further split into ProcessData-SFT and ProcessData-Eval for post-training and evaluation purposes. Extensive evaluation of various VLMs on ProcessData-Eval reveals broad limitations across 12 diagnostic task families, suggesting current models still lack robust process-aware understanding of manipulation executions. But with ProcessData-SFT, the post-trained \textit{Qwen2.5-VL-7B} and \textit{InternVL-3-8B} exhibit consistent gains on local state, motion, progress, and primitive-aware cues. These results demonstrate that RoboProcessBench serves as both an evaluation benchmark and a learnable supervision source for developing VLMs capable of monitoring and evaluating robotic manipulation processes. Project webpage: \href{ this https URL }{ this https URL }.

2606.13038 2026-06-12 cs.AI 新提交

Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

Nous: 提取并注入预测市场行为背后认知的尝试

Haowei Qian

发表机构 * Independent Researcher(独立研究员)

AI总结 针对LLM代理在预测市场中认知同质化问题,提出Nous方法从真实交易行为提取八维行为画像并注入提示,发现提取部分有效但提示注入无法传递认知多样性。

详情
Comments
37 pages, 1 figure, 7 tables. Reproduction artifacts (code, frozen profiles, prompts, model outputs): this https URL
AI中文摘要

随着LLM代理在预测市场和集体决策中激增,它们面临认知同质化的风险:基于共享基础模型构建的代理产生相关预测,近期测量发现前沿模型错误相关性约为r~0.77。我们探究人类认知多样性是否可以从行为中恢复并转移到LLM代理。Nous从真实的Polymarket交易活动中提取结构化的八维行为画像,并通过提示注入到代理中。我们的核心发现是该流程的两半之间存在分离。提取部分有效:在100个钱包中,14个参数中有8个在时间上稳定(分半ICC >= 0.5,bootstrap CI下限>0.3;逆向得分达到ICC~0.9);钱包从其画像中被识别的概率远高于随机(top-1检索17-22% vs. 1%随机);四个预定义维度中的两个与样本外未来实现利润排名相关,尽管这些相关性在行为混杂控制后不成立。提示级注入无法可测量地传递:在语义嵌入指标上,结构化注入在任何模型上均未显示出比长度匹配控制组显著的优势,并且其诱导的多样性既未降低集成错误相关性,也未改善Brier分数——这一零结果在采样温度、画像多样性和问题难度的探索性检查中持续存在。测量提示本身定位了模型前的压缩:结构到叙述的翻译器发出近乎均匀的提示,其扩散不追踪画像扩散。我们将Nous定位为测量认知同质化问题及提示级补救措施的局限性,从而推动更深层次的提示下注入(微调、激活引导)。代码、冻结画像、提示和模型输出:此 https URL

英文摘要

As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split-half ICC >= 0.5, bootstrap CI lower bound > 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top-1 retrieval 17-22% vs. 1% chance); and two of four pre-specified dimensions rank-correlate with future realized profit out-of-sample, though the correlations do not survive behavioral-confound controls. Prompt-level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length-matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score -- a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure-to-narrative translator emits near-uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive-monoculture problem and the limits of a prompt-level remedy, motivating deeper, below-the-prompt injection (fine-tuning, activation steering). Code, frozen profiles, prompts, and model outputs: this https URL

2606.13035 2026-06-12 cs.CV cs.AI 新提交

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

TetherCache: 基于门控召回与可信对齐的自回归长视频生成稳定性方法

Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua University(清华大学) D-INFK, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出TetherCache,一种无需训练、即插即用的缓存管理策略,通过门控召回(GRAB)和可信对齐编辑(TAME)缓解自回归视频扩散模型中的上下文漂移,实现稳定长视频生成。

详情
Comments
17 pages, 8 figures
AI中文摘要

自回归视频扩散模型通过将新生成帧的条件建立在先前生成内容上,为流式变长视频生成提供了自然框架。然而,将这些模型扩展到分钟级生成仍具挑战:有限的KV缓存预算使模型无法保留完整历史,而反复以自生成帧为条件会导致上下文分布偏移随时间累积,引发视觉伪影、质量下降和时间漂移。本文提出TetherCache,一种无需训练、即插即用的缓存管理策略,用于抗漂移长视频生成。TetherCache将缓存组织为sink、memory和recent区域,并引入两种互补机制。首先,GRAB(基于注意力多样性平衡的门控召回)使用结合注意力相关性与时间多样性的门控分数选择长程记忆帧,在固定缓存预算下保留信息丰富且多样化的历史上下文。其次,TAME(通过记忆编辑的可信对齐)通过将新召回的记忆令牌的统计量对齐到可信上下文分布来对其进行轻量编辑,减少漂移历史特征造成的污染。基于Self-Forcing,TetherCache在VBench-Long的30秒、60秒和240秒设置上持续提升长视频生成质量。特别地,在240秒生成中,它显著提高了整体和语义分数,同时将质量漂移从7.84降至1.33,证明了其在稳定长程自回归视频扩散中的有效性。

英文摘要

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

2606.13033 2026-06-12 cs.CV 新提交

SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

SAM-Deep-EIoU:面向多目标跟踪的选择性掩码传播

Alexander Holmberg

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 提出选择性掩码传播算法,仅在不确定性高的帧调用视频目标分割模型,以轻量级基跟踪器为主,在DanceTrack和SportsMOT上提升性能,SportsMOT达86.8 HOTA。

详情
AI中文摘要

多目标跟踪的难度分布呈重尾特性:大多数帧对于轻量级基跟踪器是容易的,而一小部分帧本质上是困难的。视频目标分割(VOS)模型通常能在基跟踪器失败的困难帧中保持身份,但其计算和内存成本高得多。我们提出选择性掩码传播,一种跟踪算法,仅在分配不确定性信号触发的窗口上从基跟踪器调度到VOS模型。仅当VOS模型做出与基跟踪器身份分配相矛盾的置信预测时,才修改基跟踪器的输出;弱或不确定的预测保留基输出。该方法无需训练,将基跟踪器和VOS模型均视为黑盒,并且可以通过用更强大的模型替换VOS组件而受益。在DanceTrack上,选择性掩码传播改进了三种不同的基跟踪器。在SportsMOT上,身份保持是体育分析的核心,使用全局轨迹关联的SAM3-Deep-EIoU以86.8 HOTA达到基准上的最先进性能。

英文摘要

Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

2606.13032 2026-06-12 cs.CV 新提交

GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal Dissection

GeoCFNet: 几何感知置信场网络用于机器人辅助内镜黏膜下剥离术

Rui Tang, Guankun Wang, Long Bai, Haochen Yin, Huxin Gao, Jiewen Lai, Jiazheng Wang, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong(香港中文大学电子工程系) Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co. Ltd.(华为技术有限公司中央研究院2012实验室理论实验室)

AI总结 提出GeoCFNet,通过几何感知置信场估计解决动态内镜场景下的解剖引导问题,集成Token差异化融合和几何感知空间正则化,实现精确稳定的置信场预测。

详情
Comments
IEEE ICIA 2026
AI中文摘要

先进的手术机器人技术使机器人辅助内镜黏膜下剥离术(ESD)成为整块切除大病变的有前景方法,具有降低复发率和改善长期预后的潜力。然而,ESD的技术复杂性和并发症风险需要稳定精确的视觉引导,以维持准确的解剖通道和安全组织边界。密集置信场通过描述优选解剖区域及其向周围组织的空间过渡,为此提供了有效表示。然而,在动态内镜场景中,由于烟雾、镜面高光、组织变形、弱纹理以及目标区域的薄几何结构,可靠的置信场估计仍然具有挑战性。为解决这些问题,我们将解剖引导表述为几何感知置信场估计问题,并提出GeoCFNet,一种基于预训练DINOv3骨干网络的几何感知置信场网络。GeoCFNet集成了Token差异化融合模块以聚合类别令牌上下文与密集补丁表示、用于置信回归的SegFormer解码器,以及几何感知空间正则化(GASR)以保持空间一致性和局部几何过渡。实验结果表明,GeoCFNet实现了RMSE 0.0480、PSNR 27.1995、SSIM 0.3397和CC 0.2466,表明其能够为机器人辅助ESD引导提供精确且几何稳定的置信场估计。

英文摘要

Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.

2606.13030 2026-06-12 cs.CV 新提交

A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

一种结合跨主体伪标签与语义对齐的多模态微手势识别框架

Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT)(合肥工业大学计算机科学与信息工程学院) School of Computer Science, University of Auckland (UOA)(奥克兰大学计算机科学学院)

AI总结 针对微手势识别中低信噪比、长尾分布和跨主体域偏移问题,提出多模态框架,通过显著性引导提取、平方根平滑加权、正交语义嵌入损失和跨模态伪标签策略,实现有效识别,F1分数达68.13%。

详情
Comments
14 pages, 2 figures
AI中文摘要

微手势(MGs)是自发的、细微的身体动作,经常传达隐藏的人类情感。在未修剪视频中识别MGs仍然极具挑战性,因为其极低的信噪比、严重的长尾类分布以及跨主体评估场景中固有的域偏移。在本文中,我们为第四届MiGA-IJCAI挑战赛的Track 1提出了一个全面的多模态框架。为了捕捉细粒度表示,我们设计了一个显著性引导的多模态提取流程,整合了68关键点骨架关节坐标、3D热图体积和高分辨率RGB视觉特征。我们引入了一种温和的平方根平滑加权机制,配合正交语义嵌入损失,以保护尾部类别而不损害整体识别能力。更重要的是,为了弥合跨主体泛化差距,我们提出了一种跨模态伪标签(CMPL)策略用于无监督域适应,显著提升了单模态鲁棒性。最后,采用温度缩放软投票机制以减轻后期融合中的过度自信。大量实验表明,我们的框架达到了具有竞争力的68.13%的F1分数,获得第四名。

英文摘要

Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

2606.13024 2026-06-12 cs.LG cs.AI 新提交

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

CausalMoE:基于模式路由异构专家的十亿规模多模态基础模型用于格兰杰因果发现

Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院通用人工智能国家重点实验室) National Institute of Health Data Science, and Institute for Artificial Intelligence, Peking University(北京大学健康医疗大数据国家研究院、人工智能研究院)

AI总结 提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,通过模式路由混合异构专家解耦动态机制,结合因果自注意力与LLM/VLM先验,实现稀疏因果图恢复,在监督和少样本场景中达到最优。

详情
AI中文摘要

格兰杰因果发现(GCD)是分析复杂系统中时间依赖性的基础。然而,现有的神经GCD方法主要依赖“一刀切”范式,难以捕捉真实世界时间序列中固有的分布偏移和动态机制变化,常导致表示纠缠和虚假因果图。本文提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,显式建模补丁级异质性。CausalMoE引入模式路由混合异构专家,动态识别潜在时间模式并将补丁路由到专门领域专家,有效解耦机制特定动态与共享动态。为确保可解释的图恢复,我们设计了一种跨变量运行的因果感知自注意力机制,通过近端优化生成稀疏格兰杰因果图。此外,CausalMoE是首个集成LLM和VLM以对齐数值信号与文本和视觉先验的模型,在复杂场景中正则化因果估计。大量实验表明,CausalMoE在全监督基准上达到新最优,同时在传统方法失败的少样本设置中有效泛化。

英文摘要

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

2606.13022 2026-06-12 cs.CV cs.LG 新提交

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

基于骨架的人体动作识别中保质量不可察觉对抗攻击

Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum

发表机构 * Durham University(杜伦大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Zhongguancun Laboratory(中关村实验室)

AI总结 针对骨架动作识别的对抗攻击常引入噪声扰动降低动作质量,本文提出一种基于分布的对抗攻击方法,通过最小化经验风险与真实风险的差距来保持动作质量,并设计新指标评估自然性,实验表明该方法在攻击成功率和动作质量上均优于现有方法。

详情
AI中文摘要

针对骨架人体动作识别的对抗攻击已受到广泛关注。然而,现有方法通常引入类似噪声的扰动,导致攻击后动作质量下降,从而在S-HAR系统的最新进展中本质上是可察觉的。我们发现这种退化源于先前对抗攻击优化过程中经验风险与真实风险之间的差距。为解决此问题,我们提出一种在不损害动作质量的情况下获得对抗动作的攻击方法。为最小化风险差距并保持动作质量,我们提出一种基于分布的对抗攻击方法,不引入类似噪声的扰动。为忠实评估动作质量,我们提出一种新指标,该指标与人类对真实世界自然性的感知一致。在两个数据集上对最先进的S-HAR方法进行了实验,通过定性和定量分析证明了我们的方法在攻击成功率和攻击后动作质量方面的优越性。我们的保质量攻击应用和基于分布的方法的成功引发了关于动作识别器鲁棒性的严重担忧,强调了在该领域进一步改进的必要性。

英文摘要

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

2606.13016 2026-06-12 cs.AI 新提交

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

Otters++: 一种基于首次脉冲时间的高能效光学脉冲Transformer

Zhanglu Yan, Jiayi Mao, Kaiwen Tang, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Qianhui Liu, Weng-Fai Wong

发表机构 * National University of Singapore(新加坡国立大学) Westlake University(西湖大学) Shandong University(山东大学) Zhejiang University(浙江大学) Agency for Science, Research and Technology(新加坡科技研究局)

AI总结 提出Otters++,利用光电器件自然信号衰减实现TTFS计算,通过层等效与混合训练方法,在GLUE上达到84.17%平均分且能耗更低。

详情
AI中文摘要

脉冲神经网络(SNN)有望实现高能效推理,而首次脉冲时间(TTFS)编码尤其吸引人,因为每个神经元最多发放一次脉冲。然而,在实践中,这一优势往往因计算时间衰减项并将其与突触权重相乘的成本而减弱。我们通过将物理硬件“缺陷”——光电器件中的自然信号衰减——转化为TTFS的主要计算来解决这一问题,命名为Otters++。具体来说,我们利用定制In$_2$O$_3$光电突触的实测衰减直接实现TTFS时间项,从而消除了显式数字衰减计算的需求。为了将该思想扩展到Transformer模型,我们建立了Otters++与量化神经网络(QNN)之间的逐层功能等价性,并开发了一种混合训练方法,在前向传播中使用忠实于器件的SNN计算,在后向传播中通过等效QNN路径使用QNN直通梯度,并结合模型蒸馏。这避免了对离散首次脉冲事件的微分,并减少了直接TTFS-SNN训练中的过度稀疏问题。我们进一步通过采样运行间变化使训练感知实测器件噪声,并通过考虑器件共享和多跳通信来细化系统级能耗模型。在GLUE数据集上,Otters++将平均得分提高到84.17%,同时相比先前的脉冲Transformer基线保持明显的能耗优势。这些结果表明,基于物理的TTFS计算在实际硬件效应下可以高效、可训练且鲁棒。

英文摘要

Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware "bug," the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In$_2$O$_3$ optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17\% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

2606.13007 2026-06-12 cs.LG cs.AI 新提交

scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

scLLM-DSC:基于LLM知识增强的跨模态深度结构聚类用于单细胞RNA测序

Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of Chinese Academy of Sciences(中国科学院大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) School of Computing and Information Technology, Great Bay University(大湾区大学计算机科学与技术学院) School of Engineering, Westlake University(西湖大学工学院)

AI总结 提出scLLM-DSC框架,通过知识驱动语义视图与结构感知拓扑视图的跨模态对比对齐,利用LLM增强单细胞RNA测序数据的聚类性能,显著优于现有方法。

详情
AI中文摘要

聚类是scRNA-seq分析的基础,是识别细胞群体和解析组织异质性的基石。然而,现有方法专注于挖掘数值统计模式,由于忽略了基因编码的内在生物学功能,存在语义不可知的问题。虽然大语言模型(LLM)提供了有前景的语义能力,但生成式预训练目标与判别式下游任务之间的结构不匹配阻碍了它们直接适应细胞聚类。为弥合这一差距,我们提出了scLLM-DSC,一种新颖的LLM知识增强跨模态深度结构聚类框架。与数据驱动范式不同,scLLM-DSC通过协同两个视图建立语义基础表示:从NCBI基因先验和上下文化的Cell2Sentence嵌入中提取的知识驱动语义视图,以及通过图引导编码器提取的结构感知拓扑视图。关键的是,我们引入了一种跨模态对比对齐机制,以在统一潜在空间中强制生物学语义与转录组特征之间的一致性。广泛的基准测试表明,scLLM-DSC在聚类准确性上显著优于十一个最先进的基线方法。

英文摘要

Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

2606.13006 2026-06-12 cs.SD 新提交

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

Emo-LiPO:基于LLM的文本到语音中细粒度情感强度控制的列表式偏好优化

Yihang Lin, Li Zhou, Congwei Cao, Dongchu Xie, Xiaoxue Gao, Chen Zhang, Haizhou Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Agency for Science, Technology and Research(新加坡科技研究局) National University of Singapore(新加坡国立大学) Shenzhen Research Institute of Big Data(深圳市大数据研究院) Shenzhen Loop Area Institute(深圳市环区研究院)

AI总结 提出Emo-LiPO框架,将情感强度控制建模为学习排序问题,通过列表式偏好优化对齐文本与语音的情感强度,实现更忠实连续的情感表达,在ESD-plus数据集上显著提升情感准确性和强度可控性。

详情
Comments
Accepted by IJCAI 2026. Emotional TTS, Preference Optimization, Emotion Intensity Control
AI中文摘要

基于大型语言模型(LLM)的文本到语音(TTS)系统能够实现提示条件的情感控制,但由于文本与语音之间的语义-声学差距,在细粒度情感强度方面存在困难。为了解决这一挑战,我们将LLM-based TTS中的情感强度控制形式化为一个学习排序问题,并提出了Emo-LiPO,一种列表式偏好优化框架,该框架将提示条件的语音生成与文本中表达的相对情感强度对齐。Emo-LiPO在固定文本下显式建模每种情感内的全局强度排序,从而实现更忠实和连续的情感表达。我们进一步构建了ESD-plus,一个具有显式情感强度变化的多说话人数据集,以支持细粒度情感建模和评估。在ESD-plus上的实验表明,与基于监督学习和DPO的LLM TTS基线相比,Emo-LiPO显著提高了情感准确性和强度可控性,特别是在高强度水平上表现尤为突出。

英文摘要

Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.

2606.13003 2026-06-12 cs.AI cs.CL cs.MA 新提交

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research(Salesforce研究院) HKUST (Guangzhou)(香港科技大学(广州)) University of British Columbia(不列颠哥伦比亚大学) Nanyang Technological University(南洋理工大学)

AI总结 通过系统评估,发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线(如思维链自一致性),揭示了现有评估框架的缺陷和架构膨胀问题。

详情
AI中文摘要

普遍观点认为多智能体系统优于单智能体系统,其优势包括上下文保护、并行处理和分布式决策。然而,这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较,这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统(旨在比手动设计的系统具有更强的泛化能力),对单智能体系统(特别是思维链自一致性)进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务(例如 BrowseComp-Plus)上,我们证明自动多智能体系统始终不如思维链自一致性,尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来,我们引入了一个为多智能体系统量身定制的诊断性合成数据集,该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明,专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构,这表明现有的评估框架未能考虑增加计算成本的边际效用,从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是,对生成的多智能体系统架构的系统解构表明,当前的自动化设计范式产生了架构膨胀,优先考虑表面复杂性,但这并未转化为功能效用,暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

2606.12997 2026-06-12 cs.LG stat.ML 新提交

Reliability of Probabilistic Emulation of Physical Systems

物理系统概率仿真的可靠性

Sam F. Greenbury (1), Radka Jersakova (1), Paolo Conti (1 and 2), Marjan Famili (1 and 3), Christopher Iliffe Sprague (1 and 4), Edwin Brown (1 and 5), Jason D. McEwen (1 and 6) ((1) The Alan Turing Institute, (2) Autodesk Research, (3) PhysicsX, (4) Orbital, (5) University of Sheffield, (6) University College London)

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) Autodesk Research(欧特克研究院) PhysicsX Orbital University of Sheffield(谢菲尔德大学) University College London(伦敦大学学院)

AI总结 比较生成模型与CRPS训练集成在物理系统概率仿真中的可靠性,发现CRPS集成在覆盖率和推理速度上更优。

详情
AI中文摘要

目前,生成物理系统概率预测的两种主要方法已经出现:生成模型(如扩散或流匹配)以及注入随机性的确定性模型集成(使用连续排序概率评分(CRPS)损失训练)。虽然这两种方法都表现出强大的预测准确性,但其不确定性的可靠性尚未得到系统评估。我们通过开发一个框架来填补这一空白,该框架在匹配模型大小和计算预算的情况下,评估这两种方法在多种二维时空物理系统中的表现。我们通过检查预测区间的经验覆盖率来评估概率仿真的可靠性,同时考虑准确性和计算效率指标。CRPS训练的集成在单步预测和自回归展开中通常能实现更可靠的不确定性,显示出比在潜在空间中训练生成模型的标准替代方案更好的覆盖率。此外,CRPS方法提供了显著更快的推理速度。当生成模型在环境空间而非压缩潜在空间中训练时(这在高维问题中通常不可行),它们表现出与CRPS训练集成相当的覆盖率,但推理延迟显著更大。相比之下,当CRPS训练的集成在潜在空间中训练时,其覆盖率相对于环境空间没有明显下降。生成模型和CRPS训练的集成都表现出良好的预测准确性。为促进未来的研究和应用,我们发布了AutoCast,一个实现生成模型和CRPS训练集成的模块化框架,以及AutoSim,一个用于快速原型的灵活数据集生成包。

英文摘要

Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

2606.12995 2026-06-12 cs.RO 新提交

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

GenHOI: 通过模仿生成视频实现接触感知的人形机器人-物体交互,无需任务特定训练

Zhihai Bi, Qiang Zhang, Guoyang Zhao, Jiahang Cao, Xueyin Luo, Yushan Zhang, Jinglan Xu, Ruoyu Geng, Yulin Li, Andrew F. Luo, Jun Ma

发表机构 * The University of Tokyo(东京大学) National University of Singapore(新加坡国立大学) University of California, Los Angeles(加州大学洛杉矶分校) Tsinghua University(清华大学)

AI总结 提出GenHOI框架,通过模仿单个生成视频实现人形机器人零样本执行多种物体交互任务,无需任务特定训练或物理演示数据,利用接触事件和手-物接触区域编码为几何约束优化轨迹。

详情
AI中文摘要

人形机器人-物体交互(HOI)是人形机器人的基本能力,但由于动态平衡与与多样物体稳定交互之间的紧密耦合,它仍然具有挑战性。现有方法通常需要耗时的任务特定策略训练或依赖于刚性轨迹回放,这限制了它们适应新颖交互场景的能力。在这项工作中,我们提出了\textit{GenHOI},一个简单而有效的框架,通过直接模仿单个生成视频,使人类形机器人能够以零样本方式执行多样化的物体交互任务,无需任务特定训练或物理演示数据。GenHOI首先在仿真中重建机器人-物体场景并渲染第一帧图像,该图像与语言命令一起条件化任务导向交互视频的合成。然后分析生成的视频以识别交互相关的接触事件并估计手-物体接触区域,这些被编码为以物体为中心的几何约束,将视觉交互线索转化为物理基础的优化先验。在这些先验的指导下,从视频中恢复的参考运动被细化和平滑,以解决2D视频生成中固有的尺度模糊性,同时将单个参考轨迹适应于未见过的机器人-物体相对姿态。优化后的轨迹最终由闭环跟踪控制器执行。我们在包括箱子抓取、非对称双臂椅子搬运、从下方抬桌子和圆柱物体包裹在内的多样化物体交互任务中,通过大量仿真和真实世界实验验证了所提出的框架。

英文摘要

Humanoid-Object Interaction (HOI) is a fundamental capability for humanoid robots, yet it remains challenging due to the tight coupling between dynamic balance and stable interaction with diverse objects. Existing methods often require time-consuming task-specific policy training or rely on rigid trajectory replay, which limits their ability to accommodate novel interaction scenarios. In this work, we present \textit{GenHOI}, a simple yet effective framework that enables humanoid robots to perform diverse object-interaction tasks in a zero-shot manner by directly imitating a single generated video, without task-specific training or physical demonstration data. GenHOI first reconstructs the robot-object scene in simulation and renders a first-frame image, which, together with the language command, conditions the synthesis of a task-oriented interaction video. The generated video is then analyzed to identify interaction-relevant contact events and estimate hand-object contact regions, which are encoded as object-centric geometric constraints that convert visual interaction cues into physically grounded optimization priors. Guided by these priors, the reference motion recovered from the video is refined and smoothed to resolve the scale ambiguity inherent in 2D video generation, while adapting a single reference trajectory to unseen robot-object relative poses. The optimized trajectory is finally executed by a closed-loop tracking controller. We validate the proposed framework in extensive simulation and real-world experiments across diverse object-interaction tasks, including box grasping, asymmetric bimanual chair carrying, table lifting from below, and cylindrical-object enveloping.