arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
2605.04948 2026-06-02 cs.CL

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

将大型语言模型适配到低资源黏着语:LoRA与QLoRA在巴什基尔语上的比较研究

Mullosharaf K. Arabov, Svetlana S. Khaybullina

发表机构 * Institute of Computational Mathematics and Information Technologies, Kazan Federal University(计算数学与信息科技学院,卡兹安联邦大学)

AI总结 本文比较了LoRA和QLoRA两种参数高效微调方法在低资源黏着语巴什基尔语上的适配效果,发现QLoRA在7B规模模型上能在质量和计算成本间取得有效平衡。

Comments Accepted to CLIB 2026

详情
AI中文摘要

本文对参数高效微调方法(包括LoRA和QLoRA)在将大型语言模型适配到巴什基尔语(突厥语族的一种低资源黏着语)任务上进行了比较研究。实验在包含71k文档(4690万个token)的巴什基尔语文本语料库上进行,使用了多种架构的模型:DistilGPT2、GPT-2(base、medium)、Phi-2、Qwen2.5-7B、DeepSeek-7B和Mistral-7B。为提高结果可靠性,每种配置使用三种不同的随机种子进行训练。测试集上最低困惑度由全微调的GPT-2 medium获得(3.34)。同时,应用于Mistral-7B(3.79)和Phi-2(3.81)的QLoRA在可训练参数减少40倍以上的情况下达到了相当的质量。然而,我们也观察到某些架构使用PEFT时质量显著下降的情况(例如,DeepSeek-7B,秩为8,困惑度=129.55),这表明结果关键取决于基础模型及其分词器的选择。此外,基于巴什基尔语提示的生成文本定性分析显示,具有最佳困惑度的模型不一定产生最连贯的输出:QLoRA微调的模型生成了单语巴什基尔语续写,而具有最低困惑度的全微调模型则频繁切换到英语。结果表明,对于巴什基尔语,7B规模模型上的QLoRA在质量和计算成本之间提供了有效的折中。为确保可重复性,开放数据、代码和训练好的适配器将在论文被接收后发布。

英文摘要

This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.

2605.02640 2026-06-02 cs.AI

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

可信人工智能面临不变性冲突,因果性是解决方案

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 本文通过将可信AI目标重新解释为数据生成过程变化下的不相容不变性要求,论证因果性是理解和平衡性能与多个可信目标之间权衡的必要框架。

Journal ref Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

随着人工智能(包括机器学习模型和基础模型)在高风险领域的部署日益增多,确保其可信度已成为一个核心挑战。然而,可信人工智能的核心目标,如公平性、鲁棒性、隐私性和可解释性,很难同时实现,尤其是在保持效用的同时。这篇立场论文认为,因果性对于理解和平衡性能与可信人工智能多个目标之间的权衡是必要的。我们将可信人工智能的权衡重新解释为数据生成过程不同变化下的不相容不变性要求,从而为我们的论点奠定基础。然后,我们通过文献中的案例研究和风格化的合成数据模拟来说明这一论点,表明因果性提供了一个统一的框架,用于理解可信人工智能中的权衡如何产生,以及如何通过选择性不变性来缓解或解决这些权衡。这一视角既适用于经典机器学习模型,也适用于大规模基础模型。最后,我们概述了利用因果性构建既可信又高性能的人工智能所面临的开放挑战和机遇。

英文摘要

As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), are increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate this argument through case-study analyses from the literature and a stylized synthetic-data simulation, showing that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Finally, we outline open challenges and opportunities for using causality to build both trustworthy and high-performing AI.

2605.02270 2026-06-02 cs.CL

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

塔吉克-波斯语对机器音译模型的系统基准测试:从基于规则到Transformer架构的比较研究

Mullosharaf K. Arabov

发表机构 * Institute of Computational Mathematics and Information Technologies(计算数学与信息科技学院) Kazan Federal University(卡兹安联邦大学)

AI总结 本文通过构建多源平行语料库,系统比较了从基于规则到Transformer的六类模型,发现字节级ByT5在塔吉克-波斯语音译任务中表现最优(chrF++ 87.4/80.1),而基于子词的分词模型完全失败。

Comments Accepted to CLIB 2026

详情
AI中文摘要

本文首次对塔吉克语(西里尔字母)和波斯语(阿拉伯字母)之间音译的现代机器学习架构进行了全面比较分析。一个关键贡献是创建并验证了一个独特的平行语料库,该语料库汇集了多个异构来源,包括众包项目、词典对、《列王纪》平行文本、外交文章、《玛斯纳维》文本、官方术语列表和音译对应关系。初始数据集包含328,253个句子对;通过分层随机抽样形成了40,000个句对的代表性子集。实验比较了六类模型:基于规则的基线、带注意力的LSTM、字符级Transformer、G2P Transformer(从头训练)、预训练多语言模型(mBART、带LoRA的mT5)以及字节级ByT5。结果表明ByT5具有压倒性优势(塔吉克语到波斯语chrF++ 87.4,反向80.1)。尽管数据有限,G2P Transformer显著优于mBART(72.3 vs. 62.2 chrF++)。使用子词分词(mT5)的模型完全失败(chrF++低于18.5)。研究结果表明,对于塔吉克-波斯语对的准确音译,在字节或字符级别操作的架构明确优于依赖子词分词的傳統多语言Seq2Seq模型。

英文摘要

This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results demonstrate the overwhelming superiority of ByT5 (chrF++ 87.4 for Tajik to Farsi, 80.1 for reverse). The G2P Transformer significantly outperformed mBART (72.3 vs. 62.2 chrF++) despite limited data. Models using subword tokenization (mT5) failed completely (chrF++ less than 18.5). The findings demonstrate that for accurate transliteration of the Tajik-Farsi pair, architectures operating at the byte or character level are unequivocally more effective than traditional multilingual Seq2Seq models relying on subword tokenization.

2605.02122 2026-06-02 cs.LG cs.AI

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

STABLEVAL: 面向AI系统的分歧感知与稳定评估

Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对多数投票法在标注者分歧下导致排名不稳定的问题,提出STABLEVAL框架,通过建模潜在正确性和标注者混淆模式,实现稳定且不确定性感知的系统评估。

详情
AI中文摘要

人类评估仍然是评估现代AI系统的主要标准,然而标注者的分歧、偏见和变异性使得在标准多数投票聚合下系统排名变得脆弱。多数投票忽略了标注者可靠性和项目级别的模糊性,往往在标注者子集之间产生不稳定的比较。我们引入了STABLEVAL,一个分歧感知的评估框架,该框架对潜在项目正确性和标注者特定的混淆模式进行建模,以产生后验期望项目得分和校准的智能体级别分数。与Dawid-Skene等标签去噪方法不同,STABLEVAL明确设计用于稳定和不确定性感知的系统评估,而不是硬标签恢复。我们将排名稳定性形式化为首要评估目标,并分析聚合方法如何保留或扭曲底层标注者行为。在受控的合成实验和多个真实世界人工标注基准上,多数投票在标注者异质性和对抗性噪声下表现出增加的得分误差和排名不稳定性,而STABLEVAL产生了更稳定和统计上更合理的系统排名。这些结果表明,对分歧进行建模对于稳健和可复现的AI评估至关重要。

英文摘要

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

2606.01964 2026-06-02 cs.CL

Eyettention II: A Dual-Sequence Architecture for Modeling Fixation Location, Within-Word Landing Position, and Fixation Duration in Reading

Eyettention II: 一种用于建模阅读中注视位置、词内着陆位置和注视持续时间的双序列架构

Shuwen Deng, Cui Ding, David R. Reich, Paul Prasse, Lena A. Jäger

发表机构 * Department of Computer Science, University of Potsdam(波恩大学计算机科学系) Department of Computational Linguistics, University of Zurich(苏黎世大学计算语言学系) Department of Informatics, University of Zurich(苏黎世大学信息学系)

AI总结 提出端到端训练的轻量级深度学习模型Eyettention II,通过双序列架构生成包含注视位置、词内着陆位置和注视持续时间的完整扫描路径,在预测性能上超越现有模型并捕捉关键心理语言学现象。

详情
AI中文摘要

阅读时眼睛的运动方式为理解读者的认知过程和文本属性提供了宝贵信息。特别是,阅读过程中的眼动追踪数据在多种技术应用中显示出高度价值,例如增强和解释语言模型以及推断读者特征。然而,这些应用通常依赖于大规模数据驱动模型,需要大量的眼动追踪数据集,而由于数据收集的资源密集型特性,这些数据集难以获取。为了解决数据稀缺的挑战,我们开发了Eyettention II,一个端到端训练的深度学习模型,能够生成由按时间顺序排列的完整注视属性组成的真实扫描路径,包括注视位置、词内着陆位置和注视持续时间。我们的模型轻量级,可在有限的GPU资源上高效训练,并与认知理论紧密对齐。我们证明,Eyettention II在扫描路径预测方面超越了最先进的模型,并通过捕捉关键心理语言学现象模拟了类似人类的注视行为。凭借其稳健的性能,Eyettention II有潜力推动自然语言处理的发展,促进心理语言学实验材料的预测试,并揭示超出理论认知模型明确编码的新见解。

英文摘要

The way our eyes move while reading provides valuable insights into both the reader's cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting language models and inferring a reader's characteristics. However, these applications often rely on large-scale, data-driven models, which demand extensive eye-tracking datasets that are challenging to obtain due to the resource-intensive nature of data collection. To address the challenge of data scarcity, we develop Eyettention II, an end-to-end trained deep-learning model capable of generating realistic scanpaths consisting of a complete set of fixation attributes in chronological order, including fixation location, within-word landing position, and fixation duration. Our model is lightweight, efficiently trainable on limited GPU resources, and closely aligned with cognitive theories. We demonstrate that Eyettention II surpasses state-of-the-art models in scanpath prediction and mirrors human-like gaze behavior by capturing key psycholinguistic phenomena. With its robust performance, Eyettention II holds the potential to drive advancements in natural language processing, facilitate piloting the materials of psycholinguistic experiments, and uncover new insights beyond what is explicitly encoded in theoretical cognitive models.

2606.01955 2026-06-02 cs.RO cs.CV

WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM:在事件关节处雕刻世界动作建模

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang

发表机构 * X Square Robot Team(X Square机器人团队)

AI总结 提出WALL-WM世界动作模型,通过事件级视觉-语言-动作预训练解决固定长度动作块与语言、视觉、动作之间的粒度不匹配问题,实现跨语言、场景和任务的泛化,在大规模真实世界评估中达到最先进性能。

详情
AI中文摘要

WALL-WM是一种世界动作模型,它将视频-动作学习从以块为中心的优化转变为以事件为基础的视觉-语言-动作预训练,使用语义连贯的动作事件作为学习的基本单元。现有的WAM通常从多模态或视频基础模型初始化,然后直接基于当前观测和指令优化固定长度的动作块。尽管方便,但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件,视觉通过连续场景动态演变,动作在控制级时间尺度上运行;将三者强制纳入相同的固定长度预测窗口,使得VLA训练变成短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这种不匹配。具体来说,它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统配对,从而实现对多样化行为、场景和任务结构的可扩展学习。从相同的事件预训练骨干出发,WALL-WM支持两种互补的推理模式。事件模式消耗下一事件描述并实现可变长度的执行块,而统一模式使用带有阶梯式解码的VLM来调节传统的固定长度块推理,同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施,WALL-WM为通用WAM提供了实用的规模化方案。实验表明,WALL-WM在语言、场景和任务上广泛泛化,在大规模真实世界泛化评估中达到了最先进的性能。

英文摘要

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

2606.01954 2026-06-02 cs.LG stat.ML

Flow-Transformed Implicit Processes for Function-Space Variational Inference

流变换隐式过程用于函数空间变分推断

Luis A. Ortega, Andrés R. Masegosa, Thomas D. Nielsen

发表机构 * Aalborg University(奥尔堡大学)

AI总结 提出流变换隐式过程(FTIP),通过归一化流增强组合权重的变分分布,从而在函数空间中捕获非对称、重尾和多模态后验结构,并使用黑盒α目标进行优化。

Comments 24 pages, 4 figures, 10 tables. Pre-print submitted for revision

详情
AI中文摘要

隐式过程先验通过灵活的生成机制定义函数上的分布,使其对贝叶斯函数空间建模具有吸引力。然而,使用此类先验进行后验推断具有挑战性,因为其诱导的函数空间分布通常没有闭式解。一种实用策略是使用有限个采样函数的集合来近似先验,然后将后验函数表示为这些样本的学习组合。现有方法通常对组合权重施加高斯变分分布。虽然易于处理,但这种选择限制了可表示的后验不确定性形状,特别是当真实后验是非对称、重尾或多模态时。我们提出流变换隐式过程(FTIP),一种变分推断方法,使这种有限维函数空间近似更具表达力。FTIP不使用高斯分布,而是使用归一化流来定义更丰富的变分分布,从而在保持可处理优化的同时诱导灵活的后验函数分布。我们使用黑盒α目标训练模型,从而能够比较质量覆盖和模式寻找的变分行为。实验表明,FTIP捕获了函数空间中的非对称和多模态后验结构,而高斯系数近似往往会平滑或崩溃这些结构。

英文摘要

Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box α objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.

2606.01952 2026-06-02 cs.LG

Randomized Least Squares Value Iteration itself is Joint Differentially Private

随机最小二乘值迭代本身是联合差分隐私的

Haiyang Lu, Pratik Gajane, Shaojie Bai, Mohammad Sadegh Talebi

发表机构 * Laboratoire d’Informatique Fondamentale d’Orléans (LIFO), Université d’Orléans(奥尔良基础信息学实验室(LIFO),奥尔良大学) College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院) Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 研究随机探索算法RLSVI在表格MDP中的隐私保护,证明其内在噪声同时提供联合差分隐私保证。

Comments 12 pages, 0 figures

详情
AI中文摘要

随着强化学习越来越多地应用于医疗和推荐系统等敏感领域,隐私保护技术对于保护用户的敏感信息变得至关重要。我们研究在情节设置下的隐私保护强化学习,重点关注基于随机探索的算法,如随机最小二乘值迭代(RLSVI)。总体目标是研究随机探索如何与隐私机制所需的注入噪声相互作用。在这项工作中,我们展示了一种新的隐私分析,该分析描述了RLSVI中为探索设置的噪声如何同时提供隐私保护。具体来说,我们证明RLSVI在表格MDP中是$(\varepsilon(δ),δ)$-联合差分隐私的,其中$\varepsilon(δ) = rac{2AK}{H^2\log(2HSA)} + 2\sqrt{ rac{2AK\log(1/δ)}{H^2\log(2HSA)}}$,$S$和$A$分别是状态和动作的数量,$H$是情节的长度,$K$是情节的数量。

英文摘要

As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(δ),δ)$-joint differentially private in tabular MDP as is with $\varepsilon(δ) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/δ)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.

2606.01951 2026-06-02 cs.RO

Co-training with Ego-centric Video and Demonstration for Robot Navigation Task

基于自我中心视频与示范的机器人导航任务协同训练

Shoya Kuno, Yumo Ouchi, Kanata Suzuki

发表机构 * Department of Informatics, Graduate School of Informatics, Kyoto University(信息学系,京都大学研究生院) Spatial Robotics Research Center, Fujitsu Limited(空间机器人研究中心,富士通有限公司)

AI总结 提出将自我中心行走视频转化为移动机器人模仿学习数据集的框架,通过联合训练VLA模型提升语言理解和动作生成能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在多种机器人任务中展现出潜力,但其性能严重依赖于大规模高质量训练数据,而在真实机器人上收集这些数据成本高昂且耗时。虽然先前的工作已经探索了利用自我中心人类视频来增强操作数据集,但由于运动过程中的视角变化,将此类方法应用于移动机器人导航仍然具有挑战性。在本文中,我们提出了一个框架,将自我中心行走视频转化为移动机器人模仿学习的数据集。该方法从人类视频中估计相机运动,并将其转换为与地面移动机器人兼容的动作表示。通过联合训练基于人类数据和机器人收集数据的VLA模型,该模型在语言理解和鲁棒动作生成方面比单独使用任一数据源训练取得了更好的性能。在水果搜索导航任务上的实验表明,人类自我中心视频为移动机器人学习提供了有效且可扩展的数据源。

英文摘要

Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.

2606.01950 2026-06-02 cs.RO cs.CV cs.LG

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

发表机构 * Intelligent Perception in Technical Systems Group(技术系统智能感知组)

AI总结 提出MRO-GWM模型,通过对象中心高斯表示和时空变换器架构,学习刚性物体在3D中的动作条件动力学,支持多物体场景和部分观测下的未来运动预测。

详情
AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中,我们提出了多刚性物体高斯世界模型(MRO-GWM),一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景,我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构,该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示,从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练,这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能,这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

2606.01947 2026-06-02 cs.CV cs.AI

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

大型预训练模型在实例分割任务中的参数高效微调

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本研究针对实例分割任务,探索了适配器和低秩适应(LoRA)两种参数高效微调方法,在仅微调约1-6%参数的情况下取得竞争性能,并发现每个Transformer块使用2-3个适配器可达到性能与效率的最佳平衡。

Comments Published by the Machine Learning and Knowledge Extraction Journal

Journal ref Abou Baker N, Rohrschneider D, Handmann U. Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks. Machine Learning and Knowledge Extraction. 2024; 6(4):2783-2807

详情
AI中文摘要

近年来,随着大型预训练模型的兴起,人工智能的研究和应用发生了转变,这些模型在众多任务中取得了最先进的结果。然而,参数的大量增加引入了对参数高效训练策略的需求。尽管取得了显著进展,但针对基于Transformer的模型在实例分割任务中的参数高效微调(PEFT)方法的研究仍然有限。为填补这一空白,本研究调查了PEFT方法的有效性,特别是适配器和低秩适应(LoRA),并将其应用于两个模型和四个基准数据集。通过集成顺序排列的适配器模块并将LoRA应用于可变形注意力(本文首次探索),在仅微调约1-6%模型参数的情况下取得了竞争性能,相比传统微调所需的40-55%有显著改进。关键发现表明,每个Transformer块使用2-3个适配器可实现性能与效率的最佳平衡。此外,LoRA在应用于可变形注意力时表现出强大的参数效率,并在某些情况下超越了适配器配置。这些结果表明,PEFT技术的影响因数据集复杂性和模型架构而异,强调了上下文特定调优的重要性。总体而言,这项工作展示了PEFT在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

英文摘要

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

2606.01945 2026-06-02 cs.CV

Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization

超越低秩:通过脉冲神经网络和提示分解实现低秩稀疏提示

Yumiao Zhao, Bo Jiang, Beibei Wang, Xixi Wan, Xiao Wang, Jin Tang

发表机构 * Information Materials and Intelligent Sensing Laboratory of Anhui Province(安徽省信息材料与智能感知实验室) Anhui Provincial Key Laboratory of Multimodal Cognitive Computation(安徽省多模态认知计算重点实验室) School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院)

AI总结 提出LoRSP框架,利用脉冲神经元的稀疏发放机制和低秩分解,生成实例特定的稀疏视觉提示,实现高效且鲁棒的视觉提示学习。

详情
AI中文摘要

视觉提示(VP)已成为一种高效范式,通过在输入层引入可学习提示来适应大规模预训练视觉模型到下游任务。然而,现有的VP方法通常采用密集的像素级提示,往往存在冗余扰动、泛化能力有限和能效低的问题。为克服这些限制,我们提出将脑启发脉冲学习融入视觉提示学习任务。我们知道,脉冲神经元可以通过将输入数据转换为离散脉冲序列并返回稀疏输出来进行低成本信息处理。受此启发,我们提出低秩视觉脉冲提示(LoRSP),一种新颖框架,通过脉冲神经元学习机制自然地学习动态低秩稀疏视觉提示。LoRSP的核心思想是利用脉冲神经元的脑启发稀疏发放机制为每个实例生成像素级稀疏提示。具体而言,我们首先通过低秩分解构建一系列提示因子以捕获不同的提示子空间。然后将这些提示因子输入SNN架构,执行整合-发放过程以发射脉冲。因此,我们的LoRSP在保持低秩约束的同时生成稀疏视觉提示。这种设计实现了实例特定的选择性提示,从而在多样化的下游任务中实现更紧凑和鲁棒的适应。在五个异构视觉骨干网络和多个基准上的大量实验表明,与现有VP方法相比,LoRSP在需要更少可调参数的情况下实现了具有竞争力的性能。

英文摘要

Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.

2606.01940 2026-06-02 cs.CV

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

SCAPO: 从单次3D观测中自监督学习类别级关节物体姿态估计

Can Zhang, Gim Hee Lee

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系)

AI总结 提出SCAPO框架,通过自监督方式从单张RGB-D图像中估计关节物体的规范几何、刚性部件分割和关节参数,无需真实标签或类别特定模型。

详情
AI中文摘要

现有的从单次3D观测中估计类别级物体关节的方法通常依赖密集监督、多帧输入或CAD模板,并且仍然难以从关节中解耦几何或恢复显式关节参数。我们提出SCAPO,一个自监督框架,从单张RGB-D观测中估计规范几何、刚性部件分割以及关节枢轴、轴和关节状态,无需真实标签或类别特定模型。我们的SCAPO首先使用SE(3)-等变向量神经元自编码器来分解全局姿态并将不同实例对齐到共享规范空间。在此对齐形状上,设计了一个关节感知的混合蒙皮模块来建模部件运动。我们通过观测形状和规范形状之间的循环重建以及可学习规范模板的跨空间对齐来学习这种表示,该模板将共享类别几何与实例特定残差形状解耦。在合成和真实关节物体数据集上的实验表明,我们的SCAPO恢复了一致的部件结构和准确的关节参数,并优于所有自监督基线。

英文摘要

Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.

2606.01939 2026-06-02 cs.CV

SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video

SAVMap: 基于结构辅助的全景视频大规模2.5D曼哈顿线框视觉映射

Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng

发表机构 * Nokia Bell Labs(诺基亚贝尔实验室) NYU(纽约大学)

AI总结 提出SAVMap方法,利用全景视频和语义分割网络,结合曼哈顿网格几何约束,从仓库场景生成语义线框地图,实现高精度大规模3D重建。

Comments IEEE ICRA 2026

详情
AI中文摘要

工业环境的精确3D表示能够支持机器人定位和数字孪生生成等任务。我们提出SAVMap,一种仅使用全景视频相机作为传感器输入,生成仓库货架和灯光结构语义线框地图的方法。从沿仓库通道拍摄的全景视频中提取一系列带有货架和天花板视角的校正图像。通过语义分割网络前端,从每张图像中提取一组稀疏的语义结构特征点(例如货架结构的角点、灯光的中心),并在序列中跟踪这些点。通过考虑点之间的真实世界几何关系(如曼哈顿网格),一种受约束的运动恢复结构算法生成构成线框地图的3D点。我们在一个拥有46排货架的仓库中展示了我们方法的可扩展性和准确性,每排货架的面尺寸为55米×7米。从一小时的视频内容中,我们为超过5000个货架元素创建了线框地图,与真实值相比,总体平均绝对误差为4.8厘米。

英文摘要

Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.

2606.01936 2026-06-02 cs.CL

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

格式化什么以及如何格式化:文档格式化的基准与工作流方法

Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin, Bing Li, Xiyan Gao, Peng Fu, Jing Huang, Can Ma

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 针对内容感知的文档格式化任务,提出基准DocFormBench和工作流方法DocFormFlow,通过解耦目标定位与修改执行,在提升准确率的同时降低token消耗。

详情
AI中文摘要

大型语言模型(LLM)的最新进展为自动化文档格式化开辟了新的可能性。然而,现实中的格式化通常需要根据文档内容识别目标。这种内容感知的设置仍然具有挑战性且未被充分探索,主要是由于缺乏专门的评估数据集。为了在现实的内容感知场景中实现评估,我们引入了DocFormBench,这是一个将文本到格式评估扩展到多样化格式化需求的基准,同时提供了准确性和效率的指标。为了减少现有方法在格式化过程中的冗余文档读取,我们提出了DocFormFlow,一种工作流格式化方法,将目标定位与修改执行解耦为“格式化什么”和“如何格式化”。在多个LLM和多模态模型上的大量实验表明,与代表性基线相比,DocFormFlow在减少token消耗的同时持续提高了格式化准确性。进一步的分析表明,精确的目标定位是影响格式化性能的主要因素。我们希望DocFormBench和DocFormFlow能够促进未来朝着更智能、更可靠的文档格式化的研究。

英文摘要

Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation datasets.To enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and efficiency.To mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.

2606.01934 2026-06-02 cs.LG cs.CL

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO: 用于思维链压缩的混合中位数长度策略优化

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

发表机构 * Li Auto Inc.(Li Auto公司)

AI总结 提出HMPO,一种单阶段强化学习框架,通过自适应中位数预算、余弦衰减令牌奖励和乘法奖励公式,在数学数据上训练后实现19%-46%的令牌压缩且精度损失极小,并泛化至多种任务。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理取得了显著性能,但这一冗长过程带来了大量推理开销。现有的思维链压缩方法面临不灵活的手动长度预算、计算昂贵的多阶段训练流程以及仅适用于小模型的脆弱可扩展性。我们提出HMPO(混合中位数长度策略优化),一种经济高效的单阶段强化学习框架。HMPO通过三个协同组件高效压缩思维链:基于成功轨迹的自适应中位数预算以消除手动调整、用于平滑长度惩罚的余弦衰减令牌奖励,以及通过严格优先考虑答案正确性来大幅减轻琐碎奖励破解的乘法奖励公式。仅在数学数据上训练,HMPO无缝泛化到数学、代码、科学和指令遵循任务。在从9B到122B参数、涵盖密集和混合专家架构的大规模实验中,HMPO实现了19%-46%的令牌压缩,精度下降可忽略,同时与现有的多阶段基线相比大幅降低了训练成本。

英文摘要

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

2606.01933 2026-06-02 cs.CV

3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval

CVPR 2026 CASTLE挑战赛第三名:基于层次化知识图谱检索的智能多视角长视频理解

Raghad Albusayes, Munirah Alyahya

发表机构 * TAHAKOM(塔哈科姆)

AI总结 提出一种免训练的智能框架,通过视频知识图谱和层次化检索索引,解决大规模多视角视频中的复杂时空推理问题,在CASTLE挑战赛中获得第三名。

详情
AI中文摘要

本文介绍了我们在CVPR 2026 EgoVis研讨会举办的CASTLE 2026挑战赛中的获胜方法,我们的团队在全球获得了第三名。该挑战要求参与者在海量多模态视频流中回答高度复杂的视觉、时空和语言问题,包括视觉计数、动作定位、多视角跟踪和说话者时间推理。底层数据集包含由15个自我和外部摄像头源捕获的超过600小时的同步视频。为了应对这种极端规模和长上下文的需求,我们引入了一种无需训练的智能框架,专门针对长视频理解进行了优化。我们的框架引入了两个核心架构组件:i) 视频知识图谱,映射静态和动态实体、它们的时间关系以及交叉事件,以实现多跳关系推理;ii) 自适应智能工作流,通过层次化检索和索引解决复杂查询。实验结果表明,我们的框架在长上下文多视角流上实现了高零样本推理精度。我们的代码将在https://github.com/RaghadKhaled/CASTLE-Challenge-Framework发布。

英文摘要

This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.

2606.01926 2026-06-02 cs.CL

Mitigating Bias in Locally Constrained Decoding via Tractable Proposals

通过可处理提议缓解局部约束解码中的偏差

Meihua Dang, Linxin Song, Honghua Zhang, Jieyu Zhao, Guy Van den Broeck, Stefano Ermon

发表机构 * Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对局部约束解码中因短视掩码导致的采样偏差,提出基于张量化有限自动机的全局约束解码提议和概率全局约束解码提议,结合序贯蒙特卡洛方法实现无偏采样,在函数调用、关键词生成和SQL生成任务中显著减少所需粒子数并加速收敛。

Comments 13 pages, 5 figures

详情
AI中文摘要

大型语言模型的生成结果往往不符合期望的约束,如JSON模式。现有的局部约束解码(LCD)方法通过短视地掩蔽下一个词元来强制约束,导致采样偏差和性能下降。最近的工作使用序贯蒙特卡洛(SMC)方法来缓解此类偏差,但设计有效的提议分布或势函数仍然是一个关键挑战。在这项工作中,我们提出了一种通用方法来构建从 $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$ 进行SMC采样的提议和势函数。首先,我们证明了以有限自动机形式指定的约束可以张量化以在GPU上高效执行,我们利用这一点构建了全局约束解码(GCD)提议。此外,利用张量化有限自动机与隐马尔可夫模型共享相同电路结构的事实,我们通过电路乘法得到概率全局约束解码(P-GCD)提议,该提议编码了目标分布的逻辑和概率信息。我们在函数调用、基于关键词的生成和SQL生成任务上评估了(P-)GCD。实验表明,在相同的SMC采样设置下,与LCD提议相比,(P-)GCD以显著更少的粒子更快地收敛到目标分布。

英文摘要

Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses sequential Monte Carlo (SMC) methods to mitigate such biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$. First, we show that constraints specified as finite automata can be tensorized for efficient execution on GPUs, which we use to construct globally constrained decoding (GCD) proposals. In addition, leveraging the fact that tensorized finite automata share the same circuit structure as hidden Markov models, we circuit-multiply them to obtain the probabilistic GCD (P-GCD) proposals encoding both logical and probabilistic information about the target distributions. We evaluate (P-)GCD on the tasks of function calling, keyword-based generation, and SQL generation. Experiments show that under the same SMC sampling setup, compared to LCD proposals, (P-)GCD converges faster to the target distribution with significantly fewer particles.

2606.01923 2026-06-02 cs.CL cs.LG

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

共振上下文锚定:推理时解耦注意力路由与信号增益

Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min

发表机构 * Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Tongji University(同济大学) Tsinghua University(清华大学)

AI总结 提出共振上下文锚定(RCA)方法,通过解耦自注意力中的路由逻辑与信息幅度,在推理时动态增强上下文令牌的信号,有效抑制大语言模型的参数化幻觉,提升事实一致性。

详情
AI中文摘要

大型语言模型(LLM)在面对与内部参数记忆冲突的输入证据时,经常表现出“上下文忽视”,导致持续的事实幻觉。现有的缓解策略主要依赖于抑制特定神经元激活或使用计算昂贵的对比解码机制,这往往会导致困惑度增加或推理延迟显著升高。为了解决这些局限性,我们从残差流信号动力学的角度提出了一种轻量级的推理时干预方法——共振上下文锚定(RCA)。RCA旨在解决外部证据在深层网络传播过程中的信号衰减问题。其核心机制是在自注意力模块中正交解耦路由逻辑和信息幅度。通过利用原始的softmax前注意力分数作为语义对齐的即时度量,我们通过非线性整流构建动态增益场,选择性地放大上下文令牌对应的值向量的范数,而不改变注意力概率分布。该机制有效提升了残差流混合中输入证据的信噪比(SNR),从而在推理时稳健地将生成轨迹锚定到真实上下文。在Llama-3模型系列上的大量实验表明,RCA在多个事实一致性和强知识冲突任务中显著提高了上下文忠实度,有效抑制了参数化幻觉。此外,结果证实,作为一个无需训练且计算量可忽略的即插即用模块,RCA在保持模型通用语言理解能力的同时,在忠实度和流畅性上实现了帕累托改进。

英文摘要

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model's general language understanding capabilities.

2606.01920 2026-06-02 cs.CV

Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement

Pool-Select-Refine: 基于软标签引导潜在精化的分配感知生成式数据集蒸馏

Wenmin Li, Shunsuke Sakai, Zhongkai Zhao, Tatsuhito Hasegawa

发表机构 * Graduate School of Engineering, University of Fukui(福井大学工学研究科) College of Computer Science and Artificial Intelligence, Southwest Minzu University(西南民族大学计算机科学与人工智能学院)

AI总结 提出Pool-Select-Refine两阶段框架,通过过完备候选池选择与软标签引导潜在精化解耦生成、选择和精化,提升扩散模型数据集蒸馏的预算利用效率。

详情
AI中文摘要

基于扩散的数据集蒸馏最近作为一种有前景的范式出现,用于将大规模数据集压缩为紧凑的合成集。通过利用预训练的生成先验,这些方法可以比传统的基于匹配的方法更高效地生成逼真的类别条件样本。然而,大多数现有的基于扩散的方法仍然采用僵化的“生成即用”策略,其中生成的样本在固定的每类图像预算下直接被视为最终的蒸馏集。这种设计将候选生成与最终预算分配紧密耦合,可能导致有限预算的冗余浪费或信息不足的样本。在本文中,我们提出“Pool-Select-Refine”,一个用于分配感知生成式数据集蒸馏的两阶段框架。首先,我们不直接使用固定数量的生成样本,而是构建一个过完备的候选池,并在目标预算下选择一个紧凑的子集。其次,我们使用从教师模型导出的软标签监督在潜在空间中精化所选样本,提高语义对齐同时保留生成先验。这种设计明确地将生成、选择和精化解耦,从而更有效地利用蒸馏预算。在大规模和细粒度图像分类基准上的实验表明,所提出的框架在基于扩散的基线上取得了一致的改进。结果表明,在精化之前引入一个筛选阶段是改进基于扩散的数据集蒸馏的一种简单而有效的方法。

英文摘要

Diffusion-based dataset distillation has recently emerged as a promising paradigm for condensing large-scale datasets into compact synthetic sets. By leveraging pretrained generative priors, these methods can produce realistic class-conditional samples more efficiently than traditional matching-based approaches. However, most existing diffusion-based methods still adopt a rigid ``Generate-and-Use'' strategy, where the generated samples are directly treated as the final distilled set under a fixed images-per-class budget. Such a design tightly couples candidate generation with final budget allocation, which may result in redundant waste of the limited budget or insufficiently informative samples. In this paper, we propose ``Pool-Select-Refine'', a two-stage framework for allocation-aware generative dataset distillation. First, instead of directly using a fixed number of generated samples, we construct an over-complete candidate pool and select a compact subset under the target budget. Second, we refine the selected samples in latent space using soft-label supervision derived from the teacher model, improving semantic alignment while preserving the generative prior. This design explicitly decouples generation, selection, and refinement, enabling more effective use of the distillation budget. Experiments on large-scale and fine-grained image classification benchmarks show that the proposed framework delivers consistent gains over diffusion-based baselines. The results suggest that introducing a curation stage before refinement is a simple yet effective way to improve diffusion-based dataset distillation.

2606.01914 2026-06-02 cs.CL cs.CV

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

多模态大语言模型空间推理中空间词汇偏差的机制诊断

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi

发表机构 * Kyoto University(京都大学) NII LLMC(日本国立信息与通信技术研究所语言模型中心) RIKEN AIP(日本理化学研究所先进理工研究所) Case Western Reserve University(凯斯西储大学) The Hong Kong Polytechnic University(香港理工大学) The University of Osaka(大阪大学) University of Tokyo(东京大学)

AI总结 本文发现多模态大语言模型存在空间词汇偏差,即添加空间关系词会吸引模型选择该选项,并通过机制可解释性工具揭示偏差主要源于语言侧而非视觉侧,最后提出轻量级LLM-only DPO更新可有效缓解偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间多项选择题上仍不可靠,其失败常归因于视觉信息关注不足。本文识别了一种互补的失败模式——空间词汇偏差:向答案选项添加空间关系词会吸引模型决策,使新添加的选项更可能被选中。使用九个开放权重的MLLMs,我们证明该现象广泛存在。特别地,模型能正确回答二元空间问题,但一旦向答案集添加第三个空间选项,模型便持续选择错误的第三选项。我们将这种二元稳定但三元脆弱的案例隔离为诊断示例,并利用机制可解释性工具,揭示失败的主要原因来自语言侧而非视觉侧:视觉注意力分析和残差流探针表明,在这些失败中,正确的空间关系在内部仍然可用,而不相关选项控制、激活修补和稀疏组件干预将偏差追溯到特定的LLM侧通道和神经元。基于此发现,我们证明在微小的单对象对合成数据上进行轻量级仅LLM的DPO更新可缓解偏差,在合成数据上将四路鲁棒准确率提升高达100个百分点,在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升68.0、32.6和20.1个百分点。

英文摘要

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

2606.01912 2026-06-02 cs.AI

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

SMH-Bench:用于智能家居中环境基础推理与行动的LLM代理基准测试

Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu

发表机构 * Midea Group(美的集团) Beijing University of Posts and Telecommunications(北京邮电大学) Donghua University(东华大学) The University of Sydney(悉尼大学) Peking University(北京大学)

AI总结 提出SMH-Bench基准,基于可执行模拟器HomeEnv,通过1100个任务评估LLM在智能家居中的推理与行动能力,发现前沿模型在自动化调度、模糊处理和个性化推理方面存在不足。

详情
AI中文摘要

智能家居正朝着复杂的、依赖于状态的生活环境发展,需要大型语言模型(LLM)对用户意图、偏好和多设备交互进行推理。然而,现有的智能家居基准通常侧重于静态的指令到API映射或有限的模拟,未能评估LLM是否能够在现实家庭场景中可靠地进行推理、交互和行动。为了解决这些局限性,我们引入了SMH-Bench,这是一个用于评估智能家居环境中LLM的全面基准。基于可执行且可验证的智能家居模拟器HomeEnv,SMH-Bench包含1100个高质量任务,涵盖7个类别和22个细粒度子类别。它进一步将任务分层为简单、中等和复杂家庭,范围从小型公寓到拥有135个设备的密集多房间环境。实验表明,尽管前沿LLM在显式控制和查询任务上表现强劲,但在自动化任务调度、模糊处理和个性化推理方面仍存在显著弱点,尤其是在家庭复杂性增加时。我们希望SMH-Bench能够促进更可靠、上下文感知且实际可部署的智能家居代理的发展。

英文摘要

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

2606.01911 2026-06-02 cs.CV

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

残差解码器适配器:用于自回归文本渲染的身份保持分词器适配

Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan

发表机构 * Central South University(中南大学) University of Oxford(牛津大学) Microsoft Research(微软研究院)

AI总结 提出残差解码器适配器(RDA),通过引入配对码本和平行分支学习像素空间残差,在不重新训练分词器和自回归模型的情况下显著提升文本渲染性能。

Comments CVPR 2026 poster

详情
AI中文摘要

视觉自回归(AR)模型通过预测由视觉分词器解码的离散标记来生成图像。尽管展示了强大的整体图像生成能力,但在文本渲染方面仍表现不佳,出现模糊笔画和破坏字母形状。在这项工作中,我们将这一限制追溯到视觉分词器,它难以重建细粒度细节。改进分词器直接但昂贵,因为它需要重新训练分词器和AR模型。我们能否在不重新训练现有分词器和AR模型的情况下提高AR模型的文本渲染性能?为实现这一目标,我们提出了残差解码器适配器(RDA),它在不改变标记空间的情况下事后升级现有分词器。具体来说,它通过引入两个新颖组件来细化视觉分词器的解码器输出:(i)一个与原始标记分布共享的配对码本;(ii)一个并行分支,用于学习像素空间中重建图像与真实图像之间的微小差异(残差)。这种残差设计使我们能够非侵入性地增强分词器,同时保持与先前AR模型的兼容性。RDA大幅提升了文本渲染性能。例如,在具有竞争力的TextAtlas基准测试上,我们使微调后的Janus-Pro OCR准确率从24.52%提高到58.26%(TextVisionBlend),从12.75%提高到36.81%(StyledTextSynth)。代码可在https://github.com/CSU-JPG/RDA获取。

英文摘要

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

2606.01909 2026-06-02 cs.SD cs.AI eess.AS

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Echo: 一种用于共享潜在空间中说话人日志和语音识别的联合嵌入预测架构

Louis Mouchon

发表机构 * Louis Mouchon(洛伊斯·莫尚)

AI总结 提出Echo系统,基于单个25M参数ViT编码器,通过JEPA预训练和分阶段特化,在512维潜在空间中联合实现说话人日志、语音分离和语音识别,无需部署时微调。

Comments 18 pages, 17 tables, 1 figure. Proof-of-concept, independent research

详情
AI中文摘要

我们提出Echo,一个围绕单个25M参数ViT编码器构建的概念验证音频系统。该编码器使用JEPA目标进行预训练,然后分阶段特化,以在同一个512维潜在空间中承载说话人身份、语音内容和动态源路由,部署时无需针对每个任务进行微调。轻量级头部处理说话人日志(ArcFace + VBx)和动态源分离(空目标K集预测)。在未知K的合成VoxCeleb2混合数据上,标准堆栈达到15.00%的盲DER、97.80%的PIT分离准确率,潜在SI-SDR提升+9.52 dB,以及在留出k-NN探针上说话人/内容因子化差距为+53.50分。Echo的意义不在于任何单一任务上的新SOTA,而在于三个任务在一个编码器上以这种规模共同共存。我们逐阶段记录了设计,报告了死胡同,并识别了通过VQ瓶颈进行端到端ASR的结构性障碍,该瓶颈仍然限制了PoC。

英文摘要

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

2606.01908 2026-06-02 cs.LG cs.CV

Private and Stable Test-Time Adaptation with Differential Privacy

具有差分隐私的私有且稳定的测试时自适应

Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出将多种测试时自适应方法转化为差分隐私形式,通过逐样本梯度裁剪和高斯噪声保护测试数据隐私,在ImageNet-C上实现隐私与精度的平衡,并发现裁剪机制能提升连续自适应的准确性和稳定性。

Comments ICML 2026

详情
AI中文摘要

测试时自适应(TTA)可以通过在推理过程中更新模型来减少在新数据上的误差。然而,这些更新引发了关于测试数据隐私的问题,因为模型参数现在依赖于所有过去的输入。为了控制这种隐私风险,我们将多种流行的TTA方法(Tent、EATA、SAR、DeYO和COME)转化为差分隐私(DP)形式,对所有更新应用逐样本梯度裁剪和高斯噪声。在ImageNet-C上,我们的DP-TTA方法在精度损失较小的情况下提供了足够的隐私,并且在低隐私机制下,DP的裁剪机制甚至可以改善连续设置中自适应的准确性和稳定性。这些对隐私和精度的改进仅带来适度的计算开销。这些关于私有TTA的初步结果提高了对该问题的认识,为开发更私密的测试时更新提供了信息,并确定了逐样本裁剪作为提高自适应准确性和稳定性的有效技术。

英文摘要

Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.

2606.01906 2026-06-02 cs.AI

Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

贝叶斯谱情感转移发现:来自多标注者分歧

Keito Inoshita, Takato Ueno

发表机构 * Keio University(庆应大学) National Institute of Advanced Industrial Science and Technology(国家工业科学与技术研究院)

AI总结 提出贝叶斯谱情感转移发现(BSETD)两阶段框架,从多标注者软标签中挖掘情感转移结构,并通过谱分解分离惯性与传染成分,在EmotionLines数据集上验证了与心理学理论的一致性。

详情
AI中文摘要

情感通过对话的动态过程演变,理解其转移结构对于从心理健康筛查到对话系统等应用至关重要。然而,现有研究通常通过多数投票将多评分者判断压缩为单个硬标签,丢弃了理解轮次间转移所需的不确定性信号。本文提出贝叶斯谱情感转移发现(BSETD),一个从多评分者软标签中发现情感转移结构的两阶段框架。第一阶段,通过软标签的外积构建层次狄利克雷-多项后验,为K×K转移矩阵的每个单元配备可信区间和Benjamini-Hochberg(BH)错误发现率(FDR)控制的显著性。第二阶段,对称图拉普拉斯矩阵经谱分解,分离出低频(惯性)和高频(传染)成分。在EmotionLines上,BSETD同时恢复了两个不同情感空间的标志:Plutchik相邻的转移——厌恶到愤怒(log2提升+0.94)和愤怒到厌恶(+0.86)被过度表示,而Russell效价反转的转移——快乐到愤怒(-0.90)和愤怒到快乐(-0.89)被欠表示。五源跨语料验证得到英语内成对皮尔逊相关0.91-0.98,与中文M3ED对比0.79-0.85,以及同一话语集上人类硬标签与LLM虚拟软标签之间0.979的相关性,表明保留标注者不确定性的流程将情感动态的计算研究与既有的心理学理论联系起来。

英文摘要

Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging from mental-health screening to dialogue systems. However, existing studies typically compress multi-rater judgments into a single hard label by majority voting, discarding the uncertainty signal needed to understand turn-to-turn transitions. In this article, we propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework that discovers emotion-transition structure from multi-rater soft labels. In the first stage, a hierarchical Dirichlet-Multinomial posterior is constructed through the outer product of soft labels, equipping each cell of the K x K transition matrix with a credible interval and Benjamini-Hochberg (BH) false discovery rate (FDR)-controlled significance. In the second stage, the symmetrized graph Laplacian is spectrally decomposed to separate a low-frequency (inertia) component from a high-frequency (contagion) component. On EmotionLines, BSETD simultaneously recovers the signatures of two distinct affective spaces: the Plutchik-adjacent transitions disgust to anger (log2 lift +0.94) and anger to disgust (+0.86) are over-represented, while the Russell-valence-reversed transitions joy to anger (-0.90) and anger to joy (-0.89) are under-represented. A five-source cross-corpus validation yields pairwise Pearson correlations in 0.91-0.98 within English, 0.79-0.85 against Chinese M3ED, and 0.979 between the human hard labels and the LLM virtual soft labels on the same utterance set, demonstrating that a pipeline preserving annotator uncertainty bridges the computational study of emotion dynamics with established psychological theory.

2606.01901 2026-06-02 cs.CV cs.AI cs.CL

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏:通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam(波恩大学语言学系计算语言学部) German Research Center for Artificial Intelligence (DFKI), Berlin(德国人工智能研究中心(DFKI)柏林)

AI总结 提出图像重建游戏基准,通过多轮迭代中视觉语言模型向图像生成器发出纠正指令,使累积的共同基础直接可视化为重建图像,发现描述器是重建质量的主导因素,而生成器决定迭代改进的效果。

详情
AI中文摘要

我们引入了图像重建游戏,这是一个全自动基准测试,其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令,使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试,我们发现描述器是重建质量的主导因素,而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性:较短的预算产生更稀疏的初始渲染,有更多可见改进的空间,而较长的预算提高了绝对质量,但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇,涵盖空间、数值和结构类别,而较弱的描述器则集中于表面属性,并且往往在几轮后停止。人工验证表明,最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性,并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

2606.01896 2026-06-02 cs.CV cs.AI

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

训练、测试、重新评估:用于手部检测的生成数据的调度敏感评估

Atmika Bhardwaj, Silvia Vock, Nico Steckhan

发表机构 * Federal Institute for Occupational Safety and Health(联邦职业安全与卫生研究所)

AI总结 本研究通过多阶段训练调度实验,评估生成性图像修补数据对安全关键场景下手部检测性能的影响,发现适当的训练流程能显著提升真实部署效果。

Comments 16 pages, 4 figures

详情
AI中文摘要

生成(或合成)图像数据越来越多地被用于增强或替代真实训练数据集,当目标图像稀缺、昂贵或存在偏差时。在手部检测中,特别是在职业安全设置中,公共数据集大多包含裸手。这低估了手套、纹身、珠宝和其他个人防护装备引入的手部外观变化,造成了安全关键应用在部署时遇到的分布偏移。我们测试生成性修补,即仅编辑真实照片的手部区域以引入配饰,是否能缩小这种偏移差距。在一个由真实图像及其合成对应物组成的配对数据集上,我们在六种训练和调度方案(实验A-F,每种三个随机种子)下训练YOLOv8n手部检测器,在真实测试集和仅真实手套测试子集上评估每个检测器,报告两个重叠阈值(mAP@0.5和mAP@0.5:0.95)下的平均精度(mAP),并进行配对统计检验。一个两阶段实验:在真实+合成数据上训练,然后在较低学习率下仅用真实数据微调得到的权重,与标准真实测试集上的仅真实基线模型相比,提高了mAP@0.5,并改善了真实手套的分布外差距。另一个三阶段实验最好地保持了框的紧密度,达到了研究中任何其他实验的最高mAP@0.5:0.95。合成数据对安全关键手部检测的效用由训练过程决定,简单的多阶段实验从修补的配饰数据中提取了实质性的真实部署收益。

英文摘要

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.

2606.01895 2026-06-02 cs.CV cs.AI

Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations

LEO星座中基于多卫星视角的协作空间目标检测

Xingyu Qu, Wenxuan Zhang, Peng Hu

发表机构 * Government of Canada(加拿大政府) Natural Sciences and Engineering Research Council of Canada(加拿大自然科学和工程研究理事会)

AI总结 针对LEO星座中空间目标检测的挑战,提出基于深度学习框架的多视角观测融合方法,使用YOLO检测器处理多视角数据,实验表明多视角融合显著提升检测精度。

详情
AI中文摘要

随着低地球轨道(LEO)星座中卫星数量的增加,近地空间环境日益拥挤,使得空间目标检测(SOD)成为空间安全和可持续性面临的紧迫挑战。为了降低碰撞风险并确保空间操作的连续性,SOD系统必须在严格的星载约束下提供快速准确的检测。在本文中,我们研究了深度学习(DL)框架内多视角观测融合的潜力,以增强SOD性能。我们设计了一个实用的多视角流水线和几种输入表示,用于将多视角数据输入基于YOLO的检测器。我们的实验表明,在大多数情况下使用多视角输入是可行的,并且通常能在mAP50和mAP50-95上产生更好的结果。例如,在模型YOLOv9-m中,单视角与三视角融合RGB设置相比,mAP50从0.638增加到0.732,而mAP50-95从0.227提高到0.276。与单视角设置相比,最佳的三视角灰度配置将mAP50提高了36.3%,mAP50-95提高了46.5%。这些发现确立了多视角融合作为SOD的一种可行且有效的策略,对LEO星座部署中的空间态势感知具有广泛意义。

英文摘要

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly congested, making space object detection (SOD) a pressing challenge for space safety and sustainability. To mitigate collision risks and ensure the continuity of space operations, SOD systems must deliver fast and accurate detection under stringent onboard constraints. In this paper, we investigate the potential of multi-viewpoint observation fusion within a deep learning (DL) framework to enhance SOD performance. We design a practical multi-view pipeline and several input representations for feeding multi-view data into YOLO-based detectors. Our experiments show that using multi-view inputs is feasible in most cases and typically produces better results for mAP50 and mAP50-95. For example, in model YOLOv9-m, single-view compared to a three-view fused RGB setting, mAP50 increases from 0.638 to 0.732, while mAP50-95 improves from 0.227 to 0.276. Compared with the single-view setting, the best three-view grayscale configuration improves mAP50 by 36.3% and mAP50-95 by 46.5%. These findings establish multi-view fusion as a viable and effective strategy for SOD, with broad implications for space situational awareness in LEO constellation deployments.

2606.01894 2026-06-02 cs.AI

Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations

物理约束的Mamba-SDE用于不规则观测下的剩余使用寿命预测

Deyu Zhuang, Peiliang Gong, Yang Shao, Liyuan Shu, Qi Zhu, Xiaoli Li, Daoqiang Zhang

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Nanyang Technological University(南洋理工大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出PC-MambaSDE框架,通过掩码感知连续Mamba编码器和物理引导的潜在SDE,解决不规则观测下剩余使用寿命预测的物理不可行性问题。

详情
AI中文摘要

准确的剩余使用寿命预测对于工业预测性维护至关重要。然而,由于传感器观测的不规则性,表现为异步采样、突发缺失和时间抖动,实际部署具有挑战性。更糟糕的是,纯数据驱动模型常常生成物理上不合理的退化轨迹,违反损伤累积的不可逆性。为了解决这个问题,我们提出了PC-MambaSDE,一个统一的连续时间框架,用于在不规则观测下进行鲁棒的RUL预测。具体来说,我们设计了一个掩码感知连续Mamba编码器,显式利用观测掩码提取富含上下文的控制信号。此外,我们引入了一个带有参数化修正混合漂移的物理引导潜在SDE,叠加全局物理偏差以强制单调退化,即使在严重观测间隙下也是如此。另外,我们通过终端退化惩罚将RUL预测公式化为边界值问题,该惩罚解耦健康指标维度并应用惩罚损失引导轨迹向故障状态演化。理论上,我们通过Girsanov定理证明了我们的变分目标在数学上等价于最小化KL散度,并通过Lyapunov分析保证了学习动力学的全局渐近稳定性。为了进行严格评估,我们开发了一个混合不规则性生成方案,模拟真实的工业缺陷。在公开基准上的大量实验表明,PC-MambaSDE显著优于最先进的方法,特别是在极端观测稀缺情况下,验证了将物理先验嵌入连续时间潜在动力学的有效性。

英文摘要

Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging due to the irregular nature of sensor observations, characterized by asynchronous sampling, burst missingness, and temporal jitter. Compounding this issue, purely data-driven models often generate physically implausible degradation trajectories that violate the irreversible nature of damage accumulation. To address this, we propose PC-MambaSDE, a unified continuous-time framework for robust RUL prediction under irregular observations. Specifically, we design a Mask-Aware Continuous Mamba Encoder that explicitly leverages observation masks to extract context-rich control signals. Furthermore, we introduce a Physics-Guided Latent SDE with parametrically rectified hybrid drift, superimposing a global physical bias to enforce monotonic degradation even amid severe observation gaps. Additionally, we formulate RUL prediction as a boundary value problem via a Terminal Degradation Penalty, which decouples a Health Index dimension and applies a penalty loss to guide trajectories toward the failure state. Theoretically, we prove that our variational objective is mathematically equivalent to minimizing the KL divergence via Girsanov's theorem, and we guarantee the global asymptotic stability of the learned dynamics through Lyapunov analysis. To enable rigorous evaluation, we develop a Hybrid Irregularity Generation Scheme that simulates realistic industrial imperfections. Extensive experiments on public benchmarks demonstrate that PC-MambaSDE significantly outperforms state-of-the-art methods, particularly under extreme observation scarcity, validating the efficacy of embedding physical priors into continuous-time latent dynamics.