arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.06047 2026-05-12 cs.LG cs.AI

TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models

Duong Nguyen, Mohammed Jawhar, Nicolas Chesneau

AI总结 表格基础模型(TFMs)在零样本学习任务中表现出色,但其归纳偏置在推理时固定不变,导致难以适配特定任务或数据集。本文提出了一种轻量级的输入空间适配器TFM-Retouche,无需修改模型结构即可在冻结的TFM主干上进行微调,通过学习输入空间的小残差修正来对齐数据与预训练模型的归纳偏置。实验表明,该方法在多个任务上显著提升了模型性能,且在计算效率和预测质量之间达到了良好的平衡。

详情
英文摘要

Tabular foundation models (TFMs), such as TabPFN-2.6, TabICLv2, ConTextTab, Mitra, LimiX, and TabDPT, achieve strong zero-shot performance through in-context learning, but their inductive biases remain fixed at inference time. Adapting a pretrained TFM to a specific dataset or task typically requires either full fine-tuning, which is computationally expensive, or parameter-efficient tuning methods (PEFT) such as LoRA, which must be tailored to the internal architecture of each TFM. Furthermore, the evidence on whether weight-space fine-tuning improves accuracy or calibration is mixed \citep{tanna_exploring_2026,rubachev_finetuning_2025}. We introduce TFM-Retouche, a lightweight input-space residual adapter that is architecture-agnostic by design with respect to the frozen TFM backbone. TFM-Retouche learns a small residual correction in the input space to align the input data with the inductive biases of the pretrained model. The adapter is trained end-to-end through the frozen TFM, with a post-training identity guard that falls back to the unmodified TFM whenever adaptation does not help on held-out validation. On TabArena-Lite (51 datasets spanning binary classification, multiclass classification, and regression), TabICLv2-Retouche -- the framework instantiated on TabICLv2 -- is the top-ranked method on the leaderboard with light per-task tuning and ensembling, lifting aggregate Elo by +56 over the frozen TabICLv2 base and sitting on the Pareto front of predictive quality versus both training and inference time.

2605.05611 2026-05-12 cs.SD cs.AI eess.AS

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Rixi Xu, Qingyu Liu, Haitao Li, Yushen Chen, Zhikang Niu, Yunting Yang, Jian Zhao, Ke Li, Berrak Sisman, Qinyuan Cheng, Xipeng Qiu, Kai Yu, Xie Chen

AI总结 本文提出X-Voice,一个0.4B参数的多语言零样本语音克隆模型,使用户能够克隆任意人声并用30种语言说话。该模型基于420,000小时的多语言语料库训练,采用国际音标(IPA)作为统一表示,并设计了两阶段训练框架以无需复杂预处理即可实现零样本克隆。通过扩展F5-TTS架构,引入语言标识符双级注入和分类器自由引导的解耦调度机制,X-Voice在主观和客观评估中均优于现有系统,实现了与百亿参数模型相当的跨语言克隆能力。

Comments 16 pages, 4 figures, 9 tables

详情
英文摘要

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

2605.05103 2026-05-12 cs.CL cs.AI cs.CY

Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement

Nicholas S. Kersting, Vittorio Castelli, Chieh Ting Yeh, Xinzhu Wang, Saad Taame

AI总结 本文提出了一种名为“概念场”的新方法,用于衡量文本语料库中句子之间的语义变化,并据此检测生成文本中的幻觉和新颖性。该方法基于句子嵌入空间中相邻句子的差异,构建局部漂移场并估计点对点不确定性,通过计算候选句子过渡与该场的匹配程度,实现对生成内容的评估。研究引入了向量序列数据库(VSDB)以支持高效计算,并在联邦法规和文学作品等不同领域验证了方法的有效性,展示了其在跨领域应用中的稳定性和可解释性。

Comments 25 pages, 8 figures

详情
英文摘要

We introduce the \textbf{Concept Field} of a text corpus: a local drift field with pointwise uncertainty, estimated in sentence-embedding space from the deltas between consecutive sentences. Given a candidate sentence transition, we score its agreement with the field by $ζ$, the mean absolute z-distance between the observed delta and the field's local Gaussian estimate. The score is black-box (no model internals), corpus-attributable (every score traces to nearby corpus sentences), and admits a probabilistically motivated interpretation under a local Gaussian approximation. We support the computation with the introduction of a \textbf{Vector Sequence Database (VSDB)} that stores embeddings together with sequence-position and next-delta metadata. We evaluate this approach on two large-scale settings: hallucination-style groundedness detection over the U.S. Code of Federal Regulations, and novelty detection over Project Gutenberg. On controlled LLM-generated rewrites, Concept Fields achieve strong selective classification performance under a grounded / ungrounded / unsure triage policy. Unlike retrieval-centric baselines, the resulting coverage-risk behavior is similar across both domains, supporting a degree of cross-domain stability for the standardized deviation score. We also sketch how divergence and curl of the Concept Field, computed on dense clusters, surface qualitatively meaningful semantic patterns (logic sources, sinks, and implicit topics), which we offer as hypothesis-generating rather than as a quantitative result. Concept Fields provide a fast, lightweight, and interpretable signal for groundedness and novelty, complementary to LLM-as-judge and white-box detectors.

2605.04899 2026-05-12 cs.LG

A geometric relation of the error introduced by sampling a language model's output distribution to its internal state

Albert F. Modenbach

AI总结 本文研究了语言模型在生成过程中,由于采样其输出分布而引入的误差与模型内部状态之间的几何关系。作者通过分析词元嵌入的几何结构,推导出一个与$\mathfrak{so}(n)$李代数相关的1-形式,并发现其曲率具有语义意义。在国际象棋推理任务中,该曲率与模型的世界模型相互作用,揭示了模型内部如何根据棋盘区域和棋子重要性进行问题表示。

Comments 12 Pages, 10 Figures, 2 Appendices. To appear in Proceedings of ICML 2026

详情
英文摘要

GPT-style language models are sensitive to single-token changes at generation points where the predicted probability distribution is spread across multiple tokens. Viewing this sensitivity as a geometric property, we derive an $\mathfrak{so}(n)$-valued 1-form that depends only on the geometry of the token embeddings. Despite this purely geometric origin, we show that its curvature is semantically meaningful: On chess reasoning tasks, the curvature couples to the world model of an off-the-shelf instruction-tuned model, with transformations clustering by board region and respecting piece importance. Our findings suggest that token space geometry directly reflects how models internally represent problems.

2605.04827 2026-05-12 cs.LG

Trustworthy Federated Label Distribution Learning under Annotation Quality Disparity

Junxiang Wu, Zhiqiang Kou, Hongwei Zeng, Wenke Huang, Biao Liu, Hanlin Gu, Yuheng Jia, Di Jiang, Yang Liu, Xin Geng

AI总结 本文研究了在标注质量不均衡的联邦学习场景下的可信标签分布学习(Fed-LDL)问题,提出了一个质量感知的框架FedQual,通过全局语义锚点引导客户端自适应训练,并在服务器端基于可靠性重新加权聚合,以应对不同客户端标注质量差异带来的挑战。为验证方法有效性,作者构建了四个新的Fed-LDL基准数据集,并从理论上证明了客户端特定校准优于统一校准,实验结果进一步验证了FedQual的有效性。

详情
英文摘要

Label Distribution Learning (LDL) models supervision as an instance-wise probability distribution, enabling fine-grained learning under inherent ambiguity, but its success relies on high-fidelity label distributions that are costly to obtain and thus often noisy. Motivated by privacy-sensitive applications, we study Federated Label Distribution Learning (Fed-LDL), where data isolation further induces heterogeneous annotation quality across clients, making local updates unevenly reliable and breaking sample-size-based aggregation (e.g., FedAvg). To address this trust dilemma, we propose FedQual, a quality-aware Fed-LDL framework with two coupled mechanisms: (i) quality-adaptive client training guided by a global semantic anchor that calibrates low-quality clients while preserving high-quality autonomy, and (ii) reliability-aware server aggregation that reweights client contributions by effective reliable information rather than raw sample size. To enable rigorous evaluation, we construct four new Fed-LDL benchmarks (FER-LDL, FI-LDL, PIPAL-LDL, and KADID-LDL) with controlled annotation quality disparity. We further provide a theoretical guarantee showing that under heterogeneous supervision quality, client-specific calibration is strictly better than any uniform calibration. Extensive experiments on the proposed benchmarks demonstrate the effectiveness of FedQual.

2605.04738 2026-05-12 cs.LG

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, Qingyi Gu

AI总结 大语言模型(LLMs)参数量庞大,导致推理时资源消耗大且延迟高。为解决这一问题,研究提出了OSAQ方法,通过利用Hessian矩阵的低秩特性,构建加法权重变换以抑制权重中的系统性离群值,从而在低比特量化中提升模型性能。该方法无需层间变换或推理开销,且可通过闭式解高效实现,实验表明其在2比特量化下显著提升了模型表现。

Comments ICML 2026

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities. However, their massive parameter scale leads to significant resource consumption and latency during inference. Post-training weight-only quantization offers a promising solution by reducing model size and accelerating token generation through alleviating the memory-bound issue. Nevertheless, the presence of inherent systematic outliers in weights continues to be a major obstacle. While existing methods, such as scaling and rotation, attempt to address this issue, the performance remains unsatisfactory. In this paper, we propose Outlier Self-Absorption Quantization (OSAQ), which performs additive weight suppression guided by the second-order low-rank property for low-bit weight-only quantization of LLMs. Specifically, we observe that the Hessian exhibits low-rank consistency across different inputs, with certain directions consistently showing vanishing curvature. Leveraging this property, we identify a stable null space of the Hessian and then construct an additive weight transformation by linearly combining the vectors within this null space, thereby suppressing weight outliers without affecting the task loss. This additive transformation can be absorbed into the weights offline, requiring no inter-layer transformations and introducing no inference overhead. Moreover, the construction is efficiently achieved by a closed-form solution, without resource-intensive training or iterative procedures. Extensive experiments demonstrate that OSAQ effectively suppresses outliers and enhances low-bit quantization performance. For instance, in 2-bit quantization, OSAQ, when integrated with GPTQ, achieves over 40% lower perplexity compared to vanilla GPTQ.

2605.04671 2026-05-12 cs.LG

ITBoost: Information-Theoretic Trust for Robust Boosting

Ye Su, Longlong Zhao, Diego Garcia-Gil, Jipeng Guo, Gangchun Zhang, Jinxin Chen, Jinsong Chen

AI总结 梯度提升在表格数据学习中表现出色,但在标签噪声环境下性能会下降,主要原因是其过于关注梯度大的样本,而未考虑这些误差是否来自难以学习的样本或不可靠的标签。为此,研究提出了基于信息论的信任机制(ITBoost),通过分析样本残差随迭代的变化轨迹,利用最小描述长度原则评估样本的可靠性,并对波动剧烈、不可靠的样本进行降权处理。理论分析表明ITBoost在标签噪声下具有更紧的泛化界,实验结果显示其在多个表格数据集上相比主流提升和深度模型具有更强的鲁棒性,同时在干净数据上仍保持优异性能。

详情
英文摘要

Gradient boosting remains a strong and widely used method for tabular data learning, but its performance often degrades when training labels are noisy. This behavior is largely related to the way boosting algorithms emphasize samples with large gradients, without explicitly accounting for whether such errors originate from informative hard cases or from unreliable labels. We address this issue by reconsidering how sample reliability is evaluated during boosting. Instead of relying on instantaneous error, we examine the evolution of each sample's residuals across iterations. Based on this insight, we propose Information-Theoretic Trust Boosting (ITBoost), which uses the Minimum Description Length principle to measure the complexity of residual trajectories. Samples whose residual patterns fluctuate in an irregular manner are treated as less trustworthy and are down-weighted during learning. Theoretically, we derive a tighter generalization bound for ITBoost under label noise. Empirical results on various tabular benchmarks indicate that ITBoost provides improved robustness in noisy environments over leading boosting and deep tabular models, while retaining best average performance on clean data.

2605.04665 2026-05-12 cs.CL

Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

Aofan Liu, Jingxiang Meng

AI总结 当请求的内容被语义等价地改写时,大型语言模型是否仍能按照原始任务要求的格式作答?研究发现,即使在温度为零的情况下,模型也常常无法保持格式一致性。论文提出了“提示变体输出模式崩溃”现象,即在封闭式提示下,语义等价的提示变体可能使模型输出从简洁格式转变为冗长的对话文本,导致评估系统误判。为此,作者构建了PARACONSIST基准,用于衡量模型在不同提示变体下的输出一致性与语义稳定性,并发现任务结构而非模型身份是崩溃现象的主要预测因素。

Comments Added a footnote; author order is alphabetical by last name

详情
英文摘要

When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five compact 2025-era LLMs and four task types, we observe a systematic failure mode we call prompt-variant output-mode collapse: when a closed-form prompt asks for a bare label or a single choice token, content-preserving prompt variants can push the model into conversational prose, the requested format dissolves, and exact-match evaluation pipelines silently misjudge the result. To make this measurable, we release PARACONSIST, a 900-prompt benchmark of 150 base queries with five lexical, syntactic, and semantic-expansion prompt variants each, and a Semantic Consistency Score that decomposes prompt-variant robustness into answer consistency, sentence-BERT semantic similarity, and length stability. Under a whole-word answer-set match, only ~22% of closed-form variant responses preserve the ground-truth label inside their output, while ~78% drift away from the answer space entirely. In our pool, the dominant predictor of collapse is task structure rather than model identity, with model differentiation jointly carried by answer consistency and length stability. Robustness audits should therefore track response-mode preservation as a first-class reliability target alongside answer accuracy.

2605.04274 2026-05-12 cs.LG cs.AI stat.ML

A Mean Curvature Approach to Boundary Detection: Geometric Insights for Unsupervised Learning

Alexandre L. M. Levada

AI总结 本文提出了一种基于平均曲率的边界检测方法——平均曲率边界点(MCBP),用于高维数据中的无监督学习。该方法通过局部k近邻邻域估计形状算子的离散近似,直接建模数据流形的内在曲率,从而无需显式参数化即可计算点的平均曲率,作为边界结构的原理性描述。研究揭示了高曲率区域与聚类过渡、几何不规则性和低密度界面之间的对应关系,并引入自适应百分位阈值策略实现多尺度边界提取,同时提出基于曲率的数据分解方法,提升聚类可分性和下游算法的鲁棒性。实验表明,MCBP在合成和真实数据集上显著提升了聚类性能,尤其在复杂高维场景中表现突出。

Comments 30 pages, 6 tables, 8 figures

详情
英文摘要

Accurate boundary detection in high-dimensional data remains a central challenge in unsupervised learning, particularly in the presence of non-linear structures and heterogeneous densities. In this work, we introduce Mean Curvature Boundary Points (MCBP), a novel geometric framework grounded in Geometric Machine Learning that departs from traditional density-based approaches by explicitly modeling the intrinsic curvature of the data manifold. The method relies on a discrete approximation of the shape operator, estimated from local k-nearest neighbor patches, to compute pointwise mean curvature without requiring explicit manifold parametrization. The key insight of MCBP is to use mean curvature as a principled descriptor of boundary structure: high-curvature regions naturally correspond to transitions between clusters, geometric irregularities, and low-density interfaces. This yields a unified geometric interpretation of boundary, outlier, and transition points. We further introduce an adaptive percentile-based thresholding scheme that enables multiscale boundary extraction without relying on ad hoc density parameters. Beyond detection, we propose a curvature-driven data decomposition that separates samples into smooth (low-curvature) and boundary (high-curvature) subsets, effectively acting as a non-linear geometric filtering mechanism. This representation enhances cluster separability and improves the robustness of downstream unsupervised algorithms. Extensive experiments on synthetic and real-world datasets demonstrate that MCBP consistently improves clustering performance, particularly in complex and high-dimensional scenarios. These results position MCBP as a concrete contribution to Geometric Machine Learning, highlighting the potential of curvature-aware analysis as a unifying paradigm bridging differential geometry and data-driven modeling.

2605.04078 2026-05-12 cs.LG cs.AI

Validity-Calibrated Reasoning Distillation

Khouloud Saadi, Di Wang

AI总结 该研究提出了一种名为“有效性校准推理蒸馏”的方法,旨在将大语言模型的多步推理能力有效地传递给小型模型。与传统依赖固定师生结构和路径模仿的方法不同,该方法将推理蒸馏视为局部学习信号分配问题,通过比较学生模型和教师模型在相同前缀下的下一步动作,利用其相对有效性动态调整蒸馏更新的强度,从而实现更灵活、更贴合实际推理过程的指导机制。实验表明,该方法在数学推理、代码生成和指令遵循等任务中均优于现有蒸馏基线。

详情
英文摘要

Reasoning distillation aims to transfer multi-step reasoning capabilities from large language models to smaller, more efficient ones. While recent methods have shown promising gains, they typically rely on static teacher-student hierarchies and frame distillation as trajectory imitation. This is misaligned with the structure of reasoning, where intermediate steps are often locally under-specified: global correctness constrains the final answer, but does not uniquely determine each intermediate move. We propose validity-calibrated reasoning distillation, a framework that treats reasoning distillation as a problem of local learning-signal allocation rather than path alignment. Instead of enforcing token-level imitation, we compare the student's and teacher's proposed next-step actions under the same prefix and use their relative local validity to modulate the strength of the distillation update. This yields a dynamic, context-dependent supervision mechanism that preserves the teacher's structural guidance while adapting update strength to local reasoning quality. Across mathematical reasoning, code generation, and instruction-following benchmarks, our method consistently outperforms strong distillation baselines. These results indicate that effective LLM reasoning distillation is governed not by rigid trajectory imitation, but by principled, locally calibrated allocation of learning signal.

2605.03799 2026-05-12 cs.CL

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Mullosharaf K. Arabov

AI总结 本文是一份系统性的研究型实践指南,全面介绍了现代自然语言处理的完整流程,涵盖分词、向量化、大语言模型微调、检索增强生成以及基于人类反馈的强化学习等内容。特别关注低资源和形态丰富的语言,如塔吉克语和鞑靼语,并提供了包括子词分词器、词嵌入、词典数据库和转写基准在内的原创研究成果,展示了如何在数据稀缺环境下实现严谨的自然语言处理。全书结合理论讲解与详细实现方案,强调可复现性,要求每章代码、模型和报告公开发布,并倡导使用开源模型而非商业API,适合希望从经典机器学习方法过渡到最先进大语言模型系统的高年级本科生、研究生及开发者参考。

Comments 136 pages, 12 practical works, preprint. Textbook for senior undergraduates and graduate students. Original contributions on low-resource languages (Tajik, Tatar and other). Companion repository available

详情
英文摘要

This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. A distinctive feature of the work is its consistent attention to low-resource and morphologically rich languages -- original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout the twelve sessions, demonstrating how modern NLP can be adapted to data-scarce environments without sacrificing rigour. Each session combines concise theory with detailed implementation plans, formalised evaluation metrics, and transparent assessment criteria. The work is not a conventional textbook: it is designed as a reproducible research artefact where every session requires publishing code, models, and reports in public repositories. All experiments are conducted on a single evolving corpus, and the work advocates open-weight models over commercial APIs, with special attention to the Hugging Face ecosystem. Designed for senior undergraduates, graduate students, and practising developers seeking to implement, compare, and deploy methods from classical ML to state-of-the-art LLM-based systems.

2605.03456 2026-05-12 cs.CV

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

Chih-Chung Liu, Zhiwei Lin, Yongtao Wang

AI总结 该研究提出了一种名为 VL-SAM-v3 的统一框架,旨在提升开放世界目标检测的性能,特别是在面对细粒度外观变化、稀有类别和复杂场景时。该方法通过引入基于检索的外部视觉记忆,生成两种互补的视觉先验,分别用于实例级空间定位和类别感知的局部上下文建模,并结合记忆引导的提示优化机制,实现了对开放词汇和开放端检测任务的支持。实验表明,VL-SAM-v3 在 LVIS 数据集上显著提升了零样本检测性能,尤其在稀有类别上表现突出,且方法具有良好的通用性。

详情
英文摘要

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference. Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories. Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.

2605.03276 2026-05-12 cs.CV

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

Andong Deng, Dawei Du, Zhenfang Chen, Wen Zhong, Fan Chen, Guang Chen, Chia-Wen Kuo, Longyin Wen, Chen Chen, Sijie Zhu

AI总结 VEBench 是一个用于评估大 multimodal 模型在真实世界视频编辑任务中表现的综合性基准。该基准包含大量高质量编辑视频和人工验证的问题答案对,旨在测试模型在视频编辑技术识别和操作流程模拟方面的能力。研究揭示了当前模型在视频编辑认知方面与人类水平仍存在较大差距,突显了将视频理解与创造性操作推理相结合的迫切需求。

Comments CVPR Findings 2026

详情
英文摘要

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.

2605.02743 2026-05-12 cs.AI cs.CV cs.HC

Triple Spectral Fusion for Sensor-based Human Activity Recognition

Ye Zhang, Longguang Wang, Qing Gao, Chaocan Xiang, Mohammed Bennamoun, Yulan Guo

AI总结 本文提出了一种用于基于传感器的人类活动识别(HAR)的三重谱融合框架,旨在解决多源异构传感器数据在时序维度上的信息融合难题。该方法结合自适应互补滤波、图傅里叶域自适应滤波和小波频率选择,分别从时间、空间和频率三个维度对传感器数据进行有效融合与特征压缩,从而提升活动识别的准确性和鲁棒性。实验结果表明,该框架在多个基准数据集上均表现出优越的性能。

详情
英文摘要

The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR. First, we develop an adaptive complementary filtering technique for noise suppression and organize each IMU's sensors into posture and motion modality nodes. Given that IMU nodes form a dynamic heterogeneous graph, we then apply adaptive filtering within the graph Fourier domain to merge both homogeneous and heterogeneous node information. Furthermore, an adaptive wavelet frequency selection approach is implemented to suppress context redundancy and shorten the length of features. This approach enhances both timestamp-based graph aggregation and the correlation of long-term contexts. Our framework uses adaptive filtering in the Fourier, graph Fourier, and wavelet domains, enabling effective multi-sensor fusion and context correlation. Extensive experiments on ten benchmark datasets demonstrate the superior performance of our framework. Project page: https://github.com/crocodilegogogo/TSF-TPAMI2026.

2605.02175 2026-05-12 cs.AI

Intervention Complexity as a Canonical Reward and a Measure of Intelligence

Brendan McCane

AI总结 本文提出了一种名为“干预复杂度”的新型通用智能度量,作为替代传统外部奖励函数的规范性奖励标准。该度量基于资源函数(如程序长度或运行时间)定义,具备环境衍生性、普遍性、最小性等五项自然属性,无需外部规范输入即可完成Legg--Hutter智能框架的理论补充。研究还引入了智能的二维刻画:代理能力与学习效率,并证明资源偏置的选择决定了度量的可计算性,为超智能和通用代理预训练提供了理论依据。

Comments 23 pages

详情
英文摘要

The Legg--Hutter universal intelligence measure provides a rigorous scalar assessment of general intelligence as expected reward across all computable environments, weighted by simplicity. However, the measure presupposes an externally specified reward function, raising the question of whether the reward primitive is inherently arbitrary or whether a canonical choice exists. We propose a new measure, called intervention complexity, that has five natural properties: environment-derivedness, universality, minimality, sensitivity, and achievement preference. Given a resource function rho encoding an inductive bias (such as program length, execution time, or energy), rho-intervention complexity is a universal reward. The result yields a family of canonical rewards indexed by resource bias, providing a principled completion of the Legg--Hutter framework that does not require external normative input. We further propose a two-dimensional characterisation of intelligence: agent competence (how well the agent performs relative to the oracle optimum) and learning efficiency (how quickly this competence improves with experience). A separation theorem establishes that the choice of resource bias determines the computability of the resulting measure: action-count IC is computable in polynomial time, while program-length IC without oracle access is uncomputable, with the gap between oracle and bare IC precisely quantifying the information-theoretic content of learning. We discuss implications for superintelligence and for pre-training universal agents.

2605.02169 2026-05-12 cs.CV cs.DC cs.LG

Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation

Peggy Joy Lu, Wei-Yu Chen, Yao-Tsung Huang, Vincent Shin-Mu Tseng

AI总结 本文提出了一种名为HeroCrystal的隐私保护多摄像头领域自适应目标检测框架,旨在解决数据隐私、类别不平衡和异构架构等挑战。该框架通过合成领域自适应技术,结合生成式模型、联邦学习和知识蒸馏,实现了在不泄露原始数据的前提下提升检测性能。实验表明,HeroCrystal在多个跨领域检测基准上表现优异,显著提升了检测精度,达到了33.4%的最新mAP指标。

Comments 42 pages, 13 figures. Published in Information Fusion (Elsevier). DOI: 10.1016/j.inffus.2026.104413

详情
Journal ref
Information Fusion, 2026
英文摘要

We propose HeroCrystal, a novel privacy-preserving framework for multi-camera domain-adaptive object detection, addressing challenges such as data privacy, class imbalance, and heterogeneous architectures. Our framework consists of three key stages. In the Generated Stage, we introduce a one-shot, target-aware diffusion-based generation module that learns visual style from a single target-domain image while leveraging prompt-based control to synthesize specific object instances. Unlike conventional style transfer-based methods that require large target datasets and ignore semantic-level discrepancies, our approach enables privacy-preserving augmentation to reduce ethical concerns, and introduces controllable rare object generation to mitigate long-tailed category degradation. In the Federated Stage, we employ probabilistic Faster R-CNN on the client side to improve localization accuracy, and a dynamic model contrastive strategy to suppress domain-specific bias. The server side performs model fusion across heterogeneous architectures without accessing raw data. Finally, in the Distilled Stage, we propose an inconsistent categories integration algorithm to resolve label inconsistency and architecture heterogeneity across clients. Extensive experiments on multiple cross-domain detection benchmarks demonstrate that our method outperforms existing multi-source domain adaptation and federated learning baselines under multi-class, privacy-preserving settings. Our method improves mAP by +2.1% over prior privacy-preserving approaches and achieves a new state-of-the-art mAP of 33.4%, highlighting the effectiveness of HeroCrystal in enabling practical multi-camera AI surveillance systems. The source code is publicly available at https://github.com/ccuvislab/HeroCrystal.

2605.01529 2026-05-12 cs.RO

Good in Bad (GiB): Sifting Through End-user Demonstrations for Learning a Better Policy

Noushad Sojib, Ola Ghattas, Momotaz Begum

AI总结 本文提出了一种名为GiB(Good-in-Bad)的算法,用于从非专家用户的演示中学习更稳健的机器人策略。该方法能够自动识别并剔除演示中的错误子任务,同时保留高质量部分,从而提升策略学习的效果。通过自监督模型学习潜在特征,并结合马氏距离检测低质量部分,GiB在模拟和真实环境中均展现出优于传统方法的性能。

详情
英文摘要

Imitation learning offers a promising framework for enabling robots to acquire diverse skills from human users. However, most imitation learning algorithms assume access to high-quality demonstrations an unrealistic expectation when collecting data from non-expert users, whose demonstrations often contain inadvertent errors. Naively learning from such demonstrations can result in unsafe policy behavior, while discarding entire demonstrations due to occasional mistakes wastes valuable data, especially in low-data settings. In this work, we introduce GiB (Good-in-Bad), an algorithm that automatically identifies and discards erroneous subtasks within demonstrations while preserving high-quality subtasks. The filtered data can then be used by any policy learning algorithm to train more robust policies. GiB first trains a self-supervised model to learn latent features and assigns binary weights to label each demonstration as good or bad. It then models the latent feature distribution of high-quality segments and uses the Mahalanobis distance to detect and evaluate poor-quality subtasks. We validate GiB on the Franka robot in both simulated and real-world multi-step tasks, demonstrating improved policy performance when learning from mixed-quality human demonstrations.

2605.01507 2026-05-12 cs.AI

MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration

Jiyao Wang, Yunbiao Wang, Yubo Jiao, Xiao Yang, Dengbo He, Sasan Jafarnejad, Luis Miranda-Moreno, Raphael Frank, Jiangbo Yu

AI总结 该研究针对部分自动驾驶系统中人车协作存在的认知负担和意图理解不足问题,提出了一种名为MILD的中介代理系统,通过双向感知和多层级对齐机制提升人车协作效率与安全性。MILD引入了感知代理和策略代理,结合证据与约束加权策略优化(ECPO)方法,确保决策既符合安全规范又满足用户偏好。实验表明,MILD在感知准确性和策略质量方面优于现有方法,显著提升了用户的信任度与驾驶体验。

详情
英文摘要

Prior studies report that partial driving automation can increase the cognitive demands on human drivers. This effect largely arises from human drivers' lack of transparent insight into the vehicle's intentions and decision logic, as well as from automated systems' limited awareness of the driver's dynamic state and preferences. This bidirectional misalignment undermines shared situational awareness and exacerbates coordination failures in human-vehicle interaction. To address these limitations, we argue for a paradigm shift that elevates the human role from passive supervisor to active manager. We introduce the Mediator-in-the-Loop-Driving (MILD) system, based on an agentic system architecture to facilitate synergistic human-vehicle collaboration. MILD integrates a perception agent for joint in-cabin and out-of-cabin understanding with a lightweight strategy agent that generates compliant and explainable action suggestions. To ensure these strategies are strictly aligned with safety regulations and human values, we develop Evidence- and Constraint-weighted Policy Optimization (ECPO). ECPO leverages automatic validators to steer the agent toward behaviors that are not only accurate but also structurally complete, substantiated by evidence, and free from constraint violations. Furthermore, a retrieval-augmented generation module dynamically incorporates constraints from traffic regulations, speed recommendations, and driver preferences into the decision loop. Field experiments across three open datasets demonstrate that MILD consistently outperforms baselines in both perception accuracy and strategy quality under auditable offline metrics, and yields higher human-rated policy adequacy, comfort, and explanation than baselines. This work offers a practical pathway for building auditable and aligned agents for human-vehicle collaborative driving.

2605.01345 2026-05-12 cs.CV cs.AI cs.LG

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei, Jun Wang

AI总结 现代视觉语言模型(VLMs)在视觉感知方面面临感知带宽瓶颈问题,即广视野虽能保留全局上下文,却牺牲了进行复杂推理所需的细粒度细节。本文提出通过主动视觉推理和顺序实验设计方法,以任务相关证据获取为目标,优化视觉信息的获取过程。研究设计了一种无需训练的框架FOVEA,通过证据导向的探查策略提升模型在高分辨率场景下的推理能力,实验表明该方法在多个高分辨率基准测试中表现优于现有方法,尤其在遥感等搜索主导任务中效果显著。

Comments 27 pages, 5 figures, accepted at ICML 2026

详情
英文摘要

Visual perception in modern Vision-Language Models (VLMs) is constrained by a perceptual bandwidth bottleneck: a broad field of view preserves global context but sacrifices the fine-grained details required for complex reasoning. We argue that high-resolution visual reasoning is therefore not only semantic reasoning but also task-relevant evidence acquisition under limited perceptual bandwidth. Inspired by active vision and information foraging, we formalise this process as sequential Bayesian optimal experimental design (S-BOED), where an agent decides which visual evidence to acquire before answering. Since exact Bayesian inference is intractable in continuous gigapixel spaces, we derive a tractable coverage--resolution objective as a proxy for task-relevant information gain. We instantiate this framework with FOVEA, a training-free procedure that refines VLM crop proposals through evidence-oriented probing. Experiments on high-resolution benchmarks show consistent gains over direct and ReAct-style baselines, with particularly strong improvements in search-dominated remote-sensing settings.

2605.01323 2026-05-12 cs.CL cs.AI

SiNFluD: Creating and Evaluating Figurative Language Dataset for Sindhi

Wazir Ali, Adeeb Noor, Saifullah Tumrani

AI总结 本文介绍了SiNFluD,一个用于信德语比喻语言分类的新基准数据集。研究者通过收集博客、社交媒体和文学作品中的原始文本,并利用Doccano工具进行标注,取得了0.81的标注者间一致性。实验采用交叉验证和多种预训练模型进行评估,其中XLM-RoBERTa-XL在少样本微调下表现最佳,为信德语比喻语言研究提供了重要资源。

详情
英文摘要

In this article, we introduce SiNFluD, a novel benchmark dataset for Sindhi figurative language classification. We first collect raw text from various blogs, social media platforms, and literary sources, and subsequently prepare the corpus for annotation. Two native annotators label the data using the Doccano text annotation tool, achieving an inter-annotator agreement of 0.81. We then establish baseline results using 5-fold and 10-fold cross-validation. Finally, we evaluate mBERT, XLM-RoBERTa, and XLM-RoBERTa-XL models, along with SetFit for few-shot fine-tuning of sentence transformers. Among these, the pretrained XLM-RoBERTa-XL achieves the best performance.

2605.01011 2026-05-12 cs.CL cs.AI cs.LG

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

AI总结 该研究提出CLEAR框架,旨在评估医学大语言模型在面对噪声和模糊性时的可靠性问题。通过系统性地改变答案选项的数量、是否存在真实答案或弃权选项以及选项的语义表述,CLEAR揭示了当前医学基准测试中存在的一些关键缺陷。研究发现,随着答案选项增多或弃权表述从明确拒绝转向不确定性承认,模型的正确识别能力下降,且模型规模越大,这种可靠性问题越显著。

详情
英文摘要

Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like "None of the Above" to uncertainty admission like "I don't know" (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.

2605.00884 2026-05-12 cs.CV

LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

Justin williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

AI总结 本文提出了一种名为LiteVLA-H的紧凑型视觉-语言-动作(VLA)系统,旨在解决无人机在严格计算和通信约束下的低延迟闭环引导与语义感知问题。该系统通过双速率操作,在NVIDIA Jetson AGX Orin平台上实现了快速动作生成与较慢语义理解的协同运行。研究发现,在边缘设备上,多模态预填充过程是影响端到端延迟的主要因素,基于此设计了高效的调度策略,并通过知识保留的微调方法提升了模型在飞行控制与语义描述任务上的性能。

详情
英文摘要

Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at 149.90--164.57\ms (6.08--6.67,Hz) on the same embedded platform. To specialize the model without collapsing its descriptive competence, we use a knowledge-preserving fine-tuning recipe that mixes reactive flight data, aerial semantic data, and generic caption/VQA supervision. Beyond reporting current latency measurements, we position the system against recent state-of-the-art architectures, including AnywhereVLA, FutureVLA, and ReMem-VLA, showing that the measured action branch reaches a higher edge inference rate under our deployment conditions while retaining periodic semantic awareness.

2605.00445 2026-05-12 cs.LG

The Power of Order: Fooling LLMs with Adversarial Table Permutations

Xinshuai Dong, Haifeng Chen, Xuyuan Liu, Shengyu Chen, Haoyu Wang, Shaoan Xie, Kun Zhang, Zhengzhang Chen

AI总结 本文研究了大语言模型(LLMs)在处理表格数据时对输入结构的鲁棒性问题,发现即使对表格的行列进行语义不变的排列,也可能导致模型输出错误或不一致。为此,作者提出了一种基于梯度的对抗性表格排列攻击方法,能够高效地找到最破坏模型性能的排列方式。实验表明,该方法显著降低了多种LLMs的性能,揭示了当前模型在处理结构化数据时存在的普遍脆弱性。

详情
英文摘要

Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper demonstrates that modern LLMs exhibit a significant vulnerability to the layout of tabular data. Specifically, we show that semantically-invariant permutations of rows and columns - rearrangements that do not alter the table's underlying information - are sometimes sufficient to cause incorrect or inconsistent model outputs. To systematically probe this vulnerability, we introduce Adversarial Table Permutation, a novel, gradient-based attack that efficiently identifies worst-case permutations designed to maximally disrupt model performance. Our extensive experiments demonstrate that ATP significantly degrades the performance of a wide range of LLMs. This reveals a pervasive vulnerability across different model sizes and architectures, including the most recent and popular models. Our findings expose a fundamental weakness in how current LLMs process structured data, underscoring the urgent need to develop permutation-robust models for reliable, real-world applications.

2604.26412 2026-05-12 cs.CL

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

Tianyu Liu, Yuhao Shen, Xinyi Hu, Baolin Zhang, Hengxin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, MingCheng Wan

AI总结 该研究探讨了在长距离推测解码中隐藏状态漂移的问题,指出当前基于隐藏状态的草案生成器在长距离预测时准确性下降。文章提出通过重用目标模型的键值(KV)缓存,可以提供更丰富的上下文信息,从而提升长距离推测的性能。研究引入了KVShot框架进行验证,并揭示了当前方法在训练和结构上的关键瓶颈,为未来高效推理架构的设计提供了指导。

详情
英文摘要

Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.

2604.26326 2026-05-12 cs.LG cs.CL stat.ML

Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

Bolian Li, Yifan Wang, Yi Ding, Anamika Lochab, Ananth Grama, Ruqi Zhang

AI总结 本文研究了大语言模型(LLM)在强化学习(RL)中遇到的性能饱和问题,并提出了一种名为Entrocraft的新方法,通过精确控制熵曲线来解决这一问题。该方法基于偏差优势分布的拒绝采样,无需正则化且适用于任意优势估计器。理论分析表明,该方法能够解释现有RL方法和熵保持方法的行为,并揭示了线性退火策略在熵调度中的优越性。实验表明,Entrocraft有效缓解了性能饱和,显著提升了模型的泛化能力、输出多样性和长期训练表现。

详情
英文摘要

Reinforcement learning (RL) has enabled complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing continued gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts focus on preventing entropy collapse through regularization or clipping. However, their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions. This explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, which reveals that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.

2604.25674 2026-05-12 cs.CL

Modeling Human-Like Color Naming Behavior in Context

Yuqing Zhang, Ecesu Ürker, Tessa Verhoef, Gemma Boleda, Arianna Bisazza

AI总结 该研究旨在建模人类样式的颜色命名行为,通过神经代理在监督学习和强化学习框架下生成类人词汇。为了解决现有模型生成的词汇与人类颜色分类在几何结构上的差异,研究引入了稀有颜色术语的上采样和多听者交互机制,并采用凸性度量来评估词汇的几何一致性。实验表明,这些方法有效提升了词汇的多样性与信息性,使生成的词汇更接近人类颜色命名系统。

Comments Cognitive Science Society Annual Conference 2026

详情
英文摘要

Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with the convexity typical of human categories. To address this, we introduce two factors, upsampling rare color terms during SL and multi-listener RL interactions, and adopt a convexity measure to quantify geometric coherence. We find that upsampling improves lexical diversity and system-level informativeness of the color lexicon, while many-listener setups promote more convex color categories. The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.

2604.25031 2026-05-12 cs.CL cs.AI

Faithful Autoformalization via Roundtrip Verification and Repair

Daneshvar Amrollahi, Jerry Lopez, Clark Barrett

AI总结 本文研究如何验证大语言模型在将自然语言形式化时的可靠性,提出了一种无需真实标注的往返验证方法,通过形式化、反向翻译、再形式化并利用形式化工具检查逻辑等价性,从而判断形式化结果是否忠实。当两次形式化结果不一致时,系统能定位错误步骤并进行有针对性的修复。实验表明,基于诊断引导的修复方法在两个法律领域中效果最佳,且形式化结果通过等价性检查的规则在自然语言推理漂移方面表现更优。

详情
英文摘要

When an LLM formalizes natural language, how do we know the output is faithful? We propose a roundtrip verification approach which does not require ground-truth annotations: formalize a statement, translate the result back to natural language, re-formalize, and use a formal tool to check logical equivalence. When the two formalizations agree, this provides evidence of a faithful formalization. When they disagree, a stage-level diagnosis localizes the error to a specific translation step, and a scoped repair operator attempts to correct that step. We evaluate the framework on two statutory domains (the Texas Transportation Code and the Texas Parks and Wildlife Code) using two LLMs (Claude Opus~4.6 and GPT-5.2) with three repair baselines. Diagnosis-guided scoped repair is the most effective method, with effectiveness contingent on the reliability of the diagnosis function. Across both domains and both models, under our full repair system, rules that fail the equivalence check show 1.4x-2.5x more NLI drift than rules that pass it.

2604.23789 2026-05-12 cs.CV

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

AI总结 该研究针对电影叙事中复杂的多镜头生成问题,提出了一个大规模数据集MuSS和电影叙事基准,以推动多镜头主体到视频生成(S2V)的发展。为解决真实叙事逻辑、时空对齐冲突和“复制-粘贴”困境等核心挑战,MuSS通过渐进式字幕生成和跨镜头匹配机制构建,确保局部准确性和全局叙事连贯性。同时,研究引入了新的评估指标ACP-Var,有效衡量模型在连续叙事和三维结构一致性方面的能力,实验表明该数据集显著提升了模型的叙事效果和跨镜头身份保持能力。

Comments 17 pages, 9 figues

详情
英文摘要

While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.

2604.23693 2026-05-12 cs.RO

Decentralized Heterogeneous Multi-Robot Collaborative Exploration for Indoor and Outdoor 3D Environments

Yuxiang Li, Kun Chen, Jiancheng Wang, Shihao Fang, Haoyao Chen, Yunhui Liu

AI总结 本文提出了一种用于室内和室外三维环境的异构多机器人协同探索的去中心化框架,旨在有效利用不同机器人特性以提升探索效率。该方法设计了融合地形与观测指标的基本感知地图,并采用改进的监督体分割技术简化地图结构,支持轻量级通信。通过建模异构机器人的通行与观测能力,将任务视点分配转化为考虑机器人能力约束的异构多仓库旅行商问题,并采用改进的遗传算法求解,最终优化探索路径并解决机器人路径冲突。实验表明,该方法在复杂环境中实现了更高效的探索与通信节省。

详情
英文摘要

Heterogeneous multi-robot systems feature significant adaptability for complex environments. However, effective collaboration that fully exploits the robots' potential remains a core challenge. This paper proposes a decentralized collaborative framework for heterogeneous multi-robot systems to autonomously explore indoor and outdoor 3D environments. First, a basic perception map that integrates terrain and observation metrics is designed. Improved supervoxel segmentation is developed to simplify the map structure and form a high-level representation that supports lightweight communication. Second, the traversal and observation capabilities of heterogeneous robots are modeled to evaluate the requirements of task views derived from incomplete supervoxels. These task views are grouped by requirements and clustered to streamline assignment. Subsequently, the view-cluster assignment is formulated as a heterogeneous multi-depot multi-traveling salesman problem (HMDMTSP) that incorporates constraints between view-cluster requirements and robot capabilities. An improved genetic algorithm is developed to efficiently solve this problem while ensuring global consistency. Based on the assignments, redundant views within clusters are eliminated to refine exploration routes. Finally, conflicts between robots' motion paths are resolved. Simulations and field experiments in cluttered indoor and outdoor environments demonstrate that our approach effectively coordinates exploration tasks among heterogeneous robots, achieving superior exploration efficiency and communication savings compared to state-of-the-art approaches.

2604.22942 2026-05-12 cs.CV cs.AI cs.LG

VS-DDPM: Efficient Low-Cost Diffusion Model for Medical Modality Translation

Nikoo Moradi, Gijs Luijten, Behrus Hinrichs-Puladi, Jens Kleesiek, Victor Alves, Jan Egger, André Ferreira

AI总结 该研究提出了一种名为VS-DDPM的三维可变步长去噪扩散概率模型,旨在在保持生成质量的同时显著提升医学图像合成的推理速度。该模型在多个医学模态转换任务中表现出色,如缺失MRI重建、肿瘤去除以及MRI到合成CT的转换,在多个指标上达到了先进水平。尽管在部分任务中未达到最优性能,但VS-DDPM仍展示了其在高保真医学图像生成中的鲁棒性和可调性。

详情
英文摘要

Diffusion models produce high-quality synthetic data but suffer from slow inference. We propose 3D Variable-Step Denoising Diffusion Probabilistic Model (VS-DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI-to-sCT, and CBCT-to-sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS-DDPM achieved state-of-the-art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal-to-noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI-to-sCT and CBCT-to-sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post-processing pipelines or specific loss function configurations. These results demonstrate that VS-DDPM provides a robust and tunable solution for high-fidelity 3D medical image synthesis. The code is available in https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it.