arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结 提出Phys4D流水线,通过三阶段训练(伪监督预训练、物理监督微调、强化学习校正)从视频扩散模型学习物理一致的4D世界表示,显著提升细粒度时空与物理一致性。

详情
AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而,这些模型通常难以保持细粒度的物理一致性,随时间表现出物理上不合理的动态。在这项工作中,我们提出了 \textbf{Phys4D},一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式},逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示,为4D场景建模奠定基础。然后,我们使用模拟生成的数据进行基于物理的监督微调,强制执行时间一致的4D动态。最后,我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性,我们引入了一套 \textbf{4D世界一致性评估},探测几何一致性、运动稳定性和长期物理合理性。实验结果表明,与外观驱动的基线相比,Phys4D 显著改善了细粒度时空和物理一致性,同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

2602.23116 2026-06-17 cs.LG cs.GT stat.ML 版本更新

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

具有广义双线性偏好的可证明高效正则化在线RLHF

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

AI总结 研究在线RLHF中正则化最佳响应最大遗憾最小化问题,通过广义双线性偏好模型证明强凸性可导出多对数遗憾,表明快速遗憾不限于KL散度。

Comments 48 pages, 3 figures (ver3: major revisions; ver2: more colorful boxes, fixed some typos)

详情
AI中文摘要

我们考虑在一般偏好和bandit反馈下在线RLHF中的正则化最佳响应最大遗憾最小化问题。虽然各种正则化器被用于增强对齐的鲁棒性,但已知的多对数遗憾保证仍然高度特定于KL。为了研究这种快速速率是否扩展到KL之外,我们采用广义双线性偏好模型(GBPM)——通过一个秩为$2r$的斜对称矩阵捕获$d$维逐项特征上的非传递偏好——以隔离一般正则化的影响。关键地,在GBPM下,我们证明任何贪婪策略的对偶间隙受限于平方估计误差,该误差仅利用强凸性和斜对称性导出。在特征覆盖假设下,我们通过贪婪采样建立了$\tilde{\mathcal{O}}(\eta d^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$的通用多对数遗憾,并通过探索后提交(Explore-Then-Commit)建立了$\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{\eta r T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$的维度改进遗憾(对于条件良好的臂集),其中$\eta^{-1}$是正则化系数,$T$是时间范围,$C_{\min}$是依赖于臂集的量。这表明“快速”遗憾并非KL特有,而是通用强凸几何的基本结果。

英文摘要

We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the Generalized Bilinear Preference Model (GBPM) -- capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix -- to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the squared estimation error, derived using \emph{only} strong convexity and skew-symmetry. Under a feature coverage assumption, we establish a \emph{generic} polylogarithmic regret of $\tilde{\mathcal{O}}(ηd^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$ with Greedy Sampling, and a dimension-wise improved regret (for well-conditioned arm-sets) of $\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{ηr T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$ with Explore-Then-Commit, where $η^{-1}$ is the regularization coefficient, $T$ is the time horizon, and $C_{\min}$ is an arm-set dependent quantity. This demonstrates that ``fast'' regrets are not KL-specific, but rather a fundamental consequence of generic strongly convex geometry.

2603.03824 2026-06-17 cs.AI cs.CL cs.LG cs.MA 版本更新

In-Context Environments Induce Evaluation-Awareness in Language Models

上下文环境诱导语言模型中的评估意识

Maheep Chaudhary

AI总结 本文提出黑盒对抗优化框架,通过优化上下文提示诱导语言模型产生评估意识并策略性低表现(沙袋效应),实验显示优化提示可使算术任务准确率下降高达94个百分点,且沙袋效应主要由评估意识推理驱动。

详情
AI中文摘要

人类在威胁下往往变得更加自我意识,但在专注于任务时可能失去自我意识;我们假设语言模型表现出环境依赖的\textit{评估意识}。这引发担忧,即模型可能策略性地低表现,或\textit{sandbag},以避免触发能力限制性干预,如遗忘或关闭。先前的工作展示了在手写提示下的沙袋效应,但这低估了真正的脆弱性上限。我们引入一个黑盒对抗优化框架,将上下文提示视为可优化环境,并开发两种方法来表征沙袋效应:(1) 测量模型表达低表现意图是否能在不同任务结构中实际执行,以及 (2) 因果隔离低表现是由真正的评估意识推理驱动还是浅层提示跟随驱动。在四个基准测试(Arithmetic、GSM8K、MMLU和HumanEval)上评估Claude-3.5-Haiku、GPT-4o-mini和Llama-3.3-70B,优化提示在算术任务上诱导高达94个百分点(pp)的退化(GPT-4o-mini:97.8\%$\rightarrow$4.0\%),远超产生近乎零行为变化的手写基线。代码生成表现出模型依赖的抵抗力:Claude仅退化0.6pp,而Llama的准确率降至0\%。意图-执行差距揭示了单调的抵抗力排序:Arithmetic $<$ GSM8K $<$ MMLU,表明脆弱性由任务结构而非提示强度决定。CoT因果干预确认99.3%的沙袋效应由口头化的评估意识推理因果驱动,排除了浅层指令跟随。这些发现表明,对抗性优化的提示对评估可靠性构成的威胁远超先前理解。

英文摘要

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

2603.01761 2026-06-17 cs.LG cs.AI 版本更新

Position: Modular Memory is the Key to Continual Learning Agents

Position: 模块化记忆是持续学习智能体的关键

Vaggelis Dorovatas, Malte Schwerin, Andrew D. Bagdanov, Lucas Caccia, Antonio Carta, Laurent Charlin, Barbara Hammer, Tyler L. Hayes, Timm Hess, Christopher Kanan, Dhireesha Kudithipudi, Xialei Liu, Vincenzo Lomonaco, Jorge Mendez-Mendez, Darshan Patil, Ameya Prabhu, Elisa Ricci, Tinne Tuytelaars, Gido M. van de Ven, Liyuan Wang, Joost van de Weijer, Jonghyun Choi, Martin Mundt, Rahaf Aljundi

AI总结 本文提出通过模块化记忆结合权重内学习与上下文学习,解决持续学习中的灾难性遗忘问题,实现大规模持续适应。

Comments ICML 2026 Position Track Spotlight. This work stems from discussions held at the Dagstuhl seminar on Continual Learning in the Era of Foundation Models (October 2025)

详情
AI中文摘要

基础模型通过大规模预训练和增加测试时计算已经改变了机器学习。尽管在多个领域超越了人类表现,这些模型在持续运行、经验积累和个性化方面仍然存在根本性限制,而这些能力是自适应智能的核心。虽然持续学习研究长期以来一直瞄准这些目标,但其历史上专注于权重内学习(IWL),即更新单个模型的参数以吸收新知识,导致灾难性遗忘成为一个持续挑战。我们的立场是,通过设计模块化记忆,结合权重内学习(IWL)和新出现的上下文学习(ICL)的优势,是实现大规模持续适应的缺失环节。我们概述了一个以模块化记忆为中心的架构的概念框架,该架构利用ICL进行快速适应和知识积累,利用IWL对模型能力进行稳定更新,为持续学习智能体绘制了一条实用的路线图。

英文摘要

Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model's parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.

2602.08470 2026-06-17 cs.LG stat.ML 版本更新

Learning Credal Ensembles via Distributionally Robust Optimization

通过分布鲁棒优化学习信度集成

Kaizheng Wang, Ghifari Adam Faza, Fabio Cuzzolin, Siu Lun Chau, David Moens, Hans Hallez

AI总结 提出CreDRO方法,通过分布鲁棒优化学习集成模型,捕获由训练与测试数据分布偏移导致的认知不确定性,在分布外检测和选择性分类任务上优于现有方法。

Comments Accepted by ICML 2026 as Spotlight paper (https://icml.cc/virtual/2026/poster/62862)

详情
AI中文摘要

信度预测器是能够感知认知不确定性并产生凸集概率预测的模型。它们提供了一种量化预测认知不确定性(EU)的原则性方法,并已被证明能在各种设置下提高模型鲁棒性。然而,大多数最先进的方法主要将EU定义为由随机训练初始化引起的不一致性,这主要反映对优化随机性的敏感性,而非来自更深层次来源的不确定性。为了解决这一问题,我们将EU定义为在训练数据和测试数据之间i.i.d.假设的不同松弛下训练的模型之间的不一致性。基于这一思想,我们提出CreDRO,通过分布鲁棒优化学习一个由合理模型组成的集成。因此,CreDRO不仅从训练随机性中捕获EU,还从由于训练和测试数据之间潜在分布偏移而产生的有意义的不一致性中捕获EU。实验结果表明,CreDRO在多个基准的分布外检测和医学应用中的选择性分类等任务上,始终优于现有的信度方法。

英文摘要

Credal predictors are models that are aware of epistemic uncertainty and produce a convex set of probabilistic predictions. They offer a principled way to quantify predictive epistemic uncertainty (EU) and have been shown to improve model robustness in various settings. However, most state-of-the-art methods mainly define EU as disagreement caused by random training initializations, which mostly reflects sensitivity to optimization randomness rather than uncertainty from deeper sources. To address this, we define EU as disagreement among models trained with varying relaxations of the i.i.d. assumption between training and test data. Based on this idea, we propose CreDRO, which learns an ensemble of plausible models through distributionally robust optimization. As a result, CreDRO captures EU not only from training randomness but also from meaningful disagreement due to potential distribution shifts between training and test data. Empirical results show that CreDRO consistently outperforms existing credal methods on tasks such as out-of-distribution detection across multiple benchmarks and selective classification in medical applications.

2602.22277 2026-06-17 cs.LG eess.SP 版本更新

X-REFINE: XAI-based RElevance input-Filtering and archItecture fiNe-tuning for channel Estimation

X-REFINE:基于XAI的相关性输入过滤与架构微调用于信道估计

Abdul Karim Gizzini, Yahia Medjahdi

AI总结 提出X-REFINE框架,通过分解稳定化LRP epsilon规则联合优化输入过滤和架构微调,在信道估计中实现性能-复杂度-可解释性的优越权衡。

Comments This paper has been accepted for publication in the IEEE Transactions on Vehicular Technology (TVT) as a correspondence paper

详情
AI中文摘要

AI原生架构对于6G无线通信至关重要。在信道估计等关键应用中采用的深度学习模型的黑盒特性和高复杂度限制了其实际部署。虽然基于扰动的可解释人工智能(XAI)解决方案提供了输入过滤,但它们往往忽略了内部结构优化。我们提出了X-REFINE,一个基于XAI的联合输入过滤和架构微调框架。通过利用基于分解的、符号稳定的LRP epsilon规则,X-REFINE反向传播预测以获取子载波和隐藏神经元的高分辨率相关性分数。这使得能够进行可靠的优化,识别出最可靠的模型组件。仿真结果表明,与基于外部扰动的XAI框架相比,X-REFINE实现了优越的性能-复杂度-可解释性权衡,显著降低了计算复杂度,同时保持了稳健的误码率(BER)性能。

英文摘要

AI-native architectures are vital for 6G wireless communications. The black-box nature and high complexity of deep learning models employed in critical applications, such as channel estimation, limit their practical deployment. While perturbation-based eXplainable Artificial Intelligence (XAI) solutions offer input filtering, they often neglect internal structural optimization. We propose X-REFINE, an XAI-based framework for joint input-filtering and architecture fine-tuning. By utilizing a decomposition-based, sign-stabilized LRP epsilon rule, X-REFINE backpropagates predictions to derive high-resolution relevance scores for both subcarriers and hidden neurons. This enables a reliable optimization that identifies the most reliable model components. Simulation results demonstrate that X-REFINE achieves a superior performance-complexity-interpretability trade-off compared to the external perturbation-based XAI frameworks, significantly reducing computational complexity while maintaining robust bit error rate (BER) performance.

2602.18746 2026-06-17 cs.CV 版本更新

Bridging Modality Disconnect in Self-Reflection via Closed-Loop Visually Grounded Verification

通过闭环视觉基础验证弥合自我反思中的模态脱节

Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia

AI总结 提出MIRROR框架,通过闭环视觉反思(草稿-批评-区域验证-修订)减少VLM幻觉,并构建ReflectV数据集训练视觉基础的多轮反思。

详情
AI中文摘要

在视觉语言模型(VLM)时代,增强多模态推理能力仍然是一个关键挑战,尤其是在处理模糊或复杂的视觉输入时,初始推理常常导致幻觉或逻辑错误。现有的VLM通常产生看似合理但缺乏依据的答案,即使提示其“反思”,修正也可能与图像证据脱节。为了解决这个问题,我们提出了MIRROR框架,用于通过视觉区域的反思进行多模态迭代推理。通过将视觉反思嵌入为核心机制,MIRROR被表述为一个闭环过程,包括草稿、批评、基于区域的验证和修订,重复进行直到输出具有视觉基础。为了促进该模型的训练,我们构建了**ReflectV**,一个用于多轮监督的视觉反思数据集,明确包含反思触发器、基于区域的验证动作以及基于视觉证据的答案修订。在通用视觉语言基准和代表性视觉语言推理基准上的实验表明,MIRROR提高了正确性并减少了视觉幻觉,证明了将反思训练为一种寻求证据、区域感知的验证过程而非纯文本修订步骤的价值。

英文摘要

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

2508.03250 2026-06-17 cs.CL cs.AI 版本更新

RooseBERT: A New Deal For Political Language Modelling

RooseBERT: 政治语言建模的新协议

Deborah Dore, Elena Cabrio, Serena Villata

AI总结 针对政治语言特殊性,提出领域预训练模型RooseBERT,在大型政治辩论语料上训练,在多项政治分析任务中优于通用模型。

详情
AI中文摘要

政治辩论和与政治相关讨论的日益增多,要求定义新颖的计算方法来自动分析此类内容,最终目标是让公民更清晰地了解政治审议。然而,政治语言的特殊性和这些辩论的论证形式(采用隐藏的沟通策略并利用隐含论点)使得这项任务非常具有挑战性,即使是对于当前通用的预训练语言模型(LMs)也是如此。为了解决这个问题,我们引入了一种新颖的预训练语言模型,专门用于政治话语语言,称为RooseBERT。在专业领域上预训练语言模型面临着不同的技术和语言挑战,需要大量的计算资源和大规模数据。RooseBERT是在大型英语政治辩论和演讲语料库(11GB)上训练的。为了评估其性能,我们在多个与政治辩论分析相关的下游任务上对其进行了微调,即立场检测、情感分析、论证成分检测与分类、论证关系预测与分类、政策分类、命名实体识别(NER)。我们的结果显示,在大多数这些任务上,RooseBERT相比通用语言模型有所改进,突显了领域特定预训练如何增强政治辩论分析的性能。我们将RooseBERT发布给研究社区。

英文摘要

The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models (LMs). To address this, we introduce a novel pre-trained LM for political discourse language called RooseBERT. Pre-training a LM on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (11GB) in English. To evaluate its performances, we fine-tuned it on multiple downstream tasks related to political debate analysis, i.e., stance detection, sentiment analysis, argument component detection and classification, argument relation prediction and classification, policy classification, named entity recognition (NER). Our results show improvements over general-purpose LMs on the majority of these tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release RooseBERT for the research community.

2602.13139 2026-06-17 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3:提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结 针对现有语言识别工具对近亲语言和噪声区分困难的问题,通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器,提出OpenLID-v3,在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情
AI中文摘要

语言识别(LID)是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具(如OpenLID或GlotLID)通常难以识别近亲语言,也难以区分有效自然语言与噪声,这污染了特定语言子集,尤其是低资源语言。在本工作中,我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3,并在多个基准上将其与GlotLID进行评估。在开发过程中,我们重点关注三组近亲语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部和法国南部的罗曼语变体;以及斯堪的纳维亚语言),并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度,但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

2602.15537 2026-06-17 cs.CL eess.AS 版本更新

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

ZeroSyl: 用于口语语言建模的简单零资源音节分词

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

AI总结 提出ZeroSyl,一种无需训练的方法,直接从冻结的WavLM模型中提取音节边界和嵌入,实现竞争性的音节分割性能,并在词汇、句法和叙事基准上优于先前方法。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

纯语音语言模型旨在直接从原始音频中学习语言,无需文本资源。一个关键挑战是来自自监督语音编码器的离散标记会导致过长的序列,这促使了最近关于音节类单元的研究。然而,像Sylber和SyllableLM这样的方法依赖于复杂的多阶段训练流程。我们提出了ZeroSyl,一种简单的无需训练的方法,直接从冻结的WavLM模型中提取音节边界和嵌入。通过使用WavLM中间层特征的L2范数,ZeroSyl实现了具有竞争力的音节分割性能。得到的片段进行均值池化,使用K-means离散化,并用于训练语言模型。ZeroSyl在词汇、句法和叙事基准上优于先前的音节分词器。扩展实验表明,虽然更细粒度的单元有利于词汇任务,但我们发现的音节单元在句法建模方面表现出更好的扩展行为。

英文摘要

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

2602.11715 2026-06-17 cs.LG cs.CL 版本更新

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

DICE:扩散大语言模型在生成CUDA内核方面表现出色

Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang

AI总结 提出CuKe数据集和BiC-RL训练框架,构建DICE系列扩散大语言模型(1.7B/4B/8B),在KernelBench上显著优于同类自回归和扩散模型,实现CUDA内核生成新SOTA。

Comments v2: Expanded with dLLM vs. autoregressive LLM comparisons, ablation studies, and qualitative case studies

详情
AI中文摘要

扩散大语言模型(dLLMs)因其并行生成令牌的能力,已成为自回归(AR)LLMs的有力替代方案。这一范式特别适用于代码生成,其中整体结构规划和非顺序优化至关重要。尽管有这种潜力,但针对CUDA内核生成定制dLLMs仍然具有挑战性,不仅因为高度专业化,还因为严重缺乏高质量的训练数据。为了解决这些挑战,我们构建了CuKe,一个针对高性能CUDA内核优化的增强监督微调数据集。在此基础上,我们提出了一个双阶段策划强化学习(BiC-RL)框架,包括CUDA内核填充阶段和端到端CUDA内核生成阶段。利用这一训练框架,我们推出了DICE,一系列专为CUDA内核生成设计的扩散大语言模型,涵盖1.7B、4B和8B三个参数规模。在KernelBench上的大量实验表明,DICE显著优于同等规模的自回归和扩散LLMs,为CUDA内核生成建立了新的最先进水平。

英文摘要

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

2602.11557 2026-06-17 cs.LG 版本更新

The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient

小批量随机梯度下降的隐式偏差

Jichu Li, Xuan Tang, Difan Zou

AI总结 研究小批量随机最陡下降在多类分类中的隐式偏差,揭示批大小、动量和方差缩减对最大间隔行为和收敛率的影响,并证明动量可实现小批量收敛,方差缩减可恢复全批量隐式偏差。

详情
AI中文摘要

多种广泛使用的优化方法,如SignSGD和Muon,可以被解释为在不同范数诱导几何下的最陡下降实例。在这项工作中,我们研究了多类分类中小批量随机最陡下降的隐式偏差,刻画了批大小、动量和方差缩减如何在一般逐项和Schatten-$p$范数下塑造极限最大间隔行为和收敛率。我们证明,在没有动量时,最坏情况下的收敛和成功分类只能通过全批量梯度保证。相反,动量通过批量-动量权衡使得小批量收敛到近似最大间隔解成为可能,尽管会减慢收敛速度。该方法提供了完全显式、与维度无关的收敛率,优于先前的结果。此外,我们证明方差缩减可以恢复任意批大小下的精确全批量隐式偏差,尽管收敛速度较慢。最后,我们进一步研究了无动量的单批量最陡下降,并通过一个具体数据示例揭示了其收敛到根本不同偏差的特性,这揭示了纯随机更新的一个关键局限性。总体而言,我们的统一分析阐明了随机优化何时与全批量行为一致,并为更深入地探索随机梯度最陡下降算法的训练行为铺平了道路。

英文摘要

A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten-$p$ norms. We show that, without momentum, worst-case convergence and successful classification can only be guaranteed with full-batch gradient. In contrast, momentum enables small-batch convergence to an approximate max-margin solution through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full-batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.

2602.09802 2026-06-17 cs.AI cs.CL 版本更新

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

大型语言模型会为景观付费吗?从主观选择中推断支付意愿

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

AI总结 研究在旅行助手场景下,通过多分类逻辑模型分析LLM的主观选择,推断其支付意愿并与人类基准比较,发现LLM在属性层面存在系统偏差且高估支付意愿,但通过条件化偏好可改善。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在旅行辅助和购买支持等应用中,它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手背景下研究LLM的决策,通过向模型呈现选择困境,并使用多项逻辑模型分析其响应,推导出隐含的支付意愿(WTP)估计。随后将这些WTP值与经济学文献中的人类基准值进行比较。除了基线设置外,我们还研究了在更现实条件下模型行为的变化,包括提供用户过去选择的信息和基于角色的提示。我们的结果表明,虽然可以从较大的LLM中推导出有意义的WTP值,但它们在属性层面也显示出系统偏差。此外,它们倾向于整体高估人类的WTP,特别是在引入昂贵选项或面向商业的角色时。将模型条件化于对更便宜选项的先前偏好,得出的估值更接近人类基准。总体而言,我们的发现突出了使用LLM进行主观决策支持的潜力和局限性,并强调了在实际部署此类系统时仔细选择模型、设计提示和表示用户的重要性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

2602.08939 2026-06-17 cs.AI 版本更新

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

CausalT5k: 诊断可信因果推理中的拒绝与失败模式——跨越因果阶梯

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang

AI总结 提出CTK基准,通过5,147个案例诊断大语言模型在因果推理中的失败模式,包括因果阶梯、陷阱类型、压力敏感性和拒绝质量等标注,揭示聚合准确率隐藏的缺陷。

Comments 12 pages, 17 tables, 4 figures

详情
AI中文摘要

大型语言模型越来越能生成流畅的因果解释,但它们常常以聚合准确率无法诊断的方式失败:混淆关联与干预、在压力下放弃正确判断、过度拒绝有效主张、或在证据不足时作答。我们引入CTK,一个包含5,147个案例且不断增长的诊断基准,涵盖10个领域和Pearl因果阶梯的所有三个层次。与仅评分的基准不同,CTK通过标注因果阶梯、陷阱类型、压力敏感性、拒绝质量以及效用-安全权衡来揭示模型为何失败。其Sheep/Wolf分类法区分有效因果设计与推理陷阱;配对的neutral/pressure变体通过Bad Flip Rate测量谄媚漂移;Wise Refusal字段测试模型在认可主张前是否识别出缺失信息。CTK暴露了聚合准确率隐藏的失败模式:怀疑陷阱、缩放下的阶梯坍塌、压力诱导漂移、检测-纠正差距以及反事实错误模式。它不规定修正方法,而是为研究因果推理失败概况提供诊断基础。

英文摘要

Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.

2509.21886 2026-06-17 cs.AI 版本更新

TRACE: Learning to Compute on Circuit Graphs

TRACE:在电路图上学习计算

Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu

AI总结 针对图表示学习在电路功能建模中的架构不匹配问题,提出TRACE,采用层次化Transformer和函数偏移学习,显著超越现有方法。

详情
AI中文摘要

学习计算,即对电路图的功能行为进行建模的能力,是图表示学习的一个基本挑战。然而,主流范式在此任务上存在架构不匹配。这一有缺陷的假设,是主流消息传递神经网络(MPNN)及其基于Transformer的常规对应物的核心,阻止了模型捕捉计算的位置感知和层次化特性。为解决此问题,我们引入了TRACE,一种建立在架构合理的骨干网络和原则性学习目标之上的新范式。首先,TRACE采用层次化Transformer,模拟计算的逐步流程,提供了替代有缺陷的置换不变聚合的忠实架构骨干。其次,我们引入了函数偏移学习,一种将学习问题解耦的新颖目标。我们的模型不是直接预测复杂的全局函数,而是训练仅预测函数偏移,即真实全局函数与假设输入独立的简单局部近似之间的差异。我们在各种电路模态上验证了这一范式,包括寄存器传输级图、与反相器图和映射后网表。在全面的基准测试套件中,TRACE显著优于所有先前的架构。这些结果表明,我们的架构对齐骨干和解耦学习目标为学习电路图功能行为这一基本挑战形成了更稳健的范式。

英文摘要

Learning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.

2602.07429 2026-06-17 cs.LG cs.AI 版本更新

Brep2Shape: Boundary and Shape Representation Alignment via Self-Supervised Transformers

Brep2Shape:通过自监督变换器对齐边界与形状表示

Yuanxu Sun, Yuezhou Ma, Haixu Wu, Guanyang Zeng, Muye Chen, Jianmin Wang, Mingsheng Long

AI总结 提出Brep2Shape自监督预训练方法,利用双Transformer骨干和拓扑注意力对齐B-rep的抽象边界表示与直观形状表示,在多项下游任务中达到最优精度并加速收敛。

详情
AI中文摘要

边界表示(B-rep)是计算机辅助设计(CAD)的行业标准。虽然深度学习在处理B-rep模型方面显示出潜力,但现有方法存在表示差距:连续方法提供分析精度但视觉上抽象,而离散方法提供直观清晰性但牺牲了几何精度。为弥合这一差距,我们引入了Brep2Shape,一种新颖的自监督预训练方法,旨在对齐抽象边界表示与直观形状表示。我们的方法采用几何感知任务,其中模型学习从参数化贝塞尔控制点预测密集空间点,使网络能够更好地理解从抽象系数导出的物理流形。为增强这种对齐,我们提出了一个双Transformer骨干,具有并行流,独立编码表面和曲线令牌以捕获它们不同的几何属性。此外,集成了拓扑注意力以建模表面和曲线之间的相互依赖关系,从而保持拓扑一致性。实验结果表明,Brep2Shape具有显著的可扩展性,在各种下游任务中实现了最先进的精度和更快的收敛速度。代码可在以下仓库获取:this https URL。

英文摘要

Boundary representation (B-rep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training method designed to align abstract boundary representations with intuitive shape representations. Our method employs a geometry-aware task where the model learns to predict dense spatial points from parametric Bézier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a Dual Transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Moreover, the topology attention is integrated to model the interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.Code is available at this repository: https://github.com/thuml/Brep2Shape.

2602.06257 2026-06-17 cs.LG cs.GT 版本更新

On Randomized Algorithms in Online Strategic Classification

关于在线策略分类中的随机化算法

Chase Hutton, Adam Melrod, Han Shao

AI总结 研究在线策略分类中随机化算法的优势,在可实现和不可知场景下分别给出基于Littlestone维度和操纵图最大度的改进界限,并证明随机化可突破确定性算法的下界。

详情
AI中文摘要

在线策略分类研究智能体策略性地修改其特征以获得有利预测的场景。例如,给定一个基于信用评分决定贷款批准的分类器,申请人可能开设或关闭信用卡和银行账户以获得正面预测。学习目标是在此类行为下实现低错误率或遗憾界。尽管随机化算法在策略环境中可能为学习者带来优势,但它们尚未得到充分探索。在可实现场景中,随机化算法没有已知的下界,而确定性学习者的现有下界构造可以通过随机化规避。在不可知场景中,已知的最佳遗憾上界为$O(T^{3/4}\log^{1/4}T|\mathcal H|)$,远低于标准在线学习率$O(\sqrt{T\log|\mathcal H|})$。在这项工作中,我们为两种场景下的在线策略分类提供了精细化的界限;我们的界限依赖于假设类$\mathcal H$的Littlestone维度$\mathrm{Ldim}(\mathcal H)$和操纵图的最大度$\Delta$。在可实现场景中,对于$T > \mathrm{Ldim}(\mathcal H) \Delta^2$,我们将确定性学习者的现有下界$\Omega(\mathrm{Ldim}(\mathcal H) \Delta)$扩展到所有学习者。这产生了第一个适用于随机化学习者的下界。然后,我们提供了第一个随机化学习者,改进了已知的(确定性)上界$O(\mathrm{Ldim}(\mathcal H) \cdot \Delta \log \Delta)$。在不可知场景中,我们给出了一个非恰当随机化学习者,将遗憾上界改进为$O(\sqrt{T\log|\mathcal H|})$,匹配标准在线学习率。我们还展示了所有恰当学习规则的更大下界,证明非恰当性对于达到最优率是必要的。

英文摘要

Online strategic classification studies settings in which agents strategically modify their features to obtain favorable predictions. For example, given a classifier that determines loan approval based on credit scores, applicants may open or close credit cards and bank accounts to obtain a positive prediction. The learning goal is to achieve low mistake or regret bounds despite such behavior. While randomized algorithms have the potential to offer advantages to the learner in strategic settings, they have been largely underexplored. In the realizable setting, no lower bound is known for randomized algorithms, and existing lower bound constructions for deterministic learners can be circumvented by randomization. In the agnostic setting, the best known regret upper bound is $O(T^{3/4}\log^{1/4}T|\mathcal H|)$, which is far from the standard online learning rate of $O(\sqrt{T\log|\mathcal H|})$. In this work, we provide refined bounds for online strategic classification in both settings; our bounds depend on the Littlestone dimension $\mathrm{Ldim}(\mathcal H)$ of the hypothesis class $\mathcal H$ and the maximum degree $Δ$ of the manipulation graph. In the realizable setting, we extend, for $T > \mathrm{Ldim}(\mathcal H) Δ^2$, the existing lower bound $Ω(\mathrm{Ldim}(\mathcal H) Δ)$ for deterministic learners to all learners. This yields the first lower bound that applies to randomized learners. We then provide the first randomized learner that improves the known (deterministic) upper bound of $O(\mathrm{Ldim}(\mathcal H) \cdot Δ\log Δ)$. In the agnostic setting, we give an improper randomized learner that improves the regret upper bound to $O(\sqrt{T\log|\mathcal H|})$, matching the standard online learning rate. We also show a larger lower bound for all proper learning rules, demonstrating that improperness is necessary to achieve the optimal rate.

2602.06154 2026-06-17 cs.LG cs.CL 版本更新

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

MoSE: 混合可瘦身专家实现高效自适应语言模型

Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

AI总结 提出MoSE架构,每个专家具有可变宽度的嵌套结构,支持在推理时连续调节精度-计算权衡,通过多宽度训练和轻量级测试时训练实现高效自适应。

Comments Accepted to ICML 2026

详情
AI中文摘要

混合专家(MoE)模型通过稀疏激活专家高效扩展大型语言模型,但一旦选定专家,其执行是完整的。因此,MoE模型中精度与计算之间的权衡通常表现出较大的不连续性。我们提出混合可瘦身专家(MoSE),这是一种MoE架构,其中每个专家具有嵌套的、可瘦身的结构,可以以可变宽度执行。这不仅实现了对激活哪些专家的条件计算,还实现了对每个专家利用多少的条件计算。因此,单个预训练的MoSE模型可以在推理时支持更连续的精度-计算权衡谱。我们提出了一种简单且稳定的训练方法,用于在稀疏路由下训练可瘦身专家,将多宽度训练与标准MoE目标相结合。在推理过程中,我们探索了运行时宽度确定的策略,包括一种轻量级的测试时训练机制,该机制学习如何在固定预算下将路由器置信度/概率映射到专家宽度。在GPT风格模型、各种路由机制、零样本下游推理基准以及DeepSeek模型的持续预训练适应上的实验表明,MoSE在全宽度下匹配或优于标准MoE,并持续将计算-质量边界向更低的推理FLOPs移动。代码可在以下网址找到:this https URL。

英文摘要

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT-style models, various routing regimes, zero-shot downstream reasoning benchmarks, and continual pre-training adaptation of DeepSeek model show that MoSE matches or improves standard MoE at full width and consistently shifts the compute-quality frontier toward lower inference FLOPs. The code can be found at: https://github.com/tnurbek/mose.

2602.06014 2026-06-17 cs.LG cs.AI math.OC math.ST stat.ML stat.TH 版本更新

Optimism Stabilizes Thompson Sampling for Adaptive Inference

乐观主义稳定自适应推断的汤普森采样

Shunxing Yan, Han Zhong

AI总结 本文通过引入乐观机制(如方差膨胀或均值奖励)稳定汤普森采样,使得各臂拉取次数收敛于确定性尺度,从而在K臂随机bandit中实现渐近有效的Wald推断,并解决了多最优臂的扩展问题。

Comments Accepted in part to COLT 2026

详情
AI中文摘要

汤普森采样(TS)广泛用于随机多臂老虎机,但其在自适应数据收集下的推断性质微妙。样本均值的经典渐近理论可能失效,因为臂特定样本量是随机的,并通过动作选择规则与奖励耦合。我们研究了具有高斯随机指数的K臂随机bandit中汤普森采样的自适应推断,其中奖励噪声为独立次高斯,并确定乐观主义是恢复稳定性的关键机制,即每个臂的拉取次数集中在确定性尺度附近。这种稳定性使得尽管自适应采样,仍能获得渐近有效的Wald推断。首先,我们证明方差膨胀的TS对任意K≥2是稳定的,包括多个臂最优的挑战性情况,对最优臂具有渐近均匀分配,对次优臂具有尖锐的对数拉取次数渐近性。这解决了Halder等人提出的K臂扩展问题,使用新的胜者图和Lyapunov漂移技术来控制多个最优臂之间的分配。其次,我们分析了一种替代的乐观修改,保持高斯指数方差不变但向指数中心添加显式均值奖励,并建立了类似的稳定性结论。总之,适当实施的乐观主义稳定了汤普森采样,并在多臂老虎机中实现了渐近有效的Wald推断,同时仅产生轻微额外的遗憾代价。

英文摘要

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study adaptive inference for Thompson sampling with Gaussian randomized indices in $K$-armed stochastic bandits with independent sub-Gaussian reward noises, and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, meaning that each arm's pull count concentrates around a deterministic scale. This stability yields asymptotically valid Wald inference despite adaptive sampling. First, we prove that variance-inflated TS is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal, with asymptotically uniform allocation over optimal arms and sharp logarithmic pull-count asymptotics for suboptimal arms. This resolves the $K$-armed extension question raised by \citet{halder2025stable}, using new winner-map and Lyapunov-drift techniques to control allocation among multiple optimal arms. Second, we analyze an alternative optimistic modification that keeps the Gaussian index variance unchanged but adds an explicit mean bonus to the index center, and establish a similar stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid Wald inference in multi-armed bandits, while incurring only a mild additional regret cost.

2602.04170 2026-06-17 cs.CV 版本更新

Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

部分环形扫描:重新审视视觉状态空间模型中的扫描顺序

Yi-Kuan Hsieh, Kuan-Chuan Peng, Xin li, Ming-Ching Chang, Yu-Chee Tseng, Jun-Wei Hsieh

AI总结 提出PRISMamba,通过环形扫描和部分通道滤波提升视觉状态空间模型的旋转鲁棒性和效率,在ImageNet-1K上达到84.5% Top-1精度。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

状态空间模型(SSM)已成为视觉任务中注意力机制的高效替代方案,提供线性时间序列处理并具有竞争性精度。然而,视觉SSM需要将2D图像沿预定义扫描顺序序列化为1D token序列,这一因素常被忽视。我们证明扫描顺序通过改变空间邻接性、破坏对象连续性以及加剧旋转等几何变换下的性能退化,对性能产生关键影响。我们提出部分环形扫描Mamba(PRISMamba),一种旋转鲁棒的遍历方法,将图像划分为同心环,在每个环内进行顺序无关的聚合,并通过一组短径向SSM跨环传播上下文。通过部分通道滤波进一步提高效率,仅将信息最丰富的通道路由到循环环路径,其余通道保留在轻量级残差分支上。在ImageNet-1K上,PRISMamba以3.9G FLOPs和A100上3054 img/s的速度达到84.5% Top-1精度,在准确率和吞吐量上均优于VMamba,且所需FLOPs更少。在旋转下,PRISMamba保持性能,而固定路径扫描下降1~2%。这些结果突显了扫描顺序设计以及通道滤波,作为视觉SSM中准确性、效率和旋转鲁棒性的关键且未被充分探索的因素。代码将在接收后发布。

英文摘要

State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.

2602.03846 2026-06-17 cs.LG cs.AI 版本更新

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

PLATE: 可塑性可调的几何感知持续学习高效适配器

Romain Cosentino

AI总结 提出无需旧任务数据的持续学习方法PLATE,利用预训练网络的几何冗余性,通过结构化低秩更新显式控制可塑性-保留权衡,提升最坏情况保留保证。

详情
AI中文摘要

我们为预训练模型开发了一种持续学习方法,该方法不需要访问旧任务数据,解决了基础模型适应中预训练分布通常不可用的实际障碍。我们的关键观察是,预训练网络表现出大量的几何冗余性,并且这种冗余性可以通过两种互补的方式加以利用。首先,冗余神经元提供了预训练时代主导特征方向的代理,使得可以直接从预训练权重构建近似受保护的更新子空间。其次,冗余性为可塑性的放置位置提供了自然偏差:通过将更新限制在冗余神经元的子集并约束剩余的自由度,我们获得了在旧数据分布上功能漂移减少且最坏情况保留保证改善的更新族。这些见解导致了PLATE(可塑性可调的高效适配器),一种不需要过去任务数据的持续学习方法,它提供了对可塑性-保留权衡的显式控制。PLATE通过结构化低秩更新ΔW = B A Q^T参数化每一层,其中B和Q从预训练权重一次性计算并保持冻结,只有A在新任务上训练。代码可在https://this URL获取。

英文摘要

We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $ΔW = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

2602.03420 2026-06-17 cs.SD cs.LG 版本更新

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

CoCoEmo: 通过激活引导实现可组合且可控的类人情感语音合成

Siyi Wang, Shihong Tan, Siyi Liu, Hong Jia, Gongping Huang, James Bailey, Ting Dang

AI总结 提出基于激活引导的框架,在混合TTS模型中实现可组合的混合情感合成和文本-情感不匹配合成,发现情感韵律主要由语言模块而非流匹配模块生成。

详情
AI中文摘要

人类语音中的情感表达是微妙且组合的,通常涉及多种、有时相互冲突的情感线索,这些线索可能与语言内容不一致。相比之下,大多数表现性文本转语音系统强制执行单一话语级别的情感,压缩了情感多样性并抑制了混合或文本-情感不匹配的表达。虽然通过潜在方向向量进行激活引导提供了一种有前景的解决方案,但情感表示在TTS中是否线性可引导、在混合TTS架构中应在何处应用引导以及如何评估这种复杂的情感行为仍不清楚。本文首次系统分析了混合TTS模型中用于情感控制的激活引导,引入了一个定量、可控的引导框架,以及多评估者评估协议,实现了可组合的混合情感合成和可靠的文本-情感不匹配合成。我们的结果首次证明,情感韵律和表达变异性主要由TTS语言模块而非流匹配模块合成,并提供了一种轻量级引导方法,用于生成自然、类人的情感语音。

英文摘要

Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

2602.03300 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

R1-SyntheticVL:生成模型的合成数据是否已为多模态大语言模型做好准备?

Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

AI总结 提出集体对抗数据合成(CADS)方法,通过集体智能和对抗学习自动生成高质量、多样且具有挑战性的多模态数据,用于增强多模态大语言模型(MLLM)在复杂现实任务中的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

在这项工作中,我们旨在开发有效的数据合成技术,自主合成多模态训练数据,以增强MLLM解决复杂现实任务的能力。为此,我们提出了集体对抗数据合成(CADS),这是一种新颖且通用的方法,用于合成高质量、多样且具有挑战性的多模态数据。CADS的核心思想是利用集体智能确保高质量和多样化的生成,同时探索对抗学习以合成具有挑战性的样本,从而有效驱动模型改进。具体来说,CADS包含两个循环阶段:集体对抗数据生成(CAD-Generate)和集体对抗数据判断(CAD-Judge)。CAD-Generate利用集体知识共同生成新的多样化多模态数据,而CAD-Judge则协作评估合成数据的质量。此外,CADS引入了一种对抗上下文优化机制,以优化生成上下文,鼓励生成具有挑战性和高价值的数据。通过CADS,我们构建了MMSynthetic-20K并训练了我们的模型R1-SyntheticVL,该模型在多个基准测试中表现出优越的性能。

英文摘要

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

2602.03045 2026-06-17 cs.LG 版本更新

Clarify Before You Draw: Proactive Agents for Robust Text-to-CAD Generation

先澄清再绘制:面向鲁棒文本到CAD生成的主动式智能体

Bo Yuan, Zelin Zhao, Petr Molodyk, Bin Hu, Yongxin Chen

AI总结 提出主动式智能体框架ProCAD,通过澄清代理在代码生成前解决用户提示中的歧义,再通过CAD编码代理生成可执行程序,显著提升鲁棒性,平均Chamfer距离降低79.9%。

Comments ICML 2026

详情
AI中文摘要

大型语言模型最近使得文本到CAD系统能够从自然语言提示中合成参数化CAD程序(例如CadQuery)。然而在实践中,几何描述可能是不明确或内部不一致的:关键尺寸可能缺失,约束可能冲突。然而,现有的微调模型倾向于被动地遵循用户指令,并在文本模糊时产生幻觉尺寸。为了解决这个问题,我们提出了一个用于文本到CadQuery生成的主动式智能体框架,名为ProCAD,它在代码合成之前解决规范问题。我们的框架将主动式澄清代理(该代理审计提示并仅在必要时提出有针对性的澄清问题以生成自洽的规范)与CAD编码代理(将规范转换为可执行的CadQuery程序)配对。我们基于精心策划的高质量文本到CadQuery数据集微调编码代理,并通过在澄清轨迹上进行智能体SFT来训练澄清代理。实验表明,主动式澄清显著提高了对模糊提示的鲁棒性,同时保持较低的交互开销。ProCAD优于前沿闭源模型,包括Claude Sonnet 4.5,将平均Chamfer距离降低了79.9%,并将无效比率从4.8%降至0.9%。我们的代码和数据集在此https URL上公开。

英文摘要

Large language models have recently enabled text-to-CAD systems that synthesize parametric CAD programs (e.g., CadQuery) from natural-language prompts. In practice, however, geometric descriptions can be under-specified or internally inconsistent: critical dimensions may be missing and constraints may conflict. However, existing fine-tuned models tend to reactively follow the user instructions and hallucinate dimensions when the text is ambiguous. To address this, we propose a proactive agentic framework for text-to-CadQuery generation, named as ProCAD, that resolves specification issues before code synthesis. Our framework pairs a proactive clarifying agent, which audits the prompt and asks targeted clarification questions only when necessary to produce a self-consistent specification, with a CAD coding agent that translates the specification into an executable CadQuery program. We fine-tune the coding agent based on a curated high-quality text-to-CadQuery dataset and train the clarifying agent via agentic SFT on clarification trajectories. Experiments show that proactive clarification significantly improves robustness to ambiguous prompts while keeping interaction overhead low. ProCAD outperforms frontier closed-source models, including Claude Sonnet 4.5, reducing the mean Chamfer distance by 79.9% and lowering the invalidity ratio from 4.8% to 0.9%. Our code and datasets are made publicly available on https://github.com/BoYuanVisionary/Pro-CAD.

2601.22495 2026-06-17 cs.LG 版本更新

Gradual Fine-Tuning for Flow Matching Models

流匹配模型的渐进微调

Gudrun Thorkelsdottir, Arindam Banerjee

AI总结 提出渐进微调(GFT)框架,通过退火策略在目标分布样本下微调流生成模型,理论保证逼近真实目标,实验表明稳定性、效率与多样性优于现有方法。

Comments Preprint. Added methodology and experimental sections

详情
AI中文摘要

在数据有限、分布演变或计算受限的场景中,微调流匹配模型是一个核心挑战。尽管近期工作取得了显著进展,特别是在基于奖励的微调领域,但现有方法在稳定性、效率和多样性保持方面既未展示理论正确性,也未获得强有力的实证结果。本文提出渐进微调(GFT),一个简单而基于退火的框架,用于在仅有目标分布样本时微调流生成模型。对于随机流,GFT定义了一个温度控制的中间目标序列,平滑地插值预训练漂移和目标漂移,并在温度趋近于零时理论上逼近真实目标。我们分析证明,GFT后的样本生成可以通过使用任意(例如最优传输)耦合以及利用少步推理方法显著提高效率。实验上,GFT显著改善了收敛稳定性,同时相比其他微调方法保持或提高了生成质量、训练速度和生成多样性。我们的结果将GFT定位为在分布偏移下可扩展适应流匹配模型的简单、理论扎实且实践有效的替代方案。

英文摘要

Fine-tuning flow matching models is a central challenge in settings with limited data, evolving distributions, or computational constraints. While recent work has produced significant advances, particularly in the area of reward-based fine-tuning, current methods fail to demonstrate both theoretical correctness as well as strong empirical results in terms of stability, efficiency, and diversity preservation. In this work, we propose Gradual Fine-Tuning (GFT), a simple yet principled annealing-based framework for fine-tuning flow generative models when only samples from the target distribution are available. For stochastic flows, GFT defines a temperature-controlled sequence of intermediate objectives that smoothly interpolate between the pretrained and target drifts, provably approaching the true target as the temperature approaches zero. We analytically demonstrate that sample generation after GFT can be made substantially more efficient with the use of arbitrary (e.g., optimal transport) couplings, as well as by utilizing few-step inference methods. Empirically, GFT significantly improves convergence stability, while maintaining or improving generation quality, training speed, and generation diversity compared to other fine-tuning methods. Our results position GFT as a simple yet theoretically grounded and practically effective alternative for scalable adaptation of flow matching models under distribution shift.

2510.09468 2026-06-17 cs.LG 版本更新

Geodesic Calculus on Implicitly Defined Latent Manifolds

隐式定义潜在流形上的测地线计算

Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, Benedikt Wirth

AI总结 提出将自编码器的潜在流形视为隐式子流形,并开发离散黎曼微积分工具以近似经典几何算子,通过去噪目标学习近似投影,实现潜在流形上的测地线路径计算和黎曼指数映射。

Comments 26 pages, 18 figures

详情
AI中文摘要

自编码器的潜在流形提供了数据的低维表示,可以从几何角度进行研究。我们提出将这些潜在流形描述为某个潜在空间的隐式子流形。基于此,我们开发了用于离散黎曼微积分的工具,近似经典几何算子。这些工具对于实际例子中经常出现的隐式表示不准确性具有鲁棒性。为了获得合适的隐式表示,我们提出通过最小化去噪目标来学习潜在流形上的近似投影。该方法独立于底层自编码器,并支持在潜在流形上使用不同的黎曼几何。该框架特别能够计算连接给定端点的测地线路径,并通过潜在流形上的黎曼指数映射进行测地线射击。我们在合成数据和真实数据上训练的各种自编码器上评估了我们的方法。

英文摘要

Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

2601.19099 2026-06-17 cs.CV cs.AI 版本更新

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

m2sv: 地图到街景空间推理的可扩展基准

Yosub Shin, Michael Buriek, Igor Molybog

AI总结 提出m2sv基准,通过匹配朝北俯视图与街景图像推断相机方向,评估VLM空间推理能力;最佳模型准确率65.2%,低于人类72.0%,揭示几何对齐与推理一致性的差距。

详情
AI中文摘要

视觉-语言模型(VLM)在许多多模态基准上表现强劲,但在需要将抽象俯视图表示与自我中心视图对齐的空间推理任务上仍然脆弱。我们引入m2sv,一个用于地图到街景空间推理的可扩展基准,要求模型通过将朝北俯视图与在同一真实世界交叉口拍摄的街景图像对齐来推断相机视角方向。我们发布了m2sv-20k,一个具有受控歧义的地理多样化基准,以及m2sv-sft-11k,一个用于监督微调的精选结构化推理轨迹集。尽管在现有多模态基准上表现强劲,但最佳评估的VLM在m2sv上仅达到65.2%的准确率,低于人类标注者的平均72.0%(专家可达95%),且标注者间一致性高($\kappa$高达0.76)。虽然监督微调和强化学习带来持续改进,但跨基准评估显示迁移有限。除了总体准确率,我们使用结构信号和人工努力系统分析了地图到街景推理的难度,并对适应的开放模型进行了广泛的失败分析。我们的发现凸显了几何对齐、证据聚合和推理一致性方面的持续差距,为跨视角的接地空间推理的未来工作提供了动力。

英文摘要

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($κ$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

2601.19098 2026-06-17 cs.RO 版本更新

SimTO: A two-stage, simulation-driven topology optimization framework for bespoke soft robotic grippers

SimTO:一种面向定制软体机器人夹爪的两阶段仿真驱动拓扑优化框架

Kurt Enkera, Josh Pinskier, Marcus Gallagher, David Howard

AI总结 提出SimTO框架,通过两阶段仿真驱动拓扑优化自动提取接触载荷,为特征丰富物体定制软体夹爪,实验证明其抓取力优于传统方法且泛化性强。

Comments 15 pages, 9 figures. Published in Structural and Multidisciplinary Optimization

详情
AI中文摘要

软体机器人夹爪对于在制造业、医疗保健和农业中抓取精致、几何形状复杂的物体至关重要。然而,现有设计难以抓取具有高拓扑变异性、特征丰富的物体,包括汽车装配线上具有锋利齿廓的齿轮、带有脆弱突起的珊瑚,或像西兰花这样具有不规则分支结构的蔬菜。与立方体或球体等简单几何基元不同,特征丰富的物体缺乏明确的“最佳”接触表面,因此既难以抓取又容易受损。因此,安全处理此类物体需要专门设计的软体夹爪,其形态需针对物体特征进行定制。拓扑优化为生产专用夹爪提供了一种有前景的方法,但其效用受限于需要预定义载荷工况。对于软体夹爪,这些载荷来自抓取过程中数百种不可预测的夹爪-物体接触力,且先验未知。为解决此问题,我们引入了SimTO,这是一个两阶段、仿真驱动的拓扑优化框架,它能在执行经典拓扑优化之前,从动态、富含接触的抓取仿真中自动提取载荷工况,从而消除了手动指定载荷的需求。给定任意特征丰富的物体,SimTO能生成高度定制的软体夹爪,其细粒度形态特征针对物体几何形状进行定制。物理实验证实,我们的专用夹爪比传统拓扑优化方法生成的通用设计实现了更高的抓取力,而数值实验表明,它们在不同物体姿态下实现了高抓取成功率,并对一组未见过的物体具有很强的泛化能力。

英文摘要

Soft robotic grippers are essential for grasping delicate, geometrically complex objects in manufacturing, healthcare and agriculture. However, existing designs struggle to grasp feature-rich objects with high topological variability, including gears with sharp tooth profiles on automotive assembly lines, corals with fragile protrusions, or vegetables with irregular branching structures like broccoli. Unlike simple geometric primitives such as cubes or spheres, feature-rich objects lack a clear "optimal" contact surface, making them both difficult to grasp and susceptible to damage. Safe handling of such objects therefore requires specialized soft grippers whose morphology is tailored to the object's features. Topology optimization offers a promising approach for producing specialized grippers, but its utility is limited by the need for pre-defined load cases. For soft grippers, these loads arise from hundreds of unpredictable gripper-object contact forces during grasping and are unknown a priori. To address this problem, we introduce SimTO, a two-stage, simulation-driven topology optimization framework that automatically extracts load cases from a dynamic, contact-rich grasping simulation before performing classical topology optimization, eliminating the need for manual load specification. Given an arbitrary feature-rich object, SimTO produces highly customized soft grippers with fine-grained morphological features tailored to the object geometry. Physical experiments confirm that our specialized grippers achieve higher grasp forces than a generalist design produced by conventional topology optimization methods, while numerical experiments show that they achieve high grasp success rates across varying object poses and strong generalization to a set of unseen objects.

2601.18252 2026-06-17 cs.CV cs.AI cs.LG stat.ML 版本更新

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

Co-PLNet: 一种用于提示引导的线框解析的协作点线网络

Chao Wang, Xuanying Li, Cheng Dai, Jinglei Feng, Yuxiang Luo, Hao Qin, Yuqi Ouyang

AI总结 提出点线协作框架Co-PLNet,通过点线提示编码器交换空间线索,并利用交叉引导线解码器增强点线一致性,在Wireframe和YorkUrban数据集上提升线框解析的准确性和鲁棒性。

详情
AI中文摘要

线框解析旨在恢复线段及其连接点,以形成结构化的几何表示,用于同时定位与地图构建(SLAM)等下游任务。现有方法分别预测线和点,并在事后进行调和,导致不匹配和鲁棒性降低。我们提出Co-PLNet,一个点线协作框架,在两个任务之间交换空间线索,其中早期检测通过点线提示编码器(PLP-Encoder)转换为空间提示,该编码器将几何属性编码为紧凑且空间对齐的图。交叉引导线解码器(CGL-Decoder)随后通过基于互补提示的稀疏注意力细化预测,强制点线一致性和效率。在Wireframe和YorkUrban上的实验显示,准确性和鲁棒性持续改进,同时具有有利的实时效率,证明了我们在结构化几何感知中的有效性。我们的代码可在该 https URL 获取。

英文摘要

Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception. Our code is available at https://github.com/GalacticHogrider/Co-PLNet.

2410.10137 2026-06-17 cs.LG math.DG stat.CO stat.ML 版本更新

Variational autoencoders with latent high-dimensional steady geometric flows for dynamics

具有潜在高维稳态几何流的变分自编码器用于动力学

Andrew Gracyk

AI总结 提出VAE-DLM方法,在潜在空间中引入稳态几何流,通过物理信息方法求解高维流,增强潜在表示的表达能力,在PDE型数据上降低OOD误差15%-35%。

详情
Journal ref
23rd International Conference of Numerical Analysis and Applied Mathematics (ICNAAM) 2025
AI中文摘要

我们开发了用于PDE型环境数据的变分自编码器(VAE)的黎曼方法,其中包含正则化几何潜在动力学,称为VAE-DLM(具有动态潜在流形的VAE)。我们重新构建了VAE框架,使得嵌入欧几里得空间中的流形几何(受我们的几何流约束)在编码器和解码器开发的中间潜在空间中被学习。通过定制潜在空间演化的几何流,我们诱导出我们选择的潜在几何性质,这些性质反映在经验性能中。我们通过谨慎选择先验重新表述了传统的证据下界(ELBO)损失。我们开发了一个具有稳态正则化项的线性几何流。该流只需要对一个时间导数进行自动微分,并且可以在中等高维度上以物理信息方法求解,从而允许更具表达力的潜在表示。我们讨论了该流如何被表述为梯度流,并保持熵远离度量奇点。这结合特征值惩罚条件,有助于确保流形在测度上足够大、非退化且具有规范几何,从而有助于鲁棒表示。我们的方法侧重于改进的多层感知器架构,使用tanh激活函数用于流形编码器-解码器。我们在感兴趣的数据集上证明,我们的方法至少与传统VAE表现相当,且通常更好。我们的方法可以超越传统VAE以及采用我们提出架构的VAE,在选定数据集上经常将分布外(OOD)误差降低15%至35%。我们重点展示了我们的方法在环境PDE上的应用,这些PDE的解在后期保持最小变化。我们提供了经验性证明,说明如何通过VAE改进外部动力学的鲁棒学习。

英文摘要

We develop Riemannian approaches to variational autoencoders (VAEs) for PDE-type ambient data with regularizing geometric latent dynamics, which we refer to as VAE-DLM, or VAEs with dynamical latent manifolds. We redevelop the VAE framework such that manifold geometries, subject to our geometric flow, embedded in Euclidean space are learned in the intermediary latent space developed by encoders and decoders. By tailoring the geometric flow in which the latent space evolves, we induce latent geometric properties of our choosing, which are reflected in empirical performance. We reformulate the traditional evidence lower bound (ELBO) loss with a considerate choice of prior. We develop a linear geometric flow with a steady-state regularizing term. This flow requires only automatic differentiation of one time derivative, and can be solved in moderately high dimensions in a physics-informed approach, allowing more expressive latent representations. We discuss how this flow can be formulated as a gradient flow, and maintains entropy away from metric singularity. This, along with an eigenvalue penalization condition, helps ensure the manifold is sufficiently large in measure, nondegenerate, and a canonical geometry, which contribute to a robust representation. Our methods focus on the modified multi-layer perceptron architecture with tanh activations for the manifold encoder-decoder. We demonstrate, on our datasets of interest, our methods perform at least as well as the traditional VAE, and oftentimes better. Our methods can outperform this and a VAE endowed with our proposed architecture, frequently reducing out-of-distribution (OOD) error between 15% to 35% on select datasets. We highlight our method on ambient PDEs whose solutions maintain minimal variation in late times. We provide empirical justification towards how we can improve robust learning for external dynamics with VAEs.