arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26977 2026-05-27 cs.LG math.OC

Convergence of Spectral Descent for Non-smooth Optimization

非光滑优化的谱下降收敛性

Yixuan Yang, Yuqing He, Song Li

AI总结 研究Muon优化器的简化变体谱下降(SD)及其截断版本(TSD)在非光滑凸优化中的全局线性收敛性,并应用于鲁棒低秩矩阵恢复。

详情
AI中文摘要

Muon优化器最近在训练大型语言模型方面展示了显著的经验成功。然而,对其机制的理论理解仍然有限。目前Muon的收敛保证严重依赖于光滑性假设,其非光滑收敛行为在很大程度上未被探索。在这项工作中,我们通过研究谱下降(SD)(Muon的简化变体)及其截断版本截断谱下降(TSD),朝着弥合这一差距迈出了一步。在凸性、Lipschitz连续性和尖锐性条件下,我们建立了SD和TSD在非光滑凸公式中的全局线性收敛性。我们还研究了配备解耦权重衰减的正则化变体,并通过它们与Frank-Wolfe方法的联系推导出次线性收敛保证。最后,我们将我们的理论框架应用于混合稀疏和密集噪声下的鲁棒低秩矩阵恢复,并提供了严格的恢复保证。数值实验支持理论发现,并展示了Muon类型方法在非光滑优化中的有效性。

英文摘要

The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heavily on smoothness assumptions, leaving its non-smooth convergence behavior largely unexplored. In this work, we take a step toward bridging this gap by investigating Spectral Descent (SD), a simplified variant of Muon, together with its truncated counterpart, Truncated Spectral Descent (TSD). Under convexity, Lipschitz continuity, and sharpness conditions, we establish global linear convergence for both SD and TSD in non-smooth convex formulations. We also study regularized variants equipped with decoupled weight decay and derive sublinear convergence guarantees through their connection with Frank-Wolfe methods. Finally, we apply our theoretical framework to robust low-rank matrix recovery under mixed sparse and dense noise regimes and provide rigorous recovery guarantees. Numerical experiments support the theoretical findings and demonstrate the effectiveness of Muon-type methods for non-smooth optimization.

2605.26973 2026-05-27 stat.ML cond-mat.dis-nn cs.LG cs.NE q-bio.NC

Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

信噪比与样本量控制神经网络中的表征对齐

Ali Hussaini Umar, Alessandro Laio

AI总结 通过理论和实验证明,信噪比和训练样本量以单调和非单调方式分别影响神经网络表征对齐,且对齐程度在插值阈值附近最小,与泛化误差解耦。

详情
AI中文摘要

已知神经网络会发展出潜在表征,这些表征是$对齐$的,即在不同架构、训练协议或训练数据集训练的网络之间结构相似。我们在一个受控环境中研究这一现象,使用被噪声过程的独立实现扰动的训练集,训练一组网络执行回归和分类任务。我们表明,信噪比(SNR)和训练样本量以定性相似的方式影响对齐,无论是在真实世界数据集上训练的网络,还是在极其简单的具有单个隐藏层的$线性$网络中(其对齐可以解析估计)。在线性和非线性网络、回归和分类任务以及合成和真实数据中,我们一致观察到,对齐随SNR单调变化,但随训练样本量非单调变化。特别地,对齐在插值阈值附近最小,且更强的对齐不一定对应更好的泛化误差。这些发现揭示了数据质量和数量对对齐的非平凡依赖关系,且与泛化性能解耦。

英文摘要

Neural networks are known to develop latent representations that are $aligned$, namely structurally similar across networks trained with different architectures, training protocols, or training datasets. We study this phenomenon in a controlled setting, where we train an ensemble of networks on regression and classification tasks using training sets perturbed by independent realizations of a noise process. We show that the signal-to-noise ratio (SNR) and the training sample size influence the alignment in qualitatively similar ways in networks trained on real-world datasets and in an extremely simple $linear$ network with a single hidden layer, for which the alignment can be estimated analytically. Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error. These findings reveal a non-trivial dependence of alignment on data quality and quantity, decoupled from generalization performance.

2605.26971 2026-05-27 cs.LG

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

RLVR 数据集及其查找方法:通过数据溯源寻找更好的训练数据

Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee, Kai Yang, Yangkun Chen, Saiyong Yang, Yunfang Wu

AI总结 针对可验证奖励强化学习(RLVR)数据集来源不清的问题,提出基于谱系感知搜索的原子源追踪框架(ATLAS),追溯超过99.7%的实例至20个原子源,并基于源级反事实归因(SCA)原则构建去污染数据集DAPO++,其质量分数Q与下游RLVR性能强相关。

详情
Comments
7 figures, 12 tables
AI中文摘要

可验证奖励强化学习(RLVR)数据集的激增加剧了来源崩溃问题,原因是现有数据集之间的谱系不明确。为弥合这一碎片化的RLVR数据格局,我们提出了基于谱系感知搜索的原子源追踪(ATLAS),这是一个系统框架,用于将RLVR数据集追溯至其原子源,将145万个实例中的超过99.7%归因于20个原子源。我们的分析表明,大多数RLVR数据集是一小组共享上游源的变体,很少有引入真正新数据的,许多面临数据污染风险。这些发现自然促使我们策划一个新的RLVR数据集DAPO++,并从谱系感知的角度对现有数据集进行基准测试。为此,我们提出源级反事实归因(SCA)作为指导原则,以策划一个具有集中学习信号的去污染训练数据集。本质上,SCA通过比较每个原子源的RL检查点与共享基模型来测量样本的边际效用。基于这些归因信号,我们进一步设计了一个复合数据集质量分数Q,该分数与下游RLVR性能强相关。在Qwen3系列模型上的实验验证了DAPO++在保留基准上持续提升性能,而Q可靠地预测了下游RLVR训练效果。我们的代码和数据可在https://github.com/Celine-hxy/ATLAS获取。

英文摘要

The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at https://github.com/Celine-hxy/ATLAS.

2605.26969 2026-05-27 cs.CL cs.AI

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Recon:基于重建指导的推理合成用于用户建模

Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou, Lisa Dunlap, Narges Norouzi, Joseph E. Gonzalez

AI总结 提出Recon方法,通过动作重建分数评估推理轨迹的预测能力,以改进用户建模中的推理合成,在多个领域优于事后合理化基线。

详情
AI中文摘要

用户建模旨在使用语言模型(LM)从过去的上下文-动作对(例如对话轮次)语料库中模拟个体的行为,从而在行为科学、人机协作和市场研究等环境中模拟用户。最近的方法通过合成推理轨迹来扩充这些语料库,通常通过同时以上下文和动作为条件生成。然而,这种条件构成事后合理化而非推理:轨迹保证证明动作的合理性,但可能不编码潜在的潜在因果决策路径。我们提出Recon,它使用动作重建通过预测能力对推理轨迹进行评分:给定上下文和候选推理,重建模型预测动作,重建保真度决定推理质量。在四个领域,Recon相对于标准事后合理化基线Backward Synthesis实现了54.7%的胜率。此外,我们发现使用来自Recon的奖励训练推理合成模型可提高下游用户建模性能,相对于基线实现了高达70.0%的胜率。我们进一步表明,Recon合成的推理可跨模型迁移,并改善重建模型之外的用户建模。我们的工作表明,事后合理化对于推理合成是不够的,有用且可解释的推理应自然地从上下文中引出动作。

英文摘要

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

2605.26967 2026-05-27 cs.CV

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

CodecCap: 高保真度编解码器启发的残差建模用于密集视频字幕生成

Zihan Lin, Songhe Deng, Shuwei He, Danxiang Zhu, Dan Zhang, Yishu Lei, Xianlong Luo, Shikun Feng, Rui Liu

AI总结 提出CodecCap框架,通过关键帧和残差字幕模拟视频编解码器,在保持细粒度视觉证据的同时减少冗余,并引入VidCapQA基准验证其高保真度。

详情
Comments
11 pages, 4 figures
AI中文摘要

现有的视频字幕方法难以平衡视觉保真度和冗余:整体字幕紧凑但丢失细粒度证据,而分段字幕改善覆盖但引入大量冗余。我们提出CodecCap,一种受编解码器启发的高保真度密集视频字幕框架。类似于视频编解码器,CodecCap使用关键帧和残差字幕表示视频。关键帧字幕详尽编码稳定的视觉上下文,而残差字幕仅捕获时间上局部的动作、运动和变化。这有效保留了细粒度视觉证据,同时减少冗余描述。为了量化字幕的保真度,我们引入VidCapQA,一个包含14个能力维度1000个问题的字幕-问答基准。VidCapQA上的结果表明,强VLM直接生成的字幕仍然遗漏许多视觉细节,突显字幕表示是关键瓶颈。实验表明,CodecCap显著超越使用相同底层VLM的直接字幕生成,表明关键帧-残差字幕是一种高保真度视频-语言监督的方式。我们进一步使用CodecCap构建CodecVDC-100K,一个包含锚点、残差、场景级和视频级监督的大规模密集字幕数据集。

英文摘要

Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.

2605.26958 2026-05-27 cs.CL cs.AI

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO:面向开放式长文本生成强化学习的群组锦标赛奖励

Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

AI总结 针对开放式长文本生成中缺乏可靠参考答案和自动评估指标的问题,提出Tournament-GRPO框架,通过同一查询生成结果间的多轮锦标赛比较将基于规则的LLM评判转化为相对奖励,在Deep Research Bench上取得4.52分提升。

详情
AI中文摘要

开放式长文本生成中的强化学习具有挑战性,因为可靠的参考答案和自动评估指标通常不可用。现有的基于规则的方法通常依赖于逐点的LLM作为评判的评分,但绝对分数难以在复杂响应间校准,可能对同一查询的生成结果提供弱区分度,并在优化过程中饱和。我们提出Tournament-GRPO,一种群组奖励框架,通过同一查询生成结果间的重复多轮锦标赛将基于规则的LLM评判转化为相对奖励。Tournament-GRPO在群组内比较候选结果,累积锦标赛结果,并将其归一化为用于GRPO训练的群组奖励。在Deep Research Bench上的实验表明,Tournament-GRPO持续优于现有的奖励设计基线,在最强基线上实现了4.52分的整体分数提升。进一步分析表明,锦标赛奖励提供了有利的有效性-效率权衡,并且锦标赛设计影响训练动态。这些结果表明,基于规则的锦标赛比较为开放式长文本生成中的强化学习提供了有效的奖励信号。

英文摘要

Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.

2605.26956 2026-05-27 cs.AI cs.CL

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

LELA: 一种基于LLM的端到端实体链接框架,支持零样本领域自适应

Samy Haffoudhi, Nikola Dobričić, Fabian Suchanek, Nils Holzenberger

AI总结 本文提出LELA,一种基于大语言模型的模块化、领域无关的实体消歧方法,并扩展为实用的Python库,集成零样本命名实体识别,实现端到端实体链接,实验验证其跨领域性能与鲁棒性。

详情
Journal ref
35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026), IJCAI (International Joint Conferences on Artificial Intelligence), Aug 2026, Bremen (DE), Germany
AI中文摘要

实体链接是许多下游NLP系统的关键组件,但现有方法通常依赖于特定的目标知识库和领域,限制了其实际应用。在本文中,我们将LELA(一种模块化且领域无关的基于LLM的实体消歧方法)扩展为一个实用的Python库,该库集成了零样本命名实体识别(NER),从而为实际使用中的实体链接提供了完整的端到端流水线。我们提供了实验结果,验证了LELA在不同实体链接设置下的性能和鲁棒性。在我们的演示中,用户可以在自己的输入文本上试用该系统。

英文摘要

Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge bases and domains, limiting their real world application. In this paper, we extend LELA, a modular and domain-agnostic LLM-based entity disambiguation method, into a practical Python library that integrates zero-shot Named Entity Recognition (NER) -thereby providing a complete end-toend pipeline for entity-linking in real-world usage. We provide experimental results validating LELA's performance and robustness across diverse entity linking settings. In our demo, users can play with the system on their own input texts.

2605.26955 2026-05-27 cs.CL cs.AI

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

JuICE:评估LLM裁判识别文化错误的基准

Jiho Jin, Junho Myung, Juhyun Oh, Junyeong Park, Rifki Afina Putri, Sunipa Dev, Vinodkumar Prabhakaran, Alice Oh

AI总结 提出JuICE基准,包含7470个文化语言错误标注的多语言数据集,用于评估LLM裁判在长文本中识别深层文化错误的能力,发现最强模型F1仅0.52且常遗漏本地人易识别的错误。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署给全球用户,它们被整合到不同文化背景下的日常任务中,从起草个人通信到头脑风暴创意想法。这些任务本质上是文化性的:它们需要语境适当性、象征共鸣以及母语者本能依赖的隐性文化期望,这意味着一个回答可能在事实上合理,但对本地读者来说却明显错误。现有的文化基准通过事实验证或规范蕴含方法将文化视为一组扁平的事实,并采用LLM作为裁判,而未检查它们是否能捕捉到这种深层的文化错误。为填补这一空白,我们提出了JuICE(LLM裁判识别文化错误基准),这是一个多语言数据集,包含7470个跨度级别的文化语言错误标注,涵盖来自四个国家(美国、韩国、印度尼西亚和孟加拉国)的1050个查询-响应对,使用英语和这些国家的主要语言。利用JuICE,我们发现即使是最强的LLM裁判在错误跨度检测任务中也仅达到0.52的F1分数。此外,LLM裁判始终会遗漏本地居民容易识别的深层文化错误。我们的研究结果表明,稳健的文化评估必须超越表面级别的检测,转向考虑文化意义的深度和情境性的框架。

英文摘要

As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries' main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.

2605.26952 2026-05-27 cs.CL

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

基于策略内在知识边界增强的高效智能体强化学习

Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang

AI总结 提出AKBE方法,通过双路径(有工具和无工具)在线策略训练动态探测模型内在知识边界,构建针对性监督信号,在保持准确率的同时减少工具调用。

详情
AI中文摘要

智能体强化学习已被证明对于训练具有外部工具使用能力的基于LLM的智能体是有效的。然而,我们发现智能体强化学习训练会导致冗余工具调用增加,并模糊模型的内在知识边界,即模型无法区分何时需要工具以及何时参数化知识足够。现有的基于奖励塑形的解决方案创建了粗粒度的优化目标,倾向于激励不加区分的工具调用抑制,导致奖励黑客行为。在本文中,我们提出AKBE(智能体知识边界增强),一种在线策略方法,通过在训练期间进行双路径(有工具和无工具)展开来动态探测模型的内在知识边界。我们将知识边界定义为每个实例是否需要工具以及所需的最小工具调用次数。通过比较各路径的正确性,AKBE对轨迹进行分类并构建针对性的监督信号,为每个问题引导高效的工具使用模式。这些信号无缝集成到智能体强化学习训练循环中。在七个QA基准上的实验表明,与标准智能体强化学习相比,AKBE平均任务准确率提高+1.85,工具调用减少18%,工具生产率提高25%,且没有任何准确率-效率权衡。进一步分析表明其在不同RL算法上的即插即用兼容性以及每个信号类别的机制。我们的代码可在https://github.com/CuSO4-Chen/AKBE获取。

英文摘要

Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model's intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at https://github.com/CuSO4-Chen/AKBE.

2605.26949 2026-05-27 cs.CV cs.GR

DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models

DinoComplete: 利用蒸馏语义先验和状态空间模型进行3D形状补全

Furkan Mert Algan, Eckehard Steinbach

AI总结 提出DinoComplete框架,通过从DINO特征中蒸馏语义先验并结合多尺度体素Mamba模块,实现高效、鲁棒的3D形状补全,在未见类别和真实噪声扫描上优于现有方法。

详情
AI中文摘要

从部分扫描进行3D形状补全对于未见类别和嘈杂的真实世界观测仍然具有挑战性,因为仅凭几何信息往往不足以推断缺失结构。我们提出了DinoComplete,一个确定且高效的形状补全框架,通过从DINO特征中蒸馏的体素对齐语义先验来增强几何重建。首先,我们构建与ShapeNet数据对齐的多视图DINO特征体积,并训练一个学生网络直接从不完整形状预测密集语义特征。这些预测特征捕获全局结构和部分感知的语义上下文,同时与底层几何保持对齐。然后,我们将这些蒸馏特征集成到一个补全网络中,其中几何和语义体素表示通过体素状态空间建模进行融合。为了在不牺牲分辨率的情况下实现高效的长距离推理,我们引入了一个多尺度体素Mamba模块,通过结合全网格和分块序列建模来细化融合特征。在未见过的ShapeNet类别和ScanNet物体上的实验表明,DinoComplete在使用更少参数、更低内存和更快推理速度的同时,实现了比先前确定性和基于生成的方法更强的补全质量。我们的结果表明,从视觉基础模型中蒸馏语义先验提高了3D形状补全的泛化能力和鲁棒性。

英文摘要

3D shape completion from partial scans remains challenging for unseen categories and noisy real-world observations, where geometry alone is often insufficient for inferring missing structure. We present DinoComplete, a deterministic and efficient shape completion framework that augments geometric reconstruction with voxel-aligned semantic priors distilled from DINO features. First, we construct multi-view DINO feature volumes aligned with ShapeNet data and train a student network to predict dense semantic features directly from incomplete shapes. These predicted features capture global structure and part-aware semantic context while remaining aligned with the underlying geometry. We then integrate these distilled features into a completion network, where geometric and semantic voxel representations are fused through voxel state-space modeling. To enable efficient long-range reasoning without sacrificing resolution, we introduce a multi-scale voxel Mamba module that refines the fused features by combining full-grid and chunk-wise sequence modeling. Experiments on unseen ShapeNet categories and ScanNet objects show that DinoComplete achieves stronger completion quality than prior deterministic and generative based completion methods while using fewer parameters, requiring lower memory, and achieving faster inference. Our results demonstrate that distilling semantic priors from visual foundation models improves generalization and robustness in 3D shape completion.

2605.26944 2026-05-27 cs.RO cs.CV

Object Pose and Shape Estimation for Grasping: Does it Work?

用于抓取的目标姿态与形状估计:有效吗?

Pavan Karke, Kushal Shah, Gaurav Singh, Md Faizal Karim, K Madhava Krishna, Rajat Talak

AI总结 本文通过对比端到端抓取合成方法与模块化方法(先估计目标姿态和形状再采样抓取),评估现有姿态和形状估计方法在抓取任务中的有效性。

详情
Comments
9 pages, 8 figures
AI中文摘要

目标姿态和形状估计问题近年来取得了关键进展。编码器-解码器(如SAM3D、LRM、CRISP)和基于扩散的模型(如InstantMesh、Zero123、SceneComplete)展示了类别无关的形状编码能力和开放集泛化性。在这项工作中,我们提出一个问题:当与对极抓取采样结合使用时,目标姿态和形状估计方法是否足够成熟,以至于能够超越端到端抓取合成方法?我们通过将研究范围限定在平行颚夹爪、7自由度抓取和单视图RGB(-D)图像输入,详细探讨了这个问题。我们实现并比较了一种最先进的端到端抓取合成方法和三种模块化方法,这些方法首先估计场景中所有目标的姿态和形状,然后使用对极采样生成抓取。我们观察到,在所有实验中,模块化方法均优于端到端方法。模块化方法能够合成大量抓取,即使是对于端到端方法失败的小目标也是如此。模块化方法的有效性取决于姿态和形状估计的准确性,并且在杂乱场景中会部分退化——这是现有姿态和形状估计方法的局限性。我们还分析了三种模块化方法的失败模式和运行时间,这些方法使用了两种不同的目标姿态和形状估计方式:一种基于编码器-解码器模型,另一种基于扩散模型。最后,我们证明单视图目标姿态和形状估计方法可以与视觉语言模型结合,仅从单视图RGB-D图像输入即可产生语言条件抓取。我们注意到其性能与最先进的LERF-TOGO基线相当。

英文摘要

The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.

2605.26940 2026-05-27 cs.CL

Accountable Human-AI Deliberation with LLMs: Scaling Collective Intelligence through Symbiotic Scaffolding

负责任的基于LLM的人机协商:通过共生脚手架扩展集体智能

Wajdi Zaghouani

AI总结 提出一个三层共生人机框架,通过多样性放大、条款级溯源和人类主导批准,在扩展集体智能的同时保持主体性和合法性。

详情
Comments
Accepted at the LREC 2026 / 2nd Workshop on Language-driven Deliberation Technology
AI中文摘要

大型语言模型(LLM)可以在以前受轮流发言和引导带宽限制的规模上支持民主协商。最近的研究表明,LLM生成的群体陈述通常比人类中介的输出更受欢迎,而理论分析认为LLM放松了限制集体智能的同时性约束。然而,纯LLM中介存在使多元性崩溃、过度优化一致性以及当参与者无法质疑其如何被代表时损害合法性的风险。我们提出了一个共生的人机框架,分为三个层次:观察与多样性放大、具有条款级溯源的引导、以及人类优先批准。我们的贡献包括:具有显著性加权分级覆盖、多样性和擦除度量;结合交叉编码器相似性与因果剔除诊断的溯源管道;偏好条件权衡控制;公平感知的可争议工作流;对抗性鲁棒性测试;以及基于LLM作为评判者局限性证据的消融设计评估协议。结果是一个可测试的协商技术蓝图,能够在扩展集体智能的同时保持主体性和合法性。

英文摘要

Large language models (LLMs) can support democratic deliberation at scales previously constrained by turn-taking and facilitation bandwidth. Recent work shows that LLM-generated group statements are often preferred over human-mediated outputs, while theoretical analyses argue that LLMs relax the simultaneity constraints limiting collective intelligence. Yet pure LLM mediation risks collapsing pluralism, over-optimizing for agreement, and undermining legitimacy when participants cannot contest how they are represented. We propose a symbiotic human-AI framework organized into three layers: observation and diversity amplification, facilitation with clause-level provenance, and human primacy for ratification. Our contributions include graded coverage, diversity, and erasure metrics with salience-aware weighting; a provenance pipeline combining cross-encoder similarity with causal knockout diagnostics; preference-conditioned trade-off control; equity-aware contestability workflows; adversarial robustness tests; and an evaluation protocol with ablation designs informed by evidence of LLM-as-judge limitations. The result is a testable blueprint for deliberation technology that scales collective intelligence while preserving agency and legitimacy.

2605.26937 2026-05-27 cs.CL cs.AI

Beyond Questions: Evaluating What Large Language Models (Actually) Know

超越问题:评估大型语言模型(实际)知道什么

Luca Giordano, Simon Razniewski

AI总结 提出开放知识评估新范式,通过开放式提示(如“告诉我关于M.L. King的一切”)评估模型自然表达的知识,并构建BeQu基准测试10,000个实体。

详情
AI中文摘要

大型语言模型(LLM)中的参数化知识是其成功的基石,但仍未被充分理解。现有的知识基准通常依赖于预定义的问题(例如,“M.L. King的出生日期是什么?”),仅评估基准设计者明确选择查询的知识,这是一种有问题的可用性偏差。在本文中,我们引入了开放知识评估,这是一种用于LLM知识基准测试的新范式。它不提出狭隘的问题,而是评估模型在响应开放式引发提示(例如,“告诉我关于M.L. King的一切”)时选择呈现的知识。这将焦点从预定义的答案检索转向表征模型自然表达的知识。我们用BeQu(超越问题)实例化这一范式,这是一个包含10,000个实体并配有用于陈述验证的参考语料库的基准。使用BeQu,我们评估了广泛的语言模型,并分析了推理努力、模型规模、提示格式和知识领域的影响。数据和排行榜可在此工作的GitHub仓库和基准网站上获取。

英文摘要

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

2605.26936 2026-05-27 cs.RO

A Bioinspired Underwater Robot with a Latch-Mediated Soft Bistable Mechanism

一种具有闩锁介导的软体双稳态机构的仿生水下机器人

Chongze Bi, Wenjie Wu, Zonghao Zuo, Li Wen

AI总结 本文提出一种受生物启发的软体双稳态执行器,通过集成闩锁机构实现单电机驱动的非对称能量输入与释放,结合鳍结构实现高效水下推进与机动,实验验证了稳定拍动、精确转向及多模式运动能力。

详情
Comments
6 pages, 6 figures
AI中文摘要

近年来,水下机器人技术取得了显著进展。然而,微型水下机器人的发展仍受限于传统能源的低能量密度。自然界提供了引人注目的解决方案——像螳螂虾和跳蚤这样的生物利用闩锁介导的弹簧驱动(LaMSA)系统,通过解耦的能量存储和释放机制实现快速运动。尽管对LaMSA进行了广泛研究,但在简单紧凑的结构中复制这种快速、非对称驱动仍然具有挑战性。在这项工作中,我们介绍了一种受生物启发的软体双稳态执行器,它集成了闩锁机制,能够使用单个电机实现非对称的能量输入和释放。结合鳍结构,这种设计促进了高效的水下推进和机动性。实验结果表明,该机器人实现了稳定的周期性拍动、精确的转向,以及最大推力0.528 N、冲量0.147 Ns和垂直位移30 mm。通过调节鳍角,机器人实现了多种运动,包括垂直上升、斜向前进和横向平移。这项研究为控制紧凑型水下机器人的运动提供了一种新颖、节能的方法,为先进仿生设计在探索、环境监测和检查中的潜在应用铺平了道路。

英文摘要

Underwater robotics has advanced significantly over recent decades. however, the development of miniaturized underwater robots remains limited by low energy densities of traditional power sources. Nature offers compelling solutions-organisms like mantis shrimps and fleas utilize latch-mediated spring actuation (LaMSA) systems that achieve rapid movements through a decoupled energy storage and release mechanism. Despite extensive studies of LaMSA, replicating such rapid, asymmetric actuation within simple, compact structures remains challenging. In this work, we introduce a bioinspired, soft bistable actuator with an integrated latch mechanism that enables asymmetric energy input and release using a single motor. Coupled with fin structures, this design facilitates efficient underwater propulsion and maneuverability. Experimental results demonstrate stable periodic flapping, precise steering, and a maximum thrust of 0.528 N, impulse of 0.147 Ns, and vertical displacement of 30 mm. By modulating fin angles, the robot achieves versatile motions, including vertical ascent, diagonal forward movement, and lateral translation. This study presents a novel, energy-efficient approach for controlling motion in compact underwater robots, paving the way for advanced biomimetic designs with potential applications in exploration, environmental monitoring, and inspection.

2605.26935 2026-05-27 cs.CL

DunbaaBERT: From Sacrifice to Semantics

DunbaaBERT: 从牺牲到语义

Iffat Maab, Waleed Jamil, Raphael Schmitt

AI总结 本文提出DunbaaBERT,一种从零训练的乌尔都语RoBERTa-base模型族,通过不同词汇表大小在17GB语料上预训练,在多项下游任务中达到与强多语言基线相当的性能,并发现较大词汇表并不持续提升效果。

详情
AI中文摘要

大型语言模型在许多自然语言处理任务中取得了强劲性能,但由于资源有限和评估设置碎片化,乌尔都语仍相对未被充分探索。为填补这一空白,我们引入了DunbaaBERT,一个乌尔都语RoBERTa-base模型族,在去重后的17GB乌尔都语语料库上使用32k、52k和96k token的Byte-BPE词汇表从头训练。我们在内在和下游乌尔都语自然语言处理基准上评估DunbaaBERT,涵盖语言可接受性、新闻分类、攻击性语言检测和情感分析,同时分析词汇表大小对性能和效率权衡的影响。在各项基准中,DunbaaBERT变体与强多语言基线相比取得了有竞争力的性能,同时始终保持有利的效率权衡。有趣的是,较大的词汇表并不持续提升下游效果,DunbaaBERT$_{\text{32k}}$反复提供最强的整体效率概况。总体而言,我们的结果表明,尽管模型和训练规模相对紧凑,精心策划的乌尔都语特定编码器模型仍能保持高度竞争力。所有模型均在MIT许可下发布。

英文摘要

Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.

2605.26934 2026-05-27 cs.CL cs.AI

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂度:逻辑推理任务中RLVR数据分配的受控研究

Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira

AI总结 通过将推理空间划分为深度和复杂度两个维度,并考虑四种推理形式,在合成知识图谱环境中进行受控实验,发现联合深度-复杂度覆盖优于单轴策略,不同推理家族对RLVR覆盖的反应非均匀,且均匀混合优于分阶段课程。

详情
Comments
Pre-print
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为后训练推理模型的核心,但现有研究的一个关键局限在于对推理空间的狭隘视角:难度仅被视为推理深度,奖励集中在正向演绎状态追踪。相反,我们沿两个维度刻画推理空间。难度:除了推理深度,我们研究环境复杂度,即模型必须在干扰项和交互结构中识别正确路径。奖励推理形式:我们考虑现实世界推理核心的四种能力:演绎状态追踪、对隐藏事件或事实的溯因恢复、归纳规则归纳以及类比迁移。为解耦这些因素,我们构建了一个合成知识图谱环境,具有受控的预训练和后训练分布,其中每个实例在深度、复杂度和任务家族上变化。三个发现:联合深度-复杂度覆盖优于单轴策略;推理家族反应非均匀,溯因推理在RL覆盖区域外退化,任务相关性聚类为演绎-溯因对和归纳-类比对;在固定预算下,均匀混合优于分阶段课程。我们还发现,最近的现成模型表现出相同的演绎-溯因不对称性,表明这一差距并非我们受控设置的假象。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.

2605.26933 2026-05-27 cs.CV

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking

利用文本到图像扩散模型进行无监督视觉目标跟踪

Zhengbo Zhang, Zhigang Tu, Junsong Yuan, De Wen Soh, Bo Du

AI总结 提出Diff-Tracking方法,利用预训练文本到图像扩散模型的跨注意力机制,通过初始提示学习器和在线提示更新器实现无监督目标跟踪。

详情
Comments
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026
AI中文摘要

无监督视觉目标跟踪是一项具有挑战性的任务,需要在没有真实标注训练的情况下跟踪视频中的任意目标。尽管取得了显著进展,现有的最先进无监督跟踪器在处理需要细粒度理解视频帧内语义和视觉结构信息的场景时仍常遇到困难。文本到图像扩散模型以其生成准确反映输入提示中描述的语义和结构的图像的能力而闻名,展现出对视觉语义和结构的强大把握。基于这一能力,我们从新的角度处理无监督跟踪,利用预训练文本到图像扩散模型中编码的丰富语义知识。为了将原本用于图像生成的扩散模型适应到跟踪任务,我们将其重新解释为文本和图像模态之间的桥梁。这种连接通过跨注意力机制实现:当文本和图像同时输入模型时,模型会突出显示与文本语义对齐的图像区域(在跨注意力图中)。因此,我们学习一个表示跟踪目标的提示,并在每一帧中激活其在跨注意力图中的对应区域,从而利用扩散模型实现目标跟踪。具体来说,我们的方法Diff-Tracking由两个主要部分组成:初始提示学习器和在线提示更新器。初始提示学习器生成一个捕获第一帧中目标对象的提示,使扩散模型能够识别目标。在线提示更新器基于运动信息优化提示,实现跨视频帧的一致跟踪。我们在六个具有挑战性的跟踪数据集上评估了我们的方法,证明了其有效性。

英文摘要

Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross-attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets demonstrate the effectiveness of our approach.

2605.26926 2026-05-27 cs.AI

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

从规范到指标 (N2I-RAG): 一种用于法律指标计算的智能检索增强生成框架

Youssef Al Mouatamid, Marie Bonnin, Jihad Zahir

AI总结 提出N2I-RAG框架,通过自适应检索、基于LLM的智能体和验证机制,实现从法律文本到指标的透明、可追溯的自动计算,在法国海洋环境法语料库上优于基线方法。

详情
AI中文摘要

从规范文本计算法律指标是法律监测和政策评估中的关键任务,但由于法律语言的复杂性、规模、解释性以及可用文档质量的差异,这一任务面临重大挑战。现有的自然语言处理技术和生成模型可以辅助法律分析,但往往存在较高的幻觉风险,且缺乏可靠指标计算所需的可解释性和证据基础。本文提出N2I-RAG(从规范到指标),一种智能检索增强生成框架,旨在以透明且可追溯的方式自动化法律指标的计算。我们将自适应检索、基于LLM的智能体和验证机制集成到一个模块化流水线中,其中每个组件在过滤、检索和评估证据,以及生成与可识别法律条款相关的二元法律结果方面执行定义明确的角色。该框架通过要求对中间决策和最终指标分配进行明确解释来强调可追溯性。我们使用内部构建的包含扫描和数字两种来源的法国海洋环境法律语料库评估N2I-RAG。与多个语言模型家族的对比实验表明,所提出的方法始终优于基线系统,并且在两种不同禁令的测试中具有良好的泛化能力。结果表明,智能检索增强生成可以桥接开放文本法律语言和标准化指标计算,为透明且可扩展的法律观测站奠定基础。

英文摘要

Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.

2605.26925 2026-05-27 quant-ph cs.LG

Adaptive Reinforcement Learning for Robust Open Quantum System Control: A Multi-Task Framework with Temporal Optimization

自适应强化学习用于鲁棒开放量子系统控制:一种带有时间优化的多任务框架

Haftu W. Fentaw, Steve Campbell, Simon Caton

AI总结 提出一种多任务软演员-评论家(SAC)强化学习框架,用于开放量子系统控制,同时学习最优脉冲序列并发现特定问题的演化时间T和控制脉冲段数N,在51种哈密顿量变化下实现高保真度状态转移,并展现出优于GRAPE的鲁棒性。

详情
AI中文摘要

我们提出了一种多任务软演员-评论家(SAC)强化学习框架,用于跨不同哈密顿量的开放系统量子控制,该框架学习最优脉冲序列,同时发现特定问题的演化时间T和控制脉冲段数N。在51种哈密顿量变化上的实验结果表明,多任务SAC模型能够生成控制脉冲,在环境噪声下将系统从初始状态驱动到目标状态,并具有高保真度,为适用于实际噪声量子器件的通用量子控制奠定了必要基础。通过逐步扩展训练哈密顿量集,我们研究了使用给定数量样本哈密顿量训练的单个多任务模型是否能够成功完成来自同一哈密顿量空间但训练中未遇到的哈密顿量的状态转移任务。此外,我们的鲁棒性不保真度度量(RIM)分析表明,与GRAPE优化的控制相比,SAC训练的策略对脉冲幅度扰动和退相干率变化表现出更优越的鲁棒性。

英文摘要

We present a Multi-task Soft Actor-Critic (SAC) Reinforcement Learning framework designed for open-system quantum control across diverse Hamiltonians, which learns optimal pulse sequences while simultaneously discovering problem-specific evolution time T and number of control pulse segments N. Experimental results across 51 Hamiltonian variations demonstrate that the multi-task SAC model is able to generate control pulses that can drive a system, under environment noise, from its initial state to its target state with high fidelities, establishing essential foundations for universal quantum control applicable to realistic noisy quantum devices. Through progressive expansion of the training Hamiltonian set, we investigate if a single multi-task model trained using a given number of sample Hamiltonians can successfully accomplish state-transfer tasks for Hamiltonians drawn from the same Hamiltonian space but not encountered during training. In addition, our Robustness Infidelity Measure (RIM) analysis reveals that SAC trained policies exhibit superior robustness to pulse amplitude perturbations and decoherence rate variations compared to GRAPE-optimized controls.

2605.26924 2026-05-27 cs.CL

Learning to Adapt SFT Data for Better Reasoning Generalization

学习适应SFT数据以实现更好的推理泛化

Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu

AI总结 提出DART方法,通过强化学习训练映射器将分布不匹配的SFT数据转化为模型自适应的监督,提升推理泛化能力。

详情
AI中文摘要

大型语言模型(LLMs)取得了显著进展,其中后训练在增强其推理能力方面起着关键作用。在后训练范式中,监督微调(SFT)被广泛使用:它利用外部数据提供密集监督并实现高效训练。然而,当数据分布与目标模型自身分布不匹配时,直接在专家数据上微调可能会损害泛化能力。在这项工作中,我们提出了推理调优的数据适应(DART),它将使用固定且可能分布不匹配的SFT数据集表述为对演示转换的优化问题。DART使用强化学习训练一个映射器模型,将原始SFT数据转换为与目标模型分布和学习偏好更匹配的模型自适应监督。转换后的数据随后用于SFT,使目标模型能够更好地利用外部监督。在多个模型和数据集上的实验表明,DART提高了泛化能力,实现了比直接RL更高的训练效率,并帮助模型超越标准SFT。我们的代码可在https://anonymous.4open.science/r/DART525E50D获取。

英文摘要

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.

2605.26918 2026-05-27 cs.CL

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

视频模型是教育领域的零样本学习者和推理者吗?EduVideoBench:面向教育视频生成的知识-技能-态度基准

Unggi Lee, Hoyoung Ahn, Yoon Choi, Seonmin Eun, Jahyun Jeong, Seonmin Jin, Harmony Jung, Hye Jin Kim, Chaerin Lee, Hyunji Lee, Jeongjin Lee, Soohwan Lee, Young-Seok Oh, Jaehyeon Park, Sun-ok Ryu, Sunyoung Shin, Yoorim Son, Haeun Park, Yeil Jeong

AI总结 提出基于知识-技能-态度框架的教育视频生成基准EduVideoBench,评估五个前沿视频生成模型在教育有效性上的不足,并发现教育有效性是多维度的,单一元素不匹配即可使视频失效。

详情
AI中文摘要

视频生成模型(VGMs)正迅速进入课堂,然而现有基准仅评估感知质量、内在忠实性、通用安全性或将视频作为推理媒介,没有评估输出是否具有教育有效性。在这项工作中,我们提出了EduVideoBench,这是教育领域第一个平衡的基准,基于知识-技能-态度(KSA)框架,使得教学充分性和教育安全性被联合评估,而非作为临时的质量维度。在五个前沿VGMs上,我们的结果显示,在知识、技能和态度方面,它们距离课堂准备就绪还有很大的改进空间。我们辅以专家评论的定性分析,发现教育有效性是多维度的,单个不匹配的元素(如节奏、可读性或符号)可能使原本正确的视频失效。我们希望EduVideoBench能够指导开发教学上合理且课堂安全的VGMs。

英文摘要

Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.

2605.26911 2026-05-27 cs.AI

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

TADDLE: 一种用于检测有缺陷的LLM生成同行评审的工具增强型代理

Hanqi Duan, Xiang Li

AI总结 针对LLM生成的同行评审难以检测缺陷的问题,提出TADDLE工具增强型代理,通过四个专用分析工具和两阶段半监督学习,在二元检测和多标签分类任务上表现优异。

详情
AI中文摘要

LLM生成的同行评审在主要会议中越来越常见,但由于它们语言流畅、结构良好,其缺陷难以检测。现有工作要么仅分类作者身份而不评判质量,要么使用为人类撰写的评审设计的特征来评分质量;没有先前系统能在单个缺陷类型级别检测LLM生成评审中的缺陷。为弥补这一空白,我们引入了TADDLE,一种用于检测有缺陷的LLM生成同行评审的工具增强型代理,以及首个针对此任务的专家标注基准。我们的基准包含对50篇ICLR 2025论文的1800条评审,由18位领域专家根据六个缺陷类别(加上一个无缺陷标签)的分类法进行多标签标注。TADDLE将检测分解为四个专用分析工具——验证、纠正、完善和转换——由一个代理协调;一个集成器通过两阶段半监督学习将其输出综合为二元和多标签分类。大量实验表明,TADDLE在二元检测和多标签分类任务上均表现强劲。我们在https://github.com/AquariusAQ/TADDLE发布基准和代码。

英文摘要

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

2605.26908 2026-05-27 cs.AI cs.DS cs.LG

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

关于因子图中可交换因子检测的充要条件

Malte Luttermann, Ralf Möller, Marcel Gehrke

AI总结 本文重新审视了因子图中可交换因子检测的理论基础,指出现有算法依赖的定理仅为必要条件而非充分条件,并提出了修正算法以保证正确性和效率。

详情
AI中文摘要

利用概率图模型(如因子图)中对象的不可区分性是提升概率推理算法的关键,并允许对领域规模进行可处理的概率推理问题。在因子图中利用不可区分对象的核心是识别可交换因子,即其输出值在分配给其部分参数的输入值的排列下保持不变的因子。本文重新审视了检测可交换因子的最先进算法的理论基础。具体而言,我们表明,在其当前形式下,最先进算法依赖于一个中心定理,该定理被错误地视为识别可交换因子的充分条件,而实际上它仅意味着必要条件。因此,正如我们在本文中所展示的,最先进算法可能会产生错误结果。为了修复当前最先进算法中存在的缺陷,我们证明了上述定理的一个略微修改版本,该版本作为识别可交换因子的必要条件。此外,我们提出了最先进算法的修正版本,在保持其效率的同时确保正确性,并引入了一种具有更严格最坏情况边界的补充算法。

英文摘要

Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.

2605.26903 2026-05-27 cs.CR cs.AI

Practical Anonymous Two-Party Gradient Boosting Decision Tree

实用的匿名两方梯度提升决策树

Huang Chenyu, Zhang Fan, Du Minxin, Chow Sherman SM, Chen Huangxun, Rao Huaming, Huang Danqing, Qian Bo, Chen Peng

AI总结 针对两方垂直分割数据上的梯度提升决策树训练,提出一种基于双电路隐私集合求交和遗忘可编程伪随机函数的匿名协议,在隐藏记录标识符的同时保持效率。

详情
Journal ref
2026 IEEE Symposium on Security and Privacy (SP)
Comments
19 pages; 2026 IEEE Symposium on Security and Privacy (SP)
AI中文摘要

梯度提升决策树(GBDT)擅长处理结构化数据,通常用于在互不信任的各方之间垂直分割的特征上进行训练。高速和可解释性使得GBDT在金融和医疗领域广受欢迎,而神经网络在这些领域可能表现不佳。为GBDT启用安全计算带来了独特的挑战,需要安全的记录对齐以进行比较。依赖隐私集合求交(PSI)是一种事实上的方法。将PSI误认为是安全措施实际上会暴露数据集中哪些记录标识符(ID)是共享的。尽管电路PSI可以提供帮助,但对于通用用途来说成本高昂。需要新的思路来在“黑暗森林”中高效训练。为了隐藏ID,我们启动了对两方持有的分割数据上的匿名GBDT训练的研究。我们设计中的双电路PSI让双方交替作为接收者,对本地特征执行“选取后求和”。通过遗忘可编程伪随机函数,我们将电路PSI的输出作为共享状态在运行之间传播。避免通用对齐,我们解决了被忽视的困境:隐藏ID会带来与域大小成比例的成本。接下来,我们将用于将单指令多数据同态加密从(环)学习误差转换的密文打包成本减半,相比之前的安全GBDT(Usenix Security' 23)和相关安全机器学习计算。对比实验表明,我们的协议在效率上与有泄漏的方法相比仍具有竞争力。通过启用隐藏ID的聚合,我们的技术可以扩展到其他垂直分割的分析场景。

英文摘要

Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling secure computation for GBDTs poses unique challenges, requiring secure record alignment for comparison. Relying on private set intersection (PSI) is a de facto approach. Mistaking PSI for a safety measure actually exposes which record identifiers (IDs) are shared between the datasets. Although circuit-PSI could help, it is costly for generic uses. New ideas are needed to efficiently train in a "dark forest". Aiming to hide the IDs, we initiate the study of anonymous GBDT training on split data held by two parties. Dual circuit-PSI in our design lets the parties alternate as receiver to run pick-then-sum over local features. Via oblivious programmable pseudorandom functions, we propagate circuit-PSI outputs as shared state across runs. Avoiding universal alignment, we resolve the neglected dilemma that ID hiding incurs a cost that scales with domain size. Next, we halve the cost of ciphertext packing used to convert single-instruction multiple-data homomorphic encryption from (ring) learning with errors in prior secure GBDT (Usenix Security' 23) and related secure machine-learning computations. Comparative experiments show our protocol remains competitive with leaky approaches in efficiency. Enabling ID-hiding aggregation, our techniques can extend to other vertically partitioned analytics.

2605.26900 2026-05-27 cs.LG

SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings

SPHERE-JEPA: 均匀嵌入的球面预测

Léo Nicollier, Max Dunitz, Marc Pic, Pablo Musé, Enric Meinhardt-Llopis, Gabriele Facciolo

AI总结 本文提出SPHERE-JEPA框架,通过将Cramér-Wold投影机制调整为强制超球面均匀性而非高斯先验,解决了自监督学习中高斯嵌入导致各向异性k-NN邻域的问题,在纹理检索和ImageNet-1K线性探测上取得显著提升。

详情
AI中文摘要

自监督学习中的一个基本开放问题是明确表征学习表示的最优几何。最近,LeJEPA将各向同性高斯嵌入确定为在欧几里得空间中最小化下游预测风险的最优解。然而,对于支撑在低维流形(如超球面)上的分布,相应问题仍未探索。在这项工作中,我们证明将这种极小极大分析扩展到黎曼流形上的光滑分布会根本性地改变最优解。我们表明,在最坏情况公式下,k近邻和核岭回归都诱导超球面均匀性。更精确地说,我们证明流形上的均匀分布对于k近邻是最优的,而球面上的均匀分布对于使用指数点积核和线性核的核岭回归是最优的。这一理论见解揭示了高斯嵌入的一个根本局限:其非均匀密度导致各向异性的k-NN邻域,严重偏置估计器。为纠正这一点,我们引入了SPHERE-JEPA,一个理论基础的SSL框架。我们调整LeJEPA的Cramér-Wold投影机制以强制超球面均匀性而非高斯先验。实验上,SPHERE-JEPA取得了显著改进,将纹理检索mAP提升了超过6%,同时在标准基准上持续匹配或超越LeJEPA——包括在ImageNet-1K(ViT-B/14)上+1.8%的线性探测增益。

英文摘要

A fundamental open question in self-supervised learning (SSL) is the explicit characterization of the optimal geometry of the learned representations. Recently, LeJEPA identified isotropic Gaussian embeddings as optimal for minimizing downstream prediction risk in Euclidean spaces. However, the corresponding problem for distributions supported on lower-dimensional manifolds, such as the hypersphere, remains unexplored. In this work, we demonstrate that extending this minimax analysis to smooth distributions on Riemannian manifolds fundamentally changes the optimal solution. We show that, under a worst-case formulation, both k-nearest neighbors and kernel ridge regression induce hyperspherical uniformity. More precisely, we show that uniform distributions on manifolds are optimal for k-nearest neighbors, and that the uniform distribution on the sphere is optimal for kernel ridge regression with both the exponential dot-product kernel and the linear kernel. This theoretical insight reveals a fundamental limitation of Gaussian embeddings: their non-uniform density induces anisotropic k-NN neighborhoods, severely biasing the estimator. To correct this, we introduce SPHERE-JEPA, a theoretically grounded SSL framework. We adapt LeJEPA's Cram{é}r-Wold projection mechanism to enforce hyperspherical uniformity rather than a Gaussian prior. Empirically, SPHERE-JEPA yields significant improvements, boosting texture retrieval mAP by over 6%, while consistently matching or outperforming LeJEPA on standard benchmarks-including a +1.8% linear probing gain on ImageNet-1K (ViT-B/14).

2605.26898 2026-05-27 cs.SE cs.AI

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

引导LLM使用软件设计模式的策略:以单例模式为例

Viktor Kjellberg, Farnaz Fotrousi, Miroslaw Staron

AI总结 通过实验比较四种提示策略(指令、二元自动反馈、详细自动反馈、少样本详细反馈),评估13个LLM在164个Java编码挑战中生成遵循单例模式的代码的能力,发现迭代二元反馈在保持或提升功能性的同时最佳地实现了单例模式对齐。

详情
Comments
Accepted at PROMISE 2026
AI中文摘要

大型语言模型(LLM)可以从自然语言提示生成功能性源代码,但往往无法一致地遵循更高级别的架构结构或设计模式。由于LLM在软件工程中的应用日益增多,它们将既定设计原则应用于生成代码的能力对于软件产品的长期成功至关重要。因此,本文的目标是确定引导LLM将设计模式融入生成源代码的策略。我们设计了一个计算实验,评估13个LLM生成遵循单例设计模式的代码的能力,使用了四种提示策略:指令、二元自动反馈、详细自动反馈以及带少样本提示的详细反馈,在HumanEval-X的164个Java编码挑战中进行。我们的结果表明,引导LLM包含设计模式的最佳策略在很大程度上取决于模型类型。尽管如此,总体而言,迭代二元反馈在保持或改善代码功能性的同时,提供了与单例模式的最佳对齐。通过指令引导,Llama 3.3在100%的情况下生成了单例类,并改善了代码功能性,使通过的测试数量增加了34.1个百分点。通过指令和二元反馈引导,它取得了类似的结果。Qwen 3(8B)使用二元反馈将单例模式对齐度提高到99.2%,功能性提高到58.6%。我们的结果表明,即使是简单的策略也可以用于引导LLM使用设计模式。

英文摘要

Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principles to generated code is crucial to the long-term success of software products. Therefore, the goal of this paper is to identify strategies for guiding LLMs to incorporate design patterns into the generated source code. We designed a computational experiment to evaluate the ability of 13 LLMs to generate code that follows the Singleton design pattern, using four prompting strategies: instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts, in 164 Java coding challenges from HumanEval-X. Our results shows that the optimal strategy to guide LLMs to include design patterns depends heavily on the type of model. Still, overall, iterative binary feedback provides the best alignment with Singleton while preserving or improving the code's functionality. With guiding with instructions, Llama 3.3 generated Singleton classes in 100% of cases and improved code functionality, increasing the number of tests passed by 34.1 percentage points. It achieved a similar result with guidance through instructions and binary feedback. Qwen 3 (8B) increased the alignment with Singleton to 99.2% and the functionality to 58.6% using binary feedback. Our result suggests that even simple strategies can be used to guide LLMs to use design patterns.

2605.26895 2026-05-27 cs.LG cs.AI stat.ML

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

微不足道的大小,显著的效果:大型语言模型中的尺度向量

Mingze Wang, Shuchen Zhu, Yuxin Fang, Binghui Li, Kai Shen, Shu Zhong

AI总结 本文系统研究了大型语言模型中的尺度向量,发现其虽参数占比极小但对预训练至关重要,通过自放大预条件效应优化优化过程,并提出了三种轻量级改进策略,在多种模型规模上一致提升性能。

详情
Comments
36 pages
AI中文摘要

现代大型语言模型(LLM)中的归一化层由确定性归一化操作和可学习的尺度向量组成。尽管归一化操作已被广泛研究,但尺度向量尽管被普遍使用,其作用仍未被充分理解。在这项工作中,我们从表达能力、优化和架构结构的角度对LLM中的尺度向量进行了系统研究。首先,我们通过实验表明,虽然尺度向量仅占模型参数的极小部分,但移除它们会显著降低LLM的预训练效果。我们的理论进一步表明,在Pre-Norm架构中,尺度向量并不增加表达能力;相反,它们通过对后续线性映射产生自放大预条件效应来改善优化。其次,我们研究了权重衰减对尺度向量的作用。通过区分Input-Norm和Output-Norm层,我们从理论上证明,由于它们在优化和表达能力中的不同作用,权重衰减对前者有益但对后者有害。第三,受此理解的启发,我们提出了三种轻量级且互补的尺度向量改进方法:分支特异性异质性、线性映射周围的改进放置以及幅度-方向重参数化。理论和实验均表明,每种改进都能带来一致的收益。最后,我们将这些改进整合为一个统一的尺度向量策略,并通过在0.12B到2B参数的密集和混合专家模型上进行大规模LLM预训练实验,使用多种优化器和学习率调度,在工业级token预算下进行评估。该统一策略始终比精心调整的基线获得更低的终端损失,并展现出更有利的扩展行为,同时增加可忽略的参数和计算开销。

英文摘要

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

2605.26894 2026-05-27 cs.CV

SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising

SIMPC: 学习自诱导镜像点一致性用于无监督点云去噪

Chengwei Zhang, Xueyi Zhang, Tao Jiang, Xinhao Xu, Wenjie Li, Fubo Zhang, Longyong Chen

AI总结 提出自诱导镜像点一致性(SIMPC)方法,通过几何先验生成镜像点并约束去噪目标一致性,实现无监督点云去噪,在合成和真实数据集上超越现有无监督及部分有监督方法。

详情
Comments
Accepted by ICML 2026. 17 pages, 8 figures, 8 tables
AI中文摘要

在点云中,噪声直接扰动编码空间位置和几何形状的点坐标,使得构建一一对应关系比图像更具挑战性。现有方法通过噪声或最优传输在噪声变体之间施加统计映射,但存在对应歧义。本文提出自诱导镜像点一致性(SIMPC),以无监督方式学习点与潜在表面之间的确定性对应关系。对于每个噪声点,SIMPC在去噪过程中根据几何先验在潜在表面的另一侧生成一个镜像点。通过鼓励原始点与其镜像点的去噪目标之间的一致性,SIMPC有效定位潜在表面的位置。在合成和真实数据集上的大量实验表明,SIMPC显著优于最先进的无监督方法,并超越了几种强监督方法。

英文摘要

In point clouds, noise directly perturbs point coordinates that encode both spatial location and geometry, making one-to-one correspondence construction more challenging than in images. Existing methods impose statistical mappings across noisy variants via noise or optimal transport, but suffer from correspondence ambiguity. In this work, we propose Self-Induced Mirror-Point Consistency (SIMPC) to learn deterministic correspondences between points and the underlying surface in an unsupervised manner. For each noisy point, SIMPC generates a mirror-point on the opposite side of the underlying surface, guided by geometric priors during the denoising process. By encouraging consistency between the denoising targets of the original point and its mirror counterpart, SIMPC effectively localizes the position of underlying surface. Extensive experiments on synthetic and real-world datasets demonstrate that SIMPC significantly outperforms state-of-the-art unsupervised methods and surpasses several strong supervised counterparts.

2605.26893 2026-05-27 cs.CL cs.AI

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

GeoFaith: 时空双视角下的忠实思维链

Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu, Jiaheng Wei, Xiaobo Xia

AI总结 针对思维链推理中的事后合理化问题,提出基于潜在几何结构和熵动力学的时空框架GeoFaith,通过可扩展的引导流水线构建忠实性检测器并联合优化结果正确性、过程忠实性和轨迹一致性。

详情
AI中文摘要

思维链推理推动了大型语言模型的发展,但基于结果的监督导致了普遍的事后合理化,产生了看似合理但不忠实的推理链。大多数先前的忠实性评估方法要么不可扩展、昂贵,要么不可靠。我们提出GeoFaith,一个利用潜在几何结构和熵动力学来诊断和强制忠实推理的时空框架。我们开发了一个可扩展的引导流水线,将四个领域的步骤级标注从1k扩展到20k样本,训练了一个在标准基准上优于GPT-5的8B忠实性检测器,并设计了一个忠实性感知的强化学习框架,联合优化结果正确性、过程忠实性和轨迹一致性。实验表明,所提出的方法在忠实性检测和下游推理上均取得了优越性能,生成了更短、更可解释的链,且不牺牲准确性。我们的代码将公开提供。

英文摘要

Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.

2605.26891 2026-05-27 cs.CL

Telenor Nordics Customer Service self-help corpus

Telenor Nordics 客户服务自助语料库

Mike Riess

AI总结 本文构建了一个包含芬兰语、丹麦语、挪威语和瑞典语的多语言客户服务自助语料库,共1122篇文档,用于支持北欧NLP和信息检索研究。

详情
Comments
8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152
AI中文摘要

本文介绍了一个多语言客户服务自助语料库,包含1122篇经过人工验证的芬兰语、丹麦语、挪威语和瑞典语文档,总词数超过一百万。这些文档来自四家北欧电信运营商的公共自助页面,随后通过结合LLM和人工标注的流程过滤了个人身份信息和相关性。北欧语言的领域特定数据集仍然稀缺,尤其是在客户服务领域——这一领域对于检索增强生成、跨语言迁移学习和新兴的基于代理的服务架构日益重要。对语料库的分析显示,不同运营商的文档长度和结构存在显著差异,反映了不同的编辑策略,以及涵盖网络硬件、移动服务、电视和流媒体、计费和账户管理的广泛主题覆盖。该数据集在CC-BY-NC-SA-4.0许可下公开提供,网址为https://zenodo.org/records/19493152,旨在支持北欧NLP和信息检索的可重复研究。

英文摘要

This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/19493152, intended to support reproducible research in Nordic NLP and information retrieval.