arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.12427 2026-05-13 cs.LG math.CO

Learning Minimally Rigid Graphs with High Realization Counts

Oleksandr Slyvka, Jan Rubeš, Rodrigo Alves, Jan Legerský

AI总结 本文研究了如何构造具有极大实现数的最小刚性图,这是一个刚性理论中的极值问题。作者提出了一种基于强化学习的方法,通过0-扩展和1-扩展(即Henneberg操作)生成最小刚性图,并利用深度交叉熵方法优化实现数不变量。该方法在平面和球面实现数方面均达到了已知最优结果,并刷新了球面实现数的记录。

Comments This is an extended version of the paper accepted to IJCAI 2026

详情
英文摘要

For minimally rigid graphs, the same edge-length data can admit multiple realizations (up to translations and rotations). Finding graphs with exceptionally many realizations is an extremal problem in rigidity theory, but exhaustive search quickly becomes infeasible due to the super-exponential growth of the number of candidate graphs and the high cost of realization-count evaluation. We propose a reinforcement-learning approach that constructs minimally rigid graphs via 0- and 1-extensions, also known as Henneberg moves. We optimize realization-count invariants using the Deep Cross-Entropy Method with a policy parameterized by a Graph Isomorphism Network encoder and a permutation-equivariant extension-level action head. Empirically, our method matches the known optima for planar realization counts and improves the best known bounds for spherical realization counts, yielding new record graphs.

2605.12426 2026-05-13 cs.CL

Geometric Factual Recall in Transformers

Shauli Ravfogel, Gilad Yehudai, Joan Bruna, Alberto Bietti

AI总结 本文研究了Transformer语言模型如何记忆事实关联,提出了一种不同于传统参数线性增长的几何记忆机制。该方法通过学习嵌入空间中的关系结构,使嵌入向量直接编码事实关联,而多层感知机(MLP)则作为关系条件选择器,通过ReLU门控提取相关属性。实验表明,该方法在单跳和多跳事实查询任务中均表现出优越的泛化能力,并揭示了模型在训练后能够零样本迁移至新事实关联的机制。

Comments Preprint

详情
英文摘要

How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.

2605.12422 2026-05-13 cs.CL cs.CY

Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

Yo Ehara

AI总结 该研究旨在解决使用大语言模型(LLM)作为评判者评估教育材料难度时,与人类评分者意见不一致的问题。不同于以往依赖生成时概率信号的方法,本文提出了一种无需这些信号的预测方法,通过构建独立的嵌入空间并利用难度的序数特性,基于评分集合的几何一致性识别可能产生分歧的案例。实验表明,该方法在预测人类评分者分歧方面优于基于概率的基线方法。

Comments Accepted to Educational Data Mining (EDM) 2026 (Poster/Demo Track)

详情
英文摘要

Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on English CEFR-based sentence difficulty assessment with GPT-OSS-120B and Qwen3-235B-A22B showed that the proposed method achieved higher AUC for predicting disagreement with human raters than probability-based baselines.

2605.12421 2026-05-13 cs.AI

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

Haoyu Wang, Yuliang Song, Tao Li, Zhiwei Deng, Yaqing Wang, Deepak Ramachandran, Eldan Cohen, Dan Roth

AI总结 该论文探讨了大语言模型(LLM)在生成组合优化求解器时面临的启发式陷阱问题。研究通过构建一个包含100个组合问题的基准测试集,比较了三种求解器构建方法,发现使用Python与OR-Tools的约束建模方法在正确性上表现最佳,而使用MiniZinc与OR-Tools的方法虽然使用相同后端,但覆盖范围较低。研究还发现,引导LLM进行搜索优化仅带来微小的加速效果,并可能引发正确性下降,其根源在于LLM倾向于采用局部近似或冗余约束等启发式策略,从而影响求解质量。论文建议在生成组合求解器时应优先使用LLM进行形式化建模,而对搜索优化部分应单独验证。

详情
英文摘要

Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize executable solvers. A central design question is how the LLM should represent the solver, and whether it should also attempt to optimize search. We introduce CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), and evaluate three solver-construction paradigms: native algorithmic search (Python), constraint modeling through a Python solver API (Python + OR-Tools), and declarative constraint modeling (MiniZinc + OR-Tools). We find a consistent representational divergence: Python + OR-Tools attains the highest correctness across LLMs, while MiniZinc + OR-Tools has lower absolute coverage despite using the same OR-Tools back-end. Native Python is the most likely to return a schema-valid solution that fails verification, whereas solver-backed paths preserve higher conditional fidelity. On the heuristic axis, prompting for search optimization yields only small median speed-ups (1.03-1.12x) and a strongly bimodal effect: many instances slow down, and correctness drops sharply on a long tail of problems. A paired code-level audit traces these regressions to a recurring heuristic trap. Under an efficiency-oriented prompt, the LLM may replace complete search with local approximations (Python), inject unverified bounds (Python + OR-Tools), or add redundant declarative machinery that overwhelms or over-constrains the model (MiniZinc + OR-Tools). These findings support a conservative design principle for LLM-generated combinatorial solvers: use the LLM primarily to formalize variables, constraints, and objectives for verified solvers, and separately check any LLM-authored search optimization before use.

2605.12419 2026-05-13 cs.CL cs.IR cs.LG

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Naijing Zhang, Alicia Tsai, Li Wei, Lukasz Heldt, Lichan Hong, Ed Chi, Xinyang Yi

AI总结 尽管大语言模型在多项任务上表现出色,但在针对特定任务进行微调时,往往会遗忘其原有的语言推理能力。本文研究了生成式检索(GenRetrieval)任务中这一问题,并提出了一种名为ORBIT的新方法,通过跟踪微调模型与原始模型之间的参数距离,并在距离超过阈值时采用权重平均策略来限制模型漂移,从而有效保留模型的文本生成与检索能力。实验表明,ORBIT在保持模型性能方面优于现有的持续学习和正则化方法。

详情
英文摘要

Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.

2605.12416 2026-05-13 cs.LG

Aligning Flow Map Policies with Optimal Q-Guidance

Christos Ziakas, Alessandra Russo, Avishek Joey Bose

AI总结 该研究提出了一种名为流图策略(flow map policies)的新型生成策略,旨在解决基于扩散模型和流匹配等复杂模型在生成动作时计算成本高的问题。通过学习在现有流策略的生成动力学中进行任意步长的跳跃,包括一步跳跃,从而实现快速动作生成。研究还引入了FLOW MAP Q-GUIDANCE(FMQ)和Q-GUIDED BEAM SEARCH(QGBS)方法,分别用于优化策略适应和推理过程中的动作生成,在多个机器人操作与移动任务中取得了优于现有方法的显著性能提升。

详情
英文摘要

Generative policies based on expressive model classes, such as diffusion and flow matching, are well-suited to complex control problems with highly multimodal action distributions. Their expressivity, however, comes at a significant inference cost: generating each action typically requires simulating many steps of the generative process, compounding latency across sequential decision-making rollouts. We introduce flow map policies, a novel class of generative policies designed for fast action generation by learning to take arbitrary-size jumps including one-step jumps-across the generative dynamics of existing flow-based policies. We instantiate flow map policies for offline-to-online reinforcement learning (RL) and formulate online adaptation as a trust-region optimization problem that improves the critic's Q-value while remaining close to the offline policy. We theoretically derive FLOW MAP Q-GUIDANCE (FMQ), a principled closed-form learning target that is optimal for adapting offline flow map policies under a critic-guided trust-region constraint. We further introduce Q-GUIDED BEAM SEARCH (QGBS), a stochastic flow-map sampler that combines renoising with beam search to enable iterative inference-time refinement. Across 12 challenging robotic manipulation and locomotion tasks from OGBench and RoboMimic, FMQ achieves state-of-the-art performance in offline-to-online RL, outperforming the previous one-step policy MVP by a relative improvement of 21.3% on the average success rate.

2605.12412 2026-05-13 cs.CL cs.AI cs.LG

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

Eric Bigelow, Raphaël Sarfati, Daniel Wurgaft, Owen Lewis, Thomas McGrath, Jack Merullo, Atticus Geiger, Ekdeep Singh Lubana

AI总结 本文研究了大型语言模型(LLMs)在上下文中学习时的信念更新过程,提出它们在低维几何结构的概念信念空间中进行动态更新。通过故事理解任务,结合行为分析和表征分析,研究发现信念更新轨迹具有低维结构化特性,并可通过线性探针解码预测行为。此外,对这些表征的干预能够因果地引导信念轨迹,其效果可由概念空间的几何结构解释,为上下文学习提供了几何视角的信念动态解释。

详情
英文摘要

Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.

2605.12411 2026-05-13 cs.LG cs.AI cs.CL cs.MA

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

AI总结 该研究探讨了如何从有限的交互中预测陌生AI代理的决策行为,提出了一种结合文本与表格信息的建模方法。研究通过构建一个基于表格结构的模型,将游戏状态、对话历史和报价记录等信息整合为表格行,并引入一个冻结的小型语言模型作为观察者,提取决策相关的隐藏特征。实验表明,该方法在预测响应和议价报价方面优于传统提示方法,展示了将代理预测建模为目标自适应文本-表格任务的有效性。

详情
英文摘要

AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.

2605.12406 2026-05-13 cs.AI

Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

William Parris

AI总结 本文探讨了基于人类反馈的强化学习(RLHF)和偏好优化在大语言模型中的应用所引发的语义奖励坍缩(SRC)问题,即不同类型的评估不满被压缩为统一的优化信号,导致模型在事实错误、不确定性披露等方面的表现失真。研究指出,适应性AI系统可能因优化压力而抑制可见的不确定性,而非保持合理的置信度。为此,作者提出宪法奖励分层(CRS)框架,旨在通过领域感知的奖励结构,保护不同类型的认知责任,为未来的研究提供可检验的治理方向。

Comments 15 pages including references. Position and framework paper. Companion empirical work available at arXiv:2604.17587

详情
英文摘要

Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating under generalized evaluative pressure may drift toward suppression of visible epistemic failure rather than preservation of calibrated uncertainty integrity. These behaviors are framed strictly as optimization consequences rather than evidence of deception or anthropomorphic agency. Drawing on institutional proxy collapse, metric gaming, software reliability engineering, and human learning theory, we propose that uncertainty disclosure and escalation behavior should be treated as protected epistemic conduct rather than globally penalized task incompletion. Finally, we introduce Constitutional Reward Stratification (CRS), a domain-aware reward framework intended to preserve differentiated epistemic attribution within adaptive learning systems. We present CRS not as a validated solution, but as a testable governance-oriented research direction requiring further empirical investigation.

2605.12399 2026-05-13 cs.CV

GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

Xiao Cao, Yuze Li, Youmin Zhang, Jiayu Song, Cheng Yan, Wen Li, Lixin Duan

AI总结 本文提出了一种名为GeoQuery的几何引导扩散框架,用于解决稀疏视角下3D高斯溅射(3DGS)重建中的严重伪影问题。该方法通过引入几何引导的跨视角注意力(GCA)机制,结合预测的深度图和相机姿态构建几何对齐的参考特征采样场,从而生成更准确的查询特征,并在局部窗口内进行特征聚合以提升重建一致性。实验表明,GeoQuery能够有效提升稀疏视角下的视图合成与伪影去除效果,且可无缝集成到现有扩散模型中。

Comments Accept to SIGGRAPH 2026 Conference Track

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a prominent paradigm for 3D reconstruction and novel view synthesis. However, it remains vulnerable to severe artifacts when trained under sparse-view constraints. While recent methods attempt to rectify artifacts in rendered views using image diffusion models, they typically rely on multi-view self-attention to retrieve information from reference images. We observe that this mechanism often fails when the rendered novel views output by 3DGS are heavily corrupted: damaged query features lead to erroneous cross-view retrieval, resulting in inconsistent rendering refinement. To address this, we propose GeoQuery, a geometry-guided diffusion framework that integrates generative priors with explicit geometric cues via a novel Geometry-guided Cross-view Attention (GCA) mechanism. First, by leveraging predicted depth maps and camera poses, we construct a geometry-induced correspondence field to sample reference features, forming a geometry-aligned proxy query that replaces the corrupted rendering features. Furthermore, we design a new cross-view feature aggregation pipeline, in which we restrict the cross-view attention to a local window around each proxy query to effectively retrieve useful features while suppressing spurious matches. GeoQuery can be seamlessly integrated into existing diffusion-based pipelines, enabling robust reconstruction even under extreme view sparsity. Extensive experiments on sparse-view novel view synthesis and rendering artifact removal demonstrate the effectiveness of our approach.

2605.12398 2026-05-13 cs.CL cs.IR

Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

AI总结 本文提出了一种基于答案合理性评分熵值的问题难度估计方法Q-DAPS,用于评估和改进大型语言模型在问答任务中的表现。该方法通过计算候选答案的合理性评分熵值来衡量问题难度,相比传统的可读性公式、检索信号或流行度统计等方法更具有效性。实验表明,Q-DAPS在多个主流问答数据集上均优于基线方法,且在不同参数设置和问题类型下表现出良好的鲁棒性,同时与人类对问题难度的判断高度一致。

Comments Accepted at ACL 2026

详情
Journal ref
Proceedings of the 64rd Annual Meeting of the Association for Computational Linguistics (ACL 2026)
英文摘要

Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.

2605.12395 2026-05-13 cs.CL

A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles

Michela Lorandi, Anya Belz

AI总结 本文旨在通过建立公平的评估框架,对多种可控文本生成(CTG)系统进行比较评估。研究采用统一的生成与处理流程,并使用共享的评估方法和数据集,以确保评估的公正性和可比性。结果表明,多数现有系统的性能在重新评估后与原始报告存在显著差异,突显了标准化评估在可控文本生成领域的重要性和紧迫性。

详情
英文摘要

Background: Many different approaches to controlled text generation (CTG) have been proposed over recent years, but it is difficult to get a clear picture of which approach performs best, because different datasets and evaluation methods are used in each case to assess the control achieved. Objectives: Our aim in the work reported in this paper is to develop an approach to evaluation that enables us to comparatively evaluate different CTG systems in a manner that is both informative and fair to the individual systems. Methods: We use a level-playing-field (LPF) approach to comparative evaluation where we (i) generate and process all system outputs in a standardised way, and (ii) apply a shared set of evaluation methods and datasets, selected based on those currently in use, in order to ensure fair evaluation. Results: When re-evaluated in this way, performance results for a representative set of current CTG systems differ substantially from originally reported results, in most cases for the worse. This highlights the importance of a shared standardised way of assessing controlled generation. Conclusions: The discrepancies revealed by LPF evaluation demonstrate the urgent need for standardised, reproducible evaluation practices in CTG. Our results suggest that without such practices, published performance claims may substantially misrepresent true system capabilities.

2605.12389 2026-05-13 cs.CV cs.AI cs.LG

SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation

Luke James Miller, Yugyung Lee

AI总结 该论文提出了一种名为SEMIR的语义小结构引导的图表示学习方法,用于解决大规模图像中分割小而稀疏结构时面临的计算复杂性和类别不平衡问题。SEMIR通过参数化的边收缩、节点删除等操作,将原网格图转化为一个紧凑且边界对齐的图小结构,同时保持从图预测到网格标签的精确映射。该方法在多个肿瘤分割数据集上表现出色,显著提升了小结构的Dice分数,为高分辨率结构化视觉数据的任务适配表示学习提供了新框架。

Comments 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices

详情
英文摘要

Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.

2605.12387 2026-05-13 cs.SD cs.LG

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

Adam Wynn, Jingyun Wang

AI总结 本文提出了一种半监督框架,用于利用Whisper模型进行语音自信度检测,旨在解决因标注数据有限和副语言标注主观性强而导致的挑战。该框架融合了Whisper编码器提取的深层语义嵌入,以及由eGeMAPS描述符和语音压力、不流畅性概率估计构成的可解释声学特征向量,并引入了一种不确定性感知的伪标签策略以减少对标注数据的依赖。实验表明,该方法在Macro-F1指标上达到0.751,优于多个自监督基线模型,并在小样本类别上提升了3%,验证了显式韵律和辅助特征对提升自信度检测性能的重要作用。

Comments 12 pages, 9 Figures, Submitted to IEEE Transactions on Audio, Speech and Language Processing

详情
英文摘要

Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.

2605.12384 2026-05-13 cs.CL cs.AI cs.LG

Scalable Token-Level Hallucination Detection in Large Language Models

Rui Min, Tianyu Pang, Chao Du, Minhao Cheng, Yi R. Fung

AI总结 大型语言模型(LLMs)在生成文本时常常产生幻觉,尤其在需要推理的任务中,这些幻觉内容看似合理却包含逻辑错误或不可靠的中间结果,检测难度较大。为解决现有方法在粒度和可扩展性上的不足,本文提出TokenHD,一种基于token级别的幻觉检测框架,通过可扩展的数据生成引擎和重要性加权训练策略,实现了对自由文本中幻觉的直接检测,无需依赖预定义的步骤划分。实验表明,即使是一个小型检测模型(0.6B)也能显著提升检测性能,且性能随模型规模增大而持续提升,同时在多种实际场景中表现出良好的泛化能力。

详情
英文摘要

Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.

2605.12382 2026-05-13 cs.CL

Pretraining Exposure Explains Popularity Judgments in Large Language Models

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

AI总结 本研究探讨了大语言模型(LLMs)对知名实体的偏好是否源于实际流行度,还是预训练过程中数据曝光的统计结果。通过使用可完全访问的预训练语料库Dolma和开源模型OLMo,研究计算了7.4万亿个token中实体的曝光统计,并与维基百科浏览量及模型生成的流行度信号进行对比。结果表明,模型对流行度的判断更依赖于预训练数据中的曝光程度,而非外部流行度指标,尤其在大模型和长尾实体中表现更为明显,揭示了预训练数据曝光是塑造模型流行度偏见的核心因素。

Comments Accepted at SIGIR 2026

详情
Journal ref
Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2026)
英文摘要

Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, largely due to the inaccessibility of most training corpora. We provide the first direct, large-scale analysis of popularity bias grounded in fully observable pretraining data. Leveraging the open OLMo models and their complete pretraining corpus, Dolma, we compute precise entity-level exposure statistics across 7.4 trillion tokens. We analyze 2,000 entities spanning five types (Person, Location, Organization, Art, Product) and compare pretraining exposure against Wikipedia pageviews and two elicited LLM popularity signals: direct scalar estimation and pairwise comparison. Our results show that pretraining exposure strongly correlates with Wikipedia popularity, validating exposure as a meaningful proxy for real-world salience during the training period. More importantly, we find that LLM popularity judgments align more closely with exposure than with Wikipedia, especially when elicited via pairwise comparisons. This alignment is strongest for larger models and persists in the long tail, where Wikipedia popularity becomes unreliable. Overall, our findings demonstrate that popularity priors in LLMs are primarily shaped by pretraining statistics rather than external popularity signals, offering concrete evidence that data exposure plays a central role in driving popularity bias.

2605.12380 2026-05-13 cs.LG cs.AI

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola

AI总结 强化学习在结构上比监督学习更具挑战性,因为策略会改变其学习的数据分布,导致训练过程中出现脆弱性,尤其在大模型训练中更为明显。本文提出了一种自适应的策略优化方法,通过引入基于当前批次策略比分布的归一化有效样本量,替代传统的固定截断方式,从而动态调整目标函数中的截断阈值和离策略正则化强度,既保证了策略更新的稳定性,又提升了对旧数据或分布不匹配数据的适应能力。实验表明,该方法在多种场景下表现优异,无需新增超参数,同时减少了原有方法中的部分超参数。

详情
英文摘要

Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at https://github.com/FeynRL-project/FeynRL.

2605.12379 2026-05-13 cs.LG cs.AI

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Fairoz Nower Khan, Nabuat Zaman Nahim, Peizhong Ju

AI总结 本文研究了如何在具有离散动作空间的强化学习任务中,将基于离线数据训练的生成策略有效迁移到在线交互环境中。为解决现有方法在离散动作空间和在线微调中的不足,作者提出了一种名为DRIFT的方法,通过引入优势加权的离散流匹配损失和路径空间惩罚,对预训练的连续时间马尔可夫链策略进行在线微调。该方法在保持预训练知识的同时提升了策略性能,并在多个主流离散动作任务中表现出优越的稳定性和效果。

详情
英文摘要

Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.

2605.12375 2026-05-13 cs.LG cs.AI

Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts

Matthew Beddows, Aiden Durrant, Georgios Leontidis

AI总结 该研究针对商业软果种植中作物产量预测精度受限于数据不足的问题,提出了一种基于结构化大语言模型(LLM)代理的后验修正框架。该方法通过整合农业领域知识,在相位检测、偏差学习和范围验证等工具中对现有模型预测结果进行修正。实验表明,该方法在草莓和玉米产量数据集上显著提升了预测精度,其中使用Llama 3.1 8B模型作为代理取得了最佳效果。

Comments 21 pages, 6 figures, 6 tables

详情
英文摘要

Accurate crop yield forecasting in commercial soft fruit production is constrained by the data available in typical commercial farm records, which lack the sensor networks, satellite imagery, and high-resolution meteorological inputs that most state-of-the-art approaches assume. We propose a structured LLM agent framework that performs post-hoc correction of existing model predictions, encoding agricultural domain knowledge across tools for phase detection, bias learning, and range validation. Evaluated on a proprietary strawberry yield dataset and a public USDA corn harvest dataset, agent refinement of XGBoost reduced MAE by 20% and MASE by 56% on strawberry, with consistent improvements across Moirai2 (MAE 24%, MASE 22%) and Random Forest (MAE 28%, MASE 66%) baselines. Using Llama 3.1 8B as the agent produced the strongest corrections across all configurations; LLaVA 13B showed inconsistent gains, highlighting sensitivity to the choice of refinement model.

2605.12370 2026-05-13 cs.CL cs.IR

Context Convergence Improves Answering Inferential Questions

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

AI总结 该研究探讨了大语言模型在处理需要推理的问答任务时的表现,重点关注语境中句子结构与质量对模型性能的影响。研究提出以“收敛性”作为衡量句子消除错误答案能力的指标,用于构建更有效的问答语境。实验表明,使用高收敛性句子构建的语境能显著提升答案准确性,并且按收敛性降序排列句子可进一步优化模型表现,突显了收敛性在指导语境构建和分析推理行为中的实用价值。

Comments Accepted at SIGIR 2026

详情
Journal ref
Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2026)
英文摘要

While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.

2605.12368 2026-05-13 cs.LG

MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions

Zichuan Yang

AI总结 MetaColloc 是一种无需优化和数据的偏微分方程求解框架,通过元学习获得的基函数实现快速求解。该方法将基函数发现与求解过程解耦,在离线阶段使用双分支神经网络对多种高斯随机场进行元训练,生成通用的神经基函数字典。测试时通过构造配点矩阵并进行一次线性最小二乘求解即可得到结果,显著提升了求解效率和精度。

Comments 21 pages, 5 figures, 6 tables

详情
英文摘要

Solving partial differential equations (PDEs) with machine learning typically requires training a new neural network for every new equation. This optimization is slow. We introduce MetaColloc. It is an optimization-free and data-free framework that removes this bottleneck completely. We decouple basis discovery from the solving process. We meta-train a dual-branch neural network on diverse Gaussian Random Fields. This offline process creates a universal dictionary of neural basis functions. At test time, we freeze the network. We solve the PDE by assembling a collocation matrix. We find the solution through a single linear least squares step. For non-linear PDEs, we apply the Newton-Raphson method to achieve fast quadratic convergence. Our experiments across six 2D and 3D PDEs show massive improvements. MetaColloc reaches state-of-the-art accuracy on smooth and non-linear problems. It also reduces test-time computation by several orders of magnitude. Finally, we provide a detailed frequency sweep analysis. This analysis reveals a critical mismatch between function approximation and operator stability at extremely high frequencies. This profound finding opens a clear path toward future operator-aware meta-learning.

2605.12366 2026-05-13 cs.AI

Classifier Context Rot: Monitor Performance Degrades with Context Length

Sam Martin, Fabien Roger

AI总结 该研究指出,当前前沿语言模型在作为分类器用于监控代码代理的危险行为时,随着上下文长度增加,其性能显著下降。实验表明,当危险行为出现在长达800K token的良性内容之后时,多个主流模型如Opus 4.6、GPT 5.4和Gemini 3.1的识别错误率提高了2到30倍。研究还提出通过提示技术和后训练改进可部分缓解这一问题,强调现有监控评估可能因忽略长上下文退化而高估了模型性能。

详情
英文摘要

Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on a dataset that requires identifying when a coding agent takes a subtly dangerous action, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions $2\times$ to $30\times$ more often when they occur after 800K tokens of benign activity than when they occur on their own. We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training. Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.

2605.12361 2026-05-13 cs.CL cs.AI cs.IR

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

Rezarta Islamaj, Robert Leaman, Joey Chan, Nicholas Wan, Qiao Jin, Natalie Xie, John Wilbur, Shubo Tian, Lana Yeganova, Po-Ting Lai, Chih-Hsuan Wei, Yifan Yang, Yao Ge, Qingqing Zhu, Zhizheng Wang, Zhiyong Lu

AI总结 MedHopQA 是一个以疾病为中心的多跳推理基准测试集,旨在评估基于大语言模型的生物医学问答系统的真实推理能力。该基准包含1000个由专家精心标注的问题-答案对,每个问题都需要整合两个不同维基百科文章的信息,并以开放式文本形式作答。为提升评估的鲁棒性和公平性,MedHopQA 引入了本体支持的同义词集,并采用分层验证机制,同时通过大规模未标注问题集降低 leaderboard 游戏和数据污染风险,为未来生物医学问答数据集的构建提供了可复用的框架。

详情
英文摘要

Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.

2605.12358 2026-05-13 cs.LG

From Message-Passing to Linearized Graph Sequence Models

Joël Mathys, Basil Rohner, Saku Peltonen, Roger Wattenhofer

AI总结 本文提出了一种名为线性化图序列模型(Linearized Graph Sequence Models)的新框架,旨在将基于消息传递的图学习方法与序列建模技术相结合。该方法通过将图计算重新表述为序列建模问题,简化了架构设计,并系统地分离了计算深度与信息传播深度,从而将核心图架构决策转化为序列建模选择。研究理论与实证分析了哪些序列特性有助于图结构的归纳偏差学习,并在长距离信息任务中验证了其有效性,为现代序列建模技术在图学习中的应用提供了原理性指导。

详情
英文摘要

Message-passing based approaches form the default backbone of most learning architectures on graph-structured data. However, the rapid progress of modern deep learning architectures in other domains, particularly sequence modeling, raises the question of how graph learning can benefit from these advances. We introduce Linearized Graph Sequence Models, a framework that recasts message-passing graph computation from the perspective of sequence modeling to simplify architectural choices. Our approach systematically separates the computational processing depth from the information propagation depth, allowing core graph architectural decisions to be treated as sequence modeling choices. Specifically, we analyze, both empirically and theoretically, what sequence properties make methods effective for learning and preserving the graph inductive bias. In particular, we validate our findings, demonstrating improved performance on long-range information tasks in graphs. Our findings provide a principled way to integrate modern sequence modeling advances into message-passing based graph learning. Beyond this, our work demonstrates how the separation of processing and information depth can recast central architectural questions as input modeling choices.

2605.12357 2026-05-13 cs.AI

$δ$-mem: Efficient Online Memory for Large Language Models

Jingdi Lei, Di Zhang, Junxian Li, Weida Wang, Kaixuan Fan, Xiang Liu, Qihan Liu, Xiaoteng Ma, Baian Chen, Soujanya Poria

AI总结 大型语言模型在长期助理和智能体系统中需要有效积累和复用历史信息。为解决单纯扩展上下文窗口成本高且效果有限的问题,本文提出了一种轻量级的在线记忆机制 $δ$-mem,通过固定大小的状态矩阵和增量学习规则压缩历史信息,并在生成过程中利用其读取结果对主干模型的注意力计算进行低秩修正。实验表明,$δ$-mem 在保持模型通用能力的同时,显著提升了模型在多个基准测试中的表现,尤其在对记忆能力要求高的任务上效果更为突出。

详情
英文摘要

Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose $δ$-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. $δ$-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone's attention computation during generation. With only an $8\times8$ online memory state, $δ$-mem improves the average score to $1.10\times$ that of the frozen backbone and $1.15\times$ that of the strongest non-$δ$-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching $1.31\times$ on MemoryAgentBench and $1.20\times$ on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.

2605.12347 2026-05-13 cs.RO

Real-Time Whole-Body Teleoperation of a Humanoid Robot Using IMU-Based Motion Capture with Sim2Sim and Sim2Real Validation

Hamza Ahmed Durrani, Suleman Khan

AI总结 本文研究了如何实现人形机器人全身实时遥操作的稳定低延迟控制,克服了人体与机器人形态差异、惯性传感器噪声、控制延迟以及仿真到现实的迁移难题。研究提出了一种基于IMU运动捕捉的端到端控制系统,直接将人体动作映射到Unitree G1机器人,无需离线缓冲或学习组件,实现了连续、低延迟的实时操作。该系统在仿真环境中验证后直接部署到实际机器人平台,成功复现了包括行走、站立、坐姿、转身、鞠躬和全身协调动作等多种复杂动作,为基于商用可穿戴设备的全身人形机器人遥操作提供了实用且可扩展的框架。

Comments 8 pages, 4 figures

详情
英文摘要

Stable, low-latency whole-body teleoperation of humanoid robots is an open research challenge, complicated by kinematic mismatches between human and robot morphologies, accumulated inertial sensor noise, non-trivial control latency, and persistent sim-to-real transfer gaps. This paper presents a complete real-time whole-body teleoperation system that maps human motion, recorded with a Virdyn IMU-based full-body motion capture suit, directly onto a Unitree G1 humanoid robot. We introduce a custom motion-processing, kinematic retargeting, and control pipeline engineered for continuous, low-latency operation without any offline buffering or learning-based components. The system is first validated in simulation using the MuJoCo physics model of the Unitree G1 (sim2sim), and then deployed without modification on the physical platform (sim2real). Experimental results demonstrate stable, synchronized reproduction of a broad motion repertoire, including walking, standing, sitting, turning, bowing, and coordinated expressive full-body gestures. This work establishes a practical, scalable framework for whole-body humanoid teleoperation using commodity wearable motion capture hardware.

2605.12345 2026-05-13 cs.CL

Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation

Michela Lorandi, Anya Belz

AI总结 本文研究了如何通过组合参数高效微调(PEFT)模块实现插拔式、属性可控的文本生成。作者提出了三种超越单一任务训练与推理的方法,包括多数据集联合训练、推理时组合不同PEFT模块的权重矩阵以及组合其输出。实验表明,组合不同PEFT模块输出的方法在性能上尤为突出,甚至在单一任务测试集上也优于专门针对单任务训练的模块,平均提升了2%的性能。

详情
英文摘要

Parameter-efficient fine-tuning (PEFT) techniques offer task-specific fine-tuning at a fraction of the cost of full fine-tuning, but require separate fine-tuning for every new task (combination). In this paper, we explore three ways of generalising beyond single-task training/inference: (i) training on combinations of multiple, related datasets; (ii) at inference, composing the weight matrices of separately trained PEFT modules; and (iii) at inference, composing the outputs of separately trained PEFT modules. We test these approaches on three different LLMs, QLoRA as the PEFT technique, and three sets of controlled text generation datasets for sentiment control, topic control, and multi-attribute control. We find that summing PEFT module outputs is a particularly strong composition method, which consistently either outperforms or matches the performance of alternative approaches. This is the case even when comparing against single-task specialised modules on the single-task test set, where three-module output composition achieves an average 2% point performance increase across all models for sentiment control.

2605.12343 2026-05-13 cs.LG

Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale

Paolo Secchi, Daniel S. Balint, Marco Maurizi

AI总结 该论文提出了一种名为NEST的新型神经偏微分方程求解框架,旨在解决传统全局代理模型在跨领域复用和大规模部署中的局限性。NEST采用局部到全局的策略,通过学习小尺度几何区域的局部物理求解器,并利用经典区域分解方法进行全局组合,实现了对复杂几何和边界条件的泛化求解。实验表明,该方法能够在远超训练尺度的三维复杂域中有效求解非线性静态平衡问题,为可扩展的偏微分方程求解器提供了新的训练路径。

详情
英文摘要

Most learned PDE solvers follow a global-surrogate paradigm: a neural operator is trained to map full problem descriptions to full solution fields for a prescribed distribution of geometries, boundary conditions, and coefficients. This has enabled fast inference within fixed problem families, but limits reuse across new domains and makes large-scale deployment dependent on expensive problem-specific data generation. We introduce $\textbf{NEST}$ ($\textbf{Ne}$ural-$\textbf{S}$chwarz $\textbf{T}$iling), a local-to-global framework that shifts learning from full-domain solution operators to reusable local physical solvers. The central premise is that, although global PDE solutions depend on geometry, scale, and boundary conditions, the physical response on small neighborhoods can be learned locally and composed into global solutions through classical domain decomposition. NEST learns a neural operator on minimal voxel patches ($3 \times 3 \times 3$) with diverse local geometries and boundary/interface data. At inference time, an unseen voxelized domain is tiled into overlapping patches, the learned local solver is applied patchwise, and global consistency is enforced through iterative Schwarz coupling with partition-of-unity assembly. In this way, generalization is shifted from a monolithic neural model to the combination of local physics learning and algorithmic global assembly. We instantiate NEST on nonlinear static equilibrium in compressible neo-Hookean solids and evaluate it on large, geometrically complex 3D domains far outside the scale of the training patches. Our results show that local neural building blocks, coupled through Schwarz iteration, offer a reusable local-training path toward scalable learned PDE solvers that generalize across domain size, shape, and boundary-condition configurations.

2605.12339 2026-05-13 cs.LG cs.AI

BSO: Safety Alignment Is Density Ratio Matching

Tien-Phat Nguyen, Truong Nguyen, Thin Nguyen, Duy Minh Ho Nguyen, Ngoc-Thanh Dinh, Trung Le

AI总结 本文提出了一种名为BSO的新方法,将语言模型的安全对齐问题转化为密度比匹配问题,从而简化了传统复杂的训练流程。该方法通过最小化数据与模型之间的Bregman散度,得到一组单阶段损失函数,具有理论保证并能恢复最优安全策略。BSO方法通用且简洁,无需辅助模型,仅引入一个额外超参数,且能涵盖现有安全对齐方法作为特例,实验表明其在安全与有用性之间取得了更优的平衡。

详情
英文摘要

Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.

2605.12338 2026-05-13 cs.LG cs.AI stat.CO

Manifold Sampling via Entropy Maximization

Cornelius V. Braun, Tilman Burghoff, Marc Toussaint

AI总结 该论文研究了在由平滑等式和不等式约束隐式定义的流形上进行采样的问题,特别是在可行域包含多个不连通部分的情况下。为了解决这一挑战,作者提出了基于熵最大化重采样的MASEM方法,通过k近邻密度估计最大化经验分布的熵,从而提升采样效率。实验表明,MASEM在合成数据和机器人应用中表现出优越的混合效率和可扩展性,显著优于现有方法。

详情
英文摘要

Sampling from constrained distributions has a wide range of applications, including in Bayesian optimization and robotics. Prior work establishes convergence and feasibility guarantees for constrained sampling, but assumes that the feasible set is connected. However, in practice, the feasible set often decomposes into multiple disconnected components, which makes efficient sampling under constraints challenging. In this paper, we propose MAnifold Sampling via Entropy Maximization (MASEM) for sampling on a manifold with an unknown number of disconnected components, implicitly defined by smooth equality and inequality constraints. The presented method uses a resampling scheme to maximize the entropy of the empirical distribution based on k-nearest neighbor density estimation. We show that, in the mean field, MASEM decreases the KL-divergence between the empirical distribution and the maximum-entropy target exponentially in the number of resampling steps. We instantiate MASEM with multiple local samplers and demonstrate its versatility and efficiency on synthetic and robotics-based benchmarks. MASEM enables fast and scalable mixing across a range of constrained sampling problems, improving over alternatives by an order of magnitude in Sinkhorn distance with competitive runtime.