arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.12427 2026-05-13 cs.LG math.CO

Learning Minimally Rigid Graphs with High Realization Counts

Oleksandr Slyvka, Jan Rubeš, Rodrigo Alves, Jan Legerský

AI总结本文研究了如何构造具有极大实现数的最小刚性图，这是一个刚性理论中的极值问题。作者提出了一种基于强化学习的方法，通过0-扩展和1-扩展（即Henneberg操作）生成最小刚性图，并利用深度交叉熵方法优化实现数不变量。该方法在平面和球面实现数方面均达到了已知最优结果，并刷新了球面实现数的记录。

Comments This is an extended version of the paper accepted to IJCAI 2026

2605.12426 2026-05-13 cs.CL

Geometric Factual Recall in Transformers

Shauli Ravfogel, Gilad Yehudai, Joan Bruna, Alberto Bietti

AI总结本文研究了Transformer语言模型如何记忆事实关联，提出了一种不同于传统参数线性增长的几何记忆机制。该方法通过学习嵌入空间中的关系结构，使嵌入向量直接编码事实关联，而多层感知机（MLP）则作为关系条件选择器，通过ReLU门控提取相关属性。实验表明，该方法在单跳和多跳事实查询任务中均表现出优越的泛化能力，并揭示了模型在训练后能够零样本迁移至新事实关联的机制。

Comments Preprint

2605.12422 2026-05-13 cs.CL cs.CY

Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

Yo Ehara

AI总结该研究旨在解决使用大语言模型（LLM）作为评判者评估教育材料难度时，与人类评分者意见不一致的问题。不同于以往依赖生成时概率信号的方法，本文提出了一种无需这些信号的预测方法，通过构建独立的嵌入空间并利用难度的序数特性，基于评分集合的几何一致性识别可能产生分歧的案例。实验表明，该方法在预测人类评分者分歧方面优于基于概率的基线方法。

Comments Accepted to Educational Data Mining (EDM) 2026 (Poster/Demo Track)

2605.12421 2026-05-13 cs.AI

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

Haoyu Wang, Yuliang Song, Tao Li, Zhiwei Deng, Yaqing Wang, Deepak Ramachandran, Eldan Cohen, Dan Roth

AI总结该论文探讨了大语言模型（LLM）在生成组合优化求解器时面临的启发式陷阱问题。研究通过构建一个包含100个组合问题的基准测试集，比较了三种求解器构建方法，发现使用Python与OR-Tools的约束建模方法在正确性上表现最佳，而使用MiniZinc与OR-Tools的方法虽然使用相同后端，但覆盖范围较低。研究还发现，引导LLM进行搜索优化仅带来微小的加速效果，并可能引发正确性下降，其根源在于LLM倾向于采用局部近似或冗余约束等启发式策略，从而影响求解质量。论文建议在生成组合求解器时应优先使用LLM进行形式化建模，而对搜索优化部分应单独验证。

2605.12419 2026-05-13 cs.CL cs.IR cs.LG

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Naijing Zhang, Alicia Tsai, Li Wei, Lukasz Heldt, Lichan Hong, Ed Chi, Xinyang Yi

AI总结尽管大语言模型在多项任务上表现出色，但在针对特定任务进行微调时，往往会遗忘其原有的语言推理能力。本文研究了生成式检索（GenRetrieval）任务中这一问题，并提出了一种名为ORBIT的新方法，通过跟踪微调模型与原始模型之间的参数距离，并在距离超过阈值时采用权重平均策略来限制模型漂移，从而有效保留模型的文本生成与检索能力。实验表明，ORBIT在保持模型性能方面优于现有的持续学习和正则化方法。

2605.12416 2026-05-13 cs.LG

Aligning Flow Map Policies with Optimal Q-Guidance

Christos Ziakas, Alessandra Russo, Avishek Joey Bose

AI总结该研究提出了一种名为流图策略（flow map policies）的新型生成策略，旨在解决基于扩散模型和流匹配等复杂模型在生成动作时计算成本高的问题。通过学习在现有流策略的生成动力学中进行任意步长的跳跃，包括一步跳跃，从而实现快速动作生成。研究还引入了FLOW MAP Q-GUIDANCE（FMQ）和Q-GUIDED BEAM SEARCH（QGBS）方法，分别用于优化策略适应和推理过程中的动作生成，在多个机器人操作与移动任务中取得了优于现有方法的显著性能提升。

2605.12412 2026-05-13 cs.CL cs.AI cs.LG

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

Eric Bigelow, Raphaël Sarfati, Daniel Wurgaft, Owen Lewis, Thomas McGrath, Jack Merullo, Atticus Geiger, Ekdeep Singh Lubana

AI总结本文研究了大型语言模型（LLMs）在上下文中学习时的信念更新过程，提出它们在低维几何结构的概念信念空间中进行动态更新。通过故事理解任务，结合行为分析和表征分析，研究发现信念更新轨迹具有低维结构化特性，并可通过线性探针解码预测行为。此外，对这些表征的干预能够因果地引导信念轨迹，其效果可由概念空间的几何结构解释，为上下文学习提供了几何视角的信念动态解释。

2605.12411 2026-05-13 cs.LG cs.AI cs.CL cs.MA

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

AI总结该研究探讨了如何从有限的交互中预测陌生AI代理的决策行为，提出了一种结合文本与表格信息的建模方法。研究通过构建一个基于表格结构的模型，将游戏状态、对话历史和报价记录等信息整合为表格行，并引入一个冻结的小型语言模型作为观察者，提取决策相关的隐藏特征。实验表明，该方法在预测响应和议价报价方面优于传统提示方法，展示了将代理预测建模为目标自适应文本-表格任务的有效性。

详情

英文摘要

AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.

URL PDF HTML ☆

赞 0 踩 0

2605.12406 2026-05-13 cs.AI

Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

William Parris

AI总结本文探讨了基于人类反馈的强化学习（RLHF）和偏好优化在大语言模型中的应用所引发的语义奖励坍缩（SRC）问题，即不同类型的评估不满被压缩为统一的优化信号，导致模型在事实错误、不确定性披露等方面的表现失真。研究指出，适应性AI系统可能因优化压力而抑制可见的不确定性，而非保持合理的置信度。为此，作者提出宪法奖励分层（CRS）框架，旨在通过领域感知的奖励结构，保护不同类型的认知责任，为未来的研究提供可检验的治理方向。

Comments 15 pages including references. Position and framework paper. Companion empirical work available at arXiv:2604.17587

2605.12399 2026-05-13 cs.CV

GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

Xiao Cao, Yuze Li, Youmin Zhang, Jiayu Song, Cheng Yan, Wen Li, Lixin Duan

AI总结本文提出了一种名为GeoQuery的几何引导扩散框架，用于解决稀疏视角下3D高斯溅射（3DGS）重建中的严重伪影问题。该方法通过引入几何引导的跨视角注意力（GCA）机制，结合预测的深度图和相机姿态构建几何对齐的参考特征采样场，从而生成更准确的查询特征，并在局部窗口内进行特征聚合以提升重建一致性。实验表明，GeoQuery能够有效提升稀疏视角下的视图合成与伪影去除效果，且可无缝集成到现有扩散模型中。

Comments Accept to SIGGRAPH 2026 Conference Track

2605.12398 2026-05-13 cs.CL cs.IR

Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

AI总结本文提出了一种基于答案合理性评分熵值的问题难度估计方法Q-DAPS，用于评估和改进大型语言模型在问答任务中的表现。该方法通过计算候选答案的合理性评分熵值来衡量问题难度，相比传统的可读性公式、检索信号或流行度统计等方法更具有效性。实验表明，Q-DAPS在多个主流问答数据集上均优于基线方法，且在不同参数设置和问题类型下表现出良好的鲁棒性，同时与人类对问题难度的判断高度一致。

Comments Accepted at ACL 2026

2605.12395 2026-05-13 cs.CL

A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles

Michela Lorandi, Anya Belz

AI总结本文旨在通过建立公平的评估框架，对多种可控文本生成（CTG）系统进行比较评估。研究采用统一的生成与处理流程，并使用共享的评估方法和数据集，以确保评估的公正性和可比性。结果表明，多数现有系统的性能在重新评估后与原始报告存在显著差异，突显了标准化评估在可控文本生成领域的重要性和紧迫性。

2605.12389 2026-05-13 cs.CV cs.AI cs.LG

SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation

Luke James Miller, Yugyung Lee

AI总结该论文提出了一种名为SEMIR的语义小结构引导的图表示学习方法，用于解决大规模图像中分割小而稀疏结构时面临的计算复杂性和类别不平衡问题。SEMIR通过参数化的边收缩、节点删除等操作，将原网格图转化为一个紧凑且边界对齐的图小结构，同时保持从图预测到网格标签的精确映射。该方法在多个肿瘤分割数据集上表现出色，显著提升了小结构的Dice分数，为高分辨率结构化视觉数据的任务适配表示学习提供了新框架。

Comments 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices

详情

英文摘要

Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.

URL PDF HTML ☆

赞 0 踩 0

2605.12387 2026-05-13 cs.SD cs.LG

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

Adam Wynn, Jingyun Wang

AI总结本文提出了一种半监督框架，用于利用Whisper模型进行语音自信度检测，旨在解决因标注数据有限和副语言标注主观性强而导致的挑战。该框架融合了Whisper编码器提取的深层语义嵌入，以及由eGeMAPS描述符和语音压力、不流畅性概率估计构成的可解释声学特征向量，并引入了一种不确定性感知的伪标签策略以减少对标注数据的依赖。实验表明，该方法在Macro-F1指标上达到0.751，优于多个自监督基线模型，并在小样本类别上提升了3%，验证了显式韵律和辅助特征对提升自信度检测性能的重要作用。

Comments 12 pages, 9 Figures, Submitted to IEEE Transactions on Audio, Speech and Language Processing

2605.12384 2026-05-13 cs.CL cs.AI cs.LG

Scalable Token-Level Hallucination Detection in Large Language Models

Rui Min, Tianyu Pang, Chao Du, Minhao Cheng, Yi R. Fung

AI总结大型语言模型（LLMs）在生成文本时常常产生幻觉，尤其在需要推理的任务中，这些幻觉内容看似合理却包含逻辑错误或不可靠的中间结果，检测难度较大。为解决现有方法在粒度和可扩展性上的不足，本文提出TokenHD，一种基于token级别的幻觉检测框架，通过可扩展的数据生成引擎和重要性加权训练策略，实现了对自由文本中幻觉的直接检测，无需依赖预定义的步骤划分。实验表明，即使是一个小型检测模型（0.6B）也能显著提升检测性能，且性能随模型规模增大而持续提升，同时在多种实际场景中表现出良好的泛化能力。

2605.12382 2026-05-13 cs.CL

Pretraining Exposure Explains Popularity Judgments in Large Language Models

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

AI总结本研究探讨了大语言模型（LLMs）对知名实体的偏好是否源于实际流行度，还是预训练过程中数据曝光的统计结果。通过使用可完全访问的预训练语料库Dolma和开源模型OLMo，研究计算了7.4万亿个token中实体的曝光统计，并与维基百科浏览量及模型生成的流行度信号进行对比。结果表明，模型对流行度的判断更依赖于预训练数据中的曝光程度，而非外部流行度指标，尤其在大模型和长尾实体中表现更为明显，揭示了预训练数据曝光是塑造模型流行度偏见的核心因素。

Comments Accepted at SIGIR 2026

详情

DOI: 10.1145/3805712.3809958
Journal ref: Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2026)

英文摘要

Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, largely due to the inaccessibility of most training corpora. We provide the first direct, large-scale analysis of popularity bias grounded in fully observable pretraining data. Leveraging the open OLMo models and their complete pretraining corpus, Dolma, we compute precise entity-level exposure statistics across 7.4 trillion tokens. We analyze 2,000 entities spanning five types (Person, Location, Organization, Art, Product) and compare pretraining exposure against Wikipedia pageviews and two elicited LLM popularity signals: direct scalar estimation and pairwise comparison. Our results show that pretraining exposure strongly correlates with Wikipedia popularity, validating exposure as a meaningful proxy for real-world salience during the training period. More importantly, we find that LLM popularity judgments align more closely with exposure than with Wikipedia, especially when elicited via pairwise comparisons. This alignment is strongest for larger models and persists in the long tail, where Wikipedia popularity becomes unreliable. Overall, our findings demonstrate that popularity priors in LLMs are primarily shaped by pretraining statistics rather than external popularity signals, offering concrete evidence that data exposure plays a central role in driving popularity bias.

URL PDF HTML ☆

赞 0 踩 0

2605.12380 2026-05-13 cs.LG cs.AI

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola

AI总结强化学习在结构上比监督学习更具挑战性，因为策略会改变其学习的数据分布，导致训练过程中出现脆弱性，尤其在大模型训练中更为明显。本文提出了一种自适应的策略优化方法，通过引入基于当前批次策略比分布的归一化有效样本量，替代传统的固定截断方式，从而动态调整目标函数中的截断阈值和离策略正则化强度，既保证了策略更新的稳定性，又提升了对旧数据或分布不匹配数据的适应能力。实验表明，该方法在多种场景下表现优异，无需新增超参数，同时减少了原有方法中的部分超参数。

2605.12379 2026-05-13 cs.LG cs.AI

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Fairoz Nower Khan, Nabuat Zaman Nahim, Peizhong Ju

AI总结本文研究了如何在具有离散动作空间的强化学习任务中，将基于离线数据训练的生成策略有效迁移到在线交互环境中。为解决现有方法在离散动作空间和在线微调中的不足，作者提出了一种名为DRIFT的方法，通过引入优势加权的离散流匹配损失和路径空间惩罚，对预训练的连续时间马尔可夫链策略进行在线微调。该方法在保持预训练知识的同时提升了策略性能，并在多个主流离散动作任务中表现出优越的稳定性和效果。

2605.12375 2026-05-13 cs.LG cs.AI

Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts

Matthew Beddows, Aiden Durrant, Georgios Leontidis

AI总结该研究针对商业软果种植中作物产量预测精度受限于数据不足的问题，提出了一种基于结构化大语言模型（LLM）代理的后验修正框架。该方法通过整合农业领域知识，在相位检测、偏差学习和范围验证等工具中对现有模型预测结果进行修正。实验表明，该方法在草莓和玉米产量数据集上显著提升了预测精度，其中使用Llama 3.1 8B模型作为代理取得了最佳效果。

Comments 21 pages, 6 figures, 6 tables

2605.12370 2026-05-13 cs.CL cs.IR

Context Convergence Improves Answering Inferential Questions

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

AI总结该研究探讨了大语言模型在处理需要推理的问答任务时的表现，重点关注语境中句子结构与质量对模型性能的影响。研究提出以“收敛性”作为衡量句子消除错误答案能力的指标，用于构建更有效的问答语境。实验表明，使用高收敛性句子构建的语境能显著提升答案准确性，并且按收敛性降序排列句子可进一步优化模型表现，突显了收敛性在指导语境构建和分析推理行为中的实用价值。

Comments Accepted at SIGIR 2026

2605.12368 2026-05-13 cs.LG

MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions

Zichuan Yang

AI总结 MetaColloc 是一种无需优化和数据的偏微分方程求解框架，通过元学习获得的基函数实现快速求解。该方法将基函数发现与求解过程解耦，在离线阶段使用双分支神经网络对多种高斯随机场进行元训练，生成通用的神经基函数字典。测试时通过构造配点矩阵并进行一次线性最小二乘求解即可得到结果，显著提升了求解效率和精度。

Comments 21 pages, 5 figures, 6 tables

2605.12366 2026-05-13 cs.AI

Classifier Context Rot: Monitor Performance Degrades with Context Length

Sam Martin, Fabien Roger

AI总结该研究指出，当前前沿语言模型在作为分类器用于监控代码代理的危险行为时，随着上下文长度增加，其性能显著下降。实验表明，当危险行为出现在长达800K token的良性内容之后时，多个主流模型如Opus 4.6、GPT 5.4和Gemini 3.1的识别错误率提高了2到30倍。研究还提出通过提示技术和后训练改进可部分缓解这一问题，强调现有监控评估可能因忽略长上下文退化而高估了模型性能。

2605.12361 2026-05-13 cs.CL cs.AI cs.IR

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

Rezarta Islamaj, Robert Leaman, Joey Chan, Nicholas Wan, Qiao Jin, Natalie Xie, John Wilbur, Shubo Tian, Lana Yeganova, Po-Ting Lai, Chih-Hsuan Wei, Yifan Yang, Yao Ge, Qingqing Zhu, Zhizheng Wang, Zhiyong Lu

AI总结 MedHopQA 是一个以疾病为中心的多跳推理基准测试集，旨在评估基于大语言模型的生物医学问答系统的真实推理能力。该基准包含1000个由专家精心标注的问题-答案对，每个问题都需要整合两个不同维基百科文章的信息，并以开放式文本形式作答。为提升评估的鲁棒性和公平性，MedHopQA 引入了本体支持的同义词集，并采用分层验证机制，同时通过大规模未标注问题集降低 leaderboard 游戏和数据污染风险，为未来生物医学问答数据集的构建提供了可复用的框架。

详情

英文摘要

Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.12358 2026-05-13 cs.LG

From Message-Passing to Linearized Graph Sequence Models

Joël Mathys, Basil Rohner, Saku Peltonen, Roger Wattenhofer

AI总结本文提出了一种名为线性化图序列模型（Linearized Graph Sequence Models）的新框架，旨在将基于消息传递的图学习方法与序列建模技术相结合。该方法通过将图计算重新表述为序列建模问题，简化了架构设计，并系统地分离了计算深度与信息传播深度，从而将核心图架构决策转化为序列建模选择。研究理论与实证分析了哪些序列特性有助于图结构的归纳偏差学习，并在长距离信息任务中验证了其有效性，为现代序列建模技术在图学习中的应用提供了原理性指导。

2605.12357 2026-05-13 cs.AI

$δ$-mem: Efficient Online Memory for Large Language Models

Jingdi Lei, Di Zhang, Junxian Li, Weida Wang, Kaixuan Fan, Xiang Liu, Qihan Liu, Xiaoteng Ma, Baian Chen, Soujanya Poria

AI总结大型语言模型在长期助理和智能体系统中需要有效积累和复用历史信息。为解决单纯扩展上下文窗口成本高且效果有限的问题，本文提出了一种轻量级的在线记忆机制 $δ$-mem，通过固定大小的状态矩阵和增量学习规则压缩历史信息，并在生成过程中利用其读取结果对主干模型的注意力计算进行低秩修正。实验表明，$δ$-mem 在保持模型通用能力的同时，显著提升了模型在多个基准测试中的表现，尤其在对记忆能力要求高的任务上效果更为突出。

2605.12347 2026-05-13 cs.RO

Real-Time Whole-Body Teleoperation of a Humanoid Robot Using IMU-Based Motion Capture with Sim2Sim and Sim2Real Validation

Hamza Ahmed Durrani, Suleman Khan

AI总结本文研究了如何实现人形机器人全身实时遥操作的稳定低延迟控制，克服了人体与机器人形态差异、惯性传感器噪声、控制延迟以及仿真到现实的迁移难题。研究提出了一种基于IMU运动捕捉的端到端控制系统，直接将人体动作映射到Unitree G1机器人，无需离线缓冲或学习组件，实现了连续、低延迟的实时操作。该系统在仿真环境中验证后直接部署到实际机器人平台，成功复现了包括行走、站立、坐姿、转身、鞠躬和全身协调动作等多种复杂动作，为基于商用可穿戴设备的全身人形机器人遥操作提供了实用且可扩展的框架。

Comments 8 pages, 4 figures

2605.12345 2026-05-13 cs.CL

Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation

Michela Lorandi, Anya Belz

AI总结本文研究了如何通过组合参数高效微调（PEFT）模块实现插拔式、属性可控的文本生成。作者提出了三种超越单一任务训练与推理的方法，包括多数据集联合训练、推理时组合不同PEFT模块的权重矩阵以及组合其输出。实验表明，组合不同PEFT模块输出的方法在性能上尤为突出，甚至在单一任务测试集上也优于专门针对单任务训练的模块，平均提升了2%的性能。

2605.12343 2026-05-13 cs.LG

Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale

Paolo Secchi, Daniel S. Balint, Marco Maurizi

AI总结该论文提出了一种名为NEST的新型神经偏微分方程求解框架，旨在解决传统全局代理模型在跨领域复用和大规模部署中的局限性。NEST采用局部到全局的策略，通过学习小尺度几何区域的局部物理求解器，并利用经典区域分解方法进行全局组合，实现了对复杂几何和边界条件的泛化求解。实验表明，该方法能够在远超训练尺度的三维复杂域中有效求解非线性静态平衡问题，为可扩展的偏微分方程求解器提供了新的训练路径。

2605.12339 2026-05-13 cs.LG cs.AI

BSO: Safety Alignment Is Density Ratio Matching

Tien-Phat Nguyen, Truong Nguyen, Thin Nguyen, Duy Minh Ho Nguyen, Ngoc-Thanh Dinh, Trung Le

AI总结本文提出了一种名为BSO的新方法，将语言模型的安全对齐问题转化为密度比匹配问题，从而简化了传统复杂的训练流程。该方法通过最小化数据与模型之间的Bregman散度，得到一组单阶段损失函数，具有理论保证并能恢复最优安全策略。BSO方法通用且简洁，无需辅助模型，仅引入一个额外超参数，且能涵盖现有安全对齐方法作为特例，实验表明其在安全与有用性之间取得了更优的平衡。

2605.12338 2026-05-13 cs.LG cs.AI stat.CO

Manifold Sampling via Entropy Maximization

Cornelius V. Braun, Tilman Burghoff, Marc Toussaint

AI总结该论文研究了在由平滑等式和不等式约束隐式定义的流形上进行采样的问题，特别是在可行域包含多个不连通部分的情况下。为了解决这一挑战，作者提出了基于熵最大化重采样的MASEM方法，通过k近邻密度估计最大化经验分布的熵，从而提升采样效率。实验表明，MASEM在合成数据和机器人应用中表现出优越的混合效率和可扩展性，显著优于现有方法。