arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05404 2026-06-05 cs.AI cs.CL cs.LG

Harnessing Generalist Agents for Contextualized Time Series

利用通用智能体进行情境化时间序列分析

Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出TimeClaw框架，通过集成可执行时间工具、经验驱动能力进化和情景多模态记忆，使通用大语言模型智能体具备情境化时间推理能力，在能源、金融等多领域基准上取得性能提升。

详情

Comments: Preprint. 38 Pages

AI中文摘要

时间序列通常嵌入在丰富的上下文中，这对于整体建模至关重要。此外，现实世界的从业者通常需要用于分析时间动态的端到端工作流，其中广泛研究的任务（如预测）只是更广泛解决方案循环中的一个步骤。虽然通用AI智能体为复杂上下文下的此类工作流提供了有前景的接口，但它们主要运行在文本空间中，并未与结构化时间信号完全对齐。在这项工作中，我们引入了TimeClaw，一个用于时间序列的智能体框架，它为通用大语言模型智能体配备了情境化时间推理所需的时间序列原生运行时支持。TimeClaw集成了可执行的时间工具以进行有根据和可审计的分析，经验驱动的能力进化以创建可重用的分析例程，以及用于检索相关推理轨迹的情景多模态记忆。这些组件共同解锁了带有上下文信息的开放式时间推理。在涵盖能源、金融、天气、交通和其他现实世界领域的多个基准上的广泛评估表明，TimeClaw的性能得到了提升。代码可在https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw获取。

英文摘要

Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.

URL PDF HTML ☆

赞 0 踩 0

2606.05403 2026-06-05 cs.LG cs.AI

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

信任，但不验证：LLM 源评估中的认知盲点

Rohan N. Pradhan, Steve Goley

发表机构 * Amazon（亚马逊）

AI总结研究语言模型在多源综合中是否评估证据质量，发现模型虽能检测伪造统计但未在综合中启用，而是依赖方法论-语域门控，导致数值有效性被抑制。

详情

AI中文摘要

语言模型日益充当认知代理，综合多个来源的证据以辅助决策。然而，它们是否评估这些证据的质量，还是仅仅基于表面呈现进行聚合，目前尚不清楚。我们表明，模型具备检测伪造统计数据的能力（孤立方法论的正确识别率为0.76-1.00），但在多源综合过程中并未启用这一能力，无论统计数据是伪造还是有效，都会产生相似的数值估计。具体而言，源影响受方法论-语域门控支配，该门控响应分析文本的分布性语域，但不响应数值有效性：例如，统计上不可能的置信区间与有效区间获得相同权重。这种行为分离在来自三个家族（Claude、Qwen、OLMo）的五个模型以及三个专业领域中均得到复现。机制分析（包括因果追踪、线性探针和组件级归因）收敛于同一解释：模型编码并因果使用一种跨领域转移的方法论-语域表示（探针AUC 0.83-0.92），而数值有效性信号（在孤立时可解码）在多源综合中被抑制至随机水平。基于提示的缓解措施（甚至是指定精确统计检查的预言清单）会产生全面怀疑而非选择性辨别，我们检查的后训练流程强化了风格捷径而未建立数值验证。与追踪用户偏好的奉承行为不同，这种失败追踪的是源是否呈现为分析可信，而非其主张是否内部一致。我们称之为认知对齐：与偏好对齐和安全对齐一样，问题不在于能力，而在于部署。

英文摘要

Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remains poorly understood. We show that models possess the capability to detect fabricated statistics (correct identification rates of 0.76-1.00 for methodology in isolation) but do not recruit this capability during multi-source synthesis, producing similar numeric estimates whether the statistics are fabricated or valid. Specifically, source influence is governed by a methodology-register gate that responds to the distributional register of analytical text but not to numeric validity: for example, statistically impossible confidence intervals receive the same weight as valid ones. The behavioral dissociation replicates across five models from three families (Claude, Qwen, OLMo) and three professional domains. Mechanistic analyses, including causal tracing, linear probes, and component-level attribution, converge on the same account: the model encodes and causally uses a methodology-register representation that transfers across domains (probe AUC 0.83-0.92), while numeric-validity signals, decodable in isolation, are suppressed to chance during multi-source synthesis. Prompting-based mitigations, even an oracle checklist naming the exact statistical checks, produce blanket skepticism rather than selective discernment, and the post-training pipelines we examine reinforce the stylistic shortcut without building numeric verification. Unlike sycophancy, which tracks user preference, this failure tracks whether a source presents as analytically credible, not whether its claims are internally consistent. We term this epistemic alignment: like preference and safety alignment, the question is not capability but deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.05402 2026-06-05 cs.CL cs.AI

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

ReasoningFlow: 理解LLM推理轨迹的话语结构

Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出ReasoningFlow框架，将大推理模型的推理轨迹建模为细粒度有向无环图，通过人工和自动标注分析发现模型间结构相似性、多样化推理行为及错误步骤与最终答案的关系。

详情

AI中文摘要

大型推理模型（LRMs）产生的推理轨迹具有非线性结构，如回溯和自我修正，这使推理过程的评估和监控复杂化。我们引入ReasoningFlow，一个将LRM推理轨迹的话语结构捕捉为细粒度有向无环图（DAGs）的框架。我们通过仔细的人工标注31条轨迹（2.1k步）来开发和验证我们的标注方案，实现了高标注者间一致性，然后扩展到自动标注1,260条轨迹（247.7k步），涵盖三个任务（数学、科学、论证）和五个模型（Qwen2.5-32B-Inst、QwQ-32B、DeepSeek-V3、DeepSeek-R1、GPT-oss-120B）。通过分析ReasoningFlow图，我们发现：（1）LRMs表现出结构相似的轨迹，尽管它们基于不同的基础模型训练且可能使用不重叠的后训练数据。（2）ReasoningFlow揭示了多样的细粒度推理行为（例如局部验证、自我反思和假设），可用于更好的推理轨迹可监控性。（3）在LRMs中，大多数错误步骤不用于推导最终答案。（4）步骤之间的机械因果依赖关系不反映语言层面的话语结构。我们在https://github.com/jinulee-v/reasoningflow 发布数据集和代码。

英文摘要

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

URL PDF HTML ☆

赞 0 踩 0

2606.05400 2026-06-05 cs.AI cs.CL cs.LG

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

LeanMarathon：通过长视界Lean自动形式化实现可靠的AI合作数学家

Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu

发表机构 * Department of Statistics, University of Warwick, UK（英国沃里克大学统计系）； Center for Advanced Intelligence Project, RIKEN, Japan（日本理化学研究所高级智能项目）； Department of Statistics, University of Michigan, USA（美国密歇根大学统计系）； Department of Mathematical Informatics, The University of Tokyo（东京大学数学信息学系；日本理化学研究所高级智能项目）； also Center for Advanced Intelligence Project, RIKEN, Japan（加州大学伯克利分校电气工程与计算机科学系；统计系）； Department of Electrical Engineering and Computer Sciences, also Department of Statistics, University of California, Berkeley, USA（上海交通大学数学科学学院，自然科学院和MOE-LSC）； School of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, China

AI总结提出多智能体框架LeanMarathon，通过蓝图抽象和两阶段编排器实现长视界研究数学的可靠自动形式化，在四个Erdős问题上成功形式化七个定理。

详情

Comments: 26 pages, 9 figures. Comments are welcome

AI中文摘要

长视界研究数学的自动形式化不仅在困难引理上失败，而且在规模上失败：陈述漂移、依赖关系纠缠、上下文衰减以及局部修复破坏远处的工作。我们提出LeanMarathon，一个用于可靠的研究级Lean自动形式化的多智能体框架。其核心抽象是一个演化的蓝图：一个Lean文件，同时作为形式化证明骨架、自然语言证明图和共享系统记录。四个合约范围的智能体构建、审计、证明和修复这个蓝图。这些智能体由一个两阶段编排器协调，该编排器首先通过对抗性审查稳定目标保真度，然后从动态叶节点向上并行地通过CI门控轮次释放证明有向无环图（DAG）。LeanMarathon将一次脆弱的数小时运行转变为许多局部、可恢复、并行的交易。我们在两篇最近的研究论文上评估LeanMarathon，涵盖四个Erdős问题（#1051, #1196, #164, #1217）。在三次自主运行中，它形式化了所有七个目标定理，没有留下任何sorry，证明了258个引理和定理。这些结果表明，可靠的AI合作数学不仅需要更强的证明器，还需要耐用的框架，以在长数学发展过程中保持目标保真度。代码可在https://github.com/YuanheZ/LeanMarathon找到。

英文摘要

Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

URL PDF HTML ☆

赞 0 踩 0

2605.02395 2026-06-05 cs.AI

Controllable and Verifiable Process Data Synthesis for Process Reward Models

用于过程奖励模型的可控且可验证的过程数据合成

Yinghui Chi, Lucien Wang

发表机构 * Jilin University（吉林大学）

AI总结提出一个可控且可验证的框架，通过注入模板感知错误并重新计算后续步骤来合成过程监督数据，以提升过程奖励模型在逻辑和数学推理中的性能。

详情

AI中文摘要

过程奖励模型（PRMs）依赖于高质量的过程监督数据，但现有的构建方法通常对错误位置、错误类型和轨迹一致性的控制有限。我们提出了一个可控且可验证的框架，用于合成PRMs的过程监督数据。我们的框架首先构建一个正确的符号推理链，在中间步骤注入一个模板感知错误，在受损状态下重新计算后续步骤，并验证注入的步骤不能从其前缀推导出来。得到的配对轨迹在第一个错误处前缀无效，但在符号重新计算后保持轨迹一致，并被翻译成对齐的自然语言过程，用于PRM训练和评估。实验表明，合成数据改进了逻辑推理基准上的Best-of-8重排序，并迁移到数学推理。步骤级评估进一步表明，第一个错误定位仍然比整体步骤分类更具挑战性，凸显了对细粒度且可验证的过程监督的需求。

英文摘要

Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.02192 2026-06-05 cs.RO

Do We Really Need Immediate Resets? Rethinking Collision Handling for Efficient Robot Navigation

我们真的需要立即重置吗？重新思考高效机器人导航的碰撞处理

Shanze Wang, Xinming Zhang, Siwei Cheng, Xianghui Wang, Changwen Chen, Hailong Huang, Wei Zhang

发表机构 * College of Information Science and Technology, Eastern Institute of Technology（信息科学与技术学院，东部技术学院）； Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University（航空与航空工程系，香港理工大学）； Department of Computing, The Hong Kong Polytechnic University（计算系，香港理工大学）； School of Computer Science and Technology, University of Science and Technology of China（计算机科学与技术学院，中国科学技术大学）； Department of Mechanical Engineering, The Hong Kong Polytechnic University（机械工程系，香港理工大学）

AI总结针对机器人导航中每次碰撞立即重置环境的惯例，提出多碰撞重置预算（MCB）框架，通过将局部碰撞终止与全局环境重置解耦，允许智能体在同一回合内重试困难配置，从而提高早期学习效率。

详情

Comments: 8 pages, 9 figures

AI中文摘要

一次碰撞是否必然终止整个导航回合？在大多数用于机器人导航的深度强化学习（DRL）框架中，这仍然是标准做法：每次碰撞都会立即触发全局环境重置，并被视为完全任务失败而受到惩罚。虽然部署期间的碰撞自然表示任务失败，但在训练期间应用相同的处理会阻止智能体探索具有挑战性的障碍物配置，从而在早期训练阶段减慢学习进度。在这项工作中，我们挑战了这一惯例，并提出了一种多碰撞重置预算（MCB）框架，该框架将局部碰撞终止与全局环境重置解耦，允许智能体在同一回合内重试困难配置。仿真实验表明，MCB通过更少的交互达到目标成功率水平，提高了早期学习效率，其中小的碰撞预算产生最一致的收益。在异构机器人平台上的真实世界实验进一步验证了所学策略在杂乱环境中的可部署性。

英文摘要

Should a single collision necessarily terminate an entire navigation episode? In most deep reinforcement learning (DRL) frameworks for robot navigation, this remains the standard practice: every collision immediately triggers a global environment reset and is penalized as a complete task failure. While a collision during deployment naturally indicates task failure, applying the same treatment during training prevents the agent from exploring challenging obstacle configurations, which slows learning progress in the early training phase. In this work, we challenge this convention and propose a Multi-Collision reset Budget (MCB) framework that decouples local collision termination from global environment resets, allowing the agent to retry difficult configurations within the same episode. Simulation experiments show that MCB improves early-stage learning efficiency by reaching target success-rate levels with fewer interactions, with small collision budgets producing the most consistent gains. Real-world experiments on heterogeneous robot platforms further validate the deployability of the learned policies in cluttered environments.

URL PDF HTML ☆

赞 0 踩 0

2605.01844 2026-06-05 cs.CL

The Cylindrical Representation Hypothesis for Language Model Steering

语言模型引导的圆柱表示假说

Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, Xiuying Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出圆柱表示假说（CRH），通过放宽线性表示假说（LRH）的正交性假设，解释语言模型引导中的不稳定性和不确定性。

详情

Comments: ICML 2026 camera ready

AI中文摘要

引导是一种广泛用于控制大型语言模型的技术，但其效果往往不稳定且难以预测。现有的理论解释主要基于线性表示假说（LRH）。虽然LRH假设概念可以正交化以实现无损控制，但这种理想化的映射在真实表示中无法实现，也无法解释观察到的引导不可预测性。通过放宽LRH的正交性假设同时保留线性表示，我们展示了重叠的概念贡献自然产生一种样本特定的轴正交结构。我们将此形式化为圆柱表示假说（CRH）。在CRH中，中心轴捕捉概念缺失与存在之间的主要差异，并驱动概念生成。周围的法平面通过决定轴激活目标概念的难易程度来控制引导敏感性。在该平面内，只有特定的敏感扇区强烈促进概念激活，而其他扇区可能抑制或延迟激活。虽然周围的法平面可以从差异向量中可靠识别，但敏感扇区无法识别，从而在扇区层面引入内在不确定性。这种不确定性提供了原则性解释，说明为什么即使使用良好对齐的方向，引导结果也常常波动。我们的实验验证了圆柱结构的存在，并证明CRH为解释真实场景中的模型引导行为提供了一种有效且实用的方法：https://github.com/mbzuai-nlp/CRH。

英文摘要

Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.

URL PDF HTML ☆

赞 0 踩 0

2606.05395 2026-06-05 cs.RO cs.AI

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents

VASO：物理AI智能体的形式可验证自进化技能

Yunhao Yang, Neel P. Bhatt, Kevin Wang, Samuel Tetteh, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Iowa State University（爱荷华州立大学）

AI总结提出VASO框架，通过形式验证引导LLM生成的机器人技能合约自进化，将模型检查的反例转化为文本梯度更新技能合约，无需微调模型权重，在Jackal和四旋翼任务中达到97.2%的形式规范符合率。

详情

Comments: Project webpage: https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/

AI中文摘要

可重用的机器人技能正在成为具身智能体将开放式指令转化为长时域物理行为的基本单元。我们认为，虽然基础模型大幅降低了创建这些技能的成本，但信任它们的成本并未降低。现有的技能进化循环通过执行反馈、单元测试、环境奖励或LLM自我批评来改进技能，但这些信号仅提供痕迹级别的证据：它们表明技能在采样执行中有效，而非技能引发的计划在未经测试的条件下满足时间安全合约。我们提出VASO，一个用于验证引导的LLM生成机器人技能合约自进化的框架。在VASO中，每个技能被表示为具有两个耦合接口的语义合约：一个形式接口，将机器人状态、观测和控制命令与用于模型检查的逻辑命题对齐；一个面向规划器的接口，指导可执行行为的生成。模型检查器首先过滤逻辑不一致的技能合约，然后验证由该技能引发的计划是否满足全局和局部时间规范。当验证失败时，VASO将反例轨迹转化为文本梯度，更新可重用的技能合约，同时保持基础模型权重冻结。在Clearpath Jackal和PX4四旋翼任务中，VASO使用少于100个优化样本达到了97.2%的形式规范符合率，优于执行反馈、提示优化和微调基线。据我们所知，VASO是首个将形式验证与物理AI智能体的自进化LLM生成技能闭环的框架：形式反例成为可重用机器人技能合约的优化反馈，而不仅仅是验证一次性计划、调优规划器提示或微调模型权重。

英文摘要

Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.

URL PDF HTML ☆

赞 0 踩 0

2606.05389 2026-06-05 cs.AI

Residual Modeling for High-Fidelity Learned Compression of Scientific Data

面向科学数据高保真有损压缩的残差建模

Liangji Zhu, Sanjay Ranka, Anand Rangarajan

发表机构 * arXiv.org ； cs.AI（计算机科学与人工智能）

AI总结针对高保真度下学习压缩残差占据主导速率的问题，提出两种残差编码器LBRC和NGLR，通过定制残差表示提升压缩比。

详情

Comments: 9 pages, 3 figures, 3 tables

AI中文摘要

有损压缩对于科学模拟产生的大规模时空数据至关重要。学习型压缩器在中等精度目标下可实现高压缩比，但其聚合重建损失无法保证每个块的精度。现有的保证自编码器（GAE）方法通过保留SVD/PCA风格的系数直到满足目标，添加逐块残差校正。这在中等容差下有效，但在块级NRMSE从10^-6到10^-4的高保真度范围内，保留的系数数量迅速增长，校正流主导总速率。我们提出以残差为中心的观点：学习残差在结构上不同于原始科学场，应使用为该残差设计的表示进行编码。我们引入两种残差编码器。LBRC是一种确定性、无需训练的处理流程，自适应地将学习残差量化到目标NRMSE，并使用3D Lorenzo差分、锯齿映射、位平面编码和熵编码对得到的整数残差进行无损编码。NGLR增加了一个因果神经预测器，在相同的确定性整数处理流程中为整数舍入的Lorenzo预测输出归一化偏置，在保持确定性解码的同时降低剩余残差码的熵。预测器权重被序列化并计入比特流。在E3SM、JHTDB和ERA5数据集上，块级NRMSE目标从10^-6到10^-4，LBRC相比GAE压缩比提升30-60%，与SZ基本相当。NGLR相比LBRC进一步提升10-40%，并在评估的高保真度范围内优于SZ。这些结果表明，当全局残差校正成为速率主导时，针对学习压缩器残差定制的残差表示可以保持学习压缩的优势。

英文摘要

Lossy compression is essential for massive spatiotemporal data from scientific simulations. Learned compressors can achieve high compression ratios at moderate accuracy targets, but their aggregate reconstruction losses do not guarantee accuracy for each block. Existing Guaranteed Autoencoder (GAE) methods add a per-block residual correction by retaining SVD/PCA-style coefficients until the target is met. This works at moderate tolerances, but in the high-fidelity regime with block-level NRMSE from 10^-6 to 10^-4, the number of retained coefficients grows quickly and the correction stream dominates the total rate. We propose a residual-centric view: the learned residual is structurally different from the original scientific field and should be coded with a representation designed for that residual. We introduce two residual coders. LBRC is a deterministic, training-free pipeline that adaptively quantizes the learned residual to the target NRMSE and losslessly encodes the resulting integer residual using 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that outputs a normalized bias for an integer-rounded Lorenzo prediction in the same deterministic integer pipeline, reducing the entropy of the remaining residual code while preserving deterministic decoding. The predictor weights are serialized and counted in the bitstream. Across E3SM, JHTDB, and ERA5 at block-level NRMSE targets from 10^-6 to 10^-4, LBRC improves compression ratio over GAE by 30-60% and is broadly competitive with SZ. NGLR adds a further 10-40% over LBRC and outperforms SZ in the evaluated high-fidelity regime. These results show that residual representations tailored to learned-compressor residuals can preserve the advantage of learned compression when global residual correction becomes rate-dominant.

URL PDF HTML ☆

赞 0 踩 0

2606.05384 2026-06-05 cs.AI cs.CL

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

稳定性与可操纵性：评估LLM裁判在决策后交互下的鲁棒性

Srimonti Dutta, Akshata Kishore Moharir

发表机构 * WAI USA Research Labs（WAI美国研究实验室）

AI总结研究LLM作为裁判在决策后交互中的可操纵性，发现虽然重复中性评估下高度稳定，但针对性挑战可显著逆转判决，并提出评估鲁棒性分数（ERS）量化交互鲁棒性。

详情

Comments: Accepted at ACL 2026 GEM (Generation, Evaluation and Metrics) Workshop

AI中文摘要

LLM作为裁判的评估广泛用于基准测试流程，其中模型输出通过自动评估器进行比较和排序。这些流程通常假设判决是固定输入的稳定属性。我们证明这一假设在交互下不成立。我们研究决策后可操纵性：在初始判决做出后，通过与裁判的后续对话改变评估结果的程度。在MT-Bench和AlpacaEval上的控制实验中，我们发现LLM裁判在重复和中性重新评估下高度稳定，但在针对性决策后挑战下变得显著可逆。反基线挑战协议表明，稳定判决可以通过动机性交互被推翻，而平衡目标验证协议将这种可逆性与净目标导向的引导区分开。这些逆转具有实际后果：它们可能降低与人类偏好的一致性，改变基准排名，并在高自我报告置信度下产生有害的评估变化。权威框架尤其具有破坏稳定性，修订后的判决通常伴随低重叠的论证，表明事后合理化而非可靠的错误纠正。我们引入评估鲁棒性分数（ERS），通过结合逆转敏感性和平衡方向效应来量化交互鲁棒性。我们的发现将决策后交互确定为LLM作为裁判评估的一个独特失败模式，并激励评估协议不仅测量静态一致性，还测量挑战下的鲁棒性。

英文摘要

LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering. These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.

URL PDF HTML ☆

赞 0 踩 0

2606.05382 2026-06-05 cs.AI

Synthetic Contrastive Reasoning for Multi-Table Q&A

合成对比推理用于多表问答

Ankit Pratap Singh, Xin Su, Phillip Howard

发表机构 * Iowa State University（爱荷华州立大学）； Thoughtworks

AI总结针对多表问答缺乏推理监督的问题，提出通过异构LLM生成合成对比推理轨迹，并利用对比偏好优化微调模型，在MMQA上提升9.7%-16.3%。

详情

AI中文摘要

多表问答要求模型检索相关证据、链接模式并在关系表之间进行组合推理。现有的多表问答资源通常提供问题和最终答案，但缺乏解释答案如何得出的推理监督。为弥补这一空白，我们通过使用异构LLM生成经过验证的正向轨迹和合理的负向轨迹，为MMQA构建了一个合成对比推理轨迹数据集。然后，我们利用生成的偏好对，通过对比偏好优化（CPO）微调开源权重LLM。在Qwen3-14B、Mistral-8B和Llama-3.1-8B上，CPO相比问答监督微调取得了9.7%-16.3%的绝对平均提升，在MMQA上最高提升21个百分点。消融实验表明，异构的正向和负向轨迹生成器增强了对比信号，自动评估和人工评估均显示生成的轨迹对基本忠实、连贯且具有有意义的对比性。

英文摘要

Multi-table question answering requires models to retrieve relevant evidence, link schemas, and perform compositional reasoning across relational tables. Existing multi-table Q&A resources typically provide questions and final answers but lack reasoning supervision that explains how answers are derived. To address this gap, we construct a synthetic contrastive reasoning-trace dataset for MMQA by generating validated positive traces and plausible negative traces with heterogeneous LLMs. We then use the resulting preference pairs to fine-tune open-weight LLMs with Contrastive Preference Optimization (CPO). Across Qwen3-14B, Mistral-8B, and Llama-3.1-8B, CPO achieves absolute average improvements over Q&A supervised fine-tuning ranging from 9.7%-16.3%, with gains up to 21 percentage points on MMQA. Ablations show that heterogeneous positive and negative trace generators strengthen the contrastive signal, and automated as well as human evaluations indicate that the generated pairs are largely faithful, coherent, and meaningfully contrastive.

URL PDF HTML ☆

赞 0 踩 0

2606.05381 2026-06-05 cs.LG

Generalized TV--$\ell_p$ Structured Priors for Bayesian $T_1$ Mapping

广义TV--$\ell_p$结构化先验用于贝叶斯$T_1$映射

Disi Lin, Martin Berggren, Tommy Löfstedt

发表机构 * Department of Computing Science, Umeå University, Sweden（乌尔姆大学计算机科学系，瑞典）

AI总结提出一种结合总变分(TV)与$\ell_p$范数的结构化空间先验族，并嵌入贝叶斯回归框架，利用No-U-Turn采样器进行后验推断，实现$T_1$映射中的不确定性量化，实验表明该方法能提高空间一致性和估计可靠性。

详情

DOI: 10.59275/j.melba.2026-g41g
Journal ref: Machine.Learning.for.Biomedical.Imaging. 2026 (2026)
Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:015

AI中文摘要

我们提出了一类扩展的结构化空间先验，将总变分(TV)函数与$\ell_p$范数相结合。该先验被证明是适定的，并嵌入到贝叶斯回归框架中，以实现$T_1$映射中的不确定性量化，后验推断使用No-U-Turn采样器(NUTS)进行。该TV--$\ell_p$构造被证明构成一个定义良好的先验分布族，并且自然地增强了估计参数图中的空间一致性和平滑变化。该方法与基于均匀先验、Gamma先验和有界TV先验的最大似然估计以及几种贝叶斯替代先验进行了比较。评估包括在合成脑和心脏$T_1$映射数据集以及真实在体乳腺$T_1$映射数据集上的实验。结果表明，TV--$\ell_p$先验产生更集中的后验密度，表明不确定性降低。它还持续实现更低的方差和更小的（负）偏差，从而得到更可靠的估计。总体而言，在贝叶斯模型中将基于TV的结构化惩罚与$\ell_p$范数嵌入先验中，改善了$T_1$图中的空间一致性，并增强了不确定性量化，为具有不确定性的$T_1$映射提供了一种稳健的方法。

英文摘要

We propose an extended family of structured spatial priors that incorporates the total variation (TV) function with $\ell_p$ norms. The prior is proven to be proper and incorporated into a Bayesian regression framework to enable uncertainty quantification in $T_1$ mapping, with posterior inference performed using the No-U-Turn Sampler (NUTS). This TV--$\ell_p$ construction is proven to constitute a well-defined family of prior distributions, and it naturally enforces spatial consistency and smooth variations in the estimated parameter maps. The method was evaluated in comparison to maximum-likelihood estimation and several Bayesian alternative priors based on the uniform, Gamma, and bounded TV priors. The evaluation includes experiments on synthetic brain and cardiac $T_1$ mapping datasets, as well as a real in-vivo breast $T_1$ mapping dataset. The results show that the TV--$\ell_p$ prior yields more concentrated posterior densities, indicating reduced uncertainty. It also consistently achieves lower variance and smaller (negative) bias, leading to more reliable estimates. Overall, embedding a TV-based structured penalty along with $\ell_p$ norms in a prior in a Bayesian model improves spatial coherence in $T_1$ maps and enhances uncertainty quantification, offering a robust approach for $T_1$ mapping with uncertainties.

URL PDF HTML ☆

赞 0 踩 0

2606.05378 2026-06-05 cs.LG cs.AI

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

模式选择性并非任务因果结构：1B类语言模型中组合任务电路的跨架构机制研究

Yongzhong Xu

发表机构 * B-Class Language Models（1B类语言模型）； Cross-Architecture Mechanistic Study（跨架构机理研究）

AI总结通过统一协议测试三个1B类语言模型在四个组合任务上的注意力头电路，发现不同模型对同一任务使用不同的注意力模式，并引入五类筛选结果分类法，提出MoE模型基于前一个token位置基板构建组合任务电路的可证伪假设。

详情

Comments: 27 pages, 3 figures

AI中文摘要

我们测试了一个单一的筛选与消融方案——通过任务模式选择性识别注意力头电路，然后通过与匹配随机零假设进行因果消融验证——是否能在不同模型家族中产生一致的机制性结论。该方案可在不同流水线间移植；但它识别出的具体电路则不能。在四个组合任务（间接宾语识别、大于、后继序列、变量绑定）和三个来自不同训练流水线的1B类语言模型（Pythia 1B / Pile / 密集；OLMo 1B / DCLM / 密集；OLMoE 1B-7B / DCLM / 混合专家）上，我们运行了一个统一协议，每个单元使用十个种子采样匹配随机零假设。由此产生的12个（任务，模型）单元中，没有两个在可比较的效应大小下共享相同的主要因果筛选：同一任务，具有相同的行为能力，在不同模型中通过不同的注意力模式类型实现。我们引入了一个五类筛选结果分类法——主要原因、次要原因、相关物、干扰物、零——并附有定量阈值，并展示了所有五类结果均出现在面板中。我们提出了一个可证伪的假设：我们面板中的MoE模型在一个基础的前一个token位置基板之上构建组合任务电路（对于OLMoE 1B-7B，前一个token电路消融在4个任务中的3个上是最强的因果筛选），IOI例外与IOI是最终位置名称复制任务一致，其结构直接探测不同的模式。该假设附带对其他MoE语言模型的明确预测。我们诚实地构建方法论：来自配套方法论论文的谱参与比信号是专门化计算的一般指标；使发现具有任务特异性的是任务模式筛选加上每个模型的因果验证。

英文摘要

We test whether a single screen-and-ablate recipe -- identify attention-head circuits by task-pattern selectivity, then verify by causal ablation against a matched-random null -- produces consistent mechanistic claims across model families. The recipe ports across pipelines; the specific circuit it identifies does not. Across four composed tasks (indirect-object identification, greater-than, successor sequences, variable binding) and three 1B-class language models from distinct training pipelines (Pythia 1B / Pile / dense; OLMo 1B / DCLM / dense; OLMoE 1B-7B / DCLM / mixture-of-experts), we run a unified protocol with the matched-random null sampled across ten seeds per cell. The resulting 12 (task, model) cells contain no two that share the same primary causal screen at comparable effect size: the same task, with the same behavioral capability, is implemented through different attention-pattern types across models. We introduce a five-category screen-outcome taxonomy -- primary cause, secondary cause, correlate, interferer, null -- with quantitative thresholds, and show that all five outcomes appear in the panel. We propose a falsifiable hypothesis: the MoE model in our panel builds composed-task circuits on top of a foundational previous-token positional substrate (the prev-token-circuit ablation is the strongest causal screen on 3 of 4 tasks for OLMoE 1B-7B), with the IOI exception consistent with IOI being a final-position name-copying task whose structure directly probes a different pattern. The hypothesis comes with explicit predictions for other MoE language models. We frame the methodology honestly: the spectral participation-ratio signal from the companion methodology paper is a general indicator of specialized computation; what makes a finding task-specific is the task-pattern screen plus a per-model causal verification.

URL PDF HTML ☆

赞 0 踩 0

2606.05376 2026-06-05 cs.LG

SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

SHALA-LLM：在对齐LLM中智能处理模糊标签

Jingyao Wu, Ashley Wang, Keane Ong, Paul Pu Liang, Rosalind Picard

发表机构 * MIT Media Lab, Massachusetts Institute of Technology（麻省理工学院媒体实验室、麻省理工学院）； National University of Singapore（新加坡国立大学）

AI总结提出SHALA-LLM强化学习框架，通过从标注者分布中学习并动态优先处理高模糊样本，改善LLM对模糊标签的建模，在NLI和情感识别任务中提升与标注者分布的一致性及分类性能。

详情

AI中文摘要

许多以人为中心的任务，包括自然语言推理（NLI）和情感识别（ER），具有多种合理的解释，导致标签模糊和不同标注者之间的分歧。随着LLM越来越多地部署在现实场景中，忠实建模这种模糊性对于识别有争议的输入、保留模糊情况下的变异性以及捕捉人类判断的完整分布至关重要。然而，现有的LLM对齐方法主要假设单一正确标签，在优化过程中排除了标注者分歧。我们不将这种模糊性视为噪声，而是展示如何通过一种名为SHALA-LLM（在对齐LLM中智能处理模糊标签）的新算法将其视为改善模型行为的信息。该强化学习框架提供了一种新方式，使LLM能够直接从标注者分布中学习，同时在优化过程中动态优先处理高模糊样本。在包括ChaosNLI、GoEmotions和MSP-Podcast在内的模糊敏感NLI和ER基准上的实验表明，SHALA-LLM改善了与标注者标签分布的一致性，例如在ChaosNLI上，它将Jensen-Shannon距离降低了高达62.1%。同时，SHALA-LLM将F1分数提高了高达16.7%，表明建模标注者分歧也能增强分类性能。

英文摘要

Many human-centered tasks, including natural language inference (NLI) and emotion recognition (ER), have multiple plausible interpretations, leading to label ambiguity and challenging disagreements across human annotators. As LLMs are increasingly deployed in real-world settings, faithfully modeling such ambiguity is essential to identify contested inputs, preserve variability in ambiguous cases, and capture the full distribution of human judgments. Yet, existing LLM alignment approaches have predominantly assumed a single correct label, excluding annotator disagreement during optimization. Instead of treating this ambiguity as noise, we show how to treat it as information that improves model behavior through a new algorithm called SMARTLY HANDLING AMBIGUOUS LABELS IN ALIGNING LLMS (SHALA-LLM). This reinforcement learning framework provides a new way for LLMs to learn directly from annotator distributions while dynamically prioritizing highly ambiguous samples during optimization. Experiments on ambiguity-sensitive NLI and ER benchmarks, including ChaosNLI, GoEmotions, and MSP-Podcast, demonstrate that SHALA-LLM improves agreement with annotator label distributions, e.g. on ChaosNLI, it reduces Jensen-Shannon Distance by up to 62.1%. At the same time, SHALA-LLM improves F1 by up to 16.7%, showing that modeling annotator disagreement can also strengthen classification performance.

URL PDF HTML ☆

赞 0 踩 0

2606.05373 2026-06-05 cs.LG physics.bio-ph

Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting

证据引导的神经架构选择在不确定性下用于个体化血糖预测

Md Azharul Islam, Dwyer Deighan, Tarunraj Singha, Danial Faghihi

发表机构 * arXiv.org ； cs.LG（计算机学习）

AI总结提出EVIDENT框架，结合贝叶斯训练、证据排序和任务特定验证，在有限、噪声和异构数据中自动选择最优神经架构，用于个体化血糖预测。

详情

AI中文摘要

在有限、噪声和异构数据下的时间序列预测中，可靠的神经架构选择是一个开放挑战，标准的启发式架构设计和验证方法无法确保准确可靠的预测和泛化。我们提出EVIDENT（基于证据的神经架构识别），一个整合贝叶斯训练、基于证据的排序和不确定性下任务特定验证的架构选择框架。该框架探索候选架构池，并识别满足规定验证标准的最低容量模型。我们使用时间卷积网络（TCNs）在1型糖尿病患者的个体化血糖预测中演示了该方法。结果表明，EVIDENT在群体水平糖尿病数据上系统地拒绝了参数不足和过度的TCN架构，同时识别出能可靠泛化到未见患者的模型。当多个架构具有竞争力时，该框架进一步支持基于可信度的集成预测，从而提升预测性能。与随机搜索基线相比，EVIDENT识别出更小的架构，在未见患者上具有更一致的预测性能。这些发现确立了EVIDENT作为一种神经架构发现策略，能够在数据有限和异构环境中实现高风险预测的可靠模型选择。

英文摘要

Reliable neural architecture selection is an open challenge in time-series forecasting under limited, noisy, and heterogeneous data, where standard heuristic architecture design and validation approaches fail to ensure accurate and reliable prediction and generalization. We propose EVIDENT (EVidence-based IDEntification of Neural archiTectures), a framework for architecture selection that integrates Bayesian training, evidence-based ranking, and task-specific validation under uncertainty. The framework explores the candidate architecture pool and identifies the lowest-capacity model that satisfies a prescribed validation criterion. We demonstrate this method using temporal convolutional networks (TCNs) for individualized blood glucose forecasting in type 1 diabetes patients. The results show that EVIDENT systematically rejects both under- and over-parameterized TCN architectures on population-level diabetes data, while identifying models that generalize reliably to unseen patients. When multiple architectures are competitive, the framework further supports plausibility-weighted ensemble predictions that enhance predictive performance. Compared with a random-search baseline, EVIDENT identified smaller architectures with more consistent forecasting performance on unseen patients. These findings establish EVIDENT as a strategy to neural architecture discovery, enabling reliable model selection for high-consequence forecasting in data-limited and heterogeneous settings.

URL PDF HTML ☆

赞 0 踩 0

2606.05372 2026-06-05 cs.RO cs.CG

Efficient Computation of Distance Functions for Navigation Vector Fields in Lie Groups

李群中导航向量场距离函数的高效计算

Vinicius M. Gonçalves, João Baião, Felipe Bartelt, Douglas G. Macharet, Gustavo M. Freitas, Héctor Azpúrua, Luciano C. A. Pimenta

发表机构 * University of São Paulo（圣保罗大学）

AI总结针对李群中基于向量场的路径跟踪问题，提出一种利用G-多项式曲线结构将距离计算简化为多项式求根的高效方法，显著降低计算时间并保持精度。

详情

AI中文摘要

基于向量场的方法被广泛用于机器人控制，并常应用于路径跟踪问题。一些向量场方法需要重复计算机器人配置与曲线之间的距离以及相应的最近点。最近，向量场已被扩展到李群。在这种情况下，这种计算可能非常昂贵，尤其是在嵌入式平台上以高控制频率执行时。本文提出了一种高效计算点与曲线之间距离的方法，该曲线表示为所谓的G-多项式曲线，这是一种将多项式曲线推广到矩阵李群的曲线表示。所提出的方法利用这些曲线的结构，将问题简化为少量多项式求根计算。仿真结果表明，与现有的基于优化的方法相比，该方法在保持精度的同时显著减少了计算时间。还提供了SE(3)群情况下的实用公式，并在机器人机械臂上进行了实验验证。该方法已在一个计算包中实现，可在线获取。

英文摘要

Vector-field-based methods are widely used for robot control and are often applied to the path-tracking problem. Some vector field approaches require repeatedly computing the distance between the robot configuration and the curve, as well as the corresponding closest point. Recently, vector fields have been extended to Lie Groups. In this case, this computation can be expensive, especially when performed at high control frequencies on embedded platforms. This paper proposes a method for efficiently computing the distance between a point and a curve represented as what is called a G-polynomial curve, which is a curve representation that generalizes polynomial curves to matrix Lie groups. The proposed approach exploits the structure of these curves to reduce the problem to a small number of polynomial root-finding computations. Simulation results show that the method significantly reduces computation time while maintaining accuracy compared to existing optimization-based approaches. Practical formulas are also provided for the case of the group SE(3), and the method is validated experimentally on a robotic manipulator. The methodology is implemented in a computational package, available online.

URL PDF HTML ☆

赞 0 踩 0

2606.05371 2026-06-05 cs.LG cs.NA math.NA stat.ML

Mamba-Assisted Non-Markovian Closure for Reduced-Order Modeling

Mamba辅助的非马尔可夫闭合用于降阶建模

Zhi-Feng Wei, Saad Qadeer, Panos Stinis

发表机构 * Pacific Northwest National Laboratory（太平洋西北国家实验室）； University of Washington（华盛顿大学）； Brown University（布朗大学）

AI总结针对高维动力系统降阶建模中的非马尔可夫闭合项问题，提出Mamba辅助闭合框架，利用Mamba序列模型从已解析轨迹预测闭合项，并通过数值积分器耦合降阶方程，在粘性Burgers方程和混沌双尺度Lorenz '96系统上优于马尔可夫模型、GRU序列模型和Wilks方法。

详情

Comments: Code will be released upon acceptance

AI中文摘要

高维动力系统的降阶建模常常受到非马尔可夫闭合项的阻碍，该闭合项表示未解析变量对解析动力学的影响。受Mori--Zwanzig形式论的启发，其中闭合项采取解析轨迹的记忆泛函形式，我们将闭合建模重新表述为序列建模问题，并提出Mamba辅助闭合（MAC）框架：一个基于Mamba的序列模型，经过训练从解析轨迹预测闭合项，通过数值积分器与降阶控制方程耦合，以在时间上推进解析变量。该框架的一个关键特性是利用状态空间模型的双重表示——模型通过卷积形式以序列到序列的方式进行训练，并通过循环形式进行逐步自回归部署，从而实现高效的长轨迹训练和恒定的每步推理成本。在粘性Burgers方程和混沌双尺度Lorenz '96系统上，MAC模型在预测准确性和长时间展开稳定性方面显著优于马尔可夫降阶模型、基于GRU的序列模型和Wilks方法。

英文摘要

Reduced-order modeling of high-dimensional dynamical systems is often hindered by the non-Markovian closure term that represents the effect of unresolved variables on the resolved dynamics. Inspired by the Mori--Zwanzig formalism, in which the closure takes the form of a memory functional of the resolved trajectory, we recast closure modeling as a sequence modeling problem and propose the Mamba-Assisted Closure (MAC) framework: a Mamba-based sequence model, trained to predict the closure from the resolved trajectory, is coupled with the reduced-order governing equations through a numerical integrator to advance the resolved variables in time. A key feature of the framework is its exploitation of the dual representation of state-space models -- the model is trained in a sequence-to-sequence fashion via the convolutional form, and deployed for step-by-step autoregressive rollout via the recurrent form, yielding both efficient long-trajectory training and constant per-step inference cost. On the viscous Burgers' equation and the chaotic two-scale Lorenz '96 system, the MAC model substantially outperforms the Markovian reduced-order model, the GRU-based sequence model, and the Wilks method in predictive accuracy and long-time rollout stability.

URL PDF HTML ☆

赞 0 踩 0

2606.05368 2026-06-05 cs.CV

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Biomazon：亚马逊盆地三维森林结构与生物量建模的多模态数据集

Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel, Ehsan Zandi, Gabriele Cavallaro

发表机构 * Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich（julich超级计算中心（JSC），julich研究所）； School of Engineering and Natural Sciences (SENS), University of Iceland（工程与自然科学学院（SENS），冰岛大学）； Global Land Monitoring Group, GFZ Helmholtz Centre for Geosciences（全球土地监测组，geofz赫尔姆霍兹研究中心）

AI总结针对现有方法未将森林垂直结构作为有序轮廓学习的问题，提出Biomazon多模态基准数据集，结合GEDI RH和AGBD目标与多传感器预测因子，通过共享编码器-解码器框架进行消融研究，为热带森林结构一致RH轮廓预测和结构-生物量建模建立参考基准。

详情

Comments: 32 pages, 21 figures

AI中文摘要

准确、空间明确的描述热带森林结构对于碳核算和生态系统监测至关重要，然而大多数机器学习流程预测冠层顶部高度代理（例如RH95/RH98）或AGBD作为单独的标量目标，而不是将森林垂直结构作为有序轮廓学习。社区缺乏一个ML就绪的多模态基准，用于联合预测整个GEDI RH轮廓与AGBD，或评估强制RH百分位数之间物理一致排序的方法。我们通过Biomazon解决了这一问题，这是一个覆盖亚马逊盆地的20米多模态基准数据集，在标准化的空间划分和评估协议下，将GEDI RH和AGBD目标与多传感器预测因子（Sentinel-1/2、ALOS-2 PALSAR-2、Copernicus DEM、Dynamic World LULC和AlphaEarth嵌入）配对。使用共享编码器-解码器与任务特定头作为基线框架，我们对（i）骨干/模型规模、（ii）模态贡献以及（iii）在独立和融合设置下使用辅助嵌入进行了全面的消融研究，并报告了单目标和联合目标结果，以量化统一训练协议下的权衡。最后，我们通过与现有网格化产品（包括GEDI L4D RH10-RH98和AGBD）在匹配时间尺度上的区域对齐比较，将基线性能置于背景中。Biomazon连同随附的协议和基线结果，为未来热带森林中结构一致的RH轮廓预测和结构-生物量建模工作建立了参考基准。

英文摘要

Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

URL PDF HTML ☆

赞 0 踩 0

2606.05367 2026-06-05 cs.SD eess.AS

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

基于任务向量算术的语言模型文本到语音情感表达控制

Daniel Oliveira de Brito, Arnaldo Candido Junior

发表机构 * Instituto de Biociências, Letras e Ciências Exatas Universidade Estadual Paulista "Júlio de Mesquita Filho" (UNESP)（生物科学、文学和精确科学学院帕尔马斯州立大学 "Júlio de Mesquita Filho" (UNESP)）

AI总结本文通过系统消融实验定位情感韵律的主要载体为x-vector，并提出一种基于x-vector质心算术的无训练方法，实现跨说话人情感强度控制，在保留身份和可懂度的同时提升情感相似度。

详情

Comments: 10 pages, 5 figures

AI中文摘要

我们研究了任务向量算术（在模块化文本到语音（TTS）中成功用于跨说话人情感强度控制）是否能够迁移到基于语言模型骨干和上下文学习（LM-TTS）构建的大规模TTS系统。通过在Qwen3-TTS-12Hz-1.7B上对四个逐渐缩小的操作数——通过LoRA微调的模型权重、连续编解码器嵌入、离散编解码器标记以及由ECAPA-TDNN编码器（与合成骨干联合训练）生成的说话人嵌入（x-vector）——进行系统消融研究，我们将情感韵律的主要载体定位到x-vector。基于这一发现，我们提出了一种基于x-vector空间质心算术的无训练方法：情感方向τ = E_i[x(s_i, emo)] - E_i[x(s_i, neutral)]，应用于未见过的目标说话人：x_new = x(target, neutral) + α·τ。使用ESD（英语）作为τ源，emoUERJ（巴西葡萄牙语）作为跨语言真实目标，我们观察到在英语保留说话人上，情感余弦相似度比ICL基线平均提升+0.29，在巴西葡萄牙语保留说话人上提升+0.09，同时很大程度上保留了身份（多说话人τ变体的WavLM SECS ≥ 0.88）和可懂度（PT-BR中WER ≈ 0）。这些结果初步证明，当算术操作作用于说话人嵌入时，可以规避先前报道的基于质心算术的风格控制与基于标记的TTS架构不兼容的问题。

英文摘要

We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.

URL PDF HTML ☆

赞 0 踩 0

2606.05359 2026-06-05 cs.CV

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

从单目视频中恢复物理上可信的人-物交互

Dingbang Huang, Etienne Vouga, Qixing Huang, Georgios Pavlakos

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出RePHO方法，通过物理引导的重建框架和强化学习策略，从单目视频中恢复物理上可信的人-物交互，解决了现有方法中的穿透和物体漂浮问题。

详情

Comments: CVPR 2026. Project Page: https://dingbang777.github.io/RePHO/

AI中文摘要

在本文中，我们提出了RePHO，一种从单目视频中重建物理上可信的人-物交互（HOI）的方法。现有的基于运动学的方法虽然能产生视觉上合理的运动，但常常导致物理上不合理的伪影，如相互穿透和物体漂浮。为了克服这些问题，我们引入了一个物理引导的重建框架。我们从运动学估计开始，然后通过强化学习（RL）训练一个策略来细化它。该策略被优化以在物理模拟器中重现交互。由于运动学估计通常带有噪声，简单的RL训练可能会失败。因此，我们提出了一种自适应采样策略，具有双重自我更新机制，可以识别具有最丰富信息和最可靠运动学重建的帧。我们的过程逐步提高重建质量，并产生物理一致的HOI序列。我们在两个标准的HOI基准上展示了我们的方法，并在物理合理性指标上取得了比现有方法明显的改进。项目页面：https://dingbang777.github.io/RePHO/

英文摘要

In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: https://dingbang777.github.io/RePHO/

URL PDF HTML ☆

赞 0 踩 0

2606.05354 2026-06-05 cs.CV

LightVesselNet: An Ultra-Lightweight Sub-100K Parameter Network for Retinal Blood Vessel Segmentation

LightVesselNet：用于视网膜血管分割的超轻量级亚10万参数网络

Shadman Sobhan, Farhana Jalil

发表机构 * Department of Electrical & Electronic Engineering, Bangladesh University of Engineering and Technology (BUET)（电子与电气工程系，孟加拉国工程与技术大学）

AI总结提出LightVesselNet，一种仅75K参数的紧凑编码器-解码器网络，结合通道与空间注意力、多尺度特征聚合和亚像素上采样，在五个公开数据集上实现与大型模型相当的视网膜血管分割性能，适用于资源受限的临床环境。

详情

AI中文摘要

视网膜血管分割在糖尿病视网膜病变和青光眼的早期检测中起着至关重要的作用。虽然最近的深度学习模型取得了很高的分割精度，但它们通常需要大量的计算资源，使得在边缘设备上的实际部署变得困难。在本文中，我们提出了LightVesselNet，一种专为资源受限环境中的视网膜血管分割设计的高效神经网络。尽管仅包含75K参数，LightVesselNet的性能与更大的模型相比具有竞争力。该网络采用紧凑的编码器-解码器架构，并增强了通道和空间注意力机制、瓶颈处的多尺度特征聚合模块以及解码器中的亚像素上采样策略。专用的边缘残差连接在整个解码过程中保留了精细的血管细节。在五个公开数据集：DRIVE、STARE、CHASEDB1、FIVES和HRF上进行的大量实验，分别获得了0.8189、0.8499、0.8640、0.8634、0.8096的灵敏度分数和0.8070、0.8072、0.8181、0.8649、0.7686的Dice系数。与最先进模型相比，LightVesselNet显示出更高的效率（性能与参数或GFlops之比）。跨数据集评估证实了模型的泛化能力。总体而言，LightVesselNet是低资源临床环境和移动筛查工具中部署的有力候选者。

英文摘要

Retinal blood vessel segmentation plays a vital role in the early detection of diabetic retinopathy and glaucoma. While recent deep learning models have achieved great segmentation accuracy, they typically require heavy computational resources, making real-world deployment on edge devices difficult. In this paper, we propose LightVesselNet, an efficient neural network designed for retinal vessel segmentation in a resource-constrained environment. Despite containing only 75K parameters, LightVesselNet performs competitively with much larger models. The network employs a compact encoder decoder architecture enhanced with channel and spatial attention mechanisms, a multi-scale feature aggregation module at the bottleneck, and a subpixel upsampling strategy in the decoder. A dedicated edge residual connection preserves fine vessel detail throughout decoding. Extensive experiments on five publicly available datasets: DRIVE, STARE, CHASEDB1, FIVES, and HRF, yield sensitivity scores of 0.8189, 0.8499, 0.8640, 0.8634, 0.8096, and Dice coefficients of 0.8070, 0.8072, 0.8181, 0.8649, and 0.7686, respectively. LightVesselNet shows improved efficiency (Performance vs Parameter or GFlops) compared to State-of-the-Art models. Cross-dataset evaluation confirms the model's generalisation capability. Overall, LightVesselNet is a strong candidate for deployment in low-resource clinical settings and mobile screening tools.

URL PDF HTML ☆

赞 0 踩 0

2606.05347 2026-06-05 cs.CV

TopoPult-SSL: Gland-Mask-Free Cross-Device Meibomian Gland Segmentation via Self-Distilled Weak Clinical Priors

TopoPult-SSL: 通过自蒸馏弱临床先验实现无腺体掩膜的跨设备睑板腺分割

Nicolò Savioli, Luca Del Tongo

发表机构 * OdaxAI S.R.L.（OdaxAI公司）； Topcon Group — VISIA Imaging S.R.L.（Topcon集团——VISIA成像公司）

AI总结提出TopoPult-SSL两阶段框架，利用眼睑掩膜和临床元数据作为弱先验，通过自蒸馏实现跨设备睑板腺分割，无需目标腺体掩膜即可达到高精度。

详情

Comments: 13 pages, 4 figures, 5 tables

AI中文摘要

每一种新的临床成像设备都会造成域偏移，其中密集的腺体掩膜成本高昂，而廉价的临床信号——眼睑轮廓、Pult分级、形态测量比率——则被常规记录。我们提出TopoPult-SSL，一个用于跨设备睑板腺分割的两阶段框架。第一阶段在训练损失中不使用目标腺体掩膜，仅通过目标眼睑掩膜和临床元数据驱动的四个弱先验锚点来适应源域训练模型。第二阶段，当目标腺体掩膜可用时，通过监督自蒸馏将互补的第一阶段教师模型蒸馏成一个紧凑的学生模型。我们在公共MGD-1k到CAMG研究基准（1000到100张图像，不同设备）上开发并验证了该技术，蒸馏模型达到Dice 0.716±0.006（最佳0.726），单次推理超越UA-MT（0.710）和集成教师（0.720）。无腺体掩膜的第一阶段变体达到精确度0.694，而SAM/MedSAM为0.30-0.34（p<0.001），使得无需密集腺体轮廓即可部署。代码和可复现脚本已发布。

英文摘要

Every new clinical imaging device creates a domain shift where dense gland masks are expensive yet cheap clinical signals -- eyelid outlines, Pult grades, morphometric ratios -- are routinely recorded. We present TopoPult-SSL, a two-stage framework for cross-device meibomian gland segmentation. Stage 1 adapts a source-trained model without target gland masks in the training loss, using four weak-prior anchors driven by target eyelid masks and clinical metadata only. Stage 2, when target gland masks are available, distils complementary Stage-1 teachers into a single compact student via supervised self-distillation. We develop and validate the technique on the public MGD-1k to CAMG research benchmark (1,000 to 100 images, different device), where the distilled model achieves Dice 0.716+/-0.006 (best 0.726), surpassing UA-MT (0.710) and the ensemble teacher (0.720) -- with a single pass. The gland-mask-free Stage-1 variant reaches Precision 0.694 vs. 0.30-0.34 for SAM/MedSAM (p<0.001), enabling deployment without dense gland contouring. Code and reproducibility scripts are released.

URL PDF HTML ☆

赞 0 踩 0

2606.05346 2026-06-05 cs.CL

Trajectory Dynamics in Language Model Hidden States Predict Human Processing Costs Beyond Surprisal

语言模型隐藏状态中的轨迹动力学预测超越惊讶度的人类处理成本

Elan Barenholtz

发表机构 * Machine Perception & Cognitive Robotics Laboratory（机器感知与认知机器人实验室）； Department of Psychology（心理学系）； Center for Complex Systems（复杂系统中心）； Florida Atlantic University（佛罗里达 Atlantic 大学）

AI总结通过线性外推语言模型隐藏状态轨迹的偏差，提出轨迹外推误差作为独立于惊讶度的人类处理成本预测因子，并在自然故事语料库中验证其对自定步速阅读时间的预测能力。

详情

Comments: 17 pages, 3 figures, 6 tables

AI中文摘要

人类语言理解是顺序进行的：每个词在其前文语境中被处理，解释随时间逐步构建。惊讶度（给定语境下词的对数概率的负值）一直是增量处理成本的主要预测因子。但惊讶度将丰富的序列表示简化为每个词处的单个标量，丢弃了解释演化方向的信息。动力系统方法表明，演化解释状态的轨迹（而不仅仅是每个时刻的位置）应塑造处理过程，语言本身可能具有局部动量，因为说话者一次计划几个词。我们引入轨迹外推误差：在每个词处，我们拟合一条线性轨迹到变换器语言模型的前面隐藏状态，并测量与外推路径的偏差。在自然故事语料库上，该度量几乎与惊讶度正交（r = .044），并独立预测自定步速阅读时间。该效应在花园路径句子中尤为显著，随模型规模（GPT-2 Small到Large）增强，并在具有不同位置编码方案（GPT-2 vs. Pythia/RoPE）的架构中复现。位移控制显示该效应不能简化为表示变化幅度：位移和外推误差以相反方向预测。这些发现揭示了处理成本的两个可分离成分：词级预测误差（惊讶度）和对展开解释的局部动量（轨迹外推误差）的敏感性。

英文摘要

Human language comprehension unfolds sequentially: each word is processed in the context of those that came before, and the interpretation builds incrementally over time. Surprisal, the negative log probability of a word given its context, has been the dominant predictor of incremental processing cost. But surprisal reduces rich sequential representations to a single scalar at each word, discarding information about the direction in which the interpretation has been evolving. Dynamical-systems approaches suggest that the trajectory of the evolving interpretive state, not just its position at each moment,should shape processing, and language itself may have local momentum, since speakers plan utterances a few words at a time. We introduce trajectory extrapolation error: at each word, we fit a linear trajectory to the preceding hidden states of a transformer language model and measure deviation from the extrapolated path. On the Natural Stories corpus, this measure is nearly orthogonal to surprisal (r = .044) and independently predicts self-paced reading times. The effect is especially pronounced in garden-path sentences, strengthens with model scale (GPT-2 Small to Large), and replicates across architectures with different positional encoding schemes (GPT-2 vs. Pythia/RoPE). A displacement control shows the effect is not reducible to representational change magnitude: displacement and extrapolation error predict in opposite directions. These findings reveal two dissociable components of processing cost: word-level prediction error (surprisal) and sensitivity to the local momentum of the unfolding interpretation (trajectory extrapolation error).

URL PDF HTML ☆

赞 0 踩 0

2606.05345 2026-06-05 cs.LG

PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention

PJ-RoPE：一种用于相对注意力的傅里叶-喷气-仿射位置空间

Yaobo Zhang

发表机构 * School of Physics, Ningxia University（宁夏大学物理学院）

AI总结本文提出PJ-RoPE，一种统一RoPE、Jordan-RoPE和ALiBi的傅里叶-喷气-仿射相对位置空间，通过可学习参数适应不同任务，并引入自适应扇区诊断和LC/快度坐标稳定高阶喷气。

详情

Comments: 26 pages, 6 figures, 10 tables. Code available at https://github.com/ybzhang-nxu/Poincare_Rope

AI中文摘要

我们将RoPE的傅里叶相位、Jordan-RoPE的有限喷气和ALiBi的仿射近因统一到一个单一的可学习相对位置空间中，并研究不同任务选择该空间的哪些区域。PJ-RoPE是一种用于相对注意力的傅里叶-喷气-仿射公式，可选地具有庞加莱型解读，作为齐次傅里叶-喷气位置表示的仿射完备化。代数上，相同的基本元素构成一个有限常系数差分模：延迟移位算子的简单根给出傅里叶/RoPE特征，重复的非零根给出乔丹/傅里叶喷气，重复的单位根给出类似ALiBi的仿射近因。该框架将标量PJ偏置核与精确的PJ旋转特征变换分离，引入自适应扇区诊断，并使用LC/快度坐标稳定高阶喷气。受控探针验证了扇区包含和选择；小型语言运行暴露了仿射/近因边界；音乐令牌流提供了最清晰的情况，其中LC/仿射变体保持强劲，同时携带可测量的高阶修正；LC诊断显示尺度稳定性增益伴随相位分辨率损失。

英文摘要

We unify RoPE's Fourier phase, Jordan-RoPE's finite jets, and ALiBi's affine recency into a single learnable relative-position space, and study which regions of this space are selected by different tasks. PJ-RoPE is a Fourier-Jet-Affine formulation for relative attention, with an optional Poincare-type reading as the affine completion of a homogeneous Fourier-jet positional representation. Algebraically, the same primitives form a finite constant-coefficient difference module: simple roots of the lag-shift operator give Fourier/RoPE characters, repeated nonzero roots give Jordan/Fourier jets, and the repeated unit root gives ALiBi-like affine recency. The framework separates scalar PJ-bias kernels from exact PJ-rotary feature transforms, introduces adaptive sector diagnostics, and uses LC/rapidity coordinates to stabilize high-order jets. Controlled probes verify sector containment and selection; small language runs expose an affine/recency boundary; music-token streams provide the clearest case where LC/affine variants remain strong while carrying measurable high-order corrections; and LC diagnostics show a scale-stability gain coupled to phase-resolution loss.

URL PDF HTML ☆

赞 0 踩 0

2606.05336 2026-06-05 cs.CL

Self-supervised User Profile Generation for Personalization

面向个性化的自监督用户画像生成

Clark Mingxuan Ju, Yuwei Qiu, Tong Zhao, Neil Shah

发表机构 * Snap Inc.（Snap公司）； bellevue, WA USA（华盛顿州西雅图市）

AI总结提出BUMP框架，利用自监督双向排序目标训练大语言模型生成用户文本画像，无需下游标注即可实现个性化。

详情

AI中文摘要

随着大语言模型（LLM）被部署到推荐、搜索、对话和内容生成等场景——在这些场景中，相同的查询应针对不同用户给出不同答案——个性化LLM已成为核心挑战。一个有前景的方法是将每个用户的交互历史总结为自然语言记忆或画像，并将其前置到提示中以便于个性化。现有方法使用来自标注下游任务的显式奖励来学习此类画像生成器，但这种方法成本高昂且稀疏，因为需要为每个目标任务提供标注监督。鉴于这一挑战，我们引入了通过画像的双向用户建模（BUMP），这是一个自监督框架，无需任何下游标签即可训练画像生成器。具体来说，给定用户的交互历史，我们使用GRPO训练LLM在双向批次内排序目标下生成自由形式的文本画像：一个小型LLM评判器衡量（i）生成的画像作为查询时，在批次中将用户自己的保留交互排在其他用户交互之上的程度，以及（ii）一个保留交互作为查询时，在批次中将用户自己的画像排在其他用户画像之上的程度。两个方向均使用多正例NDCG评分，并合并为每次生成的密集奖励；批次中的其他用户提供免费负例，因此每个训练样本仅从原始交互日志中获得监督。在LaMP基准测试上，BUMP匹配或超越了依赖标注奖励的闭源API和先前方法，同时在训练时无需任何任务标签。

英文摘要

Personalizing large language models (LLMs) has become a central challenge as LLMs are deployed across recommendation, search, dialogue, and content generation -- settings where the same query should yield different answers given different users. A promising route is to summarize each user's interaction history into a natural-language memory or profile and prepend it to the prompt to facilitate personalization. Existing methods learn such profile generators with explicit rewards derived from labeled downstream tasks, which are expensive and sparse as they require annotated supervision for every target task. In light of this challenge, we introduce Bidirectional User Modeling via Profiles (BUMP), a self-supervised framework that trains a profile generator without any downstream labels. Specifically, given a user's interaction history, we use GRPO to train an LLM to emit a free-form textual profile under a bidirectional in-batch ranking objective: a small LLM judge measures (i) how well the generated profile, used as a query, ranks the user's own held-out interactions above interactions from other users in the batch, and (ii) how well a held-out interaction, used as a query, ranks the user's own profile above profiles of other users. Both directions are scored with multi-positive NDCG and combined into a dense reward per rollout; other users in the batch supply free negatives, so every training example yields supervision from raw interaction logs alone. Evaluated on the LaMP benchmark, BUMP matches or outperforms closed-source APIs and prior methods relying on labeled rewards, while requiring no task label at training.

URL PDF HTML ☆

赞 0 踩 0

2606.05335 2026-06-05 cs.LG stat.ML

A prism hierarchy of learning regimes in large linear autoencoders

大型线性自编码器中学习机制的三棱柱层次结构

Eugene Golikov, Yaroslav Gusev, Dmitry Yarotsky

发表机构 * Applied AI Institute（应用人工智能研究所）； Steklov Mathematical Institute of Russian Academy of Sciences（俄罗斯科学院斯捷克洛夫数学研究所）

AI总结本文通过形式损失展开层次结构，将大型权重绑定线性自编码器的极端学习机制与三棱柱的面相关联，推导出五种基本极端机制下的训练和总体损失演化显式表达式。

详情

Comments: 33 pages, under review for NeurIPS'2026

AI中文摘要

机器学习模型的理论研究通常考虑不同的极限机制，在这些机制下梯度下降的学习动态在理论上变得可处理。然而，对于特定类型的模型，系统地获得所有定性不同的极端学习机制的图景是可取的。在本文中，我们为大型权重绑定线性自编码器提出了这样一个图景，其特征由输入和潜在维度、初始化幅度以及训练集大小决定。该模型在权重上非线性，其梯度流没有一般的理论解。我们表明，在形式损失展开层次结构层面，其极端机制自然地与三棱柱的面相关联。特别地，存在与棱柱的2-面相关的五种基本极端机制：(1) 大数据，(2) 小数据，(3) 平均场，(4) 窄潜在，以及 (5) 自由。对于机制 (1,2,3,4)，我们推导了梯度流下训练和总体极限损失演化的显式表达式，与实验结果非常吻合。

英文摘要

Theoretical studies of machine learning models commonly consider different limiting regimes in which the learning dynamics of gradient descent becomes theoretically tractable. It is, however, desirable to have a systematically obtained picture of all qualitatively different extreme learning regimes for a particular type of models. In this paper we propose such a picture for large weight-tied linear autoencoders characterized by input and latent dimensions, initialization magnitude, and training set size. This model is nonlinear in the weights and its gradient flow does not have a general theoretical solution. We show that at the level of the formal loss-expansion hierarchy, its extreme regimes are naturally associated with faces of a triangular prism. In particular, there are five basic extreme regimes associated with the 2-faces of the prism: (1) large-data, (2) small-data, (3) mean-field, (4) narrow-latent, and (5) free. For regimes (1,2,3,4), we derive explicit expressions for both train and population limiting loss evolutions under gradient flow, obtaining very good agreement with experimental results.

URL PDF HTML ☆

赞 0 踩 0

2606.05334 2026-06-05 cs.AI

Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory

面向循环工厂的不确定性感知功能行为预测与材料疲劳评估

Nehal Afifi, Mehdi Khabou, Victor Mas, Jonas Hemmerich, Patric Grauberger, Stefan Dietrich, Volker Schulze, Sven Matthiesen

发表机构 * IPEK Institute of Product Engineering, Karlsruhe Institute of Technology (KIT)（IPEK产品工程研究所，卡尔斯鲁厄理工学院）； IAM-WK Institute for Applied Materials – Materials Science and Engineering, Karlsruhe Institute of Technology (KIT)（应用材料研究所–材料科学与工程，卡尔斯鲁厄理工学院）； wbk Institute of Production Science, Karlsruhe Institute of Technology (KIT)（生产科学研究所，卡尔斯鲁厄理工学院）

AI总结针对循环工厂中回收产品异质退化状态下的再利用决策问题，提出一种结合不确定性感知功能预测与组件级疲劳评估的实例特定可靠性框架，通过卷积编码器提取载荷模式、LSTM预测功能变量、有限元应力重建与疲劳损伤评估，实现功能、材料和系统可靠性轨迹的融合。

详情

Comments: 27 pages, submitted to the Journal of Manufacturing Systems' special issue about circular factories, the manuscript is under review

AI中文摘要

循环工厂中的回收产品以异质退化状态、使用历史和剩余能力重新进入生产。仅凭当前检查无法决定再利用，因为未来功能实现和组件完整性可能在下一个服务场景下以不同方式演变。现有的PHM方法支持退化预测，但通常针对固定操作条件或孤立组件基准，而材料疲劳评估很少与系统级功能预后相关联。本文针对角磨机通过将不确定性感知功能预测与组件级疲劳评估结合在一个实例特定的可靠性工作流程中来解决这一差距。所提出的框架结合了当前工具状态与最近的力-扭矩使用窗口。卷积编码器从主轴力和轴扭矩中提取载荷模式，LSTM骨干网络预测九个功能变量作为高斯均值和方差估计。同时，相同的载荷历史通过有限元支持的应力重建、带Haibach扩展的S-N/Miner损伤评估和Paris定律裂纹扩展分析转化为输出轴疲劳信息。流式重放算法将两个分支整合为功能、材料和系统可靠性轨迹。保留测试显示九个输出的平均2%容差精度为0.9652。热变量预测近乎完美，而驱动电机电流和负载速度仍然是最具挑战性的动态输出，R²值分别为0.9750和0.9924。扭矩历史对这些变量尤其重要，传统LSTM在短历史设置中优于GRU和xLSTM。可靠性校准对驱动电机电流信息量最大，其中预测和观测的超越概率...

英文摘要

Returned products in circular factories re-enter production with heterogeneous degradation states, usage histories, and remaining capability. Reuse cannot be decided from the current inspection alone, because future function fulfillment and component integrity may evolve differently under the next service scenario. Existing PHM approaches support degradation prediction, but often target fixed operating conditions or isolated component benchmarks, while material-fatigue assessment is rarely linked to system-level functional prognosis. This paper addresses this gap for an angle grinder by combining uncertainty-aware functional prediction with component-level fatigue assessment in an instance-specific reliability workflow. The proposed framework combines the current tool state with recent force--torque usage windows. A convolutional encoder extracts loading patterns from spindle forces and shaft torque, and an LSTM backbone predicts nine functional variables as Gaussian mean and variance estimates. In parallel, the same loading history is translated into output-shaft fatigue information through finite-element-supported stress reconstruction, S--N/Miner damage evaluation with Haibach extension, and Paris-law crack-growth analysis. A streaming replay algorithm consolidates both branches into functional, material, and system reliability trajectories. Held-out tests show mean $2\%$-tolerance accuracy of 0.9652 across nine outputs. Thermal variables are predicted near-perfectly, while drive motor current and load speed remain the most demanding dynamic outputs, with $R^2$ values of 0.9750 and 0.9924. Torque history is especially important for these variables, and the conventional LSTM outperforms GRU and xLSTM in the short-history setting. Reliability calibration is most informative for drive motor current, where predicted and observed exceedance probabilities ...

URL PDF HTML ☆

赞 0 踩 0

2606.05332 2026-06-05 cs.AI

GITCO: Gated Inference-Time Context Optimization in TSFMs

GITCO：TSFMs中的门控推理时上下文优化

Manya Pandey, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

发表机构 * arXiv.org ； cs.AI（计算机科学与人工智能）

AI总结提出GITCO框架，通过门控机制在推理时选择性抑制有害补丁，无需更新参数即可提升基于补丁的时间序列基础模型的零样本预测精度。

详情

Comments: ICML 2026 Workshop on Foundation Models for Structured Data

AI中文摘要

基于补丁的时间序列基础模型（TSFMs）遭受上下文中毒：结构异常的补丁捕获了不成比例的注意力，并无声地降低了零样本预测质量。我们提出通过在推理时优化输入上下文而不是修改模型权重来提高TSFM精度。我们提出了GITCO（门控推理时上下文优化），一个轻量级的三组件框架：门控、路由和批评者，无需任何参数更新即可选择性地识别和抑制有害补丁。在TimesFM 2.5上，跨53个GIFT-Eval数据集进行K折交叉验证评估，GITCO在TimesFM 2.5上实现了平均+1.95%的MASE降低，同时捕获了89.9%的改进上限。我们引入了上下文敏感性配置文件作为TSFMs的一个新的可表征属性：从时间序列元特征到推理时上下文干预下预期精度改进的映射，由模型架构和数据的统计结构共同塑造。

英文摘要

Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.

URL PDF HTML ☆

赞 0 踩 0

2606.05330 2026-06-05 cs.CL cs.AI cs.HC

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

基于概率信念追踪的多轮人类可说服性模型

Jared Moore, Noah Goodman, Nick Haber, Max Kleiman-Weiner

发表机构 * Stanford University（斯坦福大学）； University of Washington（华盛顿大学）

AI总结提出PERSUASIONTRACE框架，通过记录多轮信念报告、标注修辞维度并引入贝叶斯网络模拟目标，将说服评估从端点变化转向过程保真度。

详情

AI中文摘要

大型语言模型可以在高风险领域改变人类信念，但大多数说服研究依赖于前/后信念变化。这些端点测量确定了说服是否发生，却忽略了信念在对话中移动的位置和方式。我们提出了PERSUASIONTRACE，一个用于研究人机交互中说服的框架。基于网络实验平台，PERSUASIONTRACE贡献了一个多轮说服研究的工具和一个过程级评估协议：它记录来自人类或模拟说服目标的多轮信念报告，用修辞维度（logos/pathos/ethos）标注说服者轮次，并通过保真度评估模拟器与真实人类信念动态的匹配程度。使用该框架，我们发现人类目标分为两个多轮信念更新聚类，并对修辞策略表现出易感性；LLM在通用和个性化主题、文本和音频模态以及多轮交互中都具有说服力。先前的工作主要使用普通提示的LLM来模拟人类目标，但我们表明这些模拟器无法复制人类信念动态。我们引入了一个贝叶斯网络模拟目标，它随时间维持显式的潜在信念状态，使得每个说服者消息产生认知上真实的信念更新。在人类相似性评估中，我们的贝叶斯目标得分接近人类参考（81 vs 80），而基线LLM目标得分显著较低（64）。PERSUASIONTRACE将说服评估从仅端点移动重新定义为过程保真度，为科学分析和说服系统的更安全优化提供了更强的基础。

英文摘要

Large language models can shift human beliefs across high-stakes domains, but most persuasion studies rely on pre/post belief change. These endpoint measures identify whether persuasion occurred, yet miss where and how beliefs moved within a dialogue. We present PERSUASIONTRACE, a framework for studying persuasion in human-LLM interaction. Built on a web-based experimental platform, PERSUASIONTRACE contributes a tool for multi-turn persuasion studies and a process-level evaluation protocol: it records multi-turn belief reports from human or simulated targets of persuasion, annotates persuader turns with rhetorical dimensions (logos/pathos/ethos), and evaluates simulators by fidelity to real human belief dynamics. Using this framework, we find that human targets group into two clusters of multi-turn belief updates and exhibit susceptibility to rhetorical strategies, and that LLMs are persuasive across generic and personalized topics, text and audio modalities, and multi-turn interactions. Prior work has chiefly used vanilla-prompted LLMs to simulate human targets, but we show that these simulators fail to replicate human belief dynamics. We introduce a Bayesian-network simulated target that maintains an explicit latent belief state over time so each persuader message yields cognitively realistic belief updates. In human-likeness evaluation, our Bayesian target scores near a human reference (81 vs 80), while baseline LLM targets score substantially lower (64). PERSUASIONTRACE reframes persuasion evaluation from endpoint movement alone to process fidelity, providing a stronger basis for scientific analysis and safer optimization of persuasive systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05327 2026-06-05 cs.LG q-bio.QM stat.ML

Multimarginal flow matching with optimal transport potentials

基于最优传输势的多边缘流匹配

Raghav Kansal, David Crair, Nghia Nguyen, Scott Pope, Bradley Parry

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种利用动态最优传输势引导流匹配学习中间边缘分布的方法，实现高效无模拟的多边缘流匹配，在单细胞RNA测序、海洋学和气象数据集上取得最优性能。

详情

Comments: 9 pages, 3 figures, 4 tables, and a 27 page appendix. Accepted to the Forty-Third International Conference on Machine Learning

AI中文摘要

流匹配（FM）已成为学习两个经验分布之间动态传输映射的强大框架。然而，对于存在中间观测边缘分布的情况，这些边缘分布有助于约束端点之间的流，这方面的研究较少。这种“多边缘”设置对于许多科学领域中动态系统的时间演化建模至关重要，这些领域可以对序列分布进行采样。我们通过一种新颖的方法解决了这个问题，该方法利用了FM与动态最优传输（OT）之间的联系，通过动态OT作用中的势项将流柔和地引导向中间边缘分布。通过扩展条件FM学习目标以包含这些势，我们推导出一种高效、无模拟的多边缘FM算法，该算法在学习流的时空动力学方面提供了相当大的灵活性。我们在不同的单细胞RNA测序、海洋学和气象数据集上展示了OT势FM（OTP-FM）的最先进性能和训练效率。我们的代码可在https://github.com/Bexorg-Inc/OTP-FM获取。

英文摘要

Flow matching (FM) has emerged as a powerful framework for learning dynamic transport maps between two empirical distributions. However, less explored is the setting with intermediate observed marginals that can help constrain the flows between the endpoints. This "multimarginal" regime is central to modeling temporal evolution in dynamical systems in many scientific domains that can sample sequential distributions. We tackle this problem with a novel approach that leverages the connection between FM and dynamic optimal transport (OT), softly steering the flow towards the intermediate marginals through potential terms in the dynamic OT action. By extending the conditional FM learning target to incorporate these potentials, we derive an efficient, simulation-free algorithm for multimarginal FM that offers considerable flexibility in the spatiotemporal dynamics of the learned flows. We demonstrate state-of-the-art performance and training efficiency of OT-potential FM (OTP-FM) on diverse single-cell RNA sequencing, oceanographic, and meteorological datasets. Our code is available at https://github.com/Bexorg-Inc/OTP-FM.

URL PDF HTML ☆

赞 0 踩 0