arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11092 2026-06-12 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo：通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong（香港大学）； The Chinese University of Hong Kong（香港中文大学）； Archon Robotics

AI总结提出三阶段运动引导课程强化学习框架RoboNaldo，从单一人踢参考逐步优化射门性能，在仿真中射门误差降低48.6%、速度提升2.96倍，真实机器人上3米外平均射门误差0.73-0.86米，触球后球速达13.10米/秒。

详情

AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性，但固定参考难以适应不同的球位和击球时机；相比之下，任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此，我们引入了RoboNaldo，一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架，并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验，然后使踢球适应任意静止球位的任意球场景，最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间，一个高级启发式规划器控制该接口，而推理时其他高级控制器可驱动相同的低级策略。在仿真中，RoboNaldo的任意球射门误差比先前工作基线低48.6%，射门速度高2.96倍。在真实世界中，使用搭载机载感知的宇树G1，RoboNaldo在3米距离的任意球和移动球情况下，平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒，是职业比赛开放射门速度的59-71%。项目页面：$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11042 2026-06-12 cs.AI 版本更新

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM：面向真实世界专业领域的长周期计算机使用代理任务评估

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang, Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu, Yang Liu, Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang

发表机构 * ByteDance Seed（字节跳动Seed）； M-A-P ； Humanlaya

AI总结提出Workflow-GYM基准，评估AI代理在专业软件中执行长周期、高价值工作流的能力，发现最强模型成功率仅略超30%，揭示当前代理在长周期工作流一致性方面的严重不足。

详情

AI中文摘要

近年来，AI代理在处理日益复杂、真实世界任务方面取得了快速发展。然而，现有基准很少评估代理能否操作图形用户界面以完成跨领域的长周期、高价值专业工作流。当前的GUI基准仍主要关注通用软件、相对简单的应用和短周期任务，使得现代代理能否遵循用户指令自主操作领域特定专业软件并以端到端方式完成经济价值工作尚不清楚。为填补这一空白，我们引入Workflow-GYM，一个以专业领域和专门软件环境为中心的长周期GUI任务基准。通过对最先进模型的广泛实验，我们发现即使最强的模型也仅达到略高于30%的成功率，突显出专业长周期GUI工作流对当前GUI代理仍极具挑战性。进一步分析表明，当前代理难以维持长周期工作流的一致性，频繁出现工作流阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。我们的发现为当前代理系统的局限性提供了重要见解，并为下一代GUI代理研究指明了关键方向。

英文摘要

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

URL PDF HTML ☆

赞 0 踩 0

2606.09500 2026-06-12 cs.AI cs.DL 版本更新

Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

用于LLM辅助临床手稿准备的确定性完整性门控：一种可审计的生物医学信息学架构

Yoojin Nam, Jinhoon Jeong, Namkug Kim

发表机构 * University of Ulsan College of Medicine（蔚山大学医学院）； Asan Medical Center（峨山医疗中心）； Aperivue ； AMIST, Asan Medical Center（AMIST，峨山医疗中心）

AI总结提出一种确定性完整性门控架构，通过将工作流分解为可独立验证的技能并在每个阶段设置确定性检查，解决了LLM生成临床手稿中的虚假引用、数据漂移和报告指南缺失问题。

详情

Comments: 28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): this https URL. Archived on Zenodo: concept DOI this https URL and version DOI (v3.8.0) this https URL

AI中文摘要

目的。大型语言模型（LLM）越来越多地起草临床研究手稿，但其流畅性可能隐藏虚构的引用、偏离源表格的数字以及未满足的报告指南项目。现有工具生成文本而不进行验证，自我批评继承了产生自信虚构的盲点。我们描述了一种将生成与验证配对的架构。方法。该设计基于三个原则：将工作流分解为自包含的技能，在每个阶段转换处设置失败即停止的门控，以及用最便宜的足够机制解决每个完整性问题——一个确定性的、可重新执行的检查（如果适用），以及仅在需要解释时才使用散文级探针。这种尽可能确定性的分离，组织为完整性门控分类法，是核心贡献。它被实现为MedSci Skills，一个由43个技能组成的开源工具包，由一个编排器协调，其确定性层级包括21个标准库检测器。我们在三个可重复的公共数据集管道（STARD、PRISMA、STROBE）和一个种子缺陷消融上评估它。结果。在三个管道中，每个内容哈希清单都验证为干净，门控揭示了真实缺陷。在27个相同的注入缺陷上，确定性门控检测到所有27个，在匹配的干净固定装置上没有误报，而通用单提示LLM审查员检测到11个，其遗漏集中在生成的代码、参考文献内部和散文未暴露的风格缺陷上。结论。尽可能确定性的验证产生了一个可审计、可重新执行的轨迹，暴露了人类检查LLM辅助手稿所需的证据——可行性和可重复性证据，而不是声称具有人类竞争力的质量，这由另一项盲法研究解决。MedSci Skills采用MIT许可并归档（v3.8.0）。

英文摘要

As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).

URL PDF HTML ☆

赞 0 踩 0

2605.03847 2026-06-12 cs.AI 版本更新

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

机械良知：机器智能可信赖性的数学框架

Munkhdegerekh Batzorig, Purevbaatar Ganbold, Kyungbin Park, Pilkong Jeong, Kangbin Yim

AI总结提出机械良知（MC）概念，通过轨迹级规范过滤最小化修正基线策略，降低累积偏离，并处理认知不确定性，实现单智能体与分布式智能系统的可信赖性。

详情

Comments: 9 pages, 2 figures. Preprint

AI中文摘要

分布式协作智能（DCI），包括边缘到边缘架构、联邦学习、迁移学习和群体系统，创造了结构性不可避免的涌现风险环境：在不确定性下，个体智能体的局部正确决策会组合成全局不可接受的行为轨迹。现有方法如约束优化、安全强化学习和运行时保证在个体动作层面评估可接受性，而非跨行为轨迹，且均未解决DCI部署的多参与者、充满不确定性的特性。本文引入机械良知（MC），一种新颖概念和简化数学框架，为单智能体和分布式智能系统实现轨迹级规范调节。机械良知被定义为一个监督过滤器，最小化修正基线策略的动作，以减少与规范可接受区域的累积偏差，同时考虑认知不确定性。我们引入相关构造——良知分数、机械内疚和共振可信赖性——为该新兴领域提供可解释的词汇和可计算的治理信号。建立了核心理论性质：可接受性等价性、最优调节的存在性以及单调偏差减少。示例结果表明，MC调节的智能体在传统控制器漂移到可接受边界之外的情况下保持轨迹级规范可接受性，并且该框架自然扩展到抑制多智能体DCI设置中交互引发的涌现风险。

英文摘要

Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.

URL PDF HTML ☆

赞 0 踩 0

2605.02249 2026-06-12 cs.AI 版本更新

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

多智能体系统中信念修正公设的研究（扩展版）

Michael Thielscher, Tran Cao Son

AI总结研究认知规划中的信念修正问题，将经典AGM信念修正公设推广到多智能体环境，提出广义全交多智能体信念修正算子，并讨论迭代修正公设的推广及事件模型修正算子。

详情

AI中文摘要

我们研究了认知规划中的信念修正问题，即在一个多智能体系统中，当某个智能体获得关于某个状态属性的信念后，所有智能体的信念将如何变化。基于通过单一多智能体Kripke模型表示智能体信念的标准认知规划表示，我们将经典的AGM信念修正公设推广到多智能体环境，旨在为计算作为行动结果的所有智能体信念的动态认知推理框架提供形式化评估。作为满足所有广义AGM公设的简单算子示例，我们提出了广义全交多智能体信念修正。此外，我们定义了迭代修正的标准公设的推广，提出了一个更复杂的基于事件模型的修正算子，并讨论了在Kripke模型上定义能够满足所有迭代多智能体信念修正的广义公设的认知算子时可能存在的问题。

英文摘要

We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

URL PDF HTML ☆

赞 0 踩 0

2606.06113 2026-06-12 cs.CV 版本更新

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

位置、类型、原因与重要性：面向文本到图像反馈的结构化缺陷定位

Huaisong Zhang, Hao Yu, Yuxuan Zhang, Jiahe Wang, Xinrui Chen, Haoxiang Cao, Feng Lu, Wendong Zhang, Changqian Yu, Chun Yuan

AI总结提出结构化缺陷定位（SDG）方法，将文本到图像生成中的缺陷诊断建模为结构化集合预测，通过构建SDG-30K数据集和SDG-Eval评估协议，并利用视觉语言模型作为检测器，结合BoxFlow-GRPO将预测的缺陷集合转化为空间奖励以改进扩散模型对齐。

详情

Comments: 25 pages, 9 figures

AI中文摘要

尽管文本到图像（T2I）模型生成的图像越来越逼真，但它们仍然存在局部、细微且结构复杂的失败。诊断这些失败需要实例级别的反馈，回答缺陷发生的位置、类型、原因及其对整体图像质量的重要性。虽然最近的密集反馈方法超越了标量监督，但其以热图为中心的表示仍将诊断公式化为像素场回归，这使得定位可变数量的缺陷并将语义原因绑定到单个失败变得困难。为了解决这一表示瓶颈，我们提出了结构化缺陷定位（SDG），通过将每个缺陷建模为（位置、类型、原因、重要性）元组，将T2I诊断转化为结构化集合预测。为了使这一公式可训练和可测量，我们引入了SDG-30K，一个包含30K张图像的数据集，具有跨四个现代T2I生成器的框级标注，以及一个专用的评估协议SDG-Eval。基于这种结构化表示，我们进一步提出了一个诊断到对齐的框架，其中视觉语言模型（VLM）作为SDG检测器，BoxFlow-GRPO将预测的缺陷集合转化为基于框的、重要性加权的空间奖励，用于扩散模型对齐。大量实验表明，我们的SDG检测器在结构化缺陷定位上优于领先的专有VLM，而SDG引导的奖励一致地改善了T2I对齐并支持局部图像细化。这些结果确立了SDG作为诊断、评估和增强现代生成模型的统一实例级接口。

英文摘要

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

URL PDF HTML ☆

赞 0 踩 0

2606.05860 2026-06-12 cs.LG 版本更新

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

GenAutoML: 面向时间序列分析的动态架构生成与优化的智能体框架

Oleeviya Babu Poikarayil, Cédric Schockaert, Abdulrahman Nahhas, Christian Daase, Mursal Dawodi, Jawid Ahmad Baktash

AI总结提出GenAutoML框架，利用大语言模型作为神经架构师，通过沙盒反射循环和签名感知运行时自动生成并优化时间序列预测与异常检测的神经网络架构，引入动态可逆实例归一化提升非平稳条件下的鲁棒性。

详情

Comments: 26 pages, 17 figures, 12 tables. Under review

AI中文摘要

为时间序列预测和异常检测设计神经架构仍然是一项资源密集型任务，通常需要大量领域专业知识。传统的自动机器学习系统通常依赖于静态、预定义的搜索空间，限制了其适应多样数据特征的能力。我们提出GenAutoML，一个智能体框架，利用大语言模型作为神经架构师，将自然语言需求与可执行的PyTorch实现连接起来。该框架包含一个沙盒反射循环用于自主代码优化，以及一个签名感知运行时用于确保架构一致性和执行安全性。为了提升非平稳条件下的鲁棒性，我们进一步引入了动态可逆实例归一化包装器。在ETTh1、ETTm1和Weather基准上的实验表明，GenAutoML能够动态生成针对数据集特征定制的任务特定神经架构。在生成的模型中，WaveInterferenceNet实现了每个样本低于0.01毫秒的推理延迟，同时保持有竞争力的预测性能。通过强调计算效率、架构适应性和稳定的优化行为，GenAutoML使得创建适用于资源受限和延迟敏感的Edge AI部署的超轻量级神经网络成为可能。

英文摘要

Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically rely on static, predefined search spaces, limiting their ability to adapt to diverse data characteristics. We present GenAutoML, an agentic framework that leverages Large Language Models (LLMs) as neural architects to bridge natural-language requirements and executable PyTorch implementations. The framework incorporates a Sandboxed Reflection Loop for autonomous code refinement and a Signature-Aware Runtime that enforces architectural consistency and execution safety. To improve robustness under non-stationary conditions, we further introduce a Dynamic Reversible Instance Normalization (Dyn-RevIN) wrapper. Experiments on the ETTh1, ETTm1, and Weather benchmarks demonstrate that GenAutoML can dynamically generate task-specific neural architectures tailored to dataset characteristics. Among the generated models, WaveInterferenceNet achieves inference latency below 0.01 ms per sample while maintaining competitive predictive performance. By emphasizing computational efficiency, architectural adaptability, and stable optimization behavior, GenAutoML enables the creation of ultra-lightweight neural networks suitable for resource-constrained and latency-sensitive Edge AI deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.05692 2026-06-12 cs.LG cs.AI 版本更新

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

具有时变干预的流行病时间序列中的反事实预测基准测试

Wenhao Mu, Facundo Yan, Anik Mumssen, Marisa Eisenberg, Alexander Rodríguez

AI总结为解决缺乏可观测反事实结果的真实基准问题，基于校准的基于智能体的模型生成大规模流行病时间序列反事实预测基准，支持静态/时变治疗和单/多策略干预，评估多种因果推断方法。

详情

Comments: To appear in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

AI中文摘要

深度学习在时间序列因果推断方面取得了显著进展，但由于缺乏具有可观测反事实结果的现实基准，进展仍然受到限制。现有数据集要么依赖没有真实反事实的真实世界观测，要么依赖无法捕捉复杂因果动态的简化模拟。为了解决这一差距，我们开发了一个大规模基准，用于动态干预下流行病时间序列的反事实预测。与现有基准不同，它支持静态和时变治疗，以及单策略和多策略干预设置，从而能够在广泛的因果推断场景中评估因果推断方法。利用基于真实世界人口、流动性、流行病学和政策数据校准的基于智能体的模型，我们生成了跨越美国150多个县的真实反事实轨迹。使用该基准，我们评估了广泛使用和最先进的因果推断方法，揭示了显著的性能差异，并突出了现实时间序列因果推理的挑战。

英文摘要

Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.05405 2026-06-12 cs.AI cs.CL cs.LG 版本更新

Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-Gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, Yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang

AI总结针对AI系统在专业领域缺乏经济性部署的问题，提出Agents' Last Exam (ALE)基准，通过250+专家协作构建覆盖13个行业集群55个子领域的1000+长期真实经济任务，当前最难层级平均通过率仅2.6%。

详情

Comments: Project website: this https URL Code: this https URL

AI中文摘要

最近的AI系统在广泛基准测试中取得了强劲结果，但这些成果并未转化为许多专业领域的经济上有意义的部署。我们认为这一差距主要是评估问题：广泛使用的基准缺乏对真实且经济上有价值的工作流程的持续性能测量。本文介绍了Agents' Last Exam (ALE)，这是一个旨在评估AI代理在长期、经济上有价值、结果可验证的真实世界任务上的基准。与250多名行业专家合作开发，ALE涵盖了参考O*NET/SOC 2018（美国联邦职业分类）定义的非实体行业。它围绕一个任务分类法组织，包含55个子领域，分为13个行业集群，涵盖1000多个任务。当前结果显示，最难层级远未饱和：在主流框架和骨干配置下，平均完全通过率为2.6%。ALE被设计为一个活的基准：其任务池随着新工作流程和行业的加入而持续增长。更广泛地说，ALE不仅旨在作为另一个排行榜，而是作为缩小基准成功与GDP相关影响之间差距的工具。

英文摘要

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

URL PDF HTML ☆

赞 0 踩 0

2606.04935 2026-06-12 cs.AI 版本更新

What Type of Inference is Active Inference?

主动推理是一种什么类型的推理？

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

AI总结本文通过变分自由能框架将主动推理中的期望自由能最小化分解为熵校正项和规划校正项，揭示了其推理本质，并在网格世界实验中验证了不同校正项的作用。

详情

AI中文摘要

主动推理将决策视为推理，期望自由能（EFE）统一了目标导向和信息寻求行为。最近的研究表明，EFE最小化可以写成在带有认知先验的生成模型上的变分自由能（VFE）最小化。我们证明了增强模型的VFE可以重写为预测模型的VFE加上显式的熵校正项，从而使EFE贡献透明。然后我们表明，基于EFE的适当规划需要将这些认知校正与规划校正相结合，规划校正将边际推理转化为策略优化，从而得到基于EFE规划的完整变分特征。这澄清了交叉熵规划和完整基于EFE规划所需的校正。相同的熵校正公式导致了基于EFE规划的详细消息传递方案以及更简单的消融。在三个网格世界环境上的实验表明，当观测具有决定性时，规划校正已经有所帮助，而当观测仅具有提示性时，额外的观测侧认知校正最为重要。

英文摘要

Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

URL PDF HTML ☆

赞 0 踩 0

2606.04602 2026-06-12 cs.AI 版本更新

Parthenon Law: A Self-Evolving Legal-Agent Framework

Parthenon Law: 一种自我进化的法律智能体框架

Hejia Geng, Leo Liu

AI总结本文提出Parthenon框架，通过分解模型、工具、知识等组件并引入反泄漏学习循环，使法律领域的大语言模型智能体能够从经验中自我进化，显著提升法律事务处理性能。

详情

AI中文摘要

随着智能体能力的增强，法律领域的大语言模型智能体有望将文档密集型事务转化为可审查的工作产品——然而可靠部署面临三个障碍：缺乏关于当前最强模型与框架组合在端到端法律事务上行为的大规模证据；没有适应法律垂直领域的智能体架构，只有通用框架；以及在不断变化的事实、权威和截止日期环境中，缺乏系统从自身结果中学习的机制。我们逐一解决这些问题。在Harvey LAB上进行的大规模实证研究——包含12,510条智能体轨迹——表明即使是前沿智能体也无法一次性完成事务：每项标准的准确率随模型增强而提高，但严格的事务完成率停滞不前。然后我们引入Parthenon，一种自我进化的法律智能体框架，将模型、框架、智能体角色、法律知识、确定性工具和程序技能分解为可审计的表面，以实现来源可追溯性、日期和数字接地、交付物合规性和问题关闭。最后，一个反泄漏学习循环将评分失败转化为对技能、工具和知识的任务无关编辑，使系统能够随着经验改进——就像律所在每个事务后完善其检查清单和操作手册——而不触及模型权重。在我们的大规模实证分析中，Parthenon显著提升了最先进模型和框架在法律事务任务上的性能。

英文摘要

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.04525 2026-06-12 cs.CL cs.LG q-bio.GN 版本更新

GENEB: Why Genomic Models Are Hard to Compare

GENEB：为什么基因组模型难以比较

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov

AI总结针对基因组基础模型评估碎片化的问题，提出GENEB基准，通过统一探测协议在100项任务上比较40个模型，揭示模型排名不稳定、规模收益有限等关键发现。

详情

Comments: change first page figure, fix model sizes, add more consistency

AI中文摘要

由于基准碎片化、评估协议不兼容以及任务特定报告，基因组基础模型的进展难以评估。因此，关于模型优越性或通用性的声明往往无法直接比较。我们引入GENEB，这是一个大规模诊断基准，在统一的基于探测的协议下（包括少样本场景），评估来自40个基因组基础模型的冻结表示，涵盖100个任务，跨越13个功能类别。GENEB能够在明确暴露任务级权衡的同时，对模型规模、架构、分词和预训练数据进行受控比较。我们的分析表明，整体排行榜不稳定：模型排名在不同任务类别间变化剧烈，规模仅带来适度且不一致的收益，而架构和预训练对齐常常超过参数数量的影响。这些结果凸显了当前评估实践的局限性，并将GENEB定位为基因组机器学习中原则性比较和类别感知模型选择的参考框架。

英文摘要

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

URL PDF HTML ☆

赞 0 踩 0

2606.04474 2026-06-12 cs.CL eess.AS 版本更新

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

语音大模型推理中的实体绑定失败：诊断与思维链干预

Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu

AI总结本文通过诊断语音大模型在逻辑推理中的实体绑定失败问题，提出实体感知思维链方法，显著提升推理准确率。

详情

Comments: INTERSPEECH 2026

AI中文摘要

语音大模型在复杂推理任务上表现不如文本模型。我们揭示了这种模态差距并非均匀的认知缺陷。通过评估三个不同的语音大模型，我们发现在空间、句法和事实任务上，语音到文本（S2T）匹配或超过文本到文本（T2T）。然而，在需要实体追踪的逻辑任务上，S2T准确率降至随机水平。我们将这种局部退化诊断为实体绑定失败：连续的语音特征导致模型在隐式推理过程中丢失精确的实体-属性关联。为解决此问题，我们提出了实体感知思维链（EA-CoT），强制语音大模型在推理前显式枚举实体并将其绑定到声明上。引人注目的是，即使口语名称被误识别，EA-CoT也能弥合差距，带来高达24.4%的绝对准确率提升。消融实验证实这些提升完全源于显式语义绑定，将模态差距重新定义为可解决的瓶颈。

英文摘要

Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.

URL PDF HTML ☆

赞 0 踩 0

2606.04364 2026-06-12 cs.CV cs.LG 版本更新

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

通过部分分解注意力的空间基础概念瓶颈模型

Dhanesh Ramachandram

AI总结提出一种部分分解的概念瓶颈模型，通过空间先验约束注意力，在细粒度识别中实现可解释性并提升定位精度。

详情

Comments: Updated results with GobalAttention Tokens

AI中文摘要

概念瓶颈模型（CBM）在预测类别之前预测一层人类命名的属性，从而使其决策可审计。在细粒度识别任务中，概念头通常可以自由关注图像中的任何位置，因此以某个身体区域命名的头可能被其他区域的证据满足。本研究通过构造一个部分分解的CBM来消除这种自由度。该方法基于冻结的DINOv3视觉变换器，包含三个组件。一个学习到的前景门控，基于DINOv3块特征训练，抑制部分注意力内的背景块。一组部分查询交叉关注块特征，并且312个CUB属性中的每一个通过固定的概念到部分映射被路由，仅从其名称所暗示的部分令牌读取。一个可学习的二维高斯先验，以对数空间加性注入注意力logits，打破部分查询之间的排列对称性；其均值从每个部分的数据集平均关键点位置初始化，在训练或测试时不需要每张图像的关键点监督。在CUB-200-2011上，空间先验模型匹配完全监督基线（top-1准确率88.85%对88.95%），同时将指向精度提高16个百分点（52.6%对36.4%）。用PCA前景目标替换边界框监督，并与高斯先验结合，消除了所有每张图像监督，达到88.6%的top-1准确率和约70%的指向精度。关键点分数扫描显示，训练集的0.5%（约27张图像）足以初始化先验，且无显著损失。完全移除部分身份是更困难的情况：没有任何空间先验，指向精度降至2.9%。

英文摘要

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

URL PDF HTML ☆

赞 0 踩 0

2606.04009 2026-06-12 stat.ML cs.AI cs.LG 版本更新

Counterfactual Explanations for Deep Two-Sample Testing

深度双样本检验的反事实解释

Wei-Cheng Lai, Marco Simnacher, Christoph Lippert

AI总结针对深度双样本检验，提出基于扩散自编码器和MMD优化的反事实解释框架，生成样本级编辑以揭示驱动假设拒绝的特征。

详情

Comments: 17 pages

AI中文摘要

双样本检验是检测科学领域中分布差异的基本工具，但经典检验（包括基于核的检验）在高维结构化数据（如图像）上可能效果不佳。最近的深度双样本检验通过学习信息表示提高了这些场景下的灵敏度，但它们对哪些数据特征驱动拒绝原假设 $H_0$ 提供的洞察有限。为解决此问题，我们提出了一种用于深度双样本检验的反事实解释框架，该框架生成样本级编辑，将观测值从源组移向目标组，同时明确减少检验所测量的差异。我们的方法将扩散自编码器与预训练的深度双样本检验模型相结合，并在检验模型的表示空间中优化最大均值差异（MMD）目标，以生成合理的反事实。我们通过检验统计量和由此产生的双样本p值的变化来量化分布级效应。我们在合成2D形状数据集和两个MRI队列上评估了该方法。在这两种设置下，反事实变换相对于原始样本持续增加p值，表明编辑后的源集在检验下在统计上更接近目标分布。我们使用LPIPS测量最小性，以确保反事实保持接近原始样本。由此产生的编辑提供了与检测到的组差异相关的特征的可解释证据。在MRI上，局部变化与队列之间已知的解剖学差异一致。

英文摘要

Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

URL PDF HTML ☆

赞 0 踩 0

2606.03096 2026-06-12 cs.CL 版本更新

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

大型语言模型中的事实性观点能否被编辑（操纵）？

Yuanpu Cao, Ziyi Yin, Fenglong Ma, Jinghui Chen

AI总结提出FOE基准测试，评估当前知识编辑技术对事实性观点（如公众人物立场）的操纵能力，并发现其仅能实现表面修改，无法保持观点与证据的一致性；进而提出自生成证据对齐方法实现观点-证据对齐。

详情

Comments: Accepted to the ACL 2026 Main Conference

AI中文摘要

大型语言模型（LLMs）正日益融入各个领域，这使得知识编辑技术变得至关重要，但也存在潜在危险。当前的编辑方法主要针对原子事实，忽视了操纵事实性观点（例如，公众人物在社会问题上的有记录的立场）所带来的重大风险。这种操纵可能重塑公众形象、影响选举并改变社会观点。为了系统评估这一威胁，我们引入了事实性观点编辑与证据（FOE）基准，涵盖261位公众人物、19个问题类别和2,178条完整的观点记录。我们的评估表明，当前的编辑技术在处理事实性观点时面临显著困难，通常仅能实现表面修改，而无法保持编辑后的观点与模型生成的支撑证据之间的一致性。为解决这一局限，我们进一步提出了一种简单而有效的自生成证据对齐方法，无需依赖显式指令即可实现观点-证据对齐。我们的基准和方法共同为理解LLMs中事实性观点编辑的新兴安全影响奠定了基础。

英文摘要

Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.02778 2026-06-12 astro-ph.EP astro-ph.IM cs.LG 版本更新

One Transit Is All You Need: Detecting Exoplanets Through Learned Stellar Behaviour with EXOVEIL

一次凌星足矣：通过EXOVEIL学习恒星行为检测系外行星

Pratik Priyanshu

AI总结提出EXOVEIL系统，利用Transformer世界模型和自监督学习从原始光变曲线中检测单次凌星事件，在Kepler数据上实现高召回率，并零样本迁移至TESS和PLATO任务。

详情

Comments: v3: appendix gallery of confirmed-planet recoveries added; Section 6 candidate catalogue reframed as transit-like anomalies for follow-up; TLS comparison table expanded

AI中文摘要

我提出EXOVEIL，一个凌星检测系统，它学习恒星亮度应有的样子，并在现实不符时发出标记。与需要相位折叠输入的现有系统不同，EXOVEIL在原始通量时间序列上运行，可以检测仅凌星一次的行星。一个Transformer世界模型，在16,499条Kepler光变曲线上通过凌星掩蔽自监督学习训练，预测预期的恒星通量。一个带有方差加权的匹配滤波检测器从预测残差中提取凌星信号。一个学习分类器（XGBoost）将行星与假阳性区分开，在Kepler DR25上达到AUC 0.938。应用于单次凌星注入-恢复，EXOVEIL在1000 ppm深度下恢复了32%的凌星——而所有基于分类的系统由于设计原因得分为0%。对3,737颗Kepler恒星进行盲搜索，发现了179个新的凌星类信号，这些信号不在DR25 TCE目录中，包括46个单次凌星候选者。无需重新训练，应用于PLATO LOPS2场中的47颗已确认TESS行星，EXOVEIL实现了100%的恢复，展示了零样本跨任务迁移。在PLATO的25秒曝光下，检测达到100 ppm——接近地球类似物范围。我提供了共形预测在凌星检测中的首次应用（95.9%经验覆盖率），并发布了该系统，可通过pip install exoveil安装，包含预训练权重和候选目录。

英文摘要

I present EXOVEIL, a transit detection system that learns what a star's brightness should look like and flags when reality disagrees. Unlike existing systems that require phase-folded input, EXOVEIL operates on raw flux time series and can detect planets that transit only once.A Transformer world model, trained on 16,499 Kepler light curves with transit-masked self-supervised learning, predicts expected stellar flux. A matched-filter detector with variance weighting extracts transit signals from the prediction residuals. A learned classifier (XGBoost) separates planets from false positives, achieving AUC 0.938 on Kepler DR25. Applied to single-transit injection-recovery, EXOVEIL recovers 32% of transits at 1000 ppm depth a task where all classification-based systems score 0% by construction. A blind search of 3,737 Kepler stars yields 179 new transit-like signals not present in the DR25 TCE catalogue, including 46 monotransit candidates. Applied withoutretraining to 47 confirmed TESS planets in the PLATO LOPS2 field, EXOVEIL achieves 100% recovery, demonstrating zero-shot cross-mission transfer. At PLATO's 25-second cadence, detection reaches 100 ppm -- approaching the Earth-analog regime. I provide the first application of conformal prediction to transit detection (95.9% empirical coverage) and release the system as pip install exoveil with pretrained weights and a candidate catalogue.

URL PDF HTML ☆

赞 0 踩 0

2606.02133 2026-06-12 cs.LG cs.AI 版本更新

Variational Learning for Insertion-based Generation

基于插入生成的变分学习

Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

AI总结提出插入过程（IP）模型，通过排列变分推断联合学习插入位置、内容和终止条件，支持变长生成并提升非自回归序列建模质量。

详情

AI中文摘要

非单调序列生成方法，如掩码扩散模型，通过允许以非固定和预设的顺序生成token，为从左到右的自回归建模提供了一种灵活的替代方案。尽管具有实际优势，但大多数现有的非单调模型是顺序无关的，并依赖于固定长度的网格，限制了它们支持变长生成和自适应插入顺序的能力。在这项工作中，我们引入了一个概率框架，用于在变长插入模型中学习插入顺序。我们形式化了插入轨迹与排列之间的双射对应关系，这使得数据似然能够精确重参数化为排列上的和。基于这一结果，我们提出了插入过程（IP），这是一种随机生成模型，它联合学习在哪里插入、插入什么以及何时终止，并通过基于排列的变分推断进行训练。与先前的固定画布方法不同，IP原生支持变长生成，并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明，在缺乏规范从左到右结构的领域中，学习插入顺序提高了建模质量和泛化能力。

英文摘要

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

URL PDF HTML ☆

赞 0 踩 0

2606.02044 2026-06-12 cs.LG physics.med-ph 版本更新

Realistic noise synthesis reduces bias and improves tissue microstructure estimation with supervised machine learning

真实噪声合成减少偏差并改善有监督机器学习的组织微结构估计

Bradley G. Karat, Maëliss Jallais, Ali R. Khan, Santiago Aja-Fernández, Jelle Veraart, Marco Palombo

AI总结针对扩散MRI中模拟与实测信号噪声不匹配导致的协变量偏移问题，提出真实噪声合成框架，通过引入Rician期望和有效后处理噪声方差，显著降低参数估计偏差并提高精度。

详情

Comments: * Shared first author

AI中文摘要

扩散MRI能够无创探测组织微结构，但准确的参数估计受到噪声相关效应的挑战。在基于模拟数据训练的有监督机器学习框架中，模拟信号与采集信号的噪声特性差异引入了一种协变量偏移，导致训练和推理时的输入信号分布不同。我们研究了这种不匹配对微结构参数估计的影响，并提出了一种真实噪声合成（RNS）框架来缓解该问题。RNS将Rician期望和有效后处理噪声方差同时纳入模拟训练信号。Rician期望使用MPPCA估计的噪声标准差建模，而有效标准差则从预处理数据的球谐残差中导出。该方法使用cylinder-zeppelin和SANDI模型在多个SNR水平的模拟数据集以及具有重复采集的体内扩散数据上进行了评估。还评估了对噪声误估计的敏感性。训练过程中忽略幅度诱导的噪声效应会产生系统性的、依赖于SNR的参数偏差，尤其是在低SNR下。引入Rician期望显著降低了偏差，使其达到噪声感知的非线性最小二乘拟合的水平。对有效标准差进行建模进一步提高了精度。性能在很大程度上独立于回归架构，但对准确的噪声估计敏感。这些发现表明，在模拟训练数据中进行真实噪声建模可以减轻信号域的协变量偏移，并且对于无偏的监督微结构估计至关重要，特别是在与高b值或高空间分辨率相关的低SNR区域。

英文摘要

Diffusion MRI enables non-invasive probing of tissue microstructure, but accurate parameter estimation is challenged by noise-related effects. In supervised machine learning frameworks trained on simulated data, discrepancies between the noise characteristics of simulated and acquired signals introduce a form of covariate shift, whereby the input signal distribution differs between training and inference. We investigated the impact of this mismatch on microstructure parameter estimation and propose a realistic noise synthesis (RNS) framework to mitigate it. RNS incorporates both the Rician expectation and the effective post-processing noise variance into simulated training signals. The Rician expectation was modelled using a noise standard deviation estimated with MPPCA, while the effective standard deviation was derived from spherical harmonic residuals of preprocessed data. The method was evaluated using the cylinder-zeppelin and the SANDI models on simulated datasets across multiple SNR levels and on in vivo diffusion data with repeated acquisitions. Sensitivity to noise misestimation was also assessed. Ignoring magnitude-induced noise effects during training produced systematic, SNR-dependent parameter bias, particularly at low SNR. Incorporating the Rician expectation substantially reduced bias to the level of noise-aware nonlinear least-squares fitting. Modelling the effective standard deviation further improved precision. Performance was largely independent of regression architecture but sensitive to accurate noise estimation. These findings demonstrate that realistic noise modelling in simulated training data mitigates signal-domain covariate shift and is essential for unbiased supervised microstructure estimation, particularly in low-SNR regimes associated with high b-values or high spatial resolution.

URL PDF HTML ☆

赞 0 踩 0

2606.01621 2026-06-12 cs.CV cs.RO 版本更新

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Goal2Pixel: 将目标锚定到像素以实现视觉语言导航

Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang

AI总结提出Goal2Pixel范式，通过将连续环境中的视觉语言导航（VLN-CE）重新定义为可导航像素锚定，利用图像平面作为统一空间接口，预测可见导航像素并反投影为3D航点，结合可见性感知关键帧记忆和坐标感知辅助损失，在减少VLM调用次数的同时实现竞争性性能。

详情

Comments: 8 pages

AI中文摘要

视觉语言模型（VLM）已成为连续环境中视觉语言导航（VLN-CE）的常见基础。然而，大多数基于VLM的方法将导航视为低级动作预测，这种接口模糊、受限于短视运动基元，且由于重复的VLM查询而效率低下。我们提出Goal2Pixel，一种纯基于像素的范式，将VLN-CE重新定义为可导航像素锚定。Goal2Pixel不预测动作，而是使用图像平面作为VLM推理与机器人运动之间的统一空间接口：模型预测一个对智能体可见的可导航像素，该像素被反投影为3D航点以进行前向导航。对于非前向动作，我们在图像平面上附加辅助指令区域，其中左/右/下区域分别解释为左转、右转和停止。为了实现长程导航，我们提出了一种可见性感知的关键帧记忆，用于紧凑且信息丰富的历史表示。为了将预训练的VLM适应于可导航像素锚定，我们引入了语义嵌入和坐标感知辅助损失。Goal2Pixel在需要比先前方法更少的VLM推理调用的情况下，实现了具有竞争力的最新性能。在R2R-CE Val-Unseen上，它以每集仅7.75次VLM调用达到54.1%的SR和52.5%的SPL，而直接动作预测在32.9%的SR下需要46.62次调用，减少了6倍。同样的趋势在RxR-CE上也成立。项目页面：https://baobao0926.github.io/Goal2Pixel/。

英文摘要

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on this http URL Page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.01538 2026-06-12 cs.GR cs.CV cs.LG 版本更新

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

MPMWorlds: 用于推断和外推物理动力学的物质点法模拟

Žiga Kovačič, Kevin Ellis

AI总结通过构建2D物质点法（MPM）模拟数据集，研究从视频推断物理动力学并外推时间演化的能力，比较代码生成与视频扩散方法的优劣。

详情

Comments: 16 pages, 13 figures. Project page: this https URL

AI中文摘要

为了研究从视频推断物理动力学并将其向前外推的能力，我们组装了一个包含丰富物理现象（如可变形物体、流体、运动物体和发射器）的2D物质点法（MPM）物理模拟数据集。我们在此数据集上研究了代码生成和视频扩散方法，通过改变物理相关辅助信息的数量来识别它们的优缺点。代码生成模型除了提供自动合成MPM模拟的工作演示外，还揭示了这种方法在从视觉输入推断物理参数方面存在困难，但相对于视频扩散，它能产生物理和时间上稳定的向前外推结果，而视频扩散模型能更强烈地从视觉输入中识别几何属性，但会产生物理上不可信的外推结果。

英文摘要

To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.

URL PDF HTML ☆

赞 0 踩 0

2606.01172 2026-06-12 cs.LG stat.ME stat.ML 版本更新

Revisiting Neural Processes via Fourier Transform and Volterra Series

通过傅里叶变换和Volterra级数重新审视神经过程

Peiman Mohseni, Nick Duffield, Raymond K. W. Wong

AI总结本文利用Volterra展开和集合傅里叶卷积，提出了两种新的条件神经过程模型，解决了现有平移等变神经过程在可解释性和计算效率上的局限性。

详情

AI中文摘要

从有限的、不规则采样的测量中建模未知的潜在函数是科学和工程中的一个反复出现的挑战。神经过程（NPs）是一类概率函数模型，是有前景的解决方案——尤其是当赋予领域特定的对称性（如平移等变性）时，这提高了样本效率和泛化能力。然而，现有的平移等变NPs面临两个局限性：（i）它们堆叠带有非线性的通用组件，模糊了诱导的函数类并限制了可解释性；（ii）卷积设计依赖于具有局部感受野的核，并需要密集的均匀输入网格，而基于注意力的方法避免了这些问题，但随观测数量呈二次方缩放。我们通过两个贡献解决了这两个问题。首先，利用Volterra展开，我们将连续平移等变算子表征为高阶卷积的和，实现了分析透明性，同时允许通过一阶卷积进行高效近似。其次，我们引入了集合傅里叶卷积（SFConvs），这是一种频域参数化方法，直接在不规则采样点上操作，实现近似全局感受野，并在观测数量上线性缩放。基于这些思想，我们提出了两种条件神经过程（CNPs）：SFConvCNPs，它堆叠带有非线性的SFConv块，以及SFVConvCNPs，它整合了Volterra公式。在合成和真实世界数据集上的实验证明了我们的方法相对于最先进基线的有效性。

英文摘要

Modeling unknown latent functions from finite, irregularly sampled measurements is a recurring challenge across science and engineering. Neural processes (NPs), a family of probabilistic functional models, are promising solutions -- especially when endowed with domain-specific symmetries like translation equivariance, which improve sample efficiency and generalization. Yet existing translation-equivariant NPs face two limitations: (i) they stack generic components with non-linearities, obscuring the induced function class and limiting interpretability; and (ii) convolutional designs rely on kernels with local receptive fields and require dense uniform input grids, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address both with two contributions. First, using the Volterra expansion, we characterize continuous translation-equivariant operators as sums of higher-order convolutions, yielding analytical transparency while admitting efficient approximation by first-order convolutions. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled points, achieves approximately global receptive fields, and scales linearly in the number of observations. Building on these ideas, we propose two conditional NPs (CNPs): SFConvCNPs, which stack SFConv blocks with non-linearities, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate our methods' efficacy against state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00807 2026-06-12 cs.AI cs.HC 版本更新

Interaction-Centered Intelligence: Toward an Interaction-Based Theory of Human-AI Co-Creation

以交互为中心的智能：将交互作为共创AI和人机系统中的主要分析单元

Nicholas Davis

AI总结本文提出以交互作为主要分析单元，通过分布式认知、具身认知等理论，论证智能涌现于交互动态而非孤立计算，并引入交互中心智能框架。

详情

AI中文摘要

传统人工智能很大程度上将智能概念化为发生在有界代理内的孤立计算。在经典AI、机器学习以及许多生成系统中，主要的分析单元仍然是单个模型或自主系统，通过输出、基准、预测准确性或优化性能进行评估。尽管这些方法取得了重大进展，但它们往往低估了交互在智能、创造力、意义和适应性行为涌现中的作用。本文提出将交互作为共创AI和更广泛的以交互为中心的智能的主要分析单元。借鉴分布式认知、具身认知、生成、参与式意义建构、人机交互和计算创造力，本文追溯了向越来越关系性智能观的历史进程。基于先前在创造性意义建构、量化共创以及诸如绘图学徒和AI绘图伙伴等共创系统上的工作，本文论证了智能通过代理、环境和社会技术系统之间不断演化的交互动态涌现，而非仅仅通过内部计算。本文引入了以交互为中心的智能作为理解人机共创、协作涌现、适应性参与和交互动态的框架。该框架不通过生成的输出单独评估智能，而是强调随时间展开的交互轨迹、协调模式、参与性参与、适应性调节和交互漂移。讨论了可解释的共创AI、混合智能、生成AI和未来人机系统的启示。

英文摘要

Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

URL PDF HTML ☆

赞 0 踩 0

2606.00193 2026-06-12 cs.CL 版本更新

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

BOUTEF：北非假新闻的多语种语料库——语言作为武器

Kamel Smaili, Yassine Toughrai, Amina Laggoun, David Langlois

AI总结本文构建了包含阿尔及利亚和突尼斯多语种（MSA、方言、Arabizi、法语、英语等）的假新闻语料库BOUTEF，通过定量与定性分析揭示了假新闻依赖情感化叙事、耸人听闻框架和混合语言实践来增强传播力，而辟谣内容则更注重事实和验证。

详情

AI中文摘要

社交媒体上假新闻的快速传播已成为一个重大挑战，尤其是在北非等多语言和资源匮乏的环境中。本文介绍了BOUTEF，这是一个大规模多语言语料库，旨在研究阿尔及利亚和突尼斯假新闻的传播、特征和影响。该语料库整合了三个互补部分：虚假叙述、真实叙述以及相关的用户生成评论，并附有经过验证的辟谣信息。它涵盖了广泛的语言和语言变体，包括现代标准阿拉伯语、阿尔及利亚和突尼斯方言、阿拉伯语拉丁化拼写、法语、英语以及代码转换语言。基于这一资源，我们进行了结合定量和定性方法的全面实证分析。我们考察了主题分布、语言和修辞策略、情感模式以及社交参与动态。统计分析揭示了主题类别与信息真实性之间的显著关联，以及用户参与度与虚假内容可见性之间的强相关性。我们的发现表明，假新闻严重依赖情感化的叙述、耸人听闻的框架以及增强病毒式传播和受众参与的混合语言实践。相比之下，辟谣内容采用更注重事实和验证的风格。此外，阿尔及利亚和突尼斯之间的比较分析揭示了由社会政治背景塑造的共享动态和国家特定特征。结果强调了非正式语言实践在错误信息扩散和接收中的作用。通过提供丰富、带注释且公开可用的数据集，这项工作有助于推进假新闻检测、低资源语言处理以及理解复杂语言环境中的信息紊乱的研究。

英文摘要

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

URL PDF HTML ☆

赞 0 踩 0

2605.31514 2026-06-12 cs.CL cs.AI cs.CY 版本更新

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

如果LLM具有类人属性，那么《帝国时代II》也具有

Adrian de Wynter

AI总结通过训练简单神经网络于《帝国时代II》，论证LLM的拟人属性在经验上非唯一，提出应假设LLM非独特性而非拟人属性来设计实验。

详情

Comments: Fixed corollary 1, added stat sig

AI中文摘要

关于大型语言模型（LLM）和基于LLM的智能体工作流已有大量研究。然而，该领域的许多工作声称、赋予或假设它们具有普遍化的拟人属性（例如道德或对自然语言的理解）。我们的目标不是支持或反对这些属性的存在，而是指出这些结论可能不正确。为此，我们在电子游戏《帝国时代II》上构建并训练了一个简单的神经网络，并注意到任何处于足够强大基底（如乐高或大波士顿地区）中的实体也可能呈现此类属性。因此，LLM声称的拟人属性在经验上非唯一：尽管某些属性（例如对提示的响应）可能保持不变，但其他属性（如对其感知行为的解释）可能随基底改变。因此，任何基于经验的讨论都需要明确的测量标准；否则解释就留给了表征。然后我们表明，假设这些属性在系统中存在或不存在，独立于基底并以普遍化方式，会导致循环或无信息的结论，无论实验者对该主题的观点如何。最后，我们提出一个“零”假设，即假设LLM非独特性而非拟人属性来设置实验，并给出示例。我们还讨论了对我们工作的潜在反对意见，简要调查了该领域，并证明了《帝国时代II》是功能完备和图灵完备的。

英文摘要

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

URL PDF HTML ☆

赞 0 踩 0

2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

AI总结提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统，通过在线可微渲染实现跟踪与建图，并支持实时网格转换与编辑。

详情

Comments: 26 pages, 11 figures

AI中文摘要

我们提出了一种密集RGB-D SLAM系统，使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法，但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务（如模拟、碰撞和编辑）的标准图元。最近的离线方法表明，通过在一组带姿态的图像上进行Delaunay三角剖分，可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解，我们提出了第一个密集SLAM系统，通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格，从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上，我们的系统在3D几何方面优于基线，匹配相机跟踪精度，并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

URL PDF HTML ☆

赞 0 踩 0

2605.28507 2026-06-12 cs.LG 版本更新

Universal Time Series Generation with Neural Controlled Differential Equations

基于神经受控微分方程的通用时间序列生成

Torben Berndt, Elyes Farjallah, Leif Seute, Raeid Saqur, Benjamin Walker, Jan Stühmer

AI总结本文证明结构化线性受控微分方程（SLiCEs）是通用时间序列生成器，并提出生成式SLiCEs（G-SLiCEs）用于路径空间上的流匹配，在概率预测和下流任务中表现优异，尤其适用于不规则网格。

详情

AI中文摘要

最近关于状态空间模型（SSMs）序列通用性的工作引入了高效、最大表达性的连续时间方法用于时间序列建模。虽然这些工作侧重于判别设置，我们将这一视角扩展到生成式时间序列建模，通过证明最大表达性的结构化线性受控微分方程（SLiCEs）是通用时间序列生成器，即它们可以在$W_\infty$中逼近紧致潜在集上连续因果推前映射的诱导路径律。基于这些理论结果，我们提出了生成式SLiCEs（G-SLiCEs），一种用于路径空间上流匹配的最大表达性连续时间模型。实验上，我们表明表达性提高了概率预测和下流任务的性能，同时保留了连续时间模型的优势，例如泛化到任意观测网格。这对于不规则网格尤其有利，而固定网格模型通常难以处理此类网格。

英文摘要

Recent work on the sequence universality of State Space Models (SSMs) has introduced efficient, maximally expressive continuous-time approaches for time-series modelling. While these works focus on discriminative settings, we extend this perspective to generative time-series modelling by proving that maximally expressive Structured Linear Controlled Differential Equations (SLiCEs) are universal time-series generators, in the sense that they can approximate the induced path laws of continuous causal pushforwards on compact latent sets in $W_\infty$. Building on these theoretical results, we propose Generative SLiCEs (G-SLiCEs), a maximally expressive continuous-time model for flow matching on path-space. Empirically, we show that expressivity improves performance in probabilistic forecasting and downstream tasks, while retaining the advantages of continuous-time models such as generalising to arbitrary observation grids. This is particularly beneficial for irregular grids, where fixed-grid models often struggle.

URL PDF HTML ☆

赞 0 踩 0

2605.27628 2026-06-12 cs.AI cs.CY cs.ET cs.MA eess.SY 版本更新

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

智能作为受管自主：代理型AI系统的失败、升级与治理

Srini Ramaswamy

AI总结本文提出SMARt模型，通过形式化能力检测认知漂移、暂停推理、尝试恢复并在可靠性下降时放弃控制，以解决自主AI系统中的幻觉和持续不合理行为问题。

详情

Comments: This peer-reviewed paper is to appear in the Journal of Intelligent and Robotic Systems

AI中文摘要

随着自主和代理型AI系统在机器人和人机环境中的规模扩大，管理幻觉和持续但不合理的行动仍然是一个开放挑战。本文并未将这些失败仅仅归因于模型或对齐限制，而是探讨了无界自主性的架构脆弱性——即假设代理应在不确定性上升时继续运行的预设。本文引入了一种受管自主理论，通过形式化能力来定义智能行为：检测认知漂移、暂停推理、尝试恢复，并在可靠性下降时最终放弃控制。我们通过SMARt（具有受管/撤销转换的自管理多层自主推理）模型实例化该理论，该模型是一个四层框架，包含稳定、元认知、辅助和受管状态。通过开发定时、受保护的Petri网形式化，我们建立了系统的理论有界属性，展示了架构如何形式化地强制升级、约束无效输出，并确保在指定条件下的治理可达性。我们进一步分析了如何在不同的操作环境（例如医疗、机器人等）中结合特定领域的触发集，在满足完备性和健全性标准的前提下系统地维护安全性。由于这些触发被设计为自适应的，SMARt模型允许代理操作范围随时间安全、受控地扩展。我们得出结论，在自主生命周期内形式化失败管理是实现可靠且受治理人工智能的关键一步。

英文摘要

As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.26358 2026-06-12 physics.flu-dyn cs.LG 版本更新

Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows

基于深度学习的代数雷诺应力闭合模型用于湍流RANS模拟

Daniel Dehtyriov, Jonathan F. MacArt, Justin Sirignano

AI总结提出一种物理驱动的深度学习闭合模型DARSM，通过神经网络映射流动不变量到隐式代数雷诺应力方程中的经验参数，并结合伴随方程实现端到端优化，在方形管道和周期性山丘基准测试中平均速度误差降低2-4倍。

详情

AI中文摘要

湍流在工程和科学中普遍存在，但直接模拟成本过高。雷诺平均纳维-斯托克斯（RANS）方程可节省超过十个数量级的计算量，但引入了未封闭项（封闭问题）。离线训练的机器学习（ML）闭合模型在预测模拟中会出现分布偏移，而绕过控制方程的ML方法难以从稀缺的高保真数据中泛化。我们开发了一种基于物理的深度学习RANS闭合模型——深度代数雷诺应力模型（DARSM），该模型可在小数据集上训练，并准确泛化到不同雷诺数、未见几何形状和不同流动状态。神经网络将流动不变量映射到隐式代数雷诺应力方程中的经验参数，该方程基于弱平衡假设从雷诺应力输运方程推导而来，为ML闭合施加了基于物理的结构。通过控制偏微分方程和耦合隐式闭合的端到端优化消除了分布偏移，但展开和隐式自动微分在刚性耦合求解器上均失败。我们推导了利用求解器隐式-显式结构的伴随方程，以实现高效优化。在标准方形管道和周期性山丘基准测试中，DARSM将基线RANS的平均测试速度误差降低了2-4倍（跨雷诺数、几何形状和流动状态），峰值案例级降低达12倍。在附着、各向异性主导的流动（方形管道）上训练的模型无需重新训练即可准确泛化到分离流动（周期性山丘），这是底层物理状态的改变。DARSM还优于五种已建立的ML方法：离线训练、张量基神经网络、场反演机器学习、DeepONet和物理信息神经网络。

英文摘要

Turbulence is ubiquitous in engineering and science, yet direct simulation is prohibitively expensive. The Reynolds-averaged Navier-Stokes (RANS) equations provide savings exceeding ten orders of magnitude but introduce unclosed terms (the closure problem). Offline-trained machine-learning (ML) closures suffer distribution shift in predictive simulations, while ML methods that bypass the governing equations struggle to generalise from scarce high-fidelity data. We develop a physics-derived deep learning closure model for RANS, the Deep Algebraic Reynolds Stress Model (DARSM), which can be trained on small datasets and accurately generalise across Reynolds numbers, to unseen geometries, and to different flow regimes. A neural network maps flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, derived from the Reynolds stress transport equations under the weak-equilibrium assumption, imposing physics-based structure on the ML closure. End-to-end optimisation through the governing PDEs and the coupled implicit closure eliminates distribution shift, but both unrolled and implicit automatic differentiation fail on the stiff coupled solver. We derive adjoint equations that exploit the solver's implicit-explicit structure for efficient optimisation. On canonical square-duct and periodic-hill benchmarks, DARSM reduces average test velocity error over baseline RANS by $2$-$4\times$ across Reynolds number, geometries, and flow regimes, with peak case-level reductions of $12\times$. The model trained on attached, anisotropy-dominated flows (square duct) accurately generalises without retraining to separated flows (periodic hills), a regime change in the underlying physics. DARSM also outperforms five established ML methods: offline training, tensor-basis neural networks, field-inversion machine learning, DeepONets, and physics-informed neural networks.

URL PDF HTML ☆

赞 0 踩 0

2605.26144 2026-06-12 cs.SE cs.AI cs.CV 版本更新

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA：面向视觉规格到网页应用编码智能体的端到端基准

JunJia Guo, Yuhang Yao, Jiawei (Joe) Zhou, Jingdi Chen

AI总结提出VISTA基准，通过多维度输入条件和评估指标，衡量基于LLM的智能体从视觉规格生成功能完整、视觉一致的网页应用的能力。

详情

Comments: Project page: this https URL Code: this https URL Dataset: this https URL

AI中文摘要

我们提出了VISTA（视觉规格到应用基准），这是一个用于评估基于LLM的智能体端到端网页应用生成能力的基准。与以往关注算法任务的代码生成基准不同，VISTA针对以UI为中心的现实开发场景，要求智能体从不明确的输入中生成功能完整、视觉一致的应用。我们定义了五种提示信息条件，沿视觉/结构保真度和技术栈约束两个轴变化：（1）仅文本，自由选择技术栈；（2）文本加参考截图，指定三种技术栈；（3）文本加参考截图，自由选择技术栈；（4）文本加截图和精简的Figma结构，指定单一技术栈；（5）文本加截图和精简的Figma结构，自由选择技术栈。为实现稳健评估，基准中的每个页面都手动标注了交互式UI组件和大约三个视觉锚点，解决了Playwright等基于脚本的测试工具在开放式代码生成设置中的已知局限性。评估结合了基于DOM的参考匹配、行为特定的浏览器测试和基于CLIP的视觉相似性，共同衡量结构对齐、行为完整性和整体视觉保真度。我们使用VISTA评估了来自两个模型家族和两个框架的四个智能体系统，发现视觉保真度和功能正确性在输入条件和智能体之间部分解耦，并且智能体的编辑风格差异显著，但大体上与任务质量正交。VISTA为推进基于智能体的软件工程研究建立了严谨且可重复的基础。

英文摘要

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

URL PDF HTML ☆

赞 0 踩 0