arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.16257 2026-05-18 cs.RO

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

DexJoCo:用于MuJoCo上的任务导向灵巧操作的基准和工具包

Hanwen Wang, Weizhi Zhao, Xiangyu Wang, Siyuan Huang, He Lin, Boyuan Zheng, Rongtao Xu, Gang Wang, Yao Mu, He Wang, Lue Fan, Hongsheng Li, Zhaoxiang Zhang, Tieniu Tan

AI总结 本文提出DexJoCo基准和工具包,包含11个功能任务评估灵巧手的工具使用、双臂协调、长周期执行和推理能力,通过低成本数据收集系统和领域随机化评估鲁棒性,揭示当前策略的局限性。

详情
Comments
8 pages, 6 figures, project page is available at: https://dexjoco.github.io
AI中文摘要

实现人类水平的操作需要能够进行复杂物体交互的灵巧机器人手。进一步发展此类能力需要标准化的基准以进行系统评估。然而,现有的灵巧基准缺乏反映灵巧手相对于平行夹具独特操作能力的任务以及全面的评估流程。本文提出了DexJoCo,一个用于任务导向灵巧操作的基准和工具包,包含11个功能基础任务,评估工具使用、双臂协调、长周期执行和推理。我们开发了一个低成本的数据收集系统,并在这些任务中收集了1100多条轨迹,支持领域随机化以评估鲁棒性。我们在此基础上对现代模型进行基准测试,包括视觉和动态随机化、多任务训练和动作头适应。通过广泛的实证分析,我们识别出当前策略在灵巧操作中的几个重要见解和共同限制,突显了未来灵巧手机器人学习中的关键挑战。项目页面可访问:https://dexjoco.github.io

英文摘要

Achieving human-level manipulation requires dexterous robotic hands capable of complex object interactions. Advancing such capabilities further demands standardized benchmarks for systematic evaluation. However, existing dexterous benchmarks lack tasks that reflect the unique manipulation capabilities of dexterous hands over parallel grippers, as well as comprehensive evaluation pipelines. In this paper, we present DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation, comprising 11 functionally grounded tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning. We develop a low-cost data collection system and collect 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. We benchmark modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation. Through extensive empirical analysis, we identify several important insights and common limitations of current policies in dexterous manipulation, highlighting key challenges for future research in dexterous hand robot learning. Project page available at: https://dexjoco.github.io

2605.16255 2026-05-18 cs.DC cs.AI

Designing Datacenter Power Delivery Hierarchies for the AI Era

为AI时代设计数据中心电力交付层级

Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare, Chaojie Zhang, Ricardo Bianchini

AI总结 本文研究了AI时代数据中心电力交付层级设计的挑战,提出了一种评估框架,结合吞吐量、功率和成本指标,分析多资源短缺对部署容量、资本支出和性能的影响。

详情
AI中文摘要

对AI加速器的需求迅速增加机架功率密度,预计到2027年将达到每部署1MW。这给数据中心电力交付设计者带来了重大挑战。随着功率密度增加,为不同目标密度设计的数据中心可能无法使用其交付层级预留的所有功率。设计必须在数据中心长生命周期和多个硬件世代中保持高效。功率利用率在AI时代尤为重要,因为电网电力容量是稀缺资源。设计长期高效的电力交付层级困难,因为机架放置可行性、工作负载影响和成本取决于电气拓扑、部署粒度、放置策略、功率超订和工作负载混合。此外,这些因素随时间变化,跨多个资源维度有相互依赖性,通常无法用闭式分析。为解决这一挑战,我们开发了一个评估框架,结合GPU、计算和存储部署的投影模型,结合Microsoft Azure的生产数据。我们的结果表明,多资源短缺显著改变可部署容量、有效资本支出和交付性能,并量化了从机架和机柜规模AI系统中上升的密度如何影响这些结果。对于AI数据中心设计,相关规划目标不是安装兆瓦,而是随时间变化的可部署容量。

英文摘要

Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 2027. This poses a major challenge for datacenter power delivery designers. As power densities increase, a datacenter designed for a different target density may strand power, i.e., may be unable to use all the power that its delivery hierarchy has provisioned. Designs must remain efficient over long datacenter lifetimes and multiple hardware generations. Power utilization is particularly important as grid power capacity is a scarce resource in the AI era. Designing an efficient power delivery hierarchy for the long run is difficult because rack placement feasibility, workload impact, and cost depend jointly on electrical topology, deployment granularity, placement policy, power oversubscription, and workload mix. Moreover, each of these factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis. To address this challenge, we develop a framework for evaluating datacenter power delivery designs using throughput, power, and cost metrics over realistic arrival, oversubscription, and decommissioning sequences. The framework combines projection models for GPU, compute, and storage deployments with operational factors grounded in production data from Microsoft Azure. Our results show that multi-resource stranding materially changes deployable capacity, effective capital expenditure, and delivered performance, and quantify how rising density from rack- and pod-scale AI systems shapes these outcomes. For AI datacenter design, the relevant planning objective is not installed megawatts, but deployable capacity over time.

2605.16253 2026-05-18 cs.AR

TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing

TTP:一种用于光线追踪中精确预取的高效硬件设计

Yavuz Selim Tozlu, Anshul Naithani, Huiyang Zhou

AI总结 本文提出TTP预取器,通过利用RT单元中的树遍历栈实现高效预取,减少内存延迟,提升光线追踪性能,实验显示平均加速1.48倍,硬件开销极低。

详情
AI中文摘要

光线追踪(RT)是一种3D图形技术,能提供高度逼真的视觉效果。随着GPU厂商集成专用光线追踪加速硬件,RT正变得越来越突出和易用。然而,实时通过包含大量三角形的3D场景追踪数百万条光线仍具挑战性,需要昂贵的硬件。RT工作负载的主要瓶颈是昂贵的包围盒层次结构(BVH)遍历任务,这是一个编码3D场景的大型树结构。BVH遍历是内存受限的问题,因为GPU线程大部分时间都在从内存中读取树节点数据。在本工作中,我们通过预取攻击光线追踪的内存延迟瓶颈。我们提出了一种新型的硬件预取器,称为树遍历预取器(TTP),用于光线追踪。主要思想是利用RT单元中已有的树遍历栈进行高精度预取。具体而言,TTP利用每个线程硬件遍历栈上已有的地址来预取节点。对于基于深度优先搜索(DFS)的遍历,当节点连续弹出遍历栈时生成预取,这可能对应于树中的向上遍历。我们在循环级模拟器Vulkan-sim 2.0上评估了TTP,并显示其在基准上平均加速1.48倍(最高1.89倍),硬件开销几乎可以忽略不计。TTP实现了98.92%的平均L1精度,即预取块被需求加载实际引用的比例。覆盖率,计算为L1缺失减少量与基准L1缺失量的比率,为31.54%,与所达到的加速率密切相关。

英文摘要

Ray tracing (RT) is a 3D graphics technique that offers highly realistic visuals. It is becoming prominent and accessible as GPU vendors have integrated dedicated ray tracing acceleration hardware. However, tracing millions of rays through 3D scenes consisting of high numbers of triangles in real time is challenging and requires expensive hardware. The main bottleneck in RT workloads is the expensive Bounding Volume Hierarchy (BVH) traversal task, which is a large tree structure that encodes the 3D scene. BVH traversal is a memory-bound problem, as the GPU threads spend most of their time reading tree node data from memory. In this work, we attack the memory latency bottleneck of ray tracing through prefetching. We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. The main idea is to leverage the existing tree traversal stack in the RT units for highly accurate prefetching. In particular, TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS (Depth-first search) based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads. The coverage, computed as the ratio of L1 miss reduction over baseline L1 misses, is 31.54%, correlating well with the achieved speedup.

2605.16250 2026-05-18 cs.CL cs.AI cs.DB cs.LG

A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation

一种生成式AI框架用于智能用电量分析和可持续资源优化

Pavan Manjunath, Thomas Pruefer

AI总结 本文提出一个生成式AI框架,整合四个生产级能力,实现自然语言账单生成、消费预测及碳排放优化。

详情
AI中文摘要

配电公司现在需要提供可读的账单,每千瓦时销售都附带可辩护的碳数,并根据电网压力和排放约束调度负载。本文提出一个端到端框架,整合四个生产级能力:生成式AI代理从结构化数字输入生成客户自然语言账单,基于约束解码策略;基于变压器的预测器提供提前一天的消费估计,并带有校准的分位数区间。

英文摘要

Distribution utilities are now expected to deliver bills that customers can actually read attach a defensible carbon number to every kWh sold and schedule load against grid stress and emissions constraints We propose an end-to-end framework that unifies four production-grade capabilities under one architectural roof a generative-AI agent that drafts each customers natural-language billing statement from structured numeric inputs under a constrained decoding policy a transformer-based forecaster that supplies the day-ahead consumption estimate with calibrated quantile bands

2605.16249 2026-05-18 quant-ph cs.CC

The Collapse of Unentangled Stoquastic Merlin-Arthur Proof Systems

无纠缠的量子 Merlin-Arthur 证明系统崩溃

William Gay, Fernando Granha Jeronimo

AI总结 研究通过证明无纠缠无法提升stoquastic Merlin-Arthur验证能力,揭示了干涉在检测纠缠中的作用,并提出正的de Finetti定理作为核心方法,最终将StoqMa(k)纳入AM∩PP⊆PSPACE。

详情
AI中文摘要

量子力学中,纠缠和干涉是基本属性。本文从计算复杂性角度研究干涉在检测纠缠中的作用,证明无纠缠对stoquastic Merlin-Arthur验证无增益。对于任意多项式数量的证明者k=k(n),StoqMa(k)=StoqMa。证明将纠缠与干涉的作用分离:一旦通过stoquastic性排除破坏性干涉,乘积态约束可被多项式更大的单见证stoquastic验证吸收。主要分析成分是正的de Finetti定理,用于分别对称扩展。若M是正半定收缩矩阵,则其非负乘积值可被最大特征值近似到加性误差ε。随后,通过将对称投影中的均匀排列平均替换为逆多项式接近的双极逆不变平均,实现谱松弛作为实际的单见证stoquastic验证器。最终得出StoqMa(k)=StoqMa⊆AM∩PP⊆PSPACE。正的de Finetti定理作为独立技术,可能在其他非负张量优化和stoquastic验证场景中发挥作用。

英文摘要

Entanglement and interference are among the most fundamental properties of quantum mechanics. In this work, we investigate the role and power of interference in the context of detecting entanglement. We do so from a computational complexity lens by proving that unentanglement gives no additional power to stoquastic Merlin-Arthur verification. For every polynomial number of provers $k=k(n)$, \[ \text{StoqMa}(k)=\text{StoqMa} . \] Conceptually, the proof separates the role of entanglement from the role of interference: once destructive interference is ruled out by stoquasticity, the product-state constraint can be absorbed into a polynomially larger one-witness stoquastic verification. The main analytic ingredient is a positive, value-based de Finetti theorem for separately symmetric extensions. If $M$ is an entrywise nonnegative positive semidefinite contraction on $A_1\otimes\cdots\otimes A_k$, then the nonnegative product value of $M$ is approximated to additive error $ε$ by the largest eigenvalue of \[ Π_R^{<k} (M_{A_{1,1}\cdots A_{k-1,1}A_k}\otimes I) Π_R^{<k}, \qquad R=O\!\left(\frac{k^2\sum_i\log\dim A_i}{ε^3}\right), \] where $Π_R^{<k}$ is the operator on $A_1^{\otimes R} \otimes \cdots \otimes A_{k-1}^{\otimes R} \otimes A_k$ projecting to the subspace $\mathrm{Sym}^R(A_1) \otimes \cdots \otimes \mathrm{Sym}^{R}(A_{k-1}) \otimes A_k$. The spectral relaxation is then realized as an actual one-witness stoquastic verifier. After replacing the uniform permutation averages in the symmetric projectors by inverse-polynomially close dyadic inverse-invariant averages. Consequently, \[ \text{StoqMa}(k)=\text{StoqMa}\subseteq\text{AM}\cap\text{PP}\subseteq\text{PSPACE} . \] The positive de Finetti theorem is isolated as a standalone technique and may be useful in other nonnegative tensor-optimization and stoquastic-verification settings.

2605.16245 2026-05-18 cs.CY cs.AI cs.CL cs.LG cs.SI

AI-Mediated Communication Can Steer Collective Opinion

AI介导的交流可以引导集体意见

Stratis Tsirtsis, Kai Rawal, Chris Russell, Brent Mittelstadt, Sandra Wachter

AI总结 本文研究AI在人类间交流中对集体意见形成的影响,通过实证和理论分析展示AI引入的方向性偏见如何通过网络放大并改变集体观点,探讨平台如何控制此类偏见。

详情
AI中文摘要

生成式人工智能(AI)正日益融入人类交流意见的在线平台;大型语言模型(LLMs)现在在LinkedIn上润色用户帖子,并在X上提供内容上下文。尽管先前研究显示AI能表达偏见意见并影响个体意见,但较少关注其在介导人类间交流时对集体意见形成的影响。我们通过实证和理论分析填补这一空白。我们实证显示,多个流行LLM家族在被指示编辑争议性话题的人类文本时引入方向性偏见,例如倾向于支持枪支管控,反对无神论。基于这一观察,我们引入了一个意见动态的数学模型,其中AI系统位于社交网络用户之间,转换他们表达和感知的意见。通过分析该模型的平衡点并使用真实社交网络数据进行模拟,我们显示AI在人类间交流中引入的偏见可通过网络放大并转向集体意见。鉴于这些发现,我们探讨此类偏见是否可通过在线平台控制。我们审核了X上的“解释此帖子”功能,并发现Grok在与堕胎相关的内容中的输出存在亲生命偏见,我们追溯到特定的设计选择。最后,我们讨论了这些发现与欧洲联盟正在进行的立法努力的广泛影响。

英文摘要

Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users' posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions and shape individuals' opinions during human-AI interactions, less attention has been paid to its influence on collective opinion formation when mediating human-to-human communication. We address this gap via a combination of empirical and theoretical analyses. We show empirically that LLMs from multiple popular families introduce directional biases when instructed to edit human-written texts on contested topics, for example, nudging texts in favor of gun control and against atheism. Building on this observation, we introduce a mathematical model of opinion dynamics in which an AI system sits between users on a social network, transforming the opinions they express and perceive. By analytically characterizing the equilibrium of this model and performing simulations on real social network data, we show that biases introduced by AI in human-to-human communication can be amplified through the network and shift collective opinion in their direction. In light of these findings, we investigate whether such biases are controllable by online platforms. We audit the "Explain this post" feature on X and find evidence of pro-life bias in Grok's outputs on abortion-related content, which we trace back to specific design choices. We conclude with a discussion of the broader implications of our findings in relation to ongoing legislative efforts in the European Union.

2605.16241 2026-05-18 cs.CV cs.AI

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

离线语义引导用于高效视觉-语言-动作策略蒸馏

Jin Shi, Brady Zhang, Yishun Lu

AI总结 本文提出VLA-AD框架,利用视觉-语言模型作为离线语义监督者,将大规模VLA教师模型蒸馏为轻量学生策略,通过高阶语义指导提升效率与鲁棒性。

详情
AI中文摘要

大规模视觉-语言-动作(VLA)策略近期在机器人操作中表现出色,但其规模和推理成本仍是实时闭环控制的主要障碍。我们引入VLA-AD蒸馏框架,利用视觉-语言模型作为离线语义监督者,将大规模VLA教师模型转化为轻量学生策略。不同于仅依赖低层动作模仿,VLA-AD在教师提供的7自由度动作目标中加入高层语义指导,包括任务阶段锚点和多帧操作方向描述。这些辅助信号仅在训练时使用:在测试时,学生策略独立运行,无需VLA教师或VLM。我们在三个LIBERO基准测试套件上评估VLA-AD。使用OpenVLA-7B作为教师,我们的方法产生一个15800万参数的学生模型,模型大小减少44倍,同时与教师的平均相对差距仅为0.27%。生成的策略在RTX 4090上以12.5 Hz运行,比OpenVLA-7B快3.28倍。我们进一步表明,相同的语义蒸馏流程可泛化到不同的π_{0.5}-4B教师,其中学生在两个套件中优于教师,并在libero_goal上保持在0.53%以内。此外分析表明,阶段级监督和多帧方向线索使学生对噪声教师动作(如错误的高频夹具变化)更不敏感。总体而言,VLA-AD证明了从VLMs获得的离线语义指导可以显著提高VLA策略蒸馏的效率、鲁棒性和部署性。

英文摘要

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

2605.16239 2026-05-18 cs.LG

Dynamics-Level Watermarking of Flow Matching Models with Random Codes

流匹配模型的动力学水印技术:随机码方法

Shuchan Wang

AI总结 本文提出在流匹配模型中嵌入水印的新方法,通过连续动力学嵌入随机码,实现可靠的信息恢复和生成质量保持。

详情
Comments
18 pages, 3 figures, code available at: https://github.com/ShuchanWang/flow-matching-dynamics-watermarking
AI中文摘要

本文提出在流匹配模型中嵌入水印的新方法,通过连续动力学嵌入随机码,实现可靠的信息恢复和生成质量保持。

英文摘要

We introduce a dynamics-level approach to watermarking generative models. Rather than embedding signals into model weights or outputs, we embed the watermark directly into the learned continuous dynamics -- the velocity field of a flow matching model. We formulate this as random coding over a continuous channel: a key-dependent perturbation is added during training, and the message is recovered at detection time from black-box queries. The perturbation is designed to leave the generated distribution unchanged. Experiments on MNIST and CIFAR-10 across different architectures confirm reliable message recovery, preserved generation quality, and chance-level decoding accuracy without the secret key.

2605.16238 2026-05-18 cs.AI

Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

前瞻性多病原体疾病预测使用自主LLM引导的树搜索

Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi

AI总结 本文提出自主系统,利用LLM引导树搜索生成、评估和优化可执行预测软件,在2025-2026年美国呼吸道季节中实现了流感、新冠和呼吸道合胞病毒的多方法模型,其集成模型在样本外表现优于CDC标准模型。

详情
AI中文摘要

传染病概率预测对公共卫生至关重要,但依赖专家团队耗时的手动模型定制,限制了对细粒度地理分辨率或新兴病原体的扩展性。本文提出一个自主系统,利用大型语言模型(LLM)引导的树搜索,迭代生成、评估和优化可执行预测软件。在2025-2026年美国呼吸道季节的前瞻性、实时评估中,系统自主发现了针对流感、新冠和呼吸道合胞病毒(RSV)的方法学多样的模型。汇总这些机器生成的模型得到一个集成模型,其在样本外表现一致匹配或优于金标准的人工定制的疾病控制与预防中心(CDC)枢纽集合。该系统成功应对了RSV的数据稀缺“冷启动”场景。此外,受控回顾性消解揭示了优化对数尺度距离度量可防止奖励黑客,而自动化裁判在循环中确保结构符合复杂科学理论。通过自主将流行病学理论转化为准确、透明的代码,该框架克服了建模劳动力瓶颈,实现了前所未有的大规模专家级疾病预测部署。

英文摘要

Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.

2605.16233 2026-05-18 cs.AI cs.CL cs.LG cs.MA cs.SY eess.SY

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

FORGE:无权重更新的自演化代理记忆

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

AI总结 FORGE通过群体广播机制实现无梯度更新的自生成记忆,提升层次ReAct代理决策能力,在CybORG CAGE-2任务中显著提高性能并降低失败率。

详情
AI中文摘要

LLM代理能否通过自生成记忆提升决策能力而不进行梯度更新?我们提出了FORGE(失败优化反射毕业与进化),一种分阶段、基于群体的协议,通过注入提示的自然语言记忆来进化层次ReAct代理。FORGE包含一个反射式内环,其中专门的反思代理(使用相同的基础LLM,不从更强模型蒸馏)将失败轨迹转换为可重用的知识工件:文本启发式(规则)、少量示例(示例)或两者(混合),外环在阶段间将表现最佳实例的记忆传播到群体,并通过毕业标准冻结收敛实例。我们在CybORG CAGE-2上评估,这是一个具有30步地平线的随机网络防御POMDP,对抗B线攻击者。所有四个测试的LLM家族(Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B)均表现出强烈负的、重尾零样本奖励。与零样本基线和反射基线(隔离单流学习)相比,FORGE在所有12种模型-表示条件下,将平均评估回报提高了1.7-7.7倍,比反射基线提高了29-72%,将主要失败率(低于-100)降低到约1%。我们发现(1)群体广播是关键机制,无毕业消融确认广播承载性能提升,而毕业主要节省计算;(2)示例在三个模型中表现最强,规则提供最佳成本-可靠性剖面,约少40%的token;(3)较弱基线模型受益显著,表明FORGE可能缓解能力差距而非放大强模型。所有证据均限于CAGE-2 B线;跨家族发现是方向性证据。

英文摘要

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

2605.16232 2026-05-18 cs.CL cs.AI cs.ET cs.LG cs.SY eess.SY

A Unified Generative-AI Framework for Smart Energy Infrastructure: Intelligent Gas Distribution, Utility Billing, Carbon Analytics, and Quantum-Inspired Optimisation

智能能源基础设施的统一生成式AI框架:智能燃气分配、公用事业计费、碳分析和量子启发优化

Pavan Manjunath, Thomas pruefer

AI总结 本文提出一种统一的生成式AI框架,整合智能燃气分配、计费、碳分析和量子优化,以提升能源管理效率与环境责任。

详情
AI中文摘要

智能计量、生成式人工智能和量子启发组合优化的加速融合正在重塑能源公用事业在物理基础设施管理、客户互动和环境责任方面的运营方式。

英文摘要

The accelerating convergence of smart metering, generative artificial intelligence, and quantum-inspired combinatorial optimisation is reshaping how energy utilities manage physical infrastructure, customer engagement, and environmental accountability

2605.16230 2026-05-18 cond-mat.mtrl-sci cs.LG

Universal Magnetic Structure Prediction from Atomic Coordinates with Near-Experimental Accuracy

从原子坐标预测通用磁结构并实现接近实验精度

Abhijatmedhi Chotrattanapituk, Ryotaro Okabe, Eunbi Rha, Mariya Al-Hinai, Eugene Jiang, Daniel Pajerowski, Yongqiang Cheng, Joshua J. Turner, Mingda Li

AI总结 本文提出磁结构网络(MSN),通过原子晶体结构直接预测磁结构,利用原始调制结构表示(PMSR)统一编码调制结构,实现高精度磁结构预测,为磁性材料发现提供新方法。

详情
Comments
9 pages, 3 figures
AI中文摘要

磁序是材料的基本性质,调控集体行为并实现多种功能。然而,磁结构难以确定:实验成本高且专业,而第一性原理方法常难以处理非collinear和无调制序。本文引入磁结构网络(MSN),一种E(3)等变图神经网络,直接从原子晶体结构预测collinear和non-collinear磁结构,训练于MAGNDATA实验确定结构。通过提出原始调制结构表示(PMSR),我们能够统一编码调制和非调制结构,无需对称假设。模型在所有调制组件上表现强劲,能高保真重建实验磁结构。我们的方法提供了一种可扩展的框架,用于快速磁结构预测,并开辟了数据驱动发现磁性材料的新途径。

英文摘要

Magnetic order is a fundamental property of materials, governing collective behavior and enabling a broad range of functionalities. Yet magnetic structure remains difficult to determine: experiments are costly and specialized, while first-principles methods often struggle with the noncollinear and incommensurate orders found in real materials. Here we introduce magnetic structure network (MSN), an E(3) equivariant graph neural network that predicts both collinear and non-collinear magnetic structures directly from atomic crystal structures, trained directly on experimentally determined structures from MAGNDATA. By proposing the primitive modulated structure representation (PMSR), we are able to encode commensurate and incommensurate structures in a unified way without symmetry assumptions. The model achieves strong performance across all modulation components and reconstructs experimental magnetic structures with high fidelity. Our approach provides a scalable framework for rapid magnetic structure prediction and opens a route to data-driven discovery of magnetic materials.

2605.16229 2026-05-18 cs.IT math.IT math.ST stat.ML stat.TH

Breaking the Finite-Sample Barrier in Entropy Coupling

突破有限样本障碍的熵耦合

Shahab Asoodeh, Jun Chen

AI总结 本文提出最小列表熵耦合,研究依赖性观测如何突破有限样本限制,通过条件熵分析揭示独立观测指数减少不确定性,而依赖观测可有限样本消除不确定性。

详情
AI中文摘要

边际受限观测间的依赖可以打破有限样本障碍。为形式化这一现象,我们引入最小列表熵耦合H(P∥Q₁,…,Qₘ),即所有具有给定离散边际分布P和Yᵢ∼Qᵢ的联合分布中最小的条件熵H(X|Y₁,…,Yₘ)。与基于独立观测的经典方法不同,我们的模型允许Y₁,…,Yₘ任意依赖,同时保持每个边际固定。扩大耦合空间揭示了明确二元性:独立观测使残余不确定性指数级减少,而依赖观测可在有限样本后精确消除。我们通过必要充分条件刻画零熵区域,并给出具体结构准则。特别地,在温和的支持假设下,零熵可通过O(log(1/Pₘin))观测实现,其中Pₘin是P的最小非零质量。我们还开发了具有单调近似保证的贪心算法以计算H(P∥Q₁,…,Qₘ)。最后,我们展示相同框架可形式化有限样本限制在分布匹配表示学习和随机性提取中,其中零熵对应于精确恢复和提取。

英文摘要

Dependence among marginally constrained observations can break a finite-sample barrier. To formalize this phenomenon, we introduce the \emph{minimum list entropy coupling} $H(P\|Q_1,\dots,Q_m)$, the minimum conditional entropy $H(X|Y_1,\dots,Y_m)$ over all joint distributions with prescribed discrete marginals $X\sim P$ and $Y_i\sim Q_i$. Unlike classical formulations based on independent observations, our model allows $Y_1,\dots,Y_m$ to be arbitrarily dependent while keeping each marginal fixed. This enlarged coupling space reveals a sharp dichotomy: independent observations reduce residual uncertainty exponentially, whereas dependent observations can eliminate it exactly after finitely many samples. We characterize this zero-entropy regime through necessary and sufficient conditions and give concrete structural criteria under which it occurs. In particular, under mild support assumptions, zero entropy is achieved with $O(\log(1/P_{\min}))$ observations, where $P_{\min}$ is the minimum nonzero mass of $P$. We also develop a greedy algorithm with monotone approximation guarantees for computing $H(P\|Q_1,\dots,Q_m)$. Finally, we show that the same framework formalizes finite-sample limits in distribution-matching representation learning and randomness extraction, where zero entropy corresponds to exact recovery and exact extraction.

2605.16227 2026-05-18 cs.CR

LymphNode: A Plug-and-Play Access Control Method for Deep Neural Networks

LymphNode: 一种用于深度神经网络的即插即用访问控制方法

Hanyu Pei, Shang Liu, Zeyan Liu

AI总结 LymphNode提出一种新型后处理防御框架,通过注入通用稀疏通用对抗扰动阻止模型提取和逆向攻击,实现轻量级且高效的访问控制。

详情
Comments
Accepted by the 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2026). Author accepted manuscript. 14 pages, 6 figures
AI中文摘要

深度神经网络(DNN)是高价值知识产权,但部署到边缘环境会暴露于无限制的oracle访问,导致模型提取和逆向攻击。现有防御措施无法有效解决这一问题:被动水印仅提供事后溯源,而主动防御则导致延迟过高或需要持续访问敏感训练数据。为此,我们提出LymphNode,一种新的后处理防御框架,作为模型内在的『免疫系统』。LymphNode实施严格『默认拒绝』策略:通过注入通用稀疏通用对抗扰动,主动中和未经授权查询的模型效用,有效阻止梯度估计和数据推断。只有授权输入携带隐秘的特征域凭证时,效用才会被选择性恢复。我们的框架高度实用:它具有数据高效性,仅需少于100个样本(<1%的训练数据)即可建立稳健保护,并具有跨数据集适应性,允许使用公开的替代数据集进行保护。LymphNode因此为高风险场景提供了一种轻量级、可立即部署的防御措施,其中原始训练数据受限或不可用。

英文摘要

Deep Neural Networks (DNNs) are high-value intellectual property (IP), yet deploying them to edge environments exposes them to \textbf{unrestricted oracle access}, rendering them vulnerable to model extraction and inversion attacks. Existing defenses fail to address this practically: passive watermarking only offers post-hoc provenance, while active defenses impose prohibitive latency or require persistent access to sensitive training data. To bridge this gap, we propose \textit{LymphNode}, a novel post-hoc defense framework that acts as an intrinsic ``immune system" within the model. \textit{LymphNode} enforces a strict ``default-deny'' policy: it actively neutralizes model utility for unauthorized queries via \textbf{Generalized Sparse Universal Adversarial Perturbations (GSUAP)} injected into the feature space, effectively blocking gradient estimation and data inference. Utility is selectively restored only for authorized inputs carrying a stealthy feature-domain credential. Our framework is highly practical: it is \textbf{data-efficient}, establishing robust protection with fewer than 100 samples ($<1\%$ of training data), and \textbf{cross-dataset adaptable}, enabling protection using public surrogate datasets. \textit{LymphNode} thus provides a lightweight, immediately deployable defense for high-stakes scenarios where original training data is restricted or unavailable.

2605.16225 2026-05-18 cs.IT cs.NI cs.SY eess.SY math.IT

Preemption Revisited: Multi-Threshold Preemption Policies for AoI Minimization

重新审视预emption:用于最小化信息年龄的多阈值预emption策略

Sahan Liyanaarachchi, Sennur Ulukus, Nail Akar

AI总结 本文研究了随机更新到达系统的多阈值预emption策略,通过分析框架评估信息年龄,并展示其在信息年龄优化中的有效性。

详情
AI中文摘要

在信息年龄(AoI)文献中,最优预emption策略的研究一直是热点问题,阈值结构在生成-at-will更新生成模型下被证明是优化的。本文研究了随机更新到达系统的阈值策略有效性,引入了评估多阈值预emption策略信息年龄的分析框架,并展示了最优预emption策略的结构特性。我们证明了这些阈值策略在传统概率预emption策略和单阈值策略上更有效,通过结合数据包年龄和系统年龄设计策略,可显著降低信息年龄。

英文摘要

The study of optimal preemption policies for status update systems has been a recurring topic in the age of information (AoI) literature, where threshold-based structures have been shown to be optimal under a generate-at-will update generation model under certain assumptions. In this work, we study the effectiveness of threshold-based policies for a system with random update arrivals. In this regard, we introduce an analytical framework for evaluating the AoI of multi-threshold preemption policies and present interesting characteristics of the structure of the optimal preemption policy. We show the effectiveness of these threshold-based policies over the traditional probabilistic preemption policies and single-threshold policies, where we observe that significant gains in terms of AoI can be obtained by utilizing both the age of the packet and the age of the system when designing these preemption policies.

2605.16222 2026-05-18 cs.CL cs.LG

Artificial Aphasias in Lesioned Language Models

病变语言模型中的人工失语症

Nathan Roll, Jill Kries, Laura Gwilliams, Cory Shain

AI总结 通过模拟失语症对语言模型进行参数损伤,研究其功能组织特性,发现模型与人类失语症在症状分布上有显著差异,揭示学习和处理细节对语言处理的影响。

详情
Comments
49 pages, 13 figures
AI中文摘要

失语症,由脑损伤引起的特定语言障碍,通过揭示受损脑区与特定症状谱之间的因果关系,揭示人类语言的功能组织。本文提出一种受失语症启发的技术,用于表征语言模型的功能组织。我们通过零出模型参数(即'病变')并测量其对临床失语症症状的影响,以Text Aphasia Battery (TAB)诊断。当应用于五种1B规模语言模型的112,426个输出时,评估的症状范围广泛,但其分布与人类显著不同。我们的方法揭示了注意力组件(查询、键、值、输出)与前馈组件(上、门、下)之间的广泛症状谱差异,同机制内组件差异证据较弱。我们还发现深度的影响,早期层的损伤导致语法和语义症状,而中后期层导致更高的语音和流畅性缺陷。尽管某些语言模型的损伤可能在某些人类失语症类型上更相似,但语言模型与人类在症状模式上的定性差异表明,失语症综合征受学习和处理细节影响较大,而非单纯是语言处理受损的领域无关结果。

英文摘要

Aphasias, selective language impairments which can arise from brain damage, reveal the functional organization of human language by providing causal links between affected brain regions and specific symptom profiles. Drawing on this literature, we introduce an aphasia-inspired technique to characterize the emergent functional organization of language models (LMs). We ``lesion'' (zero-out) model parameters and measure the effects of this intervention against clinical aphasia symptoms, as diagnosed by the Text Aphasia Battery (TAB). When applied to 112,426 outputs from five 1B-scale LMs, the full range of evaluated symptoms surface, but in distributions largely distinct from those of humans. Our method uncovers broad symptom-profile differences between attention components (query, key, value, output) and feed-forward components (up, gate, down), with weaker evidence for differences among components within the same mechanism. We also find an effect of depth, where lesions in early layers disproportionately cause syntactic and semantic symptoms while late-middle layers yield higher rates of phonological and fluency deficits. Although some LM lesions induce quantitatively more similar profiles to some human aphasia types than others, qualitative differences in symptom patterns between LMs and humans suggest that aphasia syndromes are heavily influenced by the details of learning and processing rather than being a domain-invariant consequence of disrupted language processing.

2605.16219 2026-05-18 cs.LG stat.ML

The Privacy Price of Tail-Risk Learning: Effective Tail Sample Size in Differentially Private CVaR Optimization

尾风险学习的隐私代价:差分隐私CVaR优化中的有效尾样本量

El Mustapha Mansouri

AI总结 研究揭示差分隐私对CVaR学习有效样本量的影响,提出隐私代价分解方法,推导出标量估计和有限类别的学习速率,并指出隐私学习在有效尾样本量上的核心挑战。

详情
Comments
34 pages, 3 figures, 2 tables
AI中文摘要

差分隐私改变了CVaR学习的有效样本量。对于尾质量τ,隐私相关的样本量不是n,而是nτ;等价地,有效的隐私尾样本量是εnτ。私有CVaR超额风险分解为普通的尾风险统计误差和隐私代价。这种分解在标量估计和有限类别的情况下是完整的:标量估计的速率是Θ(B min{1,(nτ)^{-1/2}+(εnτ)^{-1}}),有限类别的大小为M时的速率是Θ(B min{1,√(log(2M)/(nτ))+log(2M)/(εnτ)} )。这些完整的速率在纯DP下成立,其下界可扩展到近似DP的 stated small-δ 范围内。对于凸Lipschitz学习,模块化上界和下界减少显示,CVaR特定的隐私项必然以1/(εnτ)的比例增长,其维度依赖性继承自私有随机凸优化。这些结果识别出在私有CVaR学习中,普通私有学习在Θ(nτ)信息量的尾记录上的核心挑战。

英文摘要

Differential privacy changes the effective sample size governing CVaR learning. For tail mass $τ$, the privacy-relevant sample size is not $n$, but $nτ$; equivalently, the effective private tail sample size is $εnτ$. Private CVaR excess risk decomposes into ordinary tail-risk statistical error and a privacy price. This decomposition is complete for scalar estimation and finite classes: scalar estimation has rate $Θ(B \min\{1,(nτ)^{-1/2}+(εnτ)^{-1}\})$, and finite classes of size $M$ have rate $Θ(B \min\{1,\sqrt{\log(2M)/(nτ)}+\log(2M)/(εnτ)\})$. These complete rates hold under pure DP, and their lower bounds extend to approximate DP in the stated small-$δ$ regimes. For convex Lipschitz learning, modular upper and lower reductions show that the CVaR-specific privacy term necessarily scales as $1/(εnτ)$, with dimension dependence inherited from private stochastic convex optimization. Together, these results identify ordinary private learning on $Θ(nτ)$ informative tail records as the canonical hard subproblem inside private CVaR learning.

2605.16213 2026-05-18 cs.AR

ADS-IMC: Accelerating Data Sorting with In-Memory Computation

ADS-IMC:利用内存计算加速数据排序

Narendra Singh Dhakad, Santosh Kumar Vishvakarma

AI总结 本文提出在内存fabric中直接执行排序操作的新型架构,通过减少内存与处理单元间的数据传输,降低延迟和能耗,首次利用6T SRAM实现内存排序,相比基于memristor的IMC排序延迟降低3.4倍。

详情
Comments
5 Pages, 8 Figures
AI中文摘要

排序是众多计算领域中的基本操作。传统方法涉及将数据从主内存传输到处理单元进行排序,然后将排序后的数据写回内存,这种传统方法由于内存与处理组件之间大量数据传输导致显著的延迟和能耗。为缓解这些开销,本文引入了在内存fabric中直接执行排序操作的新架构,消除了离芯片数据传输的需要。据我们所知,这项工作是首次利用6T SRAM进行内存排序的探索。所提出的架构设计用于处理以数字系统中常用的标准化加权二进制基数格式表示的数据。所提出的架构在延迟方面相比基于memristor的IMC排序实现了显著的3.4倍减少。

英文摘要

Sorting is a fundamental operation across numerous computational domains. Traditionally, this process involves transferring data from main memory to a processing unit for sorting, followed by writing the sorted data back to memory. This conventional approach incurs substantial latency and energy overheads due to the extensive data movement between memory and processing components. To mitigate these overheads, this paper introduces novel architectures for executing sorting operations directly within the memory fabric, eliminating the need for off-chip data transfer. To our knowledge, this work represents the first exploration of in-memory sorting using 6T SRAM. The proposed architecture is designed to operate on data represented in the standard weighted binary radix format commonly used in digital systems. The proposed architecture achieves a significant 3.4x reduction in latency compared to memristor-based IMC sorting.

2605.16211 2026-05-18 cs.LG math.DS

Hypothesis-driven construction of mesoscopic dynamics

以假设为导向的介观动力学构建

Zhuoyuan Li, Aiqing Zhu, Qianxiao Li

AI总结 本文提出一种基于数学约束假设类学习介观动力学的新方法,通过广义奥本奈尔原理构建统一框架,提供理论保证并验证了其在连续PDE和微观链模型中的有效性。

详情
Comments
38 pages, 10 figures
AI中文摘要

传统科学建模通常从固定的实例有效方程开始,然后进行特定方程的分析和计算,在复杂应用如多尺度系统中变得尤为困难。本文提出了一种替代范式,通过在数学约束的假设类中学习介观动力学。基于广义奥本奈尔原理,我们引入了一个统一框架,涵盖耗散和保守的介观动力学。我们建立了统一的理论保证,包括全局良好定义性、渐近稳定性、唯一因子可识别性和离散能量耗散,适用于该假设类中所有时空演变方程,在所有学习阶段之前。每个问题实例的数据随后用于指导识别假设类中的成员,产生准确、稳健和可解释的动力学模型。我们通过连续PDE模型的数据作为检查,以及微观链模型中已知的精确介观模型的数据进行了实证验证。所提出的方法不仅是一种有效的动力学学习器,还提供了对底层物理的必要可解释诊断。

英文摘要

Traditional scientific modeling typically begins with fixed, instance-wise effective equations and then carries out equation-specific analysis and computation, a procedure that becomes exceptionally challenging in complex applications such as multiscale systems. We propose an alternative paradigm by learning mesoscopic dynamics within a mathematically constrained hypothesis class. Building upon a generalized Onsager principle, we introduce a unified framework encompassing both dissipative and conservative mesoscopic dynamics. We establish uniform and a priori theoretical guarantees, including global well-posedness, asymptotic stability, unique factorization identifiability, and discrete energy dissipation, applicable to all spatio-temporal evolution equations within this hypothesis class prior to all learning stages. Data from each problem instance is then used to guide the identification of members within our hypothesis class, giving rise to accurate, robust and interpretable dynamical models. We empirically validate this framework on both data from continuum PDE models as a check, and on data arising from microscopic chain models for which exact meso-scale models are unknown. The proposed approach not only acts as an effective dynamics learner, but also offers vital interpretable diagnostics of the underlying physics.

2605.16210 2026-05-18 math.DS cs.NA math.NA math.OC

The Wolf and the Cello: Modelling and design of multiple resonance suppressors in large string instruments

狼与大提琴:大型弦乐器中多重共振抑制器的建模与设计

Simone Cacace, Emiliano Cristiani, Francesca L. Ignoto

AI总结 本文提出了一种数学模型,描述弦与带有一个或两个共振抑制器的二维体的耦合动力学,通过三个性能指标评估共振抑制器的优化调谐与放置,以有效抑制狼音并保持整体音色平衡。

详情
AI中文摘要

狼音是大型弓弦乐器中由于强体共振与振动弦相互作用产生的声学不稳定性,会产生幅度调制和音色控制丧失。本文提出了一种数学模型,描述弦与带有一个或两个共振抑制器的二维体的耦合动力学。弦和体均包含弹性(二次)和刚度(四次)贡献,可被拨弦或弓弦激发。引入了三个性能指标:第一个感知狼音的出现,第二个量化可能由狼音抑制器引起的音调衰减,第三个测量与原始乐器相比的声学保真度(以频谱为衡量标准)。所提出的数值测试提供了关于一个或两个抑制器最佳调谐和放置的见解,能够在有效抑制狼音的同时尽可能保持整体音色平衡。

英文摘要

The wolf note is an acoustic instability that occurs in large bowed string instruments when a strong body resonance interacts with the vibrating string, producing amplitude modulation and loss of tonal control. Various wolf suppressors - tuned mass dampers attached to the string or to the instrument body - are used in practice to mitigate this effect. In this paper, we propose a mathematical model describing the coupled dynamics of a string and a two-dimensional body equipped with one or two wolf suppressors. Both string and body include elastic (second-order) and stiffness (fourth-order) contributions and can be excited either by plucking or bowing. Three performance indicators are introduced: The first one perceives the wolf-tone appearance, the second one quantifies the attenuation of the notes possibly caused by the wolf suppressor, and the third one measures the acoustic fidelity (in terms of spectrum) with respect to the original instrument. The proposed numerical tests give insights about optimal tuning and placement of one or two suppressors, achieving effective wolf-note suppression while preserving as much as possible the global tonal balance.

2605.16208 2026-05-18 stat.ML cs.LG

A Scalable Nonparametric Continuous-Time Survival Model through Numerical Quadrature

通过数值积分实现的可扩展非参数连续时间生存模型

Chaeyeon Lee, Sehwan Kim, Hyungrok Do

AI总结 本文提出QSurv模型,通过高斯-勒让德数值积分实现非参数连续时间生存建模,无需时间离散化或限制分布假设,有效捕捉非平稳危险动态,实验表明其在即时危险函数估计上具有优势。

详情
AI中文摘要

灵活的连续时间生存建模对于捕捉高维数据中的复杂时间变化危险动态至关重要;然而,由于似然估计所需的不可计算积分,训练此类模型仍然具有挑战性。我们引入QSurv,一种可扩展的深度学习框架,使非参数连续时间建模成为可能,而无需依赖时间离散化或限制性分布假设。我们提出基于高斯-勒让德数值积分的训练目标,该方法以高阶精度近似累积危险,同时通过标准反向传播实现高效的端到端训练。此外,为了在复杂架构中有效捕捉非平稳危险动态,我们引入了时间条件低秩适应,一种通过动态调节权重实现对时间的条件化的机制。我们提供了理论分析,建立了累积危险评估的近似误差界。在合成基准、大规模真实世界表格数据集和高维医学影像任务中的全面实验表明,QSurv在预测性能上具有竞争力,在即时危险函数估计方面具有优势,从而能够更可解释地表征时间变化的风险模式。

英文摘要

Flexible continuous-time survival modeling is critical for capturing complex time-varying hazard dynamics in high-dimensional data; however, training such models remains challenging due to the intractable integral required for likelihood estimation. We introduce QSurv, a scalable deep learning framework that enables nonparametric continuous-time modeling without relying on time discretization or restrictive distributional assumptions. We propose a training objective based on Gauss-Legendre numerical quadrature, which approximates the cumulative hazard with high-order accuracy while facilitating efficient end-to-end training via standard backpropagation. Furthermore, to effectively capture non-stationary hazard dynamics in complex architectures, we introduce time-conditioned low-rank adaptation, a mechanism that conditions general neural backbones on time by dynamically modulating weights via low-rank updates. We provide theoretical analysis establishing approximation error bounds for cumulative-hazard evaluation. Comprehensive experiments across synthetic benchmarks, large-scale real-world tabular datasets, and high-dimensional medical imaging tasks demonstrate that QSurv achieves competitive predictive performance with advantages in instantaneous hazard function estimation, enabling more interpretable characterization of time-varying risk patterns.

2605.16207 2026-05-18 cs.AI cs.CL

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

确认正确,遗漏其余:LLM辅导代理在反馈最关键的地方表现不佳

Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

AI总结 本文研究了LLM在逻辑推理中的辅导性能,发现其在区分最优解、次优解和错误解方面存在系统性偏差,影响适应性教学效果。

详情
Comments
22 pages, 20 fgures
AI中文摘要

有效的辅导需要区分最优解、有效但次优解和错误解,这对智能辅导系统至关重要,但此前未针对LLM辅导代理进行测试。本文通过知识图谱衍生的地面真实数据,评估了七个LLM反馈代理在命题逻辑中的表现。模型在最优步骤上表现接近天花板,但在有效但次优的推理和错误解的验证上系统性地过度拒绝和接受,这在适应性辅导中尤为关键。这些失败在不同模型和情境下均持续存在,表明是架构而非信息限制的问题。此外,准确的诊断未能可靠地产生教学可行的反馈,揭示了诊断判断与教学效果之间的差距。研究发现LLM更适合混合架构,其中基于知识图谱的模型负责诊断,而LLM支持开放式的支架和对话。

英文摘要

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

2605.16205 2026-05-18 cs.AI cs.CL cs.LG cs.MA cs.SY eess.SY

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

上下文、推理与层次:在对抗性POMDP中的复合LLM代理设计成本-性能研究

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

AI总结 研究探讨了在对抗性部分可观测序贯环境中,复合LLM代理设计的上下文、推理和层次分解对性能与成本的影响,发现程序化状态抽象在成本效率上表现最佳,而分层分解无需推理可获得最佳性能。

详情
AI中文摘要

在对抗性、部分可观测的序贯环境中部署复合LLM代理需要处理多个设计维度:(1)代理所见的内容,(2)其推理方式,以及(3)任务在组件间的分解。然而,从业者缺乏指导,以确定哪些设计选择能提升性能而非仅仅增加推理成本。我们通过CybORG CAGE-2环境(建模为部分可观测马尔可夫决策过程POMDP)进行受控研究。奖励为非正数,因此所有配置均在故障缓解模式下运行。我们的评估涵盖五种模型家族、六种模型和十二种配置(3,475次回合),并进行逐token的成本计算。我们变化上下文表示(原始观察与确定性状态跟踪层压缩历史)、推理(自我提问、自我批评和自我改进工具,可选思维链提示)以及分层分解(单体ReAct与委托给专门子代理)。我们发现:(1)程序化状态抽象在每token花费上获得最大回报(RPTS),在原始观察上提升均值回报高达76%。 (2)在分层中分布推理工具相对于单独分层,对所有五种模型家族均降低性能,达到3.4倍更差的均值回报,同时使用1.8-2.7倍更多token。我们称此破坏性模式为推理瀑布。 (3)没有推理的分层分解在大多数模型中获得最佳绝对性能,且上下文工程通常比推理更经济有效。这些发现表明在结构对抗性POMDPs中的设计原则:投资于程序化基础设施和清洁任务分解,而不是更深入的单个代理推理,因为这些策略在结合时可能会相互干扰。

英文摘要

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

2605.16202 2026-05-18 quant-ph cs.ET

Performance Gains in Quantum SAT Solvers Using ESOP Encoding

利用ESOP编码提升量子SAT求解器性能

Majd Assaad, Abhoy Kole, Rolf Drechsler

AI总结 本文研究了专为量子SAT求解设计的ESOP-CNF编码(e-CNF),通过减少量子资源消耗提升求解效率。

详情
Comments
18 pages, 6 figures
AI中文摘要

布尔可满足性(SAT)问题是一个经典的NP完全问题,是通过基于搜索的算法实现量子加速的自然候选者。在基于Grover的量子SAT求解器中,主要的计算开销来自于可逆 oracle 的构造,这使得 SAT 编码的选择对整体量子资源效率至关重要。尽管 SAT 实例通常以析取范式(CNF)表示,但此类编码通常会转化为具有显著量子比特开销和高非Clifford门复杂度的量子电路。在本工作中,我们研究了专为量子SAT求解设计的Exclusive-Sum-of-Products(ESOP)基于CNF(e-CNF)表示,并分析了其对oracle构造的影响。我们推导了在使用e-CNF编码代替标准CNF时,基于Grover的SAT求解器的量子比特需求和Clifford+T门数量的更紧上界。此外,我们提出了一种可扩展的从布尔公式到e-CNF的转换方法,并展示了将e-CNF表示解释为可逆量子电路的系统性程序,适用于oracle实现。在具有代表性的SAT基准测试中的实验评估表明,所提出的基于e-CNF的方法在与基于CNF的oracle构造相比时,能够显著且一致地减少量子资源,包括量子比特数量、T门复杂度和电路深度。这些结果确立了e-CNF作为有效的量子-aware SAT 编码,显著提高了基于oracle的量子SAT求解的实用性。

英文摘要

The Boolean Satisfiability (SAT) problem is a canonical NP-complete problem and a natural candidate for quantum acceleration via search-based algorithms. In Grover-based quantum SAT solvers, the dominant computational cost stems from the construction of a reversible oracle that evaluates the Boolean formula, rendering the choice of SAT encoding crucial for overall quantum resource efficiency. Although SAT instances are conventionally expressed in Conjunctive Normal Form (CNF), such encodings typically translate into quantum circuits with significant qubit overhead and high non-Clifford gate complexity. In this work, we investigate an Exclusive-Sum-of-Products (ESOP)-based CNF (e-CNF) representation tailored for quantum SAT solving and analyze its impact on oracle construction. We derive tighter upper bounds on qubit requirements and Clifford+$T$ gate counts for Grover-based SAT solvers when e-CNF encodings are employed in place of standard CNF. In addition, we propose a scalable transformation from Boolean formulas to e-CNF and present a systematic procedure for interpreting e-CNF representations as reversible quantum circuits suitable for oracle implementation. Experimental evaluation on representative SAT benchmarks demonstrates that the proposed e-CNF-based approach yields substantial and consistent reductions in quantum resources, including qubit count, T-gate complexity, and circuit depth, when compared to CNF-based oracle constructions. These results establish e-CNF as an effective quantum-aware SAT encoding that significantly improves the practicality of oracle-based quantum SAT solving.

2605.16198 2026-05-18 cs.AI cs.CY cs.LG cs.LO

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

形式方法与大语言模型交汇:面向高级AI系统合规性的审计、监控与干预

Parand A. Alamdari, Toryn Q. Klassen, Sheila A. McIlraith

AI总结 本文提出结合形式方法与机器学习的审计和监控技术,用于检测AI系统中时间扩展行为约束的违规,实验表明其在检测违规方面优于LLM基方法,且能有效降低LLM代理的违规率。

详情
AI中文摘要

我们探讨了AI治理的一个维度:如何在整个AI开发生命周期中监控和审计AI增强的产品和服务,从预部署测试到部署后的审计。结合形式方法的原则与最先进的机器学习,我们提出技术,使AI增强产品和服务开发者、第三方AI开发者和评估者能够对产品特定的时间扩展行为约束(如安全约束、规范、规则和法规)进行离线审计和在线(运行时)监控,针对黑箱高级AI系统,特别是LLMs。我们进一步提供实用的预测监控技术,如基于抽样的方法,并引入干预监控器,在运行时预判并可能缓解预测的违规。实验结果表明,通过利用线性时序逻辑(LTL)的形式语法和语义,我们提出的方法在检测时间扩展行为约束的违规方面优于LLM基方法;使用我们的方法,即使小模型标注器也能匹配或超越前沿LLM判断者。我们还显示,通过受控实验,LLM的时间推理在事件距离、约束数量和命题数量增加时表现出显著的准确性下降。

英文摘要

We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs' temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.

2605.16197 2026-05-18 cs.HC

Position: AI as Part of Self -- Extending the Mind Requires Cognitive Co-Regulation

位置:AI作为自我的一部分——扩展心智需要认知共调节

Alina Gutoreva, Fendi Tsim, Trisevgeni Papakonstantinou

AI总结 本文探讨了AI作为自我组成部分的重要性,强调认知共调节在人机认知系统中的必要性,指出传统约束无法实现安全与对齐,需通过人机协同调节来达成。

详情
AI中文摘要

本文主张安全与对齐无法通过外部系统约束实现,必须从人机认知系统的共调节设计中产生。当代AI日益参与注意力分配、推理、综合与决策,影响人类形成信念、做出决策和构建自我认知的过程。人类与AI在相互约束下扮演互补的认知角色,形成共生认知单元,其共调节而非单方外部控制是实现对齐的正确场所。本文识别了无序委托的风险,如技能退化、自动化偏见、知识权威转移和 oracle 式的知识集中。基于 System~0 认知理论,本文进一步表明 AI 在意识 deliberation 之前运作,塑造通过协商代理与信任的预注意基础设施——这一层面传统监督无法触及。本文最后提出认知共调节的设计原则,面向 ML 工程师和治理机构。本文的目标是引导人类认知在人类自我认同的基础实现韧性与知识自主性。

英文摘要

This position paper argues that safety and alignment cannot be achieved by constraining an external system: they must emerge from the co-regulatory design of the human--AI cognitive system as a whole ("AI as Part of Self"). Contemporary AI increasingly participates in attention allocation, reasoning, synthesis, and decision-making, shaping the very cognitive processes through which humans form beliefs, make decisions, and constitute their sense of self. Humans and AI occupy complementary epistemic roles under mutual constraint, forming a symbiotic cognitive unit whose co-regulation -- not the external control of either party alone -- is the proper locus of alignment. We identify the risks of unstructured delegation: deskilling, automation bias, transfer of epistemic authority, and oracle-style centralization of knowledge. Drawing on System~0 cognition theory, we further show that AI operates prior to conscious deliberation, shaping the pre-attentive infrastructures through which agency and trust are negotiated -- a level that conventional oversight cannot reach. We conclude with design principles for cognitive co-regulation addressed to ML engineers and governance bodies. The goal of this work is to guide human cognition toward resilience and epistemic agency at the foundation of human selfhood.

2605.16196 2026-05-18 cs.IT math.IT

Fundamental Performance Limits of Non-Coherent ISAC: A Data-Aided Sensing Perspective

非协作ISAC的基本性能极限:一种数据辅助传感的视角

Dongsheng Peng, Chengkai Zhao, Yihong Li, Zhiqing Wei, Jun Chen, Ping Chen

AI总结 本文研究了非协作ISAC系统在块衰落信道中的性能极限,通过数据辅助传感方案实现了比试点传感更高的传感精度和SNR增益。

详情
AI中文摘要

本文研究了一种双工多输入多输出(MIMO)集成传感与通信(ISAC)系统,在块衰落信道中,重点研究传感和通信接收机共址的场景。在假设接收端未知信道状态信息(CSI)的情况下,考虑了两种方案:试点传感(PS)和数据辅助传感(DAS)。两种方案的通信速率-传感失真函数被特征化。对于DAS方案,通过使用随机矩阵理论(RMT)推导出传感失真的闭式渐进行为表达式。渐进行为分析明确量化了DAS方案的显著增益,揭示了在低SNR情况下DAS方案比PS方案有严格3dB的有效SNR增益,并在高SNR极限下具有更快速的性能缩放率。

英文摘要

In this paper, we investigate a bistatic multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system over block-fading channels, focusing on the scenario where the sensing and communication receivers (Rxs) are co-located. Under the assumption of unknown channel state information (CSI) at the Rx, two schemes are considered: pilot sensing (PS) and data-aided sensing (DAS). The communication rate-sensing distortion functions for both schemes are characterized. For the DAS scheme, a closed-form asymptotic expression for the sensing distortion is derived by using random matrix theory (RMT). The asymptotic performance analysis explicitly quantifies the significant gains of the DAS scheme, revealing a strict $3$ dB effective SNR improvement in the low-SNR regime and a strictly faster performance scaling rate in the high-SNR limit compared to the PS scheme.

2605.16194 2026-05-18 cs.DL cs.AI cs.IR cs.MA

paper.json: A Coordination Convention for LLM-Agent-Actionable Papers

为LLM-代理可操作论文的协调约定

Arquimedes Canedo

AI总结 本文提出paper.json文件,通过稳定声明ID、明确不声明列表、精确图示命令和稳定定义ID等约定,解决LLM代理在阅读学术论文时的重复失败问题。

详情
AI中文摘要

LLM代理通常作为学术论文的第一(有时唯一)阅读者,快速浏览子声明、提取可重复性步骤并概括范围。标准论文在这一角色中产生重复失败:无法在子论文粒度下引用子声明、范围过度扩展超出论文测试内容,以及图示命令埋藏在代码库而非论文本身。我们提出paper.json,一个随PDF一同携带的JSON文件,通过轻量级约定解决这些失败:稳定声明ID(C1)、明确不声明列表(C2)、精确每图shell命令(C3)和稳定定义ID(C5)。第五个约定(C4)指出,最小可行合规性,手写JSON与PDF一同,可在一小时内完成,无需触碰人类可读输出。C1、C2、C3和C5是开放邀请:阅读合规论文并采取行动的代理将产生证据支持或反对它们。本文本身合规:运行`uv run validator.py paper.json --against paper.typ`通过。仓库:https://github.com/arquicanedo/paper-json

英文摘要

LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json --against paper.typ` passes. Repo: https://github.com/arquicanedo/paper-json

2605.16193 2026-05-18 cs.CL cs.CY

Improving Cross-Cultural Survey Simulation with Calibrated Value Personas

通过校准价值人设提升跨文化调查模拟

Axel Abels, Elias Fernandez Domingos, Apurva Shah, Tom Lenaerts

AI总结 本文提出基于价值的人设构建方法,通过校准提升跨文化调查模拟的准确性,减少预测误差,尤其在少数群体中效果显著。

详情
Comments
Submitted to the Fourth International Workshop on Value Engineering in AI (VALE 2026), held at IJCAI-ECAI 2026
AI中文摘要

大型语言模型(LLMs)越来越多地用于模拟人类意见和调查响应,但其在不同文化中再现人口响应的能力仍有限。现有基于人设的提示方法通常依赖社会人口统计或个性特征,这些只是影响人类响应价值观的间接代理。我们提出一种基于价值的人设构建方法,从调查响应中提取文本描述符,捕捉核心文化维度。通过从目标人群采样价值配置文件,并聚合LLM在不同人设下的响应,我们获得基于观察到的价值分布的群体级预测。我们进一步引入一种校准程序,以提高响应多样性的同时保持估计意见的准确性。我们证明,我们的方法在不同国家减少了预测误差,最大的改进出现在代表性不足的人群中。这大大缩小了与主流LLM先验一致的国家与在训练数据中代表性较低的国家之间的性能差距,同时产生与人类多样性密切匹配的响应分布。

英文摘要

Large language models (LLMs) are increasingly used to simulate human opinions and survey responses, but their ability to reproduce population responses across cultures remains limited. Existing persona-based prompting methods typically rely on sociodemographic or personality traits, which are only indirect proxies for the values that shape human responses. We propose a value-based persona construction method that derives textual descriptors from survey responses capturing core cultural dimensions. By sampling value profiles from target populations and aggregating LLM responses across personas, we obtain population-level predictions grounded in observed value distributions. We further introduce a calibration procedure that improves response diversity while preserving estimated opinions. We show that our approach reduces prediction error across countries, with the largest improvements observed in underrepresented populations. This substantially narrows the performance gap between countries aligned with dominant LLM priors and those that are less represented in training data, while also yielding response distributions that closely match human diversity.

2605.16191 2026-05-18 cs.CL cond-mat.other physics.comp-ph

Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search

优化的三维光伏结构与LLM引导的树搜索

Michael P. Brenner, Lizzie Dorfman, John C. Platt

AI总结 本文利用AI编码系统生成新型科学假设,通过LLM驱动的树搜索算法优化三维光伏结构,解决中纬度地区传统光伏板的效率瓶颈问题。

详情
Comments
10 pages 7 figures
AI中文摘要

我们展示了一个案例研究,说明AI编码系统如何用于生成新的科学假设。我们结合通用编码代理(谷歌的AntiGravity)与LLM驱动的树搜索算法(Empirical Research Assistance / ERA),以自动生成高效率的三维光伏(3DPV)结构,以克服中纬度地区传统光伏板的效率限制。这些结构通过一天中不同的太阳角度进行优化,我们以单天太阳日为例进行说明。我们的工作流程首先使用AntiGravity重现计算,证明3DPV的能量密度远高于静态平板光伏板。我们利用这些初始设计作为大规模树搜索的起点,寻找改进的解决方案并根据日间收益评分。初始的树搜索导致了名义上更高效的解决方案,但这些解决方案是由算法奖励黑客引起的,源于非物理设计特征,如结构上漂浮的断开层和光学求解器离散化中的利用。为对抗这一点,我们开发了一个工作流程,使编码代理迭代地将约束添加到物理引擎中,以消除奖励黑客。在消除奖励黑客后,ERA发现了一系列具有不同约束和改进性能的设计,包括具有不同固定收集面积的最优设计,优化天顶跟踪并避免自身阴影。将编码代理与树搜索(ERA)结合提供了一个强大的平台,用于解决可以通过评分函数经验评估的问题。

英文摘要

We present a case study for how AI coding systems can be used to generate novel scientific hypotheses. We combine a generic coding agent (Google's AntiGravity) with an LLM-driven tree search algorithm (Empirical Research Assistance / ERA) to autonomously generate high-efficiency three-dimensional photovoltaic (3DPV) structures that overcome losses limiting flat solar panels at mid-latitudes. These structures operate by presenting favorable angles to the sun throughout the day, and for illustrative purposes we focus on optimizing performance for a single solar day. Our workflow begins by using AntiGravity to reproduce calculations \cite{bernardi2012solar} showing that 3DPV can have energy densities much higher than stationary flat PV panels. We use these initial designs as the starting point for large scale tree search, where we seek improved solutions and score them for their diurnal yield. The initial tree search leads to nominally more efficient solutions, yet they are caused by algorithmic reward hacking, arising from non-physical design features such as structurally levitating disconnected tiers and exploitations of the discretizations in the optics solver. To counteract this, we develop a workflow where the coding agent iteratively patches the physics engine with constraints to eliminate reward hacking. With reward-hacking eliminated, ERA discovers a series of designs with various constraints and improved performance, including optimal designs with different fixed collector areas, optimizing zenith tracking and avoiding self shadowing. Combining coding agents with tree search (ERA) provides a powerful platform for scientific discovery, for problems whose solutions can be empirically evaluated with a score function.