arXivDaily arXiv每日学术速递 周一至周五更新
2606.19561 2026-06-19 cs.RO cs.SY eess.SY 新提交

pdSTL: Probabilistic Differentiable Signal Temporal Logic for Stochastic Systems

pdSTL: 面向随机系统的概率可微信号时序逻辑

Bennett Dogbey, Hemanth Manjunatha

发表机构 * Oklahoma State University(俄克拉荷马州立大学)

AI总结 提出pdSTL框架,将概率语义与可微鲁棒性结合,通过区间值概率语义和LSTM式展开实现线性时间可微监控,在障碍物规避、换道和真实四旋翼飞行实验中优于确定性可微STL。

详情
AI中文摘要

在不确定环境中运行的自主机器人必须满足复杂的时序和安全规范,尽管存在随机动力学和感知噪声。虽然信号时序逻辑(STL)为基于梯度的优化提供了鲁棒性度量,但现有的扩展要么缺乏可微性,要么忽略了信念空间的不确定性。我们引入了pdSTL(概率可微信号时序逻辑),这是一个将概率语义与信念轨迹上的可微鲁棒性统一起来的框架。pdSTL采用区间值概率语义来计算保守的满足界限,并通过STL语法树组合传播。我们将时序鲁棒性评估制定为STL算子的循环、LSTM式展开,从而实现适用于端到端轨迹优化的线性时间、可微监控。我们在模拟障碍物规避、换道操作以及真实世界的Crazyflie四旋翼飞行实验中验证了pdSTL,这些实验在气动干扰下进行。结果表明,pdSTL在保持形式化概率保证的同时实现了高效优化,在现实世界的不确定性下,在维持安全裕度方面显著优于确定性可微STL。

英文摘要

Autonomous robots operating in uncertain environments must satisfy complex temporal and safety specifications despite stochastic dynamics and sensing noise. While Signal Temporal Logic (STL) offers robustness measures for gradient-based optimization, existing extensions either lack differentiability or ignore belief-space uncertainty. We introduce pdSTL (probabilistic differentiable Signal Temporal Logic), a framework that unifies probabilistic semantics with differentiable robustness over belief trajectories. pdSTL employs interval-valued probabilistic semantics to compute conservative satisfaction bounds, propagated compositionally through the STL syntax tree. We formulate the temporal robustness evaluation as a recurrent, LSTM-style unfolding of STL operators, enabling linear-time, differentiable monitoring suitable for end-to-end trajectory optimization. We validate pdSTL on simulated obstacle avoidance, lane-change maneuvers, and real-world Crazyflie quadcopter flight experiments under aerodynamic disturbances. Results demonstrate that pdSTL achieves efficient optimization with formal probabilistic guarantees, significantly outperforming deterministic differentiable STL in maintaining safety margins under real-world uncertainty.

2606.19560 2026-06-19 cs.LG 新提交

Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting

从流行病预测理解时间序列基础模型的关键特征

Alireza Jafari, Judy Fox, Geoffrey C. Fox, Madhav Marathe, Aniruddha Adiga

发表机构 * Department of Computer Science, School of Engineering and Applied Science, University of Virginia(弗吉尼亚大学工程与应用科学学院计算机科学系) School of Data Science, University of Virginia(弗吉尼亚大学数据科学学院) Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所) Department of Electrical and Computer Engineering, School of Engineering and Applied Science, University of Virginia(弗吉尼亚大学工程与应用科学学院电气与计算机工程系)

AI总结 系统评估多种时间序列模型在流感预测中的表现,发现混合专家模型性能最优,预训练在长时域提升显著,而LLM方法效果较差。

Comments 15 pages, 2 figures, 9 tables

详情
AI中文摘要

季节性流感每年感染数百万人,并在美国造成大量发病和死亡,因此准确的短期预测成为核心公共卫生需求。可靠的流行病时间序列预测可以为疫苗接种时机、医院人员配备和资源分配提供信息,然而现代预测架构在传染病监测数据上的比较行为仍未得到充分表征。我们通过系统评估区域流感预测来填补这一空白,使用流感样疾病监测和流感相关住院时间序列,在时间泛化和空间泛化设置下进行1-4周提前预测。我们比较了经典神经网络架构、基于数值的Transformer模型、预训练时间序列基础模型和基于LLM的预测方法。在各项任务中,我们证明融合多个预训练预测器的混合专家模型实现了最强的整体性能,表明异质预训练表示提供了互补的预测信息。我们的结果进一步表明,基于数值的Transformer模型产生可靠的预测,而预训练在更长时域上提供最大增益,特别是当预训练领域与流感动力学机制一致时。相比之下,基于LLM的时间序列方法在此设置下表现不如数值预测器。最后,我们研究了住院信息作为辅助协变量和预训练源的作用。住院信号在特定设置中提供了互补的改进,并阐明了额外的监测流如何增强多时域预测的鲁棒性。这些发现为流感防范的模型选择、预训练策略和辅助信号使用提供了可操作的指导。

英文摘要

Seasonal influenza infects millions of people and causes substantial morbidity and mortality in the United States each year, making accurate short-term forecasting a core public-health need. Reliable forecasts of epidemic time series can inform vaccination timing, hospital staffing, and resource allocation, yet the comparative behavior of modern forecasting architectures on infectious-disease surveillance data remains insufficiently characterized. We address this gap through a systematic evaluation of regional influenza forecasting using influenza-like illness surveillance and influenza-associated hospitalization time series under both temporal and spatial generalization settings for 1-4-week-ahead prediction. We compare classical neural network architectures, numerical transformer-based models, pretrained time series foundation models, and LLM-based forecasting approaches. Across tasks, we demonstrate that a mixture-of-experts model that fuses multiple pretrained forecasters achieves the strongest overall performance, indicating that heterogeneous pretrained representations provide complementary predictive information. Our results further show that numerical transformer-based models produce reliable forecasts, while pretraining provides the largest gains at longer horizons, particularly when the pretraining domain is mechanistically aligned with influenza dynamics. In contrast, LLM-based time series methods underperform relative to numerical forecasters in this setting. Finally, we examine hospitalization information as both an auxiliary covariate and a pretraining source. Hospitalization signals provide complementary improvements in selected settings and clarify when additional surveillance streams enhance the robustness of multi-horizon forecasting. These findings provide actionable guidance on model selection, pretraining strategy, and auxiliary-signal use for influenza preparedness.

2606.19559 2026-06-19 cs.AI cs.CL 新提交

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University(AI Talent Hub, ITMO大学)

AI总结 提出一种基于提示的不确定性分解方法,将行动置信度与请求不确定性分离,使代理能在任务规范模糊时主动寻求澄清,在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情
AI中文摘要

最近的立场论文认为,经典的偶然/认知不确定性框架对于交互式大型语言模型(LLM)代理是不够的,并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示,以解锁新的代理能力,如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法,使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁,该分解将行动置信度与请求不确定性(u)分离,使代理能在任务规范模糊时请求澄清。为了评估它,我们引入了两个增强澄清的基准(WebShop-Clarification和ALFWorld-Clarification),其中50%的任务被故意欠规范,并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上,系统地将所提出的分解与ReAct+UE和不确定性感知记忆(UAM)在五个LLM骨干(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上进行比较。在五个骨干上平均,所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1,比UAM提高了36%,并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1,表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

2606.19558 2026-06-19 cs.LG cs.CL 新提交

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向:评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性,发现整体强相关但在近基线区域失效,归因于KL散度主要衡量分歧量而非方向。

详情
AI中文摘要

保真度指标,如每个token的KL散度(KLD)与高精度参考模型的比较,常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上,通过一系列下游基准测试验证了这一做法。我们发现,在整个量化队列中,KLD与基准分数强相关(Qwen上ρ=-0.72,Devstral上ρ=-0.86,p<0.001)。然而,在接近基线的静默区,这种关系变得不显著(Qwen上ρ=+0.00,Devstral上ρ=-0.24,p=0.36)。这种失效在14种测量变体中持续存在,包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面,KLD在代码任务上仅有较弱的失败预测能力,在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间,并且作为跨模型路由器失败,在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解:KLD主要衡量与参考模型的分歧量,在静默区复合ρ在Qwen上为+0.94(p<0.001),在Devstral上为+0.55(p=0.03),而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

2606.19556 2026-06-19 cs.CE 新提交

A hybrid sharp-diffuse interface approach to accurately model melt pool dynamics with rapid evaporation in laser-based processing of metals

一种混合锐利-扩散界面方法,用于精确模拟激光加工金属中伴随快速蒸发的熔池动力学

Nils Much, Andreas Koch, Christoph Meier, Magdalena Schreter-Fleischhacker

AI总结 提出混合锐利-扩散界面方法,结合锐利界面传热模型和扩散界面多相流模型,精确模拟激光加工中蒸发驱动的熔池热流体动力学,精度比纯扩散模型高一个数量级。

Journal ref Computer Methods in Applied Mechanics and Engineering 457, 119023, 2026

详情
AI中文摘要

在激光加工金属(如激光束焊接或激光粉末床熔融增材制造)中,熔池动力学的预测模拟需要精确解析熔-气界面的热流体动力学相互作用。这里,蒸发诱导的反冲压力和温度相关的表面张力控制着流动。由于这些机制对界面温度敏感(通常呈指数关系),可靠的预测需要高精度的传热模型。流行的扩散界面公式模糊了激光-金属相互作用中典型的极端热梯度,导致界面温度误差,从而严重降低界面力预测和熔池动力学的精度。我们提出了一种混合锐利-扩散界面方法,用于高保真模拟伴随快速蒸发的熔池热流体动力学。传热问题采用锐利界面非拟合有限元(CutFEM)公式表示,能够精确预测温度场。多相流问题具有大密度比和复杂界面动力学特征,通过稳健的基于水平集的一流体扩散界面有限元公式精确捕捉。通过将锐利界面温度扩展到窄界面区域,在扩散界面流动框架内评估温度相关的界面力,实现了一致耦合。在实际相关基准测试中,锐利界面热模型表现出二阶空间收敛性,使得有限元尺寸比扩散界面方法大两个数量级,同时保持1%精度。在一个代表激光-金属相互作用的耦合热流体动力学新基准测试中,混合方法在同一网格上比纯扩散界面模型精确一个数量级。

英文摘要

Predictive simulation of melt pool dynamics in laser-based processing of metals, e.g., laser beam welding or laser powder bed fusion additive manufacturing, requires accurate resolution of thermo-hydrodynamic interactions at the melt-gas interface. Here, evaporation-induced recoil pressure and temperature-dependent surface tension govern the flow. Because these mechanisms depend sensitively, often exponentially, on the interface temperature, reliable predictions demand highly accurate heat transfer models. Popular diffuse-interface formulations smear the extreme thermal gradients as typical for laser-metal interactions, leading to interface temperature errors that critically degrade the accuracy of interface force predictions and melt pool dynamics. We present a hybrid sharp-diffuse interface approach for high-fidelity modelling of melt pool thermo-hydrodynamics with rapid evaporation. The heat transfer problem is represented using a sharp-interface unfitted finite element (CutFEM) formulation, enabling accurate prediction of the temperature field. The multi-phase flow problem, characterized by large density ratios and complex interface dynamics, is accurately captured using a robust level-set-based one-fluid diffuse-interface finite element formulation. Consistent coupling is achieved by extending the sharp-interface temperature into a narrow interface region to evaluate temperature-dependent interface forces within the diffuse-interface flow framework. In practically relevant benchmarks, the sharp-interface thermal model exhibits second-order spatial convergence, enabling finite element sizes two orders of magnitude larger than the diffuse-interface approach for 1 accuracy. In a novel coupled thermo-hydrodynamic benchmark representative of laser-metal interactions, the hybrid approach is one order of magnitude more accurate than a purely diffuse-interface model on the same mesh. Robu

2606.19555 2026-06-19 cs.RO 新提交

SCAN-Planner: Spatial Collision-Aware Local Planning for Route-Guided Long-Range Quadruped Navigation

SCAN-Planner:用于路线引导的远程四足导航的空间碰撞感知局部规划

Han Zheng, Zhe Chen, Yiwen Fu, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SCAN-Planner框架,通过偏航感知双圆柱足迹和投影A*搜索实现空间碰撞感知的局部规划,在密集杂乱、3D非结构化环境和远程导航中生成安全平滑轨迹。

详情
AI中文摘要

四足机器人越来越需要能够在狭窄通道、杂乱室内场景和大规模3D非结构化环境中导航。现有的局部规划器通常使用各向同性几何膨胀来近似机器人,或依赖于平面和高程图表示,导致在狭窄空间中的保守运动以及对悬垂结构的推理有限。本文提出了SCAN-Planner,一种用于远程四足导航的空间碰撞感知局部规划框架。使用偏航感知的双圆柱足迹来建模细长的机器人身体,通过在膨胀的3D占用地图中进行稀疏查询实现全身碰撞评估。我们进一步引入投影A*搜索,在插值的地面跟随表面上生成无碰撞引导,并通过z梯度抑制来水平避开障碍物同时保持垂直稳定性。对于大规模部署,具有边界回退的机器人中心滑动地图提供高分辨率局部碰撞检查并从局部死胡同中恢复。仿真和真实实验表明,SCAN-Planner在密集杂乱、3D非结构化场景、楼梯穿越和远程导航任务中生成安全、平滑且高效的轨迹。

英文摘要

Quadruped robots are increasingly expected to navigate through narrow passages, cluttered indoor scenes, and large-scale 3D unstructured environments. Existing local planners commonly approximate the robot using isotropic geometric inflation or rely on planar and elevation-map representations, leading to conservative motion in tight spaces and limited reasoning about overhanging structures. This letter presents SCAN-Planner, a spatial collision-aware local planning framework for long-range quadruped navigation. A yaw-aware twin-cylinder footprint is used to model the elongated robot body, enabling whole-body collision evaluation through sparse queries in an inflated 3D occupancy map. We further introduce a projected A* search that generates collision-free guidance on an interpolated ground-following surface, with z-gradient suppression to avoid obstacles horizontally while maintaining vertical stability. For large-scale deployment, a robot-centric sliding map with boundary fallback provides high-resolution local collision checking and recovery from local dead ends. Simulation and real-world experiments demonstrate that SCAN-Planner generates safe, smooth, and efficient trajectories in dense clutter, 3D unstructured scenes, stair traversal, and long-range navigation tasks.

2606.19552 2026-06-19 cs.CL 新提交

LaViSA: A Language and Vision Structural Ambiguity Benchmark

LaViSA:语言与视觉结构歧义基准

Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Guardian Robot Project RIKEN(RIKEN守护机器人项目) The University of Osaka(大阪大学)

AI总结 提出LaViSA基准,通过七类歧义句及对应图像评估视觉语言模型利用视觉场景解决结构歧义的能力,实验显示现有模型虽能部分利用视觉信息,但在特定歧义类型和细微语义区分上仍有局限。

详情
AI中文摘要

结构歧义是指单个句子由于其句法结构而产生多种有效解释,这给语言理解带来了基本挑战。视觉场景可作为解决此类歧义的有用线索,视觉语言模型(VLM)需要能够从视觉场景中推导出可能的语义解释。我们引入了语言与视觉结构歧义(LaViSA)基准,旨在评估VLM利用视觉场景解决结构歧义的能力。LaViSA包含歧义句子、其消歧句子以及这些消歧句子对应的图像,涵盖七类歧义。利用LaViSA,我们对多种VLM进行了全面评估,包括专有模型和开源模型,参数规模和推理能力各异。实验结果表明,尽管最近的VLM能在一定程度上利用视觉场景解决结构歧义,但它们仍然在特定歧义类型和视觉上微妙的语义区分上存在困难,表明在利用视觉场景解决结构歧义方面仍存在局限性。

英文摘要

Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

2606.19549 2026-06-19 cs.LG 新提交

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

预测参数高效微调更新的可合并性

Lin Tang, Wei Zhang, Jing Li, Hongyu Chen, Ming Zhao, Yuxuan Wang

发表机构 * Sichuan University(四川大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出MergeProbe,通过训练初期信号预测LoRA适配器的可合并性,在MERGE-PEFT基准上实现最佳平均和最差保留性能。

详情
AI中文摘要

低秩适配(LoRA)使得训练许多领域和任务特定的语言模型适配器变得廉价,但两个适配器是否可以合并通常只有在两者都经过充分训练和评估后才能发现。这种延迟反馈代价高昂:单独表现强大的适配器在合并更新后可能会产生破坏性干扰。我们询问是否可以预测这种结果。我们将适配器可合并性形式化为适配器在合并后保持其单任务效用的程度,并表明可以从训练初期百分之几的信号中预测——主要是低秩更新及其梯度在不同任务间的对齐程度以及它们对共享表示的干扰程度。我们将这些信号打包成MergeProbe,一个轻量级预测器,用于估计成对和集合级别的保留,并将估计转化为具体决策:直接合并、重新加权、剪枝或路由。在MERGE-PEFT(一个涵盖数学、代码、科学、指令遵循和安全的五领域基准)上,MergeProbe在强干扰感知合并基线中实现了最佳平均和最差保留,同时增加的部署开销远低于完整任务路由。这将LoRA合并从事后工程步骤转变为预期测量问题。

英文摘要

Low-rank adaptation (LoRA) makes it cheap to train many domain- and task-specific language model adapters, but whether two adapters can be merged is usually discovered only after both have been fully trained and evaluated. This late feedback is costly: adapters that are strong in isolation can interfere destructively once their updates are combined. We ask whether this outcome can be anticipated. We formalize adapter mergeability as the degree to which an adapter preserves its single-task utility after merging, and show that it can be forecast from signals measured in the first few percent of training -- chiefly how the low-rank updates and their gradients align across tasks and how much they disturb shared representations. We package these signals into MergeProbe, a lightweight predictor that estimates pairwise and set-level retention and turns the estimate into a concrete decision: merge directly, reweight, prune, or route. On MERGE-PEFT, a five-domain benchmark spanning math, code, science, instruction following, and safety, MergeProbe attains the best average and worst-case retention among strong interference-aware merge baselines while adding far less deployment overhead than full task routing. This turns LoRA merging from a post-hoc engineering step into an anticipatory measurement problem.

2606.19544 2026-06-19 cs.CL 新提交

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性:LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information(加州大学伯克利分校信息学院)

AI总结 本研究通过大规模系统性评估(21个裁判模型、118次运行、约54.1万次判断),发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题,包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存,并提出了最小可行验证协议。

详情
AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式,但实际中的裁判验证依赖于精确匹配一致性,这一指标未对随机性进行校正,且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估:来自九个提供商的21个裁判模型,在MT-Bench、JudgeBench和RewardBench上,按照三种协议(一致性、稳定性、偏差审计)进行了118次运行,约54.1万次独立判断。发现了四个结果,在整个队列中一致,包括2026年4月的前沿模型:精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的(MT-Bench上33-41个百分点),裁判排名在不同基准上最多移动14个位置,高重测信度(>0.95)与两个生产部署裁判中的严重位置偏差(>0.10)并存(体现了一致性-偏差悖论),以及在单一成对评分标准下,整个队列中的冗长偏差较小(<0.011)。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

2606.19542 2026-06-19 cs.LG 新提交

Tracking Representation Dynamics in Large Language Models with Persistent Homology

利用持续同调追踪大型语言模型中的表示动态

Naman Malhotra, Jay Ambadkar, Abhinav Gupta, Kushal Kasivel, Abbas Schwarz, Kamillo Ferry, Anthea Monod

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 通过持续同调分析激活空间拓扑,发现对齐过程中拓扑重组主要发生在训练早期,且不同对齐目标产生可区分的拓扑轨迹。

Comments 29 pages

详情
AI中文摘要

大型语言模型通常通过监督微调进行对齐,但关于其内部表示在此过程中如何演变的研究尚不充分。我们利用持续同调,通过追踪微调过程中激活空间的拓扑结构来研究对齐动态。在四个参数范围从1B到7B的Transformer语言模型以及对应于有用、无害和混合训练数据的三个对齐目标上,我们发现大多数拓扑重组发生在训练的最早阶段。密集检查点分析揭示了拓扑活动的瞬态峰值,随后迅速稳定。我们进一步表明,不同的对齐目标会引发可区分的拓扑轨迹,而指令微调和预训练模型则表现出定性不同的演化模式。我们的结果表明,持续同调为对齐提供了互补视角,揭示了仅从行为指标无法察觉的表示级变化。

英文摘要

Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking the topology of activation spaces throughout fine-tuning. Across four transformer language models ranging from 1B to 7B parameters and three alignment objectives corresponding to helpful, harmless, and mixed training data, we find that the majority of topological reorganization occurs during the earliest stages of training. A dense checkpoint analysis reveals a transient peak in topological activity followed by rapid stabilization. We further show that different alignment objectives induce distinguishable topological trajectories, while instruction-tuned and pretrained models exhibit qualitatively different patterns of evolution. Our results suggest that persistent homology provides a complementary perspective on alignment, revealing representation-level changes that are not apparent from behavioral metrics alone.

2606.19538 2026-06-19 cs.AI cs.LG 新提交

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

ITNet: 一种可学习的积分变换,统一卷积、注意力与循环

Ashim Dhor, Rasel Mondal, Pin Yu Chen

发表机构 * Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校) IBM Research(IBM研究院)

AI总结 提出可学习积分变换网络ITNet,通过位置-特征联合核函数统一卷积、注意力和循环架构,实现跨模态高性能。

详情
AI中文摘要

卷积网络、循环网络和变换器各自编码不同的归纳偏置——局部性、序列记忆和内容相关的成对交互——自诞生以来在数学上一直彼此独立。我们表明,这种碎片化反映的不是信号处理方式的根本多样性,而是对单一底层数学对象的不完整视角:可学习的积分变换。我们引入积分变换网络(ITNet),这是一种统一架构,围绕一个依赖于位置和特征的联合可学习核构建。该核实现为一个小型神经网络(具体为MLP),用于建模成对交互,使模型能够从数据中自适应其行为。我们证明,卷积、自注意力(包括多头)和自回归循环(包括LSTM、GRU、S4和Mamba)在适当参数化下均作为特例出现,且ITNet是连续算子的通用逼近器。为使其实用,我们开发了分块核融合、重要性加权蒙特卡洛积分和可学习低秩分解,实现高效可扩展计算。单个ITNet架构,共享算子与轻量级模态特定编码器,在ImageNet-1K、GLUE、ModelNet40、VQA v2和NLVR2上匹配或超越专用基线。结果表明,单一学习交互机制可从数据中恢复所有三个架构族的行为。

英文摘要

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

2606.19537 2026-06-19 cs.MA cs.DC 新提交

Mesh Inference: A Formal Model of Collective Intelligence Without a Center

网格推理:无中心集体智能的形式模型

Hongwei Xu

AI总结 提出网格推理形式模型,通过耦合自由能实现无中心多智能体协作推理,证明收敛唯一性、识别完备性和观测唯一性,并分析线性高斯情况下的延迟代价。

Comments 21 pages, 2 figures

详情
AI中文摘要

我们提出了网格推理的形式模型:一群独立智能体,每个持有私有状态,仅交换被接纳的、类型化的观测,在没有中央协调者且无智能体暴露的情况下,推导出任何一个智能体单独无法得出的结论。没有智能体共享权重、梯度或隐藏状态,且智能体可能跨越不同的团队、网络和组织。受“询问模型是能量最小化推理”这一观察的启发,我们将网格建模为每个智能体局部松弛的耦合自由能。我们证明,单一的接纳/发射策略控制三个性质。首先,对于任何对称或非对称的接纳,网格推理收敛到唯一答案,因为耦合总是M-矩阵。其次,它是识别完备的:当贡献视图是载波连通时,它精确推导出集中式最优解。第三,它是仅观测的:没有节点传输其内部状态,且机密性是识别的对偶。内容寻址谱系是唯一的全局侧信道。在线性高斯情况下,每个推导出的答案都是确定的,因此等于集中式最优解,延迟为O(diam^2),这是移除中心所付出的代价。这样的推导是无中心学习循环的一个环节,我们将其形式化为架构而非证明。我们提出的开放问题是,询问何时能改善集体而非破坏它:非线性闭包是推导出升级的答案还是自信的错误。据我们所知,这是网格推理的第一个形式模型。

英文摘要

We present a formal model of mesh inference: how a population of independent agents, each holding private state and exchanging only admitted, typed observations, derives a conclusion none of them holds alone, with no central coordinator and no agent exposed. No agent shares weights, gradients, or hidden state, and the agents may span different teams, networks, and organizations. Motivated by the observation that asking a model is energy-minimizing inference, we model the mesh as a coupled free energy that each agent relaxes locally. We show that a single admission/emission policy governs three properties. First, mesh inference converges to a unique answer for any admission, symmetric or not, because the coupling is always an M-matrix. Second, it is identification-complete: it derives the centralized optimum exactly when the contributing views are carrier-connected. Third, it is observation-only: no node transmits its internals, and confidentiality is the dual of identification. Content-addressed lineage is the only global side-channel. In the linear-Gaussian regime every derived answer is determined, hence equal to the centralized optimum, at O(diam^2) latency, the measured price of removing the center. One such derivation is one turn of a center-free learning loop, which we formalize as architecture rather than prove. The open problem we state is when asking improves the collective rather than corrupting it: whether the non-linear closure derives an upgraded answer or a confident error. To our knowledge, this is the first formal model of mesh inference.

2606.19535 2026-06-19 cs.CR cs.LG 新提交

FloatDoor: Platform-Triggered Backdoors in LLMs

FloatDoor: 大语言模型中的平台触发后门

Nils Loose, Jonas Sander, Felix Mächtle, Thomas Eisenbarth

发表机构 * University of Luebeck(吕贝克大学)

AI总结 提出FloatDoor,首个输入无关、平台触发的后门攻击,利用浮点运算平台差异,通过两个轻量LoRA适配器在目标平台触发恶意行为,同时保持模型正常效用。

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在软件工程等敏感环境中,其输出直接影响下游工件。最近的研究表明,由于非结合浮点运算和不同的内核实现,同一模型在不同部署平台上可能产生可测量的不同输出。我们研究了这种平台依赖可变性的安全影响,并揭示了LLM部署中一种新的攻击面。我们提出了FloatDoor,这是首个针对生成式LLM的输入无关、平台触发的后门攻击。被攻陷的模型在目标平台上表现出对手选择的行为,而在其他平台上则表现正常。FloatDoor通过两个轻量级LoRA适配器实现:一个放大平台间数值差异,另一个将由此产生的平台签名绑定到恶意下游任务,同时保持模型整体效用基本不变。FloatDoor利用了模型审计和部署之间的显著检查时间与使用时间差距。我们在Qwen3-4B上展示了FloatDoor,涵盖了广泛的部署目标,包括NVIDIA GPU、Google TPU、AWS Graviton和阿里巴巴Yitian-710。作为最终案例研究,我们展示了FloatDoor能够在选定的目标平台上可靠地诱导可利用的代码漏洞。我们的结果建立了一类新的LLM部署攻击,并强调了在敏感的LLM驱动应用中建立可信模型供应链的迫切需求。

英文摘要

Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 新提交

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM:基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University(北京大学) MSALab ByteDance(字节跳动)

AI总结 提出PerceptionDLM,利用扩散语言模型的并行解码特性,通过高效提示和结构化注意力掩码实现多区域并行感知,显著提升推理效率,并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展。然而,现有大多数MLLMs依赖自回归生成,这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中,我们提出PerceptionDLM,一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base(一个在开源扩散MLLMs中达到最先进性能的强基础基线),我们的架构充分利用了DLMs的并行解码特性。具体来说,我们引入了高效提示和结构化注意力掩码,以实现对多个掩码区域的同步感知,使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比,这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性,我们通过将DLC-Bench扩展为每张图像包含多个区域掩码,构建了一个新的并行详细局部描述基准(ParaDLC-Bench),从而能够联合评估描述质量和推理效率。实验表明,PerceptionDLM在区域描述中保持竞争性能,同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知,我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

2606.19533 2026-06-19 cs.AR cs.AI 新提交

A Tool for the Synthesis of Adaptive Probabilistic Processors Based on the Ising Model

基于伊辛模型的自适应概率处理器合成工具

Jonathan Juracy Carneiro da Silva, Leonardo R. Gobatto, Jose Rodrigo Azambuja

AI总结 提出一种自动合成与仿真概率架构的工具,通过将组合优化问题映射到伊辛模型,自适应选择更新算法,改善收敛行为并支持硬件实现。

Comments ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026

详情
AI中文摘要

本文提出一种用于合成和仿真概率架构的工具,通过将组合优化问题映射到伊辛模型来求解。该方法根据问题特征(如规模和拓扑)自动构建伊辛哈密顿量并确定概率元件(p-bits)的数量。此外,该工具引入了一种自适应策略,用于在吉布斯采样、模拟退火(SA)、模拟量子退火(SQA)和基于簇的方法中选择最合适的更新算法。使用基准问题的实验结果表明,与固定方法相比,该方法具有更好的收敛行为和灵活性。所提出的框架能够系统评估概率计算策略,并支持基于MTJ和p-bits的未来硬件实现的开发。

英文摘要

This work presents a tool for the synthesis and simulation of probabilistic architectures for solving combinatorial optimization problems by mapping them to the Ising model. The proposed approach automatically constructs the Ising Hamiltonian and determines the number of probabilistic elements (p-bits) based on problem characteristics such as size and topology. Furthermore, the tool introduces an adaptive strategy for selecting the most suitable update algorithm among Gibbs Sampling, Simulated Annealing (SA), Simulated Quantum Annealing (SQA), and cluster-based methods. Experimental results using benchmark problems demonstrate improved convergence behavior and flexibility compared to fixed approaches. The proposed framework enables systematic evaluation of probabilistic computing strategies and supports the development of future hardware implementations based on MTJs and p-bits.

2606.19532 2026-06-19 cs.LO 新提交

Vancomycert: A Certified Neuro-Symbolic Drug Delivery System (Case Study)

Vancomycert: 一种经过认证的神经符号药物递送系统(案例研究)

Alistair Sirman, Fleur Conway, Jessica Ciupa, Gusts Gustavs Grīnbergs, Ekaterina Komendantskaya, Thai Son Hoang, Michael Rawson, Alessandro Bruni, Vaishak Belle, Michael John Williams

AI总结 针对抗生素给药神经网络控制器的形式化验证问题,提出一种结合监督学习和定理证明的方法,确保无限时域内自动给药不超过治疗上限。

详情
AI中文摘要

自主决策的神经网络控制器在网络物理系统中已得到广泛应用,但在安全关键的医疗环境中,其部署仍未得到充分验证。本文提出了一种用于抗生素给药神经网络控制器形式化验证的方法和案例研究,其动机源于系统必须在无限时间范围内同时具备适应性和可证明安全性的挑战。我们构建了一个简化但临床可解释的模型,用于跟踪药物浓度、体温和白细胞计数。万古霉素被选为代表性抗生素,广泛用于严重感染,但治疗窗口狭窄,超治疗浓度有肾毒性风险,而亚治疗剂量可能导致治疗失败。我们使用合成的临床医生式给药数据训练了一个监督式神经网络控制器。我们建立了输入-输出安全属性的形式化验证,特别验证了神经网络的一个属性,该属性意味着无限时域证明自动给药从未超过超治疗边界。该系统的属性在Rocq中使用Vehicle交互式定理证明器后端进行证明,以集成不同的证明系统。最终结果是一个验证流水线,允许各种治疗方法,同时为每个特定患者保持安全性。

英文摘要

Neural network controllers for autonomous decision-making are well-established in cyber-physical systems, yet their deployment in safety-critical healthcare settings remains largely unverified. This paper presents a methodology and case study for the formal verification of a neural network controller for antibiotic dosing, motivated by the challenge of systems that must be simultaneously adaptive and provably safe across unbounded time horizons. We construct a simplified yet clinically-interpretable model that tracks drug concentration, body temperature, and white blood cell count. Vancomycin is selected as a representative antibiotic, widely prescribed for severe infections yet carrying a narrow therapeutic window, where supratherapeutic concentrations risk nephrotoxicity and subtherapeutic dosing risks treatment failure. A supervised neural network controller is trained on synthetic clinician-style dosing data. We establish formal verification of input-output safety properties, specifically verifying a property of a neural network that implies an infinite-horizon proof that automated dosing never exceeds the supratherapeutic boundary. This system property is proven in Rocq using the Vehicle interactive theorem prover back-end to integrate the different proof systems. The end result is a verification pipeline that allows for a wide variety of treatment approaches whilst maintaining safety for each specific patient.

2606.19531 2026-06-19 cs.CV cs.RO 新提交

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

ImageWAM:世界动作模型真的需要视频生成,还是只需要图像编辑?

Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东方理工学院) Tencent Robotics X(腾讯机器人X) Tsinghua University(清华大学) Zhongguancun Academy(中关村学院)

AI总结 提出ImageWAM框架,利用预训练图像编辑模型替代视频生成进行机器人动作预测,通过编辑去噪的KV缓存作为世界动作上下文,在多个模拟和真实实验中优于基线,计算量降至1/6,延迟降至1/4。

Comments Project Page: https://zhangwenyao1.github.io/ImageWAM/

详情
AI中文摘要

世界动作模型(WAMs)通常依赖视频生成来桥接视觉世界建模和机器人控制。然而,基于视频的WAMs面临三个耦合的限制:密集的多帧未来令牌使得推理成本高昂,完整的视频预测将容量花费在与动作无关的时间和外观细节上,以及长期未来想象可能引入误导动作预测的错误。这些问题提出了一个简单的问题:世界动作模型真的需要视频生成吗?我们提出ImageWAM,一个简单的WAM框架,将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比,图像编辑提供了更匹配的先验:它只需要建模目标帧变换,关注与动作相关的当前到目标视觉差异,并通过编辑预训练将任务指令接地到局部视觉变化。在实践中,ImageWAM在推理时不解码目标帧;相反,它根据图像编辑去噪产生的KV缓存条件化一个流匹配动作专家,将其用作紧凑的世界动作上下文。ImageWAM在多个模拟和真实世界实验中优于标准VLA基线和匹配的竞争性WAM,且无需额外的策略预训练。它还将FLOPs降低到基于视频的WAMs的1/6,延迟降低到1/4。注意力分析进一步表明,编辑缓存聚焦于任务相关的变化区域,支持图像编辑作为基于视频的世界动作建模的有效替代方案。

英文摘要

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

2606.19529 2026-06-19 cs.DC 新提交

The Sheaf Laplacian: A Topological Framework for Data Fusion and Consensus in Distributed Sensing Networks

层拉普拉斯算子:分布式传感网络中数据融合与共识的拓扑框架

Manuel Hernández, Eduardo Sánchez-Soto

AI总结 提出层理论作为传统图模型的替代,利用层拉普拉斯算子实现异构分布式传感网络中的数据融合与共识。

详情
AI中文摘要

我们在此论证,传统网络模型——绝大多数基于简单图的数学构造——从根本上不足以捕捉现代分布式系统的复杂性。这类系统的特点是具有不同能力的异构代理、高维多模态数据流,以及无法用简单连接或标量权重充分描述的复杂上下文相关关系。这些经典模型的局限性要求一种具有更强表达能力的新数学语言。我们发现层理论为我们提供了这样一种语言。此外,我们表明层拉普拉斯算子是分布式传感网络中进行数据融合和建立共识的合适机制。

英文摘要

We argue here that traditional network models, which are overwhelmingly based on the mathematical construct of a simple graph, are fundamentally insufficient for capturing the complexity of modern distributed systems. Such systems are characterized by heterogeneous agents with diverse capabilities, high-dimensional and multi-modal data streams, and intricate, context-dependent relationships that cannot be adequately described by a simple connection or a scalar weight. The limitations of these classical models necessitate a new mathematical language, one with far greater expressive power. We have found that sheaf theory provides us with such a language. Moreover, we show that the sheaf Laplacian is a suitable mechanism for data fusion and establishing consensus within distributed sensing networks.

2606.19528 2026-06-19 cs.LG cs.AI 新提交

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

边缘设备上LLM LoRA微调峰值内存降低技术

Hassan Dbouk, Matthias Reisser, Prathamesh Mandke, Likhita Arun Navali, Christos Louizos

发表机构 * GitHub

AI总结 针对边缘设备上LLM LoRA微调的内存瓶颈,提出四种互补技术(量化、检查点、softmax近似、logits掩码),在Llama-3.2 3B和Qwen-2.5 3B上实现高达26倍和28倍的峰值内存降低。

Comments Hassan Dbouk and Matthias Reisser contributed equally to this work

详情
AI中文摘要

使用低秩适配(LoRA)在终端用户数据上微调大型语言模型(LLM)可提供个性化体验并保护数据隐私,但在消费级硬件上面临严重的内存限制。微调期间的峰值内存通常超过设备限制,尤其是对于具有数十亿参数和长上下文训练数据的模型。本文介绍了一套互补技术,可在不牺牲模型质量的情况下减少内存占用:(1)基模型量化与即时反量化,(2)结合选择性激活缓存和磁盘卸载的内存高效检查点,(3)使用语义相关令牌子集的softmax近似,以及(4)logits掩码。在Llama-3.2 3B和Qwen-2.5 3B上的实验表明,峰值内存降低高达26倍和28倍,从而能够在资源受限设备上进行微调。

英文摘要

Fine-tuning of Large Language Models (LLMs) using Low-Rank Adaptation (LoRA) on an end-user's data offers personalized experiences while keeping data private, but faces severe memory constraints on consumer hardware. Peak memory during fine-tuning often exceeds device limits, especially for models with billions of parameters and long-context training data. This paper introduces a suite of complementary techniques to reduce memory footprint without sacrificing model quality: (1) base model quantization with on-the-fly dequantization, (2) memory-efficient checkpointing combining selective activation caching and disk offloading, (3) softmax approximation using semantically relevant token subsets, and (4) logits masking. Experiments on Llama-3.2 3B and Qwen-2.5 3B demonstrate up to $26\times$ and $28\times$ reduction in peak memory, enabling fine-tuning on resource-constrained devices.

2606.19527 2026-06-19 cs.AI 新提交

Emergent Alignment

涌现对齐

Martin Kolář

发表机构 * CIIRC, Czech Technical University in Prague(捷克理工大学CIIRC)

AI总结 提出一种在线对齐技术,通过引入良心步骤和基于直接偏好优化的对齐损失,使大语言模型在训练、微调、对抗提示和零样本学习中自我纠正非伦理输出。

Comments Rejected from ICML 2026

详情
AI中文摘要

大型语言模型(LLM)能否辨别其自身输出何时与人类伦理不一致?它们能否自我纠正?我们赋予LLM一个良心步骤,用于审查其自身的推理和输出,并通过使用直接偏好优化(DPO)扩展训练损失中的对齐组件,引导模型远离非伦理输出。结果是一种在线技术,可在广泛的应用中对齐模型:训练、微调、对抗提示和零样本学习。它不需要较弱或较强的评判者,而是依赖于自身的冻结副本。在先前的工作中,涌现错位场景显示了微调模型以破解代码时出现的一系列涌现非伦理行为。相反,我们实证展示了如何实现涌现对齐:在相同的代码破解场景下,一个单一的高层内省问题将训练引导向伦理模型。

英文摘要

Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics? And can they self-correct? We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Preference Optimization (DPO) to steer the model away from non-ethical outputs. The result is an online technique to align models in a wide range of applications: training, fine-tuning, adversarial prompting, and zero-shot learning. It does not require a weaker or stronger judge, relying instead on a frozen copy of itself. In previous work, the Emergent Misalignment scenario showed a range of emergent unethical behaviors from fine-tuning the model to hack code. Instead, we empirically show how to achieve Emergent Alignment: a single high-level introspective question steers training toward an ethical model under the same code hacking scenario.

2606.19526 2026-06-19 cs.AR 新提交

SPINE: A Fault Injection Profiler for Quantized Neural Networks under Accumulated Faults

SPINE: 面向累积故障下量化神经网络的故障注入分析器

Nathan Guimarães, Ian Kersz, Leonardo R. Gobatto, Fabio Benevenuti, Michael G. Jordan, Antonio Carlos S. Beck, Fernanda L. Kastensmidt, Jose Rodrigo Azambuja

AI总结 提出GDB驱动的分析框架SPINE,通过向边缘CPU目标二进制注入累积权重位翻转,生成逐层故障特征,无需重训练或修改代码,指导选择性加固策略。

Comments ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026

详情
AI中文摘要

在边缘部署深度神经网络需要在严格的成本和功耗约束下实现高效推理。量化神经网络通过用低精度整数替换浮点参数来满足这些需求,但其权重在推理过程中仍持续暴露于辐射引起的位翻转。故障注入可用于模拟这些环境,但现有研究未能表征在现实内存布局下累积翻转如何转化为错误预测。本文提出一个GDB驱动的分析框架,直接将累积权重位翻转注入边缘CPU的目标二进制,生成逐层故障特征,无需模型重训练或代码修改。在多种拓扑、量化方案和内存布局上的评估结果表明,应如何应用选择性加固策略来有效保护神经网络。

英文摘要

Deploying deep neural networks at the edge demands efficient inference under strict cost and power constraints. Quantized neural networks address these demands by replacing floating-point parameters with low-precision integers, yet their weights remain continuously exposed to radiation-induced bit-flips during inference. Fault Injection can be used to simulate those environments, but existing studies fail to characterize how accumulated upsets translate into mispredictions under realistic memory layouts. This paper presents a GDB-driven profiling framework that injects cumulative weight bit-flips directly onto the target binary of edge CPUs, generating per-layer fault profiles without requiring model retraining or code modification. Evaluated across multiple topologies, quantization efforts, and memory layouts, the results indicate how selective hardening strategies should be applied to effectively protect neural networks.

2606.19525 2026-06-19 cs.RO 新提交

A Categorial and Sheaf-Theoretic Semantics for Autonomic Component Ensembles

自主组件集合的范畴与层论语义

Manuel Hernández, Eduardo Sánchez-Soto

AI总结 针对自主组件集合语言SCEL,提出基于范畴论和层论的多层数学模型,将机器人社会建模为拓扑空间上的层,通过层上同调量化系统故障,将分布式系统验证转化为几何分析。

详情
AI中文摘要

大规模、去中心化的自主代理系统(如机器人集群和网络化信息物理系统)的激增对传统形式化方法提出了严峻挑战。软件组件集合语言(SCEL)为这类系统提供了形式化模型,但其操作语义不适合推理全局、结构和涌现属性。本报告利用范畴论和层论为SCEL提出了一种新的多层数学模型。我们认为,用SCEL描述的机器人社会可以形式化地建模为拓扑空间上的层,其中组件是点,集合是开集,分布式知识构成层的数据。在此框架下,信息共享等计算过程等价于“粘合”局部数据的层论操作。系统故障可以被理解并量化为拓扑障碍,通过层上同调可测量。该方法将复杂分布式系统的验证转化为数学对象的几何分析,为设计鲁棒的自主系统提供了深刻的结构性见解。

英文摘要

The proliferation of large-scale, decentralized systems of autonomous agents, such as swarms of robots and networked cyber-physical systems, presents a formidable challenge to traditional formal methods. The Software Component Ensemble Language (SCEL) offers a formal model for such systems, but its operational semantics is not ideal for reasoning about global, structural, and emergent properties. This report proposes a new, multi-layered mathematical model for SCEL using category theory and sheaf theory. We argue that a society of robots described in SCEL can be formally modeled as a sheaf on a topological space, where components are points, ensembles are open sets, and distributed knowledge forms the sheaf's data. In this framework, computational processes like information sharing become equivalent to the sheaf-theoretic operation of "gluing" local data. System failures can then be understood and quantified as topological obstructions, measurable by sheaf cohomology. This approach transforms the verification of a complex distributed system into the analysis of the geometry of a mathematical object, providing deep, structural insights for the design of robust autonomic systems.

2606.19522 2026-06-19 cs.AI 新提交

REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk

REVEAL++:用于阿尔茨海默病风险视觉-语言视网膜建模的可微分表型分组

Ethan Elio Meidinger, Seowung Leem, Zeyun Zhao, Ruogu Fang

发表机构 * University of Virginia(弗吉尼亚大学) J. Crayton Pruitt Family Department of Biomedical Engineering, Herbert Wertheim College of Engineering, University of Florida(佛罗里达大学赫伯特·韦特海姆工程学院J. Crayton Pruitt家庭生物医学工程系)

AI总结 提出可微分连续表型相似性权重函数,替代离散分组,在对比学习中端到端学习跨模态对齐与表型结构,提升AD风险预测。

Comments Accepted for publication at MICCAI 2026

详情
AI中文摘要

视网膜为神经退行性疾病提供了非侵入性窗口,能够捕捉与未来认知衰退风险相关的细微结构模式。诸如REVEAL等视觉-语言对齐框架已表明,将视网膜眼底图像与结构化临床风险叙述配对可改善阿尔茨海默病(AD)的早期预测。这些方法的一个关键设计选择是使用表型分组,即在对比学习中将具有相似风险特征的个体视为多正对。然而,现有方法将表型相似性操作化为离散构造,依赖硬分组分配,施加刚性监督并将分组形成与表示学习分离。我们提出对比学习中表型结构的连续形式。我们不将样本分配到固定聚类,而是将受试者间相似性建模为可微分权重函数,该函数源自视网膜图像和风险特征中模态内嵌入相似性。这些权重通过连续聚合算子定义软多正关系,实现反映疾病风险谱的梯度监督。我们进一步引入软目标对比目标,以端到端方式联合学习跨模态对齐和表型结构。在UK Biobank视网膜成像数据上进行AD发病预测评估,所提框架持续优于基于离散分组的对比学习和标准视觉-语言基线。通过将表型相似性视为可学习的连续信号而非固定分组规则,我们的方法为从多模态视网膜和临床数据中进行人群规模的神经退行性风险建模提供了有原则且稳健的基础。

英文摘要

The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive decline. Vision-language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer's disease (AD). A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs during contrastive learning. However, existing methods operationalize phenotypic similarity as a discrete construct, relying on hard group assignments that impose rigid supervision and decouple group formation from representation learning. We propose a continuous formulation of phenotypic structure within contrastive learning. Rather than assigning samples to fixed clusters, we model inter-subject similarity as a differentiable weighting function derived from intra-modality embedding similarities in both retinal images and risk profiles. These weights define soft multi-positive relationships through a continuous aggregation operator, enabling graded supervision that reflects the spectrum nature of disease risk. We further introduce a soft-target contrastive objective that jointly learns cross-modal alignment and phenotypic structure in an end-to-end manner. Evaluated on UK Biobank retinal imaging data for incident AD prediction, the proposed framework consistently outperforms discrete group-based contrastive learning and standard vision-language baselines. By treating phenotypic similarity as a learnable, continuous signal rather than a fixed grouping rule, our approach provides a principled and robust foundation for population-scale neurodegenerative risk modeling from multi-modal retinal and clinical data.

2606.19520 2026-06-19 eess.SY cs.SY 新提交

ev-flow: A Reproducible, NHTS-Grounded Generator of Synthetic Plug-in Electric Vehicle Charging Behavior for Eight U.S. Regions

ev-flow: 一个可复现的、基于NHTS的合成插电式电动汽车充电行为生成器,适用于美国八个地区

Bertrand Travacca

AI总结 提出ev-flow开源Python包,基于2017年全国家庭旅行调查数据,通过九阶段流水线生成美国八个地区的合成插电式电动汽车充电行为,填补了美国本土化、NHTS驱动的充电行为生成工具空白。

Comments 20 pages

详情
AI中文摘要

电动汽车并网研究需要大量具有行为真实性的个体充电档案,但实际充电遥测数据稀缺且受隐私限制,现有的开源生成器要么基于非美国出行调查校准,要么忽略了驱动总需求的区域、季节和设备异质性。我们提出\texttt{ev-flow}(导入名\texttt{pev\_synth}),一个MIT许可的开源Python包,基于2017年全国家庭旅行调查(NHTS)微观数据和区域销售组合模型,为美国八个区域生成合成插电式电动汽车充电行为。一个确定性的九阶段流水线(M1-M9)将每辆车从调查记录转换为带时间戳的充电档案:它将调查的人日拼接成捐赠者匹配的365天出行日历,并带有温度依赖的冬季能量提升;从已发表的SPEECh K=16高斯混合参数化中采样行为插电开始时间;评估三层伯努利插电模型;传播连续时间荷电状态账本,并带有明确的PHEV汽油续航扩展项;将插电状态栅格化为15分钟和小时网格。该包生成住宅和工作场所档案类型,并附有描述性EVSE品牌和连接器丰富信息;每个输出均以UTC存储、时区感知,并可从单个主种子实现比特可复现。验证运行器将生成的分布与已发表的边界进行比较,并根据文献出处对每个偏差进行分类:参考的\texttt{bay\_area}住宅档案在21项适用检查中汇总为11项通过、0项未解释失败、6项已解释失败和4项已解释跳过。\texttt{ev-flow}填补了美国本土、基于NHTS的空白,与欧洲生成器(如emobpy和VencoPy)以及充电模拟器(如datafev和ACN-Sim)互补。

英文摘要

Electric-vehicle grid-integration studies need large, behaviorally realistic populations of individual charging profiles, but real charging telemetry is scarce and privacy-restricted, and the existing open generators are calibrated to non-U.S. mobility surveys or flatten the regional, seasonal, and equipment heterogeneity that drives aggregate demand. We present \texttt{ev-flow} (import name \texttt{pev\_synth}), an open-source, MIT-licensed Python package that generates synthetic plug-in electric vehicle charging behavior for eight U.S. regions, grounded in 2017 National Household Travel Survey (NHTS) microdata and regional sales-mix models. A deterministic nine-stage pipeline (M1--M9) carries each vehicle from survey records to a time-stamped charging profile: it stitches survey person-days into donor-matched 365-day travel calendars with a temperature-dependent winter energy uplift, samples behavioral plug-in start times from the published SPEECh K=16 Gaussian-mixture parameterization, evaluates a three-layer Bernoulli plug-in model, propagates a continuous-time state-of-charge ledger with an explicit PHEV gasoline range-extension term, and rasterizes plug status to 15-minute and hourly grids. The package generates residential and workplace profile types with descriptive EVSE brand and connector enrichment; every output is UTC-stored, timezone-aware, and bit-reproducible from a single master seed. A validation runner compares the generated distributions against published bounds and classifies every divergence with literature provenance: the reference \texttt{bay\_area} residential profile rolls up to 11 PASS, 0 unexplained FAIL, 6 explained failures, and 4 explained skips across 21 applicable checks. \texttt{ev-flow} fills a U.S.-focused, NHTS-grounded niche complementary to European generators such as emobpy and VencoPy and to charging simulators such as datafev and ACN-Sim.

2606.19519 2026-06-19 cs.DC 新提交

A Topos-Theoretic Interpretation of Blockchain Systems: Sheaves of Consensus and the Logic of Decentralized Truth

区块链系统的拓扑学解释:共识层与去中心化真理的逻辑

Manuel Hernández, Eduardo Sánchez-Soto

AI总结 本文提出用拓扑论(层范畴理论)作为区块链系统的数学语言,将共识过程建模为局部一致性到全局真理的构造,超越传统有限状态机模型。

详情
AI中文摘要

区块链系统,特别是智能合约的主要形式模型,大多源自经典计算理论,有限状态机或带标号迁移系统是主要概念工具。然而,有限状态机将区块链最困难和新颖的方面——在去中心化环境中达成共识——归结为复杂且往往混乱的实现细节,位于形式模型之外。但共识过程并非附属特征;它是计算现象的本质。为了忠实地建模它,需要一种新的数学语言。本文的核心论点是,拓扑论,即层范畴理论,为以局部一致性和全局真理构造为定义的系统提供了本原的数学语言。

英文摘要

The predominant formal models for blockchain systems, particularly smart contracts, have largely been drawn from the classical theory of computation, with the finite state machine (FSM) or labeled transition system serving as the primary conceptual tool. However, the FSM relegates the most difficult and novel aspect of a blockchain -- the achievement of consensus in a decentralized environment -- to a complex, often messy, implementation detail that lies outside the formal model itself. But the process of consensus is not an ancillary feature; it is the very essence of the computational phenomenon. To model it faithfully, a new mathematical language is required. The central thesis of this work is that topos theory, the theory of categories of sheaves, provides the native mathematical language for systems defined by local consistency and the construction of global truth.

2606.19514 2026-06-19 cs.HC 新提交

LLM-Mediated Human-AI Interaction in Search and Rescue: Impact of Expertise on Attentional Allocation

LLM介导的人机交互在搜索与救援中的应用:专业知识对注意力分配的影响

Elahe Oveisi, Hemanth Manjunatha

AI总结 本研究通过模拟搜索救援任务,比较有无大语言模型(LLM)指导的条件,结合眼动追踪和行为分析,发现LLM提升任务效率但未增加总救援人数,并揭示了注意力-指导权衡,其中专业知识调节了用户对AI的依赖模式。

详情
AI中文摘要

人机团队(HAT)越来越多地涉及在复杂任务中提供实时、上下文感知指导的AI系统。虽然此类系统可以提高性能,但其有效性取决于它们如何塑造人类认知和行为。特别是,AI辅助可能引入认知需求,并影响注意力、规划以及与任务环境的交互,其效果可能因专业知识水平而异。本研究在模拟搜索救援(SAR)环境中调查这些机制。我们比较了两种LLM(大语言模型)指导条件和无LLM基线条件下的人类表现,并在多个层面分析交互,包括任务绩效、眼动测量和规划行为。眼动追踪提供了对注意力分配和与AI指导交互的细粒度洞察,而行为测量则捕捉用户如何随时间构建和调整其决策。结果表明,LLM指导提高了任务效率(更高的奖励和每步受害者数),但并未增加总救援人数。眼动数据揭示了注意力-指导权衡,视觉资源转移到聊天界面,同时瞳孔大小变异性增加。专业知识调节了这种效应:新手表现出被动AI依赖,而专家通过持续的环境扫描维持“验证循环”。这些发现表明,LLM介导的团队效能取决于操作员将AI指导与地面实况交叉引用以保持态势感知的能力。

英文摘要

Human-AI teaming (HAT) increasingly involves AI systems that provide real-time, context-aware guidance in complex tasks. While such systems can improve performance, their effectiveness depends on how they shape human cognition and behavior. In particular, AI assistance can introduce cognitive demands and influence attention, planning, and interaction with the task environment, with effects that can vary across levels of expertise. This work investigates these mechanisms in a simulated search and rescue (SAR) environment. We compare human performance under two LLM (Large Language Model)-guided conditions and a no-LLM baseline, and analyze interaction at multiple levels, including task performance, eye-tracking measures, and planning behavior. Eye tracking provides fine-grained insight into attention allocation and interaction with AI guidance, while behavioral measures capture how users structure and adapt their decisions over time. Results indicate that LLM guidance enhanced task efficiency (higher rewards and victims-per-step) but did not increase total victims saved. Eye-tracking data revealed an attention-guidance trade-off, with visual resources shifting to the chat interface alongside increased pupil size variability. Expertise moderated this effect: novices exhibited passive AI reliance, whereas experts maintained a "verification loop" through persistent environmental scanning. These findings suggest that LLM-mediated teaming efficacy depends on the operator's ability to cross-reference AI guidance with ground truth to maintain situational awareness.

2606.19512 2026-06-19 cs.RO cs.SY eess.SY 新提交

Proprioceptive Invariant State Estimation for Humanoid Robots on Non-Inertial Ground

非惯性地面上仿人机器人的本体感觉不变状态估计

Falak Mandali, Zijian He, Yan Gu

发表机构 * Purdue University(普渡大学)

AI总结 提出一种仅使用本体感觉的InEKF方法,利用足部IMU和运动学约束,实现非惯性地面上仿人机器人的实时状态估计,收敛速度提升96%,位置误差降低80%。

详情
AI中文摘要

本文提出了一种不变扩展卡尔曼滤波(InEKF)方法,用于在非惯性地面上运行的仿人机器人仅使用机载本体感觉进行实时状态估计。所提出的方法估计机器人相对于移动地面框架的基座位置和速度,无需直接测量地面运动或外部安装的传感器。通过足部安装的IMU利用支撑脚的运动学约束,该滤波器在保持完全本体感觉的同时,考虑了过程模型和测量模型中的地面引起的非线性。估计器被设计为具有右不变测量模型,从而在较大的初始不确定性下实现有利的误差动态。可观测性分析建立了机器人相对于非惯性地面框架的相对基座位置和速度可观测的条件。在摇摆和俯仰地面上站立和蹲下的Digit仿人机器人实验表明,与现有的InEKF相比,收敛速度提高了96%,位置估计误差减少了80%。在单轴旋转地面上的行走实验实现了平均估计误差小于9厘米,初始误差高达1米。

英文摘要

This paper presents an invariant extended Kalman filtering (InEKF) approach for real-time state estimation of humanoid robots operating on non-inertial ground using only onboard proprioceptive sensing. The proposed approach estimates the robot's base position and velocity relative to the moving ground frame without requiring direct measurements of ground motion or externally mounted sensors. By exploiting kinematic constraints at the stance foot through foot-mounted IMUs, the filter accounts for ground-induced nonlinearities in the process and measurement models while remaining fully proprioceptive. The estimator is formulated to admit a right-invariant measurement model, enabling favorable error dynamics under large initial uncertainties. Observability analysis establishes conditions under which the robot's relative base position and velocity are observable with respect to the non-inertial ground frame. Experiments with the Digit humanoid robot standing and squatting atop a swaying and pitching ground showcase a 96% speedup in convergence rate and an 80% reduction in position estimate errors over existing InEKFs. Walking experiments on a uni-axially rotating ground achieve an average estimation error of less than 9 cm for an initial error of up to 1 m.

2606.19509 2026-06-19 cs.AI 新提交

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

LLM 不知道它不知道什么:通过跨模型归因分歧检测临床表格数据上的认知盲点

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava

发表机构 * Centific AI Research(Centific AI研究)

AI总结 研究大语言模型在结构化临床数据上的认知不确定性,通过跨模型归因分歧分析,发现其口头置信度空洞、存在逆难度效应,并提出基于归因分歧的校准方法,无需训练即可提升准确率并降低校准误差。

Comments Accepted at EIML@ICML 2026

详情
AI中文摘要

大语言模型(LLM)越来越多地应用于结构化临床数据,但它们在处理此类任务时能否认识到自身知识的局限性仍未得到探索。我们通过跨模型归因分歧的视角研究这一问题,旨在减少结构化任务的认知不确定性,通过归因分歧分析比较 Qwen 2.5 7B 和 XGBoost 在预测任务上的表现。我们报告了四个发现。首先,LLM 口头表达的置信度在认知上是空洞的,无论准确率是 49% 还是 75.3%,它输出接近常数(0.856-0.937),追踪的是提示格式而非预测质量。其次,LLM 表现出逆难度效应:当 XGBoost 以 99% 正确时,LLM 准确率降至 64.8%,但在 XGBoost 中等不确定时,LLM 与其匹配(73.8% 对 73.1%)。第三,少样本示例和 SHAP 导出的特征证据是正交的、超加性的干预措施:它们将归因分歧分数(ADS)从 1.54 降至 0.38,并在无需训练的情况下将准确率从 49% 提升至 75.3%。第四,一种利用归因分歧信号确定 LLM 可靠性的跨模型校准器,将期望校准误差从 0.254 降至 0.080,用患者特定的可靠性估计替代了无信息量的口头置信度,无需访问模型内部或重复推理。我们将这些发现视为 LLM 在结构化数据上的冷启动问题,并勾勒出通向真正认知自我意识的路径。

英文摘要

Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.

2606.19504 2026-06-19 cs.RO cs.SY eess.SY 新提交

Simulating Robotic Locomotion in Sand: Resistive Force Theory in an Open-Source Physics Engine

模拟沙地中的机器人运动:开源物理引擎中的阻力理论

Ryan Walker Brown, Laura K. Treers, Kathryn A. Daltorio

发表机构 * Case Western Reserve University(凯斯西储大学) University of Vermont(佛蒙特大学)

AI总结 将三维颗粒阻力理论(3D RFT)集成到MuJoCo物理引擎中,实现沙地行走模拟,验证了足端形状、速度和负载对运动的影响,并在六足机器人实验中预测行走距离和沉陷误差在20%以内。

Comments 12 pages, 7 figures

详情
AI中文摘要

阻力理论(RFT)的最新进展使得无需模拟单个颗粒相互作用即可近似沙地运动中的地面反作用力,从而降低了计算成本。然而,这些工具在常用于机器人仿真的3D物理引擎中尚不可用。我们探讨了将阻力近似与标准动力学计算相结合,是否能为自由行走的机器人提供稳定的支撑。为此,我们在物理仿真引擎MuJoCo中实现了三维颗粒阻力理论(3D RFT)。我们在多个场景中验证了仿真,证明了由于末端执行器形状、速度和负载引起的关键趋势得以保留。我们的实现预测了12自由度六足机器人在沙地中的行走距离和足部下沉,误差在实验值的20%以内。尽管RFT存在固有近似,但本文描述的开源工具有望帮助开发新的和改进的机器人设计,以穿越颗粒介质基底。

英文摘要

Recent advancements in Resistive Force Theory (RFT) enable approximation of ground reaction forces for locomotion in sand without the computational expense of modeling interactions with individual grains. However, these tools have been absent in 3D physics engines commonly used for robot simulation. We explore if resistive force approximations are sufficient, when integrated with standard dynamics calculations, to provide a stable substrate for a freely walking robot. To determine this, we implement 3D Granular Resistive Force Theory (3D RFT) in a physics simulation engine, MuJoCo. We verify simulations in multiple scenarios to demonstrate that key trends due to end effector shape, speed, and loading are preserved. Our implementation predicts walking distance and foot sinkage of a 12-Degree of Freedom hexapod robot within 20\% of experiments in sand. While RFT has inherent approximations, the open source tool described here has potential to help develop new and improved robot designs to traverse granular media substrates.

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 新提交

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh(爱丁堡大学) University of Glasgow(格拉斯哥大学) University of Cambridge(剑桥大学)

AI总结 针对DeFi监管中LLM智能体易误报的问题,提出DeXposure-Claw系统,通过图时间序列基础模型预测风险网络,结合确定性监控和置信度门控生成可审计监管票据,并构建六轴评估基准DeXposure-Bench,实验验证有效性。

详情
AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景:它们过度解读弱证据并推荐高风险干预,而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw,一个基于预测的智能体监管系统,通过结构化证据引导LLM决策:(1) DeXposure-FM,一个图时间序列基础模型,预测未来风险网络;(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据;(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench,一个六轴评估框架,其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.