arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19624 2026-06-19 cs.LG 新提交

MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery

MassSpecGym in the Wild: 揭示并纠正AI驱动分子发现中的评估陷阱

Hongxuan Liu, Roman Bushuiev, Ivy Lightheart, Mrunali Manjrekar, Anton Bushuiev, Magdalena Lederbauer, Filip Jozefov, Yinkai Wang, Soha Hassoun, Josef Sivic, James Taylor, Runzhong Wang, David Healey, Tomáš Pluskal, Connor W. Coley

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague（捷克信息学、机器人学与自动化捷克技术大学）； Enveda Biosciences（Enveda 生物科技）； Tufts University（塔夫茨大学）

AI总结本文系统审查了基于串联质谱的分子发现中机器学习模型的评估问题，以MassSpecGym基准为例，发现26篇论文中至少17篇存在数据泄露、捷径学习和实现错误三类问题，并通过实验量化影响，提出改进建议并发布MassSpecGym v1.5。

详情

AI中文摘要

可靠的基准测试对于开发基于串联质谱（MS/MS）分子发现的机器学习模型至关重要。实验设计和模型评估过程中的细微问题会降低此类基准的可信度，并导致错误结论。我们以标准MassSpecGym基准套件为例，对近期MS/MS机器学习文献中的模型评估问题进行了全面审查，以说明这些问题的影响。在采用MassSpecGym基准的第一年内，我们发现在26篇报告MassSpecGym基准结果的论文中，至少有17篇存在评估问题。我们将失败原因归纳为三类：(i) 数据泄露，(ii) 捷径学习，以及(iii) 实现错误和指标分歧。通过大量实验和代码复现，我们量化了这些问题的影响，并展示了它们如何破坏MassSpecGym旨在强制执行的评估标准。我们将研究结果提炼为适用于MS/MS挑战、基准和自定义评估设置的建议。我们还发布了MassSpecGym v1.5，这是我们在MassSpecGym基准套件中实施建议的版本，解决了本次审计中发现的失败模式。MassSpecGym v1.5可从此https URL公开获取。

英文摘要

Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the trustworthiness of such benchmarks and lead to erroneous conclusions. We conduct a thorough review of model evaluation issues in the recent MS/MS machine learning literature, using the standard MassSpecGym benchmark suite as a case study to illustrate the impact of these issues. We find evaluation issues in at least 17 of 26 papers reporting MassSpecGym benchmark results in the first year of its adoption. We isolate three classes of failures: (i) data leakage, (ii) shortcut learning, and (iii) implementation bugs and metric divergence. Through extensive experimentation and code replication, we quantify the impact of these issues and show how they corrupt the evaluation standards MassSpecGym was designed to enforce. We distill our findings into recommendations generalizable to MS/MS challenges, benchmarks, and custom evaluation setups. We also release MassSpecGym v1.5, an implementation of our recommendations in the MassSpecGym benchmarking suite which addresses the failure modes identified in this audit. MassSpecGym v1.5 is publicly available at https://github.com/pluskal-lab/MassSpecGym.

URL PDF HTML ☆

赞 0 踩 0

2606.19623 2026-06-19 cs.LG 新提交

SEAGAN: domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes

SEAGAN：面向动态植物过程的领域特定与边缘感知图注意力网络

Antriksh Srivastava, Soumyashree Kar

发表机构 * Center of Studies in Resources Engineering（资源工程研究中心）； Indian Institute of Technology Bombay（孟买印度理工学院）

AI总结提出SEAGAN，将植物A-Ci曲线中的生化限制状态识别建模为图节点分类问题，利用距离kNN和辅助信号引导连接构建图，通过边缘感知图注意力网络提升分类性能，F1分数达0.857。

详情

AI中文摘要

图神经网络（GNN）为从通过物理、生物或功能关系关联的科学数据中学习提供了灵活框架。一个有前景的领域是植物生理学，其中测量的响应通常来自多个相互作用的过程，即使通过人工干预，这些过程的精确分离仍然困难。在植物生理学中，一个关键例子是A-Ci曲线，它关联净CO2同化速率（Anet）与叶片胞间CO2浓度（Ci），并用于估计叶片和作物冠层模型中的光合参数。然而，可靠估计需要识别每个曲线点处的活跃生化限制状态，这仍然是主要的不确定性来源。在这里，我们将沿A-Ci曲线的限制状态识别表述为基于图的节点分类问题，以曲线点为节点。使用基于距离的k近邻（kNN）和辅助信号引导（ASG）连接创建领域特定的图表示，边属性编码成对关系。该框架与常规学习基线、基于图的架构以及基于自动拟合的基准进行了评估。在具有已知真实限制状态的大型合成数据集上的结果表明，基于图的模型改善了分类，特别是在生化过渡区域附近。最佳配置SEAGAN（面向动态植物过程的领域特定与边缘感知图注意力网络）整合了过程感知节点特征、边属性、kNN连接和带加权交叉熵损失的图注意力，实现了0.857的F1分数和0.882的准确率。结果表明，将A-Ci曲线表示为图改善了生化限制状态分析，而局部kNN邻域上的边缘感知注意力提供了最有效的策略。

英文摘要

Graph neural networks (GNNs) provide a flexible framework for learning from scientific data linked through physical, biological, or functional relationships. One promising domain is plant physiology, where measured responses often arise from multiple interacting processes whose exact separation remains difficult even with manual intervention. In plant physiology, a key example is the A-Ci curve, which relates net CO2 assimilation rate (Anet) to leaf intercellular CO2 concentration (Ci) and is used to estimate photosynthetic parameters in leaf and crop-canopy models. However, reliable estimation requires identifying the active biochemical limitation state at each curve point, which remains a major source of uncertainty. Here, we formulate limitation-state identification along A-Ci curves as a graph-based node classification problem, with curve points as nodes. Domain-specific graph representations are created using distance-based k-nearest-neighbor (kNN) and auxiliary-signal-guided (ASG) connectivity, with edge attributes encoding pairwise relations. The framework was evaluated against conventional learning baselines, graph-based architectures, and an automated fitting-based benchmark. Results on a large synthetic dataset with known ground-truth limitation states show that graph-based models improve classification, particularly near biochemical transition regions. The best-performing configuration, SEAGAN (domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes), integrates process-aware node features, edge attributes, kNN connectivity, and graph attention with weighted cross-entropy loss, achieving an F1-score of 0.857 and an accuracy of 0.882. The results show that representing A-Ci curves as graphs improves biochemical limitation-state analysis, with edge-aware attention over local kNN neighborhoods providing the most effective strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.19620 2026-06-19 cs.CR 新提交

G-Lox: Group-Adaptive, Privacy-Preserving Bridge Distribution with Two-Party Computation

G-Lox: 基于两方计算的组自适应、隐私保护桥分发

Baigang Chen, Nicholas Hopper

AI总结提出G-Lox桥分发系统，通过两方安全计算实现隐藏的组级自适应分配，保护分发者盲性，支持阻塞报告、传输感知重分配和隐私保护组分裂。

详情

AI中文摘要

我们提出G-Lox（组自适应Lox），一种桥分发系统，在保持Lox风格分发者盲性的同时，实现隐藏的、有状态的组级自适应。G-Lox将自适应分配逻辑置于双服务器隐私墙之后，因此没有单个服务器能学习组标识符或组到桥的分配。私有状态访问和状态相关更新使用双服务器DPF/FSS协议和安全两方计算，支持阻塞报告、传输感知重分配和隐私保护组分裂。我们通过系统测量和策略模拟评估G-Lox。在我们的C++/EMP实现中，基于真实TCP套接字，私有状态访问的客户端可见开销较低：在状态大小高达2^16时，每次迭代的通信量保持在低KiB范围。在M=1024时，客户端发送1,968字节，接收1,280字节，每次迭代完成约0.25秒。针对特定组阻塞和女巫枚举的模拟表明，在保持广泛发行的系统中，G-Lox相比类似Lox和rBridge的基线提高了鲁棒性。

英文摘要

We present G-Lox (group-adaptive Lox), a bridge-distribution system that preserves Lox-style distributor blindness while enabling hidden, stateful group-level adaptation. G-Lox places adaptive assignment logic behind a two-server privacy wall, so no single server learns group identifiers or group-to-bridge assignments. Private state access and state-dependent updates use two-server DPF/FSS protocols and secure two-party computation, supporting blockage reporting, transport-aware reassignment, and privacy-preserving group splitting. We evaluate G-Lox through system measurements and policy simulation. In our C++/EMP implementation over real TCP sockets, private state access has low client-visible overhead: across state sizes up to 2^16, communication remains in the low-KiB range per iteration. At M=1024, the client sends 1,968 bytes, receives 1,280 bytes, and completes an iteration in about 0.25 s. Simulations with group-specific blocking and Sybil enumeration show that G-Lox improves robustness over Lox- and rBridge-like baselines among systems that maintain broad issuance.

URL PDF HTML ☆

赞 0 踩 0

2606.19618 2026-06-19 cs.GT 新提交

Joint-task truthfulness of the DMI mechanism

DMI机制的联合任务真实性

Rafael Frongillo

AI总结研究DMI机制在联合任务策略下的真实性，证明当其他代理使用一致策略时，真实报告仍是贝叶斯-纳什均衡，但无限制时主导真实性和知情真实性均失效。

2606.19617 2026-06-19 cs.CV cs.GR cs.LG 新提交

GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution

GB-LSR：一种具有单一全局带宽的快速局部光谱图像表示，用于连续重建和超分辨率

Max Shad, Naeem Khoshnevis

发表机构 * Harvard University（哈佛大学）

AI总结提出GB-LSR，一种基于全局带宽的局部光谱表示，通过共享卷积编码器预测截断傅里叶基系数，实现连续图像重建，在Kodak等基准上PSNR提升2.8-3.6 dB，推理速度比最慢基线快约4倍。

详情

AI中文摘要

我们提出GB-LSR（全局带宽局部光谱表示），一种用于连续图像重建的固定网格局部光谱表示。图像域被划分为非重叠的方形块，每个块携带从共享卷积编码器特征预测的截断傅里叶基系数。一个可训练的标量带宽在所有块和图像中全局共享，在任何连续坐标处的重建是固定大小的基收缩，其成本与图像大小无关。我们研究了三种带宽处理变体：可训练的全局标量（主要）、固定的全局标量和逐块带宽场。在Kodak、Set14和Urban100上的标准化原生重建基准测试中，主要变体在匹配预算的LIIF/LTE/WIRE重实现上PSNR高出2.8-3.6 dB，LPIPS低0.11-0.15，同时推理成本约为最慢基线的四分之一。经验上，单个全局标量就足够了：逐块自适应带宽替代方案在闭式局部性诊断或端到端消融中均未带来改进。在独立的任意尺度超分辨率（ASR）扩展中，GB-LSR在标准SR协议下实现了具有竞争力的PSNR-Y，并在x4时比LIIF-RDN快1.44倍，比LTE-SwinIR快3.25倍；在同一扩展中，一个变体在训练和评估时不使用四角局部集成平均，速度提升1.77倍，峰值内存降低35%，PSNR变化可忽略，而将RDN编码器从64通道扩展到96通道时，PSNR略有提升，速度提升1.58倍，峰值内存降低31%。原生重建声明限定于匹配预算的摊销协议，ASR声明限定于独立的标准SR协议。

英文摘要

We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.19616 2026-06-19 cs.SE cs.AI cs.MA 新提交

Before the Pull Request: Mining Multi-Agent Coordination

在拉取请求之前：挖掘多智能体协调

Dipankar Sarkar

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结针对自主编码智能体在拉取请求中协调不足的问题，提出基于git的协调基板grite，通过事件日志减少重复和冲突工作，提升吞吐量，并自动恢复多种故障模式。

Comments 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite

详情

AI中文摘要

自主编码智能体现在可以开启数百万个拉取请求，然而大规模研究发现，它们的拉取请求虽然生成更快，但被接受的频率却更低——这是一个拉取请求级别的遥测无法解释的协调与信任差距。我们认为缺失的信号存在于拉取请求之前，即并发智能体如何声明、划分和碰撞共享工作。我们通过grite（我们的开源协调基板）来研究这一过程，它不需要中央服务器，并将其记录存储在git本身内部，因此其仅追加的、签名的事件日志直接捕获了协调过程。我们证明：(i) 这种共享基板以有限的开销减少了重复和冲突工作——仅重复队友任务的工作份额从78%降至0%，而有效吞吐量增加了三倍以上；(ii) 每个智能体的日志副本收敛到相同状态，没有写入被静默丢弃，而基于文件的跟踪器会丢失并发写入；(iii) 该日志是一个可挖掘的工件，从中可以自动恢复具体的故障模式——冲突编辑、锁饥饿、冗余发现、竞态关闭——并带有来源信息，其中一些在拉取请求历史中是不可见的。我们发布了数据集、测试平台和挖掘工具包。

英文摘要

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

URL PDF HTML ☆

赞 0 踩 0

2606.19613 2026-06-19 cs.SE cs.AI 新提交

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench: 对编码智能体进行100轮交互的压力测试

Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

发表机构 * AWS Agentic AI（AWS 代理人工智能）

AI总结提出StaminaBench基准，通过100轮连续变更请求测试编码智能体的耐力，发现所有模型在5-6轮内失败，而测试反馈和重试机制可将通过轮数提升12倍。

详情

AI中文摘要

我们引入了StaminaBench，一个衡量编码智能体耐力的基准：它们在失败前能处理多少连续交互轮次（变更请求）。与流行的任务解决率指标不同，这符合实际编码风格，其中会话运行数十或数百轮。在StaminaBench中，智能体实现一个REST API服务器，并在可调数量的程序生成的后续变更请求（实验中为100个）上进行修改，导致代码库最多达6000行。测试完全以编程方式生成，无需LLM参与，确保可重复性和可靠性；变更序列来自硬编码或LLM驱动的采样器，两者都受限于结构化动作空间以确保变更有效。智能体和服务器在隔离环境中运行，并通过HTTP与基准通信，使测试完全黑盒且与语言无关。我们评估了六个智能体框架与七个开源LLM在20个场景（每个100轮）上的表现，发现：（1）所有测试模型在5-6轮内失败，确认了无彻底测试的编码风格会产生错误；（2）将测试反馈传递给智能体并允许重试，可将通过轮数提升最多12倍；（3）良好的框架是强性能所必需的：更强的模型在其最佳和最差框架之间表现出高达6倍的差距，而较弱的模型在任何框架下都失败。我们发布了基准和生成的任务，以促进对多轮编码智能体行为的进一步研究。基准代码和数据：此 http URL。

英文摘要

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

URL PDF HTML ☆

赞 0 踩 0

2606.19610 2026-06-19 cs.LG cs.AI 新提交

Latent Confounded Causal Discovery via Lie Bracket Geometry

基于李括号几何的潜在混杂因果发现

Sridhar Mahadevan

发表机构 * Adobe Research（Adobe研究院）； University of Massachusetts, Amherst（马萨诸塞大学阿默斯特分校）

AI总结利用信息几何和范畴论，提出两种算法（BRIDGE和SKFM），通过干预诱导流的李括号非闭合性检测潜在混杂，大幅缩减因果图搜索空间。

Comments 39 pages

详情

AI中文摘要

最近关于Kan-Do-Calculus (KDC)的工作已经确立了被动观察和主动干预在因果推断中的边界是一个范畴论双伴随，其中干预由左Kan扩展建模，条件作用由右Kan扩展建模。本文在潜在混杂下引入了两种因果发现算法，基于KDC的信息几何和范畴论结果。在光滑统计设置中，观测和干预测度之间的Radon-Nikodym导数诱导局部因果向量场；这些场在李括号下不闭合的失败成为可计算的Frobenius残差，我们将其解释为失败的可视可积性和可能的潜在或未建模结构的证据。我们的第一个算法BRIDGE（用于干预发现和几何估计的括号残差）结合了一个干预密度或Radon-Nikodym比引擎与一个几何筛选器，该筛选器提出一个高召回率的可接受箭头族，识别非闭合的可视对作为潜在障碍候选，并将缩减后的族传递给下游的基于分数或可微的发现程序。第二个算法贡献，谱Kan-Do流匹配（SKFM），学习摊销干预场并在谱上分解潜在曲率，揭示BRIDGE指向的直接李空间端点。一系列详细的实验表明，两种算法都能发现具有潜在混杂的因果模型，同时将可能的DAG的超指数空间缩减多个数量级。本文引入了一种新的因果发现范式，其中潜在结构直接从干预诱导流的几何中推断出来。

英文摘要

Recent work on Kan-Do-Calculus (KDC) has established that the boundary between passive observation and active intervention in causal inference is a category-theoretic bi-adjunction, with interventions modeled by left Kan extensions and conditioning by right Kan extensions. This paper introduces two causal discovery algorithms under latent confounding, building on the information-geometric and categorical consequences of KDC. In smooth statistical settings, Radon-Nikodym derivatives between observational and interventional measures induce local causal vector fields; failures of these fields to close under Lie brackets become computable Frobenius residuals, which we interpret as witnesses of failed visible integrability and possible latent or unmodeled structure. Our first algorithm, BRIDGE (Bracket Residuals for Interventional Discovery and Geometric Estimation), combines an interventional density or Radon-Nikodym-ratio engine with a geometric screen that proposes a high-recall family of admissible arrows, identifies non-closing visible pairs as latent-obstruction candidates, and passes the reduced family to downstream score-based or differentiable discovery routines. The second algorithmic contribution, Spectral Kan-Do Flow Matching (SKFM), learns amortized intervention fields and factors latent curvature spectrally, exposing the direct Lie-space endpoint toward which BRIDGE points. A detailed set of experiments show that both algorithms are capable of discovering causal models with latent confounders while collapsing the super-exponential space of possible DAGs by many orders of magnitude. This paper introduces a new paradigm in causal discovery, where latent structure is inferred directly from the geometry of intervention-induced flows.

URL PDF HTML ☆

赞 0 踩 0

2606.19609 2026-06-19 cs.HC cs.GR 新提交

Building Drift: Documenting On-Site Construction Adaptations Across Material Lifecycles

建筑漂移：记录跨材料生命周期的现场施工适应

Ritik Batra, Martin Tamke, Tom Svilans, Jan Hüls, Amritansh Kwatra, Steven J. Jackson, Thijs Roumen, Mette Ramsgaard Thomsen

AI总结提出“建筑漂移”概念，通过案例研究建立分类法，并开发Pentimento工具，利用视频和3D高斯泼溅记录现场适应，促进再生材料循环利用。

Comments In submission

详情

AI中文摘要

在建筑循环经济中，再生材料承载着先前使用生命，并将在未来建筑中拥有后生命。然而，使用此类材料会引入不可预测性，需要现场即兴发挥，这使得其再利用难以记录和跨建筑生命周期规模化。没有记录，使用再生材料进行施工所需的现场适应使得合作者、评估者和继承者缺乏继续、评估和再利用材料所需的信息。我们将通过这些适应导致物理状态与数字模型的集体偏差称为“建筑漂移”。通过一个案例研究——在森林中建造的再生木材亭子ReShelter，我们开发了一个建筑漂移分类法，以表征跨建筑生命周期的集体偏差：照料场地、寻找契合、解读材料、标记测量和跨社区协调。为了将我们的建筑漂移分类法付诸实践，我们提出了Pentimento，一个利用视频记录和3D高斯泼溅在空间、时间和语义上表示与设计模型相关的现场适应的文档工具。Pentimento使每个利益相关者能够以降低材料再利用障碍的方式浏览材料历史。这些贡献共同为支持再生材料施工所必需的现场即兴发挥的计算工具开辟了路径，从而实现更可持续的回收、修复和再利用循环。

英文摘要

In a circular economy for construction, reclaimed materials carry prior lives of use and go on to have post-lives in future buildings. Yet working with such materials introduces unpredictability that requires on-site improvisation, making their reuse challenging to document and scale across building lifetimes. Without documentation, the on-site adaptations that make construction with reclaimed materials possible leave collaborators, evaluators, and inheritors without the information they need to continue, assess, and reuse materials. We call the collective deviation of the physical state from the digital model through these adaptations "building drift." Through a case study, ReShelter, a reclaimed timber pavilion constructed in the forest, we develop a taxonomy for building drift that characterizes the collective deviation across building lifetimes: Tending the Site, Foraging for Fit, Interpreting the Material, Marking Measurements, and Coordinating Across Communities. To put our taxonomy for building drift into practice, we present Pentimento, a documentation tool that leverages video documentation and 3D Gaussian Splatting to spatially, temporally, and semantically represent on-site adaptations in relation to the designed model. Pentimento enables each stakeholder to navigate material histories in ways that reduce barriers to material reuse. Together, these contributions open pathways towards computational tools that support the on-site improvisation essential to construction with reclaimed materials, enabling more sustainable cycles of recovery, repair, and reuse.

URL PDF HTML ☆

赞 0 踩 0

2606.19607 2026-06-19 cs.AI stat.AP 新提交

Which Pairs to Compare for LLM Post-Training?

LLM后训练中应比较哪些对？

Jiangze Han, Vineet Goyal, Will Ma

发表机构 * Columbia University（哥伦比亚大学）

AI总结研究偏好后训练中如何选择最具信息量的比较对，提出基于采样设计的比较策展方法，通过DPO训练的理论分析给出优化准则，实验证明能提升样本效率。

详情

AI中文摘要

基于偏好的后训练已成为对齐语言模型的核心范式。常见的数据收集策略是为每个提示生成少量补全并标注生成的比较对。然而，人工偏好标签通常比生成额外补全昂贵得多，这提示了相同标注预算的不同使用方式：生成更大的补全集，但只标注最具信息量的比较对。本文研究在基于偏好的后训练中应比较哪些对。我们将比较策展形式化为一个采样设计问题，并通过基于偏好的后训练目标下的最终策略质量来评估设计。我们针对直接偏好优化（DPO）实例化该框架，分析标注对的选择如何通过DPO训练传播到下游策略性能。我们的主要结果为DPO训练策略的后训练最优性差距提供了匹配的上界和下界。这些界限表明，比较选择通过一个单一的设计相关信息矩阵影响下游性能，该矩阵将标签分配与参数估计误差和策略次优性联系起来。这为预算受限的比较策展提供了显式优化准则，并激发了从大型生成补全池中选择信息对的实际采样设计。在合成设置和语言模型后训练基准上的实验表明，所提出的设计在样本效率上持续优于常见的比较选择启发式方法。

英文摘要

Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

URL PDF HTML ☆

赞 0 踩 0

2606.19605 2026-06-19 cs.SE cs.AI 新提交

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO：多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

发表机构 * Foundation AI–Cisco Systems Inc.（基础AI–思科系统公司）； Yale University（耶鲁大学）

AI总结提出FAPO框架，通过自动诊断流水线瓶颈并迭代优化提示或链结构，在18个模型-基准比较中15次优于基线GEPA，平均提升14.1个百分点。

详情

AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败，因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO（全自动提示优化），一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更，并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑，仅当提示优化似乎不足时，在归因识别出结构瓶颈的情况下，在允许范围内更改链结构。在六个基准和三个任务模型上，FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中，FAPO以不重叠的均值±试验标准差范围获胜，平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中，当提示优先搜索升级为结构变更时，FAPO在所有六个中获胜，平均增益为+33.8个百分点。FAPO还提高了安全任务的性能：在CTIBench-RCM（一个安全CVE到CWE任务）上，仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率，在Foundation-Sec-8B-Instruct上提升了+7.1个百分点，在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19603 2026-06-19 cs.LG 新提交

Comparing Linear Probes with Mahalanobis Cosine Similarity

比较线性探针与马氏余弦相似度

Zhuofan Josh Ying, Peter Hase, Nikolaus Kriegeskorte

发表机构 * Columbia University（哥伦比亚大学）； Stanford University（斯坦福大学）； Schmidt Sciences（施密特科学）

AI总结研究证明马氏余弦相似度与OOD AUROC存在线性关系，提供理论解释并验证其作为线性探针比较指标的有效性。

Comments 16 pages, 10 figures

详情

AI中文摘要

线性探针广泛用于可解释性研究，并常通过余弦相似度进行比较。两个方向之间的马氏余弦相似度（MCS）通过测试数据协方差重新加权内积，是一种自然的任务感知改进。Ying等人（2026）报告称，探针与在分布外（OOD）数据上训练的参考探针的MCS近乎完美地线性预测了该探针的OOD AUROC（R^2 = 0.98）。在这里，我们将这一实证发现扩展到不同模型、层和概念领域，并以封闭形式证明了这一普遍现象：对于投影为高斯分布的平衡类别，OOD AUROC与参考探针的MCS是线性的，因为两者都是探针在测试数据上信噪比（SNR）的S形函数。该理论还预测了这种线性何时失效，我们通过实验验证了这一点。MCS为比较线性探针提供了有理论依据且经验有效的替代方案，优于欧几里得余弦相似度。

英文摘要

Linear probes are widely used in interpretability research and often compared by cosine similarity. The Mahalanobis cosine similarity (MCS) between two directions, which reweights the inner product by test data covariance, is a natural task-aware refinement. Ying et al. (2026) report that a probe's MCS to a reference probe trained on the out-of-distribution (OOD) data near-perfectly linearly predicts the probe's OOD AUROC (R^2 = 0.98). Here, we extend this empirical finding across models, layers, and concept domains, and prove this general phenomenon in closed form: For balanced classes whose projections are Gaussian, OOD AUROC and MCS to the reference probe are linear because both are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR) on the test data. The theory also predicts when this linearity fails, which we verify empirically. MCS offers a theoretically grounded and empirically effective alternative to Euclidean cosine similarity for comparing linear probes.

URL PDF HTML ☆

赞 0 踩 0

2606.19602 2026-06-19 cs.AI 新提交

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

可配置的临床信息提取与智能体RAG：什么有效、什么失效及原因

Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

发表机构 * Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen（埃森大学医学院人工智能医学研究所）； Faculty of Computer Science, University of Duisburg-Essen（杜伊斯堡-埃森大学计算机科学学院）； Department of Physics, TU Dortmund University（多特蒙德工业大学物理系）； Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University（多特蒙德工业大学拉马尔机器学习和人工智能研究所）； Advanced Clinical Research Center, Fukushima Medical University（福岛医科大学先进临床研究中心）； Department of Cardiology and Vascular Medicine, University Hospital Essen（埃森大学医院心血管内科）

AI总结针对临床文档元数据缺失问题，提出基于智能体RAG的ACIE系统，在埃森大学医学中心部署，通过完整患者上下文推理和源引用验证，在7326次临床判断中实现96.5%的提取接受率。

详情

AI中文摘要

患者上下文涵盖数百份异构文档和数千个结构化数据点，然而AI系统进行检索和分诊所需的文档级元数据缺失或不完整。标准检索增强生成在此类数据上失效，无法处理时间推理、跨文档依赖和缺失元数据。我们在埃森大学医学中心部署了ACIE（智能体临床信息提取）：一个本地智能体RAG管道，能够推理完整的患者上下文，并将每个答案基于源段落以供临床医生验证。我们量化了元数据差距，追溯了由此形成的架构决策，并在一项独立的回顾性淋巴瘤注册研究中评估了提取效果，其中核医学医生根据引用的来源验证每个提取值。在7326次判断中，临床医生接受了96.5%的提取结果，按类型划分的接受率从80%到99%不等。

英文摘要

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

URL PDF HTML ☆

赞 0 踩 0

2606.19599 2026-06-19 eess.SY cs.SY econ.EM 新提交

Ramping Procurement and Bid-Cost Recovery in Real-Time Market

实时市场中的爬坡采购与投标成本回收

Cong Chen, Valentina Norambuena, Lang Tong

AI总结研究净需求不确定下与经济调度协同优化的爬坡采购，分析单间隔与多间隔协同优化设计，提出评估发电机利润、消费者支付、投标成本回收和运营效率的分析框架，并比较三种定价机制。

Comments 4 figures

详情

AI中文摘要

我们研究了净需求不确定下与经济调度协同优化的爬坡采购。我们考察了电网运营商实施的两种灵活爬坡产品设计：单间隔和多间隔协同优化。两者都依赖于滚动窗口随机优化，包含绑定和咨询间隔决策。我们开发了分析框架来评估发电机利润、消费者支付、投标成本回收（BCR）和运营效率。特别是，净需求不确定性可能导致发电机补偿不足，需要歧视性BCR。虽然运营效率对能量和爬坡价格不变，但生产者利润和消费者支付关键取决于定价。我们研究了节点边际定价（LMP）和两种统一定价：最大调度成本定价（MDCP）和最大时间节点边际定价（MTLMP）。在市场外BCR下，LMP产生歧视性能量价格，而MDCP消除BCR，MTLMP在大多数情况下也是如此。这一性质使我们能够在MDCP下为价格接受型发电机建立真实投标激励。我们的分析突出了单间隔和多间隔协同优化与定价设计之间的权衡：在高预测不确定性和中等爬坡需求下，单间隔能量-爬坡协同优化具有优势，而当净需求预测相对准确且爬坡需求具有挑战性时，多间隔协同优化更优。基于CAISO和ERCOT数据的实证结果表明，与LMP相比，MDCP和MTLMP增加了生产者利润且BCR可忽略，但以消费者支付增加为代价。

英文摘要

We study ramping procurement co-optimized with economic dispatch under net-demand uncertainty. We examine two flexible ramp product designs implemented by grid operators: single-interval and multi-interval co-optimization. Both rely on rolling-window stochastic optimization with binding and advisory interval decisions. We develop analytical frameworks to evaluate generator profits, consumer payments, bid cost recovery (BCR), and operational efficiency. In particular, net-demand uncertainty may lead to generator under-compensation, requiring discriminatory BCR. While operational efficiency is invariant to energy and ramp prices, producer profits and consumer payments depend critically on pricing. We examine locational marginal pricing (LMP) and two uniform pricing: maximum dispatch cost pricing (MDCP) and maximum temporal locational marginal pricing (MTLMP). With out-of-market BCR, LMP yields discriminatory energy prices, whereas MDCP eliminates BCR and MTLMP does so in most cases. This property enables us to establish truthful bidding incentives for price-taking generators under MDCP. Our analysis highlights trade-offs between single- and multi-interval co-optimization and pricing designs: single-interval energy-ramp co-optimization is advantageous under high forecast uncertainty and moderate ramping requirements, whereas multi-interval co-optimization is superior when net-demand forecasts are relatively accurate and ramp needs are challenging. Empirical results on CAISO and ERCOT data show that MDCP and MTLMP increase producer profits with negligible BCR, albeit at the expense of higher consumer payments relative to LMP.

URL PDF HTML ☆

赞 0 踩 0

2606.19598 2026-06-19 cs.RO 新提交

Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification

Fail-RAG：一种基于检索增强生成的机器人故障识别框架

Ameya Salvi, Jie Hu

发表机构 * Hitachi America, Ltd.（日立美国有限公司）

AI总结提出Fail-RAG框架，利用检索增强生成和视觉语言模型，通过嵌入故障图像和上下文信息并查询数据库，实现机器人操作故障的高效检测，在仓库自动化任务中平均检测准确率提升25个百分点。

详情

AI中文摘要

工业自动化正经历由技术突破和社会变革驱动的机器人演进：向通用机器人、具身和物理人工智能发展，以及劳动力短缺的加剧。智能自主机器人不仅需要按计划运动，还需对意外事件做出反应。本研究聚焦于仓库中物料搬运机器人的意外事件，将其定义为故障，并开发检测机器人操作故障的方法。由于环境和任务的动态性，故障形式可能变化，基于规则的检测方法可能失效。我们提出'Fail-RAG'，一种基于检索增强生成（RAG）的故障检测框架，其中故障图像和上下文信息被嵌入，并通过计算相似度查询故障数据库。进一步使用视觉语言模型（VLM）按照指令模板分析故障并提供细节。通过使用固定机械臂和移动操作器在仓库自动化常见任务中进行仿真和物理实验，评估了Fail-RAG的性能。与使用现成VLM相比，Fail-RAG在五种机器人操作类型上的平均故障检测准确率提高了25个百分点，表明其在真实世界故障检测中的有效性。

英文摘要

Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations related failures. Rule-based detection methods may break since the form of failures could change due to the dynamic nature of both environments and tasks. We propose 'Fail-RAG', a Retrieval Augmented Generation (RAG)-based failure detection framework where failure images and context information are embedded and queried against a failure database by calculating their similarities. Vision-Language Models (VLMs) are further used to analyze failures and provide details by following our instruction template. We evaluated the performance of Fail-RAG by conducting both simulation and physical experiments using fixed robot arms and a mobile manipulator for multiple tasks that are common in warehouse automation. Fail-RAG achieved 25 percentage point higher failure detection accuracy on average across five types of robot operations compared to using off-the-shelf VLMs, indicating its effectiveness for real-world failure detection.

URL PDF HTML ☆

赞 0 踩 0

2606.19597 2026-06-19 cs.SD cs.AI cs.LG 新提交

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

PrefSQA: 用于语音质量评估的成对偏好预测及高质量数据集的关键作用

Junyi Fan, Donald S. Williamson

发表机构 * Department of Computer Science and Engineering, The Ohio State University, USA（美国俄亥俄州立大学计算机科学与工程系）

AI总结提出PrefSQA模型，通过不确定性感知logits、损伤注意力头和非匹配参考比较模块，利用高质量偏好数据集提升语音质量评估的准确性。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）广泛用于语音质量评估，但标量标签对评估者变异性和听力测试差异敏感，这引入了标签噪声，限制了MOS预测的可靠性。偏好预测通过让听者直接比较信号来减少这种变异性，产生更干净的标签。我们研究了无MOS的偏好预测，并提出了PrefSQA，它结合了不确定性感知logits、损伤注意力头以及基于非匹配参考比较的模块。我们使用并精炼了五个数据集，包括MOS衍生和低噪声模拟集（包含匹配和非匹配内容），在人类偏好集上进行实验，并在未见数据上测试。实验表明，在MOS衍生数据上改进较小，而其他数据集显示出相对于基线的明显改进，突显了高质量偏好数据的价值，并证明了所提出方法的有效性。

英文摘要

Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference prediction reduces this variability as listeners compare signals directly, producing cleaner labels. We study MOS-free preference prediction and propose PrefSQA, which incorporates uncertainty-aware logits, an impairment attention head, and a module based on non-matching-reference comparisons. We use and refine five datasets, including MOS-derived and low-noise simulated sets with matching and non-matching content, experiment with human preference sets, and test on unseen data. Experiments show small improvements on MOS-derived data, while other sets reveal clear improvement over the baselines, highlighting the value of high-quality preference data and demonstrating the effectiveness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2606.19595 2026-06-19 cs.LG cs.AI 新提交

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

IHBench：评估语音代理在结构化工作流中的中断后恢复能力

Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola

发表机构 * Boson AI

AI总结提出IHBench基准，评估语音代理在结构化工作流中处理中断后的恢复能力，涵盖任务完成和恢复质量两个维度，实验表明闭源模型比开源模型更鲁棒。

详情

AI中文摘要

部署在结构化工作流（客户服务、医疗调度、账户管理）中的语音代理必须处理频繁的用户中断，同时保持多步骤程序的进度。现有的语音能力模型基准侧重于中断的时机：闯入检测、端点检测和轮流对话动态。它们忽略了中断后发生的情况：代理是否在正确的步骤恢复工作流？是否处理了用户的插话？是否避免重复用户已经听过的内容？我们引入了IHBench（中断处理基准），这是一个评估语音代理在10个企业领域中执行状态机驱动工作流时的中断后恢复能力的基准。六种中断类型在话语中间的控制点注入，并随数据生成每个中断的评估标准。每个中断在两个轴上评分：任务完成和恢复质量。我们评估了来自OpenAI、Google和开源社区的27个音频-语言模型配置。模型差异很大，恢复质量强烈依赖于中断类型。在我们的实验中，闭源模型比开源模型对中断更鲁棒：它们在任务完成上获胜的频率更高，随着对话变长，性能下降速度慢约3.3倍，并且没有音频与文本模态差距，而开源模型在这三个方面都处于劣势。一项人类研究验证了LLM评判员与人类标注者的一致性，与AudioMultiChallenge的跨基准分析表明，恢复质量在很大程度上是一个独立的能力轴。

英文摘要

Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.

URL PDF HTML ☆

赞 0 踩 0

2606.19594 2026-06-19 cs.LG 新提交

Unsupervised Causal Abstractions Discovery

无监督因果抽象发现

Théo Saulus, Simon Lacoste-Julien, Dhanya Sridhar

发表机构 * Mila - Quebec AI Institute（魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出从低层测量数据中直接学习高层结构因果模型的方法，利用低秩因果发现假设，证明低秩图观测诱导的潜变量形成因果抽象，并给出可辨识性结果及实用学习目标。

2606.19591 2026-06-19 cs.CL cs.AI 新提交

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC（Aimesoft股份公司）

AI总结提出一种新颖简单的基于黄金摘要缩短文档的分层策略，结合BART模型实现越南语多文档抽象式摘要，在VLSP 2022测试集上达到ROUGE2-F1 0.2468，并利用外部数据增强训练。

Comments originally written in 2022

2606.19590 2026-06-19 cs.RO cs.SY eess.SY 新提交

Safe, Real-Time Active Model Discrimination and Fault Diagnosis for Nonlinear Systems via Differentiable Reachability

通过可微可达性实现非线性系统的安全、实时主动模型辨识与故障诊断

Xinpei Ni, Melkior Ornik, Glen Chou, Samuel Coogan

发表机构 * Institute of Robotics and Intelligent Machines (IRIM), Georgia Institute of Technology（佐治亚理工学院机器人与智能机器研究所）； Department of Aerospace Engineering, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校航空航天工程系）

AI总结针对不确定非线性系统，提出一种基于可微可达性近似的实时主动故障诊断算法，通过优化控制输入使输出集分离，在保证安全的同时实现快速模型辨识。

详情

AI中文摘要

我们提出了一种安全、实时的算法，用于对具有过程和测量扰动的连续时间不确定非线性系统进行主动故障诊断和模型辨识。给定一组表示正常和故障模式（包括执行器和传感器故障）的候选模型，我们制定了一个输出反馈、时变策略优化问题，该问题（i）在有限时域内鲁棒地强制执行状态输入安全约束，并且（ii）驱动系统产生与至多一个模型一致的采样测量，从而实现确定性诊断。为了实时解决这个问题，我们使用可达状态和输出集的区间过近似开发了一个可处理的近似，并通过一个可微目标函数对诊断能力进行编码，该函数惩罚可能模型的可达输出集之间的重叠。由此产生的优化使用基于梯度的JAX和可微可达性原语在线高效求解。我们在几个高维非线性机器人系统（包括模拟四旋翼和战斗机模型、硬件差速驱动机器人和四足导航）上评估了我们的方法，用于传感器和执行器故障诊断（最多11种故障模式）。在这些案例研究中，我们的方法在50毫秒内实现了可靠的模型辨识，在辨识成功率和速度上优于基线方法，同时提供了形式化的安全保证。

英文摘要

We present a safe, real-time algorithm for active fault diagnosis and model discrimination for uncertain continuous-time nonlinear systems with process and measurement disturbances. Given a finite set of candidate models representing nominal and faulty modes, including actuator and sensor faults, we formulate an output-feedback, time-varying policy optimization problem that (i) robustly enforces state-input safety constraints over a finite horizon and (ii) drives the system to produce sampled measurements consistent with at most one model, enabling deterministic diagnosis. To solve this problem in real time, we develop a tractable approximation using interval over-approximations of reachable state and output sets, and encode diagnosability via a differentiable objective that penalizes overlap between the reachable output sets of possible models. The resulting optimization is solved efficiently online with gradient-based methods using JAX and differentiable reachability primitives. We evaluate our method on sensor and actuator fault diagnosis (up to 11 fault modes) in several high-dimensional nonlinear robotic systems, including a simulated quadrotor and fighter-jet model, a hardware differential-drive robot, and quadrupedal navigation. Across these case studies, our approach achieves reliable model discrimination in under 50 ms, outperforming baselines in discrimination success rate and speed while providing formal safety guarantees.

URL PDF HTML ☆

赞 0 踩 0

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结研究LLM与SAT/SMT求解器混合推理中，将求解器输出转化为用户答案的叙述步骤存在的安全漏洞，通过形式化建模和实验评估发现证书门控可保证求解结果正确，但对抗攻击可反转结论。

详情

AI中文摘要

诸如SAT和SMT求解器之类的形式化工具，当安全或安保关键问题可以用逻辑表述时，越来越多地被嵌入到语言模型推理流程中。与思维链不同（其步骤从模型分布中采样，没有形式化保证），求解器产生可靠且可独立验证的答案。然而，这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分：形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解，但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距，我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型，发现证书门控使求解器判定可靠，而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法，该方法显著减少了注入但无法完全消除，并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究，我们表明在LLM-求解器循环中，鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

URL PDF HTML ☆

赞 0 踩 0

2606.19586 2026-06-19 cs.RO 新提交

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies

一个演示胜过千条轨迹：用于视觉运动策略的动作-视角增强

Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Shuran Song

发表机构 * Stanford University（斯坦福大学）； Columbia University（哥伦比亚大学）； Toyota Research Institute（丰田研究所）

AI总结提出一种数据增强框架，通过高斯泼溅和轨迹优化生成逼真的鱼眼图像序列和物理可行的动作轨迹，提升操作策略在场景变化和障碍物下的成功率。

Comments Project website: https://chuerpan.com/1001-demos.github.io/. Published at CoRL 2025

Journal ref Proceedings of The 9th Conference on Robot Learning, PMLR 305:3902-3914, 2025

详情

AI中文摘要

用于操作的视觉运动策略在建模复杂机器人行为方面展现出显著潜力，但机器人初始配置的微小变化和未见障碍物容易导致分布外观测。在没有大量数据收集工作的情况下，这些会导致灾难性的执行失败。在这项工作中，我们引入了一个有效的数据增强框架，该框架从真实世界的眼在手演示中生成视觉上逼真的鱼眼图像序列和相应的物理上可行的动作轨迹，这些演示使用带有单个鱼眼摄像头的便携式平行夹爪捕获。我们引入了一种新颖的高斯泼溅公式，适用于广角鱼眼摄像头，以重建和编辑带有未见物体的3D场景。我们利用轨迹优化生成平滑、无碰撞、视图渲染友好的动作轨迹，并从相应新视角渲染视觉观测。在仿真和现实世界中的综合实验表明，我们的增强框架提高了各种操作任务在相同场景和需要避障的增强场景中的成功率。

英文摘要

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.

URL PDF HTML ☆

赞 0 踩 0

2606.19584 2026-06-19 cs.CV 新提交

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

语言引导的视觉嵌入用于可控且可泛化的感知

Chengzhi Mao, Xudong Lin, Wen-Sheng Chu

发表机构 * Google（谷歌）

AI总结提出语言引导视觉嵌入（LIVE）方法，利用语言动态引导视觉编码器生成任务中心嵌入，无需任务特定重训练，减少视觉幻觉并提升泛化能力。

Journal ref Published as a conference paper at ICLR 2026

2606.19579 2026-06-19 cs.SD cs.AI 新提交

FlowFake: Liquid Networks for Audio Deepfake Detection

FlowFake: 用于音频深度伪造检测的液态网络

Shivaay Dhondiyal, Divyansh Sharma, Dinesh Kumar Vishwakarma

发表机构 * Delhi Technological University（德里理工大学）

AI总结针对音频深度伪造检测中跨数据集泛化失败的问题，提出基于液态时间常数（LTC）架构的FlowFake模型，通过学习ODE演化隐藏状态并自适应时间常数，以34K参数在跨域基准上超越现有方法。

Comments Accepted at the Workshop on Learning to Listen: Machine Learning for Audio at ICML 2026

详情

AI中文摘要

由神经文本转语音和语音克隆系统生成的音频深度伪造对说话人验证和公共话语构成大规模威胁。核心挑战是跨数据集泛化：在一种合成流水线上训练的检测器在面对未见过的伪造时性能崩溃。我们认为这种失败主要是由于结构性合成语音伪影，这些伪影是多时间尺度的轨迹异常。尽管每个现有检测器都聚合固定窗口的帧统计量，但这使得架构与信号不对齐。我们提出FlowFake，一种液态时间常数（LTC）架构，其隐藏状态通过学习ODE演化，每个神经元具有自适应时间常数，同时解析频谱（10ms）和韵律（2s）线索。仅34K参数，FlowFake实现了正式的BIBO稳定性和O(dt^4)积分误差。在四个数据集的跨域基准（ASVspoof2019-LA、FakeOrReal、InTheWild、MLAAD）上，FlowFake在仅用FakeOrReal训练时在ASVspoof2019上达到75.29%，仅用MLAAD训练时达到79.97%。它在每个评估对上优于RawGAT-ST和Whisper-DF，并以0.01%的参数数量匹配SSL Wav2vec2（大300倍）。源代码可在以下网址获取：this https URL

英文摘要

Audio deepfakes generated by neural text-to-speech and voice-cloning systems threaten speaker verification and public discourse at scale. The core challenge is cross-dataset generalization: detectors trained on one synthesis pipeline collapse on unseen forgeries. We argue that this failure is primarily because of structural synthetic speech artifacts which are multi-timescale trajectory anomalies. Though every existing detector aggregates a fixed-window frame statistics, this misaligns the architecture with the signal. We propose FlowFake, a Liquid Time-Constant (LTC) architecture whose hidden state evolves via a learned ODE, with per-neuron adaptive time constants simultaneously resolving spectral (10ms) and prosodic (2s) cues. At only 34K parameters FlowFake achieves formal BIBO stability and O(dt^4) integration error. On a four-dataset cross domain benchmark (ASVspoof2019-LA, FakeOrReal, InTheWild, MLAAD), FlowFake reaches 75.29% on ASVspoof2019 trained only on FakeOrReal and 79.97% trained only on MLAAD. It outperforms RawGAT-ST and Whisper-DF on every evaluated pair and matching SSL Wav2vec2 (300x larger) at 0.01% of its parameter count. The source code is available on : https://github.com/GhostRider2023/FlowFake

URL PDF HTML ☆

赞 0 踩 0

2606.19576 2026-06-19 cs.DB cs.DC 新提交

REMOP: REmote-Memory-aware OPerator Optimization

REMOP: 远程内存感知的算子优化

Shiquan Zhang, Yunhao Mao, Yuqiu Zhang, Gengrui Zhang, Jeyhun Karimov, Hans-Arno Jacobsen

AI总结针对远程内存环境下查询处理中数据传输轮次过多的问题，提出REMOP框架，通过轮次感知的算子内内存策略优化内存溢出执行，在DuckDB中实现三种算子，减少高达97%的传输轮次和48%的算子运行时间。

Comments 14 pages, 13 figures, 9 tables. Preprint, under review

详情

AI中文摘要

远程和分离内存层扩展了分析数据库引擎的有效内存容量，但也重塑了内存溢出查询处理的成本结构。当算子溢出到本地DRAM之外时，将页面移动到远程内存既会产生数据传输时间，也会产生每次传输的固定往返延迟。经典的算子分析和缓冲区分配启发式方法主要通过最小化总I/O量来针对磁盘溢出。在远程内存下，这些策略可能不是最优的，因为它们可能触发过多的传输轮次。我们提出了REMOP，一个远程内存感知的算子优化框架，它使用传输轮次感知的算子内内存策略来改善内存预算紧张下的内存溢出执行。REMOP将传输轮次数引入延迟成本模型，并推导出算子特定的缓冲区划分策略，在DuckDB中为阻塞嵌套循环连接、外部归并排序和外部哈希连接实例化了该方法。我们在双节点计算-内存测试平台上的评估表明，在溢出密集的微基准测试中，REMOP减少了高达97%的传输轮次和高达48%的算子运行时间，并将溢出TPC-H和TPC-DS查询的平均运行时间分别降低了22.7%和26.4%。

英文摘要

Remote and disaggregated memory tiers expand the effective memory capacity of analytical database engines, but they also reshape the cost structure of out-of-memory query processing. When an operator spills beyond local DRAM, moving pages to remote memory incurs both data-transfer time and a fixed round-trip latency per transfer. Classical operator analyses and buffer-allocation heuristics primarily target disk spilling by minimizing total I/O volume. Under remote memory, these strategies can be suboptimal because they may trigger excessive transfer rounds. We present REMOP, a remote-memory-aware operator optimization framework that uses transfer-round-aware intra-operator memory policies to improve out-of-memory execution under tight memory budgets. REMOP introduces the number of transfer rounds into the latency cost model and derives operator-specific buffer-partitioning strategies, instantiating the approach for blocked nested-loop join, external merge sort, and external hash join in DuckDB. Our evaluation on a two-node compute-memory testbed shows that REMOP reduces transfer rounds by up to 97% and operator runtime by up to 48% on spill-heavy microbenchmarks, and lowers the average runtime of spilling TPC-H and TPC-DS queries by 22.7% and 26.4% end-to-end.

URL PDF HTML ☆

赞 0 踩 0

2606.19570 2026-06-19 cs.HC 新提交

Code as Anchor, Memory and Metaphor as Support: Learner Experiences with Multi-View Visualizations

代码作为锚点，记忆与隐喻作为支持：学习者对多视图可视化的体验

Naaz Sibia, Jessica Wen, Amber Richardson, Yashika Jain, Khushi Malik, Bogdan Simion, Carolina Nobre, Angela Zavaleta Bernuy, Andrew Petersen, Michael Liut

AI总结通过眼动追踪和访谈，研究19名CS1/CS2学生在多视图可视化工具中的行为，发现学生主要关注代码，忽视隐喻视图，受能动性、表征适配和合法性因素影响。

Comments Pre-Print of a paper to be published at the International Computing Education Research (ICER) conference 2026

详情

DOI: 10.1145/3765964.3811662

AI中文摘要

程序可视化被广泛用于支持新手程序员，但学生经常忽视或抵制精心设计的视觉支架。关于多重外部表征（MERs）的研究提供了协调视图的认知设计原则，但对于什么因素影响学习者对可用表征的参与度知之甚少。我们对19名已完成CS1和CS2的本科生进行了一项被试内研究。学生使用一个多表征探针（包含同步的代码、记忆和隐喻视图）和Python Tutor，在作用域、while循环和链表任务中完成出声思考任务、反思性访谈和基于摄像头的视线追踪。视线分析显示，尽管有可用的视觉支架，学生将近一半的时间专注于代码。没有先前经验的学生更强烈地以代码为锚点，并且很少参与隐喻视图。访谈确定了影响选择性参与的三个因素：能动性（学生寻求控制认知努力而非简单减少）、表征适配（相同设计在不同情境下感觉有帮助或令人不知所措）以及合法性（一些学生避免他们认为幼稚或不够严谨的隐喻支架）。这些发现表明，计算教育中的多表征工具需要关注情感和社会因素以及认知设计。实际考虑包括将可视化定位为验证工具、提供可切换的抽象级别以及通过框架设计传达学科合法性。更广泛地说，这些主题有助于解释为什么认知上合理的可视化工具可能无法吸引它们旨在帮助的学生。

英文摘要

Program visualizations are widely used to support novice programmers, yet students often ignore or resist well-designed visual scaffolds. Research on multiple external representations (MERs) offers cognitive design principles for coordinating views, but less is known about what shapes learners' engagement with available representations. We conducted a within-subjects study with 19 undergraduates who had completed CS1 and CS2. Students completed think-aloud tasks, reflective interviews, and webcam-based gaze tracking while using a multi-representational probe with synchronized code, memory, and metaphor views, and Python Tutor, across scope, while loops, and linked lists. Gaze analysis showed that students spent nearly half their time focused on code despite available visual scaffolds. Students without prior experience anchored even more heavily in code and engaged minimally with metaphor views. Interviews identified three factors shaping selective engagement: agency, as students sought control over cognitive effort rather than simply having it reduced; representational fit, as identical designs differed in whether they felt helpful or overwhelming; and legitimacy, as some students avoided metaphorical scaffolds they perceived as childish or insufficiently rigorous for university-level work. These findings suggest that multi-representational tools in computing education require attention to affective and social factors alongside cognitive design. Practical considerations include positioning visualizations as verification instruments, offering toggleable abstraction levels, and framing tools to signal disciplinary legitimacy. More broadly, the themes help explain why cognitively sound visualization tools may fail to engage the students they are designed to help.

URL PDF HTML ☆

赞 0 踩 0

2606.19569 2026-06-19 cs.LG 新提交

On the QUEST for Uncertainty Quantification via Highest Density Regions

论基于最高密度区域的量化不确定性探索

Sam Goring, Tom Kuipers, Nicola Paoletti, David S. Watson

发表机构 * Northeastern University London（东北大学伦敦校区）

AI总结针对概率机器学习中回归问题的不确定性量化，提出基于最高密度区域体积的QUEST框架，满足单调性和平移不变性公理，在选择性预测基准上优于方差和微分熵。

Comments 27 pages, of which 10 are main text. Contains 7 figures, 4 tables, 1 algorithm in total

详情

AI中文摘要

不确定性量化对于概率机器学习中安全关键应用的可靠决策至关重要。对于回归问题，主流的标量不确定性量化方法——特别是基于适当评分规则的方法——通过逐点预测风险来衡量不确定性。当目标统计量不是条件期望时，这可能导致反直觉的结果。我们提出了一种替代框架，其中不确定性通过分布支持的最可能子集的体积来表征。QUEST（通过最高密度区域量化不确定性）是一种基于勒贝格测度在分布峰值处集中程度的新颖不确定性量化方法，在鲁棒性参数$\alpha$的一个或多个值处进行评估。我们建立了我们的度量与信息论和经济学中经典统计量之间的联系。我们表明，与基于适当评分规则的流行替代方案不同，QUEST的认知不确定性和偶然不确定性度量满足从不确定性量化文献中改编的一组公理，包括在分布扩散下的单调性和位置偏移的不变性。选择性预测基准证实，QUEST在方差和微分熵等标准度量上表现良好。

英文摘要

Uncertainty quantification (UQ) is essential for reliable decision-making in safety-critical applications in probabilistic machine learning. For regression problems, dominant scalar UQ approaches - notably, those based on proper scoring rules - measure uncertainty via pointwise predictive risk. This can lead to counterintuitive results when the target statistic is not the conditional expectation. We propose an alternative framework, in which uncertainty is characterised by the volume of the most probable subset of a distribution's support. QUEST (Quantifying Uncertainty via highest dEnSiTy regions) is a novel approach to UQ based on the concentration of Lebesgue measure at a distribution's peak(s), evaluated at one or more values of a robustness parameter $α$. We establish connections between our measures and classical statistics from information theory and economics. We show that, unlike popular alternatives based on proper scoring rules, QUEST measures of epistemic and aleatoric uncertainty satisfy a set of axioms adapted from the UQ literature, including monotonicity under distributional spread and invariance to location shifts. Selective prediction benchmarks confirm that QUEST performs favourably against standard measures such as variance and differential entropy.

URL PDF HTML ☆

赞 0 踩 0

2606.19568 2026-06-19 cs.SD cs.AI 新提交

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

声学枪声分类的特征提取技术参数探索

Sinclair Gurny, Ryan Quinn

AI总结本文系统研究了特征提取技术及其参数对声学枪声分类的影响，使用ResNet-18在23000条枪声数据集上评估，发现正确技术可提升top-1准确率20%，参数优化可再提升4.7%。

2606.19566 2026-06-19 eess.SY cs.AI cs.SY 新提交

GDGU: A Gradient Difference-based Graph Unlearning Method for Cyberattack Localization in Electric Vehicle Charging Networks

GDGU：基于梯度差异的图遗忘方法用于电动汽车充电网络中的网络攻击定位

Nanhong Liu, Mucun Sun, Jie Zhang

AI总结针对电动汽车充电站数据删除需求，提出基于梯度差异的图遗忘方法（GDGU），通过一阶参数校正实现高效遗忘，在保持定位性能的同时显著降低计算开销。

详情

AI中文摘要

电动汽车充电站（EVCS）可能使配电馈线暴露于网络攻击。尽管包括图神经网络在内的机器学习方法可以定位哪个母线被攻破，但在数据共享和模型训练方面仍存在重大挑战。例如，隐私法规允许EVCS所有者从已部署的模型中删除其训练数据，但每次请求都从头重新训练在计算上不可行。为了解决这个问题，我们研究了用于EVCS网络攻击定位的图遗忘（GU），将其形式化为图级多标签分类任务上的特征级遗忘问题。具体来说，我们提出了基于梯度差异的图遗忘（GDGU），通过一阶参数校正消除请求删除数据的影响。该校正基于原始训练数据与修改后数据集之间的梯度差异计算，其中仅遗忘请求的EVCS母线的充电功率特征。然后，应用批归一化重新校准和简短的恢复微调步骤以恢复定位效用。我们在IEEE 34母线、123母线和8500节点配电网络上，使用三种图神经网络骨干网络和累积遗忘场景，将GDGU与两种二阶GU基线进行比较。GDGU在定位效用上与最强基线相当，遗忘保真度接近完全重新训练，同时遗忘速度比从头重新训练快10到12倍，且内存使用远少于二阶GU基线。

英文摘要

Electric vehicle charging stations (EVCSs) can expose distribution feeders to cyberattacks. While machine learning methods, including graph neural networks, can localize which bus is compromised, significant challenges remain in data sharing and model training. For example, privacy regulations grant EVCS owners the right to delete their training data from a deployed model, yet retraining from scratch on every request is computationally prohibitive. To address this, we study graph unlearning (GU) for EVCS cyberattack localization, formulated as a feature-level unlearning problem on a graph-level multi-label classification task. Specifically, we propose gradient difference-based graph unlearning (GDGU), which removes the influence of the requested deletion data through a first-order parameter correction. The correction is computed from the gradient difference between the original training data and a modified dataset in which only the charging power features at the requested EVCS buses are unlearned. Then, a batch-normalization recalibration and a brief recovery fine-tuning step are applied to restore localization utility. We benchmark GDGU against two second-order GU baselines on the IEEE 34-bus, 123-bus, and 8500-node distribution networks across three graph neural network backbones and cumulative unlearning scenarios. GDGU matches the strongest baseline on localization utility and reaches forgetting fidelity close to full-retraining, while unlearning 10 to 12 times faster than retraining from scratch and using far less memory than the second-order GU baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19565 2026-06-19 cs.CV 新提交

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

Mix-QVLA：任务证据感知的视觉-语言-动作模型混合精度量化

Navin Ranjan, Andreas Savakis

发表机构 * Rochester Institute of Technology（罗彻斯特理工学院）

AI总结提出Mix-QVLA框架，通过任务证据感知的混合精度后训练量化，在保持任务性能的同时大幅降低VLA模型的内存和计算开销，在LIBERO上实现4.1GB内存和1.52倍加速。

详情

AI中文摘要

我们提出Mix-QVLA，一种针对VLA模型的任务证据感知混合精度PTQ框架。Mix-QVLA将每个量化变体锚定到全精度动作令牌参考决策，并评估量化是否在关键VLA功能边界上保留了任务相关证据。它从边界激活计算归一化的梯度加权任务证据图，并使用证据质量和归因分布失真比较全精度和量化图，捕捉决策支持证据的强度和分配变化。一个软瓶颈目标将边界级退化聚合为层敏感度分数。Mix-QVLA进一步在整个任务执行过程中建模敏感度，捕捉层重要性的阶段依赖变化，而不是假设固定的敏感度分布。由此产生的证据和时间感知分数指导在模型大小和BitOps预算下的混合精度位分配。在OpenVLA风格策略上的广泛评估表明，Mix-QVLA改善了低比特VLA部署的精度-效率权衡。在LIBERO上，Mix-QVLA将OpenVLA-OFT内存从15.4 GB减少到4.1 GB，保留了96.3的平均成功率（BF16模型为97.1），并实现了1.52倍的推理加速。

英文摘要

We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

URL PDF HTML ☆

赞 0 踩 0