arXivDaily arXiv每日学术速递 周一至周五更新
2606.01338 2026-06-19 cs.CL 版本更新

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试:消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta, Ambika Baniya Bhandari

发表机构 * Department of Computer Science, University of the Cumberlands(大学的计算机科学系) Department of Computer Science, DePaul University(德保罗大学计算机科学系) Youngstown State University(亚当斯州立大学)

AI总结 本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能,发现代码调优的通用模型优于领域特定模型,但当前性能仍需人工监督。

详情
AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范(GMP)和欧盟AI法案等监管框架下运营,这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型(LLM)提供了一种保护隐私的替代方案,但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM(Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B)在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI,使用一个包含约63,000条记录的合成Microsoft SQL Server数据库,涵盖批次、制造执行系统(MES)和在线清洗(CIP)模块。模型在60个领域特定的自然语言问题上进行了基准测试,使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL,而Meditron 7B由于上下文窗口限制和SQL生成能力差,几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性,而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明,代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的,但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

2606.01316 2026-06-19 cs.AI 版本更新

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

Science Earth: 迈向面向AI原生科学发现的行星级操作系统

Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma, Yifan Wen, Jinglin Jian, Jiacheng Ge, Xiangru Tang, Bo An, Ming Yin, Sanfeng Wu, Mengdi Wang, Le Cong

发表机构 * Department of Pathology, Department of Genetics, Stanford University School of Medicine(病理学系、遗传学系,斯坦福大学医学院) Princeton AI Lab, Department of Electrical & Computer Engineering, Princeton University(普林斯顿人工智能实验室、电气与计算机工程系,普林斯顿大学) Scripps Research, La Jolla, CA, USA(斯克里普斯研究机构,洛杉矶,加利福尼亚州,美国) Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine(生物统计学部、人口健康系,纽约大学格罗斯曼医学院) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学) Department of Computer Science, Yale University(计算机科学系,耶鲁大学) Department of Physics, Princeton University(物理系,普林斯顿大学)

AI总结 提出Science Earth行星级科学运行时,通过EACN协议实现AI能力动态连接与自组织协作,在跨太平洋Kuramoto同步研究和单细胞分析中验证了分布式自校正科学推理。

Comments Withdrawn by the authors. (1) The author list and authorship roles had not been finalized and agreed upon by all listed authors prior to submission. (2) The specific contribution of the system in the K3 synchronization example (Section on Kuramoto/nonlinear physics) requires further validation before it can be reported. The authors are addressing both points and may resubmit a corrected version.

详情
AI中文摘要

科学发现需要在广阔的搜索空间中运用智能、毅力和偶然性。如今,顶尖科学能力仍然孤立——一个AI系统用于生物分析,另一个用于临床推理、数学推导或材料模拟——并且没有预设计的团队能够预见一个问题所需的所有技能。Science Earth是一个行星级科学运行时,其中任何能力——模拟集群、湿实验室机器人、证明引擎、单细胞管道——都可以相互连接,协作结构由问题本身涌现。其底层EACN协议让能力能够相互发现、协商任务所有权,并在不相容的证据标准之间进行裁决,而无需事先知道谁将遇见谁。这将组织挑战从工作流设计转向开放式连接。两次运行在结构不同的条件下验证了这一点。在一项跨太平洋高阶Kuramoto同步研究中,智能体在30分钟内识别并纠正了Ott-Antonsen解析理论中一个在洛伦兹极限外失效的闭合比率假设。在针对488万细胞Kang 2024泛癌图谱的八智能体单细胞运行中,异质能力在64.9小时窗口内耦合,仅有一条结构外部指令,产生了三个新的结果层,并将发现与一项关于相邻CCR8- TIGIT+ Treg亚群的独立湿实验室研究进行锚定。这些案例是首次实证读数,而非基准测试。它们表明,当AI能力真正可连接且协调从问题中涌现时,科学推理成为一个分布式、自校正的过程——这是向行星级AI原生发现迈出的一步。

英文摘要

Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities remain siloed--one AI system for biological analysis, another for clinical reasoning, mathematical derivation, or materials simulation--and no pre-designed team can anticipate every skill a question will need. Science Earth is a planet-scale scientific runtime in which any capability--a simulation cluster, a wet-lab robot, a proof engine, a single-cell pipeline--can connect to any other, with collaboration structure emerging from the question itself. Its underlying EACN protocol lets capabilities discover one another, negotiate task ownership, and adjudicate across incompatible evidentiary standards without prior knowledge of who will meet whom. This shifts the organizing challenge from workflow design to open-ended connectivity. Two runs validate this under structurally distinct conditions. In a trans-Pacific higher-order Kuramoto synchronization study, agents identified and corrected a closure-ratio assumption in Ott-Antonsen analytic theory that fails outside the Lorentzian limit, within thirty minutes. In an eight-agent single-cell run on the 4.88M-cell Kang 2024 pan-cancer atlas, heterogeneous capabilities coupled over a 64.9-hour window with one structural external instruction, producing three new result layers and anchoring findings against an independent wet-lab study on an adjacent CCR8- TIGIT+ Treg subset. These cases are a first empirical reading, not a benchmark sweep. They show that when AI capabilities are truly connectable and coordination emerges from the problem, scientific reasoning becomes a distributed, self-correcting process--a step towards scaling AI-native discovery to the planet.

2606.01183 2026-06-19 cs.DC cs.DB cs.DS cs.PF 版本更新

The World's Fastest Matching Engine Algorithm

世界上最快的撮合引擎算法

Jake Yoon

AI总结 提出Priority-Indicated Node (PIN)和邻域感知树操作两种数据结构,消除订单簿中指针追逐和根到叶搜索的延迟,实现亚微秒级尾部延迟和每秒数千万条消息的处理能力。

Comments 20 pages, 5 figures, 7 tables

详情
AI中文摘要

每个电子交易所都依赖于一个订单簿,其存储层决定了撮合延迟。主流实现——通过平衡树链接的链表——在每个操作上施加两个成本:指针追逐遍历以到达插入点,以及根到叶搜索以定位目标价格水平。在微突发条件下,这些成本会产生尾部延迟峰值,在流动性最需要时降低市场质量。我们提出了两种数据结构贡献,消除了这些成本。第一种是优先级指示节点(PIN),一种优先队列,其中条目占据固定容量、连续可寻址的槽位,每个槽位携带一个指示条目全局优先级的每槽指示器。与每次操作需要O(log n)次比较的堆不同,PIN直接根据指示器解析插入位置,无需比较条目;指示器更新为O(1),与队列大小无关。第二种解决了更广泛的低效问题:平衡搜索树在每次插入和删除时都进行根到叶搜索,即使调用者已经知道键的中序邻居——例如在有序事件流、增量索引维护和电子交易中。邻域感知插入和删除利用已知的邻居引用,通过O(1)次引用写入来附加或移除节点,然后进行单路径重平衡,统一适用于红黑树、AVL树和B/B+树变体。单个CPU核心在每秒数百万条消息的微突发下,以亚微秒级尾部延迟维持每秒3200万条订单消息,比同一硬件上最好的开源撮合引擎快5-11倍。扩展到单个96核实例,该引擎在10,000个交易品种上维持每秒6.4亿条消息。

英文摘要

A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.

2605.31393 2026-06-19 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata(III-LIDI国立拉普拉塔大学) CDTEC, Federal University of Pelotas(CDTEC,联邦 Pelotas 大学) CONICET III-LIDI Comision de Investigaciones Cientificas Universidad Nacional de La Plata(科学委员会国立拉普拉塔大学) Universidade Federal de Pelotas(联邦 Pelotas 大学)

AI总结 针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题,提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强,并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign @ CVPR 2026. Non-Proceedings Track (https://genai4sl.github.io/)

详情
AI中文摘要

手语翻译(SLT)仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法,其中GPT-4o生成参考句子的受控释义变体,而手语输入保持不变。采用基于Signformer姿态的Transformer,在两阶段调度下进行训练:先在增强语料库上预训练,然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估:PHOENIX14T(德国手语),具有适度的词汇多样性;GSL(希腊手语),具有高度受控、重复的录制;以及LSA-T(阿根廷手语),具有严重的长尾稀疏性。在PHOENIX14T上,增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知,这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by the limited availability of paired sign-video/text corpora and by the heavy-tailed vocabularies typical of real-world datasets. We study a target-side augmentation strategy in which a large language model (LLM) generates controlled paraphrase variants of the reference spoken-language sentence while the sign input remains unchanged. Concretely, we use GPT-4o to produce semantically faithful variants of the training targets and train a Signformer-style pose-based Transformer under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate this strategy on three datasets that span complementary challenges: PHOENIX14T (German Sign Language), a real-world corpus with moderate lexical diversity; the Greek Sign Language Dataset with highly controlled, repetitive recordings; and LSA-T (Argentinian Sign Language), a naturalistic corpus with a large vocabulary and severe long-tail sparsity. This range allows us to characterize precisely when and why target-side augmentation is beneficial. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33, demonstrating that paraphrastic exposure helps the decoder generalize beyond memorized reference phrasing. The near-saturated GSL baseline and the extremely sparse LSA-T setting reveal the limits of the approach: in both cases, single-reference lexical overlap metrics are insufficient to capture the full picture, motivating a complementary semantic evaluation. To our knowledge, this is the first study to examine LLM-generated target-side paraphrases as an augmentation mechanism for SLT, and the first to apply an LLM-as-a-Judge evaluation protocol to SLT. This complementary evaluation reveals gains in semantic fidelity that lexical overlap metrics understate.

2605.31158 2026-06-19 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互:交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University(浙江大学) NVIDIA

AI总结 针对交互式视频世界模型推理成本高的问题,提出免训练加速框架Light Interaction,通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情
AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频,支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而,由于上下文记忆增长、二次注意力复杂度和重复去噪步骤,扩展到长交互轨迹的成本过高。我们提出Light Interaction,一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是,交互自然支持轨迹依赖的自适应计算:在探索新区域时可丢弃检索到的空间记忆,根据局部潜在动态调整时间上下文,当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察,Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力(融合Triton内核)。在HY-WorldPlay和Matrix-Game-3.0上的评估表明,Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速,同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

2605.30456 2026-06-19 cs.LG math.OC 版本更新

DisjunctiveNet: Neural Symbolic Learning via Differentiable Convexified Optimization Layers

DisjunctiveNet: 通过可微凸优化层实现的神经符号学习

Shraman Pal, Can Li

发表机构 * Davidson School of Chemical Engineering, Purdue University, West Lafayette, USA(帕克大学化学工程大卫逊学校)

AI总结 针对数据稀疏且富含领域知识的场景,提出DisjunctiveNet框架,通过可微凸优化层将析取约束嵌入神经网络,实现硬约束满足与强预测性能。

Comments ICML 2026

详情
AI中文摘要

科学与工程中的许多学习任务以稀疏数据集为特征,这限制了纯数据驱动方法的有效性。同时,这些问题通常伴随着源自物理定律、操作要求和专家启发式的丰富领域知识。这些知识经常以涉及逻辑命题和线性不等式的规则形式表达。现有的神经符号方法通常通过软惩罚近似地强制执行这些规则,在设计专门架构时假设输入无关的规则,或者依赖推理时的不可微后处理来实现硬约束满足。虽然可微优化层的最新进展使得在神经网络中实现端到端的可行性强制成为可能,但由于固有的非凸性,将这些方法扩展到逻辑或混合整数规则仍然具有挑战性。在这项工作中,我们提出了一个统一的端到端框架,用于在神经网络中强制执行硬性的、输入相关的混合整数线性约束。我们的方法将规则表示为析取约束,并应用层次凸松弛来获得凸包公式。这些松弛产生了易于处理的线性约束,可以嵌入为可微优化层,同时实现精确的规则满足。我们在真实数据集上展示了所提出框架的有效性,实现了完美的规则满足和强大的预测性能。

英文摘要

Many learning tasks in science and engineering are characterized by sparse datasets, which limits the effectiveness of purely data-driven approaches. At the same time, these problems are often accompanied by rich domain knowledge derived from physical laws, operational requirements, and expert heuristics. Such knowledge is frequently expressed as rules involving logical propositions and linear inequalities. Existing neuro-symbolic methods typically enforce these rules approximately through soft penalties, assume input-independent rules when designing specialized architectures, or rely on non-differentiable post-processing at inference time to achieve hard constraint satisfaction. While recent advances in differentiable optimization layers enable end-to-end feasibility enforcement within neural networks, extending these approaches to logical or mixed-integer rules remains challenging due to inherent nonconvexity. In this work, we propose a unified end-to-end framework for enforcing hard, input-dependent mixed integer linear constraints within neural networks. Our approach represents rules as disjunctive constraints and applies hierarchical convex relaxations to obtain convex hull formulations. These relaxations yield tractable linear constraints that can be embedded as differentiable optimization layers while enabling exact rule satisfaction. We demonstrate the effectiveness of the proposed framework on real-world datasets, achieving perfect rule satisfaction and strong predictive performance.

2605.28654 2026-06-19 cs.RO cs.SY eess.SY math.OC 版本更新

Integrated Exploration-Aware UAV Route Optimization and Path Planning

集成探索感知的无人机路径优化与轨迹规划

Jimin Choi, Grant Stagg, Cameron K. Peterson, Max Z. Li

发表机构 * Department of Aerospace Engineering, University of Michigan(密歇根大学航空航天工程系) Department of Electrical Engineering, Brigham Young University(BYU 电子工程系) Department of Aerospace Engineering, Department of Civil and Environmental Engineering, and Department of Industrial and Operations Engineering, University of Michigan(密歇根大学航空航天工程系、土木与环境工程系和工业与运营管理工程系)

AI总结 提出一种集成探索感知的无人机路径优化与轨迹规划框架,通过风险地图、不确定兴趣区域建模、B样条轨迹优化和在线重规划,在灾害监测中平衡报告点访问与新信息探索,实现平均KL散度降低15.9%。

详情
AI中文摘要

无人机越来越多地用于危险环境(如灾区、污染场地、野火区域和受损基础设施)中的探索驱动监测,此时有限的飞行续航必须在访问报告位置和收集新信息之间分配。在这些场景中,关于危险的先验信息通常不完整、空间不精确,并且在执行过程中可能发生变化。例如,初始报告可能识别出危险可能存在的区域,但实际危险可能被移动、部分观察到或完全未被报告。我们提出了一种集成的探索感知无人机路径优化与轨迹规划框架,用于在不确定和演变的先验信息下进行危险监测。环境被表示为空间风险地图,每个位置都有相关的危险状况信念。报告的危险被建模为不确定的兴趣区域(ROI),而不是确认的目标位置,要求无人机在检查报告区域的同时,利用有限的飞行续航探索信息丰富的区域。所提出的方法解决了报告ROI上的车辆路径问题,通过辅助伪节点增强路径以改善空间覆盖,将剩余飞行距离预算分配到路径段,并优化局部探索的动态可行B样条轨迹。在执行过程中,无人机测量更新基于网格的信念地图,当新信息和剩余预算证明调整合理时,对剩余轨迹进行重规划。在48种场景配置中,在线重规划相比离线优化规划器平均KL散度降低15.9%,相比直线遍历降低48.6%。

英文摘要

Uncrewed aerial vehicles (UAVs) are increasingly used for exploration-driven monitoring in hazardous environments such as disaster zones, contaminated sites, wildfire areas, and damaged infrastructure, where limited flight endurance must be allocated between visiting reported locations and gathering new information. In these settings, prior information regarding hazards is often incomplete, spatially imprecise, and subject to change during execution. For example, initial reports may identify a region where a hazard is likely to exist, but the actual hazard may be displaced, partially observed, or entirely unreported. We present an integrated exploration-aware UAV route optimization and path planning framework for hazard monitoring under uncertain and evolving prior information. The environment is represented as a spatial risk map, where each location has an associated belief of hazardous conditions. Reported hazards are modeled as uncertain regions of interest (ROIs) rather than confirmed target locations, requiring the UAV to inspect reported areas while also using its limited flight endurance to explore informative regions. The proposed method solves a vehicle routing problem over reported ROIs, augments the route with auxiliary pseudo-nodes to improve spatial coverage, allocates the remaining flight distance budget across route segments, and optimizes dynamically feasible B-spline trajectories for local exploration. During execution, UAV measurements update a grid-based belief map, and the remaining trajectory is replanned when new information and the remaining budget justify adaptation. Across 48 scenario configurations, online replanning improves average KL reduction by 15.9% over the offline optimized planner and 48.6% over straight-line traversal.

2605.26891 2026-06-19 cs.CL 版本更新

Telenor Nordics Customer Service self-help corpus

Telenor Nordics 客户服务自助语料库

Mike Riess

发表机构 * Research and Innovation, Telenor Group(Telenor集团研究与创新)

AI总结 本文构建了一个包含芬兰语、丹麦语、挪威语和瑞典语的多语言客户服务自助语料库,共1122篇文档,用于支持北欧NLP和信息检索研究。

Comments 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152

详情
AI中文摘要

本文介绍了一个多语言客户服务自助语料库,包含1122篇经过人工验证的芬兰语、丹麦语、挪威语和瑞典语文档,总词数超过一百万。这些文档来自四家北欧电信运营商的公共自助页面,随后通过结合LLM和人工标注的流程过滤了个人身份信息和相关性。北欧语言的领域特定数据集仍然稀缺,尤其是在客户服务领域——这一领域对于检索增强生成、跨语言迁移学习和新兴的基于代理的服务架构日益重要。对语料库的分析显示,不同运营商的文档长度和结构存在显著差异,反映了不同的编辑策略,以及涵盖网络硬件、移动服务、电视和流媒体、计费和账户管理的广泛主题覆盖。该数据集在CC-BY-NC-SA-4.0许可下公开提供,网址为https://zenodo.org/records/19493152,旨在支持北欧NLP和信息检索的可重复研究。

英文摘要

This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling 274,599 words and 1,884,833 characters. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/20732652, intended to support reproducible research in Nordic NLP and information retrieval.

2605.30089 2026-06-19 cs.LG 版本更新

Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption

推理时元素损坏下的分布鲁棒集合表示学习

Yankai Chen, Hanrong Zhang, Bowei He, Philip S. Yu, Xue Liu

发表机构 * McGill University(麦吉尔大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 针对推理时元素损坏问题,提出SW-DRSO分布鲁棒优化框架,通过重心对抗近似最坏情况损失,在四个任务上验证了鲁棒性和性能。

Comments Accepted by ICML'26

详情
AI中文摘要

标准集合表示学习方法通常在精心整理的数据上表现良好,但往往忽略了推理时元素损坏的挑战。这指的是部署模型遇到元素级别的退化(如异常值或缺失组件)时,可能扭曲集合表示并降低性能。我们提出了SW-DRSO,一个专门为集合设计的分布鲁棒优化框架。SW-DRSO不是仅最小化观测训练数据上的损失,而是优化一个关于一系列合理推理时变体的最坏情况期望损失的可处理替代项。我们引入了一个重心对抗,通过可微的训练时优化单纯形权重来近似对损坏集合的难以处理的搜索。在四个任务上的大量实验表明,SW-DRSO在保持高整体性能的同时,有效增强了对损坏的鲁棒性。

英文摘要

Standard Set Representation Learning methods typically excel on curated data but often overlook the challenge of inference-time element corruption. This refers to scenarios where deployed models encounter element-level degradations, such as outliers or missing components, that may distort set representation and degrade performance. We propose SW-DRSO, a distributionally robust optimization framework tailored for sets. Rather than minimizing loss solely on observed training data, SW-DRSO optimizes a tractable surrogate of the worst-case expected loss over a family of plausible inference-time variations. We introduce a barycentric adversary that approximates the intractable search over corrupted sets by a differentiable training-time optimization over simplex weights. Extensive experiments across four tasks demonstrate that SW-DRSO effectively enhances robustness against corruption while maintaining high overall performance.

2605.27864 2026-06-19 cs.AI 版本更新

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod: 一个具有知识图谱记忆的多角色智能体平台,用于AI辅助的基础投资研究

Di Zhu, Lei Nico Zheng, Zihan Chen

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院) UMass Boston(马萨诸塞大学波士顿分校)

AI总结 提出FundaPod平台,通过多角色独立研究、知识图谱记忆和事后裁决机制,支持人类投资经理进行透明、可验证的基础投资决策。

Comments 32 pages; 12 figures

详情
AI中文摘要

大型语言模型(LLMs)在金融领域的应用日益增多,但现有工作大多强调交易信号或围绕预测的金融自然语言处理任务。相比之下,机构基础研究需要人类分析师或AI智能体收集证据、识别业务驱动因素、比较竞争观点并生成投资备忘录。其更广泛的目标不仅是预测结果,而是产生透明、可重用和可验证的投资计划,同时促进投资知识的累积发展。我们提出了FundaPod,一个用于AI辅助基础投资研究的多角色智能体平台。我们认为基础研究是一项以人为中心的决策支持任务,在本质上与交易信号生成不同,因此更适合采用保持独立性的架构。在FundaPod中,具有不同角色(如价值投资者或宏观策略师)的AI智能体在共享溯源契约下独立进行研究。他们的分歧随后通过知识图谱记忆系统事后呈现,供人类投资组合经理(PM)裁决。本文基于设计科学实践以及认知隔离和人机协调理论,提出了支持基础研究的人机混合系统的五项设计原则。它还描述了四种架构机制:将公开投资者资料转化为可部署智能体的角色提炼管道;允许规划器推导类型化任务图的声明式技能注册表;将备忘录声明与可验证来源联系起来的基于证据的模型;以及连接股票代码、备忘录、分析师和主题的知识图谱“第二大脑”。我们通过一个完整的案例研究和基于角色的备忘录比较来展示该架构。

英文摘要

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2605.29483 2026-06-19 cs.AI 版本更新

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

VitalAgent: 一种工具增强型代理,用于对可穿戴健康数据进行反应性和主动式生理监测

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed, Vassilis Kostakos, Ting Dang

发表机构 * The University of Melbourne, Australia(墨尔本大学) Dartmouth College, US(达特茅斯学院) University of Auckland, New Zealand(奥克兰大学) Eindhoven University of Technology, Netherlands(埃因霍温理工大学)

AI总结 提出VitalAgent框架,通过工具增强推理和纵向生理记忆,实现对ECG/PPG信号的反应性问答与主动监测,在VitalBench基准上相比基线提升超30%。

Comments Minor revisions; results unchanged

详情
AI中文摘要

可穿戴设备能够连续监测ECG和PPG等生理信号,但现有的移动健康系统大多局限于特定任务的预测管道或对静态摘要的反应性问答。它们缺乏支持时间推理、持久生理上下文以及对长期信号流进行主动监测的能力。我们提出VitalAgent,一个基于ECG/PPG的移动健康工具增强型代理框架,支持反应性问答和主动监测。VitalAgent建立在纵向生理记忆和工具增强推理接口之上,能够对原始信号进行动态计算。我们进一步引入VitalBench,一个纵向生理监测基准数据集,包含用于反应性问答的1,862个问答对和用于主动监测的90.2小时连续ECG/PPG记录,涵盖心脏、身体活动和压力相关任务。实验表明,VitalAgent在反应性评估中相比基于提示和ReAct的基线实现了超过30%的提升,并支持对长期生理信号的主动警报监测,突显了动态工具使用和长期生理监测的重要性。

英文摘要

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 25% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

2605.13438 2026-06-19 cs.AI cs.CL 版本更新

CogniFold: Always-On Proactive Memory via Cognitive Folding

CogniFold: 通过认知折叠实现始终在线的主动记忆

Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Minghua Deng, Chen Chen, Xinliang Zhou

AI总结 提出CogniFold,一种受大脑启发的主动记忆系统,通过将互补学习系统扩展为三层(海马体、新皮层、前额叶意图层)并利用图拓扑自组织,实现事件流的持续认知结构涌现,在认知评估和常规记忆基准上均表现优异。

Comments Code is available at https://github.com/OpenNorve/CogniFold

详情
AI中文摘要

现有的智能体记忆主要仍是被动反应式和基于检索的,缺乏自主将经验组织成持久认知结构的能力。为了迈向真正自主的智能体,我们引入了CogniFold,一种受大脑启发的“始终在线”智能体记忆,专为下一代主动助手设计。CogniFold持续将碎片化事件流折叠成自涌现的认知结构,从传入事件和积累的知识中逐步引导出更高层次的认知。我们通过将互补学习系统(CLS)理论从两层(海马体、新皮层)扩展到三层,增加了一个前额叶意图层来奠定基础。模仿前额叶皮层作为意图控制和决策制定的中心,CogniFold通过图拓扑自组织实现这一点:认知结构在事件流下主动组装,语义相似时合并,过时时衰减,通过联想回忆重新链接,并在概念簇密度超过阈值时浮现意图。我们使用CogEval-Bench评估结构形成,证明CogniFold独特地产生了符合认知期望和概念涌现的记忆结构。此外,在跨越五个认知领域的7个广泛覆盖的基准测试中,我们验证了CogniFold在常规记忆基准上同时表现出稳健的性能。

英文摘要

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce CogniFold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across eight downstream benchmarks -- two probing long-term conversational memory (LoCoMo, LongMemEval) and six spanning other cognitive domains -- we validate that CogniFold simultaneously performs robustly on conventional memory tasks. Our code is available at https://github.com/OpenNorve/CogniFold.

2605.25160 2026-06-19 cs.AI 版本更新

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

SimuWoB: 模拟真实世界移动应用以实现快速且保真的GUI智能体基准测试

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) University of Electronic Science and Technology of China(电子科技大学) MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus团队)

AI总结 针对现有移动GUI智能体基准测试与现实应用之间的差距,提出全合成基准SimuWoB,通过鲁棒的虚拟环境生成框架合成高保真任务和环境,自动提供有效奖励,实现对复杂长程交互的高效可重复评估。

详情
AI中文摘要

由大型语言模型驱动的移动GUI智能体发展迅速,迫切需要真实且全面的评估。现有基准测试优先考虑可重复性,但通常局限于开源应用或文件操作任务,因为在实际应用中构建奖励困难,导致基准设置与现实使用之间存在差距。此外,大多数基准测试侧重于基本定位和导航,对复杂长程交互的覆盖有限。为解决这些局限性,我们引入了SimuWoB,一个全合成的移动GUI智能体基准测试,包含120个涵盖不同类型和难度级别的挑战性任务。我们构建了一个鲁棒的虚拟环境生成框架,合成高保真任务和环境,并为每个任务自动提供有效奖励。每个环境都部署为可通过URL访问的无后端网页,实现高效且可重复的评估。我们对几个最先进的移动GUI智能体进行了全面实验。平均成功率仅为27.92%,在长程任务上降至17.82%,揭示了当前智能体在复杂场景下的显著弱点。与真实世界样本任务的评估结果比较表明,基于我们合成环境的智能体评估具有良好的泛化性。我们进一步提供了关键能力维度的诊断见解,并讨论了对未来移动GUI智能体开发的启示。

英文摘要

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

2605.25005 2026-06-19 cs.RO 版本更新

Stiffness Optimization for Concentrated Bending in Magnetically Actuated Catheters: Maintaining Steerability under Gradient Stiffness

磁驱动导管集中弯曲的刚度优化:在梯度刚度下保持可操控性

Jiewen Tan, Junnan Xue, Shing Shin Cheng, Shuang Song, Erli Lyu, Jiaole Wang

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) The Chinese University of Hong Kong(香港中文大学) Macao Polytechnic University(澳门理工学院)

AI总结 针对磁驱动软导管在推送性与近端集中弯曲之间的权衡,提出一种刚度优化的多段磁驱动导管(SO-MAC),通过解耦转向-推进机构和梯度刚度架构,在推进过程中实现稳定的近端枢轴弯曲,同时远端被动自直以传递推进力。

详情
AI中文摘要

对于磁驱动软导管,实现高效的推送性(推进力传递)和近端集中弯曲以保持可操控性具有挑战性:较高的轴向/弯曲刚度可改善力传递但降低可操控性,而较低的刚度可实现大的近端集中弯曲,但在压缩推送载荷下增加扭结/屈曲风险。为了解决这一权衡,我们提出了一种刚度优化的多段磁驱动导管(SO-MAC),它集成了解耦的转向-推进机构与梯度刚度架构。SO-MAC在推进过程中将弯曲集中在稳定的近端枢轴周围,而远端部分通过优化的刚度分布和弹簧骨架的弹性恢复抵抗摩擦引起的扭结/屈曲,被动自直以传递推进力。在$0{-}180^{\circ}$的组合转向和推进过程中,枢轴保持稳定,远端尖端几乎直线地向目标方向推进。直径为1.5 mm的SO-MAC在其10 mm尖端处实现了高达$180^{\circ}$的转向,弯曲半径为3 mm,平均形状误差为$1.39 \pm 0.56$ mm,转向枢轴误差为$0.35 \pm 0.10$ mm。在支气管模型中的视觉反馈控制进一步验证了通过高度弯曲的分叉路径的鲁棒导航。

英文摘要

Achieving both efficient pushability (propulsion transmission) and proximally concentrated bending for steerability is challenging for magnetically actuated soft catheters: higher axial/bending stiffness improves force transmission but reduces steerability, whereas lower stiffness enables large, proximally concentrated bending yet increases kinking/buckling risk under compressive push loads. To address this trade-off, we propose a stiffness-optimized multi-segment magnetically actuated catheter (SO-MAC) that integrates a decoupled steering-advancement mechanism with a gradient-stiffness architecture. The SO-MAC concentrates bending about a stable proximal pivot during advancement while the distal section passively self-straightens to transmit propulsion, aided by the optimized stiffness distribution and elastic recovery of the spring backbone against friction-induced kinking/buckling. Over $0{-}180^{\circ}$ combined steering and advancement, the pivot remained stable and the distal tip advanced near-straight toward the target direction. A 1.5 mm-diameter SO-MAC achieved up to $180^{\circ}$ steering with a 3 mm bending radius at its 10 mm tip, with an average shape error of $1.39 \pm 0.56$ mm and a steering-pivot error of $0.35 \pm 0.10$ mm. Visual feedback control in a bronchial phantom further confirmed robust navigation through highly curved, bifurcating paths.

2605.23733 2026-06-19 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics(LimX动力学)

AI总结 提出Any2Any范式,通过运动学对齐和动力学微调,实现预训练全身跟踪模型高效迁移至新的人形机器人本体,仅需少量数据和计算即可达到竞争性跟踪性能。

Comments Project Page: https://any2any.top/

详情
AI中文摘要

全身跟踪(WBT)模型已成为人形机器人的关键基础,使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算,使得在新人形平台上快速部署成本高昂。这自然引发一个问题:预训练的WBT模型能否通过最小化适应跨本体迁移?为回答这个问题,我们提出Any2Any,一种范式,能够高效地将现有WBT专家迁移到新人形本体,仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐,对齐其输入和输出空间,使得预训练的源策略可以在目标本体上有意义地重用。然后,Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调(PEFT)组件进行动力学适应,保留有用的行为先验,同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明,与从头训练相比,Any2Any显著加速收敛并降低训练成本,同时实现具有竞争力或更优的跟踪性能。值得注意的是,仅使用完整训练所需计算和数据的1%,Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明,预训练的WBT专家可以跨本体高效重用,为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.

2605.22748 2026-06-19 cs.RO cs.AI cs.LG cs.MA 版本更新

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人类安全且敏捷的赛车

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich(苏黎世大学机器人与感知组) Google DeepMind(谷歌深Mind) Nomagic

AI总结 本文提出通过多智能体强化学习在高速四旋翼赛车中实现安全且敏捷的性能,展示了多智能体交互对真实世界交互安全性的关键作用,同时在高速赛车中超越人类飞行员并减少碰撞率。

Comments 12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl

详情
AI中文摘要

自主系统在孤立或模拟环境中已实现超人类性能,但在共享、动态的真实世界空间中仍显得脆弱。这种失败源于物理应用中主导的单智能体范式,其中其他参与者被忽略或视为环境噪声,阻碍了有效协调。本文证明多智能体强化学习为真实世界交互提供了必要的安全性基础。使用高速四旋翼赛车作为高风险测试平台,训练智能体在复杂空气动力学相互作用和战略机动中导航,具有可变数量的赛车。通过联赛基于的自我对战,智能体进化出复杂的前瞻性行为,包括主动避障、超车和处理多智能体物理交互,包括空气动力学下洗。我们的智能体在超过22米/秒的速度下多玩家赛车中超越了冠军级人类飞行员,同时与最先进的单智能体基线相比,碰撞率减少了50%。关键的是,使用多样化的人工智能体进行训练能够实现零样本泛化到更安全的人类交互。这些结果表明,实现稳健的机器人共存的路径不在于孤立的安全约束,而在于多智能体交互的严格要求。多媒体材料可在:https://rpg.ifi.uzh.ch/marl

英文摘要

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

2603.19895 2026-06-19 eess.SY cs.SY math.CV math.DG math.DS 版本更新

Complex Frequency as Generalized Eigenvalue

复频率作为广义特征值

Nikolas Sofos, Federico Milano

AI总结 本文研究了复频率在描述线性时不变系统状态时作为特征值的广义形式,通过几何频率的定义和分解,展示了复频率在二维欧几里得平面中的应用,并证明了线性系统中复频率与特征值的等价性,同时指出非线性系统不具有这一等价性。

详情
AI中文摘要

本文证明了复频率的概念,最初用于描述复值信号的动力学,当应用于线性时不变(LTI)系统的状态时,构成了特征值的广义形式。从几何频率的定义出发,该定义为电路中的频率提供了几何解释,并自然分解为对称和反称成分,分别对应于幅度变化和旋转运动。我们展示复频率作为其在二维欧几里得平面上的限制。对于LTI系统,证明了通过非等距变换计算的系统状态的复频率与原系统的特征值一致。该等价性在任何阶数的可对角化系统中均成立。本文提供了一个统一的几何解释,将经典线性系统理论与曲线微分几何联系起来。同时指出,这种等价性一般不适用于非线性系统。另一方面,系统的几何频率总能被定义,从而为系统流提供几何解释。基于线性和非线性电路的多种示例展示了所提出的框架。

英文摘要

This paper shows that the concept of complex frequency, originally introduced to characterize the dynamics of signals with complex values, constitutes a generalization of eigenvalues when applied to the states of linear time-invariant (LTI) systems. Starting from the definition of geometric frequency, which provides a geometrical interpretation of frequency in electric circuits that admits a natural decomposition into symmetric and antisymmetric components associated with amplitude variation and rotational motion, respectively, we show that complex frequency arises as its restriction to the two-dimensional Euclidean plane. For LTI systems, it is shown that the complex frequencies computed from the system's states subject to a non-isometric transformation, coincide with the original system's eigenvalues. This equivalence is demonstrated for diagonalizable systems of any order. The paper provides a unified geometric interpretation of eigenvalues, bridging classical linear system theory with differential geometry of curves. The paper also highlights that this equivalence does not generally hold for nonlinear systems. On the other hand, the geometric frequency of the system can always be defined, providing a geometrical interpretation of the system flow. A variety of examples based on linear and nonlinear circuits illustrate the proposed framework.

2605.16865 2026-06-19 cs.CL 版本更新

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室,多伦多大学及向量研究所) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Princeton University(普林斯顿大学) Cornell University(康奈尔大学) The University of Tokyo(东京大学) RIKEN AIP(日本理化学研究所AIP) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根最大计划智能系统研究所) EuroSafeAI

AI总结 本文提出MixSD方法,通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入,从而在保持预训练能力的同时提升事实记忆和推理能力。

详情
AI中文摘要

监督微调(SFT)被广泛用于将新知识注入语言模型,但通常会损害预训练能力,如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致,迫使优化器模仿低概率token序列。为了解决这个问题,我们提出了MixSD,一种无需外部教师的简单方法,用于对齐分布的知识注入。与固定目标训练不同,MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号,同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD,研究事实回忆和算术功能学习,并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下,MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线,能够保留基础模型的100% held-out能力,同时保持接近完美的训练准确率,而标准SFT只能保留1%。我们进一步表明,MixSD在基础模型下生成的监督目标具有显著更低的NLL,并减少了有害的Fisher敏感参数方向运动。这些结果表明,将监督与模型的本征生成分布对齐是简单且有效的知识注入原则,可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

2509.24725 2026-06-19 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net:基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam(阿姆斯特丹大学) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Q-Net框架,通过结合卡尔曼滤波与神经网络,解决信号交叉口队列长度估计中的数据融合问题,提升空间转移性和实时性,实现无需昂贵传感设备的准确队列估计。

Journal ref Transportation Research Part C: Emerging Technologies, Volume 190, September 2026, Article 105809

详情
AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源:(i) 接近停止线的环形检测器提供的车辆计数汇总数据,以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD),但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此,本文提出Q-Net:一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战,如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构,并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现,并通过将aFCD测量分组为固定大小的局部组来提高空间转移性,使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示,Q-Net优于基线方法,能够准确追踪队列的形成和消散,并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性,Q-Net在无需昂贵的传感基础设施(如摄像头或雷达)的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

2605.20531 2026-06-19 cs.LO cs.LG 版本更新

Pseudo-Formalization for Automatic Proof Verification

伪形式化用于自动证明验证

Slim Barkallah, Luke Bailey, Kaiyue Wen, Mohammed Abouzaid, Tengyu Ma

发表机构 * GitHub

AI总结 本文提出了一种名为伪形式化的证明格式,该格式在保持自然语言灵活性的同时,保留了形式证明的模块性和精确性,通过块验证算法实现了对自然语言证明的高效验证,其在错误发现的精度和召回率上优于现有基线方法。

Comments 31 pages, code available at https://github.com/Slim205/pseudo-formalization

详情
AI中文摘要

可靠的证明验证仍然是训练和评估在复杂数学推理上的人工智能系统的主要瓶颈。在像Lean这样的语言中,完全形式化的证明容易验证,因为它们是无歧义且模块化的。大多数证明,尤其是由人工智能系统编写证明,既没有这种属性,将它们翻译成形式语言在许多前沿数学领域仍然具有挑战性。我们提出了伪形式化(PF),一种证明格式,它捕捉了形式证明的模块性和精确性,同时保留了自然语言的灵活性。一个伪形式化证明被分解成自包含的模块,每个模块陈述其前提、结论和证明,用自然语言。为了验证一个常规的自然语言证明的正确性,一个LLM将其翻译成伪形式化,然后独立验证每个模块,我们称之为块验证(BV)。我们在两个涵盖竞赛和研究级数学的基准上评估PF+BV,其中它在错误发现的精度和召回率上优于LLM-as-judge基线。为了支持未来的工作,我们发布了我们的研究级证明验证基准ArxivMathGradingBench。

英文摘要

Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.

2605.20448 2026-06-19 cs.CV cs.LG 版本更新

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI(德克南人工智能)

AI总结 本文通过一个包含3034个样本的人工整理基准,探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力,发现模型在重新安排可见布局时表现优异,但在遮挡和反射推断上表现较差。

详情
AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体,但它们是否代表这些物体所处的3D布局?我们引入了一个包含3034个样本的人工整理基准,针对空间理解的三个组成部分:深度有序遮挡(通过三种独立的反事实操作化进行探测)、可见反射的光学几何推断,以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分,没有使用LLM作为判断标准,揭示了明显的分离:在53-97%的准确率下,能够对可见布局进行重新安排的模型,在遮挡任务中表现不佳,仅在6-45%之间,而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示,失败归因于视觉标记合并:在视觉编码器中可恢复的空间信息在标记压缩后变得不可用,只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

2604.00626 2026-06-19 cs.LG cs.CL 版本更新

A Survey of On-Policy Distillation for Large Language Models

大型语言模型的在线策略蒸馏综述

Mingyang Song, Mao Zheng

发表机构 * Tencent, China(腾讯,中国)

AI总结 本文综述了大型语言模型的在线策略蒸馏方法,探讨了蒸馏过程中如何通过反馈减少累积误差,提出了基于f-散度最小化的蒸馏框架,并分析了蒸馏与强化学习之间的联系。

Comments Ongoing Work

详情
AI中文摘要

随着大型语言模型(LLMs)在能力和成本上的持续增长,将前沿能力转移到更小、可部署的学生模型已成为核心工程问题,知识蒸馏仍然是这一转移的主导技术。工业流水线中普遍采用的静态模仿教师生成文本的方法存在结构性缺陷,随着任务变得更长且需要更多推理,这种缺陷变得更加严重。因为学生是在完美教师前缀上训练的,但在推理时必须生成自己的文本,小错误往往会积累成学生很少被训练来恢复的轨迹,导致的暴露偏差已被证明与序列长度的平方成比例。在线策略蒸馏(OPD)围绕这一观察重新组织训练循环,通过让教师对学生实际生成的内容提供反馈,以减少累积项趋于线性,并将蒸馏重新定义为迭代修正过程,而不是单次模仿。由此产生的文献在分歧设计、奖励引导优化和自我对抗方面有所扩展,但贡献仍然分散在知识蒸馏、RLHF和模仿学习社区中,缺乏统一的处理。本文提供了这样的处理。我们正式将OPD定义为学生采样轨迹上的f-散度最小化,将该领域沿三个设计轴(优化什么、信号来源在哪里、以及如何在实践中稳定训练)组织起来,并整合成功条件、反复失败模式以及OPD与KL约束强化学习之间的联系。最后,我们提出了由此综合而产生的开放性问题,包括蒸馏扩展定律、不确定反馈、代理蒸馏以及知识蒸馏与强化学习之间的日益增长的重叠。

英文摘要

As Large Language Models continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become an important engineering problem, and knowledge distillation remains a common technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but generates its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as f-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained reinforcement learning. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agent-level distillation, and the growing overlap between knowledge distillation and RL.

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea(韩国文化科技研究所) Maum AI Inc., Republic of Korea(马姆人工智能公司)

AI总结 本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题,通过分析下游语义失败,揭示了传统ASR指标无法完全捕捉的误差影响,发现不同性能的LLM在级联降级上的一致性,识别出单字符ASR错误作为语义失败通道,并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

详情
AI中文摘要

我们分析了自动语音识别(ASR)误差如何通过ASR-LLM级联在韩语语音问答(SQA)中传播,重点关注传统ASR指标无法完全捕捉的下游语义失败。我们的分析显示,由ASR误差引起的相对下游降级在不同绝对性能的LLM中保持一致,表明级联降级主要跟踪ASR阶段的信息损失。我们进一步识别出单字符韩语ASR错误作为一种独特的语义失败通道,其中正确答案在下游预测中完全消失,尽管仅存在微小的转录差异。最后,辅助比较显示,大型音频语言模型在噪声韩语SQA中优于具有匹配语言骨干的ASR-LLM流水线,表明直接音频输入有潜力缓解转录诱导的信息损失。

英文摘要

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a Korean-specific loss channel, where even a minimal transcription difference can change the intended question and degrade downstream QA performance. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM cascade with an approximately matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

2605.15231 2026-06-19 cs.LG cs.CV 版本更新

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

Mask-Morph Graph U-Net:一种通用的基于网格的替代模型,用于在大几何变化下预测碰撞worthiness领域

Haoran Li, Tobias Lehrer, Yingxue Zhao, Haosu Zhou, Philipp Stocker, Tobias Pfaff, Marcus Wagner, Nan Li

发表机构 * Dyson School of Design Engineering, Imperial College London(帝国理工学院伦敦设计工程学院) TUM School of Engineering and Design, Technical University of Munich(慕尼黑技术大学工程与设计学院) Faculty of Mechanical Engineering, OTH Regensburg(雷根斯堡机械工程学院) NVIDIA(NVIDIA公司)

AI总结 本文提出Mask-Morph Graph U-Net,通过特征对齐的重心参数化和节点掩码预训练,提升网格模拟的通用性和数据效率,适用于碰撞worthiness设计探索。

Comments 48 pages, 15 figures, jounral paper under review

详情
AI中文摘要

非线性有限元碰撞模拟准确但计算成本高,限制了其在迭代设计优化中的应用。基于图神经网络(GNN)的机器学习替代模型提供了更快的替代方案。消息传递GNN广泛用于网格模拟,其共享节点和边更新函数在不同图结构中相对通用。相比之下,非共享边特定聚合层能更准确地捕捉非线性关系,但通常需要固定图连接性,限制了通用性。本文提出Mask-Morph Graph U-Net(MMGUNet),一种解决分层图U-Net架构限制的方法,该架构使用边特定下采样和上采样层。固定粗图连接性是边特定层所必需的。为了在保留此连接性的同时提高空间对应性,所提出的方法通过特征对齐的重心参数化将粗化图层次变形到每个输入网格,然后构建跨图边。它进一步在监督预训练中应用节点掩码,随后进行参数高效的微调,其中高参数边特定层被冻结。所提出的方法在分布内、分布外和跨组件迁移设置中使用均欧距离和最大入侵百分比误差进行评估。结果表明,粗图变形相对于固定粗图基线提高了测试准确性,而掩码监督预训练减少了训练-测试差异并提高了迁移期间的数据效率。所提出的模型还比外部基线取得了更低的预测误差。这些结果展示了通往可重用、数据高效网格替代模型的实用路径,用于碰撞worthiness设计探索。

英文摘要

Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

2512.03199 2026-06-19 cs.CV 版本更新

Does Head Pose Correction Improve Biometric Facial Recognition?

姿态校正是否能提升生物特征面部识别?

Justin Norman, Hany Farid

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究探讨了AI驱动的头部姿态校正与图像修复对面部识别准确率的影响,发现选择性应用CFR-GAN与CodeFormer可提升识别性能。

详情
AI中文摘要

生物特征面部识别模型在处理现实世界图像时常表现出显著的准确性下降,通常表现为图像质量差、非正面姿态和主体遮挡。我们调查了针对这些挑战的AI驱动头部姿态校正和图像修复是否能提高识别准确率。使用模型无关的大规模法医评估流程,我们评估了三种修复方法:3D重建(NextFace)、2D正面化(CFR-GAN)和特征增强(CodeFormer)。我们发现这些技术的简单应用会显著降低面部识别准确率。然而,我们还发现选择性应用CFR-GAN结合CodeFormer可以带来有意义的提升。

英文摘要

Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.

2605.10898 2026-06-19 cs.HC 版本更新

How Creatives Approach GenAI Image Generation: Tensions Between Structured Guidance, Self-Experimentation, and Creative Autonomy

创意人士如何接近生成式AI图像生成:结构化指导、自我实验与创意自主之间的张力

Haidan Liu, Isabelle Kwan, Taiga Okuma, Jeffrey Loverock, Nicholas Vincent, Parmit K Chilana

AI总结 研究探讨创意人士在使用生成式AI图像工具时如何平衡结构化指导与自我实验,发现尽管指导有助于理解AI,但许多人仍倾向于自我探索以保持创意自由。

Comments Accepted at ACM Creativity & Cognition 2026

详情
AI中文摘要

随着生成式AI工具日益影响创意实践,它们引发了长期存在的HCI问题,即创意人士如何学习复杂软件以及如何更好地得到支持。我们通过与8名艺术家和爱好者进行访谈研究,并随后进行159人调查,以了解该群体如何接近和寻求生成式AI图像工具的指导。我们发现,创意人士通常使用自我实验或教程来探索生成式AI工具,但许多人对复杂的AI术语感到困惑。为了进一步了解创意人士的学习体验,我们开发了一个研究探针来获取他们对结构化指导的看法。我们的用户研究显示,即使创意人士描述指导有助于理解AI,许多人仍更喜欢自我实验,认为指导可能限制他们的创造力。我们的发现突显了在支持创意人士AI素养时的核心张力:在平衡指导和促进素养的同时,保持创意自由。

英文摘要

As generative AI tools increasingly influence creative practice, they raise longstanding HCI questions about how creatives learn complex software and how they can be better supported. We conducted an interview study with artists and hobbyists (n=8) and a follow-up survey (n=159) to understand how this population approaches and seeks guidance for GenAI image tools. We found that creatives commonly use either self-experimentation or tutorials to explore GenAI tools, yet many struggle with confusing AI terminology. To gain further insight into creatives' learning experiences, we developed a research probe to elicit creatives' perceptions of structured guidance. Our user study with 17 creatives revealed that, even when creatives described the guidance as helpful for understanding AI, many still preferred self-experimentation, feeling that guidance could limit their creativity. Our findings highlight a central tension in supporting AI literacy for creatives: balancing guidance and promoting literacy while preserving creative freedom.

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench:一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出CADBench,一个统一的多模态CAD程序生成基准,包含18000个样本和六类基准,评估11种视觉语言模型,揭示了CAD程序生成中的三种常见失败模式。

详情
AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心,但进展难以衡量,因为现有评估分散在数据集、模态和指标上。我们引入CADBench,一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本,涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族,五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染,以及六个指标,涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层,所有家族均进行多样性采样,以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统,生成超过140万个CAD程序。在理想输入下,专用的网格到CAD模型显著优于代码生成VLMs,后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式:几何复杂性增加时重建质量下降,CAD专用模型在模态转移下可能变得脆弱,且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

2605.10526 2026-06-19 math.OC cs.DM 版本更新

Randomized Max-Vertex-Coverage Interdiction under Matroid Constraints

带有Matroid约束的随机最大顶点覆盖拦截问题

Changjun Wang, Chenhao Wang

AI总结 本文研究了带Matroid约束的随机最大顶点覆盖拦截问题,通过将追随者问题建模为整数线性规划并证明其线性松弛具有4/3的整数间隙,设计出多项式时间8/3近似算法,有效解决了双层优化问题的计算挑战。

详情
AI中文摘要

我们研究了一种新的双层优化问题,称为带有Matroid约束的随机最大顶点覆盖拦截问题(RMVCI)问题,可以建模为网络中领导者和追随者之间的零和Stackelberg博弈。领导者在Matroid约束下随机选择顶点子集进行保护,而追随者在推断领导者保护概率分布后,选择一个顶点子集(也受Matroid约束)进行攻击,旨在最大化预期总边权,即攻击集和未保护集的顶点的边权总和。领导者的目的是确定一个最优的随机拦截策略,以最小化追随者的预期收益。由于追随者的响应问题是NP难的,所得到的双层程序计算上具有挑战性。我们开发了一个概念性的近似框架来处理一般的双层拦截问题。对于带有Matroid约束的RMVCI问题,我们首先将追随者的問題建模为一个整数线性规划问题,并证明其线性松弛具有紧致的整数间隙$\tfrac{4}{3}$。在近似框架内,我们将追随者的问题替换为其线性松弛,并研究由此得到的双层程序。通过从集上的分布转换为顶点上的分布,并应用我们的近似框架,我们成功地为这个松弛的双层问题设计了一个多项式时间2近似算法。将这些成分结合到我们的框架中,得到一个多项式时间$\tfrac{8}{3}$近似算法用于带有Matroid约束的RMVCI问题。

英文摘要

We study a class of bilevel interdiction problems in which the follower's optimization problem is computationally intractable. Motivated by network defense applications, we introduce the Randomized Max-Vertex-Coverage Interdiction (RMVCI) problem under matroid constraints. In this zero-sum Stackelberg game, the leader commits to a randomized interdiction strategy over feasible vertex subsets, while the follower, after observing the induced protection probabilities, chooses a matroid-constrained attack to maximize the expected coverage of network edges. The main challenge stems from the fact that the follower's problem is a matroid-constrained maximum vertex coverage problem and is therefore NP-hard. To address this difficulty, we first develop a general approximation framework for bilevel optimization problems with hard follower responses. The framework is based on replacing the follower's value function by a surrogate objective that approximates the follower's optimal payoff while preserving tractability of the leader's optimization problem. For the RMVCI problem, we formulate the follower's problem as an integer linear program, establish a tight integrality gap of $4/3$ for its linear relaxation, and derive a polynomial-time $4/3$-approximation algorithm via pipage rounding. We then show that a carefully designed surrogate objective admits a marginal-probability reformulation that transforms the randomized interdiction problem into a tractable optimization problem over the leader's matroid polytope. This yields a polynomial-time $2$-approximation algorithm for RMVCI under general matroid constraints. Beyond the specific application studied here, our results provide a new perspective on approximation methods for {general} bilevel optimization problems.

2605.09609 2026-06-19 cs.LG math.AG 版本更新

Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

多项式神经网络的最小填充架构:反例、前沿搜索与缺陷

Kevin Dao, Jose Israel Rodriguez

发表机构 * Department of Mathematics, University of Wisconsin-Madison, Wisconsin, USA(威斯康星大学麦迪逊分校数学系)

AI总结 本文通过前沿搜索和符号计算验证了多项式神经网络的最小单峰猜想反例,揭示了部分子架构存在较大缺陷,与以往小缺陷现象形成对比。

详情
AI中文摘要

我们为具有幂激活函数的多项式神经网络(PNNs)提供了最小单峰猜想的反例。在固定输入和输出宽度的情况下,该猜想声称任何最小填充架构的隐藏层宽度都是单峰的。我们通过前沿搜索找到反例,并通过递归维度界限和符号计算进行了认证。值得注意的是,该反例的几个子架构表现出较大的缺陷,这与以往示例中普遍观察到的小缺陷行为形成对比。

英文摘要

We provide counterexamples to the unimodal minimal filling architecture conjecture for polynomial neural networks (PNNs) with power activation functions. Fixing the input and output widths, the conjecture states that any minimal filling architecture has unimodal widths for the hidden layers. We found counterexamples via a frontier search, recursive dimension bounds on neurovarieties, and symbolic computation. Notably, several subarchitectures of our main example exhibit large defect, in contrast with the predominantly small-defect behavior observed in prior literature.

2605.09550 2026-06-19 cs.HC 版本更新

Who embraces AI in play? Exploratory modeling of player preference profiles toward game AI

谁在游戏AI中持支持态度?游戏AI玩家偏好轮廓的探索性建模

Ting-Chen Hsu, Jiangxu Lin, Wenran Chen, Zheyuan Zhang, Fei Qin

AI总结 本文通过问卷数据和AA分析,揭示玩家对游戏AI接受度的跨情境偏好轮廓,识别出七种典型群体,并探讨其与AI素养、游戏习惯等因素的关系。

Comments Accepted to 2026 IEEE Conference on Games (IEEE CoG 2026)

详情
AI中文摘要

人工智能正通过多种功能进入数字游戏。尽管先前研究显示玩家对游戏AI的态度高度依赖于情境,但对这些态度在不同玩家群体中如何结构化组合仍知之甚少。本研究通过建模玩家的跨情境AI接受度作为可解释的态度轮廓来填补这一空白。基于771名数字游戏玩家的问卷数据,我们应用架构分析(AA)对八个代表性AI应用情境中的中心化接受评分进行分析。分析识别出七种不同的轮廓:AI怀疑者、广泛AI支持者、创造性玩法探索者、经验导向支持者、系统秩序倡导者、情感中心支持者和治理怀疑者。探索性的一对多(OvR)逻辑回归进一步表明,轮廓成员与玩家的感知AI素养、游戏习惯、学科背景、个性特征和应用特定优先级相关。通过将关注点从孤立的接受判断转向模式化的偏好结构,本研究为分割游戏AI受众提供了探索性经验词汇,并为更情境敏感和玩家敏感的AI整合提供了初步设计启示。

英文摘要

Artificial intelligence is increasingly entering digital games through diverse functions. While prior work has shown that player attitudes toward game AI are strongly context-dependent, less is known about how these attitudes are structurally combined within different groups of players. This study addresses this gap by modeling players' cross-context AI acceptance as interpretable attitude profiles. Based on questionnaire data from 771 digital game players, we apply Archetypal Analysis (AA) to centered acceptance ratings across eight representative AI application contexts in games. The analysis identifies seven distinctive profiles: AI-Skeptics, Broad AI-Supporters, Creative-Play Explorers, Experience-Oriented Supporters, Systemic Order Advocates, Emotion-Centered Supporters, and Governance-Skeptics. Exploratory one-vs-rest (OvR) logistic regressions further suggest that profile membership is associated with players' perceived AI literacy, gaming habits, disciplinary background, personality traits, and application-specific priorities. By shifting attention from isolated acceptance judgments to patterned preference structures, this study provides an exploratory empirical vocabulary for segmenting game AI audiences and offers preliminary design implications for more context-sensitive and player-sensitive AI integration in digital games.