arXivDaily arXiv每日学术速递 周一至周五更新
重置
CS计算机1059
2606.12225 2026-06-11 cs.CR 新提交

Bridging the Smart City Cybersecurity Data Gap Through AI-Driven Synthetic Dataset Generation

弥合智慧城市网络安全数据鸿沟:基于AI驱动的合成数据集生成

Stephanie Polczynski, John D. Hastings, Varghese Vaidyan, Kyle Korman

AI总结 提出AI合成数据生成框架,利用生成模型产生高保真网络安全数据集,解决真实数据稀缺问题,支持智慧城市安全工具开发与评估。

详情
Comments
10 pages, 1 figure, 2 tables
AI中文摘要

智慧城市依赖于互联的网络物理系统,这些系统集成了传感器、物联网设备、云平台以及AI驱动的服务和决策。虽然这些系统增强了城市服务,但由于其庞大的攻击面、异构的数据流和不断演变的威胁向量,也引入了复杂的网络安全挑战。为智慧城市开发和验证网络安全工具需要能够准确代表真实运行条件的高质量数据集。然而,真实世界的数据集往往不完整、包含隐私敏感数据、难以获取,或者缺乏足够的恶意活动来支持工具开发。本研究通过提出一个专门为智慧城市网络安全研究设计的基于AI的合成数据生成(SDG)框架,解决了这一关键差距。所提出的框架利用生成式人工智能模型来生成高保真的合成网络安全数据集,这些数据集复制了真实的设备行为、网络交互和网络攻击场景。合成数据集根据协议标准的一致性、与原始数据集的统计相似性以及在常见安全工具中的实用性进行评估。由此产生的合成数据生成框架和评估指标有望通过使研究人员能够更有效地建模威胁和更全面地评估防御技术,从而推进智慧城市网络安全,更好地保护关键智慧城市基础设施。

英文摘要

Smart cities rely on interconnected cyber-physical systems that integrate sensors, IoT devices, cloud platforms, and AI-driven services and decision-making. While these systems enhance city services, they also introduce complex cybersecurity challenges due to their large attack surfaces, heterogeneous data flows, and evolving threat vectors. Developing and validating cybersecurity tools for smart cities requires high-quality datasets that accurately represent real operational conditions. However, real-world datasets are often incomplete, contain privacy-sensitive data, are difficult to access, or lack sufficient malicious activity to support tool development. This research addresses this critical gap by proposing an AI-based synthetic data generation (SDG) framework designed specifically for smart city cybersecurity research. The proposed framework leverages generative artificial intelligence models to produce high-fidelity synthetic cybersecurity datasets that replicate realistic device behaviors, network interactions, and cyber-attack scenarios. The synthetic datasets are evaluated for conformity to protocol standards, statistical similarity to original datasets, and utility in common security tools. The resulting synthetic data generation framework and evaluation metrics are expected to advance smart city cybersecurity by enabling researchers to model threats more effectively and evaluate defensive techniques more comprehensively to better protect critical smart city infrastructures.

2606.12218 2026-06-11 cs.CV cs.AI 新提交

Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model

为食物-水关系调整Prithvi-EO用于休耕地检测:地理空间基础模型的ViT-Adapter颈部与参数高效骨干微调

Sk Muhammad Asif, Orhun Aydin

发表机构 * Earth, Atmospheric and Geospatial Science, Saint Louis University(圣路易斯大学地球、大气与地理空间科学系)

AI总结 针对休耕地检测中多尺度特征需求与基础模型单尺度ViT骨干不匹配的问题,提出结合LoRA和混合PEFT的两种参数高效微调方案与三种颈部设计,其中Lite ViT-Adapter配合单阶段检测头在mAP@50上达到0.9479,优于无适配器方法25.70%。

详情
Comments
10 pages, 6 figures. Preprint. Submitted to ACM SIGSPATIAL 2026
AI中文摘要

理解休耕地的空间分布对于优化食物-水关系至关重要,因为休耕在作物轮作和水资源保护中发挥着作用。休耕是美国农业部作物数据层中的一个低精度类别。地理空间基础模型Prithvi-EO在计算机视觉任务中展现出强大的迁移能力。然而,其视觉Transformer骨干在单一空间尺度上生成特征,不适合目标检测头所需的多尺度特征。现有方法通过缩放单步长令牌来合成多尺度金字塔,牺牲了空间异质性,而全骨干微调对于地理空间基础模型来说计算成本过高。我们评估了一个结合两种参数高效微调方案的休耕地检测流程:低秩适应和混合PEFT,以及三种颈部设计:伪多尺度、Lite ViT-Adapter和Full ViT-Adapter。我们最佳配置,即带有单阶段检测头的Lite ViT-Adapter,在Diou损失下实现了0.9479的mAP@50,表明中心感知定位对于不规则休耕地检测的有效性。在LoRA下,ViT-Adapter释放的单阶段检测比无适配器的基于锚点的方法提高了6.42%,而最佳配置比基线无适配器的基于锚点的方法提高了25.70%。这些结果表明,轻量级空间先验融合和选择性骨干解冻使Prithvi-EO能够更有效地捕捉局部休耕模式,优于依赖重塑单步长ViT令牌的方法。

英文摘要

Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.

2606.12217 2026-06-11 cs.CV cs.AI cs.RO 新提交

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

使远见可操作:在世界动作模型中重新利用表示对齐

Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu

发表机构 * The University of Hong Kong(香港大学) XPENG Robotics(小鹏机器人)

AI总结 针对世界动作模型中视觉预测与动作提取不匹配的问题,提出AGRA方法,通过对齐视频扩散特征与语义表示,提升动作解码器对任务相关区域的关注,从而改善操作任务的性能与泛化能力。

详情
AI中文摘要

世界动作模型(WAM)通过使用视频生成模型在生成控制动作之前建模未来场景演变,为机器人操作提供了一条有前景的途径。然而,我们的实证观察揭示了一个现象:生成合理的视觉未来并不总能保证提取出准确的动作。为了诊断这一失败,我们进行了动作头注意力分析和因果干预。我们发现动作解码器未能聚焦于任务相关的交互区域,并且对任务无关区域的扰动保持敏感。这揭示了一种表示不匹配:为视觉重建优化的隐藏状态并未以适用于低级动作控制的形式组织。在本文中,我们提出了AGRA,一种动作接地表示对齐目标,通过将中间视频扩散特征与来自基础视觉编码器的空间连贯语义表示对齐,来正则化世界-动作接口。我们在真实世界的操作任务上评估了AGRA。实验表明,AGRA使世界模型表示更加动作接地:通过将动作解码器聚焦于正确的交互区域,它提高了物体定位精度和功能理解,并使策略对任务无关区域的扰动更加鲁棒。因此,AGRA在分布内性能和分布外泛化方面均持续优于基线世界动作模型。

英文摘要

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

2606.12215 2026-06-11 cs.CV cs.IR cs.LG 新提交

MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching

MLT-Dedup:通过多级表示和时空匹配的高效大规模在线视频去重

David Yuchen Wang, Haoying Li, Hailun Xu, Wei Chee Yew, Zirui Zhu, Sanjay Saha, Hao Hei, Kanchan Sarkar, Kun Xu

发表机构 * TikTok Singapore(TikTok新加坡) School of Computing, National University of Singapore(新加坡国立大学计算机学院) TikTok San Jose(TikTok圣何塞)

AI总结 提出MLT-Dedup框架,采用多级视频编码器提取细粒度帧级和稀疏片段级嵌入,结合差分特征增强相似性模块进行时空匹配,在90%精度下降低在线重复率91%,索引容量提升5倍。

详情
Comments
Accepted by KDD-2026 ADS track
AI中文摘要

在线平台上用户生成视频内容的爆炸性增长伴随着大量近似重复视频的出现——这些视频相同或高度相似,但存在部分编辑差异。这些重复视频降低了用户体验,增加了存储和带宽成本,使得大规模视频去重成为一项关键任务。现有的视频去重框架在有限的索引预算下检索足够高质量候选视频方面面临根本性挑战,同时在效率和精度之间存在权衡。为了解决这些问题,我们提出了MLT-Dedup,一种基于多级表示和时空匹配的高效大规模在线视频去重框架。我们的方法采用多级视频编码器(ML-VE)提取细粒度的帧级嵌入和稀疏的片段级嵌入:稀疏嵌入支持高效的候选检索,而细粒度嵌入则用于精确的成对匹配。在匹配过程中,我们引入了DiF-SiM,一种差分特征增强相似性模块,能够定位重复的时间片段并提供可靠的相似性证据,以支持基于策略的去重决策。在真实大规模平台上的大量实验表明,MLT-Dedup在90%精度下将在线重复率降低了91%。此外,我们的稀疏检索设计使索引容量提升了5倍,从而在实际部署中实现了更广泛的候选覆盖。

英文摘要

The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.

2606.12214 2026-06-11 cs.HC cs.GR 新提交

Identifying cybersickness causes in virtual reality games using symbolic machine learning algorithms

使用符号机器学习算法识别虚拟现实游戏中的晕动症原因

Thiago Porcino, Erick Oliveira Rodrigues, Flavia Bernardini, Daniela Trevisan, Esteban Clua

AI总结 提出用符号机器学习算法对VR游戏中晕动症原因进行排序,通过两个游戏和37个有效样本的实验,发现旋转和加速度在飞行游戏中更易引发晕动症,且VR经验不足者更易不适。

详情
AI中文摘要

虚拟现实(VR)和头戴式显示器在教育、军事、娱乐和健康等各个领域越来越受欢迎。尽管此类技术提供了高度的沉浸感,但它们也可能引发不适症状。这种状况被称为晕动症(CS),在最近的虚拟现实出版物中相当常见。本文提出了一种新颖的实验分析,使用符号机器学习来对VR游戏中CS的潜在原因进行排序。我们估计CS的原因并根据其影响使用经典机器学习进行排序。实验使用了两个虚拟现实游戏和6个实验协议,以及来自88名志愿者的37个有效样本。我们的结果表明,与赛车游戏相比,在飞行游戏中旋转和加速度更频繁地引发晕动症。我们还可以观察到,VR经验较少的受试者更容易感到不适。以往经验在赛车游戏中扮演更重要的角色,因为该游戏在控制器方面给用户更多自由,更多的位移选择以及更多用户控制的加速度。此外,根据短期或长期VR暴露,引发不适的不同原因会出现。我们针对这两种场景(短期和长期暴露体验)提出了缓解CS的策略,并比较了两种突出场景(赛车和飞行)。

英文摘要

Virtual reality (VR) and head-mounted displays are constantly gaining popularity in various fields such as education, military, entertainment, and health. Although such technologies provide a high sense of immersion, they can also trigger symptoms of discomfort. This condition is called cybersickness (CS) and is quite popular in recent virtual reality publications. This work proposes a novel experimental analysis using symbolic machine learning to rank potential causes of CS in VR games. We estimate CS causes and rank them according to their impact using classical machine learning. Experiments are performed using two virtual reality games and 6 experimental protocols along with 37 valid samples from a total of 88 volunteers. Our results show that rotation and acceleration triggered cybersickness more frequently in a flight game in contrast to a race game. We could also observe that subjects that are less experienced with VR are more prone to feel discomfort. Former experience plays a more important role on the race game, as this game provides more liberty to the user in terms of controllers, more displacement alternatives and a more user-controlled acceleration. Furthermore, different causes that trigger discomfort arise based on short or long term VR exposures. We suggest strategies for mitigating CS for these two scenarios: short and long term exposure experiences and compare the two highlighted scenarios (race and flight).

2606.12213 2026-06-11 cs.CV 新提交

SHERPA: Seam-aware Harmonized ERP Adaptation for Open-Domain 360$^\circ$ Panorama Generation

SHERPA: 面向开放域360°全景生成的无缝感知协调ERP适配

Jungwoon Kang, Jaehun Kim, Yiwon Yu, Hyungyum Jang, Sanghoon Lee, Jongyoo Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出SHERPA框架,通过频率选择性圆形RoPE、圆形潜编码/解码、图像侧FFN适配器和双路径训练方案,实现从平面扩散模型到360°全景的轻量级适配,支持逼真和风格化全景生成。

详情
Comments
29 pages, 23 figures, 5 tables. Preprint version
AI中文摘要

全景图像越来越多地用于世界生成、游戏和仿真中,用户不仅需要逼真的场景,还需要风格化和非逼真的环境。大规模文本到图像扩散和流模型为此目标提供了广泛的风格和语义先验,但平面图像训练使它们与等距柱状投影(ERP)表示的360°全景的环绕拓扑和极地区域不对齐。我们提出了SHERPA,一个轻量级适配框架,结合了频率选择性圆形RoPE、圆形潜编码/解码、图像侧FFN适配器和双路径训练方案。圆形RoPE仅将接缝敏感的高频水平RoPE带替换为整数周期谐波,同时保留预训练的低频频谱。配对全景路径监督几何,而未配对风格路径使用自监督偏航一致性进行无目标风格化提示。结果,SHERPA在逼真全景域和开放域风格化提示下生成360°全景。

英文摘要

Panoramic imagery is increasingly used in world-generation, games, and simulation, where users may need not only photorealistic scenes but also stylized and non-photorealistic environments. Large-scale text-to-image diffusion and flow models provide broad style and semantic priors for this goal, but planar image training misaligns them with the wrap-around topology and polar regions of $360^\circ$ panoramas represented in equirectangular projection (ERP). We present SHERPA, a lightweight adaptation framework that combines frequency-selective Circular RoPE, Circular Latent Encoding/Decoding, image-side FFN adapters, and a Dual-Path Training Scheme. Circular RoPE replaces only the seam-sensitive high-frequency horizontal RoPE band with integer-periodic harmonics while preserving the pretrained lower-frequency spectrum. The Paired Panorama Path supervises geometry, while the Unpaired Style Path uses self-supervised yaw consistency for target-free stylized prompts. As a result, SHERPA generates $360^\circ$ panoramas across both photorealistic panorama domains and open-domain stylized prompts.

2606.12212 2026-06-11 cs.SE cs.CR 新提交

Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps

注意你的密钥:iOS应用中LLM API凭证泄露的实证研究

Pinran Gao, Lingxiang Wang, Ying Zhang, Fan Yang

AI总结 本研究首次系统调查iOS应用中LLM API密钥泄露问题,通过动态分析框架LLMKeyLens检测444个应用,发现282个存在可被利用的凭证泄露,并识别出三种泄露模式,三个月后仅28%完成修复。

详情
Comments
12 pages, 4 figures, 4 tables
AI中文摘要

大型语言模型(LLM)快速集成到移动应用中引入了一类新的凭证安全风险:泄露的凭证允许未经授权访问LLM推理服务,给开发者造成经济损失。先前关于凭证泄露的工作主要集中在Android应用上;迄今为止,尚无实证研究系统调查iOS应用中的LLM API密钥泄露。我们首次对集成LLM的应用中的API密钥泄露进行了深入的实证研究。我们构建了一个包含444个iOS应用的高质量数据集,这些应用通过标准化流程从1092个候选应用中筛选出来,并开发了LLMKeyLens,一个动态分析框架,通过流量拦截、特定于提供商的密钥提取和主动有效性确认来检测LLM API密钥泄露,无需源代码访问或二进制解密。我们的分析显示,282个应用在网络流量中暴露了可利用的LLM API凭证,涉及至少十个提供商。我们识别出三种泄露模式:基于JWT的令牌泄露(48%)、未经身份验证的后端代理访问(33%)和明文API密钥传输(19%)。为评估修复情况,我们在负责任披露三个月后重新分析了相同的282个易受攻击的应用;只有28%修复了报告漏洞,而72%仍然可利用,问题持续源于未经身份验证的后端和损坏的JWT实现。我们的发现表明,LLM API密钥泄露在iOS生态系统中既普遍又持久,暴露出开发者实践与安全集成原则之间的系统性差距,并表明安全的LLM集成不仅需要开发者意识,还需要提供商明确的安全指导和平台级强制执行。

英文摘要

The rapid integration of large language models (LLMs) into mobile applications has introduced a new class of credential security risk: leaked credentials that grant unauthorized access to LLM inference services, causing financial damage to developers. Prior work on credential leakage has focused primarily on Android apps; to date, no empirical study has systematically investigated LLM API key leakage in iOS applications. We present the first in-depth empirical study of API key leakage in LLM-integrated apps. We construct a high-quality dataset of 444 iOS applications, filtered from 1092 candidates through a standardized process, and develop LLMKeyLens, a dynamic analysis framework that detects LLM API key leakage via traffic interception, provider-specific key extraction, and active validity confirmation, requiring neither source code access nor binary decryption. Our analysis reveals that 282 applications expose exploitable LLM API credentials in network traffic, spanning at least ten providers. We identify three leakage patterns: JWT-based token leakage (48%), unauthenticated backend proxy access (33%), and plaintext API key transmission (19%). To assess remediation, we re-analyzed the same 282 vulnerable applications three months after responsible disclosure; only 28% had remediated the reported vulnerability, while 72% remained exploitable, with persistent issues stemming from unauthenticated backends and broken JWT implementations. Our findings show that LLM API key leakage is both prevalent and persistent in the iOS ecosystem, exposing a systemic gap between developer practice and secure integration principles, and suggest that secure LLM integration requires not only developer awareness but also explicit security guidance from providers and platform-level enforcement.

2606.12211 2026-06-11 quant-ph cs.LG 新提交

Quantum Occam Learning: Sample-Supported Expressibility for Circuit-Based Quantum Learning

量子奥卡姆学习:基于电路的量子学习中样本支持的表达能力

Jeongho Bang, Kyoungho Cho, Jeongwoo Jae

AI总结 针对有限大小量子电路生成的数据,提出信息论奥卡姆理论,证明样本支持的表达能力定律:在迹距离精度ε下,M个样本最多支持约Mε²个门,将电路复杂度转化为自适应统计资源。

详情
Comments
22 pages (main text + appendix), 2 figures
AI中文摘要

量子机器学习的一个核心原则是,ansatz 应具有足够的表达能力来表示感兴趣的量子数据。然而,只有当能够从有限数量的未知量子态副本中学习时,表达能力才具有统计意义。在这项工作中,我们为有限大小量子电路生成的量子数据开发了一种信息论奥卡姆理论。对于最多使用 $G$ 个双量子比特门可制备的 $n$ 量子比特纯态类 $S_{n,G}$,度量熵论证给出了在电路受限情况下的可实现样本律 $\widetilde{\Theta}(G/\epsilon^2)$。对于任意源 $\hat{\rho}$,我们引入了最佳 $G$ 门近似误差 $d_G(\hat{\rho})$ 和近似电路复杂度 $C_\eta(\hat{\rho})$。我们证明了一个不可知的量子奥卡姆定理:使用 $M$ 个副本,可以学习到最佳 $G$ 门近似误差加上统计惩罚 $\widetilde{O}(\sqrt{G/M})$。然后,通过一个自适应模型选择定理消除了预先知道 $G$ 的需要,该定理的 oracle 不等式选择了数据所证明的电路复杂度。匹配的下界给出了一个样本支持的表达能力定律:在迹距离精度 $\epsilon$ 下,$M$ 个样本只能支持 $G_{\rm supported} \simeq M\epsilon^2$ 个门,直到对数因子和 $2^n$ 的层析饱和。因此,电路复杂度成为一种自适应统计资源,而不是静态承诺。我们的框架将有界电路复杂度转化为量子机器学习的模型选择原则。

英文摘要

A central principle in quantum machine learning is that an ansatz should be expressive enough to represent the quantum data of interest. Yet, the expressibility is statistically meaningful only insofar as it can be learned from finitely many copies of an unknown quantum state. In this work, we develop an information-theoretic Occam theory for quantum data generated by finite-size quantum circuits. For the class $S_{n,G}$ of $n$-qubit pure states preparable with at most $G$ two-qubit gates, a metric-entropy argument gives the realizable sample law $\widetilde{\Theta}(G/\epsilon^2)$ in the circuit-limited regime. For an arbitrary source $\hat{\rho}$, we introduce the best $G$-gate approximation error $d_G(\hat{\rho})$ and the approximate circuit complexity $C_\eta(\hat{\rho})$. We prove an agnostic quantum Occam theorem: with $M$ copies, one can learn up to the best $G$-gate approximation error plus a statistical penalty $\widetilde{O}(\sqrt{G/M})$. We then remove the need to know $G$ in advance through an adaptive model-selection theorem whose oracle inequality selects the circuit complexity justified by the data. Matching lower bounds yield a sample-supported expressibility law: at trace-distance accuracy $\epsilon$, $M$ samples can support only $G_{\rm supported} \simeq M\epsilon^2$ gates, up to logarithmic factors and tomography saturation at $2^n$. Thus, the circuit complexity becomes an adaptive statistical resource rather than a static promise. Our framework turns bounded circuit complexity into a model-selection principle for quantum machine learning.

2606.12210 2026-06-11 cs.CL 新提交

Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI

新闻能否预测市场?零样本金融自然语言处理的局限性与可解释人工智能的作用

Ali M Karaoglu, Shreyank N Gowda

发表机构 * University of Nottingham(诺丁汉大学)

AI总结 本研究通过零样本自然语言处理框架,结合时间聚合与多层次可解释性,发现零样本方法无法超越简单基线,但可解释性信号能区分可靠与不可靠预测,强调透明性和不确定性感知在决策支持中的价值。

详情
AI中文摘要

金融新闻能否可靠地预测短期股票波动?尽管大型语言模型取得了进展,但这一问题仍未解决。我们使用零样本自然语言处理框架重新审视该问题,研究模型能否在无需领域特定训练的情况下从金融新闻中提取可操作信号。我们设计了一个结构化流程,将零样本自然语言推理与时间聚合相结合,在整合跨文章信息时明确建模时效性和事件依赖的影响范围。为了解决高风险场景中对透明度的需求,我们引入了一个多层次可解释性框架,将预测与词元级、文章级和聚合证据联系起来,并生成基于文本的自然语言理由。在多个模型和预测时间跨度上,我们发现零样本方法始终无法超越简单基线,在负向波动上表现尤其薄弱,这表明将新闻情绪映射到短期价格动态存在更深层次的结构性限制。然而,可解释性信号能够可靠地区分可信和不可信的预测,即使在准确性有限的情况下也具有实用价值。这些发现凸显了零样本金融自然语言处理的局限性,并促使我们转向优先考虑透明性和不确定性感知的决策支持系统。代码:此 https URL

英文摘要

Can financial news reliably predict short-term stock movements? Despite advances in large language models, this question remains unresolved. We revisit this problem using a zero-shot natural language processing framework, investigating whether models can extract actionable signals from financial news without domain-specific training. We design a structured pipeline that combines zero-shot natural language inference with temporal aggregation, explicitly modelling recency and event-dependent impact horizons when integrating information across articles. To address the need for transparency in high-stakes settings, we introduce a multi-layered explainability framework that links predictions to token-level, article-level, and aggregate evidence, and produces grounded natural language rationales. Across multiple models and prediction horizons, we find that zero-shot approaches consistently fail to outperform simple baselines, with particularly weak performance on negative movements, suggesting deeper structural limitations in mapping news sentiment to short-term price dynamics. However, explainability signals reliably distinguish between trustworthy and unreliable predictions, offering practical value even when accuracy is limited. These findings highlight the limits of zero-shot financial NLP and motivate a shift toward decision-support systems that prioritise transparency and uncertainty awareness. Code: this https URL

2606.12207 2026-06-11 cs.RO cs.AI 新提交

Intelligent Automation for Embodied Benchmark Construction: Pipelines, Embodiments, Simulators, and Trends

具身基准构建的智能自动化:流程、具身、模拟器与趋势

Jinshan Lai, Jianwei Hu, Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Tingxuan Huang, Xi Ren, Qiang Ma

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Qiyuan Lab(启元实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学)

AI总结 本文综述具身智能基准构建的五阶段流程,分析从人工到自动化再到智能体闭环的转变,指出自动化将成本转向验证与治理。

详情
AI中文摘要

具身智能现已涵盖导航、家务辅助、操作、自动驾驶、空中智能体及多模态大模型控制。这一扩展使得基准构建成为可靠评估的核心瓶颈。与静态数据集不同,具身基准将任务规范、环境、机器人数据、演示、标注、指标、评估脚本和发布策略整合为一个评估系统。本综述通过五阶段构建流程回顾文献:需求与任务构建、数据获取、数据清洗与标注、基准套件生成与指标定义、评估执行与诊断反馈。针对每个阶段,分析从人工管理到传统自动化、基础模型辅助以及智能体闭环工作流的转变。同时比较了人工、数据与资产获取、计算与仿真、验证与调试、治理与维护以及返工风险等定性构建成本。主要结论是:自动化并非简单降低基准成本,而是往往将成本转向验证、可审计性、版本控制和长期治理。因此,具身评估的进展不仅取决于更大的基准套件,还取决于可诊断、可审计且可负责任地更新的构建流程。

英文摘要

Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-model control. This expansion has made benchmark construction a central bottleneck for reliable evaluation. Unlike static datasets, embodied benchmarks combine task specifications, environments, robot data, demonstrations, annotations, metrics, evaluation scripts, and release policies into a single evaluation system. This survey reviews the literature through a five-stage construction pipeline: requirement and task construction, data acquisition, data cleaning and annotation, benchmark suite generation and metric definition, and evaluation execution with diagnostic feedback. For each stage, the survey analyzes the transition from manual curation to traditional automation, foundation-model assistance, and agentic closed-loop workflows. It also compares qualitative construction costs across human labor, data and asset acquisition, compute and simulation, validation and debugging, governance and maintenance, and rework risk. The main conclusion is that automation does not simply reduce benchmark cost. Instead, it often shifts cost toward validation, auditability, version control, and long-term governance. Progress in embodied evaluation will therefore depend not only on larger benchmark suites, but also on construction pipelines that are diagnosable, auditable, and responsibly refreshable.

2606.12203 2026-06-11 cs.CL 新提交

Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

自适应多分辨率程序性知识压缩用于大型语言模型

Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, Runzhong Qiao, Xuancheng Li, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出SKIM框架,通过自适应多分辨率软令牌压缩程序性技能,在保持任务性能的同时将技能令牌长度压缩至30%-60%。

详情
AI中文摘要

大型语言模型(LLM)被广泛用于处理具有自主工作流的复杂任务。最近,可重用的自然语言技能作为一种流行的范式出现,用于向LLM应用程序注入程序性知识。由于流行的技能经常被重复调用,将它们的完整文本放在每个上下文中会显著增加预填充成本和延迟。虽然文本压缩技术有潜力解决这个问题,但大多数现有方法旨在压缩文档中的事实性知识而非程序性知识,这使得它们不足以用于技能压缩。在本文中,我们认为有效的技能压缩方法应该:1)保留工作流和工具协议之间的逻辑依赖关系;2)支持对频繁更新的社区技能进行轻量级、离线压缩;3)能够适应不同技能之间的复杂性变化。为了解决这个问题,我们提出了SKIM(SKIll coMpression),一个用于程序性技能的自适应多分辨率软令牌压缩框架。根据每个技能的复杂性,SKIM创建不同数量的软令牌,这不仅提高了LLM推理的效率,而且保留了技能使用的有效性。实验表明,SKIM将技能压缩到其原始令牌长度的30%到60%,同时比现有的压缩方法更好地保持了任务性能。我们已在https://this URL发布了我们的代码。

英文摘要

Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression this http URL have released our code at this https URL.

2606.12200 2026-06-11 cs.LG cs.AI 新提交

Implicit Neural Representations of Individual Behavior

个体行为的隐式神经表示

Andrew Kang, Priya Narasimhan

AI总结 提出Behavioral INR模型,用隐式神经表示从无标签多策略行为数据中学习策略表示,通过FiLM层调节策略函数,实现无监督策略识别,在连续状态-动作空间中提升策略可识别性。

详情
Comments
ICML 2026, Structured Probabilistic Inference & Generative Modeling Workshop
AI中文摘要

我们研究从无标签多策略行为数据中进行策略表示学习。每个回合由固定策略生成,但策略标签不可用。这种设置出现在机器人操作、演示、游戏、赛车以及其他混合了异构行为但没有注释的数据集中。我们引入了\emph{Behavioral INR},一种自监督生成模型,将隐式神经表示(INR)从视觉领域适应到行为领域。Behavioral INR不是将坐标映射到RGB值,而是将策略表示为状态-动作函数,将状态映射到后续动作。一个回合级别的潜在变量通过FiLM层调节该函数,产生策略上的生成先验,并允许在无监督的情况下推断策略身份。由于INR将每个数据点视为底层函数的样本,同一模型自然适应可变回合长度和不同采样粒度,就像视觉INR处理不同图像分辨率一样。我们还定义了沿状态分布和动作分布轴的策略级分布外(OOD)偏移,当策略在状态或动作上重叠时会出现这种偏移,但标准的基于新智能体或环境的OOD设置无法捕捉到。我们在合成高斯随机场数据、带有受控OOD分割的MuJoCo演示以及真实世界的国际象棋、一级方程式赛车、机器人和搜索-规避数据集上进行了评估。Behavioral INR在最具挑战性的连续状态-动作设置中持续提升策略可识别性,尤其是当更长的回合、更多的策略和OOD分割降低了边际捷径的效用时;当策略身份可以从符号重复或低维动作统计中恢复时,摊销历史编码器仍然具有竞争力。我们发布了代码和检查点。

英文摘要

We study policy representation learning from unlabeled multi-policy behavioral data. Each episode is generated by a fixed policy, but policy labels are unavailable. This setting appears in robotics play, demonstrations, games, racing, and other datasets where heterogeneous behaviors are mixed without annotations. We introduce \emph{Behavioral INR}, a self-supervised generative model that adapts implicit neural representations (INRs) from vision to behavior. Instead of mapping coordinates to RGB values, Behavioral INR represents a policy as a state-action function mapping states to subsequent actions. An episode-level latent modulates this function through FiLM layers, yielding a generative prior over policies and allowing policy identity to be inferred without supervision. Because INRs treat each datapoint as samples from an underlying function, the same model naturally accommodates variable episode lengths and different sampling granularities, as in vision INRs with different image resolutions. We also define policy-level out-of-distribution (OOD) shifts along state-distribution and action-distribution axes, which arise when policies overlap in states or actions but are not captured by standard behavioral OOD settings based only on new agents or environments. We evaluate on synthetic Gaussian random field data, MuJoCo demonstrations with controlled OOD splits, and real-world chess, Formula 1 racing, robotics, and Seek-Avoid datasets. Behavioral INR most consistently improves policy identifiability in the hardest continuous state-action settings, especially when longer episodes, more policies, and OOD splits reduce the usefulness of marginal shortcuts; amortized history encoders remain competitive when policy identity can be recovered from symbolic repetition or low-dimensional action statistics. We release code and checkpoints.

2606.12199 2026-06-11 eess.AS cs.CL cs.SD 新提交

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

哪种语音表示更匹配文本原生推理?帧率和表示对语音-文本对齐的研究

Zhen Ye, Xu Tan, Yiming Li, Guangyan Zhang, Chimin Chan, Haohe Liu, Zhengxi Liu, Hongzhan Lin, Zheqi Dai, Xinshen Zhang, Peiwen Sun, Qiuqiang Kong, Wei Xue

AI总结 研究语音与文本模态差异中的时间粒度不匹配问题,提出因子化FSQ和轻量非自回归音频LM头以降低帧率,发现4.17Hz帧率结合中间层表示对齐在语音问答中表现最佳。

详情
Comments
Accepted by Interspeech 2026 long paper
AI中文摘要

口语对话模型通常以文本LLM骨干网络为基础,但在以语音而非文本为条件时,推理能力往往会下降。我们将这种模态差异部分归因于时间粒度不匹配:在语义匹配的情况下,语音标记在时间上是冗余的,且远长于文本,这稀释了每个标记的语义密度,削弱了文本原生的推理动态。我们将语音标记设计视为一个表示选择问题,并在固定信息速率下,在冻结的LLM骨干网络中扫描帧率。为了实现低帧率,我们引入了因子化FSQ和一个轻量级的非自回归音频LM头,在不牺牲高效预测的情况下将容量扩展到近300比特/帧。在消除瓶颈后,我们扫描帧率(50→2.08 Hz)和对齐深度,并观察到在4.17 Hz帧率下,结合中间层表示对齐,语音问答存在一致的最佳区域。

英文摘要

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.

2606.12198 2026-06-11 cs.IR 新提交

LLM-Based User Personas for Recommendations at Scale

基于LLM的用户画像用于大规模推荐

Haoting Wang, Haokai Lu, Zheyun Feng, Jenny Huang, Yifat Amir, Gregory Hinkson, Ben Most, Zelong Zhao, Yixin Kelly Cui, Rein Zhang, Fabio Soldo, Yu Xia, Nihar Bhupalam, Minmin Chen, Konstantina Christakopoulou, Lichan Hong, Ed H. Chi

AI总结 提出实时生成LLM用户兴趣画像的框架,通过知识蒸馏、异步推理和语义聚类优化,平衡利用-探索权衡,提升大规模视频推荐效果。

详情
AI中文摘要

大型语言模型(LLM)凭借其世界知识和推理能力,为增强推荐系统提供了前所未有的潜力。然而,现有方法通常依赖结构化ID或离线处理,限制了语义丰富性、实时适应性和面向用户的解释性。在本文中,我们介绍了一种新颖的框架,能够为大规模商业视频推荐平台实时生成基于LLM的用户兴趣画像。我们的方法生成自然语言的用户兴趣画像,通过结合现有兴趣的总结和新颖主题,在服务过程中直接解决利用-探索权衡。为了克服十亿用户规模下在线LLM推理的计算挑战,我们设计了一种成本高效的架构,利用知识蒸馏、异步推理和通过语义聚类视频表示进行的输入优化。广泛的离线评估、用户研究和在线A/B测试表明,该方法显著提升了观众价值。这项工作弥合了高层语义理解与工业规模推荐之间的差距,为更动态、可解释和令人满意的个性化体验铺平了道路。

英文摘要

Large Language Models (LLMs) offer unprecedented potential for enhancing recommendation systems through their world knowledge and reasoning capabilities. However, existing approaches often rely on structured IDs or offline processing, limiting semantic richness, real-time adaptability, and user-facing interpretability. In this paper, we introduce a novel framework that enables real-time generation of LLM-based user interest personas for a large-scale commercial video recommendation platform. Our method generates natural-language user interest personas that address the exploitation-exploration trade-off by combining the summarization of existing interests with novel topics, directly during serving. To overcome the computational challenges of online LLM inference at a billion-user scale, we design a cost-efficient architecture leveraging knowledge distillation, asynchronous inference, and input optimization via semantically clustered video representations. Extensive offline evaluations, user studies, and live A/B tests demonstrate significant improvements in viewer value. This work bridges the gap between high-level semantic understanding and industrial-scale recommendation, paving the way for more dynamic, explainable, and satisfying personalized experiences.

2606.12195 2026-06-11 cs.CV 新提交

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

InternVideo3: 用多模态上下文推理代理化基础模型

Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Songze Li, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Limin Wang, Yu Qiao, Yi Wang

发表机构 * Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学)

AI总结 提出InternVideo3框架,通过多模态上下文推理(MCR)和高效KV缓存压缩方法M^2LA,增强长视频理解与迭代交互能力,在多个基准上取得强性能。

详情
AI中文摘要

近期基础模型的进展已转向涉及多步推理和工具使用的代理行为。然而,开源工作主要聚焦于文本主导的场景,使得长时域多模态任务探索不足。这一差距在需要持续时间理解和迭代交互的视频任务中尤为明显。我们提出InternVideo3,一个通过多模态上下文推理(MCR)增强这些能力的框架。MCR将理解视为一个闭环过程,作用于包含观察、指令、推理、工具动作和记忆的共享演化上下文。这将长视频理解框架化为证据积累与验证。为确保效率,我们引入多模态多头潜在注意力(M^2LA),一种保留令牌的重参数化方法,压缩KV缓存状态同时保留完整令牌流。我们的分阶段训练包括持续预训练、短到长监督微调、基于规则的强化学习以及在线策略蒸馏。实验表明,InternVideo3在Video-MME、MLVU和EgoSchema等基准上取得了强性能。我们进一步将该模型实例化为带有检索工具的视频代理,展示了稳健的基于证据的行为。我们的结果表明,高效的上下文处理和闭环推理对于将开放多模态模型适应于长时域视觉接地代理至关重要。

英文摘要

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

2606.12191 2026-06-11 cs.CL cs.AI 新提交

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

面向大语言模型的智能体环境工程:环境建模、合成、评估与应用综述

Jiachun Li, Zhuoran Jin, Tianyi Men, Yupu Hao, Kejian Zhu, Lingshuai Wang, Dongqi Huang, Longxiang Wang, Shengjia Hua, Lu Wang, Jinshan Gao, Hongbang Yuan, Ruilin Xu, Kang Liu, Jun Zhao

AI总结 本文从环境工程生命周期出发,系统综述了智能体环境的建模、合成、评估与应用,涵盖八种属性与领域、两种合成范式、四种智能体演化路径及三种环境演化范式。

详情
Comments
63 pages, 10 figures
AI中文摘要

环境作为基于大语言模型(LLM)的智能体在不同场景下的交互系统,在推动模型能力持续演进中扮演关键角色。尽管重要性显著,现有工作缺乏系统分类与深入分析。本文从环境工程生命周期的视角系统研究了当前关于智能体环境的研究,涵盖其建模、合成、评估与应用。具体而言,本文首先从八个属性和八个领域引入代表性环境,详细分析其发展路径并突出核心能力。其次,针对自动化环境合成,介绍了两种范式,如符号合成和神经合成。本文还展示了每种范式下的不同环境评估方法。第三,从智能体-环境协同演化的角度讨论了相应的环境应用。具体来说,本文从四个互补视角描述了动态环境中智能体演化的主要路径:以记忆为中心的经验演化、以编排为中心的工作流演化、以轨迹为中心的离线演化和以探索为中心的在线演化。并识别了三种环境演化范式,即神经驱动、难度驱动和规模驱动方法。最后,讨论了几个有前景的未来方向,包括环境即服务、多智能体环境和神经符号环境。

英文摘要

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

2606.12189 2026-06-11 cs.CV 新提交

DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds

DynaTok: 基于Token的部分点云4D重建

Weirong Chen, Keisuke Tateno, Hidenobu Matsuki, Michael Niemeyer, Daniel Cremers, Federico Tombari

发表机构 * Technical University of Munich(慕尼黑工业大学) Google(谷歌) Imperial College London(伦敦帝国理工学院) University of Bonn(波恩大学)

AI总结 提出DynaTok框架,通过Transformer时空编码器和流匹配解码器,从部分点云序列中无对应地重建完整且时间一致的4D点云,无需图像。

详情
Comments
ICML 2026. Project page: this https URL
AI中文摘要

我们解决从部分点云序列的4D重建问题,其中深度传感器观测不完整、无序且缺乏显式时间对应。这种仅几何的设置由于缺失观测和模糊动态而具有挑战性。尽管最近的进展主要依赖于基于图像的方法,现有的基于点的方法通常关注单个物体、假设相对完整的输入或需要显式对应。为了解决这些限制,我们提出了DynaTok,一个基于点的框架,用于从部分点云序列中无对应地进行4D重建,无需图像。DynaTok将帧编码为紧凑的潜在token,通过基于Transformer的时空编码器随时间聚合不完整的观测,并通过统一模型中的残差token解耦几何和运动。然后,一个流匹配解码器以潜在token为条件,重建完整且时间一致的4D点云序列。在物体和场景级基准上的实验表明,从部分点云观测中重建质量和时间一致性得到了改善。项目页面:此https URL。

英文摘要

We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences. To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens. Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations. Project page: this https URL.

2606.12187 2026-06-11 cs.GT 新提交

Strategic Facility Location with $p$-Norm Social Costs

具有 $p$-范数社会成本的战略设施选址问题

Jabari Hastings

AI总结 研究在 $\ell_q(\mathbb R^d)$ 空间中,社会成本由任意 $p$-范数定义的战略设施选址问题,分析策略证明的坐标中位数机制的近似比,并给出紧界和上界。

详情
AI中文摘要

我们考虑 $\ell_q(\mathbb R^d)$ 空间中的战略设施选址问题,其中社会成本由个体成本的任意 $p$-范数定义。虽然在 $d = 1$ 情况下,确定性策略证明机制的最优近似比已经很好建立,但在任意 $p$-范数下的多维空间中的保证尚不明确。在这项工作中,我们分析了经过充分研究的、策略证明的坐标中位数(CM)机制,并为这些广义社会成本提供了近似保证。对于 $d = 2$,我们建立了所有 $p, q \geq 1$ 的紧近似比。特别地,我们证明 CM 机制是一个 $2^{1 - 1/ \max(p, q)}$-近似,解决了 Goel 和 Hann-Caruthers(Social Choice and Welfare, 2023)的一个猜想。此外,对于 $d \geq 3$,我们给出了 CM 机制在任意 $p$-范数社会成本下的近似比上界,推广了 Gravin 和 Jia(STOC, 2025)关于功利主义社会成本的最新结果。值得注意的是,我们证明这个近似比永远不会超过 3,无论维度如何。

英文摘要

We consider the strategic facility location problem in $\ell_q(\mathbb R^d)$ spaces where the social cost is defined by an arbitrary $p$-norm of the individual costs. While the optimal approximation ratios for deterministic strategyproof mechanisms are well established in the $d = 1$ setting, the guarantees for multi-dimensional spaces under an arbitrary $p$-norm are less understood. In this work, we analyze the well-studied, strategyproof coordinate-wise median (CM) mechanism and provide approximation guarantees for these generalized social costs. For $d = 2$, we establish tight approximation ratios for all $p, q \geq 1$. In particular, we show that the CM mechanism is a $2^{1 - 1/ \max(p, q)}$-approximation, resolving a conjecture of Goel and Hann-Caruthers (Social Choice and Welfare, 2023). Furthermore, for $d\geq 3$, we give upper bounds on the approximation ratio of the CM mechanism for arbitrary $p$-norm social costs, generalizing the recent result of Gravin and Jia (STOC, 2025) for the utilitarian social cost. Remarkably, we show that this approximation ratio never exceeds 3, regardless of the dimension.

2606.12186 2026-06-11 cs.CL 新提交

A Resource for Enthymeme Detection in Controversial Political Discourse

争议性政治话语中省略推理检测的资源

Martial Pastor, Nelleke Oostdijk

发表机构 * Centre for Language Studies, Radboud University Nijmegen(奈梅亨大学语言研究中心)

AI总结 提出一个标注了省略推理及其结构的推文资源,基于Walton论证方案设计指南,通过复杂性分析揭示标注不一致来源,实验表明利用标注者分歧训练的模型优于多数投票标签。

详情
Comments
43 pages, to be submitted to the Language Resource and Evaluation Journal
AI中文摘要

省略推理(enthymemes)是指前提或结论未明确陈述的论证,在说服性话语中普遍存在,但其标注历来具有高度主观性。我们提供了一个来自政治争议性话语的1,482条推文资源,由五位标注者标注了省略推理的存在及其论证结构,旨在研究标签变异性。我们首先重新审视省略推理的定义,并提出了基于Walton论证方案的标注指南,提供了一种结构化且受约束的方法,同时保留了任务解释性空间。这与以往资源形成对比,后者倾向于消除分歧,掩盖其来源并阻止研究其对模型性能的潜在益处。我们进一步提出了任务的复杂性分析,识别了标注中认知负荷高的环节及其可能引发不一致标注的原因。初步实验表明,基于标注者分歧训练的模型优于基于硬多数投票标签训练的模型。最后,我们反思了省略推理定义和指南中的结构开放性如何能够为未来资源和关注人类推理的下游NLP应用研究主观推理过程中的变异性提供支持。

英文摘要

Enthymemes, arguments with unstated premises or conclusions, are pervasive in persuasive discourse, yet their annotation remains notoriously subjective. We present a resource of 1,482 tweets from politically controversial discourse, annotated by five annotators for the presence of enthymemes and their argument structure, designed to study label variation. We first revisit the definition of enthymemes and propose annotation guidelines anchored in Walton's argumentation schemes, offering a structured and constrained approach that nonetheless preserves room for the interpretive nature of the task. This contrasts with past resources, which tend to eliminate disagreement, obscuring its sources and preventing investigation of its potential benefits for model performance. We further propose a complexity analysis of the task, identifying where annotation imposes high cognitive load and may give rise to inconsistent annotation. Our preliminary experiments show that models trained on annotator disagreement outperform models trained on hard majority-vote labels. We close by reflecting on how structural openness in enthymeme definitions and guidelines enables the study of variation in subjective inferential processes for future resources and downstream NLP applications concerned with human inference.

2606.12182 2026-06-11 cs.LG math.DS math.OC 新提交

How Low Can You Go? Active Learning for Sparse Model Discovery in the Ultra-Low-Data Limit

你能低到多少?超低数据极限下稀疏模型发现的主动学习

Ana Larrañaga, Urban Fasel, Steven L. Brunton

发表机构 * Department of Mechanical Engineering, University of Washington(华盛顿大学机械工程系) NSF AI Institute in Dynamic Systems, University of Washington(华盛顿大学NSF动态系统人工智能研究所) Department of Aeronautics, Imperial College London(伦敦帝国理工学院航空系)

AI总结 针对超低数据极限下动力学系统方程发现的数据稀缺问题,提出基于E-SINDy的主动学习策略,通过迭代优先采样信息量大的区域,在Lorenz、Burgers和Kuramoto-Sivashinsky系统上验证了比随机采样更少数据即可准确识别动力学。

详情
Comments
20 pages, 10 figures
AI中文摘要

识别复杂动力系统的控制方程仍然是科学和工程中的一个基本挑战。虽然早期方法依赖于经验数据和启发式方法,但现代数据驱动方法提供了更大的灵活性和更少的假设。然而,在实际环境中获取数据通常成本高昂。本文通过引入一种主动学习策略来解决这一挑战,用于超低数据极限下的动力学发现。我们的方法不是随机采样,而是迭代地优先考虑对模型识别最有信息量的区域。该方法基于稀疏非线性动力学识别(SINDy),并利用集成扩展E-SINDy来估计认知不确定性并指导常微分方程和偏微分方程(ODEs/PDEs)的采样。对于ODEs,在Lorenz系统上进行了详尽的分析,考虑了不同的数据预算和噪声水平。对于PDEs,研究了两个具有对比动力学特性的系统:Burgers方程,其中尖锐的激波前沿区分了信息丰富和信息贫乏的区域;以及Kuramoto-Sivashinsky方程,它呈现出更复杂的空间采样景观。在所有场景中,所提出的方法都能以比随机采样显著更少的数据样本准确识别控制动力学。

英文摘要

Identifying the governing equations of complex dynamical systems remains a fundamental challenge across science and engineering. While early approaches relied on empirical data and heuristics, modern data-driven methods offer greater flexibility and fewer assumptions. However, data acquisition in real-world settings is often expensive. This work addresses this challenge by introducing an active learning strategy for dynamics discovery in the ultra-low data limit. Rather than sampling randomly, our method iteratively prioritizes regions that are most informative for model identification. This approach builds on Sparse Identification of Nonlinear Dynamics (SINDy), and utilizes an ensemble extension, E-SINDy, to estimate epistemic uncertainty and guide the sampling for both ordinary and partial differential equations (ODEs/PDEs). For ODEs, an exhaustive analysis is conducted on the Lorenz system across varying data budgets and noise levels. For PDEs, two systems with contrasting dynamical characteristics are examined: the Burgers' equation, where a sharp shock front creates a distinction between informative and uninformative regions, and the Kuramoto-Sivashinsky equation, which presents a more spatially complex sampling landscape. Across all scenarios, the proposed method accurately identifies the governing dynamics with significantly fewer data samples than random sampling.

2606.12179 2026-06-11 cs.DS math.NA 新提交

Nearly Instance Optimal Sparse Matrix Approximation from Matrix-Vector Products

近乎实例最优的稀疏矩阵近似:基于矩阵-向量乘积

Christoper Musco, Indu Ramesh

AI总结 研究仅通过矩阵-向量乘积查询学习隐式矩阵的稀疏近似问题,提出基于退化度的统一框架,证明查询复杂度的紧界,并给出多项式时间算法。

详情
AI中文摘要

大量工作研究学习隐式矩阵 $A\in \mathbb{R}^{m\times n}$ 的近似问题,该矩阵仅能通过形如 ${x} \rightarrow {A}{x}$ 或 ${x} \rightarrow {A}^T{x}$ 的矩阵-向量乘积查询(matvec查询)隐式访问。特别关注的是学习具有固定稀疏模式的近最优近似的方法。例如,我们可能想学习隐式矩阵 $A$ 的近最优对角、带状或箭头形近似。自然,解决该问题所需的 matvec 查询次数取决于稀疏模式,该模式可编码为二元矩阵 ${S}\in \{0,1\}^{m\times n}$。先前算法的查询复杂度与 ${S}$ 中1的总数、其最大列/行稀疏度或其“冲突图”的色数等量相关。这些量不可比较:对于给定的 ${S}$,用其中一个参数化可能比另一个产生更低的查询复杂度。在这项工作中,我们通过提供稀疏矩阵近似的 matvec 查询复杂度的近乎尖锐刻画,统一并加强了这些先前结果。推广图算法中的一个定义,令退化度 ${degen}({S})$ 表示最小的数 $k$,使得如果我们迭代删除 ${S}$ 中所有具有 $\leq k$ 个1的行和列,最终得到一个空矩阵。我们证明,对于任何稀疏模式 ${S}$,可以用 $\tilde{O}({degen}({S}))$ 次矩阵-向量乘积查询学习到具有稀疏模式 $S$ 的 $A$ 的近最优近似,且 $\Omega({degen}({S}))$ 次查询是必要的。此外,与先前基于图着色的工作不同,我们的所有方法都在多项式时间内运行。

英文摘要

A large body of work studies the problem of learning an approximation to an implicit matrix $A\in \mathbb{R}^{m\times n}$ that is only accessible implicitly via matrix-vector product queries (matvec queries) of the form ${x} \rightarrow {A}{x}$ or ${x} \rightarrow {A}^T{x}$. Of particular interest are methods that learn a near-optimal approximation with a fixed sparsity pattern. For example, we might want to learn a near-optimal diagonal, banded, or arrow-head approximation to an implicit matrix $A$. Naturally, the number of matvec queries required to solve this problem depends on the sparsity pattern, which can be encoded as a binary matrix ${S}\in \{0,1\}^{m\times n}$. The query complexity of previous algorithms scales with quantities like the total number of ones in ${S}$, its maximum column/row sparsity, or the chromatic number of a its "conflict graph". These quantities are incomparable: for a given ${S}$, parameterizing by one might yield lower query complexity than another. In this work, we unify and tighten these prior results by providing a nearly sharp characterization of the matvec query complexity of sparse matrix approximation. Generalizing a definition from graph algorithms, let the degeneracy, ${degen}({S})$, denote the smallest number $k$ so that, if we iteratively delete all rows and columns of ${S}$ with $\leq k$ ones, we are left with an empty matrix. We show that a near-optimal approximation to $A$ with sparsity pattern $S$ can be learned with $\tilde{O}({degen}({S}))$ matrix-vector product queries, and $\Omega({degen}({S}))$ queries are necessary, for any sparsity pattern ${S}$. Moreover, unlike prior work based on graph coloring, all of our methods run in polynomial time.

2606.12171 2026-06-11 cs.CV cs.LG 新提交

Beyond Dark Knowledge: Mixup-Based Distillation for Reliable Predictions

超越暗知识:基于混合的蒸馏实现可靠预测

José Medina, Paul Honeine, Abdelaziz Bensrhair, Amnir Hadachi

发表机构 * ITS Lab, Institute of Computer Science, University of Tartu(塔尔图大学计算机科学学院ITS实验室) LITIS, Université de Rouen(鲁昂大学LITIS实验室) LITIS, INSA de Rouen(鲁昂国立应用科学学院LITIS实验室)

AI总结 研究知识蒸馏与混合训练结合时教师-学生不匹配的影响,发现学生能独立获得线性结构并提升准确率与校准,提出混合蒸馏作为更丰富的知识传递通道。

详情
AI中文摘要

知识蒸馏(KD)和混合(mixup)已被证明能有效诱导类别边界的平滑性:KD捕捉概率分布中的固有类别关系,而混合通过输入的凸组合强制执行这些关系。然而,它们的相互作用仍未被充分理解,特别是当混合仅在学生训练期间应用时。在这种情况下,教师被查询来自其训练期间从未见过的邻域分布的输入,这是一种受控的不匹配,其对知识转移的影响尚未被表征。我们表明,这种不匹配导致教师的监督信号被分布混淆而非类间结构主导。尽管如此,学生并非仅仅模仿教师:它独立地在邻域区域获得更大的线性度,这是教师缺乏的结构特性,并超越了暗知识转移。与基线相比,带有混合的KD持续提高学生准确率,并将过度自信降低一个数量级,在CIFAR和ImageNet上使用不同容量的教师均如此。关键的是,校准独立于准确率转移从教师传播到学生,温度缩放控制着可测量的准确率-校准权衡,在邻域训练下这种权衡更加明显。这些结果将混合蒸馏重新定义为不是标准KD的退化版本,而是一个更丰富的传递通道,同时塑造判别性能、不确定性估计和表示几何。

英文摘要

Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher's supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.

2606.12169 2026-06-11 cs.CV cs.AI cs.CL cs.LG 新提交

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

OpenMedReason: 医学视觉语言模型的科学推理监督

Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) University of British Columbia(不列颠哥伦比亚大学) University of Toronto(多伦多大学) Unity Health Toronto / St. Michael’s Hospital(多伦多联合健康/圣迈克尔医院) University Health Network(大学健康网络) Arc Institute(弧研究所) Queen's University(女王大学)

AI总结 提出OpenMedReason,一个包含约45万图像-问题-答案实例的大规模开放医学推理语料库,其推理轨迹主要来自生物医学科学文章,并配套基准OpenMedReason-Bench进行细粒度评估,在监督微调和强化对齐中有效提升模型性能。

详情
Comments
42 pages, 9 figures, 24 tables. Dataset and code: this https URL
AI中文摘要

高风险临床使用大型视觉语言模型(LVLMs)需要基于视觉证据和临床知识的推理,而不仅仅是正确的最终答案。我们引入了OpenMedReason,这是一个大规模、开放的多模态医学推理语料库,包含约45万图像-问题-答案实例,其推理轨迹主要来自策划的生物医学、人类撰写的科学文章。OpenMedReason提供了超越合成思维链的高保真监督,涵盖了多种医学领域视觉模态,如放射学扫描、显微图像、可见光照片、图表等。我们辅以OpenMedReason-Bench,这是一个留出基准,允许沿三个互补的能力轴(包括感知、医学知识和推理)对LVLMs进行细粒度评估,从而实现超越最终答案准确性的诊断性评估。OpenMedReason是一个丰富的训练资源,在监督微调(SFT)和基于强化的对齐中均显示出有效性。使用OpenMedReason进行训练,在VQA准确率上比基础模型平均提高20%,并且性能达到最强可比规模医学LVLMs的4.2%以内。细粒度性能分析证实,增益并非集中在单一轴上:OpenMedReason共同提升了感知、医学知识和推理,并且在86.1%的成对比较中,其推理轨迹优于基础模型。我们在以下网址发布代码和数据集:此 http URL。

英文摘要

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.

2606.12167 2026-06-11 cs.GT 新提交

Shared Infrastructure Investment and Pricing: Stackelberg Equilibria in Risk-Aware Take-or-Pay Contracts

共享基础设施投资与定价:风险意识下的照付不议合同中的斯塔克尔伯格均衡

Amal Sakr, Andrea Araldo, Tamer Başar, Tijani Chahed

AI总结 研究基础设施提供商与多个风险厌恶型企业在不确定收益下的共享基础设施投资与定价问题,通过斯塔克尔伯格博弈和条件风险价值模型,证明均衡存在性并给出多项式时间近似算法。

详情
AI中文摘要

我们研究由基础设施提供商(InP)部署并由多个通过资源使用产生收入的企业使用的共享基础设施。我们关注一个具有挑战性的场景,其中:(i) 基础设施部署需要大量前期投资,InP必须通过依赖于企业不确定未来收入的支付来回收投资;(ii) 企业的资源使用受到外生因素、基础设施定价、运营成本和资源拥堵的共同影响;(iii) 企业表现出异质性的风险厌恶。这种设置在新兴技术中很典型,例如移动边缘计算(MEC)。我们将此场景形式化为一个新颖的斯塔克尔伯格博弈,包含风险意识的照付不议合同以及企业侧的运营和拥堵成本,其中InP作为领导者,联合优化容量规划和接入定价,而企业作为追随者共享基础设施,并在不确定收入下提前承诺未来资源使用。追随者的异质性风险厌恶通过条件风险价值(CVaR)建模。我们证明了斯塔克尔伯格均衡(SE)的存在性,其中追随者的决策构成广义纳什均衡,并开发了一个多项式时间算法来计算具有有界最优性间隙的近似SE。我们还推导了追随者盈利概率(PoP)的下界。针对MEC案例的蒙特卡洛模拟表明,追随者风险厌恶的增加会降低基础设施容量、定价和领导者利润,同时提高追随者的PoP。

英文摘要

We study a shared infrastructure deployed by an Infrastructure Provider (InP) and used by multiple firms that generate revenues through resource usage. We focus on a challenging setting where: (i) infrastructure deployment requires substantial upfront investment, which the InP must recover via payments by firms that depend on their uncertain future revenues; (ii) firms' resource usage is jointly influenced by exogenous factors, infrastructure pricing, operational costs, and resource congestion; and (iii) firms exhibit heterogeneous risk aversion. This setting is typical in emerging technologies, e.g., Mobile Edge Computing (MEC). We formalize this setting as a novel Stackelberg game with risk-aware take-or-pay contracting and firm-side operational and congestion costs, in which the InP acts as the leader and jointly optimizes capacity dimensioning and access pricing, while firms act as followers that share the infrastructure and commit upfront to future resource usage under uncertain revenues. Followers' heterogeneous risk aversion is modeled through Conditional Value-at-Risk (CVaR). We prove the existence of a Stackelberg equilibrium (SE), in which the followers' decisions constitute a generalized Nash equilibrium, and develop a polynomial-time algorithm that computes an approximate SE with a bounded optimality gap. We also derive a lower bound on the followers' Probability of Profit (PoP). Monte Carlo simulations for a MEC case study show that higher followers' risk aversion reduces infrastructure capacity, pricing, and leader profit, while increasing followers' PoP.

2606.12160 2026-06-11 cs.CL 新提交

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

指令调优大语言模型解码时真实性方法的受控研究

Ao Sun

发表机构 * Independent Researcher(独立研究员)

AI总结 本研究通过分析每层令牌logits特征,提出CHAIR框架检测幻觉,在TruthfulQA和MMLU上显著提升零样本检测准确率。

详情
AI中文摘要

在这项工作中,我们引入了CHAIR(Classifier of Hallucination As ImproveR),一个通过分析每个令牌每一层的内部logits来检测幻觉的监督框架。我们的方法从所有层的令牌logits中提取一组紧凑的特征,如最大值、最小值、均值、标准差和斜率,从而在不发生过拟合的情况下实现有效的幻觉检测。在TruthfulQA和MMLU数据集上的实验表明,CHAIR显著提高了检测准确性,特别是在零样本场景下,展示了其鲁棒性和泛化能力。除了幻觉检测,CHAIR还凸显了利用内部表示设计高级解码策略的潜力。通过利用logits中的模式,我们建议更复杂的模型和自适应解码方法可以进一步减少幻觉并提高文本完成质量。CHAIR不仅为检测幻觉提供了实用解决方案,还为探索LLM中更丰富的表示以改进其事实性和连贯性奠定了基础。

英文摘要

In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.

2606.12154 2026-06-11 cs.PF 新提交

The Brain That Goes Quiet: Serving a Large Model's Knowledge at 131 Tokens per Second on an 8 GB Laptop by Removing the Large Model from the Runtime Path

静默的大脑:通过从运行时路径中移除大模型,在8 GB笔记本电脑上以每秒131个令牌的速度提供大模型知识

Myeong Jun Jo

AI总结 本文提出一种离线知识存储方法,将大模型(35B MoE)用于构建结构化知识库,运行时仅用轻量路由器和1B小模型,在8GB笔记本上将端到端响应时间从4.4秒降至0.5秒,吞吐量提升至131 tokens/s。

详情
Comments
17 pages, 5 figures
AI中文摘要

在之前的工作中,我展示了35B类混合专家模型可以在具有8 GB GPU内存的消费级笔记本电脑上加载和执行。该结果解决了一个放置问题,并立即暴露了另一个问题:即使正确放置,大模型每次查询仍需要大约四秒才能回答,因为它在每次查询时仍被调用。本文记录了当我停止调用它时发生的情况。在离线阶段,大模型读取源文档并将经过验证的答案条目写入结构化知识存储;在运行时,只有轻量级路由器、确定性渲染器和1B类模型处于活动状态。在同一台8 GB笔记本电脑上,端到端响应时间从约4,465毫秒降至518毫秒,有效端到端吞吐量从15.7 tokens/s升至131 tokens/s,小模型的流式解码速率保持在226-237 tokens/s,首令牌时间为29-62毫秒。瓶颈是结构性的:三种不同的大模型(Qwen、Gemma和GLM类)都显示出相同的多秒运行时成本,并且所有三个模型都在离线状态下生成了可用的知识存储。在由17个真实文档构建的563条条目的存储上,关键词路由的top-1准确率降至1.5%,而基于BM25的路由达到92.8%(top-3为99.4%),置信门通过升级12.3%的查询将有效top-1提升至98.0%。小模型在携带相同内容的不同信封格式上的精确匹配保真度从9/9到0/9不等。一个16案例的验证门阻止了所有十个损坏条目,同时接纳了所有六个支持的条目。

英文摘要

In earlier work I showed that a 35B-class Mixture-of-Experts model can be loaded and executed on a consumer laptop with 8 GB of GPU memory. That result solved a placement problem and immediately exposed a different one: even correctly placed, the large model needed roughly four seconds to answer, because it was still being invoked at every query. This paper documents what happened when I stopped invoking it. During an offline phase, the large model reads source documents and writes verified answer entries into a structured knowledge store; at runtime, only a lightweight router, a deterministic renderer, and a 1B-class model are active. On the same 8 GB laptop, end-to-end response time fell from approximately 4,465 ms to 518 ms, effective end-to-end throughput rose from 15.7 to 131 tokens per second, and the small model's streaming decode rate held at 226-237 tokens per second with a time-to-first-token of 29-62 ms. The bottleneck is structural: three different large models (Qwen, Gemma, and GLM class) all showed the same multi-second runtime cost, and all three produced usable knowledge stores offline. On a 563-entry store built from seventeen real documents, keyword routing collapsed to 1.5% top-1 accuracy while BM25-based routing reached 92.8% (99.4% top-3), and a confidence gate raised effective top-1 to 98.0% by escalating 12.3% of queries. Exact-match fidelity of the small model ranged from 9/9 to 0/9 across envelope formats carrying identical content. A 16-case verification gate blocked all ten corrupted entries while admitting all six supported ones.

2606.12153 2026-06-11 cs.CV cs.GR 新提交

TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

TopoCap: 学习拓扑无关的运动先验用于单目视频到动画

Cheng-Feng Pu, Jia-Peng Zhang, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu

发表机构 * Zhili College, Tsinghua University(清华大学致理书院) BNRist, Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系,北京国家信息科学与技术研究中心) VAST

AI总结 提出TopoCap,首个统一框架,从单目视频提取运动并重定向到任意未见骨骼拓扑的角色,无需测试时优化,通过图CVAE学习通用运动流形和条件流匹配实现。

详情
AI中文摘要

生成式3D资产的爆炸式增长创造了巨大的动画需求,然而当前的动作捕捉方法仍然脆弱,局限于特定物种的模板(例如SMPL)或需要劳动密集型的手动绑定。我们引入了TopoCap,这是第一个统一的框架,能够从单目视频中提取运动并将其重定向到具有任意、未见过的骨骼拓扑的角色,即从双足到六足和无生命物体,无需测试时优化。我们的关键洞察是,虽然骨骼结构是组合且离散的,但运动背后的物理占据了一个连续的、低维的流形。我们通过一个两阶段生成流水线实现了这一洞察。首先,我们使用图CVAE学习一个通用运动流形,该流形将异构的运动链压缩成共享的、固定长度的潜在代码。通过明确地以目标骨架的结构嵌入为条件对解码器进行条件化,我们将运动动力学与骨骼拓扑解耦。其次,我们将视频到动画视为一个条件流匹配问题,从视觉特征预测这些拓扑无关的代码。为了学习这种广义先验,我们引入了Mobjaverse,这是一个从Objaverse-XL整理的大规模数据集。它包含超过5000个独特的骨骼拓扑和200万帧,其结构多样性比现有数据集高出两个数量级。大量实验表明,\MethodMotion在人类和四足基准测试中优于专业模型,同时实现了对长尾3D生物的零样本重定向。数据集在此https URL公开。

英文摘要

The explosion of generative 3D assets has created a massive demand for animation, yet current motion capture methods remain brittle, restricted to species-specific templates (e.g., SMPL) or requiring labor-intensive manual rigging. We introduce TopoCap, the first unified framework capable of extracting motion from monocular video and retargeting it onto characters with arbitrary, unseen skeletal topologies, i.e., from bipeds to hexapods and inanimate objects, without test-time optimization. Our key insight is that while skeletal structures are combinatorial and discrete, the underlying physics of motion occupy a continuous, low-dimensional manifold. We materialize this insight via a two-stage generative pipeline. First, we learn a Universal Motion Manifold using a Graph CVAE that compresses heterogeneous kinematic chains into a shared, fixed-length latent code. By explicitly conditioning the decoder on a structural embedding of the target rig, we disentangle motion dynamics from skeletal topology. Second, we treat video-to-animation as a conditional flow matching problem, predicting these topology-agnostic codes from visual features. To learn this generalized prior, we introduce Mobjaverse, a massive-scale dataset curated from Objaverse-XL. Comprising over 5,000 unique skeletal topologies and 2 million frames, it exceeds the structural diversity of existing datasets by two orders of magnitude. Extensive experiments demonstrate that \MethodMotion outperforms specialist models on human and quadruped benchmarks while enabling zero-shot retargeting for the long tail of 3D creatures. Dataset is publicly available at this https URL.

2606.12149 2026-06-11 cs.GT 新提交

Do Not Discretize, Optimize: Almost Greedy Fictitious Play

不要离散化,优化:几乎贪婪的虚构博弈

Evangelos Markakis, Christodoulos Santorinaios

AI总结 提出Almost Greedy Fictitious Play变体,通过约束搜索空间贪婪优化步长,实现零和博弈中对偶间隙的实例相关O(1/T)收敛率。

详情
Comments
18 pages, 7 figures
AI中文摘要

我们的工作围绕虚构博弈展开,这是最早被证明在零和博弈中收敛到纳什均衡的迭代方法之一。近年来,由于在各种机器学习问题中的应用,人们对它的收敛性质以及提出初始算法的新变体重新产生了兴趣。我们的论文沿着这一方向,引入了一个新变体,我们称之为几乎贪婪的虚构博弈。所提出的算法在每个迭代中贪婪地尝试找到最优步长,但其搜索空间受到约束,几乎涵盖了累积混合策略与当前最佳响应之间的整条线。我们的主要结果是,该方法在对偶间隙方面实现了实例相关的$\mathcal{O}(1/T)$收敛率。这与连续虚构博弈的速率相匹配,并提供了一种离散化的替代方案。我们通过实验证明了该方法的有效性,补充了我们的理论发现。

英文摘要

Our work revolves around Fictitious Play, one of the first iterative methods that is known to converge to a Nash equilibrium in zero-sum games. In recent years, there has been a revived interest, due to applications in various machine learning problems, which has motivated a line of work on its convergence properties and on proposing new variants of the initial algorithm. Our paper is along this direction and introduces one new variant, which we refer to as Almost Greedy Fictitious Play. The proposed algorithm greedily attempts to find the optimal stepsize at each iteration but its search space is constrained and includes almost all the line between the cumulative mixed strategy and the current best response. Our main result is that the method achieves an instance dependent convergence rate of $\mathcal{O}(1/T)$ with respect to the duality gap. This matches the rate of Continuous Fictitious Play, and offers an alternative to discretization. We complement our theoretical findings with experiments that demonstrate the effectiveness of the method.

2606.12147 2026-06-11 cs.AI 新提交

Towards Responsibly Non-Compliant Machines

迈向负责任的不合规机器

Marija Slavkovik, Marie Farrell, Louise Dennis, Michael Fisher, Simon Kolker, Emily C. Collins (University of Manchester, Manchester, United Kingdom)

发表机构 * University of Bergen(卑尔根大学) University of Manchester(曼彻斯特大学)

AI总结 研究工程化能负责任地拒绝用户请求的自主智能体,提出基于理由、覆盖机制及风险责任追踪的合规框架。

详情
Comments
Presented at AAMAS-26 Workshop on Rebellion and Disobedience in AI this https URL
AI中文摘要

我们考虑工程化能够负责任地不遵守用户请求的自主智能体的问题。我们认为机器不合规有多种不同形式,并勾勒出在实现负责任不合规智能机器的道路上应追求的问题。我们将负责任的不合规锚定在任务拒绝的理由、覆盖不合规的途径,以及安全风险和责任转移的仔细追踪上。

英文摘要

We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

2606.12146 2026-06-11 cs.LG cs.AI 新提交

nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding

nD-RoPE:一种用于n维位置嵌入的广义RoPE

Boyang Li, Yulin Wu, Sizhe Xu, Nuoxian Huang, Zhonghang Yuan, Shangyi Guo, Shu Yang, Takahiro Yabe

AI总结 提出nD-RoPE,将旋转位置嵌入推广到任意维度,通过多尺度正则单纯形波矢设计实现各向同性,在图像、视频和点云任务中提升性能。

详情
Comments
Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

旋转位置嵌入(RoPE)在Transformer模型中被广泛采用,但其向高维域的扩展缺乏统一的理论表述。大多数现有方法要么沿每个轴独立应用旋转,要么经验性地混合频率,这限制了跨维交互并产生方向相关的表示。为了解决这些限制,我们提出了nD-RoPE,一种将RoPE推广到任意维度的无分解泛化。从连续希尔伯特空间中的平移不变表述出发,我们推导出各向同性的谱条件,要求将位置和频率视为耦合的\(n\)维向量。我们通过多尺度正则单纯形波矢设计实例化该表述,提供了非退化的空间覆盖和对称、方向平衡的二阶响应。在图像、视频和点云上的实验表明,在高维设置中性能持续提升且泛化能力增强。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in Transformer models, yet its extension to high-dimensional domains lacks a unified theoretical formulation. Most existing approaches either apply rotations independently along each axis or empirically mix frequencies, which limits cross-dimensional interactions and yields direction-dependent representations. To address these limitations, we propose nD-RoPE, a decomposition-free generalization of RoPE to arbitrary dimensions. From a translation-invariant formulation in continuous Hilbert space, we derive a spectral condition for isotropy that requires treating positions and frequencies as coupled \(n\)-dimensional vectors. We instantiate this formulation with a multi-scale regular-simplex wave-vector design, which provides non-degenerate spatial coverage and a symmetric, directionally balanced second-order response. Experiments across images, videos, and point clouds demonstrate consistent performance gains and improved generalization in high-dimensional settings.