arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11552 2026-06-11 cs.CL cs.LG 新提交

Teaching Diffusion to Speculate Left-to-Right

教导扩散模型从左到右推测

Lexington Whalen, Yuki Ito, Ryo Sakamoto

AI总结针对自回归解码的推理瓶颈，提出三种训练时干预方法（位置加权、首次错误焦点损失、链损失）来弥合块扩散草稿模型的双向生成与自回归目标模型从左到右验证之间的不对称性，显著提升接受草稿长度。

详情

Comments: 13 pages, technical report

AI中文摘要

大型语言模型（LLMs）在广泛任务中表现出色，但其自回归解码过程由于固有的顺序令牌生成而带来大量推理成本。推测解码通过使用轻量级草稿模型提出多个未来令牌，随后由更大的目标模型并行验证，从而解决这一瓶颈。近期工作表明，扩散语言模型非常适合此设置，因为它们可以并行生成整个草稿令牌块，从而缓解自回归草稿的顺序约束。该机制的一个微妙之处在于，块扩散草稿生成器在块内双向生成令牌，而验证由自回归目标模型以严格从左到右的方式评估令牌，导致对称的训练目标与非对称的验证奖励之间存在差距。在本工作中，我们对三种缩小这一差距的训练时干预措施进行了实证分析：令牌位置加权、针对每个块内破坏已接受前缀位置的首次错误焦点损失，以及用可微替代项替代期望接受长度的链损失项。这三种干预措施沿正交轴（位置、块条件首次错误、联合前缀）起作用，并且可加性组合；它们同样与测试时对齐机制（如多草稿自选）正交，原则上可以与之结合。在四个目标模型和六个推理、代码及对话基准测试中，与位置均匀基线相比，这三种干预措施使每个基准测试的接受草稿长度提高了21-76%，且无需增加额外前向传递，也无需改变推理流程或拒绝采样精确性约束。

英文摘要

Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.

URL PDF HTML ☆

赞 0 踩 0

2606.11534 2026-06-11 physics.ao-ph cs.LG 新提交

Urban Heat MiniCubes: An AI-Ready dataset for urban heat research

城市热微型数据立方体：面向城市热研究的人工智能就绪数据集

Jonathan Starfeldt, Maria J. Molina, Alexander Kerr, Adam Yang, Thomas R.H. Holmes, Christopher R. Hain

AI总结提出Urban Heat MiniCubes数据集，整合多源卫星数据（Landsat 8/9、Sentinel-1、GOES-R等），为48个城市提供90×90公里网格化数据立方体，支持机器学习在城市热研究中的应用。

详情

Comments: 53 pages, 26 figures, Submitted to Nature Scientific Data

AI中文摘要

城市热效应因不透水表面和异质建筑环境而加剧，但街道尺度的变异性仍难以量化，因为多传感器观测很少以一致、分析就绪的形式在必要的时空尺度上可用。我们提出了“Urban Heat MiniCubes”，一个公开可用、符合FAIR原则的数据集，专为城市热研究中的机器学习应用而设计。该数据集提供了西半球48个城市在2022-2023年间的统一90×90公里网格化数据立方体，变量被重新投影并配准到公共网格，以减少预处理（例如，重投影、重采样和时空对齐）。Urban Heat MiniCubes包括两种互补模态：（i）来自Landsat 8/9（例如，地表反射率）和Sentinel-1（例如，合成孔径雷达后向散射）的高空间分辨率、低频观测，以及（ii）来自GOES-R（例如，长波红外亮温）和微波地表温度产品的更高时间频率、较粗分辨率观测。我们记录了变量和元数据，并通过变量间分析和基于自编码器的像素类别（例如，水和云）重建误差总结提供了技术评估。还讨论了潜在用例和局限性。

英文摘要

Urban heat is amplified by impermeable surfaces and heterogeneous built environments, yet street-level variability remains difficult to quantify because multi-sensor observations are rarely available in consistent, analysis-ready form at the necessary spatiotemporal scales. We present "Urban Heat MiniCubes," a publicly available, FAIR-oriented dataset designed for machine learning applications in urban heat research. The dataset provides harmonized 90 x 90 km gridded data cubes for 48 cities in the Western Hemisphere spanning 2022-2023, with variables reprojected and collocated to a common grid to reduce preprocessing (e.g., reprojection, resampling, and spatiotemporal alignment). Urban Heat MiniCubes includes two complementary modalities: (i) higher-spatial-resolution, lower-frequency observations from Landsat 8/9 (e.g., surface reflectances) and Sentinel-1 (e.g., synthetic aperture radar backscatter), and (ii) higher-temporal-frequency, coarser observations from GOES-R (e.g., longwave infrared brightness temperatures) and a microwave land surface temperature product. We document variables and metadata and provide technical assessment using inter-variable analyses and autoencoder-based reconstruction-error summaries across pixel classes (e.g., water and cloud). Potential use cases and limitations are also discussed.

URL PDF HTML ☆

赞 0 踩 0

2606.11533 2026-06-11 cs.CY cs.AI cs.ET cs.LG 新提交

AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks

AI研究人员必须主导军备控制以降低军事AI风险

Ted Fujimoto, Jacob Benz

AI总结本文主张AI研究人员应主导军备控制研究，通过借鉴核威慑经验，推动验证与外交技术创新，以降低军事AI应用带来的紧迫风险。

详情

Comments: 9 pages, 1 figure, ICML 2026 Position Paper

AI中文摘要

AI能力的进步迫使研究人员和公众更加关注其潜在的全球影响。一个紧迫的近期问题是军事AI应用的监管。武器制造商和国防承包商正在加大对AI能力的投资，并与AI公司建立合作伙伴关系，形成了一个新兴的联盟，要求军事领导人、军备控制外交专家和AI研究人员合作，以确保更安全的未来。虽然AI研究人员通常关注超级智能AI的长期影响，但这种方法可能无法充分应对军事应用中AI带来的直接挑战。成功需要承认并减轻前沿AI模型（计划集成到国防应用中，如军事AI系统）的新兴风险。军备控制已经减少了过去的灾难性风险，因此从核威慑中吸取的经验教训可以指导AI安全与安保研究，推动验证和外交方面的创新。然而，AI研究人员必须协助主导技术研究，明确定义并缓解军事环境中的不稳定性。鉴于这些新责任以及缺乏足够可靠的解决方案，我们认为AI研究人员必须在推进军备控制研究以最小化军事AI应用风险方面发挥主导作用。

英文摘要

The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-term concern is the regulation of military AI applications. Armament manufacturers and defense contractors are increasingly investing in AI capabilities and forging partnerships with AI companies, creating a burgeoning coalition that demands military leaders, arms control diplomacy experts, and AI researchers collaborate to ensure a safer future. While AI researchers often focus on the long-term implications of superintelligent AI, this approach may not adequately address the immediate challenges posed by AI in military applications. Success requires acknowledging and mitigating the emerging risks of frontier AI models that plan to be integrated into defense applications, like military AI systems. Arms control has reduced past catastrophic risks, so lessons learned from nuclear deterrence can guide AI safety and security research towards innovations in verification and diplomacy. AI researchers, however, must assist in leading the technical research that clearly defines and alleviates instability in military settings. Given these new responsibilities and the lack of sufficiently reliable solutions, we argue that AI researchers must take a leading role in advancing arms control research to minimize risk in military AI applications.

URL PDF HTML ☆

赞 0 踩 0

2606.11529 2026-06-11 cs.GR cs.CV cs.PF 新提交

XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer

XPR：一个可扩展的跨平台基于点的可微分渲染器

Steve Rhyner, Sankeerth Durvasula, Aleksandr Kovalev, Hansel Jia, Adrian Zhao, Mrutunjayya Mrutunjayya, Nilesh Ahuja, Selvakumar Panneer, Christina Giannoula, Nandita Vijaykumar

AI总结提出XPR框架，通过高级编程接口和模块化渲染管线，支持用少量代码实现3DGS等新方法，并利用XLA编译器跨平台运行。

详情

AI中文摘要

基于点的可微分渲染支撑着现代3D重建、新视角合成和基于学习的图形管线，但开发新的渲染方法通常需要大量的底层实现、硬件特定的内核以及手动编写的反向传播。这限制了快速原型设计、可重复性、探索和部署，尤其是在不同的硬件平台上。本文提出了XPR，一个可扩展的跨平台基于点的可微分渲染框架。XPR引入了一个高级编程接口，将方法特定的逻辑与共享的渲染管线分离，允许用户用几行代码实现新方法。其管线将渲染分解为模块化的、静态形状的并行操作，这些操作可以通过跨平台编译器降级到GPU、TPU、CPU和其他ML加速器。我们展示了3DGS、3DGUT和LinPrim的实现，仅需几百行Python代码，每个都可以通过XLA编译器编译到一系列硬件平台。这些结果表明，XPR为新兴的基于点的可微分渲染系统实现了快速实验和可移植执行。

英文摘要

Point-based differentiable rendering underpins modern 3D reconstruction, novel-view synthesis, and learning-based graphics pipelines, but developing new rendering methods often requires extensive low-level implementation, hardware-specific kernels, and manually written backward passes. This limits rapid prototyping, reproducibility, exploration, and deployment, especially across diverse hardware platforms. This paper presents XPR, an extensible cross-platform framework for point-based differentiable rendering. XPR introduces a high-level programming interface that separates method-specific logic from the shared rendering pipeline, allowing users to implement new methods in a few lines of code. Its pipeline decomposes rendering into modular, statically shaped parallel operations that can be lowered by a cross-platform compiler to GPUs, TPUs, CPUs, and other ML accelerators. We demonstrate implementations of 3DGS, 3DGUT, and LinPrim, with only a few 100s lines of Python code, each of which can be compiled to a range of hardware platforms with the XLA compiler. These results show that XPR enables fast experimentation and portable execution for emerging point-based differentiable rendering systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11520 2026-06-11 cs.CL cs.AI cs.LG 新提交

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE：一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

AI总结提出ISE三阶段范式，通过结构化意图构建、角色锁定用户模拟和真实执行环境，生成多轮代理轨迹，微调后显著提升代理工具使用性能。

详情

Comments: 13 pages, 6 figures. Dataset and code: this https URL

AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE（意图->模拟->执行），一种三阶段合成范式，联合解决这些差距。阶段1通过4D框架（人物角色x领域x任务x复杂度）构建约50000个结构化意图；去重后池中包含43956个唯一意图，并在mpnet-base-v2嵌入（余弦核，q=1）上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互，将每轮用户交互基于实际执行结果，生成23132条完整轨迹，平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用，生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后，使用Qwen3-8B在标准协议下的代理工具使用任务中，ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11505 2026-06-11 cs.CV cs.AI cs.CR 新提交

On the Study of Biometric Spoofing Detection using Deep Learning

基于深度学习的生物特征欺骗检测研究

Kumar Kartikey, Nikos Komninos

AI总结评估MobileNetV2、DenseNet-121、Inception-v3和STD模型在面部识别系统欺骗检测中的性能，MobileNetV2以92%准确率最优，适合实际应用。

详情

AI中文摘要

生物特征系统越来越多地部署在安全应用中；然而，它们仍然容易受到欺骗攻击，攻击者利用伪造的生物特征数据获取未经授权的访问。本研究评估了最先进的机器学习模型MobileNetV2、DenseNet-121、Inception-v3和欺骗痕迹解缠（STD）在面部识别系统中检测欺骗攻击的有效性。使用CelebA-Spoof数据集，研究通过准确率、精确率、召回率和F1分数等指标评估模型有效性。在MSU-MFSD数据集上进行跨数据集验证以评估泛化能力。结果表明MobileNetV2是最有效的模型，在平衡计算效率的同时达到92%的准确率，使其适用于实际应用。Inception-v3表现出中等鲁棒性，而DenseNet-121和STD在泛化方面存在困难。研究结果强调了在领域自适应和混合架构方面取得进展以增强生物特征安全系统的必要性。

英文摘要

Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attackers exploit counterfeit biometric data to gain unauthorized access. This research evaluates the effectiveness of state-of-the-art machine learning models, MobileNetV2, DenseNet-121, Inception-v3, and Spoof Trace Disentanglement (STD) in detecting spoofing attacks within facial recognition systems. Using the CelebA-Spoof dataset, the study evaluates model effectiveness using metrics such as accuracy, precision, recall, and F1 Score. Cross-dataset validation is carried out on the MSU-MFSD dataset to assess generalizability. The results show MobileNetV2 as the most efficient model, achieving 92% accuracy while balancing computational effectiveness, making it appropriate for real-life applications. Inception-v3 shows moderate robustness, while DenseNet-121 and STD struggle with generalization. The findings highlight the need for advances in domain adaptation and hybrid architectures to enhance biometric security systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11500 2026-06-11 eess.IV cs.CE cs.IT cs.LG q-bio.NC 新提交

FlexiBrain: Resolution-Agnostic Voxel-Level Encoding for Native fMRI

FlexiBrain: 面向原生fMRI的分辨率无关体素级编码

Mo Wang, Wenhao Ye, Junfeng Xia, Minghao Xu, Hongkai Wen, Quanying Liu

AI总结提出FlexiBrain，一种基于Mamba-JEPA的分辨率无关体素级编码框架，通过动态补丁调整直接处理原生fMRI数据，避免破坏性空间标准化，在五个下游任务中性能提升达12个百分点，并显著降低预处理成本。

详情

AI中文摘要

大规模深度学习模型在神经科学中的成功从根本上受到严重数据异质性的制约。从不同来源聚合的原生fMRI数据在空间和时间分辨率上表现出显著差异。因此，大多数现有框架依赖于冗长、僵化的预处理流程，以强制数据集之间的一致性。这种做法引入了两个关键限制：（1）可能退化受试者特定的解剖信息；（2）显著的计算开销，通常每个受试者需要数小时的处理。在此，我们提出FlexiBrain，一种基于Mamba-JEPA的分辨率无关体素级编码框架，用于原生fMRI。FlexiBrain以真实物理单位定义补丁大小，并采用动态补丁调整，从而绕过破坏性的空间标准化，同时允许直接摄取原生空间中的数据。我们使用高效的Mamba-JEPA骨干网络实例化该框架，以建模高维4D fMRI信号。在五个不同的下游神经科学任务中，FlexiBrain持续优于近期最先进的方法，在不使用外部数据增强的情况下实现了高达12个百分点的提升。重要的是，FlexiBrain作为一个无缝插件模块，显著降低了预处理成本，并加速了稳健的体素级fMRI基础模型的开发。代码可在该https URL获取。

英文摘要

The success of large-scale deep learning models in neuroscience is fundamentally constrained by severe data heterogeneity. Native fMRI data aggregated from diverse sources exhibit substantial variation in both spatial and temporal resolutions. Consequently, most existing frameworks rely on lengthy, rigid preprocessing pipelines that enforce uniformity across datasets. This practice introduces two critical limitations: (1) potential degradation of subject-specific anatomical information; (2) significant computational overhead, often requiring hours of processing per subject. Here, we propose FlexiBrain, a resolution-agnostic voxel-level encoding framework for native fMRI based on Mamba-JEPA. FlexiBrain defines patch sizes in real-world physical units and employs a dynamic patch resizing, thereby bypassing destructive spatial standardization while enabling direct ingestion of data in native space. We instantiate the framework using an efficient Mamba-JEPA backbone to model high-dimensional 4D fMRI signals. Across five diverse downstream neuroscience tasks, FlexiBrain consistently outperforms recent state-of-the-art methods, achieving gains of up to 12 percentage points without external data augmentation. Importantly, FlexiBrain functions as a seamless plug-in module, substantially reducing preprocessing costs and accelerating the development of robust voxel-level fMRI foundation models. Code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11490 2026-06-11 cs.LG eess.SY 新提交

OmniLoc: A Geometry-Aware Foundation Model for Anchor-Free UE Localization Across Diverse Indoor Environments

OmniLoc: 一种几何感知的基础模型，用于跨多样室内环境的无锚点用户设备定位

Lei Chu, Yuning Zhang, Omer Gokalp Serbetci, Anushka Katiyar, Bassel Abou Ali Modad, Andreas F. Molisch

AI总结提出OmniLoc，首个基于无线测量的基础模型，通过统一输入分词、几何感知Transformer和几何感知位置估计模块，实现跨室内环境的鲁棒无锚点定位，显著优于现有方法。

详情

AI中文摘要

由于建筑几何形状、可检测接入点（AP）集合以及接收信号异质性的显著变化，基于无线测量的室内定位在大规模部署中仍然具有挑战性。现有的基于学习的方法通常仅在有限环境下表现良好，并在环境变化下性能下降，使得在多样室内环境中进行鲁棒的无锚点定位变得极其困难。本文提出OmniLoc，一种环境交互式基础模型，用于跨多样室内环境的无锚点用户设备定位。据我们所知，OmniLoc是首个直接基于无线测量构建的用于此任务的基础模型。OmniLoc基于三个关键设计。首先，统一输入分词模块将异构无线测量转换为更易于学习的通用表示。其次，几何感知Transformer通过强调主导AP同时聚合来自辅助AP的互补证据，执行AP感知特征提取。第三，几何感知位置估计模块根据几何嵌入进行回归，以生成几何一致的位置预测。我们在大规模内部数据集和公共基准数据集上评估OmniLoc。结果表明，OmniLoc显著优于现有方法，当其设计组件集成时能持续改进现有骨干网络，并在跨环境评估中展现出强大的泛化能力。

英文摘要

Indoor localization from wireless measurements remains challenging in large-scale deployments due to substantial variation in building geometry, the set of detectable access points (APs), and the heterogeneity of received signals. Existing learning-based methods often perform well only in limited settings and degrade under environmental shifts, making robust anchor-free localization across diverse indoor environments notoriously difficult. In this paper, we present OmniLoc, an environment-interactive foundation model for anchor-free user equipment localization across diverse indoor environments. To the best of our knowledge, OmniLoc is the first foundation-model-based approach built directly on wireless measurements for this task. OmniLoc is built on three key designs. First, a unified input tokenization module converts heterogeneous wireless measurements into a common representation that is more amenable to learning. Second, a geometry-aware Transformer performs AP-aware feature extraction by emphasizing dominant APs while aggregating complementary evidence from supporting APs. Third, a geometry-aware location estimation module conditions regression on geometric embeddings to produce geometrically consistent location predictions. We evaluate OmniLoc on both a large-scale in-house dataset and a public benchmark dataset. Results show that OmniLoc significantly outperforms existing methods, consistently improves existing backbones when its design components are integrated, and demonstrates strong generalization in cross-environment evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.11482 2026-06-11 cs.SI cs.CL 新提交

Building Social World Models with Large Language Models

用大型语言模型构建社会世界模型

Haofei Yu, Yining Zhao, Guanyu Lin, Jiaxuan You

AI总结提出社会世界模型（SWM）框架，利用LLM从社会数据中挖掘时间模式，学习社会信念的状态转移函数，无需人工标注或普查数据，在预测市场基准上超越时序基础模型。

详情

Comments: 9 pages. ICML 2026

AI中文摘要

理解和预测社会信念如何因事件（从政策变化到科学突破）而演变仍然是社会科学中的一个基本挑战。鉴于LLM的常识知识和社会智能，我们提出：LLM能否模拟社会事件后社会信念的动态？在这项工作中，我们引入了社会世界模型（SWM）的概念，这是一个通用框架，旨在捕捉社会信念如何因重大事件而演变。SWM通过挖掘社会数据中的时间模式并优化证据下界来学习社会信念的状态转移函数，无需将事件与信念转变联系起来的人工标注，也无需昂贵的普查数据。为了评估SWM，我们引入了一个基准SWM-bench，该基准源自真实世界的预测市场，特别是Kalshi和Polymarket。SWM-bench包含超过12k个数据点，用于跨政治、金融和加密货币等不同领域的社会信念预测任务。我们的实验结果表明，SWM显著优于时序基础模型，在Kalshi数据上取得了最先进的结果，并在Polymarket数据上展示了竞争性能，同时为社会信念动态的潜在机制提供了可解释的见解。

英文摘要

Understanding and predicting how social beliefs evolve in response to events -- from policy changes to scientific breakthroughs -- remains a fundamental challenge in social science. Given LLMs' commonsense knowledge and social intelligence, we ask: Can LLMs model the dynamics of social beliefs following social events? In this work, we introduce the concept of the Social World Model (SWM), a general framework designed to capture how social beliefs evolve in response to major events. SWM learns state-transition functions for social beliefs by mining temporal patterns in social data and optimizing the evidence lower bound, without the need for explicit human annotations linking events to belief shifts, or for expensive census data. To evaluate SWM, we introduce a benchmark, SWM-bench, derived from real-world prediction markets, specifically Kalshi and Polymarket. SWM-bench includes over 12k data points for social belief prediction tasks spanning diverse domains such as politics, finance, and cryptocurrency. Our experimental results show that SWM significantly outperforms time-series foundation models, achieving state-of-the-art results on Kalshi data and demonstrating competitive performance on Polymarket data, while offering interpretable insights into the underlying mechanisms of social belief dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.11474 2026-06-11 cs.LG eess.SY physics.acc-ph 新提交

Mahalanobis-Guided Latent OOD Detection for Hybrid ES-DRL Control in Time-Varying Systems

基于马氏距离的潜在分布外检测用于时变系统中混合ES-DRL控制

Shaifalee Saxena, Alexander Scheinker

AI总结针对时变系统中强化学习控制器性能下降问题，提出基于变分自编码器潜在空间马氏距离的分布外检测方法，实现与极值搜索控制器的自适应切换，并在粒子加速器控制中验证有效性。

详情

AI中文摘要

本文研究了非线性时变系统中基于马氏距离的潜在分布外（OOD）检测，用于测试时RL控制器切换。RL控制器可以在训练分布内快速控制高维系统，但当时间变化动力学产生未见过的观测时，其性能可能下降。我们考虑一个组合的ES-DRL控制器，其中RL提供快速的分布内动作，而有界极值搜索（ES）在OOD操作下提供鲁棒的模型无关控制。关键挑战在于决定何时切换。我们在分布内束流剖面观测上训练变分自编码器（VAE），并使用VAE潜在空间中的马氏距离在测试时检测OOD束流剖面。此OOD决策设置一个二元开关，选择RL控制器或ES控制器。我们在安全关键的粒子加速器控制中评估该方法。在此设置中，空间磁体运动产生RL训练期间未见过的OOD束流剖面。VAE潜在空间的可视化表明，所提方法识别出此OOD场景，并为组合控制器中RL和ES之间的切换提供可解释信号。

英文摘要

In this paper, we study Mahalanobis-guided latent out-of-distribution (OOD) detection for test-time RL controller switching in nonlinear time-varying systems. RL controllers can quickly control high-dimensional systems within the training distribution, but their performance can degrade when time-varying dynamics produce unseen observations. We consider a combined ES--DRL controller, where RL provides fast in-distribution actions and bounded extremum seeking (ES) provides robust model-independent control under OOD operation. The key challenge is deciding when to switch. We train a variational autoencoder (VAE) on in-distribution beam-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time. This OOD decision sets a binary switch that selects either the RL controller or the ES controller. We evaluate the approach in safety-critical particle accelerator control. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller.

URL PDF HTML ☆

赞 0 踩 0

2606.11471 2026-06-11 cs.CR cs.LG 新提交

Evaluating and Combating the Impact of Concept Drift on the Performance of Machine Learning-Based Phishing Detection Systems

评估与对抗概念漂移对基于机器学习的钓鱼检测系统性能的影响

Warren Fernando, Nikos Komninos

AI总结研究概念漂移对基于机器学习的钓鱼邮件检测系统性能的影响，并提出缓解性能下降的策略。

详情

AI中文摘要

数字领域的扩展导致数字通信大幅增加，电子邮件已成为最突出的渠道之一。电子邮件通信的普及在专业和个人环境中都很明显，从而为恶意行为者创造了大量可利用的漏洞。垃圾邮件作为一种未经请求的通信形式，通常对收件人带有恶意意图，自电子邮件技术诞生以来一直是电子邮件用户面临的持续挑战，而数字景观的增长加剧了这一问题。电子邮件垃圾邮件过滤器是电子邮件客户端的组成部分，旨在识别潜在有害消息并提醒用户其恶意内容。钓鱼攻击通常是基于恶意软件攻击的初始阶段，并且随着时间推移，恶意软件变得越来越复杂，钓鱼攻击也在迅速演变。检测恶意软件和垃圾邮件领域中恶意活动的一种广泛采用的方法是应用机器学习。我们的目标是评估垃圾邮件领域内的演变对这些基于机器学习的检测系统的影响，并探索减轻相关性能下降的策略。

英文摘要

The expansion of the digital domain has resulted in a substantial increase in digital communication, with email emerging as one of the most prominent channels. The proliferation of email communication is apparent in both professional and personal contexts, thereby creating numerous vulnerabilities for malicious actors to exploit. Spam emails, a form of unsolicited correspondence often bearing malicious intent towards recipients, have been an ongoing challenge for email users since the inception of email technology, and this problem has been exacerbated by the growth of the digital landscape. Email spam filters are integral components of email clients, engineered to identify potentially harmful messages and alert users to their malicious content. Phishing, frequently the initial phase of malware-based attacks, is evolving rapidly, with malware becoming increasingly sophisticated over time. A widely adopted approach for detecting malicious activity within malware and spam domains is the application of machine learning. Our aim is to assess the impact of the evolution within the spam email domain on these machine learning-based detection systems and to explore strategies for mitigating associated performance degradation.

URL PDF HTML ☆

赞 0 踩 0

2606.11469 2026-06-11 cs.DS cs.LG math.ST 新提交

Density estimation for Hellinger via minimum-distance estimators: mixtures of Gaussians, log-concave, and more

基于最小距离估计量的Hellinger密度估计：高斯混合、对数凹等

Spencer Compton, Jerry Li

AI总结将最小距离估计方法从总变差距离扩展到Hellinger距离，通过反向数据处理不等式，实现了对对数凹混合和高斯混合（任意方差）的近线性时间学习，样本复杂度接近最优。

详情

AI中文摘要

我们研究密度估计任务，希望从$n$个样本中准确估计概率密度。在总变差距离下，密度估计的经典方法是最小距离估计量方法，其中我们仅通过限制特定概念类（即Yatracos类）的VC维即可得到算法和分析。虽然该技术最初主要针对总变差距离给出了精确保证，但在本文中，我们将最小距离估计量方法扩展到Hellinger距离下的学习。我们的主要观察是，通过联系最近得到反向数据处理不等式的结果，我们可以为Hellinger距离生成类似的方案（其中我们只需要限制相关概念类的VC维）。该方案足够灵活，可以容纳最初为总变差距离设计的快速算法；通过修改Acharya等人（2017）的方法，我们首次得到了近线性时间算法，用于学习包括单变量对数凹密度混合和高斯混合（具有任意方差）在内的类别，且样本复杂度接近最优。

英文摘要

We study the task of density estimation, where we hope to accurately estimate a probability density from $n$ samples. A textbook method for density estimation in total variation distance is the minimum-distance estimator approach, where we conclude both the algorithm and the analysis merely from bounding the VC dimension of a particular concept class (the so-called Yatracos class). While this technique has originally yielded sharp guarantees primarily for total variation distance, in this work we extend the minimum-distance estimator approach for learning within Hellinger distance. Our main observation is that we may produce an analogous recipe for Hellinger (where we only require bounding the VC dimension of a related concept class) by drawing connections to recent results yielding reverse data processing inequalities. This recipe is flexible enough to accommodate fast algorithms originally designed for total variation distance; by modifying the approach of Acharya et al. (2017) we conclude the first near-linear time algorithm for learning classes including univariate mixtures of log-concave densities and mixtures of Gaussians (with arbitrary variances), with near-optimal sample complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.11437 2026-06-11 cs.DS cs.AI cs.LG stat.ML 新提交

The Power of Test-Time Training for Approximate Sampling

测试时训练对近似采样的威力

Noah Golowich, Ankur Moitra, Dhruv Rohatgi

AI总结本文形式化测试时训练（TTT）为从已知分布类中采样的问题，证明查询复杂度的二次下界，并展示在分布类大小受限时可规避该下界，为TTT提供理论框架。

详情

AI中文摘要

从复杂概率分布中高效采样是一个基本问题，近年来随着生成式AI的兴起，这一问题变得越来越重要，因为从大语言模型（LLM）中提出的复杂采样程序已被用于解决具有挑战性的推理问题。然而，这类采样算法的有效性受到LLM与特定采样任务之间关系的限制，这推动了测试时训练（TTT）框架的发展。TTT通过根据推理时收到的部分生成和奖励反馈更新模型权重来工作，从而适应特定问题。在这项工作中，我们提出了一种TTT的形式化，将其定义为从属于已知分布类$F$的给定概率测度$\mu^\star$中生成样本的问题，给定一个提供$\mu^\star$近似密度估计的预言机$\hat \mu$。这与Jerrum、Valiant和Vazirani（1986）以及Jerrum和Sinclair（1989）的开创性工作中研究的将采样约化为近似计数的问题密切相关：即当$F$是所有分布的类时，它恰好与上述计数到采样的约化一致。在本文中，我们首先证明了在给定对$\hat \mu$的查询访问的情况下，从$\mu^\star$采样的查询复杂度的二次下界（对于足够大的类$F$），从而表明Jerrum和Sinclair（1989）提出并由Hayes和Sinclair（2010）改进的随机游走方法是最优的。这回答了Hayes和Sinclair提出的一个开放问题。然后，我们证明如果$F$的大小适当受限，这个下界可以被规避。正如我们所讨论的，后一个结果可以被视为TTT的抽象，因此代表了为TTT发展一个原则性理论框架的起点。

英文摘要

Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from LLMs have been proposed to solve challenging reasoning problems. The efficacy of such sampling algorithms is limited, however, by the relationship between the LLM and the particular sampling task at hand, which has motivated the framework of test-time training (TTT). TTT works by updating a model's weights in response to partial generations and reward feedback received at inference time, thus adapting to the particular problem. In this work, we propose a formalization for TTT as the problem of producing a sample from a given probability measure $\mu^\star$ belonging to a known class ${F}$ of distributions, given an oracle $\hat \mu$ which yields approximate density estimates for $\mu^\star$. This is closely related to the problem of reducing sampling to approximate counting studied in seminal works of Jerrum, Valiant & Vazirani (1986) and Jerrum & Sinclair (1989): namely, when ${F}$ is the class of all distributions, it coincides exactly with the aforementioned counting-to-sampling reduction. In this paper, we first show a quadratic lower bound on the query complexity of sampling from $\mu^\star$ given query access to $\hat \mu$ (for sufficiently large classes ${F}$), thus showing that the random walk approach proposed by Jerrum & Sinclair (1989) and refined by Hayes & Sinclair (2010), is optimal. This answers an open question posed by Hayes & Sinclair. We then show that this lower bound can be circumvented if the size of ${F}$ is bounded appropriately. As we discuss, this latter result can be viewed as an abstraction of TTT, and thus represents a starting point for the development of a principled theoretical framework for TTT.

URL PDF HTML ☆

赞 0 踩 0

2606.11430 2026-06-11 cs.DL cs.AI cs.LO 新提交

Towards a Bridge Layer Between Bibliographic and Formalized Mathematical Knowledge

迈向文献与形式化数学知识之间的桥梁层

A. Mayeux

AI总结提出一个关系型桥接数据库，对齐出版物元数据与形式化工件，并引入论文级形式化评分，通过跨文档对齐估计形式化覆盖度，以整合文献与形式化数学生态系统。

详情

AI中文摘要

数学知识分散在文献数据库（如MathSciNet、zbMATH Open）和形式化证明库（如Lean mathlib）中，阻碍了已发表结果与其形式化之间的统一访问。我们提出了一个关系型桥接数据库，将出版物元数据与形式化工件对齐，为数学文献和机器可验证证明提供互操作层。我们引入了一个论文级形式化评分，衡量一篇出版物在形式化系统中的覆盖程度。作为可行性研究，我们展示了如何通过非正式文本与Lean形式化之间的跨文档对齐来估计此类评分，从而实现对形式化覆盖度的大规模分析。该框架是将文献和形式化数学生态系统整合为可扩展、机器可操作的知识图谱的第一步，该图谱将出版物与形式化证明对象关联起来。

英文摘要

Mathematical knowledge is split between bibliographic databases (e.g., MathSciNet, zbMATH Open) and formal proof libraries (e.g., Lean mathlib), preventing unified access between published results and their formalizations. We propose a relational bridge-database that aligns publication metadata with formal artifacts, providing an interoperability layer between mathematical literature and machine-verifiable proofs. We introduce a paper-level formalization score that measures how much of a publication is covered in formal systems. As a feasibility study, we show how such scores can be estimated via cross-document alignment between informal texts and Lean formalizations, enabling large-scale analysis of formalization coverage. This framework is a first step toward integrating bibliographic and formal mathematical ecosystems into scalable, machine-actionable knowledge graphs linking publications to formal proof objects.

URL PDF HTML ☆

赞 0 踩 0

2606.11429 2026-06-11 eess.AS cs.CL cs.SD 新提交

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

Gumbel-BEARD：低资源领域Whisper自监督自适应的自动层选择

Zilai Wang, Natarajan Balaji Shankar, Mohan Shi, Kaiyuan Zhang, Abeer Alwan

AI总结提出Gumbel-BEARD框架，通过可训练的Gumbel-Softmax选择器自动选择Whisper编码器层，结合BEST-RQ自监督目标实现低资源领域自适应，在儿童语音和方言数据集上取得最先进词错误率。

详情

Comments: Accepted by Interspeech 2026

AI中文摘要

语音基础模型在低资源领域常因领域不匹配和数据稀缺而表现不佳。我们提出Gumbel-BEARD，一种领域自适应框架，通过端到端可训练的硬Gumbel-Softmax选择器自动选择Whisper编码器层。它利用BEST-RQ目标实现自监督自适应，无需手动调整即可动态适应目标声学特征。在MyST儿童语音语料库上的实验证明了其效率和可扩展性：使用10小时标注数据进行微调，我们的方法匹配了在完整133小时标注集上训练的完全监督基线。我们在MyST上使用Whisper-medium建立了8.21%的新最先进词错误率（WER），在OGI自发言语数据集上使用Whisper-small达到11.06%。在CORAAL上的评估进一步证实了对成人方言领域偏移的鲁棒性，相对WER降低高达6%，突显了我们的方法对多样低资源条件的泛化能力。

英文摘要

Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.11425 2026-06-11 cs.CR cs.AI 新提交

JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization

JailbreakOPT: 工具辅助的迭代越狱提示优化

Ge Shi, Jun Yin, Donglin Xie, Fangyi Liu, Yucan Li, Menglin Liu

AI总结提出JailbreakOPT框架，通过工具库和上下文Thompson采样优化单轮越狱提示，在多个LLM上提高攻击成功率并减少攻击次数。

详情

AI中文摘要

越狱攻击暴露了大语言模型（LLM）中持续存在的安全弱点，但现有的无状态单轮方法面临权衡：手工制作的提示具有表现力但静态，而迭代提示优化可以适应但通常依赖于需要多次目标查询的低级突变。我们提出了JailbreakOPT，一个用于改进迭代单轮越狱提示优化的工具辅助框架。JailbreakOPT将多样化的原子越狱提示组织成一个攻击工具库，并通过统一的回合内优化抽象组合它们，以生成更强的独立攻击提示。为了跨攻击回合重用经验，JailbreakOPT进一步将工具选择框架化为上下文赌博机问题，并应用上下文汤普森采样基于过去的结果指导探索和利用。在多个目标LLM和攻击目标上的实验表明，与原子单轮攻击和现有的迭代优化基线相比，JailbreakOPT提高了攻击成功率（ASR），同时减少了成功所需的攻击次数（No.A）。本文可能包含冒犯性或有害内容。

英文摘要

Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.

URL PDF HTML ☆

赞 0 踩 0

2606.11417 2026-06-11 cs.LG cs.AI stat.ML 新提交

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

密封审计上的有符号压缩进展是古德哈特抵抗的

Ayush Mittal, Dhruv Gupta

AI总结提出有符号压缩进展作为内在动机，证明其累积奖励等于审计改进，且对有限审计面板具有假阳性预算，抵抗古德哈特定律。

详情

Comments: 16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: this https URL

AI中文摘要

压缩进展是一个长期提出的内在动机方案：当智能体的世界模型在预测或压缩经验方面变得更好时给予奖励。民间声称这种奖励是“可信的”，因为它只在学习时支付。我们使这一点精确化并证明它。如果内在奖励是固定密封审计损失的有符号减少，即 r_t = E(theta_{t-1}) - E(theta_t)，那么累积奖励恰好望远镜式地归结为端点审计改进，因此没有策略可以在真实审计性能停滞或下降时无限推高奖励。对于有限审计面板，同样的结果成立，并带有尖锐的假阳性预算：累积经验奖励最多为真实审计改进加上 2 Delta_n(F, delta)，即模型类的均匀审计偏差。这是无水平依赖的：一旦密封面板均匀控制该类，随时间变化的适应性无需付出代价。该定理还识别了失败模式：如果进展被截断、在智能体自身流上评分、暴露于可重用面板上的高容量模型，或应用于使 Delta_n 无效的神经类，则保证消失。我们给出了结构核心（望远镜式、有限审计界、有限吉布斯和熵下限）的 Lean 4 机械化，以及在 ARC-TGI 网格变换生成器上带有自适应保留攻击的实验套件。实验证实了理论：有限审计偏差按 n^{-0.527} 缩放；有符号进展抵抗截断农场、流泄漏和噪声电视好奇心；朴素的可重用审计可被黑盒标量反馈利用，而标准发布防御将攻击保持在 2 Delta_n 阈值以下。密封审计上的有符号压缩进展是真正改进的会计信号。

英文摘要

Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.11416 2026-06-11 cs.CR cs.AI 新提交

MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

MPC-Patch-Bench：面向多方计算的安全感知LLM代码补丁

Yukuan Zhang, Mengxin Zheng, Qian Lou

AI总结针对多方计算（MPC）软件缺乏仓库级代码修复基准的问题，提出MPC-Patch-Bench，包含数据筛选框架和MPC验证器，评估LLM在MPC仓库级修复中的安全性和数值保真度。

详情

Comments: preprint

AI中文摘要

目前尚不存在用于评估大型语言模型（LLM）在安全多方计算（MPC）软件上代码修复的仓库级基准，直接移植SWE-bench等通用基准在三个结构层面失败：（i）MPC仓库主要由通用Python基础设施而非密码学逻辑主导；（ii）高价值MPC修复缺乏严格提取流程所需的标准化测试；（iii）标准失败到通过评估对于必须同时保证密码学安全的代码是不充分的。MPC越来越多地部署于隐私保护机器学习、生物医学协作和安全分析。现有的MPC特定代码合成工作仅涵盖算子级或单框架任务；在真实仓库级MPC修复上评估LLM代理反而需要MPC感知的数据筛选和与MPC程序必须遵守的安全性和数值保真度保证相匹配的验证器，而现有基准均未提供。我们提出MPC-Patch-Bench，一个围绕两个框架组织的仓库级基准。（1）数据筛选框架结合了一个领域特定筛选代理，该代理通过三个密码学层过滤原始拉取请求，并配备一个人类-AI补全引擎，合成缺失的问题描述和失败到通过/通过到通过测试，生成205个完全验证的实例。（2）MPC验证器通过针对明文预言机的动态差分测试和MPC特定静态分析规则（标记不安全泄露、不安全算术和非法公共/私有转换）提供专门的安全性和数值保真度检查。评估的最强LLM在功能上仅解决了22.9%的MPC-Patch-Bench任务；MPC验证器进一步将验证通过率降至17.1%，其中高达40%的功能通过补丁因密码学或数值保真度违规而被拒绝。

英文摘要

Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not yet exist, and directly transplanting general-purpose benchmarks such as SWE-bench fails on three structural fronts: (i) MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic; (ii) high-value MPC fixes lack the standardized tests rigid extraction pipelines require; and (iii) standard fail-to-pass evaluation is insufficient for code that must also be cryptographically safe. MPC is increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics. Existing MPC-specific code-synthesis efforts cover only operator-level or single-framework tasks; evaluating LLM agents on real repository-level MPC repair instead demands MPC-aware data curation and a verifier matched to the security and numerical-fidelity guarantees MPC programs must obey neither of which existing benchmarks provide. We introduce MPC-Patch-Bench, a repository-level benchmark organised around two frameworks. (1)The Data Curation Framework combines a domain-specific curation agent that filters raw pull requests through three cryptographic layers with a human-AI completion engine that synthesizes missing problem statements and Fail-to-Pass/Pass-to-Pass tests, yielding 205 fully verified instances. (2)The MPC Verifier provides dedicated security and numerical-fidelity checks via dynamic differential testing against plaintext oracles and MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks; the MPC Verifier further reduces verified resolution to 17.1%, with up to 40% of functionally-passing patches rejected for cryptographic or numerical-fidelity violations.

URL PDF HTML ☆

赞 0 踩 0

2606.11415 2026-06-11 q-bio.NC cs.LG physics.data-an q-bio.QM 新提交

Spatially Masked Regression Reveals Local and Distributed Predictability in Electrophysiological Recordings

空间掩蔽回归揭示电生理记录中的局部和分布式可预测性

Maryam Ostadsharif Memar, Nima Dehghani

AI总结提出空间掩蔽回归（SMR）框架，通过逐步增大掩蔽区域量化电极信号中局部与分布式信息的贡献，应用于颅内和头皮脑电数据，发现邻近电极贡献显著但非全部，表明信号同时包含局部冗余和全局结构。

详情

AI中文摘要

神经记录通常被解释为局部测量，但任何单个传感器的信号也可能反映分布在整个网络中的结构化活动。这引出一个基本问题：电极信号在多大程度上反映底层系统中的局部信息与分布式信息？更具体地说，电极的活动有多少由其邻近区域携带，又有多少嵌入在阵列的更广泛分布中？我们通过空间掩蔽回归（SMR）框架解决这一问题，该框架从其余电极重建每个电极的时间序列，同时排除目标周围可配置的邻域。通过逐步增大掩蔽，空间局部性成为实验控制，用于量化在移除附近通道后有多少预测信息幸存。我们将SMR应用于具有异质电极覆盖的颅内脑电图（iEEG）和具有标准化导联组合的感觉运动皮层头皮脑电图（EEG）。使用原始信号与重建信号之间的距离相关性，我们发现两种模态中均存在强烈的受试者内重建，即使排除局部邻域后仍有显著的可预测性，且EEG中的跨受试者转移明显强于iEEG。掩蔽显示邻近电极对重建贡献显著，但并非全部，表明单个通道既反映局部冗余也反映更广泛的分布式结构。保留选定边际或谱特性但破坏相位结构或时间顺序的替代数据显著降低了性能，支持SMR依赖于结构化时间和跨通道组织而非仅边际统计的结论。这些结果将SMR定位为量化记录中局部与分布式信息平衡的可解释框架。

英文摘要

Neural recordings are often interpreted as local measurements, yet the signal at any one sensor can also reflect structured activity distributed across the broader network. This raises a basic question: to what extent does an electrode's signal reflect local versus distributed information in the underlying system? More specifically, how much of an electrode's activity is carried by its immediate neighborhood, and how much is embedded more broadly across the array? We address this with a Spatially Masked Regression (SMR) framework that reconstructs each electrode's timeseries from the remaining electrodes while excluding a configurable neighborhood around the target. By progressively increasing this mask, spatial locality becomes an experimental control for quantifying how much predictive information survives after nearby channels are withheld. We apply SMR to intracranial EEG with heterogeneous electrode coverage and to scalp EEG with standardized montages over sensorimotor cortex. Using distance correlation between original and reconstructed signals, we find strong within-subject reconstruction in both modalities, substantial residual predictability even when local neighbors are excluded, and markedly stronger cross-subject transfer in EEG than in iEEG. Masking shows that nearby electrodes contribute strongly to reconstruction but do not account for all of it, indicating that individual channels reflect both local redundancy and broader distributed structure. Surrogates that preserve selected marginal or spectral properties while disrupting phase structure or temporal ordering substantially reduce performance, supporting the conclusion that SMR depends on structured temporal and cross-channel organization rather than on marginal statistics alone. These results position SMR as an interpretable framework for quantifying the balance between local and distributed information in recordings.

URL PDF HTML ☆

赞 0 踩 0

2606.11400 2026-06-11 cs.SD cs.AI eess.AS 新提交

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

引导听哪里：基于指令的激活操控重定向大型音频语言模型中的时间注意力

Tsung-En Lin, Hung-Yi Lee

AI总结提出基于指令的向量操控方法，通过对比不同指令下的激活来重定向音频令牌的时间注意力，实现无需训练的声音事件定位，显著优于直接提示和随机基线。

详情

AI中文摘要

大型音频语言模型（LALMs）在音频理解方面表现出色，但很少揭示它们关注音频信号的哪个部分。我们引入了基于指令的向量操控，该方法通过对比不同指令提示下的激活来构建操控向量，同时保持音频不变。通过对LALM注意力的系统探测，我们发现——与标准提示或基于音频的操控不同——这种干预显著重新分配了分配给音频令牌的时间注意力，将其集中在声学相关的区域。然后我们展示了这种注意力转移在行为上是有意义的：在受控的三事件设置中，读取由操控引起的最大注意力变化的时间位置，可以恢复查询声音事件的位置，而无需任何训练，在Qwen2-Audio和Audio Flamingo 3上分别达到60.87%和68.72%与真实区间的重叠，远高于直接提示（31.84%，46.75%）和随机基线（27.74%）。我们的结果表征了LALMs中基于指令的操控的机制特性，并为这些模型编码的潜在时间结构提供了一种无需训练的探测方法。

英文摘要

Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

URL PDF HTML ☆

赞 0 踩 0

2606.11399 2026-06-11 cs.CL 新提交

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

基于场景的大型语言模型文化价值观探测与引导——扩展版

Trung Duc Anh Dang, Tung Kieu, Sarah Masud

AI总结提出基于场景的行为困境方法，通过令牌级概率和激活引导探测并调整LLM在英格尔哈特-韦尔泽尔文化轴上的潜在价值观，发现不同文化维度的引导存在耦合效应。

详情

Comments: 18 pages

AI中文摘要

大型语言模型（LLM）被部署在不同文化背景下，但往往反映出从训练数据中继承的同质化价值观。对文化一致性的评估通常依赖于直接提示调查式问题，这常常引发中性或安全对齐的回应，无法捕捉模型的潜在偏好。我们提出了一个框架，用于沿着世界价值观调查（WVS）的英格尔哈特-韦尔泽尔两个轴探测和引导LLM中的潜在文化表征。通过将社会价值观问题转化为基于场景的行为困境，我们提取令牌级概率来测量隐含价值观，并应用激活引导（可选地与基于国家的提示结合），无需重新训练即可改变模型行为。在三个开源LLM和四种目标文化中，我们发现引导能力存在显著差异，并识别出潜在纠缠，即沿着一个文化维度的干预会引发另一个维度的变化。这种耦合反映了人类WVS数据中的相关性，并在激活、提示和混合引导中持续存在。它限制了轴独立的对齐，尽管一般任务性能基本保持。

英文摘要

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

URL PDF HTML ☆

赞 0 踩 0

2606.11379 2026-06-11 cs.AI 新提交

Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

人类谈判的自动调解器：通过结构化LLM流水线进行预调解

Jamie Bergen, Sarit Kraus

AI总结提出一种结构化LLM流水线作为自动调解器，在整合性谈判中支持预调解，通过分解准备任务为专用模块，在短期自我报告结果上与人类调解员相当，并在偏好推理任务上误差降低36%。

详情

Comments: 12 pages, 7 figures

AI中文摘要

预调解是直接人类谈判前的准备阶段，在达成互利协议中起着关键作用，但由于成本、时间和缺乏训练有素的调解员而常被省略。我们引入了一种用于人类谈判的自动调解器，实现为结构化LLM模块流水线，在整合性谈判环境中支持预调解。该流水线将准备分解为对话、偏好预测、响应级批评和结构化总结的专用模块，分离推理、生成和评估，以解决单一提示方法的局限性。我们按照常见的LLM系统术语将每个模块称为“智能体”，但组件并非自主且不进行点对点交互；输出按固定顺序向前传递。我们在两个受控人类受试者实验中评估该系统，在多议题谈判场景中将基于AI的预调解与专业人类调解员进行比较。在短期自我报告测量中，自动调解器在准备结果上与人类调解员大致相当，包括对调解员的信任和达成互利协议的信心，同时在我们场景和提示下，偏好推理任务的误差显著降低（RMSE降低36%）。第二项研究表明，有针对性的提示优化将过度肯定模式从36.6%降至16.8%，与人类调解员基线匹配。我们的发现表明，结构化LLM流水线可以在短期自我报告准备结果上提供与人类调解员大致相当的可扩展、低投入的预调解支持。该流水线的单方设计反映了当前人类调解员进行预调解的方式，并支持在争议各方之间并行部署，从而实现可扩展性。

英文摘要

Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements, yet is often omitted due to cost, time, and limited access to trained mediators. We introduce an automated mediator for human negotiation, implemented as a structured pipeline of LLM modules, that supports pre-mediation in integrative negotiation settings. The pipeline decomposes preparation into specialized modules for dialogue, preference prediction, response-level critique, and structured summarization, separating inference, generation, and evaluation to address limitations of monolithic single-prompt approaches. We use the term "agent" for each module following common LLM-systems terminology, but the components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence. We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario. On short-term self-reported measures, the automated mediator achieves preparation outcomes broadly comparable to human mediators, including trust in the mediator and confidence in reaching mutually beneficial agreements, while achieving substantially lower error on the preference-inference task under our scenario and prompts (36% lower RMSE). A second study shows that targeted prompt refinements reduce excessive affirmation patterns from 36.6% to 16.8%, matching human mediator baselines. Our findings suggest that structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes. The pipeline's single-party design mirrors how human mediators run pre-mediation today and enables parallel deployment across all parties to a dispute, supporting scalability.

URL PDF HTML ☆

赞 0 踩 0

2606.11371 2026-06-11 cs.CL cs.AI eess.AS eess.SP 新提交

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

人类与AI生成语言的动态：语义如何在不同时间尺度上波动

Han-Jen Chang, Yasir Çatal, Angelika Wolman, Agustín Ibáñez, David Smith, I-Wen Su, Kai-Yuan Cheng, Georg Northoff

AI总结提出语义时间尺度分析流程，通过自相关窗口度量（ACW-0）量化人类与AI生成语音中语义特异性与上下文相似性的时间组织，发现ACW-0长度与词汇通用性相关，且该关联在随机化后被削弱。

详情

Comments: 45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech & Language

AI中文摘要

口语，无论是人类还是大型语言模型（LLM）产生的，都会随时间展开，具有变化的语义内容。然而，我们仍然缺乏简单、可解释的时间序列特征来捕捉通用与特定内容如何随时间分布，并可用于比较人类和AI生成的语音。我们引入了一个语义时间尺度分析流程，将带有时间戳的词级转录转换为语义时间序列。对于每个口语叙述，我们计算（i）基于WordNet词深度的语义特异性，以及（ii）基于SBERT嵌入的上下文相似性，并使用自相关窗口度量（ACW-0及相关指标）量化其时间依赖性。然后，我们将原始语音与多种随机化对照进行比较，这些对照选择性地破坏词汇身份、时间顺序和词时长。在人类朗读的自传叙述、TTS朗读和LLM生成的文本（通过TTS渲染）中，我们发现语义时间序列中ACW-0较长的片段往往包含更多通用词汇，而ACW-0较短的片段则富含更具体的词汇。当词序和计时被随机化时，这些关联被强烈削弱或消除，表明基于ACW的度量捕捉了语义内容超越静态词汇分布的非平凡时间组织。我们的结果表明，基于ACW的语义时间尺度是分析和比较人类与AI生成语音时间结构的有用特征系列。

英文摘要

Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

URL PDF HTML ☆

赞 0 踩 0

2606.11361 2026-06-11 cs.IR cs.CL 新提交

A PubMed-Scale Dataset of Structured Biomedical Abstracts

一个PubMed规模的生物医学结构化摘要数据集

Chia-Hsuan Chang, Haerin Song, Brian Ondov, Hua Xu

AI总结针对PubMed中大量非结构化摘要阻碍下游文本处理的问题，构建了包含2320万条记录的结构化摘要语料库，其中590万条来自官方XML，1720万条通过大语言模型自动标注，统一为五段格式。

详情

Comments: Data and code for this work are available at this https URL and this https URL, respectively

AI中文摘要

结构化摘要对于生物医学文献处理至关重要，它有助于信息检索、文本挖掘和知识综合。然而，PubMed中索引的绝大部分摘要仍然是非结构化的，这给下游文本处理工作流程和应用带来了重大瓶颈。为解决这一限制，我们引入了Structured PubMed，这是一个从完整PubMed数据库编译而来的全面语料库，包含超过2320万条研究文章记录，每条记录都带有节标签。该语料库分为两个不同的子集：一个包含590万条作者结构化摘要的集合，这些摘要从官方XML文件中解析而来；另一个包含1720万条原本非结构化摘要的自动标注集合，这些摘要通过逐字提取的大语言模型流水线进行结构化。每条记录都统一在统一的五节模式下，并映射到其原始PubMed标识符、出版类型和出版日期。该数据集可用于训练句子分类模型、基准测试文本分割架构，并在前所未有的PubMed范围内进行大规模、特定节的信息提取。

英文摘要

Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.

URL PDF HTML ☆

赞 0 踩 0

2606.11357 2026-06-11 cs.DC cs.AI cs.AR cs.PF 新提交

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse：用于AMD NPU上高效量化LLM推理的融合混合精度内核库

Wesley Pang, Gregory Hyegang Jun, Feiyang Liu, Deming Chen

AI总结针对边缘NPU上量化LLM部署困难，提出TileFuse库，通过融合解包、反量化与GEMM/GEMV内核，并设计交错预分块布局与数据流，在XDNA2上实现AWQ格式原生支持，性能提升最高281%，能耗降低64.6%。

详情

Comments: 13 pages excluding reference, 11 figures

AI中文摘要

随着设备端LLM推理需求的增长，边缘SoC越来越多地集成NPU，以在严格的功耗和热预算下提高性能和能效。然而，当前客户端NPU上的实际LLM部署仍然困难：广泛使用的量化格式（如AWQ）无法干净地映射到许多现有NPU软件栈上，这些软件栈通常是专有的，并且暴露有限底层控制。在这项工作中，我们提出了\textit{TileFuse}，一个面向AMD XDNA2 NPU的近底层混合精度内核库，针对量化LLM推理中的Transformer线性层。TileFuse将实用的低位格式（如AWQ风格的W4A16和W8A16）直接引入XDNA2，而不是迫使模型围绕NPU特定的量化方案重新调整。TileFuse协同设计了权重布局、元数据放置、混合精度微内核和阵列级数据流。具体来说，它将解包、反量化以及GEMM/GEMV执行融合到单个内核流中，引入了一种支持高达32K GEMM维度的交错预分块布局，并重新设计了GEMV数据流以利用完整的4x8 AIE阵列。在内核级评估中，与全精度基线相比，TileFuse在GEMM上性能提升高达121.6%，在GEMV上提升281%，同时在GEMM上相比强iGPU基线实现了超过2倍的性能和能效提升。在Ryzen AI笔记本电脑上的端到端LLM实验中，TileFuse实现了高达2.0倍的预填充延迟降低，能耗降低超过64.6%。这些结果共同表明，XDNA2是AWQ风格边缘LLM推理的实用目标，并且对现成量化的原生NPU支持可以使NPU在实际客户端部署中更加可用。

英文摘要

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.11348 2026-06-11 cs.LG 新提交

SwiftCTS: Fast Cross-Design Prediction and Pareto Optimization of Clock Tree Metrics via Few-Shot Calibration

SwiftCTS: 通过少样本校准实现时钟树指标的快速跨设计预测与帕累托优化

Barsat Khadka, Kawsher Roxy, Md Rubel Ahmed

AI总结提出SwiftCTS框架，利用物理信息代理模型和K-shot乘法校准机制，在数秒内训练、亚毫秒推理，实现跨设计时钟树指标的准确预测与帕累托优化。

详情

AI中文摘要

时钟树综合（CTS）是物理设计流程中计算成本高昂的阶段，需要迭代调用EDA工具以探索庞大的配置空间，从而优化功耗、线长和时序偏差。现有的机器学习方法需要昂贵的重新训练或微调周期来适应未见过的宏架构，并且在架构上与穷举组合搜索所需的数百万次评估不匹配。我们提出了SwiftCTS，一个物理信息代理框架，同时解决了这两个局限性。通过将轻量级、基于物理的统计特征与梯度提升集成相结合，SwiftCTS在CPU上训练时间不到五秒，且无需GPU支持即可实现亚毫秒级推理。为了处理分布外（OOD）设计而无需重新训练或微调，我们引入了一种K-shot乘法校准机制，该机制仅需一到两次物理参考运行即可锚定预测，将未见过的宏上的功耗预测误差从24.5%降低到3.3%，线长误差从56.6%降低到1%以下。将该引擎与进化优化器集成，SwiftCTS在十秒内评估了100,000个CTS配置，生成了在OpenROAD流程中经过物理验证的帕累托最优前沿。闭环验证确认了功耗和线长的预测误差低于0.5%，时序偏差预测在OOD基准上在五皮秒以内，在所有目标指标上始终优于默认工具启发式方法。代码公开于：\href{this https URL}{this https URL}

英文摘要

Clock Tree Synthesis (CTS) is a computationally expensive stage in the physical design flow, requiring iterative EDA tool invocations to navigate a vast configuration space for optimal power, wirelength, and timing skew. Existing machine learning approaches require computationally expensive retraining or fine-tuning cycles to adapt to unseen macro architectures and are architecturally mismatched to the millions of evaluations demanded by exhaustive combinatorial search. We present SwiftCTS, a physics-informed surrogate framework that addresses both limitations simultaneously. By coupling lightweight, physics-grounded statistical features with gradient-boosted ensembles, SwiftCTS trains in under five seconds on a CPU and delivers sub-millisecond inference without GPU support. To handle out-of-distribution (OOD) designs without retraining or fine-tuning, we introduce a K-shot multiplicative calibration mechanism that anchors predictions to just one or two physical reference runs, reducing power prediction error from 24.5\% to 3.3\% and wirelength error from 56.6\% to under 1\% on unseen macros. Integrating this engine with an evolutionary optimizer, SwiftCTS evaluates 100,000 CTS configurations in under ten seconds, yielding Pareto-optimal frontiers that are physically validated within the OpenROAD flow. Closed-loop validation confirms prediction errors below 0.5\% for power and wirelength, and timing skew predictions within five picoseconds on an OOD benchmark, consistently outperforming default tool heuristics across all target metrics. Code publicly available at: \href{ this https URL }{ this https URL }

URL PDF HTML ☆

赞 0 踩 0

2606.11347 2026-06-11 stat.ML cs.LG math.OC 新提交

Annealed Entropic Allocation for Ranking and Selection

退火熵分配用于排序与选择

Xin Fei, Juergen Branke

AI总结提出退火熵分配框架，通过加权log-sum-exp替代非光滑极大极小大偏差率目标，结合鞍点近似提升有限预算下的区分能力，数值实验表明在多个候选接近时性能优异。

详情

AI中文摘要

我们提出了退火熵分配，一种用于排序与选择中顺序预算分配的退火加权软最小化框架。核心思想是用加权log-sum-exp替代非光滑的极大极小大偏差率目标，该替代通过软最小化权重聚合特定候选对的得分，从而在多个候选几乎同时活跃时缓解硬切换。为了提升有限预算下的区分能力，我们引入了鞍点近似——一种从精细化的成对尾部渐近性导出的次指数修正。由于这些修正是次指数的，且平滑参数退火至零，该替代保持了与经典极大极小公式相同的一阶大偏差目标。我们证明了该替代一致收敛于硬最小值，软最小化权重集中于活跃候选，并且在固定权重下，诱导的目标分配映射在单纯形内部是连续的。在高斯和指数实例上的数值实验展示了竞争性能，尤其是在多个候选几乎持平时。

英文摘要

We propose Annealed Entropic Allocation, an annealed weighted soft-min framework for sequential budget allocation in ranking and selection. The central idea is to replace the non-smooth maximin large-deviation rate objective with a weighted log-sum-exp surrogate that aggregates challenger-specific pairwise scores through soft-min weights, mitigating hard switching when several challengers are nearly active. To improve finite-budget discrimination, we incorporate the saddlepoint approximation -- a sub-exponential correction derived from refined pairwise tail asymptotics. Because these corrections are sub-exponential and the smoothing parameter is annealed to zero, the surrogate preserves the same first-order large-deviation target as the classical maximin formulation. We show that the surrogate converges uniformly to the hard minimum, that the soft-min weights concentrate on the active challengers, and that, under fixed weights, the induced target allocation map is continuous on the simplex interior. Numerical experiments on Gaussian and exponential instances demonstrate competitive performance, especially when multiple challengers are nearly tied.

URL PDF HTML ☆

赞 0 踩 0

2606.11339 2026-06-11 math.OC cs.AI cs.LG eess.SY stat.ML 新提交

Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry

松弛全局几何下分布式优化的量化随机原始-对偶方法

Susmit Sarkar, Abhinav Raghuvanshi, Kushal Chakrabarti, Mayank Baranwal

AI总结提出量化随机原始-对偶方法q-PDGD，在松弛全局几何下证明线性收敛到邻域或O(1/k)收敛，匹配最优集中随机复杂度。

详情

Comments: Accepted to UAI

AI中文摘要

我们研究具有随机梯度和有限比特通信（由随机（无偏）量化建模）的分布式优化。我们提出q-PDGD，一种量化的随机原始-对偶方法，并在松弛全局几何下对其进行分析。在受限割线不等式（RSI）下，常数步长产生线性收缩到由梯度噪声、量化失真和网络连通性确定的显式邻域，而递减步长在没有共享最小化器假设的情况下实现O(1/k)收敛。在Polyak-Lojasiewicz（PL）不等式下，我们在相同的随机量化设置中获得线性到邻域的收敛。我们的结果在预言复杂度上匹配已知最优的集中随机速率，并通过实验证明了量化水平、步长选择和图结构之间的预测权衡。

英文摘要

We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global geometry. Under restricted secant inequality (RSI), a constant step-size yields linear contraction to an explicit neighborhood determined by gradient noise, quantization distortion, and network connectivity, while a diminishing step-size achieves O(1/k) convergence without shared-minimizer assumptions. Under Polyak-Lojasiewicz (PL) inequality, we obtain linear-to-neighborhood convergence in the same stochastic quantized setting. Our results match the best-known centralized stochastic rates in oracle complexity, and are supported by experiments demonstrating the predicted tradeoffs between quantization level, step-size choice, and graph structure.

URL PDF HTML ☆

赞 0 踩 0

2606.11316 2026-06-11 cs.CL 新提交

Schützen: Evaluating LLM Safety in Bulgarian and German Contexts

Schützen: 在保加利亚语和德语语境中评估LLM安全性

Kiril Georgiev, Yuxia Wang, Dimitar Iliyanov Dimitrov, Preslav Nakov, Ivan Koychev

AI总结针对现有安全评估数据集以英语和中文为主的问题，构建了覆盖低资源语言保加利亚语和高资源语言德语的Schützen安全数据集，实验揭示多语言LLM在安全行为上的显著跨语言差异，强调了区域特定评估资源的必要性。

详情

Comments: 19 pages, 13 tables, 12 figures

AI中文摘要

大型语言模型越来越多地部署在专业领域，带来了难以预测的风险，包括生成有害或不尊重的内容。尽管在开发安全评估数据集方面取得了实质性进展，但现有资源仍然 overwhelmingly 以英语和中文为中心。这种限制在评估共享社会文化、法律和伦理背景下的语言时尤为明显。为了解决这一差距，我们引入了Schützen：一个德语-保加利亚语安全数据集，旨在评估模型在风险下的可回答性，涵盖低资源语言（保加利亚语）和高资源语言（德语）。使用多语言和特定语言LLMs的实验揭示了安全行为中显著的跨语言差异，强调了需要定制的、特定区域的评估资源，以支持在德国和保加利亚负责任地部署LLMs。数据集和代码可在以下网址获取：https://this URL。警告：本文包含可能具有冒犯性、有害性或偏见性的示例。

英文摘要

Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing safety evaluation datasets, existing resources remain overwhelmingly English- and Chinese-centric. This limitation is particularly pronounced when evaluating languages that operate within shared sociocultural, legal, and ethical contexts. To address this gap, we introduce Schützen: a German--Bulgarian safety dataset designed to assess model answerability under risk, covering both a low-resource language (Bulgarian) and a high-resource language (German). Experiments with multilingual and language-specific LLMs reveal pronounced cross-language differences in safety behavior, highlighting the necessity of tailored, region-specific evaluation resources to support the responsible deployment of LLMs in Germany and Bulgaria. Datasets and code are available at this https URL. Warning: this paper contains examples that may be offensive, harmful, or biased.

URL PDF HTML ☆

赞 0 踩 0

2606.11304 2026-06-11 physics.ins-det cs.LG hep-ex hep-ph 新提交

SPADE: Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation

SPADE: 用于自回归高粒度量热器模拟的分裂与延迟嵌入

Joschka Birk, Frank Gaede, Anna Hallin, Gregor Kasieczka, Martina Mozzanica, Henning Rose

AI总结提出SPADE自回归变压器，通过独立嵌入多特征令牌并延迟特征流，利用标准自注意力学习令牌内相关性，在ILD探测器点云簇射生成中优于现有模型。