arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4092
2602.24047 2026-06-02 cs.NI cs.CR cs.LG

Unsupervised Baseline Clustering and Incremental Adaptation for IoT Device Traffic Profiling

无监督基线聚类与增量自适应用于物联网设备流量分析

Sean M. Alderman, John D. Hastings

发表机构 * The Beacom College of Computer \& Cyber Sciences Dakota State University Madison, SD, USA

AI总结 提出两阶段无监督流量分析流程,使用DBSCAN进行基线聚类(NMI 0.78),BIRCH实现增量自适应(纯度0.87),揭示静态高纯度与增量灵活性之间的权衡。

Comments 6 pages, 2 figures, 4 tables

详情
Journal ref
2026 IEEE 14th International Symposium on Digital Forensics and Security (ISDFS)
AI中文摘要

物联网设备的增长和异构性带来了安全挑战,静态识别模型会随着流量演变而退化。本文提出了一种基于流特征的两阶段无监督物联网设备流量分析和增量模型更新流程,并在Deakin物联网数据集的选定长时间捕获数据上进行评估。对于基线分析,基于密度的聚类(DBSCAN)隔离了数据中相当一部分离群点,并在测试的经典方法中与真实设备标签的对齐最强(NMI 0.78),在聚类纯度上优于基于质心的聚类。对于增量自适应,我们评估了面向流的聚类方法,发现BIRCH支持高效更新(每次更新0.13秒),并为保留的新设备形成相对连贯的聚类(纯度0.87),但新流量捕获有限(份额0.72),且自适应后已知设备准确性存在可衡量的权衡(0.71)。总体而言,结果突出了高纯度静态分析与增量聚类灵活性在演变的物联网环境中的实际权衡。

英文摘要

The growth and heterogeneity of IoT devices create security challenges where static identification models can degrade as traffic evolves. This paper presents a two-stage, flow-feature-based pipeline for unsupervised IoT device traffic profiling and incremental model updating, evaluated on selected long-duration captures from the Deakin IoT dataset. For baseline profiling, density-based clustering (DBSCAN) isolates a substantial outlier portion of the data and produces the strongest alignment with ground-truth device labels among tested classical methods (NMI 0.78), outperforming centroid-based clustering on cluster purity. For incremental adaptation, we evaluate stream-oriented clustering approaches and find that BIRCH supports efficient updates (0.13 seconds per update) and forms comparatively coherent clusters for a held-out novel device (purity 0.87), but with limited capture of novel traffic (share 0.72) and a measurable trade-off in known-device accuracy after adaptation (0.71). Overall, the results highlight a practical trade-off between high-purity static profiling and the flexibility of incremental clustering for evolving IoT environments.

2604.04958 2026-06-02 q-bio.QM cs.AI q-bio.NC

CalM: A Self-Supervised Foundation Model for Population Dynamics in Calcium Imaging Data

CalM:一种用于钙成像数据中群体动力学的自监督基础模型

Xinhong Xu, Yimeng Zhang, Qichen Qian, Yuanlong Zhang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出自监督基础模型CalM,通过双轴自回归Transformer和高效分词器,在钙成像数据上预训练后,可迁移至神经群体动力学预测和行为解码等下游任务,并取得竞争性或更优性能。

Comments ICML accepted version

详情
AI中文摘要

近期研究表明,大规模多动物建模可显著改善神经记录分析。然而,对于功能性钙信号,现有方法仍为任务特定,限制了在常见神经科学目标间的迁移。为解决此挑战,我们提出 extbf{CalM},一种仅基于神经元钙信号训练的自监督神经基础模型,可适应包括预测和解码在内的多个下游任务。我们的关键贡献是一个预训练框架,包含一个高性能分词器,将单神经元信号映射到共享离散词汇表,以及一个双轴自回归Transformer,沿神经轴和时间轴建模依赖关系。我们在大规模、多动物、多会话数据集上评估CalM。在神经群体动力学预测任务上,CalM在预训练后与强专用基线相比取得了竞争性表现。通过任务特定头部,CalM进一步适应行为解码任务,并取得了优于监督解码模型的结果。此外,CalM表示的线性分析揭示了超越预测准确性的可解释功能结构。综上,我们提出了一种新颖且有效的基于钙信号的基础模型自监督预训练范式,为功能性神经分析中的可扩展预训练和广泛应用铺平了道路。代码已发布于https://github.com/TSuXinH/CalM。

英文摘要

Recent work suggests that large-scale, multi-animal modeling can significantly improve neural recording analysis. However, for functional calcium traces, existing approaches remain task-specific, limiting transfer across common neuroscience objectives. To address this challenge, we propose \textbf{CalM}, a self-supervised neural foundation model trained solely on neuronal calcium traces and adaptable to multiple downstream tasks, including forecasting and decoding. Our key contribution is a pretraining framework, composed of a high-performance tokenizer mapping single-neuron traces into a shared discrete vocabulary, and a dual-axis autoregressive transformer modeling dependencies along both the neural and the temporal axis. We evaluate CalM on a large-scale, multi-animal, multi-session dataset. On the neural population dynamics forecasting task, CalM achieves competitive performance against strong specialized baselines after pretraining. With a task-specific head, CalM further adapts to the behavior decoding task and achieves superior results compared with supervised decoding models. Moreover, linear analyses of CalM representations reveal interpretable functional structures beyond predictive accuracy. Taken together, we propose a novel and effective self-supervised pretraining paradigm for foundation models based on calcium traces, paving the way for scalable pretraining and broad applications in functional neural analysis. Code is released at https://github.com/TSuXinH/CalM.

2603.28825 2026-06-02 cs.GT cs.AI

Incentives, Equilibria, and the Limits of Healthcare AI: A Game-Theoretic Perspective

激励、均衡与医疗AI的局限:博弈论视角

Ari Ercole

发表机构 * Cambridge Centre for AI in Medicine, University of Cambridge, UK(剑桥大学医学人工智能中心) Magdalene College, University of Cambridge, UK(剑桥大学玛格丽特学院)

AI总结 本文通过住院容量管理的协调问题,描述三种AI部署形式,并分析其对系统行为的影响,指出只有改变激励结构的干预才能改变稳定均衡,为医疗AI的采购、治理和评估提供实践启示。

详情
AI中文摘要

利用一个来自住院容量管理的典型协调问题,描述了三种典型的AI部署形式:减少努力的技术、面向可观测性的系统以及改变潜在激励结构的干预。减少努力和可观测性可能改善现有行为模式下的性能,但通常不会改变哪些行动是个人理性的。因此,此类干预通常被吸收到现有均衡中。相比之下,通过重新分配或限制局部风险来改变局部行动如何影响下游后果的干预可以改变稳定的系统行为。这些机制层面的干预不同之处不在于技术复杂性,而在于它们与制度激励的相互作用。分析表明,对AI带来系统层面收益的期望应取决于部署是否改变了激励,而不仅仅是优化任务或信息流。对于医疗组织和政策制定者而言,这对数字技术的采购、治理和评估具有实际意义。

英文摘要

Using a stylised coordination problem drawn from inpatient capacity management, three archetypal forms of AI deployment are described: effort-reducing technologies, observability-oriented systems, and interventions that alter underlying incentive structures. Effort reduction and observability may improve performance within existing patterns of behaviour but do not, in general, change which actions are individually rational. As a result, such interventions are typically absorbed into existing equilibria. By contrast, interventions that modify how local actions map to downstream consequences by redistributing or bounding local risk can change stable system behaviour. These mechanism-level interventions differ not in technical sophistication but in their interaction with institutional incentives. The analysis suggests that expectations of system-level gains from AI should be conditioned on whether a deployment changes incentives rather than optimising tasks or information flows alone. For healthcare organisations and policymakers, this has practical implications for procurement, governance, and evaluation of digital technologies.

2603.28768 2026-06-02 cs.DC cs.LG

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

CRAFT:面向高效混合专家服务的细粒度成本感知专家复制

Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 提出CRAFT框架,通过基于估计收益的细粒度逐层复制,在给定内存预算下最大化负载均衡,无需额外训练即可提升大规模MoE服务吞吐量。

Comments 22 pages, 15 figures

详情
Journal ref
Proceedings of the Ninth Conference on Machine Learning and Systems (MLSys 2026)
AI中文摘要

混合专家(MoE)最近成为高效扩展大型语言模型同时保持计算成本近乎恒定的主流架构。专家并行通过跨设备划分专家来分布参数,但这会在推理过程中引入令牌级负载不均衡。专家复制是服务框架中广泛采用的负载均衡技术,通过复制高负载专家来缓解大规模部署中的负载不均衡。在这项工作中,我们证明现有的复制方案往往过度复制,许多副本提供的改进微乎其微。副本消耗大量GPU内存,可能导致资源争用和吞吐量下降。我们提出CRAFT,一种高效的专家复制框架,通过基于估计的复制收益进行细粒度逐层复制,在给定内存预算下最大化负载均衡。CRAFT可以无缝集成到现有服务框架中,无需额外训练或模型更改。我们的评估表明,在模型规模从数千亿到数万亿参数的大规模部署中,与现有复制技术相比,CRAFT将端到端服务吞吐量平均提高1.14倍(最高1.2倍)。

英文摘要

Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.

2510.00180 2026-06-02 eess.AS cs.SD eess.SP

DiffAU: Diffusion-Based Ambisonics Upscaling

DiffAU: 基于扩散的Ambisonics升阶

Amit Milstein, Nir Shlezinger, Boaz Rafaely

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 提出DiffAU方法,利用扩散模型和空间音频适配,从一阶Ambisonics生成三阶Ambisonics,实现快速可靠的升阶。

详情
AI中文摘要

空间音频通过再现3D声场增强沉浸感,Ambisonics为此提供了可扩展的格式。与高阶Ambisonics(HOA)相比,一阶Ambisonics(FOA)在硬件上高效地获取和存储声场,但其低空间分辨率限制了真实感,因此Ambisonics升阶(AU)作为增加Ambisonics信号阶数的方法显得尤为重要。本文提出DiffAU,一种级联的AU方法,利用扩散模型的最新进展并结合对空间音频的新颖适配,从FOA生成三阶Ambisonics。通过学习数据分布,DiffAU提供了一种原则性方法,能够在各种设置中快速可靠地再现HOA。在多个扬声器的消声条件下进行的实验,展示了强大的客观和感知性能。

英文摘要

Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.

2603.25640 2026-06-02 cs.DL cs.CL

RenoBench: A Citation Parsing Benchmark

RenoBench: 引文解析基准

Parth Sarin, Juan Pablo Alperin, Adam Buttrick, Dione Mentis

发表机构 * Graduate School of Education, Stanford University(斯坦福大学教育研究生院) Public Knowledge Project, Simon Fraser University(公共知识项目,西蒙弗雷泽大学) DataCite, Hannover, Germany(DataCite,德国汉诺威) California Digital Library, University of California Office of the President(加州数字图书馆,加州大学校长办公室)

AI总结 针对现有引文解析评估方法不可泛化、基于合成数据或不可公开获取的问题,提出从四个出版生态系统收集的公开基准RenoBench,通过自动验证和特征采样构建多语言、多类型数据集,并评估多种解析系统,结果表明语言模型(尤其微调后)表现优异,为可重复标准化评估奠定基础。

Comments Presented as a conference paper at CiteX 2026

详情
AI中文摘要

准确解析引文对于机器可读的学术基础设施是必要的。但是,尽管对该问题持续关注,现有的评估技术通常不可泛化、基于合成数据或不可公开获取。我们引入了RenoBench,一个用于引文解析的公共领域基准,来源于四个出版生态系统(SciELO、Redalyc、Public Knowledge Project和Open Research Europe)发布的PDF。从161,000条带注释的引文开始,我们应用自动验证和基于特征的采样,生成了一个包含10,000条引文的数据集,涵盖多种语言、出版类型和平台。然后,我们评估了多种引文解析系统,并报告了字段级别的精确率和召回率。我们的结果显示语言模型表现强劲,尤其是在微调后。RenoBench实现了对引文解析系统的可重复、标准化评估,并为推进自动化引文解析和元科学研究提供了基础。

英文摘要

Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-based sampling to produce a dataset of 10,000 citations spanning multiple languages, publication types, and platforms. We then evaluate a variety of citation parsing systems and report field-level precision and recall. Our results show strong performance from language models, particularly when fine-tuned. RenoBench enables reproducible, standardized evaluation of citation parsing systems, and provides a foundation for advancing automated citation parsing and metascientific research.

2511.08851 2026-06-02 cs.NI cs.LG eess.SP

Measurement-Driven Early Warning of Reliability Breakdown in 5G NSA Railway Networks

基于测量的5G NSA铁路网络可靠性崩溃早期预警

Po-Heng Chou, Da-Chih Lin, Hung-Yu Wei, Walid Saad, Yu Tsao

发表机构 * National Science and Technology Council (NSTC) of Taiwan(台湾国家科学与技术委员会) U.S. National Science Foundation (NSF)(美国国家科学基金会) University of Notre Dame(诺丁汉大学)

AI总结 本文通过测量驱动的方法,使用10 Hz地铁列车测量数据,评估六种学习模型在5G NSA铁路网络中提前数秒预测可靠性崩溃事件的可行性,并建立基准以量化其性能与权衡。

Comments 6 pages, 4 figures, 2 tables, and submitted to 2026 IEEE Globecom

详情
AI中文摘要

本文提出了一种基于测量的5G非独立组网(NSA)铁路网络可靠性崩溃事件早期预警研究。利用10 Hz地铁列车测量轨迹(包含服务小区和邻小区指标),我们在多个观测窗口和预测时域下,对六种代表性学习模型(包括CNN、LSTM、XGBoost、Anomaly Transformer、PatchTST和TimesNet)进行了基准测试。本研究并非提出新的预测架构,而是开发了一个基于测量的基准,以量化5G NSA铁路环境中提前数秒可靠性预测的可行性和操作权衡。实验结果表明,学习模型可以利用商用设备上可用的轻量级无线特征,提前数秒预测与无线链路失败(RLF)相关的可靠性崩溃事件。所提出的基准为感知辅助通信控制提供了见解,并为将感知与分析集成到未来移动控制中提供了经验基础。

英文摘要

This paper presents a measurement-driven study of early warning for reliability breakdown events in 5G non-standalone (NSA) railway networks. Using 10~Hz metro-train measurement traces with serving- and neighbor-cell indicators, we benchmark six representative learning models, including CNN, LSTM, XGBoost, Anomaly Transformer, PatchTST, and TimesNet, under multiple observation windows and prediction horizons. Rather than proposing a new prediction architecture, this study develops a measurement-driven benchmark to quantify the feasibility and operating trade-offs of seconds-ahead reliability prediction in 5G NSA railway environments. Experimental results show that learning models can anticipate radio link failure (RLF)-related reliability breakdown events seconds in advance using lightweight radio features available on commercial devices. The presented benchmark provides insights for sensing-assisted communication control and offers an empirical foundation for integrating sensing and analytics into future mobility control.

2603.22235 2026-06-02 cs.HC cs.LG

ShapDBM: Exploring Decision Boundary Maps in Shapley Space

ShapDBM:在Shapley空间中探索决策边界图

Luke Watkin, Daniel Archambault, Alex Telea

发表机构 * School of Computing, Newcastle University, UK(新castle大学计算机学院) Department of Information and Computing Science, Utrecht University, Netherlands(乌得勒支大学信息与计算科学系)

AI总结 提出通过将数据空间转换为Shapley空间并计算降维来生成决策边界图,相比直接基于数据的方法,生成的图质量指标相似或更高,决策区域更紧凑、更易探索且与模型性能更一致。

Comments 4 pages and 3 figures (excluding supplementary material)

详情
AI中文摘要

决策边界图(DBM)是可视化机器学习分类边界的有效工具。然而,DBM的质量很大程度上取决于降维(DR)技术和用于数据点的高维空间。对于复杂的机器学习数据,降维可能会产生许多混合类别,导致DBM难以使用甚至产生误导。我们提出了一种新技术,通过将数据空间转换为Shapley空间并对其计算降维来生成DBM。与直接从数据计算的DBM相比,我们的图具有相似或更高质量指标值,并且决策区域明显更紧凑、更易于探索,与测量的模型性能更一致。

英文摘要

Decision Boundary Maps (DBMs) are an effective tool for visualising machine learning classification boundaries. Yet, DBM quality strongly depends on the dimensionality reduction (DR) technique and high dimensional space used for the data points. For complex ML data, DR can create many mixed classes which yield DBMs that are hard to use or even misleading. We propose a new technique to compute DBMs by transforming data space into Shapley space and computing DR on it. Compared to DBMs computed directly from data, our maps have similar or higher quality metric values and visibly more compact, easier to explore, decision zones that better agree with measured model performance.

2603.17893 2026-06-02 cs.SE cs.AI cs.LG

scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

scicode-lint: 使用LLM生成的模式检测科学Python代码中的方法论错误

Sergey V. Samsonau

发表机构 * Authentic Research Partners, Princeton, NJ(真实研究伙伴,新泽西州普林斯顿)

AI总结 提出scicode-lint,通过两级架构(构建时使用前沿模型生成模式,运行时使用小型本地模型执行)自动检测科学Python代码中的方法论错误,如数据泄露、交叉验证错误和缺失随机种子。

详情
AI中文摘要

科学Python代码中的方法论错误会产生看似合理但实际不正确的结果,传统的linter和静态分析工具无法检测到这些错误。多个研究团队构建了特定于ML的linter,证明了检测的可行性。然而,这些工具存在可持续性问题:依赖于特定的pylint或Python版本、有限的打包方式,以及每个新模式都需要手动工程。随着AI生成代码增加了科学软件的数量,对自动化方法论检查(如检测数据泄露、不正确的交叉验证和缺失随机种子)的需求日益增长。我们提出了scicode-lint,其两级架构将模式设计(构建时的前沿模型)与执行(运行时的小型本地模型)分离。模式是生成的,而非手工编码;适应新的库版本花费的是token,而非工程时间。在带有手动标注真实值的Kaggle笔记本上,预处理泄露检测在100%召回率下达到了65%的精确率;在38篇应用AI/ML的已发表科学论文中,精确率为62%(由LLM评判),不同模式类别之间存在显著差异;在一个保留的论文集上,精确率为54%。在受控测试中,scicode-lint在66个模式上达到了97.7%的准确率。

英文摘要

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

2603.13373 2026-06-02 cs.CY cs.AI cs.LG

Ethical Fairness in Ubiquitous Health Sensing without Known Attributes

无已知属性下的普适健康感知伦理公平性

Shaily Roy, Harshit Sharma, Daniel A. Adler, Srijan Sen, Tanzeem Choudhury, Asif Salekin

发表机构 * Ira A. Fulton Schools of Engineering, Arizona State University(亚利桑那州立大学弗里曼工程学院) Arizona State University(亚利桑那州立大学) Cornell University(康奈尔大学) University of Michigan(密歇根大学)

AI总结 针对普适健康感知中缺乏人口统计或异构属性时的公平性问题,提出基于Fisher信息引导的潜在子群学习与无害正则化框架Flare,通过优化几何实现伦理公平。

详情
AI中文摘要

在普适和移动健康系统中,计算模型从可穿戴、行为和生理传感数据推断人类状态。在这些场景中,仅高准确率是不够的;模型必须在不同人群、环境和设备间合乎伦理且公平地运行。然而,依赖训练时的人口统计或异构属性的公平方法难以实施,因为这些属性通常不可用、隐私敏感、受监管或不宜收集。传统的基于均等的公平也可能通过牺牲子群性能而违反伦理原则。为应对这一挑战,我们提出了Flare(Fisher引导的潜在子群学习与无害正则化),这是一个不依赖人口统计和异构属性的框架,将以人为本的公平性与普适和移动传感的伦理原则对齐。Flare利用优化几何,特别是Fisher信息,来正则化曲率并揭示模型行为中的潜在差异,而无需人口统计或异构属性。通过整合表示、损失和曲率信号,它识别隐藏的性能分层,并通过协作但无害的优化对其进行改进,在提升子群性能的同时保持伦理平衡。我们还引入了BHE(善行-避害-公平),一个超越统计均等的伦理公平度量套件。在移动生理、行为和临床传感数据集(包括EDA、OhioT1DM、IHS和Percept-R)上,Flare在伦理公平性上优于最先进的基线。消融、可解释性和损失景观分析表明,这些提升源于更平坦的优化几何、更简单的决策规则和无害的潜在子群适应。运行时分析支持Flare在资源受限的传感部署中的实用性。

英文摘要

In ubiquitous and mobile health systems, computational models infer human states from wearable, behavioral, and physiological sensing data. In these settings, high accuracy alone is insufficient; models must act ethically and equitably across diverse people, contexts, and devices. However, fairness methods that rely on demographic or heterogeneous attributes during training are difficult to enforce because such attributes are often unavailable, privacy-sensitive, regulated, or undesirable to collect. Conventional parity-based fairness can also violate ethical principles by trading off subgroup performance. To address this challenge, we present Flare, Fisher-guided LAtent-subgroup learning with do-no-harm REgularization, a demographic- and heterogeneous-attribute-agnostic framework that aligns human-centered fairness with ethical principles for ubiquitous and mobile sensing. Flare leverages optimization geometry, particularly Fisher Information, to regularize curvature and uncover latent disparities in model behavior without demographic or heterogeneous attributes. By integrating representation, loss, and curvature signals, it identifies hidden performance strata and refines them through collaborative but do-no-harm optimization, enhancing subgroup performance while preserving ethical balance. We also introduce BHE (Beneficence-Harm Avoidance-Equity), a metric suite that operationalizes ethical fairness beyond statistical parity. Across mobile physiological, behavioral, and clinical sensing datasets, including EDA, OhioT1DM, IHS, and Percept-R, Flare improves ethical fairness over state-of-the-art baselines. Ablation, interpretability, and loss-landscape analyses show that these gains arise from flatter optimization geometry, simpler decision rules, and do-no-harm latent-subgroup adaptation. Runtime analysis supports the practicality of Flare for resource-constrained sensing deployments.

2603.16572 2026-06-02 cs.CR cs.AI

Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem

上下文很重要:基于仓库感知的代理技能生态系统安全分析

Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, Johanna Ullrich

发表机构 * Interdisciplinary Transformation University (IT:U)(交叉学科转化大学) University of Vienna(维也纳大学) CDL AsTra Faculty of Computer Science(计算机科学学院CDL AsTra系)

AI总结 通过仓库上下文感知分析,发现现有扫描器高估了恶意技能比例(从46.8%降至0.52%),并识别出废弃仓库劫持等新攻击向量。

Comments AgentSkills '26 Workshop: ACM Conference on AI and Agentic Systems (CAIS), Best Paper Award

详情
AI中文摘要

代理技能扩展了本地AI代理(如Claude Code和OpenClaw)的额外功能。其日益流行催生了类似移动应用商店的专用市场,以及评估技能是良性还是恶意的自动扫描器。然而,来自单个市场的扫描器报告将高达46.8%的技能归类为恶意,引发了对误报的担忧。我们提出了迄今为止对AI代理技能生态系统最大规模的实证安全分析。我们从三个主要分发平台和GitHub收集了238,180个独特技能,并分析了它们的内容、行为和仓库上下文。与现有主要孤立评估技能的扫描器不同,我们的仓库感知分析检查被标记的技能是否与其周围的GitHub项目一致。这种上下文显著减少了可疑技能的数量:经过仓库感知分析后,仅0.52%仍保持可疑。我们的结果表明,当忽略仓库上下文时,现有扫描器可能大幅高估恶意性。同时,我们识别出先前未记录的真实世界攻击向量,包括劫持托管在废弃GitHub仓库中的技能。总体而言,我们的发现提供了对代理技能生态系统当前风险面更稳健的视图,并强调了上下文感知安全评估的必要性。

英文摘要

Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality. Their growing popularity has led to dedicated marketplaces resembling mobile app stores, as well as automated scanners that assess whether skills are benign or malicious. However, scanner reports from individual marketplaces classify up to 46.8% of skills as malicious, raising concerns about false positives. We present the largest empirical security analysis of the AI agent skill ecosystem to date. We collect 238,180 unique skills from three major distribution platforms and GitHub, and analyze their contents, behavior, and repository context. Unlike existing scanner-based assessments, which evaluate skills largely in isolation, our repository-aware analysis checks whether a flagged skill is consistent with its surrounding GitHub project. This context substantially reduces the number of suspicious skills: only 0.52% remain suspicious after repository-aware analysis. Our results show that existing scanners can substantially overestimate maliciousness when repository context is ignored. At the same time, we identify previously undocumented real-world attack vectors, including the hijacking of skills hosted in abandoned GitHub repositories. Overall, our findings provide a more robust view of the agent-skill ecosystem's current risk surface and highlight the need for context-aware security evaluation.

2603.14798 2026-06-02 stat.ML cs.LG cs.NA math.NA

Preconditioned One-Step Generative Modeling for Bayesian Inverse Problems in Function Spaces

函数空间中贝叶斯逆问题的预处理一步生成建模

Zilan Cheng, Li-Lian Wang, Zhongjian Wang

发表机构 * Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University(数学科学学院,物理与数学科学学院,南洋理工大学)

AI总结 提出一种基于一步生成传输的机器学习算法,使用先验对齐的高斯随机场作为源,通过神经算子逼近后验分布,高效求解函数空间中的贝叶斯逆问题。

详情
AI中文摘要

我们提出了一种用于函数空间贝叶斯逆问题的机器学习算法。基于一步生成传输,该方法学习一个摊销神经算子,其将高斯源的推送前推近似于以每个新观测为条件的后验分布。我们证明白噪声源与函数空间极限不兼容,因此采用先验对齐的GRF作为源。通过所得一步条件后验传输的Lipschitz正则性以及在线性逆问题和基于PDE的逆问题上的数值实验,我们证明了这一选择的合理性。该方法并非从MCMC中提炼:它仅使用先验样本和模拟的部分噪声观测进行训练。一旦训练完成,它能在约$10^{-3}$秒内生成一个$64\times64$的后验样本,避免了MCMC中重复的正向模型评估和多步生成采样器中重复的网络评估,同时匹配关键的后验摘要。

英文摘要

We propose a machine-learning algorithm for Bayesian inverse problems in the function-space regime. Based on one-step generative transport, the method learns an amortized neural operator whose pushforward of a Gaussian source approximates the posterior distribution conditioned on each new observation. We show that white-noise sources are incompatible with the function-space limit, and therefore adopt a prior-aligned GRF as the source. We justify this choice through the Lipschitz regularity of the resulting one-step conditional posterior transport and numerical experiments on linear inverse and PDE-based inverse problems. The method is not distilled from MCMC: it is trained only with prior samples and simulated partial noisy observations. Once trained, it generates a $64\times64$ posterior sample in $\sim 10^{-3}$s, avoiding repeated forward-model evaluations in MCMC and repeated network evaluations in multistep generative samplers while matching key posterior summaries.

2603.13312 2026-06-02 cs.MM cs.LG

Design-MLLM: A Reinforcement Alignment Framework for Verifiable and Aesthetic Interior Design

Design-MLLM:一种用于可验证且美观的室内设计的强化对齐框架

Yuxuan Yang, Xiaotong Mao, Jingyao Wang, Fuchun Sun

发表机构 * National Jiangsu University of Finance(江苏财经大学) University of Lorraine(洛林大学) Institute of Electronics and Information Technology, Chinese Academy of Sciences(中国科学院电子信息技术研究所) Tsinghua University(清华大学)

AI总结 提出Design-MLLM框架,通过双分支美学导向奖励的强化对齐,解决室内设计中空间可行性硬约束与美学偏好软约束的矛盾,生成既可行又美观的设计。

详情
AI中文摘要

室内设计是一个从需求到视觉方案的生成过程,必须同时满足可验证的空间可行性和比较性的美学偏好。虽然最近的多模态大语言模型(MLLM)为解释用户意图和生成设计理由提供了统一基础,但我们的实证分析揭示了实际部署中持续存在的矛盾:MLLM通常生成不可建造且美学不一致的布局。这些发现表明,简单地添加领域内文本是不够的;有效的室内设计需要一种对齐机制,将硬约束与软偏好分离,并在优化过程中协调它们。为此,我们提出Design-MLLM,一种通过双分支、美学导向奖励优化可行性优先偏好目标的强化对齐框架。具体来说,Design-MLLM (i) 使用程序化约束检查显式评估空间可行性,(ii) 仅在可行候选者中评估美学偏好,以避免视觉吸引但不可执行的捷径,(iii) 执行组相对优化以获得稳定的偏好信号。通过这个过程,Design-MLLM学习一种可控策略,一致地选择并生成既可行又美学协调的解决方案,而不是偶尔产生视觉吸引但不可行的设计。在各种基准数据集上的大量实验证明了Design-MLLM的优势。

英文摘要

Interior design is a requirements-to-visual-plan generation process that must simultaneously satisfy verifiable spatial feasibility and comparative aesthetic preferences. While recent multimodal large language models (MLLMs) offer a unified foundation for interpreting user intent and producing design rationales, our empirical analysis reveals a persistent contradiction in real-world deployment: MLLMs often produce layouts that are unbuildable and aesthetically inconsistent. These findings indicate that simply adding in-domain text is insufficient; effective interior design requires an alignment mechanism that separates hard constraints from soft preferences and coordinates them during optimization. To address this, we propose Design-MLLM, a reinforcement alignment framework that optimizes a feasibility-first preference objective via a dual-branch, aesthetic-oriented reward. Specifically, Design-MLLM (i) explicitly evaluates spatial feasibility using programmatic constraint checks, (ii) assesses aesthetic preference only among feasible candidates to avoid visually appealing but unexecutable shortcuts, and (iii) performs group-relative optimization to obtain stable preference signals. Through this process, Design-MLLM learns a controllable policy that consistently selects and generates solutions that are both executable and aesthetically coherent, rather than occasionally producing visually appealing but infeasible designs. Extensive experiments on various benchmark datasets demonstrate the advantages of Design-MLLM.

2512.16310 2026-06-02 cs.CR cs.AI cs.CL

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

Agent工具编排泄露更多:数据集、基准测试与缓解措施

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院)

AI总结 研究LLM代理在编排多个工具时泄露敏感结论的风险(TOP-R),构建了包含1000个实例的基准TOP-Bench,并提出TOP-Align后训练方法以缓解泄露。

Comments 17 pages, 2 figures. Dataset and code are available at https://github.com/1Ponder/TOP-R

详情
AI中文摘要

基于LLM的代理越来越多地使用多个外部工具来完成复杂任务。我们研究了工具编排隐私风险(TOP-R):代理可能组合单个非敏感的工具返回结果,并披露一个非预期的敏感结论。我们通过三个条件形式化TOP-R:结论敏感性、单源不可推断性和组合可推断性。我们引入了LRSE(基于库的反向推理种子扩展),这是一个基于隐私规范、推理链、工具模式和任务场景的四库反向构建流水线,并使用它构建了TOP-Bench,一个包含1000个实例的基准测试。该基准测试在受控的两阶段工具使用协议下评估最终响应的语义泄露。在六个LLM代理中,任务完成率保持较高,但平均泄露率达到88.6%,导致H分数仅为20.4。两种仅提示的防护措施在主基准测试上将H分数提高了约2.7分。我们进一步提出了TOP-Align,一种SFT+DPO后训练方法,用于更安全的任务完成边界。在单独的后训练评估划分上,TOP-Align将H分数比相应基础模型提高了16.2分,而同一划分上仅提示缓解措施的平均增益为4.9分。这些结果表明TOP-R需要超越仅提示的缓解措施。

英文摘要

LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintended sensitive conclusion. We formalize TOP-R with three conditions: conclusion sensitivity, single-source non-inferability, and compositional inferability. We introduce LRSE (Library-Grounded Reverse-Inference Seed Expansion), a four-library reverse-construction pipeline grounded in privacy norms, reasoning chains, tool schemas, and task scenarios, and use it to build TOP-Bench, a 1,000-instance benchmark. The benchmark evaluates final-response semantic disclosure under a controlled two-stage tool-use protocol. Across six LLM agents, task completion remains high, but the average leakage rate reaches 88.6 percent, yielding an H-score of only 20.4. Two prompt-only safeguards improve H-score by about 2.7 points on the main benchmark. We further propose TOP-Align, an SFT+DPO post-training method for safer task completion boundaries. On a separate post-training evaluation split, TOP-Align improves H-score by 16.2 points over the corresponding base model, compared with a 4.9-point average gain from prompt-only mitigation on the same split. These results show that TOP-R requires mitigation beyond prompting alone.

2603.02478 2026-06-02 eess.SY cs.RO cs.SY

Scalar-Measurement Attitude Estimation on $\mathbf{SO}(3)$ with Bias Compensation

$\mathbf{SO}(3)$ 上带偏差补偿的标量测量姿态估计

Alessandro Melis, Tarek Bouazza, Hassan Alnahhal, Sifeddine Benahmed, Soulaimane Berkane, Tarek Hamel

发表机构 * I3S, CNRS, Université Côte d’Azur, Sophia Antipolis, France(法国国家科学研究中心I3S研究所、普罗旺斯大学、索菲亚-安蒂波利斯分校) Institut Universitaire de France(法国国家科学院) Department of Technology & Innovation, Capgemini Engineering(Capgemini工程公司技术与创新部) Department of Computer Science and Engineering, Université du Québec en Outaouais (UQO)(魁北克大学Outaouais分校计算机科学与工程系)

AI总结 本文提出基于标量测量的 $\mathbf{SO}(3)$ 非线性确定性观测器,结合陀螺仪偏差补偿,在适当可观测性条件下实现局部指数稳定,并证明两个标量测量在合适激励下足以进行姿态估计,三个在静态情况下足够。

Comments 9 pages, 4 figures. Accepted to ICRA 2026

详情
AI中文摘要

姿态估计方法通常依赖于来自惯性传感器(如加速度计和磁力计)的完整矢量测量。本文表明,仅使用标量测量也能实现可靠估计,这些标量测量自然出现为矢量读数的分量或来自其他传感模态的独立约束。我们提出了 $\mathbf{SO}(3)$ 上的非线性确定性观测器,该观测器结合了陀螺仪偏差补偿,并在适当的可观测性条件下保证均匀局部指数稳定性。该框架的一个关键特性是对部分感知的鲁棒性:即使只有矢量分量的子集可用,也能保持准确估计。在 BROAD 数据集上的实验验证确认了在逐步减少的测量配置下性能一致,即使在严重信息丢失的情况下估计误差仍然很小。据我们所知,这是第一项建立基本可观测性结果的工作,表明在适当激励下两个标量测量足以进行姿态估计,而在静态情况下三个足够。这些结果将基于标量测量的观测器定位为传统基于矢量方法的实用且可靠的替代方案。

英文摘要

Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on $\mathbf{SO}(3)$ that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.

2603.02346 2026-06-02 cond-mat.str-el cs.AI cs.LG

Large Electron Model: A Universal Ground State Predictor

大型电子模型:一种通用的基态预测器

Timothy Zaklama, Max Geier, Liang Fu

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Department of Physics(物理系)

AI总结 提出Large Electron Model,一种基于Fermi Sets架构的神经网络模型,通过在整个哈密顿参数流形上生成变分波函数,准确预测二维谐振势中相互作用电子的基态,并泛化到未见耦合强度和粒子数,为材料发现提供了基于变分原理的基座模型方法。

Comments 8+7 pages, 5+6 figures, 1+1 tables

详情
AI中文摘要

我们引入了大型电子模型,这是一个单一的神经网络模型,能够在整个哈密顿参数流形上产生相互作用电子的变分波函数。我们的模型采用了Fermi Sets架构,这是一种多体费米子波函数的通用表示,并进一步以哈密顿参数和粒子数为条件。对于二维谐振势中的相互作用电子,一个训练好的模型能够准确预测基态波函数,同时泛化到未见过的耦合强度和粒子数扇区,产生精确的实空间电荷密度和基态能量,甚至多达50个粒子。我们的结果为基于变分原理的材料发现建立了一个基座模型方法,同时准确处理了密度泛函理论能力之外的强电子关联。

英文摘要

We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the entire Hamiltonian parameter manifold. Our model employs the Fermi Sets architecture, a universal representation of many-body fermionic wavefunctions, which is further conditioned on Hamiltonian parameter and particle number. For interacting electrons in a two-dimensional harmonic potential, a single trained model accurately predicts the ground state wavefunction while generalizing across unseen coupling strengths and particle-number sectors, producing both accurate real-space charge densities and ground state energies, even up to $50$ particles. Our results establish a foundation model method for material discovery that is grounded in the variational principle, while accurately treating strong electron correlation beyond the capacity of density functional theory.

2602.23866 2026-06-02 cs.SE cs.CL

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

SWE-rebench V2: 大规模语言无关的SWE任务集合

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev

发表机构 * GitHub

AI总结 提出语言无关的自动化流水线SWE-rebench V2,用于大规模收集可执行的软件工程任务并构建RL训练环境,产出涵盖20种语言、3617个仓库的32079个任务数据集。

Comments ICML 2026

详情
AI中文摘要

软件工程智能体(SWE)正在快速进步,最近的进展主要由强化学习(RL)驱动。然而,RL训练受到大规模任务集合稀缺性的限制,这些任务需要具有可复现的执行环境和可靠的测试套件。尽管越来越多的基准测试出现,适合训练的数据集在规模和多样性上仍然有限,或者通常针对有限的高资源语言生态系统。我们引入SWE-rebench V2,一个语言无关的自动化流水线,用于大规模收集可执行的真实世界SWE任务并构建RL训练环境。该流水线通过交互式设置智能体合成仓库特定的安装和测试程序,并使用一组LLM评判器过滤不合理的实例,这些评判器经过人工验证的SWE-bench注释验证。使用该流水线,我们构建了一个包含20种语言、3617个仓库的32079个任务数据集,并附带预构建的镜像以实现可复现执行。为了进一步扩展训练数据,我们还发布了超过120000个任务,包含安装说明、失败到通过的测试和丰富的元数据,其中问题陈述基于原始拉取请求描述生成。我们通过一项诊断研究验证了收集的实例,该研究涵盖了五种编程语言中的一部分任务,涉及七个流行模型,并提供了实例级元数据,标记了常见的混淆因素,如过于严格的测试和描述不充分。我们发布了数据集、收集和执行代码以及相关工件,以支持跨多种语言和仓库的大规模SWE智能体训练。

英文摘要

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,079 tasks spanning 20 languages and 3,617 repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

2602.01577 2026-06-02 eess.SP cs.CV

Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

基于拉梅曲线LED的可见光定位:一种通用的相机姿态估计方法

Wenxuan Pan, Yang Yang, Dong Wei, Zhiyu Zhu, Jintao Wang, Huan Wu, Yao Nie

发表机构 * Beijing Key Laboratory of Network System Architecture and Convergence, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications(北京网络系统架构与融合重点实验室,信息与通信工程学院,北京邮电大学) Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) College of Physics and Electronic Engineering, Shanxi University(物理与电子工程学院,山西大学) School of Electronic Information and Artificial Intelligence, West Anhui University(电子信息与人工智能学院,皖西学院)

AI总结 本文提出一种基于拉梅曲线LED的通用可见光定位算法LC-VLP,通过统一表示常见LED形状并利用曲线参数进行非线性最小二乘优化,实现高精度相机姿态估计。

Comments Submitted to an IEEE journal for possible publication

详情
AI中文摘要

基于相机的可见光定位(VLP)是一种有前景的技术,可实现精确且低成本的室内相机姿态估计(CPE)。为减少所需发光二极管(LED)的数量,先进方法通常利用LED形状特征进行定位。尽管有趣,但这些方法通常局限于单一LED几何形状,导致在异构LED形状场景中失效。为应对这一挑战,本文研究拉梅曲线作为常见LED形状的统一表示,并提出一种使用拉梅曲线形状LED的通用VLP算法,称为LC-VLP。在所考虑的系统中,多个天花板安装的拉梅曲线形状LED通过可见光通信定期广播其曲线参数,这些参数由配备相机的接收器捕获。基于接收到的LED图像和曲线参数,接收器可使用LC-VLP估计相机姿态。具体而言,离线构建LED数据库以存储曲线参数,而在线定位则被表述为非线性最小二乘问题并迭代求解。为提供可靠的初始化,进一步开发了一种无需对应点的透视n点(FreePnP)算法,无需任何预校准参考点即可实现近似CPE。通过仿真和实验验证了LC-VLP的性能。仿真表明,在圆形和矩形LED场景中,LC-VLP均优于最先进的方法。与透视弧算法相比,LC-VLP可实现平均位置和旋转误差均降低30%以上。实验进一步表明,LC-VLP可实现小于4厘米的平均位置精度。

英文摘要

Camera-based visible light positioning (VLP) is a promising technique for accurate and low-cost indoor camera pose estimation (CPE). To reduce the number of required light-emitting diodes (LEDs), advanced methods commonly exploit LED shape features for positioning. Although interesting, they are typically restricted to a single LED geometry, leading to failure in heterogeneous LED-shape scenarios. To address this challenge, this paper investigates Lamé curves as a unified representation of common LED shapes and proposes a generic VLP algorithm using Lamé curve-shaped LEDs, termed LC-VLP. In the considered system, multiple ceiling-mounted Lamé curve-shaped LEDs periodically broadcast their curve parameters via visible light communication, which are captured by a camera-equipped receiver. Based on the received LED images and curve parameters, the receiver can estimate the camera pose using LC-VLP. Specifically, an LED database is constructed offline to store the curve parameters, while online positioning is formulated as a nonlinear least-squares problem and solved iteratively. To provide a reliable initialization, a correspondence-free perspective-n-points (FreePnP) algorithm is further developed, enabling approximate CPE without any pre-calibrated reference points. The performance of LC-VLP is verified by both simulations and experiments. Simulations show that LC-VLP outperforms state-of-the-art methods in both circular- and rectangular-LED scenarios. Compared to a perspective arcs algorithm, LC-VLP can achieve reductions of both over 30% in average position and rotation errors. Experiments further show that LC-VLP can achieve an average position accuracy of less than 4 cm.

2602.22221 2026-06-02 cs.IR cs.AI cs.CL cs.CY

Evaluating Reliability Asymmetries in Chinese Factual Search and AI Answers

评估中文事实搜索与AI答案中的可靠性不对称性

Geng Liu, Li Feng, Mengxiao Zhu, Francesco Pierri

发表机构 * Department of Electronics, Information and Bioengineering, Politecnico di Milano(电子、信息与生物工程系,米兰理工学院) University of Science and Technology of China(中国科学技术大学)

AI总结 通过构建基于真实中文搜索日志的查询事实核查数据集,比较传统搜索引擎、大型语言模型和搜索集成AI概览在中文是非问题上的准确性、回答频率、极性差距及区域信息需求差异,揭示可靠性不仅取决于回答正确性,还受回答频率、否定主张处理和信息需求暴露风险影响。

详情
AI中文摘要

搜索引擎和AI驱动的系统越来越多地成为获取事实信息的媒介,但在现实信息寻求场景中,其可靠性仍难以评估。我们通过从真实中文搜索日志构建基于查询的事实核查数据集,并比较传统搜索引擎、独立大型语言模型和搜索集成AI概览等九种系统,在中文网络生态中研究这一问题。聚焦于中文事实性是非问题,我们根据证据推导的基准事实评估系统是否提供正确、错误或不确定的判断。我们发现,当系统给出明确答案时,准确率相似(73.2%至78.9%),但给出明确答案的频率差异显著:搜索引擎对超过83%的查询给出明确答案,而Qwen-Max则不到一半。我们还发现一致的极性差距:所有系统在标记为“是”的查询上表现优于标记为“否”的查询。我们利用百度指数数据识别健康相关搜索关注度较高的中国省份,这可能表明更大的错误信息暴露风险。总体而言,我们的结果表明,可靠性不仅取决于系统回答时的正确性,还取决于回答频率、如何处理否定主张以及信息需求可能增加暴露风险的地方。

英文摘要

Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evaluate in realistic information-seeking settings. We study this problem in the Chinese web ecosystem by constructing a query-based fact-checking dataset from real Chinese search logs and comparing nine systems across traditional search engines, standalone large language models, and search-integrated AI Overviews. Focusing on factual Chinese-language factual Yes/No questions, we evaluate whether systems provide correct, incorrect, or uncertain decisions against evidence-derived ground truth. We find that systems are similarly accurate when they provide definitive answers, but differ sharply in how often they do so. Conditional accuracy ranges from 73.2% to 78.9%, yet search engines answer definitively on over 83% of queries, while Qwen-Max does so on fewer than half. We also find a consistent polarity gap: all systems perform better on yes-labeled queries than on no-labeled queries. We also use Baidu Index data to identify Chinese provinces with higher health-related search attention, which may indicate greater potential exposure to misinformation. Overall, our results show that reliability depends not only on whether systems are correct when they answer, but also on how often they answer, how they handle negative claims, and where information demand may increase exposure risks.

2508.08337 2026-06-02 cs.CY cs.AI cs.LG

Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants

立场:超越敏感属性,机器学习公平性应通过社会决定因素量化结构性不公正

Zeyu Tang, Alex John London, Atoosa Kasirzadeh, Sarah Stewart de Ramirez, Peter Spirtes, Kun Zhang, Sanmi Koyejo

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Michigan(密歇根大学) University of Toronto(多伦多大学)

AI总结 本文主张算法公平性研究应超越敏感属性,通过社会决定因素量化结构性不公正,并通过理论模型和实证研究证明仅关注敏感属性的缓解策略可能引入新的结构性不公正。

Comments Accepted to ICML 2026 Position Paper Track

详情
AI中文摘要

算法公平性研究在很大程度上将不公平视为对敏感属性的歧视。然而,这种方法限制了对作为通过社会决定因素实例化的结构性不公正的不公平的可见性,社会决定因素是塑造属性和结果但不涉及特定个体的上下文变量。这篇立场论文认为,该领域应通过社会决定因素量化结构性不公正,超越敏感属性。借鉴跨学科见解,我们认为主流技术范式未能充分捕捉作为结构性不公正的不公平,因为上下文可能被视为需要标准化的噪声,而不是需要审计的信号。我们进一步通过大学录取的理论模型、使用美国人口普查数据的人口统计研究以及美国综合医疗系统中关于乳腺癌筛查的高风险领域应用,证明了这种转变的实际紧迫性。我们的结果表明,仅关注敏感属性的缓解策略可能引入新的结构性不公正形式。我们认为,通过社会决定因素审计结构性不公正必须先于缓解措施,并呼吁开发超越以敏感属性为中心的非歧视公平概念的新技术。

英文摘要

Algorithmic fairness research has largely framed unfairness as discrimination along sensitive attributes. However, this approach limits visibility into unfairness as structural injustice instantiated through social determinants, which are contextual variables that shape attributes and outcomes without pertaining to specific individuals. This position paper argues that the field should quantify structural injustice via social determinants, beyond sensitive attributes. Drawing on cross-disciplinary insights, we argue that prevailing technical paradigms fail to adequately capture unfairness as structural injustice, because contexts are potentially treated as noise to be normalized rather than signal to be audited. We further demonstrate the practical urgency of this shift through a theoretical model of college admissions, a demographic study using U.S. census data, and a high-stakes domain application regarding breast cancer screening within an integrated U.S. healthcare system. Our results indicate that mitigation strategies centered solely on sensitive attributes can introduce new forms of structural injustice. We contend that auditing structural injustice through social determinants must precede mitigation, and call for new technical developments that move beyond sensitive-attribute-centered notions of fairness as non-discrimination.

2512.16167 2026-06-02 cs.MA cs.AI cs.GT

Ev-Trust: An Evolutionarily Stable Trust Mechanism for Decentralized LLM-Based Multi-Agent Service Economies

Ev-Trust: 一种面向去中心化基于LLM的多智能体服务经济的演化稳定信任机制

Jiye Wang, Shiduo Yang, Ting Qiao, Jiayu Qin, Jianbin Li, Yu Wang, Yuanhe Zhao

发表机构 * School of Control and Computer Engineering, North China Electric Power University(控制与计算机工程学院,华北电力大学) State Grid Corporation of China(国家电网公司)

AI总结 针对去中心化LLM多智能体服务经济中欺诈成本降低、服务质量评估困难和服务内容不稳定三大脆弱性,提出Ev-Trust信任机制,通过交叉验证门、方差标准化漂移度量和信任信号嵌入收益函数,实现合作策略的演化稳定,实验表明恶意参与减少约60%,欺诈率降低约50%。

Comments 19 pages, 9 figures

详情
AI中文摘要

去中心化基于LLM的多智能体服务经济面临三个脆弱性,这些脆弱性破坏了传统信任机制:欺诈成本降低、服务质量评估困难以及服务内容不稳定。这些复合脆弱性可能引发群体层面的信任崩溃和短视策略的扩散。我们提出Ev-Trust,一种演化稳定的信任机制,通过三个针对性设计应对这些脆弱性:利用请求者语义理解评估响应有效性的交叉验证门;过滤内源随机性与真实行为异常的方差标准化漂移度量;以及将信任信号嵌入期望收益函数,将可信度转化为演化生存优势。基于带噪声最优反应微观基础的复制者动力学,我们证明了合作演化稳定策略的渐近稳定性,并推导了维持合作均衡的显式阈值条件。我们通过至少100个异构LLM驱动智能体(涵盖七种行为类型)的100轮模拟评估Ev-Trust。实验在TruthfulQA和TriviaQA两个事实性问答基准上进行。与基于传递信任聚合、强化学习声誉和纯演化模仿的基线相比,Ev-Trust将恶意智能体参与率降低约60%,欺诈服务率抑制约50%,并在30%对抗性突变下维持稳定的信任分化。这些结果表明,将语义信任评估与演化激励相结合,为在去中心化基于LLM的多智能体系统中保障合作提供了原则性基础。

英文摘要

Decentralized LLM-based multi-agent service economies face three vulnerabilities that undermine traditional trust mechanisms: reduced cost of fraud, difficulty in evaluating service quality, and instability of service content. These compounding vulnerabilities can trigger population-level trust collapse and the proliferation of short-sighted strategies. We propose Ev-Trust, an evolutionarily stable trust mechanism that addresses these vulnerabilities through three targeted designs: a cross-validation gate leveraging requestor semantic comprehension to assess response validity, a variance-standardized drift measure filtering endogenous stochasticity from genuine behavioral anomalies, and an embedding of trust signals into the expected revenue function that converts trustworthiness into an evolutionary survival advantage. Based on replicator dynamics with a noisy best response micro-foundation, we prove the asymptotic stability of cooperative evolutionarily stable strategies and derive explicit threshold conditions for maintaining cooperative equilibria. We evaluate Ev-Trust through 100-round simulations with at least 100 heterogeneous LLM-driven agents covering seven behavioral types. The experiments are conducted on TruthfulQA and TriviaQA, two factual question-answering benchmarks. Compared to baselines based on transitive trust aggregation, reinforcement-learning reputation, and pure evolutionary imitation, Ev-Trust reduces malicious agent participation by approximately 60%, suppresses the fraudulent service rate by approximately 50%, and maintains stable trust differentiation under a 30% adversarial mutation. These results demonstrate that coupling semantic trust evaluation with evolutionary incentives provides a principled foundation for securing cooperation in decentralized LLM-based multi-agent systems.

2602.16794 2026-06-02 stat.ML cs.LG

Beyond Procedure: Substantive Fairness in Conformal Prediction

超越程序:共形预测中的实质性公平

Pengqi Liu, Zijun Yu, Mouloud Belbahri, Arthur Charpentier, Masoud Asgharian, Jesse C. Cresswell

发表机构 * University of Montreal(蒙特利尔大学)

AI总结 本文通过理论分解和LLM辅助评估,研究共形预测中标签聚类方法如何平衡效用与实质性公平,并发现均衡集合大小比覆盖度更能提升公平性。

Comments Camera-ready version. Accepted at ICML 2026

详情
AI中文摘要

共形预测(CP)为机器学习模型提供了无分布的不确定性量化,但其在下游决策中与公平性的相互作用仍未充分探索。超越将CP视为独立操作(程序公平),我们分析整体决策流程以评估实质性公平——下游结果的公平性。理论上,我们推导出一个上界,将预测集大小差异分解为可解释的组成部分,阐明标签聚类CP如何帮助控制方法驱动的对不公平的贡献。为了促进可扩展的实证分析,我们引入了一个LLM在环评估器,它近似人类对跨多种模态的实质性公平的评估。我们的实验表明,标签聚类CP通常在效用和实质性公平之间提供了有利的平衡,同时根据我们的理论减少了集合大小差异。最后,我们实证表明,均衡的集合大小(而非覆盖度)与实质性公平的改善强相关,使从业者能够设计更公平的CP系统。我们的代码可在https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness获取。

英文摘要

Conformal prediction (CP) offers distribution-free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness-the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction-set size disparity into interpretable components, clarifying how label-clustered CP helps control method-driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM-in-the-loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments show that label-clustered CP often provides a favorable balance between utility and substantive fairness, while reducing set-size disparities in line with our theory. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems. Our code is available at https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness.

2602.16720 2026-06-02 cs.DB cs.AI

APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL

APEX-SQL: 通过智能体探索与数据对话实现Text-to-SQL

Bowen Cao, Weibin Liao, Yushi Sun, Dong Fang, Haitao Li, Wai Lam

发表机构 * The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) LIGHTSPEED

AI总结 提出APEX-SQL框架,通过假设验证循环、逻辑规划、双路径剪枝、并行数据分析和确定性探索机制,解决静态模式表示在复杂企业数据库中的语义模糊和扩展性问题,在BIRD和Spider 2.0-Snow上取得领先性能。

Comments KDD 2026

详情
AI中文摘要

由大型语言模型驱动的Text-to-SQL系统在学术基准测试中表现出色,但在复杂的企业环境中却难以应对。主要限制在于它们依赖静态模式表示,这无法解决语义模糊性,也无法有效扩展到大型复杂数据库。为了解决这个问题,我们提出了APEX-SQL,一个智能体Text-to-SQL框架,它将范式从被动翻译转变为智能体探索。我们的框架采用假设验证循环,将模型推理基于真实数据。在模式链接阶段,我们使用逻辑规划来表述假设,双路径剪枝来减少搜索空间,并行数据分析来验证列角色与真实数据的关系,然后进行全局合成以确保拓扑连通性。对于SQL生成,我们引入了一种确定性机制来检索探索指令,使智能体能够有效地探索数据分布、细化假设并生成语义准确的SQL。在BIRD(执行准确率70.65%)和Spider 2.0-Snow(执行准确率51.01%)上的实验表明,APEX-SQL在减少token消耗的同时优于竞争基线。进一步的分析表明,智能体探索作为性能倍增器,释放了基础模型在企业环境中的潜在推理能力。消融研究证实了每个组件在确保稳健和准确数据分析中的关键贡献。我们的代码发布在https://github.com/Tencent/APEX-SQL-Project。

英文摘要

Text-to-SQL systems powered by Large Language Models have excelled on academic benchmarks but struggle in complex enterprise environments. The primary limitation lies in their reliance on static schema representations, which fails to resolve semantic ambiguity and scale effectively to large, complex databases. To address this, we propose APEX-SQL, an Agentic Text-to-SQL Framework that shifts the paradigm from passive translation to agentic exploration. Our framework employs a hypothesis-verification loop to ground model reasoning in real data. In the schema linking phase, we use logical planning to verbalize hypotheses, dual-pathway pruning to reduce the search space, and parallel data profiling to validate column roles against real data, followed by global synthesis to ensure topological connectivity. For SQL generation, we introduce a deterministic mechanism to retrieve exploration directives, allowing the agent to effectively explore data distributions, refine hypotheses, and generate semantically accurate SQLs. Experiments on BIRD (70.65% execution accuracy) and Spider 2.0-Snow (51.01% execution accuracy) demonstrate that APEX-SQL outperforms competitive baselines with reduced token consumption. Further analysis reveals that agentic exploration acts as a performance multiplier, unlocking the latent reasoning potential of foundation models in enterprise settings. Ablation studies confirm the critical contributions of each component in ensuring robust and accurate data analysis. Our code is released at https://github.com/Tencent/APEX-SQL-Project.

2602.15259 2026-06-02 cs.CY cs.AI cs.LG

Knowing Isn't Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight

知道不等于理解:用认知与行为洞察重新奠定生成式主动性

Kirandeep Kaur, Xingda Lyu, Chirag Shah

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学) University of Waterloo(滑铁卢大学)

AI总结 针对用户无法明确表达需求时的认知不完整问题,提出生成式主动性需要基于认知和行为双重约束来设计负责任的主动代理。

Comments 43 rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

生成式AI代理将理解等同于解决显式查询,这一假设将交互限制在用户能够表达的范围内。当用户自身缺乏对缺失、风险或值得考虑之事的意识时,这一假设就会失效。在这种情况下,主动性不仅是效率提升,更是一种认知上的必要性。我们将这种状态称为认知不完整:即进步依赖于处理未知的未知以实现有效协作。现有的主动性方法仍然局限于预测性,从过去行为中推断并假定目标已经明确,从而未能有意义地支持用户。然而,揭示超出用户当前意识的可能性并非天然有益。不受约束的主动干预可能误导注意力、使用户不堪重负或引入伤害。因此,主动代理需要行为锚定:对代理何时、如何以及在何种程度上进行干预施加原则性约束。我们主张生成式主动性必须在认知和行为上双重锚定。借鉴无知哲学和主动行为研究,我们认为这些理论为设计能够负责任地参与并促进有意义协作的代理提供了关键指导。

英文摘要

Generative AI agents equate understanding with resolving explicit queries, an assumption that confines interaction to what users can articulate. This assumption breaks down when users themselves lack awareness of what is missing, risky, or worth considering. In such conditions, proactivity is not merely an efficiency enhancement, but an epistemic necessity. We refer to this condition as epistemic incompleteness: where progress depends on engaging with unknown unknowns for effective partnership. Existing approaches to proactivity remain narrowly anticipatory, extrapolating from past behavior and presuming that goals are already well defined, thereby failing to support users meaningfully. However, surfacing possibilities beyond a user's current awareness is not inherently beneficial. Unconstrained proactive interventions can misdirect attention, overwhelm users, or introduce harm. Proactive agents, therefore, require behavioral grounding: principled constraints on when, how, and to what extent an agent should intervene. We advance the position that generative proactivity must be grounded both epistemically and behaviorally. Drawing on the philosophy of ignorance and research on proactive behavior, we argue that these theories offer critical guidance for designing agents that can engage responsibly and foster meaningful partnerships.

2602.12972 2026-06-02 cs.SI cs.LG

Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework

联合优化去偏点击率和优惠券营销提升:一个统一因果框架

Siyun Yang, Shixiao Yang, Jian Wang, Di Fan, Kehe Cai, Haoyan Fu, Jiaming Zhang, Wenjin Wu, Peng Jiang

发表机构 * Kuaishou Technology(快手科技) Beijing Institute of Technology(北京理工大学) Independent Researcher(独立研究者)

AI总结 针对优惠券等营销干预导致的点击率预测偏差,提出统一多值处理网络UniMVT,通过反事实推断同时实现去偏点击率预测和提升估计。

详情
AI中文摘要

在线广告中,优惠券等营销干预会引入显著的混杂偏差,影响点击率(CTR)预测。观察到的点击反映了用户内在偏好与干预带来的提升的混合。这导致传统模型对基础CTR校准不准确,从而扭曲下游排序和计费决策。此外,营销干预通常作为多值处理,具有不同幅度,给CTR预测增加了额外复杂性。为解决这些问题,我们提出了统一多值处理网络(UniMVT)。具体来说,UniMVT从处理敏感表示中解耦混杂因素,使得全空间反事实推断模块能够联合重建去偏的基础CTR和强度-响应曲线。为处理多值处理的复杂性,UniMVT采用辅助强度估计任务来捕获处理倾向,并设计一个单位提升目标来归一化干预效果。这确保了在连续优惠券价值谱上的可比较估计。UniMVT同时实现了用于准确系统校准的去偏CTR预测和用于激励分配的精确提升估计。在合成和工业数据集上的大量实验证明了UniMVT在预测准确性和校准方面的优越性。此外,真实世界的A/B测试证实,UniMVT通过更有效的优惠券分发显著改善了业务指标。

英文摘要

In online advertising, marketing interventions such as coupons introduce significant confounding bias into Click-Through Rate (CTR) prediction. Observed clicks reflect a mixture of users' intrinsic preferences and the uplift induced by these interventions. This causes conventional models to miscalibrate base CTRs, which distorts downstream ranking and billing decisions. Furthermore, marketing interventions often operate as multi-valued treatments with varying magnitudes, introducing additional complexity to CTR prediction. To address these issues, we propose the \textbf{Uni}fied \textbf{M}ulti-\textbf{V}alued \textbf{T}reatment Network (UniMVT). Specifically, UniMVT disentangles confounding factors from treatment-sensitive representations, enabling a full-space counterfactual inference module to jointly reconstruct the debiased base CTR and intensity-response curves. To handle the complexity of multi-valued treatments, UniMVT employs an auxiliary intensity estimation task to capture treatment propensities and devise a unit uplift objective that normalizes the intervention effect. This ensures comparable estimation across the continuous coupon-value spectrum. UniMVT simultaneously achieves debiased CTR prediction for accurate system calibration and precise uplift estimation for incentive allocation. Extensive experiments on synthetic and industrial datasets demonstrate UniMVT's superiority in both predictive accuracy and calibration. Furthermore, real-world A/B tests confirm that UniMVT significantly improves business metrics through more effective coupon distribution.

2602.12819 2026-06-02 cs.IR cs.CV

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

WISE:一种用于视觉场景、音频、物体、人脸、语音和元数据的多模态搜索引擎

Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta

发表机构 * Engineering Science University of Oxford(工程科学大学牛津)

AI总结 提出WISE开源多模态搜索引擎,整合场景级和物体级的自然语言与反向图像查询、人脸搜索、音频事件检索、语音转录搜索及元数据过滤,支持跨模态组合查询,采用向量搜索实现高效扩展,可本地部署。

Comments Software: https://www.robots.ox.ac.uk/~vgg/software/wise/ , Online demos: https://www.robots.ox.ac.uk/~vgg/software/wise/demo/ , Example Queries: https://www.robots.ox.ac.uk/~vgg/software/wise/examples/

详情
Journal ref
International ACM SIGIR Conference on Research and Development in Information Retrieval (2026)
AI中文摘要

在本文中,我们提出WISE,一个开源视听搜索引擎,它将多种多模态检索能力集成到一个单一、实用的工具中,无需机器学习专业知识即可使用。WISE支持图像和视频的场景级(例如空街道)和物体级(例如马)的自然语言和反向图像查询;基于人脸的特定个体搜索;使用文本(例如木头吱吱声)或音频文件的声学事件音频检索;自动转录语音的搜索;以及按用户提供的元数据进行过滤。通过跨模态组合查询可以获得丰富的洞察——例如,通过应用物体查询“火车”和元数据查询“德国”从历史档案中检索德国火车,或在一个地方搜索人脸。通过采用向量搜索技术,WISE可以扩展到支持对数百万张图像或数千小时视频的高效检索。其模块化架构便于集成新模型。WISE可以本地部署用于私有或敏感集合,并已应用于各种实际用例。我们的代码是开源的,可在https://gitlab.com/vgg/wise/wise获取。

英文摘要

In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at https://gitlab.com/vgg/wise/wise.

2602.07298 2026-06-02 cs.IR cs.AI

Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

原则性合成数据使推荐系统中的LLM首次出现缩放定律

Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Qunshu Zhang, Neeraj Bhatia, Xiangjun Fan, Hong Yan

发表机构 * Meta

AI总结 本文提出一种分层框架生成高质量合成数据,通过避免原始数据噪声,首次在推荐领域实现LLM的稳健幂律缩放,并显著提升下游排序任务性能。

Comments update according to icml reviewers feedback

详情
Journal ref
ICML 2026
AI中文摘要

大型语言模型(LLM)代表了推荐系统的一个有前景的前沿,但其发展一直受到缺乏可预测缩放定律的阻碍,而缩放定律对于指导研究和优化资源分配至关重要。我们假设,这可能是由于先前持续预训练(CPT)工作中原始用户交互数据固有的噪声、偏差和不完整性所致。本文介绍了一种新颖的分层框架,用于生成高质量合成数据,通过为LLM创建精心策划的教学课程来规避此类问题。我们提供了强有力的直接证据,证明我们课程的有效性:在原则性合成数据上训练的标准序列模型在下游排序任务中显著优于(在SasRec的recall@100上提高+130%)在真实数据上训练的模型,展示了其在学习可泛化用户偏好模式方面的优越性。在此基础上,我们首次通过实验证明,在高质量、推荐特定数据上持续预训练的LLM存在稳健的幂律缩放。我们的实验揭示了跨多种合成数据模态的一致且可预测的困惑度降低。这些发现为在推荐领域可靠地缩放LLM能力建立了基础方法论,从而将研究重点从缓解数据缺陷转向利用高质量的结构化信息。

英文摘要

Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.

2602.09651 2026-06-02 stat.ML cs.LG

The Entropic Signature of Class Speciation in Diffusion Models

扩散模型中类别分化的熵特征

Florian Handke, Dejan Stančević, Felix Koulischer, Thomas Demeester, Luca Ambrogioni

发表机构 * GitHub arXiv

AI总结 通过追踪潜在语义变量的类别条件熵,检测扩散模型中的语义转变区间,并验证其在高斯混合模型和实际模型中的有效性。

Comments Accepted at International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

扩散模型并非随时间均匀地恢复语义结构。相反,样本在狭窄的区间内从语义模糊过渡到类别确定。最近的理论工作将这种转变归因于沿类别分离方向的动力学不稳定性,但在训练模型中检测和利用这些窗口的实用方法仍然有限。我们表明,跟踪给定噪声状态下潜在语义变量的类别条件熵提供了这些转变区间的可靠特征。通过将熵限制在语义划分上,熵还可以解析不同抽象层次上的语义决策。我们在高维高斯混合模型中分析了这种行为,并表明熵率集中在与方差保持扩散中先前识别的分化对称性破缺不稳定性相同的对数时间尺度上。我们在EDM2-XS和Stable Diffusion 1.5上验证了我们的方法,其中类别条件熵一致地隔离了对语义结构形成至关重要的噪声区间。最后,我们使用我们的框架来量化引导如何随时间重新分布语义信息。这些结果共同连接了信息论和统计物理学对扩散的视角,并为时间局部化控制提供了原则性基础。

英文摘要

Diffusion models do not recover semantic structure uniformly over time. Instead, samples transition from semantic ambiguity to class commitment within a narrow regime. Recent theoretical work attributes this transition to dynamical instabilities along class-separating directions, but practical methods to detect and exploit these windows in trained models are still limited. We show that tracking the class-conditional entropy of a latent semantic variable given the noisy state provides a reliable signature of these transition regimes. By restricting the entropy to semantic partitions, the entropy can furthermore resolve semantic decisions at different levels of abstraction. We analyze this behavior in high-dimensional Gaussian mixture models and show that the entropy rate concentrates on the same logarithmic time scale as the speciation symmetry-breaking instability previously identified in variance-preserving diffusion. We validate our method on EDM2-XS and Stable Diffusion 1.5, where class-conditional entropy consistently isolates the noise regimes critical for semantic structure formation. Finally, we use our framework to quantify how guidance redistributes semantic information over time. Together, these results connect information-theoretic and statistical physics perspectives on diffusion and provide a principled basis for time-localized control.

2602.03970 2026-06-02 stat.ML cs.LG cs.NE math.MG math.ST stat.TH

Statistical Guarantees for Reasoning Probes on Looped Boolean Circuits

循环布尔电路上推理探针的统计保证

Anastasis Kratsios, Giulia Livieri, A. Martina Neuman

发表机构 * Department of Mathematics, McMaster University(麦斯特大学数学系) Vector Institute(向量研究所) The London School of Economics and Political Science(伦敦政治经济学院) University of Vienna, Faculty of Mathematics(维也纳大学数学系)

AI总结 针对循环布尔电路上的推理探针,利用图卷积网络和度量嵌入技术,证明了在最坏情况下泛化误差以最优速率衰减,且该速率与计算图规模无关。

详情
AI中文摘要

我们研究了一种受神经算法推理启发的迭代计算风格化模型中推理探针的统计行为。底层计算由一个循环布尔电路给出,其图是完美的 $ν$ 元树($ν\ge 2$),输出在计算轮次中递归地作为输入反馈。探针观察内部节点的采样子集,并试图推断每个节点处的潜在操作,表示为有限可容许布尔门集合上的概率分布。这种部分可观测性在结构化计算图上诱导了一个转导泛化问题。我们证明,当探针由图卷积网络参数化并查询 $N$ 个节点时,最坏情况下的泛化误差以最优速率 $\mathcal{O}(\sqrt{\log(2/δ)}/\sqrt{N})$ 衰减,概率至少为 $1-δ$。我们的分析将度量嵌入技术与最优传输工具相结合。一个关键见解是,该速率与计算图规模无关,这是通过诱导图度量的低失真一维雪花嵌入实现的。这些结果突出了在探测结构化迭代计算中统计效率的几何机制。

英文摘要

We study the statistical behavior of reasoning probes in a stylized model of iterative computation inspired by neural algorithmic reasoning. The underlying computation is given by a looped Boolean circuit whose graph is a perfect $ν$-ary tree ($ν\ge 2$), with outputs recursively fed back as inputs across computation rounds. A probe observes a sampled subset of internal nodes and seeks to infer the latent operation at each node, represented as a probability distribution over a finite set of admissible Boolean gates. This partial observability induces a transductive generalization problem on a structured computation graph. We show that when the probe is parameterized by a graph convolutional network and queries $N$ nodes, the worst-case generalization error decays at the optimal rate $\mathcal{O}(\sqrt{\log(2/δ)}/\sqrt{N})$ with probability at least $1-δ$. Our analysis combines metric embedding techniques with tools from optimal transport. A key insight is that this rate is achievable independently of the size of the computation graph, enabled by a low-distortion one-dimensional snowflake embedding of the induced graph metric. These results highlight a geometric mechanism underlying statistical efficiency in probing structured, iterative computations.

2502.00753 2026-06-02 math.OC cs.LG

Mirror Descent Under Generalized Smoothness

广义光滑性下的镜像下降

Dingzhi Yu, Wei Jiang, Hongyi Tao, Yuanyu Wan, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院) School of Software Technology, Zhejiang University(浙江大学软件技术学院) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新技术区(滨江)区块链与数据安全研究院)

AI总结 本文提出一种新的 $\ell_*$-光滑性概念,将经典光滑性推广到一般范数空间,并证明镜像下降类算法在此条件下收敛率与经典光滑性一致。

Comments ICML 2026

详情
AI中文摘要

光滑性对于一阶优化达到快速收敛率至关重要。然而,现代机器学习中的许多优化问题涉及非光滑目标。最近的研究通过允许梯度的Lipschitz常数相对于梯度范数增长来放宽光滑性假设,这适应了实践中广泛的目标。尽管取得了进展,现有的光滑性推广仅限于具有 $\ell_2$ 范数的欧几里得几何,并且仅在欧几里得空间中的优化具有理论保证。在本文中,我们通过引入一个新的 $\ell_*$-光滑性概念来解决这一限制,该概念以一般范数及其对偶度量Hessian的范数,并建立了镜像下降类型算法的收敛性,与经典光滑性下的收敛率相匹配。值得注意的是,我们提出了一种广义的自有界性质,有助于通过控制次优性间隙来界定梯度,作为收敛分析的主要组成部分。在确定性优化之外,我们建立了随机镜像下降的尖锐收敛性,与经典光滑性下的最新结果相匹配。我们的理论还扩展到非凸和复合优化,这可能为镜像下降的实际应用(包括大语言模型的预训练和后训练)提供启示。

英文摘要

Smoothness is crucial for attaining fast rates in first-order optimization. However, many optimization problems in modern machine learning involve non-smooth objectives. Recent studies relax the smoothness assumption by allowing the Lipschitz constant of the gradient to grow with respect to the gradient norm, which accommodates a broad range of objectives in practice. Despite this progress, existing generalizations of smoothness are restricted to Euclidean geometry with $\ell_2$-norm and only have theoretical guarantees for optimization in the Euclidean space. In this paper, we address this limitation by introducing a new $\ell*$-smoothness concept that measures the norm of Hessians in terms of a general norm and its dual, and establish convergence for mirror-descent-type algorithms, matching the rates under the classic smoothness. Notably, we propose a generalized self-bounding property that facilitates bounding the gradients via controlling suboptimality gaps, serving as a principal component for convergence analysis. Beyond deterministic optimization, we establish sharp convergence for stochastic mirror descent, matching state-of-the-art under classic smoothness. Our theory also extends to non-convex and composite optimization, which may shed light on practical usages of mirror descent, including pre-training and post-training of LLMs.