arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2370
2605.29635 2026-05-29 math.OC cs.LG

MoSSP: A Momentum-Based Single-Loop Stochastic Penalty Method for Nonconvex Constrained DC-Regularized Optimization

MoSSP: 基于动量的单环随机惩罚方法用于非凸约束DC正则化优化

Luxuan Li, Chunfeng Cui, Xiao Wang

AI总结 提出MoSSP算法,一种基于动量的单环随机惩罚方法,用于解决具有非凸约束和DC正则化的随机优化问题,实现了O(ε^{-4})和O(ε^{-3})的oracle复杂度。

Comments 35 pages, 3 figures

详情
AI中文摘要

本文研究了一类具有差凸(DC)正则化的非凸约束随机问题,其中可行集可能是非凸的,且DC正则化子的凹部分允许非光滑。基本挑战在于在保持非凸约束可行性的同时实现良好的oracle复杂度。尽管单环算法能有效解决无约束DC优化问题,但它们在具有DC结构的约束优化中的潜力尚未被充分探索。为填补这一空白,我们开发了MoSSP,一种基于动量的单环随机惩罚方法,用于此类问题,并具有可证明的复杂度保证。关键思想是将单个随机近端梯度步骤应用于惩罚的Moreau包络加上凸DC部分,同时并行计算凹部分的近端映射。我们推导了两种算法变体:一种具有O(ε^{-4}) oracle复杂度的Polyak动量版本,用于寻找随机ε-KKT点,以及一种改进的O(ε^{-3})版本,结合了递归动量。实验结果证明了所提算法的有效性。

英文摘要

In this paper, we study a structured class of nonconvex constrained stochastic problems with difference-of-convex (DC) regularization, where the feasible set is possibly nonconvex and the concave part of the DC regularizer is allowed to be nonsmooth. The fundamental challenge lies in maintaining feasibility for nonconvex constraints while achieving favorable oracle complexity. Although single-loop algorithms efficiently solve unconstrained DC optimization problems, their potential for constrained optimization with DC structure remains largely unexplored. To address this gap, we develop MoSSP, a Momentum-based Single-loop Stochastic Penalty method for such problems with provable complexity guarantees. The key idea is to apply a single stochastic proximal-gradient step to the Moreau envelope of the penalty plus the convex DC part, with the concave part's proximal mapping computed in parallel. We derive two algorithm variants: a Polyak-momentum version with $O(\varepsilon^{-4})$ oracle complexity for finding stochastic $\varepsilon$-KKT points, and an improved $O(\varepsilon^{-3})$ version incorporating recursive momentum. Experimental results demonstrate the effectiveness of the proposed algorithms.

2605.29613 2026-05-29 eess.AS cs.SD

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

基于扩散的ASR解码策略:基于置信度阈值的系统评估

Jeong Hun Yeo, Minsu Kim, Hyeongseop Rha, Yong Man Ro

AI总结 本文系统评估了基于扩散语言模型的ASR中三种解码策略,提出使用基于负对数似然的不确定性度量来监控解码进度,发现基于阈值的策略在准确率和速度上均优于固定步数策略,其中静态阈值策略在匹配自回归解码准确率的同时具有更高效率。

详情
AI中文摘要

虽然基于LLM的自动语音识别(ASR)实现了高准确率,但其速度受限于顺序自回归解码。扩散语言模型(DLM)提供了一种并行替代方案,然而其解码策略在ASR场景中尚未得到充分探索。本文分析了三种用于DLM-based ASR的解码方案:固定步数、静态置信度阈值和动态置信度阈值。我们提出使用基于负对数似然的不确定性度量作为解码进度的代理来测量逐轮准确率。结果表明,基于阈值的策略在准确率和速度上均显著优于固定步数方案。我们将此归因于ASR独有的特性:大多数token在早期就达到高置信度,从而可以积极收集可靠token,仅将困难token留到后续轮次。值得注意的是,静态阈值策略在匹配自回归解码准确率的同时提供了更高的效率。

英文摘要

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.

2605.29612 2026-05-29 cs.MA cs.CL

CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems

CONCAT: 基于共识与置信驱动的即席团队协作以实现高效的基于LLM的多智能体系统

Ziyang Ma, Dingyi Zhang, Sichu Liang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou

AI总结 提出一种无需训练的共识与置信驱动即席团队协作框架CONCAT,通过聚类初始答案、选择高置信领导者并基于心智理论预测协作收益来动态组织多智能体交互,显著提升效率并降低延迟。

详情
AI中文摘要

尽管基于大型语言模型的多智能体系统在解决复杂任务和实现比单智能体系统更高的性能方面显示出能力,但由于智能体之间的密集通信,它们导致了巨大的计算开销。先前的研究致力于训练稀疏多智能体图或微调规划器以更好地编排工作流程。然而,这些额外的训练过程引入了计算成本,并将多智能体系统限制在特定领域,从而损害了其泛化能力。在本文中,我们提出了CONCAT,一种基于共识和置信驱动的即席团队协作的无训练多智能体协作框架,以高效组织智能体交互。具体来说,智能体根据其初始答案进行聚类,并根据智能体的置信度选择每个聚类的领导者。然后,基于心智理论设计启发式函数,根据领导者的答案和置信度预测每两个领导者之间的协作收益。最后,在根据预测收益驱逐一定比例的通信后,组织一个即席多智能体网络。在三个LLM和三个基准上的实验表明,CONCAT比LLM-Debate实现了高达2.02倍的效率(准确率/延迟比),并优于诸如AgentDropout等训练感知方法,同时在Qwen2.5-14B-Instruct上将平均延迟降低了50.1%,且无需任何任务特定训练。

英文摘要

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.

2605.29587 2026-05-29 q-bio.QM cs.LG

FPLIER: Federated Pathway-Level Information Extractor

FPLIER:联邦通路级信息提取器

Daniele Malpetti, Christian Berchtold, Francesco Gualdi, Marco Scutari, Laura Azzimonti, Francesca Mangili

AI总结 提出联邦学习框架FPLIER,通过安全聚合实现分布式基因表达数据上的通路级因子分解,并证明隐私风险由训练表达矩阵的秩决定。

Comments Accepted for publication at the ACM BCB '26 conference

详情
AI中文摘要

在转录组学中,通路级信息提取器(PLIER)等基因集感知因子分解方法在大型异质性表达数据集上训练时效果最佳。然而,由于隐私和治理限制,许多临床相关队列无法合并为单个数据集。我们提出FPLIER,这是PLIER的联邦扩展,能够在多个数据持有者之间进行分布式训练,同时整合公开可用数据集。通过安全聚合,FPLIER产生的训练更新在代数上等价于集中式池化数据方法,同时保持表达数据的本地性。我们在两个模拟联盟(来自K-CLIER和MultiPLIER研究)的多个场景中评估FPLIER,并展示其稳定收敛。我们进一步对针对中间训练统计量和发布模型的成员推断攻击进行了系统分析。结果表明,隐私风险由训练表达矩阵的秩决定。整合公开数据或降低数据维度会增加该秩,使系统趋向满秩状态,在此状态下训练样本与非训练样本对攻击者而言难以区分,成员推断性能接近随机猜测。

英文摘要

In transcriptomics, gene-set-aware factorization methods such as the Pathway Level Information Extractor (PLIER) are most effective when trained on large, heterogeneous expression compendia. Yet, many clinically relevant cohorts cannot be pooled into a single dataset due to privacy and governance constraints. We present FPLIER, a federated extension of PLIER that enables distributed training across multiple data holders while incorporating publicly available datasets. Through secure aggregation, FPLIER produces training updates algebraically equivalent to those of a centralized pooled-data approach while keeping expression data local. We evaluate FPLIER across multiple scenarios in two simulated consortia (from the K-CLIER and MultiPLIER studies) and demonstrate stable convergence. We further conduct a systematic analysis of membership inference attacks targeting both intermediate training statistics and the released model. Our results show that privacy risk is governed by the rank of the training expression matrix. Incorporating public data or reducing data dimensionality increases this rank, moving the system toward a full-rank regime in which training and non-training samples become indistinguishable to the attacker, and membership-inference performance approaches random guessing.

2605.28327 2026-05-29 stat.ML cs.LG q-fin.RM stat.AP

Insurance Pricing Optimization via Off-Policy Evaluation

通过离线策略评估进行保险定价优化

Sascha Günther, Dimitri Semenovich, Mario V. Wüthrich

AI总结 本文提出基于离线策略评估和随机控制的保险定价方法,利用核化逆倾向得分估计器降低方差,并通过数据共享Lasso和神经网络两种策略优化方法实现最优定价。

详情
AI中文摘要

传统保险定价依赖于基于风险的原则,确保精算公平和偿付能力,但未明确考虑投保人的价格敏感性。我们将保险定价表述为一个决策问题,并使用离线策略评估和随机控制的工具进行研究。我们提出了一种核化逆倾向得分估计器,该估计器利用动作空间中的局部结构,与经典逆倾向得分估计器相比实现了方差减少。基于这些价值估计,我们研究了策略优化,并提出了两种计算最优定价规则的实用方法:一种可解释的数据共享Lasso公式和一种基于神经网络的灵活策略参数化。通过使用受控的合成旅行保险环境,我们实证验证了理论结果,并表明神经网络在策略优化方面优于现有技术。

英文摘要

Traditional insurance pricing relies on risk-based principles that ensure actuarial fairness and solvency but do not explicitly account for policyholders' price sensitivity. We formulate insurance pricing as a decision-making problem and study it using tools from off-policy evaluation and stochastic control. We propose a kernelized inverse propensity score estimator that exploits local structure in the action space and yields variance reduction compared to the classical inverse propensity score estimator. Building on these value estimates, we investigate policy optimization and present two practical approaches for computing optimal pricing rules: an interpretable data-shared Lasso formulation and a flexible policy parameterization based on neural networks. Using a controlled synthetic travel insurance environment, we empirically confirm the theoretical results and show that neural networks outperform existing techniques for policy optimization.

2605.26156 2026-05-29 cs.CR cs.AI cs.LG

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

将偏见转化为漏洞:基于Bandit引导的LLM裁判风格操纵攻击

Xianglin Yang, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong

AI总结 提出BITE黑盒对抗框架,将风格编辑选择建模为上下文Bandit问题,通过LinUCB策略自适应选择编辑以误导LLM裁判并人为提高评分,攻击成功率超65%。

Comments Accepted to the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

已知LLM裁判中的风格偏见,例如对冗长或特定句子结构的偏好,构成了一个未被充分探索的安全漏洞。在这项工作中,我们引入了BITE(偏见探索与利用),一个黑盒对抗框架,学习保持语义的编辑以误导LLM裁判并人为提高其分配的分数。我们将风格编辑的选择建模为上下文Bandit问题,并使用LinUCB策略自适应地选择编辑,以最大化裁判的分数,而无需访问模型参数或梯度。实验上,我们在多种LLM裁判和任务上测试了BITE,包括聊天机器人排行榜和AI审稿人基准上的逐点和成对比较。BITE实现了超过65%的攻击成功率,并在9分制上将分数提高了1-2分,同时保持了语义等价性。我们进一步评估了攻击的隐蔽性,表明BITE规避了标准的风格控制方法和几种检测基线。我们的发现暴露了LLM作为裁判范式的一个根本弱点,并激励了鲁棒的、对抗感知的评估。我们的代码可在https://github.com/xianglinyang/llm-as-a-judge-attack获取。

英文摘要

The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.

2605.25975 2026-05-29 cs.GR cs.CV

F-RNG: Feed-Forward Relightable Neural Gaussians

F-RNG: 前馈可重光照神经高斯

Guangming Fu, Jiahui Fan, Jian Yang, Miloš Hašan, Beibei Wang

AI总结 提出前馈框架F-RNG,利用现有大型重建模型和内在分解模型先验,从稀疏视图直接生成可重光照的3D高斯资产,实现快速高质量重光照。

详情
AI中文摘要

从真实世界物体中捕捉可重光照的3D资产是一个广泛研究的问题。几种基于3D高斯溅射(3DGS)的逐场景优化方法支持重光照,但它们通常需要密集的输入视图,并且其过拟合特性使其难以跨场景泛化。与逐场景优化方法不同,泛化的前馈模型可以直接从稀疏输入视图重建高斯。然而,得到的资产具有烘焙好的光照,不能轻易用于重光照。在本文中,我们提出了F-RNG,一个前馈框架,直接从稀疏视图输入生成可重光照的3DGS资产。从头开始训练这样的模型可能需要大量的数据和计算资源,并且以可接受的成本以前馈方式生成可重光照资产尤其具有挑战性。我们在现有的大型重建模型(LRM)上开发F-RNG以提取可重光照表示,同时利用内在分解模型(IDM)的先验。具体来说,我们首先引入一种潜在插值的细粒度几何合成来增强LRM的几何表示。其次,我们提出了一种先验引导的可重光照外观蒸馏,通过结合IDM先验来提取可重光照神经表示。最后,一个通用的神经渲染器实现了灵活且高保真的重光照。F-RNG不需要重新训练或微调底层的LRM,因此可以自动受益于未来更好的LRM和IDM。仅使用可以用可负担的数据和计算资源训练的小型网络,F-RNG避免了在不同光照条件下大型模型的重复推理。与最先进的基于LRM的重光照方法相比,F-RNG实现了约25倍更快的重光照,以及更优的质量(约+2.0 dB)。

英文摘要

Capturing relightable 3D assets from real-world objects is a widely researched problem. Several per-scene optimization-based methods, based on 3D Gaussian splatting (3DGS), support relighting; however, they usually require dense input views, and their overfitting nature makes it difficult to generalize across scenes. Unlike per-scene optimization methods, generalized feed-forward models can directly reconstruct Gaussians from sparse input views. However, the resulting assets have baked-in illumination and cannot be easily used for relighting. In this paper, we present F-RNG, a feed-forward framework that directly generates relightable 3DGS assets from sparse-view inputs. Training such a model from scratch can require massive data and computing resources, and it is especially challenging to generate relightable assets in a feed-forward manner with acceptable cost. We develop F-RNG upon an existing large reconstruction model (LRM) to extract relightable representations, while also utilizing priors from an intrinsic decomposition model (IDM). Specifically, we first introduce a latent-interpolated fine-grained geometry synthesis to enhance the LRM's geometry representation. Second, we propose a prior-guided relightable appearance distillation to extract relightable neural representations by incorporating IDM priors. Finally, a universal neural renderer enables flexible and high-fidelity relighting. F-RNG requires neither re-training nor fine-tuning of the underlying LRMs, thus can automatically benefit from better LRMs and IDMs in the future. With only small networks that can be trained with affordable data and computational resources, F-RNG avoids the repetitive inference of large models under different light conditions. By comparison to the state-of-the-art LRM-based relighting method, F-RNG achieves ~25x faster relighting, as well as superior quality (~+2.0 dB).

2605.25376 2026-05-29 cs.CR cs.AI cs.CY cs.MA cs.SE

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

KYA: 面向自主系统的框架无关信任层,具有可验证溯源和分层策略组合

Kolawole Quadri

AI总结 提出KYA,一个框架无关的信任与治理层,通过五元组原语实现自主系统的授权、策略合规和事后可验证性,在跨后端矩阵上全部通过测试,检测89%的对抗性探测。

Comments 26 pages including appendix. Code available under Apache 2.0 at https://github.com/veldtlabs/veldt-kya (pip install veldt-kya). Two-domain worked examples (loan decisioning under NYDFS/ECOA/CFPB; clinical triage under HIPAA/21 CFR Part 11/FDA SaMD).Reproducibility artifacts in-tree

详情
AI中文摘要

KYA(Know Your Agents)是一个开源的、框架无关的自主系统信任与治理层,由五个原语组成:(1)四门入站应用管道;(2)三通道多租户层次结构上的仅收紧组合代数;(3)KYP(Know Your Principal),跨人类用户、AI智能体和服务账户的信任评分的模式级统一;(4)基于AIVSS形状的加性基线的可审计交互乘数放大;(5)两轴委托归因:高风险委托的静态溢价和运行时对多智能体扇出中实际委托不当行为的扣减。这些原语共同涵盖三个支柱(信任、治理和证据保证),使自主系统的行为得到授权、符合策略且事后可验证:其中可观测性回答多长时间、多少量以及什么路径,KYA回答是否被授权、是否合规以及能否验证;它与可观测性互补而非替代。它原生支持15+智能体框架的适配器。在4×9跨后端矩阵上,所有36个单元格均通过测试;纯函数评分器在p99下运行亚毫秒级,系统在20个并发工作线程下维持约1,800 ops/s,且HMAC链完整性端到端保持。KYA检测出来自PyRIT和Garak的1,200个对抗性探测中的89%,包括最近发布的拓扑引导的多智能体攻击。该系统以Apache 2.0许可证在PyPI上以veldt-kya包形式提供。

英文摘要

KYA (Know Your Agents) is an open-source, framework-agnostic trust and governance layer for autonomous systems, composed of five primitives: (1) a four-gate inbound apply pipeline; (2) an only-tighten composition algebra over a three-channel multi-tenant hierarchy; (3) KYP (Know Your Principal), a schema-level unification of trust scoring across human users, AI agents, and service accounts; (4) auditable interaction-multiplier amplification over an AIVSS-shaped additive baseline; and (5) two-axis delegation attribution: a static premium for risky delegates and a runtime debit for actual delegate misbehavior in multi-agent fan-out. Together these span three pillars (trust, governance, and evidentiary assurance), making an autonomous system's actions authorized, policy-conforming, and post-hoc verifiable: where observability answers how long, how much, and what path, KYA answers was it authorized, did it conform, and can it be verified; it composes with observability rather than replacing it. It ships native adapters for 15+ agent frameworks. On a 4 by 9 cross-backend matrix all 36 cells pass; the pure-function scorer runs sub-millisecond at p99 and the system sustains ~ 1,800 ops/sec at 20 concurrent workers with HMAC chain integrity preserved end-to-end. KYA detects 89% of 1,200 adversarial probes from PyRIT and Garak, including the recently-published topology-guided multi-agent attack. The system is available under Apache 2.0 as the veldt-kya package on PyPI.

2605.25303 2026-05-29 cs.DS cs.LG math.ST stat.ML stat.TH

Algorithms with Polynomially-Improved Approximation Factors for the $2 \rightarrow q$ Norm, and Applications

具有多项式改进近似因子的 $2 \rightarrow q$ 范数算法及其应用

Samuel B. Hopkins, Stefan Tiegel

AI总结 本文针对 $q>2$ 时的 $2 \rightarrow q$ 范数,提出了首个多项式时间近似算法,其近似因子在多项式级别上优于基线 $d^{1/4}$,例如 $q=4$ 时达到 $d^{1/8}$,并构造了平方和证书,从而改进了鲁棒均值估计、协方差估计、回归和聚类等问题的算法。

Comments v2 corrected minor typos

详情
AI中文摘要

矩阵 $X \in \mathbb{R}^{n \times d}$ 的 $2 \rightarrow q$ 范数定义为 $\lVert X \rVert_{2 \rightarrow q} = \sup_{\lVert v \rVert_2 = 1} \lVert Xv \rVert_q$。我们针对 $q > 2$(即超收缩设置)给出了该范数的多项式时间乘法近似算法。该问题要么直接对应,要么与组合优化和近似难度(例如小集扩张)、量子信息(例如最佳可分态)以及算法统计学中长期存在的开放问题密切相关。 关于在多项式时间内能为此问题达到何种近似因子,我们所知甚少,尽管此类近似具有重要的下游影响。Barak、Brandão、Harrow、Kelner、Steurer 和 Zhou 表明,假设指数时间假设(FOCS'12),没有多项式时间算法能实现优于 $2^{\sqrt{\log n}}$ 的近似因子。另一方面,一个简单的谱算法给出了 $d^{1/4}$ 的基线近似。据我们所知,我们给出了首个在多项式因子内超越该基线的多项式时间近似算法。对于重要的特例 $q = 4$,它实现了 $d^{1/8}$ 的近似。所有先前的算法要么需要对 $X$ 附加假设,要么仅在 $n$ 较小时才能超越基线。 此外,我们为 $2 \rightarrow q$ 范数构造了平方和证书。这直接改进了当数据仅满足 $q$ 阶矩有界时的鲁棒均值和协方差估计、鲁棒回归以及聚类算法。

英文摘要

The $2 \rightarrow q$ norm of a matrix $X \in \mathbb{R}^{n \times d}$ is defined as $\lVert X \rVert_{2 \rightarrow q} = \sup_{\lVert v \rVert_2 = 1} \lVert Xv \rVert_q$. We give polynomial-time multiplicative approximation algorithms for this norm when $q > 2$ (i.e. in the hypercontractive setting). This problem either directly captures or is closely related to long-standing open problems in combinatorial optimization and hardness of approximation (e.g. Small Set Expansion), quantum information (e.g. Best Separable State), and algorithmic statistics. Very little is known about what approximation factors we can achieve for this problem in polynomial time, even though such approximations have significant downstream consequences. Barak, Brandão, Harrow, Kelner, Steurer, and Zhou showed that no polynomial-time algorithm can achieve an approximation factor better than $2^{\sqrt{\log n}}$, assuming the Exponential Time Hypothesis (FOCS'12). On the other hand, a simple spectral algorithm gives a $d^{1/4}$-approximation as a baseline. We give, to the best of our knowledge, the first polynomial-time approximation algorithm beating this baseline by polynomial factors. For the important special case of $q = 4$ it achieves a $d^{1/8}$-approximation. All previous algorithms required additional assumptions on $X$, or only surpassed the baseline for small values of $n$. Moreover, we construct sum-of-squares certificates for the $2 \rightarrow q$ norm. This directly implies improved algorithms for robust mean and covariance estimation, robust regression, and clustering, when the data only satisfies a bound on its $q$-th moment.

2605.20460 2026-05-29 cs.GR cs.CV

HyperBones: Realtime Bone-driven Neural Garment Simulation with Hypernetwork Conditioning

HyperBones: 基于超网络调节的实时骨骼驱动神经服装模拟

Astitva Srivastava, Hsiao-Yu Chen, Ryan Goldade, Philipp Herholz, Zhongshi Jiang, Gene Wei-Chin Lin, Lingchen Yang, Nikolaos Sarafianos, Tuur Stuyck, Doug Roble, Avinash Sharma, Egor Larionov

AI总结 提出一种结合虚拟骨骼驱动粗粒度模拟和卷积神经映射恢复细粒度褶皱的实时神经服装模拟方法,通过超网络调节实现高效物理监督,无需外部模拟器。

详情
AI中文摘要

服装模拟的最新进展使高质量结果更接近实时性能。基于物理的模拟器可以产生精确的运动,但对于交互式应用而言计算成本仍然过高。相比之下,线性混合蒙皮效率高,但无法捕捉宽松服装的复杂动态,常常导致不真实的运动和视觉伪影。神经方法提供了一种有前景的替代方案,但在严格的运行时约束下仍难以合理动画化宽松衣物。我们提出了一种快速且物理上合理的动态服装模拟方法。我们的方法训练了一个由独立的粗粒度和细粒度组件组成的降维神经动力学模拟器。在粗粒度层面,服装由一组与轻量级神经网络集成的虚拟骨骼驱动。然后使用训练好的卷积神经映射恢复细粒度的褶皱细节。通过将身份特定计算与实时神经集成解耦,我们的架构在支持多样化的体型和运动的同时保持了高性能。我们进一步引入了一种有效的物理监督方案,无需依赖外部模拟器即可获得准确结果。实验表明,我们的方法产生了物理上合理的服装动态,能够泛化到各种运动和体型,并支持固定服装集。我们的模拟器在商用GPU上以300+ FPS运行,使其适用于实时应用。

英文摘要

Recent advances in garment simulation have brought high-quality results closer to real-time performance. Physics-based simulators can produce accurate motion, but remain too computationally expensive for interactive applications. In contrast, linear blend skinning is efficient, but cannot capture the complex dynamics of loose-fitting garments, often leading to unrealistic motion and visual artifacts. Neural methods offer a promising alternative, yet they still struggle to animate loose clothing plausibly under strict runtime constraints. We present a fast and physically plausible approach for dynamic garment simulation. Our method trains a reduced-space neural dynamics simulator composed of independent coarse- and fine-level components. At the coarse level, the garment is driven by a set of virtual bones integrated with a lightweight neural network. Fine-scale wrinkle details are then recovered using a trained convolutional neural map. By decoupling identity-specific computation from real-time neural integration, our architecture maintains high performance while supporting diverse body shapes and motions. We further introduce an effective physics-supervision scheme that enables accurate results without relying on an external simulator. Experiments show that our method produces physically plausible garment dynamics, generalizes across a range of motions and body shapes, and supports a fixed set of garments. Our simulator runs at 300+ FPS on a commodity GPU, making it suitable for real-time applications.

2605.16825 2026-05-29 cs.IR cs.AI

Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders

过滤气泡中的回声:诊断与治愈生成式推荐系统中的流行度偏差

Jun Yin, Bangguo Zhu, Peng Huo, Ruochen Liu, Hao Chen, Senzhang Wang, Shirui Pan, Chengqi Zhang

AI总结 本文通过理论分析发现生成式推荐系统中的流行度偏差源于令牌级优化缺陷和物品分词的无差别性,并设计了非对称不相似度优化和基于骨架的分词方法(Ghost系统)来缓解偏差。

详情
AI中文摘要

最近,以统一端到端框架为特征的生成式推荐系统(GRs)在转变推荐范式方面展现出惊人的潜力。尽管有效,但我们认识到GRs仍然容易受到长期存在的流行度偏差问题的影响,该问题一直困扰着推荐社区。虽然少数研究尝试将传统的去偏方法扩展到GRs,但其效果有限,且GRs遭受流行度偏差的根本原因仍未得到充分探索。为弥补这一空白,本研究聚焦于GRs中的两个核心方面:生成框架的优化和基于语义索引的物品分词。基于理论分析,我们识别出严重的流行度偏差源于令牌级优化缺陷和物品分词的无差别性共同作用。据此,本研究通过设计非对称不相似度优化和基于骨架的分词,开发了一种名为Ghost的新型生成式推荐系统。在三个数据集上进行的广泛实证评估,与多个SOTA基线相比,表明Ghost显著缓解了流行度偏差并促进了更公平的推荐,同时仅对整体推荐效用造成轻微下降。

英文摘要

Recently, Generative Recommenders (GRs), characterized by a unified end-to-end framework, have exhibited astonishing potential in transforming the recommendation paradigm. Despite their effectiveness, we recognize that GRs are still susceptible to the long-standing issue of popularity bias that has pervaded the recommendation community. Although a few studies have attempted to extend traditional debiasing methods to GRs, their effectiveness is marginal, and the fundamental reason why GRs suffer from popularity bias remains under-explored. To bridge this gap, this study focuses on two core aspects in GRs: the optimization of generative framework and the item tokenization based on semantic index. Based on theoretical analyses, we identify that the severe popularity bias emerges from the confluence of a token-level optimization flaw and the undifferentiated property of item tokenization. Accordingly, this study develops a novel generative recommender system, called Ghost, by designing the asymmetric unlikelihood optimization and the skeleton-founded tokenization. Extensive empirical evaluations across three datasets, alongside multiple SOTA baselines, reveal that Ghost substantially alleviates popularity bias and promotes fairer recommendations, while incurring slight degradation to the overall recommendation utility.

2605.07596 2026-05-29 stat.ML cs.LG

A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning

极端多类监督对比表示学习的精细泛化分析

Nong Minh Hieu, Antoine Ledent

AI总结 针对对比表示学习在有限标注数据中构造元组导致依赖性的问题,提出改进的U-统计量分析,得到与类别数R同阶的样本复杂度,并设计新估计器在长尾分布下实现O(k)的样本复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

对比表示学习(CRL)在多个机器学习领域取得了强大的实证成功,但其理论样本复杂度仍然知之甚少。现有分析通常假设输入元组是独立同分布的,这一假设在大多数实际设置中被违反,因为对比元组是从有限标注数据池中构造的,导致元组之间存在依赖性。虽然最近有一项工作使用U-统计量分析这种学习设置以估计总体风险,但其中使用的技术要求每个类别的风险均匀集中,使得超额风险界限的规模为$ρ_{\min}^{-{1}/{2}}$,其中$ρ_{\min}$表示最稀有类别的概率。这种依赖在极端多类设置中可能过于悲观,因为存在许多尾部类别,它们对总体风险的贡献极小。我们的贡献有两方面。首先,我们改进了先前的工作,证明了一个样本复杂度与类别数$R$同阶的界限,无论类别分布如何。此外,我们制定了一个不同的估计器,捕捉风险 extit{跨类别}的集中性,从而在极端多类学习场景中实现更尖锐的界限,特别是在类别分布为长尾的情况下。在类别分布的温和假设下,得到的样本复杂度为$\mathcal{O}(k)$,其中$k$是每个元组的样本数。

英文摘要

Contrastive Representation Learning (CRL) has achieved strong empirical success in multiple machine learning disciplines, yet its theoretical sample complexity remains poorly understood. Existing analyses usually assume that input tuples are identically and independently distributed, an assumption violated in most practical settings where contrastive tuples are constructed from a finite pool of labeled data, inducing dependencies among tuples. While one recent work analyzed this learning setting using U-Statistics to estimate the population risk, the techniques used therein require the risk of each class to concentrate uniformly, making excess risk bounds scale in the order of $ρ_{\min}^{-{1}/{2}}$ where $ρ_{\min}$ denotes the probability of the rarest class. Such a dependency can be overly pessimistic in the extreme multiclass settings where there are many tail classes which contribute minimally to the overall population risk. Our contributions are two-fold. Firstly, we improve upon the previous work and prove a bound with a sample complexity of the same order as the number of classes $R$, regardless of the distribution over classes. Furthermore, we formulate a different estimator that captures the concentration of the risk \textit{across classes}, enabling sharper bounds in extreme multi-class learning scenarios, especially where class distributions are long-tailed. Under mild assumptions on the class distributions, the resulting sample complexity is $\mathcal{O}(k)$ where $k$ is the number of samples per tuple.

2605.01395 2026-05-29 eess.SY cs.RO cs.SY

Quasi-Static Control of Discrete Cosserat Rod

离散Cosserat杆的准静态控制

Srishti Siddharth

AI总结 针对使用Cosserat杆建模的软体机器人,基于分段常应变空间离散化方法,利用外部力/力矩作为控制输入,设计应变空间和任务空间的状态反馈线性化控制律,实现末端执行器轨迹跟踪和形状控制。

Comments Submitted to 17th APCA International Conference on Automatic Control and Soft Computing (CONTROLO 2026)

详情
AI中文摘要

在本文中,我们为使用Cosserat杆建模的软体机器人设计了反馈控制律,其中Cosserat杆通过分段常应变(PCS)方法进行空间离散化。PCS方法将描述Cosserat杆的非线性偏微分方程转化为非线性常微分方程组。这种简化得到的软体机器人模型类似于串联刚性连杆机械臂。我们通过将外部力/力矩作为控制输入,为准静态PCS模型设计了反馈控制律。控制律基于应变空间和任务空间的状态反馈线性化设计。大量的数值结果展示了这些控制律在软体机器人末端执行器轨迹跟踪和形状控制中的性能。

英文摘要

In this paper, we design feedback control laws for soft robots modelled using the Cosserat rod, which is spatially discretised using the Piecewise Constant Strain (PCS) approach. The PCS approach transforms the nonlinear PDEs describing the Cosserat rod to a system of nonlinear ODEs. This simplification results in a model describing soft robots which is similar to the serial rigid-link manipulators. We design feedback control laws for the quasi-static PCS model by using the external wrenches as control input. The control laws are designed based on state-feedback linearisation in strain and task spaces. An extensive set of numerical results demonstrates the performance of the control laws for end-effector trajectory tracking and shape control of soft robots.

2605.00898 2026-05-29 eess.SP cs.LG

A Deep Learning Model for Battery State Prediction towards Intelligent Energy Management

面向智能能源管理的电池状态预测深度学习模型

Athanasios Koukosias, Vasileios Tzanidakis, Sotiris Athanasiou, Kostas Kolomvatsos

AI总结 提出一种集成先进神经网络架构和大规模训练数据的深度学习模型,用于预测工业电化学储能系统的未来状态和性能,以支持预测性维护和能源资源优化分配。

Comments 11 pages, 11 figures, Journal

详情
AI中文摘要

准确预测电池健康指标(包括剩余容量和寿命)对于确保电动汽车和大规模储能基础设施等应用的可靠性、安全性和运行效率至关重要。预测结果可用于构建先进的监测机制,持续检查电池健康状态,以协助众多应用的高效实时管理。本研究探讨了用于预测工业电化学储能系统未来状态和性能的深度学习(DL)模型的开发与实现。为应对这一挑战,我们提出了一种专用计算框架,该框架将先进的神经网络架构与大规模训练数据集相结合,能够精确建模电池退化动态和运行趋势。所提出的方法为电池的最优管理提供了决策支持机制,促进了预测性维护和能源资源的高效分配。我们的研究结果凸显了基于深度学习的预测建模在推动可持续和智能能源管理系统发展方面的巨大潜力。

英文摘要

Accurate forecasting of battery health indicators, including remaining capacity and lifetime, is of paramount importance for ensuring the reliability, safety, and operational efficiency of applications such as electric vehicles and large scale energy storage infrastructures. The result of the forecasting can be adopted to build an advanced monitoring mechanism for continuous checking batteries' health status to assist in the efficient real-time management of numerous applications. This research investigates the development and implementation of a Deep Learning (DL) model for the prediction of the future state and performance of industrial electrochemical energy storage systems. To address this challenge, we propose a dedicated computational framework that integrates advanced neural network architectures with large-scale training datasets, enabling precise modeling of batteries degradation dynamics and operational trends. The proposed approach provides a decision support mechanism for the optimal management of batteries facilitating both predictive maintenance and the efficient allocation of energy resources. Our findings highlight the potential of DL-based predictive modeling to significantly contribute to the advancement of sustainable and intelligent energy management systems.

2604.11665 2026-05-29 cs.NE cs.AI

Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM "VaCoAl" for Ultra-High Speed, Ultra-Low Power, and Low Cost>

超越LLM、稀疏分布式记忆和神经形态计算:一种用于超高速、超低功耗和低成本的超维SRAM-CAM“VaCoAl”

Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato

AI总结 本文提出VaCoAl算法,通过超维计算和伽罗瓦域扩散,在SRAM/DRAM-CAM上实现可逆、可审计的多跳推理,解决灾难性遗忘和绑定问题,并在维基数据上验证了其路径依赖的STDP式选择机制。

Comments 57 pages, 4 figure, 18 tables

详情
AI中文摘要

本文报告了一个意外发现:在一个确定性超维计算(HDC)架构中,该架构反转了伽罗瓦域代数的传统角色——不是将其用于纠错以获得唯一答案,而是作为相对相似性和路径质量排序的引擎——出现了一种路径依赖的语义选择机制,等效于尖峰时序依赖可塑性(STDP),其幅度可通过封闭形式表达式先验预测,并与测量值匹配。为了解决灾难性遗忘、学习停滞和绑定问题,我们在代数层面提出了VaCoAl(模糊巧合算法)及其Python实现PyVaCoAl,运行于超高维SRAM/DRAM-CAM上。该算法根植于稀疏分布式记忆,通过伽罗瓦域扩散解决高维二进制空间中的正交化和检索问题,实现低负载部署。关键的是,VaCoAl将认知边界——前沿大小——嵌入其架构中,通过路径积分置信度(CR2)对候选进行排序,以实现组合泛化;这种有限理性设计产生了STDP式的选择,而纠错范式在结构上无法实现。我们评估了来自Wikidata的约47万条导师-学生关系上的多跳推理,追踪了多达57代(超过2550万条路径)。基于CR去噪的HDC捆绑和解绑操作量化了概念在DAG上的传播。结果显示了对牛顿-莱布尼茨争议的重新解释,以及从稀疏收敛到后莱布尼茨“超级高速公路”的相变,结构指标支持库恩范式转变。VaCoAl因此定义了第三范式——HDC-AI,以可逆、可审计的多跳推理补充LLM。

英文摘要

This paper reports an unexpected finding: in a deterministic hyperdimensional computing (HDC) architecture **that inverts the conventional role of Galois-field algebra -- employing it not for error correction toward a unique answer but as an engine for relative similarity and path-quality ranking -- **a path-dependent semantic selection mechanism emerges, equivalent to spike-timing-dependent plasticity (STDP), with magnitude predictable a priori from a closed-form expression matching measured values. Addressing catastrophic forgetting, learning stagnation, and the Binding Problem at an algebraic level, we propose VaCoAl (Vague Coincident Algorithm) and its Python implementation PyVaCoAl on ultra-high-dimensional SRAM/DRAM-CAM. Rooted in Sparse Distributed Memory, it resolves orthogonalisation and retrieval in high-dimensional binary spaces via Galois-field diffusion, enabling low-load deployment. Crucially, VaCoAl embeds a cognitive bound -- the Frontier Size -- into its architecture, ranking candidates by path-integral confidence (CR2) to achieve compositional generalisation; this bounded-rationality design produces STDP-like selection that error-correction paradigms structurally cannot attain. We evaluated multi-hop reasoning on about 470k mentor-student relations from Wikidata, tracing up to 57 generations (over 25.5M paths). HDC bundling and unbinding with CR-based denoising quantify concept propagation over DAGs. Results show a reinterpretation of the Newton-Leibniz dispute and a phase transition from sparse convergence to a post-Leibniz "superhighway", with structural indicators supporting a Kuhnian paradigm shift. VaCoAl thus defines a third paradigm, HDC-AI, complementing LLMs with reversible, auditable multi-hop reasoning.

2604.09557 2026-05-29 cs.DC cs.AI

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

SPEED-Bench:一个统一且多样化的推测解码基准

Talor Abramovich, Maor Ashkenazi, Izzy Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman

AI总结 针对推测解码(SD)评估中任务多样性不足、吞吐量评估支持不够及实现不贴近生产环境的问题,提出SPEED-Bench基准,包含多样化语义领域和真实服务场景的数据集,集成vLLM和TensorRT-LLM引擎,以标准化SD评估并揭示系统行为。

Comments ICML 2026; Our data is available on https://huggingface.co/datasets/nvidia/SPEED-Bench

详情
AI中文摘要

推测解码(SD)已成为加速大型语言模型(LLM)推理的关键技术。与确定性系统优化不同,SD性能本质上依赖于数据,因此多样且具有代表性的工作负载对于准确衡量其有效性至关重要。现有基准存在任务多样性有限、对面向吞吐量的评估支持不足,以及依赖无法反映生产环境的高级实现等问题。为解决这些问题,我们引入了SPEED-Bench,这是一个全面的套件,旨在跨不同语义领域和真实服务场景标准化SD评估。SPEED-Bench提供了精心策划的定性数据划分,通过优先考虑数据样本之间的语义多样性来选择。此外,它还包括一个吞吐量数据划分,允许在从延迟敏感的低批量设置到面向吞吐量的高负载场景的一系列并发性下进行加速评估。通过与vLLM和TensorRT-LLM等生产引擎集成,SPEED-Bench使从业者能够分析其他基准常常掩盖的系统行为。我们通过量化合成输入如何高估实际吞吐量、识别依赖于批量大小的最优草稿长度和低多样性数据中的偏差,以及分析最先进起草器中词汇剪枝的注意事项来突出这一点。我们发布SPEED-Bench,以建立用于SD算法实际比较的统一评估标准。

英文摘要

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

2604.07789 2026-05-29 cs.MA cs.CL cs.SE

ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents

ORACLE-SWE:量化Oracle信息信号对SWE代理的贡献

Kenan Li, Qirui Jin, Liao Zhu, Xiaosong Huang, Yijia Wu, Yikai Zhang, Xin Zhang, Zijian Jin, Yufan Huang, Elsie Nallipogu, Chaoyun Zhang, Yu Kang, Saravan Rajmohan, Qingwei Lin, Wenke Lee, Dongmei Zhang

AI总结 提出Oracle-SWE方法,通过隔离和提取SWE基准测试中的Oracle信息信号,量化每种信号对代理性能的贡献,并评估强语言模型提取的信号对基础代理的性能提升。

Comments Under peer review; 37 pages, 10 figures, 5 tables

详情
AI中文摘要

语言模型代理的最新进展显著提升了自动化软件工程(SWE)的能力。先前的工作提出了各种代理工作流和训练策略,并分析了代理系统在SWE任务上的失败模式,重点关注几种上下文信息信号:复现测试、回归测试、编辑位置、执行上下文和API使用。然而,每种信号对整体成功的个体贡献仍未得到充分探索,特别是在中间信息完美获取时的理想贡献。为解决这一问题,我们引入了Oracle-SWE,一种统一的方法,用于从SWE基准测试中隔离和提取Oracle信息信号,并量化每种信号对代理性能的影响。为进一步验证模式,我们评估了由强语言模型提取的信号在提供给基础代理时的性能增益,近似于现实世界的任务解决设置。这些评估旨在指导自主编码系统的研究优先级。

英文摘要

Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.

2603.26668 2026-05-29 cs.IR cs.AI cs.CL

Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm

Bridge-RAG:一种基于抽象桥树的检索增强生成算法

Zihang Li, Wenjun Liu, Yikun Zong, Jiawen Tao, Siying Dai, Songcheng Ren, Zirui Liu, Yuhang Wang, Yanbing Jiang, Tong Yang

AI总结 针对检索增强生成中准确性和效率的挑战,提出Bridge-RAG框架,通过抽象桥树结构实现多级检索,并集成布谷鸟过滤器实现O(1)实体查找,在保持高准确率的同时将检索速度提升至1.9倍。

详情
AI中文摘要

作为增强大型语言模型(LLMs)生成质量的重要范式,检索增强生成(RAG)面临着检索准确性和计算效率两方面的挑战。本文提出了一种名为Bridge-RAG的新型RAG框架。为了克服准确性挑战,我们引入了抽象概念来桥接查询实体和文档块,提供了稳健的语义理解。我们将抽象组织成树结构,并设计了多级检索策略以确保包含足够的上下文信息。虽然这种层次化组织显著提高了答案质量,但遍历树以定位包含查询实体的抽象不可避免地引入了额外的检索开销。为了恢复检索效率,我们进一步在CFT-RAG中集成了布谷鸟过滤器,该过滤器提供O(1)实体查找,并且自然适配了我们框架中实体到抽象的路径。大量实验表明,与结构化RAG基线相比,Bridge-RAG在所有指标上均实现了持续的准确性提升,并且检索速度最高提升了1.9倍。

英文摘要

As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces the two challenges regarding retrieval accuracy and computational efficiency. This paper presents a novel RAG framework called Bridge-RAG. To overcome the accuracy challenge, we introduce the concept of abstract to bridge query entities and document chunks, providing robust semantic understanding. We organize the abstracts into a tree structure and design a multi-level retrieval strategy to ensure the inclusion of sufficient contextual information. While this hierarchical organization substantially improves answer quality, traversing the tree to locate the abstracts that contain a query entity inevitably introduces additional retrieval overhead. To restore retrieval efficiency, we further integrate the Cuckoo Filter in CFT-RAG, which provides O(1) entity lookup and naturally fits the entity-to-abstract pathway of our framework. Extensive experiments show that Bridge-RAG achieves consistent accuracy improvements across all metrics and up to $1.9\times$ faster retrieval compared to structured RAG baselines.

2603.20329 2026-05-29 stat.ML cs.LG math.PR

Measure flow path recovery in Bayes Hilbert spaces

贝叶斯希尔伯特空间中的测度流路径恢复

S. David Mis, Maarten V. de Hoop

AI总结 针对有限移动局部传感器恢复概率测度流的不适定问题,提出基于贝叶斯希尔伯特框架的变分理论,通过构造最小能量传输实现和线性化观测算子,分析可恢复性条件,并发展有限维约化方法实现稳定重建。

详情
AI中文摘要

我们研究使用贝叶斯希尔伯特框架从有限个移动局部传感器恢复概率测度流的不适定问题。相对于固定的参考概率测度,概率律由其中心化对数比坐标表示,因此演化律成为希尔伯特函数空间中的一条路径。对于足够正则的贝叶斯希尔伯特路径,我们通过在每个时间点求解加权纽曼问题,构造路径的规范最小能量传输实现,得到切方向上的内在传输形式。然后,我们直接在贝叶斯希尔伯特路径空间上制定逆问题。观测算子的线性化产生可观测性形式,可恢复性由其与传输几何通过联合传输-可观测性形式的相互作用决定。在无穷维环境中,我们发展了正则化变分理论,并识别了局部传感器的局限性:移动传感器可以使联合形式单射,但通常不能在整个状态空间上产生强制稳定性估计。这一障碍自然导致有限维贝叶斯希尔伯特约化。在那里,传输形式成为动能张量,线性化观测成为约化感知矩阵,因此可恢复性可以通过显式的格拉姆条件表达。我们证明局部凸起传感器检测每个固定的约化方向,有限个适当放置的静态传感器产生均匀的约化可观测性,并且存在依赖于路径的传感器轨迹,使得即使单个移动传感器也能恢复约化路径。最后,我们证明这些约化恢复结果可以提升到对由所选有限维子空间良好近似的路径的近似环境恢复,从而实现稳定重建至投影误差。

英文摘要

We study the ill-posed problem of recovering a probability measure flow from finitely many moving localized sensors using a Bayes Hilbert framework. Relative to a fixed reference probability measure, a probability law is represented by its centered log-ratio coordinates, so that an evolving law becomes a path in a Hilbert space of functions. For sufficiently regular Bayes Hilbert paths, we construct a canonical minimum-energy transport realization of the path by solving a weighted Neumann problem at each time, yielding an intrinsic transport form on tangent directions. We then formulate an inverse problem directly on Bayes Hilbert path space. Linearization of an observation operator yields an observability form, and recoverability is governed by its interaction with the transport geometry through a joint transport--observability form. In the ambient infinite-dimensional setting, we develop a regularized variational theory and identify limitations of localized sensing: mobile sensors can make the joint form injective, but they do not in general yield a coercive stability estimate on the full state space. This obstruction leads naturally to finite-dimensional Bayes Hilbert reductions. There the transport form becomes a kinetic tensor and the linearized observations become reduced sensing matrices, so recoverability can be expressed through explicit Gramian conditions. We show that localized bump sensors detect every fixed reduced direction, that finitely many suitably placed static sensors yield uniform reduced observability, and there exist path-dependent sensor trajectories such that even a single moving sensor can recover the reduced path. Finally, we show that these reduced recovery results lift to approximate ambient recovery for paths that are well approximated by the chosen finite-dimensional subspaces, yielding stable reconstruction up to projection error.

2602.20316 2026-05-29 astro-ph.SR cs.CV

Inspectorch: Efficient rare event exploration in solar observations

Inspectorch: 太阳观测中稀有事件的高效探索

C. J. Díaz Baso, I. J. Soler Poquet, C. Kuckein, M. van Noort, N. Poirier

AI总结 提出基于流的密度估计模型Inspectorch,用于从高维太阳观测数据中高效识别稀有事件,并聚焦计算资源于极端现象。

Comments Comments: 12+1 pages, 11+2 figures, submitted to A&A

详情
AI中文摘要

太阳正以前所未有的细节被观测,使得我们能够研究其非常小时空尺度上的活动。然而,望远镜收集的大量数据无法用传统方法完全分析。流行的机器学习方法从观测中识别一般趋势,但由于罕见事件发生频率低,往往忽略它们。我们研究无监督概率方法在多维太阳观测中高效识别罕见事件的适用性,并优化计算资源以研究这些极端现象。我们介绍了Inspectorch,一个开源框架,利用基于流的模型:灵活的概率密度估计器,能够学习太阳观测的多维分布。一旦优化,它为每个样本分配概率,使我们能够识别异常事件。我们通过将其应用于Hinode光谱偏振仪、界面区域成像光谱仪、瑞典1米太阳望远镜上的微透镜高光谱成像仪、太阳动力学观测站上的大气成像组件以及太阳轨道器上的极紫外成像仪的观测来应用该方法。我们发现该算法始终为表现出异常特征的光谱分配较低的概率。例如,它识别出具有非常强多普勒频移、不常见展宽以及与小型重联事件相关的时间动态的谱线等。因此,Inspectorch证明了使用基于流的模型进行密度估计为在大型太阳数据集中识别罕见事件提供了一种强大的方法。由此产生的概率异常分数允许将计算资源集中在最具信息量和物理相关的事件上。我们公开提供Python包,网址为https://github.com/cdiazbas/inspectorch。

英文摘要

The Sun is observed in unprecedented detail, enabling studies of its activity on very small spatiotemporal scales. However, the large volume of data collected by our telescopes cannot be fully analyzed with conventional methods. Popular machine learning methods identify general trends from observations, but tend to overlook unusual events due to their low frequency of occurrence. We study the applicability of unsupervised probabilistic methods to efficiently identify rare events in multidimensional solar observations and optimize our computational resources to the study of these extreme phenomena. We introduce Inspectorch, an open-source framework that utilizes flow-based models: flexible density estimators capable of learning the multidimensional distribution of solar observations. Once optimized, it assigns a probability to each sample, allowing us to identify unusual events. We apply this approach by applying it to observations from the Hinode Spectro-Polarimeter, the Interface Region Imaging Spectrograph, the Microlensed Hyperspectral Imager at Swedish 1-m Solar Telescope, the Atmospheric Imaging Assembly on board the Solar Dynamics Observatory and the Extreme Ultraviolet Imager on board Solar Orbiter. We find that the algorithm assigns consistently lower probabilities to spectra that exhibit unusual features. For example, it identifies profiles with very strong Doppler shifts, uncommon broadening, and temporal dynamics associated with small-scale reconnection events, among others. As a result, Inspectorch demonstrates that density estimation using flow-based models offers a powerful approach to identifying rare events in large solar datasets. The resulting probabilistic anomaly scores allow computational resources to be focused on the most informative and physically relevant events. We make our Python package publicly available at https://github.com/cdiazbas/inspectorch.

2602.06361 2026-05-29 cs.GT cs.IT cs.LG math.IT stat.ML

Envy-Free Allocation of Indivisible Goods via Noisy Queries

通过噪声查询实现不可分割物品的无嫉妒分配

Zihan Li, Yan Hao Ling, Jonathan Scarlett, Warut Suksompong

AI总结 针对不可直接观测估值、仅能通过噪声查询获取信息的不可分割物品分配问题,在双智能体高斯噪声和有界估值设定下,推导了实现无嫉妒分配所需查询次数的上下界,并证明了当最优分配负嫉妒值Δ不太小时最优查询次数与m^{2.5}/Δ^2成比例。

Comments ICML 2026

详情
AI中文摘要

我们引入了一个公平分配不可分割物品(物品)的问题,其中智能体的估值无法直接观测,而只能通过噪声查询访问。在双智能体设定中,考虑高斯噪声和有界估值,我们推导了根据物品数量$m$和最优分配的负嫉妒值$Δ$,找到无嫉妒分配所需查询次数的上下界。特别地,当$Δ$不太小(即$Δ\gg m^{1/4}$)时,我们证明最优查询次数在忽略对数因子下为$ rac{\sqrt m }{(Δ/ m)^2} = rac{m^{2.5}}{Δ^2}$。我们的上界基于非自适应查询和一个简单的基于阈值的分配算法,该算法在多项式时间内运行,而下界即使在自适应查询和任意计算时间下也成立。

英文摘要

We introduce a problem of fairly allocating indivisible goods (items) in which the agents' valuations cannot be observed directly, but instead can only be accessed via noisy queries. In the two-agent setting with Gaussian noise and bounded valuations, we derive upper and lower bounds on the required number of queries for finding an envy-free allocation in terms of the number of items, $m$, and the negative-envy of the optimal allocation, $Δ$. In particular, when $Δ$ is not too small (namely, $Δ\gg m^{1/4}$), we establish that the optimal number of queries scales as $\frac{\sqrt m }{(Δ/ m)^2} = \frac{m^{2.5}}{Δ^2}$ up to logarithmic factors. Our upper bound is based on non-adaptive queries and a simple thresholding-based allocation algorithm that runs in polynomial time, while our lower bound holds even under adaptive queries and arbitrary computation time.

2602.02751 2026-05-29 cs.MA cs.AI cs.CL

Scaling Small Agents Through Strategy Auctions

通过策略拍卖扩展小型智能体

Lisa Alazraki, William F. Shen, Yoram Bachrach, Akhil Mathur

AI总结 针对小型语言模型在复杂任务中性能不足的问题,提出受自由职业市场启发的SALE框架,通过策略拍卖实现任务分配与测试时自我改进,在降低对大型模型依赖和成本的同时提升性能。

Comments ICML 2026

详情
AI中文摘要

小型语言模型越来越被视为一种有前景、成本效益高的智能体AI方法,支持者声称它们对于智能体工作流已经足够有能力。然而,尽管较小的智能体在简单任务上能与较大的智能体紧密匹配,但它们的性能如何随任务复杂性扩展、何时需要大型模型以及如何更好地利用小型智能体处理长期工作负载仍不清楚。在这项工作中,我们通过实验表明,小型智能体的性能在深度搜索和编码任务上无法随任务复杂性扩展,并引入了受自由职业市场启发的SALE(Strategy Auctions for Workload Efficiency)智能体框架。在SALE中,智能体用简短的战略计划进行投标,这些计划通过系统性的成本-价值机制评分,并通过共享的拍卖记忆进行优化,从而无需训练单独的路由器或运行所有模型至完成即可实现每任务路由和持续自我改进。在复杂度不同的深度搜索和编码任务中,SALE将最大智能体的依赖度降低了52%,总成本降低了35%,并且始终优于最大智能体的pass@1,仅增加了可忽略的额外开销(超出执行最终轨迹的部分)。相比之下,依赖任务描述的现有路由器要么表现不如最大智能体,要么未能降低成本,通常两者兼有,凸显了它们对智能体工作流的不适用性。这些结果表明,尽管小型智能体可能不足以处理复杂工作负载,但通过协调的任务分配和测试时自我改进,它们可以有效地“扩展”。更广泛地说,它们激发了对智能体AI的系统级观点,即性能提升更多来自市场启发的协调机制(将异构智能体组织成高效、自适应的生态系统),而非日益庞大的单个模型。

英文摘要

Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 52%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost, often both, underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.

2512.14754 2026-05-29 cs.SE cs.AI cs.CL

Revisiting the Reliability of Language Models in Instruction-Following

重新审视指令跟随中语言模型的可靠性

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

AI总结 本文提出可靠@k指标和自动生成相似提示的流水线,构建IFEval++基准,发现当前模型在细微差异提示下性能下降高达61.8%,并探索了三种改进方法。

Comments ACL 2026 main oral

详情
AI中文摘要

先进的LLM在IFEval等基准测试中已达到接近上限的指令跟随准确率。然而,这些令人印象深刻的分数并不一定能转化为实际使用中的可靠服务,因为用户经常改变他们的措辞、上下文框架和任务表述。在本文中,我们研究面向细微差异的可靠性:模型是否在传达类似用户意图但具有细微差异的相似提示中表现出一致的能力。为了量化这一点,我们引入了一个新的指标,可靠@k,并开发了一个自动化流水线,通过数据增强生成高质量的相似提示。在此基础上,我们构建了IFEval++用于系统评估。在20个专有和26个开源LLM中,我们发现当前模型在面向细微差异的可靠性方面存在显著不足——它们的性能在细微提示修改下可能下降高达61.8%。此外,我们对其进行了表征,并探索了三种潜在的改进方法。我们的发现强调了面向细微差异的可靠性是朝着更可靠和可信的LLM行为迈出的关键但尚未充分探索的下一步。我们的代码和基准可访问:https://github.com/jianshuod/IFEval-pp。

英文摘要

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

2512.13517 2026-05-29 q-bio.NC cs.LG

A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments

基于交互式VR实验的心理旋转深度学习模型

Raymond Khazoum, Daniela Fernandes, Aleksandr Krylov, Qin Li, Stephane Deny

AI总结 提出一个由等变编码器、神经符号对象编码器和神经决策代理组成的深度学习模型,通过VR实验验证,准确模拟人类心理旋转的性能、响应时间和行为。

Comments Version accepted at ICML 2026

详情
AI中文摘要

心理旋转——比较从不同视角观察到的物体的能力——是人类心理模拟和空间世界建模的基本示例。在这里,我们利用深度、等变和神经符号学习的最新进展,提出了一个人类心理旋转的机制模型。我们的模型由三个堆叠的组件组成:(1) 等变神经编码器,从图像中生成物体的3D空间表示;(2) 神经符号对象编码器,从这些空间表示中推导出符号对象描述;(3) 神经决策代理,通过循环路径比较这些符号描述,以在3D潜在空间中规定旋转模拟。我们的模型设计受到现有心理旋转实验文献的指导,并辅以VR实验,其中参与者有时可以操作物体进行比较。我们的模型很好地捕捉了参与者在我们和其他人的实验中的表现、反应时间和行为,并通过消融研究证明了每个组件的必要性。我们的工作为最近一系列人类空间推理的深度神经模型增添了新的内容,进一步证明了整合深度、等变和符号表示来模拟人类思维的效力。

英文摘要

Mental rotation -- the ability to compare objects seen from different viewpoints -- is a fundamental example of mental simulation and spatial world modeling in humans. Here we propose a mechanistic model of human mental rotation, leveraging recent advances in deep, equivariant, and neuro-symbolic learning. Our model consists of three stacked components: (1) an equivariant neural encoder, producing 3D spatial representations of objects from images, (2) a neuro-symbolic object encoder, deriving symbolic objects descriptions from these spatial representations, and (3) a neural decision agent, comparing these symbolic descriptions to prescribe rotation simulations in 3D latent space via a recurrent pathway. Our model design is guided by the existing experimental literature on mental rotation, which we complemented with experiments in VR where participants could at times manipulate the objects to compare. Our model captures well the performance, response times and behavior of participants in our and others' experiments, and through ablation studies we demonstrate the necessity of each component. Our work adds to a recent collection of deep neural models of human spatial reasoning, further demonstrating the potency of integrating deep, equivariant, and symbolic representations to model the human mind.

2512.10401 2026-05-29 stat.ML cs.LG math.ST stat.TH

Diffusion differentiable resampling

扩散可微重采样

Jennifer Rosina Andersson, Zheng Zhao

AI总结 针对序贯蒙特卡洛中的可微重采样问题,提出一种基于无训练扩散模型代理的信息性且即时可微的重采样方法,理论证明其一致性,并在多个滤波和参数估计基准上优于现有方法。

Comments In ICML 2026

详情
AI中文摘要

本文关注序贯蒙特卡洛(例如粒子滤波)中的可微重采样问题。借鉴重参数化,我们提出了一种新的重采样方法,该方法基于无训练扩散模型代理,具有信息性且即时可微。我们从理论上证明了我们的扩散重采样方法提供了一致的重采样分布,并通过实验表明,在多个滤波和参数估计基准上,它优于最先进的可微重采样方法。最后,我们展示了当用于学习具有高维图像观测的复杂动力学-解码器模型时,它实现了具有竞争力的端到端性能。

英文摘要

This paper is concerned with differentiable resampling in the context of sequential Monte Carlo (e.g., particle filtering). Drawing on reparametrisation, we propose a new resampling method that is informative and instantly differentiable, based on a training-free diffusion model surrogate. We theoretically prove that our diffusion resampling method provides a consistent resampling distribution, and we show empirically that it outperforms the state-of-the-art differentiable resampling methods on multiple filtering and parameter estimation benchmarks. Finally, we show that it achieves competitive end-to-end performance when used in learning a complex dynamics-decoder model with high-dimensional image observations.

2510.27663 2026-05-29 eess.IV cs.LG stat.ME stat.ML

Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements

仅从噪声和部分测量中进行成像逆问题的贝叶斯模型选择与误设定检验

Tom Sprunck, Marcelo Pereyra, Tobias Liaudat

AI总结 提出一种结合贝叶斯交叉验证与数据分裂的通用方法,用于在无真实数据情况下对成像逆模型进行选择与误设定检测,兼容扩散采样器等贝叶斯成像采样器,计算成本低且准确率高。

详情
AI中文摘要

现代成像技术严重依赖贝叶斯统计模型来解决困难的图像重建和恢复任务。本文针对无真实数据的情况,研究此类模型的客观评估,重点关注模型选择和误设定诊断。现有的无监督模型评估方法通常因计算成本高且与通过机器学习模型隐式定义的现代图像先验不兼容,而不适用于计算成像。本文提出一种基于贝叶斯交叉验证与数据分裂(一种随机测量分裂技术)的新型组合方法,用于贝叶斯成像科学中的无监督模型选择和误设定检测。该方法与任何贝叶斯成像采样器兼容,包括扩散采样器和即插即用采样器。我们通过涉及多种评分规则和模型误设定类型的实验证明了该方法的有效性,在低计算成本下实现了出色的选择和检测精度。

英文摘要

Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.

2510.07355 2026-05-29 cs.MM cs.SD

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

AV-EMO-Reasoning: 在具有视听线索的全模态大语言模型中基准测试情感推理能力

Dingkun Zhou, Krish Patel, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli

AI总结 提出AV-EMO-Reasoning基准,通过合成和真实世界的视听对话数据集及情感感知与交互推理指标,系统评估全模态大语言模型的情感推理能力。

详情
AI中文摘要

通过声音和面部表情传达的情感塑造了人机交互中的参与度和情境。尽管全模态大语言模型取得了快速进展,但利用视听线索进行情感推理的整体评估仍然有限。为解决这一差距,我们引入了AV-EMO-Reasoning,一个旨在系统评估大语言模型情感推理能力的基准。该框架使用一个精心策划的视听语料库,包括合成的单轮和多轮对话以及一个真实世界子集,结合情感感知和交互推理指标,评估模型是否能理解用户情感并产生适当响应。通过发布一个系统评估基准,AV-EMO-Reasoning为评估情感感知对话提供了一个可重复的标准,并推动更自然、自适应的人机交互发展。

英文摘要

Emotions conveyed through voice and face shape engagement and context in human AI interaction. Despite rapid progress in omni modal large language models, the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV EMO Reasoning, a benchmark designed to systematically assess emotional reasoning abilities in large language models. The framework uses a curated audiovisual corpus comprising synthetic single turn and multi turn dialogues and a real world subset, together with emotion perception and interaction reasoning metrics, to evaluate whether models can understand user emotions and produce appropriate responses. By releasing a systematic evaluation benchmark, AV EMO Reasoning offers a reproducible standard for evaluating emotion aware dialogue and advances toward more natural, adaptive human AI interaction.

2510.04704 2026-05-29 cond-mat.mtrl-sci cs.AI cs.CL

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

AtomWorld: 评估大型语言模型在晶体材料空间推理能力的基准

Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Yingheng Wang, Bram Hoex, Zhicheng Zhong, Tong Xie

AI总结 提出AtomWorld基准,通过十种基本原子结构操作评估LLM在材料科学中的空间推理能力,发现Claude Opus 4.6表现最佳但复杂空间关系操作成功率低,表明LLM更适合作为辅助工具而非完全自主的科研代理。

详情
AI中文摘要

大型语言模型(LLMs)在科学研究中展现出巨大潜力,能够执行从知识检索到属性预测等任务。现有的科学基准主要关注感知或基于知识的任务,在很大程度上忽略了建模任务,而建模是任何真实科学研究的基本起点。对于材料科学而言,构建和操作原子结构是最具创造性和自动化程度最低的步骤之一。在这项工作中,我们引入了AtomWorld,这是一个旨在评估LLMs在结构修改方面能力的基准。该基准包括四种广泛使用的建模类别下的十种基本操作,并提供了可验证的评估指标。我们发现Claude Opus 4.6总体上表现最佳。随着建模复杂性的增加,成功率显著下降,特别是涉及复杂空间关系的操作(旋转成功率低于12%)。我们的结果表明,当代LLMs更适合作为材料结构建模的副驾驶,而非完全无监督的自主科学代理。除了评估之外,AtomWorld还作为未来开发结构感知模型(包括强化学习和基于代理的方法)的测试平台和实验场。

英文摘要

Large language models (LLMs) have shown promising potential in scientific research, enabling tasks ranging from knowledge retrieval to property prediction. Existing science benchmarks mainly focus on perceptual or knowledge-based tasks, largely ignoring the modelling tasks, a fundamental starting point for any real scientific research. For materials science, constructing and manipulating atomic structures is one of the most creative and least automated steps. In this work, we introduce AtomWorld, a benchmark designed to evaluate the abilities of LLMs on structure modifications. The benchmark includes ten fundamental actions under four widely used modelling categories, enabling verifiable evaluation metrics. We find that Claude Opus 4.6 generally performs the best. While the success rate decreases markedly with increasing modelling complexity, with particularly low success rates (below 12\% for rotation) for operations involving complex spatial relations. Our results suggest that contemporary LLMs are better suited as copilots for materials structure modelling rather than fully unsupervised autonomous scientific agents. Beyond evaluation, AtomWorld also serves as a testbed and playground for developing future structure-aware models, including reinforcement learning and agentic approaches.

2509.24100 2026-05-29 stat.ME cs.LG

SpeedCP: Fast Kernel-based Conditional Conformal Prediction

SpeedCP: 基于核的快速条件共形预测

Yating Liu, Yeo Jin Jung, Zixuan Wu, So Won Jeong, Claire Donnat

AI总结 提出一种基于路径追踪的高效算法,在保持RKHS条件共形预测框架理论优势的同时,将计算速度提升40倍,区间长度缩短30%。

详情
AI中文摘要

共形预测提供了具有有限样本条件保证的分布自由预测集。我们基于Gibbs等人(2023)的RKHS框架,该框架利用协变量偏移族来提供近似条件共形预测区间,具有强大的理论前景,但计算成本过高。为弥补这一差距,我们开发了一种稳定高效的算法,该算法以与单次核分位数拟合基本相同的成本计算正则化RKHS共形优化问题的完整解路径。我们的路径追踪框架同时调整超参数,提供平滑控制和数据自适应校准。为了将方法扩展到高维设置,我们进一步将我们的方法与低秩潜在嵌入相结合,在数据驱动的潜在空间中捕获条件有效性。实验上,我们的方法在各种现代黑盒预测器上提供了可靠的条件覆盖,将Gibbs等人(2023)的区间长度改善了30%,同时实现了40倍的加速。

英文摘要

Conformal prediction provides distribution-free prediction sets with finite-sample conditional guarantees. We build upon the RKHS-based framework of Gibbs et al. (2023), which leverages families of covariate shifts to provide approximate conditional conformal prediction intervals, an approach with strong theoretical promise, but with prohibitive computational cost. To bridge this gap, we develop a stable and efficient algorithm that computes the full solution path of the regularized RKHS conformal optimization problem, at essentially the same cost as a single kernel quantile fit. Our path-tracing framework simultaneously tunes hyperparameters, providing smoothness control and data-adaptive calibration. To extend the method to high-dimensional settings, we further integrate our approach with low-rank latent embeddings that capture conditional validity in a data-driven latent space. Empirically, our method provides reliable conditional coverage across a variety of modern black-box predictors, improving the interval length of Gibbs et al. (2023) by 30%, while achieving a 40-fold speedup.

2508.03253 2026-05-29 cs.GT cs.AI cs.MA

Approximate Proportionality in Online Fair Division

在线公平分配中的近似比例性

Davin Choo, Winston Fu, Derek Khu, Tzeh Yuan Neoh, Tze-Yang Poon, Nicholas Teh

AI总结 研究在线公平分配问题中比例性(PROP1)的可近似性,通过非自适应对手和最大物品价值预测两种松弛方法,设计了具有鲁棒保证的在线算法。

Comments Appears in the 43rd International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

我们研究在线公平分配问题,其中不可分割的商品按顺序到达,必须立即且不可撤销地分配。先前的工作为近似经典概念(如至多一个商品的嫉妒无妒(EF1)和最大最小份额(MMS))建立了强不可能性结果,但至多一个商品的比例性(PROP1)的可近似性仍未解决。我们分两步解决这一差距。首先,我们展示了三种自然的贪婪分配规则(公平分配中的标准基线)无法保证对自适应对手的任何乘法近似到PROP1。这些局限性激发了两种松弛:(i)将注意力限制在非自适应对手上,以及(ii)在学习增强算法的精神下纳入粗略预测。在非自适应对手下,我们展示了均匀随机分配以高概率实现了有意义的PROP1近似,并且这一保证对于这种方法本质上是紧的;此外,当物品值足够小时,分配以高概率接近PROP1。最后,给定最大物品值(MIV)预测,我们设计了一种在线算法,该算法实现了PROP1的鲁棒近似保证,并在单边预测误差下优雅地退化。相比之下,我们展示了即使有完美的MIV预测,EF1、MMS和PROPX仍然不可近似。

英文摘要

We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. Prior work establishes strong impossibility results for approximating classic notions such as envy-freeness up to one good (EF1) and maximin share (MMS) in this setting, but the approximability of proportionality up to one good (PROP1) has remained unresolved. We resolve this gap in two steps. First, we show that three natural greedy allocation rules (standard baselines in fair division) fail to guarantee any multiplicative approximation to PROP1 against an adaptive adversary. These limitations motivate two relaxations: (i) restricting attention to a non-adaptive adversary, and (ii) incorporating coarse predictions in the spirit of learning-augmented algorithms. Under a non-adaptive adversary, we show that the uniform random allocation achieves a meaningful PROP1 approximation with high probability, and this guarantee is essentially tight for this approach; moreover, when item values are sufficiently small, the allocation is near-PROP1 with high probability. Finally, given maximum item value (MIV) predictions, we design an online algorithm that achieves robust approximation guarantees for PROP1, and degrades gracefully under one-sided prediction error. In contrast, we show that EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions.