arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2606.03217 2026-06-03 stat.ML cond-mat.dis-nn cs.LG

An Asymptotic Theory of Chain-of-Thought in In-Context Learning

上下文学习中思维链的渐近理论

Kaito Takanami, Cengiz Pehlevan

发表机构 * Department of Physics, Graduate School of Science, The University of Tokyo(东京大学物理系研究生院) John A. Paulson School of Engineering and Applied Sciences, Harvard University(哈佛大学约翰·A·保罗森工程与应用科学学院) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学凯普勒人工智能研究 institute) Center for Brain Science, Harvard University(哈佛大学脑科学中心)

AI总结 通过高维随机矩阵理论,推导了线性回归中上下文学习思维链的泛化误差精确公式,揭示了推理深度、预训练数据量和上下文长度之间的相变现象。

详情
AI中文摘要

思维链推理已成为一种广泛使用的机制,通过在推理时生成中间推理步骤来激发大型语言模型的多步推理。然而,泛化能力随思维链深度的缩放行为仍知之甚少。为了解决这个问题,我们研究了一个理论上可解的线性回归中上下文权重预测的思维链模型,其中测试时推理表示为权重参数估计的迭代细化。利用高维渐近下的随机矩阵理论工具,我们推导了泛化误差作为推理深度、预训练数据量和上下文长度的精确公式。我们的分析揭示了指数与多项式改进、饱和及过度思考之间的尖锐相变,并刻画了最优推理深度如何缩放。我们进一步表明,更深的推理在预训练和上下文信息足够丰富时最为有效,而有限的预训练或上下文会使较长的推理容易产生误差放大或饱和。我们还通过在完全学习的线性注意力和softmax注意力模型上的实验验证了这些预测。我们的结果为测试时思维链深度如何影响泛化提供了一个统一的理论解释。

英文摘要

Chain-of-thought (CoT) reasoning has become a widely used mechanism for eliciting multi-step reasoning in large language models by generating intermediate reasoning steps at inference time. Yet the scaling behavior of generalization with CoT depth remains poorly understood. To address this question, we study a theoretically solvable model of CoT for in-context weight prediction in linear regression, where test-time reasoning is represented as an iterative refinement of the weight-parameter estimate. Using tools from random matrix theory under high-dimensional asymptotics, we derive an exact formula for the generalization error as a function of reasoning depth, pretraining data amount, and context length. Our analysis reveals a sharp phase transition separating exponential and polynomial improvement, saturation, and overthinking, and characterizes how the optimal reasoning depth scales. We further show that deeper reasoning is most effective with sufficiently rich pretraining and in-context information, whereas limited pretraining or context makes longer reasoning prone to error amplification or saturation. We also validate these predictions through experiments on fully learned linear attention and softmax attention models. Our results provide a unified theoretical account of how test-time CoT depth affects generalization.

2606.02909 2026-06-03 stat.ML cs.LG

Scalable Derivative Gaussian Processes via Exact Gradient Reduction

可扩展的导数高斯过程通过精确梯度约简

Hyunseok Seung, Matthias Katzfuss

发表机构 * Department of Statistics University of Wisconsin–Madison(统计学系威斯康星大学麦迪逊分校)

AI总结 提出TERA方法,利用精确梯度约简将导数高斯过程的计算复杂度从O(n^3 d^3)降至O(d m^2 + m^6),实现高维空间中的可扩展推理。

详情
AI中文摘要

梯度观测可以显著改善高斯过程(GP)代理,特别是在函数评估昂贵的高维设置中。然而,对n个函数值和n个完整梯度(d维)进行精确推理的计算复杂度与联合状态大小呈三次方关系,导致难以处理的O(n^3 d^3)计算瓶颈。我们提出TERA,一种基于目标特定精确梯度约简的高度可扩展导数GP方法。我们证明,对于平稳核,与连接目标和条件点的方向正交的梯度分量在条件上独立于目标函数值;因此,一旦指定了大小为m的条件集,精确条件密度完全由至多m^2个方向导数刻画。通过将这些约简的、无维度的条件作为Vecchia近似中的局部因子,TERA有效地将n和d从稠密矩阵求逆中解耦。这将每个目标的评估成本降低到O(d m^2 + m^6)时间和O(d m^2 + m^4)内存,同时保持底层导数GP模型在数学上不变。实验评估表明,TERA实现了最先进的预测精度,同时比标准导数GP快数个数量级。关键的是,计算时间和峰值GPU内存相对于d基本保持平稳,从而在高维空间中实现高度可扩展的推理。

英文摘要

Gradient observations can substantially improve Gaussian process (GP) surrogates, particularly in high-dimensional settings where function evaluations are expensive. However, exact inference with $n$ function values and $n$ full gradients in $d$ dimensions scales cubically in the joint state size, imposing an intractable $\mathcal{O}(n^3 d^3)$ computational bottleneck. We introduce TERA, a highly scalable derivative GP method based on target-specific exact gradient reduction. We prove that for stationary kernels, the gradient components orthogonal to the directions connecting the target and conditioning points are conditionally independent of the target function value; consequently, the exact conditional density is fully characterized by at most $m^2$ directional derivatives once a conditioning set of size $m$ is specified. By using these reduced, dimension-free conditionals as local factors in a Vecchia approximation, TERA effectively decouples $n$ and $d$ from the dense matrix inversion. This reduces the per-target evaluation cost to $\mathcal{O}(dm^2 + m^6)$ time and $\mathcal{O}(dm^2 + m^4)$ memory, leaving the underlying derivative GP model mathematically unchanged. Empirical evaluations demonstrate that TERA achieves state-of-the-art predictive accuracy while operating orders of magnitude faster than standard derivative GPs. Crucially, both computation time and peak GPU memory remain essentially flat with respect to $d$, enabling highly scalable inference in high-dimensional spaces.

2606.02740 2026-06-03 stat.ML cs.LG

ScoreStop: Gradient-based early stopping using functional score tests

ScoreStop: 基于梯度的早期停止方法使用函数得分检验

Oliver J. Hines, Christian L. Hines

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ScoreStop方法,通过函数得分检验在每次迭代中检验当前预测器是否为总体风险最小化器,从而在梯度提升决策树中实现基于梯度的早期停止,避免过拟合。

Comments Presented at the International Conference on Machine Learning 2026 Workshop on Hypothesis Testing

详情
AI中文摘要

梯度提升决策树需要停止规则以避免过拟合。标准规则监控验证损失,如果损失在固定的耐心期内没有改善则停止。然而,耐心参数没有可解释的尺度,验证损失可能带有噪声或由用户指定的梯度隐式定义。我们提出ScoreStop,一种基于梯度的早期停止规则,将每次迭代的停止决策视为检验当前预测器是否为总体风险最小化器的原假设。我们使用在验证数据上计算的函数得分检验,其统计量在更新方向上具有尺度不变性,并且在原假设下具有已知的渐近分布。由于我们的检验使用梯度而非损失值,相同的构造适用于隐式损失(如LambdaRank)和通过影响函数的数据依赖损失(如Cox回归)。在合成实验和真实数据基准测试中,我们展示了ScoreStop与基于损失的方法相比具有竞争力。

英文摘要

Gradient boosted decision trees require a stopping rule to avoid overfitting. The standard rule monitors a validation loss and stops if the loss fails to improve for a fixed patience period. However, the patience parameter has no interpretable scale and validation losses can be noisy or implicitly defined by a user-specified gradient. We propose ScoreStop, a gradient-based early-stopping rule that casts the stopping decision at each iteration as a test of the null hypothesis that the current predictor is the population risk minimizer. We use a functional score test, computed on validation data, with a statistic that is scale-invariant in the update direction, with a known asymptotic distribution under the null. Because our test uses gradients rather than loss values, the same construction applies to implicit losses such as LambdaRank, and data-dependent losses such as Cox regression via influence functions. In synthetic experiments and real-data benchmarks, we show that ScoreStop is competitive with loss-based methods.

2606.02664 2026-06-03 stat.ML cs.LG

State-Coupled Volatility in Latent Dynamical Systems: Recovery Under Partial Observation

潜变量动力系统中的状态耦合波动性:部分观测下的恢复

Imani Beckett

发表机构 * The Herbert Wertheim School of Public Health and Human Longevity Science(赫伯特·韦特海姆公共卫生与人类长寿科学学院) University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出状态耦合随机波动框架,利用粒子期望最大化算法在部分观测下估计潜变量过程方差与平衡点位移的关系,并通过仿真验证了恢复与检测性能。

Comments 40 pages, 16 figures

详情
AI中文摘要

潜状态空间模型广泛用于研究部分观测的动力系统,但大多数公式假设过程变异性与潜状态位置无关。然而,在许多生物、行为和生理系统中,变异性可能系统地依赖于潜在动力状态,产生恒定方差模型无法捕捉的结构化随机性。我们引入了一个状态耦合随机波动框架,其中潜过程方差取决于与潜平衡点的位移。为了在部分观测下估计这种关系,我们开发了一种粒子期望最大化程序,结合了引导粒子滤波和反向轨迹平滑。该模型包含一个耦合参数 $\gamma$,用于量化潜状态位置与过程变异性之间的关联强度。一个大规模仿真基准评估了在不同耦合强度、观测噪声水平、轨迹长度和持续性机制下的恢复和检测性能。与基于观测状态的异方差代理相比,所提出的框架一致地减少了恢复偏差,在强耦合下改进最大。恢复性能随着潜持续性的增加而提高,而检测性能在广泛条件下保持竞争力,并随着观测噪声的增加而变得更加有利。综合来看,结果表明当明确建模潜状态结构时,可以在部分观测下识别和估计状态耦合波动性。该框架为研究状态依赖变异性以及评估结构化随机性是否提供超出平均状态轨迹所包含的系统动力学信息提供了实用的方法论基础。

英文摘要

Latent state-space models are widely used to study partially observed dynamical systems, yet most formulations assume that process variability is independent of latent-state position. In many biological, behavioral, and physiological systems, however, variability may depend systematically on the underlying dynamical state, producing structured stochasticity that is not captured by constant-variance models. We introduce a state-coupled stochastic volatility framework in which latent process variance depends on displacement from a latent equilibrium. To estimate this relationship under partial observation, we develop a particle expectation-maximization procedure combining bootstrap particle filtering and backward trajectory smoothing. The model includes a coupling parameter, $γ$, that quantifies the strength of association between latent-state position and process variability. A large-scale simulation benchmark evaluated recovery and detection performance across varying coupling strengths, observation noise levels, trajectory lengths, and persistence regimes. The proposed framework consistently reduced recovery bias relative to an observed-state heteroskedastic proxy, with the largest improvements occurring under strong coupling. Recovery performance improved with increasing latent persistence, while detection performance remained competitive across a broad range of conditions and became increasingly advantageous as observation noise increased. Taken together, the results demonstrate that state-coupled volatility can be identified and estimated under partial observation when latent-state structure is explicitly modeled. The framework provides a practical methodological foundation for studying state-dependent variability and evaluating whether structured stochasticity contributes information about system dynamics beyond that contained in mean-state trajectories alone.

2606.02645 2026-06-03 stat.ML cs.AI cs.LG

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

目标更新可能稳定线性Q学习:周期性和软动态

Donghwan Lee

发表机构 * School of Electrical Engineering, KAIST(韩国成均馆大学电气工程学院)

AI总结 本文通过精确的切换线性系统动力学和联合谱半径分析,证明了在特定谱和步长条件下,周期性硬目标更新和软目标更新可以保证线性Q学习收敛到精确的投影Q-Bellman解。

详情
AI中文摘要

Q学习中的周期性目标更新和actor-critic方法中的软目标更新是经验上公认的稳定机制,但其精确的理论解释仍不完整。本文针对线性函数逼近的Q学习(线性Q学习),利用Bellman最大值引起的精确切换线性系统(SLS)动力学以及由此产生的切换矩阵族的联合谱半径(JSR),对这些机制进行了严格而精确的分析。尽管线性Q学习通常可能无法收敛,但我们证明,在明确的谱和步长条件下,周期性硬目标更新和软目标更新可以保证收敛到精确的投影Q-Bellman解。主要分析针对确定性线性Q学习进行,其中目标更新机制最为透明。一旦为均值递归建立了相应的JSR证书,随机强化学习设置可以通过将确定性模式替换为采样随机模式并添加相应的随机噪声分析来处理。

英文摘要

Periodic target updates in Q-learning and soft target updates in actor-critic methods are empirically well established stabilization mechanisms, but their precise theoretical explanation is still incomplete. This paper gives a rigorous and exact analysis of these mechanisms for Q-learning with linear function approximation (linear Q-learning) using the exact switched linear system (SLS) dynamics induced by the Bellman maximum and the joint spectral radius (JSR) of the resulting switching matrix families. Although linear Q-learning can fail to converge in general, we prove that, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis is carried out for deterministic linear Q-learning, where the target-update mechanism is most transparent. Once the corresponding JSR certificate is established for the mean recursion, the stochastic reinforcement-learning setting can be treated by replacing deterministic modes with sampled stochastic modes and adding the corresponding stochastic-noise analysis.

2606.02632 2026-06-03 stat.ML cs.AI cs.CY cs.LG econ.EM stat.AP

Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery

立场:优先识别结构,而非复杂模型,以促进科学发现

Tyler H. McCormick

发表机构 * GitHub

AI总结 本文论证现代机器学习在高维代理机制下存在通用欠定性,提出“机制性机器学习”的具体标准,以确保以LLM为中心的工作流真正支持科学而非模拟科学。

Comments Will appear as a position paper in ICML

详情
AI中文摘要

现代机器学习(ML)和人工智能(AI)模型,特别是大型语言模型(LLMs),越来越多地被用于从观测数据中生成科学假设和机制解释。这篇立场论文认为,在现代ML擅长的高维代理机制中,机制性学习通常是欠定的:许多不相容的机制在数据支撑上诱导出本质上相同的观测关系,因此预测成功和连贯的解释并不足以作为机制发现的证据。这种欠定性在大型语言模型(LLMs)中变得尤为危险,因为它们倾向于将大量等价的解释类压缩成一个流畅的叙述。本文提出了“机制性机器学习”的具体标准,并论证如果以LLM为中心的工作流要支持科学而非仅仅模拟科学,这些标准是必要的。

英文摘要

Modern Machine Learning (ML) and Artificial Intelligence (AI) models, especially large language models (LLMs), are increasingly used to generate scientific hypotheses and mechanistic explanations from observational data. This position paper argues that in the high-dimensional proxy regimes where modern ML excels, mechanistic learning is generically underdetermined: many incompatible mechanisms induce essentially the same observational relationships on the support of the data, so predictive success and coherent explanations are insufficient evidence of mechanism discovery. This underdetermination becomes uniquely hazardous with large language models (LLMs), which tend to collapse large equivalence classes of explanations into a single fluent narrative. This paper proposes concrete standards for ``mechanistic ML,'' and argues these norms are necessary if LLM-centered workflows are to support science rather than merely simulate it.

2606.02592 2026-06-03 stat.AP cs.AI

Tracking Urban Atmospheric Pollutants using Sentinel-5P Satellite Data

利用Sentinel-5P卫星数据追踪城市大气污染物

Alice Gomez-Cantos, Henry O. Velesaca

发表机构 * Facultad de Ciencias Naturales y Matemáticas, Escuela Superior Politécnica del Litoral, ESPOL, Campus Gustavo Galindo, Km. 30.5 Vía Perimetral, Guayaquil, 090902, Ecuador(生态与数学学院,海岸理工大学,ESPOL,加斯托·加林多校区,公里30.5环形路,瓜亚基尔,090902,厄瓜多尔) Software Engineering Department, Research Center for Information and Communication Technologies (CITIC-UGR), University of Granada, 18071, Granada, Spain(软件工程系,信息与通信技术研究中心(CITIC-UGR),格拉纳达大学,18071,格拉纳达,西班牙)

AI总结 提出基于Sentinel-5P/TROPOMI卫星对流层柱观测的框架,通过中位数和高百分位数等分布指标及K-means聚类,在厄瓜多尔瓜亚斯省尺度上表征城市NO2污染背景与极端值,为数据稀缺地区提供可解释、可扩展的空气质量评估工具。

详情
AI中文摘要

城市二氧化氮($NO_2$)是燃烧相关空气污染的关键指标,在城市中表现出强烈的时空变异性。本研究提出一个基于卫星的框架,利用Sentinel-5P/TROPOMI的对流层柱观测数据,追踪厄瓜多尔瓜亚斯省的城市$NO_2$污染。该方法不估计地表浓度,而是强调稳健的分布指标,包括中位数和上尾百分位数($P_{90}$、$P_{95}$和$P_{99}$),以表征县尺度上的背景条件和局部污染极端值。多年卫星观测数据按年汇总,并使用无监督K-means聚类分析,以识别无预定义阈值的特征污染模式。结果表明,高度城市化的县持续表现出较高的极端$NO_2$值和更大的变异性,而城市化程度较低的地区则呈现较低且更均匀的模式。所提出的方法为数据稀缺地区仅使用卫星观测提供了一种可解释且可扩展的城市空气质量评估工具。该实现已在GitHub上公开,网址为https://this URL。

英文摘要

Urban nitrogen dioxide ($NO_2$) is a key indicator of combustion-related air pollution and exhibits strong spatial and temporal variability in cities. This study presents a satellite-based framework for tracking urban $NO_2$ pollution using tropospheric column observations from Sentinel-5P/TROPOMI over Guayas Province, Ecuador. Rather than estimating surface concentrations, the methodology emphasizes robust distributional metrics, including the median and upper-tail percentiles ($P_{90}$, $P_{95}$, and $P_{99}$), to characterize background conditions and localized pollution extremes at the canton scale. Multi-year satellite observations are aggregated annually and analyzed using unsupervised K-means clustering to identify characteristic pollution regimes without predefined thresholds. Results show that highly urbanized cantons consistently exhibit elevated extreme $NO_2$ values and greater variability, while less urbanized areas display lower and more homogeneous patterns. The proposed approach provides an interpretable and scalable tool for urban air-quality assessment in data-scarce regions using satellite observations alone. The implementation is publicly available on GitHub https://hvelesaca.github.io/sentinel-5P-clustering/.

2606.03184 2026-06-03 q-fin.CP cs.LG q-fin.ST

FinStressTS: A Parametric Synthetic Benchmark for Time-Series Forecasting in Finance

FinStressTS: 金融时间序列预测的参数化合成基准

Jiaze Sun, Kelvin J. L. Koa, Ruiyang Ni, Yize Liu, Haonan Chen, Ke-Wei Huang

发表机构 * National University of Singapore(新加坡国立大学) Asian Institute of Digital Finance(亚洲数字金融研究所) Nanyang Technological University(南洋理工大学)

AI总结 针对金融预测中信号弱、机制复杂的问题,提出FinStressTS合成基准,通过30个诊断环境系统评估15种模型在点预测与概率预测上的表现,揭示模型性能对数据机制的依赖性。

Comments KDD 2026 (Oral)

详情
AI中文摘要

金融预测因信噪比低、潜在因子、重尾、机制转换和跳跃而困难。真实世界基准提供的故障归因有限:研究人员可以观察到表现不佳,但往往无法隔离原因,因为机制不可观察且纠缠。真实金融数据仅揭示一条实现路径,使得评估尾部风险校准或数据效率变得困难。我们引入FinStressTS,一个机制感知的合成基准,将模型行为与受控的结构原因联系起来。FinStressTS包含围绕六个机制族(波动率聚类、多尺度持续性、重尾冲击、机制转换、自激跳跃和零膨胀过程)的30个诊断环境。我们评估两个任务:点预测(使用五种设置下的NMAE)和概率预测(在已知数据生成机制下使用CRPS)。我们对15个模型进行基准测试,从经典方法(HAR、VAR)到Transformer预测器(PatchTST、iTransformer)和深度概率架构(DeepAR、TSFlow),并使用学习曲线衡量样本效率。我们的评估揭示了三个见解。首先,性能依赖于机制:自回归和线性模型在多个波动率、尾部和跳跃驱动的环境中具有很强的竞争力,并且通常优于基于Transformer的模型。其次,分布对齐很重要:诸如DeepAR之类的参数化概率模型在平稳设置中校准良好,而灵活模型在分布变为多模态或稀疏时可能有所帮助。第三,神经网络模型通常需要更多数据才能匹配简单基线,主要在学习潜在机制或复杂分布时获得更大收益。FinStressTS提供了一个用于诊断故障模式和推进风险感知预测的开放框架。

英文摘要

Financial forecasting is difficult due to low signal-to-noise ratios, latent factors, heavy tails, regime shifts, and jumps. Real-world benchmarks offer limited failure attribution: researchers can observe underperformance, but often cannot isolate why because mechanisms are unobservable and entangled. Real financial data reveal only one realized path, making it difficult to assess tail-risk calibration or data efficiency. We introduce FinStressTS, a mechanism-aware synthetic benchmark that links model behavior to controlled structural causes. FinStressTS comprises 30 diagnostic environments around six mechanism families: volatility clustering, multi-scale persistence, heavy-tailed shocks, regime switching, self-exciting jumps, and zero-inflated processes. We evaluate two tasks: point forecasting, using NMAE across five settings, and probabilistic forecasting, using CRPS under known data-generating mechanisms. We benchmark 15 models, from classical methods (HAR, VAR) to Transformer forecasters (PatchTST, iTransformer) and deep probabilistic architectures (DeepAR, TSFlow), and use learning curves to measure sample efficiency. Our evaluation reveals three insights. First, performance is mechanism-dependent: autoregressive and linear models are highly competitive, and often outperform Transformer-based models, in several volatility-, tail-, and jump-driven environments. Second, distributional alignment matters: parametric probabilistic models such as DeepAR calibrate well in stationary settings, while flexible models can help when distributions become multimodal or sparse. Third, neural models often require more data to match simple baselines, with larger gains mainly when learning latent regimes or complex distributions. FinStressTS provides an open framework for diagnosing failure modes and advancing risk-aware forecasting.

2606.02937 2026-06-03 q-bio.NC cs.CV

BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting

BEAST3D: 通过高斯泼溅从多视角视频进行动物行为分析与神经编码

Yanchen Wang, Lenny Aharon, Wangshu Zhu, Kyle Daruwalla, Linghua Zhang, Jiaru Zou, Selmaan Chettih, Helen Hou, Liam Paninski, Matthew R Whiteway

发表机构 * Columbia University(哥伦比亚大学) Cold Spring Harbor(冷泉港) Stanford University(斯坦福大学)

AI总结 提出BEAST3D自监督预训练框架,利用未标注的多视角视频通过3D高斯泼溅重建和动物分割,学习3D视觉表征,有效应用于新视角合成、多视角姿态估计和神经编码。

详情
AI中文摘要

多视角视频记录越来越多地用于捕捉实验环境中动物的3D运动,但从这些记录中提取丰富的3D表示仍然具有挑战性。有监督的姿态估计需要大量手动标注,而在通用场景数据集上训练的通用3D重建模型无法适用于实验室实验的专业图像和稀疏视角设置。我们通过BEAST3D解决了这些限制,这是一个自监督预训练框架,从未标注的、已校准的多视角视频中学习3D视觉表示。BEAST3D使用视觉变换器预测3D高斯泼溅,通过可微渲染重建保留视角,同时将动物从背景中分割出来。BEAST3D通过直接以已知相机参数为条件,仅用四个视角即可重建3D结构——这与通用模型不同,后者必须从实验室环境中很少有的密集重叠视角估计相机几何。通过在四个物种上的全面评估,我们证明BEAST3D产生丰富的、视角不变的特征,这些特征有效地迁移到三个下游任务:新视角合成(验证了学习到的3D表示的质量)、多视角姿态估计(提供了行为分析中广泛使用的稀疏关键点轨迹)和神经编码(将3D行为特征与同时记录的神经活动相关联)。因此,BEAST3D建立了一个利用现代多视角实验室记录中3D结构的行为分析多功能框架。

英文摘要

Multi-view video recordings are increasingly used to capture the 3D movements of animals in experimental settings, yet extracting rich 3D representations from these recordings remains challenging. Supervised pose estimation requires extensive manual annotation, while general-purpose 3D reconstruction models trained on generic scene datasets fail on the specialized imagery and sparse-view setting of laboratory experiments. We address these limitations with BEAST3D, a self-supervised pretraining framework that learns 3D visual representations from unlabeled, calibrated multi-view video. BEAST3D uses a vision transformer to predict 3D Gaussian splats that reconstruct held-out views through differentiable rendering, while simultaneously segmenting the animal from the background. BEAST3D reconstructs 3D structure with as few as four views by conditioning directly on known camera parameters--unlike general-purpose models, which must estimate camera geometry from dense overlapping viewpoints that are seldom available in lab settings. Through comprehensive evaluation across four species, we demonstrate that BEAST3D produces rich, viewpoint-invariant features that transfer effectively to three downstream tasks: novel view synthesis, which validates the quality of the learned 3D representations; multi-view pose estimation, which provides the sparse keypoint trajectories widely used in behavioral analysis; and neural encoding, which relates 3D behavioral features to simultaneously recorded neural activity. BEAST3D thus establishes a versatile framework for behavioral analysis that leverages 3D structure in modern multi-view laboratory recordings.

2606.02629 2026-06-03 q-bio.QM cs.AI cs.LG

Enhancing Protein-Protein Interaction Prediction with Hierarchical Motif-based Multimodal Protein Embedding

基于层次化基序的多模态蛋白质嵌入增强蛋白质-蛋白质相互作用预测

Zaifei Yang, Samuel Ping-Man Choi, James Kwok

发表机构 * National University of Singapore(新加坡国立大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出MMM-PPI模型,通过层次化基序的多模态编码(微观、中观、宏观三尺度)整合序列、结构和功能信息,提升蛋白质-蛋白质相互作用预测性能。

详情
AI中文摘要

蛋白质-蛋白质相互作用(PPIs)对许多生物过程至关重要。然而,现有的PPI预测方法存在两个主要局限性:它们忽略了蛋白质的层次组织,特别是关键调控PPIs的中观尺度基序,并且未能有效整合序列、结构和功能模态。为了解决这些局限性,我们提出了MMM-PPI,一种基于层次化基序的多模态蛋白质编码器用于PPI预测,该编码器以自底向上的多模态方式在三个尺度上构建PPI嵌入。在微观尺度上,我们编码三种模态的残基特征;在中观尺度上,一种新颖的多模态基序编码器将残基聚合成空间感知的基序嵌入;在宏观尺度上,一种多模态蛋白质编码器通过联合建模基序重要性和模态间相关性将基序整合为蛋白质嵌入。预训练的编码器可直接用于大规模PPI预测。在多个PPI数据集上的大量实验表明,MMM-PPI优于最先进的多标签PPI预测模型,特别是在具有挑战性的数据划分和有限数据场景下。代码见此链接。

英文摘要

Protein-protein interactions (PPIs) are essential for many biological processes. However, existing PPI prediction approaches suffer from two major limitations: they overlook the hierarchical organization of proteins, particularly meso-scale motifs that critically regulate PPIs, and fail to effectively integrate sequence, structure, and function modalities. To address these limitations, we propose MMM-PPI, a Hierarchical Motif-based Multi-Modal protein Encoder for PPI Prediction that constructs PPI embeddings in a bottom-up multi-modal manner across three scales. At the micro-scale, we encode three modal residue features; at the meso-scale, a novel multimodal motif encoder aggregates residues into spatially-informed motif embeddings; at the macro-scale, a multimodal protein encoder integrates motifs into protein embeddings by jointly modeling motif importance and inter-modal correlations. The pre-trained encoder can be used off-the-shelf for large-scale PPI prediction. Extensive experiments on multiple PPI datasets show that MMM-PPI outperforms state-of-the-art multi-label PPI prediction models, particularly under challenging data partitions and limited data scenarios. Codes are in https://github.com/yzf-code/MMM-PPI.

2606.02624 2026-06-03 q-bio.QM cs.AI cs.LG

TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering

TadA-Bench:面向智能蛋白质工程的未来轮次发现的百万变异基准

Jin Gao, Juntu Zhao, Zirui Zeng, Jiaqi Shen, Junhao Shi, Dukun Zhao, Yuming Lu, Dequan Wang

发表机构 * Tsinghua University(清华大学)

AI总结 TadA-Bench 是一个基于31轮TadA定向进化的百万变异湿实验回放基准,通过定义固定数据回放任务来评估模型在未见过的未来轮次中排序变异的能力,并引入Seq2Graph统一标签,揭示进化覆盖度比局部数据密度更重要。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026). Data: https://huggingface.co/datasets/JinGao/TadABench-1M . Code: https://github.com/shiyegao/TadABench-1M

详情
AI中文摘要

人工智能用于科学发现正进入智能体时代,蛋白质工程系统应优先考虑未来的湿实验,而不仅仅是拟合静态测量。我们引入了TadA-Bench,这是一个来自31轮TadA定向进化的百万变异湿实验回放基准,用于面向智能蛋白质工程的未来轮次发现。TadA-Bench保留了实验的时间顺序,并定义了一个固定数据回放任务:给定早期的实验轮次,模型对仅出现在后期轮次中的变异进行排序。它提供了对齐的DNA、RNA和蛋白质视图,并使用Seq2Graph(一种基于图的标签统一流程)来将嘈杂的富集测量结果协调为一致的跨轮次活性标签。随机分割控制显示强插值能力,但未来轮次排序和有限预算候选选择则弱得多。控制分析表明,进化覆盖度比局部数据密度更具信息性,将TadA-Bench定位为面向智能蛋白质工程的未来轮次发现的可重复湿实验回放基底;数据和代码已在Hugging Face和GitHub上发布。

英文摘要

AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experiments rather than merely fit static measurements. We introduce TadA-Bench, a million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds for future-round discovery toward agentic protein engineering. TadA-Bench preserves the campaign chronology and defines a fixed-data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph-based label-unification pipeline, to reconcile noisy enrichment measurements into consistent cross-round activity labels. Random-split controls show strong interpolation, but future-round ranking and finite-budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA-Bench as a reproducible wet-lab replay substrate for future-round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.

2606.03946 2026-06-03 cs.DB cs.LG cs.LO

MLSkip: Data Skipping for ML Filters via Lightweight Metadata

MLSkip: 通过轻量级元数据实现ML过滤器的数据跳过

Mihail Stoian, Mark Gerarts, Pascal Ginter, Andreas Zimmerer, Jan Van den Bussche, Andreas Kipf

发表机构 * University of Technology Nuremberg(图恩堡技术大学) Hasselt University(哈塞尔特大学) Technical University of Munich(慕尼黑技术大学)

AI总结 针对ML过滤器无法应用传统数据跳过技术的问题,提出利用Parquet默认的min-max元数据以及增强的二维凸包元数据结构,实现高效的谓词剪枝,平均剪枝效果达38.31%。

详情
AI中文摘要

数据库厂商最近发布了可用于过滤器谓词的AI函数。由于这些函数通常依赖于昂贵且黑盒的ML模型,它们带来了新的数据管理挑战。具体而言,针对整数和字符串数据的传统数据跳过技术无法适用于这种新型过滤器。实际上,目前还没有已知的机制用于剪枝不合格的行组,例如从blob存储读取文件时。在这项工作中,我们首次研究了ML过滤器的数据跳过技术。我们论证了Parquet默认的min-max元数据足以实现剪枝。为此,我们联系了两条研究路线:(i) 最近提出的ML模型查询语言和(ii) 神经网络验证。我们在ReLU架构上的初步结果表明,在TPC-H和TPC-DS表上,选择性低于0.1%的过滤器的平均剪枝效果为27.4%。最后,受空间连接研究的启发,我们提出了一种增强的元数据结构:一个有大小限制的二维凸包,验证工具可以更好地利用它,将剪枝效果提高到38.31%,同时每个行组和列对最多占用45字节。我们观察到在DuckDB中相对于PyTorch的端到端加速比为1.07倍。

英文摘要

Database vendors recently released AI functions that can be used in filter predicates. As such functions often rely on costly, black-box ML models, they unveil new data management challenges. Concretely, traditional data skipping techniques for integer and string data fail to be applicable to the new filter type. Indeed, there is no known mechanism for pruning non-qualifying row groups, e.g., when reading files from blob storage. In this work, we initiate the study of data skipping techniques for ML filters. We make the case that Parquet's default min-max metadata is enough to enable pruning. To this end, we draw connections to two lines of research: (i) the recently proposed query language for ML models and (ii) neural network verification. Our preliminary results on ReLU architectures show that on tables from TPC-H and TPC-DS, the average pruning effectiveness for filters of selectivity below 0.1% amounts to 27.4%. Finally, inspired by research on spatial joins, we propose an enhanced metadata structure: a size-bounded 2D convex hull that verification tools can make better use of, increasing the pruning effectiveness to 38.31%, while occupying at most 45 bytes per row group and column pair. We observe an end-to-end speedup of 1.07$\times$ over PyTorch in DuckDB.

2606.03935 2026-06-03 cs.NE cs.LG

Quadratic integrate-and-fire neurons exhibit less fragmented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descent

二次整合-放电神经元表现出更少的碎片化损失景观,并在基于脉冲的梯度下降中优于漏电整合-放电神经元

Carlo Wenig, Raoul-Martin Memmesheimer, Christian Klos

发表机构 * University of Bonn(波恩大学) University of Tübingen(图宾根大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过对比LIF和QIF神经元在Spiking Heidelberg Digits数据集上的表现,发现QIF神经元具有更平滑的损失景观和梯度,从而在脉冲神经网络训练中表现更优。

Comments 9 pages, 5 figures (main part)

详情
AI中文摘要

训练脉冲神经网络对于模拟生物神经网络以及神经形态计算至关重要。然而,对于广泛使用的漏电整合-放电(LIF)神经元,任意小的参数变化都可能引起脉冲的(消失)出现,从而破坏后续活动,导致在精确的基于脉冲的梯度下降过程中出现不稳定的神经表征和永久沉默的神经元。最近的研究表明,包括二次整合-放电(QIF)神经元在内的一类神经元模型避免了这些不连续性,并实现了连续甚至平滑的基于脉冲的梯度下降。然而,尚不清楚这些优势是否能转化为实际应用。在这里,我们通过在流行的Spiking Heidelberg Digits数据集上对LIF和QIF神经元网络进行受控比较,证明了它们确实如此。具体来说,第一步,我们进行了彻底的超参数搜索以优化两种模型,揭示了QIF神经元的明显性能优势。第二步,我们可视化了损失和梯度景观。与它们较差的性能一致,我们发现LIF神经元的损失景观(不连续)显得更加碎片化,相关梯度更加不稳定。对单个样本景观的分析表明,这些特征源于脉冲时间顺序的变化,这常常导致破坏性的脉冲(消失)出现。总体而言,我们的结果主张在梯度下降训练中用具有连续脉冲动态的神经元模型(如QIF神经元)替代LIF神经元。

英文摘要

The ability to train spiking neural networks is essential for modeling biological neural networks as well as for neuromorphic computing. However, for the extensively used leaky integrate-and-fire (LIF) neurons, arbitrarily small parameter changes can induce spike (dis)appearances that disrupt subsequent activity, leading to unstable neural representations and permanently silent neurons during exact spike-based gradient descent. Recent work shows that a class of neuron models, which includes the quadratic integrate-and-fire (QIF) neuron, avoids these discontinuities and enables continuous and even smooth spike-based gradient descent. However, it remains unclear whether these advantages translate into practice. Here, we demonstrate that they do so via a controlled comparison between networks of LIF and QIF neurons on the popular Spiking Heidelberg Digits dataset. Specifically, in a first step, we perform a thorough hyperparameter search to optimize both models, revealing a clear performance advantage of QIF neurons. In a second step, we visualize the loss and gradient landscapes. Consistent with their inferior performance, we find that the loss landscapes of LIF neurons, which are discontinuous, appear more fragmented and the related gradients more erratic. An analysis of the landscapes of single samples indicates that these features arise from changes in the temporal order of spikes, which often cause disruptive spike (dis)appearances. Overall, our results advocate replacing LIF neurons with neuron models exhibiting continuous spiking dynamics, such as QIF neurons, for gradient descent training.

2606.03926 2026-06-03 cs.HC cs.LG

DiffUNet^2: Bidirectional Prediction, Probabilistic Generation and Collaborative Visual Discovery for Scientific Data

DiffUNet^2: 科学数据的双向预测、概率生成与协同视觉发现

Mengdi Chu, Jiaxin Yang, Angus G. Forbes, Nathan Debardeleben, Earl Lawrence, Ayan Biswas, Han-Wei Shen

发表机构 * Ohio State University(俄亥俄州立大学) NVIDIA Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 提出基于扩散模型的条件生成框架DiffUNet^2,实现时间序列的双向任意步预测与概率分布捕获,并结合交互式可视化支持科学探索。

Comments 12 pages, 20 figures

详情
AI中文摘要

对科学现象进行时间演化建模对于分析和推理至关重要,然而大多数机器学习方法仅提供确定性的前向预测,忽略了多种可能的结果,且很少支持反向推理,限制了它们在科学工作流中的实用性。我们提出了一个将基于扩散的生成建模与交互式视觉分析相结合的框架,用于科学探索。我们引入了DiffUNet^2,一种条件扩散模型,能够实现跨时间的双向、任意到任意生成,并捕获系统可能演化的分布。基于该模型,我们的交互式系统支持分支时间线探索、用户引导的状态编辑和概率空间导航,使科学家能够主动探索替代假设,而非被动观察预测。我们在5个不同科学领域的数据集上评估了该模型,验证了其预测准确性和概率空间集成质量。与领域专家合作,我们证明了该方法在支持实际科学时间数据分析工作流中的有效性。通过集成建模与视觉交互,我们的方法使科学家能够交互式地探索系统动力学,将生成模型转化为假设驱动的科学分析工具。

英文摘要

Modeling temporal evolution is important to analyzing and reasoning about scientific phenomena, yet most machine learning methods provide deterministic forward predictions that overlook multiple plausible outcomes and rarely support backward reasoning, limiting their usefulness in practical scientific workflows. We present a framework that integrates diffusion-based generative modeling with interactive visual analytics for scientific exploration. We introduce DiffUNet^2, a conditional diffusion model that enables bidirectional, any-to-any generation across time and captures distributions of plausible system evolutions. Built upon the model, our interactive system supports branching timeline exploration, user-guided state editing, and probability-space navigation, enabling scientists to actively explore alternative hypotheses rather than passively observe predictions. We evaluate the model on 5 datasets across different scientific domains to validate its predictive accuracy and probability-space ensemble quality. In collaboration with domain experts, we demonstrate the effectiveness of our approach in supporting practical scientific temporal data analysis workflows. By integrating modeling and visual interaction, our approach enables scientists to interactively explore system dynamics, transforming generative models into tools for hypothesis-driven scientific analysis.

2606.03919 2026-06-03 cs.SI cs.CY cs.DL cs.LG physics.soc-ph

Forecasting Conceptual Diffusion in Science: The Case of Quantum Computing

预测科学中的概念扩散:以量子计算为例

Thomas Maillart, Thibaut Chataing, David Dosu, Paul Bagourd, Julian Jang-Jaccard, Alain Mermoud

发表机构 * Geneva School of Economics and Management, University of Geneva(日内瓦经济管理学院,日内瓦大学) Faculty of Medicine, University of Geneva(日内瓦大学医学院) Open Quantum Institute, CERN(开放量子研究所,欧洲核子研究中心) armasuisse Science + Technology(armasuisse 科学与技术)

AI总结 通过构建时间分辨的概念共现网络并训练LightGBM模型,研究量子计算领域概念的内生巩固与外生扩散的可预测性,发现外生扩散和熵具有强可预测性(R²高达0.78),而内生巩固在量子计算中几乎不可预测,但在神经植入领域显著上升(R²=0.83),表明概念扩散受语义和引用环境中的稳定结构规律支配。

Comments 19 pages, 5 figures, 6 tables. Code and manuscript sources: https://github.com/wazaahhh/breakthroughs-diffusion . An earlier version was presented at the Global Tech Mining Conference (GTM) 2026 (submission #117)

详情
AI中文摘要

理解和预测科学变化需要能够区分科学概念的内生巩固和外生扩散的模型。利用OpenAlex中量子计算概念子树,我们构建了一个时间分辨的概念共现网络,并追踪每个概念对的上游引用谱系和下游扩散。我们在分布和多样性感知特征上训练LightGBM模型,以预测四个结果:内生巩固、外生扩散、它们的比率以及扩散熵。在控制科学整体出版增长后,内生巩固在主要的量子计算基准中基本不可预测。相比之下,外生扩散和熵具有很强的可预测性(R²高达0.78),并且由上游异质性、引用广度和分布离散度驱动,如SHAP分析所示;在机器人、先进材料和神经植入上的重复验证证实,外生扩散仍然是跨领域排名最高的目标(测试R²约0.60-0.87),而内生可预测性在神经植入中显著上升(测试R²=0.83),表明量子计算的不对称性并非普遍适用。案例研究表明,尖锐的熵增加与新概念前沿的开启同时发生,而熵崩溃则标志着技术趋同或范式更替。这些结果表明,概念扩散受嵌入语义和引用环境中的稳定结构规律支配。通过识别跨领域采纳的早期基于多样性的信号,该方法为快速发展的研究领域中的预期科学计量学、技术预见和创新导向政策分析提供了可扩展的基础。

英文摘要

Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co-occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity-aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum-computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable ($R^2$ up to $0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top-ranked target across fields ($R^2_test \sim 0.60-0.87$), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum-computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity-based signals of cross-domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation-oriented policy analysis in rapidly evolving research fields.

2606.03907 2026-06-03 cs.SE cs.AI cs.HC

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol

配置智能体AI编码工具对构建vs购买决策的影响:一项研究协议

Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude

发表机构 * Singapore Management University, Singapore(新加坡管理大学) University of Bamberg, Germany(巴姆堡大学) King’s College London, United Kingdom(伦敦国王学院) Heidelberg University, Germany(海德堡大学)

AI总结 本研究通过受控实验协议,探讨配置机制如何影响Claude Code和OpenAI Codex等智能体AI编码工具在构建vs购买决策中的行为,并发布可复用的基准数据集和分析流程。

Comments 14 pages, 1 table. Accepted at the 20th International Symposium on Empirical Software Engineering and Measurement (ESEM 2026), Registered Reports track

详情
AI中文摘要

智能体AI编码工具以越来越高的自主性编写代码,并在此过程中决定何时导入库以及何时从头实现功能。这些决策——是从头构建功能还是购买外部库(以下称为构建vs购买)——对软件安全性、许可合规性、性能和长期可维护性有直接影响。然而,尚无受控实验研究探讨智能体AI编码工具中构建vs购买决策的支配因素。配置机制,即开发人员根据项目或工作流程定制智能体AI编码工具行为的手段,是实践者影响这些决策的主要方式之一。但尚不清楚哪些配置机制最有效地影响构建vs购买决策。我们提出了一项预注册协议,研究配置机制如何改变两种流行的智能体AI编码工具(Claude Code和OpenAI Codex)中的构建vs购买行为。我们将执行来自阶段性项目基准的受控编程任务,每个任务围绕可识别的构建vs购买点构建,并操纵提供给每个工具的配置,范围从无配置、包含软偏好和明确禁止的上下文文件,到技能(可自主发现的指令)、支持MCP的库发现工具和权限控制,测量工具选择的库、是否披露新引入的库以及这些披露是否完整准确。九个预注册假设构成了该协议。生成的基准数据集和分析流程将作为可复用工件发布,用于评估智能体AI编码工具中的构建vs购买行为。

英文摘要

Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.

2606.03876 2026-06-03 cs.HC cs.AI cs.MA

From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members

从“是什么”到“怎么样”和“为什么”:与远程家庭成员共享老年人被动追踪数据的LLM生成回顾性摘要

Jiachen Li, Reina Szeyi Chan, Akshat Choube, Xiang Zhi Tan, Elizabeth Mynatt, Varun Mishra

发表机构 * Northeastern University(东北大学)

AI总结 本研究利用大型语言模型(LLM)从多模态追踪数据生成回顾性摘要,通过技术探针和访谈重新设计系统,显著提升了远程家庭成员对摘要的满意度、帮助性、信任和接收意愿,并提出了支持其从“是什么”到“怎么样”和“为什么”的认知转变的设计启示。

详情
AI中文摘要

随着现代普适计算技术的日益普及,多模态追踪系统有望为远程家庭成员(RFM)等利益相关者提供及时的意识和 reassurance,这些成员在老年人护理协调中扮演核心角色。然而,将异构数据流整合为高层次、有意义的内容(如回顾性摘要)仍然具有挑战性。虽然近期工作已展示了大型语言模型(LLM)在解释多模态追踪数据方面的潜力,但针对像RFM这样拥有丰富个人知识、强烈情感责任但对老年人日常生活了解有限且照护能力受限的利益相关者生成叙事性描述的研究仍较少。在本工作中,我们探索了如何利用LLM为老年人的RFM从多模态追踪数据生成回顾性摘要。我们利用并定制了现有系统Vital Insight,在不同日期和数据可用性场景下生成初始摘要作为技术探针,并对11名RFM进行访谈以收集反馈。基于这些见解,我们将系统重新设计为一种多层、多智能体、洞察驱动的摘要方法,从客观统计和描述构建到丰富、上下文感知的叙述。随后,我们通过同一11名RFM的调查比较了重新设计的摘要与初始版本,发现满意度、感知帮助性、信任和接收意愿均有显著提升。最后,我们提出了针对RFM及更广泛场景的AI生成摘要的设计启示,强调需要支持RFM的认知转变,从简单地呈现“收集了什么数据”转向解释“我的亲人过得怎么样”和“为什么”。

英文摘要

With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi-modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi-layer, multi-agent, insight-driven summary approach that builds from objective statistics and descriptions to enriched, context-aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI-generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs' sensemaking shift from simply presenting ''What'' data were collected, to explaining ''How'' is my loved one doing and ''Why''.

2606.03866 2026-06-03 cs.IR cs.AI cs.CL

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Taiji: 面向工业LLM增强推荐的帕累托最优策略优化与语义ID权衡

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu, Peng Jiang, Kun Gai

发表机构 * Kuaishou Technology(快手科技) Unaffiliated(无隶属)

AI总结 提出Taiji框架,通过逆向工程推理和开放拒绝采样生成高质量CoT数据,并采用帕累托最优策略优化(POPO)自适应调整跨域奖励权重,实现LLM语义知识与推荐ID特征的帕累托最优权衡,在快手广告平台部署后服务超4亿日活用户。

Comments 8 pages, 2 figures

详情
AI中文摘要

通过大型语言模型(LLM)扩展推荐系统已成为工业界的显著趋势。然而,通过后训练(如SFT和RL)将LLM的语义空间与推荐系统的ID空间对齐仍然具有挑战性。现有的LLM4Rec范式受到两个主要问题的瓶颈:(1)在SFT期间,难以衡量和改进开放域推荐中的思维链(CoT)质量;(2)在RL对齐过程中,忽略了LLM语义奖励与推荐偏好奖励之间的权衡。受这些挑战启发,我们提出了Taiji,一种专为工业推荐系统设计的新型LLM-as-Enhancer框架。为了克服SFT瓶颈,我们利用逆向工程推理和开放拒绝采样生成高质量、领域特定的CoT数据。为了解决RL对齐问题,我们提出了帕累托最优策略优化(POPO),它自适应调整跨域奖励权重。理论上,它在LLM的语义世界知识与代表在线用户偏好的协同ID特征之间实现了最优权衡。大量的离线评估和在线A/B测试验证了Taiji的有效性。自2026年5月在快手广告平台部署以来,Taiji目前每天服务超过4亿用户,产生了显著的商业收入,并展示了其在网络规模环境中的强大可扩展性。

英文摘要

Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.

2606.03864 2026-06-03 cs.SI cs.CY cs.DL cs.LG physics.soc-ph

Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics

基于概念网络动力学的科学突破可解释预测

Thomas Maillart, Thibaut Chataing, Ntorina Antoni, David Dosu, Paul Bagourd, Julian Jang-Jaccard, Alain Mermoud

发表机构 * Geneva School of Economics and Management, University of Geneva, Geneva, Switzerland(日内瓦经济管理学院,日内瓦大学,瑞士日内瓦) Faculty of Medicine, University of Geneva, Geneva, Switzerland(日内瓦大学医学学院,瑞士日内瓦) TU Eindhoven, The Netherlands(埃因霍温理工大学,荷兰) Open Quantum Institute, CERN, Geneva, Switzerland(开放量子研究所,欧洲核子研究中心,瑞士日内瓦) armasuisse Science + Technology, Switzerland(armasuisse 科学与技术,瑞士)

AI总结 提出一种可解释的机器学习方法,通过建模OpenAlex概念网络的演化,预测科学突破的结构前兆(研究概念之间联系的出现和增强),并利用59个特征的两阶段LightGBM模型同时预测概念对的形成和未来权重,在四个技术/生物医学领域取得优于现有方法的ROC-AUC(0.954-0.967)和可解释性。

Comments 18 pages, 10 figures, 4 tables. An earlier version was presented at Global Tech Mining Conference 2026. Code and data: https://github.com/wazaahhh/breakthroughs-forecasting

详情
AI中文摘要

我们介绍了一种可解释的机器学习方法,通过建模OpenAlex概念网络随时间演化的方式,预测科学突破的结构前兆——研究概念之间联系的出现和增强。利用59个语义和拓扑特征,一个两阶段LightGBM模型联合预测概念对的形成及其未来权重,增加了一个回归阶段,将预期强度量化到先前的链接存在预测之上。与现有技术相比,该方法同时提高了准确性和可解释性:在四个技术和生物医学领域的比较验证中,无需重新调整即可在所有时间范围内获得[0.954, 0.967]的ROC-AUC,超过了先前模型约0.90的水平,而每个预测都基于结构化的、可审计的特征,而非不透明的嵌入。分类性能高(AUC约0.95),回归保持稳定(一到五年内RMSLE为0.45至0.6)。特征归因表明,结构因素——特别是Adamic-Adar相似性和基于度的Hadamard度量——持续驱动准确性,表明与突破相关的重组出现在紧密连接的子网络中。两个专家锚定的案例——量子退火和AI赋能的量子架构——显示模型浮现出与专家预期一致的技术融合。然后,我们概述了一个三层决策架构——检测、专家翻译、机构整合——将这些预测转化为基于证据的研究战略和政策,以开放数据和可解释特征为基础。

英文摘要

We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs -- the emergence and intensification of links between research concepts -- by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparative validation across four technology and biomedical domains yields ROC-AUC in [0.954, 0.967] at all horizons without re-tuning, exceeding the roughly 0.90 of prior models, while every forecast rests on structural, auditable features rather than opaque embeddings. Classification performance is high (AUC about 0.95) and regression remains stable (RMSLE 0.45 to 0.6 over one to five years). Feature attribution shows that structural factors -- particularly Adamic-Adar similarity and degree-based Hadamard measures -- consistently drive accuracy, suggesting that breakthrough-relevant recombinations emerge in tightly connected sub-networks. Two expert-anchored cases, quantum annealing and AI-enabled quantum architectures, show the model surfacing technological convergence consistent with expert expectations. We then outline a three-layer decision architecture -- detection, expert translation, institutional integration -- that turns these forecasts into evidence-based research strategy and policy, anchored in open data and explainable features.

2606.03852 2026-06-03 cs.SE cs.AI

FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement

FLARE: 面向大语言模型代码精炼的细粒度诊断反馈

Yinsheng Yao, Hongxiang Zhang, Weixi Tong, Tianyi Zhang

发表机构 * Tongji University(同济大学) Purdue University(普渡大学)

AI总结 提出FLARE框架,利用轻量级诊断模型预测行级可疑信号进行缺陷定位和代码精炼,通过Top-K候选搜索提升修复效果。

详情
AI中文摘要

大型语言模型生成的代码常含有错误。现有方法依赖测试失败和自批评等反馈信号来迭代精炼生成的代码,但这些信号要么过于粗粒度,要么过于高层,不足以告知模型何处需要修复。在本工作中,我们提出了Flare,一个迭代框架,配备轻量级诊断模型,用于预测行级可疑信号以进行缺陷定位和代码精炼。鉴于诊断预测固有的不确定性,Flare搜索前K个最可疑区域,并根据执行结果选择最佳候选。在LiveCodeBench和BigCodeBench上使用五个基础LLM的实验表明,即使没有候选搜索(k=1),Flare也以1.72%到7.42%的绝对提升优于最强基线。此外,与无候选搜索相比,搜索10个候选平均提升8.50%。单独评估时,我们的轻量级诊断模型与最近的缺陷定位方法相比取得了最佳性能,表明它能提供可靠的细粒度代码精炼指导。

英文摘要

Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to iteratively refine the generated code. Such signals are either too coarse-grained or too high-level, which is not sufficient to inform the model where to fix the bug. In this work, we present Flare, an iterative framework with a lightweight diagnostic model that predicts line-level suspiciousness signals for bug localization and code refinement. Given the inherent uncertainty of diagnostic predictions, Flare searches over the top-k suspicious regions and selects the best candidate according to execution outcomes. Experiments on LiveCodeBench and BigCodeBench with five base LLMs show that, even without candidate search (k=1), Flare outperforms the strongest baseline with an absolute improvement from 1.72% to 7.42%. Furthermore, searching over 10 candidates yields an average improvement of 8.50% compared with no candidate search. When evaluated in isolation, our lightweight diagnostic model achieves the best performance compared with recent fault localization methods, demonstrating that it can provide reliable fine-grained guidance for code refinement.

2606.03811 2026-06-03 cs.CR cs.AI cs.LG

AI Agents Enable Adaptive Computer Worms

AI代理实现自适应计算机蠕虫

Jonas Guan, Tom Blanchard, Hanna Foerster, Hengrui Jia, Gabriel Huang, Nicolas Papernot

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) University of Cambridge(剑桥大学) ServiceNow

AI总结 本研究展示了AI代理能够生成针对每个目标的定制攻击策略,利用被感染机器上的大语言模型自我维持并传播,形成自持的AI驱动网络威胁。

详情
AI中文摘要

计算机蠕虫是一种通过在网络中从一台机器复制到另一台机器来传播的恶意软件。传统蠕虫(如WannaCry)利用预定的漏洞,修补这些漏洞即可阻止其传播。本文表明,人工智能(AI)代理实现了一种根本性的新威胁:一种能够针对每个遭遇的目标生成定制攻击策略的蠕虫。该蠕虫寄生性地利用被感染的机器运行开放权重的大语言模型(LLM)以维持其推理能力,或扩展其攻击范围。在部署于Linux、Windows和物联网(IoT)设备的机器网络上,该蠕虫通过利用常见的现实企业网络漏洞进行传播。由于蠕虫由窃取的计算资源驱动,攻击者每次新感染所需的边际成本为零。这在攻击者和防御者之间造成了不稳定的经济不对称。此外,由于蠕虫不需要商业AI平台,集中式安全控制(如服务拒绝或速率限制)在结构上无关紧要。我们的结果表明,自持的AI驱动网络威胁不再是理论上的。我们必须为自主的生成式对手做好准备:这些恶意软件系统无需人类操作员即可传播,其定义不是固定的利用代码,而是实时推理目标、适应观察并合成攻击逻辑的能力。

英文摘要

A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, exploited predetermined vulnerabilities, and their spread can be halted by patching those vulnerabilities. Here we show that artificial intelligence (AI) agents enable a fundamentally new threat: a worm that generates tailored attack strategies to each target it encounters. The worm parasitically uses compromised machines to run open-weight large language models (LLMs) to sustain its reasoning, or extend its reach for further attacks. Deployed on a network of machines spanning Linux, Windows, and IoT (Internet of Things) devices, the worm propagated by exploiting common, real-world corporate network vulnerabilities. Since the worm is powered by stolen compute, the attacker's marginal cost per new infection is zero. This creates a destabilizing economic asymmetry between attackers and defenders. Moreover, because the worm requires no commercial AI platform, centralized safety controls, such as service refusals or rate limiting, are structurally irrelevant. Our results demonstrate that self-sustaining AI-driven cyber-threats are no longer theoretical. We must prepare for autonomous generative adversaries: malware systems that propagate without human operators and are defined not by fixed exploit code, but by the capacity to reason about targets, adapt to observations, and synthesize attack logic in real time.

2606.03770 2026-06-03 cs.DC cs.AI

E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

E2LLM:异构边缘/雾环境中高效LLM服务

Truong-Thanh Le, Amir Taherkordi, Hoang-Loc La, Frank Eliassen, Phuong Hoai Ha, Peiyuan Guan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出E2LLM框架,通过复制模型到多设备组并采用模型并行,结合遗传算法聚类和动态规划分区,在资源受限的异构边缘/雾环境中实现高效LLM部署,相比Splitwise基线在高需求下平均等待时间降低50%以上。

详情
AI中文摘要

大型语言模型(LLM)已成为现代应用不可或缺的一部分,但其部署仍具挑战性。除了执行模型本身,实际部署必须解决成本效率、低延迟和最优资源利用问题。传统方法通常假设整个模型可以托管在单个设备上,这在许多现实场景中不成立,尤其是在设备资源受限的边缘和雾环境中。本文介绍了E2LLM,一个旨在在此类资源有限环境中实现高效LLM部署的框架。E2LLM并非简单地将单个模型分区到所有可用设备,而是将完整模型复制到多个设备组(副本),并在每个副本内应用模型并行。每个副本根据其处理输入和输出令牌的效率被分配专门角色PREFILL或DECODER。这种分离利用了LLM推理这两个阶段之间的固有差异。为了有效组织设备,我们利用遗传算法形成最大化系统性能的集群。在每个集群内,我们应用动态规划确定最优分区策略,以最小化模型并行执行中的瓶颈。实验结果表明,我们的方法能够稳健地适应不同工作负载,包括输入和输出令牌长度显著变化的场景。与Splitwise基线相比,E2LLM在高需求条件下将平均等待时间降低了50%以上。

英文摘要

Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions

2606.03647 2026-06-03 cs.CR cs.AI cs.LG

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

黑盒、自适应、高效、可迁移、有害、适用……攻击是破解LLM所需的一切

Vincent Limbach, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn

发表机构 * University of St. Gallen(圣加尔大学)

AI总结 提出间接危害优化(IHO)方法,通过迭代偏好优化训练掩码扩散语言模型攻击器,实现黑盒、高效、可迁移的自适应攻击,显著提升对分层防御的破解成功率。

详情
AI中文摘要

准确评估对抗鲁棒性是一个长期挑战。有缺陷的攻击设计可能会夸大鲁棒性估计,使得部署风险评估和防御比较不可靠。历史上,像AutoAttack这样的标准化攻击在很大程度上解决了图像分类器的问题,为跨防御的系统比较提供了可靠的评估基线。然而,对于LLM越狱评估,目前还没有等效的方法,而设计这样的攻击要困难得多。一个可靠的攻击必须(除其他外)兼容黑盒、适用于任意防御管道且高效,而现有方法无法同时满足这些条件。我们引入了间接危害优化(IHO),这是一种掩码扩散语言模型攻击器,通过对危害评判器进行迭代偏好优化来训练,仅需对目标进行黑盒访问。相同的方法无需修改即可用作针对个体行为的强自适应攻击,或作为一种高效的摊销策略,无需微调即可迁移到未见行为和未见目标模型。即使面对分层防御(例如,结合辅助检测器的Circuit Breaker训练模型),IHO在攻击成功率上也显著优于最先进的方法,且无需任何防御特定的适应。我们的结果将IHO定位为向那种过去提高了可靠性的标准化越狱评估迈出的实际一步。代码和模型可在GitHub和Hugging Face上获取。

英文摘要

Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.

2606.03601 2026-06-03 cs.SE cs.AI

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

DDOR: 用于可解释过度拒绝测试与修复的Delta调试方法

Qinyan Zhou, Peixin Zhang, Jun Sun, Haonan Zhang, Dongxia Wang

发表机构 * Southeast University(东南大学) Singapore Management University(新加坡管理学院) Zhejiang University(浙江大学) Huzhou Institute of Industrial Control Technology(湖州工业控制技术研究所)

AI总结 提出DDOR框架,通过delta调试定位最小拒绝触发片段(mRTF),实现黑盒环境下大语言模型过度拒绝行为的自动化测试与修复。

详情
AI中文摘要

虽然安全对齐和护栏有助于大语言模型(LLM)避免有害输出,但它们也可能导致过度拒绝,即对仅看似有风险的无害查询进行无根据的拒绝。我们提出了DDOR(用于过度拒绝的Delta调试),这是一个完全自动化和可解释的框架,用于在黑盒设置中进行过度拒绝测试和修复,其中仅可访问模型输入和输出,内部安全机制保持不透明。DDOR应用delta调试来定位最小拒绝触发片段(mRTF),这些片段提供了短语级别的、可解释的证据,说明拒绝发生的原因。基于这些mRTF,DDOR生成多样化、上下文丰富的提示,并执行多预言验证以过滤本质上不安全或模糊的案例,从而产生可扩展且模型特定的过度拒绝测试套件(每个模型约1K个案例)。除了评估之外,我们进一步利用定位的mRTF进行有针对性的提示修复,显著减少过度拒绝,同时保留原始意图并在真正有害的输入上保持安全性。总体而言,DDOR提供了一种实用的端到端解决方案,用于评估和缓解过度拒绝,在不牺牲安全性的情况下提高LLM的可用性。

英文摘要

While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.

2606.03593 2026-06-03 cs.SE cs.RO

Making Embodied AI Reliable: A Community Agenda from Testing to Formal Verification

使具身AI可靠:从测试到形式验证的社区议程

Xi Zheng, Dulanga Weerakoon, Yintong Huo, Teresa Yeo, Guy Van Den Broeck, Vijay Ganesh, Daniel Neider, Biplav Srivastava, Ivan Ruchkin, Archan Misra, Corina Pasareanu

发表机构 * University of Waterloo(滑铁卢大学) Universityinceton University(普林斯顿大学)

AI总结 本文基于AAAI'26 Bridge Program讨论,提出通过集成测试、形式验证和运行时保证的神经符号方法,解决具身AI在开放世界中的生命周期可靠性问题。

详情
AI中文摘要

具身AI系统越来越多地部署在开放世界环境中,但确保其可靠性仍然是一个根本性挑战。借鉴AAAI'26 Bridge Program关于“通过测试和形式验证使具身AI可靠”的讨论,本文认为具身AI的可靠性本质上是一个生命周期保证问题,源于不确定性、人类交互以及紧密耦合系统组件之间的涌现行为。我们确定了实现可靠具身AI的三个互补方向:(1)基于可信场景的测试,由经过验证的规范和有意义覆盖度量支持;(2)通过系统行为和环境的符号化结构化表示实现的组合验证;(3)能够在部署期间适应不确定性和分布偏移的运行时保证机制。我们不将这些方法视为独立,而是倡导集成保证工作流,通过共享的神经符号表示和系统生命周期中的持续反馈,连接测试、验证和运行时适应。这种集成为构建能够在复杂现实世界中安全可靠运行的值得信赖的具身AI系统提供了基础。

英文摘要

Embodied AI systems are increasingly deployed in open-world environments, yet ensuring their reliability remains a fundamental challenge. Drawing on discussions from the AAAI'26 Bridge Program on "Making Embodied AI Reliable with Testing and Formal Verification", this article argues that reliability in embodied AI is inherently a lifecycle assurance problem arising from uncertainty, human interaction, and emergent behaviors across tightly coupled system components. We identify three complementary directions toward reliable embodied AI: (1) trustworthy scenario-based testing supported by validated specifications and meaningful coverage metrics, (2) compositional verification enabled by structured symbolic representations of system behavior and environmental context, and (3) runtime assurance mechanisms capable of adapting to uncertainty and distribution shifts during deployment. Rather than treating these approaches independently, we advocate integrated assurance workflows that connect testing, verification, and runtime adaptation through shared neuro-symbolic representations and continuous feedback across the system lifecycle. Such integration provides a foundation for building trustworthy embodied AI systems that can operate safely and reliably in complex real-world environments.

2606.03535 2026-06-03 cs.IR cs.CL

Can LLM Rerankers Predict Their Own Ranking Performance?

LLM 重排序器能否预测自身的排序性能?

Shiyu Ni, Keping Bi, Jiafeng Guo, Jingtong Wu, Zengxin Han, Xueqi Cheng

发表机构 * State Key Laboratory of AI Safety(人工智能安全国家重点实验室) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 研究 LLM 重排序器能否通过自一致性或口头化置信度来估计自身生成的排序质量,并提出两种监督方法 Verb-Num 和 Verb-List 以改进校准。

详情
AI中文摘要

检索效果在不同查询间差异显著,因此在获得相关性判断之前估计排序质量非常重要。查询性能预测(QPP)解决了这一需求,但大多数现有方法依赖于检索或重排序后的外部预测器。本文研究 extit{重排序器内部 QPP}:LLM 重排序器能否估计其刚刚产生的排序的质量?我们探讨了无训练和基于训练的方法。对于无训练估计,我们检查了跨采样排序的特定于度量的自一致性以及由重排序器直接生成的口头化置信度。在 TREC Deep Learning 2019--2022 上使用四个 LLM 的实验表明,自一致性与最先进(SOTA)方法竞争力相当,并且在几乎所有设置下校准更好,而直接口头化置信度严重过度自信。为了改进口头化置信度,我们提出了两种监督方法 Verb-Num 和 Verb-List,使 LLM 重排序器仅需少量额外输出标记即可生成校准的排序质量估计。

英文摘要

Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textit{reranker-internal QPP}: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019--2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.

2606.03523 2026-06-03 cs.CR cs.AI cs.LG

High-Precision APT Malware Attribution with Out-of-Scope Resilience

高精度APT恶意软件归因与越界鲁棒性

Peter Williams, Adam Sobey, Erisa Karafili

发表机构 * Department of Computer Science, University of Oxford(1 奥克斯福德大学计算机科学系)

AI总结 提出基于排名二元分类器与显式弃权的APT恶意软件归因方法,在越界样本占比87%时仍保持92%精度和95%选择性准确率。

详情
AI中文摘要

早期归因高级持续性威胁(APT)活动可帮助防御者优先调查、选择对策并减少入侵影响。恶意软件提供了有用的归因证据,但自动化APT恶意软件归因在实践中仍然困难。现有方法通常作为封闭集分类器在有限数量的已知APT组织上进行训练和评估。然而,在操作环境中,分类器很可能遇到训练中未出现的组织样本。封闭集分类器被迫将这些样本分配给已知组织,产生无根据且可能误导的归因。我们提出一种基于排名二元分类器与显式弃权的高精度APT恶意软件归因方法。我们的方法不是训练单个多类分类器,而是为每个APT组织训练和调整两个二元分类器,根据验证性能对分类器进行排名,并顺序应用它们。仅当分类器提供足够证据时才对样本进行归因;否则,弃权。我们在APT恶意软件数据集和旨在压力测试越界行为的更大组合数据集上评估该方法。在APT恶意软件数据集上,该方法实现了比先前公布结果更高的精度。在最具挑战性的设置中,87%的测试样本来自训练中排除的60个APT组织,该方法对94%的越界样本弃权,同时在其分类的样本上保持92%的精度和95%的选择性准确率。

英文摘要

Early attribution of Advanced Persistent Threat (APT) activity can help defenders prioritise investigation, select countermeasures, and reduce the impact of an intrusion. Malware provides useful attribution evidence, but automated APT malware attribution remains difficult in practice. Existing approaches are typically trained and evaluated as closed-set classifiers over a limited number of known APT groups. In operational environments, however, classifiers are likely to encounter samples from groups not represented during training. Closed-set classifiers are then forced to assign such samples to known groups, producing unsupported and potentially misleading attributions. We present a high-precision APT malware attribution method based on ranked binary classifiers with explicit abstention. Rather than training a single multi-class classifier, our approach trains and tunes two binary classifiers per APT group, ranks the classifiers by validation performance, and applies them sequentially. A sample is attributed only when a classifier provides sufficient evidence; otherwise, it abstains. We evaluate the method on the APT Malware dataset and on a larger combined dataset designed to stress-test out-of-scope behaviour. On the APT Malware dataset, the method achieves higher precision than previously published results on the same dataset. In the most challenging setting, where 87% of test samples came from 60 APT groups excluded from training, the method abstained on 94% of out-of-scope samples while maintaining 92% precision and 95% selective accuracy on the samples it classified.

2606.03486 2026-06-03 cs.CR cs.AI

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

NeuroArmor:基于安全变体引导的表示一致性实现越狱防御中的选择性重新锚定

Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出NeuroArmor白盒运行时防御方法,通过为每个提示构建安全变体作为局部安全参考,在隐藏状态空间进行一致性检查并路由异常,有效降低恶意攻击成功率同时保持低误报率。

Comments 16 pages, 4 figures, 17 tables. Submitted to ACL ARR

详情
AI中文摘要

大型语言模型仍然容易受到越狱攻击,这些攻击将有害意图隐藏在看似普通的请求背后,例如角色扮演、翻译、编码、对抗性后缀和多轮铺垫。现有的防御方法仍然难以在不过度拦截良性但敏感的请求的情况下处理这些攻击,部分原因是它们通常对每个提示应用相同的操作,因此无法平衡安全性和有用性。我们提出NeuroArmor,一种白盒运行时防御方法,它使用提示特定的安全变体作为局部安全参考,用于决定何时需要干预,并在触发时作为干预的安全目标。对于每个提示,NeuroArmor构建K个安全变体,在隐藏状态空间中将提示状态与此局部安全参考进行比较,并将异常路由到恶意提示的拒绝分支或边界良性提示的有用恢复分支。在Llama-3-8B-Instruct上,NeuroArmor将恶意攻击成功率(ASR)从41.56%降低到1.57%,同时将共享良性池上的良性误报率(FPR)从30.26%降低到22.05%;匹配的基线在此权衡上仍然明显较弱。外部评估者和手动行为评估进一步表明,剩余未拦截的输出产生操作危害的可能性大大降低。总体而言,NeuroArmor通过结合提示特定的一致性检查、路由和选择性干预,为越狱防御提供了更有效的运行时策略。

英文摘要

Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses still struggle to handle these attacks without over-blocking benign but sensitive requests, partly because they often apply the same action to every prompt and therefore fail to balance safety and helpfulness. We propose NeuroArmor, a white-box runtime defense that uses prompt-specific safe variants as a local safety reference for deciding when intervention is needed and, once triggered, as safe targets for intervention. For each prompt, NeuroArmor builds K safe variants, compares the prompt state against this local safe reference in hidden-state space, and routes anomalies either to a refusal branch for malicious prompts or to a helpful recovery branch for borderline benign prompts. On Llama-3-8B-Instruct, NeuroArmor reduces malicious attack success rate (ASR) from 41.56% to 1.57% while lowering benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%; matched baselines remain substantially weaker on this trade-off. External-judge and manual behavioral evaluations further show that the remaining non-blocked outputs are much less likely to be operationally harmful. Overall, NeuroArmor provides a more effective runtime strategy for jailbreak defense by combining prompt-specific consistency checking, routing, and selective intervention.

2606.03453 2026-06-03 cs.CR cs.AI cs.MA

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

FORGE:多智能体渐进式利用与检测工程

Farooq Shaikh

发表机构 * Dynatrace

AI总结 提出多智能体系统FORGE,通过渐进式利用深度桥接漏洞利用生成、优先级排序和检测规则工程三个孤立领域,在603个CVE上实现67.8%的端到端L1+利用,并生成低误报的Sigma和Snort检测规则。

Comments 18 pages, 4 figures, 3 tables. Accepted at the AgentCy Workshop at the 21st International Conference on Availability, Reliability and Security (ARES 2026). Keywords: Vulnerability assessment, Multi-agent systems, Exploit generation, Detection engineering, Risk prioritization

详情
AI中文摘要

漏洞披露数量现已远超组织评估能力,然而三个相邻研究社区(概念验证生成、漏洞优先级排序和检测规则工程)基本上各自为政。现有的自动利用生成系统报告二进制的通过/失败结果,丢弃了部分进展,并且对另外两个社区不产生任何信号。本文提出了FORGE,一个多智能体系统,通过渐进式利用深度来桥接这三个孤岛。五个专门智能体(情报、生成器、规划器、利用和检测器)在一个固定流水线中执行,该流水线(1)从CVE元数据生成目标易受攻击的应用程序,(2)进行指导性的多轮利用,由LLM主预言机根据四级分类法(L0:无证据到L3:完全入侵)评估,以及(3)生成基于OpenTelemetry利用痕迹的Sigma和Snort检测规则。渐进式深度是桥接机制:更深的利用为检测工程提供更丰富的行为痕迹,而跨评分区间的深度数据为优先级排序验证提供真实依据。分层知识架构跨评估累积情报,将构建和利用经验传递给后续CVE。在CVE-GENIE数据集的603个CVE上评估,跨8种语言和187种CWE类型,以每个CVE 1.50美元的成本实现了67.8%的端到端L1+利用。无论EPSS或CVSS区间如何,利用率保持在接近68%,表明模式级可达性与基于元数据的优先级排序正交。来自L2+利用的检测规则实现了显著高于L1衍生规则的跨度归一化基础(p=0.035),并且93.4%的生成Snort规则对合成良性语料库产生零误报。

英文摘要

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

2606.03430 2026-06-03 cs.CR cs.AI

FlowGuard: Flow Matching for Identity-Independent Detection of Data-Free Model Stealing Attacks on Energy System Intrusion Detection Systems

FlowGuard: 基于流匹配的能源系统入侵检测系统中无数据模型窃取攻击的身份无关检测

Maxime Schwarzer, Laurin Holz, Tobias Huerten, Johannes Loevenich, Thies Moehlenhof, Roberto Rigolin F. Lopes, Veit Hagenmeyer

发表机构 * CortAIx Labs, Thales Deutschland(CortAIx实验室,Thales德国) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出FlowGuard,一种基于流匹配的身份无关防御方法,通过检测查询是否属于分布外(OOD)来防御针对能源系统入侵检测系统的无数据模型窃取攻击,在单客户端和分布式Sybil场景下均保持稳定检测率。

详情
AI中文摘要

部署在能源基础设施中的人工智能入侵检测系统(IDS)容易受到模型窃取攻击,攻击者可以离线创建规避流量。当前针对模型提取的防御要么依赖于身份绑定的查询监控(对分布式攻击者Sybil无效),要么通过软标签扰动进行预测中毒(不适用于硬标签IDS部署)。因此,我们提出FlowGuard,一种基于流匹配的身份无关防御,在IDS处理之前将传入查询分类为分布外(OOD)。该方法利用了以下事实:为无数据模型窃取攻击合成的查询占据比真实网络流量更低维的流形,导致在使用基于合法数据训练的连续归一化流时,对数似然显著降低。我们在单客户端和分布式(100客户端Sybil)设置下,使用MAZE和DisGUIDE攻击评估了我们的方法,并与PRADA和FDINet进行了比较。当分布发生变化时,PRADA的检测率降至0%,而我们的防御在不依赖身份信息的情况下,在两种设置下均保持稳定的检测率。我们讨论了该方法的范围和局限性,并概述了在数据依赖攻击中的潜在应用。

英文摘要

Artificial Intelligence (AI)-based Intrusion Detection Systems (IDS) deployed in energy infrastructure are vulnerable to model theft attacks, which allow adversaries to create evasive traffic offline. Current defences against model extraction rely either on identity-bound query monitoring, which is ineffective against distributed attackers (Sybil), or on prediction poisoning through soft-label perturbation, which is inapplicable to hard-label IDS deployments. Therefore, we propose FlowGuard, an identity-independent defence based on flow matching that classifies incoming queries as out-of-distribution (OOD) prior to IDS processing. This approach exploits the fact that queries generated synthetically for data-free model stealing attacks occupy a lower-dimensional manifold than real network traffic. This results in measurably lower log-likelihoods when using a Continuous Normalizing Flow that has been trained on legitimate data. We evaluate our method against PRADA and FDINet using MAZE and DisGUIDE attacks in single-client and distributed (100-client Sybil) settings. While PRADA's detection rate dropped to 0% when the distribution changed, our defence maintained a stable detection rate across both settings without relying on identity information. We discuss the scope and limitations of the approach, and outline potential applications to data-dependent attacks.