Bandit Convex Optimization with Gradient Prediction Adaptivity
带梯度预测自适应的带状凸优化
Shuche Wang, Adarsh Barik, Vincent Y. F. Tan
AI总结 本文研究了在预测自适应方式下,乐观梯度预测能否改进最坏情况下的后悔保证。提出了一种双点反馈设置下的两种点方差减少乐观梯度下降算法,该算法的梯度估计器方差与预测误差相关,从而得到O(√(dE[S_T]))的后悔界,并建立了信息论下界,证明了该算法在预测自适应后悔上的最优性。
详情
带状凸优化(BCO)是一种具有部分反馈的在线学习框架,其中学习者在每一轮中只观察所选决策点的损失。在本工作中,我们研究乐观梯度预测是否能在预测自适应的方式下改进最坏情况下的后悔保证。具体而言,给定梯度预测m_t,我们寻求与累积预测误差S_T=∑_{t=1}^T ||∇f_t(x_t)-m_t||^2相关的后悔界。我们首先得出一个负结果:在单点反馈协议下,即使S_T=o(T),仍存在不可避免的Ω(√T)的后悔下界,表明梯度估计的方差从根本上阻碍了准确预测的好处。为克服这一障碍,我们提出了适用于双点反馈设置的Two-Point Variance-Reduced Optimistic Gradient Descent(TP-VR-OPT)算法。其关键思想是新颖的方差减少梯度估计器,其方差与预测误差而非梯度范数相关。这导致了O(√(dE[S_T]))的后悔界,其中d是决策维度。补充这一结果,我们建立了信息论下界,其规模为Ω(√E[S_T]),提供了预测自适应后悔的最佳可实现性的基本特征,并证明TP-VR-OPT在至多√d因子内是最佳的。我们进一步开发了自适应变体,消除了对E[S_T]或时间范围T的先验知识的需求,并将我们的框架扩展到非平稳环境,建立了同时适应累积预测误差和比较路径长度的动态后悔保证。
Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $Ω(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $Ω(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.