自动驾驶 - arXivDaily 专题

2604.25848 2026-06-18 cs.AI 版本更新 75%

A Distributionally Robust Reinforcement Learning Framework for Constrained Urban EV Dispatch

面向约束城市电动汽车调度的分布鲁棒强化学习框架

An Nguyen, Hoang Nguyen, Phuong Le, Hung Pham, Cuong Do, Laurent El Ghaoui

发表机构 * College of Engineering and Computer Science, VinUniversity, Hanoi, Vietnam（VinUniversity 工程与计算机科学学院，河内，越南）； Center for Environmental Intelligence, VinUniversity, Hanoi, Vietnam（VinUniversity 环境智能中心，河内，越南）

专题命中规划控制：城市电动汽车调度，涉及充电和路径规划

AI总结针对城市电动汽车调度中充电站和馈线容量约束及不确定需求，提出基于半马尔可夫决策过程与分布鲁棒软演员-评论家算法，通过图卷积编码器和滚动混合整数线性规划保证可行性，在纽约出租车数据仿真中实现最高净利润且零违规。

详情

AI中文摘要

我们研究城市规模的电动汽车（EV）网约车车队控制，其中调度、重新定位和充电决策必须在不确定且空间相关的出行需求和旅行时间下，遵守充电器和馈线限制。我们将问题建模为六边形网格半马尔可夫决策过程（semi-MDP），具有混合动作——用于服务、重新定位和充电的离散动作，以及连续充电功率——和可变动作持续时间。为了保证训练和部署期间的物理可行性，策略在由掩码温度退火actor产生的高层意图上学习。这些意图在每个决策步骤通过一个时间受限的滚动混合整数线性规划（MILP）进行投影，该规划严格强制执行荷电状态、充电端口和馈线约束。为了缓解分布偏移，我们针对一个Wasserstein-1模糊集优化软演员-评论家（SAC）智能体，该模糊集使用图对齐的马氏基础度量来捕捉空间相关性。鲁棒备份使用Kantorovich-Rubinstein对偶、投影次梯度内环和原始-对偶风险预算更新。我们的架构结合了两层图卷积网络（GCN）编码器、双评论家和一个驱动对手的价值网络。基于纽约出租车数据构建的大规模电动汽车车队模拟器上的实验表明，PD-RSAC实现了最高的净利润，达到122万美元，而强启发式、单智能体RL和多智能体RL基线（包括Greedy、SAC、MAPPO和MADDPG）的净利润为58万至70万美元，同时保持零馈线限制违规。

英文摘要

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor-Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich-Rubinstein dual, a projected subgradient inner loop, and a primal-dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD-RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M-\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

URL PDF HTML ☆

赞 0 踩 0