华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (12): 1-.doi: 10.12141/j.issn.1000-565X.240549

• 智慧交通系统 •    

基于分层柔性演员-评论家强化学习的交叉口信号配时-车辆轨迹联合优化方法

马莹莹1  李腾1  梁韵逸2  唐蒙1   

  1. 1.华南理工大学 土木与交通学院,广东 广州 510640;

    2.上海理工大学 管理学院,上海 200093

  • 出版日期:2025-12-25 发布日期:2025-07-04

Joint Optimization of Traffic Signal Timing and Vehicle Trajectories Using Hierarchical Soft-Actor Critic Reinforcement Learning

MA Yingying  LI Teng1   LIANG Yunyi2   TANG Meng1   

  1. 1. Department of Transportation Engineering, South China University of Technology, Guangzhou, 510640, Guangdong, China;

    2. Business School, University of Shanghai for Science and Technology, Shanghai 200093, China

  • Online:2025-12-25 Published:2025-07-04

摘要:

 本文提出了基于分层柔性演员-评论家(Soft Actor Critic, SAC)强化学习的交叉口信号配时-车辆轨迹联合优化方法。模型包括信号配时优化层和车辆轨迹优化层。两者的状态空间均包含车辆位置、速度和信号配时状态,奖励函数均为交通效率、安全和油耗的加权和。信号配时优化层的动作为信号相位持续时间,车辆轨迹优化层的动作为车辆加速度。两个优化层分别具有独立的价值网络和策略网络。价值网络根据当前状态和动作,输出当前状态-动作价值,评估策略网络性能。策略网络基于当前状态生成高斯分布的均值和标准差,并从参数化的高斯分布中采样动作。在策略网络损失函数中引入熵系数和温度系数,自动调节策略探索的广度和深度,降低模型训练性能对超参数变化的敏感度。针对信号配时优化和车辆轨迹优化间隔不一致的问题,设计信号配时层-车辆轨迹优化层异步训练算法。通过反向传播算法同时对同一层的价值网络和策略网络进行训练。利用SUMO对模型进行训练和评估,实验结果表明,与数学规划方法、只优化信号配时和只优化车辆轨迹的方法相比,提出的方法可使车辆油耗分别平均降低24.24%、5.39%和22.23%。

关键词: 网联自动驾驶汽车, 信控交叉口, 信号配时-车辆轨迹联合优化, 分层强化学习, 柔性演员-评论家强化学习

Abstract:

This study proposes a joint optimization method for intersection signal timing and vehicle trajectory based on the Soft Actor Critic (SAC) reinforcement learning framework. The model consists of two layers: signal timing optimization and vehicle trajectory optimization. The state space for both layers includes vehicle position, speed, and traffic signal status, while the reward function is a weighted sum of traffic efficiency, safety, and fuel consumption. In the signal timing optimization layer, the action is the duration of the signal phase, and in the vehicle trajectory optimization layer, the action is vehicle acceleration. Each optimization layer has independent value networks and policy networks. The value network outputs the state-action value based on the current state and action, assessing the policy network's performance. The policy network generates the mean and standard  deviation of a Gaussian distribution based on the current state and samples actions from this parameterized Gaussian distribution. The loss function of the policy network includes entropy and temperature coefficients to automatically adjust the breadth and depth of policy exploration, reducing the model's sensitivity to hyperparameter variations. To address the inconsistency in the intervals between signal timing optimization and vehicle trajectory optimization, an asynchronous training algorithm for the signal timing layer and vehicle trajectory optimization layer is designed. Both the value network and the policy network of the same layer are trained simultaneously using backpropagation. The model is trained and evaluated using SUMO, and experimental results indicate that the proposed method reduces vehicle fuel consumption by an average of 24.24%, 5.39%, and 22.23% compared to mathematical programming methods, signal-timing-only optimization methods, and trajectory-only optimization methods, respectively.

Key words: connected and autonomous vehicles, signal-controlled intersections, joint optimization of signal timing and vehicle trajectories, hierarchical reinforcement learning, soft actor critic reinforcement learning