Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (12): 1-16.doi: 10.12141/j.issn.1000-565X.240549

• Intelligent Transportation System •     Next Articles

A Method for Joint Optimization of Signal Timing and Vehicle Trajectories at Intersections Based on Hierarchical Soft Actor-Critic Reinforcement Learning

MA Yingying1, LI Teng1, LIANG Yunyi2, TANG Meng1   

  1. 1.School of Civil Engineering and Transportation,South China University of Technology,Guangzhou 510640,Guangdong,China
    2.Business School,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2024-11-18 Online:2025-12-25 Published:2025-07-04
  • Contact: 梁韵逸(1991—),男,博士,副教授,主要从事车路协同系统感知、优化和控制、强化学习、深度学习研究。 E-mail:liangyunyilyy@126.com
  • About author:马莹莹(1983—),女,博士,教授,主要从事智能交通分析与管理、交通组织与设计、交通行为与绿色交通研究。E-mail: mayingying@scut.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(52072129);the Hunan Provincial Natural Science Foundation of China(2023JJ40731)

Abstract:

This study proposes a joint optimization method for intersection signal timing and vehicle trajectory based on the Soft Actor-Critic (SAC) reinforcement learning framework. The model consists of two layers: signal ti-ming optimization and vehicle trajectory optimization. The state space for both layers includes vehicle position, speed, and traffic signal status, while the reward function is a weighted sum of traffic efficiency, safety, and fuel consumption. In the signal timing optimization layer, the action is the duration of the signal phase, and in the vehicle trajectory optimization layer, the action is the vehicle acceleration. Each optimization layer has independent value networks and policy networks. The value network outputs the state-action value and assesses the policy network’s performance according to the current state and action. The policy network generates the mean and standard deviation of a Gaussian distribution based on the current state and samples actions from this parameterized Gaussian distribution. The loss function of the policy network includes entropy and temperature coefficients to automatically adjust the breadth and depth of policy exploration, reducing the model’s sensitivity to hyperparameter variations. To address the inconsistency in the intervals between signal timing optimization and vehicle trajectory optimization, this study designed an asynchronous training algorithm for the signal timing layer and vehicle trajectory optimization layer. Both the value network and the policy network of the same layer were trained simultaneously using backpropagation. The model was trained and evaluated with SUMO, and experimental results indicate that the proposed method reduces vehicle fuel consumption by an average of 24.24%, 5.39%, and 22.23%, compared to mathematical programming methods, signal-timing-only optimization methods, and trajectory-only optimization methods, respectively. It can achieve energy optimization without significantly reducing average speed, while maintaining performance deviations within 5% under state observation disturbances, demonstrating good robustness.

Key words: connected and autonomous vehicle, signal-controlled intersection, joint optimization of signal timing and vehicle trajectory, hierarchical reinforcement learning, soft actor-critic reinforcement learning

CLC Number: