Author: Avinash Mudireddy, Nathan Shaffer, Joel J. St-Aubin 👨🔬
Affiliation: University of Iowa 🌍
Purpose: This work demonstrates preliminary results in training a reinforcement learning (RL) network to perform VMAT machine parameter optimization.
Methods: We implemented a policy gradient RL algorithm to predict 56 MLC leaf positions and one MU value per VMAT control point for 32 prostate cancer patients prescribed 36.25 Gy in five fractions on the Elekta Unity MRI-Linac. The RL network accepts (1) two 3D volumes representing the current dose and contour of the PTV, and (2) the predicted machine parameters from the previous control point. The inputs are used to predict the parameters for the next control point which are used to calculate a reward and update the input state for the next iteration. This process is repeated until all control points have been predicted for a batch of patients, at which point training occurs using the loss function from the REINFORCE policy gradient algorithm. Initial network weights for the RL algorithm were set based on a supervised learning (SL) version of the network trained on conformal MLC parameters and constant MUs. The RL reward was constructed to promote dose homogeneity of the target compared to the conformal MLC delivery. This was quantified using two homogeneity indices, H1 = (D2-D98)/D50, and H2 = D5/D95. The plan with the median reward of the test set was selected and compared to the corresponding SL plan.
Results: Training was conducted for 500 epochs, and after normalizing for target coverage, the SL plan resulted in homogeneity indices of H1 = 0.83 and H2 = 3.72. After RL training, these indices improved to H1 = 0.61 and H2 = 1.65.
Conclusion: The proposed RL network showed the ability to modulate MLC positions and MU values improve target homogeneity compared to a conformal arc delivery. This network shows promise for future RL based VMAT planning.