In deep reinforcement learning, experience replay buffers are used to reduce the effects of sequential data and make better use of past experiences. Prioritized Experience Replay (PER) improves upon random sampling by selecting transitions based on their temporal difference (TD) error. However, PER does not consider how important each transition is or how many times it has been used during training. In this paper, we propose a new method for adaptive prioritization that takes into account three additional transition-level factors: reward, usage count (counter), and policy probability—collectively referred to as RCP values. These values are normalized and used alongside the TD error to calculate the probability of selecting each transition from the replay buffer. We evaluate our method on several Atari environments and show that using any of the RCP values individually can improve performance compared to standard PER. To combine all three RCP components, we explore three aggregation functions: minimum, maximum, and mean. Experimental results show that the best aggregation method depends on the environment. However, the mean function generally provides stable improvements across tasks, as it balances all RCP signals and avoids over-relying on any single factor.
Mnih, V., et al., Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Lin, L.-J., Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine learning, 1992. 8: p. 293–321.
Schaul, T., et al., Prioritized Experience Replay. CoRR, 2015. abs/1511.05952.
Atherton, L.A., D. Dupret, and J.R. Mellor, Memory trace replay: the shaping of memory consolidation by neuromodulation. Trends in Neurosciences, 2015. 38(9): p. 560–570.
Ólafsdóttir, H.F., et al., Hippocampal place cells construct reward related sequences through unexplored space. Elife, 2015. 4: p. e06063.
Foster, D.J. and M.A. Wilson, Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature, 2006. 440(7084): p. 680–683.
McNamara, C.G., et al., Dopaminergic neurons promote hippocampal reactivation and spatial memory persistence. Nature neuroscience, 2014. 17(12): p. 1658–1660.
Van Seijen, H. and R. Sutton. Planning by prioritized sweeping with small backups. in International Conference on Machine Learning. 2013. PMLR.
Han, S. and Y. Sung. Diversity actor-critic: Sample-aware entropy regularization for sample-efficient exploration. in International Conference on Machine Learning. 2021. PMLR.
Saglam, B., et al., Actor prioritized experience replay. Journal of Artificial Intelligence Research, 2023. 78: p. 639–672.
Oh, Y., et al. Model-augmented prioritized experience replay. in International Conference on Learning Representations (ICLR). 2021.
Zhang, H., et al., Self-adaptive priority correction for prioritized experience replay. Applied sciences, 2020. 10(19): p. 6925.
Li, A.A., Z. Lu, and C. Miao, Revisiting prioritized experience replay: A value perspective. arXiv preprint arXiv:2102.03261, 2021.
Carrasco-Davis, R.A., et al. Uncertainty Prioritized Experience Replay. in Reinforcement Learning Conference.
Remonda, A., et al., Uncertainty-Based Experience Replay for Task-Agnostic Continual Reinforcement Learning. Transactions on Machine Learning Research, 2025.
Yamani, H., et al., Reward Prediction Error Prioritisation in Experience Replay: The RPE-PER Method. arXiv preprint arXiv:2501.18093, 2025.
Lin, P.-S., et al., Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning. arXiv preprint arXiv:2406.00761, 2024.
Soviany, P., et al., Curriculum learning: A survey. International Journal of Computer Vision, 2022. 130(6): p. 1526–1565.
Fedus, W., et al. Revisiting fundamentals of experience replay. in International conference on machine learning. 2020. PMLR.
Ogawa, T., K. Nakagawa, and K. Ikeda. Optimal execution strategy using Deep Q-Network with heuristics policy. in 2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI). 2024. IEEE.
Zhao, X., et al., A Novel Indicator for Quantifying and Minimizing Information Utility Loss of Robot Teams. IEEE Journal on Selected Areas in Communications, 2025.
Adibian, M. and Ebadzadeh, M. M. (2025). Double Deep Q Network with Adaptive Prioritized Experience Replay. AUT Journal of Modeling and Simulation, 57(1), 53-62. doi: 10.22060/miscj.2025.23426.5373
MLA
Adibian, M. , and Ebadzadeh, M. M. . "Double Deep Q Network with Adaptive Prioritized Experience Replay", AUT Journal of Modeling and Simulation, 57, 1, 2025, 53-62. doi: 10.22060/miscj.2025.23426.5373
HARVARD
Adibian, M., Ebadzadeh, M. M. (2025). 'Double Deep Q Network with Adaptive Prioritized Experience Replay', AUT Journal of Modeling and Simulation, 57(1), pp. 53-62. doi: 10.22060/miscj.2025.23426.5373
CHICAGO
M. Adibian and M. M. Ebadzadeh, "Double Deep Q Network with Adaptive Prioritized Experience Replay," AUT Journal of Modeling and Simulation, 57 1 (2025): 53-62, doi: 10.22060/miscj.2025.23426.5373
VANCOUVER
Adibian, M., Ebadzadeh, M. M. Double Deep Q Network with Adaptive Prioritized Experience Replay. AUT Journal of Modeling and Simulation, 2025; 57(1): 53-62. doi: 10.22060/miscj.2025.23426.5373