Double Deep Q Network with Adaptive Prioritized Experience Replay

Document Type : Research Article

Authors

1 Amirkabir University Of Technology

2 Amirkabir University of Technology

Abstract

In deep reinforcement learning, experience replay buffers help break the correlation of sequential data and improve the efficiency of learning from past experiences. Prioritized Experience Replay (PER) enhances this process by selecting transitions based on their temporal difference (TD) error. However, PER does not account for how often a transition has been used or its overall importance. To address this, we introduce an adaptive prioritization method that incorporates three additional transition-level factors: reward, usage count (counter), and policy probability—collectively termed RCP values. Each RCP value is normalized and combined with the TD error to determine the selection probability of transitions from the replay buffer.



We test our approach on several Atari game environments and find that using any of the RCP values individually improves performance over standard PER. To leverage all three RCP components, we evaluate three aggregation strategies: taking the minimum, maximum, or mean of the RCP values. Results indicate that while the best aggregation method varies by environment, the mean function consistently delivers stable performance improvements. This is likely because it balances the influence of all three factors, preventing over-reliance on any single one. Our findings indicate that incorporating RCP values provides a straightforward and effective improvement over conventional prioritization methods in experience replay.

Keywords

Main Subjects