Natural Language Policy Optimization
Making decisions in a large action space is essential for natural language generation using RL because that action space consists of all the tokens in our vocabulary. For example, in LM-based generation, the discrete action space is in the order of 50k tokens --- which is orders of magnitude above what most discrete action space RL algorithms are designed for. We introduce NLPO (Natural Language Policy Optimization) to address this issue.
NLPO is a parameterized-masked extension of PPO that learns to mask out less relevant tokens in-context as it trains. NLPO assumes access to a masking policy in addition to our learner policy. We initial this masking policy to be the same as our learner policy. The masking policy is used during the PPO learner rollouts phase to mask actions, not within the top-p tokens of the masking policy for the learner to sample from. This restricts the learner to sampling from the set whose probability is greater than the probability parameter p. The masking policy is periodically updated by copying the learner policy parameters. The pseudocode for NLPO is in Algorithm 1. The green text highlights the differences with PPO.