Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization


Rajkumar Ramamurthy*Prithviraj Ammanabrolu*Kianté BrantleyJack Hessel
Rafet SifaChristian BauckhageHannaneh HajishirziYejin Choi


We tackle the problem of training large language models to align to measures of human preferences. We note that many Natural Language Processing (NLP) tasks can be framed as sequence learning problems, with multitudes of non-differentiable automated metrics designed to grade performance by mimicking human judgements for the task.

Reinforcement Learning (RL) is a powerful paradigm for solving sequential tasks that learn from such scalar feedback and yet the use of RL for NLP tasks is severely hindered by a series of important, yet often undocumented pitfalls including: training instability of RL algorithms with combinatorially-sized language action spaces; high variance in automated NLP metrics to be used for feedback in the form of rewards; and reward hacking where a metric is state-of-the-art but the underlying spirit of the task remains unsolved.

The RL4LMs project attempts to alleviate these pitfalls by:

(1) providing guidelines for when RL should be used and what kinds of current NLP tasks and metrics are best suited for it in the form of a new ever-evolving benchmark dubbed GRUE (General Reinforced-language Understanding Evaluation);

(2) demonstrating how to use RL for language via a novel RL algorithm NLPO (Natural Language Policy Optimization) created to be more stable and less susceptible to both large language action spaces and high variance in rewards;

(3) A practical day-to-day guide including high-quality implementations and hyperparameters of NLPO along with multiple existing online RL algorithms such as PPO and A2C to train any causal or seq2seq transformer in the popular HuggingFace library.