RL4LMs

GRUE (General Reinforced-Language Understanding Evaluation)

GRUE is a benchmark that has a collection of 6 generative NLP tasks. We leverage the RL4LM framework to study these 6 tasks using RL. We train and evaluate each RL algorithm used in the study on two categories of metrics: Task-specific metrics and Nerutalness metrics. Task-specific metrics capture the performance of a particular generation task, whereas Naturalness captures non-task-specific performance such as fluency and readability. We compare 5 fine-tuning algorithms --- Supervised, PPO, NLPO, Supervised+PPO, and Supervised+NLPO. The latter two algorithms first form Supervised fine-tuning and then fine-tuning using the respective RL algorithm.

We gather several key insights from the benchmark when training RL for LM-Based generation:

(1) Using Supervised+RL works best compared to only using RL or supervised learning. The initial supervised fine-tuning is important because it is also used for the KL constraint. The KL constraint is important because it prevents reward hacking.

(2) We notice that when a fixed data collection budget, learning a reward function for RL using the data performs better than using the data for supervised learning. This implies that reward modeling is more data-efficient than supervised learning.

(3) We notice that RL for LM-based training is sensitive to the discount factor gamma, using dropout as a regularization and the sampling method used during exploration and inference.

Below we provided generation outputs for all 6 tasks.

GRUE - Generative Commonsense Commongen

Task Description: Generate a sentence using the concepts using commonsense reasoning. 

 Prompt: generate a sentence with: apron cut hat kitchen sausage
 


 Zero Shot
generate a sentence with: apron cut hat kitchen sausage 


 Supervised
amily tree. The widower, now widowed, returns, the daughter, daughter is not pleased when his wife is mistaken  for one of her sons, and the patriarch, then a wealthy man, falls victim to a jealous husband. The 


 PPO
sausage in the kitchen on an apron. 


 NLPO
sausage cut hat cut hat cut hat cut apron cut hat 


 Supervised+PPO
A man is cutting sausage on an apron in a kitchen. 


 Supervised+NLPO
A man cuts sausage in a hat on an apron in the kitchen. 
 (Sample Completed) 

 Prompt: generate a sentence with: backhand competition player reach shuttlecock
 


 Zero Shot
backhand competition player reach shuttlecock 


 Supervised
tennis player reaches for a shuttlecock during a competition. 


 PPO
the ball in the backhand and the competition player in the shuttlecock. 


 NLPO
sbackhanded competition player reaches shuttlecock. 


 Supervised+PPO
football player reaches for a shuttlecock in backhand during the competition. 


 Supervised+NLPO
football player reaches for a shuttlecock after hitting a backhand during the competition. 
 (Sample Completed) 

 Prompt: generate a sentence with: bowl combine cucumber onion pasta
 


 Zero Shot
cucumber onion pasta. 


 Supervised
amily tree. The widower, now widowed, returns, the daughter, daughter is not pleased when his wife is mistaken  for one of her sons, and the patriarch, then a wealthy man, falls victim to a jealous husband. The 


 PPO
the cucumber onion pasta in a small bowl. 


 NLPO
the in a bowl Combine cucumber onion pasta in a large bowl. Toss pasta 


 Supervised+PPO
A bowl containing pasta, tomatoes, cucumbers, and onions. 


 Supervised+NLPO
A bowl containing a mixture of pasta, cucumber, and onion. 
 (Sample Completed)