Margaret Li Weijia Shi Artidoro Pagnoni

Peter West Ari Holtzman

[Paper]

RLHF-aligned LMs have shown unprecedented ability on both benchmarks and long-form text generation, yet they struggle with one foundational task: next-token prediction. As RLHF models become agent models aimed at interacting with humans, they seem to lose their *world modeling—*the ability to predict what comes next in arbitrary documents, which is the foundational training objective of the Base LMs that RLHF adapts.

Besides empirically demonstrating this trade-off, we propose a potential explanation: to perform coherent long-form generation, RLHF models restrict randomness via implicit blueprints. In particular, RLHF models concentrate probability on sets of anchor spans that co-occur across multiple generations for the same prompt, serving as textual scaffolding but also limiting a model's ability to generate documents that do not include these spans. We study this trade-off on the most effective current agent models, those aligned with RLHF, while exploring why this may remain a fundamental trade-off between models that act and those that predict, even as alignment techniques improve.


sankey_fig_1.png

RLHF model generations on the same prompt are highly similar to each other, unlike Base LMs. For each of 80 short prompts, we collect and align 100 generations (nucleus sampling, p = 0.9) from Base (pretrained) and RLHF models. Above: A Sankey diagram of 100 RLHF model generations for the prompt ``What are the main differences between Python and JavaScript programming languages?'' Sequences share multiple lengthy anchor spans which appear verbatim in the same order, forming a uniform skeleton for nearly all generations. Below: Over the sequence length, the number of generations aligned with at least 5 others, averaged over all prompts. Base model generations maintain low levels of alignment. RLHF model generations exhibit high alignment throughout, but especially near the beginning and end of generations.

RLHF model generations on the same prompt are highly similar to each other, unlike Base LMs. For each of 80 short prompts, we collect and align 100 generations (nucleus sampling, p = 0.9) from Base (pretrained) and RLHF models. Above: A Sankey diagram of 100 RLHF model generations for the prompt ``What are the main differences between Python and JavaScript programming languages?'' Sequences share multiple lengthy anchor spans which appear verbatim in the same order, forming a uniform skeleton for nearly all generations. Below: Over the sequence length, the number of generations aligned with at least 5 others, averaged over all prompts. Base model generations maintain low levels of alignment. RLHF model generations exhibit high alignment throughout, but especially near the beginning and end of generations.