Language Models for Predicting Life Outcomes

Varun Satish , Princeton University
Flavio Hafner, Erasmus
Sayash Kapoor, Princeton
Malte Lüken, Erasmus
Lydia Liu, Princeton
Tiffany Liu, Princeton
Juan Perdomo, NYU
Benedikt Stroebl, Princeton
Keyon Vafa, Harvard
Mark Verhagen, University of Oxford
Matthew Salganik, Princeton

The conventional approach to predicting life outcomes is to train linear or tree-based models on life course data represented in tabular format. In this paper, we develop a new approach: train a large-language model (LLM) on life course data represented in text format. We constructed text summaries of 6 million people’s lives using complex, multi-domain data from the Dutch Population Registry. We used these "books of life" to fine-tune an open-weight LLM, named Cruijff, to predict an important life outcome: fertility. Using books of life as inputs, Cruijff generated predictions that were more accurate than a simple demographic benchmark. When the books of life were enriched with predictions from a tree-based model trained on tabular data, Cruijff's predictions improved. These findings demonstrate our approach is a viable and flexible approach for predicting life outcomes. This work establishes a foundation for future research to explore how LLM-based approaches might outperform conventional approaches or unlock different possibilities for life course analysis.

See extended abstract

 Presented in Session 26. Flash Session Emerging Data Sources in Demography: Digital Traces, AI and Mobile Phone Data