From Census Records to People: Reconstructing the Population of France in the SocFace Project

Jérôme Bourdieu, PSE and EHESS
dupraz yannick, Université Paris Dauphine-PSL, PSE & INED
Christopher Kermorvant, Teklia
Lionel Kesztenbaum , INED

The SocFace project aims at making accessible all French censuses from 1836 to 1936 at the individual level (listes nominatives) by developing AI-based tools to automatically transcribe handwritten entries. With around 500 million individuals recorded across 20 censuses, this initiative brings together archivists, historians, demographers, and computer scientists to enable both scientific research and public engagement with these records. However, despite the tremendous increase in the quality of automated text recognition to extract information from historical documents, challenges remain regarding the processing of very large sets of documents, namely millions of images that have been written by hundreds of thousands different individuals: some individuals might be hallucinated by the model; most text fields are very noisy; some pages might be omitted in the processing, etc. We quantify these issues in the SocFace project and describe the various tools we have developed to assess data quality and implement corrections post-processing. One important feature is that these tools need to be designed with the intended use of the dataset in mind: they won’t be the same for linking individuals across censuses; studying social mobility; or assessing fertility by cohorts. So, our tests are end-user oriented and provides different way to evaluate the quality of the dataset. Finally, analyzing the structural transformation of the labor market over one hundred years, we demonstrate the interest of having granular data to explore social and economic phenomena in the long run.

See extended abstract

 Presented in Session 70. Flash Session New Data, Methods and Comparative Perspectives in Historical Demography