Millions of people around the world query large language models for information. While several studies have compellingly documented the persuasive potential of these models, there is limited evidence of who or what influences the models themselves, leading to a flurry of concerns about which companies and governments build and regulate the models. We show through six studies that government control of the media across the world already influences the output of large language models (LLMs) via their training data.
We use a cross-national audit to show that LLMs exhibit a stronger pro-government valence in the languages of countries with lower media freedom than those with higher media freedom. This result is correlational so to triangulate the specific mechanism of how state media control can influence LLMs, we develop a multi-part case study on China's media. We demonstrate that media scripted and curated by the Chinese state appears in large language model training datasets. To evaluate the plausible effect of this inclusion, we use an open-weight model to show that additional pretraining on Chinese state-coordinated media generates more positive answers to prompts about Chinese political institutions and leaders.
We link this phenomenon to commercial models through two audit studies demonstrating that prompting models in Chinese generates more positive responses about China's institutions and leaders than do the same queries in English. The combination of influence and persuasive potential across languages suggests the troubling conclusion that states and powerful institutions have increased strategic incentives to leverage media control in the hopes of shaping large language model output.
A dominant approach in quantitative social research is to represent data as a rectangle of numbers, where there is one row for each person and column for each variable about the people. This data representation is natural for survey data and fits well with existing methods, such as the GLM and supervised machine learning. In this talk, we begin to explore an alternative pipeline that represents data about people as a text "book of life" and then analyzes the data using LLMs. This approach creates affordances that have no obvious analogue in existing approaches, and may be especially valuable for life course data with complex temporal, network, and hierarchical structure.
We study the book of life + LLM pipeline in two distinct settings focused on predicting life outcomes: one complex (the Dutch population registry) and one simple (the US American Community Survey). We compare approaches empirically and try to isolate the source of performance differences. The talk concludes with a brief discussion of the shared infrastructure—community-determined "model organisms" and open-source software—that might help us discover the value (or lack thereof) of the new approach more quickly and reliably.
Joint work with Sarah Pedersen, Mattie Niznik, Stephan Rabanser, Varun Satish, Flavio Hafner, Sayash Kapoor, Malte Luken, Lydia T. Liu, Tiffany Liu, Juan C. Perdomo, Benedikt Stroebl, Keyon Vafa, and Mark Verhagen.