Big data in social sciences: a promise betrayed ?

In just 5 years, the mood at conferences on social science and big data has shifted, at least in France. Back in the early 2010s, these venues were buzzing with exchanges about the characteristics of the “revolution” (the 4Vs) with participants marveling at the research insights afforded by the use of tweets, website ratings, Facebook likes, Ebay prices or online medical records. It was a time when, in spite of warnings about the challenges and perils ahead, grant applications, graduate courses and publications were suddenly invaded by new tools to extract, analyze and visualize data. There discussions are neither over nor even mature yet, but their tone has changed. The enthusiasm with a tint of arrogance has given way to a cautious reflexivity wrapped up in a general zeitgeist of uncertainty and angst, even anger. Or so is the feeling I took away from the ScienceXXL conference I attended last week. Organized by demographer Arnaud Bringé and sociologists Anne Lambert and Etienne Ollion at the French National Institute for Demographic Studies, it was conceived as an interdisciplinary practitioners’ forum. Debates on sources, access, tools and uses were channeled via a series of feedbacks offered by computer scientists, software engineers, demographers, statisticians, sociologists, economists, political scientists and historians. And this, in fact, made the underlying need to yoke new practices to an epistemological re-evaluation of the nature and uses of data, of the purpose of social science, and of the relationships between researchers and government, independent agencies, business and citizens especially salient.

Lucidity: big data is neither easier nor faster nor cheaper

Interdisciplinarity

The most promising trend I saw during the workshop is a better integration of users, disciplines and workflows. “Building a database doesn’t create its own uses” was much reiterated, but responses were offered. One is the interdisciplinary construction of a datascape, that is, a tool that integrates the data corpus and the visualization instrument. Paul Girard introduced RICardo, which allows the exploration of XIX/XXth centuries trade data. Eglantine Schmitt likewise explained that the development of a text-mining software required “choosing an epistemological heritage” on how words are defined and how the interpretative work is performed, and “tool it up” for current and future uses, subject to technical constraints. What surprised me, I shall confess, was the willingness of research engineers and data and computer scientists to incorporate the epistemological foundations of social sciences into their work and collect lessons learned from centuries of qualitative research. Several solutions to further improve collaboration between social and computer scientists were discussed. The Hackaton/Sprint model prevents teams from divide up tasks, and force interaction yield an understanding of others’ way of thinking and practices. The downside is in promoting “fast science,” while data need time to be understood and digested. Data dumps and associated contests on websites such as Kaggle, by contrast, allow longer-term projects.

Perceived future challenges were a better integration of 1) qualitative and quantitative methods (cases of fruitful interbreeding mentioned were the Venice Time Machine project and Moretti’s Distant Reading. Evaluations of culturomics were more mixed)  2) old and new research (to know if the behavioral patterns are really new phenomena produced by social networks and digitalized markets, or are consistent with those traditional behaviors identified with older techniques). Also pointed out was the need to identify and study social phenomena that are impossible to capture through quantification and datification. This suggests that a paradoxical consequence of the massive and constant data dump allowed through real-time recording of online behavior could be a rise in the prestige of extremely qualitative branches of analysis, such as ethnography.

Capture d_écran 2017-03-22 à 02.15.07
RICardo

Methodenstreit

Unsurprisingly, debates on quantitative tools, in particular regarding the benefits and limits of traditional regression methods vs machine learning, quickly escalated. Conference exchanges echoed larger debates on the black box character of algorithms, the lack of guarantee that their result is optimal and the difficulty in interpreting results, three shortcomings that some researchers believe make Machine Learning incompatible with social science DNA. Etienne Ollion & Julien Boelaert pictured random forest as epistemologically consistent with the great sociological tradition of “quantitative depiction” pioneered by Durkheim or Park & Burgess. They explained that ML techniques allow more iterative exploratory approaches and mapping heterogeneous variable effects across the data space. Arthur Charpentier rejected attempts to conceal the automated character of ML. These techniques are essentially built to outsource the task of getting a good fit to machines, he insisted. My impression was that there is a sense in which ML is to statistics what robotization is to society: a job threat demanding a compelling reexamination of what is left for human statisticians to do, what is impossible to automatize.

tEQDan.pngTool debates fed into soul-searching on the nature and goals of social sciences. The focus was on prediction vs explanation. How well can we hope to predict with ML, some asked? Prediction is not the purpose of social sciences, other retorted, echoing Jake Hofman, Armit Sharma and Duncan Watt’s remark that “social scientists have generally deemphasized the importance of prediction relative to explanation, which is often understood to mean the identification of interpretable causal mechanisms.” These were odd statements for a historian of economics working on macroecometrics. The 1960s/1970s debates around the making and uses of Keynesian macroeconometrics models I have excavated highlight the tensions between alternative purposes: academics primarily wanted to understand the relationships between growth, inflation and unemployment, and make conditional prediction of the impact of shifts in taxes, expenditures or the money supply on GDP. Beyond policy evaluation, central bankers also wanted their model to forecast well. Most macroeconometricians also commercialized their models, and what sold best were predictive scenarios. My conclusion is that prediction had been as important, if not more, than explanation in economics (and I don’t even discuss how Friedman’s predictive criterion got under economists’ skin in the postwar). If, as Hoffman, Sharma and Watts argue, “the increasingly computational nature of social science is beginning to reverse this traditional bias against prediction,” then the post-2008 crash crisis in economics should serve as a warning against such crystal ball hubris.

Access (denied)

scrapUncertainty, angst and a hefty dose of frustration dominated discussions on access to data. Participants documented access denials to a growing number of commercial websites after using data scrapping bots, twitter’s APIs getting increasingly restrictive, administrations and firms routinely refusing to share their data, and, absent adequate storage/retrieval routines, data mining and computational expertise and stable and intelligible legal framework, even destroying large batches of archives. Existing infrastructure designed to allow researchers’ access to public and administrative data are sometimes ridiculously inadequate. In some cases, researchers cannot access data firsthand and have to send their algorithms for intermediary operators to run them, meaning no research topic and hypotheses can emerge from observing and playing with the data. Accessing microdata through the Secure Data Access Center mean you might have to take picture of your screen as regression output, tables, and figures are not always exportable. Researchers also feel their research designs are not understood by policy and law-makers. On the one hand, data sets need to be anonymized to preserve citizens’ privacy, but on the other, only identified data allow dynamic analyses of social behaviors. Finally, as Danah Boyd and Kate Crawford had predicted in 2011, access inequalities are growing, with the prospect of greater concentration of money, prestige, power and visibility in the hand of a few elite research centers. Not so much because access to data is being monetized (at least so far), but because privileged access to data increasingly depends on networks and reputation and creates a Matthew effect.

Referring to Boyd and Crawford, one participant sadly concluded that he felt the promises of big data that had drawn him to the field were being betrayed.

Harnessing the promises of big data: from history to current debates

Those social scientists in the room shared a growing awareness that working with big data is neither easier nor faster nor cheaper. What they were looking for, it appeared, was not merely feedback, but frameworks to harness the promises of big data and guidelines for public advocacy. Yet crafting such guidelines requires some understanding of the historical, epistemological and political dimensions of big data. This involves reflecting on changing (or enduring) definitions of “big” and of “data” across time and interests groups, including scientists, citizens, businesses or governments.

When data gets big

512px-Hollerith_card_reader_closeup“Bigness” is usually defined by historians, not in terms of terabits, but as a “gap” between the amount and diversity of data produced and the available intellectual and technical infrastructure to process them. Data gets big when it becomes impossible to analyze, creating some information overload. And this has happened several times in history: the advent of the printing machine, the growth in population, the industrial revolution, the accumulation of knowledge, the quantification that came along scientists’ participation into World War II. A gap appeared when 1890 census data couldn’t be tabulate in 10 years only, and the gap was subsequently reduced by the development of punch cards tabulating machines. By the 1940s, libraries’ size was doubling every 16 years, so that classification systems needed to be rethought. In 1964, the New Statesman declared the age of “information explosion.” Though its story is unstabilized yet, the term “big data” appeared in NASA documents at the end of the 1990, then by statistician Francis Diebold in the early 2000s. Are we in the middle of the next gap? Or have we entered an era in which technology is permanently lagging behind the amount of information produced?

IBM360Because they make “bigness” historically contingent, histories of big data tend to de-emphasize the distinctiveness of the new data-driven science and to lower claims that some epistemological shift in how scientific knowledge is produced is under way. But they illuminate characteristics of past information overloads, which help make sense of contemporary challenges. Some participants, for instance, underlined the need to localize the gap (down to the size and capacities of their PCs and servers) so as to understand how to reduce it, and who should pay for it. This way of thinking is reminiscent of the material cultures of big data studied by historians of science. They show that bigness is a notion primarily shaped by technology and materiality, whether paper, punch cards, microfilms, or those hardware, software and infrastructures scientific theories were built into after the war. But there’s more to big data than just technology. Scientists have also actively sought to build large-scale databases, and a “rhetoric of big” had sometimes been engineered by scientists, government and firms alike for prestige, power and control. Historians’ narratives also elucidate how closely intertwined with politics the material and technological cultures shaping big data are . For instance, the reason why Austria-Hungary adopted punch-card machinery to handle censuses earlier that the Prussian, Christine von Oertzen explains, was determined by labor politics (Prussian rejected mechanized work to provide disabled veterans with jobs).

Defining data through ownership

The notion of “data” is no less social and political than that of “big.” In spite of the term’s etymology (data means given), the data social scientists covet and their access are largely determined by questions of uses and ownership. Not agreeing on who owns what for what purpose is what generates instability in epistemological, ethical, and legal frameworks, wha creates this ubiquitous angst. For firms, data is a strategic asset and/or a commodity protected by property rights. For them, data are not to be accessed or circulated, but to be commodified, contractualized and traded in monetized or non-monetized ways (and, some would argue, stolen). For citizens and the French independent regulatory body in charge of defending their interests, the CNIL, data is viewed through the prism of privacy. Access to citizens’ data is something to be safeguarded, secured and restricted. For researchers, finally, data is a research input on the basis of which they seek to establish causalities, make predictions and produce knowledge. And because they usually see their agenda as pure and scientific knowledge as a public good, they often think the data they need should be also considered a public good, free and open to them. 

In France, recent attempts to accommodate these contradictory views have created a mess. Legislators have strived to strengthen citizens’ privacy and their right to be forgotten against Digital Predators Inc. But the 19th article of the resulting Digital Republic Bill passed in 2016 states that, under specific conditions, the government can order private business to transfer survey data for public statistics and research purposes. The specificities will be determined by “application decrees,” not yet written and of paramount importance to researchers. But at the same time, French legislators have also increased governmental power to snitch (and control) the private life of its citizens in the wake of terror attacks, and rights on business, administrative and private data are also regulated by a wide arrays of health, insurance or environmental bills, case law, trade agreements and international treaties.

consentAs a consequence, firms are caught between contradictory requirements: preserving data to honor long term contracts vs deleting data to guarantee their clients’ “right to be forgotten.” Public organizations navigate between the need to protect citizens, their exceptional rights to require data from citizens, and incentives to misuse them (for surveillance and policing purpose.) And researchers are sandwiched between their desire to produce knowledge, describe social behaviors and test new hypotheses, and their duty to respect firms’ property rights and citizens’ privacy rights. The latter requirement yields fundamental ethical questions, also debated during the ScienceXXL conference. One is how to define consent, given that digital awareness is not distributed equally across society. Some participants argued that consent should be explicit (for instance, to scrap data from Facebook or dating websites). Other asked why digital scrapping should be regulated while field ethnographic observation wasn’t, the two being equivalent research designs. Here too, these debates would gain from a historical perspective, one offered in histories of consent in medical ethics (see Joanna Radin and Cathy Gere on the use of indigenous heath and genetic data).

All in all, scientific, commercial, and political definitions of what “big” and what “data” are are interrelated. As Bruno Strasser illustrates with the example of crystallography, “labeling something ‘data’ produces a number of obligations” and prompt a shift from privacy to publicity. Conversely, Elena Aronova’s research highlights that postwar attempts to gather geophysical data from oceanography, seismology, solar activity or nuclear radiation were shaped by the context of research militarization. They were considered a “currency” that should be accumulated in large volumes, and their circulation was more characterized by Cold War secrecy than international openness. The uncertain French technico-legal framework can also be compared to that of Denmark, whose government has lawfully architected big data monitoring without citizens’ opposition: each citizen has a unique ID carried through medical, police, financial and even phone records, an “epidemiologist’s dream” come true. 

Social scientists in search of a common epistemology

If they want to harness the promises of big data, then, social scientists cannot avoid entering the political arena. A prerequisite, however, is to forge a common understanding of what data are and are for. And conference exchanges suggest we are not there yet. At the end of the day, what participants agreed on is that the main characteristic of these new data isn’t just size, but the fact that it is produced for purposes other than research. But that’s about all they agree on. For some, it means that data like RSS feeds, tweets, Facebook likes or Amazon prices is not as clean than that produced through sampling or experiments, and that more efforts and creativity should be put into cleaning datasets. For other, cleaning is distorting. Gaps and inconsistencies (like multiple birth dates, odd occupations in demographical databases) provide useful information on the phenomena under study.

That scrapped data is not representative also commanded wide agreement but while some saw this as a limitation, other considered it as an opportunity to develop alternative quality criteria. Neither is data taken from digital websites objective. The audience was again divided on what conclusion to draw. Are these data “biased”? Do their subjective character make it more interesting? Rebecca Lemov’s history of how mid-twentieth century American psycho-anthropologists tried to set up a “database of dreams” reminds us that capturing and cataloguing the subjective part of human experience is a persistent scientific dream. In an ironical twist, the historians and statisticians in the room ultimately agreed that what a machine cannot be taught (yet) is how the data are made, and this matter more than how data are analyzed. The solution to harness the promise of big data, in the end, is to consider data not as a research input, but as the center of scientific investigation.

Relevant links on big data and social science (in progress)

2010ish “promises and challenges of big data” articles: [Bollier], [Manovich], [Boyd and Crawford]

Who coined the term big data” (NYT), a short history of big data, a timeline

Science special issue on Prediction (2017)

Max Plank Institute Project on historicizing data and 2013 conference report

Elena Aronova on historicizing big data ([VIDEO], [BLOG POST], [PAPER])

2014 STS conference on collecting, organizing, trading big data, with podcast

Quelques liens sur le big data et les sciences sociales

Au delà des Big Data” par Etienne Ollion & Julien Boelaert

A quoi rêvent les algorithms, par Dominique Cardon

Numero special de la revue Statistiques et Sociétés (2014)

Numero special de la revue Economie et Statistiques à venir

Sur le datascape RICardo, par Paul Girard

3 Comments

Leave a comment