When most people think of big data and its possible perils, the kinds of stories that come to mind include those of the father who found out his teenage daughter was pregnant from Target advertisements, the couple whose private conversation was recorded by an Alexa and sent to one of the husband’s employees, or Cambridge Analytica’s use of data from over 50 million Facebook profiles to aid Donald Trump’s 2016 presidential campaign. These breaches of privacy, these threats to civil liberties, center around the consequences of being included in a dataset. This shouldn’t be surprising—if you’re reading this blog post, you are likely one of billions of people whose social media postings, credit card purchases, and browsing history fill the very databases big data is meant to fill. The horror stories of big data’s privacy-invading potential may make you wish that you could go “off the grid” altogether. However, it is worth considering: what risks might befall those excluded from the big data revolution? In order to dissect the ethical considerations of data exclusion, we can look to two principles that have emerged from legal theory: anticlassification and antisubordination. We will then consider how these time-tested principles apply to today’s data-driven context. Lastly, we will examine the upcoming 2020 U.S. Census as a case study to understand the potentially profound impact of being excluded from government datasets.
Anticlassification vs. antisubordination
The discussion around these principles has been in large part driven by legal scholars from our very own Yale Law School. In 1976, Owen Fiss published a landmark paper that drew upon historical precedent and social theory to explain how the anticlassification principle, which prohibits practices that classify people on the basis of a “forbidden” category like race, came to be the predominant interpretation of the 14th Amendment’s Equal Protection Clause:
- Blind justice. The notion of “color blind” justice, Fiss notes, appeals to judicial norms of protecting persons from being judged on “irrelevant” characteristics; Lady Justice (below), after all, is blindfolded.
- Value neutrality. The prohibition of classification on protected categories is attractive to judges, who are charged with the task of neutral enforcement of laws. As such, the anticlassification principle is seen as a way to prevent judges from substituting their values for those of the public. This safeguard, however, may be an illusion: courts still must decide which classifications are protected—the poor are not a legally recognized protected class, for example.
- Objectivity. The anticlassification principle may aid in crafting legal rules that are universally applicable and not subject to ambiguities or changes over time. Fiss disputes this widely-held notion, pointing out that most rulings made under the premise of the anticlassification principle are based not on facts, but speculation around how a ruling could disproportionately affect protected groups.
- Individualism. Because anticlassification rejects the idea of a “natural class,” there is no recognition of social groups. Although perhaps a lofty ideal, Fiss believes this individualistic perspective, by leaving little room for recognizing systemic disadvantage, does not allow for true “protection” of marginalized groups.
Fiss’ paper culminates in a description of—and a call for—a new interpretation of what it means to offer “equal protection under the law.” His formulation of what he calls the “group-disadvantaging principle,” laid the groundwork for what we call today the antisubordination principle. This viewpoint takes a decidedly different approach to providing equal protection, condoning and even encouraging the use of protected classifiers to challenge or subvert the subordination of a “specially disadvantaged group.”
Applying the prevailing doctrine of anticlassification to the big data revolution, to the use of sensitive data like race or ethnicity, has meant that both public and private entities have often excluded this information from their datasets and models. This kind of exclusion raises two key questions:
- Is it enough? If we can completely remove protected classifiers from our data, does that ensure that different groups are equally protected?
- Is it possible? The emergence of big data has provided us with an unparalleled level of information about people. With countless proxies for protected information readily available, can we ever truly remove something like race from our data?
Antisubordination and the “big data” revolution
I don’t know the answers to these questions, but I do believe that Jonas Lerman’s call for a doctrine of “data antisubordination” could help protect those left behind by big data’s reach. A doctrine of data antisubordination that applies to both the public and private sector would mean that companies could no longer claim plausible deniability; they could no longer say “but we didn’t use race in our model, how were we supposed to know?” Amazon executive Craig Berman would no longer be able to evade claims that the exclusion of Boston’s majority-black Roxbury neighborhood from same-day service (see map below, from previously linked article) was racially-motivated, simply because “demographics played no role” in their data-driven decisions.
Being excluded from services is far from the only economic harm suffered by those left out of datasets. Businesses rely on big data to learn about consumers’ preferences and behaviors. Those not represented in these data may not get stores in their neighborhoods—nor the employment opportunities that follow—or products suited to their needs and budget.
More profound, perhaps, are the threats that data exclusions poses for civic life. Political campaigns and federal agencies alike rely on big data to shape their messaging and allocate their services. The very bedrock of our democracy rests on the ability of elections to fairly represent the American public. Election districts therefore, at least in theory, count on the accuracy of Census data. Aside from the important issues of gerrymandering and voter suppression, the systemic underrepresentation of marginalized communities in government datasets could not only lead to exclusion from important public goods and services, but also deprive those communities of their fundamental right to participate in American democracy.
The decennial United States Census, which attempts to count every single person living in the country, is perhaps the most important government dataset of all. There has been much discussion around the upcoming 2020 Census, with many experts concerned about undercounting of marginalized populations. In the next section, I use the 2020 Census as a case study of the possible impacts of data exclusion.
Case study: The 2020 Census
Why is it so important that the Census include everyone? The Census is not just any survey—it is the pillar upholding our democracy. Census data is used in countless ways, but its most important uses include the determination of how seats in the House of Representatives are allocated and where public funds for Medicare, Pell Grants, infrastructure, and other government services are directed. Therefore, the accuracy of the Census is vital to ensuring that all Americans can participate fully in our democracy. Below, I highlight some of the key threats to the accuracy of the upcoming Census.
The citizenship question. The Trump administration is planning to to add a new question to the Census form asking respondents if they are a U.S. citizen. Although officials claim that the question is necessary to better enforce the Voting Rights Act (a claim that has been repeatedly disputed), research from the Census Bureau itself shows that “adding a citizenship question to the 2020 Census would lead to lower self-response rates in households potentially containing noncitizens, resulting in higher fieldwork costs and a lower-quality population count.” In a time of heightened immigration enforcement and pervasive anti-immigrant rhetoric, it is likely that this question will deter noncitizens (who may fear their responses will be used against them) from responding. Recent field tests align with this research finding: a presentation by a Census researcher contains several quotes from interviewers and respondents about the citizenship question. One interviewer recalled,
“There was a cluster of mobile homes, all Hispanic. I went to one and I left the information on the door. I could hear them inside. I did two more interviews, and when I came back, they were moving…. It’s because they were afraid of being deported.”
One respondent shared their fears, noting that,
“The possibility that the Census could give my information to internal security and immigration could come and arrest me for not having documents terrifies me.”
It is important to note that current regulations would not allow Census-obtained data to be used for deportation. The citizenship question does not distinguish between documented and undocumented immigrants, and either way, the Census Bureau is strictly prohibited by federal law from sharing its data with other agencies such as Immigrations and Customs Enforcement. However, this doesn’t change the fact that many immigrants, particularly in light of the Trump administration’s “America First” rhetoric, have come to harbor a mistrust for the federal government. Interviewers on the ground have seen this in field testing, with one observing,
“Three years ago was so much easier to get respondents compared to now because of the government changes… and trust factors.… Three years ago I didn’t have problems with the immigration questions.”
The political implications of undercounting non-citizens, a disproportionately marginalized population—and also generally a Democratic-leaning one—are troubling. The loss of representation and vital public resources could have a potentially profound and long-lasting impact on these communities, simply because they were excluded from the Census dataset.
Untested technology. For the first time, the Census Bureau will ask most respondents to fill out the form through an online survey. Those who live in areas with many older adults or with low internet access, and those who don’t respond to the survey, will be sent a paper form. Additionally, people will have the option of calling in their answers and having them recorded by a language recognition technology. While these measures are intended to improve response rates and save costs, they have the potential to introduce a whole host of complications. Access to technology differs substantially by race, as shown by the infographic below; this may make the survey harder to fill out for poor and minority communities. Furthermore, some may be scared off by the idea of divulging personal information online, especially in an era of foreign election interference and high-profile data breaches at companies like Marriott and Equifax. The Census Bureau had hoped to field test their new technologies, but due to budget constraints had to cancel two out of three “dress rehearsals,” which were intended to help Census takers access hard-to-reach areas. It is unclear how these changes will affect the accuracy of the Census. Even if the response rate does improve, it may not improve equitably across different groups, posing a threat to the fundamental goal of the Census: to represent every American.
Underfunding = Undercounting. The Census has always struggled with undercounting certain segments of the population, such as minorities and poor people, who tend to live more transient lives and often distrust the government. The figure below from Vox’s recent article on the Census shows that whites are overcounted while Hispanics and blacks have been historically undercounted. Because efforts to allow the use of sampling to obtain more accurate counts have been consistently opposed by Republican lawmakers who believe it will artificially inflate the numbers of minorities and urban residents, much of the legwork of the Census falls on the enumerators, who traverse the country in an attempt to count every nonrespondent. For the 2020 Census, there will be 300,000 of these enumerators, down from 500,000 in 2010. Compounding this shortage of workers is the fact that they will only try to visit nonresponding homes three times, instead of the typical six. This animation describes the process by which the Census Bureau attempts to capture the entire population. The success of this head count, particularly in areas where people live in less stable housing arrangements (e.g. homeless, renters, or victims of natural disasters), often depends on the local knowledge and community ties of the enumerators. With less of these people, some of our nation’s most vulnerable voices may be ignored.
This blog post began by laying out two frameworks from legal philosophy related to the provision of “equal protection under the law”: anticlassification and antisubordination. How are these related to the 2020 Census? Recall that while the prevailing doctrine of anticlassification aims to achieve equal protection through prohibiting explicitly discriminatory practices, antisubordination encourages the use of sensitive information, like race and immigration status, to subvert systems of oppression that relegate disadvantaged groups to a subordinate status. While discussions of antisubordination in the context of data exclusion tend to revolve around data analysis (say, including race variables in a model), little attention has been given to data collection. And for a task as enormously complex—and as fundamentally important—as the US Census, this is perhaps where the task of challenging subordination begins.
Take, for instance, Cindy Quezada and Jorge Sanjuan of the Central Valley Immigrant Integration Collaborative (CVIIC). As part of a pilot program to identify hard-to-find homes in advance of the 2020 Census, Quezada and Sanjuan, both immigrants themselves, canvas the neighborhoods of Fresno, California. They give presentations about why the Census is important. And perhaps most importantly, they allay the fears and misconceptions of the many undocumented immigrants in their community, assuring them that participating in the Census does not put them in jeopardy. Fighting against data exclusion is no small feat: often, the very reason people are excluded from datasets is because their sociopolitical power has been diminished to the point of near invisibility. As such, the best way to democratize big data may be to look to the margins, to those unseen by its reach.