Last year, officials from Washington state and across the country started sounding alarm bells: the 2020 census, which profoundly shapes political and economic power in the country, was fundamentally flawed in a way no other census had been.
It wasn’t the pandemic, or the administration in the White House, or any of the historic shortcomings of the census that raised passions and got news coverage. It was a highly technical set of equations laden with clunky jargon: differential privacy, part of the Census Bureau’s Disclosure Avoidance System.
“The majority of the data output from the (Disclosure Avoidance System) appears to be unfit for most uses,” Mike Mohrman, Washington state’s demographer, wrote to the Census Bureau in August 2020.
Mohrman works for the Washington state Office of Financial Management.
The census is used to determine the shapes of electoral districts, how taxes are distributed to communities, and where developers decide to build new homes and businesses. Researchers and policy advocates use demographic data, in particular, to test if everything from infrastructure to property values is being skewed for, or against, a racial or ethnic group.
If the census can’t be reliably used to determine how many people live in a small town, or which neighborhoods in a major city have concentrated minority populations, these projects can flounder. And, in 2020, many worry that a line has been crossed.
“We believe that the equitable distribution of funds based on population will be harmed if the accuracy of the data is not markedly improved,” Mohrman wrote. “In addition, it is difficult to see how racial/ethnic minorities can be accurately represented if they are not accurately portrayed in the census data at the geographic levels needed for apportionment.”
Along with its constitutional mandate to count every person in the United States, the Census Bureau is also required to protect the privacy of those who respond — it legally should not be possible, therefor, to trace an individual respondent back to their census data.
But advancements in computing power and the proliferation of massive data sets on Americans have raised the possibility that it would be technically possible to tie a person back to their census response.
With the help of external data sets already linked to certain individuals, whether they are from advertisers or data leaks, it may be possible to piece together that information, Mohrman said during a telephone interview.
But some policy experts worry that differential privacy, in an attempt to eliminate the possibility that a person’s identity could be disclosed through the census, has added an unacceptable level of noise into the data.
“Privacy protection is important to everyone. I appreciate that,” Mohrman said. “I don’t know that what (differential privacy) protects us from gets at that.”
In 2020, the Census Bureau released datafiles that took the raw data from 2010 — data before it had gone through the traditional data scrambling still in effect in 2010 — and ran it through the differential privacy algorithm to demonstrate how differential privacy would have changed the last census.
The effect was dramatic, Mohrman wrote in his August letter. He pointed to a census block that is entirely made up of inmates at the Washington Corrections Center for Women. In the data released in 2010, 99% of the population of the census block were women.
An initial attempt in 2020 to test differential privacy on the 2010 data had reported that only 25% of the population of the census block at the time were women. By August 2020, the Census Bureau had released a new version, meant to improve accuracy: it stated that only 12% of the population in a women’s prison were, in fact, women.
“A lot of men in the women’s prison, which was disturbing,” Mohrman said, referring to the blatant inaccuracies that showed up in the data.
If a person were to try to use the census data, for instance, to look at the demographics of their immediate neighborhood, they may have difficulty. To grossly oversimplify: 1,000 people living in a neighborhood may in reality all be older Hispanic women, but their census block could say that the majority of the people living there in 2020 were Black men in their 20s.
Meanwhile, a nearby block of a thousand Black men in their 20s may see the opposite if they were to look at their 2020 data: the census bureau has declared that most of their immediate neighbors are older Hispanic women. The respondents all exist, but their specific geographic area has been swapped around.
In reality, the noise added to census data is more complicated and spread out over much larger areas, but the end result could be the same: it would not be possible for either group to accurately understand the demographics of their census block. These issues appear to be most pronounced in communities that are racially similar, meaning that researchers and policy makers will have a harder time identifying older or younger communities, communities of color, or mostly of one gender.
The data becomes significantly more accurate at the next level of geographic size: the “block group.” While individual blocks may have their demographic data radically scrambled, they add up to roughly the correct totals when combined into these larger block groups. There is still some noise, but not as dramatic as at the block level.
“I expect most of the time when you get into larger populations the data will be reasonably accurate, accurate enough to do somebody’s work,” Mohrman said. “A lot of it depends on the size of the population of the area.”
Along with adding noise to demographic data, the Census Bureau’s differential privacy algorithms mess around with the numbers themselves. The total population at the state level is the only statistic fully immune to this, and those numbers are exactly the same as what the Census Bureau collected. But at the block level, Block A with 5 residents may show up in census data as having 10, while Block B with 20 residents may show as having 15 in the published data.
The effect of this small tweaking can be radically inaccurate distributions of population, with rural areas reported as having significantly higher populations than in reality, and urban areas reporting significantly lower populations.
One of the causes of this, Mohrman said: the Census Bureau doesn’t like negative numbers.
If Block A has 1,000 people, and the Census Bureau’s algorithm wants to either add or subtract that number by 500 to add noise to the data, there’s no issue: the census could report that Block A has either 500 or 1,500 people. Across enough blocks, that averages out to be about even.
But if Block B has only 200 people, that system breaks down. The algorithm can add 500 residents just fine, but it cannot subtract more than 200. That means, on average across these mostly-empty blocks, the population is going to increase.
Since the state population is fixed, the population added to relatively empty areas has to come from somewhere, and, on average, it comes from the most populated areas.
“A lot of it depends on the size of the population of the area,” Mohrman said. “The devil’s in the details.”
More demographic data is expected out later this year, including age, gender, housing and other information. With each new layer of granularity, Mohrman worries that the problems with the differential privacy system will compound.
“With the privacy protections, every time you start getting into more demographic detail, there’s more likelihood you’re going to see some impacts,” Mohrman said.