Assessing Polling Data
What to look for
I had an interesting experience yesterday while preparing my post on recent survey data. A poll was circulating on Bluesky with shocking results suggesting that a majority of Americans support impeachment, but the graphics attached to the post had no source, the person who posted the polls had an ambiguous biography (they previously ran a podcast that hadn't posted in years), and the poll was not linked to anywhere. I could not find the original survey even through multiple searches.
Americans (likely voters, to be precise) support impeachment by 52% in our most recent survey. This includes 55% of independents. And the plot thickens when you look at how they responded to other questions...đź§µ
— Anat Shenker-Osorio (@anatosaurus.bsky.social) April 23, 2025 at 7:29 AM
[image or embed]
I asked the person who they were and where they got the results, but the person did not respond. Other people took sides, either liking my posts or telling me how rude it was to ask these questions. When I woke up this morning, I thought to myself, I still don't know if that poll was real! And I still don't know.
What happened?
I probably won't find out what happened. Was the person irritated by all of the chatter that the post created? Did they get too many notifications? Or do they not check their account very often and so didn't know that people were asking for more details? When I looked this morning many others had asked questions about sourcing without being answered.
The post has been reposted 130 times. And when I dug deeper into the bio of the author, it seems possible that she does have access to exclusive polling data (she is a democratic strategist). But that still doesn't excuse posting data without any information for confirming its validity, especially when talking about a subject as impactful as impeachment.
So, when someone reports polling data to you, what do you want to know?
Random sampling - The most important thing to know is if there was an attempt to random sample. Random sampling is when you try to represent the larger population in the sample. In this case, polling is an attempt to understand the views of the American people. You can look at census data to have a rough understanding of the basic characteristics of the US population. For example, the US population is 50.5% female. Ideally, a representative poll would reflect this percentage and 50.5% of respondents would be female. If the poll was related to an election, you would want to work from the universe of likely voters (rather than just Americans in general, since many Americans don't vote).
If a poll was random sampled, we can extrapolate to the broader population. In that case, you might say something like "polling data suggests that X% of likely voters favor [insert the results]..." If the poll was not randomly sampled, we discuss the results in a different way. Instead of saying "50% of Americans or likely voters report..." we would say "50% of respondents say..." to indicate that we can't make broader generalizations from the data. Whenever someone is making generalizations from a survey, look at the methodology to see if it was random sampled. You can only generalize from a random-sampled survey.
Here is a real survey from Data for Progress. Look at the methodological description in the first few paragraphs. You will see that they explicitly say "The sample was weighted to be representative of likely voters by age, gender, education, race, geography, and recalled presidential vote." That means you could generalize from this data to likely voters.
Sample size - The sample size is the number of people who were surveyed. You need to surpass a certain number of respondents (sometimes abbreviated "N") for your survey to be a reasonable representation of the group you are attempting to describe. I'm not the right person to explain how big a sample size you need to represent a category of people, but I can tell you that the larger your sample size, the smaller your error.
In the survey cited above, they clarify "N=555 unless otherwise specified." That means that all the charts listed below reflect results from 555 democratic leaning individuals unless the specific table suggests otherwise. The bottom row of each table provides the N broken down for each category.
Sampling Error - The Australian Bureau of Statistics has a great description of error in survey data. It explains, "Where there is a discrepancy between the value of the survey estimate and true population value, the difference between the two is referred to as the error of the survey estimate." So if a survey reports that 54% of people have a particular view, but the real number is actually 56%, then that survey has an error of 2%. Of course, there is no way to know the true value of what people feel! That's why we are using surveys, to try to figure out what they feel. So error that is reported is estimated based on the sample size and its relationship to the population of interest. The larger the sample size, the smaller the error is likely to be (among other factors).
In the survey cited above, the error is estimated at plus or minus four percentage points. That means there is an eight point range that could be the true result. Since the cited survey had only 555 respondents while 186.5 million people are registered Democrats, you would expect a larger error value than for a survey that had a higher number of respondents.
Quality of Data - Error can be introduced in surveys in a variety of other ways. This was discussed in great detail after the polling for the 2016 election suggested that Hillary Clinton would win the election, but she didn't. One source of error that was discussed at that time was the quality of data. It was widely believed that people lied to pollsters in the lead up to the election. This can occur if the person being surveyed suspects that the person asking them the question has a particular viewpoint. I once read a fascinating study that found even if pollsters conducting polls in person in Nicaragua had particular colored pens, it could skew responses since people interpreted the colors of the pens as evidence of support for one or another party and they lied about their true feelings (to claim to agree with the presumed views of the pollster).
How can you assess the quality of data? If the survey was conducted in an atmosphere where people would not feel safe responding, the data is likely of lower quality (people may not have accurately reported their views). Some survey data from authoritarian contexts is generally regarded as less reliable than survey data from democratic contexts (though I know good scholars who disagree).
There has also been some findings that in some contexts people do not understand likert scales. So if they are asked to rank something from one to ten, they will either pick one or ten, very few people will pick a number in between. People that are more accustomed to taking surveys are more likely to give a nuanced answer (a 7, or a 4). So the type of questioning can sometimes affect how reliable the results are, but this is more of an issue in a country that is less accustomed to getting asked survey questions.
Length of the Poll - Sometimes people responding to polls just get tired and start answering questions rapidly to get a survey finished. They stop saying what they really believe and just try to get off the phone with the pollster. This is common when the poll is quite lengthy. We call this "satisficing." This is one reason why it can be good to look at how long the questionnaire was. The results from lengthier polls are less likely to be accurate than the results from shorter polls.
Date of the Polling - When was the survey conducted? Sometimes a major event happens in the middle of a survey being taken. Part of the data is from before the big event, and part of the data is from after the event. That is what we would call a natural survey experiment. If we had enough responses before the big event, we can compare the data before and after to see the effect of the event. That's why on the graphic I posted in yesterday's post where the NYT was synthesizing a variety of polls they also clarified when major events had taken place that might have affected the polling data they were reporting. The only way to know if an outside event is shaping results is to know when a survey was conducted. When I tried to find the impeachment data cited above, everything I found was from 2020 surveys about impeachment. I knew that data was not relevant.
Other interesting news:
As I have been saying since I started this blog, the erratic nature of the DOGE effort will likely cost US taxpayers a significant amount of money. A new estimate by the Partnership for Public Service suggests that the DOGE effort will cost $135 billion in hiring, firing, rehiring, lost productivity and paid leaves, nearly wiping out the estimated savings to taxpayers estimated at $150 billion while dramatically reducing the services the federal government offers to tax payers, underfunding scientific research, and bankrupting pro-democracy partners around the world. I assume that when the costs of the legal battles are figured in, the DOGE effort likely did not save any money and has actually cost money.
The journalist covering the estimate, Elizabeth Williamson, explained that that amount (135 billion USD) "is about 15 percent of the $1 trillion he [Musk] pledged to save, less than 8 percent of the $2 trillion in savings he had originally promised and a fraction of the nearly $7 trillion the federal government spent in the 2024 fiscal year."
Post of the day:
Notice how Fox News clarifies that it is reporting results from "registered voters" with a plus or minus three percent error an including the dates of the survey. Good survey hygiene Fox News!!
Trump’s hits new lows: Fox polls show Trump tanking, China calls out Trump’s tariff lies, Musk and Trump’s Treasury Sec almost brawl, and Santos faces prison time with no Trump pardon in sight. Total meltdown. Catch up now!
— MeidasTouch (@meidastouch.com) April 24, 2025 at 6:43 PM
[image or embed]