How to Defend Yourself From Statistical Lies: Part 5
What is a p-value and why should you care?
--
COMING UPDATES: If you’d like to support my continuing to do this, you can send an Amazon giftcard to hollymathnerd at gmail or CashApp me at $HollyMathNerd. Thank you!
My statistical self-defense series has many other topics: go here for the list.
November 2019 update: Bret Weinstein had a podcast with Katie Herzog recently wherein p-values were discussed (the whole thing is worth listening to, but they discuss p-values at some point during the first twenty or thirty minutes). Excellent discussion.
P-values! P stands for probability. P-values are used all over the place, but the arena where they make the most difference to a non-math-nerd’s life is that p-values are used to evaluate the results of experiments. Experiments are the basis for scientific conclusions and thus for policy implemented in all of our institutions, including government — policy that shapes medicine, education, psychology, regulations, etc.
Here’s a fabulous comic about p-values, which I hope you will fully understand by the time you finish this update.

Experiments always have two hypotheses (explanations). The first, the null, is always that the experiment’s point (the drug being tested, the educational technique being tried, etc.) makes no difference. The second, the alternative hypothesis, is that it does.
The null can be compared to “innocent until proven guilty.” If you have strong evidence, you can reject the null and say “This (drug/educational technique/whatever the experiment is about) makes a difference!” If the evidence is weak or unconvincing or just not convincing enough, the benefit of the doubt goes to the null and we fail to reject it. It doesn’t mean that the experiment definitely showed no difference, just as defendants found “not guilty” are surely guilty sometimes. It just means that we didn’t have evidence sufficient to make the claim, at least not yet.
Depending on the type of experiment, one of many statistical tests are used. What these statistical tests have in common is that they return a p-value at the end.
For example, there is a statistical test for matched pairs, which is often used in a before/after scenario to determine if, say, a weight loss method worked. You may be thinking, “Why do you need a test? Weigh them. If they lost weight, it worked.” Ah, but some of them would have lost weight during that period anyway for various reasons. At least some of the women would have had PMS at the first weigh-in and at least some of both men and women participants would have celebrated a birthday, religious holiday, or child/partner’s birthday with a feast prepared in the week prior. If we want to be sure that the weight loss method being tested is the actual reason for weight loss, we have to allow for the possibility that some of the weight differential is due to these chance/random factors. We need evidence that the change in weight is really due to the weight loss method being tested. The p-value is supposed to tell us the probability that our results would have occurred by chance alone.
When the probability of getting these results by chance is very low, we attribute our results to the weight loss method at the center of our experiment.
Another example: in an ideal experiment for a drug treatment, participants are split into two large (the larger, the better) groups made up of randomly selected subjects with about equal numbers of men and women. One group gets the real drug and another gets the placebo (or, in some cases, into four groups, two each of men and women in both placebo and treatment groups). Neither the people taking the drug nor the experimenters who directly work with the people are aware of who is getting the real drug and who is getting the placebo. (This prevents the experimenters inadvertently treating the placebo or treatment group differently and thus unconsciously biasing results.) If our test indicates a big difference between the group(s) that got the real drug and the group(s) that got the placebo, and that difference shows a p-value lower than the one set for our experiment, we say the drug worked and the regulatory body moves it along the process that gets it to market.
IMPORTANT QUESTION: WHAT COUNTS AS A LOW P-VALUE?
The most common value used in these experiments is p=.05, or 5%. If a result would have happened via chance alone less than 5% of the time, we call that “statistically significant.”
Where did 5% come from?
It’s entirely arbitrary. Yes, I’m serious.
Is 5% a good number? Many people think so and many others disagree.
Here are a few ways to think about 5%:
A) Flip a coin. It’s a 50% chance of getting tails once, a 25% chance of getting two tails in a row, a 12.5% chance of getting three tails in a row, a 6.25% chance of getting four in a row, and a 3% chance of getting five. Getting five tails in a row may seem unusual, but get a coin and try it. See how long it takes you to get five heads or five tails in a row. (It’s not hard.)
B) Something that happens about 1 time out of every 20 happens 5% of the time. As a bit of a data nerd, I keep records of many small things in my life. Here are a few things that happen to me roughly 5% of the time, 1 out of every 20:
— The weather forecast regarding rain is wrong and I don’t have an umbrella when I need one OR I carry one unnecessarily.
— Walmart has an open check lane with no waiting.
— I hit all green lights on the way to my therapist’s office.
Something that is rarer than these three events in my life is at a probability of less than 5%.
C) 5% is slightly more than 18 out of 365. Something that happens on 18 days out of the year happens about 5% of the time.
Many people, especially those dedicated to greater transparency and reform in psychology, strongly believe that p=.05 is bullshit. They insist on p=.005. This is a much tougher standard. It means that for them to reject the null and say that their experiment had a significant result, their results have to be so unlikely that they would only happen 5 times out of 1,000 by chance.
WHAT IS P-HACKING?
P-hacking is a term for manipulating the way you conduct your experiment to get a p-value under .05. This increases your chance of getting published, getting grants, having your method (your drug, your whatever it is that’s being tested) recognized as a big deal. It’s when analysis is conducted by what is most likely to make the experiment show significance, not what’s the best analysis for the type of experiment.
Here are some of the many ways to p-hack:
1.) Choose your sample in a non-random way. With psychological experiments in particular, it’s important to get truly random samples. If you just recruit subjects in one particular way, your sample can get badly skewed. If you are doing a test where introverts or extroverts would score differently, is a flier on a bulletin board at a bookstore more likely to get you introverts? A flier on a bulletin board at a comedy club? If you are testing a drug, should your recruiting be done strictly through the offices of doctors who stand to make more money (sometimes, lots more money) if the drug is approved?
2.) Do too many tests. In the comic about jelly beans, they tested 20 types of jelly beans. Using p=.05, 1 out of 20 times, we would get a significant result just by chance. Which is just what they did! P-hackers may be so desperate to prove their pet theory has significance that they overtest in order to find something that will get them published. There are statistical methods to correct for this — for example, Bonferroni corrections divide the p-value by the number of tests, which helps eliminate the problem. In the comic strip example, .05 divided by 20 is .0025. With .0025 as the new threshold, green jelly beans would not have been significant. Asking whether or not a Bonferroni correction was used is one of the best ways to determine how seriously to take a result.
3.) Have a vague, poorly defined outcome. What is your definition of a change? In the weight loss example, suppose that you went into it saying that a 10 pound weight loss would be your definition of victory. At a 10 pound loss, your results are p=.07. But at an 8 pound loss, p=.03! Weight loss is a bajillion dollar industry in the US. Might you not be tempted to re-define success to 8 pounds?
HOW CAN YOU TELL WHETHER TO TAKE AN EXPERIMENTAL RESULT SERIOUSLY?
- Ask what the p-value was. If they used p=.2 or p=.1, that’s worth little compared to p=.05. The best is p=.005.
- Ask if the result has been replicated. If a p-value is below .05 (and especially if it’s below .005) and the experiment has gotten that result more than once, now we’re starting to look at something you can have real confidence in.
- Ask if the result was pre-registered. Pre-registration means that the null and alternative hypotheses, the definition of success (10 pound loss vs. 8 pound loss in our weight loss example), and the p-value that will be used have all been publicly committed to in advance. This is an important safeguard against p-hacking and pre-registered results show a commitment to transparency and honesty that deserves greater confidence than results that are not.
SHOULD WE USE P-VALUES AT ALL?
I wrote a long section about this but the non-math-people I ran it by found it both hard to understand and irrelevant, since almost everyone uses them anyway. My personal opinion is that they are rather like opioid drugs: used in many circumstances when other methods would be better, but they do have their uses and should not be tossed out entirely.