frequentism

All posts tagged frequentism

Book review: Error and the Growth of Experimental Knowledge by Deborah Mayo.

This book provides a fairly thoughtful theory of how scientists work, drawing on
Popper and Kuhn while improving on them. It also tries to describe a quasi-frequentist philosophy (called Error Statistics, abbreviated as ES) which poses a more serious challenge to the Bayesian Way than I’d seen before.

Mayo’s attacks on Bayesians are focused more on subjective Bayesians than objective Bayesians, and they show some real problems with the subjectivists willingness to treat arbitrary priors as valid. The criticisms that apply to objective Bayesians (such as E.T. Jaynes) helped me understand why frequentism is taken seriously, but didn’t convince me to change my view that the Bayesian interpretation is more rigorous than the alternatives.

Mayo shows that much of the disagreement stems from differing goals. ES is designed for scientists whose main job is generating better evidence via new experiments. ES uses statistics for generating severe tests of hypotheses. Bayesians take evidence as a given and don’t think experiments deserve special status within probability theory.

The most important difference between these two philosophies is how they treat experiments with “stopping rules” (e.g. tossing a coin until it produces a pre-specified pattern instead of doing a pre-specified number of tosses). Each philosophy tells us to analyze the results in ways that seem bizarre to people who only understand the other philosophy. This subject is sufficiently confusing that I’ll write a separate post about it after reading other discussions of it.

She constructs a superficially serious disagreement where Bayesians say that evidence increases the probability of a hypothesis while ES says the evidence provides no support for the (Gellerized) hypothesis. Objective Bayesians seem to handle this via priors which reflect the use of old evidence. Marcus Hutter has a description of a general solution in his paper On Universal Prediction and Bayesian Confirmation, but I’m concerned that Bayesians may be more prone to mistakes in implementing such an approach than people who use ES.

Mayo occasionally dismisses the Bayesian Way as wrong due to what look to me like differing uses of concepts such as evidence. The Bayesian notion of very weak evidence seems wrong given her assumption that concept scientific evidence is the “right” concept. This kind of confusion makes me wish Bayesians had invented a different word for the non-prior information that gets fed into Bayes Theorem.

One interesting and apparently valid criticism Mayo makes is that Bayesians treat the evidence that they feed into Bayes Theorem as if it had a probability of one, contrary to the usual Bayesian mantra that all data have a probability and the use of zero or one as a probability is suspect. This is clearly just an approximation for ease of use. Does it cause problems in practice? I haven’t seen a good answer to this.

Mayo claims that ES can apportion blame for an anomalous test result (does it disprove the hypothesis? or did an instrument malfunction?) without dealing with prior probabilities. For example, in the classic 1919 eclipse test of relativity, supporters of Newton’s theory agreed with supporters of relativity about which data to accept and which to reject, whereas Bayesians would have disagreed about the probabilities to assign to the evidence. If I understand her correctly, this also means that if the data had shown light being deflected at a 90 degree angle to what both theories predict, ES scientists wouldn’t look any harder for instrument malfunctions.

Mayo complains that when different experimenters reach different conclusions (due to differing experimental results) “Lindley says all the information resides in an agent’s posterior probability”. This may be true in the unrealistic case where each one perfectly incorporates all relevant evidence into their priors. But a much better Bayesian way to handle differing experimental results is to find all the information created by experiments in the likelihood ratios that they produce.

Many of the disagreements could be resolved by observing which approach to statistics produced better results. The best Mayo can do seems to be when she mentions an obscure claim by Pierce that Bayesian methods had a consistently poor track record in (19th century?) archaeology. I’m disappointed that I haven’t seen a good comparison of more recent uses of the competing approaches.

Book review: The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives by Stephen Ziliak and Deirdre McCloskey.
This book provides strong arguments that scientists often use tests of statistical significance as a ritual that substitutes for thought about how hypotheses should be tested.
Some of the practices they criticize are clearly foolish, such as treating data which fall slightly short of providing statistically significant evidence for a hypothesis as reason for concluding the hypothesis is false. But for other practices they attack, it’s unclear whether we can expect scientists to be reasonable enough to do better.
Much of the book is a history of how this situation arose. That might be valuable if it provided insights into what rules could have prevented the problems, but it is mainly devoted to identifying heroes and villains. It seems strange that economists would pay so little attention to incentives that might be responsible.
Instead of blaming the problems primarily on one influential man (R.A. Fisher), I’d suggest asking what distinguishes the areas of science where the problems are common from those where it is largely absent. It appears that the problems are worst in areas where acquiring additional data is hard and where powerful interest groups might benefit from false conclusions. Which leads me to wonder whether scientists are reacting to a risk that they’ll be perceived as agents of drug companies, political parties, etc.
The book sometimes mentions anti-commercial attitudes among the villains, but fails to ask whether that might be a symptom of a desire for “pure” science that is divorced from real world interests. Such a desire might cause many of the beliefs that the authors are fighting.
The book does not adequately address concerns that if scientists in those fields abandon easily applied rules, scientists are sufficiently vulnerable to corruption that we’d end up with less accurate conclusions.
The authors claim the problems have been getting worse, and show some measures by which that seems true. But I suspect their measures fail to capture some improvement that has been happening as the increasing pressure to follow the ritual has caused papers that would previously have been purely qualitative to use quantitative tests that reject the worst ideas.
The book seems somewhat sloppy in its analysis of specific examples. When interpreting data from a study where scientists decided there was no effect because the evidence fell somewhat short of statistical significance, it claims the data show “St. John’s-wort is on average twice as helpful as the placebo”. But the data would provide evidence for that only if there were data showing that the remission rate with no treatment was zero. It’s likely that some or all of the alleged placebo effect was due to effects that are unrelated to treatment. And their use of the word “show” suggests stronger evidence than is provided by the data.
I’ll close with two quotes that I liked from the book:

The goal of an empirical economist should not be to determine the truthfulness of a model but rather the domain of its usefulness – Edward E. Leamer

The probability that an experimental design will be replicated becomes very small once such an experiment appears in print. – Thomas D. Sterling

Book Review: The Last Well Person: How to Stay Well Despite the Health-care System by Nortin M. Hadler
There appears to be a large discrepancy between how effective most people think modern medical practices are and the evidence that experts have presented suggesting that it does very little to extend life. This book gives the impression of describing a pattern of ineffective or harmful practices that might be offsetting the benefits of the practices that are known to work. But there are enough flaws in his argument that I can’t decide how much of his conclusions I should accept.
He starts by saying he’s a Popperian, but often acts like he’s following some other, more dogmatic, philosophy. I’m particularly annoyed at his certain feelings of inevitability that we will die by about age 85:

I am aware of no data to support the premise that we can alter the date of death. … When high-functioning octogenarians decline, it is because their time is approaching.

He starts by making a plausible claim that many people get cardiovascular surgery when there’s no evidence that it will benefit them (and is likely to create some risks).
But starting in the next chapter it becomes easy to find flaws in his arguments. He raises some plausible doubts about the evidence for statins, but then tries to imply that if the imperfect evidence that’s available shows that less than 2% of people who are prescribed statins will benefit, then we should doubt that those people ought to take statins.
He presents evidence that prostate cancer treatments save fewer lives than is commonly thought. It appears that sometimes the treatment merely changes the cause of death to something else. Yet he concludes that the treatment is useless, when the data he presents indicate nontrivial benefits. He hints that the evidence doesn’t meet the usual standard of statistical significance, but feels comfortable concluding (without even saying how close it is to being statistically significant) that the lack of proof is strong evidence of ineffectiveness.
He has a somewhat interesting proposal that the final phase of drug testing be done by the FDA rather than by drug companies. If the FDA were run by angels, that would solve a number of problems with the existing regulatory incentives, but with an FDA run by humans it would replace them with new problems. For instance, the choice of which drugs to test is something that only a few special interest voters (i.e. mainly those working for large drug companies) would understand, so their interests would be likely to influence those choices to the benefit of those companies.