Book review: Notes on a New Philosophy of Empirical Science (Draft Version), by Daniel Burfoot.
Standard views of science focus on comparing theories by finding examples where they make differing predictions, and rejecting the theory that made worse predictions.
Burfoot describes a better view of science, called the Compression Rate Method (CRM), which replaces the “make prediction” step with “make a compression program”, and compares theories by how much they compress a standard (large) database.
These views of science produce mostly equivalent results(!), but CRM provides a better perspective.
Machine Learning (ML) is potentially science, and this book focuses on how ML will be improved by viewing its problems through the lens of CRM. Burfoot complains about the toolkit mentality of traditional ML research, arguing that the CRM approach will turn ML into an empirical science.
This should generate a Kuhnian paradigm shift in ML, with more objective measures of the research quality than any branch of science has achieved so far.
Burfoot focuses on compression as encoding empirical knowledge of specific databases / domains. He rejects the standard goal of a general-purpose compression tool. Instead, he proposes creating compression algorithms that are specialized for each type of database, to reflect what we know about topics (such as images of cars) that are important to us.
Benefits
- Unambiguous evaluation:
Hypotheses are evaluated by quantifying how much compression they achieve on a standard database. (The size of the software needed for decompression is included in the compression measure). - Designed to work on unlabeled data:
ML research is constrained, in part, by the cost of producing labeled training data. CRM emphasizes the benefits of unsupervised learning of unlabeled data. - Scientific fraud becomes much harder.
- Avoids Overfitting
These benefits seem real, but Burfoot exaggerates them. He claims that fraud and manual overfitting “cannot occur” with CRM.
Yet I’m sure there will still be some fraud with CRM. For example, people will try to cheat by hiding in their software something which cheats by connecting to an external database.
But when I tried to produce examples of overfitting while using the CRM approach, I discovered that I kept drifting back into using methods that looked somewhere in between CRM and traditional science.
That convinced me to replace my initial reaction of “CRM is good, but not very novel” with a “that’s harder and stranger than I expected” reaction.
Lossless versus lossy compression
Burfoot focuses on lossless compression. Yet it seems much more natural to me to use lossy compression.
Lossy compression discards noise and low-value information from datasets, in order to focus on the most valuable information. Traditional science does that, to produce insights that are simple enough for humans to understand. Human brains use lossy compression, due to both the resource costs of more accurate compression, and due to the difficulty of evolving more accurate neurons. Machine learning research produces compression that keeps more information than brains or traditional science keep, but the most valuable approaches still use the equivalent of lossy compression.
Lossless compression seems harder to implement, but it offers a valuable improvement in how objectively we can measure the quality of our hypotheses.
I was puzzled, when I reached the end of the book, by Burfoot’s failure to comment on this tradeoff. Then it occurred to me that I could convert any good lossy compression algorithm into a good lossless compression algorithm. Now that I look, I see that the book contains hints to this effect, but it somehow diverted my attention away from this possibility.
My intuition tells me that such a conversion might be trivial to a theorist. I can imagine that it will someday be sufficiently automated that it is trivial in practice. But it seems sufficiently alien to mainstream compression goals that it deserves some comment in a book such as this.
Overfitting
That doesn’t fully answer my doubts about overfitting. Burfoot’s main argument seems to be that the CRM approach uses large databases, whereas traditional ML approaches use only labeled data, which can only be produced in much smaller quantities.
I’m pretty concerned about overfitting to stock market data. That’s somewhat atypical, in that there’s an obvious way to automatically label the most important data (stock prices can be labeled by how much they go up over some arbitrary time period), with the main problem being that the data only provide a few independent pieces of evidence, generated by a somewhat malicious sampling process, for the features I care about.
For example, the dot.com bubble shows up in my database as if it happened once every two decades. But based on hard-to-quantify history books, I’m guessing that similar [1] bubbles happen more like once a century.
I’m confused about how the CRM approach is supposed to help me avoid overfitting on that data. My guess is that if I’m careful, I’ll find multiple hypotheses that provide indistinguishable (and unimpressive) amounts of compression. Or maybe I will fail to find hypotheses that produce any compression. Maybe a good hypothesis would require using a much larger database of human behavior, which would produce a very general model of human minds, and enlighten us about market bubbles as a minor side-benefit.
In spite of these doubts, I expect the CRM approach will help me if I can find a practical way to combine my intuitions with some sort of unsupervised learning.
Related ideas from other authors
This book builds on ideas that have been floating around for a while, and in many cases it isn’t obvious where the ideas originated. Here are two sources that Burfoot points to, plus two that I’ve happened to notice:
- Hinton’s generative approach has moved ML research in the direction of CRM, but lacks the focus on objective measurement. Hinton’s role in catalyzing AI progress is moderate evidence in favor of Burfoot’s thesis.
- The Hutter Prize is almost a CRM approach, but only for one medium-quality database.
- Eric Baum emphasizes compact representations of reality as the main ingredient of intelligence, but focuses on understanding evolution and intelligence, not on using compression to improve ML and science.
- Max Tegmark makes a brief comment in his book that endorses defining science as compression.
Burfoot is more ambitious than any of those authors, aiming to make ML into a rigorous science, and to make science in general more objective.
There’s a slight resemblance to MIRI’s goal of making AI research more rigorous, but a large difference in what the two approaches imply for the speed of AGI takeoff. Burfoot implies that intelligence is mostly empirical knowledge, while MIRI focuses on something closer to a general-purpose compression tool.
In sum, this is a pretty good book. It helped clarify my understanding of science, and of recent trends in ML. It is almost polished enough to be publishable. It seems a shame that it has apparently been abandoned so close to completion.
[1] – Yes, I’m being vague about “similar”. I have a clearer meaning in mind, but I’m too busy to turn this post into a theory of market bubbles. Yes, I’m concerned that my meaning of “similar” is the result of overfitting.
Pingback: Response to Review of “Notes” by Peter McCluskey – Ozora Research
[Replying to Dan Burfoot’s comments.]
On lossless versus lossy compression, the book is clear about the advantage of lossless compression for objectively measuring a theory. I agree that lossless compression is the right method for comparing theories.
My comment was an attempt to understand the practical problems of implementing CRM. For purposes other than measuring how good a theory is, I want to express the theory in a form that looks more like lossy compression than lossless compression. I’m unsure whether lossy compression is the right way to describe what I want here. Maybe I’m too confused to articulate what I do want. It’s something along the lines of expressing theories using traditional methods when we’re aiming for human comprehension of the theories, and having some standard toolkit to convert the theory into lossless compression when we want to measure the quality of the theory.
No, I didn’t conclude that CRM was inapplicable to stock market data. I believe that CRM offers some benefits for my work, but I’ve been procrastinating due to a combination of being busy at other tasks and being confused about how to apply CRM to my work.
Some of that confusion is due to CRM feeling sufficiently strange that it takes a good deal of thought to reframe my thoughts around it. But I suspect most of the difficulty is due to factors related to markets.
His comments about the stock market mostly imply that it’s futile to do scientific study of how to beat the market, not that CRM is the wrong approach.
Yet if everyone gave up on finding inefficiencies in the stock market, then the market would become inefficient. The only equilibrium that is close to being stable is for a fair number of people to think they can find inefficiencies, while actual inefficiencies are hard enough to find that many people fail to do so.
I’ve got gigabytes of data that includes stock prices, earnings / balance sheet numbers, descriptions of each company’s business, etc.
I’ve also got evidence of a more anecdotal nature concerning financial fluctuations over much longer time periods, and covering a variety of countries.
I’ve got hints about human nature (e.g. from the heuristics and biases literature) that guide my intuitions about which patterns are due to mistakes that are persistent and widespread.
My databases are hardly pure noise. They contain lots of patterns about which companies share which similarities to other companies.
If I simply focus on compressing my database via automated techniques, I’ll get lots of nearly useless knowledge: some of it due to overfitting, and lots of it due to obviousness (e.g. similarities between Wells Fargo and Bank of America). The results will also include some valuable ideas (e.g. companies grouped by similarities that I’ve overlooked).
Those underappreciated similarities may help me create new abstractions. That won’t directly lead me to better predictions. But at very least it will focus more of my attention on patterns I find interesting – I’m currently spending too much time manually looking for needles in haystacks, when if I had sufficiently good abstractions, I could automate some of that via software that makes educated guesses about which sections of the haystack are most promising.
CRM has focused my attention more clearly on that goal, but still leaves some important domain-specific challenges. I’m still unclear on how much of that I’ll need to do manually, and how much I can use standard feature extraction tools.
That’s likely to be a much smaller paradigm shift for me than the difference between math and physics, but still a real shift in my focus.
Pingback: Fork Science | Bayesian Investor Blog