Everything is Predictable: Difference between revisions

← Older edit

VisualWikitext

Latest revision as of 18:50, 18 November 2025

Introduction: A Theory of Not Quite Everything

All that we do all the time is predict the future. We couldn't function if we couldn't.
As dictated by Bayes' theorem, your response to new information is influenced by the beliefs you already hold.

1. From the Book of Common Prayer to the Full Monty Carlo

"An Essay towards solving a Problem in the Doctrine of Chances" - published posthumously.
Derivative - the rate of change of a slope on a graph - lets you work out the speed for any given distance or time.
Second derivative - divide speed by your time and find your acceleration
The study of probability starts in the 16th century with Gerolamo Cardano thinking about dice rolls:
- Antoine Gombaud 1654
- Blaise Pascal - The idea is not to look at the chances that something would happen, but to look at the chances it wouldn't happen.
- Pierre de Fermat
The great insight of probability theory: that we should look at the possible outcomes from a given situation, not what has gone before.
Pascal's Triangle - Simplifies working out probabilities of outcomes with binomial (50/50) distribution.
When you're trying to work out how likely something is, we need to talk about the number of outcomes that result in what you're talking about and the total number of possible outcomes. The probability of some event is the number of ways that event can occur, divided by the total number of things that can occur.
Outside of games, what is the probability of the world being a certain way, given the results that we're seeing?:
- Sampling probabilities - What can we predict about a sample of something, given what we know about the whole?
- Inferential probabilities - What can we know about the whole, given a sample we've taken?
Jacob Bernoulli - You can never be truly certain of things. There are three components to the question and you can optimize two:
- How big a sample do you take?
- How close to the true answer do you need to be?
- How confident in your answer do you need to be?
de Moivre - the "normal distribution" or "bell curve". The accuracy of your estimate grows in proportion to the square root of your sample size.
Standard deviation - A measure of how spread out your data is around the mean (average). Calculated by taking the variance from the mean (positive or negative), squaring it (to make them all positive) and then taking square root of the average of those numbers
- Example 1: Three children are 157, 160, and 163cm. Variance squared are 9, 0, and 9, average is 6, and square root of 6 is approx 2.4. SD is 2.4 and so the shorter child is 3/2.4 = 1.25 SD below the mean. The taller child is 1.25 SD above the mean.
- Example 2: Three children are 220, 130, and 130m. Variance squared are 3600, 900, and 900, average is 1800, and square root of 1800 is approx 42.4. SD is 42.4 and so the two shorter children are 30/42.4 = 0.7 SD below the mean. The taller child is 60/42.4 = 1.4 SD above the mean.
- With normally distributed data and a sufficiently large sample, you can reliably predict what percentage of results fall within a given distance of the mean. In general:
  - 68.27% will be within 1 SD
  - 95.43% will be within 2 SD
  - 99.73% will be within 3 SD
- Inverse probability - Given the results I've seen, what can I say about my hypothesis?
- For Bayes, probability is an expression of our lack of knowledge about the world. Probability is subjective. It's a statement about our ignorance and our best guesses of the truth. It's not a property of the world around us, but of our understanding of the world. You must take into account how likely you thought the hypothesis was in the first place, ie take your subjective beliefs into account.
- Pierre-Simon Laplace independently arrived at a similar conclusion in 1774.
- Adolphe Quetelet (1896-1874) was interested in the "average man", producing normal distributions for all kinds of attributes of people. But his finding seemed to be in conflict with free will, with our behaviors and choices as the products of our attributes.
- Bayes vs Frequentism:
  - Bayes asks - How likely is the hypothesis to be true, given the data I've seen?
    - Treats probability as subjective, a statement about our ignorance of the world.
  - Frequentism asks - How likely am I to see this data, assuming a given hypothesis is true?
    - Treats probability as objective, a statement about how often some outcome will happen, if you did it a huge number of times.
- The problem of Bayesian priors is that they're subjective - a statement not about the world, but about our own knowledge and ignorance. Frequentists argue that if you don't know which outcome is the most likely, then you should treat them as equally likely.
- Francis Galton - First to explain regression to the mean. Coined the phrase "nature and nurture".
- Statistical Significance -
  - A p-value is the likelihood of seeing results at least as extreme as those you've seen, given the null hypothesis. P-values of
  - The null hypothesis is the hypothesis that whatever effect you're looking for isn't real. Eg hair color has no effect on soup eating.
- All our lives, we are in a sense betting. Whenever we go to the station, we are betting that a train will really run, and if we had not a sufficient degree of belief in this, we should decline the bet and stay at home.

2. Bayes in Science

In 2011, science underwent the replication crisis.
- Hypothesizing after results are known or HARKing or p-hacking.
- Scientific literature prioritizes novel findings, incentivizes (even if unconsciously) p-hacking.
Popper's philosophy of science said that you never prove a scientific hypothesis - you only disprove it or fail to do so. The Bayesian idea that you can build up evidence for or against is very much opposed to it.
Hume: "All our experimental conclusions proceed upon the supposition that the future will be conformable to the past."
Popper: "We choose the theory which best holds its own in competition with other theories; the one which, by natural selection, proves itself the fittest to survive. This will be the one which not only has hitherto stood up to the severest tests, but the one which is also testable in the most rigorous way." He called such a theory "corroborated".
Instinctively, at least, scientists think like Bayesians.
In frequentist experiments, the p-value can rise and fall suddenly as the sample size gets bigger. for Bayesians, though, there is always prior data and so each new data point coming in moves your opinion much less, and form part of the new prior for your next bit of information.
Bayesian techniques don't just reject or accept the null hypothesis - they don't just say yes or no to a hypothesis, but give degrees of belief to a range of possible realities. In reality, there is no such thing as a null hypothesis, or rather the null hypothesis is always, ultimately false.
Bayesian's make an estimate of the size of an effect and give a probability distribution:
- If your prior is very strong - your curve is really tall and narrow - and the new data is fairly weak, giving a likelihood curve that is low and wide, then the resulting curve will look more like the prior.
- If your prior is weak but you've got really good data, so your likelihood curve is tall and pointy, the new data will wash our the prior, and the posterior will be more like the likelihood.
- If the posterior curve is tall and pointy compared to the prior, then you have something noteworthy and worth following up.

3. Bayesian Decision Theory

George Boole described the operations of propositional logic as "the laws of thought".
In our reasoning we depend very much on prior information to help us in evaluating the degree of plausibility in a new problem. This reasoning process goes on unconsciously, almost instantaneously, and we conceal how complicated it really is by calling it common sense.
Cromwell's rule: You should never be certain.
Probability and utility together make expected value.
If humans were perfect reasoning machines with full access to our own underlying preferences, we could work out the maths, using a combination of Bayes and utility theorem. That is, roughly speaking, what modern AI does, in a much more explicit way.
Hyperpriors are to do with uncertaintly about what world you're in. You use hyperparameters to determine your hyperpriors, and then this will restrict the parameters from which you will choose your priors.

4. Bayes in the World

People are confused by:
- The availability heuristic - How easily can I think of an example
- Conjunction fallacy - Where they think that two things happening can be more common than a single thing happening (which is logically false)
- Framing effects - Where the wording of a statement can dramatically impact the interpretation of it.
Humans are amazingly rational if information is presented to us in ways that we are designed to process it.
The gaze heuristic - is a shortcut for calculating the path of a ball - we do it automatically, fixing our gaze on the ball and adjusting our running speed so that the angle of gaze remains constant.
We use other heuristics as shortcuts:
- Recency bias - Where we overweight more recent evidence
- Anchoring - Where the first thing we see tends to set our expectations
- Frequency bias - Where we shortcut to the thing we've seen most often.
Making decisions under uncertainty is hard. We don't have access to all the information, and even if we did, integrating it all into the Bayes equation would be computationally impossible. So instead we use shortcuts and heuristics, and our instinctive decision-making, from a Bayesian perspective, isn't that bad. By and large, we have grey areas and weak spots, but we're kind of OK.
Tetlock suggested that there are two distinct groups of experts:
- Hedgehogs - Who think the world is simple and can be explained and predicted simply using "one big idea". Hedgehogs tell nice, straightforward stories, which are easy to package for the media
- Foxes - Who think that the world is complicated - that the specifics and details of each situation matter, and that predictions are difficult and uncertain. Foxes do somewhat better at predicting things, with the top 2% called by Tetlock "Superforecasters".
Good forecasters often use the "Fermi estimate" - You make several estimates of small things instead of one big estimate, and if there's no reason why those errors should be systematically high or low, then they will tend to cancel each other out. His calculation of the number of piano tuners in Chicago, broke down into the population of Chicago, the percentage of people that own pianos, and the time it takes to tune one once a year. He guessed 62.5 against the real answer, about 80.
A huge amount of our public discourse comes down to our efforts to place things, groups, people, and concepts into categories. Are they fascist, a cult, a racist? But the categories are all fuzzy.
Take the category games. you learned what defines "games" in a Bayesian way starting with one data point and then refining your category with further data points. As you learn to attach the label "game" to more and more concepts, you get more accurate estimates of the probability of seeing certain characteristics in those concepts. Your prior probability that games involve balls was low, then someone pointed out that hockey, football, tennis, cricket and ping-pong are all games, so you updated.
This method works better than idealist philosophical notion.
See also the paradox of the heap. When does a heap of sand from which you are removing grains turn into "not a heap"?

5. The Bayesian Brain

Perception and consciousness itself is - in quite a direct sense - Bayesian. You're faced with ambiguous sensory information and are going from observations to what caused those observations - that's inverse reasoning.
Various scientists argue that the central thing the brain does is build predictions of the world, which it then integrates with information coming in via the senses. Perception is a two-way street: information travels up from our sense, but it also travels down from our internal model of the universe. Our perception is the commingling of that bottom-up stream with the top-down one. The two constrain each other - if the top-down priors are strong, then it requires precise, strong evidence from the senses to overturn them.
Neuroscientist Chris Frith says that our perception of reality is a controlled hallucination.
When everything is going as expected we are not even aware of it. But when something unexpected happens, the low-level processor bumps the problem one place higher up the chain. If the slightly-higher-level processor can explain it, then it will do so, and well send new signals down the chain. If it can't then the new information keeps going up the chain.
Your brain is taking information from various modalities - vision, hearing, touch, small, and taste, but also your internal sense of where your body is and how it's arranged and whether you're hungry or thirsty or horny or whatever - and combining them.
Crucially, what you experience is not the data from your sense, but your predictions - predictions constantly updated by information from the sense, yes, but the world you live in is the prediction, not the data.
Anil Seth: What we experience is best described as a Bayesian inference about the causes of sensory data."
The greater the mismatch, the more the prior is shifter, the louder the gain, or volume of the signal sent upward.
The brain hates prediction error. it wants to minimize the difference between its predictions and its sense data. It really wants it predictions to be right. So it calls attention to mismatches so they can be sorted out. Under this model, attention is literally just when the higher-level processors and high-precision sense data are focused on some aspect of your environment. Something grabs your attention when the bottom-up data about it coming from your senses doesn't match the top-down prediction coming from your brain - sending a loud, urgent signal right up to the top.
Our brains not only need to predict the signals coming from the world, they also need to predict how the signals coming from the world would change, if we performed some action. Then they need to subtract those predictions from the predictions of how the world itself is changing, to give the impression of a stable reality.
The two models - the model of movement and the model of the goal work in parallel - your brain runs both simulations simultaneously and checks them against each other. So if you had some goal (pick up a coffee cup), your brain predicts what sequence of nerve firings would do best at that, and at the same time it takes the predicted sequence, predicts what would happen if you do it, and sees if the two models match: Does this inverse model actually result in this goal that I'm aiming for?
We also go out into the world to find out information. If I had, now, to choose what data point to gather next, where to look next, what would be the best query or question?
Our eyes saccade to where we expect the action to be. Not where the most interesting things are, but where we expect them to be.
Novices aren't as good at predicting where the action will be, so they have to make very imprecise predictions, while experts have a well-constructed model that allows them to gather highly precise information about the world.
When you move, the movements you cause are suppressed, leaving the movements that you have not caused, which are usually more important.
Bayesian theories of:
- Schizophrenia - Schizophrenics have weaker priors than we do. Their predictions of the world are less precise, so they can - for instance - correctly assess a backwards mask as hollow when the sense data fits that hypotheses.
  - Predictions of their movements are less precise, so they can experience the movements of their limbs as if they are controlled by another.
  - Voices in the head may be just our inner monologue, which they are unable to predict and suppress, so that it's as shocking and as loud as if someone spoke inside their mind.
  - Because the errors are random - generated by noise in the sense data or by unpredicted movements - the brain has to come up with bizarre hypotheses to explain them. They may experience the pulsing of blood in veins as the walls breathing.
  - They may find coincidences like their first name in the paper or the number 13 in a numberplate unduly surprising and have to explain them away with hypotheses such as the TV or newspapers giving them secret messages.
- Depression - May be caused by inappropriately strong priors on some negative belief of how you are a bad person or are powerless or that everything is bad.
  - You can't get out of the belief-valley and into a more accurate valley.
  - Evidence that could prove otherwise gets discounted because your priors are so strong. You're too confident in some pathological belief or bias.
  - There is evidence that psilocybin reduces depression. Psychedelics make you really interested in things. They can flatten your priors, make them less precise and upweight the data coming from your senses. For a depressed person, it will flatten their belief landscape - weaken their inappropriately strong priors.
  - Another theory is that depression may be due to underconfidence in neural predictions.
For Karl Friston, minimizing prediction error isn't just sense-making, but is our fundamental motivation. Hunger sexual desire, boredom - all our wants and needs - can be described in terms of a struggle to reduce the difference between top-down prediction and bottom-up sense data, between your prior and your posterior distributions:
- Being hungry is the same as confidently predicting that you are currently eating a sandwich but that prediction being wrong.
- The driver of all life, from bacteria on up, is to reduce the difference between what they predict and what they experience. You cannot change your model of what your body temperature is or what your glucose levels are, outside very narrow windows. So the only way of minimizing prediction error is by changing the world, or your position in it, so that your predictions are true.
- For Friston, that's what's going on in all self-organizing systems. We need to maintain homeostasis. But more sophisticated animals can plan ahead, manage their surrounding to minimize expected prediction error or expected surprise.
- Very fundamental priors are wired into us by evolution - blood-sugar levels, body temperature, oxygen levels, and bodily integrity and, to a lesser extent, perhaps, social and sexual desires. As young children the hardwired priors are all we have, but they start progressively to learn preferred states of being that you can work towards, which are constrained by these innate priors underneath them, which keep you alive.
- Babies do motor babbling, trying out random nerve signals and seeing what happens. They're learning you've got a body and certain things you can control and certain things you can't. At first they have so little information that they act at random, but their movements become increasingly sophisticated as they update their priors, learning to grasp food, feeding themselves, choosing which food to eat, to buy...
- The difference between a virus and you and me is how far into the future you can look. We have hierarchically deeper generative models, and what accompanies that is the ability to roll out further into the future.
- Some people find it weird to suggest that hunger is the same as wrongly predicting that you've eaten. But it's an elegant theory.

Conclusion: Bayesian Life

You can see the genome of an organism as a prediction about the world. Evolution is slow, blind, and inefficient - it might take hundreds of generations to solve a problem a human designer could solve in an hour - but it's an approximation of a Bayesian process: it works to minimize prediction error.
When we're young we have very little data about the world, so our priors are weak and new information can shift them easily. We can learn quickly, because we don't have a very precise model of the world that makes good predictions. As we get older, we gain more information, we get a richer, more precise model of the world, and new information must logically shift our priors less. So older people are wise but inflexible.

@@ Line 73: / Line 73: @@
 === 4. Bayes in the World ===
+* People are confused by:
+** The availability heuristic - How easily can I think of an example
+** Conjunction fallacy - Where they think that two things happening can be more common than a single thing happening (which is logically false)
+** Framing effects - Where the wording of a statement can dramatically impact the interpretation of it.
+* Humans are amazingly rational if information is presented to us in ways that we are designed to process it.
+* The gaze heuristic - is a shortcut for calculating the path of a ball - we do it automatically, fixing our gaze on the ball and adjusting our running speed so that the angle of gaze remains constant.
+* We use other heuristics as shortcuts:
+** Recency bias - Where we overweight more recent evidence
+** Anchoring - Where the first thing we see tends to set our expectations
+** Frequency bias - Where we shortcut to the thing we've seen most often.
+* Making decisions under uncertainty is hard. We don't have access to all the information, and even if we did, integrating it all into the Bayes equation would be computationally impossible. So instead we use shortcuts and heuristics, and our instinctive decision-making, from a Bayesian perspective, isn't that bad. By and large, we have grey areas and weak spots, but we're kind of OK.
+* Tetlock suggested that there are two distinct groups of experts:
+** Hedgehogs - Who think the world is simple and can be explained and predicted simply using "one big idea". Hedgehogs tell nice, straightforward stories, which are easy to package for the media
+** Foxes - Who think that the world is complicated - that the specifics and details of each situation matter, and that predictions are difficult and uncertain. Foxes do somewhat better at predicting things, with the top 2% called by Tetlock "Superforecasters".
+* Good forecasters often use the "Fermi estimate" - You make several estimates of small things instead of one big estimate, and if there's no reason why those errors should be systematically high or low, then they will tend to cancel each other out. His calculation of the number of piano tuners in Chicago, broke down into the population of Chicago, the percentage of people that own pianos, and the time it takes to tune one once a year. He guessed 62.5 against the real answer, about 80.
+* A huge amount of our public discourse comes down to our efforts to place things, groups, people, and concepts into categories. Are they fascist, a cult, a racist? But the categories are all fuzzy.
+* Take the category games. you learned what defines "games" in a Bayesian way starting with one data point and then refining your category with further data points. As you learn to attach the label "game" to more and more concepts, you get more accurate estimates of the probability of seeing certain characteristics in those concepts. Your prior probability that games involve balls was low, then someone pointed out that hockey, football, tennis, cricket and ping-pong are all games, so you updated.
+* This method works better than idealist philosophical notion.
+* See also the paradox of the heap. When does a heap of sand from which you are removing grains turn into "not a heap"?
 === 5. The Bayesian Brain ===
+* Perception and consciousness itself is - in quite a direct sense - Bayesian. You're faced with ambiguous sensory information and are going from observations to what caused those observations - that's inverse reasoning.
+* Various scientists argue that the central thing the brain does is build predictions of the world, which it then integrates with information coming in via the senses. Perception is a two-way street: information travels up from our sense, but it also travels down from our internal model of the universe. Our perception is the commingling of that bottom-up stream with the top-down one. The two constrain each other - if the top-down priors are strong, then it requires precise, strong evidence from the senses to overturn them.
+* Neuroscientist Chris Frith says that our perception of reality is a controlled hallucination.
+* When everything is going as expected we are not even aware of it. But when something unexpected happens, the low-level processor bumps the problem one place higher up the chain. If the slightly-higher-level processor can explain it, then it will do so, and well send new signals down the chain. If it can't then the new information keeps going up the chain.
+* Your brain is taking information from various modalities - vision, hearing, touch, small, and taste, but also your internal sense of where your body is and how it's arranged and whether you're hungry or thirsty or horny or whatever - and combining them.
+* Crucially, what you experience is not the data from your sense, but your predictions - predictions constantly updated by information from the sense, yes, but the world you live in is the prediction, not the data.
+* Anil Seth: What we experience is best described as a Bayesian inference about the causes of sensory data."
+* The greater the mismatch, the more the prior is shifter, the louder the gain, or volume of the signal sent upward.
+* The brain hates prediction error. it wants to minimize the difference between its predictions and its sense data. It really wants it predictions to be right. So it calls attention to mismatches so they can be sorted out. Under this model, attention is literally just when the higher-level processors and high-precision sense data are focused on some aspect of your environment. Something grabs your attention when the bottom-up data about it coming from your senses doesn't match the top-down prediction coming from your brain - sending a loud, urgent signal right up to the top.
+* Our brains not only need to predict the signals coming from the world, they also need to predict how the signals coming from the world would change, if we performed some action. Then they need to subtract those predictions from the predictions of how the world itself is changing, to give the impression of a stable reality.
+* The two models - the model of movement and the model of the goal work in parallel - your brain runs both simulations simultaneously and checks them against each other. So if you had some goal (pick up a coffee cup), your brain predicts what sequence of nerve firings would do best at that, and at the same time it takes the predicted sequence, predicts what would happen if you do it, and sees if the two models match: Does this inverse model actually result in this goal that I'm aiming for?
+* We also go out into the world to find out information. If I had, now, to choose what data point to gather next, where to look next, what would be the best query or question?
+* Our eyes saccade to where we expect the action to be. Not where the most interesting things are, but where we expect them to be.
+* Novices aren't as good at predicting where the action will be, so they have to make very imprecise predictions, while experts have a well-constructed model that allows them to gather highly precise information about the world.
+* When you move, the movements you cause are suppressed, leaving the movements that you have not caused, which are usually more important.
+* Bayesian theories of:
+** Schizophrenia - Schizophrenics have weaker priors than we do. Their predictions of the world are less precise, so they can - for instance - correctly assess a backwards mask as hollow when the sense data fits that hypotheses.
+*** Predictions of their movements are less precise, so they can experience the movements of their limbs as if they are controlled by another.
+*** Voices in the head may be just our inner monologue, which they are unable to predict and suppress, so that it's as shocking and as loud as if someone spoke inside their mind.
+*** Because the errors are random - generated by noise in the sense data or by unpredicted movements - the brain has to come up with bizarre hypotheses to explain them. They may experience the pulsing of blood in veins as the walls breathing.
+*** They may find coincidences like their first name in the paper or the number 13 in a numberplate unduly surprising and have to explain them away with hypotheses such as the TV or newspapers giving them secret messages.
+**Depression - May be caused by inappropriately strong priors on some negative belief of how you are a bad person or are powerless or that everything is bad.
+***You can't get out of the belief-valley and into a more accurate valley.
+***Evidence that could prove otherwise gets discounted because your priors are so strong. You're too confident in some pathological belief or bias.
+***There is evidence that psilocybin reduces depression. Psychedelics make you really interested in things. They can flatten your priors, make them less precise and upweight the data coming from your senses. For a depressed person, it will flatten their belief landscape - weaken their inappropriately strong priors.
+***Another theory is that depression may be due to underconfidence in neural predictions.
+*For Karl Friston, minimizing prediction error isn't just sense-making, but is our fundamental motivation. Hunger sexual desire, boredom - all our wants and needs - can be described in terms of a struggle to reduce the difference between top-down prediction and bottom-up sense data, between your prior and your posterior distributions:
+**Being hungry is the same as confidently predicting that you are currently eating a sandwich but that prediction being wrong.
+**The driver of all life, from bacteria on up, is to reduce the difference between what they predict and what they experience. You cannot change your model of what your body temperature is or what your glucose levels are, outside very narrow windows. So the only way of minimizing prediction error is by changing the world, or your position in it, so that your predictions are true.
+**For Friston, that's what's going on in all self-organizing systems. We need to maintain homeostasis. But more sophisticated animals can plan ahead, manage their surrounding to minimize expected prediction error or expected surprise.
+**Very fundamental priors are wired into us by evolution - blood-sugar levels, body temperature, oxygen levels, and bodily integrity and, to a lesser extent, perhaps, social and sexual desires. As young children the hardwired priors are all we have, but they start progressively to learn preferred states of being that you can work towards, which are constrained by these innate priors underneath them, which keep you alive.
+**Babies do motor babbling, trying out random nerve signals and seeing what happens. They're learning you've got a body and certain things you can control and certain things you can't. At first they have so little information that they act at random, but their movements become increasingly sophisticated as they update their priors, learning to grasp food, feeding themselves, choosing which food to eat, to buy...
+**The difference between a virus and you and me is how far into the future you can look. We have hierarchically deeper generative models, and what accompanies that is the ability to roll out further into the future.
+**Some people find it weird to suggest that hunger is the same as wrongly predicting that you've eaten. But it's an elegant theory.
 === Conclusion: Bayesian Life ===
+* You can see the genome of an organism as a prediction about the world. Evolution is slow, blind, and inefficient - it might take hundreds of generations to solve a problem a human designer could solve in an hour - but it's an approximation of a Bayesian process: it works to minimize prediction error.
+* When we're young we have very little data about the world, so our priors are weak and new information can shift them easily. We can learn quickly, because we don't have a very precise model of the world that makes good predictions. As we get older, we gain more information, we get a richer, more precise model of the world, and new information must logically shift our priors less. So older people are wise but inflexible.