I had a similar experience when I finally found a copy of Barbour’s _The End of Time_ and discovered, much to my chagrin, that it wasn’t nearly as mystical or complicated as EY makes it seem in the Timeless Physics “sequence”. Barbour’s account was much more readable and much easier to understand.
Yudkowsky just isn’t that great of a popular science writer. It’s not his specialty, so this shouldn’t be surprising.
Jaynes' book is a game changer, but I particularly love that you mentioned Barbour and his work.
On Barbour's work: Apart from being an incredibly interesting book, I was amazed that he was a sort of "outsider" writing papers and books "on his own" (or at least outside of Academia) while making money through technical translations is just a really clever way to be able to explore any interesting avenues one might find. Einstein had the right idea too...
(Sadly, it's also something that wouldn't be as feasible nowadays, but who knows...)
Here's a list of errata and commentary, collected by a fan: https://ksvanhorn.com/bayes/jaynes/index.html.
I had spotted some errors here and there, but it's always good to have them in one place.
I think we are all in the same wagon when I say that even with those rough edges Jaynes' book is kind of a transformative experience for everyone who has already been "conditioned" to other Probability texts.
For example, for me Feller is a great intro to "start working with Probability," but Jaynes is where one starts actually "thinking in Probability."
The whole Maximum Entropy thing was mind blowing for me.
And if you want to read what he has to say on the optional stopping problem, you can scroll down to page 196 (166 in page numbers) to the heading "6.9.1 Digression on optional stopping"
I don't personally think Jaynes is much easier to read than Yudkowsky, but he's definitely more rigorous.
How you infer the shape of that distribution based on the experiment is a function of the distribution of all courses your experiment could have taken. This set of paths is different in each case, which means the inference we make must also be different.
There is no inconsistency. The confusion seems to be in assuming that the experimental result was a true statement about the nature of the world rather than a true statement about simply what happened.
edit: This seems to me to be a specific case of a general class of difficult thinking where you ask yourself: "what are all the worlds that I might be in that are consistent with what I'm presently observing".
Do you not notice that your inference is less accurate using this line of reasoning? Does that not suggest that it's simply wrong?
We don't actually care at all about what happened in the two experiments per se, we care about the information provided by the experiments about future or other events.
If somehow we learned that both experiments were totally unreplicable and a product purely of that time and location with no implications for anything else ever before or since we wouldn't care about them except maybe as a historical curiosity.
Intentionally is a red herring; what matters is our expectation about what might be observed if we were to repeat the experiments again.
In that sense, there's variability in the second experiment's results due to sample size being random. So we interpret and infer based on that potential experiment we could do, not what happened to be observed at a particular moment.
I'm also confused about what this has to do with Bayesian versus non-Bayesian inference as you could approach either experiment from either paradigm, and there are different forms of Bayesianism, including nonsubjective Bayesianism.
How can the experiments provide relevant information other than through what happened?
If what happened is exactly the same (first patient with such and such characteristics had this outcome, etc.) what information can be provided by the things that didn’t happen in either?
How could it matter that the things that didn’t happen in one experiment are different from the things that didn’t happen in the other when we are interested in the information provided by what did happen?
We don't actually care at all about the distribution of things that could have happened per se, we care about the information provided by the experiments about future or other events.
But what distribution? What is this "distribution" that we are taking a sample from?
The frequentist says: because the two experimenters have different intentions, the experiments they ran are samples from different distributions.
But the Bayesian says: the experimenter's intentions can't affect things like how dice rolls come out or how well a given treatment works on a given patient. The actual "distribution" is the set of all factors that do affect how the dice rolls come out or how well the treatment works on each patient. And those factors are the same for both experimenters; their different intentions don't affect that. So both sets of data are samples from the same distribution, not different ones.
> How you infer the shape of that distribution based on the experiment is a function of the distribution of all courses your experiment could have taken.
If you're going to state it this way, then the Bayesian response is: "all courses your experiment could have taken" has nothing to do with the experimenter's intentions. The experimenters can't magically make the physical world and the biology of humans work differently depending on what stopping criterion they choose. And the physical world and the biology of humans is what determines "the courses your experiment could have taken".
In other words, when the frequentist makes up "distributions" based on the experimenter's stopping criterion, they are, whether they admit it (or even realize it) or not, making a claim about how the physical world and the biology of humans works that is obviously false.
Yeah, and then you stack some beliefs on top of that.
And then you discover the evidence wasn’t actually true. Remind me again what the normative Bayesian update looks like in that instance.
Unfortunately it’s turtles all the way down.
P(B|I saw E, P) = P(I saw E|B,P) * P(B|P) / P(I saw E|P)
P(B|E was false, I saw E, P) = P(E was false|B,I saw E,P) * P(B|P,I saw E) / P(E was false|P, I saw E)
This is a pretty basic application of Bayes' theorem.Just move the argument one level down: “I saw E is false” and it turns out so is “E is false” . So then? Add “E was false was false”?
Turtles all the way down.
At some point something has to be “true” in order to conditionalise on it.
Alternatively, brains ARE Bayesian networks with hard coded priors that cannot be changed without CRISPR.
Not really going to vouch for the normative Bayesian approach, but you might just consider this new (strong) evidence for applying an update.
That is, you say, for the update, "the probability that this trial came out with X successes given everything else that I take for granted, and also that the hypothesis is true" vs. "the probability that this trial came out with X successes given everything else that I take for granted, and also that the hypothesis is false." So you actually say in both cases the fragment, "this trial came out with X successes."
What happens if it didn't really? Well, the proper Bayesian approach is to state that you phrased this fragment wrong. You actually needed to qualify "the probability that I saw this trial come out with X successes given ...", and those probabilities might have been different than the trial actually coming out with X successes.
OK but what happens if that didn't really, either. Well, the proper Bayesian approach is to state that you phrased the fragment doubly wrong. You actually needed to qualify it as "the probability that I thought I saw this trial come out with X successes given...". So now you are properly guarded, like a good Bayesian, against the possibility that maybe you sneezed while you were reading the experiment results and even though you saw 51, it got scrambled in your head and you thought you saw 15.
OK but what happens if that didn't really, either either. You thought that you thought that you saw something, but actually you didn't think you saw anything, because you were in The Matrix or had dementia or any number of other things that mess with our perceptions of ourselves. So you, good Bayesian that you wish to be, needed to qualify this thing extra!
The idea is that Bayesianism is one of those "if all you have is a hammer you see everything as a nail" type of things. It's not that you can't see a screw as a really inefficient nail, that is totally one valid perspective on screwness. It's also not that the hammer doesn't have any valid uses. It does, it's very useful, but when you start trying to chase all of human rationality with it, you start to run into some really weird issues.
For instance, the proper Bayesian view of intuitions is that they are a form of evidence (because what else would they be), and that they are extremely reliable when they point to lawlike metaphysical statements (otherwise we have trouble with "1 + 1 = 2" and "reality is not self-contradictory" and other metaphysical laws that we take for granted) but correspondingly unreliable when, say, we intuit things other than metaphysical laws, such as the existence of a monster in the closet or a murderer hiding under the bed or that the only explanation for our missing (actually misplaced) laptop is that someone must have stolen it in the middle of the night." You need to do this to build up the "ground truth" that allows you to get to the vanilla epistemology stuff that you then take for granted like "okay we can run experiments to try to figure out stuff about the world, and those experiments say that the monster in the closet isn't actually there."
> Then the possibility seems open that, for different priors, different functions r(x1,..., xn) of the data may take on the role of sufficient statistics. This means that use of a particular prior may make certain particular aspects of the data irrelevant. Then a different prior may make different aspects of the data irrelevant. One who is not prepared for this may think that a contradiction or paradox has been found.
I think this explains one of the confusions many commenters have; for an experimenter who repeats observations until they reach their desired ratio r/(n-r), the ratio r/(n-r) is not a sufficient statistic! But when we have an experimenter who has a pre-registered n, then ratio r/(n-r) is a sufficient statistic. However, in either case,
> We did not include n in the conditioning statements in p(D|θ I) because, in the problem as defined, it is from the data D that we learn both n and r. But nothing prevents us from considering a different problem in which we decide in advance how many trials we shall make; then it is proper to add n to the prior information and write the sampling probability as p(D|nθ I). Or, we might decide in advance to continue the Bernoulli trials until we have achieved a certain number r of successes, or a certain log-odds u = log[r/(n − r)]; then it would be proper to write the sampling probability as p(D|rθ I) or p(D|uθ I), and so on. Does this matter for our conclusions about θ?
> In deductive logic (Boolean algebra) it is a triviality that AA = A; if you say: ‘A is true’ twice, this is logically no different from saying it once. This property is retained in probability theory as logic, since it was one of our basic desiderata that, in the context of a given problem, propositions with the same truth value are always assigned the same probability. In practice this means that there is no need to ensure that the different pieces of information given to the robot are independent; our formalism has automatically the property that redundant information is not counted twice.
Bayes' Theorem holds because it can be proven. Therefore, situations can be constructed where considering identical data without considering priors gives nonsense conclusions. For example if we happen to know as a prior that P(outcome of experiment is a certain ratio) = P(experiment is completed) then that must be considered when interpreting the results.
But laws are tools, and the esthetical intellectual elegance is an epiphenomenal bonus or a mean to keep human psychism motivated to keep its focus away from all the other attention sinks that life throw at it.
And that apply for both law in judiciary and sciences parlances.
There is nothing unusual about different mathematical methods/models producing different results e.g., the number of roots even for the same quadratic equation may depend on "private" thoughts such as whether complex roots are of interest (sometimes they do/sometimes they don't). All models are wrong some are useful.
You are confusing ambiguity in a problem statement due to human language being imprecise with two well-specified identical experimental results having different results due to the intentions of the human carrying them out.
Is arithmetic a religion because there's "one true way" of adding integers?
The Map is not the Territory.
Different maps can be useful. No true map.
Only about the things that can be mathematically proven. Which is just like any other branch of math.
It is true that some Bayesians (and EY can be argued to be among them) like to talk as though Bayesian computation is a drop-in replacement for your brain. Of course it isn't, and Bayesianism, like any mathematical approach, should be taken with a good-sized dose of humility. As Bertrand Russell said, to the extent that mathematical propositions refer to reality, they are not certain, and to the extent that they are certain, they do not refer to reality.
No. The number of roots that you care about might depend on your private thoughts; but the number of roots itself does not. It's a mathematical fact. It just might not be a mathematical fact that you actually care about. But what you care about is not part of math.
There are no laws for applying probability to the real world. To think so puts too much faith in your models. Remember, all models are wrong. Applying probability to the real world requires a host of assumptions, regardless of the methods you use.
Frequentist and Bayesian methods have different goals, both have there place.
For a counterweight to the strong likelihood principle find discussions of Larry Wasserman: https://youtu.be/Z-YvWyM6dRQ?si=qwzRiaPbj9ruiUEv
And for a balanced discussion for why both are great see Michael Jordan: https://youtu.be/HUAE26lNDuE?si=cwg6wpRS1gXL6r1Y
But so then the data _are_ different between the two experiments, because they were observing different random variables -- so why is it concerning if they arrive at different conclusions? In fact, the _fact that the 2nd experiment finished_ is also an observation on its own (e.g. if the treatment was in fact a dangerous poison, perhaps it would have been infeasible for the 2nd researcher to reach their stopping criteria).
It's illogical to deride one of those two result-sets as telling us less about the objective universe just because the researcher had a different private intent (e.g. "p-hacking") for stopping at n=100.
_________________
> According to old-fashioned statistical procedure [...] It’s quite possible that the first experiment will be “statistically significant,” the second not. [...]
> But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher’s private intentions. So whatever our hypotheses about Nature, the likelihood ratio is the same, and the evidential impact is the same, and the posterior belief should be the same, between the two experiments. At least one of the two Old Style methods must discard relevant information—or simply do the wrong calculation—for the two methods to arrive at different answers.
However, if you know that the first researcher just happened to get a positive result on their first try (and therefore didn't actually have to modify parameters), Bayesian math says that their intentions didn't matter, only their result. If, however, they did 100 experiments and chose the best one, then their intentions... still don't matter! but their behavior does matter, and so we can discount their paper.
Now, if you _only_ know their intentions but not their final behavior (because they didn't say how many experiments they did before publishing), then their intentions matter because we can predict their behavior based on their intentions. But once you know their behavior (how many experiments they attempted), you no longer care about their intentions; the data speaks for itself.
I think the author means to say that it’s two methods incidentally equivalent in the data they collect that may draw different conclusions based on their initial assumptions. Question is how do you make coherent sense of it.
At level 1 depth it’s insightful.
At level 2 depth it’s a straw man.
At level 3 depth, just keep drinking until you’re back at level 1 depth.
I believe that "definitely greater than 60%" is supposed to imply that the researcher is stopping when the p-value of their HA (theta>=60%) is below alpha, so an optional stopping (ie. "p-hacking") situation.
There are so many cracks in the Bayesian edifice promoted in TFA!
These problems are well-known in the Theories of Probability community [1] (which is only a subset of the larger set of theorists recognizing the limits of mechanical Bayesian reasoning in decision problems).
Here are a couple.
(1)
Bayesian approaches force you to assign a sharp probability to every event. How do we map any event to a sharp probability? E.g., I need to give a number for the probability of rain tomorrow, a non-repeating event. How do I map that to a number? Not through relative frequencies- it’s non-repeating. If two people give different numbers, how do we decide who is right?
This problem is what Peter Walley has called the “Bayesian dogma of precision.” [2]
(2)
As noted above in an aside, we have a hard time computing probabilities. This is a practical problem that we all are aware of, but often discount.
In what we could call CMP (Conventional Mathematical Probability - Kolmogorov’s axioms) we typically can’t even correctly enumerate the sample space. We’re always forgetting something, so our models are too confident. (In the “Dutch book” analogy alluded to in TFA, we are following the axioms but are somehow always losing money, in a very real sense.)
Related to this problem of computing probabilities, we don’t have a rigorous way to determine when two real-world events are independent. Yet we constantly invoke independence to construct models. Kolmororov’s 1933 manuscript was clear on this problem. [3]
Not satisfied with this, we go on to hypothesize conditional independence relationships in order to feed our complex “rational” Bayesian machine. It’s thirsty for numbers, and we just make them up!
*
This all sounds somewhat hypothetical. It’s not. In my day job, I compute supposed Bayesian credible intervals for various physical variables.
The people downstream who use those variables to assimilate into physical models typically multiply our credible intervals by 2. My friend across lab has it even worse, they multiply his Bayesian intervals by 3.
This is not a well-functioning machine.
[1] E.g., https://isipta23.sipta.org/, or https://plato.stanford.edu/entries/imprecise-probabilities/#...
[2] https://issuu.com/impreciseprobabilities/docs/imprecise_prob..., first paragraph, although the whole short article is on-point
[3] from memory, the quote is something like, “determining the conditions under which events may be judged independent is one of the major outstanding problems in theory of probability“