Aww, Shit. Here We Go Again.
It took me like, two months to make the last book notes post so that meme will probably be dead by the time this is published. But anyway, here’s Probability Theory by E.T. Jaynes. I actually picked this one up a number of years ago when I was a freshman and vastly overestimating the amount of spare time I’d have on the way to getting my undergrad degree. I remember it being exceptionally conversational and explanatory which I much enjoyed, so hopefully this will be a fun one. I doubt I’ll be taking detailed notes on the whole book - it’s quite a lot longer than Halmos’ book on Set Theory was and the resulting blog post was ~50 pages of text. By the end I was seriously thinking of splitting even that up into multiple posts, so this one will probably address that issue by skimming over the esoterica that doesn’t lead into the applications I’m interested in.
N.
Here Beginneth The Lesson
The preface of the book is mostly a history lesson on the development of probability theory and Jaynes’ particular interest in it. He lists Bernoulli (one of them, anyway) and Laplace as the ones who originally developed the rules of probability theory, and R.T. Cox and George Polya as the ones who showed that there is a unique set of rules for conducting inference with plausibility represented by real numbers. Since Cox and Polya were able to develop their results as principles of logic without reference to chance, they were able to see that the rules of probability theory could be applied much more broadly than first supposed.
He then contrasts the system presented here with those proposed by Kolmogorov and de Finetti, finding technical agreement but philosophical disagree with Kolmogorov and the opposite with de Finetti owing to de Finetti’s treatment of infinite sets. Since I’ve never heard of de Finetti, I’m gonna assume Jaynes basically won that fight and hope that his treatment of infinite sets jives with what I picked up from Halmos.
He then fires a broadside at frequentist inference but posits that neither “frequentist” nor “Bayesian” inference are universally applicable, and that the probability theory presented here should be viewed as a system of extended logic which includes all frequentist and Bayesian methods as special applications. I reaaaaallly hope he lays out exactly what those special conditions are later - if I have to sit through another lecture where someone simply says “this works because it’s Bayesian” without being able to tell whether either of us has understood things correctly I’m just gonna quit statistics forever.
At this juncture he states only that frequentist methods use strong assumptions and ad hoc techniques which break down outside of those assumptions, while Bayesian methods require a lot know prior knowledge. In the early stage of tackling a problem, the frequentist assumptions don’t apply and the Bayesian requirements aren’t met yet, so he suggests the principle of maximum entropy as a starting point. Looking at the table of contents, though, we don’t get to that until pretty deep in the book, and it appears that approach was a particular hobby-horse of Jaynes so I might do some extra poking around.
The last technical note in the preface is that since the theory of infinite sets isn’t super intuitive and only rigorous in the sense of which axioms one prefers to base it on, he’ll be doing his best to stick to elementary arithmetic on finite sets of finite integers, and then extend into infinite sets as the limit of the behavior derived for the finite ones. I’m pretty fine with this since the weird things that happen with infinite sets were my biggest headaches in set theory. If we can get to all the points of interest we need without having to invoke these things then I’m all for it.
I.
A Doctor of Letters
True to form, Jaynes opens the discussion without using much math, but instead discussing logic and the formation of belief. In the ideal case we’d be able to use strong, deductive reasoning of the type “If A is true then B is true, and we know A is true so therefore we also know B is true.” or the inverse, “If A is true then B is true, and we know B is false, so we know A is false.” Buuuuuut, mightn’t B be true even if A is false? In that case, knowing B is true doesn’t guarantee that A is true, but we have a sense that knowing B is true makes it more plausible that A is true. By example: “If it starts to rain before 10 AM, then the sky will become cloudy before 10 AM, and the sky is cloudy at 9:45.” It’d be nice that observing the clouds leads us to certainty that it’s going to rain, but at best we know that observing the clouds makes it more plausible that it will rain than if the sky was still clear.
Even though in most cases knowledge doesn’t lead to purely deductive consequences, we can still build up strong beliefs based on what we know. Eventually these beliefs become prior knowledge or models that continue to inform future belief formation. Probability theory is the underpinnings of how strongly we should update or change these beliefs as new information comes in. Jaynes proposes to think of the rules of probability theory as those which would allow a robot to usefully carry out reasoning over plausible but uncertain propositions.
For our purposes we’ll use Boolean algebra - propositions are represented by capital letters (\(A, B, C\) , etc.) and combinations like “$A$ and $B$” can be represented as the product $AB$, and “$A$ or $B$” can be represented as the sum, $A + B$. The truth of one proposition given another, “$A$ given $B$” is represented as $(A \mid B)$.
If $A$ and $B$ are true if and only if the other is true, they’re said to have the same “truth value”, and be logically equivalent. $A = B$ implies logical equivalency, not equal numerical value. Our first axiom of plausible reasoning is therefore that two propositions with the same truth value must be equally plausible. Although they don’t mean exactly the same thing, we’ll use the usual Parentheses-Products-Sums order of operations to stack complicated propositions together, so that $AB + C = (AB) + C$ but not $A(B +C)$. We’ll use a tick mark, like $A’$ to denote the denial of a proposition. $A’$ implies that $A$ is false. $(AB)’$ implies that $AB$ is false, but not necessarily that either $A$ or $B$ is false. $A’B’$ implies that both are false. From the rules above, we can see that $(AB)’ = A’ + B’$.
I’ve been abusing the word implies a bit, since formally “$A$ implies $B$” doesn’t suggest that $B$ is logically deducible from $A$, just that $A = AB$. $A$ has the same truth value as $AB$. Note that the only arrangement in which this is not the case is where $A$ is true and $B$ is false. If $A$ is false and $B$ is true, it can still be the case that $A$ implies $B$. Technically, all true propositions imply each other. A false proposition implies all propositions, regardless of whether they’re true or false. Essentially, implication isn’t really a great word for the actual relationship of logical implication, and we’ll have to tread lightly around it.
Our next task is to figure out how to build functions which depend on propositions. Since we’re dealing with strictly true or false propositions, any proposition which is a function of others can only have $2^n$ possible combinations of inputs, where n is the number of propositions which go into the function. Since there are two possible outputs to any such function, there can only by $2^{(2^n)}$ functions over those input propositions. If we consider the case where $n=2$ and there are 16 possible functions over the two input propositions, then there are four functions which are true only at one of the four combinations. That is, one function that is true only when both inputs are true, one that’s true only when both inputs are false, one that’s true only when the first input is true, and one that’s true only when the second input is true. If we call these inputs $A$ and $B$, we can see that these four are equivalent to the products, $AB$, $A’B’$, $AB’$, and $A’B$.
We can treat these four conjunctions as kind of like basis or unit vectors for all the others. Any function over those two variables can be expressed as the sum of the sum of the products which are true where it’s true. If, for instance a function is true when $A$ and $B$ are both true, and also true when $A$ is true but $B$ is not, it can be expressed as $AB + AB’$. (Incidentally, this function is also equivalent to just $A$: $AB + AB’ = A(B + B’) = A$). The point is that we can see from this that product, sum and denial (AND, OR, NOT) are sufficient to reconstruct any possible function over input propositions and therefore form at adequate set of operations for logical functions. In fact, OR isn’t strictly necessary as it can be reconstructed from AND and NOT ($A + B = (A’B’)’$).
Stretching things to the extreme, it’s actually possible to reconstruct both all functions out of only one basic function - either Not-AND (NAND) or Not-OR (NOR).
$A$ NAND $B$ = $(AB)’$. Thus $A’ = A$ NAND $A$, and $AB$= ($A$ NAND $B$) NAND ($A$ NAND $B$).
$A$ NOR $B$ = $(A + B)’$, Thus $A’$ = $A$ NOR $A$, and $AB$ = ($A$ NOR $A$) NOR ($B$ NOR $B$).
This is great for designing computers since you can therefore construct every possible combination out of many, many NAND (or NOR) gates, so you only need to get good at building NAND gates to have a logically complete computer.
Having laid out that we have a sound basis for complete reasoning on strictly true or false propositions, Jaynes’ starts to lay out desirable properties of a system of reasoning over uncertain propositions that are desirable if not provably or axiomatically true (yet). (I also learn at this point that the fancy word for this is “desiderata”).
Our wishlist is as follows:
Degrees of plausibility are represented by real numbers
Conclusions will qualitatively correspond to common sense
Conclusions will be consistent, meaning:
If a conclusion can be reached in more than one way, every way will reach the same conclusion
Conclusions will take into account all the evidence we have
Equivalent states of knowledge result in equivalent plausibility assessments.
Surprisingly short, but with some less obvious implications. By convention we’ll work so that greater plausibility leads to a greater real number (though that’s not strictly necessary) and also that our representation will be continuous so that infintesimally greater plausibility leads to an infinitesimally greater real number.
Apparently these requirements are enough to uniquely identify a system of rules for manipulating plausibilities.
Jaynes closes out the chapter as he will most with a few informal comments. This set is mostly dedicated to reinforcing that all the rules laid out here deal strictly with the plausibility of propositions, not with any of the other myriad emotions and reactions that a proposition might draw. Also, these rules are not necessarily the ones adhered to by the human mind - there’s nothing inside of probability theory to suggest that human minds use real numbers to reason about plausibility (Some recent reading of mine suggest that really we tend to operate mostly with only True/False/Maybe assessments of statements). I’m probably going to omit any notes on these comments unless I think they’re germane to understanding the concepts laid out in the main text, which Jaynes says they won’t be, but you never know.
II.
Hello Numbers My Old Friend
So, let’s get cracking. This section covers the deduction of the quantitative rules which fall out of our wish list. It’s interesting that Jaynes’ refers to it as a “deduction” of these rules, rather than a formulation. I don’t doubt that he’s right on the uniqueness of these rules for quantitative inference using real numbers, but it seems to be a meta-textual point he tries to reinforce through language choice as much as formal proof. He states that the rules have a “long complicated, and astonishing history, full of lessons for the scientific methodology in general,” which I will be skipping over here.
So, how to quantitatively evaluate plausibilities constructed out of combinations and logical operators? We’ve already seen a couple of operations or combinations of operations that are sufficient to replicate any logical function, so let’s try and work those out. In particular, we’re gonna try and work out rules for using AND and NOT since we saw earlier that OR is simple to reconstruct out of them ($A + B = (A’B’)’$) and NAND and NOR are kind weird and clunky.
Starting with AND, we get the Product Rule. What’s the plausibility of $AB$ knowing the plausibilities of $A$ and $B$ separately? We’ll also include $C$ as a proposition representing everything else we know to be true so that we don’t end up trying to take into account $AB\mid$”The World is Flat” as well as $AB\mid$”The World Is Round”.
Procedurally, a robot could determine whether $AB$ is true by first deciding whether or not $B$ is true, and then decide whether $A$ is true knowing $B$ is true, or vice versa. Thus, $(AB\mid C)$ is a function of $(B\mid C)$ and $(A\mid BC)$. It might seem that you could evaluate $(AB\mid C)$ by evaluating each $(A\mid C)$ and $(B\mid C)$ independently, but as a counter example, if you know there’s a reasonable plausibility the next person you see will have a blue eye, and also a reasonable plausibility that they will have a brown eye, can you then judge the plausibility that the next person you see will have both a blue and a brown eye? We have no way of taking into account the interaction between these two propositions without taking into account one of them when evaluating the other. Jaynes cites Tribus (1969) for a proof by exhaustion showing that any function which attempts to formulate $(AB\mid C)$ based on $(A\mid C)$ and $(B\mid C)$ breaks down our consistency conditions. Indeed, only functions which use either $(B\mid C)$ and $(A\mid BC)$ or $(A\mid C)$ and $(B\mid AC)$ do not.
If we call the plausibilities of whichever pair of conditional propositions we choose $x$ and $y$, the we know that the product rule must be a continuous, monotonically increasing function of both $x$ and $y$. In order to satisfy our requirement that different approaches to the same proposition much yield the same plausibility, we must also have that the function be associative. That is, if we tried to apply this to a product of three propositions, $F(ABC\mid D)$, then we should get the same result whether we try to evaluate the product as $F((AB)C\mid D)$, $F(A(BC)\mid D)$ or $F((AC)B\mid D)$. Fundamentally: $F[ F(x,y), z ]$ = $F[x, F(y,z) ]$.
We’ll call our “plausibility evaluator” function which must have all these properties “$w$” for now. After some calculus, Jaynes’ arrives at the form:
$$w(AB\mid C) = w(A\mid BC)w(B\mid C) = w(B\mid AC)w(A\mid C)$$
Where $w$ is some positive, continuous, monotonic function, but which we can already get some pointers from. In the case that $C$ implies $A$, that is we know $A$ to be certain given our background information, $C$, we know that $(A\mid BC) = (A\mid C)$ since $B$ being true or not doesn’t give us any more information once we know $C$. We also know that $(AB\mid C) = (B\mid C)$ for similar reasons. Thus, we can replace those two in the case that $C$ implies $A$ and arrive at:
$$w(B\mid C) = w(A\mid C)w(B\mid C)$$
Dividing both sides by $w(B\mid C)$:
$$1 = w(A\mid C)$$
Thus, though we started out simply trying to figure out what form our “plausibility evaluator” function must take when evaluating conjunctions, we have incidentally derived the property that certain truth for a proposition is represented by the number 1.
Similar algebra leads us to the conclusion that certain falsehood for a proposition must be represented either by 0 or infinity. Together with our requirement that our plausibility evaluator function be continuous and monotonic, we know that if it is monotonically increasing, it must range from 0 for impossibility to 1 for certainty, or if it is monotonically decreasing, then it must range from infinity for possibility down to 1 for certainty. By convention, we use the monotonically increasing case, where 0 represents impossibility and 1 represents certainty. These two conventions are mathematically equivalent in content, since one is just the inverse of the other, so nothing is lost by the choice.
Note that at this point, we’ve only fixed the behavior of the evaluator at its extremes and said that it will be continuous and monotonically increasing with plausibility between them, but thus far set no further limits on how it must vary.
Moving on to determining how to evaluate NOT statements, we arrive at the Sum Rule, so called because we start from the fact that : $A + A’$ is always true. We can thus see that the plausibility of a statement being false depends in some way on the plausibility of it being true. If we call the relation between the plausibility of a statement being true and that same statement being false S, we can write out that:
$$w(A’\mid C) = S[ w(A\mid C) ]$$
We know that $S(0) = 1$ and $S(1) = 0$, but we have to get more specific, and using a lot of algebra and calculus, Jaynes’ arrives at the conclusion that our plausibility evaluator must obey the rule:
$$[w(A\mid C)]^m + [w(A’\mid C)]^m = 1$$
For some positive m. But, the value of m is irrelevant, since for any m we could define:
$$p(x) = w(x)^m$$
since our only requirements on $w$ are that it’s a continuous, monotonically increasing function ranging from 0 to 1. If w satisfies this, then so does $w^m$ for any positive $m$, so $p$ would also.
Thus, we’re free to write the product and sum rules as follows:
$$p(AB\mid C) = p(A\mid C)p(B\mid AC) = p(B\mid C)p(A\mid BC)$$
$$p(A\mid C) + p(A’\mid C) = 1$$
We know that AND and NOT are sufficient to reconstruct any logical function, and we derived these two as quantitative analogies to those operators, but are we sure they’re sufficient? Well, let’s first work out whether it’s possible to reconstruct OR from these two. Remembering that $A + B = (A’B’)’$:
$p(A + B\mid C) = p((A’B’)’\mid C)$
$p(A + B\mid C) = 1 - p(A’B’\mid C)$
$p(A + B\mid C) = 1 - p(A’\mid C)p(B\mid A’C)$
$p(A + B\mid C) = 1 - p(A’\mid C)[1 - p(B\mid A’C)]$
$p(A + B\mid C) = 1 - p(A’\mid C) + p(A’\mid C)p(B\mid A’C)$
$p(A + B\mid C) = p(A\mid C) + p(A’B\mid C)$
$p(A + B\mid C) = p(A\mid C) + p(B\mid C)p(A’\mid BC)$
$p(A + B\mid C) = p(A\mid C) + p(B\mid C)[1 - p(A\mid BC)]$
$p(A + B\mid C) = p(A\mid C) + p(B\mid C) - p(B\mid C)p(A\mid BC)$
$p(A + B\mid C) = p(A\mid C) + p(B\mid C) - p(AB\mid C)$
This last form of OR is often presented as the sum rule, rather than the quantitative form of NOT we started with and is more generally useful, since our first "sum rule” was just $p(A+B\mid C)$ with $B = A’$. So, from here’ on out, I’ll be referring to this form as the sum rule.
To extend this, is the same way that we noted any function could be constructed out of a logical sum of products of its primitive propositions, you can reconstruct any function as a repeated application of the product and sum rules, so we’ll take these as sufficient for conducting plausible reasoning.
Qualitatively, we see that these rules preserve the common sense of logical reasoning. If $A$ is true, it’s negation must be false, and if $D$ is the premise that $A$ implies $B$, then:
$\begin{equation}p(B\mid AD) = \frac{p(AB\mid D)}{p(A\mid D)}\tag{1}\end{equation}$
$\begin{equation}p(A\mid B’D) = \frac{p(AB’\mid D)}{p(B’D)}\tag{2}\end{equation}$
If A implies B, then $p(AB\mid D) = p(A\mid D)$, so the first equation above. simplifies to$p(B\mid AD) = 1$. In the second equation above, $p(AB’\mid D) =0$, so $p(A\mid B’D) = 0$. Thus, we preserve our common sense that if $A$ implies $B$ then $A$ being true implies $B$ being true and $B$ being false implies $A$ being false hold up.
In other words - deductive logic is just the special case of plausible reasoning where we’re certain of our premises. Even stronger - it’s the limiting case we approach as we become more and more certain of our premises. Even better - our rules capture our desired sense that if $A$ implies $B$, then knowing $B$ is true makes $A$ more plausible, since we can write out the product rule:
$$p(AB\mid C) = p(A\mid C)p(B\mid AC) = p(B\mid C)p(A\mid BC)$$
So, to evaluate $p(A\mid BC)$, we divide the middle statement by $p(B\mid C)$:
$$p(A\mid BC) = p(A\mid C)\frac{p(B\mid AC)}{p(B\mid C)}$$
If we know that $A$ implies $B$, then $p(B\mid AC) = 1$. Knowing nothing else about $B$, we still know that $p(B\mid C) \leq 1$. Thus, $\frac{p(B\mid AC)}{p(B\mid C)} \geq 1$, and so $p(A\mid BC) \geq p(A\mid C)$. Thus, we have a quantitative, definitive answer to the question “How much more likely is $A$, knowing $B$? In fact, we see that the degree of increase in $A$’s plausibility is inverse to $B$’s plausibility. The less likely $B$ is, a priori, the more likely $A$ becomes once we observe $B$. If we live out in the desert where it’s almost never cloudy unless it’s about to rain, then seeing clouds makes us very confident it’s about to rain. If we live in a foggy valley where clouds roll in all the time whether or not it’s about to rain, then seeing clouds doesn’t make us much more confident that it’s about to rain. (Psst. This form of the product rule is a famous formula that some people have gone so far as to get tattoo’d on themselves. We’ll talk more about what drives this kind of behavior later.)
To close out this section, let’s finally talk about numbers. After all, based only on what we’ve figured out so far, suppose that we think $p(A\mid C) = 0.6$, and $p(B\mid C) = 0.01$. From the above, if $A$ implies $B$, then $p(A\mid BC) = 0.6 * \frac{1}{0.01} = 60$. That doesn’t quite square with our requirement that p(x) range between 0 and 1.
Let’s start by thinking about the plausibility of any one of a number of mutually exclusive statements being true. By repeated application of the Sum Rule, and noting that mutual exclusivity requires that terms with multiple propositions being true drop out, we find that this relation results purely in the sum of the plausibilities of the individual propositions. Let’s say $X$ and $Y$ are mutually exclusive.
$$p(X + Y\mid C) = p(X\mid C) + p(Y\mid C) - p(XY\mid C)$$
Since $X$ and $Y$ are mutually exclusive, $p(XY\mid C) = 0$, and so:
$$p(X + Y\mid C) = p(X\mid C) + p(Y\mid C)$$
(You can extend this to 3, 4, or any number of mutually exclusive propositions. Jaynes makes a particular point that this is a consequence of the consistency desiderata and not a fundamental axiomatic portion of the theory, and that intuitive reasoning after this point tends to lead to confusion, so we’re gonna be very careful to stick to the logic and not intuition to go any further).
If it is the further case that not only are the propositions mutually exclusive, but exhaustive, then that sum must equal 1. If we have no further information to distinguish the plausibility of any one of these propositions from the other, then the desiderata that equally plausible propositions are assigned equal plausibility requires that we assign them equal plausibility. The only value that can therefore be assigned to the plausibility of $n$ different mutually exclusive and exhaustive propositions is $1/n$ - thus we’ve arrived at our first derived numerical plausibility value for an uncertain proposition!
(I’ve skipped a longer explanation here about the equality condition holding no matter what we label the propositions as, but I think I’ve captured the gist of it. If I’ve committed a grave sin in some form, please send me a politely worded letter explaining my folly.)
The reason our earlier numerical example broke down is that since $A$ implies $B$, $p(B\mid C) \geq p(A\mid C)$, which our arbitrarily chosen numbers violated.
The deep takeaway is not in the particular value we’ve arrive at, but the fact that it was our state of knowledge - that the propositions were mutually exclusive, exhaustive, and with no other plausiblity-distinguishing characteristics which led us to this evaluation. It did not depend on a more particular form of $p(x)$, even though we previously said that any $m$ for $w(x)^m$ satisfied our conditions for a continuous, monotonically increasing function. Any $p(x)$ we choose must arrive at this same result, which adds a rigid, uniquely identifying constraint on our choice for $p(x)$. Thus, we can now drop the pretense of calling $p(x)$ a plausibility evaluator function, and simply refer to it as the probability of the proposition. To commemorate this momentous occasion, we’ll now start using $P(x)$ when evaluating symbolic propositions.
III.
I’LL TAKE YOUR ENTIRE STOCK!
On to jars of red and white balls. By applying the generalized sum rule to our problem of mutually exclusive and exhaustive propositions, we can arrive at the Bernoulli Urn Rule that was, for a time, the definition of a probability. If some property is true of $M$ propositions within the total set of $N$ equally likely propositions, then the probability of that property being true is $M/N$. This combined with our product, sum, and extended sum rules as well as the prior result that the probability of any one of those $N$ propositions being true is $1/N$ forms the basis for quite a bit of probability theory.
Jaynes proceeds to build up sampling theory from the Bernoulli urn problem - if we have an urn that contains $N$ balls, $M$ of which are colored red, and we draw one at a time, blindfolded, and repeat, we can figure out the probability of drawing a red ball at any particular point in the sequence. This process is called Sampling Without Replacement, and is analogous to drawing samples from a population in a scientific experiment.
Since drawing a white ball is equivalent to NOT drawing a red ball, they’re related by negation, and thus the sum rule. At first draw, the probability of drawing a red ball, $(R_1\mid B)$ - $B$ representing our background knowledge of the problem) is $M/N$, and the probability of drawing a white ball $(W_1\mid B)$ is $1-P(R_1\mid B) = 1-M/N$.
It should be stressed that these probabilities are not physical properties of the urn or its contents, but results of our state of knowledge prior to drawing the first ball. If, for instance, we weren’t sure how many red and white balls were in the urn, we’d arrive at different results, even if in fact there was the same amount of each as in the case where we know the amounts. As our state of knowledge changes, so to do our probability assessments.
If we ask what is the probability of drawing a red ball on the first two draws, $P(R_1,R_2\mid B)$ (using the comma to separate them for clarity - this notation is equivalent to the product of the propositions), we can evaluate it using the product rule:
$$P(R_1,R_2\mid B) = P(R_1\mid B)P(R_2 \mid R_1,B)$$
The first factor we already know is M/N. The second factor is in effect asking what’s the probability that we draw a red ball knowing that there is one less red ball and one less ball overall: (M-1)/(N-1). The final result is therefore:
P(R1,R2 | B) = M/N * (M-1)/(N-1)
Continuing in that way, we find that the probability of drawing r red balls in a row from the first draw is:
P(R1, R2, …, Rr | B) = M!(N-r)!/(M-r)!N!
Similarly examining the probability of drawing a number of white balls in order, or a number of red balls and then a number of white balls in order, we find that the probability of drawing r red balls in order, and then w = n-r white balls in n draws is:
P(R1…Rr, Wr+1…Wn | B) = M!(N-M)!(N-n)!/(M-r)!(N-M-w)!N!
Though this result was derived for drawing only red balls and then only white balls, in fact, the same result is arrived at for drawing r red balls in any particular order out of n draws, since the same set of factors ends up on the right hand side.
To find the overall probability of drawing r red balls in n draws, we need to sum over all the particular orders that might happen. The number of different ways of drawing r red balls in n draws is given by the binomial coefficient:
binom(n, r) = n!/r!(n-r)!
Multiplying the binomial coefficient with the above probability of particular orders and rearranging, we actually arrive at a factor of three binomial coefficients. If A is the proposition that we draw exactly r red balls out of n draws in any order, then:
h(r | N, M, n) = P(A | B) = binom(M, r) * binom(N-M, n-r) / binom(N, n)
(At this point it’s good to note that you can either approach these formulas by imposing the restriction that the second argument of the binomial coefficient must be less than or equal to the first argument, or else evaluate factorials with the Gamma function, x! = Gamma(x+1) which goes to infinity when its argument is non-positive, so that our probabilities in that case go to zero.)
The result above is called the Hypergeometric Distribution because of its relationship to coefficients of a hypergeometric series. (I started down that rabbit hole, but I don’t think it goes anywhere particularly related to the subject at hand.) We call it h to reflect that at this point it’s just a function of real numbers, rather than a propositional calculation.
One of h’s useful properties is that you can find r*, the most probable value of r (the number of red balls you’re most likely to draw in n draws) by setting the probability of drawing r’ balls, h(r’) equal to the probability of drawing r’-1 balls, h(r’-1), solving for r’ and flooring it. The result is:
r* = floor((n+1)(M+1)/(N+2))
Up until this point, everything we’ve discussed has been all about orienting our state of knowledge with respect to the problem, and this fact stays firmly within those bounds, but it’s a little bit closer to something like a physical prediction.
h is what’s known as a Probability Density Function since it tells us the probability of drawing a specific number of red balls. To figure out the probability of drawing, say R or fewer balls, we have to sum over those probabilities and generate what’s known as a Cumulative Probability Distribution or a Probability Mass Function. For this case, dealing with integer values, Jaynes defines the hypergeometric cumulative probability distribution as a staircase function, so that non-integer values resolve to their integer floor. That is, we could evaluate say, the probability of drawing 6.5 or fewer balls out of 10 draws, and it’d be the same as drawing 6 or fewer balls out of 10 draws.
The Median value of a probability distribution is the value such that equal probability is given to cumulative probabilities above and below that value. Strictly speaking, discrete distributions have no median, unless there is a unique integer which satisfies the condition. It’s typical in that case to take the integer value which is closest to 1/2 of the cumulative probability to be the median, particularly if n is large.
One of the surprising facts about the hypergeometric distribution is that it makes no difference if you swap the number of red balls and the number of draws if the total number of balls remains the same. That is, if you draw ten balls from an urn with 50 red balls or make 50 draws from an urn with 10 red balls, the probability of drawing a particular number of red balls remains the same. (Up until you run out of red balls). Another symmetry arises around the peak of the distribution - the probability of drawing one more red ball than r* is the same as drawing one less, two more is the same as two less, etc. The distribution has a familiar, symmetric bell curve shape.
We’ve been talking a lot about sequences of draws here, but what about our probabilities of drawing red balls at specific points? What’s the probability we should assign to drawing a red ball on the second draw before we’ve even drawn the first ball?
Well, even though we don’t know which color ball will be drawn on the first draw, we know it’ll be either red OR white, so:
P(R2 | B) = P((R1 + W1)R2 | B) = P(R1,R2 + W1,R2 | B)
Using the sum rule:
P(R2 | B) = P(R1,R2 | B) + P(W1,R2 | B) - P(R1, W1, R2 | B)
Since the last one involves two mutually exclusive propositions, it drops out and we apply the product rule to the first two terms:
P(R2 | B) = P(R2 | R1,B)P(R1 | B) + P(R2 | W2,B)P(W1 | B)
P(R2 | B) = (M-1)/(N-1)*M/N + M/(N-1)*(N-M)/N
P(R2 | B) = M/N
Kinda surprising, but if you do the same thing for R3, R4, etc, you’ll find that you always end up with M/N. Our probability of red at any draw, if we do not know the result of any other draw, is always M/N. This particular result is maybe part of the reason people tend to see probabilities as somehow an inherent part of the physical system since it seems invariant, but it is in fact simply a result of our state of knowledge prior to drawing any balls.
One (arguably) weird quirk that we encounter at this point is that while it might seem logically, or at least causally, necessary that our probabilities for a draw at any particular point should only be influenced by the draws that came before that point, this is not the case. To emphasize that probabilities are directly a result of our state of knowledge and not the physics or causality of the system, imagine an urn with only one red ball in it. If we know that the second draw was a red ball, our probability that the first draw is red must immediately go to zero. In fact, if you expand this out to the arbitrary case of M red balls out of N total, knowing the result of a later draw has exactly the same effect on our probability assessments as knowing the results of an earlier draw, since it has the same effect of removing one red ball out of the total.
(Jaynes goes on a long tangent about historical originations of other theories of probability and their relation to causality that I’ll pretty much omit. I do think he trips over himself a little bit here, since in order for you to have information about the second draw it has to have already happened, so it’s not really “future” information that has somehow made it to you in the “present”. He compares it to how hearing the notes of a band playing some distance away actually influences your perception of what they were playing in the past, but again, it only influences your present and future states of knowledge with information from the past - your past state of knowledge can’t be changed by new information, so this stuff isn’t as spooky and counter-intuitive as he makes it out to be.)
Anyway, we move by inches again toward turning our current state of knowledge into predictions about the future. Now that we know something about assigning probabilities to individual, mutually exclusive, and exhaustive propositions, we can begin to talk about our Expectation of the scenario. That is, by taking the probabilistically weighted sum over the outcomes, we can come up with a single value to represent the expected outcome. (Jaynes doesn’t say much about forming expectations over propositions that can’t themselves be represented by real numbers, and I’m not sure how to interpret them. I mean, what’s the “expectation” in the urn problem? If there’s three red balls out of ten, it doesn’t mean that I expect a ball to come out light pink. The numerical outcome of a six-sided die has an expected value of 3.5, but I don’t expect there to appear a previously unseen side of the die with three and a half dots on it. He does comment that expectation isn’t really a great word for it because of issues like this, but…)
Jaynes spends the next few sections showing different forms of the hypergeometric distribution and some purely mathematical results that can be proven with it, and then considers the case that the total number of balls in the urn is much larger than the number of draws. Eventually in the limit that the total number of balls approaches infinity, we have essentially the case that each draw doesn’t affect the contents of the urn. This is the case of Sampling with Replacement, and we achieve much the same result if every time you take a ball out of the urn you place it back in. Of course, when the results derived here are applied to scientific experiments, experimenters usually avoid drawing the exact same sample twice, leaning instead on the assumption that the population they’re drawing on is large enough to approximate the infinite limit of the hypergeometric distribution, which happens to be the much simpler Binomial Distribution:
b(r | N, M, n) = binom(n, r)*[(M/N)^r]*(1-M/N)^(n-r)
In most (undergraduate) courses on probability theory, the binomial distribution is taught first and the hypergeometric distribution as a special case, if at all, because of the increased mathematical complexity. I do appreciate that Jaynes’ sticks to his guns about approaching things from the logically simplest case rather than the mathematically simplest case, even if it meant I ended up glossing over most of the mathematical properties of the hypergeometric distribution. Practically speaking, the hypergeometric distribution approximates the binomial distribution pretty well in the case that there are at least ten times as many total balls as there are draws.
I glossed over a lot when I said that extending sampling without replacement to the case where sampling doesn’t affect the contents of the urn is equivalent to sampling with replacement. Jaynes has a long section about how actually sampling with replacement isn’t strictly random (if I put ball I just drew back on top of the urn am I more or less likely to draw it again?) but this section is mostly just a rant about how nothing is actually fully random and is actually governed by precise, deterministic physical processes and it’s just that we can’t be bothered to keep track of everything or solve the actually hard problem of modelling these processes so we just call their results random.
(I’m being overly dismissive - this section is probably one of the most important in the book, and the one I most vividly recall from reading it years ago. It’s Section 3.8.1 in the book and is worth excerpting and reading on its own if you’ve got the inclination)
The next section is a precise detailing of exactly how complicated it gets to account for influences of one draw on another, even if you just take into account that drawing a red or white ball on one occasion slightly nudges only the next draw to be more likely to repeat the draw. For small numbers of draws the difference with the binomial distribution is negligible, when the magnitude of the nudges multiplied by the number of draws approaches 1, the difference can become arbitrarily large. That is, if drawing a red or white ball makes it 1% more likely that the next draw will be the same, then after about 100 draws, the binomial distribution will completely fail to give decent predictions.
To get toward an analytical description of the behavior, you have to define a Markov Chain which relates the probability at one step to the probabilities at the previous step, which can then be extended out to any particular step in the process. A Markov chain can be defined by a Transition Matrix that when multiplied by the vector of probabilities for particular outcomes at the previous step, produces the probabilities for the next step. Using some matrix math we can find analytic expressions for the probabilities at any point in the sequence. (I’m glossing over the particulars because trying to express the math on this blog would probably result in some seriously ugly spaghetti, and cause linear algebra is next on my list after this, so I don’t wanna make claims of understanding this better than I do at this point.)
The takeaway is that Markov chains are great ways to model processes where the probabilities change as the process proceeds and you gain new information. The method applied in this section only applies to one-step influences, and only really helps in updating our probabilities for event after one for which we know the outcome, but Jaynes notes that techniques for multi-step influences exist as well, but is less sanguine on the prospect of backward propagation. He notes it’s possible, but also that most authors do it completely wrong. As with all truly “fun” problems, Jaynes leaves working out the answer as an exercise for the reader.
Closing out this section I think some of Jaynes’ commentary bears repeating. We derived a lot of this theory from the concept of drawing balls from an urn, and while that holds up decently well for taking samples from a population, it strains credulity a bit to apply the same analogy to things like flipping a coin, measuring wind velocity, or instantaneous stock prices. Comparing the two often leads people to think of these things as being drawn from an urn and while there’s a direct correlation between our state of knowledge and the physical presence of balls in an urn, that correspondence breaks down for the latter examples. There’s nothing at all inherently random about coin flips, stock prices aren’t actually normally or otherwise probabilistically distributed - our probability assessments are strictly, strictly statements about our current state of knowledge about the system. He terms treating our state of knowledge as an inherent property of the system the Mind Projection Fallacy, and suggests that it diverts our attention away from things that really matter but aren’t easily fit into models. I can think of half a dozen examples of this line of thinking in courses I’ve taken and books I’ve read recently. The most glaring one is Peter Thiel’s Zero-to-One which makes the argument that startup investment returns and a whole lot of other things just are power-law distributed. This is exactly the fallacy Jaynes is talking about - the distribution isn’t etched into the fabric of reality, it’s just the best prediction we can make a priori. This section, too (3.11 in the book) is worth excerpting and reading on its own.
Up next, we deal with the fact that this entire section is about trying to make predictions about data based on hypotheses, when really what we usually want to do is draw conclusions about hypotheses based on data.
IV.
Does Data Have A Soul?
Jaynes makes the point that all inference is the practice of calculating the probability that a proposition is true, conditional on all the evidence at hand. In the previous section we conducted inference by calculating the probability of obtaining a particular dataset given a hypothesis. In this section, we’ll go about calculating the probability of a hypothesis given a dataset.
The second form, though it' is in principle no different from the first has historically been misunderstood or misapplied if at all. Indeed even today most hypotheses are tested not by calculating their probability given the dataset, but by calculating the probability of drawing a dataset if the hypothesis were not true, and “failing to reject the hypothesis” if that probability is lower than some threshold (usually 5%). If this seems ass-backward, well it kinda is, but it doesn’t require anything except a hypothesis and a dataset, while direct hypothesis testing requires a bit more than that.
To start with, the requirement that we take into account all evidence at hand means you need to establish a Prior Probability that the hypothesis is true. That is, you need to determine the probability that the hypothesis is true given all the information you had before conducting an experiment or feeding it some new dataset of interest. Strictly speaking, this means you have to include every scrap of experience you’ve had since you were born. In our previous section, we conditioned all our probabilities on background information about the problem, and usually defined our problems so that anything outside the scope of the problem statement was irrelevant. Nevertheless the point stands that there’s no such thing as an absolute probability - only the probability derived from the background information relevant to its calculation, available at the time of the calculation.
That said, strict adherence to this policy is pretty much impossible. Are you really going to recalculate the probability of, say, war breaking out somewhere today on what you had for breakfast? Practically speaking, we have to make choices about what information is really relevant to the calculation at hand - we have to choose a way to construct a prior and unfortunately there no universal rule for doing so. Jaynes suggests four general principles:
Group Invariance
Maximum Entropy
Marginalization
Coding Theory
But for now doesn’t do into much detain on them.
We’ll use the notation that X stands for whatever information we use to calculate our prior probabilities, D stands for new data under consideration, and H stands for a hypothesis under consideration. In this sense, P(D | HX) is the kind of sampling probability distribution we worked on calculating in the last section. If instead we want to calculate probabilities for hypotheses given data, we can use the product rule to write:
P(DH | X) = P(D | X) * P(H | DX) = P(H | X) * P(D | HX)
And rearrange to arrive at:
P(H | DX) = P(H | X) *[P(D | HX)/P(D | X)]
In other words, the probability of a hypothesis given new data is the probability of that hypothesis based on background information, multiplied by the the ratio of much much more likely the data is if the hypothesis is true than it is based on only the background information. The probability of the hypothesis thus calculated is called its Posterior Probability in the sense that it follows after incorporation of the new data. The ratio of how much more likely the data is given the hypothesis is called the Likelihood Ratio. Thus, simply stated, the posterior probability of a hypothesis is the prior probability of that hypothesis multiplied by the likelihood ratio of the data given that hypothesis.
A lot of important problems can be formulated to hinge around the question of whether a given hypothesis is simply true or not true, so lets work out some math on how to test binary hypotheses. If we want to calculate the Odds that a hypothesis is true, we want to calculate the ratio of the probability that it’s true against the probability that is isn’t:
O(H | DX) = P(H | DX)/P(H’ | DX)
Writing out the product rule for both of the factors on the right hand side, we find that P(D | X), the prior probability of the data is irrelevant, and we need only calculate its probability based on our hypothesis and its denial:
O(H | DX) = O(H | X) * [P(D | HX)/P(D | H’X)]
Often for simplicity and computational reasons we take the logarithm of the odds so that we can simply add prior log-odds and log-likelihoods rather than multiply them. (Computers really don’t like multiplying small numbers together - they often just throw it into the garbage and call it zero, which can get problematic when your results depend on your probabilities being exhaustive). Jaynes notes that the calculations are neater if we use the natural logarithm, but more intuitive if done in base-10, so we proceed by defining Evidence for a hypothesis as 10 times the base-10 logarithm of its odds. This means we measure evidence in decibels (db), and update evidence as follows:
e(H | DX) = e(H | X) + 10*log10[P(D | HX)/P(D | H’X)]
Even if D is composed of many different datapoints, (D1, D2, …) we can just add up the evidence that each one brings to the table to update our estimates. (Point of order - you have to then calculate the probability of successive datapoints conditional on all the prior datapoints in addition to the hypothesis and the prior. It’s helpful if your datapoints are Independent so that prior datapoints have no effect on the probability of later ones.)
The reason we go to the trouble of doing all this in base-10 logarithms is to obtain the result that adding 10 db of evidence is equivalent to raising the odds by a factor of 10. Near the extremum, the difference between a 99.9% chance of something being true and a 99.99% chance of something being true is hard to judge. It’s equivalent to the odds going from 1,000:1 to 10,000:1, but even that can kinda feel like a stretch. The difference between 30 and 40 db of evidence, though has a pretty clear meaning. Toward the middle, saying something has a 50% chance of happening or has 1:1 odds doesn’t quite have the same feeling of conveying total ignorance that saying you have 0 evidence for it. I’ve often heard people proclaim 2:1 odds as if it were near certainty - it sounds a lot more like the mildly informative state it is when you call it 3 db of evidence.
I’m gonna modify an example from Jaynes here. Let’s say you’re ordering pizza from a local shop that has 11 pizza makers. 10 of these guys are pretty decent and screw up about 1/6 of the pizzas they make. One of them is much worse and screws up about 1/3 of the pizzas he makes. You’re ordering pizzas for a party and you call ahead and tell them you don’t want the guy who screws up 1/3 of the time making your pizzas. The owner promises he’ll get one of only the mildly screwy guys to make your order, and promises your money back if you think otherwise. Will you call his bluff if you open up the first pizza in the batch and it’s bad?
Well - before opening the pizza, if you don’t want to incorporate the reputation of the owner into your calculation, there’s a 1/11 chance you got the screwup making your batch and a 10/11 chance that he didn’t. So:
e(Screwup | X) = 10*log10[(1/11)/(10/11)] = -10 db
The prior improbability of it being the screwup (1-in-11 or 1:10 odds) means that we start at -10 db of evidence that the batch was made by the screwup.
10log10[P(Bad Pizza | Screwup, X)/P(Bad Pizza | Not Screwup, X)] = 10log10[(1/3)/(1/6)] = 3 db
So, we’re now at -7 db that the screwup made the batch. Each successive bad pizza adds 3 db of evidence that this batch was made by the screwup. (Assuming the batch is large enough to treat this as sampling with replacement. If it isn’t then you have to adjust the probabilities of drawing another bad pizza conditional on having removed one from the batch).
What about opening a good pizza?
10log10[P(Good Pizza | Screwup, X)/P(Good Pizza | Not Screwup, X)] = 10log10[(2/3)/(5/6)] ~= -1 db
(It’s actually -0.97 db). Each successive good pizza drops us down 1 db in evidence of believing the screwup made the batch. If the first one we open is good, we’re at -11 db. If we open at bad one, and then a good one, we’re at -8 db. If we open a bad one and then three good ones, we’re at -10 db again, back where we started. (Actually slightly above).
In order to accumulate evidence that the screwup made the batch, you need to get more than one bad pizza for every three good ones you open. There’s nothing in probability theory to tell you at what point you should or should not accept a hypothesis - it only provides the mechanism by which you can accumulate evidence for and against, and provides the precise machinery for accumulating that evidence. Add some simple rules about evidence thresholds for either accepting or rejecting a hypothesis and any other conditions under which you should or should not continue testing, and you’re well on your way. Since there’s money on the line we might decide to accuse as soon as we’re above 0 db of evidence and only accept the batch if we get below -13 db of evidence (95% probability it was not the screwup) - setting these bounds is the domain of decision theory, which Jaynes devotes some time to later on in the book.
This process of simply adding up bite sized pieces of evidence as they come in is elegant, beautiful, easy, and … does not generalize beyond the binary case. You can do something like it with multiple hypotheses by choosing one of them to be the null hypothesis and then conducting binary testing against each of your other hypothesis, but if you try and do this additive evidence procedure directly over multiple hypotheses, then at most one of the data points in your dataset can inform any of the hypotheses relative to the others.
It’s possible to rank and compare all your hypotheses by individually comparing them all to the null hypothesis and comparing their evidence relative to the null hypothesis with each other, but that’s a big, unwieldy calculation - at least as big as the method Jaynes introduces to compare multiple hypotheses against each other directly.
To start on that method - imagine that we open 50 pizzas and every one of them turned out to be bad. At 3 db each from a starting point of -10 db, we’re now at +140 db of evidence that the screwup made the pizzas. By direct comparison to sound, we passed air raid siren somewhere around 115 db, and our alarm bells are now at jet-engine during take-off volume. In fact, this is such a bad run it’s tough to believe that these were made by a guy that only screws up one in three times. If this actually happened to you your thought probably wouldn’t be “I’m really sure these were made by the screwup.” It’d probably be “Something has gone terribly wrong here - I’m really sure these were not made just by the regular process of a guy that screws up 1/3 of the time.”
Let’s say that instead of the assumption that the pizzas were made either by a guy that screws up 1/6 of the time or one that screws up 1/3 of the time, there’s a chance that something has gone wrong with the oven, in which case 99% of the pizzas came out bad. Lets say the oven itself is pretty reliable and has a 1-in-a-million chance of going wrong (P(Bad Oven | X) = 10 ^-6, e(Bad Oven | X) = -60 db). Strictly speaking we now need to multiply our other prior probabilities by (1 - 10^-6) to account for this, but this is pretty negligible. We’ll just say that we start with -10 db that the batch was made by the screwup, +10 db it was made by one of the regular guys, and -60 db the oven went wrong.
Now, to evaluate the evidence that the oven has gone wrong:
e(Bad Oven | DX) = e(Bad Oven | X) + 10log10[P(D | Bad Oven, X)/P(D | Good Oven, X)]
For our purposes, we’ll say our data, D, represents getting a string of m bad pizzas. Thus, P(D | Bad Oven, X) = (99/100)^m. To figure out the probability of getting a string of bad pizzas even if the oven hasn’t gone wrong, we apply the product rule:
P(D | Good Oven, X) = P(D | X) * [P(Good Oven | D, X)/P(Good Oven | X)]
The probability of the oven being good given a string of bad pizzas is the probability that either the screwup or the regular guys delivered that string of bad pizzas:
P(Good Oven | D, X) = P(Screwup + Not Screwup | D, X) = P(Screwup | D, X) + P(Not Screwup | D, X)
Substituting, applying the product rule again and rearranging:
P(D | Good Oven, X) = [P(D | Screwup, X)*P(Screwup, X) + P(D | Not Screwup, X)*P(Not Screwup, X)]/P(Good Oven, X)
Plugging in numbers from the problem statement (ignoring the divisions and multiplications by (1-10^-6):
P(D | Good Oven, X) = (1/11)*(1/3)^m + (10/11)*(1/6)^m
Whew. We’ve now got everything we need to calculate our evidence for a Bad Oven:
e(Bad Oven | D, X) = -60 + 10log10[(99/100)^m/((1/11)*(1/3)^m + (10/11)*(1/6)^m)]
Approximately speaking, for the first two bad pizzas, we gain 7.73 db of evidence for each bad pizza. Once we get past 5, a better approximation is -49.6 + 4.73 db. We kind of re-baseline and gain only 4.73 db with each additional bad pizza. (Just in the approximation, though.) After a string of 10 bad pizzas, we’re now at near 50-50 odds, despite having started at 1-in-a-million.
By comparison, our belief that the screwup made the pizzas increases as before by about 3 db with every bad pizza at first, but above 7 bad pizzas (when Bad Oven becomes more likely than Not Screwup) actually becomes less and less likely. Not Screwup loses confidence by 3 db per bad pizza until 10 bad pizzas (when Bad Oven becomes more likely than Screwup) at which point it loses confidence even faster.
This happens because in this procedure, at first Bad Oven is so unlikely that Screwup and Not Screwup are mostly trading evidence for and against each other as new datapoints come in. Its only once Bad Oven becomes as likely as Not Screwup that Screwup is being tested against Bad Oven. Once Bad Oven is the most likely scenario it’s essentially just Bad Oven being tested against Good Oven, and the other two fall away accordingly. This trade-off is what gets lost if you try to do linear adding of evidence as we did in the binary case - new datapoints mostly just allow you to test your leading hypothesis against its negation if its more than about 10 db above any of its competitors.
(Jaynes draws plots of how these three trade off, with linear approximations and breakpoints in a way that very much reminds me of Bode plots from controls engineering, and I expect that’s not a coincidence, especially since he also draws block diagrams of plausibility flowing between hypotheses. Unfortunately, I don’t remember the mathematical underpinnings of a Bode plot well enough to draw the direct comparison).
All in all, this works like we want it to - the outside unlikely scenario doesn’t weigh in and begin seriously affecting our plausibility until and unless we get data that’s in far better accordance with it than with the other possibilities. Caution is warranted, though - if we had set our threshold for accusing the pizza show owner of lying to us and saddling us with the Screwup at 0 db or as high as 6 db then we would not have continued opening pizzas past that point, and never seen our confidence begin to downturn. The safety rails do work - they stop us from ever developing high confidence that we got the Screwup, just not from getting to a point of low-confidence in any of our hypotheses. If we set our standard for accusation to +20 db, it’s exceedingly unlikely that we would reach that point in the event of a bad oven. A perhaps more worrying situation is if we set our threshold relatively instead of absolutely. If we only wait until our confidence that we got the Screwup is 20 db higher than any alternative, that it’s 100 times more likely than any other hypothesis, we would indeed hit that point, even if our absolute confidence that it was the Screwup isn’t very high.
I’m gonna take a break here just to let the basic ideas and methods of hypothesis testing sink in, though this is in the middle of Jaynes’ chapter on the subject.
V.
When You’re Ready, Neo, You Won’t Have To
What if we could test “every” possibility all at once? In our example, we were trying to judge whether our pizzas were made by a process that gets its wrong 1/3 of the time, 1/6 of the time, or 99% of the time. What if we wanted to figure out our confidence in every possible fraction of errors in the process? What’s our confidence that the process goes wrong 1% of the time, 2% of the time, 2.5% of the time, etc?
Up until now we’ve stuck strictly to discrete probabilities - using staircase functions to introduce the concept of probability density and cumulative probability functions. Now we’ll extend that into continuous probability distribution functions.
Let’s say we want to know the probability that our pizzas were made by a process that gets more than a fraction a wrong, but less than a fraction b wrong. That is, if f is the fraction of pizzas that whoever made our pizzas gets wrong, we want to know P(a <= f <= b | Y), where Y is whatever information we have available, be that background info, data, whatever.
If we define A as the proposition that f <= a, and B as the proposition that f <= b, then our proposition of interest, we’ll call W representing a <= f <= b. In Boolean algebra, B = A + W, thus:
P(B | Y) = P(A | Y) + P(W | Y)
If we define a function G(q) = P(f <= q | Y), and have that G is continuous and differentiable:
P(W | Y) = G(b) - G(a)
We can rewrite the left hand side as the integral from a to b of G’s differential with respect to f: g(f) = G’(f). That derivative, g(f) is the probability density function for f given Y, and G is the cumulative density function. Jaynes is insistent that these are functions for f, not of f, since f isn’t distributed, only our probabilities for its value are. He further notes that in most cases, since the phenomena we study are not themselves infinite, the theory of finite sets is actually the applicable one. It might be a huge number of permutations, but either of our pizza boys or the oven will only ever produce a finite number of pizzas in their lifetime, so the possible different ways they’ll produce those pizzas is finite, and therefore so are the possible fractions they’ll produce wrong. If they only ever produce 10 pizzas each, then 37.453412% (or 1/3 or 1/6) can’t really be an accurate description of what fraction they produced wrong. Continuous variables are therefore usually approximations, but one that we’ll not get bogged down about using.
Using logarithms gets clumsy, though, so we’ll go back to dealing with straight probabilities rather than odds and evidence. So:
P(A | DX) = P(A |X) * [P(D| AX)/P(D | X)]
Where A is the proposition that our pizzas were made by something that produces a bad pizza between f and f+df fraction of the time. Our prior is therefore:
P(A | X) = g(f | X)*df
And our posterior is:
P(A | DX) = g(f | DX) * df
And they are related as:
g(f | DX) = g(f | X) * [P(D | AX)/P(D | X)]
We can calculate all these directly, but P(D | X) is often tricky, and is just a normalizing constant, so we usually just drop it, and calculate g(f | DX) = g(f | X) * P(D | AX), and then divide so that the integral of g(f | DX) over all possible f (0 to 1 in this case) equals 1.
Point of order - this procedure is defined for the proposition that our fraction lies between f and f + df. We could maybe take the limit as df approaches 0, and calculate P(D | AX) that way, but what if there are multiple parameters involved? If f is a function of another parameter, does f trend to the same limit at the same rate for all values of the parameter? If you don’t deal with that, you can end up seeming to get different correct answers to the same problem, but for simple problems like this we won’t have to worry about that.
If you do need to get strict about it, you do so by dogmatic application of the product and sum rules - treat your continuous space as the sum of mutually exclusive hypothesis discretized by df or whatever other parameters you’re dealing with, and then take the limits of the forms that arise from the product and sum rules, rather than trying to take the limits directly from the first step.
For our present case, dealing with the pizzas, for any given f the probability of getting a bad pizza is f and the probability of getting a good pizza is (1 - f). If all the pizzas are independent in their quality, then the probability of getting n bad pizzas in a batch of N is given by the binomial distribution, and our posterior probability density function is therefore the binomial distribution multiplied by the prior distribution, divided by the integral over f of that quantity.
(Really need to get MathJax working to make these clear, but the takeaway is there, I hope.)
Jaynes dips into the theory of setting priors at this point and says that fully justifying any particular prior takes more theory than we’ll go over at this point, but if you’ve got no knowledge going into a problem, and have therefore no reason to assign any particular value of a parameter a higher probability than any other, the only “honest” way to assign your prior is uniformity - letting each point of g(f | X) have the same, constant value. If the problem is the kind we’re dealing with here, where your parameter is between 0 and 1, then the normalization constant for the posterior is an Eulerian integral of the first kind, a.k.a. a compete beta function, and our posterior distribution takes on the nice form:
g(f | DX) = (N +1)!/n!(N-n)! * f^n(1-f)^(N-n)
Curiously, it’s this result that was first discovered by Thomas Bayes, and is the reason that we call these sorts of calculations “Bayesian”. Though we also call the form the product rule takes when we write it for hypotheses testing “Bayes’ Rule” or “Bayes’ Theorem”, he never actually wrote it and it was Laplace who first used it for problems of inference.
Jaynes then goes about showing that this distribution has a peak at n/N, and if you take power series expansion of its logarithm, it’s a normal distribution to a second order approximation. with variance of (n/N)*(1-n/N)/N. As we test more pizzas, the variance shrinks proportional to 1/sqrt(N). Jaynes again takes the time to point out that all these facts are properties of our distribution, of our state of knowledge, not of f. Particularly since these results are only valid if f doesn’t vary with time, f doesn’t have a variance - only our probability that f is within a certain range.
Jaynes closes out the chapter by noting that if we’re dealing with cases where we only care about whether our parameters are within a given range or not, it’s actually wasteful to calculate the full probability distribution, and we can just integrate over such “nuisance parameters” in order to eliminate them.
In the comments for the chapter, it notes that the use of evidence and decibels to describe logarithms of odds in probability theory hasn’t actually caught on much and that it’s more common to refer to them as “logits” since the form of the function is logistic (U = log[x/(a -x)]). Jaynes recommends sticking with evidence as the name of the quantity itself, and then decibels, logits, or whatever else you please for the units as appropriate.
VI.
Here Endeth The Lesson
I’m gonna close out this post here cause it’s approaching 50 pages in length and from here the book deals a lot more with special applications and esoterica. The big takeaways from the early part of the book are:
Probabilities are always, always, always a statement of your current state of knowledge, not a statement of properties about the system they describe
The product and sum rules are the basis for all probability theory - if you’re getting screwy results or something doesn’t make sense, take it back to those rules and go step by step
Inference is the same whether you’re using a hypothesis to quantify your state of info about data or using data to quantify your state of knowledge about a hypothesis
I’ll write up a secondary post as I go through the rest of the book that will mostly be even more commentary-ish than this one was, but next up in the “main” series here will be “Linear Algebra Done Right” by Sheldon Axler because of reasoning having to do with why I skimmed over the section on Markov Chains earlier in this post.