The disagreement between Frequentism and Bayesianism concerns the definition of probability.
Frequentism interprets probability as the limiting case of repeated measurements.
probabilites = fundamentally related frequency of any given value
Bayesianism interprets probability as degrees of certainty about statements.
probability = fundamentally related to our knowledge about an event
probability = statement of my knowledge of what the measurement result will be
Example: Given a measurements of a photon flux \(F\) from a given star estimate the true flux of the star.
Different views | ||||
---|---|---|---|---|
Frequentist view: | Since by definition the probability of the true flux of a star is a fixed value then it is meaningless for the frequentist to talk about the probability. Talking about frequency distribution for a fixed value is nonsense. | |||
Bayesian view: | Claims to measure the flux \(F\) of a star with some probability \(P(F)\). Although the probability can be estimated from frequencies of repeated experiments it is not fundamental. | |||
Can meaningfully talk about the probability that the true flux of a star lies in a given range | ||||
Probability codifies our knowledge of the value based on prior information and/or available data | ||||
Philosophy difference leads to | different approaches to statistical analysis of data |
An example illustrating the different approaches:
Notation:
\(N\) = measurements
\(i^{th}\) = observation
\(F_{i}\) = photon flux
\(e_{i}\) = error related to \(F_{i}\)
Assumptions regarding \(e_{i}\): \(e_{i}\; \sim\; \cal{N}(\mu, \sigma^{2})\)
Frequentist:
\(e_{i} = \frac{\sigma}{std}\) of the results of a single measurement event w.r.t. the limit of the repetitions of that event.
Bayesian:
\(e_{i} = \frac{\sigma}{std}\) of (Gaussian) probability distribution describing our knowledge of that particular measurement given its observed value.
Given \(D = \{F_{i}, e_{i}\}\) find the best estimate of the true Ftrue?
Since \(N \in \mathbb{N}\) thus a Poisson distribution is a good approximation to the measurement process.
\(F \;\sim\; \text{Pois}(\lambda_{i}), i = 1,\dots,n\) \(F \;\sim\; \text{Pois}(\lambda)\)
\(\{N_{i}, e_{i}\}\) is estimated from Poisson statistics using the standard square-root rule.
The true \(F_{true}\) flux is already known: \(F_{true} = 1000\), numbers of photons measured in 1 sec.
Frequentist approach:
Start classic maximum likelihood! Given \(D_{i} = \{F_{i}, e_{i}\}\), compute the probability distribution of the measurement given the true \(F_{true}\) flux, given also the assumption of Gaussian errors.
\[P(D_{i} | F_{true}) = \frac{1}{\sqrt{2\pi e^{2}_{i}}}\text{exp}[-\frac{(F_{i} - F_{true})^{2}}{2e^{2}_{i}}]\]
Construct likelihood function = product of probabilities for each data point.
\[\cal{L}(D | F_{true}) = \prod_{i=1}^{N} P(D | F_{true})\]
Since \(\cal{L}\) becomes very small to avoid underflow/overflow errors we compute the log-likelihood.
\[\text{log}(\cal{L}) = -\frac{1}{2}\sum^{N}_{i=1}[\text{log}(2\pi e^{2}_{i}) + \frac{(F_{i} - F_{true})^2}{e^{2}_{i}}]\]
Determine \(F_{true}\) susch that the likelihood is maximized.
Computed analytically by setting \([\frac{d\text{log}\cal{L}}{dF_{true}} = 0]\)
\(F_{est.} = \frac{\sum w_{i}F_{i}}{\sum w_{i}};\;\; w_{i} = \frac{1}{e^{2}_{i}}\;\; \text{if}\;\forall e_{i}\; \text{being equal}\)
this reduces to \(F_{est.} = \frac{1}{N}\sum_{i=1}^{N} F_{i} \rightarrow\; \text{simply the mean of the observed data when errors are equal}\)
What is the error of \(F_{est.}\)? Identifying the error in the two different approaches.
Accomplished by fitting a Gaussian approximation to the likelihood curve at the maximum.
For the simple case we can solve it analytically.
\(\sigma_{est.} = (\sum_{i=1}^{N} w_{i})^{-\frac{1}{2}}\rightarrow\; \text{std. of Gaussian approximation}\)
2. Bayesian approach:
Begins and ends with probabilities. We want to compute our knowledge of the parameters in question, i.e. \(P(F_{true}|D)\rightarrow\;\text{Bayesian} \neq P(D | F_{true})\rightarrow\;\text{Frequentist}\)
Formulation of the problem is fundamentally contrary to the frequentist philosophy.
It says that the probabilities have no meaning for model parameters like \(F_{true}\).
To compute $P(Ftrue | D) they apply Bayes Rule.$
\[P(F_{true} | D) = \frac{P(D | F_{true}) P(F_{true})}{P(D)}\]
What is controversial is not the Bayes law but instead the Bayesian interpretation of probability by the term \(P(F_{true} | D)\).
- \(P(F_{true} | D):\) posterior/probability of the model parameters given data. Result we want to compute.
- \(P(D | F_{true}):\) likelihood, proportional to \(\cal{L}(D | F_{true})\) in the frequentist approach above.
- \(P(F_{true}):\) model prior, encodes what we knew about the mdoel prior to the application of the data \(D\).
- \(P(D):\) data probability, in practice amounts to a normalization term
Setting \(P(F_{true})\;\propto\) 1 (a flat prior), we find \(P(F_{true} | D)\propto\;\cal{L}(D | F_{true})\). For a flat prior the Bayesian and the Frequentist become similar to each other.
The prior \(P(F_{true})\) allows inclusion of other information into the computation, useful in cases of combining multiple measurement strategies.
One of the most controversial pieces of Bayesian analysis is to specify the necessity of a prior.
Frequentist will point out that the prior is problematic when no true prior information is available. In many situations a trully non-informative prior doesn't exist.
Frequentists say the choice of prior necessarily biases your results and therefore has no place in statistical data analysis.
Bayesian would say that frequentism can be viewed as a simple case of the Bayesian approach for some (implicit) choice of the prior:
Bayesians argue that it would be better to make this implicit choice explicit, even if the choice might include some subjectivity.
How Bayesian results are computed in practice?
For a one parameter problem compute the posterior probability \(P(F_{true} | D)\) as a function of \(F_{true}\).
In other words compute the distribution reflecting our knowledge of the parameter \(F_{true}\).
The direct approach becomes increasingly intractable as the dimension of the model grows.
Bayesian calculations often depend on sampling methods such as Markov Chain Monte Carlo (MCMC).
The Goal is to generate a set of points draw from the posterior probability distribution and use them to determine the answer we seek.
In pure Bayesianism the answer to a question is not a single number with error bars; the answer is the posterior distribution over the model parameters.
Exploring a more sophisticated model: Adding a Dimension
Assume our observing object has some stochastic variation, i.e. it varies with time.
Propose a simple 2-parameter \(\cal{N}(\mu, \sigma), \theta = [\mu, \sigma]\) of the variability intrinsic to the object.
Model: \(F_{true}\;\sim\;\frac{1}{\sqrt{2\pi \sigma^{2}}}\text{exp}[\frac{(F - \mu)^{2}}{2\sigma^{2}}]\)
Frequentist approach:
\[\cal{L}(D | \theta) = \prod_{i=1}^{N}\frac{1}{\sqrt{2\pi(\sigma^{2} + e^{2}_{i})}}\text{exp}[-\frac{-(F_{i} - \mu)^{2}}{2(\sigma^{2} + e^{2}_{i})}]\]
Likelihood is the convolution of the intrinsic distribution with the error distribution.
Analytically maximize the above likelihood to find the best estimate for \(\mu\):
\(\mu_{est.} = \frac{\sum w_{i}F_{i}}{\sum w_{i}}\;; w_{i} = \frac{1}{\sigma^{2} + e^{2}_{i}}\)
Here we have a problem: The optimal value of \(\mu\) depends on the optimal value of \(\sigma\).
Results are correlated \(\rightarrow\) no longer possible to use analytic methods to arrive at the Frequentist result.
But we can use numerical optimization techniques to determine the maximum likelihood value.
Maximum likelihood gives best estimate of the parameters \(\mu\) and \(\sigma\) governing our model. This is only half the answer.
We need to compute Error Bars on \(\mu\) and \(\sigma\).
Several approaches to determine errors in a frequentist approach/paradigm.
- Fit a normal approximation to the maximum likelihood and report the covariance matrix (do thsi numerically rather than analytically)
- Alternatively, compute statistics \(\chi^{2}\) and \(\chi^{2}_{\text{dof}}\) to and standard tests to determine confidence limits, which also depends on strong assumptions about the Gaussianity of the likelihood.
- Alternatively, use randomized sampling aproaches such as Jackknife or Bootstrap, which maximize the likelihood randomized samples of the input data in order to explore the degree of certainty in the result.
In bootstrapping or sampling techniques there is a potential for errors to be correlated or even non-Gaussian, neither of which is reflected by simply finding the mean and std. of each model parameter.
Bayesian approach:
Almost exactly the same as it was in the previous problem.
The vast majority of commonly applied frequentist techniques make the explicit or implicit assumption of Gaussianity of the distribution.
Bayesian approaches generally don't require such assumptions
There are good arguements that a prior on \(\sigma\) subtley or subjectively biases the calculation in this case.
Conclusion:
Bayesianism and Frequentism are fundamentally different approaches to simple problems which can yield similar or even identical results.
Differences:
Frequentism considers probabilities related to frequencies of real or hypothetical events.
Bayesianism considers probabilities as measurements of degrees of knowledge
Frequentist analyses generally proceeds through the use of point estimates and maximum likelihood
Baysian analyses generally compute the posterior either directly or through some version of MCMC sampling
In simple problems, the two approaches yield similar results. As data and models grow in complexity the two approaches can diverge greatly. This is clea in 2 situations.
- The handling of nuisance parameters
- The subtle (often overlooked) difference between frequentist confidence intervals and Bayesian credible regions
Credits to Jake VanderPlas