Statistical & Financial Consulting by Stanford PhD
Home Page

Bayesian Statistics solves the problem of parameter estimation by assuming that the parameters are random and their joint distribution with the data is known. This allows the researcher to substitute a somewhat open-ended task of approximating unknown fixed numbers with a relatively straightforward calculation of conditional expectation. The catch is that the researcher needs to make assumptions regarding the unconditional distribution of the parameters, referred to as the prior distribution. Different choices of prior distribution lead to different answers. This fact makes Bayesian statistics distinct from the traditional frequentist approach, where the inference about parameters is based on the data only.

Bayesian statistics has somewhat abstract philosophy behind it, à la Isaac Asimov. Let θ = (θ1, ..., θm) be the parameters of interest. In each universe, including ours, people live with only one realization of θ. However, over those universes θ is distributed according to a known prior distribution. Also, in each universe the link between θ and observable data X is the same: the conditional distribution of X given θ is the same and known. So after collecting all the information on X in our universe, we can calculate the conditional expectation of θ in our universe. We do that using Bayes rule (which explains the name "Bayesian statistics").

An example. Suppose we are interested in the percentage of the residents of a particular state supporting the Republicans. On the one hand, we may want to pay attention to the percentage of Republicans in the collected sample. On the other hand, from previous sociological studies we may know that the true percentage cannot get too extreme - it cannot get close to 0% or 100%. We solve the problem by modeling the true percentage P as a random variable with Beta(d,d) distribution, where d > 1. This distribution is hump-shaped, having more mass in the middle of the interval [0%, 100%] than at its ends. Conditional on the true percentage P, the percentage X of Republicans in the sample is distributed according to the binomial law with parameters P and N, where N is the size of the sample. We can write down the joint density function of P and X as

Prob(P = p, X = x) = pd-1 (1-p)d-1 / Beta(d,d) CNxN pxN (1-p)N-xN

This allows us to calculate the conditional expected value of P given X as

(d + xN) / (2d + N)

This is our estimate of the true percentage P. Notice that it depends on the macro-assumption d. We better have a good idea of what a suitable prior distribution can be, as different subjective views lead to quite different estimates of P. However, one thing has been achieved: an extreme estimate of P is less likely than an estimate closer to the middle of the [0%, 100%] interval. In fact, the estimate of P will always be in the interval [d / (2d + N), (d + N) / (2d + N)]. It will never equal 0% or 100%.

The example above is simplistic. It only illustrates Bayesian ideas, which are typically put to use in more complex contexts and formats. Essentially, there are two usages of Bayesian statistics:

1] We have views about the parameters, which may be subjective or based on previous research. We want to incorporate these views into estimation.

2] We do not know much about the parameters but a Bayesian calculation seems easier than the frequentist alternative. So we choose a high-variance, spread-out prior distribution, which puts almost equal weight on a big range of possible values of parameters. Assuming the parameters have some distribution allows us to write relatively simple formulas for their expectations conditional on all or some parts of the data. These formulas are evaluated later on using numerical methods, like EM algorithm or Markov Chain Monte Carlo.

In every Bayesian calculation the prior distribution serves as an input. The researcher is free to supply whatever he/she deems appropriate. This gives rise to empirical Bayes, where the prior distribution is inspired by the data, or hierarchical Bayes, where the prior distribution is an output of another Bayesian calculation.


Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation (2nd ed). Springer.

Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B. (2003). Bayesian Data Analysis (2nd ed). Chapman and Hall / CRC.

McElreath, R. (2015). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman & Hall / CRC.

Efron, B., & Hastie, T. (2017). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.

Bishop, C.M. (2007). Pattern Recognition and Machine Learning (corr. 2nd printing ed). Springer.

Greene, W. H. (2011). Econometric Analysis (7th ed). Prentice Hall.

Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern Classification (2nd ed). New York: Wiley-Interscience.

Bayes T. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans., 53, pp. 370–418.

Ghosal, S., and Van Der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.