Information about Prior Probability

A prior probability is a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence. The posterior probability is then the conditional probability of the variable taking the evidence into account. The posterior probability is computed from the prior and the likelihood function via Bayes' theorem.

As prior and posterior are not terms used in frequentist analyses, this article uses the vocabulary of Bayesian probability and Bayesian inference.

Throughout this article, for the sake of brevity the term variable encompasses observable variables, latent (unobserved) variables, parameters, and hypotheses.

Prior probability distribution

In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p (for example, suppose p is the proportion of voters who will vote for the politician named Smith in a future election) is the probability distribution that would express one's uncertainty about p before the "data" (for example, an opinion poll) are taken into account. It is meant to attribute uncertainty rather than randomness to the uncertain quantity.

One applies Bayes' theorem, multiplying the prior by the likelihood function and then normalizing, to get the posterior probability distribution, which is the conditional distribution of the uncertain quantity given the data.

A prior is often the purely subjective assessment of an experienced expert. Some will choose a conjugate prior when they can, to make calculation of the posterior distribution easier.

Informative priors

An informative prior expresses specific, definite information about a variable. An example is a prior distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior a normal distribution with expected value equal to today's noontime temperature, with variance equal to the day-to-day variance of atmospheric temperature.

This example has a property in common with many priors, namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and as more evidence accumulates the prior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting. The terms "prior" and "posterior" are generally relative to a specific datum or observation.

Uninformative priors

An uninformative prior expresses vague or general information about a variable. The term "uninformative prior" is a misnomer; such a prior might be called a not very informative prior. Uninformative priors can express information such as "the variable is positive" or "the variable is less than some limit". Some authorities prefer the term objective prior.

In parameter estimation problems, the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis, as the likelihood function often yields more information than the uninformative prior.

Some attempts have been made at finding probability distributions in some sense logically required by the nature of one's state of uncertainty; these are a subject of philosophical controversy. For example, Edwin T. Jaynes has published an argument (Jaynes 1968) based on Lie groups that suggests that the prior for the proportion p of voters voting for a candidate, given no other information, should be the Haldane prior p−1(1 − p)−1. If one is so uncertain about the value of the aforementioned proportion p that one knows only that at least one voter will vote for Smith and at least one will not, then the conditional probability distribution of p given this information alone is the uniform distribution on the interval [0, 1], which is obtained by applying Bayes' theorem to the data set consisting of one vote for Smith and one vote against, using the above prior. The Haldane prior has been criticized on the grounds that it yields an improper posterior distribution that puts 100% of the probability content at either p = 0 or at p = 1 if a finite sample of voters all favor the same candidate. The Jeffreys prior p−1/2(1 − p)−1/2 is therefore preferred (see below).

Priors can be constructed which are proportional to the Haar measure if the parameter space X carries a natural group structure. For example, in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system. This induces the group structure of the translation group on X, and the resulting prior is a constant improper prior. Similarly, some measurements are naturally invariant to the choice of an arbitrary scale (i.e., it doesn't matter if we use centimeters or inches, we should get results that are physically the same). In such a case, the scale group is the natural group structure, and the corresponding prior on X is proportional to 1/x. It sometimes matters whether we use the left-invariant or right-invariant Haar measure. For example, the left and right invariant Haar measures on the affine group are not equal. Berger (1985, p. 413) argues that the right-invariant Haar measure is the correct choice.

Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum entropy. The motivation is that the Shannon entropy of a probability distribution measures the amount of information contained the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on X, one finds that distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and variance unity is the standard normal distribution.

A related idea, reference priors, was introduced by Jose M. Bernardo. Here, the idea is to maximize the expected Kullback-Leibler divergence of the posterior distribution relative to the prior. This maximizes the expected posterior information about X when the prior density is p(x). The reference prior is defined in the asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points goes to infinity. Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., Jeffreys' rule) may result in priors with problematic behavior.

Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred. Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, which gives a prior proportional to the logarithm. The Jeffreys prior attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion p is p−1/2(1 − p)−1/2, which differs from Jaynes' recommendation.

Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used routinely, i.e., with many different data sets, it should have good frequentist properties. Normally a Bayesian would not be concerned with such issues, but it can be important in this situation. For example, one would want any decision rule based on the posterior distribution to be admissible under the adopted loss function. Unfortunately, admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with hierarchical Bayes models; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy.

Improper priors

If Bayes' theorem is written as
then it is clear that it would remain true if all the prior probabilities P(Ai) and P(Aj) were multiplied by a given constant; the same would be true for a continuous random variable. The posterior probabilities will still sum (or integrate) to 1 even if the prior values do not, and so the priors only need be specified in the correct proportion.

Taking this idea further, in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. When this is the case, the prior is called an improper prior. Some statisticians use improper priors as uninformative priors. For example, if they need a prior distribution for the mean and variance of a random variable, they may assume p(mv) ~ 1/v (for v > 0) which would suggest that any value for the mean is equally likely and that a value for the positive variance becomes less likely in inverse proportion to its value. Since



this would be an improper prior both for the mean and for the variance.

References

  • Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis, 2nd edition. CRC Press, 2003. ISBN 1-58488-388-X
  • James O. Berger, Statistical Decision Theory and Bayesian Analysis, Second Edition. Springer-Verlag, 1985. ISBN 0-387-96098-8
  • James O. Berger and William E. Strawderman, em>Choice of hierarchical priors: admissibility in estimation of normal means, Annals of Statistics'', 24, pp. 931-95, 1996.
  • Jose M. Bernardo, em>Reference Posterior Distributions for Bayesian Inference, Journal of the Royal Statististical Society, Series B'', 41, 113-147, 1979.
  • Edwin T. Jaynes, "Prior Probabilities," IEEE Transactions on Systems Science and Cybernetics, SSC-4, 227-241, Sept. 1968. Reprinted in Roger D. Rosenkrantz, Compiler, E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics. Dordrecht, Holland: Reidel Publishing Company, pp. 116-130, 1983. ISBN 90-277-1448-7
Probability distributions    [ edit] ]
Univariate Multivariate
Discrete: Benford • BernoullibinomialBoltzmanncategoricalcompound Poisson • discrete phase-type • degenerateGauss-Kuzmingeometrichypergeometriclogarithmicnegative binomialparabolic fractalPoissonRademacherSkellamuniformYule-SimonzetaZipfZipf-MandelbrotEwensmultinomialmultivariate Polya
Continuous: BetaBeta primeCauchychi-squareDirac delta function • Coxian • Erlangexponentialexponential powerFfading • Fermi-Dirac • Fisher's zFisher-TippettGammageneralized extreme valuegeneralized hyperbolicgeneralized inverse GaussianHalf-LogisticHotelling's T-squarehyperbolic secanthyper-exponentialhypoexponentialinverse chi-square (scaled inverse chi-square) • inverse Gaussianinverse gamma (scaled inverse gamma) • KumaraswamyLandauLaplace • Lvy • Lvy skew alpha-stablelogisticlog-normal • Maxwell-Boltzmann • Maxwell speedNakagaminormal (Gaussian)normal-gammanormal inverse GaussianParetoPearson • phase-type • polarraised cosineRayleigh • relativistic Breit-Wigner • Riceshifted GompertzStudent's ttriangulartruncated normaltype-1 Gumbeltype-2 GumbeluniformVariance-GammaVoigtvon MisesWeibullWigner semicircleWilks' lambdaDirichletGeneralized Dirichlet distribution . inverse-WishartKentmatrix normalmultivariate normalmultivariate Studentvon Mises-FisherWigner quasiWishart
Miscellaneous: bimodalCantorconditional • equilibrium • exponential family • infinitely divisible • location-scale familymarginalmaximum entropyposterior • prior • quasisamplingsingular
Conditional probability is the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the probability of A, given B".
..... Click the link for more information.
In Bayesian probability theory, a marginal likelihood function is a likelihood function integrated over some variables, typically model parameters. Integrated likelihood is a synonym for marginal likelihood.
..... Click the link for more information.
The posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned when the relevant evidence is taken into account.
..... Click the link for more information.
Conditional probability is the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the probability of A, given B".
..... Click the link for more information.
Likelihood as a solitary term is a shorthand for likelihood function. In non-technical usage, "likelihood" is a synonym for "probability", but throughout this article only the technical definition is used.
..... Click the link for more information.
Bayes' theorem (also known as Bayes' rule or Bayes' law) is a result in probability theory, which relates the conditional and marginal probability distributions of random variables.
..... Click the link for more information.
Frequency probability is the interpretation of probability that defines an event's probability as the "limit" of its relative frequency in a large number of trials.
..... Click the link for more information.
Bayesian probability is an interpretation of the probability calculus which holds that the concept of probability can be defined as the degree to which a person (or community) believes that a proposition is true.
..... Click the link for more information.
Bayesian inference is statistical inference in which evidence or observations are used to update or to newly infer the probability that a hypothesis may be true. The name "Bayesian" comes from the frequent use of Bayes' theorem in the inference process.
..... Click the link for more information.
Bayesian probability is an interpretation of the probability calculus which holds that the concept of probability can be defined as the degree to which a person (or community) believes that a proposition is true.
..... Click the link for more information.
Inferential statistics or statistical induction comprises the use of statistics to make inferences concerning some unknown aspect of a population. It is distinguished from descriptive statistics.
..... Click the link for more information.
probability distribution that assigns a probability to every subset (more precisely every measurable subset) of its state space in such a way that the probability axioms are satisfied.
..... Click the link for more information.
Bayes' theorem (also known as Bayes' rule or Bayes' law) is a result in probability theory, which relates the conditional and marginal probability distributions of random variables.
..... Click the link for more information.
Likelihood as a solitary term is a shorthand for likelihood function. In non-technical usage, "likelihood" is a synonym for "probability", but throughout this article only the technical definition is used.
..... Click the link for more information.
conjugate to a class of likelihood functions p(x|θ) if the resulting posterior distributions p(θ|x) are in the same family as p(θ).
..... Click the link for more information.
normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average",
..... Click the link for more information.
expected value (or mathematical expectation, or mean) of a discrete random variable is the sum of the probability of each possible outcome of the experiment multiplied by the outcome value (or payoff).
..... Click the link for more information.
variance of a random variable (or somewhat more precisely, of a probability distribution) is one measure of statistical dispersion, averaging the squared distance of its possible values from the expected value.
..... Click the link for more information.
Edwin Thompson Jaynes (July 5, 1922 – April 30, 1998) was Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis. He wrote extensively on statistical mechanics and on foundations of probability and statistical inference, initiating in 1957 the
..... Click the link for more information.
In mathematics, a Lie group (IPA pronunciation: [liː], sounds like "Lee"), is a group which is also a differentiable manifold, with the property that the group operations are compatible with the smooth structure.
..... Click the link for more information.
Uniform distribution can refer to:
  • Uniform distribution (mathematics), probability distributions:
  • Uniform distribution (continuous)
  • Uniform distribution (discrete)

..... Click the link for more information.
Bayes' theorem (also known as Bayes' rule or Bayes' law) is a result in probability theory, which relates the conditional and marginal probability distributions of random variables.
..... Click the link for more information.
In mathematical analysis, the Haar measure is a way to assign an "invariant volume" to subsets of locally compact topological groups and subsequently define an integral for functions on those groups.
..... Click the link for more information.
A prior probability is a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence. The posterior probability is then the conditional probability of the variable taking the evidence into account.
..... Click the link for more information.
In mathematics, the affine group of any affine space over a field K is the group of all invertible affine transformations from the space into itself. It is the semidirect product of Kn and GL(n, K).
..... Click the link for more information.
The principle of maximum entropy is a method for analyzing the available information in order to determine a unique epistemic probability distribution. It states that the least biased
..... Click the link for more information.
Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable.

Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 logarithms), that must
..... Click the link for more information.
normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average",
..... Click the link for more information.
In Bayesian probability, the Jeffreys prior (called after Harold Jeffreys) is a non-informative prior distribution proportional to the square root of the Fisher information:



and is invariant under reparameterization of .
..... Click the link for more information.
In Bayesian probability, the Jeffreys prior (called after Harold Jeffreys) is a non-informative prior distribution proportional to the square root of the Fisher information:



and is invariant under reparameterization of .
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter