The Structure of Inference

I previously described the ecological niche of statistical inference. Now let’s study the beast itself.

Kiefer in Introduction to Statistical Inference dissects a statistical inference as clearly as I have seen. Start with a set S, which is the range of a random variable representing our measurements. We augment this with a class of possible distributions \Omega for the random variable, one of which (we believe) is the true distribution which produced our measured values.

Add a set D of decisions. In the end, our inference will single out some element of D. To measure the damage if we choose wrong, we introduce a loss function W : \Omega \times D \rightarrow \mathbb{R}. W is dictated by our circumstances. In an economic problem, it may be obvious: the amount of money lost if we make a mistake. Generally it is not so clear cut. The best approach is to construct inferences which depend only crudely on the form of the loss function. If we are estimating the mean of a distribution, then our answer shouldn’t depend on whether we took the absolute value between our guess and the true value as the loss, or its square.

Our actual inference is carried out by a function t : S \rightarrow D, called a statistical procedure. To make an inference, we plug our measured values into t and take the decision which comes out.

The problem is to construct a logically defensible t. We can say we want the t which minimizes the loss function as much as possible. If the loss from t_1 is always less than the loss from t_2, no matter what the true value we are trying to measure actually is, we would not dream of using t_2, but what of procedures which minimize loss over different parts of \Omega? How are we to choose among them? All the issues of Bayesian analysis, maximum-likelihood methods, minimax techniques, and all the rest are attempts to choose a t which is optimal in some sense.

To make the idea of loss incurred by some t, we define the risk function r : (S \rightarrow D) \times \Omega \rightarrow \mathbb{R} as

r(t,\omega) = E[W(\omega,t(X)) | \omega, t ]

where X is the random variable representing our measurement. This tells us on average how well a given t will do when faced with an underlying distribution \omega \in \Omega.

Consider an example. We count the number of cosmic rays above some energy passing through a detector in a fixed time. Our random variable has range \mathbb{N} (we can only count nonnegative integer numbers of cosmic rays), and we take the underlying distribution to be Poisson. Our class of distributions \Omega is the set of all Poisson distributions. Since the Poisson distribution is defined by one parameter \lambda \in \mathbb{R}^+, \Omega is isomorphic to \mathbb{R}^+.

Our decision space D will be various guesses for \lambda, so it is also \mathbb{R}^+. We could use very different D’s. For instance, we might try to decide whether there are cosmic rays of energy greater than our detector’s threshold or not, in which case D is just {yes,no}. We’ll take the loss function W to be W(\lambda_\Omega, \lambda_D) = |1 - \frac{\lambda_D}{\lambda_\Omega}| where \lambda_\Omega is the parameter of the underlying distribution and \lambda_D is our guess. Then we have to construct a t such that the risk r(t, \lambda_\Omega) is minimal in some sense for our problem at hand.

Leave a comment