Fitting the Pearson type 3

Also see Pearson type 3 in Excel, Pearson type 4 in Excel

You’ve got some data and think a Pearson type III distribution might fit it nicely, but how do you go about choosing the parameters? As mentioned in the post about implementing Pearson III in Excel, the obvious way — using the mean, standard deviation and skewness of the sample — is often frowned upon. That’s said to be because it gives a biased fit, although in the real world it often performs well, as we’ll see.

Example

It helps to work from an example. Let’s invent one by taking the first 50 randomly generated Pearson type III variates from the test at the bottom of the first sheet in PearsonIII.xlsm. To make things a little more realistic, introduce a bit of “noise” by rounding them off to three significant figures, so they look like this:

1820
1740
1440
2340
2160
1930
1540
2330
2090
2550
2890
… etc

It’s useful to find a way to visualise that. A popular approach is to sort the sample and assign rough probability estimates to each point according to its rank¹. The simple Weibull formula [P=i/(n+1)] achieves that, but many prefer Cunnane [P=(i-0.4)/(n+0.2)], which can be shown to do slightly better near the tails. If we sort in descending order we get the probabilities that each sample value is exceeded:

Sorted sample Probability of exceedence
3390 1.2%
3360 3.2%
2890 5.2%
2580 7.2%
2570 9.2%
2550 11.2%
… etc

Then if we plot those probabilities for all 50 sample points we get this:

PIII_sample
The s-shape of course reflects the cumulative form of the normal distribution bell curve, which our sample is a little away from because it was generated with non-zero skew. I’ve added a line showing where a normal distribution with our sample’s mean and standard deviation should plot. To make things easier still to interpret, it’s common to transform the vertical scale to convert the normal distribution’s s-curve to a straight line. We do that by expressing the probabilities as standard deviations, the so-called “z-scores”:

PIII_z_scale
For the single-tailed cumulative normal, one standard deviation is around the familiar 84%, two 97.5% and three 99.9%; so it’s obvious that we can just replace the linear z scale with suitably distorted probability ticks. That’s the so-called “probability scale”, the scale that linearises a normal distribution:

PIII_probability_scale
Again, our sample is not a straight line there because it is skewed: positive skewness makes for downwards curvature, negative upwards. So what does a fit look like using the sample “moments” — the mean, standard deviation and skewness estimated from the sample using the likes of Excel functions AVERAGE(), STDEV(), SKEW()? We get this:

PIII_moment_fit
Hey, by eye that fit doesn’t look too bad. The sample “moments” are mean, standard deviation and skewness (μ, σ, γ₁) = 1918, 505, 0.927; not that far from those used to generate it (which we happen to know were 2100, 650, 0.9).

How can we do better? Well, we could try a least-squares fit to the Cunnane z-scores, minimising the sum of the squares of the deviations (like in ordinary linear regression). Here that fits a distribution with μ, σ, γ₁ = 1922, 514, 0.856 — slightly further away, at least in terms of skewness. The problem is the plotting probabilities were just meant as crude estimates for visualisation; they’re not supposed to be for fitting. There are other ways to apply least squares estimation, but these days the preferred fitting method is usually a thing called maximum likelihood estimation.

Maximum likelihood estimation

Maximum likelihood estimation (MLE) was apparently known to Herr Gauss (circa 1800), but was widely popularised around 1920 by one RA Fisher (who doesn’t seem to have much liked his countryman and contemporary, Karl Pearson). The argument’s pretty simple (as is the method) and goes like this. Assume that a sample does fit some known distribution with known parameters. If so, the unit likelihood² of each sample point will be given by the ordinate of the distribution’s probability density function at that value. And provided the sample points are independent, the likelihood of any two points together must be just the product of each’s likelihood (by the product rule). It follows that the “likelihood of the sample” taken as a whole must be the product of all the likelihoods of the individual sample points. Now intuitively (and demonstrably), if we instead wanted to choose between a range of different distribution parameters, the best fit ought to be that from the set that gives the highest aggregate sample likelihood, that is from the parameters that produce the highest product of all the individual sample point likelihoods.

Voilà … we can do that. For some distribution functions MLE can be computed analytically³, but in Excel we aren’t restricted to those. We can readily do it numerically with the Solver add-in, using the sample moments as the starting point for efficiency and to avoid the pitfalls of local maxima in non-linear optimisation⁴. It’s usual to perform the calculation using the sum of the log-likelihoods, to counter small number rounding errors. (The product is equivalent to the sum of the logs … that’s what logarithms were invented for.)

Applied to our problem, the result looks like this:

PIII_fit

Well I’ll be … the result is almost the same. The MLE fit is μ, σ, γ₁ = 1918, 501, 0.878; compared to 1918, 505, 0.927 from the sample moments (and 2100, 650, 0.9 as the “right” answer). I find that for lots of samples of moderate size, MLE fits and moment fits are near identical … so you could fairly ask, why bother? Well, when a demonstrably better method is easy to implement, ignoring it may be difficult to defend. But MLE has other advantages. It makes it easy to fit to a censored (e.g. truncated) sample, for example if you wanted to fit to the larger river floods in an annual series, while ignoring the less important small ones. It also has benefits in estimating the confidence limits of the fit, but that’s for another time…

Sample Implementation

I put a sample implementation of each method in sheet Fitting of PearsonIII.xlsm, available here. You’ll need Excel 2010+, and you need to activate the Solver add-in (File|Options|Add-ins|Go (down the bottom)| and check Solver add-in).

Notes

  1. In Excel we don’t of course need to sort; we can just use the RANK() function. I’ve sorted the sample here for clarity.
     
  2. Unit likelihood in a continuous distribution — that is the probability per unit x, from the definition of probability density (ƒ(x) = dP(x)/dx, where P(x) is the cumulative probability at x). That doesn’t really have a simple intuitive meaning, hence the choice of that term “likelihood”, but it is just a probability and therefore has to obey the rules.

    (For our example, the unit likelihood of a sample point at say x = 2000 is approximately the probability that a sample choice will fall between 1999.5 and 2000.5. In a continuous distribution the probability that a sample falls exactly at any given value is of course vanishingly small … it’s zero.) 

  3. That’s not possible for the Pearson type III (i.e. gamma) distribution. It is for the normal distribution, which interestingly gives a maximum likelihood fit using the “population” standard deviation formula (divide by n, as in Excel 2013’s STDEV.P()) rather than the usual bias-corrected sample standard deviation (divide by n-1 as in STDEV.S() and STDEV()). So MLE does not provide an unbiased fit to the normal distribution.

    Even more interestingly, the “population” moment estimates (STDEV.P() and SKEW.P()) are generally closer to the Pearson type III MLE estimates than the (bias-corrected) sample estimates (STDEV.S() and SKEW.S()). Again, MLE seems to be slightly more biased, so perhaps we should drop that rhetoric about MLE providing a less biased fit.

  4. That will work provided the moment estimate is sufficiently close to the global maximum, as it should be if the moments themselves provide a sufficiently good fit.
     

Reference

Many treatments of maximum likelihood estimation are denser than dense. I recommend trying these PennState lecture notes: https://onlinecourses.science.psu.edu/stat414/node/191