The Normal Distribution and its origin

The Normal distribution and its origin

Quite a few statistics and probability textbooks with an introduction to the normal distribution have passed through my hands. They almost always started with the mathematical definition and continued with the properties and uses of this distribution. For a long time I had not found any book that explained how this formula was created, until at some point a book named Lady Luck fell into my hands which, although old, is a very good introduction to the subject of probability, and in it I found the first clear description of how the normal distribution arises.

Since then other books have passed through my hands too, and with the spread of the internet I have also found quite a few online references on this topic. Nevertheless, I still consider the relevant bibliography to be limited, and if we restrict it to Greek, probably nonexistent. The purpose of this introduction is to familiarize the reader with the normal distribution, but also to make understandable how this strange formula arises, as well as the course of its discovery.

The normal distribution is arguably the most important statistical distribution; it has many different uses and is encountered in many different cases. Examples of variables that follow the normal distribution are human height, blood pressure, precision errors in measurements, the number of successes in dice games, and many others. It is remarkable how one formula applies to so many seemingly different cases. What relation could human height have with measurement precision errors and games of chance?
The normal distribution connects all these seemingly unrelated phenomena, and in what follows I will try to show how.

Use of the Normal distribution

diagram: (h1)diagram h1
diagram: (h2)diagram h2
diagram: (h3)diagram h3
diagram: (h4)diagram h4
diagram: (h5)diagram h5
diagram: (h6)diagram h6

Standard deviation

the standard deviation is a measure of dispersion; it shows us how far the values of the population are from its mean.

what it represents is approximately the amount by which, on average, the values are distant from the population mean. the standard deviation derives from a transformation of the absolute deviation (which computes exactly the mean distance) by replacing the absolute value with the square root of the square, which is roughly the same.

The standard deviation was created out of the need to eliminate the absolute value from the formula of the absolute deviation, because in many cases it makes the calculations more difficult.

To approach the issue of the normal distribution we will start with a short introduction on how we use it.
Let us suppose that we are conducting a survey on the height distribution of the people of a region. We collect height measurements from 800 individuals and we want to study this sample. In the end we have in our hands a set of measurements which, let us suppose, have been made with an accuracy of a centimeter of a meter (cm).
We start by computing the simple descriptive statistics and find the following:

MeasureValue
count800
unique values[n1]Unique values: the set of distinct values that the population takes.46
mean value179.7
median value180
standard deviation[n2]For the standard deviation, refer to the box with the explanation at the side of the article.8.2
minimum157
maximum203

If we place the values in order and plot the values on a simple x,y diagram we get the diagram (h1). in this diagram we can see the height values that have been recorded, but we cannot see the number of individuals we have at each height value. We continue by making a frequency table for each distinct height value we have collected, and we construct the bar chart (h2). In it we see that the values are concentrated around the mean and, as we move away, they decrease at a rapid rate. in the frequency table we divide each frequency of occurrence (vertical axis in the (h1)) by the total number of individuals (800) thus computing the probability of occurrence of each height, and we create the bar chart (h3) which, in the place of the frequencies, will have probabilities.

So far we have done the basics in order to be able to get an idea of how height is distributed in the population. Let us try to answer a few questions.

if we are interested in the probability that, choosing a person at random from this population, they have a height of 1.80, we can simply look at the bar chart (h3). This probability is: 0.05 or 5%. This probability is a rather small number, since the probability is distributed over 46 distinct height values (the sum of all occurrence probability values sums to 1).

Let us ask a more interesting question. What is the probability that a random choice gives us a person of medium height?

If we define a person of medium height as one who has a height from 175cm to 185cm then, in order to answer the question, we would have to sum all the probabilities from 175 to 185cm. if we carry out this procedure we will find that the probability is: 0.5 or 50%.

Let us dwell for a moment on this result. In order to be able to answer this question we made use of the convention that we measured the individuals to cm accuracy, and we needed to compute and sum probabilities for 11 different height values.
might there be some way not to compute and sum all these probabilities?
What would we do if the individuals had been measured to mm accuracy?
If instead of height we had some variable that can take any value in how would we group the values?

At this point we can see the use of probability distributions and, in our case, of the normal distribution. Knowing that height follows the normal distribution, we can, with its help, compute the probability for any interval of values, without needing to group the height into groups and to make time-consuming and repetitive calculations.

Let us see how this can be done.

First we will have to introduce the concept of the histogram. The histogram is a diagram that visually resembles the bar chart quite a lot, but has fundamental differences. In the bar chart, what we measure is represented through the height of the bar, whereas in the histogram through the area. To create the histogram we group the values into intervals that we call bins, and the number of values found in that interval is represented by the area of the corresponding rectangles of the histogram. For example, we will convert the (h3) into (h4), (h5) with 23 and 6 bins respectively. Now the probability no longer corresponds to the value on the vertical axis but to the area of the bar whose base is the interval of values it depicts.

To make it clearer how we use the normal distribution, we will build a histogram (h6) with 47 bins, one bin for each centimeter[n3]In general there is no need to build a histogram with such a bin; in this article we do it to see how the normal distribution behaves. so that the grouping practically coincides with the bar chart and thus we get a better overview. To answer our question we need to sum all the areas of the histogram between the values in question. So far practically nothing has changed compared to before, except that in the new diagram the probability is represented by the area. If we could find a curve that is easily integrable[n4]in simple terms, that allows us to calculate the area found between the curve and the x-axis. and that contains the same area as our histogram, then we could easily calculate the probabilities in question. This is exactly what PDF (probability density) distributions do. In our case we can use the appropriate normal distribution that will have this characteristic.

onto this histogram we will place the curve of the appropriate normal distribution which, as we see, lies very close to the true values of our sample. In practice the area found under the curve of the normal distribution (pink hatching) is very close to the area of the corresponding rectangles of the histogram. If we do the calculations we will see that the area of the histogram is: 0.5013 versus 0.496 . Our difference is 0.0052 a fairly small number. If you want to verify or study the example, you can download the R code from the link at the end of the article (article-data.R the diagrams of the example, normal.R a separate code example for study)

What does not appear at all in our example is how we find this appropriate curve and why this curve succeeds so well in calculating the probabilities of the occurrence of height.

The answer, although it may sound a bit strange, lies in games of chance. All phenomena that follow the normal distribution can be modeled with chance experiments such as, for example, rolling dice or tossing coins. Or, to put it more correctly, if a phenomenon is modeled by a specific category of chance experiments (which we will describe below) then that phenomenon follows the normal distribution.

But what exactly do we mean when we say that they are modeled with chance experiments? And what are these experiments?

When we say modeling we refer to the creation of a controlled process whose behavior we can map to the events we encounter in the real phenomenon.
For example, let us take the phenomenon of the births of boys or girls. Based on our knowledge about the mechanism of sex selection, we can model each birth as the equivalent of tossing a fair coin, where to heads we map the birth of a girl and to tails the birth of a boy (or vice versa). If the way we assume sex selection occurs is wrong, then the modeling of this phenomenon as a coin toss will also be wrong.
In the ideal case the model is based on our knowledge about the nature of the phenomenon and not on the observation of the results; in reality it usually arises from both, which is why quite often the models we build are not accurate or, even worse, may be wrong.

Statistics as a field of knowledge is usually placed under mathematics, since we practice it mainly by using it. However, there is a fundamental difference between statistics and other branches of mathematics, which brings statistics closer to other fields of knowledge such as physics. Statistics is an empirical science[n5]empirical is the name given to the science that is based on empirical data and theories that can be tested empirically, in contrast to mathematics which in general is based on logical rules and axioms that are not necessarily connected to experience.. When we analyze a problem with statistical tools then we necessarily make assumptions that are based on empirical data (exactly as in physics). In our example we assumed that height follows the normal distribution; this assumption is based on empirical data and not on some kind of mathematical proof. Through empirical data and knowledge of the genetic/environmental mechanism of height selection, which is also based on empirical data, we decide that this model of height production (which results in a normal distribution) constitutes the most appropriate model.

To better understand the normal distribution we must keep in mind that it constitutes a tool of an empirical science and not a mathematical theorem such as, for example, Euclid's theorem. As an empirical tool it was founded on observations of the natural world, and to understand its usefulness we must study the phenomena that it describes and formalizes.

Within the framework of statistics and probability, quite a few chance experiments have been studied which can function as models for various natural phenomena. The normal distribution can be modeled with a fairly simple chance experiment, tossing a coin several times. Through the normal distribution we can calculate the probability of the occurrence of a specific outcome of this experiment (e.g. 10 tails in 20 tosses) with fairly good accuracy. This is historically the first path that led to the normal distribution (there is also the path of its relationship with measurement errors, to which we will refer later).

To take this path we will start from the study of this simple chance experiment which is described by the Binomial distribution.

Binomial distribution

The Binomial distribution studies the number of occurrences of a result in successive tosses of a coin. The toss of a coin has two possible outcomes: heads (H) or tails (T). Within the framework of the binomial distribution we are interested in the number of occurrences of one of these, whose occurrence we call a success. E.g. If we toss a coin once we can get either no heads or one heads; if we toss it twice: no, one or two heads; three times: no, one, two or three heads, and so on.

In more technical language the binomial distribution is a discrete distribution that describes the probability of the occurrence of a number of successes in a game of chance with two outcomes. (the term discrete refers to the fact that the random variable can take values from within a finite set - and not from an infinite set like the - in the case of the coin we refer to the two possible outcomes (H)(T) in contrast, for example, to height which can potentially take any value in the .

Incorporating the concept of probability into what we have mentioned, if we toss a fair coin 1 time, we have 2 possible outcomes (heads (H) or tails (T)) with a probability of occurrence of each outcome ½ = 0.5 (fair)
If we toss the coin 2 times, then the possible outcomes are:

outcomesHTprobability
H H201/4
H T111/4
T H111/4
T T021/4

If we are interested only in the number of heads then we have the following result:

headsoccurrencesprobability
011/4
122/4
211/4

If we toss the coin 3 times, the possible outcomes are:

outcomesHTprobability
H H H301/8
H H T211/8
H T H211/8
H T T121/8
T H H211/8
T H T121/8
T T H121/8
T T T031/8

If we record the number of heads in the possible outcomes then the table will be as follows:

headsoccurrencesprobability
011/8
133/8
233/8
311/8

binomial distribution in R

> x=seq(0,3);
> y=dbinom(x, size=3, prob=0.5);
> x
[1] 0 1 2 3
> y
[1] 0.125 0.375 0.375 0.125

In general, the number of successes (in the case of the example, the number of heads) in n repetitions of a random experiment with two outcomes is calculated from the binomial distribution. Its formula is: ƒ k ; n , p = Pr K k = n k p k 1 - p n - k where
k the number of successes,
n the number of repetitions,
p the probability of success.
If, for example, we want to calculate the probability of getting 3 tails in 8 coin tosses then for k=3, n=8, p=0.5 we have:
ƒ 3 ; 8 , 0.5 = Pr K 3 = 8 3 0.5 3 1 - 0.5 8 - 3 = 0.21875

Below I present diagrams with the number of successes k for n = 2, 8, 32, 64 repetitions respectively (p=0.5):

diagram: (1)diagram 1
diagram: (2)diagram 2
diagram: (3)diagram 3
diagram: (4)diagram 4
diagram: (5)diagram 5
diagram: (6)diagram 6

On the horizontal axis (X) is the number of successes that occurred, on the vertical axis is the probability of occurrence of the respective successes.

it is evident that despite all the changes in the axes of the graphs there is a pattern that repeats itself in all the graphs. There is a dominant value (with the greatest occurrence) that lies at the center of each diagram, and on either side the remaining values decrease gradually and symmetrically with respect to it. The further we move away from the center (very small or very large values) the probability of occurrence shrinks dramatically. In the example of the 64 repetitions we see that the probability of getting 0 or 64 repetitions is extremely small 1 2 64 = 5.4 · 10 -20   or otherwise   0.00000000000000000054

Approximating the binomial.

The first historical way that arrived at the equation of the normal distribution was the result of the search for an alternative way of calculating the values of the binomial distribution. Abraham de Moivre in 1738[c1]Walker, Helen M. (1985). De Moivre on the Law of Normal Probability. In Smith, David Eugene. A Source Book in Mathematics. Dover. ISBN 0486646904. arrived at what is today called the normal approximation of the binomial distribution and is today expressed as:

i j n k p k 1 - p n - k N j - n p n p 1 - p - N i - n p n p 1 - p      where      N z = 1 2 π - z e - x 2 2 dx

The above formula represents a continuous function that has the ability to give us the correct probability of occurrence of an outcome for any number of repetitions n. The exact method by which we arrive at this formula is quite complex, for this reason I will sketch out the basic steps of a method by which we can arrive at the function in question.

01
111
2121
31331
414641
515101051
61615201561
721353521
82856705628
Pascal's Triangle

we will start with Pascal's triangle, this triangle has the following property, each number is produced as the sum of two numbers located diagonally to the left and right in the previous row. For example, in the last row the second number 8 is produced from the sum 1+7 which is found in the previous (second-to-last) row on either side of the 8. Pascal's triangle constitutes a different procedure for calculating the results of the binomial distribution.

To see this procedure more concretely we will take the case of 8 repetitions diagram (3). the results depicted are the following:

012345678
0.003906250.031250.1093750.218750.27343750.218750.1093750.031250.00390625

If we take the row 8 (last row) from Pascal's triangle and divide each number by the total number of possible outcomes (the number of possible outcomes for 8 tosses is 2 8 = 256) we will get the results of the binomial distribution

Abraham de Moivre

Abraham de Moivre

  •  1/256 = 0.00390625
  •  8/256 = 0.03125
  • 28/256 = 0.109375
  • 56/256 = 0.21875
  • 70/256 = 0.2734375

So with Pascal's triangle we have an alternative way to calculate the results of the binomial distribution. The coefficients that we calculate with this procedure can alternatively also be calculated by means of the binomial theorem
a + b N = a N + N 1 a N - 1 b + N 2 a N - 2 b 2 + + N N b N e.g. for 8 repetitions we have: a + b 8 = 1 a 8 + 8 a 7 b + 28 a 6 b 2 + 56 a 5 b 3 + 70 a 4 b 4 + + 1 b 8

If we assign to the a,b the values p,1-p of the probabilities of the random experiment of the binomial distribution (of the coin in our example, so a = b = 0.5) then each monomial of the binomial expansion will constitute a calculated probability (axis y in the diagrams 1-6) and the ordinal number of the monomial will constitute the number of successes (axis x in the diagrams 1-6).

If we add all the monomials we will have a sum of 1 (since they constitute the probabilities of all possible outcomes), and the sum of some consecutive monomials gives us the probability of getting a result in the corresponding interval of successes. At this point we can make use of the idea of the Riemann integral[n6]the sum of consecutive rectangles gives us area and to approximate the value of these sums by means of the Riemann integration of the binomial distribution. the Newton binomial m n = m! n! m n ! contained in the formula of the binomial distribution constitutes a significant obstacle to the integration we want to achieve, so alternatively we can use Stirling's approximation formula for the n!,   n ! n n e - n 2 π n

Starting from the binomial theorem, making use of the approximation formula for n! and integrating according to Riemann we can arrive at the normal distribution. f ( x , μ , σ ) = 1 σ 2 π e - ( x - μ ) 2 2 σ 2 . The purpose of this crude description of the procedure is to show how the basic elements of the equation of the normal distribution arise: the integral, the e and the . Which personally, the first time I saw it, it seemed extremely strange to me how they ended up being present within this equation.

At this point we have completed de Moivre's path toward the formulation of the normal distribution, next we will see how another researcher arrived at the same formula but in a completely different way.

Error curve

Carl Friedrich Gauss

Carl Friedrich Gauss

To 1809 Gauss published a monograph of his with the title Theoria motus corporum coelestium in sectionibus conicis solem ambientium in which, among other things, he introduced some important concepts of statistics such as the method of least squares, the maximum likelihood method and the normal distribution. The problem that Gauss wanted to solve, as well as the method he followed to arrive at the normal distribution, is entirely different from that of De Moivre, and for this reason it is worth examining it.

Astronomy is one of the first sciences in which accuracy of measurement was required. In it, taking repeated measurements is a standard procedure, so from early on the problem arose of how we ultimately obtain the final value of the measurement. The first to mention the relevant issue was Galileo in Dialogue Concerning the Two Chief Systems of the World—Ptolemaic and Copernican[c2]G. Galilei, Dialogue Concerning the Two Chief World Systems—Ptolemaic & Copernican (S. Drake translator), 2nd ed., Berkeley, Univ. California Press, 1967. which was published in 1632. Galileo studied the properties of the random errors that occur in astronomical observations and his conclusions were summarized by Stigler[c3]A. Hald, A History of Probability and Statistics and Their Applications before 1750. New York, John Wiley & Sons, 1990. into 5 propositions:

  1. There is only one number that gives the distance of a star from the center of the earth, the true distance.
  2. All observations are burdened with errors, which are due to the observer, the measuring instruments, the earth, the actual distance.
  3. The observations are distributed symmetrically around the true value, which means that the errors too are distributed symmetrically around zero.
  4. Small measurement errors occur more often than large errors.
  5. The calculated distance is a function of the direct angular observations such that small deviations of the observations can lead to a large deviation of the distance.

Galileo did not describe any function defining how the errors of the observations are distributed, so the issue remained open. Early attempts to describe such functions were made in the 18th century by Thomas Simpson[c4]T. Simpson, A Letter to the Right Honourable George Macclesfield, President of the Royal Society, on the Advantage of taking the Mean, of a Number of Observations, in practical Astronomy, Phil. Trans. 49 (1756), 82–93. and Laplace[c5]21. P. S. Laplace, Mémoire sur la probabilité des causes par les événements Mémoires de l'Académie Royale des Sciences present ́es par divers savan 6 (1774), 621–656. reprinted in Laplace, 1878-1912, Vol. 8, pp 27–65. Translated in S. M. Stigler, Memoir on the probability of the causes of events. Statistical Science 1 (1986) 364–378.. Ultimately, as we mentioned earlier, the solution was given by Gauss in 1809.

In 1801 the Italian astronomer Giuseppe Piazzi spotted a celestial object which he had a well-founded suspicion was a planet. He announced the discovery and named it Ceres (Demeter). Unfortunately, six weeks later, before astronomers could collect enough observations to compute its orbit and be sure that it was indeed a new planet, Ceres hid behind the sun and was not expected to reappear for about a year. The interest surrounding this new planet grew, and astronomers from all over Europe undertook to calculate the probable position where the planet would reappear. The then-young Gauss, who was already known as an exceptional mathematician, proposed a region of the sky that was quite different from those proposed by the other astronomers, and in the end he was right.

It is remarkable that in his attempt to solve this astronomical problem Gauss introduced a set of very important concepts, among them the normal distribution. To locate the orbit, Gauss used the least-squares criterion to find the one that best fit the observations. To do this he needed to compute the distribution function of the measurement errors, which was based on a theory of errors that in turn rests on three assumptions about the nature of the distribution of the errors:

  1. The probability of small deviations (measurement errors) occurring is greater than that of large deviations.
  2. For every real number (ε) the probability of a measurement error occurring with magnitude (ε) and (-ε) are equal.
  3. if we take many samples, the most probable value close to the true one is the average of the measurements.

To follow Gauss's method we need some definitions.
Let p be the correct (but unknown) value of a measured quantity.
Suppose we take ν independent measurements Μ₁, Μ₂, Μ₃ … Μᵥ
Let φ(x) be the probability density function of the random error, and let it be differentiable.
Gauss's first assumption leads to the conclusion that φ(x) has a maximum at x = 0 and the second assumption that φ(-x) = φ(x)
the expression Mᵢ - p denotes the error of the i-th measurement, and since the sampled measurements and the corresponding errors are stochastically independent, the expression
Ω = φ(M₁ − p)⋅φ(M₂ − p)⋅φ(M₃ − p) … φ(Mᵥ − p)
constitutes the joint probability distribution for the ν errors. Making use of the third assumption, Gauss considered the average (μ̄) of the measurements as the maximum likelihood estimator of p. In other words, given the measurements Μ₁, Μ₂, Μ₃ … Μᵥ, the choice p = μ̄ maximizes Ω.

With these assumptions, Gauss ended up computing that the function φ is: φ ( x ) = h π e - h 2 x 2 where h is the measure of the precision of the observations.

Epilogue

It is remarkable that two mathematicians independently arrived at the formulation of the same tool to solve seemingly different problems. This was of course not the only time it happened; something similar happened with infinitesimal calculus, where again two researchers independently, Newton and Leibniz, arrived at the invention of the same tool but for entirely different purposes. This too is part of the magic of mathematics.

you can download the article's code from here