# Statistics: Central Limit Theorem

By Brett Taylor | April 30, 2019

# Overview

The Central Limit Theorem states that when samples of a population are large, the sampliing distribution will take the shape of a normal distribution regardless of the shape of the population from which the sample was drawn. This is proven out through the simulation below that projects the theoretical mean of the exponential distribution compared to the sampling. The variance between the theorectical mean, and the sample mean is .03. This maps correctly to the Central Limit Theorem.

# Analysis

## Simulation Methods

To ensure that this report is reproducible, we set the random generator seed, and set overall parameters including the number of simulations and lambda rate. We have set the sample size above 30 to ensure that we can use the normal distribution.

set.seed(4957)
# Total number of simulation samples sets that will be ran.
sim.count <- 1000
lambda <- .2  #Rate
sample.size <-40  #Number of observations per sample

This simulation generates a couple of random simulations of the Exponential Distribution utilizing the R function rexp(). Lambda ($$\lambda$$) was set at 0.2. The inital simulation generates 1000 observations. It utilizes the rexp() R function and sets the rate at the recommended lambda = 0.2.

exp.sim <- rexp(sim.count,rate = lambda)
sim.mean <- mean(exp.sim)
sim.sd <- sd(exp.sim)

This simulation is displayed in Figure 1 in the Appendix. It shows how the exponential distribution exponentially declines from x=0 to $$x=\infty$$. It also displays the mean of the distribution, and has an overaly of the exponential function.

Central Limit Theorem $$\sigma _{x} =\frac{s}{\sqrt{N}}$$
Sample should be larger than 30

## How close is the Sample Mean to the Theoretical Mean?

The theoretical mean of exponential distribution is:
$$\mu=\frac{1}{\lambda}$$ =5

Sample Mean Simulation
To generate the sample mean, there are several possible methods of performing this. The method chosen here is to create a matrix which is randomly generated using the rexp() function. The total number of simulations is 1000 with a sample size of 40. The method for calculating this is the creation of a matrix with dimensions 1000 x 40. The matrix has a set of samples on each row. The number of observations in the matrix is 40000. The apply() function is used to iterate through each row of the matrix, and apply the mean to the observations of the row. This creates an array of size 1000, and assigns it to variable sample.means.

library(knitr)
# Generate a matrix that has 1000 rows
sample.matrix <-matrix(rexp(sim.count*sample.size,rate=lambda),sim.count)
sample.means <- apply(sample.matrix, 1,mean) #1 has apply mean() at the row level.
sample.mean <- mean(sample.means)
summary(sample.means)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   2.815   4.460   4.989   5.026   5.544   7.821
 kable(head(sample.matrix),digits=2)
 2.5 4.35 25.02 0.49 5.41 6.71 9.99 1.06 5.91 16.74 1.84 6.46 17.07 0.47 1.63 0.29 0.51 3.27 1.7 2.72 1.12 2.33 9.17 7.58 18.97 11.21 6.11 5.78 6.64 1.01 4.99 0.48 8.1 6.41 44.25 4.8 10.09 9.59 5.56 6.33 1.19 6.95 10.98 0.24 0.42 8.97 2.32 2.91 10.17 14.32 0.93 7.31 5.99 0.76 5.57 3.06 0.37 7.36 1.17 1.21 1.12 2.18 2.63 5.5 2.3 1.05 1.76 2.54 2.74 1.72 4.66 3.75 1.48 1.57 2.86 3.74 4.56 8.04 10.58 1.83 4.27 1.04 8.08 3.49 3.01 0.66 12.92 5.98 4.07 11.37 1.2 12.79 2.12 1.24 6.54 11.14 0.32 5.42 0.34 8.42 14.14 6.05 5.32 1.96 3.48 2.7 5.12 1.57 2.11 1.9 1.15 1.86 6.47 1.79 3.02 2.31 2.99 0.81 0.94 2.49 0.46 8.25 6.33 2.81 5.89 5.11 6.26 8.15 6.82 1.54 0.43 0.52 0.85 6.78 18.15 1.08 7.67 0.05 2.56 0.41 4.46 1.99 8.12 1.81 7.34 1.09 2.75 13.74 0.57 5.82 1.47 0.66 6.78 2.91 1.08 10.49 2.2 0.09 2.69 5.14 0.58 1 0.49 15.75 3.16 0.28 2.72 0.64 3.75 0.37 1.1 2.63 2.46 4.95 1.6 0.8 0.8 1.95 1.22 0.66 10.36 0.46 1.93 3.44 0.27 0.24 3.61 1.35 6.39 0.09 6.45 4.81 0.9 4.21 7.23 5.03 0.69 3.87 3.62 6.9 2.18 3.82 3.17 2.02 2.86 7.67 5.94 12.58 5.01 1.4 7.94 2.88 5.77 8.8 1.97 8.65 14.66 1.58 9.85 0.15 6.14 6.9 15.65 4.25 2.71 1.91 4.18 9.38 4.33 2.49 5.3 7.42 13.48 1.16 4.77 10.04 1.1 9.59 11.56 4.66

The sample mean is displayed in Figure 2 in the appendix. The mean value of all samples is 5.026. This is compared to the theoretical mean which is 5. The difference between these is 0.026.

## What is the difference between the Sample Variance and the Theoretical Variance?

We have several components of distribution that vary compared to the exponential distribution.
Theoretical Variance Equation $$\sigma^2 =\frac{\sum(x_{i}-\mu)^2}{N}$$ The theoretical variance of an exponential distribution is the following formula:
$$Var=S^2=\frac{(\frac{1}{\lambda})^2}{N}$$

Sample Variance Equation $$s^2 =\frac{\sum(x_{i}-\overline{x})^2}{N-1}$$

  sample.var <- apply(sample.matrix, 1,var)
sample.mean.var <- mean(sample.var)
sample.mean.sd <- sqrt(sample.mean.var)

Theoretical Variance = 0.625
Sample Varariance = 0.635

The theoretical variance is approximately equal to the sample variance.

## Is the sample mean simulation a normal Distribution?

It was chosen to use 40 observations per sample. This leveraged the Central Limit Theorem that states that with large samples (greater than 30), the means of the samples will be normally distributed, and the mean will approximate the population mean. Figure 2 shows a histogram of the mean samples of the exponential distribution. There is a distribution normal function overlayed on the plot, which shows that the histogram is close to the standard normal distribution. In addition, the Standard Error is the same as the standard deviation which is 5.04. This is close to the standard deviation of the exponential distribution which value is 0.125

The distribution is approximately normal based on the mapping of the normal distributions .

# Appendix

## Figure 1 - Exponential Distribution

library(ggplot2)
ggplot(data.frame(x=exp.sim),aes(x=x))+
geom_histogram(fill="red",color="black", binwidth=.5)+
stat_function(fun = function(x, rate,n){n * dexp(x = x, rate = rate)},
args =  c(rate= lambda, n= sim.count *.5)
,geom="line",color="green") +
geom_vline(aes(xintercept=sim.mean),color="blue",size=1) +
annotate("text", x = sim.mean , y =50, vjust=-1, hjust=-1,
label =sprintf("Sample mean = %03.2f",sim.mean),
colour ="Dark Red", angle=0 ) +
annotate("text", x =theoretical.mean, y =50,  hjust=-1,
label =sprintf("Theoretical mean = %03.2f",theoretical.mean),
colour ="Dark Red", angle=0 ) +
annotate("text", x = 30, y = 40, parse = TRUE,
label ="mu==frac(1,lambda)")+
theme_bw()

## Figure 2 - Normal Distribution of Sample Means

sample.mean<-mean(sample.means)
ggplot(data.frame(x=sample.means),aes(x=x))+
geom_histogram(fill="red",color="black", binwidth=.1)+
geom_density()+
stat_function(
fun = function(x, mean, sd, n){
n * dnorm(x = x, mean = mean, sd = sd)},
args =  c(mean =sample.mean, sd = sd(sample.means),
n = sim.count/5/2)
) +

geom_vline(aes(xintercept=sample.mean),color="blue",size=1) +
annotate("text", x = sample.mean , y =((sim.count/10)/4), vjust=-1,
label =sprintf("Sample mean = %03.2f",sample.mean),colour ="white", angle=270 ) +
annotate("text", x = 7, y = 30, parse = TRUE,
label ="f(x)==frac( 1, sqrt( 2 * pi)) * e ^ {-x ^ 2 / 2}") +

theme_bw()

## Normality of the Sample Mean distribution

### Figure 3 - QQPlot of Sample Means

A method of testing the mean distribution is to plot the qq-norm of the sample means.

qqnorm(sample.means)
qqline(sample.means)

The results of this show that the norm of the sample means is following the qqline, and demonstrates that this a a normal distribution as expected based on the Central Limit Theorem