Probability and Statistics for Machine Learning part-2

Let's continue our discussion on the Probability and Statistics journey.

Symmetric distribution, Skewness, and Kurtosis:

Symmetric distribution:

If the shape of the probability density function on the left side of the mean is a mirror image of the shape of the Probability density function on the right side of the mean then the distribution is called symmetric distribution.

In terms of probability, we can state it as below

From this, we can say that the probability P(x-a)=P(x+a), where x is the mean value and a is some random value. If this property holds then it is called Symmetic distribution.

Skewness:

The probability density function exhibits 2 types of skewness properties.

Left/Negatively skewed
Right/Positively skewed

we can measure the quantity of skewness by using the following formula.

where xbar is the mean of the sample and n is the sample size. And xi is the individual value. If the above value is positive then it is right-skewed, if it is negative then it is left-skewed. For a perfect Gaussian distribution the skewness is 0.

Left/Negatively skewed:

In this, the probability density function has a long tail towards the left side or negative X-axis side. This indicates that some values are very small and occur in less magnitude. Let's try to see visually to understand in detail.

Let's understand this visually.

From this, we can see that it has a long tail on the left side, because of some small values occurring in less magnitude.

Right/Positively skewed:

In this, the probability density function has a long tail towards the right side or positive X-axis side. This indicates that some values are very large and occur in less magnitude. Let's try to see visually to understand in detail.

Let's understand this visually.

From this, we can see that it has a long tail on the right side, because of some large values occurring in less magnitude.

Kurtosis:

It is the measure of tailedness. If you know the kurtosis value of a Probability density function then you can assume its shape. This can be mainly useful in case of finding whether the given data has outliers or not.

The kurtosis is measured in the terms of excess kurtosis. That is for a Gaussian distributed variable the kurtosis value is 3, and the excess kurtosis value indicates how much the kurtosis value of a given pdf varies from the kurtosis value of a Gaussian distributed variable. The excess kurtosis is measured by the bellow formula.

Where n is the size of the sample and xbar is the mean of the sample.

Now let's see the thing visually for different excess kurtosis values.

image reference: https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Standard_symmetric_pdfs.svg/800px-Standard_symmetric_pdfs.svg.png

Here we can see that when the excess kurtosis value is 3, the tail is falling so slowly. In contrast, when the excess kurtosis value is 0 the pdf looks like Gaussian distributed variable, whereas at the negative excess kurtosis value it has just fallen off very sharply. From this, we can conclude that as the excess kurtosis value increases, the tail is falling so slowly.

Outliers:

Outliers are the data values that do not follow the normal pattern of the data. That is geometrically they are far away from most of the normal data because they are either too small or too large than the normal data. Let's try to understand this with an example.

Let X is a variable containing the heights of people with a sample size of 10

X=[132cm,134cm,165cm,124cm,145cm,170cm,350cm,156cm,3cm,99cm]

Here in the above data, the two data points 350cm and 3cm are pretty abnormal when coming to the heights of people, they may have arrived due to many reasons like wrong entering, system malfunction, etc...

Such kind of data is known as outliers. They are either very large from all other data such as 350cm or very small as 3cm.

Let's try to see a graphical example.

When we represented the same data graphically in the above image we can see that the two points 350 and 3 are far away from the normal data, hence such data are called outliers.

If we represented them in the form of a pdf we can see that the pdf has a long tail as below which means it has more excess kurtosis. So that's how kurtosis helps us to understand whether data has outliers.

Sampling Distribution and Central limit theorem;

Let X be a random variable that follows some distribution not necessarily Gaussian. And let X is the distribution of population. To find the mean and standard deviation of the distribution(X), we employ CLT(central limit theorem).

Sampling Distribution:

In the CLT, the first step is to perform sampling distribution. In this first, we pick m samples each of size n where n>=30. The value of m can be chosen in terms of 10^x.

In the second step, we will calculate the means of m samples individually. If we consider those means as a distribution, then we will have a sampling distribution of sample means.

The central limit theorem states that the sampling distribution of sample means is in Gaussian Distribution, and if we calculate the mean and standard deviation for the sampling distribution of sample means then the mean of the distribution is equal to the population mean and the standard distribution value divided by 'n' is equal to the standard distribution of the population while the value of 'n'(size of each sample) tends to infinity. But in reality, if the n value is equal to or above 30 then the sampling distribution of sample means falls under Gaussian distribution.

Let's now see the things in the code.

g = np.random.normal(2, 7, 100000)
l=[]
for i in range(1000):
    choices = np.random.choice(g, size=40, replace=True)
    l.append(np.mean(choices))
print(np.mean(l))
print(np.std(l))
print(49/40)
sns.distplot(l)

Here in the above code, we are just performing the central limit theorem.

In the first line, we are trying to generate the Gaussian distributed data with a mean of 2 and a standard deviation of 7. Now this 100000 data points act as population.
In the second line, we are creating the empty list to store the sampling distribution of sample means.
In the 3,4, and 5 lines we are trying to generate 1000 samples each of size 40 and calculating the mean of each sample and storing it in "l".
Finally, we are trying to print the mean and standard deviation of the sampling distribution of sample means.
we are printing the value 49/40(7^2/40 ~ σ^2/n )
We are plotting the sampling distribution of sample means.

Output for code

Q-Q(Quantile-Quantile) Plot:

This is used to test whether a given data follows a particular distribution or not. That is suppose X is a continuous random variable, to know whether it is following Gaussian distribution or not we can use the Q-Q plots.

1. At first, we try to arrange the data in ascending order and then we try to calculate the percentile values from 1 to 100.

2. Now we try to create another continuous random variable that follows Gaussian distribution. And again sort them and calculate percentile values

3. Finally, plot the Q-Q plot. In simple words in the Q-Q plot, we try to plot both the percentile values that is the percentile values of continuous random variable X for which we need to determine the distribution(Gaussian distribution or not) and the percentile values of the distribution Y which is a standard normal variant of Gaussian distribution.

Let's see the code part and visual representation of the Q-Q plot

import scipy.stats as stats
import pylab
g = np.random.normal(2, 7, 500)
stats.probplot(g, dist="norm", plot=pylab)
pylab.show()

Here we try to generate random Gaussian distributed data with a mean of 2 and a standard distribution of 7. And then we will try to plot the Q-Q plot (probplot) to see whether the data is Gaussian distributed or not.

The output is given bellow

Here we can see that the percentiles values of both distributions are almost overlapping, which means the given distribution is following the Gaussian distribution.

Now lets the false case

import scipy.stats as stats
import pylab
f=np.arange(0.1,100,0.2)
stats.probplot(f, dist="norm", plot=pylab)
pylab.show()

Let's see the output

Here we can see that the percentiles values of both distributions do not overlap well, meaning the given distribution is not a Gaussian distribution.

We can also do these Q-Q plots manually like below.

g = np.random.normal(2, 7, 10000)
l=[]
for i in range(1,101):
    l.append(np.percentile(g,i))
g = np.random.normal(0, 1, 10000) 
l2=[]
for i in range(1,101):
    l2.append(np.percentile(g,i))
plt.scatter(l,l2)

Output for above code is

Chebyshev's Inequality:
In simple words, it states that if we have a continuous random variable X for which we only know the mean and standard deviation and we dont know the distribution, then 


X  is the mean and Sigma(σ) is called standard deviation.

This is simillar to the 68-95-99 rule of Gaussian distribution but the difference is that here we dont know the distribution.

let's take an code example and understand it.

f=np.arange(0.1,100,0.2)
m=np.mean(f)
s=np.std(f)
print("1 sigma "+str((m-1.5*s))+" "+str(m+1.5*s)+">="+str(1-(1/((1.5)**2))))
print("2 sigma "+str((m-2*s))+" "+str(m+2*s)+" >= "+str(1-(1/(2**2))))
print("3 sigma "+str((m-3*s))+" "+str(m+3*s)+" >= "+str(1-(1/(3**2))))

Output for the following code is 


 Uniform Distribution:
This distribution can be in both discrete and continuous random variables. In the case of a discrete random variable, this distribution is called the probability mass function. In the case of a continuous random variable, it can act as the probability density function.

Probability mass fuction:
The best example of this is rolling a dice. In It, there will be six outcomes where the probability of getting each output is equiprobable. Hence the probability is uniformly distributed. Thus it is called the probability mass function because it describes discrete random variables.

The probability mass function looks like bellow

We can see that the probability is uniformly distributed. For all the six outcomes, the probability is the same.

The uniform distribution for discrete random variable is denoted by

where a is the starting value and b is the ending value.


Probability density function:
We can see this in the case of a continuous random variable. In uniform distribution, even for a continuous random variable, the probability is distributed among each entity uniformly. The parameters are same as discrete random variables, but the probability of the distribution varies. So let's try to see the Probability density function graphically.

As in the above image, the probability is uniformly distributed among the range 0 to 6. That 
is any value between the 0 and 6 has equal probability as the all other values.


There are few parameters that we need to know.


The application of uniform distribution is as follows:
Suppose we have a list or a dataset with 45 values, and we want to pick a random sample of 
3 data points such that the probability of choosing each value in the list must be equiprobable.

Then we use the concept uniform distribution to the disired task. look at the code bellow
l=[1,32,4,53,64,64,34,53,5,4,54,5,45,44,64,6,56,5,5,5,65,6,56,5,5,6,5,6,45,77,5,7,77,7,55,3,44,66,73,49,99,33,64,22,85]
print(len(l))
p=3/len(l)
print(p)
ans=[]
for i in l:
    a=np.random.uniform(0,1)
    if(a<=p):
        ans.append(i)
print(ans)
OUTPUT:

Here we can see that the uniform probability for picking 3 values from 45 values 
will be 0.066.
Now in the for loop we are trying to genarate a random number in uniform distribution
(u{0,1}) for every value in the list.
And we are adding the value in the list "l" to "ans" if the random value is less then or 
equal to 0.066. 
If we clearly observe the chane of getting a value less than or equal to 0.066 in a 
uniform distribution of u(0,1) is equal to the probability of getting 3 values from the
list of 45 values.
So from this we can clearly pick the values with the uniform probability.

Bernoulli Distribution:
In this we only consider two outcomes just like coin toss. And the probability of getting an 
outcome is equally proabable with the other outcome. Let us see the equation for bernoulli 
distribution.


The mean and variance of the distribution is below


Binomial distribution:
This is some extention of bernoulli distribution. In this the task performed in the bernoulli 
distribution is performed n times. And the probability mass fuction can we expressed as 
below


And now lets try to see this in a example 

Log Normal Distribution:
A random variable X is said to be in log-normal distribution if and only if log(X) is 
normally distributed.
The log-normal distribution is often found on the internet. 
1. The length of comments in a discussion platform.
2. reviews of the users for a product.
3. The user dwells time on online articles etc.

The log-normal distribution has a broad tail. That indicates that some values are huge in 
quantity but occur less in nature. Whereas the peak indicates, that some values occur more 
in nature but less in quantity.
This is how log normal distriution looks like

How can we know whether a distribution is log normal or not ?
We can check it by changing the values into a log scale and then comparing the distribution
 the Gaussian distribution using a Q-Q plot. If the log scaled values are  in Gaussian 
distribution, then we can say that the original data is in Gaussiandistribution. This process 
can be easily obtained by the definition.
 Lets see code implementation.


import numpy as np
s = np.random.lognormal(2, 1, 1000)#creating a random lognormal distrbution
s=np.log(s)#applying log transform using numpy
import seaborn as sns 
sns.distplot(s)
import scipy.stats as stats
import pylab
stats.probplot(s, dist="norm", plot=pylab)#applying q-q plot
pylab.show()#plotting the Q-Q plot

Output:


From the above image, we can see that, though there are some deviations, the obtained data is in Gaussian distribution.

By knowing the distribution, we can extract many inputs from the data that can help us in our business decisions.

The parameters that we can learn include,

There are many other distributions that we can learn. But at the current stage, they are not
important.

Example:

BOX COX transform:

We can covert any non-Gaussian distributed data to Gaussian distributed data just by applying box cox

transform. Note that not all the data can be converted to the Gaussian distribution, but the box cox transform holds good for most of the data.

Why convert?

Converting the data that is present in some distribution to Gaussian distribution helps us to use all the properties of Gaussian distribution in order to draw various insights that we cannot possibly draw from the actual distribution.

Procedure:

1. when we try to apply the box cox function on the data it will return a value called lambda.

2. By applying the below formula we can get the transformed values

Let's see the code part

Before the box cox transform

import numpy as np
import seaborn as sns
s = np.random.lognormal(2, 1, 1000)#creating a random lognormal distrbution
sns.distplot(s)#plotting distribution plot

Output:


stats.probplot(s, dist="norm", plot=pylab)#plotting Q-Q plot
pylab.show()

Output:

After box cox transform


from scipy import stats
x,_=stats.boxcox(s)
sns.distplot(x)#plotting distribution plot
stats.probplot(x, dist="norm", plot=pylab)#plotting Q-Q plot
pylab.show()

Output:



Here we can see that the transformed data is in Gaussian distribution.

you can download the code ipnb file from here

The remaining topics will be discussed in the next blog,,,,,

Machine learning guider

Search This Blog

Probability and Statistics for Machine Learning part-2

Comments

Post a Comment