Skip to main content

Exploratory Data Analysis in ML



WHAT IS EXPLORATORY DATA ANALYSIS?

Exploratory data analysis is a process in which one uses various statistical and probability tools, along with it the various modules such as NumPy, and pandas for accessing and modifying, and we also use certain visualization tools such as matplot and seaborn to draw various conclusions and insights from data, as well as understand data based on the task assigned and produce the best output within the domain.

IMPORT SECTION
Here we import the necessary libraries for our exploratory data analysis



READING A DATASET:

This is a classification data set taken from the Kaggle website
the link for downloading the dataset is here



The next important task is to understand the dataset and know its shape and columns is part of understanding



conclusions:

1)By this, we can understand that the 'Drug' column is the output variable for or data set and 5 other features are the independent variables.
2)Most of the variables here are categorical so we need to understand how many different categories are present in each feature to encode


conclusions:

1)Here we can see that the Sex column has two categories. since Sex is a nominal categorical feature, So we can perform label encoding
2)The given data is balanced data for each feature

NOTE:

Nominal Data:
In statistics, nominal data (also known as nominal scale) is a type of data that is used to label variables without providing any quantitative value. It is the simplest form of a scale of measure. Unlike ordinal data, nominal data cannot be ordered and cannot be measured.

1)For the BP column am performing one-hot encoding since there are 3 categorical features it is hard to interpret the bar plots for univariate analysis

2)For the Cholesterol column we are performing label encoding



NOTE:
Since the label encoding is not readily available we are importing it.

UNIVARIATE ANALYSIS:





Here we are considering the distplot(distribution plot) of the Age column.
The x-axis contains Age and the y-axis consists of counts of drugs(in density)

conclusions:
main conclusion:
We can see that the age columns individually cannot distinguish the drug type because the distplot drawn using age has many overlaps between drugs.

sub conclusions:
1)The people in the age group 60 to 75 are mostly given the type2 of drug
2)The people in age group 20 to 50 are mostly given type1 drug
3)The most given drug is type 2

note: In this way, one can draw many conclusions from the given data but please do stick to the required task.



conclusions:

main conclusion:
We can see that the Na_to_ka columns individually can distinguish the type0 of the drug. but we cannot still separate the remaining drugs since overlaps are quite large in the remaining drug types.

sub conclusions:
1)The people having a na_to_ka level near 10 are given the drug type1 and type2 at the highest level.

note: In this way, one can draw many conclusions from the given data but please do stick to the required task.


NOTE:

For discrete random variables, we cannot draw a distribution function so we have to consider the bar plots to do this job
conclusion:

main conclusion:
1)we can see that the sex feature is distributed among each drug type so it is not a useful feature for classification.

sub conclusions:
1)The drug type0 has been given to approximately 50% of women
2)similarly drug type1 is given to approximately 61% of women,(please check for other types also)
3)Drug type2 is given more frequently to women because approximately 63% of women out of 100%
got it.

NOTE:

In the above sub conclusions, I stated that the bars are given for women because of following
verification


since Sex=0 means women as per the data frame.
conclusion:

main conclusion:
1)we can see that the Cholesterol feature is able to separate drug type3 from others since people with high cholesterol are not given drug type3
sub conclusions:
1)The drug type0 has been given to approximately 49% of high cholesterol people
2)similarly drug type1 is given to approximately 47% of high cholesterol people, (please check for other types also).
3)Drug type4 is given more frequently to people with high cholesterol people because approximately 63% of high cholesterol out of 100% got it.

main conclusion:
1)people with normal BP are not given drug types 1,2 and 3 so this BP_NORMAL can separate Drug types 0 and 4 from 1,2 and 3

sub conclusions:
1)The drug type0 has been given to approximately 25% of normal BP people
2)Drug type4 is given more frequently to people with normal BP people
   because approximately 66% of high cholesterol out of 100% got it.



main conclusion:

1)people with High BP are not given drug types 3 and 4 so this BP_HIGH can separate drug types 3 and 4 from remaining
2)Drug types 1 and 2 are only given to high bp people which helps in separating drug types 1 and 2 from others
sub conclusions:
  1)The drug type0 has been given to approximately 41% of normal BP people


main conclusion:
1)people with low BP are not given drug types 1 and 2 so this BP_LOW can separate drug types 1 and 2 from remaining
2)Drug type 3 is only given to low bp people which helps in separating drug type 3 from others

sub conclusions:
1)The drug type0 has been given to approximately 32% of low BP people. (please check for other types also).

CONCLUSIONS(from all above plots):

BP(LOW, HIGH, NORMAL) are a few main features for classification Cholesterol and NA_to_KA are also some used full features for classification

Univariate analysis using PDF and CDF:

here we considered pdf and CDF of age column with different drug types. since there are 5 drug types it's hard to represent all of them at the same time so let's see how type0 gives info when compared with 1,2,3 and 4.

conclusions:
main conclusion:
age cannot separate type0 drug from any other types clearly

sub conclusions:
1) Here we see that when type0 is plotted along with type1 after the age of 50 type0 is not given to people.
2)whereas in the case when type0 is plotted with type1 we can see that type1 is only given to people whose age is more than 54
3)from the remaining plots not much info can be drawn



similarly to the above method we now compared the type1 with remaining
conclusion:
main conclusion:
from the first comparison, we can see that there is a clear separation between type1 and type2 when we took ages.
sub conclusions:
please draw your own sub-conclusions from the above experience.



similarly, we now compared type2 drug with both type3 and type4

conclusions:
main conclusion:
age cannot separate type2 drug from type3,type4, and type0 as well but can separate from type1

sub conclusions:
Draw your own conclusions from past knowledge from the above sub conclusions
NOTE:
before taking the conclusion please check the above plots also.

Example:
in the above plots, we have compared type2 with type3 and type4 only but we also have to check when we compared type2 with type one type0 in the above plots.


similarly, this is the plot for drug type4 compared with type5

conclusions:

main conclusion:
age cannot separate type4 drug from any other drug type clearly

sub conclusions:
Draw your own conclusions from past knowledge from the above sub conclusions



conclusion:
here we can see that the feature Na_to_K has done a tremendous job in the separation of type0 drug from all other drugs.



conclusion:
here we can see that the feature Na_to_K can only separate type2 drug from type0 only not from all other drugs.



conclusion:
here we can see that the feature Na_to_K can only separate type3 drug from type0 only not from all other drugs.

NOTE:
there is no use in plotting other features like Sex, BP, and Cholesterol since they are categorical
The main conclusions from all the above plotting are:
1)the age feature can only separate the type1 and type of drugs but fails in remaining all cases
2)the Na_to_K can only separate the type0 of drug from all other drugs but fails in the remaining cases.



CHECKING OUTLIERS:
we can check outliers using 2 methods first (mean-median method) and second(box-plot method).

1)MEAN-MEDIAN METHOD:
In this method, we first consider the continuous random variables and calculate their mean and median if both are approximately the same then there are no outliers.



conclusion:
from this, we can say that there are no outliers in age

NOTE:
 mad: median absolute deviation which is used to calculate the spread (standard deviation) using the median. To know its further details click here



conclusion:
from this, we can say that there are no outliers in the Na_to_k column

2)BOXPLOT METHOD:
 here there are 2 benefits of a box plot, firstly, we can find whether there are outliers or not. secondly, we can find the 25th and 50th, and 75th percentiles.
To know much about box plot please click here



conclusion:
from this, we can say that there are no outliers in both Na_to_k and Age columns and we can also draw percentile values which are much more important in many other use cases.

BIVARIATE ANALYSIS:
we will do this by using a concept called pair plot. Since we cannot visualize more than 3D surfaces  

we opt for this pair plot concept which helps to draw insights similar to multivariate analysis



For clarity please download pic from here

Conclusions:
1)By using Age and Na_to_K column we can separate type0 drug from all others with some errors

2) By using Age and BP_HIGH column we can separate drug type1 and 2 with
approximately 15% errors.

3)People with normal BP can separate drug4 can from others with approx 
60-70% accuracy

4)Sex column when compared with Na_to_K can separate the type0 and type4 drugs with some error

5)Cholesterol column when compared with Na_to_K can separate the type0 and type4 drugs with some error

6)BP_HIGH and Na_to_K can separate type0 from type1 and type2

7)when cholesterol is high BP is low then it helps In the separation of type3 from others


GRAPH USING PLOTLY:


FINAL CONCLUSION:

After applying 3D plotting using Plotly we can conclude that BP_HIGH, Age, and Na_to_K play important roles in classification.

linear classification techniques cant be used since data is not easily dividable.

PLEASE DOWNLOAD IPYTHON NOTEBOOK FROM HERE


     

Comments

Post a Comment