Exploratory Data Analysis in Data Science

Tuesday, August 09, 2016 Data Science 5 Comments

Data Science Get Started with Exploratory Data Analysis (EDA) using Python & R

1.Understand the defined Business Objective

2.Research & explore on the Domain knowledge or consult Subject matter expert (SME)

3.Collect the metadata of the given data with the help of SME or explore various research avenues

4.Collect the data for the variables which are relevant for the project based on domain expertise

5.Data cleansing & wrangling to be performed to make data structured

Dummy variable creation

a.Create Dummy variable for categorical data in binary format(1 or 0) if exists of two levels in a factor
b.If more than two levels in a factor create dummy column with each level

Imputation for missing data

There are many types of imputation techniques which replaces N/A values
a.List wise deletion (Complete Case Analysis) Delete whole row if any N/A found
b.Pair wise deletion (Available Case Analysis)Delete the particular cell or value
c.Mean imputation-Replaces the N/A Value with Mean of the Variable
d.Mode imputation-Replaces the N/A value with Mode of the Variable
e.Hot deck Imputation-Replaces the similar value by checking each row
f.Regression Imputation-N/A is considered as an output and replace it by predicting the value
g.KNN Imputation-By Calculating the distance between each data point and replaced with the nearest neighbour

6.Find out the data types (Continuous, Discrete, Nominal, Ordinal, Interval, Ratio)

7.Find the Probability of the data
No. of interested events / Total no. of events

8.Find the Data to which probability distribution it belongs to

Probability distribution will always have Random Variable on X-axis & Probabilities associated with random variables on Y-axis
a.Continuous Probability Distribution
b.Discrete Probability distribution

9.Find whether the data is following normal distribution
a.Symmetrical
b.Bell shaped curve
c.Mean = 0, area under the curve = 1

10.If data is not following normal distribution, then transform the data.

11.Various types of transformations:

a.Log transformation.
b.1/log
c.square
d.Square root
e.1/ square root
f.Exponential
g.Cube
h.Cube root
i.1/ cube root
j.Boxcox transformation
k.Johnson transformation, And many more Transformations ….

12.If despite transformation data follow normal distribution, then perform analysis pertaining to non-normal distribution

13.Standard normal distribution (Z Distribution)
(X-µ/?)
Mean = 0, Standard Deviation = 1

14.Measures of Central Tendency (or) 1st Moment Business Decision
a.Mean
Average of the particular variable (Xi/n)
b.Median
Middle most number
c.Mode
Most repeated value

15.Measures of Dispersion (or) 2nd Moment Business Decision
a.Variance
Var(X) = E[(X-µ) ^2]
Distance from mean to each point, where units gets squared
b.Standard Deviation
Sqrt of Variance, where units get normal ( ?var )
c.Range: Max(Xi) – Min(Xi)

16.Measures of Skewness (or) 3rd Moment Business decision
a.Positive Skewed (or) Right skewed
b.Negative Skewed (or) Left Skewed

17.Measures of Kurtosis (or) 4th Moment Business Decision
a.Positive Kurtosis (or) Thinner peak
b.Negative Kurtosis (or) Wider Peak

18.Graphical Representation
a.Histogram
Represents the Normal Distribution of data, Skewness
b.Boxplot
Represents the outliers, median, Q1, Q3.
c.Bar plot
Represents the Data
d.Stem and leaf plot
A Stem and Leaf Plot is a special table where each data value is split into a “stem” (the first digit or digits) and a “leaf” (usually the last digit)
E.g.: – 32 -> Stem ‘3’ Leaf ‘2’
e.Dot plot
Represents the normal distribution and skewness

data science course in mumbai

ExcelR is one of the leading training providers of professional certification training solutions in the world. We strive to provide the best training methods, taught by industry professionals at locations all over the world. Partner with us today and see why we are the number one choice for business professionals all over the world.

5 comments:

BigBendRegionAugust 09, 2016 4:34 pm
Kurtosis tells you nothing about the "peak" of the distribution. It measures tails (outliers) only. See http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321753/
sathyatechMarch 20, 2018 11:26 am
nNice information...data science training in hyderabad!
Data science training in Ameerpet!
Data science online training in Hyderabad!
Ramesh SampangiOctober 28, 2021 4:40 pm
Data Scientists is the hottest job of the 21st century. You can get the skills to become experts by taking AI Patasala Data Science Training in Hyderabad.
Data Science Course Hyderabad
AnilDecember 02, 2023 5:59 pm
Embark on a transformative journey into the realm of data science with APTRON's comprehensive Data Science Training in Gurgaon. In an era where data is hailed as the new currency, mastering the intricacies of data science is imperative for career advancement.
Digital ArnavApril 04, 2024 4:36 pm
Don't miss this opportunity to level up your skills and advance your career in the fast-growing field of data analytics. Enroll in the Data Analytics Training Course in Noida at APTRON Solutions and embark on a rewarding journey toward becoming a proficient data analyst. Gain a competitive edge in the job market and contribute effectively to data-driven business success. Contact us today to learn more about course details, schedules, and enrollment options. Your data analytics career starts here at APTRON Solutions!