Author: Adam G. Dobrakowski
Redaction: Zuzanna Kwiatkowska
There is a simple way to analyse (almost) any tabular data in less than 2 minutes in a simple and efficient way. I will show you how to do it using only 2 Python tools: Jupyter notebook and Pandas profiling.
No matter what our problem is, if the solution requires implementation of some machine learning model, we always have to start with data analysis. Let’s say we have a single table, where rows represent observations (for example people we analyse), and the columns represent their features (for example age, height, gender, etc.).
Based on my experience, I would argue that there are 8 things that you should always check in your analysis:
- What is the number of observations and features (meaning you should check the dimensionality of the data)
- Analyse manually a subset of observations from your table
- Assign the type to each of your features
- Verify the distribution of each feature
- Check if there are outliers in data
- Check if there are any missing data (for example NULL values)
- Find correlations and dependencies between features
- Verify the quality and correctness of the data
Sounds like a lot of work. The good news is that we can do all of that with a single command!
To do that we will need the Pandas profiling library in Python. I will show you how it works using the House Rent Prediction Dataset as an example.
After you download and unpack your data, you can use the following piece of code in your notebook to generate the report:
|! pip install pandas_profiling
import pandas as pd
df = pd.read_csv(“House_Rent_Dataset.csv”)
profile = pandas_profiling.ProfileReport(df)
And that’s all! That’s the only thing required to generate a complex report which addresses all of the points I mentioned. Let’s analyse it step by step.
In the first section, called Overview, you can find some basic information about your dataset:
You can see the number of features (variables), number of observations and types of features. All of this addresses points 1 and 3 from our checklist. Additionally, we can already see if there are duplicates amongst the observations, which is a first step to verify data quality and correctness.
Here you can see all the problems that are probably present in your data such as, but not limited to, high correlation between features, skewness, uniformity, zero-values or nulls, small variability of data (such as one persisting value in some feature) or a large number of unique values within a feature.
In our example, we are shown 24 alerts:
You have to remember that alerts are just giving you tips on what to further verify manually or take a closer look at. They don’t necessarily mean there are actual problems in your data. Nevertheless, we can tick point 8 from our checklist.
Another section of the report gives us a closer look into variables, segregated to various types like numerical or categorical.
Let’s analyse the numerical features section using an example of `Size` feature. It represents the size of a house or apartment and is measured in square feet.
On the right, we can automatically analyse the distribution of the feature using a plot, but also see some important parameters of this distribution like its mean, minimum, maximum or percent of specific values (for example zeros, negative, missing, infinite, distinct, etc.)
When we open a section with details, there is even more information available for the analysis.
We could use 5th and 95th percentile to indicate a cut-off for outliers (meaning that any value lower than 5th percentile and higher than 95th is an outlier). This is not a bad approach, but in this report we have a dedicated tool for outliers analysis called `Extreme values`. There you can find 5 of the highest and of the lowest values for each feature.
The analysis for categorical features will be only slightly different than for numerical features shown above. The example of a categorical feature may be the “City”, indicating where exactly the house or apartment is located.
Again, in this section, we get all the information required to tick off numbers 4-6 from our checklist.
Interactions & Correlations
Another interesting information we get in the report is the list of dependencies between features – both in visual and numerical form.
We can quickly generate a plot to see how 2 features interact with each other, for example number of bathrooms with the size of the house:
Or if you prefer a numerical form, you can see the correlation coefficients here:
Last but not least, we can also analyse manually some automatically chosen examples. They are often selected from the beginning and the end of your table. This allows us to tick point no. 2 from our analysis checklist.
As you can see, we actually used a single Python command (apart from some obvious introductory calls like imports or data loading) to generate a pretty complex report and perform a detailed analysis of our data.
This shows how important it is to learn about new tools. Before learning Pandas profiling, I lost countless hours to create all of the histograms by myself, calculate correlations or verify distributions of features.
If you like tips like that, I kindly invite you to follow my LinkedIn profile: Adam Dobrakowski and stay tuned for more posts like this in the future!