Though it takes the most time and effort, the phase of data preparation in data science is also among the most crucial ones. Ignoring to clean and prepare the data could jeopardize the model.
Data scientists working with real-world data will always have to use certain preparation methods to help the data to be more useable. These methods will help its application in machine learning (ML) systems, simplify the complexity to prevent overfitting, and produce a better model.
Having stated that, let's briefly review what data preparation is, why it's crucial, and pick the key methods to apply in this pivotal stage of data science. The following is what this tutorial will address:
What is Data Preprocessing?
Data preparation—that is, getting your dataset ready for usage in the model—comes into action once you grasp the subtleties of your dataset and the primary data problems resulting from the exploratory data analysis.
In a world perfect, your dataset would be flawless and free from any issues. Real-world data will sadly always show some problems you will have to deal with. Think about the information you own company has. Could you find any contradictions including typos, missing data, various scales, etc.? Many times occurring in the actual world, these instances must be modified to make the data more valuable and intelligible.
We refer to this process—where most of the problems in the data are resolved—as the data preparation stage.
Why is preparing data important?
Ignoring the data pretreatment phase will impact your work subsequently when using this dataset to a machine learning model. Most of the models are unable of managing missing values. Some of them are influenced by outliers, high dimensionality and noisy data; so, preparing the data will help to make the dataset more full and accurate. Before running the data into your machine learning model, this phase is essential for making required changes in it.
Crucially Important Data Preparation Methods
Knowing more about the phase of data preparation and the reasons behind it will help us to examine the major methods to use in the data, thereby increasing its usability for our next projects. We shall investigate the following methods:
Dimensions of Data Cleaning and Dimensionality Reduction
Feature engineering
Data Sample Methodology
Data Transforms
Inaccurate Data
Data Polishing
Finding and repairing erroneous and poor observations from your dataset will help to raise the quality of your data by means of data preparation phase. This method is for spotting in the data incomplete, erroneous, repeated, irrelevant or null values. Once you have found these problems, you will either have to change or erase them. The aim of your project and the problem domain will determine the approach you use. Let's examine some of the typical problems we run across while data analysis is under progress and discuss their handling.
Noisy data
Usually, noisy data describes erroneous records, duplicate observations, or nonsensical data in your dataset. Consider, for instance, a column in your database for "age," with negative values. In this scenario, the observation does not make sense; so, you might remove it or assign a null value (we shall discuss how to handle this value in the "Missing Data" section).
Another situation arises when you have to eliminate pointless or unwelcome information. Say you must forecast, for instance, if a woman is pregnant or not. For the model, the details about their height, marital status, or hair color are not important thus you do not need them.
Depending on the anomaly, even if it could be a legitimate record, an outlier can be regarded as noise. You will have to decide whether the outlier belongs in noisy data or if you may remove it from your dataset or not.
The answer is:
The binning method is a typical method for noisy data whereby the values are first sorted, then split into "bins," or buckets with the same size, and subsequently subjected to a mean/median in each bin, effectively smoothing the data. Here is a decent paper on handling noise data if you wish further information.
Missing Information
Lack of data points is another typical problem we run with in real-world data. Most machine learning models cannot manage missing values in the data, hence you must intervene and modify the data to be correctly utilized within the model. Usually known as imputation, there are several ways you could manage it:
First answer:
Eliminating that observation seems to be the easiest fix. Still, this is only advised should:
Removing several missing entries and a sizable dataset won't affect the distribution of your dataset.
The observation itself is useless as most of the qualities of that observation are null.
Two solutions:
Using a global constant—such as "NA," or 0—but only if it is challenging to estimate the missing value is another way to close that discrepancy. One might also fill the void using the mean or median of that characteristic.
Third solution is
Another technique you might use is the backward/forward fill method, in which case the missing value is filled from either the previous or next value.
The fourth solution is
Using machine learning techniques to fill in these voids in data is a more strong method. Like this:
First locate the k instances closest to the missing value instance using KNN; next, obtain the mean of that attribute pertaining to the k-nearest neighbors (KNN).
Learn a regressor that can forecast each missing value depending on the other attributes by means of regression for each missing attribute.
Selecting a particular method to fill the missing values in our dataset is not simple; so, the method you apply significantly relies on the type of missing value you have and the problem you are addressing.
Though this is outside the purview of this article, keep in mind that each of the three different kinds of missing data calls for different treatment:
Type 1: Missing totally at random (MCAR)
Type 2: MAR - Missing at Random
Type 3: Not Random Missing Not at Random (MNAR)
If you know Python, the KNN Imputer I described above is among the useful tools in the sklearn library for this data preparation stage.
Structural Problems
Usually speaking, structural mistakes are some typos and inconsistent data value issues.
Say, for instance, that we offer shoes on our website and that we have a marketplace. Different sellers of the same shoes can write the information on the same product in distinct ways. Imagine that among the qualities we own is the brand of the shoes; so, combining the name of Nike, nike, NIKE for the identical shoes we currently own. Before presenting this data to the model, we must resolve this problem; else, the model could view them as different entities. In this scenario, all it takes is changing every word to lowercase. Though in other situations it could need more intricate adjustments to correct errors and inconsistencies.
Usually, this problem calls for hand interaction instead of using any automated solutions.
Dimensionality reduction
Reducing the number of input features in training data addresses dimensionality reduction.
Dimensionality's Curse in Your Data
Usually, a real-world dataset consists of many attributes; so, if we neglect to lower this count, the performance of the model may suffer later on when fed this dataset. Many times, reducing the number of features while maintaining as much diversity in the dataset as feasible will benefit you in many respects.
needing less computational tools
Improving the general model performance
Stopping overfitting—where the model memorizes the training data instead of learning—so in the test data the performance suffers greatly from too complicated models.
Steer clear of multicollinearity—high correlation of one or more independent variables. Moreover, using this method will lower the noise data.
Let's explore the primary forms of dimensionality reduction we can apply to our data to improve it for next usage.
Linear approaches
The linear approaches, as their name implies, lower the dimensionality of the data by means of linear transformations.
The most often used method is the Principal Component Analysis (PCA), which in terms of memory efficiency and sparse data you could use IncrementalPCA or SparsePCA), a technique that converts the original features in another dimensional space captures most of the original data variability with far less variables. Nevertheless, it only works with quantitative variables and the new modified features lose the interpretability of the original data.
Factor analysis and linear discriminant analysis are two other varieties of linear approaches.
Non-linearly approaching techniques
When data does not fit in a linear space, the non-linear methods—also known as manifold learning—are applied. This method proposes that most of the salient features in a high dimensional space reside in a limited number of low dimensional manifolds. Many algorithms apply this method.
Among those is the Multi-Dimensional Scaling (MDS), which computes the distance between any pair of objects in a geometric space. This method reduces the data to a lower dimension, hence the couples close in the upper dimension stay in the reduced dimension as well.
An expansion of MDS, the Isometric Feature Mapping (Isomap) substitutes the geodesic distance instead of Euclidean distance.
Locally linear embedding (LLE), spectral embedding, t-distributed stochastic neighbor embedding (t-SNE) are some instances of non-linear approaches. Their page especially on this approach allows you to learn more about it and view all algorithms applied in sklearn.
Feature engineering—using domain knowledge—allows one to create features.
Better features for your dataset created by means of feature engineering will boost the performance of the model. We mostly employ domain knowledge to develop such features, which we manually produce from the current features by means of some modification. These simple ideas are easily applicable to your dataset to maybe improve the performance of your model:
Break apart categorical attributes.
The first instance is breaking out your dataset's category features. Suppose your data contains a feature on hair color, with brown, blonde, and unidentified values. In this scenario, you can establish a new column named "has color" and mark 1 should you find a color and 0 should the value remain unknown.
Break out a date-time.
Decomposing a datetime feature—which offers valuable information—would also be another example; nonetheless, a model cannot profit from the original data structure. Spend some time trying to translate that datetime column into a more understandable feature for your model, like "period of day,," "day of the week, and so on," if you believe that your problem has time dependencies and could find some relationship between the datetime and the output variable.
Reword Numerical Values
This last example addresses numerical data more practically. Assume for the moment that you have data on some apparel purchases from a particular store. Apart from the total amount of transactions, you could be interested in developing new features on the seasonality of that buy. You might thus find yourself including four more columns on summer, winter, fall, and spring purchases into your dataset. Depending on the issue you are trying to resolve it could benefit you and raise the quality of your dataset.
Consequently, this part is largely about applying your domain knowledge on the problem to generate highly predictive features. Here is a fantastic blog about feature engineering if you wish further information on this.
Managing a lot of data (by use of sampling)
Some machine learning algorithms can have trouble managing a lot of data and run into problems including memory saturation, computational increase to modify the model parameters, and so on even if the more data you have, the accuracy of the model usually tends to be greater. These are some of the sample data approaches we could apply to handle this issue:
non replacement sampling This method avoids having the same data repeated in the sample, hence if the record is chosen, it is taken out of the population.
with replacement sampling. This method allows one to pick up the object more than once, so the object is not taken out from the population and can be repeated many times for the sample data.
stratified sample. More complicated than other approaches, this one divides the data into several partitions and generates random samples from every one of them. This method maintains the proportional number of classes in circumstances when the classes are disproportional based on the original data.
Proactive sampling. Starting modest, this last method continuously growing the dataset until a suitable sample size is obtained.
The Pipelines of Data Preprocessing
Though there is no rule for the data preparation stages for our machine learning pipeline, generally speaking, what I use and have seen is the following sequence of data pretreatment activities:
At last
Finding the suitable input data for the machine learning algorithms depends on the phase of data preparation. As we just mentioned, improper application of the correct procedures could produce a worse model outcome. For instance, noisy and redundant data influences the k-nearest neighbors algorithm; it is sensitive to changing scales; it does not manage a large number of attributes well. This method requires you to avoid high dimensionality and normalize the attributes to the same scale while cleansing the data.
Using a Decision Tree approach, on the other hand, relieves you of concern about attribute normalization to the same scale. Every model thus has unique characteristics; hence, you must be aware of this before providing appropriate data input to the model. Having said that, you may now forward to the phase of model discovery knowing those unique characteristics of the algorithms.
Though it takes the most time and effort, the phase of data preparation in data science is also among the most crucial ones. Ignoring to clean and prepare the data could jeopardize the model.
Data scientists working with real-world data will always have to use certain preparation methods to help the data to be more useable. These methods will help its application in machine learning (ML) systems, simplify the complexity to prevent overfitting, and produce a better model.
Having stated that, let's briefly review what data preparation is, why it's crucial, and pick the key methods to apply in this pivotal stage of data science. The following is what this tutorial will address:
What is Data Preprocessing?
Data preparation—that is, getting your dataset ready for usage in the model—comes into action once you grasp the subtleties of your dataset and the primary data problems resulting from the exploratory data analysis.
In a world perfect, your dataset would be flawless and free from any issues. Real-world data will sadly always show some problems you will have to deal with. Think about the information you own company has. Could you find any contradictions including typos, missing data, various scales, etc.? Many times occurring in the actual world, these instances must be modified to make the data more valuable and intelligible.
We refer to this process—where most of the problems in the data are resolved—as the data preparation stage.
Why is preparing data important?
Ignoring the data pretreatment phase will impact your work subsequently when using this dataset to a machine learning model. Most of the models are unable of managing missing values. Some of them are influenced by outliers, high dimensionality and noisy data; so, preparing the data will help to make the dataset more full and accurate. Before running the data into your machine learning model, this phase is essential for making required changes in it.
Crucially Important Data Preparation Methods
Knowing more about the phase of data preparation and the reasons behind it will help us to examine the major methods to use in the data, thereby increasing its usability for our next projects. We shall investigate the following methods:
Dimensions of Data Cleaning and Dimensionality Reduction
Feature engineering
Data Sample Methodology
Data Transforms
Inaccurate Data
Data Polishing
Finding and repairing erroneous and poor observations from your dataset will help to raise the quality of your data by means of data preparation phase. This method is for spotting in the data incomplete, erroneous, repeated, irrelevant or null values. Once you have found these problems, you will either have to change or erase them. The aim of your project and the problem domain will determine the approach you use. Let's examine some of the typical problems we run across while data analysis is under progress and discuss their handling.
Noisy data
Usually, noisy data describes erroneous records, duplicate observations, or nonsensical data in your dataset. Consider, for instance, a column in your database for "age," with negative values. In this scenario, the observation does not make sense; so, you might remove it or assign a null value (we shall discuss how to handle this value in the "Missing Data" section).
Another situation arises when you have to eliminate pointless or unwelcome information. Say you must forecast, for instance, if a woman is pregnant or not. For the model, the details about their height, marital status, or hair color are not important thus you do not need them.
Depending on the anomaly, even if it could be a legitimate record, an outlier can be regarded as noise. You will have to decide whether the outlier belongs in noisy data or if you may remove it from your dataset or not.
The answer is:
The binning method is a typical method for noisy data whereby the values are first sorted, then split into "bins," or buckets with the same size, and subsequently subjected to a mean/median in each bin, effectively smoothing the data. Here is a decent paper on handling noise data if you wish further information.
Missing Information
Lack of data points is another typical problem we run with in real-world data. Most machine learning models cannot manage missing values in the data, hence you must intervene and modify the data to be correctly utilized within the model. Usually known as imputation, there are several ways you could manage it:
First answer:
Eliminating that observation seems to be the easiest fix. Still, this is only advised should:
Removing several missing entries and a sizable dataset won't affect the distribution of your dataset.
The observation itself is useless as most of the qualities of that observation are null.
Two solutions:
Using a global constant—such as "NA," or 0—but only if it is challenging to estimate the missing value is another way to close that discrepancy. One might also fill the void using the mean or median of that characteristic.
Third solution is
Another technique you might use is the backward/forward fill method, in which case the missing value is filled from either the previous or next value.
The fourth solution is
Using machine learning techniques to fill in these voids in data is a more strong method. Like this:
First locate the k instances closest to the missing value instance using KNN; next, obtain the mean of that attribute pertaining to the k-nearest neighbors (KNN).
Learn a regressor that can forecast each missing value depending on the other attributes by means of regression for each missing attribute.
Selecting a particular method to fill the missing values in our dataset is not simple; so, the method you apply significantly relies on the type of missing value you have and the problem you are addressing.
Though this is outside the purview of this article, keep in mind that each of the three different kinds of missing data calls for different treatment:
Type 1: Missing totally at random (MCAR)
Type 2: MAR - Missing at Random
Type 3: Not Random Missing Not at Random (MNAR)
If you know Python, the KNN Imputer I described above is among the useful tools in the sklearn library for this data preparation stage.
Structural Problems
Usually speaking, structural mistakes are some typos and inconsistent data value issues.
Say, for instance, that we offer shoes on our website and that we have a marketplace. Different sellers of the same shoes can write the information on the same product in distinct ways. Imagine that among the qualities we own is the brand of the shoes; so, combining the name of Nike, nike, NIKE for the identical shoes we currently own. Before presenting this data to the model, we must resolve this problem; else, the model could view them as different entities. In this scenario, all it takes is changing every word to lowercase. Though in other situations it could need more intricate adjustments to correct errors and inconsistencies.
Usually, this problem calls for hand interaction instead of using any automated solutions.
Dimensionality reduction
Reducing the number of input features in training data addresses dimensionality reduction.
Dimensionality's Curse in Your Data
Usually, a real-world dataset consists of many attributes; so, if we neglect to lower this count, the performance of the model may suffer later on when fed this dataset. Many times, reducing the number of features while maintaining as much diversity in the dataset as feasible will benefit you in many respects.
needing less computational tools
Improving the general model performance
Stopping overfitting—where the model memorizes the training data instead of learning—so in the test data the performance suffers greatly from too complicated models.
Steer clear of multicollinearity—high correlation of one or more independent variables. Moreover, using this method will lower the noise data.
Let's explore the primary forms of dimensionality reduction we can apply to our data to improve it for next usage.
Linear approaches
The linear approaches, as their name implies, lower the dimensionality of the data by means of linear transformations.
The most often used method is the Principal Component Analysis (PCA), which in terms of memory efficiency and sparse data you could use IncrementalPCA or SparsePCA), a technique that converts the original features in another dimensional space captures most of the original data variability with far less variables. Nevertheless, it only works with quantitative variables and the new modified features lose the interpretability of the original data.
Factor analysis and linear discriminant analysis are two other varieties of linear approaches.
Non-linearly approaching techniques
When data does not fit in a linear space, the non-linear methods—also known as manifold learning—are applied. This method proposes that most of the salient features in a high dimensional space reside in a limited number of low dimensional manifolds. Many algorithms apply this method.
Among those is the Multi-Dimensional Scaling (MDS), which computes the distance between any pair of objects in a geometric space. This method reduces the data to a lower dimension, hence the couples close in the upper dimension stay in the reduced dimension as well.
An expansion of MDS, the Isometric Feature Mapping (Isomap) substitutes the geodesic distance instead of Euclidean distance.
Locally linear embedding (LLE), spectral embedding, t-distributed stochastic neighbor embedding (t-SNE) are some instances of non-linear approaches. Their page especially on this approach allows you to learn more about it and view all algorithms applied in sklearn.
Feature engineering—using domain knowledge—allows one to create features.
Better features for your dataset created by means of feature engineering will boost the performance of the model. We mostly employ domain knowledge to develop such features, which we manually produce from the current features by means of some modification. These simple ideas are easily applicable to your dataset to maybe improve the performance of your model:
Break apart categorical attributes.
The first instance is breaking out your dataset's category features. Suppose your data contains a feature on hair color, with brown, blonde, and unidentified values. In this scenario, you can establish a new column named "has color" and mark 1 should you find a color and 0 should the value remain unknown.
Break out a date-time.
Decomposing a datetime feature—which offers valuable information—would also be another example; nonetheless, a model cannot profit from the original data structure. Spend some time trying to translate that datetime column into a more understandable feature for your model, like "period of day,," "day of the week, and so on," if you believe that your problem has time dependencies and could find some relationship between the datetime and the output variable.
Reword Numerical Values
This last example addresses numerical data more practically. Assume for the moment that you have data on some apparel purchases from a particular store. Apart from the total amount of transactions, you could be interested in developing new features on the seasonality of that buy. You might thus find yourself including four more columns on summer, winter, fall, and spring purchases into your dataset. Depending on the issue you are trying to resolve it could benefit you and raise the quality of your dataset.
Consequently, this part is largely about applying your domain knowledge on the problem to generate highly predictive features. Here is a fantastic blog about feature engineering if you wish further information on this.
Managing a lot of data (by use of sampling)
Some machine learning algorithms can have trouble managing a lot of data and run into problems including memory saturation, computational increase to modify the model parameters, and so on even if the more data you have, the accuracy of the model usually tends to be greater. These are some of the sample data approaches we could apply to handle this issue:
non replacement sampling This method avoids having the same data repeated in the sample, hence if the record is chosen, it is taken out of the population.
with replacement sampling. This method allows one to pick up the object more than once, so the object is not taken out from the population and can be repeated many times for the sample data.
stratified sample. More complicated than other approaches, this one divides the data into several partitions and generates random samples from every one of them. This method maintains the proportional number of classes in circumstances when the classes are disproportional based on the original data.
Proactive sampling. Starting modest, this last method continuously growing the dataset until a suitable sample size is obtained.
The Pipelines of Data Preprocessing
Though there is no rule for the data preparation stages for our machine learning pipeline, generally speaking, what I use and have seen is the following sequence of data pretreatment activities:
At last
Finding the suitable input data for the machine learning algorithms depends on the phase of data preparation. As we just mentioned, improper application of the correct procedures could produce a worse model outcome. For instance, noisy and redundant data influences the k-nearest neighbors algorithm; it is sensitive to changing scales; it does not manage a large number of attributes well. This method requires you to avoid high dimensionality and normalize the attributes to the same scale while cleansing the data.
Using a Decision Tree approach, on the other hand, relieves you of concern about attribute normalization to the same scale. Every model thus has unique characteristics; hence, you must be aware of this before providing appropriate data input to the model. Having said that, you may now forward to the phase of model discovery knowing those unique characteristics of the algorithms.