During college times, I studied various types of data techniques & methods and found that understanding all the concepts about data standardization and transformation in their proper terminology was very hard for me. I prefer easy and direct normal language explanations. However, discovering all articles about these topics took so much time. If you are looking for an article that got all the basic information about these methods with their implementations, you are at the right place!
What is data standardization?
Basically, data standardization is the method of converting the data to its common format. When the features of the dataset have various range and measures, standardization converts all of them into one range and formate. Z-score is one of the most common methods to standardization and can be performed by subtracting the mean and dividing it by standard deviation for all values of each feature. After the standardization, all the data features will have a mean of zero, a standard deviation of one, and that means data with the same scale.
What is data transformation?
According to Wikipedia, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration, and application integration. During the data preprocessing part, if the data features are not normalized then usually with the help of transformations, we can convert the distribution into Gaussian Transformation.
Why standardization & transformation?
All these feature scaling techniques are useful for converting whole data into consistent form. It makes sure that every data type has identical content and format. Additionally, it helps when the machine learning model requires gaussian distribution.
The disadvantage of Feature Scaling: We will lose the original data values of all features. So, there is a loss of representation of the values.
Implementation
Here we can see that all the features have different distribution range and measures.
Data standardization and Normalization
Standardization means centering the variable at zero. z=(x-x_mean)/std. The standardization method scales data to have a mean of 0 and a standard deviation of 1 and normalization typically means rescaling the values into a range.
- StandardScaler
StandardScaler removes the mean and scales each feature/variable to unit variance. This method is not immune to outliers.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train_ss=scaler.fit_transform(X_train)
Here we will transform the X_test as well so we can avoid overfitting.
X_test_ss=scaler.transform(X_test)
- Min Max Scaling
Min Max Scaling scales the values between 0 to 1. This method is not immune to outliers.
X_scaled = (X — X_min / (X_max — X_min)
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler(feature_range=(0, 1))
dataset_mm=pd.DataFrame(mm.fit_transform(X_train))
dataset_mm.head()
We cannot use this method when outliers are important.
- MaxAbsScaler
MaxAbs scaler takes the absolute maximum value of each column and divides each value in the column by the maximum value. x_scaled = x / max(abs(x))
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
dataset_mx = scaler.fit_transform(X_train)
dataset_mx
- Unit Vector Scaler/Normalizer
Instead of columns, The Normalizer method normalizes the rows. Every row of the data is rescaled separately from other samples so that its norm (l1, l2, or inf) equals one. Like other Scalers, the Normalizer also shifts the values between [0,1] and [-1,1] when there are negative values in our data.
L1 = the values in each column are converted so that the sum of their absolute values along the row is equal to 1
L2 = the values in each column are first squared and added so that the sum of their absolute values along the row is equal to 1
from sklearn.preprocessing import Normalizer
scaler = Normalizer(norm = 'l2')
dataset_no = scaler.fit_transform(X_train)
dataset_no
- RobustScaler
This Scaler removes the median and scales the data according to the quantile range. The IQR is the range between the 1st quartile and the 3rd quartile.
from sklearn.preprocessing import RobustScaler
rs=RobustScaler(quantile_range=(25.0, 75.0))
dataset_r=pd.DataFrame(rs.fit_transform(X_train))
dataset_r.head()
Transformation
def QQ(dataset,fe):
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
dataset[fe].hist()
plt.subplot(1,2,2)
stat.probplot(dataset[fe],dist='norm',plot=pylab)
plt.show()
QQ(dataset,'DiabetesPedigreeFunction')
Here, we can see that the feature named “DiabetesPedigreeFunction ” have left-skewed distribution.
- Logarithmic Transformation
This method replaces each variable x with a log(x).
dataset['DiabetesPedigreeFunction_l']=np.log(dataset['DiabetesPedigreeFunction'])
QQ(dataset,'DiabetesPedigreeFunction_l')
- Reciprocal Transformation
Replacing the original variable x units with their reciprocals.
dataset['DiabetesPedigreeFunction_r']=1/dataset.DiabetesPedigreeFunction
plot_data(dataset,'DiabetesPedigreeFunction_r')
- Square root transformation
Replacing the original variable x units with their square root.
dataset['DiabetesPedigreeFunction_s']=dataset.DiabetesPedigreeFunction**(1/2)
plot_data(dataset,'DiabetesPedigreeFunction_s')
- Exponential transformation
dataset['DiabetesPedigreeFunction_e']=dataset.DiabetesPedigreeFunction**(1/1.2)
plot_data(dataset,'DiabetesPedigreeFunction_e')
- Boxcox transformation
This transform method supports both square root and log transform.
dataset['DiabetesPedigreeFunction_b'],parameters=stat.boxcox(dataset['DiabetesPedigreeFunction'])
plot_data(dataset,'DiabetesPedigreeFunction_b')
Methods which standardization and transformation at the same time
- Quantile Transformer
This Quantile Transformer is a very interesting method of standardization and transformation. It normalizes the data distribution and scales it accordingly. This method also deals with the outliers.
from sklearn.preprocessing import QuantileTransformer
quantile = QuantileTransformer(output_distribution='normal')
dataset_qt = quantile.fit_transform(X_train)
dataset_qt
- Power Transformer
When it comes to standardization and normalization, first we have to study and understand each method before applying them. However, here the power transformation will do that job for us by using lambda. The lambda will decide which transformation technique to use from “box-cox” and “Yeo-Johnson transform”. The power transformation will do both standardization and normalization jobs for us. Box-Cox works with only positive values and Yeo-Johnson works with both positive and negative values.
from sklearn.preprocessing import PowerTransformer
power = PowerTransformer(standardize=True)
dataset_pt = power.fit_transform(X_train)
dataset_pt
Overall we saw, what is data standardization and transformation, why is it important to perform these techniques before model building, and how can we implement all these methods with few steps.