A Guide To Data Standardization & Transformation

Complete guide for every data standardization and transformation technique with implementation.

Soni Heet
6 min readMay 4, 2021

During college times, I studied various types of data techniques & methods and found that understanding all the concepts about data standardization and transformation in their proper terminology was very hard for me. I prefer easy and direct normal language explanations. However, discovering all articles about these topics took so much time. If you are looking for an article that got all the basic information about these methods with their implementations, you are at the right place!

What is data standardization?

Basically, data standardization is the method of converting the data to its common format. When the features of the dataset have various range and measures, standardization converts all of them into one range and formate. Z-score is one of the most common methods to standardization and can be performed by subtracting the mean and dividing it by standard deviation for all values of each feature. After the standardization, all the data features will have a mean of zero, a standard deviation of one, and that means data with the same scale.

What is data transformation?

According to Wikipedia, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration, and application integration. During the data preprocessing part, if the data features are not normalized then usually with the help of transformations, we can convert the distribution into Gaussian Transformation.

Why standardization & transformation?

All these feature scaling techniques are useful for converting whole data into consistent form. It makes sure that every data type has identical content and format. Additionally, it helps when the machine learning model requires gaussian distribution.

The disadvantage of Feature Scaling: We will lose the original data values of all features. So, there is a loss of representation of the values.

Implementation

Here we can see that all the features have different distribution range and measures.

X_train — Training dataset

Data standardization and Normalization

Standardization means centering the variable at zero. z=(x-x_mean)/std. The standardization method scales data to have a mean of 0 and a standard deviation of 1 and normalization typically means rescaling the values into a range.

- StandardScaler

StandardScaler removes the mean and scales each feature/variable to unit variance. This method is not immune to outliers.

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train_ss=scaler.fit_transform(X_train)
After standardization

Here we will transform the X_test as well so we can avoid overfitting.

X_test_ss=scaler.transform(X_test)

- Min Max Scaling

Min Max Scaling scales the values between 0 to 1. This method is not immune to outliers.

X_scaled = (X — X_min / (X_max — X_min)

from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler(feature_range=(0, 1))
dataset_mm=pd.DataFrame(mm.fit_transform(X_train))
dataset_mm.head()
After Min-Max Scaling

We cannot use this method when outliers are important.

- MaxAbsScaler

MaxAbs scaler takes the absolute maximum value of each column and divides each value in the column by the maximum value. x_scaled = x / max(abs(x))

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
dataset_mx = scaler.fit_transform(X_train)
dataset_mx
After Max-abs

- Unit Vector Scaler/Normalizer

Instead of columns, The Normalizer method normalizes the rows. Every row of the data is rescaled separately from other samples so that its norm (l1, l2, or inf) equals one. Like other Scalers, the Normalizer also shifts the values between [0,1] and [-1,1] when there are negative values in our data.

L1 = the values in each column are converted so that the sum of their absolute values along the row is equal to 1

L2 = the values in each column are first squared and added so that the sum of their absolute values along the row is equal to 1

from sklearn.preprocessing import Normalizer
scaler = Normalizer(norm = 'l2')
dataset_no = scaler.fit_transform(X_train)
dataset_no
After Unit Vector Scaler/Normalizer

- RobustScaler

This Scaler removes the median and scales the data according to the quantile range. The IQR is the range between the 1st quartile and the 3rd quartile.

from sklearn.preprocessing import RobustScaler
rs=RobustScaler(quantile_range=(25.0, 75.0))
dataset_r=pd.DataFrame(rs.fit_transform(X_train))
dataset_r.head()
Before RobustScaler
After RobustScaler

Transformation

def QQ(dataset,fe):
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
dataset[fe].hist()
plt.subplot(1,2,2)
stat.probplot(dataset[fe],dist='norm',plot=pylab)
plt.show()
QQ(dataset,'DiabetesPedigreeFunction')
Left skewed feature

Here, we can see that the feature named “DiabetesPedigreeFunction ” have left-skewed distribution.

- Logarithmic Transformation

This method replaces each variable x with a log(x).

dataset['DiabetesPedigreeFunction_l']=np.log(dataset['DiabetesPedigreeFunction'])
QQ(dataset,'DiabetesPedigreeFunction_l')
After Logarithmic Transformation

- Reciprocal Transformation

Replacing the original variable x units with their reciprocals.

dataset['DiabetesPedigreeFunction_r']=1/dataset.DiabetesPedigreeFunction
plot_data(dataset,'DiabetesPedigreeFunction_r')
After Reciprocal Transformation

- Square root transformation

Replacing the original variable x units with their square root.

dataset['DiabetesPedigreeFunction_s']=dataset.DiabetesPedigreeFunction**(1/2)
plot_data(dataset,'DiabetesPedigreeFunction_s')
After Square root transformation

- Exponential transformation

dataset['DiabetesPedigreeFunction_e']=dataset.DiabetesPedigreeFunction**(1/1.2)
plot_data(dataset,'DiabetesPedigreeFunction_e')
After Exponential transformation

- Boxcox transformation

This transform method supports both square root and log transform.

dataset['DiabetesPedigreeFunction_b'],parameters=stat.boxcox(dataset['DiabetesPedigreeFunction'])
plot_data(dataset,'DiabetesPedigreeFunction_b')
After Boxcox transformation

Methods which standardization and transformation at the same time

- Quantile Transformer

This Quantile Transformer is a very interesting method of standardization and transformation. It normalizes the data distribution and scales it accordingly. This method also deals with the outliers.

from sklearn.preprocessing import QuantileTransformer
quantile = QuantileTransformer(output_distribution='normal')
dataset_qt = quantile.fit_transform(X_train)
dataset_qt
After Quantile Transformer
Before & After Quantile Transformer
After Quantile Transformer

- Power Transformer

When it comes to standardization and normalization, first we have to study and understand each method before applying them. However, here the power transformation will do that job for us by using lambda. The lambda will decide which transformation technique to use from “box-cox” and “Yeo-Johnson transform”. The power transformation will do both standardization and normalization jobs for us. Box-Cox works with only positive values and Yeo-Johnson works with both positive and negative values.

from sklearn.preprocessing import PowerTransformer
power = PowerTransformer(standardize=True)
dataset_pt = power.fit_transform(X_train)
dataset_pt
After Power Transformer
Before & After Power Transformer
After Power Transformer

Overall we saw, what is data standardization and transformation, why is it important to perform these techniques before model building, and how can we implement all these methods with few steps.

--

--