Data Preprocessing


Missing Values

  • Reasons:

    • Missing completely at random (MCAR)

    • Missing at random (MAR)

    • Not missing at random (NMAR)

  • How to handle the missing values:

    • Do Nothing:

      • Models like XGBoost can deal with missing values by deciding for each sample which is the best way to impute them and learns the best values

    • Imputation:

      • Using (Mean/Median) Value

      • Using (Most Frequent) Value

      • Using k-NN

      • Interpolation (Linear/Nearest Neighbors)

Outlier Detection

  • Cook’s Distance:

    • Measures the effect of deleting a given observation. It represents the sum of all the changes in the regression model when observation “i” is removed from it.

  • Interquartile Range Method (IQR):

    • Is a good statistic for summarizing a non-Gaussian distribution sample of data.

    • IQR is calculated as the difference between the 75th and the 25th percentiles of the data and defines the box in a box and whisker plot.

    • The IQR defines the middle 50% of the data, or the body of the data

    • Can be used to identify outliers by defining limits on the sample values that are below the 25th percentile or above the 75th percentile.

  • Linear Models: Projection methods that model the data into lower dimensions using linear correlations.

    • For example, PCA and data with large residual errors may be outliers.

    • Proximity-based Models: Data instances that are isolated from the mass of the data as determined by cluster, density or KNN analysis.

Handling Outliers

  • Log-Scale Transformation: This method is often used to reduce the variability of data including outlying observation.

  • Model Selection: Tree based models are less impacted by outliers compared to linear models.

    • XGBoost and boosting in general are very sensitive to outliers.

    • This is because boosting builds each tree on previous trees’ residuals/errors.

  • Outliers will have much larger residuals than non-outliers, so boosting will focus a disproportionate amount of its attention on those points

Categorical Encoding

One Hot Encoding:

  • Maps each category to a vector that contains 1 and 0 denoting the presence or absence of the feature.

  • The number of vectors depends on the number of categories for features.

  • This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature.

Label Encoding:

  • Each category is assigned a value from 1 through N (here N is the number of categories for the feature.

  • One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship.

Ordinal Encoding:

  • To ensure the encoding of variables retains the ordinal nature of the variable.

  • This is reasonable only for ordinal variables.

  • The transformation looks almost similar to Label Encoding but slightly different as Label coding would not consider whether a variable is ordinal or not and it will assign a sequence of integers.

Binary Encoding:

  • Converts a category into binary digits.

  • Each binary digit creates one feature column.

  • If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features.

  • Compared to One Hot Encoding, this will require fewer feature columns.

    • Explain: for 100 categories One Hot Encoding will have 100 features while forBinary encoding, we will need just seven features.

Data Normalization

  • Standardize: scaling features by removing the mean and scaling to unit variance

  • MinMax: Transform features by scaling each feature to a given range [-1,1].