# Data Preprocessing
__________

## Missing Values

- **Reasons:**
	- Missing completely at random (MCAR)
	- Missing at random (MAR)
	- Not missing at random (NMAR)
- How to handle the missing values:
	- **Do Nothing:** 
		- Models like XGBoost can deal with missing values by deciding for each sample which is the best way to impute them and learns the best values 
	- **Imputation**: 
		- Using (Mean/Median) Value
		- Using (Most Frequent) Value
		- Using k-NN
		- Interpolation (Linear/Nearest Neighbors)

## Outlier Detection

- **Cook’s Distance:** 
	- Measures the effect of deleting a given observation. It represents the sum of all the changes in the regression model when observation “i” is removed from it. 
- **Interquartile Range Method (IQR):** 
	- Is a good statistic for summarizing a non-Gaussian distribution sample of data.
	- IQR is calculated as the difference between the 75th and the 25th percentiles of the data and defines the box in a box and whisker plot.
	- The IQR defines the middle 50% of the data, or the body of the data 
	- Can be used to identify outliers by defining limits on the sample values that are below the 25th percentile or above the 75th percentile. 
- **Linear Models:** Projection methods that model the data into lower dimensions using linear correlations. 
  - For example, PCA and data with large residual errors may be outliers.
  - Proximity-based Models: Data instances that are isolated from the mass of the data as determined by cluster, density or KNN analysis.

**Handling Outliers**

- **Log-Scale Transformation:** This method is often used to reduce the variability of data including outlying observation. 
- **Model Selection:** Tree based models are less impacted by outliers compared to linear models.
	- XGBoost and boosting in general are very sensitive to outliers.
	- This is because boosting builds each tree on previous trees' residuals/errors. 
- Outliers will have much larger residuals than non-outliers, so boosting will focus a disproportionate amount of its attention on those points


## Categorical Encoding

**One Hot Encoding:**
-  Maps each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. 
- The number of vectors depends on the number of categories for features. 
- This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature.

**Label Encoding:**
- Each category is assigned a value from 1 through N (here N is the number of categories for the feature. 
- One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship.

**Ordinal Encoding:** 
- To ensure the encoding of variables retains the ordinal nature of the variable. 
- This is reasonable only for ordinal variables. 
- The transformation looks almost similar to Label Encoding but slightly different as Label coding would not consider whether a variable is ordinal or not and it will assign a sequence of integers.

**Binary Encoding:**
- Converts a category into binary digits. 
- Each binary digit creates one feature column. 
- If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features. 
- Compared to One Hot Encoding, this will require fewer feature columns.    
  - **Explain:** for 100 categories One Hot Encoding will have 100 features
    while forBinary encoding, we will need just seven features.
  
## Data Normalization 
- **Standardize**: scaling features by removing the mean and scaling to unit
  variance
- **MinMax**: Transform features by scaling each feature to a given range [-1,1].