Data cleansing and preprocessing can occupy 40-70% of the workflow. We should start any analysis, modelling, or visualisation by getting to know the data and prepare the dataset for the next steps. Basic steps of data cleansing:
- Dropping useless or multiple columns.
- Coding categorical variables:
- if necessary: in case of visualisation, categorical form can be easier to understand, but some methods manage numerical data only
- Observing missing data:
- Look at the missing values by each variable and think about if further steps needed.
- Check if the missing is random or there could be relation or pattern between them.
- Observing text data:
- Remove text errors.
- Remove whitespaces.
- Date time formatting
Some ideas on missing data handling:
- Using median for numerical variables.
- Use median as it is not sensitive to outliers opposed to the mean. Of course, in some justified cases, mean also can be used.
- If we have categorical variables too, we can compute the mean indicators in the subgroups, so we can have a less biased solution.
- Using regression for numerical variables:
- Estimate the missing values with other variables.
- We can use either mean or median regression.
- Mode for categorical data.
- Using any classification or categorization method for categorical data:
- Binary or multinominal logistic regression.
- k-nearest neighbour method.
These methods can be used to manage random data missing. Before we pick which method to use, think about the analysis we are going to undertake, and choose the right method accordingly. (E.g., if we want to analyse with regression, our results would be biased if we use regression for data imputation too.)
The datacleaning part of the article based on Admond Lee’s article titled „The Simple Yet Practical Data Cleaning Codes”.