Data Processing & Cleaning

In this note, I use df as DataFrame, s as Series.

Libraries

import pandas as pd # import pandas package
import numpy as np

csv file:
1. Values are separated by , of ;?
2. Encoding.
3. Timestamp type.
Indexes are sorted?
Indexes are continuous with step 1 (especially after using .dropna() or .drop_duplicates)?
Are there NaN values? Drop them?
Are there duplicates? Drop them?
How many unique values?
For 0/1 features, they have only 2 unique values (0 and 1)?
KDE plot to check the values distribution.
The number of columns?
Unique labels?
Time series:
1. Time range.
2. Time step.
3. Timestamp's type.
4. Timezone.
5. Timestamps are monotonic?

# REMOVING COLUMNS
df.drop('New', axis=1, inplace=True) # drop column 'New'
df.drop(['col1', 'col2'], axis=1, inplace=True)

# ONLY KEEP SOME
kept_cols = ['col1', 'col2', ...]
df = df[kept_cols]

# ALL EXCEPT SOME
df[df.columns.difference(['b'])]