Use Logic to Handle Missing Data

8 min readFeb 6, 2022

Introduction

The COVID-19 pandemic has afflicted humanity for nearly 3 years since the beginning of 2019, and we still don’t seem to see the light of day。COVID is not a good object to touch.

But as a Data Scientist, COVID dataset is a good subject to touch. As it has abundant data from around the world. What’s more, it is updating everyday.

Today, I would like to go back to 2019 and analyze the initial dataset of CCOVID. Credit to Kaggle for sharing the data:

Here are the data shared in the link, which includes below datasets:

  • COVID19_line_list_data.csv(358.85 KB) → report for each confirmed case
  • COVID19_open_line_list.csv(2.93 MB) → more detail about the confirmed case
  • covid_19_data.csv(1.53 MB) → confirmed data for each country
  • time_series_covid_19_confirmed.csv(100.3 KB) → confirmed data for each country, each column is timestamp
  • time_series_covid_19_confirmed_US.csv(1.11 MB) → COVID data in US
  • time_series_covid_19_deaths_US.csv(1.04 MB) → Death data in US
  • time_series_covid_19_deaths.csv(76.09 KB) → Death data for each country
  • time_series_covid_19_recovered.csv(84.62 KB) →Recovered data

We would like to have an initial EDA for one of the datasets (COVID19_line_list_data) and demonstrate the missing data handling.

Get Hands Dirty

As always, we shall import the modules and read data from csv.

All the code in this post can be find in my github:

import plotly.graph_objects as go
from collections import Counter
import missingno as msno
import pandas as pd
line_list_data_file = '../data/COVID19_line_list_data.csv'
line_list_data_raw_df = pd.read_csv(line_list_data_file)
line_list_data_raw_df.info()

Drop NaN Columns

Obviously, we realize that column 21–26 have 0 non-null values, it means these columns are non-informative and can be dropped safely.

line_list_data_raw_df.dropna(axis=1, how='all', inplace=True)
line_list_data_raw_df.info()

Missing Data Exploration

This dataset is not very large and includes missing data for certain columns like age, country, etc. For such a small-scale dataset, it provides us the chance to visualize all data samples.

missingno package is used to visualize the missing data info for small dataset. e.g. nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion. Except for matrix, you can also choose bar chart, dendrogram, etc. (https://github.com/ResidentMario/missingno)

So, we can display all samples in a matrix below.

msno.matrix(df=line_list_data_raw_df, fontsize=16)

The results are as follows:

  • Y-axis indicated dataset size. As for our case, total dataset size is 1085.
  • The X-axis shows each feature name, because we have limited features, they all can be clearly displayed.
  • The black color indicates that feature samples are complete(not Nan). While white gaps indicate that specific feature has missing data.
  • As you can see, that case_in_country has missing samples and is concentrated at the beginning
  • Also, a curve (sparkline) on the right side of the plot shows the number of features per row (sample). For example, there is a number 10, which means that the row has only 10 valid features, and the number 20 means that the maximum feature numbers are 20.

Handle Messy Datetime Format

One of the most important processes of data cleaning is missing data imputing. For this dataset, we can try plenty of ways of data filling. These filling ideas shared in the post may not be the best, but the purpose is to show the possibilities of different filling methods.

We don’t want to limit our insight filling missing filed with constants, averages, or other simple statistical indicators.

let us plot the trend firstly.

date_cols = [
'reporting date',
'symptom_onset',
'hosp_visit_date',
'exposure_start',
'exposure_end']
import plotly.offline as pyo
pyo.init_notebook_mode()
fig = go.Figure()
for col in date_cols:
fig.add_trace(go.Scatter(y=line_list_data_raw_df[col], name=col))
fig.show()
#fig.write_image("img/messy_datetime.png")

As we can see from y axis, the labels are in a mess. Because these values are recorded as string type rather than datetime.

A good practice is to covert these datetime strings to datetime format. Since we are using Panda DataFrame to process data, it is handy to use pandas to_datetime.

We would like to have closer view of these datetime string format before converting.

print(line_list_data_raw_df[date_cols].head(5))
print(line_list_data_raw_df[date_cols].tail(5))

# check the length of date
for col in date_cols:
date_len = line_list_data_raw_df[col].astype(str).apply(len)
date_len_ct = Counter(date_len)
print(f'{col} datetime length distributes as {date_len_ct}')

We can find from above result, the datetime format is not universal. Some months as using 01, but others are using 1. since they are mixed datetime, so we have to handle these exceptions carefully.

Here is a tip, since we know there are two kinds of format. We can convert all dataset by using each format. For mismatched datetime strings, we can set them to NaT and drop them. Later on, we can put subsets of converted for each format back to one complete dataset again.

def mixed_dt_format_to_datetime(series, format_list):
temp_series_list = []
for format in format_list:
temp_series = pd.to_datetime(series, format=format, errors='coerce')
temp_series_list.append(temp_series)
out = pd.concat([temp_series.dropna(how='any')
for temp_series in temp_series_list])
return out


for col in date_cols:
line_list_data_raw_df[col] = mixed_dt_format_to_datetime(
line_list_data_raw_df[col], ['%m/%d/%Y', '%m/%d/%y'])
print(line_list_data_raw_df[date_cols].info())

# check the length of date
for col in date_cols:
date_len = line_list_data_raw_df[col].astype(str).apply(len)
date_len_ct = Counter(date_len)
print(f'{col} datetime length distributes as {date_len_ct}')
fig = go.Figure()
for col in date_cols:
fig.add_trace(go.Scatter(y=line_list_data_raw_df[col], name=col))
fig.show()

At this moment, we can plot these trends again, and Y-axis has become a well-organized timeline.

  • When we observe the curve, and we can see that the report_date curve is at the top, which is the latest time. It is very logical and practical.
  • If hospitalize_date is missing, we can directly use the report_date respectively.
  • According to common sense, generally patients will go to the hospital after a period of time or immediately after they have symptoms. Therefore hospitalize_date must be later than symbol_onse time.
  • Here we can make statistics to see how long the patient will go to the hospital after having symptoms, and use this as a basis to shift back the symbol_onset time.
  • Similarly, we can count the difference between the start time of the exposure history and the time of the patient’s onset, so fill in the exposure_start.
  • As for the missing value of exposure_end, we have reason to believe that the exposure history ended when the patient was transferred to the hospital.

Here, we have sort out all ways to fill the missing date info for each datetime column.

Fill Missing Data with Constant Value

# fill missing report_date
line_list_data_raw_df.loc[261, 'reporting date'] = pd.Timestamp('2020-02-11')

Fill Missing Data with Other Column

As we can see below histogram, the gap between reporting date and hosp_visit_date close to 1. For simplify, we just use reporting date to fill hosp_visit_date.

time_delta = line_list_data_raw_df['reporting date'] - \
line_list_data_raw_df['hosp_visit_date']
time_delta.dt.days.hist(bins=20)
line_list_data_raw_df['hosp_visit_date'].fillna(
line_list_data_raw_df['reporting date'], inplace=True)

Fill Missing Data with Other Column + Bias

Similarly, we can look at the distribution of the time difference between hospital_visit and patients with symptoms. This time the highest point of the distribution is no longer around 1 day, but 3 days. That is to say, most people go to the hospital in 3 days after they have symptoms, and some people go to the hospital even close to 25 days.

Therefore, we use the method of calculating the mean value, and then reverse symptom_onset according to the hosp_visit_date.

#fill missing symptom_onset
time_delta = line_list_data_raw_df['hosp_visit_date'] - \
line_list_data_raw_df['symptom_onset']
time_delta.dt.days.hist(bins=20)
average_time_delta = pd.Timedelta(days=round(time_delta.dt.days.mean()))
symptom_onset_calc = line_list_data_raw_df['hosp_visit_date'] - \
average_time_delta
line_list_data_raw_df['symptom_onset'].fillna(symptom_onset_calc, inplace=True)

After most people have a history of exposure, the probability of developing symptoms is within 4 days to 10 days , which is the so-called incubation period. In the same way, we can deduce the exposure (infection) date.

#fill missing exposure_start
time_delta = line_list_data_raw_df['symptom_onset'] - \
line_list_data_raw_df['exposure_start']
time_delta.dt.days.hist(bins=20)
average_time_delta = pd.Timedelta(days=round(time_delta.dt.days.mean()))
symptom_onset_calc = line_list_data_raw_df['symptom_onset'] - \
average_time_delta
line_list_data_raw_df['exposure_start'].fillna(symptom_onset_calc, inplace=True)
#fill missing exposure_end
line_list_data_raw_df['exposure_end'].fillna(line_list_data_raw_df['hosp_visit_date'], inplace=True)

As you can see the new plot, all timeline are consistent. Perfect.

fig = go.Figure()
for col in date_cols:
fig.add_trace(go.Scatter(y=line_list_data_raw_df[col], name=col))
fig.show()

Fill Other Columns

We can use same way to impute other columns. e.g.

  • for gender, you can choose use majority of gender, but here I suggest to mark unknown.
  • Age can fill by mean value
  • We count the word frequency and select the symbol with the highest occurrence to replace the missing value. You can see that the most common symptom is fever
# we replace missing data with -1 as these are not important.
line_list_data_raw_df['case_in_country'].fillna(-1, inplace=True)
# for summary, we just replace Nan with empty string
line_list_data_raw_df['summary'].fillna('', inplace=True)
# for gender, you can choose use majority of gender, but here I suggest to mark unknown.
line_list_data_raw_df['gender'].fillna('unknown', inplace=True)
# Age can fill by mean value
line_list_data_raw_df['age'].hist(bins=10)
line_list_data_raw_df['age'].fillna(
line_list_data_raw_df['age'].mean(), inplace=True)
line_list_data_raw_df['age'].hist(bins=10)
line_list_data_raw_df['If_onset_approximated'].fillna(1, inplace=True)line_list_data_raw_df['from Wuhan'].fillna(1.0,inplace=True)symptom = Counter(line_list_data_raw_df['symptom'])
# We count the word frequency and select the symbol with the highest occurrence to replace the missing value.
# You can see that the most common symptom is fever
line_list_data_raw_df['symptom'].fillna(symptom.most_common(2)[1][0],inplace=True)
# missing data visualization
msno.matrix(df=line_list_data_raw_df, fontsize=16)

Check out the matrix again, bingo! Although the matrix is no longer fancy (black and white), this is the PERFECT black.

line_list_data_raw_df.to_csv('../data/COVID19_line_list_data_cleaned.csv')

Summary

This post mainly introduces data cleaning, especially imputing missing data by referring logic.

You can fill nan with a specific value, null, some statistic value, or make inferences from other columns.

It also provides some tips, such as how to convert mixed time formats into datetime, and how to visualize missing data.

We did not perform any EDA yet on purpose, but during the process of data cleaning, we still learned a little more about the COVID.

  • For example, the incubation period of patients is from 4 days to 10 days typically.
  • Patients usually go to the hospital in 3 days after symptoms appear.
  • The most common symptoms are fever, and so on.

--

--

No responses yet