Six Pandas Functions Solve All Data Transformation in Real Project
If we join Kaggle competition or real projects, we will realize that before going to modeling, the most difficult thing is transforming data.
Because raw data are always dirty and not ready for modeling, but the convenience of data storage.
Long Format Data
For example, the most common transaction records are generally archived as long format, even there are tens of thousands of commodities. Because long format is easy to write into database. The same scenario is existing for other fields like IoT. Thousands of time series are saved in long format in Relational Database.
Here we mockup a simple long format table, the sales transaction records.
This dataset consists of 5-day sales records with two categories of products (electric appliances 📺 and fruits 🍌🍏). The fruit category includes bananas 🍌 and apples 🍏.
Quick simple, right?
Each row of the long format table represents a feature of an object. Typically, for long-format data, there are at least three parts:
- ID/index: These column(s) should be the major index
- Variable/Key: These column(s) should contain the names of features
- Value: These column(s) contain the value corresponding to the feature
Going back to the case above, the situation is more subtle. Which are IDs and Which are Variables?
Depending on your needs, you may treat Date as ID and Prod as Variables, vice versa.
Pivot —
long to wide for data analysis
On the contrary, a wide format data has more information horizontally. Which means, multiple features/variables are displayed in the same row.
Back to this example, if ID is Date
, then the feature is Prod
.
We furtherly illustrate by converting the long table format to the wide table format.
date_wide_df = df.pivot(index='Date',columns='Prod',values='Qty')
date_wide_df
Or you would rather think that ID is Prod and feature is Date.
prod_wide_df = df.pivot(index='Prod',columns='Date',values='Qty')
prod_wide_df
As you see, choosing different Indexes and Columns will lead to different results. The impact/difference of these two options is whether the data statistics are sorted by row or column. They can transpose to each other.
If Date
is used as an index, we can "horizontally" compare sales records for all products
.
If Product
is used as an index, we can "horizontally" see sales records for different dates
.
Here we also notice some details, that is, wide-format data is prone to have missing values NaN
.
So far you can see the corresponding values of Long and Wide formats data:
- The long format table is simple and easy to record raw.
- The wide table is multi-dimensional, suitable for further analysis
Multilevel Index — Pivot
We can add multiple indexes to create a multi-level index, but you need to ensure that the last level has at maximum one value per feature.
For example, we can combine Cate
and Prod
, but we can't just use Cate
for Pivot operation.
Similarly, for features/variables multi-level indexes can also be used.
cate_prod_wide_df = df.pivot(index=['Cate','Prod'],columns='Date',values=['Qty','Price'])
cate_prod_wide_df
df.pivot(index='Date',columns=['Cate','Prod'],values=['Qty','Price'])
Melt —
wide table to long table for writing back data
Frequently, in Kaggle competitions or actual projects, we are required to convert data into a long format table after analysis or prediction. In the actual project, all data need to be written back to the database.
Since a long format table can be converted to a wide format table, how can a wide format table be converted to a long format table? Melt
it.
In the arguments of the Melt
function, id_vars
indicates that the column does not change, and other columns will be melted into two columns (name
, value
) in pairing.
cate_prod_wide_df.reset_index().melt(id_vars=['Cate','Prod'],var_name=['Qty/Price','Date']).dropna()
We can here that we did transform the data back to the long format. Unlike the original one, this time the table is longer and it also melts Qty and Price into one column.
Advanced Function — wide_to_long
Is it possible to melt back the original data format? Yes, but it takes a little effort.
There are two reasons for this effort:
- The original long format table was not trimmed before the wide table
- This example is more complicated because there are multiple levels of index
We can combine the multi-level indexes of Qty and Price into a single level first, and then call the wide_to_long
function.
The arguments in wide_to_long
are very complicated. Basically, the official documents are still ambiguous. Here is a more straightforward explanation.
In our case, the column name with value is prefixed (Qty or Price), which are stubnames
(root name). We have to parse out this "root".
suffix
is telling function how to parse columns by using regular expressions. The timestamp format is d-m-y
, so the corresponding regular is '\d+-\d+-\d+'
i
indicates which columns are indexes.j
represents the column name corresponding to the remaining characters after removing the root variable. In this case it isDate
cate_prod_wide_df.columns = [str(i)+str(j.replace('/','-')) for (i,j) in cate_prod_wide_df.columns]
cate_prod_wide_df.reset_index()
pd.wide_to_long(df=cate_prod_wide_df.reset_index(),stubnames=['Qty','Price'],i=['Cate','Prod'],j='Date',suffix='\d+-\d+-\d+')
More User-friendly Function — Stack
It should be emphasized here that the above long format table looks very similar to the original long format table for humans, but for pandas, it is different.
The row index of the above table is multi-level (Cat
first, then Prod
), which makes indexing and subsequent processing more convenient.
Melt
function for sure is powerful, but not friendly enough for our case.
For our above requirements, the Stack
function can handle it with one click. Stack
is used to rotate the column by 90 degrees and turn it into a row.
cate_prod_wide_df = df.pivot(index=['Cate','Prod'],columns='Date',values=['Qty','Price'])
cate_prod_wide_df.stack()
For the above data, we can also choose a certain level to rotate.
cate_prod_wide_df.stack(level=0)
Unstack —
revert back
Since stack
is to transform columns to rows, then unstack
is to transform rows to columns. Isn't this similar to pivot?
df.unstack()
However, it is not!
For the reason I mentioned above, the original long format table was not trimmed before transforming.
Although the original long format table has hierarchical information, it is not sorted out.
So for the original long format table, we recommend constructing a multi-level index first, and then unstacking it.
df.set_index(['Cate','Prod','Date']).unstack()
We can also rotate for a certain level of index. For example, the following code is consistent with the pivot
function we started with.
df.set_index(['Date','Cate','Prod']).unstack(level=-3)
Pivot_table —
feel lonely
Why should Pivot_table be taken out separately? Because it involves aggregate functions, that is, it is not “transformation
", but "aggregation
". That is, after using it, you may not be able to go back, because it has made statistics.
Going back to our beginning case, we mentioned that using Pivot
shall ensure that the value of each variable with the last level is unique.
What if there are multiple values? For example, what if we only use Cate
to convert wide format to long format? Then you need an upgraded version of Pivot - Pivot_table
it is.
It allows the aggregation of multiple values, and the aggregation method choosing is up to you: mean, sum, etc.
pt_agg = df.pivot_table(index=['Cate'],columns='Date',values='Qty',aggfunc='mean')
pt_agg
Recap
This post introduces several most commonly used data transformation functions in practical projects, and it is recommended to use them in combination.
- For the original long format table, it is recommended to first establish a row index (multi-level is better), and then you can use
unstack
andstack
at will. - For the original long format table, if you don’t want to create a row index, you can directly
pivot
orpivot_table
, which helps not only set index ready, but also convert table into a wide format for analysis melt
is a lightweight version ofwide_to_long
wide_to_long
is used to convert data that appears to be a single-level index, but is actually a multi-level index. The column name is generallyAAAxxx_BBBxxx_CCCxxx