Sometimes, we wanna couple multiple dataframes together. In this note, I use df
as DataFrame
, s
as Series
.
Libraries
import pandas as pd
import numpy as np
Coupling dfs with merge()
There are 4 types of merging, like in SQL.
- Inner: only includes elements that appear in both dataframes with a common key.
- Outer: includes all data from both dataframes.
- Left: includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key; the result retains all columns from both of the original dataframes.
- Right: includes all of the rows from the "right" dataframe along with any rows from the "left" dataframe with a common key; the result retains all columns from both of the original dataframes.
![Untitled](https://prod-files-secure.s3.us-west-2.amazonaws.com/70a67195-bc38-429a-9695-1ad1b42ccec8/66151d2c-1094-485c-96f7-91bd4736e278/Untitled.png)
On the same column name,
# left
df_left = pd.merge(left=df1, right=df2, how='left', on='Col_1', suffixes=('_df1', '_df2'))
# right
df_right = pd.merge(left=df1, right=df2, how='right', on='Col_1', suffixes=('_df1', '_df2'))
|
Col_1 |
Col_2 |
0 |
A |
1 |
1 |
E |
3 |
2 |
C |
NaN |
3 |
D |
NaN |
4 |
B |
2 |
|
Col_1 |
Col_2 |
0 |
A |
1 |
1 |
B |
2 |
2 |
C |
-3 |
3 |
F |
-4 |
4 |
E |
NaN |
|
Col_1 |
Col_2_df1 |
Col_2_df2 |
0 |
A |
1 |
1 |
1 |
E |
3 |
NaN |
2 |
C |
NaN |
-3 |
3 |
D |
NaN |
NaN |
4 |
B |
2 |
2 |
|
Col_1 |
Col_2_df1 |
Col_2_df2 |
0 |
A |
1 |
1 |
1 |
E |
3 |
NaN |
2 |
C |
NaN |
-3 |
3 |
B |
2 |
2 |
4 |
F |
NaN |
-4 |
# inner (defaut)
df_inner = pd.merge(left=df1, right=df2, on='Col_1', suffixes=('_df1', '_df2'))
# outer
df_outer = pd.merge(left=df1, right=df2, how='outer', on='Col_1', suffixes=('_df1', '_df2'))
display_side_by_side(df1, df2, df_inner, df_outer)
|
Col_1 |
Col_2 |
0 |
A |
1 |
1 |
E |
3 |
2 |
C |
NaN |
3 |
D |
NaN |
4 |
B |
2 |
|
Col_1 |
Col_2 |
0 |
A |
1 |
1 |
B |
2 |
2 |
C |
-3 |
3 |
F |
-4 |
4 |
E |
NaN |
|
Col_1 |
Col_2_df1 |
Col_2_df2 |
0 |
A |
1 |
1 |
1 |
E |
3 |
NaN |
2 |
C |
NaN |
-3 |
3 |
B |
2 |
2 |
|
Col_1 |
Col_2_df1 |
Col_2_df2 |
0 |
A |
1 |
1 |
1 |
E |
3 |
NaN |
2 |
C |
NaN |
-3 |
3 |
D |
NaN |
NaN |
4 |
B |
2 |
2 |
5 |
F |
NaN |
-4 |
On the different column names,
# left
df_left = pd.merge(left=df1, right=df2, how='left', left_on='Col_1', right_on='Col_X', suffixes=('_df1', '_df2'))
display_side_by_side(df1, df2, df_left)
|
Col_1 |
Col_2 |
0 |
A |
1 |
1 |
E |
3 |
2 |
C |
NaN |
3 |
D |
NaN |
4 |
B |
2 |
|
Col_X |
Col_2 |
0 |
A |
1 |
1 |
B |
2 |
2 |
C |
-3 |
3 |
F |
-4 |
4 |
E |
NaN |