Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 1

In this article we will discuss how to merge different Dataframes into a single Dataframe using Pandas Dataframe.merge() function. Merging is a big topic, so in this part we will focus on merging dataframes using common columns as Join Key and joining using Inner Join, Right Join, Left Join and Outer Join.

Dataframe.merge()

In Python’s Pandas Library Dataframe class provides a function to merge Dataframes i.e.

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

It accepts a hell lot of arguments. Let’s discuss some of them,
Imp Arguments :

  • right : A dataframe or series to be merged with calling dataframe
  • how : Merge type, values are : left, right, outer, inner. Default is ‘inner’. If both dataframes has some different columns, then based on this value, it will be decided which columns will be in the merged dataframe.
  • on : Column name on which merge will be done. If not provided then merged on indexes.
  • left_on : Specific column names in left dataframe, on which merge will be done.
  • right_on : Specific column names in right dataframe, on which merge will be done.
  • left_index : bool (default False)
    • If True will choose index from left dataframe as join key.
  • right_index : bool (default False)
    • If True will choose index from right dataframe as join key.
  • suffixes : tuple of (str, str), default (‘_x’, ‘_y’)
    • Suffex to be applied on overlapping columns in left & right dataframes respectively.

Well these are a lot of arguments and things seems over engineered here. So, let’s discuss each details be small examples one by one.

First of all, let’s create two dataframes to be merged.

Dataframe 1:
This dataframe contains the details of the employees like, ID, name, city, experience & Age i.e.

# List of Tuples
empoyees = [ (11, 'jack', 34, 'Sydney', 5) ,
           (12, 'Riti', 31, 'Delhi' , 7) ,
           (13, 'Aadi', 16, 'New York', 11) ,
           (14, 'Mohit', 32,'Delhi' , 15) ,
           (15, 'Veena', 33, 'Delhi' , 4) ,
           (16, 'Shaunak', 35, 'Mumbai', 5 ),
            (17, 'Shaun', 35, 'Colombo', 11)
            ]

# Create a DataFrame object
empDfObj = pd.DataFrame(empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience'], index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])

Contents of the first dataframe empDfObj created are,

   ID     Name  Age      City  Experience
a  11     jack   34    Sydney           5
b  12     Riti   31     Delhi           7
c  13     Aadi   16  New York          11
d  14    Mohit   32     Delhi          15
e  15    Veena   33     Delhi           4
f  16  Shaunak   35    Mumbai           5
h  17    Shaun   35   Colombo          11

Dataframe 2:
This dataframe contains the details of the employees like, ID, salary, bonus and experience i.e.

# List of Tuples
salaries = [(11, 5, 70000, 1000) ,
           (12, 7, 72200, 1100) ,
           (13, 11, 84999, 1000) ,
           (14, 15, 90000, 2000) ,
           (15, 4, 61000, 1500) ,
           (16, 5, 71000, 1000),
           (21, 10,81000, 2000)
            ]

# Create a DataFrame object
salaryDfObj = pd.DataFrame(salaries, columns=['ID', 'Experience' , 'Salary', 'Bonus'], index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])

Contents of the second dataframe created are,

   ID  Experience  Salary  Bonus
a  11           5   70000   1000
b  12           7   72200   1100
c  13          11   84999   1000
d  14          15   90000   2000
e  15           4   61000   1500
f  16           5   71000   1000
h  21          10   81000   2000

Now let’s see different ways to merge these two dataframes,

Merge DataFrames on common columns (Default Inner Join)

In both the Dataframes we have 2 common column names i.e. ‘ID’ & ‘Experience’. If we directly call Dataframe.merge() on these two Dataframes, without any additional arguments, then it will merge the columns of the both the dataframes  by considering common columns as Join Keys i.e. ‘ID’ & ‘Experience’ in our case. So, basically columns from both the dataframes will be merged for the rows in which values of ‘ID’ & ‘Experience’ are same i.e.

# Merge two Dataframes on common columns using default inner join
mergedDf = empDfObj.merge(salaryDfObj)

Merged Dataframe mergedDf  contents are:

   ID     Name  Age      City  Experience  Salary  Bonus
0  11     jack   34    Sydney           5   70000   1000
1  12     Riti   31     Delhi           7   72200   1100
2  13     Aadi   16  New York          11   84999   1000
3  14    Mohit   32     Delhi          15   90000   2000
4  15    Veena   33     Delhi           4   61000   1500
5  16  Shaunak   35    Mumbai           5   71000   1000

It merged the contents of the unique columns (salary & bonus) from dataframe 2 with the columns of dataframe 1 based on ‘ID’ & ‘Experience’ columns. Because if we don’t provide the column names on which we want to merge the two dataframes then it by defaults merge on columns with common names. Like, in our case it was ‘ID’ & ‘Experience’.

Also, we didn’t provided the ‘how’ argument in merge() call. Default value of ‘how’ is ‘inner’. It means dataframes are merged like INNER JOIN in Databases.

What is Inner Join ?

While Merging or Joining on columns (keys) in two Dataframes. Include only rows from Left & Right dataframes which have same values in key columns.

In above example key columns on which inner join happened were ‘ID’ & ‘Experience’ columns. So, during inner join only those rows are picked in merged dataframe for which values of ‘ID’ & ‘Experience’ columns are same in 2 dataframes. So basically by default Inner Join was done by using intersection of keys in both the dataframes.

Results will be same if we explicitly pass ‘how’ argument with value ‘inner’ i.e.

# Merge two Dataframes on common columns using inner join
mergedDf = empDfObj.merge(salaryDfObj, how='inner')

Merge Dataframes using Left Join

What is left join ?

While Merging or Joining on columns (keys) in two Dataframes. Include all rows from Left dataframe and add NaN for values which are
missing in right dataframe for those keys.

In above example if we will pass how argument with value ‘left’ then it will merge two dataframes using left join i.e.

# Merge two Dataframes  on common columns  using left join
mergedDf = empDfObj.merge(salaryDfObj, how='left')

Contents of the merged dataframe :

Contents of the Merged Dataframe :
   ID     Name  Age      City  Experience   Salary   Bonus
0  11     jack   34    Sydney           5  70000.0  1000.0
1  12     Riti   31     Delhi           7  72200.0  1100.0
2  13     Aadi   16  New York          11  84999.0  1000.0
3  14    Mohit   32     Delhi          15  90000.0  2000.0
4  15    Veena   33     Delhi           4  61000.0  1500.0
5  16  Shaunak   35    Mumbai           5  71000.0  1000.0
6  17    Shaun   35   Colombo          11      NaN     NaN

We can see that it picked all rows from left dataframe and there is no row with ‘ID’ 17 and ‘Experience’ 11 in right dataframe. Therefore for that row values of unique Columns from right dataframe (Salary and Bonus) are NaN in merged dataframe.

Merge DataFrames using Right Join

What is Right join ?

While Merging or Joining on columns (keys) in two Dataframes. Include all rows from Right dataframe and add NaN for values which are
missing in Left dataframe for those keys.

In above example if we will pass how argument with value ‘right’ then it will merge two dataframes using Right Join i.e.

# Merge two Dataframes  on common columns  using right join
mergedDf = empDfObj.merge(salaryDfObj, how='right')

Contents of the merged dataframe :

   ID     Name   Age      City  Experience  Salary  Bonus
0  11     jack  34.0    Sydney           5   70000   1000
1  12     Riti  31.0     Delhi           7   72200   1100
2  13     Aadi  16.0  New York          11   84999   1000
3  14    Mohit  32.0     Delhi          15   90000   2000
4  15    Veena  33.0     Delhi           4   61000   1500
5  16  Shaunak  35.0    Mumbai           5   71000   1000
6  21      NaN   NaN       NaN          10   81000   2000

We can see that it picked all rows from right dataframe and there is no row with ID 21 and Experience 10 in left dataframe. Therefore for that row values of unique Columns from left dataframe (i.e. Name, Age, City) are NaN in merged dataframe.

Merge DataFrames using Outer Join

What is Outer join ?

While Merging or Joining on columns (keys) in two Dataframes. Include all rows from Right and Left dataframes and add NaN for values which are missing in either Left or Right dataframe for any key.

In above example if we will pass how argument with value ‘outer’ then it will merge two dataframes using Outer Join i.e.

# Merge two Dataframes  on common columns  using outer join
mergedDf = empDfObj.merge(salaryDfObj, how='outer')

Contents of the merged dataframe :

   ID     Name   Age      City  Experience   Salary   Bonus
0  11     jack  34.0    Sydney           5  70000.0  1000.0
1  12     Riti  31.0     Delhi           7  72200.0  1100.0
2  13     Aadi  16.0  New York          11  84999.0  1000.0
3  14    Mohit  32.0     Delhi          15  90000.0  2000.0
4  15    Veena  33.0     Delhi           4  61000.0  1500.0
5  16  Shaunak  35.0    Mumbai           5  71000.0  1000.0
6  17    Shaun  35.0   Colombo          11      NaN     NaN
7  21      NaN   NaN       NaN          10  81000.0  2000.0

We can see that it picked all rows from right & left dataframes and there is no row with,

  • ID 21 and Experience 10 in left dataframe
  • ID 17 and Experience 11 in right dataframe

Therefore for that row NaN is added for missing values in merged dataframe.

Complete example is as follows,

import pandas as pd

def main():

    print('*** Creating Dataframe 1 ***')
    # List of Tuples
    empoyees = [ (11, 'jack', 34, 'Sydney', 5) ,
               (12, 'Riti', 31, 'Delhi' , 7) ,
               (13, 'Aadi', 16, 'New York', 11) ,
               (14, 'Mohit', 32,'Delhi' , 15) ,
               (15, 'Veena', 33, 'Delhi' , 4) ,
               (16, 'Shaunak', 35, 'Mumbai', 5 ),
                (17, 'Shaun', 35, 'Colombo', 11)
                ]

    # Create a DataFrame object
    empDfObj = pd.DataFrame(empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience'], index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])
    print("Dataframe 1 : ")
    print(empDfObj)

    print('*** Creating Dataframe 2 ***')

    # List of Tuples
    salaries = [(11, 5, 70000, 1000) ,
               (12, 7, 72200, 1100) ,
               (13, 11, 84999, 1000) ,
               (14, 15, 90000, 2000) ,
               (15, 4, 61000, 1500) ,
               (16, 5, 71000, 1000),
               (21, 10,81000, 2000)
                ]

    # Create a DataFrame object
    salaryDfObj = pd.DataFrame(salaries, columns=['ID', 'Experience' , 'Salary', 'Bonus'], index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])
    print("Dataframe 2 : ")
    print(salaryDfObj)


    print('**** Merge two Dataframes on Common Columns using Default Inner Join ****')

    # Merge two Dataframes on common columns using default inner join
    mergedDf = empDfObj.merge(salaryDfObj)
    print('Contents of the Merged Dataframe :')
    print(mergedDf)

    print('**** Merge two Dataframes using Left Join ****')

    # Merge two Dataframes  on common columns  using left join
    mergedDf = empDfObj.merge(salaryDfObj, how='left')
    print('Contents of the Merged Dataframe :')
    print(mergedDf)

    print('**** Merge two Dataframes using Right Join ****')

    # Merge two Dataframes  on common columns  using right join
    mergedDf = empDfObj.merge(salaryDfObj, how='right')
    print('Contents of the Merged Dataframe :')
    print(mergedDf)

    print('**** Merge two Dataframes using Outer Join ****')

    # Merge two Dataframes  on common columns  using outer join
    mergedDf = empDfObj.merge(salaryDfObj, how='outer')
    print('Contents of the Merged Dataframe :')
    print(mergedDf)

if __name__ == '__main__':
  main()

Output:

*** Creating Dataframe 1 ***
Dataframe 1 : 
   ID     Name  Age      City  Experience
a  11     jack   34    Sydney           5
b  12     Riti   31     Delhi           7
c  13     Aadi   16  New York          11
d  14    Mohit   32     Delhi          15
e  15    Veena   33     Delhi           4
f  16  Shaunak   35    Mumbai           5
h  17    Shaun   35   Colombo          11
*** Creating Dataframe 2 ***
Dataframe 2 : 
   ID  Experience  Salary  Bonus
a  11           5   70000   1000
b  12           7   72200   1100
c  13          11   84999   1000
d  14          15   90000   2000
e  15           4   61000   1500
f  16           5   71000   1000
h  21          10   81000   2000
**** Merge two Dataframes on Common Columns using Default Inner Join ****
Contents of the Merged Dataframe :
   ID     Name  Age      City  Experience  Salary  Bonus
0  11     jack   34    Sydney           5   70000   1000
1  12     Riti   31     Delhi           7   72200   1100
2  13     Aadi   16  New York          11   84999   1000
3  14    Mohit   32     Delhi          15   90000   2000
4  15    Veena   33     Delhi           4   61000   1500
5  16  Shaunak   35    Mumbai           5   71000   1000
**** Merge two Dataframes using Left Join ****
Contents of the Merged Dataframe :
   ID     Name  Age      City  Experience   Salary   Bonus
0  11     jack   34    Sydney           5  70000.0  1000.0
1  12     Riti   31     Delhi           7  72200.0  1100.0
2  13     Aadi   16  New York          11  84999.0  1000.0
3  14    Mohit   32     Delhi          15  90000.0  2000.0
4  15    Veena   33     Delhi           4  61000.0  1500.0
5  16  Shaunak   35    Mumbai           5  71000.0  1000.0
6  17    Shaun   35   Colombo          11      NaN     NaN
**** Merge two Dataframes using Right Join ****
Contents of the Merged Dataframe :
   ID     Name   Age      City  Experience  Salary  Bonus
0  11     jack  34.0    Sydney           5   70000   1000
1  12     Riti  31.0     Delhi           7   72200   1100
2  13     Aadi  16.0  New York          11   84999   1000
3  14    Mohit  32.0     Delhi          15   90000   2000
4  15    Veena  33.0     Delhi           4   61000   1500
5  16  Shaunak  35.0    Mumbai           5   71000   1000
6  21      NaN   NaN       NaN          10   81000   2000
**** Merge two Dataframes using Outer Join ****
Contents of the Merged Dataframe :
   ID     Name   Age      City  Experience   Salary   Bonus
0  11     jack  34.0    Sydney           5  70000.0  1000.0
1  12     Riti  31.0     Delhi           7  72200.0  1100.0
2  13     Aadi  16.0  New York          11  84999.0  1000.0
3  14    Mohit  32.0     Delhi          15  90000.0  2000.0
4  15    Veena  33.0     Delhi           4  61000.0  1500.0
5  16  Shaunak  35.0    Mumbai           5  71000.0  1000.0
6  17    Shaun  35.0   Colombo          11      NaN     NaN
7  21      NaN   NaN       NaN          10  81000.0  2000.0

 

 

1 thought on “Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 1”

  1. Pingback: Pandas : Merge Dataframes on specific columns or on index in Python – Part 2 – thispointer.com

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top