bloomtech.instructure.com/courses/2357/pages/objective-02-merge-data-using-pandas-merge?module_item_id=660942
1 Users
0 Comments
1 Highlights
0 Notes
Tags
Top Highlights
Objective 02 - Merge Data Using Pandas Merge Overview There are other ways to combine DataFrames, especially when there are overlapping rows. For example, the pandas pd.merge() method uses database-style joins, similar to what you might see in SQL later in this course. Introduction to pd.merge() To become more familiar with this method, let's look at the available arguments and defaults for this function. # Import libraries import pandas as pd import numpy as np # Display the arguments for pd.merge() pd.merge <function pandas.core.reshape.merge.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)> Some of the arguments are similar to what we have seen already with concatenation. But with pd.merge() we can do a lot of different types of joins. Let's summarize some of the different parameters. left - one of the DataFrames to merge right - the other DataFrame to merge how - specifies the kind of merge to make on - the column on which to merge; needs to be in both DataFrames left_on - the column to join on in the left DataFrame right_on - the column to join on in the right DataFrame left_index or right_index - use the index from the left/right DataFrame as the join key(s) Let's start with one of the more basic joins, the same as the pd.concat() column-wise example we completed previously. Follow Along We'll create some more DataFrames for practice. # Create DataFrames df1 = pd.DataFrame({'day': ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri'], 'weather': ['sunny', 'windy', 'rainy', 'cloudy', 'cold']}) df2 = pd.DataFrame({'day': ['Wed', 'Fri', 'Mon', 'Thurs', 'Tues'], 'temp': [88, 75, 62, 65, 53]}) # Display the DataFrames display(df1) display(df2) # Merge them with the default values df3 = pd.merge(df1, df2) display(df3) day weather 0 Mon sunny 1 Tues windy 2 Wed rainy 3 Thurs cloudy 4 Fri cold day temp 0 Wed 88 1 Fri 75 2 Mon 62 3 Thurs 65 4 Tues 53 day weather temp 0 Mon sunny 62 1 Tues windy 53 2 Wed rainy 88 3 Thurs cloudy 65 4 Fri cold 75 Our new DataFrame df3 only has one column for the 'day'; pd.merge() recognized the shared column and merged them on this column. It's also important to note that the order of the days (rows) in each DataFrame doesn't have to be the same for them to merge. There will be cases where the column names don't match, which is where the left_on and right_on parameters are useful. Let's use the same DataFrames as above but instead change the column name on one of them. # Change the column name day -> day_of_week df2_new = pd.DataFrame({'day_of_week': ['Wed', 'Fri', 'Mon', 'Thurs', 'Tues'], 'temp': [88, 75, 62, 65, 53]}) # Merge with left and right columns df4 = pd.merge(df1, df2_new, left_on='day', right_on='day_of_week') display(df4) day weather day_of_week temp 0 Mon sunny Mon 62 1 Tues windy Tues 53 2 Wed rainy Wed 88 3 Thurs cloudy Thurs 65 4 Fri cold Fri 75 There is a redundant column where 'day' and 'day_of_week' both have the same values. We can use the .drop() method to remove right after we join the DataFrames. # Drop the day_of_week column df5 = pd.merge(df1, df2_new, left_on='day', right_on='day_of_week').drop('day_of_week', axis=1) display(df5) day weather temp 0 Mon sunny 62 1 Tues windy 53 2 Wed rainy 88 3 Thurs cloudy 65 4 Fri cold 75 Joins So far, we have not considered a scenario where there are missing values in the columns of data sets we want to merge or join. Let's look at one such example, by creating two new DataFrames containing sample information about pets in an animal rescue site. Let knowingly not include information on some of the animals, and demonstrate what happens with an 'inner' and 'outer' join on those two DataFrames . We'll use a sample of information about pets in an animal rescue, where we won't have all of the data on all of the animals. It will be easier to demonstrate both an 'inner' and 'outer' join with the same DataFrames. # Create the DataFrames df6 = pd.DataFrame({'name': ['Snowball', 'Bob', 'Tweetie', 'Fluffy'], 'type': ['cat', 'rat', 'bird', 'dog']}) df7 = pd.DataFrame({'name': ['Snowball', 'Bob'], 'weight': [400, 749]}) # Display before merging display(df6); display(df7) # Merge in two different ways df8_outer = pd.merge(df6, df7, how='outer') df8_inner = pd.merge(df6, df7, how='inner') # Display after merging display(df8_outer); display(df8_inner) name type 0 Snowball cat 1 Bob rat 2 Tweetie bird 3 Fluffy dog name weight 0 Snowball 400 1 Bob 749 name type weight 0 Snowball cat 400.0 1 Bob rat 749.0 2 Tweetie bird NaN 3 Fluffy dog NaN name type weight 0 Snowball cat 400 1 Bob rat 749 We started with animal names, types, and weights but didn't have weight data for all animals. So the 'outer' join used all of the animals and filled in 'NaN' for the missing weights. This type of join is called a union. The 'inner' join only includes rows with data in all three columns; this type is an intersection. Using the .join method There is also a .join() method that is available with DataFrame objects. This method is a convenient way to join DataFrames without making a full merge() call. Some of the parameters we can use here are: other - the other DataFrame you are joining on - the column or Index on which to join the other DataFrame (default is the Index) how - similar to merge the choices are left, right, outer, inner We need to use set_index on our DataFrames so that they can be joined on those indexes. # Reset indices df1a = df1.set_index('day') df2a = df2.set_index('day') # Join on the new index='day' df1a.join(df2a, on='day') weather temp day Mon sunny 62 Tues windy 53 Wed rainy 88 Thurs cloudy 65 Fri cold 75 Challenge There were a few types of merges that we didn't demonstrate above, so this is a good time to practice the other methods. Using the examples above, you'll create some DataFrames and then merge() using both'inner' and 'outer' joins. Also try out the .join() method and see how it compares to the results from your merges. Additional Resources
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.