I have been trying to wrap my head around merge for a while:
I have the following dataframes:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'}, {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'}, {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'}, {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'}, {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])I understand that I can merge them in more ways than one:
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
pd.merge(student_df, staff_df, how='left', left_on='Name', right_on='Name')
pd.merge(staff_df, student_df, how='right', left_on='Name', right_on='Name')
pd.merge(student_df, staff_df, how='right', left_on='Name', right_on='Name')Each produces a slightly different output. Can someone guide me on the proper way to understand how each output is constructed?
Specifically,
- Why are the role and school columns always between location_y?
- When is the role column beside the name column and when is the school column beside the name column?
I would hold off asking about using left_index and right_on in the same merge statement.
Thanks.
21 Answer
I suggest you to go through the documentation to understand the merging operation properly. It is well documented with examples. Counldn't think of much simpler explanation. Documentation for merging
From documentation
left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
Why are the role and school columns always between location_y?
After merging columns will be sorted. To check that change a column name that starts with letter earlier than L in the second df parameter of pd.merge.
pd.merge(student_df, staff_df, how='left', left_on='Name', right_on='Name')Location_x Name School Location_y Role 0 1024 Billiard Avenue James Business Washington Avenue Grader 1 Fraternity House #22 Mike Law NaN NaN 2 512 Wilson Crescent Sally Engineering Washington Avenue Course liasion
Example if Role is Bole
Location_x Name School Bole Location_y 0 1024 Billiard Avenue James Business Grader Washington Avenue 1 Fraternity House #22 Mike Law NaN NaN 2 512 Wilson Crescent Sally Engineering Course liasion Washington Avenue
Instead of left_on and right_on two parameters you can use on which will match the keys from both the dataframe. i.e
pd.merge(student_df, staff_df, how='left', on='Name') When is the role column beside the name column and when is the school column beside the name column?
It depends on the priority of df you give. If your specify staff_df first then the columns will be concatenated column wise after the staff_df. So Role will be beside Name column. Similary if you specify student_df Student will be beside Name column.