Chapter 3 — Merge
When analyzing data, it’s often necessary to combine information from multiple sources into a single dataset. The primary operation for doing this is called a merge (or in other languages, like SQL, a join) and is a core part of a data programmer’s toolkit.
For example, in match_player.csv, champions are represented by ID numbers instead of names, like in Table 1 below. For the final product of a data analysis, we’d prefer to display champion names instead of ID numbers, so we need a way to add the champion names to our original table.
Specifically, for each observation in Table 1, we need to use championId to look up the corresponding champion name in Table 2 and add it as a new variable in Table 1.
In general, a merge is an operation that combines two tables together based on keys, where keys are the values of a variable (or variables) existing in both tables. The result of the merge operation will include variables from both tables, combined on observations that share the same key values.
More precisely, the merge operation is as follows: for each observation in one table, find all the observations in the other table that have the same key and concatenate their variables. This is demonstrated in the example below.
If an observation from one table shares the same key with multiple observations in the other table, this will result in multiple observations in the output.
With Python’s merge function, it only takes one line of code to combine Table 1 and Table 2 with variable championId as the key.
The resulting output is shown below.
In the example above, the merge method called on df1 takes three parameters.
- df2: This specifies that DataFrame df1 will be merged with DataFrame df2. By convention, we refer to df1 as the “left” table and df2 as the “right” table based on their positions in the code.
- how = 'left': This determines which rows will be kept in the case of unmatched keys. (We’ll explain how this works in the next section.)
- on = 'championId': This specifies that championid is our key variable, meaning that observations from the two tables will be matched together if they have the same value of championId.
We can define key values based on multiple variables, in which case observations will be matched if all key variables share the same value. For example, suppose we want to merge two tables on two variables, matchId and teamId. The following code will do the trick.
If the variables on which we want to merge have different variable names in their respective tables, we will use two alternative parameters, left_on and right_on, instead of on. For example, this might look like:
This syntax will be useful for working with the match_kills.csv table, where championIds of killing champions are stored as killerId. Of course, you can still use the left_on and right_on parameters to specify the join variables even if they have the same name in each table.
Deciding what observations to keep
When performing a merge, it’s often the case that there are observations in one table whose key values don’t match any of the key values for observations in the other table. In the original merge example above, the last observation in Table 2 is for Nidalee with championId 76, but Table 1 doesn’t have an observation with championId 76. In other words, the observation for Nidalee in Table 2 has an unmatched key.
For this reason, we need to specify what to do with observations that have unmatched keys: we can either keep them, or get rid of them. In Pandas, this is done with the how argument of merge, which can be specified as either 'left', 'right', 'inner', or 'outer'. Each of these options decides which tables keep their observations with unmatched keys. The four options are illustrated on a merge of the two tables below.
- : Keeps all observations in the left table, discarding observations in the right table that have unmatched keys
- : Keeps all observations in the right table, discarding observations in the left table that have unmatched keys
- : Keeps only observations with matching keys
- : Keeps all observations
When an observation with an unmatched key is kept, the variables from the other table will have missing values. In Pandas, these missing values are represented by an object called “Not a Number” or NaN object.
If you find missing values in your dataset, make sure you understand why they are there, and consider taking a look at the Pandas documentation for working with missing data.
Data levels after merge
When performing a merge, the data level of the resulting table will depend on the data level of the two input tables, as well as the merge key. Because of this, you will sometimes hear people describe three kinds of merges:
- One-to-one merge: Each table has observations uniquely identified by the merge keys (as shown in the example above). In this case, the data level of the resulting table will match that of the original tables.
- One-to-many (or many-to-one) merge: One table has observations uniquely identified by merge keys while another table has cases of multiple observations with the same key values. In this case, the data level of the resulting table will match that of the table with duplicate keys.
- Many-to-many merge: Both tables have multiple observations with the same keys. In this case, the resulting table will have a new data level. An example is included below. In this example, after merging a (player, champion)-level table with a (player, item)-level using player as the key, we get a (player, champion, item)-level table.
|Doublelift||Blade of the Ruined King|
|Doublelift||Lucian||Blade of the Ruined King|
|Doublelift||Vayne||Blade of the Ruined King|
|Uzi||Miss Fortune||Infinity Edge|
|Uzi||Miss Fortune||Rapid Firecannon|
|Uzi||Miss Fortune||Statikk Shiv|
In this way, we can see all the possible champion-item combinations for Doublelift and Uzi.
Starting in this chapter, we will use the full version of match_player.csv. Three differences between match_player.csv and match_player_basic.csv are worth noting:
- The champion variable in match_player_basic.csv is now replaced with championId, which gives a unique ID number for each champion.
- There are a few more variables in match_player.csv.
- There are 28,000 matches in match_player.csv, instead of 200 matches as you have used in match_player_basic.csv.
In addition, for the last task of this chapter, you will need to use match_kills.csv. This dataset shows each kill within a match and gives information about the kill. Five rows are shown below:
- The tEvent variable records the time at which the kills occur.
- The killerId and victimId variables correspond to the participantId in match_player.csv and are unrelated to championId.
- The positionX and positionY variables record the locations of the kills.
Therefore, the first observation means that player 9 kills player 1 at location (1982, 9837) at 2.45 minutes.
- Merge properties. You may find these questions helpful for improving your understanding of merge.
- If you had access to the usual Pandas operations but were limited to only using outer merges (not left, right, or inner), would you be able to perform all the data operations from this chapter? If so, how? If not, why not?
- If you merge the (person, champion, item)-level table above to itself on person, how many rows will there be in the resulting table?
- Can you change this code so it uses a left merge instead of a right, but returns the same table? It’s okay if the columns are in a different order.
df1.merge(df2, how="left, on="key")
- Early game kills. Create a champion-level table that lists the average number of pre-20 minute kills for each champion. Hint: Start by merging match_kills with match_player to get championId on the kill data. Note that killerId and participantId should be matched.
- Champion nemeses. We are curious about which champions tend to kill other specific champions.
- Using both match_player.csv and match_kills.csv, create a table where each row corresponds to a pair of champions, showing
how many times the first champion killed the second champion. The first two rows of the resulting table should look like this:
killer victim killCount Ezreal Caitlyn 502 Jhin Caitlyn 478 ... ... ...
- If your goal was to understand which champions have the greatest tendency to kill other individual champions, how might this table be misleading? What information could you add to the table to make it more helpful?
- Optional: Make a new version of the table in part (a) that includes the information from part (b).
- Using both match_player.csv and match_kills.csv, create a table where each row corresponds to a pair of champions, showing how many times the first champion killed the second champion. The first two rows of the resulting table should look like this: