Chapter 3: Merge

When analyzing data, it’s often necessary to combine information from multiple sources into a single dataset. The primary operation for doing this is called a merge (or in other languages, like SQL, a join) and is a core part of a data programmer’s toolkit.

For example, in match_player.csv, champions are represented by ID numbers instead of names, like in Table 1 below. For the final product of a data analysis, we’d prefer to display champion names instead of ID numbers, so we need a way to add the champion names to our original table.

Specifically, for each observation in Table 1, we need to use championId to look up the corresponding champion name in Table 2 and add it as a new variable in Table 1.

championId	kills	deaths	goldEarned
117	4	8	10728
266	1	6	10492
84	7	3	12242
40	4	4	9630
92	9	5	14570

Table 1 (df1)

championId	champion
40	Janna
76	Nidalee
84	Akali
92	Riven
117	Lulu
266	Aatrox

Table 2 (df2)

Merge

In general, a merge is an operation that combines two tables together based on keys, where keys are the values of a variable (or variables) existing in both tables. The result of the merge operation will include variables from both tables, combined on observations that share the same key values.

More precisely, the merge operation is as follows: for each observation in one table, find all the observations in the other table that have the same key and concatenate their variables. This is demonstrated in the example below.

championId	kills	deaths	goldEarned	champion
117	4	8	10728	Lulu
266	1	6	10492	Aatrox
84	7	3	12242	Akali
40	4	4	9630	Janna
92	9	5	14570	Riven

Result of merging Table 1 (df1) and Table 2 (df2)

championId	champion
40	Janna
76	Nidalee
84	Akali
92	Riven
117	Lulu
266	Aatrox

Table 2 (df2)

If an observation from one table shares the same key with multiple observations in the other table, this will result in multiple observations in the output.

With Python’s merge function, it only takes one line of code to combine Table 1 and Table 2 with variable championId as the key.

df = df1.merge(df2, how='left', on='championId')

The resulting output is shown below.

championId	kills	deaths	goldEarned	champion
117	4	8	10728	Lulu
266	1	6	10492	Aatrox
84	7	3	12242	Akali
40	4	4	9630	Janna
92	9	5	14570	Riven

Table 3: The result of merging Table 1 and Table 2 on variable championId

In the example above, the merge method called on df1 takes three parameters.

df2: This specifies that DataFrame df1 will be merged with DataFrame df2. By convention, we refer to df1 as the “left” table and df2 as the “right” table based on their positions in the code.
how = 'left': This determines which rows will be kept in the case of unmatched keys. (We’ll explain how this works in the next section.)
on = 'championId': This specifies that championid is our key variable, meaning that observations from the two tables will be matched together if they have the same value of championId.

We can define key values based on multiple variables, in which case observations will be matched if all key variables share the same value. For example, suppose we want to merge two tables on two variables, matchId and teamId. The following code will do the trick.

df = df1.merge(df2, how='left', on=['matchId', 'teamId'])

If the variables on which we want to merge have different variable names in their respective tables, we will use two alternative parameters, left_on and right_on, instead of on. For example, this might look like:

df = df1.merge(df2, how='left', left_on='championId',

							right_on='killerId')

This syntax will be useful for working with the match_kills.csv table, where championIds of killing champions are stored as killerId. Of course, you can still use the left_on and right_on parameters to specify the join variables even if they have the same name in each table.

Deciding what observations to keep

When performing a merge, it’s often the case that there are observations in one table whose key values don’t match any of the key values for observations in the other table. In the original merge example above, the last observation in Table 2 is for Nidalee with championId 76, but Table 1 doesn’t have an observation with championId 76. In other words, the observation for Nidalee in Table 2 has an unmatched key.

For this reason, we need to specify what to do with observations that have unmatched keys: we can either keep them, or get rid of them. In Pandas, this is done with the how argument of merge, which can be specified as either 'left', 'right', 'inner', or 'outer'. Each of these options decides which tables keep their observations with unmatched keys. The four options are illustrated on a merge of the two tables below.

Left merge: Keeps all observations in the left table, discarding observations in the right table that have unmatched keys
Right merge: Keeps all observations in the right table, discarding observations in the left table that have unmatched keys
Inner merge: Keeps only observations with matching keys
Outer merge: Keeps all observations

championId	kills	deaths	goldEarned
5	10	5	14217
7	5	0	11247
40	4	4	9630
84	7	3	12242
131	14	6	15922
222	6	7	12193

Left table

championId	name
2	Olaf
7	LeBlanc
21	Miss Fortune
40	Janna
131	Diana
412	Thresh

Right table

championId	kills	deaths	goldEarned	name
5	10	5	14217	NaN
7	5	0	11247	LeBlanc
40	4	4	9630	Janna
84	7	3	12242	NaN
131	14	6	15922	NaN
222	6	7	12193	NaN

championId	kills	deaths	goldEarned	name
7	5	0	11247	LeBlanc
40	4	4	9630	Janna
2	NaN	NaN	NaN	Olaf
21	NaN	NaN	NaN	Miss Fortune
154	NaN	NaN	NaN	Zac
412	NaN	NaN	NaN	Thresh

championId	kills	deaths	goldEarned	name
7	5	0	11247	LeBlanc
40	4	4	9630	Janna

championId	kills	deaths	goldEarned	name
5	10	5	14217	NaN
7	5	0	11247	LeBlanc
40	4	4	9630	Janna
84	7	3	12242	NaN
131	14	6	15922	NaN
222	6	7	12193	NaN
2	NaN	NaN	NaN	Olaf
21	NaN	NaN	NaN	Miss Fortune
154	NaN	NaN	NaN	Zac
412	NaN	NaN	NaN	Thresh

The result of a left merge

When an observation with an unmatched key is kept, the variables from the other table will have missing values. In Pandas, these missing values are represented by an object called “Not a Number” or NaN object.

If you find missing values in your dataset, make sure you understand why they are there, and consider taking a look at the Pandas documentation for working with missing data.

Data levels after merge

When performing a merge, the data level of the resulting table will depend on the data level of the two input tables, as well as the merge key. Because of this, you will sometimes hear people describe three kinds of merges:

One-to-one merge: Each table has observations uniquely identified by the merge keys (as shown in the example above). In this case, the data level of the resulting table will match that of the original tables.
One-to-many (or many-to-one) merge: One table has observations uniquely identified by merge keys while another table has cases of multiple observations with the same key values. In this case, the data level of the resulting table will match that of the table with duplicate keys.
Many-to-many merge: Both tables have multiple observations with the same keys. In this case, the resulting table will have a new data level. An example is included below. In this example, after merging a (player, champion)-level table with a (player, item)-level using player as the key, we get a (player, champion, item)-level table.

player	champion
Doublelift	Lucian
Doublelift	Vayne
Uzi	Ezreal
Uzi	Miss Fortune

(player, champion)-level table

player	firstItem
Doublelift	Blade of the Ruined King
Uzi	Infinity Edge
Uzi	Rapid Firecannon
Uzi	Statikk Shiv

(player, item)-level table

player	champion	firstItem
Doublelift	Lucian	Blade of the Ruined King
Doublelift	Vayne	Blade of the Ruined King
Uzi	Ezreal	Infinity Edge
Uzi	Ezreal	Rapid Firecannon
Uzi	Ezreal	Statikk Shiv
Uzi	Miss Fortune	Infinity Edge
Uzi	Miss Fortune	Rapid Firecannon
Uzi	Miss Fortune	Statikk Shiv

(player, champion, item)-level table

In this way, we can see all the possible champion-item combinations for Doublelift and Uzi.

Data

Starting in this chapter, we will use the full version of match_player.csv. Three differences between match_player.csv and match_player_basic.csv are worth noting:

The champion variable in match_player_basic.csv is now replaced with championId, which gives a unique ID number for each champion.
There are a few more variables in match_player.csv.
There are 28,000 matches in match_player.csv, instead of 200 matches as you have used in match_player_basic.csv.

In addition, for the last task of this chapter, you will need to use match_kills.csv. This dataset shows each kill within a match and gives information about the kill. Five rows are shown below:

matchId	tEvent	killerId	victimId	positionX	positionY
3321872453	4.00	8	2	6828	9837
3321872453	6.74	8	2	5108	4612
3321872453	7.26	4	7	4288	4116
3321872453	10.00	4	7	3275	1680
3321872453	12.33	4	6	1631	1271

Note:

The tEvent variable records the time at which the kills occur.
The killerId and victimId variables correspond to the participantId in match_player.csv and are unrelated to championId.
The positionX and positionY variables record the locations of the kills.

Therefore, the first observation means that player 9 kills player 1 at location (1982, 9837) at 2.45 minutes.

Tasks

Merge properties. You may find these questions helpful for improving your understanding of merge.
1. If you had access to the usual Pandas operations but were limited to only using outer merges (not left, right, or inner), would you be able to perform all the data operations from this chapter? If so, how? If not, why not?
2. If you merge the (person, champion, item)-level table above to itself on person, how many rows will there be in the resulting table?
3. Can you change this code so it uses a left merge instead of a right, but returns the same table? It’s okay if the columns are in a different order.
  df1.merge(df2, how="left, on="key")
Early game kills. Create a champion-level table that lists the average number of pre-20 minute kills for each champion. Hint: Start by merging match_kills with match_player to get championId on the kill data. Note that killerId and participantId should be matched.

Champion nemeses. We are curious about which champions tend to kill other specific champions.

Using both match_player.csv and match_kills.csv, create a table where each row corresponds to a pair of champions, showing how many times the first champion killed the second champion. The first two rows of the resulting table should look like this:

killer	victim	killCount
Ezreal	Caitlyn	502
Jhin	Caitlyn	478
	...
	...
	...

The resulting table should have 21,021 rows.

If your goal was to understand which champions have the greatest tendency to kill other individual champions, how might this table be misleading? What information could you add to the table to make it more helpful?
Optional: Make a new version of the table in part (a) that includes the information from part (b).

Chapter 3 — Merge

Merge

Deciding what observations to keep

Data levels after merge

Data

Tasks