Chapter 1 — DataFrames
A convention for tabular data
The purpose of this course is to impart the core skills you need to manipulate data stored in tables. Although “manipulating data stored in tables” may seem limited in scope, you will soon discover how flexible and potent this paradigm is.
We start with a convention for how to represent data stored in a table. The idea is that a table stores information about a collection of objects with attributes: a row in a table corresponds to an object, and a column corresponds to an attribute of the objects in the table. We will refer to these objects (rows) as observations and their attributes (columns) as variables. This convention is so useful and common in data analytics that we will take it for granted here and only work with data in this format. If you are interested in reading more, see Wickham on Tidy data.
In Pandas, a table of data is called a DataFrame. Technically, a DataFrame is a dictionary of Pandas Series objects, each corresponding to a column in the table. A Series is similar to a list or a Numpy array, but with row names and a label indicating the name of Series. Because of this, commands you learn to manipulate arrays can be applied to Series and Dataframes as well. (In fact, Pandas Series, Numpy arrays, and regular Python lists are sometimes referred to generally as array-likes, and can be used interchangeably in Pandas.)
When you load data into a Pandas DataFrame, it lives in Python’s memory, so it is important to have a way to actually look at that data. If you’re using Anaconda Spyder, the Variable Explorer lets you do this directly. For the following examples, we’ll assume a DataFrame has been loaded into memory and stored in a variable called df, containing rows from the match_player_basic table. Each row contains end-of-match statistics for one of the participants in a match.
Here are a few useful Python commands that will help you when you’re starting to get familiar with a new dataset. (Click on one of the commands on to see its output below.)
- df.head() — returns the first 5 rows of df
- df.tail() — returns the last 5 rows of df
- df.describe() — gives summary statistics on each variable in df
- df['var_name'] — accesses the column with name var_name in df. The example below uses the variable teamId as var_name.
- df['var_name'].value_counts() — returns frequency table for var_name in df. The example below uses the variable teamId as var_name.
- df.sort_values(by='var_name') — sorts by var_name, returns a new DataFrame. The example below uses the variable kills as var_name.
Each of these commands returns a DataFrame or Series based on df. So, if you want to store a new DataFrame in memory that consists of the first 5 rows of df, you could use new_df = df.head().
On the other hand, there are times when you’ll only want to take a look at the data and not need to store it in memory. In that case, all you have to do is entering your command directly into the IPython console. When you do this, Python will print a human-readable string representation of the returned object. So, if you aren’t interested in storing the top 5 rows of your DataFrame but you do want to take a quick look at it, just type df.head() into the console. In fact, you can type almost any object into the console (including simply df) and Python will print a string representation of the object for you! With that in mind, here are a few more commands you may find useful. (Click on one of the commands on to see its output below.)
- list(df) — returns the column names of the dataframe as a list
- df.T — returns a transposed version of df (rows become columns, columns become rows)
- df.dtypes — gives the data type of each variable in the dataframe
- df.shape — returns a tuple with (# of observations, # of variables)
Operations on variables
In keeping with our new paradigm for representing tabular data, we will typically think about operations on our table in terms of operations on variables. Pandas (and Numpy) makes it easy and efficient to do these kinds of operations. For example, suppose we have a DataFrame df that stores kill and assist information for players in a game, and we want to calculate each player’s involvement in kills in that game. To do this, we create a new variable called 'kill_participation', which is equal to the number of kills plus the number of assists a player got in a game. Pandas makes this kind of calculation convenient: instead of having to add the kill and assist values for each game individually, we can do it all at once with one command.
The syntax to create this new column and include it in df is simply:
If you don’t want to store the Series as a new column in the DataFrame df, you could instead write
which would store the resulting Series in a new Series object called my_series.
Other arithmetic operations like multiplication and division work similarly. In fact, you can apply all kinds of functions to DataFrame columns, which utilize Numpy’s speed with vectorized operations, operations that act on all elements of an array simultaneously.
Suppose we wanted to create a new variable that measures if a player died a lot in a game. Using df['deaths'].describe(), we find that the median number of deaths in our dataset is 5. Let’s create a variable that is 1 if a player dies more than 5 times, and is 0 otherwise.
The syntax for this transformation is
and it works in two steps. First, df['deaths']>5 creates a Series of Boolean (True/False) values corresponding to the rows where the number of deaths was greater than 5, then .astype(int) says to interpret these Booleans as integers, which recasts True to 1 and False to 0. An equivalent way to do this would be to multiply by 1:
Even though 1 is a single integer value and (df['deaths']>5) is a Series with many values, Pandas and Numpy interpret this command as multiplying 1 with every entry of (df['deaths']>5), just like we want.
Of course, if we wanted to do this in two steps we could, like so:
df['over_5_deaths'] = tmp.astype(int) # convert to int and store in df
Finally, there is another way: We can write our own function and then apply it to each element of the Series with .apply(). It’s possible to do this with a regular Python function, as shown here:
df['over_5_deaths'] = df['deaths'].apply(check_if_over_five)
Notice that none of these operations involve writing a for loop that iterates over rows in the DataFrame. If you ever find yourself writing code that iterates through individual entries of a DataFrame, stop and think carefully about whether there is a vectorized way to do it instead!
However, there’s a more succinct way to do this, using Python’s lambda functions. This is a way for us to define a function in line, without having to give it a name. The syntax for this is simply:
Note: You are welcome to store a lambda function if you like. The syntax would look like check_if_over_five = lambda x: int(x>5). However, the point is that you don’t have to.
Often, we will want to work with only part of our data, such as a specific column or collection of rows. In Pandas, subsetting (or “slicing”) DataFrames is done using the .loc and .iloc commands, which provide flexible ways to specify which rows and columns to keep.
- df.loc[row_labels, col_labels] — returns a subset of df with specified row and column names given by row_labels and col_labels
- df.iloc[row_indices, col_indices] — returns a subset of df with specified row and column indices given by row_indices and col_indices
To apply no restriction, use the : character. So df is the same as df.loc[:,:], which is the same as df.iloc[:,:]. Furthermore, .loc is flexible in that it lets us enter a Series of True/False values indicating which rows to keep, rather than row names themselves.
To put these all together, suppose we wanted to see the kill, death, and assist values for all Teemo games.
To subset our DataFrame accordingly, we would use:
cols_to_keep = ['kills', 'deaths', 'assists'] # a list
my_subset = df.loc[rows_to_keep, cols_to_keep]
STEP 1: Determine rows to keep
This line creates a Series of values that are True when Teemo is the champion played, and False otherwise.
|df['champion'] == 'Teemo'|
STEP 2: Determine columns to keep
The second line creates a list of strings corresponding to column names to keep.
STEP 3: Return only those rows and columns
Then, .loc is used to return only those rows and columns.
This last step could be done in one line, like so:
Alternatively, this could be done with .iloc by specifying the indices of the desired columns and rows instead. Remembering that Python starts counting at 0, we specify:
cols_to_keep = [4,5,6]
my_subset = df.iloc[rows_to_keep, cols_to_keep]
Or, more succinctly:
Pandas is flexible in how it lets us filter our data. If we are only filtering by rows or columns (not both), we can omit the .loc part. Here are two examples of equivalent commands:
- df.loc[:, 'my_col'] and df['my_col'] — returns the entire column 'my_col' from df
- df.loc[df.champion=='Zed', :] and df[df.champion=='Zed'] — returns rows where the champion is Zed
Technical note on subsetting data
The command my_subset = df.loc[df.champion=='Zed', :] creates a new DataFrame my_subset, but the data stored in that DataFrame is actually not distinct from the data in df. Instead, my_subset is merely a reference to some portion of the data located in df. If you modify my_subset, Pandas will warn you: “A value is trying to be set on a copy of a slice from a DataFrame”, because the values in my_subset are actually the same values as in df (they occupy the same location in your computer’s memory).
The appropriate way to deal with this warning depends on your goal. If you want to modify a subset of the values in df, you can do so directly by using:
If you want to work with a new table that’s not linked to df, then create a copy like so:
For more on this, see the Pandas documentation for returning a view versus a copy.
The Pandas documentation for indexing and selecting data provides more details.
The accompanying dataset for the first two chapters is match_player_basic.csv, a modified subset of match_player.csv to get you started. It stores end-of-game statistics on each player for the first 200 matches in match_player.csv. This is the same dataset that we have been using so far in this chapter.
Five rows of the dataset are shown below.
In this table, the observations are players-in-games because each row corresponds to a particular player in a particular game. The variables are different properties of those players in games, like which champion they were playing, whether or not they won, etc.
Starting with the provided sample code, load the dataset in match_player_basic.csv to a Dataframe called df. Use the commands given earlier in this document to get an idea of what data is stored in the table, then try the following exercises.
Play rates: Create a table of the top 10 most played champions using the following commands:
- .value_counts() — to count the number of times each champion was played
- .sort_values(ascending = False) — to sort the result; note that you don’t need to specify the by= argument because .value_counts() returns a Series (one column), not a DataFrame
- .head(10) — to get the top 10 rows
You will need to select the column corresponding to champions played in order to apply the methods.
Kill-death differentials: Add a new column to df called 'kd_diff' which contains the kill-death differential for a player in a game, defined as kills – deaths, then:
- Subset df using .loc to return only rows corresponding to your favorite champion (e.g. Cassiopeia), then use .describe() to get summary statistics on their kill differential.
- Use .describe() to find the maximum kill differential in the dataset, then use .loc to subset the DataFrame to only those games in which the maximum kill differential occurred. Take a look at the data for these games. Can you get a sense of what happened?