I want to become familiar with both the tools in the tidyverse as well as in Python. This blog post will highlight some of the typical things people might do in R or Python when working with data. It’s not meant as a ‘from scratch’ tutorial so there are things I won’t explain thoroughly or at all. It’s mainly so I can remember pandas functionality. For now, we will focus on wrangling the data. Next post will cover more data wrangling as well as visualization. To give credit, most of these exercises are found in Hadley Wickham’s R for data science
textbook, which can be found on Amazon or online.
Note: From looking at who follows me on Twitter, I assume the typical person who looks at this post will be more familiar with R than Pandas and will write this blog post under this assumption. This just means I’ll explain more of the basics in pandas rather than R.
The data we will be using is the nycflights13::flights
dataset. It is stored as flights
# flights.info() or flights.head() to see actual values
glimpse(flights) # R
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
A bit about Python DataFrames
Pandas dataframes are similar to R dataframes except for a few things that I will touch upon. Each column is considered something called a Series
, which is a one-dimensional array with axis labels (that need not be unique). The axis labels are collectively referred to as the index. If you don’t specify labels, they are just the row numbers. Series and Dataframes can also have hierarchical labels, which is a big difference with R dataframes.
Usually, if you have a dataframe called df
, you’d so something like this: df.function_name()
vs function_name(df)
. This is because the dataframe has a number of functions specific to it called methods, and we call methods with the above syntax. You’d also see something like pd.function_name()
where pd
is short for pandas.
To get certain columns in Pandas, you use the following: df['col_name']
or df[['col_name', 'other_col_name', 'another_col_name']]
if you want to get more than one column.
Next, we should talk about using iloc
or loc
. iloc
is for integer positioning (focusing on the order or rows or columns). loc
is for locational positioning (focusing on the labels/names of rows or columns)
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
# to make things more apparent, I will change the index to be the timehour column
flights = flights.set_index(flights['time_hour'])
# selecting the first five rows using iloc
print(flights.iloc[0:5])
## year month day dep_time sched_dep_time \
## time_hour
## 2013-01-01 05:00:00+00:00 2013 1 1 517.0 515
## 2013-01-01 05:00:00+00:00 2013 1 1 533.0 529
## 2013-01-01 05:00:00+00:00 2013 1 1 542.0 540
## 2013-01-01 05:00:00+00:00 2013 1 1 544.0 545
## 2013-01-01 06:00:00+00:00 2013 1 1 554.0 600
##
## dep_delay arr_time sched_arr_time arr_delay \
## time_hour
## 2013-01-01 05:00:00+00:00 2.0 830.0 819 11.0
## 2013-01-01 05:00:00+00:00 4.0 850.0 830 20.0
## 2013-01-01 05:00:00+00:00 2.0 923.0 850 33.0
## 2013-01-01 05:00:00+00:00 -1.0 1004.0 1022 -18.0
## 2013-01-01 06:00:00+00:00 -6.0 812.0 837 -25.0
##
## carrier flight tailnum origin dest air_time \
## time_hour
## 2013-01-01 05:00:00+00:00 UA 1545 N14228 EWR IAH 227.0
## 2013-01-01 05:00:00+00:00 UA 1714 N24211 LGA IAH 227.0
## 2013-01-01 05:00:00+00:00 AA 1141 N619AA JFK MIA 160.0
## 2013-01-01 05:00:00+00:00 B6 725 N804JB JFK BQN 183.0
## 2013-01-01 06:00:00+00:00 DL 461 N668DN LGA ATL 116.0
##
## distance hour minute time_hour
## time_hour
## 2013-01-01 05:00:00+00:00 1400.0 5.0 15.0 2013-01-01 05:00:00+00:00
## 2013-01-01 05:00:00+00:00 1416.0 5.0 29.0 2013-01-01 05:00:00+00:00
## 2013-01-01 05:00:00+00:00 1089.0 5.0 40.0 2013-01-01 05:00:00+00:00
## 2013-01-01 05:00:00+00:00 1576.0 5.0 45.0 2013-01-01 05:00:00+00:00
## 2013-01-01 06:00:00+00:00 762.0 6.0 0.0 2013-01-01 06:00:00+00:00
flights.loc[0:5]
would raise an error because the index labels arent 0, 1, 2, 3, 4, and 5! In this case, they actually repeat!
Instead, we index using loc by giving the name of the label
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
flights = flights.set_index(flights['time_hour'])
# use loc and the label value for the index
flights_sub = flights.loc['2013-01-01T06:00:00Z']
# get the unique index values for the subset dataframe
print(flights_sub.index.unique())
## DatetimeIndex(['2013-01-01 06:00:00+00:00'], dtype='datetime64[ns, UTC]', name='time_hour', freq=None)
The same works for indexing into columns as well. We’d just specify the label or integer positions of the columns as the second argument, eg flights.loc[:, ['year', 'month']]
or flights.iloc[:, [0, 1]]
since the year and month column are the 0th and 1st column in the dataframe
What if we wanted the first 5 rows of the columns ‘year’ and ‘month’. So far, we either need to use iloc
and memorize the numerical index of the columns (flights.iloc[0:5, [0, 1]]
), or use loc
and somehow subset to just the first 5 elements. We can use helper functions to combine both label indexing and positional indexing
flights.iloc[0:5, ['year', 'month']]
fails bc [‘year’, ‘month’] is for label indexingflights.loc[0:5, ['year', 'month']]
fails bc 0:5 does not match the index label
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
flights = flights.set_index(flights['time_hour'])
# if you want to use iloc but specify column names, you can use df.columns.get_loc('col_names')
# or for multiple columns, use df.columns.get_indexer([list of columns]) like below
print(flights.iloc[0:5, flights.columns.get_indexer(['year', 'month'])])
## year month
## time_hour
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 06:00:00+00:00 2013 1
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
flights = flights.set_index(flights['time_hour'])
# if you want to use loc without memorizing the index names, you can slice df.index
# here, I use the unique method bc the index is repeated multiple times
index_vals = flights.index.unique()[0:2]
print(index_vals)
## DatetimeIndex(['2013-01-01 05:00:00+00:00', '2013-01-01 06:00:00+00:00'], dtype='datetime64[ns, UTC]', name='time_hour', freq=None)
print(flights.loc[index_vals, ['year', 'month']].head())
## year month
## time_hour
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 05:00:00+00:00 2013 1
## 2013-01-01 05:00:00+00:00 2013 1
Transforming data
Filter rows
Using dplyr
, we have the trusty filter function that works like this:
filter(flights, month == 1, day == 1, origin == 'JFK')
The python solution is below, which uses the query
method.
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
print(flights.query('month == 1 & day == 1 & origin == "JFK"')[['month', 'day', 'time_hour']].tail())
## month day time_hour
## 833 1 1 2013-01-01 22:00:00+00:00
## 835 1 1 2013-01-01 23:00:00+00:00
## 836 1 1 2013-01-01 23:00:00+00:00
## 837 1 1 2013-01-01 23:00:00+00:00
## 841 1 1 2013-01-01 06:00:00+00:00
In python, you can also do the following:
flights[(flights['month'] == 1) & (flights['day'] == 1) & (flights['origin'] == 'JFK')].tail()
isin()
There might be times where you want to filter based on several possibilities. In Python, If I wanted any flights September-December, I’d have to write the following
flights.query('month == 12 | month == 11 | month ==10 | == 9 | month == 9')
In R, we can use %in%
inside filter to see if the value of the variable is contained in some vector. filter(flights, month %in% c(9,10,11,12))
. In Pandas, we use the isin
method. With this method, we have no need to use query.
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
# winter_filter is a column of True and False
winter_filter = flights['month'].isin([9,10,11,12])
flights_winter = flights[winter_filter]
# print unique values of the 'month' column
print(flights_winter['month'].unique())
## [10 11 12 9]
Arrange rows
In R, we can use
arrange(flights, month, desc(day))
while in Python, you can use sort_values
to sort columns. The columns should be strings (or a list of strings for multiple columns). The ascending parameter should be true or False (or a list of booleans for multiple columns).
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
print(flights.sort_values(['month', 'day'], ascending=[True, False])[['month', 'day']].head())
## month day
## 26076 1 31
## 26077 1 31
## 26078 1 31
## 26079 1 31
## 26080 1 31
Notice how the month starts in January/smallest possibility (ascending=True) but the day column now starts with the biggest number (ascending=False). You can use the na_position
parameter to say whether you want NaN values to be first or last. flights.sort_values('dep_time', na_position='first', ascending=False)
Selecting (and helper functions)
In R, you can do the following select(flights, day, month, year)
to select the columns year, month, and day. You can also do
select(flights, year:day)
to select all columns between year and day (inclusive). If you want all columns except the ones between year and day, you can write select(flights, -c(year:day))
As described in r4ds, there are a number of helper functions you can use within select():
starts_with("abc")
: matches names that begin with “abc”.ends_with("xyz")
: matches names that end with “xyz”.contains("ijk")
: matches names that contain “ijk”.
You can thankfully do something similar in Pandas. Each dataframe has a columns
attribute that you can access via df_name.columns
. It’s just another index but for the columns instead of the rows. The cool thing is that Series and Index objects are equipped with a set of string processing methods that make it easy to operate on each element of the array. These are accessed via the str attribute. Here is the pandas equivalent of the helper functions in select()
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
print(flights.loc[:, flights.columns.str.endswith('delay')].head())
## dep_delay arr_delay
## 0 2.0 11.0
## 1 4.0 20.0
## 2 2.0 33.0
## 3 -1.0 -18.0
## 4 -6.0 -25.0
print(flights.iloc[:5, flights.columns.str.contains('dep|arr')]) # this allows you to get all columns that have the string 'dep' or 'arr' in them
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay \
## 0 517.0 515 2.0 830.0 819 11.0
## 1 533.0 529 4.0 850.0 830 20.0
## 2 542.0 540 2.0 923.0 850 33.0
## 3 544.0 545 -1.0 1004.0 1022 -18.0
## 4 554.0 600 -6.0 812.0 837 -25.0
##
## carrier
## 0 UA
## 1 UA
## 2 AA
## 3 B6
## 4 DL
You can also drop certain columns (analagous to select(flights, -(year:day))
! It’s best if you have a list of columns though.
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
cols_to_drop = list(flights.columns[flights.columns.str.contains('dep|arr')])
print(cols_to_drop)
## ['dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier']
print(flights.drop(cols_to_drop, axis=1).head()) # axis=1 means to look across the columns
## year month day flight tailnum origin dest air_time distance hour \
## 0 2013 1 1 1545 N14228 EWR IAH 227.0 1400.0 5.0
## 1 2013 1 1 1714 N24211 LGA IAH 227.0 1416.0 5.0
## 2 2013 1 1 1141 N619AA JFK MIA 160.0 1089.0 5.0
## 3 2013 1 1 725 N804JB JFK BQN 183.0 1576.0 5.0
## 4 2013 1 1 461 N668DN LGA ATL 116.0 762.0 6.0
##
## minute time_hour
## 0 15.0 2013-01-01 05:00:00+00:00
## 1 29.0 2013-01-01 05:00:00+00:00
## 2 40.0 2013-01-01 05:00:00+00:00
## 3 45.0 2013-01-01 05:00:00+00:00
## 4 0.0 2013-01-01 06:00:00+00:00
Adding new variables
R’s dplyr has a nice mutate
function that allows you to create new variables, and even refer to that variable in the same mutate statement to create a new variable!
e.g mutate(flights, mph = distance / (air_time / 60), new_var = function_name(mph))
You can also discretize certain variables with case_when
mutate(flights, lateness = case_when(arr_delay > 60 ~ 'very_late',
arr_delay <= 60 ~ 'kinda_late'))
For pandas, we have the the assign
operator. The drawback is that you can’t refer to a newly created variable in the same assign statement (so you cant do the first example above using assign
). You can also just say df['new_col'] = value
. Here is an example of create new columns where dep_time
is converted to minutes since midnight
flights = (flights.assign(dep_time_mins = flights.dep_time.apply(lambda x: (x // 100) * 60 + (x % 100)))
As far as I know, Pandas also does not have it’s own version of case_when. You’d have to simply create a new column, then fill it in with multiple boolean indexing statements. For example, the Pandas version of the case_when
example would be
flights['lateness'] = "very_late"
flights.loc[flights['arr_delay'] <= 60] = "kinda_late"
Which is definitely not as nice as it makes it harder to chain many actions together like in R (df %>% filter(blah) %>% mutate(yay = case_when(mmhmm))) %>% groupby(yay) %>% summarise(oh_yeah)
)
Groupby
Quite frequently, one might want to do apply some function to the data but after grouping the data by some sort of category. For example, we can look at the average arrival delay for each month (function is the mean, data is the arr_delay column, grouping is the month column).
For example, the following finds all destinations that were flown to more than 365 times.
popular_dests <- flights %>%
group_by(dest) %>%
count() %>%
filter(n > 365) %>%
arrange(desc(n))
popular_dests
## # A tibble: 77 x 2
## # Groups: dest [77]
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
## # ... with 67 more rows
Pandas has a groupby
method that you can use. Instead of passing results to the next function using %>%
like in R, we chain methods together in Pandas. To make sure you dont have super long lines of code like df[[list_of_cols]].groupby().function_name.col_name.another_method
, you can surround your code in parenthesis and put each function on a new line. You’ll see an example below
So, lets see how you can do the same thing above in Pandas
import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
grouped = (flights
.groupby('dest')
.aggregate({'dest': 'count'})
.rename(columns={'dest': 'n'})
.reset_index())
print((grouped[grouped['n'] > 365]
.sort_values(by='n', ascending=False).head(10)))
## dest n
## 69 ORD 17283
## 4 ATL 17215
## 49 LAX 16174
## 11 BOS 15508
## 54 MCO 14082
## 23 CLT 14064
## 90 SFO 13331
## 35 FLL 12055
## 58 MIA 11728
## 28 DCA 9705
So what happened? We got the data, used the groupby method and told it how we wanted to group, then we used the aggregate method, which takes either just a function (and applies it to every column), or takes a dictionary with the column as the key, and the function to apply as the value. We then rename that newly created column to be called ‘n’. Then we use reset_index()
to make the index, which was each dest
value into a column.
Now, we need to figure out which ones were visited 366+ times. So I use boolean indexing, sort by the number of visits, then look at the first 10 (ie the top 10 destinations).
One things that I find useful is that aggregate
can have a have a list of functions as values, so I could do something like
flights.groupby('dest').aggregate({'dest': 'count', 'arr_delay'}: ['np.mean', 'np.std'])
You can do something similar using dplyr using mutate_at(vars(variable_name), funs(mean, std))
Thoughts
I like both pandas and R. They are similar in many ways when working with data, which is nice. I will say that reading nice dplyr code imo is much easier than reading nice pandas code. The function names are so nice and embody the whole functions should be verbs and should represent what they do idea I heard when I first started programming. Filter, select, mutate_at, mutate_if, summarise
, etc are all great names. You can do the same in Pandas, but it feels a bit clunky. But I do like using Python a lot! I’ve used it to develop OSS in Statsmodels, build Flask apps that fetches and caches spotify data, etc. I think it just feels a bit harder to get used to Pandas and especially matplotlib than dplyr, ggplot, etc. But then again, I learned python less than a year ago and just started using Pandas consistently for a month or so. I hope after making myself use Python for the remainder of this semester, I will be much more confortable with it. I hope this blog post helped you as well!