I want to become familiar with both the tools in the tidyverse as well as in Python. This blog post will highlight some of the typical things people might do in R or Python when working with data. It’s not meant as a ‘from scratch’ tutorial so there are things I won’t explain thoroughly or at all. It’s mainly so I can remember pandas functionality. For now, we will focus on wrangling the data. Next post will cover more data wrangling as well as visualization. To give credit, most of these exercises are found in Hadley Wickham’s R for data science textbook, which can be found on Amazon or online.

Note: From looking at who follows me on Twitter, I assume the typical person who looks at this post will be more familiar with R than Pandas and will write this blog post under this assumption. This just means I’ll explain more of the basics in pandas rather than R.

The data we will be using is the nycflights13::flights dataset. It is stored as flights

# flights.info() or flights.head() to see actual values
glimpse(flights) # R

## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

A bit about Python DataFrames

Pandas dataframes are similar to R dataframes except for a few things that I will touch upon. Each column is considered something called a Series, which is a one-dimensional array with axis labels (that need not be unique). The axis labels are collectively referred to as the index. If you don’t specify labels, they are just the row numbers. Series and Dataframes can also have hierarchical labels, which is a big difference with R dataframes.

Usually, if you have a dataframe called df, you’d so something like this: df.function_name() vs function_name(df). This is because the dataframe has a number of functions specific to it called methods, and we call methods with the above syntax. You’d also see something like pd.function_name() where pd is short for pandas.

To get certain columns in Pandas, you use the following: df['col_name'] or df[['col_name', 'other_col_name', 'another_col_name']] if you want to get more than one column.

Next, we should talk about using iloc or loc. iloc is for integer positioning (focusing on the order or rows or columns). loc is for locational positioning (focusing on the labels/names of rows or columns)

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
# to make things more apparent, I will change the index to be the timehour column
flights = flights.set_index(flights['time_hour'])
# selecting the first five rows using iloc
print(flights.iloc[0:5])

##                            year  month  day  dep_time  sched_dep_time  \
## time_hour                                                               
## 2013-01-01 05:00:00+00:00  2013      1    1     517.0             515   
## 2013-01-01 05:00:00+00:00  2013      1    1     533.0             529   
## 2013-01-01 05:00:00+00:00  2013      1    1     542.0             540   
## 2013-01-01 05:00:00+00:00  2013      1    1     544.0             545   
## 2013-01-01 06:00:00+00:00  2013      1    1     554.0             600   
## 
##                            dep_delay  arr_time  sched_arr_time  arr_delay  \
## time_hour                                                                   
## 2013-01-01 05:00:00+00:00        2.0     830.0             819       11.0   
## 2013-01-01 05:00:00+00:00        4.0     850.0             830       20.0   
## 2013-01-01 05:00:00+00:00        2.0     923.0             850       33.0   
## 2013-01-01 05:00:00+00:00       -1.0    1004.0            1022      -18.0   
## 2013-01-01 06:00:00+00:00       -6.0     812.0             837      -25.0   
## 
##                           carrier  flight tailnum origin dest  air_time  \
## time_hour                                                                 
## 2013-01-01 05:00:00+00:00      UA    1545  N14228    EWR  IAH     227.0   
## 2013-01-01 05:00:00+00:00      UA    1714  N24211    LGA  IAH     227.0   
## 2013-01-01 05:00:00+00:00      AA    1141  N619AA    JFK  MIA     160.0   
## 2013-01-01 05:00:00+00:00      B6     725  N804JB    JFK  BQN     183.0   
## 2013-01-01 06:00:00+00:00      DL     461  N668DN    LGA  ATL     116.0   
## 
##                            distance  hour  minute                 time_hour  
## time_hour                                                                    
## 2013-01-01 05:00:00+00:00    1400.0   5.0    15.0 2013-01-01 05:00:00+00:00  
## 2013-01-01 05:00:00+00:00    1416.0   5.0    29.0 2013-01-01 05:00:00+00:00  
## 2013-01-01 05:00:00+00:00    1089.0   5.0    40.0 2013-01-01 05:00:00+00:00  
## 2013-01-01 05:00:00+00:00    1576.0   5.0    45.0 2013-01-01 05:00:00+00:00  
## 2013-01-01 06:00:00+00:00     762.0   6.0     0.0 2013-01-01 06:00:00+00:00

flights.loc[0:5] would raise an error because the index labels arent 0, 1, 2, 3, 4, and 5! In this case, they actually repeat!

Instead, we index using loc by giving the name of the label

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
flights = flights.set_index(flights['time_hour'])
# use loc and the label value for the index
flights_sub = flights.loc['2013-01-01T06:00:00Z']
# get the unique index values for the subset dataframe
print(flights_sub.index.unique())

## DatetimeIndex(['2013-01-01 06:00:00+00:00'], dtype='datetime64[ns, UTC]', name='time_hour', freq=None)

The same works for indexing into columns as well. We’d just specify the label or integer positions of the columns as the second argument, eg flights.loc[:, ['year', 'month']] or flights.iloc[:, [0, 1]] since the year and month column are the 0th and 1st column in the dataframe

What if we wanted the first 5 rows of the columns ‘year’ and ‘month’. So far, we either need to use iloc and memorize the numerical index of the columns (flights.iloc[0:5, [0, 1]]), or use loc and somehow subset to just the first 5 elements. We can use helper functions to combine both label indexing and positional indexing

flights.iloc[0:5, ['year', 'month']] fails bc [‘year’, ‘month’] is for label indexing
flights.loc[0:5, ['year', 'month']] fails bc 0:5 does not match the index label

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
flights = flights.set_index(flights['time_hour'])
# if you want to use iloc but specify column names, you can use df.columns.get_loc('col_names')
# or for multiple columns, use df.columns.get_indexer([list of columns]) like below
print(flights.iloc[0:5, flights.columns.get_indexer(['year', 'month'])])

##                            year  month
## time_hour                             
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 06:00:00+00:00  2013      1

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
flights = flights.set_index(flights['time_hour'])
# if you want to use loc without memorizing the index names, you can slice df.index
# here, I use the unique method bc the index is repeated multiple times
index_vals = flights.index.unique()[0:2]
print(index_vals)

## DatetimeIndex(['2013-01-01 05:00:00+00:00', '2013-01-01 06:00:00+00:00'], dtype='datetime64[ns, UTC]', name='time_hour', freq=None)

print(flights.loc[index_vals, ['year', 'month']].head())

##                            year  month
## time_hour                             
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 05:00:00+00:00  2013      1
## 2013-01-01 05:00:00+00:00  2013      1

Transforming data

Filter rows

Using dplyr, we have the trusty filter function that works like this:

filter(flights, month == 1, day == 1, origin == 'JFK')

The python solution is below, which uses the query method.

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
print(flights.query('month == 1 & day == 1 & origin == "JFK"')[['month', 'day', 'time_hour']].tail())

##      month  day                 time_hour
## 833      1    1 2013-01-01 22:00:00+00:00
## 835      1    1 2013-01-01 23:00:00+00:00
## 836      1    1 2013-01-01 23:00:00+00:00
## 837      1    1 2013-01-01 23:00:00+00:00
## 841      1    1 2013-01-01 06:00:00+00:00

In python, you can also do the following:

flights[(flights['month'] == 1) & (flights['day'] == 1) & (flights['origin'] == 'JFK')].tail()

isin()

There might be times where you want to filter based on several possibilities. In Python, If I wanted any flights September-December, I’d have to write the following

flights.query('month == 12 | month == 11 | month ==10 | == 9 | month == 9')

In R, we can use %in% inside filter to see if the value of the variable is contained in some vector. filter(flights, month %in% c(9,10,11,12)). In Pandas, we use the isin method. With this method, we have no need to use query.

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
# winter_filter is a column of True and False
winter_filter = flights['month'].isin([9,10,11,12])
flights_winter = flights[winter_filter]
# print unique values of the 'month' column
print(flights_winter['month'].unique())

## [10 11 12  9]

Arrange rows

In R, we can use

arrange(flights, month, desc(day))

while in Python, you can use sort_values to sort columns. The columns should be strings (or a list of strings for multiple columns). The ascending parameter should be true or False (or a list of booleans for multiple columns).

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
print(flights.sort_values(['month', 'day'], ascending=[True, False])[['month', 'day']].head())

##        month  day
## 26076      1   31
## 26077      1   31
## 26078      1   31
## 26079      1   31
## 26080      1   31

Notice how the month starts in January/smallest possibility (ascending=True) but the day column now starts with the biggest number (ascending=False). You can use the na_position parameter to say whether you want NaN values to be first or last. flights.sort_values('dep_time', na_position='first', ascending=False)

Selecting (and helper functions)

In R, you can do the following select(flights, day, month, year) to select the columns year, month, and day. You can also do

select(flights, year:day) to select all columns between year and day (inclusive). If you want all columns except the ones between year and day, you can write select(flights, -c(year:day)) As described in r4ds, there are a number of helper functions you can use within select():

starts_with("abc"): matches names that begin with “abc”.
ends_with("xyz"): matches names that end with “xyz”.
contains("ijk"): matches names that contain “ijk”.

You can thankfully do something similar in Pandas. Each dataframe has a columns attribute that you can access via df_name.columns. It’s just another index but for the columns instead of the rows. The cool thing is that Series and Index objects are equipped with a set of string processing methods that make it easy to operate on each element of the array. These are accessed via the str attribute. Here is the pandas equivalent of the helper functions in select()

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
print(flights.loc[:, flights.columns.str.endswith('delay')].head())

##    dep_delay  arr_delay
## 0        2.0       11.0
## 1        4.0       20.0
## 2        2.0       33.0
## 3       -1.0      -18.0
## 4       -6.0      -25.0

print(flights.iloc[:5, flights.columns.str.contains('dep|arr')]) # this allows you to get all columns that have the string 'dep' or 'arr' in them

##    dep_time  sched_dep_time  dep_delay  arr_time  sched_arr_time  arr_delay  \
## 0     517.0             515        2.0     830.0             819       11.0   
## 1     533.0             529        4.0     850.0             830       20.0   
## 2     542.0             540        2.0     923.0             850       33.0   
## 3     544.0             545       -1.0    1004.0            1022      -18.0   
## 4     554.0             600       -6.0     812.0             837      -25.0   
## 
##   carrier  
## 0      UA  
## 1      UA  
## 2      AA  
## 3      B6  
## 4      DL

You can also drop certain columns (analagous to select(flights, -(year:day))! It’s best if you have a list of columns though.

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
cols_to_drop = list(flights.columns[flights.columns.str.contains('dep|arr')])
print(cols_to_drop)

## ['dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier']

print(flights.drop(cols_to_drop, axis=1).head()) # axis=1 means to look across the columns

##    year  month  day  flight tailnum origin dest  air_time  distance  hour  \
## 0  2013      1    1    1545  N14228    EWR  IAH     227.0    1400.0   5.0   
## 1  2013      1    1    1714  N24211    LGA  IAH     227.0    1416.0   5.0   
## 2  2013      1    1    1141  N619AA    JFK  MIA     160.0    1089.0   5.0   
## 3  2013      1    1     725  N804JB    JFK  BQN     183.0    1576.0   5.0   
## 4  2013      1    1     461  N668DN    LGA  ATL     116.0     762.0   6.0   
## 
##    minute                 time_hour  
## 0    15.0 2013-01-01 05:00:00+00:00  
## 1    29.0 2013-01-01 05:00:00+00:00  
## 2    40.0 2013-01-01 05:00:00+00:00  
## 3    45.0 2013-01-01 05:00:00+00:00  
## 4     0.0 2013-01-01 06:00:00+00:00

Adding new variables

R’s dplyr has a nice mutate function that allows you to create new variables, and even refer to that variable in the same mutate statement to create a new variable!

e.g mutate(flights, mph = distance / (air_time / 60), new_var = function_name(mph))

You can also discretize certain variables with case_when

mutate(flights, lateness = case_when(arr_delay > 60 ~ 'very_late',
                                     arr_delay <= 60 ~ 'kinda_late'))

For pandas, we have the the assign operator. The drawback is that you can’t refer to a newly created variable in the same assign statement (so you cant do the first example above using assign). You can also just say df['new_col'] = value. Here is an example of create new columns where dep_time is converted to minutes since midnight

flights = (flights.assign(dep_time_mins = flights.dep_time.apply(lambda x: (x // 100) * 60 + (x % 100)))

As far as I know, Pandas also does not have it’s own version of case_when. You’d have to simply create a new column, then fill it in with multiple boolean indexing statements. For example, the Pandas version of the case_when example would be

flights['lateness'] = "very_late"
flights.loc[flights['arr_delay'] <= 60] = "kinda_late"

Which is definitely not as nice as it makes it harder to chain many actions together like in R (df %>% filter(blah) %>% mutate(yay = case_when(mmhmm))) %>% groupby(yay) %>% summarise(oh_yeah))

Groupby

Quite frequently, one might want to do apply some function to the data but after grouping the data by some sort of category. For example, we can look at the average arrival delay for each month (function is the mean, data is the arr_delay column, grouping is the month column).

For example, the following finds all destinations that were flown to more than 365 times.

popular_dests <- flights %>% 
  group_by(dest) %>% 
  count() %>%
  filter(n > 365) %>%
  arrange(desc(n))
popular_dests

## # A tibble: 77 x 2
## # Groups:   dest [77]
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # ... with 67 more rows

Pandas has a groupby method that you can use. Instead of passing results to the next function using %>% like in R, we chain methods together in Pandas. To make sure you dont have super long lines of code like df[[list_of_cols]].groupby().function_name.col_name.another_method, you can surround your code in parenthesis and put each function on a new line. You’ll see an example below

So, lets see how you can do the same thing above in Pandas

import feather
path = "dat/flights.feather"
flights = feather.read_dataframe(path)
grouped = (flights
           .groupby('dest')
           .aggregate({'dest': 'count'})
           .rename(columns={'dest': 'n'})
           .reset_index())
           
print((grouped[grouped['n'] > 365]
 .sort_values(by='n', ascending=False).head(10)))

##    dest      n
## 69  ORD  17283
## 4   ATL  17215
## 49  LAX  16174
## 11  BOS  15508
## 54  MCO  14082
## 23  CLT  14064
## 90  SFO  13331
## 35  FLL  12055
## 58  MIA  11728
## 28  DCA   9705

So what happened? We got the data, used the groupby method and told it how we wanted to group, then we used the aggregate method, which takes either just a function (and applies it to every column), or takes a dictionary with the column as the key, and the function to apply as the value. We then rename that newly created column to be called ‘n’. Then we use reset_index() to make the index, which was each dest value into a column.

Now, we need to figure out which ones were visited 366+ times. So I use boolean indexing, sort by the number of visits, then look at the first 10 (ie the top 10 destinations).

One things that I find useful is that aggregate can have a have a list of functions as values, so I could do something like

flights.groupby('dest').aggregate({'dest': 'count', 'arr_delay'}: ['np.mean', 'np.std'])

You can do something similar using dplyr using mutate_at(vars(variable_name), funs(mean, std))

Thoughts

I like both pandas and R. They are similar in many ways when working with data, which is nice. I will say that reading nice dplyr code imo is much easier than reading nice pandas code. The function names are so nice and embody the whole functions should be verbs and should represent what they do idea I heard when I first started programming. Filter, select, mutate_at, mutate_if, summarise, etc are all great names. You can do the same in Pandas, but it feels a bit clunky. But I do like using Python a lot! I’ve used it to develop OSS in Statsmodels, build Flask apps that fetches and caches spotify data, etc. I think it just feels a bit harder to get used to Pandas and especially matplotlib than dplyr, ggplot, etc. But then again, I learned python less than a year ago and just started using Pandas consistently for a month or so. I hope after making myself use Python for the remainder of this semester, I will be much more confortable with it. I hope this blog post helped you as well!