DataCamp boasts to be “the easiest way to learn Data Science Online” and has courses of different levels taught using R or python. I wanted to see how popular some of the courses were and which technology they used, so a quick use of rvest was required. Rvest is a package developed by Hadley Wickham that allows one to easily scrape web pages. Below is the function used to get the relative links for each course based on the technology used.
get_course_links <- function(technology){
url <- paste('https://www.datacamp.com/courses/tech:', technology, sep='')
html <- read_html(url)
courses_title <- html_nodes(html, css='.course-block__link') %>%
html_attr('href')
return(courses_title)
}
I essentially go to the page, find which parts of the page correspond to the specified css paths, then which parts have the specified ‘href’ attribute i.e. which are links.
I then wrote another function get_course_data, that uses rvest, dplyr, some stringr with regex, etc. to get information for each course such as the number of participants, title of course, number of pre-requisites, so on and so forth. Below are the five courses using R or Python with the most number of participants. As you can see, there are only two courses under ‘SQL’, as other database courses seem to focus on dplyr (R) or pysqlite (python) and are thus under those categories.
courses_all <- read.csv("dat/all_courses_scrape_df.csv")
arranged_courses <- courses_all %>% group_by(tech) %>%
select(title, num_participants, num_prereqs, tech) %>%
arrange(desc(num_participants)) %>% split(., .$tech, drop=T)
arranged_courses$python %>% top_n(5, num_participants)
## # A tibble: 5 x 4
## # Groups: tech [1]
## title num_participants num_prereqs tech
## <fct> <int> <int> <fct>
## 1 Intro to Python for Data Science 452593 0 python
## 2 Intermediate Python for Data Science 127312 1 python
## 3 Deep Learning in Python 31527 3 python
## 4 Python Data Science Toolbox (Part 1) 29021 2 python
## 5 Importing Data in Python (Part 1) 26001 2 python
arranged_courses$R %>% top_n(5, num_participants)
## # A tibble: 5 x 4
## # Groups: tech [1]
## title num_participants num_prereqs tech
## <fct> <int> <int> <fct>
## 1 Introduction to R 558270 0 R
## 2 Intermediate R 160942 1 R
## 3 Data Manipulation in R with dplyr 55803 2 R
## 4 Introduction to Machine Learning 53502 3 R
## 5 Data Visualization with ggplot2 (Par… 51080 2 R
arranged_courses$sql
## # A tibble: 2 x 4
## # Groups: tech [1]
## title num_participants num_prereqs tech
## <fct> <int> <int> <fct>
## 1 Intro to SQL for Data Science 54973 0 sql
## 2 Joining Data in PostgreSQL 6413 0 sql
Below are the number of courses w/ prerequisites as well as the number of people who took those courses. It’s nice to see that DataCamp has a number of courses that have 3+ pre-requisites, evidence that their courses build off of one another instead of just stopping at the basics.
courses_all %>% group_by(num_prereqs) %>%
summarise(avg_num_participants = mean(num_participants),
tot_num_participants = sum(num_participants),
n = n())
## # A tibble: 5 x 4
## num_prereqs avg_num_participants tot_num_participants n
## <int> <dbl> <int> <int>
## 1 0 49132. 1179172 24
## 2 1 40244. 402440 10
## 3 2 15128. 332814 22
## 4 3 14062. 393731 28
## 5 4 5374. 16123 3
I was curious as to the point of XP given per course. I initially thought that “more difficult courses will give more XP per exercise”, and thus did the following:
ggplot(courses_all) + geom_point(aes(x=num_exercises, y=num_xp)) + facet_grid(~num_prereqs) +
labs(x="Number of exercises", y='Experience', title='Split by number of prerequisites') + theme(axis.ticks.x=element_blank()) + theme_bw()
I assumed that the more difficult courses tend to have more prerequisites, which may not be true as for many users, the first course might have the steepest learning curve and thus be the most difficult for them. But anyway, under my assumption I thought I’d see more experience given for the same number of exercises as the number of prerequistes increase. Not the case. Still not sure of the point of XP given per course haha..
Now comes the part that I really wish DataCamp gave the age of the course or the day it was released. I wanted to see which technology was ‘more popular’, but of course with the current data I can’t answer that. Anyway, first lets try just looking at the average number of participants by technology.
courses_all %>% group_by(tech) %>%
summarise(avg_num_participants=mean(num_participants),
avg_num_hours = mean(num_hours),
most_popular_course = max(num_participants),
total_participants = sum(num_participants),
n = n())
## # A tibble: 3 x 6
## tech avg_num_particip… avg_num_hours most_popular_co… total_participa…
## <fct> <dbl> <dbl> <dbl> <int>
## 1 python 32249. 3.85 452593. 838467
## 2 R 24143. 4.17 558270. 1424427
## 3 sql 30693. 4.50 54973. 61386
## # ... with 1 more variable: n <int>
There are many more R courses but python courses seems to have the higher average number of participants even though R has the most popular course. Let’s look at a boxplot. It’s zoomed in to stop at the whiskers.
ggplot(courses_all, aes(x=tech, y=num_participants)) + geom_boxplot() +
geom_jitter(width=.2) +
coord_cartesian(ylim = quantile(courses_all$num_participants, c(0.1, .95))) +
theme_bw()
Yes, there are a number of R courses that are more popular than Python courses, but there are also many R courses that have very few participants. This is why I wish I could know when each course started. I’m not sure whether these courses were unpopular or whether they’re just new.
There are plenty other things I can look at, but I’ll stop for now as the post is getting plenty long. If you want the scraped data, you can find the code on my Github!