Amen, Brother

I recently listened to a podcast that talked about 1960s track titled “Amen, Brother” by The Winstons. It is the most-sampled track ever based on whosampled.com. Here is a link to a particular part of the song, the 6 second drum solo known as the Amen Break that is the reason for the 2800+ samples heard in Hip-Hop, EDM, the Futurama intro, etc. I wanted to see who were the type of artists sampling this song and how did they use the sample. [Read More]

Dplyr <-> Pandas

I want to become familiar with both the tools in the tidyverse as well as in Python. This blog post will highlight some of the typical things people might do in R or Python when working with data. It’s not meant as a ‘from scratch’ tutorial so there are things I won’t explain thoroughly or at all. It’s mainly so I can remember pandas functionality. For now, we will focus on wrangling the data. [Read More]

Dragging and clicking

I forced myself to get rid of Windows after starting grad school. I read plenty of blogs and whatnot explaining why not relying on your mouse (point, click, drag) could make you much more productive while working as a data scientist. I thought only Software Engineers (SWEs) did stuff like that and coding in whatever the heck emacs or vim is. Regardless, I decided to give it a try. Switched to Ubuntu. [Read More]

Building a crappy personalized song recommender

Note: if you wanna skip straight to the repository, it’s here. This semester I learned about utilizing APIs and Oauth2 to get data from web services. Time to put this into the test and do something cool - build a crappy song recommender. Here are the steps I followed after some brainstorming. Shoutout to Spotify for having nice data about each song. Ask for several artists that you vibe to Via Spotify, get a few related artists for each artist that the user input Get all albums for each artist as well as their songs get song features for each song (valence/musical positiveness, tempo, danceability, speechiness, energy) Via Genius’s API, search for that song and artist combo for each song and get the song’s lyrics calculate the sentiment of those lyrics store it all in two database tables, one for songs and one for artists Now that I have the data, how should I go about recommending songs? [Read More]

Neural Nets - Backpropagation

Back to talk about NN a bit more! We talked about linear and nonlinear decision boundaries last time we left off. This is why multi-layered NNs have at least one nonlinear function applied to the input if not many (even many different nonlinear functions). So how do we even start using a Neural Network? It’s beneficial to talk about the input before we talk about all the things we do to that input. [Read More]

Understanding OAuth

I should definitely finish the other parts of the NN posts, but here is a brief detour into OAuth (Open Authorization). Note that I am still a novice and so there might be explanations that are missing information or lacking depth. But this is a way for me to synthesize what I know, so please message me if there are any mistakes! So what is the point of OAuth? As I’ve seen it, it’s a way for web services or apps to exchange information without compromising the people using apps. [Read More]

Neural Nets - units and decision boundaries

I have an assignment that involves building a language identifier (given text, predict which language is the text from) using Neural Nets. I wanted to use this opportunity to make a few posts to cement the idea into my head. I hope you find this intuitive and helpful. So lets talk about NN in terms of classification, when I think of neural networks, I think of applying a multi-layered system to make a decision. [Read More]

Exploring flight data

Background The given dataset comes from the Office of Airline Information, Bureau of Transportation Statistics (BTS). It consists of contains on-time arrival data for non-stop domestic flights by major air carriers, monthly from 1987 to 2017, and provides information such as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, canceled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance. [Read More]

Data from DataCamp

DataCamp boasts to be “the easiest way to learn Data Science Online” and has courses of different levels taught using R or python. I wanted to see how popular some of the courses were and which technology they used, so a quick use of rvest was required. Rvest is a package developed by Hadley Wickham that allows one to easily scrape web pages. Below is the function used to get the relative links for each course based on the technology used. [Read More]

GSOC III

This week was a busy one. Implemented most of the survey summary statistics and the jackknife to estimate their standard error. This was my first time learning about the jackknife and I thought it was an interesting topic to tell you all about. Within survey data, you tend to have strata and PSUs within the strata that make up subgroups. Now, let’s assume that we want to calculate the mean for your survey data - this entails something along the lines of [Read More]