This week was a busy one. Implemented most of the survey summary statistics and the jackknife to estimate their standard error. This was my first time learning about the jackknife and I thought it was an interesting topic to tell you all about. Within survey data, you tend to have strata and PSUs within the strata that make up subgroups. Now, let’s assume that we want to calculate the mean for your survey data - this entails something along the lines of
np.dot(weights, data) / np.sum(weights)
where weights are the inverse probability of your observation being chosen. Easy enough. Now, if we want to estimate it’s SE (standard error) via the jackknife, we should do something along the lines of
for each strata
for each cluster within that strata
delete that cluster
re-weight the other clusters
np.dot(new_weights, data) / np.sum(new_weights)
center the collection of 'minus one cluster' statistics
do a bit of subtraction, summing, squaring, etc
As you can imagine, this is pretty computationally heavy. But it allows us to estimate the variablility of our estimator and gives us confidence in using it or not. I’m used to the bootstrap, which is even more computer intensive, but is more popular in reasearch and applications. Thankfully, my advisor knows many numpy tricks to speed up the computation. I’ll be doing that next week along with fixing up some design issues. I had most things all in one class but it makes sense to break them into different classes and pass them between one another.