Array Indexing / Slicing vs. Loops

A Simple, Straightforward Example of a Powerful Concept

Just a quick write-up (you’ll be relieved to learn there are no jokes in this one. Except this one.) In the Cornell data science courses I have taken so far, they have heavily emphasized the need to use indexing and slicing instead of loops, especially when working with large data sets. The reason is speed: doing an iterative loop to perform some kind of matrix math is dramatically slower in python than the alternative. It also turns out to be easier to code and understand, once you get a grasp on the concepts.

The problem is that most of us were taught that loops were the end-all, be-all of application development, we know how to write them, and they work, so it’s easy to want to use them. Let’s take a look at an example:

A really simple example to start with

Let’s say we want to find the sum of all of the rows that have a label of 1, and the product of rows that have a label of 0. You could write a loop to accomplish this, but that would turn out to be a fair amount of code for a pretty easy operation, and on a really large data set, would take a fair amount of time. What if, instead, we started by subsetting our data by label, and then applying the operations we want to run on the appropriate subsets only?

First, let’s look at how we would create those subsets:

Proof of concept

We created a separate matrix that contains only the data that had a label of 1. Note that any valid logical operator would work here: “>”, “<”, and so forth.

Now that we know how to conceptually create the subset, start the operation by creating the variable where you want to store your output. We have four observations/labels, so our output should be shaped accordingly:

A place to write values

Now, let’s put everything together that we’ve learned so far and populate our datastuff variable with the sums out of our data variable where the label = 1.

No loops required!

You would then repeat the operation, only replace labels==1 with labels==0, and np.sum with np.prod. If you want to take a look at the code from beginning to end, as well as the final outputs, you can check it out here.

Data Science, Big Data, & Cloud nerd with a focus on healthcare & a passion for making complex topics easier to understand. All thoughts are mine & mine alone.