Why Speed Matters

The vast bulk of data science work is done using Python and R, and that’s fine. Those languages are well suited to analytics, and make available a rich infrastucture of libraries and documentation.

Looking at just Python (R is similar), there is however a problem. Python is slow, due to its interpreted execution model and dynamic typing. As a Python program runs, it is constantly checking for the types of different variables and data, and for the feasibility of certain operations such as converting data types and expanding lists. While this makes for fast development and prototyping, it can be very slow for some types of analysis.

The penny dropped for me after I worked on a price optimization project for a global beer company, which optimized wholesale and retail prices across four countries, partly using complex procedural logic to calculate the impact of price changes on volume. The Python optimization was done using simulated annealing, using the standard scikit-learn library. The optimization took twelve minutes to run, and defeated our hopes of running it in real time behind an interactive user interface.

The problem was that the objective function (which needs to run hundreds or thousands of times as the optimization explores the solution space) consisted of about 200 lines of Python. While the simulated annealing was presumably efficient, this complex objective function code made the optimization a slow process.

Discovering Alternatives

During some vacation after the project, I took advantage of the down-time to learn Julia. As a specific project, I rewrote the optimization in Julia, using its optional static typing, and an open-source simulated annealing libary, and the execution time went from twelve minutes down to six seconds. This massive speed improvement (a factor of 120x), brought the idea of a user interface running the simulation in the background into the realm of possibility.

A few months later, I decided to learn Go. Again, I rewrote the optimization, using an open-source simulated annealing implementation around the recoded objective function. This time, execution was even faster, reaching 0.6 seconds. This was now performant enough to enable the interactivity we had been hoping for, and Go’s suitability for microservices was another strong enabler of this vision.

It should be noted that neither Julia nor Go involved a massive rewrite of the original Python objective function. Both languages allow for a procedural style and syntax that is not very far from Python’s, so the translations were reasonably forward, and took about half a day in each language.

It’s not just about saving time

The point that excited me was not the speed per se (since we routinely tolerate code that takes a long time to run, and adapt our workflows accordingly). It was the new set of possibilities, either through the quick calculation of a lot more parameters or scenarios, or the ability to do calculations fast enough for users to explore the problem space in an interactive way, which is not possible when it takes 12 minutes or longer to recalculate.

In this blog site, then, I’d like to share my continuing journey around fast data science, and using different languages, architectures, and algorithms to enable new explorations in data science.

2022-05-08

https://fastdatascience.io/post/2022-05-08-why_speed_matters/ Andreas Kaempf