Using Julia for safe Data Science
Recently in Jampp, I had the chance to switch some of our data science environment from Python to Julia. For various reasons, its type system is, in my opinion, one of the best language features. The most obvious one is the performance enhancements it allows. I will not, however, address that point here: it has been benchmarked very well in several places. Instead, I will briefly show a safety advantage this type system brings that is really handy for data science.
To be fair, this feature is handy in a production oriented data science team. Someone more oriented towards analytics, might not find it so helpful. Quoting Robert Chang, I perform Type B data science: models I build are meant to get into production. Therefore, safety issues are a big concern when implementing them. If we don’t catch a misbehaviour in our Machine Learning systems, we will be running our business based on mistaken metrics. For example, bidding high prices for bad banners millions times a day…
So, let’s get to the point. When you deal with hundreds of thousands of messages per second coming from a number of sources, you learn one thing: every field will eventually receive at least one ill-formated message. It is true, and not only for data science, that a smart use of types in a language helps you finding mismatches early in the data handling process. Take this toy example and look when Python realizes there is something wrong with the message and what Julia does:
Here is Python’s version:
And Julia’s one:
In Python, I added the message in my data-frame and even processed the data without noticing the mismatch. Only eventually (and by chance) I got an exception. In contrast, in Julia, the very insertion attempt of this message to the data-frame results in an exception being thrown. Julia’s type system and the promotions implemented for it are flexible enough to allow for easy data handling (I could receive those ones either as Floats or Ints). But at the same time, they are not so liberal as to cast everything to a highly abstract class (as object), enforcing, thus, some checks (I cannot receive a String where a Number is expected). It may be argued that, with proper checking, this safety can be achieved using any laguage. Fair enough, I’ll concede that. But still, one would have to implement it. In Julia, it comes for free.
In a pure analytical or exploratory level, handling these issues might be considered, in the best cases, annoying and corrected by hand. But if a model is running in production, anything unexpected in a message should be catched as soon as possible, dealt with and, only then, finally processed. You could be doing a spurious training if ill-formed messages get through your models. As in many other aspects, when dealing with promotions Julia achieves a nice balance between flexibility and safety.