The Hardest Problem in Data Science

2 minute read

Cartoon

What’s the hardest problem in all of data science right now?

Is it:

Given your particular set of core values and whether you’re personally affected by particular issues (e.g. the gender gap), you may rank certain issues as more or less important than someone else. That’s completely understandable. However, there is, at least in my opinion, a more universal problem at large - one that creeps, overtly or implicitly, into the vast majority of conversations and articles about data science. It’s an issue we dance around, often without knowing it. And it’s one that in some senses is the most easy to solve.

Let’s start with an example. Take three well-known domains: law, medicine, and physics. I imagine few people would struggle differentiating a lawyer, a doctor, and a physicist.

Now let’s try something a little more challenging. What’s the difference between a statistician, a computer scientist, and a data scientist?

No doubt, it’s easy to separate the statistician from the computer scientist, but the statisitician or the computer scientist from the data scientist? Not such an easy feat. Why is that and why is it so easy to separate the other professions?

Data science is clearly an interdisciplinary profession, one that borrows ideas from other domains. But that’s certainly not unique to data science. Domains steal ideas from one another all the time.

So what’s the real problem?

Think about what’s universal to law, medicine, and physics. Each has its own explicitly defined vernacular. Each is comprised of unique, unambiguous terminology. Two lawyers can discuss torts or mens rea without so much as a second thought. One doctor can use the word sternocleidomastoid during a presentation and every other doctor will know exactly what he/she means. A group of physicists can discuss force, torque, and angular momentum effortlessly. In every single case, the terms are unique, unambiguous, and allow for efficient communication.

Sadly, this is far from the case in data science. We can’t even agree on what a data scientist is let alone simplify the dizzying language we use. Is the proper term column, feature, independent variable, or field? Is it a model or an algorithm? A row, a sample, an observation, or a record?

No wonder outsiders look at us as rebranded statisticians, computer scientists, analysts, etc.

Sure we borrowed language from more mature disciplines, but if data science is to finally become a domain all its own, it desperately needs to standardize its language. Let’s avoid the countless hours wasted in debate just to find out we actually agree, and instead focus our efforts on the problems that really matter - AGI, ethics, the gender gap, reproducibility, and opaque algorithms.

It’s time for data science to grow up. It’s time to standardize our language.

Until then, the scuffles with statisticians, computer scientists, and all the others we borrow from will rage on.