What is a Data Scientist?
Five simple words that when uttered in sequence conjure fierce and ceaseless debate. You’re likely to hear opinions like:
- “A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
- “A data scientist is someone with math and stats knowledge, domain expertise, and hacking skills.”
- “A data scientist is a statistician who lives in San Francisco.”
Run a Google search. You’ll find innumerable opinions on the matter. In fact, you can spend an hour, an afternoon, or probably even a week engrossed in this mind numbing task.
And it never ends. It seems every week there’s a new post delineating what a data scientist is and what a data scientist is not. Some weeks you have to be an expert in Statistics and others you have to know Scala. Some weeks you have to be an expert in software development, machine learning, big data technologies, and visualization tools. And some weeks you have to actually know how to talk to people and clearly articulate your ideas, in addition to all the other technical skills. Every week I read these posts, and every week I cringe.
The Myth of Boxes
Maybe it’s human nature or maybe it’s elitism but these posts revolve around this idea that you can place people into metaphorical boxes. One is labeled Data Scientist and the other Not Data Scientist. Where and how you decide to draw the line determines which people go into which boxes.
But why the discrepancies?
One possible explanation is that that one’s experiences bias one’s worldview. Let me clarify with an example. I have a Master’s degree from a well-known university, have to build everything from scratch to truly understand it, and prefer an even mix of working alone and collaborating with others. Therefore, it’s easy for me to assume every data scientist should have a Master’s or PhD from a reputable university. It’s easy for me to assume every data scientist should build everything from scratch. And it’s easy for me to assume every data scientist should work in exactly the same way as I do.
I mean, I’m a data scientist. I know what it takes. Right?
This is lazy thinking, a mental shortcut. To assume everyone must share my experiences is myopic. Sure, it worked for me, but other data scientists have very different experiences. That’s fine. That’s normal. In fact, that’s ideal because the world is chock full of difficult problems. Solutions aren’t going to come from a homogeneous group. We need fresh ideas, open lines of communication, and inclusion. We need to shift our thinking.
A Shift in Thinking
Rather than focusing on who we should admit into our special little club and who we should exclude, let’s focus on bringing more people into the fold. Instead of arguing about which algorithms, which tools, and which programming languages a real data scientist should know, let’s focus our energy on real problems.
Because people are not boxes. People don’t magically morph from Not Data Scientist to Data Scientist. It’s not quantum; it’s spectral.
Let me say that again: data science is a spectrum.
Let that sink in. Seriously.
Back To The Question: What is a Data Scientist?
Ever look at a data science pipeline? It can take many fanciful forms but it usually breaks down into something like this:
- Ask a question
- Generate some hypotheses
- Collect data
- See if any of your hypotheses have merit
- Make refinements
Hmm, sounds an awful lot like the Scientific Method. Maybe this term data scientist is really just another name for someone who practices these ideas - a rebranding if you will. Sure, we use fancy new tools and bandy about buzzwords like machine learning and big data, but let’s not fool ourselves. At the core we’re just doing math and science.
In fact, if you leverage the Scientific Method to quantitatively drive your decisions, then I have news for you: you’re absolutely doing some level of data science. Doesn’t matter if you’re generating a report of descriptive statistics for your boss, predicting the next trend on Twitter, or developing a bleeding edge machine learning algorithm in the lab.
If you’re new to data science, don’t fret. Figuring out where to even start can be daunting. But the dirty little secret that no one ever tells you is that there is no “right” place to start. Honestly, the trick is just getting started. Period. It doesn’t even really matter where. Follow your interests.
Want to learn Python? Dip your toe in by taking that introductory class. Curious about Statistics? Check out Khan Academy videos. Want to learn from those in the know? Read a blog. Go to a Meetup. Attend a conference. Get involved.
And if you’re a grizzled veteran, share your expertise by blogging, creating tutorials, giving talks, mentoring newcomers, or contributing in whatever way makes sense for you.
The one thing I want you to take away from this post is that regardless of your current skill set, regardless of your gender or race or anything else for that matter, you can learn, share, and contribute to data science. The field is sprawling and there’s room for everyone.