Now! It’s Right Time to Prepare for my first Data Science job

And why you might not be, yet!

Back in 2008, the financial crisis was in full swing, and I was desperately in need of a job. I had just finished graduate school in applied mathematics, and I had no idea what I was going to do for work. In 2009, I landed my first job as a data scientist. 

I showed up on the first day feeling confident about the techniques and algorithms I learned in grad school but quickly learned that my schooling didn’t prepare me for any of the skills I needed to know on the job.

Over a decade has passed, and I’m still running into new graduates who are similarly unprepared. But why is this the case?

 

The Data Science Lifecycle

#1: The Data Science Lifecycle. 

To answer this question, I found myself reviewing the data science lifecycle (Figure 1). You’ll notice that the lifecycle is not perfectly linear. After being in the data science industry for several years, I’ve found you may bounce back and forth from one phase to another as more information is revealed or accuracy thresholds are not quite met. However, in school, I was taught a simple set of linear steps:

Feature Engineering -> Modeling -> Done!

A linear lifecycle doesn’t account for the information you will uncover along the way which may lead you back to previous steps in the lifecycle. In fact, you may even find that your problem cannot be solved with the data available to you. In industry, a new machine learning product first begins as a proof of concept because you have no idea if it will work. Unfortunately in academia, professors have to grade your work, so they ensure the problem has a solution. While this makes grading simpler, it is not realistic to industry practices. You have no way of knowing if there is a solution to your problem before you have begun a proof of concept or even seen the data.

This dichotomy can cause a new data scientist to believe every problem has a solution, while in industry, your data may not be able to solve your problem, the technology to solve the problem doesn’t exist yet, or a whole host of other reasons. When your machine learning project fails, you and your manager might begin to view machine learning projects as risky, and dollars from the budget may be allocated towards other less risky efforts. This can be extremely demoralizing if you’re used to 100% success rates in school. As such, data scientists and managers should consider failures, i.e. learning opportunities a success.

“I have not failed. I’ve just found 10,000 ways that won’t work.” — Thomas A. Edison

 

Business Understanding

Looking back at the lifecycle in Figure 1, you’ll notice that the first step in the lifecycle is Business Understanding. You should begin every proof of concept by getting to know the business area, its processes, and its data. Having this foundational knowledge can even help you identify new problems to solve. Unfortunately in school, I only learned how to solve a problem, not how to identify it.

I often hear the same story from new data scientists: their boss wants them to identify machine learning use cases, but they don’t know how. If they’re lucky, they’ll have direct access to customers or a really great product owner. But these individuals will only be able to tell them about their business processes. They likely won’t have the knowledge of machine learning, so they might not be able to identify appropriate opportunities.

I’ve also seen new data scientists attempt to complete a project without talking to a customer. A data scientist who does this might choose to remove nulls from their data set, but if they talk to their customer, they might find that users choose to not enter data in certain circumstances. This leads to an incomplete picture and can cause bias when it’s time to build a model. Access to the customer can help you quickly understand how various customer behaviors or even business rules are influencing the dataset.

In many ways, business understanding is the most critical part of the lifecycle, yet it is often lacking in today’s data science curriculum.

Data Acquisition & Understanding

One of the next steps after your first round of business understanding will likely be acquiring the data set. In school, I was given multiple data sets instantaneously, but I’ve found throughout my career that acquiring data can take days, months, or years! In fact, the data to solve your problem might not even exist yet. I understand why this is the case in academia — you only have a finite amount of time for homework assignments, so you can’t have your students spend the entire semester acquiring their data like they might in the industry. However, this leaves students unprepared for the real world.

Another area that is often lacking in data science programs is data understanding. When I graduated, I had little experience with performing exploratory data analysis on a data set. In school, I mostly learned about models and algorithms, but once I started my first job, I focused almost all of my time on the data and the biases in it. When you don’t take time to explore your data, you increase the likelihood of amplifying bias in your machine learning model. Exploring the data properly and using appropriate data sampling techniques is key.

Modeling

Modeling is the one phase I think my education did a good job of preparing me for. Unfortunately, I only spend about 15% of my time focused on modeling. As I mentioned in the previous section, the other 85% of my time is spent acquiring and understanding data.

#2: Time Spent by Data Scientists in Academia vs. Industry.

But why does academia focus on models and algorithms while industry focuses on the data? It may be that cheap, high-quality, realistic data sets are hard to come by. While it’s important that you understand the theory behind different algorithms and how to use them, it’s arguably more important to understand your data and the biases in it.

Take for example a psychologist who would like to perform a study but doesn’t have much funding. To cut costs, the psychologist might use college students 18–24 years old in their study. While the psychologist might find significant results, the individuals in their sample likely aren’t representative of the entire population.

The same problem exists in the computer science industry. Large tech companies spend a lot of time collecting and labeling data properly, which comes with a large price tag. Ultimately, classroom data sets do not prepare you for all the annotation, cleaning, and preprocessing you will need to do in the industry.

Deployment

The last phase in the data science lifecycle that we haven’t discussed yet is deployment. You might be lucky enough at your first job to have a machine learning engineer handle this phase for you; however, my experience was that I was responsible for deploying models I created at my first job — something I never learned in school. Oftentimes academia teaches you how to make predictions with your model, but they don’t usually cover the best architecture for deploying your model to the cloud.

Another thing I learned on the job is that when you deploy a model to real users, your data will change over time, causing your model to degrade or make entirely incorrect predictions. This happened to me once when a business owner didn’t tell me that they renamed and reorganized the categories on their documents. Suddenly, my model was predicting the wrong categories, and the updated data points took so long to come in that the model was sitting in production making incorrect predictions for months.

Model degradation and retraining are integral parts of the deployment process that should be covered in data science curriculums.

Conclusion

If you start to doubt yourself at your first job, just know you’re not alone! Every recent grad I’ve talked to has similar feelings, and the field has a long way to go to bridge the gap between academia and industry.

If you’re a recent grad, I encourage you to dive deeper into the areas we covered in this article to help bridge the likely large gap in your education. I also recommend that your first job as a data scientist is part of a larger team of more experienced data scientists. This will allow you to learn from the mistakes the senior data scientist on your team made instead of making the mistakes yourself.