

For other teams, production means keeping your models up and running for millions of users per day.

For some teams, production means generating nice plots from notebook results to show to the business team. Bridging the gap Part II: infrastructure abstraction Bridging the gap Part I: containerization Separation of development and production environments Two real-life data scientist job descriptions 10 years of creating Hadoop clusters from scratch- Nick Heitzman 📊📈 February 12, 2019 The last part of the post is a comparison of various workflow orchestration and infrastructure tools, including Airflow, Argo, Prefect, Kubeflow, and Metaflow. While containerization is more or less well-understood, infrastructure abstraction is a relatively new category of tools, and many people still confuse them with workflow orchestrations. It continues to discuss two steps of the solutions to bridge the gap between these two environments: the first step is containerization and the second step is infrastructure abstraction. The post starts with a hypothesis that the expectation for full-stack data scientists comes from the fact that their development and production environments are vastly different.

This post is to argue that while it’s good for data scientists to own the entire stack, they can do so without having to know K8s if they leverage a good infrastructure abstraction tool that allows them to focus on actual data science instead of getting YAML files to work. Many companies expect data scientists to be full-stack, which includes knowing lower-level infrastructure tools such as Kubernetes (K8s) and resource management. Recently, there have been many heated discussions on what the job of a data scientist should entail.
