Vinija's Notes • Infrastructure • Docker

Overview of Docker in ML Data Platform Deployment
- Benefits of Using Docker for ML
- Key Docker Concepts
Docker Workflow for ML Deployment
Advanced Usage
Deploy Jupyter Notebook on Docker
Conclusion

Overview of Docker in ML Data Platform Deployment

Docker is a powerful tool for the deployment of machine learning data platforms due to its ability to package and distribute software in a consistent and efficient manner. Using Docker, you can create reproducible environments that encapsulate all dependencies, making it easier to manage, scale, and deploy ML models and their associated data processing workflows.
A container is a lightweight, standalone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, libraries, and system tools. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.
Containers are similar to virtual machines but are more resource-efficient because they share the host system’s kernel and do not require a full operating system for each instance. This makes them fast to start, highly portable, and efficient in terms of system resource usage. They are commonly used to ensure consistent operation across different computing environments, streamline development, and simplify deployment and scaling

Benefits of Using Docker for ML

Consistency: Docker containers ensure that your ML application runs the same way in different environments, from development through to production.
Scalability: Easily scale up or down by simply spinning up more or fewer containers without needing to reconfigure the underlying infrastructure.
Isolation: Containers are isolated from one another and the host system, reducing conflicts between running applications.
Portability: Containers can run on any system that supports Docker, from local machines to cloud-based environments.

Key Docker Concepts

Images: The blueprint of the application which includes the system libraries, application libraries, and code.
Containers: Runtime instances of Docker images—what the images become in memory when executed.
Dockerfile: A script containing a series of commands and instructions to build a Docker image.
Docker Hub: A registry to store Docker images, both public and private.
Volumes: Used to persist data generated by and used by Docker containers.

Docker Workflow for ML Deployment

Step 1: Create a Dockerfile

A Dockerfile defines what goes on in the environment inside your container. It allows you to install software, copy files, and execute commands.

# Use an official Python runtime as a parent image
FROM python:3.8-slim

# Set the work directory
WORKDIR /app

# Install Python dependencies
COPY requirements.txt /app/
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Copy the current directory contents into the container at /app
COPY . /app

# Make port 80 available to the world outside this container
EXPOSE 80

# Define environment variable
ENV NAME World

# Run app.py when the container launches
CMD ["python", "app.py"]

Step 2: Build the Docker Image

From the directory containing the Dockerfile:

docker build -t my-ml-app .

Step 3: Run the Container

Start a container using the image you just built:

docker run -p 4000:80 my-ml-app

Step 4: Manage Data with Docker Volumes

Since containers are ephemeral, use Docker volumes to manage and persist data:

docker run -d -p 4000:80 -v "$(pwd)/data:/app/data" my-ml-app

This command mounts the data directory from your current directory to the /app/data directory in the container.

Advanced Usage

Docker Compose

For complex applications, with multiple containers that may include ML models, databases, and web servers, Docker Compose can manage the lifecycle of your application with a single command.

version: '3'
services:
  web:
    build: .
    ports:
     - "5000:5000"
  redis:
    image: "redis:alpine"

Networking

Docker can manage networking between containers, allowing data to flow between your ML models, databases, and applications.

Security

Ensure your Docker images and containers are secure by managing container privileges and accessing control.

Deploy Jupyter Notebook on Docker

Deploying Jupyter Notebook or JupyterLab within a Docker container is a great way to ensure that you have a consistent and isolated environment for development, which can be easily shared with others. Here’s a step-by-step guide to help you set up Jupyter in Docker:

Step 1: Create a Dockerfile

First, you’ll need to create a Dockerfile that defines the environment for running Jupyter. This will include the base image, installation of Jupyter, and any additional dependencies or libraries you need for your work.

Example Dockerfile for Jupyter

# Use an official Python runtime as a parent image
FROM python:3.8-slim

# Set the working directory in the container
WORKDIR /usr/src/app

# Install Jupyter
RUN pip install jupyterlab

# Expose the port Jupyter will run on
EXPOSE 8888

# Run JupyterLab
# Note: Using the `--ip=0.0.0.0` setting to make your notebook accessible to any IP address.
# This is important in a Docker environment where you want to access the notebook from your browser.
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--no-browser", "--allow-root"]

Step 2: Build the Docker Image

Once you have your Dockerfile, you can build the Docker image. Navigate to the directory containing your Dockerfile and run:

docker build -t jupyterlab-docker .

This command builds a new Docker image named jupyterlab-docker based on the instructions in your Dockerfile.

Step 3: Run the Docker Container

After the image is built, you can run it:

docker run -p 8888:8888 jupyterlab-docker

This command starts a container from your jupyterlab-docker image. It maps port 8888 on your local machine to port 8888 on the container, allowing you to access JupyterLab from your browser.

Step 4: Access JupyterLab

When you run the container, JupyterLab starts and logs a URL with a token in the console. You will need to copy this URL into your browser to access JupyterLab. The URL will look something like this:

http://127.0.0.1:8888/?token=<some_long_token>

Optional: Adding Persistent Storage

If you want your notebooks and data to persist after the container is stopped, you can mount a volume to the container:

docker run -p 8888:8888 -v "$PWD":/usr/src/app jupyterlab-docker

This command mounts the current directory ($PWD) on your host to /usr/src/app in the container, making it the working directory where your notebooks and files are stored.

Optional: Adding More Libraries

If your work requires additional libraries or tools, you can add more RUN pip install lines in the Dockerfile or create a requirements.txt file and copy it into the image:

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

This setup provides a robust, reproducible, and portable environment for working with Jupyter notebooks, ensuring that your data and code can be shared and accessed seamlessly across different machines and platforms.

Conclusion

Docker is an essential tool for deploying ML data platforms due to its ability to create consistent environments and handle dependencies efficiently. By containerizing ML applications, teams can focus more on development and less on environment management, which accelerates the deployment process and reduces overhead in operations.