Yolov8 Deployment on Vertex AI Endpoints

Deploying machine learning models efficiently is critical for making real-time predictions and ensuring scalable operations. In this article, we will guide you through the process of deploying a YOLOv8 model on Google Cloud Platform (GCP) using Vertex AI. This deployment enables the model to be served online, offering robust infrastructure and management capabilities. We will cover each step in detail, from preparing your YOLOv8 model and creating a Flask application, to containerizing your app and deploying it on Vertex AI Endpoints. By following these instructions, you can leverage GCP’s powerful tools to operationalize your machine learning workflows seamlessly.


1-The process, from training YOLOv8 model to containerizing the Flask application, was conducted on a Vertex AI Workbench instance running Ubuntu.

2- After deploying the model, using this approach it is expecting a base64 representation of an image as input.

1. Prepare Your YOLOv8 Model

After training your YOLOv8 model, ensure you have the trained weights file ready. This file is essential for making predictions with your model.

A pre-trained model file can be found on the official Yolov8 github repository.

2. Create a Flask Application

  • Set Up Your Flask App: Your Flask application will serve as the interface for processing incoming requests to your model. It should accept a base64-encoded image as input, decode it, and prepare it for inference with your YOLOv8 model.
  • Model Inference: Implement a function in your Flask app that takes the processed image, runs it through your YOLOv8 model using the trained weights, and returns the detection results.
  • Return Results: Format the results from your model (e.g., bounding boxes, class labels, confidence scores) as a JSON response.

import torch
import io
from flask import Flask, request, Response, jsonify
from flask_cors import CORS
import cv2
import numpy as np
import base64
import io
from PIL import Image
from io import BytesIO
from ultralytics import YOLO
from predict_utils import detect

app = Flask(__name__)

# Health check route
def is_alive():
    print("/isalive request")
    status_code = Response(status=200)
    return status_code

# image detection route
@app.route('/predict', methods=['POST'])
def image_process_flow():
    base64_string = request.json['instances'][0]['image'][0]
    img = Image.open(BytesIO(base64.b64decode(base64_string)))   ### decode back to image
    img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)  ## make it a cv2 object
    inputs = [img]
    results = model(inputs)  # List of Results objects
    labels, coordinates, confidence = detect(results)
    ## output format is important to successfully deploy it on GCP Vertex AI endpoint
    return jsonify({
        "predictions": [{'coordinates': coordinates, 'label': labels, 'confidence': confidence}]

## make sure you have the right path to your model file.
model = YOLO("model.pt")

## make sure to have those settings for your flask-app
app.run(port=8080, host='')

in your predict_utils.py file you should have the following script:

def detect(results):
    cords = [] 
    confs = []
    labels = []
    for result in results:
        boxes = result.boxes  # Boxes object for bbox outputs
        class_map = result.names
        for b in boxes:
          ## single object info below
            conf = float(b.conf[0])
            class_raw = float(b.cls[0])
            class_mapped = class_map[class_raw]
            xyxy = b.xyxy.tolist()[0]
            xywh = b.xywh.tolist()[0]
    return [labels,cords,confs]

3. Containerize Your Flask Application

  • Create a Dockerfile: Write a Dockerfile for your Flask application. This file should include instructions to set up the Python environment, install dependencies (including Flask and any libraries needed for YOLOv8), copy your application code and model weights into the container, and specify the command to run your Flask app.
  • Build Your Docker Image: Use Docker to build an image from your Dockerfile. This process packages your Flask app and all its dependencies into a container that can be deployed anywhere Docker is supported.
  • Push to Container Registry: Once your Docker image is built, push it to Google Container Registry (GCR). This makes the image available for deployment on GCP.

Our Dockerfile should look like this:

FROM pytorch/pytorch
WORKDIR /workspace
ADD . /workspace
RUN pip install -r requirements.txt
RUN apt-get update
RUN apt-get install ffmpeg libsm6 libxext6  -y
CMD [ "python" , "/workspace/app.py" ]
RUN chown -R 8080:8080 /workspace
ENV HOME=/workspace

Next make sure your deployment folder contains the following files:

To build the image run the following command on the terminal:

docker build -t gcr.io/{PROJECT ID}/{IMAGE NAME} .

after the build is done, we need to push the image to Container Registry using this command:

docker push gcr.io/{PROJECT ID}/{IMAGE NAME}

4. Upload the Model to Vertex AI Model Registry

While you don’t directly upload your Docker image as a “model” to Vertex AI’s Model Registry, this step usually involves registering your model’s metadata in Vertex AI. However, since your deployment will be using a custom container, the focus will be on deploying the container for predictions.

Head to Vertex AI UI on Google Cloud, locate Model Registry and click on it.

Click on the import button.

Next we will see the menu above, choose a model name and add description but it’s not needed, and click continue.

Make sure to choose “Import an existing custom container”.

Then click on browse to choose a container image, there it will show all your images on the container registry, choose the ones you created for the flask app.

Scroll down to environment variables, we only need to modify the prediction and health routes according to our flask app endpoints we provided.

The port 8080 will be already written so no need to change it, Only include your Health and Prediction routes.

After, click continue at the end, then proceed to the next menu and click import without changing anything. the image should appear on model registry shortly.

5. Deploy the Model on Vertex AI for Online Predictions

To deploy your YOLOv8 model for online predictions using Vertex AI, follow these detailed steps. This process involves creating an endpoint on Vertex AI, selecting your model for deployment, configuring compute resources, and eventually sending prediction requests to the deployed model. Here’s how to do it:

  • Create a Vertex AI Endpoint
  • Deploy Your Container to the Endpoint: Deploy your Docker image (from GCR) to the Vertex AI endpoint.
  • Invoke Your Model: Once deployed, you can make prediction requests to your Vertex AI endpoint. You’ll need to send a base64-encoded image in the request body, which your Flask app will process and respond with the detection results.

We open Online prediction in vertex AI and click Create

We write an endpoint name, then click continue

You will be prompted to select a model from the model registry. Pick the model name that corresponds to your YOLOv8 deployment. If there are multiple versions of your model, select the appropriate version to deploy. Since this deployment focuses on a single model, set the traffic split to 100% to direct all inference requests to your YOLOv8 model.

Scroll to the “Compute resources” section to configure the scaling settings.

  • Minimum and Maximum Compute: Choose the minimum and maximum number of compute instances for auto-scaling. This allows your endpoint to scale based on traffic.
  • Select Machine Type: For the machine type, options like n1-standard-8 or n1-standard-4 are generally sufficient for most use cases, offering a balance between cost and performance.

After configuring these settings, click “Continue” and then “Create” to deploy your endpoint.

Once the deployment is complete, you will receive an email notification confirming that your endpoint is ready for use.

To use the deployed endpoint, return to the online prediction page on Vertex AI. Locate your endpoint and click on the “Sample request” option next to it. This section will provide REST and Python instructions for sending a sample prediction request to your endpoint.

Additional Considerations

  • Security and Permissions: Ensure your GCP services and resources have the correct IAM permissions for interacting with each other.
  • Testing and Validation: Before deploying, thoroughly test your Flask application locally to ensure it correctly processes requests and returns the expected results.
  • Monitoring and Logging: If needed, set up monitoring and logging for your deployed application in GCP to keep track of its performance and troubleshoot any issues.

Following these steps will help you deploy your YOLOv8 model as a custom model on Vertex AI, leveraging GCP’s infrastructure for scalable, online predictions.

Deploying a YOLOv8 model on Google Cloud Platform using Vertex AI Endpoints provides a powerful and scalable solution for real-time predictions. By following the detailed steps outlined in this guide, you can prepare your model, create a Flask application, containerize it, and deploy it on Vertex AI. This setup not only ensures efficient handling of prediction requests but also leverages GCP’s infrastructure for optimal performance and scalability. With these tools and techniques, you can enhance your machine learning applications and deliver robust, real-time services to your users.

Author: Taj Saleh, Senior Data Scientist, Oredata

Contact us