How to Deploy an LLM for Production Use-Cases
How to deploy fine-tuned LLMs as chatbots using Flask, Hugging Face, containerize with Docker, scale on AWS ECS and optimize performance with TensorFlow Serving
Oct 7, 2024 • 9 Minute Read
After fine-tuning your Large Language Model (LLM), the next critical step is to deploy it in a real-world environment where it can serve user queries at scale. In this blog, we’ll cover multiple deployment methods, focusing on:
- Deploying your fine-tuned model as a chatbot using Flask and the Hugging Face Transformers pipeline.
- Containerizing the application with Docker and deploying it to AWS ECS, along with a mention of parallels in Azure and GCP.
- Using TensorFlow Serving to serve your model, and deploying TensorFlow Serving to AWS ECS.
Integrating performance optimizations like pruning and distillation to make the model more efficient for production.
Deploying Your Fine-Tuned Model as a Chatbot with Flask
What is Flask?
Flask is a lightweight Python web framework that is perfect for building small applications, such as an API to host your model. It’s widely used in the machine learning community due to its simplicity and minimal setup requirements. Flask allows you to expose your model via HTTP endpoints, enabling users to send requests and receive responses in real-time.
Setting Up Flask with the Transformers Pipeline
The Transformers pipeline from Hugging Face is a high-level abstraction that simplifies using pre-trained models for various NLP tasks, such as question-answering, text generation, and translation. In this case, we’ll use the pipeline API to load our fine-tuned model and create a chatbot.
Let’s go through the steps:
Step 1: Install Flask and Transformers
First, we need to install Flask and Transformers. These libraries will let us create the API and load our model:
pip install flask transformers
Step 2: Writing the Flask API
We will now write a Flask API that allows users to send a question and context (in JSON format) and get the model’s prediction as a response.
from flask import Flask, request, jsonify
from transformers import pipeline
# Initialize the Flask application
app = Flask(__name__)
# Load the fine-tuned model using the Hugging Face pipeline
# This pipeline automatically loads the model and tokenizer for question-answering
qa_pipeline = pipeline("question-answering", model="path_to_your_model")
# Define an endpoint for the chatbot
@app.route("/ask", methods=["POST"])
def ask_question():
data = request.json # Get the JSON data from the request
question = data.get("question") # Extract the 'question' field
context = data.get("context") # Extract the 'context' field
# Pass the question and context to the pipeline
result = qa_pipeline(question=question, context=context)
# Return the result as a JSON response
return jsonify(result)
# Run the Flask app, making it available on port 5000
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Detailed Explanation of the Code
- Loading the model: The pipeline automatically loads both the model and the tokenizer based on the task (question-answering in this case). This greatly simplifies model serving.
- Creating the /ask endpoint: The @app.route decorator defines the /ask endpoint, where the chatbot will accept POST requests containing a question and context. The function ask_question handles these requests.
- Processing the input: The request.json line extracts the data from the incoming request. The question and context are parsed and then passed to the Hugging Face pipeline.
- Getting the result: The qa_pipeline(question=question, context=context) line sends the input to the model for prediction, and the result (typically an answer and its confidence score) is returned as a JSON response.
Step 3: Testing the API
Once the Flask app is running, you can use curl to send POST requests and receive responses from the model:
curl -X POST "http://127.0.0.1:5000/ask" -H "Content-Type: application/json" -d '{
"question": "What is Flask?",
"context": "Flask is a lightweight WSGI web application framework in Python."
}'
This command sends a request to the /ask endpoint with a question and context. You will get a response like this:
{
"answer": "Flask is a lightweight WSGI web application framework in Python.",
"score": 0.98,
"start": 0,
"end": 58
}
This basic chatbot is now ready to handle user queries locally. But we need to deploy it in a scalable production environment.
Dockerizing and Deploying the Model to AWS ECS
Why Use Docker?
Docker allows you to package your application (code, model, dependencies) into a lightweight, self-contained container. This makes it easy to run your app consistently across different environments. Docker ensures that your application behaves the same way in development, testing, and production.
Step 1: Dockerizing the Flask Application
We’ll create a Dockerfile to package the Flask API along with the fine-tuned model.
# Use the official Python image as a base
FROM python:3.9-slim
# Set the working directory
WORKDIR /app
# Copy the current directory contents into the container
COPY . /app
# Install required dependencies
RUN pip install flask transformers
# Expose port 5000 for the Flask app
EXPOSE 5000
# Define the command to run the Flask app
CMD ["python", "app.py"]
Explanation of the Dockerfile
- Base image: The FROM python:3.9-slim line uses a minimal Python environment to keep the container lightweight.
- Working directory: The WORKDIR /app sets the working directory inside the container, ensuring that all subsequent commands run inside this folder.
- Copy files: The COPY . /app line copies the contents of your current directory (code, model, etc.) into the /app directory inside the container.
- Install dependencies: The RUN pip install flask transformers command installs Flask and Hugging Face Transformers inside the container.
- Expose port: The EXPOSE 5000 line exposes port 5000, which is the port Flask listens to by default.
- Run the application: The CMD ["python", "app.py"] command runs the Flask application when the container starts.
Step 2: Build and Test the Docker Container
Build the Docker image with:
docker build -t my-llm-chatbot .
Run the container locally to ensure everything works as expected:
docker run -p 5000:5000 my-llm-chatbot
Visit http://localhost:5000/ask to test the Flask API.
Step 3: Deploying to AWS ECS
What is AWS ECS?
Amazon Elastic Container Service (ECS) is a fully managed container orchestration service. It simplifies running Docker containers at scale in the cloud. With ECS, you can define services, automatically scale them, and deploy updates.
Step 3.1: Create an ECS Cluster
- Create an ECS Cluster: Go to the AWS Management Console and search for ECS. Create a new cluster and select EC2 Linux + Networking to run EC2 instances that will host your Docker containers.
- Launch Instances: Set up your cluster by specifying the number of EC2 instances, the instance type, and the networking details (VPC, subnets).
Step 3.2: Push the Docker Image to Amazon ECR
Amazon Elastic Container Registry (ECR) is a fully managed Docker container registry. Before deploying the container to ECS, we need to store the Docker image in ECR.
- Create an ECR repository: In the AWS Management Console, go to ECR and create a new repository.
- Tag your image: Tag your Docker image with the repository URL:
docker tag my-llm-chatbot:latest <your-ecr-repo-url>
3. Push the image to ECR:
docker push <your-ecr-repo-url>
Step 3.3: Define the ECS Task and Service
- Create a Task Definition: In the ECS console, create a new Task Definition. Add your ECR image and allocate CPU and memory resources.
- Create a Service: Create a service to run your task on the cluster. This service will ensure that your chatbot is always running, and you can scale it by adding more containers if needed.
Deploying TensorFlow Serving to AWS ECS
What is TensorFlow Serving?
TensorFlow Serving is a specialized service designed for serving machine learning models in production environments. It supports REST and gRPC APIs, allowing you to integrate your model with various applications. TensorFlow Serving is ideal for models that are saved in SavedModel format.
Step 1: Export the Fine-Tuned Model
Before serving the model, export it in SavedModel format:
model.save("path_to_saved_model", save_format="tf")
This saves the model in a format that TensorFlow Serving can read.
Step 2: Run TensorFlow Serving Locally
You can test the model using TensorFlow Serving with Docker:
docker pull tensorflow/serving
docker run -p 8501:8501 --name=tf_serving --mount
type=bind,source=/path_to_saved_model,target=/models/my_model -e
MODEL_NAME=my_model -t tensorflow/serving
Step 3: Deploy TensorFlow Serving to AWS ECS
We’ll now package TensorFlow Serving into a Docker container and deploy it on ECS, similar to what we did for the Flask chatbot.
Dockerfile for TensorFlow Serving
# Use the official TensorFlow Serving image
FROM tensorflow/serving
# Copy the SavedModel to the container
COPY /path_to_saved_model /models/my_model
# Set the environment variable for the model name
ENV MODEL_NAME=my_model
Build and push this image to ECR, just like the previous Flask image, and then deploy it on ECS using a Task Definition and Service.
Querying TensorFlow Serving with Python
After deploying the TensorFlow Serving model, you can query it with Python using REST or gRPC.
REST API Query
import json
import requests
# Define the URL of the TensorFlow Serving API
url = "http://your-ecs-service-url/v1/models/my_model:predict"
# Define the input payload (adjust based on your model input)
data = {
"signature_name": "serving_default",
"instances": [{"input_ids": [101, 2023, 2003, 1037, 3160, 102]}] # Example input tokens
}
# Send the POST request and get the response
response = requests.post(url, data=json.dumps(data))
print(response.json())
gRPC Query
gRPC is faster than REST because it’s a binary protocol, making it more efficient for high-performance production systems.
import grpc
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
from google.protobuf import json_format
# Set up the gRPC channel
channel = grpc.insecure_channel('your-ecs-service-url:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Create the request object
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'
# Add input features
request.inputs['input_ids'].CopyFrom(
tf.make_tensor_proto([101, 2023, 2003, 1037, 3160, 102])
)
# Send the request and get the response
response = stub.Predict(request, 10.0)
print(json_format.MessageToJson(response))
Optimizing Your Model with Pruning and Distillation
While deploying models at scale, optimizing for performance can significantly reduce costs and improve response times. Two common techniques are pruning and knowledge distillation.
Pruning
Pruning removes unnecessary weights from the model without significantly affecting its accuracy. This reduces the model size and speeds up inference.
import tensorflow_model_optimization as tfmot
# Define a pruning schedule
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.2, final_sparsity=0.8, begin_step=0, end_step=1000
)
# Apply pruning to the model
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule=pruning_schedule)
Knowledge Distillation
In distillation, a smaller model (student) is trained to replicate the behavior of a larger model (teacher), making it more efficient while maintaining performance.
from transformers import DistilBertForQuestionAnswering
# Load a smaller model for distillation
student_model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
# Train the student model using the outputs from the teacher model
# This can be done using a custom loss function to minimize the difference between the teacher and student outputs
Conclusion
In this guide, we walked through deploying a fine-tuned LLM using Flask and the Hugging Face Transformers pipeline, Dockerizing it, and deploying it to AWS ECS. We also explored using TensorFlow Serving for more robust model serving and touched on optimizations like pruning and distillation to improve performance in production environments.
With these tools, you can deploy your model in scalable, efficient production environments to serve real-time user queries.