FastAPI & Docker Deployment

Overview

In this session, we take our agents from Jupyter notebooks to production-ready REST APIs using FastAPI and Docker.

Why FastAPI?

Feature	Benefit
Async native	Handle many concurrent requests
Auto documentation	Swagger UI out of the box
Type safety	Pydantic validation
Fast	One of the fastest Python frameworks

Project Structure

main.py
Dockerfile
requirements.txt
.env

Building the API

Step 1: Define Data Models

from pydantic import BaseModel
 
class AgentRequest(BaseModel):
    query: str
    model: str = "gpt-4o-mini"
 
class AgentResponse(BaseModel):
    response: str
    tool_calls: int = 0

Step 2: Create the Agent Function

from openai import OpenAI
 
client = OpenAI()
 
def run_simple_agent(query: str, model: str) -> str:
    """
    Simple agent that processes queries.
    In production, import your full ReActAgent here.
    """
    try:
        completion = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful API agent."},
                {"role": "user", "content": query}
            ]
        )
        return completion.choices[0].message.content
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Step 3: Define Endpoints

from fastapi import FastAPI, HTTPException
 
app = FastAPI(
    title="LLM Agent API",
    description="Production-ready Agent API",
    version="1.0.0"
)
 
@app.get("/health")
def health_check():
    """Health check endpoint for load balancers"""
    return {"status": "ok", "service": "llm-agent-api"}
 
@app.post("/v1/agent/chat", response_model=AgentResponse)
def chat_endpoint(request: AgentRequest):
    """Main chat endpoint"""
    answer = run_simple_agent(request.query, request.model)
    return AgentResponse(response=answer)

Step 4: Run Locally

# Install dependencies
pip install fastapi uvicorn openai python-dotenv
 
# Run the server
uvicorn main:app --reload --port 8000

Visit http://localhost:8000/docs for interactive API documentation.

Dockerizing the API

Dockerfile

# Use official lightweight Python image
FROM python:3.11-slim
 
# Set working directory
WORKDIR /app
 
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
 
# Copy application code
COPY main.py .
COPY .env .
 
# Expose port
EXPOSE 8000
 
# Run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt

fastapi>=0.104.0
uvicorn>=0.24.0
openai>=1.0.0
python-dotenv>=1.0.0
pydantic>=2.0.0

Build and Run

# Build the image
docker build -t llm-agent-api .
 
# Run the container
docker run -p 8000:8000 --env-file .env llm-agent-api

Production Considerations

1. Environment Variables

Never commit API keys. Use environment variables:

import os
from dotenv import load_dotenv
 
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

2. Error Handling

from fastapi import HTTPException
from tenacity import retry, stop_after_attempt, wait_exponential
 
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def call_llm_with_retry(messages):
    try:
        return client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
    except Exception as e:
        raise HTTPException(status_code=503, detail="LLM service unavailable")

3. Rate Limiting

from slowapi import Limiter
from slowapi.util import get_remote_address
 
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
 
@app.post("/v1/agent/chat")
@limiter.limit("10/minute")
def chat_endpoint(request: Request, agent_request: AgentRequest):
    ...

4. Logging & Monitoring

import logging
from datetime import datetime
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
@app.middleware("http")
async def log_requests(request: Request, call_next):
    start_time = datetime.now()
    response = await call_next(request)
    duration = (datetime.now() - start_time).total_seconds()
 
    logger.info(f"{request.method} {request.url.path} - {response.status_code} - {duration:.3f}s")
    return response

Production Architecture

Deployment Options

Docker Compose Kubernetes AWS ECS Google Cloud Run

Docker Compose Example

version: '3.8'
services:
  agent-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Testing the API

Using curl

# Health check
curl http://localhost:8000/health
 
# Chat request
curl -X POST http://localhost:8000/v1/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What is 2+2?", "model": "gpt-4o-mini"}'

Using Python

import httpx
 
response = httpx.post(
    "http://localhost:8000/v1/agent/chat",
    json={"query": "What is the capital of France?"}
)
print(response.json())

Best Practices

Do:

Use health checks for container orchestration
Implement graceful shutdown
Cache frequent queries
Use connection pooling

⚠️

Don't:

Hardcode API keys
Skip error handling
Ignore rate limits from LLM providers
Deploy without logging

References & Further Reading

FastAPI Docs Docker Best Practices 12-Factor App

Next Steps

You've learned to deploy agents as APIs! Now head to the Capstone Project to build a complete production system combining everything from this course.

Run the Code

cd week4_production/serving_api
docker build -t llm-agent-api .
docker run -p 8000:8000 --env-file .env llm-agent-api

Then visit http://localhost:8000/docs to test your API.

Capstone Project 📥 Download Solutions