FastAPI & Docker Deployment
Overview
In this session, we take our agents from Jupyter notebooks to production-ready REST APIs using FastAPI and Docker.
Why FastAPI?
| Feature | Benefit |
|---|---|
| Async native | Handle many concurrent requests |
| Auto documentation | Swagger UI out of the box |
| Type safety | Pydantic validation |
| Fast | One of the fastest Python frameworks |
Project Structure
- main.py
- Dockerfile
- requirements.txt
- .env
Building the API
Step 1: Define Data Models
from pydantic import BaseModel
class AgentRequest(BaseModel):
query: str
model: str = "gpt-4o-mini"
class AgentResponse(BaseModel):
response: str
tool_calls: int = 0Step 2: Create the Agent Function
from openai import OpenAI
client = OpenAI()
def run_simple_agent(query: str, model: str) -> str:
"""
Simple agent that processes queries.
In production, import your full ReActAgent here.
"""
try:
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful API agent."},
{"role": "user", "content": query}
]
)
return completion.choices[0].message.content
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))Step 3: Define Endpoints
from fastapi import FastAPI, HTTPException
app = FastAPI(
title="LLM Agent API",
description="Production-ready Agent API",
version="1.0.0"
)
@app.get("/health")
def health_check():
"""Health check endpoint for load balancers"""
return {"status": "ok", "service": "llm-agent-api"}
@app.post("/v1/agent/chat", response_model=AgentResponse)
def chat_endpoint(request: AgentRequest):
"""Main chat endpoint"""
answer = run_simple_agent(request.query, request.model)
return AgentResponse(response=answer)Step 4: Run Locally
# Install dependencies
pip install fastapi uvicorn openai python-dotenv
# Run the server
uvicorn main:app --reload --port 8000Visit http://localhost:8000/docs for interactive API documentation.
Dockerizing the API
Dockerfile
# Use official lightweight Python image
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY main.py .
COPY .env .
# Expose port
EXPOSE 8000
# Run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]requirements.txt
fastapi>=0.104.0
uvicorn>=0.24.0
openai>=1.0.0
python-dotenv>=1.0.0
pydantic>=2.0.0Build and Run
# Build the image
docker build -t llm-agent-api .
# Run the container
docker run -p 8000:8000 --env-file .env llm-agent-apiProduction Considerations
1. Environment Variables
Never commit API keys. Use environment variables:
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))2. Error Handling
from fastapi import HTTPException
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def call_llm_with_retry(messages):
try:
return client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
except Exception as e:
raise HTTPException(status_code=503, detail="LLM service unavailable")3. Rate Limiting
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
@app.post("/v1/agent/chat")
@limiter.limit("10/minute")
def chat_endpoint(request: Request, agent_request: AgentRequest):
...4. Logging & Monitoring
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.middleware("http")
async def log_requests(request: Request, call_next):
start_time = datetime.now()
response = await call_next(request)
duration = (datetime.now() - start_time).total_seconds()
logger.info(f"{request.method} {request.url.path} - {response.status_code} - {duration:.3f}s")
return responseProduction Architecture
Deployment Options
Docker Compose Example
version: '3.8'
services:
agent-api:
build: .
ports:
- "8000:8000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3Testing the API
Using curl
# Health check
curl http://localhost:8000/health
# Chat request
curl -X POST http://localhost:8000/v1/agent/chat \
-H "Content-Type: application/json" \
-d '{"query": "What is 2+2?", "model": "gpt-4o-mini"}'Using Python
import httpx
response = httpx.post(
"http://localhost:8000/v1/agent/chat",
json={"query": "What is the capital of France?"}
)
print(response.json())Best Practices
Do:
- Use health checks for container orchestration
- Implement graceful shutdown
- Cache frequent queries
- Use connection pooling
⚠️
Don't:
- Hardcode API keys
- Skip error handling
- Ignore rate limits from LLM providers
- Deploy without logging
References & Further Reading
Next Steps
You've learned to deploy agents as APIs! Now head to the Capstone Project to build a complete production system combining everything from this course.
Run the Code
cd week4_production/serving_api
docker build -t llm-agent-api .
docker run -p 8000:8000 --env-file .env llm-agent-apiThen visit http://localhost:8000/docs to test your API.