English
Week 4: Production
FastAPI & Docker Deployment

FastAPI & Docker Deployment

Overview

In this session, we take our agents from Jupyter notebooks to production-ready REST APIs using FastAPI and Docker.

Why FastAPI?

FeatureBenefit
Async nativeHandle many concurrent requests
Auto documentationSwagger UI out of the box
Type safetyPydantic validation
FastOne of the fastest Python frameworks

Project Structure

    • main.py
    • Dockerfile
    • requirements.txt
    • .env
  • Building the API

    Step 1: Define Data Models

    from pydantic import BaseModel
     
    class AgentRequest(BaseModel):
        query: str
        model: str = "gpt-4o-mini"
     
    class AgentResponse(BaseModel):
        response: str
        tool_calls: int = 0

    Step 2: Create the Agent Function

    from openai import OpenAI
     
    client = OpenAI()
     
    def run_simple_agent(query: str, model: str) -> str:
        """
        Simple agent that processes queries.
        In production, import your full ReActAgent here.
        """
        try:
            completion = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful API agent."},
                    {"role": "user", "content": query}
                ]
            )
            return completion.choices[0].message.content
        except Exception as e:
            raise HTTPException(status_code=500, detail=str(e))

    Step 3: Define Endpoints

    from fastapi import FastAPI, HTTPException
     
    app = FastAPI(
        title="LLM Agent API",
        description="Production-ready Agent API",
        version="1.0.0"
    )
     
    @app.get("/health")
    def health_check():
        """Health check endpoint for load balancers"""
        return {"status": "ok", "service": "llm-agent-api"}
     
    @app.post("/v1/agent/chat", response_model=AgentResponse)
    def chat_endpoint(request: AgentRequest):
        """Main chat endpoint"""
        answer = run_simple_agent(request.query, request.model)
        return AgentResponse(response=answer)

    Step 4: Run Locally

    # Install dependencies
    pip install fastapi uvicorn openai python-dotenv
     
    # Run the server
    uvicorn main:app --reload --port 8000

    Visit http://localhost:8000/docs for interactive API documentation.

    Dockerizing the API

    Dockerfile

    # Use official lightweight Python image
    FROM python:3.11-slim
     
    # Set working directory
    WORKDIR /app
     
    # Install dependencies
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
     
    # Copy application code
    COPY main.py .
    COPY .env .
     
    # Expose port
    EXPOSE 8000
     
    # Run the application
    CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

    requirements.txt

    fastapi>=0.104.0
    uvicorn>=0.24.0
    openai>=1.0.0
    python-dotenv>=1.0.0
    pydantic>=2.0.0

    Build and Run

    # Build the image
    docker build -t llm-agent-api .
     
    # Run the container
    docker run -p 8000:8000 --env-file .env llm-agent-api

    Production Considerations

    1. Environment Variables

    Never commit API keys. Use environment variables:

    import os
    from dotenv import load_dotenv
     
    load_dotenv()
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    2. Error Handling

    from fastapi import HTTPException
    from tenacity import retry, stop_after_attempt, wait_exponential
     
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    def call_llm_with_retry(messages):
        try:
            return client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages
            )
        except Exception as e:
            raise HTTPException(status_code=503, detail="LLM service unavailable")

    3. Rate Limiting

    from slowapi import Limiter
    from slowapi.util import get_remote_address
     
    limiter = Limiter(key_func=get_remote_address)
    app.state.limiter = limiter
     
    @app.post("/v1/agent/chat")
    @limiter.limit("10/minute")
    def chat_endpoint(request: Request, agent_request: AgentRequest):
        ...

    4. Logging & Monitoring

    import logging
    from datetime import datetime
     
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
     
    @app.middleware("http")
    async def log_requests(request: Request, call_next):
        start_time = datetime.now()
        response = await call_next(request)
        duration = (datetime.now() - start_time).total_seconds()
     
        logger.info(f"{request.method} {request.url.path} - {response.status_code} - {duration:.3f}s")
        return response

    Production Architecture

    Deployment Options

    Docker Compose Example

    version: '3.8'
    services:
      agent-api:
        build: .
        ports:
          - "8000:8000"
        environment:
          - OPENAI_API_KEY=${OPENAI_API_KEY}
        restart: unless-stopped
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
          interval: 30s
          timeout: 10s
          retries: 3

    Testing the API

    Using curl

    # Health check
    curl http://localhost:8000/health
     
    # Chat request
    curl -X POST http://localhost:8000/v1/agent/chat \
      -H "Content-Type: application/json" \
      -d '{"query": "What is 2+2?", "model": "gpt-4o-mini"}'

    Using Python

    import httpx
     
    response = httpx.post(
        "http://localhost:8000/v1/agent/chat",
        json={"query": "What is the capital of France?"}
    )
    print(response.json())

    Best Practices

    Do:

    • Use health checks for container orchestration
    • Implement graceful shutdown
    • Cache frequent queries
    • Use connection pooling
    ⚠️

    Don't:

    • Hardcode API keys
    • Skip error handling
    • Ignore rate limits from LLM providers
    • Deploy without logging

    References & Further Reading

    Next Steps

    You've learned to deploy agents as APIs! Now head to the Capstone Project to build a complete production system combining everything from this course.

    Run the Code

    cd week4_production/serving_api
    docker build -t llm-agent-api .
    docker run -p 8000:8000 --env-file .env llm-agent-api

    Then visit http://localhost:8000/docs to test your API.