Building Scalable Microservices with FastAPI and Kubernetes

Learn how we built a fraud detection system handling 200k+ daily active users with 80 requests/second using FastAPI, Kubernetes, and GCP.

Introduction

Building systems that handle massive scale while maintaining low latency is a challenge every backend engineer faces. In this post, I'll share how we architected and deployed a fraud detection engine for the New York Department of Labour that processes 200,000+ daily active users with peak loads of 80 requests per second.

The Challenge

The primary requirements were:

Handle high concurrent traffic without degradation
Process fraud detection algorithms in real-time
Maintain low latency (< 200ms response time)
Scale horizontally based on load
Ensure high availability (99.9% uptime)

Tech Stack

We chose a modern, battle-tested stack:

FastAPI: Python's fastest ASGI framework for building APIs
Kubernetes: Container orchestration for scaling and self-healing
Google Cloud Platform: Managed infrastructure
BigQuery: Data warehouse for analytics
Firestore: Real-time database for hot data
Neo4J: Graph database for relationship detection
Terraform: Infrastructure as Code

Architecture Overview

Our microservices architecture consists of:

┌─────────────┐
│  Load       │
│  Balancer   │
└──────┬──────┘
       │
       ├──────┐
       │      │
   ┌───▼──┐ ┌─▼────┐
   │ API  │ │ API  │  FastAPI Services
   │ Pod  │ │ Pod  │  (Auto-scaled)
   └───┬──┘ └─┬────┘
       │      │
   ┌───▼──────▼───┐
   │  Service     │  Background Jobs
   │  Workers     │  (Fraud Detection)
   └───┬──────┬───┘
       │      │
   ┌───▼──┐ ┌─▼────┐
   │BigQ. │ │Neo4J │  Data Layer
   └──────┘ └──────┘

Why FastAPI?

FastAPI provides several advantages:

Performance: Built on Starlette and Pydantic, it's one of the fastest Python frameworks
Type Safety: Automatic validation using Python type hints
Auto-documentation: OpenAPI (Swagger) docs generated automatically
Async Support: Native async/await for concurrent operations

Here's a sample endpoint:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
 
app = FastAPI()
 
class FraudCheckRequest(BaseModel):
    user_id: str
    transaction_amount: float
    device_fingerprint: str
    ip_address: str
 
class FraudCheckResponse(BaseModel):
    is_fraudulent: bool
    confidence_score: float
    risk_factors: list[str]
 
@app.post("/api/v1/fraud-check", response_model=FraudCheckResponse)
async def check_fraud(request: FraudCheckRequest):
    # Run fraud detection algorithms
    result = await fraud_detection_service.analyze(request)
    
    if result.confidence_score > 0.95:
        # Trigger alert for high-confidence fraud
        await alert_service.notify(request.user_id)
    
    return FraudCheckResponse(
        is_fraudulent=result.is_fraud,
        confidence_score=result.confidence_score,
        risk_factors=result.factors
    )

Kubernetes Deployment Strategy

We use Horizontal Pod Autoscaling (HPA) to automatically scale based on CPU and memory:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Key Configurations

1. Resource Limits

Always set resource requests and limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

2. Readiness & Liveness Probes

Ensure healthy pods handle traffic:

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
 
readinessProbe:
  httpGet:
    path: /ready
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 5

3. Pod Disruption Budget

Maintain availability during updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: fraud-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: fraud-api

Performance Optimizations

1. Connection Pooling

Reuse database connections:

from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
 
engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,
    max_overflow=10,
    pool_pre_ping=True
)
 
async_session = sessionmaker(
    engine, class_=AsyncSession, expire_on_commit=False
)

2. Caching Strategy

Implement Redis caching for hot data:

from redis.asyncio import Redis
import json
 
redis = Redis(host='redis', port=6379, decode_responses=True)
 
async def get_user_risk_profile(user_id: str):
    # Try cache first
    cached = await redis.get(f"risk_profile:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Cache miss - query database
    profile = await db.fetch_risk_profile(user_id)
    
    # Cache for 5 minutes
    await redis.setex(
        f"risk_profile:{user_id}",
        300,
        json.dumps(profile)
    )
    
    return profile

3. Background Jobs

Offload heavy processing to workers:

from celery import Celery
 
celery_app = Celery('fraud_detection', broker='redis://redis:6379/0')
 
@celery_app.task
def analyze_transaction_patterns(user_id: str):
    # Heavy computation done asynchronously
    patterns = run_ml_model(user_id)
    store_patterns_in_bigquery(patterns)

Monitoring & Observability

We use a comprehensive monitoring stack:

Prometheus: Metrics collection
Grafana: Visualization dashboards
Cloud Logging: Centralized logs
Cloud Trace: Distributed tracing

Key metrics we track:

from prometheus_client import Counter, Histogram
 
request_count = Counter(
    'fraud_api_requests_total',
    'Total API requests',
    ['endpoint', 'status']
)
 
request_duration = Histogram(
    'fraud_api_request_duration_seconds',
    'Request duration in seconds',
    ['endpoint']
)
 
@app.middleware("http")
async def monitor_requests(request: Request, call_next):
    start_time = time.time()
    
    response = await call_next(request)
    
    duration = time.time() - start_time
    request_count.labels(
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    request_duration.labels(
        endpoint=request.url.path
    ).observe(duration)
    
    return response

Results

After 2 years in production:

99.95% uptime across all services
Average response time: 85ms (p50), 180ms (p99)
Peak load handled: 120 requests/second without degradation
Auto-scaling: Successfully scales from 3 to 18 pods during peak hours
Cost optimization: 40% reduction through efficient resource allocation

Lessons Learned

Start with monitoring: Implement observability from day one
Test autoscaling: Load test your HPA configurations before production
Connection management: Properly configure pool sizes for databases
Graceful shutdowns: Handle SIGTERM signals properly in Kubernetes
Circuit breakers: Implement fallback mechanisms for external dependencies

Conclusion

Building scalable microservices requires careful attention to architecture, infrastructure, and monitoring. FastAPI combined with Kubernetes provides a powerful foundation for building systems that can grow with your needs.

The key is starting with solid fundamentals: proper resource management, autoscaling configurations, and comprehensive monitoring. The rest is iterative improvement based on real-world usage patterns.

Resources

Have questions about scaling microservices? Feel free to reach out on LinkedIn or GitHub.