Building Scalable Microservices with FastAPI and Kubernetes
Learn how we built a fraud detection system handling 200k+ daily active users with 80 requests/second using FastAPI, Kubernetes, and GCP.
Introduction
Building systems that handle massive scale while maintaining low latency is a challenge every backend engineer faces. In this post, I'll share how we architected and deployed a fraud detection engine for the New York Department of Labour that processes 200,000+ daily active users with peak loads of 80 requests per second.
The Challenge
The primary requirements were:
- Handle high concurrent traffic without degradation
- Process fraud detection algorithms in real-time
- Maintain low latency (< 200ms response time)
- Scale horizontally based on load
- Ensure high availability (99.9% uptime)
Tech Stack
We chose a modern, battle-tested stack:
- FastAPI: Python's fastest ASGI framework for building APIs
- Kubernetes: Container orchestration for scaling and self-healing
- Google Cloud Platform: Managed infrastructure
- BigQuery: Data warehouse for analytics
- Firestore: Real-time database for hot data
- Neo4J: Graph database for relationship detection
- Terraform: Infrastructure as Code
Architecture Overview
Our microservices architecture consists of:
┌─────────────┐
│ Load │
│ Balancer │
└──────┬──────┘
│
├──────┐
│ │
┌───▼──┐ ┌─▼────┐
│ API │ │ API │ FastAPI Services
│ Pod │ │ Pod │ (Auto-scaled)
└───┬──┘ └─┬────┘
│ │
┌───▼──────▼───┐
│ Service │ Background Jobs
│ Workers │ (Fraud Detection)
└───┬──────┬───┘
│ │
┌───▼──┐ ┌─▼────┐
│BigQ. │ │Neo4J │ Data Layer
└──────┘ └──────┘
Why FastAPI?
FastAPI provides several advantages:
- Performance: Built on Starlette and Pydantic, it's one of the fastest Python frameworks
- Type Safety: Automatic validation using Python type hints
- Auto-documentation: OpenAPI (Swagger) docs generated automatically
- Async Support: Native async/await for concurrent operations
Here's a sample endpoint:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
app = FastAPI()
class FraudCheckRequest(BaseModel):
user_id: str
transaction_amount: float
device_fingerprint: str
ip_address: str
class FraudCheckResponse(BaseModel):
is_fraudulent: bool
confidence_score: float
risk_factors: list[str]
@app.post("/api/v1/fraud-check", response_model=FraudCheckResponse)
async def check_fraud(request: FraudCheckRequest):
# Run fraud detection algorithms
result = await fraud_detection_service.analyze(request)
if result.confidence_score > 0.95:
# Trigger alert for high-confidence fraud
await alert_service.notify(request.user_id)
return FraudCheckResponse(
is_fraudulent=result.is_fraud,
confidence_score=result.confidence_score,
risk_factors=result.factors
)Kubernetes Deployment Strategy
We use Horizontal Pod Autoscaling (HPA) to automatically scale based on CPU and memory:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80Key Configurations
1. Resource Limits
Always set resource requests and limits:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"2. Readiness & Liveness Probes
Ensure healthy pods handle traffic:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 53. Pod Disruption Budget
Maintain availability during updates:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: fraud-api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: fraud-apiPerformance Optimizations
1. Connection Pooling
Reuse database connections:
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
engine = create_async_engine(
DATABASE_URL,
pool_size=20,
max_overflow=10,
pool_pre_ping=True
)
async_session = sessionmaker(
engine, class_=AsyncSession, expire_on_commit=False
)2. Caching Strategy
Implement Redis caching for hot data:
from redis.asyncio import Redis
import json
redis = Redis(host='redis', port=6379, decode_responses=True)
async def get_user_risk_profile(user_id: str):
# Try cache first
cached = await redis.get(f"risk_profile:{user_id}")
if cached:
return json.loads(cached)
# Cache miss - query database
profile = await db.fetch_risk_profile(user_id)
# Cache for 5 minutes
await redis.setex(
f"risk_profile:{user_id}",
300,
json.dumps(profile)
)
return profile3. Background Jobs
Offload heavy processing to workers:
from celery import Celery
celery_app = Celery('fraud_detection', broker='redis://redis:6379/0')
@celery_app.task
def analyze_transaction_patterns(user_id: str):
# Heavy computation done asynchronously
patterns = run_ml_model(user_id)
store_patterns_in_bigquery(patterns)Monitoring & Observability
We use a comprehensive monitoring stack:
- Prometheus: Metrics collection
- Grafana: Visualization dashboards
- Cloud Logging: Centralized logs
- Cloud Trace: Distributed tracing
Key metrics we track:
from prometheus_client import Counter, Histogram
request_count = Counter(
'fraud_api_requests_total',
'Total API requests',
['endpoint', 'status']
)
request_duration = Histogram(
'fraud_api_request_duration_seconds',
'Request duration in seconds',
['endpoint']
)
@app.middleware("http")
async def monitor_requests(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
request_count.labels(
endpoint=request.url.path,
status=response.status_code
).inc()
request_duration.labels(
endpoint=request.url.path
).observe(duration)
return responseResults
After 2 years in production:
- 99.95% uptime across all services
- Average response time: 85ms (p50), 180ms (p99)
- Peak load handled: 120 requests/second without degradation
- Auto-scaling: Successfully scales from 3 to 18 pods during peak hours
- Cost optimization: 40% reduction through efficient resource allocation
Lessons Learned
- Start with monitoring: Implement observability from day one
- Test autoscaling: Load test your HPA configurations before production
- Connection management: Properly configure pool sizes for databases
- Graceful shutdowns: Handle SIGTERM signals properly in Kubernetes
- Circuit breakers: Implement fallback mechanisms for external dependencies
Conclusion
Building scalable microservices requires careful attention to architecture, infrastructure, and monitoring. FastAPI combined with Kubernetes provides a powerful foundation for building systems that can grow with your needs.
The key is starting with solid fundamentals: proper resource management, autoscaling configurations, and comprehensive monitoring. The rest is iterative improvement based on real-world usage patterns.
Resources
Have questions about scaling microservices? Feel free to reach out on LinkedIn or GitHub.