How We Reduced ML Inference Latency by 73% for a FinTech Fraud Detection System
From 450ms p99 to 120ms: A Model Serving Optimisation Story
The Challenge
A US-based Series B FinTech was running a gradient boosting fraud detection model on SageMaker real-time endpoints. The model processed every card transaction — but at 450ms p99 latency, they were blocking payments by 300–400ms longer than competitors, resulting in measurable checkout abandonment and, more critically, occasional SLA breaches with banking partners who required sub-200ms responses. The team had already tried vertical scaling (larger instance types) with diminishing returns. They needed a fundamentally different approach to model serving.
Our Solution
We conducted a model-serving audit and identified four compounding inefficiencies: 1. **Model format**: The XGBoost model was loaded via the default SageMaker SKLearn container — not optimised for inference. 2. **Batching**: Every prediction was a single-record request with no micro-batching, causing per-request overhead to dominate. 3. **Feature computation**: Three features were being computed on every inference call despite being static for a given merchant. 4. **Instance type**: CPU-based c5.2xlarge instances for a workload that parallelises well. Our solution: - Converted the XGBoost model to ONNX format and served it via NVIDIA Triton Inference Server with dynamic micro-batching (max batch size 16, max latency 5ms) - Pre-computed and cached static merchant features in Redis with a 1-hour TTL - Migrated serving to inf1.xlarge (AWS Inferentia) instances — 40% cheaper than c5.2xlarge and 3× faster for ONNX inference - Deployed Triton on EKS with Karpenter autoscaling, enabling 0→1000 RPS scaling in under 90 seconds
Results & Impact
Client
Confidential FinTech (Series B, US)
Frequently Asked Questions
Why ONNX instead of keeping XGBoost native?+
ONNX Runtime applies operator-level optimisations during inference (e.g. fusing consecutive operations, SIMD vectorisation) that the native XGBoost predictor does not. For tabular models, we consistently see 2–4× speedup from ONNX conversion.
What is dynamic micro-batching in Triton?+
Instead of processing each request individually, Triton waits up to a configured window (5ms in this case) to accumulate requests, then processes them as a batch. Batching amortises the fixed overhead of a forward pass, dramatically improving throughput and reducing per-request latency under load.
Want similar results for your business?
Let's discuss your project. Free consultation, no obligation.
Start a Conversation