Home/Case Studies/How We Reduced ML Inference Latency by 73% for a FinTech Fraud Detection System

AI & Machine Learning

How We Reduced ML Inference Latency by 73% for a FinTech Fraud Detection System

From 450ms p99 to 120ms: A Model Serving Optimisation Story

Machine Learning

FinTech

Model Serving

Triton Inference Server

ONNX

p99 latency reduced from 450ms to 120ms (73% improvement)

p50 latency reduced from 180ms to 38ms

Infrastructure cost reduced by 38% (Inferentia + ONNX efficiency)

Zero SLA breaches in the 6 months following deployment

Checkout abandonment rate at payment step decreased by 1.2%

The Challenge

A US-based Series B FinTech was running a gradient boosting fraud detection model on SageMaker real-time endpoints. The model processed every card transaction — but at 450ms p99 latency, they were blocking payments by 300–400ms longer than competitors, resulting in measurable checkout abandonment and, more critically, occasional SLA breaches with banking partners who required sub-200ms responses. The team had already tried vertical scaling (larger instance types) with diminishing returns. They needed a fundamentally different approach to model serving.

Our Solution

We conducted a model-serving audit and identified four compounding inefficiencies: 1. **Model format**: The XGBoost model was loaded via the default SageMaker SKLearn container — not optimised for inference. 2. **Batching**: Every prediction was a single-record request with no micro-batching, causing per-request overhead to dominate. 3. **Feature computation**: Three features were being computed on every inference call despite being static for a given merchant. 4. **Instance type**: CPU-based c5.2xlarge instances for a workload that parallelises well. Our solution: - Converted the XGBoost model to ONNX format and served it via NVIDIA Triton Inference Server with dynamic micro-batching (max batch size 16, max latency 5ms) - Pre-computed and cached static merchant features in Redis with a 1-hour TTL - Migrated serving to inf1.xlarge (AWS Inferentia) instances — 40% cheaper than c5.2xlarge and 3× faster for ONNX inference - Deployed Triton on EKS with Karpenter autoscaling, enabling 0→1000 RPS scaling in under 90 seconds

Results & Impact

p99 latency reduced from 450ms to 120ms (73% improvement)

p50 latency reduced from 180ms to 38ms

Infrastructure cost reduced by 38% (Inferentia + ONNX efficiency)

Zero SLA breaches in the 6 months following deployment

Checkout abandonment rate at payment step decreased by 1.2%

Client

Confidential FinTech (Series B, US)

Frequently Asked Questions

Why ONNX instead of keeping XGBoost native?+

ONNX Runtime applies operator-level optimisations during inference (e.g. fusing consecutive operations, SIMD vectorisation) that the native XGBoost predictor does not. For tabular models, we consistently see 2–4× speedup from ONNX conversion.

What is dynamic micro-batching in Triton?+

Instead of processing each request individually, Triton waits up to a configured window (5ms in this case) to accumulate requests, then processes them as a batch. Batching amortises the fixed overhead of a forward pass, dramatically improving throughput and reducing per-request latency under load.

All Case Studies

Want similar results for your business?

Let's discuss your project. Free consultation, no obligation.

Start a Conversation