MLOps: Taking Models from Notebook to Production
The engineering practices, tooling, and workflows that bridge the gap between data science experiments and reliable production ML systems.
87% of ML models never reach production. Not because the models are bad, but because the engineering around them is missing. MLOps is the discipline that bridges this gap: version control for data and models, automated training pipelines, deployment strategies, monitoring for drift, and feedback loops for continuous improvement. Think of it as DevOps for machine learning.
The MLOps Maturity Model
- Level 0 — Manual: Data scientists train models in notebooks, hand off pickle files to engineers, manual deployment. No versioning, no reproducibility.
- Level 1 — Pipeline Automation: Automated training pipelines (Airflow, Kubeflow). Models are versioned and tracked (MLflow, Weights & Biases). Deployment is scripted but manual.
- Level 2 — CI/CD for ML: Code changes trigger automated retraining, evaluation, and deployment. Model performance is monitored. Data and feature pipelines are version-controlled.
- Level 3 — Continuous Training: Production data automatically triggers retraining when drift is detected. A/B testing compares model versions. Feedback loops improve data quality continuously.
Experiment Tracking and Reproducibility
Every experiment should be reproducible: same data, same code, same hyperparameters → same model. We use MLflow for experiment tracking, logging every parameter, metric, artifact, and the Git commit hash of the training code. When a model performs well in production, we need to know exactly how it was trained.
import mlflow
mlflow.set_experiment("fraud-detection-v2")
with mlflow.start_run():
# Log parameters
mlflow.log_params({
"model_type": "xgboost",
"n_estimators": 500,
"max_depth": 8,
"learning_rate": 0.05,
"data_version": "v2.3.1",
})
model = train_model(X_train, y_train)
# Log metrics
metrics = evaluate_model(model, X_test, y_test)
mlflow.log_metrics({
"accuracy": metrics.accuracy,
"precision": metrics.precision,
"recall": metrics.recall,
"f1": metrics.f1,
"auc_roc": metrics.auc_roc,
})
# Log model artifact
mlflow.xgboost.log_model(model, "model", registered_model_name="fraud-detector")Model Monitoring and Drift Detection
Models degrade over time as the real world changes. Customer behavior shifts, new products are introduced, market conditions evolve. Without monitoring, a model can silently become inaccurate. We monitor three types of drift: data drift (input distribution changes), concept drift (the relationship between inputs and outputs changes), and prediction drift (model outputs shift).
Don't just monitor model accuracy — by the time accuracy drops, users have already been affected. Monitor input data distributions (statistical tests like KS, PSI) as a leading indicator. Data drift almost always precedes performance drift.
Feature Stores: The Missing Infrastructure
A feature store is a centralized repository for ML features that serves both training (batch) and inference (real-time). Without a feature store, you end up with training-serving skew: features computed differently in the training pipeline and the inference service, causing subtle accuracy drops that are nearly impossible to debug.
“The difference between a demo ML model and a production ML system is not the model architecture — it's the engineering. Experiment tracking, reproducible pipelines, monitoring, and feature stores are the infrastructure that makes ML reliable.”
— David Kim, Vaarak AI/ML
David Kim
Embedded Systems Lead