Staff ML Ops Engineer - Recommendations
SKILLS
FULL DESCRIPTION
Staff ML Ops Engineer - Recommendations
[Employer hidden — view at passion-project.co.uk] is seeking a Staff ML Ops Engineer to own the operational lifecycle of ML systems for recommendations. This is a remote position (Americas) with a focus on deployment pipelines, evaluation frameworks, data preprocessing, and production monitoring.
About the role
[Employer hidden] is the commerce platform that powers millions of merchants worldwide. Behind the product experience are ML systems that drive recommendations, search, and personalization at massive scale.
We build and maintain the operational backbone behind these systems: deployment pipelines, evaluation frameworks, data preprocessing, and the monitoring that keeps models fresh and reliable in production. Our models serve hundreds of millions of buyers, and the pipelines we build directly impact how quickly and safely we can improve merchant outcomes.
The Role
You will own the operational lifecycle of our ML systems: deployment pipelines, evaluation frameworks, data pipelines, and the monitoring and reliability layer that keeps everything running in production. You'll ensure models go from training to production safely, that we can evaluate changes rigorously, and that the data feeding our models is fresh and correct.
This role is the connective tissue between research and production. You'll build the systems that let engineers ship model improvements with confidence and speed, while maintaining the reliability standards required to serve hundreds of millions of buyers - including during peak events like Black Friday/Cyber Monday.
This role carries real technical authority. You'll set the standards for how models get deployed and evaluated, mentor engineers on operational best practices, and drive alignment on reliability and pipeline strategy across the team. You'll influence technical direction beyond your immediate team and raise the engineering bar through hiring and technical reviews.
What You'll Do
- Deployment & Rollout
- Own the model deployment pipeline end to end: export, validation, canary rollout, rollback, and A/B integration
- Build and maintain CI/CD for ML: automated testing, model validation gates, and progressive delivery
- Ensure safe, repeatable deployments with clear rollback paths and minimal manual intervention
- Evaluation & Experimentation
- Build automated offline evaluation pipelines against production baselines
- Extend our experimentation framework so ML Engineers can launch and evaluate model changes with minimal friction
- Maintain evaluation datasets and ensure data freshness and correctness
- Integrate offline metrics with online A/B testing to close the feedback loop
- Data Pipelines
- Own data preprocessing for training: interaction histories, feature stores, and embedding tables
- Manage workflow orchestration (Airflow or equivalent) for scheduled retraining and data refresh. You trigger and monitor training runs; the underlying GPU compute layer is owned by the infrastructure side of the team.
- Ensure data quality, lineage tracking, and pipeline idempotency
- Own data correctness and freshness; partner with infrastructure engineers on data loading throughput and efficiency
- Monitoring & Reliability
- Build monitoring and alerting across training jobs, serving endpoints, and data pipelines
- Define and maintain SLOs for model freshness, serving latency, and training throughput
- Participate in incident response and drive post-mortems for ML system failures
- Identify and eliminate toil through automation
- Technical Leadership
- Drive cross-team technical strategy for ML operations - identify systemic reliability risks and pipeline bottlenecks before they become incidents
- Mentor and up-level engineers on the team through pairing, design reviews, and setting operational standards
- Contribute to hiring: screen candidates, conduct technical interviews, and calibrate the engineering bar
- Write technical proposals and RFCs that shape operational direction across the organization
What We're Looking For
Required
- 7+ years in software engineering, with 5+ years focused on MLOps, data engineering, or production ML systems
- Strong experience with ML deployment pipelines: model export, validation, canary releases, and rollback strategies
- Experience with workflow orchestration for ML (Airflow, Dagster, Prefect, or similar)
- Solid Python fundamentals; comfortable working with PyTorch model artifacts and training configurations
- Production monitoring experience: you've built or operated alerting, dashboards, and SLO frameworks for ML systems
- Experience with data pipelines at scale: batch processing, feature engineering, and data quality validation
- Working proficiency with Kubernetes: able to debug pod failures, understand resource scheduling, and navigate GPU workloads
- Demonstrated technical leadership: you've driven operational strategy, written technical proposals, and influenced engineering direction beyond your immediate team
- Track record of mentoring engineers and raising the reliability bar on a team
Preferred
- Experience with large-scale data warehouses (BigQuery or equivalent) for offline evaluation and metrics
- Hands-on with experiment tracking and A/B testing frameworks
- Experience operating recommendation or retrieval systems at scale
- Familiarity with model compression workflows in production (quantization, pruning, distillation)
- Experience with cloud-native ML orchestration (SkyPilot, Ray, or similar)
About [Employer hidden]
Opportunity is not evenly distributed. [Employer hidden] puts independence within reach for anyone with a dream to start a business. We propel entrepreneurs and enterprises to scale the heights of their potential. Since 2006, we’ve grown to over 8,300 employees and generated over $1 trillion in sales for millions of merchants in 175 countries.
This is life-defining work that directly impacts people’s lives as much as it transforms your own. This is putting the power of the few in the hands of the many, is a future with more voices rather than fewer, and is creating more choices instead of an elite option.
About you
Moving at our pace brings a lot of change, complexity, and ambiguity—and a little bit of chaos. Shopifolk thrive on that and are comfortable being uncomfortable. That means [Employer hidden] is not the right place for everyone.
Before you apply, consider if you can:
- Care deeply about what you do and about making commerce better for everyone
- Excel by seeking professional and personal hypergrowth
- Keep up with an unrelenting pace (the week, not the quarter)
- Be resilient and resourceful in face of ambiguity and thrive on (rather than endure) change
- Bring critical thought and opinion
- Put AI agents and tools to work on the tasks they're built for, and focus on the work only humans can do
- Embrace differences and disagreement to get shit done and move forward
- Work digital-first for your daily work
We may use AI-enabled tools to screen, select, and assess applications. All AI outputs are reviewed and validated by our recruitment team.