Machine Learning Infrastructure Engineers

🔒 Confidential Employer

Posted 3 May 2026

LOCATION

Remote

TYPE

Full-time

LEVEL

Mid-Senior level

SKILLS

Kubernetes GPU infrastructure Python Go Terraform Distributed systems MLOps Model serving

FULL DESCRIPTION

[Employer hidden — sign up to reveal] is hiring a Machine Learning Infrastructure Engineer (Remote - Americas) in the Engineering & Data division. This role involves building and operating the end-to-end ML platform. Apply now.

About the role

Machine Learning Infrastructure Engineers build and operate the end-to-end platform that powers AI—from data ingestion and training to large-scale, low-latency inference. They design high-performance, GPU-accelerated systems on Kubernetes, craft self-serve developer experiences, and ship the paved roads that let ML teams move fast, safely, and at global scale. Some companies separate ML Infra, ML Platform and ML Ops- at [Employer hidden — sign up to reveal]- we call this ML Infrastructure. We have an agile workforce who can flex their experience and solve problems across these three domains.

Responsibilities

Build and operate ML control planes, APIs, CLIs, SDKs, and self-serve golden paths
Design and optimize multi-tenant GPU Kubernetes clusters, including autoscaling, scheduling, packing, and utilization
Own model lifecycle: training orchestration/experiments, registries/versioning, CI/CD, canary/blue-green, and safe rollback
Build real-time serving stacks (KServe/Seldon/TensorFlow Serving) and end-to-end pipelines for batch and streaming
Design feature platforms and engineer storage/data movement for datasets, features, and artifacts tuned for cost/performance
Implement observability and SLOs across pipelines, training, and inference; automate remediation and capacity planning
Partner with ML, data, and product teams to unblock delivery and accelerate idea-to-impact

Qualifications

Proven platform/infrastructure engineering experience with a track record of shipping production systems and code
Deep Kubernetes/containerization expertise for ML workloads (operators, Helm, service mesh/gRPC) and multi-tenant clusters
Hands-on experience running GPU infrastructure at scale (NVIDIA ecosystem; scheduling/packing/optimization)
Strong distributed systems and API/service design fundamentals; experience with high-scale inference
Proficiency with infrastructure-as-code and automation (Terraform, Helm, GitOps) on major clouds (GCP/AWS/Azure)
Observability expertise (Prometheus/Grafana) and SLO-driven operations for ML systems
Proficient in Python/Go/Java; experience building developer tooling and self-service platforms

Nice to Haves

Model serving and lifecycle tooling: KServe/Seldon/TensorFlow Serving, Kubeflow, MLflow/W&B, model registries, DVC
Feature store experience (Feast/Tecton) with online/offline parity and SLAs
Data infrastructure familiarity (Kafka, Spark/Flink) and stateful stores (Redis/MySQL); CI/CD for online/batch inference
Model performance optimization (batching, caching, quantization, distillation) and hardware-aware tuning
Experience with experimentation/A/B testing platforms and online evaluation frameworks

At [Employer hidden — sign up to reveal], we pride ourselves on moving quickly—not just in shipping, but in our hiring process as well. If you're ready to apply, please be prepared to interview with us within the week. Our goal is to complete the entire interview loop within 30 days. You will be expected to complete a live pair programming session, come prepared with your own IDE.

This role may require on-call work

About [Employer hidden — sign up to reveal]

Opportunity is not evenly distributed. [Employer hidden — sign up to reveal] puts independence within reach for anyone with a dream to start a business. We propel entrepreneurs and enterprises to scale the heights of their potential. Since 2006, we’ve grown to over 8,300 employees and generated over $1 trillion in sales for millions of merchants in 175 countries.

This is life-defining work that directly impacts people’s lives as much as it transforms your own. This is putting the power of the few in the hands of the many, is a future with more voices rather than fewer, and is creating more choices instead of an elite option.

About you

Moving at our pace brings a lot of change, complexity, and ambiguity—and a little bit of chaos. Shopifolk thrive on that and are comfortable being uncomfortable. That means [Employer hidden — sign up to reveal] is not the right place for everyone.

Before you apply, consider if you can:

Care deeply about what you do and about making commerce better for everyone
Excel by seeking professional and personal hypergrowth
Keep up with an unrelenting pace (the week, not the quarter)
Be resilient and resourceful in face of ambiguity and thrive on (rather than endure) change
Bring critical thought and opinion
Put AI agents and tools to work on the tasks they're built for, and focus on the work only humans can do
Embrace differences and disagreement to get shit done and move forward
Work digital-first for your daily work

We may use AI-enabled tools to screen, select, and assess applications. All AI outputs are reviewed and validated by our recruitment team.

[Employer hidden — sign up to reveal]https://www.[Employer hidden].com

We hire people, not resumes. If you think you’re right for the role, apply now.

Apply Now