Job Title:

Lead Dev

Ops Engineer, Foundry RnD

Overview:

Our Purpose

Mastercard powers economies and empowers people across 200+ countries and territories. Together with our customers, we build a sustainable, inclusive economy by enabling secure, simple, smart, and accessible digital payments. Our technology, innovation, partnerships, and networks deliver products and services that help people, businesses, and governments reach their full potential.

Lead Dev

Ops Engineer

We are hiring a Lead Dev

Ops/Platform Engineer to help build the next-generation AI/ML infrastructure used by Mastercard Foundry. This role sits at the intersection of Dev

Ops, Platform Engineering, and MLOps — supporting R&D teams who are still shaping their product direction.

You will own the core platform components that power AI experimentation and it’s delivery to end users: Azure, AKS, GPU compute, Databricks, Terraform/Terragrunt, Git

Hub Actions, and Git

Ops. You will help build and scale AI/ML infrastructure to support our innovation efforts, with a focus on automation, observability, and developer experience. The ideal candidate is hands-on, curious, motivated, and comfortable working in fast-moving R&D environments.

This is not an AI modelling role — it is a deeply technical platform role focused on enabling rapid, safe, reproducible ML development.

What You'll Do

Drive Platform Infrastructure: Own Dev

Ops and infrastructure for MLOps and agentic AI systems, establishing reusable patterns for CI/CD, scalable inference, orchestration, observability, and cost control. Design secure, scalable, repeatable systems using Infrastructure as Code (IaC) to support R&D workloads.

Build secure CI/CD & automation systems: Enable secure tool access, workload isolation, and infrastructure for LLM-backed APIs and MCP servers, while partnering with security and compliance on access control, infrastructure governance and auditability.

Ensure Reliability & Observability: Implement monitoring, logging, and alerting. Tune observability for ML-specific workloads to ensure performance, reliability, and operational insight.

Provide Technical Leadership: Offer hands-on leadership across Dev

Ops and platform initiatives. Review code, enforce best practices, improve tooling, and promote clean, well-tested infrastructure.

Cross-Functional Collaboration: Partner with ML, software, and platform engineers to design deployment strategies, scope work, manage agile deliverables, and meet milestones.

What You'll Bring

Extensive Dev

Ops Experience: 8–12+ years in Dev

Ops, SRE, or platform engineering, including senior/lead roles. Experience designing end-to-end infrastructure systems, solving scale/performance challenges, and operating platforms in production.

Cloud & Infrastructure Expertise: Strong skills in cloud platforms (AWS, Azure, or GCP) and AI/ML components such as Databricks, Azure ML, and MLflow. Deep experience with Infrastructure as Code using Terraform and orchestration tools like Terragrunt.

Container & Orchestration Mastery: Expertise in Kubernetes and Docker, including how they optimise ML development workflows. Experience with container security, networking, and cluster management at scale.

AI/ML Platform Knowledge: Understanding of ML workflow requirements—model registries, feature stores, AI agents, Retrieval-Augmented Generation (RAG) techniques, and frameworks like Lang

Chain/Llama

Index.

Leadership & Mentorship: Ability to translate ambiguous goals into clear plans, guide engineers, and lead technical execution.

Problem-Solving Mindset: Approach issues systematically, using analysis and data to select scalable, maintainable solutions.

Required Skills

Education & Background: Bachelor's degree in Computer Science, Engineering, or related field. 8–12+ years of proven experience architecting and operating production-grade infrastructure, especially those supporting AI/ML workloads.

Infrastructure as Code: Expert in Terraform and IaC orchestration tools like Terragrunt. Strong experience with configuration management and Git

Ops practices.

Programming & Scripting: Advanced Bash and Python skills and strong software engineering fundamentals (version control, CI, code reviews). Familiarity with Go or other systems programming languages is a plus.

CI/CD & Automation: Hands-on experience with Jenkins, Git

Hub Actions, Git

Lab CI, or similar tools. Strong understanding of pipeline design, artifact management, and deployment strategies.

Monitoring & Observability: Experience with monitoring stacks such as Prometheus, Grafana, Splunk, and ELK. Skilled in building dashboards, alerts, and tuning observability for ML-specific use cases.

Cloud Infrastructure: Experience deploying systems on AWS/Azure/GCP. Familiar with cloud-native services, serverless computing, and managed Kubernetes offerings (EKS, AKS, GKE). Comfortable with Linux internals and shell scripting.

Security & Networking: Knowledge of security best practices for MLOps, including data privacy, compliance, access controls, and encryption. Understanding of modern networking protocols (mTLS) and secure service communication.

Collaboration & Agile Delivery: Strong communication skills and experience working with cross-functional teams. Ability to document designs clearly and deliver iteratively using agile practices.

Preferred Skills

Databricks Experience: Hands-on experience with Databricks, including workspace administration, cluster management, Unity Catalog, Delta Lake, and Lakehouse architectures. Familiarity with Databricks workflows, jobs orchestration, and MLflow integration.

Advanced Cloud & ML Platform Expertise: Experience with Azure ML, Sage

Maker, or similar ML platforms. Familiarity with model serving, feature stores, and ML pipeline orchestration.

Experience with Open

Shift or other on-prem containerisation offerings to support significant GPU workloads.

ML Frameworks Familiarity: Knowledge of ML frameworks like Tensor

Flow, Py

Torch, or Scikit-learn to better support ML engineering teams.

Enterprise Security: Experience working in complex enterprise environments with strict security and compliance requirements. Strong networking fundamentals, including configuring and maintaining secure mTLS-based communication between services.

Dev

Ops & Platform Innovation: Experience implementing self-service platform automation, developer portals, or internal developer platforms (IDPs).

Continuous Learning: Motivation to explore emerging technologies, especially in AI, generative AI, and cloud-native infrastructure. Certifications, personal projects, or open-source contributions are a plus.

To find US Salary Ranges, visit People Place. Under the Compensation tab, select "Salary Structures." Within the text of "Salary Structures," click on the link "salary structures 2025," through which you will be able to access the salary ranges for each Mastercard job family. For more information regarding US benefits, visit People Place and review the Benefits tab and the Time Off & Leave tab.

Lead DevOps Engineer, Foundry RnD

Job Description

Apply now

Stay at the forefront
of market

Lead DevOps Engineer, Foundry RnD

Job Description

Apply now

Stay at the forefront of market

Stay at the forefront
of market