Hybrid Remote, 1 day per week in Victoria London

Role Overview

We are looking for a senior SRE / Dev

Ops practitioner to design, standardise, and operate cloud platforms that support multiple AI-driven products and services.

This role focuses on building opinionated, reusable infrastructure patterns that enable teams to rapidly deliver AI workloads while maintaining high standards for reliability, security, and cost control.

You will develop platform architecture across multiple concurrent projects, ensuring consistency in how services are deployed, integrated, and operated. This includes shaping how AI/ML workloads are built, deployed, and monitored, as well as defining clear patterns for service communication, API exposure, and infrastructure provisioning.

This is a hands-on role for someone who is comfortable making strong architectural decisions, reducing variability across teams, and balancing flexibility with standardisation in a fast-moving environment.

Key Responsibilities

Platform Architecture & Standardisation

Define and implement opinionated architecture patterns for cloud-native and AI-enabled services on AWS
Establish reusable blueprints for these same services
Drive consistency across multiple projects through shared modules, templates, and platform tooling

Infrastructure as Code & Automation

Build and maintain Terraform-based infrastructure, using modular and reusable design
Define CI/CD patterns for:
Infrastructure deployment
Application and model delivery
Enforce best practices through pipelines and automation rather than documentation

Reliability, Observability & Operations

Embed SRE principles across all services:
Monitoring, logging, tracing
SLIs/SLOs and alerting
Continuously improve reliability, performance, and cost efficiency
Operate API gateway/data plane technologies (e.g. Kong)

Required Skills & Experience

Strong experience operating AWS-based platforms in production
Proven experience with Terraform, including module design and CI/CD integration
Hands-on experience with container platforms (ECS preferred; EKS acceptable if adaptable)
Experience operating API gateways (Kong or equivalent)
Solid understanding of cloud networking and service discovery patterns
Experience supporting multiple teams or projects on a shared platform
Strong troubleshooting and production operations experience

AI / Data Platform Experience (Required)

Practical experience running or supporting AI/ML workloads in production, such as:
Model inference services
Batch processing pipelines
Integration with LLM APIs or hosted models
Understanding of:
Scaling characteristics of AI workloads
Cost considerations (compute-heavy workloads, GPU usage, etc.)
Familiarity with tooling such as:
Model serving frameworks
Data processing pipelines
Or managed AI services on AWS

Desirable Skills

Experience with GPU workloads or specialised compute environments
Familiarity with feature stores, vector databases, or embedding pipelines
Knowledge of event-driven architectures
Experience with security best practices (IAM, secrets management, Zero Trust)
Exposure to platform engineering or internal developer platforms

Wider Skills

Ability to make and defend clear architectural decisions
Comfortable operating across multiple concurrent workstreams
Strong communication and stakeholder management skills
Detail-oriented with a bias toward automation and standardisation over ad hoc solutions

Pay: £60,000.00-£80,000.00 per year

Work Location: Hybrid remote in London W2 2UH

Senior SRE / DevOps Engineer (AI Platforms & Multi-Project Infrastructure)

Job Description

Apply now

Stay at the forefront
of market

Senior SRE / DevOps Engineer (AI Platforms & Multi-Project Infrastructure)

Job Description

Apply now

Stay at the forefront of market

Stay at the forefront
of market