OVERVIEW
We are looking for a highly technical ML Systems Engineer to architect and build scalable AI inference capabilities across heterogeneous environments.
This role focuses on solving real-world challenges in AI model execution, runtime interoperability and performance optimization.
You will operate at the intersection of machine learning, systems engineering, and software engineering, building platforms and tooling that standardise and simplify AI models serving in production environments.
Key Responsibilities
- Inference Systems Engineering
- Design and develop abstractions, middleware, and system components to support model inference across Traditional and Generative AI
- Build integration layers across different model formats, execution engines, and deployment environments
- Ensure consistency, portability, reliability, and scalability of model execution
- Model Handling
- Support diverse model architectures, including:
- Large Language Models (LLMs)
- Computer vision models
- NLP models
- Multi-modal models
- Optimise models for latency, throughput and resource efficiency
- Optimise model loading strategies
- Implement robust mechanisms for model lifecycle management.
- Benchmarking & Evaluation
- Develop and execute benchmarking methodologies to evaluate:
- Latency vs throughput trade-offs
- Runtime and hardware performance characteristics
- Use case performance characteristics
- Support data-driven deployment decisions through profiling and performance analysis
- Platform Integration & Developer Experience
- Develop APIs, libraries, and platform services that enable:
- Simplified model deployment and serving
- Runtime backends selection
- Model Observability
- Model Scaling
- Improve developer and platform operators’ experience while preserving operational flexibility and low-level control
Technical Experience
Must-Have
- Hands-on experience with at least one inference stack:
- Traditional AI
- NVIDIA Triton Inference Server
- Generative AI
- vLLM
- SGLang
- Dynamo/LLM-D
- Strong ability to profile, diagnose, and optimise performance bottlenecks
- Strong proficiency in at least one programming language (e.g. Python, C++, Go, Rust)
- Good understanding of Linux systems, distributed systems concept, and system-level debugging
- Familiarity with containers and orchestration platforms such as Docker, and Kubernetes/OpenShift
Preferred Experience
- Experience working in air-gapped or restricted environments with enterprise GPU (e.g. A100, H200, B200)
- Experience with LLM inference, including the understanding of terms such as KV cache management, Prefill vs decode phases, continuous batching and token-level scheduling.
- Experience with model optimisation including the understanding of terms such as quantisation (FP16, INT8, INT4), graph optimisation and compilation.
JOB REQUIREMENTS
- Degree in Computer Science, Computer Engineering, or a related discipline.
- Minimum 2–3 years of relevant experience in ML systems, inference engineering, platform engineering, or performance-critical software systems.
Experience
2 ~ 5 years
Job Type
Full-Time
Qualification
Bachelor's degree or equivalent
Working Hours
Standard Hours
Programme Centre / Entity
Digital Hub