*** If you applied to this role in the past 2 openings and haven’t been contacted, it means your profile wasn’t selected for this position. ****

We’re hiring a Senior Site Reliability Engineer to build and scale the reliability backbone of a leading GPU-powered platform.
Job Requirement-Degree in Computer Science or a related discipline or equivalent practical experience / solid proof of expertise.-4+ years of software development experience in one or more languages (Go ideal; Rust/Python)-4+ years designing, analyzing, and troubleshooting distributed systems and production services.-Proficiency in debugging, profiling, and performance tuning of large-scale Linux systems.-Experience with Kubernetes (or similar schedulers), containerized services, and IaC (Terraform/Pulumi/Cloud Formation).-Experience with observability (metrics, logs, traces), progressive delivery (canary/blue green), and incident management.-Track record of OSS contributions.-Linux internals, networking, and kernel/perf tooling.-Exposure to hypervisors (KVM/) or virtual machine introspection concepts.-Knowledge of GPU architectures and CUDA programming.-Cybersecurity experience (runtime security, hardening, secrets management).-Building distributed systems on Kubernetes and high-throughput data pipelines (e.g., Kafka/Redpanda/Fluent Bit).-Experience with multi-cloud operations, cost/perf optimization, and compliance-minded engineering.
Responsibilities Build and maintain systems that keep the platform stable, fast, and always available Automate repetitive operational tasks to reduce manual work and human errors Monitor system performance and set clear reliability targets (uptime, response time, etc.) Detect issues early and respond quickly to incidents to minimize downtime Work closely with engineering teams to improve system design, scalability, and efficiency Optimize infrastructure performance and cost across cloud environments Improve deployment processes to make releases safer and smoother Contribute to building internal tools that help teams operate systems more efficiently Continuously enhance system reliability, performance, and security
Perferrable Developers and volunteers contributing to open-source libraries related to Linux environments Candidate Background Only Computing Background Location Fully Remote Job Level Senior Talent Country Egypt Technologies Python, Go

Lang, Linux, Terraform, Rust, kernel, Cloud Architecture, Dev Ops, Backend, Kubernetes, Security, SRE

Site Reliability Engineer-GPU

Job Description

Apply now

Stay at the forefront
of market

Site Reliability Engineer-GPU

Job Description

Apply now

Stay at the forefront of market

Stay at the forefront
of market