Compest Solutions Inc
Job title: SRE || AI OPS Engineer
Client: Bank
Location: Toronto, Ontario - 4 days Hybrid in office
Position Type- Contract
Please reply with your expected Contract range--
Job Description/ Responsibilities
Job Description: SRE / AI Ops Engineer
Overview
We are seeking a highly skilled Site Reliability Engineer (SRE) / AI Ops Engineer to design, build, and operate intelligent, automated reliability solutions across our production environments. This role blends deep operational expertise with modern AI‑driven observability, monitoring, and automation practices. You will work with industry‑leading tools—Dynatrace, Splunk, Moogsoft, Pager
Duty, Ansible, Git/Git
Hub Actions, and Python—to create proactive, self‑healing, AI‑enhanced workflows that elevate system reliability and reduce manual toil.
This is a hands‑on engineering role for someone who thrives at the intersection of SRE, automation, and AI‑powered operations.
Key Responsibilities
AI‑Driven Observability & Monitoring
Implement and optimize monitoring solutions using Dynatrace, Splunk, and Moogsoft, leveraging their AI/ML capabilities (e.g., Davis AI, Splunk ITSI, Moogsoft AIOps) to:
Detect anomalies
Predict incidents
Correlate events across distributed systems
Reduce alert noise through intelligent clustering
AI Ops Workflow Engineering
Design and build AI‑powered operational workflows that automate:
Incident detection
Root cause analysis
Remediation actions
Post‑incident insights
Integrate AI insights from observability platforms into automated pipelines and runbooks.
Incident Response & Automation
Duty for intelligent alerting, escalation policies, and automated incident response.
Hub Actions.
Platform Reliability & SRE Practices
Apply SRE principles such as SLOs, SLIs, error budgets, and chaos testing.
Improve system reliability through automation, performance tuning, and proactive engineering.
Reduce operational toil by designing scalable, automated solutions.
Dev
Ops & CI/CD Integration
Hub Actions to build automated pipelines that integrate:
Observability signals
AI‑driven quality gates
Automated rollback and recovery workflows
Python Scripting & Tooling
Develop Python‑based automation, data processing, and AI‑enhanced operational tooling.
Build integrations between monitoring platforms, ticketing systems, and automation engines.
Required Skills & Experience
Core Technical Skills
Hands‑on experience with:
Dynatrace (including Davis AI)
Splunk (ITSI, Machine Learning Toolkit preferred)
Moogsoft AIOps
Pager
Duty
Ansible
Git & Git
Hub Actions
AI Ops & Automation
Experience leveraging AI/ML features within observability and incident‑management tools.
Ability to design automated workflows that use AI insights for:
Event correlation
Predictive alerting
Automated remediation
Intelligent routing
SRE Expertise
Strong understanding of distributed systems, cloud infrastructure, and reliability engineering.
Experience with SLO/SLI design, error budgets, and performance optimization.
Familiarity with containerized environments (Kubernetes, Docker) is a plus.
Soft Skills
Strong analytical mindset with a passion for automation and continuous improvement.
Excellent communication and cross‑team collaboration abilities.
Ability to translate operational challenges into scalable engineering solutions.
Preferred Qualifications
Experience with cloud platform Redhat Openshift
Exposure to LLM‑based automation or generative AI for operational workflows.
Background in building or integrating with Chat
Ops frameworks.
What You’ll Achieve
In this role, you will help transform traditional application and infrastructure operations into a modern, AI‑enhanced reliability ecosystem. You’ll build systems that not only detect and respond to issues but learn from them—driving a future where operations are predictive, automated, and intelligent.
Regards,
Compest Solutions Inc
D: 647-660-7562
Job Type: Fixed term contract
Contract length: 12 months
Pay: $70.00-$75.00 per hour
Experience:
SRE || AI OPS Engineer : 10 years (preferred)
Dynatrace (including Davis AI): 10 years (preferred)
Splunk (ITSI, Machine Learning Toolkit preferred): 10 years (preferred)
Moogsoft AIOps: 8 years (preferred)
Pager
Duty: 8 years (preferred)
Ansible: 8 years (preferred)
Git & Git
Hub Actions: 8 years (preferred)
Python scripting: 10 years (preferred)
Site Reliability Engineer (SRE): 10 years (preferred)
AI‑powered operational workflows: 7 years (preferred)
Davis AI: 6 years (preferred)
SRE (Site Reliability Engineer): 10 years (preferred)
Work Location: In person
Verified Listing
This role has been verified for authenticity, market-rate compensation, and remote eligibility.
Get the latest updates on AI-powered hiring, career growth, and technical deep-dives delivered to your inbox.