Staff Site Reliability Engineer - Confluent Incident Management & Reliability

IBM

Full-time toronto, on Engineering

Posted:

June 09, 2026

Location:

toronto, on, Canada

Job Description

Your Role and Responsibilities About the Role Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi‑cloud streaming platform, they happen at scale—data in motion, exactly‑once semantics, and cascading failure modes that require deep systems thinking. We need an expert‑level engineer who can drive proactive reliability improvements that prevent these incidents before they occur. 
This role combines hands‑on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post‑mortems, training incident commanders, and evolving our incident response practices. 
You'll be part of a global team with follow‑the‑sun coverage, with clean handoffs that keep everyone work...
                

Apply for this Job

Submit your application for the Staff Site Reliability Engineer - Confluent Incident Management & Reliability position at IBM.

Apply Now Save for Later

Job Overview

Job Type: Full-time

Location: toronto, Canada

Posted: June 09, 2026

Deadline: July 19, 2026