Job Title: Senior Site Reliability Engineer
Job Type: Permanent
Salary: Dependent on experience
Role Location: On-site — Palo Alto, CA
A well-established tech organization building advanced AI products for healthcare and clinical research. The team focuses on secure, reliable platforms that process sensitive medical data and support research and clinical workflows.
As a Senior SRE, you will:
Design and automate infrastructure (infrastructure-as-code tools)
Build and maintain CI/CD pipelines and release automation
Operate and scale production systems on major cloud platforms
Implement monitoring, alerting, and incident response practices
Enforce security and compliance controls for protected health data
Create and test disaster recovery and continuity plans
Produce clear operational documentation and runbooks
Coach and guide more junior engineers and on-call teams
Work closely with engineering and research teams to enable fast, safe delivery of product features
5+ years in an SRE / Infrastructure / Platform role
Hands-on with IaC (Terraform or equivalent)
Production experience with container orchestration (Kubernetes)
Solid scripting/programming skills (e.g., Python)
Proven work with CI/CD systems and pipelines
Experience running workloads on cloud providers (GCP, Azure, or AWS)
Familiar with observability tools (metrics, logs, tracing)
Practical knowledge of security best practices and data-protection tooling
Strong communication, troubleshooting, and incident response skills
Experience with healthcare compliance (HIPAA, SOC 2)
Background in high-performance or data-intensive environments
Prior mentorship or technical leadership experience
Deep experience scaling Terraform + Kubernetes at production scale
Competitive compensation (dependant on experience)
Medical, Dental, Vision
Collaborative engineering culture with strong technical ownership