
Senior Site Reliability Engineer
Location: San Fran, CA (On-site, Full-time)
Compensation: $200-275K + equity + full benefits package
About the Role
Our client is a fast-growing AI scale-up seeking a Senior Site Reliability Engineer to scale and harden the infrastructure behind their core platform. This person will own automation, deployment pipelines, observability, and reliability at scale. They will partner closely with software engineers, ML researchers, and product teams to ensure the platform is secure, performant, and ready for rapid iteration.
Key Responsibilities
Design and implement infrastructure automation and deployment pipelines using Terraform.
Build and maintain monitoring, alerting, and logging systems to ensure reliability and performance.
Work with engineering teams to design and deploy scalable, fault-tolerant, and secure production systems on AWS, GCP, or Azure.
Define and maintain cloud-native security and compliance practices (e.g. Vault, KMS).
Troubleshoot and resolve complex infra and operations issues across cross-functional teams.
Lead disaster recovery and business continuity planning.
Document systems and processes to improve repeatability and scalability.
Mentor junior engineers and provide technical guidance.
Qualifications
Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent.
5+ years of professional SRE or infrastructure engineering experience.
Strong expertise in Terraform and Kubernetes with a demonstrable track record.
Proficiency in Python with Go as a plus for building orchestration and automation.
Hands-on experience with Docker and CI/CD pipelines such as GitLab or GitHub Actions.
Familiarity with monitoring and logging platforms such as ELK, Grafana, or Datadog.
Strong knowledge of cloud platforms including AWS, GCP, or Azure. Preference for GCP or Azure if equally matched.
Ability to thrive in fast-paced, high-growth environments with minimal ramp-up.
Excellent communication and collaboration skills.