Site Reliability Engineer (SRE)
Location: San Francisco Bay Area
Role Overview:
We are seeking a highly skilled Site Reliability Engineer (SRE) to join a dynamic team at a rapidly growing technology company. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical systems, while implementing automation and optimizing cloud infrastructure. This role offers the opportunity to work with cutting-edge AI/ML technologies, leveraging them to solve complex challenges in cloud infrastructure management and performance optimization.
Key Responsibilities:
- System Reliability & Performance: Design, implement, and maintain scalable systems, ensuring high availability, performance, and disaster recovery across production environments.
- Automation & Tool Development: Develop automation tools to streamline operations, improve system reliability, and reduce manual interventions.
- Cloud Infrastructure Management: Create and manage cloud instances (e.g., dev, staging, production) using AWS, GCP, or Azure, optimizing infrastructure performance and cost.
- Integration of AI/ML Models: Collaborate with engineering teams to integrate machine learning models into production environments, ensuring that these models scale efficiently and perform optimally.
- Incident Management: Respond to and resolve incidents, minimizing downtime and ensuring quick recovery. Lead post-incident reviews and implement preventive measures.
- Continuous Improvement: Identify areas of improvement and drive initiatives to enhance system reliability, performance, and security.
- Security & Compliance: Ensure that infrastructure and applications adhere to security best practices and compliance standards.
Qualifications:
- Educational Background: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Experience: Proven experience as a Site Reliability Engineer or in a similar role within a SaaS environment, managing and optimizing cloud infrastructure (preferably AWS, GCP, or Azure), and familiarity with integrating AI and machine learning technologies.
- Technical Skills:
- Proficiency in programming and scripting languages such as Python, Go, or Bash.
- Experience with containerization and orchestration tools like Docker and Kubernetes.
- Solid understanding of networking, security, and performance optimization practices.
- Knowledge of CI/CD pipelines and DevOps practices to ensure smooth development and deployment cycles.
- Problem-Solving: Strong analytical and problem-solving skills with attention to detail.
- Collaboration & Communication: Excellent interpersonal skills, with the ability to work collaboratively in cross-functional teams and communicate technical concepts clearly.
Benefits:
- Competitive Salary: Attractive compensation package, including equity options.
- Health & Wellness: Comprehensive health, dental, and vision insurance, along with other benefits.
- Work Environment: A collaborative and innovative work environment within a growing company.
- Growth Opportunities: Opportunities for career growth, professional development, and a chance to shape the future of the company's technology and infrastructure.