Job Title: Site Reliability Engineer
Experience: +10 Years
Skills: .NET, SQL, React, Dynatrace, AWS, Splunk, Elastic Stack, Python, Scripting Languages, Ansible Tower, Terraform
Location: Fort Mill SC
We at Coforge are hiring for a Site Reliability Engineer with the following skills:
Responsibilities:
- Lead development of SRE dashboard.
- Lead development and tracking of SRE Error Budgets
- Lead root cause investigations.
- Proactively identify system anomalies
- Recognize automation opportunities.
- Strong understanding of Continuous CI/CD practices, with robust knowledge of Git, GitHub Actions and GitHub Workflows. Familiarity with other tools such as Jenkins and similar would be advantageous.
- Engage in and improve the whole life cycle of application and cloud services-from inception and design, through deployment, operation, and refinement.
- Plug into software release cycle. Work closely with developers to ensure software releases are well designed, planned, implemented, released, and monitored.
- Automate time-consuming and manual processes.
- Assess current SRE solution and define the SRE approach for products.
- Work with applications development teams on designing, implementing, and improving SRE practices.
- Proficiency in Container Orchestration: Hands-on experience in creating and managing Docker images, ensuring optimal performance and security.
- Proficiency in Kubernetes platform including the ability to effectively manage containerized applications, scale resources as needed and troubleshoot issues in production environments.
- Monitoring and Observability: Experience with monitoring tools such as Prometheus, Grafana, and ELK Stack and should be able to set up and configure monitoring solutions, utilize metrics for performance optimization, and troubleshoot issues effectively.
- Strong understanding of cloud platforms like AWS and infrastructure automation tools.
- Proven ability to design and implement monitoring solutions that ensure system uptime and performance.
- Leverage industry-leading tools like Dynatrace, Splunk, and Elastic Stack for real-time monitoring and troubleshooting.
- Maintain a deep understanding of cloud platforms like AWS and utilize infrastructure automation tools like Terraform and Ansible Tower.