Title: Site Reliability Engineer
Location: Birmingham, AL (hybrid)
FULL TIMESUMMARY The Site Reliability Engineer (SRE) is responsible for enhancing system reliability and resilience through automation. This role combines software and systems engineering to maintain large-scale, fault-tolerant systems, ensuring they remain available and adaptable. The SRE actively monitors system health, supports cloud-based transformations, and innovates to meet customer needs while providing operational support for multiple distributed software applications.
JOB DUTIES - Analyzes monitoring metrics for performance and fault tolerance.
- Collaborates with developers to enhance services and testing.
- Contributes to system design, platform management, and capacity planning.
- Balances speed of feature development with reliability.
- Assists in restoring normal service with incident response.
- Proficient in debugging and troubleshooting.
- Manages unwanted traffic with investigation and rate-limiting.
- Utilizes monitoring for proactive adjustments and alerts.
- Implements continuous improvement for processes and technology.
- Handles other assigned tasks as necessary.
KNOWLEDGE, SKILLS, ABILITIES - Bachelor's degree or equivalent experience
- 5+ years of experience in a technology or software role
- Proficient in Kubernetes, SRE principles, and cloud services (GCP).
- Experience with Dynatrace, New Relic, or SolarWinds
- Skilled in microservice architecture and infrastructure troubleshooting.
- Experienced in deploying, monitoring, and supporting enterprise applications.
- Proficient in CI/CD tools and performance optimization.
- Strong mix of software engineering and operational support skills.
- Knowledge of web technologies and tools like Azure DevOps, Dynatrace, Prometheus, Terraform, and Grafana.
NICE TO HAVE - Grafana
- Splunk Regards sachin 972-###-####