Location: Brentwood,TN, USA
Engineer, IT Cloud Site ReliabilityOverall Job SummaryA Cloud Site Reliability Engineer is a multifaceted role that combines elements of software engineering, system administration, and IT operations. Cloud SREs are responsible for ensuring the reliability, performance, and scalability of systems by focusing on system design, automation, monitoring, incident management, performance tuning, collaboration, and security. Their efforts directly impact the stability and efficiency of critical systems, enabling organizations to deliver reliable and efficient services at scale. This role requires a blend of technical expertise, problem-solving skills, and effective communication, making it essential for the success of modern, complex infrastructures.Essential Duties and ResponsibilitiesVendor ManagementStrong negotiation skills, the ability to build better vendor relationships, network effectively, manage multiple vendors, identify financial risks, and evaluate new vendors.Industry awareness, strong people skills, and the ability to make effective decisions.Effective management by monitoring performance, managing risks, tracking key performance indicators, and ensuring compliance with regulations.Coordinating Teams EffortsCoordinate efforts with teams located onsite, offshore, nearshore, and across multiple vendors, providing clear direction, setting expectations, and motivating team members to achieve common goals.Ensure that tasks are assigned, schedules are aligned, and resources are allocated effectively across teams and vendors.Establish regular communication channels and protocols to ensure that information is shared, feedback is provided, and issues are addressed in a timely manner.Understand and respect cultural differences and work styles of team members and vendors from different regions.System Design and Architecture:Collaborate with software engineers to identify and mitigate risks to system availability and reliability.Automation and Tooling:Develop and maintain automation tools to streamline operations and reduce manual interventions.Monitoring and Incident Management:Help improve monitoring and alerting systems.Respond to incidents, perform root cause analysis, and implement permanent fixes to prevent recurrence, maintaining detailed documentation.Performance and Scalability:Conduct performance tuning, optimization and capacity management of systems to handle increasing loads and demand.Collaboration and Communication:Communicate effectively with stakeholders about system performance, incidents, and improvements.Foster a culture of reliability and continuous improvement across the organization.Security and Compliance:Ensure that systems and infrastructure comply with security best practices and regulatory requirements.Required QualificationsExperience: 4+ years related work experience. Experience in the retail industry preferredEducation: Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. Any combination of education and experience will be considered.Professional Certifications: NoneHigh Demand IT Specialized skills:Platform knowledge (UNIX, Linux, Windows): Public and Private Cloud Technologies (AWS, Google Cloud, Azure) and containerization technologies (Docker, Kubernetes). Hyper-converged Platforms (Nutanix, Simplivity), VMware vSphere 6, Microsoft Applications (Active Directory, Exchange, O365 and server OS), AHV, Kubernetes, Docker, SaltstackPreferred knowledge, skills or abilitiesKnowledge of ITIL Foundation concepts, practices, and procedures preferred.Knowledge of continuous improvement concepts preferred.Experience with programming and scripting languages (Python, Go, Java, Bash).Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack).Excellent problem-solving skills and the ability to work under pressure.Strong communication and collaboration skills, with a focus on teamwork and knowledge sharing.Strong Enterprise Application Support experienceStrong Process Management skillsAbility to manage ITSM Tools and Enterprise Support toolsUnderstand data integration concepts.SDLC Waterfall and Agile knowledge preferredWorking ConditionsNormal office working conditionsPhysical RequirementsSittingStanding (not walking)WalkingLifting up to 20 poundsDisclaimerThis job description represents an overview of the responsibilities for the above referenced position. It is not intended to represent a comprehensive list of responsibilities. A team member should perform all duties as assigned by his/ her supervisor.ALREADY A TEAM MEMBER?You must apply or refer a friend through our internal portalClick here ( Mission and Values are more than just words on the wall - they're the one constant in an ever-changing environment and the bedrock on which we build our culture. They're the core of who we are and the foundation of every decision we make. It's not just what we do that sets us apart, but how we do it.Learn MoreEMPOWERMENTWe believe in managing your time for business and personal success, which is why we empower our Team Members to lead balanced lives through our benefits total rewards offerings. fot full-time and eligible part-time TSC and Petsense Team Members. We care about what you care about!Learn MoreOPPORTUNITYA lot of care goes into providing legendary service at Tractor Supply Company, which is why our Team Members are our top priority. Want a career with a clear path for growth? Your Opportunity is Out Here at Tractor Supply and Petsense.Learn MoreJoin Our Talent CommunityNearest Major Market: Nashville