Sr. Systems Operations Engineer
: Job Details :


Sr. Systems Operations Engineer

The Trade Desk

Location: New York,NY, USA

Date: 2024-11-12T08:27:36Z

Job Description:
Who We Are At The Trade Desk, we recognize that a seamless customer experience is driven by operational excellence. In pursuit of constantly improving the reliability of our platform, we are establishing a global Systems Operations team. This team's core mission will be to vigilantly monitor The Trade Desk platform services, refine our incident response methodologies, and guarantee a robust and highly-available customer experience. If you're passionate about ensuring system reliability, process improvement, and making an essential customer impact, we invite you to play a critical role in this next evolution of our on-call experience. What You'll Do
  • Act as a technical expert and advisor to more junior Associate Systems Operations Engineers
  • At an escalated tier, monitor the state of platform services and stability via telemetry and alerts; triage issues, escalate to engineering teams as needed
    • Work collaboratively with development teams to facilitate issue remediation
    • Manage remediation task workflow
    • Proactively update and improve Systems Operations documentation and runbooks
    • Increase the effectiveness of the incident response process by defining and measuring relevant metrics
    • There may be periodic weekend coverage requirements Who We are Looking For
      • Bachelor's Degree from a four-year university or relevant substitute experience
      • 6+ years relevant work experience in Technical and/or Application Support with strong knowledge of services support and troubleshooting The Systems Operations Engineer will either possess or be excited to learn a number of skills... Technical Proficiency:
        • Understanding of large-scale distributed system architectures (e.g., databases, web services, application services).
        • Familiarity with monitoring tools (e.g., Prometheus, Grafana, Nagios).
        • Ability to configure and fine-tune alerts.
        • Proficiency or ability to learn programming languages including C# and SQL. Incident Management and Troubleshooting:
          • Ability to prioritize and manage incidents based on severity, with a focus on customer impact.
          • Ability to remain calm under pressure and quickly diagnose issues.
          • Understanding of system logs, metrics, telemetry. Communication Skills:
            • Ability to communicate effectively with stakeholders during an incident.
            • Clear and concise documentation skills.
            • Ability to maintain and update troubleshooting guides (TSGs) and operational documentation.
            • Ability to translate complex technical issues and platform outages to non-technical stakeholders. Automation & Scripting:
              • Ability to automate repetitive tasks.
              • Proficiency in scripting languages (e.g., Python, Bash) is a plus. #J-18808-Ljbffr
Apply Now!

Similar Jobs (0)