Duration: (5 6) Months with possible extension or conversion. (depending on performance)
Contract Type: W2 Contract
Job Responsibilities: Hybrid Cloud Infrastructure You will help ensure the availability, reliability, and performance of business-critical applications and infrastructure by providing 24x7x365 monitoring, proactive incident response, knowledge management, and automation. You ll work with internal technology teams and third-party vendors to quickly detect, escalate, and resolve incidents while reducing manual effort through shift left practices and scripting/automation.
• Monitor applications/infrastructure using tools such as Dynatrace, Grafana, and Azure Monitor, tune dashboards, baselines, and alerts.
• Serve as an Incident Coordinator for triage and major incidents: run bridge calls, document actions, and support PIRs.
• Drive incident triage and escalation to meet rapid detection goals (e.g., TTD 5 minutes for major incidents) and support RCA and communications.
• Build and maintain SOPs, knowledge articles, and known error content to improve L1 effectiveness.
• Identify repetitive issues and create scripts/runbooks (PowerShell/Python/Bash) to automate detection and remediation.
• Track and report operational KPIs (e.g., MTTD/MTTR, tickets worked, change validations, major incidents avoided).
• Provide scheduled coverage for 24x7x365 operations, including off-hours and holidays as needed.
• 8+ years in IT operations, incident management, or application support in a 24/7 environment.
• Hands-on experience with observability/monitoring (Dynatrace, Grafana, and/or Azure Monitor), including alerting and dashboarding.
• Experience supporting or coordinating major incident resolution (bridge calls, documentation, stakeholder communications).
• Familiarity with ITSM tooling and workflows (e.g., ServiceNow).
• Excellent scripting/automation skills (PowerShell, Python, and/or Bash) and documenting SOPs/knowledge articles.
• Exceptional verbal and written communication skills; ability to document procedures, incident reports, and root cause analyses clearly.
• Proven ability to provide effective escalation support and guidance to junior engineers and Tier 1/2 teams.
• Bachelor s degree in a related field (or equivalent experience)
• Ability to travel 10%, on average, based on the work you do and the clients and industries/sectors you serve