Apply now

Apply for Job

Site Reliability Engineer

Date:  18 Oct 2024
Location: 

StarHub Green

Company:  StarHub Ltd

Job Description

We are looking for a talented and motivated Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on experience with Terraform, Ansible, and log analytics tools, combined with proficiency in working with Linux, Kubernetes, and AIOps platforms.

Key Responsibilities

  • Design, deploy, and manage scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible and GitHub.
  • Implement and maintain observability solutions using ELK, Grafana suite (e.g. Loki, Tempo, Mimir, and Prometheus), ensuring complete monitoring, logging, and tracing capabilities.
  • Leverage OpenTelemetry to instrument applications and collect telemetry data for performance insights and system health.
  • Automate configuration and operational tasks using Ansible to reduce manual efforts.
  • Manage and monitor Kubernetes clusters and Linux-based systems to ensure optimal performance and availability.
  • Integrate and support SNMP-based Network Performance Monitoring (NPM) tools like SolarWinds, SevOne, or OpsRamp for network observability.
  • Implement event management systems and AIOps platforms for proactive incident detection, correlation, and automated resolution.
  • Collaborate with DevOps teams to build and maintain CI/CD pipelines for continuous integration and delivery.
  • Perform incident management, conduct post-incident reviews, and drive long-term improvements through root-cause analysis.
  • Maintain detailed documentation for infrastructure, automation workflows, troubleshooting procedures, and operational best practices.

Required Expertise and Experience

  • At least 3 years of experience in SRE, DevOps, or a related engineering role.
  • Proficiency in Infrastructure as Code (IaC) using Terraform to manage complex infrastructure.
  • Hands-on experience with log analytics and observability tools, including ELK (Elasticsearch, Logstash, Kibana) and the Grafana suite (Loki, Tempo, Mimir, Prometheus).
  • Knowledge and experience with OpenTelemetry for distributed tracing and telemetry collection.
  • Experience working with Kubernetes clusters and Linux-based systems in production environments.
  • Expertise in automation using Ansible to streamline configuration and deployment processes.
  • Knowledge of SNMP-based NPM tools such as SolarWinds, SevOne, or OpsRamp for network monitoring.
  • Experience with AIOps platforms for event correlation and automated incident management.
  • Strong background in CI/CD practices, with hands-on involvement in building pipelines for software delivery.

Required Skills and Qualifications

  • Technical Skills:

    • Infrastructure management with Terraform.
    • Observability with ELK, Grafana suite, and OpenTelemetry.
    • Automation using Ansible.
    • Kubernetes orchestration and Linux system administration.
    • Expertise in SNMP-based NPM tools (SolarWinds, SevOne, or OpsRamp).
    • Experience with AIOps and event management platforms.
  • Soft Skills:

    • Strong problem-solving abilities with a focus on automation and continuous improvement.
    • Excellent communication and collaboration skills across cross-functional teams.
    • Ability to thrive in a dynamic, fast-paced environment and manage multiple priorities.
  • Preferred Knowledge:

    • Familiarity with GitOps practices for infrastructure management.
    • Understanding of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
    • Security awareness and experience implementing secure infrastructure.
  • Education:

    • Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent work experience.

To APPLY NOW, click on Skye!

Apply now

Apply for Job