Apply now
Apply for Job
Site Reliability Engineer
Date:
15 Dec 2024
Location:
StarHub Green
Company:
StarHub Ltd
Job Description
We are looking for a talented and motivated Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on experience with Terraform, Ansible, and log analytics tools, combined with proficiency in working with Linux, Kubernetes, and AIOps platforms.
Key Responsibilities
- Design, deploy, and manage scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible and GitHub.
- Implement and maintain observability solutions using ELK, Grafana suite (e.g. Loki, Tempo, Mimir, and Prometheus), ensuring complete monitoring, logging, and tracing capabilities.
- Leverage OpenTelemetry to instrument applications and collect telemetry data for performance insights and system health.
- Automate configuration and operational tasks using Ansible to reduce manual efforts.
- Manage and monitor Kubernetes clusters and Linux-based systems to ensure optimal performance and availability.
- Integrate and support SNMP-based Network Performance Monitoring (NPM) tools like SolarWinds, SevOne, or OpsRamp for network observability.
- Implement event management systems and AIOps platforms for proactive incident detection, correlation, and automated resolution.
- Collaborate with DevOps teams to build and maintain CI/CD pipelines for continuous integration and delivery.
- Perform incident management, conduct post-incident reviews, and drive long-term improvements through root-cause analysis.
- Maintain detailed documentation for infrastructure, automation workflows, troubleshooting procedures, and operational best practices.
Required Expertise and Experience
- At least 3 years of experience in SRE, DevOps, or a related engineering role.
- Proficiency in Infrastructure as Code (IaC) using Terraform to manage complex infrastructure.
- Hands-on experience with log analytics and observability tools, including ELK (Elasticsearch, Logstash, Kibana) and the Grafana suite (Loki, Tempo, Mimir, Prometheus).
- Knowledge and experience with OpenTelemetry for distributed tracing and telemetry collection.
- Experience working with Kubernetes clusters and Linux-based systems in production environments.
- Expertise in automation using Ansible to streamline configuration and deployment processes.
- Knowledge of SNMP-based NPM tools such as SolarWinds, SevOne, or OpsRamp for network monitoring.
- Experience with AIOps platforms for event correlation and automated incident management.
- Strong background in CI/CD practices, with hands-on involvement in building pipelines for software delivery.
Required Skills and Qualifications
-
Technical Skills:
- Infrastructure management with Terraform.
- Observability with ELK, Grafana suite, and OpenTelemetry.
- Automation using Ansible.
- Kubernetes orchestration and Linux system administration.
- Expertise in SNMP-based NPM tools (SolarWinds, SevOne, or OpsRamp).
- Experience with AIOps and event management platforms.
-
Soft Skills:
- Strong problem-solving abilities with a focus on automation and continuous improvement.
- Excellent communication and collaboration skills across cross-functional teams.
- Ability to thrive in a dynamic, fast-paced environment and manage multiple priorities.
-
Preferred Knowledge:
- Familiarity with GitOps practices for infrastructure management.
- Understanding of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Security awareness and experience implementing secure infrastructure.
-
Education:
- Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent work experience.
Apply now