Site Reliability Engineer

Date: 17 Dec 2025

Location:

SG

Company: StarHub Ltd

Job Purpose

The Senior SRE will be responsible for the reliability, scalability, and performance of enterprise-grade Red Hat OpenShift Container Platforms (OCP) and observability solutions for systems and network deployed across hybrid cloud environments.
This role combines deep platform engineering, automation, performance optimization, observability, and security to ensure mission-critical workloads achieve high availability (99.99%+), compliance, and operational excellence.

The engineer will act as a technical authority for observability solutions, collaborating with cloud architects, security teams, DevOps engineers, and business stakeholders to design, implement, and optimize SRE practices and observability platforms that meet stringent enterprise SLAs.

Key Responsibilities

Platform Reliability & Performance

Own end-to-end reliability and scalability of multi-cluster OpenShift environments (on-premises and public cloud) supporting containerized enterprise workloads.
Manage multi-tenant observability solutions including centralized log analytics, security event monitoring, network and business observability.
Conduct performance engineering for both OCP clusters and ELK components
Lead capacity planning

Automation & Resilience Engineering

Implement Infrastructure-as-Code and GitOps pipelines using Terraform, Ansible, Jenkins, and reproducible OCP and ELK deployments.
Build self-healing and auto-remediation workflows leveraging Kubernetes operators, ServiceNow ITOM/AIOps integrations, and custom runbooks.
Design and enforce automated backup/restore, failover, and disaster recovery strategies across multiple data centers and cloud regions.
Develop SLO/SLI dashboards for performance, latency, error budgets, and saturation metrics using Prometheus, Grafana, and Kibana.

Observability & Incident Response

Drive adoption of ALErTS metrics to align reliability KPIs with business outcomes.
Standardize log ingestion pipelines with Logstash/Beats/Fluent Bit across heterogeneous infrastructure.
Lead root cause analysis (RCA) for complex production issues and establish post-incident blameless retrospectives with actionable follow-ups.

Security, Compliance & Governance

Ensure secure cluster configurations (CIS-compliant hardening, RBAC/ABAC policies, secrets management with Vault/KMS).
Enforce data privacy and retention policies for logs and traces and support audit readiness
Partner with InfoSec to perform vulnerability scanning, patch automation, and zero-trust network policy enforcement.

Innovation & Continuous Improvement

Champion SRE best practices and cost-optimization efforts.
Mentor junior engineers and contribute to knowledge-sharing playbooks for SRE and observability runbooks.
Evaluate and adopt emerging cloud-native observability tools (e.g., OpenTelemetry, Elastic Agent, Loki, Tempo) to modernize telemetry pipelines

Qulifications

Education

Bachelor’s degree (or higher) in Computer Science, Information Systems, or a related engineering discipline.

Experience

5 years in enterprise infrastructure/DevOps/SRE roles, with at least 2 years managing Red Hat Linux/ Kubernetes/OpenShift clusters in production.
Proven expertise in designing and operating large-scale ELK clusters (>10TB/day log ingestion) for mission-critical workloads.
Strong experience with Linux systems administration (RHEL/Ubuntu), networking (overlay networks, CNI plugins), container runtime security, and storage backends (Ceph, NFS, SAN).
Track record of leading incident response and RCA for high-severity (P1/P2) production outages.
Demonstrated experience in performance optimization and cost-efficient scaling of container and observability platforms.

Technical Skills

Container & Cloud Platforms: Kubernetes, Red Hat OCP, Docker, Podman, Helm, Operators.
Observability Stack: ELK, Beats/Fluent Bit, OpenTelemetry, Prometheus, Grafana.
Automation & CI/CD: Terraform, Ansible, Jenkins, GitHub Actions, GitOps.
Programming/Scripting: Python, Go, or Bash for automation and custom operators.
Security & Compliance: Vault/KMS, CIS Benchmarks, RBAC/OPA, TLS/mTLS,

Soft Skills

Analytical mindset with strong troubleshooting and problem-solving abilities.
Excellent communication skills for cross-functional collaboration (Cloud, Security, DevOps, Application teams).
Leadership qualities for mentoring, incident command, and driving technical decisions.

Preferred Certifications

Any Red Hat Certification in Linux and OpenShift Administration or Kubernetes CKA/CKS/CKAD.
Elastic Certified Engineer or equivalent ELK-related certification.
AWS/GCP/Azure Certified SysOps/Architect (for hybrid-cloud OCP deployments).
HashiCorp Terraform Associate, Ansible Automation, or equivalent.
ITIL v4 Foundation and/or SRE Foundation/Practitioner certifications

Apply for Job

Site Reliability Engineer

Job Purpose

Key Responsibilities

Qulifications

Apply for Job