Apply now

Apply for Job

Principal Site Reliability Engineer

Date:  12 May 2024
Location: 

Kuala Lumpur

Company:  StarHub Ltd

Job Description

The role reports to the Applications Maintenace team with responsibility to bridge the gap between development (AD Team) and IT operations (AMS team) supporting operational tasks typically managed by operations teams. He/she will ensure the efficient functioning of application systems, monitor performance, automate processes, and enhance system availability.

Responsibilities:

    Provide effective maintenance and support services to business units. Ensure the availability and robustness of the IT applications for a 24x7 mission critical system.
•    Develop and implement automation tools to streamline operational tasks. 
•    Ensure the efficient functioning of applications, monitor performance, automate processes, and enhance system availability.
•    Be highly responsive to the dynamic nature of the business environment and to quickly react against any critical system issue.
•    Perform application support during normal office hours and on-call standby. 
•    Ensure operational health of the systems by resolving escalated problems and performing application fine-tuning and system performance improvement activities. 
•    Closely monitor the incident resolution and ensure SLA is adhered to.
•    Troubleshooting, investigation, raise defect when necessary.
•    Participate in planning and revenue assurance activities such as systems reconciliation and disaster recovery.
•    Adopt good industry practices and adherence to IS Policies and Standards.
•    Review design/ solution for system and application changes to ensure quality delivery and stability of business operations.
•    Manage application vendors to provide timely and reliable system support.
•    Lead and participate in projects aimed at improving system reliability. 
•    Collaborate with cross-functional teams to identify areas for enhancement, implement changes, and measure the impact.
•    Monitors infrastructure components such as servers, databases, and networking. 
•    Build monitoring systems that focus on symptoms rather than just outages. Alerts should provide actionable insights for rapid response.
 

Others

Qualifications

•    Bachelor’s degree in Computer Science, Computer Engineering, Information Technology or related fields. At least 7 to 9 years of relevant working experience and preferably with Telco.
•    Certifications related to Site Reliability Engineering is a plus.
•    Experienced in Program Planning and Initiatives, shows ability to drive SRE initiatives across departments in a large organization, developed strategic plans, set goals, and collaborate with stakeholders to align SRE efforts with overall business goals.
•    Experienced as a Site Reliability Engineer or in a similar role, specifically handling reliability improvement projects in large-scale, complex, business-critical application environments and ITSM/ITIL framework.
•    Proficient in containerization technologies and container orchestration platforms (e.g. Docker and Kubernetes). Understand the concept of container networking, storage, and security.
•    Proficient in cloud platforms (e.g. AWS, Google Cloud Platform (GCP), or Microsoft Azure) and cloud services (e.g. compute instances, storage, databases, networking, and monitoring tools).
•    Proficient in CI/CD pipelines and tools like Jenkins, GitLab CI/CD, or CircleCI for automating software builds, testing, and deployment processes.
•    Proficiency in languages such as Python, Java, Go, or Ruby and scripting skills for automation tasks and tool development. Knowledge of tools such as SPLUNK, Kibana will be a plus.
•    Ability to communicate asynchronously and work effectively with cross-functional teams.
•    Ability to quickly master in-depth application and business domain knowledge.
•    Ability to coach junior team members.
•    Willing to work on extended hours when needed.
 

To APPLY NOW, click on Skye!

Apply now

Apply for Job