Observability Engineer
Job Summary An Observability Engineer within the Incident Command team plays a critical role in monitoring, evaluating, and optimizing the performance and health of IT systems and applications. This position is pivotal in ensuring that the IT infrastructure operates efficiently and is capable of handling emerging issues swiftly and effectively.
The primary duties of an Observability Engineer include the development and maintenance of monitoring tools and dashboards that provide real-time insights into the operational status of IT systems. This role involves the collection and evaluation of metrics, logs, and traces to proactively detect, diagnose, and resolve performance bottlenecks or anomalies before they escalate into more significant incidents. Furthermore, the Observability Engineer partners closely with other IT and incident management teams to enhance incident response strategies. They are tasked with improving the observability framework by integrating advanced analytics and machine learning techniques to predict potential system failures and automate response processes.
The Observability Engineer will positively impact UCSF's operations and culture by ensuring UCSF's IT infrastructure is operable, secure, efficient, and effective in service of the University's mission. The Observability Engineer will advance the University's mission by delivering exceptional information technology services comprehensively and consistently across customers and stakeholders. This role will execute UCSF's vision while modeling UCSF's culture and values. The final salary and offer components are subject to additional approvals based on UC policy. Your placement within the salary range is dependent on a number of factors including your work experience and internal equity within this position classification at UCSF. For positions that are represented by a labor union, placement within the salary range will be guided by the rules in the collective bargaining agreement. The salary range for this position is $120,300 - $194,600 (Annual Rate). To learn more about the benefits of working at UCSF, including total compensation, please visit: https://ucnet.universityofcalifornia.edu/compensation-and-benefits/index.html
Department Description University of California, San Francisco (UCSF) is distinguished as a leading academic healthcare organization, home to groundbreaking discoveries, world-class education, and exceptional healthcare services. Infrastructure Services (IS) is the backbone of the technological infrastructure, assuring the technical services that enable the academic, medical, and research missions of the organization. Beyond a focus on maintaining systems and resolving issues, we are committed to nurturing the potential of our team members and empowering them to excel. UCSF Infrastructure Services provides 24x7 support to the University community, always upholding the highest level of responsiveness and reliability for our customers. IS values innovation and excellence in ensuring secure and efficient Information Technology (IT) services, regardless of the hour or complexity of the issue.
The Incident Command team within Infrastructure Services operates as a critical support system for the community of medical and health researchers. This team is dedicated to ensuring seamless access to essential IT resources, thereby enabling continuous and vital research work that has a profound impact on human health and well-being. Incident Command's mission is to manage any major IT incidents, such as data breaches or network failures, effectively and swiftly. These incidents could pose potential disruptions to the ongoing research. Operating around the clock, the team's primary objective is to restore standard operations promptly, minimizing any possible disruption to the researchers' work. The Incident Command team collaborates to diagnose the issue, evaluate its potential impact on research activities, strategize an appropriate solution, and oversee the resolution process. The team documents the incident for future learning, ensuring efficient incident management.
Required Qualifications
- Bachelor's degree, or equivalent combination of experience/training, in one or more of the following fields: computer science, engineering, computer information systems, etc.
- 5 to 7 years of experience in information technology or Information Technology (IT) Service Management/Customer
- Expertise in using advanced monitoring and observability tools such as Datadog, Spectrum, Prometheus, Grafana, Splunk, or New Relic to track system performance and health.
- Advanced ability to analyze and interpret complex data from various sources to diagnose issues and understand system behaviors.
- Skilled in responding to and managing incidents efficiently, minimizing downtime and ensuring quick resolution of issues.
- Proficiency in automating monitoring tasks using scripting languages such as Python, Bash, PowerShell, JAVA, YAML, and XML to enhance system efficiency and reliability.
- Demonstrated experience using PagerDuty, OpsGenie, or comparable applications.
- Excellent communication skills for effectively articulating incident details and collaborating with cross-functional teams to resolve issues.
- Advanced problem-solving skills with an ability to think critically and strategically under pressure to address and resolve unforeseen issues swiftly.
- Deep understanding of Information Technology (IT) infrastructure including networks, servers, databases, logging, and cloud services to identify and address potential points of failure.
- Ability to document incidents, create detailed reports, and maintain clear records of system performance and issues for future reference.
- Ability to lead and collaborate with team members in high-stress situations, ensuring effective teamwork and optimal incident handling.
- Proficiency in risk management, including risk assessment and mitigation strategies related to IT.
- Understanding of compliance requirements relevant to IT operations within the specific sector, such as educational institutions, which may include data protection laws and standards.
Preferred Qualifications
- Information Technology Infrastructure Library (ITIL)
About UCSF The University of California, San Francisco (UCSF) is a leading university dedicated to promoting health worldwide through advanced biomedical research, graduate-level education in the life sciences and health professions, and excellence in patient care. It is the only campus in the 10-campus UC system dedicated exclusively to the health sciences. We bring together the world's leading experts in nearly every area of health. We are home to five Nobel laureates who have advanced the understanding of cancer, neurodegenerative diseases, aging and stem cells.
Pride Values UCSF is a diverse community made of people with many skills and talents. We seek candidates whose work experience or community service has prepared them to contribute to our commitment to professionalism, respect, integrity, diversity and excellence - also known as our PRIDE values.
In addition to our PRIDE values, UCSF is committed to equity - both in how we deliver care as well as our workforce. We are committed to building a broadly diverse community, nurturing a culture that is welcoming and supportive, and engaging diverse ideas for the provision of culturally competent education, discovery, and patient care. Additional information about UCSF is available at diversity.ucsf.edu
Join us to find a rewarding career contributing to improving healthcare worldwide.
Equal Employment Opportunity The University of California San Francisco is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran or disabled status, or genetic information.
Job Code and Payroll Title 000499 INFO SYS ANL 4
Job Category Clinical Systems / IT Professionals
Bargaining Unit 99 - Policy-Covered (No Bargaining Unit)
Location San Francisco, CA
Campus Mission Center Building (SF)
Additional Shift Details M-F, 8am-5pm with on-call rotation
|