This job ad has been posted over 40 days ago! (*)
The Hail team is seeking a DevSecOps/Site Reliability Engineer. The Hail team's mission is to accelerate our understanding of human biology and disease by building software tools that enable rapid analysis and exploration of massive biological datasets (petabytes and tripling yearly). We are dedicated to open science, our software is open source and we develop in the open (https://github.com/hail-is/hail). We currently develop in Python, Scala, and C/C++ and use Spark, Kubernetes, Google Cloud Platform (GCP) and AWS, but will use any tools we need to get the job done.
Over the past few years, we have built Hail (see https://hail.is), a powerful and flexible tool for the analysis of large-scale genomic data. We are in the process of building a elastic, multi-tenant version of Hail as a service. We are seeking a site reliability engineer to help us build and operate this service.
You will:
- Contribute to the design, implementation and operation of distributed systems to run diverse computational workloads (user services, interactive analysis, batch analysis and continuous correctness, scale, and performance testing).
- Implement infrastructure as code and automate workflows for testing and deployment of infrastructure.
- Debug live, complex distributed systems.
- Be responsible for logging, monitor and alerting.
- Build a security culture on the team and continually improve the security of our services.
- Performance capacity planning and scale testing.
- Work cooperatively in a multi-disciplinary environment.
- Spend at least 50% of your time on software development.
- Work with scientists to execute large, ambitious data analysis projects.
Some projects you might work on:
- We continually test and deploy our application, but not some components of our infrastructure. Collect requirements, design, implement and test extensions to our CI for infrastructure-level configuration.
- Formalize service level objectives. Monitor the relevant indicators and make an operational plan to achieve our desired level of reliability.
- We have a scheduler for batch Docker workloads. Add a micro - VM backend (like firecracker or gVisor) to increase isolation of multi-tenant workloads.
- Implement rate-limiting APIs to improve the robustness of multi-tenant services.
Key to our success is growing a strong and diverse team whose members enable and support each other's development and achievements. Self-improvement is a fundamental part of our culture; we want to grow great engineers.
We are committed to giving equal consideration to candidates from underrepresented groups in software engineering. We know that many excellent candidates choose not to apply despite their skills and please allow us to enthusiastically counter this. Additionally, several team members have taken a non-traditional path to professional software development, and we are interested in all candidates who are a good fit for the role, whether or not they have a degree in computer science or a specific set of professional qualifications.
Key words: GCP, devops, devsecops, operations, site reliability engineer, SRE, systems engineer, infrastructure engineer, kubernetes, docker, distributed systems