Curi is committed to helping physicians in medicine, business, and life. Founded in 1975, we were built on a promise: When doctors needed help, we would answer the call. Physicians’ needs have changed over the years, but our dedication to that promise has never wavered. From wealth management to medical malpractice insurance to well-being programs, we remain passionately curious about identifying ways to meet the ever-evolving needs of physicians and those who support them.
We’re looking for a critical thinker that is both creative and thoughtful to join the Curi team as the organization’s Lead Site Reliability Engineering. In this role, you will report to the VP of Engineering and be responsible for driving the reliability, availability, security, and performance of Curi’s critical business applications and services through automation and continuous improvement. You will work closely with all technology teams to establish and measure Service-Level Objectives (SLOs). Additionally, you will help to streamline their application deployments and drive improvements to the observability and resiliency of their systems.
This role will also participate in the design and build of Curi’s public cloud environment; significantly contributing to the successful migration of all applications to Amazon Web Services. An ideal candidate will be able to complete technical tasks themselves while also effectively leading projects and coordinating work with third parties and team members.
- Establish site reliability engineering practices, objectives, and goals for Curi
- Share vision and direction with teams, lead through demonstration
- Partner with service owners to implement and measure service level objectives and agreements
- Recommend department policy and operational changes to help streamline business operations
- Automate and operationalize engineering and development tasks such as deployments, data migrations, performance tuning, capacity changes, backups, failover, and more
- Improve the observability of systems through monitoring, logging, tracing, and alerting
- Solve reoccurring issues with automation and improvements to system architectures
- Work with application and engineering teams to develop and test disaster recovery plans
- Oversee public and private cloud operations including security, automation, performance, availability, and cost management
- Manage vendors who provide day-to-day operational support for private cloud environments
Education and Experience
- Demonstrated experience using DevOps tools such as GitHub and GitHub Workflows
- Comfortable in one or more programming languages as well as strong scripting and automation skills
- Experience automating infrastructure and deployments with tooling such as Terraform, Ansible, Packer, and Powershell DSC
- Fluent in the terminal and comfortable with command-line interfaces
- Knowledgeable in monitoring and logging tools, such as Elastic Stack, CloudWatch, Splunk, Prometheus, Grafana, etc.
- Familiarity with implementing monitoring and alerting strategies, such as the Four Golden Signals
- A deep understanding of public cloud technologies, particularly Amazon Web Services
- Experience following site reliability engineering and DevOps disciplines.
- Breadth of knowledge in infrastructure technologies from Windows to Linux to networking
- Experience executing projects in Agile environments
- Excellent problem solving and interpersonal skills
- Familiarity with compliance frameworks such as SOC2, CIS Benchmark, etc.
- Demonstrated expertise delivering technical solutions as per specified plans, deliverables, costs and timelines, start to finish
- Strong presentation skills: written and verbal communication, including the ability to influence key business and technology partners
- Bachelor’s degree in Computer Science or similar discipline