Qumulo

Site Reliability Engineer

Reposted 13 Days Ago

2 Locations

Mid level

2 Locations

Mid level

As a Site Reliability Engineer at Qumulo, you will help manage and monitor applications and infrastructure, implementing automated solutions for both on-prem and cloud environments. Responsibilities include troubleshooting build failures, implementing system monitoring, and participating in an on-call rotation for incident response.

The summary above was generated by AI

About the company:

Qumulo is the unstructured data platform to store and manage exabyte-scale data anywhere – at the edge, in the core data center and in the cloud. With unstructured data growing in more locations faster than ever before, enterprises today need a way to store, manage, and curate data simply and efficiently in any location, on any platform. This is precisely what Qumulo was founded to accomplish.

At Qumulo, we are building an open and collaborative culture where people can do their best work with customers as our magnetic field. We act as owners, we share by default, we are data driven and experimental and as an inclusive workplace, we encourage and celebrate multiple points of view. As part of our culture we believe diversity drives innovation.

About the position:

As an SRE at Qumulo, you will help to develop solutions that help to manage and monitor applications we use internally and to support our customers. We manage our internal build and test infrastructure which includes running multiple builds and hundreds of thousands of tests continuously in both on-prem environments and on the cloud (such as AWS and Azure Native Qumulo Scalable File Service [ANQ]). This build and test environment is a core part of our engineering processes, providing continuous feedback to our engineering teams and allowing us to deliver new product releases regularly throughout each year. We also build and operate managed components of ANQ, delivering a highly available service to customers and keeping the service up to date with our latest features.

We work across engineering, product and customer success teams to identify opportunities to improve our processes and ensure that our existing systems are available and working as expected. We implement solutions that reduce work through automation, providing scalable solutions that span our on-prem and cloud environments. We help manage the operating expense of running systems across multiple clouds. We help drive down failures by providing frequent feedback to engineers on their changes with high quality test analytics.

Responsibilities:

You will collaborate with a team that identifies opportunities, plans new features, and implements solutions. You will work with team members to build a backlog and deliver solutions iteratively.You will troubleshoot build and test failures, diagnosing problems that vary from build time compilation failures to integration test failures involving both virtual machine instances and Qumulo qualified hardware. You will implement monitoring to ensure that systems are working as expected and can raise alerts when problems are detected.

This position does include an on-call rotation which requires availability to respond to critical incidents impairing our owned applications.

Technologies:

Experience working in Linux (we use Ubuntu)
Experience with Python or similar programming languages
Experience with system orchestration tools (such as Ansible, Terraform, and cloud specific implementations like AWS CloudFormation) is preferred.
Experience with one or more of the major cloud providers (AWS, GCP, Azure)
Functional working understanding of Kubernetes and working with containers to manage applications (we manage clusters in our on-prem locations as well as in the "cloud")
Experience with monitoring tools and technologies (we use a combination of home grown solutions that utilize OpenMetrics as well as tools like Grafana, InfluxDB, and Prometheus)
Experience troubleshooting systems issues
Knowledge of build automation and test frameworks

Key Benefits

Excellent healthcare coverage
Parental leave
401K investment plan
Unlimited paid time off, strongly encouraged to take at least 3 weeks per year

Other Details

Qumulo is an Equal Opportunity Employer. Qualified applicants will receive consideration for employment without regard to race, color, gender, religion, sex, sexual orientation, age, disability, military status, or national origin or any other characteristic protected under federal, state, or applicable local law.

Please note that employment at Qumulo is contingent upon completion of a satisfactory background check.

For more information on our Applicant and Employee Privacy Notice please click on the link below:

https://qumulo.com/applicant-employee-privacy-notice

Similar Jobs

SoFi

Senior Systems Reliability Engineer(SRE), Incident Management Enablement

2 Days Ago

Easy Apply

Hybrid

Easy Apply

Senior level

Fintech • Mobile • Software • Financial Services

As a Senior Systems Engineer, you will enhance the incident management platform, ensure system availability, develop automation processes, and collaborate across teams to maintain service quality. Your role involves building infrastructure as code and troubleshooting systems in a cloud environment.

Top Skills: AnsibleAutomation SoftwareAWSCfengineChefGoIncident ManagementLinuxNetworkingPuppetPythonRubySre PrinciplesTerraformUnix

Anduril

Site Reliability Engineer, Connected Warfare

12 Days Ago

Seattle, WA, USA

Senior level

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense

As a Site Reliability Engineer, you'll develop solutions for deployment engineers, collaborate with teams to integrate technologies, design scalable delivery systems, enhance operational capabilities through analysis, and lead improvements in delivery mechanisms for military systems.

Qualtrics

Senior Site Reliability Engineer, Foundation - Seattle

10 Days Ago

Seattle, WA, USA

Senior level

Artificial Intelligence • Information Technology • Natural Language Processing • Software • Business Intelligence • Generative AI

As a Senior Site Reliability Engineer, you'll lead complex technical initiatives, drive infrastructure strategy, mentor team members, and implement reliability best practices. You'll work with tools like Kubernetes, Docker, and Terraform to automate processes and enhance system performance. Your role involves collaboration across teams, focusing on operational excellence and infrastructure modernization.

What you need to know about the Calgary Tech Scene

Employees can spend up to one-third of their life at work, so choosing the right company is crucial, not just for the job itself but for the company culture as well. While startups often offer dynamic culture and growth opportunities, large corporations provide benefits like career development and networking, especially appealing to recent graduates. Fortunately, Calgary stands out as a hub for both, recognized as one of Startup Genome's Top 100 Emerging Ecosystems, while also playing host to a number of multinational enterprises. In Calgary, job seekers can find a wide range of opportunities.