Tecsys

Site Reliability Engineer

Sorry, this job was removed at 05:36 p.m. (MST) on Wednesday, Feb 19, 2025

Be an Early Applicant

Canada

Description

Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our conveniently located offices and collaborative workspaces, provide our team with the freedom and flexibility to work in the way that makes our employees most productive.

About us

Tecsys is a fast-growing innovator offering supply chain solutions to industry leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. We work with industry leaders to transform their supply chains through technology. If you thrive on tackling interesting challenges with continuous learning opportunities, then Tescys could be a good fit for you!

About the Role

We are looking for a Site Reliability Engineer to work within our “Network and Security Operations Center” department. Our NOC team is aimed at improving the reliability and uptime of our platform and applications in a data-driven way to support internal and external customers' needs.

Your responsibilities

Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Develop tools & automation on top of Azure & AWS to continuously reduce the need for manual intervention.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Be on-call.
Practice sustainable incident response and blameless postmortems.
Implement automated solutions for continuous integration and delivery (CI / CD).
Implement monitoring, Logging, alerting, and SLA Reporting.
Implement service monitoring dashboards displaying key metrics.
Create and maintain technical documentation.
Apply SRE best practices.
Take command of high-severity incidents and facilitate their resolution.
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.

Requirements:

Bachelor's degree in computer science or related technical discipline.
At least 5 years’ experience in systems engineering experience; demonstrable technical experience in new platform development, orchestration, product ownership, and iterative design and deployment.
Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.
Strong knowledge of system design; high performance computing; file, block, and storage technologies; integration of compute, storage, and network technologies to deliver cohesive infrastructure solutions.
High level of understanding and examples of executing projects with full stack automation; our scale is going to require a lot of it, we grow to use less manual intervention and work with both internal and open-source tools to automate day-to-day activities.
Self-organize, collaborate, and manage efforts with peers and teams across responsibility areas, languages, geography, and time zones.
Be a self-starter, curious, and not afraid to ask questions and challenge the way things are done today.
See a problem or opportunity, take ownership and act on it independently.
Knowledge of Datadog preferred (or at least, similar/equivalent product)
Knowledge of Rapid7 Insight preferred (or at least, similar/equivalent product)
Knowledge and experience of AWS or Azure required.
Basic knowledge of Java- or .Net-based development required.
Knowledge of GitLab (enterprise license) preferred (or at minimum, Jenkins required)
Experience with SaaS company is a strong asset.
Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.
Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.

Additional requirements:

Escalation on-call rotation
Occasional travel (quarterly offsites, conferences – less than 10%)

At Tecsys, we are committed to fostering a diverse and inclusive workplace where all employees feel valued, respected, and empowered. We believe that diversity drives innovation and strengthens our ability to deliver exceptional solutions. We welcome and encourage applicants from all backgrounds, experiences, and perspectives to join our team.

Tecsys is an equal opportunity employer. Accommodation is available for applicants selected for an interview.

NB: if you are applying to this position, you must be a Canadian Citizen or a Permanent Resident of Canada, OR, have a valid Canadian work permit.

Similar Jobs

Morningstar

Site Reliability Engineer

8 Days Ago

Toronto, ON, CAN

Mid level

Enterprise Web • Fintech • Financial Services

As a Site Reliability Engineer, you will ensure the reliability and performance of cloud-based infrastructure, working closely with development and operations teams to automate processes and improve observability. Key responsibilities include implementing observability platforms, managing incidents, and using Infrastructure as Code with tools like Terraform and Kubernetes. Your role will also focus on building resilient systems and collaborating to ensure operational excellence.

Top Skills: AWSBashCdkCi/CdCloud-Native InfrastructureCloudFormationContainersDatadogGitLinuxMonitoring ToolsNew RelicPythonSplunkTerraform

McCain Foods

Sr Engineering Manager, SRE & Observability

4 Days Ago

Toronto, ON, CAN

Senior level

Food • Retail • Agriculture • Manufacturing

The Sr Engineering Manager, SRE & Observability will lead the design, implementation, and monitoring of secure, fault-tolerant SRE and Observability infrastructure. Responsibilities include developing strategies, collaborating with teams, mentoring engineers, and driving operational excellence through advanced monitoring and automation techniques.

Behavox

Site Reliability Engineer 3 (Next Gen)

2 Days Ago

Vancouver, BC, CAN

Senior level

Artificial Intelligence • Software

As a Site Reliability Engineer at Behavox, you will be responsible for ensuring the reliability and performance of production systems, working with DevOps, Product, and Engineering teams to implement SRE practices and maintain high-load, distributed data processing environments in public clouds.

Top Skills: AnsibleAWSCloud FunctionsConsulDataflowGCPGoJavaNomadPub/SubPythonSaltstackTerraformVault

What you need to know about the Calgary Tech Scene

Employees can spend up to one-third of their life at work, so choosing the right company is crucial, not just for the job itself but for the company culture as well. While startups often offer dynamic culture and growth opportunities, large corporations provide benefits like career development and networking, especially appealing to recent graduates. Fortunately, Calgary stands out as a hub for both, recognized as one of Startup Genome's Top 100 Emerging Ecosystems, while also playing host to a number of multinational enterprises. In Calgary, job seekers can find a wide range of opportunities.