Fireworks AI Logo

Fireworks AI

Software Engineer, Site Reliability Engineer

Sorry, this job was removed at 06:05 p.m. (MST) on Saturday, Apr 26, 2025
In-Office
8 Locations
In-Office
8 Locations

Similar Jobs

16 Days Ago
Hybrid
Mississauga, ON, CAN
Senior level
Senior level
Healthtech • Software
Design and implement AI-powered solutions for infrastructure reliability, including anomaly detection, automation of incident response, and optimization using machine learning techniques.
Top Skills: AppdynamicsArgocdAzureBashDatadogDockerElkGithub ActionsJavaJenkinsKubernetesMySQLPostgresPrometheusPythonSpinnakerSQL ServerTerraform
3 Days Ago
In-Office or Remote
Toronto, ON, CAN
Senior level
Senior level
Big Data • Cloud • Healthtech • Software • Big Data Analytics
As a Senior Site Reliability Engineer, you'll ensure the reliability and scalability of enterprise applications, manage incidents, and mentor team members while using Java and modern open-source technologies.
Top Skills: AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
3 Days Ago
In-Office or Remote
Vancouver, BC, CAN
Senior level
Senior level
Big Data • Cloud • Healthtech • Software • Big Data Analytics
Join Veeva as a Senior Site Reliability Engineer to ensure scalability and reliability of applications. Responsibilities include building cloud infrastructure, driving reliability, leading incident management, automating processes, and mentoring team members.
Top Skills: AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant

About Us:

Here at Fireworks, we’re building the future of generative AI infrastructure. Fireworks offers the generative AI platform with the highest-quality models and the fastest, most scalable inference. We’ve been independently benchmarked to have the fastest LLM inference and have been getting great traction with innovative research projects, like our own function calling and multi-modal models. Fireworks is funded by top investors, like Benchmark and Sequoia, and we’re an ambitious, fun team composed primarily of veterans from Pytorch and Google Vertex AI.

The Role:

We’re seeking a highly skilled SRE/PE with deep expertise in Kubernetes (k8s), cloud networking, and infrastructure automation. This role will focus on reducing incident response time, implementing auto-remediation, optimizing auto-scaling, and improving cluster efficiency and service health. You’ll design systems that balance performance, cost, and reliability while working onsite at our Redwood City or New York City team.

Key Responsibilities:

  1. Incident Response & Reliability Engineering:

    • Drive initiatives to reduce incident response time through improved monitoring, alerting, and automated remediation.

    • Build self-healing systems and playbooks for common failure scenarios.

    • Lead blameless post-mortems and implement preventative measures.

  2. Kubernetes & GPU Cluster Optimization:

    • Manage and optimize GPU-enabled Kubernetes clusters for AI/ML workloads, focusing on cost-performance efficiency, auto-scaling, and resource utilization.

    • Debug performance bottlenecks in distributed systems (e.g., network, storage, GPU scheduling).

  3. Cloud Networking & Service Health:

    • Strengthen service health by refining cloud networking stacks (VPCs, load balancers, service meshes) and ensuring low-latency communication.

    • Design fault-tolerant architectures to minimize downtime.

  4. Monitoring & Observability:

    • Enhance service monitoring with tools like Prometheus, Grafana, and custom metrics pipelines.

    • Implement predictive analytics to proactively address system health risks.

  5. Automation & Infrastructure-as-Code (IaC):

    • Build automation for cluster provisioning, scaling, and recovery using Terraform, Argo, and CI/CD pipelines.

    • Develop tools to streamline operational workflows (e.g., automated rollbacks, canary deployments).

Minimum Qualifications:

  • 3+ years in SRE/PE/DevOps roles with production-grade Kubernetes experience.

  • Proficiency in cloud networking (AWS/GCP/Azure VPCs, firewalls, DNS) and service monitoring (Prometheus, Alertmanager, Grafana).

  • Hands-on experience with incident management and improving system reliability/SLOs.

  • Strong scripting/coding skills (Python/Go/Bash) for automation and tooling.

  • Familiarity with object storage (S3, GCS) and data pipeline integration.

Preferred Qualifications:

  • Experience with GPU clusters (NVIDIA GPUs, MIG, CUDA) and AI/ML workloads.

  • Knowledge of auto-scaling technologies (K8s HPA/VPA) and auto-remediation frameworks.

  • Expertise in service meshes (Istio)

Why Fireworks AI?

  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure, from low-latency inference to scalable model serving.

  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally.

  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results.

  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation.

What you need to know about the Calgary Tech Scene

Employees can spend up to one-third of their life at work, so choosing the right company is crucial, not just for the job itself but for the company culture as well. While startups often offer dynamic culture and growth opportunities, large corporations provide benefits like career development and networking, especially appealing to recent graduates. Fortunately, Calgary stands out as a hub for both, recognized as one of Startup Genome's Top 100 Emerging Ecosystems, while also playing host to a number of multinational enterprises. In Calgary, job seekers can find a wide range of opportunities.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account