Job Title: Site Reliability Engineering - Performance Engineer
Location: Bay Area preferred/Hybrid
Department: DevOps
At WitnessAI, we're at the intersection of innovation and security in AI. We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.
Key Responsibilities
-
Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.
-
Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.
-
Design and implement performance dashboards to visualize key performance metrics in real-time.
-
Recommend Linux and Cloud Server tuning improvements to increase throughput and latency
-
Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.
-
Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.
-
Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.
-
Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.
-
Optimize distributed training pipelines using industry-standard frameworks.
-
Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.
-
Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.
-
Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.
-
Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.
-
Work with developers to refactor applications for performance and scalability, using profiling tools
-
Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.
Qualifications Required:
-
Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.
-
Strong experience with AWS cloud services and their performance optimization techniques.
-
Proficiency in performance analysis and load testing tools and other system tracing frameworks.
-
Hands-on experience with database tuning, query analysis, and indexing strategies.
-
Expertise in GPU workload optimization, and cloud-based GPU instances
-
Familiarity with message queuing systems including performance tuning.
-
Programming experience with a focus on profiling and tuning
-
Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.
Preferred:
-
Knowledge of distributed AI/ML training frameworks
-
Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.
-
Expertise in optimizing AI inference pipelines.
-
Familiarity with Brendan Gregg’s methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.
Benefits:
-
Hybrid work environment
-
Competitive salary
-
Health, dental, and vision insurance
-
401(k) plan
-
Opportunities for professional development and growth
-
Generous vacation policy
Salary range:
$180,000-$220,000