WitnessAI
Site Reliability Engineer - Platform Engineering
Job Title: Site Reliability Engineer (SRE), Platform Engineering
About Us: WitnessAI is a leader in providing innovative networking solutions designed to enhance security, performance, and reliability for businesses of all sizes. We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong background in Linux administration, AWS, and Kubernetes for our Platform Engineering team. The ideal candidate will help ensure the reliability, scalability, and performance of our systems while driving a culture of automation and continuous improvement.
Key Responsibilities
System Reliability & Operations
-
Maintain and improve the reliability, availability, and performance of our services and infrastructure.
-
Monitor system health, troubleshoot issues, and respond to incidents with a focus on reducing mean time to recovery (MTTR).
Infrastructure Management
-
Administer and optimize Linux-based systems across development, staging, and production environments.
-
Design and manage scalable, secure, and cost-effective solutions on AWS.
-
Build, maintain, and monitor Kubernetes clusters to support containerized applications.
Automation & Tooling
-
Develop and maintain CI/CD pipelines to streamline deployments.
-
Automate operational tasks using tools such as Terraform, Crossplane, or custom scripts.
-
Create and enhance monitoring, alerting, and logging systems to improve observability.
-
Build ad-hoc, reusable automation solutions where required.
Collaboration & Best Practices
-
Partner with engineering teams to integrate SRE principles into the software development lifecycle.
-
Advocate for best practices in incident response, post-mortem reviews, and capacity planning.
-
Share knowledge with team members and contribute to a culture of continuous improvement.
Security & Compliance
-
Implement security best practices for cloud and containerized environments.
-
Ensure compliance with organizational and industry standards.
Requirements
Technical Skills
-
Proven expertise in Linux system administration (e.g., Ubuntu, CentOS, or similar).
-
Deep understanding of AWS services and architecture (e.g., EC2, S3, RDS, VPC, IAM).
-
Strong experience managing Kubernetes clusters in production.
-
Hands-on experience with infrastructure-as-code tools like Terraform or CloudFormation
-
Proficiency in scripting or programming languages (e.g., Python, Bash, or Go).
-
Demonstrated experience in app development for ba lend automation solutions.
-
3+ years of experience in a Site Reliability Engineer, DevOps Engineer, or similar role working for a SaaS or Cloud bases company.
Operational Expertise
-
Familiarity with monitoring and logging tools such as Prometheus, Grafana, ELK, or Datadog
-
Experience designing and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI, or CircleCI).
-
Understanding of networking concepts (e.g., DNS, load balancing, firewalls).
Problem Solving & Collaboration
-
Strong analytical and troubleshooting skills.
-
Ability to work effectively in a collaborative, team-oriented environment.
-
Excellent written and verbal communication skills.
Education
Bachelor’s degree in Computer Science, Engineering, or equivalent work experience.
Nice-to-Have Skills:
-
Experience with service meshes and other CNCF technologies (e.g., Istio or Linkerd).
-
Knowledge of database systems (e.g., MySQL, PostgreSQL, or NoSQL databases).
-
Familiarity with cloud-native technologies and tools (e.g., Helm, ArgoCD, Spinnaker).
Benefits:
-
Hybrid work environment
-
Competitive salary.
-
Health, dental, and vision insurance.
-
401(k) plan.
-
Opportunities for professional development and growth.
-
Generous vacation policy.
Salary range:
$170,000-$200,000