Cloud Site Reliability Engineer
Site Reliability Engineer
Primary Focus Areas: Cloud Infrastructure, System & Network Administration, Monitoring, Governance, Risk, and Compliance
Position Level: P4 - Advanced
Work Location: McLean, VA or Wilmington, NC
Overview
Looking for a skilled and driven Cloud based Site Reliability Engineer to help ensure the performance, scalability, and reliability of our AWS-based cloud systems. In this role, you'll work closely with engineers and ops teams to boost system stability and developer efficiency through automation, monitoring, and incident management.
Key Responsibilities
Build and manage robust, scalable, and secure AWS-based cloud infrastructure.
Create and support monitoring systems, alert mechanisms, and dashboards to ensure uptime and service health.
Use infrastructure-as-code tools like Terraform (with Terragrunt), CDK, and CloudFormation to automate provisioning and configuration.
Set up and manage CI/CD workflows to facilitate efficient code deployment and enhance development processes.
Take ownership of incident resolution, conduct thorough root cause analysis, and develop long-term solutions to recurring problems.
Partner with engineering teams to fine-tune performance, bolster reliability, and manage cloud costs effectively.
Promote operational excellence and guide architectural decisions for infrastructure enhancements.
Develop and maintain disaster recovery strategies to guarantee system continuity in crisis scenarios.
Education Requirements
Required: Bachelor's degree in Computer Science
Preferred: Master's degree in Computer Science or related field
Experience Requirements
Required: Minimum of 5 years of relevant experience
Preferred: 8 years of experience in a similar role
Core Competencies
Required Skills:
Experience with Argo CD and Argo Workflows
Proficiency in infrastructure-as-code: Terraform and Terragrunt
Kubernetes and Linkerd knowledge
In-depth experience with AWS services (EKS, Fargate, Aurora)
Strong background in security and compliance
Containerization tools such as Docker
Monitoring and logging technologies
Scripting or programming language proficiency
Database administration
Source control using Git
Hands-on experience in incident response and management
Preferred Skills:
Familiarity with Datadog
Knowledge of Cloudflare services
Understanding of the mortgage industry
Advanced cloud security tools (e.g., GuardDuty, Security Hub)
Disaster recovery strategy experience
Experience with automation tools like Camunda
Certifications
Required: AWS Certified Solutions Architect - Associate
Preferred: AWS Certified DevOps Engineer - Professional
FAQs
Congratulations, we understand that taking the time to apply is a big step. When you apply, your details go directly to the consultant who is sourcing talent. Due to demand, we may not get back to all applicants that have applied. However, we always keep your resume and details on file so when we see similar roles or see skillsets that drive growth in organizations, we will always reach out to discuss opportunities.
Yes. Even if this role isn’t a perfect match, applying allows us to understand your expertise and ambitions, ensuring you're on our radar for the right opportunity when it arises.
We also work in several ways, firstly we advertise our roles available on our site, however, often due to confidentiality we may not post all. We also work with clients who are more focused on skills and understanding what is required to future-proof their business.
That's why we recommend registering your resume so you can be considered for roles that have yet to be created.
Yes, we help with resume and interview preparation. From customized support on how to optimize your resume to interview preparation and compensation negotiations, we advocate for you throughout your next career move.