Site Reliability Engineering (SRE) Virtual Internship
This comprehensive virtual internship track prepares students for a career as a Site Reliability Engineer (SRE). SREs are responsible for ensuring the reliability, availability, and scalability of complex distributed systems. Through a hands-on, project-based curriculum, students will learn to design, implement, and maintain highly available and fault-tolerant infrastructure, automate operational tasks, and use data-driven approaches to optimize system performance.
Track Overview
Tasks & Milestones
Implement Distributed Tracing for Microservices
MediumCreate a distributed tracing solution similar to what companies like Google and Netflix use to monitor and debug their microservices-based applications.
Implement Canary Deployments for Production Releases
MediumCreate a canary deployment strategy for a production application, similar to the approaches used by companies like Amazon and Netflix to safely roll out new features and updates.
Implement Chaos Engineering for Resilient Systems
MediumCreate a chaos engineering solution to improve the resilience of a production-like system, similar to the approaches used by companies like Netflix and Google.
Implement Infrastructure as Code for a Scalable Web Application
AdvancedCreate an Infrastructure as Code (IaC) solution to deploy and manage a scalable web application, similar to the approach used by companies like Amazon Web Services (AWS) or Google Cloud Platform (GCP).
Automate Kubernetes Cluster Deployment and Management
AdvancedCreate an automated solution to deploy and manage a Kubernetes cluster, similar to the approaches used by companies like Google and Netflix.
Implement Infrastructure Monitoring and Alerting
AdvancedCreate a comprehensive infrastructure monitoring and alerting solution, similar to the approaches used by companies like Netflix and Google.
Implement Distributed Tracing for Microservices Observability
MediumCreate a distributed tracing solution similar to what companies like Google and Amazon use to monitor and observe their complex microservices architectures.
Implement Metrics-Driven Observability for a Distributed System
MediumCreate a comprehensive metrics-driven observability solution for a distributed system, similar to the approaches used by companies like Netflix and Amazon.
Implement Log-Based Observability for a Microservices Architecture
MediumCreate a log-based observability solution for a microservices architecture, similar to the approaches used by companies like Amazon and Google.
Reliability Engineering and Incident Response Professional Project
MediumBuild a professional-grade Reliability Engineering and Incident Response solution using industry best practices
Reliability Engineering and Incident Response Assessment Challenge
MediumDemonstrate mastery of Reliability Engineering and Incident Response concepts through practical challenges
Scalability and Optimization Professional Project
MediumBuild a professional-grade Scalability and Optimization solution using industry best practices
Scalability and Optimization Assessment Challenge
MediumDemonstrate mastery of Scalability and Optimization concepts through practical challenges
Prerequisites
- • Proficiency in a programming language (e.g., Python, Go, Java)
- • Experience with Linux/Unix operating systems
- • Understanding of web application architecture and distributed systems
- • Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure)
- • Knowledge of software development lifecycle and DevOps practices
Certificate
Certificate of Completion
Earn a certificate upon successful completion