Site Reliability Engineering for Distributed Systems Virtual Internship
In this virtual internship, students will develop skills in designing and operating highly available and scalable distributed systems. They will learn to implement load balancing, service discovery, and fault tolerance mechanisms to ensure the reliability and resilience of complex, cloud-native applications. Upon completion, students will be equipped to take on roles as Site Reliability Engineers, responsible for building and maintaining mission-critical infrastructure.
Track Overview
Tasks & Milestones
Exploring SRE Principles and Practices
IntermediateIn this task, students will research and summarize the key principles and practices of Site Reliability Engineering, including the SRE approach to incident management, error budgets, and the role of automation.
Implementing Load Balancing and Service Discovery
IntermediateIn this task, students will design and implement a load balancing and service discovery solution for a distributed application, using tools like Kubernetes and Consul.
Developing Fault-Tolerant Distributed Applications
IntermediateIn this task, students will design and implement fault-tolerant mechanisms for a distributed application, including circuit breakers, retries, and fallbacks.
Implementing Monitoring and Observability
IntermediateIn this task, students will design and implement a monitoring and observability solution for a distributed application, using tools like Prometheus and Grafana.
Implementing Incident Management Workflows
IntermediateIn this task, students will design and implement an incident management workflow for a distributed application, incorporating automation and on-call rotations.
Prerequisites
- • Familiarity with cloud computing concepts and platforms
- • Experience with containerization and container orchestration tools (e.g., Docker, Kubernetes)
Certificate
Certificate of Completion
Earn a certificate upon successful completion