Streaming Data Pipelines with Apache Spark Virtual Internship
In this advanced virtual internship, students will learn to develop real-time data processing pipelines using Apache Spark Structured Streaming. They will gain hands-on experience integrating with various data sources, building scalable and fault-tolerant data pipelines, and implementing advanced streaming analytics. Upon completion, students will be equipped with the skills to design and implement robust streaming data solutions for real-world applications.
Track Overview
Tasks & Milestones
Set up a Spark Structured Streaming Development Environment
AdvancedIn this task, students will set up a development environment for working with Apache Spark Structured Streaming, including installing the necessary software and configuring their development tools.
Ingest and Process Streaming Data from Apache Kafka
AdvancedIn this task, students will learn how to ingest data from an Apache Kafka cluster and perform basic processing using Spark Structured Streaming.
Ingest and Process Streaming Data from AWS Kinesis
AdvancedIn this task, students will learn how to ingest data from an AWS Kinesis stream and perform basic processing using Spark Structured Streaming.
Implement Windowing Operations on Streaming Data
AdvancedIn this task, students will learn how to use windowing operations to perform time-based analysis on streaming data.
Implement Advanced Aggregations on Streaming Data
AdvancedIn this task, students will learn how to perform advanced aggregations, such as sessionization and anomaly detection, on streaming data using Spark Structured Streaming.
Implement Real-time Inference with Streaming Data and Machine Learning
AdvancedIn this task, students will learn how to integrate a pre-trained machine learning model into a Spark Structured Streaming pipeline for real-time inference on streaming data.
Prerequisites
- • Proficiency in Python or Scala
- • Experience with distributed systems and data processing frameworks
- • Familiarity with relational databases and NoSQL data stores
Certificate
Certificate of Completion
Earn a certificate upon successful completion