Welcome to TuraLabs Data Engineering Bootcamp. By the end of this course you will be a well-seasoned Cloud Engineer. Throughout the course you will learn to use nearly all of Google Cloud Analytical services. This includes services such as Google BigQuery, Dataflow, Dataproc, Cloud Pub/Sub , Cloud Run, Cloud AI, BigTable, and Cloud Composer.
We've specifically chosen Google Cloud since their services are the closest to their open source Big Data counterparts such as Apache Hadoop, Hive , Hbase, Spark, Beam, Kafka, Airflow, and more. Learning the material in this course will also prepare you to become a seasoned Big Data developer.
Learning these skills on the Cloud will best enable you to enter the job market as a modern Data Engineer. Now days, nearly all startup companies and applications run on the Cloud and seek to create their competitive edge by exploring large amounts of data.
We hope you enjoy this course and it will teach you all the necessary skills to start a rewarding career in Data Engineering.
This course almost immediately start covering mid to high level topics. Therefore, we strongly recommend that learners have some experience with Python and SQL. For a more in depth explanation of what Pre-requisites are expected and a list of resources to bring you up to speed, please visit our blog post on Helpful Resources to Prep for this Course.
As you make your way through this course we love to hear from you. This course is a passion project and is the result of continuous edits and many late-nights. Even though we've taken a lot of steps to ensure the highest quality, we're 100% sure that we can improve things. While you're making your way through the course, please use the following channels:
We love to talk to you. Use our Slack Channel to get in touch with us. Ask questions or just geek out over a topic.
Please use Github Issues if you find any errors or have any recommendations for improving our material
Looking to speaking with you soon and thanks in advance for engaging with us.
Now... Roll up your sleeves, pull up the trousers, and let's get started. We promise it'll be a joyful ride!
To make the course fun and engaging, we've decided to do a bit of "role playing"! We know that you're probably staying up late or taking time away from your lunch or family to learn this stuff; you deserve to have a little fun while going through complex Data Engineering topics.
So let's pretend...
You just got hired by a large "tech startup" called D-Air! The founders thought it's a clever (!!!) way to combine the words Data and Air (🤮barfff, right?!). They're a hip travel agency like Kayak, "revolutionizing the airline industry" by developing an A/I that negotiates best airline deals on behalf of passengers. But in reality they are developing the AI to jack up ticket prices as it finds their preferences.
Why would you want to work for them, right?
Well, they pay pretty well, have the best coffee machine, ping pong tables, and the coolest office swag. Their t-shirts are top quiality for wearing them at your gym. They also run their tech on the latest Google Cloud technologies, so you figure it's a great place to sharpen your skills as a Junior Data Engineer despite it's broken ethical compass.
Shortly after your start working there, your inner anarchist decides to teach them a lesson. Power to the people! We encourage and teach you how to plant a backdoor in the AI to fix the numbers in a way where no one at D-Air will find out 😉️
This course is divided into 10 chapters and with each chapter having multiple episodes. Each chapter is designed to introduce an overarching Cloud technology like BigQuery, Dataflow, or Dataproc. Each chapter builds upon the lessons in previous chapters. We highly recommend that you follow the episodes chronologically as they are laid out here:
Chapter 1: Loading reference Dataset into BigQuery
This chapter will teach to work with Google BigQuery, Cloud Storage, and Python Pandas. You will load reference data from csv files onto BigQuery and perform basic transformations using Pandas.
Chapter 2: Loading Flights data using Apache Beam (Google Dataflow)
Learn to use a distributed processing engine such as Apache Beam (Google Dataflow). Extract, transform, and load (ETL) 4 years of data onto BigQuery using Apache Beam. You will write a distributed Beam application, write data into popular Big Data file formats (such as parquet) and onto BigQuery using External Tables.
Chapter 3: Processing Passengers using Apache Spark (Google Dataproc)
Advance your knowledge of distributed computing engines by learning to use Spark. Spark (or Cloud Dataproc) is a widely used tool to process large amounts of data. You will transform and load millions of rows of passenger data onto Google BigQuery by using Apache Spark. You will learn to use Spark Dataframes, PySpark, and PySpark SQL.
Chapter 4: Putting on our Data Architect Hat!
In this chapter, we develop a Data Architecture which will be implemented over the next few chapters. This chapter will teach us best practices in Big Data Cloud Architect and provides an in-depth view into the reasons behind our architectural choices; a discussion into "Why" and "When" we should use each technology.
Chapter 5: Real-time Stream Processing of Live Flight Queries with Cloud Pub/Sub
We cover real-time data processing by ingesting live streams of website logs via Cloud Pub/Sub (Apache Kafka) and Cloud Dataflow (Apache Beam).
Chapter 6: Registering Ticket Sales with Google BigTable
Develop an OLTP system to monitor live ticket sales via Pub/Sub and Google BigTable (Apache HBase). We will learn to process transactional data using BigTable (Apache HBase), Cloud Pub/Sub (Apache Kafka), and Cloud Dataflow (Apache Beam).
Chapter 7: Advanced Analytics using BigQuery
Advanced analytics using Google BigQuery. Preparing intelligence for our AI developed on the next chapter.
Chapter 8: Building an A/I with BigQuery ML (Machine Learning)
In this chapter, we build the Evil price-gouging AI utilizing our complex data pipelines; a Continuously running AI which updates ticket prices based on live supply & demand.
Chapter 9: Pipeline Automation with Cloud Composer (Apache Airflow)
Pipeline automation, monitoring, and metrics with Cloud Composer (Apache Airflow); this chapter creates the glue which keeps everything together.
Chapter 10: Creating a Data Hub, Exporting Data via Google AppEgine (Python Flask)
Create a Data Hub and expose our AI data via REST API. This chapter builds a fully Data-Driven Backend.