Apache Spark Explained

Parham Parvizi

Parham Parvizi

Software Architect

A quick deep-dive into Apache Spark, the most popular distributed data engineering tool.

What is Spark? Why is it so popular? When and how to use it?

Learn the difference between the sub components (RDDs, Dataframes, SQL, Streaming, ...), setup PySpark , and learn how to write Spark transformations using Python and Jupyter Notebook.

What is Data Engineering? How is it different from Data Science?

Parham Parvizi

Parham Parvizi

Software Architect

While Data Engineering and Data Science are commonly interchanged terms, there are distinct differences in responsibilities and skills of each group that
distinguishes the two.

In short... Data Engineering is the practice of gathering and preparing data which is used by the data science models and algorithms.

Intro to Pandas

Parham Parvizi

Parham Parvizi

Software Architect

Python Pandas is the #1 tool inside a Data Engineer or Data Scientist toolbox. It allows you to read/write data from a large variety of file formats; and provides extensive built-in functionality to aggregate, join, filter, and transform dataset with high performance. Pandas is the fastest and easiest tool to extract, transform, and load (ETL) dataset which fit in memory and can be process by a single machine.

This lesson will teach you the basic pandas Data Engineering skills.

Helpful Resources to Prep for this Course

Jennifer Batara

Jennifer Batara

Software Architect

Thanks for your interest in learning more about Data Engineering! We've complied a list of pre-requisites that you should be familiar with before starting this course. This course starts at a mid-level and assumes you have a general understanding of computers. We don't require much, but we do need you to know Python and SQL well. We would heavily rely and build upon those skills throughout the course.

Copyright 2020 TuraLabs