<aside> ☝
List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
</aside>
This week explores advanced data transformation frameworks beyond Pandas to address scalability and performance needs.
This week plan:
Batch Transformaton
Streaming transformation
A lot of DE’s work are batch processing.
Some transformation patterns:
The first 2, we’ve already consider in Course 1 (ETL) and Course 4 (ELT)
We’ve already considerd ETL and ELT with Spark and dbt
Data Wrangling: → should use tools like AWS Glue DataBrew.
Transform data for updating
Approach 1: Truncate and Reload ← only applied to small dataset.
Approach 2: CDC (Change Data Capture): identify the changes → insert / update / delete
Single-row inserts → OK for row-oriented database but not for OLAP system!
Google developed GFS (2003) and MapReduce (2004) for distributed data processing, leading to Yahoo's creation of Hadoop in 2006.
HDFS (combines compute and storage on the same nodes) vs Object Storage (limited compute support for internal processing)
HDFS
MapReduce:
→ Weakness: writing on Disk → Spark uses RAM!
Spark is native in Scala.