DE by DL.AI - C4 W3&4 - Data Transformations & Tech Considerations & Serving

<aside> ☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

</aside>

Batch transformations

This week explores advanced data transformation frameworks beyond Pandas to address scalability and performance needs.
This week plan:

Batch Transformaton
- Transformation use cases.
- Distributed processing frameworks:
  - Hadoop MapReduce: disk-based storage and processing ← many people consider this because of legacy techs due to complexity, high-cost scaling, significant maintenance requirements. ← need to understand because it presents in many distributed systems today.
  - Spark: in memory-based processing framework.
- Compare SQL-based transformation vs Python-based transformation.
- Lab: previous course, we use dbt to make the transformation inside the dataware, this time, we implementation outside datawarehouse using Spark.
Streaming transformation
- Transformation use cases.
- Micro-batch vs true streaming processing tools.
- Lab: implement CDC pipeline (change data capture) using Kafka and Flink.

A lot of DE’s work are batch processing.
Some transformation patterns:

The first 2, we’ve already consider in Course 1 (ETL) and Course 4 (ELT)
We’ve already considerd ETL and ELT with Spark and dbt
Data Wrangling: → should use tools like AWS Glue DataBrew.
Transform data for updating
- Approach 1: Truncate and Reload ← only applied to small dataset.
- Approach 2: CDC (Change Data Capture): identify the changes → insert / update / delete
  - Insert only → insert next to the old
  - Upsert/merge: based on ids → update if exist, add new if non-exist
  - Capture delete → hard (remove completely), soft (marked as deleled and then filter)
- Single-row inserts → OK for row-oriented database but not for OLAP system!

Google developed GFS (2003) and MapReduce (2004) for distributed data processing, leading to Yahoo's creation of Hadoop in 2006.
- HDFS powers modern big data engines like EMR and Spark
- MapReduce's concepts remain influential in distributed systems
HDFS (combines compute and storage on the same nodes) vs Object Storage (limited compute support for internal processing)
HDFS
MapReduce:

→ Weakness: writing on Disk → Spark uses RAM!

Spark is native in Scala.