👉 Lecture notes & Repositoty.
Cron: a commond line utility introduced in the 1970s. Used to execute a particular command at a specified data and time.
Syntax
Example
Example: scheduling data pipeline with Cron (Pure scheduling approach)
Weakness: If one step failed → whole process failed
But cron is still useful for simple and repetitive tasks (ex: regular data downloads) or in the prototyping phase (ex: testing aspects of your data pipeline)
Dataswarm (Facebook, late 2000s) → oozie (2010s) → airbnb’s Airflow (2014, open sources) → Apache Airflow (2019)
Airflow is used by a lot of teams. → should know
Pros and Cons of Airflow
Others
DAG = Directed Acyclic Graph
“directed” → data flows only in one direction
“acyclic” → no circles or cycles
Dependencies: the previous tasks are required to finish before the next task starts
Basic concepts
Orchestration in Airflow
Writing tasks like this
Airflow UI
We can trigger based on time or event (eg: click)
We can set up data quality checks like checking null values, range of values…