DE by DL.AI - C2 W2 - Data Ingestion

<aside> ☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

</aside>

Data Ingestion Overview

Overview

Recall: As a DE, you get raw data somewhere → turn it into something useful → make it available for downstream use cases.
- Data injection → “get raw data somewhere”
Recall all the labs we’ve done so far:
- Course 1 Week 2 Lab: inject data from RDS into S3 using Glue ETS jobs.
- Course 1 Week 4 Lab: inject data from Kinesis Data Streams and use Kinesis Firehose to deliver an event to an S3 bucket
- Course 2 Week 1 final lab: troubleshooting some common connection issues when connecting to a database.
Plan for this week:
- This week takes a closer look at batch and streaming ingestions.
- Talk to Data Analyst to identify requirements for batch ingestion from a REST API.
- Talk to Software Engineers to investigate requirements for streaming ingestion: the data payload and event rates + how to configure the streaming pipeline.

Data injection on a continuum

Data you’re working with is unbounded (continous stream of events) - the stream doesn’t have particular beginning and ending.
- If we ingest events individually, one at a time → streaming injection
- If we impose some boundaries and inject all data within these boundaries → batch injection
Different ways to impose the boundaries:
- Size-based batch ingestion: size-threshold batch ingestion: 10Gb each chunk, total number of records: 1K events each chunk.
- Time-base batch injestion: Weekly / Daily chunk, every hour,…
→ The more we increase the frequency of the injection → streaming injection.

It depends on use case and the source system to decide which one to use.
Ways to ingest data from databases:
- Connectors (JDBC/ODBC API) ← Lab 1 of this week.
  - How much can you ingest in one go? How frequently can you call the API? → no fix measure → reading API doc + communicate with data owners + writing custom API connection code.
  - 👍 Recommend: For API data ingestion, use existing solutions when possible. Reserve custom connections for when no other options exist.
- Ingestion Tool (AWS Glue ETL)
Ways to injest data from files: Use secure file transfer like SFTP, SCP.
Ways to injest data from streaming systems: choose batch or streaming or setup message queue. ← Lab 2 of this week.

Batch and Streaming tools

Batch ingestion tools:
- AWS Glue ETL: Serverless service for data ingestion and transformation from various AWS sources. Uses Apache Spark for distributed ETL jobs. Enables code-based solutions for data processing. See AWS Glue Components & AWS Glue ETL guidance.
- Amazon EMR (Big Data platform): Managed cluster platform for big data frameworks like Hadoop and Spark. Useful for ingesting and transforming petabyte-scale data from databases to AWS data stores. Offers serverless and provisioned modes.
- Glue vs EMR: see again this note.
- AWS DMS (Database Migration Service): For data ingestion without transformations. Syncs data between databases or to data stores like S3. Supports database engine migrations. Available in serverless or provisioned modes.
- AWS Snow family: For migrating large on-premise datasets (100+ TB) to the cloud. Offers physical transfer appliances like Snowball and Snowcone, avoiding slow and costly internet transfers.
- AWS Transfer Family: Enables secure file transfers to/from S3 using SFTP, FTPS, and FTP protocols.
- Other non-AWS: Airbyte, Matillion, Fivetran
Streaming ingestion tools: Kinesis, MSK.
Key considerations for Batch vs Streaming Injection:
- Use cases: Evaluate benefits of real-time data vs. batch processing. Consider stakeholder needs for ML and reporting.
- Latency: Determine if millisecond-level ingestion is necessary or if micro-batch approach suffices.
- Cost: Assess complexities and expenses of streaming vs. batch ingestion, including maintenance and team capabilities.
- Existing systems: Check compatibility with source and destination systems. Consider impact on production instances and suitability for data types (e.g., IoT sensors).
- Reliability: Ensure streaming pipeline stability, redundancy, and high availability of compute resources compared to batch services.

Batch Ingestion

Conversation with a Makerting Analyst

Lean what actions your stakeholders plan to take with the data:
- DA: We would like to look into what kind of music people are listening to across the various regions where we sell our products and then comparing those with product sales.
- DE (repeat again + more clearly): you would like to pull in public data from some external sources to get information on the music people are listening to
A follow-up conversation may be required:
- DE: wait for me to look closer on the Spotify API.