0
0
Hadoopdata~20 mins

Why ingestion pipelines feed the data lake in Hadoop - Challenge Your Understanding

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Data Lake Ingestion Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Purpose of Ingestion Pipelines in Data Lakes

Why do ingestion pipelines feed data lakes in big data systems?

ATo convert all data into a single format before storing it
BTo immediately clean and transform data before storing it in a database
CTo delete old data from the data lake to save storage space
DTo collect and store raw data from multiple sources for flexible analysis later
Attempts:
2 left
💡 Hint

Think about why raw data storage is useful for future analysis.

🧠 Conceptual
intermediate
2:00remaining
Data Lake vs Data Warehouse Feeding

What is a key reason ingestion pipelines feed data lakes instead of data warehouses?

AData lakes are slower to access than data warehouses
BData lakes only store images, while data warehouses store text data
CData lakes store raw data, while data warehouses require cleaned and structured data
DData lakes automatically analyze data, data warehouses do not
Attempts:
2 left
💡 Hint

Consider the difference in data format requirements between lakes and warehouses.

data_output
advanced
2:30remaining
Output of a Hadoop Ingestion Pipeline

Given a Hadoop ingestion pipeline that collects JSON logs from multiple servers and stores them in HDFS, what is the typical output format stored in the data lake?

Hadoop
import json
logs = ['{"user":"alice","action":"login"}', '{"user":"bob","action":"logout"}']
parsed_logs = [json.loads(log) for log in logs]
# What is stored in HDFS?
AA single CSV file with columns user and action
B[{"user": "alice", "action": "login"}, {"user": "bob", "action": "logout"}] as raw JSON files
CA SQL database table with user and action columns
DAn Excel spreadsheet with user and action data
Attempts:
2 left
💡 Hint

Think about how Hadoop stores raw data from ingestion pipelines.

🔧 Debug
advanced
2:30remaining
Identify the Error in a Hadoop Ingestion Script

What error will this Hadoop ingestion Python script raise?

Hadoop
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Ingest').getOrCreate()
data = spark.read.json('hdfs://logs/*.json')
data.write.csv('hdfs://output/logs.csv')
spark.stop()
ATypeError because write.csv expects a directory, not a file path
BAnalysisException because output path 'hdfs://output/logs.csv' already exists
CFileNotFoundError because 'hdfs://logs/*.json' path is invalid
DNo error; script runs successfully and writes CSV files
Attempts:
2 left
💡 Hint

Check the expected argument type for write.csv in Spark.

🚀 Application
expert
3:00remaining
Designing an Efficient Ingestion Pipeline for a Data Lake

You need to design an ingestion pipeline that feeds a data lake with streaming sensor data. Which approach best ensures data is stored quickly and can be processed later in different ways?

AUse a streaming system like Apache Kafka to collect raw sensor data and store it directly in the data lake as JSON files
BTransform sensor data into a fixed schema and load it immediately into a relational database
CAggregate sensor data in real-time and store only summary statistics in the data lake
DStore sensor data temporarily in local files and batch upload to the data lake once a day
Attempts:
2 left
💡 Hint

Consider speed and flexibility of raw data storage for future analysis.