Challenge - 5 Problems

🎖️

Data Lake Ingestion Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Purpose of Ingestion Pipelines in Data Lakes

Why do ingestion pipelines feed data lakes in big data systems?

ATo convert all data into a single format before storing it

BTo immediately clean and transform data before storing it in a database

CTo delete old data from the data lake to save storage space

DTo collect and store raw data from multiple sources for flexible analysis later

Attempts:

2 left

🧠 Conceptual

intermediate

2:00remaining

Data Lake vs Data Warehouse Feeding

What is a key reason ingestion pipelines feed data lakes instead of data warehouses?

AData lakes are slower to access than data warehouses

BData lakes only store images, while data warehouses store text data

CData lakes store raw data, while data warehouses require cleaned and structured data

DData lakes automatically analyze data, data warehouses do not

Attempts:

2 left

❓ data_output

advanced

2:30remaining

Output of a Hadoop Ingestion Pipeline

Given a Hadoop ingestion pipeline that collects JSON logs from multiple servers and stores them in HDFS, what is the typical output format stored in the data lake?

Hadoop

import json
logs = ['{"user":"alice","action":"login"}', '{"user":"bob","action":"logout"}']
parsed_logs = [json.loads(log) for log in logs]
# What is stored in HDFS?

AA single CSV file with columns user and action

B[{"user": "alice", "action": "login"}, {"user": "bob", "action": "logout"}] as raw JSON files

CA SQL database table with user and action columns

DAn Excel spreadsheet with user and action data

Attempts:

2 left

🔧 Debug

advanced

2:30remaining

Identify the Error in a Hadoop Ingestion Script

What error will this Hadoop ingestion Python script raise?

Hadoop

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Ingest').getOrCreate()
data = spark.read.json('hdfs://logs/*.json')
data.write.csv('hdfs://output/logs.csv')
spark.stop()

ATypeError because write.csv expects a directory, not a file path

BAnalysisException because output path 'hdfs://output/logs.csv' already exists

CFileNotFoundError because 'hdfs://logs/*.json' path is invalid

DNo error; script runs successfully and writes CSV files

Attempts:

2 left

🚀 Application

expert

3:00remaining

Designing an Efficient Ingestion Pipeline for a Data Lake

You need to design an ingestion pipeline that feeds a data lake with streaming sensor data. Which approach best ensures data is stored quickly and can be processed later in different ways?

AUse a streaming system like Apache Kafka to collect raw sensor data and store it directly in the data lake as JSON files

BTransform sensor data into a fixed schema and load it immediately into a relational database

CAggregate sensor data in real-time and store only summary statistics in the data lake

DStore sensor data temporarily in local files and batch upload to the data lake once a day

Attempts:

2 left