Why do ingestion pipelines feed data lakes in big data systems?
Think about why raw data storage is useful for future analysis.
Ingestion pipelines feed data lakes to gather raw data from many sources. This raw data is stored as-is, allowing flexible and varied analysis later without losing original details.
What is a key reason ingestion pipelines feed data lakes instead of data warehouses?
Consider the difference in data format requirements between lakes and warehouses.
Data lakes accept raw, unstructured data from ingestion pipelines, while data warehouses need data to be cleaned and structured before loading.
Given a Hadoop ingestion pipeline that collects JSON logs from multiple servers and stores them in HDFS, what is the typical output format stored in the data lake?
import json logs = ['{"user":"alice","action":"login"}', '{"user":"bob","action":"logout"}'] parsed_logs = [json.loads(log) for log in logs] # What is stored in HDFS?
Think about how Hadoop stores raw data from ingestion pipelines.
Hadoop ingestion pipelines typically store raw JSON files directly in HDFS, preserving original data format for flexible processing later.
What error will this Hadoop ingestion Python script raise?
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Ingest').getOrCreate() data = spark.read.json('hdfs://logs/*.json') data.write.csv('hdfs://output/logs.csv') spark.stop()
Check the expected argument type for write.csv in Spark.
In Spark, write.csv expects a directory path, not a file path. Giving a file path causes an AnalysisException, not a TypeError.
You need to design an ingestion pipeline that feeds a data lake with streaming sensor data. Which approach best ensures data is stored quickly and can be processed later in different ways?
Consider speed and flexibility of raw data storage for future analysis.
Using a streaming system like Kafka to ingest raw data into the data lake allows fast storage and flexible later processing without losing detail.