Complete the code to create a raw zone directory in HDFS for the data lake.
hdfs dfs -mkdir /data_lake/[1]The raw zone is where data is first ingested in its original form.
Complete the code to move data from the raw zone to the curated zone in HDFS.
hdfs dfs -mv /data_lake/raw_zone/[1] /data_lake/curated_zone/temp_data.csv is the file being moved from raw to curated zone for processing.
Fix the error in the Spark code to read data from the curated zone in the data lake.
df = spark.read.format('parquet').load('/data_lake/[1]/2024/06/01')
Data should be read from the curated zone after processing, not raw or archive zones.
Fill both blanks to create a partitioned table in Hive for the data lake's curated data.
CREATE EXTERNAL TABLE curated_data (id INT, name STRING) PARTITIONED BY ([1] STRING, [2] STRING) STORED AS PARQUET LOCATION '/data_lake/curated_zone/';
Partitioning by year and month helps organize data efficiently for queries.
Fill all three blanks to create a dictionary comprehension that maps file names to their sizes for files larger than 100MB in the raw zone.
file_sizes = { [1]: [2] for [1] in files if files[[1]] > 100 }The comprehension iterates over file names, maps each file to its size, and filters sizes greater than 100MB.