Apache Sparkdata~30 mins

Creating DataFrames from files (CSV, JSON, Parquet) in Apache Spark - Try It Yourself

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Creating DataFrames from files (CSV, JSON, Parquet)

📖 Scenario: You work as a data analyst at a retail company. You receive sales data in different file formats: CSV, JSON, and Parquet. Your task is to load these files into Spark DataFrames to analyze the sales.

🎯 Goal: Learn how to create Spark DataFrames by reading data from CSV, JSON, and Parquet files.

📋 What You'll Learn

Use SparkSession to read files

Read a CSV file into a DataFrame

Read a JSON file into a DataFrame

Read a Parquet file into a DataFrame

Print the schema of each DataFrame

💡 Why This Matters

🌍 Real World

Data scientists and analysts often receive data in different file formats. Knowing how to load these files into Spark DataFrames is essential for data processing and analysis.

💼 Career

This skill is important for roles like Data Engineer, Data Scientist, and Big Data Analyst who work with large datasets stored in various formats.

Progress0 / 4 steps

Create a SparkSession

Create a SparkSession called spark with the app name "FileReaderApp".

Apache Spark

# Create a SparkSession called spark with app name "FileReaderApp"
# Your code here

Need a hint?

Use SparkSession.builder.appName(...).getOrCreate() to create the SparkSession.

Read the CSV file into a DataFrame

Use the spark session to read the CSV file named "sales.csv" into a DataFrame called csv_df. Assume the CSV file has a header row.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FileReaderApp").getOrCreate()

# Read the CSV file "sales.csv" with header into csv_df
# Your code here

Need a hint?

Use spark.read.option("header", True).csv("sales.csv") to read the CSV file with headers.

Read the JSON and Parquet files into DataFrames

Use the spark session to read the JSON file named "sales.json" into a DataFrame called json_df, and the Parquet file named "sales.parquet" into a DataFrame called parquet_df.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FileReaderApp").getOrCreate()
csv_df = spark.read.option("header", True).csv("sales.csv")

# Read the JSON file "sales.json" into json_df
# Read the Parquet file "sales.parquet" into parquet_df
# Your code here

Need a hint?

Use spark.read.json(...) and spark.read.parquet(...) to read JSON and Parquet files respectively.

Print the schema of each DataFrame

Print the schema of csv_df, json_df, and parquet_df using the .printSchema() method.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FileReaderApp").getOrCreate()
csv_df = spark.read.option("header", True).csv("sales.csv")
json_df = spark.read.json("sales.json")
parquet_df = spark.read.parquet("sales.parquet")

# Print the schema of csv_df
# Print the schema of json_df
# Print the schema of parquet_df
# Your code here

Need a hint?

Use printSchema() on each DataFrame to see its structure.