0
0
Apache Sparkdata~30 mins

Date and timestamp functions in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Date and timestamp functions
📖 Scenario: You work as a data analyst for a retail company. You have a dataset of sales transactions with timestamps. You want to analyze the data by extracting useful date and time information.
🎯 Goal: Build a Spark DataFrame with sales data, add a configuration for the date format, use Spark date and timestamp functions to extract year and month, and display the results.
📋 What You'll Learn
Create a Spark DataFrame with sales data including a timestamp column
Add a configuration variable for the date format string
Use Spark SQL functions to extract the year and month from the timestamp
Display the resulting DataFrame with the extracted date parts
💡 Why This Matters
🌍 Real World
Retail companies often analyze sales data by date to find trends and patterns over time.
💼 Career
Data analysts and data scientists use date and timestamp functions to prepare and analyze time-based data for reports and decision making.
Progress0 / 4 steps
1
Create the sales DataFrame
Create a Spark DataFrame called sales_df with these exact columns and data: transaction_id (integers 1 to 3), amount (floats 100.0, 150.5, 200.75), and timestamp (strings '2023-05-01 10:15:00', '2023-06-15 12:30:00', '2023-07-20 14:45:00'). Use spark.createDataFrame() with a list of tuples and a schema.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and a schema defined by StructType and StructField.

2
Add date format configuration
Create a variable called date_format and set it to the string 'yyyy-MM-dd HH:mm:ss' to match the timestamp format in sales_df.
Apache Spark
Need a hint?

Assign the exact string 'yyyy-MM-dd HH:mm:ss' to the variable date_format.

3
Extract year and month from timestamp
Import to_timestamp, year, and month from pyspark.sql.functions. Create a new DataFrame called sales_with_date by adding two new columns to sales_df: year and month. Use to_timestamp with date_format to convert the timestamp string to a timestamp type, then extract the year and month.
Apache Spark
Need a hint?

Use withColumn twice to add year and month columns after converting timestamp to timestamp type.

4
Display the result
Use print() and sales_with_date.show() to display the DataFrame with the extracted year and month columns.
Apache Spark
Need a hint?

Use print() to show a message and sales_with_date.show() to display the DataFrame.