0
0
Apache Sparkdata~5 mins

Date and timestamp functions in Apache Spark

Choose your learning style9 modes available
Introduction

Date and timestamp functions help us work with dates and times easily. They let us find differences, add time, or format dates in data.

You want to find how many days passed between two dates in a sales report.
You need to add hours or minutes to a timestamp to schedule events.
You want to extract the year or month from a date to group data by time.
You need to convert string dates into date format to analyze them.
You want to get the current date or timestamp for logging or filtering.
Syntax
Apache Spark
from pyspark.sql.functions import current_date, current_timestamp, datediff, date_add, date_sub, year, month, dayofmonth, to_date, unix_timestamp

# Example usage:
df.select(current_date(), current_timestamp())
df.select(datediff(df.end_date, df.start_date))
df.select(date_add(df.date, 5))
df.select(year(df.date))

Use from pyspark.sql.functions import <function_name> to access date functions.

Functions work on columns in Spark DataFrames, not on plain Python dates.

Examples
Gets the current date.
Apache Spark
from pyspark.sql.functions import current_date

df.select(current_date())
Calculates days between two dates.
Apache Spark
from pyspark.sql.functions import datediff

df.select(datediff(df.end_date, df.start_date))
Adds 10 days to a date.
Apache Spark
from pyspark.sql.functions import date_add

df.select(date_add(df.date, 10))
Extracts the year from a date.
Apache Spark
from pyspark.sql.functions import year

df.select(year(df.date))
Sample Program

This program creates a Spark DataFrame with start and end dates. It converts strings to dates, then calculates the difference in days, adds 5 days to the start date, extracts the year, and shows the current date and timestamp.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, current_timestamp, datediff, date_add, year

spark = SparkSession.builder.appName('DateExample').getOrCreate()

# Create sample data
data = [
    ('2024-01-01', '2024-01-10'),
    ('2024-02-15', '2024-02-20')
]

columns = ['start_date', 'end_date']
df = spark.createDataFrame(data, columns)

# Convert string to date type
from pyspark.sql.functions import to_date

df = df.withColumn('start_date', to_date('start_date'))
       .withColumn('end_date', to_date('end_date'))

# Calculate days difference
result = df.select(
    'start_date',
    'end_date',
    datediff('end_date', 'start_date').alias('days_diff'),
    date_add('start_date', 5).alias('start_plus_5_days'),
    year('start_date').alias('start_year'),
    current_date().alias('today'),
    current_timestamp().alias('now')
)

result.show(truncate=False)

spark.stop()
OutputSuccess
Important Notes

Make sure date columns are in date format, not string, before using date functions.

Current date and timestamp depend on the system running Spark.

Use to_date() to convert strings to dates.

Summary

Date and timestamp functions help analyze and manipulate time data easily.

Common tasks include finding differences, adding days, and extracting parts like year or month.

Always convert strings to date type before using these functions in Spark.