0
0
Apache Sparkdata~5 mins

SQL queries on DataFrames in Apache Spark

Choose your learning style9 modes available
Introduction

SQL queries let you ask questions about your data in a simple way. Using SQL on DataFrames helps you find, filter, and summarize data easily.

You want to quickly filter rows based on conditions.
You need to group data and calculate totals or averages.
You prefer writing SQL instead of complex code to analyze data.
You want to join two sets of data to combine information.
You want to sort data to find top or bottom values.
Syntax
Apache Spark
df.createOrReplaceTempView("table_name")
spark.sql("SELECT column1, column2 FROM table_name WHERE condition")

First, register your DataFrame as a temporary table with createOrReplaceTempView.

Then, use spark.sql() to run SQL queries on that table.

Examples
Selects names and ages of people older than 30.
Apache Spark
df.createOrReplaceTempView("people")
spark.sql("SELECT name, age FROM people WHERE age > 30")
Calculates total sales amount for each product.
Apache Spark
df.createOrReplaceTempView("sales")
spark.sql("SELECT product, SUM(amount) AS total_amount FROM sales GROUP BY product")
Finds top 5 highest paid employees.
Apache Spark
df.createOrReplaceTempView("employees")
spark.sql("SELECT * FROM employees ORDER BY salary DESC LIMIT 5")
Sample Program

This program creates a DataFrame with people data, registers it as a SQL table, and runs a query to find people older than 30. It then shows the filtered results sorted by age.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQLQueriesOnDataFrames").getOrCreate()

# Create sample data
data = [
    (1, "Alice", 29),
    (2, "Bob", 35),
    (3, "Cathy", 23),
    (4, "David", 40),
    (5, "Eva", 30)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)

# Register DataFrame as a temporary view

df.createOrReplaceTempView("people")

# Run SQL query to select people older than 30
result = spark.sql("SELECT name, age FROM people WHERE age > 30 ORDER BY age")

# Show the result
result.show()
OutputSuccess
Important Notes

Temporary views last only during the Spark session.

You can use all standard SQL commands like SELECT, WHERE, GROUP BY, ORDER BY, and JOIN.

Remember to import and create a SparkSession before running SQL queries.

Summary

Use createOrReplaceTempView to turn a DataFrame into a SQL table.

Run SQL queries on DataFrames with spark.sql().

This method makes data analysis easier for those familiar with SQL.