Apache Sparkdata~10 mins

SQL queries on DataFrames in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - SQL queries on DataFrames

Create SparkSession

↓

Load Data into DataFrame

↓

Create Temp View from DataFrame

↓

Write SQL Query as String

↓

Run spark.sql(query)

↓

Get Result DataFrame

↓

Show or Use Result

This flow shows how to run SQL queries on Spark DataFrames by creating a temporary view and querying it with spark.sql.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
df = spark.createDataFrame(data, ['id', 'name'])
df.createOrReplaceTempView('people')
result = spark.sql('SELECT id, name FROM people WHERE id > 1')
result.show()

This code creates a DataFrame, registers it as a temp view, runs a SQL query to select rows where id > 1, and shows the result.

Execution Table

Step	Action	Input/Condition	Output/Result
1	Create SparkSession	null	spark session object created
2	Create DataFrame from list	data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]	DataFrame with 3 rows and columns 'id', 'name'
3	Create temp view 'people'	df.createOrReplaceTempView('people')	Temporary SQL view 'people' created
4	Run SQL query	SELECT id, name FROM people WHERE id > 1	Result DataFrame with rows where id=2 and id=3
5	Show result	result.show()	Output: id name 2 Bob 3 Cathy
6	End	No more steps	Execution complete

💡 All steps executed; SQL query filtered rows with id > 1

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	Final
spark	null	SparkSession object	SparkSession object	SparkSession object	SparkSession object
df	null	DataFrame with 3 rows	DataFrame with 3 rows	DataFrame with 3 rows	DataFrame with 3 rows
people (temp view)	null	null	Temp view created	Temp view created	Temp view created
result	null	null	null	DataFrame with 2 rows (id>1)	DataFrame with 2 rows (id>1)

Key Moments - 3 Insights

Why do we need to create a temporary view before running SQL queries on a DataFrame?

What happens if the SQL query references a table name that does not exist?

Does the original DataFrame change after running the SQL query?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 4. What does the SQL query select?

AAll rows where id is greater than 1

BAll rows where id is less than or equal to 1

CAll rows with name 'Alice'

DAll rows without any filter

Concept Snapshot

SQL queries on DataFrames in Spark:
- Create SparkSession
- Load data into DataFrame
- Create temp view with createOrReplaceTempView('viewName')
- Run SQL query with spark.sql('SELECT ... FROM viewName WHERE ...')
- Result is a new DataFrame
- Use show() to display results

Full Transcript

This visual execution shows how to run SQL queries on Spark DataFrames. First, a SparkSession is created. Then data is loaded into a DataFrame. The DataFrame is registered as a temporary SQL view. Next, a SQL query string is written and executed with spark.sql(). The query filters rows where id is greater than 1. The result is a new DataFrame with filtered rows. Finally, the result is displayed using show(). Variables like spark, df, and result change state as the code runs. Key points include the need to create a temp view before querying and that the original DataFrame does not change. The quizzes test understanding of query filtering, variable states, and error conditions.