0
0
Apache Sparkdata~10 mins

SQL queries on DataFrames in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - SQL queries on DataFrames
Create SparkSession
Load Data into DataFrame
Create Temp View from DataFrame
Write SQL Query as String
Run spark.sql(query)
Get Result DataFrame
Show or Use Result
This flow shows how to run SQL queries on Spark DataFrames by creating a temporary view and querying it with spark.sql.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
df = spark.createDataFrame(data, ['id', 'name'])
df.createOrReplaceTempView('people')
result = spark.sql('SELECT id, name FROM people WHERE id > 1')
result.show()
This code creates a DataFrame, registers it as a temp view, runs a SQL query to select rows where id > 1, and shows the result.
Execution Table
StepActionInput/ConditionOutput/Result
1Create SparkSessionnullspark session object created
2Create DataFrame from listdata = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]DataFrame with 3 rows and columns 'id', 'name'
3Create temp view 'people'df.createOrReplaceTempView('people')Temporary SQL view 'people' created
4Run SQL querySELECT id, name FROM people WHERE id > 1Result DataFrame with rows where id=2 and id=3
5Show resultresult.show()Output: id name 2 Bob 3 Cathy
6EndNo more stepsExecution complete
💡 All steps executed; SQL query filtered rows with id > 1
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4Final
sparknullSparkSession objectSparkSession objectSparkSession objectSparkSession object
dfnullDataFrame with 3 rowsDataFrame with 3 rowsDataFrame with 3 rowsDataFrame with 3 rows
people (temp view)nullnullTemp view createdTemp view createdTemp view created
resultnullnullnullDataFrame with 2 rows (id>1)DataFrame with 2 rows (id>1)
Key Moments - 3 Insights
Why do we need to create a temporary view before running SQL queries on a DataFrame?
The temporary view acts like a table in SQL. spark.sql() runs queries on SQL tables or views, not directly on DataFrames. See execution_table step 3 where the temp view is created.
What happens if the SQL query references a table name that does not exist?
spark.sql() will throw an error because it cannot find the table or view. You must create the temp view first as shown in step 3.
Does the original DataFrame change after running the SQL query?
No, the original DataFrame stays the same. The SQL query returns a new DataFrame with filtered or selected data, as shown in variable_tracker for 'result'.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 4. What does the SQL query select?
AAll rows where id is greater than 1
BAll rows where id is less than or equal to 1
CAll rows with name 'Alice'
DAll rows without any filter
💡 Hint
Check the 'Input/Condition' column at step 4 in execution_table.
According to variable_tracker, what is the state of 'result' after step 4?
Anull
BDataFrame with 2 rows where id > 1
CDataFrame with 3 rows
DTemporary view object
💡 Hint
Look at the 'result' row and the 'After Step 4' column in variable_tracker.
If we skip creating the temp view (step 3), what will happen when running the SQL query?
AThe query will run successfully with no results
BThe original DataFrame will be modified
Cspark.sql() will throw an error about missing table
DThe query will return all rows
💡 Hint
Refer to key_moments question about missing temp view.
Concept Snapshot
SQL queries on DataFrames in Spark:
- Create SparkSession
- Load data into DataFrame
- Create temp view with createOrReplaceTempView('viewName')
- Run SQL query with spark.sql('SELECT ... FROM viewName WHERE ...')
- Result is a new DataFrame
- Use show() to display results
Full Transcript
This visual execution shows how to run SQL queries on Spark DataFrames. First, a SparkSession is created. Then data is loaded into a DataFrame. The DataFrame is registered as a temporary SQL view. Next, a SQL query string is written and executed with spark.sql(). The query filters rows where id is greater than 1. The result is a new DataFrame with filtered rows. Finally, the result is displayed using show(). Variables like spark, df, and result change state as the code runs. Key points include the need to create a temp view before querying and that the original DataFrame does not change. The quizzes test understanding of query filtering, variable states, and error conditions.