0
0
Apache-sparkHow-ToBeginner ยท 4 min read

How to Use spark.sql in PySpark: Syntax and Examples

In PySpark, you use spark.sql to run SQL queries on DataFrames registered as temporary views. First, create or load a DataFrame, register it as a view with createOrReplaceTempView, then run SQL queries using spark.sql('your SQL query') to get a new DataFrame with the results.
๐Ÿ“

Syntax

The basic syntax to use spark.sql in PySpark is:

  • spark.sql(query_string): Runs the SQL query string on registered views.
  • Before running SQL, you must register your DataFrame as a temporary view using createOrReplaceTempView(view_name).
  • The result of spark.sql is a new DataFrame that you can further process.
python
df.createOrReplaceTempView("view_name")
result_df = spark.sql("SELECT * FROM view_name WHERE condition")
๐Ÿ’ป

Example

This example shows how to create a DataFrame, register it as a temporary view, and run a SQL query using spark.sql. It selects rows where age is greater than 25.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Create sample data
data = [(1, "Alice", 30), (2, "Bob", 20), (3, "Cathy", 27)]
columns = ["id", "name", "age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Register DataFrame as a temporary view
df.createOrReplaceTempView("people")

# Run SQL query
result_df = spark.sql("SELECT name, age FROM people WHERE age > 25")

# Show results
result_df.show()
Output
+-----+---+ | name|age| +-----+---+ |Alice| 30| |Cathy| 27| +-----+---+
โš ๏ธ

Common Pitfalls

Common mistakes when using spark.sql include:

  • Not registering the DataFrame as a temporary view before running SQL queries.
  • Using incorrect view names or SQL syntax errors.
  • Expecting spark.sql to modify the original DataFrame (it returns a new DataFrame).

Example of a wrong approach and the correct fix:

python
# Wrong: Running SQL without registering view
try:
    spark.sql("SELECT * FROM people").show()
except Exception as e:
    print(f"Error: {e}")

# Correct: Register view first
# df.createOrReplaceTempView("people")
# spark.sql("SELECT * FROM people").show()
Output
Error: Table or view not found: people;
๐Ÿ“Š

Quick Reference

StepDescriptionExample
Create DataFrameCreate a DataFrame from data or sourcedf = spark.createDataFrame(data, columns)
Register ViewRegister DataFrame as a temporary SQL viewdf.createOrReplaceTempView("view_name")
Run SQLRun SQL query on the registered viewresult_df = spark.sql("SELECT * FROM view_name")
Show ResultsDisplay the query resultsresult_df.show()
โœ…

Key Takeaways

Always register your DataFrame as a temporary view before using spark.sql.
spark.sql runs SQL queries and returns a new DataFrame with the results.
Use standard SQL syntax inside spark.sql strings to query your data.
If you get 'Table or view not found' error, check your view registration.
You can chain spark.sql results with other DataFrame operations.