How to Use spark.sql in PySpark: Syntax and Examples
In PySpark, you use
spark.sql to run SQL queries on DataFrames registered as temporary views. First, create or load a DataFrame, register it as a view with createOrReplaceTempView, then run SQL queries using spark.sql('your SQL query') to get a new DataFrame with the results.Syntax
The basic syntax to use spark.sql in PySpark is:
spark.sql(query_string): Runs the SQL query string on registered views.- Before running SQL, you must register your DataFrame as a temporary view using
createOrReplaceTempView(view_name). - The result of
spark.sqlis a new DataFrame that you can further process.
python
df.createOrReplaceTempView("view_name") result_df = spark.sql("SELECT * FROM view_name WHERE condition")
Example
This example shows how to create a DataFrame, register it as a temporary view, and run a SQL query using spark.sql. It selects rows where age is greater than 25.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate() # Create sample data data = [(1, "Alice", 30), (2, "Bob", 20), (3, "Cathy", 27)] columns = ["id", "name", "age"] # Create DataFrame df = spark.createDataFrame(data, columns) # Register DataFrame as a temporary view df.createOrReplaceTempView("people") # Run SQL query result_df = spark.sql("SELECT name, age FROM people WHERE age > 25") # Show results result_df.show()
Output
+-----+---+
| name|age|
+-----+---+
|Alice| 30|
|Cathy| 27|
+-----+---+
Common Pitfalls
Common mistakes when using spark.sql include:
- Not registering the DataFrame as a temporary view before running SQL queries.
- Using incorrect view names or SQL syntax errors.
- Expecting
spark.sqlto modify the original DataFrame (it returns a new DataFrame).
Example of a wrong approach and the correct fix:
python
# Wrong: Running SQL without registering view try: spark.sql("SELECT * FROM people").show() except Exception as e: print(f"Error: {e}") # Correct: Register view first # df.createOrReplaceTempView("people") # spark.sql("SELECT * FROM people").show()
Output
Error: Table or view not found: people;
Quick Reference
| Step | Description | Example |
|---|---|---|
| Create DataFrame | Create a DataFrame from data or source | df = spark.createDataFrame(data, columns) |
| Register View | Register DataFrame as a temporary SQL view | df.createOrReplaceTempView("view_name") |
| Run SQL | Run SQL query on the registered view | result_df = spark.sql("SELECT * FROM view_name") |
| Show Results | Display the query results | result_df.show() |
Key Takeaways
Always register your DataFrame as a temporary view before using spark.sql.
spark.sql runs SQL queries and returns a new DataFrame with the results.
Use standard SQL syntax inside spark.sql strings to query your data.
If you get 'Table or view not found' error, check your view registration.
You can chain spark.sql results with other DataFrame operations.