Cross joins combine every row of one table with every row of another. This helps explore all possible pairs but can create very large results.
Cross joins and when to avoid them in Apache Spark
df1.crossJoin(df2)
Use crossJoin() method on one DataFrame and pass the other DataFrame as argument.
Be careful: the result size is the product of the row counts of both tables.
df1.crossJoin(df2)
df1.crossJoin(df2).show()
df1.crossJoin(df2).filter(df2['value'] > 10)
This code creates two small tables and combines every row of the first with every row of the second using cross join. The result has 4 rows (2x2).
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('CrossJoinExample').getOrCreate() # Create first DataFrame data1 = [(1, 'A'), (2, 'B')] df1 = spark.createDataFrame(data1, ['id1', 'letter']) # Create second DataFrame data2 = [(10, 'X'), (20, 'Y')] df2 = spark.createDataFrame(data2, ['id2', 'symbol']) # Perform cross join cross_df = df1.crossJoin(df2) # Show result cross_df.show() spark.stop()
Cross joins can create huge tables quickly. Avoid using them on large datasets unless necessary.
Try to use joins with conditions (like inner or left joins) to limit the size of the result.
If you must use cross join, make sure one of the tables is very small.
Cross joins combine every row of two tables, creating all possible pairs.
They are useful for generating combinations but can produce very large results.
Avoid cross joins on big datasets to prevent slow performance and memory issues.