Apache Sparkdata~30 mins

Self joins in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Self Joins with Apache Spark

📖 Scenario: You work in a company that manages employee data. Each employee has a manager, who is also an employee. You want to find pairs of employees and their managers using a self join.

🎯 Goal: Build a Spark DataFrame with employee data, then use a self join to match each employee with their manager's name.

📋 What You'll Learn

Create a Spark DataFrame called employees with columns emp_id, name, and manager_id.

Create a variable called df to hold the joined DataFrame.

Use a self join on employees to match employees with their managers.

Print the resulting DataFrame showing employee names and their manager names.

💡 Why This Matters

🌍 Real World

Companies often store employee and manager data in the same table. Self joins help find relationships within the same dataset.

💼 Career

Understanding self joins is important for data analysts and data scientists working with hierarchical or relational data.

Progress0 / 4 steps

Create the employees DataFrame

Create a Spark DataFrame called employees with these exact rows: (1, 'Alice', 3), (2, 'Bob', 3), (3, 'Charlie', null), (4, 'David', 2). The columns should be emp_id, name, and manager_id.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName('SelfJoinExample').getOrCreate()

# Create the employees DataFrame with the given rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame with a schema defining emp_id, name, and manager_id. Use None for null values.

Create alias for self join

Create two aliases of the employees DataFrame called e and m to prepare for the self join.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName('SelfJoinExample').getOrCreate()

schema = StructType([
    StructField('emp_id', IntegerType(), False),
    StructField('name', StringType(), False),
    StructField('manager_id', IntegerType(), True)
])

data = [
    (1, 'Alice', 3),
    (2, 'Bob', 3),
    (3, 'Charlie', None),
    (4, 'David', 2)
]

employees = spark.createDataFrame(data, schema)

# Create aliases e and m for employees
# Your code here

Need a hint?

Use the alias method on employees to create e and m.

Perform the self join

Create a DataFrame called df by joining e and m where e.manager_id equals m.emp_id. Select e.name as employee and m.name as manager.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName('SelfJoinExample').getOrCreate()

schema = StructType([
    StructField('emp_id', IntegerType(), False),
    StructField('name', StringType(), False),
    StructField('manager_id', IntegerType(), True)
])

data = [
    (1, 'Alice', 3),
    (2, 'Bob', 3),
    (3, 'Charlie', None),
    (4, 'David', 2)
]

employees = spark.createDataFrame(data, schema)

e = employees.alias('e')
m = employees.alias('m')

# Join e and m where e.manager_id == m.emp_id and select employee and manager names
# Your code here

Need a hint?

Use join with the condition e.manager_id == m.emp_id. Then select e.name as employee and m.name as manager.

Show the result

Print the contents of the DataFrame df using the show() method.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName('SelfJoinExample').getOrCreate()

schema = StructType([
    StructField('emp_id', IntegerType(), False),
    StructField('name', StringType(), False),
    StructField('manager_id', IntegerType(), True)
])

data = [
    (1, 'Alice', 3),
    (2, 'Bob', 3),
    (3, 'Charlie', None),
    (4, 'David', 2)
]

employees = spark.createDataFrame(data, schema)

e = employees.alias('e')
m = employees.alias('m')

df = e.join(m, e.manager_id == m.emp_id).select(e.name.alias('employee'), m.name.alias('manager'))

# Show the DataFrame df
# Your code here

Need a hint?

Use df.show() to display the joined employee and manager names.