0
0
Apache Sparkdata~30 mins

Self joins in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Self Joins with Apache Spark
📖 Scenario: You work in a company that manages employee data. Each employee has a manager, who is also an employee. You want to find pairs of employees and their managers using a self join.
🎯 Goal: Build a Spark DataFrame with employee data, then use a self join to match each employee with their manager's name.
📋 What You'll Learn
Create a Spark DataFrame called employees with columns emp_id, name, and manager_id.
Create a variable called df to hold the joined DataFrame.
Use a self join on employees to match employees with their managers.
Print the resulting DataFrame showing employee names and their manager names.
💡 Why This Matters
🌍 Real World
Companies often store employee and manager data in the same table. Self joins help find relationships within the same dataset.
💼 Career
Understanding self joins is important for data analysts and data scientists working with hierarchical or relational data.
Progress0 / 4 steps
1
Create the employees DataFrame
Create a Spark DataFrame called employees with these exact rows: (1, 'Alice', 3), (2, 'Bob', 3), (3, 'Charlie', null), (4, 'David', 2). The columns should be emp_id, name, and manager_id.
Apache Spark
Need a hint?

Use spark.createDataFrame with a schema defining emp_id, name, and manager_id. Use None for null values.

2
Create alias for self join
Create two aliases of the employees DataFrame called e and m to prepare for the self join.
Apache Spark
Need a hint?

Use the alias method on employees to create e and m.

3
Perform the self join
Create a DataFrame called df by joining e and m where e.manager_id equals m.emp_id. Select e.name as employee and m.name as manager.
Apache Spark
Need a hint?

Use join with the condition e.manager_id == m.emp_id. Then select e.name as employee and m.name as manager.

4
Show the result
Print the contents of the DataFrame df using the show() method.
Apache Spark
Need a hint?

Use df.show() to display the joined employee and manager names.