0
0
Apache Sparkdata~10 mins

Databricks platform overview in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Databricks platform overview
User logs into Databricks
Create or open Workspace
Create Notebook or Job
Write Spark code
Submit code to Cluster
Cluster runs Spark jobs
Results returned to Notebook
Visualize or export results
Manage data and resources
Collaborate with team
End
This flow shows how a user interacts with Databricks: logging in, creating notebooks, running Spark code on clusters, getting results, and collaborating.
Execution Sample
Apache Spark
# Sample Spark code in Databricks notebook
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
df = spark.createDataFrame(data, ['id', 'name'])
df.show()
This code creates a Spark DataFrame with sample data and shows it in the notebook.
Execution Table
StepActionEvaluationResult
1Import SparkSessionfrom pyspark.sql import SparkSessionSparkSession class available
2Create SparkSessionspark = SparkSession.builder.getOrCreate()SparkSession object created
3Prepare data listdata = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]List of tuples created
4Create DataFramedf = spark.createDataFrame(data, ['id', 'name'])DataFrame with 3 rows and 2 columns created
5Show DataFramedf.show()Table displayed: id | name 1 | Alice 2 | Bob 3 | Cathy
6End of code executionNo more codeDataFrame displayed in notebook
💡 Code execution ends after displaying the DataFrame in the notebook.
Variable Tracker
VariableStartAfter Step 3After Step 4Final
sparkNoneSparkSession object createdSparkSession object createdSparkSession object created
dataNone[(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')][(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')][(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
dfNoneNoneDataFrame with 3 rows and 2 columnsDataFrame with 3 rows and 2 columns
Key Moments - 3 Insights
Why do we need to create a SparkSession before running Spark code?
The SparkSession is the entry point to use Spark features. Without it, Spark cannot run code or create DataFrames. See execution_table step 2 where SparkSession is created.
What does df.show() do in the notebook?
df.show() displays the DataFrame content as a table in the notebook output. It does not change the data, just shows it. See execution_table step 5.
Why is the data variable created before the DataFrame?
The data variable holds the raw data as a list. We need it first to pass into spark.createDataFrame to make the DataFrame. See execution_table steps 3 and 4.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the value of 'df' after step 4?
ANone
BA list of tuples
CA Spark DataFrame with 3 rows and 2 columns
DA SparkSession object
💡 Hint
Check the 'Result' column in execution_table row for step 4.
At which step is the SparkSession object created?
AStep 2
BStep 1
CStep 3
DStep 4
💡 Hint
Look at the 'Action' and 'Result' columns in execution_table for SparkSession creation.
If we skip creating the 'data' list, what will happen at step 4?
ADataFrame will be created with empty data
BError because data is not defined
CSparkSession will fail to create
Ddf.show() will display an empty table
💡 Hint
Refer to variable_tracker and execution_table steps 3 and 4 about 'data' variable.
Concept Snapshot
Databricks lets you write and run Spark code in notebooks.
You start by creating a SparkSession.
Load or create data, then make DataFrames.
Run code on clusters and see results instantly.
Use notebooks to visualize and share your work.
Full Transcript
Databricks is a platform where users log in and create notebooks to write Spark code. The user starts by creating a SparkSession, which is needed to run Spark commands. Then, data is prepared as a list of tuples. This data is converted into a Spark DataFrame. The DataFrame is shown in the notebook output. The platform runs the code on clusters and returns results quickly. Users can visualize data and collaborate with others in the workspace.