0
0
Apache Sparkdata~10 mins

What is Apache Spark - Visual Explanation

Choose your learning style9 modes available
Concept Flow - What is Apache Spark
Start: Data Input
Spark Core: Distribute Data
Transformations: Map, Filter, etc.
Actions: Collect, Count, Save
Output: Results or Files
Apache Spark takes data, splits it across computers, applies steps to change or analyze it, then collects results.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Example').getOrCreate()
data = [1, 2, 3, 4]
rdd = spark.sparkContext.parallelize(data)
squared = rdd.map(lambda x: x*x).collect()
print(squared)
This code creates a Spark session, distributes a list, squares each number, and collects the results.
Execution Table
StepActionData StateResult
1Create SparkSessionNo data distributedSpark ready
2Parallelize data[1, 2, 3, 4]Distributed across cluster
3Map: square eachTransformation defined (lazy)Transformation defined (lazy)
4Collect results[1, 4, 9, 16]Data gathered to driver
5Print results[1, 4, 9, 16]Output shown
💡 All steps complete, data processed and output displayed
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4Final
sparkNoneSparkSession objectSparkSession objectSparkSession objectSparkSession object
data[1, 2, 3, 4][1, 2, 3, 4][1, 2, 3, 4][1, 2, 3, 4][1, 2, 3, 4]
rddNoneDistributed [1, 2, 3, 4]Mapped to squares (lazy)Mapped to squares (lazy)Mapped to squares (lazy)
squaredNoneNoneNone[1, 4, 9, 16][1, 4, 9, 16]
Key Moments - 2 Insights
Why does the map step not immediately compute the squares?
Because Spark uses lazy evaluation, the map step only defines the transformation but does not run it until an action like collect is called (see execution_table step 3 and 4).
What does 'parallelize' do with the data?
Parallelize splits the data across the cluster so it can be processed in parts simultaneously (see execution_table step 2).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the data state after the map step?
ATransformation defined (lazy)
B[1,2,3,4]
CNo data
DData collected
💡 Hint
Check the 'Data State' column at step 3 in the execution_table
At which step does Spark actually compute the squared numbers?
AStep 2: Parallelize data
BStep 3: Map transformation
CStep 4: Collect results
DStep 5: Print results
💡 Hint
Look for when the transformation changes from lazy to actual computation in execution_table
If we remove the collect() call, what happens to the map step?
AMap runs immediately
BMap is never executed
CData is printed automatically
DSpark throws an error
💡 Hint
Refer to the key moment about lazy evaluation and execution_table steps 3 and 4
Concept Snapshot
Apache Spark is a fast tool to process big data by splitting it across many computers.
You write transformations (like map) that are lazy.
Actions (like collect) trigger actual computation.
Spark handles data in memory for speed.
Use SparkSession to start working with data.
Full Transcript
Apache Spark is a tool that helps process large amounts of data quickly by spreading the work across many computers. You start by creating a SparkSession, which sets up Spark. Then you give Spark some data, which it splits up to work on in parallel. You can tell Spark what to do with the data using transformations like map, but these only set up the steps and do not run immediately. When you ask for a result with an action like collect, Spark runs all the steps and brings the data back. This way, Spark is fast and efficient for big data tasks.