What is the main purpose of Apache Spark in data processing?
Think about what Spark does with big data quickly.
Apache Spark is designed to process large data sets quickly and efficiently across many computers.
What will be the output of the following Spark code snippet?
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')] df = spark.createDataFrame(data, ['id', 'fruit']) df_filtered = df.filter(df.id > 1) df_filtered.show()
Filter keeps rows where id is greater than 1.
The filter removes the row with id 1, so only rows with id 2 and 3 remain.
Given the following Spark DataFrame, what is the output of the aggregation?
from pyspark.sql import SparkSession from pyspark.sql.functions import avg spark = SparkSession.builder.appName('Test').getOrCreate() data = [(1, 10), (2, 20), (3, 30), (4, 40)] df = spark.createDataFrame(data, ['id', 'value']) avg_df = df.agg(avg('value').alias('average_value')) avg_df.show()
Average of 10, 20, 30, 40 is (10+20+30+40)/4.
The average value is 25.0, calculated by summing all values and dividing by count.
Which option correctly shows how to collect Spark DataFrame data and plot a bar chart of 'fruit' counts using matplotlib?
from pyspark.sql import SparkSession import matplotlib.pyplot as plt spark = SparkSession.builder.appName('Test').getOrCreate() data = [('apple', 4), ('banana', 2), ('cherry', 5)] df = spark.createDataFrame(data, ['fruit', 'count'])
You need to convert Spark DataFrame rows to Python lists before plotting.
Option C collects Spark rows into a list, then extracts columns for plotting. Other options misuse Spark DataFrame or pandas methods.
What error will this Spark code raise?
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Test').getOrCreate() data = [(1, 'a'), (2, 'b')] df = spark.createDataFrame(data, ['id', 'letter']) result = df.groupBy('id').sum('letter').show()
Sum aggregation expects numeric columns, but 'letter' is text.
Trying to sum a string column causes an AnalysisException because sum only works on numbers.