Practice - 5 Tasks

Answer the questions below

1fill in blank

easy

Complete the code to cache the DataFrame and avoid shuffle during repeated actions.

Apache Spark

df.[1]()

Drag options to blanks, or click blank then click option'

Acache

Bcollect

Cshow

Dcount

Attempts:

3 left

2fill in blank

medium

Complete the code to avoid shuffle by using a broadcast join.

Apache Spark

from pyspark.sql.functions import broadcast
joined_df = large_df.join([1](small_df), 'id')

Drag options to blanks, or click blank then click option'

Acollect

Bpersist

Cbroadcast

Dcache

Attempts:

3 left

3fill in blank

hard

Fix the error in the code to avoid shuffle by using partitioning correctly.

Apache Spark

df = df.repartition([1])

Drag options to blanks, or click blank then click option'

A'id'

B10

Cdf

DNone

Attempts:

3 left

4fill in blank

hard

Fill both blanks to create a DataFrame with partitioning and avoid shuffle on writes.

Apache Spark

df.write.partitionBy([1]).mode([2]).save('path/to/data')

Drag options to blanks, or click blank then click option'

A'category'

B'overwrite'

C'append'

D'id'

Attempts:

3 left

5fill in blank

hard

Fill all three blanks to create a dictionary comprehension that avoids shuffle by filtering and mapping correctly.

Apache Spark

result = [1]: [2] for k, [2] in [3] if [2] > 10

Drag options to blanks, or click blank then click option'

Cdata.items()

Dk.upper()

Attempts:

3 left