Which statement best describes when batch ingestion processes data in a Hadoop ecosystem?
Think about how batch jobs usually run on a schedule, not instantly.
Batch ingestion collects data over time and processes it in bulk at set times, unlike real-time ingestion which processes data instantly.
Given the following pseudo-code simulating real-time ingestion in Hadoop streaming, what will be the output after processing 3 data points?
stream = ['data1', 'data2', 'data3'] processed = [] for item in stream: processed.append(item.upper()) print(processed)
Check what the upper() method does to strings.
The upper() method converts all characters in a string to uppercase, so each data point becomes uppercase.
Consider a Hadoop system where batch ingestion processes 10,000 records every hour, and real-time ingestion processes 100 records every minute. How many records does each process handle in 3 hours?
Calculate total records by multiplying rate by time for both methods.
Batch: 10,000 records/hour * 3 hours = 30,000 records.
Real-time: 100 records/minute * 60 minutes/hour * 3 hours = 18,000 records.
What error will the following Hadoop streaming code produce?
data_stream = ['a', 'b', 'c']
result = []
for d in data_stream:
result.append(d / 2)
print(result)Consider what happens when dividing a string by a number.
You cannot divide a string by an integer, so Python raises a TypeError.
You manage a Hadoop system that collects sensor data. The sensors send data every second, and you need to detect anomalies within 5 seconds of data arrival. Which ingestion method is best to meet this requirement?
Think about how quickly you need to respond to data to detect anomalies.
Real-time ingestion processes data immediately, allowing anomaly detection within seconds, which fits the 5-second requirement.