Which component is primarily responsible for moving data from Kafka to Hadoop in a typical integration setup?
Think about the tool designed to connect Kafka with external systems like Hadoop.
Kafka Connect with HDFS Sink Connector is designed to stream data from Kafka topics directly into Hadoop's HDFS storage.
Given the following Kafka Connect HDFS Sink configuration snippet, what is the expected output directory structure in HDFS?
name=hdfs-sink-connector connector.class=io.confluent.connect.hdfs.HdfsSinkConnector tasks.max=1 topics=test-topic hdfs.url=hdfs://namenode:8020 flush.size=3 format.class=io.confluent.connect.hdfs.avro.AvroFormat hdfs.topics.dir=/user/kafka/topics
Look at the topic name, flush size, and format class in the config.
The connector writes to /user/kafka/topics/{topic}/partition={partition}/ with Avro files flushed after 3 records as per flush.size.
When using Kafka Connect HDFS Sink with the default Avro format, what is the schema of the stored data files?
Consider how Avro format handles schema and data storage.
Avro files embed the schema along with the data, enabling schema evolution and self-describing files.
Given a Kafka Connect HDFS Sink connector failing with error 'Failed to write to HDFS: Permission denied', what is the most likely cause?
Think about file system permissions and access rights.
The error indicates the connector cannot write to HDFS due to insufficient permissions on the target directory.
You want to optimize the Kafka Connect HDFS Sink to reduce small file creation and improve throughput. Which configuration change is most effective?
Think about how flush.size affects file creation frequency.
Increasing flush.size causes the connector to write larger files less frequently, reducing small file overhead and improving throughput.