0
0
Kafkadevops~15 mins

Source connectors in Kafka - Deep Dive

Choose your learning style9 modes available
Overview - Source connectors
What is it?
Source connectors are tools that help move data from external systems into Apache Kafka. They act like bridges that read data from databases, files, or other services and send it into Kafka topics. This allows Kafka to collect and process data from many different sources in real time. Source connectors simplify data integration without writing custom code.
Why it matters
Without source connectors, moving data into Kafka would require manual coding and complex setups for each data source. This would slow down development and increase errors. Source connectors automate and standardize data ingestion, making it easier to build real-time data pipelines. They help businesses react faster by having fresh data available in Kafka for processing and analysis.
Where it fits
Before learning source connectors, you should understand basic Kafka concepts like topics, producers, and consumers. After mastering source connectors, you can explore sink connectors, Kafka Connect framework, and building end-to-end streaming pipelines.
Mental Model
Core Idea
Source connectors automatically pull data from external systems and push it into Kafka topics for real-time processing.
Think of it like...
Imagine a water pump that draws water from a river (external system) and sends it through pipes into a storage tank (Kafka). The pump works continuously to keep the tank filled without manual effort.
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│ External     │      │ Source        │      │ Kafka Topic   │
│ System/Data  │─────▶│ Connector     │─────▶│ (Data Stream) │
│ (Database,   │      │ (Bridge)      │      │               │
│ Files, APIs) │      │               │      │               │
└───────────────┘      └───────────────┘      └───────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding Kafka Topics and Producers
🤔
Concept: Learn what Kafka topics and producers are, as source connectors act like producers sending data to topics.
Kafka topics are categories or feeds where messages are stored. Producers are clients that send data to these topics. Source connectors behave like special producers that automatically send data from external systems into Kafka topics without manual coding.
Result
You understand that source connectors produce data into Kafka topics just like any producer client.
Knowing that source connectors are specialized producers helps you see them as automated data senders, not separate systems.
2
FoundationWhat is Kafka Connect Framework?
🤔
Concept: Kafka Connect is a framework that runs connectors to move data between Kafka and other systems.
Kafka Connect manages source and sink connectors. It handles configuration, scaling, and fault tolerance. Source connectors run inside Kafka Connect workers to continuously pull data from external systems and push it into Kafka topics.
Result
You see source connectors as part of a bigger system that manages data movement reliably.
Understanding Kafka Connect clarifies that source connectors are not standalone but run inside a managed environment for stability.
3
IntermediateConfiguring a Source Connector
🤔Before reading on: do you think source connectors require coding or just configuration? Commit to your answer.
Concept: Source connectors are usually configured with simple JSON or properties files, not custom code.
To set up a source connector, you provide details like the external system's address, authentication, and the Kafka topic to send data to. For example, a JDBC source connector configuration includes database URL, table names, and topic names. Kafka Connect uses this config to run the connector.
Result
You can create a config file that tells Kafka Connect how to pull data from a source and where to send it in Kafka.
Knowing that source connectors are mostly configuration-driven lowers the barrier to integrating new data sources.
4
IntermediateHow Source Connectors Handle Data Changes
🤔Before reading on: do you think source connectors send all data repeatedly or only changes? Commit to your answer.
Concept: Many source connectors support change data capture (CDC) to send only new or updated data.
Instead of sending entire datasets repeatedly, source connectors can detect and send only changes like inserts, updates, or deletes. For example, a CDC connector reads database logs to capture changes. This reduces data volume and keeps Kafka topics up to date efficiently.
Result
You understand that source connectors can keep Kafka synchronized with source systems in near real-time.
Understanding CDC support explains how source connectors optimize data flow and reduce unnecessary load.
5
IntermediateFault Tolerance and Offset Management
🤔
Concept: Source connectors track their progress using offsets to avoid data loss or duplication.
Kafka Connect stores offsets that mark how much data a source connector has read. If the connector restarts, it resumes from the last offset. This ensures no data is missed or sent twice. Offsets are stored in Kafka topics managed by Kafka Connect.
Result
You know source connectors can recover from failures without losing or repeating data.
Knowing about offset management reveals how source connectors maintain data consistency and reliability.
6
AdvancedScaling Source Connectors for High Throughput
🤔Before reading on: do you think one connector instance can handle all data or multiple instances are needed? Commit to your answer.
Concept: Source connectors can run in parallel across multiple workers to handle large data volumes.
Kafka Connect supports distributed mode where multiple workers share the load. Source connectors can be configured to split data reading by partitions or tables. This parallelism increases throughput and fault tolerance. For example, a JDBC source connector can read different tables on different workers.
Result
You understand how to scale source connectors to handle big data sources efficiently.
Knowing how to scale connectors prevents bottlenecks and supports enterprise-level data pipelines.
7
ExpertCustom Source Connectors and Transformation Logic
🤔Before reading on: do you think source connectors can only move data as-is or can modify it? Commit to your answer.
Concept: You can build custom source connectors or use Single Message Transforms (SMTs) to modify data before sending to Kafka.
If existing connectors don't fit your needs, you can write custom connectors in Java using Kafka Connect APIs. Also, SMTs allow simple data changes like filtering fields, masking sensitive data, or changing formats on the fly. This adds flexibility without changing source systems.
Result
You can tailor data ingestion pipelines to complex business rules and privacy needs.
Understanding customization options empowers you to handle unique data scenarios and compliance requirements.
Under the Hood
Source connectors run inside Kafka Connect workers as tasks that continuously poll external systems. They use APIs or logs of the source system to fetch new data. The connector converts this data into Kafka records and sends them to specified topics. Kafka Connect manages offsets to track progress and stores them in internal Kafka topics. If a worker fails, another takes over using stored offsets to avoid data loss or duplication.
Why designed this way?
Kafka Connect was designed to separate data integration logic from application code, enabling reusable connectors. The offset system ensures exactly-once or at-least-once delivery guarantees. Running connectors inside a managed framework allows scaling, fault tolerance, and centralized configuration. Alternatives like custom scripts lacked reliability and standardization, so Kafka Connect provides a robust, pluggable architecture.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ External     │       │ Kafka Connect │       │ Kafka Broker  │
│ System/Data  │──────▶│ Worker with   │──────▶│ (Stores Data) │
│ (Database,   │       │ Source        │       │               │
│ Files, APIs) │       │ Connector     │       │               │
└───────────────┘       └───────────────┘       └───────────────┘
         ▲                      │                      ▲
         │                      │                      │
         │                      ▼                      │
         │              ┌───────────────┐             │
         │              │ Offset Storage│◀────────────┘
         │              │ (Kafka Topic) │
         │              └───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Do source connectors always send full data dumps every time? Commit yes or no.
Common Belief:Source connectors send the entire dataset from the source every time they run.
Tap to reveal reality
Reality:Most source connectors support incremental updates or change data capture to send only new or changed data.
Why it matters:Sending full data repeatedly wastes bandwidth and processing, causing delays and higher costs.
Quick: Can source connectors run without Kafka Connect framework? Commit yes or no.
Common Belief:Source connectors are standalone programs that run independently of Kafka Connect.
Tap to reveal reality
Reality:Source connectors run inside Kafka Connect workers which manage their lifecycle, scaling, and fault tolerance.
Why it matters:Running connectors outside Kafka Connect loses benefits like offset management and distributed scaling.
Quick: Do source connectors guarantee no data loss or duplication by default? Commit yes or no.
Common Belief:Source connectors always guarantee exactly-once delivery of data to Kafka topics.
Tap to reveal reality
Reality:Source connectors typically provide at-least-once delivery; exactly-once requires careful configuration and sometimes additional processing.
Why it matters:Assuming exactly-once can lead to data duplication issues if not handled properly in production.
Quick: Are source connectors only for databases? Commit yes or no.
Common Belief:Source connectors only work with databases as data sources.
Tap to reveal reality
Reality:Source connectors can ingest data from many systems including files, message queues, APIs, and cloud services.
Why it matters:Limiting source connectors to databases restricts their powerful role in diverse data integration scenarios.
Expert Zone
1
Some source connectors support schema evolution, automatically adapting to changes in source data structure without downtime.
2
Offset storage in Kafka topics allows connectors to be stateless and easily recover from failures or restarts.
3
Single Message Transforms (SMTs) provide lightweight, in-flight data modification without needing full custom connector development.
When NOT to use
Avoid source connectors when data sources have very low update frequency or require complex transformations better handled upstream. In such cases, batch ETL tools or custom ingestion pipelines may be more appropriate.
Production Patterns
In production, source connectors are often deployed in distributed Kafka Connect clusters for high availability. Teams use monitoring and alerting on connector health and lag. Combining CDC connectors with SMTs enables real-time, clean, and compliant data streams into Kafka.
Connections
Change Data Capture (CDC)
Source connectors often implement CDC to efficiently capture data changes.
Understanding CDC helps grasp how source connectors minimize data transfer and keep Kafka topics up to date.
ETL (Extract, Transform, Load)
Source connectors perform the 'Extract' and sometimes 'Transform' steps in ETL pipelines.
Knowing ETL concepts clarifies the role of source connectors in broader data workflows.
Water Supply Systems
Like pumps moving water from a river to a tank, source connectors move data from sources to Kafka.
This cross-domain view highlights the continuous, automated nature of data ingestion.
Common Pitfalls
#1Configuring source connector without specifying topic names.
Wrong approach:{ "name": "my-source-connector", "config": { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/mydb", "mode": "incrementing", "incrementing.column.name": "id" } }
Correct approach:{ "name": "my-source-connector", "config": { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/mydb", "mode": "incrementing", "incrementing.column.name": "id", "topic.prefix": "mydb-" } }
Root cause:Learners forget that source connectors need to know which Kafka topics to send data to, causing connector failures or silent data loss.
#2Running source connector in standalone mode for production high availability.
Wrong approach:Starting Kafka Connect in standalone mode with a single worker for critical data ingestion.
Correct approach:Deploying Kafka Connect in distributed mode with multiple workers for fault tolerance and scalability.
Root cause:Misunderstanding Kafka Connect modes leads to fragile setups that cannot handle failures or scale.
#3Ignoring offset storage leading to duplicate data on restart.
Wrong approach:Deleting Kafka Connect internal topics or misconfiguring offset storage causing connector to reprocess all data.
Correct approach:Preserving offset topics and configuring connectors to resume from last committed offset.
Root cause:Not understanding offset management causes data duplication or loss during connector restarts.
Key Takeaways
Source connectors automate moving data from external systems into Kafka topics without custom coding.
They run inside Kafka Connect framework which manages configuration, scaling, and fault tolerance.
Most source connectors support incremental data capture to efficiently send only new or changed data.
Offset management ensures connectors can recover from failures without losing or duplicating data.
Advanced users can customize connectors or apply transformations to tailor data ingestion pipelines.