Overview - Source connectors

What is it?

Source connectors are tools that help move data from external systems into Apache Kafka. They act like bridges that read data from databases, files, or other services and send it into Kafka topics. This allows Kafka to collect and process data from many different sources in real time. Source connectors simplify data integration without writing custom code.

Why it matters

Without source connectors, moving data into Kafka would require manual coding and complex setups for each data source. This would slow down development and increase errors. Source connectors automate and standardize data ingestion, making it easier to build real-time data pipelines. They help businesses react faster by having fresh data available in Kafka for processing and analysis.

Where it fits

Before learning source connectors, you should understand basic Kafka concepts like topics, producers, and consumers. After mastering source connectors, you can explore sink connectors, Kafka Connect framework, and building end-to-end streaming pipelines.

Mental Model

Core Idea

Source connectors automatically pull data from external systems and push it into Kafka topics for real-time processing.

Think of it like...

Imagine a water pump that draws water from a river (external system) and sends it through pipes into a storage tank (Kafka). The pump works continuously to keep the tank filled without manual effort.

┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│ External     │      │ Source        │      │ Kafka Topic   │
│ System/Data  │─────▶│ Connector     │─────▶│ (Data Stream) │
│ (Database,   │      │ (Bridge)      │      │               │
│ Files, APIs) │      │               │      │               │
└───────────────┘      └───────────────┘      └───────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding Kafka Topics and Producers

Concept: Learn what Kafka topics and producers are, as source connectors act like producers sending data to topics.

Kafka topics are categories or feeds where messages are stored. Producers are clients that send data to these topics. Source connectors behave like special producers that automatically send data from external systems into Kafka topics without manual coding.

Result

You understand that source connectors produce data into Kafka topics just like any producer client.

Knowing that source connectors are specialized producers helps you see them as automated data senders, not separate systems.

2

FoundationWhat is Kafka Connect Framework?

3

IntermediateConfiguring a Source Connector

4

IntermediateHow Source Connectors Handle Data Changes

5

IntermediateFault Tolerance and Offset Management

6

AdvancedScaling Source Connectors for High Throughput

7

ExpertCustom Source Connectors and Transformation Logic

Under the Hood

Source connectors run inside Kafka Connect workers as tasks that continuously poll external systems. They use APIs or logs of the source system to fetch new data. The connector converts this data into Kafka records and sends them to specified topics. Kafka Connect manages offsets to track progress and stores them in internal Kafka topics. If a worker fails, another takes over using stored offsets to avoid data loss or duplication.

Why designed this way?

Kafka Connect was designed to separate data integration logic from application code, enabling reusable connectors. The offset system ensures exactly-once or at-least-once delivery guarantees. Running connectors inside a managed framework allows scaling, fault tolerance, and centralized configuration. Alternatives like custom scripts lacked reliability and standardization, so Kafka Connect provides a robust, pluggable architecture.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ External     │       │ Kafka Connect │       │ Kafka Broker  │
│ System/Data  │──────▶│ Worker with   │──────▶│ (Stores Data) │
│ (Database,   │       │ Source        │       │               │
│ Files, APIs) │       │ Connector     │       │               │
└───────────────┘       └───────────────┘       └───────────────┘
         ▲                      │                      ▲
         │                      │                      │
         │                      ▼                      │
         │              ┌───────────────┐             │
         │              │ Offset Storage│◀────────────┘
         │              │ (Kafka Topic) │
         │              └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Do source connectors always send full data dumps every time? Commit yes or no.

Common Belief:Source connectors send the entire dataset from the source every time they run.

Tap to reveal reality

Quick: Can source connectors run without Kafka Connect framework? Commit yes or no.

Common Belief:Source connectors are standalone programs that run independently of Kafka Connect.

Tap to reveal reality

Quick: Do source connectors guarantee no data loss or duplication by default? Commit yes or no.

Common Belief:Source connectors always guarantee exactly-once delivery of data to Kafka topics.

Tap to reveal reality

Quick: Are source connectors only for databases? Commit yes or no.

Common Belief:Source connectors only work with databases as data sources.

Tap to reveal reality

Expert Zone

1

Some source connectors support schema evolution, automatically adapting to changes in source data structure without downtime.

2

Offset storage in Kafka topics allows connectors to be stateless and easily recover from failures or restarts.

3

Single Message Transforms (SMTs) provide lightweight, in-flight data modification without needing full custom connector development.

When NOT to use

Avoid source connectors when data sources have very low update frequency or require complex transformations better handled upstream. In such cases, batch ETL tools or custom ingestion pipelines may be more appropriate.

Production Patterns

In production, source connectors are often deployed in distributed Kafka Connect clusters for high availability. Teams use monitoring and alerting on connector health and lag. Combining CDC connectors with SMTs enables real-time, clean, and compliant data streams into Kafka.

Connections

Change Data Capture (CDC)

Source connectors often implement CDC to efficiently capture data changes.

Understanding CDC helps grasp how source connectors minimize data transfer and keep Kafka topics up to date.

ETL (Extract, Transform, Load)

Source connectors perform the 'Extract' and sometimes 'Transform' steps in ETL pipelines.

Knowing ETL concepts clarifies the role of source connectors in broader data workflows.

Water Supply Systems

Like pumps moving water from a river to a tank, source connectors move data from sources to Kafka.

This cross-domain view highlights the continuous, automated nature of data ingestion.

Common Pitfalls

#1Configuring source connector without specifying topic names.

Wrong approach:{ "name": "my-source-connector", "config": { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/mydb", "mode": "incrementing", "incrementing.column.name": "id" } }

Correct approach:{ "name": "my-source-connector", "config": { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/mydb", "mode": "incrementing", "incrementing.column.name": "id", "topic.prefix": "mydb-" } }

Root cause:Learners forget that source connectors need to know which Kafka topics to send data to, causing connector failures or silent data loss.

#2Running source connector in standalone mode for production high availability.

Wrong approach:Starting Kafka Connect in standalone mode with a single worker for critical data ingestion.

Correct approach:Deploying Kafka Connect in distributed mode with multiple workers for fault tolerance and scalability.

Root cause:Misunderstanding Kafka Connect modes leads to fragile setups that cannot handle failures or scale.

#3Ignoring offset storage leading to duplicate data on restart.

Wrong approach:Deleting Kafka Connect internal topics or misconfiguring offset storage causing connector to reprocess all data.

Correct approach:Preserving offset topics and configuring connectors to resume from last committed offset.

Root cause:Not understanding offset management causes data duplication or loss during connector restarts.

Key Takeaways

Source connectors automate moving data from external systems into Kafka topics without custom coding.

They run inside Kafka Connect framework which manages configuration, scaling, and fault tolerance.

Most source connectors support incremental data capture to efficiently send only new or changed data.

Offset management ensures connectors can recover from failures without losing or duplicating data.

Advanced users can customize connectors or apply transformations to tailor data ingestion pipelines.