HadoopConceptBeginner · 4 min read

What is Big Data in Hadoop: Explanation and Example

Big data in Hadoop refers to extremely large and complex datasets that traditional tools cannot handle efficiently. Hadoop is a framework that stores and processes big data across many computers using HDFS for storage and MapReduce for processing.

⚙️

How It Works

Imagine you have a huge library of books that is too big for one person to read alone. Hadoop works like a team of readers who split the books into smaller parts and read them at the same time. This way, the work gets done faster and can handle much more information than one person could.

Hadoop stores big data using HDFS (Hadoop Distributed File System), which breaks data into blocks and spreads them across many computers. Then, it uses MapReduce to process the data by dividing tasks into smaller jobs that run in parallel, combining results at the end. This makes handling big data efficient and scalable.

💻

Example

This example shows a simple Hadoop MapReduce program in Java that counts the number of times each word appears in a text file. It demonstrates how Hadoop processes big data by splitting tasks.

java

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      String[] words = value.toString().split("\\s+");
      for (String w : words) {
        word.set(w);
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Output

word1 3 word2 5 word3 2

🎯

When to Use

Use Hadoop big data solutions when you have very large datasets that cannot fit on a single computer or when processing needs to be done quickly by splitting tasks. Examples include analyzing social media data, processing logs from websites, or handling large scientific datasets.

Hadoop is ideal when data is too big, too fast, or too complex for traditional databases and tools. It helps businesses gain insights from massive data to make better decisions.

✅

Key Points

Big data means huge, complex data sets.
Hadoop stores data across many computers using HDFS.
MapReduce processes data in parallel for speed.
It is useful for large-scale data analysis in many fields.

✅

Key Takeaways

Hadoop handles big data by distributing storage and processing across many computers.

It uses HDFS to store data in blocks and MapReduce to process data in parallel.

Hadoop is best for very large or complex datasets that traditional tools cannot manage.

Common uses include social media analysis, log processing, and scientific data handling.