0
0
Hadoopdata~5 mins

Hadoop in cloud (EMR, Dataproc, HDInsight)

Choose your learning style9 modes available
Introduction

Hadoop in cloud helps you process big data easily without managing hardware. It lets you use powerful tools on demand.

You want to analyze large data sets without buying servers.
You need to run data jobs quickly and scale up or down.
You want to use Hadoop tools but avoid complex setup.
You want to store and process data close to cloud storage.
You want to pay only for the computing you use.
Syntax
Hadoop
Use cloud services like:
- AWS EMR
- Google Dataproc
- Azure HDInsight

Each service lets you create a Hadoop cluster with commands or console.

These services manage Hadoop setup and scaling for you.

You can submit jobs using command line or cloud consoles.

Examples
This command creates a 3-node Hadoop cluster on AWS EMR.
Hadoop
# AWS EMR example to create a cluster
aws emr create-cluster --name "TestCluster" --release-label emr-6.9.0 --applications Name=Hadoop Name=Spark --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3
This command creates a single-node Hadoop cluster on Google Dataproc.
Hadoop
# Google Dataproc example to create a cluster
gcloud dataproc clusters create my-cluster --region=us-central1 --single-node --image-version=2.0-debian10 --optional-components=ANACONDA,JUPYTER
This command creates a 3-node Hadoop cluster on Azure HDInsight.
Hadoop
# Azure HDInsight example to create a cluster
az hdinsight cluster create --name my-hadoop-cluster --resource-group myResourceGroup --type Hadoop --location eastus --tier Standard --worker-node-count 3
Sample Program

This example shows how to create an EMR cluster, run a simple word count job, and check the output in S3 storage.

Hadoop
# Sample AWS EMR job submission using AWS CLI
# Step 1: Create cluster
aws emr create-cluster --name "SampleCluster" --release-label emr-6.9.0 --applications Name=Hadoop --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --use-default-roles

# Step 2: Submit a Hadoop streaming job
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX --steps Type=STREAMING,Name="WordCount",ActionOnFailure=CONTINUE,Args=[-files,s3://mybucket/mapper.py,s3://mybucket/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://mybucket/input,-output,s3://mybucket/output]

# Step 3: Check output
aws s3 ls s3://mybucket/output/
OutputSuccess
Important Notes

Cloud Hadoop services handle hardware and software setup for you.

You pay for the time your cluster runs, so stop it when done.

Each cloud provider has its own commands and console interface.

Summary

Hadoop in cloud lets you process big data without managing servers.

Use AWS EMR, Google Dataproc, or Azure HDInsight to create Hadoop clusters easily.

You can run Hadoop jobs and store results in cloud storage.