docker-hadoopインストール手順

 

1. pull docker image
sudo docker pull kiwenlau/hadoop:1.0
2. clone github repository
git clone https://github.com/kiwenlau/hadoop-cluster-docker
3. create hadoop network
sudo docker network create --driver=bridge hadoop
4. start container and update hadoop-master,hadoop-slave1,hadoop-slave2's /etc/hosts
vi /etc/hosts
hadoop-master www.xxx.yyy.aaa
hadoop-slave1 www.xxx.yyy.bbb
hadoop-slave2 www.xxx.yyy.ccc

cd hadoop-cluster-docker sudo ./start-container.sh

output:

start hadoop-master container...
start hadoop-slave1 container...
start hadoop-slave2 container...
root@hadoop-master:~# 
  • start 3 containers with 1 master and 2 slaves
  • you will get into the /root directory of hadoop-master container
5. start hadoop
./start-hadoop.sh
6. run wordcount
./run-wordcount.sh

output

input file1.txt:
Hello Hadoop

input file2.txt:
Hello Docker

wordcount output:
Docker    1
Hadoop    1
Hello    2
 rebuild docker image
sudo ./resize-cluster.sh 4
  • specify parameter > 1: 2, 3..
  • this script just rebuild hadoop image with different slaves file, which pecifies the name of all slave nodes
start container
sudo ./start-container.sh 4

docker-sparkインストール手順

目的:1マスタ、2ワーカーsparkクラスタ構築

最小スペック:4コア、2Gメモリ

 

1.dockerインストール

必要パッケージのインストール

$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates

GPG鍵の取得

$ sudo apt-key adv \
--keyserver hkp://ha.pool.sks-keyservers.net:80 \
--recv-keys 58118E89F3A912897C070ADBF76221572C52609D

ubuntu 14 source.listの更新

$ echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee /etc/apt/sources.list.d/docker.list

ubuntuの場合はさらに依存パッケージをインストールする。

依存パッケージのインストール

$ sudo apt-get update
$ sudo apt-get install linux-image-extra-$(uname -r) linux-image-extra-virtual

参考
https://docs.docker.com/engine/installation/linux/ubuntulinux/#/prerequisites-by-ubuntu-version

docker-engineのインストール

$ sudo apt-get update
$ sudo apt-get install docker-engine

参考https://docs.docker.com/engine/installation/linux/ubuntulinux/#/install-the-latest-version

 

2.docker composeインストール

$ sudo curl -L https://github.com/docker/compose/releases/download/1.6.2/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose

 

3.docker-sparkファイル用意

任意パスでsparkフォルダを作成し、以下のようにdocker-spark用フォルダを作成

----spark

        | -----   docker-compose.yml

        | -----   conf

                     | -----   master

                             | -----   spark-defaults.conf

                     | -----   work

                            | -----   spark-defaults.conf

        | -----   data

                     | -----   test.jar

 

 docker-compose.yml

master:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
expose:
- 7001
- 7002
- 7003
- 7004
- 7005
- 7006
- 7077
- 6066
ports:
- 4040:4040
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- ./conf/master:/conf
- ./data:/tmp/data

worker1:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker1
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8082
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 7016
- 8881
ports:
- 8082:8082
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data


worker2:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker2
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 7016
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data 

 

    conf/master/spark-defaults.conf

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

spark.driver.port 7001
spark.fileserver.port 7002
spark.broadcast.port 7003
spark.replClassServer.port 7004
spark.blockManager.port 7005
spark.executor.port 7006

spark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory
spark.port.maxRetries 4

    conf/worker/spark-defaults.conf

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

#spark.driver.port 7101
spark.fileserver.port 7012
spark.broadcast.port 7013
spark.replClassServer.port 7014
spark.blockManager.port 7015
spark.executor.port 7016

spark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory
spark.port.maxRetries 4

 4.docker-compose起動

sparkフォルダでdocker-composeを起動

docker-compose up

 

ログで「INFO worker.Worker: Successfully registered with master spark://master:7077」を確認

 

5.masterにログイン

root@ubuntu:~/spark# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
aae12a0f7d12 gettyimages/spark "bin/spark-class org." 2 hours ago Up 17 minutes 7012-7016/tcp, 8881/tcp, 0.0.0.0:8082->8082/tcp spark_worker1_1
dfe9cbd3c607 gettyimages/spark "bin/spark-class org." 2 hours ago Up 17 minutes 7012-7016/tcp, 8881/tcp, 0.0.0.0:8081->8081/tcp spark_worker2_1
c2e21a4b4808 gettyimages/spark "bin/spark-class org." 24 hours ago Up 17 minutes 0.0.0.0:4040->4040/tcp, 0.0.0.0:6066->6066/tcp, 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080 ->8080/tcp, 7001-7006/tcp spark_master_1
root@ubuntu:~/spark# docker exec -it c2e21a4b4808 /bin/bash

 

サンプルプログラムの実行

 root@master:/usr/spark-2.0.2# ./bin/run-example SparkPi 100