Sparkproject记录

2019, Jan 05    

SparkProject记录

1.配置服务器环境(每台机器都需要配置)

1.1 设置静态IP和修改hosts文件
1.2 关闭IPV6和selinux、防火墙
1.3 安装JDK

在一台机器上装好之后直接复制过去

在/user/local目录下执行:
scp -r ./jdk1.8.0_121 root@sparkproject-02:/usr/local/
scp /etc/profile root@sparkproject-02:/etc/

scp -r ./jdk1.8.0_121 root@sparkproject-03:/usr/local/
scp /etc/profile root@sparkproject-03:/etc/
1.4 ssh免密码登录

三台机器上配置对本机的ssh免密码登录,接着配置三台机器互相之间的ssh免密码登录

2.安装hadoop

hadoop-2.5.0-cdh5.3.6.tar.gz上传到01机,先在01上做配置

2.1 修改hadoop配置

hadoop的配置文件都在/usr/local/hadoop/etc/hadoop目录下

  • 修改/etc/profile文件,添加环境变量配置
export HADOOP_HOME=/usr/local/hadoop
exoprt PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
  • 修改hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/usr/local/jdk1.8.0_121
  • 修改core-site.xml
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://sparkproject-01:9000</value>
  </property>
</configuration>
  • 修改hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/usr/local/data/namenode</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/usr/local/data/datanode</value>
  </property>
  <property>
    <name>dfs.tmp.dir</name>
    <value>/usr/local/data/tmp</value>
  </property>
</configuration>
  • 修改mapred-site.xml
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>
  • 修改yarn-site.xml
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>sparkproject-01</value>
  </property>
</configuration>
  • 修改slaves文件
 // 先顺除掉localhost, 再添加
sparkproject-02
sparkproject-03
2.2 同步配置到其余主机(从节点)

记得在sparkproject-02和sparkproject-03的/usr/local目录下创建data目录

// 用scp命令把hadoop安装包和/etc/profile配置文件
scp -r /usr/local/hadoop root@sparkproject-02:/usr/local
scp /etc/profile root@sparkproject-02:/etc/

scp -r /usr/local/hadoop root@sparkproject-03:/usr/local
scp /etc/profile root@sparkproject-03:/etc/

// /etc/profile配置文件进行source,以让它生效。
source /etc/profile

#####2.3 启动HDFS集群

  • 格式化namenode:在sparkproject-01(主节点)上执行以下命令

hdfs namenode -format

  • 启动HDFS集群
[root@sparkproject-01 sbin]# start-dfs.sh
18/05/16 16:36:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [sparkproject-01]
sparkproject-01: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-sparkproject-01.bookcycle.cn.out 【sparkproject-01上启动namenode】
sparkproject-02: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-sparkproject-02.bookcycle.cn.out 【sparkproject-02上启动datanode】
sparkproject-03: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-sparkproject-03.bookcycle.cn.out 【sparkproject-03上启动datanode】
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:7CINJAzRNELAEMIz9eWh1s/PNrhLQi9TO6sMqVvUp78.
ECDSA key fingerprint is MD5:cd:fe:54:8d:f2:e3:a5:ba:a9:8a:9d:f0:82:5f:36:53.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-sparkproject-01.bookcycle.cn.out 【sparkproject-01上启动secondarynamenode】
18/05/16 16:37:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

主节点上启动了NameNode、SecondaryNameNode,从节点上启动了DataNode

查看sparkproject-01上的java进程,NameNode、SecondaryNameNode启动成功

[root@sparkproject-01 sbin]# jps
1925 NameNode
2200 Jps
2091 SecondaryNameNode

查看sparkproject-02上的java进程,DataNode启动成功

[root@sparkproject-02 local]# jps
1776 Jps
1708 DataNode

查看sparkproject-03上的java进程,DataNode启动成功

[root@sparkproject-03 local]# jps
1699 DataNode
1768 Jps
  • 访问HDFS管理界面

http://sparkproject-01:50070

2.4 启动yarn集群
  • start-yarn.sh
[root@sparkproject-01 sbin]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-root-resourcemanager-sparkproject-01.bookcycle.cn.out 【sparkproject-01上启动resourcemanager】
sparkproject-03: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-sparkproject-03.bookcycle.cn.out 【sparkproject-03上启动nodemanager】
sparkproject-02: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-sparkproject-02.bookcycle.cn.out 【sparkproject-02上启动nodemanager】

主节点上启动了resourcemanager,从节点上启动了nodemanager

[root@sparkproject-01 sbin]# jps
2336 ResourceManager
2610 Jps
1925 NameNode
2091 SecondaryNameNode

--------------------------

[root@sparkproject-02 local]# jps
1831 NodeManager
1708 DataNode
1950 Jps

--------------------------

[root@sparkproject-03 local]# jps
1699 DataNode
1945 Jps
1823 NodeManager
  • 访问YARN管理界面

    http://sparkproject-01:8088

3. 安装Hive

hive-0.13.1-cdh5.3.6.tar.gz上传到01机,只在01上做配置

3.1 安装mysql(主节点)
  • 主节点安装MySQL,并设置开机自启动(按照Linux下MySQL安装配置)

  • 执行yum install -y mysql-connector-java,获取mysql-connector-java.jar (如果已经有了mysql-connector-java.jar,可以不用这条命令,避免这条命令升级JDK之类的)将/usr/share/java/mysql-connector-java.jar拷贝到/usr/local/hive/lib

  • 在mysql上创建hive元数据库,创建hive账号,并进行授权

    create database if not exists hive_metadata; 【创建hive的元数据库】
    grant all privileges on hive_metadata.* to 'hive'@'%' identified by 'hive'; 【创建hive账号,密码也是hive,对来自任意地址登陆的hive用户对hive_metadata有所有权限】
    grant all privileges on hive_metadata.* to 'hive'@'localhost' identified by 'hive'; 【对来自localhost地址登陆的hive用户对hive_metadata有所有权限】
    grant all privileges on hive_metadata.* to 'hive'@'sparkproject-01.bookcycle.cn' identified by 'hive';【对来自sparkproject-01.bookcycle.cn地址登陆的hive用户对hive_metadata有所有权限】
    【经过后面两个授权后,任意地址将不能操纵hive,sparkproject-01.bookcycle.cn,一定要带上.bookcycle.cn域名】
    flush privileges;
    use hive_metadata; 【测试数据库是否创建成功】
    
3.2 修改hive配置

hive的配置文件都在/usr/local/hive/conf目录下

  • 修改/etc/profile文件,添加环境变量配置
export HIVE_HOME=/usr/local/hive
exoprt PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin

执行source /etc/profile,刷新配置

  • 修改hive-site.xml
// 首先,将hive-default.xml.template 另存为 hive-site.xml
cp hive-default.xml.template hive-site.xml
vi hive-site.xml
// 修改如下配置

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://sparkproject-01:3306/hive_metadata?createDatabaseIfNotExist=true</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>hive</value>
</property>
  • 修改hive-env.sh
// 首先,将hive-env.sh.template 另存为 hive-env.sh,并增加如下内容
export JAVA_HOME=/usr/local/jdk1.8.0_121
export HADOOP_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive

#####3.3 测试hive

[root@sparkproject-01 ~]# hive
18/05/16 20:27:52 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-0.13.1-cdh5.3.6.jar!/hive-log4j.properties
hive> eixt;

4.安装Zookeeper

上传zookeeper-3.4.5-cdh5.3.6.tar.gz到01机,先再01机上做配置

4.1 修改Zookeeper配置

Zookeeper的配置文件在,/usr/local/zookeeper/conf

  • 修改/etc/profile文件,添加环境变量配置
export ZOOKEEPER_HOME=/usr/local/zookeeper
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin::$HIVE_HOME/bin:$ZOOKEEPER_HOME/bin

执行source /etc/profile,刷新配置

  • 配置zoo.cfg
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg
修改:dataDir=/usr/local/zk/data
在最后加上:(表示三个节点,一般要求是奇数个节点)
server.0=sparkproject-01:2888:3888
server.1=sparkproject-02:2888:3888
server.2=sparkproject-03:2888:3888
  • 建立/usr/local/zk/data目录

/usr/local/zk/data目录下执行vi myid,文件的值给0

[root@sparkproject-01 data]# cat myid 
0
4.2 同步到其他节点
scp -r /usr/local/zookeeper root@sparkproject-02:/usr/local
scp -r /usr/local/zk root@sparkproject-02:/usr/local
scp /etc/profile root@sparkproject-02:/etc/
source /etc/profile

scp -r /usr/local/zookeeper root@sparkproject-03:/usr/local
scp -r /usr/local/zk root@sparkproject-03:/usr/local
scp /etc/profile root@sparkproject-03:/etc/
source /etc/profile

// 修改第2、3台机器的myid分别为1,2
[root@sparkproject-02 data]# cat myid 
1

[root@sparkproject-03 data]# cat myid 
2

分别在三台机器上执行:zkServer.sh start

[root@sparkproject-02 data]# zkServer.sh start
JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

检查ZooKeeper状态

[root@sparkproject-01 ~]# zkServer.sh status
JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Mode: follower

​~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[root@sparkproject-02 ~]# zkServer.sh status
JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Mode: follower

​~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[root@sparkproject-03 ~]# zkServer.sh status
JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Mode: leader
【可以看到有一个leader和两个follower】

再用jps:检查三个节点是否都有QuromPeerMain进程。

5.Kafka搭建

scala-2.11.4.tgz、kafka_2.9.2-0.8.1.tgz上传到01机,先再01机上做配置

5.1 Scala安装
  • 修改/etc/profile文件
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin

source /etc/profile刷新配置

  • 查看scala是否安装成功
[root@sparkproject-01 scala]# scala -version
Scala code runner version 2.11.4 -- Copyright 2002-2013, LAMP/EPFL
5.2 同步Scala到其余节点
scp -r /usr/local/scala root@sparkproject-02:/usr/local
scp /etc/profile root@sparkproject-02:/etc/
source /etc/profile

scp -r /usr/local/scala root@sparkproject-03:/usr/local
scp /etc/profile root@sparkproject-03:/etc/
source /etc/profile
5.3 安装Kafka

kafka的配置文件在/usr/local/kafka/config目录下

  • 配置kafka
// 修改部分内容
broker.id=0
zookeeper.connect=192.168.38.141:2181,192.168.38.142:2181,192.168.38.143:2181 【一定要用IP地址】
  • 安装slf4j

slf4j-1.7.6.zip上传到01机/usr/local目录下,执行unzip slf4j-1.7.6.zip,然后把slf4j中的slf4j-nop-1.7.6.jar复制到kafka的libs目录下面

cp /usr/local/slf4j-1.7.6/slf4j-nop-1.7.6.jar /usr/local/kafka/libs/
5.4 同步Kafka到其余节点
scp -r /usr/local/kafka root@sparkproject-02:/usr/local
scp -r /usr/local/kafka root@sparkproject-03:/usr/local

// 唯一的区别是server.properties中的broker.id,要设置为1和2
5.5 启动kafka集群
// 在三台机器的kafka目录下,执行如下命令
nohup bin/kafka-server-start.sh config/server.properties &
因为kafkak可能会不支持我们当前机器上的JDK版本:
1、解决kafka Unrecognized VM option 'UseCompressedOops'问题
vi /usr/local/kafka/bin/kafka-run-class.sh 
if [ -z "$KAFKA_JVM_PERFORMANCE_OPTS" ]; then
  KAFKA_JVM_PERFORMANCE_OPTS="-server  -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -Djava.awt.headless=true"
fi
去掉-XX:+UseCompressedOops即可

另外两台机器上也要这样做!!!!
scp /usr/local/kafka/bin/kafka-run-class.sh root@sparkproject-02:/usr/local/kafka/bin/
scp /usr/local/kafka/bin/kafka-run-class.sh root@sparkproject-03:/usr/local/kafka/bin/

然后启动kafka
[root@sparkproject-01 kafka]# nohup bin/kafka-server-start.sh config/server.properties &
[1] 1980
[root@sparkproject-01 kafka]# nohup: 忽略输入并把输出追加到"nohup.out"

检查kafka是否启动成功(也可以执行cat nohup.out

[root@sparkproject-01 kafka]# jps
2024 Jps
1980 Kafka
1725 QuorumPeerMain

[root@sparkproject-02 kafka]# jps
2868 Kafka
2908 Jps
2221 QuorumPeerMain

[root@sparkproject-03 kafka]# jps
2915 Jps
2216 QuorumPeerMain
2876 Kafka

测试kafka集群(!!!!!存在问题)

[root@sparkproject-01 kafka]# bin/kafka-topics.sh --zookeeper 192.168.38.141:2181,192.168.38.142:2181,192.168.38.143:2181 --topic Test2Topic --replication-factor 1 --partitions 1 --create
Created topic "TestTopic". 【创建一个队列】

[root@sparkproject-01 kafka]# bin/kafka-console-producer.sh --broker-list 192.168.38.141:2181,192.168.38.142:2181,192.168.38.143:2181 --topic Test2Topic 【创建生产者】

[root@sparkproject-02 kafka]# ssh sparkproject-01
Last login: Thu May 17 14:15:24 2018 from 192.168.38.1
[root@sparkproject-01 ~]# cd ..
[root@sparkproject-01 /]# cd /usr/local/kafka
[root@sparkproject-01 kafka]# bin/kafka-console-consumer.sh --zookeeper 192.168.38.141:2181,192.168.38.142:2181,192.168.38.143:2181 --topic Test2Topic --from-beginning 【创建消费者】

6.Flume搭建

现在01机上做配置,然后在同步到其他机器

#####6.1 配置flume

/usr/local/flume/conf/flume-conf.properties

  • 修改/etc/profile

    export FLUME_HOME=/usr/local/flume
    export FLUME_CONF_DIR=$FLUME_HOME/conf
    export PATH=$FLUME_HOME/bin
    
    
  • 修改flume配置文件

cp flume-conf.properties.template flume-conf.properties


agent1.sources=source1
agent1.sinks=sink1
agent1.channels=channel1
修改为
#agent1表示代理名称
agent1.sources=source1
agent1.sinks=sink1
agent1.channels=channel1

agent.sources.seqGenSrc.channels = memoryChannel
修改为
#配置source1
agent1.sources.source1.type=spooldir
agent1.sources.source1.spoolDir=/usr/local/logs
agent1.sources.source1.channels=channel1
agent1.sources.source1.fileHeader = false
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = timestamp

agent.channels.memoryChannel.capacity = 100
agent.channels.memoryChannel.type = memory
修改为
#配置channel1
agent1.channels.channel1.type=file
agent1.channels.channel1.checkpointDir=/usr/local/logs_tmp_cp
agent1.channels.channel1.dataDirs=/usr/local/logs_tmp

agent.sinks.loggerSink.channel = memoryChannel
修改为
#配置sink1
agent1.sinks.sink1.type=hdfs
agent1.sinks.sink1.hdfs.path=hdfs://sparkproject-01:9000/logs
agent1.sinks.sink1.hdfs.fileType=DataStream
agent1.sinks.sink1.hdfs.writeFormat=TEXT
agent1.sinks.sink1.hdfs.rollInterval=1
agent1.sinks.sink1.channel=channel1
agent1.sinks.sink1.hdfs.filePrefix=%Y-%m-%d

  • 创建文件夹
本地文件夹:mkdir /usr/local/logs
HDFS文件夹:hdfs dfs -mkdir /logs (要确保hdfs已经启动)
6.2 启动flume-agent
[root@sparkproject-01 conf]# flume-ng agent -n agent1 -c conf -f /usr/local/flume/conf/flume-conf.properties -Dflume.root.logger=DEBUG,console

然后新建一个窗口进入01机,进行测试:
新建一份文件,移动到/usr/local/logs目录下,flume就会自动上传到HDFS的/logs目录中
[root@sparkproject-01 ~]# cd ..
[root@sparkproject-01 /]# cd /usr/local
[root@sparkproject-01 local]# vi ids
[root@sparkproject-01 local]# mv ids logs
[root@sparkproject-01 local]# hdfs dfs -lsr /logs
lsr: DEPRECATED: Please use 'ls -R' instead.
18/05/17 17:21:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   2 root supergroup        985 2018-05-17 17:20 /logs/2018-05-17.1526548803616

7.Spark搭建

先再01机上配置,然后再同步到其他机器

7.1 修改/etc/profile
export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
7.2 修改spark-env.sh

再目录 /usr/local/spark/conf下

cp spark-env.sh.template spark-env.sh
vi spark-env.sh
// 再末尾增加以下内容
export JAVA_HOME=/usr/local/jdk1.8.0_121
export SCALA_HOME=/usr/local/scala
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop