命令行安装#

如果您希望使用命令行的方式部署Spark,请按照本章节步骤安装。

本节默认yum源已经配置在IP为192.168.1.10的机器。

Spark Standalone模式安装#

前提#

Spark Standalone模式需要依赖Zookeeper,HDFS集群。Zookeeper用于支持Spark HA,HDFS集群用于存储历史数据。

Zookeeper安装部署请参考:Zookeeper 安装
HDFS安装部署请参考:HDFS 安装

Zookeeper服务地址假定为oushu1:2181,oushu2:2181,oushu3:2181
HDFS nameservice地址假定为hdfs://oushu

首先登录到oushu1,然后切换到root用户

ssh oushu1
su - root

创建一个sparkhosts文件,包含Spark集群中所有的机器

cat > ${HOME}/sparkhosts << EOF
oushu1
oushu2
oushu3
oushu4
EOF

创建一个sparkmasters文件,包含Spark集群中所有的master机器

cat > ${HOME}/sparkmasters << EOF
oushu1
oushu2
EOF

创建一个sparkworkers文件,包含Spark集群中所有的worker机器

cat > ${HOME}/sparkworkers << EOF
oushu1
oushu2
oushu3
EOF

在oushu1节点配置yum源,安装lava命令行管理工具

# 从yum源所在机器获取repo文件
scp oushu@192.168.1.10:/etc/yum.repos.d/oushu.repo /etc/yum.repos.d/oushu.repo
# 追加yum源所在机器信息到/etc/hosts文件
# 安装lava命令行管理工具
yum clean all
yum makecache
yum install lava

oushu1节点和集群内其他节点交换公钥,以便ssh免密码登陆和分发配置文件。

lava ssh-exkeys -f ${HOME}/sparkhosts -p ********

分发repo文件到其他机器

lava scp -f ${HOME}/sparkhosts /etc/yum.repos.d/oushu.repo =:/etc/yum.repos.d

安装#

在使用yum install安装Spark

lava ssh -f ${HOME}/sparkhosts -e "sudo yum install -y spark"

配置#

Spark配置参数保存在spark-defaults.confspark-env.sh。模板配置文件可在/usr/local/oushu/spark/conf.empty中找到。

文件名

作用描述

spark-defaults.conf

程序内部的默认配置信息,配置信息将存到SparkConfig中,优先级最低

spark-env.sh

程序启动环境变量

配置文件在Spark的配置路径下才能生效:/usr/local/oushu/conf/spark

准备数据目录#

Spark Worker需要将Driver的执行日志保存到文件系统中,需要给Spark配置可用的文件路径

lava ssh -f ${HOME}/sparkhosts -e "mkdir -p /data1/spark/sparkwork"
lava ssh -f ${HOME}/sparkhosts -e "chown -R spark:spark /data1/spark"

Spark需要将Application运行历史保存到HDFS集群上,需要给Spark配置可用的HDFS文件路径。
登录上HDFS集群,执行以下命令创建HDFS文件路径

sudo -u hdfs hdfs dfs -mkdir -p /spark/spark-history
sudo -u hdfs hdfs dfs -chown -R spark:spark /spark

配置依赖集群#

登录到oushu1,然后切换到root用户

ssh oushu1
su - root

添加HDFS配置文件/usr/local/oushu/conf/spark/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://oushu</value>
    </property>
</configuration>

添加HDFS配置文件/usr/local/oushu/conf/spark/hdfs-site.xml

hdfs-site.xml模板

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>rpc.client.timeout</name>
        <value>3600000</value>
    </property>
    <property>
        <name>rpc.client.connect.tcpnodelay</name>
        <value>true</value>
    </property>
    <property>
        <name>rpc.client.max.idle</name>
        <value>10000</value>
    </property>
    <property>
        <name>rpc.client.ping.interval</name>
        <value>10000</value>
    </property>
    <property>
        <name>rpc.client.connect.timeout</name>
        <value>600000</value>
    </property>
    <property>
        <name>rpc.client.connect.retry</name>
        <value>10</value>
    </property>
    <property>
        <name>rpc.client.read.timeout</name>
        <value>3600000</value>
    </property>
    <property>
        <name>rpc.client.write.timeout</name>
        <value>3600000</value>
    </property>
    <property>
        <name>rpc.client.socket.linger.timeout</name>
        <value>-1</value>
    </property>
    <property>
        <name>dfs.client.read.shortcircuit</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.default.replica</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.prefetchsize</name>
        <value>10</value>
    </property>
    <property>
        <name>dfs.client.failover.max.attempts</name>
        <value>15</value>
    </property>
    <property>
        <name>dfs.default.blocksize</name>
        <value>134217728</value>
    </property>
    <property>
        <name>dfs.client.log.severity</name>
        <value>INFO</value>
    </property>
    <property>
        <name>input.connect.timeout</name>
        <value>600000</value>
    </property>
    <property>
        <name>input.read.timeout</name>
        <value>3600000</value>
    </property>
    <property>
        <name>input.write.timeout</name>
        <value>3600000</value>
    </property>
    <property>
        <name>input.localread.default.buffersize</name>
        <value>2097152</value>
    </property>
    <property>
        <name>input.localread.blockinfo.cachesize</name>
        <value>1000</value>
    </property>
    <property>
        <name>input.read.getblockinfo.retry</name>
        <value>3</value>
    </property>
    <property>
        <name>output.replace-datanode-on-failure</name>
        <value>false</value>
    </property>
    <property>
        <name>output.default.chunksize</name>
        <value>512</value>
    </property>
    <property>
        <name>output.default.packetsize</name>
        <value>65536</value>
    </property>
    <property>
        <name>output.default.write.retry</name>
        <value>10</value>
    </property>
    <property>
        <name>output.connect.timeout</name>
        <value>600000</value>
    </property>
    <property>
        <name>output.read.timeout</name>
        <value>3600000</value>
    </property>
    <property>
        <name>output.write.timeout</name>
        <value>3600000</value>
    </property>
    <property>
        <name>output.packetpool.size</name>
        <value>1024</value>
    </property>
    <property>
        <name>output.close.timeout</name>
        <value>900000</value>
    </property>
    <property>
        <name>dfs.domain.socket.path</name>
        <value>/var/lib/hadoop-hdfs/dn_socket</value>
    </property>
    <property>
        <name>dfs.client.use.legacy.blockreader.local</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.client.failover.proxy.provider.oushu</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.oushu</name>
        <value>nn1,nn2</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.oushu.nn1</name>
        <value>oushu1:50070</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.oushu.nn2</name>
        <value>oushu2:50070</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.oushu.nn1</name>
        <value>oushu1:9000</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.oushu.nn2</name>
        <value>oushu2:9000</value>
    </property>
    <property>
        <name>dfs.nameservices</name>
        <value>oushu</value>
    </property>
</configuration>

以下配置需要修改为HDFS实际部署配置

<property>
    <name>dfs.ha.namenodes.oushu</name>
    <value>nn1,nn2</value>
</property>
<property>
    <name>dfs.namenode.http-address.oushu.nn1</name>
    <value>oushu1:50070</value>
</property>
<property>
    <name>dfs.namenode.http-address.oushu.nn2</name>
    <value>oushu2:50070</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.oushu.nn1</name>
    <value>oushu1:9000</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.oushu.nn2</name>
    <value>oushu2:9000</value>
</property>
<property>
    <name>dfs.nameservices</name>
    <value>oushu</value>
</property>

备注

Spark Standalone模式部署不支持Kerberos HDFS

配置Spark Master/Worker#

登录到oushu1,然后切换到root用户

ssh oushu1
su - root

创建配置文件spark-defaults.conf

cat > ${HOME}/spark-defaults.conf << EOF
spark.master.rest.enabled=true
spark.master.rest.port=2881
EOF

创建配置文件spark-env.sh,需要修改Zookeeper的地址oushu1:2181,oushu2:2181,oushu3:2181到实际部署的地址
如果采用的hostname:port形式,需要将hostname对应的ip配置到/etc/hosts文件中

cat > ${HOME}/spark-env.sh << EOF
export SPARK_MASTER_PORT="2882"
export SPARK_MASTER_WEBUI_PORT="2883"
export SPARK_WORKER_WEBUI_PORT="2885"
export SPARK_WORKER_DIR="/data1/spark/sparkwork"
export SPARK_LOG_DIR="/usr/local/oushu/log/spark"
export JAVA_HOME="/usr/lib/jvm/java"
export SPARK_MASTER_OPTS="-Dfile.encoding=UTF-8"
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=oushu1:2181,oushu2:2181,oushu3:2181 -Dspark.deploy.zookeeper.dir=/oushu270120"
EOF

将配置文件分发到其他机器

lava scp -f ${HOME}/sparkworkers ${HOME}/spark-env.sh =:/tmp
lava scp -f ${HOME}/sparkworkers ${HOME}/spark-defaults.conf =:/tmp
lava ssh -f ${HOME}/sparkworkers -e "mv -f /tmp/spark-env.sh /usr/local/oushu/conf/spark"
lava ssh -f ${HOME}/sparkworkers -e "chown spark:spark /usr/local/oushu/conf/spark/spark-env.sh"
lava ssh -f ${HOME}/sparkworkers -e "mv -f /tmp/spark-defaults.conf /usr/local/oushu/conf/spark"
lava ssh -f ${HOME}/sparkworkers -e "chown spark:spark /usr/local/oushu/conf/spark/spark-defaults.conf"

配置Spark History Server#

History Server只部署到oushu1,所以只用在oushu1机器中追加配置到spark-defaults.conf

echo 'spark.history.ui.port=2884
spark.history.fs.logDirectory=hdfs://oushu/spark/spark-history
spark.eventLog.dir=hdfs://oushu/spark/spark-history
spark.eventLog.enabled=true' >> /usr/local/oushu/conf/spark/spark-defaults.conf

配置Spark Client#

登录到oushu4,然后切换到root用户

ssh oushu4
su - root

准备目录

mkdir -p /data1/spark/spark-warehouse
chown -R spark:spark /data1/spark
chmod 733 /data1/spark/spark-warehouse

配置spark-defaults.conf文件

echo 'spark.sql.warehouse.dir=/data1/spark/spark-warehouse' >> /usr/local/oushu/conf/spark/spark-defaults.conf

core-site.conf文件添加以下配置

<property>
    <name>hive.exec.scratchdir</name>
    <value>file:///data1/spark/spark-warehouse</value>
</property>

启动#

启动Spark Master#

登录oushu1节点

ssh oushu1
su - root

执行以下操作以启动Spark Master

lava ssh -f ${HOME}/sparkmasters -e "sudo -u spark /usr/local/oushu/spark/sbin/start-master.sh" 

启动Spark Worker#

执行以下操作以启动Spark Worker

lava ssh -f ${HOME}/sparkworkers -e "sudo -u spark /usr/local/oushu/spark/sbin/start-slave.sh 'oushu1:2882,oushu2:2882'" 

启动History Server#

sudo -u spark /usr/local/oushu/spark/sbin/start-history-server.sh

检查状态#

检查Spark Application运行历史是否正常,浏览器访问http://oushu1:2884

登录到oushu4,然后切换到root用户

ssh oushu4
su - root

检查Spark Client是否能正常提交任务

sudo -u spark /usr/local/oushu/spark/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://oushu1:2882,oushu2:2882 \
  --executor-memory 1G \
  --total-executor-cores 3 \
  /usr/local/oushu/spark/examples/jars/spark-examples_2.12-3.1.2.jar \
  1000

检查Spark SQL是否能正常启动

sudo -u spark /usr/local/oushu/spark/bin/spark-sql \
  --master spark://oushu1:2882,oushu2:2882 \
  --executor-memory 1G \
  --total-executor-cores 3

在Spark SQL中执行以下SQL语句

show databases;
create table test(a int) using orc location 'hdfs://oushu/spark/test';
insert into test values(1);
select * from test;

常用命令#

停止Spark服务

#停止master
/usr/local/oushu/spark/sbin/stop-master.sh
#停止worker
/usr/local/oushu/spark/sbin/stop-slave.sh
# 停止History server
/usr/local/oushu/spark/sbin/stop-history-server.sh

注册到Skylab(可选)#

在oushu1节点修改lava命令行工具配置中skylab的节点ip

vi /usr/local/oushu/lava/conf/server.json

编写注册request到一个文件,例如~/spark-register.json

{
    "data": {
        "name": "SparkCluster",
        "group_roles": [
            {
                // 安装master节点
                "role": "spark.master",
                "cluster_name": "oushu1",
                "group_name": "master1",
                // 要安装的机器信息,在lavaadmin的元数据表machine中能查到
                "machines": [
                    {
                        "id": 1,
                        "name": "hostname1",
                        "subnet": "lava",
                        "data_ip": "127.0.0.1",
                        "manage_ip": "",
                        "assist_port": 1622,
                        "ssh_port": 22
                    }
                ]
            },
            {
                // 安装worker节点
                "role": "spark.worker",
                "cluster_name": "oushu1",
                "group_name": "worker1",
                "machines": [
                    {
                        "id": 1,
                        "name": "hostname1",
                        "subnet": "lava",
                        "data_ip": "127.0.0.1",
                        "manage_ip": "",
                        "assist_port": 1622,
                        "ssh_port": 22
                    }
                ]
            },
            {
                // 安装history节点
                "role": "spark.history",
                "cluster_name": "oushu1",
                "group_name": "history1",
                "machines": [
                    {
                        "id": 1,
                        "name": "hostname1",
                        "subnet": "lava",
                        "data_ip": "127.0.0.1",
                        "manage_ip": "",
                        "assist_port": 1622,
                        "ssh_port": 22
                    }
                ]
            }
        ],
        "config": {
            "spark-defaults.conf": [
                {
                    "key": "spark.master.rest.port",
                    "value": "2881"
                },
                {
                    "key": "spark.master.rest.enabled",
                    "value": "true"
                },
                {
                    "key": "spark.history.ui.port",
                    "value": "2884"
                },
                {
                    "key": "spark.history.fs.logDirectory",
                    "value": "hdfs://oushu/littleboy/spark/spark-history"
                },
                {
                    "key": "spark.eventLog.dir",
                    "value": "hdfs://oushu/littleboy/spark/spark-history"
                },
                {
                    "key": "spark.eventLog.enabled",
                    "value": "true"
                }
            ],
            "spark-env.sh": [
                {
                    "key": "SPARK_LOG_DIR",
                    "value": "/usr/local/oushu/log/spark"
                },
                {
                    "key": "SPARK_MASTER_HOSTS",
                    "value": "oushu1,oushu2"
                },
                {
                    "key": "SPARK_MASTER_WEBUI_PORT",
                    "value": "2883"
                },
                {
                    "key": "SPARK_MASTER_PORT",
                    "value": "2882"
                },
                {
                    "key": "SPARK_WORKER_WEBUI_PORT",
                    "value": "2885"
                },
                {
                    "key": "SPARK_MASTER_OPTS",
                    "value": "\"-Dfile.encoding=UTF-8\""
                },
                {
                    "key": "SPARK_WORKER_DIR",
                    "value": "/data1/spark/sparkwork"
                },
                {
                    "key": "SPARK_DAEMON_JAVA_OPTS",
                    "value": "-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=oushu1:2181,oushu2:2181,oushu3:2181 -Dspark.deploy.zookeeper.dir=/oushu270120"
                }
            ]
        }
    }
}

上述配置文件中,需要根据实际情况修改machines数组中的机器信息,在平台基础组件lava所安装的机器执行:

psql lavaadmin -p 4432 -U oushu -c "select m.id,m.name,s.name as subnet,m.private_ip as data_ip,m.public_ip as manage_ip,m.assist_port,m.ssh_port from machine as m,subnet as s where m.subnet_id=s.id;"

获取到所需的机器信息,根据服务角色对应的节点,将机器信息添加到machines数组中。

例如oushu1对应spark master节点,那么oushu1的机器信息需要备添加到spark.master角色对应的machines数组中。

调用lava命令注册集群:

lava login -u oushu -p ********
lava onprem-register service -s Spark -f ~/spark-register.json

如果返回值为:

Add service by self success

则表示注册成功,如果有错误信息,请根据错误信息处理。

同时,从页面登录后,在自动部署模块对应服务中可以查看到新添加的集群,同时列表中会实时监控Spark进程在机器上的状态。

Spark Yarn模式安装#

前提#

Spark Yarn模式需要依赖Yarn集群,HDFS集群。

Yarn安装部署请参考:Yarn 安装

HDFS安装部署请参考:HDFS 安装。HDFS nameservice地址假定为hdfs://oushu

首先登录到oushu4,然后切换到root用户

ssh oushu4
su - root

安装#

Spark rpm安装参照Spark Standalone模式

配置#

准备数据目录#

在HDFS集群上创建历史数据目录

sudo -u hdfs hdfs dfs -mkdir /spark/spark-history
sudo -u hdfs hdfs dfs -chown -R spark:spark /spark

配置依赖集群#

HDFS依赖集群的配置参照Spark Standalone模式

配置Yarn#

登录到yarn1,在Yarn集群上使用yum install安装spark-shuffle

ssh yarn1
su - root

cat > ${HOME}/yarnhost << EOF
yarn1
yarn2
yarn3
EOF

lava ssh -f ${HOME}/yarnhost -e "sudo yum install -y spark-shuffle" 

添加以下配置到/usr/local/oushu/conf/yarn/yarn-site.xml

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>spark_shuffle,mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.classpath</name>
    <value>/usr/local/oushu/spark-shuffle-3.1.2/yarn/spark-3.1.2-yarn-shuffle.jar</value>
</property>

将配置文件分发到所有yarn节点,并重启nodemanager

lava scp -f ${HOME}/yarnhost /usr/local/oushu/conf/yarn/yarn-site.xml =:/usr/local/oushu/conf/yarn/yarn-site.xml
lava ssh -f ${HOME}/yarnhost -e 'sudo -u yarn yarn --daemon stop nodemanager'
lava ssh -f ${HOME}/yarnhost -e 'sudo -E -u yarn yarn --daemon start nodemanager'

将配置文件/usr/local/oushu/conf/yarn/yarn-site.xml复制到oushu4

scp /usr/local/oushu/conf/yarn/yarn-site.xml root@oushu4:/usr/local/oushu/conf/spark/yarn-site.xml
配置Kerberos(可选)#

如果HDFS配置了Kerberos认证,登录到Kerberos服务器执行下面命令进入Kerberos控制台

kadmin.local

进入控制台后执行下列操作,配置principal实体名

addprinc -randkey spark@OUSHU.COM 
 
ktadd -k /etc/security/keytabs/spark.keytab spark@OUSHU.COM

备注

Kerberos中hostname不支持大写,如果hostname带大写字母,请将hostname改为小写字母

将生成的keytab复制到oushu4

scp root@kerberosserver:/etc/security/keytabs/spark.keytab /etc/security/keytabs/spark.keytab

core-site.xml文件中添加以下配置

<property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
</property>
<property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
</property>
<property>
    <name>hadoop.rpc.protection</name>
    <value>authentication</value>
</property>

hdfs-site.xml文件中添加以下配置

<property>
    <name>dfs.data.transfer.protection</name>
    <value>authentication</value>
</property>
<property>
  <name>dfs.namenode.kerberos.principal.pattern</name>
  <value>*</value>
</property>
修改配置文件#

创建配置文件spark-defaults.conf

cat > /usr/local/oushu/conf/spark/spark-defaults.conf << EOF
spark.history.ui.port=2884
spark.history.fs.logDirectory=hdfs://oushu/spark/spark-history
spark.eventLog.dir=hdfs://oushu/spark/spark-history
spark.eventLog.enabled=true
spark.yarn.stagingDir=hdfs://oushu/spark/staging
EOF

创建配置文件spark-env.sh

cat > /usr/local/oushu/conf/spark/spark-env.sh << EOF
export YARN_CONF_DIR=/usr/local/oushu/conf/spark
export JAVA_HOME="/usr/lib/jvm/java"
export SPARK_MASTER_OPTS="-Dfile.encoding=UTF-8"
EOF

History Server配置Kerberos(可选)

配置文件spark-defaults.conf追加以下配置

echo 'spark.history.kerberos.enabled=true
spark.history.kerberos.principal=spark@OUSHU.COM
spark.history.kerberos.keytab=/etc/security/keytabs/spark.keytab' >> /usr/local/oushu/conf/spark/spark-defaults.conf

启动#

启动History Server

sudo -u spark /usr/local/oushu/spark/sbin/start-history-server.sh

检查状态#

浏览器访问http://oushu4:2884查看History服务是否正常

提交作业

sudo -u spark /usr/local/oushu/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 1g \
    --executor-memory 2g \
    --executor-cores 1 \
    --principal spark@OUSHU.COM \
    --keytab /etc/security/keytabs/spark.keytab \
    /usr/local/oushu/spark/examples/jars/spark-examples*.jar \
    10

备注

提交作业访问Kerberos HDFS/Yarn需要指定–principal,–keytab

常用命令#

停止History Server

# 停止History server
sudo -u spark /usr/local/oushu/spark/sbin/stop-history-server.sh

常见问题#

  • 问题:提交作业时报错User spark not found

    image

    原因:yarn集群上不存在spark用户

    解决办法:在yarn集群上创建spark用户

  • 问题:提交作业时报错Requested user spark is not whitelisted and has id 993,which is below the minimum allowed 1000

    image

    原因:yarn禁止了user id 1000以下的用户提交任务

    解决办法1: 修改yarn集群 user id 到1000以上usermod -u 2001 spark

    解决办法2: 修改yarn配置文件/etc/hadoop/container-executor.cfg中的配置项min.user.id=0,然后重启yarn集群