0%

编译安装Spark

Spark会依赖Hadoop版本,当我们使用cdh版本的hadoop时,在Spark官网上下载不了对应的版本,这时就需要编译Spark了。

下载源码

到Spark官网 http://spark.apache.org/downloads.html 下载Spark的源码,并非已经Build好的安装包。

Spark release我选择的是2.2.3

package type选择Source Code

下载并解压

1
2
3
4
cd /usr/local/spark
wget https://archive.apache.org/dist/spark/spark-2.2.3/spark-2.2.3.tgz

tar -zxvf spark-2.2.3.tgz

构建发布版本

查看dev/make-distribution.sh源码,可以知道构建后的包的文件名为spark-$VERSION-bin-$NAME.tgz,所以--name参数设置为2.6.0-cdh5.7.0,-P是指定使用pom.xml中指定的profile,-D是指使用指定的Dependency。

1
2
cd spark-2.2.3
./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz -Phadoop-2.6 -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=2.6.0-cdh5.7.0

构建过程中,我们会发现出现了以下错误:

1
2
3
4
5
6
7
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 46.000 s (Wall Clock)
[INFO] Finished at: 2019-02-10T11:28:39+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.2.3: Could not find artifact org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.7.0 in alimaven (http://maven.aliyun.com/nexus/content/groups/public/) -> [Help 1]

表明在现有的maven仓库中,找不到cdh版本的jar包。所以,我们得在pom.xml中的repositories中添加cdh的仓库地址。

编辑pom.xml文件,在repositories标签下的maven central仓库后面添加cloudera的仓库。

1
2
3
4
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>

然后再执行刚才的命令,经过12分钟,spark-2.2.3-bin-2.6.0-cdh5.7.0.tgz安装包构建成功。

1
2
3
4
5
6
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12:08 min (Wall Clock)
[INFO] Finished at: 2019-02-10T11:59:57+08:00
[INFO] ------------------------------------------------------------------------

解压spark-2.2.3-bin-2.6.0-cdh5.7.0.tgz包到上级spark目录

1
2
cd ..
tar -zxvf spark-2.2.3/spark-2.2.3-bin-2.6.0-cdh5.7.0.tgz -C .

设置环境变量

1
vi ~/.bash_profile

添加内容

1
export SPARK_HOME=/usr/local/spark/spark-2.2.3-bin-2.6.0-cdh5.7.0

使设置生效

1
source ~/.bash_profile

启动Spark

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
cd $SPARK_HOME
bin/spark-shell --master local[*]

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/02/10 12:13:43 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.1.6 instead (on interface en0)
19/02/10 12:13:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/02/10 12:13:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.1.6:4040
Spark context available as 'sc' (master = local[*], app id = local-1549772024897).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.3
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Spark成功启动。