0%

Spark的local和standalone模式运行

Spark运行模式

Spark 有很多种模式,最简单就是单机本地模式,还有单机伪分布式模式,复杂的则运行在集群中,目前能很好的运行在 Yarn和 Mesos 中,当然 Spark 还有自带的 Standalone 模式,对于大多数情况 Standalone 模式就足够了,如果企业已经有 Yarn 或者 Mesos 环境,也是很方便部署的。

  • local(本地模式):常用于本地开发测试,本地还分为local单线程和local-cluster多线程;
  • standalone(集群模式):典型的Mater/slave模式,不过也能看出Master是有单点故障的;Spark支持ZooKeeper来实现 HA
  • on yarn(集群模式): 运行在 yarn 资源管理器框架之上,由 yarn 负责资源管理,Spark 负责任务调度和计算
  • on mesos(集群模式): 运行在 mesos 资源管理器框架之上,由 mesos 负责资源管理,Spark 负责任务调度和计算
  • on cloud(集群模式):比如 AWS 的 EC2,使用这个模式能很方便的访问 Amazon的 S3;Spark 支持多种分布式存储系统:HDFS 和 S3

local模式运行

1
spark-shell --master local[2]

2代表2个worker。

如果local[*],也是默认master选项,则自动获取机器cores数量。

standalone模式运行

Spark Standalone模式的架构和Hadoop hdfs/yarn很类似,1 master + n workers

配置conf/spark-env.sh文件

1
2
3
4
cd $SPARK_HOME
cd conf
cp spark-env.sh.template spark-env.sh
vi spark-env.sh

添加如下内容:

1
2
3
SPARK_MASTER_HOST=localhost
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2g

启动Spark

1
2
3
4
cd $SPARK_HOME
sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-2.2.3-bin-2.6.0-cdh5.7.0/logs/spark-simon-org.apache.spark.deploy.master.Master-1-localhost.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-2.2.3-bin-2.6.0-cdh5.7.0/logs/spark-simon-org.apache.spark.deploy.worker.Worker-1-localhost.out

查看master日志

1
2
19/02/10 13:11:09 INFO Master: I have been elected leader! New state: ALIVE
19/02/10 13:11:12 INFO Master: Registering worker 192.168.1.6:51683 with 1 cores, 2.0 GB RAM

查看worker日志

1
19/02/10 13:11:12 INFO Worker: Successfully registered with master spark://localhost:7077

执行jps命令,可以看到有MasterWorker进程

1
2
3424 Master
3459 Worker

生成wordCount的输入文件

新建/usr/local/spark/data/words文件

1
vi /usr/local/spark/data/words

添加如下内容

1
2
3
hello,hello,world
hello,world,
welcome

启动Spark-shell

以standalone模式启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
bin/spark-shell --master spark://localhost:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/02/10 13:12:41 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.1.6 instead (on interface en0)
19/02/10 13:12:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/02/10 13:12:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.1.6:4040
Spark context available as 'sc' (master = spark://localhost:7077, app id = app-20190210131243-0000).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.3
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

输入wordCount程序

1
2
3
4
5
6
7
8
9
10
scala> var file = spark.sparkContext.textFile("file:///usr/local/spark/data/words")
file: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/data MapPartitionsRDD[6] at textFile at <console>:23

scala> val wordCounts = file.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_ + _)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:25

scala> wordCounts.collect
res1: Array[(String, Int)] = Array((hello,3), (welcome,1), (world,2))

scala>

可以看到/usr/local/spark/data/words文件里面的单词被成功的统计了。