0%

创建Spark SQL程序

这篇博文将讲述如何创建一个Spark SQL的demo程序。

创建Maven工程

在pom.xml文件中需要两个依赖。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<properties>
<scala.version>2.11.12</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.2.3</spark.version>
</properties>

<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>

创建SparkSessionApp

在Spark 1.x时,是通过创建SQLContext()来获取context的。

在Spark 2.x后,改为使用SparkSession了。

args(0)表示输入文件的地址。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package gy.finolo.spark

import org.apache.spark.sql.SparkSession

object SparkSessionApp {

def main(args: Array[String]): Unit = {

val path = args(0)

val sparkSession = SparkSession.builder()
.master("local[2]")
.appName("Spark Session App")
.getOrCreate()
val people = sparkSession.read.format("json").load(path)
people.printSchema()
people.show()
sparkSession.stop()
}

}

在IDEA中运行时,需要指定Program Arguments/usr/local/spark/spark-2.2.3-bin-hadoop2.6/examples/src/main/resources/people.json

1
2
3
4
cat /usr/local/spark/spark-2.2.3-bin-hadoop2.6/examples/src/main/resources/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

提交作业

在本地测试完成后,可以打包,通过spark-submit提交作业运行。

1
2
3
4
5
bin/spark-submit \
--master local[2] \
--class gy.finolo.spark.SparkSessionApp \
/Users/simon/Development/workspace/scala/spark-sql-demo/target/spark-sql-demo-1.0-SNAPSHOT.jar \
/usr/local/spark/spark-2.2.3-bin-hadoop2.6/examples/src/main/resources/people.json

可以在console中看到输出的信息。

其中printSchema()语句打印的内容:

1
2
3
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)

people.show()输出的内容:

1
2
3
4
5
6
7
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+