这篇博文将讲述如何创建一个Spark SQL的demo程序。
创建Maven工程
在pom.xml文件中需要两个依赖。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| <properties> <scala.version>2.11.12</scala.version> <scala.compat.version>2.11</scala.compat.version> <spark.version>2.2.3</spark.version> </properties>
<dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency>
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.compat.version}</artifactId> <version>${spark.version}</version> </dependency>
|
创建SparkSessionApp
在Spark 1.x时,是通过创建SQLContext()来获取context的。
在Spark 2.x后,改为使用SparkSession了。
args(0)表示输入文件的地址。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| package gy.finolo.spark
import org.apache.spark.sql.SparkSession
object SparkSessionApp {
def main(args: Array[String]): Unit = {
val path = args(0)
val sparkSession = SparkSession.builder() .master("local[2]") .appName("Spark Session App") .getOrCreate() val people = sparkSession.read.format("json").load(path) people.printSchema() people.show() sparkSession.stop() }
}
|
在IDEA中运行时,需要指定Program Arguments
为/usr/local/spark/spark-2.2.3-bin-hadoop2.6/examples/src/main/resources/people.json
1 2 3 4
| cat /usr/local/spark/spark-2.2.3-bin-hadoop2.6/examples/src/main/resources/people.json {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
|
提交作业
在本地测试完成后,可以打包,通过spark-submit
提交作业运行。
1 2 3 4 5
| bin/spark-submit \ --master local[2] \ --class gy.finolo.spark.SparkSessionApp \ /Users/simon/Development/workspace/scala/spark-sql-demo/target/spark-sql-demo-1.0-SNAPSHOT.jar \ /usr/local/spark/spark-2.2.3-bin-hadoop2.6/examples/src/main/resources/people.json
|
可以在console中看到输出的信息。
其中printSchema()
语句打印的内容:
1 2 3
| root |-- age: long (nullable = true) |-- name: string (nullable = true)
|
people.show()
输出的内容:
1 2 3 4 5 6 7
| +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
|