0%

创建Spark SQL Hive程序

在这篇文章中,我们将使用SparkSession,打印出Hive中的emp表的内容。

Spark SQL最基础程序,可以参考博文 创建Spark SQL程序

创建SparkSessionHiveApp程序

pom.xml

1
2
3
4
5
<properties>
<scala.version>2.11.12</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.2.3</spark.version>
</properties>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>

<!-- hive support -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>

<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.46</version>
</dependency>

SparkSessionHiveApp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
package gy.finolo.spark

import org.apache.spark.sql.SparkSession

object SparkSessionHiveApp {

def main(args: Array[String]): Unit = {

// 创建Spark Session
val sparkSession = SparkSession.builder()
.master("local[2]")
.appName("Spark Session Hive App")
.getOrCreate()

// 业务处理
sparkSession.table("emp").show()

// 关闭资源
sparkSession.stop()
}

}

创建emp表

创建emp原始文件

1
vi /usr/local/spark/data/emp.csv

写入如下内容到emp.csv文件中。注意,有些字段为空值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
7369,SMITH,CLERK,7902,1980/12/17,800,,20
7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30
7566,JONES,MANAGER,7839,1981/4/2,2975,,20
7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30
7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
7839,KING,PRESIDENT,,1981/11/17,5000,,10
7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30
7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
7900,JAMES,CLERK,7698,1981/12/3,950,,30
7902,FORD,ANALYST,7566,1981/12/3,3000,,20
7934,MILLER,CLERK,7782,1982/1/23,1300,,10

在Hive中创建emp表

Hive的基本使用,可以参考博文 Hive环境搭建

语句后面必须要有分号。

hive>
CREATE TABLE emp(
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate DATE,
sal DOUBLE,
comm DOUBLE,
deptno INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

导入数据到emp表中

1
hive> LOAD DATA LOCAL INPATH '/usr/local/spark/data/emp.csv' INTO TABLE emp;

hiredate这里有一些问题,以后再解决。

在IDEA中运行程序

可以看到emp表中的结果被成功打印在了控制台上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
19/02/12 00:00:23 INFO DAGScheduler: Job 0 finished: show at SparkSessionHiveApp.scala:21, took 0.340480 s
+-----+------+---------+----+--------+------+------+------+
|empno| ename| job| mgr|hiredate| sal| comm|deptno|
+-----+------+---------+----+--------+------+------+------+
| 7369| SMITH| CLERK|7902| null| 800.0| null| 20|
| 7499| ALLEN| SALESMAN|7698| null|1600.0| 300.0| 30|
| 7521| WARD| SALESMAN|7698| null|1250.0| 500.0| 30|
| 7566| JONES| MANAGER|7839| null|2975.0| null| 20|
| 7654|MARTIN| SALESMAN|7698| null|1250.0|1400.0| 30|
| 7698| BLAKE| MANAGER|7839| null|2850.0| null| 30|
| 7782| CLARK| MANAGER|7839| null|2450.0| null| 10|
| 7788| SCOTT| ANALYST|7566| null|3000.0| null| 20|
| 7839| KING|PRESIDENT|null| null|5000.0| null| 10|
| 7844|TURNER| SALESMAN|7698| null|1500.0| 0.0| 30|
| 7876| ADAMS| CLERK|7788| null|1100.0| null| 20|
| 7900| JAMES| CLERK|7698| null| 950.0| null| 30|
| 7902| FORD| ANALYST|7566| null|3000.0| null| 20|
| 7934|MILLER| CLERK|7782| null|1300.0| null| 10|
+-----+------+---------+----+--------+------+------+------+

打包并提交作业

在测试环境上面已经可以成功运行,我们现在打好包,以spark-submit方式提交作业。

1
2
3
4
5
bin/spark-submit \
--master local[2] \
--class gy.finolo.spark.SparkSessionHiveApp \
--jars /Users/simon/.m2/repository/mysql/mysql-connector-java/5.1.46/mysql-connector-java-5.1.46.jar \
/Users/simon/Development/workspace/scala/spark-sql-demo/target/spark-sql-demo-1.0-SNAPSHOT.jar

需要通过jars参数指定JDBC驱动的位置。

如果需要加载的jar很多怎么办?我将在以后的博文中讲到这个问题。

可能遇到的问题

JDBC驱动未添加

1
2
3
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: 
The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.
Please check your CLASSPATH specification, and the name of the driver.

解决方案:

如果是在IDEA中开发运行的话,那需要添加驱动的jar包。

1
2
3
4
5
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.46</version>
</dependency>

DatastoreDriverNotFound

1
2
3
org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: 
The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.
Please check your CLASSPATH specification, and the name of the driver.

和上面那个错误是类似的,我们在提交作业时,必须要通过--jars指定mysql-connector-java.jar的位置。

1
2
3
4
5
bin/spark-submit \
--master local[2] \
--class gy.finolo.spark.SparkSessionHiveApp \
--jars /Users/simon/.m2/repository/mysql/mysql-connector-java/5.1.46/mysql-connector-java-5.1.46.jar \
/Users/simon/Development/workspace/scala/spark-sql-demo/target/spark-sql-demo-1.0-SNAPSHOT.jar

Table or view not found

1
2
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Table or view not found: emp;

解决方案:

把Hive conf下面的hive-site.xml拷贝到工程的resources目录下面。

1
cp /usr/local/hive/hive-1.1.0-cdh5.7.0/conf/hive-site.xml spark-sql-demo/src/resources