0%

HBase的应用场景及特点

HBase是一个分布式的数据库,主要用于海量数据的存储和海量数据的准实时查询。HBase可支持100亿行*100万列的表。而对于关系型数据库,单表超过500万行,列超过30列则需要重新设计了。

应用场景

交通

金融

电商

移动 - 如通话记录

HBase的特点

容量大

TB级别,上亿行的数据

面向列

列式存储,可以自由增加列

多版本

一行中,多列对应多版本

稀疏性

由于是列式存储,列的值不存在,则不需要占用空间。

扩展性

底层依赖于HDFS,动态增加DataNode就行了

可靠性

基于HDFS分布式文件系统。基于副本数和日志信息可恢复。

高性能

LSM数据结构和Rowkey有序排列,读写性能高。

HBase的概念与定位

官方对于HBase的概念描述

hadoop生态系统中对于HBase的定位

HBase的架构体系与设计模型

服务架构体系

HBase主要进程

Master

RegionServer

类似于NameNode和DataNode

HBase依赖的两个外部服务

Zookeeper

HDFS

设计模型

表结构

表数据

个人信息 | 教育经历 | 工作经历

列簇

一张表列蔟不会超过5个

每个列蔟中的列数没有限制

列只有插入数据后存在

列在列蔟中是有序的

HBase表与关系型数据库表结构对比

列动态增加

数据自动切分

高并发读写

不支持条件查询,但关系型支持复杂查询

HBase的安装部署

部署条件

JDK 1.7以上

Hadoop 2.5.x以上

Zookeeper 3.4.x以上

部署

hbase-env.sh

hbase-site.xml(hbase-default.xml)

regionserver

HBase Shell

DDL 操作

create

describe

disable

enable

drop

DML 操作

put

get

delete

count

scan

以下是macOS用户查询当前用户及用户组信息的命令

查询当前用户所属用户组信息

1
2
3
$ groups
staff everyone localaccounts _appserverusr admin _appserveradm _lpadmin _appstore _lpoperator _developer _analyticsusers
com.apple.access_ftp com.apple.access_screensharing com.apple.access_ssh com.apple.sharepoint.group.1

查询指定用户所属用户组信息

groups <username>

1
2
3
$ groups simon
staff everyone localaccounts _appserverusr admin _appserveradm _lpadmin _appstore _lpoperator _developer _analyticsusers
com.apple.access_ftp com.apple.access_screensharing com.apple.access_ssh com.apple.sharepoint.group.1

查询指定用户的信息

id -a <username>

1
2
3
$ id -a simon
uid=501(simon) gid=20(staff) groups=20(staff),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),
98(_lpadmin),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),399(com.apple.access_ssh),701(com.apple.sharepoint.group.1)

查询当前用户

1
2
$ whoami
simon

查询所有用户

1
$ cat /etc/passwd

下载Hive

我们还是使用cloudera cdh5版本 hive-1.1.0-cdh5.7.0

1
2
3
4
cd /usr/local/hive
wget http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.7.0.tar.gz

tar -zxvf hive-1.1.0-cdh5.7.0.tar.gz

配置环境变量

1
vi ~/.bash_profile

在文件末尾加上如下内容

1
export HIVE_HOME=/usr/local/hive/hive-1.1.0-cdh5.7.0

执行source命令让配置生效

1
source ~/.bash_profile

配置Hive

配置hive-env.sh

1
2
cp conf/hive-env.sh.template conf/hive-env.sh
vi conf/hive-env.sh

配置HADOOP_HOME参数

1
HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0-cdh5.7.0

配置hive-site.xml

可以参考hcatalog/etc/hcatalog/proto-hive-site.xml文件的内容

1
vi conf/hive-site.xml

添加如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<configuration>


<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hivemetastoredb?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>PASSWORD</value>
<description>password to use against metastore database</description>
</property>

</configuration>

然后我们还需要把mysql驱动包从Maven仓库拷贝到$HIVE_HOME/lib里面。

1
cp /Users/simon/.m2/repository/mysql/mysql-connector-java/5.1.46/mysql-connector-java-5.1.46.jar lib/

安装Mysql

在这里我就先不讲如何安装Mysql了,可参考 MacOS X安装MySQL

启动Hive

启动hive之前必须得保证hadoop hdfs已经启动。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
bin/hive

ls: /usr/local/spark/spark-2.2.3-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory
2019-01-11 10:12:46,772 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase/hbase-1.2.0-cdh5.7.0/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-01-11 10:12:46,954 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in jar:file:/usr/local/hive/hive-1.1.0-cdh5.7.0/lib/hive-common-1.1.0-cdh5.7.0.jar!/hive-log4j.properties
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
cd $MYSQL_HOME
bin/mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 533
Server version: 5.7.17 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

show databases;
+---------------------------+
| Database |
+---------------------------+
| information_schema |
| hivemetastoredb |
| mysql |
| performance_schema |
+---------------------------+
11 rows in set (0.01 sec)

mysql>
use hivemetastoredb;
Database changed

可以看到之前设置的数据库hivemetastoredb已经成功生成了。此数据库里面,还自动生成了很多表,如TBLS

创建表

回到hive shell

1
2
3
hive> create table hive_wordcount(context string);
OK
Time taken: 1.312 seconds

回到Mysql终端

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
mysql> select * FROM TBLS;
Empty set (0.00 sec)

mysql> select * from TBLS;
+--------+-------------+-------+------------------+-------+-----------+-------+----------------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+-------+-----------+-------+----------------+---------------+--------------------+--------------------+
| 1 | 1548903355 | 1 | 0 | simon | 0 | 1 | hive_wordcount | MANAGED_TABLE | NULL | NULL |
+--------+-------------+-------+------------------+-------+-----------+-------+----------------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)


mysql> select * from COLUMNS_V2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
| 1 | NULL | context | string | 0 |
+-------+---------+-------------+-----------+-------------+
1 row in set (0.00 sec)

可以看到已经有一张表hive_wordcount已经成功生成了,这张表里面有一个context字段。

加载数据到表中

1
2
3
4
5
cd $HADOOP_HOME
bin/hadoop fs -text /data/input
19/01/11 11:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello world
hello hello world

我们把hdfs://data/input加载到表中

1
2
3
4
5
hive> LOAD DATA INPATH '/data/input' INTO TABLE hive_wordcount;
Loading data to table default.hive_wordcount
Table default.hive_wordcount stats: [numFiles=1, totalSize=30]
OK
Time taken: 1.078 seconds

如果文件不在hfds里面,则命令为LOAD DATA LOCAL

执行查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
hive> select word, count(1) from hive_wordcount lateral view explode(split(context, ' ')) wc as word group by word;
Query ID = simon_20190111105555_39805c49-890c-4241-8388-56542c90fa2e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1548765492126_0003, Tracking URL = http://localhost:8088/proxy/application_1548765492126_0003/
Kill Command = /usr/local/hadoop/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1548765492126_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-01-11 11:11:01,153 Stage-1 map = 0%, reduce = 0%
2019-01-11 11:11:06,299 Stage-1 map = 100%, reduce = 0%
2019-01-11 11:11:12,475 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1548765492126_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 8786 HDFS Write: 16 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
hello 3
world 2
Time taken: 18.997 seconds, Fetched: 2 row(s)

lateral view explode即把每行记录按照指定分隔符进行拆解。

最后我们成功的得到了如下结果

1
2
hello	3
world 2