Finology 大数据金融

通过大数据以量化金融

以下是macOS用户查询当前用户及用户组信息的命令

查询当前用户所属用户组信息

1
2
3
$ groups
staff everyone localaccounts _appserverusr admin _appserveradm _lpadmin _appstore _lpoperator _developer _analyticsusers
com.apple.access_ftp com.apple.access_screensharing com.apple.access_ssh com.apple.sharepoint.group.1

查询指定用户所属用户组信息

groups <username>

1
2
3
$ groups simon
staff everyone localaccounts _appserverusr admin _appserveradm _lpadmin _appstore _lpoperator _developer _analyticsusers
com.apple.access_ftp com.apple.access_screensharing com.apple.access_ssh com.apple.sharepoint.group.1

查询指定用户的信息

id -a <username>

1
2
3
$ id -a simon
uid=501(simon) gid=20(staff) groups=20(staff),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),
98(_lpadmin),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),399(com.apple.access_ssh),701(com.apple.sharepoint.group.1)

查询当前用户

1
2
$ whoami
simon

查询所有用户

1
$ cat /etc/passwd

下载Hive

我们还是使用cloudera cdh5版本 hive-1.1.0-cdh5.7.0

1
2
3
4
cd /usr/local/hive
wget http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.7.0.tar.gz

tar -zxvf hive-1.1.0-cdh5.7.0.tar.gz

配置环境变量

1
vi ~/.bash_profile

在文件末尾加上如下内容

1
export HIVE_HOME=/usr/local/hive/hive-1.1.0-cdh5.7.0

执行source命令让配置生效

1
source ~/.bash_profile

配置Hive

配置hive-env.sh

1
2
cp conf/hive-env.sh.template conf/hive-env.sh
vi conf/hive-env.sh

配置HADOOP_HOME参数

1
HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0-cdh5.7.0

配置hive-site.xml

可以参考hcatalog/etc/hcatalog/proto-hive-site.xml文件的内容

1
vi conf/hive-site.xml

添加如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<configuration>


<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hivemetastoredb?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>PASSWORD</value>
<description>password to use against metastore database</description>
</property>

</configuration>

然后我们还需要把mysql驱动包从Maven仓库拷贝到$HIVE_HOME/lib里面。

1
cp /Users/simon/.m2/repository/mysql/mysql-connector-java/5.1.46/mysql-connector-java-5.1.46.jar lib/

安装Mysql

在这里我就先不讲如何安装Mysql了,可参考 MacOS X安装MySQL

启动Hive

启动hive之前必须得保证hadoop hdfs已经启动。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
bin/hive

ls: /usr/local/spark/spark-2.2.3-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory
2019-01-11 10:12:46,772 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase/hbase-1.2.0-cdh5.7.0/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-01-11 10:12:46,954 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in jar:file:/usr/local/hive/hive-1.1.0-cdh5.7.0/lib/hive-common-1.1.0-cdh5.7.0.jar!/hive-log4j.properties
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:49 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Thu Jan 11 10:12:51 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
cd $MYSQL_HOME
bin/mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 533
Server version: 5.7.17 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

show databases;
+---------------------------+
| Database |
+---------------------------+
| information_schema |
| hivemetastoredb |
| mysql |
| performance_schema |
+---------------------------+
11 rows in set (0.01 sec)

mysql>
use hivemetastoredb;
Database changed

可以看到之前设置的数据库hivemetastoredb已经成功生成了。此数据库里面,还自动生成了很多表,如TBLS

创建表

回到hive shell

1
2
3
hive> create table hive_wordcount(context string);
OK
Time taken: 1.312 seconds

回到Mysql终端

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
mysql> select * FROM TBLS;
Empty set (0.00 sec)

mysql> select * from TBLS;
+--------+-------------+-------+------------------+-------+-----------+-------+----------------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+-------+-----------+-------+----------------+---------------+--------------------+--------------------+
| 1 | 1548903355 | 1 | 0 | simon | 0 | 1 | hive_wordcount | MANAGED_TABLE | NULL | NULL |
+--------+-------------+-------+------------------+-------+-----------+-------+----------------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)


mysql> select * from COLUMNS_V2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
| 1 | NULL | context | string | 0 |
+-------+---------+-------------+-----------+-------------+
1 row in set (0.00 sec)

可以看到已经有一张表hive_wordcount已经成功生成了,这张表里面有一个context字段。

加载数据到表中

1
2
3
4
5
cd $HADOOP_HOME
bin/hadoop fs -text /data/input
19/01/11 11:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello world
hello hello world

我们把hdfs://data/input加载到表中

1
2
3
4
5
hive> LOAD DATA INPATH '/data/input' INTO TABLE hive_wordcount;
Loading data to table default.hive_wordcount
Table default.hive_wordcount stats: [numFiles=1, totalSize=30]
OK
Time taken: 1.078 seconds

如果文件不在hfds里面,则命令为LOAD DATA LOCAL

执行查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
hive> select word, count(1) from hive_wordcount lateral view explode(split(context, ' ')) wc as word group by word;
Query ID = simon_20190111105555_39805c49-890c-4241-8388-56542c90fa2e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1548765492126_0003, Tracking URL = http://localhost:8088/proxy/application_1548765492126_0003/
Kill Command = /usr/local/hadoop/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1548765492126_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-01-11 11:11:01,153 Stage-1 map = 0%, reduce = 0%
2019-01-11 11:11:06,299 Stage-1 map = 100%, reduce = 0%
2019-01-11 11:11:12,475 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1548765492126_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 8786 HDFS Write: 16 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
hello 3
world 2
Time taken: 18.997 seconds, Fetched: 2 row(s)

lateral view explode即把每行记录按照指定分隔符进行拆解。

最后我们成功的得到了如下结果

1
2
hello	3
world 2

YARN架构

1 RM(ResourceManager) + n NM(NodeManager)

ResourceManager的职责

一个集群active状态的RM只有一个,负责整个集群的资源管理和调度

  1. 处理客户端的请求(启动 / kill)

  2. 启动 / 监控ApplicationMaster(一个作业对应一个AM)

  3. 监控NM

  4. 系统的资源分配和调度

NodeManager

集群中有n个NodeManager,负责单个节点的资源管理和使用以及task的运行情况

  1. 定期向RM汇报本节点的资源使用情况和各个Container的运行状态

  2. 接收并处理RM的container启停的各种命令

  3. 单个节点的资源管理和任务管理

ApplicationMaster

每个应用 / 作业对应一个,负责应用程序的管理

  1. 数据切分

  2. 为应用程序向RM申请资源(container),并分配给内部任务

  3. 与NM通信以启停task,task是运行在container中的

  4. task的监控和容错

Container

对任务运行情况的描述:cpu, memory, 环境变量

YARN执行流程

  1. 用户向YARN提交作业

  2. RM为该作业分配第一个container(AM)

  3. RM与对应的NM通信,要求NM在这个container上启动应用程序的AM

  4. AM首先向RM注册,然后AM将为各个任务申请资源,并监控运行情况

  5. AM采用轮询的方式通过RPC协议向RM申请和领取资源

  6. AM申请到资源以后,便和相应的NM通信,要求NM启动任务

  7. NM启动我们作业对应的task

YARN配置

配置etc/hadoop/mapred-site.xml

1
2
3
cd $HADOOP_HOME
cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
vi etc/hadoop/mapred-site.xml

添加如下内容

1
2
3
4
5
6
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

配置etc/hadoop/yarn-site.xml

1
vi etc/hadoop/yarn-site.xml

添加如下内容

1
2
3
4
5
6
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

启动YARN

1
2
3
4
sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/hadoop-2.6.0-cdh5.7.0/logs/yarn-simon-resourcemanager-localhost.out
localhost: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.6.0-cdh5.7.0/logs/yarn-simon-nodemanager-localhost.out
1
2
3
jps
41620 NodeManager
41546 ResourceManager

说明YARN已经成功启动。

通过浏览器访问地址 http://localhost:8088,可见页面如下

提交作业

1
2
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount
Usage: wordcount <in> [<in>...] <out>

说明需要提供输入文件及输出结果的位置。

我们先创建一个测试文件

1
2
3
4
vi /tmp/input

hello world
hello hello world

把input文件上传到hdfs://data/目录下

1
bin/hadoop fs -put /tmp/input /data

再次执行命令

1
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /data/input /data/out

查看执行结果

1
2
3
4
5
bin/hadoop fs -ls /data/out
19/01/09 21:07:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 simon supergroup 0 2019-01-29 21:06 /data/out/_SUCCESS
-rw-r--r-- 1 simon supergroup 16 2019-01-29 21:06 /data/out/part-r-00000

wordcount的结果在文件/data/out/part-r-00000里面

1
2
3
4
bin/hadoop fs -text /data/out/part-r-00000
19/01/09 21:08:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello 3
world 2

如果重复提交作业,会出现如下FileAlreadyExistsException异常。

1
2
3
4
5
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /data/input /data/out
19/01/09 21:09:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/01/09 21:09:53 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/09 21:09:53 WARN security.UserGroupInformation: PriviledgedActionException as:simon (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/data/out already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/data/out already exists

所以如果要重复执行,则需要把输出文件先删除掉。

0%