Finology 大数据金融

通过大数据以量化金融

通过 Jupyter Notebook 做数据研究不错,但版本控制是个问题。后来找到一个最佳实践。

在保存ipynb文件之前,自动做一个ipynb转到py文件的转换,然后只把py文件提交到github上面。

生成jupyter notebook配置文件

1
jupyter notebook --generate-config

运行后会生成 ~/.jupyter/ipython_notebook_config.py 文件

编辑配置文件

添如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
### If you want to auto-save .html and .py versions of your notebook:
# modified from: https://github.com/ipython/ipython/issues/8009
# Solution2: https://jupyter-notebook.readthedocs.io/en/stable/extending/savehooks.html
import os
from subprocess import check_call
import re

def clear_prompt(dir_path, nb_fname, log_func):
"""remove the number in '# In[ ]:'"""
name, ext = os.path.splitext(nb_fname)
pattern = re.compile(r'^# In\[\d+\]:')

for n_ext in ['.py', '.txt']:
script_name = os.path.join(dir_path, name+n_ext)
if os.path.exists(script_name):
new_lines = []
with open(script_name, 'rt', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
new_line = re.sub(pattern, '# In[ ]:', line)
new_lines.append(new_line)
with open(script_name, 'wt', encoding='utf-8') as f:
f.writelines(new_lines)
log_func('Remove number in "# In[ ]:"! File Name: %s' % script_name)
break

def post_save(model, os_path, contents_manager):
"""post-save hook for converting notebooks to .py scripts"""
if model['type'] != 'notebook':
return # only do this for notebooks
d, fname = os.path.split(os_path)
check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d) # '--no-prompt',
log = contents_manager.log
# log.info('Filename:%s'%fname)
clear_prompt(d, fname, log.info)
# check_call(['ipython', 'nbconvert', '--to', 'html', fname], cwd=d)

c.FileContentsManager.post_save_hook = post_save

重启jupyter notebook,配置生效。
当保存ipynb文件时,会自动生成py文件。

配置github的.gitignore文件

1
*.ipynb

设置以后,可能会发现规则没有生效。在项目根目录,执行如下命令:

1
git rm -r --cached .

安装PyUserInput前,需要安装如下依赖:

Linux - Xlib
Mac - Quartz, AppKit
Windows - pywin32, pyHook

1
pip install pywin32

https://www.lfd.uci.edu/~gohlke/pythonlibs/

安装pyHook,找到python对应的版本,比如:pyHook‑1.5.1‑cp37‑cp37m‑win_amd64.whl

下载到本地,安装

1
pip install pyHook‑1.5.1‑cp37‑cp37m‑win_amd64.whl

安装PyUserInput

1
pip install PyUserInput
1
2
3
4
5
from pymouse import PyMouse
from pykeyboard import PyKeyboard

m = PyMouse()
k = PyKeyboard()

调用api

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
x_dim, y_dim = m.screen_size()
m.click(x_dim/2, y_dim/2, 1)
k.type_string('Hello, World!')


# pressing a key
k.press_key('H')
# which you then follow with a release of the key
k.release_key('H')
# or you can 'tap' a key which does both
k.tap_key('e')
# note that that tap_key does support a way of repeating keystrokes with a interval time between each
k.tap_key('l',n=2,interval=5)
# and you can send a string if needed too
k.type_string('o World!')


#Create an Alt+Tab combo
k.press_key(k.alt_key)
k.tap_key(k.tab_key)
k.release_key(k.alt_key)

k.tap_key(k.function_keys[5]) # Tap F5
k.tap_key(k.numpad_keys['Home']) # Tap 'Home' on the numpad
k.tap_key(k.numpad_keys[5], n=3) # Tap 5 on the numpad, thrice


# Mac example
k.press_keys(['Command','shift','3'])
# Windows example
k.press_keys([k.windows_l_key,'d'])


# Windows
k.tap_key(k.alt_key)
# Mac
k.tap_key('Alternate')

eg.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from pymouse import PyMouseEvent

def fibo():
a = 0
yield a
b = 1
yield b
while True:
a, b = b, a+b
yield b

class Clickonacci(PyMouseEvent):
def __init__(self):
PyMouseEvent.__init__(self)
self.fibo = fibo()

def click(self, x, y, button, press):
'''Print Fibonacci numbers when the left click is pressed.'''
if button == 1:
if press:
print(self.fibo.next())
else: # Exit if any other mouse button used
self.stop()

C = Clickonacci()
C.run()

在一些方法当中会使用到 axis, 一开始的时候会对这个参数的意义很模糊,现在罗列出来,做一下初步的讲解。

1
2
3
4
def func(x, y ):
return x + y

df['col3'] = df.apply(lambda x: func(x['col1] + x['col2']), axis = 1)

axis默认为0,代表参数x为一列。
axis=1时,代表参数为一行。

筛选数据时的与或非

1
2
3
4
5
6
7
And_df = df[(df['Rating']>5) & (df['Votes']>100000)]

# 多个条件: OR - 满足评分高于5分或者投票大于100000的
Or_df = df[(df['Rating']>5) | (df['Votes']>100000)]

# 多个条件:NOT - 将满足评分高于5分或者投票大于100000的数据排除掉
Not_df = df[~((df['Rating']>5) | (df['Votes']>100000))]
1
df[len(df['Title'].split(" "))>=5] # 报错 AttributeError: 'Series' object has no attribute 'split'

很容易看得出来,df[‘Title’]是一列,是一个Series,所以没有split方法。

是否可以用 df['Title'].str.split(" ") ??

可以通过如下方法来解决。

1
2
3
4
5
#创建一个新的列来存储每一影片名的长度
df['num_words_title'] = df.apply(lambda x : len(x['Title'].split(" ")),axis=1)

#筛选出影片名长度大于5的部分
new_df = df[df['num_words_title']>=5]

x代表一行了,x[‘Title’]就是一个字符串了。

复杂筛选

筛选出那些影片的票房低于当年平均水平的数据。

我们先要对每年票房的的平均值做一个归总

1
year_revenue_dict = df.groupby(['Year']).agg({'Revenue(Millions)':np.mean}).to_dict()['Revenue(Millions)']

然后我们定义一个函数来判断是否存在该影片的票房低于当年平均水平的情况,返回的是布尔值

1
2
def bool_provider(revenue, year):
return revenue<year_revenue_dict[year]

然后我们通过结合apply方法和lambda方法应用到数据集当中去

1
new_df = df[df.apply(lambda x : bool_provider(x['Revenue(Millions)'], x['Year']), axis = 1)]

方法调用过程的可视化

1
2
3
4
from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()

df["CustomRating"] = df.progress_apply(lambda x: custom_rating(x['Genre'],x['Rating']),axis=1)
0%