一:分組 (groupby)
- 對(duì)數(shù)據(jù)集進(jìn)行分組,然后對(duì)每組進(jìn)行統(tǒng)計(jì)分析
- SQL 能夠?qū)?shù)據(jù)進(jìn)行過(guò)濾,分組聚合
- pandas 能利用 groupby 進(jìn)行更加復(fù)雜的分組運(yùn)算
- 分組運(yùn)算過(guò)程:split->apply->combine 拆分:進(jìn)行分組的根據(jù)應(yīng)用:每個(gè)分組運(yùn)行的計(jì)算規(guī)則合并:把每個(gè)分組的計(jì)算結(jié)果合并起來(lái)
示例代碼:
import pandas as pd
import numpy as np
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
運(yùn)行結(jié)果:
data1 data2 key1 key2
0 0.974685 -0.672494 a one
1 -0.214324 0.758372 b one
2 1.508838 0.392787 a two
3 0.522911 0.630814 b three
4 1.347359 -0.177858 a two
5 -0.264616 1.017155 b two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three
一、GroupBy 對(duì)象:DataFrameGroupBy,SeriesGroupBy
1. 分組操作
groupby() 進(jìn)行分組,GroupBy 對(duì)象沒(méi)有進(jìn)行實(shí)際運(yùn)算,只是包含分組的中間數(shù)據(jù)按列名分組:obj.groupby(‘label’)
示例代碼:
print(type(df_obj.groupby('key1'))) # dataframe根據(jù)key1進(jìn)行分組
print(type(df_obj['data1'].groupby(df_obj['key1']))) # dataframe的 data1 列根據(jù) key1 進(jìn)行分組
運(yùn)行結(jié)果:
<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>
2. 分組運(yùn)算
對(duì) GroupBy 對(duì)象進(jìn)行分組運(yùn)算/多重分組運(yùn)算,如 mean() 非數(shù)值數(shù)據(jù)不進(jìn)行分組運(yùn)算
示例代碼:
grouped1 = df_obj.groupby('key1') # 分組運(yùn)算
print(grouped1.mean())
grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
運(yùn)行結(jié)果:
data1 data2
key1
a 0.437389 -0.230101
b 0.014657 0.802114
key1
a 0.437389
b 0.014657
Name: data1, dtype: float64
size() 返回每個(gè)分組的元素個(gè)數(shù)
示例代碼:
print(grouped1.size()) # size
print(grouped2.size())
運(yùn)行結(jié)果:
key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64
3. 按自定義的 key 分組
obj.groupby(self_def_key) 自定義的 key 可為列表或多層列表 obj.groupby([‘label1’, ‘label2’])->多層 dataframe
示例代碼:
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7] # 按自定義key分組,列表
print(df_obj.groupby(self_def_key).size())
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size()) # 按自定義key分組,多層列表
grouped2 = df_obj.groupby(['key1', 'key2']) # 按多個(gè)列多層分組
print(grouped2.size())
grouped3 = df_obj.groupby(['key2', 'key1']) # 多層分組按key的順序進(jìn)行
print(grouped3.mean())
print(grouped3.mean().unstack()) # unstack可以將多層索引的結(jié)果轉(zhuǎn)換成單層的dataframe
運(yùn)行結(jié)果:
0 1
1 1
2 1
3 2
4 1
5 1
7 1
dtype: int64
key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64
key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64
data1 data2
key2 key1
one a 0.174988 -0.110804
b -0.214324 0.758372
three a -1.019229 -1.143825
b 0.522911 0.630814
two a 1.428099 0.107465
b -0.264616 1.017155
data1 data2
key1 a b a b
key2
one 0.174988 -0.214324 -0.110804 0.758372
three -1.019229 0.522911 -1.143825 0.630814
two 1.428099 -0.264616 0.107465 1.017155
二、GroupBy對(duì)象支持迭代操作
每次迭代返回一個(gè)元組 (group_name, group_data)可用于分組數(shù)據(jù)的具體運(yùn)算
1. 單層分組
示例代碼:
for group_name, group_data in grouped1: # 單層分組,根據(jù)key1
print(group_name)
print(group_data)
運(yùn)行結(jié)果:
a data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three
b
data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two
2. 多層分組
示例代碼:
for group_name, group_data in grouped2: # 多層分組,根據(jù)key1 和 key2
print(group_name)
print(group_data)
運(yùn)行結(jié)果:
('a', 'one')
data1 data2 key1 key2
0 0.974685 -0.672494 a one
6 -0.624708 0.450885 a one
('a', 'three')
data1 data2 key1 key2
7 -1.019229 -1.143825 a three
('a', 'two')
data1 data2 key1 key2
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
('b', 'one')
data1 data2 key1 key2
1 -0.214324 0.758372 b one
('b', 'three')
data1 data2 key1 key2
3 0.522911 0.630814 b three
('b', 'two')
data1 data2 key1 key2
5 -0.264616 1.017155 b two
三、GroupBy對(duì)象可以轉(zhuǎn)換成列表或字典
示例代碼:
print(list(grouped1)) # GroupBy對(duì)象轉(zhuǎn)換list
print(dict(list(grouped1))) # GroupBy對(duì)象轉(zhuǎn)換dict
運(yùn)行結(jié)果:
[('a', data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three),
('b', data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two)]
{'a': data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three,
'b': data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two}
1. 按列分組、按數(shù)據(jù)類型分組
示例代碼:
print(df_obj.dtypes) # 按列分組
print(df_obj.groupby(df_obj.dtypes, axis=1).size()) # 按數(shù)據(jù)類型分組
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())
運(yùn)行結(jié)果:
data1 float64
data2 float64
key1 object
key2 object
dtype: object
float64 2
object 2
dtype: int64
float64 object
0 0.302191 a one
1 0.544048 b one
2 1.901626 a two
3 1.153725 b three
4 1.169501 a two
5 0.752539 b two
6 -0.173823 a one
7 -2.163054 a three
2. 其他分組方法
示例代碼:
df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)
運(yùn)行結(jié)果:
a b c d e
A 7 2.0 4.0 5.0 8
B 4 NaN NaN NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1
3. 通過(guò)字典分組
示例代碼:
mapping_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C', 'e':'Java'} # 通過(guò)字典分組
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # 非NaN的個(gè)數(shù)
print(df_obj2.groupby(mapping_dict, axis=1).sum())
運(yùn)行結(jié)果:
C 1
Java 2
Python 2
dtype: int64
C Java Python
A 1 2 2
B 0 1 1
C 1 2 2
D 1 2 2
E 1 2 2
C Java Python
A 5.0 12.0 9.0
B NaN 1.0 4.0
C 4.0 11.0 5.0
D 7.0 12.0 4.0
E 8.0 7.0 7.0
4. 通過(guò)函數(shù)分組,函數(shù)傳入的參數(shù)為行索引或列索引
示例代碼:
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['AA', 'BBB', 'CC', 'D', 'EE']) # 通過(guò)函數(shù)分組
def group_key(idx): #df_obj3
"""
idx 為列索引或行索引
"""
#return idx
return len(idx)
print(df_obj3.groupby(group_key).size())
# 以上自定義函數(shù)等價(jià)于
#df_obj3.groupby(len).size()
運(yùn)行結(jié)果:
1 1
2 3
3 1
dtype: int64
5. 通過(guò)索引級(jí)別分組
示例代碼:
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
['A', 'A', 'B', 'C', 'B']], names=['language', 'index']) # 通過(guò)索引級(jí)別分組
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)
print(df_obj4.groupby(level='language', axis=1).sum()) # 根據(jù)language進(jìn)行分組
print(df_obj4.groupby(level='index', axis=1).sum()) # 根據(jù)index進(jìn)行分組
運(yùn)行結(jié)果:
language Python Java Python Java Python
index A A B C B
0 2 7 8 4 3
1 5 2 6 1 2
2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8
language Java Python
0 11 13
1 3 13
2 9 12
3 10 9
4 8 18
index A B C
0 9 11 4
1 7 8 1
2 10 6 5
3 11 5 3
4 11 11 4
二:聚合 (aggregation)
- 數(shù)組產(chǎn)生標(biāo)量的過(guò)程,如mean()、count()等
- 常用于對(duì)分組后的數(shù)據(jù)進(jìn)行計(jì)算
示例代碼:
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1,10, 8),
'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)
運(yùn)行結(jié)果:
data1 data2 key1 key2
0 3 7 a one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 a one
7 8 4 a three
1. 內(nèi)置的聚合函數(shù)
sum(), mean(), max(), min(), count(), size(), describe()
示例代碼:
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
運(yùn)行結(jié)果:
data1 data2
key1
a 27 24
b 12 18
data1 data2 key2
key1
a 8 7 two
b 9 9 two
data1 data2 key2
key1
a 3 4 one
b 1 4 one
data1 data2
key1
a 5.4 4.8
b 4.0 6.0
key1
a 5
b 3
dtype: int64
data1 data2 key2
key1
a 5 5 5
b 3 3 3
data1 data2
key1
a count 5.000000 5.000000
mean 5.400000 4.800000
std 2.302173 1.303840
min 3.000000 4.000000
25% 3.000000 4.000000
50% 6.000000 4.000000
75% 7.000000 5.000000
max 8.000000 7.000000
b count 3.000000 3.000000
mean 4.000000 6.000000
std 4.358899 2.645751
min 1.000000 4.000000
25% 1.500000 4.500000
50% 2.000000 5.000000
75% 5.500000 7.000000
max 9.000000 9.000000
2. 可自定義函數(shù),傳入agg方法中
grouped.agg(func)func的參數(shù)為groupby索引對(duì)應(yīng)的記錄
示例代碼:
def peak_range(df): # 自定義聚合函數(shù)
"""
返回?cái)?shù)值范圍
"""
#print type(df) #參數(shù)為索引所對(duì)應(yīng)的記錄
return df.max() - df.min()
print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
運(yùn)行結(jié)果:
data1 data2
key1
a 5 3
b 8 5
data1 data2
key1
a 2.528067 1.594711
b 0.787527 0.386341
In [25]:
3. 應(yīng)用多個(gè)聚合函數(shù)
同時(shí)應(yīng)用多個(gè)函數(shù)進(jìn)行聚合操作,使用函數(shù)列表
示例代碼:
'''應(yīng)用多個(gè)聚合函數(shù)
同時(shí)應(yīng)用多個(gè)聚合函數(shù)'''
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默認(rèn)列名為函數(shù)名
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通過(guò)元組提供新的列名
運(yùn)行結(jié)果:
data1 data2
mean std count peak_range mean std count peak_range
key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341
data1 data2
mean std count range mean std count range
key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341
4. 對(duì)不同的列分別作用不同的聚合函數(shù),使用dict
示例代碼:
dict_mapping = {'data1':'mean',
'data2':'sum'} # 每列作用不同的聚合函數(shù)
print(df_obj.groupby('key1').agg(dict_mapping))
dict_mapping = {'data1':['mean','max'],
'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
運(yùn)行結(jié)果:
data1 data2
key1
a 0.437389 -1.150505
b 0.014657 2.406341
data1 data2
mean max sum
key1
a 0.437389 1.508838 -1.150505
b 0.014657 0.522911 2.406341
5. 常用的內(nèi)置聚合函數(shù)
示例代碼:
import pandas as pd
import numpy as np
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1, 10, 8),
'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_') # 按key1分組后,計(jì)算data1,data2的統(tǒng)計(jì)信息并附加到原始表格中,并添加表頭前綴
print(k1_sum)
運(yùn)行結(jié)果:
data1 data2 key1 key2
0 5 1 a one
1 7 8 b one
2 1 9 a two
3 2 6 b three
4 9 8 a two
5 8 3 b two
6 3 5 a one
7 8 3 a three
sum_data1 sum_data2
key1
a 26 26
b 17 17
聚合運(yùn)算后會(huì)改變?cè)紨?shù)據(jù)的形狀,如何保持原始數(shù)據(jù)的形狀?
1. merge
使用merge的外連接,比較復(fù)雜
示例代碼:
k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True) # 方法1,使用merge
print(k1_sum_merge)
運(yùn)行結(jié)果:
data1 data2 key1 key2 sum_data1 sum_data2
0 5 1 a one 26 26
2 1 9 a two 26 26
4 9 8 a two 26 26
6 3 5 a one 26 26
7 8 3 a three 26 26
1 7 8 b one 17 17
3 2 6 b three 17 17
5 8 3 b two 17 17
2. transform
transform的計(jì)算結(jié)果和原始數(shù)據(jù)的形狀保持一致,如:grouped.transform(np.sum)
示例代碼:
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_') # 方法2,使用transform
df_obj[k1_sum_tf.columns] = k1_sum_tf
print(df_obj)
運(yùn)行結(jié)果:
data1 data2 key1 key2 sum_data1 sum_data2 sum_key2
0 5 1 a one 26 26 onetwotwoonethree
1 7 8 b one 17 17 onethreetwo
2 1 9 a two 26 26 onetwotwoonethree
3 2 6 b three 17 17 onethreetwo
4 9 8 a two 26 26 onetwotwoonethree
5 8 3 b two 17 17 onethreetwo
6 3 5 a one 26 26 onetwotwoonethree
7 8 3 a three 26 26 onetwotwoonethree
也可傳入自定義函數(shù),
示例代碼:
def diff_mean(s): # 自定義函數(shù)傳入transform
"""
返回?cái)?shù)據(jù)與均值的差值
"""
return s - s.mean()
print(df_obj.groupby('key1').transform(diff_mean))
運(yùn)行結(jié)果:
data1 data2 sum_data1 sum_data2
0 -0.200000 -4.200000 0 0
1 1.333333 2.333333 0 0
2 -4.200000 3.800000 0 0
3 -3.666667 0.333333 0 0
4 3.800000 2.800000 0 0
5 2.333333 -2.666667 0 0
6 -2.200000 -0.200000 0 0
7 2.800000 -2.200000 0 0
groupby.apply(func)
func函數(shù)也可以在各分組上分別調(diào)用,最后結(jié)果通過(guò)pd.concat組裝到一起(數(shù)據(jù)合并)
示例代碼:
import pandas as pd
import numpy as np
dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek',
'TotalHours', 'APM'])
def top_n(df, n=3, column='APM'):
"""
返回每個(gè)分組按 column 的 top n 數(shù)據(jù)
"""
return df.sort_values(by=column, ascending=False)[:n]
print(df_data.groupby('LeagueIndex').apply(top_n))
運(yùn)行結(jié)果:
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
5 3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
6 734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518
1. 產(chǎn)生層級(jí)索引:外層索引是分組名,內(nèi)層索引是df_obj的行索引
示例代碼:
print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age')) # apply函數(shù)接收的參數(shù)會(huì)傳入自定義的函數(shù)中
運(yùn)行結(jié)果:
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 3146 1 40.0 12.0 150.0 38.5590
3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
2972 3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300.0 188.5512
1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN NaN NaN 189.7404
3341 8 NaN NaN NaN 287.8128
2. 禁止層級(jí)索引, group_keys=False
示例代碼:
print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))
運(yùn)行結(jié)果:
LeagueIndex Age HoursPerWeek TotalHours APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518
apply可以用來(lái)處理不同分組內(nèi)的缺失數(shù)據(jù)填充 ,填充該分組的均值。
推薦好課:Python 自動(dòng)化辦公、Python 自動(dòng)化管理