App下載

Pandas 分組聚合 懂?

猿友 2021-01-06 11:58:51 瀏覽數(shù) (2825)
反饋

一:分組 (groupby)

  • 對數(shù)據(jù)集進(jìn)行分組,然后對每組進(jìn)行統(tǒng)計分析
  • SQL 能夠?qū)?shù)據(jù)進(jìn)行過濾,分組聚合
  • pandas 能利用 groupby 進(jìn)行更加復(fù)雜的分組運算
  • 分組運算過程:split->apply->combine 拆分:進(jìn)行分組的根據(jù)應(yīng)用:每個分組運行的計算規(guī)則合并:把每個分組的計算結(jié)果合并起來

 

示例代碼:

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

運行結(jié)果:

      data1     data2 key1   key2
0 0.974685 -0.672494 a one
1 -0.214324 0.758372 b one
2 1.508838 0.392787 a two
3 0.522911 0.630814 b three
4 1.347359 -0.177858 a two
5 -0.264616 1.017155 b two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

一、GroupBy 對象:DataFrameGroupBy,SeriesGroupBy

1. 分組操作

groupby() 進(jìn)行分組,GroupBy 對象沒有進(jìn)行實際運算,只是包含分組的中間數(shù)據(jù)按列名分組:obj.groupby(‘label’)

示例代碼:

print(type(df_obj.groupby('key1'))) # dataframe根據(jù)key1進(jìn)行分組
print(type(df_obj['data1'].groupby(df_obj['key1']))) # dataframe的 data1 列根據(jù) key1 進(jìn)行分組

運行結(jié)果:

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>

2. 分組運算

對 GroupBy 對象進(jìn)行分組運算/多重分組運算,如 mean() 非數(shù)值數(shù)據(jù)不進(jìn)行分組運算

示例代碼:

grouped1 = df_obj.groupby('key1') # 分組運算
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())

運行結(jié)果:

         data1     data2
key1
a 0.437389 -0.230101
b 0.014657 0.802114
key1
a 0.437389
b 0.014657
Name: data1, dtype: float64
size() 返回每個分組的元素個數(shù)

示例代碼:

print(grouped1.size()) # size
print(grouped2.size())

運行結(jié)果:

key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64

3. 按自定義的 key 分組

obj.groupby(self_def_key) 自定義的 key 可為列表或多層列表 obj.groupby([‘label1’, ‘label2’])->多層 dataframe

示例代碼:

self_def_key = [0, 1, 2, 3, 3, 4, 5, 7] # 按自定義key分組,列表
print(df_obj.groupby(self_def_key).size())

print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size()) # 按自定義key分組,多層列表

grouped2 = df_obj.groupby(['key1', 'key2']) # 按多個列多層分組
print(grouped2.size())

grouped3 = df_obj.groupby(['key2', 'key1']) # 多層分組按key的順序進(jìn)行
print(grouped3.mean())

print(grouped3.mean().unstack()) # unstack可以將多層索引的結(jié)果轉(zhuǎn)換成單層的dataframe

運行結(jié)果:

0    1
1 1
2 1
3 2
4 1
5 1
7 1
dtype: int64

key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64


key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64


data1 data2
key2 key1
one a 0.174988 -0.110804
b -0.214324 0.758372
three a -1.019229 -1.143825
b 0.522911 0.630814
two a 1.428099 0.107465
b -0.264616 1.017155

data1 data2
key1 a b a b
key2
one 0.174988 -0.214324 -0.110804 0.758372
three -1.019229 0.522911 -1.143825 0.630814
two 1.428099 -0.264616 0.107465 1.017155

二、GroupBy對象支持迭代操作

每次迭代返回一個元組 (group_name, group_data)可用于分組數(shù)據(jù)的具體運算

1. 單層分組

示例代碼:

for group_name, group_data in grouped1: # 單層分組,根據(jù)key1
print(group_name)
print(group_data)

運行結(jié)果:

a      data1     data2 key1   key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

b
data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two

2. 多層分組

示例代碼:

for group_name, group_data in grouped2: # 多層分組,根據(jù)key1 和 key2
print(group_name)
print(group_data)

運行結(jié)果:

('a', 'one')
data1 data2 key1 key2
0 0.974685 -0.672494 a one
6 -0.624708 0.450885 a one

('a', 'three')
data1 data2 key1 key2
7 -1.019229 -1.143825 a three

('a', 'two')
data1 data2 key1 key2
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two

('b', 'one')
data1 data2 key1 key2
1 -0.214324 0.758372 b one

('b', 'three')
data1 data2 key1 key2
3 0.522911 0.630814 b three

('b', 'two')
data1 data2 key1 key2
5 -0.264616 1.017155 b two

三、GroupBy對象可以轉(zhuǎn)換成列表或字典

示例代碼:

print(list(grouped1)) # GroupBy對象轉(zhuǎn)換list

print(dict(list(grouped1))) # GroupBy對象轉(zhuǎn)換dict

運行結(jié)果:

[('a',       data1     data2 key1   key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three),
('b', data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two)]

{'a': data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three,
'b': data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two}

1. 按列分組、按數(shù)據(jù)類型分組

示例代碼:

print(df_obj.dtypes) # 按列分組

print(df_obj.groupby(df_obj.dtypes, axis=1).size()) # 按數(shù)據(jù)類型分組
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())

運行結(jié)果:

data1    float64
data2 float64
key1 object
key2 object
dtype: object

float64 2
object 2
dtype: int64

float64 object
0 0.302191 a one
1 0.544048 b one
2 1.901626 a two
3 1.153725 b three
4 1.169501 a two
5 0.752539 b two
6 -0.173823 a one
7 -2.163054 a three

2. 其他分組方法

示例代碼:

df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)

運行結(jié)果:

   a    b    c    d  e
A 7 2.0 4.0 5.0 8
B 4 NaN NaN NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1

3. 通過字典分組

示例代碼:

mapping_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C', 'e':'Java'} # 通過字典分組
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # 非NaN的個數(shù)
print(df_obj2.groupby(mapping_dict, axis=1).sum())

運行結(jié)果:

C         1
Java 2
Python 2
dtype: int64

C Java Python
A 1 2 2
B 0 1 1
C 1 2 2
D 1 2 2
E 1 2 2

C Java Python
A 5.0 12.0 9.0
B NaN 1.0 4.0
C 4.0 11.0 5.0
D 7.0 12.0 4.0
E 8.0 7.0 7.0

4. 通過函數(shù)分組,函數(shù)傳入的參數(shù)為行索引或列索引

示例代碼:

 
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['AA', 'BBB', 'CC', 'D', 'EE']) # 通過函數(shù)分組

def group_key(idx): #df_obj3
"""
idx 為列索引或行索引
"""
#return idx
return len(idx)

print(df_obj3.groupby(group_key).size())

# 以上自定義函數(shù)等價于
#df_obj3.groupby(len).size()

運行結(jié)果:

1    1
2 3
3 1
dtype: int64

5. 通過索引級別分組

示例代碼:

columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
['A', 'A', 'B', 'C', 'B']], names=['language', 'index']) # 通過索引級別分組
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)

print(df_obj4.groupby(level='language', axis=1).sum()) # 根據(jù)language進(jìn)行分組

print(df_obj4.groupby(level='index', axis=1).sum()) # 根據(jù)index進(jìn)行分組

運行結(jié)果:

language Python Java Python Java Python
index A A B C B
0 2 7 8 4 3
1 5 2 6 1 2
2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8

language Java Python
0 11 13
1 3 13
2 9 12
3 10 9
4 8 18

index A B C
0 9 11 4
1 7 8 1
2 10 6 5
3 11 5 3
4 11 11 4

二:聚合 (aggregation)

  • 數(shù)組產(chǎn)生標(biāo)量的過程,如mean()、count()等
  • 常用于對分組后的數(shù)據(jù)進(jìn)行計算

示例代碼:

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1,10, 8),
'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)

運行結(jié)果:

   data1  data2 key1   key2
0 3 7 a one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 a one
7 8 4 a three

1. 內(nèi)置的聚合函數(shù)

sum(), mean(), max(), min(), count(), size(), describe()

示例代碼:

print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())

運行結(jié)果:

      data1  data2
key1
a 27 24
b 12 18

data1 data2 key2
key1
a 8 7 two
b 9 9 two

data1 data2 key2
key1
a 3 4 one
b 1 4 one

data1 data2
key1
a 5.4 4.8
b 4.0 6.0

key1
a 5
b 3
dtype: int64

data1 data2 key2
key1
a 5 5 5
b 3 3 3

data1 data2
key1
a count 5.000000 5.000000
mean 5.400000 4.800000
std 2.302173 1.303840
min 3.000000 4.000000
25% 3.000000 4.000000
50% 6.000000 4.000000
75% 7.000000 5.000000
max 8.000000 7.000000
b count 3.000000 3.000000
mean 4.000000 6.000000
std 4.358899 2.645751
min 1.000000 4.000000
25% 1.500000 4.500000
50% 2.000000 5.000000
75% 5.500000 7.000000
max 9.000000 9.000000

2. 可自定義函數(shù),傳入agg方法中

grouped.agg(func)func的參數(shù)為groupby索引對應(yīng)的記錄

示例代碼:

def peak_range(df): # 自定義聚合函數(shù)
"""
返回數(shù)值范圍
"""
#print type(df) #參數(shù)為索引所對應(yīng)的記錄
return df.max() - df.min()

print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))

運行結(jié)果:

      data1  data2
key1
a 5 3
b 8 5

data1 data2
key1
a 2.528067 1.594711
b 0.787527 0.386341
In [25]:

3. 應(yīng)用多個聚合函數(shù)

同時應(yīng)用多個函數(shù)進(jìn)行聚合操作,使用函數(shù)列表

示例代碼:

'''應(yīng)用多個聚合函數(shù)

同時應(yīng)用多個聚合函數(shù)'''
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默認(rèn)列名為函數(shù)名

print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通過元組提供新的列名

運行結(jié)果:

         data1                                data2                           
mean std count peak_range mean std count peak_range
key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

data1 data2
mean std count range mean std count range
key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

4. 對不同的列分別作用不同的聚合函數(shù),使用dict

示例代碼:

dict_mapping = {'data1':'mean',
'data2':'sum'} # 每列作用不同的聚合函數(shù)
print(df_obj.groupby('key1').agg(dict_mapping))

dict_mapping = {'data1':['mean','max'],
'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

運行結(jié)果:

         data1     data2
key1
a 0.437389 -1.150505
b 0.014657 2.406341

data1 data2
mean max sum
key1
a 0.437389 1.508838 -1.150505
b 0.014657 0.522911 2.406341

5. 常用的內(nèi)置聚合函數(shù)

 

示例代碼:

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1, 10, 8),
'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_') # 按key1分組后,計算data1,data2的統(tǒng)計信息并附加到原始表格中,并添加表頭前綴
print(k1_sum)

運行結(jié)果:

   data1  data2 key1   key2
0 5 1 a one
1 7 8 b one
2 1 9 a two
3 2 6 b three
4 9 8 a two
5 8 3 b two
6 3 5 a one
7 8 3 a three

sum_data1 sum_data2
key1
a 26 26
b 17 17
聚合運算后會改變原始數(shù)據(jù)的形狀,如何保持原始數(shù)據(jù)的形狀?

1. merge

使用merge的外連接,比較復(fù)雜

示例代碼:

k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True) # 方法1,使用merge
print(k1_sum_merge)

運行結(jié)果:

   data1  data2 key1   key2  sum_data1  sum_data2
0 5 1 a one 26 26
2 1 9 a two 26 26
4 9 8 a two 26 26
6 3 5 a one 26 26
7 8 3 a three 26 26
1 7 8 b one 17 17
3 2 6 b three 17 17
5 8 3 b two 17 17

2. transform

transform的計算結(jié)果和原始數(shù)據(jù)的形狀保持一致,如:grouped.transform(np.sum)

示例代碼:

k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_') # 方法2,使用transform
df_obj[k1_sum_tf.columns] = k1_sum_tf
print(df_obj)

運行結(jié)果:

   data1  data2 key1   key2 sum_data1 sum_data2           sum_key2
0 5 1 a one 26 26 onetwotwoonethree
1 7 8 b one 17 17 onethreetwo
2 1 9 a two 26 26 onetwotwoonethree
3 2 6 b three 17 17 onethreetwo
4 9 8 a two 26 26 onetwotwoonethree
5 8 3 b two 17 17 onethreetwo
6 3 5 a one 26 26 onetwotwoonethree
7 8 3 a three 26 26 onetwotwoonethree
也可傳入自定義函數(shù),

示例代碼:

def diff_mean(s): # 自定義函數(shù)傳入transform
"""
返回數(shù)據(jù)與均值的差值
"""
return s - s.mean()

print(df_obj.groupby('key1').transform(diff_mean))

運行結(jié)果:

      data1     data2 sum_data1 sum_data2
0 -0.200000 -4.200000 0 0
1 1.333333 2.333333 0 0
2 -4.200000 3.800000 0 0
3 -3.666667 0.333333 0 0
4 3.800000 2.800000 0 0
5 2.333333 -2.666667 0 0
6 -2.200000 -0.200000 0 0
7 2.800000 -2.200000 0 0

groupby.apply(func)

func函數(shù)也可以在各分組上分別調(diào)用,最后結(jié)果通過pd.concat組裝到一起(數(shù)據(jù)合并)

示例代碼:

import pandas as pd
import numpy as np

dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek',
'TotalHours', 'APM'])

def top_n(df, n=3, column='APM'):
"""
返回每個分組按 column 的 top n 數(shù)據(jù)
"""
return df.sort_values(by=column, ascending=False)[:n]

print(df_data.groupby('LeagueIndex').apply(top_n))

運行結(jié)果:

                  LeagueIndex   Age  HoursPerWeek  TotalHours       APM
LeagueIndex
1 2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
5 3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
6 734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518

1. 產(chǎn)生層級索引:外層索引是分組名,內(nèi)層索引是df_obj的行索引

示例代碼:

print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age')) # apply函數(shù)接收的參數(shù)會傳入自定義的函數(shù)中

運行結(jié)果:

                  LeagueIndex   Age  HoursPerWeek  TotalHours       APM
LeagueIndex
1 3146 1 40.0 12.0 150.0 38.5590
3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
2972 3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300.0 188.5512
1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN NaN NaN 189.7404
3341 8 NaN NaN NaN 287.8128

2. 禁止層級索引, group_keys=False

示例代碼:

print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))

運行結(jié)果:

      LeagueIndex   Age  HoursPerWeek  TotalHours       APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518
apply可以用來處理不同分組內(nèi)的缺失數(shù)據(jù)填充 ,填充該分組的均值。

推薦好課:Python 自動化辦公Python 自動化管理


0 人點贊