閱讀(2k) 書簽贊(0) 我要糾錯(cuò)

Pandas 分層索引

2022-07-15 15:08 更新

分層索引（Multiple Index）是 Pandas 中非常重要的索引類型，它指的是在一個(gè)軸上擁有多個(gè)（即兩個(gè)以上）索引層數(shù)，這使得我們可以用低維度的結(jié)構(gòu)來處理更高維的數(shù)據(jù)。比如，當(dāng)想要處理三維及以上的高維數(shù)據(jù)時(shí)，就需要用到分層索引。

分層索引的目的是用低維度的結(jié)構(gòu)（Series 或者 DataFrame）更好地處理高維數(shù)據(jù)。通過分層索引，我們可以像處理二維數(shù)據(jù)一樣，處理三維及以上的數(shù)據(jù)。分層索引的存在使得分析高維數(shù)據(jù)變得簡單，讓抽象的高維數(shù)據(jù)變得容易理解，同時(shí)它比廢棄的 Panel 結(jié)構(gòu)更容易使用。

Pandas 可以通過 MultiIndex() 方法來創(chuàng)建分層索引對象，該對象本質(zhì)上是一個(gè)元組序列，序列中每一個(gè)元組都是唯一的。下面介紹幾種創(chuàng)建分層索引的方式。

創(chuàng)建分層索引

1) 直接創(chuàng)建

通過 MultiIndex() 的levels參數(shù)能夠直接創(chuàng)建分層索引，示例如下：

import pandas as pd 
import numpy as np 
#為leves傳遞一個(gè)1行5列的二維數(shù)組
df=pd.MultiIndex(levels=[[np.nan, 2, pd.NaT, None, 5]], codes=[[4, -1, 1, 2, 3, 4]]) 
print(df.levels)
print(df)

輸出結(jié)果：

[[nan, 2, NaT, None, 5]]

MultiIndex([(  5,),
            (nan,),
            (  2,),
            (nan,),
            (nan,),
            (  5,)],
           )

上述代碼中，levels參數(shù)用來創(chuàng)建層級索引，這里只有一層，該層的索引值分別是 np.nan, 2, NaT, None, 5；codes表示按參數(shù)值對層級索引值排序（與 levels 中的值相對應(yīng)），也就說 codes 中數(shù)值是 leves 序列的下標(biāo)索引。需要注意，這里的 -1 代表 NaN。

2) 從元組創(chuàng)建

通過 from_tuples() 實(shí)現(xiàn)從元組創(chuàng)建分層索引。

#創(chuàng)建元組序列
arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'], 
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] 
#使用zip()函數(shù)創(chuàng)建元組
tuples = list(zip(*arrays)) 
print(tuples)

輸出結(jié)果如下：

[('it', 'one'),
('it', 'two'),
('of', 'one'),
('of', 'two'),
('for', 'one'),
('for', 'two'),
('then', 'one'),
('then', 'two')]

然后使用 tuples 創(chuàng)建分層索引，如下所示：

import pandas as pd

#創(chuàng)建元組序列
arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'], 
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] 
#使用zip()函數(shù)創(chuàng)建元組
tuples = list(zip(*arrays)) 
print(tuples)
#創(chuàng)建了兩層索引，并使用names對它們命名 
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
print(index)

輸出結(jié)果：

MultiIndex([(  'it', 'one'),
            (  'it', 'two'),
            (  'of', 'one'),
            (  'of', 'two'),
            ( 'for', 'one'),
            ( 'for', 'two'),
            ('then', 'one'),
            ('then', 'two')],
           names=['first', 'second'])

3) 從DataFrame對象創(chuàng)建

通過 from_frame() 創(chuàng)建分層索引，示例如下：

#首先創(chuàng)建一個(gè) DataFrame。
import pandas as pd
import numpy as np

df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
                   ['foo', 'one'], ['foo', 'two']],
                  columns=['first', 'second'])
#然后使用 from_frame()創(chuàng)建分層索引。
index = pd.MultiIndex.from_frame(df)
#將index應(yīng)用于Series
s=pd.Series(np.random.randn(4), index=index)
print(s)

輸出結(jié)果：

first  second
bar    one       1.151928
       two      -0.694435
foo    one      -1.701611
       two      -0.486157
dtype: float64

4) 笛卡爾積創(chuàng)建

笛卡爾積（又稱直積）是數(shù)學(xué)運(yùn)算的一種方式，下面使用 from_product() 笛卡爾積創(chuàng)建分層索引。

import pandas as pd
import numpy as np
#構(gòu)建數(shù)據(jù)
numbers = [0, 1, 2]
language = ['Python', 'Java']
#經(jīng)過笛卡爾積處理后會(huì)得到6中組合方式
index = pd.MultiIndex.from_product([numbers, language],names=['number', 'language'])
#將分層索引對象應(yīng)用于Series
dk_er=pd.Series(np.random.randn(6), index=index)
print(dk_er)

輸出結(jié)果：

number  language
0       Python     -0.319739
        Java        1.599170
1       Python     -0.010520
        Java        0.262068
2       Python     -0.124177
        Java        0.315120
dtype: float64

5) 數(shù)組創(chuàng)建分層索引

通過 from_array() 方法，同樣可以創(chuàng)建分層索引。示例如下：

import pandas as pd
df=pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'],[1, 2, 1, 2]])
print(df)

輸出結(jié)果：

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

應(yīng)用分層索引

下面示例講解了如何在 DataFrame 中應(yīng)用分層索引。

import pandas as pd 
import numpy as np
#創(chuàng)建一個(gè)數(shù)組
arrays = [[0, 0, 1, 1], ['A', 'B', 'A', 'B']]
#從數(shù)組創(chuàng)建
index=pd.MultiIndex.from_arrays(arrays, names=('number', 'letter'))
print(index)

輸出結(jié)果

MultiIndex([(0, 'A'),
            (0, 'B'),
            (1, 'A'),
            (1, 'B')],
           names=['number', 'letter'])

上述示例中，第一層為 number，該層有 0、1 兩個(gè)元素，第二層為 letter，有兩個(gè)字母 A 和 B。

下面把已經(jīng)創(chuàng)建的分層索引應(yīng)用到 DataFrame 中，如下所示：

import pandas as pd 
import numpy as np
#創(chuàng)建一個(gè)數(shù)組
arrays = [[0, 0, 1, 1], ['A', 'B', 'A', 'B']]
index=pd.MultiIndex.from_arrays(arrays, names=('number', 'letter'))
#在行索引位置應(yīng)用分層索引
df=pd.DataFrame([{'a':11, 'b':22}], index=index)
print(df)

輸出結(jié)果：

                a   b
number letter       
0      A       11  22
       B       11  22
1      A       11  22
       B       11  22

通過 set_index() 可以將 DataFrame 的已有列的標(biāo)索設(shè)置為 index 行索引，示例如下：

import pandas as pd
df= pd.DataFrame({'a': range(5), 'b': range(5, 0, -1),
                  'c': ['one', 'one', 'one', 'two', 'two'],
                  'd': [0, 1, 2, 0, 1]})
print(df)
df1=df.set_index(['a','d'],drop=False)
print(df1)
df2=df.set_index(['a','d'],drop=False,append=True)
print(df2)

輸出結(jié)果：

轉(zhuǎn)換前：
   a  b    c  d
0  0  5  one  0
1  1  4  one  1
2  2  3  one  2
3  3  2  two  0
4  4  1  two  1
轉(zhuǎn)換后：
     a  b    c  d
a d             
0 0  0  5  one  0
1 1  1  4  one  1
2 2  2  3  one  2
3 0  3  2  two  0
4 1  4  1  two  1
帶append參數(shù)：
       a  b    c  d
  a d            
0 0 0  0  5  one  0
1 1 1  1  4  one  1
2 2 2  2  3  one  2
3 3 0  3  2  two  0
4 4 1  4  1  two  1

通過 set_index() 將列索引轉(zhuǎn)換為了分層行索引，其中 drop=False 表示更新索引的同時(shí)，不刪除 a、d 列；同時(shí)，該函數(shù)還提供了一個(gè) append = Ture 參數(shù)表示不添加默認(rèn)的整數(shù)索引值（0到4）

分層索引切片取值

下面講解分層索引切片取值操作，示例如下：

1) 分層行索引操作

import pandas as pd
#構(gòu)建多層索引
tuple = [('湖人',2008),('步行者',2008),
      ('湖人',2007),('凱爾特人',2007),
   ('籃網(wǎng)',2007),('熱火',2008)]
salary = [10000,20000,11000,30000,19000,22000]
#其次應(yīng)用于DataFrame
index = pd.MultiIndex.from_tuples(tuple)
s = pd.Series(salary, index=index)
print(s)
#切片取值
print(s['湖人',2007])
print(s['湖人'])
print(s[:,2008])
#比較value
print(s[s<=20000])

輸出結(jié)果：

湖人    2008    10000
步行者   2008    20000
湖人    2007    11000
凱爾特人  2007    30000
籃網(wǎng)    2007    19000
熱火    2008    22000
dtype: int64

湖人隊(duì)2007年工資：
11000

湖人隊(duì)的工資：
2008    10000
2007    11000
dtype: int64

2008年所有隊(duì)伍工資：
湖人     10000
步行者    20000
熱火     22000
dtype: int64

小于等于20000的年份和隊(duì)伍：
湖人   2008    10000
步行者  2008    20000
湖人   2007    11000
籃網(wǎng)   2007    19000
dtype: int64

2) 行、列多層索引操作

下面看一種更加復(fù)雜的情況，就是行、列同時(shí)存在多層索引時(shí)候，應(yīng)該如何通過切片取值。示例如下：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(1,13).reshape((4, 3)),
               index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
               columns=[['Jack', 'Jack', 'Helen'],
              ['Python', 'Java', 'Python']])
#選擇同一層級的索引,切記不要寫成['Jack','Helen']
print(df[['Jack','Helen']])
#在不同層級分別選擇索引
print(df['Jack','Python'])
#iloc整數(shù)索引
print(df.iloc[:3,:2])
#loc列標(biāo)簽索引
print(df.loc[:,('Helen','Python')])

輸出結(jié)果：

      Jack       Helen
    Python Java Python
a 1      1    2      3
  2      4    5      6
b 1      7    8      9
  2     10   11     12

a  1     1
   2     4
b  1     7
   2    10
Name: (Jack, Python), dtype: int32

      Jack    
    Python Java
a 1      1    2
  2      4    5
b 1      7    8

a  1     3
   2     6
b  1     9
   2    12
Name: (Helen, Python), dtype: int32

聚合函數(shù)應(yīng)用

通過給level傳遞參數(shù)值，您可以指定在哪個(gè)層上進(jìn)行聚合操作，比如求和、求均值等。示例如下：

import pandas as pd 
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape((4, 3)),
               index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
               columns=[['Jack', 'Jack', 'Helen'],
              ['Python', 'Java', 'Python']])
#第一步，給行列層級起名字
df.index.names=['key1','key2']
df.columns.names=['name','course']
print(df.sum(level='key2'))
print(df.mean(level="course",axis=1))

輸出結(jié)果：

#對key2層1/2對應(yīng)的元素值求和
name     Jack       Helen
course Python Java Python
key2                    
1           8   10     12
2          14   16     18

#axis=1沿著水平方向求均值
course     Python  Java
key1 key2             
a    1          2     2
     2          5     5
b    1          8     8
     2         11    11

在數(shù)據(jù)分析的過程中，我們把大部分時(shí)間都花費(fèi)在數(shù)據(jù)的準(zhǔn)備和預(yù)處理上，Pandas 作為一個(gè)靈活、高效的數(shù)據(jù)預(yù)處理工具，提供了諸多數(shù)據(jù)處理的方法，分層索引（Multiple Index）就是其中之一，分層索引（或多層索引）是 Pandas 的基本特性，它能夠增強(qiáng) Pands 數(shù)據(jù)預(yù)處理的能力。

對于 Series 結(jié)構(gòu)來說，通過給index參數(shù)傳遞一個(gè)二維數(shù)組就可以創(chuàng)建一個(gè)具有兩層索引的 MultiIndex 對象，示例如下：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
print(info)

輸出結(jié)果：

x  obj1    11
   obj2    14
   obj3    17
   obj4    24
y  obj1    19
   obj2    32
   obj3    34
   obj4    27
dtype: int64

上述示例，創(chuàng)建了兩個(gè)層級的索引，即 (x, y) 和 (obj1，…， obj4)，您可以使用 'index' 命令查看索引。

?info.index?

輸出結(jié)果：

MultiIndex([('x', 'obj1'),
            ('x', 'obj2'),
            ('x', 'obj3'),
            ('x', 'obj4'),
            ('y', 'obj1'),
            ('y', 'obj2'),
            ('y', 'obj3'),
            ('y', 'obj4')],
           )

此外，您還可以基于內(nèi)部索引層（也就是'obj'）來選擇數(shù)據(jù)。如下所示：

?info [:,'obj2' ]?

輸出結(jié)果：

x    14
y    32
dtype: int64

局部索引

局部索引可以理解為：從分層索引中選擇特定索引層的一種方法。比如在下列數(shù)據(jù)中，選擇所有'y'索引指定的數(shù)據(jù)，示例如下：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27], 
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'], 
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
print(info['y'])

輸出結(jié)果：

obj1    19
obj2    32
obj3    34
obj4    27
dtype: int64

當(dāng)然您也可以基于內(nèi)層索引選擇數(shù)據(jù)。

行索引層轉(zhuǎn)換為列索引

unstack() 用來將行索引轉(zhuǎn)變成列索引，相當(dāng)于轉(zhuǎn)置操作。通過 unstack() 可以將 Series（一維序列）轉(zhuǎn)變?yōu)?DataFrame（二維序列）。示例如下：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27], 
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'], 
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
#行索引標(biāo)簽?zāi)J(rèn)是最外層的 x, y
#0代表第一層索引，而1代表第二層
print(info.unstack(0))

輸出結(jié)果：

       x   y
obj1  11  19
obj2  14  32
obj3  17  34
obj4  24  27

從示例可以看出，unstack(0) 表示選擇第一層索引作為列，unstack(1) 表示選擇第二層，如下所示：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27], 
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'], 
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
print(info.unstack(1))

輸出結(jié)果：

   obj1  obj2  obj3  obj4
x    11    14    17    24
y    19    32    34    27

列索引實(shí)現(xiàn)分層

我們知道，列索引存在于 DataFrame 結(jié)構(gòu)中，下面創(chuàng)建一個(gè) DataFrame 來演示列索引如何實(shí)現(xiàn)分層。

import numpy as np 
import pandas as pd
info = pd.DataFrame(np.arange(12).reshape(4, 3), 
index = [['a', 'a', 'b', 'b'], ['one', 'two', 'three', 'four']],  
columns = [['num1', 'num2', 'num3'], ['x', 'y', 'x']] )  
print(info)

輸出結(jié)果：

        num1 num2 num3
           x    y    x
a one      0    1    2
  two      3    4    5
b three    6    7    8
  four     9   10   11

查看所有列索引：

?info.columns?

輸出結(jié)果：

MultiIndex([('num1', 'x'),
            ('num2', 'y'),
            ('num3', 'x')],)

交換層和層排序

1) 交換層

通過 swaplevel() 方法輕松地實(shí)現(xiàn)索引層交換，示例如下：

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
#設(shè)置index的levels名稱                         
frame.index.names = ['key1', 'key2']
#設(shè)置columns的levels名稱
frame.columns.names = ['state','color']
#交換key1層與key層
frame.swaplevel('key1','key2')
print(frame)

輸出結(jié)果：

state      Ohio     Colorado
color     Green Red    Green
key2 key1                  
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

2) 層排序

通過 sort_index() 的level參數(shù)實(shí)現(xiàn)對層的排序。下面示例，按“key1”的字母順序重新排序。

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
#設(shè)置index的levels的名稱，key1 與 key2分別對應(yīng)不同的層                         
frame.index.names = ['key1', 'key2']
#設(shè)置columns的levels的名稱
frame.columns.names = ['state','color']
print(frame.sort_index(level='key1'))

輸出結(jié)果：

state      Ohio     Colorado
color     Green Red    Green
key1 key2                  
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

以上內(nèi)容是否對您有幫助：

← Pandas 操作索引

Pandas執(zhí)行SQL操作 →

寫筆記

我要補(bǔ)充

Pandas 分層索引

創(chuàng)建分層索引

1) 直接創(chuàng)建

2) 從元組創(chuàng)建

3) 從DataFrame對象創(chuàng)建

4) 笛卡爾積創(chuàng)建

5) 數(shù)組創(chuàng)建分層索引

應(yīng)用分層索引

通過 set_index() 可以將 DataFrame 的已有列的標(biāo)索設(shè)置為 index 行索引，示例如下：

分層索引切片取值

1) 分層行索引操作

2) 行、列多層索引操作

聚合函數(shù)應(yīng)用

局部索引

行索引層轉(zhuǎn)換為列索引

列索引實(shí)現(xiàn)分層

交換層和層排序

1) 交換層

2) 層排序

通過 set_index() 可以將 DataFrame 的已有列的標(biāo)索設(shè)置為 index 行索引，示例如下：

2) 行、列多層索引操作