閱讀(9.1k) 書簽贊(0) 我要糾錯

Pandas 數據結構簡介

2022-07-01 14:52 更新

本節(jié)介紹 Pandas 基礎數據結構，包括各類對象的數據類型、索引、軸標記、對齊等基礎操作。首先，導入 NumPy 和 Pandas：

In [1]: import numpy as np

In [2]: import pandas as pd

“數據對齊是內在的”，這一原則是根本。除非顯式指定，Pandas 不會斷開標簽和數據之間的連接。

下文先簡單介紹數據結構，然后再分門別類介紹每種功能與方法。

#Series

Series 是帶標簽的一維數組，可存儲整數、浮點數、字符串、Python 對象等類型的數據。軸標簽統稱為索引。調用 pd.Series 函數即可創(chuàng)建 Series：

>>> s = pd.Series(data, index=index)

上述代碼中，data 支持以下數據類型：

Python 字典
多維數組
標量值（如，5）

index 是軸標簽列表。不同數據可分為以下幾種情況：

多維數組

data 是多維數組時，index 長度必須與 data 長度一致。沒有指定 index 參數時，創(chuàng)建數值型索引，即 [0, ..., len(data) - 1]。

In [3]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [4]: s
Out[4]: 
a    0.469112
b   -0.282863
c   -1.509059
d   -1.135632
e    1.212112
dtype: float64

In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [6]: pd.Series(np.random.randn(5))
Out[6]: 
0   -0.173215
1    0.119209
2   -1.044236
3   -0.861849
4   -2.104569
dtype: float64

注意
Pandas 的索引值可以重復。不支持重復索引值的操作會觸發(fā)異常。其原因主要與性能有關，有很多計算實例，比如 GroupBy 操作就不用索引。

字典

Series 可以用字典實例化：

In [7]: d = {'b': 1, 'a': 0, 'c': 2}

In [8]: pd.Series(d)
Out[8]: 
b    1
a    0
c    2
dtype: int64

注意
data 為字典，且未設置 index 參數時，如果 Python 版本 >= 3.6 且 Pandas 版本 >= 0.23，Series 按字典的插入順序排序索引。

Python < 3.6 或 Pandas < 0.23，且未設置 index 參數時，Series 按字母順序排序字典的鍵（key）列表。

上例中，如果 Python < 3.6 或 Pandas < 0.23，Series 按字母排序字典的鍵。輸出結果不是 ['b', 'a', 'c']，而是 ['a', 'b', 'c']。

如果設置了 index 參數，則按索引標簽提取 data 里對應的值。

In [9]: d = {'a': 0., 'b': 1., 'c': 2.}

In [10]: pd.Series(d)
Out[10]: 
a    0.0
b    1.0
c    2.0
dtype: float64

In [11]: pd.Series(d, index=['b', 'c', 'd', 'a'])
Out[11]: 
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

注意
Pandas 用 NaN（Not a Number）表示缺失數據。

標量值

data 是標量值時，必須提供索引。Series 按索引長度重復該標量值。

In [12]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[12]: 
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

#Series 類似多維數組

Series 操作與 ndarray 類似，支持大多數 NumPy 函數，還支持索引切片。

In [13]: s[0]
Out[13]: 0.4691122999071863

In [14]: s[:3]
Out[14]: 
a    0.469112
b   -0.282863
c   -1.509059
dtype: float64

In [15]: s[s > s.median()]
Out[15]: 
a    0.469112
e    1.212112
dtype: float64

In [16]: s[[4, 3, 1]]
Out[16]: 
e    1.212112
d   -1.135632
b   -0.282863
dtype: float64

In [17]: np.exp(s)
Out[17]: 
a    1.598575
b    0.753623
c    0.221118
d    0.321219
e    3.360575
dtype: float64

注意
索引與選擇數據一節(jié)介紹了 s[[4, 3, 1]] 等數組索引操作。

和 NumPy 數組一樣，Series 也支持 dtype。

In [18]: s.dtype
Out[18]: dtype('float64')

Series 的數據類型一般是 NumPy 數據類型。不過，Pandas 和第三方庫在一些方面擴展了 NumPy 類型系統，即擴展數據類型。比如，Pandas 的類別型數據與可空整數數據類型。更多信息，請參閱數據類型。

Series.array 用于提取 Series 數組。

In [19]: s.array
Out[19]: 
<PandasArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
 -1.1356323710171934,  1.2121120250208506]
Length: 5, dtype: float64

執(zhí)行不用索引的操作時，如禁用自動對齊，訪問數組非常有用。

Series.array 一般是擴展數組。簡單說，擴展數組是把 N 個 numpy.ndarray 包在一起的打包器。Pandas 知道怎么把擴展數組存儲到 Series 或 DataFrame 的列里。更多信息，請參閱數據類型。

Series 只是類似于多維數組，提取真正的多維數組，要用 Series.to_numpy()。

In [20]: s.to_numpy()
Out[20]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])

Series 是擴展數組，Series.to_numpy() 返回的是 NumPy 多維數組。

#Series 類似字典

Series 類似固定大小的字典，可以用索引標簽提取值或設置值：

In [21]: s['a']
Out[21]: 0.4691122999071863

In [22]: s['e'] = 12.

In [23]: s
Out[23]: 
a     0.469112
b    -0.282863
c    -1.509059
d    -1.135632
e    12.000000
dtype: float64

In [24]: 'e' in s
Out[24]: True

In [25]: 'f' in s
Out[25]: False

引用 Series 里沒有的標簽會觸發(fā)異常：

>>> s['f']
KeyError: 'f'

get 方法可以提取 Series 里沒有的標簽，返回 None 或指定默認值：

In [26]: s.get('f')

In [27]: s.get('f', np.nan)
Out[27]: nan

更多信息，請參閱屬性訪問。

#矢量操作與對齊 Series 標簽

Series 和 NumPy 數組一樣，都不用循環(huán)每個值，而且 Series 支持大多數 NumPy 多維數組的方法。

In [28]: s + s
Out[28]: 
a     0.938225
b    -0.565727
c    -3.018117
d    -2.271265
e    24.000000
dtype: float64

In [29]: s * 2
Out[29]: 
a     0.938225
b    -0.565727
c    -3.018117
d    -2.271265
e    24.000000
dtype: float64

In [30]: np.exp(s)
Out[30]: 
a         1.598575
b         0.753623
c         0.221118
d         0.321219
e    162754.791419
dtype: float64

Series 和多維數組的主要區(qū)別在于， Series 之間的操作會自動基于標簽對齊數據。因此，不用顧及執(zhí)行計算操作的 Series 是否有相同的標簽。

In [31]: s[1:] + s[:-1]
Out[31]: 
a         NaN
b   -0.565727
c   -3.018117
d   -2.271265
e         NaN
dtype: float64

操作未對齊索引的 Series，其計算結果是所有涉及索引的并集。如果在 Series 里找不到標簽，運算結果標記為 NaN，即缺失值。編寫無需顯式對齊數據的代碼，給交互數據分析和研究提供了巨大的自由度和靈活性。Pandas 數據結構集成的數據對齊功能，是 Pandas 區(qū)別于大多數標簽型數據處理工具的重要特性。

注意

總之，讓不同索引對象操作的默認結果生成索引并集，是為了避免信息丟失。就算缺失了數據，索引標簽依然包含計算的重要信息。當然，也可以用**dropna** 函數清除含有缺失值的標簽。

#名稱屬性

Series 支持 name 屬性：

In [32]: s = pd.Series(np.random.randn(5), name='something')

In [33]: s
Out[33]: 
0   -0.494929
1    1.071804
2    0.721555
3   -0.706771
4   -1.039575
Name: something, dtype: float64

In [34]: s.name
Out[34]: 'something'

一般情況下，Series 自動分配 name，特別是提取一維 DataFrame 切片時，詳見下文。

0.18.0 版新增。

pandas.Series.rename() 方法用于重命名 Series 。

In [35]: s2 = s.rename("different")

In [36]: s2.name
Out[36]: 'different'

注意，s 與 s2 指向不同的對象。

#DataFrame

DataFrame 是由多種類型的列構成的二維標簽數據結構，類似于 Excel 、SQL 表，或 Series 對象構成的字典。DataFrame 是最常用的 Pandas 對象，與 Series 一樣，DataFrame 支持多種類型的輸入數據：

一維 ndarray、列表、字典、Series 字典
二維 numpy.ndarray
結構多維數組或記錄多維數組
Series
DataFrame

除了數據，還可以有選擇地傳遞 index（行標簽）和 columns（列標簽）參數。傳遞了索引或列，就可以確保生成的 DataFrame 里包含索引或列。Series 字典加上指定索引時，會丟棄與傳遞的索引不匹配的所有數據。

沒有傳遞軸標簽時，按常規(guī)依據輸入數據進行構建。

注意

Python > = 3.6，且 Pandas > = 0.23，數據是字典，且未指定 columns 參數時，DataFrame 的列按字典的插入順序排序。

Python < 3.6 或 Pandas < 0.23，且未指定 columns 參數時，DataFrame 的列按字典鍵的字母排序。

#用 Series 字典或字典生成 DataFrame

生成的索引是每個 Series 索引的并集。先把嵌套字典轉換為 Series。如果沒有指定列，DataFrame 的列就是字典鍵的有序列表。

In [37]: d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
   ....:      'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   ....: 

In [38]: df = pd.DataFrame(d)

In [39]: df
Out[39]: 
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

In [40]: pd.DataFrame(d, index=['d', 'b', 'a'])
Out[40]: 
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

In [41]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[41]: 
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

index 和 columns 屬性分別用于訪問行、列標簽：

注意
指定列與數據字典一起傳遞時，傳遞的列會覆蓋字典的鍵。

In [42]: df.index
Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')

In [43]: df.columns
Out[43]: Index(['one', 'two'], dtype='object')

#用多維數組字典、列表字典生成 DataFrame

多維數組的長度必須相同。如果傳遞了索引參數，index 的長度必須與數組一致。如果沒有傳遞索引參數，生成的結果是 range(n)，n 為數組長度。

In [44]: d = {'one': [1., 2., 3., 4.],
   ....:      'two': [4., 3., 2., 1.]}
   ....: 

In [45]: pd.DataFrame(d)
Out[45]: 
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

In [46]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[46]: 
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

#用結構多維數組或記錄多維數組生成 DataFrame

本例與數組字典的操作方式相同。

In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]

In [49]: pd.DataFrame(data)
Out[49]: 
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'

In [50]: pd.DataFrame(data, index=['first', 'second'])
Out[50]: 
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'

In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[51]: 
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0

注意

DataFrame 的運作方式與 NumPy 二維數組不同。

#用列表字典生成 DataFrame

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [53]: pd.DataFrame(data2)
Out[53]: 
   a   b     c
0  1   2   NaN
1  5  10  20.0

In [54]: pd.DataFrame(data2, index=['first', 'second'])
Out[54]: 
        a   b     c
first   1   2   NaN
second  5  10  20.0

In [55]: pd.DataFrame(data2, columns=['a', 'b'])
Out[55]: 
   a   b
0  1   2
1  5  10

#用元組字典生成 DataFrame

元組字典可以自動創(chuàng)建多層索引 DataFrame。

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
   ....: 
Out[56]: 
       a              b      
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

#用 Series 創(chuàng)建 DataFrame

生成的 DataFrame 繼承了輸入的 Series 的索引，如果沒有指定列名，默認列名是輸入 Series 的名稱。

缺失數據

更多內容，詳見缺失數據。DataFrame 里的缺失值用 np.nan 表示。DataFrame 構建器以 numpy.MaskedArray 為參數時，被屏蔽的條目為缺失數據。

#備選構建器

DataFrame.from_dict

DataFrame.from_dict 接收字典組成的字典或數組序列字典，并生成 DataFrame。除了 orient 參數默認為 columns，本構建器的操作與 DataFrame 構建器類似。把 orient 參數設置為 'index'，即可把字典的鍵作為行標簽。

In [57]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
Out[57]: 
   A  B
0  1  4
1  2  5
2  3  6

orient='index' 時，鍵是行標簽。本例還傳遞了列名：

In [58]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
   ....:                        orient='index', columns=['one', 'two', 'three'])
   ....: 
Out[58]: 
   one  two  three
A    1    2      3
B    4    5      6

DataFrame.from_records

DataFrame.from_records 構建器支持元組列表或結構數據類型（dtype）的多維數組。本構建器與 DataFrame 構建器類似，只不過生成的 DataFrame 索引是結構數據類型指定的字段。例如：

In [59]: data
Out[59]: 
array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [60]: pd.DataFrame.from_records(data, index='C')
Out[60]: 
          A    B
C               
b'Hello'  1  2.0
b'World'  2  3.0

#提取、添加、刪除列

DataFrame 就像帶索引的 Series 字典，提取、設置、刪除列的操作與字典類似：

In [61]: df['one']
Out[61]: 
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [62]: df['three'] = df['one'] * df['two']

In [63]: df['flag'] = df['one'] > 2

In [64]: df
Out[64]: 
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False

刪除（del、pop）列的方式也與字典類似：

In [65]: del df['two']

In [66]: three = df.pop('three')

In [67]: df
Out[67]: 
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False

標量值以廣播的方式填充列：

In [68]: df['foo'] = 'bar'

In [69]: df
Out[69]: 
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar

插入與 DataFrame 索引不同的 Series 時，以 DataFrame 的索引為準：

In [70]: df['one_trunc'] = df['one'][:2]

In [71]: df
Out[71]: 
   one   flag  foo  one_trunc
a  1.0  False  bar        1.0
b  2.0  False  bar        2.0
c  3.0   True  bar        NaN
d  NaN  False  bar        NaN

可以插入原生多維數組，但長度必須與 DataFrame 索引長度一致。

默認在 DataFrame 尾部插入列。insert 函數可以指定插入列的位置：

In [72]: df.insert(1, 'bar', df['one'])

In [73]: df
Out[73]: 
   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
c  3.0  3.0   True  bar        NaN
d  NaN  NaN  False  bar        NaN

#用方法鏈分配新列

受 dplyr 的 mutate 啟發(fā)，DataFrame 提供了 assign() 方法，可以利用現有的列創(chuàng)建新列。

In [74]: iris = pd.read_csv('data/iris.data')

In [75]: iris.head()
Out[75]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [76]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])
   ....:      .head())
   ....: 
Out[76]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

上例中，插入了一個預計算的值。還可以傳遞帶參數的函數，在 assign 的 DataFrame 上求值。

In [77]: iris.assign(sepal_ratio=lambda x: (x['SepalWidth'] / x['SepalLength'])).head()
Out[77]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

assign 返回的都是數據副本，原 DataFrame 不變。

未引用 DataFrame 時，傳遞可調用的，不是實際要插入的值。這種方式常見于在操作鏈中調用 assign 的操作。例如，將 DataFrame 限制為花萼長度大于 5 的觀察值，計算比例，再制圖：

In [78]: (iris.query('SepalLength > 5')
   ....:      .assign(SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
   ....:              PetalRatio=lambda x: x.PetalWidth / x.PetalLength)
   ....:      .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
   ....: 
Out[78]: <matplotlib.axes._subplots.AxesSubplot at 0x7f66075a7978>

上例用 assign 把函數傳遞給 DataFrame，并執(zhí)行函數運算。這是要注意的是，該 DataFrame 是篩選了花萼長度大于 5 以后的數據。首先執(zhí)行的是篩選操作，再計算比例。這個例子就是對沒有事先篩選 DataFrame 進行的引用。

assign 函數簽名就是 **kwargs。鍵是新字段的列名，值為是插入值（例如，Series 或 NumPy 數組），或把 DataFrame 當做調用參數的函數。返回結果是插入新值的 DataFrame 副本。

0.23.0 版新增。

從 3.6 版開始，Python 可以保存 **kwargs 順序。這種操作允許依賴賦值，**kwargs 后的表達式，可以引用同一個 assign() 函數里之前創(chuàng)建的列。

In [79]: dfa = pd.DataFrame({"A": [1, 2, 3],
   ....:                     "B": [4, 5, 6]})
   ....: 

In [80]: dfa.assign(C=lambda x: x['A'] + x['B'],
   ....:            D=lambda x: x['A'] + x['C'])
   ....: 
Out[80]: 
   A  B  C   D
0  1  4  5   6
1  2  5  7   9
2  3  6  9  12

第二個表達式里，x['C'] 引用剛創(chuàng)建的列，與 dfa['A'] + dfa['B'] 等效。

要兼容所有 Python 版本，可以把 assign 操作分為兩部分。

In [81]: dependent = pd.DataFrame({"A": [1, 1, 1]})

In [82]: (dependent.assign(A=lambda x: x['A'] + 1)
   ....:           .assign(B=lambda x: x['A'] + 2))
   ....: 
Out[82]: 
   A  B
0  2  4
1  2  4
2  2  4

警告

依賴賦值改變了 Python 3.6 及之后版本與 Python 3.6 之前版本的代碼操作方式。

要想編寫支持 3.6 之前或之后版本的 Python 代碼，傳遞 assign 表達式時，要注意以下兩點：

更新現有的列
在同一個 assign 引用剛建立的更新列

示例如下，更新列 “A”，然后，在創(chuàng)建 “B” 列時引用該列。

>>> dependent = pd.DataFrame({"A": [1, 1, 1]})
>>> dependent.assign(A=lambda x: x["A"] + 1, B=lambda x: x["A"] + 2)

Python 3.5 或更早版本的表達式在創(chuàng)建 B 列時引用的是 A 列的“舊”值 [1, 1, 1]。輸出是：

Python >= 3.6 的表達式創(chuàng)建 A 列時，引用的是 A 列的“”新”值，[2, 2, 2]，輸出是：

#索引 / 選擇

索引基礎用法如下：

操作	句法	結果
選擇列	`df[col]`	Series
用標簽選擇行	`df.loc[label]`	Series
用整數位置選擇行	`df.iloc[loc]`	Series
行切片	`df[5:10]`	DataFrame
用布爾向量選擇行	`df[bool_vec]`	DataFrame

選擇行返回 Series，索引是 DataFrame 的列：

In [83]: df.loc['b']
Out[83]: 
one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

In [84]: df.iloc[2]
Out[84]: 
one             3
bar             3
flag         True
foo           bar
one_trunc     NaN
Name: c, dtype: object

高級索引、切片技巧，請參閱索引。重建索引介紹重建索引 / 遵循新標簽集的基礎知識。

#數據對齊和運算

DataFrame 對象可以自動對齊**列與索引（行標簽）**的數據。與上文一樣，生成的結果是列和行標簽的并集。

In [85]: df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

In [86]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

In [87]: df + df2
Out[87]: 
          A         B         C   D
0  0.045691 -0.014138  1.380871 NaN
1 -0.955398 -1.501007  0.037181 NaN
2 -0.662690  1.534833 -0.859691 NaN
3 -2.452949  1.237274 -0.133712 NaN
4  1.414490  1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN

DataFrame 和 Series 之間執(zhí)行操作時，默認操作是在 DataFrame 的列上對齊 Series 的索引，按行執(zhí)行廣播操作。例如：

In [88]: df - df.iloc[0]
Out[88]: 
          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1 -1.359261 -0.248717 -0.453372 -1.754659
2  0.253128  0.829678  0.010026 -1.991234
3 -1.311128  0.054325 -1.724913 -1.620544
4  0.573025  1.500742 -0.676070  1.367331
5 -1.741248  0.781993 -1.241620 -2.053136
6 -1.240774 -0.869551 -0.153282  0.000430
7 -0.743894  0.411013 -0.929563 -0.282386
8 -1.194921  1.320690  0.238224 -1.482644
9  2.293786  1.856228  0.773289 -1.446531

時間序列是特例，DataFrame 索引包含日期時，按列廣播：

In [89]: index = pd.date_range('1/1/2000', periods=8)

In [90]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))

In [91]: df
Out[91]: 
                   A         B         C
2000-01-01 -1.226825  0.769804 -1.281247
2000-01-02 -0.727707 -0.121306 -0.097883
2000-01-03  0.695775  0.341734  0.959726
2000-01-04 -1.110336 -0.619976  0.149748
2000-01-05 -0.732339  0.687738  0.176444
2000-01-06  0.403310 -0.154951  0.301624
2000-01-07 -2.179861 -1.369849 -0.954208
2000-01-08  1.462696 -1.743161 -0.826591

In [92]: type(df['A'])
Out[92]: Pandas.core.series.Series

In [93]: df - df['A']
Out[93]: 
            2000-01-01 00:00:00  2000-01-02 00:00:00  2000-01-03 00:00:00  2000-01-04 00:00:00  ...  2000-01-08 00:00:00   A   B   C
2000-01-01                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN
2000-01-02                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN
2000-01-03                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN
2000-01-04                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN
2000-01-05                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN
2000-01-06                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN
2000-01-07                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN
2000-01-08                  NaN                  NaN                  NaN                  NaN  ...                  NaN NaN NaN NaN

[8 rows x 11 columns]

警告

df - df['A']

已棄用，后期版本中會刪除。實現此操作的首選方法是：

df.sub(df['A'], axis=0)

有關匹配和廣播操作的顯式控制，請參閱二進制操作。

標量操作與其它數據結構一樣：

In [94]: df * 5 + 2
Out[94]: 
                   A         B         C
2000-01-01 -4.134126  5.849018 -4.406237
2000-01-02 -1.638535  1.393469  1.510587
2000-01-03  5.478873  3.708672  6.798628
2000-01-04 -3.551681 -1.099880  2.748742
2000-01-05 -1.661697  5.438692  2.882222
2000-01-06  4.016548  1.225246  3.508122
2000-01-07 -8.899303 -4.849247 -2.771039
2000-01-08  9.313480 -6.715805 -2.132955

In [95]: 1 / df
Out[95]: 
                   A         B          C
2000-01-01 -0.815112  1.299033  -0.780489
2000-01-02 -1.374179 -8.243600 -10.216313
2000-01-03  1.437247  2.926250   1.041965
2000-01-04 -0.900628 -1.612966   6.677871
2000-01-05 -1.365487  1.454041   5.667510
2000-01-06  2.479485 -6.453662   3.315381
2000-01-07 -0.458745 -0.730007  -1.047990
2000-01-08  0.683669 -0.573671  -1.209788

In [96]: df ** 4
Out[96]: 
                    A         B         C
2000-01-01   2.265327  0.351172  2.694833
2000-01-02   0.280431  0.000217  0.000092
2000-01-03   0.234355  0.013638  0.848376
2000-01-04   1.519910  0.147740  0.000503
2000-01-05   0.287640  0.223714  0.000969
2000-01-06   0.026458  0.000576  0.008277
2000-01-07  22.579530  3.521204  0.829033
2000-01-08   4.577374  9.233151  0.466834

支持布爾運算符：

In [97]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)

In [98]: df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [99]: df1 & df2
Out[99]: 
       a      b
0  False  False
1  False   True
2   True  False

In [100]: df1 | df2
Out[100]: 
      a     b
0  True  True
1  True  True
2  True  True

In [101]: df1 ^ df2
Out[101]: 
       a      b
0   True   True
1   True  False
2  False   True

In [102]: -df1
Out[102]: 
       a      b
0  False   True
1   True  False
2  False  False

#轉置

類似于多維數組，T 屬性（即 transpose 函數）可以轉置 DataFrame：

# only show the first 5 rows
In [103]: df[:5].T
Out[103]: 
   2000-01-01  2000-01-02  2000-01-03  2000-01-04  2000-01-05
A   -1.226825   -0.727707    0.695775   -1.110336   -0.732339
B    0.769804   -0.121306    0.341734   -0.619976    0.687738
C   -1.281247   -0.097883    0.959726    0.149748    0.176444

#DataFrame 應用 NumPy 函數

Series 與 DataFrame 可使用 log、exp、sqrt 等多種元素級 NumPy 通用函數（ufunc），假設 DataFrame 的數據都是數字：

In [104]: np.exp(df)
Out[104]: 
                   A         B         C
2000-01-01  0.293222  2.159342  0.277691
2000-01-02  0.483015  0.885763  0.906755
2000-01-03  2.005262  1.407386  2.610980
2000-01-04  0.329448  0.537957  1.161542
2000-01-05  0.480783  1.989212  1.192968
2000-01-06  1.496770  0.856457  1.352053
2000-01-07  0.113057  0.254145  0.385117
2000-01-08  4.317584  0.174966  0.437538

In [105]: np.asarray(df)
Out[105]: 
array([[-1.2268,  0.7698, -1.2812],
       [-0.7277, -0.1213, -0.0979],
       [ 0.6958,  0.3417,  0.9597],
       [-1.1103, -0.62  ,  0.1497],
       [-0.7323,  0.6877,  0.1764],
       [ 0.4033, -0.155 ,  0.3016],
       [-2.1799, -1.3698, -0.9542],
       [ 1.4627, -1.7432, -0.8266]])

DataFrame 不是多維數組的替代品，它的索引語義和數據模型與多維數組都不同。

Series 應用 __array_ufunc__，支持 NumPy 通用函數。

通用函數應用于 Series 的底層數組。

In [106]: ser = pd.Series([1, 2, 3, 4])

In [107]: np.exp(ser)
Out[107]: 
0     2.718282
1     7.389056
2    20.085537
3    54.598150
dtype: float64

0.25.0 版更改：多個 Series 傳遞給 ufunc 時，會先進行對齊。

Pandas 可以自動對齊 ufunc 里的多個帶標簽輸入數據。例如，兩個標簽排序不同的 Series 運算前，會先對齊標簽。

In [108]: ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [109]: ser2 = pd.Series([1, 3, 5], index=['b', 'a', 'c'])

In [110]: ser1
Out[110]: 
a    1
b    2
c    3
dtype: int64

In [111]: ser2
Out[111]: 
b    1
a    3
c    5
dtype: int64

In [112]: np.remainder(ser1, ser2)
Out[112]: 
a    1
b    0
c    3
dtype: int64

一般來說，Pandas 提取兩個索引的并集，不重疊的值用缺失值填充。

In [113]: ser3 = pd.Series([2, 4, 6], index=['b', 'c', 'd'])

In [114]: ser3
Out[114]: 
b    2
c    4
d    6
dtype: int64

In [115]: np.remainder(ser1, ser3)
Out[115]: 
a    NaN
b    0.0
c    3.0
d    NaN
dtype: float64

對 Series 和 Index 應用二進制 ufunc 時，優(yōu)先執(zhí)行 Series，并返回的結果也是 Series 。

In [116]: ser = pd.Series([1, 2, 3])

In [117]: idx = pd.Index([4, 5, 6])

In [118]: np.maximum(ser, idx)
Out[118]: 
0    4
1    5
2    6
dtype: int64

NumPy 通用函數可以安全地應用于非多維數組支持的 Series，例如，SparseArray（參見稀疏計算）。如有可能，應用 ufunc 而不把基礎數據轉換為多維數組。

#控制臺顯示

控制臺顯示大型 DataFrame 時，會根據空間調整顯示大小。info()函數可以查看 DataFrame 的信息摘要。下列代碼讀取 R 語言 plyr 包里的棒球數據集 CSV 文件）：

In [119]: baseball = pd.read_csv('data/baseball.csv')

In [120]: print(baseball)
       id     player  year  stint team  lg   g   ab   r    h  X2b  X3b  hr   rbi   sb   cs  bb    so  ibb  hbp   sh   sf  gidp
0   88641  womacto01  2006      2  CHN  NL  19   50   6   14    1    0   1   2.0  1.0  1.0   4   4.0  0.0  0.0  3.0  0.0   0.0
1   88643  schilcu01  2006      1  BOS  AL  31    2   0    1    0    0   0   0.0  0.0  0.0   0   1.0  0.0  0.0  0.0  0.0   0.0
..    ...        ...   ...    ...  ...  ..  ..  ...  ..  ...  ...  ...  ..   ...  ...  ...  ..   ...  ...  ...  ...  ...   ...
98  89533   aloumo01  2007      1  NYN  NL  87  328  51  112   19    1  13  49.0  3.0  0.0  27  30.0  5.0  2.0  0.0  3.0  13.0
99  89534  alomasa02  2007      1  NYN  NL   8   22   1    3    1    0   0   0.0  0.0  0.0   0   3.0  0.0  0.0  0.0  0.0   0.0

[100 rows x 23 columns]

In [121]: baseball.info()
<class 'Pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
id        100 non-null int64
player    100 non-null object
year      100 non-null int64
stint     100 non-null int64
team      100 non-null object
lg        100 non-null object
g         100 non-null int64
ab        100 non-null int64
r         100 non-null int64
h         100 non-null int64
X2b       100 non-null int64
X3b       100 non-null int64
hr        100 non-null int64
rbi       100 non-null float64
sb        100 non-null float64
cs        100 non-null float64
bb        100 non-null int64
so        100 non-null float64
ibb       100 non-null float64
hbp       100 non-null float64
sh        100 non-null float64
sf        100 non-null float64
gidp      100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.1+ KB

盡管 to_string 有時不匹配控制臺的寬度，但還是可以用 to_string 以表格形式返回 DataFrame 的字符串表示形式：

In [122]: print(baseball.iloc[-20:, :12].to_string())
       id     player  year  stint team  lg    g   ab   r    h  X2b  X3b
80  89474  finlest01  2007      1  COL  NL   43   94   9   17    3    0
81  89480  embreal01  2007      1  OAK  AL    4    0   0    0    0    0
82  89481  edmonji01  2007      1  SLN  NL  117  365  39   92   15    2
83  89482  easleda01  2007      1  NYN  NL   76  193  24   54    6    0
84  89489  delgaca01  2007      1  NYN  NL  139  538  71  139   30    0
85  89493  cormirh01  2007      1  CIN  NL    6    0   0    0    0    0
86  89494  coninje01  2007      2  NYN  NL   21   41   2    8    2    0
87  89495  coninje01  2007      1  CIN  NL   80  215  23   57   11    1
88  89497  clemero02  2007      1  NYA  AL    2    2   0    1    0    0
89  89498  claytro01  2007      2  BOS  AL    8    6   1    0    0    0
90  89499  claytro01  2007      1  TOR  AL   69  189  23   48   14    0
91  89501  cirilje01  2007      2  ARI  NL   28   40   6    8    4    0
92  89502  cirilje01  2007      1  MIN  AL   50  153  18   40    9    2
93  89521  bondsba01  2007      1  SFN  NL  126  340  75   94   14    0
94  89523  biggicr01  2007      1  HOU  NL  141  517  68  130   31    3
95  89525  benitar01  2007      2  FLO  NL   34    0   0    0    0    0
96  89526  benitar01  2007      1  SFN  NL   19    0   0    0    0    0
97  89530  ausmubr01  2007      1  HOU  NL  117  349  38   82   16    3
98  89533   aloumo01  2007      1  NYN  NL   87  328  51  112   19    1
99  89534  alomasa02  2007      1  NYN  NL    8   22   1    3    1    0

默認情況下，過寬的 DataFrame 會跨多行輸出：

In [123]: pd.DataFrame(np.random.randn(3, 12))
Out[123]: 
          0         1         2         3         4         5         6         7         8         9        10        11
0 -0.345352  1.314232  0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441 -1.236269  0.896171 -0.487602 -0.082240
1 -2.182937  0.380396  0.084844  0.432390  1.519970 -0.493662  0.600178  0.274230  0.132885 -0.023688  2.410179  1.450520
2  0.206053 -0.251905 -2.213588  1.063327  1.266143  0.299368 -0.863838  0.408204 -1.048089 -0.025747 -0.988387  0.094055

display.width 選項可以更改單行輸出的寬度：

In [124]: pd.set_option('display.width', 40)  # 默認值為 80

In [125]: pd.DataFrame(np.random.randn(3, 12))
Out[125]: 
          0         1         2         3         4         5         6         7         8         9        10        11
0  1.262731  1.289997  0.082423 -0.055758  0.536580 -0.489682  0.369374 -0.034571 -2.484478 -0.281461  0.030711  0.109121
1  1.126203 -0.977349  1.474071 -0.064034 -1.282782  0.781836 -1.071357  0.441153  2.353925  0.583787  0.221471 -0.744471
2  0.758527  1.729689 -0.964980 -0.845696 -1.340896  1.846883 -1.328865  1.682706 -1.717693  0.888782  0.228440  0.901805

還可以用 display.max_colwidth 調整最大列寬。

In [126]: datafile = {'filename': ['filename_01', 'filename_02'],
   .....:             'path': ["media/user_name/storage/folder_01/filename_01",
   .....:                      "media/user_name/storage/folder_02/filename_02"]}
   .....: 

In [127]: pd.set_option('display.max_colwidth', 30)

In [128]: pd.DataFrame(datafile)
Out[128]: 
      filename                           path
0  filename_01  media/user_name/storage/fo...
1  filename_02  media/user_name/storage/fo...

In [129]: pd.set_option('display.max_colwidth', 100)

In [130]: pd.DataFrame(datafile)
Out[130]: 
      filename                                           path
0  filename_01  media/user_name/storage/folder_01/filename_01
1  filename_02  media/user_name/storage/folder_02/filename_02

expand_frame_repr 選項可以禁用此功能，在一個區(qū)塊里輸出整個表格。

#DataFrame 列屬性訪問和 IPython 代碼補全

DataFrame 列標簽是有效的 Python 變量名時，可以像屬性一樣訪問該列：

In [131]: df = pd.DataFrame({'foo1': np.random.randn(5),
   .....:                    'foo2': np.random.randn(5)})
   .....: 

In [132]: df
Out[132]: 
       foo1      foo2
0  1.171216 -0.858447
1  0.520260  0.306996
2 -1.197071 -0.028665
3 -1.066969  0.384316
4 -0.303421  1.574159

In [133]: df.foo1
Out[133]: 
0    1.171216
1    0.520260
2   -1.197071
3   -1.066969
4   -0.303421
Name: foo1, dtype: float64

IPython 支持補全功能，按 tab 鍵可以實現代碼補全：

In [134]: df.fo<TAB>  # 此時按 tab 鍵 會顯示下列內容
df.foo1  df.foo2

以上內容是否對您有幫助：

← Pandas 基礎用法

Pandas 與其他工具比較 →

寫筆記

我要補充