閱讀(9k) 書簽贊(0) 我要糾錯

Pandas 索引和數(shù)據(jù)選擇器

2020-04-07 09:54 更新

索引和數(shù)據(jù)選擇器

Pandas對象中的軸標(biāo)記信息有多種用途：

使用已知指標(biāo)識別數(shù)據(jù)（即提供元數(shù)據(jù)），這對于分析，可視化和交互式控制臺顯示非常重要。
啟用自動和顯式數(shù)據(jù)對齊。
允許直觀地獲取和設(shè)置數(shù)據(jù)集的子集。

在本節(jié)中，我們將重點關(guān)注最后一點：即如何切片，切塊，以及通常獲取和設(shè)置pandas對象的子集。主要關(guān)注的是Series和DataFrame，因為他們在這個領(lǐng)域受到了更多的開發(fā)關(guān)注。

注意

Python和NumPy索引運算符[]和屬性運算符. 可以在各種用例中快速輕松地訪問pandas數(shù)據(jù)結(jié)構(gòu)。這使得交互式工作變得直觀，因為如果您已經(jīng)知道如何處理Python字典和NumPy數(shù)組，那么幾乎沒有新的東西需要學(xué)習(xí)。但是，由于預(yù)先不知道要訪問的數(shù)據(jù)類型，因此直接使用標(biāo)準(zhǔn)運算符會有一些優(yōu)化限制。對于生產(chǎn)代碼，我們建議您利用本章中介紹的優(yōu)化的pandas數(shù)據(jù)訪問方法。

警告

是否為設(shè)置操作返回副本或引用可能取決于上下文。這有時被稱為應(yīng)該避免。請參閱返回視圖與復(fù)制。chained assignment

警使用浮點數(shù)對基于整數(shù)的索引進行索引已在0.18.0中進行了說明，有關(guān)更改的摘要，請參見此處

見多指標(biāo)/高級索引的MultiIndex和更先進的索引文件。

有關(guān)一些高級策略，請參閱食譜。

#索引的不同選擇

對象選擇已經(jīng)有許多用戶請求的添加，以支持更明確的基于位置的索引。Pandas現(xiàn)在支持三種類型的多軸索引。

.loc主要是基于標(biāo)簽的，但也可以與布爾數(shù)組一起使用。當(dāng)找不到物品時.loc會提高KeyError。允許的輸入是：
- 單個標(biāo)簽，例如5或'a'（注意，它5被解釋為索引的 標(biāo)簽。此用法不是索引的整數(shù)位置。）。
- 列表或標(biāo)簽數(shù)組。['a', 'b', 'c']
- 帶標(biāo)簽的切片對象'a':'f'（注意，相反普通的Python片，都開始和停止都包括在內(nèi)，當(dāng)存在于索引中！見有標(biāo)簽切片和端點都包括在內(nèi)。）
- 布爾數(shù)組
- 一個callable帶有一個參數(shù)的函數(shù)（調(diào)用Series或DataFrame）并返回有效的索引輸出（上面的一個）。
版本0.18.1中的新功能。
在標(biāo)簽選擇中查看更多信息。
.iloc是基于主要的整數(shù)位置（從0到 length-1所述軸的），但也可以用布爾陣列使用。如果請求的索引器超出范圍，.iloc則會引發(fā)IndexError，但允許越界索引的切片索引器除外。（這符合Python / NumPy 切片語義）。允許的輸入是：
- 一個整數(shù)，例如5。
- 整數(shù)列表或數(shù)組。[4, 3, 0]
- 帶有整數(shù)的切片對象1:7。
- 布爾數(shù)組。
- 一個callable帶有一個參數(shù)的函數(shù)（調(diào)用Series或DataFrame）并返回有效的索引輸出（上面的一個）。
版本0.18.1中的新功能。
有關(guān)詳細信息，請參閱按位置選擇，高級索引和高級層次結(jié)構(gòu)。
.loc，.iloc以及[]索引也可以接受一個callable索引器。在Select By Callable中查看更多信息。

從具有多軸選擇的對象獲取值使用以下表示法（使用.loc作為示例，但以下也適用.iloc）。任何軸訪問器可以是空切片:。假設(shè)超出規(guī)范的軸是:，例如p.loc['a']相當(dāng)于。p.loc['a', :, :]

對象類型	索引
系列	s.loc[indexer]
數(shù)據(jù)幀	df.loc[row_indexer,column_indexer]

#基礎(chǔ)知識

正如在上一節(jié)中介紹數(shù)據(jù)結(jié)構(gòu)時所提到的，索引的主要功能[]（也就是__getitem__ 那些熟悉在Python中實現(xiàn)類行為的人）是選擇低維切片。下表顯示了使用以下方法索引pandas對象時的返回類型值[]：

對象類型	選擇	返回值類型
系列	series[label]	標(biāo)量值
數(shù)據(jù)幀	frame[colname]	Series 對應(yīng)于colname

在這里，我們構(gòu)建一個簡單的時間序列數(shù)據(jù)集，用于說明索引功能：

In [1]: dates = pd.date_range('1/1/2000', periods=8)

In [2]: df = pd.DataFrame(np.random.randn(8, 4),
   ...:                   index=dates, columns=['A', 'B', 'C', 'D'])
   ...: 

In [3]: df
Out[3]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

注意

除非特別說明，否則索引功能都不是時間序列特定的。

因此，如上所述，我們使用最基本的索引[]：

In [4]: s = df['A']

In [5]: s[dates[5]]
Out[5]: -0.6736897080883706

您可以傳遞列表列表[]以按該順序選擇列。如果DataFrame中未包含列，則會引發(fā)異常。也可以這種方式設(shè)置多列：

In [6]: df
Out[6]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

In [7]: df[['B', 'A']] = df[['A', 'B']]

In [8]: df
Out[8]: 
                   A         B         C         D
2000-01-01 -0.282863  0.469112 -1.509059 -1.135632
2000-01-02 -0.173215  1.212112  0.119209 -1.044236
2000-01-03 -2.104569 -0.861849 -0.494929  1.071804
2000-01-04 -0.706771  0.721555 -1.039575  0.271860
2000-01-05  0.567020 -0.424972  0.276232 -1.087401
2000-01-06  0.113648 -0.673690 -1.478427  0.524988
2000-01-07  0.577046  0.404705 -1.715002 -1.039268
2000-01-08 -1.157892 -0.370647 -1.344312  0.844885

您可能會發(fā)現(xiàn)這對于將變換（就地）應(yīng)用于列的子集非常有用。

警告

pandas在設(shè)置Series和DataFrame來自.loc和時對齊所有AXES .iloc。

這不會修改，df因為列對齊在賦值之前。

In [9]: df[['A', 'B']]
Out[9]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647

In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]

In [11]: df[['A', 'B']]
Out[11]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647

交換列值的正確方法是使用原始值：

In [12]: df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()

In [13]: df[['A', 'B']]
Out[13]: 
                   A         B
2000-01-01  0.469112 -0.282863
2000-01-02  1.212112 -0.173215
2000-01-03 -0.861849 -2.104569
2000-01-04  0.721555 -0.706771
2000-01-05 -0.424972  0.567020
2000-01-06 -0.673690  0.113648
2000-01-07  0.404705  0.577046
2000-01-08 -0.370647 -1.157892

#屬性訪問

您可以直接訪問某個Series或列上的索引DataFrame作為屬性：

In [14]: sa = pd.Series([1, 2, 3], index=list('abc'))

In [15]: dfa = df.copy()

In [16]: sa.b
Out[16]: 2

In [17]: dfa.A
Out[17]: 
2000-01-01    0.469112
2000-01-02    1.212112
2000-01-03   -0.861849
2000-01-04    0.721555
2000-01-05   -0.424972
2000-01-06   -0.673690
2000-01-07    0.404705
2000-01-08   -0.370647
Freq: D, Name: A, dtype: float64

In [18]: sa.a = 5

In [19]: sa
Out[19]: 
a    5
b    2
c    3
dtype: int64

In [20]: dfa.A = list(range(len(dfa.index)))  # ok if A already exists

In [21]: dfa
Out[21]: 
            A         B         C         D
2000-01-01  0 -0.282863 -1.509059 -1.135632
2000-01-02  1 -0.173215  0.119209 -1.044236
2000-01-03  2 -2.104569 -0.494929  1.071804
2000-01-04  3 -0.706771 -1.039575  0.271860
2000-01-05  4  0.567020  0.276232 -1.087401
2000-01-06  5  0.113648 -1.478427  0.524988
2000-01-07  6  0.577046 -1.715002 -1.039268
2000-01-08  7 -1.157892 -1.344312  0.844885

In [22]: dfa['A'] = list(range(len(dfa.index)))  # use this form to create a new column

In [23]: dfa
Out[23]: 
            A         B         C         D
2000-01-01  0 -0.282863 -1.509059 -1.135632
2000-01-02  1 -0.173215  0.119209 -1.044236
2000-01-03  2 -2.104569 -0.494929  1.071804
2000-01-04  3 -0.706771 -1.039575  0.271860
2000-01-05  4  0.567020  0.276232 -1.087401
2000-01-06  5  0.113648 -1.478427  0.524988
2000-01-07  6  0.577046 -1.715002 -1.039268
2000-01-08  7 -1.157892 -1.344312  0.844885

警告僅當(dāng)index元素是有效的Python標(biāo)識符時才可以使用此訪問權(quán)限，例如s.1，不允許。有關(guān)有效標(biāo)識符的說明請參見此處

如果該屬性與現(xiàn)有方法名稱沖突，則該屬性將不可用，例如s.min，不允許。
同樣的，如果它與任何下面的列表沖突的屬性將不可用：index， major_axis，minor_axis，items。
在任何一種情況下，標(biāo)準(zhǔn)索引仍然可以工作，例如s['1']，s['min']和s['index']將訪問相應(yīng)的元素或列。

如果您使用的是IPython環(huán)境，則還可以使用tab-completion來查看這些可訪問的屬性。

您還可以將a分配dict給一行DataFrame：

In [24]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})

In [25]: x.iloc[1] = {'x': 9, 'y': 99}

In [26]: x
Out[26]: 
   x   y
0  1   3
1  9  99
2  3   5

您可以使用屬性訪問來修改DataFrame的Series或列的現(xiàn)有元素，但要小心; 如果您嘗試使用屬性訪問權(quán)來創(chuàng)建新列，則會創(chuàng)建新屬性而不是新列。在0.21.0及更高版本中，這將引發(fā)UserWarning：

In [1]: df = pd.DataFrame({'one': [1., 2., 3.]})
In [2]: df.two = [4, 5, 6]
UserWarning: Pandas doesn't allow Series to be assigned into nonexistent columns - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute_access
In [3]: df
Out[3]:
   one
0  1.0
1  2.0
2  3.0

#切片范圍

沿著任意軸切割范圍的最穩(wěn)健和一致的方法在詳細說明該方法的“ 按位置選擇”部分中描述.iloc?，F(xiàn)在，我們解釋使用[]運算符切片的語義。

使用Series，語法與ndarray完全一樣，返回值的一部分和相應(yīng)的標(biāo)簽：

In [27]: s[:5]
Out[27]: 
2000-01-01    0.469112
2000-01-02    1.212112
2000-01-03   -0.861849
2000-01-04    0.721555
2000-01-05   -0.424972
Freq: D, Name: A, dtype: float64

In [28]: s[::2]
Out[28]: 
2000-01-01    0.469112
2000-01-03   -0.861849
2000-01-05   -0.424972
2000-01-07    0.404705
Freq: 2D, Name: A, dtype: float64

In [29]: s[::-1]
Out[29]: 
2000-01-08   -0.370647
2000-01-07    0.404705
2000-01-06   -0.673690
2000-01-05   -0.424972
2000-01-04    0.721555
2000-01-03   -0.861849
2000-01-02    1.212112
2000-01-01    0.469112
Freq: -1D, Name: A, dtype: float64

請注意，設(shè)置也適用：

In [30]: s2 = s.copy()

In [31]: s2[:5] = 0

In [32]: s2
Out[32]: 
2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06   -0.673690
2000-01-07    0.404705
2000-01-08   -0.370647
Freq: D, Name: A, dtype: float64

使用DataFrame，切片內(nèi)部[] 切片。這主要是為了方便而提供的，因為它是如此常見的操作。

In [33]: df[:3]
Out[33]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804

In [34]: df[::-1]
Out[34]: 
                   A         B         C         D
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632

#按標(biāo)簽選擇

警告

是否為設(shè)置操作返回副本或引用可能取決于上下文。這有時被稱為應(yīng)該避免。請參閱返回視圖與復(fù)制。chained assignment

警告

In [35]: dfl = pd.DataFrame(np.random.randn(5, 4),
   ....:                    columns=list('ABCD'),
   ....:                    index=pd.date_range('20130101', periods=5))
   ....: 

In [36]: dfl
Out[36]: 
                   A         B         C         D
2013-01-01  1.075770 -0.109050  1.643563 -1.469388
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05  0.895717  0.805244 -1.206412  2.565646

In [4]: dfl.loc[2:3]
TypeError: cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with these indexers [2] of <type 'int'>

切片中的字符串喜歡可以轉(zhuǎn)換為索引的類型并導(dǎo)致自然切片。

In [37]: dfl.loc['20130102':'20130104']
Out[37]: 
                   A         B         C         D
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061

警告

從0.21.0開始，pandas將顯示FutureWarning帶有缺少標(biāo)簽的列表的if索引。將來這會提高一個KeyError。請參閱list-like使用列表中缺少鍵的loc是不推薦使用。

pandas提供了一套方法，以便擁有純粹基于標(biāo)簽的索引。這是一個嚴格的包含協(xié)議。要求的每個標(biāo)簽必須在索引中，否則KeyError將被提出。切片時，如果索引中存在，則包括起始綁定和停止邊界。整數(shù)是有效標(biāo)簽，但它們是指標(biāo)簽而不是位置。******

該.loc屬性是主要訪問方法。以下是有效輸入：

單個標(biāo)簽，例如5或'a'（注意，它5被解釋為索引的標(biāo)簽。此用法不是索引的整數(shù)位置。）。
列表或標(biāo)簽數(shù)組。['a', 'b', 'c']
帶有標(biāo)簽的切片對象'a':'f'（注意，與通常的python切片相反，包括起始和停止，當(dāng)存在于索引中時！請參見切片標(biāo)簽。
布爾數(shù)組。
A callable，參見按可調(diào)用選擇。

In [38]: s1 = pd.Series(np.random.randn(6), index=list('abcdef'))

In [39]: s1
Out[39]: 
a    1.431256
b    1.340309
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64

In [40]: s1.loc['c':]
Out[40]: 
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64

In [41]: s1.loc['b']
Out[41]: 1.3403088497993827

請注意，設(shè)置也適用：

In [42]: s1.loc['c':] = 0

In [43]: s1
Out[43]: 
a    1.431256
b    1.340309
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64

使用DataFrame：

In [44]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list('abcdef'),
   ....:                    columns=list('ABCD'))
   ....: 

In [45]: df1
Out[45]: 
          A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
c  1.024180  0.569605  0.875906 -2.211372
d  0.974466 -2.006747 -0.410001 -0.078638
e  0.545952 -1.219217 -1.226825  0.769804
f -1.281247 -0.727707 -0.121306 -0.097883

In [46]: df1.loc[['a', 'b', 'd'], :]
Out[46]: 
          A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
d  0.974466 -2.006747 -0.410001 -0.078638

通過標(biāo)簽切片訪問：

In [47]: df1.loc['d':, 'A':'C']
Out[47]: 
          A         B         C
d  0.974466 -2.006747 -0.410001
e  0.545952 -1.219217 -1.226825
f -1.281247 -0.727707 -0.121306

使用標(biāo)簽獲取橫截面（相當(dāng)于df.xs('a')）：

In [48]: df1.loc['a']
Out[48]: 
A    0.132003
B   -0.827317
C   -0.076467
D   -1.187678
Name: a, dtype: float64

要使用布爾數(shù)組獲取值：

In [49]: df1.loc['a'] > 0
Out[49]: 
A     True
B    False
C    False
D    False
Name: a, dtype: bool

In [50]: df1.loc[:, df1.loc['a'] > 0]
Out[50]: 
          A
a  0.132003
b  1.130127
c  1.024180
d  0.974466
e  0.545952
f -1.281247

要明確獲取值（相當(dāng)于已棄用df.get_value('a','A')）：

# this is also equivalent to ``df1.at['a','A']``
In [51]: df1.loc['a', 'A']
Out[51]: 0.13200317033032932

#用標(biāo)簽切片

使用.loc切片時，如果索引中存在開始和停止標(biāo)簽，則返回位于兩者之間的元素（包括它們）：

In [52]: s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])

In [53]: s.loc[3:5]
Out[53]: 
3    b
2    c
5    d
dtype: object

如果兩個中至少有一個不存在，但索引已排序，并且可以與開始和停止標(biāo)簽進行比較，那么通過選擇在兩者之間排名的標(biāo)簽，切片仍將按預(yù)期工作：

In [54]: s.sort_index()
Out[54]: 
0    a
2    c
3    b
4    e
5    d
dtype: object

In [55]: s.sort_index().loc[1:6]
Out[55]: 
2    c
3    b
4    e
5    d
dtype: object

然而，如果兩個中的至少一個不存在并且索引未被排序，則將引發(fā)錯誤（因為否則將是計算上昂貴的，并且對于混合類型索引可能是模糊的）。例如，在上面的例子中，s.loc[1:6]會提高KeyError。

有關(guān)此行為背后的基本原理，請參閱端點包含。

#按位置選擇

警告

是否為設(shè)置操作返回副本或引用可能取決于上下文。這有時被稱為應(yīng)該避免。請參閱返回視圖與復(fù)制。chained assignment

Pandas提供了一套方法，以獲得純粹基于整數(shù)的索引。語義緊跟Python和NumPy切片。這些是0-based索引。切片時，所結(jié)合的開始被包括，而上限是排除。嘗試使用非整數(shù)，甚至是有效的標(biāo)簽都會引發(fā)一個問題IndexError。

該.iloc屬性是主要訪問方法。以下是有效輸入：

一個整數(shù)，例如5。
整數(shù)列表或數(shù)組。[4, 3, 0]
帶有整數(shù)的切片對象1:7。
布爾數(shù)組。
A callable，參見按可調(diào)用選擇。

In [56]: s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))

In [57]: s1
Out[57]: 
0    0.695775
2    0.341734
4    0.959726
6   -1.110336
8   -0.619976
dtype: float64

In [58]: s1.iloc[:3]
Out[58]: 
0    0.695775
2    0.341734
4    0.959726
dtype: float64

In [59]: s1.iloc[3]
Out[59]: -1.110336102891167

請注意，設(shè)置也適用：

In [60]: s1.iloc[:3] = 0

In [61]: s1
Out[61]: 
0    0.000000
2    0.000000
4    0.000000
6   -1.110336
8   -0.619976
dtype: float64

使用DataFrame：

In [62]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list(range(0, 12, 2)),
   ....:                    columns=list(range(0, 8, 2)))
   ....: 

In [63]: df1
Out[63]: 
           0         2         4         6
0   0.149748 -0.732339  0.687738  0.176444
2   0.403310 -0.154951  0.301624 -2.179861
4  -1.369849 -0.954208  1.462696 -1.743161
6  -0.826591 -0.345352  1.314232  0.690579
8   0.995761  2.396780  0.014871  3.357427
10 -0.317441 -1.236269  0.896171 -0.487602

通過整數(shù)切片選擇：

In [64]: df1.iloc[:3]
Out[64]: 
          0         2         4         6
0  0.149748 -0.732339  0.687738  0.176444
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161

In [65]: df1.iloc[1:5, 2:4]
Out[65]: 
          4         6
2  0.301624 -2.179861
4  1.462696 -1.743161
6  1.314232  0.690579
8  0.014871  3.357427

通過整數(shù)列表選擇：

In [66]: df1.iloc[[1, 3, 5], [1, 3]]
Out[66]: 
           2         6
2  -0.154951 -2.179861
6  -0.345352  0.690579
10 -1.236269 -0.487602

In [67]: df1.iloc[1:3, :]
Out[67]: 
          0         2         4         6
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161

In [68]: df1.iloc[:, 1:3]
Out[68]: 
           2         4
0  -0.732339  0.687738
2  -0.154951  0.301624
4  -0.954208  1.462696
6  -0.345352  1.314232
8   2.396780  0.014871
10 -1.236269  0.896171

# this is also equivalent to ``df1.iat[1,1]``
In [69]: df1.iloc[1, 1]
Out[69]: -0.1549507744249032

使用整數(shù)位置（等效df.xs(1)）得到橫截面：

In [70]: df1.iloc[1]
Out[70]: 
0    0.403310
2   -0.154951
4    0.301624
6   -2.179861
Name: 2, dtype: float64

超出范圍的切片索引正如Python / Numpy中一樣優(yōu)雅地處理。

# these are allowed in python/numpy.
In [71]: x = list('abcdef')

In [72]: x
Out[72]: ['a', 'b', 'c', 'd', 'e', 'f']

In [73]: x[4:10]
Out[73]: ['e', 'f']

In [74]: x[8:10]
Out[74]: []

In [75]: s = pd.Series(x)

In [76]: s
Out[76]: 
0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [77]: s.iloc[4:10]
Out[77]: 
4    e
5    f
dtype: object

In [78]: s.iloc[8:10]
Out[78]: Series([], dtype: object)

請注意，使用超出邊界的切片可能會導(dǎo)致空軸（例如，返回一個空的DataFrame）。

In [79]: dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [80]: dfl
Out[80]: 
          A         B
0 -0.082240 -2.182937
1  0.380396  0.084844
2  0.432390  1.519970
3 -0.493662  0.600178
4  0.274230  0.132885

In [81]: dfl.iloc[:, 2:3]
Out[81]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

In [82]: dfl.iloc[:, 1:3]
Out[82]: 
          B
0 -2.182937
1  0.084844
2  1.519970
3  0.600178
4  0.132885

In [83]: dfl.iloc[4:6]
Out[83]: 
         A         B
4  0.27423  0.132885

一個超出范圍的索引器會引發(fā)一個IndexError。任何元素超出范圍的索引器列表都會引發(fā) IndexError。

>>> dfl.iloc[[4, 5, 6]]
IndexError: positional indexers are out-of-bounds

>>> dfl.iloc[:, 4]
IndexError: single positional indexer is out-of-bounds

#通過可調(diào)用選擇

版本0.18.1中的新功能。

.loc，.iloc以及[]索引也可以接受一個callable索引器。在callable必須與一個參數(shù)（調(diào)用系列或數(shù)據(jù)幀）返回的有效輸出索引功能。

In [84]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list('abcdef'),
   ....:                    columns=list('ABCD'))
   ....: 

In [85]: df1
Out[85]: 
          A         B         C         D
a -0.023688  2.410179  1.450520  0.206053
b -0.251905 -2.213588  1.063327  1.266143
c  0.299368 -0.863838  0.408204 -1.048089
d -0.025747 -0.988387  0.094055  1.262731
e  1.289997  0.082423 -0.055758  0.536580
f -0.489682  0.369374 -0.034571 -2.484478

In [86]: df1.loc[lambda df: df.A > 0, :]
Out[86]: 
          A         B         C         D
c  0.299368 -0.863838  0.408204 -1.048089
e  1.289997  0.082423 -0.055758  0.536580

In [87]: df1.loc[:, lambda df: ['A', 'B']]
Out[87]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [88]: df1.iloc[:, lambda df: [0, 1]]
Out[88]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [89]: df1[lambda df: df.columns[0]]
Out[89]: 
a   -0.023688
b   -0.251905
c    0.299368
d   -0.025747
e    1.289997
f   -0.489682
Name: A, dtype: float64

您可以使用可調(diào)用索引Series。

In [90]: df1.A.loc[lambda s: s > 0]
Out[90]: 
c    0.299368
e    1.289997
Name: A, dtype: float64

使用這些方法/索引器，您可以在不使用臨時變量的情況下鏈接數(shù)據(jù)選擇操作。

In [91]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [92]: (bb.groupby(['year', 'team']).sum()
   ....:    .loc[lambda df: df.r > 100])
   ....: 
Out[92]: 
           stint    g    ab    r    h  X2b  X3b  hr    rbi    sb   cs   bb     so   ibb   hbp    sh    sf  gidp
year team                                                                                                      
2007 CIN       6  379   745  101  203   35    2  36  125.0  10.0  1.0  105  127.0  14.0   1.0   1.0  15.0  18.0
     DET       5  301  1062  162  283   54    4  37  144.0  24.0  7.0   97  176.0   3.0  10.0   4.0   8.0  28.0
     HOU       4  311   926  109  218   47    6  14   77.0  10.0  4.0   60  212.0   3.0   9.0  16.0   6.0  17.0
     LAN      11  413  1021  153  293   61    3  36  154.0   7.0  5.0  114  141.0   8.0   9.0   3.0   8.0  29.0
     NYN      13  622  1854  240  509  101    3  61  243.0  22.0  4.0  174  310.0  24.0  23.0  18.0  15.0  48.0
     SFN       5  482  1305  198  337   67    6  40  171.0  26.0  7.0  235  188.0  51.0   8.0  16.0   6.0  41.0
     TEX       2  198   729  115  200   40    4  28  115.0  21.0  4.0   73  140.0   4.0   5.0   2.0   8.0  16.0
     TOR       4  459  1408  187  378   96    2  58  223.0   4.0  2.0  190  265.0  16.0  12.0   4.0  16.0  38.0

#不推薦使用IX索引器

警告

在0.20.0開始，.ix索引器已被棄用，贊成更加嚴格.iloc 和.loc索引。

.ix在推斷用戶想要做的事情上提供了很多魔力。也就是說，.ix可以根據(jù)索引的數(shù)據(jù)類型決定按位置或通過標(biāo)簽進行索引。多年來，這引起了相當(dāng)多的用戶混淆。

建議的索引方法是：

.loc如果你想標(biāo)記索引。
.iloc如果你想要定位索引。

In [93]: dfd = pd.DataFrame({'A': [1, 2, 3],
   ....:                     'B': [4, 5, 6]},
   ....:                    index=list('abc'))
   ....: 

In [94]: dfd
Out[94]: 
   A  B
a  1  4
b  2  5
c  3  6

以前的行為，您希望從“A”列中獲取索引中的第0個和第2個元素。

In [3]: dfd.ix[[0, 2], 'A']
Out[3]:
a    1
c    3
Name: A, dtype: int64

用.loc。這里我們將從索引中選擇適當(dāng)?shù)乃饕?，然后使?em>標(biāo)簽索引。

In [95]: dfd.loc[dfd.index[[0, 2]], 'A']
Out[95]: 
a    1
c    3
Name: A, dtype: int64

這也可以.iloc通過在索引器上顯式獲取位置，并使用位置索引來選擇事物來表達。

In [96]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[96]: 
a    1
c    3
Name: A, dtype: int64

要獲得多個索引器，請使用.get_indexer：

In [97]: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
Out[97]: 
   A  B
a  1  4
c  3  6

#不推薦使用缺少標(biāo)簽的列表進行索引

警告

從0.21.0開始，使用.loc或[]包含一個或多個缺少標(biāo)簽的列表，不贊成使用.reindex。

在以前的版本中，.loc[list-of-labels]只要找到至少1個密鑰，使用就可以工作（否則會引起a KeyError）。不推薦使用此行為，并將顯示指向此部分的警告消息。推薦的替代方案是使用.reindex()。

例如。

In [98]: s = pd.Series([1, 2, 3])

In [99]: s
Out[99]: 
0    1
1    2
2    3
dtype: int64

找到所有鍵的選擇保持不變。

In [100]: s.loc[[1, 2]]
Out[100]: 
1    2
2    3
dtype: int64

以前的行為

In [4]: s.loc[[1, 2, 3]]
Out[4]:
1    2.0
2    3.0
3    NaN
dtype: float64

目前的行為

In [4]: s.loc[[1, 2, 3]]
Passing list-likes to .loc with any non-matching elements will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

Out[4]:
1    2.0
2    3.0
3    NaN
dtype: float64

#重新索引

實現(xiàn)選擇潛在的未找到元素的慣用方法是通過.reindex()。另請參閱重建部分。

In [101]: s.reindex([1, 2, 3])
Out[101]: 
1    2.0
2    3.0
3    NaN
dtype: float64

或者，如果您只想選擇有效的密鑰，則以下是慣用且有效的; 保證保留選擇的dtype。

In [102]: labels = [1, 2, 3]

In [103]: s.loc[s.index.intersection(labels)]
Out[103]: 
1    2
2    3
dtype: int64

擁有重復(fù)索引會引發(fā).reindex()：

In [104]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])

In [105]: labels = ['c', 'd']

In [17]: s.reindex(labels)
ValueError: cannot reindex from a duplicate axis

通常，您可以將所需標(biāo)簽與當(dāng)前軸相交，然后重新索引。

In [106]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[106]: 
c    3.0
d    NaN
dtype: float64

但是，如果生成的索引重復(fù)，這仍然會提高。

In [41]: labels = ['a', 'd']

In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
ValueError: cannot reindex from a duplicate axis

#選擇隨機樣本

使用該sample(方法隨機選擇Series或DataFrame中的行或列。默認情況下，該方法將對行進行采樣，并接受要返回的特定行數(shù)/列數(shù)或一小部分行。

In [107]: s = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1 row.
In [108]: s.sample()
Out[108]: 
4    4
dtype: int64

# One may specify either a number of rows:
In [109]: s.sample(n=3)
Out[109]: 
0    0
4    4
1    1
dtype: int64

# Or a fraction of the rows:
In [110]: s.sample(frac=0.5)
Out[110]: 
5    5
3    3
1    1
dtype: int64

默認情況下，sample最多會返回每行一次，但也可以使用以下replace選項進行替換：

In [111]: s = pd.Series([0, 1, 2, 3, 4, 5])

# Without replacement (default):
In [112]: s.sample(n=6, replace=False)
Out[112]: 
0    0
1    1
5    5
3    3
2    2
4    4
dtype: int64

# With replacement:
In [113]: s.sample(n=6, replace=True)
Out[113]: 
0    0
4    4
3    3
2    2
4    4
4    4
dtype: int64

默認情況下，每行具有相同的選擇概率，但如果您希望行具有不同的概率，則可以將sample函數(shù)采樣權(quán)重作為 weights。這些權(quán)重可以是列表，NumPy數(shù)組或系列，但它們的長度必須與您采樣的對象的長度相同。缺失的值將被視為零的權(quán)重，并且不允許使用inf值。如果權(quán)重不總和為1，則通過將所有權(quán)重除以權(quán)重之和來對它們進行重新規(guī)范化。例如：

In [114]: s = pd.Series([0, 1, 2, 3, 4, 5])

In [115]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [116]: s.sample(n=3, weights=example_weights)
Out[116]: 
5    5
4    4
3    3
dtype: int64

# Weights will be re-normalized automatically
In [117]: example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [118]: s.sample(n=1, weights=example_weights2)
Out[118]: 
0    0
dtype: int64

應(yīng)用于DataFrame時，只需將列的名稱作為字符串傳遞，就可以使用DataFrame的列作為采樣權(quán)重（假設(shè)您要對行而不是列進行采樣）。

In [119]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
   .....:                     'weight_column': [0.5, 0.4, 0.1, 0]})
   .....: 

In [120]: df2.sample(n=3, weights='weight_column')
Out[120]: 
   col1  weight_column
1     8            0.4
0     9            0.5
2     7            0.1

sample還允許用戶使用axis參數(shù)對列而不是行進行采樣。

In [121]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

In [122]: df3.sample(n=1, axis=1)
Out[122]: 
   col1
0     1
1     2
2     3

最后，還可以sample使用random_state參數(shù)為隨機數(shù)生成器設(shè)置種子，該參數(shù)將接受整數(shù)（作為種子）或NumPy RandomState對象。

In [123]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

# With a given seed, the sample will always draw the same rows.
In [124]: df4.sample(n=2, random_state=2)
Out[124]: 
   col1  col2
2     3     4
1     2     3

In [125]: df4.sample(n=2, random_state=2)
Out[125]: 
   col1  col2
2     3     4
1     2     3

#用放大設(shè)定

.loc/[]當(dāng)為該軸設(shè)置不存在的鍵時，操作可以執(zhí)行放大。

在這種Series情況下，這實際上是一種附加操作。

In [126]: se = pd.Series([1, 2, 3])

In [127]: se
Out[127]: 
0    1
1    2
2    3
dtype: int64

In [128]: se[5] = 5.

In [129]: se
Out[129]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

A DataFrame可以在任一軸上放大.loc。

In [130]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
   .....:                    columns=['A', 'B'])
   .....: 

In [131]: dfi
Out[131]: 
   A  B
0  0  1
1  2  3
2  4  5

In [132]: dfi.loc[:, 'C'] = dfi.loc[:, 'A']

In [133]: dfi
Out[133]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4

這就像是一個append操作DataFrame。

In [134]: dfi.loc[3] = 5

In [135]: dfi
Out[135]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

#快速標(biāo)量值獲取和設(shè)置

因為索引[]必須處理很多情況（單標(biāo)簽訪問，切片，布爾索引等），所以它有一些開銷以便弄清楚你要求的是什么。如果您只想訪問標(biāo)量值，最快的方法是使用在所有數(shù)據(jù)結(jié)構(gòu)上實現(xiàn)的at和iat方法。

與之類似loc，at提供基于標(biāo)簽的標(biāo)量查找，同時iat提供類似于基于整數(shù)的查找iloc

In [136]: s.iat[5]
Out[136]: 5

In [137]: df.at[dates[5], 'A']
Out[137]: -0.6736897080883706

In [138]: df.iat[3, 0]
Out[138]: 0.7215551622443669

您也可以使用這些相同的索引器進行設(shè)置。

In [139]: df.at[dates[5], 'E'] = 7

In [140]: df.iat[3, 0] = 7

at 如果索引器丟失，可以如上所述放大對象。

In [141]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7

In [142]: df
Out[142]: 
                   A         B         C         D    E    0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632  NaN  NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236  NaN  NaN
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804  NaN  NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860  NaN  NaN
2000-01-05 -0.424972  0.567020  0.276232 -1.087401  NaN  NaN
2000-01-06 -0.673690  0.113648 -1.478427  0.524988  7.0  NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268  NaN  NaN
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885  NaN  NaN
2000-01-09       NaN       NaN       NaN       NaN  NaN  7.0

#布爾索引

另一種常見操作是使用布爾向量來過濾數(shù)據(jù)。運營商是：|for or，&for and和~for not。必須使用括號對這些進行分組，因為默認情況下，Python將評估表達式，例如as ，而期望的評估順序是。df.A > 2 & df.B < 3````df.A > (2 & df.B) < 3````(df.A > 2) & (df.B < 3)

使用布爾向量索引系列的工作方式與NumPy ndarray完全相同：

In [143]: s = pd.Series(range(-3, 4))

In [144]: s
Out[144]: 
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [145]: s[s > 0]
Out[145]: 
4    1
5    2
6    3
dtype: int64

In [146]: s[(s < -1) | (s > 0.5)]
Out[146]: 
0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [147]: s[~(s < 0)]
Out[147]: 
3    0
4    1
5    2
6    3
dtype: int64

您可以使用與DataFrame索引長度相同的布爾向量從DataFrame中選擇行（例如，從DataFrame的其中一列派生的東西）：

In [148]: df[df['A'] > 0]
Out[148]: 
                   A         B         C         D   E   0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632 NaN NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236 NaN NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860 NaN NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268 NaN NaN

列表推導(dǎo)和map系列方法也可用于產(chǎn)生更復(fù)雜的標(biāo)準(zhǔn)：

In [149]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
   .....:                     'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

# only want 'two' or 'three'
In [150]: criterion = df2['a'].map(lambda x: x.startswith('t'))

In [151]: df2[criterion]
Out[151]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# equivalent but slower
In [152]: df2[[x.startswith('t') for x in df2['a']]]
Out[152]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# Multiple criteria
In [153]: df2[criterion & (df2['b'] == 'x')]
Out[153]: 
       a  b         c
3  three  x  0.361719

隨著選擇方法通過標(biāo)簽選擇，通過位置選擇和高級索引，你可以沿著使用布爾向量與其他索引表達式中組合選擇多個軸。

In [154]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']
Out[154]: 
   b         c
3  x  0.361719

#使用isin進行索引

考慮一下isin()法Series，該方法返回一個布爾向量，只要Series元素存在于傳遞列表中，該向量就為真。這允許您選擇一列或多列具有所需值的行：

In [155]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [156]: s
Out[156]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [157]: s.isin([2, 4, 6])
Out[157]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool

In [158]: s[s.isin([2, 4, 6])]
Out[158]: 
2    2
0    4
dtype: int64

Index對象可以使用相同的方法，當(dāng)您不知道哪些搜索標(biāo)簽實際存在時，它們非常有用：

In [159]: s[s.index.isin([2, 4, 6])]
Out[159]: 
4    0
2    2
dtype: int64

# compare it to the following
In [160]: s.reindex([2, 4, 6])
Out[160]: 
2    2.0
4    0.0
6    NaN
dtype: float64

除此之外，還MultiIndex允許選擇在成員資格檢查中使用的單獨級別：

In [161]: s_mi = pd.Series(np.arange(6),
   .....:                  index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
   .....: 

In [162]: s_mi
Out[162]: 
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int64

In [163]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[163]: 
0  c    2
1  a    3
dtype: int64

In [164]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[164]: 
0  a    0
   c    2
1  a    3
   c    5
dtype: int64

DataFrame也有一個isin()法。調(diào)用時isin，將一組值作為數(shù)組或字典傳遞。如果values是一個數(shù)組，則isin返回與原始DataFrame形狀相同的布爾數(shù)據(jù)框，并在元素序列中的任何位置使用True。

In [165]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
   .....:                    'ids2': ['a', 'n', 'c', 'n']})
   .....: 

In [166]: values = ['a', 'b', 1, 3]

In [167]: df.isin(values)
Out[167]: 
    vals    ids   ids2
0   True   True   True
1  False   True  False
2   True  False  False
3  False  False  False

通常，您需要將某些值與某些列匹配。只需將值設(shè)置dict為鍵為列的位置，值即為要檢查的項目列表。

In [168]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [169]: df.isin(values)
Out[169]: 
    vals    ids   ids2
0   True   True  False
1  False   True  False
2   True  False  False
3  False  False  False

結(jié)合數(shù)據(jù)幀的isin同any()和all()方法來快速選擇符合給定的標(biāo)準(zhǔn)對數(shù)據(jù)子集。要選擇每列符合其自己標(biāo)準(zhǔn)的行：

In [170]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [171]: row_mask = df.isin(values).all(1)

In [172]: df[row_mask]
Out[172]: 
   vals ids ids2
0     1   a    a

#該`where()`方法和屏蔽

從具有布爾向量的Series中選擇值通常會返回數(shù)據(jù)的子集。為了保證選擇輸出與原始數(shù)據(jù)具有相同的形狀，您可以where在Series和中使用該方法DataFrame。

僅返回選定的行：

In [173]: s[s > 0]
Out[173]: 
3    1
2    2
1    3
0    4
dtype: int64

要返回與原始形狀相同的系列：

In [174]: s.where(s > 0)
Out[174]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

現(xiàn)在，使用布爾標(biāo)準(zhǔn)從DataFrame中選擇值也可以保留輸入數(shù)據(jù)形狀。where在引擎蓋下用作實現(xiàn)。下面的代碼相當(dāng)于。df.where(df < 0)

In [175]: df[df < 0]
Out[175]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

此外，在返回的副本中，where使用可選other參數(shù)替換條件為False的值。

In [176]: df.where(df < 0, -df)
Out[176]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

您可能希望根據(jù)某些布爾條件設(shè)置值。這可以直觀地完成，如下所示：

In [177]: s2 = s.copy()

In [178]: s2[s2 < 0] = 0

In [179]: s2
Out[179]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [180]: df2 = df.copy()

In [181]: df2[df2 < 0] = 0

In [182]: df2
Out[182]: 
                   A         B         C         D
2000-01-01  0.000000  0.000000  0.485855  0.245166
2000-01-02  0.000000  0.390389  0.000000  1.655824
2000-01-03  0.000000  0.299674  0.000000  0.281059
2000-01-04  0.846958  0.000000  0.600705  0.000000
2000-01-05  0.669692  0.000000  0.000000  0.342416
2000-01-06  0.868584  0.000000  2.297780  0.000000
2000-01-07  0.000000  0.000000  0.168904  0.000000
2000-01-08  0.801196  1.392071  0.000000  0.000000

默認情況下，where返回數(shù)據(jù)的修改副本。有一個可選參數(shù)，inplace以便可以在不創(chuàng)建副本的情況下修改原始數(shù)據(jù)：

In [183]: df_orig = df.copy()

In [184]: df_orig.where(df > 0, -df, inplace=True)

In [185]: df_orig
Out[185]: 
                   A         B         C         D
2000-01-01  2.104139  1.309525  0.485855  0.245166
2000-01-02  0.352480  0.390389  1.192319  1.655824
2000-01-03  0.864883  0.299674  0.227870  0.281059
2000-01-04  0.846958  1.222082  0.600705  1.233203
2000-01-05  0.669692  0.605656  1.169184  0.342416
2000-01-06  0.868584  0.948458  2.297780  0.684718
2000-01-07  2.670153  0.114722  0.168904  0.048048
2000-01-08  0.801196  1.392071  0.048788  0.808838

注意

簽名DataFrame.where(不同于numpy.where()。大致相當(dāng)于。df1.where(m, df2)````np.where(m, df1, df2)

In [186]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[186]: 
               A     B     C     D
2000-01-01  True  True  True  True
2000-01-02  True  True  True  True
2000-01-03  True  True  True  True
2000-01-04  True  True  True  True
2000-01-05  True  True  True  True
2000-01-06  True  True  True  True
2000-01-07  True  True  True  True
2000-01-08  True  True  True  True

對準(zhǔn)

此外，where對齊輸入布爾條件（ndarray或DataFrame），以便可以使用設(shè)置進行部分選擇。這類似于部分設(shè)置通過.loc（但是在內(nèi)容而不是軸標(biāo)簽上）。

In [187]: df2 = df.copy()

In [188]: df2[df2[1:4] > 0] = 3

In [189]: df2
Out[189]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525  0.485855  0.245166
2000-01-02 -0.352480  3.000000 -1.192319  3.000000
2000-01-03 -0.864883  3.000000 -0.227870  3.000000
2000-01-04  3.000000 -1.222082  3.000000 -1.233203
2000-01-05  0.669692 -0.605656 -1.169184  0.342416
2000-01-06  0.868584 -0.948458  2.297780 -0.684718
2000-01-07 -2.670153 -0.114722  0.168904 -0.048048
2000-01-08  0.801196  1.392071 -0.048788 -0.808838

哪里也可以接受axis和level參數(shù)在執(zhí)行時對齊輸入where。

In [190]: df2 = df.copy()

In [191]: df2.where(df2 > 0, df2['A'], axis='index')
Out[191]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

這相當(dāng)于（但快于）以下內(nèi)容。

In [192]: df2 = df.copy()

In [193]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])
Out[193]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

版本0.18.1中的新功能。

哪里可以接受一個可調(diào)用的條件和other參數(shù)。該函數(shù)必須帶有一個參數(shù)（調(diào)用Series或DataFrame），并返回有效的輸出作為條件和other參數(shù)。

In [194]: df3 = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [4, 5, 6],
   .....:                     'C': [7, 8, 9]})
   .....: 

In [195]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[195]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9

#面具

`mask()是`的反布爾運算`where`

In [196]: s.mask(s >= 0)
Out[196]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [197]: df.mask(df >= 0)
Out[197]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

#該`query()`方法

DataFrame對象有一個query()允許使用表達式進行選擇的方法。

您可以獲取列的值，其中列b具有列值a和值之間的值c。例如：

In [198]: n = 10

In [199]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [200]: df
Out[200]: 
          a         b         c
0  0.438921  0.118680  0.863670
1  0.138138  0.577363  0.686602
2  0.595307  0.564592  0.520630
3  0.913052  0.926075  0.616184
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
6  0.792342  0.216974  0.564056
7  0.397890  0.454131  0.915716
8  0.074315  0.437913  0.019794
9  0.559209  0.502065  0.026437

# pure python
In [201]: df[(df.a < df.b) & (df.b < df.c)]
Out[201]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716

# query
In [202]: df.query('(a < b) & (b < c)')
Out[202]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716

如果沒有名稱的列，則執(zhí)行相同的操作但返回命名索引a。

In [203]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))

In [204]: df.index.name = 'a'

In [205]: df
Out[205]: 
   b  c
a      
0  0  4
1  0  1
2  3  4
3  4  3
4  1  4
5  0  3
6  0  1
7  3  4
8  2  3
9  1  1

In [206]: df.query('a < b and b < c')
Out[206]: 
   b  c
a      
2  3  4

如果您不希望或不能命名索引，則可以index在查詢表達式中使用該名稱：

In [207]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))

In [208]: df
Out[208]: 
   b  c
0  3  1
1  3  0
2  5  6
3  5  2
4  7  4
5  0  1
6  2  5
7  0  1
8  6  0
9  7  9

In [209]: df.query('index < b < c')
Out[209]: 
   b  c
2  5  6

注意

如果索引的名稱與列名稱重疊，則列名稱優(yōu)先。例如，

In [210]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})

In [211]: df.index.name = 'a'

In [212]: df.query('a > 2')  # uses the column 'a', not the index
Out[212]: 
   a
a   
1  3
3  3

您仍然可以使用特殊標(biāo)識符'index'在查詢表達式中使用索引：

In [213]: df.query('index > 2')
Out[213]: 
   a
a   
3  3
4  2

如果由于某種原因你有一個名為列的列index，那么你也可以引用索引ilevel_0，但是此時你應(yīng)該考慮將列重命名為不那么模糊的列。

#`MultiIndex` `query()`語法

您還可以使用的水平DataFrame帶 MultiIndex好像他們是在框架柱：

In [214]: n = 10

In [215]: colors = np.random.choice(['red', 'green'], size=n)

In [216]: foods = np.random.choice(['eggs', 'ham'], size=n)

In [217]: colors
Out[217]: 
array(['red', 'red', 'red', 'green', 'green', 'green', 'green', 'green',
       'green', 'green'], dtype='<U5')

In [218]: foods
Out[218]: 
array(['ham', 'ham', 'eggs', 'eggs', 'eggs', 'ham', 'ham', 'eggs', 'eggs',
       'eggs'], dtype='<U4')

In [219]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])

In [220]: df = pd.DataFrame(np.random.randn(n, 2), index=index)

In [221]: df
Out[221]: 
                   0         1
color food                    
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255
green eggs -0.748199  1.318931
      eggs -2.029766  0.792652
      ham   0.461007 -0.542749
      ham  -0.305384 -0.479195
      eggs  0.095031 -0.270099
      eggs -0.707140 -0.773882
      eggs  0.229453  0.304418

In [222]: df.query('color == "red"')
Out[222]: 
                   0         1
color food                    
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255

如果MultiIndex未命名的級別，您可以使用特殊名稱引用它們：

In [223]: df.index.names = [None, None]

In [224]: df
Out[224]: 
                   0         1
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255
green eggs -0.748199  1.318931
      eggs -2.029766  0.792652
      ham   0.461007 -0.542749
      ham  -0.305384 -0.479195
      eggs  0.095031 -0.270099
      eggs -0.707140 -0.773882
      eggs  0.229453  0.304418

In [225]: df.query('ilevel_0 == "red"')
Out[225]: 
                 0         1
red ham   0.194889 -0.381994
    ham   0.318587  2.089075
    eggs -0.728293 -0.090255

約定是ilevel_0，這意味著第0級的“索引級別0” index。

#`query()`用例

用例query()是當(dāng)您擁有一組具有共同 DataFrame列名（或索引級別/名稱）子集的對象時。您可以將相同的查詢傳遞給兩個幀，而無需指定您對查詢感興趣的幀

In [226]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [227]: df
Out[227]: 
          a         b         c
0  0.224283  0.736107  0.139168
1  0.302827  0.657803  0.713897
2  0.611185  0.136624  0.984960
3  0.195246  0.123436  0.627712
4  0.618673  0.371660  0.047902
5  0.480088  0.062993  0.185760
6  0.568018  0.483467  0.445289
7  0.309040  0.274580  0.587101
8  0.258993  0.477769  0.370255
9  0.550459  0.840870  0.304611

In [228]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)

In [229]: df2
Out[229]: 
           a         b         c
0   0.357579  0.229800  0.596001
1   0.309059  0.957923  0.965663
2   0.123102  0.336914  0.318616
3   0.526506  0.323321  0.860813
4   0.518736  0.486514  0.384724
5   0.190804  0.505723  0.614533
6   0.891939  0.623977  0.676639
7   0.480559  0.378528  0.460858
8   0.420223  0.136404  0.141295
9   0.732206  0.419540  0.604675
10  0.604466  0.848974  0.896165
11  0.589168  0.920046  0.732716

In [230]: expr = '0.0 <= a <= c <= 0.5'

In [231]: map(lambda frame: frame.query(expr), [df, df2])
Out[231]: <map at 0x7f65f7952d30>

#`query()`Python與pandas語法比較

完全類似numpy的語法：

In [232]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))

In [233]: df
Out[233]: 
   a  b  c
0  7  8  9
1  1  0  7
2  2  7  2
3  6  2  2
4  2  6  3
5  3  8  2
6  1  7  2
7  5  1  5
8  9  8  0
9  1  5  0

In [234]: df.query('(a < b) & (b < c)')
Out[234]: 
   a  b  c
0  7  8  9

In [235]: df[(df.a < df.b) & (df.b < df.c)]
Out[235]: 
   a  b  c
0  7  8  9

通過刪除括號略微更好（通過綁定使比較運算符綁定比&和更緊|）。

In [236]: df.query('a < b & b < c')
Out[236]: 
   a  b  c
0  7  8  9

使用英語而不是符號：

In [237]: df.query('a < b and b < c')
Out[237]: 
   a  b  c
0  7  8  9

非常接近你如何在紙上寫它：

In [238]: df.query('a < b < c')
Out[238]: 
   a  b  c
0  7  8  9

#在`in`與運營商`not in`

query()支持Python in和比較運算符的特殊用法，為調(diào)用或的方法提供了簡潔的語法。not in````isin````Series````DataFrame

# get all rows where columns "a" and "b" have overlapping values
In [239]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
   .....:                    'c': np.random.randint(5, size=12),
   .....:                    'd': np.random.randint(9, size=12)})
   .....: 

In [240]: df
Out[240]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

In [241]: df.query('a in b')
Out[241]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2

# How you'd do it in pure Python
In [242]: df[df.a.isin(df.b)]
Out[242]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2

In [243]: df.query('a not in b')
Out[243]: 
    a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

# pure Python
In [244]: df[~df.a.isin(df.b)]
Out[244]: 
    a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

您可以將此與其他表達式結(jié)合使用，以獲得非常簡潔的查詢：

# rows where cols a and b have overlapping values
# and col c's values are less than col d's
In [245]: df.query('a in b and c < d')
Out[245]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
4  c  b  3  6
5  c  b  0  2

# pure Python
In [246]: df[df.b.isin(df.a) & (df.c < df.d)]
Out[246]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
4   c  b  3  6
5   c  b  0  2
10  f  c  0  6
11  f  c  1  2

注意

請注意in并在Python中進行評估，因為它沒有相應(yīng)的操作。但是，只有 / expression本身在vanilla Python中進行評估。例如，在表達式中not in````numexpr**** in````not in

df.query('a in b + c + d')

(b + c + d)通過評估numexpr和然后的in 操作在普通的Python評價。通常，任何可以使用的評估操作numexpr都是。

#`==`運算符與`list`對象的特殊用法

一個比較list值的使用列==/ !=工程，以類似in/ 。not in

In [247]: df.query('b == ["a", "b", "c"]')
Out[247]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

# pure Python
In [248]: df[df.b.isin(["a", "b", "c"])]
Out[248]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

In [249]: df.query('c == [1, 2]')
Out[249]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

In [250]: df.query('c != [1, 2]')
Out[250]: 
    a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6

# using in/not in
In [251]: df.query('[1, 2] in c')
Out[251]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

In [252]: df.query('[1, 2] not in c')
Out[252]: 
    a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6

# pure Python
In [253]: df[df.c.isin([1, 2])]
Out[253]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

#布爾運算符

您可以使用單詞not或~運算符否定布爾表達式。

In [254]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [255]: df['bools'] = np.random.rand(len(df)) > 0.5

In [256]: df.query('~bools')
Out[256]: 
          a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False

In [257]: df.query('not bools')
Out[257]: 
          a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False

In [258]: df.query('not bools') == df[~df.bools]
Out[258]: 
      a     b     c  bools
2  True  True  True   True
7  True  True  True   True
8  True  True  True   True

當(dāng)然，表達式也可以是任意復(fù)雜的：

# short query syntax
In [259]: shorter = df.query('a < b < c and (not bools) or bools > 2')

# equivalent in pure Python
In [260]: longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]

In [261]: shorter
Out[261]: 
          a         b         c  bools
7  0.275396  0.691034  0.826619  False

In [262]: longer
Out[262]: 
          a         b         c  bools
7  0.275396  0.691034  0.826619  False

In [263]: shorter == longer
Out[263]: 
      a     b     c  bools
7  True  True  True   True

#的表現(xiàn)?`query()`

DataFrame.query()````numexpr對于大型幀，使用比Python略快。

query-perf

注意

如果您的框架超過大約200,000行，您將只看到使用numexpr引擎的性能優(yōu)勢DataFrame.query()。

query-perf-small

此圖是使用DataFrame3列創(chuàng)建的，每列包含使用生成的浮點值numpy.random.randn()。

#重復(fù)數(shù)據(jù)

如果要識別和刪除DataFrame中的重復(fù)行，有兩種方法可以提供幫助：duplicated和drop_duplicates。每個都將用于標(biāo)識重復(fù)行的列作為參數(shù)。

duplicated 返回一個布爾向量，其長度為行數(shù)，表示行是否重復(fù)。
drop_duplicates 刪除重復(fù)的行。

默認情況下，重復(fù)集的第一個觀察行被認為是唯一的，但每個方法都有一個keep參數(shù)來指定要保留的目標(biāo)。

keep='first' （默認值）：標(biāo)記/刪除重復(fù)項，第一次出現(xiàn)除外。
keep='last'：標(biāo)記/刪除重復(fù)項，除了最后一次出現(xiàn)。
keep=False：標(biāo)記/刪除所有重復(fù)項。

In [264]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

In [265]: df2
Out[265]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [266]: df2.duplicated('a')
Out[266]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [267]: df2.duplicated('a', keep='last')
Out[267]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [268]: df2.duplicated('a', keep=False)
Out[268]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [269]: df2.drop_duplicates('a')
Out[269]: 
       a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329

In [270]: df2.drop_duplicates('a', keep='last')
Out[270]: 
       a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [271]: df2.drop_duplicates('a', keep=False)
Out[271]: 
       a  b         c
5  three  x -1.964475
6   four  x  1.298329

此外，您可以傳遞列表列表以識別重復(fù)。

In [272]: df2.duplicated(['a', 'b'])
Out[272]: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

In [273]: df2.drop_duplicates(['a', 'b'])
Out[273]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
5  three  x -1.964475
6   four  x  1.298329

要按索引值刪除重復(fù)項，請使用Index.duplicated然后執(zhí)行切片。keep參數(shù)可以使用相同的選項集。

In [274]: df3 = pd.DataFrame({'a': np.arange(6),
   .....:                     'b': np.random.randn(6)},
   .....:                    index=['a', 'a', 'b', 'c', 'b', 'a'])
   .....: 

In [275]: df3
Out[275]: 
   a         b
a  0  1.440455
a  1  2.456086
b  2  1.038402
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [276]: df3.index.duplicated()
Out[276]: array([False,  True, False, False,  True,  True])

In [277]: df3[~df3.index.duplicated()]
Out[277]: 
   a         b
a  0  1.440455
b  2  1.038402
c  3 -0.894409

In [278]: df3[~df3.index.duplicated(keep='last')]
Out[278]: 
   a         b
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [279]: df3[~df3.index.duplicated(keep=False)]
Out[279]: 
   a         b
c  3 -0.894409

#類字典`get()`方法

Series或DataFrame中的每一個都有一個get可以返回默認值的方法。

In [280]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [281]: s.get('a')  # equivalent to s['a']
Out[281]: 1

In [282]: s.get('x', default=-1)
Out[282]: -1

#該`lookup()`方法

有時，您希望在給定一系列行標(biāo)簽和列標(biāo)簽的情況下提取一組值，并且該lookup方法允許此操作并返回NumPy數(shù)組。例如：

In [283]: dflookup = pd.DataFrame(np.random.rand(20, 4), columns = ['A', 'B', 'C', 'D'])

In [284]: dflookup.lookup(list(range(0, 10, 2)), ['B', 'C', 'A', 'B', 'D'])
Out[284]: array([0.3506, 0.4779, 0.4825, 0.9197, 0.5019])

#索引對象

In [285]: index = pd.Index(['e', 'd', 'a', 'b'])

In [286]: index
Out[286]: Index(['e', 'd', 'a', 'b'], dtype='object')

In [287]: 'd' in index
Out[287]: True

您還可以傳遞一個name存儲在索引中：

In [288]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')

In [289]: index.name
Out[289]: 'something'

名稱（如果已設(shè)置）將顯示在控制臺顯示中：

In [290]: index = pd.Index(list(range(5)), name='rows')

In [291]: columns = pd.Index(['A', 'B', 'C'], name='cols')

In [292]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)

In [293]: df
Out[293]: 
cols         A         B         C
rows                              
0     1.295989  0.185778  0.436259
1     0.678101  0.311369 -0.528378
2    -0.674808 -1.103529 -0.656157
3     1.889957  2.076651 -1.102192
4    -1.211795 -0.791746  0.634724

In [294]: df['A']
Out[294]: 
rows
0    1.295989
1    0.678101
2   -0.674808
3    1.889957
4   -1.211795
Name: A, dtype: float64

#設(shè)置元數(shù)據(jù)

索引是“不可改變的大多是”，但它可以設(shè)置和改變它們的元數(shù)據(jù)，如指數(shù)name（或為MultiIndex，levels和 codes）。

您可以使用rename，set_names，set_levels，和set_codes 直接設(shè)置這些屬性。他們默認返回一份副本; 但是，您可以指定inplace=True使數(shù)據(jù)更改到位。

有關(guān)MultiIndexes的使用，請參閱高級索引。

In [295]: ind = pd.Index([1, 2, 3])

In [296]: ind.rename("apple")
Out[296]: Int64Index([1, 2, 3], dtype='int64', name='apple')

In [297]: ind
Out[297]: Int64Index([1, 2, 3], dtype='int64')

In [298]: ind.set_names(["apple"], inplace=True)

In [299]: ind.name = "bob"

In [300]: ind
Out[300]: Int64Index([1, 2, 3], dtype='int64', name='bob')

set_names，set_levels并且set_codes還采用可選 level參數(shù)

In [301]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

In [302]: index
Out[302]: 
MultiIndex([(0, 'one'),
            (0, 'two'),
            (1, 'one'),
            (1, 'two'),
            (2, 'one'),
            (2, 'two')],
           names=['first', 'second'])

In [303]: index.levels[1]
Out[303]: Index(['one', 'two'], dtype='object', name='second')

In [304]: index.set_levels(["a", "b"], level=1)
Out[304]: 
MultiIndex([(0, 'a'),
            (0, 'b'),
            (1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['first', 'second'])

#在Index對象上設(shè)置操作

兩個主要業(yè)務(wù)是和。這些可以直接稱為實例方法，也可以通過重載運算符使用。通過該方法提供差異。union (|)````intersection (&)````.difference()

In [305]: a = pd.Index(['c', 'b', 'a'])

In [306]: b = pd.Index(['c', 'e', 'd'])

In [307]: a | b
Out[307]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [308]: a & b
Out[308]: Index(['c'], dtype='object')

In [309]: a.difference(b)
Out[309]: Index(['a', 'b'], dtype='object')

同時還提供了操作，它返回出現(xiàn)在任一元件或，但不是在兩者。這相當(dāng)于創(chuàng)建的索引，刪除了重復(fù)項。symmetric_difference (^)````idx1````idx2````idx1.difference(idx2).union(idx2.difference(idx1))

In [310]: idx1 = pd.Index([1, 2, 3, 4])

In [311]: idx2 = pd.Index([2, 3, 4, 5])

In [312]: idx1.symmetric_difference(idx2)
Out[312]: Int64Index([1, 5], dtype='int64')

In [313]: idx1 ^ idx2
Out[313]: Int64Index([1, 5], dtype='int6Index.union()

In [314]: idx1 = pd.Index([0, 1, 2])

In [315]: idx2 = pd.Index([0.5, 1.5])

In [316]: idx1 | idx2
Out[316]: Float64Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')

#缺少值

即使Index可以保存缺失值（NaN），但如果您不想要任何意外結(jié)果，也應(yīng)該避免使用。例如，某些操作會隱式排除缺失值。

Index.fillna 使用指定的標(biāo)量值填充缺失值。

In [317]: idx1 = pd.Index([1, np.nan, 3, 4])

In [318]: idx1
Out[318]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [319]: idx1.fillna(2)
Out[319]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [320]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
   .....:                          pd.NaT,
   .....:                          pd.Timestamp('2011-01-03')])
   .....: 

In [321]: idx2
Out[321]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [322]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[322]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

#設(shè)置/重置索引

有時您會將數(shù)據(jù)集加載或創(chuàng)建到DataFrame中，并希望在您已經(jīng)完成之后添加索引。有幾種不同的方式。

#設(shè)置索引

DataFrame有一個set_index()法，它采用列名（對于常規(guī)Index）或列名列表（對于a MultiIndex）。要創(chuàng)建新的重新索引的DataFrame：

In [323]: data
Out[323]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

In [324]: indexed1 = data.set_index('c')

In [325]: indexed1
Out[325]: 
     a    b    d
c               
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0

In [326]: indexed2 = data.set_index(['a', 'b'])

In [327]: indexed2
Out[327]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

該append關(guān)鍵字選項讓你保持現(xiàn)有索引并追加給列一個多指標(biāo)：

In [328]: frame = data.set_index('c', drop=False)

In [329]: frame = frame.set_index(['a', 'b'], append=True)

In [330]: frame
Out[330]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

其他選項set_index允許您不刪除索引列或就地添加索引（不創(chuàng)建新對象）：

In [331]: data.set_index('c', drop=False)
Out[331]: 
     a    b  c    d
c                  
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

In [332]: data.set_index(['a', 'b'], inplace=True)

In [333]: data
Out[333]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

#重置索引

為方便起見，DataFrame上有一個新函數(shù)，它將 reset_index()引值傳輸?shù)紻ataFrame的列中并設(shè)置一個簡單的整數(shù)索引。這是反向操作set_index()

In [334]: data
Out[334]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

In [335]: data.reset_index()
Out[335]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

輸出更類似于SQL表或記錄數(shù)組。從索引派生的列的名稱是存儲在names屬性中的名稱。

您可以使用level關(guān)鍵字僅刪除索引的一部分：

In [336]: frame
Out[336]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

In [337]: frame.reset_index(level=1)
Out[337]: 
         a  c    d
c b               
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index采用一個可選參數(shù)drop，如果為true，則只丟棄索引，而不是將索引值放在DataFrame的列中。

#添加ad hoc索引

如果您自己創(chuàng)建索引，則可以將其分配給index字段：

data.index = index

#返回視圖與副本

在pandas對象中設(shè)置值時，必須注意避免調(diào)用所謂的對象。這是一個例子。chained indexing

In [338]: dfmi = pd.DataFrame([list('abcd'),
   .....:                      list('efgh'),
   .....:                      list('ijkl'),
   .....:                      list('mnop')],
   .....:                     columns=pd.MultiIndex.from_product([['one', 'two'],
   .....:                                                         ['first', 'second']]))
   .....: 

In [339]: dfmi
Out[339]: 
    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

比較這兩種訪問方法：

In [340]: dfmi['one']['second']
Out[340]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object

In [341]: dfmi.loc[:, ('one', 'second')]
Out[341]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

這兩者都產(chǎn)生相同的結(jié)果，所以你應(yīng)該使用哪個？理解這些操作的順序以及為什么方法2（.loc）比方法1（鏈接[]）更受歡迎是有益的。

dfmi['one']選擇列的第一級并返回單索引的DataFrame。然后另一個Python操作dfmi_with_one['second']選擇索引的系列'second'。這由變量指示，dfmi_with_one因為pandas將這些操作視為單獨的事件。例如，單獨調(diào)用__getitem__，因此它必須將它們視為線性操作，它們一個接一個地發(fā)生。

對比這個df.loc[:,('one','second')]將一個嵌套的元組傳遞(slice(None),('one','second'))給一個單獨的調(diào)用 __getitem__。這允許pandas將其作為單個實體來處理。此外，這種操作順序可以明顯更快，并且如果需要，允許人們對兩個軸進行索引。

#使用鏈?zhǔn)剿饕龝r為什么分配失敗？

上一節(jié)中的問題只是一個性能問題。這是怎么回事與SettingWithCopy警示？當(dāng)你做一些可能花費幾毫秒的事情時，我們通常不會發(fā)出警告！

但事實證明，分配鏈?zhǔn)剿饕漠a(chǎn)品具有固有的不可預(yù)測的結(jié)果。要看到這一點，請考慮Python解釋器如何執(zhí)行此代碼：

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

但是這個代碼的處理方式不同：

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

看到__getitem__那里？除了簡單的情況之外，很難預(yù)測它是否會返回一個視圖或一個副本（它取決于數(shù)組的內(nèi)存布局，關(guān)于哪些pandas不能保證），因此是否__setitem__會修改dfmi或者是一個臨時對象之后立即拋出。那什么SettingWithCopy是警告你！

注意

您可能想知道我們是否應(yīng)該關(guān)注loc 第一個示例中的屬性。但dfmi.loc保證dfmi 本身具有修改的索引行為，因此dfmi.loc.__getitem__/ 直接dfmi.loc.__setitem__操作dfmi。當(dāng)然， dfmi.loc.__getitem__(idx)可能是一個視圖或副本dfmi。

有時SettingWithCopy，當(dāng)沒有明顯的鏈?zhǔn)剿饕龝r，會出現(xiàn)警告。這些SettingWithCopy是旨在捕獲的錯誤！Pandas可能會試圖警告你，你已經(jīng)這樣做了：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

哎呀！

#評估訂單事項

使用鏈?zhǔn)剿饕龝r，索引操作的順序和類型會部分確定結(jié)果是原始對象的切片還是切片的副本。

Pandas有，SettingWithCopyWarning因為分配一個切片的副本通常不是故意的，而是由鏈?zhǔn)剿饕鸬腻e誤返回一個預(yù)期切片的副本。

如果您希望pandas或多或少地信任鏈接索引表達式的賦值，則可以將選項設(shè)置mode.chained_assignment為以下值之一：

'warn'，默認值表示SettingWithCopyWarning打印。
'raise'意味著大Pandas會提出SettingWithCopyException 你必須處理的事情。
None 將完全壓制警告。

In [342]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
   .....:                           'three', 'two', 'one', 'six'],
   .....:                     'c': np.arange(7)})
   .....: 

# This will show the SettingWithCopyWarning
# but the frame values will be set
In [343]: dfb['c'][dfb.a.str.startswith('o')] = 42

然而，這是在副本上運行，不起作用。

>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
Traceback (most recent call last)
     ...
SettingWithCopyWarning:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead

鏈?zhǔn)椒峙湟部梢栽诨旌蟙type幀中進行設(shè)置。

注意

這些設(shè)置規(guī)則適用于所有.loc/.iloc。

這是正確的訪問方法：

In [344]: dfc = pd.DataFrame({'A': ['aaa', 'bbb', 'ccc'], 'B': [1, 2, 3]})

In [345]: dfc.loc[0, 'A'] = 11

In [346]: dfc
Out[346]: 
     A  B
0   11  1
1  bbb  2
2  ccc  3

這有時會起作用，但不能保證，因此應(yīng)該避免：

In [347]: dfc = dfc.copy()

In [348]: dfc['A'][0] = 111

In [349]: dfc
Out[349]: 
     A  B
0  111  1
1  bbb  2
2  ccc  3

這根本不起作用，所以應(yīng)該避免：

>>> pd.set_option('mode.chained_assignment','raise')
>>> dfc.loc[0]['A'] = 1111
Traceback (most recent call last)
     ...
SettingWithCopyException:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead

以上內(nèi)容是否對您有幫助：

← Pandas IO工具

Pandas 合并,聯(lián)接和連接 →

寫筆記

我要補充

Pandas 索引和數(shù)據(jù)選擇器

#索引的不同選擇

#基礎(chǔ)知識

#屬性訪問

#切片范圍

#按標(biāo)簽選擇

#用標(biāo)簽切片

#按位置選擇

#通過可調(diào)用選擇

#不推薦使用IX索引器

#不推薦使用缺少標(biāo)簽的列表進行索引

#重新索引

#選擇隨機樣本

#用放大設(shè)定

#快速標(biāo)量值獲取和設(shè)置

#布爾索引

#使用isin進行索引

#該where()方法和屏蔽

#面具

mask()是的反布爾運算where

#該query()方法

#MultiIndex query()語法

#query()用例

#query()Python與pandas語法比較

#在in與運營商not in

#==運算符與list對象的特殊用法

#布爾運算符

#的表現(xiàn)?query()

#重復(fù)數(shù)據(jù)

#類字典get()方法

#該lookup()方法

#索引對象

#設(shè)置元數(shù)據(jù)

#在Index對象上設(shè)置操作

#缺少值

#設(shè)置/重置索引

#設(shè)置索引

#重置索引

#添加ad hoc索引

#返回視圖與副本

#使用鏈?zhǔn)剿饕龝r為什么分配失敗？

#評估訂單事項

推薦文章

推薦教程

推薦課程

#該`where()`方法和屏蔽

`mask()是`的反布爾運算`where`

#該`query()`方法

#`MultiIndex` `query()`語法

#`query()`用例

#`query()`Python與pandas語法比較

#在`in`與運營商`not in`

#`==`運算符與`list`對象的特殊用法

#的表現(xiàn)?`query()`

#類字典`get()`方法

#該`lookup()`方法

#使用鏈?zhǔn)剿饕龝r為什么分配失敗？