Python爬蟲基礎(chǔ)入門實(shí)例

猿友 2020-12-14 14:17:13 瀏覽數(shù) (3501)

反饋

本文涉及的主要知識點(diǎn)如下：

WEB 是如何交互的；
requests 庫的 get、post 函數(shù)的應(yīng)用；
response 對象的相關(guān)函數(shù)及其屬性。

環(huán)境：Python3.6 + Pycharm

庫：requests

小編在本文中代碼都已給出了詳細(xì)注釋，并且可直接運(yùn)行。

首先，屏幕前的小伙伴們需要先安裝 requests 庫，安裝之前需先安裝好 Python 環(huán)境，如未安裝，小編在這給小伙伴們提供最新的 Python 編譯器安裝教程：Python 最新 3.9.0 編譯器安裝教程。

安裝好 Python 環(huán)境后，windows 用戶打開 cmd 命令輸入以下命令即可（其余系統(tǒng)安裝大致相同）。

pip install requests

Linux 用戶：

sudo pip install requests

接下來就是實(shí)例講解啦，小伙伴們多多動手操練吶！

1、爬取百度首頁頁面，并獲取頁面信息

實(shí)例

# 爬取百度頁面

import requests #導(dǎo)入requests爬蟲庫

resp = requests.get('http://www.baidu.com') #生成一個response對象

resp.encoding = 'utf-8' #設(shè)置編碼格式為 utf-8

print(resp.status_code) #打印狀態(tài)碼

print(resp.text) #輸出爬取的信息

2、requests 庫 get 方法實(shí)例

在此之前先給大家介紹一個網(wǎng)址：httpbin.org，這個網(wǎng)站能測試 HTTP 請求和響應(yīng)的各種信息，比如 cookie、ip、headers 和登錄驗(yàn)證等，且支持 GET、POST 等多種方法，對 web 開發(fā)和測試很有幫助。它用 Python + Flask 編寫，是一個開源項(xiàng)目。

官方網(wǎng)站：http://httpbin.org/

開源地址：https://github.com/Runscope/httpbin

實(shí)例

# get方法實(shí)例

import requests #導(dǎo)入requests爬蟲庫

resp5、爬取網(wǎng)頁圖片，并保存到本地。5、爬取網(wǎng)頁圖片，并保存到本地。 = requests.get("http://httpbin.org/get") #get方法

print( resp.status_code ) #打印狀態(tài)碼

print( resp.text ) #輸出爬取的信息

3、requests 庫 post 方法實(shí)例

實(shí)例

# post方法實(shí)例

import requests #導(dǎo)入requests爬蟲庫

resp = requests.post("http://httpbin.org/post") #post方法

print( resp.status_code ) #打印狀態(tài)碼

print( resp.text ) #輸出爬取的信息

4、requests庫 put 方法實(shí)例

實(shí)例

# put方法實(shí)例

import requests #導(dǎo)入requests爬蟲庫

resp = requests.put("http://httpbin.org/put") # put方法

print( resp.status_code ) #打印狀態(tài)碼

print( resp.text ) #輸出爬取的信息

5、requests 庫 get 方法傳參

想要使用 get 方法傳遞參數(shù)，有兩種方法可行：

在 get 方法之后加上要傳遞的參數(shù)用“=”號鏈接并用“&”符號隔開；
使用 params 字典傳遞多個參數(shù)。實(shí)例如下：

實(shí)例

# get傳參方法實(shí)例1

import requests #導(dǎo)入requests爬蟲庫

resp = requests.get("http://httpbin.org/get?name=w3cschool&age=100") # get傳參

print( resp.status_code ) #打印狀態(tài)碼

print( resp.text ) #輸出爬取的信息

實(shí)例

# get傳參方法實(shí)例2

import requests #導(dǎo)入requests爬蟲庫

data = {

"name":"w3cschool",

"age":100

} #使用字典存儲傳遞參數(shù)

resp = requests.get( "http://httpbin.org/get" , params=data ) # get傳參

print( resp.status_code ) #打印狀態(tài)碼

print( resp.text ) #輸出爬取的信息

6、requests 庫 post 方法傳參

使用 post 方法傳遞參數(shù)和使用 get 方法傳遞參數(shù)的方法二是類似的。實(shí)例如下：

實(shí)例

# post傳參方法實(shí)例

import requests #導(dǎo)入requests爬蟲庫

data = {

"name":"w3cschool",

"age":100

} #使用字典存儲傳遞參數(shù)

resp = requests.post( "http://httpbin.org/post" , params=data ) # post傳參

print( resp.status_code ) #打印狀態(tài)碼

print( resp.text ) #輸出爬取的信息

7、如何繞過各大網(wǎng)站的反爬蟲措施，以貓眼票房為例：

實(shí)例

import requests #導(dǎo)入requests爬蟲庫

url = 'http://piaofang.maoyan.com/dashboard' #貓眼票房網(wǎng)址地址

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

} #設(shè)置頭部信息,偽裝瀏覽器

resp = requests.get(url, headers=headers) #

print(resp.status_code) #打印狀態(tài)碼

print(resp.text) #網(wǎng)頁信息

8、爬取網(wǎng)頁圖片，并保存到本地。

先在E盤建立一個爬蟲目錄，才能夠保存信息，小伙伴們可自行選擇目錄保存，在代碼中更改相應(yīng)目錄代碼即可。

實(shí)例

import requests #導(dǎo)入requests爬蟲庫

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

} #設(shè)置頭部信息,偽裝瀏覽器

resp = requests.get('http://7n.w3cschool.cn/statics/img/logo/indexlogo@2x.png', headers = headers) #get方法的到圖片響應(yīng)

file = open("E:\\爬蟲\\test.png","wb") #打開一個文件,wb表示以二進(jìn)制格式打開一個文件只用于寫入

file.write(resp.content) #寫入文件

file.close() #關(guān)閉文件操作

學(xué)以致用，希望屏幕前的小伙伴們能夠多多聯(lián)系，結(jié)合實(shí)際多加操作。推薦閱讀：Python 靜態(tài)爬蟲、Python Scrapy網(wǎng)絡(luò)爬蟲。

Python

0 人點(diǎn)贊