Beautiful Soup 4 輸出

2021-05-21 14:21 更新

本章節(jié)主要介紹Beautiful Soup 4 輸出相關內(nèi)容,格式化輸出、壓縮輸出以及輸出格式與get_text()的用法都有詳細介紹。

格式化輸出

?prettify()? 方法將Beautiful Soup的文檔樹格式化后以Unicode編碼輸出,每個XML/HTML標簽都獨占一行

markup = '<a  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()
# '<html>\n <head>\n </head>\n <body>\n  <a  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank" >\n...'

print(soup.prettify())
# <html>
#  <head>
#  </head>
#  <body>
#   <a  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank" >
#    I linked to
#    <i>
#     example.com
#    </i>
#   </a>
#  </body>
# </html>

?BeautifulSoup? 對象和它的tag節(jié)點都可以調(diào)用 ?prettify()? 方法:

print(soup.a.prettify())
# <a  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank" >
#  I linked to
#  <i>
#   example.com
#  </i>
# </a>

壓縮輸出

如果只想得到結果字符串,不重視格式,那么可以對一個 ?BeautifulSoup? 對象或 ?Tag? 對象使用Python的 ?unicode()? 或 ?str()? 方法:

str(soup)
# '<html><head></head><body><a  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a></body></html>'

unicode(soup.a)
# u'<a  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a>'

?str()? 方法返回UTF-8編碼的字符串,可以指定 編碼 的設置.

還可以調(diào)用 ?encode()? 方法獲得字節(jié)碼或調(diào)用 ?decode()? 方法獲得Unicode.

輸出格式

Beautiful Soup輸出是會將HTML中的特殊字符轉(zhuǎn)換成Unicode,比如“&lquot;”:

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
unicode(soup)
# u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'

如果將文檔轉(zhuǎn)換成字符串,Unicode編碼會被編碼成UTF-8.這樣就無法正確顯示HTML特殊字符了:

str(soup)
# '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'

get_text()

如果只想得到tag中包含的文本內(nèi)容,那么可以嗲用 ?get_text()? 方法,這個方法獲取到tag中包含的所有文版內(nèi)容包括子孫tag中的內(nèi)容,并將結果作為Unicode字符串返回:

markup = '<a  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank"  rel="external nofollow" target="_blank" >\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

可以通過參數(shù)指定tag的文本內(nèi)容的分隔符:

# soup.get_text("|")
u'\nI linked to |example.com|\n'

還可以去除獲得文本內(nèi)容的前后空白:

# soup.get_text("|", strip=True)
u'I linked to|example.com'

或者使用 .stripped_strings 生成器,獲得文本列表后手動處理列表:

[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']


以上內(nèi)容是否對您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號
微信公眾號

編程獅公眾號