本章節(jié)主要介紹Beautiful Soup 4 輸出相關(guān)內(nèi)容,格式化輸出、壓縮輸出以及輸出格式與get_text()的用法都有詳細(xì)介紹。
?prettify()
? 方法將Beautiful Soup的文檔樹(shù)格式化后以Unicode編碼輸出,每個(gè)XML/HTML標(biāo)簽都獨(dú)占一行
markup = '<a rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()
# '<html>\n <head>\n </head>\n <body>\n <a rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >\n...'
print(soup.prettify())
# <html>
# <head>
# </head>
# <body>
# <a rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >
# I linked to
# <i>
# example.com
# </i>
# </a>
# </body>
# </html>
?BeautifulSoup
? 對(duì)象和它的tag節(jié)點(diǎn)都可以調(diào)用 ?prettify()
? 方法:
print(soup.a.prettify())
# <a rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >
# I linked to
# <i>
# example.com
# </i>
# </a>
如果只想得到結(jié)果字符串,不重視格式,那么可以對(duì)一個(gè) ?BeautifulSoup
? 對(duì)象或 ?Tag
? 對(duì)象使用Python的 ?unicode()
? 或 ?str()
? 方法:
str(soup)
# '<html><head></head><body><a rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a></body></html>'
unicode(soup.a)
# u'<a rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a>'
?str()
? 方法返回UTF-8編碼的字符串,可以指定 編碼 的設(shè)置.
還可以調(diào)用 ?encode()
? 方法獲得字節(jié)碼或調(diào)用 ?decode()
? 方法獲得Unicode.
Beautiful Soup輸出是會(huì)將HTML中的特殊字符轉(zhuǎn)換成Unicode,比如“&lquot;”:
soup = BeautifulSoup("“Dammit!” he said.")
unicode(soup)
# u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
如果將文檔轉(zhuǎn)換成字符串,Unicode編碼會(huì)被編碼成UTF-8.這樣就無(wú)法正確顯示HTML特殊字符了:
str(soup)
# '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
如果只想得到tag中包含的文本內(nèi)容,那么可以嗲用 ?get_text()
? 方法,這個(gè)方法獲取到tag中包含的所有文版內(nèi)容包括子孫tag中的內(nèi)容,并將結(jié)果作為Unicode字符串返回:
markup = '<a rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
可以通過(guò)參數(shù)指定tag的文本內(nèi)容的分隔符:
# soup.get_text("|")
u'\nI linked to |example.com|\n'
還可以去除獲得文本內(nèi)容的前后空白:
# soup.get_text("|", strip=True)
u'I linked to|example.com'
或者使用 .stripped_strings 生成器,獲得文本列表后手動(dòng)處理列表:
[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']
更多建議: