lxml 讀取HTML文件進(jìn)行解析

2021-05-28 09:46 更新
from lxml import etree

html=etree.parse('test.html',etree.HTMLParser()) #指定解析器HTMLParser會(huì)根據(jù)文件修復(fù)HTML文件中缺失的如聲明信息
result=etree.tostring(html)   #解析成字節(jié)
#result=etree.tostringlist(html) #解析成列表
print(type(html))
print(type(result))
print(result)

#
<class 'lxml.etree._ElementTree'>
<class 'bytes'>
b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><div>&#13;\n    <ul>&#13;\n         <li class="item-0"><a href="link1.html">first item</a></li>
\n         <li class="item-1"><a href="link2.html">second item</a></li>
\n         <li class="item-inactive"><a href="link3.html">third item</a></li>
\n         <li class="item-1"><a href="link4.html">fourth item</a></li>
\n         <li class="item-0"><a href="link5.html">fifth item</a>&#13;\n     </li></ul>&#13;\n </div>&#13;\n</body></html>'


以上內(nèi)容是否對(duì)您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號(hào)
微信公眾號(hào)

編程獅公眾號(hào)