閱讀(3.5k) 書簽贊(0) 我要糾錯(cuò)

lxml 讀取HTML文件進(jìn)行解析

2021-05-28 09:46 更新

from lxml import etree

html=etree.parse('test.html',etree.HTMLParser()) #指定解析器HTMLParser會(huì)根據(jù)文件修復(fù)HTML文件中缺失的如聲明信息
result=etree.tostring(html)   #解析成字節(jié)
#result=etree.tostringlist(html) #解析成列表
print(type(html))
print(type(result))
print(result)

#
<class 'lxml.etree._ElementTree'>
<class 'bytes'>
b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><div>&#13;\n    <ul>&#13;\n         <li class="item-0"><a href="link1.html">first item</a></li>
\n         <li class="item-1"><a href="link2.html">second item</a></li>
\n         <li class="item-inactive"><a href="link3.html">third item</a></li>
\n         <li class="item-1"><a href="link4.html">fourth item</a></li>
\n         <li class="item-0"><a href="link5.html">fifth item</a>&#13;\n     </li></ul>&#13;\n </div>&#13;\n</body></html>'

以上內(nèi)容是否對(duì)您有幫助：

← lxml 讀取文本解析節(jié)點(diǎn)

lmxl 獲取所有節(jié)點(diǎn) →

寫筆記

我要補(bǔ)充

lxml 讀取HTML文件進(jìn)行解析

推薦文章

推薦教程

推薦課程