PyQuery库是一个强大的网页解析库,在很多方面会比beautifulsoup更优。PyQuery 是 Python 仿照 jQuery 的严格实现,语法与 jQuery 几乎完全相同。

基本用法

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
        <div id="container">
            <ul class="list">
                <p class="title"><b>The Dormouse's story</b></p>
                <p class="story">Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
                and they lived at the bottom of a well.</p>
                <p class="story"> test for class story </p>
             </ul>
         </div>
    </body>
</html>
"""
from pyquery import PyQuery as pq
doc = pq(html_doc) # 转换为PyQuery类
print(f"type of doc:  {type(doc)}")
select_list = doc("#container .list") # 通过CSS选择器进行内容筛选
print(f"type of select_list:  {type(select_list)}")
print("select_list: ",select_list)

# output
"""
type of doc:  <class 'pyquery.pyquery.PyQuery'>
type of select_list:  <class 'pyquery.pyquery.PyQuery'>
select_list:  <ul class="list">
                <p class="title"><b>The Dormouse's story</b></p>
                <p class="story">Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
                and they lived at the bottom of a well.</p>
                <p class="story"> test for class story </p>
             </ul>
"""

我们可以看到,pyquery的使用方式很简单,先将html转化为pyquery.pyquery.PyQuery对象,然后基于该对象,利用CSS选择器,进行内容的筛选。

同时,每次选择返回也是一个pyquery.pyquery.PyQuery对象,支持嵌套调用。接着我们可以用find()方法得到select_list变量中的所有a标签:

print(select_list.find('a'))

# output
"""
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
                and they lived at the bottom of a well.
"""

可以发现通过这样的层层嵌套调用,我们可以很方便的获得我们想要的内容。

查找子元素:

print(select_list.children('a')) # 查找直接子元素 'a'

# None

print(select_list.children()) # 查找直接子元素

"""
<p class="title"><b>The Dormouse's story</b></p>
                <p class="story">Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
                and they lived at the bottom of a well.</p>
                <p class="story"> test for class story </p>
"""

find()不同,children()方法查找的是直接子元素,而find()只要求符合层级的组织关系即可。

类似的也有父元素:

print(select_list.parent()) # 查找直接 父元素
print(select_list.parents()) # 查找所有的父元素
print(select_list.parents("#container")) # 可再传入CSS选择器筛选

兄弟元素:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li)
print(li.siblings())
print(li.siblings('.active'))

# output 
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             
<li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0">first item</li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

"""

遍历元素

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
lis = doc('li').items()
print(f"type(lis): {type(lis)}")
for li in lis:
    print(li) # 也可以对每个元素进行更多单独的操作

# output 
"""
<li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         
type(lis): <class 'generator'>
<li class="item-0">first item</li>
             
<li class="item-1"><a href="link2.html">second item</a></li>
             
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
             
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""

获取属性和文本

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
# <a href="link3.html"><span class="bold">third item</span></a>
print(a.attr('href')) # 获取属性:链接
# link3.html
print(a.attr.href) # 获取属性:链接
# link3.html
print(a.text()) # 获取内容
# third item
print(a.html()) # 获取html
# <span class="bold">third item</span>

DOM操作

文档对象模型( DOM, Document Object Model )主要用于对HTML和XML文档的内容进行操作。DOM描绘了一个层次化的节点树,通过对节点进行操作,实现对文档内容的添加、删除、修改、查找等功能。

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)

li = doc('.item-0.active')
print(li)
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

li.removeClass('active') # 移除active属性
print(li)
# <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>

li.addClass('active') # 添加active属性
print(li)
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

li.attr('name', 'add_new_name_attr') # 添加新的name属性
print(li)
# <li class="item-0 active" name="add_new_name_attr"><a href="link3.html"><span class="bold">third item</span></a></li>

li.css('font-size', '16px') # 添加新的style属性
print(li)
# <li class="item-0 active" name="add_new_name_attr" style="font-size: 16px"><a href="link3.html"><span class="bold">third item</span></a></li>

print("------------------------------------------------------------------------")
print("<removed:>")
print(li.find('span,a').remove())
print(li)
# ------------------------------------------------------------------------
# <removed:>
# <a href="link3.html"/><span class="bold">third item</span>
# <li class="item-0 active" name="add_new_name_attr" style="font-size: 16px"/>

通过url或者本地文件初始化

# url 初始化
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head'))

# 文件初始化
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('list'))

Reference