Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。可以对爬取的HTML源码做解析。

1. 基本使用

BeautifulSoup的常用解析库,一般常用的是’lxml’。 | 解析器 | 使用方法 | 优势 | 劣势 | |–| – |– |– | | Python标准库 | BeautifulSoup(markup, “html.parser”) | Python的内置标准库、执行速度适中 、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差| | lxml HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装C语言库 | | lxml XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持XML的解析器 | 需要安装C语言库 | | html5lib | BeautifulSoup(markup, “html5lib”) | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())  # 格式化输出
# prettify() 方法将 Beautiful Soup 的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行

print(soup.title.name)

# 获得内容
print(soup.title.string)
print(soup.title.text)

# 嵌套:
print(soup.head.title.string)

# 获得标签的属性
print(soup.p.attrs['name'])
print(soup.p['name'])

2. find()与find_all()

find()与find_all()用法基本一致,find返回单个元素,findall以列表形式返回所有匹配元素。

find_all( name , attrs , recursive , string , **kwargs )

可以根据标签的名字,属性,内容进行筛选。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 根据tag的属性进行筛选 id 和 class: 
soup.find_all(attrs={"id": "link3"})
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link3")
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(attrs={"class": "title"})
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all(class_="title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

获取文本内容及链接:

# 获取文本内容:
print(soup.get_text())
"""
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
# 获取所有链接:
for link in soup.find_all('a'):
    print(link.get('href'))  # same as :  link('href')
    print(link.get_text())
    
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

3. select()

Beautiful Soup支持大部分的CSS选择器, 在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag。

soup.select("html .title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.select("html title")
# [<title>The Dormouse's story</title>]

soup.select("#link1") # select by id
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("title") # select by tag
# [<title>The Dormouse's story</title>]

soup.select(".title") # select by class
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("#link1,#link2") # 通过多种CSS筛选
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select('a[href]') # 通过是否存在某个属性来查找
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 通过属性的值,结合正则表达式语法来进行匹配
soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

总结

Reference

beautifulsoup