Python是一种非常适合网络爬虫的编程语言,因其丰富的第三方库和简单的语法,使得爬虫的实现变得快捷而高效。下面分享7个简单的Python爬虫小案例,包括具体的代码示例,希望能为你提供帮助。
案例1:爬取豆瓣电影 TOP250
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')
for movie in movies:
title = movie.find('span', class_='title').text
rating = movie.find('span', class_='rating_num').text
print(f'电影名称: {title}, 评分: {rating}')
案例2:抓取天气信息
import requests
city = 'Shanghai'
url = f'http://wttr.in/{city}?format=%C+%t'
response = requests.get(url)
print(f'{city}的天气情况:{response.text}')
案例3:爬取知乎问题的回答
import requests
from bs4 import BeautifulSoup
question_id = '123456' # 假设这是一个知乎问题的ID
url = f'https://www.zhihu.com/question/{question_id}'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
answers = soup.find_all('div', class_='Answer')
for answer in answers:
content = answer.find('div', class_='RichContent')
print(content.text)
案例4:下载图片
import requests
url = 'https://www.example.com/path/to/image.jpg'
response = requests.get(url)
with open('image.jpg', 'wb') as f:
f.write(response.content)
print('图片下载完成!')
案例5:抓取小说章节
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com/novel/chapter1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
chapter_title = soup.find('h1').text
content = soup.find('div', class_='content').text
print(f'章节标题: {chapter_title}\n内容: {content}')
案例6:爬取知乎用户信息
import requests
username = 'zhihuzhiyang' # 假设这是一个知乎用户名
url = f'https://www.zhihu.com/people/{username}'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
print(response.text) # 输出用户的主页HTML
案例7:简单的爬虫进阶:使用 Scrapy
# 创建 Scrapy 项目并在 spiders 文件夹中创建 spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
# 运行命令: scrapy crawl quotes -o quotes.json
以上就是7个简单的 Python 爬虫小案例。通过这些实例,你可以看到使用 Python 进行网页数据抓取是相对直接的。你可以根据自己的需求修改代码,爬取其他网页的数据。在使用爬虫时,请务必遵循目标网站的爬虫协议 (robots.txt) 和相关法律法规,合理利用数据。