【python】六个常见爬虫案例【附源码】

77 阅读 0 评论 0 点赞

在互联网快速发展的今天，网页爬虫已经成为数据获取的重要工具。通过爬虫程序，可以自动从网页中提取有价值的信息。下面，我将介绍六个常见的爬虫案例，附上相应的Python源码，帮助大家更好地理解和运用爬虫技术。

1. 基本的网页内容获取

我们可以使用requests库来获取网页的HTML内容。

import requests

url = 'http://example.com'
response = requests.get(url)

if response.status_code == 200:
    print(response.text)  # 打印获取的网页内容
else:
    print('请求失败，状态码:', response.status_code)

2. 基于BeautifulSoup的网页解析

采用BeautifulSoup库来解析HTML内容，提取特定的信息，比如标题和段落。

from bs4 import BeautifulSoup
import requests

url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 提取标题
title = soup.title.string
print('网页标题:', title)

# 提取所有段落
for p in soup.find_all('p'):
    print(p.text)

3. 爬取多个页面（翻页）

爬虫经常需要获取多页的数据，比如商品列表等。这可以通过构造URL来实现。

import requests
from bs4 import BeautifulSoup

base_url = 'http://example.com/page/{}'

for page in range(1, 6):  # 爬取前5页
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    for item in soup.find_all('div', class_='item'):
        title = item.find('h2').text
        price = item.find('span', class_='price').text
        print(f'商品: {title}, 价格: {price}')

4. 爬取图片

爬取网页中的图片并保存到本地。

import requests
from bs4 import BeautifulSoup
import os

url = 'http://example.com/images'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

image_urls = []
for img in soup.find_all('img'):
    image_urls.append(img['src'])

# 创建文件夹
if not os.path.exists('images'):
    os.makedirs('images')

# 下载图片
for img_url in image_urls:
    img_response = requests.get(img_url)
    img_name = os.path.join('images', img_url.split('/')[-1])
    with open(img_name, 'wb') as f:
        f.write(img_response.content)
        print(f'下载完成: {img_name}')

5. 使用Scrapy框架

Scrapy是一个强大的爬虫框架，适合大规模爬取任务。下面是一个简单的Scrapy爬虫示例。

首先，安装Scrapy：

pip install scrapy

然后，新建Scrapy工程，定义爬虫：

scrapy startproject myproject
cd myproject
scrapy genspider myspider example.com

在myspider.py中实现爬虫逻辑：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        for title in response.css('h2.title ::text').getall():
            yield {'title': title}

运行爬虫：

scrapy crawl myspider -o output.json

6. 使用代理和请求头

为了防止被网站封禁，可以使用代理和设置请求头。

import requests

url = 'http://example.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
proxies = {
    'http': 'http://your_proxy:port',
    'https': 'http://your_proxy:port'
}

response = requests.get(url, headers=headers, proxies=proxies)
print(response.text)

总结

通过以上六个案例，我们可以看到爬虫的多样性和强大功能。从基本的网页内容获取到复杂的多页面爬取、图片下载，再到利用框架Scrapy，我们能够快速而高效地获取所需数据。同时，在实际应用中，还需要注意网站的爬虫协议（robots.txt）及法律道德，合理使用爬虫技术。

点赞(0) 打赏

本文分类：后端
本文标签：python 爬虫案例100 爬虫数据分析人工智能
浏览次数：77 次浏览
发布日期：2024-09-30 20:50:16
本文链接：http://makehui.com/houduan/2761.html

上一篇 > Python酷库之旅-第三方库Pandas(117)
下一篇 > 如何快速上手Python，成为一名数据分析师

【python】六个常见爬虫案例【附源码】

1. 基本的网页内容获取

2. 基于BeautifulSoup的网页解析

3. 爬取多个页面（翻页）

4. 爬取图片

5. 使用Scrapy框架

6. 使用代理和请求头

总结

微信扫一扫：分享

【Py/Java/C++三种语言OD独家2024E卷真题】20天拿下华为OD笔试之【模拟】2024E-转骰子【欧弟算法】全网注释最详细分类最全的华为OD真题题解

初级爬虫实战——巴黎圣母院新闻

Python 3.12 环境搭建（Windows版）

Python 列表全方位解析：创建、操作、删除与遍历的全面指南

微信扫一扫：分享