Python爬虫(selenium)从网站获取信息并存入数据库(mysql)

125 阅读 0 评论 0 点赞

使用Selenium和MySQL的Python爬虫示例

在当今的信息时代，网页数据的抓取和处理变得越来越重要。Python作为一种简单易学的编程语言，广泛应用于爬虫开发中。而Selenium则是一个强大的工具，可以模拟浏览器操作，从而抓取网站的信息。本文将介绍如何利用Selenium从网页获取信息，并将其存入MySQL数据库中。

环境准备

在开始之前，确保你的开发环境中已经安装以下工具和库：

Python
MySQL数据库
Selenium库
MySQL Connector库
Chrome浏览器和对应的ChromeDriver

你可以使用以下命令安装所需的Python库：

pip install selenium mysql-connector-python

创建数据库和表

首先需要创建一个数据库和一张存储数据的表。以MySQL为例，可以使用以下SQL命令来创建数据库和表。

CREATE DATABASE web_data;

USE web_data;

CREATE TABLE articles (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    link TEXT NOT NULL
);

爬虫代码示例

以下是一个使用Selenium抓取某个网站数据并将其存入MySQL数据库的示例代码。在这个例子中，我们假设要抓取一个新闻网站的文章标题和链接。

from selenium import webdriver
from selenium.webdriver.common.by import By
import mysql.connector
import time

# Database configuration
db_config = {
    'host': 'localhost',
    'user': 'your_user',
    'password': 'your_password',
    'database': 'web_data'
}

# Connect to MySQL database
def connect_to_db():
    try:
        conn = mysql.connector.connect(**db_config)
        return conn
    except mysql.connector.Error as err:
        print(f"Error: {err}")
        return None

# Insert data into the database
def insert_data(title, link):
    conn = connect_to_db()
    if conn:
        cursor = conn.cursor()
        cursor.execute("INSERT INTO articles (title, link) VALUES (%s, %s)", (title, link))
        conn.commit()
        cursor.close()
        conn.close()
        print(f"Inserted: {title}")

# Set up Selenium
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Enable headless mode
driver = webdriver.Chrome(options=options)

# Define the URL to be scraped
url = 'https://example.com/news'

try:
    driver.get(url)
    time.sleep(3)  # Wait for the page to fully load

    # Locate the articles
    articles = driver.find_elements(By.CLASS_NAME, 'article')  # 根据实际情况修改

    # Loop through each article and extract title and link
    for article in articles:
        title = article.find_element(By.TAG_NAME, 'h2').text  # 根据实际情况修改
        link = article.find_element(By.TAG_NAME, 'a').get_attribute('href')  # 根据实际情况修改
        insert_data(title, link)

finally:
    driver.quit()

代码解析

数据库连接：使用mysql.connector库连接到MySQL数据库。
数据插入：定义insert_data函数，向数据库中插入数据。
Selenium设置：配置Selenium并打开指定的网页。这里使用了无头浏览器模式，以便在后台运行。
数据抓取：通过find_elements根据指定的类名找到所有文章，然后提取标题和链接。
结果插入：将获取到的标题和链接通过insert_data存入数据库。

总结

通过以上步骤，我们成功创建了一个简单的Python爬虫，使用Selenium抓取网站数据，并将数据存储到MySQL数据库中。虽然这个示例是一个基础的实现，但使用类似的方法可以抓取更加复杂的数据。注意在实际操作中，尊重网站的robots.txt协议，遵循网络爬虫的道德规范。

点赞(0) 打赏

本文分类：后端
本文标签：python 数据库爬虫
浏览次数：125 次浏览
发布日期：2024-10-10 06:49:02
本文链接：http://makehui.com/houduan/4913.html

上一篇 > MySQL清空所有表的数据的方法
下一篇 > 大数据-93 Spark 集群 Spark SQL 概述基本概念 SparkSQL对比架构抽象

Python爬虫(selenium)从网站获取信息并存入数据库(mysql)

使用Selenium和MySQL的Python爬虫示例

环境准备

创建数据库和表

爬虫代码示例

代码解析

总结

微信扫一扫：分享

【Py/Java/C++三种语言OD独家2024E卷真题】20天拿下华为OD笔试之【模拟】2024E-转骰子【欧弟算法】全网注释最详细分类最全的华为OD真题题解

mysql.user表查看数据库所有用户信息

初级爬虫实战——巴黎圣母院新闻

Python 3.12 环境搭建（Windows版）

微信扫一扫：分享