Scrapy 的分布式实现

今天我们简单介绍下 Scrapy 的分布式实现框架：Scrapy-Redis 并基于该插件完成一个简单的分布式爬虫案例。

1. 一个简单的分布式爬虫案例

我们以前面的第16讲的头条热点新闻爬虫基础，使用 scrapy-redis 插件进行改造，使之支持分布式爬取。现在我们按照如下的步骤进行。

环境准备。由于条件限制，我们只有2台云主机，分别命名为 server 和 server2。两台主机的用途如下：

主机	服务	公网 ip
server	scrapy爬虫	180.76.152.113
server2	scrapy爬虫、redis服务	47.115.61.209

先准备好 redis 服务，redis 服务的搭建以及设置密码等步骤在第一部分中已经介绍过了，这里就不再重复介绍了；

[root@server ~]# redis-cli -h 47.115.61.209 -p 6777
47.115.61.209:6777> auth spyinx
OK
47.115.61.209:6777> get hello
"new world"

我们在 server 和 server2 上都进行测试，确保都能连上 server2 上的 redis 服务。

安装 scrapy 和 scrapy-redis；

[root@server2 ~]# pip3 install scrapy scrapy-redis
# ...

改造 spider 代码，将原先继承的 Spider 类改为继承 scrapy-redis 插件中的 RedisSpider，同时去掉 start_requests() 方法：

# from scrapy import Request, Spider
from scrapy_redis.spiders import RedisSpider

# ...

class HotnewsSpider(RedisSpider):
    # ...
    
    # 注释start\_requests()方法
    # def start\_requests(self):
        # request\_url = self.\_get\_url(max\_behot\_time)
        # self.logger.info(f"we get the request url : {request\_url}")
        # yield Request(request\_url, headers=headers, cookies=cookies, callback=self.parse)
        
    # ...

改造下原先的 pipelines.py 代码，为了能实时将数据保存到数据库中，我们挪动下 SQL 语句 commit 的位置，同时去掉原先的邮件发送功能：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM\_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import logging
from string import Template
from itemadapter import ItemAdapter
import pymysql


from toutiao_hotnews.mail import HtmlMailSender
from toutiao_hotnews.items import ToutiaoHotnewsItem
from toutiao_hotnews.html_template import hotnews_template_html
from toutiao_hotnews import settings

class ToutiaoHotnewsPipeline:
    logger = logging.getLogger('pipelines\_log')

    def open\_spider(self, spider):
        # 初始化连接数据库
        self.db = pymysql.connect(
            host=spider.settings.get('MYSQL\_HOST', 'localhost'),                 
            user=spider.settings.get('MYSQL\_USER', 'root'),
            password=spider.settings.get('MYSQL\_PASS', '123456'),
            port=spider.settings.get('MYSQL\_PORT', 3306),
            db=spider.settings.get('MYSQL\_DB\_NAME', 'mysql'),
            charset='utf8'
        ) 
        self.cursor = self.db.cursor()

    def process\_item(self, item, spider):
        # 插入sql语句
        sql = "insert into toutiao\_hotnews(title, abstract, source, source\_url, comments\_count, behot\_time) values (%s, %s, %s, %s, %s, %s)"
        if item and isinstance(item, ToutiaoHotnewsItem):
            self.cursor.execute(sql, (item['title'], item['abstract'], item['source'], item['source\_url'], item['comments\_count'], item['behot\_time']))
        # 将commit语句移动到这里
        self.db.commit()
        return item

    def query\_data(self, sql):
        data = {}
        try:
            self.cursor.execute(sql)
            data = self.cursor.fetchall()
        except Exception as e:
            logging.error('database operate error:{}'.format(str(e)))
            self.db.rollback()
        return data

    def close\_spider(self, spider):
        self.cursor.close()
        self.db.close()

接下来就是配置 settings.py 了，我们首先要设置好 UserAgent，这一步是所有爬虫必须的。另外，针对 scrapy-redis 插件，我们只需要设置 scrapy-redis 的调度器和去重过滤器以及 Redis 的连接配置即可。如果想要将抓取的结果保存到 Redis 中，需要在 ITEM_PIPELINES 值中添加 scrapy-redis 的 item pipeline 即可。这里我们相应的配置如下：

# ...

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'

# ...

ITEM_PIPELINES = {
    'toutiao\_hotnews.pipelines.ToutiaoHotnewsPipeline': 300,
    # 指定scrapy-redis的pipeline，将结果保存到redis中
    'scrapy\_redis.pipelines.RedisPipeline': 400,
}

# ...
SCHEDULER = 'scrapy\_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy\_redis.dupefilter.RFPDupeFilter'
# 设置连接 Redis 的 URL
REDIS_URL = 'redis://:spyinx@47.115.61.209:6777'

就这样简单改造后，一个支持分布式的爬虫就完成了。我们在每台云主机上上传该爬虫代码，然后在爬虫项目目录下执行 scrapy crawl hotnews 运行爬虫。此时，所有的爬虫都会处于等待状态，需要手动将起始的请求 URL 设置到 redis 的请求列表中，相应的 key 默认为 hotnews:start_urls。添加的 redis 命令为：

> lpush hotnews:start_urls url

为此我准备了一段 python 代码帮助我们完成 url 的生成以及推送到 redis 中：

# 位置: 在 toutiao\_hotnews 目录下，和 scrapy.cfg 文件同一级

import redis
from toutiao_hotnews.spiders.hotnews import HotnewsSpider

spider = HotnewsSpider()
r = redis.Redis(host='47.115.61.209', port=6777, password='spyinx', db=0)
request_url = spider._get_url(0)
r.lpush("hotnews:start\_urls", request_url)

接下来，我们看看这个分布式爬虫的运行效果：

Scrapy 的分布式实现

1. 一个简单的分布式爬虫案例​

1. 一个简单的分布式爬虫案例