Python爬虫之Scrapy环境搭建案例教程

Python爬虫之Scrapy环境搭建

如何搭建Scrapy环境

首先要安装Python环境,Python环境搭建见:https://blog.csdn.net/alice_tl/article/details/76793590

接下来安装Scrapy

1、安装Scrapy,在终端使用pip install Scrapy(注意最好是国外的环境)

进度提示如下:

alicedeMacBook-Pro:~ alice$ pip install Scrapy
Collecting Scrapy
  Using cached https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from Scrapy)
  Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy)
 xxxxxxxxxxxxxxxxxxxxx
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 380, in fetch_build_egg
        return cmd.easy_install(req)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('incremental>=16.10.1')

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

出现缺少Twisted的错误提示:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

2、安装Twiseted,终端里输入:sudo pip install twisted==13.1.0

alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
Collecting twisted==13.1.0
  Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
    100% |████████████████████████████████| 2.7MB 398kB/s
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
Installing collected packages: twisted
  Running setup.py install for twisted ... error
    Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.13-intel-2.7
    creating build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/_version.py -> build/li

3、再次使用sudo pip install scrapy安装,发现仍然出现错误提示,这次是没有安装lxml的错误提示:

Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )

No matching distribution found for lxml (from Scrapy)

alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
  Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
    100% |████████████████████████████████| 256kB 210kB/s
Collecting w3lib>=1.17.0 (from Scrapy)
  xxxxxxxxxxxx
  Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB)
    100% |████████████████████████████████| 3.1MB 59kB/s
Collecting lxml (from Scrapy)
  Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)

4、安装lxml,使用:sudo pip install lxml

alicedeMacBook-Pro:~ alice$ sudo pip install lxml
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting lxml
  Downloading https://files.pythonhosted.org/packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
    100% |████████████████████████████████| 8.7MB 187kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.4

5、再次安装scrapy,使用sudo pip install scrapy,安装成功

alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
  Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
    100% |████████████████████████████████| 256kB 11.5MB/s
Collecting w3lib>=1.17.0 (from Scrapy)
  xxxxxxxxx
Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4)
Collecting functools32; python_version < "3.0" (from parsel>=1.1->Scrapy)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/
  Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
    100% |████████████████████████████████| 61kB 66kB/s
Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, zope.interface, constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy
  Running setup.py install for functools32 ... done
  Running setup.py install for PyDispatcher ... done
  Found existing installation: zope.interface 4.1.1
    Uninstalling zope.interface-4.1.1:
      Successfully uninstalled zope.interface-4.1.1
  Running setup.py install for zope.interface ... done
  Running setup.py install for Twisted ... done
Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0

6、检查scrapy是否安装成功,输入scrapy --version

出现scrapy的版本信息,比如:Scrapy 1.5.1 - no active project即可。

alicedeMacBook-Pro:~ alice$ scrapy --version
Scrapy 1.5.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

PS:如果中途没有能够正常访问org网和使用sudo管理员权限安装,则会出现类似的错误提示

Exception:

Traceback (most recent call last):

  File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main

    status = self.run(options, args)

  File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run

    resolver.resolve(requirement_set)

Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
    resolver.resolve(requirement_set)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 102, in resolve
    self._resolve_one(requirement_set, req)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 256, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 209, in _get_abstract_dist_for
    self.require_hashes
  File "/Library/Python/2.7/site-packages/pip/_internal/operations/prepare.py", line 283, in prepare_linked_requirement
    progress_bar=self.progress_bar
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 836, in unpack_url
    progress_bar=progress_bar
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 673, in unpack_http_url
    progress_bar)
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 897, in _download_http_url
    _download_url(resp, link, content_file, hashes, progress_bar)
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 617, in _download_url
    hashes.check_against_chunks(downloaded_chunks)
  File "/Library/Python/2.7/site-packages/pip/_internal/utils/hashes.py", line 48, in check_against_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 585, in written_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 574, in resp_read
    decode_content=False):
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 465, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 430, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 345, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

按照指南上搭建好了Scrapy的环境。

Scrapy爬虫运行常见报错及解决

按照第一个Spider代码练习,保存在 tutorial/spiders 目录下的 dmoz_spider.py 文件中:

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body) 

terminal中运行:scrapy crawl dmoz,试图启动爬虫

报错提示一:

Scrapy 1.6.0 - no active project

Unknown command: crawl

alicedeMacBook-Pro:~ alice$ scrapy crawl dmoz
Scrapy 1.6.0 - no active project

Unknown command: crawl

Use "scrapy" to see available commands

原因是:在使用命令行startproject的时候,会自动生成scrapy.cfg。而使用命令行cmd启动爬虫时,crawl会去搜索cmd当前目录下的scrapy.cfg文件,官方文档中也进行了说明。找不到scrapy.cfg文件则认为没有该project。

解决方案:因此cd进入该dmoz项目的根目录,即scrapy.cfg文件在的目录,执行命令scrapy crawl dmoz

正常情况下得到的输出应该是:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened

2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

然而实际不是

报错提示二:

  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load

    raise KeyError("Spider not found: {}".format(spider_name))

KeyError: 'Spider not found: dmoz'

alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:39:00) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.3.0-x86_64-i386-64bit
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 69, in load
    return self._spiders[spider_name]
KeyError: 'dmoz'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load
    raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: dmoz'

原因:定位的目录不正确,要进入到dmoz在的目录

解决方案:也比较简单,重新check目录进去即可

报错提示三:

 File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in <module>
from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util

alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
2018-08-06 22:25:23 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2018-08-06 22:25:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Jul 15 2017, 17:16:57) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 0.13.1 (LibreSSL 2.2.7), cryptography unknown, Platform Darwin-17.3.0-x86_64-i386-64bit
2018-08-06 22:25:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
  t/ssl.py", line 230, in <module>
    from twisted.internet._sslverify import (
  File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in <module>
    from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util

网上查了很久的资料,仍然无解。部分博主说是pyOpenSSL或Scrapy的安装有问题,于是重新装了pyOpenSSL和Scrapy,但还是报同样错误,实在不知道怎么解决了。

后面重装了pyOpenSSL和Scrapy,貌似是解决了~

2019-04-19 09:46:37 [scrapy.core.engine] INFO: Spider opened
2019-04-19 09:46:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/robots.txt> (referer: None)
2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>: HTTP status code is not handled or not allowed
2019-04-19 09:46:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>: HTTP status code is not handled or not allowed
2019-04-19 09:46:40 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-19 09:46:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 737,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 2103,
 'downloader/response_count': 3,
 'downloader/response_status_count/403': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 19, 1, 46, 40, 570939),
 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/403': 2,
 'log_count/DEBUG': 3,
 'log_count/INFO': 9,
 'log_count/WARNING': 1,
 'memusage/max': 65601536,
 'memusage/startup': 65597440,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 4, 19, 1, 46, 37, 468659)}
2019-04-19 09:46:40 [scrapy.core.engine] INFO: Spider closed (finished)
alicedeMacBook-Pro:tutorial alice$ 

到此这篇关于Python爬虫之Scrapy环境搭建案例教程的文章就介绍到这了,更多相关Python爬虫之Scrapy环境搭建内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!

时间: 2021-07-19

python爬虫scrapy框架之增量式爬虫的示例代码

scrapy框架之增量式爬虫 一 .增量式爬虫 什么时候使用增量式爬虫: 增量式爬虫:需求 当我们浏览一些网站会发现,某些网站定时的会在原有的基础上更新一些新的数据.如一些电影网站会实时更新最近热门的电影.那么,当我们在爬虫的过程中遇到这些情况时,我们是不是应该定期的更新程序以爬取到更新的新数据?那么,增量式爬虫就可以帮助我们来实现 二 .增量式爬虫 概念: 通过爬虫程序检测某网站数据更新的情况,这样就能爬取到该网站更新出来的数据 如何进行增量式爬取工作: 在发送请求之前判断这个URL之前是不是

一文读懂python Scrapy爬虫框架

Scrapy是什么? 先看官网上的说明,http://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架.可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫. S

python爬虫scrapy基本使用超详细教程

一.介绍 官方文档:中文2.3版本 下面这张图大家应该很熟悉,很多有关scrapy框架的介绍中都会出现这张图,感兴趣的再去查询相关资料,当然学会使用scrapy才是最主要的. 二.基本使用 2.1 环境安装 1.linux和mac操作系统: pip install scrapy 2.windows系统: 先安装wheel:pip install wheel 下载twisted:下载地址 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd

python scrapy项目下spiders内多个爬虫同时运行的实现

一般创建了scrapy文件夹后,可能需要写多个爬虫,如果想让它们同时运行而不是顺次运行的话,得怎么做? a.在spiders目录的同级目录下创建一个commands目录,并在该目录中创建一个crawlall.py,将scrapy源代码里的commands文件夹里的crawl.py源码复制过来,只修改run()方法即可! import os from scrapy.commands import ScrapyCommand from scrapy.utils.conf import arglist

Python爬虫基础讲解之scrapy框架

网络爬虫 网络爬虫是指在互联网上自动爬取网站内容信息的程序,也被称作网络蜘蛛或网络机器人.大型的爬虫程序被广泛应用于搜索引擎.数据挖掘等领域,个人用户或企业也可以利用爬虫收集对自身有价值的数据. 一个网络爬虫程序的基本执行流程可以总结三个过程:请求数据,解析数据,保存数据 数据请求 请求的数据除了普通的HTML之外,还有json数据.字符串数据.图片.视频.音频等. 解析数据 当一个数据下载完成后,对数据中的内容进行分析,并提取出需要的数据,提取到的数据可以以多种形式保存起来,数据的格式有非常多

Python爬虫框架-scrapy的使用

Scrapy Scrapy是纯python实现的一个为了爬取网站数据.提取结构性数据而编写的应用框架. Scrapy使用了Twisted异步网络框架来处理网络通讯,可以加快我们的下载速度,并且包含了各种中间件接口,可以灵活的完成各种需求 1.安装 sudo pip3 install scrapy 2.认识scrapy框架 2.1 scrapy架构图 Scrapy Engine(引擎): 负责Spider.ItemPipeline.Downloader.Scheduler中间的通讯,信号.数据传递

Python爬虫实战之使用Scrapy爬取豆瓣图片

使用Scrapy爬取豆瓣某影星的所有个人图片 以莫妮卡·贝鲁奇为例 1.首先我们在命令行进入到我们要创建的目录,输入 scrapy startproject banciyuan 创建scrapy项目 创建的项目结构如下 2.为了方便使用pycharm执行scrapy项目,新建main.py from scrapy import cmdline cmdline.execute("scrapy crawl banciyuan".split()) 再edit configuration 然后

Python爬虫之教你利用Scrapy爬取图片

Scrapy下载图片项目介绍 Scrapy是一个适用爬取网站数据.提取结构性数据的应用程序框架,它可以通过定制化的修改来满足不同的爬虫需求. 使用Scrapy下载图片 项目创建 首先在终端创建项目 # win4000为项目名 $ scrapy startproject win4000 该命令将创建下述项目目录. 项目预览 查看项目目录 win4000 win4000 spiders __init__.py __init__.py items.py middlewares.py pipelines

python 爬虫 实现增量去重和定时爬取实例

前言: 在爬虫过程中,我们可能需要重复的爬取同一个网站,为了避免重复的数据存入我们的数据库中 通过实现增量去重 去解决这一问题 本文还针对了那些需要实时更新的网站 增加了一个定时爬取的功能: 本文作者同开源中国(殊途同归_): 解决思路: 1.获取目标url 2.解析网页 3.存入数据库(增量去重) 4.异常处理 5.实时更新(定时爬取) 下面为数据库的配置 mysql_congif.py: import pymysql def insert_db(db_table, issue, time_s

python爬虫框架scrapy实战之爬取京东商城进阶篇

前言 之前的一篇文章已经讲过怎样获取链接,怎样获得参数了,详情请看python爬取京东商城普通篇,本文将详细介绍利用python爬虫框架scrapy如何爬取京东商城,下面话不多说了,来看看详细的介绍吧. 代码详解 1.首先应该构造请求,这里使用scrapy.Request,这个方法默认调用的是start_urls构造请求,如果要改变默认的请求,那么必须重载该方法,这个方法的返回值必须是一个可迭代的对象,一般是用yield返回. 代码如下: def start_requests(self): fo

Python 利用scrapy爬虫通过短短50行代码下载整站短视频

近日,有朋友向我求助一件小事儿,他在一个短视频app上看到一个好玩儿的段子,想下载下来,可死活找不到下载的方法.这忙我得帮,少不得就抓包分析了一下这个app,找到了视频的下载链接,帮他解决了这个小问题. 因为这个事儿,勾起了我另一个念头,这不最近一直想把python爬虫方面的知识梳理梳理吗,干脆借机行事,正凑着短视频火热的势头,做一个短视频的爬虫好了,中间用到什么知识就理一理. 我喜欢把事情说得很直白,如果恰好有初入门的朋友想了解爬虫的技术,可以将就看看,或许对你的认识会有提升.如果有高手路过,

Python利用Scrapy框架爬取豆瓣电影示例

本文实例讲述了Python利用Scrapy框架爬取豆瓣电影.分享给大家供大家参考,具体如下: 1.概念 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 通过Python包管理工具可以很便捷地对scrapy进行安装,如果在安装中报错提示缺少依赖的包,那就通过pip安装所缺的包 pip install scrapy scrapy的组成结构如下图所示 引擎Scrapy Engine,用于中转调度其他部分的信号和数据

Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码

大家可以在Github上clone全部源码. Github:https://github.com/williamzxl/Scrapy_CrawlMeiziTu Scrapy官方文档:http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html 基本上按照文档的流程走一遍就基本会用了. Step1: 在开始爬取之前,必须创建一个新的Scrapy项目. 进入打算存储代码的目录中,运行下列命令: scrapy startproject CrawlMe

Python爬虫实例——scrapy框架爬取拉勾网招聘信息

本文实例为爬取拉勾网上的python相关的职位信息, 这些信息在职位详情页上, 如职位名, 薪资, 公司名等等. 分析思路 分析查询结果页 在拉勾网搜索框中搜索'python'关键字, 在浏览器地址栏可以看到搜索结果页的url为: 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=', 尝试将?后的参数删除, 发现访问结果相同. 打开Chrome网页调试工具(F12), 分析每条搜索结果

python爬虫 正则表达式使用技巧及爬取个人博客的实例讲解

这篇博客是自己<数据挖掘与分析>课程讲到正则表达式爬虫的相关内容,主要简单介绍Python正则表达式爬虫,同时讲述常见的正则表达式分析方法,最后通过实例爬取作者的个人博客网站.希望这篇基础文章对您有所帮助,如果文章中存在错误或不足之处,还请海涵.真的太忙了,太长时间没有写博客了,抱歉~ 一.正则表达式 正则表达式(Regular Expression,简称Regex或RE)又称为正规表示法或常规表示法,常常用来检索.替换那些符合某个模式的文本,它首先设定好了一些特殊的字及字符组合,通过组合的&

零基础写python爬虫之使用Scrapy框架编写爬虫

网络爬虫,是在网上进行数据抓取的程序,使用它能够抓取特定网页的HTML数据.虽然我们利用一些库开发一个爬虫程序,但是使用框架可以大大提高效率,缩短开发时间.Scrapy是一个使用Python编写的,轻量级的,简单轻巧,并且使用起来非常的方便.使用Scrapy可以很方便的完成网上数据的采集工作,它为我们完成了大量的工作,而不需要自己费大力气去开发. 首先先要回答一个问题. 问:把网站装进爬虫里,总共分几步? 答案很简单,四步: 新建项目 (Project):新建一个新的爬虫项目 明确目标(Item

一个月入门Python爬虫学习,轻松爬取大规模数据

Python爬虫为什么受欢迎 如果你仔细观察,就不难发现,懂爬虫.学习爬虫的人越来越多,一方面,互联网可以获取的数据越来越多,另一方面,像 Python这样的编程语言提供越来越多的优秀工具,让爬虫变得简单.容易上手. 利用爬虫我们可以获取大量的价值数据,从而获得感性认识中不能得到的信息,比如: 知乎:爬取优质答案,为你筛选出各话题下最优质的内容. 淘宝.京东:抓取商品.评论及销量数据,对各种商品及用户的消费场景进行分析. 安居客.链家:抓取房产买卖及租售信息,分析房价变化趋势.做不同区域的房价分

Python爬虫爬取一个网页上的图片地址实例代码

本文实例主要是实现爬取一个网页上的图片地址,具体如下. 读取一个网页的源代码: import urllib.request def getHtml(url): html=urllib.request.urlopen(url).read() return html print(getHtml(http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E5%A3%81%E7%BA%B8&ct=201326592&am