A web crawler (also known as spider or spiderbot) is an internet bot that continually browses web pages, typically for web indexing purposes.
Typically Search Engines use web crawling ito scan the web and be aware of contents, links and websites relations. These data are processed to understand what results better fit users queries.

Crawlers consume resources on visited systems. For this reason, mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent via file named robot.txt under their root url.

Crawers are also used by some websites update their web content and stay aligned with target sources.

In this article I’ll show you how to create and configure a simple spiderbot (which crawls duykhang.com home page posts) with a tiny computer like Raspberry Pi.

 

What Is Scrapy

From Scrapy official website:

Scrapy logo

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Ref. https://docs.scrapy.org/en/latest/intro/overview.html

 

Step-By-Step Procedure

First, please be sure that your system is up to date:

sudo apt update
sudo apt upgrade

Install Scrapy

Raspberry PI OS Lite base OS already includes Python 3, so you don’t need any specific setup to have Python working.

Instead, we need to install some required packages. From terminal:

sudo apt install python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

As you can see, we have also python3-pip package in previous command, so we ca now install scrapy with pip. Please note that, differently from standard Python installation, you need to use “sudo” to make scrapy command available directly from terminal.

sudo pip3 install scrapy

During scrapy installation, some dependencies version errors may aoccurr. At the date of this article, an error comes with cryptograpy version:

pyopenssl 19.1.0 has requirement cryptography>=2.8, but you'll have cryptography 2.6.1 which is incompatible.

To solve this kind of errors, simply install required version of package notified. For example, I will fix my error with the following command:

sudo pip3 install cryptography==2.8

Finally check that your scrapy installation is ok:

pi@raspberrypi:~ $ scrapy version
Scrapy 2.1.0

Create Your First Spiderbot

Scrapy is not complicated, but requires a bit of study on scrapy tutorial pages. It can be run from and interactive shell (with the command “scrapy shell http://url…”). This method is usually the best way to identify tags inside pages you want to extract.

Another good practice, is visiting the url you want to crawl with dev tools (for example in Chrome -> Options -> Tools -> Developer Tools). Then, analyze web page elements and identify the ones you want to extract.

Once identified your targets, you can build a simple crawler to run manually or prepare a complete crawling project. Let’s try an example of a single standalone crawler, manually launched, which extracts posts titles, Summary and date from duykhang.com home page.

 

You can both download in your Raspberry Pi my script “myspider.py” from my download area:

wget https://duykhang.com/download/python/myspider.py

Or manually create the spider configuration file:

nano myspider.py

Insert following code:

import scrapy
class QuotesSpider(scrapy.Spider):
  name = "duykhang"
  start_urls = [
    'https://duykhang.com',
  ]

  def parse(self, response):
    for posts in response.css('div.post-wrapper article'):
      yield {
        'Title': posts.css('h2.entry-title a::text').get(),
        'Description': posts.css('div.post-content p::text').get(),
        'Date': posts.css('time::text').get(),
      }

 

Run this spider by simply typing from terminal:

 

scrapy runspider myspider.py -o duykhang.json

The final “-o duykhang.json” writes results to output file named duykhang.json,

When the crawler ends its tasks, you will find the new file (if not already existing) and you can see its content:

nano duykhang.json

 

Each run will append newly downloaded record to this json file.

 

Bình luận đi nào: