Web Scraping with Scrapy
Sep 6, 2019 • 17 Minute Read
Introduction
The extraction process of structured data from a website can be implemented using requests and beautifulsoup libraries or the scrapy framework. Both are sufficient to extract data from a static webpage though, in terms of features, scrapy is a compelling choice because it has inbuilt support to download and process content while applying restrictions whereas beautifulsoup is only capable of extracting data.
Scrapy is an open source python framework, specifically developed to:
- Automate the process of crawling through numerous websites while processing data. e.g. Search engine indexing.
- Extract data from web pages or APIs.
- Apply URL restrictions, data storage mechanism.
Scrapy offers a base structure to write your own spider or crawler. Spiders and crawlers both can be used for scraping, though a crawler provides inbuilt support for recursive web-scraping while going through extracted URLs. This guide will demonstrate the application and various features of scrapy to extract data from the Github Trending Page to collect the details of repositories.
Scraping Ethics
Every site provides a URL/robots.txt file which defines the access policies for a particular website or sub-directory, like:
User-Agent: *
Disallow: /posts/
User-agent: Googlebot
Allow: /*/*/main
Disallow: /raw/*
Here Googlebot is allowed to access the main page but not to the raw sub-directory. The posts section is prohibited for all bots. So, these rules must be followed to avoid getting blocked by the website. Alternatively, you could use a delay while making requests to avoid constant access to the website, otherwise, this could degrade the performance of the website.
Prerequisites
Before going further, it is advisable to have a basic understanding of:
- HTML and CSS: I explained this previously in web scraping with beautifulsoup guide.
- Scrapy: To create a scrapy project and write spiders. Use the below pip command to install the required packages:
pip install scrapy
Scrapy Project Setup
Execute the below command to create a Scrapy project:
scrapy startproject github_trending_bot
Startproject command will create a directory in the current directory. Use the cd command to change directory and pwd or cd(alone) to check the name of the current directory.
github_trending_bot directory with the following contents:
github_trending_bot/
scrapy.cfg # configuration parameters for deployment
github_trending_bot/ # python module, can be used for imports
__init__.py # required to import submodules in current directory
__pycache__ # contains bytecode of compiled code
items.py # custom classes(like directory) to store data
middlewares.py # middlewares are like interceptors to process request/response
pipelines.py # classes to perform tasks on scraped data, returned by spiders
settings.py # setting related to delay, cookies, used pipelines etc.
spiders/ # directory to store all spider files
__init__.py
Spiders
Spider is a class responsible for extracting the information from a website. It also contains additional information to apply or restrict the crawling process to specific domain names. To create a Spider, use the genspider command as:
cd github_trending_bot # move to project github_trending_bot directory
scrapy genspider GithubTrendingRepo github.com/trending/
The above command will create a GithubTrendingRepo.py file, shown below:
# -*- coding: utf-8 -*-
import scrapy
class GithubtrendingrepoSpider(scrapy.Spider):
name = 'GithubTrendingRepo'
allowed_domains = ['github.com/trending/']
start_urls = ['http://github.com/trending//']
def parse(self, response):
pass
As you may have already infered, the GithubtrendingrepoSpider class is a subclass of scrapy.Spider and each spider should have at least two properties:
- name: This should be a unique identifier across every spider because this will be used to run the spider with the crawl command.
- allowed_domains: This is a list of optional domains that can be crawled by this spider; other domain names will not be accessed during the crawling process.
- start_urls: This is a list of URLs used to begin the crawling.
- parse(self, response): This function will be called every time a response is acquired from a URL. The response object contains the HTML text response, HTTP status code, source url, etc. Currently, the parse() function is doing nothing and can be replaced with the below implementation to view the webpage data.
Execute Spider
- Replace the parse method with the below code to view the response content.
def parse(self, response):
print("%s : %s : %s" % (response.status, response.url, response.text))
-
Add ROBOTSTXT_OBEY=False in the settings.py file because by default the crawl command will verify against robots.txt and a True value will result in a forbidden access response.
-
Use the crawl command with the spider name to execute the project:
scrapy crawl GithubTrendingRepo
You can skip the startproject and crawl command. Write your spider python script for the spider class and then run the spidername.py file directly using runspider command:
scrapy runspider github_trending_bot/spiders/GithubTrendingRepo.py
Use the scrapy fetch URL command to view the HTML response from a URL for testing purposes.
CSS and Xpath
Extracting data is one of the crucial and common tasks that occur while scraping a website. Every HTML element can be found by either using unique CSS properties or an Xpath expression syntax, as shown below:
- CSS: CSS is a language to apply styles to HTML documents. The style can be applied by using particular tags, unique IDs, and style classes, as shown in the example below:
<!DOCTYPE html>
<html>
<head>
<title><b>CSS Demo</b></title>
<style type="text/css">
div{ /*style for tag*/
padding: 8px;
}
.container-alignment{ /*style using css classes*/
text-align: center; /* !!! */
}
.header{
color: darkblue;
}
#container{ /*style using ids*/
color: rgba(0, 136, 218, 1);
font-size: 20px;
}
</style>
</head>
<body>
<div id="container" class="container-alignment">
<p class="header"> CSS Demonstration</p>
<p>CSS Description paragraph</p>
</div>
</body>
</html>
In the above example, the header p tag can be accessed using the header class.
- :: : Double-colon is used to define tag along with attribute. a::attr(href) is an example of an anchor tag with the href attribute or p::text to get the text of paragraph tag.
- space : Space is used to define the inner tag as title b::text
- XPath: XPath is an expression path syntax to find an object in DOM. XPath has its own syntax to find the node from the root element, either via an absolute path or anywhere in the document using a relative path. Below is the explanation of XPath syntax with examples:
- /: Select node from the root. /html/body/div[1] will find the first div.
- //: Select node from the current node. //p[1] will find the first p element.
- [@attributename='value']: Use the syntax to find the node with the required value of an attribute. //div[@id='container'] will find the first input element with the name Email.
A simple way to get the XPath is via the inspect element option. Right click on the desired node and choose the copy xpath option:
Read more about XPaths to combine multiple attributes or use it as a supported function.
Data Extraction
Scrappy is equipped with CSS and XPath selectors to extract data from the URL response:
- Extract Text: Scrapy scrapy.http.TextResponse object has the css(query) function which can take the string input to find all the possible matches using the pass CSS query pattern. To extract the text with the CSS selector, simply pass tag_name::text query to the css(query) method which will create an object of the Selector and then use get() to fetch the text of the first matched tag. Use the below code to extract the text of title tag:
def parse(self, response):
title_text = response.css('title::text')
print(title_text.get())
This can also be done using xpath:
title_text = response.xpath('//title[1]/text()')
print(title_text.get())
//title[1]/text() is indicating a text node under the first title tag and can be explained as below: //: This will start the search from the current tag that is html. / : Indicates the child node in a hierarchical path. text(): Indicates a text node.
Both css and xpath methods return a SelectorList object which also supports css, xpath and re(regex) methods for data extractions.
- Extract All URLs and Corresponding Text: The list of all URLs can be extracted using
- css('a::attr(href)').getall(): Finds the a(anchor) tag with the href attribute.
- response.xpath('//a/@href').getall(): Find the a(anchor) tag from the current node with href attribute.
css_links = response.css('a::attr(href)').getall()
xpath_links = response.xpath('//a/@href').getall()
# print length of lists
print(len(css_links))
print(len(xpath_links))
- LxmlLinkExtractor as Filter: The drawback of previous search is that it will fetch all the URLs including social media and internal links to resources as well. So,LxmlLinkExtractor can be used as a filter to extract a list of specific URL(s) by:
# from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor # import required
trending_links = LxmlLinkExtractor(allow= r'^https://[a-z.]+/[a-z.]+$', deny_domains=['shop.github.com','youtube.com','twitter.com'], unique = True).extract_links(response)
for link in trending_links:
print("%s : %s " % (link.url, link.text))
The type of list element is scrapy.link.Link.
Recursive Crawler
Often it is required to extract links from a webpage and further extract data from those extracted links. This process can be implemented using the CrawlSpider which provides inbuilt implementation to generate requests from extracted links. The CrawlSpider also supports crawling Rule which defines:
- How links should be extracted from each web-page.
- How the result should be processed (using a callback method name).
Rules
Every Rule object takes the LxmlLinkExtractor object as a parameter which will be used to filter links. LxmlLinkExtractor is subclass of FilteringLinkExtractor which does most of the filtration. A Rule object can be created by:
Rule(
LxmlLinkExtractor(restrict_xpaths=[
"//ol[@id=repo-list]//h3/a/@href"], allow_domains=['https://github.com/trending']),
callback='parse'
)
-
restrict_xpaths will only extract the links from this matched xpath HTML element.
-
LxmlLinkExtractor will extract the links from an ordered list which has repo-list as the ID.
-
//h3/a/@href: Here // will find an h3 heading element from current element i.e. ol.
-
/a/@href indicates that h3 elements should have a direct anchor tag with the href attribute.
LxmlLinkExtractor has various useful optional parameter like allow and deny to match link patterns, allow_domains, and deny_domains to define desired and undesired domain names. tags and attrs are used to match specific attributes and tag values. It has restrict_css as well.
Fetch Trending Repositories Name and Details
Follow the below steps to create a custom crawler to fetch trending repositories:
- Create a new python script and create a new Githubtrendingrepocrawler class by extending CrawlSpider.
- Define the values of name and start_urls properties.
- Create rules object of a tuple with Rules object.
- Optionally, use callback property to provide methods to handle the result from a URL, filtered by a specific Rule object.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class PageContentItem(Item): # A data storage class(like directory) to store the extracted data
url = Field()
content = Field()
class Githubtrendingrepocrawler(CrawlSpider): # 1
name = 'GithubTrendingRepoCrawler' # 2
start_urls = ['http://github.com/trending/'] # 2
# 3
rules = (
# Extract link from this path only
Rule(
LxmlLinkExtractor(restrict_xpaths=[
"//ol[@id=repo-list]//h3/a/@href"], allow_domains=['https://github.com/trending']),
callback='parse'
),
# link should match this pattern and create new requests
Rule(
LxmlLinkExtractor(allow='https://github.com/[\w-]+/[\w-]+$', allow_domains=['github.com']),
callback='parse_product_page'
),
)
# 4
def parse_product_page(self, response):
item = PageContentItem()
item['url'] = response.url
item['content'] = response.css('article').get()
yield item
yield is like a return statement to send a value back to the caller but it doesn't stop the execution of the method. yield creates a generator which can be used in the future. If the body of a function contains yield then the function automatically becomes a generator function.
Run this crawler using scrapy command by:
scrapy crawl GithubTrendingRepoCrawler
If a link matched multiple rules then the first matched Rule object will be applied.
Recursively Extract Links or Data from Extracted Links
To continue crawling through previously extracted links, just use follow=True in the second Rule by:
rules = (
Rule(
LxmlLinkExtractor(allow='https://github.com/[\w-]+/[\w-]+$', allow_domains=['github.com']),
callback='parse_product_page', follow=True # this will continue crawling through the previously extracted links
),
)
DEPTH_LIMIT=1 can be added in the setting.py file to process the level of extraction as per the value. 1 means do not extract links from newly extracted links.
Considering the number of repositories and links in their description, this process can continue for hours or days. Use Command + C or Control + C to stop the process on the terminal.
Storing Data
Scrapy provides a convenient way to store the yielded item into a separate file using -o option by:
scrapy crawl GithubTrendingRepoCrawler -o extracted_data_files/links_JSON.json
extracted_data_files is a folder and .json is the file format. Scrapy also supports .csv and .xml formats as well.
Tips
- Scrapy has an inbuilt shell which can be used as:
scrapy shell 'https://github.com
>>> response.css('title::text')
-
setting.py provides information about the bot and contains flags like DEPTH_LIMIT, USER_AGENT, ROBOTSTXT_OBEY, 'DOWNLOAD_DELAY', etc. to control the behavior of bots executed via the crawl command.
-
xpath pattern also supports placeholders; the values of these placeholders can be changed by:
response.xpath('//div[@id=$value]/a/text()', value='content').get()
- Regular expressions can also be used to extract lists of string values using re method aby:
response.xpath('//a/@href').re(r'https://[a-zA-Z]+\.com') # [a-zA-Z]+ mean match one or more alphabets
The code is available on github_trending_bot repo for demonstration and practice. Next, you can try to implement a news website crawler for fun.