Implementing Web Scraping with Scrapy
Mar 11, 2020 • 12 Minute Read
Introduction
Data is the new oil, many argue, as it becomes an increasingly valuable resource. With internet use growing, there is a massive amount of data on different websites. If you want to get data from web pages, one way is to use an API or implement web scraping techniques. Web scrapers and crawlers read a website’s pages and feed, and they analyze the site’s structure and markup language for clues to extract data. Sometimes the data collected from scraping is fed into other programs for validation, cleaning, and input into a datastore. It may also be fed into other processes, such as natural language processing (NLP) toolchains or machine learning (ML) models.
There are a few Python packages you can use for web scraping, including Beautiful Soup and Scrapy, and we’ll focus on Scrapy in this guide. Scrapy makes it easy for us to quickly prototype and develop web scrapers.
In this guide, you will see how to scrape the IMDB website and extract some of its data into a JSON file.
What is Scrapy?
Scrapy is a free and open-source web crawling framework written in Python. It is a fast, high-level framework used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Scrapy uses spiders to define how a site should be scraped for information. It lets us determine how we want a spider to crawl, what information we want to extract, and how we can extract it.
Setup and Installation
Let’s talk about installation, creating a spider, and then testing it.
Step 1: Creating a Virtual Environment
It's best to create a different virtual environment for Scrapy because that isolates the program and doesn’t affect any other programs present in the machine.
First, install the virtualenv using the below command.
$ pip install virtualenv
Now create a virtual environment with Python.
$ virtualenv scrapyvenv
For Linux/Mac, you can mention the Python version.
$ virtualenv -p python3 scrapyvenv
You can also mention which Python version you want to create the virtual environment.
After creating a virtual environment, activate it.
For Windows:
$ cd scrapyvenv
$ .\Scripts\activate
For Linux/Mac:
$ cd scrapyvenv
$ source bin/activate
Step 2: Installing Scrapy
Most of the dependencies will automatically get installed. They're available for Python 2.7+.
- pip install: To install using pip, open the terminal and run the following command:
$ pip install scrapy
- conda Install: To install using conda, open the terminal and run the following command:
$ conda install -c anaconda scrapy
If you have a problem installing the twisted library, you can download it here and then install it locally.
Step 3: Creating a Scrapy Project
Since Scrapy is a framework, we need to follow some standards of the framework. To create a new project in scrapy, use the command startproject. I have named my project webscrapy.
$ scrapy startproject webscrapy
Moreover, this will create a webscrapy directory with the following contents:
webscrapy
├── scrapy.cfg -- deploy configuration file of scrapy project
└── webscrapy -- your scrapy project module.
├── __init__.py -- module initializer(empty file)
├── items.py -- project item definition py file
├── middlewares.py -- project middleware py file
├── pipelines.py -- project pipeline py file
├── settings.py -- project settings py file
└── spiders -- directory where spiders are kept
├── __init__.py
Implementing Web Scraping with Scrapy
Create a Spider
Now, let's create our first spider. Use the command genspider, which takes the name of spider and the URL it will crawl :
$ cd webscrapy
$ scrapy genspider imdb www.imdb.com
After running this command, Scrapy will automatically create a Python file named imdb in the spider folder.
When you open that spider imdb.py Python file, you will see a class named imdbSpider that inherits scrapy.Spider class and contains a method named parse, which we will discuss later.
import scrapy
class ImdbSpider(scrapy.Spider):
name = 'imdb'
allowed_domains = ['www.imdb.com']
start_urls = ['http://www.imdb.com/']
def parse(self, response):
pass
A few things to note here:
-
name: The name of the spider. In this case, it is ImdbSpider. Naming spiders properly becomes a huge relief when you have to maintain hundreds of spiders.
-
allowed_domains: An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed.
-
parse(self, response): This function is called whenever the crawler successfully crawls a URL.
To run this spider, use the below command. Before running this command, make sure that you in the right directory.
$ scrapy crawl imdb
Note that the above command takes the spider's name as an argument.
Scrape on IMDB
Let's now get all the table entries, such as title, year, and rating, from the table of IMDB top 250 movies.
Create the spider imdb.py, which has been created earlier.
# importing the scrapy
import scrapy
class ImdbSpider(scrapy.Spider):
name = "imdb"
allowed_domains = ["imdb.com"]
start_urls = ['http://www.imdb.com/chart/top',]
def parse(self, response):
# table coloums of all the movies
columns = response.css('table[data-caller-name="chart-top250movie"] tbody[class="lister-list"] tr')
for col in columns:
# Get the required text from element.
yield {
"title": col.css("td[class='titleColumn'] a::text").extract_first(),
"year": col.css("td[class='titleColumn'] span::text").extract_first().strip("() "),
"rating": col.css("td[class='ratingColumn imdbRating'] strong::text").extract_first(),
}
Run the above imdb spider:
$ scrapy crawl imdb
You will get the following output:
{'title': 'The Shawshank Redemption', 'year': '1994', 'rating': '9.2'}
{'title': 'The Godfather', 'year': '1972', 'rating': '9.1'}
...
{'title': 'Swades', 'year': '2004', 'rating': '8.0'}
{'title': 'Song of the Sea', 'year': '2014', 'rating': '8.0'}
Create a More Advanced Scraper
Let's get more advanced in scraping IMDB. Let's open the detailed page for each movie in the list of all 250 movies and then fetch all the important features, such as director name, genre, cast members, etc.
Before diving into creating the spider, we have to create Movie and Cast items in itmes.py.
For more details, read the Item documentation here.
import scrapy
class MovieItem(scrapy.Item):
title = scrapy.Field()
rating = scrapy.Field()
summary = scrapy.Field()
genre = scrapy.Field()
runtime = scrapy.Field()
directors = scrapy.Field()
writers = scrapy.Field()
cast = scrapy.Field()
class CastItem(scrapy.Item):
name = scrapy.Field()
character = scrapy.Field()
Now that the Items are created, let's extend the spider.
Withn this spider, we are fetching the URL of each movie item and requesting that URL by calling parseDetailItem, which collects all the movie data from the movie detail page.
# importing the scrapy
import scrapy
from webscrapy.items import MovieItem, CastItem
class ImdbSpider(scrapy.Spider):
name = "imdb"
allowed_domains = ["imdb.com"]
base_url = "https://imdb.com"
start_urls = ['https://www.imdb.com/chart/top',]
def parse(self, response):
# table coloums of all the movies
columns = response.css('table[data-caller-name="chart-top250movie"] tbody[class="lister-list"] tr')
for col in columns:
# rating of the movie i.e., position in the table
rating = col.css("td[class='titleColumn']::text").extract_first().strip()
# url of detail page of that movie.
rel_url = col.css("td[class='titleColumn'] a::attr('href')").extract_first().strip()
# add the domain to rel. url
col_url = self.base_url + rel_url
# Make a request to above url, and call the parseDetailItem
yield scrapy.Request(col_url, callback=self.parseDetailItem, meta={'rating' : rating})
# calls every time, when the movie is fetched from table.
def parseDetailItem(self, response):
# create a object of movie.
item = MovieItem()
# fetch the rating meta.
item["rating"] = response.meta["rating"]
# Get the required text from element.
item['title'] = response.css('div[class="title_wrapper"] h1::text').extract_first().strip()
item["summary"] = response.css("div[class='summary_text']::text").extract_first().strip()
item['directors'] = response.css('div[class="credit_summary_item"] a::text').extract()[0].strip()
item['writers'] = response.css('div[class="credit_summary_item"] a::text').extract()[1].strip()
item["genre"] = response.xpath("//*[@id='title-overview-widget']/div[1]/div[2]/div/div[2]/div[2]/div/a[1]/text()").extract_first().strip()
item["runtime"] = response.xpath("//*[@id='title-overview-widget']/div[1]/div[2]/div/div[2]/div[2]/div/time/text()").extract_first().strip()
# create a list of cast of movie.
item["cast"] = list()
# fetch all the cast of movie from table except first row.
for cast in response.css("table[class='cast_list'] tr")[1:]:
castItem = CastItem()
castItem["name"] = cast.xpath("td[2]/a/text()").extract_first().strip()
castItem["character"] = cast.css("td[class='character'] a::text").extract_first()
item["cast"].append(castItem)
return item
Getting all the data on the command line is nice, but as a data scientist, it is preferable to have data in certain formats like CSV, Excel, JSON, etc. that can be imported into programs. Scrapy provides this nifty little functionality where you can export the downloaded content in various formats. Many of the popular formats are already supported.
Let's export our data in JSON format using the below command.
$ scrapy crawl imdb -o imdbdata.json -t json
You will get this type of output in the file imdbdata.json.:
...
{
"rating": "4"
Conclusion
In this guide, we learned the basics of Scrapy and how to extract data in JSON file format, though you can export to any file format. We have just scratched the surface of Scrapy’s potential as a web scraping tool.
I hope that from this guide, you understood the basic of Scrapy and are motivated to go deeper with this wonderful scraping tool. For deeper understanding, you can always follow the Scrapy Documentation or read my previous guide on Crawling the Web with Python and Scrapy.
If you have any questions related to this guide, feel free to ask me at Codealphabet.