Web Scraping with Beautiful Soup
Jan 8, 2019 • 13 Minute Read
Introduction
Web scraping is a process of extracting specific information as structured data from HTML/XML content. Often data scientists and researchers need to fetch and extract data from numerous websites to create datasets, test or train algorithms, neural networks, and machine learning models. Usually, a website offers APIs which are the sublime way to fetch structured data. However, there are times when there is no API available or you want to bypass the registration process. Under these circumstances, the data can only be accessed via the web page. A manual process can be quite cumbersome and time-consuming when dealing with dynamic data related to a website like stocks, job listing, hotel bookings, real estate, etc. which needs to be accessed frequently. Python offers an automated way, through various modules, to fetch the HTML content from the web (URL/URI) and extract data. This guide will elaborate on the process of web scraping using the beautifulsoup module.
Process of Web Scraping
The process of scraping includes the following steps:
-
Make a request with requests module via a URL.
-
Retrieve the HTML content as text.
-
Examine the HTML structure closely to identify the particular HTML element from which to extract data. To do this, right click on the web page in the browser and select inspect options to view the structure.
In Safari, enable developer option via Safari -> Preferences -> Advanced -> show develop menu in bar
- Use BeautifulSoup to find the particular element from the response and extract the text.
Follow the Web Requests in Python guide to learn how to make web requests in Python.
Basics of HTML and CSS
HTTP requests via a URL are responded to with an HTML webpage. HTML and XML are markup languages and are used to define the way to format of the text using tags. HTML content can also contain CSS instructions within style tag to add various styles and decorations interpreted by the browser to apply formatting. Below is a common example of a typical HTML page:
<!DOCTYPE html> --> 1
<html> --> 2
<head> --> 3
<title>List of Game Engines</title>
<style type="text/css"> --> 4
table, th, td{
border: solid 2px black;
border-collapse: collapse;
border-color: lightgrey;
padding: 4px;
}
table th{
color: blue;
font-size: 20px;
}
table td{
color: green;
font-size: 25px;
}
</style>
</head>
<body> --> 5
<div> --> 6
<table align = "center"> --> 7
<tr>
<th>Name</th>
<th>Language</th>
<th>Platform</th>
</tr><br>
<tr>
<td>Unreal Engine</td>
<td>C++</td>
<td>Cross platform</td>
</tr>
</table>
</div>
</body>
</html>
HTML Description
Tags (<tags>) are written within diamond brackets <>, they can be paired(<title>) or unpaired(<br>)
-
<!DOCTYPE html> tags represent the HTML5 syntax which supports some new tags like nav, header etc.
-
The html tag contains HTML content also knows as root tag.
-
head includes the CSS style code, JavaScript code, and meta tags.
-
CSS is used to decorate content and can be added using style tag.
-
All the HTML rendered content should be placed inside body tag.
-
div is used as a container to represent an area on the screen.
-
The table tag is used to render data in the form of a table. th is for bold heading columns, td for columns, and tr is for rows. There are two rows, knowns as siblings.
CSS and JavaScript files can be created separately and linked to multiple HTML pages using link or script tags.
Prerequisites
Python has a rich collection of packages and a pip tool is used to manage those packages in the current development environment. This guide will use the below modules:
-
Requests: To make web requests
-
Beautiful Soup: To extract data from the HTML response
Use the below pip command to install the required packages:
pip install beautifulsoup4 requests
Use a space to mention multiple modules in a single install statement
Data Extraction and Cleaning
The first step of scraping is to get the data. So, for demonstration purposes, we will be using a List of game engines page. Let's open the page and view the structure using inspect option.
This will bring up the developer tool window which will display the HTML element structure. There is a div with id as bodyContent which contains all the visible HTML elements as:
-
This is the table tag which contains the details about game engines.
-
Every tr represents an entry in the list and contains columns entries.
-
This is the cursor which will highlight the corresponding element on the web page and here it's highlighting the first column of second row i.e. "4A Engine" heading.
-
Node provides the attributes of the selected node like id, style, etc. Styles provide the details about CSS code and layers provides the details about re-drawn content like images etc.
-
The Attribute section is displaying the name of the class i.e. wikitable sortable jquery-tablesorter which is a customizable name given to a group of CSS style properties applied to this table.
Scraping Script
Now we know about the specific HTML tags which contain the data, so let's jump straight into writing code.
The first step is to import modules. BeautifulSoup for scraping and Requests to make HTTP requests.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
Make an HTTP request to get HTML content via the specific URL.
URL = 'https://en.wikipedia.org/wiki/List_of_game_engines'
content = requests.get(URL)
Create a BeautifulSoup object and define the parser.
soup = BeautifulSoup(content.text, 'html.parser')
The default parser is lxml which is lenient and fast as compared to html.parser though lxml is platform dependent and html.parser is part of Beautiful Soup.
Parsers convert the input into single entities known as tokens and further convert the tokens into a graph or a tree structure for processing.
BeautifulSoup can extract single or multiple occurrences of a specific tag and can also accept search criteria based on attributes such as:
- Find: This function takes the name of the tag as string input and returns the first found match of the particular tag from the webpage response as:
row = soup.find('tr') # Extract and return first occurrence of tr
print(row) # Print row with HTML formatting
print("=========Text Result==========")
print(row.get_text()) # Print row as text
- Findall: Use find_all to extract all the occurrences of a particular tag from the page response as:
rows = soup.find_all('tr')
for row in rows: # Print all occurrences
print(row.get_text())
find_all returns an object of ResultSet which offers index based access to the result of found occurrences and can be printed using a for loop.
Pass List: find_all can accept a list of tags as soup.find_all(['th', 'td']) and parameters like id to find tags with unique id and href to process tags with href attribute as:
content = requests.get("URL")
soup = BeautifulSoup(content.text, 'html.parser')
tags = soup.find_all(id = True, href = True)
Pass Function: A function can contain your customized logic to validate the tag and can be used as:
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tags = soup.find_all(isAnchorTagWithLargeText, limit = 10)
for tag in tags:
print(tag.get_text())
def isAnchorTagWithLargeText(tag):
""" Validate the anchor tag and should have text length greater than 50 """
return True if tag.name == 'a' and len(tag.get_text()) > 50 else False
- Attribute Driven Search: The result of find_all function can also contain
-
Rows from other tables
-
Unwanted values These are not desired most of the time. So, attributes like id, class, or value are used to further refine the search.
Let's print the first found table (content table) to identify the attributes as:
table = soup.find_all('table')
print(table)
The content table has a unique CSS class attribute i.e. wikitable sortable which can be used to find the main content table as:
contentTable = soup.find('table', { "class" : "wikitable sortable"}) # Use dictionary to pass key : value pair
rows = contentTable.find_all('tr')
for row in rows:
print(row.get_text())
Here find is more suitable than find_all, since only one table has wikitable sortable class property.
Alternatively, the _class (not available in old versions) attribute can be used as soup.find_all('table', class_ ="wikitable sortable").
- Nested Tags: Nested tags can be found using the select method as:
print(soup.select("html head title")[0].get_text()) # List of game engines – Wikipedia
Use of Regular Expression
Regular expression allows you to find specific tags by matching a pattern instead of an entire value of an attribute. Beautiful Soup can take regular expression objects to refine the search. Below is the example to find all the anchor tags with title starting with Id Tech :
contentTable = soup.find('table', { "class" : "wikitable sortable"})
rows = contentTable.find_all('a', title = re.compile('^Id Tech .*'))
print(rows)
for row in rows:
print(row.get_text())
Explanations:
-
^: Start matching from the beginning (otherwise it can match from anywhere like the middle).
-
Id Tech: Match the exact characters.
-
.*: . means match any character and * mean keeps on matching till line break('\n' or enter).
Parameters and Properties
Beautiful Soup offers functionality like limit, string, and recursive which can be applied as:
-
Use limit = 2 to apply a limit on a result
-
Use contentTable.find_all('a', string = 'Alamo') to extract all anchor tags with text Alamo
-
By default, Beautiful Soup searches through all of the child elements. So, setting recursive = False will restrict the search to the first found element and its child only.
contentTable = soup.find('table', { "class" : "wikitable sortable"})
rows = contentTable.find_all('a', string = 'C', limit = 2
#, recursive = False
)
# Output: [<a href="/wiki/C_(programming_language)" title="C (programming language)">C</a>]
Beautiful Soup also allows you to mention tags as properties to find first occurrence of the tag as:
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
print(soup.head, soup.title)
print(soup.table.tr) # Print first row of the first table
Beautiful Soup also provides navigation properties like
-
next_sibling and previous_sibling: To traverse tags at same level, like tr or td within the same tag.
-
next_element and previous_element: To shift HTML elements.
Multiple elements can also be traversed with next_siblings, previous_siblings, and next_elements , previous_elements
Key Points
-
The logic to extract the data usually depends upon the HTML structure of the webpage, so some changes in structure can break the logic.
-
The content of a website can be subject to applied laws, so make sure to read the terms and conditions about content.
-
Use prettify() method to print the formatted HTML response.
The code for this script is available on Github for experimenting. It would be great to pick up any content-based website and write your own script to scrap it. Happy Scraping!