Extracting Data from HTML with BeautifulSoup
To get the most out of BeautifulSoup, one needs only to have a basic knowledge of HTML, which is covered in this guide.
Dec 19, 2019 • 14 Minute Read
Introduction
Nowadays everyone is talking about data and how it is helping to learn hidden patterns and new insights. The right set of data can help a business to improve its marketing strategy and that can increase the overall sales. And let's not forget the popular example in which a politician can know the public's opinion before elections. Data is powerful, but it does not come for free. Gathering the right data is always expensive; think of surveys or marketing campaigns, etc.
The internet is a pool of data and, with the right set of skills, one can use this data in a way to gain a lot of new information. You can always copy paste the data to your excel or CSV file but that is also time-consuming and expensive. Why not hire a software developer who can get the data into a readable format by writing some jiber-jabber? Yes, it is possible to extract data from Web and this "jibber-jabber" is called Web Scraping.
According to Wikipedia, Web Scraping is:
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites
BeautifulSoup is one popular library provided by Python to scrape data from the web. To get the best out of it, one needs only to have a basic knowledge of HTML, which is covered in the guide.
Components of a Webpage
If you know the basic HTML, you can skip this part.
The basic syntax of any webpage is:
<!DOCTYPE html>
<html markdown="1">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
</head>
<body>
<h1 class = "heading"> My first Web Scraping with Beautiful soup </h1>
<p>Let's scrap the website using python. </p>
<body>
</html>
Every tag in HTML can have attribute information (i.e., class, id, href, and other useful information) that helps in identifying the element uniquely.
For more information about basic HTML tags, check out w3schools.
Steps for Scraping Any Website
To scrape a website using Python, you need to perform these four basic steps:
-
Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. We can do this by using the Request library of Python.
-
Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.
-
Analyzing the HTML tags and their attributes, such as class, id, and other HTML tag attributes. Also, identifying your HTML tags where your content lives.
-
Outputting the data in any file format such as CSV, XLSX, JSON, etc.
Understanding and Inspecting the Data
Now that you know about basic HTML and its tags, you need to first do the inspection of the page which you want to scrape. Inspection is the most important job in web scraping; without knowing the structure of the webpage, it is very hard to get the needed information. To help with inspection, every browser like Google Chrome or Mozilla Firefox comes with a handy tool called developer tools.
In this guide, we will be working with wikipedia to scrap some of its table data from the page List of countries by GDP (nominal). This page contains a Lists heading which contains three tables of countries sorted by their rank and its GDP value as per "International Monetary Fund", "World Bank", and "United Nations". Note, that these three tables are enclosed in an outer table.
To know about any element that you wish to scrape, just right-click on that text and examine the tags and attributes of the element.
Jump into the Code
In this guide, we will be learning how to do a simple web scraping using Python and BeautifulSoup.
Install the Essential Python Libraries
pip3 install requests beautifulsoup4
Note: If you are using Windows, use pip instead of pip3
Importing the Essential Libraries
Import the "requests" library to fetch the page content and bs4 (Beautiful Soup) for parsing the HTML page content.
from bs4 import BeautifulSoup
import requests
Collecting and Parsing a Webpage
In the next step, we will make a GET request to the url and will create a parse Tree object(soup) with the help of BeautifulSoup and Python built-in "lxml" parser.
# importing the libraries
from bs4 import BeautifulSoup
import requests
url="https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify()) # print the parsed data of html
With our BeautifulSoup object i.e., soup we can move on and collect the required table data.
Before going to the actual code, let's first play with the soup object and print some basic information from it:
Example 1:
Let’s just first print the title of the webpage.
print(soup.title)
It will give an output as follows:
<title>List of countries by GDP (nominal) - Wikipedia</title>
To get the text without the HTML tags, we just use .text:
print(soup.title.text)
Which will result into:
List of countries by GDP (nominal) - Wikipedia
Example 2:
Now, let's get all the links in the page along with its attributes, such as href, title, and its inner Text.
for link in soup.find_all("a"):
print("Inner Text: {}".format(link.text))
print("Title: {}".format(link.get("title")))
print("href: {}".format(link.get("href")))
This will output all the available links along with its mentioned attributes from the page.
Now, let's get back to the track and find our goal table.
Analyzing the outer table, we can see that it has special attributes which include class as wikitable and has two tr tags inside tbody.
If you uncollapse the tr tag, you will find that the first tr tag is for the headings of all three tables and the next tr tag is for the table data for all three inner tables.
Let's first get all three table headings:
Note that we are removing the newlines and spaces from left and right of the text by using simple strings methods available in Python.
gdp_table = soup.find("table", attrs={"class": "wikitable"})
gdp_table_data = gdp_table.tbody.find_all("tr") # contains 2 rows
# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
# remove any newlines and extra spaces from left and right
headings.append(td.b.text.replace('\n', ' ').strip())
print(headings)
This will give an output as:
'Per the International Monetary Fund (2018)'
data = {}
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
# Get headers of table i.e., Rank, Country, GDP.
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
# Each table row is stored in the form of
# t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
# Put the data for the table with his heading.
data[heading] = table_data
print(data)
Writing Data to CSV
Now that we have created our data structure, we can export it to a CSV file by just iterating over it.
import csv
for topic, table in data.items():
# Create csv file for each table
with open(f"{topic}.csv", 'w') as out_file:
# Each 3 table has headers as following
headers = [
"Country/Territory",
"GDP(US$million)",
"Rank"
] # == t_headers
writer = csv.DictWriter(out_file, headers)
# write the header
writer.writeheader()
for row in table:
if row:
writer.writerow(row)
Putting It Together
Let's join all the above code snippets.
Our complete code looks like this:
# importing the libraries
from bs4 import BeautifulSoup
import requests
import csv
# Step 1: Sending a HTTP request to a URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Step 2: Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
#Get the table having the class wikitable
gdp_table = soup.find("table", attrs={"class": "wikitable"})
gdp_table_data = gdp_table.tbody.find_all("tr") # contains 2 rows
# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
# remove any newlines and extra spaces from left and right
headings.append(td.b.text.replace('\n', ' ').strip())
# Get all the 3 tables contained in "gdp_table"
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
# Get headers of table i.e., Rank, Country, GDP.
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
# Each table row is stored in the form of
# t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
# Put the data for the table with his heading.
data[heading] = table_data
# Step 4: Export the data to csv
"""
For this example let's create 3 seperate csv for
3 tables respectively
"""
for topic, table in data.items():
# Create csv file for each table
with open(f"{topic}.csv", 'w') as out_file:
# Each 3 table has headers as following
headers = [
"Country/Territory",
"GDP(US$million)",
"Rank"
] # == t_headers
writer = csv.DictWriter(out_file, headers)
# write the header
writer.writeheader()
for row in table:
if row:
writer.writerow(row)
BEWARE -> Scraping rules
Now that you have a basic idea about scraping with Python, it is important to know the Legality of web scraping before starting scraping a website. Generally, if you are using scraped data for personal use and do not plan to republish that data, it may not cause any problems. Read the Terms of Use, Conditions of Use, and also the robots.txt before scraping the website. You must follow the robots.txt rules before scraping, otherwise, the website owner has every right to take legal action against you.
Conclusion
The above guide went through the process of how to scrape a Wikipedia page using Python3 and Beautiful Soup and finally exporting it to a CSV file. We have learned how to scrape a basic website and fetch all the useful data in just a couple of minutes.
You can further continue to expand the awesomeness of the art of scraping by jumping for new websites. Some good examples of data to scrape are:
- Weather forecasts
- Customer reviews and product pages
- Stock Prices
- Articles
Beautiful Soup is simple for small-scale web scraping. If you want to scrape webpages on a large scale, you can consider more advanced techniques like Scrapy and Selenium.
Here are the some of my scraping guides:
- Crawling the Web with Python and Scrapy
- Advanced Web Scraping Tactics
- Best Practices and Guidelines for Scraping
Hope you like this guide. If you have any queries regarding this topic, feel free to contact me at CodeAlphabet.
Useful Links
Learn More
Explore these Beautifulsoap and HTML courses from Pluralsight to continue learning: