Implementing Web Scraping with BeautifulSoup
Feb 5, 2020 • 13 Minute Read
Introduction
The internet revolution has resulted in an explosion of data, and many companies are trying to extract and analyze as much as they can from the web. The process of scraping data from websites and extracting information is called web scraping. In this guide, you will learn about web scraping using Python's powerful package, BeautifulSoup, which is used for parsing HTML and XML documents.
Let's start by loading the required libraries.
import pandas as pd
import numpy as np
import bs4
import requests
import urllib
import urllib.request
import re
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
from urllib.request import urlopen, Request
Getting Web Data
In this guide, we'll scrape data from a Wikipedia article on the movie Avengers: Endgame. We'll specify the URL address of the web page using the first line of code below. URL is an acronym for Universal Resource Locator, which focuses on web addresses and has two components:
-
Protocol identifier, denoted by http:
-
Resource name, denoted by en.wikipedia.org/wiki/Avengers:_Endgame in this case
These two components specify the web address completely. The first line of code below specifies the url of the Wikipedia link to the movie, while the second line extracts the response as an HTML object. HTML is an acronym for *Hyper-Text Markup Language and is the standard language for web pages. Once we have the HTML object, we'll use the BeautifulSoup method to parse the HTML document, as shown in the third line of code. The fourth line prints the type of the object.
url = "https://en.wikipedia.org/wiki/Avengers:_Endgame"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
Output:
bs4.BeautifulSoup
We can look at the structure of the object we created above using the code below.
print(soup.prettify())
Output:
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
Avengers: Endgame - Wikipedia
</title>
<script>
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjNILwpAAEIAAJRf3S8AAAAU","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",
"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclopædia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanics","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",
</script>
<script>
(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.tokens@tffin",function($,jQuery,require,module){/*@nomin*/mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});
});});
</script>
<link href="/w/load.php?lang=en&modules=ext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.toc.styles%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector" rel="stylesheet"/>
<script async="" src="/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector">
</script>
<meta content="" name="ResourceLoaderDynamicStyles"/>
<link href="/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector" rel="stylesheet"/>
<meta content="MediaWiki 1.35.0-wmf.15" name="generator"/>
<meta content="origin" name="referrer"/>
<meta content="origin-when-crossorigin" name="referrer"/>
<meta content="origin-when-cross-origin" name="referrer"/>
The command print(soup.prettify()) generates a long output, which has been truncated above for the sake of brevity.
The structure we created above can be navigated in several ways, some of which are highlighted below. The line of code below will extract the title of the web page.
print(soup.title)
Output:
<title>Avengers: Endgame - Wikipedia</title>
The soup.get_text() method will extract the text of the HTML object from the webpage, as shown below.
print(soup.text)
Output:
Avengers: EndgameTheatrical release posterDirected byAnthony RussoJoe RussoProduced byKevin FeigeScreenplay byChristopher MarkusStephen McFeelyBased onThe Avengersby Stan LeeJack KirbyStarring
Robert Downey Jr.
Chris Evans
Mark Ruffalo
Chris Hemsworth
Scarlett Johansson
Jeremy Renner
Don Cheadle
Paul Rudd
Brie Larson
Karen Gillan
Danai Gurira
Benedict Wong
Jon Favreau
Bradley Cooper
Gwyneth Paltrow
Josh Brolin
Music byAlan SilvestriCinematographyTrent OpalochEdited by
Jeffrey Ford
Matthew Schmidt
Productioncompany Marvel Studios Distributed byWalt Disney StudiosMotion PicturesRelease date
April 22, 2019 (2019-04-22) (Los Angeles Convention Center)
April 26, 2019 (2019-04-26) (United States)
Running time181 minutes[1]CountryUnited States[2]LanguageEnglishBudget$356 million[3]Box office$2.8 billion[3]
Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. It is the sequel to 2012's The Avengers, 2015's Avengers: Age of Ultron, and 2018's Avengers: Infinity War, and the twenty-second film in the Marvel Cinematic Universe (MCU). It was directed by Anthony and Joe Russo and written by Christopher Markus and Stephen McFeely, and features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, Scarlett Johansson, Jeremy Renner, Don Cheadle, Paul Rudd, Brie Larson, Karen Gillan, Danai Gurira, Benedict Wong, Jon Favreau, Bradley Cooper, Gwyneth Paltrow, and Josh Brolin. In the film, the surviving members of the Avengers and their allies attempt to reverse the damage caused by Thanos in Infinity War.
The film was announced in October 2014 as Avengers: Infinity War – Part 2, but Marvel later removed this title. The Russo brothers joined as directors in April 2015, with Markus and McFeely signing on to write the script a month later. The film serves as a conclusion to the story of the MCU up to that point, ending the story arcs for several main characters. Filming began in August 2017 at Pinewood Atlanta Studios in Fayette County, Georgia, shooting back-to-back with Infinity War, and ended in January 2018. Additional filming took place in the Metro and Downtown Atlanta areas, New York, Scotland, and England. The story revisits several moments from earlier films, bringing back actors and settings from throughout the franchise as well as music from previous films. The official title was revealed in December 2018. With an estimated budget of $356 million, it is one of the most expensive films ever made.
Avengers: Endgame was widely anticipated, and Disney backed the film with Marvel's largest marketing campaign. It premiered in Los Angeles on April 22, 2019, and was theatrically released in the United States on April 26. The film received praise for its direction, acting, musical score, action sequences, visual effects, and emotional weight, with critics lauding its culmination of the 22-film story. It grossed nearly $2.8 billion worldwide, surpassing Infinity War's entire theatrical run in just eleven days and breaking numerous box office records, including becoming the highest-grossing film of all time. The film received numerous awards and nominations, including a nomination for Best Visual Effects at the 92nd Academy Awards, three nominations at the 25th Critics' Choice Awards, and a nomination for Special Visual Effects at the 73rd British Academy Film Awards.
There are many more powerful functions in BeautifulSoup that makes it easy to scrape information from the website. One such method is the find_all() method, which allows us to extract useful HTML tags from within a web page. One such tag is the tag <a> for extracting the hyperlinks in the web page. We create a for loop, as shown below, that searches for all the hyperlinks and prints them, as shown below.
web_links = soup.find_all("a")
for link in web_links:
print(link.get("href"))
Output:
None
/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/File:Avengers_Endgame_poster.jpg
/wiki/Russo_brothers
/wiki/Kevin_Feige
/wiki/Christopher_Markus_and_Stephen_McFeely
/wiki/Avengers_(comics)
/wiki/Stan_Lee
/wiki/Jack_Kirby
/wiki/Robert_Downey_Jr.
/wiki/Chris_Evans_(actor)
/wiki/Mark_Ruffalo
/wiki/Chris_Hemsworth
/wiki/Scarlett_Johansson
#cite_ref-LebowsChris_27-1
https://variety.com/2019/film/news/chris-hemsworth-fat-thor-avengers-endgame-1203226429/
/wiki/Variety_(magazine)
https://web.archive.org/web/20190529212103/https://variety.com/2019/film/news/chris-hemsworth-fat-thor-avengers-endgame-1203226429/
#cite_ref-JohanssonAvengers4_28-0
http://screenrant.com/avengers-4-images-japan-black-widow/
https://web.archive.org/web/20170823002522/http://screenrant.com/avengers-4-images-japan-black-widow/
#cite_ref-ControWidow_29-0
#cite_ref-ControWidow_29-1
https://ew.com/movies/2019/05/02/avengers-endgame-directors-black-widow-scene/
/wiki/Entertainment_Weekly
https://web.archive.org/web/20190507013118/https://ew.com/movies/2019/05/02/avengers-endgame-directors-black-widow-scene/
#cite_ref-JohanssonPrepare_30-0
https://www.hollywoodreporter.com/news/scarlett-johanssons-avengers-workout-how-get-a-black-widow-body-1204043
/wiki/The_Hollywood_Reporter
https://web.archive.org/web/20190502183033/https://www.hollywoodreporter.com/news/scarlett-johanssons-avengers-workout-how-get-a-black-widow-body-1204043
Conclusion
In this guide, you learned about the basics of web scraping using the popular BeautifulSoup library in Python. You learned how to access web data and convert it into an HTML object, along with the basic methods of parsing it with the BeautifulSoup library.
To learn more about data science using Python, please refer to the following guides.