Scraping Media from the Web with Python
This guide will show you how to scrape ‘binary based’ types of files and understand how the remote webserver communicates.
May 9, 2019 • 10 Minute Read
Introduction
Most people start extracting data from websites in the form of text extracted from HTML. There is another rich vein of information available, however, in the form of multi-media. This collection of ‘binary based’ data includes images, videos, audio, and specially formatted documents like spreadsheets and pdf files, in addition to zipped or compressed data and more. This guide will show you how to scrape these types of files and understand how the remote webserver communicates the file type that it is sending to the scrape client.
Requirements
For this guide, we are going to use the Python ‘Requests’ library to get the data, and the ‘Lxml’ library to parse the HTML that we download. These are very straightforward to use and suitable for most web-scraping purposes. We will also use the ‘Pafy’ and ‘Youtube-dl’ libraries to assist in scraping streaming video files from Youtube.
Prerequisites:
In general, once you have Python 3 installed correctly, you can download Lxml and Requests using the ‘PIP’ utility:
pip install requests
pip install lxml
pip install pafy
pip install youtube-dl
If Pip is not installed, you can download and install it here
For simple web-scraping, an interactive editor like Microsoft Visual Code (free to use and download) is a great choice, and it works on Windows, Linux, and Mac.
Getting Started - Scraping Simple Media Files
The first media file most developers who begin web-scraping come across is an image file format. Images can be presented to us in a webpage in many ways, but in general, they are given as simple URL-based links that are either absolute or relative. An absolute link includes everything we need to download the file and appears in the HTML code as follows:
http://www.howtowebscrape.com/examples/images/test1.jpg
A relative link on the other hand normally has only the path to the image, relative to the webpage you have called, which sometimes leads to confusion:
examples/media/images/test1.jpg
This relative link is the same as the absolute link, once you add back in the main domain path you used to call the HTML page that contained the relative image located. For example, we call the webpage:
http://www.howtowebscrape.com/examples/media1.html
which contains the HTML tag:
<img src=”/media/images/test1.jpg”>
In this case, we’ll take the main path we received the HTML from and prepend this to make the full correct callable link.
The following diagram helps to explain the concept visually:
Regardless of how the image path is presented to us, we need to have a full valid link to be able to download the file.
Extracting Media Links from the Webpage
Unless the media we are seeking to download is known to us already, we generally have to download a webpage and parse it to find the link we require. To learn more about downloading and working with HTML and scraping and parsing your first webpage, please see my previous guide Scraping Your First Webpage with Python.
The basic code needed to download the webpage and get our media target link is listed below with inline commenting to explain each line of code.
To download the page, we simply need to ask the requests library to ‘get’ it, so we declare a variable for examples called ‘page’, and the result of the call to ‘get’ is loaded into this variable.
from lxml import html, etree
import requests
# Get the original webpage html content
webpageLink = 'http://www.howtowebscrape.com/examples/simplescrape1.html'
page = requests.get(webpageLink)
# convert the data received into searchable HTML
extractedHtml = html.fromstring(page.content)
# use an XPath query to find the image link (the 'src' attribute of the 'img' tag).
imageSrc = extractedHtml.xpath("//img/@src") # in our example, result = ‘/images/GrokkingAlgorithms.jpg’
# strip off the actual *page* being called as we only want to base url
imageDomain = webpageLink.rsplit('/', 1) # in our example, result = http://www.howtowebscrape.com/examples/
Now that we have the link URL of the image, we can test to see if it is a relative or absolute link. One simple way to check for this is to example the extracted link for the presence of ‘http’ which indicates the ‘hyper text transport protocol’ is being used.
# test if this is an absolute link or relative
if imageSrc[0].startswith("http"):
# start with http, therefore take this as the full link
imageLink = imageSrc[0]
else:
# does not start with http, therefore construct the full url from the base url plus the absolute image link
imageLink = str(imageDomain[0]) + str(imageSrc[0])
At this stage, we now have a fully qualified URL or web-link that we can use to download the media from the webserver itself. We will first extract the filename part of the link, then get the file from the webserver using ‘requests.get’, and finally we can save the data received to file.
# extract file name from link
filename = imageLink.split("/")[-1]
# download image using GET
rawImage = requests.get(imageLink, stream=True)
# save the image received into the file
with open(filename, 'wb') as fd:
for chunk in rawImage.iter_content(chunk_size=1024):
fd.write(chunk)
Scraping Multiple Types of Media
We can use exactly the same concept for downloading any kind of binary file, once we know the absolute web link of the file we are looking for. To make this more efficient, we can build the main download code into a function call.
def downloadFile(AFileName):
# extract file name from AFileName
filename = AFileName.split("/")[-1]
# download image using GET
rawImage = requests.get(AFileName, stream=True)
# save the image recieved into the file
with open(filename, 'wb') as fd:
for chunk in rawImage.iter_content(chunk_size=1024):
fd.write(chunk)
return
As you can see, the function does exactly the same as the code we wrote previously but is wrapped in a function so that we can reuse it easily. To demonstrate the code working on various media types, we can call the function for media types of document, pdf, audio, and video.
downloadFile("http://www.howtowebscrape.com/examples/media/images/BigRabbit.mp4")
downloadFile("http://www.howtowebscrape.com/examples/media/images/Clapping.mp3")
downloadFile("http://www.howtowebscrape.com/examples/media/images/SampleSlides.pptx")
downloadFile("http://www.howtowebscrape.com/examples/media/images/SampleZip.zip")
Working with Content Attributes
Sometimes you can download a file and the name or file extension of the file may not match the content of the file itself. In these cases, we can examine the meta-data sent by the webserver and use this to assist us in generating the correct file extension type. The way we get access to this information is to examine the headers property of the requests object after we make the connection:
rawImage = requests.get(imageLink, stream=True)
print(rawImage.headers)
This gives a JSON output as follows:
{'Content-Type': 'image/jpeg', 'Last-Modified': 'Tue, 07 May 2019 18:57:12 GMT', 'Accept-Ranges': 'bytes', 'ETag': '"1eae29ae65d51:0"', 'Server': 'Microsoft-IIS/8.5', 'X-Powered-By': 'ASP.NET', 'Date': 'Tue, 07 May 2019 22:29:10 GMT', 'Content-Length': '50424'}
From here, we can get the media type (image/jpeg) and the file size.
Scraping Streaming Video from Youtube
Media is not always presented to us as cleanly as the previous examples. To download a Youtube video we can use the ‘pafy’ library. Youtube video links can be difficult to identify – the pafy library helps us with this problem. In addition to identifying the correct download link, the library also gives us access to other information about the video such as author, size, formats available, etc.
Downloading a Youtube file only requires a few lines of code:
# Download the Pluralsight 'we are one' video
# url of video
url = "https://www.youtube.com/watch?v=TgRwoBgPM0o"
# create video object
video = pafy.new(url)
# extract information about best resolution video available
bestResolutionVideo = video.getbest()
# download the video
bestResolutionVideo.download()
Conclusion
Web-scraping is an important skill to have, especially for developers who work with data, business intelligence, and data science professionals. This guide has given a fast-track introduction to scraping different types of media from the web. If you wish to learn more about the subject please consider the following courses Pluralsight has to offer: