Implementing Web Scraping with Selenium
Feb 15, 2020 • 13 Minute Read
Introduction
In recent years, there has been an explosion of front-end frameworks like Angular, React, and Vue, which are becoming more and more popular. Webpages that are generated dynamically can offer a faster user experience; the elements on the webpage itself are created and modified dynamically. These websites are of great benefit, but can be problematic when we want to scrape data from them. The simplest way to scrape these kinds of websites is by using an automated web browser, such as a selenium webdriver, which can be controlled by several languages, including Python.
Selenium is a framework designed to automate tests for your web application. Through Selenium Python API, you can access all functionalities of Selenium WebDriver intuitively. It provides a convenient way to access Selenium webdrivers such as ChromeDriver, Firefox geckodriver, etc.
In this guide, we will explore how to scrape the webpage with the help of Selenium Webdriver and BeautifulSoup. This guide will demonstrate with an example script that will scrape authors and courses from pluralsight.com with a given keyword.
Installation
Download Driver
Selenium requires a driver to interface with the chosen browser. Here are the links to some of the most popular browser drivers:.
Make sure the driver is in PATH folder, i.e., for Linux, place it in /usr/bin or /usr/local/bin. Or you can place the driver in a known location and provide the executable_path afterward.
Browser | Download Link | |
---|---|---|
Edge | https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ | |
Firefox | https://github.com/mozilla/geckodriver/releases | |
Safari | https://webkit.org/blog/6900/webdriver-support-in-safari-10/ | |
Chrome | https://sites.google.com/a/chromium.org/chromedriver/downloads |
Install required packages
Install the Selenium Python package, if it is not already installed.
pip install selenium
pip install bs4
pip install lxml
Initialize the Webdriver
Let's create a function to initialize the webdriver by adding some options, such as headless. In the below code, I have created two different functions for Chrome and Firefox, respectively.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions
# configure Chrome Webdriver
def configure_chrome_driver():
# Add additional Options to the webdriver
chrome_options = ChromeOptions()
# add the argument and make the browser Headless.
chrome_options.add_argument("--headless")
# Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
# if driver is in PATH, no need to provide executable_path
driver = webdriver.Chrome(executable_path="./chromedriver.exe", options = chrome_options)
return driver
# configure Firefox Driver
def configure_firefox_driver():
# Add additional Options to the webdriver
firefox_options = FirefoxOptions()
# add the argument and make the browser Headless.
firefox_options.add_argument("--headless")
# Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
# if driver is in PATH, no need to provide executable_path
driver = webdriver.Firefox(executable_path = "./geckodriver.exe", options = firefox_options)
return driver
Making Browser Headless
Headless browsers can work without displaying any graphical UI, which allows applications to be a single source of interaction for users and provides a smooth user experience. Selenium helps you make any browser headless by adding an options argument as --headless. There are several option parameters you can set for your selenium webdriver. Check out some Chrome WebDriver Options here .
Locating the Elements on the Page
Selenium offers a wide variety of functions to locate an element on a web page:
<div id="search-field">
<input type="text" name = "search-container" id = "id_search_input" class = "search_input" autocomplete="off">
<input type="submit" class = "search_submit btn btn-default" >
</div>
element = driver.find_element_by_id("id_search_input") # by id
element = driver.find_element_by_class_name("search-container") # by class
element = driver.find_element_by_name("search-container") # by name
element = driver.find_element_by_xpath("//input[@type='text']") # by xpath
If the element is not be found, a NoSuchElementException is raised. You can read more strategies to locate the element here .
XPath is a powerful language often used in scraping the web. You can learn more about XPath here.
Not only can you locate the element on the page, you can also fill a form by sending the key input, add cookies, switch tabs, etc. You can read more about that here .
Data Extraction
Let's now see how to extract the required data from a web page. In the below code, we define two functions, getCourses and getAuthors, and print the courses and authors respectively for a given search keyword query.
Beautiful Soup remains the best way to traverse the DOM and scrape the data, so after making a GET request to the url, we will transform the page source to a BeautifulSoup object. Before doing that, we can wait for the element to get loaded, and also load all the paginated content by clicking Load More again and again (uncomment the loadAllContent(driver) to see this in action). After that, we can quickly get the required information from the page source using the select method.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
def getCourses(driver, search_keyword):
# Step 1: Go to pluralsight.com
driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=course")
WebDriverWait(driver, 5).until(
lambda s: s.find_element_by_id("search-results-category-target").is_displayed()
)
# Load all the page data, by clicking Load More button again and again
# loadAllContent(driver) # Uncomment me for loading all the content of the page
# Step 2: Create a parse tree of page sources after searching
soup = BeautifulSoup(driver.page_source, "lxml")
# Step 3: Iterate over the search result and fetch the course
for course_page in soup.select("div.search-results-page"):
for course in course_page.select("div.search-result"):
# selectors for the required information
title_selector = "div.search-result__info div.search-result__title a"
author_selector = "div.search-result__details div.search-result__author"
level_selector = "div.search-result__details div.search-result__level"
length_selector = "div.search-result__details div.search-result__length"
print({
"title": course.select_one(title_selector).text,
"author": course.select_one(author_selector).text,
"level": course.select_one(level_selector).text,
"length": course.select_one(length_selector).text,
})
# Driver code
# create the driver object.
driver = configure_chrome_driver()
search_keyword = "Machine Learning"
getCourses(driver, search_keyword)
# close the driver.
driver.close()
Similarly, you can do the same for the getAuthors function.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
def getAuthors(driver, search_keyword):
driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=aem-author")
WebDriverWait(driver, 5).until(
lambda s: s.find_element_by_id("author-list-target").is_displayed()
)
# Load all the page data, by clicking Load More button again and again
# loadAllContent(driver) ## Uncomment me for loading all the content of the page
# Step 1: Create a parse tree of page sources after searching
soup = BeautifulSoup(driver.page_source, "lxml")
# Step 2: Iterate over the search result and fetch the author
for author_page in soup.select("div.author-list-page"):
for author in author_page.select("div.columns"):
author_name = "div.author-name"
author_img = "div.author-list-thumbnail img"
author_profile = "a.cludo-result"
print({
"name": author.select_one(author_name).text,
"img": author.select_one(author_img)["src"],
"profile": author.select_one(author_profile)["href"]
})
# Driver code
# create the driver object.
driver = configure_chrome_driver()
search_keyword = "Machine Learning"
getAuthors(driver, search_keyword)
# close the driver.
driver.close()
Waits
Nowadays, most web pages are using dynamic loading techniques such as AJAX. When a page is loaded by the browser, the elements within that page may load at different time intervals, which makes locating an element difficult, and sometimes the script throws the exception ElementNotVisibleException.
Using waits, we can resolve this issue. There can be two different types of waits: implicit and explicit. An explicit waits for a specific condition to occur before proceeding further in execution, where implicit waits for a certain fixed amount of time. You can learn more here.
So, for our example, I have used the WebDriverWait explicit method to wait for an element to load.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
def loadAllContent(driver):
WebDriverWait(driver, 5).until(
lambda s: s.find_element_by_class_name("cookie_notification").is_displayed()
)
driver.find_element_by_class_name('cookie_notification--opt_in').click()
while True:
try:
WebDriverWait(driver, 3).until(
lambda s: s.find_element_by_id('search-results-section-load-more').is_displayed()
)
except TimeoutException:
break
driver.find_element_by_id('search-results-section-load-more').click()
Filling in Forms
Filling in a form on a web page generally involves setting values for text boxes, perhaps selecting options from a drop-box or radio control, and clicking on a submit button. We have already seen how to identify, and now there are many methods available to send the data to the input box, such as send_keys and click methods.
Check out more on this here.
def login(driver, credentials):
driver.get("https://app.pluralsight.com/")
uname_element = driver.find_element_by_name("Username")
uname_element.send_keys(credentials["username"])
pwd_element = driver.find_element_by_name("Password")
pwd_element.send_keys(credentials["password"])
login_btn = driver.find_element_by_id("login")
login_btn.click()
Conclusion
Web scraping using Selenium and BeautifulSoup can be a handy tool in your bag of Python and data knowledge tricks, especially when you face dynamic pages and heavy JavaScript-rendered websites. This guide has covered only some aspects of Selenium and web scraping. To learn more about scraping advanced sites, please visit the official docs of Python Selenium.
If you want to dive deeper into web scraping, check out some of my published guides on Web scraping.
- Extracting Data from HTML with BeautifulSoup
- Advanced Web Scraping Tactics
- Crawling the Web with Python and Scrapy
- Best Practices and Guidelines for Scraping
That's it from this guide. Keep scraping challenging sites. For more queries, feel free to ask me at Codealphabet.