Web Scraping with Selenium
Jan 8, 2019 • 12 Minute Read
Introduction
The previous guide Web Scraping with BeautifulSoup explains the essential fundamentals of web scraping as:
- To understand the basics of HTML.
- To explore the web page structure and usage of developer tools.
- To make HTTP requests and get HTML responses.
- To get specific structured information using beautifulsoup.
This process is suitable for static content which is available by making an HTTP request to get the webpage content, but dynamic websites load the data from a data source (database, file etc) or require a few additional action events on the web page to load the data.
- Scroll down to load more content when reached at the end of the page.
- Click the next button to see the next page of available offers on an e-commerce website.
- Click the button to view complete details of a comment or user profile to apply scraping.
In order to automate this process, our scraping script needs to interact with the browser to perform repetitive tasks like click, scrolling, hover etc. and Selenium is the perfect tool to automate web browser interactions.
Selenium is an automation testing framework for web applications/websites which can also control the browser to navigate the website just like a human. Selenium uses a web-driver package that can take control of the browser and mimic user-oriented actions to trigger desired events. This guide will explain the process of building a web scraping program that will scrape data and download files from Google Shopping Insights.
Google Shopping Insights loads the data at runtime so any attempt to extract data using requests package will be responded to with an empty response.
Setup
- Selenium : To download selenium package, execute the below pip command in terminal:
pip install selenium
- Selenium Drivers: Web drivers enable python to control the browser via OS-level interactions. Web drivers use the browser's built-in support for the automation process so, in order to control the browser, the web-driver must be installed and should be accessible via the PATH variable of the operating system (only required in case of manual installation).
Download the drivers from official site for Chrome, Firefox, and Edge. Opera drivers can also be downloaded from the Opera Chromium project hosted on Github.
Safari 10 on OS X El Capitan and macOS Sierra have built-in support for the automation driver. This guide contains snippets to interact with popular web-drivers, though Safari is being used as a default browser throughout this guide.
Other browsers like UC, Netscape etc., cannot be used for automation. The Selenium-RC (remote-control) tool can control browsers via injecting its own JavaScript code and can be used for UI testing.
Data Extraction
Let's get started by searching a product and downloading the CSV file(s) with the following steps:
- Import Dependencies and Create Driver Instance: The initial step is to create an object of webdriver for particular browser by importing it from selenium module as:
from selenium import webdriver # Import module
from selenium.webdriver.common.keys import Keys # For keyboard keys
import time # Waiting function
URL = 'https://shopping.thinkwithgoogle.com' # Define URL
browser = webdriver.Safari() # Create driver object means open the browser
By default, the automation control is disabled in safari and it needs to be enabled for automation environment otherwise it will raise SessionNotCreatedException. So, enable the Develop option under the advanced settings in Safari preferences.
Then open the Develop option and select Allow Remote Automation.
- Apply Wait and Search : Follow the steps to apply search:
- Open the link.
- Wait for the website to load.
- Find the id of the search element using its 'subjectInput' id.
There's also a hidden input tag which is not required. So, use find_elements to get the list of all elements with matched searched criteria and use the index to access it.
- Provide input using send_keys.
- Wait for the result to appear.
- To press enter and proceed.
browser.get('https://shopping.thinkwithgoogle.com') # 1
time.sleep(2) # 2
search = browser.find_elements_by_id('subjectInput')[1] # 3
# find_elements will give us the list of all elements with id as subjectInput
search.send_keys('Google Pixel 3') # 4
time.sleep(2) # 5
search.send_keys(Keys.ENTER) # 6
To use Firefox and Chrome browsers, use their corresponding methods to create browser instances as:
# Firefox
firefoxBrowser = webdriver.Firefox(executable_path=FIREFOX_GOCKO_DRIVER_PATH) # Chrome
chromeBrowser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)
- Download Files: In order to download files, locate Download all data div tag using si-button-data download-all class value.
time.sleep(2) # Wait for button to appear
browser.find_element_by_class_name('si-button-data download-all').click()
- Extract Data from Elements: To download interest data by location, first locate the ul list tag by its class value and then fetch data of all list items as:
data = browser.find_element_by_class_name('content content-breakpoint-gt-md')
dataList = data.find_elements_by_tag_name('li')
for item in dataList:
text = item.text
print(text)
Locating WebElement
Selenium offers a wide variety of functions to locate an element on the web-page as:
- find_element_by_id: Use id to find an element.
- find_element_by_name: Use name to find an element.
- find_element_by_xpath: Use xpath to find an elements.
- find_element_by_link_text: Use text value of a link to find element.
- find_element_by_partial_link_text: Find element by matching some part of a hyper link text(anchor tag).
- find_element_by_tag_name: Use tag name to find an element.
- find_element_by_class_name: Use value of class attribute to find an element.
- find_element_by_css_selector: Use CSS selector for id, class to find element. Or use find_element with BY locater as:
search = browser.find_element(By.ID,'subjectInput')[1]
Use overloaded versions of functions to find all occurrences of a searched value. Just use elements instead of element as:
searchList = browser.find_elements(By.ID,'subjectInput')[1]
XPath
XPath is an expression path syntax to find an object in DOM. XPath has its own syntax to find the node from the root element either via an absolute path or anywhere in the document using a relative path. Below is the explanation of XPath syntax with an example:
-
/ : Select node from the root. /html/body/div[1] will find the first div.
-
//: Select node from the current node. //form[1] will find the first form element.
-
[@attributename='value']: Use the syntax to find the node with the required value of the attribute.
//input[@name='Email'] will find the first input element with name as Email.
<html>
<body>
<div class = "content-login">
<form id="loginForm">
<div>
<input type="text" name="Email" value="Email Address:">
<input type="password" name="Password"value="Password:">
</div>
<button type="submit">Submit</button>
</form>
</div>
</body>
<html>
Read more about XPath to combine multiple attributes or use supported function.
HeadLess Browsers
Headless or Invisible Browser: During the scraping process, any user action on a browser window can interrupt the flow and can cause an unexpected behavior. So, for scraping applications, it is crucial to avoid any external dependency while creating applications, such as browser. Headless browsers can work without displaying any graphical UI which allows applications to be a single source of interaction for users and provides a smooth user experience.
Some famous headless browsers are PhantomJS and HTMLUnit. Other browsers like Chrome and Firefox also support the headless feature which can be enabled with set_headless parameter:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
# options.headless = True # older webdriver versions
options.set_headless(True) # newer webdriver versions
# Firefox
firefoxBrowser = webdriver.Firefox(options=options, executable_path=FIREFOX_GOCKO_DRIVER_PATH)
# Chrome
chromeBrowser = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options)
At the time of writing this guide, Headless mode is not supported by Safari.
Key Points
- To optimize the code, use WebDriverWait instead of time.sleep. It will automatically check and proceed when searched element is visible:
driver = webdriver.Safari()
browser.get('https://shopping.thinkwithgoogle.com')
try: # proceed if element is found within 3 seconds otherwise will raise TimeoutException
element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,
'Id_Of_Element')))
except TimeoutException:
print("Time out!")
- Use ActionChains for interactions like mouse movement like click, hold, hover, drag-drop and TouchActions for touch interactions like double-tap, long-press, flick, scroll as:
# perform hover
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.touch_actions import TouchActions
browser = webdriver.Firefox()
browser.get(URLString)
element_to_hover_over = browser.find_element_by_id("anyID")
ActionChains(browser).move_to_element(element_to_hover_over).perform()
scroll_up_arrow = browser.find_element_by_id("scroll_up")
TouchActions(browser).long_press(scroll_up_arrow)
At the time of writing this guide, ActionChains and TouchActions are not supported by Safari.
- Selenium offers many other navigational functions like:
- Back and Forward: To navigate the history
- Switch: To navigate between frames and alert dialog using Switch(browser).alert(),Switch(browser)window(window_Object_2), Switch(browser).frame(frame_Object_2) methods of Switch class.
driver.switch_to methods has been deprecated so instead use Switch class function.
The code is available on github for demonstration and practice. It also contains few more use-cases and optimized code.