Best Practices and Guidelines for Scraping
As web scraping becomes more sophisticated, the legality of it becomes complicated. Let's explore some best practices and guidelines.
Sep 6, 2019 • 6 Minute Read
Introduction
With web scraping, technology is growing increasingly productive and sophisticated and the legality of web scraping becomes complicated. So, we’ll explore some of the best practices and guidelines that you’ll need to grasp.
My previous guide on "Advanced Web Scraping Tactics" covers the complexities of web scraping, along with how to tackle them. This guide will give you a set of best practices and guidelines for Scraping that will help you know when you should be cautious about the data you want to scrape.
If you are a beginner to web scraping with Python, check out my guides on Extracting Data from HTML with BeautifulSoup and Crawling the Web with Python and Scrapy.
User-agent Rotation
A User-Agent string in the request header helps to identify the information of browser and operating system from which request has been executed.
The sample user-agent string looks like this:
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36
Every request that you make has some header information, in which user-agent is one of them, which leads to the detection of the bot. User-agent rotation is the best solution for being caught. Most websites don't allow multiple requests from a single source, so we can try to change our identity by randomizing the user-agent while making a request.
If you're using Scrapy, then you can set the USER-AGENT in settings.py.
It is always better to identify yourself whenever possible. Try not to mask yourself, and provide the correct contact details in the Header of the request.
Rotating IP's and Using Proxy Services
Continuing the previous practice, it is always better to rotate IP's and use proxy services and VPN services so that your spider won't get blocked. It will help to minimize the danger of getting trapped and getting blacklisted.
Rotating IP's is an effortless job if you are using Scrapy.
Minimize the Load
Try to minimize the load on the website that you want to scrape. Any web server may slow down or crash when it exceeds the trustworthy limit which it can handle. Minimize the concurrent requests and follow the crawling limit which sets in robots.txt.
It will also help you to not getting blocked by the website.
Follow robots.txt and Terms & Conditions
Robots.txt is a text file that webmasters create to instruct web robots how to crawl pages on their website, so it contains the information for the crawler.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
Robot.txt is the first thing you need to check when you are planning to scrape a website. Generally, the robot.txt of a website is located at website-name/robot.txt.
The file contains clear instructions and a set of rules that they consider to be good behavior on that site, such as areas that are allowed to crawl, restricted pages, and frequency limits for crawling. You should respect and follow all the rules set by a website while attempting to scrape it.
If you disobey the rules of robots.txt then the website admin can block you permanently.
Review Terms & Conditions or TOS before starting the process of scraping. You should always honor the terms and conditions and privacy policy.
Use API's
If a site you want to scrape provides an API service to download the data, obtain data that way, as opposed to scraping.
-
Twitter provides an API service to access the public twitter data of users.
-
Amazon also provide an API service to fetch the information of the products. If you are using Python, you can use this library.
Look for API services before scraping. Even if the API services are paid, try to use them instead.
Caching the Pages
It is an excellent practice to cache the web page that you have already crawled so that you don't have to request again to the same web page.
Also, try to store the URLs of crawled pages to maintain the pages you have already visited.
Scrape During off-peak Hours
To make sure that a website isn't slowed down due to the high request load by your web crawler, it is better to schedule web crawling tasks to run in the off-peak hours. It will give a better user experience to the real human visitor. That will also improve the speed of the scraping process.
Change the Crawling Pattern
Sites that have intelligent anti-crawling mechanisms can easily detect spiders from finding the pattern in their actions. So it is a good idea to change the regular design of extracting information in a monotonous manner. You may incorporate some random clicks and mouse movements to look like a human.
Don't Violate Copyright Issues
While scraping a website, make sure you don't reproduce the copyrighted web content.
Copyright is the exclusive and assignable legal right, given to the creator of a literary or artistic work, to reproduce the work. - Wikipedia
The copyrighted content should not be republished or redistributed. Read the Terms & Conditions and Privacy Policy for more information regarding copyright issues.
Conclusion
That is all for this guide. In this guide, we have learned the best practices that need to be followed while doing scraping. I hope you are going to follow the guidelines and best practices while doing web scraping. Cheers and happy Scraping.