Extracting Data from HTML with BeautifulSoup
This course covers the important aspects of scraping websites using Beautiful Soup. You will learn to build, manipulate and traverse the parse tree, as well as to leverage advanced features such as working with filters, CSS and XPath.
What you'll learn
Web scraping is an important technique that is widely used as the first step in many workflows in data mining, information retrieval, and text-based machine learning.
In this course, Extracting Data from HTML with BeautifulSoup* you will gain the ability to build robust, maintainable web scraping solutions using the Beautiful Soup library in Python.
First, you will learn how regular expressions can be used to scrape web content, and how Beautiful Soup does better in important ways. Next, you will discover how Beautiful Soup parses HTML from web content, fixes up badly-formed tags, and builds a clean, easily traversable parse tree. You will then see how that parse tree can be used in order to find and retrieve specific patterns.
Finally, you will round out your knowledge by leveraging advanced features of beautiful soup such as working with CSS and XPath. When you’re finished with this course, you will have the skills and knowledge to implement robust web scraping using Beautiful Soup.
Table of contents
- Version Check 0m
- Module Overview 1m
- Prerequisites and Course Outline 1m
- Introducing Web Scraping 2m
- Regular Expressions and Beautiful Soup 7m
- Making GET Requests Using Httplib2, Urllib and Requests 8m
- Introducing Regular Expressions 4m
- Performing Simple Pattern Matches Using Regular Expressions 5m
- Parsing Web Pages Using Regular Expressions 7m
- Introducing Beautiful Soup 8m
- Module Summary 1m
- Module Overview 1m
- Parsing Web Pages with Beautiful Soup 5m
- Tags, Attributes, NavigableStrings, Comments 4m
- Navigating Using Tags and Contents 4m
- Navigating Children, Descendants, and Parents 6m
- Navigating Sideways Using Next and Previous Sibling 4m
- Navigating Sideways Using Next Element and Previous Element 3m
- Filter by Tags and Attributes Using Regular Expressions and Custom Functions 7m
- Extracting Absolute and Relative Links from HTML 5m
- Module Summary 1m
- Module Overview 1m
- Modifying the HTML Parse Tree 6m
- Exploring Beautiful Soup Functions to Modify the Parse Tree 6m
- Miscellaneous Operations Using Beautiful Soup 6m
- Working with Different Parsers 4m
- Using the Soup Strainer to Parse Parts of a Document 2m
- Encodings in Beautiful Soup 3m
- Summary and Further Study 2m