Introduction to Web Scraping with BeautifulSoup

Introduction to Web Scraping with BeautifulSoup

Web Scraping is the process of  downloading data from websites and extracting valuable information from that data. The need for Web Scraping is increasing, and so it’s the perfect time to get comfortable  using it.

The process of web scraping and cleaning the scraped data is logical and can therefore be implemented easily and become second nature after a few attempts.

In this article, we will go through a simple example of how to scrape and clean data from Wikipedia. We will take a look at how to load the data, find specific elements, as well as how to save the data into a .txt file.

Getting Started

Library wise we have a few different choices, including:

  • Request
  • Beautiful Soup
  • Scrapy
  • Selenium

Scrapy is a complete web scraping framework which takes care of everything from getting the HTML, to processing the data. Selenium is a browser automation tool that can for example enable you to navigate between multiple pages. These two libraries have a steeper learning curve than Request which is used to get HTML data and BeautifulSoup which is used as a parser for the HTML.

Therefore we will use BeautifulSoup in this post, which can be installed using the Python package manager pip or the anaconda package manager.

pip install BeautifulSoup4
or
conda install beautifulsoup4

Inspect the website

To get information about the elements we want to access, we first of need to inspect the web page using the developer tools.

In this post we will scrape the “content” and “see also” sections from an arbitrary Wikipedia article. To get information about the elements and attributes used for the sections, we can right click on the element to inspect it. This will open the inspector which lets us look at the HTML   Code.

Inspecting website
Inspecting the website
Inspecting website
Inspecting the website

The content section has an ip of toc and each list item has a class of tocsection-n where n is the number of the list item, so if we want to get the content text we can just loop through all list items that have a class that starts with tocsection-. This can be done using BeautifulSoup in combination with Regular Expressions.

To get the data from the “see also” section we can loop through all the list items contained in the div with the classes div-col columns column-width.

Parse HTML

Now that we know what we need to scrape we can get started by parsing the HTML. First of we need to import the libraries that we will be using for scraping the website. As already said above, we will use BeautifulSoup for parsing the page and searching for specific elements. For connecting to the website and getting the html we will use urllib which  is a Python Standard Library, and so it is already installed. Lastly the re libary will be used for working with Regular Expressions.

# importing libraries
from bs4 import BeautifulSoup
import urllib.request
import re

Next we need the url of the Wikipedia page we want to get our information from. In this post we will scrape the data from the Artificial Intelligence Wikipedia article.

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"

Now we can connect to the website using urllib .

page = urllib.request.urlopen(url) # conntect to website

If the request wasn’t successful because of something like a wrong url urllib will throw an error. One way of handling this kind of exception, is by wrapping the urlopen code in a try-except statement.

try:
    page = urllib.request.urlopen(url)
except:
    print("An error occured.")

For parsing the html, the page object needs to be passed to BeautifulSoup.

soup = BeautifulSoup(page, 'html.parser')
print(soup)

Find specific elements in the page

The created BeautifulSoup object can now be used to find elements in the HTML. When we inspected the website we saw that every list item in the content section has a class that starts with tocsection- and we can us BeautifulSoup’s find_all method to find all list items with that class.

regex = re.compile('^tocsection-')
content_lis = soup.find_all('li', attrs={'class': regex})
print(content_lis)

This gives us an array of list items. The first few can be seen below:

<li class="toclevel-1 tocsection-1"><a href="#History"><span class="tocnumber">1</span> <span class="toctext">History</span></a></li>,
<li class="toclevel-1 tocsection-2"><a href="#Basics"><span class="tocnumber">2</span> <span class="toctext">Basics</span></a></li>,
<li class="toclevel-1 tocsection-3"><a href="#Problems"><span class="tocnumber">3</span> <span class="toctext">Problems</span></a>

To get the raw text we can loop through the array and call the getText method on each list item.

content = []
for li in content_lis:
    content.append(li.getText().split('\n')[0])
print(content)

The split on \n ensures that the list items containing other list items only have there own text and not the text from the sub list items.

Output:

'1 History',
'2 Basics',
'3 Problems',
'3.1 Reasoning, problem solving',
'3.2 Knowledge representation',
'3.3 Planning',
'3.4 Learning',

To get the data from the “see also” section, we use the find method to get the div containing the list items, and then use find_all to get an array of list items.

see_also_section = soup.find('div', attrs={'class': 'div-col columns column-width'})
see_also_soup =  see_also_section.find_all('li')
print(see_also_soup)

Output:

<li><a class="image" href="/wiki/File:Animation2.gif"><img alt="Animation2.gif" class="noviewer" data-file-height="78" data-file-width="48" height="16" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Animation2.gif/10px-Animation2.gif" srcset="//upload.wikimedia.org/wikipedia/commons/thumb ...

To extract the hrefs and the text a loop in combination with the find method can be used.

see_also = []
for li in see_also_soup:
    a_tag = li.find('a', href=True, attrs={'title':True, 'class':False}) # find a tags that have a title and a class
    href = a_tag['href'] # get the href attribute
    text = a_tag.getText() # get the text
    see_also.append([href, text]) # append to array
print(see_also)

Output:

['/wiki/Portal:Artificial_intelligence', 'Artificial intelligence portal'],
['/wiki/Abductive_reasoning', 'Abductive reasoning'],
['/wiki/Behavior_selection_algorithm', 'Behavior selection algorithm'],
['/wiki/Business_process_automation', 'Business process automation'],

Saving data

Almost all of the time we would like to save our scraped data, so we can use it later. The easiest way is to save it to a .txt or .csv file by using the open function which is build into Python.

We will save the content section into a text file with the name content.txt .

with open('content.txt', 'w') as f:
    for i in content:
        f.write(i+"\n")

The best format for the “see also” data is probably a csv because it has two columns(One for the href and one for the text).

with open('see_also.csv', 'w') as f:
    for i in see_also:
        f.write(",".join(i)+"\n")

Conclusion

Web Scraping is the process of downloading data from webpages and extracting information from that data. It is a great tool to have in your tool kit because it allows you to get rich varieties of data.

BeautifulSoup is a web scraping library which is best used for small projects. For   larger projects libraries like Scrapy and Selenium start to shine and I will cover both of them in another blog post.

If you liked this article consider subscribing on my Youtube Channel and following me on social media.

The code covered in this article is available as a Github Repository.

If you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section.