HOWTO: Easy Web Scraping with Python

Overwhelming Offer in the Webshop

Two weeks ago, a frequently used online mail-order company, whose reminds of a river in South America, called my attention to a campaign by a friendly information email. Namely, three music CDs from a large selection were offered to me for 15€.

As in the past, I still enjoy buying music on physical sound carriers and decided to have a closer look at the offer. It turned out that approx. 9,000 CDs were offered on about 400 pages in the online shop. This shop provides the possibility to sort the offers by popularity or customer ratings. However, if I view the popularity in descending order, I find many titles which do not quite correspond to my age group. On the other hand, if I sort the offers by customer ratings, it turns out that the shop processes the ratings in an unweighted manner. That means that a CD with only one 5 star rating is listed above another CD with 4.9 stars over 1,000 ratings.

Web Scraping

At first, I did not feel like checking all 400 pages manually in order to find something interesting to me. So I used a trick which I have frequently used in the past, namely to automatically harvest the content of the website. This procedure is anything but new, however, within the data science community it now has a new name - Web Scraping.

It is quite likely that a data scientist has to extract data from the World Wide Web him-/herself. Therefore, I want to show how small the initial hurdle is with Python, using the example of my simple problem.

Let's Get to Work

URL

Fortunately, the online provider uses a request method which shows the request parameters in the URL in plain language. Here is an example which I have anonymized a little:

 http://www.onlineshop.de/s/ref=lp_12345_pg_1&rh=123456&page=1&ie=UTF8&qid=12345

One can see that the page is referenced at two points, at _pg_1 and &page=1&. So, if I adapt those two points, I can directly iterate through all subpages.

Retrieving the Website

In order to actually read the website, we use a module which is firmly integrated in Python, namely urllib. Loading the page then looks like this (again, I have altered the URL):

from urllib import request 
url_requested = request.urlopen('http://www.onlineshop.de/s/ref=lp_12345_pg_1&rh=[...]&page=1&ie=UTF8&qid=12345')
 if 200 == url_requested.code:
    html_content = str(url_requested.read())

One uses the request object to open the website. The HTML code is read if the code of the request is 200, the normal code for the successful retrieval of a website. We all know code 404, if the page is e.g. not found. The string function around the requested read is required to receive strings and not byte strings. This would otherwise cause us problems later.

Parsing the HTML Code

Now, as we have the HTML code of the page, we can directly scan it for CD titles and performers. Previously, I have had a direct look at the HTML code in the browser in order to detect suspicious patterns one can search for. This is how I came to notice that all titles start with <h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal"> and end with Audio CD. In between is a number of codes which are of no importance to us.

Now, we cut the respective code parts with regular expressions out of the page and save the information in a list.

 import re 
[...] 
page_content=[] 
html_content = str(url_requested.read()) 
re_search_chunk = re.findall('<h2>.*?Audio CD', html_content) 
for result in re_search_chunk: 
    tag_contents = re.findall('>.*?<', result) 
    album = tag_contents[0][1:-1] 
    artist = tag_contents[13][1:-1] 
    page_content.append((album, artist))

re.findall belongs to the package for regular expressions. I use it to search for all parts in the HTML code which correspond to the above pattern, where .*? is the placeholder for any chain. Here, this place holder is "not greedy", which means it tries to match the unknown area to as little signs as possible, otherwise multiple albums would be included at the same time.

For each of the code snippets, I repeat the process using the search procedure '>.*?<' which now returns the individual contents of the tags in a "not greedy" manner. By "critically considering the code" I can recognize that the album title is saved in tag 1 (index 0) and the artist in tag 14 (index 13). In order to get rid of the brackets of the tag, we index the results to the second to second to last sign ([1:-1]).

The result is attached to my result list page_content as tuple.

Writing Off in a CSV

with open('offers.txt', 'w') as f: 
    for row in page_content: 
        print('\t'.join(row), file=f)

The last part is easy. We open a "file handle" and write each album-artist-tuple into the file separated by a tabulator. I use the print literal from Python3 which automatically inserts a line break for each print command and accepts the "file handle" as target parameter for the output.

Total Script

Of course, reading out the website has to be processed via a loop and the page numbers have to be parameterized in this process. Here is the total script:

import re 
from urllib import request 
page_content=[] 
for page_number in range(400): 
    url_requested = request.urlopen('http://www.onlineshop.de/s/ref=lp_12345_pg_{page_number}&rh=[...]&page={page_number}&ie=UTF8&qid=12345'.format(page_number=page_number+1) 
    if 200 == url_requested.code: 
        html_content = str(url_requested.read()) 
        re_search_chunk = re.findall('<h2>.*?Audio CD', html_content) 
        for result in re_search_chunk:
            tag_contents = re.findall('>.*?<', result) 
            album = tag_contents[0][1:-1] 
            artist = tag_contents[13][1:-1] 
            page_content.append((album, artist)) 
with open('offers.txt', 'w') as f:
    for row in page_content:
        print('\t'.join(row), file=f)

That's it!

Viewing the Data

Now, I can open my file offers.txt e.g. in Excel and look at it in a pivot table. That makes it much easier for me to get an overview of the performers involved in the offer.

With less than 30 minutes, I assume it took me approximately the same time to develop the script as it would have taken me to click through the pages - but I had more fun. And, as I have written the script in form of a class with the partial steps as methods, I will probably be able to use it again for similar scenarios.