Web Scraping With Python For Beginners: How To Get Started

A guide on how to get data from websites and APIs using Python

First Things First: What Libraries Will You Need?

As is often the case with a general-use language like Python, performing specialised tasks will require libraries designed specifically for those purposes.

When it comes to scraping data from the internet, there are two different libraries that you should familiarise yourself with. The first is Requests, while the second is Beautiful Soup. We will introduce you to both in greater detail over the course of the article.

We will also be making a marginal use of Pandas, the popular Python library for analysing and manipulating data. You will need Pandas again if you intend to plot, visualize or manipulate the data you scrape, but for the purposes of this particular guide, we will only be using one of its methods.

Understanding The Code Of A Web Page

For those who have been learning Python (and programming in general) for purposes other than web development, let’s start from the absolute basics.

The majority of webpages is composed of code written in three languages: HTML, CSS, and JavaScript. Right now we are interested in getting information primarily from HTML.

The good news is that the HTML code for any given website is incredibly easy to find.

Simply open a webpage, right click with your mouse anywhere within it, and you will open a scroll down menu with a number of options. At the bottom of that menu you will find an option called Inspect. Click on it, and a whole new section will open (either on the right hand side or the bottom of your browser), showing you the code that composes the website.

This newly-opened ‘section’ is called is called the Developer Tools, or DevTools. It is an incredibly powerful resource to explore, analyse and even edit the webpage you are on. Today we will look at only one of the many things that we can do with it: retrieving its HTML code and the information it contains.

The HTML code of a website is the long script that you will see in the DevTools you just opened (right beneath the title ‘elements’).

Pro tip: If you want to see the code that generates a particular part of the website (for example the specific paragraph of text that you are reading, or an image, or a button), simply highlight that part, right click on it, and click inspect. The section will open at the pertinent line of code, highlighting it for you to see!

This code that you have revealed contains literally all of the information that is in the page you are visiting. In its raw form, however, it is difficult to read, and even more difficult to use.

This is where Requests comes to our aid.

Getting Started With Requests

It makes sense to start with Requests, as this library involves some basic concepts which we will use again later on.

Requests is one of a number of libraries (alongside httplib, urllib, httplib2, and treq) that can be used to do exactly what this guide is about: retrieve information from web pages, or what in the business is known as web scraping.

How is this done? Open your console and let’s do this together!

First, of course, we import the Requests library.

import requests

Then we use the ‘get’ method from that same library on the URL of a website to get the contents of that website, that is to say, its HTML code. We will store those contents in an object called response, as this is, after all, the response by the website to our request.

response = requests.get('https://www.imdb.com/title/tt0034583/fullcredits/?ref_=tt_ql_cl')

The response object isn’t a simple thing. It contains the retrieved data, as well as a variety of attributes and methods to retrieve it. This means that whenever we want to do anything with it, we can’t just treat it like any old variable – rather, we need to specify which part of it we want to work with. For example:

You can get the same results by writing print(response.content) Just bear in mind that we can’t simply write print(response). This is because the response object in its entirety is more than just readable text, and would therefore return an error if you tried to print it.

You should also know that Request methods can also be written in this syntax:

requests.request(method, url, **kwargs).

This means that our line response = requests.get(‘https://www.imdb.com/title/tt0034583/fullcredits/?ref_=tt_ql_cl) could also have been written like this:

response = requests.request('get', 'https://www.imdb.com/title/tt0034583/fullcredits/?ref_=tt_ql_cl')

Now that we have converted all of the contents of a website into a response object, we can do all sorts of things with it.