Python Web Scraping Tutorial

4 min readJan 31, 2023

Introduction:

If you are into Data Analysis or Big data chances are you need to collect data from websites. Python is commonly used while working with data due it’s extensive libraries and simplicity. We will now look at a simple Web Scraping Project that I have built using Python🐍

Prerequisites:

Before we begin this tutorial, make sure you have Python installed onto your machine . Head over to the official page here to install if you have not done so.

In this tutorial I’ll be using PyCharm as my IDE and we will be installing BeautifulSoup and requests module for this Project

Tutorial:

This is the website that we will be working with for this Project:

The 100 Greatest Movies

By Empire | Posted 20 Mar 2018 The history of cinema is rife with great movies - classic films that have stood the test…

web.archive.org

This is the Empire’s 100 Greatest Movies of all time and we will be scraping the site and prepare the list of the Top 100 Movies

Installing the Dependencies:

In the terminal enter the following command:

pip install requests beautifulsoup4

Import the necessary modules:

In main.py:

from bs4 import BeautifulSoup
import requests

Getting the response from the site:

URL = "https://web.archive.org/web/20200518073855/https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(URL)
website_html = response.text

Run the main.py to see the response from the site

<Response [200]>

Great! This means the page was fetched successfully!.

Extracting data:

Let’s now use BeautifulSoup to create an object and store the HTML into it

soup = BeautifulSoup(website_html, "html.parser")
print(soup.prettify())

We can now see that the HTML code of the site has now been stored in the soup object using BeautifulSoup

Go ahead and remove the print statement or comment it out. I included it to demonstrate the code

Getting all the Titles of the Movies:

On the website right click on the title and click on Inspect

As you can see the titles are stored in an h3 under the class called title

So lets get all the h3s that have class called “title” and store it in a variable:

all_movies_title = soup.find_all(name="h3", class_="title")
print(all_movies_title)

Cool! Now we have acquired all the titles of the Top 100 Movies

But there’s one problem we acquired the HTML code as well when doing this operation

So lets just extract the text from all the titles and store it as a list

Let’s use List Comprehension and get only the text as follows:

# Getting the title of each H3 and forming a list of all titles
movie_titles = [movie.getText() for movie in all_movies_title]

The list as of now starts from 100 so let’s reverse it and make it start from 1:

# Reversing the list using reverse()
movie_titles.reverse()
print(movie_titles)

Yaay! we got our list of Top 100 Movies from Empire

Let’s go one step further and store this data on a .txt file for future reference:

# Writing the top 100 movies to a file called movies.txt
with open("movies.txt", mode="w", encoding="utf-8") as file:
    for movie in movie_titles:
        file.write(f"{movie}\n")

And that’s it!. We have now acquired our Top 100 Movies to binge-watch using Python🐍🍿

Here’s the link to the full project:

100-Days-of-Code/Day-45 at main · adithya1010/100-Days-of-Code

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Hope this project has helped you in understanding scraping with Python and Happy Coding!👋