Python Web Scraping Tutorial
Introduction:
If you are into Data Analysis or Big data chances are you need to collect data from websites. Python is commonly used while working with data due it’s extensive libraries and simplicity. We will now look at a simple Web Scraping Project that I have built using Python🐍
Prerequisites:
Before we begin this tutorial, make sure you have Python installed onto your machine . Head over to the official page here to install if you have not done so.
In this tutorial I’ll be using PyCharm as my IDE and we will be installing BeautifulSoup and requests module for this Project
Tutorial:
This is the website that we will be working with for this Project:
This is the Empire’s 100 Greatest Movies of all time and we will be scraping the site and prepare the list of the Top 100 Movies
Installing the Dependencies:
In the terminal enter the following command:
pip install requests beautifulsoup4
Import the necessary modules:
In main.py:
from bs4 import BeautifulSoup
import requests
Getting the response from the site:
URL = "https://web.archive.org/web/20200518073855/https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
website_html = response.text
Run the main.py to see the response from the site
<Response [200]>
Great! This means the page was fetched successfully!.
Extracting data:
Let’s now use BeautifulSoup to create an object and store the HTML into it
soup = BeautifulSoup(website_html, "html.parser")
print(soup.prettify())
We can now see that the HTML code of the site has now been stored in the soup object using BeautifulSoup
Go ahead and remove the print statement or comment it out. I included it to demonstrate the code
Getting all the Titles of the Movies:
On the website right click on the title and click on Inspect
As you can see the titles are stored in an h3 under the class called title
So lets get all the h3s that have class called “title” and store it in a variable:
all_movies_title = soup.find_all(name="h3", class_="title")
print(all_movies_title)
Cool! Now we have acquired all the titles of the Top 100 Movies
But there’s one problem we acquired the HTML code as well when doing this operation
So lets just extract the text from all the titles and store it as a list
Let’s use List Comprehension and get only the text as follows:
# Getting the title of each H3 and forming a list of all titles
movie_titles = [movie.getText() for movie in all_movies_title]
The list as of now starts from 100 so let’s reverse it and make it start from 1:
# Reversing the list using reverse()
movie_titles.reverse()
print(movie_titles)
Yaay! we got our list of Top 100 Movies from Empire
Let’s go one step further and store this data on a .txt file for future reference:
# Writing the top 100 movies to a file called movies.txt
with open("movies.txt", mode="w", encoding="utf-8") as file:
for movie in movie_titles:
file.write(f"{movie}\n")
And that’s it!. We have now acquired our Top 100 Movies to binge-watch using Python🐍🍿
Here’s the link to the full project:
Hope this project has helped you in understanding scraping with Python and Happy Coding!👋