Introduction
Web scraping is a way of extracting data or information from the internet. It could be done using automation or manually. Often using automation is better and faster in helping you gather data efficiently. Unlike the hectic process of scraping data manually, when you scrape the internet automatically hundreds of millions of data can be gathered in split seconds. This is one of the reasons this method of data gathering is relied upon.
Beautiful Soup is a Python library that is used in writing Python scripts to scrape or gather data from the internet. It is a tool used for automation in web scraping. Its versatility and time-saving capabilities make it valuable for parsing and scraping HTML and XML content.
This article will look at how to use Beautiful Soup, for web scraping. A real-world job website will be scraped and saved as a text file for use, you will learn how to build a web scraper using Python Beautiful Soup and navigate through a job listing website, extracting relevant data such as job title, company name, job role, and location.
At the end of this article, you will know what web scraping is, and how you can easily use Beautiful Soup to gather specific data such as job data from the internet.
What is Web scraping?
Web Scraping is a technique used to extract data from sources on the internet, particularly websites. It utilizes automated processes or scripts to navigate web pages, retrieve HTML content, and extract particular data elements, enabling the organized gathering of information from the internet.
Web scraping is one of the most effective ways of gathering useful data from the internet, It has applications in the research field, businesses e.t.c, which enables users of this method of data collection to be able to make informed decisions.
What is Beautiful Soup?
Beautiful Soup is a popular Python library used for web scraping. It provides a convenient way to extract data from HTML and XML files by parsing the markup and navigating the parsed tree structure. Beautiful Soup handles malformed markup well and provides high-level methods for searching, filtering, and manipulating the parsed data.
Beautiful Soup supports various parsers, including Python's built-in html.parser, lxml, and html5lib. It is widely used for scraping data from websites, generating automated tests, and parsing XML files. With its simple and intuitive API, Beautiful Soup makes web scraping seamless.
Setting up the environment for web scraping with Beautiful Soup
Install Python: If you do not have Python Installed on your machine visit their official website (python.org), download it and follow the installation guide for your operating system.
Install Beautiful Soup: Open your command prompt or terminal and run the command:
pip install beautifulSoup4
Install a parser: Beautiful Soup supports multiple parsers, including Python's built-in HTML parser, lxml, and html5lib. For example, if you want to use the lxml parser, you can install it using pip:
pip install lxml
Which is the preferred parser used in this case.
Install requests: The requests library enables you to download the HTML web page you will be scraping. Run the command:
pip install requests
Scraping web content with Python Beautiful Soup.
- Scrape HTML content from a page.
First, you need to get the site’s HTML code into your Python script so that you can interact with it. For this task, you will use Python’s requests library.
import requests
job_url = requests.get("https://www.jobberman.com/job?experience=entry-level").text
This sends a GET request to the specific URL. It grabs the data that the server sends back and stores it in a Python object.
- Parse HTML code with Beautiful Soup.
Import the library in your script and create a Beautiful Soup object
import requests
from bs4 import Beautiful Soup
soup = BeautifulSoup(job_url, ‘lxml’)
- Find Elements by HTML Class Name.
jobs = soup.find_all(‘div’, class_ =’w-full’)
You use .find_all() on the Beautiful Soup object jobs
, and it returns a list that contains all other HTML content for the jobs on that page.
These child elements can be accessed using the .find() method
- Access child elements.
for job in jobs:
role = job.find('a', class_='relative mb-3 text-lg font-medium break-words focus:outline-none metrics-apply-now text-link-500 text-loading-animate')
if role and role.p:
job_role = role.p.text
else:
print("No <p> tag found inside the <a> tag.")
continue # Skip to the next iteration if no <p> tag is found
company_name = job.find('p', class_='text-sm text-link-500')
if company_name and company_name.a:
company = company_name.a.text
else:
print("No <a> tag found inside the <p> tag.")
continue # Skip to the next iteration if no <a> tag is found
job_function = job.find('p', class_='text-sm text-gray-500 text-loading-animate inline-block')
if job_function and job_function.text:
plain_text = job_function.text.strip()
else:
plain_text = "No job function information"
job_details = job.find_all('div', class_='flex flex-wrap mt-3 text-sm text-gray-500 md:py-0')
for detail in job_details:
job_location = detail.find('span', class_='mb-3 px-3 py-1 rounded bg-brand-secondary-100 mr-2 text-loading-hide').text
In the above code we loop through the jobs list using a for loop and then try to access the child element anchor tag so that we can get the job role. There is a need to confirm this tag and if it contains a paragraph that holds the job role so we can collect the text content of that tag. This procedure is done all through to get the
- company name
- job function
- job detail
- job location
The .text returns the text content in the HTML element, the .strip() helps remove any whitespace. Now you can create a .txt file to save this data collected as a text file for later use.
file_name = f'posts/.txt'
with open(file_name, 'a') as job_file:
job_file.write(f'Job Role: {job_role}\n')
job_file.write(f'Company Name: {company}\n')
job_file.write(f'{plain_text}\n')
job_file.write(f'Job Location: {job_location}\n')
job_file.write('\n') # Add a separator between job details
Now, your Python Beautiful Soup script is ready for web scraping. You can wrap this entire code into a function so that it can be used at certain intervals when needed.
from bs4 import BeautifulSoup
import requests
import time
def search_jobs():
for job in jobs:
role = job.find('a', class_='relative mb-3 text-lg font-medium break-words focus:outline-none metrics-apply-now text-link-500 text-loading-animate')
if role and role.p:
job_role = role.p.text
else:
print("No <p> tag found inside the <a> tag.")
continue # Skip to the next iteration if no <p> tag is found
company_name = job.find('p', class_='text-sm text-link-500')
if company_name and company_name.a:
company = company_name.a.text
else:
print("No <a> tag found inside the <p> tag.")
continue # Skip to the next iteration if no <a> tag is found
job_function = job.find('p', class_='text-sm text-gray-500 text-loading-animate inline-block')
if job_function and job_function.text:
plain_text = job_function.text.strip()
else:
plain_text = "No job function information"
job_details = job.find_all('div', class_='flex flex-wrap mt-3 text-sm text-gray-500 md:py-0')
for detail in job_details:
job_location = detail.find('span', class_='mb-3 px-3 py-1 rounded bg-brand-secondary-100 mr-2 text-loading-hide').text
file_name = f'posts/.txt'
with open(file_name, 'a') as job_file:
job_file.write(f'Job Role: {job_role}\n')
job_file.write(f'Company Name: {company}\n')
job_file.write(f'{plain_text}\n')
job_file.write(f'Job Location: {job_location}\n')
job_file.write('\n') # Add a separator between job details
if __name__ == "__main__":
while True:
search_job()
time_wait = 20
print(f'Waiting {time_wait} minute...')
time.sleep(time_wait * 60)
- Text format of details scraped from the website
Conclusion
In summary, this article talked about web scraping, particularly the automating of the process of web scraping by writing a script in other to scrape data from the internet. A case study is the Nigerian Job website where Beautiful Soup, a Python library used to parse HTML, XML content is used to scrape certain data from the site.