Python BeautifulSoup Removing the tags from the content of the page

To remove HTML tags from the content of a web page using web scraping, you can utilize the BeautifulSoup library in Python. Here’s an example of how you can achieve this:

import requests
from bs4 import BeautifulSoup

# Send a GET request to the webpage
url = "https://example.com"  # Replace with the URL of the webpage you want to scrape
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find the desired content and remove HTML tags
content = soup.find("div", class_="content")  # Replace with the appropriate tag and class/id
text = content.get_text(strip=True)

# Print the cleaned text
print(text)

In the code above, we send a GET request to the webpage, parse the HTML using, and then use the find method to locate the desired content on the page. Replace "div" with the appropriate HTML tag and provide the relevant class or id if necessary.

Once the content is located, we use the get_text method to extract the text without any HTML tags. The strip=True argument removes any leading or trailing whitespace from the extracted text.

Run Code In Live & Test

Conclusion:

Finally, we print the cleaned text. You can modify this code to save the cleaned text to a file or process it further according to your needs.

Note that the specific HTML structure and class/id names will vary depending on the webpage you are scraping. You may need to inspect the webpage source code to identify the appropriate tags and attributes to use in the find method.

Leave a Reply

Your email address will not be published. Required fields are marked *