Riddlersdata: Web Scraping with Python, Part 2

So there is a small error, the end of the following lesson should look something like this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.riddlersdata.blogspot.com")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

Depending on what you're looking for, you wont need the nameList function.

This particular scraping and parsing method will isolate all headers (bsObj,h1) in the text file.

Ok, so since this is part 2 let's do some part 2 worthy execution.

First, let's take a look at the code:
html = urlopen("http://www.riddlersdata.blogspot.com")
...this piece of code here, this retrieves the data from the webpage. And the way we have it formatted here could cause potential issues:
1) The page may not be found on the server
2) The server may not be found

In which case, you will get an HTTP error code, so the way we handle this is by using the following code (which will replace your existing code):
html = urlopen("http://www.riddlersdata.blogspot.com")
except HTTPError as e:
print(e)
#return null, break, or do some other "Plan B"
else:
#program continues. Note: If you return or break in the
#exception catch, you do not need to use the "else" statement

***NOTE*** any line that begins with a hashtag (#) is not actual code, these are comments within the code that you can write to yourself; just to help clarify what goes where and why. Every programmer uses these.

In case no server is found, we can add a check by using the following code:
if html is None:
print("URL is not found")
else:
#program continues

These commands say "if there is not server found, tell me "URL is not found", otherwise continue scraping".

Next, type in the following code:

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:

html = urlopen(url)
except HTTPError as e:
return None
try:

bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.riddlersdata.blogspot.com")
if title == None:
print("Title could not be found")
else:
print(title)
In this example, we’re

From these commands we have created a scraper with exceptions and loops via else-if statements. Which is exactly what we want if we are going to scrape a lot of webpages looking for a particular type of data, while taking into account the possibility of broken servers/pages.

Riddlersdata

Wednesday, November 11, 2015

Web Scraping with Python, Part 2

No comments:

Post a Comment