Riddlersdata: November 2015

Saturday, November 14, 2015

Web Scraping with Python; Part 3

More web scraping with Python and Beautiful Soup!

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

The findAll and find functions are the two most common functions you will use when you're scraping a page; they both allow you to search through HTML to find specific tags, attributes, recursives, text, limits, and keywords.

Most of the time you will only need the tag and attribute arguments.

The tags argument takes text from the specific tags you search for; it could be headers h1, h2, h3. etc. Or paragraphs p, or lists li, or anything else you want to search for.

.findAll({"h1","h2","p,"li","div","header"})

Here we use the attributes arguments
.findAll("span", {"class":"blue", "class":"yellow"})

Let's look at the recursive argument now.
The recursive argument is a Boolean, meaning that it operates based upon a TRUE or FALSE argument. The argument itself is asking you "how deep would you like to look into the document to obtain the information you want?" If recursive is set to TRUE, it will look into children, childrens children, etc. for the tags you want. Sometimes we don't want to look that deep into an argument because you may only want a limited set of information within a parent, or a single child or whatever.

Now let's look at the text argument. This argument will search for a specific word surrounded by tags, for instance, let's say we are scraping a page with a large text file in the form of a book. Perhaps we are scraping the Bible on a webpage and we want to find the word 'Moses':

nameList = bsObj.findAll(text="Moses")
print(len(nameList))

This argument will retrieve all the instances the word Moses is used.

The keyword argument will select tags that contain a specific attribute:

allText = bsObj.findAll(id="text")
print(allText[0].get_text())

Some additional notes on the keyword function:
Keyword is particularly clumsy, so you cannot simply use it like this

bsObj.findAll(class="green")

Instead, type either of the following to find a particular keyword:

bsObj.findAll(class_="green")
or
bsObj.findAll("", {"class":"green"}

That's all for now until next time!

Wednesday, November 11, 2015

Web Scraping with Python, Part 2

So there is a small error, the end of the following lesson should look something like this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.riddlersdata.blogspot.com")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

Depending on what you're looking for, you wont need the nameList function.

This particular scraping and parsing method will isolate all headers (bsObj,h1) in the text file.

Ok, so since this is part 2 let's do some part 2 worthy execution.

First, let's take a look at the code:
html = urlopen("http://www.riddlersdata.blogspot.com")
...this piece of code here, this retrieves the data from the webpage. And the way we have it formatted here could cause potential issues:
1) The page may not be found on the server
2) The server may not be found

In which case, you will get an HTTP error code, so the way we handle this is by using the following code (which will replace your existing code):
html = urlopen("http://www.riddlersdata.blogspot.com")
except HTTPError as e:
print(e)
#return null, break, or do some other "Plan B"
else:
#program continues. Note: If you return or break in the
#exception catch, you do not need to use the "else" statement

***NOTE*** any line that begins with a hashtag (#) is not actual code, these are comments within the code that you can write to yourself; just to help clarify what goes where and why. Every programmer uses these.

In case no server is found, we can add a check by using the following code:
if html is None:
print("URL is not found")
else:
#program continues

These commands say "if there is not server found, tell me "URL is not found", otherwise continue scraping".

Next, type in the following code:

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:

html = urlopen(url)
except HTTPError as e:
return None
try:

bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.riddlersdata.blogspot.com")
if title == None:
print("Title could not be found")
else:
print(title)
In this example, we’re

From these commands we have created a scraper with exceptions and loops via else-if statements. Which is exactly what we want if we are going to scrape a lot of webpages looking for a particular type of data, while taking into account the possibility of broken servers/pages.

Tuesday, November 10, 2015

Tensorflow

Tensorflow has been released to the public!

This is seriously, really, really cool! I haven't looked into everything it can do, and have yet to try it out (too busy with everything else programming related), but I am dying to fiddle with this!

So what is Tensorflow and why does it matter?

Tensorflow is "...an interface for expressing machine learning algorithms, and an implementation for executing such algorithms". It is great if you want to use it for a deep neural network--> what is a deep neural network? A deep neural network is about learning multiple levels of representation and abstraction that help to make sense of data like images, sound, or text. Inevitably, the goal of DNN is to get to actual artificial intelligence, which is a product of machine learning. Meaning, awareness and identification of images, sound, text, etc.

I will try and be back soon to write more on it, but for now I wanted to help spread the word about this incredible tool.