Riddlersdata: 2015

Saturday, November 14, 2015

Web Scraping with Python; Part 3

More web scraping with Python and Beautiful Soup!

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

The findAll and find functions are the two most common functions you will use when you're scraping a page; they both allow you to search through HTML to find specific tags, attributes, recursives, text, limits, and keywords.

Most of the time you will only need the tag and attribute arguments.

The tags argument takes text from the specific tags you search for; it could be headers h1, h2, h3. etc. Or paragraphs p, or lists li, or anything else you want to search for.

.findAll({"h1","h2","p,"li","div","header"})

Here we use the attributes arguments
.findAll("span", {"class":"blue", "class":"yellow"})

Let's look at the recursive argument now.
The recursive argument is a Boolean, meaning that it operates based upon a TRUE or FALSE argument. The argument itself is asking you "how deep would you like to look into the document to obtain the information you want?" If recursive is set to TRUE, it will look into children, childrens children, etc. for the tags you want. Sometimes we don't want to look that deep into an argument because you may only want a limited set of information within a parent, or a single child or whatever.

Now let's look at the text argument. This argument will search for a specific word surrounded by tags, for instance, let's say we are scraping a page with a large text file in the form of a book. Perhaps we are scraping the Bible on a webpage and we want to find the word 'Moses':

nameList = bsObj.findAll(text="Moses")
print(len(nameList))

This argument will retrieve all the instances the word Moses is used.

The keyword argument will select tags that contain a specific attribute:

allText = bsObj.findAll(id="text")
print(allText[0].get_text())

Some additional notes on the keyword function:
Keyword is particularly clumsy, so you cannot simply use it like this

bsObj.findAll(class="green")

Instead, type either of the following to find a particular keyword:

bsObj.findAll(class_="green")
or
bsObj.findAll("", {"class":"green"}

That's all for now until next time!

Wednesday, November 11, 2015

Web Scraping with Python, Part 2

So there is a small error, the end of the following lesson should look something like this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.riddlersdata.blogspot.com")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

Depending on what you're looking for, you wont need the nameList function.

This particular scraping and parsing method will isolate all headers (bsObj,h1) in the text file.

Ok, so since this is part 2 let's do some part 2 worthy execution.

First, let's take a look at the code:
html = urlopen("http://www.riddlersdata.blogspot.com")
...this piece of code here, this retrieves the data from the webpage. And the way we have it formatted here could cause potential issues:
1) The page may not be found on the server
2) The server may not be found

In which case, you will get an HTTP error code, so the way we handle this is by using the following code (which will replace your existing code):
html = urlopen("http://www.riddlersdata.blogspot.com")
except HTTPError as e:
print(e)
#return null, break, or do some other "Plan B"
else:
#program continues. Note: If you return or break in the
#exception catch, you do not need to use the "else" statement

***NOTE*** any line that begins with a hashtag (#) is not actual code, these are comments within the code that you can write to yourself; just to help clarify what goes where and why. Every programmer uses these.

In case no server is found, we can add a check by using the following code:
if html is None:
print("URL is not found")
else:
#program continues

These commands say "if there is not server found, tell me "URL is not found", otherwise continue scraping".

Next, type in the following code:

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:

html = urlopen(url)
except HTTPError as e:
return None
try:

bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.riddlersdata.blogspot.com")
if title == None:
print("Title could not be found")
else:
print(title)
In this example, we’re

From these commands we have created a scraper with exceptions and loops via else-if statements. Which is exactly what we want if we are going to scrape a lot of webpages looking for a particular type of data, while taking into account the possibility of broken servers/pages.

Tuesday, November 10, 2015

Tensorflow

Tensorflow has been released to the public!

This is seriously, really, really cool! I haven't looked into everything it can do, and have yet to try it out (too busy with everything else programming related), but I am dying to fiddle with this!

So what is Tensorflow and why does it matter?

Tensorflow is "...an interface for expressing machine learning algorithms, and an implementation for executing such algorithms". It is great if you want to use it for a deep neural network--> what is a deep neural network? A deep neural network is about learning multiple levels of representation and abstraction that help to make sense of data like images, sound, or text. Inevitably, the goal of DNN is to get to actual artificial intelligence, which is a product of machine learning. Meaning, awareness and identification of images, sound, text, etc.

I will try and be back soon to write more on it, but for now I wanted to help spread the word about this incredible tool.

Saturday, October 31, 2015

Intro to Python; Web scraping Part 1

Python is a great tool for a variety of reasons, but one of the reasons why I like Python is to for obtaining data. As much as I can, I actually use R for number crunching and data analysis, but R lags behind the simplicity of Python if you want to "scrape the web." If you're wondering, yes you can use R for web scraping much like I am about to show you with Python, but the process is of actually setting up your scraping algorithm is a bit different; first, you have differences in language... I have been using R for quite some time so I am comfortable with the language and actually enjoy it, where as with Python I am still pretty new. But, if you're totally new to both languages then great! Because not only is the language used in Python relatively simpler than R (minor differences but for the most part it is slightly easier I think), but the algorithm is much easier to create in Python than with R. For example, in R, you have download a few different packages (not a big deal as you also have to with Python... which is more complicated with Python than R), and then you have to input all of your commands, a lot of which are intimidating to noobies. I'll do an example in R later so you can see for yourself... for now let's learn web-scraping with Python!

.... This will assume that you're totally new to Python...

First thing you want to do is to go to python.org and download the appropriate package for your operating system. If you're running on a Windows-based operating system as I am, you can simply download the Python program from your browser (I use Chrome).

Once you have downloaded the program (assuming you have downloaded it from your web-browser), all you have to do is to install it and run it.

**NOTE: If you are unsure of which version of Python to download (2.7 or 3.5), I recommend 3.5. There are big differences in the language used between both versions, however 2.7 is no longer supported/updated. 3.5 is what you want.

Open up Python, and open up a text editor, something like Notepad will work just fine, as long as the program lets you write in it is all that matters (the program must also allow you to save in other file formats. For Python, you want to save your document as .py.... for example "Pythonisfun.py". Lets begin.

Your setup should look something like this. The window on the left is my text editor saved with the .py extension on the end of it (so Python can read and execute it), and the window on the right is our Python IDLE.

***NOTE: when you launch Python be sure that you're selecting the Python IDLE program (also known as the Python Shell). This is our main-executable white space).

Now, one of the great things about Python 3.5 is that it includes built-in package installation support so you can type in what package you want to use. If that sounds weird, let me explain it better: Python is a user interface and it comes with pre-installed software packages (libraries) that allow you perform a variety of tasks.

To launch one of these packages, we use the import command. So, on your IDLE, type import urllib.request. type it exactly like this: import urllib.request and hit the 'enter' key on your keyboard.

.......... Congratulations! You have just written your very first command in the Python interface! WOOHOO!

What does this do? This is telling Python to import a specific Python package into the current environment. The package we are wanting to use is called "request" (requests data from websites), and it is located in the "lib" location in your Python directory. The "url" part is exactly what you think it means... its the url of the webpage we are going to look for.

Next type in htmlfile = urllib.request.urlopen("http://google.com") and hit 'enter'.

Followed by htmltext = htmlfile.read() and hit 'enter'.

And finally type in print(htmltext) and hit 'enter'.

....this is what the code should look like

***Notice that we haven't done anything to our text editor. That's fine for now, I just want you to get used to having them both open because in the future we will be using them both.

Ok, so if you did everything correctly your IDLE window should look like mine.

Your next question is probably.... what the heck are all those blue lines of text? Well that is the source code of google.com! You can also view that in your web browser using the shortcut CTRL + U.

Congratulations! You have just written your first script using Python! And, you have completed the first step in web scraping! Woohoo!

Ok awesome, this is great but what good is having all that text if we don't know what to do with it? Well, luckily were going to do something it!

So, web-scraping is pretty cool and very useful if you're looking for specific data. For example, you want to find all the <tags> used on a webpage we can simply write an algorithm much like the one we just did and find all <tags>. What's more, we can find out a whole lot of information, anything from words used, to numbers, locations, names, lists, etc. You name it! Web scraping allows us to find data that we can use for specific purposes. I wont go into a detailed explanation here because I think it's better to learn the and apply the actual techniques first and then go into more detailed explanation because quite honestly that is what works best for me when I'm learning.

Alright, so after we collect this data we want to do something with it because in its current form it isn't neat, doesn't tell us anything, and quite honestly isn't useful. It's raw and we need to refine it!

This is where parsing comes into play. After we collect our data as we have just done, we must parse it so that we have more useful data, more refined data.

**//insert

from urllib.request import urlopen

>>> from bs4 import BeautifulSoup

>>> html = urlopen("http://www.http://riddlersdata.blogspot.com/")

>>> bsObj = BeautifulSoup(html)

>>> nameList = bsObj.findAll("span", {"class":"green"})

>>> for name in nameList:

print(name.get_text())

.... hit 'enter' twice.

//**