Saturday, October 31, 2015

Intro to Python; Web scraping Part 1

Python is a great tool for a variety of reasons, but one of the reasons why I like Python is to for obtaining data. As much as I can, I actually use R for number crunching and data analysis, but R lags behind the simplicity of Python if you want to "scrape the web." If you're wondering, yes you can use R for web scraping much like I am about to show you with Python, but the process is of actually setting up your scraping algorithm is a bit different; first, you have differences in language... I have been using R for quite some time so I am comfortable with the language and actually enjoy it, where as with Python I am still pretty new. But, if you're totally new to both languages then great! Because not only is the language used in Python relatively simpler than R (minor differences but for the most part it is slightly easier I think), but the algorithm is much easier to create in Python than with R. For example, in R, you have download a few different packages (not a big deal as you also have to with Python... which is more complicated with Python than R), and then you have to input all of your commands, a lot of which are intimidating to noobies. I'll do an example in R later so you can see for yourself... for now let's learn web-scraping with Python!


.... This will assume that you're totally new to Python...

First thing you want to do is to go to python.org and download the appropriate package for your operating system. If you're running on a Windows-based operating system as I am, you can simply download the Python program from your browser (I use Chrome).


Once you have downloaded the program (assuming you have downloaded it from your web-browser), all you have to do is to install it and run it.

**NOTE: If you are unsure of which version of Python to download (2.7 or 3.5), I recommend 3.5. There are big differences in the language used between both versions, however 2.7 is no longer supported/updated. 3.5 is what you want.


Open up Python, and open up a text editor, something like Notepad will work just fine, as long as the program lets you write in it is all that matters (the program must also allow you to save in other file formats. For Python, you want to save your document as .py.... for example "Pythonisfun.py". Lets begin.




Your setup should look something like this. The window on the left is my text editor saved with the .py extension on the end of it (so Python can read and execute it), and the window on the right is our Python IDLE.

***NOTE: when you launch Python be sure that you're selecting the Python IDLE program (also known as the Python Shell). This is our main-executable white space).

Now, one of the great things about Python 3.5 is that it includes built-in package installation support so you can type in what package you want to use. If that sounds weird, let me explain it better: Python is a user interface and it comes with pre-installed software packages (libraries) that allow you perform a variety of tasks. 

To launch one of these packages, we use the import command. So, on your IDLE, type import urllib.request. type it exactly like this: import urllib.request and hit the 'enter' key on your keyboard.


.......... Congratulations! You have just written your very first command in the Python interface! WOOHOO!

What does this do? This is telling Python to import a specific Python package into the current environment. The package we are wanting to use is called "request" (requests data from websites), and it is located in the "lib" location in your Python directory. The "url" part is exactly what you think it means... its the url of the webpage we are going to look for.

Next type in htmlfile = urllib.request.urlopen("http://google.com") and hit 'enter'.

Followed by htmltext = htmlfile.read() and hit 'enter'.

And finally type in print(htmltext) and hit 'enter'.

....this is what the code should look like


***Notice that we haven't done anything to our text editor. That's fine for now, I just want you to get used to having them both open because in the future we will be using them both.

Ok, so if you did everything correctly your IDLE window should look like mine. 

Your next question is probably.... what the heck are all those blue lines of text? Well that is the source code of google.com! You can also view that in your web browser using the shortcut CTRL + U. 

Congratulations! You have just written your first script using Python! And, you have completed the first step in web scraping! Woohoo! 

Ok awesome, this is great but what good is having all that text if we don't know what to do with it? Well, luckily were going to do something it! 

So, web-scraping is pretty cool and very useful if you're looking for specific data. For example, you want to find all the <tags> used on a webpage we can simply write an algorithm much like the one we just did and find all <tags>. What's more, we can find out a whole lot of information, anything from words used, to numbers, locations, names, lists, etc. You name it! Web scraping allows us to find data that we can use for specific purposes. I wont go into a detailed explanation here because I think it's better to learn the and apply the actual techniques first and then go into more detailed explanation because quite honestly that is what works best for me when I'm learning.

Alright, so after we collect this data we want to do something with it because in its current form it isn't neat, doesn't tell us anything, and quite honestly isn't useful. It's raw and we need to refine it!

This is where parsing comes into play. After we collect our data as we have just done, we must parse it so that we have more useful data, more refined data.

**//insert 
from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> html = urlopen("http://www.http://riddlersdata.blogspot.com/")
>>> bsObj = BeautifulSoup(html)
>>> nameList = bsObj.findAll("span", {"class":"green"})
>>> for name in nameList:
print(name.get_text())


.... hit 'enter' twice.
//**

No comments:

Post a Comment