Riddlersdata

Saturday, November 14, 2015

Web Scraping with Python; Part 3

More web scraping with Python and Beautiful Soup!

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

The findAll and find functions are the two most common functions you will use when you're scraping a page; they both allow you to search through HTML to find specific tags, attributes, recursives, text, limits, and keywords.

Most of the time you will only need the tag and attribute arguments.

The tags argument takes text from the specific tags you search for; it could be headers h1, h2, h3. etc. Or paragraphs p, or lists li, or anything else you want to search for.

.findAll({"h1","h2","p,"li","div","header"})

Here we use the attributes arguments
.findAll("span", {"class":"blue", "class":"yellow"})

Let's look at the recursive argument now.
The recursive argument is a Boolean, meaning that it operates based upon a TRUE or FALSE argument. The argument itself is asking you "how deep would you like to look into the document to obtain the information you want?" If recursive is set to TRUE, it will look into children, childrens children, etc. for the tags you want. Sometimes we don't want to look that deep into an argument because you may only want a limited set of information within a parent, or a single child or whatever.

Now let's look at the text argument. This argument will search for a specific word surrounded by tags, for instance, let's say we are scraping a page with a large text file in the form of a book. Perhaps we are scraping the Bible on a webpage and we want to find the word 'Moses':

nameList = bsObj.findAll(text="Moses")
print(len(nameList))

This argument will retrieve all the instances the word Moses is used.

The keyword argument will select tags that contain a specific attribute:

allText = bsObj.findAll(id="text")
print(allText[0].get_text())

Some additional notes on the keyword function:
Keyword is particularly clumsy, so you cannot simply use it like this

bsObj.findAll(class="green")

Instead, type either of the following to find a particular keyword:

bsObj.findAll(class_="green")
or
bsObj.findAll("", {"class":"green"}

That's all for now until next time!

Wednesday, November 11, 2015

Web Scraping with Python, Part 2

So there is a small error, the end of the following lesson should look something like this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.riddlersdata.blogspot.com")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

Depending on what you're looking for, you wont need the nameList function.

This particular scraping and parsing method will isolate all headers (bsObj,h1) in the text file.

Ok, so since this is part 2 let's do some part 2 worthy execution.

First, let's take a look at the code:
html = urlopen("http://www.riddlersdata.blogspot.com")
...this piece of code here, this retrieves the data from the webpage. And the way we have it formatted here could cause potential issues:
1) The page may not be found on the server
2) The server may not be found

In which case, you will get an HTTP error code, so the way we handle this is by using the following code (which will replace your existing code):
html = urlopen("http://www.riddlersdata.blogspot.com")
except HTTPError as e:
print(e)
#return null, break, or do some other "Plan B"
else:
#program continues. Note: If you return or break in the
#exception catch, you do not need to use the "else" statement

***NOTE*** any line that begins with a hashtag (#) is not actual code, these are comments within the code that you can write to yourself; just to help clarify what goes where and why. Every programmer uses these.

In case no server is found, we can add a check by using the following code:
if html is None:
print("URL is not found")
else:
#program continues

These commands say "if there is not server found, tell me "URL is not found", otherwise continue scraping".

Next, type in the following code:

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:

html = urlopen(url)
except HTTPError as e:
return None
try:

bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.riddlersdata.blogspot.com")
if title == None:
print("Title could not be found")
else:
print(title)
In this example, we’re

From these commands we have created a scraper with exceptions and loops via else-if statements. Which is exactly what we want if we are going to scrape a lot of webpages looking for a particular type of data, while taking into account the possibility of broken servers/pages.

Tuesday, November 10, 2015

Tensorflow

Tensorflow has been released to the public!

This is seriously, really, really cool! I haven't looked into everything it can do, and have yet to try it out (too busy with everything else programming related), but I am dying to fiddle with this!

So what is Tensorflow and why does it matter?

Tensorflow is "...an interface for expressing machine learning algorithms, and an implementation for executing such algorithms". It is great if you want to use it for a deep neural network--> what is a deep neural network? A deep neural network is about learning multiple levels of representation and abstraction that help to make sense of data like images, sound, or text. Inevitably, the goal of DNN is to get to actual artificial intelligence, which is a product of machine learning. Meaning, awareness and identification of images, sound, text, etc.

I will try and be back soon to write more on it, but for now I wanted to help spread the word about this incredible tool.

Saturday, October 31, 2015

Intro to Python; Web scraping Part 1

Python is a great tool for a variety of reasons, but one of the reasons why I like Python is to for obtaining data. As much as I can, I actually use R for number crunching and data analysis, but R lags behind the simplicity of Python if you want to "scrape the web." If you're wondering, yes you can use R for web scraping much like I am about to show you with Python, but the process is of actually setting up your scraping algorithm is a bit different; first, you have differences in language... I have been using R for quite some time so I am comfortable with the language and actually enjoy it, where as with Python I am still pretty new. But, if you're totally new to both languages then great! Because not only is the language used in Python relatively simpler than R (minor differences but for the most part it is slightly easier I think), but the algorithm is much easier to create in Python than with R. For example, in R, you have download a few different packages (not a big deal as you also have to with Python... which is more complicated with Python than R), and then you have to input all of your commands, a lot of which are intimidating to noobies. I'll do an example in R later so you can see for yourself... for now let's learn web-scraping with Python!

.... This will assume that you're totally new to Python...

First thing you want to do is to go to python.org and download the appropriate package for your operating system. If you're running on a Windows-based operating system as I am, you can simply download the Python program from your browser (I use Chrome).

Once you have downloaded the program (assuming you have downloaded it from your web-browser), all you have to do is to install it and run it.

**NOTE: If you are unsure of which version of Python to download (2.7 or 3.5), I recommend 3.5. There are big differences in the language used between both versions, however 2.7 is no longer supported/updated. 3.5 is what you want.

Open up Python, and open up a text editor, something like Notepad will work just fine, as long as the program lets you write in it is all that matters (the program must also allow you to save in other file formats. For Python, you want to save your document as .py.... for example "Pythonisfun.py". Lets begin.

Your setup should look something like this. The window on the left is my text editor saved with the .py extension on the end of it (so Python can read and execute it), and the window on the right is our Python IDLE.

***NOTE: when you launch Python be sure that you're selecting the Python IDLE program (also known as the Python Shell). This is our main-executable white space).

Now, one of the great things about Python 3.5 is that it includes built-in package installation support so you can type in what package you want to use. If that sounds weird, let me explain it better: Python is a user interface and it comes with pre-installed software packages (libraries) that allow you perform a variety of tasks.

To launch one of these packages, we use the import command. So, on your IDLE, type import urllib.request. type it exactly like this: import urllib.request and hit the 'enter' key on your keyboard.

.......... Congratulations! You have just written your very first command in the Python interface! WOOHOO!

What does this do? This is telling Python to import a specific Python package into the current environment. The package we are wanting to use is called "request" (requests data from websites), and it is located in the "lib" location in your Python directory. The "url" part is exactly what you think it means... its the url of the webpage we are going to look for.

Next type in htmlfile = urllib.request.urlopen("http://google.com") and hit 'enter'.

Followed by htmltext = htmlfile.read() and hit 'enter'.

And finally type in print(htmltext) and hit 'enter'.

....this is what the code should look like

***Notice that we haven't done anything to our text editor. That's fine for now, I just want you to get used to having them both open because in the future we will be using them both.

Ok, so if you did everything correctly your IDLE window should look like mine.

Your next question is probably.... what the heck are all those blue lines of text? Well that is the source code of google.com! You can also view that in your web browser using the shortcut CTRL + U.

Congratulations! You have just written your first script using Python! And, you have completed the first step in web scraping! Woohoo!

Ok awesome, this is great but what good is having all that text if we don't know what to do with it? Well, luckily were going to do something it!

So, web-scraping is pretty cool and very useful if you're looking for specific data. For example, you want to find all the <tags> used on a webpage we can simply write an algorithm much like the one we just did and find all <tags>. What's more, we can find out a whole lot of information, anything from words used, to numbers, locations, names, lists, etc. You name it! Web scraping allows us to find data that we can use for specific purposes. I wont go into a detailed explanation here because I think it's better to learn the and apply the actual techniques first and then go into more detailed explanation because quite honestly that is what works best for me when I'm learning.

Alright, so after we collect this data we want to do something with it because in its current form it isn't neat, doesn't tell us anything, and quite honestly isn't useful. It's raw and we need to refine it!

This is where parsing comes into play. After we collect our data as we have just done, we must parse it so that we have more useful data, more refined data.

**//insert

from urllib.request import urlopen

>>> from bs4 import BeautifulSoup

>>> html = urlopen("http://www.http://riddlersdata.blogspot.com/")

>>> bsObj = BeautifulSoup(html)

>>> nameList = bsObj.findAll("span", {"class":"green"})

>>> for name in nameList:

print(name.get_text())

.... hit 'enter' twice.

//**

Thursday, October 10, 2013

Example of collecting data!

I am going to use events in my life for examples in how to properly collect, organize, categorize, manage, and interpret data. Based upon my results, I will construct a model that I may use to predict future events.

Microsoft Excel will be my software of choice for actually managing my data and constructing models. What's nice about excel is not only its easy to use functions but the ability to manipulate individual cells and neatly organize data. After all data has been collected I can then build models in the form of graphs, charts, and functions that may be used to (hopefully) predict some future results with fair accuracy.

I will begin collecting data on my finances. Information used will be categories such as "income", "expenses", "assets", and "liabilities".

Under the "assets" we want to list our cash on hand (how much cash you have on your person, in your home, etc). Followed by "deposits" which will be used to define how much cash we have in the bank.

We will want to list our other less liquid assets such as "cars", "home", "tools", etc. These go after or below more liquid assets like cash.

Next we list our income: how much, from where, how often, etc. Income is classified as an asset so income ultimately will be listed under the "assets" heading.

After we list our assets we may list our "liabilities", beginning with small expenses like "rent", "phone bill", "fuel", "electric", "food", etc.

After we list our smaller expenses we can list our larger liabilities such as "student loans", "mortgage", etc.

We will want to list the frequency of each event as it happens, and it's monthly affect on our calculated net-worth. In accounting rules, assets always equal liabilities, but in this case because we are dealing with individuals not companies, we may have more assets than liabilities, or if we are in a less-than hopeful position in life, our liabilities will be greater than our assets... let's hope it doesn't come to that!

Now after we set all this up in our Excel spreadsheet, we can color coordinate it and have some fun with creating tables, fonts, etc.

We will want to create pie-charts that can be used to describe our current situation at any given time, and also a linear-graph to chart our situation over time.

Along with this, let's create a table that lists everyday of the current month. We will use this table to describe in one word whether or not are happy. We can say "happy" or "sad". This will be tied into our charts and will become our "x-axis".

Or, we can do another topic. Anything really. This is just to give an idea of what is possible.

We can also do this with a pencil and paper, though a bit more math is required.

For example. We can create a simple Cartesian coordinate system, with wealth as the Y-axis and happiness as the X-axis. For our table of happiness, "happy" may be n=1, where it is attributed one point along the X-axis, and "sad" n=-1. -1 Must be used because sad is the opposite of happy, and unless we include a third descriptor such as "content", our level will line will drop.

Wednesday, October 9, 2013

My new blog!

Hello,

RiddlerThis here! I have created this blog to discuss the wide world of data!

Data is always increasing. It's one of the few things in the world that continues on ad infinitum, and because of that, maintaining data, organizing data, and creating correlations and links between data becomes an ever more difficult albeit ever important job.

Note the following is my interpretation of [big] data:

Current affairs and history are the two main components data. History comprises everything that has already happened, events, discoveries, conversations, developments, etc, may all be neatly organized and categorized under the heading of "history".

The second component, current affairs, contains everything happening in real-time. The war in Afghanistan, U.S. government shutdown, what I am eating right now, etc, are all current affairs. How long do current affairs stay current affairs? That I find to be relatively subjective, but ideally it should be no more than a few minutes. I say ideally because if we had an algorithm/model that is efficient enough to categorize and organize all data streaming into a single source, all current affairs would be no-longer current affairs but history. This would take a fair amount of computing power, a stable electric grid, server power, and algorithms that are linked into rss feeds that constantly scan all incoming information and organize that data by category then sub-category, sub-sub-category, etc, until it is placed in the right folder or specific category as it relates to the data. Example would be "government shutdown" which would go from "current affairs" to "history" to "politics" to "american" to "government shutdown" for example. You could break it down further if you really wanted to.

After we categorize this data we'll need to interpret it. Answering such in-grained questions as: "What does this mean?" If there is a fire-fight happening in Afghanistan we must interpret it as such. "Current affairs"-"history"-"war"-"Afghanistan"-"Bagram"-"October 9th, 2013", etc... you could have a vast amount of data just from one report. That is why it is important to interpret it: "shots fired near Bagram, U.S. Marines return fire"-"firefight at Bagram".

Categorize and interpret data are the first steps. After completing both steps, we can now begin to create a model or theories and try to predict future events. But in order to do so, we must go back and break down the data as much as possible and analyze each piece until the data that is being broken down into is basic parts has become so basic that it is no longer relevant to the event that occurred. That is essentially a false statement because every piece of information that leads up to the main-event that is being analyzed is relevant, though lesser and lesser so. It's similar to how 0 can continue without stop, example: 0.1, 0.001, 0.2, 0.002, etc. You're essentially looking at a rounded integer more closely, then dividing it into it's parts, then dividing those previously divided parts into more parts, etc. All the information is relative, but not essential. That's where it is important to continue to interpret the incoming data and categorize it, until you have "essential" and "non-essential" information. Where the "essential" information is all data that should be looked at for one to understand the event that has occurred, and "non-essential" being all the information that backs up the essential information. It's not necessary to read it to understand the event (i.e. the events significance or meaning), but it is there and readily available if you want it.

A perfect model would require all kinds of energy, algorithms, and reliability. It would be very hard, but not impossible. IBM's Watson is kind of cool in this respect because it can learn which allows it function at a higher cognitive state then a simple pc.

In most cases however, a perfect model is not necessary. In fact, a good model that is well-functioning is all we need. We don't really have to have all the data that would come with a perfect model. Although we have our model to do the categorizing and searching of data for us, we would still be overwhelmed with information due to our biology. Nate Silver is a great example of having a well-functioning model that is not perfect. He can make accurate predictions (not perfect ones) that get most of the details correct, which ultimately leads to correct predictions. His models work better than anyone else's, hence why he has garnered so much attention.

After we construct a good model we can begin making theories, saying something like "John McCain will win the presidency in 2016." Or, we can begin to develop links between data. Data X says A, which connects to what data Z says about B. Sometimes making inferences about data is not a bad thing, so long as you a model to help predict or organize your data as often is the case with data, bits and pieces of one event may correlate to bits and pieces of another event. Example of this being... [Event]: "firefight at Bagram base"-[Event]:"Senator John McCain visits Bagram." Those are main events, but even specific key words or phrases used in the event can be cross-checked and related to information in another event. The purpose of our model to tie not just those events together, but also the key-information used in the description of that data, or the numbers found within it. Whatever! The point is that all the information must be analyzed and broken down to determine any relationship with any other data from any other event. Dates, times, people, job descriptions, location, etc. Example: what, when, where, who, how? We answer all those questions in a particular set of a data then cross-check with other sets of data to establish any possible relationships and linkages, then we can begin the process of making predictions. The more information, the more accurate the prediction.

That is why data is such an issue today. We have more data being created more rapidly then ever before. Harnessing and organizing this data is hard; using it to make predictions is even harder. Hence why Nate Silver is the most successful... he's constructed a better model than anyone.