Thursday, October 10, 2013

Example of collecting data!

I am going to use events in my life for examples in how to properly collect, organize, categorize, manage, and interpret data. Based upon my results, I will construct a model that I may use to predict future events.

Microsoft Excel will be my software of choice for actually managing my data and constructing models. What's nice about excel is not only its easy to use functions but the ability to manipulate individual cells and neatly organize data. After all data has been collected I can then build models in the form of graphs, charts, and functions that may be used to (hopefully) predict some future results with fair accuracy.

I will begin collecting data on my finances. Information used will be categories such as "income", "expenses", "assets", and "liabilities".

Under the "assets" we want to list our cash on hand (how much cash you have on your person, in your home, etc). Followed by "deposits" which will be used to define how much cash we have in the bank.

We will want to list our other less liquid assets such as "cars", "home", "tools", etc. These go after or below more liquid assets like cash.

Next we list our income: how much, from where, how often, etc. Income is classified as an asset so income ultimately will be listed under the "assets" heading.

After we list our assets we may list our "liabilities", beginning with small expenses like "rent", "phone bill", "fuel", "electric", "food", etc.

After we list our smaller expenses we can list our larger liabilities such as "student loans", "mortgage", etc.

We will want to list the frequency of each event as it happens, and it's monthly affect on our calculated net-worth. In accounting rules, assets always equal liabilities, but in this case because we are dealing with individuals not companies, we may have more assets than liabilities, or if we are in a less-than hopeful position in life, our liabilities will be greater than our assets... let's hope it doesn't come to that!

Now after we set all this up in our Excel spreadsheet, we can color coordinate it and have some fun with creating tables, fonts, etc.

We will want to create pie-charts that can be used to describe our current situation at any given time, and also a linear-graph to chart our situation over time.

Along with this, let's create a table that lists everyday of the current month. We will use this table to describe in one word whether or not are happy. We can say "happy" or "sad". This will be tied into our charts and will become our "x-axis".

Or, we can do another topic. Anything really. This is just to give an idea of what is possible.

We can also do this with a pencil and paper, though a bit more math is required.

For example. We can create a simple Cartesian coordinate system, with wealth as the Y-axis and happiness as the X-axis. For our table of happiness, "happy" may be n=1, where it is attributed one point along the X-axis, and "sad" n=-1. -1 Must be used because sad is the opposite of happy, and unless we include a third descriptor such as "content", our level will line will drop.

Wednesday, October 9, 2013

My new blog!

Hello,

RiddlerThis here! I have created this blog to discuss the wide world of data!

Data is always increasing. It's one of the few things in the world that continues on ad infinitum, and because of that, maintaining data, organizing data, and creating correlations and links between data becomes an ever more difficult albeit ever important job.

Note the following is my interpretation of [big] data: 

Current affairs and history are the two main components data. History comprises everything that has already happened, events, discoveries, conversations, developments, etc, may all be neatly organized and categorized under the heading of "history".

The second component, current affairs, contains everything happening in real-time. The war in Afghanistan, U.S. government shutdown, what I am eating right now, etc, are all current affairs. How long do current affairs stay current affairs? That I find to be relatively subjective, but ideally it should be no more than a few minutes. I say ideally because if we had an algorithm/model that is efficient enough to categorize and organize all data streaming into a single source, all current affairs would be no-longer current affairs but history. This would take a fair amount of computing power, a stable electric grid, server power, and algorithms that are linked into rss feeds that constantly scan all incoming information and organize that data by category then sub-category, sub-sub-category, etc, until it is placed in the right folder or specific category as it relates to the data. Example would be "government shutdown" which would go from "current affairs" to "history" to "politics" to "american" to "government shutdown" for example. You could break it down further if you really wanted to.

After we categorize this data we'll need to interpret it. Answering such in-grained questions as: "What does this mean?" If there is a fire-fight happening in Afghanistan we must interpret it as such. "Current affairs"-"history"-"war"-"Afghanistan"-"Bagram"-"October 9th, 2013", etc... you could have a vast amount of data just from one report. That is why it is important to interpret it: "shots fired near Bagram, U.S. Marines return fire"-"firefight at Bagram".

Categorize and interpret data are the first steps. After completing both steps, we can now begin to create a model or theories and try to predict future events. But in order to do so, we must go back and break down the data as much as possible and analyze each piece until the data that is being broken down into is basic parts has become so basic that it is no longer relevant to the event that occurred. That is essentially a false statement because every piece of information that leads up to the main-event that is being analyzed is relevant, though lesser and lesser so. It's similar to how 0 can continue without stop, example: 0.1, 0.001, 0.2, 0.002, etc. You're essentially looking at a rounded integer more closely, then dividing it into it's parts, then dividing those previously divided parts into more parts, etc. All the information is relative, but not essential. That's where it is important to continue to interpret the incoming data and categorize it, until you have "essential" and "non-essential" information. Where the "essential" information is all data that should be looked at for one to understand the event that has occurred, and "non-essential" being all the information that backs up the essential information. It's not necessary to read it to understand the event (i.e. the events significance or meaning), but it is there and readily available if you want it.

A perfect model would require all kinds of energy, algorithms, and reliability. It would be very hard, but not impossible. IBM's Watson is kind of cool in this respect because it can learn which allows it function at a higher cognitive state then a simple pc.

In most cases however, a perfect model is not necessary. In fact, a good model that is well-functioning is all we need. We don't really have to have all the data that would come with a perfect model. Although we have our model to do the categorizing and searching of data for us, we would still be overwhelmed with information due to our biology.  Nate Silver is a great example of having a well-functioning model that is not perfect. He can make accurate predictions (not perfect ones) that get most of the details correct, which ultimately leads to correct predictions. His models work better than anyone else's, hence why he has garnered so much attention.

After we construct a good model we can begin making theories, saying something like "John McCain will win the presidency in 2016." Or, we can begin to develop links between data. Data X says A, which connects to what data Z says about B. Sometimes making inferences about data is not a bad thing, so long as you a model to help predict or organize your data as often is the case with data, bits and pieces of one event may correlate to bits and pieces of another event. Example of this being... [Event]: "firefight at Bagram base"-[Event]:"Senator John McCain visits Bagram." Those are main events, but even specific key words or phrases used in the event can be cross-checked and related to information in another event. The purpose of our model to tie not just those events together, but also the key-information used in the description of that data, or the numbers found within it. Whatever! The point is that all the information must be analyzed and broken down to determine any relationship with any other data from any other event. Dates, times, people, job descriptions, location, etc. Example: what, when, where, who, how? We answer all those questions in a particular set of a data then cross-check with other sets of data to establish any possible relationships and linkages, then we can begin the process of making predictions. The more information, the more accurate the prediction.

That is why data is such an issue today. We have more data being created more rapidly then ever before. Harnessing and organizing this data is hard; using it to make predictions is even harder. Hence why Nate Silver is the most successful... he's constructed a better model than anyone.