Wednesday, October 9, 2013

My new blog!

Hello,

RiddlerThis here! I have created this blog to discuss the wide world of data!

Data is always increasing. It's one of the few things in the world that continues on ad infinitum, and because of that, maintaining data, organizing data, and creating correlations and links between data becomes an ever more difficult albeit ever important job.

Note the following is my interpretation of [big] data: 

Current affairs and history are the two main components data. History comprises everything that has already happened, events, discoveries, conversations, developments, etc, may all be neatly organized and categorized under the heading of "history".

The second component, current affairs, contains everything happening in real-time. The war in Afghanistan, U.S. government shutdown, what I am eating right now, etc, are all current affairs. How long do current affairs stay current affairs? That I find to be relatively subjective, but ideally it should be no more than a few minutes. I say ideally because if we had an algorithm/model that is efficient enough to categorize and organize all data streaming into a single source, all current affairs would be no-longer current affairs but history. This would take a fair amount of computing power, a stable electric grid, server power, and algorithms that are linked into rss feeds that constantly scan all incoming information and organize that data by category then sub-category, sub-sub-category, etc, until it is placed in the right folder or specific category as it relates to the data. Example would be "government shutdown" which would go from "current affairs" to "history" to "politics" to "american" to "government shutdown" for example. You could break it down further if you really wanted to.

After we categorize this data we'll need to interpret it. Answering such in-grained questions as: "What does this mean?" If there is a fire-fight happening in Afghanistan we must interpret it as such. "Current affairs"-"history"-"war"-"Afghanistan"-"Bagram"-"October 9th, 2013", etc... you could have a vast amount of data just from one report. That is why it is important to interpret it: "shots fired near Bagram, U.S. Marines return fire"-"firefight at Bagram".

Categorize and interpret data are the first steps. After completing both steps, we can now begin to create a model or theories and try to predict future events. But in order to do so, we must go back and break down the data as much as possible and analyze each piece until the data that is being broken down into is basic parts has become so basic that it is no longer relevant to the event that occurred. That is essentially a false statement because every piece of information that leads up to the main-event that is being analyzed is relevant, though lesser and lesser so. It's similar to how 0 can continue without stop, example: 0.1, 0.001, 0.2, 0.002, etc. You're essentially looking at a rounded integer more closely, then dividing it into it's parts, then dividing those previously divided parts into more parts, etc. All the information is relative, but not essential. That's where it is important to continue to interpret the incoming data and categorize it, until you have "essential" and "non-essential" information. Where the "essential" information is all data that should be looked at for one to understand the event that has occurred, and "non-essential" being all the information that backs up the essential information. It's not necessary to read it to understand the event (i.e. the events significance or meaning), but it is there and readily available if you want it.

A perfect model would require all kinds of energy, algorithms, and reliability. It would be very hard, but not impossible. IBM's Watson is kind of cool in this respect because it can learn which allows it function at a higher cognitive state then a simple pc.

In most cases however, a perfect model is not necessary. In fact, a good model that is well-functioning is all we need. We don't really have to have all the data that would come with a perfect model. Although we have our model to do the categorizing and searching of data for us, we would still be overwhelmed with information due to our biology.  Nate Silver is a great example of having a well-functioning model that is not perfect. He can make accurate predictions (not perfect ones) that get most of the details correct, which ultimately leads to correct predictions. His models work better than anyone else's, hence why he has garnered so much attention.

After we construct a good model we can begin making theories, saying something like "John McCain will win the presidency in 2016." Or, we can begin to develop links between data. Data X says A, which connects to what data Z says about B. Sometimes making inferences about data is not a bad thing, so long as you a model to help predict or organize your data as often is the case with data, bits and pieces of one event may correlate to bits and pieces of another event. Example of this being... [Event]: "firefight at Bagram base"-[Event]:"Senator John McCain visits Bagram." Those are main events, but even specific key words or phrases used in the event can be cross-checked and related to information in another event. The purpose of our model to tie not just those events together, but also the key-information used in the description of that data, or the numbers found within it. Whatever! The point is that all the information must be analyzed and broken down to determine any relationship with any other data from any other event. Dates, times, people, job descriptions, location, etc. Example: what, when, where, who, how? We answer all those questions in a particular set of a data then cross-check with other sets of data to establish any possible relationships and linkages, then we can begin the process of making predictions. The more information, the more accurate the prediction.

That is why data is such an issue today. We have more data being created more rapidly then ever before. Harnessing and organizing this data is hard; using it to make predictions is even harder. Hence why Nate Silver is the most successful... he's constructed a better model than anyone.

No comments:

Post a Comment