Information Management and Big Data.

Does every organisation find the science of IM interesting?, well probably not but the fact remains that Information management in Big data is part of its foundation and is critically important. An analogy can be used with a building – if there is a problem with the foundation, structural problems are certainly to follow.

So what are these foundations? well, Big data foundation is composed of two major systems – Storage and processing. With Hadoop file system(HDFS) we think of Big data, however traditional data warehouses can also store big data. But the advantage of HDFS over traditional storage is that you do not need to set up the data being stored, It simply can be dumped into a file, which in the age of Big data is logical as with massive quantities of data, we may not know its value. HDFS is also more versatile than conventional data warehousing which may not be able to manipulate data once its stored, rendering it effectively useless. whicj leads us to the the second system of the big data Foundation – processing.

Described simply, processing Big data involves making calculations and manipulations with the bigg data. Some of the more traditional databases have different levels of capability in terms of effectiveness and efficiency and this can linked how the datbase software is exploits the architecture of the underlying hardware. So, in effect there is a coupling relationship which affects the performance. But with haddoop, the processing software used is known as mapReduce and it uses a fault tolerant (software that automatically  recovers and handles processing failures in a highly reliable way) parallel programming framework. It sounds quite technical, but its a way of dividing the processing workload into smaller workloads, which are then distributed. A key advantage over traditional systems is that data manipulations and calculations can be performed independently.

A useful way to understand how the parallel programming framework works is to apply a simple mathematical example. Lets use two examples: average and median. so for example, a calculated average is passed to each of the processors, which in turn calculate an average of the data that has been passed to that processor.  A final average is calculated by a master or controller based on the subtotals calculated by the processors. However median is a more complex function. to find the middle number, the list of numbers needs to e redistributed across the processors or the processors must be in communication with each other to locate the middle number on the entire list.

Another technology in Big data processing platforms is GPU (Graphical processing Units). Mostly used in vector processing which is common in big data analytics. In real world terms, This technology is being used by banks to compute large scale workloads – an example GPU grids are being used to perform large scale Monte Carlo simulations across 10 Million portfolios with hundreds of thousands of simulations. Again, this is another example of a calculation that is either impossible or takes to long with conventional systems. while the overheads are high in switching to this system, It is well worth the investment due to the computational speed that is gained  by moving the data to this system.

 

Valar Morghulis! …. Not if, But When.

Applying Bayesian statistical survival analysis on the characters of A Song of Ice and Fire, the book series by George R. R. Martin.

During my time studying data analytics one of the more interesting reports I have come across was a white paper published by MIT via their technology review webpage. Using data analysis from the first 5 books, it predicted the survival outcomes of key characters in the widely anticipated next two novels in the series. Ill talk about how its possible to generate these predictions using Bayesian modeling.

\hat \lambda^k = \frac{1}{n} \sum_{i=1}^n x_i^k    – Maximum Likelihood estimator formula

Based on the data from a ‘wiki of ice and fire’, 916 named characters appear in the  have appeared in the book thus far. For each of these characters we have data giving us when and how often they appear, Their age and gender, their class status, their House and affiliations to other major houses. Using Weibull distributions, we are then able to predict the survival possibilities for named characters through to the 7th book. The weibull distribution is a way of modeling hazard function (also known as hazard rate). Weibull depends on two parameters – K and lamba, which will control its shape. For example the graph below gives us the distribution for ‘The Nights Watch’

GOT posterior distribution

The distribution for lambda is very tight, but the distribution for K is broader.

So using this analysis we can then make a prediction for a major character such as John Snow.

 

 

nightswatch2

 

So from the data gathered up to the end of ‘ dances with Dragons’ the credible interval for the nights watch survival ranges from 36 to 56 per cent. So even if John survives book 5, his odds for surviving till the end of book 7 are now a worrying 30-51%.

But… we should also factor in the the data that John is no average memner of the Nights watch, unlike the thieves and worse in his ranks, John is from a noble background and is well trained in combat. However applying the same analysis based on the previous 11 other recorded noble members of the nights watch, only the credibility interval curb is wider, with the survival prediction relatively unchanged. as shown below:

nw3

If we apply the The same modeling to analyse the survival probability of the entire houses, the results are interesting. The factors taken to support these results are measured by the level involvement in major conflicts in the books also the wealth, Geo-political positions and the strategic alliances chosen by these houses.

House1

 

House Arryn have the highest survival possibility due to their policy of non involvement in major wars and conflicts with other houses, however because of the small amount of data on house Arryn, there are wider credible intervals. Lannisters are actively involved in the major conflicts but rate highly in terms of survival probability due to their wealth, military power and alliances. The projections for House stark are pretty grim.

I find these methods of prediction quite interesting not just in the context of the fictional world of game of thrones, but in the real world application of data analysis and the importance of using data in predictive analysis.

 

 

 

Try R by Code School.

‘R’ is a statistical and data modeling tool developed by codeschool. It has a syntax specifically designed for Data. It has advanced graphics capabilities and It is ideal for manipulating your data and presenting it in an engrossing way. This Blog is about the the TryR introduction course consisting of Seven chapters covering: R Syntax, Vectors, Matrices, Summary statistics, factors, Data frames and working with real world data. I will discuss the learning outcomes from each chapter and have Attached a link with the completed course including specifics on command entries, formulas, graphics etc.

Try R

Chapter One – Using R.

Using R begins with simple expressions. basically with numbers and strings. It also shows us how to store those values as variables and apply them to functions.

Example – in maths simply enter 6*7 to get answer 42. Or with logical values entering 3<4 returns as “TRUE”. or 2+2 == 5 would return as “FALSE”.  “T” or “F” can be used as shorthand for TRUE and FALSE respectively. you can also assign any value to a variable such as a number or words or a logical Value, for example x <- TRUE. You can call a function up by typing the name of the required function followed by one or more values in parenthesis. These arguments could be the “sum”, “rep”, “sqrt” root, or times. Entering the command HELP followed by function name gives us information on the required function. Files can be called up by running “list.files”.

Chaper Two  – Vectors 

Vectors are simply a list of values, often central to R in its operations. Vectors can be represented as numbers, strings, logical values or any other value as long as it is the same type. With R you can sequence vectors and you can numerically Index named vectors. The barplot function allows you to draw up a barchart with the vector values In the example for Ch2  you have assigned names to the vector values and an integer range. Once you have assigned your values, You can plot your X and Y values of the two vectors on a graph.

Chapter Three – Matrices

A matrix is a 2 dimensional array. This chapters deals with creating, accessing and plotting them. You could use a vector to initiate a matrix value For example a 3X4 matrix would require a 12 item vector and be represented in rows of 3 by 4 columns, By changing the vector Value the matrix dimensions can be adjusted to specification using the “dim” function. Since its two dimensional, you need two indices to get a value from a matrix. You can read multiple rows or columns by providing a vector or sequence with their indices. With matrix plotting of complex data, visualisation is important. In the example we can create a contour map or a 3D perspective plot of a beach using a 10X10 matrix . To improve the visualisation, you can use an expand parameter. Once you are happy with the parameters, the image function will create the heat map.

Chapter Four – Summary Statistics 

This chapter deals with using R to explain data to your audience. R can calculate mean or if more suitable, median values and plot them on a graph. The “sd” function can calculate standard deviation to work out the deviation variable.  you can add line on the plot to show one standard deviation above the mean or below the mean.

Standard variation Formula
Standard variation Formula

Chapter Five – Factors

R factors tracks categorised values. In the example we have a vector value listed, However, when categorised as “types” the values are not repeated and appear not as strings, but as integer references. Plotting with factors will graph their values, weight and types. Again the data is visualised to enhance interpretation and make it more “readable”.

Chapter Six  – Data Frames. 

Data frames ties related variables together in a single data structure. It has specified columns for value types and indeterminate rows for related values. Data frames can load external file formats such as csv using the “read” function. Or txt files can be uploaded by adding the “sep” argument to the read command. In the example in c6 we have two different file types loaded in the data Frames. By assigning X and Y to the respective frames, we can merge both data sets, adding more insight and intelligence to the combined data set.

Chapter seven – Real world data. 

We now look at examples that are not abstract but are real world. A merging of a CSV file of piracy data and a text file of nations and their GDPs. From here we can plot GDP versus Piracy and determine if their is a negative correlation between wealth and piracy – which we can see there is. Rs “cor.test2 function can verify this correlation. In the example we have calculated the negative correlation p value is below the statistically significant value 0.05 . You can calculate piracy rates of the others countries in the data set without knowing the piracy figures if we have their GDP values – by calculating a linear model that best represents our data points. It will include some degree of error.

 

Team Sky – Data Mining and dark Social.

In the world of the web a massive viral success is difficult and rare. In pro-cycling facile victories are equally as elusive. It is a sport of marginal gains. The key to winning and obtaining competitive edge is by focusing on niche, critical areas. Since its formation in 2010, Team sky has have achieved both success in the sport of professional cycling and through innovative analytics and data mining.. However Its success in the latter has it positioned well ahead of its rivals

Team-Sky-in-action-©-Team-Sky

 One of the most visable indicators of TeamSkys  Digital media prominence is the Teams twitter Following compared to that of rival teams on the Pro-Tour. With 347,000 followers It is strides ahead. The table below gives some context to this –

skyteamtwitter1

So why is this important? A More indepth analysis reveals the reasons behind the teams successful Digital strategy. The key has been creating a connected audience. Team Sky is using link shortner and sharing widgets – a creation of Tech Company RadiumOne, to evaluate and measure all of the content interaction across its websites, Social media as well as Mobile apps platforms. For the past 2 years It has been mining data to expand and create new target groups. Through this, It has been able to demonstrate to sponsors its ability to become more relevant to passive consumers of its content. Also known as “Dark Social” – Dark Social has been compared to dark matter, In the sense that is not seen or truly understood, but that it comprises most of the universe.  It is not only providing sponsors access to an immediate audience, but also to a vast secondary audience. Because the reach has been magnified, sponsors realise the vast commercial opportunity provided by this. In using “Dark Social” Team Skys analytics team can apply these insights to consult with sponsors on advertising on digital, Mobile and video channels. Crucially, It means they are able to drive higher levels of engagement with target markets and reduce customer acquisition costs. The below chart shows us the significance of dark Socials share of online traffic compared to better known platforms:

 Dark-Social-Chart-2

So in real world commercial applications of smart analytics and data mining Team Skys innovative Digital strategy speaks for itself. In 2014 has delivered over $550 Million to its advertising partners. This far exceeds any of the other professional cycling teams on the pro-tour for that year. Recognising this success, Sponsors have weighed in with support. News International have signed a multi year sponsorship deal in June 2014. Jaguar has signed a 5 year sponsorship deal beginning 2015 and other global brands including Adidas and oakley have renewed their commercial association with Team sky. Their long term strategy is to build, develop and expand the teams commercial partnerships and central to this will be smart, creative Digital strategy and analytics. 

Fusion Tables….Because Two is Better Than One.

Fusion Tables is a relatively new data visualisation google  application to gather, visualise, assess and share data tables.

Fusion tables allows us to Merge two or more data sets in single visualisation. The Attached Google Fusion table is a merge of two data sets, a Geo location file in KML format and a spreadsheet database of Irelands Population statistics.

In this case we are tasked with the following:

* Combining the data and Boundary tables.

* Customising the map Display.

*Publish and embed the Heat Map.

Initial step is to add the fusion tables extension to your Google chrome. Then I began the creation of the fusion Table by importing the public domain ‘Ireland_population’ spreadsheet file which arranged into columns and rows, In this case, rank, County and population. Next stage allows us to name, describe and attribute the imported data. Then I clicked on Finish, and the the table appears in rows, cards, and map format. FT will automatically Geo-code the data after a seconds. I was able to Merge the information with the KLM file by clicking on the file>Merge Icon which allowed me to select the ‘map_lead.klm’ file(Irish KZM Data File) by pasting the link to the indicated task bar. At This point we are prompted to Confirm source of the match. The data sources with commonality were ‘rank’ in the table and ‘name’ in the KML file. ie County names. From here I clicked on next and merged Columns; Rank, County, Population and Geometry. Then we can view the new combined table and Geolocation data as ‘Merge of ireland_population and map_lead.kml’. FT will automatically Geo Code the merged data sets at this point.

FT has the function to allow you to manipulate and improve and visualisation using the ‘Change Feature map styles’ setting. You can stratify the population range with the polygon background colours deciding into a range of 1 – 8 Buckets. Colour gradation within the boundary polygons is probably the most effective way to make population ranges county by county instantly interpretative and visual . Its also possible to customise and change info windows settings below the change match feature styles.

At This point I can now publish in a link in email/ IM or embed as a HTML to a webpage or in this case wordpress blog.

The information we Glean from this heat map is an instant visualisation of Irelands population demographics. We can see the distribution of the populace is in the national context,disproportionately urbanised  and concentrated around the major cities of Dublin, Cork, Limerick and Galway. The Heat map could be useful source for national development projects in major Infrastructure such a Roads, Rail and ports.

Google Fusion Table