Does every organisation find the science of IM interesting?, well probably not but the fact remains that Information management in Big data is part of its foundation and is critically important. An analogy can be used with a building – if there is a problem with the foundation, structural problems are certainly to follow.
So what are these foundations? well, Big data foundation is composed of two major systems – Storage and processing. With Hadoop file system(HDFS) we think of Big data, however traditional data warehouses can also store big data. But the advantage of HDFS over traditional storage is that you do not need to set up the data being stored, It simply can be dumped into a file, which in the age of Big data is logical as with massive quantities of data, we may not know its value. HDFS is also more versatile than conventional data warehousing which may not be able to manipulate data once its stored, rendering it effectively useless. whicj leads us to the the second system of the big data Foundation – processing.
Described simply, processing Big data involves making calculations and manipulations with the bigg data. Some of the more traditional databases have different levels of capability in terms of effectiveness and efficiency and this can linked how the datbase software is exploits the architecture of the underlying hardware. So, in effect there is a coupling relationship which affects the performance. But with haddoop, the processing software used is known as mapReduce and it uses a fault tolerant (software that automatically recovers and handles processing failures in a highly reliable way) parallel programming framework. It sounds quite technical, but its a way of dividing the processing workload into smaller workloads, which are then distributed. A key advantage over traditional systems is that data manipulations and calculations can be performed independently.
A useful way to understand how the parallel programming framework works is to apply a simple mathematical example. Lets use two examples: average and median. so for example, a calculated average is passed to each of the processors, which in turn calculate an average of the data that has been passed to that processor. A final average is calculated by a master or controller based on the subtotals calculated by the processors. However median is a more complex function. to find the middle number, the list of numbers needs to e redistributed across the processors or the processors must be in communication with each other to locate the middle number on the entire list.
Another technology in Big data processing platforms is GPU (Graphical processing Units). Mostly used in vector processing which is common in big data analytics. In real world terms, This technology is being used by banks to compute large scale workloads – an example GPU grids are being used to perform large scale Monte Carlo simulations across 10 Million portfolios with hundreds of thousands of simulations. Again, this is another example of a calculation that is either impossible or takes to long with conventional systems. while the overheads are high in switching to this system, It is well worth the investment due to the computational speed that is gained by moving the data to this system.