Blog Archives

Data Pollution


We know different kinds of pollution but there is a new kind of pollution that doesn’t affect the environment but seriously hampers our ability to take decision. We can name it as data pollution. With cheap data storage devices we are lethargic in maintaining quality of data. Even though we strive to maintain quality of data in the warehouses through various control measures, still the volume of data is huge causing considerable amount of pollution.

Data pollution is caused by poorly formed data which is of less importance to humans. Recently we faced a similar problem where one user entered alias name for our currency as Indian rupee and another as Indian national rupee. The above mentioned is a perfect example of data pollution.

We BI people always take pride in fancy reports which we developed for people in the higher management. Suppose the same report displays the amount for a report based on the alias name of currency instead of currency codes definitely there will be amount mismatch which destroys the credibility of the entire report. The business impact varies based on the variation or difference in the amount values.

I think growth in volume of data is the only serious competitor to Moore’s law and reliability of data is very high these days and even the Indian government has to change its IIP (Index of Industrial Production) figures as there was data pollution causing huge embarrassment. So kindly takes steps to control pollution of data.

Kindly share your thoughts on how data pollution has corrupted your data.

Need of Hadoop Distributed File System


People would always think how the organizations like Yahoo, Google, Facebook store large amounts of data of the users. We should take a note that Facebook stores more photos than Google’s Picassa. Any guesses??

The answer is Hadoop and it is a way to store large amounts of data in petabytes and zettabytes. This storage system is called as Hadoop Distributed File System. Hadoop was developed by Doug Cutting based on ideas suggested by Google’s papers. Mostly we get large amounts of machine generated data. For example, the Large Hadron Collider to study the origins of universe produces 15 petabytes of data every year for each experiment carried out.

The next thing which comes to our mind is how quick we can access these large amounts of data. Hadoop also uses Map Reduce. It follows ‘Divide and Conquer’. The data is organized as key value pairs. It processes the entire data that is spread across countless number of systems in parallel chunks from a single node. Then it will sort and process the collected data.

With a standard PC server, Hadoop will connect to all the servers and distributes the data files across these nodes. It used all these nodes as one large file system to store and process the data , making it a 100% unadulterated distributed file system. Extra nodes can be added if data reaches the maximum installed capacity making the setup highly scalable. It is very cheap as it is open source and doesn’t require special processors like used in traditional servers. Hadoop is also one of the NoSQL implementations.

The Tennessee Valley Authority uses smart-grid field devices to collect data on its power-transmission lines and facilities across the country. These sensors send in data at a rate of 30 times per second – at that rate, the TVA estimates it will have half a petabyte of data archived within a few years. TVA  uses Hadoop to store and analyse data. Our own Power Grid Corporation of India intends to install these smart devices in their grids for collecting data to reduce transmission losses. It is better they also emulate TVA.