How IT companies handle so much data

5 min readSep 17, 2020

Big data is one of the most popular buzzwords of today’s IT industry. To many, it may seem to be a very innovative technology. But it is the name of the problem indeed. The problem that we very often face in our day-to-day life. For instance, if we are required to install 1500 GB of data on our laptop which has a hard disk capacity of 1000 GB then this is the big data problem.

Big data is a massive volume of both structured and unstructured data that is so large that it is difficult to process it with the help of traditional database and software techniques.

According to recent estimates, 2.5 quintillion bytes of data are generated every day. In the last two years alone, an astonishing 90% of the world’s data has been created. 463 exabytes of data will be generated each day by humans as of 2025. Here also we are facing the big data problem. Here the data is so huge that it won’t be able to fit in normal databases and hard disks. If businesses will start investing only in so many hard disks then the data processing speed will reduce. Moreover, if any hard drive will stop working then there will be a permanent loss of data. As a result, people will stop taking their services.

Then how and where is big data stored? How is it possible to perform such fast queries on such a humungous amount of data?

Big data has the following four characteristics:

Volume: Volume refers to the amount of data.
Velocity: Velocity refers to data processing speed.
Variety: Variety refers to the types of data, how disparate the data formats are.
Veracity: Veracity refers to how accurate or truthful a data set may be.

Distributed storage clusters help businesses to overcome the problems of volume and velocity. They are all about networking several computers together and taking advantage of their resources in a collective way. Each computer contributes some of its hard drive space to the overall network. This turns the entire network into a massive computer, where each computer or node is acting as a data storage device. In this cluster, a copy of the data is also kept and it is recovered whenever required. There are several arrangements or topologies by which these nodes can be connected. Here I shall explain the concept of distributed storage with the help of the Master-Slave topology model example.

Master-Slave is a model of communication for hardware devices where one device (Master node) has unidirectional control over one or more devices (Slave nodes). This is often used in the electronic hardware space where one device acts as the controller, and the other devices are the ones being controlled. The master node is commonly known as the Name node and the Slave node as the Data node. This way the entire network acts like a single computer whose hard disk capacity is equal to the aggregate sum of hard disk capacity of all the Data nodes. Whenever a file is to be stored in this cluster, then it is first partitioned into multiple blocks of a certain size. Then these blocks are stored in different data nodes. This way the system can hold the desired amount of data and the problem of volume is thus sorted. Parallelism is also achieved when at the same time the Master node makes calls to multiple Slave nodes. Hence, the velocity of the data is also improved.

Facebook’s distributed storage for logs (Image Source: Google)

Hadoop is widely used to implement distributed storage clusters. It is open-source software written in Java and managed by Apache. It is used by many tech firms like Amazon, Facebook, Cloudspace, etc. Many universities also use it for their research projects. The Hadoop model is inspired by Google File System (GFS). GFS is a proprietary distributed file system developed by Google to provide reliable access to data using large clusters of commodity hardware. GFS is highly efficient and scalable meaning it sustains the performance whenever volume increases. Similarly, Hadoop is used for storing data and running applications on clusters of commodities. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent jobs.

Hadoop consists of four main modules:

Hadoop Distributed File System (HDFS): HDFS is a distributed file system designed to run on commodity hardware. It provides high-performance access to data across highly scalable systems.

Yet Another Resource Negotiator (YARN): It manages and monitors cluster nodes and resource usage. It also schedules tasks and jobs.

MapReduce: MapReduce is a paradigm that is used to perform parallel computation on the data. The map task takes input data and converts it into a dataset that can be computed in key-value pairs. The output of the map task is consumed by reduce tasks, aggregating the output, and provide optimized results.

Hadoop Common: Provides common Java libraries that can be used across all the modules.

Applications like Spark, Hive, Presto, Hbase help Hadoop to collect, store, process, analyze, and manage big data. Today many businesses are investing in big data and AI to extract meaningful information from data to make more intelligent business decisions. For example, using big data Netflix saves $1 billion per year on customer retention. Data is growing exponentially with time and the future need for software like Hadoop will increase.

And that’s how IT companies handle data.

How IT companies handle so much data

Written by Shefali Sharma