This article “Big Data Analytics and Concepts” helps you to understand what exactly big data is, starting from the core concept of data, you moved towards understanding the notion of big data and its features. Big data shares same definition as data i.e., “some existing information or knowledge is represented or coded in some form suitable for better usage or processing,” with the only difference that it is enormous in size.
Following topics have been covered in this tutorial:
1. What is big data?
2. Big Data Categories.
3. Different aspects of Big Data
4. 4Vs in Big Data.
5. Additional Vs
6. Challenges of Big Data
7. Top big data technologies.
What is Big Data?
Big Data describes large volume of data structured or unstructured. Big data has the potential to grow exponentially for an indefinite period. It can increase even to the extent where it cannot be managed/processed using traditional techniques such as RDBMS.
Due to increasing dependencies on internet, every online user activity such as, Google search, ‘like’ on Facebook post, sending/receiving emails, leads to data generation.
In case of quantifying big data in terms of size, consider maximum possible capacity od HDD (hard disk drive) available today, which is 16 TB. Using this size as scale, any data volume in terabytes or petabytes would be considered as big data because it would be difficult for a single system to accommodate dataset in range of TBs or PBs.
Why traditional systems failed to manage big data?
Previously, most commonly used data storing and data- processing systems were RDBMS (relational database management systems). An RDBMS uses tables to store data in row-column format. These tables have a well- defined schema/metadata, and the data that is stored in each table must comply with the underlying schema. Traditional systems fail to manage big data because of its huge size and diversity.
Few reasons due to which traditional system fail to manage big data-
- Types of data: Traditional databases only support structured data. Today, big data also includes semi – structured and unstructured data, which cannot be processed using traditional systems.
- Volume of data: Traditional systems cannot store and process massive volumes of data efficiently. Big data processing systems store data in distributed file systems, which lead to efficient storage and processing of data.
- Scaling: Big data processing systems follow the scale –out architecture and distributed computing for data processing. Thus, load of computation is shared among multiple systems, which is not possible in traditional systems.
- Data Schema: Traditional databases follow a strict schema for the data. Big data is first stored in raw format, and then schema applied while reading.
Some basic industry applications:
- Application of big data in retail industry
- Application of big data in healthcare industry
- Every online activity
- Data produced by cell phone towers.
In these industries, daily transactions are recorded and analyzed. The rate of data generation is tremendous and volume of data is of huge amount.
Let’s take example of healthcare application
People visit doctors for consultation. Their diagnosis, treatment, tests are recorded. Various test reports stored in data and analyzed. Suppose need to identify how fast cancer patients’ count is growing. Few analyses can be done based on test reports-
1.Real time monitoring of patients
2.Predicting outcomes or upcoming health related hazards
Cell phone tower data is an example of data generated by machine-
- Cell phone tower produces data about connected devices
- Produces data about the calls it connects and completes
Data is divided into three categories
Different aspects of Big Data
- Volume: The size of the data has to be huge, i.e. in the range of terabytes or even more than that.
- Rate of Change: the nature of the data has to be dynamic because of the changes in transactions. There could be multiple reasons supporting changes in the transactions, like change in logic/requirements.
- Variety: Based on form of data, it can be divided into three categories:
Structured: Data stored in tabular format.
Unstructured: Data that does not have well defined structure e.g. videos, images, CCTV footage, Weather data, Vehicular data, etc .
Semi-structured: Data that is partially structured or is a combination of both structured and unstructured data e.g. emails. Emails are semi-structured because they have well-defined structure that consists of sender address, receiver address, subject, message body, attachments etc. However, content given in body is unstructured. Examples are XML, JSON, data from social networking sites, email and server logs.
Two essential characteristics of big data i.e. it is vast and diverse. Big data is diverse because it can include structured, semi-structured and unstructured data.
Big data can also be described with few other characteristics. These are represented using common 4Vs, denoted as ‘4 Vs of big data’.
- Volume: This represents the amount of data generated by organization. Size of big data ranges from petabytes to Exabyte.
- Velocity: This indicates the rate by which data is generated/consumed. Social media sites such as Twitter, Facebook etc. create data from every activity that user performs, leading to enormous amount of data generated every minute.
- Variety: This represents different types of data generated. For example, various types of data collected from Gmail may be sign-up/registration data, login data, emails, etc.
- Veracity: This represents the quality and trustworthiness of data. Due to increasing analysis on generated data, veracity becomes very important.
Apart from 4 Vs, few additional Vs introduced. They are
- Variability: This refers to the way the meaning of data always changes. Example interpreting a word without knowing complete sentence, will not give exact meaning.
- Validity: Refers to correctness of values present in working dataset. Anomalies in data can be missing values, incorrect values. It is important to remove anomalies before processing the dataset.
- Volatility: It refers to determining the expiry or life expectancy of data. For better results, sometimes need to eliminate old data.
Challenges of Big Data
- Exponential Growth of business data
- Internet traffic doubling every year
- Data Transport requiring conversion to higher bandwidth networks
- Infrastructure challenges for data centers
- Storage challenges for online and archival data
Big Data Management Challenges
- Privacy and Civil Liberties
Top big data technologies –
Apache Hadoop is a java based free software framework that can effectively store large amount of data in a cluster. This framework runs in parallel on a cluster and has an ability to allow us to process data across all nodes. Hadoop Distributed File System (HDFS) is the storage system of Hadoop, which splits big data and distribute across many nodes in a cluster. This also replicates data in a cluster thus providing high availability.
It is a Big Data solution from Microsoft powered by Apache Hadoop, which is available as a service in the cloud. HDInsight uses Windows Azure Blob storage as the default file system. This also provides high availability with low cost.
NoSQL stands for “not only SQL”. NoSQL databases store unstructured data with no particular schema. Each row can have its own set of column values. NoSQL gives better performance in storing massive amount of data. There are many open-source NoSQL DBs available to analyze big Data.
This is a distributed data management for Hadoop. This supports SQL-like query option HiveSQL (HSQL) to access big data. It converts SQL-like queries into MapReduce jobs for easy execution and processing of extremely large volumes of data.
Sqoop is data transfer tool which is part of Hadoop ecosystem. It transfers data beween relational database system and Hadoop distributed file system .
Spark is an open source, distributed in-memory data processing framework. Spark is fast, flexible, batch processing and stream processing. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing.g
This article “Big Data Analytics and Concepts” is also helpful for for beginners, it also give conceptual idea for Big Data Analytics.