The term “Big Data” have become quite common in the field of modern Relational Database Management and Data Analysis and today there are so many Big Data Platforms out there(Hadoop, SAP HANA etc)
We would cover the following sub-topics:
Let’ begin by explaining the difference between Data and Information
1. Difference Between Data and Information
What is Data in the first place? There is not perfect definition of data but we can say ‘data is quantities gathered from measurement made in the environment where such quantities could further be analysed to produced useful information’ – Kindson The Genius.
Examples of data could include the following:
1. Temperature measurements taken several times during the day
2. Tabulation of scores of students tabulated at the end of the semester exam
3. Tabulation of responses gotten from respondents through a questionnaire distributed
4. List of names and dates of births of all the students in a school
You can think of more example. All of this could be viewed as data. This means that they could be further analysed to produce useful information. Note this major difference between data and information: data is analysed or processed to produce information but information is not processed to produced data.
Let’s now look at examples of information, the useful results or conclusion obtained when data has been processed.
1. The weather is hottest around midday (information derived from 1)
2. 80 percent of the students passed the exam. OR. Students did very well in the exam (derived from 2)
3. Majority of the students are teenagers
Unlike the data presented previously, these pieces of information would be very useful to any management in taking decision and making needed changes
2. Data Analysis (Conventional Data Analytics)
The Data Analytics we know is the used of statistical techniques to process data to yield useful information. From the previous example I gave, you can see that conventional data is normally presented in tabular form of rows and columns.
To analyse this data you apply methods such as t-Test, ANOVA, ANCOVA, Correlation and Regression, MANOVA, Chi-Square etc. So you get to find mean, median, mode, standard deviation variance etc.
All of this works perfectly and are still very usedful today in the field of Research Statistics in Education.Some of the tools used for data analysis includes SPSS by IBM, EViews and Statistica.
3. What is Big Data?
The question is, what if in gathering data from respondents, each respondents includes his passport photo? Or in taking temperature readings, you also take snapshot of the whether at the particular time? How do you analyze the images?
Definition: Big data is defined as very large and complex data sets that cannot be analysed using conventional data analysis methods.
So what exactly makes data qualify as big data? Lets consider 3 attributes. These attributes are knows as the 3 Vs, that is Volume, Variety and Velocity.
1. Volume: Big data sets assumes very large volume. The size ranges from hundreds of gigabytes terabytes and even petabytes.
2. Variety: Just and mentioned, big data contains data in different structures. Could be images, text, audio as well as data in video formats. Because of this, this kind of data is referred to as unstructured data as opposed to structured data arranged in rows and columns in tables.
3. Velocity: This refers to the high rate of growth or generation of the data. An example would be the growth of data in a social network site such as Facebook.
Example of big data
Example 1: A typical example would be users data from social network site. This would be made up of several billions of files of different format/
Example 2: Millions of emails stored in public mail server such as Yahoo together with to attachments to these emails.
4. Big Data Analytics
Since big data does not always have a well-defined structure, it would not be possible to used conventional data analysis tools to analyze them to yield needed trends and information.Some of the big data analysis tools are:
MapReduce: A software framework for analysis of unstructured data
Hadoop: Developed by Apache is a framework used for processing of distributed data.
Hive: An open source analysis tool for querying and processing big data
5. Final Notes
The field of data analytics is evolving and so is data analysis technology improving continuously. So for folks engaged in Research Data Analytics, this is time to improve your skills in the area of big data analytics. Would there be a time when conventional data analytics would be completely replaced and irrelevant? I don’t think so. But its necessary to move along with the trend and keep yourself up to date with the latest trend in Data Analytics.
I’m working on finding out available Big Data Conferences I could recommend as well as available Big Data Platforms for 2019.