Breadcrumb

Have fun With Statistics in Big Data Era

By Lin Cong |

What is Big Data, you ask?

 

[Image Description: A grey- and white-haired Husky puppy shaking its head in confusion while lying on the floor.]

Pictured: You, ready to learn all about Big Data

Despite how it may seem, big data is actually not a brand-new concept. People have been gathering and storing data for quite a long time. However, the concept of big data did not gain momentum until the early 2000s when Gartner analyst Doug Laney articulated the now mainstream definition of big data as the three V’s:

Volume: Volume refers to the amount of data. Data today can be collected from a variety of resources. With the development of computer technology, the cost of storing data has been significantly reduced, so data storing is much easier than it was in the past.

Velocity: Velocity refers to the speed of data processing. Since data analysis is important in decision making, and data is streaming at an unprecedented speed, it must be dealt with in a timely manner.

Variety:  Variety refers to the number of types of data. The different types of data can be categorized into structured data, such as numerical data in traditional databases, and unstructured data, such as text documents, e-mails, video, audio, and so on.

Over time, there have been several more characteristics added to this definition of big data, such as Variability and Veracity. However, the main difference between big data and a traditional database is the size and complexity of big data, which makes it impossible to implement traditional analytical methods. Due to this, more and more statisticians and data scientists are devoted to developing new algorithms and statistical methods to analyze big data.

 

Three Stories About Statistics with Big Data

Big data can be used in a variety of areas in everyday life. The following three stories depict three different kinds of big data analysis using statistics in industry.

 

[Image Description: Steve Carell as Michael Scott from The Office moves forward, his eyes widening with interest, as he puts his head on top of his hands.]

Pictured: It's storytime, folks

 

Big Data in The Media and Entertainment Industry: Nowadays, people are exposed to various digital resources. A huge amount of data is generated every day on media platforms such as YouTube and Spotify. Companies like these can gather massive amounts of data from millions of their subscribers for analysis. Netflix, for example, uses its big data to achieve better profits.

Netflix has over 100 million subscribers, and it collects a large amount of behavioral data from subscribers, such as their search history, ratings for programs, the date that the movie/show is watched, on which device the movie/show is watched, and so on. However, the data is only valuable by processing data, cleaning and revealing useful insights. This data will be transformed into a numerical or categorical format for descriptive statistics as a start, such as “A typical Netflix member loses interest in 60-90 seconds when choosing something to watch, having reviewed 10 to 20 titles.”

More sophisticated analysis can be done with the cleaned data. For example, by implementing models of a recommendation system, Netflix can provide each member with a personalized selection of videos from its entire collection. Also, people are more likely to watch videos that are similar to the ones they are watching, so a video-video similarity algorithm can provide an estimate for what the member would like to watch based on the programs watched previously.

With better models and advanced algorithms based on big data, more details underlying data can be revealed and used to instruct future business decisions.

 

Big Data in The Healthcare Industry: The healthcare industry is a big source of data since healthcare data, such as records for patients and clinical data, have been kept for years in formats such as pictures, videos, and multimedia, are so complex. Especially after the digitization of healthcare information and the rise of value-based care, data analysis is gaining more attention from this industry.

Despite some of the challenges with processing such vast and complex data produced by the healthcare industry, statistical analysis of the data can be quite beneficial. For example, with one patient’s own healthcare data, a personalized diagnosis and care plan can be taken efficiently based on his/her comprehensive profiles. Also, based on the history and demographics of epidemics, certain patterns of outbreaks and development can be detected by building statistical models using the records from infected patients. This analysis can be used to predict the outbreaks of epidemics and provide instructions for better preventative measures.

 

Big Data in Transportation Industry: A more familiar development of big data can be found in the transportation industry, where plenty of geographical data and traffic data are being collected consistently. The usage of big data has been shown in different aspects of our daily transportation, such as route planning and supply chain management, etc.

One notable example of using big data in transportation is Uber. Uber collects a tremendous amount of data regarding drivers, their vehicles, locations, every trip from every vehicle. For each trip, an optimal route with better traffic and less distance can be provided based on the analysis of traffic and weather data from the surrounding area.

 

Development of statistics in Big Data Era

 

[Image Description: Chris Hemsworth as Agent H and Tessa Thompson as Agent M from Men in Black: International look down at the viewer curiously.]

Pictured: You, considering all the options you have with big data

 

Big data opens the door for potential further development of statistical methods. Even though traditional methods such as linear regression still remains an important part of statistical analysis, many new machine learning algorithms and data mining techniques such as Random Forest and Neural Network have proven to have a good fit and more computational efficiency with respect to big data.

While big data offers major benefits, it never the less also comes with its own set of challenges. In big data, there are plenty of methods and algorithms being created since big data is commonly used in data analysis, which is a benefit. For each method, however, there are too many parameters with which we can tune our equations to get a better fit. So, training a model is getting more difficult when given too many options. One potential solution to this challenge is to cooperate with user departments to have a better understanding of the data.

In general, big data is a change for official statistics. In this big data era, there is more information that can be extracted from huge amounts of data. So, statistics is becoming more and more critical. Isn’t it fun to discover the world growing wider and deeper with statistical analysis?!