Discovering the World Through Data Analysis

By Lin Cong |


[Image Description: A man confused by a lot of math equations and had a hard time with calculations]

Pictured: You, preparing your data analysis


Data analysis is the process of cleaning, transforming, and modeling data to then extract information from that data. Nowadays, data analysis is widely used for decision making in many different areas of life.

Some common types of data analysis are Descriptive Analysis, Diagnostic Analysis, Predictive Analysis, and Prescriptive Analysis. Descriptive Analysis is a basic summary of the complete data or some specific features in the data. Examples of this type of data are mean, standard deviation for continuous features, and frequency for categorical features. Diagnostic Analysis is used to identify behavior patterns of data by analyzing the relationship within the data. For example, a business may see that leads increased in the month of October and use diagnostic analysis to determine which marketing efforts contributed the most. Predictive Analysis is using current data to make a forecast for future observations. A simple example of this type of data would be using rainfall data from previous years to predict the rainfall in the current year in the same area. And Prescriptive Analysis is a combination of the three analyses above to make plans for the actions to be implemented in a current situation.

As mentioned above, a typical process of data analysis includes data cleaning, modeling, interpretation, and visualization. Data can be collected in different ways, such as open data sources, questionnaires, and so on. However, the raw data collected is usually not suitable for analysis, so we need to clean the data by reorganizing, recoding the data. The data cleaning process is as important as data modeling and can sometimes be the most complex part of data analysis. After the data is cleaned and processed, a summary of the data should be performed to have a first look at the data before modeling, and different statistical models can be used to fit the data. There are a number of different statistical software such as Python, R, SAS which can provide statistical tools for modeling. The best model can be picked based on a model selection criterion. Data modeling is the tool people use to extract information from data, while data interpretation is how we can convey the information acquired from statistical models; for example, interpretation of coefficient estimate from a linear regression model can be used to explain the contribution of specific factors. Data visualization is another way to summarize the data, which is more straightforward than a numerical summary, simple visualizations are bar plots, boxplots scatterplots, etc. Additionaly, the modeling result can be visualized based on what statistical models are implemented.

Data analysis is everywhere.

[Image Description: A woman got so surprised that she did not want to miss that, so she picked up her phone and took photos.]

Pictured: You, reading this data analysis example


A well-known data analysis example is the Titanic dataset, which contains the data on all the passengers who were aboard the RMS Titanic when it sank on 15 April 1912 after colliding with an iceberg in the North Atlantic Ocean. The variables included in the dataset are personal characteristics such as age, class of ticket, sex, and so on. It is not surprising to see that people with higher class tickets on the Titanic had a higher survival rate. This is a typical data for classification problems, where people try to predict if the passengers aboard RMS Titanic survived or not based on their personal characteristics. Despite predicting on a binary outcome, the general classification problem can solve problems where there are more groups within the whole population.

Different from the above classification problem, regression is another important type of statistical model. An example of a regression problem is the Boston Housing dataset. The data was collected by the U.S Census Service for housing in Boston, Massachusetts for a study that aimed at ascertaining if the availability of clean air influenced the value of houses in Boston. The goal of this study is to predict the median value of occupied homes and discover ideal explanatory variables that significantly influences the median of house prices. Some other similar regression problems are Walmart Sales forecasting where the goal is to predict the sales across various departments in each store given the information on historical sales data across different departments of Walmart stores.

As data is getting larger and larger and we begin working with a greater variety of data formats, such as text, photos being analyzed, more and more complex problems occur. One example of this is the Text Mining dataset, where large amounts of unstructured data are collected from natural language in different sources such as e-mails, text messages, and other platforms like Facebook and Twitter. This kind of problem can involve both classification and regression problem mentioned above and the difficulty is how to transform the data into the format which can be used to implement normal statistical models.

Another example is the Yelp dataset, which includes Yelp’s businesses, reviews, and users, provided by the platform for educational and academic purposes. The complexity of these problems is the magnitude of the amount of data and how to transform the words and pictures into usable data. So, this project involves topics such as natural language processing and sentiment analysis, photo classification, and graph mining.

Not only have the types and amounts of data been enlarged, but the methods for data analysis have also developed very fast. For example, the Adult Census Income problem collected data through the 1994 and 1995 Current Population Surveys which contains the extracted weighted census data and has 41 employment and demographic related variables. The purpose of the study is to predict whether income exceeds $50,000 per year. Despite statistical models such as Logistic regression for this kind of classification problem, new algorithms such as KNN (K Nearest Neighbors) and Random Forests have been proposed with more precise predictions.

Everyone can play with data analysis.


[Image Description: Mark Brennan puts on his gloves and prepares to start his work.]

Pictured: You, preparing to get your hands into some additional data sets!


The most fascinating part about data analysis is that, despite the problems discussed above, there are more questions that can be asked using the same dataset. People can propose different questions which are analyzable with respect to specific data, or collect data based on the proposed questions.

Data analysis is no longer a professional subject only for statisticians and data scientists, and people from many areas are willing to or needed to use data analysis to make more confidential conclusions based on the data. Additionally, it is always a good idea to have hand-on experience with data analysis, since the most important part in data analysis is not on how complex model to be implemented, but how to understand the meaning of data and find the direction of interpreting the data.

Hopefully, these examples helped highlight how exciting data analysis truly is! It's never too late to start your own data analysis project to learn more about the world!