What is data and its understanding from a statistical perspective?

What is Data?

The simplest definition is that data is information! Information that we collect and use to understand things better. It can come in many forms, like numbers, words, or pictures. By organizing and analyzing data, we can learn about the world around us and make decisions based on what we find. Here is an example of some data that we collected from a grocery store.

Imagine a grocery store that wants to understand the shopping habits of its customers to improve their experience and increase sales. They decide to collect data on the items purchased, the time of day customers shop, and the total amount spent on each visit.

Image showing poeple going for shopping

Customer 1:

Items purchased: Bread, milk, eggs, apples, and orange juice
Time of day: 9:30 AM
Total spent: $15.50

Customer 2:

Items purchased: Cereal, bananas, yogurt, and coffee
Time of day: 6:15 PM
Total spent: $12.75

Customer 3:

Items purchased: Pasta, tomato sauce, and salad
Time of day: 4:00 PM
Total spent: $18.25

By collecting and analyzing this data, the grocery store can better understand what products are popular, when customers prefer to shop, and how much they typically spend. This information can help them make informed decisions about product placement, stocking times, and promotions to improve the overall shopping experience and increase sales.

Organizing Data

We see that that the above data is in a raw format and we can start organizing the data by using a table. We can oraganize the data in a table like so

Customer	Items Purchased	Time of Day	Total Spent
1	Bread, milk, eggs, apples, orange juice	9:30 AM	$15.50
2	Cereal, bananas, yogurt, coffee	6:15 PM	$12.75
3	Pasta, tomato sauce, ground beef, salad	4:00 PM	$18.25

Statisical Perspective Of Data

From a statistical point of view, data is a collection of individual pieces of information, often in the form of numbers, that help us understand patterns, trends, and relationships. By organizing, analyzing, and interpreting this information, we can make informed decisions, predictions, or conclusions about a larger group or population. In statistics, we often use data to create charts, graphs, or tables to better visualize and communicate these patterns and trends.

How is Data Organised?

Data is usually organized in a structured format to make it easier to understand, analyze, and use. Some common ways to organize data include:

Tables

Data can be arranged in rows and columns, similar to a spreadsheet, where each row represents a unique record or observation and each column represents a specific variable or attribute. Tables make it easy to sort, filter, and compare data.

Image showing 5 comic characters

Let's consider a small dataset of people's ages, heights, and weights:

Name	Age	Height (inches)	Weight (lbs)
Alice	25	64	120
Bob	30	70	150
Carol	35	62	130
David	28	72	175
Eve	22	66	140

In this table:

Each row represents a unique person and their associated information (Age, Height, and Weight).
Each column represents a specific attribute (Name, Age, Height, and Weight).

By organizing the data in a table, it's easier to read and compare the information. For example, you can quickly find that Alice is the youngest person in the table, Bob is the tallest, and Carol weighs 130 pounds. The tabular format also makes it simple to sort or filter the data by a specific attribute, like age or height.

Databases

A database is an organized collection of data, often stored and accessed electronically. Data in databases can be organized into tables, with relationships between tables allowing for more complex data organization and retrieval.

Let's consider a simple example of a library database. In this database, we have two tables - one for books and one for authors. The tables have a relationship based on the author's unique ID.

The authors table:

Author ID	Author Name
1	J.K. Rowling
2	George R.R. Martin
3	Jane Austen

The books table:

Book ID	Book Title	Author ID
1	Harry Potter and the Sorcerer's Stone	1
2	Harry Potter and the Chamber of Secrets	1
3	A Game of Thrones	2
4	A Clash of Kings	2
5	Pride and Prejudice	3
6	Sense and Sensibility	3

ER Diagram To Represent The Relationship Between The Tables

The Entity Relationship (ER) diagram is a visual representation of the data model for the library database, which consists of two entities: AUTHOR and BOOK. The diagram depicts the relationship between these entities and their attributes.

In the diagram:

AUTHOR entity: Represents the authors in the library database.
- Attributes:
- AuthorID: A unique identifier for each author.
- AuthorName: The name of the author.
BOOK entity: Represents the books in the library database.
- Attributes:
- BookID: A unique identifier for each book.
- BookTitle: The title of the book.
- AuthorID: A reference to the author of the book, which corresponds to the AuthorID attribute in the AUTHOR entity.
Relationship between AUTHOR and BOOK: The diagram shows a one-to-many (1:n) relationship, represented by the ||--o{ notation. This indicates that one author can have multiple books, but each book is associated with only one author.

The ER diagram helps to visually understand the structure of the data model, the entities, their attributes, and the relationships between them. In this case, it shows how the AUTHOR and BOOK entities are related through the AuthorID attribute, which is used to associate each book with its author.

Lists

Data can be organized as a simple list or sequence of items, often for one-dimensional data or when there is no need for complex relationships between data points. Suppose you want to keep track of the top 5 best-selling books of the month. In this case, you could create a simple list of the book titles:

The Lost Treasure
Journey to the Stars
The Secret Garden
The Time Traveler's Chronicles
Beneath the Waves

Charts and Graphs

Visual representations of data can be helpful for understanding patterns, trends, and relationships. Examples of charts and graphs include bar charts, pie charts, line charts, and scatter plots.

Let's consider an example dataset and create a table to represent it:

Month	Sales
Jan	50
Feb	70
Mar	80
Apr	40
May	60

The table can be used to create a horizontal bar chart:

Hierarchies and Trees

Data can be organized in hierarchical structures, like nested categories, where each level represents a different level of detail or aggregation. Let's consider an example of a company's organizational structure, which can be organized in a tree-like hierarchical structure. Here's some sample data:

CEO
- VP of Operations
  - Operations Manager
  - Plant Supervisor
- VP of Finance
  - Finance Manager
  - Accountant
- VP of Sales
  - Sales Manager
    - Sales Associate

Geographic Information Systems (GIS):

Spatial data can be organized using geographic information systems, which store, manipulate, and analyze data based on geographical locations, such as latitude and longitude. Here's an example dataset containing information about cities and their geographical coordinates:

City	Latitude	Longitude
New York	40.7128	-74.0060
Los Angeles	34.0522	-118.2437
Chicago	41.8781	-87.6298
Houston	29.7604	-95.3698
Phoenix	33.4484	-112.0740

The choice of how to organize data depends on the type of data being collected, the purpose for which it will be used, and the tools available for analysis.

In this entire series Statistics And Probability on we would be understanding various methods of structuring data and infer various conclusions from it.

What Can You Do Next 🙏😊

If you liked the article, consider subscribing to Cloudaffle, my YouTube Channel, where I keep posting in-depth tutorials and all edutainment stuff for software developers.

What is Data?​

Customer 1:​

Customer 2:​

Customer 3:​

Organizing Data​

Statisical Perspective Of Data​

How is Data Organised?​

Tables​

Databases​

ER Diagram To Represent The Relationship Between The Tables​

Lists​

Charts and Graphs​

Hierarchies and Trees​

Geographic Information Systems (GIS):​

What Can You Do Next 🙏😊​