Data Science and Data Science Hierarchy

İrem Kömürcü
Heartbeat
Published in
7 min readDec 21, 2022

--

Data is the most important and popular resource today. Applications, artificial intelligence products, games, customer services, marketing trends, social media, and more provide us with data and create a large data pool by taking data from us.

Big Data is the science of data that is concerned with collecting, storing, processing, and analyzing huge amounts of data. It can also be used by businesses and individuals. For example, a company can collect customer data, which can help predict sales trends and improve marketing strategies. Likewise, an individual can collect their own data to help monitor their health or understand their spending habits. Big data can be used in many different fields and can be a valuable resource for businesses and individuals.

Photo by Killian Cartignies on Unsplash

Obtaining, interpreting, understanding, and using data constitutes the journey of data, and in this article, we will examine the journey of data in detail.

Basically, the field of data science covers a wide area from collecting data to learning. Each area has its own special titles and special working principles.

The Data Science field also includes data collection, storage, and development, as opposed to what is known as data analysis and model development. All fields feed off each other and so this is expressed by the Data Science Hierarchy pyramid.

Data Science Hierarchy — by author

Let’s get to know the fields that feed each other and discover the people working in this field!

Collect

First, we start by collecting data. We can obtain data from various sources in various ways.

Photo by DeepMind on Unsplash

Sensors, data entered by users, social media, and more are sources from which we can collect data. There are various tools and various techniques for collecting data.

Data collection is the process of gathering and measuring information from a variety of sources, with the goal of adding to existing knowledge or using it to make decisions. This can be done through a variety of means, such as online surveys, focus groups, interviews, observation, and experimentation. The data collected can be qualitative, quantitative, or a combination of both, and it is typically analyzed and interpreted to draw conclusions or inform decision-making. Data collection is an important part of many fields, including research, business, and policy-making, as it allows organizations and individuals to gather information, understand trends, and make informed decisions.

This position is covered by the job title of Data Science Engineer, but a Data Infrastructure Engineer is specifically interested in this stage.

Data Infrastructure Engineer is a professional group that deals with the design, installation, and management of a company’s data structure. These functions cover topics such as setting up data storage systems, configuring databases, managing data flow, and data security. Data Infrastructure Engineers can also aim to optimize the performance of the data structure and provide scalability. This professional group also interacts with teams working in fields such as data science, data analytics, and machine learning and helps meet their data needs.

Move/Store

As a second step, we need to keep the data we collect in a source. The collected data must be stored.

Photo by DeepMind on Unsplash

Data storage and creating infrastructure for this storage, providing reliable data flow during storage, determining pipeline if necessary, and all infrastructures we use during data storage belong to this stage.

The data move and storage phase refers to the processes of moving data from a source to a destination and storing it at the destination. At this stage, security measures are taken during the move of the data and it is ensured that the data is stored correctly.

The data storage stage refers to the process of storing the data after it reaches the destination. Data storage can be done by storing data on a physical medium (for example, on a disk or device) or in a cloud service. At this stage, it is ensured that the data is stored in a safe and easily accessible way.

A Data Engineer is a professional who deals with the design, installation, and management of database and data storage systems. These functions are often required to provide data to data analysts or data scientists.

A data engineer may also perform tasks such as optimizing performance in database systems, designing data flow systems, and managing their loading. It also deals with issues such as consolidation of data sources, management of data storage systems, and data security.

In general, the data engineer operates and maintains database systems and tries to optimize the availability and performance of database systems.

Explore/Transform

After storing our data, we need to perform a series of operations.

Photo by DeepMind on Unsplash

The data has not been further processed except for data collection and storage so far and we need to prepare the data at this stage. Data doesn’t always come to us clean and exactly the way we want it to be. Therefore, we need to check the data, detect abnormal conditions, clean it if necessary, and prepare it for our operations. At this stage, we will need to clean, anomaly detection and some pre-processes.

The data exploration and transformation phase refers to the operations performed to examine and understand the data. At this stage, the data is discovered and made understandable so that it can be made available for later use of the data.

It is very important that we do these operations just before using the data in our classification and models. Thanks to these processes, we can access quality data, directly affect our model results and analyze the data correctly.

Aggregate/Label

The data was collected, stored, cleaned, and now it’s time to select and label features for our operations on our data. Introducing and labeling the data in a way that the machines can understand is carried out in this step.

Photo by DeepMind on Unsplash

Analyzing the data, determining the metrics that we will process depending on our operations and data, and partitioning the data are at this stage and it is one of the most important stages.

Correct labeling is very important for optimization and model results. Incorrect labeling directly affects model results and provides incorrect learning.

Data Scientists or Data Analysts deal with the pre-processing after data storage and this part is dealt with afterward.

Data Scientist is a professional who is interested in solving problems based on data science. These problems often involve functions such as making data-driven decisions, making predictions, or optimizing a system. In general, a Data Scientist is tasked with providing data science-based solutions by discovering the information to be extracted from the database.

Learn/Optimize

Now, our data is ready and we can use our data with simple machine learning algorithms. At this stage, it is possible to perform A/B tests, experiments, and simple ML algorithms with our data.

Note from experience: The part I have told so far was the part that took a long time in data science projects and we prepared to put the data into the algorithm. Preparing, recognizing, and exploring the data constitutes the AI part, which is the next step.

Photo by DeepMind on Unsplash

At the top and at the same time the last stage of data science, there is processing with our ready data, but instead of using our data with simple ML algorithms, we use our data with Artificial Intelligence and Deep Learning with big algorithms and different computational techniques.

Conclusion

As we have mentioned throughout this article, the field of Data Science covers a wide area from collecting data to analyzing data and inserting it into the model. We learned about the different job titles working within the Data Science field, which covers this wide spectrum.

You can follow my Medium account, if you like the article, you can present your appreciation with claps.

You can also follow and communicate with me on social media. Thanks!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--