Terminology glossary | Insightsable

Data Terminology

Knowing data analytics terminology is important for two main reasons:

First, it allows for precise communication and understanding between data professionals. Standard definitions ensure everyone is on the same page regarding processes, methods, and objectives.

Second, terminology provides a framework to properly apply data analytics concepts. The correct terms help anchor the underlying statistical, technological, and business significance of data work.

Looking for something else?

Data terminology quiz - part 1

Data Analytics Techniques

Data transformation

Data transformation is data that has been converted, calculated, aggregated or otherwise manipulated from its raw form into a desired structure or layout to prepare it for analysis and use.

Variable

In data analytics, a variable refers to a characteristic or attribute that can take on different values. It represents a piece of information that can be measured, observed, or recorded within a dataset.

Data Type

n data analytics, a data type is a classification that describes the kind of information stored in a variable. It helps determine how the data is treated and what operations can be performed on it, such as arithmetic calculations or comparisons.

Data Set

In data analytics, a data set refers to a collection of related information or data points that are organized and grouped together. It is typically structured in a way that allows for analysis, insights, and patterns to be derived from the data for various purposes, such as making informed decisions or discovering trends.

Information

Information refers to processed and organised data that has been transformed into a meaningful and useful form. It is the result of analysing and interpreting raw data to extract insights, patterns, or knowledge that can support decision-making or provide valuable understanding about a particular subject.

Data

Data refers to raw facts, observations, or measurements that are collected and recorded. It represents the unprocessed and unorganized information that serves as the foundation for analysis and insights in data-driven decision-making.

Raw data

Raw data refers to unprocessed and unstructured information that is collected directly from its source. It is the initial form of data before any modifications, cleaning, or organisation takes place, and it often requires further processing to be useful for analysis or decision-making.

A flat file

A flat file refers to a type of data storage structure where data is stored in a single table or file, with a simple and straightforward format. It typically consists of rows and columns, where each row represents a record or entry, and each column represents a specific attribute or field of the data.

Data quality

Data quality refers to the reliability, accuracy, completeness, and consistency of data. It assesses the overall fitness or suitability of data for its intended purpose, ensuring that the data is reliable, free from errors, and aligned with the desired standards or requirements.

Insights

Insights refer to valuable and meaningful interpretations or discoveries derived from the analysis of data. They provide deeper understanding, patterns, correlations, or trends within the data that can inform decision-making, drive improvements, or uncover new opportunities.

Qualitative data

Qualitative data refers to non-numerical information that is descriptive in nature. It captures subjective observations, opinions, or characteristics that cannot be precisely measured or quantified, often obtained through methods such as interviews, surveys, or observations.

Quantitative data

Quantitative data refers to numerical information that can be quantified and measured. It involves data that can be expressed in terms of quantities, counts, or numerical values, allowing for mathematical calculations, statistical analysis, and objective comparisons.

Metric

A metric refers to a quantifiable measurement or indicator used to assess, track, or analyze a specific aspect of data. It provides a standardized way to evaluate performance, progress, or characteristics of a process, system, or entity, often enabling comparisons and the identification of trends or patterns.

Dashboard

A dashboard refers to a visual display of key information, metrics, and performance indicators in a consolidated and easily digestible format. It provides a real-time or summary view of data, allowing users to monitor and analyse trends, patterns, or insights at a glance for effective decision-making.

Visualisation

Visualisation refers to the representation of data and information in a visual format, such as charts, graphs, or diagrams. It aims to present complex or large datasets in a way that is easily understandable, enabling users to quickly grasp patterns, trends, or relationships within the data for better insights and communication.

Data structure

A data structure refers to the way data is organized, stored, and accessed within a system or database. It defines the format, relationships, and hierarchy of the data, facilitating efficient storage, retrieval, and manipulation of information for analysis & processing purposes.

Data integrity

Data integrity refers to the accuracy, consistency, and reliability of data throughout its lifecycle. It ensures that data is complete, valid, and free from errors or inconsistencies, maintaining its quality and trustworthiness for effective analysis and decision-making.

Metadata

Data metadata refers to the information that describes and provides context about the data. It includes details such as the data source, format, structure, meaning of variables, and other properties that help users understand and interpret the data accurately.

Data modeling

Data modeling refers to the process of creating a conceptual or logical representation of data and its relationships within a specific domain or context. It involves defining the structure, constraints, and rules that govern how data is organized, enabling effective analysis, visualization, and decision-making.

Data pipeline

A data pipeline refers to a series of steps or processes that extract, transform, and load (ETL) data from various sources into a target system or database for analysis. It involves the flow of data through different stages, such as data ingestion, cleansing, integration, and transformation, to ensure that the data is prepared and ready for analysis.

ETL - Extract, Transform, Load

ETL refers to the process of extracting data from various sources, transforming it into a consistent and usable format, and loading it into a target destination, such as a data warehouse or database, for analysis and reporting purposes.

Data warehouse

In data analytics, a data warehouse is a centralised repository that stores structured, historical, and integrated data from various sources. It is designed to support efficient querying, analysis, and reporting, providing a consolidated view of data for decision-making and business intelligence purposes.

Relational Database

A relational database is a type of database that organizes data into tables with predefined relationships between them. It uses a structured approach where data is stored in rows and columns, allowing for efficient storage, retrieval, and manipulation of data using the structured query language (SQL) for analysis and management purposes.

SQL

SQL (Structured Query Language) is a programming language used for managing and manipulating relational databases. It provides a standardized way to interact with databases, allowing users to retrieve, insert, update, and delete data, as well as perform various operations and transformations on the data for analysis and reporting purposes.

Database

A database refers to a structured collection of data that is organized, stored, and managed in a systematic manner. It provides a centralized and efficient way to store and retrieve data, enabling users to efficiently manage, access, and analyze information for various purposes such as decision-making, reporting, and data analysis.

Data cleansing

Data cleansing refers to the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in the data. It involves techniques such as removing duplicates, handling missing values, standardizing formats, and resolving discrepancies to ensure the data is accurate, reliable, and suitable for analysis.

Data profiling

Data profiling refers to the process of examining and analysing the characteristics, structure, and quality of data. It involves assessing data patterns, distributions, completeness, and uniqueness to gain insights into the data's overall quality, understand its content, and identify potential issues or anomalies that may impact data analysis or decision-making.

Schema

A schema refers to the logical structure or blueprint that defines the organisation, relationships, and constraints of a database. It specifies the tables, columns, data types, and rules that govern how data is stored, ensuring consistency and facilitating efficient data management and analysis.

Data catalog

A data catalog refers to a centralised repository or system that provides a comprehensive inventory and description of available data assets within an organisation. It serves as a searchable catalog, containing metadata, data lineage, and information about data sources, helping users discover, understand, and access relevant data for analysis and decision-making.

Data dictionary

A data dictionary is a repository or document that provides detailed descriptions and definitions of the data elements within a database or dataset. It includes information about the data's meaning, format, data types, relationships, and constraints, serving as a reference guide for understanding and interpreting the data accurately.

Master data

Master data refers to the core and essential data entities that are critical for the operations and decision-making of an organization. It represents the key reference data elements such as customers, products, suppliers, or employees, which are standardized, consistent, and shared across different systems or departments to ensure data integrity and accuracy.

Transactional data

Transactional data refers to the detailed records of individual business transactions or events that occur within an organization. It includes information such as dates, times, quantities, prices, and specific actions taken, capturing the operational activities and interactions that drive the day-to-day operations and financial transactions of a business.

Transforming data

Transform data is data that has been converted, calculated, aggregated or otherwise manipulated from its raw form into a desired structure or layout to prepare it for analysis and use.

Training data

Training data refers to a set of labeled or annotated data used to train machine learning models. It consists of input data along with the corresponding desired output or target values, enabling the model to learn patterns, relationships, and make predictions based on the provided examples.

Data Analytics Lifecycle

The data analytics lifecycle refers to the series of stages or steps involved in the process of extracting insights and value from data. It typically includes steps such as data collection, data preparation, data analysis, data visualization, and data-driven decision-making, forming a cyclical process where insights and findings inform the next iteration of data analysis and decision-making.

Descriptive analytics

Descriptive analytics refers to the analysis and interpretation of historical data to understand past events, trends, and patterns. It focuses on summarizing and presenting data in a way that provides insights into what has happened, allowing for a better understanding of the current state and informing decision-making based on past observations.

Diagnostic analytics

Diagnostic analytics refers to the process of examining data to determine the causes and reasons behind past events or outcomes. It involves digging deeper into the data to identify patterns, correlations, and relationships in order to gain insights into why certain events or behaviors occurred, helping to uncover the root causes and factors that contributed to specific outcomes.

Predictive analytics

Predictive analytics refers to the use of historical data, statistical modeling, and machine learning algorithms to make predictions or forecasts about future events or outcomes. It involves analyzing past patterns and trends to identify patterns and relationships that can be used to predict future behavior or make informed projections, enabling organizations to anticipate trends, make proactive decisions, and optimize business processes.

Prescriptive analytics

Prescriptive analytics refers to the use of data, algorithms, and optimization techniques to provide recommendations and actions for decision-making. It goes beyond predicting future outcomes by suggesting the best course of action to achieve desired outcomes, helping organizations make informed decisions and optimize their strategies based on data-driven insights.

Test data

Test data refers to a subset of data used specifically for testing and validating the functionality, accuracy, and performance of a system or analytical model. It is carefully selected or generated to cover various scenarios, edge cases, and potential issues, allowing analysts or developers to assess and verify the effectiveness and reliability of their data analytics processes or models.

Outlier

An outlier refers to a data point or observation that significantly deviates from the usual or expected pattern of the dataset. It is an extreme value that is notably different from the majority of the data points and may indicate a potential anomaly, error, or important information that requires further investigation or consideration.

Frequency distribution

A frequency distribution refers to a summary of data that displays the count or frequency of each distinct value or range within a dataset. It provides a tabular or graphical representation that helps visualise the distribution of data and identify the most common or frequent values, enabling analysts to understand the patterns and characteristics of the data.

Attributes

Attributes refer to the individual characteristics or properties of a dataset or data objects. They represent the different variables or fields that describe and provide information about the data, such as names, dates, quantities, or categories, allowing for the organization, analysis, and interpretation of the data based on its specific attributes.

Demographic

Demographic refers to a specific characteristic or trait of a population or group of people. It typically includes information such as age, gender, location, income, education level, and other relevant factors that help categorize and understand the composition and characteristics of a target audience or population segment.

Segmentation

Segmentation refers to the process of dividing a larger population or dataset into smaller, homogeneous subgroups or segments based on certain criteria or characteristics. It involves identifying and grouping individuals or entities that share similar attributes or behaviors, allowing for more targeted analysis, personalised marketing strategies, and better understanding of different customer or user segments.

Null value

A null value refers to the absence or lack of a value in a particular data field or attribute. It represents a missing or unknown value, indicating that the data is unavailable or has not been recorded for a specific observation or record.

Record

A record refers to a single entry or instance of data that contains information about a specific entity or observation. It is a collection of related data fields or attributes that are grouped together to represent a distinct data point or unit within a dataset or database.

Infographic

An infographic refers to a visual representation or graphic that presents complex data, information, or statistics in a clear, concise, and visually appealing manner. It combines text, images, charts, and diagrams to convey key insights, trends, or relationships, making it easier for viewers to understand and interpret the data at a glance.

Logistic regression

Logistic regression is a statistical modeling technique used to predict binary or categorical outcomes based on one or more independent variables. It estimates the probability of an event occurring by fitting a logistic function to the data, enabling the identification of the relationship between the predictors and the likelihood of a specific outcome.

Statistical modeling

Statistical modeling refers to the process of using statistical techniques and mathematical models to analyse and interpret data. It involves formulating and fitting statistical models to data, allowing for the identification of patterns, relationships, and trends, and making predictions or inferences about the data population based on the observed sample.

Trend analysis

Trend analysis refers to the examination of data over a period of time to identify and analyse patterns or changes in the data. It involves studying the direction and magnitude of changes in the data points or variables, enabling analysts to understand the underlying trends, make predictions, and derive insights about the future behavior or outcomes.

Narrative statement

A narrative statement refers to a written or verbal description that provides a coherent and meaningful explanation of the insights or findings derived from data analysis. It involves translating the data-driven insights into a narrative format that can be easily understood and communicated, helping stakeholders grasp the key messages, implications, and recommendations stemming from the data analysis process.

Query

A query refers to a request or inquiry made to a database or data source to retrieve specific information or perform operations on the data. It typically involves using a query language or tool to specify the desired criteria, such as selecting certain columns, filtering data based on conditions, or aggregating data, in order to extract relevant data for analysis or reporting purposes.

Governance

Governance refers to the framework, policies, and processes established to ensure the appropriate management, control, and integrity of data within an organisation. It involves defining roles and responsibilities, establishing data quality standards, implementing data security measures, and enforcing compliance to ensure that data is handled and used in a consistent, reliable, and ethical manner.

CRISP-DM

Cross-Industry Standard Process for Data Mining): It is a widely adopted framework that outlines a structured approach for executing data analytics projects. It consists of six phases - Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment - providing a systematic guide for data analysis from start to finish.

TDSP

(Team Data Science Process): Developed by Microsoft, TDSP is a comprehensive framework that emphasizes collaboration and teamwork in data analytics projects. It encompasses various stages, including Business Understanding, Data Acquisition and Understanding, Modeling, Deployment, and Customer Acceptance, and emphasizes iterative development and frequent feedback.

FAIR

(Findable, Accessible, Interoperable, and Reusable): FAIR is a set of principles developed to promote the usability and value of data. It provides guidance on making data findable, accessible, interoperable, and reusable, ensuring that data can be easily discovered, shared, and effectively utilized by different stakeholders and systems.

Report automation

Report automation refers to the process of using software tools or scripts to automatically generate reports based on analysed data. It eliminates the need for manual report creation by automating the extraction, analysis, and presentation of data, saving time and ensuring consistency in reporting.

Python

Python is a widely used programming language known for its simplicity and versatility. It provides a rich ecosystem of libraries and frameworks, such as Pandas, NumPy, and Scikit-learn, which facilitate data processing, analysis, visualisation, and modeling tasks, making Python a popular choice for data analytics professionals and researchers.

Dax language

DAX (Data Analysis Expressions) is a formula language used in various analytics and business intelligence tools, such as Power BI and Excel Power Pivot. It allows users to create custom calculations, measures, and aggregations to manipulate and analyze data, enabling complex calculations and data transformations to be performed within these tools.

Power Bi

Power BI is a business intelligence tool developed by Microsoft that enables users to visualise and analyse data from various sources. It provides a user-friendly interface for creating interactive dashboards, reports, and data visualisations, allowing users to gain insights and make data-driven decisions based on their data.

Tableau

Tableau is a popular data visualisation and business intelligence tool used for analysing and presenting data visually. It provides a user-friendly interface and a wide range of interactive visualisations, allowing users to explore data, discover patterns, and communicate insights effectively, making it a powerful tool for data analysis and storytelling.

Storytelling

Storytelling refers to the practice of using data, visualisations, and narratives to effectively communicate insights and findings to an audience. It involves presenting data in a compelling and understandable manner, connecting the dots between data points, and crafting a narrative that engages and informs stakeholders, enabling them to grasp the significance and implications of the data analysis.

Data decision

A data decision refers to a decision-making process that is informed by data analysis and insights. It involves using data-driven information and findings to guide and support decision-making, helping to reduce uncertainty, mitigate risks, and improve the overall effectiveness and efficiency of decision-making processes within an organization.

Data integration

Data integration in data analytics refers to the process of combining and merging data from multiple sources into a unified and consistent format. It involves transforming and harmonising the data to ensure compatibility and coherence, enabling analysts to perform comprehensive analysis & gain holistic insights.

API

(Application Programming Interface) in data analytics is a set of rules & protocols that allows different software applications to communicate & exchange data with each other. It enables data analysts & developers to access & retrieve data from various sources, such as databases or web services, for analysis & integration into their own applications or systems.

Data Exploration

Data exploration in data analytics refers to the process of investigating and examining data to gain a better understanding of its structure, patterns, and characteristics. It involves visualising, querying, and analysing the data to discover insights, trends, and relationships that can inform decision-making and further analysis.

Big data

Big data refers to extremely large and complex datasets that cannot be easily managed, processed, or analysed using traditional methods. These datasets often involve high volumes, variety, and velocity of data, requiring specialised tools and techniques to extract valuable insights & uncover patterns or trends.

Machine learning

ML is a branch of artificial intelligence that focuses on developing algorithms & models that enable computers to learn from data and make predictions or decisions without being explicitly programmed. It involves training the computer systems to recognise patterns and relationships within the data, allowing to generate insights & automate tasks based on learned knowledge.

Data mart

A data mart in data analytics refers to a specialised subset of a data warehouse that is designed to support the needs of a specific department, business unit, or subject area within an organisation. It contains a focused and curated collection of data relevant to the specific requirements of the intended users, making it easier to access and analyse data for specific purposes or analysis.

Data lake

A data lake in data analytics is a centralised repository that stores large volumes of raw and unprocessed data in its original format. It allows for the storage of diverse data types and provides a flexible and scalable environment for data exploration, analysis, and processing, enabling users to derive insights and extract value from the data at a later stage.

Data wrangling

Data wrangling, also known as data preparation or data cleaning, refers to the process of cleaning, transforming, and organising raw data into a structured and usable format for analysis. It involves activities such as removing duplicates, handling missing values, standardising data formats, and resolving inconsistencies to ensure that the data is accurate, complete, & ready for further analysis.

Data mining

Data mining in data analytics is the process of discovering patterns, relationships, and insights from large datasets. It involves using statistical techniques, machine learning algorithms, and other analytical methods to extract valuable information and knowledge, enabling organizations to make data-driven decisions and predictions.

NLP

NLP (Natural Language Processing) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and process human language. It involves techniques and algorithms that allow machines to analyse and derive meaning from text or speech data, facilitating tasks such as sentiment analysis, text classification, and information extraction from unstructured textual data.

Data ethics

Data ethics in data analytics refers to the moral principles and guidelines that govern the responsible and ethical use of data. It involves considerations of privacy, consent, transparency, fairness, and accountability in the collection, storage, analysis, and sharing of data, ensuring that data-driven practices are conducted in an ethically sound manner to protect individual rights and promote societal well-being.

Data Democratization

Data democratization in data analytics refers to the process of making data accessible and available to a wider range of users within an organization. It involves empowering individuals and teams with the necessary tools, skills, and access to data, enabling them to explore, analyze, and make informed decisions based on data without relying solely on data specialists or IT departments. This promotes a culture of data-driven decision-making and fosters collaboration and innovation across the organization.

Data silo

A data silo refers to a situation where data is stored, managed, and maintained in isolated systems or departments within an organization, without proper integration or sharing. This results in data being fragmented and inaccessible to other parts of the organization, hindering collaboration, data-driven insights, and efficient decision-making.

Data fusion

Data fusion refers to the process of combining or integrating data from multiple sources or sensors to create a more comprehensive and accurate view of the data. It involves merging and harmonizing different data sets, often using advanced algorithms and techniques, to generate a unified dataset that provides richer insights and a more complete understanding of the underlying phenomena or patterns.

Real time data

Real-time data in data analytics refers to data that is generated, processed, and made available for analysis immediately or with minimal delay. It reflects the most up-to-date information at any given moment and allows for timely insights, decision-making, and actions based on the current state of the data.

Data blending

Data blending in data analytics refers to the process of combining data from multiple sources or datasets to create a unified and enriched dataset for analysis. It involves merging and integrating data with similar or related attributes, enabling analysts to gain a comprehensive view of the data and uncover insights that may not be apparent when examining individual datasets separately.