FAIR data management
Data management is the process of collecting, validating, organizing, protecting, processing, and maintaining scientific data to ensure the accessibility, reliability, and quality of the data and metadata for its users. FAIR data management ensures that the data and metadata captured is findable, accessible, interoperable and reusable (FAIR) throughout the data lifecycle.
Systematically organized, well annotated data and associated documentation lets researchers and collaborators use data consistently and accurately. Carefully storing and documenting data also allows more people to use the data in the future, potentially leading to more discoveries beyond the initial research.
Data governance aims to ensure consistently high data quality by establishing processes to ensure effective data management throughout the data lifecycle. Focus areas include data availability, usability, consistency, integrity, security and ethics. The person responsible for ensuring that data governance processes are followed and that guidelines are enforced is commonly called a data steward.
A data architecture describes how data is managed - from collection to transformation, distribution and consumption - and how it flows through data storage systems. A data architecture lists tools and infrastructures that will be involved in data management across the data lifecycle. Some elements of a data architecture must be defined early, during the design phase of the data lifecycle, such as the administrative structure to manage the data and the methods that will be used to store data.
Data modeling and design
A data model is a conceptual representation of the data, the relationships between data, and the rules that connect them. It can be used to make datasets compatible, enable conversion between different formats, and make data easy to integrate and share.
Database and storage management
- A database is an organized collection of stored data that can be managed and accessed electronically via database management system (DBMS) that enables users to store, retrieve, query and manage the data. Small databases can often be stored locally on a single computer, while larger databases need to be hosted on computer clusters or cloud storage.
- Database design includes data modeling, efficient data representation and storage query languages for data search, security and privacy of sensitive data. It also includes computing issues such as supporting concurrent access.
- Data storage management aims to ensure that data is stored as optimally as possible given the available (financial and computational) resources and the needs for applications to access and process the data.
Data security refers to protecting data from destructive forces and from the unwanted actions of unauthorized users.
Technologies employed to ensure data security include:
- Disk encryption protects data on a hard disk drive from unauthorized access. It can be hardware-based or software-based. Software-based security solutions encrypt the data to protect it from theft but are vulnerable to malicious programs or hackers who could corrupt the data to make it unrecoverable. Hardware-based security solutions prevent read and write access to data, which provides very strong protection against tampering and unauthorized access.
- Backups ensure that lost or corrupted data can be recovered from another source. It is essential to keep a backup of all research data.
- Data masking can obscure specific data in a database table or cell, to ensure that data security is maintained and sensitive information is not exposed to unauthorized personnel.
Data erasure is a software-based method of overwriting that completely wipes all electronic data residing on a hard drive or other digital media to ensure that no sensitive data is left when an asset is retired or reused
Reference data is used to classify or categorize other data. Typical examples are units of measurement or fixed conversion rates, that are static or change slowly over time. Reference data sets can also be referred to as "controlled vocabularies" or "lookup" data.
Data integration refers to the process of combining data from different sources in a unified view. Data integration is becoming more common as the volume and the need to share existing data increases. Data integration provides users with consistent access and delivery of data across the spectrum of subjects and structure types.
Documentation is critical. It is recommended to use a document management system to receive, track, manage and store documents, such as an Electronic Lab Notebook (ELN) - a software tool which lets you enter protocols, observations, notes, and other data using your computer or mobile device. ELNs facilitate good data management practices, data security and collaboration.
Data warehousing and analytics
A data warehouse is a central repository of integrated data from one or more disparate sources, storing all available data in one single place where it can be used to generate reports.
Two main approaches used to build a data warehouse system:
i. Extract, transform, load (ETL) uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts.
ii. Extract, load, transform (ELT) gets rid of a separate ETL tool for data transformation. Instead, it maintains a staging area inside the data warehouse itself. In this approach, data gets extracted from heterogeneous source systems and are then directly loaded into the data warehouse, before any transformation occurs. All necessary transformations are then handled inside the data warehouse itself. Finally, the manipulated data gets loaded into target tables in the same data warehouse.
Metadata is data that describes other data. It summarizes basic information about data, making it easier to find and work with particular instances of data. To be as useful as possible, metadata needs to be standardized so it can be used and understood by many, and especially by machines. It can be divided into four categories:
- Descriptive metadata is needed for discovering and identifying assets. It consists of information that describes the asset, such as the asset’s title, author, and relevant keywords.
- Structural metadata describes relationships among various parts of a resource - it shows how a digital asset is organized, like how pages in a book are organized to form chapters. It also indicates whether a particular asset is part of a single collection or multiple collections, and facilitates the navigation and presentation of information in an electronic resource. It is the key to documenting the relationship between two assets, and is usually.
- Administrative metadata is related to the technical source of a digital asset, such as the file type, and when and how the asset was created. It also relates to usage rights and intellectual property, giving the owner of the asset, the license or conditions of use, and the allowed duration of use. It can be subdivided into technical metadata, preservation metadata, and rights metadata.
- Provenance metadata, which is not always present, can be used to describe a digital file or resource’s history - including what was done to the file, when and where, who did it, what they did and which tool(s) they used, and why.
Metadata can be stored and managed in a database, often called a metadata registry or metadata repository.
Common data elements
Common data elements (CDEs) are pieces of data common to multiple datasets across different studies. The use of CDEs helps improve accuracy, consistency, and interoperability among datasets generated by clinical neuroscience research. Common data elements have been developed as a response to the healthcare industry’s need to develop clinical data content standards that can be used both for patient care in clinical settings as well as for secondary data uses. These may include disease surveillance, population and public health, quality improvement, clinical research, and reimbursement.
- High quality data is data "fit for [its] intended uses in operations, decision making and planning" that correctly represents the real-world construct it refers to. When the number of data sources increases, internal data consistency also becomes an important quality factor. Data quality depends on its accessibility, correctness, comparability, completeness, coherence, reliability, flexibility, plausibility, relevance, timeliness or latency, uniqueness and validity.
- Data quality assurance is the process of profiling data to discover inconsistencies and other anomalies, and data cleansing activities such as removing outliers or interpolating missing data to improve the quality of the data. It can be a part of data warehousing or a part of the database administration of an existing piece of application software.
- Data quality control is the process of controlling the usage of data for an application or a process. This process is performed both before and after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction.