What are research data?

The UKRI Concordat on Open Research Data (PDF) defines research data as evidence that underpins the answer to the research question.

These might be quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, modelling, interview or other methods such as data extraction from existing evidence. Data may be raw or primary (for example directly from measurement or collection), derived from primary data (for example cleaned up or as an extract from a larger dataset), or derived from existing sources where the rights may be held by others.

There can be different implications for working with and preserving different types of research data:

Observational: data captured in real-time, for example, neuro-images, sample data, sensor data, survey or interview data. It is usually irreplaceable and hard or impossible to re-create.
Experimental: data captured from laboratory equipment by the researcher or a service used by them, for example, gene sequences, chromatograms, chemical toroid magnetic field data. The data is often reproducible but reproduction could be costly.
Simulation: data generated from test models. For example climate, mathematical or economic models. Datasets used here are usually very large but model code in itself might be sufficient to recapitulate results.
Derived or compiled: for example, text and data mining, 3D models, compiled databases. Data is reproducible but reproduction could be costly.
Reference or secondary: a (static or organic) conglomeration or collection of smaller (peer-reviewed) datasets, most probably published and curated elsewhere for re-use. For example, gene sequence databases, chemical structures, or spatial data portals.

Research data can be seen as the collection of digital objects acquired and generated during the process of research. They can have many forms and names and may include:
- Spreadsheets
- Notes (paper-based and digital)
  - Archive notes
  - Laboratory notes
  - Field notes
- Diaries
- Critical apparatus
- Stemmata
- Standard operating procedures, protocols and workflows
- Code
- Text with markup
- Questionnaires, surveys
- Transcripts
- Focus group documentation
- Interview coding
- Meeting minutes
- Audio recordings
- Videos
- Images
- Films
- Test responses
- Specimens
- Samples
- Artefacts
- Finding aids for archives/fonds
- Text corpus/corpora for linguistic analysis
- Thematic research collections
- Models
- Algorithms
- Scripts
- Digitised books, paintings, or other works of art
- Contents of an application
  - Input
  - Output
  - Log files for analysis software
  - Simulation software
  - Schemas
The term ‘dataset’ is used throughout this guide to mean a logically complete set of data with common or related elements. Sometimes datasets may be called ‘data product’ or ‘data package’.

Datasets are composed of a group of data files along with the documentation files which explain their production or use (codebook, technical or methodology report, data dictionary, etc.), as well as information about the structure of the data itself. A good quality data collection can be further enhanced by the inclusion of contextual information such as information about the study, observation or investigation.

Types of datasets may include:
- a spreadsheet of numerical or encoded data,
- a collection of interview transcripts, field notes, audio recordings, readings or photographs resulting from a research project,
- a database containing survey data, numeric data files, input data and script or code used to model scenarios,
- a database of population or economic data,
- sets of images analysed during a research project,
- a collection of sequence and structure data.

What are research data?

Examples of research data

Datasets