Documenting microdata
Defining microdata
When surveys or censuses are conducted, or when administrative data are recorded, information is collected on each unit of observation. The unit of observation can be a person, a household, a firm, an agricultural holding, a facility, or other. Microdata are the data files resulting from these data collection activities. They contain the unit-level information (as opposed to aggregated data in the form of counts, means, or other). Each row in a microdata file is referred to as an observation. Information on each unit of observation is stored in variables, which can be of different types (e.g. numeric or character variables, with discrete or continuous values). These variables may contain data reported by the respondent (e.g., the marital status of a person), obtained by observation or measurement (e.g., the GPS location of a dwelling or other sensor), or generated by calculation, recoding or derivation (e.g., the sample weight in a survey).
For efficiency reasons, categorical variables are usually stored in numeric format (i.e. coded values). For example, the sex of a respondent may be stored in a variable named ‘Q_01’, and include values 1, 2, and 9 where 1 represents “Male”, 2 represents “Female”, and 9 represents “Unreported”. Microdata must therefore be provided with a data dictionary (i.e., structural metadata) containing the variables and value labels and, for the derived variables (if any), some information of the derivation process.
Many other features (descriptive and reference metadata) of a micro-dataset should also be described such as the objectives and the methodology of data collection, a description of the sampling design for sample surveys, the period of data collection, the identification of the primary investigator and other contributors, the scope and geographic coverage of the data, and much more. This information is essential to make the microdata usable and discoverable.
Metadata standard: the DDI Codebook
The Data Documentation Initiative (DDI) metadata standard provides a structured and comprehensive list of metadata elements and attributes for the documentation of microdata. The DDI originated in the Inter-university Consortium for Political and Social Research (ICPSR), a membership-based organization with more than 500 member colleges and universities worldwide. The DDI is now the project of an alliance of North American and European institutions. Member institutions comprise many of the largest data producers and data archives in the world. The DDI standard is published under the terms of the [GNU General Public License]((http://www.gnu.org/licenses) (version 3 or later).
The DDI standard is used by a large community of data archivists, including data libraries from the academia and research centers, national statistical agencies and other official data producing agencies, and international organizations. The DDI standard has two branches: the DDI-Codebook (version 2.x) and the DDI LifeCycle (version 3.x). These two branches serve different purposes and audiences.
The Metadata Editor implements the DDI-Codebook. Internally, it uses a slightly simplified version of the DDI Codebook 2.5, to which a few elements are added. The DDI Alliance publishes the DDI-Codebook as an XML schema. The Metadata Editor uses a JSON implementation of the schema; the Metadata Editor however exports fully-compliant DDI Codebook 2.5 metadata in XML format (among other options).
The DDI Alliance developed the DDI-Codebook for organizing the content, presentation, transfer, and preservation of metadata in the social and behavioral sciences. It enables documenting microdata files in a simultaneously flexible and rigorous way. The DDI-Codebook aims to provide a straightforward means of recording and communicating all the salient characteristics of a micro-dataset. It is designed to encompass the kinds of data resulting from surveys, censuses, administrative records, experiments, direct observation and other systematic methodology for generating empirical measurements. The unit of observation can be individual persons, households, families, business establishments, transactions, countries or other subjects of scientific interest.
The technical description of the JSON schema used for the documentation of microdata is available at https://worldbank.github.io/metadata-schemas/#tag/Microdata.