Introduction to the GLD

Introduction to the GLD#

What is GLD?#

The Global Labor Database (GLD) is an effort of the Jobs Group at the World Bank to harmonize household surveys with a relevant labor module. Its mission is to create an open and transparent harmonization with sufficient contextual information to allow colleagues to use, alter, and expand the harmonization.

The GLD aims to be open, meaning that as much information should be accessible to as many people as possible, considering relevant data privacy concerns. It also strives to be transparent, making all steps that create the harmonization traceable, from raw data acquisition to harmonized variable coding. The GLD provides sufficient contextual information, with documentation that allows users to fully comprehend the survey and the choices made in the harmonization, enabling them to start using it, alter it by adding variables not in the harmonization, or expand it by correcting the harmonization or adding a variable not previously coded.

What issue is GLD solving?#

The GLD is solving the issue of the time-consuming process of harmonization, which requires reading both data files and survey materials in detail to understand what to code and how, as well as many steps of validation. This effort only needs to be done once and well to serve as the springboard for all users. Most harmonization efforts provide users with a “take it or leave it” option, but the GLD’s open and transparent approach allows users to trace and deviate from the standard harmonization at any point, giving them a head start regardless of where they wish to jump in.

Moreover, the ecosystem built around the GLD harmonization, with the ability for users to provide feedback to the GLD team, allows for the correction of the harmonization or the improvement of the documentation. This, in turn, supports evidence generation and monitoring and evaluation (M&E) by providing better data that informs evidence-based policy lending and the monitoring of jobs outcomes. The standard approach of the GLD further allows the building of tools and automated processes on top of the harmonized outputs to deepen analysis.

In turn, a good harmonization makes for better data that informs evidence-based policy lending as well as monitoring of jobs outcomes. Its standard approach further allows the building of tools and automated processes on top of it to deepen analysis based on the harmonized outputs. It thus supports evidence generation and monitoring and evaluation (M&E).

Who is the GLD for? Who is the intended audience?#

GLD is for any person, member of the World Bank or not, who wishes to use the harmonization for their own projects. We distinguish between two uses (the same user may only want one or need both uses).

The first use is the “as-is” harmonization. This refers to the user taking the harmonized data files as prepared by the data team and using those variables (or combinations thereof) for their analysis.

The second use is the “amended” or “hacked” harmonization. This refers to the user wanting to go beyond the prepared harmonization. This may be, for example, because they are interested in another specific variable from the survey, present in the questionnaire but not harmonized as not common in most surveys.

In this case, the user can still utilize the harmonization do file to standardize most variables (as concepts like education level or labour status are likely still going to be relevant) but in addition add other ones. This use entails editing the harmonization code and/or adding to it at specific points to serve the users purpose without them needing to process the survey entirely.

How are surveys selected for GLD?#

The initial funding for the GLD was provided for a flagship report catalyzed by the Jobs Group. The initial choice of countries was driven by the needs of this flagship report. Thereafter, the GLD team has established selection guidelines to try to level the GLD selection across income groups, regions, and time. The guidelines consist of five levels of selection:

1. Ensure relative evenness of surveys across income levels#

The GLD aims to be a global database. Across the income levels low, lower middle, and upper middle income (LIC, LC, and UMC), GLD should contain a roughly equal percentage of countries. High income countries are less of a focus of GLD.

Example: There are currently 29 countries classified as low income, 50 as lower middle income, and 56 as upper middle income. If GLD contained 10 countries of the first group (34%), 11 of the second group (22%), and 20 of the third group (35%), the next countries to be selected should be lower middle-income countries.

2. Ensure topicality of surveys within income level groups#

Within each group and for the calculation of ratios for point 1) a country is considered as being present in the database if there is at least one survey from the last four years.

Example: In 2022 any survey from 2018 and later is considered as valid. If GLD, within LIC, has 30 surveys from 10 countries but in one country the latest survey is from 2017 in another the latest is from 2016, then, for the purpose of point 1) GLD contains only 8 LIC countries.

3. Focus on larger countries first#

To select countries to be included in a group among those not selected, if surveys are available, preference is given to countries with larger population.
To treat countries of equivalent size equally, proceed from 1 million (smallest group) in steps of 5 (i.e., 0 ≤ group 1 ≤ 1 million // 1 million < group 2 ≤ 5 million // 5 million < group 3 ≤ 10 million), until 50 million, then proceed in groups of 10 million until 100 million, thereafter in groups of 25 million.

Example: There are resources only to include one country to the GLD LIC group. Amongst those not represented are three countries: Nation A, Nation B, and Nation C with respective population numbers of 100 million, 50 million, and 5 million. Nation A would be the first option yet there are no surveys available to include in GLD. Nations B and C, on the other hand, have. In this case, Nation B is chosen to be included in GLD.

4. Favour updating the most out of date surveys first#

After considering population, if two nations are of equivalent size, preference should be given to updating more out of date surveys. To treat surveys of similar vintage equally, proceed in steps of 3 years.

Example: From the vantage point of 2022, surveys from 2017 or older are considered out of date. Those from 2017, 2016, and 2015 are considered equally old and worthy of updating. Surveys from 2014 to 2012 are even older. If the latest survey from Nation A of 28 million is from 2015 and that from Nation B of 26 million is from 2014, priority should be given to including surveys from Nation B.

5. Favour complementarity with GMD#

Points 1) through 4) being (roughly) equal, preference is given to countries with less representation in GMD.

Example: Points 1) through 4) lead us to choose between Nation A of 40 million and latest survey from 2017 and Nation B of 37 million and latest survey from 2016. However, in GMD there is a 2019 survey for Nation A but no recent one from Nation B. We should update the survey data for Nation B.

Shortcomings of the selection guidelines#

However, these guidelines cannot always be followed due to two main reasons:

Data access constraints: The team should include newer surveys from a particular country in a region, but they are not shared with the GLD team.
Immediate needs of the unit: A high priority report for the Jobs Group requires data from a specific country, and thus that country “jumps the queue” in the selection process.

To address these challenges, the GLD team requests that other teams reach out to GLD Focal Point to let them know of their data needs. This will allow the GLD team to take these needs into account, collaborate on the harmonization, and create a broad, level database. By uniting forces, the GLD team and other teams can reduce the duplication of efforts and ensure that the GLD serves the needs of the various units within the World Bank.

What are the strengths and weaknesses of GLD?#

This section goes through the strengths and weaknesses of the GLD in bulleted way.

Strengths#

Rigour: The GLD process follows a strict standardized process with clear and improving documentation and detailed quality check procedure to minimize errors. Owing to this a published GLD survey is a reliable source users can trust.

Transparency: All codes and processes are available to all online to see and any World Bank staff member should be able to recreate any of the datasets from raw data to harmonized output. The GLD Focal Point is there to support these efforts.

GLD is set out to aid individuals going through the code and spot potential mistakes, continually improving its quality.

Amenability: Due to the transparent nature of GLD, with its open codes, users can “hack” the harmonization to build specialised datasets to serve their research interest without having to start from scratch. For example, some surveys may contain information about the location of work (in office building, at home, structure adjacent to home, …) which a user may wish to exploit. With a regular harmonized dataset, users could try to extract the variable and match the information to the harmonized data via the individual ID codes. This is extremely risky if the creation of the ID codes is not documented.

The other alternative is to redo the harmonization from scratch to include the variable(s) of interest. This is very resource intensive. With the public GLD codes users can copy the process from the beginning and simply append a section adding their own variable(s) of interest, greatly reducing the effort to achieve this.

Interoperability: The GLD data dictionary is modelled after the second version of the Global Monitoring Database (GMD) data dictionary. While there are variables in GMD not in GLD and vice versa, due to the specific nature of the surveys, about 80% of the variables in GLD exist as well in GMD. These variables have the same names and definitions, ensuring that (survey methodology permitting it) a GMD from country C1 in year Y1 can be used to compare it to a GLD survey from C2 at time Y2.

ISCO/ISIC information: From the start GLD has set out to try to extract information on industry and occupation in a more detailed manner. The comparable databases usually obtain industry and occupation as categorical variables with at most 10 categories.

GLD strives to extract ISCO and ISIC variables to the most accurate level possible. In surveys that classify the information already originally following international classifications this is straightforward.

However, many countries have coding systems that differ from the international classification. GLD invests effort in finding national to international correspondence systems to convert information to a common code. Even if this is only done at two digit level (i.e., division of ISIC like 17 “Manufacture of paper and paper products” or 24 “Manufacture of basic metals” instead of lumping all as “Manufacturing”) is a great value added.

Context information: Under the heading of Country Survey Details on the GitHub repository harmonizers write introductions to the harmonized surveys to give users contextual information about the survey and idiosyncrasies of the survey that cannot be clarified in the harmonization code.

An example is explaining the process of correspondence used between national and international industry or occupation systems, which in the harmonization code is done by merging in the relevant tables but here ([example from South Africa here])(https://github.com/worldbank/gld/blob/main/Support/B - Country Survey Details/ZAF/QLFS/Correspondence_National_International_Classifications.md) is described to make it understandable, replicable and – in the GLD spirit – transparent.

Weaknesses#

Preference for integrated household budget surveys: In many settings the only survey performed is an integrated household income and expenditure survey with a labour market module. This survey would commonly fall under the purview of GMD.

GLD can, if available, harmonize a labour force survey from the same country and year. However – and despite the ISCO/ISIC information – it seems safe to assume that most researchers would prefer to use the integrated budget survey as it offers generally the same information and allows to exploit consumption patterns, allowing a richer investigation. In this sense, GLD is a second-best option.

GLD tries to, in its logic of inclusion of new surveys, take this factor into account by aiming to create complementarity with GMD. There is nonetheless a risk of users, expecting the full GMD suite, to be disillusioned or disappointed with the GLD offerings.

Updating the data dictionary: There are two reasons for changes in the data dictionary. The first is normal change over time. Changes to surveys procedures and researcher interest shift cause the data dictionary to have to adapt. This causes breaks in the data series, especially if variable definitions change but is a small risk.

The second change is changes to the GMD data dictionary. GMD, being a larger project may choose has the ability to set the agenda, potentially in ways conflicting with GLDs interest. Since interoperability between the data dictionaries is central to the project, GLD is more of a “dictionary taker” than a dictionary setter, in allegory to price setting and taking.

Increasing management effort: As the project scales the management of database access, database survey improvement, and database user support are expected to grow. Database access refers to responding to requests, on the datalibweb site or via the GLD server to demands to the harmonized data, harmonization codes, and/or raw data.

Database survey improvement refers to continuously correcting the harmonization efforts. The GLD team believes this creates a better project. However, having researchers (in the case of public data potentially from all over the world) point out (hopefully fewer and fewer) potential mistakes in the harmonization of hundreds of surveys from dozens of countries requires human power to review, respond, and – if necessary – update.

Database user support refers to the role of GLD as a public utility. As outlined above, GLD aims to support users who wish to (1) amend the harmonization to serve their purpose as well as (2) encourage building tools based on it. Both require helping users understand how GLD works in general and some of the survey specific aspects (e.g., survey specific correspondence between national and international occupation classifications).

How can GLD be sustainable?#

Maintaining the level of detail provided by the GLD is a significant undertaking that requires a serious investment of resources. The World Bank is the right organization to house such an effort due to the externalities generated by a public effort to create accessible data.

To maintain the costs of the effort low and ensure the sustainability of the project as it scales and requires more management, two things are necessary: 1) a strong collaboration with regional data teams (already established with the South Asia team), and 2) the creation of a community of users on GitHub, initially World Bank colleagues but hopefully from around the world eventually, to leverage their input through a collaborative yet curated wiki approach.