Metadata Quality Assessment with LLM

3. Metadata Quality Assessment with LLM#

This notebook provides an example of how to use a large language model (LLM) to assess the quality of metadata. The method leverages a set of expert-curated guidelines to assess the quality of metadata. The LLM is tasked to assess the quality of metadata based on the guidelines and provide a revision of the metadata. Each dimension of the guidelines is assessed separately and the LLM is instructed to analyze the metadata against the guidelines for each dimension. The LLM also rates the current metadata against the guidelines.

We use the OpenAI API for the LLM.

%%capture

cd ~/WBG/ai4data-360/

3.1. All module imports#

import seaborn as sns

sns.set_theme(
    context="notebook",        # or "paper", "notebook", "poster"
    style="white",
    palette="deep",
    font_scale=1
)

# Standard library imports
import glob
import json
import os
import re
import pytz
from datetime import datetime, timezone
from hashlib import md5

# Third-party imports
import pandas as pd
from tqdm.auto import tqdm

/home/jupyter-wb536061/WBG/ai4data-360/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

# Constants
REQUIRED_COLS = ["name", "definition_long", "methodology", "license", "sources"]
VERTICALS = set(["Digital", "Infrastructure", "People", "Planet", "Prosperity"])

3.2. Setting up directories#

CWD = os.path.abspath("./")
assert CWD.endswith("/ai4data-360")

DATA_DIR = os.path.join(CWD, "data")

OPENAI_SAVE_DIR = os.path.join(DATA_DIR, "openai-payloads")
BATCH_SAVE_DIR = os.path.join(OPENAI_SAVE_DIR, "quality-assessment")

QUALITY_DATA_DIR = os.path.join(DATA_DIR, "quality")

os.makedirs(QUALITY_DATA_DIR, exist_ok=True)
os.makedirs(BATCH_SAVE_DIR, exist_ok=True)

3.3. Loading external sources#

# Load the metadata source
# Select the most recent data file
METADATA_FNAME = sorted(glob.glob(os.path.join(DATA_DIR, "data360-indicator-*.json")), key=os.path.getmtime, reverse=True)[0]
print(f"Loading file: {METADATA_FNAME}")

df = pd.read_json(METADATA_FNAME).T
df = df["series_description"].apply(pd.Series)
df.rename(columns={"idno": "indicator_code"}, inplace=True)

# Add a simple completion measure.
df["completion"] = df[REQUIRED_COLS].notnull().sum(axis=1)

# Sort the indicators based on completion, then drop any duplicated indicator code.
df = df.sort_values("completion", ascending=False).drop_duplicates(subset=["indicator_code"])

# Create the verticals
df["verticals"] = df.apply(lambda x: ";".join(sorted([i["name"] for i in x["topics"] if i["name"] in VERTICALS])), axis=1)
df.loc[df["verticals"] == "", "verticals"] = "N/A"

# Set the code as the index for content mapping below.
df.set_index("indicator_code", inplace=True)

df

Loading file: /home/jupyter-wb536061/WBG/ai4data-360/data/data360-indicator-20250403.json

	name	database_id	database_name	measurement_unit	periodicity	definition_short	definition_long	methodology	economies_count	csv_link	indicator_count	version_statement	topics	time_periods	ref_country	license	links	sources	completion	verticals
indicator_code
WB_WDI_SH_XPD_GHED_GD_ZS	Domestic general government health expenditure...	WB_WDI	World Development Indicators	None	Annual	None	Public expenditure on health from domestic sou...	The health expenditure estimates have been pre...	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-01-27'}	[{'id': None, 'name': 'Health: Health systems'...	[{'start': '2000', 'end': '2021', 'notes': None}]	[{'name': 'Afghanistan', 'code': 'AFG'}, {'nam...	[{'name': 'CC BY-4.0', 'uri': 'https://creativ...	[{'type': 'WDI API', 'description': 'Data in J...	[{'idno': None, 'type': None, 'name': 'World H...	5	People
WB_WDI_SH_XPD_PVTD_PP_CD	Domestic private health expenditure per capita...	WB_WDI	World Development Indicators	None	Annual	None	Current private expenditures on health per cap...	The health expenditure estimates have been pre...	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-01-27'}	[{'id': None, 'name': 'Health: Health systems'...	[{'start': '2000', 'end': '2021', 'notes': None}]	[{'name': 'Afghanistan', 'code': 'AFG'}, {'nam...	[{'name': 'CC BY-4.0', 'uri': 'https://creativ...	[{'type': 'WDI API', 'description': 'Data in J...	[{'idno': None, 'type': None, 'name': 'World H...	5	N/A
WB_WDI_SI_POV_LMIC_GP	Poverty gap at $3.65 a day (2017 PPP) (%)	WB_WDI	World Development Indicators	%	Annual	None	Poverty gap at $3.65 a day (2017 PPP) is the m...	International comparisons of poverty estimates...	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-01-27'}	[{'id': None, 'name': 'Poverty: Poverty rates'...	[{'start': '1963', 'end': '2022', 'notes': None}]	[{'name': 'Angola', 'code': 'AGO'}, {'name': '...	[{'name': 'CC BY-4.0', 'uri': 'https://creativ...	[{'type': 'WDI API', 'description': 'Data in J...	[{'idno': None, 'type': None, 'name': 'World B...	5	N/A
WB_WDI_SI_SPR_PC40	Survey mean consumption or income per capita, ...	WB_WDI	World Development Indicators	2017 PPP $	Annual	Survey mean consumption or income per capita, ...	Mean consumption or income per capita (2017 PP...	Survey mean consumption or income per capita, ...	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-01-27'}	[{'id': None, 'name': 'Poverty: Shared prosper...	[{'start': '2003', 'end': '2022', 'notes': None}]	[{'name': 'Albania', 'code': 'ALB'}, {'name': ...	[{'name': 'CC BY-4.0', 'uri': 'https://creativ...	[{'type': 'WDI API', 'description': 'Data in J...	[{'idno': None, 'type': None, 'name': 'World B...	5	N/A
WB_WDI_SL_AGR_EMPL_MA_ZS	Employment in agriculture, male (% of male emp...	WB_WDI	World Development Indicators	% of male employment	Annual	None	Employment is defined as persons of working ag...	The series is part of the "ILO modeled estimat...	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-01-27'}	[{'id': None, 'name': 'Social Protection & Lab...	[{'start': '1991', 'end': '2021', 'notes': None}]	[{'name': 'Afghanistan', 'code': 'AFG'}, {'nam...	[{'name': 'CC BY-4.0', 'uri': 'https://creativ...	[{'type': 'WDI API', 'description': 'Data in J...	[{'idno': None, 'type': 'external database', '...	5	Planet
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
IMF_GFSR_W1_S13_G13	Government grants revenue from foreign govts	IMF_GFSR	None	National currency, Percentage of GDP	Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-02'}	[{'id': 'P3', 'name': 'Prosperity', 'parent_id...	[{'start': '1972', 'end': '2023', 'notes': None}]	[{'name': 'Angola', 'code': 'AGO'}, {'name': '...	[]	[]	[{'idno': None, 'type': None, 'name': 'Governm...	3	Prosperity
IMF_GFSR_W1_S1_G1411	Government interest revenue from nonresidents	IMF_GFSR	None	National currency, Percentage of GDP	Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-02'}	[{'id': 'P3', 'name': 'Prosperity', 'parent_id...	[{'start': '1991', 'end': '2023', 'notes': None}]	[{'name': 'Albania', 'code': 'ALB'}, {'name': ...	[]	[]	[{'idno': None, 'type': None, 'name': 'Governm...	3	Prosperity
IMF_BOP_BFDDDSD_BP6	BOP, Financial Account, Direct investment, Deb...	IMF_BOP	None	USD, Euros, LCU	Annual, Quarterly	None	None	None	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-02'}	[{'id': 'P3', 'name': 'Prosperity', 'parent_id...	[{'start': '1967', 'end': '2024-Q3', 'notes': ...	[{'name': 'Albania', 'code': 'ALB'}, {'name': ...	[]	[]	[{'idno': None, 'type': None, 'name': 'Balance...	3	Prosperity
IMF_BOP_BFDDD_BP6	BOP, Financial Account, Direct investment, Deb...	IMF_BOP	None	USD, Euros, LCU	Annual, Quarterly	None	None	None	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-02'}	[{'id': 'P3', 'name': 'Prosperity', 'parent_id...	[{'start': '1948', 'end': '2024-Q3', 'notes': ...	[{'name': 'Afghanistan', 'code': 'AFG'}, {'nam...	[]	[]	[{'idno': None, 'type': None, 'name': 'Balance...	3	Prosperity
IMF_BOP_BFDDFU_BP6	BOP, Financial Account, Direct investment, Deb...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, Euros, LCU	Annual, Quarterly	None	None	None	None	https://data360files.worldbank.org/data360-dat...	None	{'version_date': '2025-02'}	[{'id': 'P3', 'name': 'Prosperity', 'parent_id...	[{'start': '1967', 'end': '2024-Q3', 'notes': ...	[{'name': 'Albania', 'code': 'ALB'}, {'name': ...	[]	[]	[{'idno': None, 'type': None, 'name': 'Balance...	3	Prosperity

10574 rows × 20 columns

# Some simple checks of missingness and duplicates
print("Indicator name is the same as definition:", (df["name"] == df["definition_long"]).sum())
print("Indicator name is missing:", df["definition_long"].isnull().sum())
print("Indicator name is duplicated:", df["definition_long"].dropna().duplicated().sum())

Indicator name is the same as definition: 2027
Indicator name is missing: 2818
Indicator name is duplicated: 1782

3.4. Guidelines and Prompts#

3.4.1. System prompt#

SYSTEM_PROMPT = """You are an expert in documenting development data. You are provided with a guideline for a specific metadata field and relevant information, and your task is to evaluate whether the given metadata complies with the guideline.

For each line in the guideline, assess whether the metadata meets the specified requirement. Provide your assessment within <analysis> tags, clearly explaining whether the criterion is satisfied and why. Rate the compliance on a scale from 0 to 10 for each criterion, placing the rating within <rating> tags.

Based on your analysis, propose an improved version of the metadata. Place your suggestion within <revised> tags. You must also provide an informative reason that highlights the key improvements made in the revision.

Compare the original and revised versions carefully, and identify any key information that was removed or added in the revised version. For each piece of information removed, explain why it is unnecessary or unhelpful. If a piece of information is added, explain why its addition is important. Place these information in the <modifications> tags.

The output must be in JSON with the following structure:

```json
{
  "analysis": [
    {"id": "<id>", "analysis": "<id-level analysis>", "rating": <0-10>}, ...
  ],
  "revised": "<revised>",
  "revision_reason": "<explanation why the revised version is better>",
  "modifications": [
    {
      "source": "<snippet that was removed/added>",
      "reason": "<reason why the information was removed/added from the original value>"
      "action": "<add/remove>"
    }
  ]
}```"""

3.4.2. Definition#

DEFINITION_GUIDELINES = """# Guidelines for Writing an Indicator Definition

Each indicator definition must follow the criteria below. The purpose is to ensure that definitions are **clear**, **accessible**, **accurate**, and **scoped appropriately**, without duplicating content from other parts of the reference metadata.

---

## [id:meaning] Conceptual Clarity

Focus on what the indicator fundamentally measures.

- Clearly state **what the indicator measures**.
- Describe the **core concept or idea** the indicator captures.
- Avoid including **unrelated information** or **secondary objectives**.

---

## [id:scope] Scope and Relevance

Define the indicator's boundaries and ensure content is appropriate to this metadata field.

- Specify **what is included and excluded**, where relevant.
- Identify any **interpretation boundaries or limitations**, where applicable.
- Avoid content that belongs in other metadata fields (e.g., methodology, statistical concepts).
- Do not **duplicate** content from other reference metadata sections.

---

## [id:audience] Plain Language for Accessibility

Write for a non-technical audience using simple and neutral language.

- Use **clear, plain language**.
- Avoid **jargon** and overly academic terminology.
- Maintain a **neutral and informative tone**.

---

## [id:grammar] Grammar and Sentence Construction

Ensure readability and grammatical correctness.

- Write in **full, complete sentences**.
- Use **correct grammar**, punctuation, and sentence structure.
- Avoid fragments or lists.

---

## [id:terminology] Technical Consistency and Standards

Ensure technical language is consistent and aligned with standards.

- Use **consistent terminology** throughout the definition.
- Align with international metadata standards (e.g., **DDI**, **SDMX**, **ISO**).
- Ensure **technical accuracy** in how the indicator is described.

---

## [id:form] Structure and Formatting

Keep the definition concise and well-structured.

- Use **1 to 3 complete sentences**.
- Keep the total length between **100 and 150 words**.
- Avoid **excessive detail**, **textbook-style narratives**, or **complex sentence structures**.

---

## [id:abbreviation] Abbreviation Clarification

Ensure acronyms and abbreviations are clearly explained.

- Spell out **any acronyms or abbreviations** the first time they appear (in either the indicator name or the definition)."""

INDICATOR_NAME_GUIDELINES = """# Guidelines for Naming Indicators

These conventions are intended to ensure that indicator names are **concise**, **clear**, **consistent**, and **suitable for general audiences**. A well-formed indicator name improves usability across systems and helps users understand what is being measured at a glance.

---

## [id:clarity] Clarity and Conceptual Meaning

Keep the name focused and understandable.

- Clearly convey **what is being measured**, avoiding technical jargon.
- Do not assume specialized knowledge; make names **comprehensible to non-experts**.
- Avoid vague or overly complex phrases.

Good: Nitrous oxide (N2O) emissions from Waste (Mt CO₂e)
Bad: Adequacy of unemployment benefits ALMP (% of total welfare of beneficiary households)

---

## [id:length] Conciseness and Display Limits

The name should ideally be short enough to fit on screens and in data visualizations.

- Keep names to **around 70 characters or fewer**.
- Avoid **redundant** or **repetitive** language.

Good: Account ownership at financial institution
Bad: "Account" in FINDEX database
Bad: Per capita daily gas consumption (barrels per capita per day)

---

## [id:self-contained] Self-Containment and Independence

Names should stand on their own without relying on external context.

- Avoid references to databases or internal sources.
- Do not use abbreviations or variable names as stand-ins for descriptive terms.

Good: Debt service on external debt (% of GNI)
Bad: PPG; DOD (current US$)

---

## [id:acronyms] Acronym Use

Use acronyms carefully and always provide context.

- Do not use acronyms **alone** as indicator names.
- Widely known acronyms (e.g., GDP, CO₂) are acceptable if used as **part of** the name.

Good: GDP per capita (current US$)
Bad: PPG; DOD

---

## [id:units] Units of Measure

Indicate the unit of measure clearly and consistently.

- Include the unit in the name or pull it from a separate "Unit of measure" field.
- Avoid using the unit "Count"; specify **what is being counted** (e.g., persons, vehicles).
- Include the unit even when the measure is a count, for consistency.

Good: School-aged children enrolled (number of persons)
Bad: School-aged children enrolled (count)

---

## [id:disaggregation] Disaggregation Handling

Avoid embedding disaggregations directly in the name unless necessary.

- Do **not include** dimensions (e.g., sex, age) in the base indicator name if handled separately.
- If disaggregated indicators are published separately, include the relevant dimension in the name.

Good: Children in employment, wage workers, male (% of male children...)
Bad: Children in employment (%), with separate disaggregation fields not defined

---

## [id:prefixes] Avoid Unnecessary Prefixes

Start the name with the core concept being measured.

- Avoid starting with words like "Share of", "Percentage of" unless essential.
- Emphasize **what is measured**, not how it is expressed.

Good: Population using safely managed sanitation services
Bad: Share of population using safely managed sanitation services

---

## [id:year] Reference Years

Keep reference years out of indicator names.

- Use **"reference year"** or **"constant US$"** instead of naming a specific year.
- This ensures names remain valid across data updates.

Good: Net official aid received (constant US$)
Bad: Net official aid received (constant 2021 US$)

---

## [id:capitalization] Capitalization and Style

Use consistent capitalization for all indicator names.

- Follow **title case** capitalization (capitalize major words, lowercase articles/conjunctions).
- Avoid unnecessary use of uppercase.

Good: Nitrous Oxide (N2O) Emissions from Waste (Mt CO₂e)
Bad: Nitrous oxide (N2O) emissions from Waste (Mt CO₂e)"""

print(f"{SYSTEM_PROMPT}\n\n{INDICATOR_NAME_GUIDELINES}")

You are an expert in documenting development data. You are provided with a guideline for a specific metadata field and relevant information, and your task is to evaluate whether the given metadata complies with the guideline.

For each line in the guideline, assess whether the metadata meets the specified requirement. Provide your assessment within <analysis> tags, clearly explaining whether the criterion is satisfied and why. Rate the compliance on a scale from 0 to 10 for each criterion, placing the rating within <rating> tags.

Based on your analysis, propose an improved version of the metadata. Place your suggestion within <revised> tags. You must also provide an informative reason that highlights the key improvements made in the revision.

Compare the original and revised versions carefully, and identify any key information that was removed or added in the revised version. For each piece of information removed, explain why it is unnecessary or unhelpful. If a piece of information is added, explain why its addition is important. Place these information in the <modifications> tags.

The output must be in JSON with the following structure:

```json
{
  "analysis": [
    {"id": "<id>", "analysis": "<id-level analysis>", "rating": <0-10>}, ...
  ],
  "revised": "<revised>",
  "revision_reason": "<explanation why the revised version is better>",
  "modifications": [
    {
      "source": "<snippet that was removed/added>",
      "reason": "<reason why the information was removed/added from the original value>"
      "action": "<add/remove>"
    }
  ]
}```

# Guidelines for Naming Indicators

These conventions are intended to ensure that indicator names are **concise**, **clear**, **consistent**, and **suitable for general audiences**. A well-formed indicator name improves usability across systems and helps users understand what is being measured at a glance.

---

## [id:clarity] Clarity and Conceptual Meaning

Keep the name focused and understandable.

- Clearly convey **what is being measured**, avoiding technical jargon.
- Do not assume specialized knowledge; make names **comprehensible to non-experts**.
- Avoid vague or overly complex phrases.

Good: Nitrous oxide (N2O) emissions from Waste (Mt CO₂e)
Bad: Adequacy of unemployment benefits ALMP (% of total welfare of beneficiary households)

---

## [id:length] Conciseness and Display Limits

The name should ideally be short enough to fit on screens and in data visualizations.

- Keep names to **around 70 characters or fewer**.
- Avoid **redundant** or **repetitive** language.

Good: Account ownership at financial institution
Bad: "Account" in FINDEX database
Bad: Per capita daily gas consumption (barrels per capita per day)

---

## [id:self-contained] Self-Containment and Independence

Names should stand on their own without relying on external context.

- Avoid references to databases or internal sources.
- Do not use abbreviations or variable names as stand-ins for descriptive terms.

Good: Debt service on external debt (% of GNI)
Bad: PPG; DOD (current US$)

---

## [id:acronyms] Acronym Use

Use acronyms carefully and always provide context.

- Do not use acronyms **alone** as indicator names.
- Widely known acronyms (e.g., GDP, CO₂) are acceptable if used as **part of** the name.

Good: GDP per capita (current US$)
Bad: PPG; DOD

---

## [id:units] Units of Measure

Indicate the unit of measure clearly and consistently.

- Include the unit in the name or pull it from a separate "Unit of measure" field.
- Avoid using the unit "Count"; specify **what is being counted** (e.g., persons, vehicles).
- Include the unit even when the measure is a count, for consistency.

Good: School-aged children enrolled (number of persons)
Bad: School-aged children enrolled (count)

---

## [id:disaggregation] Disaggregation Handling

Avoid embedding disaggregations directly in the name unless necessary.

- Do **not include** dimensions (e.g., sex, age) in the base indicator name if handled separately.
- If disaggregated indicators are published separately, include the relevant dimension in the name.

Good: Children in employment, wage workers, male (% of male children...)
Bad: Children in employment (%), with separate disaggregation fields not defined

---

## [id:prefixes] Avoid Unnecessary Prefixes

Start the name with the core concept being measured.

- Avoid starting with words like "Share of", "Percentage of" unless essential.
- Emphasize **what is measured**, not how it is expressed.

Good: Population using safely managed sanitation services
Bad: Share of population using safely managed sanitation services

---

## [id:year] Reference Years

Keep reference years out of indicator names.

- Use **"reference year"** or **"constant US$"** instead of naming a specific year.
- This ensures names remain valid across data updates.

Good: Net official aid received (constant US$)
Bad: Net official aid received (constant 2021 US$)

---

## [id:capitalization] Capitalization and Style

Use consistent capitalization for all indicator names.

- Follow **title case** capitalization (capitalize major words, lowercase articles/conjunctions).
- Avoid unnecessary use of uppercase.

Good: Nitrous Oxide (N2O) Emissions from Waste (Mt CO₂e)
Bad: Nitrous oxide (N2O) emissions from Waste (Mt CO₂e)

3.4.3. Mapping of guidelines#

# Define the guidelines mapping here
METADATA_GUIDELINES = {
    "definition_long": DEFINITION_GUIDELINES,
    "name": INDICATOR_NAME_GUIDELINES,
}

FIELD_MAPPING = {
    "definition_long": "Definition",
    "name": "Indicator name"
}

3.5. Functions#

def extract_ids(guidelines):
    """Extract guideline IDs from a guidelines string.

    Args:
        guidelines (str): A string containing guidelines with sections marked by [id:name] tags.

    Returns:
        list[str]: A list of extracted ID names from the guidelines.

    Example:
        >>> guidelines = "## [id:meaning] Section 1\\n## [id:scope] Section 2"
        >>> extract_ids(guidelines)
        ['meaning', 'scope']
    """
    return re.findall(r"\[id\:(.*)?\]", guidelines)


def get_response_format(guidelines):
  ids = extract_ids(guidelines)

  response_format = {
    "type": "json_schema",
    "json_schema": {
      "name": "metadata_analysis",
      "strict": True,
      "schema": {
        "type": "object",
        "properties": {
          "analysis": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "id": {
                  "type": "string",
                  "enum": ids
                },
                "analysis": {
                  "type": "string"
                },
                "rating": {
                  "type": "number"
                }
              },
              "required": [
                "id",
                "analysis",
                "rating"
              ],
              "additionalProperties": False
            }
          },
          "revised": {
            "type": "string"
          },
          "revision_reason": {
            "type": "string"
          },
          "modifications": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "source": {
                  "type": "string"
                },
                "reason": {
                  "type": "string"
                },
                "action": {
                  "type": "string",
                  "enum": ["add", "remove"]
                }
              },
              "required": [
                "source",
                "reason",
                "action"
              ],
              "additionalProperties": False
            }
          }
        },
        "required": [
          "analysis",
          "revised",
          "revision_reason",
          "modifications"
        ],
        "additionalProperties": False
      }
    }
  }

  return response_format


def get_system_prompt(guidelines):
    return f"{SYSTEM_PROMPT}\n\n{guidelines}"


def build_payload(content, guidelines: str, model_id: str, **api_kwargs):
    """Builds an API payload for metadata quality assessment using an LLM.

    Args:
        content (str): The metadata content to be analyzed.
        guidelines (str): The assessment guidelines containing criteria marked with [id:name] tags.
        model_id (str): The identifier for the LLM model to be used.
        **api_kwargs: Additional keyword arguments to override default API parameters.

    Returns:
        dict: A payload dictionary containing:
            - messages: List of system and user messages with the prompt and content
            - response_format: JSON schema specification for the expected response
            - model: The specified model ID
            - Additional API parameters (temperature, max_completion_tokens, etc.)

    Notes:
        Default API parameters include:
        - temperature: 0
        - max_completion_tokens: 4096
        - top_p: 1
        - frequency_penalty: 0
        - presence_penalty: 0
        - seed: 1029
    """
    response_format = get_response_format(guidelines)

    default_kwargs = dict(
        model=model_id,
        temperature=0,
        max_completion_tokens=4096,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        seed=1029,
    )

    if api_kwargs:
        default_kwargs.update(api_kwargs)

    payload = dict(
      messages=[
        {
          "role": "system",
          "content": [
            {
              "type": "text",
              "text": get_system_prompt(guidelines)
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": content
            }
          ]
        }
      ],
      response_format=response_format,
      **default_kwargs,
    )

    return payload


def write_payload(payload_fname, metadata_field, contents, model_id: str, overwrite: bool = False):
    """Writes API payload data for metadata quality assessment to a JSONL file.

    Args:
        payload_fname (str): Path to the output JSONL file where payloads will be written.
        metadata_field (str): The metadata field being analyzed (e.g., 'definition').
            Must be a key in the metadata_guidelines dictionary.
        contents (dict): Dictionary mapping identifier codes to metadata content strings to be analyzed.
        model_id (str): The ID of the LLM model to use for analysis.
        overwrite (bool, optional): If True, overwrites existing file. If False, raises
            ValueError if file exists. Defaults to False.

    Raises:
        ValueError: If the output file already exists and overwrite=False.

    Notes:
        - Creates a JSONL file where each line contains a JSON object with:
            - custom_id: Generated from prompt hash, identifier code, and content hash
            - method: Always "POST"
            - url: Always "/v1/chat/completions"
            - body: The complete API payload including system prompt, user content,
                   and response format specifications
        - The metadata_field must correspond to a key in the metadata_guidelines dictionary
          which contains the assessment guidelines for that field.
    """
    if os.path.exists(payload_fname) and not overwrite:
       raise ValueError(f"File {payload_fname} exists. Set `overwrite=True` to replace this.")

    assert metadata_field in payload_fname

    guidelines = METADATA_GUIDELINES[metadata_field]
    prompt_hash = md5(get_system_prompt(guidelines).encode()).hexdigest()[:8]

    with open(payload_fname, "w") as fl:
      for code, content in contents.items():
        batch_data = dict(
            custom_id=f"{prompt_hash}-{code}-{md5(content.encode()).hexdigest()}",
            method="POST",
            url="/v1/chat/completions",
            body=build_payload(content, guidelines=guidelines, model_id=model_id),
        )
        fl.write(json.dumps(batch_data) + "\n")


# def get_field_content(df: pd.DataFrame, field: str, code_field_name: str = "indicator_code", drop_empty: bool = True):

#     assert df.index.name == code_field_name

#     func = lambda x: f"Indicator name: {x['title']}\n\n{FIELD_MAPPING[field]}: {x[field]}".strip()

#     if drop_empty:
#         df = df[df[field].notna()]

#     return df.apply(func, axis=1)


def get_field_content(df: pd.DataFrame, field: str, code_field_name: str = "indicator_code", drop_empty: bool = True, title_name: str = "Indicator name"):

    if field == "name":
       return get_indicator_name_content(
          df=df, field=field, code_field_name=code_field_name, drop_empty=drop_empty, title_name=title_name,
       )
    else:
      assert df.index.name == code_field_name

      func = lambda x: f"{title_name}: {x['title']}\n\n{FIELD_MAPPING[field]}: {x[field]}".strip()

      if drop_empty:
          df = df[df[field].notna()]

      return df.apply(func, axis=1)


def get_indicator_name_content(df: pd.DataFrame, field: str, code_field_name: str = "indicator_code", drop_empty: bool = True, title_name: str = "Indicator name"):

    assert df.index.name == code_field_name

    func = lambda x: f"{FIELD_MAPPING[field]}: {x[field]}".strip()

    if drop_empty:
        df = df[df[field].notna()]

    return df.apply(func, axis=1)

print(str(extract_ids(INDICATOR_NAME_GUIDELINES)).replace("'", '"'))

["clarity", "length", "self-contained", "acronyms", "units", "disaggregation", "prefixes", "year", "capitalization"]

3.6. Generate the payloads#

ax = sns.histplot(data=df["name"].map(len), stat="probability", bins=30)
ax.set_ylabel("% of Indicators")
ax.set_xlabel("Indicator name length")
sns.despine(right=True, top=True, left=False, bottom=False)

../_images/63dbea025cd27d2dc9ebac11e88a7cfc0754ffb1d6bcfa6466f741e47e2cfbbe.png

df["name"].map(len).describe()

count    10574.000000
mean        77.005769
std         44.259267
min          7.000000
25%         45.000000
50%         64.000000
75%         98.000000
max        338.000000
Name: name, dtype: float64

MODEL_ID = "gpt-4o"
code_field_name = "indicator_code"
title_name = "Indicator name"
fields = ["name"]
data_type = "indicator"

# Define the Eastern Time timezone
run_date = datetime.now(pytz.timezone('US/Eastern')).date()

for metadata_field in fields:
    payload_fname = os.path.join(BATCH_SAVE_DIR, f"{data_type}-{metadata_field}-{run_date}.input.jsonl")

    print(f"Generating payloads to: {payload_fname}")

    contents = get_field_content(df, field=metadata_field, code_field_name=code_field_name, title_name=title_name)

    write_payload(payload_fname, metadata_field, contents, model_id=MODEL_ID, overwrite=False)

Generating payloads to: /home/jupyter-wb536061/WBG/ai4data-360/data/openai-payloads/quality-assessment/indicator-name-2025-04-03.input.jsonl

MODEL_ID = "gpt-4o"
code_field_name = "indicator_code"
title_name = "Indicator name"
fields = ["definition_long"]
data_type = "indicator"


# Define the Eastern Time timezone
run_date = datetime.now(pytz.timezone('US/Eastern')).date()

for metadata_field in fields:
    payload_fname = os.path.join(BATCH_SAVE_DIR, f"{data_type}-{metadata_field}-{run_date}.input.jsonl")

    print(f"Generating payloads to: {payload_fname}")

    contents = get_field_content(df, field=metadata_field, code_field_name=code_field_name, title_name=title_name)

    write_payload(payload_fname, metadata_field, contents, model_id=MODEL_ID, overwrite=False)

Generating payloads to: /home/jupyter-wb536061/WBG/ai4data-360/data/openai-payloads/quality-assessment/definition_long-2025-03-27.input.jsonl

3.7. Process output#

import glob
from datasets import load_dataset

# base_name = "definition_long-*.output.jsonl"
base_name = "indicator-name-*.output.jsonl"

out_fnames = sorted(glob.glob(os.path.join(BATCH_SAVE_DIR, base_name)))

out_dataset = load_dataset("json", data_files=out_fnames, split="train")
out_dataset

Dataset({
    features: ['id', 'custom_id', 'response', 'error'],
    num_rows: 10574
})

def parse_content(content):
    content = json.loads(content)

    output = {
        "revised": content["revised"],
        "revision_reason": content["revision_reason"],
        "modifications": "",
    }

    # Flatten the data
    for item in content["analysis"]:

        output[f"{item['id']}_analysis"] = item["analysis"]
        output[f"{item['id']}_rating"] = item["rating"]

    modifications_texts = []
    for item in content["modifications"]:
        source = item["source"]
        reason = item["reason"]
        action = item["action"].title()
        text = f"{action}: {source}\nReason: {reason}"

        modifications_texts.append(text)

    if modifications_texts:
        output["modifications"] = "\n---\n".join(modifications_texts).strip()

    return output


def extract_response(examples):
    outs = {"indicator_code": [], "finish_reason": []}
    other_cols = set()

    for custom_id, response in zip(examples["custom_id"], examples["response"]):
        # Parse the indicator_code
        _, code, _ = custom_id.split("-")
        choice = response["body"]["choices"][0]
        outs["finish_reason"].append(choice["finish_reason"])

        if choice["finish_reason"] == "stop":
            content = choice["message"]["content"]

            try:
                content = parse_content(content)
            except:
                content = {}

            for k in content:
                if k not in outs:
                    outs[k] = []
                    other_cols.add(k)

                outs[k].append(content[k])
        else:
            for k in other_cols:
                outs[k].append(None)

        outs["indicator_code"].append(code)

    return outs


def aggregate_rating(examples):
    overall_rating = []

    cols = [k for k in examples if k.endswith("_rating")]
    for c_ix, k in enumerate(cols):
        for v_ix, v in enumerate(examples[k]):
            if c_ix == 0:
                overall_rating[ix]

    pass

response_dataset = out_dataset.map(extract_response, batched=True, remove_columns=["id", "response", "error"])
print(response_dataset)

df_resp = response_dataset.to_pandas()
df_resp

Dataset({
    features: ['custom_id', 'indicator_code', 'finish_reason', 'revised', 'revision_reason', 'modifications', 'clarity_analysis', 'clarity_rating', 'length_analysis', 'length_rating', 'self-contained_analysis', 'self-contained_rating', 'acronyms_analysis', 'acronyms_rating', 'units_analysis', 'units_rating', 'disaggregation_analysis', 'disaggregation_rating', 'prefixes_analysis', 'prefixes_rating', 'year_analysis', 'year_rating', 'capitalization_analysis', 'capitalization_rating'],
    num_rows: 10574
})

	custom_id	indicator_code	finish_reason	revised	revision_reason	modifications	clarity_analysis	clarity_rating	length_analysis	length_rating	...	units_analysis	units_rating	disaggregation_analysis	disaggregation_rating	prefixes_analysis	prefixes_rating	year_analysis	year_rating	capitalization_analysis	capitalization_rating
0	b549c5b6-WB_WDI_SH_XPD_GHED_GD_ZS-f20c7a8bd713...	WB_WDI_SH_XPD_GHED_GD_ZS	stop	Domestic General Government Health Expenditure...	<revision_reason>The revised version capitaliz...	Remove: Domestic general government health exp...	<analysis>The indicator name 'Domestic general...	10	<analysis>The name is concise and fits within ...	10	...	<analysis>The unit of measure is clearly indic...	10	<analysis>The name does not include unnecessar...	10	<analysis>The name starts with the core concep...	10	<analysis>The name does not include reference ...	10	<analysis>The name uses sentence case rather t...	8
1	b549c5b6-WB_WDI_SH_XPD_PVTD_PP_CD-65294e508a0b...	WB_WDI_SH_XPD_PVTD_PP_CD	stop	Domestic Private Health Expenditure Per Capita...	The revised version improves capitalization co...	Remove: Domestic private health expenditure pe...	The indicator name clearly conveys what is bei...	9	The name is concise and fits within the 70-cha...	10	...	The unit of measure is clearly indicated as 'c...	10	The name does not include unnecessary disaggre...	10	The name starts with the core concept being me...	10	The name uses 'current international $' instea...	10	The capitalization is consistent with title ca...	10
2	b549c5b6-WB_WDI_SI_POV_LMIC_GP-ffd6b9f3de4ee96...	WB_WDI_SI_POV_LMIC_GP	stop	Poverty Gap at $3.65 a Day (PPP) (%)	The revised version removes the specific refer...	Remove: 2017\nReason: Including a specific ref...	The indicator name is clear and conveys what i...	9	The name is concise and well within the 70-cha...	10	...	The unit of measure is indicated as a percenta...	10	The name does not include unnecessary disaggre...	10	The name does not start with unnecessary prefi...	10	The reference year '2017' is included in the n...	5	The capitalization is consistent with title ca...	10
3	b549c5b6-WB_WDI_SI_SPR_PC40-cf22b3242b8604c2a1...	WB_WDI_SI_SPR_PC40	stop	Survey Mean Consumption or Income per Capita, ...	The revised version improves clarity by using ...	Remove: 2017 PPP $ per day\nReason: The specif...	The indicator name is clear in conveying what ...	9	The name is concise and fits within the 70-cha...	10	...	The unit of measure is clearly indicated as '2...	9	The disaggregation by the bottom 40% of the po...	10	The name starts with the core concept being me...	10	The reference year '2017' is included in the n...	5	The capitalization is not consistent with titl...	6
4	b549c5b6-WB_WDI_SL_AGR_EMPL_MA_ZS-30a5bd06af55...	WB_WDI_SL_AGR_EMPL_MA_ZS	stop	Male Employment in Agriculture (% of Male Empl...	The revised version improves clarity by reorde...	Add: (modeled ILO estimate)\nReason: Reordered...	The indicator name is clear in conveying what ...	7	The name is concise and fits within the 70-cha...	10	...	The unit of measure is clearly indicated as a ...	10	The disaggregation by gender is appropriately ...	10	The name starts with the core concept 'Employm...	10	There is no reference year included in the nam...	10	The capitalization is consistent with title ca...	10
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10569	b549c5b6-IMF_GFSR_W1_S13_G13-de258bbca34f5508c...	IMF_GFSR_W1_S13_G13	stop	Revenue from Government Grants from Foreign Go...	The revised version improves clarity by replac...	Remove: govts\nReason: The abbreviation 'govts...	The indicator name 'Government grants revenue ...	6	The name is concise and well within the 70-cha...	10	...	The indicator name does not specify the unit o...	4	The name does not include any disaggregation, ...	10	The name does not start with unnecessary prefi...	10	There are no reference years included in the n...	10	The capitalization is inconsistent with the gu...	5
10570	b549c5b6-IMF_GFSR_W1_S1_G1411-008742c7b34442f2...	IMF_GFSR_W1_S1_G1411	stop	Government Interest Revenue from Nonresidents ...	The revised version includes the unit of measu...	Remove: Government interest revenue from nonre...	The indicator name 'Government interest revenu...	9	The name is concise and fits well within the 7...	10	...	The name does not specify the unit of measure,...	5	The name does not include any disaggregation d...	10	The name starts with the core concept being me...	10	The name does not include any reference years,...	10	The name uses sentence case rather than title ...	7
10571	b549c5b6-IMF_BOP_BFDDDSD_BP6-bc77b73151f060683...	IMF_BOP_BFDDDSD_BP6	stop	Balance of Payments: Financial Account - Direc...	The revised version clarifies the meaning by e...	Add: BOP\nReason: Expanded to 'Balance of Paym...	The name uses technical jargon such as 'BOP' a...	4	The name is excessively long, containing 116 c...	3	...	The name does not specify any unit of measure,...	2	The name includes disaggregation details ('Of ...	5	The name does not start with unnecessary prefi...	6	There is no reference to specific years in the...	10	The name does not follow title case capitaliza...	4
10572	b549c5b6-IMF_BOP_BFDDD_BP6-ce022971e592e69b60d...	IMF_BOP_BFDDD_BP6	stop	Balance of Payments: Financial Account - Direc...	The revised version clarifies the meaning by e...	Add: BOP\nReason: Expanded to 'Balance of Paym...	The name uses technical jargon such as 'BOP' a...	4	The name is 95 characters long, which exceeds ...	3	...	The name does not specify any units of measure...	2	The name does not include disaggregation dimen...	8	The name does not start with unnecessary prefi...	8	The name does not include reference years, whi...	10	The name does not follow title case capitaliza...	5
10573	b549c5b6-IMF_BOP_BFDDFU_BP6-8a9186e83b6c2ff771...	IMF_BOP_BFDDFU_BP6	stop	Balance of Payments: Direct Investment Debt In...	The revised version clarifies the meaning by e...	Add: BOP\nReason: Expanded to 'Balance of Paym...	The name uses technical jargon such as 'BOP' a...	4	The name is 108 characters long, which exceeds...	3	...	The name does not specify any units of measure...	2	The name includes disaggregation details ('if ...	5	The name does not start with unnecessary prefi...	6	There are no reference years included in the n...	10	The name does not follow title case capitaliza...	4

10574 rows × 24 columns

merged_df = df_resp.merge(df.reset_index(), how="left", on="indicator_code")
merged_df = merged_df[df.columns.tolist() + df_resp.columns.tolist()]

# Compute the mean rating of the original metadata value
merged_df["overall_rating"] = merged_df[merged_df.columns[merged_df.columns.str.endswith("_rating")]].mean(axis=1)


ax = sns.histplot(merged_df, x="overall_rating", bins=20, stat="probability", binrange=(0, 10)) # ["overall_rating"].hist()
ax.set_xlabel("Overall Rating")
ax.set_ylabel("% of Indicators")
sns.despine(right=True, top=True, left=False, bottom=False)

../_images/c0626e192fe49d56c236eedef2f31f91887286d9f2bc9faaefcc0a634a50d78e.png

merged_df = merged_df.sort_values("overall_rating", ascending=False)
merged_df

	name	database_id	database_name	measurement_unit	periodicity	definition_short	definition_long	methodology	economies_count	csv_link	...	units_rating	disaggregation_analysis	disaggregation_rating	prefixes_analysis	prefixes_rating	year_analysis	year_rating	capitalization_analysis	capitalization_rating	overall_rating
5973	Prevalence of anemia among children (% of chil...	WB_HCP	None	% of children aged 6-59 months	None	Prevalence of anemia (%), ages 6-59 months	Percentage of children ages 6-59 months whose ...	None	None	https://data360files.worldbank.org/data360-dat...	...	10	The name includes disaggregation by age, which...	10	The name does not use unnecessary prefixes and...	10	The name does not include any reference years,...	10	The capitalization is consistent with the guid...	10	10.000000
73	Oil rents (% of GDP)	WB_CCDR	None	Percentage of GDP	Annual	None	Oil rents are the difference between the value...	The estimates of natural resources rents are c...	None	https://data360files.worldbank.org/data360-dat...	...	10	There is no unnecessary disaggregation embedde...	10	The name does not start with unnecessary prefi...	10	There are no reference years included in the n...	10	The capitalization is consistent with title ca...	10	10.000000
94	Rural population (% of total population)	WB_CLEAR	Climate and Economic Analyses for Resilience i...	None	Annual	None	Rural population refers to people living in ru...	Rural population is calculated as the differen...	None	https://data360files.worldbank.org/data360-dat...	...	10	The name does not embed disaggregations direct...	10	The name does not start with unnecessary prefi...	10	There are no reference years included in the n...	10	The capitalization is consistent with title ca...	10	10.000000
4842	Battle-related deaths (number of people)	WB_WDI	World Development Indicators	None	Annual	None	Battle-related deaths are deaths in battle-rel...	During warfare, targets are usually the milita...	None	https://data360files.worldbank.org/data360-dat...	...	10	The name does not include any disaggregation d...	10	The name starts with the core concept being me...	10	There are no reference years included in the n...	10	The capitalization follows title case, with ma...	10	10.000000
21	Tax revenue (% of GDP)	WB_CCDR	Country Climate and Development Report	Percentage of GDP	Annual	None	Tax revenue refers to compulsory transfers to ...	The IMF's Government Finance Statistics Manual...	None	https://data360files.worldbank.org/data360-dat...	...	10	There is no unnecessary disaggregation embedde...	10	The name starts with the core concept 'Tax rev...	10	There are no reference years included in the n...	10	The capitalization is consistent with title ca...	10	10.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8683	BOP, Current Account, Primary Income, Investme...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, Euros, LCU	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	2	The name includes multiple layers of disaggreg...	3	The name does not start with unnecessary prefi...	5	There is no reference to specific years in the...	10	The capitalization is inconsistent, with some ...	4	3.555556
8250	BOP, Memorandum Items, Exceptional financing, ...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, LCU, Euros	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	2	The name includes disaggregation details such ...	3	The name does not start with unnecessary prefi...	5	The name does not include reference years, whi...	10	The name does not follow title case capitaliza...	4	3.555556
8614	BOP, Memorandum Items, Exceptional financing, ...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, LCU, Euros	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	2	The name includes disaggregations such as 'Non...	3	The name does not start with unnecessary prefi...	5	There are no reference years included in the n...	10	The capitalization is inconsistent, with some ...	4	3.555556
8294	BOP, Memorandum Items, Exceptional financing, ...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, LCU, Euros	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	2	The name includes disaggregation details such ...	3	The name does not start with unnecessary prefi...	4	The name does not include reference years, whi...	10	The capitalization is inconsistent and does no...	4	3.444444
7312	LPSEV = SD: ANER; LD: TIMSS 2019 for grade 4 u...	WB_EDSTATS	None	Share (proportion)	Annual	None	This indicator uses the Minimum Proficiency Le...	None	None	https://data360files.worldbank.org/data360-dat...	...	3	The name includes disaggregation details ('gra...	4	The name does not start with unnecessary prefi...	4	The name includes a specific year ('2019'), wh...	3	The capitalization is inconsistent and does no...	4	3.111111

10574 rows × 45 columns

merged_df.groupby("database_id")["overall_rating"].mean().sort_values(ascending=False)

database_id
FAO_GFRA               10.000000
IMF_GENDER_EQUALITY    10.000000
WB_WWGD                 9.777778
WB_PPI                  9.641975
WB_MPM                  9.571429
                         ...    
IMF_FSI                 5.788833
IMF_IRFCL               5.566667
FAO_EMS                 5.111111
IMF_BOP                 4.989432
FAO_FBS                 4.333333
Name: overall_rating, Length: 154, dtype: float64

3.8. Measure semantic and lexical change#

import nltk
import numpy as np

from sentence_transformers import SentenceTransformer

st_model = SentenceTransformer("avsolatorio/GIST-small-Embedding-v0")

name_vecs = st_model.encode(merged_df["name"], batch_size=32, show_progress_bar=True)
revised_name_vecs = st_model.encode(merged_df["revised"], batch_size=32, show_progress_bar=True)

merged_df["semantic_change"] = (1 - np.einsum("ij,ij->i", name_vecs, revised_name_vecs)).tolist()
merged_df["lexical_change"] = merged_df.apply(lambda x: nltk.edit_distance(x["name"], x["revised"]), axis=1)

Batches: 100%|██████████| 331/331 [00:03<00:00, 105.47it/s]
Batches: 100%|██████████| 331/331 [00:02<00:00, 124.71it/s]

ax = sns.histplot(merged_df, x="semantic_change", bins=10, stat="probability") # ["overall_rating"].hist()
ax.set_xlabel("Semantic Change")
ax.set_ylabel("% of Indicators")
sns.despine(right=True, top=True, left=False, bottom=False)

../_images/4189d9b595db03e896d6ecac5018c9655acf927d86065b9f84890360c164eb84.png

merged_df["lexical_change_rate"] = (merged_df["lexical_change"] / merged_df["name"].map(len))

# merged_df["lexical_change_rate"] = merged_df["lexical_change_rate"].map(lambda x: x if x < 1 else 1)

ax = sns.histplot(merged_df, x="lexical_change_rate", bins=10, binrange=(0, 1), stat="probability") # ["overall_rating"].hist()
ax.set_xlabel("Lexical Change Rate")
ax.set_ylabel("% of Indicators")
sns.despine(right=True, top=True, left=False, bottom=False)

../_images/362ff18eab831aa427c95329a5b79b5dc5a7782cc2ba8da1fd191bdea2c97459.png

merged_df.to_excel(os.path.join(QUALITY_DATA_DIR, f"assessment-indicator-name-{datetime.now().date()}.xlsx"), index=False)

merged_df.groupby("verticals")["overall_rating"].mean()

verticals
Digital                      8.422222
Digital;Prosperity           8.435626
Infrastructure               8.802168
Infrastructure;People        9.000000
Infrastructure;Planet        8.888889
Infrastructure;Prosperity    9.042838
N/A                          8.733778
People                       8.717756
People;Planet                8.450980
People;Prosperity            9.407407
Planet                       8.822258
Planet;Prosperity            9.065134
Prosperity                   7.189197
Name: overall_rating, dtype: float64

merged_df = df_resp.merge(df.reset_index(), how="left", on="indicator_code")
merged_df = merged_df[df.columns.tolist() + df_resp.columns.tolist()]

# Compute the mean rating of the original metadata value
merged_df["overall_rating"] = merged_df[merged_df.columns[merged_df.columns.str.endswith("_rating")]].mean(axis=1)

merged_df["overall_rating"].hist()

<Axes: >

../_images/d7202d0e99ff0ddcd6bb7f140ae011abcb1dbf2932d8e5a20683161be7204998.png

merged_df = merged_df.sort_values("overall_rating", ascending=False)
merged_df

	id	dataset	title	definition_short	definition_long	measurement_unit	sources	methodology	license	Planet	...	audience_rating	grammar_analysis	grammar_rating	terminology_analysis	terminology_rating	form_analysis	form_rating	abbreviation_analysis	abbreviation_rating	overall_rating
7886	6699	WB_EDSTATS	Total net enrolment rate, primary, gender pari...	NaN	Ratio of female total net enrolment rate for p...	Share (proportion)	[{'name': 'Education Statistics', 'organizatio...	NaN	[{'name': 'CC BY-4.0', 'uri': 'https://datacat...	0	...	10.0	The definition is written in full, complete se...	10.0	The terminology is consistent throughout the d...	10.0	The definition is concise, consisting of 3 com...	10.0	The abbreviation 'GPI' is clearly explained as...	10.0	10.0
7846	6577	WB_EDSTATS	Gross enrolment ratio, lower secondary, gender...	NaN	Ratio of female gross enrolment ratio for lowe...	Share (proportion)	[{'name': 'Education Statistics', 'organizatio...	NaN	[{'name': 'CC BY-4.0', 'uri': 'https://datacat...	0	...	10.0	The definition uses full, complete sentences w...	10.0	The terminology is consistent and technically ...	10.0	The definition is concise, consisting of three...	10.0	The abbreviation GPI is clearly explained as '...	10.0	10.0
7847	6578	WB_EDSTATS	Gross enrolment ratio, upper secondary, gender...	NaN	Ratio of female gross enrolment ratio for uppe...	Share (proportion)	[{'name': 'Education Statistics', 'organizatio...	NaN	[{'name': 'CC BY-4.0', 'uri': 'https://datacat...	0	...	10.0	The definition is written in full, complete se...	10.0	The terminology is consistent and technically ...	10.0	The definition is concise, consisting of 3 com...	10.0	The abbreviation 'GPI' is clearly explained as...	10.0	10.0
4358	6837	WB_EDSTATS	Out-of-school rate for children, adolescents a...	NaN	Proportion of children, adolescents, and youth...	Share (proportion)	[{'name': 'Education Statistics', 'organizatio...	To calculate the indicator, the number of stud...	[{'name': 'CC BY-4.0', 'uri': 'https://datacat...	0	...	10.0	The definition is written in a complete senten...	10.0	The terminology is consistent and technically ...	10.0	The definition is concise, consisting of one c...	10.0	There are no acronyms or abbreviations in the ...	10.0	10.0
4338	6834	WB_EDSTATS	Out-of-school rate for children, adolescents a...	NaN	Proportion of children, adolescents, and youth...	Share (proportion)	[{'name': 'Education Statistics', 'organizatio...	To calculate the indicator, the number of stud...	[{'name': 'CC BY-4.0', 'uri': 'https://datacat...	0	...	10.0	The definition is written in a complete senten...	10.0	The terminology is consistent and technically ...	10.0	The definition is concise, consisting of a sin...	10.0	There are no acronyms or abbreviations in the ...	10.0	10.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4813	10554	OECD_MSTI	Civil GERD	NaN	Civil GERD	% of GDP	[{'name': 'Main Science and Technology Indicat...	https://www.oecd.org/sti/inno/MSTI_documentati...	[{'name': 'License Specified Externally', 'uri...	0	...	0.0	The definition is not a complete sentence and ...	0.0	The definition does not use any terminology, c...	0.0	The definition is not concise or well-structur...	0.0	The definition does not spell out the acronym ...	0.0	0.0
1538	5972	WEF_TTDI	T&T GDP multiplier	T&T GDP multiplier	T&T GDP multiplier	Number	[{'name': 'Travel & Tourism Development Index ...	https://www.weforum.org/reports/travel-and-tou...	[{'name': 'CC BY-NC-ND', 'uri': 'https://creat...	0	...	0.0	The definition is not a complete sentence and ...	0.0	The definition does not use any technical lang...	0.0	The definition is not concise or well-structur...	0.0	The definition does not spell out the acronym ...	0.0	0.0
873	11785	WEF_GCI	GCI 4.0: Railroad	GCI 4.0: Railroad	GCI 4.0: Railroad	Rank, 0-100 scale	[{'name': 'WEF Global Competitiveness Index 4....	Of the 98 indicators composing the GCI 4.0 met...	[{'name': 'CC BY-NC-ND', 'uri': 'https://creat...	0	...	0.0	The definition is not a complete sentence and ...	0.0	There is no use of terminology in the definiti...	0.0	The definition is not concise or well-structur...	0.0	The definition does not spell out the acronym ...	0.0	0.0
876	11740	WEF_GCI	GCI 4.0: Depth	GCI 4.0: Depth	GCI 4.0: Depth	Rank, 0-100 scale	[{'name': 'WEF Global Competitiveness Index 4....	Of the 98 indicators composing the GCI 4.0 met...	[{'name': 'CC BY-NC-ND', 'uri': 'https://creat...	0	...	0.0	The definition is not a complete sentence and ...	0.0	There is no use of terminology in the definiti...	0.0	The definition is not concise or well-structur...	0.0	The definition does not explain any acronyms o...	0.0	0.0
4486	6703	WB_EDSTATS	Total net enrolment rate, upper secondary, gen...	NaN	Ratio of female total net enrolment rate for u...	Share (proportion)	[{'name': 'Education Statistics', 'organizatio...	For more information, consult the UIS website:...	[{'name': 'CC BY-4.0', 'uri': 'https://datacat...	0	...	NaN	None	NaN	None	NaN	None	NaN	None	NaN	NaN

8093 rows × 38 columns

(merged_df["definition_long"] == merged_df["revised"]).sum()

np.int64(0)

print(merged_df["revised"].iloc[1])

The Gross Enrolment Ratio, Lower Secondary, Gender Parity Index (GPI) measures the ratio of female to male gross enrolment ratios in lower secondary education. It is calculated by dividing the female gross enrolment ratio by the male gross enrolment ratio. A GPI of 1 indicates gender parity, while a value less than 1 suggests a disparity favoring males, and a value greater than 1 suggests a disparity favoring females.

merged_df.groupby("dataset")["overall_rating"].mean().sort_values(ascending=False)

dataset
FAO_GFRA               9.428571
OHI_OHI                9.000000
IMF_GENDER_EQUALITY    8.714286
WB_FINDEX              8.528000
WB_HCP                 8.515152
                         ...   
IMF_FFS                1.741497
IMF_EPE                1.734694
OECDWBG_PMR            1.662338
IMF_ED                 1.571429
IMF_HPDD               1.285714
Name: overall_rating, Length: 178, dtype: float64

merged_df.to_excel(os.path.join(QUALITY_DATA_DIR, "assessment-definition_long.xlsx"), index=False)

p_df = merged_df[merged_df["verticals"] == "Prosperity"]
p_df

	name	database_id	database_name	measurement_unit	periodicity	definition_short	definition_long	methodology	economies_count	csv_link	...	prefixes_analysis	prefixes_rating	year_analysis	year_rating	capitalization_analysis	capitalization_rating	overall_rating	semantic_change	lexical_change	lexical_change_rate
4274	Tax revenue (% of GDP)	WB_WDI	World Development Indicators	None	Annual	Tax revenue (% of GDP)	Tax revenue refers to compulsory transfers to ...	The IMF's Government Finance Statistics Manual...	None	https://data360files.worldbank.org/data360-dat...	...	The name does not start with unnecessary prefi...	10	There are no reference years included in the n...	10	The capitalization follows the title case styl...	10	10.000000	0.000000e+00	1	0.045455
4276	Gross capital formation (% of GDP)	WB_WDI	World Development Indicators	None	Annual	Gross capital formation (% of GDP)	Gross capital formation (formerly gross domest...	Gross domestic product (GDP) from the expendit...	None	https://data360files.worldbank.org/data360-dat...	...	The name starts with the core concept 'Gross c...	10	There are no reference years included in the n...	10	The capitalization follows title case, with ma...	10	10.000000	0.000000e+00	2	0.058824
4279	GDP per capita (current US$)	WB_WDI	World Development Indicators	None	Annual	GDP per capita (current US$)	GDP per capita is gross domestic product divid...	For more information, see the metadata for cur...	None	https://data360files.worldbank.org/data360-dat...	...	The name starts with the core concept 'GDP per...	10	The name uses 'current US$' instead of a speci...	10	The capitalization follows title case, with ma...	10	10.000000	1.192093e-07	0	0.000000
4280	GNI per capita, Atlas method (current US$)	WB_WDI	World Development Indicators	None	Annual	GNI per capita, Atlas method (current US$)	GNI per capita (formerly GNP per capita) is th...	The World Bank uses Atlas method GNI per capit...	None	https://data360files.worldbank.org/data360-dat...	...	The name starts with the core concept 'GNI per...	10	The name uses 'current US$' instead of a speci...	10	The capitalization follows title case, with ma...	10	10.000000	5.960464e-08	4	0.095238
4272	Revenue, excluding grants (% of GDP)	WB_WDI	World Development Indicators	None	Annual	Revenue, excluding grants (% of GDP)	Revenue is cash receipts from taxes, social co...	The IMF's Government Finance Statistics Manual...	None	https://data360files.worldbank.org/data360-dat...	...	The name does not start with unnecessary prefi...	10	There are no reference years included in the n...	10	The capitalization follows title case, which i...	10	10.000000	5.008996e-03	3	0.083333
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8032	BOP, Memorandum Items, Exceptional financing, ...	IMF_BOP	None	USD, LCU, Euros	Annual, Quarterly	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	The name does not start with unnecessary prefi...	5	The name does not include reference years, whi...	10	The capitalization is inconsistent and does no...	4	3.555556	1.565850e-02	102	0.421488
8683	BOP, Current Account, Primary Income, Investme...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, Euros, LCU	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	The name does not start with unnecessary prefi...	5	There is no reference to specific years in the...	10	The capitalization is inconsistent, with some ...	4	3.555556	2.211583e-02	161	0.774038
8250	BOP, Memorandum Items, Exceptional financing, ...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, LCU, Euros	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	The name does not start with unnecessary prefi...	5	The name does not include reference years, whi...	10	The name does not follow title case capitaliza...	4	3.555556	1.329958e-02	153	0.642857
8614	BOP, Memorandum Items, Exceptional financing, ...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, LCU, Euros	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	The name does not start with unnecessary prefi...	5	There are no reference years included in the n...	10	The capitalization is inconsistent, with some ...	4	3.555556	1.146848e-01	131	0.629808
8294	BOP, Memorandum Items, Exceptional financing, ...	IMF_BOP	Balance of Payments (BOP) and International In...	USD, LCU, Euros	Quarterly, Annual	None	None	None	None	https://data360files.worldbank.org/data360-dat...	...	The name does not start with unnecessary prefi...	4	The name does not include reference years, whi...	10	The capitalization is inconsistent and does no...	4	3.444444	9.184921e-02	97	0.411017

5834 rows × 48 columns

# Some simple checks of missingness and duplicates
print((p_df["name"] == p_df["definition_long"]).sum())
print(p_df["definition_long"].isnull().sum())
print(p_df["definition_long"].dropna().duplicated().sum())

1060
2757
214

p_df["verticals"]

  Prosperity
  Prosperity
  Prosperity
  Prosperity
  Prosperity
           ...    
  Prosperity
  Prosperity
  Prosperity
  Prosperity
  Prosperity
Name: verticals, Length: 5834, dtype: object