Weakly Supervised Labeling of Documents (PART 1)#

Objective#

This notebook demonstrates how to leverage structured outputs from OpenAI’s GPT-4o-mini model for data labeling of climate related research papers. The task involves analyzing academic texts to identify and classify mentions of datasets while ensuring consistency in context across pages.

Workflow#

PDF Text Extraction:

  • Use PyMuPDF to extract pages from PDF documents.

  • Prefiltering document pages using an HF-trained model.

Weakly Supervised Data Labeling

  • Use the GPT-4o-mini model with a customized prompt for structured data extraction.

LLM as a Judge (Validation & Error Correction):

  • Use an LLM to validate extracted dataset mentions.

  • Correct or remove errors in dataset identification.

  • Filter only valid dataset mentions (valid: true), discarding invalid entries.

Autonomous Reasoning Agent

  • Use a reasoning pipeline to validate the LLM as a judge output Next Steps

  • Scale this into a batch processing of multiple files / directory of research papers.

This workflow demonstrates a weakly supervised approach to labeling documents, specifically focusing on identifying and classifying dataset mentions in research papers.

Install Required Packages

%%capture
!pip install pymupdf openai nltk scikit-learn python-dotenv networkx transformers

Helper Functions

import pymupdf
import requests
import tempfile


def load_doc(fname_or_url: str, n_pages: int = 1) -> list:
    """
    Loads a PDF document from a file or URL and extracts content from it.

    Args:
        fname_or_url (str): The path to the PDF file or a URL where the file can be downloaded.
        n_pages (int, optional): The number of pages to extract. Defaults to 1.

    Returns:
        list: A list of dictionaries containing the extracted text and page indices.

    Raises:
        ValueError: If the number of pages is not greater than 0.
        Exception: If the PDF file fails to download from the specified URL or if there's an issue loading the document.
    """

    # Validate that the number of pages is greater than 0
    assert n_pages > 0, "The number of pages must be greater than 0."

    def _load_doc(fname: str) -> list:
        """
        Creates content from two successive pages.

        Args:
            fname (str): The path to the PDF file.

        Returns:
            list: A list of dictionaries containing the extracted text and page indices.
        """

        # Initialize an empty list to store the contents
        contents = []

        # Open the PDF document
        doc = pymupdf.open(fname)

        # Iterate over the pages, skipping the last n_pages - 1 pages
        for page_idx in range(len(doc) - (n_pages - 1)):
            # Extract text from each of the next n_pages pages and store it as a dictionary
            contents.append(
                dict(
                    text="\n\n".join(
                        [doc[page_idx + i].get_text() for i in range(n_pages)]
                    ),
                    pages=[page_idx + i for i in range(n_pages)],
                )
            )

        # Validate that all pages were loaded successfully
        assert len(doc) - (n_pages - 1) == len(contents), "Failed to load all pages."

        return contents

    # Check if the file or URL starts with 'http:' or 'https:'
    if fname_or_url.startswith(("http:", "https:")):
        # Download the PDF file from the specified URL
        with tempfile.NamedTemporaryFile(suffix=".pdf") as temp_pdf:
            response = requests.get(fname_or_url, stream=True)
            if response.status_code == 200:
                # Write the downloaded data to the temporary file
                for chunk in response.iter_content(chunk_size=8192):
                    temp_pdf.write(chunk)
                # Seek back to the beginning of the file and return the loaded document
                temp_pdf.seek(0)
                return _load_doc(temp_pdf.name)
            else:
                # Raise an exception if there's an issue with the download or loading the document
                raise Exception(
                    f"Failed to download PDF, status code: {response.status_code}"
                )

    else:
        # If it's not a URL, simply load the document from the specified file path
        return _load_doc(fname_or_url)
# using pymupdf load the document via url, you can also use filename here or loop over multiple files (list of urls or filenames)
url_path = "https://documents1.worldbank.org/curated/en/776741468181503442/pdf/The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana.pdf"
loaded_doc = load_doc(url_path, n_pages=1)
# inspect the loaded document
len(loaded_doc)  # number of pages
47
print(loaded_doc[3]["text"][:500])
2 
1 Introduction 
The mining sector in Africa is growing rapidly and is the main recipient of foreign direct 
investment (World Bank 2011). The welfare effects of this sector are not well understood, 
although a literature has recently developed around this question. The main contribution of this 
paper is to shed light on the welfare effects of gold mining in a detailed, in-depth country study 
of Ghana, a country with a long tradition of gold mining and a recent, large expansion in capital-
i

Load utility functions

import json
import os
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def find_best_matching_span(text, snippet, window: int = 1):
    sents = sent_tokenize(text)
    tfidf = TfidfVectorizer(ngram_range=(1, 3))
    mi_vec = tfidf.fit_transform([snippet])
    sents_vec = tfidf.transform(sents)

    mx_idx = cosine_similarity(mi_vec, sents_vec).flatten().argmax()
    span_sents = sents[max(mx_idx - window, 0) : min(mx_idx + window + 1, len(sents))]

    return {
        "match_idx": mx_idx,
        "match_sent": sents[mx_idx],
        "match_span_sents": span_sents,
        "match_span": " ".join(span_sents),
    }


def find_empirical_span(
    text: str, sentences: list, best_match_idx: int, window: int = 1
):
    # Define the start and end indices to include adjacent sentences for context
    start_idx = text.index(sentences[max(best_match_idx - window, 0)])
    last_sent = sentences[min(best_match_idx + window, len(sentences) - 1)]
    # NOTE: This will fail if the last_sent also occurred in an earlier part of the text.
    # SOLUTION: Start the search for last_sent from the start_idx
    end_idx = start_idx + text[start_idx:].index(last_sent) + len(last_sent)

    # Extract the final span
    context_span = text[start_idx:end_idx]

    return {
        "empirical_span": context_span,  # Extracted span
        "start_idx": start_idx,
        "end_idx": end_idx,
    }


def get_empirical_mentioned_in(
    text, mentioned_in, window: int = 1, with_match_output: bool = False
):
    """
    Extract the most relevant span of text from the original document (`text`)
    that matches the `mentioned_in` field. Returns the span, label, start, and end indices.
    """
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    match_output = find_best_matching_span(text, mentioned_in, window=window)
    best_match_idx = match_output["match_idx"]

    output = find_empirical_span(text, sentences, best_match_idx, window=window)
    output["empirical_mentioned_in"] = output.pop("empirical_span")

    output = {
        "label": "mentioned_in",  # Label as "mentioned_in"
        **output,
    }

    if with_match_output:
        output.update(match_output)

    return output
def chunk_text(text, tokenizer, max_length=500):
    """
    Split the text into chunks of max_length tokens, ensuring no chunk exceeds the model's token limit,
    and includes special tokens properly.

    Args:
        text (str): The input text to be chunked.
        tokenizer: The tokenizer to use for encoding and decoding.
        max_length (int): The maximum length of tokens allowed in each chunk, including special tokens.

    Returns:
        list: A list of text chunks as strings.
    """
    # Reserve space for special tokens (e.g., [CLS], [SEP])
    special_tokens_count = 2  # Adjust based on the tokenizer's special token usage
    chunk_size = max_length - special_tokens_count

    # Tokenize the text into token IDs without truncation
    tokens = tokenizer.encode(text, add_special_tokens=False)

    # Split the tokens into chunks
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        token_chunk = tokens[i : i + chunk_size]
        # Add special tokens to the chunk
        token_chunk_with_specials = (
            [tokenizer.cls_token_id] + token_chunk + [tokenizer.sep_token_id]
        )
        # Decode the chunk back to text
        chunk_text = tokenizer.decode(
            token_chunk_with_specials, skip_special_tokens=False
        )
        chunks.append(chunk_text)

    return chunks
def save_text_per_document(text, text_output_path, page_idx):
    """
    Save cleaned text for each page to a single JSON file, appending page data.

    Parameters:
        text (str): The cleaned text for the current page.
        text_output_path (str): The path to the text JSON file.
        page_idx (int): The current page index.

    Returns:
        None
    """
    # Load existing text data or create a new structure
    if os.path.exists(text_output_path):
        with open(text_output_path, "r", encoding="utf-8") as existing_file:
            text_data = json.load(existing_file)
    else:
        text_data = {
            "source": os.path.splitext(os.path.basename(text_output_path))[0],
            "pages": {},
        }

    # Add text for the current page
    text_data["pages"][str(page_idx + 1)] = text

    # Save the updated text data
    os.makedirs(os.path.dirname(text_output_path), exist_ok=True)
    with open(text_output_path, "w", encoding="utf-8") as text_file:
        json.dump(text_data, text_file, indent=4)
# load helper functions

from copy import deepcopy
import networkx as nx


def consolidate_dataset(raw_text: str, data: dict):
    text = raw_text
    page_data = {"dataset_used": data.get("dataset_used", False), "data_mentions": []}

    G = nx.Graph()
    sents = sent_tokenize(text)
    _datasets = []

    for ds in data.get("dataset", []):
        mentioned_in = ds.pop("mentioned_in") or ""

        try:
            mi = find_best_matching_span(mentioned_in, ds["raw_name"], window=0)
            mi = mi["match_span"]
            match_output = find_best_matching_span(text, mi, window=1)
        except ValueError:
            # Likely that the `mentioned_in` is not found in the text or not correct.
            # We try expanding the search to the entire text.
            match_output = find_best_matching_span(text, ds["raw_name"], window=1)

        ds["sent_spans"] = match_output["match_span_sents"]
        sents_idx = sorted([sents.index(s) for s in ds["sent_spans"]])
        ds["sent"] = match_output["match_sent"]
        ds["sent_idx"] = sents_idx

        G.add_edges_from(zip(sents_idx[:-1], sents_idx[1:]))
        _datasets.append(ds)

    _datasets = sorted(_datasets, key=lambda x: x["sent_idx"][0])

    # The connected components in the graphs form the `mentioned_in`s.
    mentioned_ins = sorted(
        [sorted(x) for x in nx.connected_components(G)], key=lambda x: x[0]
    )
    updated_mentions = []

    for midx in mentioned_ins:
        _mi = {"mentioned_in": " ".join([sents[i] for i in midx]), "datasets": []}

        for ds in _datasets:
            ds = deepcopy(ds)
            if ds["sent_idx"][0] in midx:
                ds.pop("sent_idx")
                ds.pop("sent_spans")
                _mi["datasets"].append(ds)

        updated_mentions.append(_mi)

    page_data["data_mentions"] = updated_mentions

    return page_data


def save_output_per_document(raw_text, data, output_path, page_idx):
    """
    Save output data to a JSON file per document, appending new page data.

    Parameters:
        data (LabelledResponseFormat): The data to save, in the validated format.
        output_path (str): The output path for the document-wide JSON file.
        page_idx (int): The current page index being processed.

    Returns:
        None
    """

    # Restructure and consolidate dataset if possible
    page_data = consolidate_dataset(raw_text, data)

    # Initialize the new page's data structure
    page_data = {"page": page_idx + 1, **page_data}

    # Check if the file already exists
    if os.path.exists(output_path):
        with open(output_path, "r", encoding="utf-8") as existing_file:
            document_data = json.load(existing_file)
    else:
        # Create a new JSON structure
        document_data = {
            "source": os.path.splitext(os.path.basename(output_path))[0],
            "pages": [],
        }

    # Append the new page data
    document_data["pages"].append(page_data)

    # Save the updated document data back to the file
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    with open(output_path, "w", encoding="utf-8") as output_file:
        json.dump(document_data, output_file, indent=4)
import re


def clean_extracted_text(text):
    """
    Cleans text extracted from PDFs using PyMuPDF.
    - Reduces unnecessary whitespace and artifacts while preserving meaningful structure.
    - Prevents unintentional removal of spaces or concatenation of words.
    """

    # Replace non-breaking spaces (\xa0) with regular spaces
    text = text.replace("\xa0", " ")

    # Remove control characters (ASCII 0-31) except line breaks
    text = re.sub(r"[\x00-\x08\x0B-\x1F]", "", text)

    # Collapse excessive newlines (more than 2) but preserve single newlines
    text = re.sub(r"\n{3,}", "\n\n", text)

    # Collapse multiple spaces but preserve single spaces between words
    text = re.sub(r"[ \t]{2,}", " ", text)

    # Preserve dashes at line breaks (e.g., "address-\nclimate" to "address-climate")
    text = re.sub(r"([a-zA-Z])-?\n([a-zA-Z])", r"\1-\2", text)

    # Trim leading/trailing spaces and newlines
    text = text.strip()

    return text
from typing import Callable


def should_process_page(text: str, classifier: Callable, tokenizer) -> bool:
    """Determine whether a page should be processed."""

    chunks = chunk_text(text, tokenizer, max_length=500)
    return any(classifier(chunk)[0]["label"] != "NO_DATA" for chunk in chunks)

Weakly Supervised Labeling#

from openai import OpenAI

# Load environment variables from .env file
# load_dotenv()

API_KEY = "YOUR_API_KEY"
MODEL = "gpt-4o-mini"
client = OpenAI(api_key=API_KEY)  # initialize the client

Create a the prompt and Pydantic Model for Structured Outputs#

from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum


# Define Enums for categorical fields
class Context(str, Enum):
    background = "background"
    supporting = "supporting"
    primary = "primary"


class Specificity(str, Enum):
    properly_named = "properly_named"
    descriptive_but_unnamed = "descriptive_but_unnamed"
    vague_generic = "vague_generic"


class Relevance(str, Enum):
    directly_relevant = "directly_relevant"
    indirectly_relevant = "indirectly_relevant"
    not_relevant = "not_relevant"


class DatasetEntry(BaseModel):
    raw_name: Optional[str] = Field(
        ..., description="The exact dataset name as it appears in the text."
    )
    harmonized_name: Optional[str] = Field(
        None, description="The standardized or full name of the dataset."
    )
    acronym: Optional[str] = Field(
        None, description="The short name or acronym associated with the dataset."
    )
    context: Context
    specificity: Specificity
    relevance: Relevance
    mentioned_in: Optional[str] = Field(
        None, description="The exact text excerpt where the dataset is mentioned."
    )
    producer: Optional[str] = Field(
        None, description="The organization responsible for producing the dataset."
    )
    data_type: Optional[str] = Field(
        None, description="The type of data represented by the dataset."
    )


class LabelledResponseFormat(BaseModel):
    dataset: List[DatasetEntry] = Field(
        ..., description="A list of datasets mentioned in the paper."
    )
    dataset_used: bool = Field(
        ..., description="A boolean indicating if a dataset is used in the paper."
    )
DATA_USE_TASK_PROMPT = """You are an expert in extracting and categorizing dataset mentions from research papers and policy documents. Your task is to **identify and extract all valid dataset mentions**, ensuring they are correctly classified based on naming specificity, context, and relevance.

### **What Qualifies as a Dataset?**
A dataset is a structured collection of data used for empirical research, analysis, or policy-making. Examples include:
- **Surveys & Census Data** (e.g., LSMS, DHS, national census records)
- **Indicators & Indexes** (e.g., HDI, GFSI, WDI, ND-GAIN, EPI)
- **Geospatial & Environmental Data** (e.g., OpenStreetMap, Sentinel-2 imagery)
- **Economic & Trade Data** (e.g., UN Comtrade, Balance of Payments Statistics)
- **Health & Public Safety Data** (e.g., epidemiological surveillance, crime reports)
- **Time-Series & Energy Data** (e.g., climate projections, electricity demand records)
- **Transport & Mobility Data** (e.g., road accident statistics, smart city traffic flow)
- **Other emerging dataset types** as identified in the text.

**Important:**  
If the dataset does not fit into the examples above, infer the **most appropriate category** from the context and **create a new `"data_type"` if necessary**.

### **What Should NOT Be Extracted?**
Do **not** extract mentions that do not clearly refer to a dataset, including, but not limited to:
1. **Organizations & Institutions** (e.g., WHO, IMF, UNDP, "World Bank data" unless it explicitly refers to a dataset)
2. **Reports & Policy Documents** (e.g., "Fiscal Monitor by the IMF", "IEA Energy Report"; only extract if the dataset itself is referenced)
3. **Generic Mentions of Data** (e.g., "various sources", "survey results from multiple institutions")
4. **Economic Models & Policy Frameworks** (e.g., "GDP growth projections", "macroeconomic forecasts")
5. **Legislation & Agreements** (e.g., "Paris Agreement", "General Data Protection Regulation")

### **Rules for Extraction**
1. **Extract All Structured Data Mentions**
   - If the dataset is explicitly named (e.g., "Global Fishing Watch"), label it as `"properly_named"`.
   - If the dataset is described but not explicitly named (e.g., "electricity usage data from Albania"), label it as `"descriptive_but_unnamed"`.
   - If the dataset mention is too generic (e.g., "electricity usage data"), label it as `"vague_generic"`.

2. **Ensure `"data_type"` Is Always Assigned**
   - **Use an existing category if applicable.**
   - **If no suitable category exists, create a new `"data_type"` based on context.**

3. **Classify `"context"` Correctly**
   - `"primary"`: The dataset is used for direct analysis in the document.
   - `"supporting"`: The dataset is referenced to validate or compare findings.
   - `"background"`: The dataset is mentioned as general context or prior research.

   **Examples:**
   - `"The LSMS-ISA data is analyzed to assess the impact of agricultural practices on productivity."` → `"primary"`
   - `"Our results align with previous studies that used LSMS-ISA."` → `"supporting"`
   - `"LSMS-ISA is widely recognized as a reliable data source for agricultural research."` → `"background"`

4. **Capture Full Sentence Context**
   - The `"mentioned_in"` field must always include the **full sentence** where the dataset is referenced.
   - If a dataset is mistakenly extracted from an unrelated sentence, correct it.

### **Extraction Schema**
Each extracted dataset should have the following fields:
- `raw_name`: Exact dataset name from the text (**no paraphrasing**).
- `harmonized_name`: If properly named, use directly; if referenced in multiple ways, standardize using the most precise form in the text, otherwise, set this to None.
- `acronym`: Extract if explicitly mentioned.
- `mentioned_in`: **Full sentence** where the dataset appears (**no paraphrasing**).
- `context`: **primary / supporting / background**
- `specificity`: **properly_named / descriptive_but_unnamed / vague_generic**
- `relevance`: **directly_relevant / indirectly_relevant / not_relevant**
- `producer`: **Extract only if explicitly mentioned; otherwise, set to `None`.**
- `data_type`: **Assign based on existing categories, but create new ones if necessary.**

### **Handling New or Unlisted Data Types**
- If a dataset does not fit into existing categories, **infer an appropriate name** for its `"data_type"` based on context.
- Use a **general but informative label** for new data types (e.g., `"Climate Risk Data"`, `"Social Media Analytics"`).

### **Important: Do NOT Skip Unnamed Datasets**
If a dataset is described but lacks a proper name, extract it under `"descriptive_but_unnamed"` or `"vague_generic"`, which ever is appropriate.
If `"producer"` is not mentioned, set it to `None` rather than inferring."""
from typing import List, Optional
from transformers import AutoTokenizer
from transformers import pipeline
from tqdm.auto import tqdm


def process_document_extraction(fname_or_url: str):
    # set up the output directory
    loaded_doc = load_doc(fname_or_url, n_pages=1)
    base_name = os.path.splitext(os.path.basename(fname_or_url))[0]

    # we will store the results in a dictionary
    extraction_path = f"./output/extracted_data/{base_name}.json"
    text_output_path = f"./output/text/{base_name}.json"
    raw_text_output_path = f"./output/raw_text/{base_name}.json"

    # we will use a trained ModernBert model to filter the pages that likely contains data

    data_model_id = "ai4data-use/bert-base-uncased-data-use"
    tokenizer = AutoTokenizer.from_pretrained(data_model_id)

    # load model from huggingface.co/models using our repository id
    classifier = pipeline(
        "text-classification", model=data_model_id, tokenizer=tokenizer
    )
    for page_idx, page in tqdm(
        enumerate(loaded_doc), desc="Processing pages", total=len(loaded_doc)
    ):
        raw_text = page["text"]
        text = clean_extracted_text(raw_text)

        # Save raw text for the page
        save_text_per_document(raw_text, raw_text_output_path, page_idx)

        # Save cleaned text for the page
        save_text_per_document(text, text_output_path, page_idx)
        # Check if the page should be processed
        # If the page contains data and the model returns with data, we will process it
        if not raw_text or not should_process_page(text, classifier, tokenizer):
            # print(f"skipping {page_idx}, contains no data")
            continue

        # Process the page
        completion = client.beta.chat.completions.parse(
            model=MODEL,
            temperature=0.2,  # you can tweak this if you want
            messages=[
                {"role": "system", "content": DATA_USE_TASK_PROMPT},
                {"role": "user", "content": text},
            ],
            response_format=LabelledResponseFormat,
        )

        parsed_data = completion.choices[0].message.parsed

        # Save the extraction output
        save_output_per_document(
            raw_text, parsed_data.model_dump(), extraction_path, page_idx
        )
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
# uncomment to run
# process_document_extraction(url_path)
# inspect the output of the extraction

with open(
    "./output/extracted_data/The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana.json",
    "r",
) as f:
    extracted_data = json.load(f)
    print(json.dumps(extracted_data, ensure_ascii=False, indent=2))
{
  "source": "The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana",
  "pages": [
    {
      "page": 4,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "We also allow for spillovers across \ndistricts, in a district-level analysis. We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \noutcomes. The paper contributes to the growing literature on the local effects of mining.",
          "datasets": [
            {
              "raw_name": "Demographic and Health Survey (DHS)",
              "harmonized_name": "Demographic and Health Survey (DHS)",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \noutcomes."
            },
            {
              "raw_name": "Ghana Living Standard Survey (GLSS)",
              "harmonized_name": "Ghana Living Standard Survey (GLSS)",
              "acronym": "GLSS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \noutcomes."
            }
          ]
        }
      ]
    },
    {
      "page": 5,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "Mining is also associated with more economic \nactivity measured by nightlights (Benshaul-Tolonen, 2019; Mamo et al, 2019). Kotsadam and Tolonen (2016) use DHS data from Africa, and find that mine openings cause \nwomen to shift from agriculture to service production and that women become more likely to \nwork for cash and year-round as opposed to seasonally. Continuing this analysis, Benshaul-\nTolonen (2018) explores the links between mining and female empowerment in eight gold-\nproducing countries in East and West Africa, including Ghana.",
          "datasets": [
            {
              "raw_name": "DHS data",
              "harmonized_name": "Demographic and Health Surveys (DHS)",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Health & Public Safety Data",
              "sent": "Kotsadam and Tolonen (2016) use DHS data from Africa, and find that mine openings cause \nwomen to shift from agriculture to service production and that women become more likely to \nwork for cash and year-round as opposed to seasonally."
            }
          ]
        },
        {
          "mentioned_in": "We explore the effects of mining activity on employment, earnings, expenditure, and children’s \nhealth outcomes in local communities and in districts with gold mining. We combine the DHS \nand GLSS with production data for 17 large-scale gold mines in Ghana. We find that a new \nlarge-scale gold mine changes economic outcomes, such as access to employment and cash \nearnings.",
          "datasets": [
            {
              "raw_name": "GLSS",
              "harmonized_name": "Ghana Living Standards Survey (GLSS)",
              "acronym": "GLSS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Economic & Trade Data",
              "sent": "We combine the DHS \nand GLSS with production data for 17 large-scale gold mines in Ghana."
            }
          ]
        }
      ]
    },
    {
      "page": 7,
      "dataset_used": false,
      "data_mentions": []
    },
    {
      "page": 8,
      "dataset_used": false,
      "data_mentions": [
        {
          "mentioned_in": "12 currently active mines dominate the sector, and there are an additional five suspended mines \nthat have been in production in recent decades. Table 1 presents a full list of the mines, the year \nthey opened, and their status as of December 2012. Company name and country are for the \nmain shareowner in the mine.",
          "datasets": [
            {
              "raw_name": "Table 1 Gold Mines in Ghana",
              "harmonized_name": "Gold Mines in Ghana",
              "acronym": null,
              "context": "background",
              "specificity": "properly_named",
              "relevance": "indirectly_relevant",
              "producer": null,
              "data_type": "Mining Operations Data",
              "sent": "Table 1 presents a full list of the mines, the year \nthey opened, and their status as of December 2012."
            }
          ]
        },
        {
          "mentioned_in": "Most are open-pit mines, although a few consist of a combination of open-pit \nand underground operations. Table 1 Gold Mines in Ghana \nName \nOpening \nyear \nClosing year \nCompany \nCountry \nAhafo \n2006 \nactive \nNewmont Mining Corp. \nUSA \nBibiani \n1998 \nactive \nNoble Mineral Resources \nAustralia \nBogoso Prestea \n1990 \nactive \nGolden Star Resources \nUSA \nChirano \n2005 \nactive \nKinross Gold \nCanada \nDamang \n1997 \nactive \nGold Fields Ghana Ltd. \nSouth Africa \nEdikan (Ayanfuri) \n1994 \nactive \nPerseus Mining \nAustralia \nIduapriem \n1992 \nactive \nAngloGold Ashanti \nSouth Africa \nJeni (Bonte) \n1998 \n2003 \nAkrokeri-Ashanti \nCanada \nKonongo \n1990 \nactive \nLionGold Corp. \nSingapore \nKwabeng \n1990 \n1993 \nAkrokeri-Ashanti \nCanada \nNzema \n2011 \nactive \nEndeavour \nCanada \nObotan \n1997 \n2001 \nPMI Gold \nCanada \nObuasi \n1990 \nactive \nAngloGold Ashanti \nSouth Africa \nPrestea Sankofa \n1990 \n2001 \nAnglogold Ashanti \nSouth Africa \nTarkwa \n1990 \nactive \nGold Fields Ghana Ltd. \nSouth Africa \nTeberebie \n1990 \n2005 \nAnglogold Ashanti \nSouth Africa \nWassa \n1999 \nactive \nGolden Star Resources \nUSA \nSource: InterraRMG 2013. Note: Active is production status as of December 2012, the last available data point.",
          "datasets": [
            {
              "raw_name": "InterraRMG 2013",
              "harmonized_name": "InterraRMG 2013",
              "acronym": null,
              "context": "background",
              "specificity": "properly_named",
              "relevance": "indirectly_relevant",
              "producer": "InterraRMG",
              "data_type": "Mining Data",
              "sent": "Table 1 Gold Mines in Ghana \nName \nOpening \nyear \nClosing year \nCompany \nCountry \nAhafo \n2006 \nactive \nNewmont Mining Corp. \nUSA \nBibiani \n1998 \nactive \nNoble Mineral Resources \nAustralia \nBogoso Prestea \n1990 \nactive \nGolden Star Resources \nUSA \nChirano \n2005 \nactive \nKinross Gold \nCanada \nDamang \n1997 \nactive \nGold Fields Ghana Ltd. \nSouth Africa \nEdikan (Ayanfuri) \n1994 \nactive \nPerseus Mining \nAustralia \nIduapriem \n1992 \nactive \nAngloGold Ashanti \nSouth Africa \nJeni (Bonte) \n1998 \n2003 \nAkrokeri-Ashanti \nCanada \nKonongo \n1990 \nactive \nLionGold Corp. \nSingapore \nKwabeng \n1990 \n1993 \nAkrokeri-Ashanti \nCanada \nNzema \n2011 \nactive \nEndeavour \nCanada \nObotan \n1997 \n2001 \nPMI Gold \nCanada \nObuasi \n1990 \nactive \nAngloGold Ashanti \nSouth Africa \nPrestea Sankofa \n1990 \n2001 \nAnglogold Ashanti \nSouth Africa \nTarkwa \n1990 \nactive \nGold Fields Ghana Ltd. \nSouth Africa \nTeberebie \n1990 \n2005 \nAnglogold Ashanti \nSouth Africa \nWassa \n1999 \nactive \nGolden Star Resources \nUSA \nSource: InterraRMG 2013."
            }
          ]
        }
      ]
    },
    {
      "page": 9,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "3 Data \nTo conduct this analysis, we combine different data sources using spatial analysis. The main \nmining data is a dataset from InterraRMG covering all large-scale mines in Ghana, explained \nin more detail in section 3.1. This dataset is linked to survey data from the DHS and GLSS, \nusing spatial information. Geographical coordinates of enumeration areas in GLSS are from \nGhana Statistical Services (GSS).2 Point coordinates (global positioning system [GPS]) for the \nsurveyed DHS clusters3 allow us to match all individuals to one or several mineral mines. We \ndo this in two ways.",
          "datasets": [
            {
              "raw_name": "InterraRMG dataset",
              "harmonized_name": "InterraRMG dataset",
              "acronym": null,
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Mining Data",
              "sent": "The main \nmining data is a dataset from InterraRMG covering all large-scale mines in Ghana, explained \nin more detail in section 3.1."
            },
            {
              "raw_name": "DHS",
              "harmonized_name": "Demographic and Health Surveys",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Survey Data",
              "sent": "This dataset is linked to survey data from the DHS and GLSS, \nusing spatial information."
            },
            {
              "raw_name": "GLSS",
              "harmonized_name": "Ghana Living Standards Survey",
              "acronym": "GLSS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Survey Data",
              "sent": "This dataset is linked to survey data from the DHS and GLSS, \nusing spatial information."
            },
            {
              "raw_name": "Ghana Statistical Services (GSS) geographical coordinates",
              "harmonized_name": "Ghana Statistical Services geographical coordinates",
              "acronym": null,
              "context": "primary",
              "specificity": "descriptive_but_unnamed",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Geospatial Data",
              "sent": "Geographical coordinates of enumeration areas in GLSS are from \nGhana Statistical Services (GSS).2 Point coordinates (global positioning system [GPS]) for the \nsurveyed DHS clusters3 allow us to match all individuals to one or several mineral mines."
            }
          ]
        }
      ]
    },
    {
      "page": 10,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "8 \n \nwe use a cutoff distance of 20 km, we assume there is little economic footprint beyond that \ndistance. Of course, any such distance is arbitrarily chosen, which is why we try different \nspecifications to explore the spatial heterogeneity by varying this distance (using 10 km, 20 km, \nthrough 50 km) as well as a spatial lag structure (using 0 to 10 km, 10 to 20 km, through 40 to \n50 km distance bins).4  \nSecond, we collapse the DHS mining data at the district level.5 The number of districts has \nchanged over time in Ghana, because districts with high population growth have been split into \nsmaller districts. To avoid endogeneity concerns, we use the baseline number of districts that \nexisted at the start of our analysis period, which are 137.",
          "datasets": [
            {
              "raw_name": "DHS mining data",
              "harmonized_name": "Demographic and Health Surveys (DHS) mining data",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": "None",
              "data_type": "Resource Data",
              "sent": "Of course, any such distance is arbitrarily chosen, which is why we try different \nspecifications to explore the spatial heterogeneity by varying this distance (using 10 km, 20 km, \nthrough 50 km) as well as a spatial lag structure (using 0 to 10 km, 10 to 20 km, through 40 to \n50 km distance bins).4  \nSecond, we collapse the DHS mining data at the district level.5 The number of districts has \nchanged over time in Ghana, because districts with high population growth have been split into \nsmaller districts."
            }
          ]
        },
        {
          "mentioned_in": "Because some mines are close to district boundaries, we additionally test whether there \nis an effect in neighboring districts. 3.1 Resource data \nThe Raw Materials Data are from InterraRMG (2013). The data set contains information on \npast or current industrial mines.",
          "datasets": [
            {
              "raw_name": "Raw Materials Data",
              "harmonized_name": "Raw Materials Data from InterraRMG",
              "acronym": "None",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": "InterraRMG",
              "data_type": "Resource Data",
              "sent": "3.1 Resource data \nThe Raw Materials Data are from InterraRMG (2013)."
            }
          ]
        },
        {
          "mentioned_in": "All mines have information on annual production volumes, \nownership structure, and GPS coordinates on location. We complete this data with exact \ngeographic location data from MineAtlas (2013), where satellite imagery shows the actual mine \nboundaries, which allows us to identify and update the center point of each mine. The \nproduction data and ownership information are double-checked against the companies’ annual \nreports.",
          "datasets": [
            {
              "raw_name": "MineAtlas data",
              "harmonized_name": "MineAtlas geographic location data",
              "acronym": "None",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": "MineAtlas",
              "data_type": "Geospatial Data",
              "sent": "We complete this data with exact \ngeographic location data from MineAtlas (2013), where satellite imagery shows the actual mine \nboundaries, which allows us to identify and update the center point of each mine."
            }
          ]
        }
      ]
    },
    {
      "page": 11,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "Road data is an alternative way of defining \ndistance from mines, but time series data on roads is not available. 3.2 Household data \nWe use microdata from the DHS, obtained from standardized surveys across years and \ncountries. We combine the respondents from all four DHS standard surveys in Ghana for which \nthere are geographic identifiers.",
          "datasets": [
            {
              "raw_name": "DHS",
              "harmonized_name": "Demographic and Health Surveys",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "3.2 Household data \nWe use microdata from the DHS, obtained from standardized surveys across years and \ncountries."
            }
          ]
        },
        {
          "mentioned_in": "See Appendix table 1 for definition of \noutcome variables. We complement the analysis with household data from the GLSS collected in the years—1998–\n99, 2004–05, and 2012–13. These data are a good complement to the DHS data, because they \n                                                            \n6 The first mines were opened in 1990, prior to the first household survey.",
          "datasets": [
            {
              "raw_name": "GLSS",
              "harmonized_name": "Ghana Living Standards Survey",
              "acronym": "GLSS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "We complement the analysis with household data from the GLSS collected in the years—1998–\n99, 2004–05, and 2012–13."
            }
          ]
        }
      ]
    },
    {
      "page": 12,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "In addition, they provide more detailed information on labor market \nparticipation, such as exact profession (where, for example, being a miner is a possible \noutcome), hours worked, and a wage indicator. The data estimate household expenditure and \nhousehold income. Wages, income, and expenditure can, however, be difficult to measure in \neconomies where nonmonetary compensation for labor and subsistence farming are common \npractices. 4 Empirical Strategies \n4.1 Individual-level difference-in-differences  \nTime-varying data on production and repeated survey data allow us to use a difference-in-\ndifferences approach.7 However, due to the spatial nature of our data and the fact that some \nmines are spatially clustered, we use a strategy developed by Benshaul-Tolonen (2018). The \ndifference-in-difference model compares the treatment group (close to mines) before and after \nthe mine opening, while removing the change that happens in the control group (far away from \nmines) over time under the assumption that such changes reflect underlying temporal variation \ncommon to both treatment and control areas.",
          "datasets": [
            {
              "raw_name": "household expenditure and household income",
              "harmonized_name": null,
              "acronym": null,
              "context": "primary",
              "specificity": "descriptive_but_unnamed",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Economic Data",
              "sent": "The data estimate household expenditure and \nhousehold income."
            },
            {
              "raw_name": "data on production",
              "harmonized_name": null,
              "acronym": null,
              "context": "primary",
              "specificity": "descriptive_but_unnamed",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Economic Data",
              "sent": "4 Empirical Strategies \n4.1 Individual-level difference-in-differences  \nTime-varying data on production and repeated survey data allow us to use a difference-in-\ndifferences approach.7 However, due to the spatial nature of our data and the fact that some \nmines are spatially clustered, we use a strategy developed by Benshaul-Tolonen (2018)."
            },
            {
              "raw_name": "survey data",
              "harmonized_name": null,
              "acronym": null,
              "context": "primary",
              "specificity": "descriptive_but_unnamed",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Survey Data",
              "sent": "4 Empirical Strategies \n4.1 Individual-level difference-in-differences  \nTime-varying data on production and repeated survey data allow us to use a difference-in-\ndifferences approach.7 However, due to the spatial nature of our data and the fact that some \nmines are spatially clustered, we use a strategy developed by Benshaul-Tolonen (2018)."
            }
          ]
        }
      ]
    },
    {
      "page": 13,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "Moreover, cluster \nfixed effects are not possible because of clusters are not repeatedly sampled over time. However, since the estimation is at individual level, all standard errors are clustered at the DHS \ncluster level. The sample is restricted to individuals living within 100 km of a deposit location (mine), so \nmany parts of Northern Ghana where there are few gold mines are not included in the analysis.",
          "datasets": [
            {
              "raw_name": "DHS",
              "harmonized_name": "Demographic and Health Surveys",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "However, since the estimation is at individual level, all standard errors are clustered at the DHS \ncluster level."
            }
          ]
        }
      ]
    },
    {
      "page": 15,
      "dataset_used": false,
      "data_mentions": []
    },
    {
      "page": 17,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "The baseline differences in observable characteristics – in particular, \nlower levels of economic development preceding the mine opening - indicate that a cross-\nsectional approach using only the post-period may not be sufficient to understand the impact of \ngold mining on socio-economic variables. Table 2 Summary statistics for women’s survey  \n \n(1) \n(2) \n \n(3)                 (4)  \n \nBefore mining \n  \n \nDuring Mining \n \n>20 km \n<20 km \n \n>20 km \n<20 km \n \nMean \nCoefficient  \nMean \nCoefficient \n \n \n \n \n \n \nWoman Characteristics \n \n \n \n \nAge \n28.79 \n0.836 \n \n28.95 \n-0.352 \nTotal children \n2.18 \n0.417* \n \n2.56 \n-0.035 \nWealth \n3.85 \n-0.619** \n \n3.33 \n-0.028 \nNonmigrant \n0.32 \n0.123** \n \n0.33 \n-0.028 \nUrban \n0.62 \n-0.300** \n \n0.49 \n-0.150** \nNo education \n0.17 \n-0.045 \n \n0.20 \n-0.042** \n<3 years education \n0.77 \n0.035 \n \n0.74 \n0.045**",
          "datasets": [
            {
              "raw_name": "women’s survey",
              "harmonized_name": "Women's Survey Data",
              "acronym": null,
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Survey Data",
              "sent": "Table 2 Summary statistics for women’s survey  \n \n(1) \n(2) \n \n(3)                 (4)  \n \nBefore mining \n  \n \nDuring Mining \n \n>20 km \n<20 km \n \n>20 km \n<20 km \n \nMean \nCoefficient  \nMean \nCoefficient \n \n \n \n \n \n \nWoman Characteristics \n \n \n \n \nAge \n28.79 \n0.836 \n \n28.95 \n-0.352 \nTotal children \n2.18 \n0.417* \n \n2.56 \n-0.035 \nWealth \n3.85 \n-0.619** \n \n3.33 \n-0.028 \nNonmigrant \n0.32 \n0.123** \n \n0.33 \n-0.028 \nUrban \n0.62 \n-0.300** \n \n0.49 \n-0.150** \nNo education \n0.17 \n-0.045 \n \n0.20 \n-0.042** \n<3 years education \n0.77 \n0.035 \n \n0.74 \n0.045**"
            }
          ]
        }
      ]
    },
    {
      "page": 18,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "To test for exogeneity, we run regressions using baseline individual-level data to explore \nchanges in observable characteristics among women (the main part of the sample). Table 3 \nshows that there are no significant effects of the mine opening on the age structure, migration \nhistory, marital status, fertility, or education, using the difference-in-difference specification \nwith a full set of controls. If anything, it seems that women in active mining communities are \nmarginally older, more likely to never have moved, and more likely to be or have been in a \ncohabiting relationship or married.",
          "datasets": [
            {
              "raw_name": "DHS individual data",
              "harmonized_name": "Demographic and Health Surveys (DHS)",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "Table 3 \nshows that there are no significant effects of the mine opening on the age structure, migration \nhistory, marital status, fertility, or education, using the difference-in-difference specification \nwith a full set of controls."
            }
          ]
        }
      ]
    },
    {
      "page": 19,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "There is no change in the likelihood that she is not working. These 5 categories \nstem from the same occupational variable in the DHS data, and are mutually exclusive. The \nsurveyed individual is told to report their main occupation.",
          "datasets": [
            {
              "raw_name": "DHS data",
              "harmonized_name": "Demographic and Health Surveys (DHS)",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "These 5 categories \nstem from the same occupational variable in the DHS data, and are mutually exclusive."
            }
          ]
        }
      ]
    },
    {
      "page": 21,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "Splitting the sample by gender, we note that this decrease is \nonly statistically significant for boys at an effect size of 6.6 percentage points. Table 5 OLS estimates of birth outcomes, infant survival, and child health in the DHS individual-\nlevel analysis \n \nPANEL A                                  size at birth                        infant mortality (<12months)                  antenatal visits \n \nsmall \naverage \nlarge \n  \nall \nboys \ngirls \n  \n# visits \nat least 1 \nactive*mine \n0.022 \n0.053 \n-0.075* \n \n-0.041* \n-0.066** \n-0.020 \n \n-0.151 \n-0.007 \n \n(0.028) \n(0.041) \n(0.041) \n \n(0.022) \n(0.030) \n(0.035) \n \n(0.331) \n(0.028) \nmine \n-0.010 \n0.071** \n-0.061** \n \n0.004 \n0.008 \n0.001 \n \n0.153 \n0.000 \n \n(0.019) \n(0.028) \n(0.030) \n \n(0.015) \n(0.020) \n(0.024) \n \n(0.241) \n(0.019) \nactive \n-0.010 \n0.054** \n-0.044 \n \n0.002 \n0.014 \n-0.012 \n \n0.012 \n0.002 \n \n(0.016) \n(0.026) \n(0.027) \n \n(0.014) \n(0.022) \n(0.018) \n \n(0.209) \n(0.012) \n \n \n \n \n \n \n \n \n \n \n \nObservations \n6,771 \n6,771 \n6,771 \n \n5,356 \n2,718 \n2,638 \n \n5,704 \n5,704 \nR-squared \n0.031 \n0.054 \n0.059 \n \n0.135 \n0.160 \n0.152 \n \n0.186 \n0.062 \nMean of dep var. 0.136 \n0.359 \n0.505 \n \n0.073 \n0.08 \n0.066 \n \n5.79 \n0.941 \n \n \n \n \n \n \n \n \n \n \n \nPANEL B \nin the last 2 weeks, had:         \nfever          cough        diarrhea \n  \nanthropometrics (WHO) in sd \nht/a              wt/a            wt/ht \n  \nhas                             \nhealth card \nactive*mine \n-0.035 \n-0.061* \n0.042 \n \n-3.532 \n-5.208 \n-0.641 \n \n0.014 \n  \n \n(0.037) \n(0.033) \n(0.027) \n \n(11.472) \n(9.283) \n(8.948) \n \n(0.027) \n \nmine \n-0.002 \n-0.006 \n-0.038 \n \n-0.828 \n3.481 \n3.853 \n \n-0.006 \n \n \n(0.031) \n(0.028) \n(0.024) \n \n(10.385) \n(8.574) \n(7.468) \n \n(0.022) \n \nactive \n0.023 \n-0.003 \n-0.033** \n \n-1.904 \n5.265 \n9.433* \n \n0.009 \n \n \n(0.020) \n(0.020) \n(0.016) \n \n(5.942) \n(5.304) \n(5.183) \n \n(0.012) \n \n \n \n \n \n \n \n \n \n \n \n \nObservations \n6,246 \n6,257 \n6,262 \n \n5,627 \n5,627 \n5,727 \n \n6,378 \n \nR-squared \n0.024 \n0.043 \n0.024 \n \n0.136 \n0.080 \n0.036 \n \n0.084 \n \nMean of dep var.",
          "datasets": [
            {
              "raw_name": "DHS individual-level analysis",
              "harmonized_name": "Demographic and Health Surveys (DHS)",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Health & Public Safety Data",
              "sent": "Table 5 OLS estimates of birth outcomes, infant survival, and child health in the DHS individual-\nlevel analysis \n \nPANEL A                                  size at birth                        infant mortality (<12months)                  antenatal visits \n \nsmall \naverage \nlarge \n  \nall \nboys \ngirls \n  \n# visits \nat least 1 \nactive*mine \n0.022 \n0.053 \n-0.075* \n \n-0.041* \n-0.066** \n-0.020 \n \n-0.151 \n-0.007 \n \n(0.028) \n(0.041) \n(0.041) \n \n(0.022) \n(0.030) \n(0.035) \n \n(0.331) \n(0.028) \nmine \n-0.010 \n0.071** \n-0.061** \n \n0.004 \n0.008 \n0.001 \n \n0.153 \n0.000 \n \n(0.019) \n(0.028) \n(0.030) \n \n(0.015) \n(0.020) \n(0.024) \n \n(0.241) \n(0.019) \nactive \n-0.010 \n0.054** \n-0.044 \n \n0.002 \n0.014 \n-0.012 \n \n0.012 \n0.002 \n \n(0.016) \n(0.026) \n(0.027) \n \n(0.014) \n(0.022) \n(0.018) \n \n(0.209) \n(0.012) \n \n \n \n \n \n \n \n \n \n \n \nObservations \n6,771 \n6,771 \n6,771 \n \n5,356 \n2,718 \n2,638 \n \n5,704 \n5,704 \nR-squared \n0.031 \n0.054 \n0.059 \n \n0.135 \n0.160 \n0.152 \n \n0.186 \n0.062 \nMean of dep var."
            }
          ]
        }
      ]
    },
    {
      "page": 27,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "25 \n \n \nNote: Figure 5 shows the main treatment coefficients (active*mine) using the baseline estimation strategy (with \nDHS individual-level data; see table 4 for more information) in the top panel, but with different cutoffs (10 km, \n20 km, 30 km, 40 km, and 50 km). *** p<0.01, **p<0.05, *p<0.1.",
          "datasets": [
            {
              "raw_name": "DHS individual-level data",
              "harmonized_name": "Demographic and Health Surveys (DHS)",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "25 \n \n \nNote: Figure 5 shows the main treatment coefficients (active*mine) using the baseline estimation strategy (with \nDHS individual-level data; see table 4 for more information) in the top panel, but with different cutoffs (10 km, \n20 km, 30 km, 40 km, and 50 km)."
            }
          ]
        }
      ]
    },
    {
      "page": 29,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "OLS = ordinary least squares. 6.4 Bottom 40% of the population \nTo understand the welfare effects of the bottom 40 percent of the population in the income \nscale, we split the sample according to the wealth score provided by DHS. Given the data \nstructure, which is repeated cross-section, we cannot follow a particular household that was \nidentified as belonging to the bottom 40 percent in the initial time period.",
          "datasets": [
            {
              "raw_name": "DHS",
              "harmonized_name": "Demographic and Health Surveys",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Surveys & Census Data",
              "sent": "6.4 Bottom 40% of the population \nTo understand the welfare effects of the bottom 40 percent of the population in the income \nscale, we split the sample according to the wealth score provided by DHS."
            }
          ]
        }
      ]
    },
    {
      "page": 32,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "30 \n \nTable 11 Using GLSS: Employment on extensive and intensive margin and wages \n  \n(1) \nworked \nlast  year \n(2) \nwork 7 \ndays \n(3) \nhours \nworked \nper week \n(4) \nagri- \nculture \n(5) \nservice \nand sales \n(6) \nminer \nPanel A: Women  \n  \n  \n  \n  \n  \n  \n1. baseline  \n \n \n \n \n \n \nactive*mine \n-0.067* \n-0.032 \n3.565 \n-0.075 \n0.074 \n0.025 \n \n(0.040) \n(0.038) \n(3.140) \n(0.064) \n(0.054) \n(0.016) \n2. drop 20-40 km \n \n \n \n \n \n \nactive*mine \n-0.062 \n-0.039 \n3.849 \n-0.076 \n0.094* \n0.026* \n \n(0.040) \n(0.039) \n(3.359) \n(0.064) \n(0.057) \n(0.015) \n3. drop 2 years before \n \n \n \n \n \n \nactive*mine \n-0.067* \n-0.031 \n3.565 \n-0.087 \n0.080 \n0.024 \n \n(0.040) \n(0.038) \n(3.140) \n(0.065) \n(0.055) \n(0.016) \n4. mine FE \n \n \n \n \n \n \nactive*mine \n-0.067 \n-0.012 \n8.560* \n-0.084 \n0.104 \n0.025* \n \n(0.051) \n(0.048) \n(5.125) \n(0.075) \n(0.065) \n(0.015) \n5. mine clustering \n \n \n \n \n \n \nactive*mine \n-0.067* \n-0.032 \n3.565 \n-0.075 \n0.074 \n0.025 \n \n(0.032) \n(0.036) \n(3.521) \n(0.081) \n(0.080) \n(0.022) \n \n \n \n \n \n \n \nMean dep var. 0.727 \n0.673 \n40.39 \n42.32 \n0.391 \n0.005 \nPanel B: Men \n  \n  \n  \n  \n  \n  \n1. baseline  \n-0.086** \n-0.055 \n3.705 \n-0.058 \n-0.032 \n0.125*** \nactive*mine \n(0.041) \n(0.039) \n(3.460) \n(0.066) \n(0.036) \n(0.043) \n \n \n \n \n \n \n \n2. drop 20-40 km \n \n \n \n \n \n \nactive*mine \n-0.094** \n-0.062 \n3.893 \n-0.064 \n-0.031 \n0.126*** \n \n(0.042) \n(0.040) \n(3.842) \n(0.066) \n(0.038) \n(0.042) \n3. drop 2 years before \n \n \n \n \n \n \nactive*mine \n-0.094** \n-0.062 \n3.708 \n-0.071 \n-0.026 \n0.125*** \n \n(0.041) \n(0.039) \n(3.459) \n(0.067) \n(0.036) \n(0.043) \n4. mine FE \n \n \n \n \n \n \nactive*mine \n-0.123** \n-0.094* \n8.233 \n-0.068 \n-0.049 \n0.113** \n \n(0.057) \n(0.051) \n(5.425) \n(0.075) \n(0.044) \n(0.045) \n5. mine clustering \n \n \n \n \n \n \nactive*mine \n-0.086*** \n-0.055** \n3.705 \n-0.058 \n-0.032 \n0.125** \n \n(0.025) \n(0.025) \n(2.898) \n(0.086) \n(0.032) \n(0.051) \n \n \n \n \n \n \n \nMean dep var \n0.715 \n0.705 \n45.71 \n0.491 \n0.259 \n0.028 \n \nNote: The table uses GLSS data for Ghana for the survey years 1998, 2005, 2012. The sample is restricted to \nwomen and men aged 15–49.",
          "datasets": [
            {
              "raw_name": "GLSS data",
              "harmonized_name": "Ghana Living Standards Survey (GLSS)",
              "acronym": "GLSS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": "None",
              "data_type": "Surveys & Census Data",
              "sent": "0.727 \n0.673 \n40.39 \n42.32 \n0.391 \n0.005 \nPanel B: Men \n  \n  \n  \n  \n  \n  \n1. baseline  \n-0.086** \n-0.055 \n3.705 \n-0.058 \n-0.032 \n0.125*** \nactive*mine \n(0.041) \n(0.039) \n(3.460) \n(0.066) \n(0.036) \n(0.043) \n \n \n \n \n \n \n \n2. drop 20-40 km \n \n \n \n \n \n \nactive*mine \n-0.094** \n-0.062 \n3.893 \n-0.064 \n-0.031 \n0.126*** \n \n(0.042) \n(0.040) \n(3.842) \n(0.066) \n(0.038) \n(0.042) \n3. drop 2 years before \n \n \n \n \n \n \nactive*mine \n-0.094** \n-0.062 \n3.708 \n-0.071 \n-0.026 \n0.125*** \n \n(0.041) \n(0.039) \n(3.459) \n(0.067) \n(0.036) \n(0.043) \n4. mine FE \n \n \n \n \n \n \nactive*mine \n-0.123** \n-0.094* \n8.233 \n-0.068 \n-0.049 \n0.113** \n \n(0.057) \n(0.051) \n(5.425) \n(0.075) \n(0.044) \n(0.045) \n5. mine clustering \n \n \n \n \n \n \nactive*mine \n-0.086*** \n-0.055** \n3.705 \n-0.058 \n-0.032 \n0.125** \n \n(0.025) \n(0.025) \n(2.898) \n(0.086) \n(0.032) \n(0.051) \n \n \n \n \n \n \n \nMean dep var \n0.715 \n0.705 \n45.71 \n0.491 \n0.259 \n0.028 \n \nNote: The table uses GLSS data for Ghana for the survey years 1998, 2005, 2012."
            }
          ]
        }
      ]
    },
    {
      "page": 37,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "Natural resource extraction is often argued to have detrimental effects on countries, \nhowever, and the so-called natural resource curse may imply that resource wealth is harmful to \nsocial development and inclusive growth. We use rich geocoded data with information on \nhouseholds and mining production over time to evaluate the gold boom at the local and district \nlevels in difference-in-differences analyses. Men benefit from direct job creation within the mining sector, and women seem to benefit from \nindirectly generated jobs in the service sector (statistically significant within 10 km from a \nmine).",
          "datasets": [
            {
              "raw_name": "geocoded data with information on households and mining production over time",
              "harmonized_name": null,
              "acronym": null,
              "context": "primary",
              "specificity": "descriptive_but_unnamed",
              "relevance": "directly_relevant",
              "producer": null,
              "data_type": "Geospatial & Economic Data",
              "sent": "We use rich geocoded data with information on \nhouseholds and mining production over time to evaluate the gold boom at the local and district \nlevels in difference-in-differences analyses."
            }
          ]
        }
      ]
    },
    {
      "page": 39,
      "dataset_used": false,
      "data_mentions": []
    }
  ]
}

In this step, we perform weakly supervised labeling of the document to extract dataset mentions. The extracted data includes structured information about datasets, such as their names, context, specificity, relevance, and other attributes. This process uses the GPT-4o-mini model with a customized prompt to ensure accurate and consistent extraction.

Next Step#

The output from this step will be processed by the critic model (LLM-as-a-Judge) to validate and refine the extracted dataset mentions, ensuring their quality and correctness.