Autonomous Reasoning Agent (PART 3)#

Objective#

This notebook demonstrates how to leverage structured outputs from OpenAI’s GPT-4o-mini model for data labeling of climate related research papers. The task involves analyzing academic texts to identify and classify mentions of datasets while ensuring consistency in context across pages.

Workflow#

PDF Text Extraction:

  • Use PyMuPDF to extract pages from PDF documents.

  • Prefiltering document pages using an HF-trained model.

Weakly Supervised Data Labeling

  • Use the GPT-4o-mini model with a customized prompt for structured data extraction.

LLM as a Judge (Validation & Error Correction):

  • Use an LLM to validate extracted dataset mentions.

  • Correct or remove errors in dataset identification.

  • Filter only valid dataset mentions (valid: true), discarding invalid entries. Autonomous Reasoning Agent

  • Use a reasoning pipeline to validate the LLM as a judge output Next Steps

  • Scale this into a batch processing of multiple files / directory of research papers.

Install Required Packages

%%capture
!pip install pymupdf openai nltk scikit-learn python-dotenv

Autonomous Reasoning for Filtering Validated Dataset Mentions#

Even after LLM-as-a-Judge validation, many non-dataset references mistakenly classified as datasets. To improve classification accuracy, we will implement a reasoning agent that autonomously develops and executes a structured self-evaluation strategy

from openai import OpenAI

# Load environment variables from .env file
# load_dotenv()

API_KEY = "YOUR_API_KEY"
MODEL = "gpt-4o-mini"
client = OpenAI(api_key=API_KEY)  # initialize the client
THINKING_PROMPT = """Your task is to review a structured user input that may mention a dataset in a text. Please take your time.

Carefully analyze what the text in the `mentioned_in` field explicitly means and in what context the `raw_name` is discussed. Never infer, imply, or assume, so you must exclusively rely on the text as facts. If there are multiple datasets, do the assessment individually.

Plan a strategy to ensure you can maximize the chances of correctly judging and classifying whether the provided input:
- Clearly, the `raw_name` falls under the concept of a data/dataset and not by extension or implicitly.
- Whether the raw_name is actually in the `mentioned_in`.
- Whether the harmonized_name (if present) is actually in the `mentioned_in`. If not found, remove it from the output.
- The `raw_name` is `properly_named` (e.g., DHS, LSMS, etc.), `descriptive_but_unnamed` (administrative school records in Ghana for 2020) , or `vague_generic` (a survey data). Any of these are valid data mentions. To be sure, elaborate how you interpret these classes and use that for classifying.
- The context concerning usage of the dataset is mentioned: is it `primary`, `supporting`, or `background`.

Then, write down your strategy.

After you write down your strategy, synthesize it to develop a rubric of what qualifies as a dataset, which you must use to base your judgment.

Incorporate a devil's advocate review as part of your strategy. If the review shows inconsistency, update accordingly. Do not reason based on assumption, inference, or implicit thinking.  Relationships do not count as a dataset; for example, the producer is not a dataset.

Execute the strategy, **step by step**, and write an analysis of how you interpret the `raw_name` in the context of the `mentioned_in`.

If your analysis results in the `raw_name` being a dataset, set the `valid` field to `true`, otherwise, set it to `false`. In both cases, return the result of your analysis focusing on the `raw_name` in the `reason` field. If it is invalid, set the `specificity` and `context` to null.

ALWAYS WRITE A DEVIL'S ADVOCATE REVIEW AFTER THE ANALYSIS BEFORE CONCLUDING.

After you write your analysis, your output must repeat the input with the `specificity`, `context`, `valid` and `invalid_reason` values replaced accordingly in the same level as the corresponding `raw_name`. IMPORTANT: the final output must be between these tags <OUTPUTDATA>```json<the output must be here>```</OUTPUTDATA>"""
import json
from tqdm.auto import tqdm

# Output of the LLM as a Judge Critic
llm_validated_path = "output/llm_judge_validation/The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana.json"
with open(llm_validated_path, "r") as infile:
    validated_input = json.load(infile)
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
def filter_valid_mentions(validated_input):
    # Filter out invalid datasets before passing to LLM
    filtered_mentions = []
    for page in validated_input["pages"]:
        for mention in page.get("data_mentions", []):
            valid_datasets = [
                dataset
                for dataset in mention["datasets"]
                if dataset.get("valid", False)
            ]

            if valid_datasets:  # Only keep mentions with at least one valid dataset
                filtered_mentions.append(
                    {
                        "mentioned_in": mention["mentioned_in"],
                        "datasets": valid_datasets,
                        "page": page["page"],
                        "dataset_used": page["dataset_used"],
                    }
                )

    input_data = {
        "source": validated_input.get("source"),
        "data_mentions": filtered_mentions,
    }

    return input_data
input_data = filter_valid_mentions(validated_input)
def prepare_input_data(data):
    for mention in data.get("data_mentions", []):
        for ds in mention.get("datasets", []):
            # Replace string "None" with actual None
            if ds.get("producer") == "None":
                ds["producer"] = None

            # Remove unwanted keys
            keys_to_remove = [
                "sent",
                "specificity",
                "context",
                "relevance",
                "data_type",
                "valid",
                "invalid_reason",
            ]
            for key in keys_to_remove:
                ds.pop(key, None)  # `None` as default to avoid KeyError
    return data
input_data = prepare_input_data(input_data)
# inspect our prepared input data

input_data.get("data_mentions")[0]
{'mentioned_in': 'We also allow for spillovers across \ndistricts, in a district-level analysis. We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \noutcomes. The paper contributes to the growing literature on the local effects of mining.',
 'datasets': [{'raw_name': 'Demographic and Health Survey (DHS)',
   'harmonized_name': 'Demographic and Health Survey (DHS)',
   'acronym': 'DHS',
   'producer': None,
   'year': None},
  {'raw_name': 'Ghana Living Standard Survey (GLSS)',
   'harmonized_name': 'Ghana Living Standard Survey (GLSS)',
   'acronym': 'GLSS',
   'producer': None,
   'year': None}],
 'page': 4,
 'dataset_used': True}
def process_reasoning(input_data):
    all_outputs = []
    for idx, mention in tqdm(
        enumerate(input_data.get("data_mentions")),
        desc="processing autonomous reasoning",
    ):
        if not mention:
            continue

        messages = [
            {"role": "system", "content": THINKING_PROMPT},
            {"role": "user", "content": f"{json.dumps(mention, indent=2)}"},
        ]

        completion = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=0.2,
        )

        response = completion.choices[0].message.content

        all_outputs.append(response)

    base_source = input_data["source"]
    output_path = f"output/reasoning/{base_source}.json"
    with open(output_path, "w") as outfile:
        json.dump(all_outputs, outfile, indent=2)

    return None
# get the output
# process_reasoning(input_data)
reasoning_fpath = "output/reasoning/The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana.json"

with open(reasoning_fpath) as f:
    reasoning_payload = json.load(f)
reasoning_payload[0]
'### Strategy\n\n1. **Identify the Context**: Analyze the `mentioned_in` text to determine if it explicitly mentions the `raw_name` datasets and their context of use.\n  \n2. **Check for Explicit Mentions**: Ensure that the `raw_name` is clearly stated in the `mentioned_in` text without any ambiguity.\n\n3. **Classify the Dataset Type**: Determine if the `raw_name` is `properly_named`, `descriptive_but_unnamed`, or `vague_generic` based on the definitions:\n   - **Properly Named**: Clearly defined datasets with specific titles (e.g., DHS).\n   - **Descriptive but Unnamed**: Descriptions of datasets without specific titles (e.g., administrative records).\n   - **Vague Generic**: General terms that do not specify a dataset (e.g., survey data).\n\n4. **Assess the Context of Usage**: Identify if the dataset is used as `primary`, `supporting`, or `background` based on how it is referenced in the text.\n\n5. **Devil\'s Advocate Review**: Critically evaluate the findings to ensure no assumptions are made and that the analysis aligns strictly with the provided text.\n\n### Rubric for Dataset Qualification\n- **Dataset Identification**: The `raw_name` must be explicitly mentioned in the `mentioned_in` text.\n- **Type Classification**: Must fit into one of the three categories: `properly_named`, `descriptive_but_unnamed`, or `vague_generic`.\n- **Context of Use**: Must clearly indicate how the dataset is utilized in the analysis.\n- **Validity**: If the above criteria are met, the dataset is considered valid.\n\n### Analysis\n\n1. **Demographic and Health Survey (DHS)**:\n   - **Mentioned in Text**: The text explicitly states "Demographic and Health Survey (DHS)".\n   - **Type Classification**: This is a `properly_named` dataset as it has a specific title.\n   - **Context of Use**: The DHS is used in a district-level analysis, indicating it is a `primary` dataset for the analysis.\n\n2. **Ghana Living Standard Survey (GLSS)**:\n   - **Mentioned in Text**: The text explicitly states "Ghana Living Standard Survey (GLSS)".\n   - **Type Classification**: This is also a `properly_named` dataset.\n   - **Context of Use**: Similar to the DHS, the GLSS is used in a district-level analysis, indicating it is a `primary` dataset.\n\nBoth datasets are clearly mentioned, properly named, and used as primary datasets in the analysis.\n\n### Devil\'s Advocate Review\nUpon reviewing the analysis, both datasets are explicitly mentioned and classified correctly. There are no inconsistencies or assumptions made. The context of their usage is clear and aligns with the definitions provided.\n\n### Final Output\n```json\n{\n  "mentioned_in": "We also allow for spillovers across \\ndistricts, in a district-level analysis. We use two complementary geocoded household data sets \\nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \\nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \\noutcomes. The paper contributes to the growing literature on the local effects of mining.",\n  "datasets": [\n    {\n      "raw_name": "Demographic and Health Survey (DHS)",\n      "harmonized_name": "Demographic and Health Survey (DHS)",\n      "acronym": "DHS",\n      "producer": null,\n      "year": null,\n      "specificity": "properly_named",\n      "context": "primary",\n      "valid": true,\n      "invalid_reason": null\n    },\n    {\n      "raw_name": "Ghana Living Standard Survey (GLSS)",\n      "harmonized_name": "Ghana Living Standard Survey (GLSS)",\n      "acronym": "GLSS",\n      "producer": null,\n      "year": null,\n      "specificity": "properly_named",\n      "context": "primary",\n      "valid": true,\n      "invalid_reason": null\n    }\n  ],\n  "page": 4,\n  "dataset_used": true\n}\n```'
from bs4 import BeautifulSoup

validated_datasets = []
for content in reasoning_payload:
    try:
        content = content[content.index("```json") + len("```json") :]
        content = content[: content.index("```")]

        if "<outputdata>" in content.lower() and "</outputdata>" in content.lower():
            soup = BeautifulSoup(content, "html.parser")
            content = soup.find("outputdata").text

        content = json.loads(content)
        validated_datasets.append(content)
    except Exception:
        pass
validated_datasets[0]
{'mentioned_in': 'We also allow for spillovers across \ndistricts, in a district-level analysis. We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \noutcomes. The paper contributes to the growing literature on the local effects of mining.',
 'datasets': [{'raw_name': 'Demographic and Health Survey (DHS)',
   'harmonized_name': 'Demographic and Health Survey (DHS)',
   'acronym': 'DHS',
   'producer': None,
   'year': None,
   'specificity': 'properly_named',
   'context': 'primary',
   'valid': True,
   'invalid_reason': None},
  {'raw_name': 'Ghana Living Standard Survey (GLSS)',
   'harmonized_name': 'Ghana Living Standard Survey (GLSS)',
   'acronym': 'GLSS',
   'producer': None,
   'year': None,
   'specificity': 'properly_named',
   'context': 'primary',
   'valid': True,
   'invalid_reason': None}],
 'page': 4,
 'dataset_used': True}

Next Steps#

You can scale this process to process multiple pdfs in a directory via openai batch processing. You can learn more from this notebook

Upon scaling, from there you can create a fine tuning dataset for dataset mentions, you can check this notebook for reference on how to do it.