Generate Synthetic Fine Tuning Dataset (PART 4)

Generate Synthetic Fine Tuning Dataset (PART 4)#

This notebook assumes that you have processed the papers via using the pipeline and your extraction_outputs/reasoning folder contains the output, which will be the input in this task.

import os
import json

# Define paths
reasoning_folder = "./extraction_outputs/reasoning"
text_folder = "./output/text"
output_folder = "./extraction_outputs/simpleschema"

# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

def load_json_files(folder):
    """Load all JSON files from a given folder into a dictionary {paper_id: content}."""
    json_data = {}
    for filename in os.listdir(folder):
        if filename.endswith(".json"):
            filepath = os.path.join(folder, filename)
            with open(filepath, "r", encoding="utf-8") as f:
                data = json.load(f)
                paper_id = filename.replace(".json", "")
                json_data[paper_id] = data
    return json_data


# Load JSON files
payload_data = load_json_files(reasoning_folder)
text_data = load_json_files(text_folder)

for paper_id, text_json in text_data.items():
    # Get pages from the text JSON
    text_pages = text_json.get("pages", {})

    # Retrieve the corresponding payload JSON
    payload_json = payload_data.get(paper_id, {})
    payload_pages_list = payload_json.get("pages", [])

    # Build a mapping from page number to a list of data_mentions
    payload_pages_mapping = {}
    for page_obj in payload_pages_list:
        data_mentions = page_obj.get("data_mentions", [])
        for mention in data_mentions:
            page_no = mention.get("page")
            if page_no is not None:
                payload_pages_mapping.setdefault(page_no, []).append(mention)

    merged_pages = []
    # Iterate over each page in the text JSON and create a merged object per page
    for page_key, text in text_pages.items():
        try:
            page_number = int(page_key)
        except ValueError:
            continue  # Skip keys that cannot be converted to int

        # Set payload: if no data_mentions exist, set data_used to False and data_mentions to an empty list.
        data_mentions = payload_pages_mapping.get(page_number, [])
        if not data_mentions:
            payload_field = {"data_used": False, "data_mentions": []}
        else:
            payload_field = {"data_used": True, "data_mentions": data_mentions}

        merged_page = {
            "paper_id": text_json.get("source", paper_id),
            "page": page_number,
            "text": text,
            "payload": payload_field,
        }
        merged_pages.append(merged_page)

    # Consolidate pages for this paper_id into one JSON object
    merged_output = {
        "paper_id": text_json.get("source", paper_id),
        "pages": merged_pages,
    }

    # Save the merged output per paper_id / source
    output_filename = f"{paper_id}.json"
    output_filepath = os.path.join(output_folder, output_filename)
    with open(output_filepath, "w", encoding="utf-8") as outfile:
        json.dump(merged_output, outfile, indent=2, ensure_ascii=False)

import os
import json

folder = "extraction_outputs/simpleschema"

# init
all_items = []

# Iterate through each file in the folder.
for filename in os.listdir(folder):
    if filename.endswith(".json"):
        filepath = os.path.join(folder, filename)
        with open(filepath, "r", encoding="utf-8") as f:
            data = json.load(f)

        # If the JSON data is a dictionary with a key "pages", iterate over its list.
        if isinstance(data, dict):
            if "pages" in data and isinstance(data["pages"], list):
                for item in data["pages"]:
                    all_items.append(item)
            else:
                # Otherwise, iterate over all values in the dictionary.
                for key, value in data.items():
                    if isinstance(value, list):
                        all_items.extend(value)
                    else:
                        all_items.append(value)
        # If the JSON data is already a list, extend the master list.
        elif isinstance(data, list):
            all_items.extend(data)
        else:
            all_items.append(data)

print(f"Total individual items: {len(all_items)}")

Total individual items: 47

all_items[8]

{'paper_id': 'The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana',
 'page': 9,
 'text': '7 \n \nAlongside the large-scale, capital-intensive mining industry in Ghana, there is an artisanal and \nsmall-scale mining sector (ASM). ASM activities were legalized in 1984, when the state \nloosened its monopoly on gold mining. In Ghana, as in many other African countries, the sector \nis an important employer (ILO 1999). It is estimated that around 1 million people in Ghana \nsupport themselves with revenues from ASM activities. \nThe sector is associated with several hazardous labor conditions, however. This includes child \nlabor, mercury exposure, and risk of mine collapse (Hilson 2009). The ASM and the large-scale \nmining sector sometimes thrive side by side, but sometimes competing interests lead to conflict \nbetween the two sectors, such as around Prestea, where domestic galamsey miners (informal \nsmall-scale miners) have been in conflict with the multinational concession owner (Hilson and \nYakoleva 2007). \nIn this analysis, we focus solely on large-scale mining. We understand, however, that small- \nand large-scale operations may be geographically correlated. Assuming that the start of a large-scale mine does not affect the likelihood or viability of artisanal and small-scale mining, it is \nnot a threat to our identifying assumptions. However, should ASM respond to large-scale \nactivities, either by increasing or decreasing activity in the close geographic area, we will end \nup estimating the impact of these sectors jointly. In a later stage, should the opportunity arise, \nwe encourage researchers to try to disentangle the effects of small-scale and large-scale mining. \n3 Data \nTo conduct this analysis, we combine different data sources using spatial analysis. The main \nmining data is a dataset from InterraRMG covering all large-scale mines in Ghana, explained \nin more detail in section 3.1. This dataset is linked to survey data from the DHS and GLSS, \nusing spatial information. Geographical coordinates of enumeration areas in GLSS are from \nGhana Statistical Services (GSS).2 Point coordinates (global positioning system [GPS]) for the \nsurveyed DHS clusters3 allow us to match all individuals to one or several mineral mines. We \ndo this in two ways. \nFirst, we calculate distance spans from an exact mine location given by its GPS coordinates, \nand match surveyed individuals to mines. These are concentric circles with radiuses of 10, 20, \nand 30 kilometers (km), and so on, up to 100 km and beyond. In the baseline analysis where \n \n2 The data was shared by Aragón and Rud (2013) \n3 Both the DHS and GLSS enumeration area coordinates have a 1-5 km offset. The DHS clusters have up to \n10km displacement in 1% of the cases.',
 'payload': {'data_used': True,
  'data_mentions': [{'mentioned_in': '3 Data \nTo conduct this analysis, we combine different data sources using spatial analysis. The main \nmining data is a dataset from InterraRMG covering all large-scale mines in Ghana, explained \nin more detail in section 3.1. This dataset is linked to survey data from the DHS and GLSS, \nusing spatial information. Geographical coordinates of enumeration areas in GLSS are from \nGhana Statistical Services (GSS).2 Point coordinates (global positioning system [GPS]) for the \nsurveyed DHS clusters3 allow us to match all individuals to one or several mineral mines. We \ndo this in two ways.',
    'page': 9,
    'dataset_used': True,
    'datasets': [{'raw_name': 'dataset from InterraRMG covering all large-scale mines in Ghana',
      'harmonized_name': 'InterraRMG Large-Scale Mines Dataset',
      'acronym': 'None',
      'producer': 'InterraRMG',
      'year': None,
      'specificity': 'descriptive_but_unnamed',
      'context': 'primary',
      'valid': True,
      'invalid_reason': None},
     {'raw_name': 'DHS',
      'harmonized_name': 'Demographic and Health Surveys',
      'acronym': 'DHS',
      'producer': None,
      'year': None,
      'specificity': 'properly_named',
      'context': 'supporting',
      'valid': True,
      'invalid_reason': None},
     {'raw_name': 'GLSS',
      'harmonized_name': 'Ghana Living Standards Survey',
      'acronym': 'GLSS',
      'producer': None,
      'year': None,
      'specificity': 'properly_named',
      'context': 'supporting',
      'valid': True,
      'invalid_reason': None}]}]}}

import random
from copy import deepcopy

dataset_consolidated = deepcopy(all_items)
random.seed(123)
random.shuffle(dataset_consolidated)

# ideally you will split the entire reasoning output directory into train, valid and test.
# You can use the code below to split the data into train, valid and test

test_data = random.sample(dataset_consolidated, int(len(dataset_consolidated) * 0.1))
valid_data = [i for i in dataset_consolidated if i not in test_data]
valid_data = random.sample(valid_data, int(len(dataset_consolidated) * 0.1))
train_data = [
    i for i in dataset_consolidated if i not in test_data and i not in valid_data
]

len(test_data), len(valid_data), len(train_data)

(4, 4, 39)

def build_finetuning_dataset(dataset, split_name, out_dir, data_id):
    assert split_name in ["train", "valid", "test"]

    with open(os.path.join(out_dir, f"{data_id}-{split_name}.json"), "w") as f:
        json.dump(dataset, f)

    fine_tune_dataset = dataset  # [o for o in data_detected_dataset if not o["skip"]]

    """[{'from': 'human',
    'value': 'What is the typical wattage of bulb in a lightbox?'},
    {'from': 'gpt',
    'value': 'The typical wattage of a bulb in a lightbox is 60 watts, although domestic LED bulbs are normally much lower than 60 watts, as they produce the same or greater lumens for less wattage than alternatives. A 60-watt Equivalent LED bulb can be calculated using the 7:1 ratio, which divides 60 watts by 7 to get roughly 9 watts.'}]"""

    # Create the dataset
    conversation_data = []

    for fd in fine_tune_dataset:
        conv = []
        conv.append(
            {
                "from": "human",
                "value": fd["text"],
            }
        )

        fd = fd.copy()

        payload = deepcopy(fd["payload"])

        for dm in payload["data_mentions"]:
            for d in dm["datasets"]:
                d.pop("valid", None)
                d.pop("invalid_reason", None)
                d.pop("sent", None)

        conv.append(
            {
                "from": "gpt",
                "value": json.dumps(payload, indent=2),
            }
        )

        conversation_data.append(conv)

    print(len(conversation_data))

    with open(
        os.path.join(out_dir, f"conversation_data_{data_id}-{split_name}.json"), "w"
    ) as f:
        json.dump(conversation_data, f)

data_id = "finetune-simpleschema"
out_dir = os.path.join("extraction_outputs", "simpleschema", "finetune")
os.makedirs(out_dir, exist_ok=True)

build_finetuning_dataset(train_data, "train", out_dir, data_id)
build_finetuning_dataset(valid_data, "valid", out_dir, data_id)
build_finetuning_dataset(test_data, "test", out_dir, data_id)

39
4
4

import glob

finetune_fpaths = glob.glob("extraction_outputs/simpleschema/finetune/*")

finetune_fpaths

['extraction_outputs/simpleschema/finetune/finetune-simpleschema-valid.json',
 'extraction_outputs/simpleschema/finetune/conversation_data_finetune-simpleschema-valid.json',
 'extraction_outputs/simpleschema/finetune/conversation_data_finetune-simpleschema-train.json',
 'extraction_outputs/simpleschema/finetune/finetune-simpleschema-test.json',
 'extraction_outputs/simpleschema/finetune/finetune-simpleschema-train.json',
 'extraction_outputs/simpleschema/finetune/conversation_data_finetune-simpleschema-test.json']

# Inspect the finetuning data

with open(finetune_fpaths[0], "r") as f:
    data = json.load(f)

data[0:5]

[{'paper_id': 'The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana',
  'page': 25,
  'text': '23 \n \ntable 7). Low weight-for-age is an indicator for acute malnutrition, whereas height-for-age is \nan indicator for chronic malnutrition. This could indicate that mining districts are less food \nsecure.12 Table 7 shows that there are no effects on illness in the last two weeks. \n \n6. Distributional effects, mechanisms and robustness \n6.1 Decomposing results by migration status \nWe argue that one source of heterogeneity is to consider when exploring socio-economic \nimpacts and distributional effects of large-scale mining is migration status. First because mining \nmay cause inward migration of individuals that are different from the previous local population. \nWhile it has its limitations, disaggregating the effects between nonmigrants and migrants may \nshed some light on the effect on the initial population. Second, to understand the distributional \neffects of mining we argue that migration status may be an important factor. \nIn the analysis, we distinguish between nonmigrants (where the woman respondent report being \nborn in the locality) and migrants (born elsewhere). We note several caveats with this analysis, \nthe first being that we cannot follow migrant households before the migration decision. \nTherefore, we cannot make any causal claims on changes in this group over time. We compare \nmigrant households in mining communities with migrant households elsewhere, and the null \nhypothesis would be similar trajectory over time. If we reject the null, we cannot distinguish \nbetween selective migration to mining communities and the impact of the mining. The \nnonmigrant analysis can plausibly reflect similar households over time, with the limitation of \nselective outward migration. We believe inward migration to mining areas to be more common \nthan outward migration (in line with Fafchamps et al., 2016). \nDiarrhea is a major concern in many developing countries. Diarrheal diseases are, in part, a \nmatter of infrastructure, where access to clean water and proper sanitation are important \ndeterminants. To further understand the effects on diarrhea, we look at the difference between \nmigrants and nonmigrants and the effects by distance (Figure 5). There are, in fact, large \ndifferences between the migrant and the nonmigrant populations. Among nonmigrants, a mine \nopening is associated with large decreases in incidence, whereas for migrants, the opposite is \ntrue. Considering all children between 0 km and 20 km of an active mine, children born to \n \n12 In table 5 we saw very small insignificant changes in nutritional status.',
  'payload': {'data_used': False, 'data_mentions': []}},
 {'paper_id': 'The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana',
  'page': 13,
  'text': '11 \n \nThe choice of district – rather than cluster – fixed effect is informed by the understanding that \nmeaningful time-invariant factors - such as mining laws, level of development, local political \ninstitutions, norms regarding environment, women’s participation in the labor market, etc. - that \ninfluence exploitation of the mine happens at the district level. Including district fixed effects, \nwe control for various institutional and cultural factors at the district level that are stable over \ntime. Including district fixed effects also ensures that we are not only capturing effects from \ntransfers or the fiscal system as we compare individuals within the same districts. With this \nmethod we capture the geographic spillover effects in the vicinity of the mine. Moreover, cluster \nfixed effects are not possible because of clusters are not repeatedly sampled over time. \nHowever, since the estimation is at individual level, all standard errors are clustered at the DHS \ncluster level. \nThe sample is restricted to individuals living within 100 km of a deposit location (mine), so \nmany parts of Northern Ghana where there are few gold mines are not included in the analysis. \nThe sample restriction is created by using the time-stable continuous distance measure that we \ncalculate from each mine location to each DHS cluster. This is also the distance measure that \nwe use to create the “mine” dummy, which captures whether the cluster lies within 20 km of a \nknown gold deposit. Note that we only consider deposits that have been in production at some \npoint until December 2012. \nAll households are thus within 100 km of one, or several, gold deposits. To ascertain whether \nthere is any gold production in these potential mining sites, we construct an indicator variable \nactive, which takes a value of 1 if there is at least one mine within 100 km that was extracting \ngold in the year the household was surveyed, and 0 otherwise. While the mine dummy captures \nsome of the special characteristics of mining areas (for example, whether mines tend to open in \nless urban areas), the active dummy captures long-range spillovers of mining. \nThe treatment effect that we are mostly interested in is captured with the active*mine \ncoefficient. The coefficient for β3 tells us what the effect of being close to an actively producing \nmine is. Since the inclusion of the three dummies (active, mine, and active*mine) captures the \ndifference between close and far, and before and after mine opening, we have created a \ndifference-in-differences estimator. \nPanel B of figure 2 shows this strategy in a map, where the small blue circles show the treatment \nareas, and the 100-km-radius green circles show the geographic areas that constitute the control \ngroup. As is common in difference-in-differences analysis, the estimation relies on treatment',
  'payload': {'data_used': False, 'data_mentions': []}},
 {'paper_id': 'The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana',
  'page': 10,
  'text': '8 \n \nwe use a cutoff distance of 20 km, we assume there is little economic footprint beyond that \ndistance. Of course, any such distance is arbitrarily chosen, which is why we try different \nspecifications to explore the spatial heterogeneity by varying this distance (using 10 km, 20 km, \nthrough 50 km) as well as a spatial lag structure (using 0 to 10 km, 10 to 20 km, through 40 to \n50 km distance bins).4 \nSecond, we collapse the DHS mining data at the district level.5 The number of districts has \nchanged over time in Ghana, because districts with high population growth have been split into \nsmaller districts. To avoid endogeneity concerns, we use the baseline number of districts that \nexisted at the start of our analysis period, which are 137. Eleven of these districts have industrial \nmining. Because some mines are close to district boundaries, we additionally test whether there \nis an effect in neighboring districts. \n3.1 Resource data \nThe Raw Materials Data are from InterraRMG (2013). The data set contains information on \npast or current industrial mines. All mines have information on annual production volumes, \nownership structure, and GPS coordinates on location. We complete this data with exact \ngeographic location data from MineAtlas (2013), where satellite imagery shows the actual mine \nboundaries, which allows us to identify and update the center point of each mine. The \nproduction data and ownership information are double-checked against the companies’ annual \nreports. \nFor Ghana, this exercise results in 17 industrial mines tracked over time. We have annual \nproduction levels from 1990 until 2012. As mentioned, Table 1 shows the mining companies \nactive in Ghana during recent decades, with opening and closing years (although some were \nclosed in between, and are not presented in the table). Figure 2 shows the geographic \ndistribution of these mines. \nFigure 2 Gold mines and DHS clusters in Ghana \nPanel A Gold mines and 20 km buffer zones Panel B Gold mines, DHS clusters, and 100 km buffer zones \n \n4 The distances are radii from mine center point, and form concentric circles around the mine. \n5 The DHS and the GLSS data are representative at the regional level, and not at the district level. Since the \nregional level is too aggregated, we do the analysis at the district level, but note that the sample may not be \nrepresentative.',
  'payload': {'data_used': True,
   'data_mentions': [{'mentioned_in': '8\n \nwe use a cutoff distance of 20 km, we assume there is little economic footprint beyond that distance. Of course, any such distance is arbitrarily chosen, which is why we try different specifications to explore the spatial heterogeneity by varying this distance (using 10 km, 20 km, through 50 km) as well as a spatial lag structure (using 0 to 10 km, 10 to 20 km, through 40 to 50 km distance bins).4  \nSecond, we collapse the DHS mining data at the district level.5 The number of districts has changed over time in Ghana, because districts with high population growth have been split into smaller districts. To avoid endogeneity concerns, we use the baseline number of districts that existed at the start of our analysis period, which are 137.',
     'page': 10,
     'dataset_used': True,
     'datasets': [{'raw_name': 'DHS mining data',
       'harmonized_name': None,
       'acronym': 'DHS',
       'producer': None,
       'year': None,
       'specificity': 'properly_named',
       'context': 'primary',
       'valid': True,
       'invalid_reason': None}]},
    {'mentioned_in': 'Because some mines are close to district boundaries, we additionally test whether there is an effect in neighboring districts. 3.1 Resource data \nThe Raw Materials Data are from InterraRMG (2013). The data set contains information on past or current industrial mines.',
     'page': 10,
     'dataset_used': True,
     'datasets': [{'raw_name': 'Raw Materials Data',
       'harmonized_name': 'Raw Materials Data from InterraRMG',
       'acronym': 'None',
       'producer': 'InterraRMG',
       'year': '2013',
       'specificity': 'properly_named',
       'context': 'primary',
       'valid': True,
       'invalid_reason': None}]},
    {'mentioned_in': 'All mines have information on annual production volumes, ownership structure, and GPS coordinates on location. We complete this data with exact geographic location data from MineAtlas (2013), where satellite imagery shows the actual mine boundaries, which allows us to identify and update the center point of each mine. The production data and ownership information are double-checked against the companies’ annual reports.',
     'page': 10,
     'dataset_used': True,
     'datasets': [{'raw_name': 'MineAtlas data',
       'harmonized_name': None,
       'acronym': 'None',
       'producer': 'MineAtlas',
       'year': '2013',
       'specificity': 'properly_named',
       'context': 'supporting',
       'valid': True,
       'invalid_reason': None}]}]}},
 {'paper_id': 'The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana',
  'page': 45,
  'text': '43 \n \n \n(1) \n(2) \n(3) \n(4) \n(5) \n(6) \n(7) \n \nnot \nagri- \nservice \nprofess- \nmanual \nearns \nworks \nVARIABLES \nworking \nculture \nor sales \nional \nwork \ncash \nall year \ngold period \n0.012 \n-0.033 \n0.020 \n0.019* \n-0.018 \n-0.001 \n0.028 \nDistrict \n(0.022) \n(0.024) \n(0.013) \n(0.011) \n(0.015) \n(0.015) \n(0.022) \n \n \n \n \n \n \n \n \nneighbor \n-0.042** \n0.036 \n0.007 \n-0.009** \n0.008 \n0.020 \n0.013 \ngold production \n(0.017) \n(0.025) \n(0.021) \n(0.004) \n(0.010) \n(0.025) \n(0.019) \n \n \n \n \n \n \n \n \nobservations \n19,175 \n19,175 \n19,175 \n19,175 \n19,175 \n14,852 \n11,568 \nR-squared \n0.207 \n0.327 \n0.128 \n0.137 \n0.037 \n0.146 \n0.255 \nNote: Robust standard errors clustered at the district level in parentheses. All regressions control for year and \ndistrict fixed effects, urban dummy, age, and years of education. *** p<0.01, **p<0.05, *p<0.1.',
  'payload': {'data_used': False, 'data_mentions': []}}]

### End of the pipeline

Now you have created your fine tuning dataset, the next step is for us to fine tune Phi-3.5 mini instruct using unsloth, you can refer to this notebook for more information