### Generate Synthetic Fine Tuning Dataset (PART 4)

This notebook assumes that you have processed the papers via using the pipeline
and your `extraction_outputs/reasoning` folder contains the output, which will be the input in this task.

In [37]:
import os
import json

# Define paths
reasoning_folder = "./extraction_outputs/reasoning"
text_folder = "./output/text"
output_folder = "./extraction_outputs/simpleschema"

# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

In [38]:
def load_json_files(folder):
    """Load all JSON files from a given folder into a dictionary {paper_id: content}."""
    json_data = {}
    for filename in os.listdir(folder):
        if filename.endswith(".json"):
            filepath = os.path.join(folder, filename)
            with open(filepath, "r", encoding="utf-8") as f:
                data = json.load(f)
                paper_id = filename.replace(".json", "")
                json_data[paper_id] = data
    return json_data


# Load JSON files
payload_data = load_json_files(reasoning_folder)
text_data = load_json_files(text_folder)

In [39]:
for paper_id, text_json in text_data.items():
    # Get pages from the text JSON
    text_pages = text_json.get("pages", {})

    # Retrieve the corresponding payload JSON
    payload_json = payload_data.get(paper_id, {})
    payload_pages_list = payload_json.get("pages", [])

    # Build a mapping from page number to a list of data_mentions
    payload_pages_mapping = {}
    for page_obj in payload_pages_list:
        data_mentions = page_obj.get("data_mentions", [])
        for mention in data_mentions:
            page_no = mention.get("page")
            if page_no is not None:
                payload_pages_mapping.setdefault(page_no, []).append(mention)

    merged_pages = []
    # Iterate over each page in the text JSON and create a merged object per page
    for page_key, text in text_pages.items():
        try:
            page_number = int(page_key)
        except ValueError:
            continue  # Skip keys that cannot be converted to int

        # Set payload: if no data_mentions exist, set data_used to False and data_mentions to an empty list.
        data_mentions = payload_pages_mapping.get(page_number, [])
        if not data_mentions:
            payload_field = {"data_used": False, "data_mentions": []}
        else:
            payload_field = {"data_used": True, "data_mentions": data_mentions}

        merged_page = {
            "paper_id": text_json.get("source", paper_id),
            "page": page_number,
            "text": text,
            "payload": payload_field,
        }
        merged_pages.append(merged_page)

    # Consolidate pages for this paper_id into one JSON object
    merged_output = {
        "paper_id": text_json.get("source", paper_id),
        "pages": merged_pages,
    }

    # Save the merged output per paper_id / source
    output_filename = f"{paper_id}.json"
    output_filepath = os.path.join(output_folder, output_filename)
    with open(output_filepath, "w", encoding="utf-8") as outfile:
        json.dump(merged_output, outfile, indent=2, ensure_ascii=False)

In [40]:
import os
import json

folder = "extraction_outputs/simpleschema"

# init
all_items = []

# Iterate through each file in the folder.
for filename in os.listdir(folder):
    if filename.endswith(".json"):
        filepath = os.path.join(folder, filename)
        with open(filepath, "r", encoding="utf-8") as f:
            data = json.load(f)

        # If the JSON data is a dictionary with a key "pages", iterate over its list.
        if isinstance(data, dict):
            if "pages" in data and isinstance(data["pages"], list):
                for item in data["pages"]:
                    all_items.append(item)
            else:
                # Otherwise, iterate over all values in the dictionary.
                for key, value in data.items():
                    if isinstance(value, list):
                        all_items.extend(value)
                    else:
                        all_items.append(value)
        # If the JSON data is already a list, extend the master list.
        elif isinstance(data, list):
            all_items.extend(data)
        else:
            all_items.append(data)

print(f"Total individual items: {len(all_items)}")

Total individual items: 80


In [41]:
all_items[8]

{'paper_id': '06c998e896785ab8b6d6caa4a8beb2f505c375a5',
 'page': 9,
 'text': '7\n3. \nEmpirical Results \n3.1. \nHeadcount index of poverty \nTable 2 presents results regarding the impact on poverty of the increases in prices for the \ngoods listed in table 1 by country, together with data on the share of total consumption \nrepresented by these goods. These shares of total consumption range from 6.5 percent in Togo to \n28.3 percent in the Democratic Republic of Congo and even 41.0 percent in Niger. Yet for two \nthirds of the countries, the food items included in the simulations account for less than 15 percent \nof total consumption. The summary data on the impact on the headcount index of poverty (i.e., \nthe share of the population in poverty) of the higher food prices is given for two levels of price \nincrease: 25 percent and 50 percent. As mentioned earlier, the lower bound impact on poverty is \nobtained by combining the consumer and producer impact, while the upper bound imp

In [42]:
import random
from copy import deepcopy

dataset_consolidated = deepcopy(all_items)
random.seed(123)
random.shuffle(dataset_consolidated)

In [43]:
# ideally you will split the entire reasoning output directory into train, valid and test.
# You can use the code below to split the data into train, valid and test

test_data = random.sample(dataset_consolidated, int(len(dataset_consolidated) * 0.1))
valid_data = [i for i in dataset_consolidated if i not in test_data]
valid_data = random.sample(valid_data, int(len(dataset_consolidated) * 0.1))
train_data = [
    i for i in dataset_consolidated if i not in test_data and i not in valid_data
]

In [44]:
len(test_data), len(valid_data), len(train_data)

(8, 8, 64)

In [45]:
def build_finetuning_dataset(dataset, split_name, out_dir, data_id):
    assert split_name in ["train", "valid", "test"]

    with open(os.path.join(out_dir, f"{data_id}-{split_name}.json"), "w") as f:
        json.dump(dataset, f)

    fine_tune_dataset = dataset  # [o for o in data_detected_dataset if not o["skip"]]

    """[{'from': 'human',
    'value': 'What is the typical wattage of bulb in a lightbox?'},
    {'from': 'gpt',
    'value': 'The typical wattage of a bulb in a lightbox is 60 watts, although domestic LED bulbs are normally much lower than 60 watts, as they produce the same or greater lumens for less wattage than alternatives. A 60-watt Equivalent LED bulb can be calculated using the 7:1 ratio, which divides 60 watts by 7 to get roughly 9 watts.'}]"""

    # Create the dataset
    conversation_data = []

    for fd in fine_tune_dataset:
        conv = []
        conv.append(
            {
                "from": "human",
                "value": fd["text"],
            }
        )

        fd = fd.copy()

        payload = deepcopy(fd["payload"])

        for dm in payload["data_mentions"]:
            for d in dm["datasets"]:
                d.pop("valid", None)
                d.pop("invalid_reason", None)
                d.pop("sent", None)

        conv.append(
            {
                "from": "gpt",
                "value": json.dumps(payload, indent=2),
            }
        )

        conversation_data.append(conv)

    print(len(conversation_data))

    with open(
        os.path.join(out_dir, f"conversation_data_{data_id}-{split_name}.json"), "w"
    ) as f:
        json.dump(conversation_data, f)

In [46]:
data_id = "finetune-simpleschema"
out_dir = os.path.join("extraction_outputs", "simpleschema", "finetune")
os.makedirs(out_dir, exist_ok=True)

build_finetuning_dataset(train_data, "train", out_dir, data_id)
build_finetuning_dataset(valid_data, "valid", out_dir, data_id)
build_finetuning_dataset(test_data, "test", out_dir, data_id)

64
8
8


In [47]:
import glob

finetune_fpaths = glob.glob("extraction_outputs/simpleschema/finetune/*")

In [48]:
finetune_fpaths

['extraction_outputs/simpleschema/finetune/conversation_data_finetune-simpleschema-test.json',
 'extraction_outputs/simpleschema/finetune/finetune-simpleschema-train.json',
 'extraction_outputs/simpleschema/finetune/conversation_data_finetune-simpleschema-train.json',
 'extraction_outputs/simpleschema/finetune/finetune-simpleschema-test.json',
 'extraction_outputs/simpleschema/finetune/finetune-simpleschema-valid.json',
 'extraction_outputs/simpleschema/finetune/conversation_data_finetune-simpleschema-valid.json']

In [49]:
# Inspect the finetuning data

with open(finetune_fpaths[0], "r") as f:
    data = json.load(f)

In [50]:
data[0:5]

[[{'from': 'human',
   'value': "11\nReferences \n \nBoyce, J. K., and M. Ravallion, 1991, A Dynamic Econometric Model of Agricultural Wage \nDetermination in Bangladesh, Oxford Bulletin of Economics and Statistics, 53(4): 361-76 \n \nBudd, J. W., 1993, Changing Food Prices and Rural Welfare: A Non-Parametric Examination of \nthe Cote d‚ÄôIvoire, Economic Development and Cultural Change, 41(3): 587-603. \n \nCoudouel, A., J. Hentschel, and Q. Wodon, 2002, Poverty Measurement and Analysis, in J. \nKlugman, editor, A Sourcebook for Poverty Reduction Strategies, Volume 1: Core Techniques \nand Cross-Cutting Issues, World Bank, Washington, DC. \n \nBarrett, C. D. and P. A. Dorosh, 1996, Farmers' Welfare and Changing Food Prices: \nNonparametric Evidence from Rice in Madagascar, American Journal of Agricultural \nEconomics, 78(3): 656-69. \n \nChristiaensen, L. and L. Demery, 2007, Down to Earth: Agriculture and Poverty Reduction in \nAfrica, Directions in Development, World Bank, Washingto

In [51]:
### End of the pipeline

Now you have created your fine tuning dataset, the next step is for us to fine tune Phi-3.5 mini instruct
using unsloth, you can refer to this [notebook]() for more information