Chatting with a Population Dataset Using LangChain and LLMs#


In this simple demonstration, we show how you can use natural language to query a structured dataset. The dataset is a 2018 population census enumeration level data from Malawi.

import getpass
import pandas as pd
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

1. Creating a SQLLite Database#

Use a CSV file to create a database. The file which was used to create a database is shown below as a Pandas Dataframe

df_pop = pd.read_csv("mw-ea-pop.csv")
df_pop.head(5)
RegionName DistrictName TAName EnumerationArea TotalPopulation PopulationMale PopulationFemale NumberHouseholds
0 Central Ntchisi TA Malenga ea-20307802 633.0 331.0 302.0 145.0
1 Central Ntchisi TA Malenga ea-20307025 1006.0 507.0 499.0 226.0
2 Central Ntchisi TA Malenga ea-20307007 1503.0 740.0 763.0 338.0
3 Central Ntchisi TA Malenga ea-20307005 1139.0 553.0 586.0 251.0
4 Central Ntchisi TA Malenga ea-20307012 1400.0 668.0 732.0 284.0

2. Setup LangChain for Connecting to Database#

The tool we will use is called LangChain. Its a popular tool for creating apps ontop of LLMs. During the course, we will delve more into using LangChain.

2.1 Import LangChain Packages and Setup Connection#

from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI

from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool

from operator import itemgetter

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

2.2 Create the SQL Agent and a Chain#

# Test connection to the database
db = SQLDatabase.from_uri("sqlite:///mydatabase.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

execute_query = QuerySQLDataBaseTool(db=db)
write_query = create_sql_query_chain(llm, db)
chain = write_query | execute_query
answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

Question: {question}
SQL Query: {query}
SQL Result: {result}
Answer: """
)

answer = answer_prompt | llm | StrOutputParser()
chain = (
    RunnablePassthrough.assign(query=write_query).assign(
        result=itemgetter("query") | execute_query
    )
    | answer
)

3. Chat with the Data#

user_question = "How many districts are there in Malawi"
chain.invoke({"question": "{}".format(user_question)})
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
'There are 33 districts in Malawi.'
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
# VERIFY THIS INFORMATION USING PYTHON
df_pop.DistrictName.nunique()
32
user_question = "How many women are there in Malawi"
chain.invoke({"question": "{}".format(user_question)})
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
'There are 9,042,289 women in Malawi.'
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
# VERIFY THIS INFORMATION USING PYTHON
df_pop.PopulationFemale.sum()
9042289.0
user_question = "How many women are there in Salima district"
chain.invoke({"question": "{}".format(user_question)})
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
'There are 246,415 women in Salima district.'
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
# We can check that the answer above is correct using Python code
df_pop.query('DistrictName == "Salima"')['PopulationFemale'].sum()
246415.0

Complicated question#

user_question = "What percent of the population is female in Malawi?"
chain.invoke({"question": "{}".format(user_question)})
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
'Approximately 51.48% of the population in Malawi is female.'
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
fem = df_pop.PopulationFemale.sum()
tot = df_pop.TotalPopulation.sum()

fem/tot*100
51.482681744085504
user_question = "are the number of men increasing in the four last years in malawi"
chain.invoke({"question": "{}".format(user_question)})
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
'Based on the SQL query and result provided, we are only retrieving the population of males in the specified region (Central, Ntchisi, TA Malenga) for the last four years. We are not directly comparing the number of men over the years to determine if they are increasing. To answer the user question accurately, we would need to retrieve the population data for men in Malawi over the last four years and compare the numbers to see if there is an increase.'
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
user_question = "what is the fertilely rate of Malawi(Calculate)?"
chain.invoke({"question": "{}".format(user_question)})
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')
'The fertility rate of Malawi can be calculated by dividing the total female population by the total population. \n\nFor the first set of data:\nFertility rate = total_female_population / total_population\nFertility rate = 1303 / 2604\nFertility rate = 0.5008\n\nFor the second set of data:\nFertility rate = total_female_population / total_population\nFertility rate = 9042289 / 17563749\nFertility rate = 0.5143\n\nTherefore, the fertility rate of Malawi is approximately 0.5008 for the first set of data and 0.5143 for the second set of data.'
Failed to batch ingest runs: LangSmithError('Failed to POST https://api.smith.langchain.com/runs/batch in LangSmith API. HTTPError(\'403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Forbidden"}\')')

4. EXERCISE: What Question Do You Want Me to Try?#

Share any question in the chat you would like me to try based on this dataset so that we see how much it can handle.

  • Share your question on the chat

  • I will run the question here and we will inspect the response together

5. What We will Do During the Course#

During the course we will use LangChain to build our own Ask-A-Question (AAQ) type of Chatbot to enable a user to chat with a dataset by asking natural language questions. We will build an interactive app like this using Streamlit and be able to share it with others.

Deployment#

  1. Web app

  2. WhatsApp

  3. Chatbot on website of NSO or Health ministry