{ "cells": [ { "cell_type": "markdown", "id": "de8b7f60", "metadata": {}, "source": [ "# Model - Scikit-Learn" ] }, { "cell_type": "markdown", "id": "f1a813d7", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "25a735ed", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "from iqual import iqualnlp, evaluation, crossval" ] }, { "cell_type": "markdown", "id": "17a42bc8", "metadata": {}, "source": [ "### Load `annotated (human-coded)` and `unannotated` datasets" ] }, { "cell_type": "code", "execution_count": 2, "id": "a7d035ab", "metadata": {}, "outputs": [], "source": [ "data_dir = \"../../data\"\n", "human_coded_df = pd.read_csv(os.path.join(data_dir,\"annotated.csv\"))\n", "uncoded_df = pd.read_csv(os.path.join(data_dir,\"unannotated.csv\"))" ] }, { "cell_type": "markdown", "id": "9b308f43", "metadata": {}, "source": [ "### Split the data into training and test sets" ] }, { "cell_type": "code", "execution_count": 3, "id": "e0f95233", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Size: 7470\n", "Test Size: 2490\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "train_df, test_df = train_test_split(human_coded_df,test_size=0.25)\n", "print(f\"Train Size: {len(train_df)}\\nTest Size: {len(test_df)}\")" ] }, { "cell_type": "markdown", "id": "f0ea3ad2", "metadata": {}, "source": [ "### Configure training data" ] }, { "cell_type": "code", "execution_count": 4, "id": "ebffbf2c", "metadata": {}, "outputs": [], "source": [ "### Select Question and Answer Columns\n", "question_col = 'Q_en'\n", "answer_col = 'A_en'\n", "\n", "### Select a code\n", "code_variable = 'marriage'\n", "\n", "### Create X and y\n", "X = train_df[[question_col,answer_col]]\n", "y = train_df[code_variable]" ] }, { "cell_type": "markdown", "id": "23910543", "metadata": {}, "source": [ "### Initiate model" ] }, { "cell_type": "code", "execution_count": 5, "id": "ecd4446c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('Input',\n",
       "                 FeatureUnion(transformer_list=[('question',\n",
       "                                                 Pipeline(steps=[('selector',\n",
       "                                                                  FunctionTransformer(func=<function column_selector at 0x000002B49C2C9820>,\n",
       "                                                                                      kw_args={'column_name': 'Q_en'})),\n",
       "                                                                 ('vectorizer',\n",
       "                                                                  Vectorizer(analyzer='word',\n",
       "                                                                             binary=False,\n",
       "                                                                             decode_error='strict',\n",
       "                                                                             dtype=<class 'numpy.float64'>,\n",
       "                                                                             encoding='utf-8',\n",
       "                                                                             env='scikit-learn',\n",
       "                                                                             input='co...\n",
       "                                                                             tokenizer=None,\n",
       "                                                                             use_idf=True,\n",
       "                                                                             vocabulary=None))]))])),\n",
       "                ('Classifier',\n",
       "                 Classifier(C=1.0, class_weight=None, dual=False,\n",
       "                            fit_intercept=True, intercept_scaling=1,\n",
       "                            l1_ratio=None, max_iter=100,\n",
       "                            model='LogisticRegression', multi_class='auto',\n",
       "                            n_jobs=None, penalty='l2', random_state=None,\n",
       "                            solver='lbfgs', tol=0.0001, verbose=0,\n",
       "                            warm_start=False)),\n",
       "                ('Threshold', BinaryThresholder())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('Input',\n", " FeatureUnion(transformer_list=[('question',\n", " Pipeline(steps=[('selector',\n", " FunctionTransformer(func=,\n", " kw_args={'column_name': 'Q_en'})),\n", " ('vectorizer',\n", " Vectorizer(analyzer='word',\n", " binary=False,\n", " decode_error='strict',\n", " dtype=,\n", " encoding='utf-8',\n", " env='scikit-learn',\n", " input='co...\n", " tokenizer=None,\n", " use_idf=True,\n", " vocabulary=None))]))])),\n", " ('Classifier',\n", " Classifier(C=1.0, class_weight=None, dual=False,\n", " fit_intercept=True, intercept_scaling=1,\n", " l1_ratio=None, max_iter=100,\n", " model='LogisticRegression', multi_class='auto',\n", " n_jobs=None, penalty='l2', random_state=None,\n", " solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)),\n", " ('Threshold', BinaryThresholder())])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 1: Initiate the model class\n", "iqual_model = iqualnlp.Model()\n", "\n", "# Step 2: Add layers to the model\n", "# Add text columns, and choose a feature extraction model (Available options: scikit-learn, spacy, sentence-transformers, precomputed (picklized dictionary))\n", "iqual_model.add_text_features(question_col,answer_col,model='TfidfVectorizer')\n", "\n", "# Step 3: Add a feature transforming layer (optional)\n", "# A. Choose a feature-scaler. Available options: \n", "# any scikit-learn scaler from `sklearn.preprocessing`\n", "# iqual_model.add_feature_transformer(name='Normalizer', transformation=\"FeatureScaler\")\n", "\n", "# OR\n", "# B. Choose a dimensionality reduction model. Available options:\n", "# - Any scikit-learn dimensionality reduction model from `sklearn.decomposition`\n", "# - Uniform Manifold Approximation and Projection (UMAP) using umap.UMAP (https://umap-learn.readthedocs.io/en/latest/)\n", "\n", "### iqual_model.add_feature_transformer(name='TruncatedSVD', transformation=\"DimensionalityReduction\")\n", "\n", "# Step 4: Add a classifier layer\n", "# Choose a primary classifier model (Available options: any scikit-learn classifier)\n", "iqual_model.add_classifier(name = \"LogisticRegression\")\n", "\n", "# Step 5: Add a threshold layer. This is optional, but recommended for binary classification\n", "iqual_model.add_threshold(scoring_metric='f1')\n", "\n", "# Step 6: Compile the model\n", "iqual_model.compile()" ] }, { "cell_type": "markdown", "id": "0e7ce910", "metadata": {}, "source": [ "### Configure a Hyperparameter Grid for cross-validation + fitting" ] }, { "cell_type": "code", "execution_count": 6, "id": "d11e1507", "metadata": {}, "outputs": [], "source": [ "search_param_config = {\n", " \"Input\":{\n", " \"question\":{\n", " \"vectorizer\":{\n", " \"model\":[\"TfidfVectorizer\"],\n", " \"env\":[\"scikit-learn\"], \n", " \"max_features\":[1000,2000],\n", " \"ngram_range\":[(1,2)],\n", " },\n", " },\n", " \"answer\":{\n", " \"vectorizer\":{\n", " \"model\":[\"TfidfVectorizer\"],\n", " \"env\":[\"scikit-learn\"], \n", " \"max_features\":[1000,2500,4000,6000],\n", " \"ngram_range\":[(1,2)],\n", " }, \n", " },\n", " },\n", " \"Classifier\":{\n", " \"model\":[\"LogisticRegression\"],\n", " \"C\":[0.001,0.01, 0.1, 1.1,],\n", " },\n", "}\n", "\n", "CV_SEARCH_PARAMS = crossval.convert_nested_params(search_param_config)" ] }, { "cell_type": "markdown", "id": "5250bd6d", "metadata": {}, "source": [ "### Model training:\n", "Cross-validate over hyperparameters and select the best model" ] }, { "cell_type": "code", "execution_count": 7, "id": "41bfb8ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".......32 hyperparameters configurations possible.....\n", "Average F1 score: 0.759\n" ] } ], "source": [ "# Scoring Dict for evaluation\n", "scoring_dict = {'f1':evaluation.get_scorer('f1')}\n", "\n", "cv_dict = iqual_model.cross_validate_fit(\n", " X,y, # X: Pandas DataFrame of features, y: Pandas Series of labels\n", " search_parameters=CV_SEARCH_PARAMS, # search_parameters: Dictionary of parameters to use for cross-validation\n", " cv_method='GridSearchCV', # cv_method: Cross-validation method to use, options: GridSearchCV, RandomizedSearchCV\n", " scoring=scoring_dict, # scoring: Scoring metric to use for cross-validation\n", " refit='f1', # refit: Metric to use for refitting the model\n", " n_jobs=-1, # n_jobs: Number of parallel threads to use\n", " cv_splits=3, # cv_splits: Number of cross-validation splits\n", ")\n", "print()\n", "print(\"Average F1 score: {:.3f}\".format(cv_dict['avg_test_score']))" ] }, { "cell_type": "markdown", "id": "c7575dd4", "metadata": {}, "source": [ "### Evaluate model using out sample data (Held out human-coded data)" ] }, { "cell_type": "code", "execution_count": 8, "id": "6e771963", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out-sample F1-score: 0.822\n" ] } ], "source": [ "test_X = test_df[['Q_en','A_en']]\n", "test_y = test_df[code_variable]\n", "\n", "f1_metric = evaluation.get_metric('f1_score')\n", "f1_score = iqual_model.score(test_X,test_y,scoring_function=f1_metric)\n", "print(f\"Out-sample F1-score: {f1_score:.3f}\")" ] }, { "cell_type": "markdown", "id": "93f3255a", "metadata": {}, "source": [ "### Predict labels for unannotated data" ] }, { "cell_type": "code", "execution_count": 9, "id": "626b81f4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAN8AAADCCAYAAADJsRdpAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAMlUlEQVR4nO3db6xcdZ3H8fdnW0G2rtJa96YBQ0u2iSkSgTbQuGa3iKG1JFuNxpSstmAVXcr+yfaBZXmAgSULD9Ck6rpx1waaKAVRY3ctW6+1jfHBBcpupRS3tpQaaUq7UP5sIYEt+frg/C6eXmZ6586dme/c6eeVTObM7/yZ7z25H87M757yVURgZr33B9kFmJ2pHD6zJA6fWRKHzyyJw2eWxOEzSzI9u4B2zZ49O+bOndtw3SuvvMKMGTN6W1ALXNfETNW6Hnvsseci4j3jHigipuRj4cKF0cyOHTuarsvkuiZmqtYF7IoWfof9sdMsicNnlsThM0vi8JklcfjMkkzZPzWczp7DL3Hd+h9nl/EW6y4+OeG6Dt15TZeqsWy+8pklcfjMkjh8ZkkcPrMkDp9ZEofPLInDZ5bE4TNLMm74JL1X0g5JT0raK+lvy/gsScOS9pfnmWVckjZIOiDpcUmX1Y61umy/X9Lq2vhCSXvKPhskqRs/rFk/aeXKdxJYFxELgMXAWkkLgPXA9oiYD2wvrwE+CswvjxuAb0IVVuBW4ArgcuDW0cCWbT5f22/Z5H80s/42bvgi4khE/FdZ/j/gV8B5wArg3rLZvcDHyvIKYFP5d4UjwLmS5gBLgeGIOB4RLwDDwLKy7p0RMVL+IeKm2rHMBtaE7u2UNBe4FHgYGIqII2XVs8BQWT4P+G1tt2fK2OnGn2kw3uj9b6C6mjI0NMTOnTsb1jl0TnUfZb9pp65mP2MnnThxoifvM1GDXlfL4ZP0DuD7wN9FxMv1r2UREZK6/v+dj4hvAd8CWLRoUSxZsqThdl/7zo+4e0//3TO+7uKTE67r0F8u6U4xNTt37qTZucw06HW1NNsp6W1UwftORPygDB8tHxkpz8fK+GHgvbXdzy9jpxs/v8G42UBrZbZTwLeBX0XEV2qrtgCjM5argR/VxleVWc/FwEvl4+k24GpJM8tEy9XAtrLuZUmLy3utqh3LbGC18hnoT4HPAHsk7S5j/wDcCTwgaQ3wG+BTZd1WYDlwAHgVuB4gIo5Luh14tGx3W0QcL8s3AvcA5wAPlYfZQBs3fBHxC6DZ392uarB9AGubHGsjsLHB+C7g/ePVYjZIfIeLWRKHzyyJw2eWxOEzS+LwmSVx+MySOHxmSRw+syQOn1kSh88sicNnlsThM0vi8JklcfjMkjh8ZkkcPrMkDp9ZEofPLInDZ5bE4TNL4vCZJXH4zJI4fGZJHD6zJA6fWRKHzyyJw2eWpJUuRRslHZP0RG3sy5IOS9pdHstr624uvdX3SVpaG19Wxg5IWl8bnyfp4TJ+v6SzOvkDmvWrVq5899C4R/pXI+KS8tgKUHq1rwQuKvv8s6RpkqYB36Dq174AuLZsC3BXOdafAC8AaybzA5lNFa30ZP85cHy87YoVwOaIeC0inqZqE3Z5eRyIiIMR8TqwGVhR+vF9GHiw7F/v7W420CbTO/kmSauAXcC6iHiBqpf6SG2ben/1sf3YrwDeDbwYEScbbP8W7sneHYPe+7zTet6TfYxvArcDUZ7vBj476WrG4Z7s3THovc87rVN1tfUbGhFHR5cl/SvwH+Vls77rNBl/HjhX0vRy9XM/djtjtPWnBklzai8/DozOhG4BVko6W9I8YD7wCFUr6PllZvMsqkmZLaWL7Q7gk2X/em93s4E27pVP0n3AEmC2pGeAW4Elki6h+th5CPgCQETslfQA8CRwElgbEW+U49wEbAOmARsjYm95iy8BmyX9I/DfwLc79cOZ9bNWerJf22C4aUAi4g7gjgbjW4GtDcYPUs2Gmp1RfIeLWRKHzyyJw2eWxOEzS+LwmSVx+MySOHxmSRw+syQOn1kSh88sicNnlsThM0vi8JklcfjMkjh8ZkkcPrMkDp9ZEofPLInDZ5bE4TNL4vCZJXH4zJI4fGZJHD6zJA6fWRKHzyyJw2eWpN2e7LMkDUvaX55nlnFJ2lD6qz8u6bLaPqvL9vslra6NL5S0p+yzoXSrNRt47fZkXw9sj4j5wPbyGqqe6/PL4waqJppImkXV3egKqqYot44Gtmzz+dp+jfq/mw2cdnuyr6Dqnw6n9lFfAWyKyghV48s5wFJgOCKOl/bRw8Cysu6dETFSevVtwj3Z7QzRbu/koYg4UpafBYbK8nm8tff6eeOMP9NgvCH3ZO+OQe993mnZPdnfFBEhKSZdSWvv5Z7sXTDovc87rVN1tTvbeXS0NXR5PlbGm/VkP934+Q3GzQZeu+HbQtU/HU7to74FWFVmPRcDL5WPp9uAqyXNLBMtVwPbyrqXJS0us5yrcE92O0O025P9TuABSWuA3wCfKptvBZYDB4BXgesBIuK4pNuBR8t2t0XE6CTOjVQzqucAD5WH2cBrtyc7wFUNtg1gbZPjbAQ2NhjfBbx/vDrMBo3vcDFL4vCZJXH4zJI4fGZJHD6zJA6fWRKHzyyJw2eWxOEzS+LwmSVx+MySOHxmSRw+syQOn1kSh88sicNnlsThM0vi8JklcfjMkjh8ZkkcPrMkDp9ZEofPLInDZ5bE4TNL4vCZJZlU+CQdKi2dd0vaVcY61jLabJB14sp3ZURcEhGLyutOtow2G1jd+NjZkZbRXajLrK9MNnwB/ETSY6VlM3SuZbTZQJts7+QPRcRhSX8MDEv6n/rKTreMdk/27hj03ued1hc92SPicHk+JumHVN/ZjkqaExFHJtAyesmY8Z1N3s892btg0Hufd1p2T3YkzZD0R6PLVK2en6BDLaPbrctsqpjM5WEI+GHVSp3pwHcj4j8lPUrnWkabDay2wxcRB4EPNBh/ng61jDYbZL7DxSyJw2eWxOEzS+LwmSVx+MySOHxmSRw+syQOn1kSh88sicNnlsThM0vi8JklcfjMkjh8ZkkcPrMkDp9ZEofPLInDZ5bE4TNL4vCZJXH4zJI4fGZJHD6zJA6fWRKHzyxJ/3UTsYE1d/2PJ7T9uotPct0E9+mFe5bN6MhxfOUzS9I34ZO0TNK+0rN9/fh7mE1tfRE+SdOAb1D1bV8AXCtpQW5VZt3VF+Gjaqp5ICIORsTrwGaqHu5mA6tfwue+7HbGmVKznfWe7MAJSfuabDobeK43VbXub9qoS3d1qZhTDcz56oUr7xq3rgtaOU6/hK9Zv/ZT1Huyn46kXRGxqHPldYbrmphBr6tfPnY+CsyXNE/SWcBKqh7uZgOrL658EXFS0k3ANmAasDEi9iaXZdZVfRE+gIjYCmzt0OHG/WiaxHVNzEDXpYjoxHHMbIL65Tuf2RlnyoVvvNvQJJ0t6f6y/mFJc2vrbi7j+yQt7XFdfy/pSUmPS9ou6YLaujck7S6Pjk40tVDXdZL+t/b+n6utWy1pf3ms7nFdX63V9GtJL9bWdfN8bZR0TNITTdZL0oZS9+OSLqutm9j5iogp86CajHkKuBA4C/glsGDMNjcC/1KWVwL3l+UFZfuzgXnlONN6WNeVwB+W5b8arau8PpF4vq4Dvt5g31nAwfI8syzP7FVdY7b/a6pJuK6er3LsPwMuA55osn458BAgYDHwcLvna6pd+Vq5DW0FcG9ZfhC4SpLK+OaIeC0ingYOlOP1pK6I2BERr5aXI1R/y+y2ydy2txQYjojjEfECMAwsS6rrWuC+Dr33aUXEz4Hjp9lkBbApKiPAuZLm0Mb5mmrha+U2tDe3iYiTwEvAu1vct5t11a2h+q/nqLdL2iVpRNLHOlTTROr6RPkI9aCk0Zsd+uJ8lY/n84Cf1Ya7db5a0az2CZ+vvvlTw5lC0qeBRcCf14YviIjDki4EfiZpT0Q81aOS/h24LyJek/QFqk8NH+7Re7diJfBgRLxRG8s8Xx0z1a58rdyG9uY2kqYD7wKeb3HfbtaFpI8AtwB/ERGvjY5HxOHyfBDYCVzaq7oi4vlaLf8GLGx1327WVbOSMR85u3i+WtGs9omfr259ce3Sl+HpVF9k5/H7L+oXjdlmLadOuDxQli/i1AmXg3RuwqWVui6lmmSYP2Z8JnB2WZ4N7Oc0kw9dqGtObfnjwEj8fgLh6VLfzLI8q1d1le3eBxyi/D262+er9h5zaT7hcg2nTrg80u75Sg9UGydmOfDr8ot8Sxm7jepqAvB24HtUEyqPABfW9r2l7LcP+GiP6/opcBTYXR5byvgHgT3lF3APsKbHdf0TsLe8/w7gfbV9P1vO4wHg+l7WVV5/GbhzzH7dPl/3AUeA/6f63rYG+CLwxbJeVP/w+6ny/ovaPV++w8UsyVT7zmc2MBw+syQOn1kSh88sicNnlsThM0vi8JklcfjMkvwOUI9llZ7lDnAAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "uncoded_df[code_variable+'_pred'] = iqual_model.predict(uncoded_df[['Q_en','A_en']])\n", "\n", "uncoded_df[code_variable+\"_pred\"].hist(figsize=(3,3),bins=3)" ] }, { "cell_type": "markdown", "id": "8e3a3e9f", "metadata": {}, "source": [ "### Examples for positive predictions" ] }, { "cell_type": "code", "execution_count": 12, "id": "b1e02a70", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q: Sister, tell us the dreams and hopes you have about your children.\n", "A: You, girls are 5. I will educate the girls. If I get married, I will give it, if I don't get married, I will study more. And 2 sons. He has a dream to do something big by educating his eldest son. If you can't afford it, do business like father. The little boy is still little. I can't talk about him. Want to do the same to both.\n", "\n", "Q: So what else do you hope for your son except for you to study?\n", "A: What else is the hope that my eldest child is a girl, so when she grows up, I will marry her. If I had money, I could study well. Now we have to get married. Even then, there is no money, how can I meet the dowry that I want now?\n", "\n", "Q: What dreams do you all have about your children? said\n", "A: My dream is that my daughter will study, get a good family and get married. And I gave my son in business, he will run the family by doing business. This is more. Will work hard to run the family. If you don't work, will the money come?\n", "\n" ] } ], "source": [ "for idx, row in uncoded_df.loc[(uncoded_df[code_variable+\"_pred\"]==1),['Q_en','A_en']].sample(3).iterrows():\n", " print(\"Q: \",row['Q_en'],\"\\n\",\"A: \", row['A_en'],sep='')\n", " print()" ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }