{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(content:post_process_collection)=\n",
    "# Post-processing the CSV file created by pySocialWatcher\n",
    "\n",
    "So far, we have: \n",
    "1. [Created a development account and generated a token for our collection](getting_your_token)\n",
    "2. [Learned how to run a data collection](content:basic_example)\n",
    "3. [Learned how to customize our collection and save the results to disk](content:json_creation)\n",
    "\n",
    "We now assume that a file named ``output_psw_top5_cities.csv`` is created on disk after processing the [previous notebook](content:json_creation).\n",
    "\n",
    "Now we will learn how to use pySocialWatcher tools to post-process the data collected and create a human-readable file, which is also ready [to plot some maps](content:plotting_maps)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-24T11:35:05.370203Z",
     "start_time": "2021-02-24T11:35:05.364538Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pysocialwatcher import post_process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-24T11:35:07.398072Z",
     "start_time": "2021-02-24T11:35:07.372890Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>name</th>\n",
       "      <th>interests</th>\n",
       "      <th>ages_ranges</th>\n",
       "      <th>genders</th>\n",
       "      <th>behavior</th>\n",
       "      <th>scholarities</th>\n",
       "      <th>languages</th>\n",
       "      <th>family_statuses</th>\n",
       "      <th>relationship_statuses</th>\n",
       "      <th>...</th>\n",
       "      <th>household_composition</th>\n",
       "      <th>all_fields</th>\n",
       "      <th>targeting</th>\n",
       "      <th>response</th>\n",
       "      <th>dau_audience</th>\n",
       "      <th>mau_audience</th>\n",
       "      <th>access_device</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>publisher_platforms</th>\n",
       "      <th>mock_response</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>test</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'min': 18}</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(('ages_ranges', {'min': 18}), ('genders', 0),...</td>\n",
       "      <td>{'geo_locations': {'cities': [{'key': 2880782,...</td>\n",
       "      <td>b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>{'name': '2G', 'or': [6017253486583]}</td>\n",
       "      <td>1614166082</td>\n",
       "      <td>[\"facebook\"]</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>test</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'min': 18}</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(('ages_ranges', {'min': 18}), ('genders', 0),...</td>\n",
       "      <td>{'geo_locations': {'cities': [{'key': 2490299,...</td>\n",
       "      <td>b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>{'name': '2G', 'or': [6017253486583]}</td>\n",
       "      <td>1614166082</td>\n",
       "      <td>[\"facebook\"]</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>test</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'min': 18}</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(('ages_ranges', {'min': 18}), ('genders', 0),...</td>\n",
       "      <td>{'geo_locations': {'cities': [{'key': 2673660,...</td>\n",
       "      <td>b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...</td>\n",
       "      <td>463</td>\n",
       "      <td>1700</td>\n",
       "      <td>{'name': '2G', 'or': [6017253486583]}</td>\n",
       "      <td>1614166082</td>\n",
       "      <td>[\"facebook\"]</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>test</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'min': 18}</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(('ages_ranges', {'min': 18}), ('genders', 0),...</td>\n",
       "      <td>{'geo_locations': {'cities': [{'key': 1035921,...</td>\n",
       "      <td>b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...</td>\n",
       "      <td>5055</td>\n",
       "      <td>14000</td>\n",
       "      <td>{'name': '2G', 'or': [6017253486583]}</td>\n",
       "      <td>1614166082</td>\n",
       "      <td>[\"facebook\"]</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>test</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'min': 18}</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(('ages_ranges', {'min': 18}), ('genders', 0),...</td>\n",
       "      <td>{'geo_locations': {'cities': [{'key': 269969, ...</td>\n",
       "      <td>b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...</td>\n",
       "      <td>777</td>\n",
       "      <td>2000</td>\n",
       "      <td>{'name': '2G', 'or': [6017253486583]}</td>\n",
       "      <td>1614166082</td>\n",
       "      <td>[\"facebook\"]</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0  name  interests  ages_ranges  genders  behavior  scholarities  \\\n",
       "0           0  test        NaN  {'min': 18}        0       NaN           NaN   \n",
       "1           1  test        NaN  {'min': 18}        0       NaN           NaN   \n",
       "2           2  test        NaN  {'min': 18}        0       NaN           NaN   \n",
       "3           3  test        NaN  {'min': 18}        0       NaN           NaN   \n",
       "4           4  test        NaN  {'min': 18}        0       NaN           NaN   \n",
       "\n",
       "   languages  family_statuses  relationship_statuses  ...  \\\n",
       "0        NaN              NaN                    NaN  ...   \n",
       "1        NaN              NaN                    NaN  ...   \n",
       "2        NaN              NaN                    NaN  ...   \n",
       "3        NaN              NaN                    NaN  ...   \n",
       "4        NaN              NaN                    NaN  ...   \n",
       "\n",
       "  household_composition                                         all_fields  \\\n",
       "0                   NaN  (('ages_ranges', {'min': 18}), ('genders', 0),...   \n",
       "1                   NaN  (('ages_ranges', {'min': 18}), ('genders', 0),...   \n",
       "2                   NaN  (('ages_ranges', {'min': 18}), ('genders', 0),...   \n",
       "3                   NaN  (('ages_ranges', {'min': 18}), ('genders', 0),...   \n",
       "4                   NaN  (('ages_ranges', {'min': 18}), ('genders', 0),...   \n",
       "\n",
       "                                           targeting  \\\n",
       "0  {'geo_locations': {'cities': [{'key': 2880782,...   \n",
       "1  {'geo_locations': {'cities': [{'key': 2490299,...   \n",
       "2  {'geo_locations': {'cities': [{'key': 2673660,...   \n",
       "3  {'geo_locations': {'cities': [{'key': 1035921,...   \n",
       "4  {'geo_locations': {'cities': [{'key': 269969, ...   \n",
       "\n",
       "                                            response dau_audience  \\\n",
       "0  b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...            0   \n",
       "1  b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...            0   \n",
       "2  b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...          463   \n",
       "3  b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...         5055   \n",
       "4  b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...          777   \n",
       "\n",
       "   mau_audience                          access_device   timestamp  \\\n",
       "0          1000  {'name': '2G', 'or': [6017253486583]}  1614166082   \n",
       "1          1000  {'name': '2G', 'or': [6017253486583]}  1614166082   \n",
       "2          1700  {'name': '2G', 'or': [6017253486583]}  1614166082   \n",
       "3         14000  {'name': '2G', 'or': [6017253486583]}  1614166082   \n",
       "4          2000  {'name': '2G', 'or': [6017253486583]}  1614166082   \n",
       "\n",
       "   publisher_platforms mock_response  \n",
       "0         [\"facebook\"]         False  \n",
       "1         [\"facebook\"]         False  \n",
       "2         [\"facebook\"]         False  \n",
       "3         [\"facebook\"]         False  \n",
       "4         [\"facebook\"]         False  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"./output_psw_top5_cities.csv\")\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two main functions from `pySocialWatcher` to post-process the collection and they need to be used in the following order:\n",
    "1. ``post_process.post_process_df_collection`` creates new columns based on what was used in the collection. For example, there will be a new column for Gender, Ages, Education, Device, etc.\n",
    "2. ``post_process.combine_cols`` generates, given a (sub)set of columns created by ``post_process.post_process_df_collection``, a new dataframe that combines the columns and has as many rows as unique locations in the collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-24T11:35:13.036509Z",
     "start_time": "2021-02-24T11:35:12.857577Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>LocationType</th>\n",
       "      <th>FullLocation</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Ages</th>\n",
       "      <th>Education</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>city</td>\n",
       "      <td>Minato-ku, Tokyo, JP</td>\n",
       "      <td>both</td>\n",
       "      <td>18-</td>\n",
       "      <td>AllDegrees</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>city</td>\n",
       "      <td>New York, New York, US</td>\n",
       "      <td>both</td>\n",
       "      <td>18-</td>\n",
       "      <td>AllDegrees</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>city</td>\n",
       "      <td>Mexico City, Distrito Federal, MX</td>\n",
       "      <td>both</td>\n",
       "      <td>18-</td>\n",
       "      <td>AllDegrees</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>city</td>\n",
       "      <td>Mumbai, Maharashtra, IN</td>\n",
       "      <td>both</td>\n",
       "      <td>18-</td>\n",
       "      <td>AllDegrees</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  LocationType                       FullLocation Gender Ages   Education\n",
       "0         city               Minato-ku, Tokyo, JP   both  18-  AllDegrees\n",
       "1         city             New York, New York, US   both  18-  AllDegrees\n",
       "2         city  Mexico City, Distrito Federal, MX   both  18-  AllDegrees\n",
       "3         city            Mumbai, Maharashtra, IN   both  18-  AllDegrees"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_df = post_process.post_process_df_collection(df)\n",
    "processed_df.head(4)[[\"LocationType\", \"FullLocation\", \"Gender\", \"Ages\", \"Education\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-24T11:35:26.625502Z",
     "start_time": "2021-02-24T11:35:26.499249Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>combo</th>\n",
       "      <th>Key</th>\n",
       "      <th>both_18-40_2G</th>\n",
       "      <th>both_18-40_3G</th>\n",
       "      <th>both_18-40_4G</th>\n",
       "      <th>both_18-40_AllDevices</th>\n",
       "      <th>both_18-40_Wifi</th>\n",
       "      <th>both_18-_2G</th>\n",
       "      <th>both_18-_3G</th>\n",
       "      <th>both_18-_4G</th>\n",
       "      <th>both_18-_AllDevices</th>\n",
       "      <th>...</th>\n",
       "      <th>male_41-54_2G</th>\n",
       "      <th>male_41-54_3G</th>\n",
       "      <th>male_41-54_4G</th>\n",
       "      <th>male_41-54_AllDevices</th>\n",
       "      <th>male_41-54_Wifi</th>\n",
       "      <th>male_55-_2G</th>\n",
       "      <th>male_55-_3G</th>\n",
       "      <th>male_55-_4G</th>\n",
       "      <th>male_55-_AllDevices</th>\n",
       "      <th>male_55-_Wifi</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>269969</td>\n",
       "      <td>1000</td>\n",
       "      <td>45000</td>\n",
       "      <td>510000</td>\n",
       "      <td>5800000</td>\n",
       "      <td>3700000</td>\n",
       "      <td>2000</td>\n",
       "      <td>88000</td>\n",
       "      <td>870000</td>\n",
       "      <td>9800000</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>12000</td>\n",
       "      <td>120000</td>\n",
       "      <td>1000000</td>\n",
       "      <td>610000</td>\n",
       "      <td>1000</td>\n",
       "      <td>9300</td>\n",
       "      <td>68000</td>\n",
       "      <td>600000</td>\n",
       "      <td>370000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1035921</td>\n",
       "      <td>11000</td>\n",
       "      <td>46000</td>\n",
       "      <td>5300000</td>\n",
       "      <td>9000000</td>\n",
       "      <td>1700000</td>\n",
       "      <td>14000</td>\n",
       "      <td>58000</td>\n",
       "      <td>6500000</td>\n",
       "      <td>11000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1800</td>\n",
       "      <td>5700</td>\n",
       "      <td>640000</td>\n",
       "      <td>1100000</td>\n",
       "      <td>290000</td>\n",
       "      <td>1000</td>\n",
       "      <td>2700</td>\n",
       "      <td>210000</td>\n",
       "      <td>450000</td>\n",
       "      <td>170000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2490299</td>\n",
       "      <td>1000</td>\n",
       "      <td>4900</td>\n",
       "      <td>520000</td>\n",
       "      <td>3300000</td>\n",
       "      <td>1600000</td>\n",
       "      <td>1000</td>\n",
       "      <td>11000</td>\n",
       "      <td>1100000</td>\n",
       "      <td>5900000</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>1700</td>\n",
       "      <td>150000</td>\n",
       "      <td>670000</td>\n",
       "      <td>300000</td>\n",
       "      <td>1000</td>\n",
       "      <td>2000</td>\n",
       "      <td>130000</td>\n",
       "      <td>540000</td>\n",
       "      <td>270000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2673660</td>\n",
       "      <td>1200</td>\n",
       "      <td>160000</td>\n",
       "      <td>1000000</td>\n",
       "      <td>7600000</td>\n",
       "      <td>4800000</td>\n",
       "      <td>1700</td>\n",
       "      <td>240000</td>\n",
       "      <td>1400000</td>\n",
       "      <td>11000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>28000</td>\n",
       "      <td>180000</td>\n",
       "      <td>1100000</td>\n",
       "      <td>710000</td>\n",
       "      <td>1000</td>\n",
       "      <td>17000</td>\n",
       "      <td>77000</td>\n",
       "      <td>590000</td>\n",
       "      <td>410000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2880782</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>8400</td>\n",
       "      <td>64000</td>\n",
       "      <td>34000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>15000</td>\n",
       "      <td>120000</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>2900</td>\n",
       "      <td>23000</td>\n",
       "      <td>12000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1500</td>\n",
       "      <td>11000</td>\n",
       "      <td>5600</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 61 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "combo      Key  both_18-40_2G  both_18-40_3G  both_18-40_4G  \\\n",
       "0       269969           1000          45000         510000   \n",
       "1      1035921          11000          46000        5300000   \n",
       "2      2490299           1000           4900         520000   \n",
       "3      2673660           1200         160000        1000000   \n",
       "4      2880782           1000           1000           8400   \n",
       "\n",
       "combo  both_18-40_AllDevices  both_18-40_Wifi  both_18-_2G  both_18-_3G  \\\n",
       "0                    5800000          3700000         2000        88000   \n",
       "1                    9000000          1700000        14000        58000   \n",
       "2                    3300000          1600000         1000        11000   \n",
       "3                    7600000          4800000         1700       240000   \n",
       "4                      64000            34000         1000         1000   \n",
       "\n",
       "combo  both_18-_4G  both_18-_AllDevices  ...  male_41-54_2G  male_41-54_3G  \\\n",
       "0           870000              9800000  ...           1000          12000   \n",
       "1          6500000             11000000  ...           1800           5700   \n",
       "2          1100000              5900000  ...           1000           1700   \n",
       "3          1400000             11000000  ...           1000          28000   \n",
       "4            15000               120000  ...           1000           1000   \n",
       "\n",
       "combo  male_41-54_4G  male_41-54_AllDevices  male_41-54_Wifi  male_55-_2G  \\\n",
       "0             120000                1000000           610000         1000   \n",
       "1             640000                1100000           290000         1000   \n",
       "2             150000                 670000           300000         1000   \n",
       "3             180000                1100000           710000         1000   \n",
       "4               2900                  23000            12000         1000   \n",
       "\n",
       "combo  male_55-_3G  male_55-_4G  male_55-_AllDevices  male_55-_Wifi  \n",
       "0             9300        68000               600000         370000  \n",
       "1             2700       210000               450000         170000  \n",
       "2             2000       130000               540000         270000  \n",
       "3            17000        77000               590000         410000  \n",
       "4             1000         1500                11000           5600  \n",
       "\n",
       "[5 rows x 61 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We then combine the columns to obtain a dataframe in which every single line is the data for a location\n",
    "cols_to_combine = [\"Gender\", \"Ages\", \"Device\"]\n",
    "combo_df = post_process.combine_cols(processed_df, cols_to_combine)\n",
    "\n",
    "combo_df = combo_df.pivot(index=\"Key\", columns=\"combo\", values=\"mau_audience\").reset_index()\n",
    "combo_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important trick is to recover what are the locations that each Key refers to.\n",
    "For that, we can use the ``processed_df`` dataframe again as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-24T11:35:28.681949Z",
     "start_time": "2021-02-24T11:35:28.666139Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Key</th>\n",
       "      <th>Name</th>\n",
       "      <th>Region</th>\n",
       "      <th>FullLocation</th>\n",
       "      <th>both_18-40_2G</th>\n",
       "      <th>both_18-40_3G</th>\n",
       "      <th>both_18-40_4G</th>\n",
       "      <th>both_18-40_AllDevices</th>\n",
       "      <th>both_18-40_Wifi</th>\n",
       "      <th>both_18-_2G</th>\n",
       "      <th>...</th>\n",
       "      <th>male_41-54_2G</th>\n",
       "      <th>male_41-54_3G</th>\n",
       "      <th>male_41-54_4G</th>\n",
       "      <th>male_41-54_AllDevices</th>\n",
       "      <th>male_41-54_Wifi</th>\n",
       "      <th>male_55-_2G</th>\n",
       "      <th>male_55-_3G</th>\n",
       "      <th>male_55-_4G</th>\n",
       "      <th>male_55-_AllDevices</th>\n",
       "      <th>male_55-_Wifi</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2880782</td>\n",
       "      <td>Minato-ku</td>\n",
       "      <td>Tokyo</td>\n",
       "      <td>Minato-ku, Tokyo, JP</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>8400</td>\n",
       "      <td>64000</td>\n",
       "      <td>34000</td>\n",
       "      <td>1000</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>2900</td>\n",
       "      <td>23000</td>\n",
       "      <td>12000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1500</td>\n",
       "      <td>11000</td>\n",
       "      <td>5600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2490299</td>\n",
       "      <td>New York</td>\n",
       "      <td>New York</td>\n",
       "      <td>New York, New York, US</td>\n",
       "      <td>1000</td>\n",
       "      <td>4900</td>\n",
       "      <td>520000</td>\n",
       "      <td>3300000</td>\n",
       "      <td>1600000</td>\n",
       "      <td>1000</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>1700</td>\n",
       "      <td>150000</td>\n",
       "      <td>670000</td>\n",
       "      <td>300000</td>\n",
       "      <td>1000</td>\n",
       "      <td>2000</td>\n",
       "      <td>130000</td>\n",
       "      <td>540000</td>\n",
       "      <td>270000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2673660</td>\n",
       "      <td>Mexico City</td>\n",
       "      <td>Distrito Federal</td>\n",
       "      <td>Mexico City, Distrito Federal, MX</td>\n",
       "      <td>1200</td>\n",
       "      <td>160000</td>\n",
       "      <td>1000000</td>\n",
       "      <td>7600000</td>\n",
       "      <td>4800000</td>\n",
       "      <td>1700</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>28000</td>\n",
       "      <td>180000</td>\n",
       "      <td>1100000</td>\n",
       "      <td>710000</td>\n",
       "      <td>1000</td>\n",
       "      <td>17000</td>\n",
       "      <td>77000</td>\n",
       "      <td>590000</td>\n",
       "      <td>410000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1035921</td>\n",
       "      <td>Mumbai</td>\n",
       "      <td>Maharashtra</td>\n",
       "      <td>Mumbai, Maharashtra, IN</td>\n",
       "      <td>11000</td>\n",
       "      <td>46000</td>\n",
       "      <td>5300000</td>\n",
       "      <td>9000000</td>\n",
       "      <td>1700000</td>\n",
       "      <td>14000</td>\n",
       "      <td>...</td>\n",
       "      <td>1800</td>\n",
       "      <td>5700</td>\n",
       "      <td>640000</td>\n",
       "      <td>1100000</td>\n",
       "      <td>290000</td>\n",
       "      <td>1000</td>\n",
       "      <td>2700</td>\n",
       "      <td>210000</td>\n",
       "      <td>450000</td>\n",
       "      <td>170000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>269969</td>\n",
       "      <td>São Paulo</td>\n",
       "      <td>São Paulo (state)</td>\n",
       "      <td>São Paulo, São Paulo (state), BR</td>\n",
       "      <td>1000</td>\n",
       "      <td>45000</td>\n",
       "      <td>510000</td>\n",
       "      <td>5800000</td>\n",
       "      <td>3700000</td>\n",
       "      <td>2000</td>\n",
       "      <td>...</td>\n",
       "      <td>1000</td>\n",
       "      <td>12000</td>\n",
       "      <td>120000</td>\n",
       "      <td>1000000</td>\n",
       "      <td>610000</td>\n",
       "      <td>1000</td>\n",
       "      <td>9300</td>\n",
       "      <td>68000</td>\n",
       "      <td>600000</td>\n",
       "      <td>370000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 64 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       Key         Name             Region                       FullLocation  \\\n",
       "0  2880782    Minato-ku              Tokyo               Minato-ku, Tokyo, JP   \n",
       "1  2490299     New York           New York             New York, New York, US   \n",
       "2  2673660  Mexico City   Distrito Federal  Mexico City, Distrito Federal, MX   \n",
       "3  1035921       Mumbai        Maharashtra            Mumbai, Maharashtra, IN   \n",
       "4   269969    São Paulo  São Paulo (state)   São Paulo, São Paulo (state), BR   \n",
       "\n",
       "   both_18-40_2G  both_18-40_3G  both_18-40_4G  both_18-40_AllDevices  \\\n",
       "0           1000           1000           8400                  64000   \n",
       "1           1000           4900         520000                3300000   \n",
       "2           1200         160000        1000000                7600000   \n",
       "3          11000          46000        5300000                9000000   \n",
       "4           1000          45000         510000                5800000   \n",
       "\n",
       "   both_18-40_Wifi  both_18-_2G  ...  male_41-54_2G  male_41-54_3G  \\\n",
       "0            34000         1000  ...           1000           1000   \n",
       "1          1600000         1000  ...           1000           1700   \n",
       "2          4800000         1700  ...           1000          28000   \n",
       "3          1700000        14000  ...           1800           5700   \n",
       "4          3700000         2000  ...           1000          12000   \n",
       "\n",
       "   male_41-54_4G  male_41-54_AllDevices  male_41-54_Wifi  male_55-_2G  \\\n",
       "0           2900                  23000            12000         1000   \n",
       "1         150000                 670000           300000         1000   \n",
       "2         180000                1100000           710000         1000   \n",
       "3         640000                1100000           290000         1000   \n",
       "4         120000                1000000           610000         1000   \n",
       "\n",
       "   male_55-_3G  male_55-_4G  male_55-_AllDevices  male_55-_Wifi  \n",
       "0         1000         1500                11000           5600  \n",
       "1         2000       130000               540000         270000  \n",
       "2        17000        77000               590000         410000  \n",
       "3         2700       210000               450000         170000  \n",
       "4         9300        68000               600000         370000  \n",
       "\n",
       "[5 rows x 64 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the unique set of <Key, location>\n",
    "location_mapping = processed_df[[\"Key\", \"Name\", \"Region\", \"FullLocation\"]].drop_duplicates()\n",
    "\n",
    "# Merge it back to the post-processed dataframe\n",
    "final_df = pd.merge(location_mapping, combo_df)\n",
    "final_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we are done! \n",
    "Look how we combined, respectively, the dimensions <Gender, Age, Device> to generate columns like __\"both_18-40_2G\"__ meaning _both_ male and female FB audience aged _18 to 40_ that connected primarily using _2G_ connection.  \n",
    "\n",
    "\n",
    "Look at the dataframe above: note that for Minato-ki, New Yorn and Mexico City, the value of the column __\"both_18-40_2G\"__ is *1000*. The value of '**_1000_**' is retrieved when the Facebook audience that matches our criteria is *equal to or smaller than 1000*. Unfortunately, we cannot tell if the 1000s are really 1000s or something else, like 100s or 0s.\n",
    "\n",
    "While the default solution is not trusting on values of **1000**, another is trying to finding a more precise estimate. While this is a challenge to do and currently not automatically supported by pySocialWatcher (i.e., no function does it automatically yet), one workaround is submitting API calls that combine multiple criteria and checking the results. For example, for New York, we have 5000 users matching the criteria of __both_18-40_3G__, and if we issued additional API calls for people using either 2G or 3G we might be able to create a column __both_18-40_2G3G__, with let's say 5400 users, meaning that the 1000 for __\"both_18-40_2G\"__ should actually be something like 400.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can save the output file as another csv file [to plot some maps](content:plotting_maps)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-24T11:35:34.272174Z",
     "start_time": "2021-02-24T11:35:34.268707Z"
    }
   },
   "outputs": [],
   "source": [
    "final_df.to_csv(\"processed_top5_cities.csv\", index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}