{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(content:post_process_collection)=\n", "# Post-processing the CSV file created by pySocialWatcher\n", "\n", "So far, we have: \n", "1. [Created a development account and generated a token for our collection](getting_your_token)\n", "2. [Learned how to run a data collection](content:basic_example)\n", "3. [Learned how to customize our collection and save the results to disk](content:json_creation)\n", "\n", "We now assume that a file named ``output_psw_top5_cities.csv`` is created on disk after processing the [previous notebook](content:json_creation).\n", "\n", "Now we will learn how to use pySocialWatcher tools to post-process the data collected and create a human-readable file, which is also ready [to plot some maps](content:plotting_maps)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2021-02-24T11:35:05.370203Z", "start_time": "2021-02-24T11:35:05.364538Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "from pysocialwatcher import post_process" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2021-02-24T11:35:07.398072Z", "start_time": "2021-02-24T11:35:07.372890Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0nameinterestsages_rangesgendersbehaviorscholaritieslanguagesfamily_statusesrelationship_statuses...household_compositionall_fieldstargetingresponsedau_audiencemau_audienceaccess_devicetimestamppublisher_platformsmock_response
00testNaN{'min': 18}0NaNNaNNaNNaNNaN...NaN(('ages_ranges', {'min': 18}), ('genders', 0),...{'geo_locations': {'cities': [{'key': 2880782,...b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...01000{'name': '2G', 'or': [6017253486583]}1614166082[\"facebook\"]False
11testNaN{'min': 18}0NaNNaNNaNNaNNaN...NaN(('ages_ranges', {'min': 18}), ('genders', 0),...{'geo_locations': {'cities': [{'key': 2490299,...b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...01000{'name': '2G', 'or': [6017253486583]}1614166082[\"facebook\"]False
22testNaN{'min': 18}0NaNNaNNaNNaNNaN...NaN(('ages_ranges', {'min': 18}), ('genders', 0),...{'geo_locations': {'cities': [{'key': 2673660,...b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...4631700{'name': '2G', 'or': [6017253486583]}1614166082[\"facebook\"]False
33testNaN{'min': 18}0NaNNaNNaNNaNNaN...NaN(('ages_ranges', {'min': 18}), ('genders', 0),...{'geo_locations': {'cities': [{'key': 1035921,...b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...505514000{'name': '2G', 'or': [6017253486583]}1614166082[\"facebook\"]False
44testNaN{'min': 18}0NaNNaNNaNNaNNaN...NaN(('ages_ranges', {'min': 18}), ('genders', 0),...{'geo_locations': {'cities': [{'key': 269969, ...b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0...7772000{'name': '2G', 'or': [6017253486583]}1614166082[\"facebook\"]False
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 name interests ages_ranges genders behavior scholarities \\\n", "0 0 test NaN {'min': 18} 0 NaN NaN \n", "1 1 test NaN {'min': 18} 0 NaN NaN \n", "2 2 test NaN {'min': 18} 0 NaN NaN \n", "3 3 test NaN {'min': 18} 0 NaN NaN \n", "4 4 test NaN {'min': 18} 0 NaN NaN \n", "\n", " languages family_statuses relationship_statuses ... \\\n", "0 NaN NaN NaN ... \n", "1 NaN NaN NaN ... \n", "2 NaN NaN NaN ... \n", "3 NaN NaN NaN ... \n", "4 NaN NaN NaN ... \n", "\n", " household_composition all_fields \\\n", "0 NaN (('ages_ranges', {'min': 18}), ('genders', 0),... \n", "1 NaN (('ages_ranges', {'min': 18}), ('genders', 0),... \n", "2 NaN (('ages_ranges', {'min': 18}), ('genders', 0),... \n", "3 NaN (('ages_ranges', {'min': 18}), ('genders', 0),... \n", "4 NaN (('ages_ranges', {'min': 18}), ('genders', 0),... \n", "\n", " targeting \\\n", "0 {'geo_locations': {'cities': [{'key': 2880782,... \n", "1 {'geo_locations': {'cities': [{'key': 2490299,... \n", "2 {'geo_locations': {'cities': [{'key': 2673660,... \n", "3 {'geo_locations': {'cities': [{'key': 1035921,... \n", "4 {'geo_locations': {'cities': [{'key': 269969, ... \n", "\n", " response dau_audience \\\n", "0 b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0... 0 \n", "1 b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0... 0 \n", "2 b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0... 463 \n", "3 b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0... 5055 \n", "4 b'{\"data\":[{\"daily_outcomes_curve\":[{\"spend\":0... 777 \n", "\n", " mau_audience access_device timestamp \\\n", "0 1000 {'name': '2G', 'or': [6017253486583]} 1614166082 \n", "1 1000 {'name': '2G', 'or': [6017253486583]} 1614166082 \n", "2 1700 {'name': '2G', 'or': [6017253486583]} 1614166082 \n", "3 14000 {'name': '2G', 'or': [6017253486583]} 1614166082 \n", "4 2000 {'name': '2G', 'or': [6017253486583]} 1614166082 \n", "\n", " publisher_platforms mock_response \n", "0 [\"facebook\"] False \n", "1 [\"facebook\"] False \n", "2 [\"facebook\"] False \n", "3 [\"facebook\"] False \n", "4 [\"facebook\"] False \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"./output_psw_top5_cities.csv\")\n", "df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two main functions from `pySocialWatcher` to post-process the collection and they need to be used in the following order:\n", "1. ``post_process.post_process_df_collection`` creates new columns based on what was used in the collection. For example, there will be a new column for Gender, Ages, Education, Device, etc.\n", "2. ``post_process.combine_cols`` generates, given a (sub)set of columns created by ``post_process.post_process_df_collection``, a new dataframe that combines the columns and has as many rows as unique locations in the collection." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2021-02-24T11:35:13.036509Z", "start_time": "2021-02-24T11:35:12.857577Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LocationTypeFullLocationGenderAgesEducation
0cityMinato-ku, Tokyo, JPboth18-AllDegrees
1cityNew York, New York, USboth18-AllDegrees
2cityMexico City, Distrito Federal, MXboth18-AllDegrees
3cityMumbai, Maharashtra, INboth18-AllDegrees
\n", "
" ], "text/plain": [ " LocationType FullLocation Gender Ages Education\n", "0 city Minato-ku, Tokyo, JP both 18- AllDegrees\n", "1 city New York, New York, US both 18- AllDegrees\n", "2 city Mexico City, Distrito Federal, MX both 18- AllDegrees\n", "3 city Mumbai, Maharashtra, IN both 18- AllDegrees" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processed_df = post_process.post_process_df_collection(df)\n", "processed_df.head(4)[[\"LocationType\", \"FullLocation\", \"Gender\", \"Ages\", \"Education\"]]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2021-02-24T11:35:26.625502Z", "start_time": "2021-02-24T11:35:26.499249Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
comboKeyboth_18-40_2Gboth_18-40_3Gboth_18-40_4Gboth_18-40_AllDevicesboth_18-40_Wifiboth_18-_2Gboth_18-_3Gboth_18-_4Gboth_18-_AllDevices...male_41-54_2Gmale_41-54_3Gmale_41-54_4Gmale_41-54_AllDevicesmale_41-54_Wifimale_55-_2Gmale_55-_3Gmale_55-_4Gmale_55-_AllDevicesmale_55-_Wifi
0269969100045000510000580000037000002000880008700009800000...10001200012000010000006100001000930068000600000370000
1103592111000460005300000900000017000001400058000650000011000000...18005700640000110000029000010002700210000450000170000
22490299100049005200003300000160000010001100011000005900000...1000170015000067000030000010002000130000540000270000
3267366012001600001000000760000048000001700240000140000011000000...100028000180000110000071000010001700077000590000410000
4288078210001000840064000340001000100015000120000...1000100029002300012000100010001500110005600
\n", "

5 rows × 61 columns

\n", "
" ], "text/plain": [ "combo Key both_18-40_2G both_18-40_3G both_18-40_4G \\\n", "0 269969 1000 45000 510000 \n", "1 1035921 11000 46000 5300000 \n", "2 2490299 1000 4900 520000 \n", "3 2673660 1200 160000 1000000 \n", "4 2880782 1000 1000 8400 \n", "\n", "combo both_18-40_AllDevices both_18-40_Wifi both_18-_2G both_18-_3G \\\n", "0 5800000 3700000 2000 88000 \n", "1 9000000 1700000 14000 58000 \n", "2 3300000 1600000 1000 11000 \n", "3 7600000 4800000 1700 240000 \n", "4 64000 34000 1000 1000 \n", "\n", "combo both_18-_4G both_18-_AllDevices ... male_41-54_2G male_41-54_3G \\\n", "0 870000 9800000 ... 1000 12000 \n", "1 6500000 11000000 ... 1800 5700 \n", "2 1100000 5900000 ... 1000 1700 \n", "3 1400000 11000000 ... 1000 28000 \n", "4 15000 120000 ... 1000 1000 \n", "\n", "combo male_41-54_4G male_41-54_AllDevices male_41-54_Wifi male_55-_2G \\\n", "0 120000 1000000 610000 1000 \n", "1 640000 1100000 290000 1000 \n", "2 150000 670000 300000 1000 \n", "3 180000 1100000 710000 1000 \n", "4 2900 23000 12000 1000 \n", "\n", "combo male_55-_3G male_55-_4G male_55-_AllDevices male_55-_Wifi \n", "0 9300 68000 600000 370000 \n", "1 2700 210000 450000 170000 \n", "2 2000 130000 540000 270000 \n", "3 17000 77000 590000 410000 \n", "4 1000 1500 11000 5600 \n", "\n", "[5 rows x 61 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We then combine the columns to obtain a dataframe in which every single line is the data for a location\n", "cols_to_combine = [\"Gender\", \"Ages\", \"Device\"]\n", "combo_df = post_process.combine_cols(processed_df, cols_to_combine)\n", "\n", "combo_df = combo_df.pivot(index=\"Key\", columns=\"combo\", values=\"mau_audience\").reset_index()\n", "combo_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An important trick is to recover what are the locations that each Key refers to.\n", "For that, we can use the ``processed_df`` dataframe again as follows:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2021-02-24T11:35:28.681949Z", "start_time": "2021-02-24T11:35:28.666139Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
KeyNameRegionFullLocationboth_18-40_2Gboth_18-40_3Gboth_18-40_4Gboth_18-40_AllDevicesboth_18-40_Wifiboth_18-_2G...male_41-54_2Gmale_41-54_3Gmale_41-54_4Gmale_41-54_AllDevicesmale_41-54_Wifimale_55-_2Gmale_55-_3Gmale_55-_4Gmale_55-_AllDevicesmale_55-_Wifi
02880782Minato-kuTokyoMinato-ku, Tokyo, JP10001000840064000340001000...1000100029002300012000100010001500110005600
12490299New YorkNew YorkNew York, New York, US10004900520000330000016000001000...1000170015000067000030000010002000130000540000270000
22673660Mexico CityDistrito FederalMexico City, Distrito Federal, MX12001600001000000760000048000001700...100028000180000110000071000010001700077000590000410000
31035921MumbaiMaharashtraMumbai, Maharashtra, IN110004600053000009000000170000014000...18005700640000110000029000010002700210000450000170000
4269969São PauloSão Paulo (state)São Paulo, São Paulo (state), BR100045000510000580000037000002000...10001200012000010000006100001000930068000600000370000
\n", "

5 rows × 64 columns

\n", "
" ], "text/plain": [ " Key Name Region FullLocation \\\n", "0 2880782 Minato-ku Tokyo Minato-ku, Tokyo, JP \n", "1 2490299 New York New York New York, New York, US \n", "2 2673660 Mexico City Distrito Federal Mexico City, Distrito Federal, MX \n", "3 1035921 Mumbai Maharashtra Mumbai, Maharashtra, IN \n", "4 269969 São Paulo São Paulo (state) São Paulo, São Paulo (state), BR \n", "\n", " both_18-40_2G both_18-40_3G both_18-40_4G both_18-40_AllDevices \\\n", "0 1000 1000 8400 64000 \n", "1 1000 4900 520000 3300000 \n", "2 1200 160000 1000000 7600000 \n", "3 11000 46000 5300000 9000000 \n", "4 1000 45000 510000 5800000 \n", "\n", " both_18-40_Wifi both_18-_2G ... male_41-54_2G male_41-54_3G \\\n", "0 34000 1000 ... 1000 1000 \n", "1 1600000 1000 ... 1000 1700 \n", "2 4800000 1700 ... 1000 28000 \n", "3 1700000 14000 ... 1800 5700 \n", "4 3700000 2000 ... 1000 12000 \n", "\n", " male_41-54_4G male_41-54_AllDevices male_41-54_Wifi male_55-_2G \\\n", "0 2900 23000 12000 1000 \n", "1 150000 670000 300000 1000 \n", "2 180000 1100000 710000 1000 \n", "3 640000 1100000 290000 1000 \n", "4 120000 1000000 610000 1000 \n", "\n", " male_55-_3G male_55-_4G male_55-_AllDevices male_55-_Wifi \n", "0 1000 1500 11000 5600 \n", "1 2000 130000 540000 270000 \n", "2 17000 77000 590000 410000 \n", "3 2700 210000 450000 170000 \n", "4 9300 68000 600000 370000 \n", "\n", "[5 rows x 64 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the unique set of \n", "location_mapping = processed_df[[\"Key\", \"Name\", \"Region\", \"FullLocation\"]].drop_duplicates()\n", "\n", "# Merge it back to the post-processed dataframe\n", "final_df = pd.merge(location_mapping, combo_df)\n", "final_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we are done! \n", "Look how we combined, respectively, the dimensions to generate columns like __\"both_18-40_2G\"__ meaning _both_ male and female FB audience aged _18 to 40_ that connected primarily using _2G_ connection. \n", "\n", "\n", "Look at the dataframe above: note that for Minato-ki, New Yorn and Mexico City, the value of the column __\"both_18-40_2G\"__ is *1000*. The value of '**_1000_**' is retrieved when the Facebook audience that matches our criteria is *equal to or smaller than 1000*. Unfortunately, we cannot tell if the 1000s are really 1000s or something else, like 100s or 0s.\n", "\n", "While the default solution is not trusting on values of **1000**, another is trying to finding a more precise estimate. While this is a challenge to do and currently not automatically supported by pySocialWatcher (i.e., no function does it automatically yet), one workaround is submitting API calls that combine multiple criteria and checking the results. For example, for New York, we have 5000 users matching the criteria of __both_18-40_3G__, and if we issued additional API calls for people using either 2G or 3G we might be able to create a column __both_18-40_2G3G__, with let's say 5400 users, meaning that the 1000 for __\"both_18-40_2G\"__ should actually be something like 400.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can save the output file as another csv file [to plot some maps](content:plotting_maps)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2021-02-24T11:35:34.272174Z", "start_time": "2021-02-24T11:35:34.268707Z" } }, "outputs": [], "source": [ "final_df.to_csv(\"processed_top5_cities.csv\", index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }