6. Post-processing the CSV file created by pySocialWatcher

So far, we have:

  1. Created a development account and generated a token for our collection

  2. Learned how to run a data collection

  3. Learned how to customize our collection and save the results to disk

We now assume that a file named output_psw_top5_cities.csv is created on disk after processing the previous notebook.

Now we will learn how to use pySocialWatcher tools to post-process the data collected and create a human-readable file, which is also ready to plot some maps.

import pandas as pd
from pysocialwatcher import post_process
df = pd.read_csv("./output_psw_top5_cities.csv")
df.head(5)
Unnamed: 0 name interests ages_ranges genders behavior scholarities languages family_statuses relationship_statuses ... household_composition all_fields targeting response dau_audience mau_audience access_device timestamp publisher_platforms mock_response
0 0 test NaN {'min': 18} 0 NaN NaN NaN NaN NaN ... NaN (('ages_ranges', {'min': 18}), ('genders', 0),... {'geo_locations': {'cities': [{'key': 2880782,... b'{"data":[{"daily_outcomes_curve":[{"spend":0... 0 1000 {'name': '2G', 'or': [6017253486583]} 1614166082 ["facebook"] False
1 1 test NaN {'min': 18} 0 NaN NaN NaN NaN NaN ... NaN (('ages_ranges', {'min': 18}), ('genders', 0),... {'geo_locations': {'cities': [{'key': 2490299,... b'{"data":[{"daily_outcomes_curve":[{"spend":0... 0 1000 {'name': '2G', 'or': [6017253486583]} 1614166082 ["facebook"] False
2 2 test NaN {'min': 18} 0 NaN NaN NaN NaN NaN ... NaN (('ages_ranges', {'min': 18}), ('genders', 0),... {'geo_locations': {'cities': [{'key': 2673660,... b'{"data":[{"daily_outcomes_curve":[{"spend":0... 463 1700 {'name': '2G', 'or': [6017253486583]} 1614166082 ["facebook"] False
3 3 test NaN {'min': 18} 0 NaN NaN NaN NaN NaN ... NaN (('ages_ranges', {'min': 18}), ('genders', 0),... {'geo_locations': {'cities': [{'key': 1035921,... b'{"data":[{"daily_outcomes_curve":[{"spend":0... 5055 14000 {'name': '2G', 'or': [6017253486583]} 1614166082 ["facebook"] False
4 4 test NaN {'min': 18} 0 NaN NaN NaN NaN NaN ... NaN (('ages_ranges', {'min': 18}), ('genders', 0),... {'geo_locations': {'cities': [{'key': 269969, ... b'{"data":[{"daily_outcomes_curve":[{"spend":0... 777 2000 {'name': '2G', 'or': [6017253486583]} 1614166082 ["facebook"] False

5 rows × 21 columns

There are two main functions from pySocialWatcher to post-process the collection and they need to be used in the following order:

  1. post_process.post_process_df_collection creates new columns based on what was used in the collection. For example, there will be a new column for Gender, Ages, Education, Device, etc.

  2. post_process.combine_cols generates, given a (sub)set of columns created by post_process.post_process_df_collection, a new dataframe that combines the columns and has as many rows as unique locations in the collection.

processed_df = post_process.post_process_df_collection(df)
processed_df.head(4)[["LocationType", "FullLocation", "Gender", "Ages", "Education"]]
LocationType FullLocation Gender Ages Education
0 city Minato-ku, Tokyo, JP both 18- AllDegrees
1 city New York, New York, US both 18- AllDegrees
2 city Mexico City, Distrito Federal, MX both 18- AllDegrees
3 city Mumbai, Maharashtra, IN both 18- AllDegrees
# We then combine the columns to obtain a dataframe in which every single line is the data for a location
cols_to_combine = ["Gender", "Ages", "Device"]
combo_df = post_process.combine_cols(processed_df, cols_to_combine)

combo_df = combo_df.pivot(index="Key", columns="combo", values="mau_audience").reset_index()
combo_df.head()
combo Key both_18-40_2G both_18-40_3G both_18-40_4G both_18-40_AllDevices both_18-40_Wifi both_18-_2G both_18-_3G both_18-_4G both_18-_AllDevices ... male_41-54_2G male_41-54_3G male_41-54_4G male_41-54_AllDevices male_41-54_Wifi male_55-_2G male_55-_3G male_55-_4G male_55-_AllDevices male_55-_Wifi
0 269969 1000 45000 510000 5800000 3700000 2000 88000 870000 9800000 ... 1000 12000 120000 1000000 610000 1000 9300 68000 600000 370000
1 1035921 11000 46000 5300000 9000000 1700000 14000 58000 6500000 11000000 ... 1800 5700 640000 1100000 290000 1000 2700 210000 450000 170000
2 2490299 1000 4900 520000 3300000 1600000 1000 11000 1100000 5900000 ... 1000 1700 150000 670000 300000 1000 2000 130000 540000 270000
3 2673660 1200 160000 1000000 7600000 4800000 1700 240000 1400000 11000000 ... 1000 28000 180000 1100000 710000 1000 17000 77000 590000 410000
4 2880782 1000 1000 8400 64000 34000 1000 1000 15000 120000 ... 1000 1000 2900 23000 12000 1000 1000 1500 11000 5600

5 rows × 61 columns

An important trick is to recover what are the locations that each Key refers to. For that, we can use the processed_df dataframe again as follows:

# Get the unique set of <Key, location>
location_mapping = processed_df[["Key", "Name", "Region", "FullLocation"]].drop_duplicates()

# Merge it back to the post-processed dataframe
final_df = pd.merge(location_mapping, combo_df)
final_df
Key Name Region FullLocation both_18-40_2G both_18-40_3G both_18-40_4G both_18-40_AllDevices both_18-40_Wifi both_18-_2G ... male_41-54_2G male_41-54_3G male_41-54_4G male_41-54_AllDevices male_41-54_Wifi male_55-_2G male_55-_3G male_55-_4G male_55-_AllDevices male_55-_Wifi
0 2880782 Minato-ku Tokyo Minato-ku, Tokyo, JP 1000 1000 8400 64000 34000 1000 ... 1000 1000 2900 23000 12000 1000 1000 1500 11000 5600
1 2490299 New York New York New York, New York, US 1000 4900 520000 3300000 1600000 1000 ... 1000 1700 150000 670000 300000 1000 2000 130000 540000 270000
2 2673660 Mexico City Distrito Federal Mexico City, Distrito Federal, MX 1200 160000 1000000 7600000 4800000 1700 ... 1000 28000 180000 1100000 710000 1000 17000 77000 590000 410000
3 1035921 Mumbai Maharashtra Mumbai, Maharashtra, IN 11000 46000 5300000 9000000 1700000 14000 ... 1800 5700 640000 1100000 290000 1000 2700 210000 450000 170000
4 269969 São Paulo São Paulo (state) São Paulo, São Paulo (state), BR 1000 45000 510000 5800000 3700000 2000 ... 1000 12000 120000 1000000 610000 1000 9300 68000 600000 370000

5 rows × 64 columns

And we are done! Look how we combined, respectively, the dimensions <Gender, Age, Device> to generate columns like “both_18-40_2G” meaning both male and female FB audience aged 18 to 40 that connected primarily using 2G connection.

Look at the dataframe above: note that for Minato-ki, New Yorn and Mexico City, the value of the column “both_18-40_2G” is 1000. The value of ‘1000’ is retrieved when the Facebook audience that matches our criteria is equal to or smaller than 1000. Unfortunately, we cannot tell if the 1000s are really 1000s or something else, like 100s or 0s.

While the default solution is not trusting on values of 1000, another is trying to finding a more precise estimate. While this is a challenge to do and currently not automatically supported by pySocialWatcher (i.e., no function does it automatically yet), one workaround is submitting API calls that combine multiple criteria and checking the results. For example, for New York, we have 5000 users matching the criteria of both_18-40_3G, and if we issued additional API calls for people using either 2G or 3G we might be able to create a column both_18-40_2G3G, with let’s say 5400 users, meaning that the 1000 for “both_18-40_2G” should actually be something like 400.

Finally, we can save the output file as another csv file to plot some maps.

final_df.to_csv("processed_top5_cities.csv", index=False)