6. Post-processing the CSV file created by pySocialWatcher¶
So far, we have:
Created a development account and generated a token for our collection
Learned how to customize our collection and save the results to disk
We now assume that a file named output_psw_top5_cities.csv
is created on disk after processing the previous notebook.
Now we will learn how to use pySocialWatcher tools to post-process the data collected and create a human-readable file, which is also ready to plot some maps.
import pandas as pd
from pysocialwatcher import post_process
df = pd.read_csv("./output_psw_top5_cities.csv")
df.head(5)
Unnamed: 0 | name | interests | ages_ranges | genders | behavior | scholarities | languages | family_statuses | relationship_statuses | ... | household_composition | all_fields | targeting | response | dau_audience | mau_audience | access_device | timestamp | publisher_platforms | mock_response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | test | NaN | {'min': 18} | 0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | (('ages_ranges', {'min': 18}), ('genders', 0),... | {'geo_locations': {'cities': [{'key': 2880782,... | b'{"data":[{"daily_outcomes_curve":[{"spend":0... | 0 | 1000 | {'name': '2G', 'or': [6017253486583]} | 1614166082 | ["facebook"] | False |
1 | 1 | test | NaN | {'min': 18} | 0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | (('ages_ranges', {'min': 18}), ('genders', 0),... | {'geo_locations': {'cities': [{'key': 2490299,... | b'{"data":[{"daily_outcomes_curve":[{"spend":0... | 0 | 1000 | {'name': '2G', 'or': [6017253486583]} | 1614166082 | ["facebook"] | False |
2 | 2 | test | NaN | {'min': 18} | 0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | (('ages_ranges', {'min': 18}), ('genders', 0),... | {'geo_locations': {'cities': [{'key': 2673660,... | b'{"data":[{"daily_outcomes_curve":[{"spend":0... | 463 | 1700 | {'name': '2G', 'or': [6017253486583]} | 1614166082 | ["facebook"] | False |
3 | 3 | test | NaN | {'min': 18} | 0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | (('ages_ranges', {'min': 18}), ('genders', 0),... | {'geo_locations': {'cities': [{'key': 1035921,... | b'{"data":[{"daily_outcomes_curve":[{"spend":0... | 5055 | 14000 | {'name': '2G', 'or': [6017253486583]} | 1614166082 | ["facebook"] | False |
4 | 4 | test | NaN | {'min': 18} | 0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | (('ages_ranges', {'min': 18}), ('genders', 0),... | {'geo_locations': {'cities': [{'key': 269969, ... | b'{"data":[{"daily_outcomes_curve":[{"spend":0... | 777 | 2000 | {'name': '2G', 'or': [6017253486583]} | 1614166082 | ["facebook"] | False |
5 rows × 21 columns
There are two main functions from pySocialWatcher
to post-process the collection and they need to be used in the following order:
post_process.post_process_df_collection
creates new columns based on what was used in the collection. For example, there will be a new column for Gender, Ages, Education, Device, etc.post_process.combine_cols
generates, given a (sub)set of columns created bypost_process.post_process_df_collection
, a new dataframe that combines the columns and has as many rows as unique locations in the collection.
processed_df = post_process.post_process_df_collection(df)
processed_df.head(4)[["LocationType", "FullLocation", "Gender", "Ages", "Education"]]
LocationType | FullLocation | Gender | Ages | Education | |
---|---|---|---|---|---|
0 | city | Minato-ku, Tokyo, JP | both | 18- | AllDegrees |
1 | city | New York, New York, US | both | 18- | AllDegrees |
2 | city | Mexico City, Distrito Federal, MX | both | 18- | AllDegrees |
3 | city | Mumbai, Maharashtra, IN | both | 18- | AllDegrees |
# We then combine the columns to obtain a dataframe in which every single line is the data for a location
cols_to_combine = ["Gender", "Ages", "Device"]
combo_df = post_process.combine_cols(processed_df, cols_to_combine)
combo_df = combo_df.pivot(index="Key", columns="combo", values="mau_audience").reset_index()
combo_df.head()
combo | Key | both_18-40_2G | both_18-40_3G | both_18-40_4G | both_18-40_AllDevices | both_18-40_Wifi | both_18-_2G | both_18-_3G | both_18-_4G | both_18-_AllDevices | ... | male_41-54_2G | male_41-54_3G | male_41-54_4G | male_41-54_AllDevices | male_41-54_Wifi | male_55-_2G | male_55-_3G | male_55-_4G | male_55-_AllDevices | male_55-_Wifi |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 269969 | 1000 | 45000 | 510000 | 5800000 | 3700000 | 2000 | 88000 | 870000 | 9800000 | ... | 1000 | 12000 | 120000 | 1000000 | 610000 | 1000 | 9300 | 68000 | 600000 | 370000 |
1 | 1035921 | 11000 | 46000 | 5300000 | 9000000 | 1700000 | 14000 | 58000 | 6500000 | 11000000 | ... | 1800 | 5700 | 640000 | 1100000 | 290000 | 1000 | 2700 | 210000 | 450000 | 170000 |
2 | 2490299 | 1000 | 4900 | 520000 | 3300000 | 1600000 | 1000 | 11000 | 1100000 | 5900000 | ... | 1000 | 1700 | 150000 | 670000 | 300000 | 1000 | 2000 | 130000 | 540000 | 270000 |
3 | 2673660 | 1200 | 160000 | 1000000 | 7600000 | 4800000 | 1700 | 240000 | 1400000 | 11000000 | ... | 1000 | 28000 | 180000 | 1100000 | 710000 | 1000 | 17000 | 77000 | 590000 | 410000 |
4 | 2880782 | 1000 | 1000 | 8400 | 64000 | 34000 | 1000 | 1000 | 15000 | 120000 | ... | 1000 | 1000 | 2900 | 23000 | 12000 | 1000 | 1000 | 1500 | 11000 | 5600 |
5 rows × 61 columns
An important trick is to recover what are the locations that each Key refers to.
For that, we can use the processed_df
dataframe again as follows:
# Get the unique set of <Key, location>
location_mapping = processed_df[["Key", "Name", "Region", "FullLocation"]].drop_duplicates()
# Merge it back to the post-processed dataframe
final_df = pd.merge(location_mapping, combo_df)
final_df
Key | Name | Region | FullLocation | both_18-40_2G | both_18-40_3G | both_18-40_4G | both_18-40_AllDevices | both_18-40_Wifi | both_18-_2G | ... | male_41-54_2G | male_41-54_3G | male_41-54_4G | male_41-54_AllDevices | male_41-54_Wifi | male_55-_2G | male_55-_3G | male_55-_4G | male_55-_AllDevices | male_55-_Wifi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2880782 | Minato-ku | Tokyo | Minato-ku, Tokyo, JP | 1000 | 1000 | 8400 | 64000 | 34000 | 1000 | ... | 1000 | 1000 | 2900 | 23000 | 12000 | 1000 | 1000 | 1500 | 11000 | 5600 |
1 | 2490299 | New York | New York | New York, New York, US | 1000 | 4900 | 520000 | 3300000 | 1600000 | 1000 | ... | 1000 | 1700 | 150000 | 670000 | 300000 | 1000 | 2000 | 130000 | 540000 | 270000 |
2 | 2673660 | Mexico City | Distrito Federal | Mexico City, Distrito Federal, MX | 1200 | 160000 | 1000000 | 7600000 | 4800000 | 1700 | ... | 1000 | 28000 | 180000 | 1100000 | 710000 | 1000 | 17000 | 77000 | 590000 | 410000 |
3 | 1035921 | Mumbai | Maharashtra | Mumbai, Maharashtra, IN | 11000 | 46000 | 5300000 | 9000000 | 1700000 | 14000 | ... | 1800 | 5700 | 640000 | 1100000 | 290000 | 1000 | 2700 | 210000 | 450000 | 170000 |
4 | 269969 | São Paulo | São Paulo (state) | São Paulo, São Paulo (state), BR | 1000 | 45000 | 510000 | 5800000 | 3700000 | 2000 | ... | 1000 | 12000 | 120000 | 1000000 | 610000 | 1000 | 9300 | 68000 | 600000 | 370000 |
5 rows × 64 columns
And we are done! Look how we combined, respectively, the dimensions <Gender, Age, Device> to generate columns like “both_18-40_2G” meaning both male and female FB audience aged 18 to 40 that connected primarily using 2G connection.
Look at the dataframe above: note that for Minato-ki, New Yorn and Mexico City, the value of the column “both_18-40_2G” is 1000. The value of ‘1000’ is retrieved when the Facebook audience that matches our criteria is equal to or smaller than 1000. Unfortunately, we cannot tell if the 1000s are really 1000s or something else, like 100s or 0s.
While the default solution is not trusting on values of 1000, another is trying to finding a more precise estimate. While this is a challenge to do and currently not automatically supported by pySocialWatcher (i.e., no function does it automatically yet), one workaround is submitting API calls that combine multiple criteria and checking the results. For example, for New York, we have 5000 users matching the criteria of both_18-40_3G, and if we issued additional API calls for people using either 2G or 3G we might be able to create a column both_18-40_2G3G, with let’s say 5400 users, meaning that the 1000 for “both_18-40_2G” should actually be something like 400.
Finally, we can save the output file as another csv file to plot some maps.
final_df.to_csv("processed_top5_cities.csv", index=False)