6. Post-processing the CSV file created by pySocialWatcher¶

So far, we have:

We now assume that a file named output_psw_top5_cities.csv is created on disk after processing the previous notebook.

Now we will learn how to use pySocialWatcher tools to post-process the data collected and create a human-readable file, which is also ready to plot some maps.

import pandas as pd
from pysocialwatcher import post_process

df = pd.read_csv("./output_psw_top5_cities.csv")
df.head(5)

	Unnamed: 0	name	interests	ages_ranges	behavior	scholarities	languages	family_statuses	relationship_statuses	...	household_composition	all_fields	targeting	response	dau_audience	mau_audience	access_device	timestamp	publisher_platforms	mock_response
0	0	test	NaN	{'min': 18}	NaN	NaN	NaN	NaN	NaN	...	NaN	(('ages_ranges', {'min': 18}), ('genders', 0),...	{'geo_locations': {'cities': [{'key': 2880782,...	b'{"data":[{"daily_outcomes_curve":[{"spend":0...	0	1000	{'name': '2G', 'or': [6017253486583]}	1614166082	["facebook"]	False
1	1	test	NaN	{'min': 18}	NaN	NaN	NaN	NaN	NaN	...	NaN	(('ages_ranges', {'min': 18}), ('genders', 0),...	{'geo_locations': {'cities': [{'key': 2490299,...	b'{"data":[{"daily_outcomes_curve":[{"spend":0...	0	1000	{'name': '2G', 'or': [6017253486583]}	1614166082	["facebook"]	False
2	2	test	NaN	{'min': 18}	NaN	NaN	NaN	NaN	NaN	...	NaN	(('ages_ranges', {'min': 18}), ('genders', 0),...	{'geo_locations': {'cities': [{'key': 2673660,...	b'{"data":[{"daily_outcomes_curve":[{"spend":0...	463	1700	{'name': '2G', 'or': [6017253486583]}	1614166082	["facebook"]	False
3	3	test	NaN	{'min': 18}	NaN	NaN	NaN	NaN	NaN	...	NaN	(('ages_ranges', {'min': 18}), ('genders', 0),...	{'geo_locations': {'cities': [{'key': 1035921,...	b'{"data":[{"daily_outcomes_curve":[{"spend":0...	5055	14000	{'name': '2G', 'or': [6017253486583]}	1614166082	["facebook"]	False
4	4	test	NaN	{'min': 18}	NaN	NaN	NaN	NaN	NaN	...	NaN	(('ages_ranges', {'min': 18}), ('genders', 0),...	{'geo_locations': {'cities': [{'key': 269969, ...	b'{"data":[{"daily_outcomes_curve":[{"spend":0...	777	2000	{'name': '2G', 'or': [6017253486583]}	1614166082	["facebook"]	False

5 rows × 21 columns

There are two main functions from pySocialWatcher to post-process the collection and they need to be used in the following order:

post_process.post_process_df_collection creates new columns based on what was used in the collection. For example, there will be a new column for Gender, Ages, Education, Device, etc.
post_process.combine_cols generates, given a (sub)set of columns created by post_process.post_process_df_collection, a new dataframe that combines the columns and has as many rows as unique locations in the collection.

processed_df = post_process.post_process_df_collection(df)
processed_df.head(4)[["LocationType", "FullLocation", "Gender", "Ages", "Education"]]

	LocationType	FullLocation	Gender	Ages	Education
0	city	Minato-ku, Tokyo, JP	both	18-	AllDegrees
1	city	New York, New York, US	both	18-	AllDegrees
2	city	Mexico City, Distrito Federal, MX	both	18-	AllDegrees
3	city	Mumbai, Maharashtra, IN	both	18-	AllDegrees

# We then combine the columns to obtain a dataframe in which every single line is the data for a location
cols_to_combine = ["Gender", "Ages", "Device"]
combo_df = post_process.combine_cols(processed_df, cols_to_combine)

combo_df = combo_df.pivot(index="Key", columns="combo", values="mau_audience").reset_index()
combo_df.head()

combo	Key	both_18-40_2G	both_18-40_3G	both_18-40_4G	both_18-40_AllDevices	both_18-40_Wifi	both_18-_2G	both_18-_3G	both_18-_4G	both_18-_AllDevices	...	male_41-54_2G	male_41-54_3G	male_41-54_4G	male_41-54_AllDevices	male_41-54_Wifi	male_55-_2G	male_55-_3G	male_55-_4G	male_55-_AllDevices	male_55-_Wifi
0	269969	1000	45000	510000	5800000	3700000	2000	88000	870000	9800000	...	1000	12000	120000	1000000	610000	1000	9300	68000	600000	370000
1	1035921	11000	46000	5300000	9000000	1700000	14000	58000	6500000	11000000	...	1800	5700	640000	1100000	290000	1000	2700	210000	450000	170000
2	2490299	1000	4900	520000	3300000	1600000	1000	11000	1100000	5900000	...	1000	1700	150000	670000	300000	1000	2000	130000	540000	270000
3	2673660	1200	160000	1000000	7600000	4800000	1700	240000	1400000	11000000	...	1000	28000	180000	1100000	710000	1000	17000	77000	590000	410000
4	2880782	1000	1000	8400	64000	34000	1000	1000	15000	120000	...	1000	1000	2900	23000	12000	1000	1000	1500	11000	5600

5 rows × 61 columns

An important trick is to recover what are the locations that each Key refers to. For that, we can use the processed_df dataframe again as follows:

# Get the unique set of <Key, location>
location_mapping = processed_df[["Key", "Name", "Region", "FullLocation"]].drop_duplicates()

# Merge it back to the post-processed dataframe
final_df = pd.merge(location_mapping, combo_df)
final_df

	Key	Name	Region	FullLocation	both_18-40_2G	both_18-40_3G	both_18-40_4G	both_18-40_AllDevices	both_18-40_Wifi	both_18-_2G	...	male_41-54_2G	male_41-54_3G	male_41-54_4G	male_41-54_AllDevices	male_41-54_Wifi	male_55-_2G	male_55-_3G	male_55-_4G	male_55-_AllDevices	male_55-_Wifi
0	2880782	Minato-ku	Tokyo	Minato-ku, Tokyo, JP	1000	1000	8400	64000	34000	1000	...	1000	1000	2900	23000	12000	1000	1000	1500	11000	5600
1	2490299	New York	New York	New York, New York, US	1000	4900	520000	3300000	1600000	1000	...	1000	1700	150000	670000	300000	1000	2000	130000	540000	270000
2	2673660	Mexico City	Distrito Federal	Mexico City, Distrito Federal, MX	1200	160000	1000000	7600000	4800000	1700	...	1000	28000	180000	1100000	710000	1000	17000	77000	590000	410000
3	1035921	Mumbai	Maharashtra	Mumbai, Maharashtra, IN	11000	46000	5300000	9000000	1700000	14000	...	1800	5700	640000	1100000	290000	1000	2700	210000	450000	170000
4	269969	São Paulo	São Paulo (state)	São Paulo, São Paulo (state), BR	1000	45000	510000	5800000	3700000	2000	...	1000	12000	120000	1000000	610000	1000	9300	68000	600000	370000

5 rows × 64 columns

And we are done! Look how we combined, respectively, the dimensions <Gender, Age, Device> to generate columns like “both_18-40_2G” meaning both male and female FB audience aged 18 to 40 that connected primarily using 2G connection.

Look at the dataframe above: note that for Minato-ki, New Yorn and Mexico City, the value of the column “both_18-40_2G” is 1000. The value of ‘1000’ is retrieved when the Facebook audience that matches our criteria is equal to or smaller than 1000. Unfortunately, we cannot tell if the 1000s are really 1000s or something else, like 100s or 0s.

While the default solution is not trusting on values of 1000, another is trying to finding a more precise estimate. While this is a challenge to do and currently not automatically supported by pySocialWatcher (i.e., no function does it automatically yet), one workaround is submitting API calls that combine multiple criteria and checking the results. For example, for New York, we have 5000 users matching the criteria of both_18-40_3G, and if we issued additional API calls for people using either 2G or 3G we might be able to create a column both_18-40_2G3G, with let’s say 5400 users, meaning that the 1000 for “both_18-40_2G” should actually be something like 400.

Finally, we can save the output file as another csv file to plot some maps.

final_df.to_csv("processed_top5_cities.csv", index=False)

Connectivity Mapping

6. Post-processing the CSV file created by pySocialWatcher¶