3. Introduction to Pandas, Series and dataframes#
ModelFlow is built on top of the Pandas library. Pandas is the Swiss Army knife of data science and can perform an impressive array of data oriented tasks.
This tutorial is a very short introduction to how pandas Series and Dataframes are used with ModelFlow. For a more complete discussion see any of the many tutorials on the internet, notably:
In this chapter - Introduction to Pandas
This chapter introduces pandas, a powerful Python library for data manipulation and analysis, essential for working with World Bank models. It equips users with basic skills needed to manage and analyze data effectively, and work with the ModelFlow package.
This chapter can be safely skipped by users familiar with Python and Pandas.
Key points include:
Core Data Structures:
Series: One-dimensional arrays for handling single-variable data.
DataFrames: Two-dimensional, tabular data structures with labeled rows and columns.
Basic Operations:
Create Series and DataFrames from lists, dictionaries, or external files (e.g., CSV, Excel).
Access and manipulate data using indexing, slicing, and filtering.
Common Methods:
Analyze data with methods like
.mean(),.sum(), and.describe().Clean data using
.dropna()to handle missing values and.apply()for transformations.Select and modify data subsets using
.loc[]and.iloc[].
Advanced Features:
Merge and join datasets for integrating multiple sources.
Reshape data using pivot tables and group operations.
Best Practices:
Structure data clearly for easier analysis.
Use
pandas’ extensive functionality for efficient handling of large datasets.
3.1. Import the pandas library#
As with any python program, in order to use a package or library it must first be imported into the session. As noted above, by convention pandas is imported as pd (assigning the alias pd to the imported pandas library).
import pandas as pd
Note
By importing pandas with the alias pd (import pandas as pd), future references to the pandas library can be truncated as by using pd. instead of pandas.
The Pandas library contains many classes and methods. The discussion below focuses on Series and DataFrames, two classes that are part of the pandas library. Both series and dataframes are containers that can be used to store time-series data and that have associated with them a number of very useful methods for displaying and manipulating time-series data.
Unlike other statistical packages neither series nor dataframes are inherently or exclusively time-series in nature. ModelFlow and macro-economists use them in this way, but the classes themselves are not necessarily dated or numerical.
3.2. The Series class in Pandas#
Series is a class that is part of the pandas package and can be used to instantiate an object that holds a two dimensional array comprised of values and an index.
The constructor for a Series object is pandas.Series(). The content inside the parentheses will determine the nature of the series-object generated. As an object-oriented language, Python supports overrides. Overrides mean a method – including constructors – can have more than one way in which they can be called. Specifically there can be different constructors for a class, depending on how the data used to initialize the object are organized.
3.2.1. Series declared from a list#
The simplest way to create a Series is to pass an array of values as a Python list to the Series constructor.
Note
A list in python is a comma delimited collection of items. It could be text, numbers or even more complex objects. When declared (and returned) lists are enclosed in square brackets.
For example both of the following two lines are perfectly good examples of lists.
mylist=[2,7,8,9]
mylist2=[“Some text”,”Some more Text”,2,3]
The list is entirely agnostic about the type of data it contains.
In the examples below Simplest, Simple, Simple2 and simple3 are all python data objects of type Pandas.Series. Each Series object is instantiated (created) in a different way. Simplest is instantiated by passing it a python list of numbers, simple2 is instantiated by passing a list object (values), series3 is instantiated indirectly by passing a list mixing text and numeric values. Series3 would be hard to interpret as an economic series, but is a perfectly valid python series.
Note
All series have an index. If no index is explicitly declared (see following section) python will assign a zero-based index (a numerical index that starts with 0).
values=[7,8,9,10,11]
weird=["Some text","Some more Text",2,3]
# Here the constructor is passed a numeric list
Simplest=pd.Series([2,3,4,5,6,7])
Simplest
0 2
1 3
2 4
3 5
4 6
5 7
dtype: int64
# In this case the constructor is passed a variable that was defined above as a list
simple2=pd.Series(values)
simple2
0 7
1 8
2 9
3 10
4 11
dtype: int64
# Here the constructor is passed a variable containing a list that is a mix of
# alphanumerics and numerical values
simple3=pd.Series(weird)
simple3
0 Some text
1 Some more Text
2 2
3 3
dtype: object
Note that all three series have different length (6 datapoints for simplest;5 for simple2 and 4 for simple3).
3.2.2. Series declared using a specific index#
In the example below, the series Simplest and Simple2 are recreated (overwritten), but this time an index is specified. Here the index is declared as a(nother) list.
# In this example the constructor is given both the values
# and specific values for the index
Simplest=pd.Series([2,3,4,5,6],index=[1966,1967,1996,1999,2000])
Simplest
1966 2
1967 3
1996 4
1999 5
2000 6
dtype: int64
simple2=pd.Series(values,index=[1966,1967,1996,1999,2000])
simple2
1966 7
1967 8
1996 9
1999 10
2000 11
dtype: int64
Now these Series look more like time-series data!
3.2.3. Create Series from a dictionary#
In python, a dictionary is a data structure that is more generally known in computer science as an associative array. A dictionary consists of a collection of key-value pairs, where each key-value pair maps or links the key to its associated value.
Note
A dictionary is enclosed in curly brackets {}, versus a list which is enclosed in square brackets[].
Thus mydict={“1966”:2,”1967”:3,”1968”:4,”1969”:5,”2000”:-15} creates a dictionary object called mydict. mydictmaps (or links) the key “1966” to the value 2.
Note
In this example the Key was a string but we could just as easily made it a numerical value:
In the following example the keys are numeric
mydict2={1966:2,1967:3,1968:4,1969:5,2000:-15}
Here, mydict2 links (maps) the key 1966 to the value 2 and “1969” links to the value 5.
The series constructor can also accept a dictionary. In this instance, the key of the dictionary becomes the index of the series that was used to instantiate the series.
mydict2={1966:2,1967:3,1968:4,1969:5,2000:6}
simple2=pd.Series(mydict2)
simple2
1966 2
1967 3
1968 4
1969 5
2000 6
dtype: int64
3.3. The DataFrame class in Pandas#
The DataFrame is the primary structure of pandas. It is a two-dimensional data structure with named rows and columns. Each column can have different data types (numeric, string, etc).
By convention, a dataframe is often called df or some other modifier followed by df, to assist in reading the code.
Much more detail on standard pandas dataframes can be found on the official pandas website https://pandas.pydata.org/docs/reference/frame.html.
3.3.1. Creating or instantiating a dataframe#
Like any object, a DataFrame can be created by calling the constructor of the pandas class DataFrame.
The pandas.DataFrame() method is constructor for the DataFrame class. It can take several forms (as with Series), but always returns (instantiates) an instance of a DataFrame object – i.e. a variable whose contents are a DataFrame.
The code example below creates a DataFrame called df comprised of three columns B,C and E; indexed between 2018 and 2021. Macroeconomists may interpret the index as dates, but for pandas they are just numbers.
The DataFrame is instantiated from a dictionary and assigned a specific index by passing a list of years as the index.
df = pd.DataFrame({'B': [1,1,1,1],'C':[1,2,3,6],'E':[4,4,4,4]},
index=[2018,2019,2020,2021])
df
| B | C | E | |
|---|---|---|---|
| 2018 | 1 | 1 | 4 |
| 2019 | 1 | 2 | 4 |
| 2020 | 1 | 3 | 4 |
| 2021 | 1 | 6 | 4 |
Note
In the DataFrames that are used in macrostructural models like MFMod, each column is often interpreted as a time-series of an economic variable. So in this dataframe, normally B, C and E would each be interpreted as an economic time-series.
That said, there is nothing in the DataFrame class that suggests that the data it stores must be time-series or even numeric in nature.
3.3.2. Alternative ways to set the time period of a dated index#
A somewhat more creative way to initialize the dataframe for dates would use a loop to specify the dates that get passed to the constructor as an argument.
Below a DataFrame df with two Series (A and B), is initialized with the values 100 for all data points.
The index is defined dynamically by a loop index=[2020+v for v in range(number_of_rows)] that runs number_of_rows times (6 times in this example) setting v equal to 2020+0, 2020+1,…,2020+5 (recall, that python is zero-indexed so the first value of v is 0, not one). The resulting list whose values are assigned to index is [2020,2021,2022,2023,2024,2025].
The big advantage of this method is that if the user wanted to have data created for the period 1990 to 2030, they would only have to change number_of_rows from 6 to 41, and then change the staring date in the loop from 2020 to 1990.
#define the number of years for which the data is to be created.
number_of_rows = 6
# call the dataframe constructor
df = pd.DataFrame(100,
index=[2020+v for v in range(number_of_rows)], # create row index
# equivalent to index=[2020,2021,2022,2023,2024,2025]
columns=['A','B']) # Assign the output to two columns,named A and B
df
| A | B | |
|---|---|---|
| 2020 | 100 | 100 |
| 2021 | 100 | 100 |
| 2022 | 100 | 100 |
| 2023 | 100 | 100 |
| 2024 | 100 | 100 |
| 2025 | 100 | 100 |
This second example simplifies the creation even further by specifying the begin and end point as a range.
df1 = pd.DataFrame(200,
index=[v for v in range(2020,2030)], # create row index
# equivalent to index=[2020,2021,...,2029]
columns=['A1','B1']) # create column name
df1
| A1 | B1 | |
|---|---|---|
| 2020 | 200 | 200 |
| 2021 | 200 | 200 |
| 2022 | 200 | 200 |
| 2023 | 200 | 200 |
| 2024 | 200 | 200 |
| 2025 | 200 | 200 |
| 2026 | 200 | 200 |
| 2027 | 200 | 200 |
| 2028 | 200 | 200 |
| 2029 | 200 | 200 |
Warning
the range(2020,2030) creates an indexed series from 2020 through 2029, not 2030 as might be expected. If you want a value created for 2030 then the range would have to be range(2020,2031)
3.3.3. Adding a column to a dataframe#
If a value is assigned to a column that does not exist, pandas will add a column with that name and fill it with values resulting from the calculation.
Note
The size of the object assigned to the new column must match the size (number of rows) of the pre-existing DataFrame. Originally df has 6 rows so we must supply 6 data points for this command to run error free.
df["NEW"]=[10,12,10,13,14,15] #df originally has 6 rows so we must supply 6 data points for this command to run error free
df
| A | B | NEW | |
|---|---|---|---|
| 2020 | 100 | 100 | 10 |
| 2021 | 100 | 100 | 12 |
| 2022 | 100 | 100 | 10 |
| 2023 | 100 | 100 | 13 |
| 2024 | 100 | 100 | 14 |
| 2025 | 100 | 100 | 15 |
3.3.4. Revising values#
If the column exists, then the = method will revise the values of the rows with the values assigned in the statement.
Warning
The dimensions of the list assigned via the = method must be the same as the DataFrame (i.e. there must be exactly as many values as there are rows). Alternatively if only one value is provided, then that value will replace all of the values in the specified column (be broadcast to the other rows in the column).
Below we change the values of the NEW Column that we generated earlier.
df["NEW"]=[11,12,10,14,2,1]
df
| A | B | NEW | |
|---|---|---|---|
| 2020 | 100 | 100 | 11 |
| 2021 | 100 | 100 | 12 |
| 2022 | 100 | 100 | 10 |
| 2023 | 100 | 100 | 14 |
| 2024 | 100 | 100 | 2 |
| 2025 | 100 | 100 | 1 |
# replace all of the rows of column B with the same value
df['B']=17
df
| A | B | NEW | |
|---|---|---|---|
| 2020 | 100 | 17 | 11 |
| 2021 | 100 | 17 | 12 |
| 2022 | 100 | 17 | 10 |
| 2023 | 100 | 17 | 14 |
| 2024 | 100 | 17 | 2 |
| 2025 | 100 | 17 | 1 |
Note
The declaration of the column name, can either use single or double quotes – but they cannot be mixed. Thus
df['A']=7
is fine, and so is
df["B"]=7
but
df["Error']=7
would generate an error because the string created with " is not closed by another matching ". Similarily the string created with ' is not closed by a matching '.
3.4. Selected pandas methods#
Pandas has a number of methods (functions that operate on the underlying data of a pandas object). Several of them are used when working with a World Bank model in ModelFlow and these are discussed below.
3.4.1. .columns lists the column names of a dataframe#
The method .columns returns the names of the columns in the dataframe.
df.columns
Index(['A', 'B', 'NEW'], dtype='object')
3.4.2. .size indicates the dimension of a list#
so df.columns.size returns the number of columns in a dataframe.
df.columns.size
3
The dataframe df has 3 columns.
3.4.3. .eval() evaluates (calculates) an expression on the data of a dataframe#
.eval is a native dataframe method, which does calculations on a dataframe and returns a revised dataframe. With this method expressions can be evaluated and new columns created.
df.eval('X = B*NEW')
| A | B | NEW | X | |
|---|---|---|---|---|
| 2020 | 100 | 17 | 11 | 187 |
| 2021 | 100 | 17 | 12 | 204 |
| 2022 | 100 | 17 | 10 | 170 |
| 2023 | 100 | 17 | 14 | 238 |
| 2024 | 100 | 17 | 2 | 34 |
| 2025 | 100 | 17 | 1 | 17 |
3.4.4. multiple expressions can be evaluated with .eval()#
With .eval() multiple expressions can be evaluated at the same time. In the example below three apostrophe’s are used to indicate a multiple-line block of text. Each line is executed separately by eval.
In this example, two new variables X and THE_ANSWER are created in a single call to .eval().
df.eval('''X = B*NEW
THE_ANSWER = 42''')
| A | B | NEW | X | THE_ANSWER | |
|---|---|---|---|---|---|
| 2020 | 100 | 17 | 11 | 187 | 42 |
| 2021 | 100 | 17 | 12 | 204 | 42 |
| 2022 | 100 | 17 | 10 | 170 | 42 |
| 2023 | 100 | 17 | 14 | 238 | 42 |
| 2024 | 100 | 17 | 2 | 34 | 42 |
| 2025 | 100 | 17 | 1 | 17 | 42 |
Note
In python three apostrophes ''' signals a block of text. Of course the block needs to be opened and closed – in each case by three apostrophes. New lines that appear within a block (if any) are an integral part of the block.
Because the result of the .df.eval() call was not assigned to anything, least of all the dataframe df, the value of df is unchanged.
df
| A | B | NEW | |
|---|---|---|---|
| 2020 | 100 | 17 | 11 |
| 2021 | 100 | 17 | 12 |
| 2022 | 100 | 17 | 10 |
| 2023 | 100 | 17 | 14 |
| 2024 | 100 | 17 | 2 |
| 2025 | 100 | 17 | 1 |
To store the results of the calculation, the expression must be assigned to a variable. The pre-existing DataFrame can be overwritten by assigning it the result of the eval statement or a new DataFrame could be created.
df=df.eval('''X = B*NEW
Y = 42''')
df
| A | B | NEW | X | Y | |
|---|---|---|---|---|---|
| 2020 | 100 | 17 | 11 | 187 | 42 |
| 2021 | 100 | 17 | 12 | 204 | 42 |
| 2022 | 100 | 17 | 10 | 170 | 42 |
| 2023 | 100 | 17 | 14 | 238 | 42 |
| 2024 | 100 | 17 | 2 | 34 | 42 |
| 2025 | 100 | 17 | 1 | 17 | 42 |
With this operation the new columns, X and Y have been appended to the dataframe df.
Note
The .eval() method is a native pandas method. As such it cannot handle lagged variables (because pandas do not support the idea of a lagged variable.
The .mfcalc() and the .upd() methods discussed in the next chapter are ModelFlow features that extend the functionalities native to dataframe that allows such calculations to be performed.
df2=df.eval('''X = B*NEW
Y = 42''')
df2
| A | B | NEW | X | Y | |
|---|---|---|---|---|---|
| 2020 | 100 | 17 | 11 | 187 | 42 |
| 2021 | 100 | 17 | 12 | 204 | 42 |
| 2022 | 100 | 17 | 10 | 170 | 42 |
| 2023 | 100 | 17 | 14 | 238 | 42 |
| 2024 | 100 | 17 | 2 | 34 | 42 |
| 2025 | 100 | 17 | 1 | 17 | 42 |
Here the results are assigned to a new dataframe df2.
3.5. The .loc[] method#
The .loc[] method is a powerful pandas routine used extensively in ModelFlow. .loc[] selects a portion (slice) of a dataframe and allows the user to display and/or revise specific sub-sections of a single or several columns or rows in a dataframe.
3.5.1. Selecting a single element in a dataframe: .loc[row,column]#
.loc[row,column] operates on a single cell in the dataframe. Thus, the below displays the value of the cell with index=2023 observation from the column NEW.
df.loc[2023,'NEW']
np.int64(14)
3.5.2. Selecting a single columns: .loc[:,column]#
The lone colon in a .loc[] statement indicates all the rows or columns. Here all of the rows of the column labeled NEW are displayed.
df.loc[:,'NEW']
2020 11
2021 12
2022 10
2023 14
2024 2
2025 1
Name: NEW, dtype: int64
3.5.3. Selecting a single row: .loc[row,:] #
Here all of the columns, for the selected row are displayed.
df.loc[2023,:]
A 100
B 17
NEW 14
X 238
Y 42
Name: 2023, dtype: int64
3.5.4. Selecting specific rows and columns: .loc[xxx,xxx],[names…]]#
Passing a list in either the rows or columns portion of the .loc statement will allow multiple rows or columns to be displayed.
df.loc[[2021,2024],['B','NEW']]
| B | NEW | |
|---|---|---|
| 2021 | 17 | 12 |
| 2024 | 17 | 2 |
3.5.5. Select a range: the “:” operator in a ‘.loc’ statement#
with the colon operator we can also select a range of results.
Here from 2018 to 2019.
df.loc[2021:2023,['B','NEW']]
| B | NEW | |
|---|---|---|
| 2021 | 17 | 12 |
| 2022 | 17 | 10 |
| 2023 | 17 | 14 |
3.5.6. Revising a sub-section of a DataFrame, using .loc[] on the left hand side to assign values to specific cells#
This can be handy when revising a DataFrame in preparation for running a scenario.
df.loc[2022:2024,'NEW'] = 20
df
| A | B | NEW | X | Y | |
|---|---|---|---|---|---|
| 2020 | 100 | 17 | 11 | 187 | 42 |
| 2021 | 100 | 17 | 12 | 204 | 42 |
| 2022 | 100 | 17 | 20 | 170 | 42 |
| 2023 | 100 | 17 | 20 | 238 | 42 |
| 2024 | 100 | 17 | 20 | 34 | 42 |
| 2025 | 100 | 17 | 1 | 17 | 42 |
Warning
The dimensions on the right hand side of = and the left hand side should match. That is: either the dimensions should be the same, or the right hand side should be a single value that is then broadcast into all of the lements of the left hand slice.
For more on broadcasting see here (https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html)
For more info on the .loc[] method
Description https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html.
Search https://www.google.com/search?q=pandas+dataframe+loc&newwindow=1.
For more info on pandas:
Pandas homepage https://pandas.pydata.org/.
Pandas community tutorials https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html.