Skip to content

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 5

The Swiss Army Knife of Python Data Processing: pandas"

Part 1: Introduction to Basic Data Structures

1. Introduction to Pandas

We've finally arrived at the module the author is most eager to introduce — and arguably the most powerful Python extension for data processing: pandas.

When working with real-world financial data, a single record often contains multiple types of data. For example, a stock ticker is a string, the closing price is a float, and the trading volume is an integer. In C++, this can be handled using a container like a vector of custom structs. In Python, pandas provides high-level data structures — Series and DataFrame — that make data manipulation extremely convenient, fast, and straightforward.

Note that there are some incompatibilities between different versions of pandas. Therefore, it's important to know which version you are using. Let's first check the version of pandas in your local enviroment:

import pandas as pd
pd.__version__
the output:
'2.2.3'

The two main data structures in pandas are Series and DataFrame. In the next two sections, we’ll explore how to create these structures either from other data types or from scratch. But first, let’s import them along with the relevant modules:

import numpy as np
from pandas import Series, DataFrame

2. Pandas Data Structure: Series

Generally speaking, a Series can be thought of as a one-dimensional array. The main difference between a Series and a regular 1D array is that a Series has an index, which makes it similar to a hash (dictionary-like structure) commonly seen in programming.

2.1 Creating a Series

The basic format for creating a Series is:

s = Series(data, index=index, name=name)

Below are a few examples of how to create a Series. Let's start by creating a Series from an array:

a = np.random.randn(5)
print("a is an array:")
print(a)
s = Series(a)
print("s is a Series:")
print(s)
the output:
a is an array:
[ 1.35729482 -1.45138391  0.91716941 -1.24918144 -0.68685959]
s is a Series:
0    1.357295
1   -1.451384
2    0.917169
3   -1.249181
4   -0.686860
dtype: float64

You can specify an index when creating a Series, and you can use Series.index to view the specific index values. One important thing to note is that when creating a Series from an array, the length of the specified index must match the length of the data.

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
s.index
the output:
a   -1.898245
b    0.172835
c    0.779262
d    0.289468
e   -0.947995
Name: my_series, dtype: float64
my_series

Another optional parameter when creating a Series is name, which allows you to assign a name to the Series. You can access it using Series.name. In a DataFrame, the name of each column becomes the name of the Series when that column is extracted individually.

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'], name='my_series')
print(s)
print(s.name)
the output:
a   -1.898245
b    0.172835
c    0.779262
d    0.289468
e   -0.947995
Name: my_series, dtype: float64
my_series

A Series can also be created from a dictionary (dict):

d = {'a': 0., 'b': 1, 'c': 2}
print("d is a dict:")
print(d)
s = Series(d)
print("s is a Series:")
print(s)
the output:
d is a dict:
{'a': 0.0, 'c': 2, 'b': 1}
s is a Series:
a    0
b    1
c    2
dtype: float64

Let’s take a look at the case where we specify an index when creating a Series from a dictionary (the index does not have to match the dictionary’s length):

Series(d, index=['b', 'c', 'd', 'a'])
the output:
b     1
c     2
d   NaN
a     0
dtype: float64

We can observe two things:

When creating a Series from a dictionary, the data is reordered to match the specified index.

The length of the index does not need to match the length of the dictionary. If there are extra index labels, pandas will automatically assign them a value of NaN (Not a Number — the standard marker for missing data in pandas). If the index is shorter, only the corresponding subset of the dictionary will be used.

If the data is a single value, such as the number 4, then the Series will repeat this value across all index labels:

Series(4., index=['a', 'b', 'c', 'd', 'e'])
the output:
a    4
b    4
c    4
d    4
e    4
dtype: float64

2.2 Accessing Data in a Series

You can access data in a Series using index positions (like arrays), index labels (like dictionaries), and even through conditional filtering:

s = Series(np.random.randn(10),index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
s[0]
the output:
1.4328106520571824

s[:2]
the output:
a    1.432811
b    0.120681
dtype: float64

s[[2,0,4]]
the output:
c    0.578146
a    1.432811
e    1.327594
dtype: float64

s[['e', 'i']]
the output:
e    1.327594
i   -0.634347
dtype: float64

s[s > 0.5]
the output:
a    1.432811
c    0.578146
e    1.327594
g    1.850783
dtype: float64

'e' in s
the output:
True

3. Pandas Data Structure: DataFrame

Before using a DataFrame, let’s briefly go over its characteristics. A DataFrame is a two-dimensional data structure formed by combining multiple Series (column-wise). Each column, when extracted individually, is a Series. This is very similar to how data is retrieved from a SQL database. Therefore, it’s often more convenient to process a DataFrame column by column, and it's helpful for users to develop a column-oriented mindset when working with data.

One of the key advantages of a DataFrame is its ability to handle columns of different data types with ease. So there's no need to think about operations like matrix inversion on a DataFrame full of floats — for such numerical tasks, it’s usually better to store the data in a NumPy matrix.

3.1 Creating a DataFrame

Let’s first look at how to create a DataFrame from a dictionary. A DataFrame is a 2D data structure that serves as a collection of Series. We’ll start by creating a dictionary where the values are Series, and then convert it into a DataFrame:

d = {'one': Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = DataFrame(d)
print(df)
the output:
   one  two
a    1    1
b    2    2
c    3    3
d  NaN    4

You can specify the desired rows (index) and columns when creating the DataFrame. If the dictionary does not contain the corresponding elements, those entries will be filled with NaN (missing values):

df = DataFrame(d, index=['r', 'd', 'a'], columns=['two', 'three'])
print(df)
the output:
   two three
r  NaN   NaN
d    4   NaN
a    1   NaN

You can use dataframe.index and dataframe.columns to view the rows and columns of a DataFrame. The dataframe.values attribute returns the elements of the DataFrame as a NumPy array.

print("DataFrame index:")
print(df.index)
print("DataFrame columns:")
print(df.columns)
print("DataFrame values:")
print(df.values)
the output:
DataFrame index:
Index([u'alpha', u'beta', u'gamma', u'delta', u'eta'], dtype='object')
DataFrame columns:
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
DataFrame values:
[[  0.   0.   0.   0.   0.]
 [  1.   2.   3.   4.   5.]
 [  2.   4.   6.   8.  10.]
 [  3.   6.   9.  12.  15.]
 [  4.   8.  12.  16.  20.]]

A DataFrame can also be created from a dictionary whose values are arrays, but all arrays must be of the same length.

d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
df = DataFrame(d, index=['a', 'b', 'c', 'd'])
print(df)
the output:
   one  two
a    1    4
b    2    3
c    3    2
d    4    1

When the values are not arrays, this length restriction does not apply, and any missing values will be automatically filled with NaN.

d= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6, 'c': 9}]
df = DataFrame(d)
print(df)
the output:
     a  b   c
0  1.6  2 NaN
1  3.0  6   9

When working with real-world data, you may sometimes need to create an empty DataFrame. This can be done as follows:

df = DataFrame()
print(df)
the output:
Empty DataFrame
Columns: []
Index: []

Another very useful way to create a DataFrame is by using the concat function, which allows you to build a DataFrame from one or more Series or existing DataFrames.

a = Series(range(5))
b = Series(np.linspace(4, 20, 5))
df = pd.concat([a, b], axis=1)
print(df)
the output:
   0   1
0  0   4
1  1   8
2  2  12
3  3  16
4  4  20

Here, axis=1 means concatenation by columns, while axis=0 means concatenation by rows. Note that a Series is treated as a single column, so if you choose axis=0, you’ll get a 10×1 DataFrame.

The following example shows how to concatenate DataFrames by rows to form a larger DataFrame:

df = DataFrame()
index = ['alpha', 'beta', 'gamma', 'delta', 'eta']
for i in range(5):
    a = DataFrame([np.linspace(i, 5*i, 5)], index=[index[i]])
    df = pd.concat([df, a], axis=0)
print(df)
the output:
       0  1   2   3   4
alpha  0  0   0   0   0
beta   1  2   3   4   5
gamma  2  4   6   8  10
delta  3  6   9  12  15
eta    4  8  12  16  20

3.2 Accessing Data in a DataFrame

First, it’s important to emphasize again that DataFrame operations are fundamentally column-based. You can think of every operation as first selecting a column (which is a Series), and then accessing elements from that Series.

You can select a column using either dataframe.column_name or dataframe[]. You’ll quickly notice that:

The dot notation (dataframe.column_name) can only select a single column.

The bracket notation (dataframe[]) can be used to select one or multiple columns.

If the DataFrame has no column names, you can use non-negative integers (i.e., indices) inside the brackets to select columns. However, if column names do exist, then you must use those names to select columns. Also, in the absence of column names, dataframe.column_name is not valid.

print(df[1])
print(type(df[1]))
df.columns = ['a', 'b', 'c', 'd', 'e']
print(df['b'])
print(type(df['b']))
print(df.b)
print(type(df.b))
print(df[['a', 'd']])
print(type(df[['a', 'd']]))
the output:
alpha    0
beta     2
gamma    4
delta    6
eta      8
Name: 1, dtype: float64
<class 'pandas.core.series.Series'>
alpha    0
beta     2
gamma    4
delta    6
eta      8
Name: b, dtype: float64
<class 'pandas.core.series.Series'>
alpha    0
beta     2
gamma    4
delta    6
eta      8
Name: b, dtype: float64
<class 'pandas.core.series.Series'>
       a   d
alpha  0   0
beta   1   4
gamma  2   8
delta  3  12
eta    4  16
<class 'pandas.core.frame.DataFrame'>

In the code above, we used dataframe.columns to assign column names to the DataFrame. As shown, when a single column is extracted, the resulting data structure is a Series. However, when two or more columns are selected, the result remains a DataFrame.

To access specific elements, you can use indices or labels, just like with a Series.

print df['b'][2]
print df['b']['gamma']
the output:
4.0
4.0

To select rows, you can use dataframe.iloc to select by position (index number), or dataframe.loc to select by label (index name).

print(df.iloc[1])
print(df.loc['beta'])
the output:
a    1
b    2
c    3
d    4
e    5
Name: beta, dtype: float64
a    1
b    2
c    3
d    4
e    5
Name: beta, dtype: float64

Rows can also be selected using slicing or a Boolean array (Boolean mask).

print("Selecting by slices:")
print(df[1:3])
bool_vec = [True, False, True, True, False]
print("Selecting by boolean vector:")
print(df[bool_vec])
the output:
Selecting by slices:
       a  b  c  d   e
beta   1  2  3  4   5
gamma  2  4  6  8  10
Selecting by boolean vector:
       a  b  c   d   e
alpha  0  0  0   0   0
gamma  2  4  6   8  10
delta  3  6  9  12  15

Rows and columns can be combined to select specific data.

print(df[['b', 'd']].iloc[[1, 3]])
print(df.iloc[[1, 3]][['b', 'd']])
print(df[['b', 'd']].loc[['beta', 'delta']])
print(df.loc[['beta', 'delta']][['b', 'd']])
the output:
       b   d
beta   2   4
delta  6  12
       b   d
beta   2   4
delta  6  12
       b   d
beta   2   4
delta  6  12
       b   d
beta   2   4
delta  6  12

If you want to access a specific element at a particular position (rather than an entire row or column), the fastest way is to use dataframe.at and dataframe.iat, which access data by label and integer position, respectively.

print(df.iat[2, 3])
print(df.at['gamma', 'd'])
the output:
8.0
8.0