See below for godzilla.dev materials about: AI x Quant Trader Series - Day 3

"Widely used Python Libraries"¶

The upcoming series will introduce some of the most widely used Python libraries in quantitative finance:

numpy

scipy

pandas

matplotlib

Each will be explained one by one for beginners.

NumPy¶

What is NumPy¶

Quantitative analysis involves a large amount of numerical computation, so having an efficient and convenient scientific computing tool is essential. Python was not originally designed as a language for scientific computing. However, as more people recognized its ease of use, a wide range of external extensions emerged—NumPy (Numeric Python) being one of them.

NumPy provides a wealth of tools for numerical programming, making it easy to handle operations on vectors, matrices, and more, which significantly facilitates scientific computing tasks. On the other hand, Python is free, and compared to the high costs of using software like MATLAB, NumPy has made Python an increasingly popular choice.

Let’s take a quick look at how to get started with NumPy:

import numpy
numpy.version.full_version

the output:

2.2.4

We used the import command to load the NumPy library and checked the version with numpy.version.full_version, which turned out to be 2.2.4

In the upcoming lessons, we’ll frequently use functions from NumPy. However, constantly writing numpy as a prefix before every function call can be tedious. As mentioned earlier, there’s a shortcut when importing external modules: using from numpy import * allows you to access all functions without the prefix.

Problem solved? Not so fast!

Python has thousands of external modules, and in practice, it’s common to import several of them at once. If two modules happen to include functions or properties with the same name, this can lead to conflicts. To avoid such name clashes—also known as namespace confusion—it’s generally better to keep the module prefix.

So is there a simpler way? Yes—when importing a module, you can assign it an alias. This way, you don’t need to write the full module name every time. For example, we can import NumPy as np and call version.full_version like this:

import numpy as np
np.version.full_version

the output:

2.2.4

A First Look at NumPy Objects: Arrays¶

The fundamental object in NumPy is the homogeneous multidimensional array, meaning all elements in the array must be of the same type—just like arrays in C++. For example, character and numeric types cannot coexist in the same array.

Let’s look at an example:

a = np.arange(20)

Here, we’ve created a one-dimensional array a starting from 0, with a step size of 1, and a total length of 20. In Python, indexing starts at 0, so users coming from R or MATLAB should be cautious about this difference. You can use print to view the array:

print(a)

the output:

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

We can use the type function to check the type of a. Here, it shows that a is an array:

type(a)

the output:

numpy.ndarray

Using the reshape function, we can restructure this array. For example, we can create a 4×5 two-dimensional array. The arguments passed to reshape specify the size of each dimension, and the data is arranged in order by dimension (for two dimensions, this means row-wise order). This is different from R, where arrays are filled column-wise by default.

a = a.reshape(4, 5)
print(a)

the output:

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

Creating higher-dimensional arrays is no problem either:

a = a.reshape(2, 2, 5)
print(a)

the output:

[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]]

 [[10 11 12 13 14]
  [15 16 17 18 19]]]

Since a is an array, we can call its associated functions to further inspect its properties:

ndim shows the number of dimensions

shape returns the size of each dimension

size gives the total number of elements (equal to the product of all dimension sizes)

dtype displays the data type of the elements

itemsize (not dsize) shows the number of bytes each element occupies

a.ndim

the output:

a.shape

the output:

(2, 2, 5)

a.size

the output:

a.dtype

the output:

dtype('int64')

Creating Arrays¶

Arrays can be created by converting lists, and higher-dimensional arrays can be created by converting nested lists.

raw = [0,1,2,3,4]
a = np.array(raw)
a

the output:

array([0, 1, 2, 3, 4])

raw = [[0,1,2,3,4], [5,6,7,8,9]]
b = np.array(raw)
b

the output:

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

Some special arrays have dedicated commands for creation—for example, a 4×5 matrix filled with zeros:

d = (4, 5)
np.zeros(d)

the output:

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

By default, the generated array is of float type, but you can specify the data type to create an integer array instead:

d = (4, 5)
np.ones(d, dtype=int)

the output:

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

An array of random numbers in the interval [0,1):

np.random.rand(5)

the output:

array([ 0.93807818,  0.45307847,  0.90732828,  0.36099623,  0.71981451])

Array Operations¶

Basic arithmetic operations have been overloaded—operators like +, -, *, and / are all applied element-wise to the entire array. For example, with addition:

a = np.array([[1.0, 2], [2, 4]])
print("a:")
print(a)
b = np.array([[3.2, 1.5], [2.5, 4]])
print("b:")
print(b)
print("a+b:")
print(a+b)

the output:

a:
[[ 1.  2.]
 [ 2.  4.]]
b:
[[ 3.2  1.5]
 [ 2.5  4. ]]
a+b:
[[ 4.2  3.5]
 [ 4.5  8. ]]

Here, you can see that even though only one element in array a is a float and the rest are integers, Python automatically converts all elements to float—because NumPy arrays are homogeneous. Also, when adding two 2D arrays, the size of each dimension must match.

Of course, in NumPy, these operators can also be used between a scalar and an array. The result is that the operation is applied element-wise between the scalar and each element in the array, and the output is still an array.

print("3 * a:")
print(3 * a)
print("b + 1.8:")
print(b + 1.8)

the output:

3 * a:
[[  3.   6.]
 [  6.  12.]]
b + 1.8:
[[ 5.   3.3]
 [ 4.3  5.8]]

Just like in C++, the +=, -=, *=, and /= operators are also supported in NumPy.

a /= 2
print(a)

the output:

[[ 0.5  1. ]
 [ 1.   2. ]]

Taking square roots or computing exponentials is also very straightforward:

print("a:")
print(a)
print("np.exp(a):")
print(np.exp(a))
print("np.sqrt(a):")
print(np.sqrt(a))
print("np.square(a):")
print(np.square(a))
print("np.power(a, 3):")
print(np.power(a, 3))

the output:

a:
[[ 0.5  1. ]
 [ 1.   2. ]]
np.exp(a):
[[ 1.64872127  2.71828183]
 [ 2.71828183  7.3890561 ]]
np.sqrt(a):
[[ 0.70710678  1.        ]
 [ 1.          1.41421356]]
np.square(a):
[[ 0.25  1.  ]
 [ 1.    4.  ]]
np.power(a, 3):
[[ 0.125  1.   ]
 [ 1.     8.   ]]

Need to find the maximum or minimum of a 2D array? Want to calculate the total sum of all elements, or sum by rows or columns? Use a for loop? No need—NumPy’s ndarray class already provides built-in functions for these operations:

a = np.arange(20).reshape(4,5)
print("a:")
print(a)
print("sum of all elements in a: " + str(a.sum()))
print("maximum element in a: " + str(a.max()))
print("minimum element in a: " + str(a.min()))
print("maximum element in each row of a: " + str(a.max(axis=1)))
print("minimum element in each column of a: " + str(a.min(axis=0)))

the output:

a:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
sum of all elements in a: 190
maximum element in a: 19
minimum element in a: 0
maximum element in each row of a: [ 4  9 14 19]
minimum element in each column of a: [0 1 2 3 4]

Matrix operations are heavily used in scientific computing. In addition to arrays, NumPy also provides a dedicated matrix object. There are two main differences between matrices and arrays:

Matrices are strictly 2-dimensional, whereas arrays can have any number of dimensions (as long as they are positive integers).

The * operator performs matrix multiplication for matrix objects, meaning the number of columns in the left matrix must equal the number of rows in the right matrix. In contrast, for arrays, the * operator performs element-wise multiplication, requiring that the arrays have the same shape.

You can convert an array to a matrix using asmatrix or mat, or you can create a matrix directly. For example:

a = np.arange(20).reshape(4, 5)
a = np.asmatrix(a)
print(type(a))

b = np.matrix('1.0 2.0; 3.0 4.0')
print(type(b))

the output:

<class 'numpy.matrixlib.defmatrix.matrix'>
<class 'numpy.matrixlib.defmatrix.matrix'>

Let’s take another look at matrix multiplication. Here, we use the arange function to generate another matrix b. The arange function can also be called with the form arange(start, stop, step) to create an arithmetic sequence. Note that the range includes the start value but excludes the stop value.

b = np.arange(2, 45, 3).reshape(5, 3)
b = np.mat(b)
print(b)

the output:

[[ 2  5  8]
 [11 14 17]
 [20 23 26]
 [29 32 35]
 [38 41 44]]

Some might ask: arange specifies the step size, but what if you want to specify the length of the generated 1D array instead? No problem — linspace can do just that.

np.linspace(0, 2, 9)

the output:

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ,  1.25,  1.5 ,  1.75,  2.  ])

Back to our problem: perform matrix multiplication on matrices a and b.

print("matrix a:")
print(a)
print("matrix b:")
print(b)
c = a * b
print("matrix c:")
print(c)

the output:

matrix a:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
matrix b:
[[ 2  5  8]
 [11 14 17]
 [20 23 26]
 [29 32 35]
 [38 41 44]]
matrix c:
[[ 290  320  350]
 [ 790  895 1000]
 [1290 1470 1650]
 [1790 2045 2300]]

Array element access¶

Elements of arrays and matrices can be accessed using indices. The following examples all use two-dimensional arrays (or matrices).

a = np.array([[3.2, 1.5], [2.5, 4]])
print(a[0][1])
print(a[0, 1])

the output:

1.5
1.5

Array element values can be modified using index-based access.

b = a
a[0][1] = 2.0
print("a:")
print(a)
print("b:")
print(b)

the output:

a:
[[ 3.2  2. ]
 [ 2.5  4. ]]
b:
[[ 3.2  2. ]
 [ 2.5  4. ]]

Now here comes the problem: you clearly modified a[0][1], so why did b[0][1] also change? This is a common pitfall in Python programming. The reason is that Python didn't actually make a true copy of a and assign it to b; instead, it made b point to the same memory address as a. To create a real copy of a for b, you can use copy.

a = np.array([[3.2, 1.5], [2.5, 4]])
b = a.copy()
a[0][1] = 2.0
print("a:")
print(a)
print("b:")
print(b)

the output:

a:
[[ 3.2  2. ]
 [ 2.5  4. ]]
b:
[[ 3.2  1.5]
 [ 2.5  4. ]]

If you reassign a, meaning you point it to a different address, b will still remain at the original address.

a = np.array([[3.2, 1.5], [2.5, 4]])
b = a
a = np.array([[2, 1], [9, 3]])
print("a:")
print(a)
print("b:")
print(b)

the output:

a:
[[2 1]
 [9 3]]
b:
[[ 3.2  1.5]
 [ 2.5  4. ]]

The colon : can be used to access all elements along a certain dimension — for example, to extract a specific column from a matrix.

a = np.arange(20).reshape(4, 5)
print("a:")
print(a)
print("the 2nd and 4th column of a:")
print(a[:,[1,3]])

the output:

a:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
the 2nd and 4th column of a:
[[ 1  3]
 [ 6  8]
 [11 13]
 [16 18]]

Let’s try something a bit more complex: extracting elements that meet certain conditions — a common task in data processing, usually applied to a single row or column. In the example below, we extract the third column elements (12 and 17) that correspond to the rows where the first column values are greater than 5 (i.e., 10 and 15).

a[:, 2][a[:, 0] > 5]

the output:

array([12, 17])

The where function can be used to find the positions of specific values in an array.

loc = numpy.where(a==11)
print(loc)
print(a[loc[0][0], loc[1][0]])

the output:

(array([2]), array([1]))
11

Matrix operations¶

Let’s continue using a matrix (or 2D array) as an example. First, let’s look at matrix transposition.

a = np.random.rand(2,4)
print("a:")
print(a)
a = np.transpose(a)
print("a is an array, by using transpose(a):")
print(a)
b = np.random.rand(2,4)
b = np.mat(b)
print("b:")
print(b)
print("b is a matrix, by using b.T:")
print(b.T)

the output:

a:
[[ 0.17571282  0.98510461  0.94864387  0.50078988]
 [ 0.09457965  0.70251658  0.07134875  0.43780173]]
a is an array, by using transpose(a):
[[ 0.17571282  0.09457965]
 [ 0.98510461  0.70251658]
 [ 0.94864387  0.07134875]
 [ 0.50078988  0.43780173]]
b:
[[ 0.09653644  0.46123468  0.50117363  0.69752578]
 [ 0.60756723  0.44492537  0.05946373  0.4858369 ]]
b is a matrix, by using b.T:
[[ 0.09653644  0.60756723]
 [ 0.46123468  0.44492537]
 [ 0.50117363  0.05946373]
 [ 0.69752578  0.4858369 ]]

Matrix inversion

import numpy.linalg as nlg
a = np.random.rand(2,2)
a = np.mat(a)
print("a:")
print(a)
ia = nlg.inv(a)
print("inverse of a:")
print(ia)
print("a * inv(a)")
print(a * ia)

the output:

a:
[[ 0.86211266  0.6885563 ]
 [ 0.28798536  0.70810425]]
inverse of a:
[[ 1.71798445 -1.6705577 ]
 [-0.69870271  2.09163573]]
a * inv(a)
[[ 1.  0.]
 [ 0.  1.]]

Computing eigenvalues and eigenvectors

a = np.random.rand(3,3)
eig_value, eig_vector = nlg.eig(a)
print("eigen value:")
print(eig_value)
print("eigen vector:")
print(eig_vector)

the output:

eigen value:
[ 1.35760609  0.43205379 -0.53470662]
eigen vector:
[[-0.76595379 -0.88231952 -0.07390831]
 [-0.55170557  0.21659887 -0.74213622]
 [-0.33005418  0.41784829  0.66616169]]

Concatenate two vectors into a matrix by columns.

a = np.array((1,2,3))
b = np.array((2,3,4))
print np.column_stack((a,b))

the output:

[[1 2]
 [2 3]
 [3 4]]

After processing some data in a loop and obtaining results, it's often useful to combine those results into a matrix. This can be done using vstack and hstack.

a = np.random.rand(2,2)
b = np.random.rand(2,2)
print("a:")
print(a)
print("b:")
print(b)
c = np.hstack([a,b])
d = np.vstack([a,b])
print("horizontal stacking a and b:")
print(c)
print("vertical stacking a and b:")
print(d)

the output:

a:
[[ 0.6738195   0.4944045 ]
 [ 0.25702675  0.15422012]]
b:
[[ 0.6738195   0.4944045 ]
 [ 0.25702675  0.15422012]]
horizontal stacking a and b:
[[ 0.6738195   0.4944045   0.28058267  0.0967197 ]
 [ 0.25702675  0.15422012  0.55191041  0.04694485]]
vertical stacking a and b:
[[ 0.6738195   0.4944045 ]
 [ 0.25702675  0.15422012]
 [ 0.28058267  0.0967197 ]
 [ 0.55191041  0.04694485]]

Missing Value¶

Missing values are also a form of information in data analysis. NumPy provides nan to represent missing values, and isnan can be used to detect them.

a = np.random.rand(2,2)
a[0, 1] = np.nan
print(np.isnan(a))

the output:

[[False  True]
 [False False]]

nan_to_num can be used to replace nan with 0. In the more advanced module pandas, which we’ll cover later, we’ll see that it provides functions that allow you to specify the replacement value for nan.

print(np.nan_to_num(a))

the output:

[[ 0.58144238  0.        ]
 [ 0.26789784  0.48664306]]

NumPy offers many more functions. For a detailed understanding, you can refer to the following links: http://wiki.scipy.org/Numpy_Example_List and http://docs.scipy.org/doc/numpy