See below for godzilla.dev materials about: AI x Quant Trader Series - Day 4

"Widely used Python Libraries"¶

Last time we introduced NumPy. In this article, we'll focus on another commonly used library in quantitative finance: SciPy.

SciPy¶

Overview of SciPy¶

In the previous article, we briefly introduced NumPy. Now let’s take a look at what SciPy can do. While NumPy handles vector and matrix operations—essentially functioning like an advanced scientific calculator—SciPy builds on top of NumPy and provides a more comprehensive and advanced set of functionalities. It offers a wide array of functions for statistics, optimization, interpolation, numerical integration, signal processing, and more, covering almost all fundamental scientific computing tasks.

In quantitative analysis, the most commonly used areas are statistics and optimization. Therefore, this article will focus on SciPy’s statistics and optimization modules. Other modules will be introduced in future articles when relevant.

This article will involve some matrix algebra. If you find it difficult, feel free to skip Part 3 or try to understand the concepts using one-dimensional scalars instead of higher-dimensional vectors.

As always, let's start by importing the necessary modules. Here, we’ll be using the statistics and optimization parts of SciPy:

import numpy as np
import scipy.stats as stats
import scipy.optimize as opt

Statistics Module¶

Generating Random Numbers¶

Let’s begin with generating random numbers, as this will make it easier to demonstrate other concepts later. To generate n random numbers, you can use rv_continuous.rvs(size=n) or rv_discrete.rvs(size=n).

rv_continuous refers to continuous probability distributions such as:

Uniform distribution: uniform

Normal distribution: norm

Beta distribution: beta, etc.

rv_discrete refers to discrete probability distributions such as:

Bernoulli distribution: bernoulli

Geometric distribution: geom

Poisson distribution: poisson, etc.

For example, to generate:

10 random numbers in the interval 0,1 from a uniform distribution, and

10 random numbers from a Beta distribution with parameters α and β (denoted as Beta(α,β)):

rv_unif = stats.uniform.rvs(size=10)
print rv_unif
rv_beta = stats.beta.rvs(size=10, a=4, b=2)
print rv_beta

the output:

[ 0.6419336   0.48403001  0.89548809  0.73837498  0.65744886  0.41845577
  0.3823512   0.0985301   0.66785949  0.73163835]
[ 0.82164685  0.69563836  0.74207073  0.94348192  0.82979411  0.87013796
  0.78412952  0.47508183  0.29296073  0.52551156]

Each random distribution function in SciPy comes with built-in default parameters—for example, the uniform distribution defaults to the range 0,1. However, when you need to modify these parameters, having to type out the full command each time can be a bit tedious.

To simplify this, SciPy provides a "freezing" feature. This allows you to create a frozen distribution object with fixed parameters, so you don't need to repeatedly specify them. This is particularly useful in scenarios where you work with the same distribution settings multiple times.

For example, in the case of the Beta distribution, instead of specifying the parameters α and β every time you call .rvs(), you can define a frozen distribution like this:

np.random.seed(seed=2015)
rv_beta = stats.beta.rvs(size=10, a=4, b=2)
print "method 1:"
print rv_beta

np.random.seed(seed=2015)
beta = stats.beta(a=4, b=2)
print "method 2:"
print beta.rvs(size=10)

the output:

method 1:
[ 0.43857338  0.9411551   0.75116671  0.92002864  0.62030521  0.56585548
  0.41843548  0.5953096   0.88983036  0.94675351]
method 2:
[ 0.43857338  0.9411551   0.75116671  0.92002864  0.62030521  0.56585548
  0.41843548  0.5953096   0.88983036  0.94675351]

Hypothesis Testing¶

Now, let’s generate a dataset and examine its related statistical properties. (You can find the parameters and documentation for the relevant distributions here: http://docs.scipy.org/doc/scipy/reference/stats.html)

norm_dist = stats.norm(loc=0.5, scale=2)
n = 200
dat = norm_dist.rvs(size=n)
print "mean of data is: " + str(np.mean(dat))
print "median of data is: " + str(np.median(dat))
print "standard deviation of data is: " + str(np.std(dat))

the output:

mean of data is: 0.383309149888
median of data is: 0.394980561217
standard deviation of data is: 2.00589851641

Suppose this dataset represents actual observed data—such as daily returns of a stock. We can perform a basic analysis on it. One of the simplest analyses is to test whether this dataset follows a given distribution, such as the normal distribution.

This is a classic one-sample hypothesis testing problem. A commonly used method for this is the Kolmogorov–Smirnov test (K-S test).

In a one-sample K-S test, the null hypothesis is that the sample comes from the specified theoretical distribution.

In SciPy, this can be done using the kstest function, where the parameters are:

the dataset,

the name of the distribution to test against (as a string),

and the parameters of that distribution.

mu = np.mean(dat)
sigma = np.std(dat)
stat_val, p_val = stats.kstest(dat, 'norm', (mu, sigma))
print 'KS-statistic D = %6.3f p-value = %6.4f' % (stat_val, p_val)

the output:

KS-statistic D =  0.037 p-value = 0.9428

If the p-value from the hypothesis test is large (note that under the null hypothesis, the p-value is a random variable uniformly distributed over the interval 0,1; see: http://en.wikipedia.org/wiki/P-value), then we fail to reject the null hypothesis—in other words, we accept that the data passes the normality test.

Given the assumption of normality, we can further test whether the mean of this dataset is significantly different from zero. A common method for this is the t-test, specifically the one-sample t-test.

In SciPy, this is done using the ttest_1samp function:

stat_val, p_val = stats.ttest_1samp(dat, 0)
print 'One-sample t-statistic D = %6.3f, p-value = %6.4f' % (stat_val, p_val)

the output:

One-sample t-statistic D =  2.696, p-value = 0.0076

We observe that p-value < 0.05, which means that under a significance level of 0.05, we should reject the null hypothesis—that is, the data’s mean is not equal to 0.

Next, let’s generate another dataset and try a two-sample t-test using ttest_ind. This test checks whether two independent samples have significantly different means.

norm_dist2 = stats.norm(loc=-0.2, scale=1.2)
dat2 = norm_dist2.rvs(size=n/2)
stat_val, p_val = stats.ttest_ind(dat, dat2, equal_var=False)
print 'Two-sample t-statistic D = %6.3f, p-value = %6.4f' % (stat_val, p_val)

the output:

Two-sample t-statistic D =  3.572, p-value = 0.0004

Note that in this case, the second dataset we generated differs from the first in terms of sample size and variance. Therefore, when performing the t-test, we need to use Welch’s t-test by setting equal_var=False in the ttest_ind function.

We again obtain a relatively small p-value, which means that under the 0.05 significance level, we reject the null hypothesis and conclude that the two groups do not have equal means.

The scipy.stats module also provides many other hypothesis testing functions, such as:

bartlett and levene: for testing whether two or more samples have equal variances.

anderson_ksamp: for performing the Anderson-Darling k-sample test, used to check whether multiple samples come from the same distribution.

These tools are useful for more advanced statistical analysis depending on the properties of your data.