See below for godzilla.dev materials about: AI x Quant Trader Series - Day 6

The Swiss Army Knife of Python Data Processing: pandas"¶

Part 2: Rapid Advancement¶

In the previous article, we introduced how to create and access data in pandas using the Series and DataFrame types. In this article, we will cover how to perform operations on pandas data. Once you’ve mastered these operations, you’ll be able to handle most data processing tasks.

First, let’s import the modules we’ll be using in this article:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

To make the data easier to view, let’s adjust the output display width

pd.set_option('display.width', 200)

1. Other Ways to Create Data¶

The creation of data structures is not limited to the standard forms introduced in the previous article.

In this article, we’ll look at a few more. For example, we can create a Series with dates as its elements:

dates = pd.date_range('20250101', periods=5)
print(dates)

the output:

DatetimeIndex(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05'], dtype='datetime64[ns]', freq='D')

Assign this date Series as the index of a DataFrame:

df = pd.DataFrame(np.random.randn(5, 4),index=dates,columns=list('ABCD'))
print(df)

the output:

                   A         B         C         D
2025-01-01 -1.119762 -0.088336  1.921095  1.158499
2025-01-02 -0.250627  0.271175 -0.505430 -1.490358
2025-01-03  0.710884 -1.478697  0.537757  1.448547
2025-01-04 -1.658607 -0.364456  0.196627  0.881224
2025-01-05  0.347936  0.312740 -0.199889  2.881074

Any object that can be converted into a Series can be used to create a DataFrame:

df2 = pd.DataFrame({ 'A' : 1., 'B': pd.Timestamp('20250214'), 'C': pd.Series(1.6,index=list(range(4)),dtype='float64'), 'D' : np.array([4] * 4, dtype='int64'), 'E' : 'hello pandas!' })
print(df2)

the output:

     A          B    C  D              E
0  1.0 2025-02-14  1.6  4  hello pandas!
1  1.0 2025-02-14  1.6  4  hello pandas!
2  1.0 2025-02-14  1.6  4  hello pandas!
3  1.0 2025-02-14  1.6  4  hello pandas!

2. Viewing Data¶

In most cases, data is not generated by the analysts themselves but obtained through data APIs, external files, or other sources.

Here, we’ll use a dataset retrieved from binance rest api as an example:

# pip install pandas requests python-dateutil
import requests
import pandas as pd
from dateutil import parser

symbol = "BTCUSDT"  # REST 接口不带斜杠
interval = "1d"

start = "2025-01-01 00:00:00"
end   = "2025-02-01 00:00:00"

start_ms = int(parser.isoparse(start).timestamp() * 1000)
end_ms   = int(parser.isoparse(end).timestamp() * 1000)

url = "https://api.binance.com/api/v3/klines"
params = {
    "symbol": symbol,
    "interval": interval,
    "startTime": start_ms,
    "endTime": end_ms,
    "limit": 1000
}
r = requests.get(url, params=params, timeout=15)
r.raise_for_status()
data = r.json()

# 返回每一行：
# [
#   0 open time, 1 open, 2 high, 3 low, 4 close, 5 volume,
#   6 close time, 7 quote asset volume, 8 number of trades,
#   9 taker buy base volume, 10 taker buy quote volume, 11 ignore
# ]
df = pd.DataFrame(data, columns=[
    "open_time","open","high","low","close","volume",
    "close_time","quote_vol","trades","taker_base","taker_quote","ignore"
])

# 转数值
for col in ["open","high","low","close","volume","quote_vol","taker_base","taker_quote"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# 只保留核心列，并用UTC日期索引
df["date_utc"] = pd.to_datetime(df["open_time"], unit="ms", utc=True).dt.date
df = df[["open","high","low","close","volume","date_utc"]].set_index("date_utc").sort_index()

print(df)

the output:

                 open       high        low      close        volume
date_utc                                                            
2025-01-01   93576.00   95151.15   92888.00   94591.79  10373.326130
2025-01-02   94591.78   97839.50   94392.00   96984.79  21970.489480
2025-01-03   96984.79   98976.91   96100.01   98174.18  15253.829360
2025-01-04   98174.17   98778.43   97514.79   98220.50   8990.056510
2025-01-05   98220.51   98836.85   97276.79   98363.61   8095.637230
2025-01-06   98363.61  102480.00   97920.00  102235.60  25263.433750
2025-01-07  102235.60  102724.38   96181.81   96954.61  32059.875370
2025-01-08   96954.60   97268.65   92500.90   95060.61  33704.678940
2025-01-09   95060.61   95382.32   91203.67   92552.49  34544.836850
2025-01-10   92552.49   95836.00   92206.02   94726.11  31482.864240
2025-01-11   94726.10   95050.94   93831.73   94599.99   7047.904300
2025-01-12   94599.99   95450.10   93711.19   94545.06   8606.866220
2025-01-13   94545.07   95940.00   89256.69   94536.10  42619.564230
2025-01-14   94536.11   97371.00   94346.22   96560.86  27846.617530
2025-01-15   96560.85  100681.94   96500.00  100497.35  30509.991790
2025-01-16  100497.35  100866.66   97335.13   99987.30  27832.853170
2025-01-17   99987.30  105865.22   99950.77  104077.48  39171.852920
2025-01-18  104077.47  104988.88  102277.55  104556.23  24307.829980
2025-01-19  104556.23  106422.43   99651.60  101331.57  43397.282980
2025-01-20  101331.57  109588.00   99550.00  102260.01  89529.231732
2025-01-21  102260.00  107240.81  100119.04  106143.82  45941.020020
2025-01-22  106143.82  106394.46  103339.12  103706.66  22248.692540
2025-01-23  103706.66  106850.00  101262.28  103910.34  53953.120310
2025-01-24  103910.35  107120.00  102750.00  104870.50  23609.240170
2025-01-25  104870.51  105286.52  104106.09  104746.85   9068.323770
2025-01-26  104746.86  105500.00  102520.44  102620.00   9812.512380
2025-01-27  102620.01  103260.00   97777.77  102082.83  50758.134100
2025-01-28  102082.83  103800.00  100272.68  101335.52  22022.057650
2025-01-29  101335.52  104782.68  101328.01  103733.24  23155.358020
2025-01-30  103733.25  106457.44  103278.54  104722.94  19374.074720
2025-01-31  104722.94  106012.00  101560.00  102429.56  21983.181930

Using the code above, we retrieved BTC’s daily market data for all trading days in January 2025. First, let’s check the size of the dataset:

print(df.shape)

the output:

(31, 5)

We can see there are 31 rows, which means we fetched 31 records. Each record has 5 fields.

Now let’s preview the data: DataFrame.head() and DataFrame.tail() show the first five and last five rows, respectively. To change the number of rows displayed, pass a number in the parentheses.

print("Head of this DataFrame:")
print(df.head())
print("Tail of this DataFrame:")
print(df.tail(3))

the output:

Head of this DataFrame:
                open      high       low     close       volume
date_utc                                                       
2025-01-01  93576.00  95151.15  92888.00  94591.79  10373.32613
2025-01-02  94591.78  97839.50  94392.00  96984.79  21970.48948
2025-01-03  96984.79  98976.91  96100.01  98174.18  15253.82936
2025-01-04  98174.17  98778.43  97514.79  98220.50   8990.05651
2025-01-05  98220.51  98836.85  97276.79  98363.61   8095.63723
Tail of this DataFrame:
                 open       high        low      close       volume
date_utc                                                           
2025-01-29  101335.52  104782.68  101328.01  103733.24  23155.35802
2025-01-30  103733.25  106457.44  103278.54  104722.94  19374.07472
2025-01-31  104722.94  106012.00  101560.00  102429.56  21983.18193

DataFrame.describe() provides statistical summaries for the purely numeric data in the DataFrame.

print(df.describe())

the output:

                open           high            low          close        volume
count      31.000000      31.000000      31.000000      31.000000     31.000000
mean    99750.482258  101877.524839   97835.769032  100036.080645  27888.217365
std      4148.002200    4544.879400    4050.642975    4011.242254  17319.976611
min     92552.490000   95050.940000   89256.690000   92552.490000   7047.904300
25%     95810.730000   97605.250000   94369.110000   96757.735000  17313.952040
50%    100497.350000  102724.380000   97777.770000  101331.570000  24307.829980
75%    103719.955000  105938.610000  101295.145000  103719.950000  34124.757895
max    106143.820000  109588.000000  104106.090000  106143.820000  89529.231732

Sorting the data makes it easier to inspect. A DataFrame offers two kinds of sorting.

One is label-based sorting—i.e., sorting by the index (row labels) or by column names.

Use DataFrame.sort_index, with axis=0 to sort by the index (rows) and axis=1 to sort by column names. You can also specify ascending or descending order.

print("Order by column names, descending:")
print(df.sort_index(axis=1, ascending=False).head())

the output:

Order by column names, descending:
                 volume      open       low      high     close
date_utc                                                       
2025-01-01  10373.32613  93576.00  92888.00  95151.15  94591.79
2025-01-02  21970.48948  94591.78  94392.00  97839.50  96984.79
2025-01-03  15253.82936  96984.79  96100.01  98976.91  98174.18
2025-01-04   8990.05651  98174.17  97514.79  98778.43  98220.50
2025-01-05   8095.63723  98220.51  97276.79  98836.85  98363.61

The second type is value-based sorting. You can specify the column name(s) and the sort order; by default, it sorts in ascending order.

print("Order by column value, ascending:")
print(df.sort_values(by="date_utc", ascending=True).head())

the output:

Order by column value, ascending:
                open      high       low     close       volume
date_utc                                                       
2025-01-01  93576.00  95151.15  92888.00  94591.79  10373.32613
2025-01-02  94591.78  97839.50  94392.00  96984.79  21970.48948
2025-01-03  96984.79  98976.91  96100.01  98174.18  15253.82936
2025-01-04  98174.17  98778.43  97514.79  98220.50   8990.05651
2025-01-05  98220.51  98836.85  97276.79  98363.61   8095.63723

3. Data Access and Manipulation¶

3.1 Revisiting Data Access¶

In the previous section, we introduced several ways to access data in a DataFrame using loc, iloc, at, iat, ix, and [].

Here, we’ll introduce another method: using ":" to retrieve part of the rows or all of the columns.

print(df.iloc[1:4][:])

the output:

                open      high       low     close       volume
date_utc                                                       
2025-01-02  94591.78  97839.50  94392.00  96984.79  21970.48948
2025-01-03  96984.79  98976.91  96100.01  98174.18  15253.82936
2025-01-04  98174.17  98778.43  97514.79  98220.50   8990.05651

We can extend the method introduced in the previous section that uses Boolean vectors to access data.

This makes it very convenient to filter data. For example, we can select the rows where the closing price is above the average.

print(df[df.close> df.close.mean()].head())

the output:

                 open       high        low      close       volume
date_utc                                                           
2025-01-06   98363.61  102480.00   97920.00  102235.60  25263.43375
2025-01-15   96560.85  100681.94   96500.00  100497.35  30509.99179
2025-01-17   99987.30  105865.22   99950.77  104077.48  39171.85292
2025-01-18  104077.47  104988.88  102277.55  104556.23  24307.82998
2025-01-19  104556.23  106422.43   99651.60  101331.57  43397.28298