zl程序教程

您现在的位置是:首页 >  后端

当前栏目

【pandas】用户手册:10分钟熟悉pandas(下)

pandas 10 分钟 熟悉 用户手册
2023-09-14 09:15:12 时间

数据分组

  • Splitting : 利用某些条件将数据进行分组
  • Applying : 函数应用于每个单独的分组
  • Combining : 合并最终的结果
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df
     A      B         C         D
0  foo    one -0.738005 -2.019732
1  bar    one  0.887627  0.015670
2  foo    two -0.108933 -0.077614
3  bar  three  0.076641  1.675694
4  foo    two -0.787585  0.466678
5  bar    two  0.193921 -0.345819
6  foo    one  0.846988 -1.513333
7  foo  three  1.110915  0.189766
df.groupby("A")[["C", "D"]].sum()
            C         D
A                      
bar  1.158189  1.345545
foo  0.323379 -2.954235

分组并应用 sum() 对他们进行求和汇总

df.groupby(["A", "B"]).sum()
                  C         D
A   B                        
bar one    0.887627  0.015670
    three  0.076641  1.675694
    two    0.193921 -0.345819
foo one    0.108983 -3.533064
    three  1.110915  0.189766
    two   -0.896518  0.389064

先对 A 分组,后对 B 分组

df.groupby(["B", "A"]).sum()
                  C         D
B     A                      
one   bar  0.887627  0.015670
      foo  0.108983 -3.533064
three bar  0.076641  1.675694
      foo  1.110915  0.189766
two   bar  0.193921 -0.345819
      foo -0.896518  0.389064

先对 B 分组,后对 A 分组

注意:对多个列进行操作,用 [["C", "D"]]
对一个列进行操作,可以用["C"], 当然也可以用 [["C"]]

数据表格形状改变

Stack

tuples = list(
    zip(
        ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
        ["one", "two", "one", "two", "one", "two", "one", "two"],
    )
)
# tuples
# 多索引值
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 3), columns=["C1", "C2", "C3"], index=index)
df2 = df[:5]
df2
                    C1        C2        C3
first second                              
bar   one    -1.347431  0.153681 -1.006217
      two    -0.741849 -0.117988 -0.593601
baz   one     0.394623 -0.360702  0.062728
      two    -0.477569 -1.504717  0.124419
foo   one     0.340487 -1.045430 -0.623986
stacked = df2.stack()
stacked
first  second    
bar    one     C1   -1.347431
               C2    0.153681
               C3   -1.006217
       two     C1   -0.741849
               C2   -0.117988
               C3   -0.593601
baz    one     C1    0.394623
               C2   -0.360702
               C3    0.062728
       two     C1   -0.477569
               C2   -1.504717
               C3    0.124419
foo    one     C1    0.340487
               C2   -1.045430
               C3   -0.623986
dtype: float64

stack 将数据压缩成一个列
上面例子中 df2shape(5,3)
stackedshape(15, )

Pivot

创建一个电子表格风格的数据透视表作为数据框架。
函数原型: pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)

df = pd.DataFrame(
    {
        "C1": ["one", "one", "two", "three"] * 3,
        "C2": ["A", "B", "C"] * 4,
        "C3": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "C4": np.random.randn(12),
        "C5": np.random.randn(12),
    }
)
df
       C1 C2   C3        C4        C5
0     one  A  foo -0.111176 -0.049645
1     one  B  foo -0.483144 -2.182207
2     two  C  foo  0.841522 -0.669410
3   three  A  bar  1.074447 -1.335228
4     one  B  bar -1.949381  0.594608
5     one  C  bar -1.544474 -0.873641
6     two  A  foo -0.837036 -1.054699
7   three  B  foo  0.537476 -0.359334
8     one  C  foo  0.169522 -1.594076
9     one  A  bar -0.595527  0.225416
10    two  B  bar -0.443136 -1.495795
11  three  C  bar -0.081103  1.551327

C1 列的值作为新的 label
C2, C3 列的值作为索引
C5 列的值作为表里的值, 无值则补 NaN

pd.pivot_table(df, values="C5", index=["C2", "C3"], columns=["C1"])
C1           one     three       two
C2 C3                               
A  bar  0.225416 -1.335228       NaN
   foo -0.049645       NaN -1.054699
B  bar  0.594608       NaN -1.495795
   foo -2.182207 -0.359334       NaN
C  bar -0.873641  1.551327       NaN
   foo -1.594076       NaN -0.669410

时间序列

pandas 具有简单、强大、高效的功能,可以在变频过程中进行重采样操作(如将秒级数据转换为5分钟级数据)。这在(但不限于)金融应用程序中非常常见。请参阅时间序列部分。

rng = pd.date_range("1/1/2023", periods=100, freq="S")
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2023-01-01 00:00:00    194
2023-01-01 00:00:01    306
2023-01-01 00:00:02     54
2023-01-01 00:00:03    198
2023-01-01 00:00:04    368
                      ... 
2023-01-01 00:01:35    431
2023-01-01 00:01:36    276
2023-01-01 00:01:37    286
2023-01-01 00:01:38    223
2023-01-01 00:01:39    217
Freq: S, Length: 100, dtype: int32
ts.resample("5Min").sum()
2023-01-01    25350
Freq: 5T, dtype: int32

Series.tz_localize() localizes a time series to a time zone:

将时间序列化为本地化一个时区

rng = pd.date_range("1/6/2023 00:00", periods=5, freq="D")
ts = pd.Series(np.random.randn(len(rng)), rng)
ts_utc = ts.tz_localize('UTC')
ts_utc
2023-01-06 00:00:00+00:00    0.418221
2023-01-07 00:00:00+00:00   -1.714893
2023-01-08 00:00:00+00:00   -0.464742
2023-01-09 00:00:00+00:00    0.005428
2023-01-10 00:00:00+00:00    0.209386
Freq: D, dtype: float64

将一个时区转到另外一个时区

ts_utc.tz_convert("US/Eastern")
2023-01-05 19:00:00-05:00    0.418221
2023-01-06 19:00:00-05:00   -1.714893
2023-01-07 19:00:00-05:00   -0.464742
2023-01-08 19:00:00-05:00    0.005428
2023-01-09 19:00:00-05:00    0.209386
Freq: D, dtype: float64

数据分类

df["raw_grade"].astype("category")raw_grade类型换成了类别。

df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

rename_categories 将类别名称重命名。

new_categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.rename_categories(new_categories)
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): ['very good', 'good', 'very bad']

增加新的类别

df["grade"] = df["grade"].c> `df["raw_grade"].astype("category")` 将`raw_grade`类型换成了类别。at.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

显示数据

import matplotlib.pyplot as plt
plt.close("all")
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2022", periods=1000))
ts = ts.cumsum()
ts.plot()

在这里插入图片描述

df = pd.DataFrame(np.random.randn(1000, 3), index=ts.index, columns=['A', 'B', 'C'])
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc='best')

在这里插入图片描述

数据导入导出

CSV

df.to_csv("foo.csv")
pd.read_csv("foo.csv")
     Unnamed: 0          A          B          C
0    2022-01-01  -2.112172  -0.161145  -1.891843
1    2022-01-02  -1.787807  -0.469220  -1.592460
2    2022-01-03  -2.366840  -0.465609  -3.204489
3    2022-01-04  -2.913202  -0.220295  -3.415782
4    2022-01-05  -3.819952  -0.831654  -3.465468
..          ...        ...        ...        ...
995  2024-09-22  45.661361  13.760668  40.401864
996  2024-09-23  45.608082  14.161003  41.035935
997  2024-09-24  45.256665  12.934910  41.751221
998  2024-09-25  46.313781  12.783737  41.720967
999  2024-09-26  46.183519  12.790855  41.323802

[1000 rows x 4 columns]

HDF5

df.to_hdf("foo.h5", "df")
pd.read_hdf("foo.h5", "df")

这中间有可能会报错:

File d:\Anaconda3\envs\pytorch\lib\site-packages\pandas\compat\_optional.py:141, in import_optional_dependency(name, extra, errors, min_version)
    140 try:
--> 141     module = importlib.import_module(name)
    142 except ImportError:

File d:\Anaconda3\envs\pytorch\lib\importlib\__init__.py:127, in import_module(name, package)
    126         level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1014, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:991, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:973, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Cell In [90], line 1
----> 1 df.to_hdf("foo.h5", "df")
...
--> 144         raise ImportError(msg)
    145     else:
    146         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

提示用 pip install pytables 但还是会报错,最后改用 pip install tables 解决问题;

`ERROR: Could not find a version that satisfies the requirement pytables (from versions: none)
ERROR: No matching distribution found for pytables
                    A          B          C
2022-01-01  -2.112172  -0.161145  -1.891843
2022-01-02  -1.787807  -0.469220  -1.592460
2022-01-03  -2.366840  -0.465609  -3.204489
2022-01-04  -2.913202  -0.220295  -3.415782
2022-01-05  -3.819952  -0.831654  -3.465468
...               ...        ...        ...
2024-09-22  45.661361  13.760668  40.401864
2024-09-23  45.608082  14.161003  41.035935
2024-09-24  45.256665  12.934910  41.751221
2024-09-25  46.313781  12.783737  41.720967
2024-09-26  46.183519  12.790855  41.323802

[1000 rows x 3 columns]

excel

ModuleNotFoundError: No module named 'openpyxl'
# pip install openpyxl
df.to_excel("foo.xlsx", sheet_name="Sheet1")
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
    Unnamed: 0          A          B          C
0   2022-01-01  -2.112172  -0.161145  -1.891843
1   2022-01-02  -1.787807  -0.469220  -1.592460
2   2022-01-03  -2.366840  -0.465609  -3.204489
3   2022-01-04  -2.913202  -0.220295  -3.415782
4   2022-01-05  -3.819952  -0.831654  -3.465468
..         ...        ...        ...        ...
995 2024-09-22  45.661361  13.760668  40.401864
996 2024-09-23  45.608082  14.161003  41.035935
997 2024-09-24  45.256665  12.934910  41.751221
998 2024-09-25  46.313781  12.783737  41.720967
999 2024-09-26  46.183519  12.790855  41.323802

[1000 rows x 4 columns]

【参考】

10 minutes to pandas — pandas 1.5.2 documentation (pydata.org)