您现在的位置是：首页 > 后端

当前栏目

【pandas】教程：9-如何轻松处理时间序列数据

序列 pandas 教程数据如何处理时间轻松

2023-09-14 09:15:12 时间

Pandas 如何轻松处理时间序列数据

数据

本节使用的数据为 data/air_quality_no2_long.csv，链接为 pandas案例和教程所使用的数据-机器学习文档类资源-CSDN文库

在这里插入图片描述

import pandas as pd 
import matplotlib.pyplot as plt

air_quality = pd.read_csv("data/air_quality_no2_long.csv")
# 重命名列名
air_quality = air_quality.rename(columns={"date.utc": "datetime"})
air_quality

        city country                   datetime            location parameter  \
0      Paris      FR  2019-06-21 00:00:00+00:00             FR04014       no2   
1      Paris      FR  2019-06-20 23:00:00+00:00             FR04014       no2   
2      Paris      FR  2019-06-20 22:00:00+00:00             FR04014       no2   
3      Paris      FR  2019-06-20 21:00:00+00:00             FR04014       no2   
4      Paris      FR  2019-06-20 20:00:00+00:00             FR04014       no2   
...      ...     ...                        ...                 ...       ...   
2063  London      GB  2019-05-07 06:00:00+00:00  London Westminster       no2   
2064  London      GB  2019-05-07 04:00:00+00:00  London Westminster       no2   
2065  London      GB  2019-05-07 03:00:00+00:00  London Westminster       no2   
2066  London      GB  2019-05-07 02:00:00+00:00  London Westminster       no2   
2067  London      GB  2019-05-07 01:00:00+00:00  London Westminster       no2   

      value   unit  
0      20.0  µg/m³  
1      21.8  µg/m³  
2      26.5  µg/m³  
3      24.9  µg/m³  
4      21.4  µg/m³  
...     ...    ...  
2063   26.0  µg/m³  
2064   16.0  µg/m³  
2065   19.0  µg/m³  
2066   19.0  µg/m³  
2067   23.0  µg/m³  

[2068 rows x 7 columns]

利用 pandas 的 `datetime` 属性

将文本型的数据转换为 datetime

air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
air_quality["datetime"]

0      2019-06-21 00:00:00+00:00
1      2019-06-20 23:00:00+00:00
2      2019-06-20 22:00:00+00:00
3      2019-06-20 21:00:00+00:00
4      2019-06-20 20:00:00+00:00
                  ...         
2063   2019-05-07 06:00:00+00:00
2064   2019-05-07 04:00:00+00:00
2065   2019-05-07 03:00:00+00:00
2066   2019-05-07 02:00:00+00:00
2067   2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]

利用 to_datetime 函数可以将 string 类型的时间变量转换为 datetime64[ns, UTC] 对象。
pd.read_csv("data/air_quality_no2_long.csv", parse_dates=["datetime"]) 可以在读入数据的时候，直接将日期时间数据转换为 datetime64[ns, UTC] 对象。

为什么需要 datetime ?

可以计算开始时间和结束时间
可以计算时间间隔
可以进行时间比较等等。

air_quality["datetime"].min(), air_quality["datetime"].max()

(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
 Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))

air_quality["datetime"].max() - air_quality["datetime"].min()

Timedelta('44 days 23:00:00')

将数据中的月份单独作为数据 DataFrame 的一列。

air_quality["month"] = air_quality["datetime"].dt.month
air_quality.head()

    city country                  datetime location parameter  value   unit  \
0  Paris      FR 2019-06-21 00:00:00+00:00  FR04014       no2   20.0  µg/m³   
1  Paris      FR 2019-06-20 23:00:00+00:00  FR04014       no2   21.8  µg/m³   
2  Paris      FR 2019-06-20 22:00:00+00:00  FR04014       no2   26.5  µg/m³   
3  Paris      FR 2019-06-20 21:00:00+00:00  FR04014       no2   24.9  µg/m³   
4  Paris      FR 2019-06-20 20:00:00+00:00  FR04014       no2   21.4  µg/m³   

   month  
0      6  
1      6  
2      6  
3      6  
4      6

Timestamp 对象有很多属性可以使用，除了 month，还可以用 year, weekofyear, quarter …,这些都可以通过 dt 访问器来访问。

如何计算每个地区周一到周日每天平均 $NO_2$ 浓度？

air_quality.groupby([air_quality["datetime"].dt.weekday, "location"])["value"].mean()

datetime  location        
0         BETR801               27.875000
          FR04014               24.856250
          London Westminster    23.969697
1         BETR801               22.214286
          FR04014               30.999359
          London Westminster    24.885714
2         BETR801               21.125000
          FR04014               29.165753
          London Westminster    23.460432
3         BETR801               27.500000
          FR04014               28.600690
          London Westminster    24.780142
4         BETR801               28.400000
          FR04014               31.617986
          London Westminster    26.446809
5         BETR801               33.500000
          FR04014               25.266154
          London Westminster    24.977612
6         BETR801               21.896552
          FR04014               23.274306
          London Westminster    24.859155
Name: value, dtype: float64

还记得 groupby 提供的 split-apply-combine 模式吗？
我们这里需要计算每个测量区域的周一到周日总体平均浓度。
首先将周一到周日每天分组（group）Monday=0, Sunday=6，weekday 通过 dt 访问，
然后按地区分组（group），分别计算平均值，然后组合。
由于我们在这些例子中使用的是非常短的时间序列，因此分析并不能提供具有长期代表性的结果!

绘制一天中每个小时的 $NO_2$ 平均浓度

fig, axs = plt.subplots(figsize=(12, 4))
air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
    kind='bar', rot=0, ax=axs)
plt.xlabel("Hour of the day")  # custom x label using Matplotlib
plt.ylabel("$NO_2 (µg/m^3)$")

在这里插入图片描述

`Datetime` 作为索引

在 (1条消息) 【pandas】教程：7-调整表格数据的布局_黄金旺铺的博客-CSDN博客中提到了 pivot 可以改变表格的形状，将每个地区做为单独的一列。
通过pivot 数据，datetime 变成了表格的索引，通常情况下，设置一列为索引可以通过 set_index 函数实现。

no_2 = air_quality.pivot(index="datetime", columns="location", values="value")
no_2

location                   BETR801  FR04014  London Westminster
datetime                                                       
2019-05-07 01:00:00+00:00     50.5     25.0                23.0
2019-05-07 02:00:00+00:00     45.0     27.7                19.0
2019-05-07 03:00:00+00:00      NaN     50.4                19.0
2019-05-07 04:00:00+00:00      NaN     61.9                16.0
2019-05-07 05:00:00+00:00      NaN     72.4                 NaN
...                            ...      ...                 ...
2019-06-20 20:00:00+00:00      NaN     21.4                 NaN
2019-06-20 21:00:00+00:00      NaN     24.9                 NaN
2019-06-20 22:00:00+00:00      NaN     26.5                 NaN
2019-06-20 23:00:00+00:00      NaN     21.8                 NaN
2019-06-21 00:00:00+00:00      NaN     20.0                 NaN

[1033 rows x 3 columns]

datetime 提供了强大的 index 功能，例如，我们不需要 dt 来获取时间序列的属性，可以直接使用 index 获得这些属性；

no_2.index.year, no_2.index.weekday

(Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
             ...
             2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
            dtype='int64', name='datetime', length=1033),
 Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             ...
             3, 3, 3, 3, 3, 3, 3, 3, 3, 4],
            dtype='int64', name='datetime', length=1033))

绘制 5月20日到 5月21日不同地区的 $NO_2$ 浓度值

no_2["2019-05-20":"2019-05-21"].plot()

在这里插入图片描述

将时间序列重新采样另一个频率

将当前每小时的时间序列采样值聚合到每个站点的月最大值；

monthly_max = no_2.resample("M").max()
monthly_max

location                   BETR801  FR04014  London Westminster
datetime                                                       
2019-05-31 00:00:00+00:00     74.5     97.0                97.0
2019-06-30 00:00:00+00:00     52.5     84.7                52.0

在时间序列上一个非常强大的方法是 datetime 的 index 可以通过 resample 时间序列到不同频率(例如：将每秒的数据转换为每五分钟的数据)
resample 有点类似于 groupby 操作；
提供了基于时间的 grouping ，可以利用字符串（例如：M, 5H …）
需要提供聚合函数 mean, max …

绘制每个地区日均 $NO_2$ 浓度

no_2.resample("D").mean().plot(style="-o", figsize=(12, 4))

在这里插入图片描述

记住

有效的日期字符串可以使用 to_datetime 转换为 datetime 对象，也可以在读入数据时直接转换为 datetime 对象。
pandas 里的 datetime 对象支持计算和逻辑运算等，还可以方便的使用 dt 访问时间的属性。
DatetimeIndex 包含了日期时间相关的属性，并支持方便的切片。
resample 是一个非常强大的方法，支持时间序列的采样频率变换。

参考

How to handle time series data with ease? — pandas 1.5.2 documentation (pydata.org)

猜你喜欢

SpringMVC中发送PUT和DELETE请求详解编程语言
TermKit的新一代Mac终端，在Ubuntu 11.04 轻松安装TermKit
Linux 课程：开启人工智能新时代（linux课的特点）
浅析C和C++函数的相互引用
java发送get请求和post请求示例
ORA-30727: duplicate referential constraint for a REF column ORACLE 报错故障修复远程处理
一次 Netty 不健壮导致的无限重连分析
【信管6.2】估算成本、制定预算与控制成本
数据结构：红黑树（Red Black Tree）
驱动事件的addEvent.js代码
如何快速升级Redis系统（怎么更新redis系统）
一气之下，我一行代码搞定了约瑟夫环问题，面试官懵了[通俗易懂]
HoloLamp 推出便携式增强现实投影仪，实现裸眼AR｜CES 2017
Linux内核接口实现丰富多彩的功能（linux内核接口）
maven打包错误： Failed to execute goal org.apache.maven.pluginsmaven-resources-plugin3.2.0resources
编译openwrt
Spring MVC数据验证简介
一个帮你自动填写Git Comment的插件

相关主题

时间序列方法
子序列的和
P1410 子序列

zl程序教程

当前栏目

【pandas】教程：9-如何轻松处理时间序列数据

Pandas 如何轻松处理时间序列数据

数据

利用 pandas 的 `datetime` 属性

`Datetime` 作为索引

将时间序列重新采样另一个频率

记住

参考

相关文章

当前栏目

【pandas】教程：9-如何轻松处理时间序列数据

Pandas 如何轻松处理时间序列数据

数据

利用 pandas 的 datetime 属性

Datetime 作为索引

将时间序列重新采样另一个频率

记住

参考

相关文章

利用 pandas 的 `datetime` 属性

`Datetime` 作为索引