zl程序教程

您现在的位置是:首页 >  后端

当前栏目

python学习之数据分析(四):Pandas基础

2023-09-27 14:29:29 时间

一、Pandas介绍:

1. Pandas介绍:

在这里插入图片描述

  • 2008年WesMcKinney开发出的库
  • 专门用于数据挖掘的开源python库
  • 以Numpy为基础,借力Numpy模块在计算方面性能高的优势
  • 基于matplotlib,能够简便的画图
  • 独特的数据结构

2.为什么要使用Pandas:

Numpy已经能够帮助我们处理数据,能够结合matplotlib解决部分数据展示等问题,那么pandas学习的目的在什么地方呢?

  • 便捷的数据处理能力
  • 读取文件方便
  • 封装了Matplotlib、Numpy的画图和计算

3. DataFrame:

import numpy as np

# 创建一个符合正态分布的10个股票5天的涨跌幅数据
stock_change = np.random.normal(0, 1, (10, 5))
stock_change
array([[-0.78146676, -0.29810035,  0.17317068, -0.78727269, -1.13741097],
       [-1.64768295,  0.1966735 , -0.40381405, -1.38547391,  1.03162812],
       [-0.88359711, -0.51776621,  0.31386734, -0.79209882, -0.75448839],
       [ 0.39497997,  0.47411555, -1.22856179,  2.32711195,  0.16330958],
       [ 1.71156574,  1.32175126, -0.27637519, -0.1037488 ,  0.80180467],
       [ 0.16196088,  1.23434847,  0.09890927,  0.39747989, -0.28454071],
       [ 1.17218486,  1.57634118, -0.58714471,  1.40127241,  0.19774915],
       [ 0.76779403,  1.44145798, -1.36100164,  0.44464079, -0.56796337],
       [-1.80942914,  1.89610206, -0.37059895, -0.95929575,  0.19099914],
       [ 0.53646672, -0.19264632, -1.61610463,  1.27208662,  0.61560309]])

但是这样的数据形式很难看到存储的是什么样的数据,并且也很难获取相应的数据,比如需要获取某个指定股票的数据,就很难去获取!!

问题:如何让数据更有意义的显示?

import pandas as pd
# 使用Pandas中的数据结构
stock_data = pd.DataFrame(stock_change)
stock_data
01234
0-0.781467-0.2981000.173171-0.787273-1.137411
1-1.6476830.196674-0.403814-1.3854741.031628
2-0.883597-0.5177660.313867-0.792099-0.754488
30.3949800.474116-1.2285622.3271120.163310
41.7115661.321751-0.276375-0.1037490.801805
50.1619611.2343480.0989090.397480-0.284541
61.1721851.576341-0.5871451.4012720.197749
70.7677941.441458-1.3610020.444641-0.567963
8-1.8094291.896102-0.370599-0.9592960.190999
90.536467-0.192646-1.6161051.2720870.615603
  • 增加行索引;

  • 增加列索引:

    • 股票的日期是一个时间的序列,我们要实现从前往后的时间还要考虑每月的总天数等,不方便。使用pd.date_range():用于生成一组连续的时间序列(暂时了解)
    date_range(start=None,end=None, periods=None, freq='B')
    
    start:开始时间
    
    end:结束时间
    
    periods:时间天数
    
    freq:递进单位,默认1,'B'默认略过周末
    
help(pd.date_range)
Help on function date_range in module pandas.core.indexes.datetimes:

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
    Return a fixed frequency DatetimeIndex.
    
    Parameters
    ----------
    start : str or datetime-like, optional
        Left bound for generating dates.
    end : str or datetime-like, optional
        Right bound for generating dates.
    periods : integer, optional
        Number of periods to generate.
    freq : str or DateOffset, default 'D'
        Frequency strings can have multiples, e.g. '5H'. See
        :ref:`here <timeseries.offset_aliases>` for a list of
        frequency aliases.
    tz : str or tzinfo, optional
        Time zone name for returning localized DatetimeIndex, for example
        'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
        timezone-naive.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    closed : {None, 'left', 'right'}, optional
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None, the default).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    rng : DatetimeIndex
    
    See Also
    --------
    DatetimeIndex : An immutable container for datetimes.
    timedelta_range : Return a fixed frequency TimedeltaIndex.
    period_range : Return a fixed frequency PeriodIndex.
    interval_range : Return a fixed frequency IntervalIndex.
    
    Notes
    -----
    Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified. If ``freq`` is omitted, the resulting
    ``DatetimeIndex`` will have ``periods`` linearly spaced elements between
    ``start`` and ``end`` (closed on both sides).
    
    To learn more about the frequency strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    **Specifying the values**
    
    The next four examples generate the same `DatetimeIndex`, but vary
    the combination of `start`, `end` and `periods`.
    
    Specify `start` and `end`, with the default daily frequency.
    
    >>> pd.date_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start` and `periods`, the number of periods (days).
    
    >>> pd.date_range(start='1/1/2018', periods=8)
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `end` and `periods`, the number of periods (days).
    
    >>> pd.date_range(end='1/1/2018', periods=8)
    DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
                   '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start`, `end`, and `periods`; the frequency is generated
    automatically (linearly spaced).
    
    >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
    DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
                   '2018-04-27 00:00:00'],
                  dtype='datetime64[ns]', freq=None)
    
    **Other Parameters**
    
    Changed the `freq` (frequency) to ``'M'`` (month end frequency).
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='M')
    DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
                   '2018-05-31'],
                  dtype='datetime64[ns]', freq='M')
    
    Multiples are allowed
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    `freq` can also be specified as an Offset object.
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    Specify `tz` to set the timezone.
    
    >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
    DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
                   '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
                   '2018-01-05 00:00:00+09:00'],
                  dtype='datetime64[ns, Asia/Tokyo]', freq='D')
    
    `closed` controls whether to include `start` and `end` that are on the
    boundary. The default includes boundary points on either end.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='left'`` to exclude `end` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='right'`` to exclude `start` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
    DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
# 构造行索引
stock_index = ['股票'+str(i) for i in range(stock_change.shape[0])]

# 生成一个时间的序列,略过周末非交易日
date = pd.date_range('2019-01-01', periods=stock_change.shape[1], freq='B')

# index代表行索引,columns代表列索引
data = pd.DataFrame(stock_change, index=stock_index, columns=date)

data
2019-01-012019-01-022019-01-032019-01-042019-01-07
股票0-0.781467-0.2981000.173171-0.787273-1.137411
股票1-1.6476830.196674-0.403814-1.3854741.031628
股票2-0.883597-0.5177660.313867-0.792099-0.754488
股票30.3949800.474116-1.2285622.3271120.163310
股票41.7115661.321751-0.276375-0.1037490.801805
股票50.1619611.2343480.0989090.397480-0.284541
股票61.1721851.576341-0.5871451.4012720.197749
股票70.7677941.441458-1.3610020.444641-0.567963
股票8-1.8094291.896102-0.370599-0.9592960.190999
股票90.536467-0.192646-1.6161051.2720870.615603

4.DataFrame

4.1 DataFrame结构

DataFrame对象既有行索引,又有列索引

  • 行索引,表明不同行,横向索引,叫index
  • 列索引,表名不同列,纵向索引,叫columns
    在这里插入图片描述

4.2 DatatFrame的属性

data.index# 行索引:DataFrame的行索引列表
Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object')
data.columns# 列索引,DataFrame的列索引列表
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-07'],
              dtype='datetime64[ns]', freq='B')
data.shape# 数组形状
(10, 5)
data.values# 内容:直接获取其中array的值
array([[-0.78146676, -0.29810035,  0.17317068, -0.78727269, -1.13741097],
       [-1.64768295,  0.1966735 , -0.40381405, -1.38547391,  1.03162812],
       [-0.88359711, -0.51776621,  0.31386734, -0.79209882, -0.75448839],
       [ 0.39497997,  0.47411555, -1.22856179,  2.32711195,  0.16330958],
       [ 1.71156574,  1.32175126, -0.27637519, -0.1037488 ,  0.80180467],
       [ 0.16196088,  1.23434847,  0.09890927,  0.39747989, -0.28454071],
       [ 1.17218486,  1.57634118, -0.58714471,  1.40127241,  0.19774915],
       [ 0.76779403,  1.44145798, -1.36100164,  0.44464079, -0.56796337],
       [-1.80942914,  1.89610206, -0.37059895, -0.95929575,  0.19099914],
       [ 0.53646672, -0.19264632, -1.61610463,  1.27208662,  0.61560309]])
data.T# 转置
股票0股票1股票2股票3股票4股票5股票6股票7股票8股票9
2019-01-01-0.781467-1.647683-0.8835970.3949801.7115660.1619611.1721850.767794-1.8094290.536467
2019-01-02-0.2981000.196674-0.5177660.4741161.3217511.2343481.5763411.4414581.896102-0.192646
2019-01-030.173171-0.4038140.313867-1.228562-0.2763750.098909-0.587145-1.361002-0.370599-1.616105
2019-01-04-0.787273-1.385474-0.7920992.327112-0.1037490.3974801.4012720.444641-0.9592961.272087
2019-01-07-1.1374111.031628-0.7544880.1633100.801805-0.2845410.197749-0.5679630.1909990.615603

4.3 DatatFrame的常用方法:

data.head(5)# 显示前5行内容;如果不补充参数,默认5行。填入参数N则显示前N行
2019-01-012019-01-022019-01-032019-01-042019-01-07
股票0-0.781467-0.2981000.173171-0.787273-1.137411
股票1-1.6476830.196674-0.403814-1.3854741.031628
股票2-0.883597-0.5177660.313867-0.792099-0.754488
股票30.3949800.474116-1.2285622.3271120.163310
股票41.7115661.321751-0.276375-0.1037490.801805
data.tail(5) # :显示后5行内容;如果不补充参数,默认5行。填入参数N则显示后N行
2019-01-012019-01-022019-01-032019-01-042019-01-07
股票50.1619611.2343480.0989090.397480-0.284541
股票61.1721851.576341-0.5871451.4012720.197749
股票70.7677941.441458-1.3610020.444641-0.567963
股票8-1.8094291.896102-0.370599-0.9592960.190999
股票90.536467-0.192646-1.6161051.2720870.615603

4.3 DatatFrame索引的设置

  • 修改行列索引值:
    注意:以下修改方式是错误的

# 错误修改方式
data.index[3] = '股票_3'

正确的方式:

stock_code = ["股票_" + str(i) for i in range(stock_change.shape[0])]

# 必须整体全部修改
data.index = stock_code
# 结果
data
2019-01-012019-01-022019-01-032019-01-042019-01-07
股票_0-0.781467-0.2981000.173171-0.787273-1.137411
股票_1-1.6476830.196674-0.403814-1.3854741.031628
股票_2-0.883597-0.5177660.313867-0.792099-0.754488
股票_30.3949800.474116-1.2285622.3271120.163310
股票_41.7115661.321751-0.276375-0.1037490.801805
股票_50.1619611.2343480.0989090.397480-0.284541
股票_61.1721851.576341-0.5871451.4012720.197749
股票_70.7677941.441458-1.3610020.444641-0.567963
股票_8-1.8094291.896102-0.370599-0.9592960.190999
股票_90.536467-0.192646-1.6161051.2720870.615603

重设索引

  • reset_index(drop=False)
    • 设置新的下标索引
    • drop:默认为False,不删除原来索引,如果为True,删除原来的索引值
# 重置索引,drop=False
data.reset_index()
index2019-01-01 00:00:002019-01-02 00:00:002019-01-03 00:00:002019-01-04 00:00:002019-01-07 00:00:00
0股票_0-0.781467-0.2981000.173171-0.787273-1.137411
1股票_1-1.6476830.196674-0.403814-1.3854741.031628
2股票_2-0.883597-0.5177660.313867-0.792099-0.754488
3股票_30.3949800.474116-1.2285622.3271120.163310
4股票_41.7115661.321751-0.276375-0.1037490.801805
5股票_50.1619611.2343480.0989090.397480-0.284541
6股票_61.1721851.576341-0.5871451.4012720.197749
7股票_70.7677941.441458-1.3610020.444641-0.567963
8股票_8-1.8094291.896102-0.370599-0.9592960.190999
9股票_90.536467-0.192646-1.6161051.2720870.615603
  • 以某列值设置为新的索引
    • set_index(keys, drop=True)
      • keys : 列索引名成或者列索引名称的列表
      • drop : boolean, default True.当做新的索引,删除原来的列

设置新索引案例:

  • 1、创建
df = pd.DataFrame({'month': [12, 3, 6, 9],
                    'year': [2013, 2014, 2014, 2014],
                    'sale':[55, 40, 84, 31]})
df
monthyearsale
012201355
13201440
26201484
39201431
  • 2、以月份设置新的索引
df.set_index('month')
yearsale
month
12201355
3201440
6201484
9201431
df.set_index(keys = ['year', 'month'])
sale
yearmonth
20131255
2014340
684
931
df.set_index(keys = ['year', 'month']).index
MultiIndex([(2013, 12),
            (2014,  3),
            (2014,  6),
            (2014,  9)],
           names=['year', 'month'])
  • 注:通过刚才的设置,这样DataFrame就变成了一个具有MultiIndex的DataFrame。

4.4 MultiIndex与Panel

1.MultiIndex

多级或分层索引对象。

  • index属性
    • names:levels的名称
    • levels:每个level的元组值
df.set_index(keys = ['year', 'month']).index.names
FrozenList(['year', 'month'])
df.set_index(keys = ['year', 'month']).index.levels
FrozenList([[2013, 2014], [3, 6, 9, 12]])

4.5 series对象:

在这里插入图片描述

  • series结构只有行索引
df
monthyearsale
012201355
13201440
26201484
39201431
type(df)
pandas.core.frame.DataFrame
ser = df['sale']
ser
0    55
1    40
2    84
3    31
Name: sale, dtype: int64
type(ser)
pandas.core.series.Series
ser.index
RangeIndex(start=0, stop=4, step=1)
ser.values
array([55, 40, 84, 31])

1.创建series:

通过已有数据创建

  • 指定内容,默认索引
pd.Series(np.arange(10))
  • 指定索引
pd.Series([6.7, 5.6, 3, 10, 2], index=[1, 2, 3, 4, 5])

通过字典数据创建

pd.Series({'red':100, 'blue':200, 'green': 500, 'yellow':1000})
# 创建series
pd.Series([5,6,7,8,9], index=[1,2,3,4,5])
1    5
2    6
3    7
4    8
5    9
dtype: int64

二、pandas的基本操作:

为了更好的理解这些基本操作,将读取一个真实的股票数据。关于文件操作,后面在介绍,这里只先用一下API:

1. 读取数据:

import pandas as pd
# 读取文件
data = pd.read_csv("./stock_day/stock_day.csv")

# 删除一些列,让数据更简单些,再去做后面的操作
data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"], axis=1)
data
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
...........................
2015-03-0613.1714.4814.2813.13179831.721.128.516.16
2015-03-0512.8813.4513.1612.8793180.390.262.023.19
2015-03-0412.8012.9212.9012.6167075.440.201.572.30
2015-03-0312.5213.0612.7012.52139071.610.181.444.76
2015-03-0212.2512.6712.5212.2096291.730.322.623.30

643 rows × 8 columns

data.columns
Index(['open', 'high', 'close', 'low', 'volume', 'price_change', 'p_change',
       'turnover'],
      dtype='object')
data.index
Index(['2018-02-27', '2018-02-26', '2018-02-23', '2018-02-22', '2018-02-14',
       '2018-02-13', '2018-02-12', '2018-02-09', '2018-02-08', '2018-02-07',
       ...
       '2015-03-13', '2015-03-12', '2015-03-11', '2015-03-10', '2015-03-09',
       '2015-03-06', '2015-03-05', '2015-03-04', '2015-03-03', '2015-03-02'],
      dtype='object', length=643)

1.1 索引操作

Numpy当中我们已经讲过使用索引选取序列和切片选择,pandas也支持类似的操作,也可以直接使用列名、行名

称,甚至组合使用。

1.直接使用行列索引:(先列后行)

data["close"]# 通过列索引名称获取series对象的一种方式
2018-02-27    24.16
2018-02-26    23.53
2018-02-23    22.82
2018-02-22    22.28
2018-02-14    21.92
              ...  
2015-03-06    14.28
2015-03-05    13.16
2015-03-04    12.90
2015-03-03    12.70
2015-03-02    12.52
Name: close, Length: 643, dtype: float64
data.open # 省略使用
2018-02-27    23.53
2018-02-26    22.80
2018-02-23    22.88
2018-02-22    22.25
2018-02-14    21.49
              ...  
2015-03-06    13.17
2015-03-05    12.88
2015-03-04    12.80
2015-03-03    12.52
2015-03-02    12.25
Name: open, Length: 643, dtype: float64
data.open[0] # 通过角标拿到某一准确的数据
23.53
data.open[:10]# 通过切片获取series对象
2018-02-27    23.53
2018-02-26    22.80
2018-02-23    22.88
2018-02-22    22.25
2018-02-14    21.49
2018-02-13    21.40
2018-02-12    20.70
2018-02-09    21.20
2018-02-08    21.79
2018-02-07    22.69
Name: open, dtype: float64
# 通过数组或者列表完成索引
data[['close','open']].head()# 获取到了还是dataframe, 是二维的
closeopen
2018-02-2724.1623.53
2018-02-2623.5322.80
2018-02-2322.8222.88
2018-02-2222.2822.25
2018-02-1421.9221.49

2.先列后行的索引方式

结合loc或者iloc使用索引

  • iloc: 通过索引角标进行索引,通过索引角标完成索引,也支持切片
  • loc: 通过索引名称完成索引,也支持切片;
  • ix: 混合索引,既能够支持索引角标,也能支持索引名称 (被废弃)
data.iloc[:2]# 获取前两行
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
data.iloc[:2,:3]# 获取前两行前三列
openhighclose
2018-02-2723.5325.8824.16
2018-02-2622.8023.7823.53
data.iloc[:2,3] # 获取前两行的第3列
2018-02-27    23.53
2018-02-26    22.80
Name: low, dtype: float64
data.iloc[-2]
open                12.52
high                13.06
close               12.70
low                 12.52
volume          139071.61
price_change         0.18
p_change             1.44
turnover             4.76
Name: 2015-03-03, dtype: float64
  • loc:
# 如果通过loc方法使用行列索引名称完成切片,会前后包含
data.loc[:"2018-02-14", 'open':'close']
openhighclose
2018-02-2723.5325.8824.16
2018-02-2622.8023.7823.53
2018-02-2322.8823.3722.82
2018-02-2222.2522.7622.28
2018-02-1421.4921.9921.92
  • ix
data.ix[:4, 'open':'close']
/home/chengfei/miniconda3/envs/jupyter/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
/home/chengfei/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/core/indexing.py:822: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
openhighclose
2018-02-2723.5325.8824.16
2018-02-2622.8023.7823.53
2018-02-2322.8823.3722.82
2018-02-2222.2522.7622.28

1.2 赋值操作:

对DataFrame当中的close列进行重新赋值为1


# 直接修改原来的值
data['close'] = 1
# 或者
data.close = 1

1.3 排序操作:

排序有两种形式,一种对内容进行排序,一种对索引进行排序

DataFrame:

  • 使用df.sort_values(key=, ascending=)对内容进行排序
    • 单个键或者多个键进行排序,默认升序
    • ascending=False:降序
    • ascending=True:升序
  • 使用df.sort_index对索引进行排序

1. df.sort_index():

data.head()
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
data.head().sort_index() # 默认就是按照升序排序,如果需要降序,则指定ascending=False
openhighcloselowvolumeprice_changep_changeturnover
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
data.head().sort_index(ascending=False)
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
help(data.sort_values)
Help on method sort_values in module pandas.core.frame:

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') method of pandas.core.frame.DataFrame instance
    Sort by the values along either axis.
    
    Parameters
    ----------
            by : str or list of str
                Name or list of names to sort by.
    
                - if `axis` is 0 or `'index'` then `by` may contain index
                  levels and/or column labels
                - if `axis` is 1 or `'columns'` then `by` may contain column
                  levels and/or index labels
    
                .. versionchanged:: 0.23.0
                   Allow specifying index or column level names.
    axis : {0 or 'index', 1 or 'columns'}, default 0
         Axis to be sorted.
    ascending : bool or list of bool, default True
         Sort ascending vs. descending. Specify list for multiple sort
         orders.  If this is a list of bools, must match the length of
         the by.
    inplace : bool, default False
         If True, perform operation in-place.
    kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
         Choice of sorting algorithm. See also ndarray.np.sort for more
         information.  `mergesort` is the only stable algorithm. For
         DataFrames, this option is only applied when sorting on a single
         column or label.
    na_position : {'first', 'last'}, default 'last'
         Puts NaNs at the beginning if `first`; `last` puts NaNs at the
         end.
    
    Returns
    -------
    sorted_obj : DataFrame or None
        DataFrame with sorted values if inplace=False, None otherwise.
    
    Examples
    --------
    >>> df = pd.DataFrame({
    ...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    ...     'col2': [2, 1, 9, 8, 7, 4],
    ...     'col3': [0, 1, 9, 4, 2, 3],
    ... })
    >>> df
        col1 col2 col3
    0   A    2    0
    1   A    1    1
    2   B    9    9
    3   NaN  8    4
    4   D    7    2
    5   C    4    3
    
    Sort by col1
    
    >>> df.sort_values(by=['col1'])
        col1 col2 col3
    0   A    2    0
    1   A    1    1
    2   B    9    9
    5   C    4    3
    4   D    7    2
    3   NaN  8    4
    
    Sort by multiple columns
    
    >>> df.sort_values(by=['col1', 'col2'])
        col1 col2 col3
    1   A    1    1
    0   A    2    0
    2   B    9    9
    5   C    4    3
    4   D    7    2
    3   NaN  8    4
    
    Sort Descending
    
    >>> df.sort_values(by='col1', ascending=False)
        col1 col2 col3
    4   D    7    2
    5   C    4    3
    2   B    9    9
    0   A    2    0
    1   A    1    1
    3   NaN  8    4
    
    Putting NAs first
    
    >>> df.sort_values(by='col1', ascending=False, na_position='first')
        col1 col2 col3
    3   NaN  8    4
    4   D    7    2
    5   C    4    3
    2   B    9    9
    0   A    2    0
    1   A    1    1

2. df.sort_values()

data.head(10).sort_values(by="close",ascending=False)# 根据close进行降序排序
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
2018-02-0821.7922.0921.8821.7527068.160.090.410.68
2018-02-0722.6923.1121.8021.2953853.25-0.50-2.241.35
2018-02-1321.4021.9021.4821.3130802.450.281.320.77
2018-02-1220.7021.4021.1920.6332445.390.824.030.81
2018-02-0921.2021.4620.3620.1954304.01-1.50-6.861.36
data.head(10).sort_values(by=["close","open"],ascending=False)# 优先级:close>open
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
2018-02-0821.7922.0921.8821.7527068.160.090.410.68
2018-02-0722.6923.1121.8021.2953853.25-0.50-2.241.35
2018-02-1321.4021.9021.4821.3130802.450.281.320.77
2018-02-1220.7021.4021.1920.6332445.390.824.030.81
2018-02-0921.2021.4620.3620.1954304.01-1.50-6.861.36

三、DataFrame运算:

  • 算数运算符;
  • pandas封装的方法;

1. 算术运算

  • DataFrame.add(other):数学运算加上具体的一个数字
  • DataFrame.sub(other):减
  • DataFrame.mul(other):乘
  • DataFrame.div(other):除
  • DataFrame.truediv(other): 浮动除法
  • DataFrame.floordiv(other): 整数除法
  • DataFrame.mod(other):模运算
  • DataFrame.pow(other):幂运算
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.arange(16).reshape(4,4), index = list("ABCD"))
df
0123
A0123
B4567
C891011
D12131415
df + 1
0123
A1234
B5678
C9101112
D13141516
df.add(1)
0123
A1234
B5678
C9101112
D13141516

2. 逻辑运算:

  • 条件判断;
  • 布尔索引;
  • 布尔赋值;

2.1 条件判断:

df>10
0123
AFalseFalseFalseFalse
BFalseFalseFalseFalse
CFalseFalseFalseTrue
DTrueTrueTrueTrue

2.2 布尔索引

df[df>10]# 不满足条件会使用缺失值填充
0123
ANaNNaNNaNNaN
BNaNNaNNaNNaN
CNaNNaNNaN11.0
D12.013.014.015.0

2.3 布尔赋值

df[df>10] = 1000
df
0123
A0123
B4567
C89101000
D1000100010001000
data.head()
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
data = data.astype('float64')# 将数据类型转换成float64
data[(data.close > 21.5) & (data.close < 23) ].head(10)
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
2018-02-0821.7922.0921.8821.7527068.160.090.410.68
2018-02-0722.6923.1121.8021.2953853.25-0.50-2.241.35
2018-02-0622.8023.5522.2922.2055555.00-0.97-4.171.39
2018-02-0222.4022.7022.6221.5333242.110.200.890.83
2018-02-0123.7123.8622.4222.2266414.64-1.30-5.481.66
2018-01-0322.4222.8322.7922.1874687.100.381.701.87
2018-01-0222.3022.5422.4222.0542677.760.120.541.07

2.4 逻辑运算函数:

  • query(expr)
    - expr:查询字符串
    通过query使得刚才的过程更加方便简单
data.query("p_change > 2 & turnover > 15")
  • isin(values)
    • 判断是否存在某值
data.query('close>21.5 & open < 23' ).head()
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
2018-02-0821.7922.0921.8821.7527068.160.090.410.68
data.close.isin([23.53,21.92]).head(10)
2018-02-27    False
2018-02-26     True
2018-02-23    False
2018-02-22    False
2018-02-14     True
2018-02-13    False
2018-02-12    False
2018-02-09    False
2018-02-08    False
2018-02-07    False
Name: close, dtype: bool

3.统计运算:

3.1describe()

综合分析: 能够直接得出很多统计结果,count, mean, std, min, max 等

# 计算平均值、标准差、最大值、最小值
data.describe()
data.describe()
openhighcloselowvolumeprice_changep_changeturnover
count643.000000643.000000643.000000643.000000643.000000643.000000643.000000643.000000
mean21.27270621.90051321.33626720.77183599905.5191140.0188020.1902802.936190
std3.9309734.0775783.9428063.79196873879.1193540.8984764.0796982.079375
min12.25000012.67000012.36000012.2000001158.120000-3.520000-10.0300000.040000
25%19.00000019.50000019.04500018.52500048533.210000-0.390000-1.8500001.360000
50%21.44000021.97000021.45000020.98000083175.9300000.0500000.2600002.500000
75%23.40000024.06500023.41500022.850000127580.0550000.4550002.3050003.915000
max34.99000036.35000035.21000034.010000501915.4100003.03000010.03000012.560000

3.2 统计函数

Numpy当中已经详细介绍,在这里演示min(最小值), max(最大值), mean(平均值), median(中位数), var(方差), std(标准差)结果,

countNumber of non-NA observations说明
sumSum of values求和
meanMean of values平均值
medianArithmetic median of values中位数
minMinimum最小值
maxMaximum最大值
modeMode
absAbsolute Value绝对值
prodProduct of values累积
stdBessel-corrected sample standard deviation标准差
varUnbiased variance方差
idxmaxcompute the index labels with the maximum最大值的索引标签
idxmincompute the index labels with the minimum最小值的索引标签
data.max() # 默认按列取最大值
open                34.99
high                36.35
close               35.21
low                 34.01
volume          501915.41
price_change         3.03
p_change            10.03
turnover            12.56
dtype: float64
data.max(axis=1).head(10)
2018-02-27    95578.03
2018-02-26    60985.11
2018-02-23    52914.01
2018-02-22    36105.01
2018-02-14    23331.04
2018-02-13    30802.45
2018-02-12    32445.39
2018-02-09    54304.01
2018-02-08    27068.16
2018-02-07    53853.25
dtype: float64

3.4 累计统计函数

函数作用
cumsum计算前1/2/3/…/n个数的和
cummax计算前1/2/3/…/n个数的最大值
cummin计算前1/2/3/…/n个数的最小值
cumprod计算前1/2/3/…/n个数的积

1.累计求和:

data.head()
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2622.8023.7823.5322.8060985.110.693.021.53
2018-02-2322.8823.3722.8222.7152914.010.542.421.32
2018-02-2222.2522.7622.2822.0236105.010.361.640.90
2018-02-1421.4921.9921.9221.4823331.040.442.050.58
data.cumsum().head() # 累计求和
openhighcloselowvolumeprice_changep_changeturnover
2018-02-2723.5325.8824.1623.5395578.030.632.682.39
2018-02-2646.3349.6647.6946.33156563.141.325.703.92
2018-02-2369.2173.0370.5169.04209477.151.868.125.24
2018-02-2291.4695.7992.7991.06245582.162.229.766.14
2018-02-14112.95117.78114.71112.54268913.202.6611.816.72
data = pd.read_csv('./stock_day.csv')
data
openhighcloselowvolumeprice_changep_changema5ma10ma20v_ma5v_ma10v_ma20turnover
2018-02-2723.5325.8824.1623.5395578.030.632.6822.94222.14222.87553782.6446738.6555576.112.39
2018-02-2622.8023.7823.5322.8060985.110.693.0222.40621.95522.94240827.5242736.3456007.501.53
2018-02-2322.8823.3722.8222.7152914.010.542.4221.93821.92923.02235119.5841871.9756372.851.32
2018-02-2222.2522.7622.2822.0236105.010.361.6421.44621.90923.13735397.5839904.7860149.600.90
2018-02-1421.4921.9921.9221.4823331.040.442.0521.36621.92323.25333590.2142935.7461716.110.58
.............................................
2015-03-0613.1714.4814.2813.13179831.721.128.5113.11213.11213.112115090.18115090.18115090.186.16
2015-03-0512.8813.4513.1612.8793180.390.262.0212.82012.82012.82098904.7998904.7998904.793.19
2015-03-0412.8012.9212.9012.6167075.440.201.5712.70712.70712.707100812.93100812.93100812.932.30
2015-03-0312.5213.0612.7012.52139071.610.181.4412.61012.61012.610117681.67117681.67117681.674.76
2015-03-0212.2512.6712.5212.2096291.730.322.6212.52012.52012.52096291.7396291.7396291.733.30

643 rows × 14 columns

data.price_change.sort_index().cumsum()# 按日期索引升序排列后累加求和
2015-03-02     0.32
2015-03-03     0.50
2015-03-04     0.70
2015-03-05     0.96
2015-03-06     2.08
              ...  
2018-02-14     9.87
2018-02-22    10.23
2018-02-23    10.77
2018-02-26    11.46
2018-02-27    12.09
Name: price_change, Length: 643, dtype: float64
# 画图操作(简单应用)
import matplotlib.pyplot as plt
data.price_change.sort_index().cumsum().plot()
plt.show()

在这里插入图片描述

3.5 自定义运算

  • apply(func, axis=0)
    • func:自定义函数
    • axis=0:默认是列,axis=1为行进行运算
  • 定义一个对列,最大值-最小值的函数
data[['open', 'close']].apply(lambda x: x.max() - x.min(), axis=0)

open     22.74
close    22.85
dtype: float64
# 求极差值
data.apply(lambda x:x.max() - x.min(), axis=0)
open                22.740
high                23.680
close               22.850
low                 21.810
volume          500757.290
price_change         6.550
p_change            20.060
ma5                 21.176
ma10                19.666
ma20                17.478
v_ma5           393638.800
v_ma10          340897.650
v_ma20          245969.790
turnover            12.520
dtype: float64

四、panads画图:

1.pandas.DataFrame.plot

  • DataFrame.plot(x=None, y=None, kind=‘line’)

    • x : label or position, default None
    • y : label, position or list of label, positions, default None
      • Allows plotting of one column versus another
    • kind : str
      • ‘line’ : line plot (default)
      • ‘bar’ : vertical bar plot
      • ‘barh’ : horizontal bar plot
      • ‘hist’ : histogram
      • ‘pie’ : pie plot
      • ‘scatter’ : scatter plot
        更多参数细节:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html?highlight=plot#pandas.DataFrame.plot
ret = data[['high', 'low']]
ret.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7efd64428da0>

在这里插入图片描述

ret[:10].plot(kind='bar')# 柱状图
plt.show()

在这里插入图片描述

data.price_change.plot(kind='hist', figsize=(20,10))#直方图, 近似的满足正态分布
plt.show()

在这里插入图片描述

2 pandas.Series.plot

更多参数细节:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html?highlight=plot#pandas.Series.plot

import pandas as pd
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(data,figsize=(20,10))
plt.show()

在这里插入图片描述

pd.plotting.scatter_matrix(data.iloc[:,:10],figsize=(20,10))# 获取所有行,前10列的数据
plt.show()

在这里插入图片描述

五、文件读取与存储:

数据大部分存在于文件当中,所以pandas会支持复杂的IO操作,pandas的API支持众多的文件格式,如CSV、SQL、XLS、JSON、HDF5。

  • 注:最常用的HDF5和CSV文件
format typedata descriptionreaderwriter
textCSVread_csvto_csv
textJSONread_jsonto_json
textHTMLread_htmlto_html
textlocal clipboardread_clipboardto_clipboard
binaryMS Excelread_excelto_excel
binaryHDF5 Formatread_hdfto_hdf
binaryFeather Formatread_featherto_feather
binaryParquet Formatread_parquetto_parquet
binaryMsgpackread_msgpackto_msgpack
binaryStataread_statato_stata
binarySASread_sas
binaryPython Pickle Formatread_pickleto_pickle
SQLSQLread_sqlto_sql
SQLGoogle Big Queryread_gbqto_gbq

1.CSV

1.1 读取csv文件-read_csv

  • pandas.read_csv(filepath_or_buffer, sep =’,’ , delimiter = None)
    • filepath_or_buffer:文件路径
    • usecols:指定读取的列名,列表形式
import pandas as pd
data = pd.read_csv("./stock_day/stock_day.csv", usecols=['open', 'high', 'close','low'])
data.head(10)
openhighcloselow
2018-02-2723.5325.8824.1623.53
2018-02-2622.8023.7823.5322.80
2018-02-2322.8823.3722.8222.71
2018-02-2222.2522.7622.2822.02
2018-02-1421.4921.9921.9221.48
2018-02-1321.4021.9021.4821.31
2018-02-1220.7021.4021.1920.63
2018-02-0921.2021.4620.3620.19
2018-02-0821.7922.0921.8821.75
2018-02-0722.6923.1121.8021.29

1.2 写入csv文件-to_csv

  • DataFrame.to_csv(path_or_buf=None, sep=’, ’, columns=None, header=True, index=True, index_label=None, mode=‘w’, encoding=None)

    • path_or_buf :string or file handle, default None
    • sep :character, default ‘,’
    • columns :sequence, optional
    • mode:‘w’:重写, ‘a’ 追加
    • index:是否写进行索引
    • header :boolean or list of string, default True,是否写进列索引值
  • Series.to_csv(path=None, index=True, sep=’, ‘, na_rep=’’, float_format=None, header=False, index_label=None, mode=‘w’, encoding=None, compression=None, date_format=None, decimal=’.’)

Write Series to a comma-separated values (csv) file

ret.head().to_csv("./test.csv")
ret = pd.read_csv("./test.csv")
ret
Unnamed: 0highlow
02018-02-2725.8823.53
12018-02-2623.7822.80
22018-02-2323.3722.71
32018-02-2222.7622.02
42018-02-1421.9921.48

会发现将索引存入到文件当中,变成单独的一列数据。如果需要删除,可以指定index参数,删除原来的文件,重新保存一次。

ret.set_index("Unnamed: 0")
highlow
Unnamed: 0
2018-02-2725.8823.53
2018-02-2623.7822.80
2018-02-2323.3722.71
2018-02-2222.7622.02
2018-02-1421.9921.48
# index:存储不会将索引值变成一列数据
ret.head().to_csv("./test.csv", columns=['high'], index=False)
pd.read_csv("./test.csv")
high
025.88
123.78
223.37
322.76
421.99
  • 指定追加方式
stock_day[:10].to_csv("./test.csv", mode='a')
import pandas as pd
ret = pd.read_csv("./stock_day/stock_day.csv", usecols=['open', 'high', 'close','low'])
ret.head().to_csv("./test.csv", mode='a')
ret = pd.read_csv("./test.csv")
ret.set_index("Unnamed: 0")
ret
Unnamed: 0openhighcloselow
02018-02-2723.5325.8824.1623.53
12018-02-2622.8023.7823.5322.80
22018-02-2322.8823.3722.8222.71
32018-02-2222.2522.7622.2822.02
42018-02-1421.4921.9921.9221.48

又存进了一个列名,所以当以追加方式添加数据的时候,一定要去掉列名columns,指定header=False

import pandas as pd
ret = pd.read_csv("./stock_day/stock_day.csv", usecols=['open', 'high', 'close','low'])
ret.head().to_csv("./test.csv", mode='a',header=False)
ret = pd.read_csv("./test.csv",index_col=0)
ret
openhighcloselow
2018-02-2723.5325.8824.1623.53
2018-02-2622.8023.7823.5322.80
2018-02-2322.8823.3722.8222.71
2018-02-2222.2522.7622.2822.02
2018-02-1421.4921.9921.9221.48
2018-02-2723.5325.8824.1623.53
2018-02-2622.8023.7823.5322.80
2018-02-2322.8823.3722.8222.71
2018-02-2222.2522.7622.2822.02
2018-02-1421.4921.9921.9221.48

1.3 读取远程的csv

指定names,既列名

names = [f"第{x}列" for x in range(1,12)]
pd.read_csv("url",names = names)

2.HDF5

拓展:
优先选择使用HDF5文件存储

  • HDF5在存储的是支持压缩,使用的方式是blosc,这个是速度最快的也是pandas默认支持的
  • 使用压缩可以提磁盘利用率,节省空间
  • HDF5还是跨平台的,可以轻松迁移到hadoop 上面

2.1 read_hdf与to_hdf

HDF5文件的读取和存储需要指定一个键,值为要存储的DataFrame

  • pandas.read_hdf(path_or_buf,key =None,** kwargs)

从h5文件当中读取数据

- path_or_buffer:文件路径
- key:读取的键
- mode:打开文件的模式
- return:Theselected object
  • DataFrame.to_hdf(path_or_buf, key, \kwargs)
# 读取hdf5文件数据
hdf_data = pd.read_hdf("./stock_data/day/day_close.h5")
ret = hdf_data.iloc[:10,:10]
# 写入hdf5, 存储时需要指定键的名字
ret.to_hdf("./test.h5", key="close_10")
# h5文件是没有办法直接打开的
# 再次读取的时候, 需要指定键的名字
ret = pd.read_hdf("./test.h5", key="close_10")
ret
000001.SZ000002.SZ000004.SZ000005.SZ000006.SZ000007.SZ000008.SZ000009.SZ000010.SZ000011.SZ
016.3017.714.582.8814.602.624.964.665.376.02
117.0219.204.653.0215.972.654.954.705.376.27
217.0217.284.563.0614.372.634.824.475.375.96
316.1816.974.492.9513.102.734.894.335.375.77
416.9517.194.552.9913.182.774.974.425.375.92
517.7617.304.783.1013.703.015.174.635.376.22
618.1016.934.983.1613.483.315.694.785.376.48
717.7117.934.913.2513.893.255.984.885.376.57
817.4017.654.953.2013.893.015.584.845.376.25
918.2718.584.953.2313.973.055.764.945.376.56

3.Excel文件的读取:

框架:xlrd
文件后缀:xls、xlsx

3.1 excel文件的读取:

ex_data = pd.read_excel("./scores.xlsx")
ex_data
Unnamed: 0一本分数线Unnamed: 2二本分数线Unnamed: 4
0NaN文科理科文科理科
12018.0576532488432
22017.0555537468439
32016.0583548532494
42015.0579548527495
52014.0565543507495
62013.0549550494505
72012.0495477446433
82011.0524484481435
92010.0524494474441
102009.0532501489459
112008.0515502472455
122007.0528531489478
132006.0516528476476
# index_col=0 结果输出就没有了Unnamed
ex_data = pd.read_excel("./scores.xlsx", header=[0,1],index_col=0)
ex_data
一本分数线二本分数线
文科理科文科理科
2018576532488432
2017555537468439
2016583548532494
2015579548527495
2014565543507495
2013549550494505
2012495477446433
2011524484481435
2010524494474441
2009532501489459
2008515502472455
2007528531489478
2006516528476476
ex_data.一本分数线
文科理科
2018576532
2017555537
2016583548
2015579548
2014565543
2013549550
2012495477
2011524484
2010524494
2009532501
2008515502
2007528531
2006516528
ex_data.一本分数线.to_excel("./test.xls")
ex_data2 = pd.read_excel("./test.xls",index_col=0)
ex_data2

4.json数据的读取:

4.1 read_json

  • pandas.read_json(path_or_buf=None, orient=None, typ=‘frame’, lines=False)

    • 将JSON格式准换成默认的Pandas DataFrame格式
    • orient : string,Indication of expected JSON string format.
      • ‘split’ : dict like {index -> [index], columns -> [columns], data -> [values]}
      • ‘records’ : list like [{column -> value}, … , {column -> value}]
      • ‘index’ : dict like {index -> {column -> value}}
      • ‘columns’ : dict like {column -> {index -> value}},默认该格式
      • ‘values’ : just the values array
    • lines : boolean, default False
      • 按照每行读取json对象
    • typ : default ‘frame’, 指定转换成的对象类型series或者dataframe
help(pd.read_json)
# orient:json的格式;lines:是否按行存
json_data = pd.read_json("./Sarcasm_Headlines_Dataset.json", orient='records',lines=True)
json_data

4.2 to_json

  • DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
    • 将Pandas 对象存储为json格式
    • path_or_buf=None:文件地址
    • orient:存储的json形式,{‘split’,’records’,’index’,’columns’,’values’}
    • lines:一个对象存储为一行
json_data[:10].to_json("./test.json", orient='records',lines=True)