您现在的位置是：首页 > 后端

当前栏目

python学习之数据分析(四):Pandas基础

Python pandas 基础学习数据分析

2023-09-27 14:29:29 时间

文章目录

一、Pandas介绍:
二、pandas的基本操作:
- 1. 读取数据:
三、DataFrame运算：
四、panads画图:
- 1.pandas.DataFrame.plot
- 2 pandas.Series.plot
五、文件读取与存储:

一、Pandas介绍:

1. Pandas介绍:

在这里插入图片描述

2008年WesMcKinney开发出的库
专门用于数据挖掘的开源python库
以Numpy为基础，借力Numpy模块在计算方面性能高的优势
基于matplotlib，能够简便的画图
独特的数据结构

2.为什么要使用Pandas:

Numpy已经能够帮助我们处理数据，能够结合matplotlib解决部分数据展示等问题，那么pandas学习的目的在什么地方呢？

便捷的数据处理能力
读取文件方便
封装了Matplotlib、Numpy的画图和计算

3. DataFrame:

import numpy as np

# 创建一个符合正态分布的10个股票5天的涨跌幅数据
stock_change = np.random.normal(0, 1, (10, 5))
stock_change

array([[-0.78146676, -0.29810035,  0.17317068, -0.78727269, -1.13741097],
       [-1.64768295,  0.1966735 , -0.40381405, -1.38547391,  1.03162812],
       [-0.88359711, -0.51776621,  0.31386734, -0.79209882, -0.75448839],
       [ 0.39497997,  0.47411555, -1.22856179,  2.32711195,  0.16330958],
       [ 1.71156574,  1.32175126, -0.27637519, -0.1037488 ,  0.80180467],
       [ 0.16196088,  1.23434847,  0.09890927,  0.39747989, -0.28454071],
       [ 1.17218486,  1.57634118, -0.58714471,  1.40127241,  0.19774915],
       [ 0.76779403,  1.44145798, -1.36100164,  0.44464079, -0.56796337],
       [-1.80942914,  1.89610206, -0.37059895, -0.95929575,  0.19099914],
       [ 0.53646672, -0.19264632, -1.61610463,  1.27208662,  0.61560309]])

但是这样的数据形式很难看到存储的是什么样的数据，并且也很难获取相应的数据，比如需要获取某个指定股票的数据，就很难去获取！！

问题：如何让数据更有意义的显示？

import pandas as pd
# 使用Pandas中的数据结构
stock_data = pd.DataFrame(stock_change)
stock_data

	0	1	2	3	4
0	-0.781467	-0.298100	0.173171	-0.787273	-1.137411
1	-1.647683	0.196674	-0.403814	-1.385474	1.031628
2	-0.883597	-0.517766	0.313867	-0.792099	-0.754488
3	0.394980	0.474116	-1.228562	2.327112	0.163310
4	1.711566	1.321751	-0.276375	-0.103749	0.801805
5	0.161961	1.234348	0.098909	0.397480	-0.284541
6	1.172185	1.576341	-0.587145	1.401272	0.197749
7	0.767794	1.441458	-1.361002	0.444641	-0.567963
8	-1.809429	1.896102	-0.370599	-0.959296	0.190999
9	0.536467	-0.192646	-1.616105	1.272087	0.615603

增加行索引;
增加列索引:
- 股票的日期是一个时间的序列，我们要实现从前往后的时间还要考虑每月的总天数等，不方便。使用pd.date_range()：用于生成一组连续的时间序列(暂时了解)
```
date_range(start=None,end=None, periods=None, freq='B')

start:开始时间

end:结束时间

periods:时间天数

freq:递进单位，默认1天,'B'默认略过周末
```

help(pd.date_range)

Help on function date_range in module pandas.core.indexes.datetimes:

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
    Return a fixed frequency DatetimeIndex.
    
    Parameters
    ----------
    start : str or datetime-like, optional
        Left bound for generating dates.
    end : str or datetime-like, optional
        Right bound for generating dates.
    periods : integer, optional
        Number of periods to generate.
    freq : str or DateOffset, default 'D'
        Frequency strings can have multiples, e.g. '5H'. See
        :ref:`here <timeseries.offset_aliases>` for a list of
        frequency aliases.
    tz : str or tzinfo, optional
        Time zone name for returning localized DatetimeIndex, for example
        'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
        timezone-naive.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    closed : {None, 'left', 'right'}, optional
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None, the default).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    rng : DatetimeIndex
    
    See Also
    --------
    DatetimeIndex : An immutable container for datetimes.
    timedelta_range : Return a fixed frequency TimedeltaIndex.
    period_range : Return a fixed frequency PeriodIndex.
    interval_range : Return a fixed frequency IntervalIndex.
    
    Notes
    -----
    Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified. If ``freq`` is omitted, the resulting
    ``DatetimeIndex`` will have ``periods`` linearly spaced elements between
    ``start`` and ``end`` (closed on both sides).
    
    To learn more about the frequency strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    **Specifying the values**
    
    The next four examples generate the same `DatetimeIndex`, but vary
    the combination of `start`, `end` and `periods`.
    
    Specify `start` and `end`, with the default daily frequency.
    
    >>> pd.date_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start` and `periods`, the number of periods (days).
    
    >>> pd.date_range(start='1/1/2018', periods=8)
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `end` and `periods`, the number of periods (days).
    
    >>> pd.date_range(end='1/1/2018', periods=8)
    DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
                   '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start`, `end`, and `periods`; the frequency is generated
    automatically (linearly spaced).
    
    >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
    DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
                   '2018-04-27 00:00:00'],
                  dtype='datetime64[ns]', freq=None)
    
    **Other Parameters**
    
    Changed the `freq` (frequency) to ``'M'`` (month end frequency).
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='M')
    DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
                   '2018-05-31'],
                  dtype='datetime64[ns]', freq='M')
    
    Multiples are allowed
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    `freq` can also be specified as an Offset object.
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    Specify `tz` to set the timezone.
    
    >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
    DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
                   '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
                   '2018-01-05 00:00:00+09:00'],
                  dtype='datetime64[ns, Asia/Tokyo]', freq='D')
    
    `closed` controls whether to include `start` and `end` that are on the
    boundary. The default includes boundary points on either end.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='left'`` to exclude `end` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='right'`` to exclude `start` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
    DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')

# 构造行索引
stock_index = ['股票'+str(i) for i in range(stock_change.shape[0])]

# 生成一个时间的序列，略过周末非交易日
date = pd.date_range('2019-01-01', periods=stock_change.shape[1], freq='B')

# index代表行索引，columns代表列索引
data = pd.DataFrame(stock_change, index=stock_index, columns=date)

data

	2019-01-01	2019-01-02	2019-01-03	2019-01-04	2019-01-07
股票0	-0.781467	-0.298100	0.173171	-0.787273	-1.137411
股票1	-1.647683	0.196674	-0.403814	-1.385474	1.031628
股票2	-0.883597	-0.517766	0.313867	-0.792099	-0.754488
股票3	0.394980	0.474116	-1.228562	2.327112	0.163310
股票4	1.711566	1.321751	-0.276375	-0.103749	0.801805
股票5	0.161961	1.234348	0.098909	0.397480	-0.284541
股票6	1.172185	1.576341	-0.587145	1.401272	0.197749
股票7	0.767794	1.441458	-1.361002	0.444641	-0.567963
股票8	-1.809429	1.896102	-0.370599	-0.959296	0.190999
股票9	0.536467	-0.192646	-1.616105	1.272087	0.615603

4.DataFrame

4.1 DataFrame结构

DataFrame对象既有行索引，又有列索引

行索引，表明不同行，横向索引，叫index
列索引，表名不同列，纵向索引，叫columns

4.2 DatatFrame的属性

data.index# 行索引:DataFrame的行索引列表

Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object')

data.columns# 列索引,DataFrame的列索引列表

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-07'],
              dtype='datetime64[ns]', freq='B')

data.shape# 数组形状

(10, 5)

data.values# 内容:直接获取其中array的值

array([[-0.78146676, -0.29810035,  0.17317068, -0.78727269, -1.13741097],
       [-1.64768295,  0.1966735 , -0.40381405, -1.38547391,  1.03162812],
       [-0.88359711, -0.51776621,  0.31386734, -0.79209882, -0.75448839],
       [ 0.39497997,  0.47411555, -1.22856179,  2.32711195,  0.16330958],
       [ 1.71156574,  1.32175126, -0.27637519, -0.1037488 ,  0.80180467],
       [ 0.16196088,  1.23434847,  0.09890927,  0.39747989, -0.28454071],
       [ 1.17218486,  1.57634118, -0.58714471,  1.40127241,  0.19774915],
       [ 0.76779403,  1.44145798, -1.36100164,  0.44464079, -0.56796337],
       [-1.80942914,  1.89610206, -0.37059895, -0.95929575,  0.19099914],
       [ 0.53646672, -0.19264632, -1.61610463,  1.27208662,  0.61560309]])

data.T# 转置

	股票0	股票1	股票2	股票3	股票4	股票5	股票6	股票7	股票8	股票9
2019-01-01	-0.781467	-1.647683	-0.883597	0.394980	1.711566	0.161961	1.172185	0.767794	-1.809429	0.536467
2019-01-02	-0.298100	0.196674	-0.517766	0.474116	1.321751	1.234348	1.576341	1.441458	1.896102	-0.192646
2019-01-03	0.173171	-0.403814	0.313867	-1.228562	-0.276375	0.098909	-0.587145	-1.361002	-0.370599	-1.616105
2019-01-04	-0.787273	-1.385474	-0.792099	2.327112	-0.103749	0.397480	1.401272	0.444641	-0.959296	1.272087
2019-01-07	-1.137411	1.031628	-0.754488	0.163310	0.801805	-0.284541	0.197749	-0.567963	0.190999	0.615603

4.3 DatatFrame的常用方法:

data.head(5)# 显示前5行内容;如果不补充参数，默认5行。填入参数N则显示前N行

	2019-01-01	2019-01-02	2019-01-03	2019-01-04	2019-01-07
股票0	-0.781467	-0.298100	0.173171	-0.787273	-1.137411
股票1	-1.647683	0.196674	-0.403814	-1.385474	1.031628
股票2	-0.883597	-0.517766	0.313867	-0.792099	-0.754488
股票3	0.394980	0.474116	-1.228562	2.327112	0.163310
股票4	1.711566	1.321751	-0.276375	-0.103749	0.801805

data.tail(5) # :显示后5行内容;如果不补充参数，默认5行。填入参数N则显示后N行

	2019-01-01	2019-01-02	2019-01-03	2019-01-04	2019-01-07
股票5	0.161961	1.234348	0.098909	0.397480	-0.284541
股票6	1.172185	1.576341	-0.587145	1.401272	0.197749
股票7	0.767794	1.441458	-1.361002	0.444641	-0.567963
股票8	-1.809429	1.896102	-0.370599	-0.959296	0.190999
股票9	0.536467	-0.192646	-1.616105	1.272087	0.615603

4.3 DatatFrame索引的设置

修改行列索引值:
注意：以下修改方式是错误的


# 错误修改方式
data.index[3] = '股票_3'

正确的方式：

stock_code = ["股票_" + str(i) for i in range(stock_change.shape[0])]

# 必须整体全部修改
data.index = stock_code
# 结果
data

	2019-01-01	2019-01-02	2019-01-03	2019-01-04	2019-01-07
股票_0	-0.781467	-0.298100	0.173171	-0.787273	-1.137411
股票_1	-1.647683	0.196674	-0.403814	-1.385474	1.031628
股票_2	-0.883597	-0.517766	0.313867	-0.792099	-0.754488
股票_3	0.394980	0.474116	-1.228562	2.327112	0.163310
股票_4	1.711566	1.321751	-0.276375	-0.103749	0.801805
股票_5	0.161961	1.234348	0.098909	0.397480	-0.284541
股票_6	1.172185	1.576341	-0.587145	1.401272	0.197749
股票_7	0.767794	1.441458	-1.361002	0.444641	-0.567963
股票_8	-1.809429	1.896102	-0.370599	-0.959296	0.190999
股票_9	0.536467	-0.192646	-1.616105	1.272087	0.615603

重设索引

reset_index(drop=False)
- 设置新的下标索引
- drop:默认为False，不删除原来索引，如果为True,删除原来的索引值

# 重置索引,drop=False
data.reset_index()

	index	2019-01-01 00:00:00	2019-01-02 00:00:00	2019-01-03 00:00:00	2019-01-04 00:00:00	2019-01-07 00:00:00
0	股票_0	-0.781467	-0.298100	0.173171	-0.787273	-1.137411
1	股票_1	-1.647683	0.196674	-0.403814	-1.385474	1.031628
2	股票_2	-0.883597	-0.517766	0.313867	-0.792099	-0.754488
3	股票_3	0.394980	0.474116	-1.228562	2.327112	0.163310
4	股票_4	1.711566	1.321751	-0.276375	-0.103749	0.801805
5	股票_5	0.161961	1.234348	0.098909	0.397480	-0.284541
6	股票_6	1.172185	1.576341	-0.587145	1.401272	0.197749
7	股票_7	0.767794	1.441458	-1.361002	0.444641	-0.567963
8	股票_8	-1.809429	1.896102	-0.370599	-0.959296	0.190999
9	股票_9	0.536467	-0.192646	-1.616105	1.272087	0.615603

以某列值设置为新的索引
- set_index(keys, drop=True)
  - keys : 列索引名成或者列索引名称的列表
  - drop : boolean, default True.当做新的索引，删除原来的列

设置新索引案例:

1、创建

df = pd.DataFrame({'month': [12, 3, 6, 9],
                    'year': [2013, 2014, 2014, 2014],
                    'sale':[55, 40, 84, 31]})
df

	month	year	sale
0	12	2013	55
1	3	2014	40
2	6	2014	84
3	9	2014	31

2、以月份设置新的索引

df.set_index('month')

	year	sale
month
12	2013	55
3	2014	40
6	2014	84
9	2014	31

df.set_index(keys = ['year', 'month'])

		sale
year	month
2013	12	55
2014	3	40
	6	84
	9	31

df.set_index(keys = ['year', 'month']).index

MultiIndex([(2013, 12),
            (2014,  3),
            (2014,  6),
            (2014,  9)],
           names=['year', 'month'])

注：通过刚才的设置，这样DataFrame就变成了一个具有MultiIndex的DataFrame。

4.4 MultiIndex与Panel

1.MultiIndex

多级或分层索引对象。

index属性
- names：levels的名称
- levels：每个level的元组值

df.set_index(keys = ['year', 'month']).index.names

FrozenList(['year', 'month'])

df.set_index(keys = ['year', 'month']).index.levels

FrozenList([[2013, 2014], [3, 6, 9, 12]])

4.5 series对象：

在这里插入图片描述

series结构只有行索引

df

	month	year	sale
0	12	2013	55
1	3	2014	40
2	6	2014	84
3	9	2014	31

type(df)

pandas.core.frame.DataFrame

ser = df['sale']
ser

0    55
1    40
2    84
3    31
Name: sale, dtype: int64

type(ser)

pandas.core.series.Series

ser.index

RangeIndex(start=0, stop=4, step=1)

ser.values

array([55, 40, 84, 31])

1.创建series:

通过已有数据创建

指定内容，默认索引

pd.Series(np.arange(10))

指定索引

pd.Series([6.7, 5.6, 3, 10, 2], index=[1, 2, 3, 4, 5])

通过字典数据创建

pd.Series({'red':100, 'blue':200, 'green': 500, 'yellow':1000})

# 创建series
pd.Series([5,6,7,8,9], index=[1,2,3,4,5])

1    5
2    6
3    7
4    8
5    9
dtype: int64

二、pandas的基本操作:

为了更好的理解这些基本操作，将读取一个真实的股票数据。关于文件操作，后面在介绍，这里只先用一下API:

1. 读取数据:

import pandas as pd
# 读取文件
data = pd.read_csv("./stock_day/stock_day.csv")

# 删除一些列，让数据更简单些，再去做后面的操作
data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"], axis=1)
data

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58
...	...	...	...	...	...	...	...	...
2015-03-06	13.17	14.48	14.28	13.13	179831.72	1.12	8.51	6.16
2015-03-05	12.88	13.45	13.16	12.87	93180.39	0.26	2.02	3.19
2015-03-04	12.80	12.92	12.90	12.61	67075.44	0.20	1.57	2.30
2015-03-03	12.52	13.06	12.70	12.52	139071.61	0.18	1.44	4.76
2015-03-02	12.25	12.67	12.52	12.20	96291.73	0.32	2.62	3.30

643 rows × 8 columns

data.columns

Index(['open', 'high', 'close', 'low', 'volume', 'price_change', 'p_change',
       'turnover'],
      dtype='object')

data.index

Index(['2018-02-27', '2018-02-26', '2018-02-23', '2018-02-22', '2018-02-14',
       '2018-02-13', '2018-02-12', '2018-02-09', '2018-02-08', '2018-02-07',
       ...
       '2015-03-13', '2015-03-12', '2015-03-11', '2015-03-10', '2015-03-09',
       '2015-03-06', '2015-03-05', '2015-03-04', '2015-03-03', '2015-03-02'],
      dtype='object', length=643)

1.1 索引操作

Numpy当中我们已经讲过使用索引选取序列和切片选择，pandas也支持类似的操作，也可以直接使用列名、行名

称，甚至组合使用。

1.直接使用行列索引:(先列后行)

data["close"]# 通过列索引名称获取series对象的一种方式

2018-02-27    24.16
2018-02-26    23.53
2018-02-23    22.82
2018-02-22    22.28
2018-02-14    21.92
              ...  
2015-03-06    14.28
2015-03-05    13.16
2015-03-04    12.90
2015-03-03    12.70
2015-03-02    12.52
Name: close, Length: 643, dtype: float64

data.open # 省略使用

2018-02-27    23.53
2018-02-26    22.80
2018-02-23    22.88
2018-02-22    22.25
2018-02-14    21.49
              ...  
2015-03-06    13.17
2015-03-05    12.88
2015-03-04    12.80
2015-03-03    12.52
2015-03-02    12.25
Name: open, Length: 643, dtype: float64

data.open[0] # 通过角标拿到某一准确的数据

23.53

data.open[:10]# 通过切片获取series对象

2018-02-27    23.53
2018-02-26    22.80
2018-02-23    22.88
2018-02-22    22.25
2018-02-14    21.49
2018-02-13    21.40
2018-02-12    20.70
2018-02-09    21.20
2018-02-08    21.79
2018-02-07    22.69
Name: open, dtype: float64

# 通过数组或者列表完成索引
data[['close','open']].head()# 获取到了还是dataframe, 是二维的

	close	open
2018-02-27	24.16	23.53
2018-02-26	23.53	22.80
2018-02-23	22.82	22.88
2018-02-22	22.28	22.25
2018-02-14	21.92	21.49

2.先列后行的索引方式

结合loc或者iloc使用索引

iloc: 通过索引角标进行索引,通过索引角标完成索引,也支持切片
loc: 通过索引名称完成索引,也支持切片;
ix: 混合索引,既能够支持索引角标,也能支持索引名称 (被废弃)

data.iloc[:2]# 获取前两行

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53

data.iloc[:2,:3]# 获取前两行前三列

	open	high	close
2018-02-27	23.53	25.88	24.16
2018-02-26	22.80	23.78	23.53

data.iloc[:2,3] # 获取前两行的第3列

2018-02-27    23.53
2018-02-26    22.80
Name: low, dtype: float64

data.iloc[-2]

open                12.52
high                13.06
close               12.70
low                 12.52
volume          139071.61
price_change         0.18
p_change             1.44
turnover             4.76
Name: 2015-03-03, dtype: float64

loc:

# 如果通过loc方法使用行列索引名称完成切片,会前后包含
data.loc[:"2018-02-14", 'open':'close']

	open	high	close
2018-02-27	23.53	25.88	24.16
2018-02-26	22.80	23.78	23.53
2018-02-23	22.88	23.37	22.82
2018-02-22	22.25	22.76	22.28
2018-02-14	21.49	21.99	21.92

data.ix[:4, 'open':'close']

/home/chengfei/miniconda3/envs/jupyter/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
/home/chengfei/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/core/indexing.py:822: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  retval = getattr(retval, self.name)._getitem_axis(key, axis=i)

	open	high	close
2018-02-27	23.53	25.88	24.16
2018-02-26	22.80	23.78	23.53
2018-02-23	22.88	23.37	22.82
2018-02-22	22.25	22.76	22.28

1.2 赋值操作:

对DataFrame当中的close列进行重新赋值为1


# 直接修改原来的值
data['close'] = 1
# 或者
data.close = 1

1.3 排序操作:

排序有两种形式，一种对内容进行排序，一种对索引进行排序

DataFrame:

使用df.sort_values(key=, ascending=)对内容进行排序
- 单个键或者多个键进行排序,默认升序
- ascending=False:降序
- ascending=True:升序
使用df.sort_index对索引进行排序

1. df.sort_index():

data.head()

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58

data.head().sort_index() # 默认就是按照升序排序,如果需要降序,则指定ascending=False

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39

data.head().sort_index(ascending=False)

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58

help(data.sort_values)

Help on method sort_values in module pandas.core.frame:

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') method of pandas.core.frame.DataFrame instance
    Sort by the values along either axis.
    
    Parameters
    ----------
            by : str or list of str
                Name or list of names to sort by.
    
                - if `axis` is 0 or `'index'` then `by` may contain index
                  levels and/or column labels
                - if `axis` is 1 or `'columns'` then `by` may contain column
                  levels and/or index labels
    
                .. versionchanged:: 0.23.0
                   Allow specifying index or column level names.
    axis : {0 or 'index', 1 or 'columns'}, default 0
         Axis to be sorted.
    ascending : bool or list of bool, default True
         Sort ascending vs. descending. Specify list for multiple sort
         orders.  If this is a list of bools, must match the length of
         the by.
    inplace : bool, default False
         If True, perform operation in-place.
    kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
         Choice of sorting algorithm. See also ndarray.np.sort for more
         information.  `mergesort` is the only stable algorithm. For
         DataFrames, this option is only applied when sorting on a single
         column or label.
    na_position : {'first', 'last'}, default 'last'
         Puts NaNs at the beginning if `first`; `last` puts NaNs at the
         end.
    
    Returns
    -------
    sorted_obj : DataFrame or None
        DataFrame with sorted values if inplace=False, None otherwise.
    
    Examples
    --------
    >>> df = pd.DataFrame({
    ...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    ...     'col2': [2, 1, 9, 8, 7, 4],
    ...     'col3': [0, 1, 9, 4, 2, 3],
    ... })
    >>> df
        col1 col2 col3
    0   A    2    0
    1   A    1    1
    2   B    9    9
    3   NaN  8    4
    4   D    7    2
    5   C    4    3
    
    Sort by col1
    
    >>> df.sort_values(by=['col1'])
        col1 col2 col3
    0   A    2    0
    1   A    1    1
    2   B    9    9
    5   C    4    3
    4   D    7    2
    3   NaN  8    4
    
    Sort by multiple columns
    
    >>> df.sort_values(by=['col1', 'col2'])
        col1 col2 col3
    1   A    1    1
    0   A    2    0
    2   B    9    9
    5   C    4    3
    4   D    7    2
    3   NaN  8    4
    
    Sort Descending
    
    >>> df.sort_values(by='col1', ascending=False)
        col1 col2 col3
    4   D    7    2
    5   C    4    3
    2   B    9    9
    0   A    2    0
    1   A    1    1
    3   NaN  8    4
    
    Putting NAs first
    
    >>> df.sort_values(by='col1', ascending=False, na_position='first')
        col1 col2 col3
    3   NaN  8    4
    4   D    7    2
    5   C    4    3
    2   B    9    9
    0   A    2    0
    1   A    1    1

2. df.sort_values()

data.head(10).sort_values(by="close",ascending=False)# 根据close进行降序排序

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58
2018-02-08	21.79	22.09	21.88	21.75	27068.16	0.09	0.41	0.68
2018-02-07	22.69	23.11	21.80	21.29	53853.25	-0.50	-2.24	1.35
2018-02-13	21.40	21.90	21.48	21.31	30802.45	0.28	1.32	0.77
2018-02-12	20.70	21.40	21.19	20.63	32445.39	0.82	4.03	0.81
2018-02-09	21.20	21.46	20.36	20.19	54304.01	-1.50	-6.86	1.36

data.head(10).sort_values(by=["close","open"],ascending=False)# 优先级:close>open

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58
2018-02-08	21.79	22.09	21.88	21.75	27068.16	0.09	0.41	0.68
2018-02-07	22.69	23.11	21.80	21.29	53853.25	-0.50	-2.24	1.35
2018-02-13	21.40	21.90	21.48	21.31	30802.45	0.28	1.32	0.77
2018-02-12	20.70	21.40	21.19	20.63	32445.39	0.82	4.03	0.81
2018-02-09	21.20	21.46	20.36	20.19	54304.01	-1.50	-6.86	1.36

三、DataFrame运算：

算数运算符;
pandas封装的方法;

1. 算术运算

DataFrame.add(other):数学运算加上具体的一个数字
DataFrame.sub(other):减
DataFrame.mul(other):乘
DataFrame.div(other):除
DataFrame.truediv(other): 浮动除法
DataFrame.floordiv(other): 整数除法
DataFrame.mod(other):模运算
DataFrame.pow(other):幂运算

import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.arange(16).reshape(4,4), index = list("ABCD"))
df

	0	1	2	3
A	0	1	2	3
B	4	5	6	7
C	8	9	10	11
D	12	13	14	15

df + 1

	0	1	2	3
A	1	2	3	4
B	5	6	7	8
C	9	10	11	12
D	13	14	15	16

df.add(1)

	0	1	2	3
A	1	2	3	4
B	5	6	7	8
C	9	10	11	12
D	13	14	15	16

2. 逻辑运算:

条件判断;
布尔索引;
布尔赋值;

2.1 条件判断:

df>10

	0	1	2	3
A	False	False	False	False
B	False	False	False	False
C	False	False	False	True
D	True	True	True	True

2.2 布尔索引

df[df>10]# 不满足条件会使用缺失值填充

	0	1	2	3
A	NaN	NaN	NaN	NaN
B	NaN	NaN	NaN	NaN
C	NaN	NaN	NaN	11.0
D	12.0	13.0	14.0	15.0

2.3 布尔赋值

df[df>10] = 1000
df

	0	1	2	3
A	0	1	2	3
B	4	5	6	7
C	8	9	10	1000
D	1000	1000	1000	1000

data.head()

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58

data = data.astype('float64')# 将数据类型转换成float64

data[(data.close > 21.5) & (data.close < 23) ].head(10)

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58
2018-02-08	21.79	22.09	21.88	21.75	27068.16	0.09	0.41	0.68
2018-02-07	22.69	23.11	21.80	21.29	53853.25	-0.50	-2.24	1.35
2018-02-06	22.80	23.55	22.29	22.20	55555.00	-0.97	-4.17	1.39
2018-02-02	22.40	22.70	22.62	21.53	33242.11	0.20	0.89	0.83
2018-02-01	23.71	23.86	22.42	22.22	66414.64	-1.30	-5.48	1.66
2018-01-03	22.42	22.83	22.79	22.18	74687.10	0.38	1.70	1.87
2018-01-02	22.30	22.54	22.42	22.05	42677.76	0.12	0.54	1.07

2.4 逻辑运算函数:

query(expr)
- expr:查询字符串
通过query使得刚才的过程更加方便简单

data.query("p_change > 2 & turnover > 15")

isin(values)
- 判断是否存在某值

data.query('close>21.5 & open < 23' ).head()

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58
2018-02-08	21.79	22.09	21.88	21.75	27068.16	0.09	0.41	0.68

data.close.isin([23.53,21.92]).head(10)

2018-02-27    False
2018-02-26     True
2018-02-23    False
2018-02-22    False
2018-02-14     True
2018-02-13    False
2018-02-12    False
2018-02-09    False
2018-02-08    False
2018-02-07    False
Name: close, dtype: bool

3.统计运算:

3.1describe()

综合分析: 能够直接得出很多统计结果,count, mean, std, min, max 等

# 计算平均值、标准差、最大值、最小值
data.describe()

data.describe()

	open	high	close	low	volume	price_change	p_change	turnover
count	643.000000	643.000000	643.000000	643.000000	643.000000	643.000000	643.000000	643.000000
mean	21.272706	21.900513	21.336267	20.771835	99905.519114	0.018802	0.190280	2.936190
std	3.930973	4.077578	3.942806	3.791968	73879.119354	0.898476	4.079698	2.079375
min	12.250000	12.670000	12.360000	12.200000	1158.120000	-3.520000	-10.030000	0.040000
25%	19.000000	19.500000	19.045000	18.525000	48533.210000	-0.390000	-1.850000	1.360000
50%	21.440000	21.970000	21.450000	20.980000	83175.930000	0.050000	0.260000	2.500000
75%	23.400000	24.065000	23.415000	22.850000	127580.055000	0.455000	2.305000	3.915000
max	34.990000	36.350000	35.210000	34.010000	501915.410000	3.030000	10.030000	12.560000

3.2 统计函数

Numpy当中已经详细介绍，在这里演示min(最小值), max(最大值), mean(平均值), median(中位数), var(方差), std(标准差)结果,

count	Number of non-NA observations	说明
sum	Sum of values	求和
mean	Mean of values	平均值
median	Arithmetic median of values	中位数
min	Minimum	最小值
max	Maximum	最大值
mode	Mode
abs	Absolute Value	绝对值
prod	Product of values	累积
std	Bessel-corrected sample standard deviation	标准差
var	Unbiased variance	方差
idxmax	compute the index labels with the maximum	最大值的索引标签
idxmin	compute the index labels with the minimum	最小值的索引标签

data.max() # 默认按列取最大值

open                34.99
high                36.35
close               35.21
low                 34.01
volume          501915.41
price_change         3.03
p_change            10.03
turnover            12.56
dtype: float64

data.max(axis=1).head(10)

2018-02-27    95578.03
2018-02-26    60985.11
2018-02-23    52914.01
2018-02-22    36105.01
2018-02-14    23331.04
2018-02-13    30802.45
2018-02-12    32445.39
2018-02-09    54304.01
2018-02-08    27068.16
2018-02-07    53853.25
dtype: float64

3.4 累计统计函数

函数	作用
cumsum	计算前1/2/3/…/n个数的和
cummax	计算前1/2/3/…/n个数的最大值
cummin	计算前1/2/3/…/n个数的最小值
cumprod	计算前1/2/3/…/n个数的积

1.累计求和:

data.head()

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	0.58

data.cumsum().head() # 累计求和

	open	high	close	low	volume	price_change	p_change	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	2.39
2018-02-26	46.33	49.66	47.69	46.33	156563.14	1.32	5.70	3.92
2018-02-23	69.21	73.03	70.51	69.04	209477.15	1.86	8.12	5.24
2018-02-22	91.46	95.79	92.79	91.06	245582.16	2.22	9.76	6.14
2018-02-14	112.95	117.78	114.71	112.54	268913.20	2.66	11.81	6.72

data = pd.read_csv('./stock_day.csv')
data

	open	high	close	low	volume	price_change	p_change	ma5	ma10	ma20	v_ma5	v_ma10	v_ma20	turnover
2018-02-27	23.53	25.88	24.16	23.53	95578.03	0.63	2.68	22.942	22.142	22.875	53782.64	46738.65	55576.11	2.39
2018-02-26	22.80	23.78	23.53	22.80	60985.11	0.69	3.02	22.406	21.955	22.942	40827.52	42736.34	56007.50	1.53
2018-02-23	22.88	23.37	22.82	22.71	52914.01	0.54	2.42	21.938	21.929	23.022	35119.58	41871.97	56372.85	1.32
2018-02-22	22.25	22.76	22.28	22.02	36105.01	0.36	1.64	21.446	21.909	23.137	35397.58	39904.78	60149.60	0.90
2018-02-14	21.49	21.99	21.92	21.48	23331.04	0.44	2.05	21.366	21.923	23.253	33590.21	42935.74	61716.11	0.58
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2015-03-06	13.17	14.48	14.28	13.13	179831.72	1.12	8.51	13.112	13.112	13.112	115090.18	115090.18	115090.18	6.16
2015-03-05	12.88	13.45	13.16	12.87	93180.39	0.26	2.02	12.820	12.820	12.820	98904.79	98904.79	98904.79	3.19
2015-03-04	12.80	12.92	12.90	12.61	67075.44	0.20	1.57	12.707	12.707	12.707	100812.93	100812.93	100812.93	2.30
2015-03-03	12.52	13.06	12.70	12.52	139071.61	0.18	1.44	12.610	12.610	12.610	117681.67	117681.67	117681.67	4.76
2015-03-02	12.25	12.67	12.52	12.20	96291.73	0.32	2.62	12.520	12.520	12.520	96291.73	96291.73	96291.73	3.30

643 rows × 14 columns

data.price_change.sort_index().cumsum()# 按日期索引升序排列后累加求和

2015-03-02     0.32
2015-03-03     0.50
2015-03-04     0.70
2015-03-05     0.96
2015-03-06     2.08
              ...  
2018-02-14     9.87
2018-02-22    10.23
2018-02-23    10.77
2018-02-26    11.46
2018-02-27    12.09
Name: price_change, Length: 643, dtype: float64

# 画图操作(简单应用)
import matplotlib.pyplot as plt
data.price_change.sort_index().cumsum().plot()
plt.show()

在这里插入图片描述

3.5 自定义运算

apply(func, axis=0)
- func:自定义函数
- axis=0:默认是列，axis=1为行进行运算
定义一个对列，最大值-最小值的函数

data[['open', 'close']].apply(lambda x: x.max() - x.min(), axis=0)

open     22.74
close    22.85
dtype: float64

# 求极差值
data.apply(lambda x:x.max() - x.min(), axis=0)

open                22.740
high                23.680
close               22.850
low                 21.810
volume          500757.290
price_change         6.550
p_change            20.060
ma5                 21.176
ma10                19.666
ma20                17.478
v_ma5           393638.800
v_ma10          340897.650
v_ma20          245969.790
turnover            12.520
dtype: float64

四、panads画图:

1.pandas.DataFrame.plot

DataFrame.plot(x=None, y=None, kind=‘line’)
- x : label or position, default None
- y : label, position or list of label, positions, default None
  - Allows plotting of one column versus another
- kind : str
  - ‘line’ : line plot (default)
  - ‘bar’ : vertical bar plot
  - ‘barh’ : horizontal bar plot
  - ‘hist’ : histogram
  - ‘pie’ : pie plot
  - ‘scatter’ : scatter plot
    更多参数细节：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html?highlight=plot#pandas.DataFrame.plot

ret = data[['high', 'low']]
ret.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7efd64428da0>

在这里插入图片描述

ret[:10].plot(kind='bar')# 柱状图
plt.show()

在这里插入图片描述

data.price_change.plot(kind='hist', figsize=(20,10))#直方图, 近似的满足正态分布
plt.show()

在这里插入图片描述

2 pandas.Series.plot

更多参数细节：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html?highlight=plot#pandas.Series.plot

import pandas as pd
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(data,figsize=(20,10))
plt.show()

在这里插入图片描述

pd.plotting.scatter_matrix(data.iloc[:,:10],figsize=(20,10))# 获取所有行,前10列的数据
plt.show()

在这里插入图片描述

五、文件读取与存储:

数据大部分存在于文件当中，所以pandas会支持复杂的IO操作，pandas的API支持众多的文件格式，如CSV、SQL、XLS、JSON、HDF5。

注：最常用的HDF5和CSV文件

format type	data description	reader	writer
text	CSV	read_csv	to_csv
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google Big Query	read_gbq	to_gbq

1.CSV

1.1 读取csv文件-read_csv

pandas.read_csv(filepath_or_buffer, sep =’,’ , delimiter = None)
- filepath_or_buffer:文件路径
- usecols:指定读取的列名，列表形式

import pandas as pd
data = pd.read_csv("./stock_day/stock_day.csv", usecols=['open', 'high', 'close','low'])
data.head(10)

	open	high	close	low
2018-02-27	23.53	25.88	24.16	23.53
2018-02-26	22.80	23.78	23.53	22.80
2018-02-23	22.88	23.37	22.82	22.71
2018-02-22	22.25	22.76	22.28	22.02
2018-02-14	21.49	21.99	21.92	21.48
2018-02-13	21.40	21.90	21.48	21.31
2018-02-12	20.70	21.40	21.19	20.63
2018-02-09	21.20	21.46	20.36	20.19
2018-02-08	21.79	22.09	21.88	21.75
2018-02-07	22.69	23.11	21.80	21.29

1.2 写入csv文件-to_csv

DataFrame.to_csv(path_or_buf=None, sep=’, ’, columns=None, header=True, index=True, index_label=None, mode=‘w’, encoding=None)
- path_or_buf :string or file handle, default None
- sep :character, default ‘,’
- columns :sequence, optional
- mode:‘w’：重写, ‘a’ 追加
- index:是否写进行索引
- header :boolean or list of string, default True,是否写进列索引值
Series.to_csv(path=None, index=True, sep=’, ‘, na_rep=’’, float_format=None, header=False, index_label=None, mode=‘w’, encoding=None, compression=None, date_format=None, decimal=’.’)

Write Series to a comma-separated values (csv) file

ret.head().to_csv("./test.csv")
ret = pd.read_csv("./test.csv")
ret

	Unnamed: 0	high	low
0	2018-02-27	25.88	23.53
1	2018-02-26	23.78	22.80
2	2018-02-23	23.37	22.71
3	2018-02-22	22.76	22.02
4	2018-02-14	21.99	21.48

会发现将索引存入到文件当中，变成单独的一列数据。如果需要删除，可以指定index参数,删除原来的文件，重新保存一次。

ret.set_index("Unnamed: 0")

	high	low
Unnamed: 0
2018-02-27	25.88	23.53
2018-02-26	23.78	22.80
2018-02-23	23.37	22.71
2018-02-22	22.76	22.02
2018-02-14	21.99	21.48

# index:存储不会将索引值变成一列数据
ret.head().to_csv("./test.csv", columns=['high'], index=False)
pd.read_csv("./test.csv")

	high
0	25.88
1	23.78
2	23.37
3	22.76
4	21.99

指定追加方式

stock_day[:10].to_csv("./test.csv", mode='a')

import pandas as pd
ret = pd.read_csv("./stock_day/stock_day.csv", usecols=['open', 'high', 'close','low'])
ret.head().to_csv("./test.csv", mode='a')
ret = pd.read_csv("./test.csv")
ret.set_index("Unnamed: 0")
ret

	Unnamed: 0	open	high	close	low
0	2018-02-27	23.53	25.88	24.16	23.53
1	2018-02-26	22.80	23.78	23.53	22.80
2	2018-02-23	22.88	23.37	22.82	22.71
3	2018-02-22	22.25	22.76	22.28	22.02
4	2018-02-14	21.49	21.99	21.92	21.48

又存进了一个列名，所以当以追加方式添加数据的时候，一定要去掉列名columns,指定header=False

import pandas as pd
ret = pd.read_csv("./stock_day/stock_day.csv", usecols=['open', 'high', 'close','low'])
ret.head().to_csv("./test.csv", mode='a',header=False)
ret = pd.read_csv("./test.csv",index_col=0)
ret

	open	high	close	low
2018-02-27	23.53	25.88	24.16	23.53
2018-02-26	22.80	23.78	23.53	22.80
2018-02-23	22.88	23.37	22.82	22.71
2018-02-22	22.25	22.76	22.28	22.02
2018-02-14	21.49	21.99	21.92	21.48
2018-02-27	23.53	25.88	24.16	23.53
2018-02-26	22.80	23.78	23.53	22.80
2018-02-23	22.88	23.37	22.82	22.71
2018-02-22	22.25	22.76	22.28	22.02
2018-02-14	21.49	21.99	21.92	21.48

1.3 读取远程的csv

指定names,既列名

names = [f"第{x}列" for x in range(1,12)]
pd.read_csv("url",names = names)

2.HDF5

拓展:
优先选择使用HDF5文件存储

HDF5在存储的是支持压缩，使用的方式是blosc，这个是速度最快的也是pandas默认支持的
使用压缩可以提磁盘利用率，节省空间
HDF5还是跨平台的，可以轻松迁移到hadoop 上面

2.1 read_hdf与to_hdf

HDF5文件的读取和存储需要指定一个键，值为要存储的DataFrame

pandas.read_hdf(path_or_buf，key =None，** kwargs)

从h5文件当中读取数据

- path_or_buffer:文件路径
- key:读取的键
- mode:打开文件的模式
- return:Theselected object

DataFrame.to_hdf(path_or_buf, key, \kwargs)

# 读取hdf5文件数据
hdf_data = pd.read_hdf("./stock_data/day/day_close.h5")
ret = hdf_data.iloc[:10,:10]

# 写入hdf5, 存储时需要指定键的名字
ret.to_hdf("./test.h5", key="close_10")

# h5文件是没有办法直接打开的
# 再次读取的时候, 需要指定键的名字
ret = pd.read_hdf("./test.h5", key="close_10")
ret

	000001.SZ	000002.SZ	000004.SZ	000005.SZ	000006.SZ	000007.SZ	000008.SZ	000009.SZ	000010.SZ	000011.SZ
0	16.30	17.71	4.58	2.88	14.60	2.62	4.96	4.66	5.37	6.02
1	17.02	19.20	4.65	3.02	15.97	2.65	4.95	4.70	5.37	6.27
2	17.02	17.28	4.56	3.06	14.37	2.63	4.82	4.47	5.37	5.96
3	16.18	16.97	4.49	2.95	13.10	2.73	4.89	4.33	5.37	5.77
4	16.95	17.19	4.55	2.99	13.18	2.77	4.97	4.42	5.37	5.92
5	17.76	17.30	4.78	3.10	13.70	3.01	5.17	4.63	5.37	6.22
6	18.10	16.93	4.98	3.16	13.48	3.31	5.69	4.78	5.37	6.48
7	17.71	17.93	4.91	3.25	13.89	3.25	5.98	4.88	5.37	6.57
8	17.40	17.65	4.95	3.20	13.89	3.01	5.58	4.84	5.37	6.25
9	18.27	18.58	4.95	3.23	13.97	3.05	5.76	4.94	5.37	6.56

3.Excel文件的读取:

框架:xlrd
文件后缀:xls、xlsx

3.1 excel文件的读取:

ex_data = pd.read_excel("./scores.xlsx")
ex_data

	Unnamed: 0	一本分数线	Unnamed: 2	二本分数线	Unnamed: 4
0	NaN	文科	理科	文科	理科
1	2018.0	576	532	488	432
2	2017.0	555	537	468	439
3	2016.0	583	548	532	494
4	2015.0	579	548	527	495
5	2014.0	565	543	507	495
6	2013.0	549	550	494	505
7	2012.0	495	477	446	433
8	2011.0	524	484	481	435
9	2010.0	524	494	474	441
10	2009.0	532	501	489	459
11	2008.0	515	502	472	455
12	2007.0	528	531	489	478
13	2006.0	516	528	476	476

# index_col=0 结果输出就没有了Unnamed
ex_data = pd.read_excel("./scores.xlsx", header=[0,1],index_col=0)
ex_data

	一本分数线		二本分数线
	文科	理科	文科	理科
2018	576	532	488	432
2017	555	537	468	439
2016	583	548	532	494
2015	579	548	527	495
2014	565	543	507	495
2013	549	550	494	505
2012	495	477	446	433
2011	524	484	481	435
2010	524	494	474	441
2009	532	501	489	459
2008	515	502	472	455
2007	528	531	489	478
2006	516	528	476	476

ex_data.一本分数线

	文科	理科
2018	576	532
2017	555	537
2016	583	548
2015	579	548
2014	565	543
2013	549	550
2012	495	477
2011	524	484
2010	524	494
2009	532	501
2008	515	502
2007	528	531
2006	516	528

ex_data.一本分数线.to_excel("./test.xls")
ex_data2 = pd.read_excel("./test.xls",index_col=0)
ex_data2

4.json数据的读取:

4.1 read_json

pandas.read_json(path_or_buf=None, orient=None, typ=‘frame’, lines=False)
- 将JSON格式准换成默认的Pandas DataFrame格式
- orient : string,Indication of expected JSON string format.
  - ‘split’ : dict like {index -> [index], columns -> [columns], data -> [values]}
  - ‘records’ : list like [{column -> value}, … , {column -> value}]
  - ‘index’ : dict like {index -> {column -> value}}
  - ‘columns’ : dict like {column -> {index -> value}},默认该格式
  - ‘values’ : just the values array
- lines : boolean, default False
  - 按照每行读取json对象
- typ : default ‘frame’，指定转换成的对象类型series或者dataframe

help(pd.read_json)

# orient:json的格式;lines:是否按行存
json_data = pd.read_json("./Sarcasm_Headlines_Dataset.json", orient='records',lines=True)
json_data

4.2 to_json

DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
- 将Pandas 对象存储为json格式
- path_or_buf=None：文件地址
- orient:存储的json形式，{‘split’,’records’,’index’,’columns’,’values’}
- lines:一个对象存储为一行

json_data[:10].to_json("./test.json", orient='records',lines=True)

猜你喜欢

Java_动态重新加载Class总结
响应式编程
Python当中list列表的使用（创建列表，删除列表元素，添加列表元素，插入列表元素）
双11秒查包裹，菜鸟携快递公司推出云客服功能
蓝桥杯刷题第二十六天
方法练习
TFS 忽略文件
64Vue - Slots 分发内容（编译作用域）
面向对象和面向过程
为你在 Bash 历史中执行过的每一项命令设置时间和日期
VHDL——三态门设计
Linux Command groupadd 、groupdel、groupmod
第10周-任务3-由点到圆再到圆柱体
《惢客创业日记》2021.06.19（周六）与投资机构产生分歧
城市物联网建设的具体步骤
centos7 搭建GlusterFS
Owl 学习笔记之--- Environment
MD5加密工具
Hadoop框架：DataNode工作机制详解
Excel VBA简单使用——数据缺失处理
国家信息化战略纲要发布核心技术突围成关键
C#创建唯一的订单号, 考虑时间因素

相关主题

python之模块
Python数据分析库Pandas
Python - pandas 数据分析
python-鼠标宏
Python的pandas
python——闭包

当前栏目

python学习之数据分析(四):Pandas基础

文章目录

一、Pandas介绍:

1. Pandas介绍:

2.为什么要使用Pandas:

3. DataFrame:

4.DataFrame

4.1 DataFrame结构

4.2 DatatFrame的属性

4.3 DatatFrame的常用方法:

4.3 DatatFrame索引的设置

4.4 MultiIndex与Panel

1.MultiIndex

4.5 series对象：

1.创建series:

二、pandas的基本操作:

1. 读取数据:

1.1 索引操作

1.直接使用行列索引:(先列后行)

2.先列后行的索引方式

1.2 赋值操作:

1.3 排序操作:

1. df.sort_index():

2. df.sort_values()

三、DataFrame运算：

1. 算术运算

2. 逻辑运算:

2.1 条件判断:

2.2 布尔索引

2.3 布尔赋值

2.4 逻辑运算函数:

3.统计运算:

3.1describe()

3.2 统计函数

3.4 累计统计函数

1.累计求和:

3.5 自定义运算

四、panads画图:

1.pandas.DataFrame.plot

2 pandas.Series.plot

五、文件读取与存储:

1.CSV

1.1 读取csv文件-read_csv

1.2 写入csv文件-to_csv

1.3 读取远程的csv

2.HDF5

2.1 read_hdf与to_hdf

3.Excel文件的读取:

3.1 excel文件的读取:

4.json数据的读取:

4.1 read_json

4.2 to_json

相关文章