Pandas库基础分析——数据生成和访问

Jonathan Shieber 发布于2019-07-30 15:42 / 3193人阅读

摘要：本文着重介绍这两种数据结构的生成和访问的基本方法。是一种类似于一维数组的对象，由一组数据一维数组对象和一组与之对应相关的数据标签索引组成。注当数据未指定索引时，会自动创建整数型索引注通过字典创建，可视为一个定长的有序字典。

前言

Pandas是Python环境下最有名的数据统计包，是基于 Numpy 构建的含有更高级数据结构和工具的数据分析包。Pandas围绕着 Series 和 DataFrame 两个核心数据结构展开的。本文着重介绍这两种数据结构的生成和访问的基本方法。

Series

Series是一种类似于一维数组的对象，由一组数据（一维ndarray数组对象）和一组与之对应相关的数据标签（索引）组成。
注：numpy（Numerical Python）提供了python对多维数组对象的支持：ndarray，具有矢量运算能力，快速、节省空间。

（1）Pandas说明文档中对Series特点介绍如下：

""" One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, , *) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
---------- data : array-like, dict, or scalar value

Contains data stored in Series index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex(len(data)) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict. dtype : numpy.dtype or None
If None, dtype will be inferred copy : boolean, default False
Copy input data """

（2）创建Series的基本方法如下，数据可以是阵列（list、ndarray）、字典和常量值。s = pd.Series(data, index=index)

s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=["a", "b", "c", "d", "e"],dtype="int8" )
a   -1
b    0
c    0
d   -1
e   -1
dtype: int8

s = pd.Series(["a",-0.75414753,123,66666,-1.64899442], index=["a", "b", "c", "d", "e"],)
a           a
b   -0.754148
c         123
d       66666
e    -1.64899
dtype: object

注：Series支持的数据类型包括整数、浮点数、复数、布尔值、字符串等numpy.dtype，与创建ndarray数组相同的是，如未指定类型，它会尝试推断出一个合适的数据类型，例程中数据包含数字和字符串时，推断为object类型；如指定int8类型时数据以int8显示。

s = pd.Series(np.random.randn(5))
0    0.485468
1   -0.912130
2    0.771970
3   -1.058117
4    0.926649
dtype: float64

s.index
RangeIndex(start=0, stop=5, step=1)

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
a    0.485468
b   -0.912130
c    0.771970
d   -1.058117
e    0.926649
dtype: float64

注：当数据未指定索引时，Series会自动创建整数型索引

s = pd.Series({"a" : 0., "b" : 1., "c" : 2.})
a    0.0
b    1.0
c    2.0
dtype: float64

s = pd.Series({"a" : 0., "b" : 1., "c" : 2.}, index=["b", "c", "d", "a"])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

注：通过Python字典创建Series，可视为一个定长的有序字典。如果只传入一个字典，那么Series中的索引即是原字典的键。如果传入索引，那么会找到索引相匹配的值并放在相应的位置上，未找到对应值时结果为NaN。

s = pd.Series(5., index=["a", "b", "c", "d", "e"])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

注：数值重复匹配以适应索引长度

（3）访问Series中的元素和索引

s = pd.Series({"a" : 0., "b" : 1., "c" : 2.}, index=["b", "c", "d", "a"])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

s.values
[  1.   2.  nan   0.]

s.index
Index([u"b", u"c", u"d", u"a"], dtype="object")

注：Series的values和index属性获取其数组表示形式和索引对象

s["a"]
0.0

s[["a","b"]]
a    0.0
b    1.0
dtype: float64

s[["a","b","c"]]
a    0.0
b    1.0
c    2.0
dtype: float64

s[:2] 
b    1.0
c    2.0
dtype: float64

注：可以通过索引的方式选取Series中的单个或一组值

DataFrame

DataFrame是一个表格型（二维）的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame既有行索引也有列索引，它可以看做由Series组成的字典（共用同一个索引）。

（1）Pandas说明文档中对DataFrame特点介绍如下：

""" Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structure

Parameters
---------- data : numpy ndarray (structured or homogeneous), dict, or DataFrame

Dict can contain Series, arrays, constants, or list-like objects index : Index or array-like
Index to use for resulting frame. Will default to np.arange(n) if
no indexing information part of input data and no index provided columns : Index or array-like
Column labels to use for resulting frame. Will default to
np.arange(n) if no column labels are provided dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input

（2）创建DataFrame的基本方法如下，数据可以是由列表、一维ndarray或Series组成的字典（序列长度必须相同）、二维ndarray、字典组成的字典等df = pd.DataFrame(data, index=index)

df = pd.DataFrame({"one": [1., 2., 3., 5], "two": [1., 2., 3., 4.]})
   one  two
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0

注：以列表组成的字典形式创建，每个序列成为DataFrame的一列。不支持单一列表创建df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]})，因为list为unhashable类型

df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=["a", "b"],columns=["one","two","three","four"])
   one  two  three  four
a  1.0  2.0    3.0   5.0
b  1.0  2.0    3.0   4.0

注：以嵌套列表组成形式创建2行4列的表格，通过index和 columns参数指定了索引和列名

data = np.zeros((2,), dtype=[("A", "i4"),("B", "f4"),("C", "a10")])
[(0,  0., "") (0,  0., "")]

注：zeros(shape, dtype=float, order="C")返回一个给定形状和类型的用0填充的数组

data[:] = [(1,2.,"Hello"), (2,3.,"World")]        
df = pd.DataFrame(data)
   A    B      C
0  1  2.0  Hello
1  2  3.0  World

df = pd.DataFrame(data, index=["first", "second"])
        A    B      C
first   1  2.0  Hello
second  2  3.0  World

df = pd.DataFrame(data, columns=["C", "A", "B"])
       C  A    B
0  Hello  1  2.0
1  World  2  3.0

注：同Series相同，未指定索引时DataFrame会自动加上索引，指定列则按指定顺序进行排列

data = {"one" : pd.Series([1., 2., 3.], index=["a", "b", "c"]),
        "two" : pd.Series([1., 2., 3., 4.], index=["a", "b", "c", "d"])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

注：以Series组成的字典形式创建时，每个Series成为一列，如果没有显示指定索引，则各Series的索引被合并成结果的行索引。NaN代替缺失的列数据

df = pd.DataFrame(data,index=["d", "b", "a"])
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

df = pd.DataFrame(data,index=["d", "b", "a"], columns=["two", "three"])
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
df = pd.DataFrame(data2)
   a   b     c
0  1   2   NaN
1  5  10  20.0

注：以字典的列表形式创建时，各项成为DataFrame的一行，字典键索引的并集成为DataFrame的列标

df = pd.DataFrame(data2, index=["first", "second"])
        a   b     c
first   1   2   NaN
second  5  10  20.0

df = pd.DataFrame(data2, columns=["a", "b"])
   a   b
0  1   2
1  5  10

df = pd.DataFrame({("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
                 ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
                 ("a", "c"): {("A", "B"): 5, ("A", "C"): 6}, 
                 ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},  
                 ("b", "b"): {("A", "D"): 9, ("A", "B"): 10}})
       a              b
       a    b    c    a     b
A B  4.0  1.0  5.0  8.0  10.0
  C  3.0  2.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

注：以字典的字典形式创建时，列索引由外层的键合并成结果的列索引，各内层字典成为一列，内层的键会被合并成结果的行索引。

（3）访问DataFrame中的元素和索引

data = {"one" : pd.Series([1., 2., 3.], index=["a", "b", "c"]),
        "two" : pd.Series([1., 2., 3., 4.], index=["a", "b", "c", "d"])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df["one"]或df.one
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

注：通过类似字典标记的方式或属性的方式，可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引，且其name属性也被相应设置。

df[0:1]
   one  two
a  1.0  1.0

注：返回前两列数据

df.loc["a"]
one    1.0
two    1.0
Name: a, dtype: float64

df.loc[:,["one","two"] ]
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df.loc[["a",],["one","two"]]
   one  two
a  1.0  1.0

df.loc["a","one"]
1.0

注：loc是通过标签来选择数据

df.iloc[0:2,0:1]  
   one
a  1.0
b  2.0

df.iloc[0:2]  
   one  two
a  1.0  1.0
b  2.0  2.0

df.iloc[[0,2],[0,1]]#自由选取行位置，和列位置对应的数据
   one  two
a  1.0  1.0
c  3.0  3.0

注：iloc通过位置来选择数据

df.ix["a"]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix["a",["one","two"]]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix["a",[0,1]]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix[["a","b"],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

df.ix[1,[0,1]]
one    2.0
two    2.0
Name: b, dtype: float64

df.ix[[0,1],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

注：通过索引字段ix和名称结合的方式获取行数据

df.ix[df.one>1,:1]
   one
b  2.0
c  3.0

注：使用条件来选择，选取one列中大于1的行和第一列

df["one"]=16.8
    one  two
a  16.8  1.0
b  16.8  2.0
c  16.8  3.0
d  16.8  4.0

val = pd.Series([2,2,2],index=["b", "c", "d"])
df["one"]=val
   one  two
a  NaN  1.0
b  2.0  2.0
c  2.0  3.0
d  2.0  4.0

注：列可以通过赋值方式修改，将列表或数组赋值给某个列时长度必须和DataFrame的长度相匹配。Series赋值时会精确匹配DataFrame的索引，空位以NaN填充。

df["four"]=[3,3,3,3]
   one  two  four
a  NaN  1.0     3
b  2.0  2.0     3
c  2.0  3.0     3
d  2.0  4.0     3

注：对不存在的列赋值会创建新列

df.index.get_loc("a")
0

df.index.get_loc("b")
1

df.columns.get_loc("one")
0

注：通过行/列索引获取整数形式位置

更多python量化交易内容互动请加微信公众号：PythonQT-YuanXiao
欢迎订阅量化交易课程：[链接地址]

云服务器 GPU云服务器生成动态库 pc访问数据库库通过堡垒机 linux 生成库数据分析基础数据

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/41409.html

Pandas库基础分析——数据规整化处理

摘要：前言在数据分析和建模之前需要审查数据是否满足数据处理应用的要求，以及对数据进行清洗，转化，合并，重塑等一系列规整化处理。通过数据信息查看可知数据中存在缺失值，比如各存在个，各存在个。前言在数据分析和建模之前需要审查数据是否满足数据处理应用的要求，以及对数据进行清洗，转化，合并，重塑等一系列规整化处理。pandas标准库提供了高级灵活的方法，能够轻松地将数据规整化为正确的形式，本文通...

roundstones 2019-07-30 16:36 评论0 收藏0
Python工具分析风险数据

摘要：小安分析的数据主要是用户使用代理访问日志记录信息，要分析的原始数据以的形式存储。下面小安带小伙伴们一起来管窥管窥这些数据。在此小安一定一定要告诉你，小安每次做数据分析时必定使用的方法方法。随着网络安全信息数据大规模的增长,应用数据分析技术进行网络安全分析成为业界研究热点，小安在这次小讲堂中带大家用Python工具对风险数据作简单分析，主要是分析蜜罐日志数据，来看看一般大家都使用代理i...

Berwin 2019-07-25 10:42 评论0 收藏0
网络爬虫介绍

摘要：什么是爬虫网络爬虫也叫网络蜘蛛，是一种自动化浏览网络的程序，或者说是一种网络机器人。什么是爬虫网络爬虫也叫网络蜘蛛，是一种自动化浏览网络的程序，或者说是一种网络机器人。它们被广泛用于互联网搜索引擎或其他类似网站，以获取或更新这些网站的内容和检索方式。它们可以自动采集所有其能够访问到的页面内容，以供搜索引擎做进一步处理（分检整理下载的页面），而使得用户能更快的检索到他们需要的信息。简...

sf190404 2019-07-31 10:23 评论0 收藏0
8步从Python白板到专家，从基础到深度学习

摘要：去吧，参加一个在上正在举办的实时比赛吧试试你所学到的全部知识微软雅黑深度学习终于看到这个，兴奋吧现在，你已经学到了绝大多数关于机器学习的技术，是时候试试深度学习了。微软雅黑对于深度学习，我也是个新手，就请把这些建议当作参考吧。如果你想做一个数据科学家，或者作为一个数据科学家你想扩展自己的工具和知识库，那么，你来对地方了。这篇文章的目的，是给刚开始使用Python进行数据分析的人，指明一条全...

Zachary 2019-04-25 18:00 评论0 收藏0
一文带你斩杀Python之Numpy☀️Pandas全部操作【全网最详细】❗❗❗

目录Numpy简介Numpy操作集合1、不同维度数据的表示1.1 一维数据的表示1.2 二维数据的表示1.3 三维数据的表示2、为什么要使用Numpy2.1、Numpy的ndarray具有广播功能2.2 Numpy数组的性能比Python原生数据类型高3 ndarray的属性和基本操作3.1 ndarray的基本属性3.2 ndarray元素类型3.3 创建ndarray的方式3.4 ndarr...

asoren 2021-09-09 09:34 评论0 收藏0