df.at["20190102", "A"]
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 449, in pandas._libs.index.DatetimeEngine.get_loc
File "pandas/_libs/index.pyx", line 455, in pandas._libs.index.DatetimeEngine._date_check_type
KeyError: "20190102"
通过位置选择
选择某行
df.iloc[3]
Out[71]:
A -0.485980
B -1.281454
C 0.354063
D -1.418858
Name: 2019-01-04 00:00:00, dtype: float64
iloc[]方法的参数,必须是数值。
选择指定行列的数据
df.iloc[3:5, 0:2]
Out[72]:
A B
2019-01-04 -0.485980 -1.281454
2019-01-05 -1.122717 -2.789041
df.iloc[:,:]
Out[73]:
A B C D
2019-01-01 0.671622 0.785726 0.392435 0.874692
2019-01-02 -2.420703 -1.116208 -0.346070 0.785941
2019-01-03 1.364425 -0.947641 2.386880 0.585372
2019-01-04 -0.485980 -1.281454 0.354063 -1.418858
2019-01-05 -1.122717 -2.789041 -0.791812 -0.174345
2019-01-06 0.221597 -0.753038 -1.741256 0.287280
df.iloc[[1, 2, 4], [0, 2]]
Out[74]:
A C
2019-01-02 -2.420703 -0.346070
2019-01-03 1.364425 2.386880
2019-01-05 -1.122717 -0.791812
df[df.A > 0]
Out[77]:
A B C D
2019-01-01 0.671622 0.785726 0.392435 0.874692
2019-01-03 1.364425 -0.947641 2.386880 0.585372
2019-01-06 0.221597 -0.753038 -1.741256 0.287280
筛选出符合要求的数据
df[df > 0]
Out[78]:
A B C D
2019-01-01 0.671622 0.785726 0.392435 0.874692
2019-01-02 NaN NaN NaN 0.785941
2019-01-03 1.364425 NaN 2.386880 0.585372
2019-01-04 NaN NaN 0.354063 NaN
2019-01-05 NaN NaN NaN NaN
2019-01-06 0.221597 NaN NaN 0.287280
不符合要求的数据均会被赋值为空NaN。
使用isin()方法筛选
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2
Out[88]:
A B C D E
2019-01-01 0.671622 0.785726 0.392435 0.874692 one
2019-01-02 -2.420703 -1.116208 -0.346070 0.785941 one
2019-01-03 1.364425 -0.947641 2.386880 0.585372 two
2019-01-04 -0.485980 -1.281454 0.354063 -1.418858 three
2019-01-05 -1.122717 -2.789041 -0.791812 -0.174345 four
2019-01-06 0.221597 -0.753038 -1.741256 0.287280 three
df2["E"].isin(["two", "four"])
Out[89]:
2019-01-01 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-05 True
2019-01-06 False
Freq: D, Name: E, dtype: bool
df2[df2["E"].isin(["two", "four"])]
Out[90]:
A B C D E
2019-01-03 1.364425 -0.947641 2.386880 0.585372 two
2019-01-05 -1.122717 -2.789041 -0.791812 -0.174345 four
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates)
s
Out[122]:
2019-01-01 1.0
2019-01-02 3.0
2019-01-03 5.0
2019-01-04 NaN
2019-01-05 6.0
2019-01-06 8.0
Freq: D, dtype: float64
s = s.shift(2)
s
Out[125]:
2019-01-01 NaN
2019-01-02 NaN
2019-01-03 1.0
2019-01-04 3.0
2019-01-05 5.0
2019-01-06 NaN
Freq: D, dtype: float64
这里将s的值移动两个,那么空出的部分会自动使用NaN填充。
不同维度间的运算,pandas会自动扩展维度:
df.sub(s, axis="index")
Out[128]:
A B C D F
2019-01-01 NaN NaN NaN NaN NaN
2019-01-02 NaN NaN NaN NaN NaN
2019-01-03 0.364425 -1.947641 1.386880 4.0 1.0
2019-01-04 -3.485980 -4.281454 -2.645937 2.0 0.0
2019-01-05 -6.122717 -7.789041 -5.791812 0.0 -1.0
2019-01-06 NaN NaN NaN NaN NaN
应用
通过apply()方法,可以对数据进行逐一操作:
累计求和
df.apply(np.cumsum)
Out[130]:
A B C D F
2019-01-01 0.000000 0.000000 0.392435 5 NaN
2019-01-02 -2.420703 -1.116208 0.046365 10 1.0
2019-01-03 -1.056278 -2.063849 2.433245 15 3.0
2019-01-04 -1.542258 -3.345303 2.787307 20 6.0
2019-01-05 -2.664975 -6.134345 1.995495 25 10.0
2019-01-06 -2.443377 -6.887383 0.254239 30 15.0
这里使用了apply()方法调用np.cumsum方法,也可直接使用df.cumsum():
df.cumsum()
Out[133]:
A B C D F
2019-01-01 0.000000 0.000000 0.392435 5.0 NaN
2019-01-02 -2.420703 -1.116208 0.046365 10.0 1.0
2019-01-03 -1.056278 -2.063849 2.433245 15.0 3.0
2019-01-04 -1.542258 -3.345303 2.787307 20.0 6.0
2019-01-05 -2.664975 -6.134345 1.995495 25.0 10.0
2019-01-06 -2.443377 -6.887383 0.254239 30.0 15.0
自定义方法
通过自定义函数,配合apply()方法,可以实现更多数据处理:
df.apply(lambda x: x.max() - x.min())
Out[134]:
A 3.785129
B 2.789041
C 4.128136
D 0.000000
F 4.000000
dtype: float64
df = pd.DataFrame({"A": ["foo", "bar", "foo", "bar",
"foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three",
"two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8)})
df
Out[166]:
A B C D
0 foo one -1.252153 0.172863
1 bar one 0.238547 -0.648980
2 foo two 0.756975 0.195766
3 bar three -0.933405 -0.320043
4 foo two -0.310650 -1.388255
5 bar two 1.568550 -1.911817
6 foo one -0.340290 -2.141259
按A列分组并使用sum函数进行计算:
df.groupby("A").sum()
Out[167]:
C D
A
bar 0.873692 -2.880840
foo -1.817027 -5.833961
这里由于B列无法应用sum函数,所以直接被忽略了。
按A、B列分组并使用sum函数进行计算:
df.groupby(["A", "B"]).sum()
Out[168]:
C D
A B
bar one 0.238547 -0.648980
three -0.933405 -0.320043
two 1.568550 -1.911817
foo one -1.592443 -1.968396
three -0.670909 -2.673075
two 0.446325 -1.192490
这样就有了一个多层index的结果集。
整形 Reshaping
堆叠 Stack
python的zip函数可以将对象中对应的元素打包成一个个的元组:
tuples = list(zip(["bar", "bar", "baz", "baz",
"foo", "foo", "qux", "qux"],
["one", "two", "one", "two",
"one", "two", "one", "two"]))
tuples
Out[172]:
[("bar", "one"),
("bar", "two"),
("baz", "one"),
("baz", "two"),
("foo", "one"),
("foo", "two"),
("qux", "one"),
("qux", "two")]
## 设置两级索引
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index
Out[174]:
MultiIndex(levels=[["bar", "baz", "foo", "qux"], ["one", "two"]],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=["first", "second"])
## 创建DataFrame
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df
Out[176]:
A B
first second
bar one -0.501215 -0.947993
two -0.828914 0.232167
baz one 1.245419 1.006092
two 1.016656 -0.441073
foo one 0.479037 -0.500034
two -1.113097 0.591696
qux one -0.014760 -0.320735
two -0.648743 1.499899
## 选取DataFrame
df2 = df[:4]
df2
Out[179]:
A B
first second
bar one -0.501215 -0.947993
two -0.828914 0.232167
baz one 1.245419 1.006092
two 1.016656 -0.441073
使用stack()方法,可以通过堆叠的方式将二维数据变成为一维数据:
stacked = df2.stack()
stacked
Out[181]:
first second
bar one A -0.501215
B -0.947993
two A -0.828914
B 0.232167
baz one A 1.245419
B 1.006092
two A 1.016656
B -0.441073
dtype: float64
对应的逆操作为unstacked()方法:
stacked.unstack()
Out[182]:
A B
first second
bar one -0.501215 -0.947993
two -0.828914 0.232167
baz one 1.245419 1.006092
two 1.016656 -0.441073
stacked.unstack(1)
Out[183]:
second one two
first
bar A -0.501215 -0.828914
B -0.947993 0.232167
baz A 1.245419 1.016656
B 1.006092 -0.441073
stacked.unstack(0)
Out[184]:
first bar baz
second
one A -0.501215 1.245419
B -0.947993 1.006092
two A -0.828914 1.016656
B 0.232167 -0.441073
unstack()默认对最后一层级进行操作,也可通过输入参数指定。
表格转置
df = pd.DataFrame({"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12)})
df
Out[190]:
A B C D E
0 one A foo -0.933264 -2.387490
1 one B foo -0.288101 0.023214
2 two C foo 0.594490 0.418505
3 three A bar 0.450683 1.939623
4 one B bar 0.243897 -0.965783
5 one C bar -0.705494 -0.078283
6 two A foo 1.560352 0.419907
7 three B foo 0.199453 0.998711
8 one C foo 1.426861 -1.108297
9 one A bar -0.570951 -0.022560
10 two B bar -0.350937 -1.767804
11 three C bar 0.983465 0.065792
通过pivot_table()方法可以很方便的进行行列的转换:
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[191]:
C bar foo
A B
one A -0.570951 -0.933264
B 0.243897 -0.288101
C -0.705494 1.426861
three A 0.450683 NaN
B NaN 0.199453
C 0.983465 NaN
two A NaN 1.560352
B -0.350937 NaN
C NaN 0.594490
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
"raw_grade": ["a", "b", "b", "a", "a", "e"]})
df
Out[255]:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Out[257]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
重命名分类:cat
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"]
Out[269]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]
重分类:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium","good", "very good"])
df["grade"]
Out[271]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
排列
df.sort_values(by="grade")
Out[272]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
分组
df.groupby("grade").size()
Out[273]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
>>> if pd.Series([False, True, False]):
... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().