1659 字

4 分钟

pandas基础使用

2024-01-02

cs-base

pandas

python

pandas#

pandas介绍#

Pandas 是一个开源的数据分析和数据处理库，它是基于 Python 编程语言的。

Pandas 提供了易于使用的数据结构和数据分析工具，特别适用于处理结构化数据，如表格型数据（类似于Excel表格）。

Pandas 是数据科学和分析领域中常用的工具之一，它使得用户能够轻松地从各种数据源中导入数据，并对数据进行高效的操作和分析。

Pandas 主要引入了两种新的数据结构：DataFrame 和 Series

Series：类似于一维数组或列表，是由一组数据以及与之相关的数据标签（索引）构成。Series 可以看作是 DataFrame 中的一列，也可以是单独存在的一维数据结构。
DataFrame：类似于一个二维表格，它是 Pandas 中最重要的数据结构。DataFrame 可以看作是由多个 Series 按列排列构成的表格，它既有行索引也有列索引，因此可以方便地进行行列选择、过滤、合并等操作。

Pandas 提供了丰富的功能，包括：

数据清洗：处理缺失数据、重复数据等。
数据转换：改变数据的形状、结构或格式。
数据分析：进行统计分析、聚合、分组等。
数据可视化：通过整合 Matplotlib 和 Seaborn 等库，可以进行数据可视化。

pandas安装#

安装python

官网下载/docker安装
安装pandas

pip install pandas

验证使用：
```
1
import pandas as pd
2
pd.__version__
```

pandas series#

结构#

索引：每个 Series 都有一个索引，它可以是整数、字符串、日期等类型。如果没有显式指定索引，Pandas 会自动创建一个默认的整数索引。
数据类型： Series 可以容纳不同数据类型的元素，包括整数、浮点数、字符串等。

1
pandas.Series( data, index, dtype, name, copy)
2

3
## data：一组数据(ndarray 类型)。
4
## index：数据索引标签，如果不指定，默认从 0 开始。
5
## dtype：数据类型，默认会自己判断。
6
## name：设置名称。
7
## copy：拷贝数据，默认为 False。

实例#

使用series

1
import pandas as pd
2

3
a = [1, 2, 3]
4
myvar = pd.Series(a)
5
print(myvar)
6
print(myvar[1])

输出为：

使用pd.Series设置索引

1
import pandas as pd
2

3
a = ["Google", "Runoob", "Wiki"]
4
myvar = pd.Series(a, index = ["x", "y", "z"])
5
print(myvar)
6
print(myvar["y"])

通过字典来创建

1
import pandas as pd
2

3
sites = {1: "Google", 2: "Runoob", 3: "Wiki"}
4
myvar = pd.Series(sites)
5
print(myvar)
6

7
myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )
8
print(myvar)
9

10
myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )
11
print(myvar)

基本操作#

基本操作

1
## 获取值
2
value = series[2]  ## 获取索引为2的值
3

4
## 获取多个值
5
subset = series[1:4]  ## 获取索引为1到3的值
6

7
## 使用自定义索引
8
value = series_with_index['b']  ## 获取索引为'b'的值
9

10
## 索引和值的对应关系
11
for index, value in series_with_index.items():
12
    print(f"Index: {index}, Value: {value}")

基本运算

1
## 算术运算
2
result = series * 2  ## 所有元素乘以2
3

4
## 过滤
5
filtered_series = series[series > 2]  ## 选择大于2的元素
6

7
## 数学函数
8
import numpy as np
9
result = np.sqrt(series)  ## 对每个元素取平方根

属性和方法

1
## 获取索引
2
index = series_with_index.index
3

4
## 获取值数组
5
values = series_with_index.values
6

7
## 获取描述统计信息
8
stats = series_with_index.describe()
9

10
## 获取最大值和最小值的索引
11
max_index = series_with_index.idxmax()
12
min_index = series_with_index.idxmin()

注意事项
- Series 中的数据是有序的。
- 可以将 Series 视为带有索引的一维数组。
- 索引可以是唯一的，但不是必须的。
- 数据可以是标量、列表、NumPy 数组等。

pandas dataframe#

dataframe结构#

列和行： DataFrame 由多个列组成，每一列都有一个名称，可以看作是一个 Series。同时，DataFrame 有一个行索引，用于标识每一行。
二维结构： DataFrame 是一个二维表格，具有行和列。可以将其视为多个 Series 对象组成的字典。
列的数据类型：不同的列可以包含不同的数据类型，例如整数、浮点数、字符串等。

1
pandas.DataFrame( data, index, columns, dtype, copy)
2

3
# data：一组数据(ndarray、series, map, lists, dict 类型)。
4
# index：索引值，或者可以称为行标签
5
# columns：列标签，默认为 RangeIndex (0, 1, 2, …, n)
6
# dtype：数据类型，默认会自己判断。
7
# copy：拷贝数据，默认为 False。

dataframe实例#

使用dataframe

1
import pandas as pd
2

3
data = [['Google',10],['Runoob',12],['Wiki',13]]
4
df = pd.DataFrame(data,columns=['Site','Age'])
5
print(df)

使用ndarrays创建

1
import pandas as pd
2

3
data = {'Site':['Google', 'Runoob', 'Wiki'], 'Age':[10, 12, 13]}
4
df = pd.DataFrame(data)
5
print (df)

通过字典来创建

1
import pandas as pd
2

3
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
4
df = pd.DataFrame(data)
5
print (df)

没有对应的部分数据为 NaN。

通过loc返回指定行 Pandas 可以使用 loc 属性返回指定行的数据，如果没有设置索引，第一行索引为 0，第二行索引为 1，以此类推

1
import pandas as pd
2

3
data = {
4
  "calories": [420, 380, 390],
5
  "duration": [50, 40, 45]
6
}
7

8
# 数据载入到 DataFrame 对象
9
df = pd.DataFrame(data)
10

11
# 返回第一行
12
print(df.loc[0])
13
# 返回第二行
14
print(df.loc[1])
15

16
# 返回第一行和第二行
17
print(df.loc[[0, 1]])
18

19
# 指定索引
20
print(df.loc["duration"])

pd.DataFrame指定索引

1
import pandas as pd
2
data = {
3
  "calories": [420, 380, 390],
4
  "duration": [50, 40, 45]
5
}
6

7
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
8
print(df)

dataframe基本操作#

基本操作

1
# 获取列
2
name_column = df['Name']
3

4
# 获取行
5
first_row = df.loc[0]
6

7
# 选择多列
8
subset = df[['Name', 'Age']]
9

10
# 过滤行
11
filtered_rows = df[df['Age'] > 30]

数据操作

1
# 添加新列
2
df['Salary'] = [50000, 60000, 70000]
3

4
# 删除列
5
df.drop('City', axis=1, inplace=True)
6

7
# 排序
8
df.sort_values(by='Age', ascending=False, inplace=True)
9

10
# 重命名列
11
df.rename(columns={'Name': 'Full Name'}, inplace=True)

属性和方法

1
# 获取列名
2
columns = df.columns
3

4
# 获取形状（行数和列数）
5
shape = df.shape
6

7
# 获取索引
8
index = df.index
9

10
# 获取描述统计信息
11
stats = df.describe()

外部数据源创建

1
# 从CSV文件创建 DataFrame
2
df_csv = pd.read_csv('example.csv')
3

4
# 从Excel文件创建 DataFrame
5
df_excel = pd.read_excel('example.xlsx')
6

7
# 从字典列表创建 DataFrame
8
data_list = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]
9
df_from_list = pd.DataFrame(data_list)

注意事项
- DataFrame 是一种灵活的数据结构，可以容纳不同数据类型的列。
- 列名和行索引可以是字符串、整数等。
- DataFrame 可以通过多种方式进行数据选择、过滤、修改和分析。
- 通过对 DataFrame 的操作，可以进行数据清洗、转换、分析和可视化等工作。

pandas CSV#

介绍#

CSV（Comma-Separated Values，逗号分隔值，有时也称为字符分隔值，因为分隔字符也可以不是逗号），其文件以纯文本形式存储表格数据（数字和文本）。

CSV 是一种通用的、相对简单的文件格式，被用户、商业和科学广泛应用。

处理csv Pandas 可以很方便的处理 CSV 文件

1
import pandas as pd
2
df = pd.read_csv('site.csv')
3
print(df.to_string())

存储csv 可以使用 to_csv() 方法将 DataFrame 存储为 csv 文件

1
import pandas as pd
2

3
# 三个字段 name, site, age
4
nme = ["Google", "Runoob", "Taobao", "Wiki"]
5
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
6
ag = [90, 40, 80, 98]
7
# 字典
8
dict = {'name': nme, 'site': st, 'age': ag}
9
df = pd.DataFrame(dict)
10

11
# 保存 dataframe
12
df.to_csv('site.csv')

数据处理#

head()#

head(n) 方法用于读取前面的 n 行，如果不填参数 n ，默认返回 5 行。

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.head())
5

6
print(df.head(10))

tail()#

tail(n) 方法用于读取尾部的 n 行，如果不填参数 n ，默认返回 5 行，空行各个字段的值返回 NaN。

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.tail())
5

6
print(df.tail(10))

info()#

info() 方法返回表格的一些基本信息：

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.info())

Pandas JSON#

JSON（JavaScript Object Notation，JavaScript 对象表示法），是存储和交换文本信息的语法，类似 XML。

JSON 比 XML 更小、更快，更易解析，更多 JSON 内容可以参考 JSON 教程。

Pandas 可以很方便的处理 JSON 数据

普通JSON处理#

1
import pandas as pd
2

3
df = pd.read_json('sites.json')
4
print(df.to_string())
5

6
URL = '<https://static.runoob.com/download/sites.json>'
7
df = pd.read_json(URL)
8
print(df)

JSON 对象与 Python 字典具有相同的格式，所以我们可以直接将 Python 字典转化为 DataFrame 数据

内嵌JSON处理#

使用 json_normalize() 方法将内嵌的数据完整的解析出来

1
import pandas as pd
2
import json
3

4
# 使用 Python JSON 模块载入数据
5
with open('nested_list.json','r') as f:
6
    data = json.loads(f.read())
7

8
# 展平数据
9
df_nested_list = pd.json_normalize(data, record_path =['students'])
10
print(df_nested_list)

更加复杂的数据

1
import pandas as pd
2
import json
3

4
# 使用 Python JSON 模块载入数据
5
with open('nested_mix.json','r') as f:
6
    data = json.loads(f.read())
7

8
df = pd.json_normalize(
9
    data,
10
    record_path =['students'],
11
    meta=[
12
        'class',
13
        ['info', 'president'],
14
        ['info', 'contacts', 'tel']
15
    ]
16
)
17

18
print(df)

读取内嵌JSON中的一组数据#

使用 glom 模块来处理数据套嵌，glom 模块允许我们使用 . 来访问内嵌对象的属性

安装glom
```
1
pip3 install glom
```

使用

1
import pandas as pd
2
from glom import glom
3

4
df = pd.read_json('nested_deep.json')
5

6
data = df['students'].apply(lambda row: glom(row, 'grade.math'))
7
print(data)

数据清洗#

使用数据

清洗空值#

如果我们要删除包含空字段的行，可以使用 dropna() 方法，语法格式如下：

1
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
2

3
# axis：默认为 0，表示逢空值剔除整行，如果设置参数 axis＝1 表示逢空值去掉整列。
4
# how：默认为 'any' 如果一行（或一列）里任何一个数据有出现 NA 就去掉整行，如果设置 how='all' 一行（或列）都是 NA 才去掉这整行。
5
# thresh：设置需要多少非空值的数据才可以保留下来的。
6
# subset：设置想要检查的列。如果是多个列，可以使用列名的 list 作为参数。
7
# inplace：如果设置 True，将计算得到的值直接覆盖之前的值并返回 None，修改的是源数据。

可以通过 isnull() 判断各个单元格是否为空。

在pandas.read_csv中可以指定na_values来指定空值

1
import pandas as pd
2

3
df = pd.read_csv('property-data.csv')
4

5
print(df['NUM_BEDROOMS'])
6
print(df['NUM_BEDROOMS'].isnull())
7

8
# 指定空数据
9
missing_values = ["n/a", "na", "--"]
10
df = pd.read_csv('property-data.csv', na_values = missing_values)
11

12
print (df['NUM_BEDROOMS'])
13
print (df['NUM_BEDROOMS'].isnull())
14

15
# 删除空数据
16
new_df = df.dropna()
17
print(new_df.to_string())
18

19
# 修改原DataFrame
20
df.dropna(inplace = True)
21
print(df.to_string())
22

23
# 移除指定有空值的行
24
df.dropna(subset=['ST_NUM'], inplace = True)
25
print(df.to_string())

使用fillna()替换空值

替换空单元格的常用方法是计算列的均值、中位数值或众数。

Pandas使用 mean()、median() 和 mode() 方法计算列的均值（所有值加起来的平均值）、中位数值（排序后排在中间的数）和众数（出现频率最高的数）。

1
import pandas as pd
2

3
df = pd.read_csv('property-data.csv')
4

5
# 使用 12345 替换空字段
6
df.fillna(12345, inplace = True)
7
print(df.to_string())
8

9
# 使用均值替换空字段
10
x = df["ST_NUM"].mean()
11
df["ST_NUM"].fillna(x, inplace = True)
12
print(df.to_string())
13

14
# 使用中位数替换空字段
15
x = df["ST_NUM"].median()
16
df["ST_NUM"].fillna(x, inplace = True)
17
print(df.to_string())
18

19
# 使用众数替换空字段
20
x = df["ST_NUM"].mode()
21
df["ST_NUM"].fillna(x, inplace = True)
22
print(df.to_string())

清洗格式错误#

数据格式错误的单元格会使数据分析变得困难，甚至不可能。

我们可以通过包含空单元格的行，或者将列中的所有单元格转换为相同格式的数据。

1
import pandas as pd
2

3
# 第三个日期格式错误
4
data = {
5
  "Date": ['2020/12/01', '2020/12/02' , '20201226'],
6
  "duration": [50, 40, 45]
7
}
8
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
9

10
# 在新的python3中，下句会报错，需要加上format='mixed'明确格式混合可以正常运行
11
# pd.to_datetime(df['Date'])
12
df['Date'] = pd.to_datetime(df['Date'], format='mixed')
13
print(df.to_string())

使用astype修改数据格式

1
data['语文'].dropna(how='any').astype('int')

清洗错误数据#

1
import pandas as pd
2

3
person = {
4
  "name": ['Google', 'Runoob' , 'Taobao'],
5
  "age": [50, 200, 12345]
6
}
7

8
df = pd.DataFrame(person)
9

10
# 直接修改数据
11
df.loc[2, 'age'] = 30
12

13
# 循环判断
14
for x in df.index:
15
  if df.loc[x, "age"] > 120:
16
    df.loc[x, "age"] = 120
17

18
# 删除行
19
for x in df.index:
20
  if df.loc[x, "age"] > 120:
21
    df.drop(x, inplace = True)
22

23
print(df.to_string())

清洗重复数据#

如果我们要清洗重复数据，可以使用 duplicated() 和 drop_duplicates() 方法。

如果对应的数据是重复的，duplicated() 会返回 True，否则返回 False。

1
import pandas as pd
2

3
person = {
4
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
5
  "age": [50, 40, 40, 23]
6
}
7
df = pd.DataFrame(person)
8

9
# 查找重复数据
10
print(df.duplicated())
11

12
# 删除重复数据
13
df.drop_duplicates(inplace = True)
14
print(df)

952 字

5 分钟

Pandas Basics

2024-01-02

cs-base

pandas

python

pandas#

Pandas Introduction#

Pandas is an open-source library for data analysis and data processing, based on the Python programming language.

Pandas provides easy-to-use data structures and data analysis tools, especially suitable for handling structured data such as tabular data (similar to Excel spreadsheets).

Pandas is one of the commonly used tools in data science and analytics, enabling users to easily import data from various data sources and perform efficient operations and analysis on the data.

Pandas introduces two main data structures: DataFrame and Series

Series: Similar to a one-dimensional array or list, consisting of a set of data and associated data labels (indices). A Series can be viewed as a column in a DataFrame, or as a standalone one-dimensional data structure.
DataFrame: Similar to a two-dimensional table, it is the most important data structure in Pandas. DataFrame can be seen as a table composed of multiple Series arranged by column; it has both row indices and column indices, making it convenient to perform row/column selection, filtering, merging, etc.

Pandas provides a rich set of features, including:

Data cleaning: handling missing data, duplicate data, etc.
Data transformation: changing the shape, structure, or format of data.
Data analysis: performing statistical analysis, aggregation, grouping, etc.
Data visualization: by integrating libraries such as Matplotlib and Seaborn, you can perform data visualization.

Pandas Installation#

Install Python

Download from the official site / Docker installation
Install Pandas

pip install pandas

Validation:
```
1
import pandas as pd
2
pd.__version__
```

Pandas Series#

Structure#

Index: Each Series has an index, which can be integers, strings, dates, etc. If no explicit index is specified, Pandas automatically creates a default integer index.
Data type: A Series can contain elements of different data types, including integers, floats, strings, etc.

1
pandas.Series( data, index, dtype, name, copy)
2

3
## data：A set of data (ndarray type).
4
## index：Data indexing labels; if not specified, default starts from 0.
5
## dtype：Data type; defaults to auto-detection.
6
## name：Set the name.
7
## copy：Copy the data, defaults to False.

Examples#

Using series

1
import pandas as pd
2

3
a = [1, 2, 3]
4
myvar = pd.Series(a)
5
print(myvar)
6
print(myvar[1])

Output:

Setting index with pd.Series

1
import pandas as pd
2

3
a = ["Google", "Runoob", "Wiki"]
4
myvar = pd.Series(a, index = ["x", "y", "z"])
5
print(myvar)
6
print(myvar["y"])

Creating from a dictionary

1
import pandas as pd
2

3
sites = {1: "Google", 2: "Runoob", 3: "Wiki"}
4
myvar = pd.Series(sites)
5
print(myvar)
6

7
myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )
8
print(myvar)
9

10
myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )
11
print(myvar)

Basic Operations#

Basic Operations

1
## Get value
2
value = series[2]  ## Get the value with index 2
3

4
## Get multiple values
5
subset = series[1:4]  ## Get the values with index 1 to 3
6

7
## Use a custom index
8
value = series_with_index['b']  ## Get the value for index 'b'
9

10
## Index and value correspondence
11
for index, value in series_with_index.items():
12
    print(f"Index: {index}, Value: {value}")

Arithmetic Operations

1
## Arithmetic operations
2
result = series * 2  ## Multiply all elements by 2
3

4
## Filtering
5
filtered_series = series[series > 2]  ## Select elements greater than 2
6

7
## Mathematical functions
8
import numpy as np
9
result = np.sqrt(series)  ## Take the square root of each element

Attributes and Methods

1
## Get index
2
index = series_with_index.index
3

4
## Get values array
5
values = series_with_index.values
6

7
## Get descriptive statistics
8
stats = series_with_index.describe()
9

10
## Get the index of the maximum and minimum values
11
max_index = series_with_index.idxmax()
12
min_index = series_with_index.idxmin()

Notes
- Data in a Series is ordered.
- A Series can be viewed as a one-dimensional array with an index.
- The index can be unique, but it is not required.
- Data can be scalars, lists, NumPy arrays, etc.

Pandas DataFrame#

DataFrame Structure#

Columns and rows: DataFrame is composed of multiple columns, each column has a name and can be seen as a Series. At the same time, DataFrame has a row index used to identify each row.
Two-dimensional structure: DataFrame is a two-dimensional table with rows and columns. It can be viewed as a dictionary of multiple Series objects.
Data types of columns: Different columns can contain different data types, such as integers, floats, strings, etc.

1
pandas.DataFrame( data, index, columns, dtype, copy)
2

3
# data：A set of data (ndarray, series, map, lists, dict types).
4
# index: The index values, or row labels
5
# columns: Column labels, default is RangeIndex (0, 1, 2, ..., n)
6
# dtype: Data type; defaults to auto-detection.
7
# copy: Copy the data, defaults to False.

DataFrame Examples#

Using DataFrame

1
import pandas as pd
2

3
data = [['Google',10],['Runoob',12],['Wiki',13]]
4
df = pd.DataFrame(data,columns=['Site','Age'])
5
print(df)

Creating from ndarrays

1
import pandas as pd
2

3
data = {'Site':['Google', 'Runoob', 'Wiki'], 'Age':[10, 12, 13]}
4
df = pd.DataFrame(data)
5
print (df)

Creating from dictionaries

1
import pandas as pd
2

3
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
4
df = pd.DataFrame(data)
5
print (df)

Missing parts of data are NaN.

Returning specified rows with loc Pandas can use the loc attribute to return data for specified rows. If no index is set, the first row has index 0, the second row 1, and so on

1
import pandas as pd
2

3
data = {
4
  "calories": [420, 380, 390],
5
  "duration": [50, 40, 45]
6
}
7

8
# Load data into a DataFrame object
9
df = pd.DataFrame(data)
10

11
# Return the first row
12
print(df.loc[0])
13
# Return the second row
14
print(df.loc[1])
15

16
# Return the first and second rows
17
print(df.loc[[0, 1]])
18

19
# Specify the index
20
print(df.loc["duration"])

pd.DataFrame with specified index

1
import pandas as pd
2
data = {
3
  "calories": [420, 380, 390],
4
  "duration": [50, 40, 45]
5
}
6

7
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
8
print(df)

Basic DataFrame Operations#

Basic Operations

1
# Get a column
2
name_column = df['Name']
3

4
# Get a row
5
first_row = df.loc[0]
6

7
# Select multiple columns
8
subset = df[['Name', 'Age']]
9

10
# Filter rows
11
filtered_rows = df[df['Age'] > 30]

Data Manipulation

1
# Add a new column
2
df['Salary'] = [50000, 60000, 70000]
3

4
# Delete a column
5
df.drop('City', axis=1, inplace=True)
6

7
# Sort
8
df.sort_values(by='Age', ascending=False, inplace=True)
9

10
# Rename a column
11
df.rename(columns={'Name': 'Full Name'}, inplace=True)

Attributes and Methods

1
# Get column names
2
columns = df.columns
3

4
# Get shape (rows and columns)
5
shape = df.shape
6

7
# Get index
8
index = df.index
9

10
# Get descriptive statistics
11
stats = df.describe()

Creating from external data sources

1
# Create DataFrame from a CSV file
2
df_csv = pd.read_csv('example.csv')
3

4
# Create DataFrame from an Excel file
5
df_excel = pd.read_excel('example.xlsx')
6

7
# Create DataFrame from a list of dictionaries
8
data_list = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]
9
df_from_list = pd.DataFrame(data_list)

Notes
- DataFrame is a flexible data structure that can accommodate columns with different data types.
- Column names and row indices can be strings, integers, etc.
- DataFrame can be queried, filtered, modified, and analyzed in many ways.
- Through working with DataFrame, you can perform data cleaning, transformation, analysis, and visualization.

Pandas CSV#

Introduction#

CSV (Comma-Separated Values; sometimes also referred to as Character-Separated Values, since the separator character can be something other than a comma) stores tabular data (numbers and text) in plain text files.

CSV is a general-purpose, relatively simple file format widely used by users, businesses, and scientists.

Handling CSVs Pandas can easily handle CSV files

1
import pandas as pd
2
df = pd.read_csv('site.csv')
3
print(df.to_string())

Storing CSVs You can use the to_csv() method to store a DataFrame as a CSV file

1
import pandas as pd
2

3
# Three fields name, site, age
4
nme = ["Google", "Runoob", "Taobao", "Wiki"]
5
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
6
ag = [90, 40, 80, 98]
7
# Dictionary
8
dict = {'name': nme, 'site': st, 'age': ag}
9
df = pd.DataFrame(dict)
10

11
# Save dataframe
12
df.to_csv('site.csv')

Data Processing#

head()#

head(n) method is used to read the first n rows; if n is not provided, it defaults to 5 rows.

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.head())
5

6
print(df.head(10))

tail()#

tail(n) method is used to read the last n rows; if n is not provided, it defaults to 5 rows, and the values of empty rows are NaN.

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.tail())
5

6
print(df.tail(10))

info()#

info() method returns some basic information about the table:

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.info())

Pandas JSON#

JSON (JavaScript Object Notation) is a syntax for storing and exchanging text information, similar to XML.

JSON is smaller, faster, and easier to parse than XML; for more JSON content, refer to JSON tutorials.

Pandas can easily handle JSON data

Plain JSON Handling#

1
import pandas as pd
2

3
df = pd.read_json('sites.json')
4
print(df.to_string())
5

6
URL = '<https://static.runoob.com/download/sites.json>'
7
df = pd.read_json(URL)
8
print(df)

JSON objects have the same format as Python dictionaries, so we can directly convert Python dictionaries into DataFrame data

Nested JSON Handling#

Using the json_normalize() method to fully parse nested data

1
import pandas as pd
2
import json
3

4
# Load data using Python's JSON module
5
with open('nested_list.json','r') as f:
6
    data = json.loads(f.read())
7

8
# Flatten data
9
df_nested_list = pd.json_normalize(data, record_path =['students'])
10
print(df_nested_list)

More complex data

1
import pandas as pd
2
import json
3

4
# Load data using Python's JSON module
5
with open('nested_mix.json','r') as f:
6
    data = json.loads(f.read())
7

8
df = pd.json_normalize(
9
    data,
10
    record_path =['students'],
11
    meta=[
12
        'class',
13
        ['info', 'president'],
14
        ['info', 'contacts', 'tel']
15
    ]
16
)
17

18
print(df)

Reading a Group of Data from Nested JSON#

Use the glom module to handle nested data; the glom module allows us to access nested object attributes using dot notation

Install glom
```
1
pip3 install glom
```

Usage

1
import pandas as pd
2
from glom import glom
3

4
df = pd.read_json('nested_deep.json')
5

6
data = df['students'].apply(lambda row: glom(row, 'grade.math'))
7
print(data)

Data Cleaning#

Using Data

Cleaning Missing Values#

If we want to delete rows that contain empty fields, we can use the dropna() method; the syntax is as follows:

1
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
2

3
# axis: defaults to 0, meaning drop the entire row when a missing value is present; if axis=1 is set, drop the entire column.
4
# how: defaults to 'any'. If any data in a row (or column) has NA, drop the entire row; if how='all', a row (or column) where all values are NA is dropped.
5
# thresh: set how many non-empty values are required to keep the row.
6
# subset: set the columns to check. If multiple columns, you can use a list of column names.
7
# inplace: If True is set, the calculated values overwrite the previous values and return None, modifying the source data.

You can use isnull() to determine whether each cell is empty.

In pandas.read_csv you can specify na_values to designate missing values

1
import pandas as pd
2

3
df = pd.read_csv('property-data.csv')
4

5
print(df['NUM_BEDROOMS'])
6
print(df['NUM_BEDROOMS'].isnull())
7

8
# Specify missing data
9
missing_values = ["n/a", "na", "--"]
10
df = pd.read_csv('property-data.csv', na_values = missing_values)
11

12
print (df['NUM_BEDROOMS'])
13
print (df['NUM_BEDROOMS'].isnull())
14

15
# Remove missing data
16
new_df = df.dropna()
17
print(new_df.to_string())
18

19
# Modify the original DataFrame
20
df.dropna(inplace = True)
21
print(df.to_string())
22

23
# Remove rows with missing values in specified columns
24
df.dropna(subset=['ST_NUM'], inplace = True)
25
print(df.to_string())

Using fillna() to replace missing values

A common method to replace empty cells is to compute the mean, median, or mode of the column.

Pandas uses mean(), median(), and mode() to compute the column mean (the average of all values), median (the middle value after sorting), and mode (the value that appears most frequently).

1
import pandas as pd
2

3
df = pd.read_csv('property-data.csv')
4

5
# Replace missing fields with 12345
6
df.fillna(12345, inplace = True)
7
print(df.to_string())
8

9
# Replace missing fields with the mean
10
x = df["ST_NUM"].mean()
11
df["ST_NUM"].fillna(x, inplace = True)
12
print(df.to_string())
13

14
# Replace missing fields with the median
15
x = df["ST_NUM"].median()
16
df["ST_NUM"].fillna(x, inplace = True)
17
print(df.to_string())
18

19
# Replace missing fields with the mode
20
x = df["ST_NUM"].mode()
21
df["ST_NUM"].fillna(x, inplace = True)
22
print(df.to_string())

Cleaning Incorrect Data#

Data with incorrect formats can make data analysis difficult, even impossible.

We can address this by either including rows with empty cells, or converting all cells in a column to the same format data.

1
import pandas as pd
2

3
# The third date format is incorrect
4
data = {
5
  "Date": ['2020/12/01', '2020/12/02' , '20201226'],
6
  "duration": [50, 40, 45]
7
}
8
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
9

10
# In newer Python 3, the following will raise an error; you need to specify format='mixed' to clearly indicate mixed formats to run properly
11
# pd.to_datetime(df['Date'])
12
df['Date'] = pd.to_datetime(df['Date'], format='mixed')
13
print(df.to_string())

Using astype to modify data types

1
data['语文'].dropna(how='any').astype('int')

Cleaning Erroneous Data#

1
import pandas as pd
2

3
person = {
4
  "name": ['Google', 'Runoob' , 'Taobao'],
5
  "age": [50, 200, 12345]
6
}
7

8
df = pd.DataFrame(person)
9

10
# Directly modify data
11
df.loc[2, 'age'] = 30
12

13
# Loop-based checks
14
for x in df.index:
15
  if df.loc[x, "age"] > 120:
16
    df.loc[x, "age"] = 120
17

18
# Delete rows
19
for x in df.index:
20
  if df.loc[x, "age"] > 120:
21
    df.drop(x, inplace = True)
22

23
print(df.to_string())

Cleaning Duplicate Data#

If we want to clean duplicate data, you can use duplicated() and drop_duplicates() methods.

If the data is duplicated, duplicated() will return True; otherwise, it returns False.

1
import pandas as pd
2

3
person = {
4
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
5
  "age": [50, 40, 40, 23]
6
}
7
df = pd.DataFrame(person)
8

9
# Find duplicated data
10
print(df.duplicated())
11

12
# Remove duplicate data
13
df.drop_duplicates(inplace = True)
14
print(df)

2407 字

6 分钟

Pandas基礎使用

2024-01-02

cs-base

pandas

python

pandas#

pandasの紹介#

Pandas は、Python プログラミング言語をベースにしたオープンソースのデータ分析・データ処理ライブラリです。

Pandas は、使いやすいデータ構造とデータ分析ツールを提供し、特に表形式データ（Excel の表に似たデータなど）の処理に適しています。

Pandas はデータサイエンスおよび分析分野で広く使われているツールの一つで、さまざまなデータソースからデータを容易に取り込み、データを効率的に操作・分析できるようにします。

Pandas は主に2つの新しいデータ構造を導入しました：DataFrame と Series

Series: 一次元配列またはリストと似ており、一組のデータとそれに関連するデータラベル（インデックス）で構成されます。Series は DataFrame の列のようにも、単独の1次元データ構造としても扱えます。
DataFrame: 二次元の表のようなもので、Pandas の中で最も重要なデータ構造です。DataFrame は複数の Series を列方向に並べてできた表で、行インデックスと列インデックスの両方を持つため、行と列の選択、フィルタ、結合などを容易に行えます。

Pandas は豊富な機能を提供します。以下を含みます：

データクリーニング：欠損データ、重複データなどの処理。
データ変換：データの形状・構造・形式を変更。
データ分析：統計分析、集計、グルーピングなど。
データの可視化：Matplotlib や Seaborn などのライブラリと統合してデータの可視化を行うことができます。

pandasのインストール#

Pythonのインストール

公式サイトからダウンロード/ Docker でのインストール
pandasのインストール

pip install pandas

動作確認：

2024-01-02 22:26:35の出力（例）

1
import pandas as pd
2
pd.__version__

pandas series#

構造#

インデックス：各 Series にはインデックスがあり、整数・文字列・日付などの型になり得ます。明示的にインデックスを指定しない場合、Pandasはデフォルトの整数インデックスを自動作成します。
データ型： Series は異なるデータ型の要素を格納できます。整数、浮動小数点数、文字列など。

1
pandas.Series( data, index, dtype, name, copy)
2

3
## data：一組のデータ（ndarray 型）。
4
## index：データのインデックスラベル。指定しなければ 0 から始まるデフォルト。
5
## dtype：データ型。デフォルトは自動判定。
6
## name：名前を設定。
7
## copy：データをコピー。デフォルトは False。

実例#

Series の使用

1
import pandas as pd
2

3
a = [1, 2, 3]
4
myvar = pd.Series(a)
5
print(myvar)
6
print(myvar[1])

出力は：

pd.Series でインデックスを設定

1
import pandas as pd
2

3
a = ["Google", "Runoob", "Wiki"]
4
myvar = pd.Series(a, index = ["x", "y", "z"])
5
print(myvar)
6
print(myvar["y"])

辞書から作成

1
import pandas as pd
2

3
sites = {1: "Google", 2: "Runoob", 3: "Wiki"}
4
myvar = pd.Series(sites)
5
print(myvar)
6

7
myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )
8
print(myvar)
9

10
myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )
11
print(myvar)

基本操作#

基本操作

1
## 値の取得
2
value = series[2]  ## インデックスが 2 の値を取得
3

4
## 複数の値を取得
5
subset = series[1:4]  ## インデックスが 1 から 3 の値を取得
6

7
## カスタムインデックスを使用
8
value = series_with_index['b']  ## インデックスが 'b' の値を取得
9

10
## インデックスと値の対応関係
11
for index, value in series_with_index.items():
12
    print(f"Index: {index}, Value: {value}")

基本運算

1
## 算術運算
2
result = series * 2  ## 全要素を 2 倍
3

4
## フィルタリング
5
filtered_series = series[series > 2]  ## 2 より大きい要素を選択
6

7
## 数学関数
8
import numpy as np
9
result = np.sqrt(series)  ## 各要素の平方根を取る

属性とメソッド

1
## インデックスの取得
2
index = series_with_index.index
3

4
## 値配列の取得
5
values = series_with_index.values
6

7
## 記述統計情報の取得
8
stats = series_with_index.describe()
9

10
## 最大値・最小値のインデックス取得
11
max_index = series_with_index.idxmax()
12
min_index = series_with_index.idxmin()

注意事項
- Series のデータは有序です。
- Series はインデックス付きの1次元配列と見なすことができます。
- インデックスは一意である必要はありません。
- データはスカラー、リスト、NumPy配列などで構いません。

pandas dataframe#

dataframe構造#

列と行： DataFrame は複数の列で構成され、それぞれの列には名前があり、1つの Series として見ることができます。同時に、DataFrame には行インデックスがあり、各行を識別します。
二次元構造： DataFrame は行と列を持つ二次元の表で、複数の Series オブジェクトからなる辞書のように見ることもできます。
列のデータ型：異なる列は異なるデータ型を含むことができます。例えば整数、浮動小数、文字列など。

1
pandas.DataFrame( data, index, columns, dtype, copy)
2

3
# data：一組のデータ（ndarray、series、map、lists、dict 型）。
4
# index：インデックス値、行ラベルとも呼ばれます
5
# columns：列ラベル、デフォルトは RangeIndex (0, 1, 2, …, n)
6
# dtype：データ型、デフォルトは自動判定。
7
# copy：データをコピー、デフォルトは False。

dataframeの実例#

DataFrame の使用

1
import pandas as pd
2

3
data = [['Google',10],['Runoob',12],['Wiki',13]]
4
df = pd.DataFrame(data,columns=['Site','Age'])
5
print(df)

ndarrays で作成

1
import pandas as pd
2

3
data = {'Site':['Google', 'Runoob', 'Wiki'], 'Age':[10, 12, 13]}
4
df = pd.DataFrame(data)
5
print (df)

辞書リストから作成

1
import pandas as pd
2

3
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
4
df = pd.DataFrame(data)
5
print (df)

対応するデータがない部分は NaN。

loc を使って指定行を返す Pandas は loc 属性を用いて指定した行のデータを返します。インデックスが設定されていない場合、最初の行のインデックスは 0、次の行は 1、以下同様です。

1
import pandas as pd
2

3
data = {
4
  "calories": [420, 380, 390],
5
  "duration": [50, 40, 45]
6
}
7

8
# DataFrame へデータを読み込む
9
df = pd.DataFrame(data)
10

11
# 第1行を返す
12
print(df.loc[0])
13
# 第2行を返す
14
print(df.loc[1])
15

16
# 第1行と第2行を返す
17
print(df.loc[[0, 1]])
18

19
# 指定インデックス
20
print(df.loc["duration"])

pd.DataFrame でインデックスを指定

1
import pandas as pd
2
data = {
3
  "calories": [420, 380, 390],
4
  "duration": [50, 40, 45]
5
}
6

7
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
8
print(df)

dataframeの基本操作#

基本操作

1
# 列の取得
2
name_column = df['Name']
3

4
# 行の取得
5
first_row = df.loc[0]
6

7
# 複数列の選択
8
subset = df[['Name', 'Age']]
9

10
# 行のフィルタ
11
filtered_rows = df[df['Age'] > 30]

データ操作

1
# 新しい列の追加
2
df['Salary'] = [50000, 60000, 70000]
3

4
# 列の削除
5
df.drop('City', axis=1, inplace=True)
6

7
# ソート
8
df.sort_values(by='Age', ascending=False, inplace=True)
9

10
# 列名の変更
11
df.rename(columns={'Name': 'Full Name'}, inplace=True)

属性とメソッド

1
# 列名の取得
2
columns = df.columns
3

4
# 形状の取得（行数と列数）
5
shape = df.shape
6

7
# インデックスの取得
8
index = df.index
9

10
# 記述統計情報の取得
11
stats = df.describe()

外部データ源からの作成

1
# CSV ファイルから DataFrame を作成
2
df_csv = pd.read_csv('example.csv')
3

4
# Excel ファイルから DataFrame を作成
5
df_excel = pd.read_excel('example.xlsx')
6

7
# 辞書リストから DataFrame を作成
8
data_list = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]
9
df_from_list = pd.DataFrame(data_list)

注意事項
- DataFrame は柔軟なデータ構造で、異なるデータ型の列を格納できます。
- 列名と行インデックスは文字列、整数などを含むことがあります。
- DataFrame はデータの選択、フィルタ、修正、分析を多様な方法で行えます。
- DataFrame の操作を通じて、データのクリーニング、変換、分析、可視化などを行うことができます。

pandas CSV#

紹介#

CSV（Comma-Separated Values、カンマ区切り値、時には文字区切り値とも呼ばれる。区切り文字が必ずしもカンマとは限らない）、ファイルはプレーンテキスト形式で表形式データ（数字とテキスト）を保存します。

CSV は一般的で比較的シンプルなファイル形式で、ユーザー・ビジネス・科学の分野で広く利用されています。

CSV の処理 Pandas は CSV ファイルの処理を非常に容易に行えます
```
1
import pandas as pd
2
df = pd.read_csv('site.csv')
3
print(df.to_string())
```

CSV の保存 DataFrame を CSV ファイルとして保存するには to_csv() を使用します

1
import pandas as pd
2

3
# 三つのフィールド name, site, age
4
nme = ["Google", "Runoob", "Taobao", "Wiki"]
5
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
6
ag = [90, 40, 80, 98]
7
# 辞書
8
dict = {'name': nme, 'site': st, 'age': ag}
9
df = pd.DataFrame(dict)
10

11
# DataFrame の保存
12
df.to_csv('site.csv')

データ処理#

head()#

head(n) メソッドは先頭の n 行を読み取ります。引数 n を指定しない場合はデフォルトで 5 行を返します。

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.head())
5

6
print(df.head(10))

tail()#

tail(n) メソッドは末尾の n 行を読み取ります。引数を指定しない場合はデフォルトで 5 行を返します。空行の各フィールドの値は NaN となります。

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.tail())
5

6
print(df.tail(10))

info()#

info() メソッドは表の基本情報を返します：

1
import pandas as pd
2

3
df = pd.read_csv('nba.csv')
4
print(df.info())

Pandas JSON#

JSON（JavaScript Object Notation、JavaScript のオブジェクト表記法）は、テキスト情報を保存・交換するための文法で、XMLに似ています。

JSON は XML より小さく、高速で、解析が容易です。JSON に関する詳細は JSON チュートリアルを参照してください。

Pandas は JSON データの処理を非常に簡単に行えます。

普通JSON処理#

1
import pandas as pd
2

3
df = pd.read_json('sites.json')
4
print(df.to_string())
5

6
URL = '<https://static.runoob.com/download/sites.json>'
7
df = pd.read_json(URL)
8
print(df)

JSON オブジェクトは Python の辞書と同じフォーマットを持つため、Python の辞書をそのまま DataFrame データに変換できます。

内嵌JSON処理#

ネストされたデータを完全に解析するには json_normalize() メソッドを使用します。

1
import pandas as pd
2
import json
3

4
# Python の JSON モジュールを使用してデータを読み込む
5
with open('nested_list.json','r') as f:
6
    data = json.loads(f.read())
7

8
# データをフラット化
9
df_nested_list = pd.json_normalize(data, record_path =['students'])
10
print(df_nested_list)

より複雑なデータ

1
import pandas as pd
2
import json
3

4
# Python の JSON モジュールを使用してデータを読み込む
5
with open('nested_mix.json','r') as f:
6
    data = json.loads(f.read())
7

8
df = pd.json_normalize(
9
    data,
10
    record_path =['students'],
11
    meta=[
12
        'class',
13
        ['info', 'president'],
14
        ['info', 'contacts', 'tel']
15
    ]
16
)
17

18
print(df)

ネストされたJSONの一部データを読む#

glom モジュールを使用してデータのネストを扱います。glom モジュールを使って、’.’ を使ってネストされたオブジェクトの属性にアクセスします。

glom のインストール
```
1
pip3 install glom
```

使用方法

1
import pandas as pd
2
from glom import glom
3

4
df = pd.read_json('nested_deep.json')
5

6
data = df['students'].apply(lambda row: glom(row, 'grade.math'))
7
print(data)

データのクリーニング#

データの使用

欠損値のクリーニング#

欠損値を含む行を削除したい場合、dropna() メソッドを使用します。書式は以下のとおりです：

1
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
2

3
# axis：デフォルトは 0。NA を含む行を削除。axis=1 を設定するとNAを含む列を削除します。
4
# how：デフォルトは 'any'。行（または列）に NA が1つでも含まれていればその行を削除。how='all' の場合、行（または列）がすべて NA のときのみ削除します。
5
# thresh：残すべき非空値の最小数を設定します。
6
# subset：チェックしたい列を設定します。複数列の場合、列名のリストを引数として使用します。
7
# inplace：True に設定すると、計算結果を元のデータに直接上書きして、None を返します。元データを変更します。

isnull() を使って、各セルが空かどうかを判定できます。

pandas.read_csv で na_values を指定して、空値を指定することができます

1
import pandas as pd
2

3
df = pd.read_csv('property-data.csv')
4

5
print(df['NUM_BEDROOMS'])
6
print(df['NUM_BEDROOMS'].isnull())
7

8
# 空データを指定
9
missing_values = ["n/a", "na", "--"]
10
df = pd.read_csv('property-data.csv', na_values = missing_values)
11

12
print (df['NUM_BEDROOMS'])
13
print (df['NUM_BEDROOMS'].isnull())
14

15
# 空データを削除
16
new_df = df.dropna()
17
print(new_df.to_string())
18

19
# 元の DataFrame を上書き
20
df.dropna(inplace = True)
21
print(df.to_string())
22

23
# 特定の空値を含む行を削除
24
df.dropna(subset=['ST_NUM'], inplace = True)
25
print(df.to_string())

fillna() を使って空値を置換します

空セルを置換する一般的な方法は、列の平均値・中央値・最頻値を計算することです。

Pandas は mean()、median()、mode() メソッドを使用して、列の平均値（全値の総和を割った値）、中央値、および最頻値（出現頻度が最も高い値）を計算します。

1
import pandas as pd
2

3
df = pd.read_csv('property-data.csv')
4

5
# 空のフィールドを 12345 で置換
6
df.fillna(12345, inplace = True)
7
print(df.to_string())
8

9
# 平均値で空値を置換
10
x = df["ST_NUM"].mean()
11
df["ST_NUM"].fillna(x, inplace = True)
12
print(df.to_string())
13

14
# 中央値で空値を置換
15
x = df["ST_NUM"].median()
16
df["ST_NUM"].fillna(x, inplace = True)
17
print(df.to_string())
18

19
# 最頻値で空値を置換
20
x = df["ST_NUM"].mode()
21
df["ST_NUM"].fillna(x, inplace = True)
22
print(df.to_string())

フォーマットエラーのクリーニング#

データ形式が正しくないセルは、データ分析を難しくし、場合によっては不可能にします。

ネストされたセルを含む行、または列内のすべてのセルを同じ形式のデータに変換することで対応できます。

1
import pandas as pd
2

3
# 3番目の日付形式が間違っています
4
data = {
5
  "Date": ['2020/12/01', '2020/12/02' , '20201226'],
6
  "duration": [50, 40, 45]
7
}
8
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
9

10
# 新しい Python 3 では、下の行はエラーになり、format='mixed' を明示して混合形式を許可する必要があります
11
# pd.to_datetime(df['Date'])
12
df['Date'] = pd.to_datetime(df['Date'], format='mixed')
13
print(df.to_string())

astype でデータ形式を変更する

1
data['語文'].dropna(how='any').astype('int')

エラーデータのクリーニング#

1
import pandas as pd
2

3
person = {
4
  "name": ['Google', 'Runoob' , 'Taobao'],
5
  "age": [50, 200, 12345]
6
}
7

8
df = pd.DataFrame(person)
9

10
# データを直接変更
11
df.loc[2, 'age'] = 30
12

13
# ループで判定
14
for x in df.index:
15
  if df.loc[x, "age"] > 120:
16
    df.loc[x, "age"] = 120
17

18
# 行を削除
19
for x in df.index:
20
  if df.loc[x, "age"] > 120:
21
    df.drop(x, inplace = True)
22

23
print(df.to_string())

重複データのクリーニング#

もし重複データをクリーニングする場合、duplicated() と drop_duplicates() メソッドを使います。

対応するデータが重複している場合、duplicated() は True を返し、そうでなければ False を返します。

1
import pandas as pd
2

3
person = {
4
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
5
  "age": [50, 40, 40, 23]
6
}
7
df = pd.DataFrame(person)
8

9
# 重複データの検索
10
print(df.duplicated())
11

12
# 重複データの削除
13
df.drop_duplicates(inplace = True)
14
print(df)

pandas基础使用

https://dreaife.tokyo/posts/pandas-basics/

作者

dreaife

发布于

2024-01-02

许可协议

CC BY-NC-SA 4.0

部分信息可能已经过时

基于docker在win11运行pyspider

关于pandas.to_datetime对不同时间格式使用时发生报错的情况

dreaife的休憩小栈

pandas#

pandas介绍#

pandas安装#

pandas series#

结构#

实例#

基本操作#

pandas dataframe#

dataframe结构#

dataframe实例#

dataframe基本操作#

pandas CSV#

介绍#

数据处理#

head()#

tail()#

info()#

Pandas JSON#

普通JSON处理#

内嵌JSON处理#

读取内嵌JSON中的一组数据#

数据清洗#

清洗空值#

清洗格式错误#

清洗错误数据#

清洗重复数据#

pandas#

Pandas Introduction#

Pandas Installation#

Pandas Series#

Structure#

Examples#

Basic Operations#

Pandas DataFrame#

DataFrame Structure#

DataFrame Examples#

Basic DataFrame Operations#

Pandas CSV#

Introduction#

Data Processing#

head()#

tail()#

info()#

Pandas JSON#

Plain JSON Handling#

Nested JSON Handling#

Reading a Group of Data from Nested JSON#

Data Cleaning#

Cleaning Missing Values#

Cleaning Incorrect Data#

Cleaning Erroneous Data#

Cleaning Duplicate Data#

pandas#

pandasの紹介#

pandasのインストール#

pandas series#

構造#

実例#

基本操作#

pandas dataframe#

dataframe構造#

dataframeの実例#

dataframeの基本操作#

pandas CSV#

紹介#

データ処理#

head()#

tail()#

info()#

Pandas JSON#

普通JSON処理#

内嵌JSON処理#

ネストされたJSONの一部データを読む#

データのクリーニング#

欠損値のクリーニング#

フォーマットエラーのクリーニング#

エラーデータのクリーニング#

重複データのクリーニング#