판다스(Pandas) : 데이터 Selection 및 Filtering

DataFrame의 [ ] 연산자

넘파이에서 []연산자는 행의 위, 열의 위치, 슬라이싱 범위 등을 지정해 데이터를 가져올 수 있다.
하지만 DataFrame 바로 뒤에 있는 ‘[]’안에 들어갈 수 있는 것은 컬럼 명 문자, 또는 인덱스로 변환 가능한 표현식이다.

import pandas as pd

titanic_df = pd.read_csv("titanic_train.csv")
print("단일 컬럼 데이터 추출 : \n{}".format(titanic_df["Pclass"].head(3)))
print("")
print("여러 컬럼들의 데이터 추출 : \n{}".format(titanic_df[["Survived", "Pclass"]].head(3)))
print("")
print("[] 안에 숫자 index는 KeyError 오류 발생 : {}".format(titanic_df[0]))

단일 컬럼 데이터 추출 : 
0    3
1    1
2    3
Name: Pclass, dtype: int64

여러 컬럼들의 데이터 추출 : 
   Survived  Pclass
0         0       3
1         1       1
2         1       3

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

C:\anaconda\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

C:\anaconda\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_6328/905428361.py in <module>
      4 print("여러 컬럼들의 데이터 추출 : \n{}".format(titanic_df[["Survived", "Pclass"]].head(3)))
      5 print("")
----> 6 print("[] 안에 숫자 index는 KeyError 오류 발생 : {}".format(titanic_df[0]))

C:\anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 0

titanic_df[ titanic_df["Pclass"] == 3 ].head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.250	NaN	S
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.925	NaN	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.050	NaN	S

DataFrame iloc[ ] 연산자

위치기반 인덱싱을 제공

data = {"Name" : ["Dowon", "Junho", "Bomi", "Youngsu"],
       "Year" : [2011, 2016, 2015, 2015],
       "Gender" : ["Male", "Male", "Female", "Male"]}
data_df = pd.DataFrame(data, index = ["one", "two", "three", "four"])
data_df

	Name	Year	Gender
one	Dowon	2011	Male
two	Junho	2016	Male
three	Bomi	2015	Female
four	Youngsu	2015	Male

data_df.iloc[0, 0]

'Dowon'

iloc로 인덱싱을 확인할 때는 숫자 데이터만 들어가야한다.

data_df.iloc[0, 1]

DataFrame loc[ ] 연산자

명칭기반 인덱싱을 제공

data_df

	Name	Year	Gender
one	Dowon	2011	Male
two	Junho	2016	Male
three	Bomi	2015	Female
four	Youngsu	2015	Male

data_df.loc["two", "Year"]

data_df.loc["four", "Gender"]

'Male'

iloc[ ], loc[ ]를 사용한 slicing

data_df

	Name	Year	Gender
one	Dowon	2011	Male
two	Junho	2016	Male
three	Bomi	2015	Female
four	Youngsu	2015	Male

print("위치 기반 인덱싱 iloc[ ]로 slice! : \n{}".format(data_df.iloc[0:3, 1]))

print("-"*35)

print("명칭 기반 인덱싱 loc[ ]로 slice! : \n{}".format(data_df.loc["one" : "three", "Name"]))

위치 기반 인덱싱 iloc[ ]로 slice! : 
one      2011
two      2016
three    2015
Name: Year, dtype: int64
-----------------------------------
명칭 기반 인덱싱 loc[ ]로 slice! : 
one      Dowon
two      Junho
three     Bomi
Name: Name, dtype: object

불린 인덱싱(Boolean indexing)

헷갈리는 위치기반, 명칭기반 인덱싱을 사용할 필요없이 조건식을 [ ]안에 기입하여 간편하게 필터링을 수행

titanic_df = pd.read_csv("titanic_train.csv")
titanic_boolean = titanic_df[ titanic_df["Age"] > 60 ]

print(type(titanic_boolean))
titanic_boolean

<class 'pandas.core.frame.DataFrame'>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
33	34	0	2	Wheadon, Mr. Edward H	male	66.0	0	0	C.A. 24579	10.5000	NaN	S
54	55	0	1	Ostby, Mr. Engelhart Cornelius	male	65.0	0	1	113509	61.9792	B30	C
96	97	0	1	Goldschmidt, Mr. George B	male	71.0	0	0	PC 17754	34.6542	A5	C
116	117	0	3	Connors, Mr. Patrick	male	70.5	0	0	370369	7.7500	NaN	Q
170	171	0	1	Van der hoef, Mr. Wyckoff	male	61.0	0	0	111240	33.5000	B19	S
252	253	0	1	Stead, Mr. William Thomas	male	62.0	0	0	113514	26.5500	C87	S
275	276	1	1	Andrews, Miss. Kornelia Theodosia	female	63.0	1	0	13502	77.9583	D7	S
280	281	0	3	Duane, Mr. Frank	male	65.0	0	0	336439	7.7500	NaN	Q
326	327	0	3	Nysveen, Mr. Johan Hansen	male	61.0	0	0	345364	6.2375	NaN	S
438	439	0	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S
456	457	0	1	Millet, Mr. Francis Davis	male	65.0	0	0	13509	26.5500	E38	S
483	484	1	3	Turkula, Mrs. (Hedwig)	female	63.0	0	0	4134	9.5875	NaN	S
493	494	0	1	Artagaveytia, Mr. Ramon	male	71.0	0	0	PC 17609	49.5042	NaN	C
545	546	0	1	Nicholson, Mr. Arthur Ernest	male	64.0	0	0	693	26.0000	NaN	S
555	556	0	1	Wright, Mr. George	male	62.0	0	0	113807	26.5500	NaN	S
570	571	1	2	Harris, Mr. George	male	62.0	0	0	S.W./PP 752	10.5000	NaN	S
625	626	0	1	Sutton, Mr. Frederick	male	61.0	0	0	36963	32.3208	D50	S
630	631	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	0	0	27042	30.0000	A23	S
672	673	0	2	Mitchell, Mr. Henry Michael	male	70.0	0	0	C.A. 24580	10.5000	NaN	S
745	746	0	1	Crosby, Capt. Edward Gifford	male	70.0	1	1	WE/P 5735	71.0000	B22	S
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0000	B28	NaN
851	852	0	3	Svensson, Mr. Johan	male	74.0	0	0	347060	7.7750	NaN	S

titanic_df["Age"] > 60

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: Age, Length: 891, dtype: bool

var1 = titanic_df["Age"] > 60
print(type(var1))

<class 'pandas.core.series.Series'>

titanic_df[titanic_df["Age"] > 60][["Name", "Age"]].head(3)

	Name	Age
33	Wheadon, Mr. Edward H	66.0
54	Ostby, Mr. Engelhart Cornelius	65.0
96	Goldschmidt, Mr. George B	71.0

titanic_df[["Name", "Age"]][titanic_df["Age"] > 60].head(3)

	Name	Age
33	Wheadon, Mr. Edward H	66.0
54	Ostby, Mr. Engelhart Cornelius	65.0
96	Goldschmidt, Mr. George B	71.0

titanic_df.loc[titanic_df["Age"] > 60, ["Name", "Age"]].head(3)

	Name	Age
33	Wheadon, Mr. Edward H	66.0
54	Ostby, Mr. Engelhart Cornelius	65.0
96	Goldschmidt, Mr. George B	71.0

titanic_df[ (titanic_df["Age"] > 60) & (titanic_df["Pclass"] == 1) & (titanic_df["Sex"] == "female") ]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
275	276	1	1	Andrews, Miss. Kornelia Theodosia	female	63.0	1	0	13502	77.9583	D7	S
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0000	B28	NaN

논리 연산자로 결합된 조건식도 불린 인덱싱으로 적용 가능

cond1 = titanic_df["Age"] > 60
cond2 = titanic_df["Pclass"] == 1
cond3 = titanic_df["Sex"] == "female"

titanic_df[cond1 & cond2 & cond3]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
275	276	1	1	Andrews, Miss. Kornelia Theodosia	female	63.0	1	0	13502	77.9583	D7	S
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0000	B28	NaN

조건식은 변수로도 할당 가능. 복잡한 조건식은 변수로 할당하여 가독성을 향상

Aggregation 함수 및 GroupBy 적용

Aggregation 함수

titanic_df.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

titanic_df[["Fare", "Age"]].mean()

Fare    32.204208
Age     29.699118
dtype: float64

titanic_df[["Age", "Fare"]].sum()

Age     21205.1700
Fare    28693.9493
dtype: float64

titanic_df[["Age", "Fare"]].sum(axis=1)

     29.2500
    109.2833
     33.9250
     88.1000
     43.0500
         ...   
   40.0000
   49.0000
   23.4500
   56.0000
   39.7500
Length: 891, dtype: float64

titanic_df[["Age", "Fare"]].count()

Age     714
Fare    891
dtype: int64

groupby()

groupby() by 인자에 Group By 하고자 하는 컬럼을 입력, 여러개의 컬럼으로 Group By하고자 하면 [ ] 내에 해당 컬럼명 입력.
DataFrame에 groupby()를 호출하면 DataFrameGroupBy 객체를 반환.

titanic_groupby = titanic_df.groupby(by = "Pclass")
print(type(titanic_groupby))
print(titanic_groupby)

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024AD4F01250>

titanic_groupby= titanic_df.groupby("Pclass").count()
titanic_groupby

	PassengerId	Survived	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Pclass
1	216	216	216	216	186	216	216	216	216	176	214
2	184	184	184	184	173	184	184	184	184	16	184
3	491	491	491	491	355	491	491	491	491	12	491

위 출력 결과에서 Pclass는 index이다.

print(type(titanic_groupby))
print(titanic_groupby.shape)
print(titanic_groupby.index)

<class 'pandas.core.frame.DataFrame'>
(3, 11)
Int64Index([1, 2, 3], dtype='int64', name='Pclass')

titanic_groupby = titanic_df.groupby(by = "Pclass")[["PassengerId", "Survived"]].count()
titanic_groupby

	PassengerId	Survived
Pclass
1	216	216
2	184	184
3	491	491

titanic_df[["Pclass", "PassengerId", "Survived"]].groupby("Pclass").count()

	PassengerId	Survived
Pclass
1	216	216
2	184	184
3	491	491

titanic_df.groupby("Pclass")["Pclass"].count()

Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

titanic_df["Pclass"].value_counts()

  491
  216
  184
Name: Pclass, dtype: int64

titanic_df.groupby("Pclass")["Age"].agg([max, min])

	max	min
Pclass
1	80.0	0.92
2	70.0	0.67
3	74.0	0.42

agg_format = {"Age" : "max", "SibSp" : "sum", "Fare" : "mean"}
titanic_df.groupby("Pclass").agg(agg_format)

	Age	SibSp	Fare
Pclass
1	80.0	90	84.154687
2	70.0	74	20.662183
3	74.0	302	13.675550

Missing 데이터 처리하기

DataFrame의 isna() 메소드는 모든 컬럼 값들이 NaN인지 True/False값을 반환(NaN이면 True)

isna()

titanic_df.isna().head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	False	False	False	False	False	False	False	False	False	False	True	False
1	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	True	False

NaN인 곳만 True가 된다

titanic_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

컬럼 별로 NaN이 몇 개인지

fillna()로 Missing Data 대체하기

titanic_df["Cabin"] = titanic_df["Cabin"].fillna("0000")
titanic_df.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	0000	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	0000	S

titanic_df["Age"] = titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")
titanic_df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

apply lambda 식으로 데이터 가공

파이썬 lambda 식 기본

def get_square(a) :
    return a**2

print("3의 제곱은 {}".format(get_square(3)))

3의 제곱은 9

lambda_square = lambda x : x**2
print("3의 제곱은 {}".format(lambda_square(3)))

3의 제곱은 9

a = [1, 2, 3]
squares = map(lambda x : x**2, a)
list(squares)

[1, 4, 9]

판다스에 apply lambda식 적용

titanic_df["Name_len"] = titanic_df["Name"].apply(lambda x : len(x))
titanic_df[["Name", "Name_len"]].head(3)

	Name	Name_len
0	Braund, Mr. Owen Harris	23
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	51
2	Heikkinen, Miss. Laina	22

titanic_df["Child_Adult"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else "Adult")
titanic_df[["Age", "Child_Adult"]].head(10)

	Age	Child_Adult
0	22.000000	Adult
1	38.000000	Adult
2	26.000000	Adult
3	35.000000	Adult
4	35.000000	Adult
5	29.699118	Adult
6	54.000000	Adult
7	2.000000	Child
8	27.000000	Adult
9	14.000000	Child

titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else("Adult" if x <= 60 else "Elderly"))

titanic_df["Age_cat"].value_counts()

Adult      786
Child       83
Elderly     22
Name: Age_cat, dtype: int64

def get_category(age):
    cat = ""
    if age <= 5 : cat = "Baby"
    elif age <= 12 : cat = "Child"
    elif age <= 18 : cat = "Teen"
    elif age <= 25 : cat = "Student"
    elif age <= 35 : cat = "Young Adult"
    elif age <= 60 : cat = "Adult"
    else : cat = "Elderly"
        
    return cat

titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : get_category(x))
titanic_df[["Age", "Age_cat"]].head()
  

	Age	Age_cat
0	22.0	Student
1	38.0	Adult
2	26.0	Young Adult
3	35.0	Young Adult
4	35.0	Young Adult