판다스(Pandas) : 데이터 Selection 및 Filtering
DataFrame의 [ ] 연산자
- 넘파이에서 []연산자는 행의 위, 열의 위치, 슬라이싱 범위 등을 지정해 데이터를 가져올 수 있다.
- 하지만 DataFrame 바로 뒤에 있는 ‘[]’안에 들어갈 수 있는 것은 컬럼 명 문자, 또는 인덱스로 변환 가능한 표현식이다.
import pandas as pd
titanic_df = pd.read_csv("titanic_train.csv")
print("단일 컬럼 데이터 추출 : \n{}".format(titanic_df["Pclass"].head(3)))
print("")
print("여러 컬럼들의 데이터 추출 : \n{}".format(titanic_df[["Survived", "Pclass"]].head(3)))
print("")
print("[] 안에 숫자 index는 KeyError 오류 발생 : {}".format(titanic_df[0]))
단일 컬럼 데이터 추출 :
0 3
1 1
2 3
Name: Pclass, dtype: int64
여러 컬럼들의 데이터 추출 :
Survived Pclass
0 0 3
1 1 1
2 1 3
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
C:\anaconda\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
C:\anaconda\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_6328/905428361.py in <module>
4 print("여러 컬럼들의 데이터 추출 : \n{}".format(titanic_df[["Survived", "Pclass"]].head(3)))
5 print("")
----> 6 print("[] 안에 숫자 index는 KeyError 오류 발생 : {}".format(titanic_df[0]))
C:\anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 0
titanic_df[ titanic_df["Pclass"] == 3 ].head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.250 | NaN | S |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.050 | NaN | S |
DataFrame iloc[ ] 연산자
- 위치기반 인덱싱을 제공
data = {"Name" : ["Dowon", "Junho", "Bomi", "Youngsu"],
"Year" : [2011, 2016, 2015, 2015],
"Gender" : ["Male", "Male", "Female", "Male"]}
data_df = pd.DataFrame(data, index = ["one", "two", "three", "four"])
data_df
Name | Year | Gender | |
---|---|---|---|
one | Dowon | 2011 | Male |
two | Junho | 2016 | Male |
three | Bomi | 2015 | Female |
four | Youngsu | 2015 | Male |
data_df.iloc[0, 0]
'Dowon'
iloc로 인덱싱을 확인할 때는 숫자 데이터만 들어가야한다.
data_df.iloc[0, 1]
2011
DataFrame loc[ ] 연산자
- 명칭기반 인덱싱을 제공
data_df
Name | Year | Gender | |
---|---|---|---|
one | Dowon | 2011 | Male |
two | Junho | 2016 | Male |
three | Bomi | 2015 | Female |
four | Youngsu | 2015 | Male |
data_df.loc["two", "Year"]
2016
data_df.loc["four", "Gender"]
'Male'
iloc[ ], loc[ ]를 사용한 slicing
data_df
Name | Year | Gender | |
---|---|---|---|
one | Dowon | 2011 | Male |
two | Junho | 2016 | Male |
three | Bomi | 2015 | Female |
four | Youngsu | 2015 | Male |
print("위치 기반 인덱싱 iloc[ ]로 slice! : \n{}".format(data_df.iloc[0:3, 1]))
print("-"*35)
print("명칭 기반 인덱싱 loc[ ]로 slice! : \n{}".format(data_df.loc["one" : "three", "Name"]))
위치 기반 인덱싱 iloc[ ]로 slice! :
one 2011
two 2016
three 2015
Name: Year, dtype: int64
-----------------------------------
명칭 기반 인덱싱 loc[ ]로 slice! :
one Dowon
two Junho
three Bomi
Name: Name, dtype: object
불린 인덱싱(Boolean indexing)
- 헷갈리는 위치기반, 명칭기반 인덱싱을 사용할 필요없이 조건식을 [ ]안에 기입하여 간편하게 필터링을 수행
titanic_df = pd.read_csv("titanic_train.csv")
titanic_boolean = titanic_df[ titanic_df["Age"] > 60 ]
print(type(titanic_boolean))
titanic_boolean
<class 'pandas.core.frame.DataFrame'>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
33 | 34 | 0 | 2 | Wheadon, Mr. Edward H | male | 66.0 | 0 | 0 | C.A. 24579 | 10.5000 | NaN | S |
54 | 55 | 0 | 1 | Ostby, Mr. Engelhart Cornelius | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C |
96 | 97 | 0 | 1 | Goldschmidt, Mr. George B | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C |
116 | 117 | 0 | 3 | Connors, Mr. Patrick | male | 70.5 | 0 | 0 | 370369 | 7.7500 | NaN | Q |
170 | 171 | 0 | 1 | Van der hoef, Mr. Wyckoff | male | 61.0 | 0 | 0 | 111240 | 33.5000 | B19 | S |
252 | 253 | 0 | 1 | Stead, Mr. William Thomas | male | 62.0 | 0 | 0 | 113514 | 26.5500 | C87 | S |
275 | 276 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0 | 1 | 0 | 13502 | 77.9583 | D7 | S |
280 | 281 | 0 | 3 | Duane, Mr. Frank | male | 65.0 | 0 | 0 | 336439 | 7.7500 | NaN | Q |
326 | 327 | 0 | 3 | Nysveen, Mr. Johan Hansen | male | 61.0 | 0 | 0 | 345364 | 6.2375 | NaN | S |
438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
456 | 457 | 0 | 1 | Millet, Mr. Francis Davis | male | 65.0 | 0 | 0 | 13509 | 26.5500 | E38 | S |
483 | 484 | 1 | 3 | Turkula, Mrs. (Hedwig) | female | 63.0 | 0 | 0 | 4134 | 9.5875 | NaN | S |
493 | 494 | 0 | 1 | Artagaveytia, Mr. Ramon | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | NaN | C |
545 | 546 | 0 | 1 | Nicholson, Mr. Arthur Ernest | male | 64.0 | 0 | 0 | 693 | 26.0000 | NaN | S |
555 | 556 | 0 | 1 | Wright, Mr. George | male | 62.0 | 0 | 0 | 113807 | 26.5500 | NaN | S |
570 | 571 | 1 | 2 | Harris, Mr. George | male | 62.0 | 0 | 0 | S.W./PP 752 | 10.5000 | NaN | S |
625 | 626 | 0 | 1 | Sutton, Mr. Frederick | male | 61.0 | 0 | 0 | 36963 | 32.3208 | D50 | S |
630 | 631 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0 | 0 | 0 | 27042 | 30.0000 | A23 | S |
672 | 673 | 0 | 2 | Mitchell, Mr. Henry Michael | male | 70.0 | 0 | 0 | C.A. 24580 | 10.5000 | NaN | S |
745 | 746 | 0 | 1 | Crosby, Capt. Edward Gifford | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0000 | B22 | S |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0000 | B28 | NaN |
851 | 852 | 0 | 3 | Svensson, Mr. Johan | male | 74.0 | 0 | 0 | 347060 | 7.7750 | NaN | S |
titanic_df["Age"] > 60
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Age, Length: 891, dtype: bool
var1 = titanic_df["Age"] > 60
print(type(var1))
<class 'pandas.core.series.Series'>
titanic_df[titanic_df["Age"] > 60][["Name", "Age"]].head(3)
Name | Age | |
---|---|---|
33 | Wheadon, Mr. Edward H | 66.0 |
54 | Ostby, Mr. Engelhart Cornelius | 65.0 |
96 | Goldschmidt, Mr. George B | 71.0 |
titanic_df[["Name", "Age"]][titanic_df["Age"] > 60].head(3)
Name | Age | |
---|---|---|
33 | Wheadon, Mr. Edward H | 66.0 |
54 | Ostby, Mr. Engelhart Cornelius | 65.0 |
96 | Goldschmidt, Mr. George B | 71.0 |
titanic_df.loc[titanic_df["Age"] > 60, ["Name", "Age"]].head(3)
Name | Age | |
---|---|---|
33 | Wheadon, Mr. Edward H | 66.0 |
54 | Ostby, Mr. Engelhart Cornelius | 65.0 |
96 | Goldschmidt, Mr. George B | 71.0 |
titanic_df[ (titanic_df["Age"] > 60) & (titanic_df["Pclass"] == 1) & (titanic_df["Sex"] == "female") ]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
275 | 276 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0 | 1 | 0 | 13502 | 77.9583 | D7 | S |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0000 | B28 | NaN |
- 논리 연산자로 결합된 조건식도 불린 인덱싱으로 적용 가능
cond1 = titanic_df["Age"] > 60
cond2 = titanic_df["Pclass"] == 1
cond3 = titanic_df["Sex"] == "female"
titanic_df[cond1 & cond2 & cond3]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
275 | 276 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0 | 1 | 0 | 13502 | 77.9583 | D7 | S |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0000 | B28 | NaN |
- 조건식은 변수로도 할당 가능. 복잡한 조건식은 변수로 할당하여 가독성을 향상
Aggregation 함수 및 GroupBy 적용
Aggregation 함수
titanic_df.count()
PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64
titanic_df[["Fare", "Age"]].mean()
Fare 32.204208
Age 29.699118
dtype: float64
titanic_df[["Age", "Fare"]].sum()
Age 21205.1700
Fare 28693.9493
dtype: float64
titanic_df[["Age", "Fare"]].sum(axis=1)
0 29.2500
1 109.2833
2 33.9250
3 88.1000
4 43.0500
...
886 40.0000
887 49.0000
888 23.4500
889 56.0000
890 39.7500
Length: 891, dtype: float64
titanic_df[["Age", "Fare"]].count()
Age 714
Fare 891
dtype: int64
groupby()
- groupby() by 인자에 Group By 하고자 하는 컬럼을 입력, 여러개의 컬럼으로 Group By하고자 하면 [ ] 내에 해당 컬럼명 입력.
- DataFrame에 groupby()를 호출하면 DataFrameGroupBy 객체를 반환.
titanic_groupby = titanic_df.groupby(by = "Pclass")
print(type(titanic_groupby))
print(titanic_groupby)
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024AD4F01250>
titanic_groupby= titanic_df.groupby("Pclass").count()
titanic_groupby
PassengerId | Survived | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Pclass | |||||||||||
1 | 216 | 216 | 216 | 216 | 186 | 216 | 216 | 216 | 216 | 176 | 214 |
2 | 184 | 184 | 184 | 184 | 173 | 184 | 184 | 184 | 184 | 16 | 184 |
3 | 491 | 491 | 491 | 491 | 355 | 491 | 491 | 491 | 491 | 12 | 491 |
- 위 출력 결과에서 Pclass는 index이다.
print(type(titanic_groupby))
print(titanic_groupby.shape)
print(titanic_groupby.index)
<class 'pandas.core.frame.DataFrame'>
(3, 11)
Int64Index([1, 2, 3], dtype='int64', name='Pclass')
titanic_groupby = titanic_df.groupby(by = "Pclass")[["PassengerId", "Survived"]].count()
titanic_groupby
PassengerId | Survived | |
---|---|---|
Pclass | ||
1 | 216 | 216 |
2 | 184 | 184 |
3 | 491 | 491 |
titanic_df[["Pclass", "PassengerId", "Survived"]].groupby("Pclass").count()
PassengerId | Survived | |
---|---|---|
Pclass | ||
1 | 216 | 216 |
2 | 184 | 184 |
3 | 491 | 491 |
titanic_df.groupby("Pclass")["Pclass"].count()
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64
titanic_df["Pclass"].value_counts()
3 491
1 216
2 184
Name: Pclass, dtype: int64
titanic_df.groupby("Pclass")["Age"].agg([max, min])
max | min | |
---|---|---|
Pclass | ||
1 | 80.0 | 0.92 |
2 | 70.0 | 0.67 |
3 | 74.0 | 0.42 |
agg_format = {"Age" : "max", "SibSp" : "sum", "Fare" : "mean"}
titanic_df.groupby("Pclass").agg(agg_format)
Age | SibSp | Fare | |
---|---|---|---|
Pclass | |||
1 | 80.0 | 90 | 84.154687 |
2 | 70.0 | 74 | 20.662183 |
3 | 74.0 | 302 | 13.675550 |
Missing 데이터 처리하기
- DataFrame의 isna() 메소드는 모든 컬럼 값들이 NaN인지 True/False값을 반환(NaN이면 True)
isna()
titanic_df.isna().head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | True | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | True | False |
- NaN인 곳만 True가 된다
titanic_df.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
- 컬럼 별로 NaN이 몇 개인지
fillna()로 Missing Data 대체하기
titanic_df["Cabin"] = titanic_df["Cabin"].fillna("0000")
titanic_df.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0000 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0000 | S |
titanic_df["Age"] = titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")
titanic_df.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64
apply lambda 식으로 데이터 가공
파이썬 lambda 식 기본
def get_square(a) :
return a**2
print("3의 제곱은 {}".format(get_square(3)))
3의 제곱은 9
lambda_square = lambda x : x**2
print("3의 제곱은 {}".format(lambda_square(3)))
3의 제곱은 9
a = [1, 2, 3]
squares = map(lambda x : x**2, a)
list(squares)
[1, 4, 9]
판다스에 apply lambda식 적용
titanic_df["Name_len"] = titanic_df["Name"].apply(lambda x : len(x))
titanic_df[["Name", "Name_len"]].head(3)
Name | Name_len | |
---|---|---|
0 | Braund, Mr. Owen Harris | 23 |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 51 |
2 | Heikkinen, Miss. Laina | 22 |
titanic_df["Child_Adult"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else "Adult")
titanic_df[["Age", "Child_Adult"]].head(10)
Age | Child_Adult | |
---|---|---|
0 | 22.000000 | Adult |
1 | 38.000000 | Adult |
2 | 26.000000 | Adult |
3 | 35.000000 | Adult |
4 | 35.000000 | Adult |
5 | 29.699118 | Adult |
6 | 54.000000 | Adult |
7 | 2.000000 | Child |
8 | 27.000000 | Adult |
9 | 14.000000 | Child |
titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else("Adult" if x <= 60 else "Elderly"))
titanic_df["Age_cat"].value_counts()
Adult 786
Child 83
Elderly 22
Name: Age_cat, dtype: int64
def get_category(age):
cat = ""
if age <= 5 : cat = "Baby"
elif age <= 12 : cat = "Child"
elif age <= 18 : cat = "Teen"
elif age <= 25 : cat = "Student"
elif age <= 35 : cat = "Young Adult"
elif age <= 60 : cat = "Adult"
else : cat = "Elderly"
return cat
titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : get_category(x))
titanic_df[["Age", "Age_cat"]].head()
Age | Age_cat | |
---|---|---|
0 | 22.0 | Student |
1 | 38.0 | Adult |
2 | 26.0 | Young Adult |
3 | 35.0 | Young Adult |
4 | 35.0 | Young Adult |