판다스(Pandas) : 데이터 Selection 및 Filtering


DataFrame의 [ ] 연산자

  • 넘파이에서 []연산자는 행의 위, 열의 위치, 슬라이싱 범위 등을 지정해 데이터를 가져올 수 있다.
  • 하지만 DataFrame 바로 뒤에 있는 ‘[]’안에 들어갈 수 있는 것은 컬럼 명 문자, 또는 인덱스로 변환 가능한 표현식이다.
import pandas as pd
titanic_df = pd.read_csv("titanic_train.csv")
print("단일 컬럼 데이터 추출 : \n{}".format(titanic_df["Pclass"].head(3)))
print("")
print("여러 컬럼들의 데이터 추출 : \n{}".format(titanic_df[["Survived", "Pclass"]].head(3)))
print("")
print("[] 안에 숫자 index는 KeyError 오류 발생 : {}".format(titanic_df[0]))
단일 컬럼 데이터 추출 : 
0    3
1    1
2    3
Name: Pclass, dtype: int64

여러 컬럼들의 데이터 추출 : 
   Survived  Pclass
0         0       3
1         1       1
2         1       3




---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:


C:\anaconda\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


C:\anaconda\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()


KeyError: 0


The above exception was the direct cause of the following exception:


KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_6328/905428361.py in <module>
      4 print("여러 컬럼들의 데이터 추출 : \n{}".format(titanic_df[["Survived", "Pclass"]].head(3)))
      5 print("")
----> 6 print("[] 안에 숫자 index는 KeyError 오류 발생 : {}".format(titanic_df[0]))


C:\anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]


C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:


KeyError: 0
titanic_df[ titanic_df["Pclass"] == 3 ].head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.250 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.050 NaN S

DataFrame iloc[ ] 연산자

  • 위치기반 인덱싱을 제공
data = {"Name" : ["Dowon", "Junho", "Bomi", "Youngsu"],
       "Year" : [2011, 2016, 2015, 2015],
       "Gender" : ["Male", "Male", "Female", "Male"]}
data_df = pd.DataFrame(data, index = ["one", "two", "three", "four"])
data_df
Name Year Gender
one Dowon 2011 Male
two Junho 2016 Male
three Bomi 2015 Female
four Youngsu 2015 Male
data_df.iloc[0, 0]
'Dowon'

iloc로 인덱싱을 확인할 때는 숫자 데이터만 들어가야한다.

data_df.iloc[0, 1]
2011

DataFrame loc[ ] 연산자

  • 명칭기반 인덱싱을 제공
data_df
Name Year Gender
one Dowon 2011 Male
two Junho 2016 Male
three Bomi 2015 Female
four Youngsu 2015 Male
data_df.loc["two", "Year"]
2016
data_df.loc["four", "Gender"]
'Male'

iloc[ ], loc[ ]를 사용한 slicing

data_df
Name Year Gender
one Dowon 2011 Male
two Junho 2016 Male
three Bomi 2015 Female
four Youngsu 2015 Male
print("위치 기반 인덱싱 iloc[ ]로 slice! : \n{}".format(data_df.iloc[0:3, 1]))

print("-"*35)

print("명칭 기반 인덱싱 loc[ ]로 slice! : \n{}".format(data_df.loc["one" : "three", "Name"]))
위치 기반 인덱싱 iloc[ ]로 slice! : 
one      2011
two      2016
three    2015
Name: Year, dtype: int64
-----------------------------------
명칭 기반 인덱싱 loc[ ]로 slice! : 
one      Dowon
two      Junho
three     Bomi
Name: Name, dtype: object

불린 인덱싱(Boolean indexing)

  • 헷갈리는 위치기반, 명칭기반 인덱싱을 사용할 필요없이 조건식을 [ ]안에 기입하여 간편하게 필터링을 수행
titanic_df = pd.read_csv("titanic_train.csv")
titanic_boolean = titanic_df[ titanic_df["Age"] > 60 ]

print(type(titanic_boolean))
titanic_boolean
<class 'pandas.core.frame.DataFrame'>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S
54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
170 171 0 1 Van der hoef, Mr. Wyckoff male 61.0 0 0 111240 33.5000 B19 S
252 253 0 1 Stead, Mr. William Thomas male 62.0 0 0 113514 26.5500 C87 S
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
280 281 0 3 Duane, Mr. Frank male 65.0 0 0 336439 7.7500 NaN Q
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0 0 0 345364 6.2375 NaN S
438 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S
456 457 0 1 Millet, Mr. Francis Davis male 65.0 0 0 13509 26.5500 E38 S
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
545 546 0 1 Nicholson, Mr. Arthur Ernest male 64.0 0 0 693 26.0000 NaN S
555 556 0 1 Wright, Mr. George male 62.0 0 0 113807 26.5500 NaN S
570 571 1 2 Harris, Mr. George male 62.0 0 0 S.W./PP 752 10.5000 NaN S
625 626 0 1 Sutton, Mr. Frederick male 61.0 0 0 36963 32.3208 D50 S
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
672 673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 0 C.A. 24580 10.5000 NaN S
745 746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 B22 S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
titanic_df["Age"] > 60
0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool
var1 = titanic_df["Age"] > 60
print(type(var1))
<class 'pandas.core.series.Series'>
titanic_df[titanic_df["Age"] > 60][["Name", "Age"]].head(3)
Name Age
33 Wheadon, Mr. Edward H 66.0
54 Ostby, Mr. Engelhart Cornelius 65.0
96 Goldschmidt, Mr. George B 71.0
titanic_df[["Name", "Age"]][titanic_df["Age"] > 60].head(3)
Name Age
33 Wheadon, Mr. Edward H 66.0
54 Ostby, Mr. Engelhart Cornelius 65.0
96 Goldschmidt, Mr. George B 71.0
titanic_df.loc[titanic_df["Age"] > 60, ["Name", "Age"]].head(3)
Name Age
33 Wheadon, Mr. Edward H 66.0
54 Ostby, Mr. Engelhart Cornelius 65.0
96 Goldschmidt, Mr. George B 71.0
titanic_df[ (titanic_df["Age"] > 60) & (titanic_df["Pclass"] == 1) & (titanic_df["Sex"] == "female") ]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
  • 논리 연산자로 결합된 조건식도 불린 인덱싱으로 적용 가능
cond1 = titanic_df["Age"] > 60
cond2 = titanic_df["Pclass"] == 1
cond3 = titanic_df["Sex"] == "female"

titanic_df[cond1 & cond2 & cond3]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
  • 조건식은 변수로도 할당 가능. 복잡한 조건식은 변수로 할당하여 가독성을 향상

Aggregation 함수 및 GroupBy 적용


Aggregation 함수

titanic_df.count()
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64
titanic_df[["Fare", "Age"]].mean()
Fare    32.204208
Age     29.699118
dtype: float64
titanic_df[["Age", "Fare"]].sum()
Age     21205.1700
Fare    28693.9493
dtype: float64
titanic_df[["Age", "Fare"]].sum(axis=1)
0       29.2500
1      109.2833
2       33.9250
3       88.1000
4       43.0500
         ...   
886     40.0000
887     49.0000
888     23.4500
889     56.0000
890     39.7500
Length: 891, dtype: float64
titanic_df[["Age", "Fare"]].count()
Age     714
Fare    891
dtype: int64

groupby()

  • groupby() by 인자에 Group By 하고자 하는 컬럼을 입력, 여러개의 컬럼으로 Group By하고자 하면 [ ] 내에 해당 컬럼명 입력.
  • DataFrame에 groupby()를 호출하면 DataFrameGroupBy 객체를 반환.
titanic_groupby = titanic_df.groupby(by = "Pclass")
print(type(titanic_groupby))
print(titanic_groupby)
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024AD4F01250>
titanic_groupby= titanic_df.groupby("Pclass").count()
titanic_groupby
PassengerId Survived Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
Pclass
1 216 216 216 216 186 216 216 216 216 176 214
2 184 184 184 184 173 184 184 184 184 16 184
3 491 491 491 491 355 491 491 491 491 12 491
  • 위 출력 결과에서 Pclass는 index이다.
print(type(titanic_groupby))
print(titanic_groupby.shape)
print(titanic_groupby.index)
<class 'pandas.core.frame.DataFrame'>
(3, 11)
Int64Index([1, 2, 3], dtype='int64', name='Pclass')
titanic_groupby = titanic_df.groupby(by = "Pclass")[["PassengerId", "Survived"]].count()
titanic_groupby
PassengerId Survived
Pclass
1 216 216
2 184 184
3 491 491
titanic_df[["Pclass", "PassengerId", "Survived"]].groupby("Pclass").count()
PassengerId Survived
Pclass
1 216 216
2 184 184
3 491 491
titanic_df.groupby("Pclass")["Pclass"].count()
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64
titanic_df["Pclass"].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64
titanic_df.groupby("Pclass")["Age"].agg([max, min])
max min
Pclass
1 80.0 0.92
2 70.0 0.67
3 74.0 0.42
agg_format = {"Age" : "max", "SibSp" : "sum", "Fare" : "mean"}
titanic_df.groupby("Pclass").agg(agg_format)
Age SibSp Fare
Pclass
1 80.0 90 84.154687
2 70.0 74 20.662183
3 74.0 302 13.675550

Missing 데이터 처리하기

  • DataFrame의 isna() 메소드는 모든 컬럼 값들이 NaN인지 True/False값을 반환(NaN이면 True)

isna()

titanic_df.isna().head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 False False False False False False False False False False True False
1 False False False False False False False False False False False False
2 False False False False False False False False False False True False
  • NaN인 곳만 True가 된다
titanic_df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
  • 컬럼 별로 NaN이 몇 개인지

fillna()로 Missing Data 대체하기

titanic_df["Cabin"] = titanic_df["Cabin"].fillna("0000")
titanic_df.head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 0000 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 0000 S
titanic_df["Age"] = titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")
titanic_df.isna().sum()
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

apply lambda 식으로 데이터 가공


파이썬 lambda 식 기본

def get_square(a) :
    return a**2

print("3의 제곱은 {}".format(get_square(3)))
3의 제곱은 9
lambda_square = lambda x : x**2
print("3의 제곱은 {}".format(lambda_square(3)))
3의 제곱은 9
a = [1, 2, 3]
squares = map(lambda x : x**2, a)
list(squares)
[1, 4, 9]

판다스에 apply lambda식 적용

titanic_df["Name_len"] = titanic_df["Name"].apply(lambda x : len(x))
titanic_df[["Name", "Name_len"]].head(3)
Name Name_len
0 Braund, Mr. Owen Harris 23
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 51
2 Heikkinen, Miss. Laina 22
titanic_df["Child_Adult"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else "Adult")
titanic_df[["Age", "Child_Adult"]].head(10)
Age Child_Adult
0 22.000000 Adult
1 38.000000 Adult
2 26.000000 Adult
3 35.000000 Adult
4 35.000000 Adult
5 29.699118 Adult
6 54.000000 Adult
7 2.000000 Child
8 27.000000 Adult
9 14.000000 Child
titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else("Adult" if x <= 60 else "Elderly"))

titanic_df["Age_cat"].value_counts()
Adult      786
Child       83
Elderly     22
Name: Age_cat, dtype: int64
def get_category(age):
    cat = ""
    if age <= 5 : cat = "Baby"
    elif age <= 12 : cat = "Child"
    elif age <= 18 : cat = "Teen"
    elif age <= 25 : cat = "Student"
    elif age <= 35 : cat = "Young Adult"
    elif age <= 60 : cat = "Adult"
    else : cat = "Elderly"
        
    return cat

titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : get_category(x))
titanic_df[["Age", "Age_cat"]].head()
  
Age Age_cat
0 22.0 Student
1 38.0 Adult
2 26.0 Young Adult
3 35.0 Young Adult
4 35.0 Young Adult