Feature Engineering by Pandas

4 min readAug 19, 2020

คือการสร้างFeatureใหม่ที่ช่วยให้ Machine Learning มีประสิทธิภาพทำงานได้ดีขึ้นเป็นการนำข้อมูลดิบที่ยังไม่ผ่านการกลั่นกรองมาทำกระบวนการให้ data มีความพร้อมมากที่สุด

ในวันนี้เราจะทำ Feature Engineering โดยใช้ 7 เทคนิค ได้แก่

- Imputation

- Handling Outliers

- Drop Outlier with Standard Deviation

- Drop with Percentiles

- Binning

- Log Transform

- One-hot encoding

อันดับแรกเราก็มา install Pandas กันก่อนเลยครับ

pip install pandas-profiling[notebook]

โหลด Dataset ที่ต้องใช้มาก่อน

https://gitlab.cpsudevops.com/nuttachot/dataset_w5

import สิ่งที่จำเป็นต่างๆ

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

ทำการดึงข้อมูลเเละใช้ ProfileReport Function เพื่อตรวจสอบรายละเอียดของข้อมูล

ลองแสดงผล 5 บรรทัดแรก

df.head()

Overview

profile = ProfileReport(df, title=”Pandas Profiling Report”)
profile

เราจะสังเกตุเห็นว่ามี Missing cell ถึง 866 หรือ 8.1% เลยทีเดียว

ใช้ Function isnull() และ sum() คำนวนจำนวน Missing Value ในแต่ละ Column

print(df.isnull().sum())

Imputation

Imputation คือการแทนที่ Missing Value ด้วยค่าอะไรสักอย่างเพื่อให้ได้ข้อมูลที่ตรงตามที่เราต้องการมากขึ้น

new_df = df.copy()
new_df[‘Age’].fillna(df[‘Age’].mean(), inplace = True)
print(new_df.isnull().sum())

จากภาพด้านบน Missing Values ของ Age ถูกแทนที่ด้วยค่าเฉลี่ยจึงทำให้พบจำนวน Missing Value เป็น 0

ทำให้ Column ที่มีร้อยละของ Missing Value มากกว่า 0.5 ถูกลบ

df.isnull().mean()

ลบ column Cabin ที่มีค่าร้อยละของค่า missing values เกิน 0.5

threshold = 0.5
new_df = df[df.columns[df.isnull().mean() < threshold]]
new_df.isnull().mean()

วิธีในการลบทั้งแถวทิ้ง เมื่อเจอMissing Value ใน cell

print(df.shape)
new_df = df.dropna(how=’any’)
print(new_df.shape)

Handling Outliers

Outlier หรือค่าที่ผิดปกติ คือ ข้อมูลที่มีค่าสูง หรือต่ำกว่าข้อมูลส่วนใหญ่ในชุดข้อมูลหนึ่ง

import สิ่งที่ต้องใช้มาก่อน

แสดงข้อมูลแบบ Data Visualization

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=df[‘Age’], color=’brown’)
plt.xlabel(‘Age Featured’, fontsize=14)
plt.show()

Drop Outlier with Standard Deviation

คือการลบแถวที่มี Outlier

print(df.shape)
factor = 3
upper_lim = df[‘Age’].mean () + df[‘Age’].std () * factor
lower_lim = df[‘Age’].mean () — df[‘Age’].std () * factor
drop_outlier1 = df[(df[‘Age’] < upper_lim) & (df[‘Age’] > lower_lim)]
print(drop_outlier1.shape)

Drop with Percentiles

print(df.shape)
upper_lim = df[‘Age’].quantile(.95)
lower_lim = df[‘Age’].quantile(.05)
drop_outlier2 = df[(df[‘Age’] < upper_lim) & (df[‘Age’] > lower_lim)]
print(drop_outlier2.shape)

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=drop_outlier2[‘Age’], color=’brown’)
plt.xlabel(‘Age Featured’, fontsize=14)
plt.show()

Max และ Mean ได้เปลี่ยนไป หลังจากที่ลบค่าที่เป็น Outlier

Binning

คือการแบ่งข้อมูลเป็นส่วนๆ ซึ่งสามารถป้องกันการเกิด Overfitting ได้

labels = [‘Childhood’, ‘teens’, ‘Mature’, ‘Elderly’]
bins = [0., 12., 22., 60., 100.]drop_outlier2[‘Age_cat’] = pd.cut(drop_outlier2[‘Age’], labels=labels, bins=bins, include_lowest=False