randomforest, categorical numerical 를 어떻게 판단해?

꼬꼬마코더 2024. 7. 18. 01:24

728x90

RandomForest 모델은 데이터의 각 특징이 범주형(categorical)인지 연속형(numerical)인지 직접적으로 인식하지 않습니다. 대신 데이터 과학자가 명시적으로 범주형 변수와 연속형 변수를 구분하여 적절한 전처리를 수행해야 합니다.

범주형 변수 인식 및 전처리 방법

데이터 타입을 통해 인식:
- Pandas에서 object, category 타입을 가진 열은 일반적으로 범주형 변수로 간주됩니다.
유일 값의 개수를 통해 인식:
- 열에 포함된 고유 값의 개수가 상대적으로 적으면 범주형 변수로 간주할 수 있습니다.
명시적으로 지정:
- 데이터 과학자가 도메인 지식을 바탕으로 범주형 변수를 명시적으로 지정합니다.

예제 코드: 범주형 변수 인식 및 원-핫 인코딩

다음은 Pandas를 사용하여 범주형 변수를 인식하고, 이를 원-핫 인코딩으로 변환하는 예제 코드입니다.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 예제 데이터프레임 생성 (실제 데이터로 대체)
data = {
    '계약년월': ['2003-12-26', '2005-07-13', '2010-01-01', '2021-09-15'],
    'target': [100, 200, 150, 300],
    'feature1': [1, 2, 3, 4],
    'feature2': [5, 4, 3, 2],
    'category_feature': ['A', 'B', 'A', 'C']
}
df = pd.DataFrame(data)

# '계약년월'을 datetime 형식으로 변환
df['계약년월'] = pd.to_datetime(df['계약년월'])

# 연도와 월을 별도의 컬럼으로 추가
df['year'] = df['계약년월'].dt.year
df['month'] = df['계약년월'].dt.month

# 범주형 변수를 명시적으로 지정
categorical_columns = ['category_feature']

# 범주형 변수 원-핫 인코딩
df = pd.get_dummies(df, columns=categorical_columns)

# Target과 독립변수들을 분리합니다.
y = df['target']
X = df.drop(['target', '계약년월'], axis=1)

# Hold out split을 사용해 학습 데이터와 검증 데이터를 8:2 비율로 나누겠습니다.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2023)

# NaN 및 Infinity 값 확인 및 처리
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)
X_val.replace([np.inf, -np.inf], np.nan, inplace=True)
X_train.fillna(0, inplace=True)
X_val.fillna(0, inplace=True)

# 모델 학습
model = RandomForestRegressor(n_estimators=100, criterion='squared_error', random_state=1, n_jobs=-1)
model.fit(X_train, y_train)

# 예측
pred = model.predict(X_val)

# 모델 평가
mse = mean_squared_error(y_val, pred)
print(f"Mean Squared Error: {mse}")

# 결과 확인
print("Predictions:", pred)
print("Actual values:", y_val.values)

설명

범주형 변수 인식:
- categorical_columns 리스트에 범주형 변수 이름을 명시적으로 지정합니다.
- 또는 df.select_dtypes(include=['object', 'category'])를 사용하여 자동으로 범주형 변수를 인식할 수도 있습니다.
원-핫 인코딩:
- pd.get_dummies 함수를 사용하여 범주형 변수를 원-핫 인코딩합니다.
시간 관련 변수 추가:
- datetime 형식의 컬럼에서 연도와 월을 추출하여 새로운 컬럼으로 추가합니다.
데이터 분할 및 모델 학습:
- 데이터를 학습(train)과 검증(validation) 세트로 분할하고, RandomForest 모델을 학습시킵니다.

이 접근 방식을 통해 RandomForest 모델이 범주형 변수와 시간 관련 변수를 적절히 처리하여 예측 성능을 향상시킬 수 있습니다.