딥딥딥

데이터베이스 정규화 - 종속성

카카오그래놀라 — Tue, 12 Oct 2021 16:57:22 +0900

이상현상을 해결하기 위해서 정규화를 해야한다는 것을 앞선 포스팅에서 확인할 수 있었습니다.

데이터베이스 이상 현상

데이터 이상(data anomaly)은 릴레이션 처리과정에서 불필요한 데이터 중복으로 인해 발생하는 부작용을 말합니다. 이상 현상에는 1. 삽입 이상 2. 삭제 이상 3. 갱신 이상 이 있습니다. 예를 들어 설

deep-deep-deep.tistory.com

이러한 이상현상이 발생하는 원인은 하나의 릴레이션에 무리하게 많은 속성들을 넣으려다 보니 생겨난 것입니다.

따라서 연관관계가 높은 속성들만 모아 릴레이션을 분해하는 것이 필요하다는 것을 알 수 있습니다.

그렇다면 어떻게 연관성이 높은지를 판단할 수 있을까요?

바로 함수 종속성이라는 개념을 통해 연관성이 높은지 낮은지 판단하게됩니다.

그리고 결론적으로 말하자면, 이상현상을 방지하려면 속성 사이의 연관관계, 즉 종속성(dependency)를 분석하여, 하나의 릴레이션에는 하나의 종속성만 표현되도록 릴레이션을 분할하면 됩니다.

그리고 단계적 규칙에 따라 릴레이션을 분할하는 과정을 정규화(Normalization)이라고 합니다.

갑자기 많은 내용을 말한 것 같습니다. 천천히 하나씩 알아가 봅시다.

일단 함수 종속성 정의를 알아봅시다.

정의: 한 릴레이션 안의 속성 간에 특정 속성값이 함수적으로 다른 속성 값을 결정하는 종속관계를 뜻합니다. 쉽게 A 속성을 알면 다른 속성 B, C를 알 수 있게 되는 관계를 함수 종속성이라고 합니다.

말로는 어려우니, 예를 들어 알아봅시다.

릴레이션

학번(기본키)	이름	전화번호	이메일
2020123	김철수	010-1111-1234	kim@mail.mail
2020124	김서준	010-1111-1235	kim1@mail.mail
2021445	이서아	010-1111-1236	lee@mail.mail
2020145	김철수	010-1111-1237	han@mail.mail

위의 릴레이션에서 학번에 따라 이름, 전화번호, 이메일이 결정되게 됩니다. 이름은 동명이인이 있을 수 있습니다. 위의 예에서 김철수 라는 이름이 종복되므로, 김철수를 알아도, 해당 튜플의 학번이 2020123인지, 2020145인지는 알 수 없습니다. 하지만 학번을 알면 나머지 이름, 전화번호, 이메일은 반드시 유일하게 결정되게 됩니다. 이처럼 어떤 속성 값은 다른 속성 값을 고유하게 결정될 수 있습니다.

이처럼 속성2가 속성 1에 의해 결정된다고 할 때, 이를 속성1 → 속성 2 라고 표현하고, 속성1을 속성2를 결정하는 결정자, 속성 2를 속성 1에 의해 종속된 종속자라고 표현합니다.

결정자인 각 속성 1에 대해 반드시 속성2의 값이 하나식만 대응됩니다. 그리고 속성1과 속성2는 복합속성일 수 있습니다.

위 릴레이션에 존재하는 종속성을 모두 표현하면

학번 → 이름

학번 → 전화번호

학번 → 이메일

또는

학번 → (이름, 전화번호, 이메일)

이렇게 표현할 수 있습니다.

위 릴레이션에서도 보셨겠지만, 보통 튜플을 유일하게 식별하는 기본키 혹은 후보키는 다른 모든 속성들을 함수적으로 결정하므로 결정자가 됩니다. 또한, 기본키나 후보키가 아닌 일반 속성도 결정자가 될 수 있습니다.

그리고 함수 종속성을 판단할 때 현재 릴레이션에 저장된 속성 값만을 가지고 판단하지 않도록 주의해야 합니다. 릴레이션 안의 속성 값은 계속 변할 수 있습니다. 함수 종속성은 릴레이션의 모든 가능한 경우에 대해 충족되어야 합니다. 결국 속성 값보다는 속성 자체의 의미로 종속 여부를 판단해야 합니다.

내용이 조금 어려우셨을수도 있을 것 같습니다. 우선은 아래의 정리만 이해하면 됩니다.

이어지는 포스팅에서 추가적으로 설명드릴 예정입니다.

그럼 다음 포스팅에서는 함수종속에 존재하는 완전 함수 종속과 부분 함수 종속 등을 설명드릴 예정입니다.

정리

이상현상이 발생한 원인은 하나의 릴레이션에 많은 속성을 무리하게 담아놓았기 때문입니다.

따라서 릴레이션을 분해해야 하는데, 분해해가는 과정을 정규화라고 합니다.

그러면 어떻게 분해해야 할까요?

⇒ 연관성이 높은 속성끼리 묶어서 릴레이션을 분해해가야 합니다.

그러면 어떻게 연관성이 높은지 알 수 있을까요?

⇒ 함수 종속성을 통해 알 수 있습니다.

그러면 함수 종속성이란?

⇒ 속성 A에 따라 속성 B가 결정되는 관계. (예: 학번을 알면 이름, 전화번호가 유일한 값으로 결정되게 됩니다.)

그래서 결론적으로

이상현상을 방지하려면 속성 사이의 연관관계, 즉 종속성(dependency)를 분석하여, 하나의 릴레이션에는 하나의 종속성만 표현되도록 릴레이션을 분할하면 됩니다.

참고: MySQL과 모바일 웹으로 만나는 데이터베이스의 정석을 읽고 정리하고 있습니다!

데이터베이스의 정석 - 교보문고

누구를 위한 책인가?ㆍ 데이터베이스의 핵심 개념과 용어를 빠른 시간에 학습하고자 하는 경우ㆍ 데이터베이스 모델링/설계 과정을 체계적으로 습득하고자 하는 경우ㆍ MySQL DBMS의 활용 방법과

www.kyobobook.co.kr

데이터베이스 이상 현상

카카오그래놀라 — Tue, 12 Oct 2021 16:17:26 +0900

데이터베이스 이상현상이란 릴레이션 처리과정에서 불필요한 데이터 중복으로 인해 발생하는 부작용을 말합니다.

이상 현상에는

1. 삽입 이상

2. 삭제 이상

3. 갱신 이상

이 있습니다.

예를 들어 설명해보겠습니다.

한 대학교에서 IT 시스템을 운용하는데

단 하나의 릴레이션만 사용하고 있다고 가정해봅시다. ~~(말만 들어도 이상현상이 생길 것 같습니다)~~

릴레이션 예시

학번(기본키)	이름	수강과목 번호	수강과목 이름	교수 이름
2020123	홍길동	c103	선형대수	김지아
2020124	철수	e403	서양철학의 이해	이서준
2020125	영희	k114	경제학입문	한서아
2020202	이지수	e403	서양철학의 이해	이서준

1. 삽입 이상

올해 처음으로 박민준 교수의 양자역학의 이해(q291)라는 과목이 개설된다고 합시다.

해당 과목을 릴레이션에 넣어줘야 하는데, 학번이 기본키라 이상한 값을 학번으로 지정해줘야만 데이터베이스에 위의 정보를 입력할 수 있습니다. 예를 들면 <99999, NULL, q291, 양자역학의 이해, 박민준> 이런 튜플을 넣어줘야 할 것입니다.

이처럼 불필요한 데이터를 합께 입력하지 않고서는 입력이 불가능한 상황을 삽입이상이라고 합니다.

2. 갱신 이상

이서준 교수님의 서양철학의 이해(e403)라는 과목명을 서양철학의 진정한 이해(e403)로 바꾸려고 합니다. 이 경우, 2번째 행과 4번재 행의 수강과목 이름을 모두 바꿔줘야 할 것입니다.

만약 서양철학의 이해(e403)인 튜플이 1,000개가 넘어간다고 가정할 때, 일일이 데이터를 수정하다가 일부 수정되지 않을 경우가 발생할 수 있습니다. 이처럼 속성값을 갱신 시, 일부가 수정되지 않아 데이터 불일치가 발생하는 경우를 갱신 이상이라고 합니다.

3. 삭제 이상

홍길동이 듣고 있는 선형대수(c103)은 난이도가 너무 어려워 홍길동만 강의를 수강하고 있다고 합시다. 홍길동도 수업 1회 OT 참석 후, 이건 아닌거 같아 수강철회를 하기로 했습니다.

이 경우, 해당 튜플이 삭제되면 김지아 교수의 선형대수(c103)이라는 데이터도 삭제되버립니다. 홍길동이 유일하게 듣고 있었기 때문에 해당 과목이 삭제되면 김지아 교수의 강의정보도 사라지게 되는 것입니다.

이처럼 삭제할 때 원하지 않는 데이터까지 함께 삭제되어 데이터 손실이 발생할 수 있는 경우를 삭제 이상이라고 합니다.

이상 현상 해결 방법

이상 현상의 주 원인은 릴레이션 안의 불필요한 데이터 중복 때문입니다. 데이터 중복이 발생하는 가장 큰 원인은 릴레이션 안에 너무 많은 속성을 표현하려 하기 때문입니다. 연관성이 낮은 속성들은 릴레이션을 나눠줘야 하는데 이러한 과정을 정규화라고 하고, 종속성에 대한 개념을 알아야 합니다.

종속성은 다음 포스팅을 통해 설명하겠습니다.

데이터 베이스 정규화 - 종속성

이상현상을 해결하기 위해서 정규화를 해야한다는 것을 앞선 포스팅에서 확인할 수 있었습니다. 데이터베이스 이상 현상 데이터 이상(data anomaly)은 릴레이션 처리과정에서 불필요한 데이터 중

deep-deep-deep.tistory.com

참고: MySQL과 모바일 웹으로 만나는 데이터베이스의 정석을 읽고 정리하고 있습니다!

데이터베이스의 정석 - 교보문고

www.kyobobook.co.kr

XGBoost Metric 확인

카카오그래놀라 — Tue, 12 Oct 2021 14:41:29 +0900

XGBoost Metric 종류

https://xgboost.readthedocs.io/en/latest/parameter.html#:~:text=too%20much%20effect.-,eval_metric,-%5Bdefault%20according%20to

XGBoost Parameters — xgboost 1.6.0-dev documentation

XGBoost Parameters Before running XGBoost, we must set three types of parameters: general parameters, booster parameters and task parameters. General parameters relate to which booster we are using to do boosting, commonly tree or linear model Booster para

xgboost.readthedocs.io

자주 쓰이는 Metric 예시

rmse: root mean square error
rmsle
mae
auc

사용 방법

선언한 모델 메소드 fit()을 실행할 때,

eval_set과

eval_metric을 지정해주면 됩니다.

eval_set 지정 방법

리스트안에 x, y 값을 튜플로 묶어 담아두면 됩니다. 밑의 코드를 보면 이해가 빠릅니다.

eval_metric 지정 방법

위에서 알아본, XGBoost 메트릭의 이름('rmse', 'rmsle', 'mae', 'auc' 등)을 써주면 됩니다. 밑의 코드를 보면 이해가 빠릅니다.

x = data.drop(['target'], axis=1)
y = data['target']

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=32)

model = xgb.XGBRegressor()
model.fit(
    train_x,
    train_y,
    eval_set=[(train_x, train_y),(test_x, test_y)],
    eval_metric='rmse'
    )
pred = model.predict(test_x)

추가 팁: Custom Metric 사용하기

준비물

1. xgboost.DMatrix

위의 경우, Scikit-learn interface를 통해 모델을 학습했습니다. Custom Metric을 사용하기 위해서는 Python Package 형태로 학습을 진행해야합니다.

이를 위해서 우선 데이터를 xgboost.DMatrix로 바꿔줘야 합니다.

바꾸는 방법은 굉장히 쉬운데, 아래 코드를 통해 쉽게 확인할 수 있습니다.

xgb.DMatrix에 x값 및 label에 y값을 넘겨주면 쉽게 만들 수 있습니다.

x = data.drop(['target'], axis=1)
y = data['target']
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=32)

train = xgb.DMatrix(train_x, label=train_y)
test = xgb.DMatrix(test_x, label=test_y)

2. Metric 함수

y_pred, Dmatrix를 인자로 받는 함수를 만들어 줍니다.

Dmatrix의 get_label() 메서드를 통해 y_label 값을 얻을 수 있습니다.

y_pred와 y_label을 통해 Metric 함수를 작성합니다.

return 값으로, Metric 함수이름과 Metric함수를 넘겨주면 됩니다.

def custom_rmse(y_pred, Dmatrix):
    labels = Dmatrix.get_label()
    return ('custom_rmse', np.sqrt(np.mean(np.power(y_pred - labels, 2))))

사용 예시

xgboost.DMatrix는 train()메서드를 통해 모델을 훈련합니다. 인자값으로 feval=함수명 을 입력해주면 custom Metric을 사용할 수 있습니다.

param = {
    'max_depth':5, 
    'objective': 'reg:squarederror', 
    'eta' : 0.1,
}
model = xgb.train(param, train, num_boost_round=100, evals=[(test, 'test')], feval=custom_rmse)

LightGBM Metric 확인을 알아보고 싶다면, 아래 글을 참고하세요!
https://deep-deep-deep.tistory.com/159

LightGBM metric 확인

LightGBM 지원 Metric https://lightgbm.readthedocs.io/en/latest/Parameters.html?highlight=metric#metric Parameters — LightGBM 3.3.0.99 documentation objective ︎, default = regression, type = enum..

deep-deep-deep.tistory.com

LightGBM Metric 확인

카카오그래놀라 — Tue, 12 Oct 2021 13:48:53 +0900

LightGBM 지원 Metric

https://lightgbm.readthedocs.io/en/latest/Parameters.html?highlight=metric#metric

Parameters — LightGBM 3.3.0.99 documentation

objective ︎, default = regression, type = enum, options: regression, regression_l1, huber, fair, poisson, quantile, mape, gamma, tweedie, binary, multiclass, multiclassova, cross_entropy, cross_entropy_lambda, lambdarank, rank_xendcg, aliases: object

lightgbm.readthedocs.io

자주 사용되는 Metric 예시

'l1': absolute loss
- aliases: mean_absolute_error, mae, regression_l1
'l2': square loss
- aliases: mean_squared_error, mse, regression_l2, regression
'rmse': root square loss
- aliases: root_mean_squared_error, l2_root
'auc': area under the ROC curve

사용 방법

선언한 모델 메소드 fit()을 실행할 때,

eval_set과

eval_metric을 지정해주면 됩니다.

eval_set 지정 방법

리스트안에 x, y 값을 튜플로 묶어 담아두면 됩니다. 밑의 코드를 보면 이해가 빠릅니다.

eval_metric 지정 방법

위에서 알아본, LightGBM 메트릭의 이름('l1', 'l2', 'rmse', 'auc' 등)을 써주면 됩니다. 밑의 코드를 보면 이해가 빠릅니다.

x = data.drop(['target'], axis=1)
y = data['target']

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=32)

model = lgbm.LGBMRegressor(random_state=32)
model.fit(
    train_x,
    train_y,
    eval_set=[(train_x, train_y),(test_x, test_y)],
    eval_metric='rmse'
    )
pred = model.predict(test_x)

추가 팁: Custom Metric 사용하기

Custom Metric 정의하기

1. y_true, y_pred(실제값, 예측값)을 받는 함수를 정의합니다.

2. 리턴해줘야 하는 값

- Metric 이름 (string)

- Metric (Numpy function)

- Metric 값이 높으면 좋은지? (True or False)

예를 들어 설명하면, (아래 코드와 함께 보세요!)

RMSE를 Custome Metric으로 정의하자고 합시다.

1. RMSE의 이름을 정합니다. 'custom_RMSE'라고 이름을 짓겠습니다. (사용자 마음대로 정하면 됩니다.)

2. Metric을 정의합니다. 아래 Numpy 식과 같이 정의합니다.

3. RMSE는 낮을수록 좋기 때문에 False를 지정합니다.

위 3가지 값을 튜플로 묶어 return 하면 됩니다.

추가적으로 Acc를 통해 설명드리면,

1. 이름을 짓는다: 'acc' 라고 이름을 지어보겠습니다.

2. Metric을 정의합니다. 아래 Numpy 식과 같이 정의하겠습니다.

3. Acc는 높을수록 좋기 때문에 True를 지정합니다.

위 3가지 값을 튜플로 묶어 return 하면 됩니다.

def custom_rmse(y_true, y_pred):
    # metric 이름, metric, metric 값이 높을수록 좋은지?
    return ('custom_RMSE', np.sqrt(np.mean(np.power(y_pred - y_true, 2))), False)
    
    
def custom_acc(y_true, y_pred):
    # metric 이름, metric, metric 값이 높을수록 좋은지?
    return ('acc', np.mean(y_pred == y_true), True)

Custom Metric을 정의 후, 사용하기 위해서는

model.fit()에서 eval_metric에 함수명을 입력해주면 작동합니다.

x = data.drop(['target'], axis=1)
y = data['target']

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=32)

model = lgbm.LGBMRegressor(random_state=32)
model.fit(
    train_x,
    train_y,
    eval_set=[(train_x, train_y),(test_x, test_y)],
    eval_metric= custom_rmse
    )
pred = model.predict(test_x)

XGBoost Metric 확인을 알아보고 싶다면 아래 글을 참고하세요!

https://deep-deep-deep.tistory.com/160

XGBoost Metric 확인

XGBoost Metric 종류 https://xgboost.readthedocs.io/en/latest/parameter.html#:~:text=too%20much%20effect.-,eval_metric,-%5Bdefault%20according%20to XGBoost Parameters — xgboost 1.6.0-dev documentati..

deep-deep-deep.tistory.com

triplet loss를 활용한 이미지 유사도 측정

카카오그래놀라 — Thu, 7 Oct 2021 16:05:08 +0900

원문: https://keras.io/examples/vision/siamese_network/

요약: triplet loss 함수를 사용하여 이미지간 유사도를 비교하는 Siamese Network에 대해 알아봅시다.

View in Colab

GitHub source

개요

Siamese Network 는 두 개 이상의 동일한 하위 네트워크를 포함하는 네트워크 아키텍쳐로서, 하위 네트워크들은 각 입력에 대한 특징 벡터를 생성하고 비교하는 역할을 합니다.

이 예제에서는 3개의 동일한 하위 네트워크가 있는 Siamese Network를 사용합니다. 우리는 모델에 3개의 이미지를 제공할 것입니다. 그 중 2개는 유사하고 (Anchor-Positive 이미지) 나머지 1개(Negative 이미지)는 관련이 없습니다. 우리의 목표는 모델이 이미지간 유사성을 추정하는 방법을 배우게 하는 것입니다.

NN의 학습에 우리는 triplet loss 함수를 사용할 것입니다. 위 loss는 FaceNet Paper by Schroff et al,. 2015. 를 통해 자세히 살펴볼 수 있습니다. triplet loss는 다음과 같이 정의됩니다.

L(A, P, N) = max(‖f(A) - f(P)‖² - ‖f(A) - f(N)‖² + margin, 0)

Dataset으로는 Totally Looks Like dataset by Rosenfeld et al., 2018. 를 활용하였습니다.

라이브러리 로드

import matplotlib.pyplot as plt
import numpy as np
import os
import random
import tensorflow as tf
from pathlib import Path
from tensorflow.keras import applications
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import optimizers
from tensorflow.keras import metrics
from tensorflow.keras import Model
from tensorflow.keras.applications import resnet


target_shape = (200, 200)

데이터셋 준비하기

위의 Dataset을 ~/.keras 경로에 압축을 해제합니다.

데이터 셋은 2개의 개별 파일로 구성되어 있습니다.

left.zip 은 anchor로서 활용되는 이미지입니다.
right.zip Positive(anchor와 닮은 이미지)로서 활용되는 이미지 입니다.

cache_dir = Path(Path.home()) / ".keras"
anchor_images_path = cache_dir / "left"
positive_images_path = cache_dir / "right"

!gdown --id 1jvkbTr_giSP3Ru8OwGNCg6B4PvVbcO34
!gdown --id 1EzBZUb_mh_Dp_FKD0P4XiYYSd0QBH5zW
!unzip -oq left.zip -d $cache_dir
!unzip -oq right.zip -d $cache_dir

데이터 불러오기

우리는 tf.data 파이프라인을 사용하여 데이터를 로드하고 Siamese Network 훈련하는 데 필요한 triplets(삼중항, 쉽게 말하자면 이미지를 3개씩 불러온다는 의미)를 생성할 것입니다.

Anchor, Positive 및 Negative 이미지들의 파일명을 통해 파이프라인을 설정합니다. 파이프라인은 해당 이미지를 로드하고 전처리합니다.

def preprocess_image(filename):
    image_string = tf.io.read_file(filename)
    image = tf.image.decode_jpeg(image_string, channels=3)
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = tf.image.resize(image, target_shape)
    return image


def preprocess_triplets(anchor, positive, negative):
    return (
        preprocess_image(anchor),
        preprocess_image(positive),
        preprocess_image(negative),
    )

tf.data.Dataset을 통해 파이프라인을 구축합시다.

anchor 이미지들의 경로를 담은 anchor_images,

positive 이미지들의 경로를 담은 positive_images 가 있습니다.

그리고 negative 이미지들은 따로 모으는 방식이 아닌 anchor와 positive 이미지들을 합친 다음 랜덤으로 섞어서 활용하겠습니다. (우연히 같은 이미지가 positive와 negative에 활용될 수 있으나, 랜덤하게 섞기 때문에 대부분은 다른 이미지가 negative에 적용됩니다.)

anchor_images = sorted(
    [str(anchor_images_path / f) for f in os.listdir(anchor_images_path)]
)

positive_images = sorted(
    [str(positive_images_path / f) for f in os.listdir(positive_images_path)]
)

image_count = len(anchor_images)

anchor_dataset = tf.data.Dataset.from_tensor_slices(anchor_images)
positive_dataset = tf.data.Dataset.from_tensor_slices(positive_images)

rng = np.random.RandomState(seed=42)
rng.shuffle(anchor_images)
rng.shuffle(positive_images)

# anchor_images와 positive_images를 합친 다음,
# 다시 한 번 섞어서 negative_images로 사용합니다.
negative_images = anchor_images + positive_images
np.random.RandomState(seed=32).shuffle(negative_images)

negative_dataset = tf.data.Dataset.from_tensor_slices(negative_images)
negative_dataset = negative_dataset.shuffle(buffer_size=4096)

dataset = tf.data.Dataset.zip((anchor_dataset, positive_dataset, negative_dataset))
dataset = dataset.shuffle(buffer_size=1024)
dataset = dataset.map(preprocess_triplets)

train_dataset = dataset.take(round(image_count * 0.8))
val_dataset = dataset.skip(round(image_count * 0.8))

train_dataset = train_dataset.batch(32, drop_remainder=False)
train_dataset = train_dataset.prefetch(8)

val_dataset = val_dataset.batch(32, drop_remainder=False)
val_dataset = val_dataset.prefetch(8)

예시 사진을 보면 1, 2번째 이미지는 비슷하고, 3번째 이미지는 다른 모습을 볼 수 있습니다.

def visualize(anchor, positive, negative):
    def show(ax, image):
        ax.imshow(image)
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)

    fig = plt.figure(figsize=(9, 9))

    axs = fig.subplots(3, 3)
    for i in range(3):
        show(axs[i, 0], anchor[i])
        show(axs[i, 1], positive[i])
        show(axs[i, 2], negative[i])


visualize(*list(train_dataset.take(1).as_numpy_iterator())[0])

embedding generator model 정의하기

Siamese 네트워크는 triplet의 각 이미지에 대한 임베딩을 생성합니다. 이를 위해 ImageNet에서 사전 훈련된 ResNet50 모델을 사용하고 몇 가지 Dense 레이어를 연결하여 이러한 임베딩을 분리하는 방법을 배울 것입니다.

conv5_block1_out 레이어까지 모델의 모든 레이어의 가중치를 고정합니다. 이는 모델이 이미 학습한 가중치에 영향을 미치지 않도록 하는 데 중요합니다. 우리는 훈련하는 동안 가중치를 미세 조정할 수 있도록 맨 아래 몇 개의 레이어를 훈련 가능한 상태로 둘 것입니다.

base_cnn = resnet.ResNet50(
    weights="imagenet", input_shape=target_shape + (3,), include_top=False
)

flatten = layers.Flatten()(base_cnn.output)
dense1 = layers.Dense(512, activation="relu")(flatten)
dense1 = layers.BatchNormalization()(dense1)
dense2 = layers.Dense(256, activation="relu")(dense1)
dense2 = layers.BatchNormalization()(dense2)
output = layers.Dense(256)(dense2)

embedding = Model(base_cnn.input, output, name="Embedding")

trainable = False
for layer in base_cnn.layers:
    if layer.name == "conv5_block1_out":
        trainable = True
    layer.trainable = trainable

Siamese Network model 정의하기

Siamense Network는 각각의 Triplet 이미지를 입력으로 받아 임베딩을 생성하고 Anchor와 Positive 임베딩 사이의 거리와 Anchor와 Negative 임베딩 사이의 거리를 출력합니다.

거리를 계산하기 위해 두 값을 튜플로 반환하는 사용자 지정 레이어 DistanceLayer를 사용할 수 있습니다.

사용자 지정 레이어에 대한 설명은 https://www.tensorflow.org/guide/keras/custom_layers_and_models?hl=ko 를 참조하세요.

class DistanceLayer(layers.Layer):
    """
    이 레이어는 Anchor 임베딩과 Positive 임베딩 그리고 Anchor 임베딩과 Negative 임베딩의 거리를
    계산하는 역할을 합니다.
    """

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def call(self, anchor, positive, negative):
        ap_distance = tf.reduce_sum(tf.square(anchor - positive), -1)
        an_distance = tf.reduce_sum(tf.square(anchor - negative), -1)
        return (ap_distance, an_distance)


anchor_input = layers.Input(name="anchor", shape=target_shape + (3,))
positive_input = layers.Input(name="positive", shape=target_shape + (3,))
negative_input = layers.Input(name="negative", shape=target_shape + (3,))

distances = DistanceLayer()(
    embedding(resnet.preprocess_input(anchor_input)),
    embedding(resnet.preprocess_input(positive_input)),
    embedding(resnet.preprocess_input(negative_input)),
)

siamese_network = Model(
    inputs=[anchor_input, positive_input, negative_input], outputs=distances
)

Model.fit() 사용자 정의하기

이제 Siamese 네트워크에서 생성된 3개의 임베딩을 사용하여 Triple Loss를 계산할 수 있도록 사용자 지정 training loop가 있는 모델을 구현해야 합니다. train loss를 추적하기 위해 Mean metric 인스턴스를 생성해 보겠습니다.

아래를 이해하기 위해서는 사용자 정의 모델에 대한 이해가 필요합니다.

https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit?hl=ko 를 참조하세요.

class SiameseModel(Model):
    """The Siamese Network model with a custom training and testing loops.

    Computes the triplet loss using the three embeddings produced by the
    Siamese Network.

    The triplet loss is defined as:
       L(A, P, N) = max(‖f(A) - f(P)‖² - ‖f(A) - f(N)‖² + margin, 0)
    """

    def __init__(self, siamese_network, margin=0.5):
        super(SiameseModel, self).__init__()
        self.siamese_network = siamese_network
        self.margin = margin
        self.loss_tracker = metrics.Mean(name="loss")

    def call(self, inputs):
        return self.siamese_network(inputs)

    def train_step(self, data):
        # GradientTape는 내부에서 수행하는 모든 작업을 기록하는 컨텍스트 관리자입니다.
        # 여기서 손실을 계산하는 데 사용하므로 그래디언트를 가져올 수 있고,
        # `compile()`을 통해 그래디언트를 적용할 수 있습니다.
        
        with tf.GradientTape() as tape:
            loss = self._compute_loss(data)

        # 가중치에 대한 손실 함수의 그래디언트를 저장합니다.
        gradients = tape.gradient(loss, self.siamese_network.trainable_weights)

        # 모델에 지정된 옵티마이저를 통해 그래디언트를 적용합니다.
        self.optimizer.apply_gradients(
            zip(gradients, self.siamese_network.trainable_weights)
        )

        # trainig loss를 갱신해줍니다.
        self.loss_tracker.update_state(loss)
        return {"loss": self.loss_tracker.result()}

    def test_step(self, data):
        loss = self._compute_loss(data)

        # loss를 갱신해줍니다.
        self.loss_tracker.update_state(loss)
        return {"loss": self.loss_tracker.result()}

    def _compute_loss(self, data):
        # loss 계산하기
        # 우선 계산된 anchor-positive 거리와 anchor-negative 거리를 가져옵니다.
        ap_distance, an_distance = self.siamese_network(data)

        # triplet loss 정의에 따라
        # 두 거리를 빼고, 음수가 나오지 않도록 max(loss+margin, 0)을 적용합니다.
        loss = ap_distance - an_distance
        loss = tf.maximum(loss + self.margin, 0.0)
        return loss

    @property
    def metrics(self):
        #`reset_states()`가 자동으로 호출될 수 있도록 여기에 메트릭을 나열해야 합니다.
        return [self.loss_tracker]

훈련하기

siamese_model = SiameseModel(siamese_network)
siamese_model.compile(optimizer=optimizers.Adam(0.0001))
siamese_model.fit(train_dataset, epochs=10, validation_data=val_dataset)

결과 확인하기

우리는 어떻게 네트워크가 유사한 이미지에 따라 임베딩을 분리하는 방법을 학습하는지 확인할 수 있습니다.

우리는 코사인 유사도를 통해 두 임베딩간 유사도를 측정할 수 있습니다.

각 이미지에 대해 생성된 임베딩 간의 유사성을 확인하기 위해 샘플을 가져옵시다.

sample = next(iter(train_dataset))
visualize(*sample)

anchor, positive, negative = sample
anchor_embedding, positive_embedding, negative_embedding = (
    embedding(resnet.preprocess_input(anchor)),
    embedding(resnet.preprocess_input(positive)),
    embedding(resnet.preprocess_input(negative)),
)

마지막으로 Anchor와 Positive 사이의 코사인 유사성을 계산하고 Anchor와 Negative 이미지 사이의 유사성과 비교할 수 있습니다.

Anchor와 Positive 사이의 코사인 유사성이 Anchor와 Negative 이미지 사이의 유사성보다 클 것으로 예상해야 합니다.

cosine_similarity = metrics.CosineSimilarity()

positive_similarity = cosine_similarity(anchor_embedding, positive_embedding)
print("Positive similarity:", positive_similarity.numpy())

negative_similarity = cosine_similarity(anchor_embedding, negative_embedding)
print("Negative similarity", negative_similarity.numpy())


# Positive similarity: 0.9940324
# Negative similarity 0.9918252

tensorflow 난수 고정

카카오그래놀라 — Thu, 1 Jul 2021 15:39:52 +0900

import os
SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
os.environ['TF_DETERMINISTIC_OPS'] = '1'

import random
random.seed(SEED)

import tensorflow as tf
import numpy as np


tf.random.set_seed(SEED)
np.random.seed(SEED)

Tensorflow 에러 해결: Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found. Cannot dlopen some GPU libraries.

카카오그래놀라 — Thu, 17 Jun 2021 20:09:02 +0900

Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found.

윈도우 10, CUDA 11.0 환경에서

pip install tensorflow 로 설치 시, Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found. Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 에러가 발생하여, GPU가 잡히지 않는 에러가 발생한 경우 해결방법입니다.

해답을 먼저 말하자면,

CUDA 11.0 환경에서 pip로 tensorflow를 설치할 경우, GPU를 사용할 수 없습니다.

(2021.06.17 기준)

그 이유는 현재(2021.06.17 이후) pip install tensorflow를 입력할 경우, tensorflow==2.5.0이 설치되기 때문입니다.

tensorflow==2.5.0은 아래 링크에서 볼 수 있듯이, CUDA 11.2 및 cuDNN 8.1을 기준으로 구성되었기 때문입니다.

https://www.tensorflow.org/install/source_windows?hl=en#gpu

Windows의 소스에서 빌드 | TensorFlow

소스에서 TensorFlow pip 패키지를 빌드하고 Windows에 설치합니다. 참고: 잘 테스트되고 사전 빌드된 Windows 시스템용 TensorFlow 패키지가 이미 제공되고 있습니다. Windows용 설정 다음 빌드 도구를 설치

www.tensorflow.org

따라서

1. CUDA 10.1으로 tensorflow GPU를 사용하셨다면, pip install tensorflow==2.3.0 을 통해, GPU를 사용하시거나,

2. CUDA 11.0 및 cuDNN 8.0으로 tensorflow GPU를 사용하셨다면, tensorflow==2.4.0을 사용하시거나,

3. CUDA 11.2 및 cuDNN 8.1 버전으로 업데이트하시고, tensorflow==2.5.0 버전의 GPU를 사용하시면 됩니다.

tensorflow 에러 해결: Could not load 'cudart64_110.dll'; dlerror: cudart64_110.dll not found

카카오그래놀라 — Tue, 9 Feb 2021 23:48:45 +0900

could not load 'cudart64_110.dll' ; dlerror: cudart64_110.dll not found

Window 10

CUDA 10.1 환경에서,

pip install tensorflow 로 설치 시, Could not load dynamic library 'cudnn64_110.dll', dlerror: cudnn64_110.dll not found 에러가 발생하여, GPU가 잡히지 않는 에러가 발생한 경우 해결방법입니다.

해답을 먼저 말하자면,

CUDA 10.1 환경에서 pip로 tensorflow를 설치할 경우, GPU를 사용할 수 없습니다.

그 이유는 현재(20.12.15 이후) pip install tensorflow를 입력할 경우, tensorflow==2.4.0이 설치되기 때문입니다.

tensorflow==2.4.0은 아래 링크에서 볼 수 있듯이, CUDA 11.0을 기준으로 구성되었기 때문입니다. 또한 cuDNN 8.0을 바탕으로 구성되어 있습니다.

https://www.tensorflow.org/install/source_windows?hl=ko#gpu

Windows의 소스에서 빌드 | TensorFlow

www.tensorflow.org

따라서

1. CUDA 10.1으로 tensorflow GPU를 사용하셨다면, pip install tensorflow==2.3.0 을 통해, GPU를 사용하시거나,

2. CUDA 11.0 및 cuDNN 8.0 버전으로 업데이트하시고, tensorflow==2.4.0 버전의 GPU를 사용하시면 됩니다.

협업 필터링을 활용한 영화 추천

카카오그래놀라 — Sat, 19 Dec 2020 01:10:11 +0900

Collaborative Filtering for Movie Recommendations

Author: Siddhartha Banerjee
Date created: 2020/05/24
Last modified: 2020/05/24
Description: Recommending movies using a model trained on Movielens dataset.

- Keras

- Colab

- Github

Introduction

Movielens dataset을 활용하여 협업 필터링을 적용하여 사용자에게 영화를 추천하는 방법에 대해 알아봅시다. MovieLens 영화 평점 데이터 세트는 사용자가 영화에 부여한 평점을 담고 있습니다. 우리의 목표는 사용자가 아직 보지 않은 영화의 평점을 예측하는 것입니다. 이후, 가장 높은 예상 평점을 가진 영화를 사용자에게 추천 할 수 있습니다.

This example demonstrates Collaborative filtering using the Movielens dataset to recommend movies to users. The MovieLens ratings dataset lists the ratings given by a set of users to a set of movies. Our goal is to be able to predict ratings for movies a user has not yet watched. The movies with the highest predicted ratings can then be recommended to the user.

모델의 단계는 다음과 같습니다.

1. 임베딩 매트릭스를 통해 유저 ID를 "유저 벡터"에 매핑
2. 임베딩 매트릭스를 통해 영화 ID를 "영화 벡터"에 매핑
3. 유저 벡터와 영화 벡터 간의 내적을 계산하여 유저와 영화 간의 일치 점수를 얻습니다 (예측 평점).
4. 알려진 모든 유저-영화 쌍을 사용하여 경사하강법을 통해 임베딩을 훈련합니다.

The steps in the model are as follows:

1. Map user ID to a "user vector" via an embedding matrix

2. Map movie ID to a "movie vector" via an embedding matrix

3. Compute the dot product between the user vector and movie vector, to obtain the a match score between the user and the movie (predicted rating).

4. Train the embeddings via gradient descent using all known user-movie pairs.

References:

Setup

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path
import matplotlib.pyplot as plt

데이터 로드

# First, load the data and apply preprocessing
# Download the actual data from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
# Use the ratings.csv file
movielens_data_file_url = (
    "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
)
movielens_zipped_file = keras.utils.get_file(
    "ml-latest-small.zip", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"

# Only extract the data the first time the script is run.
if not movielens_dir.exists():
    with ZipFile(movielens_zipped_file, "r") as zip:
        # Extract files
        print("Extracting all the files now...")
        zip.extractall(path=keras_datasets_path)
        print("Done!")

ratings_file = movielens_dir / "ratings.csv"
df = pd.read_csv(ratings_file)

데이터 전처리

첫째, 유저와 영화를 integer 인덱스로 변환하기 위해 몇 가지 전처리를 수행해야합 니다.

First, need to perform some preprocessing to encode users and movies as integer indices.

# 유저 ID의 unique 값을 list로 변환
user_ids = df["userId"].unique().tolist()

# 값을 key로, 인덱스를 value로 하는 딕셔너리 생성
user2user_encoded = {x: i for i, x in enumerate(user_ids)}

# 값을 value로, 인덱스를 key로 하는 딕셔너리 생성
userencoded2user = {i: x for i, x in enumerate(user_ids)}

# 위와 같음
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}

# 유저 id 및 영화를 숫자로 변경
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)

num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)

df["rating"] = df["rating"].values.astype(np.float32)

# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])

print(
    "Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating: {}".format(
        num_users, num_movies, min_rating, max_rating
    )
)

# Number of users: 610, Number of Movies: 9724, Min rating: 0.5, Max rating: 5.0

training set 및 validation set 만들기

# Prepare training and validation data

# df.sample 은 랜덤 추출 메서드입니다. frac은 비율인데 frac=1이므로 모든 데이터가 랜덤하게 추출됩니다.
# 즉, 모든 데이터를 랜덤하게 섞겠다는 뜻입니다.
df = df.sample(frac=1, random_state=42) 
x = df[["user", "movie"]].values

# 정규화하기. the targets between 0 and 1. Makes it easy to train.
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values

# 90%를 training set으로, 10%를 validation set으로 만들기
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * df.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)

모델 만들기 (Create the model)

우리는 유저와 영화 데이터를 50차원으로 embed 할 것입니다.

모델은 유저와 영화 임베딩간 내적 그리고 각각의 bias를 통해 유저-영화간 매치 점수를 계산할 것입니다. 매치 점수는 시그모이드를 통해 0과 1사이의 값으로 조절될 것입니다. 앞서 우리가 평점 데이터를 normalize했기 때문입니다.

We embed both users and movies in to 50-dimensional vectors.

The model computes a match score between user and movie embeddings via a dot product, and adds a per-movie and per-user bias. The match score is scaled to the [0, 1] interval via a sigmoid (since our ratings are normalized to this range).

EMBEDDING_SIZE = 50


class RecommenderNet(keras.Model):
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(RecommenderNet, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.user_bias = layers.Embedding(num_users, 1)
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.movie_bias = layers.Embedding(num_movies, 1)

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        movie_bias = self.movie_bias(inputs[:, 1])
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
        # Add all the components (including bias)
        x = dot_user_movie + user_bias + movie_bias
        # The sigmoid activation forces the rating to between 0 and 1
        return tf.nn.sigmoid(x)


model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(), optimizer=keras.optimizers.Adam(lr=0.001)
)

Train the model based on the data split

history = model.fit(
    x=x_train,
    y=y_train,
    batch_size=64,
    epochs=5,
    verbose=1,
    validation_data=(x_val, y_val),
)


# Epoch 1/5
# 1418/1418 [==============================] - 6s 4ms/step - loss: 0.6368 - val_loss: 0.6206
# Epoch 2/5
# 1418/1418 [==============================] - 7s 5ms/step - loss: 0.6131 - val_loss: 0.6176
# Epoch 3/5
# 1418/1418 [==============================] - 6s 4ms/step - loss: 0.6083 - val_loss: 0.6146
# Epoch 4/5
# # 1418/1418 [==============================] - 6s 4ms/step - loss: 0.6072 - val_loss: 0.6131
# Epoch 5/5
# 1418/1418 [==============================] - 6s 4ms/step - loss: 0.6075 - val_loss: 0.6150

training and validation loss 시각화

plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()

예상 평점 상위 10개 영화 확인해보기

# Show top 10 movie recommendations to a user

movie_df = pd.read_csv(movielens_dir / "movies.csv")

# Let us get a user and see the top recommendations.
user_id = df.userId.sample(1).iloc[0]
movies_watched_by_user = df[df.userId == user_id]
movies_not_watched = movie_df[
    ~movie_df["movieId"].isin(movies_watched_by_user.movieId.values)
]["movieId"]
movies_not_watched = list(
    set(movies_not_watched).intersection(set(movie2movie_encoded.keys()))
)
movies_not_watched = [[movie2movie_encoded.get(x)] for x in movies_not_watched]
user_encoder = user2user_encoded.get(user_id)
user_movie_array = np.hstack(
    ([[user_encoder]] * len(movies_not_watched), movies_not_watched)
)
ratings = model.predict(user_movie_array).flatten()
top_ratings_indices = ratings.argsort()[-10:][::-1]
recommended_movie_ids = [
    movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices
]

print("Showing recommendations for user: {}".format(user_id))
print("====" * 9)
print("Movies with high ratings from user")
print("----" * 8)
top_movies_user = (
    movies_watched_by_user.sort_values(by="rating", ascending=False)
    .head(5)
    .movieId.values
)
movie_df_rows = movie_df[movie_df["movieId"].isin(top_movies_user)]
for row in movie_df_rows.itertuples():
    print(row.title, ":", row.genres)

print("----" * 8)
print("Top 10 movie recommendations")
print("----" * 8)
recommended_movies = movie_df[movie_df["movieId"].isin(recommended_movie_ids)]
for row in recommended_movies.itertuples():
    print(row.title, ":", row.genres)
    
    
# Showing recommendations for user: 474
# ====================================
# Movies with high ratings from user
# --------------------------------
# Fugitive, The (1993) : Thriller
# Remains of the Day, The (1993) : Drama|Romance
# West Side Story (1961) : Drama|Musical|Romance
# X2: X-Men United (2003) : Action|Adventure|Sci-Fi|Thriller
# Spider-Man 2 (2004) : Action|Adventure|Sci-Fi|IMAX
# --------------------------------
# Top 10 movie recommendations
# --------------------------------
# Dazed and Confused (1993) : Comedy
# Ghost in the Shell (Kôkaku kidôtai) (1995) : Animation|Sci-Fi
# Drugstore Cowboy (1989) : Crime|Drama
# Road Warrior, The (Mad Max 2) (1981) : Action|Adventure|Sci-Fi|Thriller
# Dark Knight, The (2008) : Action|Crime|Drama|IMAX
# Inglourious Basterds (2009) : Action|Drama|War
# Up (2009) : Adventure|Animation|Children|Drama
# Dark Knight Rises, The (2012) : Action|Adventure|Crime|IMAX
# Star Wars: Episode VII - The Force Awakens (2015) : Action|Adventure|Fantasy|Sci-Fi|IMAX
# Thor: Ragnarok (2017) : Action|Adventure|Sci-Fi

IMDB에 Bidirectional LSTM 모델 적용하기

카카오그래놀라 — Thu, 17 Dec 2020 22:40:36 +0900

Bidirectional LSTM on IMDB

Author: fchollet
Date created: 2020/05/03
Last modified: 2020/05/03
Description: Train a 2-layer bidirectional LSTM on the IMDB movie review sentiment classification dataset.

- Keras

- Colab

- Github

라이브러리 로드

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

max_features = 20000  # 상위 20000개 단어들만을 사용하겠습니다.
maxlen = 200  # 영화 리뷰 중 처음 200단어까지만 사용하겠습니다.

Build the model

# 가변 길이의 정수형 시퀀스를 input으로 사용하겠습니다.
inputs = keras.Input(shape=(None,), dtype="int32")

# 각 정수형 시퀀스를 128차원으로 Embed 하겠습니다.
x = layers.Embedding(max_features, 128)(inputs)

# Bidirectional LSTM layer를 두 번 사용하겠습니다.
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)

# 이진 분류를 하겠습니다.
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

IMDB 영화 리뷰 감정 데이터를 불러오겠습니다.

(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features
)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

Train and evaluate the model

model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

딥딥딥

데이터베이스 정규화 - 종속성

일단 함수 종속성 정의를 알아봅시다.

정리

데이터베이스 이상 현상

예를 들어 설명해보겠습니다.

1. 삽입 이상

2. 갱신 이상

3. 삭제 이상

이상 현상 해결 방법

XGBoost Metric 확인

XGBoost Metric 종류

사용 방법

eval_set 지정 방법

eval_metric 지정 방법

추가 팁: Custom Metric 사용하기

준비물

1. xgboost.DMatrix

2. Metric 함수

사용 예시

LightGBM Metric 확인을 알아보고 싶다면, 아래 글을 참고하세요!https://deep-deep-deep.tistory.com/159

LightGBM Metric 확인

LightGBM 지원 Metric

자주 사용되는 Metric 예시

사용 방법

eval_set 지정 방법

eval_metric 지정 방법

추가 팁: Custom Metric 사용하기

Custom Metric 정의하기

XGBoost Metric 확인을 알아보고 싶다면 아래 글을 참고하세요!

triplet loss를 활용한 이미지 유사도 측정

개요

라이브러리 로드

데이터셋 준비하기

데이터 불러오기

embedding generator model 정의하기

Siamese Network model 정의하기

Model.fit() 사용자 정의하기

훈련하기

결과 확인하기

tensorflow 난수 고정

Tensorflow 에러 해결: Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found. Cannot dlopen some GPU libraries.

해답을 먼저 말하자면,

CUDA 11.0 환경에서 pip로 tensorflow를 설치할 경우, GPU를 사용할 수 없습니다.

1. CUDA 10.1으로 tensorflow GPU를 사용하셨다면, pip install tensorflow==2.3.0 을 통해, GPU를 사용하시거나,

2. CUDA 11.0 및 cuDNN 8.0으로 tensorflow GPU를 사용하셨다면, tensorflow==2.4.0을 사용하시거나,3. CUDA 11.2 및 cuDNN 8.1 버전으로 업데이트하시고, tensorflow==2.5.0 버전의 GPU를 사용하시면 됩니다.

tensorflow 에러 해결: Could not load 'cudart64_110.dll'; dlerror: cudart64_110.dll not found

해답을 먼저 말하자면,

CUDA 10.1 환경에서 pip로 tensorflow를 설치할 경우, GPU를 사용할 수 없습니다.

1. CUDA 10.1으로 tensorflow GPU를 사용하셨다면, pip install tensorflow==2.3.0 을 통해, GPU를 사용하시거나,

2. CUDA 11.0 및 cuDNN 8.0 버전으로 업데이트하시고, tensorflow==2.4.0 버전의 GPU를 사용하시면 됩니다.

협업 필터링을 활용한 영화 추천

Collaborative Filtering for Movie Recommendations

Introduction

Setup

데이터 로드

데이터 전처리

training set 및 validation set 만들기

모델 만들기 (Create the model)

Train the model based on the data split

training and validation loss 시각화

예상 평점 상위 10개 영화 확인해보기

IMDB에 Bidirectional LSTM 모델 적용하기

Bidirectional LSTM on IMDB

라이브러리 로드

Build the model

IMDB 영화 리뷰 감정 데이터를 불러오겠습니다.

Train and evaluate the model

LightGBM Metric 확인을 알아보고 싶다면, 아래 글을 참고하세요!
https://deep-deep-deep.tistory.com/159

2. CUDA 11.0 및 cuDNN 8.0으로 tensorflow GPU를 사용하셨다면, tensorflow==2.4.0을 사용하시거나,

3. CUDA 11.2 및 cuDNN 8.1 버전으로 업데이트하시고, tensorflow==2.5.0 버전의 GPU를 사용하시면 됩니다.