[LLM][Google/T5] T5ForConditionalGeneration 모델의 구조

DeepLearning/NLP

[LLM][Google/T5] T5ForConditionalGeneration 모델의 구조

꼬꼬마코더 2024. 9. 4. 10:38

728x90

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
              (wo): Linear(in_features=2816, out_features=1024, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (1-23): 23 x T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
              (wo): Linear(in_features=2816, out_features=1024, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (final_layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerCrossAttention(
            (EncDecAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (2): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
              (wo): Linear(in_features=2816, out_features=1024, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (1-23): 23 x T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerCrossAttention(
            (EncDecAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (2): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
              (wo): Linear(in_features=2816, out_features=1024, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (final_layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (lm_head): Linear(in_features=1024, out_features=32128, bias=False)
)

이 코드는 T5ForConditionalGeneration 모델의 구조를 나타내며, Hugging Face의 Transformers 라이브러리를 사용하여 구현된 T5 모델의 구조입니다. 이 모델은 주로 텍스트 생성, 요약, 번역, 질의응답 등의 자연어 처리(NLP) 작업에 사용됩니다. T5 모델은 "Text-To-Text Transfer Transformer"로 알려져 있으며, 입력과 출력이 모두 텍스트 형식입니다. 모델 구조의 각 부분에 대해 설명하겠습니다.

1. Embedding Layers (embedding 층):

(shared): Embedding(32128, 1024)
- shared 임베딩 레이어는 토큰을 1024차원의 벡터로 변환합니다. 32,128은 토큰의 수(vocabulary size)를 나타냅니다. T5 모델은 입력 텍스트와 출력 텍스트 모두에서 동일한 임베딩 레이어를 사용하므로 shared로 불립니다.

2. Encoder (인코더):

T5Stack 클래스는 T5 모델의 인코더를 정의합니다. 인코더는 입력 텍스트를 처리하여 고차원 표현으로 변환합니다. 각 블록은 Self-Attention과 Feed-Forward 층으로 구성됩니다.
(block): 총 24개의 T5Block이 있습니다. 이들은 T5 모델에서 인코더의 깊이를 나타냅니다 (0부터 23까지).
T5Block: 각 블록은 두 개의 주요 층으로 구성됩니다:
1. T5LayerSelfAttention: 입력 토큰 사이의 관계를 학습하는 셀프 어텐션 메커니즘입니다. q (쿼리), k (키), v (밸류)로 입력을 선형 변환하여 어텐션 점수를 계산합니다.
2. T5LayerFF: Feed-forward 층입니다. 입력을 선형 변환하고 활성화 함수(여기서는 GELU)를 적용합니다.
T5LayerNorm: 각 층 후에는 layer normalization이 적용됩니다.
Dropout: 각 층에는 과적합을 방지하기 위한 dropout이 적용되며, 여기서는 확률 0.1로 설정되었습니다.

3. Decoder (디코더):

인코더와 유사하게, T5Stack 클래스는 디코더를 정의합니다. 디코더는 인코더의 출력을 받아서 최종 텍스트 출력을 생성합니다.
디코더는 인코더보다 추가적인 Cross-Attention 메커니즘을 가지고 있습니다. 이는 디코더가 인코더의 출력과 상호작용하여 최종 출력 토큰을 생성할 수 있도록 합니다.
T5LayerSelfAttention: 디코더의 첫 번째 층도 입력 사이의 셀프 어텐션을 수행합니다.
T5LayerCrossAttention: 여기서는 디코더가 인코더의 출력에 대한 어텐션을 수행하여 입력의 의미를 반영한 출력을 생성합니다.
T5LayerFF: 인코더와 동일하게 Feed-forward 층을 갖습니다.

4. lm_head (언어 모델 출력 헤드):

Linear(in_features=1024, out_features=32128): 디코더의 마지막 출력(1024차원)을 토큰 확률 분포로 변환하는 선형 변환입니다. 32,128은 모델의 어휘 크기로, 각 토큰에 대한 확률을 계산합니다. 이 확률이 가장 높은 토큰이 최종 예측으로 선택됩니다.

작동 방식:

입력 텍스트는 shared 임베딩 레이어에서 임베딩된 후 인코더에 의해 처리됩니다.
인코더는 텍스트의 고차원 표현을 생성하여 디코더에 전달합니다.
디코더는 인코더의 출력과 자신의 이전 출력을 사용하여 다음 토큰을 예측합니다.
lm_head는 최종 출력을 위해 토큰 확률을 계산하고, 이를 통해 새로운 텍스트를 생성합니다.

T5 모델은 입력과 출력을 모두 텍스트로 처리하는 방식으로, 다양한 자연어 처리 작업에서 널리 사용되는 강력한 Transformer 모델입니다.

T5 모델의 파라미터를 수정하려면, 모델의 특정 부분(예: 임베딩 크기, 레이어 수, 히든 유닛 크기 등)을 변경하거나, 미세 조정을 통해 모델을 다시 학습시킬 수 있습니다. Hugging Face의 Transformers 라이브러리에서는 대부분의 모델이 파라미터를 쉽게 수정할 수 있도록 구조화되어 있습니다. 여기서는 몇 가지 주요 파라미터 수정 방법을 설명하겠습니다.

1. 모델 설정(Config) 변경

T5 모델은 T5Config 객체를 통해 설정이 가능합니다. 모델을 로드할 때 T5Config를 사용하여 원하는 설정을 수정할 수 있습니다.

from transformers import T5ForConditionalGeneration, T5Config

# 기존 설정 불러오기
config = T5Config.from_pretrained("t5-small")

# 원하는 파라미터 수정하기
config.d_model = 768  # 임베딩 차원 크기 변경
config.num_layers = 16  # 인코더/디코더 레이어 수 변경
config.num_heads = 12  # 어텐션 헤드 수 변경

# 수정된 설정으로 모델 초기화
model = T5ForConditionalGeneration(config)

주요 파라미터:

d_model: 임베딩 벡터의 차원 크기 (기본값: 512 또는 1024).
num_layers: 인코더 및 디코더의 레이어 수 (기본값: 12 또는 24).
num_heads: 어텐션 메커니즘의 헤드 수 (기본값: 8 또는 16).
d_ff: 피드포워드 네트워크의 중간 층 크기 (기본값: 2048).
vocab_size: 모델이 사용하는 어휘(vocabulary)의 크기.
max_length: 최대 입력 토큰 길이 (기본값: 512).

2. 모델의 임베딩 또는 레이어 수정

(1) 임베딩 크기 수정

모델의 임베딩 크기를 수정하고 싶다면, config.d_model을 통해 임베딩 벡터의 크기를 변경할 수 있습니다. 또는 모델의 embed_tokens 레이어를 직접 수정할 수 있습니다.

# 임베딩 레이어 크기 변경 (예: 1024 -> 768)
model.shared = torch.nn.Embedding(32128, 768)
model.encoder.embed_tokens = model.shared
model.decoder.embed_tokens = model.shared

(2) 레이어 수 수정

T5 모델의 레이어 수를 수정하려면 config.num_layers를 변경하면 됩니다. 아래는 인코더와 디코더의 레이어 수를 줄이거나 늘리는 예시입니다.

# 인코더 및 디코더의 레이어 수 변경 (24 -> 12)
config.num_layers = 12
model = T5ForConditionalGeneration(config)

3. 미세 조정(Fine-tuning)

모델을 특정 작업에 맞게 미세 조정할 때, 파라미터를 수정한 후 데이터에 대해 다시 학습할 수 있습니다. 예를 들어, 텍스트 생성, 번역, 요약 등의 작업을 위해 미세 조정할 수 있습니다.

(1) 학습 파라미터 조정

학습 과정에서의 파라미터도 수정할 수 있습니다. 예를 들어, 학습률, 배치 크기 등을 설정하는 방법입니다.

from transformers import AdamW, get_scheduler

# 학습률 설정
optimizer = AdamW(model.parameters(), lr=5e-5)

# 학습 스케줄러 설정
num_training_steps = 1000
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=100, num_training_steps=num_training_steps
)

(2) Trainer 사용

Hugging Face의 Trainer를 사용하면 파라미터 수정 후 바로 미세 조정을 쉽게 할 수 있습니다.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # 출력 경로
    evaluation_strategy="epoch",     # 평가 전략
    per_device_train_batch_size=8,   # 배치 크기
    per_device_eval_batch_size=8,    # 평가 시 배치 크기
    num_train_epochs=3,              # 에포크 수
    weight_decay=0.01,               # 가중치 감쇠
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,  # 학습 데이터셋
    eval_dataset=eval_dataset     # 평가 데이터셋
)

# 미세 조정 시작
trainer.train()

4. 모델 가중치 초기화 및 재학습

모델 파라미터를 수정한 후, 다시 학습하려면 가중치를 초기화하고 새로 학습할 수 있습니다.

# 모델의 모든 가중치를 초기화
model.init_weights()

# 모델을 다시 학습
trainer.train()

5. 추가적인 설정 변경

T5 모델의 학습 관련 파라미터, 드롭아웃 비율(dropout_rate), 어텐션 비율 등도 수정할 수 있습니다. 예를 들어, 드롭아웃 비율을 변경하려면 다음과 같이 설정합니다.

config.dropout_rate = 0.2  # 드롭아웃 확률을 0.1에서 0.2로 변경

이 방법을 사용하면 모델의 거의 모든 파라미터를 수정하거나 미세 조정할 수 있습니다. 특정 파라미터를 수정하고 싶은데 그 방법이 궁금하다면 추가로 질문해 주세요!

'DeepLearning > NLP' 카테고리의 다른 글

IndexError: index out of range in self (4)	2024.09.04
[LLM] embed_tokens 임베딩 크기 수정하면? (0)	2024.09.04
LLM(Large Language Models, LLMs)의 발전과 계보 (0)	2024.09.03
Google FLAN-T5와 T5 차이 (1)	2024.09.03
generate_max_length와 decoder_max_len (1)	2024.08.31

250x250

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

문과생CS정복기

Table of Contents