보통 MLP레이어에서 gate구조는 Mixtral 같은 MOE모델에서 사용되는건줄 알았는데 라마 기반 모델에도 아래와 같은 gate구조가 있더라고(gate_proj)
여기서 gate는 MOE에서의 게이트 구조와 동일하다고 봐도 되나? 만약 그렇다면 Llama도 MOE 기반 구조를 차용한게 되는걸까..??
모델명은 Trelis/TinyLlama-1.1B-4k-chat-SFT 이야
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32001, 2048, padding_idx=32000)
(layers): ModuleList(
(0-21): 22 x LlamaDecoderLayer(
(self_attn): LlamaFlashAttention2(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=256, bias=False)
(v_proj): Linear(in_features=2048, out_features=256, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
(up_proj): Linear(in_features=2048, out_features=5632, bias=False)
(down_proj): Linear(in_features=5632, out_features=2048, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=2048, out_features=32001, bias=False)
)