제가 시도한 yaml파일입니다.

model:

  base_model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0

  model_type: LlamaForCausalLM

  tokenizer_type: AutoTokenizer

  load_in_4bit: true

  strict: false


training_args:

  output_dir: "./training_results"

  num_train_epochs: 3

  per_device_train_batch_size: 4

  gradient_accumulation_steps: 16  # 일관된 값 사용

  fp16: true  # FP16 활성화

  flash_attention: false  # 필요 없다면 비활성화 유지


datasets:

  - path: /workspace/axolotl/tests/transform_json_4.json

    ds_type: json

    type:

      system_prompt: ""

      system_format: "{system}"

      field_system: system

      field_instruction: human

      field_input: ""

      field_output: gpt

    format: |-

        System: {system}

        User: {human}

        Assistant: {gpt}

    no_input_format: "System: {system} User: {human} Assistant: {gpt}"


val_set_size: 0.05


dataset_prepared_path: last_run_prepared


output_dir: ./lora-out


sequence_len: 4096


pad_to_sequence_len: true


adapter: lora


lora_model_dir:


lora_r: 16


lora_alpha: 16


lora_dropout: 0.05


lora_target_modules:

  - gate_proj

  - down_proj

  - up_proj


lora_target_linear: true


num_epochs: 1


micro_batch_size: 2


optimizer: adamw_torch


lr_scheduler: cosine


learning_rate: 0.0004


train_on_inputs: false


group_by_length: false


bf16: false


gradient_checkpointing: true


logging_steps: 1


warmup_steps: 100


evals_per_epoch: 4


saves_per_epoch: 1


weight_decay: 0.01


eval_sample_packing: false



이게 gpt한테 질문하면서 만든 yaml파일이구요, 실행했을경우


The following values were not passed to `accelerate launch` and had defaults used instead:

        `--num_processes` was set to a value of `2`

                More than one GPU was found, enabling multi-GPU training.

                If this was unintended please pass in `--num_processes=1`.

        `--num_machines` was set to a value of `1`

        `--mixed_precision` was set to a value of `'no'`

        `--dynamo_backend` was set to a value of `'no'`

To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.

Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:122: UserWarning: 


================================================================================

WARNING: Manual override via BNB_CUDA_VERSION env variable detected!

BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.

If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=

If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH

For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

Loading: libbitsandbytes_cuda118.so

================================================================================



  warn(

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:122: UserWarning: 


================================================================================

WARNING: Manual override via BNB_CUDA_VERSION env variable detected!

BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.

If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=

If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH

For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

Loading: libbitsandbytes_cuda118.so

================================================================================



  warn(

[2024-04-20 09:01:15,913] [INFO] [datasets.<module>:58] [PID:765] PyTorch version 2.1.2+cu118 available.

[2024-04-20 09:01:15,941] [INFO] [datasets.<module>:58] [PID:764] PyTorch version 2.1.2+cu118 available.

[2024-04-20 09:01:16,347] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

[2024-04-20 09:01:16,392] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Traceback (most recent call last):

  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main

    return _run_code(code, main_globals, None,

  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code

    exec(code, run_globals)

  File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module>

    fire.Fire(do_cli)

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire

    component_trace = _Fire(component, args, parsed_flag_args, context, name)

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire

    component, remaining_args = _CallAndUpdateTrace(

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace

    component = fn(*varargs, **kwargs)

  File "/workspace/axolotl/src/axolotl/cli/train.py", line 30, in do_cli

    parsed_cfg = load_cfg(config, **kwargs)

  File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 353, in load_cfg

    cfg = validate_config(

  File "/workspace/axolotl/src/axolotl/utils/config/__init__.py", line 209, in validate_config

    AxolotlConfigWCapabilities(

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__

    self.__pydantic_validator__.validate_python(data, self_instance=self)

pydantic_core._pydantic_core.ValidationError: 1 validation error for AxolotlConfigWCapabilities

  Value error, At least two of micro_batch_size, gradient_accumulation_steps, batch_size must be set [type=value_error, input_value={'model': {'base_model': ...e_capability': 'sm_89'}}, input_type=dict]

    For further information visit https://errors.pydantic.dev/2.6/v/value_error

Traceback (most recent call last):

  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main

    return _run_code(code, main_globals, None,

  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code

    exec(code, run_globals)

  File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module>

    fire.Fire(do_cli)

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire

    component_trace = _Fire(component, args, parsed_flag_args, context, name)

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire

    component, remaining_args = _CallAndUpdateTrace(

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace

    component = fn(*varargs, **kwargs)

  File "/workspace/axolotl/src/axolotl/cli/train.py", line 30, in do_cli

    parsed_cfg = load_cfg(config, **kwargs)

  File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 353, in load_cfg

    cfg = validate_config(

  File "/workspace/axolotl/src/axolotl/utils/config/__init__.py", line 209, in validate_config

    AxolotlConfigWCapabilities(

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__

    self.__pydantic_validator__.validate_python(data, self_instance=self)

pydantic_core._pydantic_core.ValidationError: 1 validation error for AxolotlConfigWCapabilities

  Value error, At least two of micro_batch_size, gradient_accumulation_steps, batch_size must be set [type=value_error, input_value={'model': {'base_model': ...e_capability': 'sm_89'}}, input_type=dict]

    For further information visit https://errors.pydantic.dev/2.6/v/value_error

[2024-04-20 09:01:19,230] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 764) of binary: /root/miniconda3/envs/py3.10/bin/python3

Traceback (most recent call last):

  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>

    sys.exit(main())

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main

    args.func(args)

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1048, in launch_command

    multi_gpu_launcher(args)

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher

    distrib_run.run(args)

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run

    elastic_launch(

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__

    return launch_agent(self._config, self._entrypoint, list(args))

  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent

    raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

============================================================

axolotl.cli.train FAILED

------------------------------------------------------------

Failures:

[1]:

  time      : 2024-04-20_09:01:19

  host      : 2819657d1cb0

  rank      : 1 (local_rank: 1)

  exitcode  : 1 (pid: 765)

  error_file: <N/A>

  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

------------------------------------------------------------

Root Cause (first observed failure):

[0]:

  time      : 2024-04-20_09:01:19

  host      : 2819657d1cb0

  rank      : 0 (local_rank: 0)

  exitcode  : 1 (pid: 764)

  error_file: <N/A>

  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

============================================================


이런식으로 에러메시지가 나오면서 진행이 안됩니다..


gpt한테 에러메시지를 주면서 물어봐도

export BNB_CUDA_VERSION=118  # CUDA 11.8 버전 사용 설정

echo $BNB_CUDA_VERSION  # 설정된 값 확인

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64

echo $LD_LIBRARY_PATH  # 설정된 경로 확인


이 설정이랑 수정된 yaml파일로

model:

  base_model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0

  model_type: LlamaForCausalLM

  tokenizer_type: AutoTokenizer

  load_in_4bit: true

  strict: false


training_args:

  output_dir: "./training_results"

  num_train_epochs: 3

  per_device_train_batch_size: 4

  micro_batch_size: 2

  gradient_accumulation_steps: 16

  fp16: true

  flash_attention: false


datasets:

  - path: /workspace/axolotl/tests/transform_json_4.json

    ds_type: json

    type:

      system_prompt: ""

      system_format: "{system}"

      field_system: system

      field_instruction: human

      field_input: ""

      field_output: gpt

    format: |-

        System: {system}

        User: {human}

        Assistant: {gpt}

    no_input_format: "System: {system} User: {human} Assistant: {gpt}"


val_set_size: 0.05

dataset_prepared_path: last_run_prepared

sequence_len: 4096

pad_to_sequence_len: true

adapter: lora

lora_model_dir:

lora_r: 16

lora_alpha: 16

lora_dropout: 0.05

lora_target_modules:

  - gate_proj

  - down_proj

  - up_proj

lora_target_linear: true

optimizer: adamw_torch

lr_scheduler: cosine

learning_rate: 0.0004

train_on_inputs: false

group_by_length: false

bf16: false

gradient_checkpointing: true

logging_steps: 1

warmup_steps: 100

evals_per_epoch: 4

saves_per_epoch: 1

weight_decay: 0.01

eval_sample_packing: false


이렇게 답변하는데, 하라는대로 환경변수랑 yaml파일을 수정해도 계속 동일한 에러가 발생하면서 막힙니다..
두서없는것 같지만 axolotl 많이 활용해보신분들의 도움 요청드립니다ㅠ