제가 시도한 yaml파일입니다.
model: base_model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0 model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer load_in_4bit: true strict: false training_args: output_dir: "./training_results" num_train_epochs: 3 per_device_train_batch_size: 4 gradient_accumulation_steps: 16 # 일관된 값 사용 fp16: true # FP16 활성화 flash_attention: false # 필요 없다면 비활성화 유지 datasets: - path: /workspace/axolotl/tests/transform_json_4.json ds_type: json type: system_prompt: "" system_format: "{system}" field_system: system field_instruction: human field_input: "" field_output: gpt format: |- System: {system} User: {human} Assistant: {gpt} no_input_format: "System: {system} User: {human} Assistant: {gpt}" val_set_size: 0.05 dataset_prepared_path: last_run_prepared output_dir: ./lora-out sequence_len: 4096 pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 16 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - gate_proj - down_proj - up_proj lora_target_linear: true num_epochs: 1 micro_batch_size: 2 optimizer: adamw_torch lr_scheduler: cosine learning_rate: 0.0004 train_on_inputs: false group_by_length: false bf16: false gradient_checkpointing: true logging_steps: 1 warmup_steps: 100 evals_per_epoch: 4 saves_per_epoch: 1 weight_decay: 0.01 eval_sample_packing: false |
이게 gpt한테 질문하면서 만든 yaml파일이구요, 실행했을경우
The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `2` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:122: UserWarning: ================================================================================ WARNING: Manual override via BNB_CUDA_VERSION env variable detected! BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version. If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION= If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64 Loading: libbitsandbytes_cuda118.so ================================================================================ warn( /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:122: UserWarning: ================================================================================ WARNING: Manual override via BNB_CUDA_VERSION env variable detected! BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version. If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION= If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64 Loading: libbitsandbytes_cuda118.so ================================================================================ warn( [2024-04-20 09:01:15,913] [INFO] [datasets.<module>:58] [PID:765] PyTorch version 2.1.2+cu118 available. [2024-04-20 09:01:15,941] [INFO] [datasets.<module>:58] [PID:764] PyTorch version 2.1.2+cu118 available. [2024-04-20 09:01:16,347] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-04-20 09:01:16,392] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module> fire.Fire(do_cli) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/axolotl/src/axolotl/cli/train.py", line 30, in do_cli parsed_cfg = load_cfg(config, **kwargs) File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 353, in load_cfg cfg = validate_config( File "/workspace/axolotl/src/axolotl/utils/config/__init__.py", line 209, in validate_config AxolotlConfigWCapabilities( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__ self.__pydantic_validator__.validate_python(data, self_instance=self) pydantic_core._pydantic_core.ValidationError: 1 validation error for AxolotlConfigWCapabilities Value error, At least two of micro_batch_size, gradient_accumulation_steps, batch_size must be set [type=value_error, input_value={'model': {'base_model': ...e_capability': 'sm_89'}}, input_type=dict] For further information visit https://errors.pydantic.dev/2.6/v/value_error Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module> fire.Fire(do_cli) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/axolotl/src/axolotl/cli/train.py", line 30, in do_cli parsed_cfg = load_cfg(config, **kwargs) File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 353, in load_cfg cfg = validate_config( File "/workspace/axolotl/src/axolotl/utils/config/__init__.py", line 209, in validate_config AxolotlConfigWCapabilities( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__ self.__pydantic_validator__.validate_python(data, self_instance=self) pydantic_core._pydantic_core.ValidationError: 1 validation error for AxolotlConfigWCapabilities Value error, At least two of micro_batch_size, gradient_accumulation_steps, batch_size must be set [type=value_error, input_value={'model': {'base_model': ...e_capability': 'sm_89'}}, input_type=dict] For further information visit https://errors.pydantic.dev/2.6/v/value_error [2024-04-20 09:01:19,230] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 764) of binary: /root/miniconda3/envs/py3.10/bin/python3 Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module> sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1048, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ axolotl.cli.train FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-04-20_09:01:19 host : 2819657d1cb0 rank : 1 (local_rank: 1) exitcode : 1 (pid: 765) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-20_09:01:19 host : 2819657d1cb0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 764) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ |
이런식으로 에러메시지가 나오면서 진행이 안됩니다..
gpt한테 에러메시지를 주면서 물어봐도
export BNB_CUDA_VERSION=118 # CUDA 11.8 버전 사용 설정 echo $BNB_CUDA_VERSION # 설정된 값 확인 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64 echo $LD_LIBRARY_PATH # 설정된 경로 확인 이 설정이랑 수정된 yaml파일로 model: base_model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0 model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer load_in_4bit: true strict: false training_args: output_dir: "./training_results" num_train_epochs: 3 per_device_train_batch_size: 4 micro_batch_size: 2 gradient_accumulation_steps: 16 fp16: true flash_attention: false datasets: - path: /workspace/axolotl/tests/transform_json_4.json ds_type: json type: system_prompt: "" system_format: "{system}" field_system: system field_instruction: human field_input: "" field_output: gpt format: |- System: {system} User: {human} Assistant: {gpt} no_input_format: "System: {system} User: {human} Assistant: {gpt}" val_set_size: 0.05 dataset_prepared_path: last_run_prepared sequence_len: 4096 pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 16 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: - gate_proj - down_proj - up_proj lora_target_linear: true optimizer: adamw_torch lr_scheduler: cosine learning_rate: 0.0004 train_on_inputs: false group_by_length: false bf16: false gradient_checkpointing: true logging_steps: 1 warmup_steps: 100 evals_per_epoch: 4 saves_per_epoch: 1 weight_decay: 0.01 eval_sample_packing: false |
이렇게 답변하는데, 하라는대로 환경변수랑 yaml파일을 수정해도 계속 동일한 에러가 발생하면서 막힙니다..
두서없는것 같지만 axolotl 많이 활용해보신분들의 도움 요청드립니다ㅠ