Unsloth Fine-tuning DeepSeek R1 on your PC with Nvidia GPU¶
In this notebook, it will demonstrate how to finetune DeepSeek-R1-Distill-Llama-8B
with Unsloth, using a medical dataset on your PC.
After the finetune, we will get the model in both safetensor
and gguf
format.
What's more is that we will guide you to transfer the finetuned model to Ollama campatible format and infer via ollama
and Page Assist
Why do we need LLM fine-tuning?¶
Fine-tuning tailors the model to have a better performance for specific tasks, making it more effective and versatile in real-world applications. This process is essential for tailoring an existing model to a particular task or domain.
Preparation¶
Install softwares¶
Before we start the finetune process, we need to get some tools ready:
Nvidia Driver
andCuda toolkit
: choose a proper driver for your Nvidia GPU and install cuda 12.4 toolkitAnaconda
: we need this to create virtual environment with python3.10, and install PyTorch and Unslothgit
andcmake
: when we save finetuned model into gguf format, we need to callsave_pretrained_gguf()
method and the method will clonellama.cpp
from github and compile it. you can download it from the git offical site and cmake offical site and append their bin folders after windows'path
ollama
andPage Assist
: we use ollama to run finetuned model and Page Assist for a friendly UI. You can download ollama -> here and install Page Assist from -> hereTriton wheel for python 3.10
: if you are using a windows pc , you will need to download the triton-windows-builds and install it later viapip install triton-3.0.0-cp310-cp310-win_amd64.whl
Create Virtual Environment with Anaconda and install essential package¶
- Let's download this notebook and
triton-3.0.0-cp310-cp310-win_amd64.whl
into folderllama_finetune
- Open
cmd
andcd
tollama_finetune
- run command
conda create -p llama python=3.10
to create avenv
calledllama
in current folder - run command
conda activate ./llama
to activate the venv - run command
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
to install PyTorch - run command
pip install jupyter
- run command
pip install triton-3.0.0-cp310-cp310-win_amd64.whl
(you can skip this if you are using WSL) - run command
pip install unsloth
and thenpip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
and finallypip install bitsandbytes unsloth_zoo
- run command
jupyter notebook
and open this Jupyter Notebook
# check if PyTorch is properly installed, torch.cuda.is_available() should be True
import torch
torch.cuda.is_available()
True
Set Hyperparameters¶
Notice:
- BASE_MODEL: str = "./DeepSeek-R1-Distill-Llama-8B" <--means we pre-download the model files into folder
llama/DeepSeek-R1-Distill-Llama-8B
, if set to"unsloth/DeepSeek-R1-Distill-Llama-8B"
it will download from huggingface and save tocache folder
, you can check the docstring to find the actual path ofcache folder
- TRAIN_SIZE and DATASET_NUM_PROCESS: change these 2 parameters according to your device. It will result in
Out of Cuda Memory
error if set too big - RANDOM_STATE: set to any if you wish, it was for
Reproducibility
- LORA_OUTPUT_DIR: the training checkpoints will be saved here
- others: check pytorch concepts
# Hyperparameters
BASE_MODEL: str = "./DeepSeek-R1-Distill-Llama-8B" # <--means we pre-download the model files into folder llama/DeepSeek-R1-Distill-Llama-8B unsloth/DeepSeek-R1-Distill-Llama-8B
SAVED_MODEL_NAME: str = "Medical_DeepSeek"
REPORT_TO: str = "none" # Use this for WandB etc
TRAIN_EPOCHS: float = 1.0 # don't know why but it should be float according to the doctring of TrainingArguments
LEARNING_RATE: float = 2e-4
TRAIN_SIZE: int = 1000
RANDOM_STATE: int = 3407
LORA_ALPHA: int = 16
MAX_SEQUENCE_LENGTH: int = 2048
DATASET_NUM_PROCESS: int = 1
PER_DEVICE_TRAIN_BATCH_SIZE: int = 2
GRADIENT_ACCUMULATION_STEPS: int = 4
WARMUP_STEPS: int = 5
MAX_STEPS: int = 80
LOGGING_STEPS: int = 1
WEIGHT_DECAY: float = 0.01
OPTIMIZER: str = "adamw_8bit"
LEARNING_RATE_SCHEDULER_TYPE: str = "linear"
LORA_OUTPUT_DIR: str = "outputs"
Choose a Base Model¶
- Choose a model that aligns with your usecase
- Assess your storage, compute capacity and dataset
- Select a Model and Parameters
- Choose Between Base and Instruct Models
from unsloth import FastLanguageModel
import torch
max_seq_length = MAX_SEQUENCE_LENGTH # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
# model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
model_name = BASE_MODEL,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
D:\llama_finetune\llama\lib\site-packages\unsloth_zoo\gradient_checkpointing.py:330: UserWarning: expandable_segments not supported on this platform (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.) GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])
==((====))== Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.2. \\ /| Tesla P40. Num GPUs = 1. Max memory: 24.0 GB. Platform: Windows. O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 6.1. CUDA Toolkit: 12.4. Triton: 3.2.0 \ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Inference before fine-tuning¶
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.
### Question:
{}
### Response:
<think>{}"""
question = "一个患有急性阑尾炎的病人已经发病5天,腹痛稍有减轻但仍然发热,在体检时发现右下腹有压痛的包块,此时应如何处理?"
FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=1200,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
<think> 好的,我现在需要帮助处理一个急性阑尾炎患者的情况。患者已经发病5天,腹痛稍有减轻但仍然发热。在体检时,发现了右下腹有压痛的包块。作为医生,我应该如何处理呢? 首先,急性阑尾炎的典型症状包括发热、腹痛、发热5天以上,且通常在右下腹出现压痛的包块。患者的情况符合急性阑尾炎的表现,所以首先要考虑是否需要进一步的诊断。 接下来,发现右下腹有压痛的包块,这可能提示有炎症或附加的结核病性变化。因此,需要进行进一步的影像学检查,例如超声检查,以确定包块的性质。超声可以帮助判断包块是否是实性或虚性,并评估周围组织的状况。 如果超声检查结果显示包块周围有液体,或者有增厚的结构,可能需要进行穿刺检查来获取细胞学检查,以明确诊断。穿刺检查可以提供细菌培养和细胞学结果,帮助确定是否需要抗生素治疗或进一步的治疗措施。 在抗生素治疗方面,考虑患者仍然有发热,可能需要抗生素治疗。选择敏感的药物,如青霉素或第三代 cephalosporin,以覆盖可能的病原菌。同时,考虑是否有胃肠炎的可能,但通常急性阑尾炎的包块更倾向于炎症性,而不是感染性。 如果有结核病史或其他免疫抑制疾病,可能需要进一步评估,并考虑是否需要抗结核治疗。但在这个情况下,暂时不需要立即考虑这一点。 此外,患者的腹痛稍有减轻,可能是因为炎症开始缓解,但仍需要观察是否有其他症状,如发热、腹泻等。如果有其他症状加重,可能需要考虑重症或其他并发症。 总结一下,处理步骤应该是: 1. 确认诊断:急性阑尾炎,伴有包块,需要进一步检查。 2. 进行超声检查,评估包块性质。 3. 如果必要,进行穿刺检查获取细菌培养和细胞学检查。 4. 根据检查结果决定是否需要抗生素或抗结核治疗。 5. 监测患者的症状,及时调整治疗计划。 在这个过程中,可能需要结合患者的病史和其他检查结果,进行综合判断,以确定最合适的治疗方案。 </think> 对于一个患有急性阑尾炎且右下腹有压痛包块的患者,处理步骤如下: 1. **确认诊断**:确定患者确实患有急性阑尾炎,考虑包块的性质,可能提示炎症或附加的结核病性变化。 2. **进行超声检查**:评估右下腹包块的性质,包括是否为实性或虚性,以及周围组织的状态。 3. **穿刺检查**(如必要):若超声结果显示包块周围有液体或增厚结构,进行穿刺检查以获取细胞学检查和细菌培养。 4. **抗生素治疗**:考虑患者仍有发热,选择敏感的抗生素如青霉素或第三代 cephalosporin。 5. **监测和调整**:观察患者症状,及时调整治疗计划,特别是如果有其他症状加重或包块性质变化。 6. **结核病评估**:在有结核病史或免疫抑制疾病的情况下,进行进一步评估和必要的抗结核治疗。 通过以上步骤,确保患者得到全面和及时的处理,防止并发症的发生。<|end▁of▁sentence|>
Prepare Dataset¶
A medical dataset https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/ will be used to train the selected model.
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.
### Question:
{}
### Response:
<think>
{}
</think>
{}"""
Important Notice¶
It's crucial to add the EOS (End of Sequence) token at the end of each training dataset entry, otherwise you may encounter infinite generations.
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
inputs = examples["Question"]
cots = examples["Complex_CoT"]
outputs = examples["Response"]
texts = []
for input, cot, output in zip(inputs, cots, outputs):
text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
texts.append(text)
return {
"text": texts,
}
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", 'zh', split = f"train[0:{TRAIN_SIZE}]", trust_remote_code=True)
# dataset = load_dataset("./FreedomIntelligence/medical-o1-reasoning-SFT", 'default', split = f"train[0:{TRAIN_SIZE}]", trust_remote_code=True)
print(dataset.column_names)
['Question', 'Complex_CoT', 'Response']
For Ollama
and llama.cpp
to function like a custom ChatGPT
Chatbot, we must only have 2 columns - an instruction
and an output
column. We need to transform the dataset into proper structure.
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]
Map: 0%| | 0/1000 [00:00<?, ? examples/s]
'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nPlease answer the following medical question.\n\n### Question:\n根据描述,一个1岁的孩子在夏季头皮出现多处小结节,长期不愈合,且现在疮大如梅,溃破流脓,口不收敛,头皮下有空洞,患处皮肤增厚。这种病症在中医中诊断为什么病?\n\n### Response:\n<think>\n这个小孩子在夏天头皮上长了些小结节,一直都没好,后来变成了脓包,流了好多脓。想想夏天那么热,可能和湿热有关。才一岁的小孩,免疫力本来就不强,夏天的湿热没准就侵袭了身体。\n\n用中医的角度来看,出现小结节、再加上长期不愈合,这些症状让我想到了头疮。小孩子最容易得这些皮肤病,主要因为湿热在体表郁结。\n\n但再看看,头皮下还有空洞,这可能不止是简单的头疮。看起来病情挺严重的,也许是脓肿没治好。这样的情况中医中有时候叫做禿疮或者湿疮,也可能是另一种情况。\n\n等一下,头皮上的空洞和皮肤增厚更像是疾病已经深入到头皮下,这是不是说明有可能是流注或瘰疬?这些名字常描述头部或颈部的严重感染,特别是有化脓不愈合,又形成通道或空洞的情况。\n\n仔细想想,我怎么感觉这些症状更贴近瘰疬的表现?尤其考虑到孩子的年纪和夏天发生的季节性因素,湿热可能是主因,但可能也有火毒或者痰湿造成的滞留。\n\n回到基本的症状描述上看,这种长期不愈合又复杂的状况,如果结合中医更偏重的病名,是不是有可能是涉及更深层次的感染?\n\n再考虑一下,这应该不是单纯的瘰疬,得仔细分析头皮增厚并出现空洞这样的严重症状。中医里头,这样的表现可能更符合‘蚀疮’或‘头疽’。这些病名通常描述头部严重感染后的溃烂和组织坏死。\n\n看看季节和孩子的体质,夏天又湿又热,外邪很容易侵入头部,对孩子这么弱的免疫系统简直就是挑战。头疽这个病名听起来真是切合,因为它描述的感染严重,溃烂到出现空洞。\n\n不过,仔细琢磨后发现,还有个病名似乎更为合适,叫做‘蝼蛄疖’,这病在中医里专指像这种严重感染并伴有深部空洞的情况。它也涵盖了化脓和皮肤增厚这些症状。\n\n哦,该不会是夏季湿热,导致湿毒入侵,孩子的体质不能御,其病情发展成这样的感染?综合分析后我觉得‘蝼蛄疖’这个病名真是相当符合。\n</think>\n从中医的角度来看,你所描述的症状符合“蝼蛄疖”的病症。这种病症通常发生在头皮,表现为多处结节,溃破流脓,形成空洞,患处皮肤增厚且长期不愈合。湿热较重的夏季更容易导致这种病症的发展,特别是在免疫力较弱的儿童身上。建议结合中医的清热解毒、祛湿消肿的治疗方法进行处理,并配合专业的医疗建议进行详细诊断和治疗。<|end▁of▁sentence|>'
Train the model¶
Now let's use Huggingface TRL's SFTTrainer
.
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = LORA_ALPHA,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = RANDOM_STATE,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = DATASET_NUM_PROCESS,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = PER_DEVICE_TRAIN_BATCH_SIZE,
gradient_accumulation_steps = GRADIENT_ACCUMULATION_STEPS,
warmup_steps = WARMUP_STEPS,
max_steps = MAX_STEPS,
# num_train_epochs = TRAIN_EPOCHS, # For longer training runs!
learning_rate = LEARNING_RATE,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = LOGGING_STEPS,
optim = OPTIMIZER,
weight_decay = WEIGHT_DECAY,
lr_scheduler_type = LEARNING_RATE_SCHEDULER_TYPE,
seed = RANDOM_STATE,
output_dir = LORA_OUTPUT_DIR,
report_to = REPORT_TO, # Use this for WandB etc
),
)
Unsloth: Tokenizing ["text"]: 0%| | 0/1000 [00:00<?, ? examples/s]
trainer_stats = trainer.train()
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 1,000 | Num Epochs = 1 | Total steps = 80 O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)
Unsloth: Will smartly offload gradients to save VRAM!
Step | Training Loss |
---|---|
1 | 2.080300 |
2 | 2.122400 |
3 | 2.182200 |
4 | 2.165600 |
5 | 2.254100 |
6 | 2.186100 |
7 | 2.000400 |
8 | 2.040800 |
9 | 1.781600 |
10 | 1.700400 |
11 | 1.573500 |
12 | 1.746100 |
13 | 1.666600 |
14 | 1.593900 |
15 | 1.592100 |
16 | 1.623200 |
17 | 1.607300 |
18 | 1.680500 |
19 | 1.626200 |
20 | 1.646600 |
21 | 1.508600 |
22 | 1.475300 |
23 | 1.738100 |
24 | 1.591400 |
25 | 1.513300 |
26 | 1.511800 |
27 | 1.695300 |
28 | 1.540100 |
29 | 1.427000 |
30 | 1.541700 |
31 | 1.601300 |
32 | 1.547600 |
33 | 1.599300 |
34 | 1.592200 |
35 | 1.559100 |
36 | 1.576000 |
37 | 1.594600 |
38 | 1.627500 |
39 | 1.516900 |
40 | 1.672800 |
41 | 1.519100 |
42 | 1.655700 |
43 | 1.625400 |
44 | 1.475800 |
45 | 1.402200 |
46 | 1.582300 |
47 | 1.584600 |
48 | 1.567900 |
49 | 1.521600 |
50 | 1.497500 |
51 | 1.225000 |
52 | 1.478400 |
53 | 1.647300 |
54 | 1.664400 |
55 | 1.492300 |
56 | 1.716000 |
57 | 1.519300 |
58 | 1.600100 |
59 | 1.579400 |
60 | 1.600500 |
61 | 1.546000 |
62 | 1.468200 |
63 | 1.483900 |
64 | 1.573500 |
65 | 1.403700 |
66 | 1.586200 |
67 | 1.448200 |
68 | 1.636000 |
69 | 1.450900 |
70 | 1.539400 |
71 | 1.541300 |
72 | 1.475000 |
73 | 1.629100 |
74 | 1.609800 |
75 | 1.472500 |
76 | 1.414100 |
77 | 1.457300 |
78 | 1.465500 |
79 | 1.666700 |
80 | 1.450300 |
Inference after fine-tuning¶
Let's inference with same question again and see the difference.
print(question) # "一个患有急性阑尾炎的病人已经发病5天,腹痛稍有减轻但仍然发热,在体检时发现右下腹有压痛的包块,此时应如何处理?"
一个患有急性阑尾炎的病人已经发病5天,腹痛稍有减轻但仍然发热,在体检时发现右下腹有压痛的包块,此时应如何处理?
FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=1200,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
<think> 这个病人已经有5天的急性阑尾炎了,虽然腹痛减轻了,但他还是发热。还发现右下腹有压痛的包块,这个包块是什么呢? 嗯,包块的存在让我想到阑尾炎。因为通常在急性阑尾炎中,阑尾会变硬,导致在右下腹出现压痛的包块。所以,我怀疑是阑尾炎的问题。 但也不能排除其他可能的原因,比如说,可能是肝炎或者胆囊炎。这些疾病也可能在右下腹出现压痛的包块,特别是肝炎和胆囊炎。 不过,考虑到他已经有5天的症状,而且包块的位置和症状都符合阑尾炎的特征,我倾向于认为是阑尾炎。 嗯,既然如此,那么他需要什么样的治疗呢? 首先,他的发热和腹痛应该控制住。对发热,可以用退烧药,比如对乙酰氨基酚或者布洛芬。 然后,为了缓解他的腹痛,可以考虑用止痛药,比如美芬酸或者氨基丁酚。 接下来,处理包块的问题。包块的存在可能意味着阑尾炎已经发展到一定的阶段,可能需要手术干预。 所以,他需要尽快进行手术,去除这个包块,或者进行阑尾切除术。 同时,手术后可能需要使用肠道保护液,帮助他恢复肠道功能。 嗯,综合考虑,他的症状和检查结果,手术似乎是必要的,才能彻底解决他的病情。 </think> 根据描述的病人症状和检查结果,右下腹有压痛的包块很可能是急性阑尾炎的表现。由于病人已经有5天的症状,并且包块的存在可能意味着阑尾炎已经发展到一定阶段,因此需要进行手术干预。 在手术中,医生可能会进行阑尾切除术,以彻底解决病人的症状。此外,手术后可以考虑使用肠道保护液,以帮助病人恢复肠道功能。 总结来说,目前病人最需要的是手术治疗,以解决包块和发热等症状。<|end▁of▁sentence|>
Save the fine-tuned model to GGUF format¶
Choose the llama.cpp's GGUF format we prefer by setting the corresponding if
to True
.
# # Save to 8bit Q8_0
quantization_method = "q8_0"
model.save_pretrained_gguf("model", tokenizer, quantization_method = quantization_method)
# model.save_pretrained("model_safetensor", safe_serialization = None)
Unsloth: ##### The current model auto adds a BOS token. Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.
Unsloth: Merging 4bit and LoRA weights to 16bit... Unsloth: Will use up to 8.52 out of 31.84 RAM for saving. Unsloth: Saving model... This might take 5 minutes ...
100%|██████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 13.69it/s]
Unsloth: Saving tokenizer... Done. Done. ==((====))== Unsloth: Conversion from QLoRA to GGUF information \\ /| [0] Installing llama.cpp might take 3 minutes. O^O/ \_/ \ [1] Converting HF to GGUF 16bits might take 3 minutes. \ / [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each. "-____-" In total, you will have to wait at least 16 minutes. Unsloth: Installing llama.cpp. This might take 3 minutes... Unsloth: [1] Converting model at model into q8_0 GGUF format. The output location will be D:\llama_finetune\model\unsloth.Q8_0.gguf This might take 3 minutes... INFO:hf-to-gguf:Loading model: model INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Set model parameters INFO:hf-to-gguf:gguf: context length = 131072 INFO:hf-to-gguf:gguf: embedding length = 4096 INFO:hf-to-gguf:gguf: feed forward length = 14336 INFO:hf-to-gguf:gguf: head count = 32 INFO:hf-to-gguf:gguf: key-value head count = 8 INFO:hf-to-gguf:gguf: rope theta = 500000.0 INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05 INFO:hf-to-gguf:gguf: file type = 7 INFO:hf-to-gguf:Set model tokenizer WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional. INFO:gguf.vocab:Setting special token type bos to 128000 INFO:gguf.vocab:Setting special token type eos to 128001 INFO:gguf.vocab:Setting special token type pad to 128004 INFO:gguf.vocab:Setting add_bos_token to True INFO:gguf.vocab:Setting add_eos_token to False INFO:gguf.vocab:Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<��User��>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<��Assistant��><��tool�xcalls�xbegin��><��tool�xcall�xbegin��>' + tool['type'] + '<��tool�xsep��>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<��tool�xcall�xend��>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<��tool�xcall�xbegin��>' + tool['type'] + '<��tool�xsep��>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<��tool�xcall�xend��>'}}{{'<��tool�xcalls�xend��><��end�xof�xsentence��>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<��tool�xoutputs�xend��>' + message['content'] + '<��end�xof�xsentence��>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<��Assistant��>' + content + '<��end�xof�xsentence��>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<��tool�xoutputs�xbegin��><��tool�xoutput�xbegin��>' + message['content'] + '<��tool�xoutput�xend��>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<��tool�xoutput�xbegin��>' + message['content'] + '<��tool�xoutput�xend��>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<��tool�xoutputs�xend��>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<��Assistant��><think>\n'}}{% endif %} INFO:hf-to-gguf:Exporting model... INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json' INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors' INFO:hf-to-gguf:token_embd.weight, torch.float16 --> Q8_0, shape = {4096, 128256} INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.0.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.0.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.0.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.0.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.1.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.1.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.1.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.1.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.2.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.2.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.2.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.2.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.2.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.2.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.2.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.2.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.2.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.3.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.3.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.3.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.3.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.3.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.3.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.3.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.3.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.3.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.4.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.4.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.4.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.4.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.4.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.4.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.4.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.4.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.4.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.5.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.5.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.5.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.5.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.5.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.5.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.5.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.5.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.5.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.6.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.6.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.6.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.6.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.6.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.6.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.6.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.6.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.6.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.7.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.7.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.7.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.7.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.7.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.7.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.7.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.7.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.7.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.8.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.8.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.8.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.8.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.8.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.8.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.8.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.8.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.8.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:gguf: loading model part 'model-00002-of-00004.safetensors' INFO:hf-to-gguf:blk.10.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.10.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.10.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.10.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.10.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.10.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.10.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.10.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.10.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.11.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.11.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.11.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.11.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.11.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.11.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.11.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.11.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.11.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.12.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.12.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.12.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.12.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.12.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.12.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.12.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.12.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.12.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.13.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.13.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.13.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.13.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.13.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.13.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.13.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.13.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.13.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.14.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.14.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.14.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.14.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.14.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.14.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.14.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.14.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.14.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.15.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.15.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.15.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.15.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.15.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.15.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.15.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.15.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.15.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.16.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.16.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.16.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.16.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.16.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.16.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.16.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.16.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.16.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.17.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.17.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.17.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.17.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.17.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.17.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.17.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.17.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.17.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.18.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.18.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.18.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.18.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.18.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.18.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.18.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.18.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.18.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.19.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.19.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.19.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.19.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.19.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.19.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.19.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.19.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.19.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.20.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.20.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.20.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.20.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.20.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.9.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.9.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.9.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.9.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.9.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.9.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.9.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.9.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.9.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:gguf: loading model part 'model-00003-of-00004.safetensors' INFO:hf-to-gguf:blk.20.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.20.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.20.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.20.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.21.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.21.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.21.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.21.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.21.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.21.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.21.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.21.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.21.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.22.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.22.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.22.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.22.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.22.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.22.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.22.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.22.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.22.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.23.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.23.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.23.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.23.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.23.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.23.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.23.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.23.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.23.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.24.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.24.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.24.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.24.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.24.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.24.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.24.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.24.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.24.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.25.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.25.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.25.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.25.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.25.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.25.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.25.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.25.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.25.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.26.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.26.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.26.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.26.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.26.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.26.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.26.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.26.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.26.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.27.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.27.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.27.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.27.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.27.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.27.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.27.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.27.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.27.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.28.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.28.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.28.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.28.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.28.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.28.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.28.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.28.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.28.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.29.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.29.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.29.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.29.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.29.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.29.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.29.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.29.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.29.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.30.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.30.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.30.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.30.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.30.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.30.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.30.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.30.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.30.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.31.ffn_gate.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.31.ffn_up.weight, torch.float16 --> Q8_0, shape = {4096, 14336} INFO:hf-to-gguf:blk.31.attn_k.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:blk.31.attn_output.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.31.attn_q.weight, torch.float16 --> Q8_0, shape = {4096, 4096} INFO:hf-to-gguf:blk.31.attn_v.weight, torch.float16 --> Q8_0, shape = {4096, 1024} INFO:hf-to-gguf:gguf: loading model part 'model-00004-of-00004.safetensors' INFO:hf-to-gguf:output.weight, torch.float16 --> Q8_0, shape = {4096, 128256} INFO:hf-to-gguf:blk.31.attn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:blk.31.ffn_down.weight, torch.float16 --> Q8_0, shape = {14336, 4096} INFO:hf-to-gguf:blk.31.ffn_norm.weight, torch.float16 --> F32, shape = {4096} INFO:hf-to-gguf:output_norm.weight, torch.float16 --> F32, shape = {4096} INFO:gguf.gguf_writer:Writing the following files: INFO:gguf.gguf_writer:D:\llama_finetune\model\unsloth.Q8_0.gguf: n_tensors = 291, total_size = 8.5G Writing: 100%|��������������������| 8.53G/8.53G [04:24<00:00, 32.2Mbyte/s] INFO:hf-to-gguf:Model successfully exported to D:\llama_finetune\model\unsloth.Q8_0.gguf
Unsloth: ##### The current model auto adds a BOS token. Unsloth: ##### We removed it in GGUF's chat template for you.
Unsloth: Conversion completed! Output location: D:\llama_finetune\model\unsloth.Q8_0.gguf
Transfer finetuned model to Ollama compatible model¶
the finetuned model is saved to ./model
folder in safetensor
and gguf
format now.
so, we can use ollama
command to transfer them and inference with it.
Here's how:
import os
from pathlib import Path
ollama_modelfile_path = Path("./model/modelfile.txt").resolve()
model_path = Path("./model/").resolve()
original_gguf_path = Path(f"./model/unsloth.{quantization_method.upper()}.gguf").resolve()
final_gguf_path = Path(f"./model/{SAVED_MODEL_NAME}.{quantization_method.upper()}.gguf").resolve()
try:
os.rename(original_gguf_path, final_gguf_path)
except FileNotFoundError as e:
print(f"文件不存在, 请检查gguf是否转换完成,或者gguf文件是否已经重名为 {final_gguf_path.name}")
with open(ollama_modelfile_path, "w") as modelfile:
# modelfile.write(f"FROM '{model_path / final_gguf_path.name}'") # 如果你希望通过量化后的gguf转为ollama格式
modelfile.write(f"FROM '{model_path}'") # 如果你希望直接通过safetensor转为ollama格式
create_cmd = f"ollama create {SAVED_MODEL_NAME} -f 'modelfile.txt'"
print(f"请打开cmd/终端,并切换到 {model_path} 并运行命令 {create_cmd}")
文件不存在, 请检查gguf是否转换完成,或者gguf文件是否已经重名为Medical_DeepSeek.Q8_0.gguf 请打开cmd/终端,并切换到 D:\llama_finetune\model 并运行命令 ollama create Medical_DeepSeek -f 'modelfile.txt'
Ollama inference¶
open cmd
or terminal
and run the command below to inference,PAY ATTENTION TO SAVED_MODEL_NAME
and REPLACE IT WITH REAL MODEL NAME. IF you are using Chrome plugins like Page Assit
,you can choose SAVED_MODEL_NAME
and inference easily.
ollama run SAVED_MODEL_NAME