Unsloth Fine-tuning DeepSeek R1 on your PC with Nvidia GPU¶

In this notebook, it will demonstrate how to finetune DeepSeek-R1-Distill-Llama-8B with Unsloth, using a medical dataset on your PC.

After the finetune, we will get the model in both safetensor and gguf format.

What's more is that we will guide you to transfer the finetuned model to Ollama campatible format and infer via ollama and Page Assist

Why do we need LLM fine-tuning?¶

Fine-tuning tailors the model to have a better performance for specific tasks, making it more effective and versatile in real-world applications. This process is essential for tailoring an existing model to a particular task or domain.

Preparation¶

Install softwares¶

Before we start the finetune process, we need to get some tools ready:

  1. Nvidia Driver and Cuda toolkit: choose a proper driver for your Nvidia GPU and install cuda 12.4 toolkit
  2. Anaconda: we need this to create virtual environment with python3.10, and install PyTorch and Unsloth
  3. git and cmake: when we save finetuned model into gguf format, we need to call save_pretrained_gguf() method and the method will clone llama.cpp from github and compile it. you can download it from the git offical site and cmake offical site and append their bin folders after windows' path
  4. ollama and Page Assist: we use ollama to run finetuned model and Page Assist for a friendly UI. You can download ollama -> here and install Page Assist from -> here
  5. Triton wheel for python 3.10: if you are using a windows pc , you will need to download the triton-windows-builds and install it later via pip install triton-3.0.0-cp310-cp310-win_amd64.whl

Create Virtual Environment with Anaconda and install essential package¶

  1. Let's download this notebook and triton-3.0.0-cp310-cp310-win_amd64.whl into folder llama_finetune
  2. Open cmd and cd to llama_finetune
  3. run command conda create -p llama python=3.10 to create a venv called llama in current folder
  4. run command conda activate ./llama to activate the venv
  5. run command pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 to install PyTorch
  6. run command pip install jupyter
  7. run command pip install triton-3.0.0-cp310-cp310-win_amd64.whl(you can skip this if you are using WSL)
  8. run command pip install unsloth and then pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git and finally pip install bitsandbytes unsloth_zoo
  9. run command jupyter notebook and open this Jupyter Notebook
In [6]:
# check if PyTorch is properly installed, torch.cuda.is_available() should be True 
import torch
torch.cuda.is_available()
Out[6]:
True

Set Hyperparameters¶

Notice:

  • BASE_MODEL: str = "./DeepSeek-R1-Distill-Llama-8B" <--means we pre-download the model files into folder llama/DeepSeek-R1-Distill-Llama-8B, if set to "unsloth/DeepSeek-R1-Distill-Llama-8B" it will download from huggingface and save to cache folder, you can check the docstring to find the actual path of cache folder
  • TRAIN_SIZE and DATASET_NUM_PROCESS: change these 2 parameters according to your device. It will result in Out of Cuda Memory error if set too big
  • RANDOM_STATE: set to any if you wish, it was for Reproducibility
  • LORA_OUTPUT_DIR: the training checkpoints will be saved here
  • others: check pytorch concepts
In [5]:
# Hyperparameters
BASE_MODEL: str = "./DeepSeek-R1-Distill-Llama-8B" # <--means we pre-download the model files into folder llama/DeepSeek-R1-Distill-Llama-8B unsloth/DeepSeek-R1-Distill-Llama-8B
SAVED_MODEL_NAME: str = "Medical_DeepSeek"

REPORT_TO: str = "none" # Use this for WandB etc

TRAIN_EPOCHS: float = 1.0 # don't know why but it should be float according to the doctring of TrainingArguments
LEARNING_RATE: float = 2e-4
TRAIN_SIZE: int = 1000
RANDOM_STATE: int = 3407
LORA_ALPHA: int = 16
MAX_SEQUENCE_LENGTH: int = 2048
DATASET_NUM_PROCESS: int = 1
PER_DEVICE_TRAIN_BATCH_SIZE: int = 2
GRADIENT_ACCUMULATION_STEPS: int = 4
WARMUP_STEPS: int = 5
MAX_STEPS: int = 80
LOGGING_STEPS: int = 1
WEIGHT_DECAY: float = 0.01
OPTIMIZER: str = "adamw_8bit"
LEARNING_RATE_SCHEDULER_TYPE: str = "linear"
LORA_OUTPUT_DIR: str = "outputs"

Choose a Base Model¶

  1. Choose a model that aligns with your usecase
  2. Assess your storage, compute capacity and dataset
  3. Select a Model and Parameters
  4. Choose Between Base and Instruct Models
In [7]:
from unsloth import FastLanguageModel
import torch
max_seq_length = MAX_SEQUENCE_LENGTH # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    model_name = BASE_MODEL,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
D:\llama_finetune\llama\lib\site-packages\unsloth_zoo\gradient_checkpointing.py:330: UserWarning: expandable_segments not supported on this platform (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.)
  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.2.
   \\   /|    Tesla P40. Num GPUs = 1. Max memory: 24.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 6.1. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Inference before fine-tuning¶

In [8]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""
In [9]:
question = "一个患有急性阑尾炎的病人已经发病5天,腹痛稍有减轻但仍然发热,在体检时发现右下腹有压痛的包块,此时应如何处理?"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
<think>
好的,我现在需要帮助处理一个急性阑尾炎患者的情况。患者已经发病5天,腹痛稍有减轻但仍然发热。在体检时,发现了右下腹有压痛的包块。作为医生,我应该如何处理呢?

首先,急性阑尾炎的典型症状包括发热、腹痛、发热5天以上,且通常在右下腹出现压痛的包块。患者的情况符合急性阑尾炎的表现,所以首先要考虑是否需要进一步的诊断。

接下来,发现右下腹有压痛的包块,这可能提示有炎症或附加的结核病性变化。因此,需要进行进一步的影像学检查,例如超声检查,以确定包块的性质。超声可以帮助判断包块是否是实性或虚性,并评估周围组织的状况。

如果超声检查结果显示包块周围有液体,或者有增厚的结构,可能需要进行穿刺检查来获取细胞学检查,以明确诊断。穿刺检查可以提供细菌培养和细胞学结果,帮助确定是否需要抗生素治疗或进一步的治疗措施。

在抗生素治疗方面,考虑患者仍然有发热,可能需要抗生素治疗。选择敏感的药物,如青霉素或第三代 cephalosporin,以覆盖可能的病原菌。同时,考虑是否有胃肠炎的可能,但通常急性阑尾炎的包块更倾向于炎症性,而不是感染性。

如果有结核病史或其他免疫抑制疾病,可能需要进一步评估,并考虑是否需要抗结核治疗。但在这个情况下,暂时不需要立即考虑这一点。

此外,患者的腹痛稍有减轻,可能是因为炎症开始缓解,但仍需要观察是否有其他症状,如发热、腹泻等。如果有其他症状加重,可能需要考虑重症或其他并发症。

总结一下,处理步骤应该是:

1. 确认诊断:急性阑尾炎,伴有包块,需要进一步检查。
2. 进行超声检查,评估包块性质。
3. 如果必要,进行穿刺检查获取细菌培养和细胞学检查。
4. 根据检查结果决定是否需要抗生素或抗结核治疗。
5. 监测患者的症状,及时调整治疗计划。

在这个过程中,可能需要结合患者的病史和其他检查结果,进行综合判断,以确定最合适的治疗方案。
</think>

对于一个患有急性阑尾炎且右下腹有压痛包块的患者,处理步骤如下:

1. **确认诊断**:确定患者确实患有急性阑尾炎,考虑包块的性质,可能提示炎症或附加的结核病性变化。

2. **进行超声检查**:评估右下腹包块的性质,包括是否为实性或虚性,以及周围组织的状态。

3. **穿刺检查**(如必要):若超声结果显示包块周围有液体或增厚结构,进行穿刺检查以获取细胞学检查和细菌培养。

4. **抗生素治疗**:考虑患者仍有发热,选择敏感的抗生素如青霉素或第三代 cephalosporin。

5. **监测和调整**:观察患者症状,及时调整治疗计划,特别是如果有其他症状加重或包块性质变化。

6. **结核病评估**:在有结核病史或免疫抑制疾病的情况下,进行进一步评估和必要的抗结核治疗。

通过以上步骤,确保患者得到全面和及时的处理,防止并发症的发生。<|end▁of▁sentence|>

Prepare Dataset¶

A medical dataset https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/ will be used to train the selected model.

In [10]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

Important Notice¶

It's crucial to add the EOS (End of Sequence) token at the end of each training dataset entry, otherwise you may encounter infinite generations.

In [11]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }
In [13]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", 'zh', split = f"train[0:{TRAIN_SIZE}]", trust_remote_code=True)
# dataset = load_dataset("./FreedomIntelligence/medical-o1-reasoning-SFT", 'default', split = f"train[0:{TRAIN_SIZE}]", trust_remote_code=True)
print(dataset.column_names)
['Question', 'Complex_CoT', 'Response']

For Ollama and llama.cpp to function like a custom ChatGPT Chatbot, we must only have 2 columns - an instruction and an output column. We need to transform the dataset into proper structure.

In [14]:
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]
Map:   0%|          | 0/1000 [00:00<?, ? examples/s]
Out[14]:
'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nPlease answer the following medical question.\n\n### Question:\n根据描述,一个1岁的孩子在夏季头皮出现多处小结节,长期不愈合,且现在疮大如梅,溃破流脓,口不收敛,头皮下有空洞,患处皮肤增厚。这种病症在中医中诊断为什么病?\n\n### Response:\n<think>\n这个小孩子在夏天头皮上长了些小结节,一直都没好,后来变成了脓包,流了好多脓。想想夏天那么热,可能和湿热有关。才一岁的小孩,免疫力本来就不强,夏天的湿热没准就侵袭了身体。\n\n用中医的角度来看,出现小结节、再加上长期不愈合,这些症状让我想到了头疮。小孩子最容易得这些皮肤病,主要因为湿热在体表郁结。\n\n但再看看,头皮下还有空洞,这可能不止是简单的头疮。看起来病情挺严重的,也许是脓肿没治好。这样的情况中医中有时候叫做禿疮或者湿疮,也可能是另一种情况。\n\n等一下,头皮上的空洞和皮肤增厚更像是疾病已经深入到头皮下,这是不是说明有可能是流注或瘰疬?这些名字常描述头部或颈部的严重感染,特别是有化脓不愈合,又形成通道或空洞的情况。\n\n仔细想想,我怎么感觉这些症状更贴近瘰疬的表现?尤其考虑到孩子的年纪和夏天发生的季节性因素,湿热可能是主因,但可能也有火毒或者痰湿造成的滞留。\n\n回到基本的症状描述上看,这种长期不愈合又复杂的状况,如果结合中医更偏重的病名,是不是有可能是涉及更深层次的感染?\n\n再考虑一下,这应该不是单纯的瘰疬,得仔细分析头皮增厚并出现空洞这样的严重症状。中医里头,这样的表现可能更符合‘蚀疮’或‘头疽’。这些病名通常描述头部严重感染后的溃烂和组织坏死。\n\n看看季节和孩子的体质,夏天又湿又热,外邪很容易侵入头部,对孩子这么弱的免疫系统简直就是挑战。头疽这个病名听起来真是切合,因为它描述的感染严重,溃烂到出现空洞。\n\n不过,仔细琢磨后发现,还有个病名似乎更为合适,叫做‘蝼蛄疖’,这病在中医里专指像这种严重感染并伴有深部空洞的情况。它也涵盖了化脓和皮肤增厚这些症状。\n\n哦,该不会是夏季湿热,导致湿毒入侵,孩子的体质不能御,其病情发展成这样的感染?综合分析后我觉得‘蝼蛄疖’这个病名真是相当符合。\n</think>\n从中医的角度来看,你所描述的症状符合“蝼蛄疖”的病症。这种病症通常发生在头皮,表现为多处结节,溃破流脓,形成空洞,患处皮肤增厚且长期不愈合。湿热较重的夏季更容易导致这种病症的发展,特别是在免疫力较弱的儿童身上。建议结合中医的清热解毒、祛湿消肿的治疗方法进行处理,并配合专业的医疗建议进行详细诊断和治疗。<|end▁of▁sentence|>'

Train the model¶

Now let's use Huggingface TRL's SFTTrainer.

In [15]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = LORA_ALPHA,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = RANDOM_STATE,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
In [16]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = DATASET_NUM_PROCESS,
    packing = False, # Can make training 5x faster for short sequences.
    
    args = TrainingArguments(
        per_device_train_batch_size = PER_DEVICE_TRAIN_BATCH_SIZE,
        gradient_accumulation_steps = GRADIENT_ACCUMULATION_STEPS,
        warmup_steps = WARMUP_STEPS,
        max_steps = MAX_STEPS,
        # num_train_epochs = TRAIN_EPOCHS, # For longer training runs!
        learning_rate = LEARNING_RATE,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = LOGGING_STEPS,
        optim = OPTIMIZER,
        weight_decay = WEIGHT_DECAY,
        lr_scheduler_type = LEARNING_RATE_SCHEDULER_TYPE,
        seed = RANDOM_STATE,
        output_dir = LORA_OUTPUT_DIR,
        report_to = REPORT_TO, # Use this for WandB etc
    ),
)
Unsloth: Tokenizing ["text"]:   0%|          | 0/1000 [00:00<?, ? examples/s]
In [17]:
trainer_stats = trainer.train()
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 80
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)
Unsloth: Will smartly offload gradients to save VRAM!
[80/80 1:00:39, Epoch 0/1]
Step Training Loss
1 2.080300
2 2.122400
3 2.182200
4 2.165600
5 2.254100
6 2.186100
7 2.000400
8 2.040800
9 1.781600
10 1.700400
11 1.573500
12 1.746100
13 1.666600
14 1.593900
15 1.592100
16 1.623200
17 1.607300
18 1.680500
19 1.626200
20 1.646600
21 1.508600
22 1.475300
23 1.738100
24 1.591400
25 1.513300
26 1.511800
27 1.695300
28 1.540100
29 1.427000
30 1.541700
31 1.601300
32 1.547600
33 1.599300
34 1.592200
35 1.559100
36 1.576000
37 1.594600
38 1.627500
39 1.516900
40 1.672800
41 1.519100
42 1.655700
43 1.625400
44 1.475800
45 1.402200
46 1.582300
47 1.584600
48 1.567900
49 1.521600
50 1.497500
51 1.225000
52 1.478400
53 1.647300
54 1.664400
55 1.492300
56 1.716000
57 1.519300
58 1.600100
59 1.579400
60 1.600500
61 1.546000
62 1.468200
63 1.483900
64 1.573500
65 1.403700
66 1.586200
67 1.448200
68 1.636000
69 1.450900
70 1.539400
71 1.541300
72 1.475000
73 1.629100
74 1.609800
75 1.472500
76 1.414100
77 1.457300
78 1.465500
79 1.666700
80 1.450300

Inference after fine-tuning¶

Let's inference with same question again and see the difference.

In [18]:
print(question) # "一个患有急性阑尾炎的病人已经发病5天,腹痛稍有减轻但仍然发热,在体检时发现右下腹有压痛的包块,此时应如何处理?"
一个患有急性阑尾炎的病人已经发病5天,腹痛稍有减轻但仍然发热,在体检时发现右下腹有压痛的包块,此时应如何处理?
In [19]:
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
<think>
这个病人已经有5天的急性阑尾炎了,虽然腹痛减轻了,但他还是发热。还发现右下腹有压痛的包块,这个包块是什么呢?

嗯,包块的存在让我想到阑尾炎。因为通常在急性阑尾炎中,阑尾会变硬,导致在右下腹出现压痛的包块。所以,我怀疑是阑尾炎的问题。

但也不能排除其他可能的原因,比如说,可能是肝炎或者胆囊炎。这些疾病也可能在右下腹出现压痛的包块,特别是肝炎和胆囊炎。

不过,考虑到他已经有5天的症状,而且包块的位置和症状都符合阑尾炎的特征,我倾向于认为是阑尾炎。

嗯,既然如此,那么他需要什么样的治疗呢?

首先,他的发热和腹痛应该控制住。对发热,可以用退烧药,比如对乙酰氨基酚或者布洛芬。

然后,为了缓解他的腹痛,可以考虑用止痛药,比如美芬酸或者氨基丁酚。

接下来,处理包块的问题。包块的存在可能意味着阑尾炎已经发展到一定的阶段,可能需要手术干预。

所以,他需要尽快进行手术,去除这个包块,或者进行阑尾切除术。

同时,手术后可能需要使用肠道保护液,帮助他恢复肠道功能。

嗯,综合考虑,他的症状和检查结果,手术似乎是必要的,才能彻底解决他的病情。
</think>
根据描述的病人症状和检查结果,右下腹有压痛的包块很可能是急性阑尾炎的表现。由于病人已经有5天的症状,并且包块的存在可能意味着阑尾炎已经发展到一定阶段,因此需要进行手术干预。

在手术中,医生可能会进行阑尾切除术,以彻底解决病人的症状。此外,手术后可以考虑使用肠道保护液,以帮助病人恢复肠道功能。

总结来说,目前病人最需要的是手术治疗,以解决包块和发热等症状。<|end▁of▁sentence|>

Save the fine-tuned model to GGUF format¶

Choose the llama.cpp's GGUF format we prefer by setting the corresponding if to True.

In [23]:
# # Save to 8bit Q8_0
quantization_method = "q8_0"
model.save_pretrained_gguf("model", tokenizer, quantization_method = quantization_method)

# model.save_pretrained("model_safetensor", safe_serialization = None)
Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 8.52 out of 31.84 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...
100%|██████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 13.69it/s]
Unsloth: Saving tokenizer... Done.
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be D:\llama_finetune\model\unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 131072
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 7
INFO:hf-to-gguf:Set model tokenizer
WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:gguf.vocab:Setting special token type pad to 128004
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<��User��>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<��Assistant��><��tool�xcalls�xbegin��><��tool�xcall�xbegin��>' + tool['type'] + '<��tool�xsep��>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<��tool�xcall�xend��>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<��tool�xcall�xbegin��>' + tool['type'] + '<��tool�xsep��>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<��tool�xcall�xend��>'}}{{'<��tool�xcalls�xend��><��end�xof�xsentence��>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<��tool�xoutputs�xend��>' + message['content'] + '<��end�xof�xsentence��>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<��Assistant��>' + content + '<��end�xof�xsentence��>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<��tool�xoutputs�xbegin��><��tool�xoutput�xbegin��>' + message['content'] + '<��tool�xoutput�xend��>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<��tool�xoutput�xbegin��>' + message['content'] + '<��tool�xoutput�xend��>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<��tool�xoutputs�xend��>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<��Assistant��><think>\n'}}{% endif %}
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> Q8_0, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.1.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.1.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.1.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.2.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.2.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.2.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.3.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.3.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.3.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.3.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.3.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.3.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.3.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.3.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.3.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.4.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.4.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.4.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.4.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.4.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.4.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.4.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.4.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.4.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.5.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.5.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.5.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.5.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.5.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.5.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.5.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.5.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.5.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.6.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.6.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.6.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.6.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.6.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.6.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.6.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.6.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.6.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.7.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.7.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.7.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.7.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.7.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.7.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.7.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.7.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.7.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.8.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.8.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.8.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.8.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.8.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.8.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.8.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.8.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.8.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00002-of-00004.safetensors'
INFO:hf-to-gguf:blk.10.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.10.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.10.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.10.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.10.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.11.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.11.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.11.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.12.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.12.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.12.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.12.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.12.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.13.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.13.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.13.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.14.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.14.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.14.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.15.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.15.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.15.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.16.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.16.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.16.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.17.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.17.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.17.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.18.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.18.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.18.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.18.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.18.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.19.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.19.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.19.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.9.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.9.ffn_down.weight,       torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.9.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.9.ffn_up.weight,         torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.9.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.9.attn_k.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.9.attn_output.weight,    torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.9.attn_q.weight,         torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.9.attn_v.weight,         torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00003-of-00004.safetensors'
INFO:hf-to-gguf:blk.20.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.20.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.20.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.21.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.21.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.21.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.21.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.21.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.22.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.22.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.22.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.23.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.23.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.23.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.24.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.24.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.24.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.24.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.24.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.24.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.24.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.24.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.24.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.25.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.25.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.25.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.25.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.25.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.25.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.25.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.25.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.25.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.26.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.26.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.26.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.26.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.26.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.26.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.26.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.26.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.26.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.27.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.27.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.27.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.27.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.27.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.27.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.27.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.27.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.27.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.28.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.28.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.28.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.28.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.28.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.28.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.28.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.28.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.28.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.29.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.29.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.29.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.29.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.29.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.29.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.29.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.29.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.29.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.30.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.30.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.30.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.30.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.30.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.30.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.30.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.30.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.30.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.ffn_gate.weight,      torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_up.weight,        torch.float16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.attn_k.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_output.weight,   torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_q.weight,        torch.float16 --> Q8_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_v.weight,        torch.float16 --> Q8_0, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00004-of-00004.safetensors'
INFO:hf-to-gguf:output.weight,               torch.float16 --> Q8_0, shape = {4096, 128256}
INFO:hf-to-gguf:blk.31.attn_norm.weight,     torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.31.ffn_down.weight,      torch.float16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.31.ffn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:output_norm.weight,          torch.float16 --> F32, shape = {4096}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:D:\llama_finetune\model\unsloth.Q8_0.gguf: n_tensors = 291, total_size = 8.5G
Writing: 100%|��������������������| 8.53G/8.53G [04:24<00:00, 32.2Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to D:\llama_finetune\model\unsloth.Q8_0.gguf
Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.
Unsloth: Conversion completed! Output location: D:\llama_finetune\model\unsloth.Q8_0.gguf

Transfer finetuned model to Ollama compatible model¶

the finetuned model is saved to ./model folder in safetensor and gguf format now. so, we can use ollama command to transfer them and inference with it.

Here's how:

In [29]:
import os
from pathlib import Path

ollama_modelfile_path = Path("./model/modelfile.txt").resolve()
model_path =  Path("./model/").resolve()
original_gguf_path =  Path(f"./model/unsloth.{quantization_method.upper()}.gguf").resolve()
final_gguf_path = Path(f"./model/{SAVED_MODEL_NAME}.{quantization_method.upper()}.gguf").resolve()

try:
    os.rename(original_gguf_path, final_gguf_path)
except FileNotFoundError as e:
    print(f"文件不存在, 请检查gguf是否转换完成,或者gguf文件是否已经重名为 {final_gguf_path.name}")

with open(ollama_modelfile_path, "w") as modelfile:
    # modelfile.write(f"FROM '{model_path / final_gguf_path.name}'") # 如果你希望通过量化后的gguf转为ollama格式
    modelfile.write(f"FROM '{model_path}'") # 如果你希望直接通过safetensor转为ollama格式

create_cmd = f"ollama create {SAVED_MODEL_NAME} -f 'modelfile.txt'"
print(f"请打开cmd/终端,并切换到 {model_path} 并运行命令 {create_cmd}")
文件不存在, 请检查gguf是否转换完成,或者gguf文件是否已经重名为Medical_DeepSeek.Q8_0.gguf
请打开cmd/终端,并切换到 D:\llama_finetune\model 并运行命令 ollama create Medical_DeepSeek -f 'modelfile.txt'

Ollama inference¶

open cmd or terminaland run the command below to inference,PAY ATTENTION TO SAVED_MODEL_NAME and REPLACE IT WITH REAL MODEL NAME. IF you are using Chrome plugins like Page Assit,you can choose SAVED_MODEL_NAME and inference easily.

ollama run SAVED_MODEL_NAME