Generative AI Services

A Hands-on Approach to Fine Tuning DeepSeek R1

A Hands-on Approach to Fine Tuning DeepSeek R1

Introduction to Fine-Tuning DeepSeek R1

The rise of large language models (LLMs) has opened exciting possibilities across industries—from creating intuitive chatbots to automating complex business tasks. One such impressive model gaining momentum is DeepSeek R1, known for its powerful capabilities and versatility.

However, generic pre-trained models often don’t fully meet specific business or project requirements. This is where finetuning comes into play. By fine tuning, you adapt the general-purpose knowledge of an LLM like DeepSeek R1 to perform exceptionally well on your particular tasks or data.

In this blog, we’ll explore a hands-on approach to fine tuning DeepSeek R1. Whether you’re a developer eager to customize a chatbot, a researcher exploring specialized text generation, or a business professional aiming to build tailored AI solutions, this guide will empower you with practical knowledge to get started immediately.

Understanding Fine Tuning

Fine Tuning refers to the practice of taking a pre-trained model—trained on vast amounts of general data—and further training it on a more specific dataset. This additional training process allows the model to better understand nuances relevant to your particular domain or use case.

Why finetune instead of training from scratch? Simply put, training large language models from scratch is expensive, time-consuming, and resource-intensive. Fine Tuning, on the other hand, leverages existing model knowledge, significantly reducing the resources and time needed.

Here are some benefits of fine tuning:

  • Customization: Adjust the model precisely for your application or industry.
  • Efficiency: Achieve desired results quickly without massive computing power.
  • Enhanced performance: Improved accuracy and contextual relevance.

For example, a healthcare application might finetune an LLM to accurately answer patient queries, while an e-commerce chatbot might focus on product recommendations. Through finetuning, each application achieves superior performance tailored specifically to its needs.

Introduction to DeepSeek R1

DeepSeek R1 is a state-of-the-art large language model known for its strong performance in conversational AI and general text generation. Built upon advanced transformer architecture, it has been pre-trained on vast quantities of diverse textual data, enabling robust understanding and generation of human-like text.

Key features of DeepSeek R1 include:

  • Highly contextual responses: Capable of maintaining consistent and relevant conversation flow.
  • Flexibility: Suitable for various applications such as chatbots, content generation, and information retrieval.
  • Openness: Designed for community involvement, allowing researchers and developers to easily access and modify it for their own use cases.

What makes DeepSeek R1 particularly appealing is its balanced trade-off between performance and efficiency. It achieves impressive results without the enormous computational demands associated with some other massive LLMs, making it accessible even to those with moderate hardware setups.

In the next sections, we’ll dive deeper into practical steps for fine tuning DeepSeek R1, ensuring you can fully leverage its power for your projects.

Getting Started: Setting Up Your Environment

Before you jump into fine-tuning DeepSeek R1, you’ll need to ensure your environment is properly set up. Here’s a straightforward guide to quickly get you started:

1. System Requirements:

This experiment was performed on a Linux (Ubuntu 22.4) system, with a Nvidia RTX 4070 Ti Super GPU (16GB VRAM) and 64 GB Memory.

You will need to use a Linux based OS because Unsloth has dependency on Triton which does not have proper Windows Support. Additionally, you will need a GPU with at least 14 GB of VRAM and at least 32GB of system memory (RAM).

As an alternative, you can use a Google Colab T4 instance, which is available to use for free with a simple google account. Or similarly, use a Kaggle notebook T4 instance. This will not allow you to do a full epoch finetune run, but you can still perform and test for a few steps. You might need to use a smaller model than the one showcased here, due to system memory constraints.

2. Install Dependencies

Install PyTorch with CUDA support (in our case CUDA 12.6)  and Unsloth

3. Login to HuggingFace Hub

After completing these steps, your environment is ready for the hands-on fine-tuning process.

4. Preparing Your Dataset

To effectively fine-tune a model like DeepSeek R1, you’ll start by preparing a high-quality, domain-specific dataset. In this tutorial, we’re using the bird-cot-reasoning dataset from HuggingFace. This dataset is ideal because it encourages the model to learn structured reasoning and provide clear, concise answers.

Load the dataset using:

We use the first 10,000 rows of the dataset, leaving 200 somewhat rows for testing/evaluation.

If you notice, in the dataset we are using, the reasoning COT is merged with the final output in one column. We want to clearly separate the reasoning (Chain-of-Thought or CoT) and the final answers. Here’s how you can do that:

Hands-on Fine-Tuning Process

Let’s dive into the practical steps of fine-tuning DeepSeek R1, breaking down each step clearly and providing intuitive explanations of the key parameters used. 

If you have fine tuned any other open source LLM before, you most probably used HuggingFace transformers library. But this requires high computational resources which might not be available to an everyday individual. Instead, today we are going to use Unsloth, which offers a more optimized approach, making fine-tuning possible even on consumer grade GPUs. Combining this with techniques like LoRA, we can finetune comparatively large models with minimal computation resources and minimal loss in accuracy and/or performance.

Step 1: Load the Model and Tokenizer

First, we’ll load DeepSeek R1 along with its tokenizer. Using 4-bit quantization helps in reducing memory usage, making it accessible even if you don’t have extensive GPU resources:

Step 2: Parameter-Efficient Fine-Tuning with LoRA

Next, we’ll apply Low-Rank Adaptation (LoRA). LoRA helps adjust the model effectively without retraining all weights, saving both computational cost and time:

Step 3: Format the Dataset

We need to format prompts consistently to help guide the model effectively. For this, we create a system prompt template which will be used throughout our fine tuning process. We merge our bird-cot-dataset with this prompt template: 

Step 4: Training the Model

Finally, we’ll fine-tune the model using Hugging Face’s SFTTrainer. Here’s a brief explanation of the key parameters used:

  • per_device_train_batch_size: Samples processed at once per GPU (larger batch = faster but requires more memory)
  • gradient_accumulation_steps: Accumulates gradients across batches, effectively increasing batch size.
  • learning_rate: Controls how much weights are adjusted per training step.
  • warmup_ratio: Gradually increases the learning rate initially to improve convergence.
  • optim: Optimization method, here adamw_8bit, reduces GPU memory usage.

We run the finetuning process for 1 epoch, which in our case is 1250 steps. This means that our training dataset will be passed once through our model for the fine tuning process.

This took around ~ 4 hours on the specified system configuration. If you want to do a quick test you can remove the num_epochs and warmup_ratio arguments, and add max_steps=60.
In case of low GPU resources, try halving the values for per_device_train_batch_size and gradient_accumulation_steps. In case of low system memory, try reducing the number of rows in the training dataset. 

Step 5: Save the model

After successfully fine-tuning your DeepSeek R1 model, you’ll want to save it for future use or deployment. Here’s how you can easily do this:

Evaluating Your Model

After fine-tuning, evaluating your model ensures it performs accurately and effectively in practice. Let’s use one of the unused rows from our dataset: 

Switch your model to inference mode first:

Create an evaluation prompt and run the inference:

Common Pitfalls & Best Practices

Fine tuning can be highly effective, but there are common challenges you’ll want to avoid:

  • Insufficient Data:
    • Ensure your dataset is large enough and adequately represents your target domain.
  • Overfitting:
    • Avoid excessive training on your dataset; use proper validation and stop when performance plateaus.
  • Inconsistent Formatting:
    • Maintain consistent prompts and responses to improve training efficiency and effectiveness.
  • Ignoring Resource Limits:
    • Fine tuning large models can quickly consume GPU memory; always optimize your parameters like batch size, gradient accumulation, and quantization to stay within hardware constraints.

Best Practices:

  • Use LoRA: Leverage parameter-efficient methods to save computational resources.
  • Gradually Tune Hyperparameters: Experiment systematically with hyperparameters such as learning rate and warm up steps.
  • Monitor Training: Regularly track metrics and training logs to catch and address issues early. In our case we log every 10 steps to check loss.

Conclusion

Fine tuning DeepSeek R1 empowers you to tailor an already powerful model to your specific needs, ensuring precise, efficient, and contextually accurate results. With the hands-on knowledge you’ve gained, you’re now well-equipped to explore further possibilities in specialized text generation and conversational AI.

References