
Image by Author
DeepSeek-R1-0528 is the latest update to DeepSeek’s R1 reasoning model that requires 715GB of disk space, making it one of the largest open-source models available. However, thanks to advanced quantization techniques from Unsloth, the model’s size can be reduced to 162GB, an 80% reduction. This allows users to experience the full power of the model with significantly lower hardware requirements, albeit with a slight trade-off in performance.
In this tutorial, we will:
- Set up Ollama and Open Web UI to run the DeepSeek-R1-0528 model locally.
- Download and configure the 1.78-bit quantized version (IQ1_S) of the model.
- Run the model using both GPU + CPU and CPU-only setups.
Step 0: Prerequisites
To run the IQ1_S quantized version, your system must meet the following requirements:
GPU Requirements: At least 1x 24GB GPU (e.g., NVIDIA RTX 4090 or A6000) and 128GB RAM. With this setup, you can expect a generation speed of approximately 5 tokens/second.
RAM Requirements: A minimum of 64GB RAM is required to run the model to run the model without GPU but performance will be limited to 1 token/second.
Optimal Setup: For the best performance (5+ tokens/second), you need at least 180GB of unified memory or a combination of 180GB RAM + VRAM.
Storage: Ensure you have at least 200GB of free disk space for the model and its dependencies.
Step 1: Install Dependencies and Ollama
Update your system and install the required tools. Ollama is a lightweight server for running large language models locally. Install it on an Ubuntu distribution using the following commands:
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Step 2: Download and Run the Model
Run the 1.78-bit quantized version (IQ1_S) of the DeepSeek-R1-0528 model using the following command:
ollama serve &
ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0

Step 3: Setup and Run Open Web UI
Pull the Open Web UI Docker image with CUDA support. Run the Open Web UI container with GPU support and Ollama integration.
This command will:
- Start the Open Web UI server on port 8080
- Enable GPU acceleration using the
--gpus all
flag - Mount the necessary data directory (
-v open-webui:/app/backend/data
)
docker pull ghcr.io/open-webui/open-webui:cuda
docker run -d -p 9783:8080 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:cuda
Once the container is running, access the Open Web UI interface in your browser at http://localhost:8080/
.
Step 4: Running DeepSeek R1 0528 in Open WebUI
Select the hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0
model from the model menu.

If the Ollama server fails to properly use the GPU, you can switch to CPU execution. While this will significantly reduce performance (approximately 1 token/second), it ensures the model can still run.
# Kill any existing Ollama processes
pkill ollama
# Clear GPU memory
sudo fuser -v /dev/nvidia*
# Restart Ollama service
CUDA_VISIBLE_DEVICES="" ollama serve
Once the model is running, you can interact with it via Open Web UI. However, note that the speed will be limited to 1 token/second due to the lack of GPU acceleration.

Final Thoughts
Running even the quantized version was challenging. You need a fast internet connection to download the model, and if the download fails, you have to restart the entire process from the beginning. I also faced many issues trying to run it on my GPU, as I kept getting GGUF errors related to low VRAM. Despite trying several common fixes for GPU errors, nothing worked, so I eventually switched everything to CPU. While this did work, it now takes about 10 minutes just for the model to generate a response, which is far from ideal.
I’m sure there are better solutions out there, perhaps using llama.cpp, but trust me, it took me the whole day just to get this running.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.