DINOv2.cpp: Accelerating Edge Inference with C++ and GGML

Introduction

Edge computing for AI involves the deployment of AI algorithms on local edge devices and sensors, which enables a wider range of AI use cases and provides substantial benefit. Edge computing allows for AI algorithms to be run locally on devices with limited hardware, ensuring devices do not need to be connected to the cloud for real-time inference. Some benefits provided include reduced memory and energy use, decreased latency, and faster compute. These also provide privacy benefits as data can be stored locally. Use cases include real-time processing for autonomous vehicles, analytics for security devices, and improved sensor understanding for robotics.

The Vision Transformer (ViT) architecture has achieved state-of-the art results on a wide range of traditional computer vision problems, pushing the capabilities of image analysis Dosovitskiy et al., 2021. Among recent innovations, the DINOv2 family of vision transformer models achieves competitive results in classification, depth estimation, and semantic segmentation tasks. Developed by Meta, pretrained weights and model architectures have been made publicly available on GitHub Oquab et al., 2024. Despite their improved performance, ViT models are more computationally expensive than convolutional neural networks. The ViT self-attention mechanism scales quadratically with input size, leading to increased memory usage and slower inference times. As a result, deployment of ViTs on systems with low resources can prove challenging and can fail to provide competitive speeds.

We introduce DINOv2.cpp, an inference engine for the DINOv2 family of models for use on edge devices. Written in C++ with minimal dependencies, our lightweight implementation achieves lower memory use and improved inference speeds on all models.

vit.cpp Taghadouini et al., 2024

Our work is inspired by vit.cpp, a C++ implementation of the ViT inference engine using the ggml library.

Model	Max Mem (PyTorch)	Max Mem	Speed (PyTorch)	Speed
tiny	~780 MB	~20 MB	431 ms	120 ms
small	~965 MB	~52 MB	780 ms	463 ms
base	~1.61 GB	~179 MB	2393 ms	1441 ms
large	~3.86 GB	~597 MB	8151 ms	4892 ms

Table 1. Comparison of peak memory and inference speed across model sizes.

DINOv2 Oquab et al., 2024

DINOv2 is a self-supervised vision model that trains Vision Transformers on a massive 142 million-image dataset. It combines two key objectives: aligning global representations through class-token loss and fine-graining features via masked patch-token loss. The model also integrates techniques like Sinkhorn–Knopp for clustering and KoLeo regularization for uniform feature distribution. Additionally, DINOv2 includes a high-resolution adaptation stage and produces dense patch embeddings to capture local image semantics.

Approach

We build on the existing vit.cpp codebase, updating and refactoring the implementation as required to support DINOv2. Steps taken involved writing a script to convert the model to GGUF format, loading model weights into ggml, rewriting the forward pass to support new layers, and updating quantization support. In addition to these steps, we upgraded the GGML version for the base implementation of vit.cpp. Following the initial implementation of the inference pipeline, we added support for realtime visualization of extracted features, flash attention, and positional embedding interpolation.

GGUF Format Gerganov et al., 2025

GGUF is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. GGUF is designed for use with GGML and other executors. A Python script is provided in vit.cpp to download a ViT model from HuggingFace and then convert it to a deprecated GGML format. We rewrote this script to account additional layers present in DINOv2, such as the register tokens, and to use the updated GGUF format.

Correcting for the DINOv2 Architecture

Adapting the vit.cpp pipeline involved rewriting the model loading and forward pass to handle the modified DINOv2 architecture. Memory must be manually allocated for each tensor and account for the variable size of different models. The GGUF file format provides metadata on the loaded model architecture allowing the inference pipeline to account for different model sizes and architectures. The layer structure of the model is detailed in Figure 1.

Processing variable: dinov2.embeddings.cls_token with shape:  torch.Size([1, 1, 384])  and type:  torch.float32
...
Processing variable: classifier.bias with shape:  torch.Size([1000])  and type:  torch.float32

Figure 1. An example of DINOv2’s model architecture information from our GGUF conversion script.

Visualizing DINOv2’s Feature Extraction

dinov2.cpp supports parameters to enable classification or feature extraction and visualization. When visualizing features, the classification head is removed and the learned feature representation of the image will be output. To produce a visualization of features, principal component analysis (PCA) is run on the last layer of weights. OpenCV is used to convert and save the resulting image. Figure 2 shows an example of this visual feature representation.

Figure 2. Left: Original. Middle: PCA Features without Register Tokens. Right: PCA Features with Register Tokens.

Quantization

We adapted the ViT quantization script to support DINOv2. Table 2 details the different quantization formats supported with DINOv2.

Format	Description
4_0	4-bit round-to-nearest quantization (q). Each block has 32 weights.
4_1	4-bit round-to-nearest quantization (q). Each block has 32 weights.
5_0	5-bit round-to-nearest quantization (q). Each block has 32 weights.
5_1	5-bit round-to-nearest quantization (q). Each block has 32 weights.
8_0	8-bit round-to-nearest quantization (q). Each block has 32 weights.

Table 2. Supported quantization formats

Additional Features

Flash Attention: To improve inference speed and reduce memory usage we implemented flash attention within the DINOv2 pipeline. Flash attention computes memory in a much more memory-efficient manner using kernel fusion techniques. This can be optionally toggled to maximize performance on CPU.

Positional Embedding Interpolation: When training, Vision Transformers rely on a fixed size positional embedding. However during inference, input image size may vary, especially as models are implemented on different devices. The interpolation process involves resizing the learned positional embeddings to match the number of patches generated by the new input resolution. This enables the model to generalize to arbitrary input size.

Cross-Platform Support: We have tested dinov2.cpp on Windows, Linux, and Mac and have ensured that the inference pipeline works cross-platform.

Datasets

The original DINOv2 family of models was trained on the LVD-142M dataset, a curated set of 142 million images assembled from ImageNet-22k, Google Landmarks, and other fine-grained datasets. A self-supervised pipeline was developed for training, capturing features on the image and pixel level. The diverse dataset and self-supervised pipeline enables a wide degree of possible implementation in computer vision tasks without any finetuning. Existing tasks include classification, semantic segmentation, depth estimation, instance retrieval, and video understanding.

DINOv2.cpp retains these capabilities. Publicly available weights can be converted to the gguf format without any loss of accuracy. The inference pipeline supports images up to sizes of 518 × 518 pixels and automatically resizes input images, providing support for any image-based dataset. The diversity and size of the dataset used to train DINOv2 ensure that dinov2.cpp can support a wide range of real-time tasks in a variety of environments.

Evaluation Results

We evaluate our results on inference speed and memory usage. When compared to DINOv2’s PyTorch implementation, we aim to achieve faster inference speeds while decreasing memory usage. To compare our results, we ran four benchmarks. First, we have two benchmarks without quantization. For these benchmarks, we run the small, base, large, and giant models 100 times and measure the average inference speed and memory usage. This is done for the model variants with register tokens and without register tokens. We also ran two benchmarks for quantized models. For each model we benchmark the 4_0, 4_1, 5_0, 5_1, and 8_0 formats across 100 inference runs each. We run this for the model variants small, base, large, and giant, each with and without register tokens. Each experiment was run using Intel Core i9-14900HX, using 24 of its 32 threads for best results.

Our benchmark results, detailed in Figure 3, show universal improvements in inference speed and memory use when compared to PyTorch. Our most notable improvement was with the small variant with register tokens: speed from 297 ms → 64 ms, memory from 455 MB → 110 MB.

Our quantization benchmark results, detailed in Figure 4, show universal improvements in inference speed and memory use when compared to the non-quantized models run in our pipeline. Our most notable improvement can be seen with the giant model with register tokens: speed from 1995 ms → 1281 ms, memory from 4400 MB → 1281 MB with quantization type 4_0.

Benchmark Results

Without Register Tokens

Model	Max Mem (PyTorch)	Max Mem	Speed (PyTorch)	Speed
small	~455 MB	~110 MB	181 ms	62 ms
base	~720 MB	~367 MB	462 ms	197 ms
large	~1.55 GB	~1.2 GB	1288 ms	600 ms
giant	~4.8 GB	~4.4 GB	4384 ms	1969 ms

With Register Tokens

Model	Max Mem (PyTorch)	Max Mem	Speed (PyTorch)	Speed
small	~455 MB	~110 MB	297 ms	64 ms
base	~720 MB	~366 MB	436 ms	200 ms
large	~1.55 GB	~1.2 GB	1331 ms	597 ms
giant	~4.8 GB	~4.4 GB	4472 ms	1995 ms

Quantization Benchmark Results

Without Register Tokens

Model	Quant	Speed (ms)	Mem (MB)
small	q4_0	46	49
small	q4_1	48	51
small	q5_0	63	54
small	q5_1	58	57
small	q8_0	50	70
base	q4_0	141	129
base	q4_1	135	140
base	q5_0	162	150
base	q5_1	161	160
base	q8_0	125	212
large	q4_0	389	371
large	q4_1	382	407
large	q5_0	497	444
large	q5_1	478	480
large	q8_0	348	661
giant	q4_0	1268	1281
giant	q4_1	1248	1417
giant	q5_0	1625	1553
giant	q5_1	1576	1688
giant	q8_0	1059	2364

With Register Tokens

Model	Quant	Speed (ms)	Mem (MB)
small	q4_0	52	49
small	q4_1	50	52
small	q5_0	59	54
small	q5_1	57	57
small	q8_0	51	70
base	q4_0	136	129
base	q4_1	133	139
base	q5_0	164	150
base	q5_1	158	160
base	q8_0	124	211
large	q4_0	395	371
large	q4_1	395	407
large	q5_0	493	443
large	q5_1	490	480
large	q8_0	353	661
giant	q4_0	1275	1281
giant	q4_1	1261	1417
giant	q5_0	1615	1552
giant	q5_1	1583	1687
giant	q8_0	1065	2364

Conclusion

For our final project we developed dinov2.cpp, a lightweight C++ inference engine for the DINOv2 family of models. Our implementation enables efficient inference on edge devices, using the GGML tensor library with minimal dependencies.

Our benchmarks demonstrate significant reductions in memory usage and inference time across all model sizes when compared to the original PyTorch implementation. Quantization improves these results further, allowing for efficient model loading and inference. We provide support for classification and real-time feature extraction, giving insight into how the model understands visual information. We implemented flash attention, and positional embedding interpolation features, giving users the ability to choose between classification accuracy and inference speed.

Future work may include adding support for different visual analysis tasks, such as depth estimation and semantic segmentation.

References

Taghadouini, S., Gerganov, G., & Elion, M. (2024, April 11). staghado/vit.cpp. GitHub. https://github.com/staghado/vit.cpp
Gerganov, G., Devesa, D., Gäßler, J., Bolz, J., & et al. (2025, May 2). ggml-org/ggml. GitHub. https://github.com/ggml-org/ggml
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., & et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. arXiv. https://arxiv.org/abs/2304.07193
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., & et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv. https://arxiv.org/abs/2010.11929

👥 Collaborators

Michael Krah (Boston University)
Zach Gentile (Boston University)

Introduction

Related Work

vit.cpp Taghadouini et al., 2024

DINOv2 Oquab et al., 2024

Approach

GGUF Format Gerganov et al., 2025

Correcting for the DINOv2 Architecture

Visualizing DINOv2’s Feature Extraction

Quantization

Additional Features

Datasets

Evaluation Results

Benchmark Results

Without Register Tokens

With Register Tokens

Quantization Benchmark Results

Without Register Tokens

With Register Tokens

Conclusion

References

👥 Collaborators