Alex on GitHub Alex on LinkedIn Alex's CV

DINOv2.cpp: Accelerating Edge Inference with C++ and GGML

DINOv2.cpp: Accelerating Edge Inference with C++ and GGML

Code Availability

View on GitHub

Introduction

Edge computing for AI involves the deployment of AI algorithms on local edge devices and sensors, which enables a wider range of AI use cases and provides substantial benefit. Edge computing allows for AI algorithms to be run locally on devices with limited hardware, ensuring devices do not need to be connected to the cloud for real-time inference. Some benefits provided include reduced memory and energy use, decreased latency, and faster compute. These also provide privacy benefits as data can be stored locally. Use cases include real-time processing for autonomous vehicles, analytics for security devices, and improved sensor understanding for robotics.

The Vision Transformer (ViT) architecture has achieved state-of-the art results on a wide range of traditional computer vision problems, pushing the capabilities of image analysis Dosovitskiy et al., 2021. Among recent innovations, the DINOv2 family of vision transformer models achieves competitive results in classification, depth estimation, and semantic segmentation tasks. Developed by Meta, pretrained weights and model architectures have been made publicly available on GitHub Oquab et al., 2024. Despite their improved performance, ViT models are more computationally expensive than convolutional neural networks. The ViT self-attention mechanism scales quadratically with input size, leading to increased memory usage and slower inference times. As a result, deployment of ViTs on systems with low resources can prove challenging and can fail to provide competitive speeds.

We introduce DINOv2.cpp, an inference engine for the DINOv2 family of models for use on edge devices. Written in C++ with minimal dependencies, our lightweight implementation achieves lower memory use and improved inference speeds on all models.

vit.cpp Taghadouini et al., 2024

Our work is inspired by vit.cpp, a C++ implementation of the ViT inference engine using the ggml library.

ModelMax Mem (PyTorch)Max MemSpeed (PyTorch)Speed
tiny~780 MB~20 MB431 ms120 ms
small~965 MB~52 MB780 ms463 ms
base~1.61 GB~179 MB2393 ms1441 ms
large~3.86 GB~597 MB8151 ms4892 ms

Table 1. Comparison of peak memory and inference speed across model sizes.

DINOv2 Oquab et al., 2024

DINOv2 is a self-supervised vision model that trains Vision Transformers on a massive 142 million-image dataset. It combines two key objectives: aligning global representations through class-token loss and fine-graining features via masked patch-token loss. The model also integrates techniques like Sinkhorn–Knopp for clustering and KoLeo regularization for uniform feature distribution. Additionally, DINOv2 includes a high-resolution adaptation stage and produces dense patch embeddings to capture local image semantics.

Approach

We build on the existing vit.cpp codebase, updating and refactoring the implementation as required to support DINOv2. Steps taken involved writing a script to convert the model to GGUF format, loading model weights into ggml, rewriting the forward pass to support new layers, and updating quantization support. In addition to these steps, we upgraded the GGML version for the base implementation of vit.cpp. Following the initial implementation of the inference pipeline, we added support for realtime visualization of extracted features, flash attention, and positional embedding interpolation.

GGUF Format Gerganov et al., 2025

GGUF is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. GGUF is designed for use with GGML and other executors. A Python script is provided in vit.cpp to download a ViT model from HuggingFace and then convert it to a deprecated GGML format. We rewrote this script to account additional layers present in DINOv2, such as the register tokens, and to use the updated GGUF format.

Correcting for the DINOv2 Architecture

Adapting the vit.cpp pipeline involved rewriting the model loading and forward pass to handle the modified DINOv2 architecture. Memory must be manually allocated for each tensor and account for the variable size of different models. The GGUF file format provides metadata on the loaded model architecture allowing the inference pipeline to account for different model sizes and architectures. The layer structure of the model is detailed in Figure 1.

Processing variable: dinov2.embeddings.cls_token with shape:  torch.Size([1, 1, 384])  and type:  torch.float32
...
Processing variable: classifier.bias with shape:  torch.Size([1000])  and type:  torch.float32

Figure 1. An example of DINOv2’s model architecture information from our GGUF conversion script.

Visualizing DINOv2’s Feature Extraction

dinov2.cpp supports parameters to enable classification or feature extraction and visualization. When visualizing features, the classification head is removed and the learned feature representation of the image will be output. To produce a visualization of features, principal component analysis (PCA) is run on the last layer of weights. OpenCV is used to convert and save the resulting image. Figure 2 shows an example of this visual feature representation.

Original Image
Original Image
PCA Features (No Registers)
PCA Features (No Registers)
PCA Features
PCA Features

Figure 2. Left: Original. Middle: PCA Features without Register Tokens. Right: PCA Features with Register Tokens.

Quantization

We adapted the ViT quantization script to support DINOv2. Table 2 details the different quantization formats supported with DINOv2.

FormatDescription
4_04-bit round-to-nearest quantization (q). Each block has 32 weights.
4_14-bit round-to-nearest quantization (q). Each block has 32 weights.
5_05-bit round-to-nearest quantization (q). Each block has 32 weights.
5_15-bit round-to-nearest quantization (q). Each block has 32 weights.
8_08-bit round-to-nearest quantization (q). Each block has 32 weights.

Table 2. Supported quantization formats

Additional Features

Flash Attention: To improve inference speed and reduce memory usage we implemented flash attention within the DINOv2 pipeline. Flash attention computes memory in a much more memory-efficient manner using kernel fusion techniques. This can be optionally toggled to maximize performance on CPU.

Positional Embedding Interpolation: When training, Vision Transformers rely on a fixed size positional embedding. However during inference, input image size may vary, especially as models are implemented on different devices. The interpolation process involves resizing the learned positional embeddings to match the number of patches generated by the new input resolution. This enables the model to generalize to arbitrary input size.

Cross-Platform Support: We have tested dinov2.cpp on Windows, Linux, and Mac and have ensured that the inference pipeline works cross-platform.

Datasets

The original DINOv2 family of models was trained on the LVD-142M dataset, a curated set of 142 million images assembled from ImageNet-22k, Google Landmarks, and other fine-grained datasets. A self-supervised pipeline was developed for training, capturing features on the image and pixel level. The diverse dataset and self-supervised pipeline enables a wide degree of possible implementation in computer vision tasks without any finetuning. Existing tasks include classification, semantic segmentation, depth estimation, instance retrieval, and video understanding.

DINOv2.cpp retains these capabilities. Publicly available weights can be converted to the gguf format without any loss of accuracy. The inference pipeline supports images up to sizes of 518 × 518 pixels and automatically resizes input images, providing support for any image-based dataset. The diversity and size of the dataset used to train DINOv2 ensure that dinov2.cpp can support a wide range of real-time tasks in a variety of environments.

Evaluation Results

We evaluate our results on inference speed and memory usage. When compared to DINOv2’s PyTorch implementation, we aim to achieve faster inference speeds while decreasing memory usage. To compare our results, we ran four benchmarks. First, we have two benchmarks without quantization. For these benchmarks, we run the small, base, large, and giant models 100 times and measure the average inference speed and memory usage. This is done for the model variants with register tokens and without register tokens. We also ran two benchmarks for quantized models. For each model we benchmark the 4_0, 4_1, 5_0, 5_1, and 8_0 formats across 100 inference runs each. We run this for the model variants small, base, large, and giant, each with and without register tokens. Each experiment was run using Intel Core i9-14900HX, using 24 of its 32 threads for best results.

Our benchmark results, detailed in Figure 3, show universal improvements in inference speed and memory use when compared to PyTorch. Our most notable improvement was with the small variant with register tokens: speed from 297 ms → 64 ms, memory from 455 MB → 110 MB.

Our quantization benchmark results, detailed in Figure 4, show universal improvements in inference speed and memory use when compared to the non-quantized models run in our pipeline. Our most notable improvement can be seen with the giant model with register tokens: speed from 1995 ms → 1281 ms, memory from 4400 MB → 1281 MB with quantization type 4_0.

Benchmark Results

Without Register Tokens

ModelMax Mem (PyTorch)Max MemSpeed (PyTorch)Speed
small~455 MB~110 MB181 ms62 ms
base~720 MB~367 MB462 ms197 ms
large~1.55 GB~1.2 GB1288 ms600 ms
giant~4.8 GB~4.4 GB4384 ms1969 ms

With Register Tokens

ModelMax Mem (PyTorch)Max MemSpeed (PyTorch)Speed
small~455 MB~110 MB297 ms64 ms
base~720 MB~366 MB436 ms200 ms
large~1.55 GB~1.2 GB1331 ms597 ms
giant~4.8 GB~4.4 GB4472 ms1995 ms

Quantization Benchmark Results

Without Register Tokens

ModelQuantSpeed (ms)Mem (MB)
smallq4_04649
smallq4_14851
smallq5_06354
smallq5_15857
smallq8_05070
baseq4_0141129
baseq4_1135140
baseq5_0162150
baseq5_1161160
baseq8_0125212
largeq4_0389371
largeq4_1382407
largeq5_0497444
largeq5_1478480
largeq8_0348661
giantq4_012681281
giantq4_112481417
giantq5_016251553
giantq5_115761688
giantq8_010592364

With Register Tokens

ModelQuantSpeed (ms)Mem (MB)
smallq4_05249
smallq4_15052
smallq5_05954
smallq5_15757
smallq8_05170
baseq4_0136129
baseq4_1133139
baseq5_0164150
baseq5_1158160
baseq8_0124211
largeq4_0395371
largeq4_1395407
largeq5_0493443
largeq5_1490480
largeq8_0353661
giantq4_012751281
giantq4_112611417
giantq5_016151552
giantq5_115831687
giantq8_010652364

Conclusion

For our final project we developed dinov2.cpp, a lightweight C++ inference engine for the DINOv2 family of models. Our implementation enables efficient inference on edge devices, using the GGML tensor library with minimal dependencies.

Our benchmarks demonstrate significant reductions in memory usage and inference time across all model sizes when compared to the original PyTorch implementation. Quantization improves these results further, allowing for efficient model loading and inference. We provide support for classification and real-time feature extraction, giving insight into how the model understands visual information. We implemented flash attention, and positional embedding interpolation features, giving users the ability to choose between classification accuracy and inference speed.

Future work may include adding support for different visual analysis tasks, such as depth estimation and semantic segmentation.

References

  1. Taghadouini, S., Gerganov, G., & Elion, M. (2024, April 11). staghado/vit.cpp. GitHub. https://github.com/staghado/vit.cpp
  2. Gerganov, G., Devesa, D., Gäßler, J., Bolz, J., & et al. (2025, May 2). ggml-org/ggml. GitHub. https://github.com/ggml-org/ggml
  3. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., & et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. arXiv. https://arxiv.org/abs/2304.07193
  4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., & et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv. https://arxiv.org/abs/2010.11929

👥 Collaborators

  • Michael Krah (Boston University)
  • Zach Gentile (Boston University)