An idea: Combine KNN and Attention

The emergence of Attention Mechanisms has revolutionized the field of deep learning, especially within the realms of Natural Language Processing (NLP) and Computer Vision. These mechanisms allow models to focus on specific parts of the input data that are more relevant to solving the task at hand, mimicking the attentive processes observed in human cognition. However, the scalability of traditional attention mechanisms, especially in contexts with large input sequences or feature dimensions, remains a challenge due to the quadratic complexity associated with computing pairwise relevance scores.

In light of these challenges, we introduces a novel attention mechanism, termed as KNN (K-Nearest Neighbors) Attention, that leverages the principles of efficient nearest neighbor search to enhance the scalability and efficiency of attention computations. Unlike the traditional attention mechanisms that compute scores across all pairs of input units, KNN Attention focuses on identifying and utilizing a limited number of most relevant key-value pairs for each query.

KNN Attention Mechanism

The KNN Attention module operates as follows:

  1. Key Indexing: During initialization, keys are randomly generated, normalized, and added to a FAISS index. This index allows for efficient retrieval of the nearest neighbors based on Euclidean distance.
  2. Query Projection: Queries are derived by projecting the input features through a learned linear transformation. The queries are then normalized.
  3. Neighbor Retrieval and Scoring: For each query, the top-k nearest key neighbors are retrieved from the index. Distance measures from these neighbors are used to compute scaled attention scores, where the scaling factor is derived from the dimension of the key space to ensure unit variance in the dot product space.
  4. Weighted Value Aggregation: The retrieved top-k value vectors corresponding to the nearest keys are weighted by the softmax-normalized attention scores and aggregated to produce the output representation.

Fine-tune Algorithm

Our fine-tuning approach in the KNN Attention model adeptly maintains a balance between general and domain-specific knowledge. Key-value pairs from extensive pre-training capture broad abilities and are frozen to preserve this universal knowledge. Concurrently, the addition of new key-value pairs targeted towards specific vertical domains enriches the model with specialized insight. This dual strategy ensures the model retains its foundational strengths while gaining precise, domain-specific expertise. Thus, our fine-tuning process skillfully navigates the preservation of wide-ranging capabilities and the incorporation of targeted knowledge, making the model versatile and precise across varying applications.

To further enhance the responsiveness and relevance of our KNN Attention model in dynamic data environments, the incorporation of a short-term memory mechanism is crucial. This necessity stems from the observation that in many real-world scenarios—ranging from conversational AI to real-time data analysis—the ability to rapidly process and recall the most recent information is paramount. The short-term memory mechanism addresses this by instantiating fine-tuning during inference, seamlessly integrating embeddings from newly encountered text or data into the vector database. This ensures the model’s output reflects the latest changes in the data landscape. Notably, our algorithm’s efficiency in fine-tuning processes makes it exceptionally well-suited to meeting interactive demands. By facilitating swift updates to the model’s knowledge base without significant computational overheads, our approach guarantees that even in highly interactive and data-intensive applications, the model remains both current and accurate. This aspect of our KNN Attention mechanism underscores its potential in applications where timeliness and relevance of information are key to achieving high levels of user engagement and satisfaction.

Comparison with Traditional Attention Mechanisms

Our KNN Attention mechanism presents a significant departure from traditional attention mechanisms in terms of computational complexity and scalability. While traditional attention mechanisms, such as the standard dot-product attention, exhibit a computational complexity of O(N^2) for sequence lengths N, our KNN Attention mechanism achieves a more favorable complexity of O(N log⁡ K), where N is the sequence length and K represents the size of the vector database. This improvement is attributed to our efficient neighbor retrieval process within the embedding space, leveraging the FAISS library. Consequently, our approach scales more effectively with sequence length, facilitating the application of attention mechanisms to longer sequences or larger datasets without prohibitive computational costs.

Additionally, our KNN Attention model introduces a dynamic and scalable approach to memory capacity. Namely, end-users have the flexibility to enhance the model’s memory capabilities by purchasing additional vector database storage space. This feature allows for a customizable and scalable memory solution, enabling users to dynamically adjust the model’s recall abilities according to their specific needs and resources. The integration of this scalable memory capacity, in conjunction with the efficient and adaptable mechanisms of the KNN Attention module, ensures an unparalleled level of performance and responsiveness in managing large-scale and evolving datasets. This characteristic further distinguishes our approach from traditional attention mechanisms, providing a powerful tool for users requiring enhanced memory and recall functionalities in their applications.

Vector database can reside in system memory rather than GPU memory, allowing compute operations to be carried out on the CPU. This feature significantly alleviates the constraints imposed by limited GPU memory resources, which is a common bottleneck in deploying large-scale deep learning models, especially when processing extensive datasets or employing models with substantial memory requirements. By leveraging the CPU for these operations, our approach ensures broader applicability and scalability of attention-based models across a wider range of hardware configurations. This adaptability enhances the feasibility of implementing complex models in environments where GPU memory is scarce or when the computational demands exceed the available GPU resources, thus facilitating more diverse and inclusive research and application scenarios in the field of deep learning.


In this work, we put forward a novel methodology that intertwines the K-Nearest Neighbors (KNN) approach with traditional attention mechanisms, integrating a vector database into the architecture to effectively reduce computational burdens associated with attention processes. Moreover, it accentuates the capability for dynamic adjustment of database sizes, harmonizing considerations of expense with the prowess of the model. This innovative technique facilitates streamlined fine-tuning processes and is poised to be a pivotal strategy for embedding vertical domain expertise and acquiring short-term memory capabilities during analytical reasoning.

import faiss.contrib.torch_utils
import torch
import faiss

class KNNAttention(torch.nn.Module):
    def __init__(self, input_dim, key_dim, value_dim, ntotal, num_head=1):
        self.key_dim = key_dim
        self.index = faiss.IndexFlatL2(key_dim)

        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        if self.device == "cuda":
            self.index = faiss.index_cpu_to_all_gpus(self.index)

        gauss = torch.randn((ntotal, key_dim), device=self.device)
        keys = gauss / torch.norm(gauss, dim=1, keepdim=True)
        self.weight_query = torch.nn.Parameter(
            torch.empty(num_head, input_dim, key_dim, device=self.device)
        torch.nn.init.kaiming_normal_(self.weight_query, mode="fan_out")
        self.weight_values = torch.nn.Parameter(
            torch.empty(ntotal, value_dim, device=self.device)
        torch.nn.init.kaiming_normal_(self.weight_values, mode="fan_out")

    def forward(self, x, topk):
        query = torch.einsum("...i,hik->", x, self.weight_query)
        normalized_query = query / torch.norm(query, dim=-1, keepdim=True)

        # Flatten the input to support multi-dimensional input
        flatten = normalized_query.reshape(-1, self.key_dim)
        distances, indices =, topk)
        index_shape = query.shape[:-1] + (topk,)
        distances = distances.reshape(index_shape)
        indices = indices.reshape(index_shape)

        sqrt_dk = self.key_dim ** 0.5
        # Scale to achieve a dot product with unit variance.
        scaled_product = sqrt_dk - distances * (sqrt_dk / 2)
        attention_scores = torch.nn.functional.softmax(scaled_product, dim=-1)
        weighted_values = torch.einsum(
            "...k,...kv->...v", attention_scores, self.weight_values[indices]
        result = weighted_values.flatten(-2, -1)
        return result