Dot Product is a Bad Distance Function

Josua Krause, PhD
10 min readJul 27, 2023

Machine learning has created a need for non-spatial k-Nearest-Neighbor (kNN) solutions for tasks, such as, topic modeling, semantic search, attention, etc. This has shifted the focus of kNN towards similarity measures like cosine similarity or the dot product. In general, we have a good intuition, based on day to day interactions, of how the Euclidean distance L2 (or L1 if you are from New York City) behaves and what behaviors to expect. On the other hand, we rarely encounter the cosine similarity or the dot product in our daily lives and thus do not have a strong real-world intuition, which makes their behaviors often surprising. In this article we explore the cosine similarity and the dot product in an attempt to gain an intuition about their respective behaviors (and learn why you probably shouldn’t use the dot product for kNN tasks). We do this using a visualization [1] that uses a two dimensional vector space.

However, before diving in to how the visualization works, how it can be interpreted, and what its limitations are, we need to properly define our terminology. A distance function or metric is a bivariate function obeying the following rules:

The definition of a distance function with the four rules: reflexivity, positivity, symmetry, and the triangle inequality.
The four rules of distance functions: Reflexivity, Positivity, Symmetry, and the Triangle Inequality.

On the other hand, similarity functions have very few requirements, often only requiring the bivariate function to be symmetric. As we will see, those lax requirements will lead to some interesting and surprising behaviors.

Visualization

The visualization we will be using to explore various similarity functions shows a representation of a two dimensional vector space. The points shown are the items embedded in the vector space. The brightness of the background color is proportional to the distance of the pixel position to the closest item of our item set. At the boundary of where the closest item changes, a red line is drawn. Below, a short video illustrates how moving an item around changes those properties in the L2 (Euclidean distance) and L1 (Manhattan distance) spaces. Click also on the “Live Demo” link below to play around and familiarize yourself with the visualization:

Exploring how the voronoi regions of the L2 and L1 norm change when moving an item across the space.
Demonstrating the visualization using the L2 and L1 metrics.

As expected for the L2 metric, the distances to the closest items increase as concentric circles around those items. For the L1 metric this change in distance takes the shape of diamonds. Note, that at one point the background color stops changing. This happens because we have to map a distance, which has a range from 0 to infinity, to the color space, which has a range from 0 to 1. We could transform those ranges into each other using, for example, a sigmoid function, but this would distort the shapes we see and in turn make it harder to draw conclusions from the visualization. Instead, we stop the change at a certain distance and only show the distances from 0 to that constant. In the link above you can change the “Correction” value to change this constant and explore different distance ranges.

The closest item boundaries in red form voronoi regions [2] of their respective metric. Even though the visualization is only in two dimensions, boundaries in higher dimensions would have similar properties. While a three dimensional version would still be relatively easy to imagine (think bubbly foam) four or higher dimensional boundaries become basically impossible to visualize intuitively. This also applies to other phenomena we will encounter with this visualization. Two dimensions make it easy to identify patterns but it is crucial for us to convince ourself that those patterns hold in higher dimensions as well. We will (sketch) mathematical proofs to do this.

Cosine Similarity

Cosine similarity measures the angle of two vectors as follows:

The definition of cosine similarity and distance.
Definition of the cosine similarity and the same as distance function.

In order for cosine similarity to be usable in a kNN algorithm we need to convert it to a distance function. This is typically done by subtracting its value from 1 since the maximum cosine similarity, when both vectors align perfectly, amounts to 1. Note, that this conversion will not give us a true distance function (the triangle inequality is not satisfied; when using an algorithm that relies on this property the results might be incorrect) but for kNN tasks it is still a very useful function especially for high dimensional vector spaces. Looking at the cosine distance in our visualization we can see the boundaries for closest items:

The proximity boundaries of the cosine similarity. All boundaries are straight lines from the origin.
Closest neighbor regions of the cosine similarity.

We can see that all boundaries are straight lines emerging from the origin (i.e., the zero vector) at half the angle between two items forming neighboring regions. This is expected as the cosine similarity measures the angle between two points in the vector space wrt. the origin.

Whenever there are angles involved it cannot hurt to look at the unit circle. Let’s project our items to the unit circle (by normalizing the vectors) and see how the L2 voronoi regions of the projected items look like:

The proximity boundaries of L2 on normalized points. All boundaries are straight lines from the origin. The boundaries look the same as for the cosine similarity boundaries above.
Voronoi regions using the L2 metric after projecting points to the unit circle.

Wow, this looks pretty similar to the cosine distance regions above. Maybe this holds true in general? You can play around with the live demo below to explore the two dimensional case further. But after that, and to check higher dimensional cases, let’s do some math.

Given two n-dimensional points a and b spanning an angle θ with the origin, using trigonometry we can compute the L2 distance d as follows:

Starting with the point distance based on the angle between the vectors we can derive a formula to equate the normalized L2 distance to the cosine distance.
Proof to show that the normalized L2 distance preserves the relative order of the cosine distance.

After rearranging and substituting the cosine similarity back in, we can see that, indeed, those two methods produce the exact same result (looking at relative orders that is). This has several implications. For one, computing the L2 distance on normalized vectors is slightly cheaper to do. More importantly, when looking at normalized vectors with L2 we have a full distance function again, including the triangle inequality. So, all fast kNN algorithms that make use of the properties of distance functions (like, e.g., BSP trees [3]) can be used to speed up cosine similarity lookups.

Dot Product Similarity

So at this point, you might be thinking: “I know exactly why you say: ‘the dot product is a bad distance function’. Like with the cosine similarity above, we need to transform the dot product into something resembling a distance function. However, this time, we are not only losing the triangle inequality but also reflexivity. This is the case because the dot product of a vector with itself yields a large number, but there’s no universal maximum that could be used to construct a transformation that maps all reflexive dot products to zero. Big deal. Does it actually matter?” In fact, it does, and it is more intricate than it seems. A dot product distance is not only not reflexive; with the dot product, a vector is not even close to itself:

Constructing an easy example for showing that a vector is not closest to itself in the dot product similarity space.
For every vector the same vector scaled up will always be closer to the vector than the vector itself.

We can almost trivially construct a vector that will always be “closer” to the vector than the vector to itself. This begs the question: can we construct a set of items where some items will never be closest to any possible vector in the vector space?

But before we can investigate this question using the visualization, we have to decide on a transformation function that turns the dot product into a “distance-like” function.

Alternative ways of mapping the dot product in a distance-like function.
Two ways of mapping the dot product into a “distance-like” function.

Really any strictly monotone descending function will properly map the dot product into a “distance-like” function that correctly preserves the relative order of all inputs. The dₑ mapping above is such a function that is frequently used (it also often shows up in unexpected ways, e.g., when computing the softmax of a dot product). But for reasons that we will discuss in the “Numerical Limitations” section below we will use dₐ instead. Mathematically it makes no difference which method we choose as we are only interested in the relative order of vectors.

Back to our question whether we can fully hide items. Let’s play a bit around in the visualization:

Moving an item around to see where the item is completely hidden.
Points become unreachable beyond certain invisible lines using dot product similarity.

It seems that for a given set of items there is a region in which additional items will never be closest to any vector in the entire vector space. In order to find out the shape of this region we must first formalize what it means for an item to be hidden:

A system of linear inequalities to determine whether an item is completely hidden.
Given a set of items V we can hide h if all inequalities are satisfied.

In order to hide an item h in a set of items V we need the above system of linear inequalities to have a solution for all x. This is exactly the case when h lies in the inside of the convex hull of V (technically it’s “convex polytope” for higher dimensions). Accordingly, we can compute whether a vector h is hidden using linear programming [4].

In turn, this means that, when performing a nearest neighbor search, only items on the surface of their convex hull are possible candidates for any query. For the dot product similarity the vast majority of items can never be a nearest neighbor to anything! Note that this is quite surprising since hiding items in this way is generally not possible with (proper) distance functions (the only way to hide items with them is by placing them on top of each other).

This also has some interesting consequences for a kNN search. In order to obtain the candidate set of a k-th neighbor we can remove all previous nearest neighbors from the item set. The candidate set is then exactly the items on the surface of the new convex hull created by removing the other nearest neighbors:

Getting the top 10 nearest neighbors by removing nearest neighbors one by one to shrink the convex hull.
The top 10 closest neighbors of the point in black are shown in orange.

In the above visualization we can see how the convex hull changes when removing the currently closest neighbor to a given point. Each time only few new items are added to the new convex hull and the relevant set of candidates stays relatively local to the previous nearest neighbor. This might have some potential to speed up kNN lookups for dot product metrics but I couldn’t find any paper that looks into it. If anyone wants to try it out, connect and we can write one together.

That said, considering that kNN lookups typically stay in the order of 10 to 100 closest neighbors, for large data sets the majority of items will never show up in any query result! Even though, it seems that this is not a big issue in practice, it makes the dot product a bad candidate for kNN tasks in my book.

Below is a link to the full visualization with all similarity functions discussed in this article. Feel free to play around and explore more. I’d love to hear about further insights that were not covered in this article:

Numerical Limitations

While implementing the visualization, some other surprising behaviors occurred. However, in those cases the math did not support the phenomena and they could be considered bugs. Understanding how those bugs happened does not necessarily improve intuition about the similarity functions involved but rather serves as cautionary tale about how real world limitations can mess with the mathematics behind similarity functions.

One such phenomena happens most easily for the cosine similarity:

Noisy boundaries created by co-linear or identical points.
Co-linear or identical points can cause noise in the boundary calculation.

In order to understand what is going on, it is important to know how the visualization determines where to draw the neighborhood boundaries. It achieves this by computing the nearest neighbor of a point and then computing the nearest neighbors of adjacent points. If the nearest neighbors between those are different the boundary is drawn at the given point. For co-linear or identical items the cosine similarity can return slightly different distances depending on the reference point. This is due to the limited precision of the floating point numbers involved. With that, either item can be closest to the reference point. Even adjacent points can therefore have different neighbors and create boundaries where there shouldn’t be any. A solution to this issue is to reduce the precision of the distance values slightly before comparing them. This forces a consistent pick of the nearest neighbor if either candidate could be the closest.

Another glitch can happen for the dot product similarity when using the previously mentioned e-Mapping dₑ to convert the dot product into a usable “distance-like” function:

Numerical limitations cause bends in the boundary of dot product nearest neighbors.
Boundary lines bend for large values when using the e-Mapping for the dot product.

Here, a “large” dot product value created by vectors ~10 units away from the origin causes the e-Mapping to go to 0 in which case the item with the lower list index is chosen as nearest neighbor. This results in curving the nearest neighbor region when its boundaries should be straight lines from the origin. The issue can be resolved using a different similarity formula that doesn’t approach 0 as fast, such as the Absolute-Mapping dₐ. Either function would produce identical nearest neighbor results mathematically but due to the limited precision of the floating point numbers, in reality, for large vectors this is not the case. This is especially worrisome as the e-Mapping often sneaks in from other places, for example, when using the softmax on the result of the dot product. Transformers mitigate the issue by dividing the attention scores (i.e., the dot product values) by the square root of the number of dimensions [5] before passing the values on to the softmax function. However, others have suggested, albeit for unrelated reasons, that the softmax function might not be ideal [6] in the first place when computing attention.

A woman made out of a point cloud in the manifold.
Created using MidJourney.

References

[1] https://searchspace.josuakrause.com/
[2] https://en.wikipedia.org/wiki/Voronoi_diagram
[3] https://en.wikipedia.org/wiki/Binary_space_partitioning
[4] https://stackoverflow.com/a/43564754/20487202
[5] https://github.com/huggingface/transformers/blob/6704923107dfab02f76546f816911f907af66e47/src/transformers/models/bert/modeling_bert.py#L349
[6] https://www.evanmiller.org/attention-is-off-by-one.html

--

--

Josua has led Data Science teams focused on deep representation learning, natural language processing, and adaptive learning. His PhD focused on explainable AI.