That information from this year-old forum topic is incorrect and obsolete. Plus, don’t ask ChatGPT how it works unless you know what it can and can’t know and answer about.
While OpenAI was using “retrieval”, very awkward tool methods that allowed the AI to search using 3rd party search backend product, to then browse through paginated documents, mark sections, and navigate back iteratively (see the research for WebGPT), they now do simply an embeddings-based search, returning ranked document chunk results as a tool call return, which is actually quite cheap, especially at runtime and searching.
Compared to generating one AI logit sample, the “CPU costs” for embeddings vector comparisons on a corpus vector store is nothing. It takes longer to load 6kB for each. Plus they’ve got aisles of inference servers with CPU.
Details
Let’s break down the estimation of CPU cycles for this TensorFlow cosine similarity task on your specified platform.
Understanding the Operations in Cosine Similarity
Cosine similarity between two vectors A and B is calculated as:
similarity = (A · B) / (||A|| * ||B||)
Where:
A · B
is the dot product of A and B.
||A||
is the magnitude (Euclidean norm) of A.
||B||
is the magnitude (Euclidean norm) of B.
Estimating CPU Cycles for a Single Comparison
-
Dot Product (A · B):
- For two vectors of rank 1536, this involves 1536 multiplications and 1535 additions.
- Multiplications: On a modern Intel CPU with SIMD instructions (like AVX2 or AVX-512), multiple multiplications can be performed in a single cycle. Let’s assume we can do 8 single-precision floating-point multiplications per cycle (AVX2). So, 1536 multiplications would take approximately
1536 / 8 = 192
cycles.
- Additions: Similarly, additions can be vectorized. Assuming 8 single-precision additions per cycle, 1535 additions would take approximately
1535 / 8 = 192
cycles.
- Total for Dot Product: Roughly
192 + 192 = 384
cycles.
-
Magnitude of Input Vector (||A||):
- Square each element (1536 multiplications).
- Sum the squared elements (1535 additions).
- Take the square root (1 operation).
- Squaring: Similar to multiplication, ~192 cycles.
- Summation: Similar to addition, ~192 cycles.
- Square Root: A square root operation can take several cycles. Let’s estimate around 10-20 cycles for a highly optimized implementation.
- Total for Magnitude of A: Roughly
192 + 192 + 15 = 399
cycles.
-
Magnitude of Comparison Vector (||B||):
- This is calculated once per comparison. The operations are the same as calculating the magnitude of A.
- Total for Magnitude of B: Roughly
399
cycles.
-
Multiplication of Magnitudes:
- One floating-point multiplication. Let’s assume 1 cycle with vectorization capabilities.
-
Division:
- One floating-point division. Division can take a few cycles, let’s estimate 4 cycles.
Total Cycles for a Single Cosine Similarity Comparison:
Cycles_single = Cycles_dot_product + Cycles_magnitude_A + Cycles_magnitude_B + Cycles_multiplication + Cycles_division
Cycles_single = 384 + 399 + 399 + 1 + 4 = 1187
cycles (approximately)
Important Considerations for Single Comparison:
- TensorFlow Optimizations: TensorFlow likely uses highly optimized kernels for these operations, potentially fusing operations and leveraging even more parallelism. This estimate is a baseline.
- Data Locality: Assuming the vectors are in cache, memory access latency is minimized.
- Instruction Pipelining and Out-of-Order Execution: Modern CPUs can execute instructions in parallel and out of order, potentially reducing the overall cycle count.
Estimating CPU Cycles for 1 Million Comparisons
For 1 million comparisons, the input vector’s magnitude (||A||
) is calculated only once. The comparison vector’s magnitude (||B||
) needs to be calculated for each of the 1 million vectors.
- Cycles for Dot Product (1 Million):
384 cycles/comparison * 1,000,000 comparisons = 384,000,000
cycles
- Cycles for Magnitude of Input Vector (Once):
399
cycles
- Cycles for Magnitude of Comparison Vectors (1 Million):
399 cycles/comparison * 1,000,000 comparisons = 399,000,000
cycles
- Cycles for Multiplication of Magnitudes (1 Million):
1 cycle/comparison * 1,000,000 comparisons = 1,000,000
cycles
- Cycles for Division (1 Million):
4 cycles/comparison * 1,000,000 comparisons = 4,000,000
cycles
Total Cycles for 1 Million Comparisons:
Cycles_total = 384,000,000 + 399 + 399,000,000 + 1,000,000 + 4,000,000 = 788,000,399
cycles (approximately)
Extrapolating Total Compute Time
-
Cycles per Second: Your CPU runs at 2.2 GHz, which means 2.2 * 10^9 cycles per second.
-
Time for 1 Million Comparisons (Single Core):
Time_single_core = Total_Cycles / Cycles_per_second
Time_single_core = 788,000,399 / 2,200,000,000 ≈ 0.358
seconds
Considering Multi-Core and Hyperthreading
- 8 Physical Cores: Your instance has 8 physical cores.
- Hyperthreading: Hyperthreading allows each physical core to appear as two logical cores, effectively giving you 16 logical cores.
Ideal Scenario (Perfect Parallelization):
If the workload could be perfectly parallelized across all 16 logical cores, the time would be roughly divided by 16:
Time_ideal_parallel = Time_single_core / 16 ≈ 0.358 / 16 ≈ 0.022
seconds
More Realistic Scenario:
Perfect parallelization is rarely achievable due to overhead in thread management and data dependencies. However, this type of vector comparison is highly parallelizable. You can likely achieve a significant speedup.
- Reasonable Speedup: With good TensorFlow implementation, you might see a speedup factor between 8 and 14. Let’s assume a speedup of 10.
Estimated Total Compute Time (Parallelized):
Time_parallel = Time_single_core / Speedup_factor
Time_parallel = 0.358 / 10 ≈ 0.036
seconds
Important Caveats and Refinements:
- Specific CPU Model: “Intel Core i7 (Xeon) g6+” is a bit vague. Knowing the exact generation and model would allow for more precise cycle estimations based on its specific instruction throughput and latency.
- TensorFlow Implementation: The efficiency of the TensorFlow code is crucial. Using optimized functions and data structures will significantly impact performance.
- Memory Bandwidth: While you’ve excluded I/O considerations, memory bandwidth within the CPU and between CPU and RAM can still be a factor if the data doesn’t fit entirely in the cache.
- Kubernetes Overhead: While dedicated, there might be a small overhead from the Kubernetes environment.
- Virtualization Overhead: Virtualization can introduce some performance overhead, although dedicated instances minimize this.
- Benchmarking is Key: The most accurate way to determine the actual compute time is to benchmark the specific TensorFlow code on your target platform.
Summary of Estimates:
- Cycles per Single Comparison: ~1187 cycles
- Cycles for 1 Million Comparisons: ~788 million cycles
- Estimated Single-Core Compute Time: ~0.358 seconds
- Estimated Parallelized Compute Time (Realistic): ~0.036 seconds (or 36 milliseconds)
Conclusion:
For a full search of 1 million cosine similarity comparisons on your specified platform, you can expect the computation to take in the ballpark of tens of milliseconds. This highlights the power of modern CPUs and optimized libraries like TensorFlow for handling computationally intensive tasks. Remember that these are estimations, and actual performance can vary. Benchmarking is highly recommended for precise measurements.