Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3144
Visualize results of running the benchmark with Pareto optima filtering:
1. per index or across indices
2. for space, time or space & time
3. knn or range search, the latter @ specific precision
Reviewed By: mdouze
Differential Revision: D51552775
fbshipit-source-id: d4f29e3d46ef044e71b54439b3972548c86af5a7
Summary:
1. Support for index construction parameters outside of the factory string (arbitrary depth of quantizers).
2. Refactor that provides an index wrapper which is a prereq for the optimizer, which will generate indices from pre-optimized components (particularly quantizers)
Reviewed By: mdouze
Differential Revision: D51427452
fbshipit-source-id: 014d05dd798d856360f2546963e7cad64c2fcaeb
Summary:
1. Support `search_preassigned` in IVFFastScan
2. `try_extract_index_ivf` to search recursively and support `IndexRefine`
3. `get_InvertedListScanner` to fail where not available
4. Workaround an OpenMP issue with `IndexIVFSpectralHash`
Reviewed By: mdouze
Differential Revision: D51427241
fbshipit-source-id: 365e3f11d24e80f101f986fc358c28dcc00805fa
Summary:
Introduces `FAISS_ALWAYS_INLINE` pragma directive and improves `ScalarQuantizer` performance with it.
Most of performance-critical methods for `ScalarQuantizer` are marked with this new directive, because a compiler (especially, an old one) may be unable to inline it properly. In some of my GCC experiments, such an inlining yields +50% queries per second in a search.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3141
Reviewed By: algoriddle
Differential Revision: D51615609
Pulled By: mdouze
fbshipit-source-id: 9c755c3e1a289b5d498306c1b9d6fcc21b0bec28
Summary: It seems that for some build modes, swig chokes on static_assert, so protect this with #idndef SWIG. Let's see what the tests say....
Reviewed By: algoriddle
Differential Revision: D50971042
fbshipit-source-id: 83e2ccb464c0bd024cbf3a494357147d75a76ca2
Summary:
This PR adds a functionality where an IVF index can be searched and the corresponding codes be returned. It also adds a few functions to compress int arrays into a bit-compact representation.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3143
Test Plan:
```
buck test //faiss/tests/:test_index_composite -- TestSearchAndReconstruct
buck test //faiss/tests/:test_standalone_codec -- test_arrays
```
Reviewed By: algoriddle
Differential Revision: D51544613
Pulled By: mdouze
fbshipit-source-id: 875f72d0f9140096851592422570efa0f65431fc
Summary:
In the GPU IVF (Flat, SQ and PQ) code, there is a requirement for using temporary memory for storing unfiltered (or partially filtered) vector distances calculated during list scanning which are k-selected by separate kernels.
While a batch query may be presented to an IVF index, the amount of temporary memory needed to store all these unfiltered distances prior to filtering may be very huge depending upon IVF characteristics (such as the maximum number of vectors encoded in any of the IVF lists), in which case we cannot process the entire batch of queries at once and instead must tile over the batch of queries to reuse the temporary memory that we make available for these distances.
The old code duplicated this roughly equivalent logic in 3 different places (the IVFFlat/SQ code, IVFPQ with precomputed codes, and IVFPQ without precomputed codes). Furthermore, in the case where either little/no temporary memory was available or where what temporary memory was available was (vastly) exceeded by the amount needed to handle a particular query, the old code enforced a minimum number of queries to be processed at once of 8. In certain cases (huge IVF list imbalance), this memory request could exceed the amount of memory that can be safely allocated on a GPU.
This diff consolidates the original 3 separate places where this calculation took place to 1 place in IVFUtils. The logic proceeds roughly as before, to figure out how many queries can be processed in the available temporary memory, except we add a new heuristic in the case where the number of queries that can be concurrently processed falls below 8. This could be either due to little temporary memory being available, or due to huge memory requirements. In this case, we instead ignore the amount of temporary memory available and instead see how many queries' memory requirements would fit into a single 512 MiB memory allocation, so we reasonably cap this amount. If the query still cannot be satisfied with this allocation, we still proceed executing 1 query at a time (which note could still potentially exhaust the GPU memory, but this is an error that is unavoidable).
While a different heuristic using the amount of actual memory allocatable on the device could be used instead of this fixed 512 MiB amount, there is no guarantee to my knowledge that a single cudaMalloc up to this limit could succeed (e.g., GPU reports 3 GiB available, you attempt to allocate all of that in a single allocation), so we just pick an amount which is a reasonable balance between efficiency (parallelism) and memory consumption. Note that if not enough temporary memory is available and a single 512 MiB allocation fails, then there is likely little memory to proceed efficiently at all under any scenario, as Faiss does require some headroom in terms of memory available for scratch spaces.
Reviewed By: mdouze
Differential Revision: D45574455
fbshipit-source-id: 08f5204e3e9656627c9134d7409b9b0960f07b2d
Summary:
nvcc starting with CUDA 11.5 offers a `-hls` option to generate host side linker scripts to support large cubin file support.
Since faiss supports CUDA 11.4 we replicate that behavior but injecting the same linker script into the link line manually.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3115
Reviewed By: mdouze
Differential Revision: D51308908
Pulled By: algoriddle
fbshipit-source-id: c6dd073cd3f44dbc99d2e2da97f79b9ebc843b59
Summary:
This diff fixes the bug associated with moving Faiss GPU to CUDA 12.
The following tests were succeeding in CUDA 11.x but failed in CUDA 12:
```
✗ faiss/gpu/test:test_gpu_basics_py - test_input_types (faiss.gpu.test.test_gpu_basics.TestKnn)
✗ faiss/gpu/test:test_gpu_basics_py - test_dist (faiss.gpu.test.test_gpu_basics.TestAllPairwiseDistance)
✗ faiss/gpu/test:test_gpu_index_ivfpq - TestGpuIndexIVFPQ.Add_L2
✗ faiss/gpu/test:test_gpu_basics_py - test_input_types_tiling (faiss.gpu.test.test_gpu_basics.TestKnn)
✗ faiss/gpu/test:test_gpu_index_ivfpq - TestGpuIndexIVFPQ.Add_IP
✗ faiss/gpu/test:test_gpu_index_ivfpq - TestGpuIndexIVFPQ.Float16Coarse
✗ faiss/gpu/test:test_gpu_index_ivfpq - TestGpuIndexIVFPQ.LargeBatch
```
It took a long while to track down, but the issue presented itself when an odd number of dimensions not divisible by 32 was used in cases where we needed to calculate a L2 norm for vectors, which occurred with brute-force L2 distance computation, as well as certain L2 IVFPQ operations. This issue appeared as some tests were using 33 as the dimensionality of vectors.
The issue is that the number of threads given to the L2 norm kernel was effectively `min(dims, 1024)` where 1024 is the standard maximum number of CUDA threads per CTA on all devices at present. In the case where the result was not a multiple of 32, this would result in a partial warp being passed to the kernel (with non-participating lanes having no side effects).
The change in CUDA 12 here seemed to be a change in the compiler behavior for warp-synchronous shuffle instructions (such as `__shfl_up_sync`. In the case of the partial warp, we were passing `0xffffffff` as the active lane mask, implying that all lanes were present for the warp. In the case of dims = 33, we would have 1 full warp with all lanes present, and 1 partial warp with only 1 active thread, so `0xffffffff` is a lie in this case. Prior to CUDA 12, it appeared that these shuffle instructions may have passed 0? around for lanes not present (or would it stall?), so the result was still calculated correctly. However, with the change to CUDA 12, the compiler and/or device firmware (or something) interprets this differently, where the warp lanes not present were providing garbage. The shuffle instructions were used to perform in-warp reductions (e.g., summing a bunch of floating point numbers), namely those needed to sum up the L2 vector norm value. So for dims = 32 or dims = 64 (and bizarrely, dims = 40 and some other choices) it still worked, but for dims = 33 it was adding in garbage, producing erroneous results.
This diff removes the non-dim loop functionality for runL2Norm (where we can statically avoid a for loop over dimensions in case our threadblock is exactly sized with the number of dimensions present) and we just use the general-purpose fallback. Second, we now always provide an even number of warps when running the L2 norm kernel, avoiding the issue with the warp synchronous instructions not having a full warp present.
This bug has been present since the code was written 2016 and was technically wrong before, but is only surfaced to be a bug/problem with the CUDA 12 change.
tl;dr: if you use any kind of `_sync` instruction involving warp sync, always have a whole number of warps present, k thx.
Reviewed By: mdouze
Differential Revision: D51335172
fbshipit-source-id: 97da88a8dcbe6b4d8963083abc01d5d2121478bf
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3097
A framework for evaluating indices offline.
Long term objectives:
1. Generate offline similarity index performance data with test datasets both for existing indices and automatically generated alternatives. That is, given a dataset and some constraints this workflow should automatically discover optimal index types and parameter choices as well as evaluate the performance of existing production indices and their parameters.
2. Allow researchers, platform owners (Laser, Unicorn) and product teams to understand how different index types perform on their datasets and make optimal choices wrt their objectives. Longer term to enable automatic decision-making/auto-tuning.
Constraints, design choices:
1. I want to run the same evaluation on Meta-internal (fblearner, data from hive and manifold) or the local machine + research cluster (data on local disk or NFS) via OSS Faiss. Via fblearner, I want this to work in a way that it can be turned into a service and plugged into Unicorn or Laser, while the core Faiss part can be used/referred to in our research and to update the wiki with the latest results/recommendations for public datasets.
2. It must support a range of metrics for KNN and range search, and it should be easy to add new ones. Cost metrics need to be fine-grained to allow extrapolation.
3. It should automatically sweep all query time params (eg. nprobe, polysemous code hamming distance, params of quantizers), using`OperatingPointsWithRanges` to cut down the optimal param search space. (For now, it sweeps nprobes only.)
4. [FUTURE] It will generate/sweep index creation hyperparams (factory strings, quantizer sizes, quantizer params), using heuristics.
5. [FUTURE] It will sweep the dataset size: start small test with e.g. 100K db vectors and go up to millions, billions potentially, while narrowing down the index+param choices at each step.
6. [FUTURE] Extrapolate perf metrics (cost and accuracy)
7. Intermediate results must be saved (to disk, to manifold) throughout, and reused as much as possible to cut down on overall runtime and enable faster iteration during development.
For range search, this diff supports the metric proposed in https://docs.google.com/document/d/1v5OOj7kfsKJ16xzaEHuKQj12Lrb-HlWLa_T2ct0LJiw/edit?usp=sharing I also added support for the classical case where the scoring function steps from 1 to 0 at some arbitrary threshold.
For KNN, I added knn_intersection, but other metrics, particularly recall@1 will also be interesting. I also added the distance_ratio metric, which we previously discussed as an interesting alternative, since it shows how much the returned results approximate the ground-truth nearest-neighbours in terms of distances.
In the test case, I evaluated three current production indices for VCE with 1M vectors in the database and 10K queries. Each index is tested at various operating points (nprobes), which are shows on the charts. The results are not extrapolated to the true scale of these indices.
Reviewed By: yonglimeta
Differential Revision: D49958434
fbshipit-source-id: f7f567b299118003955dc9e2d9c5b971e0940fc5
Summary:
This is a design proposal that demonstrates an approach to enabling optional support for [RAFT](https://github.com/rapidsai/raft) versions of IVF PQ and IVF Flat (and brute force w/ fused k-selection when k <= 64). There are still a few open issues and design discussions needed for the new RAFT index types to support the full range of features of that FAISS' current gpu index types.
Checklist for the integration todos:
- [x] Rebase on current `main` branch
- [X] The raft handle has been plugged directly into the StandardGpuResources
- [X] `FlatIndex` passing Googletests
- [x] Use `CodePacker` to support `copyFrom()` and `copyTo()`
- [X] `IVF-flat passing Googletests
- [ ] Raise appropriate exceptions for operations which are not yet supported by RAFT
Additional features we've discussed:
- [x] Separate IVF lists into individual memory chunks
- [ ] Saving/loading
To build FAISS w/ optional RAFT support:
```
mkdir build
cd build
cmake ../ -DFAISS_ENABLE_RAFT=ON -DFAISS_ENABLE_GPU=ON
make -j
```
For development/testing, we've also supplied a bash script to make things easier: `build.sh`
Below is a benchmark comparing the training of IVF Flat indices for RAFT and FAISS:

The benchmark was produced using Googlebench in [this](https://github.com/tfeher/raft/tree/raft_faiss_bench) RAFT fork. We're going to provide benchmarks for the queries as well. There are still a couple bottlenecks to be removed in the IVF-Flat training implementation and we'll update the current benchmark when ready.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/2521
Test Plan: `buck test mode/debuck test mode/dev-nosan //faiss/gpu/test:test_gpu_index_ivfflat`
Reviewed By: algoriddle
Differential Revision: D49118319
Pulled By: mdouze
fbshipit-source-id: 5916108bc27154acf7c92021ba579a6ca85d730b
Summary:
The CMake CUDA Architecture value of `60` means to generate both PTX and SASS for that arch. We only need SASS for the architectures we support, and one PTX version for future hardware versions.
So now we build on SASS for everything ( `60-real` ) and use 80 as the baseline for newer archs likes 90
By removing this unneeded PTX code we can reduce the libfaiss.a binary to 305MB from the current 484MB.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3083
Reviewed By: wickedfoo
Differential Revision: D49901896
Pulled By: algoriddle
fbshipit-source-id: 15e98f81e191a565319cf855debad33b24ebf10b
Summary: 1L and 1UL are problematic because sizeof(long) depends on the platform
Reviewed By: mlomeli1
Differential Revision: D49911901
fbshipit-source-id: d4e4cb1f0283a33330bf1b8ca6b7f7bf41bc6ff4
Summary:
Adds a missing function argument to ResidualCoarseQuantizer() whenever the data is processed in chunks
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3047
Reviewed By: mlomeli1
Differential Revision: D49687858
Pulled By: mdouze
fbshipit-source-id: 1456138fe1ff3a033b73e97f16470ac8ceca60ab
Summary:
The implementations for `fvec_madd` and `fvec_madd_and_argmin` are in `utils/distances.cpp`, so I moved the declarations from `utils/utils.h` to `utils/distances.h`
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3054
Reviewed By: mlomeli1
Differential Revision: D49687725
Pulled By: mdouze
fbshipit-source-id: b98c13f5710f06daba479767a7aab8d62d6e6ddf
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3012
The cross-tables for codebook construction contained the dot products between codebook entries, which is not necessary (and caused OOMs in some cases). This diff computes only the off-diagonal blocks.
Reviewed By: pemazare
Differential Revision: D48448615
fbshipit-source-id: 494b54e2900754a3ff5d3c8073cb9a768e578c58
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3011
After Alexandr's optimizations the ResidualQuantizer code has become harder to read. Split off the quantization code to a separate .h / .cpp to make it clearer.
Reviewed By: pemazare
Differential Revision: D48448614
fbshipit-source-id: c90d572ea3afe12a7a7e5092f88710e8eceaa2d1
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3030
Added default arguments to the .h file (for some reason I forgot this file when migrating default args).
Logging a hash value in MatrixStats, useful to check if two runs really really run on the same matrix...
Reviewed By: pemazare
Differential Revision: D48834343
fbshipit-source-id: 7c1948464e66ada1f462f4486f7cf3159bbf9dfd
Summary:
This is a minor bug that comes with a perf impact. The classic FAISS `FlatIndex` always uses expanded form of distance computation even though an argument `exactDistances` is provided. `RaftFlatIndex` was using this argument to determine whether the computation should be exhaustive.
This PR includes one additional change to eagerly initialize the `cublas_handle` on the `device_resources` instance when it's created.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3021
Reviewed By: pemazare
Differential Revision: D48739660
Pulled By: mdouze
fbshipit-source-id: a361334eb243df86c169c69d24bb10fed8876ee9