Summary: Removes unused host fp16 code, the dependency upon which was removed a while ago.
Reviewed By: beauby
Differential Revision: D24279982
fbshipit-source-id: 5f6820c41eb387f766b2bed7e70203f5e01f49e9
Summary:
PyTorch GPU in general is free to use whatever stream it currently wants, based on `torch.cuda.current_stream()`. Due to C++/python language barrier issues, we couldn't previously pass the actual `cudaStream_t` that is currently in use on a given device from PyTorch C++ to Faiss C++ via python.
This diff adds conversion functions to convert a Python integer representing a pointer to a `cudaStream_t` (which is itself a `CUstream_st*`), so we can pass the stream specified in `torch.cuda.current_stream()` to `StandardGpuResources::setDefaultStream`. We thus guarantee that all Faiss work is ordered on the same stream that is in use in PyTorch.
For use in Python, there is now the `faiss.contrib.pytorch_tensors.using_stream` context object which automatically sets and unsets the current PyTorch stream within Faiss. This takes a `StandardGpuResources` object in Python, and an optional `torch.cuda.Stream` if one wants to use a different stream, otherwise it uses the current one. This is how it is used:
```
# Create a non-default stream
s = torch.cuda.Stream()
# Have Torch use it
with torch.cuda.stream(s):
# Have Faiss use the same stream as the above
with faiss.contrib.pytorch_tensors.using_stream(res):
# Do some work on the GPU
faiss.bfKnn(res, args)
```
`using_stream` uses the same pattern as the Pytorch `torch.cuda.stream` object.
This replaces any brute-force GPU/CPU synchronization work that was necessary before.
Other changes in this diff:
- cleans up the config objects in the GpuIndex subclasses, to distinguish between read-only parameters that can only be set upon index construction, versus those that can be changed at runtime.
- StandardGpuResources now more properly distinguishes between user-supplied streams (like the PyTorch one) which will not be destroyed upon resources destruction, versus internal streams.
- `search_index_pytorch` now needs to take a `StandardGpuResources` object as well, there is no way to get this from an index instance otherwise (or at least, I would have to return a `shared_ptr`, in which case we should just update the Python SWIG stuff to use `shared_ptr` for GpuResources or something.
Reviewed By: mdouze
Differential Revision: D24260026
fbshipit-source-id: b18bb0eb34eb012584b1c923088228776c10b720
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1459
In C++11, destructors default to `noexcept(true)`. This destructor can throw (through `FAISS_THROW_IF_NOT()`), so marking it accordingly.
Reviewed By: mdouze
Differential Revision: D24253879
fbshipit-source-id: 7ba40387ed214dc2a03a495bc0d31ac9601c4c15
Summary: Those docs are not very useful as is, and having to re-generate the html manually leads to them being stale most of the time. Should we decide that we want to have them, we can bring them back with some automated generation.
Reviewed By: mdouze
Differential Revision: D24246072
fbshipit-source-id: 39798b2861ff25ee3fa1f95abdbc3e7ddf3469ed
Summary:
Removed an unused function that caused compile errors in some configurations.
Added contrib function (exhaustive_search.knn) to compute the k nearest neighbors without constructing an index.
Renamed the equivalent GPU function as exhaustive_search.knn_gpu (it does not make much sense to mention numpy in the name as all functions take numpy arguments by default).
Reviewed By: beauby
Differential Revision: D24215427
fbshipit-source-id: 6d8e1eafa7c57593304b7b76f83b3015e4d2a2bb
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1445
As requested in https://github.com/facebookresearch/faiss/issues/1304, `bfKnn` can now produce int32 indices for output.
The native kernels themselves for brute-force kNN only operate on int32 indices in any case, so this is faster.
Also added a SWIG definition for float16 numpy arrays. As there is not a native half type, the reverse definition is undefined, so this is only really used for taking float16 data (e.g., from numpy) as input when in Python.
Added a `knn_numpy_gpu` wrapper as well that handles calling the `bfKnn` GPU implementation using CPU numpy arrays. This handles transposition and f32/f16/i32 data types as needed.
Reviewed By: mdouze
Differential Revision: D24152296
fbshipit-source-id: caa7daea23438cf26aa248e380f0dab2b6b907fd
Summary:
Legacy __print__ statements are syntax errors in Python 3 but __print()__ function works as expected in both Python 2 and Python 3.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1443
Reviewed By: LowikC
Differential Revision: D24157415
Pulled By: mdouze
fbshipit-source-id: 4ec637aa26b61272e5337d47b7796a330ce25bad
Summary:
This diff removes a long-standing limitation with GpuIndexIVFPQ, in that only a limited number of dimensions per sub-quantizer were supported when not using precomputed codes. This is part of the general cleanup and extension/optimization that I am performing of the GPU PQ code.
Now, we keep the same old specialized distance computations, but if we attempt to use a number of dimensions per sub-Q that are not specialized, we fall back to a general implementation based on batch matrix multiplication for computing PQ distances per code.
The batch MM PQ distance computation is enabled automatically if you use an odd number of dimensions per sub-quantizer (say, 7, 11, 53, ...). It can also be manually enabled via the `useMMCodeDistance` option in `GpuIndexIVFPQConfig` for testing purposes, though the result should be within some epsilon of the other implementation.
This diff also removes the iterated GEMM wrapper. I don't honestly know why I was using this instead of `cublasGemmStridedBatchedEx`, maybe I couldn't find that or this was originally implemented in a much older version of CUDA. The iterated GEMM call was used in a few other places (e.g., precomputed code computation). Now, this (and the PQ distance computation) use batch MM which is a single CUDA call.
This diff also adds stream synchronization to the temporary memory manager, as the fallback PQ distance computation needs to use temporary memory, and there were too many buffers for these to pre-allocate.
It also fixes the bug in https://github.com/facebookresearch/faiss/issues/1421.
Reviewed By: mdouze
Differential Revision: D24130629
fbshipit-source-id: 1c8bc53c86d0523832ad89c8bd4fa4b5fc187cae
Summary:
This diff contains the following changes:
- adds support for an alternative IVFPQ memory layout, where the codes are interleaved by vector in groups of 32 rather than sub-quantizer, in order to support a variety of SIMD-like optimizations for PQ lookup kernels (eg like SCANN or other techniques that use in-register storage). This internal GPU-only format is transparent to the rest of the code, and attempts to copy an index to/from CPU deal with the difference in layout. The feature is enabled using `GpuIndexIVFPQConfig::alternativeLayout` upon index construction, which is not intended for general use yet, though it is functional.
This is the difference in layout explained:
```
/// The default memory layout is [vector][PQ component]:
/// (v0 d0) (v0 d1) ... (v0 dD-1) (v1 d0) (v1 d1) ...
///
/// An alternative memory layout (layoutBy32) is
/// [vector / 32][PQ component][vector % 32] with padding:
/// (v0 d0) (v1 d0) ... (v31 d0) (v0 d1) (v1 d1) ... (v31 dD-1) (v31 dD-1) (v33
/// d0) ...
/// so the list length is always a multiple of numSubQuantizers * 32
```
- adds kernels to support IVFPQ query using this format. These new kernels are naive implementations that do not use register or shared memory at all to store the code distance information, however unlike the prior GPU IVFPQ code, they support arbitrary-sized PQ encodings (arbitrary dimensions per sub-quantizer and arbitrary number of sub-quantizers per vector). This is enabled for both precomputed and normal codes. It is my intention that I will eventually remove the restriction on dimensions per sub-Q / number of sub-Qs in the GPU code so that it functions more like the CPU code, though due to the necessity of implementation specialization, it will be likely that a small number of choices will be optimized, leaving the rest to use slower fallback implementations.
It is likely that this will eventually become the default / only format supported by the GPU, but the optimized kernels have not yet been developed using this layout. This diff is being checked in first in order to checkpoint the development. The existing lookup kernels and storage have not been affected.
Furthermore, it may be likely that IVFFlat and IVFSQ eventually change to this interleaved format as they offer some advantages in implementation.
- Unifies the IVF handling and copy code more between IVFFlat, IVFPQ and IVFScalarQuantizer on the GPU. There was a lot of copy pasta/duplicated code between the three implementations which was also divergent. This code is now all handled by the IVFBase and GpuIndexIVF classes.
- Adds a logging feature to StandardGpuResources which allows for printing all memory allocation/deallocation requests to the console as they happen in real time. This is useful for debugging.
Reviewed By: mdouze
Differential Revision: D24064745
fbshipit-source-id: 434fb4ec39aaba32271742ba7a40460847386141
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1432
The contrib function knn_ground_truth does not provide exactly the same resutls on GPU and CPU (but relative accuracy is still 1e-7). This diff relaxes the constraint on CPU and added test on GPU.
Reviewed By: wickedfoo
Differential Revision: D24012199
fbshipit-source-id: aaa20dbdf42b876b3ed7da34028646dbb20833d3
Summary:
This diff fixes https://github.com/facebookresearch/faiss/issues/1412
There were various inconsistencies in how the shard and replica wrappers updated their internal state as the sub-indices were updated. This makes the two container classes work in the same way with similar synchronization functionality.
Reviewed By: beauby
Differential Revision: D23974186
fbshipit-source-id: c688c0c9124f823e4239aa2ff617b007b4564859
Summary:
if `ils = dynamic_cast<ArrayInvertedLists *> (index_ivf->invlists)` failed, `ils` would be `nullptr`.
so check if `ils` is `nullptr` before use it.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1410
Reviewed By: beauby
Differential Revision: D23985814
Pulled By: mdouze
fbshipit-source-id: f62a3988e74b4de1f1c9a127475368302a35d4a5
Summary:
For some obscure reason Lua support depends on python2, which is going to be removed.
https://fb.workplace.com/groups/311767668871855/permalink/4219593711422545/
Since the Lua interface is not used in any active code it seems, it's easier to just remove the Lua interface.
It also removes the Faiss Lua dep in the few occurrences where it is used (omry see recog-eval).
Reviewed By: wickedfoo
Differential Revision: D23865458
fbshipit-source-id: 4149517af18acce29179d04152c7364c2548efa0
Summary:
This PR paves the way for nightly builds.
+ Get rid of cmake 3.17 manual install as cmake 3.18 is now available
in conda.
+ Update docker files for conda packages.
+ Specify CUDA architectures via CMake's `CMAKE_CUDA_ARCHITECTURES`.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1422
Reviewed By: mdouze
Differential Revision: D23870447
Pulled By: beauby
fbshipit-source-id: 40ae7517e83356443a007a43261713e7e3a140d4
Summary:
there was a dynamic allocation on a std::string from multiple threads. Instead of adding a mutex in perfromance sensitive code, I use a statically allocated string instead.
The stress test crashed before, now it runs fine.
Reviewed By: wickedfoo
Differential Revision: D23702154
fbshipit-source-id: 5dd37f1c151d8ce7f756f54a059235d8673cdabc
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1407
When Inverted list reading throws an exception, it is propagated until it reaches the openmp loop, which crashes the caller.
This diff catches the exception and properly propagates it to the caller in Python.
It should be possible to test it with an ondisk instead of relying on Manifold.
Reviewed By: MDSilber
Differential Revision: D23688968
fbshipit-source-id: 0943fac41d4e9b8b86535439e3fdee18ce96d4a5
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1406
I appear to have broken this with the rework of float16 support in Faiss GPU, though I cannot figure out why the tests only started failing recently.
cuBLAS does not support a f16 x f32 = f32 matrix multiplication. With a f16 coarse quantizer and IVFPQ precomputed codes, we were attempting to perform such a multiplication.
Now, in the precomputed code calculation, we intercept this and change it to a f32 x f32 = f32 computation.
The test when run by itself was also failing separately, though when run in series with the other test_gpu_index_ivfpq tests it was succeeding, due to the fact that the seed is only initialized once. The epsilon needed to change a slight bit.
Reviewed By: mdouze
Differential Revision: D23687070
fbshipit-source-id: 14a535407ed433eeaef3bc77cb0d6f5909c55b9f
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1402
in C++03 there was no way to get the min value that is portable between int and float types (for float min returns the lowests strictly positive value). For C++11 this is lowest, so let's use it in the heap funcs.
Reviewed By: beauby
Differential Revision: D23622612
fbshipit-source-id: d3e3b2b7f695d971866f7b45bfc41986cd6b9bf4
Summary:
Fix to https://github.com/facebookresearch/faiss/issues/1385, set the value during cuBLAS handle construction.
Also the tensor core option is deprecated for CUDA 11+.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1388
Test Plan: Unable to test numerical results, but builds on GCP A100 instance with CUDA 11.
Reviewed By: mdouze
Differential Revision: D23427285
Pulled By: wickedfoo
fbshipit-source-id: d9487559035175ec7e06600dcd8f6a307f50abad
Summary:
The pytorch interop code was in a test until now. However, it is better if people can rely on it to be updated when the API is updated. Therefore, we move it into contrib.
Also added a README.md
Reviewed By: wickedfoo
Differential Revision: D23392962
fbshipit-source-id: 9b7c0e388a7ea3c0b73dc0018322138f49191673
Summary:
This diff adds an object for a few useful dataset in faiss.contrib.
This includes synthetic datasets and the classic ones.
It is intended to work on:
- the FAIR cluster
- gluster
- manifold
Reviewed By: wickedfoo
Differential Revision: D23378763
fbshipit-source-id: 2437a7be9e712fd5ad1bccbe523cc1c936f7ab35
Summary:
`long` is 32 bits on windows and so is the default int type for numpy (eg. the one used for `np.arange`).
This diff explicitly specifies 64-bit ints for all occurrences where it matters.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1381
Reviewed By: wickedfoo
Differential Revision: D23371232
Pulled By: mdouze
fbshipit-source-id: 220262cd70ee70379f83de93561a4eae71c94b04
Summary: Properties on swig sources must be set prior to the `swig_add_library()` call.
Reviewed By: mdouze
Differential Revision: D23338668
fbshipit-source-id: 71fdd1221ef0fabbd5597eff5e71d36e26435304