Summary:
## Description
This PR added support for LSQ on GPU. Only the encoding part is running on GPU and the others are still running on CPU.
Multi-GPU is also supported.
## Usage
``` python
lsq = faiss.LocalSearchQuantizer(d, M, nbits)
ngpus = faiss.get_num_gpus()
lsq.icm_encoder_factory = faiss.GpuIcmEncoderFactory(ngpus) # we use all gpus
lsq.train(xt)
codes = lsq.compute_codes(xb)
decoded = lsq.decode(codes)
```
## Performance on SIFT1M
On 1 GPU:
```
===== lsq-gpu:
mean square error = 17337.878528
training time: 40.9857234954834 s
encoding time: 27.12640070915222 s
```
On 2 GPUs:
```
===== lsq-gpu:
mean square error = 17364.658176
training time: 25.832106113433838 s
encoding time: 14.879548072814941 s
```
On CPU:
```
===== lsq:
mean square error = 17305.880576
training time: 152.57522344589233 s
encoding time: 110.01779270172119 s
```
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1978
Test Plan: buck test mode/dev-nosan //faiss/gpu/test/:test_gpu_index_py -- TestLSQIcmEncoder
Reviewed By: wickedfoo
Differential Revision: D29609763
Pulled By: mdouze
fbshipit-source-id: b6ffa2a3c02bf696a4e52348132affa0dd838870
Summary:
I want to invoke norm computations by using CGO, but I find some functions which have been implemented in cpp are not exported in c api, so I commit the PR to solve the problem.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/2036
Reviewed By: beauby
Differential Revision: D30762172
Pulled By: mdouze
fbshipit-source-id: 097b32f29658c1864bd794734daaef0dd75d17ef
Summary:
This is required for the renaming of the default branch from `master` to `main`, in accordance with the new Facebook OSS guidelines.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/2029
Reviewed By: mdouze
Differential Revision: D30672862
Pulled By: beauby
fbshipit-source-id: 0b6458a4ff02a12aae14cf94057e85fdcbcbff96
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/2018
The centroids norms table was not reconstructed correctly after being stored in RCQ.
Reviewed By: Sugoshnr
Differential Revision: D30484389
fbshipit-source-id: 9f618a3939c99dc987590c07eda8e76e19248b08
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1983
Add clone_index support to ResidualCoarseQuantizer to enable GPU training. Similar to D28614996
Reviewed By: mdouze
Differential Revision: D29605169
fbshipit-source-id: bf9cc32b60061a42310506058ebb45d5f2cea8d8
Summary:
## Description
The process of updating the codebook in LSQ may be unstable if the data is not zero-centering. This diff fixed it by using `double` instead of `float` during codebook updating. This would not affect the performance since the update process is quite fast.
Users could switch back to `float` mode by setting `update_codebooks_with_double = False`
## Changes
1. Support `double` during codebook updating.
2. Add a unit test.
3. Add `__init__.py` under `contrib/` to avoid warnings.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1975
Reviewed By: wickedfoo
Differential Revision: D29565632
Pulled By: mdouze
fbshipit-source-id: 932d7932ae9725c299cd83f87495542703ad6654
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1972
This fixes a few issues that I ran into + adds tests:
- range_search_max_results with IP search
- a few missing downcasts for VectorTRansforms
- ResultHeap supports max IP search
Reviewed By: wickedfoo
Differential Revision: D29525093
fbshipit-source-id: d4ff0aff1d83af9717ff1aaa2fe3cda7b53019a3
Summary:
Currently CI jobs using conda are failed due to conflict packages.
This PR fixes this.
- use newer `numpy` to build `faiss-cpu`
- install `pytorch` when testing `faiss-cpu`
- to find correct `pytorch` package, `pytorch` channel is set at `conda build`
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1884
Reviewed By: mdouze
Differential Revision: D28777447
Pulled By: beauby
fbshipit-source-id: 82a1ce076abe6bbbba9415e8935ed57b6104b6c3
Summary:
This should fix the GPU nighties.
The rationale for the cp is that there is a shared file between the CPU and GPU tests.
Ideally, this file should probably moved to contrib at some point.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1901
Reviewed By: beauby
Differential Revision: D28680898
Pulled By: mdouze
fbshipit-source-id: b9d0e1969103764ecb6f1e047c9ed4bd4a76aaba
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1908
To search the best combination of codebooks, the method that was implemented so far is via a beam search.
It is possible to make this faster for a query vector q by precomputing look-up tables in the form of
LUT_m = <q, cent_m>
where cent_m is the set of centroids for quantizer m=0..M-1.
The LUT can then be used as
inner_prod = sum_m LUT_m[c_m]
and
L2_distance = norm_q + norm_db - 2 * inner_prod
This diff implements this computation by:
- adding the LUT precomputation
- storing an exhaustive table of all centroid norms (when using L2)
This is only practical for small additive quantizers, eg. when a residual vector quantizer is used as coarse quantizer (ResidualCoarseQuantizer).
This diff is based on AdditiveQuantizer diff because it applies equally to other quantizers (eg. the LSQ).
Reviewed By: sc268
Differential Revision: D28467746
fbshipit-source-id: 82611fe1e4908c290204d4de866338c622ae4148
Summary:
Moving index from cpu to gpu is failing with error message `RuntimeError: Error in virtual faiss::Index *faiss::Cloner::clone_Index(const faiss::Index *) at faiss/clone_index.cpp:144: clone not supported for this type of Index`
This diff support IndexResidual clone and unblocks gpu training
Reviewed By: sc268, mdouze
Differential Revision: D28614996
fbshipit-source-id: 9b1e5e7c5dd5da6d55f02594b062691565a86f49
Summary: This is necessary to share a `SQDistanceComputer` instance among multiple thread, when the codes are not stored in a faiss index. The function is `const` and thread-safe.
Reviewed By: philippv, mdouze
Differential Revision: D28623897
fbshipit-source-id: e527d98231bf690dc01191dcc597ee800b5e57a9
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1906
This PR implemented LSQ/LSQ++, a vector quantization technique described in the following two papers:
1. Revisiting additive quantization
2. LSQ++: Lower running time and higher recall in multi-codebook quantization
Here is a benchmark running on SIFT1M for 64 bits encoding:
```
===== lsq:
mean square error = 17335.390208
training time: 312.729779958725 s
encoding time: 244.6277096271515 s
===== pq:
mean square error = 23743.004672
training time: 1.1610801219940186 s
encoding time: 2.636141061782837 s
===== rq:
mean square error = 20999.737344
training time: 31.813055515289307 s
encoding time: 307.51959800720215 s
```
Changes:
1. Add LocalSearchQuantizer object
2. Fix an out of memory bug in ResidualQuantizer
3. Add a benchmark for evaluating quantizers
4. Add tests for LocalSearchQuantizer
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1862
Test Plan:
```
buck test //faiss/tests/:test_lsq
buck run mode/opt //faiss/benchs/:bench_quantizer -- lsq pq rq
```
Reviewed By: beauby
Differential Revision: D28376369
Pulled By: mdouze
fbshipit-source-id: 2a394d38bf75b9de0a1c2cd6faddf7dd362a6fa8
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1905
This PR added some tests to make sure the building with AVX2 works as we expected in Linux.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1792
Test Plan: buck test //faiss/tests/:test_fast_scan -- test_PQ4_speed
Reviewed By: beauby
Differential Revision: D27435796
Pulled By: mdouze
fbshipit-source-id: 901a1d0abd9cb45ccef541bd7a570eb2bd8aac5b
Summary:
related: https://github.com/facebookresearch/faiss/issues/1815, https://github.com/facebookresearch/faiss/issues/1880
`vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter.
However, the implementations of those intrinsics on GCC can receive runtime value.
Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64.
This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1882
Reviewed By: beauby
Differential Revision: D28465563
Pulled By: mdouze
fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
Summary:
Also remove support for deprecated compute capabilities 3.5 and 5.2 in
CUDA 11.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1899
Reviewed By: mdouze
Differential Revision: D28539826
Pulled By: beauby
fbshipit-source-id: 6e8265f2bfd991ff3d14a6a5f76f9087271f3f75
Summary:
Current `faiss` contains some codes which will be warned by compilers when we will add some compile options like `-Wall -Wextra` .
IMHO, warning codes in `.cpp` and `.cu` doesn't need to be fixed if the policy of this project allows warning.
However, I think that it is better to fix the codes in `.h` and `.cuh` , which are also referenced by `faiss` users.
Currently it makes a error to `#include` some faiss headers like `#include<faiss/IndexHNSW.h>` when we compile the codes with `-pedantic-errors` .
This PR fix this problem.
In this PR, for the reasons above, we fixed `-Wpedantic` codes only in `.h` .
This PR doesn't change `faiss` behavior.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1888
Reviewed By: wickedfoo
Differential Revision: D28506963
Pulled By: beauby
fbshipit-source-id: cbdf0506a95890c9c1b829cb89ee60e69cf94a79
Summary:
This diff fixes a serious bug in the range search implementation.
During range search in a flat index, (exhaustive_L2sqr_seq and exhaustive_inner_product_seq) when running in multiple threads, the per-thread results are collected into RangeSearchPartialResult structures.
When the computation is finished, they are aggregated into a RangeSearchResult. In the previous version of the code, this loop was nested into a second loop that is used to check for KeyboardInterrupts. Thus, at each iteration, the results were overwritten.
The fix removes the outer loop. It is most likely useless anyways because the sequential code is called only for a small number of queries, for a larger number the BLAS version is used.
Reviewed By: wickedfoo
Differential Revision: D28486415
fbshipit-source-id: 89a52b17f6ca1ef68fc5e758f0e5a44d0df9fe38
Summary:
In the current `faiss` implementation for x86, `fvec_L2sqr` , `fvec_inner_product` , and `fvec_norm_L2sqr` are [optimized for any dimensionality](e86bf8cae1/faiss/utils/distances_simd.cpp (L404-L432)).
On the other hand, the functions for aarch64 are optimized [**only** if `d` is multiple for 4](e86bf8cae1/faiss/utils/distances_simd.cpp (L583-L584)); thus, they are not much fast for vectors with `d % 4 != 0` .
This PR has accelerated the above three functions for any input size on aarch64.
Kind regards,

- Evaluated on an AWS EC2 ARM instance (c6g.4xlarge)
- sift1m127 is the dataset with dropped trailing elements of sift1m
- Therefore, the vector length of sift1m127 is 127 that is not multiple of 4
- "optimized" runs 2.45-2.77 times faster than "original" with sift1m127
- Two methods, "original" and "optimized", are expected to achieve the same level of performance for sift1m
- And actually there is almost no significant difference
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1878
Reviewed By: beauby
Differential Revision: D28376329
Pulled By: mdouze
fbshipit-source-id: c68f13b4c426e56681d81efd8a27bd7bead819de
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1865
This diff chunks vectors to encode to make it more memory efficient.
Reviewed By: sc268
Differential Revision: D28234424
fbshipit-source-id: c1afd2aaff953d4ecd339800d5951ae1cae4789a
Summary:
Need to add an ssh key to the circleci to be able to debug
For my own ref, how to connect to the job:
```
[matthijs@matthijs-mbp /Users/matthijs/Desktop/faiss_github/circleci_keys] ssh -p 54782 38.39.188.110 -i id_ed25519
```
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1849
Reviewed By: wickedfoo
Differential Revision: D28234897
Pulled By: mdouze
fbshipit-source-id: 6827fa45f24b3e4bf586315bd38f18608d07ecf9
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1840
This diff is related to
https://github.com/facebookresearch/faiss/issues/1762
The ResultHandler introduced for FlatL2 and FlatIP was not multithreaded. This diff attempts to fix that. To be verified if it is indeed faster.
Reviewed By: wickedfoo
Differential Revision: D27939173
fbshipit-source-id: c85f01a97d4249fe0c6bfb04396b68a7a9fe643d
Summary:
This diff adds the following to bring the residual quantizer support on-par with PQ:
- IndexResidual can be built with index factory, serialized and used as a Faiss codec.
- ResidualCoarseQuantizer can be used as a coarse quantizer for inverted files.
The factory string looks like "RQ1x16_6x8" which means a first 16-bit quantizer then 6 8-bit ones. For IVF it's "IVF4096(RQ2x6),Flat".
Reviewed By: sc268
Differential Revision: D27865612
fbshipit-source-id: f9f11d29e9f89d3b6d4cd22e9a4f9222422d5f26