547 Commits

Author SHA1 Message Date
Lucas Hosseini
08a0ce72a2 Fix nightly build for CUDA 11. (#1675)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1675

Reviewed By: mdouze

Differential Revision: D26338704

Pulled By: beauby

fbshipit-source-id: f440bbd05d6dbc09280e4f3631e4a9af99bde5f5
2021-02-09 07:44:27 -08:00
Lucas Hosseini
f5a8c29c57 Parameterize CUDA_ARCHS in packaging jobs. (#1671)
Summary:
This will allow us to support compute capabilities 8.0 and 8.6 (for
Ampere devices) with CUDA 11.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1671

Reviewed By: mdouze

Differential Revision: D26338700

Pulled By: beauby

fbshipit-source-id: f023e7a37504d79ab78a45319e5a9cb825e7604a
2021-02-09 07:37:51 -08:00
Matthijs Douze
10c8583b2d Fix order of results for IndexBinaryHash and IndexBinaryMultiHash
Summary: The IndexBinaryHash and IndexBinaryMultiHash knn search functions returned results in a random order. This diff fixes that to the standard decreasing Hamming distance order + adds a test for that. I noticed on a notebook from sc268.

Reviewed By: sc268

Differential Revision: D26324795

fbshipit-source-id: 1444e26950e24bfac297f34f3d481d902d8ee769
2021-02-08 18:22:55 -08:00
Authman
976a942838 Cuda 11.0 Dockerimage for CircleCI conf (#1669)
Summary:
This small change adds a dockerimage for cuda11.0.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1669

Reviewed By: mdouze

Differential Revision: D26278940

Pulled By: beauby

fbshipit-source-id: 59af80c0eac1fe8b512a8543ec15b5c7174219fb
2021-02-08 00:53:03 -08:00
Jeff Johnson
f15ce621f3 Expect warpSize == 32 and align allocations
Summary:
When new GPU compute capabilities were released, DeviceDefs.cuh had to be manually updated to expect them, as we statically compile the warp size (32 in all of Nvidia's current GPUs) into kernel code.

In order to avoid having to change this header for each new GPU generation (e.g., the new RTX devices which are CC 8.6), instead we just assume the warp size is 32, but when we initialize a GPU device and its resources in StandardGpuResources, we check to make sure that the GPU has a warp size of 32 as expected. Much code would have to change for a non-32 warp size (e.g., 64, as seen in AMD GPUs), so this is a hard assert. It is likely that Nvidia will never change this anyways for this reason.

Also, as part of the PQ register change, I noticed that temporary memory allocations were only being aligned to 16 bytes. This could cause inefficiencies in terms of excess gmem transactions. Instead, we bump this up to 256 bytes as the guaranteed alignment for all temporary memory allocations, which is the same guarantee that cudaMalloc provides.

Reviewed By: mdouze

Differential Revision: D26259976

fbshipit-source-id: 10b5fc708fffc9433683e85b9fd60da18fa9ed28
2021-02-04 13:22:36 -08:00
H. Vetinari
73141fb872 Add missing headers in faiss/[gpu/]CMakeLists.txt (#1666)
Summary:
While preparing https://github.com/conda-forge/faiss-split-feedstock/pull/26, I grepped for the expected headers based on the files in the repo, à la:
```
>ls faiss/invlists/ | grep -E "h$"
BlockInvertedLists.h
DirectMap.h
InvertedLists.h
InvertedListsIOHook.h
OnDiskInvertedLists.h
```

Doing so uncovered that there were some headers missing (apparently) in `CMakeLists.txt`, namely:
```
faiss/impl/ResultHandler.h
faiss/gpu/impl/IVFInterleaved.cuh
faiss/gpu/impl/InterleavedCodes.h
faiss/gpu/utils/WarpPackedBits.cuh
```

It's possible that they were left out intentionally, but I didn't see something that would make me think so e.g. in [`ResultHandler.h`](https://github.com/facebookresearch/faiss/blob/master/faiss/impl/ResultHandler.h).

While I was at it, I decided to order the filenames consistently (alphabetically, except for the increasing bit-sizes for blockselect/warpselect, as is already the case for `impl/scan/IVFInterleaved<x>.cu`), but of course, those commits could easily be dropped.

By reviewing the commits separately, it should be clear (for the first two) from the equal number of deletions/insertions (and the simple diff) that this is just a reshuffle. The only additions are in the last commit.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1666

Reviewed By: wickedfoo

Differential Revision: D26248038

Pulled By: mdouze

fbshipit-source-id: 4add4959446deb16126c59b2d1e3f0305e6236c1
2021-02-04 09:22:58 -08:00
Matthijs Douze
5602724979 make calling conventions uniform between faiss.knn and faiss.knn_gpu
Summary: The order of xb an xq was different between `faiss.knn` and `faiss.knn_gpu`. Also the metric argument was called distance_type. This diff fixes both. Hopefully not too much external code depends on it.

Reviewed By: wickedfoo

Differential Revision: D26222853

fbshipit-source-id: b43e143d64d9ecbbdf541734895c13847cf2696c
2021-02-03 12:21:40 -08:00
shengjun.li
cf33102a7e Improve performance of Hamming computer (#1661)
Summary:
Signed-off-by: shengjun.li <shengjun.li@zilliz.com>

Improve performance of Hamming computer

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1661

Reviewed By: wickedfoo

Differential Revision: D26222892

Pulled By: mdouze

fbshipit-source-id: 5c1228b9e6c0f196ebcdfb0227ecdf7a02610871
2021-02-03 10:32:24 -08:00
Matthijs Douze
8894ba7488 convert CPU fp16 scalar quantizer to GpuFlat index
Summary:
fp16 scalar quantizer is supported via IndexFlat with foat16 option.
This diff also splits the python GPU tests in 2 files.

Reviewed By: wickedfoo

Differential Revision: D26221563

fbshipit-source-id: c08fce27e6acedc486478b37ef77ccebcefb3dc0
2021-02-03 09:13:07 -08:00
H. Vetinari
16b4e88aca make AVX2-detection platform-independent (#1600)
Summary:
In the context of https://github.com/conda-forge/faiss-split-feedstock/issues/23, I discussed with some of the conda-folks how we should support AVX2 (and potentially other builds) for faiss. In the meantime, we'd like to follow the model that faiss itself is using (i.e. build with AVX2 and without and then load the corresponding library at runtime depending on CPU capabilities).

Since windows support for this is missing (and the other stuff is also non-portable in `loader.py`), I chased down `numpy.distutils.cpuinfo`, which is pretty outdated, and opened: https://github.com/numpy/numpy/issues/18058

While the [private API](https://github.com/numpy/numpy/issues/18058#issuecomment-749852711) is obviously something that _could_ change at any time, I still think it's better than platform-dependent shenanigans.

Opening this here to ideally upstream this right away, rather than carrying patches in the conda-forge feedstock.

TODO:
* [ ] adapt conda recipe for windows in this repo to also build avx2 version

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1600

Reviewed By: beauby

Differential Revision: D25994705

Pulled By: mdouze

fbshipit-source-id: 9986bcfd4be0f232a57c0a844c72ec0e308fff19
2021-02-03 08:02:14 -08:00
Matthijs Douze
04f777ead5 Re-enable fast scan on Windows tests (#1663)
Summary:
Fast-scan tests were disabled on windows because of a heap corruption. This diff enables them because the free_aligned bug was fixed in the meantime.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1663

Reviewed By: beauby

Differential Revision: D26201040

Pulled By: mdouze

fbshipit-source-id: 8d6223b4e42ccb1ce2da6e2c51d9e0833199bde7
2021-02-03 07:48:52 -08:00
Matthijs Douze
27077c4627 Small fixes for compilation on ARM (#1655)
Summary:
This PR fixes a few small issues with compilation on ARM.
It has been tested on an AWS c6g.8xlarge machine with Ubuntu 18.04.5 LTS.
Compilation instructions are here: https://github.com/facebookresearch/faiss/wiki/Installing-Faiss#compiling-faiss-on-arm

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1655

Reviewed By: wickedfoo

Differential Revision: D26145921

Pulled By: mdouze

fbshipit-source-id: 007e57a610f489885e78ba22bc82605d67661c44
2021-01-29 10:06:45 -08:00
Lucas Hosseini
7c2d2388a4 Bump version to 1.7.0. (#1652)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1652

Reviewed By: mdouze

Differential Revision: D26077948

Pulled By: beauby

fbshipit-source-id: 599ee61edd2425250948577cb55d145d9179ab25
v1.7.0
2021-01-27 03:37:26 -08:00
Andrew Aksyonoff
bc35be3e77 fixed free vs aligned_free mismatch (crashes on Windows!) (#1647)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1647

Reviewed By: wickedfoo

Differential Revision: D26074437

Pulled By: mdouze

fbshipit-source-id: f72b56c1a3681694e14ae07400e56a4a657c327c
2021-01-26 09:03:01 -08:00
Matthijs Douze
f351a83ef6 Update demo_imi_pq.cpp (#1636)
Summary:
remove long.
This is to close PR https://github.com/facebookresearch/faiss/issues/1050

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1636

Reviewed By: beauby

Differential Revision: D25994690

Pulled By: mdouze

fbshipit-source-id: 35dfa4295602f64053883594b5a7d0b9b7293545
2021-01-22 00:04:19 -08:00
shengjun.li
908812266c Add heap_replace_top to simplify heap_pop + heap_push (#1597)
Summary:
Signed-off-by: shengjun.li <shengjun.li@zilliz.com>

Add heap_replace_top to simplify heap_pop + heap_push

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1597

Test Plan:
OMP_NUM_THREADS=1 buck run mode/opt //faiss/benchs/:bench_heap_replace
OMP_NUM_THREADS=8 buck run mode/opt //faiss/benchs/:bench_heap_replace

Reviewed By: beauby

Differential Revision: D25943140

Pulled By: mdouze

fbshipit-source-id: 66fe67779dd281a7753f597542c2e797ba0d7df5
2021-01-20 11:28:08 -08:00
Matthijs Douze
a2791322d9 Update ProductQuantizer.cpp (#1634)
Summary:
Make error message more clear.
See https://github.com/facebookresearch/faiss/issues/1632

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1634

Reviewed By: beauby

Differential Revision: D25942961

Pulled By: mdouze

fbshipit-source-id: e96ca808a9a0dcd1de5417a5e048d39431b30a5e
2021-01-20 00:12:54 -08:00
Eduardo Pinho
9503cf0d6a Add GPU device utility functions (#1613)
Summary:
This adds some more functions to the C API, under a new DeviceUtils_c.h module. Resolves https://github.com/facebookresearch/faiss/issues/1414.

- `faiss_get_num_gpus`
- `faiss_gpu_profiler_start`
- `faiss_gpu_profiler_stop`
- `faiss_gpu_sync_all_devices`

The only minor issue right now is that building this requires basing it against an older version of Faiss until the building system is updated to use CMake (https://github.com/facebookresearch/faiss/issues/1390). I have provided a separate branch with the same contribution which is based against a version that works and builds OK: [`imp/c_api/add_gpu_device_utils`](https://github.com/Enet4/faiss/tree/imp/c_api/add_gpu_device_utils)

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1613

Reviewed By: wickedfoo

Differential Revision: D25942933

Pulled By: mdouze

fbshipit-source-id: 5b73a86b0c1702dfb7b9e56bd741f72495aac2fd
2021-01-19 17:23:12 -08:00
Jeff Johnson
d8b64b5122 Improve transpose performance
Summary:
The general array transposition kernel for the GPU in Faiss had two issues.

One, there was a typo (`+` instead of `*`) which did not cause a correctness bug but was a severe performance issue, since the general transposition kernel was written in 2016/2017. This was causing large slowdowns with precomputed code usage that I noticed while profiling over IVFPQ issues.

Two, the general transposition code was written for the most generic case. The transposition that we care about/use the most in Faiss is a transposition of outermost dimensions, say transposing an array [s1 s2 s3] -> [s2 s1 s3], where there are one or more innermost dimensions which are still contiguous in the new layout. A separate kernel has been written to cover this transposition case.

Also updates the code to avoid `unsigned int` and `unsigned long` in lieu of `uint32_t` and `uint64_t`.

D25703821 (removing serialize tags for GPU tests) is reverted in this as well, as that change prevents all GPU tests from being run locally on devservers; RE might have implicit serialization, but local execution doesn't.

Reviewed By: beauby

Differential Revision: D25929892

fbshipit-source-id: 66ddfc56189305f698a85c44abdeb64eb95ffe6b
2021-01-19 13:22:27 -08:00
Lucas Hosseini
010b05712c Update README.md (#1635)
Summary:
Update link to API docs website.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1635

Reviewed By: mdouze

Differential Revision: D25942944

Pulled By: beauby

fbshipit-source-id: 573024b01c61f2464ecbf33e233cd93b2903a493
2021-01-18 01:13:28 -08:00
LiuJuanXi
39c4ff218a Update IVFBase.cuh (#1625)
Summary:
visual Studio compile problem(undefine reference)

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1625

Reviewed By: beauby

Differential Revision: D25928255

Pulled By: wickedfoo

fbshipit-source-id: c123eb6f263e8748fca6390beac0f7c0f4868163
2021-01-18 00:30:24 -08:00
LiuJuanXi
149e66959f Update GpuIndexFlat.h (#1626)
Summary:
visual Studio compile problem(undefine reference)

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1626

Reviewed By: beauby

Differential Revision: D25928072

Pulled By: wickedfoo

fbshipit-source-id: 464ad453f8f85a802e81d28d8f1a5fbae0b388f3
2021-01-18 00:29:58 -08:00
Matthijs Douze
950c831d45 Fix number of threads in test_ivfpq_codec when running in sandcastle
Summary: When running in a heavily parallelized env, the test becomes very slow and causes timeouts. Here we reduce the nb of threads.

Reviewed By: wickedfoo, Ben0mega

Differential Revision: D25921771

fbshipit-source-id: 1e0aacbb3e4f6e8f33ec893984b343eb5a610424
2021-01-17 13:42:06 -08:00
Matthijs Douze
0b55f10f7e Fix ASAN bugs in demo_sift1M
Summary: Fixes 2 bugs spotted by ASAN in the demo.

Reviewed By: wenjieX

Differential Revision: D25897053

fbshipit-source-id: fd2bed13faded42426cefc5ebe9d027adec78015
2021-01-13 08:28:25 -08:00
Matthijs Douze
c7fd8d7ac3 improved range search evaluation functions
Summary: For range search evaluation, this diff adds optimized functions for ground-truth generation (on GPU).

Reviewed By: beauby

Differential Revision: D25822716

fbshipit-source-id: c5278dfad0510d24c2a5c87c1f8c81161fa8c5d3
2021-01-11 08:12:10 -08:00
Matthijs Douze
9b2384f305 Fix serialization of large hash indexes
Summary:
64-bit cleanness issue for BitstringWriter.
Shows in HashIndex I/O, see https://github.com/facebookresearch/faiss/issues/1532

Reviewed By: beauby

Differential Revision: D25804891

fbshipit-source-id: d4cd3714d116a1b2fe1c9446eb1e9d3a8acf854e
2021-01-11 05:48:12 -08:00
H. Vetinari
3d3d539bb5 Some fixes for building on windows + cuda (#1614)
Summary:
Upstreaming some work towards https://github.com/facebookresearch/faiss/issues/1586 from https://github.com/conda-forge/faiss-split-feedstock/pull/19; complements https://github.com/facebookresearch/faiss/issues/1610

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1614

Reviewed By: mdouze

Differential Revision: D25864994

Pulled By: wickedfoo

fbshipit-source-id: 4abaaebf1c4102c24c3a9dca544744318e780581
2021-01-11 00:46:02 -08:00
Matthijs Douze
1c996d0db7 Fix refine implementation (#1607)
Summary:
The IndexRefineFlat with pre-populated indexes could not be used because of the order of construction of the parent class. This diff fixes is. This addresses https://github.com/facebookresearch/faiss/issues/1604.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1607

Reviewed By: wickedfoo

Differential Revision: D25801869

Pulled By: mdouze

fbshipit-source-id: 6098065e497dff39f4dd7474fa7136ba3610ef7e
2021-01-06 20:49:02 -08:00
Lucas Hosseini
d494251b5d Increase no-output timeout for Linux GPU nightlies build. (#1605)
Summary:
Since https://github.com/facebookresearch/faiss/issues/1566, the build for GPU seems to take significantly longer, causing GPU nightly builds to timeout.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1605

Reviewed By: mdouze

Differential Revision: D25780754

Pulled By: beauby

fbshipit-source-id: 879fcb55550711dd21030a640eed9cc1ee820fba
2021-01-05 05:04:06 -08:00
Matthijs Douze
05774f1996 Fix compile gcc 7.3.0 (#1593)
Summary:
A small compilation issue with gcc 7.3.0, does not appear with 7.4.0.

Also updated the readme.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1593

Reviewed By: beauby

Differential Revision: D25705885

Pulled By: mdouze

fbshipit-source-id: 920b35264463cdd6ad10bbb09e07cf483fcaa724
2021-01-04 11:49:16 -08:00
Lucas Hosseini
4774951982 GitHub actions hooks for GitHub pages docs website. (#1599)
Summary:
These hooks, along with the creation of the `gh-pages` branch with a Sphinx-powered website ([preview](https://beauby.github.io/faiss/)) will ensure automatic rebuild of the C++ API (doxygen) docs upon modification on `master`.
Moreover,  direct changes to the `gh-pages` branch will trigger a rebuild of the website as well.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1599

Reviewed By: wickedfoo

Differential Revision: D25723322

Pulled By: beauby

fbshipit-source-id: fafeed64b393dce8c569f9fd1f5bc3b004706589
2020-12-29 13:46:54 -08:00
Lucas Hosseini
e6a19f190a Replace tempnam with mkstemp in tests. (#1596)
Summary:
This avoids triggering the following warnings:
```
tests/test_ondisk_ivf.cpp:36:24: warning: 'tempnam' is deprecated: This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of tempnam(3), it is highly recommended that you use mkstemp(3) instead. [-Wdeprecated-declarations]
        char *cfname = tempnam (nullptr, prefix);
                       ^
tests/test_merge.cpp:34:24: warning: 'tempnam' is deprecated: This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of tempnam(3), it is highly recommended that you use mkstemp(3) instead. [-Wdeprecated-declarations]
        char *cfname = tempnam (nullptr, prefix);
```

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1596

Reviewed By: wickedfoo

Differential Revision: D25710654

Pulled By: beauby

fbshipit-source-id: 2aa027c3b32f6cf7f41eb55360424ada6d200901
2020-12-29 13:37:05 -08:00
Lucas Hosseini
544e08f660 Fix SWIG dependencies in CMake. (#1591)
Summary:
CMake's SWIG module does not track dependencies to header files by
default, and they have to be stated manually.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1591

Reviewed By: wickedfoo

Differential Revision: D25705493

Pulled By: beauby

fbshipit-source-id: faf70415efb0db677ea3ee8e38495d9ed39432d7
2020-12-29 13:27:03 -08:00
egolearner
7dce45f10f fix typo (#1553)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1553

Reviewed By: mdouze

Differential Revision: D25704031

Pulled By: beauby

fbshipit-source-id: d5a632a529ad71b40a583340a383071e74c356da
2020-12-25 05:57:25 -08:00
Matthijs Douze
3dd7ba8ff9 Add range search accuracy evaluation
Summary:
Added a few functions in contrib to:
- run range searches by batches on the query or the database side
- emulate range search on GPU: search on GPU with k=1024, if the farthest neighbor is still within range, re-perform search on CPU
- as reference implementations for precision-recall on range search datasets
- optimized code to plot precision-recall plots (ie. sweep over thresholds)

The new functions are mainly in a new `evaluation.py`

Reviewed By: wickedfoo

Differential Revision: D25627619

fbshipit-source-id: 58f90654c32c925557d7bbf8083efbb710712e03
2020-12-17 17:17:09 -08:00
Jeff Johnson
32df3f3198 GPU IVFPQ nbits != 8 support (#1576)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1576

This diff updates the simple SIMD interleaved add and search kernels I added in D24064745 to support an arbitrary number of bits per PQ code, bitwise packed (just as the CPU). Right now only nbits of 4, 5, 6, 8 are supported, but I will also update to support 7 bits before checking in, and the framework exists for any other value that we want (even > 8 bits) later on. For nbits != 8, copy to/from the packed CPU format is also supported. This new functionality is still experimental and is opt-in at the moment when interleaved codes are enabled in the config object, though after a few more subsequent diffs it will become the default and all the old code/kernels will be deleted.

Since I originally wrote the GPU IVFPQ training code in 2016, the CPU version had a subsampling step added, which I was missing in the GPU version, and was causing non-reproducibility between the CPU and GPU when trained on the same data (if subsampling was required). This step has been added to the GPU version as well.

This diff also unifies the IVF add and search kernels somewhat between IVFPQ and IVFFlat/IVFSQ for SIMD interleaved codes. This required adding an additional step to IVFSQ to calculate distances in a separate kernel instead of a fused kernel (much as IVFPQ does with its separate kernels to determine the codes).

From here it shouldn't be too difficult to create a primitive version of the in-register IVFPQ list scanning code that I will iterate on :)

Reviewed By: mdouze

Differential Revision: D25597421

fbshipit-source-id: cbcf1a2b79d92cc007ab95428057f9de643baf1a
2020-12-17 07:36:30 -08:00
Matthijs Douze
c5975cda72 PQ4 fast scan benchmarks (#1555)
Summary:
Code + scripts for Faiss benchmarks around the  Fast scan codes.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1555

Test Plan: buck test //faiss/tests/:test_refine

Reviewed By: wickedfoo

Differential Revision: D25546505

Pulled By: mdouze

fbshipit-source-id: 902486b7f47e36221a2671d124df8c114f25db58
2020-12-16 01:18:58 -08:00
Jeff Johnson
90c891b616 Optimized SIMD interleaved IVF flat/SQ implementation (#1566)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1566

This diff converts the IVF flat and IVFSQ code to use an interleaved-by-32 format, the same as was added to the IVFPQ GPU implementation in D24064745. It also implements SQ6 on the GPU.

However, the new interleaved format is now enabled by default for GpuIndexIVFFlat and GpuIndexIVFScalarQuantizer, while the IVFPQ version is still opt-in until I can develop optimized PQ kernels.

For extension of the interleaved format to non-word size (8/16/32 bit) codes, arbitrary bit codes are packed in groups of 32 vectors, so each dimension of SQ6 for 32 vectors is packed into (32 * 6) / 8 = 24 bytes, and SQ4 packs into 16 bytes.

The new IVF code also fuses the k-selection kernel with the distance computation, so results per-(query, ivf list) are already k-selected. This improves the speed, especially at small query batch sizes which is as much as 84% faster. The float32 version at large nq batch size (the 16384) is 13% faster, though this is now running at the peak memory bandwidth of the GPU it seems and cannot go any faster as far as I can tell. There is still room for improvement with the sq8/6/4 versions which are at about 50% of peak; optimizing these I'll work on in subsequent diffs.

Performance numbers for nlist = 1000, nb = 10^6, nprobe = 16, dim = 128 at varying nq are (in milliseconds), with all versions compared against the old interleaved version, but sq6 compared against the CPU implementation:

```
float32
nq 1 new 0.420811 old 0.774816 speedup 1.8412446442702306x
nq 4 new 0.377007 old 0.573527 speedup 1.5212635309158717x
nq 16 new 0.474821 old 0.611986 speedup 1.2888772821758094x
nq 64 new 0.926598 old 1.124938 speedup 1.2140518326178127x
nq 256 new 2.918364 old 3.339133 speedup 1.1441797527655906x
nq 1024 new 11.097743 old 12.647599 speedup 1.1396550631961833x
nq 4096 new 43.828697 old 50.088993 speedup 1.142835549046781x
nq 16384 new 173.582674 old 196.415956 speedup 1.1315412504821765x
sq8
nq 1 new 0.419673 old 0.660393 speedup 1.5735894374906176x
nq 4 new 0.396551 old 0.55526 speedup 1.4002234264949527x
nq 16 new 0.437477 old 0.589546 speedup 1.3476045597825714x
nq 64 new 0.697084 old 0.889233 speedup 1.2756468373969279x
nq 256 new 1.904308 old 2.231102 speedup 1.1716077441254251x
nq 1024 new 6.539976 old 8.23596 speedup 1.2593257222962286x
nq 4096 new 25.524117 old 31.276868 speedup 1.2253849173313223x
nq 16384 new 101.992982 old 125.355406 speedup 1.2290591327156215x
sq6
nq 1 new 0.693262 old 2.007591 speedup 2.895861881943623x
nq 4 new 0.62342 old 3.049899 speedup 4.892205896506368x
nq 16 new 0.626906 old 2.760067 speedup 4.402680784679043x
nq 64 new 1.002582 old 7.152971 speedup 7.134549592951x
nq 256 new 2.806507 old 19.4322 speedup 6.923980592245094x
nq 1024 new 9.414069 old 65.208767 speedup 6.926735612411593x
nq 4096 new 36.099553 old 249.866567 speedup 6.921597256342759x
nq 16384 new 142.230624 old 1040.07494 speedup 7.312594930329491x
sq4
nq 1 new 0.46687 old 0.670675 speedup 1.436534795553366x
nq 4 new 0.436246 old 0.589663 speedup 1.351675430834896x
nq 16 new 0.473243 old 0.628914 speedup 1.3289451719306993x
nq 64 new 0.789141 old 1.018548 speedup 1.2907047029618282x
nq 256 new 2.314464 old 2.592711 speedup 1.1202209237214318x
nq 1024 new 8.203663 old 9.574067 speedup 1.167047817541993x
nq 4096 new 31.910841 old 37.19758 speedup 1.1656721927197093x
nq 16384 new 126.195179 old 147.004414 speedup 1.164897226382951x
```

This diff also provides a new method for packing data of uniform arbitrary bitwidth in parallel, where a warp uses warp shuffles to exchange data to the right lane which is then bit packed in the appropriate lane. Unpacking data happens in a similar fashion. This allows for coalesced memory loads and stores, instead of individual lanes having to read multiple bytes or words out of global or shared memory. This was the most difficult thing about this particular diff.

The new IVF layout is completely transparent to the user. When copying to/from a CPU index, the codes are converted as needed. This functionality is implicitly tested in all of the CPU <-> GPU copy tests for the index types that currently exist.

This diff also contains an optimization to the scalar quantizers to only require an int-to-float conversion and a single multiply-add as opposed to more operations previously, by rewriting vmin and vdiff at runtime in the kernel.

The old IVF flat code is still in the tree and is accessible by setting `interleavedLayout` to false in the config object. This will be deleted in a later diff as part of a cleanup when I am finally done with performance comparisons.

The diff also contains various other changes:

- new code to allow copying a Tensor to/from a std::vector which reduces the amount of boilerplate code required in some places
- fixes a bug where miscellaneous index API calls were not properly stream synchronized if the user was using a non-default stream (e.g., a pytorch provided stream). This would not have been noticed by any regular user for the wrapped index calls, but it would be noticed if you attempted to call some of the debugging functions (e.g., get the GPU codes). This is done by adding additional logic to the StandardGpuResources stream update functions to add the required synchronization if the user manually changes the stream
- function to retrieve encoded IVF data in either CPU native or GPU interleaved format
- the CPU scalar quantizer object now directly reports how many bits are in a single scalar code, as previously the only information was how many bytes were used for a full encoded vector

Reviewed By: mdouze

Differential Revision: D24862911

fbshipit-source-id: 9a92486306b4b0c6ac30e5cd22c1ffbb6ed2faf4
2020-12-15 21:17:56 -08:00
Matthijs Douze
218a6a9b90 Update INSTALL.md (#1456)
Summary:
Added doc + placeholders on how to compile demos and tests with cmake

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1456

Reviewed By: wickedfoo

Differential Revision: D25562575

Pulled By: mdouze

fbshipit-source-id: d1decdfc263b4ca3fcd8c5f6ec50f5d950ac5588
2020-12-15 15:00:25 -08:00
Matthijs Douze
6d0bc58db6 Implementation of PQ4 search with SIMD instructions (#1542)
Summary:
IndexPQ and IndexIVFPQ implementations with AVX shuffle instructions.

The training and computing of the codes does not change wrt. the original PQ versions but the code layout is "packed" so that it can be used efficiently by the SIMD computation kernels.

The main changes are:

- new IndexPQFastScan and IndexIVFPQFastScan objects

- simdib.h for an abstraction above the AVX2 intrinsics

- BlockInvertedLists for invlists that are 32-byte aligned and where codes are not sequential

- pq4_fast_scan.h/.cpp:  for packing codes and look-up tables + optmized distance comptuation kernels

- simd_result_hander.h: SIMD version of result collection in heaps / reservoirs

Misc changes:

- added contrib.inspect_tools to access fields in C++ objects

- moved .h and .cpp code for inverted lists to an invlists/ subdirectory, and made a .h/.cpp for InvertedListsIOHook

- added a new inverted lists type with 32-byte aligned codes (for consumption by SIMD)

- moved Windows-specific intrinsics to platfrom_macros.h

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1542

Test Plan:
```
buck test mode/opt  -j 4  //faiss/tests/:test_fast_scan_ivf //faiss/tests/:test_fast_scan
buck test mode/opt  //faiss/manifold/...
```

Reviewed By: wickedfoo

Differential Revision: D25175439

Pulled By: mdouze

fbshipit-source-id: ad1a40c0df8c10f4b364bdec7172e43d71b56c34
2020-12-03 10:06:38 -08:00
redwrasse
c66ffe8736 typo fixes (#1548)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1548

Reviewed By: beauby

Differential Revision: D25265990

Pulled By: mdouze

fbshipit-source-id: 210b0f76aafc401fec8b71f7dfc57dc99c856680
2020-12-02 01:30:49 -08:00
Matthijs Douze
204ada93a1 Enable AVX optimizations for ScalarQuantizer (#1546)
Summary:
Add F16C support flag for main Faiss.
Otherwise all the code in ScalarQuantizer does not use the AVX optimizations.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1546

Reviewed By: beauby

Differential Revision: D25216912

Pulled By: mdouze

fbshipit-source-id: 7e2a495f6e89a03513e4e06572b360bce0348fea
2020-11-30 08:41:47 -08:00
Mo Zhou
1ac4ef5b77 CMake: use GNUInstallDirs instead of hardcoded paths. (#1541)
Summary:
Upstreamed from Debian packaging: https://salsa.debian.org/deeplearning-team/faiss

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1541

Reviewed By: mdouze

Differential Revision: D25175035

Pulled By: beauby

fbshipit-source-id: c6bc5896e2b602e49edc4bf6ccc8cf97df25ad85
2020-11-24 23:10:06 -08:00
Lucas Hosseini
88eabe97f9 Fix version string in conda builds.
Summary: Currently, conda version strings are built from the latest git tag, which starts with the letter `v`. This confuses conda, which orders v1.6.5 before 1.6.3.

Reviewed By: LowikC

Differential Revision: D25151276

fbshipit-source-id: 7abfb547fee3468b26fedb6637a15e725755daf3
v1.6.5
2020-11-22 08:58:08 -08:00
Lowik Chanussot
f171d19ae8 bump version to 1.6.5
Summary: as title said

Reviewed By: beauby

Differential Revision: D25093295

fbshipit-source-id: 1c019bf525eb62b591bb7c1327ceb27e39dc29b8
2020-11-19 11:54:25 -08:00
Matthijs Douze
25adab7425 fix 64-bit arrays on the Mac (#1531)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1531

vector_to_array assumes that long is 64 bit. Fix this and test it.

Reviewed By: wickedfoo

Differential Revision: D25022363

fbshipit-source-id: f51f723d590d71ee5ef39e3f86ef69426df833fa
2020-11-17 09:00:06 -08:00
Matthijs Douze
fa85ddf8fa reduce nb of pq training iterations in test
Summary:
The tests TestPQTables are very slow in dev mode with BLAS. This seems to be due to the training operation of the PQ. However, since it does not matter if the training is accurate or not, we can just reduce the nb of training iterations from the default 25 to 4.

Still unclear why this happens, because the runtime is spent in BLAS, which should be independend of mode/opt or mode/dev.

Reviewed By: wickedfoo

Differential Revision: D24783752

fbshipit-source-id: 38077709eb9a6432210c11c3040765e139353ae8
2020-11-08 22:26:08 -08:00
Hap-Hugh
9b0029bd7e Fix Bugs in Link&Code (#1510)
Summary:
As the issue said, I patched these two bugs and the codes are working well now.

https://github.com/facebookresearch/faiss/issues/1503#issuecomment-722172257

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1510

Reviewed By: wickedfoo

Differential Revision: D24786497

Pulled By: mdouze

fbshipit-source-id: e7fc538ae2c5f20caf4cc9a3e9f369db7bf48a71
2020-11-06 10:28:15 -08:00
Matthijs Douze
e1adde0d84 Faster brute force search (#1502)
Summary:
This diff streamlines the code that collects results for brute force distance computations for the L2 / IP and range search / knn search combinations.

It introduces a `ResultHandler` template class that abstracts what happens with the computed distances and ids. In addition to the heap result handler and the range search result handler, it introduces a reservoir result handler that improves the search speed for  large k (>=100).

Benchmark results (https://fb.quip.com/y0g1ACLEqJXx#OCaACA2Gm45) show that on small datasets (10k) search is 10-50% faster (improvements are larger for small k). There is room for improvement in the reservoir implementation, whose implementation is quite naive currently, but the diff is already useful in its current form.

Experiments on precomputed db vector norms for L2 distance computations were not very concluding performance-wise, so the implementation is removed from IndexFlatL2.

This diff also removes IndexL2BaseShift, which was never used.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1502

Test Plan:
```
buck test //faiss/tests/:test_product_quantizer
buck test //faiss/tests/:test_index -- TestIndexFlat
```

Reviewed By: wickedfoo

Differential Revision: D24705464

Pulled By: mdouze

fbshipit-source-id: 270e10b19f3c89ed7b607ec30549aca0ac5027fe
2020-11-04 22:16:23 -08:00
Matthijs Douze
698a4592e8 fix clustering objective for inner product use cases
Summary: When an INNER_PRODUCT index is used for clustering, higher objective is better, so when redoing clusterings the highest objective should be retained (not the lowest). This diff fixes this and adds a test.

Reviewed By: wickedfoo

Differential Revision: D24701894

fbshipit-source-id: b9ec224cf8f4ffdfd2b8540ce37da43386a27b7a
2020-11-03 09:44:09 -08:00