411 Commits

Author SHA1 Message Date
Matthijs Douze
c5975cda72 PQ4 fast scan benchmarks (#1555)
Summary:
Code + scripts for Faiss benchmarks around the  Fast scan codes.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1555

Test Plan: buck test //faiss/tests/:test_refine

Reviewed By: wickedfoo

Differential Revision: D25546505

Pulled By: mdouze

fbshipit-source-id: 902486b7f47e36221a2671d124df8c114f25db58
2020-12-16 01:18:58 -08:00
Jeff Johnson
90c891b616 Optimized SIMD interleaved IVF flat/SQ implementation (#1566)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1566

This diff converts the IVF flat and IVFSQ code to use an interleaved-by-32 format, the same as was added to the IVFPQ GPU implementation in D24064745. It also implements SQ6 on the GPU.

However, the new interleaved format is now enabled by default for GpuIndexIVFFlat and GpuIndexIVFScalarQuantizer, while the IVFPQ version is still opt-in until I can develop optimized PQ kernels.

For extension of the interleaved format to non-word size (8/16/32 bit) codes, arbitrary bit codes are packed in groups of 32 vectors, so each dimension of SQ6 for 32 vectors is packed into (32 * 6) / 8 = 24 bytes, and SQ4 packs into 16 bytes.

The new IVF code also fuses the k-selection kernel with the distance computation, so results per-(query, ivf list) are already k-selected. This improves the speed, especially at small query batch sizes which is as much as 84% faster. The float32 version at large nq batch size (the 16384) is 13% faster, though this is now running at the peak memory bandwidth of the GPU it seems and cannot go any faster as far as I can tell. There is still room for improvement with the sq8/6/4 versions which are at about 50% of peak; optimizing these I'll work on in subsequent diffs.

Performance numbers for nlist = 1000, nb = 10^6, nprobe = 16, dim = 128 at varying nq are (in milliseconds), with all versions compared against the old interleaved version, but sq6 compared against the CPU implementation:

```
float32
nq 1 new 0.420811 old 0.774816 speedup 1.8412446442702306x
nq 4 new 0.377007 old 0.573527 speedup 1.5212635309158717x
nq 16 new 0.474821 old 0.611986 speedup 1.2888772821758094x
nq 64 new 0.926598 old 1.124938 speedup 1.2140518326178127x
nq 256 new 2.918364 old 3.339133 speedup 1.1441797527655906x
nq 1024 new 11.097743 old 12.647599 speedup 1.1396550631961833x
nq 4096 new 43.828697 old 50.088993 speedup 1.142835549046781x
nq 16384 new 173.582674 old 196.415956 speedup 1.1315412504821765x
sq8
nq 1 new 0.419673 old 0.660393 speedup 1.5735894374906176x
nq 4 new 0.396551 old 0.55526 speedup 1.4002234264949527x
nq 16 new 0.437477 old 0.589546 speedup 1.3476045597825714x
nq 64 new 0.697084 old 0.889233 speedup 1.2756468373969279x
nq 256 new 1.904308 old 2.231102 speedup 1.1716077441254251x
nq 1024 new 6.539976 old 8.23596 speedup 1.2593257222962286x
nq 4096 new 25.524117 old 31.276868 speedup 1.2253849173313223x
nq 16384 new 101.992982 old 125.355406 speedup 1.2290591327156215x
sq6
nq 1 new 0.693262 old 2.007591 speedup 2.895861881943623x
nq 4 new 0.62342 old 3.049899 speedup 4.892205896506368x
nq 16 new 0.626906 old 2.760067 speedup 4.402680784679043x
nq 64 new 1.002582 old 7.152971 speedup 7.134549592951x
nq 256 new 2.806507 old 19.4322 speedup 6.923980592245094x
nq 1024 new 9.414069 old 65.208767 speedup 6.926735612411593x
nq 4096 new 36.099553 old 249.866567 speedup 6.921597256342759x
nq 16384 new 142.230624 old 1040.07494 speedup 7.312594930329491x
sq4
nq 1 new 0.46687 old 0.670675 speedup 1.436534795553366x
nq 4 new 0.436246 old 0.589663 speedup 1.351675430834896x
nq 16 new 0.473243 old 0.628914 speedup 1.3289451719306993x
nq 64 new 0.789141 old 1.018548 speedup 1.2907047029618282x
nq 256 new 2.314464 old 2.592711 speedup 1.1202209237214318x
nq 1024 new 8.203663 old 9.574067 speedup 1.167047817541993x
nq 4096 new 31.910841 old 37.19758 speedup 1.1656721927197093x
nq 16384 new 126.195179 old 147.004414 speedup 1.164897226382951x
```

This diff also provides a new method for packing data of uniform arbitrary bitwidth in parallel, where a warp uses warp shuffles to exchange data to the right lane which is then bit packed in the appropriate lane. Unpacking data happens in a similar fashion. This allows for coalesced memory loads and stores, instead of individual lanes having to read multiple bytes or words out of global or shared memory. This was the most difficult thing about this particular diff.

The new IVF layout is completely transparent to the user. When copying to/from a CPU index, the codes are converted as needed. This functionality is implicitly tested in all of the CPU <-> GPU copy tests for the index types that currently exist.

This diff also contains an optimization to the scalar quantizers to only require an int-to-float conversion and a single multiply-add as opposed to more operations previously, by rewriting vmin and vdiff at runtime in the kernel.

The old IVF flat code is still in the tree and is accessible by setting `interleavedLayout` to false in the config object. This will be deleted in a later diff as part of a cleanup when I am finally done with performance comparisons.

The diff also contains various other changes:

- new code to allow copying a Tensor to/from a std::vector which reduces the amount of boilerplate code required in some places
- fixes a bug where miscellaneous index API calls were not properly stream synchronized if the user was using a non-default stream (e.g., a pytorch provided stream). This would not have been noticed by any regular user for the wrapped index calls, but it would be noticed if you attempted to call some of the debugging functions (e.g., get the GPU codes). This is done by adding additional logic to the StandardGpuResources stream update functions to add the required synchronization if the user manually changes the stream
- function to retrieve encoded IVF data in either CPU native or GPU interleaved format
- the CPU scalar quantizer object now directly reports how many bits are in a single scalar code, as previously the only information was how many bytes were used for a full encoded vector

Reviewed By: mdouze

Differential Revision: D24862911

fbshipit-source-id: 9a92486306b4b0c6ac30e5cd22c1ffbb6ed2faf4
2020-12-15 21:17:56 -08:00
Matthijs Douze
218a6a9b90 Update INSTALL.md (#1456)
Summary:
Added doc + placeholders on how to compile demos and tests with cmake

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1456

Reviewed By: wickedfoo

Differential Revision: D25562575

Pulled By: mdouze

fbshipit-source-id: d1decdfc263b4ca3fcd8c5f6ec50f5d950ac5588
2020-12-15 15:00:25 -08:00
Matthijs Douze
6d0bc58db6 Implementation of PQ4 search with SIMD instructions (#1542)
Summary:
IndexPQ and IndexIVFPQ implementations with AVX shuffle instructions.

The training and computing of the codes does not change wrt. the original PQ versions but the code layout is "packed" so that it can be used efficiently by the SIMD computation kernels.

The main changes are:

- new IndexPQFastScan and IndexIVFPQFastScan objects

- simdib.h for an abstraction above the AVX2 intrinsics

- BlockInvertedLists for invlists that are 32-byte aligned and where codes are not sequential

- pq4_fast_scan.h/.cpp:  for packing codes and look-up tables + optmized distance comptuation kernels

- simd_result_hander.h: SIMD version of result collection in heaps / reservoirs

Misc changes:

- added contrib.inspect_tools to access fields in C++ objects

- moved .h and .cpp code for inverted lists to an invlists/ subdirectory, and made a .h/.cpp for InvertedListsIOHook

- added a new inverted lists type with 32-byte aligned codes (for consumption by SIMD)

- moved Windows-specific intrinsics to platfrom_macros.h

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1542

Test Plan:
```
buck test mode/opt  -j 4  //faiss/tests/:test_fast_scan_ivf //faiss/tests/:test_fast_scan
buck test mode/opt  //faiss/manifold/...
```

Reviewed By: wickedfoo

Differential Revision: D25175439

Pulled By: mdouze

fbshipit-source-id: ad1a40c0df8c10f4b364bdec7172e43d71b56c34
2020-12-03 10:06:38 -08:00
redwrasse
c66ffe8736 typo fixes (#1548)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1548

Reviewed By: beauby

Differential Revision: D25265990

Pulled By: mdouze

fbshipit-source-id: 210b0f76aafc401fec8b71f7dfc57dc99c856680
2020-12-02 01:30:49 -08:00
Matthijs Douze
204ada93a1 Enable AVX optimizations for ScalarQuantizer (#1546)
Summary:
Add F16C support flag for main Faiss.
Otherwise all the code in ScalarQuantizer does not use the AVX optimizations.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1546

Reviewed By: beauby

Differential Revision: D25216912

Pulled By: mdouze

fbshipit-source-id: 7e2a495f6e89a03513e4e06572b360bce0348fea
2020-11-30 08:41:47 -08:00
Mo Zhou
1ac4ef5b77 CMake: use GNUInstallDirs instead of hardcoded paths. (#1541)
Summary:
Upstreamed from Debian packaging: https://salsa.debian.org/deeplearning-team/faiss

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1541

Reviewed By: mdouze

Differential Revision: D25175035

Pulled By: beauby

fbshipit-source-id: c6bc5896e2b602e49edc4bf6ccc8cf97df25ad85
2020-11-24 23:10:06 -08:00
Lucas Hosseini
88eabe97f9 Fix version string in conda builds.
Summary: Currently, conda version strings are built from the latest git tag, which starts with the letter `v`. This confuses conda, which orders v1.6.5 before 1.6.3.

Reviewed By: LowikC

Differential Revision: D25151276

fbshipit-source-id: 7abfb547fee3468b26fedb6637a15e725755daf3
v1.6.5
2020-11-22 08:58:08 -08:00
Lowik Chanussot
f171d19ae8 bump version to 1.6.5
Summary: as title said

Reviewed By: beauby

Differential Revision: D25093295

fbshipit-source-id: 1c019bf525eb62b591bb7c1327ceb27e39dc29b8
2020-11-19 11:54:25 -08:00
Matthijs Douze
25adab7425 fix 64-bit arrays on the Mac (#1531)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1531

vector_to_array assumes that long is 64 bit. Fix this and test it.

Reviewed By: wickedfoo

Differential Revision: D25022363

fbshipit-source-id: f51f723d590d71ee5ef39e3f86ef69426df833fa
2020-11-17 09:00:06 -08:00
Matthijs Douze
fa85ddf8fa reduce nb of pq training iterations in test
Summary:
The tests TestPQTables are very slow in dev mode with BLAS. This seems to be due to the training operation of the PQ. However, since it does not matter if the training is accurate or not, we can just reduce the nb of training iterations from the default 25 to 4.

Still unclear why this happens, because the runtime is spent in BLAS, which should be independend of mode/opt or mode/dev.

Reviewed By: wickedfoo

Differential Revision: D24783752

fbshipit-source-id: 38077709eb9a6432210c11c3040765e139353ae8
2020-11-08 22:26:08 -08:00
Hap-Hugh
9b0029bd7e Fix Bugs in Link&Code (#1510)
Summary:
As the issue said, I patched these two bugs and the codes are working well now.

https://github.com/facebookresearch/faiss/issues/1503#issuecomment-722172257

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1510

Reviewed By: wickedfoo

Differential Revision: D24786497

Pulled By: mdouze

fbshipit-source-id: e7fc538ae2c5f20caf4cc9a3e9f369db7bf48a71
2020-11-06 10:28:15 -08:00
Matthijs Douze
e1adde0d84 Faster brute force search (#1502)
Summary:
This diff streamlines the code that collects results for brute force distance computations for the L2 / IP and range search / knn search combinations.

It introduces a `ResultHandler` template class that abstracts what happens with the computed distances and ids. In addition to the heap result handler and the range search result handler, it introduces a reservoir result handler that improves the search speed for  large k (>=100).

Benchmark results (https://fb.quip.com/y0g1ACLEqJXx#OCaACA2Gm45) show that on small datasets (10k) search is 10-50% faster (improvements are larger for small k). There is room for improvement in the reservoir implementation, whose implementation is quite naive currently, but the diff is already useful in its current form.

Experiments on precomputed db vector norms for L2 distance computations were not very concluding performance-wise, so the implementation is removed from IndexFlatL2.

This diff also removes IndexL2BaseShift, which was never used.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1502

Test Plan:
```
buck test //faiss/tests/:test_product_quantizer
buck test //faiss/tests/:test_index -- TestIndexFlat
```

Reviewed By: wickedfoo

Differential Revision: D24705464

Pulled By: mdouze

fbshipit-source-id: 270e10b19f3c89ed7b607ec30549aca0ac5027fe
2020-11-04 22:16:23 -08:00
Matthijs Douze
698a4592e8 fix clustering objective for inner product use cases
Summary: When an INNER_PRODUCT index is used for clustering, higher objective is better, so when redoing clusterings the highest objective should be retained (not the lowest). This diff fixes this and adds a test.

Reviewed By: wickedfoo

Differential Revision: D24701894

fbshipit-source-id: b9ec224cf8f4ffdfd2b8540ce37da43386a27b7a
2020-11-03 09:44:09 -08:00
Lucas Hosseini
7212261a86 Fix docker build for GPU nightlies.
Reviewed By: wickedfoo

Differential Revision: D24670301

fbshipit-source-id: 5b19a9a88a880c20e51f6db1ce663224cf8d212c
2020-11-02 07:49:33 -08:00
Lucas Hosseini
04fde4032a Fix nightly build for GPU.
Reviewed By: wickedfoo

Differential Revision: D24559296

fbshipit-source-id: a9fbf51c5153b8b2dff4b2dd684cd84f5aaabc49
2020-10-26 22:33:52 -07:00
Jeff Johnson
8d776e6453 PyTorch tensor / Faiss index interoperability (#1484)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1484

This diff allows for native usage of PyTorch tensors for Faiss indexes on both CPU and GPU. It is currently only implemented in this diff for things that inherit from `faiss.Index`, which covers the non-binary indices, and it patches the same functions on `faiss.Index` that were also covered by `__init__.py` for numpy interoperability.

There must be uniformity among the inputs: if any array input is a Torch tensor, then all array inputs must be Torch tensors. Similarly, if any array input is a numpy ndarray, then all array inputs must be numpy ndarrays.

If `faiss.contrib.torch_utils` is imported, it ensures that `import faiss` has already been performed to patch all of the functions using the base `__init__.py` numpy wrappers, and then patches the following functions again:

```
add
add_with_ids
assign
train
search
remove_ids
reconstruct
reconstruct_n
range_search
update_vectors
search_and_reconstruct
sa_encode
sa_decode
```

to allow usage of PyTorch CPU tensors, and additionally PyTorch GPU tensors if the index being used is on the GPU.

numpy functionality is still available when `faiss.contrib.torch_utils` is imported; we pass through to the original patched numpy function when we detect numpy inputs.

In addition, to allow for better (asynchronous) GPU usage without requiring the CPU to be involved, all of these functions which construct tensors/arrays for output now take optional arguments for storage (numpy or torch.Tensor) to be provided that will contain the output data. `range_search` is the only exception to this, as the size of the output data is indeterminate. The eventual GPU implementation will likely require the user to provide a maximum cap on the output size, and allow that to be passed instead. If the optional pre-allocated output values are presented by the user, they are used; otherwise, new return ndarray / Tensors are constructed as before and used for the return. If this feature were not provided on the GPU, then every execution would be completely serial as we would depend upon the CPU to allocate GPU memory before every operation. Instead, now this can function much like NN graph execution on the GPU, assuming that all of the data requirements are pre-allocated, so the execution will run at the full speed of the GPU and not be stalled sequentially launching kernels.

This diff also exposes the `GpuResources` shared_ptr object owned by a GPU index. This is required for pytorch GPU so that we can perform proper stream ordering in Faiss with respect to the current pytorch stream. So, Faiss indices now perform more or less as any NN operation in Torch does.

Note, however, that a Faiss index has its own setting on current device, and if the pytorch GPU tensor inputs are resident on a different device than what the Faiss index expects, a cross-device copy will be initiated. I may choose to make this an error in the future and require matching device to device.

This diff also found a bug when passing GPU data directly to `train()` for `GpuIndexIVFFlat` and `GpuIndexIVFScalarQuantizer`, as I guess we never tested passing GPU data directly to these functions before. `GpuIndexIVFPQ` was doing the right thing however.

The assign function is now also implemented on the GPU as well, and is now marked `const` to be in line with the `search` function.

Also added better checking of non-contiguous inputs for both Torch tensors and numpy ndarrays.

Updated the `knn_gpu` function with a base implementation always present that allows for usage of numpy arrays, which is overridden when `torch_utils` is imported to allow torch usage. This supports row/column major layout, float32/float16 data and int64/int32 indices for both numpy and torch.

Reviewed By: mdouze

Differential Revision: D24299400

fbshipit-source-id: b4f117b9c120bd1ad83e7702087051ab7b303b29
2020-10-23 22:24:22 -07:00
Lucas Hosseini
bab6db84e0 Make pytorch available in CircleCI. (#1486)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1486

Reviewed By: wickedfoo

Differential Revision: D24493486

Pulled By: beauby

fbshipit-source-id: 00156213061503ff593b2e9ede062850b23527a9
2020-10-22 21:10:32 -07:00
Lucas Hosseini
7891094da6 Add nightly packages for GPU. (#1485)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1485

Test Plan: Imported from OSS

Reviewed By: wickedfoo

Differential Revision: D24492171

Pulled By: beauby

fbshipit-source-id: 20fbcbdd50ab30e110e41b34e0c07d88432b1422
2020-10-22 19:47:13 -07:00
Lucas Hosseini
0b365fa6d8 Add docker image for CUDA 10.1. (#1477)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1477

Test Plan: Imported from OSS

Reviewed By: wickedfoo

Differential Revision: D24492173

Pulled By: beauby

fbshipit-source-id: 5247accb2dc31bb125f9b06fb2275346b2e6465f
2020-10-22 19:47:13 -07:00
Lucas Hosseini
616dc44e1e Update conda package for faiss-gpu. (#1476)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1476

Test Plan: Imported from OSS

Reviewed By: wickedfoo

Differential Revision: D24492172

Pulled By: beauby

fbshipit-source-id: 63497b54d8aed10d45ebc4ed7659dd1d18b36edf
2020-10-22 19:47:13 -07:00
Lucas Hosseini
64c13cdda3 Run python gpu tests. (#1479)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1479

Reviewed By: mdouze

Differential Revision: D24413984

Pulled By: beauby

fbshipit-source-id: 006343c996a278df1d9fc70e11283d31d63a0330
2020-10-22 09:56:30 -07:00
Matthijs Douze
f2369fcc82 benchmark SSD IndexIVF
Summary: This is some code for benchmakring the SSD reads.

Reviewed By: MDSilber

Differential Revision: D24457715

fbshipit-source-id: 475668e4dc450dc4652ef8828111335c236bfa44
2020-10-21 18:21:39 -07:00
Eyal Trabelsi
33e319f8b3 Skip TestShardedFlat for 1 GPU fixes #982 (#1466)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1466

Reviewed By: wickedfoo, mdouze

Differential Revision: D24448344

Pulled By: beauby

fbshipit-source-id: 7e8c563f1a5a1d745a1073319365c485fcbe1698
2020-10-21 09:15:12 -07:00
Jeff Johnson
e9dda0590c Deeper tests of GPU IVF list equality + test that would have caught bug fixed by D24405231 (#1480)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1480

There was no test that captured the bug fixed in D24405231, namely that the GPU's version of the IVF lists for IVFSQ both contained garbage and the lists were longer than the CPU version.

This diff contains 4 changes:
- provides an API for accessing both the list indices and encoded vectors for all IVF GPU index types, as well as the number of lists, on par with the CPU's InvertedLists structure. The encoded vectors are returned in the expected, canonical CPU format (even when the GPU version may differ)
- Updates the inverted list vector encoding from `unsigned char` to `uint8_t` to match the CPU's InvertedLists datatype
- Adds tests for IVFFlat, IVFPQ and IVFSQ to explicitly assert CPU and GPU IVF list equality when copying both to and from GPU.
- Removed usage of `long` that represented indices in Faiss GPU and replaced with `Index::idx_t` everywhere

Reviewed By: beauby

Differential Revision: D24411004

fbshipit-source-id: b3335e559102008d805122f3b4594db6738c3ae9
2020-10-20 17:17:00 -07:00
Matthijs Douze
cf6593bdca Update ISSUE_TEMPLATE.md (#1462)
Summary:
add a field to ask people how they installed Faiss

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1462

Reviewed By: LowikC

Differential Revision: D24279581

Pulled By: mdouze

fbshipit-source-id: 2492c73e31d22f3b7f37de6bcfcac90eae0ccd07
2020-10-20 04:37:50 -07:00
Matthijs Douze
9c51bbb977 Fix faiss_contrib (#1478)
Summary:
Fixes the path issue mentioned in https://github.com/facebookresearch/faiss/issues/1472

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1478

Reviewed By: LowikC

Differential Revision: D24394529

Pulled By: mdouze

fbshipit-source-id: 5e4261a0f271751c736c562514f2ee8604c50702
2020-10-20 04:35:19 -07:00
Matthijs Douze
92306e3a69 Synthetic dataset with inner product option
Summary: The synthetic dataset can now have IP groundtruth

Reviewed By: wickedfoo

Differential Revision: D24219860

fbshipit-source-id: 42e094479311135e932821ac0a97ed0fb237bf78
2020-10-20 03:46:26 -07:00
Jeff Johnson
fbb6789f0e cuda 11 fix
Summary: Fix compilation of a CUDA 11 API to disable tensor core usage.

Reviewed By: ip4368

Differential Revision: D24404288

fbshipit-source-id: 5cc9fdcf3c86669bc85d5c13a7a523daf7fee62d
2020-10-19 18:53:43 -07:00
Lucas Hosseini
89187fee3c Fix CMake build on GPU. (#1475)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1475

Test Plan: Imported from OSS

Reviewed By: mdouze

Differential Revision: D24393544

Pulled By: beauby

fbshipit-source-id: 4776cb75dc11e5bb4bb4417ee44f0c794084c301
2020-10-19 16:56:17 -07:00
Jeff Johnson
6be85b0554 GPU IVFSQ code_size fix
Summary:
This bug was introduced in D24064745, which broke the code distance for GPU IVFSQ. The `code_size` is size in bytes per encoded vector, not per scalar. This diff updates the expression for computing GPU and CPU vector sizes.

The bug was not seen on FB machines since it appears that the memory allocator (jemalloc?) was more forgiving in terms of mapped page sizes, and garbage tends to be far away in N-dimensional space from real queries.

Reviewed By: beauby

Differential Revision: D24405231

fbshipit-source-id: f9ad0d3f326afe412ea864537a24efbd74d97f1f
2020-10-19 16:54:00 -07:00
Matthijs Douze
28edc56fa8 Search in sharded invlists
Summary:
This diff adds a CombinedIndexSharded1T class to combined_index that uses the 30 shards from the Spark reducer.
The metadata is stored in pickle files on manifold.

Differential Revision: D24018824

fbshipit-source-id: be4ff8b38c3d6e1bb907e02b655d0e419b7a6fea
2020-10-19 10:39:22 -07:00
Jeff Johnson
8fd753e7e6 Disable tensor core usage on V100 GPUs / CUDA 10
Summary:
Tensor core usage on V100 + CUDA 10 for sgemm f32 x f32 = f32 seems to allow demotion of the inputs to f16 (wtf?!), resulting in an unacceptable loss of precision. All accumulation in Faiss is f32, and the use cases for f16 x f16 = f32 are opt-in and relatively small in practice I believe.

For A100 or CUDA 11, there is an option to only allow tensor core usage if we guarantee preservation of precision, which we instead prefer.

Reviewed By: beauby

Differential Revision: D24348944

fbshipit-source-id: d22cfaa233d21ee9c20974914ad155dab8c901fd
2020-10-16 11:50:27 -07:00
Lucas Hosseini
d1f72c5922 Remove guard for OSX packages.
Reviewed By: wickedfoo

Differential Revision: D24295557

fbshipit-source-id: 87608edba1c67f10ea11a4ceb73234ada7663bab
2020-10-13 19:44:26 -07:00
Lucas Hosseini
4a63f77fde Remove redundant jobs for releases.
Reviewed By: mdouze

Differential Revision: D24286656

fbshipit-source-id: 93c176e5c063f845114f76e2ac01dbad69b70fb5
2020-10-13 14:03:04 -07:00
Jeff Johnson
92f2391f41 Remove unused nvidia host fp16 headers/functions
Summary: Removes unused host fp16 code, the dependency upon which was removed a while ago.

Reviewed By: beauby

Differential Revision: D24279982

fbshipit-source-id: 5f6820c41eb387f766b2bed7e70203f5e01f49e9
2020-10-13 11:47:00 -07:00
Lucas Hosseini
70eaa9b1a3 Add missing copyright headers. (#1460)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1460

Reviewed By: wickedfoo

Differential Revision: D24278804

Pulled By: beauby

fbshipit-source-id: 5ea96ceb63be76a34f1eb4da03972159342cd5b6
2020-10-13 11:15:59 -07:00
Jeff Johnson
e796f4f9df Improve Faiss / PyTorch GPU interoperability
Summary:
PyTorch GPU in general is free to use whatever stream it currently wants, based on `torch.cuda.current_stream()`. Due to C++/python language barrier issues, we couldn't previously pass the actual `cudaStream_t` that is currently in use on a given device from PyTorch C++ to Faiss C++ via python.

This diff adds conversion functions to convert a Python integer representing a pointer to a `cudaStream_t` (which is itself a `CUstream_st*`), so we can pass the stream specified in `torch.cuda.current_stream()` to `StandardGpuResources::setDefaultStream`. We thus guarantee that all Faiss work is ordered on the same stream that is in use in PyTorch.

For use in Python, there is now the `faiss.contrib.pytorch_tensors.using_stream` context object which automatically sets and unsets the current PyTorch stream within Faiss. This takes a `StandardGpuResources` object in Python, and an optional `torch.cuda.Stream` if one wants to use a different stream, otherwise it uses the current one. This is how it is used:

```
# Create a non-default stream
s = torch.cuda.Stream()

# Have Torch use it
with torch.cuda.stream(s):

  # Have Faiss use the same stream as the above
  with faiss.contrib.pytorch_tensors.using_stream(res):
    # Do some work on the GPU
    faiss.bfKnn(res, args)
```

`using_stream` uses the same pattern as the Pytorch `torch.cuda.stream` object.

This replaces any brute-force GPU/CPU synchronization work that was necessary before.

Other changes in this diff:
- cleans up the config objects in the GpuIndex subclasses, to distinguish between read-only parameters that can only be set upon index construction, versus those that can be changed at runtime.
- StandardGpuResources now more properly distinguishes between user-supplied streams (like the PyTorch one) which will not be destroyed upon resources destruction, versus internal streams.
- `search_index_pytorch` now needs to take a `StandardGpuResources` object as well, there is no way to get this from an index instance otherwise (or at least, I would have to return a `shared_ptr`, in which case we should just update the Python SWIG stuff to use `shared_ptr` for GpuResources or something.

Reviewed By: mdouze

Differential Revision: D24260026

fbshipit-source-id: b18bb0eb34eb012584b1c923088228776c10b720
2020-10-13 09:11:19 -07:00
Lucas Hosseini
b459931ae4 Mark ~BufferedIOWriter() noexcept(false) (#1459)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1459

In C++11, destructors default to `noexcept(true)`. This destructor can throw (through `FAISS_THROW_IF_NOT()`), so marking it accordingly.

Reviewed By: mdouze

Differential Revision: D24253879

fbshipit-source-id: 7ba40387ed214dc2a03a495bc0d31ac9601c4c15
2020-10-13 06:55:27 -07:00
Lucas Hosseini
882d4f1051 Fix conda build cmd for nightly packages.
Reviewed By: mdouze

Differential Revision: D24269549

fbshipit-source-id: 8e6800781ae54a67c1d2424611a761f838d12026
2020-10-12 22:17:11 -07:00
Lucas Hosseini
52c6465a5e Test reporting in CircleCI (#1452)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1452

Reviewed By: mdouze

Differential Revision: D24269693

Pulled By: beauby

fbshipit-source-id: a1a98d263ef4b20107c845421615b1f35b52c6e2
2020-10-12 22:13:31 -07:00
Lucas Hosseini
8e44bff055 Nightly builds (#1451)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1451

Reviewed By: mdouze

Differential Revision: D24245777

Pulled By: beauby

fbshipit-source-id: f2ce92b28e3d7ffdc2e85bcd78d321da15fec87e
2020-10-12 14:40:52 -07:00
Lucas Hosseini
0aaf0a6357 Enable tests by default. (#1458)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1458

Reviewed By: mdouze

Differential Revision: D24252321

Pulled By: beauby

fbshipit-source-id: 38dc1f710c63ff1a292e962c636c380d82281b7f
2020-10-12 10:39:35 -07:00
Lucas Hosseini
38bf9f64e9 Get rid of stale generated docs.
Summary: Those docs are not very useful as is, and having to re-generate the html manually leads to them being stale most of the time. Should we decide that we want to have them, we can bring them back with some automated generation.

Reviewed By: mdouze

Differential Revision: D24246072

fbshipit-source-id: 39798b2861ff25ee3fa1f95abdbc3e7ddf3469ed
2020-10-12 00:33:37 -07:00
Lucas Hosseini
0fb6c00cfa Bump version to 1.6.4 (#1453)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1453

Reviewed By: mdouze

Differential Revision: D24243546

Pulled By: beauby

fbshipit-source-id: 190e0601bad3f08c5fac37923170a68ba1e83f16
v1.6.4
2020-10-12 00:01:06 -07:00
Lucas Hosseini
81b1aeea5e Fix warnings.
Reviewed By: wickedfoo

Differential Revision: D24168429

fbshipit-source-id: 0f5fe5eee0f313224a4681dc84ba05169ceb482d
2020-10-09 16:49:18 -07:00
Matthijs Douze
8b05434a50 Remove useless function
Summary:
Removed an unused function that caused compile errors in some configurations.
Added contrib function (exhaustive_search.knn) to compute the k nearest neighbors without constructing an index.
Renamed the equivalent GPU function as exhaustive_search.knn_gpu (it does not make much sense to mention numpy in the name as all functions take numpy arguments by default).

Reviewed By: beauby

Differential Revision: D24215427

fbshipit-source-id: 6d8e1eafa7c57593304b7b76f83b3015e4d2a2bb
2020-10-09 07:57:04 -07:00
Jeff Johnson
0412d761e5 GPU brute-force kNN can take int32 indices (#1445)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1445

As requested in https://github.com/facebookresearch/faiss/issues/1304, `bfKnn` can now produce int32 indices for output.

The native kernels themselves for brute-force kNN only operate on int32 indices in any case, so this is faster.

Also added a SWIG definition for float16 numpy arrays. As there is not a native half type, the reverse definition is undefined, so this is only really used for taking float16 data (e.g., from numpy) as input when in Python.

Added a `knn_numpy_gpu` wrapper as well that handles calling the `bfKnn` GPU implementation using CPU numpy arrays. This handles transposition and f32/f16/i32 data types as needed.

Reviewed By: mdouze

Differential Revision: D24152296

fbshipit-source-id: caa7daea23438cf26aa248e380f0dab2b6b907fd
2020-10-08 17:50:42 -07:00
cclauss
efa1e3f64f Use print() function in both Python 2 and Python 3 (#1443)
Summary:
Legacy __print__ statements are syntax errors in Python 3 but __print()__ function works as expected in both Python 2 and Python 3.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1443

Reviewed By: LowikC

Differential Revision: D24157415

Pulled By: mdouze

fbshipit-source-id: 4ec637aa26b61272e5337d47b7796a330ce25bad
2020-10-08 00:27:29 -07:00
Jeff Johnson
9b007c7418 GPU supports arbitrary dimensions per PQ sub-quantizer
Summary:
This diff removes a long-standing limitation with GpuIndexIVFPQ, in that only a limited number of dimensions per sub-quantizer were supported when not using precomputed codes. This is part of the general cleanup and extension/optimization that I am performing of the GPU PQ code.

Now, we keep the same old specialized distance computations, but if we attempt to use a number of dimensions per sub-Q that are not specialized, we fall back to a general implementation based on batch matrix multiplication for computing PQ distances per code.

The batch MM PQ distance computation is enabled automatically if you use an odd number of dimensions per sub-quantizer (say, 7, 11, 53, ...). It can also be manually enabled via the `useMMCodeDistance` option in `GpuIndexIVFPQConfig` for testing purposes, though the result should be within some epsilon of the other implementation.

This diff also removes the iterated GEMM wrapper. I don't honestly know why I was using this instead of `cublasGemmStridedBatchedEx`, maybe I couldn't find that or this was originally implemented in a much older version of CUDA. The iterated GEMM call was used in a few other places (e.g., precomputed code computation). Now, this (and the PQ distance computation) use batch MM which is a single CUDA call.

This diff also adds stream synchronization to the temporary memory manager, as the fallback PQ distance computation needs to use temporary memory, and there were too many buffers for these to pre-allocate.

It also fixes the bug in https://github.com/facebookresearch/faiss/issues/1421.

Reviewed By: mdouze

Differential Revision: D24130629

fbshipit-source-id: 1c8bc53c86d0523832ad89c8bd4fa4b5fc187cae
2020-10-06 11:06:03 -07:00