faiss/faiss/gpu/utils/MatrixMult.cuh
Jeff Johnson 9b007c7418 GPU supports arbitrary dimensions per PQ sub-quantizer
Summary:
This diff removes a long-standing limitation with GpuIndexIVFPQ, in that only a limited number of dimensions per sub-quantizer were supported when not using precomputed codes. This is part of the general cleanup and extension/optimization that I am performing of the GPU PQ code.

Now, we keep the same old specialized distance computations, but if we attempt to use a number of dimensions per sub-Q that are not specialized, we fall back to a general implementation based on batch matrix multiplication for computing PQ distances per code.

The batch MM PQ distance computation is enabled automatically if you use an odd number of dimensions per sub-quantizer (say, 7, 11, 53, ...). It can also be manually enabled via the `useMMCodeDistance` option in `GpuIndexIVFPQConfig` for testing purposes, though the result should be within some epsilon of the other implementation.

This diff also removes the iterated GEMM wrapper. I don't honestly know why I was using this instead of `cublasGemmStridedBatchedEx`, maybe I couldn't find that or this was originally implemented in a much older version of CUDA. The iterated GEMM call was used in a few other places (e.g., precomputed code computation). Now, this (and the PQ distance computation) use batch MM which is a single CUDA call.

This diff also adds stream synchronization to the temporary memory manager, as the fallback PQ distance computation needs to use temporary memory, and there were too many buffers for these to pre-allocate.

It also fixes the bug in https://github.com/facebookresearch/faiss/issues/1421.

Reviewed By: mdouze

Differential Revision: D24130629

fbshipit-source-id: 1c8bc53c86d0523832ad89c8bd4fa4b5fc187cae
2020-10-06 11:06:03 -07:00

48 lines
1.5 KiB
Plaintext

/**
* Copyright (c) Facebook, Inc. and its affiliates.
*
* This source code is licensed under the MIT license found in the
* LICENSE file in the root directory of this source tree.
*/
#pragma once
#include <cublas_v2.h>
#include <faiss/gpu/utils/Tensor.cuh>
#include <faiss/gpu/utils/DeviceTensor.cuh>
#include <faiss/gpu/utils/HostTensor.cuh>
#include <faiss/gpu/utils/Float16.cuh>
namespace faiss { namespace gpu {
class GpuResources;
/// C = alpha * A * B + beta * C
/// Expects row major layout, not fortran/blas column major!
template <typename AT, typename BT>
void
runMatrixMult(Tensor<float, 2, true>& c, bool transC,
Tensor<AT, 2, true>& a, bool transA,
Tensor<BT, 2, true>& b, bool transB,
float alpha,
float beta,
cublasHandle_t handle,
cudaStream_t stream);
/// C_i = alpha * A_i * B_i + beta * C_i
/// where `i` is the outermost dimension, via iterated gemm
/// Expects row major layout, not fortran/blas column major!
template <typename AT, typename BT>
void runIteratedMatrixMult(Tensor<float, 3, true>& c, bool transC,
Tensor<AT, 3, true>& a, bool transA,
Tensor<BT, 3, true>& b, bool transB,
float alpha,
float beta,
cublasHandle_t handle,
cudaStream_t stream);
} } // namespace
#include <faiss/gpu/utils/MatrixMult-inl.cuh>