Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4198
1. pins lief due to `AttributeError: type object 'CLASS' has no attribute 'CLASS64'` (just set it to last passing nightly version)
2. pins mkl in gpu builds due to it trying to pull in 2024.2.2 which conflicts with 2023 in the libfaiss.
Added nightlies to make sure they pass https://github.com/facebookresearch/faiss/actions/runs/13422430425/job/37498020894. Not all passed: I'm not sure the `build-pull-request / Linux x86_64 GPU w/ cuVS nightlies (CUDA 12.4.0)` nightly is actually broken, but this unblocks the PR builds for now.
Reviewed By: junjieqi
Differential Revision: D69860604
fbshipit-source-id: 2da623c71b03c22d581b78655253a863fbafd3ed
Summary:
For the release, we want to standardize on Faiss instead of FAISS.
Changed everything except the CHANGELOG.md which I assume should not be changed after it lands.
This doesn't aim to fix any existing lints / errors. Those can be handled at another time.
Reviewed By: junjieqi
Differential Revision: D68842649
fbshipit-source-id: c0b60d5baa0e1f710db3638ffcc6f223fb3408ad
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4126
Good resource on overriding channels to make sure we aren't using `defaults`:https://stackoverflow.com/questions/67695893/how-do-i-completely-purge-and-disable-the-default-channel-in-anaconda-and-switch
Explanation of changes:
-
- changed to miniforge from miniconda: this ensures we only pull in from conda-defaults when creating the environment
- architecture: ARM64 and aarch64 are the same thing. But there is no miniforge package for ARM64, so we need to make it check for aarch64 instead. However, mac breaks this rule, and does have macOS-arm64! So there is a conditional for mac to use arm64. https://github.com/conda-forge/miniforge/releases/
- action.yml mkl 2022.2.1 change: conda-forge and defaults have completely different dependencies. Defaults required intel-openmp, but now on conda-forge, mkl 2023.1 or higher requires llvm-openmp >=14.0.6, but this is incompatible with the pytorch build <2.5 which requires llvm-openmp<14.0. We would need to upgrade Python to 3.12 first, upgrade Pytorch build, then upgrade this mkl. (The meta.yaml changes are the ones that narrow it to 2022.2.1 during `conda build faiss`.) So, this has just been changed to 2022.2.1.
- mkl now requires _openmp_mutex of type "llvm" instead of "gnu": prior non-cuVS builds all used gnu, because intel-openmp from anaconda defaults channel does not require llvm-openmp. Now we need to remove the gnu one which is automatically pulled in during miniconda setup, and only keep the llvm version of _openmp_mutex.
- liblief: The above changes tried to pull in liblief 0.15. This results in an error like `AttributeError: module 'lief._lief.ELF' has no attribute 'ELF_CLASS'`. When I checked passing PR builds on defaults, they use lief 0.12, so I pinned that one for Python 3.9 3.10 3.11. For Python 3.12, we need lief 0.14 or higher.
- gcc_linux-64 =11.2 for faiss-gpu on cudatoolkit-11.2: GPU builds kept trying to reference 11.2 when 14.2 was installed. I couldn't figure out why, or how to point it to the 14.2 installed on the host. Current nightly builds still reference 11.2, so I gave up and pinned 11.2 to keep it the same. Moving to 14.2 will take some more investigation.
- meta.yaml mkl 2023.0 vs 2023.1 with python versions: 3.9, 3.10, and 3.11 pass with 2023.0, but python 3.12 needs mkl 2023.1 or higher. Otherwise we get:
```
INTEL MKL ERROR: $PREFIX/lib/python3.12/site-packages/faiss/../../.././libmkl_def.so.2: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
Intel MKL FATAL ERROR: Cannot load libmkl_def.so.2.
```
so the solution was to put a bunch of conditions in in faiss/meta.yaml.
We should be able to use Jinja macros to reduce duplication but it requires some investigation. It was failing: https://github.com/facebookresearch/faiss/actions/runs/12915187334/job/36016477707?pr=4126 (paste of logs here: P1716887936). This can be a future BE task.
Macro example (the `-` signs remove whitespace lines before and after)
```
{% macro inclmkldevel() %}
{%- if PY_VER == '3.9' or PY_VER == '3.10' or PY_VER == '3.11' -%}
- mkl-devel =2023.0 # [x86_64]
- liblief =0.12.3 # [not win]
- python_abi <3.12
{%- elif PY_VER == '3.12' %}
- mkl-devel >=2023.2.0 # [x86_64]
- liblief =0.15.1 # [not win]
- python_abi =3.12
{% endif -%}
{% endmacro %}
```
The python_abi was required to be pinned inside these conditions because otherwise several builds got this error:
```
File "/Users/runner/miniconda3/lib/python3.12/site-packages/conda_build/utils.py", line 1919, in insert_variant_versions
matches = [regex.match(pkg) for pkg in reqs]
^^^^^^^^^^^^^^^^
TypeError: expected string or bytes-like object, got 'list'
```
Unit test notes:
-
- test_gpu_basics.py: GPU residual quantizer: Debugged extensively with Matthijs. The problem is in the C++ -> Python conversion. The C++ side prints the right values, but when getting it back to Python, it is filled with junk data. It is only reproducible on CUDA 11.4.4 after switching channels. It is likely a compiler problem. We discussed, and resolved to create a C++ side unit test (so this diff creates TestGpuResidualQuantizer) to verify the functionality and disable the Python unit test, but leave it in the codebase with a comment. Matthijs made extensive notes in https://docs.google.com/document/d/1MjMdOpPgx-MArdrYJZCaQlRqlrhSj5Y1Z9lTyiab8jc/edit?usp=sharing .
- test_contrib.py: this now hangs forever and times out the runner for Windows on Python 3.12. I have it skipping now.
- test_mem_leak.cpp seems flaky. It sometimes fails, then passes with rerun.
Unfixed issues:
-
- I noticed sometimes downloads will fail with the text like below. It passes on re-run.
```
libgomp-14.2.0-h77fa898_1.conda extraction failed
Warning: error libmamba Error when extracting package: Could not chdir info/recipe/parent/patches/0005-Hardcode-HAVE_ALIGNED_ALLOC-1-in-libstdc-v3-configur.patch
error libmamba Error when extracting package: Could not chdir info/recipe/parent/patches/0005-Hardcode-HAVE_ALIGNED_ALLOC-1-in-libstdc-v3-configur.patch
Warning: Found incorrect download: libgomp. Aborting
Found incorrect download: libgomp. Aborting
Warning:
```
Green build and tests for both build pull request and nightlies: https://github.com/facebookresearch/faiss/actions/runs/12956402963/job/36148818361
Reviewed By: asadoughi
Differential Revision: D68043874
fbshipit-source-id: b105a1e3e6272763ad9daab7fc6f05a79f01c9e2
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4140
Auto retry build should have been made more reliable in D68032708, so reenable it.
Reviewed By: mengdilin
Differential Revision: D68636531
fbshipit-source-id: 00fe52bbd2a711d3a74b6febaac9763d384af0bf
Summary:
```gh run``` seems to time out every now and then, crashing the retry build. I suspect this because whenever the retry build fails, it lacks any logging that would happen if ```gh run``` finishes.
Changing to use ```gh view``` and sleeping for 10 minutes before viewing again may resolve this issue, as it isn't continuously watching.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4124
Test Plan:
https://github.com/gtwang01/faiss/actions/runs/12718383559/job/35456710369
{F1974306037}
{F1974306103}
Also, monitor nightlies to ensure it works properly
Reviewed By: mengdilin
Differential Revision: D68032708
Pulled By: gtwang01
fbshipit-source-id: 02f57239e6203a1d0accc09c48aa7071e553878f
Summary:
Small updates to the ReadMe files. More detailed description in a follow up PR for the wiki.
Remove the cuvs conda CI checks
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4084
Reviewed By: mengdilin
Differential Revision: D67602013
Pulled By: mnorris11
fbshipit-source-id: f7c40440d278f00195bcad2dbdd2187325f40662
Summary:
Pin dependecy version to get stable CI signal
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4046
Reviewed By: mnorris11
Differential Revision: D66684832
Pulled By: junjieqi
fbshipit-source-id: 9749d328688034514d6f2315d313fbf8045405ee
Summary:
Remove the dependency on `raft::compiled` and modify GPU implementations to use cuVS backend in place of RAFT.
A deeper insight into the dependency:
FAISS gets the ANN algorithm implementations such as IVF-Flat and IVF-PQ from cuVS. RAFT is meant to be a lightweight C++ header-only template library that cuVS relies on for the more fundamental / low-level utilities. Some examples of these are RAFT's device mdarray and mdspan objects; the RAFT resource object (`raft::resource`) that takes care of the stream ordering of device functions; linear algebra functions such as mapping, reduction, BLAS routines etc. A lot of the cuVS functions take the RAFT mdspan objects as arguments (for example `raft::device_matrix_view`). Therefore FAISS relies on both cuVS and RAFT. FAISS gets RAFT headers through cuVS and uses them to create the function arguments that can be consumed by cuVS. Note that we are not explicitly linking FAISS against `raft::raft` or `raft::compiled`. Only the required headers are included and compiled rather than compiling the whole RAFT shared library. This is the reason we still see mentions of `raft` in FAISS.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3549
Reviewed By: ramilbakhshyiev
Differential Revision: D62041013
Pulled By: asadoughi
fbshipit-source-id: 7230dcc06cf47baf95873adc1dec2adca4a8f82a
Summary:
- Called the hipify script at CMAKE configure time removing the need for the user to run it.
- Now removes any .hip files left over when running the hipify script.
- Cleaned up the hipify script to remove redundancy.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3962
Reviewed By: asadoughi, ramilbakhshyiev
Differential Revision: D64495550
Pulled By: mnorris11
fbshipit-source-id: 5547712a4e46fc18cf62346adb0395d0e5626399
Summary:
Sometimes between Sept 25 to Oct 2, downloading and linking against `openblas=*=*openmp*` package to run tests have caused a 4-7x slow down. Link it with the regular openblas package which is not compiled with `USE_OPENMP=1`. We will set the openblas omp threads via the environment variable `OPENBLAS_NUM_THREADS` according to https://github.com/OpenMathLib/OpenBLAS/wiki/Faq#multi-threaded
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3918
Test Plan: SVE CI should finish within 40 minutes
Reviewed By: ramilbakhshyiev
Differential Revision: D64059860
Pulled By: mengdilin
fbshipit-source-id: 3ba2bda5fce5122f051421f459692f15ad5360a4
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3839
This is a prerequisite to fixing issue 3787 and an upgrade to a newer stable version.
Reviewed By: mengdilin
Differential Revision: D62284555
fbshipit-source-id: 946f7757eea36bdddc3f8bb7d8c16168e90fd063
Summary: ROCm does not require CUDA, this change stops installing it.
Reviewed By: mengdilin
Differential Revision: D62283602
fbshipit-source-id: 8fd0d770c5bd407b0c7bca7e92d754e05b5083da
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3836
This disables verbose output from apt-get and only outputs on errors to make the build output logs more readable.
Reviewed By: junjieqi
Differential Revision: D62278742
fbshipit-source-id: 524490ffd95fc1283f69797c0da57886e68206a6
Summary:
I noticed by default, conda install openblas installs `libopenblas-pthreads` on our SVE CI. This can be problematic as described in https://github.com/facebookresearch/faiss/wiki/Troubleshooting#surprising-faiss-openmp-and-openblas-interaction
Updating installation of openblas to be more specific and use the version that works well with openmp.
Sees version `0.3.27-openmp_h1b0c31a_0` for openblas instead of `pthread`
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3776
Reviewed By: ramilbakhshyiev
Differential Revision: D61856775
Pulled By: mengdilin
fbshipit-source-id: 950bd68ba438d221b39d25b2d6e185bc61512243
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3808
Previously, we only upload the results when the tests passed. It is useful to also upload results of failing tests for debugging
Reviewed By: ramilbakhshyiev
Differential Revision: D61972433
fbshipit-source-id: 1825e6ebea56279e4d8d8c6480841985c1626674
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3786
ROCm build successfully passes all but 2 GPU tests and we want to enable the passing test on CI while skipping the 2 failing tests to make progress. The 2 failing tests are failing specifically on the hardware type that we use for our runners and the AMD team is actively working on root causing it and providing a fix:
`TestGpuIndexIVFPQ.Query_L2_MMCodeDistance`
`TestGpuIndexIVFPQ.Query_IP_MMCodeDistance`
Reviewed By: asadoughi
Differential Revision: D61688657
fbshipit-source-id: 3fedfcf22a0ccf40ac8aff033e8bc09c4eb0cbd5
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3772
It looks like there are many failures on the retry build workflow, but these are mainly due to retry attempts with the --failed flag being unable to rerun workflows that don't have any failed jobs.
Reviewed By: kuarora, junjieqi, ramilbakhshyiev
Differential Revision: D61489426
fbshipit-source-id: 6dcef6ba422634bb333e44a5b12c74c5d3b3df8f
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3747
This change converts the ROCm build to run inside containers and updates it to run on AMD GPU based runners. Still working with the AMD team to resolve test failures before enabled those.
Differential Revision: D61049115
fbshipit-source-id: 28274e0bde795f99b3d78711beaf9b3ed3c5e66c
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3744
gpg is needed for ROCm builds but does not come with containerized builds. This change add installation of gpg.
Reviewed By: junjieqi
Differential Revision: D61007840
fbshipit-source-id: 6322112803866dff57637bea290dc032e2bf41ad
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3743
This fixes containerized builds that will be needed for ROCm.
Reviewed By: junjieqi
Differential Revision: D61007764
fbshipit-source-id: 11fa8dc3641a85d4c220832bedf0f6d62ae49426
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3718
This workflow needs to be pushed first before it can be called from the build workflow.
Reviewed By: ramilbakhshyiev
Differential Revision: D60697701
fbshipit-source-id: 40cb6b7006dae8293e966cc2cbb0ebda5d606045
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3737
This moves nightly builds away from when most of the team is working to avoid exhausting limited resources like custom hardware / specialized hardware.
Reviewed By: bshethmeta
Differential Revision: D60976671
fbshipit-source-id: 1a8521379654a06a793fda0ae3f3bd1bf6fa8bf6
Summary:
The TestPartitioning.TestPartitioningBigRange test case fails on gcc version 13.2. We can avoid this by requiring gcc version 11.2 where the test case works.
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3655
Test Plan: Check github workflows and test branches
Reviewed By: ramilbakhshyiev
Differential Revision: D59988036
Pulled By: gtwang01
fbshipit-source-id: ae6d7f7888c9d7a2e59f557e05dbd4f318983668
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3725
This step is necessary for both of the builds with newer gxx_linux package version. ROCm is already using this symlinking and this change expands it to RAFT as well.
Reviewed By: mengdilin
Differential Revision: D60830977
fbshipit-source-id: fe95a6580b3866e17b56d542509405e93a3ff453