<divclass="line"><aname="l00006"></a><spanclass="lineno"> 6</span> <spanclass="comment"> * This source code is licensed under the CC-by-NC license found in the</span></div>
<divclass="line"><aname="l00007"></a><spanclass="lineno"> 7</span> <spanclass="comment"> * LICENSE file in the root directory of this source tree.</span></div>
<divclass="line"><aname="l00010"></a><spanclass="lineno"> 10</span> <spanclass="comment">// Copyright 2004-present Facebook. All Rights Reserved.</span></div>
<divclass="line"><aname="l00025"></a><spanclass="lineno"> 25</span> <spanclass="preprocessor"></span><spanclass="comment">// dummy value for host compiler</span></div>
<divclass="line"><aname="l00030"></a><spanclass="lineno"> 30</span> <spanclass="comment">// Technically, memory barriers are required via the CUDA</span></div>
<divclass="line"><aname="l00031"></a><spanclass="lineno"> 31</span> <spanclass="comment">// programming model, since warp synchronous programming no longer</span></div>
<divclass="line"><aname="l00032"></a><spanclass="lineno"> 32</span> <spanclass="comment">// is guaranteed.</span></div>
<divclass="line"><aname="l00034"></a><spanclass="lineno"> 34</span> <spanclass="comment">// There are two components to it:</span></div>
<divclass="line"><aname="l00035"></a><spanclass="lineno"> 35</span> <spanclass="comment">// -a barrier known to the compiler such that the compiler will not</span></div>
<divclass="line"><aname="l00036"></a><spanclass="lineno"> 36</span> <spanclass="comment">// schedule loads and stores across the barrier;</span></div>
<divclass="line"><aname="l00037"></a><spanclass="lineno"> 37</span> <spanclass="comment">// -a HW-level barrier that guarantees that writes are seen in the</span></div>
<divclass="line"><aname="l00040"></a><spanclass="lineno"> 40</span> <spanclass="comment">// However, __threadfence_block() is a stronger constraint than what</span></div>
<divclass="line"><aname="l00041"></a><spanclass="lineno"> 41</span> <spanclass="comment">// we really want out of the hardware: a warp-wide barrier.</span></div>
<divclass="line"><aname="l00043"></a><spanclass="lineno"> 43</span> <spanclass="comment">// In current hardware, it appears that warp synchronous programming</span></div>
<divclass="line"><aname="l00044"></a><spanclass="lineno"> 44</span> <spanclass="comment">// is a reality; by all tests it appears safe and race-free.</span></div>
<divclass="line"><aname="l00046"></a><spanclass="lineno"> 46</span> <spanclass="comment">// However, understandably it may not be in the future (based on</span></div>
<divclass="line"><aname="l00047"></a><spanclass="lineno"> 47</span> <spanclass="comment">// what Nvidia says in the Kepler guide, it may change depending</span></div>
<divclass="line"><aname="l00048"></a><spanclass="lineno"> 48</span> <spanclass="comment">// upon compiler/toolchain issues or future hardware).</span></div>
<divclass="line"><aname="l00050"></a><spanclass="lineno"> 50</span> <spanclass="comment">// Removing the fence results in 10%+ faster performance.</span></div>
<divclass="line"><aname="l00051"></a><spanclass="lineno"> 51</span> <spanclass="comment">// However, we are judicious as to where we insert the fence, so if</span></div>
<divclass="line"><aname="l00052"></a><spanclass="lineno"> 52</span> <spanclass="comment">// this reality ever changes, uncommenting this will result in CUDA</span></div>
<divclass="line"><aname="l00055"></a><spanclass="lineno"> 55</span> <spanclass="comment">// FIXME: we should probably qualify as volatile as well, since the</span></div>
<divclass="line"><aname="l00056"></a><spanclass="lineno"> 56</span> <spanclass="comment">// compiler could technically preserve values across loops? This</span></div>
<divclass="line"><aname="l00057"></a><spanclass="lineno"> 57</span> <spanclass="comment">// seems very impractical for the compiler to do, however.</span></div>