<divclass="line"><aname="l00005"></a><spanclass="lineno"> 5</span> <spanclass="comment"> * This source code is licensed under the BSD+Patents license found in the</span></div>
<divclass="line"><aname="l00006"></a><spanclass="lineno"> 6</span> <spanclass="comment"> * LICENSE file in the root directory of this source tree.</span></div>
<divclass="line"><aname="l00009"></a><spanclass="lineno"> 9</span> <spanclass="comment">// Copyright 2004-present Facebook. All Rights Reserved.</span></div>
<divclass="line"><aname="l00024"></a><spanclass="lineno"> 24</span> <spanclass="preprocessor"></span><spanclass="comment">// dummy value for host compiler</span></div>
<divclass="line"><aname="l00029"></a><spanclass="lineno"> 29</span> <spanclass="comment">// Technically, memory barriers are required via the CUDA</span></div>
<divclass="line"><aname="l00030"></a><spanclass="lineno"> 30</span> <spanclass="comment">// programming model, since warp synchronous programming no longer</span></div>
<divclass="line"><aname="l00031"></a><spanclass="lineno"> 31</span> <spanclass="comment">// is guaranteed.</span></div>
<divclass="line"><aname="l00033"></a><spanclass="lineno"> 33</span> <spanclass="comment">// There are two components to it:</span></div>
<divclass="line"><aname="l00034"></a><spanclass="lineno"> 34</span> <spanclass="comment">// -a barrier known to the compiler such that the compiler will not</span></div>
<divclass="line"><aname="l00035"></a><spanclass="lineno"> 35</span> <spanclass="comment">// schedule loads and stores across the barrier;</span></div>
<divclass="line"><aname="l00036"></a><spanclass="lineno"> 36</span> <spanclass="comment">// -a HW-level barrier that guarantees that writes are seen in the</span></div>
<divclass="line"><aname="l00039"></a><spanclass="lineno"> 39</span> <spanclass="comment">// However, __threadfence_block() is a stronger constraint than what</span></div>
<divclass="line"><aname="l00040"></a><spanclass="lineno"> 40</span> <spanclass="comment">// we really want out of the hardware: a warp-wide barrier.</span></div>
<divclass="line"><aname="l00042"></a><spanclass="lineno"> 42</span> <spanclass="comment">// In current hardware, it appears that warp synchronous programming</span></div>
<divclass="line"><aname="l00043"></a><spanclass="lineno"> 43</span> <spanclass="comment">// is a reality; by all tests it appears safe and race-free.</span></div>
<divclass="line"><aname="l00045"></a><spanclass="lineno"> 45</span> <spanclass="comment">// However, understandably it may not be in the future (based on</span></div>
<divclass="line"><aname="l00046"></a><spanclass="lineno"> 46</span> <spanclass="comment">// what Nvidia says in the Kepler guide, it may change depending</span></div>
<divclass="line"><aname="l00047"></a><spanclass="lineno"> 47</span> <spanclass="comment">// upon compiler/toolchain issues or future hardware).</span></div>
<divclass="line"><aname="l00049"></a><spanclass="lineno"> 49</span> <spanclass="comment">// Removing the fence results in 10%+ faster performance.</span></div>
<divclass="line"><aname="l00050"></a><spanclass="lineno"> 50</span> <spanclass="comment">// However, we are judicious as to where we insert the fence, so if</span></div>
<divclass="line"><aname="l00051"></a><spanclass="lineno"> 51</span> <spanclass="comment">// this reality ever changes, uncommenting this will result in CUDA</span></div>
<divclass="line"><aname="l00054"></a><spanclass="lineno"> 54</span> <spanclass="comment">// FIXME: we should probably qualify as volatile as well, since the</span></div>
<divclass="line"><aname="l00055"></a><spanclass="lineno"> 55</span> <spanclass="comment">// compiler could technically preserve values across loops? This</span></div>
<divclass="line"><aname="l00056"></a><spanclass="lineno"> 56</span> <spanclass="comment">// seems very impractical for the compiler to do, however.</span></div>