* support fp16 training
* Use compiled training program
* Change timing ips.
* Use dali
* add pure fp16 training
* fix a bug, which will not use fuse pass using pure fp16 training.
* modify code as review
* modify loss, so that it will use different loss when using pure fp16 training.
* remove some fluid API
* add static optimizer.