Without training a new model, which model optimization technique for reducing latency should you try first?

You are an ML engineer at a mobile gaming company. A data scientist on your team recently trained a TensorFlow model, and you are responsible for deploying this model into a mobile application. You discover that the inference latency of the current model doesn’t meet production requirements. You need to reduce the inference time by 50%, and you are willing to accept a small decrease in model accuracy in order to reach the latency requirement.

Without training a new model, which model optimization technique for reducing latency should you try first?
A . Weight pruning
B. Dynamic range quantization
C. Model distillation
D. Dimensionality reduction

Answer: C

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments