Firstly, deserialization tasks will be run with inputs(Array[Byte]).
Firstly, deserialization tasks will be run with inputs(Array[Byte]).
Then, run model prediction with deserialized inputs
as soon as there exists vacant instances(total number is numThreads).
Otherwise, it will hold on till some instances are released.
Finally, prediction results will be serialized to Array[Byte] according to BigDL.proto.
input bytes, which will be deserialized by BigDL.proto
output bytes, which is serialized by BigDl.proto
Running model prediction with input Activity as soon as there exists vacant instances(the size of pool is numThreads).
Running model prediction with input Activity as soon as
there exists vacant instances(the size of pool is numThreads).
Otherwise, it will hold on till some instances are released.
Outputs will be deeply copied after model prediction, so they are invariant.
input Activity, could be Tensor or Table(key, Tensor)
output Activity, could be Tensor or Table(key, Tensor)
Thread-safe Prediction Service for Concurrent Calls
In this service, concurrency is kept not greater than numThreads by a
BlockingQueue
, which contains available model instances.numThreads model instances sharing weights/bias will be put into the
BlockingQueue
during initialization.When predict method called, service will try to take an instance from
BlockingQueue
, which means if all instances are on serving, the predicting request will be blocked until some instances are released.If exceptions caught during predict, a scalar Tensor[String] will be returned with thrown message.