we are looking to set a pipeline that will contain two layers , we are thinking a creating a dag that combines the two below layers, but we are still not sure what tools to use for the data science layer
- data generation layer: Generation of a Table in bigquery containing around hundreds of thousands of rows (product) and around 26 fields based on a sql logic (bigquery) , baring in mind that the number of rows might increase to millions in the near future
- Data science layer :prediction of possible outcomes of each product by a data science model in written in python. (for each product we need to predict what are the next possible stages for it), there is a lot of computation done by the model which requires 20 types of gaussian mixture fittings, also the performance will depend on the amount of input products/ output outcomes
The solution would be running hourly , every day. Priorities of criterias whilst looking for a solution for the data science layer of the pipeline are as follows:
- Inference : possibility of making the model scale horizontally ( increase number of samples x ) or vertically (number of producst n ) in order to produce x possible outcomes for each single product , the scaling will be in the hand of the data scientist
- Costs
- Possibility of having a model registry ( similar to images registry , which will keep a history of artifacts of the models that can be deployed)
- Training whilst doing inference
- Possibility of giving the end user the choice of input output (one or more specific product id as input/and to choose the number of samples as output for those input products)