best DS solution for the use case below

de_ma_21 · 07-25-2023 03:54 AM

we are looking to set a pipeline that will contain two layers , we are thinking a creating a dag that combines the two below layers, but we are still not sure what tools to use for the data science layer

data generation layer: Generation of a Table in bigquery containing around hundreds of thousands of rows (product) and around 26 fields based on a sql logic (bigquery) , baring in mind that the number of rows might increase to millions in the near future
Data science layer :prediction of possible outcomes of each product by a data science model in written in python. (for each product we need to predict what are the next possible stages for it), there is a lot of computation done by the model which requires 20 types of gaussian mixture fittings, also the performance will depend on the amount of input products/ output outcomes

The solution would be running hourly , every day. Priorities of criterias whilst looking for a solution for the data science layer of the pipeline are as follows:

Inference : possibility of making the model scale horizontally ( increase number of samples x ) or vertically (number of producst n ) in order to produce x possible outcomes for each single product , the scaling will be in the hand of the data scientist
Costs
Possibility of having a model registry ( similar to images registry , which will keep a history of artifacts of the models that can be deployed)
Training whilst doing inference
Possibility of giving the end user the choice of input output (one or more specific product id as input/and to choose the number of samples as output for those input products)