Which one to choose ? - Job vs Deployment - batch data preprocessing

adityahpatel · February 19, 2024, 9:18pm

Situation:
Say I have 1 million sentences in S3. My task is simple - extract # of words and length from each sentence.

OPTION # 1
The lecture says this (i.e. batch processing workload) can be accomplished using Job resource. But how? If each of the 10 parallel pods in a Job processes the same batch of sentences, how is that going to help speed up? Each Pod of the Job should run on a different batch of sentences and then persist the results on my host/physical machine. that way parallel processing can be attained… So how can this be done with JOB?

OPTION # 2: (self conceived)
Instead of setting this up as a “JOB”, would it be better to set up preprocessing as clusterip service that exposes an autoscaled deployment.e.g. http://process-service:6666? Then i make http requests such that each http POST request is 1 batch of 500 sentences to http://process-service:6666? Then the output of those pods is persisted on host.

for i in range(20000):
        response = requests.post("http://preprocess-service:6666", json=data)
           # where each json payload has 1 batch of 500 sentences, so 500 x 20K = 1 Million

Is the 2nd option a valid approach as well? Which option would you prefer? Thanks!

rob_kodekloud · February 19, 2024, 9:55pm

IIRC, this is a class of question you’ve asked before I think for either case, you need to imitate what a streaming type environment does: you need to create a list or map of batches (perhaps as offsets into the file of sentences), and have some sort of ACK protocol for marking when a batch is successfully handled. Then, like streaming processing, you assign a batch to a worker, and either mark it “done” or reassign it to a new worker, logging the errors.
Your cook-it-yourself version does not appear to have much space for error recovery; depending upon how reliable your processing technique is, this might or might not be a problem. But if it’s pretty reliable – the number of batches lost is very small – then you could use a deployment to hold the workers, and use a service to distribute between the worker. Note that you’d probably need to throttle the feeder process however.
The throttling problem is why the approach in the docs is probably better, but as I’ve said before, I haven’t tried this myself, so YMMV