Situation:
Say I have 1 million sentences in S3. My task is simple - extract # of words and length from each sentence.
OPTION # 1
The lecture says this (i.e. batch processing workload) can be accomplished using Job resource. But how? If each of the 10 parallel pods in a Job processes the same batch of sentences, how is that going to help speed up? Each Pod of the Job should run on a different batch of sentences and then persist the results on my host/physical machine. that way parallel processing can be attained… So how can this be done with JOB?
OPTION # 2: (self conceived)
Instead of setting this up as a “JOB”, would it be better to set up preprocessing as clusterip service that exposes an autoscaled deployment.e.g. http://process-service:6666? Then i make http requests such that each http POST request is 1 batch of 500 sentences to http://process-service:6666? Then the output of those pods is persisted on host.
for i in range(20000):
response = requests.post("http://preprocess-service:6666", json=data)
# where each json payload has 1 batch of 500 sentences, so 500 x 20K = 1 Million
Is the 2nd option a valid approach as well? Which option would you prefer? Thanks!