Cost efficient batch processing in amazon cloud with deadline awareness
Amazon spot instances have become a very popular alternative for cost-saving in the cloud. The spot instances are prone to abrupt termination whenever the spot market price exceeds the bid price. In this paper, spot instances are resorted to in task instances’ group of Amazon Elastic Map Reduce (EMR)clustertoprocessbatchjobswithdeadline.AmazonEMR makes it convenient to process Big Data with the aid of the Hadoop framework. However, the processed intermediate results in the task nodes of the cluster are lost if the spot instances gets terminated which can cause processing delay. The cost efﬁciency can be realized by exploiting the non-real time nature of batch computingforBigData.Twoalgorithmsaredevisedforachieving cost efﬁcient processing in Hadoop MapReduce. Both algorithms process data in divisions such that abrupt termination of spot instances only affects the last division. Based on monitoring the progress at given checkpoints, task group’s capacity is resized to completetheprocessingwithinthedeadline.Progressismeasured in terms of the number of completed work divisions. The ﬁrst algorithm begins with some spot instances whose number is initially estimated. To complete processing of all data in time, ondemandinstancesaredeployedafteracertainthresholdtime.The second algorithm starts by using higher number of spot instances than required to complete the work within the given deadline. Therefore, it has higher chance to rely solely on instances during the whole execution of the batch job. On-demand instances are deployed only in case of slow progress caused by termination of the spot instances combined with subsequent unsuccessful bids. The experiments show that both algorithms are able to minimize the processing cost. The second algorithm further minimizes the cost in most cases.