Username   Password       Forgot your password?  Forgot your username? 


Reliability Evaluation of a Parallel Job with Real-Time Redundant Computing

Volume 14, Number 7, July 2018, pp. 1487-1492
DOI: 10.23940/ijpe.18.07.p12.14871492

Xiwei Qiu, Liang Luo, Sa Meng, and Han Xu

University of Electronic Science and Technology of China, Chengdu, 611731, China

(Submitted on April 14, 2018; Revised on May 23, 2018; Accepted on June 22, 2018)


In network systems, a job with a large amount of work-requirement is usually divided into multiple tasks to achieving parallel computing. However, if any task is failed due to random task failures or server failures, the entire job cannot be complete. In traditional redundant computing, copies of tasks are initiated whenever the tasks are found to be failed. This is not an efficient approach from the perspective of the performance. Real-time parallel computing is more high-efficient, which makes tasks and their copies run simultaneously. In this paper, to evaluating the reliability of such a job, we first describe a complicated parallel and redundant computing environment as multiple minimal-job spanning trees (MJST) consisting of tasks and servers. Then, we design two operators of MJSTs to systemically analyze complicated failure correlations among multiple MJSTs. Finally, an algorithm based on the Bayesian theorem is presented to evaluate the reliability of a parallel job with real-time parallel computing. Illustrative examples are provided.


References: 11

          1. Y. Jiang, “A Survey of Task Allocation and Load Balancing in Distributed Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 2, pp. 585-599, 2015
          2. J. Cha, M. Guida, and G. A. Pulcini, “Competing Risks Model With Degradation Phenomena and Catastrophic Failures,” International Journal of Performability Engineering, vol. 10, no. 1, pp. 63-74, 2014
          3. Y. Dai, Y. Pan, and X. Zou, “A Hierarchical Modeling and Analysis for Grid Service Reliability,” IEEE Transactions on Computer, vol. 56, no. 5, pp. 681-691, 2007
          4. X. Ni, J. Zhao, W. Song, and H. Li, “Reliability Modeling for Two-stage Degraded System Based on Cumulative Damage Model,” International Journal of Performability Engineering, vol. 12, no. 1, pp. 89-94, 2016
          5. X. Qiu, Y. Dai, Y. Xiang, and L. Xing, “A Hierarchical Correlation Model for Evaluating Reliability, Performance and Power Consumption of a Cloud Service,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 3, pp. 401-412, 2016
          6. M. Xie, Y. Dai and K. L. Poh, “Computing Systems Reliability: Models and Analysis,” Kluwer, New York, 2004.
          7. G. Jung, N. Gnanasambandam, and T. Mukherjee, “Synchronous Parallel Processing of Big-data Analytics Services to Optimize Performance in Federated Clouds,” in Proceedings of 2012 IEEE 5th International Conference on Cloud Computing, pp. 811-818, 2012
          8. H. Ke, P. Li, S. Guo, and M. Guo, “On Traffic-aware Partition and Aggregation in MapReduce for Big Data Applications,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 3, pp. 818-828, 2015
          9. A. Bahga, and K. Madisetti, “Analyzing Massive Machine Maintenance Data in a Computing Cloud,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 10, pp. 1831-1843, 2011.
          10. X. Qiu, P. Sun, X. Guo, and Y. Xiang, “Performability Analysis of a Cloud System,” in proceedings of 2015 IEEE 34th International Performance Computing and Communications Conference, pp. 1-6, 2016
          11. X. Qiu, L. Luo, and Y. Dai, “Reliability-performance-energy Joint Modelling and Optimization for a Big Data Task,” in Proceedings of 2016 IEEE International Conference on Software Quality, Reliability and Security Companion, pp. 334-338, 2016


                  Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

                  This site uses encryption for transmitting your passwords.