The Implementation of Titian for Data Provenance on DISC Systems Automated Debugging
DOI:
https://doi.org/10.21108/ijoict.v10i1.929Keywords:
Automated-debugging, DISC, Snowfall, TitianAbstract
Data-Intensive Scalable Computing (DISC) systems are critical for managing large datasets while prioritizing fault tolerance, cost effectiveness, and user accessibility. However, the presence of input errors in processed data presents considerable hurdles to programmers. The Snowfall Analysis program, which is well-known for its anomalous data that causes forecasting failures, serves as a key case study in this research. To solve this problem, this study leverages Titian, an extended library designed to speed debugging by methodically tracing the provenance of incorrect data back to its original source. Through thorough analysis, we analyzed Titian's accuracy using confusion matrix and compared its efficiency to standard manual debugging approaches, showing solid evidence of its utility in improving data provenance in DISC systems.
Downloads
References
[2] M. A. Gulzar, M. Interlandi, X. Han, M. Li, T. Condie, and M. Kim, “Automated debugging in data-intensive scalable computing,” in SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing, 2017. doi: 10.1145/3127479.3131624.
[3] M. Interlandi et al., “Titian: Data provenance support in Spark,” Proceedings of the VLDB Endowment, vol. 9, no. 3, 2016, doi: 10.14778/2850583.2850595.
[4] R. Diestelkämper and M. Herschel, “Tracing nested data with structural provenance for big data analytics,” in Advances in Database Technology - EDBT, 2020. doi: 10.5441/002/edbt.2020.23.
[5] R. L. Armstrong and M. Hardman, “Monitoring global snow cover,” in Proceedings of The Western Snow Conference, 1991.
[6] B. Brasnett, “A global analysis of snow depth for numerical weather prediction,” Journal of Applied Meteorology, vol. 38, no. 6, 1999, doi: 10.1175/1520-0450(1999)038<0726:AGAOSD>2.0.CO;2.
[7] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Commun ACM, vol. 51, no. 1, 2008, doi: 10.1145/1327452.1327492.
[8] N. Ahmed, A. L. C. Barczak, T. Susnjak, and M. A. Rashid, “A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench,” J Big Data, vol. 7, no. 1, 2020, doi: 10.1186/s40537-020-00388-5.
[9] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, “Big data analytics on Apache Spark,” International Journal of Data Science and Analytics, vol. 1, no. 3–4. 2016. doi: 10.1007/s41060-016-0027-9.
[10] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: Fault-tolerant streaming computation at scale,” in SOSP 2013 - Proceedings of the 24th ACM Symposium on Operating Systems Principles, 2013. doi: 10.1145/2517349.2522737.
[11] J. E. Gonzalez, “From graphs to tables the design of scalable systems for graph analytics,” in WWW 2014 Companion - Proceedings of the 23rd International Conference on World Wide Web, 2014. doi: 10.1145/2567948.2580059.
[12] N. Deshai, B. V. D. S. Sekhar, and S. Venkataramana, “Mllib: machine learning in apache spark,” International Journal of Recent Technology and Engineering, vol. 8, no. 1, 2019.
[13] P. Buneman, S. Khanna, and W. C. Tan, “Why and where: A characterization of data provenance?,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2001. doi: 10.1007/3-540-44503-x_20.
[14] M. Interlandi, “Supporting Data Provenance in Data-Intensive Scalable Computing Systems,” 2018.
[15] A. P. Dimri and U. C. Mohanty, “Snowfall statistics of some SASE field stations in J&K,” Def Sci J, vol. 49, no. 5, 1999, doi: 10.14429/dsj.49.3858.
Downloads
Published
How to Cite
Issue
Section
License
Manuscript submitted to IJoICT has to be an original work of the author(s), contains no element of plagiarism, and has never been published or is not being considered for publication in other journals. Author(s) shall agree to assign all copyright of published article to IJoICT. Requests related to future re-use and re-publication of major or substantial parts of the article must be consulted with the editors of IJoICT.