The Implementation of Titian for Data Provenance on DISC Systems Automated Debugging

Authors

DOI:

https://doi.org/10.21108/ijoict.v10i1.929

Keywords:

Automated-debugging, DISC, Snowfall, Titian

Abstract

Data-Intensive Scalable Computing (DISC) systems are critical for managing large datasets while prioritizing fault tolerance, cost effectiveness, and user accessibility. However, the presence of input errors in processed data presents considerable hurdles to programmers. The Snowfall Analysis program, which is well-known for its anomalous data that causes forecasting failures, serves as a key case study in this research. To solve this problem, this study leverages Titian, an extended library designed to speed debugging by methodically tracing the provenance of incorrect data back to its original source. Through thorough analysis, we analyzed Titian's accuracy using confusion matrix and compared its efficiency to standard manual debugging approaches, showing solid evidence of its utility in improving data provenance in DISC systems.

Downloads

Download data is not yet available.

References

[1] R. E. Bryant, “Data-intensive scalable computing for scientific applications,” Comput Sci Eng, vol. 13, no. 6, 2011, doi: 10.1109/MCSE.2011.73.
[2] M. A. Gulzar, M. Interlandi, X. Han, M. Li, T. Condie, and M. Kim, “Automated debugging in data-intensive scalable computing,” in SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing, 2017. doi: 10.1145/3127479.3131624.
[3] M. Interlandi et al., “Titian: Data provenance support in Spark,” Proceedings of the VLDB Endowment, vol. 9, no. 3, 2016, doi: 10.14778/2850583.2850595.
[4] R. Diestelkämper and M. Herschel, “Tracing nested data with structural provenance for big data analytics,” in Advances in Database Technology - EDBT, 2020. doi: 10.5441/002/edbt.2020.23.
[5] R. L. Armstrong and M. Hardman, “Monitoring global snow cover,” in Proceedings of The Western Snow Conference, 1991.
[6] B. Brasnett, “A global analysis of snow depth for numerical weather prediction,” Journal of Applied Meteorology, vol. 38, no. 6, 1999, doi: 10.1175/1520-0450(1999)038<0726:AGAOSD>2.0.CO;2.
[7] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Commun ACM, vol. 51, no. 1, 2008, doi: 10.1145/1327452.1327492.
[8] N. Ahmed, A. L. C. Barczak, T. Susnjak, and M. A. Rashid, “A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench,” J Big Data, vol. 7, no. 1, 2020, doi: 10.1186/s40537-020-00388-5.
[9] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, “Big data analytics on Apache Spark,” International Journal of Data Science and Analytics, vol. 1, no. 3–4. 2016. doi: 10.1007/s41060-016-0027-9.
[10] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: Fault-tolerant streaming computation at scale,” in SOSP 2013 - Proceedings of the 24th ACM Symposium on Operating Systems Principles, 2013. doi: 10.1145/2517349.2522737.
[11] J. E. Gonzalez, “From graphs to tables the design of scalable systems for graph analytics,” in WWW 2014 Companion - Proceedings of the 23rd International Conference on World Wide Web, 2014. doi: 10.1145/2567948.2580059.
[12] N. Deshai, B. V. D. S. Sekhar, and S. Venkataramana, “Mllib: machine learning in apache spark,” International Journal of Recent Technology and Engineering, vol. 8, no. 1, 2019.
[13] P. Buneman, S. Khanna, and W. C. Tan, “Why and where: A characterization of data provenance?,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2001. doi: 10.1007/3-540-44503-x_20.
[14] M. Interlandi, “Supporting Data Provenance in Data-Intensive Scalable Computing Systems,” 2018.
[15] A. P. Dimri and U. C. Mohanty, “Snowfall statistics of some SASE field stations in J&K,” Def Sci J, vol. 49, no. 5, 1999, doi: 10.14429/dsj.49.3858.

Downloads

Published

2024-07-03

How to Cite

Putri, A., Selviandro, N., & Wulandari, G. S. (2024). The Implementation of Titian for Data Provenance on DISC Systems Automated Debugging. International Journal on Information and Communication Technology (IJoICT), 10(1), 116–126. https://doi.org/10.21108/ijoict.v10i1.929

Issue

Section

Software Engineering