Abstract
In recent years, an increasing number of knowledge bases have been built using linked data, thus datasets have grown substantially. It is neither reasonable to store a large amount of triple data in a single graph, nor appropriate to store RDF in named graphs by class URIs, because many joins can cause performance problems between graphs. This paper presents an agglomerative-adapted approach for large-scale graphs, which is also a bottom-up merging process. The proposed algorithm can partition triples data in three levels: blank nodes, associated nodes, and inference nodes. Regarding blank nodes and classes/nodes involved in reasoning rules, it is better to store with an optimal neighbor node in the same partition instead of splitting into separate partitions. The process of merging associated nodes needs to start with the node in the smallest cost and then repeat it until the final number of partitions is met. Finally, the feasibility and rationality of the merging algorithm are analyzed in detail through bibliographic cases. In summary, the partitioning methods proposed in this paper can be applied in distributed storage, data retrieval, data export, and semantic reasoning of large-scale triples graphs. In the future, we will research the automation setting of the number of partitions with machine learning algorithms.
References
Erkimbaev, A. O., Zitserman, V. Y., Kobzev, G. A., Serebrjakov, V. A., & Teymurazov, K. B. (2013). Publishing scientific data as linked open data. Scientific and Technical Information Processing, 40(4): 253-263. DOI:10.3103/S014768821304014X
Craig A. Knoblock, Pedro Szekely, Eleanor Fink, Duane Degler, David Newbury, Robert Sanderson, ... Yixiang Yao (2017). Lessons learned in building linked data for the American art collaborative. in Proc. The Semantic Web - ISWC 2017, 263-279. DOI:10.1007/978-3-319-68204-4_26
Chen Tao, Zhang Yongjuan, Liu Wei, & Zhu Qinghua (2019). Several specifications and recommendations for the publication of linked data. Journal of Library Science in China, 45(1):34-46
Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, & Bhavani Thuraisingham (2009). Storage and retrieval of large RDF graph using Hadoop and MapReduce. CloudCom 2009, LNCS 5931, 680-686. DOI:10.1007/978-3-642-10665-1_72
Kurt, R., & Richard, E. S. (2010). High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. in Proc. Programming Support Innovations for Emerging Distributed Applications, ACM, 4:1-4:5. DOI:10.1145/1940747.1940751
Khushboo, T., & Abhishek B. (2017). A review of large-scale RDF document processing in Hadoop MapReduce framework. International Journal of Scientific Research Engineering & Technology (IJSRET), 6(2):123-126. .
Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, & Paolo Castagna (2012). Jena-HBase: a distributed, scalable and efficient RDF triple store. in Proc. International Semantic Web Conference (ISWC), Springer, 1-4.
Nikolaos, P., Ioannis, K., Dimitrios, T., & Nectarios K. (2012). H2RDF: adaptive query processing on RDF data in the cloud. in Proc. 21st International Conference on World Wide Web, 397-400. DOI:10.1145/2187980.2188058
Alfredo, C., Rajkumar, B., Vincenzo P., & Giovanni P. (2017). MapReduce-based algorithms for managing big RDF graphs: state-of-the-art analysis, paradigms, and future directions. in Proc. 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 898-905. DOI:10.1109/CCGRID.2017.109
Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, & Zhongyuan Wang (2013). A distributed graph engine for web scale RDF data. Proceedings of the VLDB Endowment, 6(4):265-276. DOI:10.14778/2535570.2488333
Rong Gu, Wei Hu, & Yihua Huang (2014). Rainbow: a distributed and hierarchical RDF triple store with dynamic scalability. in Proc. IEEE International Conference on Big Data, 561-566. DOI:10.1109/BigData.2014.7004274
Yingjie Li, & Jeff Heflin (2010). Query optimization for ontology-based information integration. Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada. DOI:10.1145/1871437.1871623
Razen AI-Harbi, Yasser Ebrahim, & Panos Kalnis (2014). PHD-Store: an adaptive SPARQL engine with dynamic partitioning for distributed RDF repositories. CoRR.
Ruben ,V., Miel, V.S., & Pieter, C. (2014). Web-scale querying through Linked Data Fragments. Proceedings of the 7th Workshop on Linked Data on the Web
Huang, J. W., & Daniel J. A. (2016). LEOPARD: lightweight edge-oriented partitioning and replication for dynamic graphs. Proceedings of the VLDB Endowment, 9(7):540-551. DOI:10.14778/2904483.2904486
Wang, R. & Kenneth, C. (2012). A graph partitioning approach to distributed RDF stores. in Proc. IEEE 10th International Symposium on Parallel and Distributed Processing with Application (ISPA), 411-418. DOI:10.1109/ISPA.2012.60
Yun Hao, Gaofeng Li, Pingpeng Yuan, & Hai Jin (2017). An association-oriented partitioning approach for streaming graph query. Scientific Programming, 11:1-11. DOI:10.1155/2017/2573592
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and the initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).