Abstract
Three NLP (Natural Language Processing) automated summarization techniques were tested on a special collection of Catholic Pamphlets acquired by Hesburgh Libraries. The automated summaries were generated after feeding the pamphlets as .pdf files into an OCR pipeline. Extensive data cleaning and text preprocessing were necessary before the computer summarization algorithms could be launched. Using the standard ROUGE F1 scoring technique, the Bert Extractive Summarizer technique had the best summarization score. It most closely matched the human reference summaries. The BERT Extractive technique yielded an average Rouge F1 score of 0.239. The Gensim python package implementation of TextRank scored at .151. A hand-implemented TextRank algorithm created summaries that scored at 0.144. This article covers the implementation of automated pipelines to read PDF text, the strengths and weakness of automated summarization techniques, and what the successes and failures of these summaries mean for their potential to be used in Hesburgh Libraries.
References
Amrhein, C., & Clematide, S. (2018). Supervised ocr error detection and correction using statistical and neural machine translation methods. Journal for Language Technology and Computational Linguistics (JLCL), 33(1), 49-76.
Bird, S., Loper, E., & Klein, E. (2009). Natural language toolkit. J URL Http://Www.Nltk.Org,
Borko, H., & Bernier, C. L. (1975). Abstracting concepts and methods.
Chiang, D. (2019, November,). Lecture notes.
CurateND Support Team, & McManus, J. (2019). The catholic pamphlets. Retrieved from https://curate.nd.edu/show/1v53jw84s7d
Dong, Y., Shen, Y., Crawford, E., van Hoof, H., & Cheung, J. C. K. (2018). Banditsum: Extractive summarization as a contextual bandit. arXiv Preprint arXiv:1809.09672.
Gupta, V., & Lehal, G. S. (2010). A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence, 2(3), 258-268.
Joshi, P. (2018). An introduction to text summarization using the TextRank algorithm (with python implementation). Retrieved from https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/
Kasten, S., & Flannery, J. (2020). Bibliographic analysis of english language monographs. Unpublished manuscript.
Kumar, A. (2016). A survey on various OCR errors. International Journal of Computer Applications, 143(4), 8-10.
Lin, C., & Och, F. J. (2004). Looking for a few good metrics: ROUGE and its evaluation. Paper presented at the Ntcir Workshop.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159-165.
Luk, A. T. (1996). Evaluating bibliographic displays from the user's point of view: A focus group study. Master of Information Studies Research Project Report.Toronto: Faculty of Information Studies, University of Toronto.
Lundgren, J., & Simpson, B. (1997). Cataloging needs survey for faculty at the university of florida. Cataloging & Classification Quarterly, 23(3-4), 47-63.
Lundgren, J., & Simpson, B. (1999). Looking through users' eyes: What do graduate students need to know about internet resources via the library catalog? Journal of Internet Cataloging, 1(4), 31-44.
Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. Paper presented at the Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404-411.
Miller, D. (2019). Leveraging BERT for extractive text summarization on lectures. arXiv Preprint arXiv:1906.04165.
Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A decomposable attention model for natural language inference. arXiv Preprint arXiv:1606.01933.
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1-20.
Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. arXiv Preprint arXiv:1509.00685.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2020 Jeremiah Flannery