Prototype to Investigate the Extent to Which Words with Specific Attributes Can Be Retrieved Using Granular Metadata

Liezl Hilde Ball; Theo J.D. Bothma

doi:10.25159/2663-659X/14399

Authors

Liezl Hilde Ball University of Pretoria https://orcid.org/0000-0002-1483-0780
Theo J.D. Bothma University of Pretoria https://orcid.org/0000-0001-7850-3263

DOI:

https://doi.org/10.25159/2663-659X/14399

Keywords:

digital humanities, digital libraries, metadata, information retrieval, text collections, prototype

Abstract

Despite the growth in digital text collections, the ability to retrieve words or phrases with specific attributes is limited, for example, to retrieve words with a specific meaning within a specific section of a text. Many systems work with coarse bibliographic metadata. To enable fine-grained retrieval, it is necessary to encode texts with granular metadata. Sample texts were encoded with granular metadata. Five categories of metadata that can be used to capture additional data about texts were used, namely, morphological, syntactic, semantic, functional and bibliographic. A prototype was developed to parse the encoded texts and store the information in a database. The prototype was used to test the extent to which words or phrases with specific attributes could be retrieved. Retrieval on a detailed level was possible through the prototype. Retrieval using all five categories of metadata was demonstrated, as well as advanced searches using metadata from different categories in a single search. This article demonstrates that when granular metadata is used to encode texts, retrieval is improved. Relevant information can be selected, and irrelevant information can be excluded, even within a text.

References

Ball, Liezl H. 2020. “Enhancing Digital Text Collections with Detailed Metadata to Improve Retrieval.” PhD diss., University of Pretoria. http://hdl.handle.net/2263/79015

Ball, Liezl H., and Theo J. D. Bothma. 2022. “Investigating the Extent to Which Words or Phrases with Specific Attributes Can Be Retrieved from Digital Text Collections.” Information Research 27 (1): 917. https://doi.org/10.47989/irpaper917

Cox, Andrew M. 2021. Research Report: The Impact of AI, Machine Learning, Automation and Robotics on the Information Professions. CILIP (The Library and Information Association). Accessed April 27, 2022. https://www.cilip.org.uk/page/researchreport

Edmond, Jennifer, and Jörg Lehmann. 2021. “Digital Humanities, Knowledge Complexity, and the Five ‘Aporias’ of Digital Research.” Digital Scholarship in the Humanities 36 (2): ii95–ii108. https://doi.org/https://doi.org/10.1093/llc/fqab031

Fenlon, Katrina, Megan Senseney, Harriett Green, Sayan Bhattacharyya, Craig Willis, and J. Stephen Downie. 2014. “Scholar‐Built Collections: A Study of User Requirements for Research in Large‐Scale Digital Libraries.” Proceedings of the American Society for Information Science and Technology 51 (1): 1–10. https://doi.org/https://doi.org/10.1002/meet.2014.14505101047

Finlayson, Mark A. 2015. “ProppLearner: Deeply Annotating a Corpus of Russian Folktales to Enable the Machine Learning of a Russian Formalist Theory.” Digital Scholarship in the Humanities 32 (2): 284–300. https://doi.org/https://doi.org/10.1093/llc/fqv067

Google Books Ngram Viewer. n.d. “Google Books Ngram Viewer Info.” Accessed August 18, 2020. https://books.google.com/ngrams/info

Heiden, Serge. 2010. “The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme.” In Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation (PACLIC24), edited by Ryo Otoguro, Kiyoshi Ishikawa, Hiroshi Umemoto, Kei Yoshimoto and Yasunari Harada, 389–398. Sendai: Institute for Digital Enhancement of Cognitive Development, Waseda University. https://aclanthology.org/Y10-1044/

Heuser, Ryan, Franco Moretti, and Erik Steiner. 2016. “The Emotions of London.” Pamphlets of the Stanford Literary Lab, Pamphlet 13. Accessed August 2, 2018. https://litlab.stanford.edu/LiteraryLabPamphlet13.pdf

Hoffmann, Sebastian, and Stefan Evert. 2006. “BNCweb (CQP-edition): The Marriage of Two Corpus Tools.” In Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, Vol. 3, edited by Sabine Braun, Kurt Kohn and Joybrato Mukherjee, 177–195. Frankfurt: Peter Lang.

Jett, Jacob, Terhi Nurmikko-Fuller, Timothy W. Cole, Kevin R. Page, and J. Stephen Downie. 2016. “Enhancing Scholarly Use of Digital Libraries: A Comparative Survey and Review of Bibliographic Metadata Ontologies.” In JCDL ’16: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, 35–45. New York: The Association for Computing Machinery. https://doi.org/10.1145/2910896.2910903

Klimczak, Erik. 2013. Design for Software: A Playbook for Developers. Chichester: John Wiley and Sons.

Lansdall-Welfare, Thomas, and Nello Cristianini. 2020. “History Playground: A Tool for Discovering Temporal Trends in Massive Textual Corpora.” Digital Scholarship in the Humanities 35 (2): 328–341. https://doi.org/https://doi.org/10.1093/llc/fqy077

Lin, Yuri, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, Will Brockman, and Slav Petrov. 2012. “Syntactic Annotations for the Google Books Ngram Corpus.” In ACL 2012: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 169–174. Stroudsburg, PA: Association for Computational Linguistics. https://aclanthology.org/P12-3029.pdf

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, the Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. 2010. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–182. https://doi.org/10.1126/science.1199644

Nguyen, Dong, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Rebekah Tromble, and Jane Winters. 2020. “How We Do Things with Words: Analyzing Text as Social and Cultural Data.” Frontiers in Artificial Intelligence 3: 62. https://doi.org/10.3389/frai.2020.00062

Senseney, Megan, Eleanor Dickson Koehl, Beth Sandor Namachchivaya, and Bertram Ludäscher. 2021. Transforming Library Services for Computational Research with Text Data: Environmental Scan, Stakeholder Perspectives, and Recommendations for Libraries. Chicago: Association of College and Research Libraries. Accessed April 27. 2022. https://www.ala.org/acrl/sites/ala.org.acrl/files/content/publications/whitepapers/TransformingLibServices.pdf

Suranto, Beni. 2015. “Software Prototypes: Enhancing the Quality of Requirements Engineering Process.” In Proceedings of ISTMET 2015 2nd International Symposium on Technology Management and Emerging Technologies, 148–153. Piscataway: Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ISTMET.2015.7359019

TEI (Text Encoding Initiative). n.d. “TEI: Text Encoding Initiative.” Accessed January 12, 2018. http://www.tei-c.org/index.xml

Underwood, Ted. 2015. “Understanding Genre in a Collection of a Million Volumes.” White Paper Report 109365, University of Illinois, Urbana-Champaign. Accessed July 30, 2019. https://hcommons.org/deposits/item/hc:12277/

Underwood, Ted, David Bamman, and Sabrina Lee. 2018. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics 3 (2): 1–25. https://doi.org/10.22148/16.019

Ustalov, Dmitry, Denis Teslenko, Alexander Panchenko, Mikhail Chernoskutov, Chris Biemann, and Simone Paolo Ponzetto. 2018. “An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages.” In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), edited by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis and Takenobu Tokunaga, 1018–1022. Miyazaki: European Language Resources Association. https://aclanthology.org/L18-1164

Viiri, Sampo. 2014. Digital Humanities and Future Archives. London: Finnish Institute in London. Accessed September 29, 2020. https://www.fininst.uk/wp-content/uploads/2017/09/Digital_Humanities_and_Future_Archives.pdf

Walker, Miriam, Leila Takayama, and James A. Landay. 2002. “High-Fidelity or Low-Fidelity, Paper or Computer? Choosing Attributes When Testing Web Prototypes.” Proceedings of the Human Factors and Ergonomics Society Annual Meeting 46 (5): 661–665. https://doi.org/10.1177/154193120204600513

Welsh, Megan E. 2014. “Review of Voyant Tools.” Collaborative Librarianship 6 (2): 96–98.