Managing Bias When Library Collections Become Data


Developments in AI research have dramatically changed what we can do with data and how we can learn from data. At the same time, implementations of AI amplify the prejudices in data often framed as ‘data bias’ and ‘algorithmic bias.’ Libraries, tasked with deciding what is worth keeping, are inherently discriminatory and yet remain trusted sources of information. As libraries begin to systematically approach their collections as data, will they be able to adopt and adapt the AI-driven tools to traditional practices?


Drawing on the work of the AI initiative within Stanford Libraries, the Fantastic Futures conference on AI for libraries, archives, and museums, and recent scholarship on data bias and algorithmic bias, this article encourages libraries to engage critically with AI and help shape applications of the technology to reflect the ethos of libraries for the benefit of libraries themselves and the patrons they serve. A brief examination of two core concepts in machine learning, generalization and unstructured data, provides points of comparison to library practices in order to uncover the theoretical assumptions driving the different domains. The comparison also offers a point of entry for libraries to adopt machine learning methods on their own terms.


Amrhein, V., Greenland, S., & McShane, B. (2019, March 20). Scientists rise up against

statistical significance. Retrieved March 30, 2019, from

Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired magazine, 16(7), 16-07.

Arnold, T., & Tilton, L. (2020). Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture. Journal of Open Source Software, 5(45), 1800.

Barocas, S., & Selbst, A. D. (2016). Big data's disparate impact. Calif. L. Rev., 104, 671.

Benjamin, R. (2019). Race after Technology: Abolitionist Tools for the New Jim Code. Polity.

Berman, S. (1993). Prejudices and antipathies: A tract on the LC subject heads concerning people. McFarland & Company Incorporated Pub.

Berman, S., & Gross, T. (2017). Expand, Humanize, Simplify: An Interview with Sandy Berman. Cataloging & Classification Quarterly, 55(6), 347-360.

DOI: 10.1080/01639374.2017.1327468

Bermès, E. (2019, December 4). "The Corpus project at the French National Library." [Video file]. Retrieved from

Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).

Borgman, C. (2015). Big data, little data, no data: Scholarship in the networked world. MIT Press.

Borgman, C. (2007). Scholarship in the digital age: information, infrastructure, and the Internet. MIT Press.

Bowker, G. C., & Star, S. L. (2000). Sorting things out: Classification and its consequences. MIT Press.

Broussard, M. (2018). Artificial unintelligence: How computers misunderstand the world. MIT Press.

Buolamwini, J. & Gebru, T.. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, in PMLR 81:77-91

Catanzaro, B. (2019, December 4). "Datasets make algorithms: how creating, curating, and distributing data creates modern AI." [Video file]. Retrieved from

Devlin, J., et al. (2018) "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805.

Francis, W. N., & Kucera, H. (1979, July). BROWN CORPUS MAUNAL. Retrieved July 11, 2020, from

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for datasets. arXiv preprint arXiv:1803.09010.

Gross, T. (2017). Examining the Subject Heading" Illegal aliens". Paper at CaMMS Forum: Working Within and Going Beyond: Approaches to Problematic Terminology or Gaps in Established Vocabularies. American Library Association Midwinter, Atlanta, GA.

Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), 8-12.

Hickerson, T., & Brosz, J. (2017). Remaining Relevant: Critical Roles for Libraries in the Research Enterprise.

Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2018). The dataset nutrition label: A framework to drive higher data quality standards. arXiv preprint arXiv:1805.03677.

Howard, Z. (2017, April 8). Finding Patterns in the Content of Teenie Harris's Photos (with Convolutional Neural Networks and Agglomerative Clustering). Retrieved July 11, 2020, from

Ing, A. (n.d.). AI For Everyone. Retrieved from

Jo, E. S., & Gebru, T. (2020, January). Lessons from archives: strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 306-316).

Jordan, M. I. (2018, April 30). Artificial Intelligence - The Revolution Hasn't Happened Yet. Retrieved from

Larson, E. (2020). Big Questions: Digital Preservation of Big Data in Government. The American Archivist, 83(1), 5-20.

Leonard, P. (2019, December 4). "Yale DH Lab’s Pix Plot." [Video file]. Retrieved from

Loukissas, Y. (2017) Taking Big Data apart: local readings of composite media collections. Information, Communication & Society 20, no. 5. pp 651-664.

McGillivray, B. (2018, December 05). Fantastic Futures. AI-conference. Retrieved July 11, 2020, from

Merler, M., Ratha, N., Feris, R. S., & Smith, J. R. (2019). Diversity in faces. arXiv preprint arXiv:1901.10436.

Mittelstadt, Brent, Chris Russell, and Sandra Wachter. "Explaining explanations in AI." In Proceedings of the conference on fairness, accountability, and transparency, pp. 279-288. 2019. DOI:

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. NYU Press.

Norvig, P. (n.d.). All we want are the facts, ma’am. Retrieved from

O'Donovan, M., Richardson, Z., Powell, S., & Moriarty, A. (2018). Open by default?: Images of Maori and Moana pacific subjects at Auckland war memorial museum Tamaki Paenga Hira, New Zealand. Journal of the Australasian Registrars Committee, (74), 44.

O'Neil, C. (2016). Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.

Padilla, T., Allen, L., Frost, H., Potvin, S., Russey Roke, E., & Varner, S. (2019, May 22). Final Report --- Always Already Computational: Collections as Data (Version 1). Zenodo.

Padilla, T. (2019). Responsible Operations: Data Science, Machine Learning, and AI in Libraries. OCLC Research Position Paper.

Rizzi, C. (2020, January 30). Class Action Accuses IBM of 'Flagrant Violations' of Illinois Biometric Privacy Law to Develop Facial Recognition Tech. Retrieved July 10, 2020, from

Schmidt, B. (2018) Stable random projection: lightweight, general-purpose

dimensionality reduction for digitized libraries. Journal of Cultural Analytics. DOI:10.22148/16.025

Selbst, A. D.; Barocas, S. (2018). The intuitive appeal of explainable machines. Fordham Law Review, 87(3), 1085-1140.

Thomas, P. S., da Silva, B. C., Barto, A. G., Giguere, S., Brun, Y., & Brunskill, E. (2019). Preventing undesirable behavior of intelligent machines. Science, 366(6468), 999-1004.

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., ... & Li, L. J. (2016). YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2), 64-73.

Wevers, M. & Smits, T. (2020) The visual digital turn: Using neural networks to study historical images. Digital Scholarship in the Humanities, Volume 35, Issue 1, pp. 194–207.

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision (pp. 19-27).

Zook, M., Barocas, S., Boyd, D., Crawford, K., Keller, E., Gangadharan, S. P., ... & Narayanan, A. (2017). Ten simple rules for responsible big data research.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.