Computational Methods for Historians of Science
Automated Image Extraction
Scientific images are almost always published alongside text in articles, books, and on the internet. In order to analyze published scientific images with computers, you need to extract them from their original text-image environments and create relevant metadata. Extracting images manually from large corpora of digital documents is tedious and time consuming. Many historians and meta-science researchers who want to study large groups of images don’t have the time or technical skills to automate this process, so we are doing it for them!
Currently, our team (including Dr. Aaron Dinner at UChicago and Dr. Julia Damerow at Arizona State) is fine-tuning a non-AI based image extraction software. You give the program a corpus of PDFs, then the code removes text and extracts the remaining images. It’s not perfect, but it’s much faster than extracting images one-by-one, by hand. If you're interested in getting involved, check out our GitHub and contact me!
The software will be integrated into the Giles Ecosystem.
Concepts and Keywords
When researchers want to learn about the development of a scientific concept, they often search digital databases, like Scopus or Web of Science, for publications that used that concept in the past. Say you wanted to learn about the history of the concept of “microbial biofilms.” If you go to one of those digital publication databases, you might type in “biofilm” and sort the results by date. You might find that the earliest article about biofilms was published in 1968, and that the publication didn’t even use the word “biofilm,” but rather, “microbial film.”
However, as a historian, I know that researchers were studying “biofilms” long before using that term, or even the term “microbial film.” Database results will likely give you an incomplete picture of the development of a concept.
Keywords are not synonymous to concepts. We tend to rely on language to learn about concepts, and for good reasons, but words are not the only things that make up concepts. When using computers to aid research about the development of scientific knowledge, we should always keep in mind the limits of language in representing that knowledge.
Co-citation (author) network for the “Biofilm” concept till 1974 (Made with VoS Viewer)
The network would be MUCH smaller if I had used database search results!