A University of Washington research team has built a search system that can scan 10 million government PDFs for less than $1,500 in processing costs, a fraction of what commercial AI tools would charge for the same task.
The tool is called GovScape, and it was built to work with the End of Term Web Archive, a project that has preserved federal government websites at the close of every presidential administration since 2008. The archive covers the George W. Bush, Obama, Trump, and Biden administrations, storing images, text, graphs, redacted pages, and other media. The current version of GovScape can search PDFs from Donald Trump's first term. The team plans to expand it to cover the full archive.
According to Phys.org, the project was led by Benjamin Charles Germain Lee, a UW assistant professor in the Information School. "The End of Term Web Archive is immensely important to historians, journalists and the American public," Lee said. "But many of these digital archives are getting so big — the Internet Archive just announced its trillionth page archived — that finding information is the real challenge."
GovScape offers three types of search. The first is a standard keyword search, similar to a book index, that locates every page where a specific word or phrase appears. The second is a semantic search, which finds documents related to a topic even when the exact search terms are not present in the text. The third is a visual search, allowing users to search for document qualities such as redacted pages, aerial photographs, or pie charts.
To build the system, the researchers created a processing pipeline that splits each PDF into individual pages, saves those pages as images, and then extracts the text. They then used AI models to generate what are called embeddings, which are strings of numbers that capture the content of both the text and the images on each page. "Just as library classification systems group books on similar topics on the same shelf, these embeddings group similar pages with one another based on their visual and textual content," Lee said.
The cost efficiency of the system stands out. Processing all 10 million PDFs cost less than $1,500, which works out to roughly $1 per 47,000 pages. By comparison, Google might charge consumers $1 to parse around 100 pages using AI tools.
The research team will present GovScape on July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego. The work is currently published on the arXiv preprint server, and the team plans to expand the tool to cover the full End of Term Web Archive beyond Trump's first term.
