The Genealogy Indexer website enables users to make full-text searches of over one million pages of historical records. But their data must first be converted into searchable digital files originating from paper documents that are often of poor quality, hundreds to thousands of pages long and in hard-to-recognize typefaces. A task made possible with sophisticated, accurate and automatic Optical Character Recognition (OCR) from ABBYY Recognition Server.
“Without Recognition Server, I would simply not be able to do any of this. No other solution I have tested comes close to delivering acceptable accuracy,” said Logan Kleinwaks, Founder, Genealogy Indexer.
of scanned documents
and old fonts
search queries per day
For those who seek insights into the history of Jewish communities, as well as individuals researching their own ancestry, Genealogy Indexer provides an invaluable resource. A unique innovation in the field of Jewish Genealogy, the free website makes it possible to search original documents that have not been previously indexed. Created and maintained by Logan Kleinwaks as a service to historians and genealogists, Genealogy Indexer utilizes source materials from around the world — but primarily from Central and Eastern Europe, as Kleinwaks describes:
“Genealogy Indexer makes searchable more than a million pages of historical European directories, books commemorating Jewish communities destroyed in the Holocaust, military lists, school records and other documents of interest to genealogists and historians. Most of the material is not searchable elsewhere. The website is also free to use and completely non-commercial.”
In 2008, Kleinwaks began the process of converting documents into fully searchable files and integrating them into Genealogy Indexer. “Even with many volunteers,” says Kleinwaks, “manually transcribing documents took a very long time. So OCR was key.” Initially, Kleinwaks tried a mix of OCR solutions. But the accuracy and versatility of ABBYY FineReader led him to standardize on it.
“Many of our documents,” explains Kleinwaks, “are from business directories, address directories, or telephone directories. They may arrive as paper — or as DjVu or PDF files of between 200 to 3,000 pages each, or multiple JPG or TIFF files. Often these are challenging for OCR because of poor print and paper quality, small dense text, complex layouts, and the high percentage of non-dictionary words such as surnames. ABBYY’s software,” Kleinwaks states, “was very good at meeting those challenges.
However, Kleinwaks’ vision for Genealogy Indexer also extended to adding thousands of historical directories from Germany and German-speaking areas printed in Fraktur Gothic fonts during the 18th to early 20th centuries. He especially wanted to make directories from the 1930s searchable, to assist researchers of families separated during World War II. “Because of the large numbers involved,” says Kleinwaks, “finding a highly-automated OCR solution was essential - there are millions of pages that need to be converted.”
So, after discussing options with ABBYY, Kleinwaks decided to adopt ABBYY Recognition Server. “I discovered it is capable of handling high-volume Fraktur recognition,” states Kleinwaks, “thanks to its inclusion of the FineReader XIX module.”
As a server-based document conversion solution, Recognition Server automatically converts high volumes of paper, image-only digital files and electronic documents into searchable records. Moreover, the software is capable of recognizing over 190 languages in a wide variety of fonts — including Fraktur.
Using Recognition Server, Kleinwaks performs OCR tasks on a single PC that hosts the server manager, processing station and verification station. Software developed by Kleinwaks then automates the post-OCR workflow — integrating the output files and document metadata with the site’s search engine.
“After OCR,” explains Kleinwaks,” I upload the output and a spreadsheet featuring metadata about the documents to my website and search engine server. From there, software I created integrates the OCR output and metadata into my search engine automatically — making the information available to users of Genealogy Indexer.”
According to Kleinwaks, users are performing between 4,000 to 5,000 searches every day. The searchable content at their disposal now includes: 900,000 pages of 1,800 historical directories; 114,000 pages from 256 yizkor books; 32,000 pages of military lists; 43,000 pages of community and personal histories; and 24,000 pages of Polish secondary school reports and other school sources.
“Generally,” says Kleinwaks, “it is fair to say that OCR has greatly increased the use of Central and Eastern European directories as a genealogical source. And without Recognition Server I would simply not be able to do any of this.
“Being able,” he concludes, “to OCR Fraktur documents using Recognition Server has brought new users to my site and allowed existing users to search documents they never could before. No other solution I have tested comes close to delivering acceptable accuracy. And since I’ve been using Recognition Server its automation features have proven incredibly valuable. It saves a lot of time.”