Exploiting a hidden treasure: Automated chemical entity recognition in Chemisches Zentralblatt

Valentina Eigner-Pitto,, Heinz Saller, and Peter Loew, InfoChem GmbH, Landsberger Strasse 408, Munich 81241, Germany

The German publication Chemisches Zentralblatt was the first chemistry abstract collection in history starting in 1830 and contains 140 years of research progress in chemistry and chemical knowledge. Modern scan- and OCR-software technology was utilized to make the entire content of this unique reference work available for full-text retrieval, but a solution offering chemical structure search seemed to be unfeasible as this work is written in German, the original document quality is not consistent, and numerous obsolete compound names occur. This talk describes our approach to identify and extract chemical compounds automatically from the text and convert them into a structure database. The process is based on the systematic training and enhancing of the OCR, the Annotation and the Name-to-Structure process using specifically developed German dictionaries. A web-based prototype application is implemented providing structure, substructure and similarity search with the hits linked back directly to the original pages of Chemisches Zentralblatt