CINF 27
Mining a large reaction database with name reaction patterns
Matthew A. Kayala1, mkayala@ics.uci.edu, Qian-Nan Hu1, qhu@uci.edu, Jonathan H. Chen1, chenjh@uci.edu, James S. Nowick2, jsnowick@uci.edu, and Pierre Baldi1, pfbaldi@uci.edu. (1) Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, Irvine, CA 92697, (2) Department of Chemistry, University of California, Irvine, 4126 Natural Sciences 1, Irvine, CA 92697-2025
Over the past several years, comprehensive data sets of small chemical compounds, such as our own ChemDB (http://cdb.ics.uci.edu), have been made publicly available for statistical analysis and data mining purposes. However, access to reaction data resources is comparatively restricted. With data largely unavailable, how to approach knowledge discovery in reaction databases is an open question. One potential method for data mining is to classify reactions using pattern matching rules. We present initial results on mining 2,000,000+ well-annotated reactions from a database of published reactions (SPRESI). Here, we have hand-composed 500+ SMIRKS language patterns to cover 306 common `Name Reactions'. The rules provide a broad classification of the database into a small number of classes based on net structural changes. To facilitate future research, a tool to classify reactions using the patterns has been made available as part of ChemDB.