#236 - Abstracts

ACS National Meeting
August 17-21, 2008
Philadelphia, PA

 1 Chemical encyclopedists: The prehistory of databases
Barbara Charton, Department of Mathematics and Science, Pratt Institute, 200 Willoughby Ave, ARC LL G-35, Brooklyn, NY 12205

The need for the collection of accurate chemical information was engendered as chemistry emerged from two sources; collections of recipes and alchemy. The Age of Enlightenment and the Industrial Revolution meant that manufacturing increased as did a need for explanations and a need for precision. Early manufacturers were using chemical substances and they wanted to know what they were using and how much did they need. Early writers of chemistry attempted to codify and explain what was happening to substances as those substances were heated, reshaped, glazed and fired. The scientists who examined the materials used in commerce produced a body of literature for others pursuing the same quest for information. Gmelin is certainly a well-known name in the history of chemistry: he is not the first to have attempted to codify this “new science”. Thomson, Macquer, Nicholson, have all contributed to the classification of chemical information. These and others and their effect on the growth of chemistry shall be discussed.

 2 M. G. Mellon's Chemical Publications: Their Nature and Use : Five editions, six decades
Adrienne W. Kozlowski, Department of Chemistry, Central Connecticut State University, 194 Blake Road, New Britain, CT 06050, Fax: 860-832-2704

Generations of chemistry students and instructors have been influenced by M. G. Mellon¹s work. Professor Mellon was a pioneering author in establishing the field of chemical literature and in the instruction of students in its use. In 1925, just six years after joining the Purdue chemistry department as an analytical chemist, he published his first book on chemical publications. He had a remarkable record of keeping the book up to date, publishing the final edition just eleven years before his death in 1993 at the age of 100. The presentation will include some constants and some evolutions through the successive editions as well as some examples of how his expertise in chemical information was integrated with his interests in chemical education and analytical chemistry.

 3 Peter Norton: A life in chemical indexing
Bob Stembridge, bob.stembridge@thomson.com, Customer Relations, Thomson Scientific, 77 Hatton Garden, EC1N 8JS London, United Kingdom, and Gez Cross, gez.cross@thomson.com, Editorial Operations Group, Thomson Scientific, EC1N 8JS London, United Kingdom

Peter Norton is recognized for a career devoted to enhancing access to chemical information via systematic indexing. Starting as a synthetic organic chemist at the Aspro-Nicholas pharmaceutical company, Peter soon became involved in information research and found himself searching a new punchcard system for patent information. In 1963 Peter joined Derwent where he was responsible for the creation of the Derwent Manual code, Fragmentation Code and Plasdoc Code systems. Later in his career he was a member of the team that created the Markush DARC chemical structure-based indexing and retrieval system used as the basis for the Merged Markush Service. This paper will review the work of this “renaissance information professional” and his contributions to the field of chemical information search and retrieval.

 4 Philip Sadtler: Founder of analytical informatics--in the era of punch cards
Gregory M. Banik and Marie Scandone, Informatics Division, Bio-Rad Laboratories, Inc, Two Penn Center Plaza, Suite 800, 1500 John F. Kennedy Boulevard, Philadelphia, PA 19102

The Sadtler family has had a long and illustrious history in chemistry, starting with Samuel P. Sadtler, who founded an eponymous company in Philadelphia in 1874 and continuing with his son, Samuel S. Sadtler, who joined in the family business. The scion of this family of great chemists, S. Philip Sadtler (son of Samuel S.) left a permanent mark in spectroscopy with the collections of reference spectral databases that still bear his name. This paper will trace the history of this renowned family, focusing on the intellectual contributions made by Philip Sadtler in founding the field of Analytical Informatics in the era of punch cards and following the legacy of his contributions through the present day.

 5 Rise and fall of British Chemical Abstracts
Helen Cooke, GlaxoSmithKline, 709 Swedeland Road, King of Prussia, PA 19406

Starting in the late 19th century, J. Chem. Soc. and J. Soc. Chem. Ind. pubished abstracts of patents and articles from other journals in addition to original work. These abstracts were teh predecessors of British Chemical Abstracts (BCA) which was launched in 1926 as a separate publication under the direction of the Bureau of Abstracts, a joint committtee of the Society of Chemical Industry (SCI) and the Chemical Society (CS). Prior to the start of Chemical Abstracts (CA) in 1907, the ACS unsuccessfully attempted to collaborate with the sCI and cS to produce a single abstracts journal. The increasing inportance of CA and the impact of outside influences, such as the war, led to the demise of BCA in 1953. The successes, struggles and dilemmas experienced by the publishers of BCA will be explored, as well as its organisation, indexing and coverage. BCA users' experiences will also be presented. *This work represents the author's personal thoughts and does not represent the views of GlaxoSmithKline.

 6 Science of structural revolutions: August Kekulé and chemical representation - WITHDRAWN
Robert Schombs Jr., Department of Science & Technology Studies, Cornell University, 306 Rockefeller Hall, Ithaca, NY 14853

In 1857-58, August Kekulé proposed the theory of atomicity, or atomic valence, and leveraged this theoretical resource into a theory of chemical structure. The articulation of chemical structure opened up new vistas for experimental investigation and instigated the explosion in organic chemical research in the second half of the 19th century. Linked with the theoretical development of structure was the codification of a system by which the practicing chemist could draw the chemical structure of compounds on paper, both for heuristic and explanatory use. A number of fundamental questions arise from the development of the conventions of structure drawing (1850-1870): In what way was the representation of compounds on paper a ‘revolution'? In what ways was it continuous with other representational practices? How did the deployment of this new “paper tool” affect experimental practice and theory construction? This talk will be an introduction to some of these issues.

 7 Emergent Knowledge from Chemical and Biological Data Integration
Anthony J. Trippe, atrippe@cas.org, Chemical Abstracts Service, 2540 Olentangy River Road, Columbus, OH 43202-1505, Fax: 614-447-5443, and Fred Winer, fwiner@cas.org, New Product Development, Chemical Abstracts Service, Columbus, OH 43224

With over 28 million substances and more than 26 million references SciFinder from CAS is the world's largest repository of chemical and life-sciences related information. CAS editorial staff annotate the references in this database identifying key concepts and substances found within the associated documents. Previously it has been reported that bioactivity related concepts have been associated with specific substances based on a shared presence in a document. Taking the next step forward CAS has now associated specific substances with a hierarchy of more than 30,000 specific biomolecular targets. Using these two features in combination it is possible to associate biomolecular targets with bioactivity based on shared small molecules. Specific examples of this functionality will be provided.

8 Progress toward the bioeconomy: An overview
Jeremy L. Jenkins, Lead Finding Platform, Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, MA 02139

Cheminformatics and bioinformatics data sources are often disconnected with respect to their semantic underpinnings. By simply annotating chemical SAR databases with standard gene or protein IDs (e.g. Entrez Gene Symbol), we have found it possible to meaningfully federate compound assay results to extensive information surrounding targets, such as pathways, tissue expression, drug side effects, and crystal structures. The result is a chemogenomics knowledgebase that supports the complexity of real-life Drug Discovery questions. We show examples of applied chemogenomics, where exploiting drug target phylogenetic trees as well as “phylochemical” trees can drive lead finding even for new targets.

9 Systems Pharmacology Pathway Analysis of drugs and endogenous compounds targeting opioid receptors
Ally Perlina, Yuri Nikolsky, and Tatiana Nikolskaya, GeneGo, Inc, 169 saxony rd, # 104, Encinitas, CA 92024

Opioids are commonly used for treatment of neuropathic pain. However, morphine, for instance, as well as other opioids demonstrate inconsistent efficacy upon administration. Tramadol hydrochloride, a semisynthetic opioid analgesic, may also affect neuropathic pain by low-affinity binding to mu receptors, as well as having weak inhibition of norepinephrine and serotonin reuptake, mirroring the mechanism of action of both opioids and TCAs. Both tramadol and morphine are metabolized by CYP2D6 and activate Mu-type opioid receptors. However, they have different systems effects and pharmacological profiles.

In order to analyze pharmacogenomic data with account for genetic differences and the chemical properties of any compound and/or its metabolites, integrative meta-analysis tools are needed. With MetaDrug we can analyze gene expression profiling data. Predicted targets for morphine and tramadol, were used to construct networks and elucidate common and unique signaling and metabolic events attributable to each drug. Once the published data was overlaid, the expression data allowed identification of those genes in a set of potential common targets, which were more likely to be associated with inter-strain variability due to different opioid-related phenotypes. Drugs can also be compared to endogenous compounds and analyzed with GeneGo systems pharmacology tools, as this type of comprehensive analysis may help pin-point those genes that are more responsible for phenotypes of therapeutic compounds, as well as association of genetic predisposition in key identified genes to susceptibility or resistance of various modes of treatment

10 Systems chemical biology modeling of virulence-related pathways of Mycobacterium tuberculosis
Elebeoba May, eemay@sandia.gov, Sandia National Laboratories, P. O. Box 5800, Albuquerque, NM 87185, Andrei Leitao, aleitao@salud.umn.edu, Division of Biocomputing, University of New Mexico School of Medicine, Albuquerque, NM 87131-0001, Jean-Loup Faulon, jfaulon@sandia.gov, Computational Systems Biology Department, Sandia National Laboratories and Joint BioEnergy Institute, Albuquerque, NM 87185, Jaewook Joo, jjoo@sandia.gov, Computational Bioscience Dept, Sandia National Laboratories, Albuquerque, NM 87185, Milind Misra, mmisra@sandia.gov, Computational Systems Biology Department, Sandia National Laboratories, Albuquerque, NM 87185, and Tudor I. Oprea, toprea@salud.unm.edu, Health Sciences Center, University of New Mexico, Albuquerque, NM 87131-0001

Mycobacterium tuberculosis (Mtb) is able to persist in host tissues in a non-replicating persistence (NRP) or latent state. This presents a challenge to the treatment of tuberculosis (TB). To develop an effective treatment against latent TB, we need to understand how potential anti-microbial agents affect NRP Mtb. We investigate two virulence associated pathways: 1) The glyoxylate-to-glycine (GtG) shunt, which may be part of an alternative energy generation path by NRP TB; and 2) Mycolic acid biosynthesis, a key compound in the Mtb cell wall. We developed a systems-chemical biology (SCB) platform to perform a chemistry-centric systemic analysis of these virulence related metabolic pathways in Mtb. Simulations of the glyoxylate pathway indicate that inhibition of malate synthase leads to an effective depletion of glyoxylate, the key metabolite in the GtG shunt. Cheminformatics studies have been used to identify potential inhibitors of this pathway, based on available chemical, biological and structural information.

 11 A chemical systems biology approach to metabolic network inference
Jean-Loup Faulon, jfaulon@sandia.gov, Computational Systems Biology Department, Sandia National Laboratories and Joint BioEnergy Institute, Albuquerque, NM 87185, Fax: 505-284-1323, and Milind Misra, mmisra@sandia.gov, Computational Systems Biology Department, Sandia National Laboratories, Albuquerque, NM 87185

The traditional method for constructing a metabolic map of a newly sequenced organism is to assign enzymatic activities (EC numbers) to its proteins. Many proteins remain unannotated not only because their sequences cannot be mapped to an already classified enzyme, but also because the reactions catalyzed by the proteins have not been characterized in the EC nomenclature. Thus, many enzymes and reactions, although occurring in various pathways, remain unannotated. In this paper, we will present unsupervised and supervised similarity and machine learning kernel methods for predicting which metabolic reactions enzymes can catalyze using heterogeneous input consisting of both sequences and chemical structures. The methods rely on fusing protein sequence data with chemical structure data by representing each with a common cheminformatics description. We will demonstrated that our methods can perform accurate predictions (>80% accuracy) in situations where neither the sequence nor the reaction have been previously classified.

12 Pathway Analysis based on Gene Expression Profiles from Huntington's Disease Brain
Jung-Hsin Lin and Tien-Lu Huang, School of Pharmacy, National Taiwan University, 12F No.1 Ren-Ai Road Sec. 1, Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, Fax: +886-2-23919098

Huntington's disease is a neurological disorder associated with dysfunction and degeneration of the basal ganglia. The abnormal CAG repeat in the HD gene leads to the characteristic motor and cognitive symptoms with a mid-life onset. Striatum was found to be the tissue with earliest noted changes and most severe damages. We used the GenePathway Viewer (http://bioinfo.mc.ntu.edu.tw:8080/GenePathway) to analyze the microarray data on the pubic database to detect the important genes that are involved in the Huntington's disease and other neurodegenerative diseases. Our GenePathway Viewer is a web server that can be used to visualize gene expression levels and correlated genes on the maps of the biological pathways. GenePathway Viewer, different from other existent similar pathway viewers, is facilitated directly by the web service provided by the KEGG API and will acquire the most up-to-date pathway information in the KEGG database. On the other hand, GenePathway Viewer is also a meta-server that can combine various resources on gene identification and gene annotation information. The aim of this analysis was to provide Huntington's disease pathology on the molecular level. Four brain regions were investigated, which are caudate nucleus, motor cortex, cerebellum, and prefrontal association cortex. In the current pathway analysis, only the microarray data for the caudate nucleus was used because greatest number and magnitude of differentially expressed genes were found in this tissue. mRNA levels in laser capture micro-dissected neurons was measured to confirm that the mRNA changes are not due to cell loss alone. The GenePathway Viewer can successfully identify relevant pathways for Huntington's disease using the microarray data. These pathways provide a direct visualization for the roles of various differently expressed genes in the proteomic maps. The pathways obtained in this analysis may be further integrated for providing a systems biological view of Huntington's disease.

13 Promiscuousness of cancer target binding sites: Integrating molecular structure and systems biology
Raphael A. Bauer, Jessica Ahmed, Stefan Günther, Dominic Jansen, and Robert Preissner, Institute of Molecular Biology and Bioinformatics, Structural Bioinformatics Group, Charité Universitätsmedizin Berlin (Medical University), Arnimallee 22, Berlin 14195, Germany

The analysis of microarrays from drug-treated cells revealed, that drugs generally affect the expression levels of hundreds of proteins, which has to be understood from a systems biology view point, considering two aspects: signaling cascades and many-to-many relations between drugs and their targets. In this survey, we analyze in atomic detail the basis of fuzzy, multiple molecular recognition, leading to redundancy, which secures the desired effects under diverse conditions. A key principle for the understanding the balance between drug action and adverse effects, is the promiscuousness of binding pockets in target proteins. This means, that drugs generally address a number of targets with a profile of affinities, and vice-versa, a distinct target will be attacked by a diverse set of compounds. The increased knowledge of drug-target structures will help to elucidate the relation of similarity between the complementary parts – binding site and ligand, respectively.

Here, we present a study that examines the relation between some 4,000 known target structures and more than a hundred anti-cancer drugs, bound to them. The ligands and targets are analyzed regarding their similarity, binding specificity and cross-reactivity. The activity profiles of the drugs were calculated as cellular fingerprints, using information from the National Cancer Institute and integrated into the study. For the estimation of similarity, general approaches were applied, which allow consistent consideration of similarity on different levels like cellular effect, 2D- or 3D-similarity. A detailed understanding of the complex drug-target networks will enable rational drug-combinations which are tailored against distinct types of cancer.

14 PubChem bioassays as a source of polypharmacology
B Chen, David J Wild, and Rajarshi Guha, School of Informatics, Indiana University, 1130 Eigenmann Hall, 1900 E 10th Street, Bloomington, IN 47406

There are currently upwards of 700 bioassays in PubChem ranging in size from 80 to 80,000 molecules. Though the primary focus of the bioassays have been identifying molecular probes we consider the possibilty of analyzing the bioassay data to identify ligands with polypharmcological effects. We define a network model of enzymatic assays for which the protein target is known. Assays are represented as nodes if their associated targets have a similarity greater than a cutoff value. We also consider an alternative assay network, defined in terms of active compounds common to a pair of assays. With these networks we investigate mapping functions that map the assay targets to other protein target networks. We initially focus on a drug target network and use the assay network as a method of exploring the drug target network in terms of compounds that may be active against a given drug target. We use a combination of Tanimoto similarity and binding site similarity to suggest whether a ligand tested in a bioassay may also be active against a drug target. Coupled with pathway data, we then identify compounds that may be active against multiple targets located in different pathways. We discuss some issues that affect the robustness of the approach such as promiscuity.

15 Chemical Abstracts Service 1907-2008: key people, services and enhancements
Evelyn C. Powell, Physical and Chemical Sciences Librarian, Rensselaer Polytechnic Institute, Folsom Library, Rensslaer Polytechnic Institute, 110 8th Street, Troy, NY 12180-3590, Fax: (518) 276-2044

This paper highlights the key innovators and innovations leading to the development of the Chemical Abstracts Service we know to day. We begin in 1907 when abstracts were written by hand and follow through to the early computerization in the 1960s. We look at later computerization and enhancements. We discuss diversified products such as CA on CD-ROM, STN Express, STN Easy, SciFinder and its companion SciFinder scholar. We conclude by discussing the news product, SciFinder on the Web.

16 Eugene Garfield and his ideas, writings, and accomplishments: Impact on science, scientists, and politics
Svetla Baykoucheva, White Memorial Chemistry Library, University of Maryland, College Park, MD 20742, Fax: 301-314-5910

When Eugene Garfield created the Science Citation Index, he could not have foreseen the dramatic impact his brilliant ideas would have on science and scientists in decades to come. As founder of the Institute for Scientific Information in Philadelphia, he laid the foundations for creating valuable information products such as Index Chemicus, Current Chemical Reactions, Current Contents, Web of Science, Essential Science Indicators, and Journal Citation Reports. His weekly essays, published for many years in the Current Contents, touched on themes of enormous interest to a broad audience of scientists, academic administrators, and even politicians. In his 80s now, Dr. Garfield continues to create new information tools, his latest accomplishment being HistCite—a program that allows users to perform sophisticated analysis of the scientific literature and the publication behavior of scientists. This paper discusses how Dr. Garfieled's ideas have influenced the development of science, the life of scientists, and even...the (international) political discourse.

17 Josef Houben and Theodor Weyl
M. Fiona Shortt de Hernandez, Science of Synthesis, Georg Thieme Verlag, Ruedigerstrasse 14, D-70469 Stuttgart, Germany, Fax: 0049-711-8931777

Josef Houben and Theodor Weyl were two German chemists who made a significant contribution to the field of chemical information in the early 20th century. They structured and assessed organic synthetic chemical information in an exhaustive and comprehensive manner. Their reference work series, Methoden der Organischen Chemie (Houben–Weyl), provided extensive experimental detail and literature references as well as a critical assessment of synthetic methodology by experts. This meant that the reader did not have to refer to other chemistry handbooks or even the original journal articles when searching for an organic synthesis strategy thus saving valuable time. The Houben–Weyl concept involved not only processing and evaluating synthetic chemistry information, it also provided the context associated with each organic synthetic method, thus helping to inspire the chemist as well as encourage creativity. Josef Houben and Theodor Weyl were pioneers in developing a new methodology for the processing and presentation of organic synthetic information. Houben–Weyl, which is available today electronically as part of Science of Synthesis, is still cherished and valued by the organic chemist 100 years later for its structured organization and comprehensiveness.

18 Treasures from the vault: Unique items from the collections of the Chemical Heritage Foundation
Rosanne Divernieri, rosanned@chemheritage.org, Collections Coordinator, Chemical Heritage Foundation, 315 Chestnut Street, Philadelphia, PA 19106, and Jennifer Landry, jenniferl@chemheritage.org, Senior Archivist and Acting Head of Collections, The Chemical Heritage Foundation, Philadelphia, PA 19106

“It's like a candy store” is a phrase often heard by the curatorial staff at the Chemical Heritage Foundation when they bring visitors into the collection storage rooms. Shelves and aisles lined with laboratory notebooks, archival boxes, chemical apparatus, spectrometers, balances, paintings, and chemistry sets are just a few of the items that are housed at CHF. With a collecting mandate of “collecting the history of the chemical and molecular sciences,” the collection is quite far-reaching and contains many fascinating pieces of our chemical heritage. Of particular focus for this presentation will be the chemistry set collection, which is arguably one of the largest in the world and CHF's unique collection of manuals and advertisements that document 20th century analytical instrumentation.

19 What are we going to do with these old tapes?: Creating the Woodward Chemistry Media Archive
Marcia L. Chapin, Chemistry & Chemical Biology, Harvard University, 12 Oxford St., Cambridge, MA 02138, Fax: 617-495-0788

Noted chemists have been presenting seminars at the Harvard Chemistry & Chemical Biology Department for many years. Beginning in the 1970's, these lectures were captured on various magnetic videotape formats and distributed throughout the Department. As the Department evolved, the natural repository for this material became the Chemistry & Chemical Biology Library. It was discovered that these magnetic formats were deteriorating substantially and a concerted effort was needed to preserve this material which was of current and historical interest. Digitization using DVD+R format provided the solution. This is the story of creating the Woodward Chemistry Media Archive, including; permissions, cataloging, technical aspects, and availability (all at a reasonable cost). May I view a lecture by R.B. Woodward? Sure!

20 Real enthusiast: Edgar Fahs Smith and the history of chemistry
Lynne Farrington, Rare Book & Manuscript Library, University of Pennsylvania Libraries, Van Pelt-Dietrich Library Center, 3420 Walnut Street, Philadelphia, PA 19104

This presentation highlights both the collection and the contributions of Edgar Fahs Smith (1854-1928), professor of chemistry and later provost of the University of Pennsylvania, to the history of chemistry in the United States. Smith, who originally collected historical works on chemistry both to inspire his students and to foster his own research in the field, was instrumental in founding the Division of the History of Chemistry while serving as president of the ACS in the early 1920s.

21 Cheminformatics analysis of HIV-1 protease mutations
Gene M. Ko1, gko@rohan.sdsu.edu, A. Srinivas Reddy2, asvreddy@gmail.com, Sunil Kumar2, skumar@mail.sdsu.edu, and Rajni Garg1, rgarg@csusm.edu. (1) Computational Science Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182-1245, (2) Electrical and Computer Engineering Department, San Diego State University, San Diego, CA 92182-1309

Mutations that arise in HIV-1 protease after exposure to various HIV-I protease inhibitors have proved to be a difficult aspect in the treatment of HIV. The crystal structures of 52 HIV-1 proteases complexed with FDA approved protease inhibitors from the Protein Data Bank (PDB) were studied. The information reported by the PDB for each crystal structure has been found to be error prone due to the nature of the PDB verification process. Incorrect structural classifications reported by the database may lead to potential structures being overlooked during the dataset collection process. The inconsistent mutation information also leads to incorrect data parameters in one’s own research. Each of the 52 structures was aligned against the wild-type HXB2 HIV-1 protease strain to create a baseline sequence from which mutations can be identified. The mutations were mapped according to their bound ligand in an attempt to analyze the mutations for each protease inhibitor.

22 Extracting chemical protein interactions from literature using natural language processing methods
Dazhi Jiao, djiao@indiana.edu, School of Informatics, Indiana University at Bloomington, Wells Library 043, Bloomington, IN 47405, and David J Wild, djwild@indiana.edu, School of Informatics, Indiana University, Bloomington, IN 47408

This poster describes the development of a system to automatically build database and entity representation of chemical protein interactions based on information extracted from abstracts of journal articles, using machine learning and natural language processing methods. In this system, abstracts related to proteins and chemical interactions are preprocessed using named entity recognition methods to identify chemical names and protein names. Chemical structures are also attached to chemical names for future processing. The texts are then syntactically analyzed, and grammatical relationships between constituents of the sentences are generated. Then interactions between proteins and chemicals are extracted by identifying certain keywords, together with the protein and chemical names based on the dependency graph. The extracted information, including the chemical compounds, their structures, the proteins, and the interactions between chemicals and proteins are stored in a database for retrieval and further analysis. The information are also represented based on biological ontologies for molecular interaction networks. In this poster, the training process to build certain components of the system, problems encountered during the system creation, and the creation of the database and ontology based representation will be discussed in detail.

23 Hierarchical screening with multiple receptor structures to target the non-nucleoside binding site of HIV-1 reverse transcriptase
Sara E. Nichols1, sara.nichols@yale.edu, Christopher Bailey2, Robert Domaoal2, Ligong Wang2, Karen S. Anderson2, and William L. Jorgensen3, william.jorgensen@yale.edu. (1) Interdepartmental program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, (2) Department of Pharmacology, Yale Medical School, (3) Interdepartmental program in Computational Biology and Bioinformatics, Department of Chemistry, Yale University

At present, multiple protein targets are being investigated in order to suppress the retrovirus HIV-1. The target of our study, reverse transcriptase (RT), translates the single stranded RNA of HIV into DNA. There are notable entries in the Protein Data Bank of an alternative conformation of residue 181, located at the non-nucleoside binding site of RT, which is different from the most common bound conformations. The significance of Y181 interactions with the ligand are confirmed by the resistance conferred upon Y181C mutation. These interactions are integral to inhibitor activity and the alternative conformation provides new information about the dynamic nature of the binding site. Our study uses this knowledge to screen a large database, specifically targeting inhibitors which can accommodate the conformational variations of residue 181. Since there is a lack of standard protocol for flexible receptor docking in the literature, we present a case study of RT which compromises between speed and accuracy, while still focusing on the inherent flexibility of the receptor residues. By using multiple structures, we also aimed to tackle the problem of false negatives using a rapid scoring function. After screening a publicly available database of over two million compounds, one of the four top compounds assayed showed antiviral activity in HIV-infected cell culture.

24 Reaction mechanism prediction by transformation rules and general principles
Jonathan H. Chen and Pierre Baldi, Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, Irvine, CA 92697

A core skill chemists need is to predict the course and major products of arbitrary reactions. This is crucial for pre-validating synthesis plans and can provide insight into new types of reactivity. A rule-based reaction expert system has been developed to support this process with applications in computer-based learning in organic chemistry, retro-synthetic analysis, combinatorial library design and automated classification of large reaction databases. The current system comprises over 1,500 manually-curated reaction patterns written using the SMIRKS language with transformation rule extensions to enable robust predictions. Application of this expert system in reaction prediction will be illustrated, including production of complete curved-arrow mechanism diagrams. Initial results on another approach based on general principles of reactivity such as frontier molecular orbital theory, resonance, and thermodynamic and kinetic analysis, will be discussed as a means to develop even more general and robust predictions. Select applications available via http://cdb.ics.uci.edu.

25 Game theory and biochemical networks
John H Van Drie, John H Van Drie Research LLC, 34 Stinson Rd, Andover, MA 01810

A traditional way of modeling biochemical networks (examples include work of A Perelson and that of E Ross) is to treat the system as a set of coupled ordinary differential equations, using experimentally-determined rate constants and initial concentrations. While this approach has been successful when sufficient experimental data is available, it tends to be extremely difficult to extract general, global behaviors of the system.

Game theory may provide an alternative approach to modeling biochemical networks. It has been applied successfully to modeling ecological networks (eg work of J Maynard Smith), and there yields insights into phenomena like predator/prey oscillations. The key game theory concepts whose counterparts may be sought in biochemical networks are Nash equilibrium and Braess' paradox, which highlights how a inhibition of one step in a system producing quantity A may paradoxically increase the overall quantity of A produced by the system.

26 Interplay of protein sequence and structure: A concept with broad implications in biology and molecular design
Nathalie Meurice1, NMeurice@tgen.org, Joseph C. Loftus2, loftus.joseph@mayo.edu, Christopher A. Lipinski2, Lipinski.Christopher@mayo.edu, Daniel P. Vercauteren3, daniel.vercauteren@fundp.ac.be, Spyro Mousses1, smousses@tgen.org, and Gerald M. Maggiora4, maggiora@pharmacy.arizona.edu. (1) Pharmaceutical Genomics Division, Translational Genomics Research Institute (TGen), 13208 E. Shea Blvd., Suite 110, Scottsdale, AZ 85259, Fax: 602-358-8360, (2) Mayo Clinic Arizona, Scottsdale, AZ 85259, (3) PCI Laboratory, University of Namur, Namur 5000, Belgium, (4) College of Pharmacy & BIO5 Institute, University of Arizona, Tucson, AZ 85721

The efforts in genome sequencing and structural genomics initiatives are generating massive amounts of protein sequence and structure data, but still many proteins lack correct functional annotation. Because structure diverges less than sequence in distant homologs, structure-derived patterns reveal important features that sequence comparison methods cannot capture. In addition to biology, this also has broad implications in molecular design. In homology modeling, assessing local structure conservation across protein families is essential since it indicates the level of confidence for important functional regions of protein models. It subsequently impacts structure-based design applications, which often rely on computational models in early stages of drug discovery. An example is presented, based on the FERM domain of the pro-invasive kinase Pyk2, that shows how an improved homology model constructed using the above approach can positively impact the design of Pyk2 FERM inhibitors, leading to improved clinical outcomes.

27 Modeling chemical reactivity in metabolism and degradation reactions
Johann Gasteiger, Lothar Terfloth, and A Tarkhov, Molecular Networks GmbH, Henkestrasse 91, D-91052, Erlangen, Germany, Fax: 0049-9131-8526566

Hazard and risk assessment of chemical compounds is presently of high interest and should be assisted by the application of computational tools. This paper will present a chemoinformatics approach to the prediction of chemical reactivity in metabolism and degradation reactions. The emphasis is on assessing the reactivity of chemicals in biotic and abiotic processes and in biodegradation.

28 Optimized coverage of ring-system and functional-group chemotypic environments in a screening library
Mark A Johnson, mark@pannanugget.com, Pannanugget Consulting, 2015 Grand Avenue, Kalamazoo, MI 49006, Gordon Bundy, Independent Consultant, Kalamazoo, MI, Darryl Chapman, dlchapman@kvcc.edu, MHTSC, Kalamazoo, MI, and Robert Kilkuskie, rkilkuskie@kvcc.edu, Michigan High Throughput Screening Center, Kalamazoo, MI

After some reminisces regarding Gerry Maggiora's role in promoting methods in molecular similarity and in chemotypic analysis, the notions of ring-system and functional-group chemotypic environments are formally defined. The view of a molecule as a key chain with chemotypic environments for keys is motivated in the context of screening-library design. The overall protocol for selecting the 100K screening library for the Michigan High Throughput Screening Center (MHTSC) is summarized. Three critical chemotypic-coverage runs of the selection protocol are detailed. Every ring-system and functional-group chemotypic environment is represented by at least three structures if also so represented in the reference library of 480K structures. Different ideas related to chemotypic coverage are discussed in the context of the results obtained from each of the three chemotypic-coverage runs.

29 Techniques for effective integrated access to large compound-oriented drug discovery databases
Thomas R Hagadone, Global Scientific Informatics, Eli Lilly and Company, Eli Lilly Corporate Center, Indianapolis, IN 46285

Drug discovery organizations face significant challenges in organizing and analyzing the large and complex collection of data necessary to drive the discovery process. Public and proprietary data concerning disease states, genes, proteins, pathways, targets, assays (in vitro, in vivo and in silico), compounds, patents, projects and resources must be maintained and made available through appropriate interfaces to the many disciplines involved. Although there is a need for specialized systems to meet the particular requirements of each discipline there is also a need for systems that provide an integrated view and set of analysis tools that can be applied over large segments of the total data collection and user population.

In this presentation, we will describe an approach to providing an integrated interface to a broad range of discovery data from the compound perspective to a large group of researchers. This integration method, which has been instantiated in multiple applications at various pharmaceutical companies, has proven to be effective, economical, stable and extensible over a period of three decades. The most-recent embodiment of this approach, Eli Lilly's Mobius system, employs a high-level generic model and implementation for key discovery components including ontologies, molecule and assay databases, an ad hoc query builder/reporting system, an underlying SQL-like query language and a plug-in architecture for extending the system with a variety of visualization and analysis tools. This high-level model provides the necessary hooks for the integration of specific lower-level commercial and proprietary database content, search engines, molecule objects and visualization and analysis tools. The lower-level data sources and software components which come from a variety of providers can be updated and replaced over time as desired without significantly altering the user's interaction with the system, the higher-level software components or the internally developed extensions.

30 Understanding Holistic Approaches in Molecular Similarity Analysis
Jürgen Bajorath, Life Science Informatics, University of Bonn, Dahlmannstr. 2, 53113 Bonn, Germany, Fax: +49-228-2699-341

In drug design, there are many different ways to assess molecular similarity and its relationship to biological activity. Numerous studies have shown that the complexity of computational methods does not necessarily correlate with their success in recognizing or predicting structure-activity relationships of small molecules. Holistic similarity methods and rather simple molecular representations are often surprisingly effective in identifying novel active compounds. Why is this so? Why do similarity methods succeed in some cases and fail in others? Why do different computational approaches often display comparably good or poor performance on a given compound class? If methodological details are not the sole determinants of success or failure in molecular similarity analysis, then attempts to answer such questions must ultimately take principal differences between structure-activity relationships into account. Systematic qualitative and quantitative correlation of compound similarity and potency presents us with some surprises and some good news, at least for virtual screeners and medicinal chemists.

31 2D- vs. 3D-similarity studies in combinatorial and other compound libraries
Jose L. Medina-Franco, jmedina@tpims.org, Torrey Pines Institute for Molecular Studies, 5775 Old Dixie Highway, Fort Pierce, FL 34946, Fax: 772-462-0886, and Karina Martínez-Mayorga, kmartinez@tpims.org, Computer-aided Drug Design, Torrey Pines Institute for Molecular Studies, Fort Pierce, FL 34946

Shape-based and fingerprint-based similarity methods have been used as powerful tools in virtual screening. The widespread use of the latter methods is due, at least in part, to their fast and efficient computational performance. However, one of the major problems with these methods is their inability to discern differences among stereoisomers. This is not a problem with shape-based methods, but this comes at greater computational cost. In this work, comparisons are made of the results obtained using 2D fingerprint-based and 3D shape-based similarity from our in-house combinatorial libraries and other compound collections. Despite the well recognized issue in 3D-similarity analysis regarding the conformation of the query, we successfully applied the Rapid Overlay of Chemical Structures software (ROCS, Open Eye Software, Santa Fe, NM) to the cases under study. Comparisons between the fingerprint- and shape-based similarity methods used in this study provide guidance as to where 3D approaches are particularly useful.

32 An integrated desktop computing environment for medicinal and computational chemists
W. Jeffrey Howe, Computational Sciences, Pfizer Global Research & Development, Eastern Point Rd., Groton, CT 06340

The rapidly increasing application of in silico molecular property predictions, pharmacophore and QSAR models, and structure-based drug design calculations to assist in the drug discovery effort has resulted in heavy demand being placed on computational chemists to deliver modeling results. To address this demand, we have implemented an integrated computational environment that automates the creation of such tools by computational chemists and enables their direct exposure to project team scientists for end-user access. The environment is underpinned by the Computational Chemistry Toolbox (CCT), which now contains over 1000 commonly-used protocols that are accessible to our desktop applications for structure-activity analysis, molecular modeling, library design, and so on. CCT provides a standard infrastructure for publication of models and protocols to desktop applications, eliminating duplication of effort and providing consistent, validated results regardless of which applications invoke the service. The high-level architecture of the system will be described, and examples of its application to compound design will be provided. The placement of such tools directly on the end-user scientists' desktops, for more routine calculations, allows computational chemists to take on more complex analyses and has promoted a ‘design culture' among the population of project team scientists for the discovery of new drugs.

33 Computational model of molecular evolution
ID. Kuntz, Department of Pharmaceutical Chemistry, University of California at San Francisco, Genentech Hall, 600 16th Street, Box 2240, San Francisco, CA 94143-2240

I describe a simple model for molecular evolution based on Monte Carlo simulations of short 2-dimensional "peptide-like" chains. These chains contain sequence information that can be altered through mutation. A cycle of conformation search, survival testing, asexual reproduction, and mutation constitutes a "generation". The simulation lasts for a few hundred generations and involves thousands of individual chains. While this model is extremely simple, it illuminates a number of fundamental aspects of biological evolution such as the definition of "species", "fitness", and adaptation. The model also leads to the characterization of a Darwinian optimizer that differs considerably from the genetic algorithms in current use. Finally, aspects of the flow of information during evolution can be examined and characterized.

34 The Similarity-Property Principle and Beyond
Jordi Mestres, Chemogenomics Laboratory, Research Unit on Biomedical Informatics, Municipal Institute of Medical Research and University Pompeu Fabra, Doctor Aiguader 88, 08003 Barcelona, Spain, Fax: +34 93 316 0550

One of the key concepts introduced in the seminal book entitled “Concepts and Applications of Molecular Similarity” edited by Johnson and Maggiora in 1990 is the similarity-property principle by which similar molecules should exhibit similar properties. Despite its simplicity, this principle has been widely used in molecular design, with applicability in areas such as compound acquisition, library design, virtual screening and, more recently, pharmacological profiling, This talk will offer a modern perspective of the principle and its projection towards a systems approach to multitarget drug discovery.

35 Fuzzy Set Theory - A Tool for Soft Modeling in Chemical and Bioinformatics
Gerald M Maggiora, Department of Pharmacology and Toxicology, University of Arizona, College of Pharmacy, Bio5 Institute, Tucson, AZ 85721

Fuzzy set theory (FST) was designed to handle the types of vague and uncertain information that are an important part of the emerging field of soft and granular computing. FST has seen widespread application in many areas of engineering, but has not as yet had much impact in chemical and bioinformatics. This may be about to change, especially as the amount of information in the biological and pharmaceutical sciences continues to grow at a significant rate. To paraphrase Lofti Zadeh, the father of FST, as the amount of information grows the level of detail at which it can be treated effectively must decrease - a situation that FST is ideally suited to handle. A brief description of the salient features of FST will be presented, along with examples illustrating how FST can be applied to problems of interest in chemical and bioinformatics. Future prospects of FST will also be discussed.

36 Mining the space of known drugs and targets to rationalize pharmacology
Ajay N. Jain, Cancer Research Institute, University of California, San Francisco, Box 0128, San Francisco, CA 94143-0128, Fax: 650-240-1781

There are slightly more than 1000 marketed small molecule therapeutics in North America. Just tens of desired biological targets account for the therapeutic effects of the vast majority of them. However, advances in biological experimental investigation have begun to uncover an increasing number of undesired side-effect targets. These include enzymes, receptors, transporters, ion channels, and transcription factors, and the picture of biological impact within the pharmacopeia looks very complex from this perspective. We advocate systematic 3D computational modeling of many targets, both by docking and by ligand-based methods, in order to rationalize drug activity patterns and also to make predictions that can have a practical impact on the lead optimization cycle.

37 Accurate and fast virtual screening using 3D pharmacophore queries
Gerhard Wolber, wolber@inteligand.com, Inte:Ligand GmbH, Mariahilferstrasse 74B/11, 1070 Vienna, Austria, Johannes Kirchmair, Institute of Pharmacy, University of Innsbruck, Innsbruck 6020, Austria, Fabian Bendix, bendix@inteligand.com, Computer Science Group, Inte:Ligand GmbH, 1070 Vienna, Austria, and Thierry Langer, langer@inteligand.com, Inte:Ligand GmbH, 2344 Maria Enzersdorf, Austria

Virtual screening using 3D pharmacophores has become an important technique for fast searching of large multi-conformational molecule databases and has been successfully used in virtual drug discovery. Commercially available search algorithms are based on cascading n-point fingerprinting and filtering steps and suffer from inaccuracies regarding the search results due to speed optimizations, which are needed for the high throughput these algorithms provide. We propose a new search algorithm that is based on our recently published pattern-matching 3D alignment algorithm that provides high accuracy while still maintaining high performance. Additionally, we compare and discuss 3D pharmacophore search algorithms with respect to their data mining capabilities, their geometric accuracy, coverage and speed.

38 Hierarchical clustering of chemical structures by maximum common substructures
Miklos Vargyas and Ferenc Csizmadia, ChemAxon Ltd, Maramaros koz 3/a, 1037 Budapest, Hungary, Fax: 361-453-2659

Cluster analysis has been shown to be successful in the categorization of physico-chemical and biological properties of compounds. However, conventional approaches to clustering molecular structures, where chemical graphs are transformed into sequences of numbers, seldom meet chemists' expectations. Graph based techniques that cluster compounds with respect to common structural motifs are gaining in popularity as these can better mimic human categorization. One such graph based method, called LibraryMCS, which clusters compounds according to their maximum common substructures (MCS) in a hierarchical manner is presented. Unlike some other graph based clustering methods, LibraryMCS neither involves a similarity based pre-clustering step nor relies on predefined fragments. Recent evaluation by different research groups indicated that LibraryMCS was capable of producing high quality clusters agreeing with human categorization within practicable time (approximately 1000 structures/s). The presentation will recount and demonstrate typical usages of LibraryMCS: virtual HTS hit set profiling, R-group decomposition by learned scaffolds, perception of novel scaffolds, reverse engineering of combinatorial libraries, diversity assessment of large chemical library and compound acquisition.

39 Searching fragment spaces with Feature Trees
Uta Lessel, Uta.Lessel@bc.boehringer-ingelheim.com, Department of Lead Discovery - Computational Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss 88397, Germany, Fax: +49/7351/83-3062, and Bernd Wellenzohn, Bernd.Wellenzohn@bc.boehringer-ingelheim.com, Department of Lead Discovery - Computational Chemistry, Boehringer Ingelheim Pharma GmbH & Co KG, Biberach D-88397, Germany

Virtual combinatorial chemistry easily produces billions of compounds, which can not be screened in a conventional manner even with the fastest methods available. An efficient solution for such a scenario is the generation of fragment spaces which encode huge numbers of virtual compounds by their fragments/reagents and rules of how to combine them. Fragment spaces can be screened with so-called fragment space searches using e.g. the Feature Tree descriptor.

This is frequently used for virtual screening and has a potential for scaffold hopping. The fragment space searches are performed without ever fully enumerating all virtual products.

In this presentation we show the preparation of fragment spaces based on combinatorial chemistry and share our experiences with fragment space searches based on the Feature Tree descriptor in a possible workflow to use this methodology in a pharmaceutical setup.

40 Applications of Rough Set Theory in drug discovery: Analysis of HTS data relative to the inhibition of Aurora A kinase
Joachim Petit1, jpetit@tgen.org, Nathalie Meurice1, NMeurice@tgen.org, Spyro Mousses1, smousses@tgen.org, Daniel Von Hoff2, and Haiyong Han2, hhan@tgen.org. (1) Pharmaceutical Genomics Division, Translational Genomics Research Institute (TGen), 13208 E. Shea Blvd, Suite 110, Scottsdale, AZ 85259, (2) Translational Drug Development Division, Translational Genomics Research Institute (TGen)

The advent of high-throughput experimental methods in pharmaceutical research has led to dramatic increases in the amount and variety of data available. Extracting essential information from these large datasets poses a significant challenge to many current methods of data analysis, which represents a potential impediment to drug discovery. Rough Set Theory (RST) is a data classification methodology originally developed by Pawlak in the 1980's, which has met with success in medical applications such as diagnosis. RST relies on two principles: (1) removing superfluous information from the data and (2) expressing significant information as set of simple association rules.

In this contribution, we show how RST can be used to develop structure-based rules associated with inhibition of Aurora A Kinase from the results of high-throughput screening of a large, diverse compound library.

41 Finding drug information in integrated chemistry and life sciences databases: PubChem and DiscoveryGate
Svetla Baykoucheva, White Memorial Chemistry Library, University of Maryland, College Park, MD 20742, Fax: 301-314-5910

While science is becoming more and more interdisciplinary, valuable resources such as PubChem and DiscoveryGate (DG) are still considered created for and used mainly by chemists. PubChem is a fast-growing database of properties of small organic molecules, including many drugs. DG provides access to chemical, medicinal, toxicity, metabolite, cancer, and drug databases that can be accessed from a single entry point, the MDL Compound Index. Users of DG can expand their searches to PubChem, and users of PubChem can expand their searches to DG. While PubChem is a free resource, DG requires expensive subscriptions. The purpose of this paper is to show how the life sciences community could greatly benefit from using integrated chemistry and life sciences databases such as PubChem and DG to find specific, detailed, accurate, and independent information about the properties, synthesis, metabolism, and effects of drugs.

42 Digital preservation readiness
Thomas F. R. Clareson, PALINET, 3000 Market Street, Suite 200, Philadelphia, PA 19104

This presentation will cover work on the Northeast Document Conservation Center (NEDCC) "Digital Preservation Readiness Survey" project, where consultants visited cultural institutions to review their digital programs, policies, and work in digital preservation. This groundbreaking project adapts some tools and activities from the traditional preservation world to the digital age, and has resulted in the development of new resources which cultural heritage institutions can utilize in building their digital preservation programs.

43 Using CLOCKSS for long-term digital preservation
Grace Baysinger, graceb@stanford.edu, Swain Library of Chemistry and Chemical Engineering, Stanford University Libraries, 364 Lomita Drive, Organic Chemistry Building, Stanford, CA 94305-5081, Fax: 650-725-2274, and Victoria Reich, vreich@stanford.edu, Director, LOCKSS Program, Stanford University Libraries

Unlike paper copies, digital information is fragile. Because the online version of journals is now considered the authoritative version of record, assuring long-term access to online content has become an urgent problem. The CLOCKSS - Controlled LOCKSS (Lots of Copies Keep Stuff Safe) initiative is a partnership of libraries and publishers committed to ensuring long-term access to scholarly work in digital format. To address digital preservation needs, the initiative is creating a secure, multi-sited archive of web-published content that will be stored in a "dark" archive that will become accessible to researchers worldwide for free only after the occurrence of certain defined trigger events. This talk will cover current activities and future plans for CLOCKSS as well as a brief overview of LOCKSS.

44 Addressing the e-journal preservation conundrum: Understanding Portico
Ken DiFiore, Portico, 149 Fifth Avenue, 8th Floor, New York, NY 10010

Teaching and research have become increasingly dependent upon the convenience and enhanced accessibility of electronic scholarly resources, particularly electronic journals. Along with the use of these resources comes the challenge associated with protecting them for future generations of scholars, researchers, and students. Prior to the development of e-journals, libraries maintained the long-term availability of scholarly research through storing and preserving print copies. However, the preservation of e-journals raises many new questions for librarians. Portico has been created to address the growing concern over the preservation of scholarly e-journals, and with the hope that providing a trusted archival home for these ‘at-risk assets' will assist libraries in their transition from print to electronic. This presentation will examine the issues surrounding e-journal preservation and provide a general overview of the Portico archiving service, and includes a description of Portico's approach to preserving e-journals and the business model for supporting the Archive.

45 Digital preservation of scholarly journals: The publisher perspective
Adam Chesler, Publications Division, American Chemical Society, 1155 16th Street NW, Washington, DC 20036, Fax: 202-776-8290

Archival preservation of books and journals has long been the purview of the library: after all, in the print environment, journals were delivered to the library and physical copies were kept on its shelves. Now, as the online versions of journals become the primary if not only form of access to scholarly materials, publishers are not only distributing content, but hosting it as well. How can they serve their customers and their authors, and collaborate in the development and implementation of digital preservation programs? What are the unique and shared approaches available to publishers, and how can they partner with libraries to manage the long-term stewardship of the scholarly record?

46 Ensuring long-term digital preservation: The view from a large STM publisher
Daviess Menefee, Director, Library Relations, Elsevier, 343 Reinhard Ave., Columbus, OH 43206

This presentation examines the path that Elsevier followed in creating its Digital Archiving and Preservation policy. It traces the steps from the early stages where there was a Mellon Grant with Yale University up to the present alliances with Portico and the Royal Dutch Library. Along the road there were potholes and these are discussed also. The presentation, furthermore, stresses the need for publisher participation in the various archiving initiatives. It concludes with a review of the CLIR recommendations for publishers.

47 Integration of data curation into publishing workflows
Sayeed Choudhury, Sheridan Libraries, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218

One of the emerging areas of preservation relates to data curation which the Digital Curation Centre defines as “maintaining and adding value to a trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials.” The Sheridan Libraries at Johns Hopkins University has been working with the Virtual Observatory to develop a prototype data curation system for high-level, refined data that are most often cited in publications. This prototype system brings together the library, the professional society and publishers. By combining data curation into existing publishing workflows, we maximize the probability of gathering data and associated metadata for long-term preservation.

48 Using text-mining and crowdsourced curation to build a structure centric community for chemists
Anthony J. Williams, ChemZoo Inc, 904 Tamaras Circle, Wake Forest, NC 27587

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

49 Text mining for Cheminformatics applications
Mark J. Embrechts1, embrem@rpi.edu, Mike Krein2, and Curt M. Breneman2, brenec@rpi.edu. (1) Department of Decision Sciences & Engineering Systems, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, (2) Department of Chemistry / RECCR Center, Rensselaer Polytechnic Institute, Troy, NY 12180

Text mining is the discovery of novel, non-obvious, and interesting information from a single text document, or from a vast number of text documents. This presentation demonstrates text mining for a corpus of cheminformatics-based abstracts and documents. Automated document clustering and searching for relevant documents from a cheminformatics-specialized corpus of documents will be explained. Special operations such as extracting a specialized cheminformatics relevant dictionary, text cleansing and constructing chemically relevant tagging systems will also be addressed in this presentation.

50 Mining, Storage, Retrieval: The Challenge of Integrating Cheminformatics with Chemical Structure Recognition in Text and Images
Valentina Eigner-Pitto, Josef Eiblmaier, Hans Kraut, Larisa Isenko, Heinz Saller, and Peter Loew, InfoChem GmbH, Landsberger Strasse 408, Munich 81241, Germany

Text mining in chemistry and drug discovery relies heavily on the automated extraction of chemical compounds and pharmaceutical substance names from text and images.

In this presentation a hybrid approach combining information science, cheminformatics, computational linguistics and pattern recognition techniques will be presented. Various text mining applications have been developed recently that promise comprehensive access to knowledge for researchers. However in many cases the quality of the extracted chemical content in terms of precision and recall is questionable. Bad image quality, ambiguous notation or incorrect names can be the source of errors and wrong results. Thus strict chemical validation and verification of the extracted information is of utmost importance to achieve reliable and consistent results. The approach presented here combines specialized software tools for graphical structure recognition, chemical named entity extraction and name to structure conversion. Combination with established verification and checking tools for automatic chemical validation ensures high quality in the generated content.

51 Automated extraction of chemical structures in large text corpora
Nicko Goncharoff, SureChem, Inc, 2255 Van Ness Avenue, Suite 101, San Francisco, CA 94109

Chemists, biologists and intellectual property analysts are increasingly seeking the ability to perform structure searches on large text collections, such as scientific journals, in-house document repositories and patent collections. We present a scalable, validated system for extracting chemical names from very large collections of documents, converting those names to chemical structures and presenting them to users as a text and structure-searchable database. We will discuss the process used to generate structure searchable databases, including methodology, validation results and technical challenges.

We will also show a production application of this process, in which we have generated a structure searchable database of more than 9 million full-text patent documents from the US, Europe, World Intellectual Property Organization and Japan. Lastly we will focus on ways to apply this approach to journal article and in-house text collections.

52 Chemical data mining in documents
Matthew T. Stahl, Roger A. Sayle, and Joseph J. Corkery, OpenEye Scientific Software, 9 Bisbee Court, Suite D, Santa Fe, NM 87508

The chemical information present in document data sources of disparate types has tremendous value. Manual extraction, if done properly, has the highest likelihood of teasing out the relevant data with high integrity. For low throughput use cases where data quality is paramount, manual extraction is routinely practiced. Automated high throughput data extraction methods produce value, even when precision is low, due to the sheer magnitude of data being processed. This paper will focus on medium throughput methods for extracting chemical information out of a number of document sources. The technology described can both assist the low throughput extraction of information in an interactive fashion, and feed systems built for high throughput extraction of chemical information. A number of practical applications will be presented.

53 Optical Structure Recognition Application (OSRA)
Igor V. Filippov, igorf@helix.nih.gov, Laboratory of Medicinal Chemistry, SAIC-Frederick, Inc., NCI-Frederick, 376 Boyles St, Frederick, MD 21702, and Marc C. Nicklaus, mn1@helix.nih.gov, Laboratory of Medicinal Chemistry, National Cancer Institute - Frederick Cancer Research and Development Center, National Institutes of Health, Frederick, MD 21702

We present the latest developments of our Optical Structure Recognition Application (OSRA). OSRA is an open source project which has been designed to extract chemical structure images from documents such as patents and scientific publications and convert the extracted images into the computer-readable SMILES format. A variety of image formats, drawing conventions and graphical resolutions is currently supported. Recent work concentrates on improvements in the areas of conversion accuracy, efficiency and automatization.

54 Introducing CLiDE Pro
Aniko Valko, a.p.johnson@chemistry.leeds.ac.uk, Keymodule Ltd, Leeds, United Kingdom, A Peter Johnson, a.p.johnson@chemistry.leeds.ac.uk, School of Chemistry, University of Leeds, Leeds LS2 9JT, United Kingdom, and Aniko Simon, aniko@simbiosys.ca, SimBioSys Inc, Toronto, ON M9W 6V1, Canada

CLiDE Pro is the latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project. Chemical OCR involves three main problems: (a) identification of chemical images within a document, (b) compilation of chemical graphs of individual molecules from chemical images, and (c) interpretation of complex objects such as generic molecules and reaction schemes using the retrieved chemical graphs. The structure recognition methods implemented in CLiDE Pro will be presented. Structure features which frequently cause problems such as crossing bonds, lines found in various chemical entities such as single bonds attached to triple bonds, dashed bonds and parts of atom labels commonly misclassified as lines (e.g. I and Cl) will be discussed together with our solutions to these problems. A key component of the presentation will be CLiDE Pro's approach to the interpretation of generic structures.

55 Riding the rising tide of research data: A university library considers its role
Leah R. Solla, lrm1@cornell.edu, Physical Sciences Library, Cornell University, 293 Clark Hall, Ithaca, NY 14853-2501, Fax: 607-255-5288, and Gail Steinhart, gss1@cornell.edu, Life Sciences and Specialized Services, Cornell University, Ithaca, NY 14853

Enabling new discoveries by exposing data for use in data-driven research, ensuring access to and preservation of scholarly output, and meeting requirements of funding agencies and institutions regarding data management, retention, and access are primary motivations for developing robust data curation infrastructure. However, such infrastructure is not well developed across disciplines, scales, or contexts, and organizations are examining potential roles in the areas of cyberinfrastructure development, data-driven scholarship, and data curation. Research libraries have demonstrated expertise in a number of areas that could be productively applied to the practice of data curation, including: principles and policies related to scholarly communication; description and discovery; interoperability; digital preservation; selection and appraisal; user support for information retrieval, computer and Internet applications; and business models. This paper discusses data curation activities being investigated by the Cornell University Library in support of the university goal to “enable and encourage the faculty, their students and staff to lead in the preservation, discovery, transmission, and application knowledge, creativity and critical thought.”

56 Curating chemistry data through its lifecycle: A collaboration between library and laboratory in scientific data preservation
Jeremy R Garritano, jgarrita@purdue.edu, Mellon Library of Chemistry, Purdue University, 504 W. State St., West Lafayette, IN 47907, and Jake R. Carlson, jrcarlso@purdue.edu, Distributed Data Curation Center (D2C2), Purdue University, West Lafayette, IN 47907

Collaborating with the Center for Authentic Science Practice in Education (CASPiE), the Purdue University Libraries have created a model for the capture and preservation of data as part of the experimental workflow. In the CASPiE program, students work closely with a faculty researcher on a real world research problem and generate data using a networked system of scientific instrumentation. The data may pass through three stages in its lifecycle. First, the data is the student's record within the course and therefore has educational value. Second, the data also supports the scientist's work and therefore has research value. Finally, the data may be “published” to a wider community and therefore has lasting value which needs to be preserved. The flow of data from student to researcher to a digital repository has been outlined and partially implemented. Collaboration tips, metadata issues, preservation concerns, and instrument/software challenges will be discussed.

57 Archiving the Sloan Digital Sky Survey: A model for collaborative digital preservation
Andrea Twiss-Brooks, University of Chicago, John Crerar Library, 5730 S. Ellis Ave, Chicago, IL 60637-1403

The Sloan Digital Sky Survey (SDSS) is a co-operative scientific project to map a portion of the sky in detail. When the project is completed in fall 2008, it will have produced over 35 terabytes of data comprised of object catalogs, images, and spectra. While the project remained active SDSS data was been housed at Fermilab, long term storage and preservation of the data was not within scope of Fermilab's mission. Following discussions with the SDSS project director and others, the University of Chicago Library undertook a pilot project in 2007-2008 to investigate the feasibility of long term storage and archiving of the project data, archiving of the administrative materials for SDSS, and providing ongoing access by scientists and educators to the data through the SkyServer user interface.

58 On the TRAIL of technical reports
Daureen Nesdill, daureen.nesdill@utah.edu, Interim Head, Science and Engineering Library, University of Utah, J. Willard Marriott Library, 295 South 1500 East, Salt Lake City, UT 84112-0860, and Patricia Kirkwood, pkirkwo@uark.edu, University of Arkansas Libraries, University of Arkansas, Fayetteville, AR 72701-4002

The microcard has taught the library community to avoid unrecoverable formats and processes. Digital preservation processes have yet to be proven sustainable. For these reasons the Greater Western Library Alliance (GWLA) along with the Center for Research Libraries (CRL) and other libraries are establishing a paper archive for the TRAIL project until the digital preservation issues have become more standardized and routine. TRAIL is the Technical Reports Archive and Image Library. In addition to assembling a copy for digitization, the project will maintain a paper archive of this important gray literature through a collaborative model. The resulting electronic images will serve as the main access point as well as the index and shelf list to an important body of federal technical reports that has languished with out indexing in paper and microfomats. Engineering and government documents librarians are working together to preserve and provide access to this unique literature. This paper will describe our process to ensure preservation as well as improve access.

62 Comparison of machine learning algorithms to predict ADME properties using chemical descriptors and molecular fingerprints
Anthony E. Klon, aklon@pcop.com, Molecular Modeling, Pharmacopeia Drug Discovery, Inc, P.O. Box 5350, Princeton, NJ 08543-5350, Fax: 609-655-4187, and David J. Diller, ddiller@pharmacop.com, Molecular Modeling, Pharmacopeia, Princeton, NJ 08543-5350

We have compared the performance of ten different machine learning algorithms available in Weka to create binary classification models for blood-brain barrier (BBB) penetration and human intestinal absorption (HIA). For each data set, two models were constructed for each binary classifier; one using chemical descriptors and one using molecular fingerprints based on atom pairs and topological torsions, resulting in a total of 20 models for BBB penetration and HIA prediction. We describe the selection of descriptors used to train the chemical descriptor models. For both BBB and HIA datasets, the performance of all ten chemical descriptor models was tested by randomly scrambling the descriptors. For both datasets, the performance of all twenty models, descriptor and fingerprint-base, was further assessed and by randomly assigning compounds to the BBB penetrant / non-penetrant or HIA well-absorbed / poorly absorbed classes.

63 New insights into membrane permeability gained from statistical models trained on high content screening , Caco-2 and passive permeability assay data
Sai Chetan K. Sukuru1, chetan.sukuru@novartis.com, Meir Glick1, meir.glick@novartis.com, Suzanne Tilton2, suzanne.tilton@novartis.com, Josef Scheiber1, josef.scheiber@novartis.com, Jeremy L. Jenkins1, jeremy.jenkins@novartis.com, and John W. Davies1, john-w.davies@novartis.com. (1) Lead Finding Platform, Novartis Institutes for BioMedical Research, 250 Massachussetts Avenue, Cambridge, MA 02139, (2) In Vitro ADME Profiling, Novartis Institutes for BioMedical Research, Cambridge, MA 02139

Recent years have seen an increased emphasis in the pharmaceutical industry on thorough evaluation of ADME properties of compounds in the early drug discovery phase. Cell membrane permeability plays an important role in GI-tract absorption and eventually reaching the desirable cellular target. Predictive in silico models of cell membrane permeability are therefore a useful tool in lead optimization and library design. However, improving the accuracy and the interpretation of such models is often challenging. We have built naïve Bayesian models trained on extended-connectivity fingerprints (ECFPs) of compounds tested in two permeability assays – PAMPA, measuring the passive membrane permeability and Caco-2 cell-based assay – and, a high content cell-based screening assay. The experimental data from each assay were divided into training and test set (70:30). Applying the models on each of their corresponding test sets yielded ROC curve AUCs of 0.85 or greater, providing a clear and robust classification of the favorable compounds. The information-rich ECFP descriptors enable us to identify molecular features that could enhance the permeability of compounds. The relationship between permeability of compounds and their activity (or lack thereof) in high content cell-based screening assays was studied by applying our models built on different assays on each others' test sets. While the permeability models perform well on each other's test sets (ROC curve AUCs of 0.81-0.82 between Caco-2 and PAMPA), their correlation with the model built on cell-based screening assays seems to be poor (ROC curve AUCs of 0.55-0.61 on each other's test sets), suggesting that highly permeable compounds need not be active in cell-based screening assays and vice-versa.

64 Processing drug discovery raw data collaboratively and openly using Open Notebook Science
Jean-Claude Bradley, bradlejc@drexel.edu, Department of Chemistry, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104, Rajarshi Guha, rguha@indiana.edu, School of Informatics, Indiana University, Bloomington, IN 47406, and Phillip Rosenthal, San Francisco Division of Infectious Diseases, University of California, San Francisco, CA 94143

The UsefulChem project, designed to publicly report ongoing research within a research group working on the development of anti-malarial and anti-tumor agents, will be described. The project makes use of free hosted tools as much as possible so that the infrastructure can be easily replicated by other research groups. Such an open architecture is conducive to productive collaboration between groups of complementary competency. For example, the design, synthesis and testing of novel anti-malarial agents, bringing together groups from Indiana University, Drexel University and UCSF, will be detailed.

65 Evolution of Data Models for SAR and Modeling of ADME/Tox Properties
Rishi R. Gupta, rishi.gupta@pfizer.com, CS CoE, Pfizer Inc, Eastern Point Road, MS 8260-1422, Groton, CT 06340, and Eric M. Gifford, eric.gifford@pfizer.com, CS CoE, Pfizer Global Research and Development, Groton, CT 06340

We place significant demands on SAR models in terms of training set selection, statistical behavior, interpretability, usage traceability and predictive abilities. Recently, a considerable emphasis has been placed on assessing the true prediction scope of models such as those used in the prediction of ADME/Tox endpoints. That is, defining a space where we can expect a model to perform within given accuracy guidelines. Although internal statistics can be captured as metadata for individual models, these do not generally provide sufficient information to forecast external predictions.

We present herein the status of our work towards the evolution of data models capturing the historical information. We introduce the concept of a “smart model update”, that is, a mechanism that is aware of models predictive ability and coverage. This mechanism provides an easier model update when new compounds are available.

We believe that there are several practical applications of this concept. For example, descriptors could be queried in a chronological order as model evolves to get some sense of synthesizing new structures and their ability to predict a given chemotype. This may be particularly useful when dealing with large collections of chemotypes (chemical series) and searching for the one suitable for a specific therapeutic area. Besides this, a chemist can observe the underlying predictability of the model for different versions of model.

66 Fragment-based prediction of inhibitors of cytochromes P450 1A2 and 2D6
Julien Burton1, julien.burton@fundp.ac.be, Emeric Danloy1, emeric.danloy@student.fundp.ac.be, Nathalie Meurice2, NMeurice@tgen.org, Gerald M. Maggiora3, maggiora@pharmacy.arizona.edu, and Daniel P. Vercauteren1, daniel.vercauteren@fundp.ac.be. (1) PCI Laboratory, University of Namur, 61, rue de Bruxelles, 5000 Namur, Belgium, (2) Pharmaceutical Genomics Division, Translational Genomics Research Institute (TGen), Scottsdale, AZ 85259, (3) College of Pharmacy & BIO5 Institute, University of Arizona, Tucson, AZ 85721

Early prediction of ADME properties in the field of cytochrome P450 (CYP)-mediated drug-drug interactions is an important challenge. This contribution presents an original approach based on coupling data mining methods to stereoelectronic molecular descriptors. A collection of molecules involved in CYP1A2 and CYP2D6 inhibition was described using the MACCS keys and five in-house fingerprints derived from properties of electron density distributions of chemical functions. Recursive partitioning was used to build decision trees and Rough Set Theory allowed extracting rules utilized as classifiers to predict the inhibitory power of an independent set of test molecules. Resulting prediction accuracies exceeded 85% for both CYPs. Additionally, these classifiers were analyzed to determine which structural fragments were most employed for classification, revealing relationships between the occurrence of particular functional groups and CYP inhibition. These results assess the proposed fragment-based approach as a powerful tool to build predictive models and to infer potent structure-activity relationships.

67 Can a free access structure-centric community for chemists benefit drug discovery?
A J Williams, ChemZoo, 904 Tamaras Circle, Wake Forest, NC 27587

ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.

68 Learning from failures - discontinued compounds as a source for knowledge in drug discovery
David Marcus, Anwar Rayan, Dinorah Barasch, Maayan Elias, and Amiram Goldblum, Department of Medicinal Chemistry, Hebrew University of Jerusalem, Grass Center for Drug Design and Synthesis, and Sudarsky Center for Computational Biology, Jerusalem 91120, Israel

Our Iterative Stochastic Elimination (ISE) algorithm (PNAS 99, 703-8, 2002) has been recently applied to several ligand-based screenings, such as specific molecular bioactivity, selectivity and toxicity. Here, we use molecular information derived from discontinued compounds at Phase I and from launched drugs to evaluate a molecule's potential to be a drug. Phase I evaluates ADME characteristics in humans and failure at this stage is usually due to poor pharmacokinetics or to severe adverse reactions and toxicity. Our "potential drug" index distinguishes between these compound populations and can be used with much confidence to extract highly enriched subsets of drug-like compounds based on real failures, rather than on molecular sets of presumed "non-drugs". This index can also clean large datasets or pipelines from ADME adverse compounds. Applying this method to large datasets such as MDDR reveals the characteristics of compounds in several phases of the drug discovery pipeline.

69 Improving reliability of in silico ADME/Tox property prediction by incorporating in-house/proprietary data
Paulius Jurgutis, Pharma Algorithms, BCE Place - TD Canada, Trust Tower, 161 Bay Street, 27th Floor, Toronto, ON M5J 2S1, Canada

A number of problems are known to limit the effective use of third-party predictive algorithms used in compound screening and development. Commonly the training sets do not cover the chemical space occupied by the compounds of interest. Therefore the need arises for a method that would allow any research group to tailor a third-party predictive algorithm to its specific needs using proprietary in-house data. We present a methodology that allows for the addition of user data to the original data models. Predicted values can be corrected according to experimental results present in the in-house/proprietary databases resulting in the amendment of the chemical space not initially included in the original training set. A reliability index (RI) is also calculated as a measure of the quality of the predictions. This novel RI takes into account both local molecular similarity aspects to the molecule of interest and the consistency of the experimental data.

70 Effect of ionization on lipophilicity
Greg Pearl, Sanji Bhal, Ian Peirson, and Karim Kassam, ACD/Labs, 110 Yonge Street, 14th Floor, Toronto, ON M5C 1T4, Canada

The effects of ionization and tautomers in computational modeling have historically been considered negligible due to the complexities involved with including them into an a priori calculation. In order to determine the effect of dismissing the effect of ionization, we have investigated Lipinski's "Rule-of-5" which has been widely adopted to eliminate drug candidates that are deemed to have poor physicochemical properties. Although logP is a useful descriptor, it fails to take into account any variation in lipophilicity of a drug due to the potential ionization at a key biological pH. Given that >95% of commercial pharmaceuticals contain an ionizable moiety, we propose that logD could be used as an alternate descriptor for lipophilicity in the Rule-of-5 in order to reduce the number of potential false-positives that are eliminated in screening. The adapted Rule-of-5 was applied to a series of commercial compound libraries and notable improvements in pass rate were attained.

71 Enhancements to CAS' predicted properties coverage
Roger Schenck and Elizabeth Drotleff, CAS, 2540 Olentangy River Road, Columbus, OH 43202

Molecular properties have gained in importance as more chemistry is being performed in silico. While experimental properties are valued, predicted property algorithms have been greatly improved and the resulting data has steadily gained acceptance in the scientific community. It is critical for information platforms to deliver this data in an efficient and comprehensive manner. CAS' STN and SciFinder products are well-known delivery vehicles with approximately 1.6 B predicted properties in the flagship CAS Registry database. CAS plans to focus on augmenting their existing collection with the inclusion of additional predicted spectral and property values.

72 Inverse design of host-guest complexes in competitive binding problems
B. Christopher Rinderspacher1, c.rinders@duke.edu, David N. Beratan2, david.beratan@duke.edu, and Weitao Yang1, weitao.yang@duke.edu. (1) Department of Chemistry, Duke University, Durham, NC 27707, (2) Department of Chemistry, Duke University Box 90346, Durham, NC 27708

The host-guest problem has many applications in chemistry; enzymes are hosts to their substrates, drugs are guests to their delivery systems,chemical sensors play hosts to target molecules. Similarly, competitive binding is important in many natural processes. We have developed a new general purpose data structure and branch-and-bound algorithm that allows the design of an optimal binder specific to a small to medium-sized organic molecule. We have applied the method to find a molecule that maximizes its binding energy to sarin vs. acetylcholine using AM1 and B3LYP methods. The results clarify binding motifs, differentiate between the binding modes, and demonstrate the general applicability of the theoretical approach.

73 Topological polar surface area: A useful descriptor in 2D-QSAR
Robert J. Doerksen and Prasanna Sivaprakasam, Department of Medicinal Chemistry, School of Pharmacy, University of Mississippi, 421 Faser Hall, University, MS 38677-1848, Fax: 662-915-5638

Topological polar surface area (TPSA) is shown to be a useful descriptor in 2D-QSAR. TPSA is a convenient measure of the polar surface area that avoids the need to calculate ligand 3D structure or to decide which is the relevant biological conformation or conformations. This is the first report to demonstrate the value of TPSA as a relevant descriptor applicable to a large, structurally and pharmacologically diverse set of classes of compounds. We observed a negative correlation of TPSA with activity data for anticancer alkaloids, MT1 and MT2 agonists, MAO-B and tumor necrosis factor-α inhibitors and a positive correlation with inhibitory activity data for telomerase, PDE-5, GSK-3, DNA-PK, aromatase, malaria, trypanosomatids and CB2 agonists.

74 Automatic generation of predictive property models
George D Purvis III1, GPurvis@us.fujitsu.com, David T Stanton2, and William D Laidig2. (1) Biosciences Group, Fujitsu Computer Systems, 15244 NW Greenbrier Pkwy, Beaverton, OR 97007, (2) Modeling and Simulations Group, Procter & Gamble, Cincinnati, OH 45253

In the design of new products, specialized properties that are unique to the use of that product are optimized. Products can be improved faster with greater certainty that the improvements are the best possible when predictive models assist the exploration. Proprietary ownership of product performance data prevents development of useful models for these properties outside of the companies that own the products and measure the data. We are developing an automated tool for rapid creation of predictive properties based on quantitative structure-property relationships (QSPR) which enables owners of proprietary data to build their own predictive models. In this talk, we describe our optimization methodology for finding QSPR relationships, the descriptors incorporated, the description of chemical space and the methodology for error estimation based on QSPR ensembles in the context of developing a predictive QSPR model for boiling points.

75 A Status Report on the InChI & InChIKey Project
Stephen R. Heller, Physical and Chemical Properties Division, NIST, Gaithersburg, MD 20899-8380

IUPAC has developed an algorithm to create a unique chemical identifier- the InChI and a related fixed length InChIKey. As opposed to other chemical structure representations, with the InChI/InChIKey, anyone, anywhere is able to create their own structure files and databases using this public domain open source algorithm. This means one is no longer dependent on any outside source or organization for a unique chemical identifier. Using InChI means you can freely create and exchange structure files with others within your organization and with any person or organization anywhere in the world knowing the structure name, the InChI/InChIKey, will be the same. You can search for the InChI/InChIKey n the Internet, using Google/Yahoo/Microsoft Live, and so on. Using an InChI/InChIKey knowing you find a match if it is there and not need to worry if it was coded differently by another person or program. InChI/InChIKey means you are no longer dependent on any proprietary system and you are uch more likely be link to and to be linked from many, many more chemists and sources of chemical information than has been possible in the past. The InChI/InChIKey is a system for both public and private (internal proprietary and commercial fee-based) sources. Details of the InChI/InChIKey and its rapid and worldwide adoption will be presented.

76 Dynamic data evaluation for thermodynamic properties of binary mixtures
Chris D. Muzny, Vladimir V. Diky, Andrei Kazakov, Eric W. Lemmon, Robert D. Chirico, and Michael Frenkel, Physical and Chemical Properties Division, National Institute of Standards and Technology, 325 Broadway, Boulder, CO 80305-3328, Fax: 303-497-5044

Thermodynamic property data evaluation and the subsequent production of data correlations or equations of state for chemical compounds have traditionally been performed by thermodynamicists who are experts in this field. The concept of using an expert software package to dynamically perform this function was demonstrated for pure compounds with the release of ThermoData Engine (TDE) 1.0, a product of the National Institute of Standards and Technology. Implementation of the dynamic data evaluation concept requires the development of large electronic databases capable of storing essentially all experimental data known to date with detailed descriptions of relevant metadata and uncertainties. The combination of these electronic databases with expert-system software, designed for automatic generation of recommended property values based on available experimental data plus a system of prediction methods, leads to the ability to produce critically evaluated data dynamically or ‘to order'. TDE 1.0 demonstrates the ability of a software system to perform the various tasks of the traditional data evaluator including data quality analysis, cross-property data consistency checking, the production of accurate data correlations and, in cases where sufficient data is available, the automated production of a high quality equation of state. Recently, this expert system was extended to dynamic data evaluation for binary mixtures of organic compounds. The data analysis algorithms, data correlations, data models used for the analysis of binary mixture data, and the subsequent production of mixture models, will be presented. In doing so specific problems associated with binary mixture data will be covered and the expert data analysis system will be demonstrated.

77 Framework structures of zeolite crystals: A machine learning classification approach
Shujiang Yang, Mohammed Lach-hab, Iosif Vaisman, and Estela Blaisten-Barojas, Computational Materials Science Center, George Mason University, 4400 University Dr., MSN 6A2, Fairfax, VA 22030, Fax: 703-993-9300

With their unique 3D microporous structure, zeolites have been extensively used in the field of absorption, ion exchange, and catalysis. The framework type of zeolites has been mainly determined from coordination sequences and vertex symbols. In this work a machine learning approach is used for predicting the zeolite framework based on the topology of the structures. A data set of zeolites from the Inorganic Crystal Structure Database (ICSD) is used. A supercell of each zeolite is constructed, extra-framework cations and adsorbed phase are eliminated such that only tetrahedrally-bonded framework atoms are retained. These supercells are Delaunay-tessellated and several topological descriptors are developed as the foundation of the Zeolite-Structure-Predictor (ZSP). The ZSP uses the Random Forest algorithm, is trained with 130 zeolites evenly distributed in 13 framework type classes, and is able to correctly classify zeolites with scores of over 90% correctly classified crystals.

78 Quantum information and chemistry: using quantum computers to simulate chemical systems
Alán Aspuru-Guzik, Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford St, Cambridge, MA 02138 - Abstract

Quantum information science encompasses different areas such as quantum cryptography and quantum computation. In 1982, Feynman suggested that a quantum computer could simulate quantum systems in polynomial time. We present our progress on quantum algorithms for the simulation of the dynamical properties of molecules, such as chemical reactions exactly on a quantum computer in polynomial time. We will also summarize other applications of quantum information to chemistry in which we have recently obtained exciting results, such as the prospects of a quantum computer for protein folding, electronic structure, and in the understanding of energy transfer in biological systems using a quantum information perspective.

79 A new, automated retrosynthetic search engine: ARChem
A Peter Johnson1, a.p.johnson@chemistry.leeds.ac.uk, Jacqueline Law2, Zsolt Zsoldos3, zsolt@simbiosys.ca, Aniko Simon3, aniko@simbiosys.ca, and Anthony J. Williams4, tony@chemspider.com. (1) School of Chemistry, University of Leeds, Leeds LS2 9JT, United Kingdom, (2) SymBioSys Inc, Toronto M9W 6V1, Canada, (3) SimBioSys Inc, Toronto, ON M9W 6V1, Canada, (4) ChemZoo Inc, Wake Forest, NC 27587

ARChem Route Designer is a new retrosynthetic analysis package (1) that generates complete synthetic routes for target molecules from readily available starting materials. Rule generation from reaction databases is fully automated to insure that the system can keep abreast with the latest reaction literature and available starting materials. After these rules are used to carry out an exhaustive retrosynthetic analysis of the target molecules, special heuristics are used to mitigate a combinatorial explosion. Proposed routes are then prioritized by a merit ranking algorithm to present to the viewer the most diverse solution profile. Users then have the option to view the details of the overall reaction tree and to scroll through the reaction details. The program runs on a server with a web-based user interface. An overview of some of the challenges, solutions, and examples that illustrate ARChem's utility will be discussed.

80 Ligand tautomer enumeration and scoring for structure-based 3D pharmacophore modeling
Thomas Seidel, seidel@inteligand.com, Inte:Ligand GmbH, Mariahilferstrasse 74B/11, A-1070 Vienna, Austria, Gerhard Wolber, wolber@inteligand.com, Inte:Ligand GmbH, 1070 Vienna, Austria, and Thierry Langer, thierry.langer@uibk.ac.at, Department of Pharmaceutical Chemistry, Computer Aided Molecular Design Group, University of Innsbruck, Institute of Pharmacy, Innsbruck A-6020, Austria

Tautomeric rearrangements of molecules lead to distinct equilibrated structural states of the same chemical compound and show impact on nearly all aspects of computer-based chemical data processing. Especially for structure-based pharmacophore modeling of ligand-protein complexes, tautomerism is a determining factor for the presence or absence of possible H-bonding interactions due to changing H-donor/H-acceptor properties. Knowledge of the most favorable tautomeric states is therefore crucial for the quality and correctness of the derived pharmacophore models and putative binding modes. We will present a ligand-side tautomer enumeration and ranking procedure that considers both geometrical constraints imposed by the conformation of the bound ligand as well as intra- and inter-molecular energetic contributions. The ranking algorithm is based on energetic scores and has proven to be able to top-rank known preferred tautomeric states of bound ligands in a series of investigated protein complexes.

81 Working with IUPAC names using ChemAxon tools
Daniel Bonniot, Rita Veréb, Ferenc Csizmadia, and Gyorgy Pirok, ChemAxon Ltd, Maramaros koz 3/a, 1037 Budapest, Hungary, Fax: +36-1-453-2659 - Abstract

ChemAxon is providing tools to generate the IUPAC names of chemical structures and import chemical names as structures. We illustrate the possibilities of these tools, using concrete examples covering different nomenclatures and complex cases. We consider the differences between traditional and preferred IUPAC nomenclature and the options to handle both. We present the different ways these tools can be used, from real-time, interactive naming of drawn structures to batch naming and automatically computed names in databases. Finally, we evaluate the export and import rate and the quality of generated names using both human expert analysis and automated methods.



Newspaper template for websites