#240 - Abstracts
ACS National Meeting
August 22-26, 2010
Semantic envelopment of cheminformatics resources with SADI
Leonid L Chepelev, Egon Willighagen, Michel Dumontier. Department of Biology, School of Computer Science, and Institute of Biochemistry, Carleton University, Ottawa, Ontario, Canada; Department of Pharmaceutical Sciences, Uppsala University, Uppsala, Sweden
The distribution of computational resources as web services and their execution as workflows has enabled facile computation and data integration for bio- and cheminformatics. The Semantic Automated Discovery and Integration (SADI) framework addresses many shortcomings of similar frameworks, such as SSWAP and BioMoby, while allowing for more efficient semantic envelopment of computational chemistry services, resource discovery, and automated workflow organization. In this work, we apply the CHEMINF ontology and Chemical Entity Semantic Specification and demonstrate the usability of the SADI framework in solving common cheminformatics problems starting from RDF-based chemical entity representations. Our eventual goal is to convert all of the functions and functionalities of the Chemistry Development Kit (CDK) into distinct SADI services. This would enable the formulation of all cheminformatics problems currently addressed by CDK, as SPARQL queries, returning meaningful RDF output which can then be easily integrated with existing RDF-based knowledgebases or used for further processing.
RESTful RDF web services for predictive toxicology
Dr. Nina Jeliazkova PhD. Ideaconsult Ltd., Sofia, Bulgaria
The Open Source Predictive Toxicology Framework http://www.opentox.org, developed by partners of the EC FP7 OpenTox project , aims at providing a unified access to toxicity data and predictive models, as well as validation procedures. This is achieved by i) an information model, based on a common OWL-DL ontology http://www.opentox.org/api/1.1/opentox.owl ii) flexibility by linking with related ontologies; iii) availability of data and algorithms via a standardized REST web services interface, where every compound, data set or predictive method has an unique web address, used to retrieve its RDF representation, or initiate the calculations. The OpenTox framework allows building user-friendly applications for toxicological experts or model developers, or direct access by an application programming interface for development, integration and validation of new algorithms. The work presented describes the experience of building RESTful web services, based on RDF representation of resources, to incorporate diverse IT solutions into a distributed and interoperable system.
Linking the resource description framework to cheminformatics and proteochemometrics
Dr. Egon L. Willighagen, Prof. Jarl E.S. Wikberg. Department of Pharmaceutical Biosciences, Uppsala University, Uppala, Sweden
We have shown that existing RDF standards can suitably be integrated into existing molecular chemometrics methods. Platforms that unite these technologies, like Bioclipse, makes this even simpler and more transparent. Being able to create and share workflows that integrate data aggregation and analysis (visual and statistical) is beneficial to interoperability and reproducibility. The current work shows that RDF approaches are sufficiently powerful to support molecular chemometrics workflows.
Chemical e-Science Information Cloud (ChemCloud): A semantic web based eScience infrastructure
Prof. Dr. Adrian Paschke PhD, Stephan Heineke. FIZ Chemie, Berlin, Germany; Department of Mathematics and Computer Science, FU Berlin, Berlin, Germany
Our Chemical e-Science Information Cloud (ChemCloud) - a Semantic Web based eScience infrastructure - integrates and automates a multitude of databases, tools and services in the domain of chemistry, pharmacy and bio-chemistry available at the Fachinformationszentrum Chemie (FIZ Chemie), at the Freie Universitaet Berlin (FUB), and on the public Web. Based on the approach of the W3C Linked Open Data initiative and the W3C Semantic Web technologies for ontologies and rules it semantically links and integrates knowledge from our W3C HCLS knowledge base hosted at the FUB, our multi-domain knowledge base DBpedia (Deutschland) implemented at FUB, which is extracted from Wikipedia (De) providing a public semantic resource for chemistry, and our well-established databases at FIZ Chemie such as ChemInform for organic reaction data, InfoTherm the leading source for thermophysical data, Chemisches Zentralblatt, the complete chemistry knowledge from 1830 to 1969, and ChemgaPedia the largest and most frequented e-Learning platform for Chemistry and related sciences in German language.
Use of semantic web services to access small molecule ligand database
Anay P Tamhankar, Aniket S Ausekar. Software Solutions Group, Evolvus, Pune, Maharashtra, India
Resource Description Framework (RDF) and a set of associated technologies like OWL, SPARQL etc..., which form the W3C's semantic web technology stack, are renewing interest in semantic chemistry. Semantic Web Services not only specify syntactic interoperability but also specify and enforce the semantic constraints of messages being transmitted and objects being accessed.
Liceptor database is a small molecule ligand database consisting of approximately 4 million compounds. The database schema consists of fields like molecular properties (2D-structure, molecular weight, molecular formula etc...), molecular descriptors (H-donors, H-acceptors, logP, logD number of rotational bonds etc...) and pharmacological properties (bio-assays, receptors, enzymes, parameters, animal models, therapeutic indications etc...). Pharmaceutical and Bio-Technology companies use this database to mine chemical space for internal research, to prioritize QSAR and pharmacophore studies, for synthetic chemistry endeavors and for advancing hit-to-lead patterns.
The database records are available in multiple formats (relational database, XML, Rdfile etc...) as well as available online through an interactive web application (html format).
The soon to be released version of the database includes access using semantic web services. The ontology is expressed in OWL and RDF defines the overall framework. Typical consumers of the data using this access mechanism are expected to be third-party tool vendors and data aggregators.
Use of semantic web services allows evolution of the schema over time without explicitly communicating the change as well as requiring all data consumers to be changed.
Usage metrics: Tools for evaluating science monograph collections
Asst Univ Librarian Michelle M Foss, Dr. Vernon Kisling, Ms. Stephanie Haas. Department of Marston Science Library, University of Florida, Gainesville, FL, United States
As academic libraries are increasingly supported by a matrix of databases functions, the use of data mining and visualization techniques offer significant potential for future collection development based on quantifiable data. While data collection techniques are not standardized and results may be skewed because of granularity problems, or faulty algorithms, useful baseline data is extractable and broad trends identified. The purpose of the study is to provide an initial assessment of data associated with the science monograph collection at the Marston Science Library (MSL), University of Florida. The sciences fall within the major Library of Congress Classification schedules of Q, S, and T, excluding TN, TR, TT, and R. The overall strategy of this project is to analyze audience-based circulation patterns, e-book usage, purchases, and interlibrary loan statistics from the academic year July 1, 2008 to June 30, 2009. Such analyses provide an evidence-based framework for future collection decisions.
Happily ever after or not: E-book collection usage analysis and assessment at USCLibrary
Norah Xiao. University of Southern California, United States
With more and more e-book collections being launched by publishers, USC Science and Engineering Library initiated its e-book collection acquisition since late 2008, and one of first and biggest acquired collections is Springer e-books. Now after two years, are users satisfied with this e-book collection? Are they accessing and using it? Like any other e-collection, how well have we, librarians and staff, been coping with this collection in collection development (e.g. e-book packages from other publishers), access services (e.g. interlibrary loan, off-campus access, e-books technical issues), outreach (e.g. e-book market strategies), and information literacy?
This presentation will overview our assessment of this e-book collection after 2 years. What have we learned from the usage data? And by analyzing the data, how did and can we improve our services to users? It is hoped to our experience can present a proactive implementation plan for others considering comprehensive digital migration of their content, with the goal of not only better coping with the current economic environment, but of spurring development, innovation, and efficiency in the long run.
From Chemical Abstracts to SciFinder: Transitioning to SciFinder and assessing customer usage
Susan Makar, Stacy Bruss. National Institute of Standards and Technology, United States
The Research Library of the National Institute of Standards and Technology (NIST) monitors SciFinder usage to ensure customers have ready access to the database and to determine who uses it. Usage statistics played a critical role in determining whether to increase the number of seats and which heavy users should help pay for those additional seats. While most NIST researchers were very excited to acquire access to this product, many, who were well acquainted with using the print version of Chemical Abstracts, needed to learn best techniques for searching and browsing the chemistry literature using SciFinder. Transitioning from the printed Chemical Abstracts to SciFinder posed significant challenges to one research project. This presentation will describe how the NIST Research Library used SciFinder usage statistics to make collection development decisions and how library staff worked with NIST researchers to successfully transition from the printed Chemical Abstracts to SciFinder.
Using Web of Knowledge to identify publishing andcitation patterns of campus researchers at the University of Arkansas
Lutishoor Salisbury, Jeremy S. Smith. University of Arkansas, United States
This presentation will provide information on a project undertaken at the University of Arkansas in Fayetteville to study publications by the campus researchers with an emphasis on the STEM (agricultural sciences, physical science, biological sciences, engineering and mathematics, etc.) disciplines at the macro-level for a three-year period. The overall objective of the study was (1) to provide an overview of the productivity of faculty and researchers in the various departments which could be used in allocating resources for collection development and (2) to provide evidence-based data of periodical use to assist with collection decisions and to identify collection strengths at the university level. We used the Web of Knowledge database (Science Citation Index, Social Science Citation Index and Arts and Humanities Citation Index) to identify the periodical literature in which our researchers published and those that they cite in their publications to do several analysis including determining the extent to which our researchers are publishing in and citing periodicals from the Elsevier, Wiley and IEEE journal packages. A methodology for extracting citations from Web of Knowledge into an Excel spreadsheet will also be presented. The strengths and weaknesses of the Web of Knowledge for this study will also be highlighted
Don't forget the qualitative: Including focus groups in the collection assessment process
Susan Shepherd, Teri M. Vogel. University of California San Diego, United States
To complement our ongoing quantitative collection evaluations based on cost and usage data, the UC San Diego Science & Engineering Library conducted a series of focus groups with graduate students and faculty in our core departments. Our objective was to learn more about how they use the collection for research and teaching, so that we could make more informed decisions about collection management, as well as how best to deploy our staff resources for increased promotion, outreach and instruction. Participants were asked about the resources they use, how they use them, and what gaps they perceived. We also probed their familiarity with the top licensed resources in their fields.
In this presentation we will discuss our focus group methods, results and the next steps we have taken in this assessment, including a follow-up survey to the same departments to obtain more quantitative information about usage of the collection.
Strategies for the identification and generation of informative compound sets
Michael S Lajiness. Computer Aided Drug Discovery, Eli Lilly & Company, Indianapolis, IN, IN, United States
Mounting pressures in drug discovery research dictate more efficient methods of picking the winners: molecules that actually have a chance to be the drugs of the future. Clearly, these methods need to navigate a highly, multi-dimensional landscape. It is also clear that hard filters should never be used and that a more continuous treatment or prioritization has clear advantages. Further, structural diversity needs to be considered in order for the best structural ideas to be found most efficiently. In addition, history and external sources of information also must be examined. This presentation will describe some of the methods, techniques, and strategies that have been employed by the author over the past 25 years working in cheminformatic that attempt to identify compounds that are likely to provide the most useful information so that one might discover solid leads more rapidly.
Public-domain data resources at the European Bioinformatics Institute and their use in drug discovery
Christoph Steinbeck. European Bioinformatics Institute, EMBL Outstation - Hinxton, Hinxton, Cambridge, United Kingdom
Small molecules are of increasing interest for bioinformatics in areas such as metabolomics and drug discovery. The recent release of large open chemistry databases into the public domain calls for flexible, open toolkits to process them. These databases and tools will, for the first time, create opportunities for academia and third-world countries to perform state-of-the-art open drug discovery and translational research - endeavors so far a domain of the pharmaceutical industry. This talk will describe a couple of relevant data resources at the European Bioinformatics Institute and will also outline our research on and development of toolkits such as the Chemistry Development Kit and CDK-Taverna to support the exploitation of these data sources.
Decision making in the face of complicated drug discovery data using the Novartis system for virtual medicinal chemistry (FOCUS)
Donovan Chin. Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Cambridge, MA, United States
This talk will describe some of the broad concepts that led to the development of the Novartis software system for data analysis & virtual medicinal chemistry (FOCUS). The system, which is routinely used globally, is designed to present the scientist with an accessible interface that permits iterative hypothesis testing of many possible chemical candidates while accounting for undesirable ADMET properties. Some of the key principles are to present the data in a way that reflects stored knowledge and facilitates the decision about what compound to make next. We will highlight some of these concepts in applications spaning the range from target identification to drug optimization.
Integrating chemical and biological data: Insights from 10 years of VERDI
Susan Roberts, W. Patrick Walters, Ryan McLoughlin, Philppe Gabriel, Jonathan Willis, Trevor Kramer. Vertex Pharmaceuticals, Cambridge, MA, United States
VERDI is a software system, originally developed in 2000 at Vertex Pharmaceuticals, for integrating chemical and biological data and delivering this information to drug discovery teams. In addition to traditional table views, VERDI incorporated a number of modules designed to enable scientists to understand relationships between chemical structure and biological data. Over the last 10 years, VERDI has been the primary data access tool for hundreds of scientists at multiple sites around the world. A retrospective evaluation of VERDI has provided us with a number of 'lessons-learned', which come from a multitude of revisions, improvements and new feature additions. Some of these lessons, which are being used as the basis for development of the next generation of data analysis and visualization tools at Vertex, will be presented and discussed in detail.
Collaborative database and computational models for tuberculosis drug discovery decision making
Dr. Sean Ekins PhD, Dr Justin Bradford PhD, Krishna Dole, Anna Spektor, Kellan Gregory, David Blondeau, Dr Moses Hohman PhD, Dr Barry A Bunin. Collaborative Drug Discovery, Burlingame, CA, United States; Collaborations in Chemistry, Jenkintown, PA, United States; Department of Pharmaceutical Sciences, University of Maryland, Baltimore, MD, United States; Department of Pharmacology, Robert Wood Johnson Medical School, University of Medicine & Dentistry of New Jersey, Piscataway, NJ, United States
Drug discovery is being re-shaped involving large scale collaborations that connect individual researchers using collaborative computational approaches and crowdsourcing. Future drug discovery decisions will ultimately still be made based on massive multidimensional datasets. As an example, the search for molecules with activity against Mycobacterium tuberculosis (Mtb) is employing many approaches in collaborating national and international laboratories. We have developed a database (CDD TB) to capture public and private Mtb data while enabling data mining and collaborations with other researchers. We have also used the public data along with several computational approaches including Bayesian classification models for 220,463 molecules and tested them with external molecules, enabling the discrimination of active or inactive substructures from other datasets in CDD TB. The combination of the database, dataset analysis, and computational models provides new insights into molecular properties and features that are determinants of whole cell activity, allowing prioritization and decision making around molecules.
Data drive life sciences: The Pyramids meet the Tower of Babel
Dr. Rajarshi Guha. Department of Informatics, NIH Chemical Genomics Center, Rockville, MD, United States
A characteristic feature of modern life science research is the fact that it has become data intensive. As a result we are faced with datasets of massive size and wide variety in terms of the type of data. Examples include massive datasets from next generation sequencing to more complex datasets of chemical structure and activity from high-throughput small molecule screens. In this talk I will discuss some aspects of how one can handle datasets of such size and variability. I will consider examples from computational science and distributed services that allow us to easily and cheaply handle massive datasets to integration approaches that attempt to merge data from multiple sources to obtain a systems level view of the biological effects of small molecules. In all cases, the focus will be data generated from and for small molecule studies.
Design principles for diversity-oriented synthesis: Facilitating downstream discovery with upfront design
Lisa Marcaurelle. Chemical Biology Platform, Broad Institute, Cambridge, MA, United States
To expand the diversity of our screening collection to access a broad range of biological targets, we aspire to produce libraries of small-molecules that combine the structural complexity of natural products and the efficiency of high-throughput processes. Moreover, we aim to synthesize the complete matrix of stereoisomers for all library members. We reason that this unique collection will enable the rapid development of stereo-structure/activity relationships (SSAR) upon biological testing providing valuable information for the prioritization and optimization of hit compounds. Although our library products may be distinct compared to traditional compound collections, we are faced with fundamental questions relevant to library design: How do you prioritize scaffolds for synthesis? How do you select products with desirable physicochemical properties? In designing DOS libraries we employ a number of cheminformatic methods to tackle such issues and select compounds for synthesis/screening. An overview of our design criteria and decision-making process will be presented.
Overview: Data-intensive drug design
John H Van Drie. R&D, Van Drie Research, Andover, MA, United States
How do we best make med chem decisions in the face of a lot of data? This is an issue that confronts us at many stages of the drug discovery process: screening, hit-to-lead, early lead optimization, and late-stage lead optimization. In this session, speakers representing each of these stages will describe how they have successfully tackled these issues, emphasizing general principles over specific computational tools. Our brains can conveniently handle only about 7 things at a time, and most traditional med chem. decision-making processes reflect that. Already when the number of molecules being considered is in the range of dozens, things get tricky; when that number is in the thousands to hundreds of thousands, one must re-orient one's perspective
Data-driven development: How ACS Publications uses data toenhance products and services, and respond to customer needs
Melissa Blaney, Sara Rouhi. ACS Publications, United States
As the scholarly publishing landscape continues to rapidly transform in unprecedented ways, publishers and libraries have had to quickly pivot to accommodate the changing preferences that users have for accessing, collecting, and consuming digital information. ACS Publications has used a data-driven approach to handle these changing customer and end-user needs. Everything from our ACS Mobile iPhone application to our transition from print to online Web products has been shaped by this approach. This presentation will address the role of data in developing new products, enhancing our web presence, and responding to user behavior on the ACS Web Editions Platform.
Objective collections evaluation using statistics at the MIT Libraries
Mathew Willmott, Erja Kajosalo. Engineering & Science Libraries, Massachusetts Institute of Technology, United States
Recent budget pressures have forced many libraries to reevaluate their collections and substantially cut back on their subscription spending. The task of evaluating a large collection of subscription-based materials, however, is a difficult one. Journals from different subject areas are used differently, and journals from different publishers have their usage measured differently. Evaluating each individual journal subscription separately would be a monumental task bordering on infeasibility. This paper will discuss the approach taken by the MIT Engineering and Science Libraries in the spring of 2009 and 2010 to evaluate their journal collections, specifically for Springer, Elsevier, and Wiley-Blackwell, the three journal publishers with which these libraries hold the most subscriptions. Discussion will include the gathering and analysis of usage data, publication data, and citation data, as well as the process by which these data were combined to create an objective ranking for each journal. These objective rankings were not final decisions; librarians with subject expertise then evaluated the lower-ranked journals to determine if they were appropriate choices for cancellation, often taking into consideration many additional factors. However, these objective evaluations helped librarians to more efficiently use their time by indicating which journals may be strong candidates for cancellation, and they helped department liaisons to defend final cancellation choices to a very data-driven faculty. The end result was a more efficient cancellation process as well as a more comprehensive understanding of the library's journal collections.
Getting the biggest bang for your buck: Methods and strategies for managing journal collections
Grace Baysinger. Stanford University, United States
Chemistry journals have the highest average cost per title of all subject areas. Library collection budgets have not kept pace with price increases and funds to acquire new titles are scarce. Signing big deals for journals has limited flexibility in adapting to changes. These factors have made acquiring journals to support programmatic needs more of a challenge than ever before. This presentation will cover methods, strategies, and tools than can be used to help assess how resources are allocated when developing and managing journal collections.
Taking a collection down to its elements: Using various assessment techniques to revitalize a library
Leah Solla. Cornell University, 283 Clark Hall, Ithaca, NY, United States
What are the elements of a research literature collection in the physical sciences? How are they being used and what roles are they playing in research and teaching and learning? Who is using them- students, faculty, related disciplines? These are the questions that drove the extensive analyses conducted on the print and electronic literature collections in the Physical Sciences Library at Cornell University in preparation for transitioning the service model from a print-based facility to electronic collections and services. General trends indicated the usage of the collection had been well over 90% electronic for years and the acquisition of books and journals in print had been reduced to minimal levels under budget pressures. But there were significant gaps in the electronic holdings and there remained a small but very active core of the print collection, both warranted further study to enable us to provide the best possible access to these crucial materials in the new service model. The library management system was mined for a variety of data points and complemented with external data sources and user input to build the transition map for the physical sciences literature collections.
Predicting specific inhibition of cyclophilins A and B using docking, growing, and free energy perturbation calculations
Somisetti V Sambasivarao, Orlando Acevedo. Department of Chemistry and Biochemistry, Auburn University, Auburn, AL, United States
Cyclophilins (Cyp) belong to the enzyme class of peptidyl-prolyl isomerases which catalyze the cis-trans conversion of prolyl bonds in peptides and proteins. Twenty human Cyp isoenzymes have been reported and many are excellent targets for the inhibition of hepatitis C virus replication and multiple inflammatory diseases and cancers. Given the complete conservation of all active site residues between many of the enzymes, i.e., CypA, CypB, CypC and CypD, a better understanding of how to specifically inhibit individual targets could potentially reduce reported side effects in current treatments. Docking and growing programs have been used to construct protein-ligand complexes for a variety of reported selective inhibitors, including acylurea and aryl 1-indanylketone derivatives. Free-energy perturbation/Monte Carlo (FEP/MC) calculations have been utilized to quantitatively reproduce the free energies of binding for the inhibitors in multiple Cyp active sites in order to elucidate the origin of the specificity for the compounds.
Using aggregative web services for drug discovery
Dr. Qian Zhu PhD, Dr. Michael S. Lajiness PhD, Dr. David J. Wild PhD. School of Informatics and Computing, Indiana University, Bloomington, IN, United States
Recent years have seen a huge increase in the amount of publicly-available information pertinent to drug discovery, including online databases of compound and bioassay information; scholarly publications linking compounds with genes, targets and diseases; and predictive models that can suggest new links between compounds, genes, targets and diseases. However, there is a distinct lack of data mining tools available to harness this information, and in particular to look for information across multiple sources. At Indiana University we are developing an aggregative web service framework to solve this kind of problems. It offers a new approach to data mining that crosses information source types to look at the "big picture" and to identify corroborating or conflicting information from models, assays, databases and publications.
Semantifying polymer science using ontologies
Dr. Edward O. Cannon PhD, Dr. Adams Nico, Prof. Peter Murray-Rust. Department of Chemistry, Unilever Centre for Molecular Science Informatics, University of Cambridge, Cambridge, Cambridgeshire, United Kingdom
Ontologies are graph based, formal representations of information in a domain. Currently, there is a large interest in ontologies for biology and medicine, though little effort has been concentrated in the field of chemistry, let alone polymer science. We have developed a number of ontologies for polymer science: properties, measurement techniques and measurement conditions, using the Web Ontology Language. These ontologies will help facilitate the standardization of data exchange formats in polymer science by providing a common domain of knowledge. The properties ontology contains over 150 properties and has been integrated with the measurement techniques and conditions ontology, to give information on how a property is measured and under what conditions. The ontologies will be of use to polymer scientists wishing to reach a consensus in this area of knowledge. The ontologies also have the advantage that they can be integrated into software applications to leverage this knowledge.
Toxicity reference database (ToxRefDB) to develop predictive toxicity models and prioritize compounds for future toxicity testing
Hao Tang, Hao Zhu PhD, Liying Zhang, Alexander Sedykh PhD, Ann Richard PhD, Ivan Rusyn MD, PhD, Prof. Alexander Tropsha PhD. Division of Medicinal Chemistry and Natural Products, School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States; Department of Biochemistry and Biophysics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States; National Center for Computational Toxicology, Office of Research&Developoment, U.S. Environmental Protection Agency, Chapel Hill, NC, United States; Department of Environmental Sciences and Engineering, School of Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
EPA's ToxCast program aims to use in vitro assays to predict chemical hazards and prioritize chemicals for toxicity testing. We employed the predictive QSAR workflow to develop computational toxicity models for ToxCast compounds with historical animal testing results available from ToxRefDB. To ensure model stability and robustness, multiple classifiers and 5-fold external cross-validation were applied. Results show that for three of the 78 toxicity endpoints, including one chronic and two reproductive endpoints, the Correct Classification Rate for external validation datasets was above 0.6 for all types of QSAR models. Our studies suggest that it is feasible to develop QSAR models for some endpoints, which could be further augmented by in vitro assay measures. The validated toxicity models were used for virtual screening of 50,000 chemicals compiled for the REACH program. The compounds predicted as toxic could be regarded as candidates for future toxicity testing. Abstract does not reflect EPA policy.
OrbDB: A database of molecular orbital interactions
Matthew A. Kayala, Chloe A. Azencott, Dr. Jonathan H. Chen PhD, Prof. Pierre F. Baldi PhD. Department of Computer Science, University of California - Irvine, Irvine, CA, United States
The ability to anticipate the course of a reaction is essential to the practice of chemistry. This aptitude relies on the understanding of elementary mechanistic steps, which can be described as the interaction of filled and unfilled molecular orbitals. Here, we create a database of mechanistic steps from previous work on a rule-based expert system (ReactionExplorer). We derive 21,000 priority ordered favorable elementary steps for 7800 distinct reactants or intermediates. All other filled to unfilled molecular orbital interactions yield 106 million unfavorable elementary steps. To predict the course of reactions, one must recover the relative priority of these elementary steps. Initial cross-validated results for a neural network on several stratified samples indicate we are able to retrieve this ordering with a precision of 98.9%. The quality of our database makes it an invaluable resource for the prediction of elementary reactions, and therefore of full chemical processes.
Novel approach to drug discovery integrating chemogenomics and QSAR modeling: Applications to anti-Alzheimer's agents
Rima Hajjo, Dr. Simon Wang PhD, Prof. Bryan L. Roth MD, PhD, Prof. Alexander Tropsha PhD. Department of Medicinal Chemistry and Natural Products, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States; Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
Chemogenomics is an emerging interdisciplinary field relating the receptorome-wide biological screening to functional or clinical effects of chemicals. We have developed a novel chemogenomics approach combining QSAR modeling, virtual screening (VS), and gene expression profiling for drug discovery. Gene signatures for the Alzheimer's disease (AD) were used to query the Connectivity Map (cmap,http://www.broad.mit.edu/cmap/) to identify potential anti-AD agents. Concurrently, QSAR models were developed for the serotonin, dopamine, muscarinic and sigma receptor families implicated in the AD. The models were used for VS of the World Drug Index database to identify putative ligands. 12 common hits from QSAR/VS and cmap studies were subjected to parallel binding assays against a panel of GPCRs. All compounds were found to bind to at least one receptor with binding affinities between 1.7 - 9000 nM. Thus, our approach afforded novel experimentally confirmed GPCR ligands that may be implied as putative treatments for the AD.
Cheminformatics improvements by combining semantic web technologies, cheminformatical representations, and chemometrics for statistical modeling and pattern recognition
Dr. Egon L. Willighagen. Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Uppland, Sweden
My research focuses on the methods needed for large-scale molecular property prediction, using semantic web, cheminformatics, and chemometrics methods. Originally starting with a Dictionary on Organic Chemistry website, research was started to find methods to accurately disseminate molecular knowledge, resulting in participation in Open Source cheminformatics projects, including Jmol, JChemPaint, and the Chemical Markup Language project, and an oral presentation at the "2000 Chemistry & Internet" conference. In that year, the applicant founded together with the Jmol and JChemPaint project leaders the Chemistry Development Kit (CDK), which is now a highly cited Open Source cheminformatics toolkit. Between 2001 and 2006 the applicant continued research in the area of data analysis with a PhD thesis on the "Representation of Molecules and Molecular Systems in Data Analysis and Modeling" with Prof. dr L.M.C. Buydens at the Analytical Chemistry Department at the Radboud University Nijmegen. The thesis studies the interaction of representation and the statistics and shows how tightly these need to match. Topics of the thesis include: a critical analysis of the use of proton and carbon NMR in QSAR; the use of Open Source, Open Data, and Open standards in interoperability in cheminformatics; the clustering of crystal structures using a novel similarity measure; and, the use of new supervised self-organizing maps in pattern recognition in crystallography. Part of the research was performed in the group of dr P. Murray-Rust at Cambridge University. Later research focused on the use of semantic technologies to reduce error in the aggregation and exchange of molecular data. Recent work applies developed technologies to cheminformatics in general and QSAR and metabolite identification in particular, with dr C. Steinbeck at Cologne University in Germany, and with dr R. van Ham at Wageningen University within the Netherlands Metabolomics Center. The applicant recently joined the development team of the award-winning cheminformatics-platform Bioclipse in Uppsala with Prof. J. Wikberg in Sweden, to continue his research in improving interoperability and reproducibility in cheminformatics and pharmaceutical bioinformatics and proteochemometrics in particular. This implies continued CDK development, development of semantic methods in computational chemistry, and making these technologies accessible to the non-programming chemist by supporting the development of cheminformatics in bench-chemist-oriented platforms such as Bioclipse and Taverna.
Prediction of consistent water networks in uncomplexed protein binding sites based on knowledge-based potentials
Michael Betz, Gerd Neudert, Gerhard Klebe. Pharmaceutical Chemistry, Philipps-University Marburg, Marburg, Germany
Within the active site of a protein water fulfills a variety of different roles. Solvation of hydrophilic parts stabilizes a distinct protein conformation, whereas desolvation upon ligand binding may lead to a gain of entropy. In an overwhelming number of cases, water molecules mediate interactions between protein and the bound ligand. Therefore, a reliable prediction of water molecules participating in ligand binding is essential for docking and scoring, and is necessary to develop strategies in ligand design. We require some reasonable estimates about the free energy contributions of water to binding.
Useful parameters for such estimations are the total number of displaceable water molecules and the probabilities for their displacement upon ligand binding. These parameters depend on specific interactions with the protein and other water molecules, and thus the positions of individual water molecules.
The high flexibility of water networks makes it difficult to observe distinct water molecules at well defined positions in structure determinations. Thus, experimentally observed positions of water molecules have to be assessed critically, bearing in mind that they represent an average picture of a highly dynamic equilibrium ensemble. Moreover, there are many structures with inconsistent and incomplete water networks.
To address these deficiencies we developed a tool that predicts possible configurations of complete water networks in binding pockets in a consistent way. It is based on the well established knowledge-based potentials implemented into DrugScore, which also allow for a reasonable differentiation between "conserved" and "displaceable" water molecules. The potentials used were derived specifically for water positions as observed in small molecule crystal structures in the CSD.
To account for the flexibility and high intercorrelation we apply a clique-based approach, resulting in water networks maximizing the total DrugScore.
To incorporate as much known information as possible about a given target, we also allow to include constraints defined by experimentally observed water positions.
Our tool provides a useful starting point whenever a possible configuration of water molecules need to be estimated in an uncomplexed protein, and suggests their spatial positions and their classification with respect to some kind of affinity prediction.
In first tests we were able to get classifications and positional predictions which are in good agreement with crystallographically observed water molecules with remarkably small deviations.
Functional binders for non-specific binding: Evaluation of virtual screening methods for the elucidation of novel transthyretin amyloid inhibitors
Carlos J.V. Simões, Trishna Mukherjee, Prof. Richard M. Jackson PhD, Prof. Rui M.M. Brito PhD. Department of Chemistry, Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Institute of Molecular and Cellular Biology, University of Leeds, Leeds, West Yorkshire, United Kingdom
Inhibition of fibril formation by stabilization of the native form of transthyretin (TTR) is a viable approach for the treatment of Familial Amyloid Polyneuropathy that has been gaining momentum in the field of amyloid research. Herein, we present a benchmark of five virtual screening strategies to identify novel TTR stabilizers: (1) 2D similarity searches with chemical hashed fingerprints, pharmacophore fingerprints and UNITY fingerprints, (2) 3D-searches based on shape, chemical and electrostatic similarity, (3) LigMatch, a ligand-based method employing multiple templates, (4) 3D- pharmacophore searches, and (5) docking to consensus X-ray crystal structures. By combining the best-performing VS protocols, a small subset of molecules was selected from a tailored library of 2.3 million compounds and identified as representative of multiple series of potential leads. According to our predictions, the retrieved molecules present better solubility, halogen fraction and binding affinity for both TTR pockets than the stabilizers discovered to date.
Using the oreChemexperiments ontology: Planning and enacting chemistry
Prof Jeremy G Frey, Mark I Borkum, Prof Carl Lagoze, Dr. Simon J Coles. School of Chemistry, Univeristy of Southampton, Southampton, Hants, United Kingdom; Department of Information Science, Cornell Univeristy, Ithica, NY, United States
This paper presents the oreChem Experiments Ontology, an extensible model that describes the formulation and enactment of scientific methods (referred to as “plans”), designed to enable new models of research and facilitate the dissemination of scientific data on the Semantic Web. Currently, a high level of domain-specific knowledge is required to identify and resolve the implicit links that exist between digital artefacts, constituting a significant barrier-to-entry for third parties that wish to discover and reuse published data. The oreChem ontology radically simplifies and clarifies the problem of representing an experiment to facilitate the discovery and re-use of the data in the correct context. We describe the main parts of the ontology and detail the enhancements made to the Southampton eCrystals repository to enable the publication of oreChem metadata.
CHEMINF: Community-developed ontology of chemical information and algorithms
Leonid L Chepelev, Janna Hastings, Egon Willighagen, Nico Adams, Christoph Steinbeck, Peter Murray-Rust, Michel Dumontier. Department of Biology, School of Computer Science, and Institute of Biochemistry, Carleton University, Ottawa, Ontario, Canada; Chemoinformatics and Metabolism Team, European Bioinformatics Institute, Cambridge, United Kingdom; Department of Pharmaceutical Sciences, Uppsala University, Uppsala, Sweden; Department of Chemistry, Unilever Centre for Molecular Informatics, University of Cambridge, Cambridge, United Kingdom
In order to truly convert RDF-encoded chemical information into knowledge and break out of domain- and vendor-specific data silos, reliable chemical ontologies are necessary. To date, no standard ontology that addresses all chemical information representation and service integration needs has emerged from previously proposed ontologies, ironically threatening yet another “Tower of Babel” event in cheminformatics. To avoid resultant substantial ontology mapping costs, we hereby propose CHEMINF, a community-developed modular and unified ontology for chemical graphs, qualities, descriptors, algorithms, implementations, and data representations/formalisms. Further, CHEMINF is aligned with ontologies developed within the OBO Foundry effort, such as the Information Artifact Ontology. We present the application of CHEMINF to efficiently integrate two RDF-based chemical knowledgebases with different representation structures and aims, but common classes and properties from CHEMINF. Finally, we discuss the steps taken to ensure applicability of this ontology in the semantic envelopment of computational chemistry resources, algorithms, and their output.
Chemical entity semantic specification: Knowledge representation for efficient semantic cheminformatics and facile data integration
Leonid L Chepelev, Michel Dumontier. Department of Biology, School of Computer Science, and Institute of Biochemistry, Carleton University, Ottawa, Ontario, Canada
Though the nature of RDF implies the ability to interoperate and integrate diverse knowledgebases, designing adequate and efficient RDF-based representations of knowledge concerning chemical entities is non-trivial. We hereby describe Chemical Entity Semantic Specification (CHESS), which captures chemical descriptors, molecular connectivity, functional composition, and geometric structure of chemical entities and their components. CHESS also handles multiple data sources and multiple conformers for molecules, as well as reactions and interactions. We demonstrate the generation of a chemical knowledgebase from disparate data sources, using which we conduct an analysis of the implications of design choices taken in CHESS on the efficiency of solutions for some classical cheminformatics problems, including molecular similarity searching and subgraph detection. We do this through automated conversion of SMILES-encoded query fragments into SPARQL queries and DL-Safe rules. Finally, we discuss approaches to identification of potential reaction participants and class members in chemical entity knowledgebases represented with CHESS.
Semantic assistant for lipidomics researchers
Alexandre Kouznetsov, Rene Witte, Christopher J.O. Baker. Department of Computer Science and Applied Statistics, University of New Brunswick, Saint John, New Brunswick, Canada; Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
Lipid nomenclature has yet to become a robust research tool for lipidomics or lipid research in general. This is in part because no rigorous structure based definitions exist for membership of specific lipid classes has existed. Recent work on the OWL-DL Lipid Ontology with defined axioms for class membership and has provided new opportunities to revisit the lipid nomenclature issue , . Also necessary is a framework for sharing these axioms with scientists during scientific discourse and the drafting of publications. To achieve this we introduce here a new paradigm for Lipidomics researchers in which a client side application tags raw text about lipids with information, such as canonical name or relevant functional groups, derived from the ontology and is delivered using web services. Our approach includes following core components: (i)Semantic Assistant Framework ; (ii) Lipid ontology ; (iii) Ontological NLP methodology; (iv) Ontology Axiom-extractor for the GATE framework. The Semantic Assistant Framework is aservice-oriented architecture used to enhancing existing end-user clients, such Open Office Writter, with online Lipidomics text analysis capabilities provided as a set of web services. The Ontological NLP methodology links Lipid named entities occurred in a document opened on client side with existing ontologies on server side. The Ontology Axiom-extractor annotates each named entity with canonical name, class name and related class axioms providing annotation for documents on the client side. The proposed system is scalable and extensible allowing researchers to easily customize the information to be delivered as annotations depending on the availability of chemical ontologies with defined axioms linked to canonical names for chemical entities.
 Baker CJO, Low HS, Kanagasabai R, and Wenk MR, (2010) Lipid Ontologies, 3rdInterdisciplinary Ontology Conference, Tokyo, Japan, February 27-28, 2010
 Low HS, Baker CJO, Garcia A and Wenk M., OWL-DL (2009), Ontology for Classification of Lipids, International Conference on Biomedical Ontology, Buffalo, New York, July 24-26
 Witte R., Gitzinger T., (2008), A General Architecture for Connecting NLP Frameworks and Desktop Clients Using Web Services, 13th International Conference on Applications of Natural Language to Information Systems
 Lipid Ontology available at http://bioportal.bioontology.org/ontologies/39503
ChemicalTagger:A tool for semantic text-mining in chemistry
Dr Lezan Hawizy, Dave M Jessop, Professor Peter Murray-Rust. The Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
The primary method for scientific communication is in the form of published scientific articles and theses and the use of natural language combined with domain-specific terminology. As such, they contain unstructured data.
Given the unquestionable usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches. Using chemical synthesis procedures as an exemplar, we present ChemicalTagger. ChemicalTagger is a tool that combines chemical entity recognisers such as OSCAR with tokenisers, part-of-speech taggers and shallow parsing tools to produce a formal structure of reactions.
This extracted data can then be expressed in RDF. This allows for the generation of highly informative visualisations, such as visual document summaries, structured querying and further enrichment can be provided by linking with domain specific ontologies.
From canonical numbering to the analysis of enzyme-catalyzed reactions: 32 years of publishing in JCIM (JCICS)
Prof. Johann Gasteiger, Prof. Johann Gasteiger. Computer-Chemie-Centrum, University of Erlangen-Nuremberg, Erlangen, Germany; Molecular Networks GmbH, Erlangen, Germany
In 1972 we embarked on the development of a program for computer-assisted synthesis design which eventually led to the present system THERESA. Along the way many fundamental problems had to be solved such as the unique representation of chemical structures published in 1977. This work laid the foundation for building the Beilstein database. Methods had to be developed for the computer representation of chemical reactions which formed the basis for constructing the ChemInform reaction database. Recent work has concentrated on the analysis of biochemical reactions, the prediction of metabolism and the risk assessment of chemicals.
Fifteen years of JCICS
Dr. George W Milne. NCI, NIH (Retd), Williamsburg, VA, United States
During the period 1989-2004 when I was Editor of the Journal of Chemical Information and Computer Sciences (JCICS), the predecessor of the Journal of Chemical Information and Modeling (JCIM), many papers appeared addressing contemporary problems in computational chemistry.
Some of these problems were completely settled and significant progress was made with others. A third group, in spite of numerous publications, defied attempts at resolution and remain to this day as challenges to computational chemists.
As JCIM, aka JCICS, aka J. Chem. Doc embarks upon its second 50 years, the progress recorded during the 1990s and the advances in computer hardware and software are reviewed. With a longer perspective, the impact of computers on chemistry is considered resolved.
Fifteen years in chemical informatics: Lessons from the past, ideas for the future
Dimitris Agrafiotis PhD. Pharmaceutical Research & Development, Johnson & Johnson, Spring House, Pennsylvania, United States
A unique aspect of chemical informatics is that it has been heavily influenced and shaped by the needs of the pharmaceutical industry. As this industry undergoes a profound transformation, so will the field itself. In this talk, we reflect on the experiences of the past and explore the possibilities we see for the future. These possibilities lie on the convergence of chemistry, biology, and information technology, and will require thinking and working across scientific and organizational boundaries in a way that has never been previously possible.
Applications of wavelets in virtual screening
Prof Val Gillet PhD, Mr Richard Martin, Dr Eleanor Gardiner, Dr Stefan Senger. Department of Information Studies, University of Sheffield, Sheffield, United Kingdom; Computational and Structural Chemistry, GlaxoSmithKline, Stevenage, Hertfordshire, United Kingdom
The interactions which a small molecule can make with a receptor can be modelled using three-dimensional molecular fields, such as GRID fields, however, the cumbersome nature of these fields makes their storage and comparison computationally expensive. Wavelets are a family of multiresolution signal analysis functions which have become widely used in data compression. We have applied the non-standard wavelet transform to generate low-resolution approximations (wavelet thumbnails) of finely sampled GRID fields, without loss of information. We demonstrate various applications of wavelet thumbnails including the development of an alignment method to enable the comparison of the wavelet representations of GRID fields in arbitrary orientation.
Privileged substructures revisited: Target community-selective scaffolds
Jürgen Bajorath. Department of Life Science Informatics, University of Bonn, Germany
Molecular scaffolds that preferentially bind to a given target family, so-called “privileged” substructures, have long been of high interest in drug discovery. Many privileged substructures have been proposed, in particular, for G protein coupled receptors and protein kinases. However, the existence of truly privileged structural motifs has remained controversial. Frequency-based analysis has shown that many scaffolds thought to be target class-specific also occur in compounds active against other types of targets. In order to explore scaffold selectivity on a large scale, we have carried out a systematic survey of publicly available compound data and defined target communities on the basis of ligand-target networks. The analysis was based on compound potency data and target pair potency-derived selectivity. More than 200 hierarchical scaffolds were identified, each represented by at least five compounds, which exclusively bound to targets within one of ca. 20 target communities. By contrast, currently available compound data is too sparsely distributed to assign target-specific scaffolds. Most scaffolds that exclusively bind to a single target within a community are only represented by one or two compounds in public domain databases. However, characteristic selectivity patterns are found to evolve around community-selective scaffolds that can be explored to guide the design of target-selective compounds.
Automated retrosynthetic analysis: An old flame rekindled
Prof Peter Johnson PhD, Anthony P Cook, James Law, Mahdi Mirzazadeh, Dr Aniko Simon PhD. School of Chemistry, University of Leeds, Leeds, United Kingdom; Simbiosys Inc, Toronto, Ontario, Canada
The last century saw truly innovative research aimed at the creation of systems for computer aided organic synthesis design (CAOSD). However, such systems have not achieved significant user acceptance, perhaps because they required manual creation of reaction knowledge bases, a time consuming task which requires considerable synthetic chemistry expertise. More recent systems like ARChem1 circumvent this problem by automated abstraction of transformation rules from very large databases of specific examples of reactions. ARChem is still a work in progress and specific problems which are being addressed include:
a) dentification of precise structural characteristics of each reaction, often requiring knowledge of reaction mechanism;
b) treatment of interfering functional groups;
c) minimising the combinatorial explosion inherent in automated multistep retrosynthesis;
d) treatment of the results of extensive recent research into enantioselective and stereoselective reactions.
1 Law et al J. Chem. Inf. Model., 2009, 49 (3), pp 593-602
Dietary supplements: Free evidence-based resources for the cautious consumer
MLS Brian Erb. McGoogan Library of Medicine, University of Nebraska Medical Center, Omaha, NE, United States
Vitamin, mineral and dietary supplements are a 70 billion dollar industry. With marginal FDA regulation, it can be difficult to evaluate the health claims of a given product. How can the skeptical consumer distinguish a promising nutritional supplement from a substance that lacks the evidence to back its nutritional claims? This short presentation will highlight some evidence-based Internet sources that will help the consumer navigate the dietary supplement minefield. These sources will not only help the consumer separate bogus claims from research supported evidence, but also help the consumer make informed nutritional decisions regarding which supplements might be a relevant and useful part of their healthy diet and lifestyle. The resources to be explored have been collected in a UNMC libguide at http://unmc.libguides.com/supplements for ease of navigation and dissemination.
What lessons learned can we generalize from evaluation and usability of a health website designed for lower literacy consumers?
Mary J Moore PhD, Randolph G. Bias PhD. Department of Health Informatics, University of Miami Miller School of Medicine, Miami, FL, United States; Department of Information, University of Texas at Austin, Austin, Texas, United States
Objectives: Researchers conducted multifaceted usability testing and evaluation of a website designed for use by those with lower computer literacy and lower health literacy. Methods included heuristic evaluation by a usability engineer, remote usability testing and face-to-face testing. Results: Standard usability testing methods required modification, including interpreters, increased flexibility for time on task, presence of a trusted intermediary, and accommodation for family members who accompanied participants. Participants suggested website redesign, including simplified language, engaging and relevant graphics, culturally relevant examples, and clear navigation. Conclusions: User-centered design was especially important for this audience. Some lessons learned from this experience are echoed in usability and evaluation of commercial sites designed for similar audiences, and may be generalizable.
National Library of Medicine resources for consumer health information
Michelle Eberle. National Network of Libraries of Medicine - New England, Shrewsbury, MA, United States
Come learn about free, high quality web resources for consumer health information from the National Library of Medicine. We will cover MedlinePlus, a resource for health information for the public. The presenter will take you on a guided tour of http://medlineplus.gov and other specialized web resources for consumer health information including the Drug Information Portal, DailyMed and the Dietary Labels Supplement Database. The program will wrap up with a brief introduction to ClinicalTrials.gov. You will leave this program equipped with expertise to find, critically appraise, and use online health information more effectively.
Better prescription for information: Dietary supplements online
Gail Y. Hendler MLS. Hirsh Health Sciences Library, Tufts University, Boston, MA, United States
Dietary supplements are becoming staples in the health regimens of a growing number of consumers worldwide. According to the most recent National Health and Nutrition Examination Survey, 52% percent of adults in the United States reported taking a nutraceutical in the past month. Consumers turn to these products believing they are safe and effective because they are “all natural.” Supplementing knowledge about the benefits and the potential risks associated with nutraceutical use requires information resources that are authoritative, accurate and readable to a large and general audience. This presentation will provide recommendations for locating high-quality, freely available online resources that today's consumers need to support decision-making. Featured resources will include books, databases and websites that discuss the pros and cons and provide the evidence for better use of dietary supplements, herbs and functional foods.
Overview of the linking open drug data task
Eric Prudhommeaux, Egon Willighagen, Susie Stephens. , W3C/MIT, Cambridge, MA, United States; Uppsala University, Uppsala, Sweden; , Johnson and Johnson, United States
There is much interesting information about drugs that is available on the Web. Data sources range from medicinal chemistry results, to the impacts of drugs on gene expression, through to the results of drugs in clinical trials.
Linking Open Drug Data (LODD) is a task within the W3C's Health Care Life Sciences Interest Group. LODD has surveyed publicly available data sets about drugs, created Linked Data representations of the data sets and interlinked them together, and identified interesting scientific and business questions that can be answered once the data sets are connected. The task also actively explores best practices for exposing data in a Linked Data representation.
The figure below shows part of the data sets that have been published and interlinked by the task so far.
Control, monitoring, analysis and dissemination of laboratory physical chemistry experiments using semantic web and broker technologies
Prof Jeremy G Frey, Stephen Wilson. School of Chemistry, Univeristy of Southampton, Southampton, Hants, United Kingdom
Semantic analysis of chemical patents
Dave M Jessop Dr Lezan Hawizy, Prof. Peter Murray-Rust, Professor Robert C Glen
A suite of software was developed to control and monitor experimental and environmental data and used for probing of the air/water interface using Second Harmonic Generation. A centralised message broker enabled a common communication protocol between all objects in the system; experimental apparatus, data loggers, storage solutions and displays. The data and context are captured and represented in ways compatible with the Semantic Web. Experimental plans and the enactment are described using the oreChem experiments ontology; this provides the means to capture the metadata associated with the experimental process and the resulting data. Environmental data was stored in the Open Geospatial Consortium Sensor Observation Service (SOS). The SOS is part of the Sensor Web Enablement architecture; this describes a number of interoperable interfaces and metadata encodings for integrating sensors webs into the cloud. A mashup web interface was produced to link all these sources of information from a single point.
Data mining and querying of integrated chemical and biological information using Chem2Bio2RDF
Dr David J Wild, Bin Chen, Dr Ying Ding, Xiao Dong, Huijun Wang, Dazhi Jiao, Dr Qian Zhu, Madhuvanti Sankaranarayanan. School of Informatics and Computing, Indiana University, Bloomington, IN, United States; School of Library and Information Science, Indiana University, Bloomington, IN, United States
We have recently developed a freely-available resource called Chem2Bio2RDF (http://chem2bio2rdf.org) that consists of chemical, biological and chemogenomic datasets in a consistent RDF framework, along with SPARQL querying tools that have been extended to allow chemical structure and similarity searching. Chem2Bio2RDF allows integrated querying that crosses chemical and biological information including compounds, publications, drugs, genes, diseases, pathways and side-effects. It has been used for a variety of applications including investigation of compound polypharmacology, linking drug side-effects to pathways, and identifying potential multi-target pathway inhibitors. In the work reported here, we describe a new set of tools and methods that we have developed for querying and data mining in Chem2Bio2RDF, including: Linked Path Generation (a method for automatically identifying paths between datasets and generating SPARQL queries from these paths); an ontology for integrated chemical and biological information; a Cytoscape plugin that allows dynamic querying and network visualization of query results; and a facet-based browser for browsing results
Mining and visualizing chemical compound-specific chemical-gene/disease/pathway/literature relationships
Dr. Qian Zhu, Prajakta Purohit, Jong Youl Choi, Seung-Hee Bae, Dr. Judy Qiu, Prof. Ying Ding, Prof. David Wild. School of Informatics and Computing, Indiana University, Bloomington, IN, United States; School of Library & Information Science, Indiana University, Bloomington, IN, United States; Department of Computer Science, Indiana University, Bloomington, IN, United States
In common with most scientific disciplines, there has in the last few years been a huge increase in the amount of publicly-available and proprietary information pertinent to drug discovery, owing to a variety of factors including improvements in experimental technologies. So the big challenge for us is how we can use all of this information together in an intelligent way, in an integrative fashion.
We are developing an application to mine relationships between Chemical and Gene/Disease/Pathway/Literature, and visualize them. It aims to help answer the question “anything else should I know about this compound?” from a medicinal chemistry perspective based on the full picture of chemicals. For the mining part, we have already developed an aggregating web services, named WENDI, which calls multiple individual or atomic, web services including diversity of compound-related data sources, predictive models and self-developed algorithms, and aggregates the results from these services in XML; For visualizing, two ways to go: First, we create a RDF reasoner to convert XML from WENDI to RDF, find inferred relationships based on RDF, rank evidences focused on chemical-disease, and print all evidences out by using SWP faceted browser based on Longwell http://simile.mit.edu/wiki/Longwell), it mixes the flexibility of the RDF data model with the faceted browser to enable users to browse complex RDF triples in a user-friendly and meaningful manner; Second, we place all relationships from WENDI into a chemical space consisted of 60M PubChem compounds, then clustered/highlighted particular chemical compounds with specific attributes, like gene/disease/pathway/literature by using PubChemBrowse, which is a customized visualization tool for cheminformatics research and provides a novel 3D data point browser that displays complex properties of massive data on commodity clients and supports fast interaction with an external property database via semantic web interface.
What makes polyphenols good antioxidants? Alton Brown, you should take notes...
Emilio Xavier Esposito PhD. The Chem21 Group, Inc, Lake Forest, Illinois, United States
The dominant physical feature of antioxidants are phenols; polyphenols according to Alton Brown. The proposed antioxidant-tyrosinase mechanism, based on a series of experimentally determined mushroom tyrosinase structures, provides insight to the molecular interactions that drive the reaction. While the enzyme structures illustrate the important molecular interactions for tyrosinase inhibition, the enzyme structures do not always facilitate the understanding of what makes a good inhibitor or the mechanism of the reaction. Using an antioxidant (tyrosinase inhibitors) dataset of 626 compounds (from the linear discriminate analysis research of Martín et al. Euro J Med Chem 42 p1370-1381, 2007) we constructed binary QSAR models to indicate the important antioxidant molecular features. Exploring models constructed from molecular descriptors based on fingerprints (MACCS keys), traditional molecular descriptors (2D and 2½D), VolSurf-like molecular descriptors (3D) and molecular dynamics (4D-Fingerprints), the relationship between polyphenols' biologically relevant molecular features - as determined by each set of descriptors - and their antioxidant abilities will be discussed.
Engineering and 3D protein-ligand interaction scaling of 2D fingerprints
Jürgen Bajorath. Department of Life Science Informatics, University of Bonn, Bonn, Germany
Different concepts are introduced to further refine and advance molecular descriptors for SAR analysis. Fingerprints have long been among preferred descriptors for similarity searching and SAR studies. Standard fingerprints typically have a constant bit string format and are used as individual database search tools. However, by applying “engineering” techniques such as “bit silencing”, fingerprint reduction, and “recombination”, standard fingerprints can be tuned in a compound class-directed manner and converted into size-reduced versions with higher search performance. It is also possible to combine preferred bit segments from fingerprints of distinct design and generate “hybrids” that exceed the search performance of their parental fingerprints. Furthermore, effective 2D fingerprint representations can be generated from strongly interacting parts of ligands in complex crystal structures. These “interacting fragment” fingerprints focus search calculations on pharmacophore elements without the need to encode interactions directly. Moreover, 3D protein-ligand interaction information can implicitly be taken into account in 2D similarity searching through fingerprint scaling techniques that emphasize characteristic bit patterns.
In silico binary QSAR models based on4D-fingerprints and MOE descriptors for prediction of hERG blockage
Prof. Y. Jane Tseng PhD. Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan Republic of China
Blockage of the human ether-a-go-go related gene (hERG) potassium ion channel is a major factor related to cardiotoxicity. Hence, drugs binding to this channel have become an important biological endpoint in side effects screening. We have collected all available biologically active hERG compounds from the hERG literature for a total of 250 structurally diverse compounds. This data set was used to construct a set of two-state hERG QSAR models. The descriptor pool used to construct the models consisted of 4D-fingerprints generated from the thermodynamic distribution of conformer states available to a molecule, 204 traditional 2D descriptors and 76 3D VolSurf-like descriptors computed using the Molecular Operating Environment (MOE) software. One model is a continuous partial least squares (PLS) QSAR hERG binding model. Another related model is an optimized binary QSAR model that classifies compounds as active, or inactive. This binary model achieves 91% accuracy over a large range of molecular diversity spanning the training set. An external test set was constructed from the condensed PubChem bioassay database containing 816 compounds and successfully used to validate the binary model. The binary QSAR model permits a structural interpretation of possible sources for hERG activity. In particular, the presence of a polar negative group at a distance of 6 to 8 Å from a hydrogen bond donor in a compound is predicted to be a quite structure-specific pharmacophore that increases hERG blockage. Since a data set of high chemical diversity was used to construct the binary model, it is applicable for performing general virtual hERG screening.
Telling the good from the bad and the ugly: The challenge of evaluating pharmacophore model performance
Robert D. Clark PhD. Simulations Plus, Inc., Lancaster, California, United States
Pharmacophore models are useful when they provide qualitative insight into the interactions between ligands and their target macromolecules, and therefore are more akin in many ways to molecular simulations than to quantitative structure activity relationships (QSARs) based on the partition of activity across a set of molecular descriptors. When the performance of a pharmacophore model is assessed quantitatively, it is usually in terms of its ability to recover known ligands or, less often, in terms of how well it distinguishes ligands from non-ligands. This status as a classification technique also sets it apart from more numerical QSAR methods, in part because of fundamental differences in what being "good" means. Carefully defining what "good" classification is, however, can make creative combination with other techniques a productive way to capture the value of their intrinsic complementarity.
Creative application of ligand-based methods to solve structure-based problems: Using QSAR approaches to learn from protein crystalstructures
Prof. Curt M Breneman, Dr. Sourav Das, Dr. Matt Sundling, Mr. Mike Krein, Prof. Steven Cramer, Prof. Kristin P Bennett, Dr. Charles Bergeron, Mr. Jed Zaretzki. Department of Chemistry and Chemical Biology, Rensselaer Polytechnic Institute, Troy, NY, United States; Department of Chemical and Biological Engineering, Rensselaer Polytechnic Institute, Troy, NY, United States; Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY, United States
In practice, there is no inherent disconnect between the descriptor-based cheminformatics methods commonly used for predicting small molecule properties and those that can be used to understand and predict protein behaviors. Examples of such connections include the development of predictive models of protein/stationary phase binding in HIC and ion-exchange chromatography, protein/ligand binding mode characterization through PROLICSS analysis of crystal structures, and the use of PESD binding site signatures for pose scoring and predicting off-target drug interactions. In all of these cases, models were created using descriptors based on protein electronic and structural features and modern machine learning methods that include model validation tools and domain of applicability assessment metrics.
Computer-aided drug discovery
Prof. William L Jorgensen. Department of Chemistry, Yale University, New Haven, CT, United States
Drug development is being pursued through computer-aided structure-based design. For de novo lead generation, the BOMB program builds combinatorial libraries in a protein binding site using a selected core and substituents, and QikProp is applied to filter all designed molecules to ensure that they have drug-like properties. Monte Carlo/free-energy perturbation simulations are then executed to refine the predictions for the best scoring leads including ca. 1000 explicit water molecules and extensive sampling for the protein and ligand. FEP calculations for optimization of substituents on an aromatic ring and for choice of heterocycles are now common. Alternatively, docking with Glide is performed with the large databases of purchasable compounds to provide leads, which are then optimized via the FEP-guided route. Successful application has been achieved for HIV reverse transcriptase, FGFR1 kinase, and macrophage migration inhibitory factor (MIF); micromolar leads have been rapidly advanced to extraordinarily potent inhibitors.
Structure-based discovery and QSAR methods: A marriage of convenience
Jose S Duca. Novartis, Cambridge, MA, United States
The art of building predictive models of the relationships between structural descriptors and molecular properties has been historically important to drug design. In the recent years there has been an extraordinary amount of experimental data available from processes designed to accelerate drug discovery in pharma; from high throughput screening and automation applied to library design and synthesis to chemogenomics and microarray analysis. QSAR methods are one of the many tools to predict affinity-related, physicochemical, pharmacokinetic and toxicological properties through analyzing and extracting information from molecular databases and HTS campaigns.
This presentation will cover case studies in which QSAR and Structure-Based Drug Design (SBDD) have worked in concert during the discovery process of pre-clinical candidates. The importance of incorporating time-dependent sampling to improve the quality of the nD-QSAR models (n=3,4) will also be discussed and compared to simplified low dimensional QSAR models. For those cases where structural information cannot be readily available an extension of these methodologies will be discussed in relation to ligand-based approaches.
Extending the QSAR Paradigm using molecular modeling and simulation
Professor Anton J Hopfinger Ph.D.. College of Pharmacy, MSC 09 5360, University of New Mexico, Albuquerque, NM, United States; Computational Chemistry, The Chem21 Group, Inc., Lake Forest, IL, United States
QSAR analysis and molecular modeling/ simulation methods are often complementary, and when combined in a study yield results greater than the sum of their parts. Modeling and simulation offer the ability to design custom, information-rich trial descriptors for a QSAR analysis. In turn, QSAR analysis is able to discern which of the custom descriptors most fully relate to the behavior of an endpoint of interest. One useful set of custom QSAR descriptors from modeling and simulation for describing ligand-receptor interactions are the grid cell occupancy descriptors, GCODs, of 4D-QSAR analysis. These descriptors characterize the relative spatial occupancy of all the atoms of a molecule over the set of conformations available to the molecule when in a particular environment. GCODS permit the construction of a 4D-QSAR equation for virtual screening, as well as a spatial pharmacophore of the 4D-QSAR equation for exploring mechanistic insight. Applications that can particularly benefit from combining QSAR analysis and modeling/simulation tools are those in which a model chemical system is needed to determine the sought after property. One such application is the transport of molecules through biological compartments, an integral part of many ADMET properties. The reliable estimation of eye irritation is greatly enhanced by simulating the transport of test solutes through membrane bilayers, and using extracted properties from the simulation trajectories as custom descriptors to build eye irritation QSAR models. These key descriptors of the QSAR models, in turn, also permit the investigator to probe and postulate detailed molecular mechanisms of action.
Overview of activity landscapes and activity cliffs: Prospects and problems
Prof Gerald M Maggiora. Department of Pharmacology & Toxicology, University of Arizona College of Pharmacy, Tucson, AZ, United States; BIO5 Institute, University of Arizona, Tucson, AZ, United States; Translational Genomics Research Institute, Phoenix, AZ, United States
Substantial growth in the size and diversity of compound collections and the capability to subject them to an increasing variety of different high-throughput assays manifests the need for a more systematic and global view of structure-activity relationships. The concepts of chemical space and molecular similarity, which are now well known to the drug-research community, provide a suitable framework for developing such a view. Augmenting a chemical space with activity data from various assays generates a set of activity landscapes, one for each assay. The topography of these landscapes contains important information on the structure-activity relationships of compounds that inhabit the chemical space. Activity cliffs, which arise when similar compounds possess widely different activities, are a particularly informative feature of activity landscapes with respect to SAR. The talk will present an overview of activity landscapes and cliffs and will describe some of the prospects and problems associated with these important concepts.
Exploring and exploiting the potential of structure-activitycliffs
Dr Gerald M Maggiora PhD, Michael S Lajiness. Department of Pharmacology & Toxicology, University of Arizona College of Pharmacy, Tucson, Arizona, United States; Scientific Informatics, Eli Lilly & Co, Indianapolis, IN, United States
It's well known that small structural changes sometimes result in large changes in activity. There have been some recent efforts to identify such changes but little in regards to defining which structural changes are most informative or even real. Also, the missing value problem often obfuscates the ability to detect relevant patterns if in fact they exist. This presentation will present several ideas and applications for exploring and exploiting Structure-Activity Cliffs. In addition, various visualizations and approaches to communicate the information contained in these "cliffs" will be shared. Examples will be drawn from PubChem.
What makes a good structure activity landscape? Network metrics and structure representations as a way of exploring activity landscapes
Dr. Rajarshi Guha. Department of Informatics, NIH Chemical Genomics Center, Rockville, MD, United States
The representation of SAR data in the form of landscapes and the identification of activity cliffs in such landscapes is well known. A number of approaches have been described to identifying activity cliffs, including several network based methods such as the SALI approach (JCIM, 2008, 48, 646-658). While a network representation of an SAR landscape moves away from the intuitive idea of rolling hills and steep gorges, it allows us to apply a variety of quantitative analyses. In this talk I will first examine some of the properties of SALI networks using various measures of network structures and attempt to correlate these features with features of the SAR data. While most examples are from relatively small datasets I will highlight some examples from larger datasets from high-throughput screens. While such data can be noisy and contain artifacts I will examine whether the underlying network structure can shed light on specific molecules that may be worth following up. The second focus of the talk will look at the effect of structure representations on the smoothness of the landscape and how one can derive ideas from the SALI characterization to suggest good or bad landscapes.
Consensus model of activity landscapes and consensus activity cliffs
Jose L Medina-Franco, Karina Martinez-Mayorga, Fabian Lopez-Vallejo. Torrey Pines Institute for Molecular Studies, Port St Lucie, FL, United States
Characterization of activity landscapes is a valuable tool in lead optimization, virtual screening and computational modeling of active compounds. As such understanding the activity landscape and early detection of activity cliffs [Maggiora, G. M. J. Chem. Inf. Model. 2006, 46, 1535] can be crucial to the success of computational models. Similarly, characterizing the activity landscape will be critical in future ligand-based virtual screening campaigns. However, the chemical space and activity landscape are influenced by the particular representation used and certain representations may lead to apparent activity cliffs. A strategy to address this problem is to consider multiple molecular representations in order to derive a consensus model for the activity landscape and in particular identify consensus activity cliffs [Medina-Franco, J. L. et al. J. Chem. Inf. Model. 2009, 49, 477]. The current approach can be extended to indentify consensus selectivity cliffs.
R-Cliffs: Activity cliffs within a single analog series
Dimitris Agrafiotis PhD. Pharmaceutical Research & Development, Johnson & Johnson, Spring House, Pennsylvania, United States
The concept of activity cliffs has gained popularity as a means to identify and understand discontinuous SAR, i.e., regions of SAR where minor changes in structure have unpredictably large effects on biological activity. To the best of our knowledge, activity cliffs have been invariably evaluated using global measures of molecular similarity that do not take into account the presence of finer substructure among a series of related analogs. In this talk, we look at activity cliffs within a congeneric series, by decomposing them into R-groups and analyzing how activity is affected by changes in a single variation site. The analysis is greatly enhanced by R-group-aware visualization tools such as the SAR maps, which have been enhanced to specifically highlight such discontinuities.
Chemical structure representation in the DuPont Chemical Information Management Solutions database: Challenges posed by complex materials in a diversified science company
Dr. Mark A Andrews, Dr. Edward S. Wilks. CR&D, Information & Computing Technologies, DuPont, Wilmington, DE, United States
This talk will describe the novel ways we have developed to represent precisely the structures of the diverse chemical materials of interest to DuPont. These range from simple organics and inorganics to polymers, mixtures, formulations, multi-layer films, composites, and even devices and incompletely defined substances. Part of the solution involves evaluating trade-offs, which may be situation dependent, between details captured in the structure vs. details captured at the sample history level, e.g., ratios of components, polymer molecular weights and microstructures, and the existence of “fairy dust” components. An important aspect of the solution involves ensuring robust structure standardization and duplicate checking for complex and ill-defined substances. We believe that our needs and solutions have challenged and inspired a number of chemical software vendors to provide significant upgrades to the functionalities of their drawing packages and database cartridges.
From deposition to application: Technologies for storing and exploiting crystal structure data
Dr Colin R Groom, Dr Jason Cole, Dr Simon Bowden, Dr Tjelvar Olsson. Cambridge Crystallographic Data Centre, United Kingdom
In December 2009 The Cambridge Crystallographic Data Centre (CCDC) archived the 500,000th small-molecule crystal structure to the Cambridge Structural Database (CSD). The passing of this milestone highlights the rate of growth of the CSD in recent years and the continuing challenges this represents in terms of information storage and exchange.
This talk will describe the development of a number of tools for the processing, validation, and storage of crystal structure data. Recent developments that will aid this growing body of structural knowledge to be exploited in a range of applications and the provision of additional services that can assist the scientific community will also be illustrated.
Recent IUPAC recommendations for chemical structure representation: An overview
Mr. Jonathan Brecher. CambridgeSoft Corporation, Cambridge, MA, United States
Accurate and unambiguous depiction of chemical information is a key step in communicating that information. Such depiction is equally important whether the intended audience is a human chemist (as in a journal article or patent) or a computer (as in a chemical registration system). Recent IUPAC publications provide chemists a practical guide for producing chemical structure diagrams that accurately convey the author's intended meaning. A summary of those recommendations will be presented. As part of that summary, common pitfalls in producing chemical structure diagrams will be discussed. Solutions to those pitfalls will also be described, with an emphasis on solutions that are simple, straightforward, and accessible to the majority of practicing chemists.
Orbital development kit
Dr. Egon L. Willighagen. Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
Understanding properties of molecular structures requires a computer representation, and quantum mechanical and chemical graph representations have been used abundantly. Own have found their own areas of application in chemistry, and their fields are best described as theoretical chemistry and cheminformatics, respectively. The Orbital Development Kit (ODK) positions itself in-between these two representations, though closest to chemical graph theory, and addressing shortcomings of the latter. In particular, it replaces coloring of the nodes and edges in the chemical graph with atom hybridization and bond order explicit, making the representation more precise in how it represents geometrical features of the molecule. The ODK does so by replacing the atom as single node in the chemical graph by a central atomic core surrounded by valence orbitals, possible hybridized. Using this approach, the definition of an atom type is reformulated as a core element with a particular and well-defined set of identifiable orbitals with an implied, though relative, geometrical orientation. Bonding is now the connection of two orbitals, and a lone pair becomes a single orbital, and is therefore directional too. This approach means that the classical double bond in ethene is now represented by one sigma bonding between two sp2 orbitals of the two carbons, and one bonding of their two pz orbitals. This ODK representation leaves also room for representations beyond the chemical graph, such as proposed by Dietz in 1995: more than two orbitals can be combined into set to represent delocalization. The presentation will present the ODK data model, serialization and deserialization into a Resource Description Framework-based file format, and a bridge to the Chemistry Development Kit, for visualization and molecular property calculation.
Line notations as unique identifiers
Krisztina Boda PhD. OpenEye Scientific Software, Santa Fe, New Mexico, United States
A wide variety of structure representation formats have been devised to encode molecular information in order to register, store and manipulate molecules in silico.
One class of these formats, called line notations, is designed to express molecules as compact, unambiguous strings that can be used as unique identifiers for compound registration eliminating the computationally more expensive graph matching.
The presentation will provide an overview of popular line notations, such as canonical SMILES, isomeric SMILES, and InChI, discussing their merits and shortcomings in regards to using them as robust lossless unique identifiers.
We will present results of testing a variety of line notations on a diverse set of 10M compounds generated by combining organic and inorganic vendor databases. We will also examine the information loss of various molecular normalization procedures with regard to line notation generation.
Analysis of activity landscapes, activity cliffs, and selectivity cliffs
Jürgen Bajorath. Department of Life Science Informatics, University of Bonn, Germany
The concept of activity landscapes (ALs) is of fundamental importance for the exploration of structure-activity relationships (SARs). ALs are best rationalized as biological activity hypersurfaces in chemical space. When reduced to three dimensions, ALs display characteristic topologies that determine the SAR behavior of compound sets. Prominent features of ALs are activity cliffs that are formed by structurally similar compounds having large potency differences, giving rise to SAR discontinuity. ALs and activity cliffs can be analyzed in different ways including similarity-potency diagrams, approximate three-dimensional landscape representations, or molecular networks integrating compound similarity and potency information. Annotated similarity-based compound networks that incorporate results of numerical SAR analysis functions, termed Network-like Similarity Graphs (NSGs) are designed to explore relationships between global and local SAR features in compound data sets of any source. For collections of analogs, substitution patterns that introduce activity cliffs are identified in Combinatorial Analog Graphs (CAGs) that make it also possible to study additive and non-additive effects of compound modifications. Activity cliffs identified in CAGs can frequently be rationalized on the basis of complex crystal structures. When studying multi-target SARs using the NSG framework, the concept of activity cliffs can be extended to selectivity cliffs, i.e. similar compounds having significant differences in target selectivity.
Using Activity Cliff Information in structure-based design approaches
Birte Seebeck, Markus Wagener, Prof. Dr. Matthias Rarey. Center for Bioinformatics (ZBH), University of Hamburg, Hamburg, Germany; Molecular Design and Informatics, MSD, Oss, The Netherlands
Activity cliffs are often the pitfall of QSAR modeling techniques, but at the same time they exhibit key features of a SAR. Based on the principles of the structure-activity landscape index (SALI) , here we present an approach to use the valuable information of activity cliffs in a structure-based design scenario, analyzing key interactions between protein-ligand complexes in activity cliff events. We visualize those interaction “hot spots” directly in the active site of target proteins. In addition, we use the activity cliff information to derive target-specific scoring models and pharmacophoric hypothesis, which are validated in enrichment experiments on independent external test sets. The results show an improved enrichment in comparison to the standard score for various protein targets. 1. Guha R. and Van Drie J.H., J. Chem. Inf. Model., 2008, 48, 646-658.
Exploring activity cliffs using large scale semantic analysis of PubChem
Dr David J Wild, Bin Chen, Qian Zhu. School of Informatics and Computing, Indiana University, Bloomington, IN, United States
Identification of Activity Cliffs, defined as the ratio of the difference in activity of two compounds to their “distance” of separation in a given chemical space , has been established as important in the creation of robust quantitative-structure activity relationship models. Previously, a method, SALI, for identifying and visualizing these activity cliffs was developed at Indiana University, and applied successfully to several established QSAR datasets . In the work reported here, we have extended this work in two ways. First, we have used structure and activitydata from the public PubChem BioAssay dataset to evaluate the method on a much larger scale, and second, we have integrated it with a project called Chem2Bio2RDF to look not just for activity cliffs based on reported assay values, but also on computationally established relationships between compounds and genes and diseases. We thus propose an extended application of SALI which can be used in a systems chemical biology and chemogenomic context.
 J. Chem. Inf. Model., 2006, 46 (4), p 1535
 J. Chem. Inf. Model., 2008, 48 (3), pp 646-658
Quantifying the usefulness of a model of a structure-activity relationship: The SALI Curve Integral
John H Van Drie, Rajarshi Guha. R&D, Van Drie Research LLC, Andover, MA, United States; Chemical Genomics Center, NIH, Bethesda, MA, United States
In 2008, in two papers Guha and Van Drie introduced the notion of structure-activity landscape index (SALI) curves as a way to assess a model and a modeling protocol, applied to structure-activity relationships. The starting point is to study a structure-activity relationship pairwise, based on the notion of "activity cliffs"--pairs of molecules that are structurally similar but have large differences in activity. The basic idea behind the “SALI Curve” is to tally how many of these pairwise orderings a model is able to predict. Empirically, testing these SALI curves against a variety of models, ranging over structure-based and non-structure-based models, the utility of a model seems to correspond to characteristics of these curves. In particular, the integral of these curves, denoted as SCI and being a number ranging from -1.0 to 1.0, approaches a value of 1.0 for two literature models, which are both known to be prospectively useful.
Status of the InChI and InChIKey algorithms
Dr. Stephen Heller. CBRD, MS - 8320, NIST, Gaithersburg, MD, United States
The Open Source chemical structure representation standard, the IUPAC InChI/InChIKey project, has evolved considerably in the past two years. The project is now being supported and widely used by virtually all major publishers of chemical journals, databases, and structure drawing and related software. This usage of the InChI/InChIKey in their products enable them to link information between their products and other (fee-free and fee-based) chemical information available on the world wide web via the Internet
These organizations are now providing for a stable and financially viable structure to the project. This is enabling the world-wide chemistry community to expand its use of the InChI knowing that this freely available Open Source algorithm will be widely accepted and used of as a mainstream standard. The mission of the Trust is quite simple and limited; its sole purpose is to create and support administratively and financially a scientifically robust and comprehensive InChI algorithm and related standards and protocols.
This presentation will describe the current technical state of the InChI and InChIKey algorithms.
Self-contained sequence representation (SCSR): Bridging the gap between bioinformatics and cheminformatics
Dr Keith T Taylor, Dr William L Chen, Brad D Christie, Joe L Durant, David L Grier, Burt A Leland, Jim G Nourse. Symyx Technologies Inc, San Ramon, CA, United States
In this paper we will discuss the benefits and disadvantages of the current approaches for storing biological sequence information.
We have developed a hybrid representation that uses the compactness of the sequence, together with the detail of chemical connectivity information for modified regions. It represents standard residues with substructure. All instances of the same residue are represented by a single template. This hybrid approach is compact and scalable.
We have developed a converter that takes a UniProt format file extracts the sequence information and derives the modifications producing an SCSR record. The SCSR is encoded as a molfile and registered into a Symyx Direct database. Duplicate checking, exact matching - with and without the modifications -molecular weight calculation, and substructure searching are all available with these structures.
We are using this representation for peptides, oligonucleotides, and we are now extending it to oligosaccharides. Non-natural residues can be included in an SCSR.
Representation of Markush structures: From molecules toward patents
Szabolcs Csepregi, Nóra Máté, Róbert Wágner, Tamás Csizmazia, Szilárd Dóránt, Erika Bíró, Tim Dudgeon, Ali Baharev, Ferenc Csizmadia. ChemAxon Ltd., Budapest, Hungary
Cheminformatics systems usually focus primarily on handling specific molecules and reactions. However, Markush structures are also indispensable in various areas, like combinatorial library design or chemical patent applications for the description of compound classes.
The presentation will discuss how an existing molecule drawing tool (Marvin) and chemical database engine (JChem Base/Cartridge) are extended to handle generic features (R-group definitions, atom and bond lists, link nodes and larger repeating units, position and homology variation). Markush structures can be drawn and visualized in the Marvin sketcher and viewer, registered in JChem databases and their library space is searchable without the enumeration of library members. Different enumeration methods allow the analysis of Markush structures and their enumerated libraries. These methods include full, partial and random enumerations as well as calculation of the library size. Furthermore, unique visualization techniques will be demonstrated on real-life examples that illustrate the relationship between Markush structures and the chemical structures contained in their libraries (involving substructures and enumerated structures).
Special attention will be given to file formats and how they were extended to hold generic features.
CSRML: A new markup language definition for chemical substructure representation
Dr. Christof H. Schwab, Dr. Bruno Bienfait, Dr. Johann Gasteiger, Dr. Thomas Kleinoeder, Dr. Joerg Marucszyk, Dr. Oliver Sacher, Dr. Aleksey Tarkhov, Dr. Lothar Terfloth, Dr. Chihae Yang. Molecular Networks GmbH, Erlangen,, Bavaria, Germany; Altamira LLC, Columbus, Ohio, United States
Although, chemical subgraphs or substructures are quite popular and used since a long time in chemoinformatics, the existing and well established standards still have some limitations. In general, these standards are suited even for complex substructure queries, however, show some insufficiences, e.g., for the inclusion of physicochemical properties or annotation of meta information. In addition, the existing standards are not fully interconvertible and specify no validation techniques to check the semantic correctness of a query definition. This paper proposes an approach for the representation of chemical subgraphs that aims to overcome the limitations of existing standards. The approach presents a well-structured, XML-based standard specification, the Chemical Subgraph Representation Markup Language (CSRML), that supports a flexible annotation mechanism of meta information and properties at each level of a substructure as well as user-defined extensions. Furthermore, he specification foresees a mandatory inclusion and use of test cases. In addition, it can be used as an exchange format.
Prediction of solvent physical properties using the hierarchical clustering method
Dr. Todd M Martin, Dr. Douglas M Young. National Risk Management Research Laboratory, Environmental Protection Agency, Cincinnati, OH, United States
Recently a QSAR (Quantitative Structure Activity Relationship) method, the hierarchical clustering method, was developed to estimate acute toxicity values for large, diverse datasets. This methodology has now been applied to the estimate solvent physical properties including surface tension and the normal boiling point. The hierarchical clustering method divides a chemical dataset into a series of clusters containing similar compounds (in terms of their 2D molecular descriptors). Multilinear regression models are fit to each cluster. The toxicity or property is estimated using the prediction value from several different cluster models. The physical properties are estimated using 2D molecular structure only (i.e. w/o the use of critical constants). The hierarchical clustering methodology was able to achieve excellent predictions for the external prediction sets. A freely available software tool to estimate toxicity and physical properties has been developed. The software tool is based on the open source Chemistry Development Kit (written in Java).
Scaffold diversity analysis using scaffold retrieval curves and an entropy-based measure
Jose L Medina-Franco PhD, Karina Martinez-Mayorga, Andreas Bender PhD, Thomas Scior PhD. Torrey Pines Institute for Molecular Studies, Port St. Lucie, FL, United States; Leiden University, Leiden, The Netherlands; Benemerita Universidad Autonoma de Puebla, Puebla, Mexico
Scaffold diversity analysis of compound collections has several applications in medicinal chemistry and drug discovery. Applications include, but are not limited to, library design, compounds acquisition and assessment of structure-activity relationships. The scaffold diversity is commonly measured based on frequency counts. Scaffold retrieval curves are also employed. Further information can be obtained by considering the specific distribution of the molecules in those scaffolds. To this end, we present an entropy-based information metric to assess the scaffold diversity of compound databases [Medina-Franco, J. L. et al. QSAR Comb. Sci. 2009, 28, 1551]. The entropy-based information metric takes into account the frequency distribution of the different scaffolds and is a complementary measure of scaffold diversity enabling a more comprehensive analysis.
Nonsubjective clustering scheme for multiconformer databases
Dr. Austin B. Yongye, Dr. Andreas Bender, Dr. Karina Martinez-Mayorga. Torrey Pines Institute for Molecular Studies, Port St Lucie, FL, United States; Medicinal Chemistry Division and Pharma-IT Platform, Leiden/Amsterdam Center for Drug Research, Leiden University, Leiden, The Netherlands
Representing the 3D-structures of ligands in virtual screenings via multi-conformer ensembles can be computationally intensive, especially for compounds with a large number of rotatable bonds. While clustering and RMSD filtering methods are employed in existing conformer generators, the novelty of this work is the inclusion of a non-subjective clustering scheme. This algorithm simultaneously optimizes the number and the average spread of the clusters. Using this method 10 times less conformers per compound were obtained on averaged and performed as well as OMEGA. Furthermore, we propose thresholds for root-mean square filtering depending on the number of rotors in a compound: 0.8, 1.0 and 1.4 for structures with low (1-4), medium (5-9) and high (10-15) numbers of rotatable bonds, respectively. The protocol employed is general and can be applied to reduce the number of conformers in multi-conformer compound collections and alleviate the complexity of downstream data processing in virtual screening experiments.
Finding drug discovery "rules of thumb" with bump hunting
Mr. Tatsunori Hashimoto, Dr. Matthew Segall PhD. Department of Statistics, Harvard University, Cambridge, MA, United States; Optibrium, Cambrdige, United Kingdom
Rules-of-thumb for evaluating potential drug molecules, such as Lipinski's Rule of Five, are commonly used because they are easy to understand and translate into practice. These rules have traditionally been constructed by observation or by following simple statistical analysis. However, application of these techniques to QSAR models or early screening data often ignores the underlying statistical structure. Conversely, when machine learning algorithms are used to classify 'drug-like' molecules, they often result in black-box classifiers that cannot be modified to suit a particular target drug profile. We propose a novel hybrid approach to constructing rules-of-thumb from existing data to match a given target product profile for any therapeutic objective. These rules are easily interpretable and can be rapidly modified to reflect expert opinions before application.
Machine learning in discovery research: Polypharmacology predictions as a use case
Nikil Wale PhD, Kevin McConnell PhD, Eric M Gifford PhD. Computational Sciences Center of Emphasis, Pfizer Inc, Groton, CT, United States
In this talk I will lay out the increasing role of machine learning technology in discovery research at Pfizer. Specifically, I will talk about how algorithms and methods inspired by (Machine) Learning Theory are playing an increasing role in in-silico predictive technologies in pharmaceutical research. These methods will be put in the context of other popular methods based on the classical statistics based approaches and overlap and contrast will be discussed. I will use poly-pharmacology predictions as an important use case to demonstrate the power of large scale machine learning methods for such application. In particular, prospective validation of these methods will be emphasized and discussed.
Interpretable correlation descriptors for quantitative structure-activity relationships
Prof. Jonathan D. Hirst. School of Chemistry, University of Nottingham, Nottingham, Nottinghamshire, United Kingdom
Highly predictive Topological Maximum Cross Correlation (TMACC) descriptors for the derivation of quantitative structure-activity relationships (QSARs) are presented, based on the widely used autocorrelation method. They require neither the calculation of three-dimensional conformations, nor an alignment of structures. Open source software for generating the TMACC descriptors is freely available from our website: http://comp.chem.nottingham.ac.uk/download/TMACC. We illustrate the interpretability of the TMACC descriptors, through the analysis of the QSARs of inhibitors of angiotensin converting enzyme (ACE) and dihydrofolate reductase. In the case of the ACE inhibitors, the TMACC interpretation shows features specific to C-domain inhibition, which have not been explicitly identified in previous QSAR studies.
Chemistry in your hand: Using mobile devices to access public chemistry compound data
Dr Antony J Williams PhD, Valery Tkachenko. ChemSpider, Royal Society of Chemistry, Wake Forest, North Carolina, United States
Mobile devices allowing browsing of the internet to access chemistry related data come in many forms: phones, music players and, increasingly, as “tablets” and “pads”. With the permanently online connectivity of these mobile devices, the browser now being the default environment for much of our computer-based interactions, and the increasing availability of rich datasets online, the aggregation of these offerings mesh together to provide chemists with the capabilities to query and search for chemistry in ways that were the stuff of science fiction only a few years ago. Using the ChemSpider platform as a foundation, and with the intention of continuing to enable the community to access Chemistry, we have delivered mobile chemistry applications to search across over 20 million compounds sourced from over 300 data sources to retrieve data including properties, spectra and links to patents and publications. This presentation will discuss Mobile ChemSpider and the challenges of delivering such a tool.
Feature analysis of ToxCast(TM) compounds
Patra Volarath, Stephen Little, Chihae Yang, Matt Martin, David Reif, Ann Richard. National Center for Computational Toxicology, U.S. Environmental Protection Agency, Research Triangle Park, NC, United States; Center for Food Safety and Nutrition, U.S. Food and Drug Administration, Bethesda, MD, United States
ToxCastTM was initiated by the US Environmental Protection Agency (EPA) to prioritize environmental chemicals for toxicity testing. Phase I generated data for 309 unique chemicals, mostly pesticide actives, that span diverse chemical feature/property space, as determined by quantum mechanical, feature-/QSAR-based, and ADME-based descriptors. Results in over 450 high-throughput screening assays were generated for the chemicals. Deriving associations across such a structurally diverse and information-rich dataset is challenging. Approaches to determine relationships between the bioassay data and chemistry-/biology-informed structural features, and methods to meaningfully represent this knowledge are being developed. We initially focus on the Phase I data set. Successful approaches will be applied to the much larger chemical libraries in ToxCast Phase II and Tox21 projects (the latter to screen approximately 10,000 chemicals). These approaches will be used to develop data mining approaches to inform toxicity testing and risk assessment modelling. This abstract does not reflect EPA or FDA policy.
Extracting information from the IUPAC Green Book
Prof Jeremy G Frey, Mark I Borkum. School of Chemistry, Univeristy of Southampton, Southampton, Hants, United Kingdom
The IUPAC manual of Symbols and Terminology for Physicochemical Quantities and Units (the Green Book) was first published in 1969. One of the fundamental principles of the IUPAC Green Book is the reuse of existing symbols and terminology, in order to enable the accurate exchange of information and data. Accordingly, there is a need for the IUPAC Green Book to be repurposed as a machine-processable resource. This paper reports an experiment where we define a syntax for the subject index of the IUPAC Green Book in the Parsing Expression Grammar (PEG) formalism. We repurpose the resulting Abstract Syntax Tree (AST) as the primary data source for a Ruby on Rails application and Simple Knowledge Organization System (SKOS) concept scheme. We demonstrate a metric that gives prominence to the most significant terms and pages in the subject index, and reflect upon the usefulness and relevance of the information obtained.
Biologics and biosimilars: One and the same?
Roger Schenck. Chemical Abstracts Service, Columbus, OH, United States
Biopharmaceuticals (or biologics) and generic follow-on biosimilars currently account for more than 10% of the revenue in the pharmaceutical market. As patent protection for first generation biotherapeutics begins to expire, follow-on biosimilars have begun to appear. This presentation will provide insights on how the CAS databases handle biologics and biosimilars, how these substances are treated differently in patents, and how biosimilars are viewed by different patenting authorities. What the CAS databases reveal about trends in biopharmaceutical research and development will be discussed along with specific examples
Intelligent mining of drug information resources
Rashmi Jain, Anay Tamhankar, Aniket Ausekar, Yuthika Dixit. Evolvus Group, Pune, India
A fundamental aspect of any research is to understand and keep track of progress made by peer groups in terms of scientific discoveries. Research Conferences form a definitive source of this information. Annually, thousands of papers are presented in such conferences for any given disease vertical from a Therapeutic, Biological, Pharmacological, Clinical perspective. At first glance, the problem of finding relevant conference proceedings of interest and then organizing the information into a format which is easily analyzed, stored and efficiently retrieved seems to be difficult and chaotic as there are no patterns by which a process can be defined, furthermore conference presentations are highly fragmented and non-standardized.
A hybrid approach, wherein a Machine Learning based text-extraction software coupled with assisted expert annotations by human editors come to the rescue. An in-house Machine Learning software system is used in the first stage wherein the conference proceedings are classified based on keywords, segmented and converted into standardized format.
The software then uses a proprietary, heuristic based, learning algorithm to extract relevant data from the segments. Since it is well known that any automated approach cannot be 100% accurate, in this step the software is assisted by a team of expert human editors who analyze the extracted and segmented data and perform necessary corrections, if any. In the third step, the software then pushes each segment to a team of expert human editors who analyze the segment, extract information relevant to the area of research, and store the information in our internal databases.
Cheminformatics semantic grid for neglected diseases
Paul J Kowalczyk PhD. Department of Computational Chemistry, SCYNEXIS, Durham, NC, United States
We present a summary of our progress towards establishing a cheminformatics semantic grid for neglected diseases. Our efforts are based on using public data and open-source programs to generate both descriptive and predictive models, which are themselves made publicly available. There are three modes of model access: as web services, via web portals, and as downloads. Models are saved in Predictive Model Markup Language (PMML) format. Information stored for each model includes the training set, test set, descriptors and model tuning parameters. This information is provided so that researchers may determine a model's domain, and its applicability to their data. Examples will be presented for two data sets retrieved from PubChem: enzyme inhibition of dihydroorotate dehydrogenase (AID:1175), and a cytochrome panel assay with activity outcomes (AID:1851).
Extraction and integration of chemical information from documents
Dr Hugo O Villar, Dr. Juan Betancort, Dr Mark R Hansen. Altoris, Inc., La Jolla, California, United States
Effective chemical research requires that all sources of information be incorporated in the decision making. Here we introduced a tool that saves time when trying to build chemical databases that can be built from web information or chemical literature, including patent information. We discuss some of the challenges faced in automating the identification and extraction of chemicals named in patents, and their conversion into chemical databases that can be mined effectively. The integration of external sources of data can be valuable for research informatics. To that end we have integrated the conversion of IUPAC names with chemical optical character recognition. We show examples where such integration can provide useful competitive information.
SAR and the role of active-site waters in blood coagulating serine proteases: A thermodynamic analysis of ligand-protein binding
Dr. Noeris K Salam, Dr. Woody Sherman, Dr. Robert Abel. Schrodinger, Inc., San Diego, CA, United States; Schrodinger, Inc., New York, New York, United States
The prevention of blood coagulation is important in treating thromboembolic disorders. Several serine proteases involved in the coagulation cascade are classified as pharmaceutically relevant and are the focus of structure-based drug design campaigns. Here, we investigate the serine proteases thrombin and factors VIIa, Xa, and XIa, using a computational method called WaterMap that describes the thermodynamic properties of the water solvating the active site. We show that the displacement of key waters from specific subpockets (e.g. S1, S2, S3 and S4) of the active site by the ligand is a dominant term governing potency, providing insights into SAR cliffs observed in several compound series. Furthermore, we describe how WaterMap scoring can be supplemented with terms from an MM-GBSA calculation to improve the overall predictive capabilities.