#232 - Abstracts
ACS National Meeting
September 10-14, 2006
San Francisco, CA
PubChem: An information resource linking chemistry and biology
Evan Bolton, National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 - PDF
PubChem is a free, online public information resource from the National Center for Biotechnology Information (NCBI). The system provides information on the biological properties and activities of chemical substances, linking together results from different sources on the basis of chemical structure and/or chemical structure similarity. Following the deposition model introduced by GenBank, PubChem's content is derived from user depositions of chemical structure and bioassay data, including high-throughput biological screening data from National Institutes of Heath's Molecular Libraries initiative. PubChem's retrieval system supports searches based on chemical names and chemical structure, as well as searches based on bioassay descriptions and activity criteria. PubChem provides further information on biological properties via links to other NCBI information resources, such as the PubMed biomedical literature database and NCBI's protein 3D structure database, as well as via links to depositor web sites.
Mining chemical information from the literature: Finding the right stuff
Debra L. Banville, Global Information Science & Libraries, AstraZeneca Pharmaceuticals, 1800 Concord Pike, CRDL2, Wilmington, DE 19850- PPT
It is easier to find too many documents on a particular life science topic, then to find the right information inside these documents. With the application of text data mining to biological documents, it is no surprise that researchers are starting to look at applications to mine out chemical information as well. The mining of chemical entities, both names and structures, brings with it some unique challenges. Commercial and academic efforts are beginning to address these challenges. Ultimately, text data mining applications need to focus on the marriage of biological and chemical information.
Creating a virtual large research infrastruktur
René Deplanque, FIZ CHEMIE Berlin, Franklin Str. 11, 10583 Berlin, Germany - PPT
For large infrastructures with costly and complex scientific equipment it is necessary to share resources, people and information results. This is to speed up the scientific discovery process and lower the total cost for the single infrastructure. For this purpose several large European research infrastructures combined their major tools like: networks and research activities via a transnational access system. During the first 18-month of this project 280 scientists from 21 European countries worked in 18 research institutions for 1600 experimental days to build the largest distributed superstructure for ultra fast laser optics.
The necessary structural adaptation, infrastructure tools, publication- and coordination networks will be presented
SemanticEye: A semantic Web application to rationalise and enhance chemical electronic publishing
Omer Casher, Information Architecture and Engineering, GlaxoSmithKline, New Frontiers Science Park, Third Avenue, CM19 5AW, Harlow, United Kingdom and Henry S. Rzepa, Department of Chemistry, Imperial College London, Exhibition Road, London, SW7 2AZ, United Kingdom.
Despite the revolution in information access caused by the uptake of the Web, scientific electronic publishing has yet to realise its full potential compared to other mediums, such as digital music management and online retailing. The latter two have been relying on specialised semantic models to provide highly targeted product access to users. Such a model, which hitherto has been lacking in scientific electronic publishing, is the goal of SemanticEye. SemanticEye is a Semantic Web (a “Web 2.0”) application which applies the digital music management metaphor to electronic chemical journal articles. Here key metadata objects are embedded in the PDF representation of articles as XMP, an RDF vocabulary published by Adobe. A workflow for automated extraction of this metadata from PDF and management in an RDF repository (Sesame) is described. By including unique identifiers within the XMP, such as the DOI and InChI identifiers for molecular structures, document associations can be mapped, and the documents themselves resolved with the help of Web agents. The applicability of SemanticEye to other domains, such as medical imaging, will be discussed.
CombeChem - Semantic support for the chemical information lifecycle
Jeremy G Frey, School of Chemistry, University of Southampton, Southampton, SO17 1BJ, United Kingdom - PDF
“CombeChem” provided experience of e-science semantic support for the chemical data lifecycle, from inception in the laboratory to dissemination of data, showing how laboratory data should be recorded, using electronic laboratory notebooks, enriched with appropriate metadata, to ensure information can be correctly understood when subsequently accessed, (“Annotation@Source”). Chemical information results from a chain of analysis & data integration. Current chemical data storage methodologies place restrictions on the use of this data; absence of sufficient high-quality metadata, particularly in a computer readable form, prevents automated access to the data without significant human intervention. The Semantic web approach enhances the data by making use of unique identifiers and relationships described with RDF. This informs new routes to dissemination, with data and ideas being treated by parallel but linked methods; a Grid style access to information spread across several administrative domains - individual laboratories to national repositories - the concept of “Publication@Source”.
Technical and social aspects of collaboratively developed information systems
Christoph Steinbeck, Research Group for Molecular Informatics, Cologne University Bioinformatics Center (CUBIC), Zuelpicher Str. 47, D-50674 Cologne, Germany - PDF
Recent years have seen the emergence of technologies allowing communities to collaboratively develop open information systems. A prominent example is Wikipedia, a general, open encyclopedia, now rivaling even long established commercial products. This talk will highlight what Chemistry can learn and adopt from these new developments, exemplify technologies available and how they might be adapted to the specific situation of chemical information systems.
Growth of e-Chemistry
Peter Murray-Rust, Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW Cambridge, United Kingdom
Chemistry has been slow to adopt the cyber-revolution but there are now signs that the technology and interest are set to grow rapidly. Major driving forces include the demand for online data and the realisation by many scientists and publishers that they can and will start to create and disseminate semantically enhanced documents. This is facilitated by a variety of tools which make it easier to capture or translate current sources of chemistry and provide repositories of semantically enriched data, mainly through markup languages. The first sets of such datuments (combined documents and data) and likely to be public repositories of scientific data (e.g. biological testing or thermochemistry), peer-reviewed articles marked up by the publishers, and translations of computational chemistry log files to CML/XML. Success here will encourage the next wave of the capture of experimental data such as from analytical instruments. The presentation will highlight some of the early adopter projects in creating semantic chemistry. This includes the use of machines for automatic filtering and analysis of publications - the "journal-eating robot" is now a reality.
Bayesian modeling in Pipeline PilotTM: Application to structural analysis of CDK2 inhibitors
Shikha Varma-O’Brien, Accelrys, Inc, 10188 Telesis Court, Suite 100, San Diego, CA 92121 and David Rogers, SciTegic, Inc, 9665 Chesapeake Dr, Suite 401, San Diego, CA 92123.
Laplacian-modified Bayesian modeling as implemented in Pipeline PilotTM is ideally suited to the rapid analysis of data with a view to library development and compound prioritization via virtual screening. The modeling process identifies molecular features that are associated with compound activity (or inactivity). The proprietary functional class and extended connectivity fingerprints (FCFP and ECFP, respectively) characterize each molecule as a series of a combination of 2D chemical features. These fingerprints are extremely fast to calculate and can represent very large number of different features.
This work describes the application of Bayesian modeling to 17,550 compounds and their corresponding cyclin-dependent kinase-2 (CDK2) activities. This model distinguishes good CDK2 inhibitors (actives) from the bad CDK2 inhibitors (inactives) by using FCFPs. These fingerprints enable us to recover chemical scaffolds and sub-structures that are intrinsically associated with CDK2 activity. Receiver Operating Characteristic (ROC) plot with the area under the curve (AUC) of 0.83, reveals the significant enrichment obtained using this virtual screening methodology: 17% of active compounds are identified by screening just 1% of the database.
Virtual Ligand Screening with eHiTS
Zsolt Zsoldos1, Darryl Reid1, Aniko Simon1, Bashir S. Sadjad1, and A Peter Johnson2. (1) SimBioSys Inc, 135 Queen's Plate Dr, Suite 520, Toronto, ON M9W 6V1, Canada, (2) School of Chemistry, University of Leeds, Leeds, LS2 9JT, United Kingdom
Virtual Ligand Screening (VLS) has become an integral part of the drug design process for many pharmaceutical companies. In protein structure based VLS the aim is to find a ligand that has a high binding affinity to the target receptor whose 3D structure is known. Ligand similarity searches also provide a very powerful method of quickly screening large databases of ligands to identify possible hits. This presentation will describe the docking tool eHiTS and its seamless integration with a new ligand-based pre-screening filter tool, eHiTS_Filter. eHiTS_Filter uses 23 surface point types (chemical property identifiers) to create a feature vector of active and presumed inactive ligands. The filter is then trained to recognize active ligands and can then be used to screen large databases of ligands extremely rapidly (5-7 ligands per second per cpu). eHiTS_Filter has been integrated into eHiTS to allow for docking poses to be generated for the top N% of the database as ranked by eHiTS_Filter. Enrichment results obtained over a wide range of receptor families consistently show that eHiTS_Filter is able to recover ~80% of the actives in the top 10% of a screened database.
for more information see: http://www.simbiosys.ca/ehits/
Advanced HTS data mining using web service workflows
David Wild1, Gary Wiggins1, Xiao Dong1, Huijun Wang1, Marlon Pierce2, Jang H Lee3, and Geoffrey Fox3. (1) School of Informatics, Indiana University, Bloomington, IN 47408, (2) School of Informatics, Bloomington, IN 47408, (3) Indiana University School of Informatics, Bloomington, IN 47408
We will discuss the application of web services and workflows of these services in advanced data mining of HTS data. We have worked with HTS data from a number of sources, including the NIH DTP human tumor cell lines and MLSCN data from PubChem. Our workflows are able to find relationships and correlations of the HTS data using a wide variety of computational techniques, including docking, machine learning, similarity searching, and inclusion of genomic data. We are able to use models built from these analyses and workflows to provide virtual HTS environments within an information system targeted to particular disease areas.
Similarity searching in large virtual chemistry spaces derived from synthetically accessible combinatorial libraries
Markus Boehm, Gregory A. Bakken, and Alan M. Mathiowetz. Computational Chemistry, Pfizer Global Research and Development, Eastern Point Road, Groton, CT 06340
Fast molecular similarity searching is widely established as part of the drug discovery process. At the same time large collections of validated high-speed synthetic protocols are an integral element in today's pharmaceutical industry. It would be of great interest to perform similarity searches against a database of all virtual compounds that are synthetically accessible by any of such combinatorial library protocols. However, the number of possible compounds easily exceeds - by many orders of magnitude - the number of compounds that can be stored and searched by conventional searching methods. We have developed a software tool that converts large numbers of combinatorial libraries into an enclosed "virtual chemistry space". Feature Trees Fragment Spaces are capable of searching those libraries without ever enumerating all possible molecular structures. The result of the similarity search is a set of compounds which are synthetically accessible by one or more of the existing synthetic protocols. Such output can provide library design ideas for tasks like hit follow-up from high-throughput screening or lead hopping from one compound to another attractive series.
Increasing the speed and accuracy of the virtual screening process
Pascal Bonnet1, Eric Arnoult2, and Christophe Meyer2. (1) Molecular Informatics, Johnson & Johnson PRD, A division of Janssen-Pharmaceutica, Turnhoutseweg 30, 2340 Beerse, Belgium, (2) Molecular Informatics, Janssen-Cilag S.A, Campus de Maigremont, 27106 Val de Reuil, France
Virtual screening in various guises is employed by most pharmaceutical companies to investigate large collections of compounds for the identification of possible hits against biological targets. Many computational tools are available but speed and quality are a major concern. We describe here a sequential step process, implemented on a GRID-computing platform, for large-scale database mining using intensive scoring computations to enable the identification of the best in-silico candidates for biological testing. A first optional filter is applied to a large database of putative compounds to remove non drug-like compounds. This is followed by a 3D-pharmacophore search. The remaining compounds are docked in a known protein binding site and an accurate scoring function used which combines a molecular dynamics/continuum solvent potential. Three case studies with high enrichment factors are presented.
CHEMGAROO: A chemistry educational system
Jost T. Bohlen, FIZ CHEMIE Berlin, Franklinstrasse 11, D-10587 Berlin, Germany - PPT
CHEMGAROO is the umbrella brand name for all educational offers of FIZ CHEMIE. It is based on the interactive encyclopedia for chemistry education ChemgaPedia, comprising about 15,000 pages on all topics of chemistry and related sciences. A multitude of elaborately designed onscreen displays, video footage of experiments, complex 3D animations of camera rides through chemical structures invite to learn in a playful manner. With 19,000 graphics, 3,000 animations, 2,300 3D molecules, 650 video and 600 audio files it is one of the largest multimedia collections of its kind in the world. All content is grouped in chapters, fully linked and optimized with approx. 900 exercises and 2,500 glossary and biographical entries in order to ensure a motivating and explorative approach.
Guaranteed high-interest withdrawals: Creating a dynamic and usable chemical information instruction digital depository
F. Bartow Culp, Mellon Library of Chemistry, Purdue University, 504 West State Street, West Lafayette, IN 47907-2058, Grace A. Baysinger, Swain Library of Chemistry & Chemical Engineering, Stanford University, Organic Chemistry Building, 364 Lomita Drive, Stanford, CA 94305-5080, and Susan Cardinal, Carlson Library, University of Rochester, Rochester, NY 14627. - PPT
While most chemical educators and managers agree that information skills are important components of their students' and employees' success, they typically do not know how to impart these skills. A recent survey has shown that most who teach chemical information are self-taught – a manifestly insufficient preparation for today's complex and changing environment. The creation of a useful and adaptable digital depository of chemical instructional materials is a high priority goal of the CIC-CINF Working Committee. This presentation will outline our progress towards that goal.
New information services in Germany: GetInfo and vascoda
Irina Sens, German National Library of Science and Technology (TIB), Welfengarten 1B, 30169 Hannover, Germany - PPT
GetInfo is the information portal for scientific and technical information of the German National Library of Science and Technology and the subject information centers, FIZ CHEMIE, FIZ Karlsruhe and FIZ Technik. GetInfo offers an integrated information infrastructure where content and service converge to make resources readily accessible, openly available, useful, and usable.
Vascoda is an interdisciplinary portal for scientific and scholary information in Germany. Vascoda unites the internet services of numerous high-performance academic libraries and information institutions. vascoda is a strategic alliance between the Virtual Libraries, Information Networks and the Electronic Journal Library.
Impact of cyberinfrastructure on large research libraries
Grace Baysinger, Swain Library of Chemistry and Chemical Engineering, Stanford University Libraries, 364 Lomita Drive, Organic Chemistry Building, Stanford, CA 94305-5081 - PPT
Mass digitization and ubitiquous computing are revolutionizing access to information. A rich array of resources and tools are now available to researchers and students at their deskop. This talk will cover how technologies will shape major research libraries of the future, how collections are changing, and how user behavior and needs are shifting. It will also cover what roles and services libraries will offer as well as what skills will be needed by library staff.
New tools for virtual high-throughput screening
N. Sukumar, Curt M. Breneman, and C. Matthew Sundling. Department of Chemistry and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180-3590
Several new software tools for virtual high-throughput screening have been developed recently at the NIH-funded Rensselaer Exploratory Center for Cheminformatics Research (RECCR) and are being made available online (http://reccr.chem.rpi.edu) such as PROTEIN RECON, DIXEL and PPEST, which employ electron-density-derived surface property maps to screen protein-DNA and protein-ligand complexes for significant interactions. The Center welcomes participation of domain specialists and application scientists as both data generators and end-users of the molecular property models. The application of these new techniques, which include new families of descriptors, methods of encoding and analysis, will be highlighted, with examples from the screening of small molecules from online catalogs and structures from the Protein Data Bank, and the chemical insight gained from such analysis discussed.
Consensus descriptor-subset selection for ensemble QSAR models
Debojyoti Dutta1, Rajarshi Guha2, Peter C. Jurs2, and Ting Chen1. (1) Department of Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, (2) Department of Chemistry, Pennsylvania State University, 104 Chemistry Building, University Park, State College, PA 16802
Selecting a small subset of descriptors from a large pool to build a good QSAR model is a hard problem. Even heuristics typically aim to find a subset that leads to a good model for a single model type. Ensemble QSAR models combine predictions of multiple instances of of different model types. Traditionally, descriptor selection for ensemble models has consisted of performing feature selection for the individual models leading to a set of features that are specific to the model type. However, for more interpretable QSAR models, it is advantageous to have a single consistent set of features that can be used for different model types.
In this work, we select a single optimal subset of descriptors for multiple model types by jointly optimizing the prediction accuracy of multiple model types using a genetic algorithm and linear combination functions. We apply this approach to both regression as well as classification problems. In particular, for two datasets, using an ensemble of a linear model and a neural network, we show that the predictive ability decreases only by 1.14%, 2.3% respectively.
This work is the very first step in consensus descriptor modeling and we are not aware of any other work in this are. Several directions are currently being pursued to improve the above approach including designing better scoring functions, exploring alternative optimization techniques and novel ways to combine predictions.
Neighbor voting: A method to improve confidence in docking poses
Santosh Putta, Paul Beroza, Komath Damodaran, and Thomas Macke. Computational Sciences, Telik Inc, 3165 Porter Drive, Palo Alto, CA 94304
Predicting the binding mode for a lead compound plays a key role during drug lead optimization. Compounds are often proposed for chemical synthesis based on potential new interactions suggested by the docked pose of the original lead compound. However, if a lead is small and displays only moderate inhibition, several alternative docked poses may be reasonable, and it becomes difficult to determine which pose to use to guide lead optimization. Resolving these predicted binding modes is often done visually by rationalizing the SAR for related compounds. Here we present a computational approach to learn the correct binding mode by using activity data for closely related compounds. Each compound for which activity data is available votes for the binding mode that best explains its activity. We will illustrate this method and present results on public data sets.
Using physicochemical parameter models to screen for bioavailability and drug plasma concentration
Alanas Petrauskas, Pranas Japertas, and Remigijus Didziapetris. Pharma Algorithms Inc, 591 Indian Rd., Toronto, ON M6P 2C4, Canada
This study presents a simulation of drug plasma concentration(Cp) vs time, dose and physicochemical parameters of drugs. Dissolution, solubility in the gastrointestinal tract, passive absorption, first-effect in the liver and gut, volume of distribution and total body clearance were considered. This simulation leads to an estimation of the dependency of oral bioavailability (%F) on dose and physicochemical parameters (ionization constant (pKa) and hydrophobicity (logP). Simulations can be useful in lead optimization and selection as these allow to model changes in pharmacokinetic parameters (%F, Cp, Cmax,AUC0-t and others) of drugs by changing the main physicochemical parameters: pKa and logP. The model was validated by analyzing published Cp-time curves and predicting oral bioavailability (%F) values for a number of drugs. The resultant algorithm allows batch screening of compounds and close analogs revealing how PK properties of compounds could be changed in optimization stage and employed in scoring and ranking of potential hits.
Probabilistic pharmacophore matching
Boris Klebansky, BioPredict, Inc, 4 Adele Avenue, Demarest, NJ 07627
Many problems in computational chemistry can be reduced to finding a match of chemical features in 3D. We describe Markov Random Fields (MRF) computational approach to pharmacophore search and to pharmacophore-driven docking. The pharmacophore matching is described as pharmacophore graph to target ligand or protein active site pharmacophore graph match. This weighted graph-matching problem is expressed as an MRF model, whose solution minimizes its associated free energy function. The resulting solution is a maximal a posteriori (MAP) probability distribution that statistically describes the ensemble of optimal (in MAP) matches. Individual low-energy placements of the molecule are obtained by marginalizing this distribution. MRFs can be combined, allowing simultaneous probabilistic match into multiple models of targeted ligand conformations or protein active sites, accounting for ligand and protein flexibility. To bias pharmacophore matches MRFs can incorporate prior knowledge as probabilistic beliefs. Derivation of consensus pharmacophores for a set of ligands is also described.
Using physicochemical parameter models to screen for bioavailability and drug plasma concentration
Kristian Birchall1, Gavin Harper2, Valerie J. Gillet1, and Stephen Pickett2. (1) Department of Information Studies, University of Sheffield, Regent Court, 211 Portobello St., Sheffield, S1 4DP, United Kingdom, (2) Computational Chemistry and Informatics, GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, SG1 2NY, United Kingdom
Reduced Graphs (RGs) summarise a chemical structure by grouping atoms into nodes based on properties likely to be important for bioactivity (H-bond donors, aromatic rings etc.). RG queries can be represented using SMARTS notation, where RG nodes replace the normal atoms. A Genetic Program (GP) has been used to derive SMARTS type RG queries for identifying SARs. The GP is provided with a list of RG node types and SMARTS syntactic constraints and is trained on a labelled mixture of active and inactive RGs for a specific activity class. From this it evolves queries that maximise the precision and recall of the actives. Their predictive power is then validated on datasets not used in deriving the queries. Furthermore, the queries inherent SAR information reveals the key features that identify a particular activity class, and which RG nodes are functionally similar in terms of their importance for bioactivity.
Structure Searching Concepts at Chemical Abstract Service: Past, Present, and Future (?)
W. Fisanick, Research, Chemical Abstract Service, 2540 Olentangy River Road, P. O. Box 3012, Columbus, OH 43210, Karen Lucas, Online Services, Chemical Abstract Service, 2540 Olentangy River Road, P. O. Box 3012, Columbus, OH 43210, and Tommy Ebe, Database Quality Engineering, Chemical Abstracts Service, 2540 Olentangy River Rd, Columbus, OH 43202-1505
Since the advent of the Chemical Registry in 1965, Chemical Abstracts Service (CAS) has developed and used a variety of approaches and techniques relative to the searching of structures for chemical substances. Included are capabilities for exact, substructure, structure similarity, and Markush structure searching. Important components of these search capabilities are the families of structural search fragments and the system architecture. In addition, CAS continues to explore new and modified structure search techniques. The focus of this paper will be on the key concepts and algorithms for structure searching developed at CAS.
Weaving tools and techniques to create a tapestry of substance-oriented information from CAS
Linda S. Toler and Kathleen D. Schmidt. Chemical Abstracts Service, Columbus, OH 43210
Weaving Tools and Techniques to Create a Tapestry of Substance-oriented Information from CAS
The CAS databases contain the largest collection of substance information from the world's chemistry related literature and patents. With over 28 million records for organic, inorganic, and polymeric substances in the CAS REGISTRYSM, over 670,000 Markush records in the MARPAT® database, and over 2,700 controlled terms in CAplusSM describing various substance classes, the CAS databases present quite a challenge for searchers trying to balance precision and recall in their substance search queries. This talk will focus on how various tools and techniques that have been developed by CAS to work with their data collections can be woven together to make the most of the content of the CAS databases. It will focus on techniques that are especially useful for dealing with very broadly defined or imprecisely defined substance questions.
Combining text and structural search in the chemical literature
John G. Cleary and Nicholas T. Goncharoff. Reel Two, Innovation Park, Ruakura Rd, Hamilton, New Zealand
There are many ways that the chemical literature can be searched. The simplest is to supply a single keyword or substructure and retrieve all the documents or structures that match. This has of course many problems. It can be difficult to select a single word that both returns all the hits you want (because of the use of synonyms) as well as eliminating the many false hits (because of ambiguous usage). The traditional solution has been the use of boolean search terms to both extend and refine the search.
An adjunct to this is a text similarity search where all documents or structures that are “similar” to some given document or structure are retrieved. An effective way to extend and refine these similarity searches is to do the search against a whole set of documents or structures. An example of this is searching for those patents similar to a whole portfolio of patents.
This paper examines ways that similarity searching can be done in an environment where there are very large numbers of documents (tens of millions of patents) containing many structures (millions of chemical structures). It also considers how both document similarity and structural similarity can be combined in a single similarity search. Such a search would start with a portfolio of patents and then find those patents which both use the same language in the text and which reference the same structures. Such searches can then be iteratively refined by including or excluding any of the items in the search: documents, words, phrases, structures, or fragments; and then repeating the similarity search.
Compound Selection and Library Analysis at Bayer Healthcare AG
Jens Schamberger, PH-GDD-EURC-CR MC V, Bayer HealthCare AG, Elberfeld, 0460, D0460 Elberfeld, AZ, Germany
Selection of compounds from libraries is the founding step to successful screening and testing. Despite the numerous applications available to rationalize and facilitate this process, selecting molecules from very large databases, or comparing huge amounts of molecular data, still proves to be challenging. At the Pharma Division of Bayer Healthcare AG Inforsense KDE computational workflow technology is used to integrate and analyze structural data. Using a relational Oracle database together with Auspyx data cartridge technology, structure comparison and selection can be carried out on standard molecular descriptors and properties, as well as on in house applications and data. Special care was taken to consider increasingly important ADMET parameters via integrating BHC's recently developed ADMET-Traffic-Light scheme into the workflow. Chemical intuitive grouping and selection of compounds is achieved by in house topological descriptors called BayTrees. In addition to fully automatic selection, the visualization software Spotfire allows to view data and select compounds interactively. Selected compounds and accompanying data is reported using BHC's Pharmacophore Information System PiX, which furthermore provides extended methods of data analysis. Examples to illustrate this workflow will be presented.
Substructure searching of Markush structures and its uses
John M. Barnard1, Anthony P. F. Cook1, Geoff M. Downs1, Annette von Scholley-Pfab1, Daniel G. Thomas1, P. Matthew Wright1, Jimmy Chung2, Gavin Harper2, and Stephen Pickett2. (1) Digital Chemistry Ltd, The Iron Shed, Harewood House Estate, Harewood, Leeds, LS17 9LF, United Kingdom, (2) Computational Chemistry and Informatics, GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, SG1 2NY, United Kingdom
Direct substructure searching of Markush structure representations of large virtual combinatorial libraries offers substantial speed advantages over searching the enumerated molecules, since each individual building block needs only to be examined once. Markush substructure searching has several potential applications in the management of collections of large virtual libraries, and has been implemented in the context of an Oracle RDBMS data cartridge. The libraries can be stored as Markush structures (or as pools of precursor molecules with generic reactions) rather than as enumerated products. The presence or absence (in the product set) of substructures or complete molecules of interest can be rapidly identified, and just the relevant molecules enumerated if required. Physicochemical and ADME properties based on additive contributions from predefined sets of substructures can be rapidly calculated. Work on a recent collaborative project will also be described, in which Markush substructure search techniques are used for rapid generation of the "reduced feature graphs" (Harper et al., J. Chem. Inf. Comput. Sci., 2004, 44, 2145-2156) present in a large virtual library, and for identification of those that contain specified patterns of feature nodes, along with the corresponding individual molecules.
A new look at the Merged Markush Service
Joseph M. Terlizzi, Questel-Orbit, Inc, 81 Pierrepont St., Brooklyn, NY 11201
The Merged Markush Service (MMS) has been available exclusively on Questel•Orbit since 1987. Designed for graphical searching of generic and specific chemical structures indexed in the Derwent World Patents Index (DWPI) and the INPI Pharm databases, MMS has recently been enhanced with new capabilities. Structures can now be monitored as alerts (SDIs) on a weekly or monthly basis, and compound number answer sets can be parsed to their various DWPI and INPI formats. This presentation will review all of the new enhancements and take a look at some underused functions, such as file segmentation, roles, and subset searching, and other useful methods for processing and organizing answer sets. Techniques for integrating and transferring results to other Questel•Orbit patent databases, such as FamPat, will also be explored.
Towards Markush: First steps to turn a chemistry database engine into a Markush device
Szabolcs Csepregi, Nóra Máté, Szilárd Dóránt, Andras Volford, György Pirok, and Ferenc Csizmadia. ChemAxon Ltd, Maramaros koz 3/a, 1037 Budapest, Hungary
It will be presented how an existing chemical database tool (JChem Base/Cartridge) is extended to handle generic Markush features (R-group definitions, atom and bond lists and link nodes) in chemical structures registered to the database.
A modified Ullmann algorithm performs structural search in Markush databases without explicit enumeration of library members. This algorithm exploits a supergraph representation of the Markush library and takes care of generic feature groups and atom exclusions within the supergraph. For structural search pre-screening, a modified chemical hashed fingerprint is used that is also calculated from the supergraph representation.
Special handling of possible ambiguous aromaticity in generic structures is also presented.
Catalyzing sustainability: Emerging fields in green chemistry
Kathryn E. Parent, Jennifer L. Young, Julie B. Manley, and Paul T. Anastas. Green Chemistry Institute, American Chemical Society, 1155 Sixteenth Street, NW, Washington, DC 20036 - PDF - PPT
Green Chemistry – a new approach to designing chemicals and chemical transformations that are beneficial for human health and the environment – is an area that continues to emerge as an important field of study. New journals, research centers, academic courses, and industrial initiatives all focused on Green Chemistry have been announced in recent years. Conferences and symposia focused on technical advances in Green Chemistry ranging from synthesis to solvents, polymers to analytical methods, bio-renewables to green nanotechnology, are taking place in nations around the world. The mission of the ACS Green Chemistry Institute (GCI) is to advance sustainability through the implementation of green chemistry and engineering principles into all aspects of the chemical enterprise. This presentation will provide an overview of this emerging field, particularly in the areas that GCI seeks to catalyze through research, education, industrial implementation, and information dissemination.
Soft Matter: Where hard science gets soft and squishy
Carol Stanier, MRSC Editor, Royal Society of Chemistry, Thomas Graham House, Science Park, Cambridge, CB4 0WF, United Kingdom - PDF - PPT
Fifteen years ago, physicist Pierre-Gilles de Gennes won a Nobel Prize "for discovering that methods developed for studying order phenomena in simple systems can be generalized to more complex forms of matter, in particular to liquid crystals and polymers". He is widely regarded as the founding father of soft matter research.
"Soft matter" pervades our world, from biological systems, washing powders, medical formulations, plastics, and paints to modern TV and computer screens. The driving force for the creation of a new journal, Soft Matter, came from the community, who called for an interdisciplinary home for their work giving physicists, chemists, biologists, chemical engineers, materials and food scientists a forum for discussing their research. Editor, Dr Carol Stanier, will give you a whirlwind tour of the wonderful world of soft matter and advise on key information resources available in this area, including the RSC's publication, Soft Matter.
Brief introduction to materials science for the information professional
Meghan Lafferty, Science & Engineering Library, University of Minnesota, 108 Walter Library, 117 Pleasant St SE, Minneapolis, MN 55455 - PDF - PPT
Although materials science appeared as a field of study fairly recently, in some senses it has existed since people first began manipulating stone and other materials in their environments. The level of sophistication is much greater these days with the vastly improved ability to control properties and techniques for characterization. In the process of creating and studying new materials, materials scientists and engineers draw on a number of disciplines including chemistry, mechanical engineering, and physics. This presentation will give background on materials science and discuss current major areas of emphasis in materials science such as biomaterials, nanotechnology, polymers, and semiconductors. I will also address important sources of information for materials science.
Trends in Chinese chemical research
DingFei Lui, Wanfang Data, Institute of Scientific and Technical Information of China (ISTIC), Suite 1-19, Hua Tong Plaza B Tower, 100044, Beijieng, China, Song Yu, Columbia University Libraries, Columbia University, 454 Chandler, 3010 Broadway, New York, NY 10027, and Dani Xu, OriProbe Information Services, 3238 Curry Ave, Windsor, N9E 2T5, Canada. - PDF - PPT
Chinese chemical research is increasingly impacting the world knowledge. Research output in chemistry from China has more than tripled since 1989 and China is the third largest country in chemical research. As of 2005, there are 128 journal titles published in China out of the 1000 journals most frequently cited in Chemical Abstracts. The hottest research areas in China are materials synthesis and characterization, asymmetric catalysed reactions, and structure determination data acquiring. Major trends in chemical engineering include innovative ways to produce clean and reusable energy, the development of prototype molecular electronics devices and nanodevices, computational simulations, investigation of biomolecular recognition and interactions. The trend of Chinese chemical research is clearly aligned to the strategy China is pursuing of using science and technology to serve national goals and to help lead economic growth under the pressure of being the world's most populous country and the second largest energy consumer.
Is it chemistry or biology – or both?
Sarah Thomas, MRSC Editor, Royal Society of Chemistry, Thomas Graham House, Science Park, Cambridge CB4 0WF, United Kingdom - PDF - PPT
The power of the chemical approach to biology, when combined with the amazing technologies being developed in the –omic sciences, is having a transforming effect on biological research and clinical medicine. From stem cell research and cancer drug development to genetic screening and drug delivery systems - all have the potential to make a real difference to our lives.
Chemical biologists, biological chemists; biochemists; molecular and structural biologists; drug discovery scientists; protein chemists; bio/cheminformaticians – all draw on the fundamental disciplines and technologies of chemistry and biology, but researchers found that their work often did not fit within the scope of traditional discipline journals. Molecular BioSystems has a particular focus on the interface between chemistry and the -omic sciences and systems biology, thus providing a unique and targeted forum for communication. Dr Sarah Thomas, the Editor, looks at the developments and the available resources, including the RSC Biomolecular Sciences book series.
Bioinformatics: an instructional opportunity for academic science and engineering libraries
Erja Kajosalo, MIT Libraries, Massachusetts Institute of Technology, 14S-M48, 77 Massachusetts Ave, Cambridge, MA 02139-4307 - PDF - PPT
Bioinformatics merges molecular biology, computer science, and information technology into a single discipline, which has drastically changed the amount and type of data used in biology research, as well as how research is carried out. This presentation will discuss what type of information resources there are available for bioinformatics and what role the academic science & engineering libraries might have in teaching the use of these resources.
The art and science of information dissemination
Sarah Tegen, ACS Chemical Biology, American Chemical Society, 1155 16th St NW, Washington, DC 20036 and Adam Chesler, ACS Publications, American Chemical Society, 1155 16th Street NW, Washington, DC 20036 - PDF - PPT
Today's technologies have revolutionized the way we disseminate information. We read papers on our desktops, we receive alerts through our cell phones or handheld devices, we aggregate information, we collaborate virtually. All of these advances help us disseminate our scientific ideas. How can chemists better use the technologies available today, and what kinds of technology might we envision in the future? This talk will detail some of the innovations used by ACS Publications for information dissemination and will speculate on new features that the community could adopt.
Think like a database: Substructure searching in the classroom
Judith N. Currano, Chemistry Library, University of Pennsylvania, 3301 Spruce St. 5th Floor, Philadelphia, PA 19104
Substructure searching is critical in many areas of chemistry, including in synthetic organic chemistry because, when trying to synthesize novel compounds, exact structure searches will not retrieve the information needed to construct the molecule. However, this search technique is frequently difficult for students to grasp; it requires them to think like databases instead of chemists. The author presents several techniques and methods of teaching substructure searching at various levels, based on the following points: what a substructure is and what it can do; the circumstances under which this type of search should be employed; how one should analyze a molecule to create a set of specifications for a substructure; and where and how one can actually run a substructure search. The teaching of substructure searching in both undergraduate and graduate courses is discussed, and suggestions for contexts in which to present the material are made.
Comparison of structure searching between SciFinder Scholar and MDL CrossFire Commander
Bing Wang, Library and Information Center, Georgia Institute of Technology, 704 Cherry Street NW, Atlanta, GA 30332
Among the over two hundred databases to which the Georgia Tech Libraries subscribe, only SciFinder Scholar and MDL CrossFire Commander have the structure searching feature. SciFinder Scholar and MDL CrossFire Commander are two proprietary clients that provide web access to Chemical Abstracts Service (CAS) databases and Beilstein/Gmelin databases respectively. This paper will compare both clients based on query build-up, structure editor tool, result navigation, and result management through demonstration of different searching scenarios. The relative importance of the advantages and disadvantages of the two databases will vary depending on information needs. Additionally, online resources are included to help users in conducting structure searches in SciFinder Scholar and MDL CrossFire Commander.
Advantages of multi-file chemical structure searching
Bob Stewart, Dialog, Thomson Scientific, 3501 Market Street, Philadelphia, PA 19104-3302
Commercial online database suppliers offer access to many sources of chemical information. These sources may include patents, scientific literature, drug pipeline data and industry information. Each has its own specific editorial focus, leading to some significant differences in the information covered. Considering the differences between sources, it is essential searchers, regardless of their method of searching (free text, key words, CAS Registry Number or chemical structure searching), check multiple sources to ensure comprehensive retrieval.
With regard to chemical structure searching specifically, there are several advantages of performing chemical structure searching directly in multiple databases, particularly within drug pipeline databases. Additionally, collating the results of multi-file searches via XML also has several benefits worth highlighting.
Multi-host, multi-file chemical structure patent searching
Donald Walter, Customer Training, Thomson Scientific, 1725 Duke Street Suite 250, Alexandria, VA 22314
We all have our favorite hosts and databases for searching patents with specific chemical and Markush structures. Not all of them are on the same host. How can I search several different databases, and combine the results on my favorite host for easy downloading of all my results at once? This talk will review -Some advantages of each host -Searching of chemical and Markush structures in the Derwent World Patents Index on Dialog, Questel and STN -Comparing results between each type of search (Fragmentation codes vs. MMS vs. DCR) -Coordinating searches between DWPI and other databases -Moving the results to, and displaying the results from, your favorite host
Advances in chemical reaction searching
Jim Nourse, William Lingran Chen, Bradley D. Christie, David L. Grier, and Burton A. Leland. Elsevier-MDL, 2440 Camino Ramon, San Ramon, CA 94583
The demands on reaction searching have grown substantially with the advent of automated lab notebooks and parallel synthesis among other reasons. Databases with millions of reactions are now common and these must be updated constantly with immediate access to newly registered reactions. In addition the expectations on search performance have also grown with the widespread use of Google and other search engines that return results instantaneously. We will report on new work on direct indexing of entire reactions, returning first hits immediately , and improvements in performance of various classes of difficult queries.
Rapid structure lookup and distributed substructure searches in very large databases
Marc C. Nicklaus1, Markus Sitzmann1, Igor V. Filippov1, and Wolf-Dietrich Ihlenfeldt2. (1) Laboratory of Medicinal Chemistry, Center for Cancer Research, National Cancer Institute/Frederick, NIH, DHHS, 376 Boyles Street, Frederick, MD 21702, (2) Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
We present new tools and services developed by the CADD Group, NCI, for searching for structures in very large databases, such as very large screening sample collections. One of these tools is a service for very rapid structure lookup, making use of InChIs as well as CACTVS hash code-based identifiers. These latter, designed to allow one take into account tautomerism, different resonance structures drawn for charged species, and presence of additional fragments, enable fine-tunable yet rapid compound identification and database overlap analyses. We also present a powerful substructure search tool, implemented in the form of a web service, for databases of millions of compounds, using a search engine operating in distributed mode across a Linux cluster. Finally, a tool for automatic generation of a web interface, for searches by substructure and other criteria, from a database file, e.g. an SDF, is presented. Some of these tools and services are being made publicly available on the CADD Group's web server.
Produce, purchase, or partner? An informatics business development case study
Gregory M. Banik and Kevin Scully. Bio-Rad Laboratories, Informatics Division, 3316 Spring Garden Street, Philadelphia, PA 19104
The business dilemma of whether or not to produce, purchase, or partner to add new technologies will be explored via a case study involving Bio-Rad Laboratories, Informatics Division, a provider of state-of-the-art chemical software and database solutions. The company's KnowItAll® Informatics System offers a fully integrated environment for spectroscopy, cheminformatics, ADME/Tox, and metabolomics. A strategic cornerstone has been a “produce and partner” approach that facilitated the rapid release of products to the scientific community through both internal development and external partnerships.
Using an "agile" software development process, the Informatics Division and its partners have made major product launches at a pace and with a quality unheard of in the industry.. In addition, we examine the company's “Powered by KnowItAll” program, a business development program for software development outsourcing that quickly enables partners to create new product solutions using a combination of out-of-the box and custom applications.
Strategic analysis of discovery research informatics in the pharmaceutical Industry
Hugo O. Villar and Richard Kho. Altoris, Inc, 11575 Sorrento Valley Rd, Suite 214, San Diego, CA 92121
Chemoinformatics is in transition, with higher levels of integration required with other disciplines, including but not limited to bioinformatics, clinical trial management systems, modeling software, and patent informatics. The field has significantly changed from the time when the results were analyzed in terms of individual compounds, one at a time. While the drug discovery paradigm has shifted, the informatics products in the market place have not kept pace with those changes. A Porter's five forces competitive analysis will be carried out for different aspects of the research informatics field. An analysis of the drivers for change and innovation will be discussed, with real life examples of our company's development of ChemApps and PatentInformatics as new brands in this field.
Nothing ventured, nothing gained
Wendy A Warr, Wendy Warr & Associates, 6 Berwick Court, Holmes Chapel, Cheshire CW4 7HZ, United Kingdom
The decision to strike out on your own is not taken lightly. Entrepreneurship and risk-taking go hand in hand but the temptation at first is to limit risk and expenditure. Often, the initial business plans, and the products and services they cover, quite quickly get forgotten as a true market niche opens up. As an entrepreneurial service professional, it is essential to be nimble and responsive to the emerging technologies that excite the client community. By being agile, responsive to change and quick to learn, the business grows and the entrepreneurial spirit is maintained. This paper will cover some of the thrills (money, success, travel) and a few of the hazards (time-consuming paperwork, obstructive purchasing departments, disaster contingency planning) from a personal viewpoint but will also be exemplified by other entrepreneurs in the industry.
Freedom to use and barriers to entry for the informatics entrepreneur
Gianna Arnold, Miles Stockbridge, PC, 10490 Little Patuxent Parkway, Suite 300, Columbia, MD 21044
This session will provide a review of recent statutory and case law with an exploration as to the effect upon the informatics entrepreneur. Discussion will include an overview of intellectual property law with a focus on tools that 1/ are available to ensure that you may practice your technology; and 2/ provide the opportunity to prevent others from practicing your technology. Ramifications of proposed U.S. Patent and Trademark Office rule changes will also be discussed.
Crossing the Pacific: Relocating a small patent information business
Alan Engel, Paterra, Inc, 526 N Spring Mill Road, Villanova, PA 19085-1928
Paterra pioneered services for the machine translation of Japanese patents 10 years ago. In order to meet changes in the marketplace, expand our service offering, access deeper information resources and tap linguistic talent, we have decided to move our primary location of operations to Tsukuba Science City in Japan. This presentation will describe the background to this decisions, what Paterra aims to accomplish for its clients, and the mechanics of relocating an Internet-based business to the other side of the Pacific Ocean.
Time for a global patent database
Gregory Aharonian, Source Translation & Optimization, P.O. Box 475847, San Francisco, CA 94147-5847
In the 1990s, many successfully fought to get patent offices to open up their databases of patents and patent applications to the public. Until then, inventors and the public at large had to pay the same high hourly fees to access commercial databases that large companies were paying, an obvious inequity. Is there a new battle in the making - is it time for a global patent database, with better database retrieval tools? Increasingly, major patent offices are providing overlapping information: EPO, PTO, JPO and the national patent offices of Europe. Need a PDF of a U.S. patent - go to the EPO databases. Want do decent range searches on fields - don't go the EPO databases. Need really clean bibliographic information about patents - don't go to the government patent databases. A mess. I suggest then that it is time, ten years after our initial battles for access to patent information, to battle for a WIPO-sponsored global patent database - with more database, more servers, better search engines - to help the patent systems of the world further their role in providing to the world the latest information on new inventions. It's how they do it on Star Trek.
MDL® CMC database used to a seek SARS cure
Pieder Caduff, Elsevier MDL, Gewerbestrasse 12, CH-4123 Allschwil, CA 94853, Switzerland and Hualiang Jiang, Drug Discovery and Design Center, Chinese Academy of Sciences, Shanghai 201203, China.
SARS emerged as a communicable human disease in November 2002 and rapidly spread throughout the world. A coronavirus, called SARS-associated coronavirus (SARS-CoV), has been identified as the culprit. The 3C-like proteinase (3CLpro) of SARS-CoV is one of the most promising targets for anti-SARS-CoV drugs due to its critical role in the viral life cycle. Docking experiments revealed the characteristics a molecule should have in order to bind to the SARS-CoV 3CLpro protein. The MDL® Comprehensive Medicinal Chemistry (CMC) database was screened for candidates using a virtual screening strategy. Based on the analysis of the compounds in CMC, cinanserin, a well-characterized serotonin antagonist that has undergone preliminary clinical testing in humans, showed a high score in the virtual screening and was chosen for further experimental evaluation. The experiments, which included surface plasmon resonance and various assays, demonstrated that cinanserin inhibits SARS-CoV replication, acting most likely via inhibition on the 3CLpro. In a lead optimization effort consistent with this strategy, pharmacology-related information about compounds with the cinanserin chemical sub-structure (or similar compounds) could have been obtained using MDL® Drug Data Report (MDDR).
A procedure for modeling induced fit effects in receptor-ligand complexes
Florian Raubacher, Lead Generation, AstraZeneca, Pepperedsleden 1, 43183 Mölndal, Sweden
The binding of a ligand to its protein receptor often introduces a substantial change of the receptor structure. In these cases, it is difficult to predict the correct side chain placement in homology models or receptor structures used for docking. For the generation of structures accounting for induced fit, these ligand induced side chain movements could be seen as a number of additional constrains of freedom which should be satisfied in a homology modeling approach.
Therefore, a procedure has been developed that is capable of automatically generating an ensemble of receptor-ligand complexes based on the homology modeling program MODELLER, which allows the generation of complex models from an apo template structure by incorporating flexible ligand placement including known spatial interactions.
Statistics of the performance of the protocol using a dataset of experimental complex structures from the literature are presented.
Addressing the speed/accuracy dilemma of the virtual screening process
Pascal Bonnet1, Eric Arnoult2, and Christophe Meyer2. (1) Molecular Informatics, Johnson & Johnson PRD, A division of Janssen-Pharmaceutica, Turnhoutseweg 30, 2340 Beerse, Belgium, (2) Molecular Informatics, Janssen-Cilag S.A, Campus de Maigremont, 27106 Val de Reuil, France
Virtual screening is employed by most pharmaceutical companies to screen in-silico large collections of compounds in order to identify new hits to biological targets. Several computational tools are available but speed and accuracy are still a major concern. To relieve the limitations of this technique while improving the quality of its outcome, we describe here a sequential steps process implemented on a GRID-computing platform. This tool was applied to large-scale database mining and intensive scoring computations to enable the identification of the best candidates for biological testing. A first optional filter is applied to remove non drug-like compounds, followed by a 3D-pharmacophore search. The remaining compounds are docked in the protein binding site then rescored with an accurate scoring function using a combined molecular dynamics/continuum solvent potential. All the computationally intensive steps have been grid-enabled to accelerate the process. Three case studies with high enrichment factor are presented.
eHiTS_Score: A new statistically derived empirical scoring function
Zsolt Zsoldos1, Darryl Reid1, Aniko Simon1, Bashir S. Sadjad1, and A Peter Johnson2. (1) SimBioSys Inc, 135 Queen's Plate Dr, Suite 520, Toronto, ON M9W 6V1, Canada, (2) School of Chemistry, University of Leeds, Leeds, LS2 9JT, United Kingdom
eHiTS_Score is the new scoring function in the latest release of the eHiTS docking program. eHiTS_Score takes advantage of the temperature factors in PDB files during the statistics collection phase of the scoring function generation to better capture the interaction geometries between ligands and receptors. The data collected is then fitted to create an "empirical"function to represents the statistical interaction data. The weights of each interaction term are derived by training using experimentally derived binding affinity information to generate the full eHiTS_Score scoring function. This novel scoring function has the additional benefit of family training based on automatic clustering of input receptor structures. Analysis of the results from eHiTS_Score on a very large and diverse test set of 1091 PDB structures showed very good correlation to known binding affinities. Additional results from a complimentary ligand-based screening tool will also be presented.
for more information see: http://www.simbiosys.ca/ehits/index.html
Fragmental QSAR model for the prediction of AMES genotoxicity
Kiril Lanevskij, Pranas Japertas, Remigijus Didziapetris, and Alanas Petrauskas. ADMETox Development, Pharma Algorithms Inc, 591 Indian Rd., Toronto, ON M6P 2C4, Canada
This study presents a computational analysis and the development of a predictive model of genotoxicity (based on the in vitro Ames reverse mutation test data). In addition to generating an estimate of the probability of a compound's genotoxicity, toxicophores can be identified in their specific chemical structural environment, providing an insight as to which parts of the molecule are responsible for the genotoxic effect. The predictive genotoxicity model can be used by researchers to supplement various pre-defined genotoxicity filters that ignore the chameleonic dependence of genotoxicity on substituent effects. The model was validated on a set of marketed drugs and on a randomly selected compound validation set (N=945). The accuracy of compound classification into the ‘genotoxic' or ‘non-genotoxic' categories is close to 95%. Although in vitro bacterial reversal tests are relatively inexpensive to perform, it is much easier to test the outcome of multiple structural modifications in silico.
Identification of unique and redundant scaffolds in chemical databases
Mark R. Hansen and Jason Hodges. Altoris, Inc, 11575 Sorrento Valley Rd, Suite 214, San Diego, CA 92121
The use of chemoinformatics techniques to select compounds for strudy continues to be of great importance in drug discovery process. The selection of compounds to be added to a library from among the myriad of commercially available chemicals can be challenging. In some cases, compounds containing certain particular scaffolds are desirable, but often the types of unique scaffolds in a library are of importance. The extent to which different commercial vendors are providing unique chemistry can be of great value to a user. We build on our methods for automated scaffold enumeration and detection (SARvision, www.chemapps.com) to perform Boolean operations at the phylogenetic tree levels. The operations allow us to quantitate the number of unique scaffolds in a molecule library, the number of common scaffolds between pairs of molecule libraries and the union of the total number of scaffolds from a series of chemical libraries.
In silico technology for identification of potentially toxic compounds in drug discovery
Pranas Japertas, Alanas Petrauskas, and Remigijus Didziapetris. Pharma Algorithms Inc, 591 Indian Rd., Toronto, ON M6P 2C4, Canada
This study presents a computational analysis and derivation of endpoint-specific predictive models based on toxicity data of several types: acute toxicity (Mouse and Rat LD50), genotoxicity (Ames test) and organ-specific health effects (based on diverse animal and human studies). Classification and quantitative structure-activity analyses of the data for each toxicity endpoint were performed. This work is an attempt at stepwise identification of unknown effects using simple descriptors to facilitate chemical explanations of toxicity. In drug discovery these tools can help prioritize in vitro measurements and estimate animal toxicity, although multiple data gaps in the training sets restrict their usefulness. A partial solution to this problem is the calculation of 95% confidence intervals which indicates toxicological similarity of a given compound to the training set. If a compound is not too dissimilar, ‘hazard substructures' can be automatically generated, thus suggesting possible mechanistic explanations and structural modifications of a compound.
Maximum common substructure search in focused set profiling and in library analysis
Miklos Vargyas1, Ferenc Csizmadia1, Szabolcs Csepregi1, and Peter Vadasz2. (1) ChemAxon Ltd, Maramaros koz 3/a, 1037 Budapest, Hungary, (2) Department of Algorithms and Datastructures, Eotvos University, Budapest, Pázmány Péter sétány 1/c, Budapest, 1117, Hungary
Fingerprints proved to be feasible and efficient tools to solve various problems emerging in chemical information technology. Applications range from pre-filtering in structure searching to similarity calculations including clustering and virtual screening. The success of fingerprints roots in their descriptive power and nevertheless in their fast generation, compact storage and simple use.
The continuos growth of computational power allows the use molecular descriptors that are much harder to compute. The maximum common substructure (MCS) provides an example of such descriptors: it requires an exponential computational time to find the MCS of two the structures in the worst case. Yet, it is feasible to compute the MCSs even of a medium sized set of compounds (that is, around 50000 structures) in a hierarchical manner within an hour. Non-hierarchical comparison of millions of compounds against one or more query structures is also viable.
The presentation outlines the computational methods applied, briefly introduces the application program LibMCS based on the hierarchical MCS search and discusses application areas including focused set profiling, diversity analysis, combinatorial library optimisation, compound acqusition.
Mechanistic, ionization-specific model of human intestinal absorption
Pranas Japertas, Alanas Petrauskas, and Remigijus Didziapetris. Pharma Algorithms Inc, 591 Indian Rd., Toronto, ON M6P 2C4, Canada
This study presents a mechanistic QSAR analysis of Human Intestinal Absorption (HIA) that takes into account the dependence of passive absorption on several physicochemical descriptors (log P (ow), H-bonding parameters) and ionization. Since most oral drugs that are passively absorbed have an HIA very close to 100%, we cannot transform HIA into P(app)HIA and then correlate it with QSAR parameters. (This would mean attempting to differentiate between 99% and 99.9% absorption.) Instead, we must attempt to relate HIA to all QSAR parameters directly using non-linear fitting of experimental data to a system of equations. This system of equation represents both routes of permeation: transcellular and paracellular. Overall, the present QSAR model explained over 95% of all analyzed HIA values. This QSAR model was converted into an automated software system that can be used for both virtual screening (supplementing ‘lead-likeness' filters) and lead optimization ( supplementing absorption simulations in early lead development).
Multivariate analysis of pyranone based HIV protease inhibitors: Cheminformatics approach
Barun Bhhatarai and Rajni Garg. Department of Chemistry, Clarkson University, 8 Clarkson Avenue, Potsdam, NY 13699-5812
Different multivariate statistical techniques are employed to analyse the SAR data in chemistry particularly in drug design. HIV protease is an attractive target for such statistical in-silico drug-design tools on which abundent SAR data is available. In the present context, several datasets compiled on pyranone based HIV protease inhibitors (HIV-PI) are studied. Pyranone was one of the lead molecule for the development of Tipranavir - recently approved non-peptidic HIV-PI. Several statistical models are developed on SAR data of pyranone derivatives to link the structure of ligand to their biological activity. Each ligand molecule under study was described by means of physico-chemical descriptors and structural parameters which encodes topological, geometrical and electronic features. Assessment of different QSAR models allows physical interpretation and quantitative treatment of structure-activity trend captured by the model. Combined information from these models helps in ‘tranforming data into information and information into knowledge' from chem-informatics point of view. The results obtained from different statistical models on individual dataset as well as combined dataset will be analysed and discussed.
Normalizing ionic resonance structures
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, and Marc C. Nicklaus1. (1) Laboratory of Medicinal Chemistry, Center for Cancer Research, National Cancer Institute/Frederick, NIH, DHHS, 376 Boyles Street, Frederick, MD 21702, (2) Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
We present our algorithm for normalizing ionic resonance structures. This is an integral part of the calculation of our structure identifiers based on CACTVS hashcodes, calculated from the molecular structure of a compound. We have developed several variants of these hashcode-based identifiers providing adjustable sensitivity to certain molecular or atomic features; i.e. the identifier can set to be sensitive to, or to ignore, fragments (e.g. salt counter-ions), isotopes, charges, tautomerism, and/or stereochemistry. Hence, as a preliminary step to the calculation of the hashcode we normalize the input structure according to the selected sensitivity on structural features. In case of charges the algorithm for the normalization of structures includes (1) finding a reasonable or canonic resonance structure, and (2) transform this into the least charged form of the compound. For the latter step the deletion of reasonable charges is avoided (e.g. charges in nitro groups and many other functional groups, charges that complete aromaticity etc.). The algorithm is also capable of correcting ill-defined resonance structures.
Optimizing CoMFA setttings
Shane D. Peterson, Wesley Schaal, Torbjörn Lundstedt, and Anders Karlén. Department of Medicinal Chemistry, Division of Organic Pharmaceutical Chemistry, Uppsala University, BMC, Box 574, SE-751 23 Uppsala, Sweden
We have recently published a practical method for improving CoMFA models by optimizing modeling settings (J. Chem. Inf. Model. 2006, 46, 355-364). The method has been applied to a variety of data sets, including the original benchmark steroids data set as well as eight other electronically available datasets (J. Med. Chem. 2004, 47, 5541-5554). Results indicate that statistically better models could be obtained using this methodology as compared to default CoMFA models. Furthermore, optimized CoMFA models performed as good or better than models made using CoMSIA and H-QSAR models, as reflected by both q2 and r2pred. Since publication, we have applied multivariate modeling techniques to understand which settings are correlated with model improvement. This information may allow practitioners to reduce the number of models to be evaluated, while still obtaining improved models.
Prediction of hERG liability using a novel approach
Max Leong, Department of Chemistry, National Dong Hwa University, Hualien, 97401, Taiwan
Drugs that inhibit the human ether-a-go-go related gene are at risk to lead to a prolongation of the QT interval or torsade de pointes in the worst case. Therefore, it is important to devise a model to predict the hERG liability at the early stage of drug discovery. A pharmacophore ensemble is constructed from a number of pharmacophore hypotheses to address the plasticity of hERG protein while interacting with various compounds; and is subject to regression by support vector machine to generate the final model. The SVM-based model performed better than any of pharmacophore candidates in the ensemble and yielded the correlation coefficients of 0.98 and 0.97 for the training set and test set, respectively, suggesting that this is a plausible in silico model to predict the hERG liability of novel compounds.
Predictive data mining system to build models based on customized training sets and adding knowledge
Guangyu Sun, Chihae Yang, Kevin P. Cross, and Jared Archer. Leadscope, Inc, 1393 Dublin Road, Columbus, OH 43215
The development of a comprehensive predictive toxicology system faces numerous challenges, most importantly, the lack of data and lack of transparency of the methods. To address these challenges, we have designed a predictive data mining system to streamline the logical stages of the model building process, starting from database, searching/retrieval for profiling and read-across, classifications, building models, and finally to predictions. This allows users to build customized training sets for the particular chemical space, and to link biological findings to structural domain. We prepared various QSAR training sets suitable for different chemical classes, according to their structural similarity. Predictive models were built after extensive cross validation using PCA and PLS techniques. Since the numbers of structural features and PLS factors affect the performance, they are optimized for each model. The availability of these predictive models and large amount of high-quality data provides a valuable tool in accessing the chemical toxicity
Quantitative structure and activity relationships study on the Ah receptor binding affinities of polybrominated diphenyl ethers using a support vector machine
Gang Zheng1, Man Xiao1, and Xiaohua Lu2. (1) School of Environmental Science and Engineering, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, 430074, China, (2) Environmental Science Research Institute, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, 430074, China
Polybrominated diphenyl ethers (PBDEs), which are widely distributed in the environment due to their use as flame retardant, may cause long-term health problems in humans. Their structural similarity to polychlorinated biphenyls (PCBs) implies their possible dioxin-like toxicity. By the use of partial least square regression with leave-one-out cross-validation, five net atomic charge descriptors and the first-order hyper-polarizability have been extracted from more than 80 quantum descriptors for predicting the aryl hydrocarbon receptor relative binding affinities (RBA) of PBDEs. Using the support vector machine (SVM) and the radial basis function neural network (RBFN), the RBAs of 18 PBDE congeners have been correlated with the extracted 6 quantum chemical descriptors. The SVM model performs well in avoiding the over-training. The cross-validation correlation coefficients (q2) for the SVM and RBFN models are 0.841 and 0.927, respectively. The good performance of the QSAR models based on net atomic charges and hyper-polarizability suggests that electrostatic and dispersion-type interactions may play important roles in the AhR binding of PBDEs.
Quantum mechanical energy-based screening of combinatorially generated library of tautomers
Maciej Haranczyk, Department of Chemistry, University of Gdansk, Sobieskiego 18, Gdansk, 80-952, Poland and Maciej Gutowski, Chemical Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd., P.O. Box 999, MS K1-83, Richland, WA 99352.
Our recent studies on the anions of the biologically relevant molecules and complexes suggest that even small molecules like nucleic acid bases are challenging systems in terms of identifying most stable tautomers. Although the most stable neutral tautomers had been identified decades ago, the anionic tautomers have just been discovered among unconventional structures that do not result from oxo-hydroxy and/or imine/amine tautomerizations. The current contribution describes the procedure of finding low energy tautomers which have not been identified with the common organic chemistry knowledge. This procedure consists of (i) combinatorial generation of tautomers, (ii) initial density functional B3LYP geometry optimization and optimal energy screening and (iii) the final energy refinement at MP2 and CCSD(T) level of the top hits from step (ii). The library of initial geometries of tautomers is generated with TauTGen, the tautomer generator program.
R-NN curves: A method for diversity analysis and cluster identification
Rajarshi Guha1, Debojyoti Dutta2, Peter C. Jurs1, and Ting Chen2. (1) Department of Chemistry, Pennsylvania State University, 104 Chemistry Building, University Park, State College, PA 16802, (2) Department of Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089
When working with libraries of chemical structures it is useful to understand the distribution of compounds in a space defined by a set of molecular descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, follow a logistic model for any given descriptor pace. The method can be applied to large datasets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We describe a method to numerically characterize the R-NN curves thus allowing one to summarize the location characteristics of an entire dataset in a single plot. We also consider the scenario involving clustered data and describe approaches to identifying the number of clusters using R-NN curves.
Sources of information on chemical industry statistics for chemical engineering students
Bing Wang, Library and Information Center, Georgia Institute of Technology, 704 Cherry Street NW, Atlanta, GA 30332
Chemical engineering students need to become familiar with chemical industry statistical resources. Knowledge of these resources is not only important for their senior design and graduate research projects but will also be useful throughout their careers as chemical engineers. However, finding chemical industry statistics can be very challenging. This paper will review the publication process of chemical industry statistics and examine the strategies and challenges of locating these resources. In addition, major sources, both print and online, will be identified.
Standardizer: Molecular cosmetics for cheminformatics
György Pirok, Nóra Máté, István Cseh, Szilárd Dóránt, Péter Kovács, Szabolcs Csepregi, and Ferenc Csizmadia. ChemAxon Ltd, Maramaros koz 3/a, 1037 Budapest, Hungary
A chemical compound can be represented in various ways. Therefore the canonicalization of the original structures is usually unavoidable before storing the library in a database or using in cheminformatics applications. To automate this process, ChemAxon has developed a software tool for the batch standardization of molecules according to a user defined configuration. Apart from the visualization issues (cleaning the layout, reorientation of wedges, changing the display mode of hydrogens and aromaticity, etc.), Standardizer helps in the canonicalization of mesomers and tautomers, and allows the removal of counterions or solvents, manipulation of the stereo information, sgroups, attached data, and the custom transformation of functional groups. Standardizer is integrated with the JChem database systems including the JChem Cartridge eliminating many common traps of chemical database administration.
Studying the effects of individual interaction energies in a variety of protein-ligand complexes
Sally Mardikian1, Valerie J. Gillet1, Richard M. Jackson2, and David R. Westhead2. (1) Department of Information Studies, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, United Kingdom, (2) Institute of Molecular and Cellular Biology, University of Leeds, Garstang Building, Institute of Molecular and Cellular Biology, University of Leeds, Leeds LS2 9JT, United Kingdom
The influence of individual interaction energies within a given scoring function - e.g. the electrostatic, hydrogen bond and van der Waals interactions of the GRID scoring function- is usually unknown. This is because during a conformational search these contributions are summed, and only the total energy, used to assess the quality of a pose, is known. We have developed a genetic algorithm-based method that is able to optimise a molecule's pose based on the individual GRID interaction energy contributions. This was tested on several protein-ligand complexes including twenty selected from across the four categories of the original GOLD dataset, and various small molecules in complex with glycogen-synthase kinase 3-beta. We observed that, when obtaining the correct solutions (rmsd<2.0 Å), the effect of individual interaction energies is not equal, and the balance of these varies substantially between complexes. In fact with many of the complexes, it was found that the van der Waals interactions alone were guiding the search.
Understanding stereoselectivity: Molecular modeling to inform organic synthesis
Robert S. Paton, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, United Kingdom and Jonathan M. Goodman, Department of Chemistry, Cambridge University, Lensfield Road, Cambridge, CB2 1EW, United Kingdom.
In the synthesis of complex natural products, highly elaborate fragments with multiple stereocentres are brought together. In particular, the boron-mediated aldol reaction of methyl ketones has become a key coupling strategy. For each new substrate there is uncertainty over stereochemical outcome whilst the complexity of these reactions rules out simple stereochemical models (eg Felkin-Anh) and high level calculations. We can enhance the efficiency of organic chemistry with the development of a widely applicable and easy to use predictive model able to aid and inform synthesis planning.
We have performed ab initio calculations to model the transition structures in the boron aldol reaction of methyl ketones. The resulting data were used to derive a force field which accurately reproduces the ab initio results in a tiny fraction of the time. The parameterization of our force field was automated using a multi-objective genetic algorithm. Our model can be run on large reaction systems in hours, where high level calculations would be unfeasible, to give a quantitative prediction of product selectivity. The model is truly predicitive since no experimental data has gone into its construction and it has performed well in validation against experimental results.
ZINC: Molecular representation is important for virtual screening
John J Irwin and Brian K. Shoichet. Department of Pharmaceutical Chemistry, University of California, San Francisco, 1700 4th Street, San Francisco, CA 94143
An important problem in virtual screening is the quality of the molecular representations being screened. Careful database creation and curation are ongoing problems for experts in the field and barriers to entry for non-experts. This has led us to create ZINC, a free database of commercially available compounds for virtual screening, on the web at http://zinc.docking.org. ZINC contains multiple representations because molecules that are not in the biologically relevant protonation state, tautomeric, stereo- or regio-isomeric form in the virtual library often fail to dock and score well. Since its debut in 2005 many features have been added to ZINC and numerous problems have been fixed. For example, we have now created pH-range-specific databases that attempt to represent molecular forms only seen at higher or lower pH as well as the ones expected near physiological pH. There are many other new features and many problems have been fixed.
Do MedChem rules stand up to validation?
Yvonne C. Martin, D-47E AP 10/2, Abbott Laboratories, 100 Abbott Park Road, Abbott Park, IL 60064-6100
Scientists involved in drug discovery have been taught a number of concepts that are generally accepted as useful for lead selection and optimization. This talk will examine some of these in the light of data. For example, does increasing octanol-water logP increase or decrease bioavailability; indeed, can we even accurately calculate logP? How about predicting pKa? Do similar molecules have similar biological activity? Is the rule-of-five predictive? This talk will challenge several such concepts with data and discuss the implications of these finding for computer assisted molecular design.
Exploiting QSAR models for effective lead optimization
Richard A. Lewis, Computer-aided Drug Discovery, Novartis Institutes for Biomedical Research, WKL136.3.94, CH-4002 Basel, Switzerland
In our work as drug designers, we generate many useful and powerful models that explain Structure-Activity Relationships (SAR) observations. These models can be used to assess molecules before expensive synthesis and testing. The accuracy of these models can be on a level with the experimental assay, and yet the models may not be driving the decisions in the project. How can we make our models generally more interpretable/useful to the bench chemist and how can we increase the confidence chemists have in our models. Issues around extrapolation, multi-objective optimization and in silico exploration of the chemistry space around an idea will be addressed.
The more things change, the more they stay the same: Issues in data analysis from QSAR to HTS
David Rogers, Workplace, 10188 Telesis Court, Suite 100, San Diego, CA 92121-4779
From the 20+ years since the development of the earliest quantitative structure-activity releationship (QSAR) methods to the current era with the availability of data from high-throughput screening (HTS), it would appear that much has changed: data sets have grown from dozens to millions of samples, novel analysis techniques have appeared, and new descriptors (including high-dimensional molecular fingerprints) have greatly increased the data content of each data point. However, throughout this transformation of the data available, the primary issues affecting the choice of analysis method remain the same: data dimensionality, algorithm scaling, and algorithm tuning. In this talk I will discuss a number of methods used in QSAR and HTS analysis and how these choices are derived from these same underlying issues.
What's a drug designer to do?
Robert D. Clark, Informatics Research Center, Tripos, Inc, 1699 S. Hanley Rd., St. Louis, MO 63144
The goal of computer-aided drug design (CADD) has always been to help identify better drugs faster. Achieving this goal has always been difficult, and it has become more so in the last several years. Increasingly complex constraints have been imposed on drug candidates as the focus of pharmaceutical research efforts have shifted away from cures and towards very long-term treatments for chronic diseases. Such treatments are necessarily more prone to side effects and long-term toxicity. Techniques based on structure-activity relationships (SAR) of one sort or another have critical to CADD, but such techniques are by their nature ill-suited for identifying the sort of idiosyncratic problems involved in clinical failures due to lack of efficacy and toxicity. In light of these considerations, finding ways to combine SARs into robust meta-analyses is likely to become more important in the years ahead.
Identifying the optimal energy window in pharmacophore discovery
John H. van Drie, Computer-Aided Drug Discovery, Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Cambridge, MA 02139
The initial step in pharmacophore discovery is exhaustive conformational analysis. The user must specify an energy window, tau; all conformations are retained that are within tau kcal/mol of the global minimum. Generally, one employs 'recommended' values of tau, based on experience, and/or studies of strain energy in ligands (e.g. Nicklaus et al, Bioorg Med Chem, 1995; Bostrom et al, J CAMD, 1998; Perola and Charifson, J Med Chem, 2004). These values for tau used in pharmacophore discovery typically range from 3 kcal/mol to 20 kcal/mol (the default in Catalyst).
We describe here a methodical approach for determining tau, which is dataset-dependent, and which leads to exceptionally high-quality and selective pharmacophores. The application to a number of standard datasets is shown.
High strain energies of bound ligands: What is going on?
Paul Labute, Chemical Computing Group, Inc, 1010 Sherbrooke Street W, Suite 910, Montreal, QC H3A 2R7, Canada
The prediction of the bioactive bound conformation of a candidate ligand is important for computational methodologies such as pharmacophore search and docking. The strain energy of a conformation (relative to theglobal minimum energy) is often used as a criterion for rejection of a conformation from consideration. Recent molecular mechanics studies using ligand-receptor complexes from the PDB have suggested that high strain energies (> 10 kcal/mol) are not only possible but routinely observed. We present the results of computational experiments that attempt to explain these observations and determine their validity.
"Fuzzy" pharmacophores for virtual screening and library design
Gisbert Schneider, Institute of Organic Chemistry & Chemical Biology, Johann Wolfgang Goethe-University, Siesmayerstr. 70, D-60323 Frankfurt, Germany
The degree of scaffold diversity of virtual hits obtained by similarity searching is influenced by the molecular representation. Different hit lists are found at different level of abstraction from the atomic structure. A high degree of abstraction is thought to facilitate the identification of novel structures that exhibit a similar pharmacological profile as the query. We have developed molecular descriptors that allow for variable degrees of fuzziness of potential pharmacophore-points. The theoretical concept and several prospective applications will be presented. These include scaffold-hopping applications and the design of natural-product derived combinatorial libraries.
Structure-based 3D pharmacophores as a tool for efficient virtual screening
Gerhard Wolber, Inte:Ligand GmbH, Mariahilferstrasse 74B/11, 1070 Wien, Austria
Chemical-feature based pharmacophore models have been established as a state-of-the-art technique for characterizing the interaction between a macromolecule and a ligand . While in ligand-based drug design, feature-based pharmacophore creation from a set of bio-active molecules is a frequently chosen approach, structure-based 3D pharmacophores are still lacking the reputation to be an alternative or at least a supplement to docking techniques. Nevertheless, 3D pharmacophore screening bears the advantage of being faster than docking and to transparently provide the user with all the information that is used by the screening algorithms to characterize the ligand-macromolecule interaction. Besides the presentation of our structure-based pharmacophore elucidation, our fast, rigid 3D pharmacophore superpositioning technique is applied to several examples. Geometric fitting of multi-conformational models of small organic molecules to structure-based pharmacophores is compared with docking methods and discussed in terms of conformational coverage, flexibility and eligibility for virtual screening.
 H. Kubinyi. In Search for New Leads, EFMC - Yearbook 2003, pp. 14-28.
Recent advances in molecular docking
Christian Lemmen, Holger Claußen, and Marcus Gastreich. BioSolveIT GmbH, An der Ziegelei 75, 53757 Sankt Augustin, Germany
Even after more than a decade of research, molecular docking remains a challenging task. The scoring problem and the proper handling of protein flexibility aside, also other issues are still largely unresolved. We enhanced the docking software FlexX in several ways to challenge some of these. First of all, we added functionality to provide a more appropriate treatment of water molecules in the active site. Secondly, we introduced a new way of handling metal ions explicitly considering the different possible coordination geometries. Third, we implemented a completely novel placement strategy (single interaction scan) that has proven to be most suitable especially in cases of largely hydrophobic active sites. Finally, we took measures to be able to combine the new modules among each other as well as with the established extensions for handling of pharmacophore constraints, combinatorial library docking and the simultaneous consideration of multiple active site conformations. All together this provides a powerful set of tools rise to the the challenge of molecular docking. We provide recent results using this functionality in a series of application examples.
Understanding decoys and hits in molecular docking
Brian K. Shoichet, Department of Pharmaceutical Chemistry, University of California, San Francisco, 1700 4th Street, San Francisco, CA 94143-2240
Molecular docking is widely used to screen large compound collections for novel lead molecules that complement a receptor of known structure. Docking energy functions are approximate and many degrees of freedom are under-sampled. To understand where algorithms can be improved, we have turned to model systems where predictions can be tested in detail. These are simplified, small buried cavities where the interactions are dominated by one particular term. Thus, the L99A cavity in T4 lysozyme is dominated by non-polar complementarity, the L99A/M102Q cavity has a single hydrogen bond acceptor, and the W191G cavity in cytochrome C peroxidase is dominated by a single ionic interaction. Predicted ligands are being tested for binding, geometry, and protein motion using x-ray crystallography. We use a cycle of theory development and experimental testing in these systems, where mis-predicted ligands and geometries are as informative as correct predictions.
Modeling chemical reactions in drug design
Johann Gasteiger, Computer-Chemie-Centrum, University of Erlangen-Nuremberg, Erlangen, 91052, Germany
Chemical reactions play a major role at many steps of the drug design process. A better understanding and modelling of chemical reactions could greatly increase the efficiency in developing a new drug. In target identification, an understanding of enzyme reactions is needed. In lead discovery and lead optimization, an estimate of synthetic accessibility is desired, syntheses have to be designed, and the synthesis of a library asks for knowledge on the scope and limitations of a reaction type. Furthermore, knowledge on the stability of the compounds of a library is necessary. The estimation of ADME-Tox properties has to model the metabolism of drugs and has to predict pKa values, both being chemical reactions. Furthermore, many toxic modes of action are the result of chemical reactions. Examples for modelling these various types of chemical reactions will be given.
Oncology: Targeting the Hedgehog Signaling Pathway with small molecules
Alex S. Kiselyov, Sergey Tkachenko, Sergei Malarchuk, and Ilya Okun. Executive VP of R&D, ChemDiv, Inc, 11558 Sorrento Valley Road, Suite #5, San Diego, CA 92121
Aberrant activation of the Hedgehog (Hh) pathway is associated with numerous malignancies including basal cell carcinoma, medulloblastoma and pancreatic cancer. Several reports also suggest that positive regulators of the Hh pathway could be used in the treatment of neurodegenerative diseases.
ChemDiv, Inc. has been working in collaboration with Dr. James Chen of Stanford University, on the design and evaluation of compound libraries that modulate the Hh pathway. Considering the “druggability” of specific targets in both pathways, we focused our initial efforts on identification of 7TM and S/T kinase- or ATP-motor-specific inhibitors. Particular attention is paid to compounds acting downstream of Smo, a 7TM protein that is known to have several onco-mutations. An initial set of 5,000 compounds was assayed to yield 37 compounds that affected Hh signaling in the Shh-N-producing HEK 293 cells with EC50 values in the 0.4-5 uM range. Following these studies, 52 compounds were shown to be active by binding directly and antagonizing the respective 7TM protein (Smo). Based on the success of the first phase of screening, we have assembled a second generation library (7,500 compounds) of Hh modulators. Particular attention has been paid to i) IP potential, ii) synthetic feasibility, iii) drug-like potential and iv) activity profile of the identified scaffolds. Secondary screens to confirm the resulting ‘hits' and epistatically link them to known Hh pathway components are ongoing, and these studies will guide our lead optimization efforts.
Design and synthesis of tailor-made compound libraries via a knowledge-based approach: A case study
Wibke E. Diederich, Christof Gerlach, Andreas Blum, Jark Boettcher, Sascha Brass, Torsten Luksch, and Gerhard Klebe. Fachbereich Pharmazie, Philipps-Universitaet Marburg, Institut fuer Pharmazeutische Chemie, Marbacher Weg 6, 35032 Marburg, Germany
Our approach of designing tailor-made compound libraries starts with a privileged ligand scaffold well-suited to address the key interactions of the conserved recognition pattern of the respective protein target family. Through specific decoration with appropriate side chains individual library members can be tailored with respect to selectivity towards particular family members. The selection of the appropriate side chain also takes feasible synthetic strategies into account giving rise to a subset of putative building blocks useful for decoration of the main scaffold. The obtained building blocks are docked combinatorially using FlexXC leading to a virtual library of putative inhibitors. Through re-scoring, the most promising library members are determined. To verify our approach of this knowleged-based library design, we selected the serine and aspartic proteases as well studied model cases. For both protease families, we investigated the suitability of five and seven-membered azacycles, respectively as privileged structure elements for the design and synthesis of selective inhibitors. In case of the aspartic proteases, the azacycle-scaffold addresses via its basic nitrogen the conserved catalytic dyad as revealed by X-ray crystallography. Concerning the serine proteases the S2-pocket and the non-specific peptide recognition unit is addressed, in this case by a non-basic azacycle. Both core-structures can easily be modified by means of standard synthetic chemistry thus allowing specific side-chain decoration. Via synthesis, enzymatic assay and crystal structure analysis the most promising library members are further characterized. Based on this thorough investigation, the design of a structure-based combinatorial library directed towards highly selective aspartic as well as serine-protease inhibitors is currently being performed in our laboratories.
Herman Skolnik Award Lecture: Why models fail
Hugo Kubinyi, University of Heidelberg, c/o Donnersbergstrasse 9, D-67256 Weisenheim am Sand, Germany
Quantitative models describe inaccurate data in terms of independent variables. Whereas there are precise mathematical procedures to arrive at a “best” fit of the error-containing experimental data, model and variable selection are based on more or less arbitrary choice. Significance measures and validation procedures are applied to check the consistency of a model by its internal predictivity. However, even most sophisticated validation procedures, like LMO crossvalidation or y scrambling, do not guarantee good external (test set) predictivity. Reasons for such failure are wrong model selection, poor test set selection, inappropriate scaling, artificial cut-offs, and variable selection from a large pool of variables. Several examples will illustrate the most common problems and will provide a simple explanation for the lack of relationship between internal and external predictivity.
The past illuminates the future
Philip Abrahams, Customer Services, Royal Society of Chemistry, Thomas Graham House, Science Park, Milton Road, Cambridge, CB4 0WF, United Kingdom - PDF - PPT - MP3
Since its launch in December 2003, the RSC Archive has become one of the RSC's most used and most popular products. There seems to be no period in which historic chemical science literature suddenly becomes irrelevant to current researchers needs. We will give a brief background to the extent of the RSC Archive, and the kind of first wave customers who have bought and are using it. How do Archive papers compare in terms of citations, or hits via reference linking? How might Archive material be used to promote usage of current content? And looking to the near-future, what are the possibilities for customers to text-mine the information? Revenues resulting from the sale of the Archive have funded the RSC's support of chemical science teaching initiatives. The presentation will close with a look at the forthcoming launch of journal-specific and package-specific archives and a digitised books archive, all of which are intended to help improve the RSC's ability to invest in and advance the chemical sciences.
Cooperation between Canadian university chemistry departments, the Chemical Institute of Canada, and publisher: Archiving the Canadian Journal of Chemistry, issues 1951 through 1997
Lai Im Lancaster and Brian Maurice Lynch. Department of Chemistry, St. Francis Xavier University, Physical Sciences Complex, 1 West Street, Antigonish, NS B2G 2W5, Canada - PDF - PPT - MP3
In June 2003, the Atlantic Section of the Chemical Institute of Canada [representing New Brunswick, Newfoundland, Nova Scotia, and Prince Edward Island] voted to propose to the publisher of the Canadian Journal of Chemistry [National Research Council of Canada [NRC] Press, a federal government agency responsible for 16 journals] that the backfiles be converted into searchable electronic files as a matter of extreme urgency. At that time, the ACS archive covering all issues of journals back to 1879 had been completed, and the corresponding archive from the Royal Society of Chemistry [RSC] going back to 1841 was near completion. Our paper will give a chronological account of many approaches looking for financial support from other federal science-related departments and agencies, and from the Canadian chemical and scientific instrumentation industries. Eventually a partially successful result was achieved from collaboration between academic chemistry departments and science libraries, national chemical societies, and the publisher. Significant work remains before we reach the objective of a freely, universally accessible and completely searchable archive at the RSC/ACS level, but completion is now in sight. A detailed comparison will be made of archival procedures adopted by the ACS, the RSC, NRC Press, and the Australian CSIRO.
A remembrance (and retrieval) of information past: Finding older information in CAS databases
Jan Williams, Chemical Abstracts Service, 2540 Olentangy River Rd, Columbus, OH 43202-1505 and Ida L. Copenhaver, Editorial Operations, Chemical Abstracts Service, P.O. Box 3012, Columbus, OH 43210. - PDF - PPT - MP3
In a very real sense, CAS databases function as a collective memory of the world's chemical literature. Now available online are the abstracts captured digitally from CA issues back to 1907, additional abstracts from ACS journals, older reaction information and more. Knowing the extent of what is available in the CAS collection online and the techniques for optimal retrieval in light of database organization will help to unlock this retrospective treasure trove of chemistry-related literature and patent records. Examples will illustrate the perhaps surprising relevance of older publications for today's research.
Backfile journals and indexes: Impact and issues for researchers and research institutions
Gary Ives, Texas A&M University Libraries, Texas A&M University, 5000 TAMU, College Station, TX 77843-5000 - PDF - PPT - MP3
Since 2002, Texas A&M University Libraries have invested heavily in electronic resource backfiles of both journal and indexing tools. Among our first journal backfile purchases were those from Elsevier and JSTOR, followed by subscriptions or purchases from ACS, RSC, and Wiley. Our purchases of index backfiles have included Compendex, INSPEC, and Web of Science. In this paper, I will focus on the impact that these electronic backfiles have had on research at our institution, as indicated by various measures of usage. I will also report on a number of issues faced by those research institutions which invest in such information products, including missing content, access vs. ownership of the content, and “misuse” of content by researchers.
An analysis of citations in scientific and patent literature to historical research from the first half of the 20th century and the relationship to the accessibility of these works on electronic archives
Simon M Pratt, Thomson Scientific, 14 Great Queen Street, London, WC2B 5DF, United Kingdom and Robert A Stembridge, Global Marketing Services, Thomson Scientific, 14 Great Queen Street, London, United Kingdom. - PDF - PPT - MP3
An increasing rate in the number of citations to classical works of research from the early 20th century has been observed. Indeed some works, such as those of Einstein, are more heavily cited today than at any time in their history. In this paper we use citation analysis to investigate this trend, and discuss any relationship between the citation rates to these works and the facilitation of access to the works by the introduction of archives of electronic journals and indexing services. We also explore the impact that improved visibility of historical research has had on the patentability of inventions vis-a-vis the prior art.
Cell-Surface Informatics (CSI): A platform for quantifying effects of biomaterial surface features on cells
Jing Su, Coulter School of Biomedical Engineering, Emory University/Georgia Institute of Technology, Atlanta, GA 30332-0100 and J. Carson Meredith, School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst Drive, Atlanta, GA 30332-0100.
The relationships between biomaterial surfaces and cells are foundational to a broad range of potentially revolutionary health-care solutions: tissue engineering, diagnostic devices, drug delivery, and culture surfaces for isolating stem cells. Despite major advances in material surface imaging and cell assays, an informatics package does not exist for a systems-based analysis and discovery of cell-surface responses. This is a critical limitation in developing materials that control cell functions on surfaces. We present the recent development of cell-surface informatics (CSI) tools for discovery of relationships between material surface features and cell responses. These tools combine cell assay and material image data using metrics called localized cell-features. These metrics are based on a naïve Bayes classification model which differs significantly from global, summary statistic approaches traditionally employed. We have validated the new algorithms via experiments on combinatorial libraries designed to discover relationships between micropatterned surfaces and osteoblasts.
Evolving an informatics environment for materials discovery and research
Charles P. McGonegal, Joseph F. Schaaf, and J. W. Adriaan Sachtler. UOP LLC, 25 East Algonquin Road, Des Plaines, IL 60017
While developing high-throughput material creation and screening capabilities, UOP has grown an increasingly-complex informatics system--first to manage the large volume of data generated and then to provide an integrated environment to work with a combination of traditional and high-throughput equipment and methods. The core goal of any informatics system is to collect and deliver data. Over several generations, our system has evolved into an environment that enables information flow among researchers, instrument operators and hardware. Its scope extends from material DOE, synthesis, preparation and characterization through assay screening, capturing key variables and a rich set of context-setting metadata. The system comprises many small unit-operation oriented programs whose modularity makes the system expandable and flexible. These have been adapted to successive generations of instrument hardware and database schema. Key principles driving the system's evolution, the underlying database architecture and software tools will be presented and illustrated by selected applications.
High-throughput sensor-based solvent-resistance mapping of copolymers and determination of quantitative structure-property relationships
Radislav A. Potyrailo, Materials Analysis and Chemical Sciences, GE Global Research Center, 1 Research Circle, Niskayuna, NY 12309 and Ronald J. Wroczynski, Materials Analysis and Chemical Sciences, General Electric Corporate Research and Development, One Research Circle, Niskayuna, NY 12309.
Polymers are important materials for sensor, microfluidic, and other demanding applications. High-throughput screening methodology has been applied for the evaluation of the solvent-resistance of a family of polycarbonate copolymers in different solvents of practical importance. We employed a 24-channel acoustic-wave sensor system that provided previously unavailable capabilities for parallel evaluation of polymer solvent-resistance. This high-throughput polymer evaluation approach assisted in construction of detailed solvent-resistance maps of polycarbonate copolymers and in determination of quantitative structure-property relationships. A D-optimal mixture design was employed to explore the relationship between the copolymer compositions and their solvent-resistance.
Informatics for a high-throughput liquid formulations workflow
Dave Rothman, Tom Boomgaard, Bruce Wilson, and Mary Beth Seasholtz. The Dow Chemical Company, 1776 Building, Midland, MI 48674
High-throughput research is being applied to new problems in materials science. As with prior experience with high throughput research in other fields, informatics capability is needed to deal with the large volumes of data. This talk will describe an informatics system built for high-throughput liquid formulations research in personal care products, based on commercially available components enabling experimental design, lab automation, device control and the collection, management, analysis and visualization of data.
Using integrative analytics for on-line optimization of catalyst libraries
Francois Gilardoni, Industrial Applications, InforSense Ltd, 459A Fulham Road, London, SW10 9UZ, United Kingdom and David Farrusseng, Groupe de Catalyse, Institut de Recherches sur la Catalyse IRC–CNRS, 2, Av. Albert Einstein, F-69626 Villeurbanne, France, France
Computational technologies and methodologies developed within the Life Sciences framework are not seamlessly transferable to material science because of the nature of the materials employed. The major difficulty to deal with is the intricacy in describing a solid catalyst. These materials are usually not or badly characterized partly due to the absence of fast and inexpensive in situ techniques, also because of the impact of their preparation on their intrinsic activity and properties are intricate to grasp. Moreover, the fundamental understanding of the reaction paths involved requires usually important research efforts that are often unsuited to meet the time industrial constraints. The alternative to a trial-and-error modus operandi is to rely on integrative analytics, where robotics and informatics are combined and seamlessly interoperating. As a case study, we will present how TOPCOMBI, a project for Nanotechnologies and Nanosciences funded by the European Commission, leverages the integrative analytics paradigm as a methodology to scrutinize and identify catalysts suitable to be tested in high throughput campaigns. We will demonstrate how elemental proprietary descriptors are combined with material preparation recipes and experimental data in order to identify relevant inputs capable to foresee catalysts activities and performances. By setting a very high rate of relevance at an early stage of the high throughput experimentation program, the number of trials is reduced drastically and knowledge per experiment maximized. The methodology and heuristic are encapsulated in a scientific workflow where each building block performs a set of operations, like data mining and processing, on the experimental and virtual catalyst libraries.
Materials informatics for materials chemistry
Krishna Rajan, Department of Materials Science & Engineering, Iowa State University, Ames, IA 50011
In this presentation we discuss how data dimensionality reduction and data mining can serve as powerful tools for both classification and prediction of structure-property relationships in materials. Using examples from different applications of materials chemistry such as catalysis design and crystal chemistry we show how high dimensional data can be used to assess patterns of behavior as well as establish predictive Quantitative Structure Activity Relationships (QSARs). The talk will also examine the value of informatics techniques in materials characterization and spectroscopy.
Experimental strategies for combinatorial and high throughput materials development
James N. Cawse, GE Corporate R&D, 1 Research Circle, Niskayuna, NY 12309
High throughput and combinatorial methods for materials discovery and optimization have presented a real challenge for the effective planning of experiments. When experiments can be run in parallel by the dozens or hundreds, the classic experimental designs for data-sparse systems must be rethought for data-rich ones. The structure of the experimental space is critical to the planning process; it determines the combinatorial methodology, the throughput required, the data structures, and the visualization and analysis tools. I will review the latest developments in experimental design in this context.
Effects of electronic indexes and journals on citation patterns in chemical information
Beth Thomsett-Scott, Reference and Information Services, University of North Texas Libraries, P.O. Box 305190, Denton, TX 76226 - PDF - PPT - MP3
Over the past decade, a large number of indexes and journals in Chemistry have been moved to an electronic format. In general, the electronic format is considered to be easier to use and more accessible. This paper will examine whether the availability of electronic indexes and journals have changed citation patterns in chemical information. Using a small sample of chemistry faculty and comparing their citation patterns over the last ten years, the effects of electronic information will be evaluated and discussed.
If you build it, will they come? Experience with journal backfiles at HighWire Press
Helen Barsky Atkins, HighWire Press, Stanford University, 1454 Page Mill Rd, Palo Alto, CA 94304 - PDF - PPT
HighWire Press began in 1995 by hosting the online version of the Journal of Biological Chemistry (JBC). For the next years, our work was adding new journals, those that understood that having an online presence was essential to their future development and success. But merely having an online presence is no longer sufficient. In recent years we have seen these same prominent journals from a variety of publishers assign a high priority to ensuring that their full historical content is available online. A number of journals have now completed placing their complete backfiles online, and while it is still early days, we are beginning to see some usage trends. In this session, we'll share some of this preliminary usage data.
Looking back, moving forward: Examining the impact of digitizing the ACS archive
David P Martinsen and Adam Chesler. ACS Publications, American Chemical Society, 1155 16th Street NW, Washington, DC 20036 - PDF - PPT - MP3
Culminating an effort of well over a year to scan the hard copy editions of the ACS journals, the ACS Archive was launched between April and June, 2002. A number of interesting challenges were encountered during that effort, and the scope of the project itself evolved during the process. While hopes for user satisfaction were high from the outset, actual interest in and use of this older material has outstripped expectations and revealed some important insights into user interests and behavior. This exercise in upgrading an older format (ink-on-paper) to today's technology has evoked some thoughts on how to prepare for future technology upgrades and new publication models.
Recent advances in the phlogiston theory: Mining the *really* old literature
F. Bartow Culp, Mellon Library of Chemistry, Purdue University, 504 West State Street, West Lafayette, IN 47907-2058 - PDF - PPT - MP3
Electronic access to chemical information has generally occurred in the perceived order of its importance, that is, in reverse chronological order. The recent availability of such resources as EEBO (Early English Books Online) and ECCO (Eighteenth Century Collections Online) now allows us to search electronically back to the 17th and 18th centuries, the time when modern chemistry was being born. These resources, along with others, will be used to investigate some of the fundamental tenets of chemistry at the period when they were still being formalized.
Reviving analytical data of the past with open submission databases and text mining tools
Sam Adams1, Stefan Kuhn2, Peter Murray-Rust3, Christoph Steinbeck2, Joe A Townsend1, and Christopher A. Waudby4. (1) Unilever Centre for Molecular Science Informatics, University of Cambridge, Lensfield Road, CB2 1EW Cambridge, United Kingdom, (2) Research Group for Molecular Informatics, Cologne University Bioinformatics Center (CUBIC), Zuelpicher Str. 47, D-50674 Cologne, Germany, (3) Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW Cambridge, United Kingdom, (4) Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW Cambridge, United Kingdom - PDF - PPT - MP3
In contrast to Molecular Biology, Chemistry faces a significant lack of open databases. We have addressed such a lack in our own field of research, Computer-Assisted Structure Elucidation, by creating an open access, open submission database of Nuclear Magnetic Resonance (NMR) spectra called NMRShiftDB. NMR data have been published in the literature for 40 years, electronically only available as scanned bitmaps. NMRShiftDB allows to revive this information by providing means to enter data via a submission interface, augmented by quality-assurance procedures. We also present the application of the analytical data mining tool, OSCAR, to produce starting material for NMRShiftDB's authoring process. OSCAR parses organic chemistry papers, summarizes the data it finds and alerts the user of potential errors in the data. The discovered spectral data stored by OSCAR as CMLSpect files are used to author NMRShiftDB dataset.
Chemical ontologies: Towards automated data mining and knowledge generation on chemical compounds
Lutz Weber, Chemoinformatics, OntoChem GmbH, Weinbergweg 22, Halle, 06120, Germany
Ontologies and semantic data mining are maturing and useful concepts in genomics and biological information technologies, while chemical ontologies are still at its dawn. The automated classification of compounds into chemical classes such as natural products or compounds that exhibit biological activities gains increasing interest. We will discuss some prerequisites and new developments in the field of chemical ontologies as well as possible applications. In particular, OntoChem has developed technologies that address problems of 2D and 3D searches in very large chemical databases, potentially containing several trillions of compounds. Thus, we have used validated chemical reactions and accessible starting materials to construct very large databases of synthetically accessible, drug-like compounds for in silico screening. The presentation will also expand on our novel 2D topological descriptor technology for substructure and similarity searching, classifying chemical reactions and compounds – providing a framework for a future chemical knowledge system.
Open-standards based IT integration in Life Science research: From present to future
Carsten Stauffer, Frank N. Penkert, Karsten Tittmann, and Heinz Rakel. Science & Technology, Bayer Business Services GmbH, Olof-Palme-Str. 15, Leverkusen, 51368, Germany
Bayer Business Services Science & Technology is for many years the main research IT department within the Bayer group. These years of experience and continuous development and appliance of research IT systems have largely increased the knowledge in IT technology, research processes and ways to combine these.
This presentation shows the current state of integrating cheminformatics applications within Bayer using the IT landscape of Bayer's Pharma research unit and its two key components, the central structure warehouse and the electronic lab notebook, as example. It then goes beyond the current state to show the current steps towards open standards based IT integration and concludes with a mid-term vision, for which already now internal research projects are laying the foundations.
The overall key for success is the availability and usage of open and flexible standards for interfacing and linking the different applications, which within Bayer has already been achieved to a large extent, but which is still in early stages when also trying to access external data sources.
Classification of organic and bio-organic reactions with MOLMAP physicochemical descriptors
Joao Aires-de-Sousa, Sunil Gupta, Diogo A. R. S. Latino, and Qing-You Zhang. REQUIMTE and Department of Chemistry, New University of Lisbon, campus FCTUNL, 2829-516 Caparica, Portugal
Automatic classification of organic reactions is of high importance for the analysis of reaction databases, reaction retrieval, reaction prediction, or synthesis planning. In Bioinformatics, the reconstruction of metabolic pathways from genomes requires the classification of enzymatic reactions. Encoding chemical reactions with physicochemical parameters has the potential to account for the influence of electronic effects on reactivity. We have developed MOLMAP molecular descriptors that encode physicochemical properties of the bonds present in a molecule. In a chemical reaction, the difference between the MOLMAP of the products and the MOLMAP of the reactants represents the structural changes operated by the reaction. This is a numerical fixed-length representation of a chemical reaction that does not require explicit assignment of the reaction center. We show how MOLMAP descriptors are used for the classification of chemical reactions. Their application is demonstrated for assigning EC numbers to enzymatic reactions, and for estimating reaction likelihood.
Following the road step by step: A new reaction database-driven tool for stepwise retrosynthetic analysis
Christof H. Schwab, Bruno Bienfait, and Johann Gasteiger. Molecular Networks GmbH, Naegelsbachstrasse 25, D-91052 Erlangen, Germany
The synthesis of new compounds is a quite time consuming and cost expensive task. The need to search and evaluate alternative synthetic paths is a mandatory step before going to the lab. Searching in reaction databases may provide some information about how a compound can be synthesized but often fails if the target is not present in the database.
We present Retrosynthesis Browser, RSB, a novel, web-based, easy-to-use tool for the stepwise retrosynthetic analysis of a given target compound. RSB scans reaction databases to suggest new synthetic routes and simultaneously searches in catalogs of available starting materials for the proposed precursors. Furthermore, it provides the corresponding published reaction data for each suggested synthetic step. Due to the rather general definition of reactivity, it is able to provide the chemist with new ideas for organic synthesis since it deals with a broad range and diverse chemistry, including, e.g., formation of heterocycles, pericyclic reactions, rearrangements and metathesis.
In our presentation we will provide insights into the general algorithms of our approach and demonstrate its application on some simple but medicinally relevant synthetic targets.
Introducing Route Designer v1.0
A Peter Johnson1, Zsolt Zsoldos2, Aniko Simon2, Darryl Reid2, James Law2, Yang Liu2, Sing Yoong Khew2, and Howard Y. Ando3. (1) School of Chemistry, University of Leeds, Leeds, LS2 9JT, United Kingdom, (2) SimBioSys Inc, 135 Queen's Plate Dr, Suite 520, Toronto, ON M9W 6V1, Canada, (3) Ann Arbor Laboratories, Pfizer Inc., Pfizer Global Research & Development, 2800 Plymouth Road, Ann Arbor, MI 48105
Route Designer is a new retrosynthetic analysis package, which generates complete routes to synthetic targets, starting from readily available starting materials. A key feature of the system is the fully automated generation of retrosynthetic reaction rules by analysis of a reaction database, thus avoiding the need for time consuming manual creation. Route Designer uses these rules to carry out an exhaustive retrosynthetic analysis of a synthetic target. Special heuristics have been developed to mitigate the combinatorial explosion which is inherent in such an exhaustive approach. The system employs a user friendly, web-based front end which sorts the routes according to a merit ranking and also allows the user to view literature examples of the reactions suggested. An overview of the problems (and some solutions) faced in the creation of this system will be presented, together with examples which demonstrate its use.
Organic chemistry in the ChemgaPedia encyclopaedia: Useful visualization tools for e-teaching and e-learning
Torsten Winkler, Otto Diels Institute for Organic Chemistry, University of Kiel, Germany, Otto-Hahn-Platz 4, D-24118 Kiel, Germany and Rainer Herges, Institut f. Organische Chemie, Universität Kiel, Otto-Hahn Platz 4, Kiel, D-24098, Germany.
The ChemgaPedia encyclopaedia is based on the project „Vernetztes Studium – Chemie“ (VS-C, „networked chemistry studies“). Starting in 1999, 16 workgroups from Germany, Switzerland and Great Britain created substantial learning material covering all aspects of chemistry and related topics. Currently, there are 15000 pages and 25000 media files available (graphics, movie clips, 3D models, animations, www.chemgapedia.de). After the official termination of the VS-C project at the end of 2004, the project coordinator FIZ CHEMIE Berlin is continuing the development towards the ChemgaPedia learning system.
Organic chemistry was represented by the workgroups of Prof. Gregor Fels (Paderborn, Germany) and Prof. Rainer Herges (Kiel, Germany), who closely collaborated, sharing their respective results and expertise. The aim of our material is to enhance the visualization of organic reactions and structures and thus to improve their comprehensibility. Furthermore, streaming video clips, optional exercises, a search function and a glossary section add to the value of this learning system.
Libraries and librarians: Supporting chemists with information
Andrea Twiss-Brooks, University of Chicago, John Crerar Library, 5730 S. Ellis Ave, Chicago, IL 60637-1403
University chemists with various levels of expertise need information to carry out a variety of synthetic experiments. Chemistry faculty and research associates need sophisticated tools to discover the latest research in synthetic methodology; graduate students need information tools to develop their skills as synthetic chemists and as users of the synthetic literature; undergraduate students need basic information resources designed to support their educational needs; and researchers in other disciplines need to discover chemical information specific to their own areas of interest. The university chemistry librarian fulfills a critical role by selecting appropriate information resources to meet user needs, ensuring adequate access for all users, providing instruction in effective resource selection and use, serving as liaison between information provider and researchers, and increasing scientists' and students' awareness of new and existing information resources. The presentation will discuss the librarian's present and future roles in supporting the work of the synthetic chemist.
Chemical information instruction and support at ETH Zürich: Concept, realization, and trends
Martin P. Braendle, Engelbert Zass, Blanka Cartier, and Arun Kumar. Informationszentrum Chemie Biologie Pharmazie, ETH Zuerich, HCI G 5.3, CH-8093 Zuerich, Switzerland
Nowadays, chemists are faced with various electronic information sources at their workbench. Proper selection and use of these tools requires knowledge of their strengths and deficiencies. Students and faculty often don't comprehend the need for chemical information instruction which would impart this knowledge; librarians encounter difficulties to get enough time in the tightly subject matter packed curricula. Given these conditions at ETH Zürich, we have worked out a comprehensive concept for chemical information instruction and support. It comprises: 1. Lessons in chemical information that are integrated in lab courses and lectures. These problem-oriented units are taught by scientific staff from the Information Center and accompany students for their entire Bachelor courses, starting in the first term. 2. E-learning material produced in the BMBF-project „Networked Chemistry Studies“. 3. Supporting web pages for major databases and individual end-user support. In addition, we try to improve our basic access points and our instruction by analyzing feedback from end-user search, e.g. with text mining.
Beilstein Chemical Toolkit (BCT): A flexible framework for handling chemical structure drawing
Jochen Zügge, Beilstein Institut, Trakehner Strasse 7-9, 60487 Frankfurt, Germany
The BCT is being designed as a general purpose toolkit for the efficient handling of chemical structures in software. Its main focus will be the input, normalization and canonization of 2D-drawings of stereochemical substances. The software will be object oriented and highly modular, extensively using JAVA and XML as key technologies. Wherever possible, a data driven design approach will be followed employing known design patterns. Algorithms are to be based where possible on scientifically accepted and published concepts, for example: the bonding model first outlined by Dietz, the aromaticity estimation described by Randic, the ring perception algorithm of Pearlman, and the stereoperception according to Cahn, Ingold and Prelog. In the area of stereo recognition systems, new approaches that rely on XML-Patterns, mathematical group theory and graph topology, as well as, a new Morgan type algorithm for the unique numbering of atoms will be used. The Beilstein-Institut plans to make the BCT available to the scientific community as open source software.
2D and 3D Visualization of pharmacophoric ligand-receptor interaction
Fabian Bendix1, Philipp Adaktylos1, Gerhard Wolber1, and Thierry Langer2. (1) Computer Science Group, Inte:Ligand GmbH, Mariahilferstrasse 74B/11, 1070 Vienna, Austria, (2) Institute of Pharmacy, University of Innsbruck, Innrain 52, 6020 Innsbruck, Austria
3D pharmacophores have proven to be an efficient and comprehensive method to model ligand-macromolecule interactions describing both steric and chemical feature complementarity. In the context of the dense information content of a protein, the identification of pharmacophoric features is still complex: A promising way is combining automated feature generation with a comprehensive visual representation.
In this work, a visualization framework that is capable of displaying the target molecule and the ligand together with its pharmacophore is presented. Two views are provided showing the receptor-ligand interactions, one in 3D and one in 2D. The pharmacophore features are generated automatically and can be visually verified by chemists by interacting with the display in the 3D view . Additionally, the visualization of the binding pocket's surface provides essential information about the receptor-ligand interaction (e.g.: polarity of the surrounding environment). To overcome the inherent weaknesses of 3D representations, such as occlusion or the complexity of 3D perception, the framework provides a 2D view linked to the 3D representation of a particular ligand. This automatically calculated structure diagram displays the topology of the molecule, and also includes comprehensive visual metaphors for the pharmacophore features and associated amino acids.
We present a convenient way of displaying pharmacophoric features in 3D as well as in 2D, suitable visual metaphors for pharmacophoric interactions, the generation of 2D layouts as collision-free as possible, and an implementation of linking and brushing among the views.
 Wolber, G.; Kosara, R. Pharmacophores from macromolecular complexes with LigandScout. In 'Pharmacophores and Pharmacophore Searches', Langer, T.; Hoffmann, R., eds., Wiley-VCH: Weinheim, Germany, 2006; pp 131-150.
Designed visualizations for enabling medicinal chemistry
Barry J. Wythoff, Scientific Reasoning, One Market Square, Suite 2, Newburyport, MA 01950
Why do scientists bother to look at visualizations at all? What characteristics make a visualization useful? Is it attractive, high-impact colors? Dynamic animation? Highly detailed rendering? We will discuss some of the cognitive principles that actually determine whether a visualization will be useful, rather than merely appealing. Then, a number of novel visualizations that have been designed to aid medicinal chemists in LeadDecisionTM will be presented in this context.
A novel interactive tool for multidimensional biological data analysis
Zhaowen Luo and Xuliang Jiang. System Biology, Serono Research Institute, Inc, One Technology Place, Rockland, MA 02370 - PPT
We have developed an interactive tool for multiple dimensional structural and biological data analysis and visualization. On the top layer of the tool is the heat map which provides an overview of all data. Any area of the map can be zoomed for a close-up view. Data analysis methods, such as clustering and profiling, are embedded in the tool to aid the selection of focused sets for further analysis of structure-activity/property relationships. Clicking at any point in the heat map will launch tooltip windows that display drilldown views of structural and biological information, including targets, compound structure and physical properties, and various assay results from different measurements. The tool can display tens of thousands of data points for hundreds of assay results from hundreds of compounds into a single map. We have used the tool to speed up decision-making in our drug-discovery process.
Dynamic indexing of chemical metadata using open tools: Case study of Open Babel, CDK, and the Blue Obelisk
Geoffrey R. Hutchison1, Tobias Helmus2, Stefan Kuhn2, Henry S. Rzepa3, Christoph Steinbeck2, Christopher J. Swain4, and Egon L. Willighagen5. (1) Department of Chemistry and Chemical Biology, Cornell University, Baker Laboratory, Ithaca, NY 14853-1301, (2) Research Group for Molecular Informatics, Cologne University Bioinformatics Center (CUBIC), Zuelpicher Str. 47, D-50674 Cologne, Germany, (3) Department of Chemistry, Imperial College of Science, Technology and Medicine, Exhibition Road, South Kensington, London SW7 2AY, United Kingdom, (4) Cambridge MedChem Consulting, United Kingdom, (5) Institute for Molecules and Materials, Radboud University Nijmegen, Department of Analytical Chemistry, Toernooiveld 1, Nijmegen, NL-6525 ED, Netherlands - PDF
Chemical datafiles, including common formats, are handled by a variety of proprietary and open software tools. However, while many of these formats contain a wide variety of chemical information, such data is inaccessible to modern operating systems and end users. Chemists may have thousands or more chemical files (in a variety of formats) on their drives, yet cannot easily index and search for molecular information. For example, users can easily identify the artist of a music file, but a chemist cannot easily identify the chemical formula or number of atoms in a downloaded file. We discuss open approaches for dynamically indexing and searching chemical metadata using existing open source software.
Chemical ontologies and safety intelligence networks
Jürgen Harter, Life Sciences, Biowisdom ltd, Harston Mill, Harston, CB2 5GG Cambridge, United Kingdom
'Ontology' can mean different things, e.g. glossaries, dictionaries, thesauri, taxonomies, schemas and data models. This presentation explains what an ontology is and how you can use it to build intelligence networks in the life sciences domain, thereby enhancing decision making alongside the drug discovery process. The benefits of employing ontologies for knowledge management are shown (semantic integration of chemical, biological and safety data). A chemistry knowledge map will illustrate the links between compounds, drugs, biological entities, pharmacology, disorders, side effects, toxicology, molecular properties etc. Some use cases for intelligence networks will be discussed (Safety Intelligence Programme), in particular the sort of questions/issues to which answers can be found in those networks: 'Which brain-specific proteins are the targets for established marketed drugs?' 'What are good biomarkers for a disease?' Furthermore, various chem-/bioinformatics tools can interact with an ontology knowledge server, thus providing a mechanism to explore the chemical landscape using similarity searching or statistical analyses (cluster analyses).
Chemical structure search engines in cyberspace
Klaus Gubernator and Craig A. James. eMolecules, Inc, PO Box 2790, Del Mar, CA 92014 - PPT
The web has revolutionized the way we retrieve information. Chemistry is a late participant in this revolution, probably because searching for chemical structures is significantly more difficult than text searching. Recently, a number of chemical search engines have emerged that give free access to large databases of public domain chemical structures. They differ in scope, content, functionality and performance. By indexing the content by unique chemical structure, these search engines provide bridges between disconnected and sometimes hidden sources of information about the same structure. Searching millions of chemical structures and returning results on the time scale that a web surfer expects is a particular challenge. Chemist search for exact matches or for substructures. They frequently refine their searches by applying further restrictions to structural features, and export lists of structures to apply computational methods. We will discuss approaches to meet these challenges and requirements, and we will present solutions that provide chemists with productive web-based tools.
Exploring the feasibility of a protein structure prediction metaserver based on the AMBER/PBSA scoring function
Hai-Feng Chen, MJ Hsieh, and Ray Luo. Department of Molecular Biology and Biochemistry, UC-IRVINE, 3144 Natural Sciences I, Irvine, CA 92697
Protein structure prediction is a challenging scientific field with clear applications in molecular biology. The goal of this research is to explore the feasibility of setting up a metaserver based on the AMBER/PBSA scoring function to improve the quality of predicted protein structures. To test the robustness of the scoring function, we ranked all predicted folds from all automatic servers in CASP5 and CASP6. Before the AMBER/PBSA scoring function was used to rank the structures, the predicted folds were converted into all-atom structures with MODELLER. The prediction accuracy of AMBER/PBSA was found to be 70.9% for CASP5 and 71.2% for CASP6, much higher than individual servers (56.4% in CASP5, 45.5% in CASP6). AMBER/PBSA also performs much better than other knowledge-based scoring functions such as Dfire and Rosetta under the same condition. This suggests that our AMBER/PBSA scoring function can be used for more consistent model selection in a metaserver setting. Development of a metaserver based on AMBER/PBSA scoring function is currently underway in our group.
Continuum polarizable force field
Yuhong Tan and Ray Luo. Department of Molecular Biology and Biochemistry, UC-IRVINE, Irvine, CA 92697-3900
A great deal of effort has been directed to developing polarizable force fields to account for varying dielectric environment for accurate representation of energetic surface. Despite considerable effort, a clear consensus on how to incorporate polarization has not emerged. Different from other explicit polarizable force fields, which are expensive in calculations with implicit solvents, we propose to treat polarization in a continuum manner. A continuum polarizable force field was developed based on the fact that different dielectric constants effectively imply different polarizabilities. We validated the continuum polarizable force field with resepct to small-molecule dipole moments, interaction energies, and geometries from high-level quantum mechanical data as benchmark. These validations were preformed both in vacuum and in implicit water. Our validation data show that the quality of the continuum polarizable force field is comparable with the explicit polarizable force field distributed with the Amber package.
High-throughput and conventional experimentation coexistence: War of the worlds?
Francois Gilardoni, Industrial Applications, InforSense Ltd, 459A Fulham Road, London, SW10 9UZ, United Kingdom and David Farrusseng, Groupe de Catalyse, Institut de Recherches sur la Catalyse IRC–CNRS, 2, Av. Albert Einstein, F-69626 Villeurbanne, France, France.
The chemical and pharmaceutical industries are facing significant internal and external pressure to boost experimental efficiency and effectiveness by cutting direct research costs and reduce the time to market for new sustainable products. There is a constellation of software packages for knowledge management and discovery available but none fully encapsulate the process from inception to delivery of modern multidisciplinary R&D projects. Catalysis, for instance, is crippled by non-interoperability at many levels: encoding, syntax, semantics and ontological. Also, best practice data mining techniques are ineffective without high-quality data, fast, reliable and full access to the information and a consistent capture of data and processes. Furthermore, modern R&D completely depends on data generated by high-throughput screening and experimentation for the identification of new catalysts and the corresponding experimental routes. The corollary is an overwhelming data avalanche where experimentalists still struggle to fully exploit and combine information issued from high-throughput and conventional experimentation. This precludes a proper dissemination of exploitable knowledge and hinders both scientific breakthroughs and short development cycles. We will also address some of these issues and expose the concept of integrative analytics specifically in material science. As a case study, we will present how TOPCOMBI, a project for Nanotechnologies and Nanosciences funded by the European Commission, deals with the exploitation of incongruent information and protocols issued from high-throughput campaigns and conventional experimentation using scientific workflows.
In silico compound activity reprofiling
A. W. Edith Chan1, Richard J Fagan2, and John P Overington2. (1) Inpharmatica, Commonwealth House, 1 New Oxford Street, WC1A 1NU London, United Kingdom, (2) Inpharmatica Ltd, 1 New Oxford Street, London, WC1A 1NU, United Kingdom
The human genome has offered potential novel drug targets. This opportunity demands better ways to translate potential molecular targets into disease-relevant therapeutics. Cell-based assays discover compounds with activity against a signalling pathway rather than a specific protein. Unfortunately, identification of molecular target requires a labor intensive, time consuming experimental strategy. The Chematica platform enables molecular target prediction by searching for chemical structural similarity in its molecular databases. It consists of multiple chemogenomics databases and various cheminformatics and bioinformatics tools. These databases contain highly curated compound, assay activity, molecular target, and SAR data abstracted from 20 years of medicinal chemistry literature. The premise is that if two compounds are highly similar, their bioactivities may be similar too. Target prediction was performed on novel compounds identified in cell-based screen. 70% of the compounds are highly similar to those in the databases, and within which, about 25% were highly confident target predictions.
Interplay of sequence and structure: Extending the limits of detectability of distantly-related proteins
Nathalie Meurice, F.N.R.S. Postdoctoral Researcher, Department of Pharmacology and Toxicology, University of Arizona, College of Pharmacy, Bio5 Institute, Tucson, AZ 85721, Daniel P. Vercauteren, Laboratoire de Physico-Chimie Informatique, University of Namur, Rue de Bruxelles, 61, B-5000 Namur, Belgium, and Gerald M Maggiora, Department of Pharmacology and Toxicology, University of Arizona, College of Pharmacy, Bio5 Institute, Tucson, AZ 85721.
The combined efforts in genome sequencing projects and structural genomics initiatives are generating massive amounts of protein sequence and structure data. However, molecular function remains unknown for many of these proteins, even when their folds are known. Thus, the relationship to homologs of known function, if it exists, is likely very distant, and their function cannot be reliably identified with sequence-based methods alone. In this context, we carried out a comprehensive analysis of sequence and structure similarity of 18 proteins from the metzincin family that indicates the increasing importance of structural similarity when both sequence and structure have diverged. Because structure diverges less than sequence in remote homologs, structure-derived patterns can reveal the features that traditional sequence comparison methods cannot possibly capture. In extreme cases, functional residues are isolated in 3-D space and sequences differ to such an extent that related proteins can only be detected from 3-D structure.
Interpretable correlation descriptors for quantitative structure-activity relationships
James L. Melville and Jonathan D. Hirst. School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, United Kingdom
New, Topological Maximum Cross Correlation (TMACC) descriptors for the derivation of quantitative structure-activity relationships (QSARs) are presented, based on the widely used autocorrelation method. They do not require the calculation of three-dimensional structures, or alignment. We have validated the TMACC descriptors across eight literature datasets, ranging in size from 66-361 molecules. In combination with partial least squares regression, they perform competitively with a current state-of-the-art 2D QSAR methodology, hologram QSAR (HQSAR), yielding superior leave-one-out cross validated coefficient of determination values (LOO q2) for six datasets, illustrating their wide applicability. Like HQSAR, these descriptors are also interpretable, but do not requiring hashing.
Local lazy regression: Making use of the neighborhood to improve QSAR predictions
Rajarshi Guha1, Debojyoti Dutta2, Peter C. Jurs1, and Ting Chen2. (1) Department of Chemistry, Pennsylvania State University, 104 Chemistry Building, University Park, State College, PA 16802, (2) Department of Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089 - PDF
Traditional QSAR models aim to capture global structure-activity trends. In many situations there may be groups of molecules which exhibit a specific set of features which relate to their activity. Such a group of features can be said to represent a local structure-activity relationship. We describe the use of local lazy regression which obtains a prediction for a query molecule using its local neighborhood, rather than considering the whole dataset. This modeling approach is useful for large datasets since no a priori model is built. We have applied the method to three biological datasets where we observe improvements in RMSE ranging from 2% to 8% for external prediction sets. The approach also explains why a global model behaves poorly for some molecules. On the other hand certain molecules are poorly predicted by the LLR method and we discuss the underlying problem as well as possible improvements based on descriptor distributions.
MCDL: A new public domain chemical information toolkit
Andrei A. Gakh1, Michael N. Burnett1, Sergei V. Trepalin2, Alexander V. Yarkov2, and Igor V. Pletnev3. (1) Oak Ridge National Laboratory, Oak Ridge, TN 37831-6242, (2) Institute of Physiologically Active Compounds RAS, 142432 Chernogolovka, Moscow Region, Russia, (3) Chemistry Department, Lomonosov Moscow State University, GSP-3 Vorobyovy Gory, Moscow, 119899, Russia
The Modular Chemical Descriptor Language (MCDL) was developed as a de-centralized public domain tool [1,2]. It was designed to evolve so it can better fit the diverse needs of the scientific community, and valuable modifications can be shared among users and then formally incorporated into standardized form. Another feature of the MCDL is its all-inclusive nature, which was conceived with the understanding that no known chemical language is perfect, and the best way to describe a molecular object is to preserve the advantages of many chemical descriptors by presenting them together for a particular compound. The MCDL concept will be discussed in the context of the recently introduced set of supplementary modules designed to handle stereoisomers, mixtures, tautomers, and other molecular systems. An overview of several new MCDL software packages will also be provided, including LINDES 2.7 and Java™ MCDL chemical editor.  J. Chem. Inf. Comput. Sci., 2001, 41, 1494-1499;  Molecules 2006, 11, 129-141.
Perchlorate and the press: Reporting on ambiguity
Margaret W. Batschelet, Communication Department, University of Texas at San Antonio, One UTSA Circle, San Antonio, TX 78249-1644 and William H. Batschelet, Air Force Center for Environmental Excellence, 3300 Sidney Brooks, Brooks City-Base, TX 78235-5112 - PDF
Press coverage of perchlorate exemplifies the problems in reporting on contaminants with ambiguous health implications.
Measurement of perchlorate occurrence in water and food was not possible until development of an ion chromatographic method in 1995. Since that time, detection levels have decreased from hundreds of parts per billion to tens of parts per trillion. With this improved analytical sensitivity has come increased frequency of detection. As the frequency of finding perchlorate has increased, public debate has centered on the safe level for daily consumption of perchlorate. Press coverage of perchlorate contamination has mirrored this growing controversy, rising from a total of 94 articles listed in the Lexis/Nexis news database in 2000-2001 to 579 in 2004-5. Yet the complexity of perchlorate data--the level of its natural occurrence, the difficulty in gauging the safe dosage of perchlorate in food and water, even the disagreement over the nature of perchlorate pollution--has not always been adequately reported. This study of 74 news reports of perchlorate levels in food found that while these news reports usually included basic facts about perchlorate, they frequently failed to adequately explain the significance of those facts. Moreover, the language used to describe perchlorate contamination included emotionally charged terms that tended to detract from the scientific evidence. Press coverage of perchlorate suggests the difficulties in reporting on complex substances, difficulties that may become more problematic as lower detection levels for more compounds become more common.
UsefulChem project: Open source chemical research with blogs and wikis
Jean Claude Bradley, Khalid Mirza, Alicia Holsey, Brett Rosen, James Giammarco, and Julimarie DeNicco. Department of Chemistry, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104 - PPT
The UsefulChem project is an initiative to carry out open source science based on current problems likely to have chemical solutions in the immediate future. An important aim of the initiative is to report progress in a public and transparent manner, inviting contributions and comments from the community. The most active current project involves the synthesis and testing of new anti-malarial compounds. A briefing of the projects and ways of contributing are maintained on the UsefulChem Wiki. Discussion of the synthetic strategies and objectives is carried out in the UsefulChem Blog. Molecules of interest are housed in the Molecules Blog. The UsefulChem Experiments Blog serves as a common public laboratory notebook detailing the experimental work of students in the laboratory in near real-time. Project info at http://usefulchem.wikispaces.com