#229 - Abstracts

ACS National Meeting
March 13-17, 2005
San Diego, CA

Please note: Presentations given at CINF symposia have been posted to the CINF website with express permission granted by the authors who retain the original copyright. These presentations are for information purposes only and cannot be further disseminated without the author's prior written permission.

8:00 1 The importance of being Ernest: Why gathering and cleaning all the relevant data matters for patent analysis
Anthony J. Trippe, Science IP/Chemical Abstracts Service, 2540 Olentangy River Rd., Columbus, OH 43210, atrippe@cas.org - SLIDES

More and more in the process of making critical business decisions, technical patent and non-patent information is used as a means to determine competitive position and formulate company strategy on technical subjects. The importance of having all the relevant data available for analysis and having that data normalized so accurate statistics can be generated cannot be overstated.

The purpose of this talk will be to examine the requirements for ensuring that, as much as possible, all of the pertinent data, whether from patent or non-patent sources, has been gathered. Further, this presentation will examine the pitfalls of performing an analysis on an incomplete data set or or on a collection which has not been cleaned and normalized. Specific examples from the author's personal experience will be shared.

8:30 2 Patent analysis: The technical intelligence professional’s adjustable spanner
Robert A Stembridge, Global Marketing Services, Thomson Scientific, 14 Great Queen Street, London, United Kingdom, bob.stembridge@thomson.com- SLIDES

The use of patent analysis in tracking the development and evolutionary trends within a technology is a vital component in the technical intelligence professional's toolbox. However, as with any specialist tool, a degree of knowledge and experience goes a long way towards using the tool safely.

Using a case study approach, we will explore some of the issues in using patent information and illustrate how valuable technical intelligence can be derived from judicious use of patent analysis.

9:00 3 Technology oriented competitive intelligence: A primer
Bruce Mason, Research, Development & Technical Services, CIBA Vision Corporation, 11460 Johns Creek Parkway, Duluth, GA 30097, Fax: 1-678-415-7467, bruce.mason@cibavision.novartis.com - SLIDES

Competitive or business intelligence comes in many shapes and flavors. Think of a business function and place the term in front of the word intelligence and you have described a facet of competitive intelligence. For example, marketing intelligence, sales intelligence, distribution intelligence, manufacturing intelligence, human resource intelligence, are but a few. Technical or technology intelligence is given significant emphasis in many organizations because it historically has been linked to R&D, research trends, scientific breakthroughs and innovation. In other words, technology oriented competitive intelligence encompasses how a competitor does things, e.g., develops new products or services, manages processes, responds to scientific advancements that impact its industry, and interacts with its customers and suppliers. An overview of technology oriented competitive intelligence, what it is, who is served, analytical tools, and frameworks for assessing competitors' technology will be discussed.

9:30 4 Rapid technology intelligence process
Alan L. Porter, R&D, Search Technology, Inc, 4960 Peachtree Industrial Blvd, Norcross, GA 30071-1580, Fax: 770-263-0802, aporter@searchtech.com - SLIDES

Technical intelligence conveys value when it affects decision processes. Too many managers and professionals have come to disregard technical intelligence because it has been too slow to provide timely guidance. This is changing. I indicate how combining five key features enables rapid technology intelligence processes (RTIP): • Immediate desktop access to science & technology database search results • Standardized sets of technology questions • Templates of “innovation indicators” to answer specific questions to profile a technology or an organization • Analytical software tuned by macro's to clean those data and generate results pertinent to the question at hand in seconds • Wizards to guide the user to the right answers, presented in the right form, for the target audience. RTIP answers certain technical questions in minutes. It provides essential empirical evidence to inform strategic technology decisions in hours. These capabilities can, and will, change the very nature of technology management.

10:15 5 PatGen DB: A consolidated genetic patent database platform
Richard JD Rouse, PatentInformatics, Inc, PO Box 948586, La Jolla, CA 92037, rjdrouse@patentinformatics.com - SLIDES

Patent information is voluminous. According to the 2003 United States Patent and Trademark Office (USPTO) annual report, the office received 333,452 applications; this accounts for 913 applications a day. Compared to the wealth of online resources covering genomic, proteomic and derived data the scientific community is rather underserved when it comes to patent information related to genetic sequences. Here we describe, PatGen DB, an integrated database containing data from bioinformatic and patent resources. This resource is an open-ended service designed to enable customized searching and database compilation. Features of PatGen DB can be searched at http://www.patgendb.com where bibliography, taxonomy and sequence search tools are provided.

10:45 6 Globalization trends measured via patent analysis
Anthony F. Breitzman Sr., CHI Research, Inc, 10 White Horse Pike, Haddon Heights, NJ 08035, Fax: 856-546-9633, abreitz@chiresearch.com

There have been a lot of discussions of globalization and outsourcing of jobs in manufacturing industries, but virtually no discussion concerning globalization of R&D. Trends in patent activity show increased globalization in R&D that is noticeable in patents. In the last 10 years, large companies have increased R&D efforts outside their home countries. As an example General Electric has always had facilities all over the world, but until very recently virtually all of its R&D was done in the US. In 1994, 96% of its patents were invented in the US, in 2004, that number is 86%. The numbers are similar for many US companies. In Japan we see the same thing. In 1994, 99.5% of Canon's US patents were invented in Japan. In 2004, that number is down to 94% and the number for Sony is down to 85%. This has huge implications in terms of jobs, since a 5% decline in patents invented in the US translates to a $300 billion+ drop in GDP, but it also has implications for competitive intelligence, counter-intelligence, etc. In this study we examine US and EP patents in order to analyze these trends. Questions we consider are, is globalization of R&D occurring? Is it a positive or negative for US companies? That is, is the US a net exporter or importer of R&D jobs in recent years? Finally, we take a similar look at companies in Japan and Europe and attempt to answer the same questions.

11:15 7 Assembling the information mosaic
Donald Walter, Customer Training, Thomson Scientific, 1725 Duke Street Suite 250, Alexandria, VA 22314, Fax: 703 519 5838, Don.Walter@Thomson.com - SLIDES

Technical Intelligence requires information from many sources and disciplines, and of many types. This talk will focus on the integration of patent, technical and business information as raw material for analysis. Case studies will show how the information mosaic can be assembles into an informative picture.

11:45 8 Analyzing and presenting chemical structural information in support of competitor or technology assessment
Kerry G. Stanley, Science IP, Chemical Abstracts Service, 2540 Olentangy River Rd, Columbus, OH 43202-1505, Fax: 614-447-5627, kstanley@cas.org - SLIDES

In many areas of research any truly diligent review of the technological landscape will require an assessment of a structural profile. This may be most evident in exploring SAR relationships in the pharmaceutical industry but may apply to other areas as well; for instance an analysis of the monomeric components imparting specific properties to a class of polymers. This talk will present several case studies where a "R-group Analysis" of a class of chemical structures will provide insights into the chemical approaches explored by differing competitors within a research area. Similar analytical approaches may be used to showcase the similarities, and most importantly, the differences in a class of compounds otherwise viewed only at the individual substance level. For an organization interested in innovating in a crowded art space this type of analysis is useful for identifying which organizations have covered what structural modifications to a central class of compounds or around a specific scaffold.

1:30 9 Start-up companies and chemical informatics: A professional service provider's perspective
Robert D. Feinstein, Kelaroo, Inc, 312 S. Cedros Ave., Suite 320, Solana Beach, CA 92075, rdf@kelaroo.com

Kelaroo integrates and enhances the drug discovery efforts of companies through a combination of cheminformatics products and professional services. We have worked with dozens of start-ups and other small drug discovery companies. Most small companies face similar challenges in terms of balancing basic research needs, budget constraints and resource issues. However, start-ups typically strive for novel research techniques that may not conform to commercially available software and database solutions. We will present our perspective on how start-ups and other small drug discovery companies can best prioritize and implement solutions to their cheminformatics needs. Examples will include commercial and custom systems for reagent management and procurement, library enumeration, compound registration and archival, and biological data management.

1:55 10 Developing an hepatotoxicity database
James Kelly, Amphioxus Cell Technologies, Inc, 11222 Richmond Ave, Suite 180, Houston, TX 77082, jkelly@amphioxus.com

Amphioxus Cell Technologies has developed a series of tools for high throughput hepatotoxicity testing. These tools are intended to be used early in the drug discovery process so that structure toxicity relationships can be developed along with SAR, allowing compounds to be optimized for toxicity and activity simultaneously. It became clear that in order for this system to be truly useful, we needed to develop a database of compounds that had been screened through the assays. This would allow our customers to place the results in the context of other known toxins and structurally related compounds. We set about screening several thousand known compounds in each of seven assays at multiple concentrations. We quickly realized that our then current information resources were insufficient. We needed a system that could group the results according to chemical structure and that would allow structural searches within the database, yet, we had only rudimentary knowledge of chemistry based software. With the help of MDL Information Systems, we were able to implement a relatively sophisticated system quickly and inexpensively without the addition of substantial information technology resources.

2:20 11 Battling the data avalanche: A chemical data management solution for the start-up company
Antony Williams, Advanced Chemistry Development, 90 Adelaide Street West, Suite 600, Toronto, ON M5H 2L3, Canada, tony@acdlabs.com

The pharmaceutical and chemical industries are well acquainted with the challenges of managing various forms of chemical data across an organization. These challenges are augmented when considering the plight of start-up companies, whose monetary and human resources are often severely compromised relative to the need to manage the volumes of chemical data they are generating.

This talk will discuss the emergence of a novel database software system designed for standardizing and consolidating chemical information company-wide. The software integrates chemical structures with images, reaction diagrams, documents, and text in a manner that is customizable to the user, and thus is malleable to the specific data management needs of an organization. Databases that are built in this system are searchable by chemical structure, sub-structure, text, and other user-defined data fields. The databases can be distributed via thick client or shared across an organization via a web interface. Such databases are easily accessible by all beneficiaries in the company, and can be connected to commercial tools for physical property and spectroscopy prediction, systematic nomenclature generation, and analytical data management (for example, NMR, MS, IR, UV, HPLC, and GC).

2:45 12 Integrating ISIS/Host RCG databases with other applications
Mark Runyan, Richard Sandstrom, Julie Myhre, Alex Tulinsky, and Ambrogio Oliva, Cell Therapeutics, Inc, 501 Elliott Avenue West, Suite 400, Seattle, WA 98119, mrunyan@ctiseattle.com

Our group deploys a variety of cheminformatics and biological database management software at CTI. Most data originates with the registration of new chemical entities in an ISIS/Host Relational Chemical Gateway (RCG) database; therefore we must integrate ISIS/Host RCG databases with a variety of other systems which manage related information. We look for the simplest and most direct method of integration, but our decisions are always guided by the requirements and capabilities of the third party application. Common methods are direct access and replication, both of which rely on the underlying framework of the Oracle relational database from which ISIS/Host RCG is based. Specific integration techniques and associated implementation details will be discussed in the context of CTI's Scientific Systems environment, which includes ISIS/Host, IDBS ActivityBase, Chemical Computing Group's Molecular Operating Environment, and other third party applications.

3:20 13 Capturing and aggregating large-scale discovery data in a start-up environment
Susan M. Baxter, National Center for Genome Resources, 2935 Rodeo Park Drive East, Santa Fe, NM 87505, smb@ncgr.org, Jacquelyn Fetrow, Departments of Physics and Computer Science, Wake Forest University, and Stephanie J. Reisinger, ProSanos Corporation - SLIDES

GeneFormatics' founding target identification technology was ideally suited as a platform for lead discovery and evolved into an in-house, centralized application for matching small molecules with human protein tyrosine phosphatase targets. The large-scale approach taken by GeneFormatics (GFI) for target and lead discovery required relational databases and applications to automate workflow, to compute and update large amounts of sequence information regularly, to manage intellectual property, and, importantly, to reliably and quickly deliver information to customers. The major challenges faced by GFI were the integration of disparate, genomic-scale databases, and the rapid development of an automated work-flow to manage and analyze the data. To solve this, GeneFormatics used a multidisciplinary team of research scientists, who articulated the short and long-term needs, and professionally trained software and database engineers, who quickly translated those needs into useful and validated software applications.

3:45 14 Mobilizing published data to make informed drug discovery decisions
Russ Hillard, Marketing, Elsevier MDL, 14600 Catalina Street, San Leandro, CA 94530, russ@mdli.com

High throughput techniques in chemistry and biology generate ever-increasing volumes of chemical structures, physical properties, and bioassay data. Much of this data is indexed in databases, posted on web sites or divulged in patents, conferences, journals and reviews.

The researchers' challenge is to extract actionable information – without being experts in locating data sources and using multiple search applications. Typical questions concern which chemical series to explore, the synthesis or modification of compounds, purchasing of starting materials, pharmacological profiles, metabolic liabilities or toxic properties, and safety issues.

This presentation focuses on using Elsevier MDL's DiscoveryGate service to answer such questions. DiscoveryGate delivers key chemistry-related databases from a variety of sources including CrossFire Beilstein, ChemInform, MDL Available Chemicals Directory, MDL Drug Data Report, MDL Toxicity and MDL Metabolite. It is linked by reaction type to major reference works on chemical synthesis. Researchers can view cited papers or patents from licensed electronic repositories such as ScienceDirect. We will compare the use of DiscoveryGate with other non-integrated sources and discuss searching workflows and strategies.

4:10 15 The Vault, ArQule’s dry compound archive
Rebecca J. Carazza, Research Informatics, ArQule Inc, 19 Presidential Way, Woburn, MA 01801, rcarazza@arqule.com

ArQule's strategy in January 2002 required that we leverage our scientific excellence and resources to support our transition into a recognized R&D organization with compounds in clinical development. With this, we identified the need to change from a solution phase, plate based storage of compounds with limited characterization to a fully managed dry compound archive with increased characterization of compounds to support compound identification as well as legal needs. In less than four months time with equipment costs under $30K, ArQule defined and implemented robust new processes, including software and hardware systems to manage dry compounds. The new processes included preparing and characterizing, submitting, storing and handling, requisitioning and dispensing of dry compounds that have been synthesized as singletons or in high-throughput production.

4:35 16 Extracting knowledge and delivering data: From the analytical laboratory to the chemist's desktop using web-enabled technologies
Antony John Williams, Scientific Development, Advanced Chemistry Development, 90 Adelaide Street West, Suite 600, Toronto, ON M5H 3V9, Canada, Fax: 416-368-5596, tony@acdlabs.com

Walk-up or open-access laboratories have dramatically impacted the ability for a small organization to support the analytical needs of its chemists. Commonly, skilled professionals assume the duty of laboratory manager as well as skilled technical consultant. As part of this responsibility one challenge is the distribution of data from the instruments to the chemist as well as providing enabling technologies to extract full-value from the data. Open-access laboratories are heterogeneous in nature requiring that data from a series of techniques can be distributed in a homogenizing fashion. The world-wide web has certainly assumed the primary mantle of electronic communication nowadays and would be assumed to be an ideal solution for analytical data dissemination as well as management and distribution of the extracted knowledge. This talk will detail technical approaches for the delivery of heterogeneous analytical data, including integrated chemical structures, to an organization.

1:00 17 What do they want from me? A chemistry librarian explores liaison needs and desires
Beth Thomsett-Scott, Reference and Information Services, University of North Texas Libraries, P.O. Box 305190, Denton, TX 76226, Fax: 940-565-3695, bscott@library.unt.edu

Have you ever wondered what a liaison librarian does? What their role is in providing services to an active chemistry department in an academic library? What do the faculty members, students and staff want from the library? This session will answer many of your questions!

Three years ago, I became a Chemistry Liaison Librarian at the University of North Texas. My last chemistry course was in 1986! The challenges and thrills that have occurred since then will be presented. Lessons learned and recommended preparations will be discussed. Survey results and comments from chemistry faculty on the traits and skills they desire in chemistry librarians and what they want from a chemistry liaison librarian will be offered. Examples and advice from other practicing chemistry librarians will be included to provide a well-rounded information session

1:25 18 Opportunity knocks: Chemical information careers in industry
David A. Breiner, Technical Information Center, Cytec Industries Inc, 1937 West Main Street, Stamford, CT 06904, Fax: 203-321-2985, david.breiner@cytec.com - SLIDES

Working in an industrial information center requires a vast array of skills and talents, and can be an extremely rewarding and challenging career. Whether searching online databases, designing educational webpages, or conducting training sessions, today's information professionals must understand their customers' needs first and foremost. The rapidly changing technology landscape requires information professionals to proactively deliver valuable solutions and services that drive productivity for their organization. Simply stated, they must get the right information to the right people at the right time. Therefore, developmental opportunities must always be sought to gain the necessary experience to be successful in industry.

This presentation will reflect on a 14 year career in chemical information ranging from sales to management. Highlighted experiences will include working as an account representative, searching chemical and patent literature, training end-users, building websites, and managing a technical information center. Lessons learned and career strategies will also be shared.

1:50 19 From lab chemist to patent searcher: Why, what, and how
Randall K. Ward, Science & Maps, Brigham Young University, Harold B. Lee Library 2320, Provo, UT 84602, Fax: 801-422-0466, randy_ward@byu.edu - SLIDES

If one is a practicing lab chemist and is looking at different career options, patent/information searching is one to seriously consider. In the form of questions, this presentation will specifically cover three aspects of becoming a patent searcher. First explored will be “Why would one want to be a patent searcher?” Most of the observations in this section come from years of personal experience. Secondly, “What does a patent searcher do?” This section will cover the kinds of work involved as well as a typical “day in the life of . . . “. The third question is “How would one become a patent searcher?” In this section, some common threads in the career progression to patent searching will be explored as well as the author's own personal path. Interspersed within the presentation will be slightly liberal doses of advice on career planning from the author's own experience.

2:15 20 Chemical information careers at U.S. GOCO research laboratories
Diane M. Kozelka, Rio Rancho, NM 87144

Technical information specialists are continually challenged, when helping their customers find the right answer for that obscure question. When working for a government-owned, contractor-operated (GOCO) facility, that usually occurs every week! The technical information needs of a GOCO technical library are very similar to any other technical library, with one large exception -- classified requests. I will discuss the unique resources that a GOCO technical library has access to (especially since 2001), and which resources are available to non-governmental organizations as well. Additional information about the US GOCO labs will be mentioned, if time permits.

2:40 21 Chemical information in not-for-profit nirvana
Anne T. O'Brien, Creative Connections, 15 Crest Drive, Tarrytown, NY 10591-4305, Fax: 914-631-5241, ronanne@attglobal.net - SLIDES

Foundations, societies, public radio and television, hospitals, high schools, public heath, emergency, and world service organizations, have purpose. They serve society and each of us. What is particularly challenging, especially demanding, most rewarding for a chemical information professional working in these environments? Which individual human traits are needed? What are the unusual opportunities? What is uniquely compelling about working in these surroundings? Why do individuals choose the non-profit sector? The presentation will use examples from well-known organizations to initiate discussion of the financial, human, technical, time-pressure, and career development challenges – and the potent corresponding rewards – of serving in such settings.

3:05 22 So you are thinking of becoming an online information entrepreneur
Alan Engel, Paterra, Inc, 526 N Spring Mill Road, Villanova, PA 19085-1928, Fax: 610-527-2041, aengel@paterra.com

The financial and technical barriers to becoming an online information vendor are as low as they have ever been. The sci-tech information market is broken and in need of innovation. Open Access and other initiatives are roiling the waters and making raw information materials increasingly available. Is it time to contribute your talents to the fray as an online information entrepreneur? The author will provide pointers drawn from 18 years of experience as an independent consultant, translator and online information vendor.

3:30 23 Careers in science writing and publishing
Lynne Friedmann, Freelance Science Writer, P.O. Box 1725, Solana Beach, CA 92075, Fax: 858-793-1144, lfriedmann@nasw.org

To individuals who love science but not necessarily lab work science writing sounds appealing as a career alternative. But it's a highly competitive field that requires specialized training and in many cases the mind-set of a small-business owner. People who write about science for a living fall into two broad categories: 1) science journalists who are staff reporters for news organizations or freelance writers who write for magazines and the Web, and 2) science writers who find work as public information officers for universities, government science agencies, and research institutions or as public relations professionals for industry. In the publishing arena, technically trained individuals work as acquisition editors for major publishing houses or university presses. Nonfiction book writers author original works, co-author/edit book with other scientists, or "ghost write" manuscripts. The common denominator in all these endeavors is communicating science in an accurate yet compelling manner. Training requirements, science-writing programs, lifestyle issues, and strategies for entering the field and building a science-writing career will be discussed.

3:55 24 Career opportunities in computational chemistry and computer-assisted drug design
J. Phillip Bowen, Center for Drug Design, Department of Chemistry and Biochemistry, University of North Carolina at Greensboro, 401 New Science Building, PO Box 26170, Greensboro, NC 27402-6170, Fax: 336-334-5402, jpbowen@uncg.edu

Computer-based methods have changed the world, particularly scientific research. Computational chemistry may be defined as the use of theory and computer technology to calculate molecular structures, properties, and related effects. Today computational chemistry methods are widely used in industrial and academic settings throughout the world to gain insight into chemical and biochemical problems at the molecular level. Over the years the uses of computer-based methods in drug design have been successful in predicting biological activity. With the increasing awareness of the power of computational chemistry, new career opportunities have emerged. This presentation will focus on discussing career options in computational chemistry.

8:40 25 Sharing chemical information without sharing chemical structure
Lingling Shen1, Karl M. Smith2, Brian B. Masek2, and Robert S. Pearlman1. (1) Laboratory for the Development of CADD Software, University of Texas, College of Pharmacy, Austin, TX 78712, Fax: 512-471-7474, shenl@list.phr.utexas.edu, bob.pearlman@optive.com, (2) Optive Research, Inc

There are various reasons for which scientists might want to share measured and/or calculated properties or “descriptors” of chemical compounds without revealing the actual chemical structures of those compounds. However, there is growing concern that, using emerging software technology, the chemical structures could be deduced from the chemical information which is shared.

We will briefly describe software technology which, unless precautions are taken, can indeed be used to deduce chemical structures from chemical descriptors. We will also discuss how the ability to deduce structure depends upon which descriptors or which combinations of descriptors are used. Lastly, we will suggest a simple but very effective mechanism by which chemical information (descriptors) can be shared in a manner which enables the desired use of the information but which thwarts efforts to deduce the corresponding chemical structures.

9:10 26 How to reveal without revealing
Ruben Abagyan1, Eugene Raush2, and Levon Budagyan2. (1) Department of Molecular Biology, The Scripps Research Institute, 10550 North Torrey Pines Road TPC-28, La Jolla, CA CA 92037, abagyan@scripps.edu, (2) R&D, Molsoft LLC

Safe exchange of data associated with chemical compound along with the essential descriptors of the compound, but without revealing its structure is highly desirable. Solving this problem may dramatically expand the public knowledge base on physico-chemical and biological properties of compounds. We present statistical analysis of the difficulty of deciphering the chemical structure and make recommendations on how to modify this process to make it more robust and safe.

One idea is to add artificial numerical noise to the descriptors to the degree which can be tolerated by the property prediction methods. For example, knowing the molecular mass of a compound to four-to-five decimal places is sufficient to derive the molecular formula (still not the structure), while knowing the molecular mass to 1-to-10 dalton accuracy makes cracking the formula next to impossible. At the same time, the druggability rules may easily tolerate that 1-to-10 dalton uncertainty in the mass value.

The deciphering complexity depends strongly on the initial conditions of the task. There are two radically different situations, namely, searching among the ~20 million available/known compounds, or searching among a virtually infinite number of the theoretically possible compounds. We demonstrate that recognizing a compound from a database of available compounds using a set of descriptors is a relatively easy but not always unambiguous task. However, finding a non-available theoretical compound using rounded or distorted numerical descriptors, as well as finite length chemical fingerprints is practically impossible.

9:40 27 Reverse engineering chemical structures from molecular descriptors: How many solutions?
Jean-Loup Faulon, William M. Brown, and Shawn Martin, Computational Biology Dept, Sandia National Laboratories, P.O. Box 969, MS 9951, Livermore, CA 94551, Fax: 925-924-3020, jfaulon@sandia.gov

Physical, chemical and biological properties and are the ultimate information of interest for chemical compounds. Disregarding the information sharing system one designs, this system should allow for the calculation of such properties and activities. Molecular descriptors that map structural information with activities and properties are obvious candidates for information sharing. In this talk we examine to what extent the sharing of chemical descriptors is safe, by computing how many structures in the chemical universe match a given set of descriptor values. Precisely, we examine several classical 2D descriptors (from the CODESSA software package) and molecular fragments (signature descriptors) for various properties including log P and IC50. We first select sets of descriptors that provide meaningful QSARs for the chosen properties. Next, we stochastically search (using a bond swapping algorithm JCICS 1996, 43, 731) and deterministically count and enumerate (JCICS 2003, 43, 721) the compounds matching the selected descriptors.

10:10 28 Possibilities for transfer of relevant data without revealing structural information
Omoshile O. Clement and Osman F. Guner, 9685 Scranton Rd, Accelrys Inc, San Diego, CA 92121-3752, omoshile@accelrys.com

In this paper, we will discuss how we have approached the problem of keeping structural information proprietary in the early years of predictive ADME/Tox model development. At that time, scientists in the industry wanted to evaluate the predictive models, but were not willing to share their structures. At the same time, the commercial model developers were willing to run the scientists' structures trough the model, but they were not willing to reveal which descriptors were important for a particular predictive model. We developed a process where the scientists could perform calculation on a broad number of commercially available public descriptors and forward this property file, instead of the structures. Meanwhile, the model developer could extract those descriptors that are used in the predictive model, run the model and pass on the results back to the scientist. We will discuss pros and cons of such approach. We propose to address questions such as: Can structural information that is proprietary be compromised from descriptors in ADME/Tox models? And can ADME/Tox predictions be made purely from descriptors without the need explicit knowledge of chemical structures, proprietary or otherwise?

11:00 29 Screens as a secure descriptor of chemistry space
Nikolay Osadchiy and Sergey Trepalin, Department of Chemoinformatics, ChemDiv, Inc, 11558 Sorrento Valley Rd, San Diego, CA 92121, Fax: 858-794-4931, no@chemdiv.com - SLIDES

Chemical structure provides exhaustive description of a compound, but it is often proprietary and thus an impediment in the exchange of information. An effective representation of structural properties of a chemical library can be made with Screens - a set of substructures pertaining to this library. We define Screen as a structural fragment, centroid of N-bond lengths radius between the central atom and the atoms maximally remote from it. Screens, and their occurrence frequencies, are gathered for each atom being used as a center and for each compound in the library. Using Screens descriptor, we can assess its similarity to another library and select compounds which enrich its chemistry space or, alternatively, fill its voids. While providing a relevant description of the compounds, the descriptor conceals real structures and can facilitate the exchange of sensitive information. A case study about Screen descriptor applications at ChemDiv will be presented.

11:30 30 Why relevant chemical information cannot be exchanged without disclosing structures
Dmitry Filimonov and Vladimir V. Poroikov, Russian Academy of Medical Science, Institute of Biomedical Chemistry, Pogodinskaya Str., 10, Moscow 119121, Russia, Fax: 007-095-245-0857, dmitry.filimonov@ibmc.msk.ru, vladimir.poroikov@ibmc.msk.ru - SLIDES

For usual confidential exchange of information between two or several persons traditional cryptographic means can be applied. It is easy to show that any meaningful (relevant) information about chemical structures can be used for search of either a particular compound itself or its close analogues. Since the meaningful information is presented by different descriptors, set of these descriptors can be used as a fingerprint to search for a particular molecule itself or molecules with a particular property. The success of recognition depends only on the number of used descriptors. However, this information may be not enough for appropriate QSAR/QSPR investigations. Some case studies based on the analysis of NCI and MDDR databases will be presented.

12:00 31 Are topomers a useful representation for “safe exchange of chemical information”?
Richard D. Cramer, Chief Scientific Officer, Tripos, Inc, 1699 South Hanley Road, St. Louis, MO 63144, Fax: 314-647-9241, dcramer@tripos.com

Encoding molecules into a useful but non-structurally-revealing representation is a difficult problem. Different applications will require different representations. However for applications involving biological or other shape-related effects, topomer properties have several relevant and particularly well-characterized behaviors. Thus topomers exemplify a relatively specific candidate structure encoding, whose strengths and weaknesses as a useful representation for “safe exchange of chemical information” may be instructive to consider.

8:05 32 The perfect storm: Electronic publishing and the Internet
Stephen R. Heller, Physical and Chemical Properties Division, NIST, Gaithersburg, MD 20899-8380, srheller@nist.gov - SLIDES

The frenzy of Open Access has come to the publishing scene in the past 1-2 years like a major storm. With each month come new activities in this area. Much is being said and written about Open Access, with very strong proponents for and against Open Access.

Organizations that fail to recognize and confront technological and market changes often tend to lose their positions, if not their organizations. History is replete with such examples. In the 18th century the power looms replaced the handloom weavers, In the early 20th century the horse and buggy industry giving way to automobiles, In the late 20th century the airplane replaced the train and boat for long distance traveling. Now, at the start of the 21st century the technology of the Internet is threatening the way in which the 3+ century old scientific publishing industry and libraries which subscribe to scholarly publications have done business for many decades.

In this presentation the author promises to provide many facts, many extreme opinions, and no solutions.

8:35 33 Scientific and technological data in society
René Deplanque, FIZ CHEMIE Berlin, Franklin Str. 11, 10583 Berlin, Germany, deplanque@fiz-chemie.de

The use of scientific data has changed over the years. In the past very large databases, both bibliographical and factual, where build up as large archiving and retrieval systems for published data. Within the last years a concentration process took place in databank production. Hardly any new database entered the market. The use of databases today is commonplace and they are accepted tools within the scientific working process. But with the advance of the Internet, evolving Grid technology and the open access initiatives new ways of handling and distribution of data will change the functions and applications of information systems. As the user of yesterday was satisfied by finding the appropriate publication nowadays, for the user of information systems the direct application of information within the scientific process is of greatest importance. Networking of computers to calculate immense amounts of experimental data, networking of experiments, and easy inexpensive access to a full text publications is changing the scientific community. This talk will give an overview where we are and what we have to expect next, and how this will effect the everyday work of the scientist

9:05 34 Open access and the Chemical Semantic Web
Peter Murray-Rust, Unilever Centre for Molecular Informatics, University of Cambridge, University Chemical Laboratory, Lensfield Road, CB2 1EW Cambridge, United Kingdom, Fax: +44-1223-763076, pm286@cam.ac.uk, and Henry S. Rzepa, Department of Chemistry, Imperial College London - SLIDES

We have developed the Chemical Semantic Web so that computers can understand primary publications and act upon them. An autonomous machine could read and understand an issue from J. Med. Chem., extract the information, run high-throughput computations and systematize the results leading to new scientific insights.

For robots the most exciting and most tractable part of scientific publications are formalized presentations of data (e.g. analytical proof of synthesis) and supplemental data (e.g. crystallography and spectra). We argue that these are "facts" under the Berne Copyright convention and therefore re-usable without hindrance. For many decades humans have manually abstracted articles and produced compilations and we argue that robots can do the same to great communal benefit. However it appears that some publishers now see a journal as a database and may regard chemically-aware robots as unacceptable under their license terms.

The public Semantic Web currently depends on complete absence of barriers to the re-use of information. Robots cannot currently negotiate license agreements, logon to sites, or make micropayments. We see Open Access, especially to data, as an exciting opportunity to transform chemical informatics and provide a global knowledge base. We shall present arguments that funders, researchers, editors and readers should promote a model of publication for Open Data.

We shall provide online demonstrations of the power and potential of the Chemical Semantic Web based on Open Access to primary publications.

9:35 35 RDF-based molecular relationships, the Semantic Web and the future of scientific publishing
Henry S. Rzepa, Department of Chemistry, Imperial College London, South Kensington Campus, London SW7 2AY, United Kingdom, h.rzepa@imperial.ac.uk, Omer Casher, Information Architecture and Engineering, GlaxoSmithKline, and Peter Murray-Rust, Unilever Centre for Molecular Informatics, University of Cambridge

We describe an XML/RDF model developed to improve the classification and (open) accessibility of chemical information within the de facto output of electronic journals. This model enhances the Adobe eXtensible Metadata platform (XMP), an RDF vocabulary which can be readily embedded in text documents such as SVG or CML (Chemical Markup Language), or a variety of binary documents which support it such as PDF or JPEG. Molecular structures for given journal articles are represented as unique INChI identifiers and embedded in electronic articles as part of the XMP. By extracting this XMP from multiple and related articles and managing it with an RDF repository, expandable lightweight Chemical Ontologies, fine tuned to a scientist's research needs can be auto-generated. The use of Semantic Web technologies to link the Chemical Ontology with related resources on the Web is explored. Here, using INChIs as the nodes for establishing the relationships provides a "semantically intuitive" alternative to text based relationship mapping.

10:05 36 Movement toward open access: Why new models of research communication are inevitable
Ann J. Wolpert, Director of Libraries, Massachusetts Institute of Technology, 14S-216, 77 Massachusetts Avenue, Cambridge, MA 02139, awolpert@mit.edu - SLIDES

Advances in computing and communications technologies over the past decade have introduced significantly disruptive technologies into both the conduct of research and traditional systems of research reporting and scholarly communication. The open access “movement” developed as a response to two separate phenomena. First, researchers and educators began to use and appreciate the power of new computational and communications technologies in their research, teaching, and collaboration. Second, these same researchers and educators became aware that control over the record of published research was moving into the proprietary hands of publishers who did not always share their values, and that such control might well stifle scientific progress and diminish learning opportunities in the 21st century. Publishers, scientists, librarians, and universities need to move beyond the current narrow debate about the sustainability of 20th century publishing models. Scientists and educators will not turn back from the advantages of new computing and communication technologies. It is time to devise new models of scientific publishing that support the larger interests of research and education.

10:35 37 Open access and the BERLIN DECLARATION: The MPG strategy
Robert Schlögl, Department of Inorganic Chemistry, Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, Berlin 14195, Germany, Fax: +49-30-8413-4401, acsek@fhi-berlin.mpg.de, and Theresa Velden, Heinz Nixdorf Zentrum für Informationsmanagement in der Max-Planck-Gesellschaft

The Internet drives a transformation of the scientific discovery and dissemination processes. It is currently used as multifunctional tool to support the traditional work flows. The vision of MPG is to integrate the internet into the scientific work flows. The realisation of “e-science” that is attempted by research institutions world-wide requires creative solutions on several levels of legal, organisational and technical dimensions. Open access, the unlimited free and immediate access to all materials of scholarly interest is a corner stone of e-science. Many components of e-science exist already today as disciplinary or island solutions. A key effort is needed to link and exchange their information content over national, institutional and disciplinary borders.

11:05 38 Open reader access, a better business model? A view from the STM-Association
Pieter Bolman, International Association of Scientific, Technical & Medical Publishers, The Hague, Netherlands, bolman@stm.nl - SLIDES

The STM-Association is a global organisation and is emphatically 'business model neutral'. STM's main concern in the Open Access debate is that new business models are sustainable in such a way that continuity and enhancement of access for researchers, scholars, and practitioners is guaranteed, that it attracts innovation, and that the publishing system maintains its independence from any national government. We will apply these criteria when examining the current status of both Open Access Publishing per se and open access via the 'self-archiving' route.

11:35 39 Springer Open Choice: evolution, not revolution
Derk Haank, Chief Executive Officer, Springer Science+Business Media, Heidelberger Platz 3, Berlin 14197, Germany, derk.haank@springer-sbm.com

In response to Open Access, Springer is now letting its authors decide: They can choose between the traditional publishing model and an additional new model, Springer Open Choice. In the latter model, it is the authors and not the users who pay for publishing quality and service. The paper is then accessible via the Internet free of charge to anyone interested. This would make things cheaper for libraries, but it also means that funds would be diverted. Scientists and researchers now have an opportunity to show how serious they are about wanting Open Access. We're prepared to experiment.

2:20 40 Secure statistical analyses on distributed databases
S. Stanley Young1, Alan Karr2, and Ashish P. Sanil2. (1) Bioinformatics, National Institute of Statistical Sciences, PO Box 14006, Research Triangle Park, NC 27709, genetree@bellsouth.net, (2) NISS - SLIDES

A principal reason for sharing chemical data is to conduct analyses of the combined data that are more powerful and informative than analyses of the individual databases. The impediments to "full" sharing are well known: proprietary information, the scale of the data, and even the reluctance to disclose who "owns" particular data points. Trusted third parties, whether human or machine, are not seen as feasible strategies.

We show how computer science concepts known as secure multi-party computation (specifically, secure summation) can be used to perform two important classes of statistical analyses--regression and recursive partitioning--for "horizontally partitioned" data. That is, the databases contain the same attributes (for example, chemical descriptors) for different sets of compounds. The basis of the methods is secure sharing of data summaries that are sufficient (Indeed, they are known as sufficient statistics.) to conduct the analyses. We also note how secure database query techniques can be used to deal with "duplicate" compounds that may be in more than one of the databases.

The techniques will be illustrated with applications to real data.

2:40 41 Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models
Igor V. Tetko, Institute of Bioorganic & Petrochemistry, Kiev, Ukraine and Institute for Bioinformatics, Neuherberg D-85764, Germany, itetko@vcclab.org - SLIDES

In order that the lead compound will become a drug it has to possess a number of important ADME/T properties, e.g. favorable lipophilicity and solubility. The poor ADME/T profiling of drugs may result in their fail during the late stages of development. Some companies have experimental databases of such properties. A sharing of these data could develop much better models for the whole community but the proprietary value of chemical structures is a major impediment to do this. Recently we developed ALOGPS program (http://www.vcclab.org) . It can incorporate the user-specific data and dramatically improve its prediction ability for similar series of compounds. The external molecules are represented in it as ranks of 64 neural network models, i.e. as an array of 64 numbers where each number is in [0,63] range. Such representation makes it impossible to disclosure the underlining chemical structures and allows a secure sharing of corporate data.

42 Open access, open minds
Andrea Twiss-Brooks, University of Chicago, John Crerar Library, 5730 S. Ellis Ave, Chicago, IL 60637-1403 - SLIDES

Discussions of open access publishing are characterized by highly charged rhetoric and nearly religious fervor. Proponents of open access are highly visible, and often occupy what appears a moral high ground. Publishers are coming under significant pressure from government authorities, scientific communities, and other parties to move to open access models of publishing journals. Libraries and their institutions are caught in the middle, wanting to support what is best for scientific communication, while coming to grips with the organizational and financial implications of transition to new publishing models. At this time, even the most highly touted open access publishing efforts should be considered experiments. Open access publishing carries both risks and benefits for these various stakeholders. This presentation will attempt to identify major risks and benefits of open access publishing for libraries and their organizations and the data needed by those organizations to make responsible decisions regarding open access.

43 Wide road to open access
Nicholas R. Cozzarelli, Department of Molecular and Cell Biology, University of California, Berkeley, 16 Barker Hall MC 3204, Berkeley, CA 94720-3204

Scientific publishing is undergoing a revolution, but thus far chemical journals have stayed on the sidelines. I suggest that they start with releasing back content six months after publication, preferably at PubMed Central. I think the financial loss will be minimal and the gain to chemists, from students to professionals, will be enormous. The ACS runs many of the best journals in chemistry. It is an organization with a proud past and should now play a leadership role in shaping the improved access to the scientific literature. Many others would follow their lead. I will also discuss additional aspects of Open Access that are followed by the journal I edit, the Proceedings of the National Academy of Sciences.

2:30 44 Chemistry journals: A modest proposal
Steven M. Bachrach, Department of Chemistry, Trinity University, 1 Trinity Place, San Antonio, TX 78212, Fax: 210-999-7569, sbachrach@trinity.edu - SLIDES

Solutions to the journals crisis have coalesced around a small number of options: open access, preprint archives, embargo periods, consortia arrangements. These efforts focus on the concern of ever-rising costs of STM journals. While I will briefly suggest that enhanced publication is the real publication revolution awaiting the STM world, I will offer a proposal for re-positioning of the journal components amongst the interested parties (authors, universities and chemical industries, publishers, and the abstracting/indexing services) that preserves their value-added roles yet allows for potentially cheaper dissemination of information.

3:00 45 Open access publication: One editor’s perspective
Lawrence J. Marnett, Biochemistry, Vanderbilt University School of Medicine, 23rd Ave at Pierce, Nashville, TN 37232-0146, Fax: 615-343-7534, larry.marnett@vanderbilt.edu - SLIDES

Electronic publishing has had a dramatic impact on scientific publishing. The speed of submission, review, and access is significantly improved and the numbers of libraries subscribing to packages of journals produced by single publishers has increased. Most institutional subscriptions provide unlimited access to all users within the institution's network. However, for individuals not affiliated with an institution, electronic access to a range of journals is very uneven. Multiple proposals have been made to provide unlimited access at no charge to articles 6-12 months after their publication. Implementing open access is a desirable goal but it presents significant challenges to scientific publishers, particularly those affiliated with non-profit societies. The presentation will focus on some of the key issues as seen through the eyes of an editor of an American Chemical Society journal.

46 Publishing implications of open archiving proposals: An examination of academic chemistry research funding sources
George S. Porter, Caltech Library System, 1-43, Pasadena, CA 91125-4300, Fax: 626-431-2681, george@library.caltech.edu

Speculation is currently rife about the possible impact of the National Institutes of Health (NIH) proposed mandate for open archiving of all NIH-funded research. The speculation making the rounds is routinely devoid of data, which seriously undercuts one's ability to judge the probability of any projected future for the STM publishing industry and scholarly communication. Similar initiatives have been proposed by the Parliament Science & Technology Committee and by the Wellcome Trust, a charitable source of funding for biomedical research.

We reviewed the funding sources acknowledged by authors from six leading US chemistry departments (Caltech, Harvard, MIT, Stanford, Yale, and UCSD) in their journal articles published in 2004. In addition, a corresponding survey was conducted of the journal articles produced from Oxford and Cambridge universities.

Alternative Open Access models include the “author pays” Open Access journal concept. The same analysis of funding sources and publication frequency could be used to project the additional costs associated with the dissemination of research results within this model and the funding sources which might be expected to cover those fees. An analysis was prepared of the declared funding sources in the research articles of PLoS Biology, PLoS Medicine, and 4 BMC titles for the period 2003-2004. These were compared with the funding sources acknowledged in a month's worth of research articles from Nature, Science, and JAMA, and a single issue of PNAS, JACS, and Chemical Communications. We attempt to discern whether the authors' funding sources influence their choice of journal in which to publish.

47 Practical use of scientific and engineering information at United Technologies and Hamilton Sundstrand
Suzanne Cristina, Information Research, Hamilton Sundstrand, 1-3-BC38, One Hamilton Road, Windsor Locks, CT 06096, suzanne.cristina@hs.utc.com

Corporations conduct numerous engineering/scientific/business projects each year. Increasingly, the output of this research is in electronic formats including documents and datasets and databases. This technical intelligence is stored in a variety of formats such as document management systems or records management systems. However, in many corporations, technical intelligence is generally hard to discover and reuse especially after the project is completed. This presentation will cover how United Technologies is taking a basic business driver and utilizing it to create, develop, sustain and reuse technical information throughout the corporation.

48 Aqueous solubility prediction using 7,000 compounds
Paulius J. Jurgutis, Andrius Sazonovas, and Pranas Japertas, Pharma Algorithms, Inc, 591 Indian Road, Toronto, ON, Canada, jurgutis@ap-algorithms.com

Aqueous solubility of a compound can be characterized by multiple means. For example, consider the "intrinsic SW" vs. "characteristic SW", SW in pure water vs. SW in buffer, SW of free electrolytes vs. SW of salts, SW by dissolution vs. SW by precipitation, etc. Different types of solubilities can be described by different superpositions of three factors - crystallization, solvation, and ionization. Provided that the influence of ionization can be estimated from pKa calculations, solvation and crystallization remain the most important factors. Most frequently they are crudely estimated by the following expression: - log SW » log P + mpo, where mpo is melting point divided by 100. For hydrophilic compounds with log P

49 Estimation of estrogen receptor binding affinity using theoretical molecular descriptors
Denise Mills1, Subhash C. Basak1, and Douglas M. Hawkins2. (1) Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Hwy, Duluth, MN 55811, Fax: 218-720-4328, dmills@nrri.umn.edu, (2) School of Statistics, University of Minnesota

Calf estrogen receptor binding affinity was modeled using the quantitative structure-activity relationship approach for a set of 46 compounds consisting of 2-phenylindoles and 5,6-dihydroindolo[2,1-α]isoquinolines. Molecular descriptors based solely on chemical structure were partitioned into three classes based on level of complexity and demand for computational resources. The topostructural descriptors encode information strictly on the adjacency and connectedness of atoms within a molecule, while the topochemical descriptors encode chemical information such as atom and bond type in addition to topological information. The geometrical or 3-dimensional indices encode three-dimensional aspects of molecular structure. For comparative purposes, three regression methods were used, namely ridge regression (RR), partial least squares (PLS), and principal components regression (PCR). Results indicated that RR generally outperforms PLS and PCR, and acceptable models were obtained from the use of the topochemical descriptors alone.

50 Alchemist Club at Missouri Western State College
Janessa M Hovey, Jessica M McKinzie, Cindy M Peters, LeeAnn M Schuster, Alexa Cook, Shellney A Oehlert, and Michael B Mears, Alchemist Club, School, 4525 Downs Drive, St. Joseph, MO 64507, jmh7742@mwsc.edu

The Alchemist Club at Missouri Western State College has been on the campus since the 1980s. The Chapter even received an Outstanding Chapter award from the American Chemical Society in 1985-1986. Over the course of the past few years, the Alchemist Club has decreased in numbers of members. One of the major goals for this year was to get the club back rolling with activities on Campus and in the Community. Activities for the year have included a booth at the Campus Family Day, a float for Homecoming, participating in Super Science Saturday and hosting a Boy Scout Workshop. In doing so, the club now has many new members and may even have the largest numbers in the history of our Chapter.

51 Application of rough set theory to structure-activity relationships
Joachim Petit, Pharmacology and Toxicology, University of Arizona - College of Pharmacy, 1703 E. Mabel street, PO Box 210207, Tucson, AZ 85721-0207, Fax: 520 626 2466, petit@pharmacy.arizona.edu, and Gerald M Maggiora, Department of Pharmacology and Toxicology, University of Arizona

Rough set theory (RST), developed more than 25 years ago by Pawlak, provides a powerful means for organizing and analyzing data. RST is a set-based method that uses equivalency relationships to group objects with similar attributes into indiscernability classes, which are the basis for the development of decision rules. The present work focuses on an application of RST to structure-activity relationships. A brief introduction to RST will be presented along with an example of how it can be applied to develop decision rules from structure-activity data.

8:30 52 Canonicalized systematic nomenclature in chemoinformatics
Jeremy J Yang, OpenEye Scientific Software, 3600 Cerrillos Road, Suite 1107, Santa Fe, NM 87507, Fax: 505.473.0833, jj@eyesopen.com

A fundamental task of chemistry is identifying distinct chemical entities. In chemoinformatics, species must be specified rigorously to facilitate unambiguous expression of chemical data and knowledge. A theoretically equivalent task is determining the equality of two molecules. However, the meaning of sameness or identity depends upon the context or hierarchical chemical level of abstraction, for example, whether stereochemistry or tautomerism is considered. An important subset of this problem can be addressed by graph theory which applies well to valence models for covalently bonded molecules. Algorithms generating canonical (unique) identifiers for chemical graphs exist and are available. However, due to the multiple contexts mentioned, a single algorithm is not sufficient to solve all problems. This study reviews some existing canonicalization methodology and describes new methods implemented by chemoinformatics library OEChem and other OpenEye tools

9:00 53 Data publication @ source via the open archive initiative
Simon J. Coles1, Jeremy G Frey1, Michael B. Hursthouse1, Leslie A Carr2, and Christopher J Gutteridge2. (1) School of Chemistry, University of Southampton, Southampton, United Kingdom, Fax: 442380596723, S.J.Coles@soton.ac.uk, (2) School of Electronics and Computer Science, University of Southampton - SLIDES

A crystallography-based examplar for open archive publication of scientific data will be presented.

Advances in instrumentation and computation have caused an explosion of scientific data. However, this has not resulted in the expected growth of scientific databases and the reason for this can be clearly identified as a publication bottleneck. As a result of this situation, the user community is deprived of valuable information, and the funding bodies are getting a poor return for their investments!

Unlike other disciplines the chemical sciences have been reluctant or slow to embrace the 'preprint concept'. This poster outlines a pre-print procedure for the rapid and effective dissemination of structural information to the scientific community (eCrystals) which removes the lengthy peer review process that hampers traditional publication routes, but provides an alternative mechanism. eCrystals is built on a concept developed in the computer science community whereby an author may reveal archives of information to the public. eCrystals makes available all raw, derived and results data from a crystallographic experiment via a searchable and hierarchical system. Bibliographic and chemical metadata items, which are associated with the data, are published through standard protocols and therefore immediately and globally disseminated.

Hence scientific data may be disseminated in a manner that anyone wishing to utilise the information may access the entire archive of data related to it and assess its validity and worth. Recent advances in developing this approach to openly publish ANY form of chemical, or indeed scientific, data will also be presented

9:30 54 Designing libraries from HTS data: Hot fragments and activity models
Carolyn M. Barker and James E Mills, Molecular Informatics, Structure and Design, Pfizer Global R&D, Ramsgate Road (ipc 636), Kent CT13 9NJ, Sandwich, United Kingdom

Parallel chemistry and high throughput screening (HTS) are an integral part of Drug Discovery. HTS is routinely used to identify novel chemical series but the data, as a whole, are rarely used to drive compound design. This paper demonstrates that mining HTS data is key to designing information-rich libraries. We highlight the application and success of an array of new library design approaches, for example: Multiple-target activity models and mining HTS data at the fragment level (existing in available monomers). Key issues, such as how to optimise multiple dimensions (primary and secondary pharmacology, ADMET and physical properties) will be discussed.

10:15 55 Hierarchical quantitative structure-toxicity relationship (Hi-QSTR) modeling of aquatic toxicity and mutagenicity
Denise Mills, Subhash C. Basak, and Brian D. Gute, Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Hwy, Duluth, MN 55811, Fax: 218-720-4328, dmills@nrri.umn.edu

Two toxicity endpoints were modeled using the hierarchical quantitative structure-toxicity relationship (Hi-QSTR) method, namely aquatic toxicity, LC50, for a set of 69 benzene derivatives and mutagenicity for a set of 95 aromatic and heteroaromatic amines. With the hierarchical approach, we begin with the least complex descriptors, the topostructural (TS), which encode information strictly about the adjacency and topological distances between atoms in a molecule. The topochemical (TC) descriptors encode chemical information, such as bond and atom type, in addition to information about molecular topology. The geometrical (3D) descriptors are more complex yet, encoding three-dimensional aspects of molecular structure. Finally, the quantum chemical (QC) descriptors encode electronic information. In particular, we were interested to see whether the addition of the quantum chemical descriptors, which are more demanding in terms of computational resources, results in significant model improvement. Marginal improvement in model quality was obtained upon the addition of such descriptors.

10:45 56 MGE: A model generating engine and its applications
Sabine Schefzick, Discovery Technology (Scientific Computing), Pfizer Global R&D, 2800 Plymouth St., Bldg.28/G-131W/G-9, Ann Arbor, MI 48105, Fax: 734-622-2782, sabine.schefzick@pfizer.com, and Mary Bradley, Discovery Technology (Scientific Computing), Pfizer Inc

Abstract text not available.

8:00pm 57 Mutagen/non-mutagen classification of congeneric and diverse sets of chemicals using computed molecular descriptors: A hierarchical approach
Denise Mills1, Subhash C. Basak1, Douglas M. Hawkins2, and Brian D. Gute1. (1) Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Hwy, Duluth, MN 55811, Fax: 218-720-4328, dmills@nrri.umn.edu, (2) School of Statistics, University of Minnesota

Ridge linear discriminant analysis was used to classify a diverse set of 508 mutagens/ non-mutagens, as well as three structurally homogenous subsets, viz., 260 monocyclic carbocycles and heterocycles, 192 polycyclic carbocycles and heterocycles, and 124 aliphatic alkanes, alkenes, and alkynes. Software programs including POLLY, Triplet, Molconn-Z, Sybyl, and MOPAC were used to calculate a large and diverse set of theoretical molecular descriptors. Subsequently, the descriptors were divided into hierarchical classes based on level of complexity and demand for computational resources. Results indicate that inclusion of the more complex descriptors does not lead to a significant increase in model quality. In addition, correct classification rates for the relatively homogeneous subsets are comparable to those obtained for the entire set of 508 diverse compounds, indicating that the diverse set of theoretical descriptors is capable of representing the diversity of structural features present in the data set.

58 NMR spectral invariants as numerical descriptors for diastereomers and enantiomers
Ramanathan Natarajan and Subhash C. Basak, Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Hwy, Duluth, MN 55811, Fax: 218-720-4328, rnataraj@nrri.umn.edu

Topostructural or topochemical invariants derived for a molecule from its molecular graph (hydrogen included or suppressed) based on the edge count or information content can differentiate structural isomers. However, they are incapable of differentiating geometrical isomers because the 3-D orientations of the atoms in a molecule are not considered in their computation. Although the next generation indices, the geometrical indices such as 3-D Weiner index, can account for molecular volume etc., diasteroisomers cannot be distinguished. While attempts have been made by Schulz et al., in this line, the indices created by them could not be applied in SAR modeling. NMR, a powerful tool in the hands of chemists, can differentiate diasteroisomers of a compound because it sees the three dimensional disposition, “environment”, of the protons in a molecule; that is to say it has a higher dimensional perception of a molecule than that of a chemist who tries to visualize using a molecular graph. This higher dimensional perception is used by us in converting NMR spectra into invariants. The new spectral invariant thus generated can differentiate diastereoisomers. The ability of NMR to differentiate diastereomeric compounds has been used to assign absolute configuration of several organic compounds after reacting with chiral derivatizing agents. We have shown how the 1H-NMR data of such derivatives can be used to calculate spectral invariants for the enantiomers of 1) chiral alcohols from their esters with 2-methoxy-2-(1-naphthyl)propionic acid and, 2) α-chiral carboxylic acids from their esters with ethyl 2-hydroxy-2-(9-anthryl) acetate.

59 Partition of solvents–co-solvents of nanotubes: Proteins and cyclopyranoses
Francisco Torrens, Institut Univesitari de Ciencia Molecular, Universitat de Valencia, Dr. Moliner-50, EI-1-38, Burjassot (Valencia) 46100, Spain, Fax: 34-96-354-3156, Francisco.Torrens@uv.es

The main contribution to the water-accessible surface area of lysozyme helices is the hydrophobic term, while the hydrophilic part dominates in the sheet, what is related to the 1-octanol-, cyclohexane- and chloroform-water partition coefficients P_o-ch-cf of helices, which are greater than those of the sheet are. The analysis of atom-group partial contributions to log_P_o-ch-cf allows building local maps. The molecular lipophilicity pattern differentiates among helices, sheet and binding site. For a given atom, log_P is sensitive to the presence of other atoms. The contributions of C_70-a-c atoms to log_P are slightly greater than that of d-e are, which correlate with the distances from the nearest pentagon. (10,10) is the favourite single-wall carbon nanotube (SWNT), presenting consistency between a relatively small aqueous solubility and great P_o-ch-cf. Efforts to use fullerenes-SWNTs in therapeutic applications are re-evaluated.

60 Prediction of biologic partition coefficients and binding affinities using QSAR models
Denise Mills1, Moiz M. Mumtaz2, Hisham A. El-Masri2, Douglas M. Hawkins3, and Subhash C. Basak1. (1) Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Hwy, Duluth, MN 55811, Fax: 218-720-4328, dmills@nrri.umn.edu, (2) Computational Toxicology Laboratory, Division of Toxicology, Agency for Toxic Substances and Disease Registry, (3) School of Statistics, University of Minnesota

For contaminants, toxicological data are usually not available to conduct health risk assessments. In such cases, ATSDR and other federal agencies often recommend the use of surrogate values obtained from computational tools such as quantitative structure-activity relationship (QSAR) techniques and physiologically based pharmacokinetic (PBPK) modeling. In an ongoing effort to develop alternative toxicity assessment methods, we have applied QSAR to compute: 1) tissue:air partition coefficients, including fat:air, liver:air, and muscle:air, for a group of 46 low molecular weight volatile organic compounds (VOCs); 2) blood:air partition coefficients for a set of 39 VOCs; and 3) aryl hydrocarbon (Ah) receptor binding affinity for a set of 34 dibenzofurans. The structural descriptors consisted of four classes based on increased level of complexity and computational demand: topostructural (TS), topochemical (TC), geometrical (3D) and quantum chemical (QC). Results indicate that structure-based models using the simple descriptors alone adequately predict toxicological characteristics of these environmental contaminants.

61 Prediction of blood: Brain penetration of chemicals using computed molecular descriptors
Christian T Matson, Center for Water and the Environment, Natural Resource Research Institute, University of Minnesota, Duluth, 5013 Miller Trunk Highway, Duluth, MN 55811, mats0126@d.umn.edu, Subhash C. Basak, Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, and Lester R. Drewes, Department of Biochemistry and Molecular Biology, School of Medicine, University of Minnesota Duluth

Prediction of blood: brain entry of chemicals is important both for drug discovery and environmental protection. A drug designer would like to know whether candidate chemicals for the design of psychoactive drugs would be sufficiently available at the specific receptor site. This will be guided by the BBB entry of the chemicals. In environmental protection, people in USA are collectively exposed to over 75,000 chemicals. The regulator would like to know how many of them will be entering the brain from environmental exposure. But the exhaustive experimental testing of all chemicals for their BBB entry is not possible for the enormous cost and the large number of animals necessary for such testing. So, there is a need for the development of computational models for the prediction of BBB entry from molecular structure directly without the input of any experimental data.

We have developed computational models for the prediction of BBB penetration of a selected set of 27 chemicals from their computed molecular descriptors. The descriptors used include topological and geometrical indices. The usefulness of this approach in BBB research will be discussed.

62 How to find the best computational chemistry method using cheminformatics
Tulay Ercanli and Donald B. Boyd, Department of Chemistry, Indiana University-Purdue University at Indianapolis, 402 N Blackford Street, Indianapolis, IN 46202, Fax: 317-274-4701, tercanli@iupui.edu

Cheminformatics is used to compare the capabilities of widely used quantum chemistry and molecular mechanics methods. Among the quantum methods examined are the semiempirical MNDO, AM1, and PM3 methods, Hartree-Fock (HF) at a range of basis set levels, density functional theory (DFT) at a range of basis sets, and a post-Hartree-Fock method, local Møller-Plesset second-order perturbation theory (LMP2). Among the force fields compared are AMBER*, MMFF94, MMFF94s, OPLS/A, OPLS-AA, Sybyl, and Tripos. The Spartan, MacroModel, SYBYL, and Jaguar programs are used. The test molecule is (2-amino-5-thiazolyl)-alpha-(methoxyimino)-N-methylacetamide, an analogue of the aminothiazole methoxime (ATMO) substructure of the 7-acylamido side chain of third-generation cephalosporin antibacterial agents. The Ward hierarchical clustering technique is shown to be very useful for comparing experimental and calculated (optimized) bond lengths, bond angles, and torsional potentials.

63 Scaffold hopping and virtual screening using similarity search and bioisosteric replacement
Guyan Liang and Isabelle Morize, Department of Molecular Modeling, Sanofi-Aventis, Route 202-206, Bridgewater, NJ 07059, Fax: 908-231-3605, Guyan.Liang@aventis.com

Ligand-based virtual screening of selected therapeutic targets has been carried out using several popular software and various descriptors including MDL public keys [http://www.mdl.com], SciTegic extended connectivity fingerprint [http://www.scitegic.com], Unity fingerprint [http://www.tripos.com], Daylight fingerprint [http://www.daylight.com], Similarity in BioIsosteric Space (SiBIS) [http://moltop.com], BCI fingerprint [http://www.bci.gb.com], and Feature Tree [http://www.biosolveit.de]. Their ability to differentiate active inhibitors from negative background compounds was evaluated. While they were all able to identify actives with high structural similarity to query molecules (a subset of known inhibitors), some of them can extrapolate to novel scaffold better than others. Contrary to a common perception that ligand-based 2D virtual screening is incapable of identifying novel scaffolds, our study demonstrated that 2D based methods can offer reasonable enrichment rate during virtual screening and that their capability of novel scaffold identification ravels 3D docking approaches. During this comparison study two unique 2D approaches came to our attention. With a built-in knowledge of bioisosteric replacements, SiBIS combines conventional medicinal chemistry thinking with similarity search and significantly enhances the enrichment rate of novel scaffolds. Feature Tree, representing molecules in a tree-like structure, makes its results more interpretable chemically.

64 QSTR models of juvenile hormone mimetic compounds for Culex pipiens larvae
Jessica J. Kraker1, Douglas M. Hawkins1, Denise Mills2, Ramanathan Natarajan2, and Subhash C. Basak2. (1) School of Statistics, University of Minnesota, 224 Church Street S.E, Minneapolis, MN 55455, Fax: 612- 624-8868, stoerijj@stat.umn.edu, (2) Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota

The goal of this study was to develop quantitative structure-toxicity relationship (QSTR) models for predicting insect juvenile hormone (JH) activity of 304 JH mimetic compounds against Culex pipiens. The activities (pI50) of the compounds were modeled using various calculated predictors to find a function relating the activities to the predictors. Linear models can often predict as well as more flexible models (such as k nearest neighbors or neural nets) without overfitting the data. However, in a regression setting, using highly correlated predictors to estimate a response can lead to difficulties in estimation, due to uncertainty in estimating the coefficients of the predictors. Many methods have been suggested to alleviate this problem, including ridge regression (RR), principal component regression (PCR), and partial least squares (PLS).

Our analysis of 304 compounds was to predict measured toxicity to mosquitoes, using various chemodescriptors. These were: 920 atom pair descriptors (denoted AP), 268 topological indices (denotes DES). In addition to separate modeling using each type, we combined them into a 1188-predictor run (BTH). The models were compared using the cross-validated squared multiple correlation q2 of the regressions. RR performed best, (BTH data gave q2 = 0.621, AP data gave q2 = 0.600). PLS also performed reasonably well with the BTH data (q2 = 0.549) or AP data (q2 = 0.515), but PCR performed poorly.

To both lighten computation and improve interpretability, thinning of the descriptors was investigated. This involved stepwise selection of 100, 150 or 200 of the predictors. Whereas conventional predictor thinning by such methods as stepwise regression has been largely discredited, a moderate pruning of the predictor list did nothing to diminish the fit of the models. The best fits came with the AP data, thinned to 150 variables, using ridge regression (q2 = 0.680) and PLS (q2 = 0.633) methods.

65 Beyond the ADME challenge: Integration of experimental and in silico approaches
Jacques R. Chretien, Marco Pintore, and Nadege Piclin, BioChemics Consulting SAS, Innovation Center, 16 L. de Vinci, 45074 Orleans cedex 2, France, Fax: +33 2 38 41 72 21, jacques.chretien@biochemics-consulting.com

Gathering appropriate pharmacokinetic properties, experimental and/or in silico ones, is important for the success of a drug discovery program. Early ADME prediction had been considered as a key challenge for long time in large pharma. We will demonstrate by recent innovative strategies how overcoming this challenge by delineating a Global Strategy derived from Molecular Experimental Design (MED) concepts. Recently, we have developed new computational methods based on Genetic Algorithms, Fuzzy Logic and Radial Basis Function (RBF) allowing to develop a number of early ADME predictors. More particularly our Global Strategy was applied on the main pharmacokinetic properties, oral absorption, bioavailability, volume of distribution and clearance, with prediction rates higher than 65-70%. Then it appeared fundamental to validate a priori in silico prediction with a posteriori, experimental based predictions. Examples derived from various RBF predictions exhibit good agreements. It demonstrates that both approaches worth to be incorporated,in a Global ADME Strategy,but at their proper place. It might be a chance to reduce time and cost of drug discovery in industry.

66 Beyond the LFER paradigm: Harnessing atomic descriptors and artificial neural networks to predict pKa
Robert Fraczkiewicz, Boyd Steere, and Michael B. Bolger, Life Sciences Department, Simulations Plus, Inc, 1220 West Avenue J, Lancaster, CA 93534, Fax: (661) 723-5524

A vast array of chemical and biological properties of molecules strongly depend on ionization in water - the fundamental solvent in living systems. Consequently, the knowledge and understanding of ionization constants (pKa) is of chief importance to the pharmaceutical and environmental industries. The cheapest and easiest way of obtaining pKa for new molecules is in silico prediction. Almost all computational methods for this purpose are based on a perturbational approach utilizing the Linear Free Energy Relationships (LFER) of Hammett and Taft. An alternative route to pKa prediction, successful in the area of QSAR/QSPR, involves a direct, non-linear correlation between the observed property, pKa, and calculated descriptors. However, unlike straigtforward molecular property-molecular descriptors relationships (e.g., log P, solubility), the localized nature of ionization requires creation of a new class of atomic descriptors. Another serious problem: unlike measured log P or solubility which, essentially, describe single reactions, the measured pKas of a polyprotic molecule are a net effect of (sometimes) very large and complex networks of microscopic equilibria. We have solved both problems and will present the resulting model of pKa prediction.

67 Building a computational platform for predicting toxicity
Julie E. Penzotti, Gregory A. Landrum, and Santosh Putta, Rational Discovery LLC, 555 Bryant St. #467, Palo Alto, CA 94301, penzotti@RationalDiscovery.com

When one uncovers a novel chemotype in drug discovery, there is often little or no information available to assess the toxic potential of the series, particularly in humans. Any available toxicology data often exists in multiple sources and divergent formats, making it difficult to assemble and evaluate. While computational methods like machine learning and QSAR show promise in predicting a compound's toxic potential from chemical structure, their accuracy is limited by the size and diversity of the training data used to create the models. To address these issues, we will present an integrated data and analysis software platform that combines a chemically searchable database of existing toxicological knowledge with a collection of in silico predictive models for toxicity and ADME properties. Various similarity metrics enable the user to explore properties of related compounds. We will also discuss the application of machine learning to generate robust models for toxicity.

68 Making FDA toxicity data available to the public: FDA ToxML database for genetic toxicity
Kirk B. Arvidson1, Julie Mayer1, Michelle L. Twaroski1, R. Daniel Benz2, Edwin J. Matthews2, Naomi L. Kruhlak2, Mitchell A. Cheeseman1, and Chihae Yang3. (1) HFS-275, FDA/CFSAN/OFAS, College Park, MD 20740, Kirk.Arvidson@cfsan.fda.gov, (2) HFD-901, FDA/CDER/OPS/ICSAS, (3) Leadscope, Inc

The Center for Food Safety and Applied Nutrition (CFSAN) and the Center for Drug Evaluation and Research (CDER) are repositories for a great deal of toxicity data that are not available in the public literature. CFSAN/OFAS and CDER/OPS/ICSAS, in collaboration with Leadscope, Inc., are consolidating non-proprietary data submitted in petitions to FDA. Structure searchable electronic toxicity databases will assist the FDA in its mission to review incoming submissions promptly and efficiently. Incorporating chemical structures with the toxicity information will enable reading across and (Q)SAR methodologies in FDA. These databases follow ToxML standard and controlled vocabularies for various toxicity endpoints: bacterial mutagenesis, in vitro chromosome aberration, in vitro mammalian mutagenesis, in vivo micronucleus, (sub)chronic toxicity, and reproductive and developmental toxicity. This talk presents the genetic toxicity database integrated from separate endpoints using the ToxML controlled vocabulary. Unique structural space and diversity assessment of the database will be compared to other genetic toxicity databases.

69 Strategic assessment of domain applicability of QSAR models
Grace Patlewicz1, Chihae Yang2, Glenn J. Myatt2, Kevin Cross2, and Paul E. Blower2. (1) SEAC, Unilever, Colworth House, Sharnbrook MK44 1LQ, United Kingdom, Grace.Patlewicz@unilever.com, (2) Leadscope, Inc

The ability to predict toxicity via QSAR is becoming more important as international regulations seeks QSAR-based evaluations for risk assessment including toxicity. Legislation examples include the REACH initiative, the 7th Amendment to the Cosmetics Directive in the European Union as well as the DSL in Canada. In reality building reliable QSAR models meets many challenges. Access to high-quality data from which predictive models can be derived continues to be a major impediment. Another obstacle is the current paradigm of the black-box nature of the models. Understanding the structure and data space is essential when developing models that can be externally validated. A strategic process needs to be designed so that each stage is transparently interpretable. Assessing whether an untested compound is within the model domain is critical to this QSAR strategy. This paper will demonstrate strategic assessment of model domains and applicability using a structural features-based weight of evidence approach.

70 CAMS: A high-throughput compound archive management system
Robert D. Feinstein, Kelaroo, Inc, 312 S. Cedros Ave., Suite 320, Solana Beach, CA 92075, rdf@kelaroo.com

The procurement, archival and management of drug-like molecules presents significant informatics challenges in today's information-driven drug discovery process. Robotic compound manipulation and storage hardware can mitigate some of these issues only if they are well integrated with complementary software and database solutions. We present a high-throughput Compound Archive Management System (CAMS) that encompasses storage, retrieval and tracking of samples (in plates and tubes) using a 3rd party automated compound storage and retrieval system capable of holding ~30 million samples. The development of CAMS illustrates many technical challenges typical of applications designed to handle large amounts of data. For example, despite being a web-based application, chemists use CAMS to register 100,000+ samples at a time. Furthermore, biologists use CAMS to format, request and withdraw compound libraries containing 100,000's of samples. Similarly, CAMS must process large output files generated by archive robotics in order to track and reconcile compound inventory data.

71 SeQuence IDentification: A peptide sequencing algorithm based on gas-phase peptide fragmentation patterns in tandem mass spectrometry
Li Ji1, Joseph Triscari2, Yingying Huang1, George Tseng3, Shinsheng Yuan4, Ljiljana Pasa-Tolic5, Mary S Lipton5, Richard D. Smith5, and Vicki H Wysocki1. (1) Department of Chemistry, University of Arizona, 1306 E. University Blvd, Box 210041, Tucson, AZ 85721-0041, lji@u.arizona.edu, (2) Science Application International Corporation, (3) Department of Biostatistics and Department of Human Genetics, University of Pittsburgh, (4) Department of Statistics, University of California, (5) Pacific Northwest National Laboratory - SLIDES

Automated peptide sequencing combined with database searching via tandem mass spectrometry is a promising approach to characterize and identify peptides and proteins from complex mixtures in a high-throughput mode. Currently, no popular peptide sequencing algorithms use sophisticated peptide dissociation models and consider potentially relevant factors such as size, charge state, charge location, and amino acid (AA) content. We are developing a new Bayesian learning-based approach, called SeQuence IDentification (SQID), which employs differentiated intensities for each ion by incorporating peptide fragmentation statistics for higher identification accuracy. Previously, 28,330 spectra of known sequence were clustered using the penalized K-means method, and their corresponding chemical properties were determined. In SQID~{!/~}s training stage, spectra from each cluster are used to derive cleavage probability "lookup" histograms of all AA pairs. In the testing stage, the probability "lookup" histograms are then used to calculate the probability of each candidate sequence matching a given spectrum.

8:30 72 Progressable hit identification from HTS data: An integrated informatics solution
Mark A. Hermsmeier, New Leads Chemistry, Bristol-Myers Squibb, P.O. Box 4000, Princeton, NJ 08543, Fax: 609-252-7446

A method to identify pharmaceutically interesting hits from High Throughput Screening data is presented. An integrated web-based platform, Pick-a-thon, allows automatic profiling, filtering and selection by potency, molecular properties, chemotype clusters, compound origin, assay history, similarity to known bioactives and chemical fragments. Results are easily shared within groups and a history of the selection process is recorded. This tools has been adopted at BMS for early phase hit identification.

9:00 73 Triple store databases and their role in high throughput, automated, extensible data analysis
Kieron R Taylor, Robert J Gledhill, Jonathan W Essex, and Jeremy G Frey, School of Chemistry, University of Southampton, Highfield, Southampton SO17 1BJ, United Kingdom, Fax: +44 (0)23 8059 3781, j.g.frey@soton.ac.uk

A critical component of high throughput experiments is the ability to store, retrieve, and analyse the resulting data. This is arguably best accomplished using a relational database. However, an elaborate relational database is founded on a complicated schema, and changing this schema requires a major act of redesign. This is incompatible with the scientific method, however, by which new hypotheses are devised and tested. Triple store databases, on the other hand, are able to be modified and extended without requiring a major redesign. In this presentation, the triple store method, and its application to the cheminformatics problem of solubility prediction, will be described.

9:30 74 Informatics implementation in ExxonMobil Chemical Company
Robert J Wittenbrink, Michael E. Lacey, Gregg J. Howsmon, and Dave A. Stachelczyk, Research, ExxonMobil Chemical Company, 2205 BTEC-West, 5200 Bayway Drive, Baytown, TX 77520, Fax: 281-834-1793, robert.j.wittenbrink@exxonmobil.com - SLIDES

High throughput experimentation (HTE) is an important, rapidly expanding technology that will impact the way research is done in the chemicals industry. The ability to conduct 100s and even 1000s of experiments a day generates a large amount of data. The ability to manage and analyze these very large data sets in an efficient way becomes critical to the success of the HTE programs. Further, the ability to relate data generated in HTE experiments to data generated in conventional, or non-HTE, applications is a critical step in making full use out of HTE programs. In order to achieve full benefit of HTE, we have begun the broad implementation of an informatics system across our entire technology pipeline within ExxonMobil Chemical Company. The system will enhance the way we design experiments, automate our experimental equipment, and capture data from multiple sources. It will also improve our capability to visualize data in multiple dimensions and analyze the results. The vision is that ALL of our data... from HTE tools, lab tools, small and large pilot plants...will be captured, stored, and integrated such that it can be retrieved for analysis from a single point. The use of this system will allow us to extract key learnings and turn data into useful information - Informatics.

10:00 75 Designing test plates with maximal information content and diversity for the development of library protocols
Jean E. Patterson1, Ying Zhang1, Andrew Smellie2, Daming Li1, David S. Hartsough3, Libing Yu2, and Carmen M. Baldino4. (1) Department of Chemistry, ArQule, Inc, 19 Presidential Way, Woburn, MA 01801, jpatterson@arqule.com, (2) ArQule Inc, (3) Informatics and Modeling, ArQule, Inc, (4) Chemistry Department, Arqule Inc - SLIDES

The design of test plates used to develop a parallel synthesis library often has multiple, competing constraints such as a wide range of physico-chemical properties (molecular weights, logP, logD, solubility, HPLC retention times, etc.), structural diversity, and practical limitations such as reagent availability or reactivity that need to be taken into account. At ArQule, we have developed a test plate design process that involves reagent binning, automated Pipeline Pilot protocols that allow chemists maximum flexibility in choosing preferred reagents, and standardized Spotfire visualization tools that enable chemists to interactively view the design diversity and other information with respect to the virtual library.

76 Difference in vector-based and graph-based coding for ADME prediction
Joerg K. Wegner and Andreas Zell, Department of Bioinformatics (ZBIT), University, Sand 1, Tuebingen 72076, Germany, Fax: 049-7071-29-5091, wegnerj@informatik.uni-tuebingen.de

We present an extensive study to build classification and regression models using five different ADMET data sets (HIA, LogP, LogS, BBB, and two toxicological data sets causing cancer in rats and mice).

We compare especially the relevance of vector based coding for molecules using descriptors and fingerprints and a coordinate-free coding working directly on the molecular structures avoiding a temporary abstract vector representation.

We see that the vector coding can be used for large data sets by loosing accuracy and the coordinate-free approach avoids the feature selection problem, but is only applicable for smaller data sets. Furthermore we discuss shortly the underlying space and time complexities.

77 Electron density derived descriptors in ADME/tox screening
N. Sukumar1, Curt M. Breneman1, and Mark J. Embrechts2. (1) Department of Chemistry, Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180-3590, Fax: 518-276-4045, nagams@rpi.edu, (2) Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute - SLIDES

Rapid and efficient ADME/tox screening is of vital importance in the design and development of new chemical entities for drug discovery. We have applied electron density derived descriptors, generated by the RECON program, as screens for a variety of ADME/tox datasets. The RECON program is based on a pre-computed library of additive ab initio quality electron density derived properties for atomic (virial) fragments, that are matched to the atom types in a molecule and recombined; close to half a million molecules can be processed in an hour. Some new types of electron density derived descriptors are introduced and their performance compared to those of several other families of descriptors.

78 Methods to assess diversity and quality of local neighbors in toxicity databases
Kevin P. Cross1, Chihae Yang2, Glenn J. Myatt2, and Paul E. Blower2. (1) LeadScope, Inc, Columbus, OH 43215, kcross@leadscope.com, (2) Leadscope, Inc

The ability to accurately predict toxicity is increasingly becoming more important as pharmaceutical and chemical industries move towards efficient up-front screening to reduce late stage attrition. Bringing toxicity assessment into the discovery phase requires access to toxicity databases that can provide training sets. Large quantities of toxicity information are publicly available; however, most of these databases are not optimized for building structure-toxicity relationships. Quite often a new focused training set must be constructed to relate endpoint relationships with structural features. Understanding the data space in terms of structural diversity and data distribution is a key to building interpretable predictive models. Techniques to assess the diversity of a dataset both at global and local levels will be discussed. A new, “coverage” statistic will be introduced which calculates set diversity based on the number of local structure neighborhoods required to contain (or “cover”) all of the structures in the set.

79 An approach to the interpretation of computational neural network QSAR models
Rajarshi Guha1, David T. Stanton2, and Peter C. Jurs1. (1) Department of Chemistry, Pennsylvania State University, 104 Chemistry Building, University Park, State College, PA 16802, rxg218@psu.edu, (2) Central Research Chemical Technology Division, Procter & Gamble

We present a simple and general method using the weights and biases from a trained computational neural network (CNN) to correlate individual input descriptors to the predicted activities. The method is based on the assumption that certain descriptors contribute more to the hidden layer than others. Similarly, certain hidden neurons contribute more to the output layer than others. By considering the signs and magnitudes of these contributions the method allows us to explain how each descriptor contributes to the structure activity relationship being modeled. We tested the method by developing interpretations for CNN models based on three datasets and comparing them to conclusions drawn from the interpretation of linear models for these datasets. Preliminary results indicate that conclusions from the CNN interpretation compare well with those made for the linear models using a technique based on partial least squares. Applications of this methodology to other datasets are in progress

80 ALOGPS (http://www.vcclab.org) is a free on-line program to predict lipophilicity and aqueous solubility of chemical compounds
Igor V. Tetko, Institute of Bioorganic & Petrochemistry, Kiev, Ukraine and Institute for Bioinformatics, Neuherberg D-85764, Germany, itetko@vcclab.org, and Vsevolod Yu. Tanchuk, Institute of Bioorganic & Petrochemistry, Kiev, Ukraine

The program was developed using Associative Neural Network and it combines models to predict lipophilicity and aqueous solubility of chemicals. The cross-validation (CV) accuracy is rms=0.35 and standard mean error s=0.26 for 12908 molecules. The aqueous solubility module calculated CV rms=0.49 and s=0.38. The main feature of the ALOGPS is a possibility to add user's “in-house” molecules (the “LIBRARY” mode) without a need to retrain the neural networks or/and generate new indices. The use of the LIBRARY increases prediction ability of the method for the users molecules up to 5 times using just few new compounds per series. This makes it an invaluable tool for applications in pharmaceutical firms that have private “in-house” collections of compounds. A description of the concept of LIBRARIES as well as introduction to the property-based similarity space will be provided using several illustrative examples. This study was supported with INTAS 00-0363 grant.

81 Oral bioavailability prediction based on expert knowledge and informatics
Paulius J. Jurgutis, Donatas Zmuidinavicius, Remigijus Didziapetris, and Pranas Japertas, Pharma Algorithms, Inc, 591 Indian Road, Toronto, ON, Canada, jurgutis@ap-algorithms.com

Accurate predictions of oral bioavailability (%F) are highly desirable in the early stages of oral drug development. Once the target activity has been identified, compounds that are both active and bioavailable must be developed. To achieve this goal, medicinal chemists use various "generic" criteria, such as Lipinski's "rule of fives", log P - TPSA profiles, counts of rotatable bonds, and other computational thresholds. All of these criteria ignore the influence of specific transporters and other pharmacokinetic (PK) factors that are class-specific in nature. These factors are only considered during the later stages of drug development, when dose-response curves are obtained. Our goal was to make these considerations available to medicinal chemists before the actual PK measurements are made. This resulted in development of a new software system that provides experimental data and expert advice on a number of PK-related properties: (i) solubility in stomach and intestine, (ii) stability in lumen, (iii) human intestinal permeability, (iv) active transport, (v) Pgp efflux, (vi) 1st pass metabolism, and (vii) oral bioavailability in humans. Each prediction is provided with experimental data for similar compounds and logical reasoning behind each prediction. All of this information can help medicinal chemists to avoid errors at the earliest stages of lead optimization.

82 Library design through lead optimization: An application for integrating data and workflow among high-throughput scientists
Louis J. Culot Jr., CambridgeSoft Corporation, 100 Cambridge Park Drive, Cambridge, MA 02140, lculot@cambridgesoft.com - SLIDES

Laboratory automation and electronic information capture have changed the way chemists and biologists approach problems by enabling rapid exploration and testing of large chemical spaces. Even so, good library design tools, such as property predictors, data visualization, list manipulation, reagent selection, and automated synthesis layout, are essential for efficient exploration of the nearly infinite chemical space. Likewise, good early biological protocol design and data visualization tools are needed for high-throughput screening biology. The ability to share information between these disciplines, such as using real biological assay data alongside bio-availability predictors, provides a substantial advantage in lead optimization. We present here an integrated application framework that connects the high-throughput scientific disciplines for data sharing (not replication), providing chemical information forwards through to biology, and feeding results backwards to future chemical library design.

83 Boosting descriptors for similarity searches: feature trees trained by machine learning
Marcus Gastreich1, Jun Liao2, Gerhard Hessler3, Stefania Pfeiffer-Marek3, Sally Ann Hindle1, Manfred Warmuth2, Christian Lemmen1, Thorsten Naumann3, and Karl-Heinz Baringhaus3. (1) BioSolveIT GmbH, An der Ziegelei 75, 53757 Sankt Augustin, Germany, Fax: +49 2241 2525 525, marcus.gastreich@biosolveit.de, (2) Computer Science Department, University of California Santa Cruz, (3) Molecular Modelling, aventis pharma Deutschland GmbH

The FTrees program, being based on the Feature Trees descriptor, is an extremely fast, effective tool for similarity searching. Compounds are described in a topology-preserving way, assigning physico-chemical properties to the tree nodes. Similarities between two compounds are scores for an optimum superposition of compared trees.

Multiple Feature Trees can be 'overlaid' into a so-called model which represents the characteristics of a series of compounds. Moreover, models can store pharmacophore-related information.

Since the constituting parts of models can be assigned weights, it is possible to employ machine learning procedures to adjust them to improve the predictability of the models.

Such a very time efficient post-processing of Feature Trees-based calculations leads to a distinct advantage: Through backmapping, molecular features important for biological activity can be identified.

We evaluated the procedure on three large data sets and report comparatively on the performance of a specially developed machine learning algorithm

84 Comparison of the effect of false positives on Tanimoto and modified Bayesian similarity
David Rogers, SciTegic, Inc, 9665 Chesapeake Dr, Suite 401, San Diego, CA 92123, Fax: 858-279-8804, drogers@scitegic.com

A common technique for following up results from a high-throughput screening campaign is to take the actives ("hits") from the study and use Tanimoto similarity to identify other in-house or vendor compounds that are similar to an active and thus worth testing while the campaign is still in progress. However, the assays in use often have a high false positive rate, that is, identify many more compounds as active than are truly active. This is particularly true if the expensive step of confirmation is not performed before the analysis.

In this study, we wanted to study the effect of random false positives on the quality of the results. In particular, we compared Tanimoto similarity and a modified Bayesian method. For the descriptor we used extended-connectivity fingerprints (ECFPs), recently reported effective for similarity-based virtual screening by Hert et. al. The results demonstrate that the modified Bayesian method is superior to Tanimoto similarity in its resistance to false positive noise. Further, it will be shown how the results from the Bayesian method could be used to improve the robustness of the Tanimoto similiarity method

85 Design and linkage of compound filters to HTS assay promiscuity
Bradley C. Pearce1, Michael J Sofia1, David A. Stock2, and Dieter A. Drexler3. (1) New Leads Chemistry, Bristol-Myers Squibb, 5 Research Parkway, Wallingford, CT 06492, Fax: 203-677-6984, bradley.pearce@bms.com, (2) Non-Clinical Biostatistics, Bristol-Myers Squibb, (3) Discovery Analytical Sciences, Bristol-Myers Squibb - SLIDES

A process for identifying and filtering undesirable compounds that contribute to HTS screening deck promiscuity is described. An analysis was made linking SMARTS-based structural queries with primary HTS data obtained from historic screens at Bristol-Myers Squibb. Two complimentary views of promiscuity were developed and the data were assessed relative to HTS benchmarks. One captures an expected assay hit rate and the other examines how strongly active compounds are expressed across multiple assays. Statistical evaluation of the data indicate functional group filter impact as they relate to compound promiscuity. The empirically-derived model helps remove some of the usual subjectivity in applying filtering guidelines. A limited set of structural integrity data was also used to help assess the functional group filters used in this study. Using Scitegic's Pipeline Pilot, integrated protocols were designed and built that greatly facilitate implementation of these filters. Utility for HTS screening architecture, combinatorial libraries and external compound acquisitions is discussed.

86 Fingerprint-based virtual screening using multiple reference structures
Jérôme Hert1, Peter Willett1, David J. Wilton1, Pierre Acklin2, Kamal Azzaoui2, Edgar Jacoby2, and Ansgar Schuffenhauer2. (1) Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom, j.hert@sheffield.ac.uk, (2) Discovery Technologies, Novartis Institute for Biomedical Research - SLIDES

Fingerprint-based similarity searching is widely used for virtual screening when only a single bioactive reference structure is available. This paper considers similarity approaches that can be used when multiple, structurally heterogeneous reference structures are available. Extensive simulated virtual screening searches on the MDL Drug Data Report database suggest that the best results come from data fusion, specifically fusing the similarity scores for similarity searches using individual reference molecules, and an approximate form of the binary kernel discrimination technique. A detailed comparison was then carried out using these two approaches with 14 different types of 2D fingerprint, evaluating the experiments in terms of both active molecules retrieved and chemotypes retrieved. The results demonstrate the effectiveness of fingerprints that encode circular substructure descriptors generated using the Morgan algorithm. The combination of these fingerprints with data fusion based on similarity scores would seem to provide both an effective and an efficient approach to virtual screening in lead-discovery programmes.

87 Learning from library design
Gregory A. Landrum, Julie E. Penzotti, and Santosh Putta, Rational Discovery LLC, 555 Bryant St. #467, Palo Alto, CA 94301, Landrum@RationalDiscovery.com

High throughput chemistry and screening technologies, combined with a growing list of targets from genomics, are flooding the drug discovery process with data. Data mining methods are routinely applied in an attempt to extract valuable knowledge from these large data sets and build predictive models. However, to maximize the perceptive power of these methods, the size of a data set is secondary to its information content. Information-driven library designs can enable successful data mining and construction of robust predictive models for activity. We are evaluating several new strategies for information-driven design of chemical libraries that are based on ensemble models and prediction confidence measures. Our approach incorporates active learning into the iterative process of discovery: compounds are selected for synthesis/screening based upon their ability to improve predictive models. We will compare these approaches to diversity and similarity designs using a published CDK-2 dataset to simulate multiple rounds of drug discovery.

88 Using topomer similarity and FlexS with OptDesign to create targeted GPCR libraries
Sun Choi1, Farhad Soltanshahi1, Michael S. Lawless1, Richard Hufton2, and Robert D. Clark1. (1) Tripos, Inc, 1699 South Hanley Road, St. Louis, MO 63144, Fax: 314-647-9241, schoi@tripos.com, (2) Tripos Discovery Research, LTD

OptDesign® is an extension of optimizable k-dissimilarity (OptiSim) selection that supports efficient construction of multiblock and sparse combinatorial libraries. At the method's heart is evaluation of a series of small reactant subsamples drawn alternately from each reagent pool, with the "best" candidate drawn from the subsample chosen at each iteration. When the criterion for "best" is structural dissimilarity with respect to products selected at preceding iterations, the libraries produced are diverse and representative. In addition, OptDesign allows for other product scoring mechanisms like physical property diversity, property profiling and externally generated scores with the option of ranking products using a multi-objective function that optimizes different criteria at the same time. Here we describe a more generalized approach for incorporating other product-based criteria for what constitutes "best" that complement structural diversity. In particular, FlexS and topomeric similarity to known ligands will be used to create targeted G-protein-coupled receptor (GPCR) libraries.

9:00 89 Solubility and permeability in oral absorption: Prediction success depends on chemistry structure
Christopher A. Lipinski, Exploratory Medicinal Sciences, Pfizer Global Research and Development, Groton Laboratories, Eastern Point Road, mail stop 8200-36, Groton, CT 06340, Fax: 860-715-3149, christopher_a_lipinski@groton.pfizer.com

To screen or not to screen for solubility and permeability? The answer depends on the compound chemistry. About 30-40% of medicinal chemistry compounds currently have poor aqueous solubility regardless of whether the compounds are combinatorial, are from hit to lead or from lead optimization. Predicting poor aqueous solubility for the large lipophilic compound is easy. Predicting poor solubility for the compound with strong inter-molecular crystal packing is more difficult. Permeability problems are more variable with few predicted permeability problems among combinatorial chemistry compounds. The chemistry compound spectrum ranges from compounds accurately (and easily) predicted to have few permeability issues to compounds where calculations are of little value and where experimental assays are absolutely critical to assessment of permeability. A rational approach to handling solubility / permeability issues requires an understanding of the compound solubility / permeability chemistry space.

9:30 90 Pharmaceutical decision making using LeadDecisionTM
Barry J. Wythoff, Research, Scientific Reasoning, 23B Congress Street, Newburyport, MA 01950, barry@scientificreasoning.com

The drug discovery process proceeds iteratively and discontinuously. At each iteration, a fateful decision must be made which might be phrased as “Out of the molecules that we can ? which should we ?”, wherein the action denoted by the question mark might be “order”, “test”, “synthesize”, “carry forward”… This is a question then, of selecting among alternatives. Increasingly, this selection must be made in the face of manifold dimensions that we wish to optimize. LeadDecision is designed to aid the scientist in rapidly accomplishing such selections using a combination of calculation, visualization and interaction. The calculation methods employed are adapted from economics, statistics, mathematics, artificial intelligence and operations research.

10:00 91 High-throughput hERG models derived from a high-quality training set
Mark Seierstad, Dimitris K. Agrafiotis, and Christophe Buyck, Computer Aided Drug Discovery, Johnson & Johnson Pharmaceutical Research & Development, L.L.C, 3210 Merryfield Row, San Diego, CA 92121, mseierst@prdus.jnj.com - SLIDES

The pharmaceutical industry has been forced to address a recent addition to ADME/Tox profiling: the ability of a compound to bind to and inhibit the hERG-encoded cardiac potassium channel. Our in silico efforts in this area started with the compilation of a large (more than 400) training set of compounds measured in a hERG channel inhibition assay. Initial models based on decision tree and linear regression techniques showed good predictive power for several different classes of compounds. These models were subsequently refined using feed-forward neural networks in conjunction with a variety of molecular descriptor sets and feature selection algorithms.

10:30 92 Prioritizing hit series when hERG is inherent
David Patterson, Tripos, Inc, 1699 South Hanley Road, St. Louis, MO 63144, Fax: 314-647-9241, pat@tripos.com, and Barbara Wible, ChanTest, Inc

Published pharmacophores for hERG are remarkably simple: a basic N roughly 5-9 Angstroms from the center of an aromatic ring. Since this molecular feature is a subset of known pharmacophores for many GPCR, kinase, and other targets, programs must address hERG liability in quantitative terms of therapeutic ratio. This requires QSAR methods rather than 3-D screens. Using the method of topomer CoMFA (Cramer, 2003), excellent predictive models of hERG IC50 for a specific chemical series are created from 10-20 assayed compounds. Multidimensional biological optimization is thereby enabled. These QSAR models are instantly applicable to large virtual libraries, typically representing millions of potential compounds. As hits emerge in the drug discovery process, all possible attractive analogs within each series are computational evaluated for quantitative hERG prediction. Although individual compound predictions are subject to error, ensemble predictions are very suitable as one important measure in risk assessment to decide which series to progress.

11:00 93 Substructural analysis of toxicological databases
Hugo O. Villar, Mark R. Hansen, Jason Hodges, and Robin Friedman, Altoris, Inc, 5820 Miramar Rd #207, San Diego, CA 92121, hugo@altoris.com - SLIDES

Analysis of chemical databases with toxicological information can be of value to develop knowledge based rules that improve the design of libraries for screening and for the generation of new chemical entities. SARvision, a new tool for scaffold perception, was used to analyze the prevalence of substructures in AMES and carcinogenicity databases. SARvision automatically identifies the scaffolds contained in these datasets by first, carrying out an extensive enumeration of the substructures present and second, applying a knowledge based approach to narrow down the scaffolds considered. Finally, scaffolds are organized hierarchically to identify parent-child relationships between scaffolds. We present a comparison of results for different Salmonella strains and carcinogenicity in different species.

11:30 94 QSPR Studies of PBDEs
Paul G. Seybold and Matthew J O'Malley, Department of Chemistry, Wright State University, Dayton, OH 45435, Fax: 937-775-2717, paul.seybold@wright.edu

Three important environmentally-related properties— chromatographic retentions, octanol-air partition coefficients, and vapor pressures--of the widely-used flame retardants polybrominated diphenyl ethers (PBDEs) were modeled using a variety of molecular structural descriptors. Whereas the employment of relatively sophisticated structural descriptors generally provided slightly better statistical models for the properties of these compounds, the use of quite simple structural parameters provided stronger insights into the physical mechanisms underlying the properties. The results are interpreted in terms of models for the cohesive forces acting between the compounds involved.

9:00 95 Application of virtual screening technologies on discovery of factor Xa inhibitors
Guyan Liang, Department of Molecular Modeling, Sanofi-Aventis, Route 202-206, Bridgewater, NJ 07059, Fax: 908-231-3605, Guyan.Liang@aventis.com, and Isabelle Morize, Molecular Modeling, Sanofi-Aventis

Virtual screening (VS) is a power tool for lead identification. It allows us to focus our attention on chemotypes with desired features and in case where high-throughput screening assay is not possible allows for a rational selection of compounds. While enrichment rate of virtual screening is always one of the focal points of discussion, it is in our view more important to ensure that the VS algorithms offer adequate chemotype coverage and allows to leap beyond known scaffold series. However, most often, this is the most challenging part of the problem, especially for any ligand-based virtual screening technologies, e.g., similarity search. To better understand the limitations and power of virtual screening technologies we performed a retrospective analysis of factor Xa inhibitor chemotype discoveries. In this study, chemotypes were grouped by generations (i.e. chronological discoveries of chemotypes) and the following software/descriptors were included: MDL public keys (http://www.mdl.com), SciTegic extended connectivity fingerprint (http://www.scitegic.com), Unity fingerprint (http://www.tripos.com), Daylight fingerprint (http://www.daylight.com), SiBIS (http://www.moltop.com), BCI fingerprint (http://www.bci.gb.com), and Feature Tree (http://www.biosolveit.de). Starting with the first generation of chemotypes and sequentially adding chemotypes to derive queries, each software and descriptors were used to search a large database of randomly selected compounds and all factor Xa inhibitors. Ability to retrieve similar compounds and more distant factor Xa chemotypes within the first fractions of the hitlist was studied and will be presented and discussed.

9:25 96 Comparative pharmacophore modeling of organic anion transporting polypeptides: a meta-analysis of rat Oatp1a1, human OATP1A2 and OATP1B1
Cheng Chang, Biophysics Program, Ohio State University, 1614 Sparks Rd, Sparks, MD 21152, chang.440@osu.edu, Sandy Pang, Department of Pharmacology, University of Toronto, S Ekins, GeneGo, and Peter Swaan, Department of Pharmaceutics, University of Maryland at Baltimore - SLIDES

The organic anion transporting polypeptides OATPs are key membrane transporters for which crystal structures are not currently available. They transport a diverse array of xenobiotics and are expressed at the interface of hepatocytes, renal tubular cells, enterocytes and the choroid plexus. Pharmacophore models were produced for rat Oatp1a1 and human OATP1B1 and OATP1A2 to aid the understanding of the key molecular features for substrate-transporter interactions. Literature data from CHO, HeLa, Hek-293 cells and X. laevis oocytes were used to construct pharmacophores for each individual transporter which were later merged. Additionally, meta-pharmacophores were generated from the combined datasets of each cell system used with the same transporter. The pharmacophores for each transporter consisted of hydrogen bond acceptor and hydrophobic features. There was good agreement between the merged and meta-pharmacophores containing 2 hydrogen bond acceptors and 2 or 3 hydrophobic features for Oatp1a1 and OATP1B1. The OATP1A2 pharmacophore overlapped with these transporters but consisted of one hydrogen bond acceptor and two hydrophobes. External test sets were used to validate the individual pharmacophores. The meta-pharmacophore approach provided new molecular insight into the key features for these OATP transporters with the limited data available.

9:50 97 Integrated approaches to informatics: bayer healthcare pharmacophore informatics platform, part 1: document handling, project support and portfolio management
William J. Scott1, Stefan Weigand2, Peter G. Nell2, Stephan-Nicholas Wirtz2, Emanuel Lohrmann2, Roger-Michael Brunne2, and Joachim Mittendorf2. (1) Department of Chemistry Research, Bayer HealthCare Corp, 400 Morgan Lane, West Haven, CT 06516, Fax: 203-812-3655, william.scott.b@bayer.com, (2) Bayer HealthCare AG

This presentation will highlight the new and innovative, integrated Pharmacophore Informatics Platform that is used at Bayer HealthCare AG. A single sign in system has been developed to provide project teams with document handling and sharing centers. The system is constructed so as to allow the team readily access central project information, including reports, spreadsheets, and project status information. Selected information may also be shared with the Bayer research community at the discretion of the project leader. The system also includes a portfolio management component that taps information on project status and milestone achievement to aid planning and prioritization. Illustrative examples of the components and their interaction will be presented.

A section on data integration, analysis and visualization as part of the Pharmacophore Informatics Platform will be given in a separate session

10:15 98 Integrated approaches to informatics: Bayer HealthCare Pharmacophore Informatics Platform, Part 2: Data integration, analysis and visualization
Peter G. Nell1, Michael Haerter1, Roger-Michael Brunne1, William J. Scott2, Stefan Mundt1, Andreas Goeller1, Jill Wood2, Florian Reiche3, Martin Ruppelt3, and Joachim Mittendorf1. (1) Bayer HealthCare AG, Aprather Weg 18, D-42096 Wuppertal, Germany, peter.nell@bayerhealthcare.com, (2) Department of Chemistry Research, Bayer HealthCare Corp, (3) Bayer Business Services GmbH

This presentation will highlight the new and innovative, integrated Pharmacophore Informatics Platform that is used at Bayer HealthCare. Databases that were historically scattered and overlapping have now been integrated into a single data warehouse that includes a newly developed query device providing single point access to all research-relevant data. Data analysis and visualization methods that feature a fully synchronized spreadsheet with different types of graphs and charts, will be described, as well as several state of the art computational analysis tools for the exploration of data sets of all sizes. These tools are easy to use and are available to all research scientists to be used for the generation of structure activity relationships and as decision support for lead identification and optimization. An illustrative overview will be given.

A section on project management and document handling as part of the Pharmacophore Informatics Platform will be given in a separate session.

10:40 99 LigandScout: Interactive automated pharmacophore model generation from ligand-target complexes
Gerhard Wolber, Inte:Ligand GmbH, Clemens Maria Hofbauer-G. 6, 2344 Maria Enzersdorf, Austria, wolber@inteligand.com, and Thierry Langer, Institute of Pharmacy, University of Innsbruck - SLIDES

Computer-aided molecular design together with virtual screening have emerged as one answer to steadily increasing economic pressure that forces the pharmaceutical industry to develop new drugs in a faster and more efficient way. [1] We present a new approach for structure-based high throughput pharmacophore model generation. The LigandScout program [2] provides an automated method for creating feature-based pharmacophore models from experimentally determined structure data, e.g. publicly available from the Protein Databank (PDB). In a first step, small molecule ligands from the PDB are extracted and assignment of hybridization states and bond orders is performed. Second, from the interactions of the interpreted ligands with relevant surrounding amino acids, pharmacophore models reflecting functional interactions are created. These models can be used for screening molecular databases for similar modes of actions on the one hand, or for establishing bio-activity profiles for one single compound on the other hand.


[1] H. Kubinyi. In Search for New Leads, EFMC - Yearbook 2003, 14-28.

[2] G. Wolber, T. Langer: 3D Pharmacophores Derived from Protein-Bound Ligands and their Use as Virtual Screening Filters, J. Chem. Inf. Comput. Sci, in press.

11:05 100 Virtual screening of combinatorial libraries for asymmetric catalysis
Jonathan D. Hirst, School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, United Kingdom, Fax: +44-115-951-3562, jonathan.hirst@nottingham.ac.uk

High-throughput screening has revolutionized drug discovery; the field of catalyst discovery and optimization is poised to undergo an analogous upheaval. However, due to the large number of possible compounds that can be synthesized, computational approaches to guide synthetic efforts are needed. In the area of asymmetric catalysis, some first steps in this direction have recently been published, but these have concentrated on building and analyzing the entire catalyst structure. An advantage of combinatorial chemistry is that a small number of reactants can be combined to form a large number of products. Here, we describe a method of constructing a three-dimensional Quantitative Structure-Selectivity Relationship (QSSR), based around the Comparative Molecular Field Analysis (CoMFA) methodology, that is focused on the substituents of a common catalytic core. By avoiding the necessity of modeling the individual catalysts as a whole, very large reductions in complexity and computational effort can be achieved.

101 Making the most of what you have: an information-theoretical approach to the selection of diverse subsets of predetermined size from chemical libraries
Melissa R. Landon, Bioinformatics Program, Boston University, 44 Cummington St., Boston, MA 02215, mlandon@bu.edu, and Scott E. Schaus, Department of Chemistry, Boston University

Current methodologies for diversity analysis and subset selection consist of either distance-based or partition-based metrics, where in both cases the compounds are first represented by a set of descriptors. While both methods are valid for general library reduction, they do not allow the chemist to ask the question “If I can make N compounds feasibly, which N should I choose?” We present a simplified information-theoretical-based approach to diversity analysis which allows the chemist to ask this question and results in subsets that, to the extent possible, represent the total diversity of the library. Our method determines the optimal N subset to be that which has the maximal information content based on Shannon entropy calculations. Results are then both mapped onto a cell-based space and analyzed by a distance-based metric to provide a comparison to standard techniques and prove that this method is a valid approach to library subset selection.

102 Information-theoretic approach to calculating molecular vibrational spectra
Ralph A. Wheeler, Haitao Dong, and Scott E. Boesch, Department of Chemistry and Biochemistry, University of Oklahoma, 620 Parrington Oval, Room 208, Norman, OK 73019, Fax: 405-325-6111, rawheeler@chemdept.chem.ou.edu

Information theory was used in the 1950s to reformulate statistical thermodynamics, but time correlation functions have never been considered in equivalent detail. Our recently developed method of spectral analysis of time correlation functions, called principal mode analysis when applied to vibrational spectra, will be presented in the framework of information theory. This perspective gives new insights into the method's relation to traditional methods of spectral analysis such as Fourier transforms and the maximum entropy method and shows why principal mode analysis is more accurate. Numerical tests will be presented to compare the accuracy of various methods for calculating vibrational spectra of several molecules, including liquid water.

103 Applications of information theory in quantum chemistry: The electron density function
Paul Geerlings and Greet Boon, Department of General Chemistry (ALGC), Free University of Brussels (VUB), Pleinlaan 2, Brussels 1050, Belgium, Fax: 32-2-6293317, pgeerlin@vub.ac.be

In recent years Density Functional Theory (DFT) revolutionarized Quantum Chemistry and thereby increased its impact on other subdisciplines of Chemistry, varying from organic to inorganic and biological chemistry. In DFT, combining both computational efficiency and conceptual richness, the electron density function is considered as the fundamental carrier of information. It is therefore of ever increasing importance to study this function for atoms and molecules. In this contribution two types of studies on the electron density function of atoms and molecules are presented in which Information Theory plays an important role. In a first part it s shown that Information Theory can be used to extract chemically relevant information from atomic densities. It is shown how periodicity appears in a natural way in the information theory based analysis of numerical Hartree Fock densities for the elements H to Xe. In the second part the Hirshfeld partitioning of the electron density, known to show maximal conservation of the information content of isolated atoms upon molecule formation, is used to study the dissimilarity of enantiomers at global and local level quantifying the asymmetry of the chiral center in CHFClBr and simple amino acids and providing numerical evidence for Mezey's holographic density theorem.

104 Indices of the neighborhood complexity of molecules
Subhash C. Basak, Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Hwy, Duluth, MN 55811, Fax: 218-720-4328, sbasak@nrri.umn.edu

A contemporary interest of computational chemistry and biology is the characterization of complexity of chemical and biological systems. One important approach to the quantification of complexity of a system is through the application of information theoretic formalism originating from Shannon's information theory. We applied information theoretic formalism on neighborhoods of molecular systems represented by molecular graphs, multigraphs and pseudographs and calculated indices of molecular complexity. This follows from the discrete mathematical idea of open sphere, S(v,r), which defines different orders of topological neighborhoods for different integral values of r. The utility of these information theoretic neighborhood complexity indices in the characterization of chemical and biological systems will be discussed.

105 SHED: Molecular Shannon entropy descriptors from atom-centered feature distributions
Jordi Mestres and Elisabet Gregori-Puigjané, Chemogenomics Laboratory, Research Unit on Biomedical Informatics (GRIB), Municipal Institute of Medical Research (IMIM), Dr. Aiguader, 80, 08003 Barcelona, Spain, Fax: +34 93 2240875, jmestres@imim.es

A new set of information-theory molecular descriptors based on binned distributions of atom-centered features will be introduced. Basically, atoms in molecules are assigned to one or more of four pharmacophoric features, namely, hydrophobic, aromatic, hydrogen-bond acceptor and hydrogen-bond donor. The feature assignment is based on Sybyl atom types. Then, atom-centered feature distributions are derived from distances between pairs of features using crisp or fuzzy binning criteria. Their use and relative performance in applications to molecular similarity searching and diversity analysis will be presented and discussed.

106 Relevance of feature selection for clustering molecules
Joerg K. Wegner, Florian Sieker, and Andreas Zell, Department of Bioinformatics (ZBIT), University, Sand 1, Tuebingen 72076, Germany, Fax: 049-7071-29-5091, wegnerj@informatik.uni-tuebingen.de

We present an extensive study to classify and cluster four different activity classes (5HT1A antagonists, thrombin inhibitors, MAO inhibitors and H2 antagonists) by using four different feature selection algorithms, ten different classification algorithms and three clustering algorithms.

We show that depending on the used number of features the generalization ability of the results varies dramatically for clustering compounds. The number of features used was based on the previous feature selection results. The classification rates ranges from 85% up to 97%, which are much better than the clustering confusion matrix results.

Finally, we conclude the work with presenting a new approach avoiding the feature selection dilemma.

9:00 107 Chemical information instruction, 1984–2004: who is leading the charge?
Jeremy R Garritano and F. Bartow Culp, Mellon Library of Chemistry, Purdue University, 504 W. State St., West Lafayette, IN 47907, jgarrita@purdue.edu

The ACS Committee on Professional Training has long emphasized the importance of chemical information instruction (CII) in the education of both undergraduate and graduate students. In 1984 and 1993, the ACS Chemical Information Division Education Committee surveyed US academic institutions regarding their level of CII. The results of these surveys have provided valuable information concerning the levels of activity and the difficulties faced in providing information instruction. However, in the 11 years since the last survey was conducted, there have been explosive changes in both information delivery and instructional methodology. We have therefore updated the survey instrument to capture such changes. We will present the results of the latest CII survey, conducted in 2004-5, and examine and discuss the trends shown by its comparison with findings from the previous surveys.

9:30 108 Chemistry meets marine biology: Where is the literature indexed?
James W. Markham and Charles F. Huber, Davidson Library, University of California, Santa Barbara, Santa Barbara, CA 93106-9010, Fax: 805-893-8620, markham@library.ucsb.edu, huber@library.ucsb.edu

There is a large body of literature for studies involving both chemistry and marine biology, but it is not clear whether this literature is more likely to be indexed as chemical literature or as marine biological literature. To compare coverage of these studies, four different databases were searched for four different topics. The databases searched were CAplus on the SciFinder Scholar interface; BIOSIS on the Ovid interface; ASFA on the CSA interface; and Science Citation Index on the Web of Science interface. The topics searched covered inorganic and/or organic chemistry, with fish, bacteria, or algae. Searches in all databases were limited to items published during 1994-2003; and to test currency, the same searches were run for items published in 2004. Comparative results are presented for each search, including number of unique citations for each database, and retrieval overlap between databases. Recommendations are given for optimal searching.

10:00 109 Creating a current awareness web page on complexity theory, life sciences, information theory, and entropy
Suzanne Fedunok, Coles Science Center, New York University Bobst Library, 70 Washington Square South, New York, NY 10012, Fax: 212-995-4283, suzanne.fedunok@nyu.edu - SLIDES

The goal of the site http://library.nyu.edu/research/physics/entropy/entropy.html is to guide users to scholarly resources, both current and retrospective on Complexity Theory, Life Sciences, Information Theory, and Entropy. It brings bibliographic and other information on these subjects together for the first time in one site, which is to be updated on a routine basis using saved searches and bibliographic software. The site was designed specifically to assist researchers in the sciences and the content is selected for the professional reader. Recognized specialists provided input and guidance regarding content.

10:30 110 GPCR KnowledgeBase: From sequences to ligands
Ah Wing E. Chan, Bissan Al-Lazikani, Ian Carruthers, Richard Cox, Scott Dann, Mark Davies, David Michalovich, and John Overington, Molecular Design, Inpharmatica, 60 Charlotte Street, London W1T 2NU, United Kingdom, Fax: +44 207 074 4700, e.chan@inpharmatica.co.uk - SLIDES

The GPCR KnowledgeBase is a curated database of sequence, structural and cheminformatic knowledge for a non-redundant set of 297 human type-1 GPCRs, accessible via a web interface. The contents are searchable by sequence, gene names, descriptions, synonyms and reported ligands. The receptors are classified in a hierarchical ligand-based scheme, and their associated natural ligands as well as FDA approved drugs annotated. The KnowledgeBase contains a library of homology models for all receptors based on the X-ray structure of bovine rhodopsin, which are suitable for initial binding and docking experiments. Five different views of the binding site are defined. The pre-computed clustering of binding site profiles for each GPCR calculated using 491 physio-chemical descriptors provides exhaustive analysis of receptor relationships, which are useful for de-orphanisation programs. Finally, the structural analysis of over 39,000 known GPCR focused compounds, together with their calculated drug-like properties, associated bioactivity data, mapped on to receptor classes. This database can be used for screenings, designing focused libraries, addressing compound selectivity, and identifying appropriate counter screens based on binding site and/or ligand similarities.

11:00 111 New web services of public small-molecule databases, tools, and identifiers
Marc C. Nicklaus1, Markus Sitzmann1, and Wolf-Dietrich Ihlenfeldt2. (1) Laboratory of Medicinal Chemistry, CCR, National Cancer Institute/Frederick, NIH, DHHS, 376 Boyles Street, Frederick, MD 21702, Fax: 301-846-6033, mn1@helix.nih.gov, (2) Xemistry GmbH

We present the next generation of our online services and public databases at http://cactus.nci.nih.gov. Expanding from our previous 250,000-compound Enhanced NCI Database Browser, we have made available more than 20 different databases with a total of approximately 1.8 million structures in a similar web-based search and display interface. These databases originate at both various U.S. Government agencies and commercial screening sample suppliers. We also present new automated tools for generating such web services as well as new calculable hash code-based identifiers useful for rapid compound identification and database overlap analyses, which are made available to the public.

11:30 112 Open Archive publication of scientific data: How ²crystalloinformatics² can enable chemoinformatics
Simon J. Coles, School of Chemistry, University of Southampton, Southampton, United Kingdom, Fax: 442380596723, S.J.Coles@soton.ac.uk

Recent work by the UK National Crystallography Service (NCS) (http://www.soton.ac.uk/~xservice) has been aimed at developing an eScience infrastructure to facilitate the end-to-end crystallographic experiment. In addition to this recent advances in instrumentation and computational resources have dramatically increased the output of the crystallographic laboratory. However this presents a new problem in the dissemination of these vast amounts of structural data and information through current peer review publication protocols. Thus the funding bodies are getting poor value for money in their investments and the chemistry and chemoinformatics communities are being deprived of valuable data. The eBank-UK project (http://www.ukoln.ac.uk/projects/ebank-uk/) addresses the issue of dissemination of scientific data and uses the philosophy of the Open Archive Initiative (OAI) to solve this problem. The NCS has developed an Open Access Archive of crystal structure data (http://ecrystals.chem.soton.ac.uk) which is operated in a similar fashion to an institutional repository. All the data generated during the course of the crystal structure experiment is deposited in an OAA with attached metadata, such as chemical name, empirical formula, authors, institution, International Chemical Identifier (INChI), etc. These metadata are exported to the public domain through the OAI Protocol for Metadata Harvesting (OAI-PMH) following conventional protocols for open publication (Dublin Core). This methodology allows electronic harvesting agents to visit the archive and gather any new metadata, which may then be stored, aggregated and linked by information provision services. The OAI publishing of crystallography data not only allows a fast track route to the public for reuse of this data, but it also enables more detailed discussions of chemistry in conventional journal publications without the distracting reproduction of experimental data. The informatician may easily discover the existence of structural chemistry data, seamlessly navigate to any aspect of it, openly access it and download it for reuse in a further 'value added' studies.

113 An extensible framework for chemical application development in drug discovery
Lakshmi Akella, Michael Mandl, and Webster Homer, Tripos, Inc, 1699 South Hanley Road, St. Louis, MO 63144, Fax: 314-647-9241, lakella@tripos.com

Today's medicinal chemists face many challenges, one of which is to mine structural data consistently and accurately in tandem with associated data. Tripos' AUSPYX/Oracle extensible chemical framework allows developers and scientists to build chemical applications quickly and effectively. AUSPYX extends Oracle with powerful search operators, built in chemical features/methods and processes that address aromatic normalization, functional group standardization, and validation of structures. These robust chemical manipulation tools provide a highly desirable environment for smooth application development in drug discovery. A prototypical application utilizing this framework will be demonstrated.

114 Comparing human interface devices for chemical informatics
J. Christopher Phelan, David H. Silber, and A. James Laurino, Abacalab, Inc, 811 N. Franklin St., Wilmington, DE 19806, Fax: 302-213-9179, phelan@abacalab.com

Computer technology has become indispensable in the handling of chemical information, from spectrometry to chemical databases to electronic lab notebooks. Great strides have been made in structuring, searching, and presenting chemical data, and in capturing those generated by machines. However, hardware still limits our ability to capture human-generated chemical information despite improvements in graphical interfaces. We present here the results of a preliminary comparative study of traditional and emerging human interface devices that roughly quantifies this limitation in the chemical information context. We focus on the effect of hardware input devices on data capture, both efficiency (time and user effort) and effectiveness (accuracy and level of detail). We also evaluate the effect of the software interface (degree of interactivity) upon the relative inefficiency of the hardware devices. Finally, we discuss user perceptions of the various input devices, which are likely to affect acceptance of real-life chemical informatics tools.

115 Semantic support for smart laboratories
Jeremy G. Frey1, Hugo R. Mills2, Gareth V. Hughes2, Jamie Robinson3, Dave De Roure2, Monica M.C. Schraefel2, and Luc Moreau3. (1) Chemistry, University of Southampton, School of Chemistry, Southampton SO17 1BJ, United Kingdom, Fax: +44 23 8059 3781, j.g.frey@soton.ac.uk, (2) School of Electronics and Computer Science, University of Southampton, (3) School of Chemistry, University of Southampton

The electronic notebook is an integral part of a smart laboratory and is one of the main ways the laboratory record is generated. Triple stores containing RDF and other semantically-rich information provide a basis for recording this information in a flexible manner, and ultimately to allow automated semantic reasoning on the stored information. These automatic processes are essential not only for subsequent analysis but also to provide suitable context with which to place observations in their proper place in the experiment record, and to ensure that the electronic lab book adjusts itself to the context in which it is being used. Information flows from other sources, such as environmental monitors for temperature and humidity, or systems from tracking researchers' presence, contribute to and annotate the records generated as the experiments are run and are conveniently handled using middleware tools including data broker services. Examples of how these systems have been set up and the way they aid chemical investigations will be provided.

116 Chemistry ELN challenges and benefits: Evaluation to implementation
Dilip P. Modi, Department of Chemistry, Incyte Corporation, Experimental Station - E336/137, Route 141 and Henry Clay Road, Wilmington, DE 19880, Fax: 302-425-2750, dmodi@incyte.com

Abstract text not available.



Newspaper template for websites