Potential Projects from RDA Sponsored Workshop
Heads up for those attending ACS in Philadelphia. The RDA sponsored workshop 'Prioritizing Digital Data Challenges in Chemistry' held in July at the National Center for Computational Toxicology produced a number of potential projects that are listed below. You can find more information in the workshop summary at https://drive.google.com/open?id=12KF4wRCawY_ZZV3_OUrElSwBYSQzQeLlPT8CCCtL8Mg.
The workshop came out of the pain points collected during the ACS CINF Data Summit held in San Diego in March, and we encourage you to read the summaries below and then attend the 'Chemistry data pain points: distilled, analyzed, and next steps' symposium at ACS in Philadelphia. The session will be held in Room 112A - Pennsylvania Convention Center (1.55 - 4.10 pm Monday August 22nd). We are looking for your feedback, volunteers to participate in the projects listed, information about other data projects, and ideas you might have for additional projects.
Please join Evan Bolton, Leah McEwen and Ian Bruno tomorrow for what will be a lively discussion!
Stuart Chalk and Leah McEwen
RDA Co-Chairs and Organizers of the Workshop
Chemical Structure Standardization
Normalizing Chemical Structures Best Practices
Chemical structure standardization is necessary to transfer data between systems. There are many ways standardization is currently done, often with different goals in mind. This can cause different chemical structures to result from applying different standardizers to the same structures. Each organization and chemist has their own way of representing chemical structures, with different preferences. For example there are different approaches to aromaticity (e.g., in SMILES) that are incompatible and result in addition or subtraction of H2. Some ‘normalizers’ go wild and corrupt structures by making assumptions, such as the presence of acid or base or heating to 40°C when it comes to tautomer normalization, which transform the chemical substance into a related but not identical structure. There is great potential to facilitate the sharing of chemical data if all standardizing softwares could be directed to operate in “sanctioned” fashion using best practices established for standardization of structures in chemical databases.
Open Chemical Structure File Formats
Recommending and standardizing use of a small number of open chemical file formats/representations will improve interoperability and reduce error for chemistry data exchange. This strawman covers three commonly used de facto community based file formats, including SMILES, SMARTS and CTAB. Each project proposal is of fairly limited scope and likely completable in a comparatively short period of time with some degree of overlap in processes and concerns.
Chemical Structure Standardization Education and Outreach
This project is focusing on helping more chemists and other stakeholders to understand the issues of chemical structure standardization, its importance for chemical data exchange among humans and machines, and how these issues relate to their own work. Foundational activities will focus on identifying examples of how lack of standardization hinders research, articulating benefits to authors, readers, publishers, reviewers and educators and better understanding why chemists draw molecules the way they do and where the critical points exist in communicating chemistry among humans and machines.
IUPAC Graphical Representation Guidelines Update
This project will focus on updating the IUPAC Chemical Structure Drawing standards to consider machine interpretation of chemical depictions and prevent corruption of chemist intention (obvious in chemical depiction) when converted to chemical structure. The goals is to harmonize the guidelines with structure standardization and nomenclature considerations. There is a particular interest in helping to teach students/educators about these issues, based on the premise that if we train the next generation about the positive outcomes from using updated standards the result will be an improvement in data quality over time.
Chemical Terminology Representation
IUPAC Orange Book Ontology
This project will focus on development of a small scale ontology of chemical terms based on terms in the current IUPAC Orange Book as a case study. Foundational activities will look for example terminologies that have been converted to ontologies, identify where terms are currently being used and in what contexts, and look at relationships of those terms to others and potential differences in definitions. Terms will be transferred to a formal ontology in a plain bibliographic format, and a framework will be developed for augmenting the definition of terms to clarify the semantic meaning and context.
IUPAC Gold Book Data Structure
The IUPAC Gold Book is a valued compendium of terms sourcing from IUPAC published recommendations, including other Color Books and Pure and Applied Chemistry. The content is electronically accessible and linkable but not easily processable. This project is related to a current effort to extract the content data and term identifiers and migrate them into a more accessible format for increased usability.
Use Cases for Semantic Chemical Terminology Applications
This scoping project will focus on researching the current chemical data transfer and communication landscape for potential applications of semantic terminology. Example use cases might include text books, patents, article and data indexing, standard protocols, experimental literature, published ontologies and thesauri with chemical terms, dictionaries for text mining, etc. Initial activities will analyze citations to terminology in the IUPAC Color Books (including the Gold Book) and Pure and Applied Chemistry.
Additional Ideas Discussed
• Semantic web exchange format for chemical structures
• Chemistry data and metadata documentation and interoperability guidelines
• Integration of chemical terminology into digital information literacy technologies
• Normalizing/harmonizing terminology to access toxicity data