In-House Development of a Chemical Registration System: A Case Study
Chemical registration systems play a central role in many R&D organizations. The main purpose of a registration system is to provide each scientist with a common framework for working with and sharing information about chemical entities. But this disarmingly simple mission statement hides a multitude of complexities just below the surface. Often, these complexities only become apparent after work on a chemical registration system has already begun.
Although case studies about the creation of industrial R&D registration systems exist, they are relatively uncommon. In this context, a recent publication by Martin and coworkers at Phillip Morris (PMI, Building an R&D chemical registration system) may be useful for anyone planning on building or maintaining an industrial R&D chemical registration system.
Starting Point
Prior to development of the registration system, PMI's scientists maintained data about substances keyed from chemical names and other high-level descriptors in Excel files and plain text.
Therefore, an important prerequisite was the ability to produce chemical structures based on IUPAC or trivial names only. The tool used by PMI was able to convert only about 70% of the 7,000 starting chemical names into structures. Structures for the unconvertible names were found by manual search of public-facing and closed databases. This phase of the conversion process required a non-trivial amount of time and effort.
The resulting collection of structures, chemical names, and associated data was converted into a Structure Data file (SDF). This file was then used for subsequent import using an automated script.
Chemical Uniqueness
The main purpose of a registration system is to provide a common framework (or vocabulary) for chemical substances. In most cases, this vocabulary turns out to be a system of numerical identifiers that can be shared throughout an organization. Two scientists working with the same identifier should never be confused as to the identity of a substance represented by an identifier. Likewise, they should never disagree about the identifier associated with a substance.
What properties should be used to determine uniqueness between two substances or samples? This was the central question faced by PMI. It is also the fundamental question faced by any organization attempting to build a registration system.
PMI's solution was based on a hierarchy consisting of Molecules, Substances, and Batches:
By linking experimental data directly to Batches, rather than Substances or Molecules, PMI can easily handle cases in which a structure needs to be reassigned, a sample is a mixture, or no structure information is available.
Another advantage of this approach lies in information retrieval, particularly structure or chemical name-based searches. In these searches, the exact composition of the sample is less important than the presence of a molecule with the constitution and possibly stereochemistry of interest.
But the main motivation for a multi-tiered system like the one chosen by PMI is that certain kinds of data only make sense when associated with certain chemical concepts. For example, linking a location to a Batch makes sense, but linking a location to a Molecule less so. Likewise, a monoisotopic molecular mass only makes practical sense when calculated for a Molecule, not a Substance.
Significant effort at PMI was made to ensure consistent and meaningful distinctions between Molecules, particularly with respect to stereochemistry and partially-defined stereochemistry. For comparison, see how Genentech addressed issues around stereochemistry and tautomerism in its own chemical registration system.
Curation
Given the emphasis PMI placed on automated registration procedures, it may be surprising that the workflow calls for human validation of every submission prior to registration:
This decision was likely motivated by two factors: (1) the registration system was new and its performance characteristics unknown; and (2) submissions were expected to be few in number compared to queries, at least initially. For registration systems accepting hundreds or thousands of submissions a day, a strict manual validation system is likely to be impractical.
User Interface
PMI developed two different user interfaces, one for each of the main roles it identified. For the limited number of Submitters and Registrars (read/write), a thick client was created. For the more numerous Viewers (read only), a Web application usable with Internet Explorer 7 was deployed.
Conclusions
Chemical registration systems are valuable assets in many chemically-oriented R&D organizations. Case studies like those from Phillip Morris and Genentech offer useful insights into some of the many complex issues to be considered before building a registration system.