Computer Translation of IUPAC Chemical Nomenclature
Few methods for conveying organic chemical structures can match the scope of IUPAC nomenclature. Central to patents, papers, and reports, IUPAC names have the rare distinction of being readable by humans and machines alike. This article, the first in a series on IUPAC Nomenclature translation, introduces some of the foundational works in the field.
Eugene Garfield
Although several name to structure software systems have been developed over the last thirty years, their origins can be traced to a single 1961 paper by Eugene Garfield that was subsequently republished.
Garfield's paper described a computer program capable of converting systematic organic nomenclature into empirical formulas. As explained in a much later interview, Garfield's immediate interest was to use the formulas as unique keys in his growing Index Chemicus. Nevertheless, the ultimate goal was to develop a program capable of producing structural diagrams from arbitrary systematic names.
A later paper described an eight-step algorithm for generating molecular formulas from systematic chemical names.
Chemical Abstracts Service
In 1967, Chemical Abstracts Service (CAS) published the first widely-applicable set of rules for converting systematic organic nomenclature into machine-readable structures (a process the authors termed "Nomenclature Translation"). The algorithm begins at the first character of a name and works its way rightward, one character at a time. State is accumulated along the way by recognizing name components such as locants, punctuation, and name roots.
The CAS group later described a software implementation of the original algorithm. Written in assembly language, it weighed in at 205K of machine code, ran on an IBM 360/370, and could process 4900 names per minute.
Grammar-Based Translators
Although grammar-based analysis might seem like an obvious choice for chemical nomenclature translation given its systematic nature, the first in-depth studies did not appear until 1988. A series of papers by Kirby's group at the University of Hull comprehensively surveyed the field, developed a detailed context-free grammar for an important subset of IUPAC nomenclature, and described a working software implementation.
- Introduction and background to a grammar-based approach Comprehensive review of the field of systematic nomenclature translation.
- Development of a formal grammar Illustrates the process of building a systematic nomenclature grammar starting with saturated hydrocarbons.
- Development of a formal grammar (Supporting Information) Provides the first published example of an IUPAC nomenclature grammar.
- Syntax analysis and semantic processing Implementation of a nomenclature translator using a Simple Left to Right (SLR) backtracking algorithm together with the proceeding grammar to produce a semantic tree.
- Concise connection tables to structure diagrams Description of the temporary data structure obtained immediately after parsing a name, and how to transform it into a connection table.
- Steroid nomenclature Expansion of the grammar and parser to include steroids.
- (Semi)automatic name correction A combination of loosened grammar rules and ad-hoc procedures can be used to correct many common errors in systematic names.
Name=Struct
Rejecting grammar-based approaches as too rigid for the loose way systematic nomenclature has been used in practice, CambridgeSoft worked on a different approach to the problem, publishing a description 1999. Following a set of principles derived in large part from vendor catalogs and name queries received by a company-run web service, Name=Struct consisted of two main steps:
- Divide a name into a set of recognized fragments of maximum length, proceeding left to right, one character at a time.
- Given a set of fragments, assemble the corresponding structure.
Although conceptually simple, the Name=Struct approach required close attention to the ways name fragments relate to one another. The complete implementation to produce in-memory structure representations from arbitrary names consisted of roughly 30,000 lines of C++.
OPSIN
OPSIN currently stands as the only broadly-applicable, open source systematic name translation software. A 2011 paper by Murray Rust's group at Cambridge describes OPSIN's design and implementation in Java. A high-level overview can be given as:
- Tokenization into "Words" via a backtracking, grammar-based automaton similar to that described by the Kirby group. Multiple valid parses may be detected, although in practice this was true for fewer than 10% of all names.
- Generation, processing, and assembly of structure fragments.
OPSIN's recall and accuracy were found to be competitive with that of the CambridgeSoft implementation (in the form of ChemDraw 12). Machine-readable datasets used to determine the accuracy of OPSIN's results are available in the supporting information, as are the specific failure cases.
OPSIN continues to be actively maintained with an up-to-date source code repository hosted on BitBucket and a Web-based demo.
Conclusions
Systematic nomenclature translation continues to play an important role in many cheminformtics workflows today. As shown by the diversity and complexity of the approaches disclosed over the last 50+ years, the problem remains both challenging and quite difficult to solve comprehensively.