Blog

Signals Blog

Computer Translation of IUPAC Chemical Nomenclature

Few methods for conveying organic chemical structures can match the scope of IUPAC nomenclature. Central to patents, papers, and reports, IUPAC names have the rare distinction of being readable by humans and machines alike. This article, the first in a series on IUPAC Nomenclature translation, introduces some of the foundational works in the field.

Eugene Garfield

Although several name to structure software systems have been developed over the last thirty years, their origins can be traced to a single 1961 paper by Eugene Garfield that was subsequently republished.

Garfield's paper described a computer program capable of converting systematic organic nomenclature into empirical formulas. As explained in a much later interview, Garfield's immediate interest was to use the formulas as unique keys in his growing Index Chemicus. Nevertheless, the ultimate goal was to develop a program capable of producing structural diagrams from arbitrary systematic names.

A later paper described an eight-step algorithm for generating molecular formulas from systematic chemical names.

Chemical Abstracts Service

In 1967, Chemical Abstracts Service (CAS) published the first widely-applicable set of rules for converting systematic organic nomenclature into machine-readable structures (a process the authors termed "Nomenclature Translation"). The algorithm begins at the first character of a name and works its way rightward, one character at a time. State is accumulated along the way by recognizing name components such as locants, punctuation, and name roots.

The CAS group later described a software implementation of the original algorithm. Written in assembly language, it weighed in at 205K of machine code, ran on an IBM 360/370, and could process 4900 names per minute.

Grammar-Based Translators

Although grammar-based analysis might seem like an obvious choice for chemical nomenclature translation given its systematic nature, the first in-depth studies did not appear until 1988. A series of papers by Kirby's group at the University of Hull comprehensively surveyed the field, developed a detailed context-free grammar for an important subset of IUPAC nomenclature, and described a working software implementation.

Name=Struct

Rejecting grammar-based approaches as too rigid for the loose way systematic nomenclature has been used in practice, CambridgeSoft worked on a different approach to the problem, publishing a description 1999. Following a set of principles derived in large part from vendor catalogs and name queries received by a company-run web service, Name=Struct consisted of two main steps:

  1. Divide a name into a set of recognized fragments of maximum length, proceeding left to right, one character at a time.
  2. Given a set of fragments, assemble the corresponding structure.

Although conceptually simple, the Name=Struct approach required close attention to the ways name fragments relate to one another. The complete implementation to produce in-memory structure representations from arbitrary names consisted of roughly 30,000 lines of C++.

OPSIN

OPSIN currently stands as the only broadly-applicable, open source systematic name translation software. A 2011 paper by Murray Rust's group at Cambridge describes OPSIN's design and implementation in Java. A high-level overview can be given as:

  1. Tokenization into "Words" via a backtracking, grammar-based automaton similar to that described by the Kirby group. Multiple valid parses may be detected, although in practice this was true for fewer than 10% of all names.
  2. Generation, processing, and assembly of structure fragments.

OPSIN's recall and accuracy were found to be competitive with that of the CambridgeSoft implementation (in the form of ChemDraw 12). Machine-readable datasets used to determine the accuracy of OPSIN's results are available in the supporting information, as are the specific failure cases.

OPSIN continues to be actively maintained with an up-to-date source code repository hosted on BitBucket and a Web-based demo.

Conclusions

Systematic nomenclature translation continues to play an important role in many cheminformtics workflows today. As shown by the diversity and complexity of the approaches disclosed over the last 50+ years, the problem remains both challenging and quite difficult to solve comprehensively.