Blog

Signals Blog

Visualizing the SMILES Language with Railroad Diagrams

SMILES is a language for encoding chemical structures. Like any language, a "grammar" (or set of rules) determines how SMILES components can be arranged. Unfortunately, most descriptions of SMILES grammar begin and end with text narratives. Text has its place, but pictures can be far more effective.

How can we graphically represent the rules for making a valid SMILES string?

Railroad Diagrams

Railroad Diagrams have been used to describe many computer languages. An excellent example can be found in the JSON Specification. The fundamental unit of JSON, object, is defined graphically as:

JSON Object
Railroad Diagram Example (json.org)

Reading Railroad Diagrams is simple. Start on the left. Follow the horizontal line rightward until reaching a square, oval, or branching path. Exit to the right. Applying these rules to the above diagram gives these valid JSON objects:

{}
{"color":"red"}
{"width":10,"height":5}
 

A Railroad Diagram for SMILES

A full Railroad Diagram for SMILES is available online as a hyperlinked document. To my knowledge, it is the only published, complete example of such a diagram for the SMILES language. This diagram is a work in progress and is based on the OpenSMILES specification.

The SMILES Railroad Diagram is made up of several interlocking modules. A few of them are described in detail here to illustrate interpretation.

SMILES
Top-Level SMILES Railroad Diagram

A SMILES consists of one mandatory Atom optionally followed by any number Chain or Branch elements in any order. The terms Atom, Chain, and Branch are themselves defined elsewhere within the full diagram.

Atom
Atom Railroad Diagram

An Atom is comprised of an element selected from the list of: OrganicSymbol; AromaticSymbol; AtomSpec; or WILDCARD.

OrganicSymbol
OrganicSymbol Railroad Diagram

An OrganicSymbol represents those chemical elements making up the "Organic Subset" that is widely-used in organic chemistry: B; C; N; O; P; S; F; Cl; Br; and I. The branching notation used here is useful both to compact the graphical presentation and to aid in developing automated parsers.

The full diagram defines every element of the SMILES language in terms of similar Railroad Diagrams.

Testing

Specific SMILES examples can be parsed and validated using Smidge. This tool is based on the same underlying grammar used to generate the SMILES Railroad Diagram described here. What holds for the diagrams should hold for the parser, and vice versa. As the SMILES grammar is refined, both Smidge and the SMILES Railroad Diagram will immediately reflect those changes.

Conclusions

Railroad Diagrams are extremely useful, both for beginners as a learning tool, and as a communication medium for experts. This article hasn't described how the SMILES Railroad Diagrams were generated, nor the development of the required grammar. Future articles will address these points.