Visualizing the SMILES Language with Railroad Diagrams
SMILES is a language for encoding chemical structures. Like any language, a "grammar" (or set of rules) determines how SMILES components can be arranged. Unfortunately, most descriptions of SMILES grammar begin and end with text narratives. Text has its place, but pictures can be far more effective.
How can we graphically represent the rules for making a valid SMILES string?
Railroad Diagrams
Railroad Diagrams have been used to describe many computer languages. An excellent example can be found in the JSON Specification. The fundamental unit of JSON, object
, is defined graphically as:
Reading Railroad Diagrams is simple. Start on the left. Follow the horizontal line rightward until reaching a square, oval, or branching path. Exit to the right. Applying these rules to the above diagram gives these valid JSON objects
:
{}
{"color":"red"}
{"width":10,"height":5}
A Railroad Diagram for SMILES
A full Railroad Diagram for SMILES is available online as a hyperlinked document. To my knowledge, it is the only published, complete example of such a diagram for the SMILES language. This diagram is a work in progress and is based on the OpenSMILES specification.
The SMILES Railroad Diagram is made up of several interlocking modules. A few of them are described in detail here to illustrate interpretation.
A SMILES
consists of one mandatory Atom
optionally followed by any number Chain
or Branch
elements in any order. The terms Atom
, Chain
, and Branch
are themselves defined elsewhere within the full diagram.
An Atom
is comprised of an element selected from the list of: OrganicSymbol
; AromaticSymbol
; AtomSpec
; or WILDCARD
.
An OrganicSymbol
represents those chemical elements making up the "Organic Subset" that is widely-used in organic chemistry: B; C; N; O; P; S; F; Cl; Br; and I. The branching notation used here is useful both to compact the graphical presentation and to aid in developing automated parsers.
The full diagram defines every element of the SMILES language in terms of similar Railroad Diagrams.
Testing
Specific SMILES examples can be parsed and validated using Smidge. This tool is based on the same underlying grammar used to generate the SMILES Railroad Diagram described here. What holds for the diagrams should hold for the parser, and vice versa. As the SMILES grammar is refined, both Smidge and the SMILES Railroad Diagram will immediately reflect those changes.
Conclusions
Railroad Diagrams are extremely useful, both for beginners as a learning tool, and as a communication medium for experts. This article hasn't described how the SMILES Railroad Diagrams were generated, nor the development of the required grammar. Future articles will address these points.