Working with collections of chemical structures often presents problems that plain text or numerical datasets do not. These problems all tend to revolve around the difficulties that general-purpose software packages have in dealing with chemical structures.
Two popular stand-ins for chemical structures are: (1) systematic names such as IUPAC nomenclature; and (2) Chemical Abstracts Service (CAS) registry numbers. Each has limitations. Although IUPAC names are relatively easy to generate, different software packages often produce notably different results. CAS numbers have the advantage of being short and unique, but are quite difficult and expensive to generate for new substances.
InChI™ is a system of chemical nomenclature designed to address the shortcomings of both IUPAC nomenclature and CAS numbers. The system has been in development jointly by IUPAC and NIST for over twelve years, during which time it has steadily gained in use and acceptance. Here are some things that may be useful to know before using InChIs.
- InChIs are widely used in databases. The most important reason to be familiar with InChI is that the system has become ubiquitous in a number of chemical databases. For example, each of the tens of millions of Compounds in PubChem has been assigned an InChI. A basic knowledge of InChI can go far in helping you answer questions that might not otherwise have obvious answers.
- InChIs are unique chemical names. Although InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H, the InChI for benzene, may not look like much of a name, it has many of the characteristics of a good name. For one, it uniquely identifies benzene. For another, it's a single line of text. These two simple features lead to many more advantages. InChI is even better than many names because generally speaking there is a 1:1 correspondence between every organic chemical structure and a single InChI.
- InChIs can be Googled. Because InChIs are just plain text, they can be Googled. For example, this is a Google search for benzene using InChI. Notice the general relevance of all of the results to the substance called benzene. Also notice how no special software was necessary to do this search.
- InChIs can be easily generated - but only by software. If an InChI is not available, it must be generated. Fortunately, this is quite easy. IUPAC freely distributes the software for generating InChIs, which has enabled many software vendors to bundle InChI generators with their products. For example, ChemDraw™ 12 supports InChI generation with the menu option Edit->Copy As...->InChI. However, generating InChIs requires the application of numerous detailed rules that can only efficiently be followed by software. It's very unlikely any organic chemistry class of the future will have a section on constructing InChIs manually.
- InChIs can be converted to structures. Like an IUPAC name, an InChI can be decoded to yield a chemical structure. One caveat is that the software performing the decoding must be able to generate 2D coordinates and properly handle stereochemistry.
- InChIs can be used for exact structure search. Because any two structures representing the same molecule will always produce the same InChI, InChIs can be used as a highly efficient exact structure matching system.
- InChIs can be very long. The more atoms, bonds, and special features such as isotopes that are present in a molecule, the longer the corresponding InChI becomes. This property is a disadvantage within spreadsheets and web pages, in which fixed-width columns make layout of long text strings messy.
- InChIKeys makes InChI shorter. One solution to the problem of long InChIs is to use InChIKey instead. An InChIKey is a fixed-width version of an InChI that works well in spreadsheets, web pages, and any situation in which a short piece of text is needed. Unlike InChIs, InChIKeys are devoid of most chemical information. The InChIKey for benzene is UHOVQNZJYSORNB-UHFFFAOYSA-N.
- InChIKeys can not be converted to structures. The process of generating InChIKeys removes most chemical information from the identifier through a process called hashing. The end result is that converting chemical structures to InChIKeys is a one way process. However, a dictionary of InChIKey to structure mappings can be used if complied before the conversion needs to take place.
- Standard InChIs are consistent across software packages. There is only one implementation of the structure to InChI converter, the one released by IUPAC. Each software package or database providing InChI functionality uses the same underlying software to do so.
- InChIs come in different flavors. Although "standard InChIs" are consistent across software and databases, the base InChI software exposes a number of settings that can affect the exact composition of resulting InChIs. Fortunately, these non-standard InChIs are easy to spot because they begin with "InChI=1/" instead of "InChI=1S/". The "S" stands for "standard".
- Use InChIs with organic molecules only. The InChI system was specifically optimized around organic molecules. Although InChIs for metal-containing species such as metallocenes can be encoded, the output is highly dependent on the input representation. In other words, there is no guarantee of uniquely identifying inorganic and organometallic substances with InChI.
More information can be found in the InChI FAQ.