Signals Blog

Nine Things Every (Organic) Chemist Should Know About Structure Data Files, aka SDfiles

One problem every organic chemist runs into eventually is how to keep track of the chemical structures and associated data they work with. Whether it's trying to plan the next synthesis, developing a new building block to sell, or writing a publication or patent, managing chemical data can quickly get out of hand. If collaborators or customers are also involved, then problems around sharing data begin to crop up.

Fortunately, organic chemistry has a file format specially designed to store both chemical structures and associated data. It's called the Structure Data file, or SDfile. Unfortunately, practical information on the SDfile format is hard to come by. This article attempts to describe the SDfile from the perspective of a working scientist.

  1. SDfiles are Used to Share Chemical Structures and Associated Data. Think of an SDfile as a table of data. Each row contains information about one structure. Column one is the chemical structure itself and each column after it is a piece of data associated with the structure. This model happens to map very well onto many kinds of chemical data mangagment and sharing problems. This makes SDfiles a good first choice for short-term storage and exchange of chemical data.
  2. Chemical Spreadsheets and Many Other Programs Can Work With SDFiles. Chemical Spreadsheets enable the visualization and manipulation of chemical structures and associated data. In addition, most offer the ability to view the entire dataset as either a grid or table, while some offer a great deal more. Although you generally get what you pay for, some free chemical spreadsheets are worth looking into.
  3. SDfiles Can Be Downloaded From Many Sources. Many chemical suppliers offer product catalogs in SDfile format. Government agencies such as NIH and some academic research groups also offer download of chemical-biological data in SDfile format. SDfiles are available from many public-facing, free chemical databases.
  4. SDfiles Come in Many Sizes. The only physical limit to the size of an SDfile is your computer's operating system. On older systems, this limit is a couple of gigabytes. On newer systems, this limit is much higher.
  5. SDfiles Are Compressable. Chemical datasets can be very large, which means that SDFiles can take up a lot of disk space and be very slow to send, for example, by email. One fix is to compress the SDfile with a utility of the kind that comes with most Windows, Mac, and Linux computers. Compression ratios of 10:1 are typical.
  6. SDfiles Are Directly Editable. If you open an SDfile with a simple word processor (such as NotePad on Windows or TextEdit on Mac) you'll notice that it contains readable text. SDfiles can be edited like any other text document. Although it's far better to use a Chemical Spreadsheet (see 2 above), in case one isn't available many SDfile data corrections can be made directly using simple free tools.
  7. SDfiles Can Be Easily Reprocessed. You may want to do something with SDfiles that seems difficult (for example, combine a lot of them together into one file). Because SDfiles are plain text with a relatively simple format, many problems can be solved with simple tools. Most people with even passing experience in programming can write useful SDfile processing routines.
  8. SDfiles Are the Industry Standard. A written, freely-available specification of the entire SDfile format is available from Accelrys. This makes is relatively easy to create software that can manipulate encoded SDfile structures and data in interesting and useful ways.
  9. SDfiles Are No Substitute For a Good Database System. The convenience of SDfiles may make it tempting to use them to solve every chemical data management problem. Resist the temptation. SDfiles can be poweful when combined with a good database system, but over-reliance on these files for long-term storage and collaboration suffers from the same limitations of all file-based "databases": synchronization problems; data loss; and overly complex workflows. If you find yourself relying on SDfiles as your primary tool for storage and collaboration, chances are good you could benefit from a dedicated chemical database system.