Amino acid codes

The twenty naturally occurring amino acids are commonly referred to using either three-letter codes such as ALA (alanine) or by using the single-letter codes shown below. The single-letter codes came about with the advent of bioinformatics as these were much more convenient to manipulate using scripts written in interpreted languages such as Perl and Python than groups of three letters.

Alanine ALA A
Arginine ARG R
Asparagine ASN N
Aspartic Acid ASP D
Cysteine CYS C
Glycine GLY G
Glutamic acid GLU E
Glutamine GLN Q
Histidine HIS H
Isoleucine ILE I
Leucine LEU L
Lysine LYS K
Methionine MET M
Phenylalanine PHE F
Proline PRO P
Serine SER S
Threonine THR T
Tryptophan TRP W
Tyrosine TYR Y
Valine VAL V

Whilst the fact that the single-letter codes do not all match the first letter of the amino acid that they correspond to is somewhat confusing to begin with is is worth remembering that most proteins of interest contain hundreds of amino acid residues. To illustrate how useful the amino acid codes can be let's have a look at a rather small imaginary protein with only seven residues:

Alanine-Phenylalanine-Proline-Leucine-Serine-Valine-Valine-Arginine

This is already irritatingly long if you have to write it out more than once. So, using the three-letter codes we have instead:

ALA-PHE-PRO-LEU-SER-VAL-VAL-ARG

This is already a great improvement in terms of reducing the length of the sequence that we have to write (and it remains fairly human-readable since the codes are just the first part of the amino acid names). However, if our protein had a more realistic number of residues e.g. 700 instead of 7 then this is clearly still going to be a fairly long piece of text when fully written out.

Finally moving to the single letter codes we can write our model sequence as:

AFPLSVVR

As well as being a lot shorter the single-letter codes have the advantage that we do not need to include any formatting to make them more easily readable such as white spaces or hyphens as in the examples above.

The single-letter codes also lend themselves nicely to a shorthand notation that can be used to describe changes or mutations occurring in the amino acid sequence of a protein. In the 'protein' above the serine residue (which is the fifth residue in the chain) would be S5 using the single-letter codes. If this serine was to be changed to a phenylalanine then this mutation could be written conveniently as S5F.