DragonFly On-Line Manual Pages
FASTA/TFASTA/FASTX/TFASTXv2.0u(1) DragonFly General Commands Manual
NAME
fasta - scan a protein or DNA sequence library for similar sequences
tfasta - compare a protein sequence to a DNA sequence library,
translating the DNA sequence library `on-the-fly'.
lfasta - compare two protein or DNA sequences for local similarity and
show the local sequence alignments
plfasta - compare two sequences for local similarity and plot the local
sequence alignments
SYNOPSIS
fasta [-a -A -b # -c # -d # -E # -f # -g # -k # -l file -L FASTLIBS
-r STATFILE -m # -o -O file -p # -Q -s SMATRIX -w # -x "# #" -y # -z -1
] query-sequence-file library-file [ ktup ]
fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file @library-name-file
fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file "%PRMVI"
fasta [-aAbcdEgHlmnoOprswyx] - interactive mode
fastx [-aAbcdEfghHlmnoOprswyx] DNA-query-file protein-library [ ktup ]
tfasta [-aAbcdEfgkmoOprswy3] protein-query-file DNA-library [ ktup ]
tfastx [-abcdEfghHikmoOprswy3] protein-query-file DNA-library [ ktup ]
lfasta [-afgmnpswx] sequence-file-1 sequence-file-2 [ ktup ]
plfasta [-afgkmnpsxv] sequence-file-1 sequence-file-2 [ ktup ]
DESCRIPTION
fasta is used to compare a protein or DNA sequence to all of the
entries in a sequence library. For example, fasta can compare a
protein sequence to all of the sequences in the NBRF PIR protein
sequence database. fasta will automatically decide whether the query
sequence is DNA or protein by reading the query sequence as protein and
determining whether the `amino-acid composition' is more than 85%
A+C+G+T. fasta uses an improved version of the rapid sequence
comparison algorithm described by Lipman and Pearson (Science, (1985)
227:1427) that is described in Pearson and Lipman, Proc. Natl. Acad.
USA, (1988) 85:2444. The program can be invoked either with command
line arguments or in interactive mode. The optional third argument,
ktup sets the sensitivity and speed of the search. If ktup=2, similar
regions in the two sequences being compared are found by looking at
pairs of aligned residues; if ktup=1, single aligned amino acids are
examined. ktup can be set to 2 or 1 for protein sequences, or from 1
to 6 for DNA sequences. The default if ktup is not specified is 2 for
proteins and 6 for DNA.
fasta compares a query sequence to a sequence library which consists of
sequence data interspersed with comments, see below. Normally fasta,
fastx, tfasta, and tfastx search the libraries listed in the file
pointed to by the environment variable FASTLIBS. The format of this
file is described in the file FASTA.DOC. tfasta compares a protein
sequence to a DNA sequence database, translating the DNA sequence
library in 6 frames `on-the-fly' (3 frames with the -3 option). The
search uses the standard BLOSUM50 scoring matrix, and uses a ktup=2 by
default. tfasta searches a DNA sequence database in the standard text
format described below. tfastx, like tfasta, compares a protein
sequence to a DNA sequence library. However, tfastx compares the
protein sequence to the forward and reverse three-frame translation of
the DNA library sequence, allowing for frameshifts. fastx compares a
DNA sequence to a protein sequence database, translating the DNA
sequence in three frames and allowing frameshifts in the alignment.
lfasta and plfasta programs compare two sequences looking for local
sequence similarities. While fasta, fastx, and tfasta report only the
best alignment between the query sequence and the library sequence,
lfasta and plfasta will report all of the alignments between the two
sequences with scores greater than a cut-off value. lfasta shows the
actual local alignments between the two sequences and their scores,
while plfasta produces a plot of the alignments that looks similar to a
`dot-matrix' homology plot. On Unixtm systems, plfasta generates
postscript output.
The fasta programs use a standard text format sequence file. Lines
beginning with '>' or ';' are considered comments and ignored;
sequences can be upper or lower case, blanks,tabs and unrecognizable
characters are ignored. fasta expects sequences to use the single
letter amino acid codes, see protcodes(1) . Library files for fasta
should have the form shown below.
OPTIONS
fasta and the other programs can be directed to change the scoring
matrix, search parameters, output format, and default search
directories by entering options on the command line (preceeded by a `-'
or `/' for MS-DOS). All of the options should preceed the file name and
ktup arguments). Alternately, these options can be changed by setting
environment variables. The options and environment variables are:
-1 Normally, the top scoring sequences are ranked by the z-score
based on the opt score. To rank sequences by raw scores, use
the -z option. With the -1 option, sequences are ranked by the
z-score based on the init1 score. With the
-a (SHOWALL) Modifies the display of the two sequences in
alignments. Normally, both sequences are shown only where they
overlap (SHOWALL=0); If -a or the environment variable SHOWALL =
1, both sequences are shown in their entirety.
-A Force use of unlimited Smith-Waterman alignment for DNA FASTA
and TFASTA. By default, the program uses the older (and faster)
band-limited Smith-Waterman alignment for DNA FASTA and TFASTA
alignments.
-b # The number of similarity scores to be shown when the -Q option
is used. This value is usually calculated based on the actual
scores.
-c # (OPTCUT) The threshold for optimization with the option. The
OPTCUT value is normally calculated based on sequence length.
-d # The number of alignments to be shown. Normally, fasta shows the
same number of alignments as similarity scores. By using fasta
-Q -b 200 -d 50, one would see the top scoring 200 sequences and
alignments for the 50 best scores.
-E # The expectation value threshold for displaying similarity scores
and sequence alignments. fasta -Q -E 2.0 would show all
library sequences with scores expected to occur no more than 2
times by chance in a search of the library.
-f # Penalty for the first residue in a gap (-12 by default for fasta
with proteins, -16 for DNA).
-g # Penalty for additional residues in a gap (-2 by default for
fasta with proteins, -4 for DNA).
-h # (fastx, tfastx only) penalty for a +1 or -1 frameshift.
-H Do not display histogram of similarity scores.
-i (fasta, fastx) search with the reverse-complement of the query
DNA sequence. (tfastx) search only the reverse complement of
the DNA library sequence.
-k # (GAPCUT) Sets the threshold for joining the initial regions for
calculating the initn score.
-l file
(FASTLIBS) The name of the library menu file. Normally this
will be determined by the environment variable FASTLIBS.
However, a library menu file can also be specified with -l.
-L display more information about the library sequence in the
alignment.
-m # (MARKX) =0,1,2,3,4,10. Alternate display of matches and
mismatches in alignments. MARKX=0 uses ":","."," ", for
identities, consevative replacements, and non-conservative
replacements, respectively. MARKX=1 uses " ","x", and "X".
MARKX=2 does not show the second sequence, but uses the second
alignment line to display matches with a "." for identity, or
with the mismatched residue for mismatches. MARKX=2 is useful
for aligning large numbers of similar sequences. MARKX=3 writes
out a file of library sequences in FASTA format. MARKX=3 should
always be used with the "SHOWALL" (-a) option, but this does not
completely ensure that all of the sequences output will be
aligned. MARKX=4 displays a graph of the alignment of the
library sequence with repect to the query sequence, so that one
can identify the regions of the query sequence that are
conserved. MARKX=10 is used to produce a parseable output
format.
-n Forces the query sequence to be treated as a DNA sequence.
-O filename
send copy of results to "filename."
-o Turns off default fasta limited optimization on all of the
sequences in the library with initn scores greater than OPTCUT.
This option is now the reverse of previous versions of fasta.
-Q Quiet option. This allows fasta and tfasta to search a database
and report the results without asking any questions. fasta -Q
file library > output can be put in the background or run at a
later time with the unix 'at' command. The number of similarity
scores and alignments displayed with the -Q option can be
modified with the -b (scores) and -d (alignments) options.
-r STATFILE Causes fasta to write out the sequence identifier,
superfamily number (if available), and similarity scores to
STATFILE for every sequence in the library. These results are
not sorted.
-s str (SMATRIX) the filename of an alternative scoring matrix file.
For protein sequences, BLOSUM50 is used by default; PAM250 can
be used with the command line option -s 250.
-v str (LINEVAL) (plfasta only) plfasta and pclfasta can use up to 4
different line styles to denote the scores of local alignments.
The scores that correspond to these line styles can be specified
with the environment variable LINVAL, or with the -v option. In
either case, a string with three numbers separated by spaces
should be given. This string must be surrounded by double
quotation marks. For example, LINEVAL="200 100 50" tells
plfasta to use solid lines for local alignments with scores
greater than 200, long dashed lines for scores between 100 and
200, short dashed lines for scores between 50 and 100, and
dotted lines for scores less than 50.
plfasta -v "200 100 50"
Normally, the values are 200, 100, and 50 for protein sequence
comparisons and 400, 200, and 100 for DNA sequence comparisons.
-w # (LINLEN) output line length for sequence alignments. (normally
60, can be set up to 200).
-x "offset1 offset2"
Causes fasta/lfasta/plfasta to start numbering the aligned
sequences starting with offset1 and offset2, rather than 1 and
1. This is particularly useful for showing alignments of
promoter regions.
-y Set the band-width used for optimization. -y 16 is the default
for protein when ktup=2 and for all DNA alignments. -y 32 is
used for protein and ktup=1. For proteins, optimization slows
comparison 2-fold and is highly recommended.
-z Do not do statistical significance calculation. Results are
ranked by the unnormalized opt, initn, or init1 score.
-3 (tfasta, tfastx) only. Normally tfasta and tfastx translate
sequences in the DNA sequence library in all six frames. With
the -3 option, only the three forward frames are searched.
EXAMPLES
(1) fasta musplfm.aa $AABANK
Compare the amino acid sequence in the file musplfm.aa with the
complete PIR protein sequence library using ktup = 2 Each "library"
sequence (there need only be one) should start with a comment line
which starts with a '>', e.g.
>LCBO bovine preprolactin
WILLLSQ ...
>LCHU human ...
...
(2) fasta -a -w 80 musplfm.aa lcbo.aa 1
Compare the amino acid sequence in the file musplfm.aa with the
sequences in the file lcbo.aa using ktup = 1. Show both sequences in
their entirety, with 80 residues on each output line.
(3) fasta
Run the fasta program in interactive mode. The program will prompt for
the file name for the query sequence, list alternative libraries to be
seached (if FASTLIBS is set), and prompt for the ktup.
FILES
This version of fasta prompts for the library file to be searched from
a list of file names that are saved in the file pointed to by the
environment variable FASTLIBS. If FASTLIBS = fastgb.list, then the
file fastgb.list might have the entries:
NBRF Protein$0P/u/lib/aabank.lib 0
GB Primate$1P@/u/lib/gpri.nam
GB Rodent$1R@/u/lib/grod.nam
GB Mammal$1M@/u/lib/gmammal.nam
Each line in this file has 4 fields: (1) The library name, separated
from the remaining fields by a '$'; (2) A 0 or a 1 indicating protein
or DNA library respectively; (3) A single letter that will be used to
choose the library; (4) the location of the library file itself (the
library file name can contain an optional library format specfier.
Fasta recognizes the following library formats: 0 - Pearson/FASTA; 1 -
Genbank flat file; 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 4 -
Intelligenetics; 5 - NBRF/PIR VMS); Note that this fourth field can
contain an '@' character, which indicates that the library file is an
indirect library file containing list of library files, one per line.
An indirect library file might have the lines:
</usr/slib/genbank (the directory for the library files)
gbpri.seq 1
gbrod.seq 1
gbmam.seq 1
...
gbvrl.seq 1
...
You can use your own sequence files for fasta, just be certain to put a
'>' and comment as the first line before the sequence. Only one
library file type, the standard NBRF library format, is supported by
the VAX/VMS programs. lfasta and plfasta do not required the '>' and
comment line. fasta does.
SEE ALSO
rdf2(1),protcodes(5), dnacodes(5)
AUTHOR
Bill Pearson
wrp@virginia.EDU
local FASTA/TFASTA/FASTX/TFASTXv2.0u(1)