DragonFly On-Line Manual Pages

FASTA/TFASTA/FASTX/TFASTXv2.0u(1)            DragonFly General Commands Manual

NAME
       fasta - scan a protein or DNA sequence library for similar sequences

       tfasta - compare a protein sequence to a DNA sequence library,
       translating the DNA sequence library `on-the-fly'.

       lfasta - compare two protein or DNA sequences for local similarity and
       show the local sequence alignments

       plfasta - compare two sequences for local similarity and plot the local
       sequence alignments

SYNOPSIS
       fasta [-a -A -b # -c # -d #  -E # -f # -g # -k # -l file -L FASTLIBS
       -r STATFILE -m # -o -O file -p # -Q -s SMATRIX -w # -x "# #" -y # -z -1
       ] query-sequence-file library-file [ ktup ]

       fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file @library-name-file

       fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file "%PRMVI"

       fasta [-aAbcdEgHlmnoOprswyx] - interactive mode

       fastx [-aAbcdEfghHlmnoOprswyx] DNA-query-file protein-library [ ktup ]

       tfasta [-aAbcdEfgkmoOprswy3] protein-query-file DNA-library [ ktup ]

       tfastx [-abcdEfghHikmoOprswy3] protein-query-file DNA-library [ ktup ]

       lfasta [-afgmnpswx] sequence-file-1 sequence-file-2 [ ktup ]

       plfasta [-afgkmnpsxv] sequence-file-1 sequence-file-2 [ ktup ]

DESCRIPTION
       fasta is used to compare a protein or DNA sequence to all of the
       entries in a sequence library.  For example, fasta can compare a
       protein sequence to all of the sequences in the NBRF PIR protein
       sequence database.  fasta will automatically decide whether the query
       sequence is DNA or protein by reading the query sequence as protein and
       determining whether the `amino-acid composition' is more than 85%
       A+C+G+T.  fasta uses an improved version of the rapid sequence
       comparison algorithm described by Lipman and Pearson (Science, (1985)
       227:1427) that is described in Pearson and Lipman, Proc. Natl. Acad.
       USA, (1988) 85:2444.  The program can be invoked either with command
       line arguments or in interactive mode.  The optional third argument,
       ktup sets the sensitivity and speed of the search.  If ktup=2, similar
       regions in the two sequences being compared are found by looking at
       pairs of aligned residues; if ktup=1, single aligned amino acids are
       examined.  ktup can be set to 2 or 1 for protein sequences, or from 1
       to 6 for DNA sequences.  The default if ktup is not specified is 2 for
       proteins and 6 for DNA.

       fasta compares a query sequence to a sequence library which consists of
       sequence data interspersed with comments, see below.  Normally fasta,
       fastx, tfasta, and tfastx search the libraries listed in the file
       pointed to by the environment variable FASTLIBS.  The format of this
       file is described in the file FASTA.DOC.  tfasta compares a protein
       sequence to a DNA sequence database, translating the DNA sequence
       library in 6 frames `on-the-fly' (3 frames with the -3 option).  The
       search uses the standard BLOSUM50 scoring matrix, and uses a ktup=2 by
       default.  tfasta searches a DNA sequence database in the standard text
       format described below.  tfastx, like tfasta, compares a protein
       sequence to a DNA sequence library.  However, tfastx compares the
       protein sequence to the forward and reverse three-frame translation of
       the DNA library sequence, allowing for frameshifts.  fastx compares a
       DNA sequence to a protein sequence database, translating the DNA
       sequence in three frames and allowing frameshifts in the alignment.
       lfasta and plfasta programs compare two sequences looking for local
       sequence similarities.  While fasta, fastx, and tfasta report only the
       best alignment between the query sequence and the library sequence,
       lfasta and plfasta will report all of the alignments between the two
       sequences with scores greater than a cut-off value.  lfasta shows the
       actual local alignments between the two sequences and their scores,
       while plfasta produces a plot of the alignments that looks similar to a
       `dot-matrix' homology plot.  On Unixtm systems, plfasta generates
       postscript output.

       The fasta programs use a standard text format sequence file.  Lines
       beginning with '>' or ';' are considered comments and ignored;
       sequences can be upper or lower case, blanks,tabs and unrecognizable
       characters are ignored.  fasta expects sequences to use the single
       letter amino acid codes, see protcodes(1) .  Library files for fasta
       should have the form shown below.

OPTIONS
       fasta and the other programs can be directed to change the scoring
       matrix, search parameters, output format, and default search
       directories by entering options on the command line (preceeded by a `-'
       or `/' for MS-DOS). All of the options should preceed the file name and
       ktup arguments). Alternately, these options can be changed by setting
       environment variables.  The options and environment variables are:

       -1     Normally, the top scoring sequences are ranked by the z-score
              based on the opt score.  To rank sequences by raw scores, use
              the -z option. With the -1 option, sequences are ranked by the
              z-score based on the init1 score. With the

       -a     (SHOWALL) Modifies the display of the two sequences in
              alignments. Normally, both sequences are shown only where they
              overlap (SHOWALL=0); If -a or the environment variable SHOWALL =
              1, both sequences are shown in their entirety.

       -A     Force use of unlimited Smith-Waterman alignment for DNA FASTA
              and TFASTA.  By default, the program uses the older (and faster)
              band-limited Smith-Waterman alignment for DNA FASTA and TFASTA
              alignments.

       -b #   The number of similarity scores to be shown when the -Q option
              is used.  This value is usually calculated based on the actual
              scores.

       -c #   (OPTCUT) The threshold for optimization with the option.  The
              OPTCUT value is normally calculated based on sequence length.

       -d #   The number of alignments to be shown.  Normally, fasta shows the
              same number of alignments as similarity scores.  By using fasta
              -Q -b 200 -d 50, one would see the top scoring 200 sequences and
              alignments for the 50 best scores.

       -E #   The expectation value threshold for displaying similarity scores
              and sequence alignments.  fasta  -Q -E 2.0 would show all
              library sequences with scores expected to occur no more than 2
              times by chance in a search of the library.

       -f #   Penalty for the first residue in a gap (-12 by default for fasta
              with proteins, -16 for DNA).

       -g #   Penalty for additional residues in a gap (-2 by default for
              fasta with proteins, -4 for DNA).

       -h #   (fastx, tfastx only) penalty for a +1 or -1 frameshift.

       -H     Do not display histogram of similarity scores.

       -i     (fasta, fastx) search with the reverse-complement of the query
              DNA sequence.  (tfastx) search only the reverse complement of
              the DNA library sequence.

       -k #   (GAPCUT) Sets the threshold for joining the initial regions for
              calculating the initn score.

       -l file
              (FASTLIBS) The name of the library menu file.  Normally this
              will be determined by the environment variable FASTLIBS.
              However, a library menu file can also be specified with -l.

       -L     display more information about the library sequence in the
              alignment.

       -m #   (MARKX) =0,1,2,3,4,10. Alternate display of matches and
              mismatches in alignments. MARKX=0 uses ":","."," ", for
              identities, consevative replacements, and non-conservative
              replacements, respectively. MARKX=1 uses " ","x", and "X".
              MARKX=2 does not show the second sequence, but uses the second
              alignment line to display matches with a "."  for identity, or
              with the mismatched residue for mismatches.  MARKX=2 is useful
              for aligning large numbers of similar sequences.  MARKX=3 writes
              out a file of library sequences in FASTA format.  MARKX=3 should
              always be used with the "SHOWALL" (-a) option, but this does not
              completely ensure that all of the sequences output will be
              aligned. MARKX=4 displays a graph of the alignment of the
              library sequence with repect to the query sequence, so that one
              can identify the regions of the query sequence that are
              conserved. MARKX=10 is used to produce a parseable output
              format.

       -n     Forces the query sequence to be treated as a DNA sequence.

       -O filename
              send copy of results to "filename."

       -o     Turns off default fasta limited optimization on all of the
              sequences in the library with initn scores greater than OPTCUT.
              This option is now the reverse of previous versions of fasta.

       -Q     Quiet option.  This allows fasta and tfasta to search a database
              and report the results without asking any questions. fasta -Q
              file library > output can be put in the background or run at a
              later time with the unix 'at' command.  The number of similarity
              scores and alignments displayed with the -Q option can be
              modified with the -b (scores) and -d (alignments) options.

       -r     STATFILE Causes fasta to write out the sequence identifier,
              superfamily number (if available), and similarity scores to
              STATFILE for every sequence in the library.  These results are
              not sorted.

       -s str (SMATRIX) the filename of an alternative scoring matrix file.
              For protein sequences, BLOSUM50 is used by default; PAM250 can
              be used with the command line option -s 250.

       -v str (LINEVAL) (plfasta only) plfasta and pclfasta can use up to 4
              different line styles to denote the scores of local alignments.
              The scores that correspond to these line styles can be specified
              with the environment variable LINVAL, or with the -v option.  In
              either case, a string with three numbers separated by spaces
              should be given.  This string must be surrounded by double
              quotation marks.  For example, LINEVAL="200 100 50" tells
              plfasta to use solid lines for local alignments with scores
              greater than 200, long dashed lines for scores between 100 and
              200, short dashed lines for scores between 50 and 100, and
              dotted lines for scores less than 50.
                   plfasta -v "200 100 50"
              Normally, the values are 200, 100, and 50 for protein sequence
              comparisons and 400, 200, and 100 for DNA sequence comparisons.

       -w #   (LINLEN) output line length for sequence alignments.  (normally
              60, can be set up to 200).

       -x "offset1 offset2"
              Causes fasta/lfasta/plfasta to start numbering the aligned
              sequences starting with offset1 and offset2, rather than 1 and
              1.  This is particularly useful for showing alignments of
              promoter regions.

       -y     Set the band-width used for optimization.  -y 16 is the default
              for protein when ktup=2 and for all DNA alignments. -y 32 is
              used for protein and ktup=1.  For proteins, optimization slows
              comparison 2-fold and is highly recommended.

       -z     Do not do statistical significance calculation. Results are
              ranked by the unnormalized opt, initn, or init1 score.

       -3     (tfasta, tfastx) only.  Normally tfasta and tfastx translate
              sequences in the DNA sequence library in all six frames.  With
              the -3 option, only the three forward frames are searched.

EXAMPLES
       (1)    fasta musplfm.aa $AABANK

       Compare the amino acid sequence in the file musplfm.aa with the
       complete PIR protein sequence library using ktup = 2 Each "library"
       sequence (there need only be one) should start with a comment line
       which starts with a '>', e.g.

            >LCBO bovine preprolactin
            WILLLSQ ...
            >LCHU human ...
            ...

       (2)    fasta -a -w 80 musplfm.aa lcbo.aa 1

       Compare the amino acid sequence in the file musplfm.aa with the
       sequences in the file lcbo.aa using ktup = 1.  Show both sequences in
       their entirety, with 80 residues on each output line.

       (3)    fasta

       Run the fasta program in interactive mode.  The program will prompt for
       the file name for the query sequence, list alternative libraries to be
       seached (if FASTLIBS is set), and prompt for the ktup.

FILES
       This version of fasta prompts for the library file to be searched from
       a list of file names that are saved in the file pointed to by the
       environment variable FASTLIBS.  If FASTLIBS = fastgb.list, then the
       file fastgb.list might have the entries:

            NBRF Protein$0P/u/lib/aabank.lib 0
            GB Primate$1P@/u/lib/gpri.nam
            GB Rodent$1R@/u/lib/grod.nam
            GB Mammal$1M@/u/lib/gmammal.nam

       Each line in this file has 4 fields: (1) The library name, separated
       from the remaining fields by a '$'; (2) A 0 or a 1 indicating protein
       or DNA library respectively; (3) A single letter that will be used to
       choose the library; (4) the location of the library file itself (the
       library file name can contain an optional library format specfier.
       Fasta recognizes the following library formats: 0 - Pearson/FASTA; 1 -
       Genbank flat file; 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 4 -
       Intelligenetics; 5 - NBRF/PIR VMS); Note that this fourth field can
       contain an '@' character, which indicates that the library file is an
       indirect library file containing list of library files, one per line.
       An indirect library file might have the lines:
            </usr/slib/genbank  (the directory for the library files)
            gbpri.seq 1
            gbrod.seq 1
            gbmam.seq 1
            ...
            gbvrl.seq 1
            ...

       You can use your own sequence files for fasta, just be certain to put a
       '>' and comment as the first line before the sequence.  Only one
       library file type, the standard NBRF library format, is supported by
       the VAX/VMS programs.  lfasta and plfasta do not required the '>' and
       comment line.  fasta does.

SEE ALSO
       rdf2(1),protcodes(5), dnacodes(5)

AUTHOR
       Bill Pearson
       wrp@virginia.EDU

                                     local   FASTA/TFASTA/FASTX/TFASTXv2.0u(1)