NAME
map_transcripts.pl
A script to associate enriched regions of transcription with gene annotation.
SYNOPSIS
map_transcripts.pl --ffile <filename> --rfile <filename> --out <filename>
map_transcripts.pl --db <name> --win <i> --step <i> --fdata <name> --rdata <name> --thresh <i> --out <filename>
Options:
--out <filename>
--ffile <filename>
--rfile <filename>
--db <db_name>
--win <integer>
--step <integer>
--thresh <number>
--fdata <dataset | filename>
--rdata <dataset | filename>
--tol <integer>
--min <integer>
--(no)gff
--source <text>
--version
--help
OPTIONS
The command line flags and descriptions:
- --out <filename>
-
Provide the base output filename for the transcript data file, GFF file, and summary file.
- --ffile <filename>
- --rfile <filename>
-
Provide the names of eniched regions files representing transcription fragments. The files should be generated by the program
find_enriched_regions.pl. If these files are not provided, they will be automatically generated by running the find_enriched_regions.pl program using the parameters defined here. Two files should be provided: each representing transcription fragments on the forward and reverse strands. - --db <database_name>
-
Specify the name of a
Bio::DB::SeqFeature::Storeannotation database from which gene or feature annotation may be derived. For more information about using annotation databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. - --win <integer>
-
Specify the window size in bp to scan the genome for enriched regions. When the find_enriched_regions.pl program is run, trimming is turned on so that the final regions are not limited to multiples of window size.
- --step <integer>
-
Specify the step size in bp for advancing the window across the genome when scanning for enriched regions.
- --thresh <number>
-
Specify the threshold value to identify regions that are enriched, i.e. expressed. A relatively low value should be set to ensure low expressed transcripts are also included. The value may be gleaned from the metadata in the provided enriched regions files.
- --fdata <dataset>
- --rdata <dataset>
-
Specify the name of the datasets in the database representing the forward and reverse transcription data. Alternatively, the paths of data files may be provided. Supported formats include Bam (.bam), BigBed (.bb), or BigWig (.bw). Files may be local or remote (http:// or ftp://).
- --tol <integer>
-
Specify the tolerance distance in bp with which the ends of a transcription fragment and gene need to be within to call the transcription fragment a complete transcript. If the ends exceed the tolerance value, then the transcription fragment is labeled as 'overlap'. The default value is 20 bp.
- --min <integer>
-
Small, spurious transcription fragments may be identified, particularly if the threshold is set too low or the dataset is noisy. This parameter sets the minimum size of the transcription fragment to initiate a search for associated known genes. The default value is 100 bp.
- --(no)gff
-
Indicate whether a GFF file should also be written. A version 3 file is generated. The default is true.
- --source <text>
-
Specify the GFF source value when writing a GFF file. The default value is the name of this program.
- --version
-
Print the version number.
- --help
-
Display the POD documentation.
DESCRIPTION
This program will identify transcription fragments that correspond to gene transcripts using available transcriptome microarray data. Specifically, it will identify the start and stop coordinates of transcribed regions that correspond or overlap an annotated gene or ORF. It does not identify exons or introns.
The program was initially written to address the lack of officially mapped transcripts in the S. cerevisiae genome, which has few and small introns. It may work well other similar genomes, but probably not complex metazoan genomes with very large introns.
Transcription fragments are identified as windows of enrichment using the script 'find_enriched_regions.pl'. This script can either be automatically executed using the specified parameters, or run separately. If run separately, the output files should be indicated (--ffile and --rfile).
Each enriched window, or transcription fragment, is checked for corresponding or overlapping genomic features. The name(s) of the overlapping gene(s) are reported, as well as a classification for the transcript. Transcripts that completely contain (within a default 20 bp tolerance) a single known gene (ORF or ncRNA) are labeled as 'complete'. Transcription fragments that simply overlap a known gene are labeled as 'overlap'. Transcription fragments that overlap more than one annotated gene are labeled as 'multi-orf'.
Transcription fragments that only overlap annotated genes on the opposite strand are labeled as 'anti-sense'. Transcription fragments overlapping repetitive elements or no known feature are also reported.
Some rudimentary calculations are performed to identify the length of the 5' and 3' UTRs.
The program writes out a tab delimited text file. It will also write out a gff file for the genome browser. It also writes out a summary report file.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.