NAME

map_transcripts.pl

A script to associate enriched regions of transcription with gene annotation.

SYNOPSIS

map_transcripts.pl --ffile <filename> --rfile <filename> --out <filename>

map_transcripts.pl --db <name> --win <i> --step <i> --fdata <name> --rdata <name> --thresh <i> --out <filename>

Options:
--out <filename>
--ffile <filename>
--rfile <filename>
--db <db_name>
--win <integer>
--step <integer>
--thresh <number>
--fdata <dataset | filename>
--rdata <dataset | filename>
--tol <integer>
--min <integer>
--(no)gff
--source <text>
--version
--help

OPTIONS

The command line flags and descriptions:

--out <filename>

Provide the base output filename for the transcript data file, GFF file, and summary file.

--ffile <filename>
--rfile <filename>

Provide the names of eniched regions files representing transcription fragments. The files should be generated by the program find_enriched_regions.pl. If these files are not provided, they will be automatically generated by running the find_enriched_regions.pl program using the parameters defined here. Two files should be provided: each representing transcription fragments on the forward and reverse strands.

--db <database_name>

Specify a Bioperl Bio::DB::SeqFeature::Store database. Required for generating the transcription fragment enriched_regions files. It may be gleaned from the metadata of the provided enriched_region files.

--win <integer>

Specify the window size in bp to scan the genome for enriched regions. When the find_enriched_regions.pl program is run, trimming is turned on so that the final regions are not limited to multiples of window size.

--step <integer>

Specify the step size in bp for advancing the window across the genome when scanning for enriched regions.

--thresh <number>

Specify the threshold value to identify regions that are enriched, i.e. expressed. A relatively low value should be set to ensure low expressed transcripts are also included. The value may be gleaned from the metadata in the provided enriched regions files.

--fdata <dataset>
--rdata <dataset>

Specify the name of the datasets in the database representing the forward and reverse transcription data. Alternatively, the paths of data files may be provided. Supported formats include Bam (.bam), BigBed (.bb), or BigWig (.bw). Files may be local or remote (http:// or ftp://).

--tol <integer>

Specify the tolerance distance in bp with which the ends of a transcription fragment and gene need to be within to call the transcription fragment a complete transcript. If the ends exceed the tolerance value, then the transcription fragment is labeled as 'overlap'. The default value is 20 bp.

--min <integer>

Small, spurious transcription fragments may be identified, particularly if the threshold is set too low or the dataset is noisy. This parameter sets the minimum size of the transcription fragment to initiate a search for associated known genes. The default value is 100 bp.

--(no)gff

Indicate whether a GFF file should also be written. A version 3 file is generated. The default is true.

--source <text>

Specify the GFF source value when writing a GFF file. The default value is the name of this program.

--version

Print the version number.

--help

Display the POD documentation.

DESCRIPTION

This program will identify transcription fragments that correspond to gene transcripts using available transcriptome microarray data. Specifically, it will identify the start and stop coordinates of transcribed regions that correspond or overlap an annotated gene or ORF. It does not identify exons or introns.

The program was initially written to address the lack of officially mapped transcripts in the S. cerevisiae genome, which has few and small introns. It may work well other similar genomes, but probably not complex metazoan genomes with very large introns.

Transcription fragments are identified as windows of enrichment using the script 'find_enriched_regions.pl'. This script can either be automatically executed using the specified parameters, or run separately. If run separately, the output files should be indicated (--ffile and --rfile).

Each enriched window, or transcription fragment, is checked for corresponding or overlapping genomic features. The name(s) of the overlapping gene(s) are reported, as well as a classification for the transcript. Transcripts that completely contain (within a default 20 bp tolerance) a single known gene (ORF or ncRNA) are labeled as 'complete'. Transcription fragments that simply overlap a known gene are labeled as 'overlap'. Transcription fragments that overlap more than one annotated gene are labeled as 'multi-orf'.

Transcription fragments that only overlap annotated genes on the opposite strand are labeled as 'anti-sense'. Transcription fragments overlapping repetitive elements or no known feature are also reported.

Some rudimentary calculations are performed to identify the length of the 5' and 3' UTRs.

The program writes out a tab delimited text file. It will also write out a gff file for the genome browser. It also writes out a summary report file.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.