NAME
Bio::ToolBox::Data - Reading, writing, and manipulating data structure
SYNOPSIS
use Bio::ToolBox::Data;
### Create new gene list from database
my $Data = Bio::ToolBox::Data->new(
db => 'hg19',
feature => 'gene:ensGene',
);
my $Data = Bio::ToolBox::Data->new(
db => 'hg19',
feature => 'genome',
win => 1000,
step => 1000,
);
### Open a pre-existing file
my $Data = Bio::ToolBox::Data->new(
file => 'coordinates.bed',
);
### Get a specific value
my $value = $Data->value($row, $column);
### Replace or add a value
$Data->value($row, $column, $new_value);
### Iterate through a Data structure one row at a time
my $stream = $Data->row_stream;
while (my $row = $stream->next_row) {
# get the positional information from the file data
# assuming that the input file had these identifiable columns
my $seq_id = $row->seq_id;
my $start = $row->start;
my $stop = $row->end;
# generate a Bio::Seq object from the database using
# these coordinates
my $region = $db->segment($seq_id, $start, $stop);
my $value = $row->value($column);
my $new_value = $value + 1;
$row->value($column, $new_value);
}
### write the data to file
my $success = $Data->write_file(
filename => 'new_data.txt',
gz => 1,
);
print "wrote new file $success\n"; # file is new_data.txt.gz
PREFACE
This is an object-oriented interface to the Bio::ToolBox Data structure. Most of the remaining Bio::ToolBox libraries are collections of exported subroutines, and the data structures required a lot of manual manipulation (and redundant code - sigh). They were written before I learned to fully appreciate the benefits of OO-code. Hence, this module is an attempt to right the wrongs of my early practices.
Many of the provided scripts that accompany the Bio::ToolBox distribution do not use the this OO interface. They can be cryptic, obtuse, and hard to follow. New scripts should follow this interface instead.
DESCRIPTION
This module works with the primary Bio::ToolBox Data structure. Simply, it is a complex data structure representing a tabbed-delimited table (array of arrays), with plenty of options for metadata. Many common bioinformatic file formats are simply tabbed-delimited text files (think BED and GFF). Each row is a feature or genomic interval, and each column is a piece of information about that feature, such as name, type, and/or coordinates. We can append to that file additional columns of information, perhaps scores from genomic data sets. We can record metadata regarding how and where we obtained that data. Finally, we can write the updated table to a new file.
METHODS
Initializing the structure
- new()
-
Initialize a new Data structure. This generally requires options, provided as an array of key => values. A new list of features may be obtained from an annotation database, an existing file may be loaded, or a new empty structure may be generated.
These are the options available.
- file => $filename
-
Provide the path and name to an existing tabbed-delimited text file. BED and GFF files and their variants are accepted. Except for structured files, e.g. BED and GFF, the first line is assumed to be column header names. Commented lines (beginning with #) are parsed as metadata. The files may be compressed (gzip or bzip2).
- feature => $type
- feature => "$type:$source"
- feature => 'genome'
-
For de novo lists from an annotation database, provide the GFF type or type:source (columns 3 and 2) for collection. A comma delimited string may be accepted (not an array).
For a list of genomic intervals across the genome, specify a feature of 'genome'.
- db => $name
- db => $path
- db => $database_object
-
Provide the name of the database from which to collect the features. It may be a short name, whereupon it is checked in the Bio::ToolBox configuration file
.biotoolbox.cfgfor connection information. Alternatively, a path to a database file or directory may be given.If you already have an opened Bio::DB::SeqFeature::Store database object, you can simply pass that. See Bio::ToolBox::db_helper for more information. However, this in general should be discouraged, since the name of the database will not be properly recorded when saving to file.
- win => $integer
- step => $integer
-
If generating a list of genomic intervals, optionally provide the window and step values. Default values are defined in the Bio::ToolBox configuration file
.biotoolbox.cfg.
If successful, the method will return a Bio::ToolBox::Data object.
General Metadata
There is a variety of general metadata regarding the Data structure.
The following methods may be used to access or set these metadata properties.
- feature($text)
-
Returns or sets the name of the features used to collect the list of features. The actual feature types are listed in the table, so this is metadata is merely descriptive.
- program($name)
-
Returns or sets the name of the program generating the list.
- database($name)
-
Returns or sets the name or path of the database from which the features were derived.
The following methods may be used to access metadata only.
- gff
- bed
-
Returns the GFF version number or the number of BED columns indicating that the Data structure is properly formatted as such. A value of 0 means they are not formatted as such.
File information
- filename($text)
-
Returns or sets the filename for the Data structure. If you set a new filename, the path, basename, and extension are automatically derived for you. If a path was not provided, the current working directory is assumed.
- path
- basename
- extension
-
Returns the full path, basename, and extension of the filename. Concatenating these three values will reconstitute the original filename.
Comments
Comments are the other commented lines from a text file (lines beginning with a #) that were not parsed as metadata.
- comments
-
Returns a copy of the array containing commented lines.
- add_comment($text)
-
Appends the text string to the comment array.
- delete_comment
- delete_comment($index)
-
Deletes a comment. Provide the array index of the comment to delete. If an index is not provided, ALL comments will be deleted!
The Data table
The Data table is the array of arrays containing all of the actual information. It has some metadata as well.
- number_columns
-
Returns the number of columns in the Data table.
- last_row
-
Returns the array index number of the last row. Since the header row is index 0, this is also the number of actual content rows.
- add_column($name)
-
Appends a new empty column to the Data table at the rightmost position (highest index). It adds the column header name and creates a new column metadata hash. But it doesn't actually fill values in every row. It returns the new column index.
- delete_column($index1, $index2, ...)
-
Deletes one or more specified columns. Any remaining columns rightwards will have their indices shifted down appropriately. If you had identified one of the shifted columns, you may need to re-find or calculate its new index.
- reorder_column($index1, $index, ...)
-
Reorders columns into the specified order. Provide the new desired order of indices. Columns could be duplicated or deleted using this method. The columns will adopt their new index numbers.
- add_row
- add_row(\@values)
-
Add a new row of data values to the end of the Data table. Optionally provide a reference to an array of values to put in the row. The array is filled up with
undeffor missing values, and excess values are dropped. - delete_row($row1, $row2, ...)
-
Deletes one or more specified rows. Rows are spliced out highest to lowest index to avoid issues. Be very careful deleting rows while simultaneously iterating through the table!
- row_values($row)
-
Returns a copy of an array for the specified row index. Modifying this returned array does not migrate back to the Data table; Use the value method below instead.
- value($row, $column)
- value($row, $column, $new_value)
-
Returns or sets the value at a specific row or column index. Index positions are 0-based (header row is index 0).
Column Metadata
Each column has metadata. Each metadata is a series of key => value pairs. The minimum keys are 'index' (the 0-based index of the column) and 'name' (the column header name). Additional keys and values may be queried or set as appropriate. When the file is written, these are stored as commented metadata lines at the beginning of the file.
- metadata($index, $key)
- metadata($index, $key, $new_value)
-
Returns or sets the metadata value for a specific $key for a specific column $index.
This may also be used to add a new metadata key. Simply provide the name of a new $key that is not present
- delete_metadata($index, $key);
-
Deletes a column-specific metadata $key and value for a specific column $index. If a $key is not provided, then all metadata keys for that index will be deleted.
- find_column($name)
-
Searches the column names for the specified column name. This employs a case-insensitive grep search, so simple substitutions may be made.
- chromo_column
- start_column
- stop_column
- strand_column
- name_column
- type_column
- id_column
-
These methods will return the identified column best matching the description. Returns
undefif that column is not present. These use the find_column() method with a predefined list of aliases.
Efficient Data Access
Most of the time we need to iterate over the Data table, one row at a time, collecting data or processing information. These methods simplify the process.
- row_stream()
-
This returns an Bio::ToolBox::Data::Iterator object, which has one method, next_row(). Call this method repeatedly until it returns
undefto work through each row of data.Users of the Bio::DB family of database adaptors may recognize the analogy to the seq_stream() method.
- next_row()
-
Called from a Bio::ToolBox::Data::Iterator object, it returns a Bio::ToolBox::Data::Feature object. This object represents the values in the current Data table row.
Bio::ToolBox::Data::Feature Methods
These are methods for working with the current data row generated using the next_row() method from a row_stream iterator.
- seq_id
- start
- end
- strand
- name
- type
- id
-
These methods return the corresponding appropriate value, if present. These rely on the corresponding find_column methods.
- value($index)
- value($index, $new_value)
-
Returns or sets the value at a specific column index in the current data row.
The next three functions are convenience methods for using the attributes in the current data row to interact with databases. They are wrappers to methods in the Bio::ToolBox::db_helper module.
- feature
-
Returns a SeqFeature object from the database using the name and type values in the current Data table row. The SeqFeature object is requested from the database named in the general metadata. If an alternate database is desired, you should change it first using the general database() method. If the feature name or type is not present in the table, then nothing is returned.
- segment
-
Returns a database Segment object corresponding to the coordinates defined in the Data table row. The database named in the general metadata is used to establish the Segment object. If a different database is desired, it should be changed first using the general database() method.
- get_score(%args)
-
This is a convenience method for the Bio::ToolBox::db_helper::get_chromo_region_score() method. It will return a single score value for the region defined by the coordinates or typed named feature in the current data row. If the Data table has coordinates, then those will be automatically used. If the Data table has typed named features, then the coordinates will automatically be looked up for you by requesting a SeqFeature object from the database.
The name of the dataset from which to collect the data must be provided. This may be a GFF type in a SeqFeature database, a BigWig member in a BigWigSet database, or a path to a BigWig, BigBed, Bam, or USeq file. Additional parameters may also be specified; please see the Bio::ToolBox::db_helper:: get_chromo_region_score() method for full details.
Here is an example of collecting mean values from a BigWig and adding the scores to the Data table.
my $index = $Data->add_column('MyData'); my $stream = $Data->row_stream; while (my $row = $stream->next_row) { my $score = $row->get_score( 'method' => 'mean', 'dataset' => '/path/to/MyData.bw', ); $row->value($index, $score); } - get_position_scores(%args)
-
This is a convenience method for the Bio::ToolBox::db_helper:: get_region_dataset_hash() method. It will return a hash of positions => scores over the region defined by the coordinates or typed named feature in the current data row. The coordinates for the interrogated region will be automatically provided.
Just like the get_score() method, the dataset from which to collect the scores must be provided, along with any other optional arguments. See the documentation for the Bio::ToolBox::db_helper::get_region_dataset_hash() method for more details.
Here is an example for collecting positioned scores around the 5 prime end of a feature from a BigWigSet directory.
my $stream = $Data->row_stream; while (my $row = $stream->next_row) { my %position2score = $row->get_position_scores( 'ddb' => '/path/to/BigWigSet/', 'dataset' => 'MyData', 'position' => 5, 'start' => -500, 'stop' => 500, ) # do something with %position2score }
Data Table Functions
These methods alter the Data table en masse.
- verify
-
This method will verify the Data structure, including the metadata and the Data table. It ensures that the table has the correct number of rows and columns as described in the metadata, and that each column has the basic metadata.
If the Data structure is marked as a GFF or BED structure, then the table is checked that the structure matches the proper format. If not, for example when additional columns have been added, then the GFF or BED value is set to null.
This method is automatically called prior to writing the Data table to file.
- splice_data($current_part, $total_parts)
-
This method will splice the Data table into $total_parts number of pieces, retaining the $current_part piece. The other parts are discarded. This method is intended to be used when a program is forked into separate processes, allowing each child process to work on a subset of the original Data table.
Two values are passed to the method. The first is the current part number, 1-based. The second value is the total number of parts that the table should be divided, corresponding to the number of concurrent processes. For example, to fork the program into four concurrent processes.
my $Data = Bio::ToolBox::Data->new(file => $file); my $pm = Parallel::ForkManager->new(4); for my $i (1..4) { $pm->start and next; ### in child $Data->splice_data($i, 4); $db = $Data->open_database; # a clone-safe new db object # do something with this portion $Data->save('filename' => "file#$i"); $pm->finish; } $pm->wait_all_children;There is no convenient method for merging the modified contents of the table from each child process back into the original Data table, as each child is essentially isolated from the parent. The Parallel::ForkManager documentation recommends going through a disk file intermediate. See the accompanying BioToolBox script join_data_file.pl for concatenating Data table files together.
Remember that if you fork your script into child processes, any database connections must be re-opened; they are typically not clone safe. If you have an existing database connection by using the open_database() method, it should be automatically re-opened for you when you use the splice_data() method, but you will need to call open_database() again in the child process to obtain the new database object.
- convert_gff(%options)
-
This method will irreversibly convert the Data table into a GFF format. Table columns will be added, deleted, reordered, and renamed as necessary to generate the GFF structure. An array of options should be passed to control the conversion step.
- version => <2|3>
-
Provide the GFF version. The default is version 3.
- chromo => $index
- start => $index
- stop => $index
- strand => $index
-
Provide the column indices for the appropriate columns. These should be automatically identified from the column header names. Indices are 0-based.
- score => $index
-
Provide the index column name for whatever score column. This is not automatically determined.
- source => $index|$text
- type => $index|$text
- name => $index|$text
-
Provide either a column index (0-based) or a text name to be used for all the features. Integers between 0 and the rightmost column index are presumed to be an index; everything else is taken as text.
- tag => \@indices
-
Provide an array reference of column indices to be used for GFF tags.
- id => $index
-
Provide a column index of unique values to be used for GFF3 ID tag.
- midpoint => <boolean>
-
Flag to use the midpoint instead of actual start and stop coordinates.
Data Table File Functions
When you are finished modifying the Data table, it may then be written out as a tabbed-delimited text file. If the format corresponds to a valide BED or GFF file, then it may be written in that format.
Several functions are available for writing the Data table, exporting to a compatible GFF file format, or writing a summary of the Data table.
- write_file()
- save()
-
These methods will write the Data structure out to file. It will be first verified as to proper structure. Opened BED and GFF files are checked to see if their structure is maintained. If so, they are written in the same format; if not, they are written as regular tab-delimited text files. You may pass additional options.
- filename => $filename
-
Optionally pass a new filename. Required for new objects; previous opened files may be overwritten if a new name is not provided. If necessary, the file extension may be changed; for example, BED files that no longer match the defined format lose the .bed and gain a .txt extension. Compression may or add or strip .gz as appropriate. If a path is not provided, the current working directory is used.
- gz => boolean
-
Change the compression status of the output file. The default is to maintain the status of the original opened file.
If the file save is successful, it will return the full path and name of the saved file, complete with any changes to the file extension.
- summary_file()
-
Write a separate file summarizing columns of data (mean values). The mean value of each column becomes a row value, and each column header becomes a row identifier (i.e. the table is transposed). The best use of this is to summarize the mean profile of windowed data collected across a feature. See the Bio::ToolBox scripts
get_relative_data.plandaverage_gene.plas an example. You may pass options.- filename => $filename
-
Pass an optional new filename. The default is to take the basename and append "_summed" to it.
- startcolumn => $index
- stopcolumn => $index
-
Provide the starting and ending columns to summarize. The default start is the leftmost column without a recognized standard name. The default ending column is the last rightmost column. Indexes are 0-based.
If successful, it will return the name of the file saved.
- write_gff()
-
This will write out the existing data in GFF format. A number of options may be passed to control the conversion.
- filename => $filename
-
Optionally pass the filename to save. A suitable default will be generated if not provided.
- version => <2|3>
-
Provide the GFF version. The default is version 3.
- chromo => $index
- start => $index
- stop => $index
- strand => $index
-
Provide the column indices for the appropriate columns. These should be automatically identified from the column header names. Indices are 0-based.
- score => $index
-
Provide the index column name for whatever score column. This is not automatically determined.
- source => $index|$text
- type => $index|$text
- name => $index|$text
-
Provide either a column index (0-based) or a text name to be used for all the features. Integers between 0 and the rightmost column index are presumed to be an index; everything else is taken as text.
- tag => \@indices
-
Provide an array reference of column indices to be used for GFF tags.
- id => $index
-
Provide a column index of unique values to be used for GFF3 ID tag.
- midpoint => <boolean>
-
Flag to use the midpoint instead of actual start and stop coordinates.
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.