Genomes and Genes

Page Sections

Loading Feature Data

Now that we have our organism and whole genome analysis ready, we can being loading genomic data. For this tutorial only a single gene from sweet orange will be loaded into the databsae. This is to ensure we can move through the tutorial rather quickly. The following datasets will be used for this tutorial:

Download these to the /var/www/html/sites/default/files. The quickest method is to right-click on the links above, then wget to retrieve the file:

  cd /var/www/html/sites/default/files
  wget http://www.gmod.org/mediawiki/images/d/dc/Citrus_sinensis-orange1.1g015632m.g.gff3
  wget http://www.gmod.org/mediawiki/images/8/87/Citrus_sinensis-scaffold00001.fasta
  wget http://www.gmod.org/mediawiki/images/9/90/Citrus_sinensis-orange1.1g015632m.g.fasta

 

Loading a GFF3 File

The gene features (e.g. gene, mRNA, 5_prime_UTRs, CDS 3_prime_UTRS) are stored in the GFF3 file downloaded in the previous step. We will load this GFF3 file and consequently load our gene features into the database. Navigate to TripalChado Data LoadersGFF3 file loader.

Tripal2.0 gff3 import.png

Perform the following:

  1. Enter the path on the file system where our GFF file resides (/var/www/html/sites/default/files/Citrus_sinensis-orange1.1g015632m.g.gff3)
  2. Choose the organism to which the GFF3 file belongs (in this case Citrus sinensis (sweet orange)
  3. Select the analysis named "Whole Genome Assembly and Annotation of Citrus sinensis...".
  4. Leave all other options as default.

Finally, click the Import GFF3 file button. You'll notice a job was submitted to the jobs subsystem. Now, to complete the process we need the job to run. We'll do this manually:

cd /var/www/html;
drush trp-run-jobs --user=administrator

You should see output similar to the following:

Tripal Job Launcher
Running as user 'administrator'
-------------------
Calling: tripal_feature_load_gff3(/var/www/html/sites/default/files/Citrus_sinensis-orange1.1g015632m.g.gff3, 13, 10, 0, 1, 0, 0, 1, , , 0, , , , 0, 8)

NOTE: Loading of this GFF file is performed using a database transaction. 
If the load fails or is terminated prematurely then the entire set of 
insertions/updates is rolled back and will not be found in the database

Opening /var/www/html/sites/default/files/Citrus_sinensis-orange1.1g015632m.g.gff3
Parsing Line 138 (100.00%). Memory: 25,873,800 bytes.
Setting ranks of children...
Setting 10 of 10 (100.00%). Memory: 25,901,976 bytes.
Done

Note: For very large GFF files the loader can take quite a while to complete.

Loading FASTA files

Using the Tripal GFF loader we were able to populate the database with the genomic features for our organism. However, those features now need nucleotide sequence data. To do this, we will load the nucleotide sequences for the mRNA features and the scaffold sequence. Navigate to the TripalChado Data LoadersFASTA file loader Page

Tripal2.0 fasta loader.png

Before loading the FASTA file we must first know the Sequence Ontology (SO) term that describes the sequences we are about to upload. We can find the appropriate SO terms from our GFF file. In the GFF file we see the SO terms that correspond to our FASTA files are 'scaffold' and 'mRNA'.

IMPORTANT: It is important to ensure prior to importing, that the FASTA loader will be able to appropriately match the sequence in the FASTA file with existing sequences in the database. Before loading FASTA files take special care to ensure the definition line of your FASTA file can uniquely identify the feature for the specific organism and sequence type. For example, in our GFF file an mRNA feature appears as follows:

scaffold00001   phytozome6      mRNA    4058460 4062210 .       +       .       ID=PAC:18136217;Name=orange1.1g015632m;PACid=18136217;Parent=orange1.1g015632m.g

Note that for this mRNA feature the ID is PAC:18136217 and the name is orange1.1g015632m. In Chado, features always have a human readable name which does not need to be unique, and also a unique name which must be unique for the organism and SO type. In the GFF file, the ID becomes the unique name and the Name becomes the human readable name.

In our FASTA file the definition line for this mRNA is:

>orange1.1g015632m PAC:18136217 (mRNA) Citrus sinensis

By default Tripal will match the sequence in a FASTA file with the feature that matches the first word in the definition line. In this case the first word is orange1.1g015632m. As defined in the GFF file, the name and unique name are different for this mRNA. However, we can see that the first word in the definition line of the FASTA file is the name and the second is the unique name. Therefore, when we load the FASTA file we should specify that we are matching by the name because it appears first in the definition line.

If however, we cannot guarantee the that feature name is unique then we can use a regular expressions in the Advanced Options to tell Tripal where to find the name or unique name in the definition line of your FASTA file.

IMPORTANT: When loading FASTA files to update existing features, always choose "Update only" as the import method. Otherwise, Tripal may add the features in the FASTA file as new features if it cannot properly match them to existing features.

Now, enter the following values in the fields on the web form:

  • FASTA file: /var/www/html/sites/default/files/Citrus_sinensis-scaffold00001.fasta
  • Organism: Citrus sinensis (Sweet orange)
  • Sequence type: supercontig (scaffold is an alias for supercontig in the sequence ontology)
  • Method: Update only (we do not want to insert these are they should already be there)
  • Name Match Type: Name
  • Analysis: Whole Genome Assembly and Annotation of Citrus sinensis....

Click the Import Fasta File, and a job will be added to the jobs system. Run the job:

cd /var/www/html
drush trp-run-jobs --user=administrator

Next do the same for the genes GFF:

  • FASTA file: /var/www/html/sites/default/files/Citrus_sinensis-orange1.1g015632m.g.fasta
  • Organism: Citrus sinensis (Sweet orange)
  • Sequence type: mRNA
  • Method: Update only
  • Name Match: Name
  • Analysis: Whole Genome Assembly and Annotation of Citrus sinensis....

Now run this job:

cd /var/www;
drush trp-run-jobs --user=administrator

Now the scaffold sequence and mRNA sequences are loaded

Note It is not necessary to load the mRNA sequences as those can be derived from their alignments with the scaffold sequence. However, in Chado the feature table has a 'residues' column. Therefore, it is best practice to load the sequence when possible.

The FASTA loader has some advanced options which we will not use in this tutorial. But briefly, the advanced options allow you to create relationships between features and associate them with external databases. For example, the definition line for an mRNA is:

>orange1.1g015632m PAC:18136217 (mRNA) Citrus sinensis

Here we have more information than just the feature name. We have a unique Phytozome accession number (e.g. PAC ID) for the mRNA. Using the External Database Reference section under Advanced Options we can provide the name of the database and a regular expression to tell the loader how to find the accession number in the definition line.

If the name of the gene to which this mRNA belonged was also on the definition line, we could use the Relationships section to link this mRNA with it's gene parent. Fortunately, this information is also in our GFF file and these relationships have already been made.

Creating Feature Pages

Now that we've loaded our feature data, we must "sync" them. Loading of the GFF file in the previous step has populated the feature tables of Chado for us, but now Drupal must know about these features. To sync features, navigating to TripalChado ModulesFeaturesSync.

Tripal2.0 feature sync.png

Here we can specify the types of features to sync and the organism. This allows us to create feature pages for different types of features for different organisms. In our case, we want gene and mRNA pages (these types were present in our GFF file). To only create pages for genes and mRNA we want to enter the sequence ontology terms gene and mRNA in the Feature Types box. Place each term on a separate line.

Next, select the organism "Citrus sinensis", and click the "Sync Features" button. A job is then added to the jobs management system which we need to manually run rather than wait on the cron entry to run it.

cd /var/www/html
drush trp-run-jobs --user=administrator

The following text will be seen on the command-line to indicate that the features were synced:

Sync'ing feature records.  Records matching these criteria will be synced:
  Type(s): mRNA, gene

10 feature records found.

NOTE: Syncing is performed using a database transaction.
If the sync fails or is terminated prematurely then the entire set of
synced items is rolled back and will not be found in the database

Syncing feature 10 of 10 (100.00%). Memory: 23,686,264 bytes.

Complete!

 

Note: It is not necessary to sync all types of features in the GFF file. For example, do not sync the scaffold. The feature is large and would have many relationships to other features. Only sync features that you will want users to view. For example, each mRNA is composed of several CDS features. These CDS features do not need their own page and therefore do not need to be synced..

Now, we can view our gene and mRNA pages. Click the Find Content link. to see the newly added features. Click the new page titled orange1.1g015615m, PAC:18136219 (mRNA) Citrus sinensis. Here we can see the gene feature we added and its corresponding mRNA's.

Tripal2.0 feature2.png

Feature Page Configuration

The feature configuration page allows us to perform configuration changes for the entire site. Navigate to the TripalChado ModulesFeaturesSettings page.

Feature Titles

First on the configuration page is the Set Page Titles settings. In this section, there is a list of radio buttons with different styles of titles you can apply to features.  Possible feature titles include the feature name only, the uniquename only, the unique constraint or a custom title.   If you select the unique constraint you will always have a unique title for each page.  By default, the unique constraint is used.  However, you can customize the title by selecting the 'Custom' radio button.   In the Custom Page Title text field you can enter a string containing tokens that can be used to create custom titles.  The default string is:

[feature.name], [feature.uniquename] ([feature.type_id>cvterm.name]) [feature.organism_id>organism.genus] [feature.organism_id>organism.species]. 

The tokens in this custom title string are surrounded by square brackets (for example [feature.uniquename] and [feature.name] are tokens). These tokens are substituted by appropriate values related to the feature (e.g. the feature uniquename and name respectively). All other characters not part of a token are left as is. Thus, for feature orange1.1g015615m the title becomes: orange1.1g015615m, PAC:18136219 (mRNA) Citrus sinensis. You can customize the titles by using any combination of tokens.  The list of all available tokens can be found in the section titled Available Tokens.

Feature URLs

Next on the feature settings page is the Set Page URL settings. Simliar to titles, you can select from a list of available options for URLs.  URLs must be unique, therfore there are only three possible options:  feature_id, unique constraint and custom.  If you select the feature_id option then the URL becomes

http://localhost/feature/37

If you select the unique constraint, then the URL will contain the genus, species, feature type and feature unique name and becomes:

http://localhost/feature/Citrus/sinensis/mRNA/PAC%3A18136219

Just like feature titles, you can also create custom URLs by selecting the Custom radio button and providing a string containing tokens in the Custom Page URL  text field.  By default the custom string create a URL identical to the unique constraint:

/feature/[feature.organism_id>organism.genus]/[feature.organism_id>organism.species]/[feature.type_id>cvterm.name]/[feature.uniquename]. 

You can change the tokens as desired. But, be certain to always create a URL that is guaranteed to be unique. The URL string provided by default will always be unique.

URLs must be unique, but unique URLs can be difficult to link to from other sites, GBrowse, JBrowse, Excel documents, etc.  Therefore, Tripal provides a convenient single URL for accessing any feature.  This URL is of the form:

feature/[feature]

where [feature] can be the name, uniquename or a synonym of the feature.  If two or more features have the same name then Tripal will present a table of matching features for the user to select from.

Feature Summary Report

Last on the configuration page is the Feature Summary Report setting. On the organism page Tripal will provide a list of all features belonging to an organism and provide table of this list. For example, below is a screen shot of the Feature Summary on the Citrus sinensis page for the data we loaded. This summary requires that the organism_feature_counts materialized view is populated.  Thus, after loading feature data this view should be populated as described in the Introduction to Materialized Views section.  Once the organism_feature_counts materializd view is populated, the feature summary report will appear as below:

Tripal2.0 feature summary.png

For sites with only a single unigene (transcriptome analysis) or a single whole genome then this summary would be appropriate. For sites with multiple analyses it may confuse site visitors who see mulitple counts, and should be hidden.  Instructions for hiding content on a page will be discussed in the Managing the Table of Contents (TOC) section. 

On the feature settings page, you can specify which feature types should appear in the feature summary. You can also rename them to be more meaningful. For this tutorial, we want to provide a list of the total number of scaffolds, genes and mRNA. To do this, enter the following contents in the Map feature types box"

supercontig = Scaffolds
gene = Genes
mRNA

Cick the Save configuration button at the bottom. Now the Feature Summary on the organism page appears as:

Tripal2.0 feature summary2.png