Tripal Developer's Meeting 2015-05-05

Meeting Date
Attendees

Ames Team (Ethy Cannon, Sudhansu Dash, Nathan Weeks)
Andrew Farmer, Sudhansu Dash (NCGR)
Gerard Lazo
Lacey Sanderson (U Saskatchewan)
Stephen Ficklin, Chun-Huai Cheng (WSU)
Valentin Guignon (Bioversity International, CGIAR)
Chris Childers, Monica Poelchau, Vijaya Tsavatapalli (NAL)
Meg Staton, Nathan Henry (U Tennessee)
Vivek, Irina (JCVI)

  1. Meeting notes are now archived on the Tripal.info site:  http://tripal.info/meetings/developers/summary

  2. Strategies for dealing with large numbers of features: memory and slowness issues. (EC)

    1. Related to Tripal issue https://www.drupal.org/node/2463211.

    2. Problematic pages:

      1. feature admin.

        1. Timeout in the first instance has been traced to very slow queries to count, retrieve, and sort all features.

        2. FIX:

          1. change the select boxes to text boxes in the view.

          2. disable the sorting will speed up the query.

          3. loose index scan is faster than a select DISTINCT. See notes posted by Nathan below for solution to code.

          4. The pagination does a query to get the count and second to populate the page.

          5. We can setup the view so we only display results when the user presses the submit/filter button

      2. organisms with many features

        1. Feature browser: not very useful (Ethy).  Maybe have some sort of example page to replace the feature page.  Perhaps a block.

        2. Feature summary.

      3. maps with many features.

        1. possible solution would be to limit the nested objects that are returned from chado_generate_var().

    3. Problematic processes:

      1. Sync-ing (it is possible to go directly to sync page, which by-passes the time-out on the feature list pages).

      2. Updating URLs (though much improved now that it no longer runs out of memory).

        1. Want to select to a subset of features (by type, organism, or simply a count).

      3. Removing orphaned features.

        1. Runs out of memory.

    4. Partition feature table some way? Or the queries? Alternative methods for counting records?

      1. Suggest using a bigserial for primary keys in Chado.

      2. Need to explore table partitioning in Chado.  Valentin will be potentially exploring this.  

    5. Multiple schema? (E.g. one per organism) But how to deal with foreign keys across schema? Could feature be replaced with a materialized view?

      1. For questions please contact Valentin for how this could be done.

    6. Statistics for the database:  Taein Lee is working on a module to show the number and amount of data types.  To provide information about structure in the database in relation to classes.

    7. Sync-ing a large number of features/stocks in a bulk fashion.

      1. memory problems on syncing stocks.

      2. drush command. **

      3. You can limit the number of stocks you want to sync at one time

  3. Search Engine for Chado: https://www.drupal.org/node/2214775

    1. feedback on the use of elasticsearch search API module (by Valentin)

      http://www.crop-diversity.org/mgis/accession-search

      1. The ElasticSearch Drupal module ( https://www.drupal.org/project/search_api_elasticsearch) is new and under development.

      2. More details: https://www.drupal.org/node/2214775#comment-9895567

    2. has anyone worked on incorporating ontologies into searches for tripal? (ADF).

      1. A project is in the works at Ames to take advantage of ontology hierarchy to search for data objects attached to ontology terms. Stephen is doing something similar with REST services. Will collaborate.

    3. Add to the Tripal User’s Guide a section on options for setup of Searching.

    4. USDA/ARS is working on integrating Solr into the Tripal site to index Drupal/Chado and other file types.  

      1. Two modules: Drupal Solr (well supported) and an additional module to add file support (not well supported)

  4. Mailing lists (gmod-tripal vs gmod-tripal-devel).

  5. Updates to Extension page on tripal.info.  Forthcoming interface on Tripal sites for easier installation.

  6. Issues we didn’t cover, but will be on agenda for next month’s meeting:

    1. Downloads: it would be nice to have downloads built in to Tripal, especially downloads of search results, configurable as to which fields from the record pages to include. In the meantime, what have other groups done? (EC) NB: this relates to an issue I raised for our site regarding the views-data-export module listed as a "Highly Recommended Module" in the README that comes with tripal (ADF)

    2. Permissions: having to manually reset permissions for modules on install seems to trip us up frequently during development/deployment. Is there a reason that modules should not have some default permission settings and if not, is there a way to effect this? (ADF)

 

Action Items

  1. Stephen will fix the slowness on the admin pages and memory issues for stable release v2.0

  2. Stephen will experiment with bigserial for pkeys in Chado.

 

Loose Index Scan Notes:

It turns out that the primary bottleneck in the 3rd query is the "SELECT DISTINCT":

SELECT distinct(type_id) FROM chado.feature

 

This is apparently implemented by doing a sequential scan on the chado.feature table:

drupal=> explain SELECT distinct(type_id) FROM chado.feature;

                              QUERY PLAN                               

------------------------------------------------------------------------

HashAggregate  (cost=176753.74..176753.85 rows=11 width=4)

  ->  Seq Scan on feature  (cost=0.00..165521.79 rows=4492779 width=4)

 

I stumbled upon the following optimization, called a "loose indexscan", which works well when there are many rows containing few distinct values:

https://wiki.postgresql.org/wiki/Loose_indexscan

Using that to implement a faster equivalent to the 3rd query results in a run time of ~5 milliseconds (vs over 13 _seconds_):

LOG:  statement: WITH RECURSIVE t AS (

         SELECT MIN(type_id) AS type_id FROM chado.feature

         UNION ALL

         SELECT (SELECT MIN(type_id) FROM chado.feature WHERE type_id > t.type_id)

         FROM t WHERE t.type_id IS NOT NULL

       )

       SELECT cvterm_id, name FROM chado.cvterm WHERE cvterm_id IN

         (SELECT type_id FROM t WHERE type_id IS NOT NULL) ORDER BY cvterm.name ASC;

LOG:  duration: 4.762 ms

 

I think we should update tripal_views_handler_filter_select_cvterm.inc to use the new query.

--- a/tripal_views/views/handlers/tripal_views_handler_filter_select_cvterm.inc

+++ b/tripal_views/views/handlers/tripal_views_handler_filter_select_cvterm.inc

@@ -67,7 +67,7 @@ class tripal_views_handler_filter_select_cvterm extends tripal_views_handler_fil

        $where = ' AND ' . implode(' AND ', $where_clauses);

      }



-      $sql = "SELECT cvterm_id, name FROM {cvterm} WHERE cvterm_id IN (SELECT distinct(" . $this->field . ") FROM {" . $this->table . "}) " . $where . ' ORDER BY cvterm.name ASC';

+      $sql = "WITH RECURSIVE t AS (SELECT MIN(" . $this->field . ") AS col FROM {" . $this->table . "} " . ($where == '' ? '' : "WHERE " . $where) . " UNION ALL SELECT (SELECT MIN(" . $this->field . ") FROM {" . $this->table . "} WHERE " . $this->field . " > col " . $where . ") FROM t WHERE col IS NOT NULL) SELECT cvterm_id, name FROM {cvterm} WHERE cvterm_id IN (SELECT col FROM t where col IS NOT NULL) ORDER BY cvterm.name ASC";

      $resource = chado_query($sql);

      $cvterms = array();

Meeting Type