Overview
Datasets
Data browser
BLAST search
CrusTome BLAST
1. CrusTome BLAST overview
2. Analyze in Galaxy
User accounts
Groups
Import data

User Guide

Overview

CrustyBase is a repository and analysis suite for crustacean transcriptome data. CrustyBase relates to two separate datasets: community datasets and CrusTome datasets. Please see the table below which summarises the differences.

You can submit feedback about additional tools that you would like implemented.

	Community datasets	CrusTome datasets
Data import	By users	Curated
Expression data	Yes	No
Purpose	Various	Annotation, phylogeny
Data access	Access levels can be set by user that is uploading	All datasets are fully accessible
Relevant tools	BLAST, Extract, Data browser, Domain search	CrusTome BLAST

Community datasets

Community datasets consist of assembled transcriptomes and gene expression data for a particular species, across a set of samples.

There are currently 38 community datasets, but anyone can add to this in the future by uploading their own. Users can make use of Groups to control access to these datasets.

You can access these datasets through the following tools: BLAST, Extract, Data browser, Domain search

CrusTome datasets

CrusTome is a curated database of 201 assembled transcriptomes from a taxonomically comprehensive range of Pancrustacean species.

The transcriptome assemblies in this database were produced using a consistent methodology and processed to remove microbial contamination and redundancy. It provides datasets for sequence similarity searches, orthology assignments, phylogenetic inference, etc.

You can access these datasets through the CrusTome BLAST tool.

Datasets

Datasets have been structured to provide as much information as possible for each transcriptome.

Metadata

Each dataset is described by a structured array of metadata.

For community datasets, this will include information like taxonomy, experiment description and assembly procedure. It can also include a reference to a publication so that the dataset author can be easily cited, which we strongly encourage if you find a dataset useful. You can view the metadata for a community dataset on its profile page in the data browser.

For CrusTome databsets, metadata includes fields such as taxonomy, sampling condition, keywords, and NCBI SRA/TSA accession.

Transcript data

Each dataset has a number of data types corresponding to each transcript. Some of these are rendered in a graphical display as you browse through transcripts in a BLAST result, and all of them are available to download in some form.

Data types relevant to transcripts from community datasets are summarised in the table below.

Data type	Description	Rendered in output	Available for download
Nucleotide sequence	cDNA sequence of the transcript. This sequence is the origin of all other data types.	Only as alignment	Yes, full access required
CDS sequence	Coding DNA sequence predicted by TransDecoder.	No	Yes, full access required
Peptide sequence	Protein sequence predicted by TransDecoder.	Only as alignment	Yes, full access required
Expression data	Mean expression level across experiment features	Yes	Yes, full access required for raw data
Conserved domains	Conserved protein domains predicted by CD-search	Yes	Yes, graphics only

Data types relevant to transcripts from CrusTome datasets are summarised in the table below.

Data type	Description	Rendered in output	Available for download
mRNA sequence	mRNA sequence of the transcript. This sequence is the origin of all other data types.	Yes	Yes
Amino acid sequence	mRNA sequences were translated to amino acid sequences.	Yes	Yes
Conserved domains	Conserved protein domains predicted by CD-search	Yes	Yes, graphics only

Data browser

While the number of transcriptomes is small, navigating datasets remains quite simple. But once the database grows, finding a relevant dataset among thousands becomes challenging. The browser tool was designed to efficiently find the community datasets that are of most interest to you.

You can search for a particular species, or try any other keyword that is relevant to you. "Molt", "disease", "virus", "immune" and "brain" should all return related datasets. You can also search by taxonomy ("portunidae"). Each dataset has a dedicated page describing the species, data and experimental conditions.

BLAST search

BLAST overview

The BLAST tool is a long-time staple of the bioinformatics toolbelt. Submit a DNA or protein sequence that you're interested in and it will give you back the best-matching sequences in the transcriptomes that you have selected. Learn more about the NCBI's BLAST tool here.

In CrustyBase, the BLAST tool is for searching through the community datasets and allows users to view the expression of the transcripts found. On the other hand, the CrusTome BLAST tool is for searching through the CrusTome datasets and provides users the option of analyzing transcripts in Galaxy for alignment and phylogeny. The user interface is similar for both tools, so these instructions apply to both.

Search for a sequence

The BLAST submission form is made up of three simple components:

Query sequence input

To conduct a BLAST search you must first find a DNA or protein sequence of a gene that you're interested in. The "sample sequence" button can give you an example if you're just trying it out. Sequences can be obtained from the NCBI website or sometimes directly from published research articles. Copy and paste your sequence into this box to begin. It is worth remembering that protein sequences are usually more conserved than DNA between distant taxa, and are therefore more likely to match your target gene.

The BLAST tool accepts one query sequence only. For CrusTome BLAST, you can input multiple query sequences.

Database list

Select the transcriptome dataset you wish to search from this list. There may be many databases, but only 10 are shown at a time. To find the database you're looking for, simply type in some keywords into the input field above to filter the database list. Try searching with species or family names, tissues ("brain" or "gill") or other biological keywords such as "immune", "environment" or "reproduction".

If you want to search and view the community datasets in more detail, switch to the data browser. The access level of each database is shown on the right as a green (full access) or orange (restricted access) light.

You can add one or many datasets to the "selected" pane before running the search, but more datasets will of course increase the run time.

Search algorithm

The search algorithm that you choose depends on the query sequence that you entered above:

BLASTN searches nucleotide transcriptomes with a DNA sequence
tBLASTN takes a protein sequences and searches translated nucleotide sequences on-the-fly. This tends to be more reliable when searching cross-genus and beyond, as protein sequences are almost always more conserved.

Viewing the result

Datasets with transcripts matching your query are stacked up the page.
Each dataset in the stack features a species image, experiment description and table summarising the BLAST hits.
Select transcripts in the table with the checkbox, and download data relating to them.
Click "expand" to open up a full-screen view of a dataset's results
Click or use arrow keys to visualize transcripts in three corresponsing panes:

The alignment pane

A conventional alignment out from the BLAST tool is displayed in the first viewing pane. This shows exactly which residues match between the query sequence and the selected transcript. Match statistics are shown above the alignment. One match may return a number of HSPs (high-scoring pairs).

HSPs are sections of matching sequence. When searching a transcriptome with protein or cDNA sequence you would normally expect to find one HSP. Two or more indicates large insertions or deletions between the query and subject sequence, probably due to differential mRNA splicing or perhaps a sequencing error.

The expression pane

This pane is only shown in the BLAST tool, not the CrusTome BLAST tool.

The expression pane shows the expression profile of the selected transcript. Move the cursor over markers on the chart to see mean and standard error of the data. If you aren't sure what the x-axis labels mean, try hovering over the info symbol on the right to see the experiment description. You can also click-and-drag on the y-axis to increase or decrease scale. This can be useful for zooming in on low-expressed samples.

You may notice that some datasets are displayed with bar charts and others with line graphs. This depends on whether the expression data describes a categorical variable (tissues, treatment etc.) or continuous variable (days, temperature etc.).

The protein structure pane

The structure pane shows the protein length predicted by TransDecoder, with conserved domains (predicted by the NCBI's CD-search ) plotted along its length. Domains are linked to their descriptions in NCBI, PFAM, TIGRFAM and other databases which can be followed by clicking on them. Not all transcripts will have a predicted protein, and not all predicted proteins will have predicted domains. However, if you know what structure to expect (like the nuclear receptor DBD and LBD in the protein on the right) it can be a good indication of a correct and complete sequence.

Downloading sequence data

Plots and sequences from the results page can be downloaded in bulk.

Select transcripts of interest in the match list

As you browse matching transcripts, you can select any that seem interesting with the checkboxes on the right of the match list.

Once you're happy you've got all the interesting matches, click the download button above to open the download dialog. The download button from the BLAST app is shown in the top image.

If you're using the CrusTome BLAST tool, selecting transcripts using the checkbox will add them to cart (bottom image). Click on the cart to see the "Download All" button.

The download dialog

Simply select the formats that you wish to download and click the download button. For community datasets, the available formats depend on whether the owner of the dataset allows full or only partial access.

It may take up to several minutes to render the requested files if many formats and transcripts are selected.

You optionally add a file prefix. This will be used to name the downloaded file, so you can remember the origin of these files later on. For example, entering "myresult" will lead to downloading a file called "myresult.zip".

Saving results

If you are a registered user, you have the option to save results that might be useful in the future. When logged in, you should see a "save" button in the top-right. Click save, enter a useful identifier (so you can remember what it is) and then click save or hit the enter key.

You can then return to this result at any time through the saved results page of your user profile. CrustyBase will store a maximum of 200 saved results. When this limit is reached, further saves will begin overwriting the oldest save.

You can also view and revisit all results from the past 7 days in the "BLAST history" panel at the bottom of your dashboard. Please note that this history does not apply to CrusTome BLAST results.

CrusTome BLAST

CrusTome BLAST overview

The CrusTome BLAST tool allows users to search for their query sequences in CrusTome datasets. The search, view results, save results, and download functionality is similar to the BLAST tool. Please see the BLAST search section of the guide for more details about these steps. The "Analyze in Galaxy" step is unique to the CrusTome BLAST tool and is explained more below.

Analyze in Galaxy

You can use the "Analyze in Galaxy" option if you would like to perform alignment and phylogeny in Galaxy using the sequences that you have selected from CrusTome.

View the cart

From the results page, click on the cart icon to view all the transcripts in the cart. This icon is only visible if you have selected one or more transcripts using the checkbox. From the cart, click on the "Analyze in Galaxy" button to proceed to the instructions for utilizing Galaxy.

Prepare input files

Three FASTA files are needed to run our Galaxy workflow: amino acid sequences from CrusTome, amino acid sequences from your query sequences, and outgroup amino acid sequence/s.

The first two inputs can be obtained by clicking "Download CrusTome Files" and unzipping the downloaded folder. The outgroup sequence/s which will be used for rooting the phylogenetic tree can be obtained from the NCBI website . Please create a FASTA file with sequences suitable for your analysis. This can be done by using a text editor.

You can choose outgroup sequences based on established evolutionary relationships from the literature. Outgroup sequences should be related enough to your ingroup (group of genes and species that you're studying) so that you can confidently align the sequences, but distant enough that it clearly branches off before the diversification of your main group of interest. For example, if you are studying genes in Malacostraca, sequences from Hexapods may serve as an outgroup.

Instructions for utilizing Galaxy workflow

Go to Galaxy

Go to Galaxy Australia and login. Currently, our workflow is only available on this server, but we aim to make it available on other servers in the future.

Click on the plus icon near the top right corner to create a new history.

Button for creating a new history in Galaxy

Upload input files to Galaxy

Click on the "Upload" button on the top left corner to bring up the upload dialog. Drag and drop your FASTA files there and click "Start" to upload them to your new history. Click "Close" to close this dialog.

Import workflow to Galaxy

Click on "Import Phylogeny Workflow" and then click on "Version 1" to import the workflow.

Next, click on the "Run workflow" icon on the bottom right.

Run workflow

Select your input FASTA files and then click the "Run Workflow" button on the top right to run.

The following steps will be performed by the workflow:

All FASTA files will be merged into a single FASTA file. Only unique sequences, determined using accession and sequence, will be kept.
A multiple sequence alignment will be produced using MAFFT.
The alignment will be trimmed using ClipKIT.
A phylogenetic tree will be reconstructed using IQTREE.

Examine outputs

The outputs will be available in your history. Notice the icons under each item in the history. You can click on the "i" icon (third icon) to view details from the tool execution. You can use this to examine the standard error and output. For MAFFT, this can be helpful for seeing which strategy was selected by the tool for performing alignment.

We recommend examining the alignment files produced by MAFFT and ClipKIT. You can do this in Galaxy by clicking on the visualize icon which is represented with a bar chart (fifth icon). Alternatively, you can download the file using the first icon and then open it with a tool such as Jalview.

Similarly, the maximum likelihood tree produced by IQTREE can be visualised in Galaxy. Alternatively, you can use a tool such as iTOL.

After examining the outputs, you may wish to remove some outliers or duplicate sequences and rerun the workflow.

User accounts

You don't need to be registered to use CrustyBase, but it does come with benefits!

When you create an account with CrustyBase, you will be able to save BLAST results and view your search history. You can also create groups and upload your own transcriptome datasets. It's free to register and always will be.

Account overview

The dashboard gives you a brief overview of your account. This page allows you to edit your personal details, view your groups and datasets, view recent search history and delete your account (why would you do that!?).

When you are logged in, the dashboard and other pages related to your account can be found by clicking on the login prompt in the top-right of any every page, as shown on the right.

Groups

Groups control access to transcriptome datasets.
There are two situations when you might need to make use of groups:

You are going to import a dataset
You want to get access to a colleague's datasets

You can manage groups through your user profile.

Data access

The purpose of groups is to control data access. Not all datasets are fully accessible to the public, but the members of a group always have full access to that group's data. Groups are designed to reflect data ownership in the real world - datasets are typically owned by a research group, not by a single person in that group.

You can still view and search restricted datasets, but you cannot access or download raw sequence or expression data. The purpose of this is to encourage sharing of datasets which are restricted by intellectual property rights, since graphical results are usually insufficient for published research. If you find something of interest in these datasets, we encourage users to look up the owner of the dataset and seek collaboration. You can find out who uploaded the dataset by checking the dataset's profile in the data browser.

Create a group

The only time you might need to create a group is when you are going to import a dataset. You can also consider joining a colleague's group and importing the data to there, if you wish to share access. However, if you are going to import publicly available data, you can simply opt to import the dataset into the Public Domain instead.

You can create new groups in the group management page, which is accessible through the login prompt in the top-right. When creating a group, think of a name that is descriptive and unique. For example, Albert Einstein's research group at ETH Zurich might be called "Einstein ETHZ".

Join a group

If a colleague has a research group that you wish to join, they will need to send you an invite. This can be done easily by visiting the group management page.

Simply select the appropriate group and hit the "invite" button. Enter the email address of the person you wish to invite and they will be sent a link to join the group.

Check the email address carefully - once someone joins there is no (easy) way to remove them from the group.

Leave a group

There are situations where you may want leave a group - perhaps the group is redundant or you are leaving an institution. Simply select the group on the group management page and select "leave group" at the bottom of the page.

Consider this carefully though. Once you leave you'll lose access to the group's datasets and, if you were the last member in the group, the group and all its data will be deleted. This is the only way to delete a group.

Import data

Why import data?

There are a number of reasons to import data into CrustyBase. The most obvious reason to contribute your data is because it is the right thing to do! We are all in the business of advancing global knowledge, and the process is far more efficient when we work together.

Aside from that, there several benefits of having your data in the hands of CrustyBase:

CrustyBase is a free service for navigating your transcriptome data.
You and your research group can access your data wherever you are.
Increase your exposure. Make collaborations, get citations.

Prepare an import

You can find the data import app in the data tab in the navigation bar.
There are three phases to the data import process:

Metadata input (dataset information and access level)
File upload & validation
Review & submit

Public access level

Full access allows any user of CrustyBase to download raw sequence and expression data from any transcripts that they find. This data might be sufficient to publish findings on CrustyBase.

Partial access restricts the data types that a public user is able to download, allowing only graphical content to be seen. Given that publication typically requires reporting of original data, users who wish to make such use out of these datasets would be expected to contact the dataset owner to seek collaboration.

All members of the group owning the dataset have full access, including permission to edit and delete the dataset.

File upload

Upload files must be correctly formatted for the server to parse them correctly. If there is a problem validating the data, you will be given a useful message to help fix the problem. It should be possible to make any formatting changes with conventional spreadsheet and text-editor software.

There are two files that need to be uploaded:

Transcriptome assembly in FASTA format
Expression data in CSV format

These files are limited to 1000MB each.

Assembly file

The assembly file should be a FASTA-formatted sequence file.
Each sequence should start with a title line, beginning with an angle bracket ">" and a new line. The following sequence should be composed of only the characters "ATGCN" and new lines. The title can be no longer than 25 characters for any given sequence - any longer than this is unnecessary and makes them difficult to display on CrustyBase.

In the FASTA file on the right you can see that the first sequence title (green) is an appropriate length and format, whereas the second sequence title (red) has been appended with surplus information by the assembly software. This additional information needs to be removed.

If you are unable to reformat the sequence titles, try uploading the file with the "Reformat contig IDs" option enabled. Please note that you will have no way to link CB results to your local dataset if you choose this method, as the contig IDs will no longer match.

>c10000_c0_s1
CGACACCCAGAAGGGCCTGCAGCACGCCATGATGCAGATGAACGGCCCGATGATGGAAGG ACGTCGCCTGGATCTGCGCGATGATCCCGCATCACATGGGCGCCATCGCCATGGCCCAGG AATCCCGAGGCTAAGAAGATCGCCGAGAAGAGCATCCAGGAACAGGAGAAGAGCATCAAG

>comp3_c0_seq1 len=319 path=[12086:0-74 3106:75-318]
ATCTGTTTCTCCTTTTCATATTTTTCTTTTCTTTTGTTCCCTGTGTTCCACTTCTCTGTC CTTTCACTTCCCTCCTTCTCTTCCTTCTGTTTATTTGCTTCTTCTTCAGTATCCTTTCTT CCTCTTTTCCTTTTCCCTTCGCATCTTTCTCTCCTTCTTCTTTCTCTTCATCTATTCCTT

Expression file

The expression quantitation file should be a plain CSV file. If the samples were sequenced with replicates, makes sure that this file includes the raw replicate data and not mean data.

The file on the right shows how the data should be formatted. Contig IDs in the first column should match those in the assembly file, otherwise there is no way to connect the two together!

It may be helpful for you to label the column header with meaningful names, but these will be discarded when the dataset is imported. Column names are taken from the "Series labels" field during Metadata entry.

We highly recommend that you use RLE, TMM or TPM as units of transcript expression, though there are other metrics that can be used. For more accurate quantitation it is often better to map reads to the CDS (coding DNA sequence) rather than the entire transcript, as this results in a more uniform distribution of read mapping.

Restrictions

There are several restrictions on data imports which may prevent some datasets from being imported.

Comparable samples

The transcriptome must have originated from sequencing multiple samples that can be compared. Some transcriptomes are generated with only one sample feature without examining any variables - these datasets are not suitable for CrustyBase because there is no expression data to compare. An example would be a transcriptome of brain tissue which includes no other features for comparison.

CrustyBase is designed to show experiments where transcript abundance has been estimated across a number of features, for example brain, gonad and muscle tissues. In this case, we would be able to visualise the difference in transcript abundance between these three tissues.

Feature levels

Features describe the variable(s) compared in an RNA-seq experiment.

Example features could be tissue type, developmental stage or experimental treatment. However, some experiments define multiple feature levels.

An example of this is shown on the right (red box), where samples have been taken from two different tissues under a control and treatment condition. In order to import this dataset, these features would need to be flattened into one level, as shown on the bottom panel (green box).

File size

Uploads have been restricted to a maximum file size of 1000MB to preserve server resources. If this limit prevents you from uploading a genuine dataset, please let us know by leaving some feedback . We can raise this limit if necessary.

Delete a dataset

There are a number of reasons why you might want to remove a dataset from CrustyBase. Any member of a group can delete any dataset under that group's ownership. Find the dataset under My datasets. On the profile page you will find the option to delete at the end of the Details section.