Overview
Datasets
BLAST search
User accounts
Groups
Import data

User Guide

Overview

CrustyBase is a repository and analysis suite for crustacean transcriptome data. Each dataset describes the activity of every expressed gene in a particular species, across a set of samples.

CrustyBase currently contains 37 transcriptome datasets, but anyone can add to this in the future by uploading their own.

You can access the datasets on CrustyBase through different tools (aka apps). We may implement more apps depending on user feedback but for now, there are two core apps:

The data browser

While the number of transcriptomes is small, navigating datasets remains quite simple. But once the database grows, finding a relevant dataset among thousands becomes challenging. The browser tool was designed to efficiently find the datasets that are of most interest to you.

You can search for a particular species, or try any other keyword that is relevant to you. "Molt", "disease", "virus", "immune" and "brain" should all return related datasets. You can also search by taxonomy ("portunidae"). Each dataset has a dedicated page describing the species, data and experimental conditions.

The BLAST tool

The BLAST tool is a long-time staple of the bioinformatics toolbelt. Submit a DNA or protein sequence that you're interested in and it will give you back the best-matching sequences in the transcriptomes that you have selected. Typically you would obtain the sequence of a gene of interest from a related species (there are millions available on NCBI) and use that as a query sequence. Learn more about the NCBI's BLAST tool here.

Once your result has been returned, you can expand details on the datasets that yield the most interesting results. You can then click (or use arrow keys) through the matching transcripts while viewing expression profiles and sequence structures in real time.

Datasets

Datasets have been structured to provide as much information as possible for each transcriptome.

Metadata

Each dataset is described by a structured array of metadata, including information like taxonomy, experiment description and assembly procedure. It can also include a reference to a publication so that the dataset author can be easily cited, which we strongly encourage if you find a dataset useful.

You can view the metadata for a dataset on its profile page in the data browser.

Transcript data

Each dataset has a number of data types corresponding to each transcript. Some of these are rendered in a graphical display as you browse through transcripts in a BLAST result, and all of them are available to download in some form.

These data types are summarised in the table below.

Data type	Description	Rendered in output	Available for download
Nucleotide sequence	cDNA sequence of the transcript. This sequence is the origin of all other data types.	Only as alignment	Yes, full access required
CDS sequence	Coding DNA sequence predicted by TransDecoder.	No	Yes, full access required
Peptide sequence	Protein sequence predicted by TransDecoder.	Only as alignment	Yes, full access required
Expression data	Mean expression level across experiment features	Yes	Yes, full access required for raw data
Conserved domains	Conserved protein domains predicted by CD-search	Yes	Yes, graphics only

BLAST search

Search for a sequence

The BLAST submission form is made up of three simple components:

Query sequence input

To conduct a BLAST search you must first find a DNA or protein sequence of a gene that you're interested in. The "sample sequence" button can give you an example if you're just trying it out. Sequences can be obtained from the NCBI website or sometimes directly from published research articles. Copy and paste your sequence into this box to begin. It is worth remembering that protein sequences are usually more conserved than DNA between distant taxa, and are therefore more likely to match your target gene.

Database list

Select the transcriptome dataset you wish to search from this list. There may be many databases, but only 10 are shown at a time. To find the database you're looking for, simply type in some keywords into the input field above to filter the database list. Try searching with species or family names, tissues ("brain" or "gill") or other biological keywords such as "immune", "environment" or "reproduction". If you want to search and view the datasets in more detail, switch to the data browser.
The access level of each database is shown on the right as a green (full access) or orange (restricted access) light.

You can add one or many datasets to the "selected" pane before running the search, but more datasets will of course increase the run time.

Search algorithm

The search algorithm that you choose depends on the query sequence that you entered above:

BLASTN searches nucleotide transcriptomes with a DNA sequence
tBLASTN takes a protein sequences and searches translated nucleotide sequences on-the-fly. This tends to be more reliable when searching cross-genus and beyond, as protein sequences are almost always more conserved.

Viewing the result

Datasets with transcripts matching your query are stacked up the page.
Each dataset in the stack features a species image, experiment description and table summarising the BLAST hits.
Select transcripts in the table with the checkbox, and download data relating to them.
Click "expand" to open up a full-screen view of a dataset's results
Click or use arrow keys to visualize transcripts in three corresponsing panes:

The alignment pane

A conventional alignment out from the BLAST tool is displayed in the first viewing pane. This shows exactly which residues match between the query sequence and the selected transcript. Match statistics are shown above the alignment. One match may return a number of HSPs (high-scoring pairs).

HSPs are sections of matching sequence. When searching a transcriptome with protein or cDNA sequence you would normally expect to find one HSP. Two or more indicates large insertions or deletions between the query and subject sequence, probably due to differential mRNA splicing or perhaps a sequencing error.

The expression pane

The expression pane shows the expression profile of the selected transcript. Move the cursor over markers on the chart to see mean and standard error of the data. If you aren't sure what the x-axis labels mean, try hovering over the info symbol on the right to see the experiment description. You can also click-and-drag on the y-axis to increase or decrease scale. This can be useful for zooming in on low-expressed samples.

You may notice that some datasets are displayed with bar charts and others with line graphs. This depends on whether the expression data describes a categorical variable (tissues, treatment etc.) or continuous variable (days, temperature etc.).

The protein structure pane

The structure pane shows the protein length predicted by TransDecoder, with conserved domains (predicted by the NCBI's CD-search ) plotted along its length. Domains are linked to their descriptions in NCBI, PFAM, TIGRFAM and other databases which can be followed by clicking on them. Not all transcripts will have a predicted protein, and not all predicted proteins will have predicted domains. However, if you know what structure to expect (like the nuclear receptor DBD and LBD in the protein on the right) it can be a good indication of a correct and complete sequence.

Downloading sequence data

Plots and sequences from the results page can be downloaded in bulk.

Select transcripts of intereset in the match list

As you browse matching transcripts, you can select any that seem interesting with the checkboxes on the right of the match list.

Once you're happy you've got all the interesting matches, click the download button above to open the download dialog.

The download dialog

Simply select the formats that you wish to download and click the download button. The available formats depend on whether the owner of the dataset allows full or only partial access.

It may take up to several minutes to render the requested files if many formats and transcripts are selected.

You optionally add a file prefix. This will be used to name the downloaded file, so you can remember the origin of these files later on. For example, entering "myresult" will lead to downloading a file called "myresult.zip".

Saving results

If you are a registered user, you have the option to save results that might be useful in the future. When logged in, you should see a "save" button in the top-right. Click save, enter a useful identifier (so you can remember what it is) and then click save or hit the enter key.

You can then return to this result at any time through the saved results page of your user profile. CrustyBase will store a maximum of 200 saved results. When this limit is reached, further saves will begin overwriting the oldest save.

You can also view and revisit all results from the past 7 days in the "BLAST history" panel at the bottom of your dashboard.

User accounts

You don't need to be registered to use CrustyBase, but it does come with benefits!

When you create an account with CrustyBase, you will be able to save BLAST results and view your search history. You can also create groups and upload your own transcriptome datasets. It's free to register and always will be.

Account overview

The dashboard gives you a brief overview of your account. This page allows you to edit your personal details, view your groups and datasets, view recent search history and delete your account (why would you do that!?).

When you are logged in, the dashboard and other pages related to your account can be found by clicking on the login prompt in the top-right of any every page, as shown on the right.

Groups

Groups control access to transcriptome datasets.
There are two situations when you might need to make use of groups:

You are going to import a dataset
You want to get access to a colleague's datasets

You can manage groups through your user profile.

Data access

The purpose of groups is to control data access. Not all datasets are fully accessible to the public, but the members of a group always have full access to that group's data. Groups are designed to reflect data ownership in the real world - datasets are typically owned by a research group, not by a single person in that group.

You can still view and search restricted datasets, but you cannot access or download raw sequence or expression data. The purpose of this is to encourage sharing of datasets which are restricted by intellectual property rights, since graphical results are usually insufficient for published research. If you find something of interest in these datasets, we encourage users to look up the owner of the dataset and seek collaboration. You can find out who uploaded the dataset by checking the dataset's profile in the data browser.

Create a group

The only time you might need to create a group is when you are going to import a dataset. You can also consider joining a colleague's group and importing the data to there, if you wish to share access. However, if you are going to import publicly available data, you can simply opt to import the dataset into the Public Domain instead.

You can create new groups in the group management page, which is accessible through the login prompt in the top-right. When creating a group, think of a name that is descriptive and unique. For example, Albert Einstein's research group at ETH Zurich might be called "Einstein ETHZ".

Join a group

If a colleague has a research group that you wish to join, they will need to send you an invite. This can be done easily by visiting the group management page.

Simply select the appropriate group and hit the "invite" button. Enter the email address of the person you wish to invite and they will be sent a link to join the group.

Check the email address carefully - once someone joins there is no (easy) way to remove them from the group.

Leave a group

There are situations where you may want leave a group - perhaps the group is redundant or you are leaving an institution. Simply select the group on the group management page and select "leave group" at the bottom of the page.

Consider this carefully though. Once you leave you'll lose access to the group's datasets and, if you were the last member in the group, the group and all its data will be deleted. This is the only way to delete a group.

Import data

Why import data?

There are a number of reasons to import data into CrustyBase. The most obvious reason to contribute your data is because it is the right thing to do! We are all in the business of advancing global knowledge, and the process is far more efficient when we work together.

Aside from that, there several benefits of having your data in the hands of CrustyBase:

CrustyBase is a free service for navigating your transcriptome data.
You and your research group can access your data wherever you are.
Increase your exposure. Make collaborations, get citations.

Prepare an import

You can find the data import app in the data tab in the navigation bar.
There are three phases to the data import process:

Metadata input (dataset information and access level)
File upload & validation
Review & submit

Public access level

Full access allows any user of CrustyBase to download raw sequence and expression data from any transcripts that they find. This data might be sufficient to publish findings on CrustyBase.

Partial access restricts the data types that a public user is able to download, allowing only graphical content to be seen. Given that publication typically requires reporting of original data, users who wish to make such use out of these datasets would be expected to contact the dataset owner to seek collaboration.

All members of the group owning the dataset have full access, including permission to edit and delete the dataset.

File upload

Upload files must be correctly formatted for the server to parse them correctly. If there is a problem validating the data, you will be given a useful message to help fix the problem. It should be possible to make any formatting changes with conventional spreadsheet and text-editor software.

There are two files that need to be uploaded:

Transcriptome assembly in FASTA format
Expression data in CSV format

These files are limited to 1000MB each.

Assembly file

The assembly file should be a FASTA-formatted sequence file.
Each sequence should start with a title line, beginning with an angle bracket ">" and a new line. The following sequence should be composed of only the characters "ATGCN" and new lines. The title can be no longer than 25 characters for any given sequence - any longer than this is unnecessary and makes them difficult to display on CrustyBase.

In the FASTA file on the right you can see that the first sequence title (green) is an appropriate length and format, whereas the second sequence title (red) has been appended with surplus information by the assembly software. This additional information needs to be removed.

If you are unable to reformat the sequence titles, try uploading the file with the "Reformat contig IDs" option enabled. Please note that you will have no way to link CB results to your local dataset if you choose this method, as the contig IDs will no longer match.

>c10000_c0_s1
CGACACCCAGAAGGGCCTGCAGCACGCCATGATGCAGATGAACGGCCCGATGATGGAAGG ACGTCGCCTGGATCTGCGCGATGATCCCGCATCACATGGGCGCCATCGCCATGGCCCAGG AATCCCGAGGCTAAGAAGATCGCCGAGAAGAGCATCCAGGAACAGGAGAAGAGCATCAAG

>comp3_c0_seq1 len=319 path=[12086:0-74 3106:75-318]
ATCTGTTTCTCCTTTTCATATTTTTCTTTTCTTTTGTTCCCTGTGTTCCACTTCTCTGTC CTTTCACTTCCCTCCTTCTCTTCCTTCTGTTTATTTGCTTCTTCTTCAGTATCCTTTCTT CCTCTTTTCCTTTTCCCTTCGCATCTTTCTCTCCTTCTTCTTTCTCTTCATCTATTCCTT

Expression file

The expression quantitation file should be a plain CSV file. If the samples were sequenced with replicates, makes sure that this file includes the raw replicate data and not mean data.

The file on the right shows how the data should be formatted. Contig IDs in the first column should match those in the assembly file, otherwise there is no way to connect the two together!

It may be helpful for you to label the column header with meaningful names, but these will be discarded when the dataset is imported. Column names are taken from the "Series labels" field during Metadata entry.

We highly recommend that you use RLE, TMM or TPM as units of transcript expression, though there are other metrics that can be used. For more accurate quantitation it is often better to map reads to the CDS (coding DNA sequence) rather than the entire transcript, as this results in a more uniform distribution of read mapping.

Restrictions

There are several restrictions on data imports which may prevent some datasets from being imported.

Comparable samples

The transcriptome must have originated from sequencing multiple samples that can be compared. Some transcriptomes are generated with only one sample feature without examining any variables - these datasets are not suitable for CrustyBase because there is no expression data to compare. An example would be a transcriptome of brain tissue which includes no other features for comparison.

CrustyBase is designed to show experiments where transcript abundance has been estimated across a number of features, for example brain, gonad and muscle tissues. In this case, we would be able to visualise the difference in transcript abundance between these three tissues.

Feature levels

Features describe the variable(s) compared in an RNA-seq experiment.

Example features could be tissue type, developmental stage or experimental treatment. However, some experiments define multiple feature levels.

An example of this is shown on the right (red box), where samples have been taken from two different tissues under a control and treatment condition. In order to import this dataset, these features would need to be flattened into one level, as shown on the bottom panel (green box).

File size

Uploads have been restricted to a maximum file size of 1000MB to preserve server resources. If this limit prevents you from uploading a genuine dataset, please let us know by leaving some feedback . We can raise this limit if necessary.

Delete a dataset

There are a number of reasons why you might want to remove a dataset from CrustyBase. Any member of a group can delete any dataset under that group's ownership. Find the dataset under My datasets. On the profile page you will find the option to delete at the end of the Details section.