CrustyBase is a repository and analysis suite for crustacean
transcriptome data. Each dataset describes the activity of
every expressed gene in a particular species, across a set of
CrustyBase currently contains 23 transcriptome datasets, but anyone can add to this in the future by uploading their own.
You can access the datasets on CrustyBase through different tools (aka apps). We may implement more apps depending on user feedback but for now, there are two core apps:
While the number of transcriptomes is small, navigating datasets remains quite
simple. But once the database grows, finding a relevant dataset among thousands
becomes challenging. The browser tool was designed to efficiently find
the datasets that are of most interest to you.
You can search for a particular species, or try any other keyword that is relevant to you. "Molt", "disease", "virus", "immune" and "brain" should all return related datasets. You can also search by taxonomy ("portunidae"). Each dataset has a dedicated page describing the species, data and experimental conditions.
The BLAST tool is a long-time staple of the bioinformatics toolbelt. Submit
a DNA or protein sequence that you're interested in and it will give you back the
best-matching sequences in the transcriptomes that you have selected.
Typically you would
obtain the sequence of a gene of interest from a related species (there
are millions available on NCBI) and use that as a query sequence.
Learn more about the NCBI's BLAST tool
Once your result has been returned, you can expand details on the datasets that yield the most interesting results. You can then click (or use arrow keys) through the matching transcripts while viewing expression profiles and sequence structures in real time.
Datasets have been structured to provide as much information as possible for each transcriptome.
Each dataset is described by a structured array of metadata,
including information like taxonomy, experiment description and
assembly procedure. It can also include a reference to a publication
so that the dataset author can be easily cited, which we
strongly encourage if you find a dataset useful.
You can view the metadata for a dataset on its profile page in the data browser.
Each dataset has a number of data types corresponding to each transcript. Some of these are rendered in a graphical display as you browse through transcripts in a BLAST result, and all of them are available to download in some form.
These data types are summarised in the table below.
|Data type||Description||Rendered in output||Available for download|
|Nucleotide sequence||cDNA sequence of the transcript. This sequence is the origin of all other data types.||Only as alignment||Yes, full access required|
|CDS sequence||Coding DNA sequence predicted by TransDecoder.||No||Yes, full access required|
|Peptide sequence||Protein sequence predicted by TransDecoder.||Only as alignment||Yes, full access required|
|Expression data||Mean expression level across experiment features||Yes||Yes, full access required for raw data|
|Conserved domains||Conserved protein domains predicted by CD-search||Yes||Yes, graphics only|
Search for a sequence
The BLAST submission form is made up of three simple components:
Query sequence input
To conduct a BLAST search you must first find a DNA or protein sequence of a gene that you're interested in. The "sample sequence" button can give you an example if you're just trying it out. Sequences can be obtained from the NCBI website or sometimes directly from published research articles. Copy and paste your sequence into this box to begin. It is worth remembering that protein sequences are usually more conserved than DNA between distant taxa, and are therefore more likely to match your target gene.
Select the transcriptome dataset you wish to search from this list. There
may be many databases, but only 10 are shown at a time. To find the
database you're looking for, simply type in some keywords into the
input field above to filter the database list. Try searching with
species or family names, tissues ("brain" or "gill") or other
biological keywords such as "immune", "environment" or "reproduction".
If you want to search and view the datasets in more detail, switch
to the data browser.
The access level of each database is shown on the right as a green (full access) or orange (restricted access) light.
You can add one or many datasets to the "selected" pane before running the search, but more datasets will of course increase the run time.
The search algorithm that you choose depends on the query sequence
that you entered above:
- BLASTN searches nucleotide transcriptomes with a DNA sequence
- tBLASTN takes a protein sequences and searches translated nucleotide sequences on-the-fly. This tends to be more reliable when searching cross-genus and beyond, as protein sequences are almost always more conserved.
Viewing the result
- Datasets with transcripts matching your query are stacked up the page.
- Each dataset in the stack features a species image, experiment description and table summarising the BLAST hits.
- Select transcripts in the table with the checkbox, and download data relating to them.
- Click "expand" to open up a full-screen view of a dataset's results
- Click or use arrow keys to visualize transcripts in three corresponsing panes:
The alignment pane
A conventional alignment out from the BLAST tool is displayed in
the first viewing pane. This shows exactly which residues match
between the query sequence and the selected transcript. Match
statistics are shown above the alignment. One match may return
a number of HSPs (high-scoring pairs).
HSPs are sections of matching sequence. When searching a transcriptome with protein or cDNA sequence you would normally expect to find one HSP. Two or more indicates large insertions or deletions between the query and subject sequence, probably due to differential mRNA splicing or perhaps a sequencing error.
The expression pane
The expression pane shows the expression profile of the
selected transcript. Move the cursor over markers on the chart
to see mean and standard error of the data. If you aren't sure
what the x-axis labels mean, try hovering over the info symbol
on the right to see the experiment description. You can also
click-and-drag on the y-axis to increase or decrease scale.
This can be useful for zooming in on low-expressed samples.
You may notice that some datasets are displayed with bar charts and others with line graphs. This depends on whether the expression data describes a categorical variable (tissues, treatment etc.) or continuous variable (days, temperature etc.).
The protein structure pane
The structure pane shows the protein length predicted by TransDecoder, with conserved domains (predicted by the NCBI's CD-search ) plotted along its length. Domains are linked to their descriptions in NCBI, PFAM, TIGRFAM and other databases which can be followed by clicking on them. Not all transcripts will have a predicted protein, and not all predicted proteins will have predicted domains. However, if you know what structure to expect (like the nuclear receptor DBD and LBD in the protein on the right) it can be a good indication of a correct and complete sequence.
Downloading sequence data
Plots and sequences from the results page can be downloaded in bulk.
Select transcripts of intereset in the match list
As you browse matching transcripts, you can select any that
seem interesting with the checkboxes on the right of the match
Once you're happy you've got all the interesting matches, click the download button above to open the download dialog.
The download dialog
Simply select the formats that you wish to download and click
the download button. The available formats depend on whether
the owner of the dataset allows full or only
It may take up to several minutes to render the requested files if many formats and transcripts are selected.
You optionally add a file prefix. This will be used to name the downloaded file, so you can remember the origin of these files later on. For example, entering "myresult" will lead to downloading a file called "myresult.zip".
If you are a registered user, you have the option to save results
that might be useful in the future. When logged in,
you should see a "save" button in the top-right. Click save,
enter a useful identifier (so you can remember what it is) and
then click save or hit the enter key.
You can then return to this result at any time through the saved results page of your user profile. CrustyBase will store a maximum of 200 saved results. When this limit is reached, further saves will begin overwriting the oldest save.
You can also view and revisit all results from the past 7 days in the "BLAST history" panel at the bottom of your dashboard.
You don't need to be registered to use CrustyBase, but it does come with benefits!
When you create an account with CrustyBase, you will be able to save BLAST results and view your search history. You can also create groups and upload your own transcriptome datasets. It's free to register and always will be.
gives you a brief overview of your account. This page allows
you to edit your personal details,
view your groups and datasets, view recent search history and
delete your account (why would you do that!?).
When you are logged in, the dashboard and other pages related to your account can be found by clicking on the login prompt in the top-right of any every page, as shown on the right.
Groups control access to transcriptome datasets.
There are two situations when you might need to make use of groups:
- You are going to import a dataset
- You want to get access to a colleague's datasets
You can manage groups through your user profile.
The purpose of groups is to control data access.
Not all datasets are fully accessible to the public, but the members
of a group always have full access to that group's data. Groups
are designed to reflect data ownership in the real world - datasets
are typically owned by a research group, not by a single person in
You can still view and search restricted datasets, but you cannot access or download raw sequence or expression data. The purpose of this is to encourage sharing of datasets which are restricted by intellectual property rights, since graphical results are usually insufficient for published research. If you find something of interest in these datasets, we encourage users to look up the owner of the dataset and seek collaboration. You can find out who uploaded the dataset by checking the dataset's profile in the data browser.
Create a group
The only time you might need to create a group is when you are
going to import a dataset. You can also consider joining a
colleague's group and importing the data to there, if you wish to
share access. However, if you are going to import publicly available
data, you can simply opt to import the dataset into the
Public Domain instead.
You can create new groups in the group management page, which is accessible through the login prompt in the top-right. When creating a group, think of a name that is descriptive and unique. For example, Albert Einstein's research group at ETH Zurich might be called "Einstein ETHZ".
Join a group
If a colleague has a research group that you wish to join, they
will need to send you an invite. This can be done easily by
group management page.
Simply select the appropriate group and hit the "invite" button. Enter the email address of the person you wish to invite and they will be sent a link to join the group.
Check the email address carefully - once someone joins there is no (easy) way to remove them from the group.
Leave a group
There are situations where you may want leave a group - perhaps
the group is redundant or you are leaving an institution. Simply
select the group on the
page and select "leave group" at the bottom of the page.
Consider this carefully though. Once you leave you'll lose access to the group's datasets and, if you were the last member in the group, the group and all its data will be deleted. This is the only way to delete a group.
Why import data?
There are a number of reasons to import data into CrustyBase.
The most obvious reason to contribute your data is because it is
the right thing to do! We are all in the business of advancing
global knowledge, and the process is far more efficient when we
Aside from that, there several benefits of having your data in the hands of CrustyBase:
- CrustyBase is a free service for navigating your transcriptome data.
- You and your research group can access your data wherever you are.
- Increase your exposure. Make collaborations, get citations.
Prepare an import
You can find the
data import app
in the data tab in the navigation bar.
There are three phases to the data import process:
- Metadata input (dataset information and access level)
- File upload & validation
- Review & submit
Public access level
Full access allows any user of CrustyBase to download raw sequence and expression data from any transcripts that they find. This data might be sufficient to publish findings on CrustyBase.
Partial access restricts the data types that a public user is able to download, allowing only graphical content to be seen. Given that publication typically requires reporting of original data, users who wish to make such use out of these datasets would be expected to contact the dataset owner to seek collaboration.
All members of the group owning the dataset have full access, including permission to edit and delete the dataset.
Upload files must be
correctly formatted for the server to parse them correctly. If
there is a problem validating the data, you will be given a
useful message to help fix the problem. It should be possible
to make any formatting changes with conventional spreadsheet
and text-editor software.
There are two files that need to be uploaded:
- Transcriptome assembly in FASTA format
- Expression data in CSV format
These files are limited to 1000MB each.
The assembly file should be a FASTA-formatted sequence file.
Each sequence should start with a title line, beginning with an angle bracket ">" and a new line. The following sequence should be composed of only the characters "ATGCN" and new lines. The title can be no longer than 25 characters for any given sequence - any longer than this is unnecessary and makes them difficult to display on CrustyBase.
In the FASTA file on the right you can see that the first sequence title (green) is an appropriate length and format, whereas the second sequence title (red) has been appended with surplus information by the assembly software. This additional information needs to be removed.
If you are unable to reformat the sequence titles, try uploading the file with the "Reformat contig IDs" option enabled. Please note that you will have no way to link CB results to your local dataset if you choose this method, as the contig IDs will no longer match.
CGACACCCAGAAGGGCCTGCAGCACGCCATGATGCAGATGAACGGCCCGATGATGGAAGG ACGTCGCCTGGATCTGCGCGATGATCCCGCATCACATGGGCGCCATCGCCATGGCCCAGG AATCCCGAGGCTAAGAAGATCGCCGAGAAGAGCATCCAGGAACAGGAGAAGAGCATCAAG
>comp3_c0_seq1 len=319 path=[12086:0-74 3106:75-318]
ATCTGTTTCTCCTTTTCATATTTTTCTTTTCTTTTGTTCCCTGTGTTCCACTTCTCTGTC CTTTCACTTCCCTCCTTCTCTTCCTTCTGTTTATTTGCTTCTTCTTCAGTATCCTTTCTT CCTCTTTTCCTTTTCCCTTCGCATCTTTCTCTCCTTCTTCTTTCTCTTCATCTATTCCTT
The expression quantitation file should be a plain CSV file.
If the samples were sequenced with replicates, makes sure that
this file includes the raw replicate data and not mean data.
The file on the right shows how the data should be formatted. Contig IDs in the first column should match those in the assembly file, otherwise there is no way to connect the two together!
It may be helpful for you to label the column header with
meaningful names, but these will be discarded when the dataset
is imported. Column names are taken from the "Series labels"
field during Metadata entry.
We highly recommend that you use RLE, TMM or TPM as units of transcript expression, though there are other metrics that can be used. For more accurate quantitation it is often better to map reads to the CDS (coding DNA sequence) rather than the entire transcript, as this results in a more uniform distribution of read mapping.
There are several restrictions on data imports which may prevent some datasets from being imported.
The transcriptome must have originated from sequencing multiple
samples that can be compared. Some transcriptomes are generated
with only one sample feature without examining any variables -
these datasets are not suitable for CrustyBase because there is
no expression data to compare. An example would be a
transcriptome of brain tissue which includes no other
features for comparison.
CrustyBase is designed to show experiments where transcript abundance has been estimated across a number of features, for example brain, gonad and muscle tissues. In this case, we would be able to visualise the difference in transcript abundance between these three tissues.
Features describe the variable(s) compared in an RNA-seq experiment.
Example features could be tissue type, developmental stage or experimental treatment. However, some experiments define multiple feature levels.
An example of this is shown on the right (red box), where samples have been taken from two different tissues under a control and treatment condition. In order to import this dataset, these features would need to be flattened into one level, as shown on the bottom panel (green box).
Uploads have been restricted to a maximum file size of 1000MB to preserve server resources. If this limit prevents you from uploading a genuine dataset, please let us know by leaving some feedback . We can raise this limit if necessary.
Delete a dataset
There are a number of reasons why you might want to remove a dataset from CrustyBase. Any member of a group can delete any dataset under that group's ownership. Find the dataset under My datasets. On the profile page you will find the option to delete at the end of the Details section.