Services and software
Here a description of the diferent available services.
Grape is a pipeline for processing and analyzing RNA-Seq data
High throughput sequencing technologies generate vast amounts of data that require subsequent management, analysis and visualization.
The Grape RNAseq Analysis Pipeline Environment implements a set of workflows that allow for easy exploration of RNA-Seq data. Among other features, it enables the users to perform
- quality checks
- read mapping
- generation of expression and splicing statistics
The results are stored in a MySQL database and become immediately available through a RESTful back end server that is connected to a web application using the Google chart tools for display.
The documentation is hosted here: http://grapebuildout.readthedocs.org
Download the latest stable versions of Grape from here:
Follow the instructions in the README.txt
Knowles D, Röder M, Merkel A, Guigó R.
Bioinformatics. 2013 Mar 1;29(5):614-21. doi: 10.1093/bioinformatics/btt016. Epub 2013 Jan 17.
Check out the development version of Grape:
Then follow the steps in the README.txt
You need to have access to a
- MySQL database
Make sure to have the following standard programming languages installed:
The following Perl modules must be installed
You need to have the following module installed in Python:
To give you a preview of the statistical results produced by Grape, our lab has published a set of results for the following RNASeq projects:
New in Grape 1.9.5 (2013-10-21):
- update bootstrap.py from http://downloads.buildout.org/1/bootstrap.py
- Upgrade to grape.recipe.pipeline 1.1.16
- Better error message when the type parameter in the accession is neither fastq nor bam
New in Grape 1.9.4 (2013-02-13):
- Make TEMPLATE parameter optional, the right one is chosen by default now
- Detect wrong template for fastq and bam files.
New in Grape 1.9.3 (2013-02-12):
- Fix configuration checks for bam files
- Allow plus sign in parameter values
- Use a single bootstrap.py, and update the documentation
New in Grape 1.9.2 (2013-01-09):
- New MAXINTRONLENGTH option: Sets the maximum length of splits allowed during the postprocessing of the files generated by gem-2-sam removing the noise. The default is set to 50000, which is reasonable in mammals, however different species may require different settings. Setting it to 0 will remove this filter.
- Improve the documentation for grape.recipe.pipeline: https://graperecipepipeline.readthedocs.org/en/latest/
New in Grape 1.9.1 (2012-12-09):
- Fix bug that caused an uninitialized value in exon, junction and transcript tables when the initial fasta_file table was not present in the database.
New in Grape 1.9 (2012-11-11):
- The charts shown in the Raisin web application now have a better layout and a proper title.
- Fix Quick runs. Soft links were not made correctly, and some accession parameters are now filled in correctly as well.
- add FLUXMEM parameter to control the maximum number of GB of memory that the Flux can use
- add MIN_RECURSIVE_MAPPING_TRIM_LENGTH parameter that allows tuning the minimum length to which a read will be trimmed during the recursive mapping.
- check read labels for consistency
- Add the basic scripts to run the IDR. This is not incorporated yet to the pipeline itself, but the scripts can be used to run it.
- Servers now have a project_downloads and project_downloads_folder section that can be configured in servers/devel/buildout.cfg
New in Grape 1.8 (2012-10-09):
- Use new Cufflinks, version 2.0.2
- Upgrade to Grape pipeline 6.5
- Allow for the running of the start script with only species, genome, annotation and read length specified appart from a list of one or two files.
- Set the number of CPUs used by fastqc to one.
- new dependency on grape.recipe.pipeline to share validation code with the Raisin
- Fix a bug that prevented the correct running of the Flux when the read Ids came from HiSeq
- Add verbose to the mysqlimport statement in the build_exon_junctions.RNAseq.pl
- Install the pre version of GEM in grape.recipe.pipeline 1.1.9, with binaries prefixed with "next".
- Download packages from PyPI instead of from the SVN
- add .downloads and .eggs folders
- Raisin web server
- The download paths and the project folders are now configured in the buildout.cfg
- Remove pickle caching code
- Remove code previously used for dumping resources
- Move dumps folder to the top
- improve .gitignore
- pin MySQL-python = 1.2.3
New in Grape 1.7 (2012-07-25):
- Fix a bug that prevented the pipeline from building the inclusion exclusion table
- Speed up the recursive mapping part of the pipeline
- The output from the Flux capacitor is not deleted any more, making it available for further analysis
- In the Raisin web application, links are now shorter when pointing to pages with tabs. For example, it is not necessary to add /tab/experiments to URLs any more, if the experiments tab is the default tab.
New in Grape 1.6 (2012-07-10):
- Now creates the var/log folder needed when starting raisin with supervisord
- Raisin can now be installed even if some annotation information is missing
- Now correctly gets all the scores in qualities and ambiguous
- Parsing reads is fixed
- HiSEq read IDs are now handled
New in Grape 1.5 (2012-07-06):
- Installs and integrates FastQC
- Default parameters to make configuration easier. When no project parameters are given, use read_length. When no project user is given, use anonymous
New in Grape 1.4 (2012-06-27):
Install Cufflinks 2.0.1 binaries and use it for detecting novel transcripts
Use FastQC for quality control
Calculate gene and exon RPKM now from the Flux Capacitor results
New in Grape 1.3 (2012-05-18):
Integrating version 6.0 of the pipeline that now depends on a new version of the Flux-Capacitor: 1.0 RC2.
Speed improvements: The pipeline now relies less on the overlap tool by using information already included in the BAM files.
New in Grape 1.2 (2012-04-20):
The automatic installation of dependencies has changed. Before, they were taken from the SVN using mr.developer. Now that releases are available for the packages needed by the pipeline, they are downloaded from our web server using the hexagonit.recipe.download recipe, or taken from PyPI (See grape.recipe.pipeline) using zc.recipe.egg.
New in Grape 1.1.2 (2012-03-30):
Moved the installation instructions to INSTALL.txt and added a README.txt to pipelines/Quick.
New in Grape 1.1.1 (2012-03-30):
Grape 1.1.1 includes a better README.txt that will guide you throught the installation process.
New in Grape 1.1 (2012-03-29):
Grape 1.1 allows you to quickly analyse one set of RNAseq reads without a lot of configuration overhead.
We provide annotations and genomes that are known to work perfectly for the following species:
- Homo sapiens
- Caenorhabditis elegans (Coming soon)
- Mus musculus
- Drosophila Melanogaster
All versions of Grape:
AcE is a program to aid gene prediction accuracy evaluation. It uses GFF format to make it easy to convert gene prediction results into an analyzable format. Novel features include isoform accuracy evaluation from either the annotated gene or gene prediction perspective or both at the same time. Masking of genomic sequence which has unknown features allows gene predictions in annotated regions to be analyzed in a genomic context. Test sets, such as an artificial sequence test set or genomic context test set, can be generated by selecting specified annotated sequences from a master set.
BioMoby Web Services and Workflows
The bioMoby project aims to provide bioinformatics resources through the web. In this regard, we have designed and implemented a set of Web services that are compliant with bioMoby specifications. We have been focusing on providing genome analysis resources. These resources are of various types:
- Sequence analysis
- Gene prediction (i.e. runGeneIDGFF, runSGP2GFF)
- Signal predicition (transcription regulatory elements or splicing elements) [i.e. runMatScanGFF, runMetaAlignmentGFF]
- ESTs assembly
- Sequence data retrieval (promoter sequences from Ensembl database)
- Data format conversion
A more complete list of Web resources can be found in the WEB SERVICES section We have also set up some pipelines of analysis to illustrate the use of these Web resources. Here are the main pipelines of computational analysis that were implemented:
- Promoter analysis
- ESTs assembly
More information about our pipelines of analysis can be found in the WORKFLOWS section.
compmerge is a program that tries to solve the same problem as cuffmerge.It is not limited to cufflinks models and transcripts, but can work with any .gtf file.It merges the spliced transcripts that have a compatible intron structure and merges the monoexonic transcripts based on simple stranded overlap.The output is a .gtf file of merged transcripts.
After getting the compmerge executable, type:
without any argument to get the help.
This will give you all the possible options of compmerge.
Do not hesitate to contact sarahqd at gmail dot com for more information.
Deathbase is a database of proteins involved in cell death. It compiles relevant data on the function, structure and evolution of proteins involved in apoptosis and other forms of cell death in several organisms. Information contained in this database is subjected to manual curation. You can contribute to maintain the DeathBase by editing the wikipage for any protein.
Environment for Tree Exploration (ETE)
ETE is a python programming toolkit that assists in the automated manipulation, analysis and visualization of hierarchical trees. Besides a broad set of tree handling options, ETE provides specific methods to analyze phylogenetic and clustering trees. It also supports large tree data structures, node annotation, independent editing and analysis of tree partitions, and the association of trees with external data such as multiple sequence alignments or numerical matrices.
ETE is available at http://ete.cgenomics.org
geneid is a program to predict genes along a DNA sequence in a large set of organisms. While its accuracy compares favorably to that of other existing tools, geneid is more efficient in terms of speed and memory usage and it offers some rudimentary support to integrate predictions from multiple source.
You will also find whole genome annotations for different species obtained with geneid in our "Gene Predictions" web pages.
Visualizing pair-wise alignments with annotated axes from GFF files.
We are proud to announce the new version of gff2aplot, that has been re-implemented in perl. Visit the program's web page downloading section to obtain v2.0. You can obtain the full distribution tarball from there.
Although the "gff2aplot User's Manual" is not finished yet, you can start using it as we have written several HTML tutorials that will introduce you in how to use this program. We hope you will enjoy them.
A snapshots web page is also available, listing few examples of what can be done with gff2aplot.
Obtaining plots to compare genomic sequences and/or sources from GFF files. Last available version is 0.98. Get the PostScript version of "gff2ps Users Manual" (v0.96). A Web Server is also available at Institut Pasteur thanks to Catherine Letondal. A new section has been created: HTML HOWTOs for gff2ps. The first two HOWTOs were also added: "Comparing sources with gff2ps" and "Visualizing PostScript output from gff2ps". We hope you will find them useful.
gff2ps was used to obtain the six chromosome arm plots (X, 2L, 2R, 3L, 3R and 4) appearing in the "Coding content of the fly genome" genome map (figure 4), included as a poster in "The Genome Sequence of Drosophila melanogaster" [Adams et al. Science 287(5461):2185-2195(2000)].
We have produced the map of the Human Genome with gff2ps. 22 autosomic, X and Y chromosomes were displayed in a big poster appearing as the figure 1 of "The Sequence of the Human Genome" [Venter et al. Science 291(5507):1304-1351 (2001)]. The single chromosome pictures can be accessed from here to visualize the web version of the "Annotation of the Celera Human Genome Assembly" poster.
gff2ps has achieved another genome landmark. The mosquito genome annotation for five chromosome arms (2L, 2R, 3L, 3R and X) has been summarized into a two-sided five-pages foldout included as the figure 1 of "The Genome Sequence of the Malaria Mosquito Anopheles gambiae" [Holt et al. Science 298(5591):129-149 (2002)], available from the "Annotation of the Anopheles gambiae genome sequence" web page.
meta is a program to produce and to align the TF-maps of two gene promoter regions. meta is very useful to characterize promoter regions from orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.
MetaPhOrs is a public repository of phylogeny-based orthology and paralogy predictions that were computed using resources available in seven popular homology prediction services (PhylomeDB, EnsemblCompara, EggNOG, OrthoMCL, COG, Fungal Orthogroups, and TreeFam). Currently above 306 millions of unique homologous protein pairs are deposited in MetaPhOrs database. These predictions were retrieved from 705 123 phylogenetic trees for 829 genomes. For each prediction, MetaPhOrs provides a Consistency Score and Evidence Level describing its goodness, together with number of trees and links to their source databases.
A database of tetrapod mitochondrial tRNAs. The database includes secondary structure based alignments of mt-tRNAs from 277 species of completely sequences tetrapod mitochondrial genomes as of 2007.
mmeta is a program to produce and to align the TF-maps of multiple promoter regions. mmeta is very powerful to characterize promoter regions from multiple orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.
overlap is a program that computes the overlap between two sets of genomic features. More precisely it takes two gff files of genomic features as input and for each feature of the first set, says whether it is overlapped by a feature of the second set (basic mode, however more and more precise information can be retrieved).
After getting the overlap executable, type:
without any argument to get the help.
This will give you all the possible options of overlap. Basically the output will be equal to the first input file (file1) with additional information about the overlap of its features with the features of the second file (file2). There are 4 basic modes:- Mode 0 (option -m 0) is to report boolean overlap: 1 if the feature of file1 is overlapped by a feature of file2, 0 otherwise.- Mode 1 (option -m 1) is to report quantitative overlap: the number of file2 features overlapping a file1 feature. - Negative mode (option -m -1 for example) is to report the list of coordinates of file2 features overlapping a file1 feature.- Value mode (option -m n where n>=10 and n is even) is to report the list of values located in field 10 in file2 associated to file2 features overlapping a file1 feature.
You can also ask for inclusion instead of general overlap, this is the option -i, or make a stranded overlap, this is the option -st.
Do not hesitate to contact sarahqd at gmail dot com for more information.
PATRONUS (from "PATtern Recognition by Optimized Numerical Universal Scoring") is a program designed to compute in a very fast way the exact probability of observing a given number of occurrences of a simple motif (that is, a continuous word without gaps) in a sequence. Its intended scope is the analysis of very long biological sequences, like chromosomes or whole genomes of complex organisms. The probability is computed on the basis of the Markovian statistics of order m for the sequence, that is the recorded number of the occurrences of all the submotifs of length m + 1 in the sequence. Contrary to what many people believe, computing such a probability for a generic motif is a computationally demanding task, mainly because motifs can overlap in non-trivial ways.
A detailed description of both the PATRONUS algorithm and its excellent performance can be found here.
PeSV-Fisher is a pipeline for the detection of five general types of structural variants (SVs): deletions, gains, intra- and inter-chromosomal translocations, and inversions, at very reasonable computational costs. The pipeline further provides comprehensive information on co-localization of SVs in the genome, a key aspect for studying biological consequences. The algorithm uses a combination of methods based on paired-reads (PR) and read-depth strategies (RD). PeSV-Fisher has been designed with the aim to facilitate identification of somatic variation, and, as such, it is capable of analysing two or more samples simultaneously, producing a list of non-shared variants between samples, although it can also analyse individual samples.
Download the latest version of PeSV-Fisher from here: PeSVFisher-0.93.tar.gz
Access to the quick guideline, dependencies, test data and demo through the link: http://gd.crg.es/tools/PeSVFisher/
PhylomeDB is a public database for complete collections of gene phylogenies (phylomes). It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes , including Maximum Likelihood or Bayesian tree inference, alignment trimming and evolutionary model testing. PhylomeDB includes also a public download section with the complete set of trees, alignments and orthology predictions.
Predictor of Interactions with Molecular Chaperones
We use a series of stringent relationships between abundance, solubility and chaperone usage of proteins. Based on these relationships, we show that the need of Escherichia coli proteins for the chaperonin GroEL can be predicted with 86% accuracy. Furthermore, from the observation that the abundance and solubility of proteins depend on the physicochemical properties of their amino acid sequences, we demonstrate that the requirement for GroEL can also be predicted directly from the sequences with 90% accuracy. These results indicate that the physicochemical properties of the amino acid sequences represent an essential component of the cellular quality control system that ensures the maintenance of protein homeostasis in living systems.
Predictor of Protein Amyloidogenicity
Protein aggregation causes many devastating neurological and systemic diseases and represents a major problem in the preparation of recombinant proteins in biotechnology. Major advances in understanding the causes of this phenomenon have been made through the realisation that the analysis of the physico-chemical characteristics of the amino acids can provide accurate predictions about the rates of growth of the misfolded assemblies and the specific regions of the sequences that promote aggregation. More recently it has also been shown that the toxicity in vivo of protein aggregates can be predicted by estimating the propensity of polypeptide chains to form protofibrillar assemblies.
Predictor of Protein Folding Propensities
With the advent of proteomics, there is an increasing need of tools for predicting the properties of large numbers of proteins by using the information provided by their amino acid sequences, even in the absence of the knowledge of their structures. One of the most important types of predictions concerns whether proteins will fold or aggregate. These profiles are calculated, respectively, using the CamFold method, which we introduce in this server, and the Zyggregator method. Our results indicate that the kinetic behavior of proteins is, to a large extent, determined by the interplay between regions of low folding and high aggregation propensities.
Predictor of protein-RNA interactions
Fast predictions of RNA-Protein interactions and domains
Query sequences should be pasted in the 'Protein sequence' and 'RNA sequence' text area.
The catRAPID server accepts only amino acid and nucleic acid sequences, defined as lines of sequence data, without the FASTA definition line, sequence identifiers and/or other symbols; eg:
General guidelines are reported here below:
1. Use one word to identify your protein and your RNA
2. Use standard IUB/IUPAC amino acid and nucleic acid codes.
The nucleic acid codes supported are:
A adenosine; C cytidine; G guanine; U uridine.
The accepted amino acid codes are:
A alanine; C cystine; E glutamate; D aspartate; F phenylalanine; G glycine; H histidine; I isoleucine; K lysine; L leucine;
M methionine; N asparagine; P proline; Q glutamine; R arginine; S serine; T threonine; V valine; W tryptophan; Y tyrosine.
3. Blank lines are not allowed in the middle of sequence input (no spaces or next line)
4. Use of sequence identifiers, such as simply accession, accession.version or gi's (e.g., p01013, AAA68881.1, 129295) is not supported in the protein and RNA sequence fields.
Predictor of Soluble Expression
Each step in the process of gene expression, from the transcription of DNA into mRNA to the folding and posttranslational modification of proteins, is regulated by complex cellular mechanisms. At the same time, stringent conditions on the physicochemical properties of proteins, and hence on the nature of their amino acids, are imposed by the need to avoid aggregation at the concentrations required for optimal cellular function. A relationship is therefore expected to exist between mRNA expression levels and protein solubility in the cell. By investigating such a relationship, we formulate a method that enables the prediction of the maximal levels of mRNA expression in Escherichia coli with an accuracy of 83% and of the solubility of recombinant human proteins expressed in E. coli with an accuracy of 86%.
SECISaln will predict a SECIS element in the query sequence, split it into its constituent parts and align these against a precompiled database of eukaryotic SECIS elements. The user can choose whether the database sequences are sorted by protein family or by species, thereby offering the possibility of comparing the submitted sequence to other, known SECISes. In addition, SECISaln returns a graphical image of the predicted structure of the user-submitted sequence as well as a multiple structural alignment of all SECIS elements of that type already present in the database.
SECISearch3 and Seblastian: prediction of SECIS elements and selenoproteins
Selenoproteins are proteins containing an uncommon amino acid selenocysteine (Sec). Sec is inserted by a specific translational machinery that recognizes a stem-loop structure, the SECIS element, at the 3′ UTR of selenoprotein genes and recodes a UGA codon within the coding sequence. As UGA is normally a translational stop signal, selenoproteins are generally misannotated and designated tools have to be developed for this class of proteins.
In this webserver (go to http://seblastian.crg.es) we provide public access to two new computational methods for selenoprotein identification and analysis: SECISearch3 replaces its predecessor SECISearch as a tool for prediction of eukaryotic SECIS elements. Seblastian is a new method for selenoprotein gene detection that uses SECISearch3 and then predicts selenoprotein sequences encoded upstream of SECIS elements. Seblastian is able to both identify known selenoproteins and predict new selenoproteins.
A open-access paper describing these methods, including their validation and the prediction of new selenoproteins in many eukaryotic lineages, was published in Nucleic Acid Research: http://nar.oxfordjournals.org/content/early/2013/06/19/nar.gkt550.full
This project is the result of a collaboration with Vadim Gladyshev's lab in Harvard (http://gladyshevlab.bwh.harvard.edu/).
Selenoprofiles is a pipeline for profile-based protein finding in genomes.
Provided one or more protein alignments, it scans a target genome (or any other nucleotide database) and reports the gene structures of homologous genes. It can be used to easily characterize any protein family of interest across a massive amount of sequenced genomes, allowing a finely tuned filtering of results. Or it can be used with comprehensive sets of input profiles, in order to completely annotate by homology one or more genomes. The pipeline runs internally blast (psitblastn), exonerate (p2g mode) and genewise, combining them into a final set of non-overlapping predictions for all profiles.
Selenoprofiles is highly flexible. Even unexperienced user can edit the procedures of filtering, adapting them for each profile. The advanced user can plug-in its own code to customize the internal procedures of labelling, filtering, solving overlaps, outputing. It is also possible to write code to annotate genomic features of the predicted gene, such as protein or RNA motifs, which are stored and added to the native selenoprofiles output.
Although the program offers a variety of filtering methods, the default filter is quite effective. For each candidate, a measure of its similarity with the sequences in the profile is computed (AWSI score, see manual). The resulting score is compared to the distribution of this measure within the profile sequences. In this way, very conserved alignment profiles allow only highly similar sequences to pass the filter and be output.
Selenoprofiles can be used with any input protein family, but we initially developed it for selenoproteins. These peculiar proteins contain a selenocysteine, the 21st amino acid, which is inserted in correspondence to specific UGA codons, normally signalling translation termination. In selenoprotein transcripts we find specific secondary structures (SECIS elements), which targets a specific UGA to be read as Sec instead that as a stop. Since selenoproteins possess this peculiar feature (recoding of specific stop codons), normal gene prediction programs fail to predict them. Selenoprofiles in contrast can correctly predict selenoprotein genes, by using technical expedients to align selenocysteine positions. Selenoprofiles includes built-in profiles for selenoproteins and other proteins related to selenocysteine, allowing out-of-the-box prediction of these families.
Mariotti M, Guigó R. Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes. Bioinformatics. 2010 Nov 1;26(21):2656-63. Epub 2010 Sep 21
(Note that the paper refers to outdated version 1)
All "slave" programs run by selenoprofiles must be installed by user:
- Blast: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy/2.2.26/ [blastall legacy suite required (2.2.2x). Blast+ programs are not supported yet]
- Exonerate: http://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
- Genewise: ftp://ftp.ebi.ac.uk/pub/software/unix/wise2/
- Mafft: http://mafft.cbrc.jp/alignment/software/
If you experience problems in the installation of these programs, this page may help you.
Selenoprofiles can be installed on any unix system with python 2.6 or newer. A python command line installer (install_selenoprofiles.py) is provided inside the installation package. You can get the installation package at http://github.com/marco-mariotti/selenoprofiles ; simply open a terminal and type:
git clone https://github.com/marco-mariotti/selenoprofiles
cd selenoprofiles After obtaining the package, follow the instructions on the README file included in the package. The simplest installation, suitable to be used with your own custom profiles, can be performed with: python install_selenoprofiles.py -min
Alternatively, you can perform a full installation. This is necessary to scan with the built-in profiles for selenoproteins and Sec machinery. Selenoprofiles requires a few accesory files for this purpose, including the large Uniprot uniref50 database. If you need to use Selenoprofiles for this purpose, run: python install_selenoprofiles.py -full
For more information run:
python install_selenoprofiles.py --help
The manual is included within the installation package. However you can also download it here:
After obtaining the package, follow the instructions on the README file included in the package. The simplest installation, suitable to be used with your own custom profiles, can be performed with:
python install_selenoprofiles.py -min
Alternatively, you can perform a full installation. This is necessary to scan with the built-in profiles for selenoproteins and Sec machinery. Selenoprofiles requires a few accesory files for this purpose, including the large Uniprot uniref50 database. If you need to use Selenoprofiles for this purpose, run:
python install_selenoprofiles.py -full
For more information run:
python install_selenoprofiles.py --help
The manual is included within the installation package. However you can also download it here:
The manual is included within the installation package. However you can also download it here:
NOTE: Starting from version 3.4, due to the unstainable growth of the NCBI NR database, the full installation Selenoprofiles employs the Uniprot database of Uniref50 as protein reference for the methods tag score and GO score. This may cause slight differences in the performance of the built-in selenoprofiles and Sec machinery profiles when comparing with older versions.
Output and accessory programs
Several output formats are available, such as gff or fasta for nucleotide or protein sequences. The manual explains how to activate the built-in output types, and how to customize output by adding information of interesdt. By default, selenoprofiles produce only two type of output files: a fasta alignment of all predictions aligned to their corresponding profile, and a human readable p2g file showing the alignment gene structure and sequence (find an example of p2g file in here). The selenoprofiles package then contains a few additional programs, suited for projects aimed at searching certain protein families in many target species. The program selenoprofiles_join_alignments retrieves and merge into a single alignment all the results in different species. Then, selenoprofiles_tree_drawer allows their visualization in the phylogenetic tree of the target species. This program require the installation of the python tree environment ete2: http://ete.cgenomics.org/. Selenoprofiles_tree_drawer can generate images like those below.
Abstract mode (-a):
If you need help with selenoprofiles, do not hesitate to contact me by email: marco.mariotti at crg.eu
Selenoproteins are a group of proteins that contain selenocysteine (Sec), a rare amino acid inserted co-translationally into the protein chain. The Sec codon is UGA, which is normally a stop codon. In selenoproteins UGA is recoded to Sec in presence of specific signals on selenoprotein gene transcripts. Due to the dual role of the UGA codon, gene prediction programs fail to predict correctly selenoproteins. Selenoprofiles is an homology-based in silico tool able to scan genomes for members of the known selenoprotein families, thus finding both selenoproteins and cysteine homologues. Selenoprofiles is built in python, and it internally runs psitblastn, exonerate, genewise and SECISearch.
Selenoprofiles is tuned to search for selenoprotein genes, and comes out-of-the box with profile alignments for each known selenoprotein and selenocysteine-related family (Note: profiles will be released soon. The current release contain only the program and a single profile for example).
Selenoprofiles can be used to search for any protein family (also non-selenoprotein), given an input profile alignment. This pipeline combines standard gene prediction tools to provide a clean and fast way to scan genomes for protein families, and provides a wide repertoire of output formats which can also be extended by the user. The program allows for a deep level of customization, and provides many built-in methods to filter spurious hits.
NOTE: This page describes selenoprofiles version 2.2. A newer version of this program is available here. Version 1 is no longer maintained.
This version features major improvements on the previous ones, such as:
- improved workflow control
- prediction by blast can be output, allowing use of selenoprofiles in bacterial genomes (exonerate and genewise are eukaryote specific)
- lazy computing implemented
- pre-clustering of the profile alignment: multiple blast are run if the profile is highly variable
- an SQLite database is used to store results, allowing to search for a high number of families without producing an enourmous amount of files, since they can be deleted at the end of computation
- improved customization of the options used with the slave programs, which can potentially be different for each profile
- improved filtering of results: all filtering procedures are defined as pieces of python code which are run internally in selenoprofiles. Several methods useful for filtering are provided. Filtering can be customized for each family
- intra-family and inter-family redundancy of results is removed
- tag blast and gene ontology extensions implemented for filtering (see manual)
Tools for graphical representation of selenoprofiles results are under development and will be released in the next few months.
Download the last version of selenoprofiles manual here: http://genome.crg.es/~mmariotti/selenoprofiles_manual.2.2.pdf
For selenoprofiles to work, all the slave programs that it utilizes must be already installed in your machine (blastall, exonerate, genewise). You will also need some external python modules if you want to use all its functionalities. These additional modules are needed if you want to scan genomes for selenoproteins, but may be omitted if you want to scan for your protein family of interest. In this page you can find help to install the slave programs and the additional python modules.
To install selenoprofiles, download this tarball. Then, follow the instructions in the README.
Selenoprofiles was published in Bioinformatics. To read the article or access the online data, check this page. Please cite:
Mariotti M, Guigo R - Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes.Bioinformatics. 2010 Nov 1;26(21):2656-63. Epub 2010 Sep 21
sgp2 is a program to predict genes by comparing anonymous genomic sequences from different species. It combines tblastx, a sequence similarity search program, with geneid, an ab initio gene prediction program.
You will also find whole genome annotations for different species obtained with sgp2 in our "Gene Predictions" web pages.
sQTLseekeR is a R package to detect splicing QTLs (sQTLs), which are variants associated with change in the splicing pattern of a gene. Here, splicing patterns are modeled by the relative expression of the transcripts of a gene. For more information about the method and performance see article : Monlong, J. et al. Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat. Commun. 5:4698 doi: 10.1038/ncomms5698 (2014).
The latest version as well as more details on installation and usage can be found on sQTLseekeR github page.
New in sQTLseekeR 2.0 (2014-10-14) : New packaging available and maintained on GitHub.
New in sQTLseekeR-1-3 (2014-09-18) : Fix svQTL problem due to an updated version of 'vegan' package.
New in sQTLseekeR-1-2 (2014-06-18) : More extensive output: MaxDiff splicing ratios between groups and pair of transcripts that change the most.
New in sQTLseekeR-1-1 (2014-02-21) : 'vegan' package error catching when splicing dispersion and gene expression are too low.
SymCurv is a computational ab initio method for nucleosome positioning prediction. It is based on the structural property of natural nucleosome forming sequences, to be symmetrically curved around a local minimum of curvature. The method takes as input the primary DNA sequence, calculates the expected curvature from which it deduces possible centers of nucleosomal sequences, by imposing symmetry constraints. SymCurv's performance is comparable to existing tools but offers the additional advantages of predicting nucleosome positions under two assumed-states (stationary and dynamic) providing insight on the remodelling potential of nucleosomes of possible regulatory function.
TCoffee is a multiple alignment package
The Flux Capacitor
The Flux Capacitor predicts abundances for transcript molecules and alternative splicing events from RNAseq experiments. Additionally, there is a simulation pipeline that is capable to simulate whole transcriptome sequencing experiments.
The GEM (GEnome Multi-tool) Library
The GEM (GEnome Multi-tool) Library is a set of very optimized tools for indexing/querying huge genomes/files. Provided so far are a very fast exhaustive mapper (the GEM mapper), an unconstrained split mapper (the GEM split mapper), and a very fast program to compute genome mappability (the GEM mappability).
You will be able to have access to the latest version of the GEM Aligner Library by clicking on the link below:
TreeKO is a python package used to compare phylogenetic trees. Currently it contains two different programs:
Tree comparison The tree comparison algorithm has been designed in order to be able to compare trees that have undergone gene loss and gene duplication processes and therefore do not necessarily have the same number of leaves. TreeKO computes all the possible pruned trees in each original tree by splitting the trees by the duplication nodes and reassembling the trees with combinations of pruned trees. The pruned trees are then compared all against all, pruning the leaves that are not common in both pruned trees. TreeKO offers two distance measures that are modifications of the Robinson & Foulds distance. The speciation distance will compute the distance between two trees without penalizing for gene loss and duplication events. On the other hand the strict distance will compute the distance between the complete structures of the two trees.
Phylome support The phylome support algorithm has been designed as a way to identify conflicting nodes and to incorporate genome-wide information on species trees. The algorithm is able to map gene-tree variability levels of large groups of gene trees (e.g. a whole phylome) on the nodes of the species tree.
A tool for automated alignment trimming, which is especially suited for large-scale analyses. Its speed and the possibility for automatically adjusting the parameters to optimize the phylogenetic signal-to-noise ratios for different families, makes trimAl especially suited for large-scale phylogenomic analyses, involving thousands of large multiple sequence alignments.