Introductions to the website and dataset.
Easy-to-use search page
FTP site for downloading pig genomic sequences and other data
Frequently asked questions
PigGIS Update to v2
Update Date: 2010.03.03
Use this map of human chromosomes to find pig genes
The Pig Genomic Informatics System (PigGIS) presents accurate pig gene annotations in all sequenced genomic regions. It integrates various available pig sequence data, including 3.84 million whole-genome-shortgun (WGS) reads and 0.7 million Expressed Sequence Tags (ESTs) generated by Sino-Danish Pig Genome Project, and 1 million miscellaneous GenBank records. The Pig Analysis Database has covered nearly 50% of the whole pig genome and over 70% of the coding sequences (CDS), and aims to provide the most complete pig gene set to date.
In addition to gene annotations, the PigGIS also presents expressional information from 98 EST libraries, SNPs detected from both WGS reads and ESTs, oligos that can be used in microarray design and relevant evolutionary data. SIFT analysis of deleterious mutations will come in future.
Whereas the international partnership to sequence the pig genome has recently initiated a project to sequence the pig genome, the publicly available sequences at present only covers a little more than a half of the whole genome. Many huge sequencing gaps spot everywhere, which makes it impossible to get long pieces of sequences. This poses a great challenge to the gene annotation. In designing this database, we have developed a sophisticated pipeline to recover most of sequenced CDS of the pig genome without a high-quality assembly. The basic idea here is to directly map the pig segments, including WGS reads, ESTs and even various GenBank records, to a complete set of annotated human genes (Ensembl v32 in our case), and then to assemble the segments that are well aligned to a human gene or an exon. In this way, we could save computing resources by avoiding large-scale genome-wide alignment, and fully make use of every base pair that contributes to the pig CDS regardless of the huge amount of sequencing gaps. The similar method could also be used to annotate other low-coverage genomes when a close homologous genome is available. The on-going Mammalian Genome Project sets another example of low-coverage genomes.
Human proteins were downloaded from Ensembl v32. They were aligned against ESTs and WGS reads by BLASTX. One sequence, either read or EST, was arbitrarily anchored to its best-aligned human gene, but the sequence might also be discarded as repeats if its best match is similar to its second best match. After all the sequences are anchored, PHRAP was applied to assemble collected sequences for each exon. The resultant contigs were further aligned to the corresponding exon by FASTY in order to fix potential frameshifts. Contigs with protein identity less than 80% at protein level were discarded, and only one best-aligned contig were reserved.
Up to this step, one human exon was aligned to zero or one pig contig. SNP detection pipeline was then applied. In this pipeline, a high-quality base of a sequence is a base pair that satisfies: a) its quality is not lower than 25, and b) the qualities of its 5-bp flanking sequences is not lower than 20. If high-quality bases at the same position disagreed with each other, a SNP was then detected.