This is the NCBI Build37 (hg19) release of this track. This release includes the 3 datasets previously released on NCBI Build36 (hg18) and adds data for several more cell types and growth conditions in replicate. Four types of download files are available for each replicate including the raw data (fastq) and 3 display files: FPKM's for GencodeV3c gene models (gtf), raw signal (bigwig), and alignments (bam).
This track is produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly (Mortazavi et al., 2008). Biological replicates of ENCODE cell lines were grown on separate culture plates, total RNA was purified and polyA selected two times. mRNA was then fragmented by magnesium-catalyzed hydrolysis, reverse transcribed to cDNA by random priming and amplified. The cDNA was sequenced on an Illumina Genome Analyzer (GAI or GAIIx).
The DNA sequences were aligned to the NCBI Build37 (hg19) version of the human genome using the sequence alignment programs ELAND (Illumina) or Bowtie (Langmead et al., 2009). The first 10 residues of sequencing have a weak characteristic nucleotide bias of unknown origin. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed.This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks (cell lines, replicates and growth conditions) that display individually on the browser. Instructions for configuring multi-view tracks are here. The following views are in this track:
Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.
Cells were grown according to the approved ENCODE cell culture protocols except for H1-hESC for which frozen cell pellets were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNase digestion step to remove residual genomic DNA. mRNA was isolated from at least 10 ug of total RNA with oligo(dT) two times (Dynabeads mRNA PurificationgKit, Invitrogen). Alternatively, cells were lysed and mRNA was purified directly two times with oligo(dT) (Dynabeads mRNA DIRECT Kit, Invitrogen). 100 ng of mRNA was fragmented by magnesium-catalyzed hydrolysis and reverse transcribed to cDNA by random priming according to the protocol in Mortazavi et al. (2008). cDNA was prepared for sequencing on the Genome Analyzer flowcell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The sequencing libraries were size-selected around 225 bp and amplified with 15 rounds of PCR.
Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Single end reads of 36 nt in length were obtained.
Fastq files were made from qseq files generated by the Illumina pipeline (Casava 1.7). The Raw Signal files (bigWig) were generated from bedgraph files and the score was calculated as the number of reads at that position divided by the total number of reads divided by one million.
Casava export files were aligned to the NCBI Build37 (hg19) version of the human genome with ELAND (Illumina), generating SAM files. Fastq files of experiments that were previously aligned to NCBI Build36 (hg18) were aligned to NCBI Build37 (hg19) using Bowtie (Langmead et al., 2009; parameters: -S -n 2 -k 11 -m 10 --best), also generating SAM files. SAM files were converted to BAM with SAMtools (Li et al., 2009).
RNA-seq reads were aligned to Gencode.v3c (Harrow et al., 2006) gene models and gene expression was measured in Fragments Per Kilobase exon per Million reads (FPKM) using Cufflinks v0.9.3 (Roberts et al., 2011). FPKM is calculated by dividing the total number of fragments that align to the gene model by the size of the spliced transcript (exons) in kilobases. This number is then divided by the total number of reads in millions for the experiment. FPKM is reported in the last column of the gtf (TranscriptGencV3c) files.
RawData (fastq), RawSignal (bigWig), Alignments (bam) and TranscriptGencV3c (gtf) files are available from the Downloads page.
These data were produced by the Dr. Richard Myers Lab at the HudsonAlpha Institute for Biotechnology.
Contact: Dr. Florencia Pauli.
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE and Guigo R. GENCODE: producing a reference annotation for ENCODE Genome Biology. 2006; 7 Suppl 1;S4.1-9
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Genome Biology. 2009 Mar; 10:R25.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R and 1000 Genome Project Data Processing Subgroup. The Sequence alignment/map (SAM) format and SAMtools Bioinformatics. 2009; 25:2078-9.
Mortazavi A, Williams BA, McCue K, Schaeffer L, and Wold BJ. Mapping and quantifying mammalian transcriptomes by RNA-Seq Nature Methods. 2008 Jul; 5(7):621-628.
Roberts A, Trapnell C, Donaghey J, Rinn JL, Patcher L. Improving RNA-Seq expression estimates by correcting for fragment bias Genome Biology. 2011 Mar; 12:R22.
Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.