Data Download
Validated Sites of Structural Variation
Formatted and compressed files are available for bulk data download. The most useful file is likely to be the listing of sample level, validated sites of structural variation. This table is available here.
Novel Sequence Information
Files describing the “novel”sequences not found in the build35 reference are available here. The sequence of all 1,736 assembled contigs is availabol (freeze1.contigs.fa.1736), as well as the names of the 1,435 contigs analyzed by array CGH and the 1,299 contigs confirmed as human in origin. The clone names and mapped positions for each of the 525 identified regions of new insertion are also given (clones_in_nils.txt).
Single Nucleotide and Small Insertions-Deletion Variants
Single nucleotide and small indel variants identified using the ESPs from the ABC7-ABC14 libraries are available here. All variants have been submitted to dbSNP and should be included in the next release, dependant upon the dbSNP update schedule.
ESP Mapping Data
These files represent the raw end-sequence mapping data and are available here. For each library the following files are available:
- *.conversion: This file lists the correspondence between the ESP clone names and the trace names reported in the NCBI trace repository. Note: these are trace names, not TI numbers. Also, files is not present for G248 since the clone names and trace names are the same.
- *_individual.txt: This file summarizes quality information for each read.
- *clonelayer files: There are three clonelayer files for each library. These files give the mapping location for each end sequence pair. The clonelayer.best.matched file reports “best” clone placements (based on the 13-point scoring system). The clonelayer.tied.matched file reports all “tied” clone placements (when a clone has two placements with the same highest score). In this process, multiple possible positions are considered and the top (best or tied) placements are chosen. The other, lower-scoring possibilities are reported in the clonelayer.other.matched files.
In addition to placement information, these files report statistics on the quality of each clone position, such as sequence identity, number of Q30 positions in the alignment, and intersections with common repeats. These files list all placed clones. For variant analysis more stringent mapping criteria involving read and alignment quality were applied. A detailed description of these criteria is given in the supplementary material of Kidd et al. (2008) and Tuzun et al. (2005). - *qualityaligns files: Quality rescored alignments are given in the qualityaligns.bestandtied and qualityaligns.other files. Following megaBLAST searching, a complete sequence alignment of the end-placement is recomputed and the alignment is rescored considering only high quality (Q30) positions. Positions in the end sequences having a quality of at least Q30 are given in uppercase letters. The alignments are reported as two strings (with gaps represented as dashes). For example, the following entry:
| Seq1 | aaacTG-GTCCATGc |
| Seq2 | AAACTTTGTCCATGC |
would correspond to this alignment:
| Seq1 | aaacTG-GTCCATGc |
| |||||| |||||||| | |
| Seq2 | AAACTTTGTCCATGC |