Background
Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection.
Results
We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection.
Conclusions
DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.
Background
The genome of a living species is the product of a long series of changes, including neutral, beneficial, and detrimental alterations to the sequence. Sequence changes that affect the organism's fitness are subject to evolutionary pressures, such as the pressure to survive, to out-compete other species, and to defend the organism against external attack. In order to uncover these changes, we need to know what the ancestral genome looked like, which we can infer by comparing multiple genomes to one another. As we accumulate genomes from species related to human, and especially from within the primate lineages, we should be able to learn more about what makes humans special. At the same time, we can learn what makes each primate different from the others. Until recently, methods for detecting the effects of evolution had been designed for relatively distant species such as humans and mice. With the publication of the chimpanzee genome
[
1], we had our first look at a very close relative of human. The genomes of chimpanzees and humans are so close, in fact, that sequence similarity cannot be used to infer functional significance: in most cases, similarity simply reflects the recent divergence between the species. With more species, sequence comparison even among close relatives can be used to tease apart regions that are constrained by evolutionary forces and that, consequently, are likely to have functional importance to the biology of humans.
Recently, the ENCODE project selected 13 primates (in addition to human) and sequenced 1% of each genome to produce "comparative grade"
[
2] assemblies. These high-quality sequences from close human relatives give us a greater ability than before to detect the signs of evolutionary selection on the human genome and other primates. The traces of evolution's effects can be found more easily when they are shared among multiple species. Signs of selection also may indicate functionally important sequences, and in particular they can be used to identify regulatory regions that fall outside protein-coding regions and are otherwise difficult to find.
Broadly speaking, there are two main types of selective processes driving the evolution of genomes. Negative or purifying selection is the evolutionary pressure that eliminates deleterious mutations from a population. Most mutations in the genome are probably neutral, because most of the genome is itself non-functional, but within coding regions, the majority of mutations are deleterious
[
3]. Deleterious mutations are likely to be transient; i.e., they do not become fixed in the human population. Negative selection has been identified principally by pairwise sequence alignment methods, through which DNA or amino acid sequences can be shown to be more highly conserved than expected based on the overall evolutionary distance between a pair of species. By one well-known estimate, approximately 5% of the human genome is under negative selection
[
4], of which only 1.5% is contained in protein-coding exons.
Positive selection is more difficult to detect. In positive selection, a region of the genome, protein coding or otherwise, accumulates beneficial mutations that provide a survival advantage to the organism. One way to detect positive selection is by the presence of genes that have acquired many more mutations than other genes when compared to close relatives. A well-documented example of positive selection is the rapid change in the hemagglutinin protein on the surface of the human influenza virus, which is in constant competition with the human immune system
[
5]. Positive selection must be carefully distinguished from the relaxation of selective constraints, however. If a sequence (a gene or a regulatory sequence) ceases to perform its function, and if that function is no longer needed by the organism, then it might accumulate mutations faster precisely because it is no longer functional.
In this study, we describe a new method, called DivE, for detecting lineage-specific regions evolving at a slower or faster rate than the background evolutionary rate in the primate genomes. Other methods have been previously developed for detecting selection, but most look only at conservation of sequence (negative selection) in all aligned species, and are not lineage specific
[
6-
12]. Methods to detect accelerated regions (i.e. regions evolving at faster-than-neutral rates) have also appeared recently
[
13-
18]. Some of these methods allow for lineage-specific selection
[
14-
16,
18], but in contrast with conservation-detection methods, they cannot be easily used for genome-wide scans to detect selection, and look only at particular regions of interest. Although accelerated regions may indicate positive selection, this is not necessarily the case
[
19]. There are many examples where positive selection manifests itself at only a small number of sites
[
20-
23]. Our method is not suited to the identification of positive selection in these cases.
Recently a new program, phyloP, was developed to examine the more general problem of detecting either conserved or accelerated regions in a set of aligned orthologous sequences from multiple species
[
24]. PhyloP implements four different statistical phylogenetic tests to find significant departures from non-neutral substitution rates on a whole phylogeny as well as on selected subtrees (clades) of interest in the phylogeny. It was shown to have fairly good accuracy in detecting strong selection even at individual nucleotides. In one respect, DivE is similar to phyloP in that both methods try to solve the general problem of detecting an increase or a decrease in the rate of substitution in a given genomic region, either on a whole phylogeny or within a clade of the phylogeny.
However, in phyloP the phylogenetic subtree of interest needs to be provided to the program, while in contrast DivE addresses the more complicated problem in which the lineage of interest is not pre-specified. Therefore the lineage under selection must be detected automatically by DivE from among all possible subtrees within a phylogeny. Another significant difference is that applying phyloP to an entire genome to detect selection involves using a sliding window approach. Although a sliding-window analysis is a popular method to test for negative or positive selection, there are results that show that this approach is not generally valid if selective trends are not known
a priori in a given region
[
25]. In addition, the sensitivity of phyloP is dependent on the size of the window used to scan the genome, which in turn depends on the number of species available. DivE doesn't use a sliding window approach, but instead tries to determine the optimal size for the selected genomic element that is predicted to be under selection. In regard to these differences, DivE is more similar to DLESS
[
26], a method that detects sequences that have either come under selection, or begun to drift, in any lineage. While DLESS only allows for detection of a "gain" event (conservation in a phylogenetic subtree) or a "loss" event (where a subtree is evolving neutrally while the rest of the tree is conserved), DivE also detects acceleration events in any clade of the tree. DLESS is the only other computational method, prior to DivE, that can detect lineage-specific selection when the lineage of interest is not pre-specified.
Below we present our method for detecting both conserved and accelerated regions and apply it to 14 primate genomes. We describe results on simulated and real data, including the identification of positively selected genes that intersect regions evolving faster than the neutral mutation rate. The method described in this paper is implemented in the DivE package which is available as free, open-source software
[
27].
Results
Simulation results
For our simulation tests, we created sequence elements that were both positively and negatively selected within the same 14 primate species used for our later experiments on real data. Because we knew the precise location, size, and type of selection involved in each element, we could use this data to evaluated the accuracy of DivE and compare it to other methods.
We created simulated data sets that contain selected elements of lengths between 50 bp and 1000 bp in all subtrees of the phylogeny of the 14 primates (see Figure
1 and Methods for a description of the primate phylogeny). Conserved elements are either "gained" or "lost" on a particular lineage, where a "gain" event implies that the region defined by that particular lineage will experience selective pressure that will tend to eliminate individuals with mutations in that region (i.e., negative or purifying selection). A "loss" event implies that the region in question does not have evolutionary constraints, and will evolve at the neutral substitution rate, while the rest of the tree is constrained. The average substitution rate observed for conserved elements is a fraction of the non-conserved regions, and we therefore can simulate negative selection by reducing the branch lengths of the selected subtree (for gain) or supertree (for loss), as depicted in Figure
2. For accelerated elements, the observed substitution rate is greater than the neutral rate.