.post-body { line-height:100%; } -->

Thursday 25 January 2018

Guest Post: “Common design” vs consilience of independent phylogenies

Image result for evidences of continental drift theory
Rum's avatar
As many will be aware, I cut my discursive teeth on the now-defunct Richard Dawkins forum. This was a vibrant community of more than 85,000 members with something like 3,000 active members united by a single obsession. 

That obsession was the truth. Not 'Da TroofTM', but what's actually true about the real world.

This obsession manifests in many ways, but its most tangible manifestation is that of checking what those whose business is observing and diligently recording those observations and the conclusions we can robustly draw from them, even where such conclusions are tentative, actually say about them.

Our guest today is somebody whose particular obsession is with relationships, particularly relationships between organisms, and what it takes to demonstrate them.

I will, ladles and jellyspoons, say no more. I will let my old friend Mikkel Rasmussen speak for himself. 

People, I give you Rumraket.


The single greatest proof realistically imaginable, for the reality of macroevolution

In arguments with creationists, a contentious issue that often comes up is the potential common descent of different species of organisms. Do different species of apes share common descent? Do different species of sharks share common descent? Do different breeds of dogs share common descent? And what would evidence for common descent even look like if it existed?

Biologists argue (and I will attempt to show at a level that can be understood by most laymen) that a very powerful piece of evidence for common descent is what is known as consilience of independent phylogenies. This evidence is so overwhelming that it is hard to really put into words that can be intuitively grasped, just how overwhelming this evidence actually is. It is perhaps the single most powerful evidence, in a sense that can be rigorously quantified, for a fundamental theory in its field, in all of the sciences.

Before we move on, I will just not here that this argument isn’t actually one I have invented. It essentially goes back to Darwin, in chapter 13 of the 1st editions of The Origin of Species (which everyone should read and you can find here). A modern version of it (which is where I first learned about it myself) can be found in Douglas Theobald’s 29+ Evidences for Macroevolution – The Scientific Case for Common Descent.

I should clarify what consilience of independent phylogenies even means. People in the sciences are familiar with the concept of having consilience of independent lines of evidence. It basically just means that different types of evidence which is independent of each other, supports the same hypothesis.

In the case of phylogenetics, it means that phylogenetic trees inferred from independent sets of data show similar branching orders. Another way of putting it is that in an evolutionary tree of species, some species are systematically grouped closer together (share more recent common ancestors), than other species, almost no matter what attribute of the organism is used to construct the tree from. Basically, the different trees you can make from different genes, look the same. They corroborate the same genealogical relationship. They converge on the same branching order.

Let’s dig into that a bit. Suppose we want to use a particular gene to construct a phylogenetic tree from. We take the gene sequence of the gene from, say, 3 different species of organisms (which could be anything, like a rabbit, a rat, and a cow), then compare them by how similar they are (make an alignment). Then we use a phylogenetic algorithm to make one or more trees from that alignment. We don’t tell the algorithm what species is thought to be more closely related to what, so it doesn’t have any sort of inside knowledge about what we might happen to think is more closely related to what. The only thing the algorithm knows is the sequence alignment we feed it.

A historically much-used algorithm, called maximum parsimony, basically works like this:

What evolutionary explanation (i.e. what tree) that invokes the least number of character state changes, explains this set of sequences?

Another way of saying the same thing is, how little evolution does it take to “make” all these sequences from a common ancestor, using only copying with mutations?

The algorithm then does this basically by comparing lots and lots of trees with the sequences at different branches, then counting how many total mutations it takes to make all the sequences given a certain tree, and gives its results by showing the “best” tree(s). Sometimes there’s more than one “best” tree, as some of them have close or equal scores. A score here is, again, really just a reflection of how many “mutations” the tree implies. Lower is taken to be better with this sort of algorithm.

Okay, suppose we now have a gene sequence from several different species, and this set of gene sequences was used to built a tree with this method. The question now is, suppose we take a different gene entirely, from the same 10 species, and submit it to the same algorithm, what kind of tree will the algorithm make? How will that tree look, compared to the first one? Well it turns out that, when we do this with real biological data, we get a tree that is highly similar to the first one.

Often times they’re basically identical. Other times they’re not exactly identical, but still very much similar to each other. So similar, in fact, that it demands an explanation why they are so consistently pretty much the same tree. Vertebrates (organisms with a spine) will always group together in a group with that traces back to a single common ancestor, and in that tree that emerges from the common ancestor of vertebrates, you never find organisms that aren’t vertebrates.

Inside vertebrates you find lots of other groups, for example amniotes (organisms that lay eggs on dry land). Same rules apply. You never find non-amniotes inside the tree that connects all the amniotes to a common ancestor. Inside amniotes you find groups like reptiles and mammals. You never find mammals that sit in the reptile clade, nor reptiles that sit in the mammal clade. Inside mammals you find other groups like rodents, primates and so on. This pattern will be almost entirely identical no matter what gene you use to construct a phylogenetic tree with.

Now your resident creationist comes along, and declares: I can explain that, it’s because there’s a designer that is “re-using parts to design her organisms”. Okay, so that is one creationist hypothesis that is supposed to explain why there is consilience of independent phylogenies. Let’s see if that’s correct.

So the designer has this “part”, a gene. She takes this gene-part she’s created for her new organism, and copies it, because she intends to re-use it in a new organism to be created. She does so. Creates a new organism, but re-uses an identical copy of the gene. Now two identical copies of the gene exists, one in each new species. She does it again when creating the third species. Now three identical copies of the gene exist in three different species. She does it again for the fourth species, re-uses the gene-part. Now four identical copies exist. And so on and so forth until she’s created ten species with ten identical copies of that gene-part.

Maybe the gene looks like this:


And since an identical copy of it was re-used in every new species, that’s just what it looks like for each of those ten species.

Now biologists come along and submit those ten identical copies of the gene to the above-mentioned phylogenetic algorithm. What do they get? They get a star-tree without any groups. Nothing is any further or any closer to any other member, than to anything else. They’re all equally distant from each other, because the gene is identical in all twenty species. So the tree is just a star with ten branches all connected to the same node (actually it comes out as a dot, because the branches have zero length). There is no way to elucidate any groups where some species are more closely related than to others, if all you have are ten identical sequences.

Okay, so simply re-using the part doesn’t explain the why there are groupings of organisms, with some being more closely related to each other than others. Much less why different genes would ever yield similar trees.

Hold on! The creationist says. Perhaps the designer slightly tweaks the re-used part every time for functional reasons? Perhaps the part NEEDS to be ever so slightly different in a specific way, in order to function in a slightly different new species she creates? And THEN you’d get differences that give hierarchical structure to the tree. Okay, let’s run with that.

So the designer copies the gene from the first organism, then slightly alters it in a particular location in order for it to function in a particular way. Then she does the same thing again for the third species. But does she take the first gene again and copy it, and then slightly alter it? Or does she take the second gene, the one that was already altered a bit? Either way, we end up with a third copy that is not identical to the 1st or 2nd copy. And so on and so forth, until we have ten copies of the gene, but all of them slightly different from each other.

Maybe the first gene the designer made looks like this (the gene from above):


And then She makes another 10 altered versions of it like this:


So there we go, 10 sequences. And they were designed by re-using a common template and then altered in slightly different ways. And the species are different, so different functional alterations were chosen. Those changes could cause anything from changes to binding spots to regulatory elements, alternative splice sites, frameshift mutations, amino acid substitutions in protein coding genes, protection against mutation due to altered base frequencies or whatever else you can think of.

Now we imagine these ten genes are in ten different species, and biologists come along and sequence the ten versions of this gene from the ten different organisms and make an alignment, and then use the algorithm described earlier, to make a tree from the alignment. They get something that actually has definite tree-like structure this time. I used an online version maximum parsimony algorithm to get this tree:

This gene has hierarchical structure, because in fact, I actually generated the above ten gene-sequences by first copying the #1 gene, altering it with some mutations, and then copying the mutated version again to make more copies. In effect, I have mirrored what happens when organisms reproduce. That’s the process that gives hierarchical structure to data. Already here this should cause us to stop and think about that. In order to make data yield significant tree-like hierarchical structure, we have to behave as if we are a branching genealogical process. We have to copy, and then independently mutate, then copy the mutated copies again and mutate them further. We are playing the role of a process of splitting and divergence.

Of course it takes more than one gene to make an organism, so the designer makes lots of genes by the same method for her 10 different species. She takes the 2nd gene and goes through a similar process as for the 1st gene. Copies it, makes some changes to it for functional reasons, and inserts it into organism #2. Then copies it again, makes some more changes to it, inserts it into organism #3. And so on, until all 10 organisms have two genes each. And the genes are all different, but sort of derivations of each other. They are “commonly designed”. They follow a “common design”.
So let me do that, here’s 10 versions of Gene #2:


Again I put these ten sequences to the algorithm and let it find the “best” tree:

This tree looks almost nothing like the tree we got from the first gene we made. The only shared feature seems to be that species 8 and 5 are more “closely related” to each other than to other species.

Let me make another gene so we have three genes with ten versions of each, for the ten species:


How does the tree look?

Nothing like the other two trees. We can safely say here that there is NOT consilience of independent phylogenies in these three different data sets. They do NOT corroborate the same overall branching structure.

Let’s just put them side by side to make it easier to do the comparison:

But remember, the number of possible unrooted trees for 10 species is 2,027,025. So it is extremely unlikely that we would, by the above process, just so happen to produce trees that are similar. It is entirely possible that we would have to generate hundreds of thousands of such genes for those 10 species before we happened to get two trees that started to look significantly similar. Which is of course why we didn’t get trees that are similar, it is extremely unlikely to end up like that even when we are deliberately designing the genes around a common sequence template.

None of the trees output for these three genes, by the same algorithm, are similar. So even if the designer re-uses the same general gene-template, and slightly alters it for functional reasons (or even aesthetic reason, like “hmm I like G’s here”), there’s still no reason they should yield highly similar branching patterns. Which they do with real data from actual biological organisms.

And remember also, in real life there are many many more genes and many many more species. And the genes are much much longer than these genes made of only 40 nucleotides here.

Statistically speaking, there are only two options here. Either the designer is being deliberately deceptive with the real data sets from the diversity of life, and goes back on purpose and re-tweaks the copied genes with intent such that they yield similar trees when analyzed by the algorithm, or there was common descent.

This is where creationists have tried some additional arguments before, which I will address. One creationist commenter asked, in effect, “What if the changes you make in the sequence of gene #1, makes it so you have to make a specific set of changes to the sequence of gene #2 in the same species, otherwise they won’t work in combination. The genome could be like a software program, built on “interdependent sequences”. A designer could be forced to make changes in one gene, after altering another, because these genes somehow interact and therefore are constrained by each other?”.

Yes it is true, genes often times “cooperate” to yield certain functions, and can sometimes be constrained at some sites to be inter-compatible. A very simple example I can think of is a transcription factor that has to bind a particular stretch of DNA and thereby regulate a downstream gene. In this situation, the sequence of the DNA to which the TF binds, and the amino acid sequence of the TF itself, are co-dependent.

However, there are at least two very important caveats to this, which render this creationist rationalization completely ineffective. First of all, the number of possible TF-to-DNA relationships that are possible, so unbelievably vastly outnumber the number of extant TF and DNA sequences, that it makes it incredibly unlikely that phylogenies constructed from orthologous genes of each, will end up exhibiting similar branching patterns by the mere fact that they have to bind each other.

So there is sequence-constraint on both. But this relationship between them doesn’t dictate that if you change the sequence of the DNA in a particular way, that the changes to the sequence of the associated TF must be changed in a way that gives a phylogeny constructed from it, a similar branching order to the tree derived from the DNA binding spot. The constraint is in the direction of whether the TF will function. It is not in the direction of what kind of branching pattern a likelihood or parsimony-based algorithm makes. So on this fact alone, the suggestion that genes interact in interdependent ways doesn’t rescue a “design-plan” rationalization from the consilience of independent phylogenies.

Second. Even if there were such associations, which were constrained to yield similar branching patterns (and to be clear, there are and the most obvious example is RNA-based transcription factors that bind complementarily to DNA, you wouldn’t compare the RNA TF to the DNA it binds, as they are directly complementary) we can simply detect such associations and avoid using them when testing common descent by the consilience of independent phylogenies.

In other words, we just have to think about what data sets to use and pick the actually known to be independent ones. Which really isn’t difficult. To pick an example, we can compare trees from genes that are demonstrably independent in their sequences, like two enzymes in some metabolic pathway. The gene-sequence of enzyme 1 on chromosome four, does not cause, nor is it caused, by the gene sequence in enzyme 8 on chromosome 20. The genetic sequences of these two enzymes are constrained by the function they have, not by the order of amino acids in each other. Or we can just compare the RNA based transcription factor tree, to a tree derived from a locus elsewhere in the genome to which this RNA doesn’t bind. Or we can use SINE insertions, and compare them to enzymes, or to transcription factors and so on and so forth.

So no, that creationist rationalization doesn’t work. The fact that different genes are functionally interdependent doesn’t explain why they should yield similar branching orders. Only if they went through the same genealogical relationship would you expect that.

And in the cases where they ARE sequence-constrained in a way that forces them to exhibit similar branching orders, we can simply avoid using them to test for consilience of independent phylogenies. This is why the strength of the evidence from consilience is so incredible, because we know there is no mechanistic reason why the data should exhibit such convergence on the same overall branching tree structure, unless there really was common descent. Common descent is the mechanism that forces the same tree-structure on independent data sets from real biological organisms. It is unavoidable.

At this stage, creationist arguments either break down into nonsensical incoherency, or they start mindlessly declaring what is opposite to demonstrable fact. For example in response to the above, I had a creationist assert to me that, in fact, common descent would NOT yield consilience of independent phylogenies because… wait for it, “mutations are random”.

Well, while that is stupid on its face, let’s test it anyway. I will make five gene sequences of a more realistic size (300 nucleotides pr. Gene), and then I will evolve them by splitting and copying, and introduce random mutation (using a dice-roller to determine which nucleotide position to mutate, and another dice roller to determine how it mutates) until we have 13 “species”. Then I will “evolve” those 13 species for a few generations with more random mutations. Then I’m going to make trees for all five genes from the 13 species and see if they match, or if random mutations cause a common ancestral template to somehow magically destroy phylogenetic signal.

Okay, so here are the results. First of all, here is the overall phylogeny I generated:

It looks like that because it’s actually a screenshot zoomed out screenshot from excel where I saved the ancestral and descendant gene sequences. Subsequent generations were evolved by simply making two copies of the ancestor, then introducing between three and five random mutations in each gene. Then making two new copies of the mutated genes, and mutating them again by the same rules. Between three and five randomly determined mutations.

That means there are two “levels” of randomness that affect what type of mutation it is. The position in the 300 nucleotide gene where a mutation occurs, is random, and the type of nucleotide it is changed into, is random. To keep things simple, I only allowed substitutions.

The five gene sequences for the common ancestor:

>Gene 1

>Gene 2

>Gene 3

>Gene 4

>Gene 5

This time, because of the limitation of an online tool I use to get phylogenetic trees, I used a maximum likelihood algorithm instead because it allows one to determine the root of the tree.

Okay, so how does the tree for gene 1, from the 16 species, look? I submitted the 16 gene sequences to the maximum likelihood algorithm, and got this tree:

Looks pretty much identical to the true structure of the phylogeny I generated. Could it be a lucky statistical fluke? Let’s see how the gene 2 tree came out:

The branching orders are identical between the two gene trees, and match they true phylogeny completely. How strange! What about the rest of them?

Okay, so while it may look like there are some differences (the bottom branch in each tree is different), these trees are actually genealogically identical. In all five trees (B, C) form a clade, which sits inside a clade with (A, (B, C)), which sits inside a clade with (((A, B, C)(F, (E, D))). The same nesting overall hierarchical structure ((((A, B, C)(F, (E, D))) (((I,J)(H,G))((L,M),K)))) is exhibited by all five trees. Despite it being possible to arrange the 13 “species” into 316,234,143,225 different trees. Stated another way, there is over three hundred billion ways these five data sets could have failed to exhibit identical branching orders.

In Douglas Theobald's 29+ Evidences for Macroevolution - The Scientific Case for Common Descent there is a section devoted to the convergence of independent phylogenies, and the associated statistics of phylogenetic trees. In reality, with real biological data, the degree to which independent phylogenies corroborate each other is hard to overstate. To pick an example, a phylogeny of 186 species of primates was generated using 54 independent genes (many of them thousands of nucleotides long)[1].

According to the phylogenetic trees-calculator Theobald’s page, we can see that there are 8,3803*10394 possible phylogenetic trees for a rooted phylogeny containing 186 taxa. Eight times ten to the three hundred and ninety fourth power. This number is incomprehensible.

In other words, there are 8,3803*10394 ways to arrange the branches in the tree figure provided in the paper:

That means two independent genetic loci have 8,3803*10394 ways to fail to corroborate each other for a tree constructed using those 186 taxa. So when we find that they nevertheless DO corroborate each other, that is a result that cries out for an explanation. And every one of those 54 gene-trees significantly corroborate each other. How significantly?

Even if those trees were to have 90 incongruent branches (they don't, even the most different among them actually have less than 20 incongruent branches), it would still yield a result with a significance of P ≤ 3.14423*10-318.
Let's try to put this into other words so it becomes more clear what I'm saying here.

To those of you who like watching youtube videos, you will no doubt have come across Lawrence Krauss giving one of his public lectures on theoretical physics and cosmology. In one of these lectures, you will hear Lawrence Krauss say something along the lines of "In quantum-electrodynamics you will find the greatest agreement between theory and observation in all of science in that the strength of the electromagnetic field around an electron has been verified to 13 decimal places".

What does that actually mean? Well it means that there is a theory (quantum electrodynamics), and this theory predicts a certain value that we should be able to measure in an experiment.

In particular, it predicts a certain value in the strength of the electromagnetic field that surrounds a charged elementary particle. Experimental physicists have measured this value, and found that it agrees with the value predicted by theory, to the 13th decimal place.

This is the 13th decimal place: 0.0000000000001
This is the predicted and the measured value:

(This picture is from this youtube video (video inserted at bottom; hack)).

That is the accuracy with which the theory that predicts the value of the strength of the electromagnetic field around an electron, has been observationally verified. Physicists are rightly very proud of the observational corroboration of the theoretical prediction of this value. Recall that every additional decimal place you can verify your theoretical prediction, corresponds to a reduction in the amount of uncertainty by a factor of ten. So going from 0.01 to 0.001 means you are ten times less uncertain about the real value.

But Lawrence Krauss is wrong. The greatest agreement between theory and observation in all of science, is the agreement between the theory of common descent, and the significance of the convergence of independent phylogenetic trees. As we have just seen above, the phylogenetic tree of primates can be observationally verified to an accuracy of over three hundred decimal places.

To put that into perspective like the above agreement between prediction and observation in physics, it is like having a theory that predicts a value of a measurement like this:


And you measure this:

To sum up all of the above, common design, or "following a design-plan" where the designer is re-using slightly tweaked parts in new creations, demonstrably does not predict consilience of independent phylogenies. Stated another way, there is NO reason to expect that "common design" should yield highly congruent independent phylogenetic trees. There is only ONE predictive and observationally falsifiable explanation for this phenomenon: There must have been common descent by a branching genealogical process. Common descent.

To borrow that phrase from Emile Zuckerlandl and Linus Pauling, this is the single greatest proof one could realistically imagine, for the reality of macroevolution.


There you have it. As always, nits, crits and comments always welcome. Please contact me at info@hackenslash.co.uk

No comments:

Post a Comment

Note: only a member of this blog may post a comment.