Friday linksfest: Messiness and using Google Street View to map species

Anyone that knows me in real life will know that I’m I hate mess. I hate wires that tangle everything up in my flat, creating impenetrable black and off-white thickets. I hated my old office mate leaving his fieldwork kit strewn all across the office, grass festering away on the edges of his quadrat. I hate my sister’s inability to cook without the whole kitchen descending into culinary chaos. In summary, I hate mess.*

So it was with some concern that I saw the recent TED talk by Tim Harford, talking about how messiness can improve problem-solving flash up on my twitter feed. I clicked the link, despairing that I might have to scatter papers all over my desk in an effort to become a better scientist. Thankfully, that wasn’t what happened. Harford actually says that putting constraints or introducing randomness into how you solve problems can mean you produce better solutions. It’s very TED-y, linking disparate ideas from music and science, but I think Harford has something. Give it a look.


Something else that grabbed my attention this week was a paper which used Google Street View to build up a picture of where a plant species in Spain can be found. I like papers like this that just blow my mind. I would never have thought differently enough to come up with this idea.

I love using ggplot for analysis in R but up until know I have had to use a different package to do my diagnostic plots for models. Thankfully, now this has been solved with the ggfortify package so you never need to use ugly base R plots again….

And finally, if you are an early career researcher in the UK you might be interested in a joint BES & ZSL meeting aimed at helping you establish a career in conservation. Check it out, it has great speakers and I’m sure it will be as good as all other BES meetings I’ve been to.


* I do however, love all of the people I mention here. Even if I don’t like the mess you cause.

What Attenborough got wrong about the population crisis

Recently David Attenborough has been causing a bit of a fuss talking about the increasing human population in apocalyptic terms.

According to him some places, like Ethiopia, are in the grip of the Malthusian nightmare of starvation and disease caused by overpopulation. He has a point, but the example of Ethiopia is not a great one if he wants to say something about the developing world in general. For a start I’m not sure it’s particularly representative.

I can hear you now: “Phil – don’t have a go at lovely David Attenborough. He’s like the granddad I never had.” Or “Who are you criticise this national treasure? He’s achieved more than you ever will!” Or, if you read the Daily Mail, “I agree with Attenborough, we should let those scroungers starve!”

Two out of three of these people have a point (I’ll leave it up to you to guess which one I think doesn’t). It’s true – Attenborough has had a more positive influence on biodiversity conservation than me or any of my colleagues. His programs continue to inspire future conservation biologists all over the world and grip people in a fascination of the natural world. But this doesn’t make him perfect. I’m sure even Jesus had his faults – at the very least he (allegedly) liked a drink and hanging out with prostitutes.

First, Attenborough was right to raise the issue of increasing human population – I share some his concerns. Global human population is currently at an all-time high, having increased massively over the last century, and if estimates are to be believed it will continue to increase for about the next 50-100 years. This, along with changing human behaviour have led to an era of unprecedented human domination of the globe, is becoming known as the anthropocene.

Looks pretty bad – right?

Here’s why Ethiopia is not a great example when we’re talking population growth and the environment.

Here is the global distribution of percentage population growth for 2000-2010 in countries with a population of more than one million.

Global_percentageYou can see that most countries had increasing populations, with most having an increase of around 10%. If we break that up by human development index category (an index based on measures of poverty, education, equlaity etc) we get this:

Ethiopia_percentageCountries with the lowest HDI ranking tend to have higher population increases, and Ethiopia lies at the peak of the histogram.

Ethiopia_percentage_arrowHere, in fact.

So the population growth in Ethiopia is actually very representative of the poorest countries in the world.

However, if you look at how many people live on degraded land in Ethiopia it’s hardly surprising that there are so many hungry people in the country.

Ethiopia_degraded_arrowEthiopia is clearly not ‘normal’ in terms of the percentage of people who live in degraded lands. The mean global percentage is around 22% but in Ethiopia it’s around 72%. This has long been recognised as a problem and has probably resulted from people with relatively large families having only a small amount of farm land, and as a result this land cannot be left fallow and becomes quickly degraded making the situation worse.

So Attenborough is probably right about rising population in Ethiopia causing degradation. However, there is actually relatively little link between population growth and the percentage of people living in degraded land or indeed changes in forest cover on a global scale.

The links between land use change, degradation, population change and poverty are really difficult to generalise about. On the whole it looks like movement of people into natural ecosystems, rather than population increases by themselves, are the cause of much conversion to cropland. This movement can be driven by the perceived need to expand agricultural production, much of which is for foreign markets in the case of developing countries.

This combined with corruption can result in the economic benefits of exporting agricultural products not reaching the pockets of the average person in a developing country. Use of this land also stops production of crops to feed the native population which otherwise may have improved food security. Alternatively natural ecosystems could be saved from conversion.

Given all this, changing consumption habits in China & Brazil will have much greater environmental impacts than population growth in desperately poor countries. These changes in the environment may lead to further degradation of farmland and reduced food security.

So, Attenborough is right to be concerned about population growth – it is important to slow it as much as possible. But next time he speaks about it he would do well not ignore the issue of consumption.

Using metafoR and ggplot together:part 2

In the last post I introduced you to the basics of meta-analysis using metafor together with ggplot2. If you haven’t seen that post, I suggest you go back and read it since this post will be hard to follow otherwise.

In this post I will concentrate on trying to explain differences between different sites/studies.

Going back to our previous example, we saw that logging tended to have a negative effect on species richness.

Luckily my completely made up data had information on the intensity of logging (volume of wood removed per hectare) and the method used (conventional or reduced impact).

We can make a model looking at the effects of logging intensity and method on species richness. First we can start with our most complex model


This spits out the following:

Mixed-Effects Model (k = 30; tau^2 estimator: ML)

  logLik  deviance       AIC       BIC
-15.8430  121.7556   41.6860   48.6920

tau^2 (estimated amount of residual heterogeneity):     0.1576 (SE = 0.0436)
tau (square root of estimated tau^2 value):             0.3970
I^2 (residual heterogeneity / unaccounted variability): 96.52%
H^2 (unaccounted variability / sampling variability):   28.74

Test for Residual Heterogeneity:
QE(df = 26) = 1151.1656, p-val &amp;lt; .0001

Test of Moderators (coefficient(s) 2,3,4):
QM(df = 3) = 38.2110, p-val &amp;lt; .0001

Model Results:

                     estimate      se     zval    pval    ci.ub
intrcpt                0.1430  0.3566   0.4011  0.6883  -0.5558   0.8419
Intensity             -0.0281  0.0099  -2.8549  0.0043  -0.0474  -0.0088  **
MethodRIL              0.1715  0.4210   0.4074  0.6837  -0.6536   0.9966
Intensity:MethodRIL    0.0106  0.0138   0.7638  0.4450  -0.0166   0.0377

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Looking at our residual plots, they seem to conform to our assumptions of constant variance.


Our models suggests that the more wood is extracted from a forest the greater the negative effect on species richness, but that seems to be little evidence about the effect of method or the the interaction between intensity and method.

So, lets get rid of the interaction and see what happens:



Mixed-Effects Model (k = 30; tau^2 estimator: ML)

  logLik  deviance       AIC       BIC
-16.1323  122.3341   40.2646   45.8694

tau^2 (estimated amount of residual heterogeneity):     0.1606 (SE = 0.0443)
tau (square root of estimated tau^2 value):             0.4008
I^2 (residual heterogeneity / unaccounted variability): 96.75%
H^2 (unaccounted variability / sampling variability):   30.80

Test for Residual Heterogeneity:
QE(df = 27) = 1151.4310, p-val &amp;lt; .0001

Test of Moderators (coefficient(s) 2,3):
QM(df = 2) = 36.9762, p-val &amp;lt; .0001

Model Results:

           estimate      se     zval    pval    ci.ub
intrcpt     -0.0416  0.2648  -0.1570  0.8753  -0.5606   0.4774
Intensity   -0.0228  0.0070  -3.2617  0.0011  -0.0364  -0.0091  **
MethodRIL    0.4630  0.1803   2.5681  0.0102   0.1096   0.8163   *

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now both intensity and method are statistically significant, but method less so. Let’s take it out:



Mixed-Effects Model (k = 30; tau^2 estimator: ML)

  logLik  deviance       AIC       BIC
-19.1526  128.3747   44.3052   48.5088

tau^2 (estimated amount of residual heterogeneity):     0.1962 (SE = 0.0536)
tau (square root of estimated tau^2 value):             0.4429
I^2 (residual heterogeneity / unaccounted variability): 97.45%
H^2 (unaccounted variability / sampling variability):   39.27

Test for Residual Heterogeneity:
QE(df = 28) = 1258.6907, p-val &amp;lt; .0001

Test of Moderators (coefficient(s) 2):
QM(df = 1) = 25.2163, p-val &amp;lt; .0001

Model Results:

           estimate      se     zval    pval    ci.ub
intrcpt      0.4678  0.1927   2.4276  0.0152   0.0901   0.8455    *
Intensity   -0.0324  0.0065  -5.0216  &amp;lt;.0001  -0.0451  -0.0198  ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Unsurprisingly, intensity is now more significant. However, we have taken out method which may have some value in explaining differences. The tool I use most for model selection is AIC. This basically aims to get the best fitting model with the least explanatory variables. This is why I did the last step of removing method. It could be that the effect of method is significant but adds relatively little explanatory power to our model.

Let’s test this by calculating their AICs.



[1] 41.68604
> AIC(Model2)
[1] 40.26459
> AIC(Model3)
[1] 44.30516

We choose the model with the lowest AIC, in this case Model2.

To calculate the fit of our model, we can compare it to our intercept only model (the original meta-analysis from part 1), but we have to set method=”ML” since REML is a poor estimator of deviance.


Using the two Tau squared statistics we can calculate the proportion of deviance not explained by the intercept only model using the sum:
1-\frac{tau^{2}_{Model2}} {tau^{2}_{ROM.ma2}}

in.plot3<-in.plot2+ylab("log response ratio")+xlab("Logging intensity (m-3 ha-1)")

Very pretty.

Circle size is relative to weight for each study, since we weight the most precise studies most heavily. Blue points and line refer to reduced impact logging sites, while red refers to conventionally logged sites.

We can also transform it all to proportions for more intuitive interpretations:

in.plot3<-in.plot2+ylab("Proportional change\nin species richness")+xlab("Logging intensity (m-3 ha-1)")

The graphs show that, for my made up data, reduced impact logging has a less negative effect than conventional logging on species richness when carried out at similar intensities but that increasing intensity leads to reductions in species richness for both conventional and reduced impact methods.

I think that wraps it up for now. I hope you’ve found this useful. As ever any comments or questions are welcome.
I will be back to writing non-stats posts again soon!

Using metafoR and ggplot together:part 1

I really like meta-analysis, as some of you might already  know.

It has it’s critics but used well it can help us achieve one of the major goals of applied ecology: generalisation. This way we can say how we think things generally work and draw conclusions bigger than  individual studies.

However, for some reason people have got it into their head  that meta-analysis is hard. It’s not. The hardest thing is the same as always, coming up with a good question/problem.

However, for a first timer it can seem daunting. Luckily there are a few good R packages for doing meta-analysis, my favourite of which is metafor.

The aim of these posts is to take some of the mystery out of how to do this and was prompted by a comment from Jarret Byrnes on a previous blog post.

In the first post I will deal with effect sizes, basic meta-analysis, how to draw a few plots and some analyses to explore bias. In the second I will go through meta-regression/sub-group analysis and a few more plots. It shouldn’t be too hard to understand.

For this I will use a dataset based on some work I am doing at the moment looking at the effects of logging on forest species richness. However, I don’t really feel like sharing unpublished data on the web so I’ve made some up based on my hypotheses prior to doing the work.

First of all we need to install the right packages and open them



Next we need the data, download it from here and get r to load it up:


This file contains data on species richness of unlogged (G1) and logged (G2) forests, it’s not perfect as all of the data is made up – it should be useful for an exercise though.

First we need to calculate the effect size. This is essentially a measure of the difference between a treatment and control group, in our case logged (treatment) and unlogged (control) forests. We can do this in different ways, the most commonly used in ecology are the standardised mean difference (SMD) and log mean ratios (lnRR).

The SMD is calculated like this:

\frac{\bar{X_{1}}-\bar{X_{2}}} {SD_{pooled}}

where \bar{X_{1}}and \bar{X_{2}} are the different group means, SD_{pooled} is the pooled standard deviation ( I won’t explain this here but there are good explanations elsewhere). Essentially what this does is to give an effect size based on the difference between groups with more imprecise studies having a smaller effect size.

The lnRR is calculated as:


which gives the log of the proportional difference between the groups. I quite like this as it is easier to interpret than the SMD as the units can be back-transformed to percentages.

In metafor you calculate SMD by doing this:


and lnRR like this:


From now on in this post I will use the lnRR effect size.

Next we carry out a meta-analysis using the variability and effects sizes of each set of comparisons to calculate an overall mean effect size, using the idea that we want to give most weight to the most precise studies.

In metafor you do that by doing this:<-rma.uni(yi,vi,method="REML",data=ROM)

This is the code for a mixed effects meta-analysis, which basically assumes there are real differences between study effect sizes and then calculates a mean effect size. This seems sensible in ecology where everything is variable, even in field sites which quite close to each other.

The code it spits out can be a bit daunting. it looks like this:

Random-Effects Model (k = 30; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC
-27.8498   55.6997   59.6997   62.4343

tau^2 (estimated amount of total heterogeneity): 0.3859 (SE = 0.1044)
tau (square root of estimated tau^2 value):      0.6212
I^2 (total heterogeneity / total variability):   98.79%
H^2 (total variability / sampling variability):  82.36

Test for Heterogeneity:
Q(df = 29) = 2283.8442, p-val < .0001

Model Results:

estimate       se     zval     pval    ci.ub
 -0.4071   0.1151  -3.5363   0.0004  -0.6327  -0.1815      ***

Essentially what it’s saying is that logged forests tend to have lower species richness than unlogged forests (the model results bit), and that there is quite a lot of heterogeneity between studies (the test for heterogeneity bit).

We can calculate the percentage difference between groups by doing taking the estimate part of the model results along with the se and doing this:


to get things in percentages. So, now we can say that logged forests have 33% less species than unlogged forests  +/- a confidence interval of 25%.

We can summarise this result using a forrest plot in metafor, but this is a bit ugly.



What did I tell you? Pretty ugly right?

Much better to use ggplot2 to do this properly. But this is a bit of a pain in the arse:

forrest_data$Study2<-factor(forrest_data$Study, levels=rev(levels(forrest_data$Study)) )
plot2<-plot1+coord_flip()+geom_hline(aes(x=0), lty=2,size=1)+scale_size_manual(values=c(0.5,1))
plot3<-plot2+xlab("Study")+ylab("log response ratio")+scale_colour_manual(values=c("grey","black"))


There. Much better.

With a little tweak we can even have it displayed as percentage change, though your confidence intervals will not be symmetrical any more.

plot2<-plot1+coord_flip()+geom_hline(aes(x=0), lty=2,size=1)+scale_size_manual(values=c(0.5,1))
plot3<-plot2+xlab("Study")+ylab("Percentage change in richness\n as a result of logging")+scale_colour_manual(values=c("grey","black"))
plot3+theme(legend.position="none")+ scale_y_continuous(breaks=c(-50,0,50,100,150,200))


There that’s a bit more intuitively understandable, isn’t it?

Right, I will sort out the next post on meta-regression and sub-group analysis soon.

In the meantime feel free to make any comments you like.

R code for Meta-analysis summary plots

Meta-analysis summary plot

I think meta-analysis is great. I am aware that some out there are a bit less complementary about it, but I like it. Sure, it can be used stupidly and give spurious results, but then so can ANOVAs or GLMs if you don’t know what you’re doing.

I also love R and have recently been converted to using ggplot2, one of the fantastic packages put together by Hadley Wickham. For me ggplot2 is the most aesthetically pleasing way to plot your data in R. It also has the ability to draw error bars easily, which in base graphics can be a bit of a faff. It’s endlessly customisable and can produce things of beautiful simplicity. I was quite proud of the rarefaction curves I produced the other day for example:

Some rarefaction curves we came up with the other day
Some rarefaction curves we came up with the other day

Anyway, I recently did a meta-analysis and found the best way to plot the results was in ggplot2. It wasn’t really easy to get into the form commonly used for summary plots so I thought I’d stick up some sample code here in the hope that it might help some people stuck with the same problems. The data I used here is all made up but the code for the graphs should prove useful.

#this is a file to create the standard meta-analysis 
#summary plot using ggplot2

#first load in/create some data, this should be in the form of 
#summarised means and upper and lower confidence intervals

#these are the mean values
#and this is a bit of code to come up with randomly sized CIs
#this is the upper 95% confidence interval
#this is the lower 95% confidence interval
# and this is a buch of different taxonomic groups for my imaginary meta-analysis
treatment<-c("Birds","Mammals","Amphibians", "Insects", "Plants") 
#stick all the data together

#now install and run ggplot from the library

#the easiest way for me to make a ggplot figure is to build things up
#a bit at a time. That way if you only need to change one bit you can do it
#without hunting through a heap of code

#this defines the elements to go in the plot, both the x and y and upper and lower CIs
#this defines the plot type
#this flips the co-ordinates so your x axis becomes your y and vice versa
#this puts in a dotted line at the point of group difference
d<-c+geom_hline(aes(x=0), lty=2,size=1)
#all of this gets rid of the grey grid and legends
e<-d+opts(panel.grid.minor=theme_blank(), panel.grid.major=theme_blank())+opts(axis.ticks = theme_blank(), axis.text.x = theme_blank())+ theme_bw()+opts(legend.position = "none")
#this sets x and y axis titles
f<-e+ xlab('Taxonomic group') +ylab ('Change following removal of herbivores')
#this sets axis label size
g<-f+opts(axis.text.x = theme_text(size = 16, colour = 'black')) +opts(axis.text.y = theme_text(size = 16, colour = 'black'))
#this sets axis title size and there is your finished summary plot!
g+opts(axis.title.x = theme_text(size = 20, colour = 'black'))+opts(axis.title.y = theme_text(size = 20, colour = 'black'))

Created by Pretty R at

At the end you should have something that looks a bit like this:

Meta-analysis summary plot

though your error bars will obviously be a different width.

Hope someone finds this useful. Drop me a comment if you did.


I have been asked below to supply the code for the rarefaction curves. The data for the curves was produced using a program other than R (shock horror – do they even exist?). I think my friend used estimateS. Anyway once you have done that you can get the data in R and do something like this to it:

#script for drawing rarefraction curves#

#read in data#
leah<-read.csv("C:/Users/Phil/Documents/My Dropbox/rare2.csv")

#load ggplot2 librar (or install it if you don't have it yet!)

#this sorts the orders of the seperate panels for each plot

#this tells ggplot what data to plot and the limits for the 'ribbon' around the curve

#this plots your line along with the ribbon and different 'facets' or panels and sets the y axis limits
b<-a+geom_line(shape=16,size=0.5)+geom_ribbon(colour=NA,fill="blue",alpha=0.2)+facet_wrap(~sampleNo2,scales="free_x")+ylim(0,80)+opts(panel.grid.major = theme_line(colour = "white"))
#this changes the colour of your plots to white and gets rid of gridlines
c<-b+ theme_bw()+opts(panel.grid.major = theme_line(colour =NA))
#this changes the angle of your x axis labels
d<-c+opts(axis.text.x=theme_text(angle=90, hjust=1))
#this changes the size of your x axis labels
e<-d+opts(axis.title.x = theme_text(size = 20, colour = 'black'))+opts(axis.title.y = theme_text(angle=90,size = 20, colour = 'black'))
#this changes you x axis title names
f<-e+ylab ('Mean richness')+xlab ('Sample size')
#this plots the final plot

#and this saves it
setwd("C:/Documents and Settings/PMART/My Documents/Dropbox/")

Created by Pretty R at

The data we used can be found here.