Using metafoR and ggplot together:part 1

I really like meta-analysis, as some of you might already  know.

It has it’s critics but used well it can help us achieve one of the major goals of applied ecology: generalisation. This way we can say how we think things generally work and draw conclusions bigger than  individual studies.

However, for some reason people have got it into their head  that meta-analysis is hard. It’s not. The hardest thing is the same as always, coming up with a good question/problem.

However, for a first timer it can seem daunting. Luckily there are a few good R packages for doing meta-analysis, my favourite of which is metafor.

The aim of these posts is to take some of the mystery out of how to do this and was prompted by a comment from Jarret Byrnes on a previous blog post.

In the first post I will deal with effect sizes, basic meta-analysis, how to draw a few plots and some analyses to explore bias. In the second I will go through meta-regression/sub-group analysis and a few more plots. It shouldn’t be too hard to understand.

For this I will use a dataset based on some work I am doing at the moment looking at the effects of logging on forest species richness. However, I don’t really feel like sharing unpublished data on the web so I’ve made some up based on my hypotheses prior to doing the work.

First of all we need to install the right packages and open them

install.packages("metafor")
install.packages("ggplot2")

library(metafor)
library(ggplot2)

Next we need the data, download it from here and get r to load it up:

logging<-read.csv("C:whereveryouputthefile/logging.csv")

This file contains data on species richness of unlogged (G1) and logged (G2) forests, it’s not perfect as all of the data is made up – it should be useful for an exercise though.

First we need to calculate the effect size. This is essentially a measure of the difference between a treatment and control group, in our case logged (treatment) and unlogged (control) forests. We can do this in different ways, the most commonly used in ecology are the standardised mean difference (SMD) and log mean ratios (lnRR).

The SMD is calculated like this:

\frac{\bar{X_{1}}-\bar{X_{2}}} {SD_{pooled}}

where \bar{X_{1}}and \bar{X_{2}} are the different group means, SD_{pooled} is the pooled standard deviation ( I won’t explain this here but there are good explanations elsewhere). Essentially what this does is to give an effect size based on the difference between groups with more imprecise studies having a smaller effect size.

The lnRR is calculated as:

ln{\bar{X_{1}}-ln\bar{X_{2}}}

which gives the log of the proportional difference between the groups. I quite like this as it is easier to interpret than the SMD as the units can be back-transformed to percentages.

In metafor you calculate SMD by doing this:

SMD<-escalc(data=logging,measure="SMD",m2i=G1,sd2i=G1SD,n2i=G1SS,m1i=G2,sd1i=G2SD,n1i=G2SS,append=T)

and lnRR like this:

ROM<-escalc(data=logging,measure="ROM",m2i=G1,sd2i=G1SD,n2i=G1SS,m1i=G2,sd1i=G2SD,n1i=G2SS,append=T)

From now on in this post I will use the lnRR effect size.

Next we carry out a meta-analysis using the variability and effects sizes of each set of comparisons to calculate an overall mean effect size, using the idea that we want to give most weight to the most precise studies.

In metafor you do that by doing this:

SMD.ma<-rma.uni(yi,vi,method="REML",data=ROM)
summary(SMD.ma)

This is the code for a mixed effects meta-analysis, which basically assumes there are real differences between study effect sizes and then calculates a mean effect size. This seems sensible in ecology where everything is variable, even in field sites which quite close to each other.

The code it spits out can be a bit daunting. it looks like this:

Random-Effects Model (k = 30; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC
-27.8498   55.6997   59.6997   62.4343

tau^2 (estimated amount of total heterogeneity): 0.3859 (SE = 0.1044)
tau (square root of estimated tau^2 value):      0.6212
I^2 (total heterogeneity / total variability):   98.79%
H^2 (total variability / sampling variability):  82.36

Test for Heterogeneity:
Q(df = 29) = 2283.8442, p-val < .0001

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub
 -0.4071   0.1151  -3.5363   0.0004  -0.6327  -0.1815      ***

Essentially what it’s saying is that logged forests tend to have lower species richness than unlogged forests (the model results bit), and that there is quite a lot of heterogeneity between studies (the test for heterogeneity bit).

We can calculate the percentage difference between groups by doing taking the estimate part of the model results along with the se and doing this:

(exp(-0.4071)-1)*100
(exp(0.1151*1.96)-1)*100

to get things in percentages. So, now we can say that logged forests have 33% less species than unlogged forests  +/- a confidence interval of 25%.

We can summarise this result using a forrest plot in metafor, but this is a bit ugly.

forest.rma(SMD.ma)

Forest_plot

What did I tell you? Pretty ugly right?

Much better to use ggplot2 to do this properly. But this is a bit of a pain in the arse:

theme_set(theme_bw(base_size=10))
forrest_data<-rbind(data.frame(ES=ROM.ma$yi,SE=sqrt(ROM.ma$vi),Type="Study",Study=logging$Study),data.frame(ES=ROM.ma$b,SE=ROM.ma$se,Type="Summary",Study="Summary"))
forrest_data$Study2<-factor(forrest_data$Study, levels=rev(levels(forrest_data$Study)) )
levels(forrest_data$Study2)
plot1<-ggplot(data=forrest_data,aes(x=Study2,y=ES,ymax=ES+(1.96*SE),ymin=ES-(1.96*SE),size=factor(Type),colour=factor(Type)))+geom_pointrange()
plot2<-plot1+coord_flip()+geom_hline(aes(x=0), lty=2,size=1)+scale_size_manual(values=c(0.5,1))
plot3<-plot2+xlab("Study")+ylab("log response ratio")+scale_colour_manual(values=c("grey","black"))
plot3+theme(legend.position="none")

better_forrest

There. Much better.

With a little tweak we can even have it displayed as percentage change, though your confidence intervals will not be symmetrical any more.

theme_set(theme_bw(base_size=10))
plot1<-ggplot(data=forrest_data,aes(x=Study2,y=(exp(ES)-1)*100,ymax=(exp(ES+(1.96*SE))-1)*100,ymin=(exp(ES-(1.96*SE))-1)*100,size=factor(Type),colour=factor(Type)))+geom_pointrange()
plot2<-plot1+coord_flip()+geom_hline(aes(x=0), lty=2,size=1)+scale_size_manual(values=c(0.5,1))
plot3<-plot2+xlab("Study")+ylab("Percentage change in richness\n as a result of logging")+scale_colour_manual(values=c("grey","black"))
plot3+theme(legend.position="none")+ scale_y_continuous(breaks=c(-50,0,50,100,150,200))

better_forrest_percentage

There that’s a bit more intuitively understandable, isn’t it?

Right, I will sort out the next post on meta-regression and sub-group analysis soon.

In the meantime feel free to make any comments you like.

26 thoughts on “Using metafoR and ggplot together:part 1

  1. I’m sure you have it, but here is a handy introduction to meta-analysis for the readers of your blog:

    Harrison, F. 2011. Getting started with meta-analysis.
    Methods in Ecology and Evolution, 2, 1-10.

    I haven’t done my own meta-analysis, but I am very keen to try someday. What I find most off-putting is the actual screening of the literature and the standardisation of the raw data set. Maybe you can shed some light on my confusion.

    Here are my questions:
    First, would you only start a meta-analysis after you already have a good grasp of the available literature? Or would you start from scratch?

    Second, if you can start from scratch, would you read all the papers (in their entirety) or do you have useful heuristics that can screen suitable studies based solely on the abstracts?

    I tried to initiate a meta-analysis once, but found that I couldn’t source enough suitable studies to include in the analysis. You can read about it here. Unfortunately, I wasted a lot of time reading entire papers that ended up being useless; so any advice would be helpful.

    1. Hey Falko. Thanks for that article, useful for other people reading the blog.

      In answer to your first question – it is good to have a reasonable grasp of the literature. I have done two now and I know the one I had less feel for before I started was less successful. Personally, I would recommend that you do some reading on a subject before starting a meta-analysis. But you can still be a relative newcomer and do a good job.

      In answer to your second question, I would always carry out a systematic review. Andrew Pullin has written load about this, but it really is the only way to be sure you are getting a non-biased bunch of studies to work with. Once I have a list of studies I go through it, first getting rid of the ones with irrelevant titles and secondly reading the abstracts and getting rid of those that seem irrelevant. After that I read the articles, more or less in full. I usually concentrate on the methods and results sections as that is where I can find out if the data is applicable to my question. I think there is a gap for someone to produce a program/tool to make this process easier.

      I think your last point is unfortunately one of the problems with meta-analysis. Sometimes you can have too much data and sometimes very little. I think this just comes down to knowledge of your field again.

      Does that help?

      1. Yes, that’s sensible advice. Confirmed my suspicions.

        I was secretly hoping you had some amazing insider’s short-cut that makes screening the literature an automated exercise! Too bad.

        Wouldn’t it be awesome if journals encouraged authors to add the means/effect sizes in abstracts, especially when the study uses a standardised experimental protocol? It would make doing a meta-analysis much easier.

        …On second thoughts, it would also make mindless machine-based data-mining a possibility – and I foresee many major mistakes if that becomes a reality. Maybe it’s not such a good idea after all? I don’t know.

      2. I really wish there was a push towards reporting of means and effect sizes, plus geographic locations in ecology can conservation journals. With the advent of the web there is no reasons that all articles couldn’t have a data supplement that contained that info. More often than not though I have had problems getting SEs and sample sizes from papers – this is unacceptable.

        The next thing for me would be to have a way that people can get access to very detailed info from studies. If we could get info on each replicate we wouldn’t need to use the same weighting mechanism as in current meta-analyses and could use a more mixed model approach. The only people I have seen do this are those from the predicts.org project, but it has so many other applications.

        Oh, and I share you fears about mindless data-mining. I have been guilty of it myself in the past. It’s not science, but just fishing for p values in my experience.

  2. I’ve wanted a ggplot solution to forest plots for a while. Why use the circle over the square. It seems comparison of the weights using a square is easier as it draws on a preattentive attribute, length, that is useful for comparing quantitative measures. The circles are more difficult to compare. This is likely as easy as changing the symbol being used but I wanted to get your take on this as I’m sure you have a rationale for this design choice.

    1. Squares might be better. I have no particular reason for using circles, just that it the default symbol in ggplot2. I think if you are going to vary the size of point based on the weight each study has in the analysis squares make more sense. You can do this in ggplot2 by setting shape=16. It’s a good point, thanks!

  3. Nice post. I too am a fan of metafor, though yours is my first exposure to ggplot. I’ve even shown people how to use metafor to generate confidence bands around meta-regression predicted trend lines (see http://tinyurl.com/kysf3uv).

    In your post, you stated “However, for some reason people have got it into their head that meta-analysis is hard. It’s not. The hardest thing is the same as always, coming up with a good question/problem.”
    I would also agree with this statement, especially as someone who has contributed software solutions to meta-analysis for over 20 years (I’m the authors of dstat, an early program for calculating and analyzing the standardized mean difference).
    One caveat: The ease with which the STATISTICS can be created should not used to imply that the assumptions underlying these statistics is easy. They are often anything but. In fact, a lot of novice meta-analysts get themselves in trouble because they are too naive about what they are doing. It is not just finding a good problem, it is also using thorough search techniques, using duplication of effort in retrieval of studies, etc. Why not a few words to that effect?

    1. Thanks Blair.

      I agree completely that the method for how you actually collect data is one of the most important elements of a meta-analysis.

      I haven’t concentrated on this so much because I assumed that since the readers of this blog are largely ecologists, they should know something about this already. Andrew Pullin et al have published a series of really well known papers on how to do systematic reviews in ecology. Having said this, I still see numerous meta-analyses that do not carry out systematic reviews and therefore may actually miss studies that are relevant to their question.

      I agree that in a perfect world duplication of effort in collating studies would be best but I personally don’t do it. There is no way I could do it as my PhD project is basically me with some supervisors offering advice. I do aim to be very conservative when removing studies that I deem not to be relevant.

      Does that answer what you wanted? I will stick up a link to the Pullin papers tomorrow as well.

  4. Hi Phil,
    I am the author of the metafor package. I agree that the default forest plot looks a bit “plain” (ok ok, ugly). But that should only be the starting point for making it look a bit nicer. How about this:

    ROM.ma <- rma.uni(yi, vi, data=ROM, slab=Study)
    par(mar=c(4,4,2,2))
    forest.rma(ROM.ma, cex=.7, order=order(ROM$Study), xlim=c(-5,4), at=log(c(.1, .25, .5, 1, 2, 4)), atransf=exp, xlab="Response Ratio (log scale)", psize=1)
    text(-5, 32, "Author and Year", pos=4, font=2, cex=.7)
    text( 4, 32, "ROM [95% CI]", pos=2, font=2, cex=.7)

    Much better now (IMHO). I would make the width of the plotting device a bit more narrow as well. Looks even better that way.

    Of course, the figure you created with ggplot2 is very nice as well and ggplot2 is ultimately much more versatile. Always nice to have options!

    1. Hi Wolfgang,

      Thanks for the response!

      I like your solution, it comes out really well. However, I have always found ggplot2 a bit easier to use and as a result have neglected learning lots of standard plotting techniques in r.

      Even though I called your plots ‘ugly’ I really love the package -it’s so much better than all the other meta-analysis packages and programs I have seen before, so thanks for producing the package in the first place!

      1. For me, it’s the other way around. I am quite familiar with the plotting stuff in the default “graphics” package and have not delved into ggplot2 enough. If someone was really ambitious, one could provide ggplot2 equivalents of all the plotting functions in metafor. That would be awesome (and quite time consuming).

        And thanks for the positive comments about the package.

  5. Thanks for this great post Phil, really helpful. Did you get around to publishing a post about meta-regression and subgroup analysis?

  6. Thanks for this really useful post! Forest plots are really important in medical statistics where we do a lot of systematic reviews and meta-analyses. To use ggplot2 (as I’d like) I would need the summary to be in the form of a diamond and to have a table of values and intervals alongside (e.g. http://en.wikipedia.org/wiki/File:Generic_forest_plot.png). Would be good to hear any ideas as to how to approach this. The diamond might be tricky because it would need to be asymmetrical if the lower and upper CLs are of different lengths. I thought the geom_polygon() function could be useful, but I can’t think of a way to do it with a discrete x axis.

  7. Like the post and found it useful… a few quick questions. Is it possible to still use response ration if the studies don’t report SD or N? I’m dredging through some very old reports and attempting to do a meta-analysis but the older data isn’t reporting even a sample size, never mind standard deviation or standard error. Should I give up the idea of a meta?

    1. Hi Emilie. First of all don’t give up. It depends a bit on what your kind of study is to work out what the consequences of ignoring sd and sample size. However I have done this in the past with a paper I got published last year. There is no problem using a response ratio, you just have to be aware of what you are doing. Send me an email & I can send you the pdf of our paper if you want?

  8. Hello! thanks for the post! I was trying to follow the instructions with my data, but it gives some errors, and I cannot confront it with your work becauseyour data is not uploaded anymore. Would you mind uploading it again? that would help a lot!!! thanks in advance!

    1. Hi Pilar. I’m actually away from the office until about Christmas and haven’t got access to the computer with that data on so I’m sorry but I can’t do much to fix that at the moment .

      1. Ok, looking forward to when it is available again! For simplicity, you could post example data (random values with correct format would be enough).

  9. I know that with the transf argument in forest {metafor} you can back-transform the log ratio to ratio, but, do you know if you can plot the forest graph in percentage? I mean, without doing this trick moving the meta-analysis to ggplot, but directly in forest. Thanks

  10. Regarding your previous reply to Emilie, I recently read a paper by de Graaff et al. (2005) Global Change Biology where “studies were weighted by experimental replication, using the function FN 5 (na 􏰂x ne)/(na + ne) (Hedges & Olkin, 1985; Adams et al., 1997), where na and ne represent the number of replicates under ambient and elevated CO2, respectively.” For those cases when you don’t have SD.

  11. Great post on meta-analysis for beginners like me. While looking in your blog, your dataset have mean standard deviation and number of replicates for each observation.

    My question for you is: Is it possible to perform meta-analysis if the dataset lacks information on standard deviation. I found that most of the published article in the field of my interest have not reported standard deviaiton or any other measure of variance. They have only reported mean and number of replicates. Going through some of the previous meta-analysis, I found that CI can be calculated using boot package through resampling test. However, I am unable to perform cumulative group analysis (calculate mean effect size and CI for each categorical sub-groups).

    Can someone provide me the guidance in this regard or share R scripts if you have them. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s