Tuesday, September 18, 2007

Better late than never...


Wednesday, April 18, 2007

Perl scripts & LaTeX

Not really related to any news, but I've written a couple of perl scripts for formatting the output from two tagging programs. In both cases the scripts take the standard output and generate LaTeX formatted tables for subsequent use.

Tagger (de Bakker et al 2005) is implemented in the program Haploview, and uses pair-wise LD to tag regions. The script for re-formatting the output is written in perl and is available here.

ldSelect (Carlson et al 2004) is an alternative method of tagging SNPs, but also based on LD. The output format from this is (in my opinion) a little awkward as the output uses the base-pair positions (mainly because the input file format is Prettybase). Thus in addition to the output file this script requires the user to provide a tab-delimited text file of the markers and their base-pair positions and will convert the base-pair positions of markers to their rs numbers in the resulting latex table. The script is again written in perl and is available here.


Carlson, C. S., Eberle, M. A., Rieder, M. J., Yi, Q., Kruglyak, L., Nickerson, D. A. (2004) Selecting a maximally informative set of single- nucleotide polymorphisms for association analyses using linkage disequilibrium American Journal of Human Genetics 74:106-120 PDF on PubMed

de Bakker, P. I. Yelensky, R. Pe'er, I. Gabriel, S. B. Daly, M. J., Altshuler, D. (2005) Efficiency and power in genetic association studies. Nature Genetics 37:1217-1223

Tuesday, January 09, 2007

To test or not to test that is the question

A recent paper by Zou & Donner (2006) questions whether testing for Hardy-Weinberg equilibrium (HW-eqm)case-control association studies is a viable strategy.

The main point they make is that genotyping error is unlikely to be detected by testing for departure from HW-eqm. This is for a number of reasons, the assumptions underlying HW-eqm, and secondly (perhaps more importantly) that by performing a two stage analysis of screening for deviations for HW-eqm and then only taking forward markers which do not deviate inflates the Type 1 Error rate, because the p-value can not be interpreted as evidence that alleles are independent (i.e. HW-eqm holds).

This is of course appealing as it reduces the computational burden (particuarly when performing whole genome screens by association where multiple testing becomes a big problem), but also because a large proportion of associations are seen where loci do deviate from HW-eqm.

The authors propose a new test for adjusted χ2 test based on the difference in variance of the estimated allele frequencies in cases and controls which essentially is essentially the same as a Cochran-Armitage trend test (Sassieni, 1977). The power and performance of this test is discussed in a upcoming paper in Annals of Human Genetics (Ahn et al 2007).

So the upshot of all of this is that as a first pass screen your probably better of using a robust test rather than worrying about deviations from Hardy-Weinberg equilibrium.

References and Links

  • Zou GY, Donnet A (2006) The Merits of Testing Hardy-Weinberg Equilibrium in the Analysis of Unmatched Case-Control Data: A Cautionary Note Annals of Human Genetics 70:923-933

  • Ahn K, Haynes C, Kim W, Fleur RS, Gordon D, Finch SJ (2007) The Effects of SNP Genotyping Errors on the Power of the Cochran-Armitage Linear Trend Test for Case/Control Association Studies. Annals of Human Genetics AOP:

  • Sassieni PD (1997) From genotype to genes: doubling the sample size 53:1253-1261

Monday, January 01, 2007

Fast Genome Wide Association Analysis

Recently a new R package that provides an interface to running PBAT in parallel was published in Bioinformatics.The package is described in the Open Access article here.

This is the first package that I've come across that provides an easy interface to analyse the massive numbers of data that are being generated by the new genotyping arrays such as Illumina's bead station and Affymetrix' SNP chip. Basically the data is split up and small chunks analysed on each processor, with a seperate instance of PBAT running on eahc processor. The graphical R front end allows users and easy way to select their analysis option and then handles splitting the data, sending off jobs and amalgamating the results when they are completed.

Refernces and Links

Monday, October 23, 2006

A few special issues of journals have focused on Statistical Genetics recently.

The journal Statistics in Medicine is celebrating its 25th Anniversary and recently published a special issue focusing on statistical genetic analyses. A number of articles are published on topics from genome-wide analyses, estimation of genetic and environmental components in specific diseases, multiple testing and more. The full contents can be read here.

The October issue of Nature Reviews Genetics has a series of interesting articles on statistical genetics. Papers include a reviews of population genetics software, methods of analysing molecular genetic variation, and the analysis of relatedness. There is also a thoughtful perspective paper on the need for longtitudinal analyses for investigating the aetiology of complex diseases. The full contents can be read here.

Tuesday, September 05, 2006

Genetics Blogs....

A Research Highlight in Nature Reviews Genetics higlights the use of the much hyped Web2.0 (wiki's, blogs, social networking sites, interactive localised information etc.) by geneticist's and highlight's a few interesting sites (although manages to miss this one, prehaps that's a bit of motivation to actually get on and write more regularly about interesting developments!!!).

The blogs listed are...

Mendel's Garden
Free Association
Daily Transcript
Genomics Policy

All a lot more topical than this little blog, but then the authors are probably more motivated to write than I am.

Wednesday, April 26, 2006

Rub the lamp and make a wish....

A recent article by Allen-Brady , Wong & Camp (2006) in BMC Bioinformatics details a new program called PedGenie which provides a java (GUI) interface to performing association and transmission disequilibrium tests (TDT).

Personally I have mixed feelings about GUI interfaces for analysis, firstly they mean that it is hard to make work reproducible, which is an essential part of any investigation (Gentleman 2005). More importantly because people are often very familiar with point and click interfaces they can go ahead and perform their analysis witout necessarily understanding whether it is appropriate to perform their analysis, or even worse they'll perform every possible test they can and think that because they've found something "significant" they have found an important front. This may come acorss as elitist, but its really not meant to be, since when performing statistical analysis one needs to be very careful and precise about what you are doing.

On the flip side GUI's provide a relatively easy interface to what can be a sometimes impenetrable field, since there is often a steep learning curve to becoming proficent in a given statistical package, OS or scripting language to the extent that one can perform their analysis in a reproducable manner.

At the end of the day its about finding a balance between needing to getting things done, and hey, if it means theres one less person banging on my door needing their hand held to perform their analysis then its not all bad!