Below is a list of several ongoing projects in my lab. Note that not all projects are necessarily active or current. They all represent interest research of mine and work in these areas depends on the availability of funding, time, and "able bodies". For software packages developed as part of this research see our Software page.
Note that I do not do research on machine learning, statistics, or graphics, though I use tools from these fields on occasion. My primary interests relate algorithms for processing strings (pairwise alignment, and multiple alignment of DNA or protein sequences) and graphs (uncovering interesting patterns in assembly graphs). I am also very interested in graph drawing and in software testing for bioinformatics/scientific applications.
Some of these projects provide opportunities undergraduate research, either as summer projects or as part of the CS honors program. For more information on how to apply for such research opportunities see our Undergraduate Programs page.
Metagenomics is a new scientific field that is targeted at the analysis (through high-throughput genomic technologies) of microbial communities that inhabit our bodies and our world (see our overview of metagenomics for more information). Our current research in metagenomics is primarily focused on Metagenomic assembly and whole-metagenome analyses with a particular focus on understanding the genomic variation within microbial communities; and the development of comparative methods that enable the analysis of clinical data-sets (generally comprised of hundreds to thousands of samples), see our comparative packages Metastats and MetaPath.
In addition, we are involved in several metagenomic projects analyzing the microbial, viral, and parasitic communities that cause diarrhea in third-world children; and analyzing the role microbes and viruses play in lung disease in HIV-infected patients.
One of our long-term goals in this field is to develop predictive models of microbial communities that will enable biologists to simulate community dynamics in order to better understand the effects of treatment or other external factors on health.
My work on genome assembly is currently primarily focused on the assembly of metagenomic data, with a particular focus on uncovering genomic variation within the assemblies.
Also, we are interested in developing approaches for the validation of genome assemblies, in particular de novo validation approaches that can assess the quality of assemblies in the absence of a 'golden truth'.
The research performed in my lab is motivated and driven by real biological applications. In addition to doing basic research and writing software, the researchers in the lab work to analyze real biological datasets generated by our collaborators. Here are some examples from among the many projects we are and have been involved with:
In addition to the major research interests outlined above, my lab is also working on several other topics:
New DNA sequencing technologies are generating large amounts of data at significantly higher pace than possible just a few years ago. The analysis of new generation sequencing data poses significant computational challenges, both due to the sheer size of the data-sets being analyzed and due to individual characteristics of the new sequences. We are currently conducting research to evaluate whether highly-parallel computing clusters can be used to efficiently analyze such data, with the goal of providing researchers with the ability to rent CPU cycles rather than have to implement and maintain an expensive computational infrastructure in their labs. We are primarily focused on algorithms for sequence alignment (see Crossbow) and for genome assembly (see Contrail.
For more information check out our High Performance Computing page.
We created a database of all information we could easily extract from literature and other public databases - ARDB. This database is freely available to all scientists both through the web as well as a flat-file download from ftp://ftp.cbcb.umd.edu/pub/data/ARDB.
Together with colleagues at the NMRC we have developed a modular prokaryotic annotation pipeline, primarily for use in various genome projects we are involved in, but also as a framework for exploring research questions regarding the functional annotation of genomes and metagenomes. The software is available, open-source, from http://sourceforge.net/projects/diyg.
This research is an unexpected result of our research on genome assembly and on incorporating new types of data in the assembly process. We noticed that mate-pair information, optical mapping data, as well as other information generated during the genome assembly process (specifically assembly graphs) can be used to improve genome assemblies (e.g. by resolving certain classes of repeats) as well as to guide the design of experiments aimed at finishing genomes. We have successfully applied some of these ideas to the finishing of Aggregatibacter aphrophilus and Vibrio harveyi, and we are currently in the process of finishing Yersinia rohdei and Yersinia ruckeri.