Pollination is a vital ecosystem service and a key consideration for food security. Globally, 75% of crops depend on animal pollination with vegetables and fruits being the most dependent on insects. Despite their importance, honeybees and wild pollinators are facing declines throughout the world due to habitat loss, agricultural intensification, pests, disease and climate change. Pollinating insects require access to suitable plants for foraging and as native habitats decrease gardens may become increasingly important refuges. Our research investigates the plants that pollinators need and the extent to which these can be provided within gardens.
We use DNA metabarcoding of pollen collected by insects to track which plants they visit. Pollen is retrieved from the bodies of insects or extracted from honey. DNA from the pollen is amplified using DNA barcode markers and sequenced using next-generation sequencing (Illumina MiSeq). Key to the ability to identify unknown DNA samples is a comprehensive DNA barcode reference library. We have DNA barcoded all of the native flowering plants of the UK, along with non-native plants likely to be important for pollinators.
The National Botanic Garden of Wales and agricultural habitats are used as study sites to assess plant use by different pollinator groups in order to build a temporal and spatial picture of foraging. The vegetation within the botanic garden has been mapped and plants in flower are recorded on a monthly basis. Honey is collected from honeybee colonies within the botanic garden and DNA metabarcoded to see which plants are used compared to those available. This approach is being extended to other pollinator groups in order to examine resource partitioning. Our results show the importance of native and near-native plants within gardens.
We are using our findings to develop evidence-based guidance on horticultural best practice for pollinators and we are creating seed mixes and planting-plans. We use wide public engagement within the botanic garden to highlight the importance of pollinators. This includes an Apiary and Bee Garden, Butterfly House and art-science exhibitions.
Common practice suggests that human-origin genome data should be deposited in public repositories for further reuse. The current system for finding and accessing deposited genome datasets, however, is cumbersome and not scalable, leading to a huge loss of mining opportunities. Manuel Corpas argues that established public efforts for maintaining and archiving controlled access genome datasets are barely serving their purposes of enabling data re-utilisation; and that in order to address this bottleneck, access protocols and dataset descriptors require better coordination and standardisation.
tissue specificity super-enhancers regulatory elements predicting phenotypes machine learning
Densely spaced clusters of active enhancers called super-enhancers have recently been found to control and maintain the cell identity in the mammalian genome. Disease-associated sequence variations are enriched in super-enhancers of disease-relevant cell types compared to typical-enhancers and have been proposed to be involved in development and disease. However, the structure and function of super-enhancers is controversial and not completely understood. In addition, the extent of the relationship and mechanisms between tissue-specific regulatory regions (TSREs) and the corresponding phenotypes of genes they control remain unclear.
Using an in silico integrated approach, we present a novel analysis to identify genome-wide TSREs that regulate key cell identity genes and are associated with mutant mouse phenotypes. Using histone modification data in 22 mouse epigenomes, we systematically predicted genome-wide regulatory elements, identified highly tissue-specific enhancers and active promoter elements, and produced a catalogue of super-enhancers in every epigenome. We demonstrate that super-enhancers compared to other TSREs, drive significantly higher expression of target genes and maintain their cell-type specificity, have a stronger correlation with mammalian phenotypes, and are enriched with known and novel transcription factors that define the biology of these epigenomes. These results suggest that super-enhancers are a regulatory entity on its own and conceptually distinct from other TSREs. Furthermore, using machine learning we show that combining TSRE information with existing gene expression profiles and protein-protein interactions improves our capability to predict mammalian phenotypes.
Overall, this study highlights the strong association between TSREs and mouse phenotypes as well as providing a pool of candidate regulatory elements and genes for hypothesis-driven biology.
metagenomics assembly 16S rRNA diversity
Metagenomics is the study of the genomes of entire microbial communities based on samples obtained directly from the environment and with little knowledge as to what species might be included in each sample. The term “environment” is used loosely and indicates anything that isn’t a well-defined organism. Examples include soil, ocean water, fresh water, ice, or human (or animal) biomes such as those found in the gut or the mouth. Those “biomes” are known to host a large variety of microorganisms which together constitute a genome quite different from that of the host and have been proven to play a significant role in disease processes and other characteristics such as propensity to obesity or the strength of the immune system.
Existing research on metagenomics focuses primarily on identifying known species within the sample and, in some cases, studying their relative abundance under certain (normally artificial) stresses such as chemicals or other pollutants. This confinement to known/culturable species means that the research relies heavily on existing methods of analysis, such as 16S rRNA analysis for species identification, and known genome assembly algorithms used for standard DNA or RNA analysis where an organism is known and its genome has been mapped.
Our research attempts to reframe the problem and widen the premise of metagenomics research. The main question posed is: “What happens to a microbial community if it is exposed to certain types of stress?”. The stress can be intermittent, such as a cyclical spike in temperature; temporary, such as a flood; or permanent such as decreased soil moisture. Of particular interest is the change in the “functional diversity” of the community due to this stress. Functional diversity is the range of functions the community is able to perform. This can include metabolic functions such as respiration, cell division, or fermentation, or functions specific to the environment such as digestion, temperature regulation, aiding in the growth of particular types of plants, etc. A study of the changes these functions undergo due to stress can shed light on an environment’s ability to adapt to change as is the case in conditions like climate change or antibiotic resistance. Our hypothesis is that by eliminating the assembly and species identification steps we can gain novel insights about a community's functional profile as well as discover potentially new protein domains or entire microbial species.
This talk will give a brief overview of metagenomics and outline the new methodology proposed to analyse microbial communities along with possible application areas.
cryptosporidium comparitive genomics assembly parasitology
Cryptosporidium is a genus of apicomplexan protozoan parasites responsible for diarrhoeal disease worldwide and a well known pathogenic contaminant of water sources. Because of this, it is capable of sudden outbreaks, such as the epidemic in Milwaukee in 1994 where 403,000 people fell ill with Cryptosporidiosis, including 19 mortalities. The source was tracked to an infected drinking water supply. In immunocompetant individuals, the disease is self limiting, with cases usually resolving within a few weeks without need for treatment. However, in immunosuppressed patients, the disease is potentially life threatening. Furthermore, the only effective treatment for Cryptosporidiosis: Nitazoxanide, lacks efficacy in immunosuppressed patients. Because of this, the ability to accurately track the spread of the disease in cluster cases and epidemics is of great importance in identifying the source of infection and developing effective prevention strategies.
The diagnosis of Cryptosporidium favours two approaches: Microscopic and Molecular. Microscopic diagnosis is generally the first pass approach due to its quick and effective nature at identifying Cryptosporidium within a faecal specimen. However, due to the morphologically identical nature of human infective Cryptopsoridium species, molecular and genetic approaches are necessary to identify to a species and subtype level. Currently, species identification is carried out using 18S rDNA qPCR analysis. Cryptosporidium spp. is further subdivided by analysis of a sporozoite surface protein gene: gp60, due to the presence of a highly polymorphic Variable Number Tandem Repeat (VNTR) region within this gene. However, the lack of a consensus MLST approach, the sexual nature of the Cryptosporidium life cycle, and the fact that Cryptosporidium is known to recombine around the gp60 VNTR allele necessitates the identification of further biomarkers around the genome of this protozoan which can be utilised in tandem with the gp60 allele to form novel subtyping paradigms. These subtyping approaches can be used to furnish epidemiological data to elucidate transmission cycles, in an attempt to develop greater understanding of the spread of the parasite throughout a population during an outbreak, which is essential in developing novel prevention strategies.
Our research involves identification of novel biomarkers around the genome of Cryptosporidium using comparative genomic approaches and developing tools to automate processes which can be ran on non HPC systems: primarily workstation computers and laptops. We have developed a pipeline which identifies and analyses VNTR’s within the genomes of multiple isolates of Cryptosporidium to establish whether they can be used as viable alternatives to, or in tandem with, the gp60 subtyping approach. Using this pipeline, 213 potentially viable VNTR alleles within the genome have been identified as suitable for interrogation. This work necessitates the sequencing and assembly of high quality Cryptosporidium genomes, which has been achieved by the work of Hadfield et al (2015) and Swain et al (2012). However, the intention is to obviate the highly computationally intensive nature of genome assembly and apply these methodologies to sequencing reads.
metagenome genes open reading frames
Metagenomes can be described as an environmental sample of genetic material. Current methods of annotating this genetic material rely on referencing previously annotated cultured organisms stored in large databases. This can be a problem when posed with novel genes and organisms in an environmental sample as there is no reference for them. In some metagenomic samples this lack of annotation can be as large as 50% of the recovered material. Therefore a number of new and existing tools should be studied to see whether they could uncover this metagenomic ‘dark matter’.
Current studies show that current knowledge of gene identification and annotation may not be as universal and accurate as once thought. Are these rigid rules to blame for the inability to assign function to this dark matter?
automatic phenotyping arabidopsis computer vision
The combination of digital cameras and computer vision techniques is changing the way plant morphological traits are measured (a.k.a phenotyped). Image analysis allows to calculate size and shape metrics with little manipulation of specimens, in dynamic, time-resolved and non destructive manner. Perdurable pictures taken at multiple time steps and point-of-views are stored for posterior analysis allowing calculation of many parameters and return to the same image at any time a new metric is required.
In addition, digital images facilitates shape metrics otherwise impossible to calculate from analogical-continuous images or real-live specimes. Examples of these are shape descriptors like circularity (Perimeter^2/Area or Haralick's robust version Spatial average /spatial variance), eccentricity, compactness, etc. Statistical shape descriptors make use of pixels' distribution in the discrete space to obtain the closeness of image objects to certain geometrical object, e.g circularity as similarity to a circle.
A set of statistical shape descriptors implemented in robotic phenotyping platforms has been used to quantify shape changes in developing juveniles of a rosetete plant called Arabidopsis thaliana. It has been observed that Arabidopsis juveniles from several geographical origins develop through divergent growth trajectories, like short vs long petioles, i.e leaf stalk, and long vs round leaf blades. Twenty shape descriptors were calculated on images of ~500 lines derived from a Multiparent Advanced Generation InterCross (MAGIC) population, taken daily. Shape descriptor values along time describe rosettes development by a multivariate feature vector.
The latter vector is transformed into Principal Components to obtain the major variability vector that better represent the developmental pathway and separate rosettes' size and shape. Principal Componentes and raw values were used as input for Genome Wide Association Mapping (GWAS) analysis workflow that localize genome-wide factors associated with the differences in rosette development.
Altogether, shape descriptor by time GWAS analysis found 43 genomic regions associated with rosette shape differences in the MAGIC population. Some of these regions harbour candidate genes well known to be associated to rosette morphology. As an example is the gene Erecta, whose mutation "er" is related to compact rosettes and tougher stems. Other putative genes related with rosette shape would be the phytochromes B and D, known to be related with petiole length and the response of aerial parts to low light conditions. Interestingly, some regions contain genes related with flowering time and other environmental responses, suggesting that shape descriptors and GWAS in MAGIC could be sensible to little variations in rosette morphology according to environmental variation. Therefore our proposed strategy could be suitable to study the effect of different stressors like low/high light intensity, moderate drought stress treatments and others at gradual conditions rather than disparate differences.
cancer cell-lines pre-clinical models machine learning
Pre-clinical cancer models, such as tumour-derived cell-lines and animal models, are essential in cancer research. Consistently used as a platform to investigate mechanism of action, they can identify potential biomarkers prior to clinical trials where similar exploration is more complicated and expensive. However, whilst cell-lines are the most used pre-clinical model, their applicability in certain settings is questioned because of the difficulty of aligning the appropriate cell-lines with a clinically relevant disease segment.
We aim to develop computational tools which would determine, for some pre-clinical model, suitability for clinical experiments, and the most relevant disease segment. This would enable researchers to increase the information researchers have when choosing a pre-clinical model prior to the experiments and, thus, potenitally reduce the usage of unsuitable models and increase the reliability of conducted experiments.
Genomics profiling data from patient tumours (The Cancer Genome Atlas) and cell-lines (Cancer Cell Line Encyclopaedia) were used to train and test the methods. Machine learning techniques (including random forests, principal component analysis, Gaussian processes) were applied to create predictive models based on patient training data. Their accuracy was evaluated on the patient test set and then applied to cell-line data.
Endometrial and breast cancer classification achieved good correspondence with established subtypes (around 0.90 AUC). With the appropriate classifiers (copy-number for endometrial, expression for breast), cell-lines mostly accurately differentiated into respective subtypes. Cancer-related genes were predominant in the most influential genes. in the models' decision making.
Whilst most cell-lines associated with clinically relevant segments, a significant number were ambiguous. Furthermore, cell-line suitability scores across different subtypes were not complementary - inappropriate cell line for one subtype is likely to be inappropriate for the other. We will refine the methodology and ultimately develop an online scoring tool to improve the usage of pre-clinical cancer models in therapeutic testing.
sigma factors regulation plasmid copy regulation rule-based genome-scale models
Synthetic Biology has the ultimate objective of design cells with predictable responses. Our ability to develop modified and synthetic organisms tailored to chemical production is fostered by our ability to recombine DNA with error-free protocols. However, our current capacities for modeling how cells work is way behind our synthesis and analysis tools that difficult the prediction of desired cell responses. Interestingly, computational modeling has impacted prominently Synthetic Biology, where the manipulation of biological systems is cost-intensive, and computational resources could leverage experimental procedures. Traditionally, Ordinary Differential Equations (ODEs) have been employed to model biological systems, but their assumptions are simply not realistic. Particularly, it has been known for a long time that biological processes are stochastic, discrete and structurally complex, hampering differential equations systems to fit these properties. Even if noise is considered, modelers would be making assumptions on how cell components traveling between compartments could affect physically separated processes, how they bind each other, and how they perform behaviors that resemble cooperativity and competition.
To further resolve a connection between modeling and designing organisms, we present a Rule-based model simulated using Gillespie’s Stochastic Simulation Algorithm. Under this approach, rules are macroscopic chemical reactions between entities that recapitulate one or several patterns necessary for a transformation. The rate associated with each rule represents how often a reaction fires in a given time. Noteworthy, our laboratory has developed a software called PISKaS that enable explicit compartmentalized modeling in Kappa language. We modeled two gene regulatory networks of E. coli. These two models resemble the core network that regulates transcription and the replication of the ColEI plasmid. Average and variance of selected variables were analyzed in these examples simulated employing arbitrary rates, yet surprisingly, their properties are in close agreement with experimental data. Specifically, when the core transcription network reached pseudo-equilibrium, it predicts free RNA Polymerase Holoenzyme close to 20%, relatively near the 30% reported during exponentially growing E. coli. Similarly, the plasmid replication controlled with a negative feedback simulated a saturation dynamics, producing tens or hundreds of copies, depending strongly on the rate of interaction between its non-coding RNAs.
We are aware of limitations in our example models. We considered cells in a pseudo-stationary state, therefore disregarding the necessity to model metabolism, translation and protein degradation or dilution. Although, the processes mentioned above could be easily incorporated in successive refinements. Importantly, modeling metabolism and linking it to transcription and translation could facilitate a more reliable prediction of phenotype emergency. To this end, a Gene Regulatory Network (GRN), a Genome-Scale Metabolic Model (GSMM) and (optionally) a protein-protein and an RNA-protein interaction networks will serve as inputs to write draft models. We sought to automatically write a genome-scale model of replication, transcription, translation, RNA and protein degradation joint to metabolism. For instance, we wrote a combined metabolism and gene expression model that resemble the published central metabolism of E. coli (MODEL1505110000) employing the RegulonDB GRNs and the iJO1366 GSMM, resulting in comparable dynamics as the published ODE model.
hyperspectral imaging water stress bioenergy crops
Arundo donax L., common name giant cane or giant reed, is a plant that it is widespread in many environments of temperate and hot areas across the world. Due to its high productivity, adaptability to marginal land conditions, and suitability for biofuel and biomaterial production, is a candidate energy crop for use in biomass-to-liquid fuel conversion and bio refineries. As its cultivation for these purposes is relatively recent, its growth, management and breeding to improve stress tolerance are currently being studied. Because of the dense canopies, often 4 m tall, it is difficult to study but Unmanned Aerial Vehicles (UAVs) equipped with hyperspectral cameras, offer a potentially attractive means of providing remotely sensed data at high spatial resolution which will allow detailed study these crops.
Here, hyperspectral images were acquired from an UAV in the VNIR spectral region (400 – 1000 nm) of canopies of three ecotypes of A. donax grown under well-watered and droughted conditions. The steps taken to analyse hyperspectral images will be described, as well as the multivariate analysis techniques used to obtain relevant information from the data extracted. The development of water stress indicators based on hyperspectral imagery for field phenotyping in response to water constraints in A. donax will be described. Other suggestions for multivariate analyses appropriate to these data will be welcome!
genome rearrangements evolutionary breakpoints cancer breakpoints lowest common ancestor
Genome rearrangement is one of the major forces that drive evolution. It happens when germline DNA is broken at two or more places (breakpoints) and reassembled in such a way that changes the landscape of the genome. This event does not occur randomly along the genome; ergo, certain regions are more susceptible to be rearranged in the course of genome evolution or evolutionary breakpoint hotspot regions (EBHRs). Genome rearrangements in somatic cells that contribute to cancer development process also occur in a non-random fashion. Although evolutionary breakpoints (EBrs) and cancer breakpoints (CBrs) are under different selective pressure, they are known to be associated with similar functional markers and are the result of similar molecular mechanisms. Moreover, a number of observations of cancer rearrangements that coincide with EBrs has been reported previously [1, 2]. Nonetheless, no systematic method has been yet proposed to evaluate the correlation between EBrs and CBrs and the affinity of CBrs for EBHRs.
Given the human 44-way alignment of ENCODE project  and the corresponding species tree, we developed an original Lowest Common Ancestor predictor for EBrs and a statistical framework to predict enriched genomic regions for EBrs (EBHRs) and CBrs (Cancer Breakpoint Hotspot Regions or CBHRs).
We predicted 261,391 human lineage-specific EBrs with different ancestral origins covering more than 50% of the human genome. 16,120 CBrs were collected from previously published studies [4, 5]. Using a non-overlapping sliding window approach of 100 Kbp, 1,395 windows were identified as EBHRs and 1,589 windows as CBHRs that were significantly enriched in EBrs and CBrs respectively. Having compared the two sets of hotspots we observed only 79 windows that are shared by EBHRs and CBHRs, which is statistically a random coincidence. A further analysis of functional properties of both hotspot categories shows significant over-representation of G4, CpG-islands, repeats, segmental duplications (SDs) and copy number variations (CNVs). The enrichments of Gene Ontology (GO) terms for these two hotspot categories were distinctive. Moreover, by ranking the chromosomes based on the proportion of hotspots occupying each chromosome, we observed that in 13 of them these proportions have strong negative correlations between EBHRs and CBHRs (ρ = -0.73). However, for the rest of the chromosomes these proportions have a strong positive correlation (ρ = 0.78). These results could contemplate to different nature of genomic regions that are more susceptible to cancer rearrangements and evolutionary rearrangements.
1. Murphy, W. J. et al. (2005) Science, 309(5734), 613-617
2. Kost-Alimova, M. et al. (2003) Proceedings of the National Academy of Sciences, 100(11), 6622-6627
3. ENCODE Project Consortium et al. (2004) Science, 306(5696): 636-640
4. Malhotra et al. (2013). Genome research, 23(5), 762-776.
5. Sudmant et al. (2015). Nature, 526(7571), 75-81.
machine learning circadian rhythms mouse
In mouse models, gene function can be discovered by reverse genetic screens by analyzing phenotypes of knockout or mutagenized mouse lines. Current large phenotype programs such as the International Mouse Phenotype Consortium test for a variety of phenotypes, but tests indicative of neurobehavioral phenotypes capture high level phenotypes. Among specific phenotypes not tested for are changes in circadian rhythm. Disruptions in circadian rhythms have been associated with disorders ranging from schizophrenia to bipolar disorder, implicating mutations in known circadian clock genes as potential therapeutic targets or biomarkers of mental illness. To prioritize candidates for circadian phenotype screens, we used machine learning approaches to predict the abnormal circadian rhythm phenotype in the mouse genome.
We analyzed RNA-Sequencing libraries from two mouse tissues: the central pacemaker clock of the suprachiasmatic nucleus, and a peripheral clock tissue liver. We detected signature circadian oscillations in each tissue’s expression, indicative of circadian characteristics. Abundance of expression in the central clock and relative expression distributions throughout 26 tissues were also shown to be indicative of potential circadian phenotype. Using predicted protein-protein interaction graphs, we constructed a diffusion kernel on known interactions, scoring nodes from short random walks around known circadian genes. These were used as features in a RUSBoost ensemble tree classifier. 91 genes with annotated abnormal circadian rhythm phenotypes were used as positive targets. A total of 275 out of 12502 protein coding genes expressed in the suprachiasmatic nucleus were predicted to have novel abnormal phenotypes. Those highly ranked are known clock genes which have no abnormal circadian mouse phenotypes in mouse, but present circadian phenotypes in human orthologs. Known circadian genes, including Npas2 and Fbxw11 were successfully predicted to contribute to circadian phenotypes.
Our model predicts genes which, if disturbed, may contribute to abnormal circadian phenotypes in mouse. Our findings highlight bias in mammalian phenotype annotations, reflecting both what has been studied to date in mouse and the redundancy in the mammalian clock. Integrating a combination gene expression and protein interaction metrics, our machine learning method reduces the search space for novel phenotype annotations. We are expanding our methodology above to be applicable to several neurological phenotypes besides abnormal circadian rhythm.
flowering locus t miscanthus qpcr
The onset of flowering has been shown to have a significant impact on biomass development, as a sudden arrest of growth occurs shortly before the early stages of flowering. Extending the growing season of perennial bioenergy crop, Miscanthus, by 2 months may increase biomass yields by over 50% (Jensen et al., 2012). Later flowering species, such as M. sacchariflorus, therefore, have an extended growing season compared to early flowering genotypes, such as M. sinensis, and typically produce significantly greater biomass yields.
The Florigen pathway is a genetic pathway provoking flowering in most flowering plant species, induced by vernalisation and photoperiod (day length). Flowering Locus T (FT), a floral integrator gene involved in the Florigen pathway, has been identified in multiple crops, and highlighted as a key component of the initiation of flowering. Primers from an identified Sorghum FT gene were used to amplify fragments of the FT homolog in Miscanthus. These primers were tested against M. sinensis, M. floridulus and M. giganteus, to identify how conserved the gene was within the genus.
A qPCR analysis will be performed to identify how gene expression changes with the onset of flowering, and confirm the identity of the candidate gene. If a high gene expression profile is shown to change over the course of the onset of flowering, the gene will be functionally tested by creating a knock out in M. sinensis. Transgenic plants will be cultivated in a controlled growth room, and monitored for any physiological effects, particularly the effects upon the onset of flowering, and preceding arrest of biomass development.
amphibians poison arrow frogs gut microbiome toxin sequestration
Poison arrow frogs (of the family Dendrobatidae) secrete alkaloid toxins in their skin as defence mechanisms against predators. Numerous studies have shown that the alkaloid toxins in dendrobatid skin is acquired by “sequestration from diet”, i.e. uptake and storage of toxins or their chemical precursors, mostly from consumed arthropods. There exists the intriguing possibility that the gut microbiome of these frogs may play a role in this process.
We address this question by looking at the organism together with its associated microbial communities, an effective symbiotic relationship between host and microbiome that could have allowed phenotypic adaptation of the host to a toxic diet. Following a metagenomic approach, we sequenced the Bacterial and Archaeal 16S rRNA and Fungal ITS regions of the gut microbiome of 7 dendrobatid species and 9 outgroup frog species caught in the rainforest of Eastern Peru. Frog species were selected based on sharing similar microhabitats and comparable individual sizes. A comparative analysis of the microbiome composition across all our samples allowed us to identify if there is a core group of symbiotic microbes unique to poison arrow frogs that could be associated to their ability to sequester toxins. In this talk we will discuss the possibilities and pitfalls of sampling gut microbiome in small anurans and showcase the potential of this fast evolving area of research.