Jabref Journal Article Index Application

Global QuickSearch:   Number of matching entries: 0

Search Settings

    Author / Editor / Organization Title Year Journal / Proceedings / Book BibTeX type Keywords
    Christensen, R.G.; Gupta, A.; Zuo, Z.; Schriefer, L.A.; Wolfe, S.A. & Stormo, G.D. A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity. 2011 Nucleic Acids Res   article
    Abstract: We examine the use of high-throughput sequencing on binding sites recovered using a bacterial one-hybrid (B1H) system and find that improved models of transcription factor (TF) binding specificity can be obtained compared to standard methods of sequencing a small subset of the selected clones. We can obtain even more accurate binding models using a modified version of B1H selection method with constrained variation (CV-B1H). However, achieving these improved models using CV-B1H data required the development of a new method of analysis-GRaMS (Growth Rate Modeling of Specificity)-that estimates bacterial growth rates as a function of the quality of the recognition sequence. We benchmark these different methods of motif discovery using Zif268, a well-characterized C(2)H(2) zinc-finger TF on both a 28 bp randomized library for the standard B1H method and on 6 bp randomized library for the CV-B1H method for which 45 different experimental conditions were tested: five time points and three different IPTG and 3-AT concentrations. We find that GRaMS analysis is robust to the different experimental parameters whereas other analysis methods give widely varying results depending on the conditions of the experiment. Finally, we demonstrate that the CV-B1H assay can be performed in liquid media, which produces recognition models that are similar in quality to sequences recovered from selection on solid media.
    BibTeX:
    @article{
      author = {Ryan G Christensen and Ankit Gupta and Zheng Zuo and Lawrence A Schriefer and Scot A Wolfe and Gary D Stormo},
      title = {A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity.},
      journal = {Nucleic Acids Res},
      year = {2011},
      url = {http://dx.doi.org/10.1093/nar/gkr239},
      doi = {http://dx.doi.org/10.1093/nar/gkr239}
    }
    					
    Grimm, A.A.; Brace, C.S.; Wang, T.; Stormo, G.D. & ichiro Imai, S. A nutrient-sensitive interaction between Sirt1 and HNF-1α regulates Crp expression. 2011 Aging Cell
    Vol. 10 (2) , pp. 305-317  
    article
    Abstract: Silent information regulator 2 (Sir2) orthologs are an evolutionarily conserved family of NAD-dependent protein deacetylases that regulate aging and longevity in model organisms. The mammalian Sir2 ortholog Sirt1 regulates metabolic and stress responses through the deacetylation of many transcriptional regulatory factors. To elucidate the mechanism by which Sirt1 controls gene expression in response to nutrient availability, we devised a bioinformatic screen combining gene expression analysis with phylogenetic footprinting to identify transcription factors as new candidate partners of Sirt1. One candidate target was HNF-1α, a homeodomain transcription factor that regulates pancreatic β-cell and hepatocyte functions and is commonly mutated in patients with maturity-onset diabetes of the young (MODY). Interestingly, Sirt1 physically interacts with HNF-1αin vitro but does so in vivo only in nutrient-restricting conditions. This interaction requires 12-24 h of nutrient restriction and is dependent on protein synthesis. Both nutrient restriction and Sirt1 suppress HNF-1α transcriptional activity and the expression of one of its target genes, C-reactive protein (Crp), in mouse primary hepatocytes. Pharmacological inhibition of Sirt1 blocks the suppression of Crp by nutrient restriction. Similarly, Crp expression is also suppressed in fasted and diet-restricted liver. Furthermore, Sirt1 and HNF-1α co-localize on two HNF-1α binding sites on the Crp promoter, leading to decreased acetylation of lysine 16 of histone H4 at these sites only in response to nutrient restriction. These findings reveal a novel nutrient-dependent interaction between Sirt1 and HNF-1α and provide important insight into the molecular mechanism by which Sirt1 mediates the anti-aging effects of diet restriction.
    BibTeX:
    @article{
      author = {Andrew A Grimm and Cynthia S Brace and Ting Wang and Gary D Stormo and Shin-ichiro Imai},
      title = {A nutrient-sensitive interaction between Sirt1 and HNF-1α regulates Crp expression.},
      journal = {Aging Cell},
      year = {2011},
      volume = {10},
      number = {2},
      pages = {305--317},
      url = {http://dx.doi.org/10.1111/j.1474-9726.2010.00667.x},
      doi = {http://dx.doi.org/10.1111/j.1474-9726.2010.00667.x}
    }
    					
    Stormo, G.D. Maximally efficient modeling of DNA sequence motifs at all levels of complexity. 2011 Genetics
    Vol. 187 (4) , pp. 1219-1224  
    article
    Abstract: Identification of transcription factor binding sites is necessary for deciphering gene regulatory networks. Several new methods provide extensive data about the specificity of transcription factors but most methods for analyzing these data to obtain specificity models are limited in scope by, for example, assuming additive interactions or are inefficient in their exploration of more complex models. This article describes an approach-encoding of DNA sequences as the vertices of a regular simplex-that allows simultaneous direct comparison of simple and complex models, with higher-order parameters fit to the residuals of lower-order models. In addition to providing an efficient assessment of all model parameters, this approach can yield valuable insight into the mechanism of binding by highlighting features that are critical to accurate models.
    BibTeX:
    @article{
      author = {Gary D Stormo},
      title = {Maximally efficient modeling of DNA sequence motifs at all levels of complexity.},
      journal = {Genetics},
      year = {2011},
      volume = {187},
      number = {4},
      pages = {1219--1224},
      url = {http://dx.doi.org/10.1534/genetics.110.126052},
      doi = {http://dx.doi.org/10.1534/genetics.110.126052}
    }
    					
    Stormo, G.D. An introduction to recognizing functional domains. 2011 Curr Protoc Bioinformatics
    Vol. Chapter 2 , pp. Unit2.1  
    article
    Abstract: This unit provides an overview of issues involved in domain recognition in protein and DNA sequences. It opens with a discussion of the two primary methods of domain representation, namely consensus sequences and alignment matrices (e.g., the log-odds matrix). The unit continues with a brief overview of some of the resources available for identifying functional domains in nucleotide sequences (e.g., transcription factor binding sites). In addition, it reviews databases such as Pfam and InterPro, which are available for protein analysis. Curr. Protoc. Bioinform. 34:2.1.1-2.1.6. © 2011 by John Wiley & Sons, Inc.
    BibTeX:
    @article{
      author = {Gary D Stormo},
      title = {An introduction to recognizing functional domains.},
      journal = {Curr Protoc Bioinformatics},
      year = {2011},
      volume = {Chapter 2},
      pages = {Unit2.1},
      url = {http://dx.doi.org/10.1002/0471250953.bi0201s34},
      doi = {http://dx.doi.org/10.1002/0471250953.bi0201s34}
    }
    					
    Zhao, Y. & Stormo, G.D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. 2011 Nat Biotechnol
    Vol. 29 (6) , pp. 480-483  
    article
    BibTeX:
    @article{
      author = {Yue Zhao and Gary D Stormo},
      title = {Quantitative analysis demonstrates most transcription factors require only simple models of specificity.},
      journal = {Nat Biotechnol},
      year = {2011},
      volume = {29},
      number = {6},
      pages = {480--483},
      url = {http://dx.doi.org/10.1038/nbt.1893},
      doi = {http://dx.doi.org/10.1038/nbt.1893}
    }
    					
    Kwan, A., D.S.S.G. Detecting Coevolution of Functionally Related Protiens for Automated Protein Annotation 2010 10th International IEEE Conference on Bioinformatics and Bioengineering , pp. pp.99-105   article
    BibTeX:
    @article{
      author = {Kwan, A., Dutcher, SK, Stormo, GD},
      title = {Detecting Coevolution of Functionally Related Protiens for Automated Protein Annotation},
      journal = {10th International IEEE Conference on Bioinformatics and Bioengineering},
      year = {2010},
      pages = {pp.99-105},
      url = {http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05521705}
    }
    					
    Sahota, G. & Stormo, G.D. Novel sequence-based method for identifying transcription factor binding sites in prokaryotic genomes. 2010 Bioinformatics   article
    Abstract: MOTIVATION: Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With next-generation sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be studied experimentally. Most related computational work has been focused on sequence assembly, gene annotation, and metabolic network reconstruction. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor binding specificities. RESULTS: Specificity determining residues (critical residues) were identified from crystal structures of DNA-protein complexes and transcription factors (TFs) with the same critical residues were grouped into specificity classes. The putative binding regions for each class were defined as the set of promoters for each TF itself (autoregulatory) and the immediately upstream and downstream operons. MEME was used to find putative motifs within each separate class. Tests on the LacI and TetR TF families, using RegulonDB annotated sites, showed the sensitivity of prediction is 86% and 80% respectively. AVAILABILITY: http://ural.wustl.edu/~gsahota/HTHmotif/ CONTACT: stormo@wustl.edu.
    BibTeX:
    @article{
      author = {Gurmukh Sahota and Gary D Stormo},
      title = {Novel sequence-based method for identifying transcription factor binding sites in prokaryotic genomes.},
      journal = {Bioinformatics},
      year = {2010},
      url = {http://dx.doi.org/10.1093/bioinformatics/btq501},
      doi = {http://dx.doi.org/10.1093/bioinformatics/btq501}
    }
    					
    Stormo, G.D. Motif discovery using expectation maximization and gibbs' sampling. 2010 Methods Mol Biol
    Vol. 674 , pp. 85-95  
    article
    Abstract: Expectation maximization and Gibbs' sampling are two statistical approaches used to identify transcription factor binding sites and the motif that represents them. Both take as input unaligned sequences and search for a statistically significant alignment of putative binding sites. Expectation maximization is deterministic so that starting with the same initial parameters will always converge to the same solution, making it wise to start it multiple times from different initial parameters. Gibbs' sampling is stochastic so that it may arrive at different solutions from the same initial parameters. In both cases multiple runs are advised because comparisons of the solutions after each run can indicate whether a global, optimum solution is likely to have been achieved.
    BibTeX:
    @article{
      author = {Gary D Stormo},
      title = {Motif discovery using expectation maximization and gibbs' sampling.},
      journal = {Methods Mol Biol},
      year = {2010},
      volume = {674},
      pages = {85--95},
      url = {http://dx.doi.org/10.1007/978-1-60761-854-6_6},
      doi = {http://dx.doi.org/10.1007/978-1-60761-854-6_6}
    }
    					
    Stormo, G.D. & Zhao, Y. Determining the specificity of protein-DNA interactions. 2010 Nat Rev Genet   article
    Abstract: Proteins, such as many transcription factors, that bind to specific DNA sequences are essential for the proper regulation of gene expression. Identifying the specific sequences that each factor binds can help to elucidate regulatory networks within cells and how genetic variation can cause disruption of normal gene expression, which is often associated with disease. Traditional methods for determining the specificity of DNA-binding proteins are slow and laborious, but several new high-throughput methods can provide comprehensive binding information much more rapidly. Combined with in vivo determinations of transcription factor binding locations, this information provides more detailed views of the regulatory circuitry of cells and the effects of variation on gene expression.
    BibTeX:
    @article{
      author = {Gary D Stormo and Yue Zhao},
      title = {Determining the specificity of protein-DNA interactions.},
      journal = {Nat Rev Genet},
      year = {2010},
      url = {http://dx.doi.org/10.1038/nrg2845},
      doi = {http://dx.doi.org/10.1038/nrg2845}
    }
    					
    Foat, B.C. & Stormo, G.D. Discovering structural cis-regulatory elements by modeling the behaviors of mRNAs. 2009 Mol Syst Biol
    Vol. 5 , pp. 268  
    article
    Abstract: Gene expression is regulated at each step from chromatin remodeling through translation and degradation. Several known RNA-binding regulatory proteins interact with specific RNA secondary structures in addition to specific nucleotides. To provide a more comprehensive understanding of the regulation of gene expression, we developed an integrative computational approach that leverages functional genomics data and nucleotide sequences to discover RNA secondary structure-defined cis-regulatory elements (SCREs). We applied our structural cis-regulatory element detector (StructRED) to microarray and mRNA sequence data from Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. We recovered the known specificities of Vts1p in yeast and Smaug in flies. In addition, we discovered six putative SCREs in flies and three in humans. We characterized the SCREs based on their condition-specific regulatory influences, the annotation of the transcripts that contain them, and their locations within transcripts. Overall, we show that modeling functional genomics data in terms of combined RNA structure and sequence motifs is an effective method for discovering the specificities and regulatory roles of RNA-binding proteins.
    BibTeX:
    @article{
      author = {Barrett C Foat and Gary D Stormo},
      title = {Discovering structural cis-regulatory elements by modeling the behaviors of mRNAs.},
      journal = {Mol Syst Biol},
      year = {2009},
      volume = {5},
      pages = {268},
      url = {http://dx.doi.org/10.1038/msb.2009.24},
      doi = {http://dx.doi.org/10.1038/msb.2009.24}
    }
    					
    Homsi, D.S.F.; Gupta, V. & Stormo, G.D. Modeling the quantitative specificity of DNA-binding proteins from example binding sites. 2009 PLoS One
    Vol. 4 (8) , pp. e6736  
    article
    Abstract: BACKGROUND: The binding of transcription factors to their respective DNA sites is a key component of every regulatory network. Predictions of transcription factor binding sites are usually based on models for transcription factor specificity. These models, in turn, are often based on examples of known binding sites. METHODOLOGY/PRINCIPAL FINDINGS: Collections of binding sites are obtained in simulation experiments where the true model for the transcription factor is known and various sampling procedures are employed. We compare the accuracies of three different and commonly used methods for predicting the specificity of the transcription factor based on example binding sites. Different methods for constructing the models can lead to significant differences in the accuracy of the predictions and we show that commonly used methods can be positively misleading, even at large sample sizes and using noise-free data. Methods that minimize the number of predicted binding sequences are often significantly more accurate than the other methods tested. CONCLUSIONS/SIGNIFICANCE: Different methods for generating motifs from example binding sites can have significantly different numbers of false positive and false negative predictions. For many different sampling procedures models based on quadratic programming are the most accurate.
    BibTeX:
    @article{
      author = {Dana S F Homsi and Vineet Gupta and Gary D Stormo},
      title = {Modeling the quantitative specificity of DNA-binding proteins from example binding sites.},
      journal = {PLoS One},
      year = {2009},
      volume = {4},
      number = {8},
      pages = {e6736},
      url = {http://dx.doi.org/10.1371/journal.pone.0006736},
      doi = {http://dx.doi.org/10.1371/journal.pone.0006736}
    }
    					
    Kwan, A.; Li, L.; Kulp, D.; Dutcher, S. & Stormo, G. Improving Gene-finding in Chlamydomonas reinhardtii: GreenGenie2. 2009 BMC Genomics
    Vol. 10 (1) , pp. 210  
    article
    Abstract: ABSTRACT: BACKGROUND: The availability of whole-genome sequences allows for the identification of the entire set of protein coding genes as well as their regulatory regions. This can be accomplished using multiple complementary methods that include ESTs, homology searches and ab initio gene predictions. Previously, the Genie gene-finding algorithm was trained on a small set of Chlamydomonas genes and shown to improve the accuracy of gene prediction in this species compared to other available programs. To improve ab initio gene finding in Chlamydomonas, we assemble a new training set consisting of over 2,300 cDNAs by assembling over 167,000 Chlamydomonas EST entries in GenBank using the EST assembly tool PASA. RESULTS: The prediction accuracy of our cDNA-trained gene-finder, GreenGenie2, attains 83% sensitivity and 83% specificity for exons on short-sequence predictions. We predict about 12,000 genes in the version v3 Chlamydomonas genome assembly, most of which (78 are either identical to or significantly overlap the published catalog of Chlamydomonas genes [1]. Twenty-two percent of the published catalog is absent from the GreenGenie2 predictions; there is also a fraction (23 of GreenGenie2 predictions that are absent from the published gene catalog. Randomly chosen gene models were tested by RT-PCR and most support the GreenGenie2 predictions. CONCLUSIONS: These data suggest that training with EST assemblies is highly effective and that GreenGenie2 is a valuable, complementary tool for predicting genes in Chlamydomonas reinhardtii.
    BibTeX:
    @article{
      author = {Alan Kwan and Linya Li and David Kulp and Susan Dutcher and Gary Stormo},
      title = {Improving Gene-finding in Chlamydomonas reinhardtii: GreenGenie2.},
      journal = {BMC Genomics},
      year = {2009},
      volume = {10},
      number = {1},
      pages = {210},
      url = {http://dx.doi.org/10.1186/1471-2164-10-210},
      doi = {http://dx.doi.org/10.1186/1471-2164-10-210}
    }
    					
    Schraml, B.U.; Hildner, K.; Ise, W.; Lee, W.-L.; Smith, W.A.-E.; Solomon, B.; Sahota, G.; Sim, J.; Mukasa, R.; Cemerski, S.; Hatton, R.D.; Stormo, G.D.; Weaver, C.T.; Russell, J.H.; Murphy, T.L. & Murphy, K.M. The AP-1 transcription factor Batf controls T(H)17 differentiation. 2009 Nature   article
    Abstract: Activator protein 1 (AP-1, also known as JUN) transcription factors are dimers of JUN, FOS, MAF and activating transcription factor (ATF) family proteins characterized by basic region and leucine zipper domains. Many AP-1 proteins contain defined transcriptional activation domains, but BATF and the closely related BATF3 (refs 2, 3) contain only a basic region and leucine zipper, and are considered to be inhibitors of AP-1 activity. Here we show that Batf is required for the differentiation of IL17-producing T helper (T(H)17) cells. T(H)17 cells comprise a CD4(+) T-cell subset that coordinates inflammatory responses in host defence but is pathogenic in autoimmunity. Batf(-/-) mice have normal T(H)1 and T(H)2 differentiation, but show a defect in T(H)17 differentiation, and are resistant to experimental autoimmune encephalomyelitis. Batf(-/-) T cells fail to induce known factors required for T(H)17 differentiation, such as RORgammat (encoded by Rorc) and the cytokine IL21 (refs 14-17). Neither the addition of IL21 nor the overexpression of RORgammat fully restores IL17 production in Batf(-/-) T cells. The Il17 promoter is BATF-responsive, and after T(H)17 differentiation, BATF binds conserved intergenic elements in the Il17a-Il17f locus and to the Il17, Il21 and Il22 (ref. 18) promoters. These results demonstrate that the AP-1 protein BATF has a critical role in T(H)17 differentiation.
    BibTeX:
    @article{
      author = {Barbara U Schraml and Kai Hildner and Wataru Ise and Wan-Ling Lee and Whitney A-E Smith and Ben Solomon and Gurmukh Sahota and Julia Sim and Ryuta Mukasa and Saso Cemerski and Robin D Hatton and Gary D Stormo and Casey T Weaver and John H Russell and Theresa L Murphy and Kenneth M Murphy},
      title = {The AP-1 transcription factor Batf controls T(H)17 differentiation.},
      journal = {Nature},
      year = {2009},
      url = {http://dx.doi.org/10.1038/nature08114},
      doi = {http://dx.doi.org/10.1038/nature08114}
    }
    					
    Stormo, G.D. An introduction to sequence similarity ("homology") searching. 2009 Curr Protoc Bioinformatics
    Vol. Chapter 3 , pp. Unit 3.1 3.1.1-Unit 3.1 3.1.7  
    article sequence similarity; homology; dynamic programming; similarity-scoring matrices; sequence alignment; multiple alignment; sequence evolution
    Abstract: Homologous sequences usually have the same, or very similar, functions, so new sequences can be reliably assigned functions if homologous sequences with known functions can be identified. Homology is inferred based on sequence similarity, and many methods have been developed to identify sequences that have statistically significant similarity. This unit provides an overview of some of the basic issues in identifying similarity among sequences and points out other units in this chapter that describe specific programs that are useful for this task.
    BibTeX:
    @article{
      author = {Gary D Stormo},
      title = {An introduction to sequence similarity ("homology") searching.},
      journal = {Curr Protoc Bioinformatics},
      year = {2009},
      volume = {Chapter 3},
      pages = {Unit 3.1 3.1.1--Unit 3.1 3.1.7},
      url = {http://dx.doi.org/10.1002/0471250953.bi0301s27},
      doi = {http://dx.doi.org/10.1002/0471250953.bi0301s27}
    }
    					
    Xu, X.; Ji, Y. & Stormo, G.D. Discovering cis-regulatory RNAs in Shewanella genomes by Support Vector Machines. 2009 PLoS Comput Biol
    Vol. 5 (4) , pp. e1000338  
    article algorithms; artificial intelligence; base sequence; chromosome mapping, methods; genome, bacterial, genetics; molecular sequence data; pattern recognition, automated, methods; rna, bacterial, genetics; regulatory sequences, ribonucleic acid, genetics; sequence analysis, rna, methods; shewanella, genetics
    Abstract: An increasing number of cis-regulatory RNA elements have been found to regulate gene expression post-transcriptionally in various biological processes in bacterial systems. Effective computational tools for large-scale identification of novel regulatory RNAs are strongly desired to facilitate our exploration of gene regulation mechanisms and regulatory networks. We present a new computational program named RSSVM (RNA Sampler+Support Vector Machine), which employs Support Vector Machines (SVMs) for efficient identification of functional RNA motifs from random RNA secondary structures. RSSVM uses a set of distinctive features to represent the common RNA secondary structure and structural alignment predicted by RNA Sampler, a tool for accurate common RNA secondary structure prediction, and is trained with functional RNAs from a variety of bacterial RNA motif/gene families covering a wide range of sequence identities. When tested on a large number of known and random RNA motifs, RSSVM shows a significantly higher sensitivity than other leading RNA identification programs while maintaining the same false positive rate. RSSVM performs particularly well on sets with low sequence identities. The combination of RNA Sampler and RSSVM provides a new, fast, and efficient pipeline for large-scale discovery of regulatory RNA motifs. We applied RSSVM to multiple Shewanella genomes and identified putative regulatory RNA motifs in the 5' untranslated regions (UTRs) in S. oneidensis, an important bacterial organism with extraordinary respiratory and metal reducing abilities and great potential for bioremediation and alternative energy generation. From 1002 sets of 5'-UTRs of orthologous operons, we identified 166 putative regulatory RNA motifs, including 17 of the 19 known RNA motifs from Rfam, an additional 21 RNA motifs that are supported by literature evidence, 72 RNA motifs overlapping predicted transcription terminators or attenuators, and other candidate regulatory RNA motifs. Our study provides a list of promising novel regulatory RNA motifs potentially involved in post-transcriptional gene regulation. Combined with the previous cis-regulatory DNA motif study in S. oneidensis, this genome-wide discovery of cis-regulatory RNA motifs may offer more comprehensive views of gene regulation at a different level in this organism. The RSSVM software, predictions, and analysis results on Shewanella genomes are available at http://ural.wustl.edu/resources.html#RSSVM.
    BibTeX:
    @article{
      author = {Xing Xu and Yongmei Ji and Gary D Stormo},
      title = {Discovering cis-regulatory RNAs in Shewanella genomes by Support Vector Machines.},
      journal = {PLoS Comput Biol},
      year = {2009},
      volume = {5},
      number = {4},
      pages = {e1000338},
      url = {http://dx.doi.org/10.1371/journal.pcbi.1000338},
      doi = {http://dx.doi.org/10.1371/journal.pcbi.1000338}
    }
    					
    Zhao, Y.; Granas, D. & Stormo, G.D. Inferring binding energies from selected binding sites. 2009 PLoS Comput Biol
    Vol. 5 (12) , pp. e1000590  
    article
    Abstract: We employ a biophysical model that accounts for the non-linear relationship between binding energy and the statistics of selected binding sites. The model includes the chemical potential of the transcription factor, non-specific binding affinity of the protein for DNA, as well as sequence-specific parameters that may include non-independent contributions of bases to the interaction. We obtain maximum likelihood estimates for all of the parameters and compare the results to standard probabilistic methods of parameter estimation. On simulated data, where the true energy model is known and samples are generated with a variety of parameter values, we show that our method returns much more accurate estimates of the true parameters and much better predictions of the selected binding site distributions. We also introduce a new high-throughput SELEX (HT-SELEX) procedure to determine the binding specificity of a transcription factor in which the initial randomized library and the selected sites are sequenced with next generation methods that return hundreds of thousands of sites. We show that after a single round of selection our method can estimate binding parameters that give very good fits to the selected site distributions, much better than standard motif identification algorithms.
    BibTeX:
    @article{
      author = {Yue Zhao and David Granas and Gary D Stormo},
      title = {Inferring binding energies from selected binding sites.},
      journal = {PLoS Comput Biol},
      year = {2009},
      volume = {5},
      number = {12},
      pages = {e1000590},
      url = {http://dx.doi.org/10.1371/journal.pcbi.1000590},
      doi = {http://dx.doi.org/10.1371/journal.pcbi.1000590}
    }
    					
    Chang, L.W.; Payton, J.E.; Yuan, W.; Ley, T.J.; Nagarajan, R. & Stormo, G.D. Computational identification of the normal and perturbed genetic networks involved in myeloid differentiation and acute promyelocytic leukemia. 2008 Genome Biol
    Vol. 9 (2) , pp. R38  
    article animals; binding sites; cell differentiation, genetics; computational biology, methods; gene expression profiling; gene regulatory networks; humans; leukemia, promyelocytic, acute, genetics; mice; myeloid cells, cytology; rats; sequence analysis, dna; transcription factors, metabolism
    Abstract: BACKGROUND: Acute myeloid leukemia (AML) comprises a group of diseases characterized by the abnormal development of malignant myeloid cells. Recent studies have demonstrated an important role for aberrant transcriptional regulation in AML pathophysiology. Although several transcription factors (TFs) involved in myeloid development and leukemia have been studied extensively and independently, how these TFs coordinate with others and how their dysregulation perturbs the genetic circuitry underlying myeloid differentiation is not yet known. We propose an integrated approach for mammalian genetic network construction by combining the analysis of gene expression profiling data and the identification of TF binding sites. RESULTS: We utilized our approach to construct the genetic circuitries operating in normal myeloid differentiation versus acute promyelocytic leukemia (APL), a subtype of AML. In the normal and disease networks, we found that multiple transcriptional regulatory cascades converge on the TFs Rora and Rxra, respectively. Furthermore, the TFs dysregulated in APL participate in a common regulatory pathway and may perturb the normal network through Fos. Finally, a model of APL pathogenesis is proposed in which the chimeric TF PML-RARalpha activates the dysregulation in APL through six mediator TFs. CONCLUSION: This report demonstrates the utility of our approach to construct mammalian genetic networks, and to obtain new insights regarding regulatory circuitries operating in complex diseases in humans.
    BibTeX:
    @article{
      author = {Li Wei Chang and Jacqueline E Payton and Wenlin Yuan and Timothy J Ley and Rakesh Nagarajan and Gary D Stormo},
      title = {Computational identification of the normal and perturbed genetic networks involved in myeloid differentiation and acute promyelocytic leukemia.},
      journal = {Genome Biol},
      year = {2008},
      volume = {9},
      number = {2},
      pages = {R38},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/18291030},
      doi = {http://dx.doi.org/10.1186/gb-2008-9-2-r38}
    }
    					
    Laramie, J.M.; Chung, T.P.; Brownstein, B.; Stormo, G.D. & Cobb, J.P. Transcriptional profiles of human epithelial cells in response to heat: computational evidence for novel heat shock proteins. 2008 Shock
    Vol. 29 (5) , pp. 623-630  
    article binding sites; blotting, western; calibration; cell survival; computational biology, methods; dna, chemistry; epithelial cells, cytology; gene expression regulation; heat; heat-shock proteins, metabolism; humans; models, biological; oligonucleotide array sequence analysis; transcription factors, metabolism; transcription, genetic
    Abstract: We hypothesized that broad-scale expression profiling would provide insight into the regulatory pathways that control gene expression in response to stress and potentially identify novel heat-responsive genes. HEp2 cells, a human malignant epithelial cell line, were heated at 37 degrees C to 43 degrees C for 60 min to gauge the heat shock response, using as a proxy inducible Hsp70 quantified by Western blot analysis. Based on these results, microarray experiments were conducted at 37 degrees C, 40 degrees C, 41 degrees C, 42 degrees C, and 43 degrees C. Using linear modeling, we compared the sets of microarrays at 40 degrees C, 41 degrees C, 42 degrees C, and 43 degrees C with the 37 degrees C baseline temperature and took the union of the genes exhibiting differential gene expression signal to create two sets of "heat shock response" genes, each set reflecting either increased or decreased RNA abundance. Leveraging human and mouse orthologous alignments, we used the two lists of coexpressed genes to predict transcription factor binding sites in silico, including those for heat shock factor (HSF) 1 and HSF2 transcription factors. We discovered HSF1 and HSF2 binding sites in 15 genes not previously associated with the heat shock response. We conclude that microarray experiments coupled with upstream promoter analysis can be used to identify novel genes that respond to heat shock. Additional experiments are required to validate these putative heat shock proteins and facilitate a deeper understanding of the mechanisms involved during the stress response.
    BibTeX:
    @article{
      author = {Jason M Laramie and T. Philip Chung and Buddy Brownstein and Gary D Stormo and J. Perren Cobb},
      title = {Transcriptional profiles of human epithelial cells in response to heat: computational evidence for novel heat shock proteins.},
      journal = {Shock},
      year = {2008},
      volume = {29},
      number = {5},
      pages = {623--630},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/17885648},
      doi = {http://dx.doi.org/10.1097/shk.0b013e318157f33c}
    }
    					
    Liu, J.; Xu, X. & Stormo, G.D. The cis-regulatory map of Shewanella genomes. 2008 Nucleic Acids Res
    Vol. 36 (16) , pp. 5376-5390  
    article dna, bacterial, chemistry; escherichia coli, genetics; gene regulatory networks; genome, bacterial; genomics; metals, metabolism; phylogeny; promoter regions (genetics); regulon; shewanella, classification/genetics/metabolism
    Abstract: While hundreds of microbial genomes are sequenced, the challenge remains to define their cis-regulatory maps. Here, we present a comparative genomic analysis of the cis-regulatory map of Shewanella oneidensis, an important model organism for bioremediation because of its extraordinary abilities to use a wide variety of metals and organic molecules as electron acceptors in respiration. First, from the experimentally verified transcriptional regulatory networks of Escherichia coli, we inferred 24 DNA motifs that are conserved in S. oneidensis. We then applied a new comparative approach on five Shewanella genomes that allowed us to systematically identify 194 nonredundant palindromic DNA motifs and corresponding regulons in S. oneidensis. Sixty-four percent of the predicted motifs are conserved in at least three of the seven newly sequenced and distantly related Shewanella genomes. In total, we obtained 209 unique DNA motifs in S. oneidensis that cover 849 unique transcription units. Besides conservation in other genomes, 77 of these motifs are supported by at least one additional type of evidence, including matching to known transcription factor binding motifs and significant functional enrichment or expression coherence of the corresponding target genes. Using the same approach on a more focused gene set, 990 differentially expressed genes derived from published microarray data of S. oneidensis during exposure to metal ions, we identified 31 putative cis-regulatory motifs (16 with at least one type of additional supporting evidence) that are potentially involved in the process of metal reduction. The majority (18/31) of those motifs had been found in our whole-genome comparative approach, further demonstrating that such an approach is capable of uncovering a large fraction of the regulatory map of a genome even in the absence of experimental data. The integrated computational approach developed in this study provides a useful strategy to identify genome-wide cis-regulatory maps and a novel avenue to explore the regulatory pathways for particular biological processes in bacterial systems.
    BibTeX:
    @article{
      author = {Jiajian Liu and Xing Xu and Gary D Stormo},
      title = {The cis-regulatory map of Shewanella genomes.},
      journal = {Nucleic Acids Res},
      year = {2008},
      volume = {36},
      number = {16},
      pages = {5376--5390},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/18701645},
      doi = {http://dx.doi.org/10.1093/nar/gkn515}
    }
    					
    Liu, J. & Stormo, G.D. Context-dependent DNA recognition code for C2H2 zinc-finger transcription factors. 2008 Bioinformatics
    Vol. 24 (17) , pp. 1850-1857  
    article base sequence; binding sites; computer simulation; dna, chemistry/genetics; dna-binding proteins, chemistry/genetics; models, genetic; molecular sequence data; pattern recognition, automated, methods; protein binding; transcription factors, chemistry/genetics; zinc fingers, genetics
    Abstract: MOTIVATION: Modeling and identifying the DNA-protein recognition code is one of the most challenging problems in computational biology. Several quantitative methods have been developed to model DNA-protein interactions with specific focus on the C(2)H(2) zinc-finger proteins, the largest transcription factor family in eukaryotic genomes. In many cases, they performed well. But the overall the predictive accuracy of these methods is still limited. One of the major reasons is all these methods used weight matrix models to represent DNA-protein interactions, assuming all base-amino acid contacts contribute independently to the total free energy of binding. RESULTS: We present a context-dependent model for DNA-zinc-finger protein interactions that allows us to identify inter-positional dependencies in the DNA recognition code for C(2)H(2) zinc-finger proteins. The degree of non-independence was detected by comparing the linear perceptron model with the non-linear neural net (NN) model for their predictions of DNA-zinc-finger protein interactions. This dependency is supported by the complex base-amino acid contacts observed in DNA-zinc-finger interactions from structural analyses. Using extensive published qualitative and quantitative experimental data, we demonstrated that the context-dependent model developed in this study can significantly improves predictions of DNA binding profiles and free energies of binding for both individual zinc fingers and proteins with multiple zinc fingers when comparing to previous positional-independent models. This approach can be extended to other protein families with complex base-amino acid residue interactions that would help to further understand the transcriptional regulation in eukaryotic genomes. AVAILABILITY: The software implemented as c programs and are available by request. http://ural.wustl.edu/softwares.html
    BibTeX:
    @article{
      author = {Jiajian Liu and Gary D Stormo},
      title = {Context-dependent DNA recognition code for C2H2 zinc-finger transcription factors.},
      journal = {Bioinformatics},
      year = {2008},
      volume = {24},
      number = {17},
      pages = {1850--1857},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/18586699},
      doi = {http://dx.doi.org/10.1093/bioinformatics/btn331}
    }
    					
    Noyes, M.B.; Christensen, R.G.; Wakabayashi, A.; Stormo, G.D.; Brodsky, M.H. & Wolfe, S.A. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. 2008 Cell
    Vol. 133 (7) , pp. 1277-1289  
    article amino acid sequence; animals; bacteria, chemistry/genetics; base sequence; dna, chemistry/metabolism; drosophila proteins, chemistry/genetics; drosophila melanogaster, chemistry/genetics; genome, insect; homeodomain proteins, chemistry/genetics; humans; models, molecular; phylogeny; protein engineering; protein structure, tertiary; two-hybrid system techniques
    Abstract: We describe the comprehensive characterization of homeodomain DNA-binding specificities from a metazoan genome. The analysis of all 84 independent homeodomains from D. melanogaster reveals the breadth of DNA sequences that can be specified by this recognition motif. The majority of these factors can be organized into 11 different specificity groups, where the preferred recognition sequence between these groups can differ at up to four of the six core recognition positions. Analysis of the recognition motifs within these groups led to a catalog of common specificity determinants that may cooperate or compete to define the binding site preference. With these recognition principles, a homeodomain can be reengineered to create factors where its specificity is altered at the majority of recognition positions. This resource also allows prediction of homeodomain specificities from other organisms, which is demonstrated by the prediction and analysis of human homeodomain specificities.
    BibTeX:
    @article{
      author = {Marcus B Noyes and Ryan G Christensen and Atsuya Wakabayashi and Gary D Stormo and Michael H Brodsky and Scot A Wolfe},
      title = {Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites.},
      journal = {Cell},
      year = {2008},
      volume = {133},
      number = {7},
      pages = {1277--1289},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/18585360},
      doi = {http://dx.doi.org/10.1016/j.cell.2008.05.023}
    }
    					
    Zeng, J.; Yan, J.; Wang, T.; Mosbrook-Davis, D.; Dolan, K.T.; Christensen, R.; Stormo, G.D.; Haussler, D.; Lathrop, R.H.; Brachmann, R.K. & Burgess, S.M. Genome wide screens in yeast to identify potential binding sites and target genes of DNA-binding proteins. 2008 Nucleic Acids Res
    Vol. 36 (1) , pp. e8  
    article animals; base sequence; binding sites; computational biology; consensus sequence; dna, chemistry; dna-binding proteins, metabolism; forkhead transcription factors, metabolism; genomic library; genomics, methods; mice; plasmids, genetics; regulatory elements, transcriptional; saccharomyces cerevisiae, genetics; transcription factors, metabolism; tumor suppressor protein p53, metabolism; zebrafish proteins, metabolism; zebrafish, genetics
    Abstract: Knowledge of all binding sites for transcriptional activators and repressors is essential for computationally aided identification of transcriptional networks. The techniques developed for defining the binding sites of transcription factors tend to be cumbersome and not adaptable to high throughput. We refined a versatile yeast strategy to rapidly and efficiently identify genomic targets of DNA-binding proteins. Yeast expressing a transcription factor is mated to yeast containing a library of genomic fragments cloned upstream of the reporter gene URA3. DNA fragments with target-binding sites are identified by growth of yeast clones in media lacking uracil. The experimental approach was validated with the tumor suppressor protein p53 and the forkhead protein FoxI1 using genomic libraries for zebrafish and mouse generated by shotgun cloning of short genomic fragments. Computational analysis of the genomic fragments recapitulated the published consensus-binding site for each protein. Identified fragments were mapped to identify the genomic context of each binding site. Our yeast screening strategy, combined with bioinformatics approaches, will allow both detailed and high-throughput characterization of transcription factors, scalable to the analysis of all putative DNA-binding proteins.
    BibTeX:
    @article{
      author = {Jue Zeng and Jizhou Yan and Ting Wang and Deborah Mosbrook-Davis and Kyle T Dolan and Ryan Christensen and Gary D Stormo and David Haussler and Richard H Lathrop and Rainer K Brachmann and Shawn M Burgess},
      title = {Genome wide screens in yeast to identify potential binding sites and target genes of DNA-binding proteins.},
      journal = {Nucleic Acids Res},
      year = {2008},
      volume = {36},
      number = {1},
      pages = {e8},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/18086703},
      doi = {http://dx.doi.org/10.1093/nar/gkm1117}
    }
    					
    Chang, L.-W.; Fontaine, B.R.; Stormo, G.D. & Nagarajan, R. PAP: a comprehensive workbench for mammalian transcriptional regulatory sequence analysis. 2007 Nucleic Acids Res
    Vol. 35 (Web Server issue) , pp. W238-W244  
    article algorithms; animals; computational biology; databases, genetic; gene expression profiling; gene expression regulation; humans; internet; models, genetic; promoter regions (genetics); protein binding; transcription factors; transcription, genetic; user-computer interface
    Abstract: Given the recent explosion of publications that employ microarray technology to monitor genome-wide expression and that correlate these expression changes to biological processes or to disease states, the determination of the transcriptional regulation of these co-expressed genes is the next major step toward deciphering the genetic network governing the pathway or disease under study. Although computational approaches have been proposed for this purpose, there is no integrated and user-friendly software application that allows experimental biologists to tackle this problem in higher eukaryotes. We have previously reported a systematic, statistical model of mammalian transcriptional regulatory sequence analysis. We have now made crucial extensions to this model and have developed a comprehensive, user-friendly web application suite termed the Promoter Analysis Pipeline (PAP). PAP is available at: http://bioinformatics.wustl.edu/webTools/portalModule/PromoterSearch.do.
    BibTeX:
    @article{
      author = {Li-Wei Chang and Burr R Fontaine and Gary D Stormo and Rakesh Nagarajan},
      title = {PAP: a comprehensive workbench for mammalian transcriptional regulatory sequence analysis.},
      journal = {Nucleic Acids Res},
      year = {2007},
      volume = {35},
      number = {Web Server issue},
      pages = {W238--W244},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/17517777},
      doi = {http://dx.doi.org/10.1093/nar/gkm308}
    }
    					
    Davies, S.R.; Chang, L.-W.; Patra, D.; Xing, X.; Posey, K.; Hecht, J.; Stormo, G.D. & Sandell, L.J. Computational identification and functional validation of regulatory motifs in cartilage-expressed genes. 2007 Genome Res
    Vol. 17 (10) , pp. 1438-1447  
    article
    Abstract: Chondrocyte gene regulation is important for the generation and maintenance of cartilage tissues. Several regulatory factors have been identified that play a role in chondrogenesis, including the positive transacting factors of the SOX family such as SOX9, SOX5, and SOX6, as well as negative transacting factors such as C/EBP and delta EF1. However, a complete understanding of the intricate regulatory network that governs the tissue-specific expression of cartilage genes is not yet available. We have taken a computational approach to identify cis-regulatory, transcription factor (TF) binding motifs in a set of cartilage characteristic genes to better define the transcriptional regulatory networks that regulate chondrogenesis. Our computational methods have identified several TFs, whose binding profiles are available in the TRANSFAC database, as important to chondrogenesis. In addition, a cartilage-specific SOX-binding profile was constructed and used to identify both known, and novel, functional paired SOX-binding motifs in chondrocyte genes. Using DNA pattern-recognition algorithms, we have also identified cis-regulatory elements for unknown TFs. We have validated our computational predictions through mutational analyses in cell transfection experiments. One novel regulatory motif, N1, found at high frequency in the COL2A1 promoter, was found to bind to chondrocyte nuclear proteins. Mutational analyses suggest that this motif binds a repressive factor that regulates basal levels of the COL2A1 promoter.
    BibTeX:
    @article{
      author = {Sherri R Davies and Li-Wei Chang and Debabrata Patra and Xiaoyun Xing and Karen Posey and Jacqueline Hecht and Gary D Stormo and Linda J Sandell},
      title = {Computational identification and functional validation of regulatory motifs in cartilage-expressed genes.},
      journal = {Genome Res},
      year = {2007},
      volume = {17},
      number = {10},
      pages = {1438--1447},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/17785538},
      doi = {http://dx.doi.org/10.1101/gr.6224007}
    }
    					
    GuhaThakurta, D. & Stormo, G. Dear, P. (Hrsg.) Finding regulatory elements in DNA sequence ( Methods Express: Bioinformatics ) 2007 Methods Express: Bioinformatics   inbook
    BibTeX:
    @inbook{
      author = {GuhaThakurta, D. and Stormo, G.},
      title = {Methods Express: Bioinformatics},
      publisher = {Scion Pub. Ltd.},
      year = {2007}
    }
    					
    Rochat, R.H.; de las Fuentes, L.; Stormo, G.; Davila-Roman, V.G. & Gu, C.C. A novel method combining linkage disequilibrium information and imputed functional knowledge for tagSNP selection. 2007 Hum Hered
    Vol. 64 (4) , pp. 243-249  
    article algorithms; genome, human; humans; linkage disequilibrium; methods; polymorphism, single nucleotide
    Abstract: Analyses of high-density SNPs in genetic studies have the potential problems of prohibitive genotyping costs and inflated false discovery rates. Current methods select subsets of representative SNPs (tagSNPs) using information either on potential biologic functionality of the SNPs or on the underlying linkage disequilibrium (LD) structure, but not both. Combining the two types of information may lead to more effective tagSNP selection. The proposed method combines both functional and LD information using a weighted factor analysis (WFA) model. The WFA was applied to the dense SNP collection from 129 genes sequenced by the SeattleSNPs Program for Genomic Application. TagSNPs selected by WFA were compared with those selected by an LD-based method. WFA allowed prioritization of SNPs that would otherwise share equivalent ranking due to underlying LD structure alone. Furthermore, WFA consistently included SNPs not selected by function or by LD alone. A literature review of a subset of genes revealed that SNPs selected by WFA were more likely represented in published reports.
    BibTeX:
    @article{
      author = {R. H. Rochat and L. de las Fuentes and G. Stormo and V. G. Davila-Roman and C. Charles Gu},
      title = {A novel method combining linkage disequilibrium information and imputed functional knowledge for tagSNP selection.},
      journal = {Hum Hered},
      year = {2007},
      volume = {64},
      number = {4},
      pages = {243--249},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/17587853},
      doi = {http://dx.doi.org/10.1159/000104227}
    }
    					
    Souvenir, R.; Buhler, J.; Stormo, G. & Zhang, W. Yuryev, A. (Hrsg.) An Iterative Method for Selecting Degenerate Multiplex PCR Primers ( Methods in Molecular Biology: PCR Primer Design ) 2007 Methods Mol Biol Methods in Molecular Biology: PCR Primer Design
    Vol. 402 , pp. 245-268  
    inbook algorithms; dna primers; genotype; polymorphism, single nucleotide; quantitative trait loci; sequence analysis, dna
    Abstract: Single-nucleotide polymorphism (SNP) genotyping is an important molecular genetics process, which can produce results that will be useful in the medical field. Because of inherent complexities in DNA manipulation and analysis, many different methods have been proposed for a standard assay. One of the proposed techniques for performing SNP genotyping requires amplifying regions of DNA surrounding a large number of SNP loci. To automate a portion of this particular method, it is necessary to select a set of primers for the experiment. Selecting these primers can be formulated as the Multiple Degenerate Primer Design (MDPD) problem. The Multiple, Iterative Primer Selector (MIPS) is an iterative beam-search algorithm for MDPD. Theoretical and experimental analyses show that this algorithm performs well compared with the limits of degenerate primer design. Furthermore, MIPS outperforms an existing algorithm that was designed for a related degenerate primer selection problem. An implementation of the MIPS algorithm is available for research purposes from the website http://www.cse.wustl.edu/~zhang/software/mips.
    BibTeX:
    @inbook{
      author = {Richard Souvenir and Jeremy Buhler and Gary Stormo and Weixiong Zhang},
      title = {Methods in Molecular Biology: PCR Primer Design},
      journal = {Methods Mol Biol},
      year = {2007},
      volume = {402},
      pages = {245--268},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/17951799}
    }
    					
    Stormo, G.D. & Zhao, Y. Putting numbers on the network connections. 2007 Bioessays
    Vol. 29 (8) , pp. 717-721  
    article dna-binding proteins; gene regulatory networks; humans; microfluidic analytical techniques; models, theoretical; protein binding; substrate specificity; transcription factors
    Abstract: DNA-protein interactions are fundamental to many biological processes, including the regulation of gene expression. Determining the binding affinities of transcription factors (TFs) to different DNA sequences allows the quantitative modeling of transcriptional regulatory networks and has been a significant technical challenge in molecular biology for many years. A recent paper by Maerkl and Quake1 demonstrated the use of microfluidic technology for the analysis of DNA-protein interactions. An array of short DNA sequences was spotted onto a glass slide, which was then covered with a microfluidic device allowing each spot to be within a chamber into which the flow of materials was controlled by valves. By trapping the DNA-protein complexes on the surface and measuring their concentrations microscopically, they could determine the binding affinity to a large number of DNA sequences that were varied systematically. They studied four TFs from the basic helix-loop-helix family of proteins, all of which bind to E-box sites with the consensus CAnnTG (where "n" can be any base), and showed that variations in affinity for different sites allows each TF to regulate different genes.
    BibTeX:
    @article{
      author = {Gary D Stormo and Yue Zhao},
      title = {Putting numbers on the network connections.},
      journal = {Bioessays},
      year = {2007},
      volume = {29},
      number = {8},
      pages = {717--721},
      url = {http://dx.doi.org/10.1002/bies.20617},
      doi = {http://dx.doi.org/10.1002/bies.20617}
    }
    					
    Xu, X.; Ji, Y. & Stormo, G.D. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. 2007 Bioinformatics
    Vol. 23 (15) , pp. 1883-1891  
    article algorithms; base sequence; computer simulation; databases, genetic; information storage and retrieval; models, chemical; models, molecular; molecular sequence data; nucleic acid conformation; rna; sample size; sequence alignment; sequence analysis, rna
    Abstract: MOTIVATION: Non-coding RNA genes and RNA structural regulatory motifs play important roles in gene regulation and other cellular functions. They are often characterized by specific secondary structures that are critical to their functions and are often conserved in phylogenetically or functionally related sequences. Predicting common RNA secondary structures in multiple unaligned sequences remains a challenge in bioinformatics research. Methods and RESULTS: We present a new sampling based algorithm to predict common RNA secondary structures in multiple unaligned sequences. Our algorithm finds the common structure between two sequences by probabilistically sampling aligned stems based on stem conservation calculated from intrasequence base pairing probabilities and intersequence base alignment probabilities. It iteratively updates these probabilities based on sampled structures and subsequently recalculates stem conservation using the updated probabilities. The iterative process terminates upon convergence of the sampled structures. We extend the algorithm to multiple sequences by a consistency-based method, which iteratively incorporates and reinforces consistent structure information from pairwise comparisons into consensus structures. The algorithm has no limitation on predicting pseudoknots. In extensive testing on real sequence data, our algorithm outperformed other leading RNA structure prediction methods in both sensitivity and specificity with a reasonably fast speed. It also generated better structural alignments than other programs in sequences of a wide range of identities, which more accurately represent the RNA secondary structure conservations. AVAILABILITY: The algorithm is implemented in a C program, RNA Sampler, which is available at http://ural.wustl.edu/software.html
    BibTeX:
    @article{
      author = {Xing Xu and Yongmei Ji and Gary D Stormo},
      title = {RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment.},
      journal = {Bioinformatics},
      year = {2007},
      volume = {23},
      number = {15},
      pages = {1883--1891},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/17537756},
      doi = {http://dx.doi.org/10.1093/bioinformatics/btm272}
    }
    					
    Zhao, G.; Chang, K.Y.; Varley, K. & Stormo, G.D. Evidence for active maintenance of inverted repeat structures identified by a comparative genomic approach. 2007 PLoS ONE
    Vol. 2 , pp. e262  
    article
    Abstract: Inverted repeats have been found to occur in both prokaryotic and eukaryotic genomes. Usually they are short and some have important functions in various biological processes. However, long inverted repeats are rare and can cause genome instability. Analyses of C. elegans genome identified long, nearly-perfect inverted repeat sequences involving both divergently and convergently oriented homologous gene pairs and complete intergenic sequences. Comparisons with the orthologous regions from the genomes of C. briggsae and C. remanei show that the inverted repeat structures are often far more conserved than the sequences. This observation implies that there is an active mechanism for maintaining the inverted repeat nature of the sequences.
    BibTeX:
    @article{
      author = {Guoyan Zhao and Kuan Y Chang and Katherine Varley and Gary D Stormo},
      title = {Evidence for active maintenance of inverted repeat structures identified by a comparative genomic approach.},
      journal = {PLoS ONE},
      year = {2007},
      volume = {2},
      pages = {e262},
      doi = {http://dx.doi.org/10.1371/journal.pone.0000262}
    }
    					
    Zhao, G.; Schriefer, L.A. & Stormo, G.D. Identification of muscle-specific regulatory modules in Caenorhabditis elegans. 2007 Genome Res
    Vol. 17 (3) , pp. 348-357  
    article animals; caenorhabditis elegans; computational biology; gene expression regulation, developmental; genomics; green fluorescent proteins; muscle, skeletal; regulatory elements, transcriptional; transcription factors
    Abstract: Transcriptional regulation is the major regulatory mechanism that controls the spatial and temporal expression of genes during development. This is carried out by transcription factors (TFs), which recognize and bind to their cognate binding sites. Recent studies suggest a modular organization of TF-binding sites, in which clusters of transcription-factor binding sites cooperate in the regulation of downstream gene expression. In this study, we report our computational identification and experimental verification of muscle-specific cis-regulatory modules in Caenorhabditis elegans. We first identified a set of motifs that are correlated with muscle-specific gene expression. We then predicted muscle-specific regulatory modules based on clusters of those motifs with characteristics similar to a collection of well-studied modules in other species. The method correctly identifies 88% of the experimentally characterized modules with a positive predictive value of at least 65 The prediction accuracy of muscle-specific expression on an independent test set is highly significant (P<0.0001). We performed in vivo experimental tests of 12 predicted modules, and 10 of those drive muscle-specific gene expression. These results suggest that our method is highly accurate in identifying functional sequences important for muscle-specific gene expression and is a valuable tool for guiding experimental designs.
    BibTeX:
    @article{
      author = {Guoyan Zhao and Lawrence A Schriefer and Gary D Stormo},
      title = {Identification of muscle-specific regulatory modules in Caenorhabditis elegans.},
      journal = {Genome Res},
      year = {2007},
      volume = {17},
      number = {3},
      pages = {348--357},
      doi = {http://dx.doi.org/10.1101/gr.5989907}
    }
    					
    Zhou, Y.; Cras-Méneur, C.; Ohsugi, M.; Stormo, G.D. & Permutt, M.A. A global approach to identify differentially expressed genes in cDNA (two-color) microarray experiments. 2007 Bioinformatics
    Vol. 23 (16) , pp. 2073-2079  
    article algorithms; gene expression profiling; in situ hybridization, fluorescence; microscopy, fluorescence, multiphoton; multigene family; oligonucleotide array sequence analysis
    Abstract: MOTIVATION: Currently most of the methods for identifying differentially expressed genes fall into the category of so called single-gene-analysis, performing hypothesis testing on a gene-by-gene basis. In a single-gene-analysis approach, estimating the variability of each gene is required to determine whether a gene is differentially expressed or not. Poor accuracy of variability estimation makes it difficult to identify genes with small fold-changes unless a very large number of replicate experiments are performed. RESULTS: We propose a method that can avoid the difficult task of estimating variability for each gene, while reliably identifying a group of differentially expressed genes with low false discovery rates, even when the fold-changes are very small. In this article, a new characterization of differentially expressed genes is established based on a theorem about the distribution of ranks of genes sorted by (log) ratios within each array. This characterization of differentially expressed genes based on rank is an example of all-gene-analysis instead of single gene analysis. We apply the method to a cDNA microarray dataset and many low fold-changed genes (as low as 1.3 fold-changes) are reliably identified without carrying out hypothesis testing on a gene-by-gene basis. The false discovery rate is estimated in two different ways reflecting the variability from all the genes without the complications related to multiple hypothesis testing. We also provide some comparisons between our approach and single-gene-analysis based methods. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    BibTeX:
    @article{
      author = {Yiyong Zhou and Corentin Cras-Méneur and Mitsuru Ohsugi and Gary D Stormo and M. Alan Permutt},
      title = {A global approach to identify differentially expressed genes in cDNA (two-color) microarray experiments.},
      journal = {Bioinformatics},
      year = {2007},
      volume = {23},
      number = {16},
      pages = {2073--2079},
      doi = {http://dx.doi.org/10.1093/bioinformatics/btm292}
    }
    					
    Agrawal, R. & Stormo, G.D. Using mRNAs lengths to accurately predict the alternatively spliced gene products in Caenorhabditis elegans. 2006 Bioinformatics
    Vol. 22 (10) , pp. 1239-1244  
    article algorithms; alternative splicing; animals; base sequence; caenorhabditis elegans; caenorhabditis elegans proteins; chromosome mapping; computer simulation; gene expression; models, genetic; molecular sequence data; rna, messeng; sequence analysis, rna; er
    Abstract: MOTIVATION: Computational gene prediction methods are an important component of whole genome analyses. While ab initio gene finders have demonstrated major improvements in accuracy, the most reliable methods are evidence-based gene predictors. These algorithms can rely on several different sources of evidence including predictions from multiple ab initio gene finders, matches to known proteins, sequence conservation and partial cDNAs to predict the final product. Despite the success of these algorithms, prediction of complete gene structures, especially for alternatively spliced products, remains a difficult task. RESULTS: LOCUS (Length Optimized Characterization of Unknown Spliceforms) is a new evidence-based gene finding algorithm which integrates a length-constraint into a dynamic programming-based framework for prediction of gene products. On a Caenorhabditis elegans test set of alternatively spliced internal exons, its performance exceeds that of current ab initio gene finders and in most cases can accurately predict the correct form of all the alternative products. As the length information used by the algorithm can be obtained in a high-throughput fashion, we propose that integration of such information into a gene-prediction pipeline is feasible and doing so may improve our ability to fully characterize the complete set of mRNAs for a genome. AVAILABILITY: LOCUS is available from http://ural.wustl.edu/software.html
    BibTeX:
    @article{
      author = {Ritesh Agrawal and Gary D Stormo},
      title = {Using mRNAs lengths to accurately predict the alternatively spliced gene products in Caenorhabditis elegans.},
      journal = {Bioinformatics},
      year = {2006},
      volume = {22},
      number = {10},
      pages = {1239--1244},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/16595562},
      doi = {http://dx.doi.org/10.1093/bioinformatics/btl076}
    }
    					
    Chang, L.-W.; Nagarajan, R.; Magee, J.A.; Milbrandt, J. & Stormo, G.D. A systematic model to predict transcriptional regulatory mechanisms based on overrepresentation of transcription factor binding profiles. 2006 Genome Res
    Vol. 16 (3) , pp. 405-413  
    article animals; cell proliferation; cholesterol; chromatin immunoprecipitation; computational biology; conserved sequence; gene expression profiling; gene expression regulation; humans; mice; models, genetic; promoter regions (genetics); protein binding; transcription factors
    Abstract: An important aspect of understanding a biological pathway is to delineate the transcriptional regulatory mechanisms of the genes involved. Two important tasks are often encountered when studying transcription regulation, i.e., (1) the identification of common transcriptional regulators of a set of coexpressed genes; (2) the identification of genes that are regulated by one or several transcription factors. In this study, a systematic and statistical approach was taken to accomplish these tasks by establishing an integrated model considering all of the promoters and characterized transcription factors (TFs) in the genome. A promoter analysis pipeline (PAP) was developed to implement this approach. PAP was tested using coregulated gene clusters collected from the literature. In most test cases, PAP identified the transcription regulators of the input genes accurately. When compared with chromatin immunoprecipitation experiment data, PAP's predictions are consistent with the experimental observations. When PAP was used to analyze one published expression-profiling data set and two novel coregulated gene sets, PAP was able to generate biologically meaningful hypotheses. Therefore, by taking a systematic approach of considering all promoters and characterized TFs in our model, we were able to make more reliable predictions about the regulation of gene expression in mammalian organisms.
    BibTeX:
    @article{
      author = {Li-Wei Chang and Rakesh Nagarajan and Jeffrey A Magee and Jeffrey Milbrandt and Gary D Stormo},
      title = {A systematic model to predict transcriptional regulatory mechanisms based on overrepresentation of transcription factor binding profiles.},
      journal = {Genome Res},
      year = {2006},
      volume = {16},
      number = {3},
      pages = {405--413},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/16449500},
      doi = {http://dx.doi.org/10.1101/gr.4303406}
    }
    					
    Chung, T.P.; Laramie, J.M.; Meyer, D.J.; Downey, T.; Tam, L.H.Y.; Ding, H.; Buchman, T.G.; Karl, I.; Stormo, G.D.; Hotchkiss, R.S. & Cobb, J.P. Molecular diagnostics in sepsis: from bedside to bench. 2006 J Am Coll Surg
    Vol. 203 (5) , pp. 585-598  
    article adolescent; adult; aged; aged, 80 and over; animals; female; gene expression profiling; humans; inflammation; male; mice; mice, inbred c57bl; middle aged; principal component analysis; protein array analysis; sepsis; spleen
    Abstract: BACKGROUND: Based on recent in vitro data, we tested the hypothesis that microarray expression profiles can be used to diagnose sepsis, distinguishing in vivo between sterile and infectious causes of systemic inflammation. STUDY DESIGN: Exploratory studies were conducted using spleens from septic patients and from mice with abdominal sepsis. Seven patients with sepsis after injury were identified retrospectively and compared with six injured patients. C57BL/6 male mice were subjected to cecal ligation and puncture, or to IP lipopolysaccharide. Control mice had sham laparotomy or injection of IP saline, respectively. A sepsis classification model was created and tested on blood samples from septic mice. RESULTS: Accuracy of sepsis prediction was obtained using cross-validation of gene expression data from 12 human spleen samples and from 16 mouse spleen samples. For blood studies, classifiers were constructed using data from a training data set of 26 microarrays. The error rate of the classifiers was estimated on seven de-identified microarrays, and then on a subsequent cross-validation for all 33 blood microarrays. Estimates of classification accuracy of sepsis in human spleen were 67.1 in mouse spleen, 96 and in mouse blood, 94.4% (all estimates were based on nested cross-validation). Lists of genes with substantial changes in expression between study and control groups were used to identify nine mouse common inflammatory response genes, six of which were mapped into a single pathway using contemporary pathway analysis tools. CONCLUSIONS: Sepsis induces changes in mouse leukocyte gene expression that can be used to diagnose sepsis apart from systemic inflammation.
    BibTeX:
    @article{
      author = {T. Philip Chung and Jason M Laramie and Donald J Meyer and Thomas Downey and Laurence H Y Tam and Huashi Ding and Timothy G Buchman and Irene Karl and Gary D Stormo and Richard S Hotchkiss and J. Perren Cobb},
      title = {Molecular diagnostics in sepsis: from bedside to bench.},
      journal = {J Am Coll Surg},
      year = {2006},
      volume = {203},
      number = {5},
      pages = {585--598},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/17084318},
      doi = {http://dx.doi.org/10.1016/j.jamcollsurg.2006.06.028}
    }
    					
    DeMets, D.L.; Stormo, G.; Boehnke, M.; Louis, T.A.; Taylor, J. & Dixon, D. Training of the next generation of biostatisticians: a call to action in the U.S. 2006 Stat Med
    Vol. 25 (20) , pp. 3415-3429  
    article biological sciences; biometry; cooperative behavior; curriculum; education; national institutes of health (u.s.); united states
    Abstract: Two workshops (2001, 2003) were held by the National Institutes of Health (NIH) to examine the need to train more biostatisticians in the U.S. to meet the increasing opportunities in the biomedical research enterprise. The supply of new PhD graduates in biostatistics in the U.S. has been relatively steady for the past two decades while the demand has increased dramatically. These workshops concluded that a renewed effort must be made in the U.S., led in part by the NIH, to add to and expand the existing training programs to increase the supply. This article summarizes those two workshops and their recommendations. Some progress has been made through a new biostatistics training program with emphasis in bioinformatics sponsored by the National Institute of General Medical Sciences (NIGMS).
    BibTeX:
    @article{
      author = {David L DeMets and Gary Stormo and Michael Boehnke and Thomas A Louis and Jeremy Taylor and Dennis Dixon},
      title = {Training of the next generation of biostatisticians: a call to action in the U.S.},
      journal = {Stat Med},
      year = {2006},
      volume = {25},
      number = {20},
      pages = {3415--3429},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/16927449},
      doi = {http://dx.doi.org/10.1002/sim.2668}
    }
    					
    MacIsaac, K.D.; Wang, T.; Gordon, D.B.; Gifford, D.K.; Stormo, G.D. & Fraenkel, E. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. 2006 BMC Bioinformatics
    Vol. 7 , pp. 113  
    article algorithms; chromosome mapping; conserved sequence; gene expression regulation, fungal; phylogeny; regulatory elements, transcriptional; saccharomyces cerevisiae; sequence analysis, dna; trans-activation (genetics); transcription factors
    Abstract: BACKGROUND: The regulatory map of a genome consists of the binding sites for proteins that determine the transcription of nearby genes. An initial regulatory map for S. cerevisiae was recently published using six motif discovery programs to analyze genome-wide chromatin immunoprecipitation data for 203 transcription factors. The programs were used to identify sequence motifs that were likely to correspond to the DNA-binding specificity of the immunoprecipitated proteins. We report improved versions of two conservation-based motif discovery algorithms, PhyloCon and Converge. Using these programs, we create a refined regulatory map for S. cerevisiae by reanalyzing the same chromatin immunoprecipitation data. RESULTS: Applying the same conservative criteria that were applied in the original study, we find that PhyloCon and Converge each separately discover more known specificities than the combination of all six programs in the previous study. Combining the results of PhyloCon and Converge, we discover significant sequence motifs for 36 transcription factors that were previously missed. The new set of motifs identifies 636 more regulatory interactions than the previous one. The new network contains 28% more regulatory interactions among transcription factors, evidence of greater cross-talk between regulators. CONCLUSION: Combining two complementary computational strategies for conservation-based motif discovery improves the ability to identify the specificity of transcriptional regulators from genome-wide chromatin immunoprecipitation data. The increased sensitivity of these methods significantly expands the map of yeast regulatory sites without the need to alter any of the thresholds for statistical significance. The new map of regulatory sites reveals a more elaborate and complex view of the yeast genetic regulatory network than was observed previously.
    BibTeX:
    @article{
      author = {Kenzie D MacIsaac and Ting Wang and D. Benjamin Gordon and David K Gifford and Gary D Stormo and Ernest Fraenkel},
      title = {An improved map of conserved regulatory sites for Saccharomyces cerevisiae.},
      journal = {BMC Bioinformatics},
      year = {2006},
      volume = {7},
      pages = {113},
      url = {http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435934/?tool=pubmed},
      doi = {http://dx.doi.org/10.1186/1471-2105-7-113}
    }
    					
    Magee, J.A.; wei Chang, L.; Stormo, G.D. & Milbrandt, J. Direct, androgen receptor-mediated regulation of the FKBP5 gene via a distal enhancer element. 2006 Endocrinology
    Vol. 147 (1) , pp. 590-598  
    article animals; base sequence; chromosome mapping; conserved sequence; enhancer elements (genetics); gene expression regulation; humans; introns; male; mice; molecular sequence data; orchiectomy; prostate; receptors, androgen; reverse transcriptase polymerase chain reaction; sequence alignment; sequence homology, nucleic acid; tacrolimus binding proteins; transcription, genetic
    Abstract: Androgen signaling via the androgen receptor (AR) transcription factor is crucial to normal prostate homeostasis and prostate tumorigenesis. Current models of AR function are predominantly based on studies of prostate-specific antigen regulation in androgen-responsive cell lines. To expand on these in vitro paradigms, we used the mouse prostate to elucidate the mechanisms through which AR regulates another direct target, FKBP5, in vivo. FKBP5 encodes an immunophilin that has been previously implicated in glucocorticoid and progestin signaling pathways and that likely influences prostate physiology in the presence of androgens. In this work, we show that androgens directly regulate FKBP5 via an interaction between the AR and a distal enhancer located 65 kb downstream of the transcription start site in the fifth intron of the FKBP5 gene. We have found that AR selectively recruits cAMP response element-binding protein to this enhancer. These interactions, in turn, result in chromatin remodeling that affects the enhancer proper but not the FKBP5 locus as a whole. Furthermore, in contrast to prostate-specific antigen-regulatory mechanisms, we show that transactivation of the FKBP5 gene does not rely on a single looping complex to mediate communication between the distal enhancer and proximal promoter. Rather, the distal enhancer complex and basal transcription apparatus communicate indirectly with one another, implicating a regulatory mechanism that has not been previously appreciated for AR target genes.
    BibTeX:
    @article{
      author = {Jeffrey A Magee and Li-wei Chang and Gary D Stormo and Jeffrey Milbrandt},
      title = {Direct, androgen receptor-mediated regulation of the FKBP5 gene via a distal enhancer element.},
      journal = {Endocrinology},
      year = {2006},
      volume = {147},
      number = {1},
      pages = {590--598},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/16210365},
      doi = {http://dx.doi.org/10.1210/en.2005-1001}
    }
    					
    Ong, C.-T.; Cheng, H.-T.; Chang, L.-W.; Ohtsuka, T.; Kageyama, R.; Stormo, G.D. & Kopan, R. Target selectivity of vertebrate notch proteins. Collaboration between discrete domains and CSL-binding site architecture determines activation probability. 2006 J Biol Chem
    Vol. 281 (8) , pp. 5106-5119  
    article animals; base sequence; basic helix-loop-helix transcription factors; binding sites; blotting, western; cell line; cell line, tumor; dna; dose-response relationship, drug; female; gene deletion; green fluorescent proteins; hela cells; homeodomain proteins; humans; immunohistochemistry; kinetics; luciferases; male; mice; models, biological; models, genetic; models, statistical; molecular sequence data; nih 3t3 cells; organ culture techniques; promoter regions (genetics); protein binding; protein structure, tertiary; receptors, notch; repressor proteins; time factors; trans-activation (genetics); transcription factors; transcription, genetic; transfection
    Abstract: All four mammalian Notch proteins interact with a single DNA-binding protein (RBP-jkappa), yet they are not equivalent in activating target genes. Parallel assays of three Notch-responsive promoters in several cell lines revealed that relative activation strength is dependent on protein module and promoter context more than the cellular context. Each Notch protein reads binding site orientation and distribution on the promoter differently; Notch1 performs extremely well on paired sites, and Notch3 prefers single sites in conjunction with a proximal zinc finger transcription factor. Although head-head sites can elicit a Notch response on their own, use of CBS (CSL binding site) in tail-tail orientation is context-dependent. Bias for specific DNA elements is achieved by interplay between the N-terminal RAM (RBP-jkappa-associated molecule/ankyrin region), which interprets CBS proximity and orientation, and the C-terminal transactivation domain that interacts specifically with the transcription machinery or nearby factors. To confirm the prediction that modular design underscores the evolution of functional divergence between Notch proteins, we generated a synthetic Notch protein (Notch1 ankyrin with Notch3 transactivation domain) that displayed superior signaling strength on the hes5 promoter. Consistent with the prediction that "preferred" targets (Hes1) should respond faster and at lower Notch concentration than other targets, we showed that Hes5-GFP was extinguished fast and recovered slowly, whereas Hes1-GFP was inhibited late and recovered quickly after a pulse of DAPT in metanephroi cultures.
    BibTeX:
    @article{
      author = {Chin-Tong Ong and Hui-Teng Cheng and Li-Wei Chang and Toshiyuki Ohtsuka and Ryoichiro Kageyama and Gary D Stormo and Raphael Kopan},
      title = {Target selectivity of vertebrate notch proteins. Collaboration between discrete domains and CSL-binding site architecture determines activation probability.},
      journal = {J Biol Chem},
      year = {2006},
      volume = {281},
      number = {8},
      pages = {5106--5119},
      url = {http://www.jbc.org/content/281/8/5106.long},
      doi = {http://dx.doi.org/10.1074/jbc.M506108200}
    }
    					
    Stormo, G.D. An introduction to recognizing functional domains. 2006 Curr Protoc Bioinformatics
    Vol. Chapter 2 , pp. Unit 2.1  
    article algorithms; conserved sequence; dna, genetics/metabolism; protein structure, tertiary; proteins, chemistry/metabolism; sequence alignment, methods; sequence analysis, dna, methods; sequence analysis, protein, methods; software
    Abstract: This unit provides an overview of issues involved in domain recognition in protein and DNA sequences. It opens with a discussion of the two primary methods of domain representation, namely consensus sequences and alignment matrices (e.g., the log-odds matrix). The unit continues with a brief overview of some of the resources available for identifying functional domains in nucleotide sequences (e.g., TRANSFAC). In addition, it reviews databases such as Pfam, InterPro and Blocks, which are available for protein analysis.
    BibTeX:
    @article{
      author = {Gary D Stormo},
      title = {An introduction to recognizing functional domains.},
      journal = {Curr Protoc Bioinformatics},
      year = {2006},
      volume = {Chapter 2},
      pages = {Unit 2.1},
      url = {http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0201s15/abstract},
      doi = {http://dx.doi.org/10.1002/0471250953.bi0201s15}
    }
    					
    Stormo, G.D. An overview of RNA structure prediction and applications to RNA gene prediction and RNAi design. 2006 Curr Protoc Bioinformatics
    Vol. Chapter 12 , pp. Unit 12.1  
    article base sequence; computer simulation; models, chemical; models, genetic; molecular sequence data; nucleic acid conformation; rna interference; rna, chemistry/genetics; sequence analysis, rna, methods
    Abstract: This unit briefly describes the two fundamentally different methods for predicting RNA structures. The first is to find that structure with the minimum free energy of folding, as predicted by various thermodynamic parameters related to base-pair stacking, loop lengths, and other features. If one has only a single sequence, this thermodynamic approach is the best available method. The second fundamental approach to RNA structure prediction is to use multiple, homologous sequences for which one can infer a common structure, and then try and predict a structure common to all of the sequences. Such an approach is referred to as a comparative method or phylogenetic method of RNA structure prediction.
    BibTeX:
    @article{
      author = {Gary D Stormo},
      title = {An overview of RNA structure prediction and applications to RNA gene prediction and RNAi design.},
      journal = {Curr Protoc Bioinformatics},
      year = {2006},
      volume = {Chapter 12},
      pages = {Unit 12.1},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/18428758},
      doi = {http://dx.doi.org/10.1002/0471250953.bi1201s13}
    }
    					
    Stormo, G. Rigoutsos, I. & Stephanopoulos, G. (Hrsg.) DNA-Protein Interactions ( Systems Biology I: Genomics ) 2006 Systems Biology I: Genomics , pp. pp 219-247   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Systems Biology I: Genomics},
      publisher = {Oxford Univ. Press},
      year = {2006},
      pages = {pp 219-247}
    }
    					
    Agrawal, R. & Stormo, G.D. Editing efficiency of a Drosophila gene correlates with a distant splice site selection. 2005 RNA
    Vol. 11 (5) , pp. 563-566  
    article alternative splicing; animals; drosophila proteins; drosophila melanogaster; genes, insect; models, genetic; rna editing; rna splice sites; reverse transcriptase polymerase chain reaction
    Abstract: RNA editing and alternative splicing are two processes that increase protein diversity. The relationship between the two processes is not well understood. There are a few examples of correlations between editing and alternative splicing, but these are all nearby effects. A search for alternative splicing among 16 edited genes in Drosophila reveals two novel instances of alternative splicing. In one example where alternative splicing occurs downstream of editing, a strong correlation between editing efficiency and splice site selection is observed. In contrast, when editing occurs downstream of alternative splicing, no correlation is seen. These results suggest some models for the coupling of editing and splicing processes.
    BibTeX:
    @article{
      author = {Ritesh Agrawal and Gary D Stormo},
      title = {Editing efficiency of a Drosophila gene correlates with a distant splice site selection.},
      journal = {RNA},
      year = {2005},
      volume = {11},
      number = {5},
      pages = {563--566},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15840811},
      doi = {http://dx.doi.org/10.1261/rna.7280605}
    }
    					
    Cobb, J.P.; Mindrinos, M.N.; Miller-Graziano, C.; Calvano, S.E.; Baker, H.V.; Xiao, W.; Laudanski, K.; Brownstein, B.H.; Elson, C.M.; Hayden, D.L.; Herndon, D.N.; Lowry, S.F.; Maier, R.V.; Schoenfeld, D.A.; Moldawer, L.L.; Davis, R.W.; Tompkins, R.G.; Baker, H.V.; Bankey, P.; Billiar, T.; Brownstein, B.H.; Calvano, S.E.; Camp, D.; Chaudry, I.; Cobb, J.P.; Davis, R.W.; Elson, C.M.; Freeman, B.; Gamelli, R.; Gibran, N.; Harbrecht, B.; Hayden, D.L.; Heagy, W.; Heimbach, D.; Herndon, D.N.; Horton, J.; Hunt, J.; Laudanski, K.; Lederer, J.; Lowry, S.F.; Maier, R.V.; Mannick, J.; McKinley, B.; Miller-Graziano, C.; Mindrinos, M.N.; Minei, J.; Moldawer, L.L.; Moore, E.; Moore, F.; Munford, R.; Nathens, A.; O'keefe, G.; Purdue, G.; Rahme, L.; Remick, D.; Sailors, M.; Schoenfeld, D.A.; Shapiro, M.; Silver, G.; Smith, R.; Stephanopoulos, G.; Stormo, G.; Tompkins, R.G.; Toner, M.; Warren, S.; West, M.; Wolfe, S.; Xiao, W.; Young, V.; Inflammation & to Injury Large-Scale Collaborative Research Program, H.R. Application of genome-wide expression analysis to human health and disease. 2005 Proc Natl Acad Sci U S A
    Vol. 102 (13) , pp. 4801-4806  
    article cluster analysis; gene expression; genome, human; genotype; humans; leukocytes; multicenter studies; oligonucleotide array sequence analysis; patient selection; principal component analysis; reproducibility of results; specimen handling; wounds and injuries
    Abstract: The application of genome-wide expression analysis to a large-scale, multicentered program in critically ill patients poses a number of theoretical and technical challenges. We describe here an analytical and organizational approach to a systematic evaluation of the variance associated with genome-wide expression analysis specifically tailored to study human disease. We analyzed sources of variance in genome-wide expression analyses performed with commercial oligonucleotide arrays. In addition, variance in gene expression in human blood leukocytes caused by repeated sampling in the same subject, among different healthy subjects, among different leukocyte subpopulations, and the effect of traumatic injury, were also explored. We report that analytical variance caused by sample processing was acceptably small. Blood leukocyte gene expression in the same individual over a 24-h period was remarkably constant. In contrast, genome-wide expression varied significantly among different subjects and leukocyte subpopulations. Expectedly, traumatic injury induced dramatic changes in apparent gene expression that were greater in magnitude than the analytical noise and interindividual variance. We demonstrate that the development of a nation-wide program for gene expression analysis with careful attention to analytical details can reduce the variance in the clinical setting to a level where patterns of gene expression are informative among different healthy human subjects, and can be studied with confidence in human disease.
    BibTeX:
    @article{
      author = {J. Perren Cobb and Michael N Mindrinos and Carol Miller-Graziano and Steve E Calvano and Henry V Baker and Wenzhong Xiao and Krzysztof Laudanski and Bernard H Brownstein and Constance M Elson and Douglas L Hayden and David N Herndon and Stephen F Lowry and Ronald V Maier and David A Schoenfeld and Lyle L Moldawer and Ronald W Davis and Ronald G Tompkins and Henry V Baker and Paul Bankey and Timothy Billiar and Bernard H Brownstein and Steve E Calvano and David Camp and Irshad Chaudry and J. Perren Cobb and Ronald W Davis and Constance M Elson and Bradley Freeman and Richard Gamelli and Nicole Gibran and Brian Harbrecht and Douglas L Hayden and Wyrta Heagy and David Heimbach and David N Herndon and Jureta Horton and John Hunt and Krzysztof Laudanski and James Lederer and Stephen F Lowry and Ronald V Maier and John Mannick and Bruce McKinley and Carol Miller-Graziano and Michael N Mindrinos and Joseph Minei and Lyle L Moldawer and Ernest Moore and Frederick Moore and Robert Munford and Avery Nathens and Grant O'keefe and Gary Purdue and Laurence Rahme and Daniel Remick and Matthew Sailors and David A Schoenfeld and Michael Shapiro and Geoffrey Silver and Richard Smith and Gregory Stephanopoulos and Gary Stormo and Ronald G Tompkins and Mehmet Toner and Shaw Warren and Michael West and Steven Wolfe and Wenzhong Xiao and Vernon Young and Inflammation and Host Response to Injury Large-Scale Collaborative Research Program},
      title = {Application of genome-wide expression analysis to human health and disease.},
      journal = {Proc Natl Acad Sci U S A},
      year = {2005},
      volume = {102},
      number = {13},
      pages = {4801--4806},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15781863}
    }
    					
    Freimuth, R.R.; Stormo, G.D. & McLeod, H.L. PolyMAPr: programs for polymorphism database mining, annotation, and functional analysis. 2005 Hum Mutat
    Vol. 25 (2) , pp. 110-117  
    article databases, nucleic acid; gene frequency; genetic predisposition to disease; humans; internet; pharmacogenetics; polymorphism, single nucleotide; software; user-computer interface
    Abstract: Pharmacogenomic and disease-association studies rely on identifying a comprehensive set of polymorphisms within candidate genes. Public SNP databases are a rich source of polymorphism data, but mining them effectively requires overcoming at least four challenges: ensuring accurate annotations for genes and polymorphisms, eliminating both inter- and intra-database redundancy, integrating data from multiple public sources with data generated locally, and prioritizing the variants for further study. PolyMAPr (Polymorphism Mining and Annotation Programs)' was developed to overcome these challenges and to improve the efficiency of database mining and polymorphism annotation. PolyMAPr takes as input a file containing a list of genes to be processed and files containing each annotated gene sequence. Polymorphic sequences obtained from public databases (dbSNP, CGAP, and JSNP) or through local SNP discovery efforts, as well as oligonucleotide sequences (e.g., PCR primers), are mapped to the annotated gene sequences and named according to suggested nomenclature guidelines. The functional effects of nonsynonymous coding-region SNPs (cSNPs) and any variants that might alter exon splicing enhancer (ESE) sites, putative transcription factor binding sites, or intron-exon splice sites are predicted. The output files are accessible though a browser interface. In addition, the results are also provided in Extensible Markup Language (XML) format to facilitate uploading them into a local relational database. PolyMAPr increases the efficiency of mining public databases for genetic variants within candidate genes and provides a mechanism by which data from multiple sources (both public and private) can be uniformly integrated, thereby significantly reducing the effort required to obtain a comprehensive set of polymorphisms for pharmacogenomic and disease-association studies. PolyMAPr can be obtained from http://pharmacogenomics.wustl.edu.
    BibTeX:
    @article{
      author = {Robert R Freimuth and Gary D Stormo and Howard L McLeod},
      title = {PolyMAPr: programs for polymorphism database mining, annotation, and functional analysis.},
      journal = {Hum Mutat},
      year = {2005},
      volume = {25},
      number = {2},
      pages = {110--117},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15643605},
      doi = {http://dx.doi.org/10.1002/humu.20123}
    }
    					
    Gershenzon, N.I.; Stormo, G.D. & Ioshikhes, I.P. Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites. 2005 Nucleic Acids Res
    Vol. 33 (7) , pp. 2290-2301  
    article algorithms; binding sites; computational biology; dna-binding proteins; genome, human; humans; promoter regions (genetics); response el; sequence analysis, dna; sp1 transcription factor; transcription factors; ements
    Abstract: Position-weight matrices (PWMs) are broadly used to locate transcription factor binding sites in DNA sequences. The majority of existing PWMs provide a low level of both sensitivity and specificity. We present a new computational algorithm, a modification of the Staden-Bucher approach, that improves the PWM. We applied the proposed technique on the PWM of the GC-box, binding site for Sp1. The comparison of old and new PWMs shows that the latter increase both sensitivity and specificity. The statistical parameters of GC-box distribution in promoter regions and in the human genome, as well as in each chromosome, are presented. The majority of commonly used PWMs are the 4-row mononucleotide matrices, although 16-row dinucleotide matrices are known to be more informative. The algorithm efficiently determines the 16-row matrices and preliminary results show that such matrices provide better results than 4-row matrices.
    BibTeX:
    @article{
      author = {Naum I Gershenzon and Gary D Stormo and Ilya P Ioshikhes},
      title = {Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites.},
      journal = {Nucleic Acids Res},
      year = {2005},
      volume = {33},
      number = {7},
      pages = {2290--2301},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15849315},
      doi = {http://dx.doi.org/10.1093/nar/gki519}
    }
    					
    Havgaard, J.H.; Lyngsø, R.B.; Stormo, G.D. & Gorodkin, J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. 2005 Bioinformatics
    Vol. 21 (9) , pp. 1815-1824  
    article algorithms; conserved sequence; nucleic acid conformation; rna, untranslated; sequence alignment; sequence analysis, rna; sequence homology, nucleic acid; software
    Abstract: MOTIVATION: Searching for non-coding RNA (ncRNA) genes and structural RNA elements (eleRNA) are major challenges in gene finding today as these often are conserved in structure rather than in sequence. Even though the number of available methods is growing, it is still of interest to pairwise detect two genes with low sequence similarity, where the genes are part of a larger genomic region. RESULTS: Here we present such an approach for pairwise local alignment which is based on foldalign and the Sankoff algorithm for simultaneous structural alignment of multiple sequences. We include the ability to conduct mutual scans of two sequences of arbitrary length while searching for common local structural motifs of some maximum length. This drastically reduces the complexity of the algorithm. The scoring scheme includes structural parameters corresponding to those available for free energy as well as for substitution matrices similar to RIBOSUM. The new foldalign implementation is tested on a dataset where the ncRNAs and eleRNAs have sequence similarity <40% and where the ncRNAs and eleRNAs are energetically indistinguishable from the surrounding genomic sequence context. The method is tested in two ways: (1) its ability to find the common structure between the genes only and (2) its ability to locate ncRNAs and eleRNAs in a genomic context. In case (1), it makes sense to compare with methods like Dynalign, and the performances are very similar, but foldalign is substantially faster. The structure prediction performance for a family is typically around 0.7 using Matthews correlation coefficient. In case (2), the algorithm is successful at locating RNA families with an average sensitivity of 0.8 and a positive predictive value of 0.9 using a BLAST-like hit selection scheme. AVAILABILITY: The program is available online at http://foldalign.kvl.dk/
    BibTeX:
    @article{
      author = {Jakob Hull Havgaard and Rune B Lyngsø and Gary D Stormo and Jan Gorodkin},
      title = {Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%.},
      journal = {Bioinformatics},
      year = {2005},
      volume = {21},
      number = {9},
      pages = {1815--1824},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15657094},
      doi = {http://dx.doi.org/10.1093/bioinformatics/bti279}
    }
    					
    Li, J.B.; Zhang, M.; Dutcher, S.K. & Stormo, G.D. Procom: a web-based tool to compare multiple eukaryotic proteomes. 2005 Bioinformatics
    Vol. 21 (8) , pp. 1693-1694  
    article animals; chromosome mapping; eukaryotic cells; gene expression profiling; humans; internet; linkage disequilibrium; proteome; software; species specificity
    Abstract: SUMMARY: Each organism has traits that are shared with some, but not all, organisms. Identification of genes needed for a particular trait can be accomplished by a comparative genomics approach using three or more organisms. Genes that occur in organisms without the trait are removed from the set of genes in common among organisms with the trait. To facilitate these comparisons, a web-based server, Procom, was developed to identify the subset of genes that may be needed for a trait. AVAILABILITY: The Procom program is freely available with documentation and examples at http://ural.wustl.edu/~billy/Procom/ CONTACT: billy@ural.wustl.edu.
    BibTeX:
    @article{
      author = {Jin Billy Li and Miao Zhang and Susan K Dutcher and Gary D Stormo},
      title = {Procom: a web-based tool to compare multiple eukaryotic proteomes.},
      journal = {Bioinformatics},
      year = {2005},
      volume = {21},
      number = {8},
      pages = {1693--1694},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15564299},
      doi = {http://dx.doi.org/10.1093/bioinformatics/bti161}
    }
    					
    Liu, J. & Stormo, G.D. Combining SELEX with quantitative assays to rapidly obtain accurate models of protein-DNA interactions. 2005 Nucleic Acids Res
    Vol. 33 (17) , pp. e141  
    article base sequence; binding sites; chromatography, affinity; dna; dna-binding proteins; early growth response protein 1; genomics; immediate-early proteins; models, biological; molecular sequence data; oligonucleotides; reproducibility of results; transcription factors; zinc fingers
    Abstract: Models for the specificity of DNA-binding transcription factors are often based on small amounts of qualitative data and therefore have limited accuracy. In this study we demonstrate a simple and efficient method of affinity chromatography-SELEX followed by a quantitative binding (QuMFRA) assay to rapidly collect the data necessary for more accurate models. Using the zinc finger protein EGR as an e.g. we show that many bindings sites can be obtained efficiently with affinity chromatography-SELEX, but those sequences alone provide a weight matrix model with limited accuracy. Using a QuMFRA assay to determine the quantitative relative affinity for only a subset of the sequences obtained by SELEX leads to a much more accurate model. Application of this method to variants of a transcription factor would allow us to generate a large collection of quantitative data for modeling protein-DNA interactions that could facilitate the determination of recognition codes for different transcription factor families.
    BibTeX:
    @article{
      author = {Jiajian Liu and Gary D Stormo},
      title = {Combining SELEX with quantitative assays to rapidly obtain accurate models of protein-DNA interactions.},
      journal = {Nucleic Acids Res},
      year = {2005},
      volume = {33},
      number = {17},
      pages = {e141},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/16186128},
      doi = {http://dx.doi.org/10.1093/nar/gni139}
    }
    					
    Liu, J. & Stormo, G.D. Quantitative analysis of EGR proteins binding to DNA: assessing additivity in both the binding site and the protein. 2005 BMC Bioinformatics
    Vol. 6 , pp. 176  
    article algorithms; amino acids; binding sites; computational biology; dna; early growth response transcription factors; models, genetic; models, molecular; protein binding; zinc fingers
    Abstract: BACKGROUND: Recognition codes for protein-DNA interactions typically assume that the interacting positions contribute additively to the binding energy. While this is known to not be precisely true, an additive model over the DNA positions can be a good approximation, at least for some proteins. Much less information is available about whether the protein positions contribute additively to the interaction. RESULTS: Using EGR zinc finger proteins, we measure the binding affinity of six different variants of the protein to each of six different variants of the consensus binding site. Both the protein and binding site variants include single and double mutations that allow us to assess how well additive models can account for the data. For each protein and DNA alone we find that additive models are good approximations, but over the combined set of data there are context effects that limit their accuracy. However, a small modification to the purely additive model, with only three additional parameters, improves the fit significantly. CONCLUSION: The additive model holds very well for every DNA site and every protein included in this study, but clear context dependence in the interactions was detected. A simple modification to the independent model provides a better fit to the complete data.
    BibTeX:
    @article{
      author = {Jiajian Liu and Gary D Stormo},
      title = {Quantitative analysis of EGR proteins binding to DNA: assessing additivity in both the binding site and the protein.},
      journal = {BMC Bioinformatics},
      year = {2005},
      volume = {6},
      pages = {176},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/16014175},
      doi = {http://dx.doi.org/10.1186/1471-2105-6-176}
    }
    					
    Olson, M.V., B.P.C.J.F.N.G.L.H.R.K.J.L.R.M.J.N.C.S.S.S.G.W.M.W.P.W.W. & Wooley, J. Mathematics and 21st Century Biology 2005   book
    BibTeX:
    @book{
      author = {Olson, M.V., Bickel, P.J., Cowan, J.D., Federoff, N., Greengard, L., Hudson, R., Keener, J., Lipshutz, R., Mesirov, J.P., Neuhauser, C., Shvartsman, S.Y., Stormo, G.D., Waterman, M.S., Wolynes, P.G., Wong, W.H. and Wooley, J.},
      title = {Mathematics and 21st Century Biology},
      publisher = {National Academies Press, Washington, D.C.},
      year = {2005},
      url = {http://www.nap.edu/catalog.php?record_id=11315}
    }
    					
    Tan, K.; McCue, L.A. & Stormo, G.D. Making connections between novel transcription factors and their DNA motifs. 2005 Genome Res
    Vol. 15 (2) , pp. 312-320  
    article algorithms; base c; binding sites; dna, bacterial; genome, bacterial; gram-negative bacteria; peptides; phylogeny; predictive value of tests; protein binding; protein structure, tertiary; regulon; transcription factors; transcription, genetic; omposition
    Abstract: The key components of a transcriptional regulatory network are the connections between trans-acting transcription factors and cis-acting DNA-binding sites. In spite of several decades of intense research, only a fraction of the estimated approximately 300 transcription factors in Escherichia coli have been linked to some of their binding sites in the genome. In this paper, we present a computational method to connect novel transcription factors and DNA motifs in E. coli. Our method uses three types of mutually independent information, two of which are gleaned by comparative analysis of multiple genomes and the third one derived from similarities of transcription-factor-DNA-binding-site interactions. The different types of information are combined to calculate the probability of a given transcription-factor-DNA-motif pair being a true pair. Tested on a study set of transcription factors and their DNA motifs, our method has a prediction accuracy of 59% for the top predictions and 85% for the top three predictions. When applied to 99 novel transcription factors and 70 novel DNA motifs, our method predicted 64 transcription-factor-DNA-motif pairs. Supporting evidence for some of the predicted pairs is presented. Functional annotations are made for 23 novel transcription factors based on the predicted transcription-factor-DNA-motif connections.
    BibTeX:
    @article{
      author = {Kai Tan and Lee Ann McCue and Gary D Stormo},
      title = {Making connections between novel transcription factors and their DNA motifs.},
      journal = {Genome Res},
      year = {2005},
      volume = {15},
      number = {2},
      pages = {312--320},
      doi = {http://dx.doi.org/10.1101/gr.3069205}
    }
    					
    Wang, T. & Stormo, G.D. Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. 2005 Proc Natl Acad Sci U S A
    Vol. 102 (48) , pp. 17400-17405  
    article algorithms; computational biology; gene expression regulation, fungal; genomics; models, genetic; promoter regions (genetics); regulatory elements, transcriptional; saccharomyces cerevisiae; species specificity
    Abstract: A major focus of genome research has been to decipher the cis-regulatory code that governs complex transcriptional regulation. We report a computational approach for identifying conserved regulatory motifs of an organism directly from whole genome sequences of several related species without reliance on additional information. We first construct phylogenetic profiles for each promoter, then use a BLAST-like algorithm to efficiently search through the entire profile space of all of the promoters in the genome to identify conserved motifs and the promoters that contain them. Statistical significance is estimated by modified Karlin-Altschul statistics. We applied this approach to the analysis of 3,524 Saccharomyces cerevisiae promoters and identified a highly organized regulatory network involving 3,315 promoters and 296 motifs. This network includes nearly all of the currently known motifs and covers >90% of known transcription factor binding sites. Most of the predicted coregulated gene clusters in the network have additional supporting evidence. Theoretical analysis suggests that our algorithm should be applicable to much larger genomes, such as the human genome, without reaching its statistical limitation.
    Review: This paper introduces the Phylonet program.

    PhyloNet is a motif discovery program. PhyloNet stands for

    "Phylogenetic Regulatory Network". It represents a very new

    paradigm for motif discovery: based on sequences of several

    evolutionarily related genomes, PhyloNet predicts a near

    complete set of conserved motifs of the organism of interest,

    as well as gene clusters that share these motifs, without

    reliance of additional data such as gene regulation.

    The algorithm takes advantage of two important features of a

    regulatory network: phylogenetic conservation and network topology.

    The architecture of the program follows that of BLAST. The input

    sequences are divided into two parts: query and database. The query

    is the "gene of interest", or the promoter sequence of the gene

    of interest; the database contains all the genes/promoters of

    the genome. For each promoter, a few orthologous promoters are

    needed, but they don't have to come from the same set of genomes

    or have the same number. Basically, each "unit" is a group of

    orthologous sequences, with the first sequence being the promoter

    of the genome of interest. Just like BLAST, the algorithm compares

    the promoter of interest to all the promoters of the genome and

    determine local similarities between the query and database

    promoters. The output of the program contains motifs of the promoter

    of interest, together with a set of genes that share this motif.

    The program will determine the width of the pattern being sought.

    For whole genome motif discovery analysis, one can simply use

    every promoter as a query and all promoters as database to run

    PhyloNet multiple times, and consolidate the predictions.

    Before running PhyloNet, a separate step of "phylogenetic footprinting"

    needs to be performed. Bundled with PhyloNet, we use the algorithm

    "wconsensus" for phylogenetic footprinting. This algorithm

    locally alignes sequences of multiple genomes, producing multiple,

    suboptimal ungapped alignments. By replacing the input module of

    PhyloNet one can use other algorithms for this step.

    Following phylogenetic footprinting, the algorithm has these

    components:

    1) Phylogenetic footprinting of the promoters: Wconsensus algorithm

    is used to extract conserved regions of the promoters based on

    reference genome sequences.

    2) Promoter profile construction: multiple, suboptimal sequence

    alignments from phylogenetic footprinting are converted to

    sequence profiles.

    3) Profile space partition: continuous profile space is partitioned

    into discrete profile clusters. Each partitioned profile space is

    represented by a single profile. Distances among the spaces are

    calculated by ALLR statistic. An ALLR scoring matrix is constructed

    for profile comparison.

    4) Query hashing: the query promoter profiles are converted into

    a collection of formatted seeds (or words) of flexible length.

    Neighborhood words of each seed is generated via a branch and

    bound algorithm. A hash (or index) is built for the query promoter.

    5) Motif BLAST: the entire database (all promoter profiles) are

    searched against query hash to locate word hit then each hit is

    extended via a local dynamic programming to a high scoring

    pair (HSP). The significance of these HSPs is estimated by

    Karlin-Altschul statistic.

    6) HSP clustering: Significant HSPs are mapped back to the query

    promoter, and are clustered by applying a maximum clique finding

    algorithm from graph theory, based on the overlapping relations

    among HSPs.

    7) Motif construction: Clustered HSPs are converted to motifs using

    a greedy approach. Final significance of the motif is estimated

    based on sum of p-values.

    8) Background control: The algorithm has options to shuffle either

    the query promoter, or the database, or both, while conserving

    the sequence identity, sequence length and length of conserved

    blocks. The program will run on the shuffled datasets to generate

    background score distribution.

    BibTeX:
    @article{
      author = {Ting Wang and Gary D Stormo},
      title = {Identifying the conserved network of cis-regulatory sites of a eukaryotic genome.},
      journal = {Proc Natl Acad Sci U S A},
      year = {2005},
      volume = {102},
      number = {48},
      pages = {17400--17405},
      doi = {http://dx.doi.org/10.1073/pnas.0505147102}
    }
    					
    Workman, C.T.; Yin, Y.; Corcoran, D.L.; Ideker, T.; Stormo, G.D. & Benos, P.V. enoLOGOS: a versatile web tool for energy normalized sequence logos. 2005 Nucleic Acids Res
    Vol. 33 (Web Server issue) , pp. W389-W392  
    article amino acids; binding sites; computer graphics; dna-binding proteins; internet; sequence alignment; sequence analysis, dna; software; user-computer interface; zinc fingers
    Abstract: enoLOGOS is a web-based tool that generates sequence logos from various input sources. Sequence logos have become a popular way to graphically represent DNA and amino acid sequence patterns from a set of aligned sequences. Each position of the alignment is represented by a column of stacked symbols with its total height reflecting the information content in this position. Currently, the available web servers are able to create logo images from a set of aligned sequences, but none of them generates weighted sequence logos directly from energy measurements or other sources. With the advent of high-throughput technologies for estimating the contact energy of different DNA sequences, tools that can create logos directly from binding affinity data are useful to researchers. enoLOGOS generates sequence logos from a variety of input data, including energy measurements, probability matrices, alignment matrices, count matrices and aligned sequences. Furthermore, enoLOGOS can represent the mutual information of different positions of the consensus sequence, a unique feature of this tool. Another web interface for our software, C2H2-enoLOGOS, generates logos for the DNA-binding preferences of the C2H2 zinc-finger transcription factor family members. enoLOGOS and C2H2-enoLOGOS are accessible over the web at http://biodev.hgen.pitt.edu/enologos/.
    BibTeX:
    @article{
      author = {Christopher T Workman and Yutong Yin and David L Corcoran and Trey Ideker and Gary D Stormo and Panayiotis V Benos},
      title = {enoLOGOS: a versatile web tool for energy normalized sequence logos.},
      journal = {Nucleic Acids Res},
      year = {2005},
      volume = {33},
      number = {Web Server issue},
      pages = {W389--W392},
      doi = {http://dx.doi.org/10.1093/nar/gki439}
    }
    					
    Akopyants, N.S.; Matlib, R.S.; Bukanova, E.N.; Smeds, M.R.; Brownstein, B.H.; Stormo, G.D. & Beverley, S.M. Expression profiling using random genomic DNA microarrays identifies differentially expressed genes associated with three major developmental stages of the protozoan parasite Leishmania major. 2004 Mol Biochem Parasitol
    Vol. 136 (1) , pp. 71-86  
    article animals; gene expression profiling; gene expression regulation; genome, protozoan; leishmania major; leishmaniasis, cutaneous; life cycle stages; mice; mice, inbred balb c; oligonucleotide array sequence analysis; protozoan proteins; transcription, genetic
    Abstract: To complete its life cycle, protozoan parasites of the genus Leishmania undergo at least three major developmental transitions. However, previous efforts to identify genes showing stage regulated changes in transcript abundance have yielded relatively few. Here we used expression profiling to assess changes in transcript abundance in three stages: replicating promastigotes and infective non-replicating metacyclics, which occur in the sand fly vector, and in the amastigote stage residing with macrophage phagolysosomes in mammals. Microarrays were developed containing 11,484 PCR products that included a number of known genes and 10,464 random 1 kb genomic DNA fragments. Arrays were hybridized in triplicate and genes showing two-fold or greater changes in 2/3 experiments were scored as differentially expressed. Remarkably, only about one percent of the DNAs expression varied by this criteria, in either stage comparison. Northern blot analysis confirmed the predicted change in mRNA abundance for most of these (68. This set of genes included most of those previously identified in the literature as differentially regulated as well as a number of novel genes. Notably, Leishmania maxicircle transcripts showed strong up-regulation in metacyclic and amastigote parasites, probably associated with changes in parasite energy metabolism. However, current data suggest that expression profiling using shotgun DNA libraries significantly underestimates the extent of regulated transcripts.
    BibTeX:
    @article{
      author = {Natalia S Akopyants and Robin S Matlib and Elena N Bukanova and Matthew R Smeds and Bernard H Brownstein and Gary D Stormo and Stephen M Beverley},
      title = {Expression profiling using random genomic DNA microarrays identifies differentially expressed genes associated with three major developmental stages of the protozoan parasite Leishmania major.},
      journal = {Mol Biochem Parasitol},
      year = {2004},
      volume = {136},
      number = {1},
      pages = {71--86},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15138069},
      doi = {http://dx.doi.org/10.1016/j.molbiopara.2004.03.002}
    }
    					
    Friedman, C.P.; Altman, R.B.; Kohane, I.S.; McCormick, K.A.; Miller, P.L.; Ozbolt, J.G.; Shortliffe, E.H.; Stormo, G.D.; Szczepaniak, M.C.; Tuck, D.; Williamson, J. & of Medical Informatics, A.C. Training the next generation of informaticians: the impact of "BISTI" and bioinformatics-a report from the American College of Medical Informatics. 2004 J Am Med Inform Assoc
    Vol. 11 (3) , pp. 167-172  
    article computational biology; curriculum; medical informatics; societies, medical; united states
    Abstract: In 2002-2003, the American College of Medical Informatics (ACMI) undertook a study of the future of informatics training. This project capitalized on the rapidly expanding interest in the role of computation in basic biological research, well characterized in the National Institutes of Health (NIH) Biomedical Information Science and Technology Initiative (BISTI) report. The defining activity of the project was the three-day 2002 Annual Symposium of the College. A committee, comprised of the authors of this report, subsequently carried out activities, including interviews with a broader informatics and biological sciences constituency, collation and categorization of observations, and generation of recommendations. The committee viewed biomedical informatics as an interdisciplinary field, combining basic informational and computational sciences with application domains, including health care, biological research, and education. Consequently, effective training in informatics, viewed from a national perspective, should encompass four key elements: (1). curricula that integrate experiences in the computational sciences and application domains rather than just concatenating them; (2). diversity among trainees, with individualized, interdisciplinary cross-training allowing each trainee to develop key competencies that he or she does not initially possess; (3). direct immersion in research and development activities; and (4). exposure across the wide range of basic informational and computational sciences. Informatics training programs that implement these features, irrespective of their funding sources, will meet and exceed the challenges raised by the BISTI report, and optimally prepare their trainees for careers in a field that continues to evolve.
    BibTeX:
    @article{
      author = {Charles P Friedman and Russ B Altman and Isaac S Kohane and Kathleen A McCormick and Perry L Miller and Judy G Ozbolt and Edward H Shortliffe and Gary D Stormo and M. Cleat Szczepaniak and David Tuck and Jeffrey Williamson and American College of Medical Informatics},
      title = {Training the next generation of informaticians: the impact of "BISTI" and bioinformatics--a report from the American College of Medical Informatics.},
      journal = {J Am Med Inform Assoc},
      year = {2004},
      volume = {11},
      number = {3},
      pages = {167--172},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/14764617}
    }
    					
    GuhaThakurta, D.; Schriefer, L.A.; Waterston, R.H. & Stormo, G.D. Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. 2004 Genome Res
    Vol. 14 (12) , pp. 2457-2468  
    article animals; base sequence; binding sites; caenorhabditis elegans; conserved sequence; dna primers; gene expression regulation; green fluorescent proteins; muscle, skeletal; promoter regions (genetics); regulatory sequences, nucleic acid; sequence alignment; species specificity; transcription, genetic
    Abstract: We report the identification of three new transcription regulatory elements that are associated with muscle gene expression in the nematode Caenorhabditis elegans. Starting from a subset of well-characterized nematode muscle genes, we identified conserved DNA motifs in the promoter regions using computational DNA pattern-recognition algorithms. These were considered to be putative muscle transcription regulatory motifs. Using the green-fluorescent protein (GFP) as a reporter, experiments were done to determine the biological activity of these motifs in driving muscle gene expression. Prediction accuracy of muscle expression based on the presence of these three motifs was encouraging; nine of 10 previously uncharacterized genes that were predicted to have muscle expression were shown to be expressed either specifically or selectively in the muscle tissues, whereas only one of the nine that scored low for these motifs expressed in muscle. Knockouts of putative regulatory elements in the promoter of the mlc-2 and unc-89 genes show that they significantly contribute to muscle expression and act in a synergistic manner. We find that these DNA motifs are also present in the muscle promoters of C. briggsae, indicating that they are functionally conserved in the nematodes.
    BibTeX:
    @article{
      author = {Debraj GuhaThakurta and Lawrence A Schriefer and Robert H Waterston and Gary D Stormo},
      title = {Novel transcription regulatory elements in Caenorhabditis elegans muscle genes.},
      journal = {Genome Res},
      year = {2004},
      volume = {14},
      number = {12},
      pages = {2457--2468},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15574824},
      doi = {http://dx.doi.org/10.1101/gr.2961104}
    }
    					
    Hu, Y.; Wang, T.; Stormo, G.D. & Gordon, J.I. RNA interference of achaete-scute homolog 1 in mouse prostate neuroendocrine cells reveals its gene targets and DNA binding sites. 2004 Proc Natl Acad Sci U S A
    Vol. 101 (15) , pp. 5559-5564  
    article animals; base sequence; basic helix-loop-helix transcription factors; binding sites; carcinoma, neuroendocrine; cell line, tumor; dna; dna-binding proteins; gene expression profiling; liver neoplasms, experimental; male; mice; mice, transgenic; models, genetic; neo; oligonucleotide array sequence analysis; prostatic neoplasms; rna interference; signal transduction; transcription factors; transcription, genetic; plasm transplantation
    Abstract: We have previously characterized a transgenic mouse model (CR2-TAg) of metastatic prostate cancer arising in the neuroendocrine (NE) cell lineage. Biomarkers of NE differentiation in this model are expressed in conventional adenocarcinoma of the prostate with NE features. To further characterize the pathways that control NE proliferation, differentiation, and survival, we established prostate NE cancer (PNEC) cell lines from CR2-TAg prostate tumors and metastases. GeneChip analyses of cell lines harvested at different passages, and as xenografted tumors, indicated that PNECs express consistent features ex vivo and in vivo and share a remarkable degree of similarity with primary CR2-TAg prostate NE tumors. PNECs express mAsh1, a basic helix-loop-helix (bHLH) transcription factor essential for NE cell differentiation in other tissues. RNA interference knockdown of mAsh1, GeneChip comparisons of treated and control cell populations, and a computational analysis of down-regulated genes identified 12 transcriptional motifs enriched in the gene set. Affected genes, including Adcy9, Hes6, Iapp1, Ndrg4, c-Myb, and Mesdc2, are enriched for a palindromic E-box motif, CAGCTG, indicating that it is a physiologically relevant mAsh1 binding site. The enrichment of a c-Myb binding site and the finding that c-Myb is down-regulated by mAsh1 RNA interference suggest that mAsh1 and c-Myb are in the same signaling pathway. Our data indicate that mAsh1 negatively regulates the cell cycle (e.g., via enhanced Cdkn2d, Bub1 expression), promotes differentiation (e.g., through effects on cAMP), and enhances survival by inhibiting apoptosis. PNEC cell lines should be generally useful for genetic and/or pharmacologic studies of the regulation of NE cell proliferation, differentiation, and tumorigenesis.
    BibTeX:
    @article{
      author = {Yan Hu and Ting Wang and Gary D Stormo and Jeffrey I Gordon},
      title = {RNA interference of achaete-scute homolog 1 in mouse prostate neuroendocrine cells reveals its gene targets and DNA binding sites.},
      journal = {Proc Natl Acad Sci U S A},
      year = {2004},
      volume = {101},
      number = {15},
      pages = {5559--5564},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15060276},
      doi = {http://dx.doi.org/10.1073/pnas.0306988101}
    }
    					
    Ji, Y.; Xu, X. & Stormo, G.D. A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. 2004 Bioinformatics
    Vol. 20 (10) , pp. 1591-1602  
    article algorithms; base sequence; molecular sequence data; na; nucleic acid conformation; rna; regulatory sequences, ribonucleic acid; sequence alignment; sequence analysis, r; sequence homology, nucleic acid
    Abstract: MOTIVATION: RNA structure motifs contained in mRNAs have been found to play important roles in regulating gene expression. However, identification of novel RNA regulatory motifs using computational methods has not been widely explored. Effective tools for predicting novel RNA regulatory motifs based on genomic sequences are needed. RESULTS: We present a new method for predicting common RNA secondary structure motifs in a set of functionally or evolutionarily related RNA sequences. This method is based on comparison of stems (palindromic helices) between sequences and is implemented by applying graph-theoretical approaches. It first finds all possible stable stems in each sequence and compares stems pairwise between sequences by some defined features to find stems conserved across any two sequences. Then by applying a maximum clique finding algorithm, it finds all significant stems conserved across at least k sequences. Finally, it assembles in topological order all possible compatible conserved stems shared by at least k sequences and reports a number of the best assembled stem sets as the best candidate common structure motifs. This method does not require prior structural alignment of the sequences and is able to detect pseudoknot structures. We have tested this approach on some RNA sequences with known secondary structures, in which it is capable of detecting the real structures completely or partially correctly and outperforms other existing programs for similar purposes. AVAILABILITY: The algorithm has been implemented in C++ in a program called comRNA, which is available at http://ural.wustl.edu/softwares.html
    BibTeX:
    @article{
      author = {Yongmei Ji and Xing Xu and Gary D Stormo},
      title = {A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences.},
      journal = {Bioinformatics},
      year = {2004},
      volume = {20},
      number = {10},
      pages = {1591--1602},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/14962926},
      doi = {formatics/bth131}
    }
    					
    Li, J.B.; Gerdes, J.M.; Haycraft, C.J.; Fan, Y.; Teslovich, T.M.; May-Simera, H.; Li, H.; Blacque, O.E.; Li, L.; Leitch, C.C.; Lewis, R.A.; Green, J.S.; Parfrey, P.S.; Leroux, M.R.; Davidson, W.S.; Beales, P.L.; Guay-Woodford, L.M.; Yoder, B.K.; Stormo, G.D.; Katsanis, N. & Dutcher, S.K. Comparative genomics identifies a flagellar and basal body proteome that includes the BBS5 human disease gene. 2004 Cell
    Vol. 117 (4) , pp. 541-552  
    article animals; arabidopsis; bardet-biedl syndrome; caenorhabditis elegans; caenorhabditis elegans proteins; chlamydomonas; chromosomes, human, pair 2; cilia; dna mutational analysis; dna, complementary; female; flagella; genomic library; humans; male; mice; molecular sequence data; mutation; pedigree; proteins; proteome; rna interference; sequence homology, amino acid; sequence homology, nucleic acid; transcription factors
    Abstract: Cilia and flagella are microtubule-based structures nucleated by modified centrioles termed basal bodies. These biochemically complex organelles have more than 250 and 150 polypeptides, respectively. To identify the proteins involved in ciliary and basal body biogenesis and function, we undertook a comparative genomics approach that subtracted the nonflagellated proteome of Arabidopsis from the shared proteome of the ciliated/flagellated organisms Chlamydomonas and human. We identified 688 genes that are present exclusively in organisms with flagella and basal bodies and validated these data through a series of in silico, in vitro, and in vivo studies. We then applied this resource to the study of human ciliation disorders and have identified BBS5, a novel gene for Bardet-Biedl syndrome. We show that this novel protein localizes to basal bodies in mouse and C. elegans, is under the regulatory control of daf-19, and is necessary for the generation of both cilia and flagella.
    BibTeX:
    @article{
      author = {Jin Billy Li and Jantje M Gerdes and Courtney J Haycraft and Yanli Fan and Tanya M Teslovich and Helen May-Simera and Haitao Li and Oliver E Blacque and Linya Li and Carmen C Leitch and Richard Allan Lewis and Jane S Green and Patrick S Parfrey and Michel R Leroux and William S Davidson and Philip L Beales and Lisa M Guay-Woodford and Bradley K Yoder and Gary D Stormo and Nicholas Katsanis and Susan K Dutcher},
      title = {Comparative genomics identifies a flagellar and basal body proteome that includes the BBS5 human disease gene.},
      journal = {Cell},
      year = {2004},
      volume = {117},
      number = {4},
      pages = {541--552},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15137946}
    }
    					
    Lin, Y.; Stormo, G.D. & Taghert, P.H. The neuropeptide pigment-dispersing factor coordinates pacemaker interactions in the Drosophila circadian system. 2004 J Neurosci
    Vol. 24 (36) , pp. 7951-7957  
    article animals; cell nucleus; circadian rhythm; drosophila proteins; drosophila melanogaster; female; microscopy, confocal; neurons; neuropeptides; nuclear proteins
    Abstract: In Drosophila, the neuropeptide pigment-dispersing factor (PDF) is required to maintain behavioral rhythms under constant conditions. To understand how PDF exerts its influence, we performed time-series immunostainings for the PERIOD protein in normal and pdf mutant flies over 9 d of constant conditions. Without pdf, pacemaker neurons that normally express PDF maintained two markers of rhythms: that of PERIOD nuclear translocation and its protein staining intensity. As a group, however, they displayed a gradual dispersion in their phasing of nuclear translocation. A separate group of non-PDF circadian pacemakers also maintained PERIOD nuclear translocation rhythms without pdf but exhibited altered phase and amplitude of PERIOD staining intensity. Therefore, pdf is not required to maintain circadian protein oscillations under constant conditions; however, it is required to coordinate the phase and amplitude of such rhythms among the diverse pacemakers. These observations begin to outline the hierarchy of circadian pacemaker circuitry in the Drosophila brain.
    BibTeX:
    @article{
      author = {Yiing Lin and Gary D Stormo and Paul H Taghert},
      title = {The neuropeptide pigment-dispersing factor coordinates pacemaker interactions in the Drosophila circadian system.},
      journal = {J Neurosci},
      year = {2004},
      volume = {24},
      number = {36},
      pages = {7951--7957},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15356209},
      doi = {http://dx.doi.org/10.1523/JNEUROSCI.2370-04.2004}
    }
    					
    Man, T.-K.; Yang, J.S. & Stormo, G.D. Quantitative modeling of DNA-protein interactions: effects of amino acid substitutions on binding specificity of the Mnt repressor. 2004 Nucleic Acids Res
    Vol. 32 (13) , pp. 4026-4032  
    article amino acid substitution; binding sites; dna; dna-binding proteins; models, genetic; operator regions (genetics); repressor proteins; viral proteins
    Abstract: Understanding DNA-protein recognition quantitatively is essential to developing computational algorithms for accurate transcriptional binding site prediction. Using a quantitative, multiple fluorescence, relative affinity (QuMFRA) assay, we determine the binding specificity of 11 different position 6 variants of the Mnt repressor for operators containing all 16 possible dinucleotides at operator positions 16 and 17. We show that the wild-type and all variant proteins interact with the two positions in a non-independent manner, but that a simple independent model provides a close approximation to the true binding affinities. The wild-type His at amino acid 6 is the only protein to prefer the AC sequence of the wild-type operator, whereas most of the variant proteins prefer TA. H6R is unique in having a strong preference for C at position 16. A comparison of the quantitative binding data for all of the protein variants with a model for recognition of the early growth response (EGR) zinc finger family suggests that interactions of Mnt with positions 16 and 17 are similar to interactions of EGR with positions 1 and 2, respectively. This information leads to an augmented model for the interaction of Mnt with its operator.
    BibTeX:
    @article{
      author = {Tsz-Kwong Man and Joshua SungWoo Yang and Gary D Stormo},
      title = {Quantitative modeling of DNA-protein interactions: effects of amino acid substitutions on binding specificity of the Mnt repressor.},
      journal = {Nucleic Acids Res},
      year = {2004},
      volume = {32},
      number = {13},
      pages = {4026--4032},
      url = {http://dx.doi.org/10.1093/nar/gkh729},
      doi = {http://dx.doi.org/10.1093/nar/gkh729}
    }
    					
    Ruan, J.; Stormo, G.D. & Zhang, W. ILM: a web server for predicting RNA secondary structures with pseudoknots. 2004 Nucleic Acids Res
    Vol. 32 (Web Server issue) , pp. W146-W149  
    article algorithms; internet; nucleic acid conformation; rna; sequence analysis, rna; software; user-computer interface
    Abstract: The ILM web server provides a web interface to two algorithms, iterated loop matching and maximum weighted matching, for efficiently predicting RNA secondary structures with pseudoknots. The algorithms can utilize either thermodynamic or comparative information or both, and thus can work on both aligned and individual sequences. Predicted secondary structures are presented in several formats compatible with a variety of existing visualization tools. The service can be accessed at http://cic.cs.wustl.edu/RNA/.
    BibTeX:
    @article{
      author = {Jianhua Ruan and Gary D Stormo and Weixiong Zhang},
      title = {ILM: a web server for predicting RNA secondary structures with pseudoknots.},
      journal = {Nucleic Acids Res},
      year = {2004},
      volume = {32},
      number = {Web Server issue},
      pages = {W146--W149},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/15215368},
      doi = {http://dx.doi.org/10.1093/nar/gkh444}
    }
    					
    Ruan, J.; Stormo, G.D. & Zhang, W. An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. 2004 Bioinformatics
    Vol. 20 (1) , pp. 58-66  
    article algorithms; base pairing; base sequence; feedback; models, molecular; molecular ; nucleic acid conformation; rna; reproducibility of ; results; sensitivity and specificity; sequence alignment; sequence analysis, rna; sequence data; sequence homology, nucleic acid
    BibTeX:
    @article{
      author = {Jianhua Ruan and Gary D Stormo and Weixiong Zhang},
      title = {An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots.},
      journal = {Bioinformatics},
      year = {2004},
      volume = {20},
      number = {1},
      pages = {58--66},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/14693809},
      doi = {http://dx.doi.org/ 10.1093/bioinformatics/btg373}
    }
    					
    Stormo, G. A.D. Baxevanis, D.B. Davison, R.P.G.P.L.S. & Stormo, G. (Hrsg.) An overview of RNA structure prediction 2004 , pp. pp 12.1.1-2   book
    BibTeX:
    @book{
      author = {Stormo, G.D.},
      title = {An overview of RNA structure prediction},
      publisher = {Wiley and Sons},
      year = {2004},
      pages = {pp 12.1.1-2}
    }
    					
    Zhao, T.; Chang, L.-W.; McLeod, H.L. & Stormo, G.D. PromoLign: a database for upstream region analysis and SNPs. 2004 Hum Mutat
    Vol. 23 (6) , pp. 534-539  
    article animals; chromosome mapping; databases, genetic; humans; internet; mice; polymorphism, single nucleotide; regulatory sequences, nucleic acid; sequence alignment; user-computer interface
    Abstract: The study of transcriptional regulation at the genomic level has been hindered by the lack of functional annotation in the putative regulatory regions. Phylogenetic footprinting, in which cross-species sequence alignment among orthologous genes is applied to locate conserved sequence blocks, is an effective strategy to attack this problem. Single nucleotide polymorphisms (SNPs) in transcription factor (TF) binding sites contribute to the heterogeneity of TF binding sites and might disrupt or enhance their regulatory activity. The correlation of SNPs with the TF sites will not only help in functional evaluation of SNPs, but will also help in the study of transcription regulation by focusing attention on specific TF sites. PromoLign (http://polly.wustl.edu/promolign/main.html) is an online database application that presents SNPs and TF binding profiles in the context of human-mouse orthologous sequence alignment with a hyperlinked graphical interface. PromoLign could be applied to a variety of SNPs and transcription related studies, including association genetics, population genetics, and pharmacogenetics.
    BibTeX:
    @article{
      author = {Tao Zhao and Li-Wei Chang and Howard L McLeod and Gary D Stormo},
      title = {PromoLign: a database for upstream region analysis and SNPs.},
      journal = {Hum Mutat},
      year = {2004},
      volume = {23},
      number = {6},
      pages = {534--539},
      url = {http://www.ncbi.nlm.nih.gov/pubmed},
      doi = {http://dx.doi.org/10.1002/humu.20049}
    }
    					
    Johnston, M. & Stormo, G.D. Evolution. Heirlooms in the attic. 2003 Science
    Vol. 302 (5647) , pp. 997-999  
    article animals; base sequence; chickens; chromosomes; chromosomes, human, pair 21; chromosomes, human, pair 7; chromosomes, mammalian; conserved sequence; dna, intergenic; evolution, molecular; fishes; genome; genome, human; humans; mammals; polymerase chain reaction; species specificity
    BibTeX:
    @article{
      author = {Mark Johnston and Gary D Stormo},
      title = {Evolution. Heirlooms in the attic.},
      journal = {Science},
      year = {2003},
      volume = {302},
      number = {5647},
      pages = {997--999},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/14605357},
      doi = {http://dx.doi.org/10.1126/science.1092271}
    }
    					
    Li, J.B.; Lin, S.; Jia, H.; Wu, H.; Roe, B.A.; Kulp, D.; Stormo, G.D. & Dutcher, S.K. Analysis of Chlamydomonas reinhardtii genome structure using large-scale sequencing of regions on linkage groups I and III. 2003 J Eukaryot Microbiol
    Vol. 50 (3) , pp. 145-155  
    article amino acid sequence; animals; base sequence; chlamydomonas reinhardtii; dna, protozoan; genome; linkage (genetics); molecular sequence data; rna, transfer; repetitive sequences, nu; sequence alignment; cleic acid
    Abstract: Chlamydomonas reinhardtii is a unicellular green alga that has been used as a model organism for the study of flagella and basal bodies as well as photosynthesis. This report analyzes finished genomic DNA sequence for 0.5% of the nuclear genome. We have used three gene prediction programs as well as EST and protein homology data to estimate the total number of genes in Chlamydomonas to be between 12,000 and 16,400. Chlamydomonas appears to have many more genes than any other unicellular organism sequenced to date. Twenty-seven percent of the predicted genes have significant identity to both ESTs and to known proteins in other organisms, 32% of the predicted genes have significant identity to ESTs alone, and 14% have significant similarity to known proteins in other organisms. For gene prediction in Chlamydomonas, GreenGenie appeared to have the highest sensitivity and specificity at the exon level, scoring 71% and 82 respectively. Two new alternative splicing events were predicted by aligning Chlamydomonas ESTs to the genomic sequence. Finally recombination differs between the two sequenced contigs. The 350-Kb of the Linkage group III contig is devoid of recombination, while the Linkage group I contig is 30 map units long over 33-kb.
    BibTeX:
    @article{
      author = {Jin Billy Li and Shaoping Lin and Honggui Jia and Hongmin Wu and Bruce A Roe and David Kulp and Gary D Stormo and Susan K Dutcher},
      title = {Analysis of Chlamydomonas reinhardtii genome structure using large-scale sequencing of regions on linkage groups I and III.},
      journal = {J Eukaryot Microbiol},
      year = {2003},
      volume = {50},
      number = {3},
      pages = {145--155},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12836870}
    }
    					
    Liu, J.; Tan, K. & Stormo, G.D. Computational identification of the Spo0A-phosphate regulon that is essential for the cellular differentiation and development in Gram-positive spore-forming bacteria. 2003 Nucleic Acids Res
    Vol. 31 (23) , pp. 6891-6903  
    article algorithms; amino acid sequence; bacillus; bacterial proteins; base sequence; binding sites; computational biology; dna-binding proteins; gene expression profiling; gene expression regulation, bacterial; genes, bacterial; genome, bacterial; molecular sequence data; oligonucleotide array sequence analysis; phosphates; protein structure, tertiary; regulon; software; spores, bacterial; transcription factors; transcription, genetic
    Abstract: Spo0A-phosphate is essential for the initiation of cellular differentiation and developmental processes in Gram-positive spore-forming bacteria. Here we combined comparative genomics with analyses of microarray expression profiles to identify the Spo0A-phosphate regulon in Bacillus subtilis. The consensus Spo0A-phosphate DNA-binding motif identified from the training set based on different computational algorithms is an 8 bp sequence, TTGTCGAA. The same motif was identified by aligning the upstream regulatory sequences of spo0A-dependent genes obtained from the expression profile of Sad67 (a constitutively active form of Spo0A) and their orthologs. After the transcription units (TUs) having putative Spo0A-phosphate binding sites were obtained, conservation of regulons among the genomes of B.subtilis, Bacillus halodurans and Bacillus anthracis, and expression profiles were employed to identify the most confident predictions. Besides genes already known to be directly under the control of Spo0A-phosphate, 276 novel members (organized in 109 TUs) of the Spo0A-phosphate regulon in B.subtilis are predicted in this study. The sensitivity and specificity of our predictions are estimated based on known sites and combinations of different types of evidence. Further characterization of the novel candidates will provide information towards understanding the role of Spo0A-phosphate in the sporulation process, as well as the entire genetic network governing cellular differentiation and developmental processes in B.subtilis.
    BibTeX:
    @article{
      author = {Jiajian Liu and Kai Tan and Gary D Stormo},
      title = {Computational identification of the Spo0A-phosphate regulon that is essential for the cellular differentiation and development in Gram-positive spore-forming bacteria.},
      journal = {Nucleic Acids Res},
      year = {2003},
      volume = {31},
      number = {23},
      pages = {6891--6903},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/14627822}
    }
    					
    Souvenir, R., Z.W.S.G. & Buhler, J. Benson, G. & Page, R.D.M. (Hrsg.) WABI An Efficient Algorithm for Selecting Degenerate Multiplex PCR Primers 2003 Workshop on Algorithms in Bioinformatics
    Vol. 2812 Workshop on Algorithms in Bioinformatics, Budapest, Hungary, Proceedings , pp. 512-526  
    article
    BibTeX:
    @article{
      author = {Souvenir, R., Zhang, W., Stormo, G. and Buhler, J.},
      title = {An Efficient Algorithm for Selecting Degenerate Multiplex PCR Primers},
      booktitle = {Workshop on Algorithms in Bioinformatics, Budapest, Hungary, Proceedings},
      journal = {Workshop on Algorithms in Bioinformatics},
      publisher = {Springer},
      year = {2003},
      volume = {2812},
      pages = {512-526}
    }
    					
    Stormo, G.D. New tricks for an old dogma: riboswitches as cis-only regulatory systems. 2003 Mol Cell
    Vol. 11 (6) , pp. 1419-1420  
    article 5' untranslated regions; bacillus subtilis; gene expression regulation, bacterial; genes, bacterial; mutation; purines; rna, messenger; transcription, genetic
    Abstract: Riboswitches are mRNAs that can act as direct sensors of small molecules to control their own expression. In the May 30, 2003, issue of Cell, Mandal et al. show that cis elements in mRNAs involved in purine metabolism measure the effector molecule concentration with sensitivity and specificity, and control expression of adjacent genes. Analysis of several recently discovered riboswitches suggests that this may be a common, efficient mechanism for regulating the synthesis of proteins required for the production of important metabolites.
    BibTeX:
    @article{
      author = {Gary D Stormo},
      title = {New tricks for an old dogma: riboswitches as cis-only regulatory systems.},
      journal = {Mol Cell},
      year = {2003},
      volume = {11},
      number = {6},
      pages = {1419--1420},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12820954},
      doi = {http://dx.doi.org/10.1016/S1097-2765(03)00240-5}
    }
    					
    Stormo, G. A.D. Baxevanis, D.B. Davison, R.P.G.P.L.S. & Stormo, G. (Hrsg.) An Introduction to Recognizing Functional Domains ( Current Protocols in Bioinformatics ) 2003 Current Protocols in Bioinformatics Current Protocols in Bioinformatics , pp. pp 2.1.1-5   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Current Protocols in Bioinformatics},
      journal = {Current Protocols in Bioinformatics},
      publisher = {Wiley and Sons, Inc.},
      year = {2003},
      pages = {pp 2.1.1-5}
    }
    					
    Wang, T. & Stormo, G.D. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. 2003 Bioinformatics
    Vol. 19 (18) , pp. 2369-2380  
    article algorithms; conserved sequence; gene expression profiling; gene expression regulation; phy; regulatory sequences, nucleic acid; reproducibility of results; sensitivity and specificity; sequence alignment; sequence analysis, dna; logeny
    Abstract: MOTIVATION: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a 'multiple genes, single species' approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for co-regulated genes identified through experimental approaches. The second approach can be called 'single gene, multiple species'. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogenetic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of co-regulation among different genes and conservation among orthologous genes to improve our ability to identify motifs. RESULTS: Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data. AVAILABILITY: Software available upon request from the authors. http://ural.wustl.edu/softwares.html
    BibTeX:
    @article{
      author = {Ting Wang and Gary D Stormo},
      title = {Combining phylogenetic data with co-regulated genes to identify regulatory motifs.},
      journal = {Bioinformatics},
      year = {2003},
      volume = {19},
      number = {18},
      pages = {2369--2380},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/14668220}
    }
    					
    Benos, P.V.; Lapedes, A.S. & Stormo, G.D. Is there a code for protein-DNA recognition? Probab(ilistical)ly. . . 2002 Bioessays
    Vol. 24 (5) , pp. 466-475  
    article animals; binding sites; dna; models, genetic; models, statistical; nucleotides; protein binding; proteins; thermodynamics; transcription, genetic
    Abstract: Transcriptional regulation of all genes is initiated by the specific binding of regulatory proteins called transcription factors to specific sites on DNA called promoter regions. Transcription factors employ a variety of mechanisms to recognise their DNA target sites. In the last few decades, attempts have been made to describe these mechanisms by general sets of rules and associated models. We give an overview of these models, starting with a historical review of the somewhat controversial issue of a "recognition code" governing protein-DNA interaction. We then present a probabilistic framework in which advantages and disadvantages of various models can be discussed. Finally, we conclude that simplifying assumptions about additivity of interactions are sufficiently justified in many situations (and can be suitably extended in other situations) to allow a unifying concept of a "probabilistic code" for protein-DNA recognition to be defined.
    BibTeX:
    @article{
      author = {Panayiotis V Benos and Alan S Lapedes and Gary D Stormo},
      title = {Is there a code for protein-DNA recognition? Probab(ilistical)ly. . .},
      journal = {Bioessays},
      year = {2002},
      volume = {24},
      number = {5},
      pages = {466--475},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12001270},
      doi = {http://dx.doi.org/10.1002/bies.10073}
    }
    					
    Benos, P.V.; Bulyk, M.L. & Stormo, G.D. Additivity in protein-DNA interactions: how good an approximation is it? 2002 Nucleic Acids Res
    Vol. 30 (20) , pp. 4442-4451  
    article animals; basic helix-loop-helix leucine zipper transcription factors; binding sites; dna; dna-binding proteins; entropy; gene; mice; models, theoretical; nuclear proteins; operator regions (genetics); protein binding; repressor proteins; transcription factors; s, suppressor
    Abstract: Man and Stormo and Bulyk et al. recently presented their results on the study of the DNA binding affinity of proteins. In both of these studies the main conclusion is that the additivity assumption, usually applied in methods to search for binding sites, is not true. In the first study, the analysis of binding affinity data from the Mnt repressor protein bound to all possible DNA (sub)targets at positions 16 and 17 of the binding site, showed that those positions are not independent. In the second study, the authors analysed DNA binding affinity data of the wild-type mouse EGR1 protein and four variants differing on the middle finger. The binding affinity of these proteins was measured to all 64 possible trinucleotide (sub)targets of the middle finger using microarray technology. The analysis of the measurements also showed interdependence among the positions in the DNA target. In the present report, we review the data of both studies and we re- analyse them using various statistical methods, including a comparison with a multiple regression approach. We conclude that despite the fact that the additivity assumption does not fit the data perfectly, in most cases it provides a very good approximation of the true nature of the specific protein-DNA interactions. Therefore, additive models can be very useful for the discovery and prediction of binding sites in genomic DNA.
    BibTeX:
    @article{
      author = {Panayiotis V Benos and Martha L Bulyk and Gary D Stormo},
      title = {Additivity in protein-DNA interactions: how good an approximation is it?},
      journal = {Nucleic Acids Res},
      year = {2002},
      volume = {30},
      number = {20},
      pages = {4442--4451},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12384591}
    }
    					
    Benos, P.V.; Lapedes, A.S. & Stormo, G.D. Probabilistic code for DNA recognition by proteins of the EGR family. 2002 J Mol Biol
    Vol. 323 (4) , pp. 701-727  
    article algorithms; amino acids; binding sites; computational biology; computer simulation; dna-binding proteins; early growth response transcription factors; kruppel-like transcription factors; models, molecular; probability; protein binding; thermodynamics; transcription factors; zinc fingers
    Abstract: A recognition code for protein-DNA interactions would allow for the prediction of binding sites based on protein sequence, and the identification of binding proteins for specific DNA targets. Crystallographic studies of protein-DNA complexes showed that a simple, deterministic recognition code does not exist. Here, we present a probabilistic recognition code (P-code) that assigns energies to all possible base-pair-amino acid interactions for the early growth response factor (EGR) family of zinc-finger transcription factors. The specific energy values are determined by a maximum likelihood method using examples from in vitro randomisation experiments (namely, SELEX and phage display) reported in the literature. The accuracy of the model is tested in several ways, including the ability to predict in vivo binding sites of EGR proteins and other non-EGR zinc-finger proteins, and the correlation between predicted and measured binding affinities of various EGR proteins to several different DNA sites. We also show that this model improves significantly upon the prediction capabilities of previous qualitative and quantitative models. The probabilistic code we develop uses information about the interacting positions between the protein and DNA, but we show that such information is not necessary, although it reduces the number of parameters to be determined. We also employ the assumption that the total binding energy is the sum of the energies of the individual contacts, but we describe how that assumption can be relaxed at the cost of additional parameters.
    BibTeX:
    @article{
      author = {Panayiotis V Benos and Alan S Lapedes and Gary D Stormo},
      title = {Probabilistic code for DNA recognition by proteins of the EGR family.},
      journal = {J Mol Biol},
      year = {2002},
      volume = {323},
      number = {4},
      pages = {701--727},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12419259}
    }
    					
    Beverley, S.M.; Akopyants, N.S.; Goyard, S.; Matlib, R.S.; Gordon, J.L.; Brownstein, B.H.; Stormo, G.D.; Bukanova, E.N.; Hott, C.T.; Li, F.; MacMillan, S.; Muo, J.N.; Schwertman, L.A.; Smeds, M.R. & Wang, Y. Putting the Leishmania genome to work: functional genomics by transposon trapping and expression profiling. 2002 Philos Trans R Soc Lond B Biol Sci
    Vol. 357 (1417) , pp. 47-53  
    article animals; dna transposable elements; gene expression profiling; genes, protozoan; genome, protozoan; genomics; leishmania; proteome; protozoan proteins; rna, protozoan
    Abstract: Leishmania are important protozoan pathogens of humans in temperate and tropical regions. The study of gene expression during the infectious cycle, in mutants or after environmental or chemical stimuli, is a powerful approach towards understanding parasite virulence and the development of control measures. Like other trypanosomatids, Leishmania gene expression is mediated by a polycistronic transcriptional process that places increased emphasis on post-transcriptional regulatory mechanisms including RNA processing and protein translation. With the impending completion of the Leishmania genome, global approaches surveying mRNA and protein expression are now feasible. Our laboratory has developed the Drosophila transposon mariner as a tool for trapping Leishmania genes and studying their regulation in the form of protein fusions; a classic approach in other microbes that can be termed 'proteogenomics'. Similarly, we have developed reagents and approaches for the creation of DNA microarrays, which permit the measurement of RNA abundance across the parasite genome. Progress in these areas promises to greatly increase our understanding of global mechanisms of gene regulation at both mRNA and protein levels, and to lead to the identification of many candidate genes involved in virulence.
    BibTeX:
    @article{
      author = {Stephen M Beverley and Natalia S Akopyants and Sophie Goyard and Robin S Matlib and Jennifer L Gordon and Bernard H Brownstein and Gary D Stormo and Elena N Bukanova and Christian T Hott and Fugen Li and Sandra MacMillan and James N Muo and Libbey A Schwertman and Matthew R Smeds and Yujia Wang},
      title = {Putting the Leishmania genome to work: functional genomics by transposon trapping and expression profiling.},
      journal = {Philos Trans R Soc Lond B Biol Sci},
      year = {2002},
      volume = {357},
      number = {1417},
      pages = {47--53},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11839181},
      doi = {http://dx.doi.org/10.1098/rstb.2001.1048}
    }
    					
    Cobb, J.P.; Laramie, J.M.; Stormo, G.D.; Morrissey, J.J.; Shannon, W.D.; Qiu, Y.; Karl, I.E.; Buchman, T.G. & Hotchkiss, R.S. Sepsis gene expression profiling: murine splenic compared with hepatic responses determined by using complementary DNA microarrays. 2002 Crit Care Med
    Vol. 30 (12) , pp. 2711-2721  
    article animals; apoptosis; cluster analysis; gene expression; liver; male; mice; mice, inbred c57bl; multiple organ failure; oligonucleotide array sequence analysis; organ specificity; prospective studies; rna, messenger; spleen
    Abstract: OBJECTIVE: DNA microarrays allow genome-wide assessment of changes in relative messenger RNA abundance and thus can be used to monitor changes in gene expression. The aim of this series of experiments was to gain experience in sepsis gene expression profiling in a well-accepted model of murine polymicrobial abdominal sepsis and begin characterizing (in the parlance of genomics) the sepsis "transcriptome." DESIGN: Prospective animal study. SETTING: University-based animal research facility.SUBJECTS C57BL/6 mice. INTERVENTIONS: After induction of general anesthesia, cecal ligation and puncture were performed to induce peritonitis and polymicrobial sepsis. The control group had sham laparotomy only. Three samples of spleen and liver were collected from septic and sham animals at 24 hrs after laparotomy. Changes in expression were measured for 588 annotated mouse genes by using a commercially available complementary DNA microarray kit. MEASUREMENTS AND MAIN RESULTS: Broad-scale gene expression profiles were characterized for septic liver and spleen and compared with sham controls. The analytical tools used included commercially available software packages and a novel analysis program. Very little overlap was observed in the septic gene expression profiles of these two organs. Most of the genes identified have previously been linked to regulation of the inflammatory response; importantly, however, some have not. In addition, hierarchical cluster analysis showed that cecal ligation and puncture at 24 hrs induced coordinate expression of genes that alter cell signaling and survival pathways in spleen, consistent with previously published reports of sepsis-induced splenocyte apoptosis. The current limitations of microarray analysis as reflected in these studies are also discussed. CONCLUSIONS: Microarray technology provides a powerful new tool for rapidly analyzing tissue-specific changes in gene expression induced by sepsis in animal models. To our knowledge, these data constitute the first report on the use of microarrays to determine the sepsis transcriptome.
    BibTeX:
    @article{
      author = {J. Perren Cobb and Jason M Laramie and Gary D Stormo and Jerry J Morrissey and William D Shannon and Yuyu Qiu and Irene E Karl and Timothy G Buchman and Richard S Hotchkiss},
      title = {Sepsis gene expression profiling: murine splenic compared with hepatic responses determined by using complementary DNA microarrays.},
      journal = {Crit Care Med},
      year = {2002},
      volume = {30},
      number = {12},
      pages = {2711--2721},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12483063},
      doi = {http://dx.doi.org/10.1097/01.CCM.0000034693.46903.ED}
    }
    					
    Gu, C.C.; Rao, D.C.; Stormo, G.; Hicks, C. & Province, M.A. Role of gene expression microarray analysis in finding complex disease genes. 2002 Genet Epidemiol
    Vol. 23 (1) , pp. 37-56  
    article gene expression; genetic diseases, inborn; genetic techniques; humans; oligonucleotide array sequence analysis
    BibTeX:
    @article{
      author = {Chi C Gu and D. C. Rao and Gary Stormo and Chindo Hicks and Michael A Province},
      title = {Role of gene expression microarray analysis in finding complex disease genes.},
      journal = {Genet Epidemiol},
      year = {2002},
      volume = {23},
      number = {1},
      pages = {37--56},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12112247},
      doi = {http://dx.doi.org/10.1002/gepi.220}
    }
    					
    GuhaThakurta, D.; Palomar, L.; Stormo, G.D.; Tedesco, P.; Johnson, T.E.; Walker, D.W.; Lithgow, G.; Kim, S. & Link, C.D. Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods. 2002 Genome Res
    Vol. 12 (5) , pp. 701-712  
    article animals; base sequence; binding sites; caenorhabditis elegans; computational biology; conserved sequence; dna; gene expression profiling; genes, helminth; green fluorescent proteins; heat-shock response; luminescent proteins; molecular sequence data; mutagenesis, site-directed; oligonucleotide array sequence analysis; probability; promoter regions (genetics); regulatory sequences, nucleic acid; species specificity; statistics, nonparametric; up-regulation
    Abstract: We report here the identification of a previously unknown transcription regulatory element for heat shock (HS) genes in Caenorhabditis elegans. We monitored the expression pattern of 11,917 genes from C. elegans to determine the genes that were up-regulated on HS. Twenty eight genes were observed to be consistently up-regulated in several different repetitions of the experiments. We analyzed the upstream regions of these genes using computational DNA pattern recognition methods. Two potential cis-regulatory motifs were identified in this way. One of these motifs (TTCTAGAA) was the DNA binding motif for the heat shock factor (HSF), whereas the other (GGGTGTC) was previously unreported in the literature. We determined the significance of these motifs for the HS genes using different statistical tests and parameters. Comparative sequence analysis of orthologous HS genes from C. elegans and Caenorhabditis briggsae indicated that the identified DNA regulatory motifs are conserved across related species. The role of the identified DNA sites in regulation of HS genes was tested by in vitro mutagenesis of a green fluorescent protein (GFP) reporter transgene driven by the C. elegans hsp-16-2 promoter. DNA sites corresponding to both motifs are shown to play a significant role in up-regulation of the hsp-16-2 gene on HS. This is one of the rare instances in which a novel regulatory element, identified using computational methods, is shown to be biologically active. The contributions of individual sites toward induction of transcription on HS are nonadditive, which indicates interaction and cross-talk between the sites, possibly through the transcription factors (TFs) binding to these sites.
    BibTeX:
    @article{
      author = {Debraj GuhaThakurta and Lisanne Palomar and Gary D Stormo and Pat Tedesco and Thomas E Johnson and David W Walker and Gordon Lithgow and Stuart Kim and Christopher D Link},
      title = {Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods.},
      journal = {Genome Res},
      year = {2002},
      volume = {12},
      number = {5},
      pages = {701--712},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11997337},
      doi = {http://dx.doi.org/10.1101/gr.228902}
    }
    					
    Guhathakurta, D.; Schriefer, L.A.; Hresko, M.C.; Waterston, R.H. & Stormo, G.D. Identifying muscle regulatory elements and genes in the nematode Caenorhabditis elegans. 2002 Pac Symp Biocomput , pp. 425-436   article animals; base sequence; binding sites; caenorhabditis elegans; consensus sequence; dna, helminth; genes, helminth; muscle proteins; regulatory sequences, nucleic acid; software
    Abstract: We report the identification of several putative muscle-specific regulatory elements, and genes which are expressed preferentially in the muscle of the nematode Caenorhabditis elegans. We used computational pattern finding methods to identify cis-regulatory motifs from promoter regions of a set of genes known to express preferentially in muscle; each motif describes the potential binding sites for an unknown regulatory factor. The significance and specificity of the identified motifs were evaluated using several different control sequence sets. Using the motifs, we searched the entire C. elegans genome for genes whose promoter regions have a high probability of being bound by the putative regulatory factors. Genes that met this criterion and were not included in our initial set were predicted to be good candidates for muscle expression. Some of these candidates are additional, known muscle expressed genes and several others are shown here to be preferentially expressed in muscle cells by using GFP (green fluorescent protein) constructs. The methods described here can be used to predict the spatial expression pattern of many uncharacterized genes.
    BibTeX:
    @article{
      author = {D. Guhathakurta and L. A. Schriefer and M. C. Hresko and R. H. Waterston and G. D. Stormo},
      title = {Identifying muscle regulatory elements and genes in the nematode Caenorhabditis elegans.},
      journal = {Pac Symp Biocomput},
      year = {2002},
      pages = {425--436},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11928496}
    }
    					
    Lin, Y.; Han, M.; Shimada, B.; Wang, L.; Gibler, T.M.; Amarakone, A.; Awad, T.A.; Stormo, G.D.; Gelder, R.N.V. & Taghert, P.H. Influence of the period-dependent circadian clock on diurnal, circadian, and aperiodic gene expression in Drosophila melanogaster. 2002 Proc Natl Acad Sci U S A
    Vol. 99 (14) , pp. 9562-9567  
    article animals; circadian rhythm; drosophila melanogaster; gene expression; gene expression profiling; genes, insect; mutation; nuclear proteins; oligonucleotide array sequence analysis; photoperiod
    Abstract: We measured daily gene expression in heads of control and period mutant Drosophila by using oligonucleotide microarrays. In control flies, 72 genes showed diurnal rhythms in light-dark cycles; 22 of these also oscillated in free-running conditions. The period gene significantly influenced the expression levels of over 600 nonoscillating transcripts. Expression levels of several hundred genes also differed significantly between control flies kept in light-dark versus constant darkness but differed minimally between per(01) flies kept in the same two conditions. Thus, the period-dependent circadian clock regulates only a limited set of rhythmically expressed transcripts. Unexpectedly, period regulates basal and light-regulated gene expression to a very broad extent.
    BibTeX:
    @article{
      author = {Yiing Lin and Mei Han and Brian Shimada and Lin Wang and Therese M Gibler and Aloka Amarakone and Tarif A Awad and Gary D Stormo and Russell N Van Gelder and Paul H Taghert},
      title = {Influence of the period-dependent circadian clock on diurnal, circadian, and aperiodic gene expression in Drosophila melanogaster.},
      journal = {Proc Natl Acad Sci U S A},
      year = {2002},
      volume = {99},
      number = {14},
      pages = {9562--9567},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/12089325},
      doi = {http://dx.doi.org/10.1073/pnas.132269699}
    }
    					
    Silbaq, F.S.; Ruttenberg, S.E. & Stormo, G.D. Specificity of Mnt 'master residue' obtained from in vivo and in vitro selections. 2002 Nucleic Acids Res
    Vol. 30 (24) , pp. 5539-5548  
    article base sequence; binding sites; binding, competitive; dna, viral; mutation; oligonucleotides; operator regions (genetics); repressor proteins; vir; al proteins
    Abstract: Mnt is a repressor from phage P22 that belongs to the ribbon-helix-helix family of DNA binding factors. Four amino acids from the N-terminus of the protein, Arg2, His6, Asn8 and Arg10, interact with the base pairs of the DNA to provide the sequence specificity. Raumann et al. (Nature Struct. Biol., 2, 1115-1122) identified position 6 as a 'master residue' that controls the specificity of the protein. Models for the interaction have residue 6 of Mnt interacting directly with position 5 of the operator. In vivo selections demonstrated that protein variants at residue 6 bound specifically to operator mutations at that position. Operators in which the wild-type G at position 5 was replaced by T specifically bound to several different protein variants, primarily hydrophobic residues. The obtained protein variants, plus some others, were used in in vitro selections to determine their preferred binding sites. The results showed that the residue at position 6 influenced the preference for binding site bases predominantly at position 5, but that the effects of altering it can extend over longer distances, consistent with its designation as a 'master residue'. The similarities of binding sites for different residues do not correlate strongly with common measures of amino acid similarities.
    BibTeX:
    @article{
      author = {Fauzi S Silbaq and Steven E Ruttenberg and Gary D Stormo},
      title = {Specificity of Mnt 'master residue' obtained from in vivo and in vitro selections.},
      journal = {Nucleic Acids Res},
      year = {2002},
      volume = {30},
      number = {24},
      pages = {5539--5548},
      url = {http://www.ncbi.nlm.nih.gov//pubmed/12490722}
    }
    					
    Stormo, G.D. & Tan, K. Mining genome databases to identify and understand new gene regulatory systems. 2002 Curr Opin Microbiol
    Vol. 5 (2) , pp. 149-153  
    article computational biology; databases; genes, regulator; genome, bacterial; transcription, genetic
    Abstract: The availability of a large number of sequenced microbial genomes allows us to conduct systematic studies on microbial gene regulatory systems. Computational methods, using comparative genomics approaches, are powerful tools to understand their mechanisms and evolutionary history. Recent advances in computational methodology for uncovering transcriptional regulatory components and their interactions are discussed.
    BibTeX:
    @article{
      author = {Gary D Stormo and Kai Tan},
      title = {Mining genome databases to identify and understand new gene regulatory systems.},
      journal = {Curr Opin Microbiol},
      year = {2002},
      volume = {5},
      number = {2},
      pages = {149--153},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11934610},
      doi = {http://dx.doi.org/10.1016/S1369-5274(02)00309-0}
    }
    					
    Stormo, G. Collado-Vides, J. & R. Hofestadt, e. (Hrsg.) Specificity of DNA-protein interactions ( Gene Regulation and Metabolism: Post-Genomic Computational Approaches ) 2002 Gene Regulation and Metabolism: Post-Genomic Computational Approaches , pp. pp. 87-101   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Gene Regulation and Metabolism: Post-Genomic Computational Approaches},
      publisher = {MIT Press},
      year = {2002},
      pages = {pp. 87-101}
    }
    					
    Benos, P.V.; Lapedes, A.S.; Fields, D.S. & Stormo, G.D. SAMIE: statistical algorithm for modeling interaction energies. 2001 Pac Symp Biocomput
    Vol. 6 , pp. 115-126  
    article algorithms; binding sites; dna; dna-binding proteins; data interpretation, statistical; models, chemical; models, statistical; neural networks (computer); peptide library; protein binding; saccharomyces cerevisiae; thermodynamics; transcription factors
    Abstract: We are investigating the rules that govern protein-DNA interactions, using a statistical mechanics based formalism that is related to the Boltzmann Machine of the neural net literature. Our approach is data-driven, in which probabilistic algorithms are used to model protein-DNA interactions, given SELEX and/or phage data as input. In the current report, we trained the network using SELEX data, under the "one-to-one" model of interactions (i.e. one amino acid contacts one base). The trained network was able to successfully identify the wild-type binding sites of EGR and MIG protein families. The predictions using our method are the same or better than that of methods existing in the literature. However our methodology offers the potential to capitalise in quantitative detail, as well as to be used to explore more general model of interactions, given availability of data.
    BibTeX:
    @article{
      author = {P. V. Benos and A. S. Lapedes and D. S. Fields and G. D. Stormo},
      title = {SAMIE: statistical algorithm for modeling interaction energies.},
      journal = {Pac Symp Biocomput},
      year = {2001},
      volume = {6},
      pages = {115--126},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11262933}
    }
    					
    Cobb, J.P.; Brownstein, B.H.; Watson, M.A.; Shannon, W.D.; Laramie, J.M.; Qiu, Y.; Stormo, G.D.; Morrissey, J.J.; Buchman, T.G.; Karl, I.E. & Hotchkiss, R.S. Injury in the era of genomics. 2001 Shock
    Vol. 15 (3) , pp. 165-170  
    article forecasting; genetic techniques; genome, fungal; genomics; humans; multiple organ failure; research; saccharomyces cerevisiae; spleen; wounds and injuries
    Abstract: The traditional approach to the study of biology employs small-scale experimentation that results in the description of a molecular sequence of known function or relevance. In the era of the genome the reverse is true, as large-scale cloning and gene sequencing come first, followed by the use of computational methods to systematically determine gene function and regulation. The overarching goal of this new approach is to translate the knowledge learned from a systematic, global analysis of genomic data into a complete understanding of biology. For investigators who study shock, the specific goal is to increase understanding of the adaptive response to injury at the level of the entire genome. This review describes our initial experience using DNA microarrays to profile stress-induced changes in gene expression. We conclude that efforts to apply genomics to the study of injury are best coordinated by multi-disciplinary groups, because of the extensive expertise required.
    BibTeX:
    @article{
      author = {J. P. Cobb and B. H. Brownstein and M. A. Watson and W. D. Shannon and J. M. Laramie and Y. Qiu and G. D. Stormo and J. J. Morrissey and T. G. Buchman and I. E. Karl and R. S. Hotchkiss},
      title = {Injury in the era of genomics.},
      journal = {Shock},
      year = {2001},
      volume = {15},
      number = {3},
      pages = {165--170},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11236897}
    }
    					
    Gorodkin, J.; Lyngso, R.B. & Stormo, G.D. A mini-greedy algorithm for faster structural RNA stem-loop search. 2001 Genome Inform
    Vol. 12 , pp. 184-193  
    article algorithms; base sequence; computational biology; databases, nucleic acid; nucleic acid conformation; rna; sequence alignment
    Abstract: When a set of coregulated genes share a common structural RNA motif, e.g. a hairpin, most motif search approaches fail to locate the covarying but structurally conserved motif. There do exist methods that can locate structural RNA motifs, like FOLDALIGN, but the main problem with these methods is that they are computationally expensive. In FOLDALIGN, a major contribution to this is the use of a greedy algorithm to construct the multiple alignment. To ensure good quality many redundant computations must be made. However, by applying the greedy algorithm on a carefully selected subset of sequences, near full greedy quality can be obtained. The basic idea is to estimate the order in which the sequences entered a good greedy alignment. If such a ranking, found from all pairwise alignments, is in good agreement with the order of appearance in the multiple alignment, the core structural motif can be found by performing the greedy algorithm on just the top sequences in the ranking. The ranking used in this mini-greedy algorithm is found by using two complementing approaches: 1) When interpreting the FOLDALIGN score as an inner product (kernel), the sequences can be ranked according to their distance to their center of mass; 2) We construct an algorithm that attempts to find the K closest sequences in the vector space associated with the inner product, and the remaining sequences can be ranked by their minimum distance to any of the sequences, or to the center of mass in this set. The two approaches arecompared and merged, and the results discussed. We also show that structural alignments of near full greedy quality can found in significantly reduced time, using these methods. The algorithm is being included in the SLASH (Stem-Loop Align SearcH) server available at http://www.bioinf.au.dk/slash.
    BibTeX:
    @article{
      author = {J. Gorodkin and R. B. Lyngso and G. D. Stormo},
      title = {A mini-greedy algorithm for faster structural RNA stem-loop search.},
      journal = {Genome Inform},
      year = {2001},
      volume = {12},
      pages = {184--193},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11791237}
    }
    					
    Gorodkin, J.; Stricklin, S.L. & Stormo, G.D. Discovering common stem-loop motifs in unaligned RNA sequences. 2001 Nucleic Acids Res
    Vol. 29 (10) , pp. 2135-2144  
    article algorithms; base sequence; computational biology; databases; internet; molecular sequence data; nucleic acid conformation; rna; rna, archaeal; rna, ribosomal; regulatory sequences, nucleic acid; sensitivity and specificity; sequence alignment; software; untranslated regions
    Abstract: Post-transcriptional regulation of gene expression is often accomplished by proteins binding to specific sequence motifs in mRNA molecules, to affect their translation or stability. The motifs are often composed of a combination of sequence and structural constraints such that the overall structure is preserved even though much of the primary sequence is variable. While several methods exist to discover transcriptional regulatory sites in the DNA sequences of coregulated genes, the RNA motif discovery problem is much more difficult because of covariation in the positions. We describe the combined use of two approaches for RNA structure prediction, FOLDALIGN and COVE, that together can discover and model stem-loop RNA motifs in unaligned sequences, such as UTRs from post-transcriptionally coregulated genes. We evaluate the method on two datasets, one a section of rRNA genes with randomly truncated ends so that a global alignment is not possible, and the other a hyper-variable collection of IRE-like elements that were inserted into randomized UTR sequences. In both cases the combined method identified the motifs correctly, and in the rRNA example we show that it is capable of determining the structure, which includes bulge and internal loops as well as a variable length hairpin loop. Those automated results are quantitatively evaluated and found to agree closely with structures contained in curated databases, with correlation coefficients up to 0.9. A basic server, Stem-Loop Align SearcH (SLASH), which will perform stem-loop searches in unaligned RNA sequences, is available at http://www.bioinf.au.dk/slash/.
    BibTeX:
    @article{
      author = {J. Gorodkin and S. L. Stricklin and G. D. Stormo},
      title = {Discovering common stem-loop motifs in unaligned RNA sequences.},
      journal = {Nucleic Acids Res},
      year = {2001},
      volume = {29},
      number = {10},
      pages = {2135--2144},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11353083}
    }
    					
    GuhaThakurta, D. & Stormo, G.D. Identifying target sites for cooperatively binding factors. 2001 Bioinformatics
    Vol. 17 (7) , pp. 608-621  
    article algorithms; bacterial proteins; base sequence; binding sites; computational biology; dna; dna, bacterial; dna-binding proteins; databases; escherichia coli; genes, fungal; genome, bacterial; saccharomyces cerevisiae; software; trans-activation (genetics); transcription factors
    Abstract: MOTIVATION: Transcriptional activation in eukaryotic organisms normally requires combinatorial interactions of multiple transcription factors. Though several methods exist for identification of individual protein binding site patterns in DNA sequences, there are few methods for discovery of binding site patterns for cooperatively acting factors. Here we present an algorithm, Co-Bind (for COperative BINDing), for discovering DNA target sites for cooperatively acting transcription factors. The method utilizes a Gibbs sampling strategy to model the cooperativity between two transcription factors and defines position weight matrices for the binding sites. Sequences from both the training set and the entire genome are taken into account, in order to discriminate against commonly occurring patterns in the genome, and produce patterns which are significant only in the training set. RESULTS: We have tested Co-Bind on semi-synthetic and real data sets to show it can efficiently identify DNA target site patterns for cooperatively binding transcription factors. In cases where binding site patterns are weak and cannot be identified by other available methods, Co-Bind, by virtue of modeling the cooperativity between factors, can identify those sites efficiently. Though developed to model protein-DNA interactions, the scope of Co-Bind may be extended to combinatorial, sequence specific, interactions in other macromolecules. AVAILABILITY: The program is available upon request from the authors or may be downloaded from http://ural.wustl.edu.
    BibTeX:
    @article{
      author = {D. GuhaThakurta and G. D. Stormo},
      title = {Identifying target sites for cooperatively binding factors.},
      journal = {Bioinformatics},
      year = {2001},
      volume = {17},
      number = {7},
      pages = {608--621},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11448879}
    }
    					
    Li, F. & Stormo, G.D. Selection of optimal DNA oligos for gene expression arrays. 2001 Bioinformatics
    Vol. 17 (11) , pp. 1067-1076  
    article algorithms; base sequence; computational biology; databases, nucleic acid; nucleic acid hybridization; ol; oligonucleotide array sequence analysis; software; thermodynamics; igonucleotide probes
    Abstract: MOTIVATION: High density DNA oligo microarrays are widely used in biomedical research. Selection of optimal DNA oligos that are deposited on the microarrays is critical. Based on sequence information and hybridization free energy, we developed a new algorithm to select optimal short (20-25 bases) or long (50 or 70 bases) oligos from genes or open reading frames (ORFs) and predict their hybridization behavior. Having optimized probes for each gene is valuable for two reasons. By minimizing background hybridization they provide more accurate determinations of true expression levels. Having optimum probes minimizes the number of probes needed per gene, thereby decreasing the cost of each microarray, raising the number of genes on each chip and increasing its usage. RESULTS: In this paper we describe algorithms to optimize the selection of specific probes for each gene in an entire genome. The criteria for truly optimum probes are easily stated but they are not computable at all levels currently. We have developed an heuristic approach that is efficiently computable at all levels and should provide a good approximation to the true optimum set. We have run the program on the complete genomes for several model organisms and deposited the results in a database that is available on-line (http://ural.wustl.edu/~lif/probe.pl). AVAILABILITY: The program is available upon request.
    BibTeX:
    @article{
      author = {F. Li and G. D. Stormo},
      title = {Selection of optimal DNA oligos for gene expression arrays.},
      journal = {Bioinformatics},
      year = {2001},
      volume = {17},
      number = {11},
      pages = {1067--1076},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11724738}
    }
    					
    Man, T.K. & Stormo, G.D. Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. 2001 Nucleic Acids Res
    Vol. 29 (12) , pp. 2471-2478  
    article bacteriophage p22; base sequence; binding sites; dna; dna-binding proteins; fluorescence; fluorescent dyes; models, molecular; mutation; operator regions (genetics); protein binding; repressor proteins; salmonella; substrate specificity; thermodynamics; viral proteins
    Abstract: Salmonella bacteriophage repressor Mnt belongs to the ribbon-helix-helix class of transcription factors. Previous SELEX results suggested that interactions of Mnt with positions 16 and 17 of the operator DNA are not independent. Using a newly developed high-throughput quantitative multiple fluorescence relative affinity (QuMFRA) assay, we directly quantified the relative equilibrium binding constants (K(ref)) of Mnt to operators carrying all the possible dinucleotide combinations at these two positions. Results show that Mnt prefers binding to C, instead of wild-type A, at position 16 when wild-type C at position 17 is changed to other bases. The measured K(ref) values of double mutants were also higher than the values predicted from single mutants, demonstrating the non-independence of these two positions. The ability to produce a large number of quantitative binding data simultaneously and the potential to scale up makes QuMFRA a valuable tool for the large-scale study of macromolecular interaction.
    BibTeX:
    @article{
      author = {T. K. Man and G. D. Stormo},
      title = {Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay.},
      journal = {Nucleic Acids Res},
      year = {2001},
      volume = {29},
      number = {12},
      pages = {2471--2478},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11410653}
    }
    					
    Stormo, G.D. & Ji, Y. Do mRNAs act as direct sensors of small molecules to control their expression? 2001 Proc Natl Acad Sci U S A
    Vol. 98 (17) , pp. 9465-9467  
    article bacillus subtilis; bacterial proteins; escherichia coli; feedback; gene expression regulation; gene expression regulation, bacterial; models, genetic; nucleic acid conformation; operon; peptide chain initiation, translational; rna, bacterial; rna, messenger; regulatory sequences, nucleic acid; rhizobium; riboflavin; thiamine
    BibTeX:
    @article{
      author = {G. D. Stormo and Y. Ji},
      title = {Do mRNAs act as direct sensors of small molecules to control their expression?},
      journal = {Proc Natl Acad Sci U S A},
      year = {2001},
      volume = {98},
      number = {17},
      pages = {9465--9467},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11504932},
      doi = {http://dx.doi.org/10.1073/pnas.181334498}
    }
    					
    Tan, K.; Moreno-Hagelsieb, G.; Collado-Vides, J. & Stormo, G.D. A comparative genomics approach to prediction of new members of regulons. 2001 Genome Res
    Vol. 11 (4) , pp. 566-584  
    article amino acid sequence; bacterial proteins; binding sites; computational biology; conserved sequence; cyclic amp receptor protein; dna-binding proteins; escherichia coli; escherichia coli proteins; genome, bacterial; genomics; iron-sulfur proteins; molecular sequence data; regulon; sequence alignment; transcription factors
    Abstract: Identifying the complete transcriptional regulatory network for an organism is a major challenge. For each regulatory protein, we want to know all the genes it regulates, that is, its regulon. Examples of known binding sites can be used to estimate the binding specificity of the protein and to predict other binding sites. However, binding site predictions can be unreliable because determining the true specificity of the protein is difficult because of the considerable variability of binding sites. Because regulatory systems tend to be conserved through evolution, we can use comparisons between species to increase the reliability of binding site predictions. In this article, an approach is presented to evaluate the computational predictions of regulatory sites. We combine the prediction of transcription units having orthologous genes with the prediction of transcription factor binding sites based on probabilistic models. We augment the sets of genes in Escherichia coli that are expected to be regulated by two transcription factors, the cAMP receptor protein and the fumarate and nitrate reduction regulatory protein, through a comparison with the Haemophilus influenzae genome. At the same time, we learned more about the regulatory networks of H. influenzae, a species with much less experimental knowledge than E. coli. By studying orthologous genes subject to regulation by the same transcription factor, we also gained understanding of the evolution of the entire regulatory systems.
    BibTeX:
    @article{
      author = {K. Tan and G. Moreno-Hagelsieb and J. Collado-Vides and G. D. Stormo},
      title = {A comparative genomics approach to prediction of new members of regulons.},
      journal = {Genome Res},
      year = {2001},
      volume = {11},
      number = {4},
      pages = {566--584},
      doi = {http://dx.doi.org/10.1101/gr.149301}
    }
    					
    Akmaev, V.R.; Kelley, S.T. & Stormo, G.D. Phylogenetically enhanced statistical tools for RNA structure prediction. 2000 Bioinformatics
    Vol. 16 (6) , pp. 501-512  
    article base sequence; biometry; escherichia coli; evolution, molecular; likelihood functions; models, genetic; molecular sequence data; nucleic acid conformation; phylogeny; rna; rna, bacterial; rna, ribosomal, 16s; sequence analysis, rna
    Abstract: MOTIVATION: Methods that predict the structure of molecules by looking for statistical correlation have been quite effective. Unfortunately, these methods often disregard phylogenetic information in the sequences they analyze. Here, we present a number of statistics for RNA molecular-structure prediction. Besides common pair-wise comparisons, we consider a few reasonable statistics for base-triple predictions, and present an elaborate analysis of these methods. All these statistics incorporate phylogenetic relationships of the sequences in the analysis to varying degrees, and the different nature of these tests gives a wide choice of statistical tools for RNA structure prediction. RESULTS: Starting from statistics that incorporate phylogenetic information only as independent sequence evolution models for each position of a multiple alignment, and extending this idea to a joint evolution model of two positions, we enhance the usual purely statistical methods (e.g. methods based on the Mutual Information statistic) with the use of phylogenetic information available in the sequences. In particular, we present a joint model based on the HKY evolution model, and consequently a X(2) test of independence for two positions. A significant part of this work is devoted to some mathematical analysis of these methods. We tested these statistics on regions of 16S and 23S rRNA, and tRNA.
    BibTeX:
    @article{
      author = {V. R. Akmaev and S. T. Kelley and G. D. Stormo},
      title = {Phylogenetically enhanced statistical tools for RNA structure prediction.},
      journal = {Bioinformatics},
      year = {2000},
      volume = {16},
      number = {6},
      pages = {501--512},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10980147}
    }
    					
    Kelley, S.T.; Akmaev, V.R. & Stormo, G.D. Improved statistical methods reveal direct interactions between 16S and 23S rRNA. 2000 Nucleic Acids Res
    Vol. 28 (24) , pp. 4938-4943  
    article animals; base sequence; binding sites; computational biology; databases; genes, archaeal; genes, bacterial; molecular sequence data; phylogeny; rna, ribosomal, 16s; rna, ribosomal, 23s; sequence alignment; statistics
    Abstract: Recent biochemical studies have indicated a number of regions in both the 16S and 23S rRNA that are exposed on the ribosomal subunit surface. In order to predict potential interactions between these regions we applied novel phylogenetically-based statistical methods to detect correlated nucleotide changes occurring between the rRNA molecules. With these methods we discovered a number of highly significant correlated changes between different sets of nucleotides in the two ribosomal subunits. The predictions with the highest correlation values belong to regions of the rRNA subunits that are in close proximity according to recent crystal structures of the entire ribosome. We also applied a new statistical method of detecting base triple interactions within these same rRNA subunit regions. This base triple statistic predicted a number of new base triples not detected by pair-wise interaction statistics within the rRNA molecules. Our results suggest that these statistical methods may enhance the ability to detect novel structural elements both within and between RNA molecules.
    BibTeX:
    @article{
      author = {S. T. Kelley and V. R. Akmaev and G. D. Stormo},
      title = {Improved statistical methods reveal direct interactions between 16S and 23S rRNA.},
      journal = {Nucleic Acids Res},
      year = {2000},
      volume = {28},
      number = {24},
      pages = {4938--4943},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/11121485}
    }
    					
    Li, F. & Stormo, G. Probe design on genomic level for high-density DNA oligo Microarray. 2000 Proc of the Int IEEE Symp on Bio-Informatics and Biomedical Engineering (BIBE) , pp. 200-207   article
    BibTeX:
    @article{
      author = {Li, F. and Stormo, G.D.},
      title = {Probe design on genomic level for high-density DNA oligo Microarray.},
      journal = {Proc of the Int IEEE Symp on Bio-Informatics and Biomedical Engineering (BIBE)},
      year = {2000},
      pages = {200-207},
      url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=889608},
      doi = {http://dx.doi.org/10.1109/BIBE.2000.889608}
    }
    					
    Shumaker-Parry, J.S., C.C.T.S.G.D.S.F.S. & Aebersold, R.H. S. Nie, E.T. & E.S. Yeung, e. (Hrsg.) Probing Protein:DNA Interactions Using a Uniform Monolayer of DNA and Surface Plasmon Resonance ( Scanning and Force Microscopies for Biomedical Applications II ) 2000 Scanning and Force Microscopies for Biomedical Applications II
    Vol. Vol. 3922 , pp. 158-166  
    inbook
    Abstract: A method is described for immobilizing double-stranded DNAs to a planar gold surface with high density and uniform spacing. This is accomplished by adsorbing biotinylated DNAs onto a nearly close-packed monolayer of the protein streptavidin. This streptavidin monolayer, which offers approximately 5 X 1012 biotin sites per cm2, is prepared first by adsorbing it onto a mixed self-assembled monolayer on gold which contains biotin-terminated and oligo-terminated alkylthiolates in a 3/7 ratio. This DNA- functionalized surface resists non-specific protein adsorption and is useful for probing the kinetics and equilibrium binding of proteins to DNA with surface plasmon resonance. This is demonstrated with the Mnt protein, which is found to bind in 3.8:1 ratio to its immobilized DNA operator sequence. This is consistent with its behavior in homogeneous solution, where it binds as a tetramer to its DNA. A sequence with a single base-pair mutation shows nearly as much Mnt binding, but a completely random DNA sequence shows only 5 percent of this binding. This proves that DNA-binding proteins bind sequence-specifically to double-stranded DNAs which are immobilized to gold with this streptavidin linker layer.
    BibTeX:
    @inbook{
      author = {Shumaker-Parry, J. S., Campbell, C. T., Stormo, G. D., Silbaq, F. S. and Aebersold, R. H.},
      title = {Scanning and Force Microscopies for Biomedical Applications II},
      publisher = {Bellingham, WA},
      year = {2000},
      volume = {Vol. 3922},
      pages = {158-166},
      url = {http://spiedigitallibrary.org/proceedings/resource/2/psisdg/3922/1/158_1},
      doi = {http://dx.doi.org/10.1117/12.383343}
    }
    					
    Stormo Identification of coordinated gene expression and regulatory sequences 2000 Pac Symp Biocomput (12) , pp. 416-417   article
    BibTeX:
    @article{
      author = {Stormo},
      title = {Identification of coordinated gene expression and regulatory sequences},
      journal = {Pac Symp Biocomput},
      year = {2000},
      number = {12},
      pages = {416--417},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10902189}
    }
    					
    Stormo, G.D. DNA binding sites: representation and discovery. 2000 Bioinformatics
    Vol. 16 (1) , pp. 16-23  
    article binding sites; dna; dna-binding proteins; history, 20th century; research
    Abstract: The purpose of this article is to provide a brief history of the development and application of computer algorithms for the analysis and prediction of DNA binding sites. This problem can be conveniently divided into two subproblems. The first is, given a collection of known binding sites, develop a representation of those sites that can be used to search new sequences and reliably predict where additional binding sites occur. The second is, given a set of sequences known to contain binding sites for a common factor, but not knowing where the sites are, discover the location of the sites in each sequence and a representation for the specificity of the protein.
    BibTeX:
    @article{
      author = {G. D. Stormo},
      title = {DNA binding sites: representation and discovery.},
      journal = {Bioinformatics},
      year = {2000},
      volume = {16},
      number = {1},
      pages = {16--23},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10812473}
    }
    					
    Stormo, G.D. Gene-finding approaches for eukaryotes. 2000 Genome Res
    Vol. 10 (4) , pp. 394-397  
    article eukaryotic cells; protein biosynthesis; proteins; sequence analysis, dna
    BibTeX:
    @article{
      author = {G. D. Stormo},
      title = {Gene-finding approaches for eukaryotes.},
      journal = {Genome Res},
      year = {2000},
      volume = {10},
      number = {4},
      pages = {394--397},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10779479}
    }
    					
    Workman, C.T. & Stormo, G.D. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. 2000 Pac Symp Biocomput
    Vol. 5 , pp. 467-478  
    article algorithms; binding sites; computer simulation; dna; gene expression regulation; models, biological; neural networks (computer); sensitivity and specificity; software; transcription factors
    Abstract: This work describes ANN-Spec, a machine learning algorithm and its application to discovering un-gapped patterns in DNA sequence. The approach makes use of an Artificial Neural Network and a Gibbs sampling method to define the Specificity of a DNA-binding protein. ANN-Spec searches for the parameters of a simple network (or weight matrix) that will maximize the specificity for binding sequences of a positive set compared to a background sequence set. Binding sites in the positive data set are found with the resulting weight matrix and these sites are then used to define a local multiple sequence alignment. Training complexity is O(lN) where l is the width of the pattern and N is the size of the positive training data. A quantitative comparison of ANN-Spec and a few related programs is presented. The comparison shows that ANN-Spec finds patterns of higher specificity when training with a background data set. The program and documentation are available from the authors for UNIX systems.
    BibTeX:
    @article{
      author = {C. T. Workman and G. D. Stormo},
      title = {ANN-Spec: a method for discovering transcription factor binding sites with improved specificity.},
      journal = {Pac Symp Biocomput},
      year = {2000},
      volume = {5},
      pages = {467--478},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10902194}
    }
    					
    Akmaev, V.R.; Kelley, S.T. & Stormo, G.D. A phylogenetic approach to RNA structure prediction. 1999 Proc Int Conf Intell Syst Mol Biol , pp. 10-17   article computer simulation; models, statistical; nucleic acid conformation; phylogeny; rna; rna, ribosomal, 16s; rna, transfer
    Abstract: Methods based on the Mutual Information statistic (MI methods) predict structure by looking for statistical correlations between sequence positions in a set of aligned sequences. Although MI methods are often quite effective, these methods ignore the underlying phylogenetic relationships of the sequences they analyze. Thus, they cannot distinguish between correlations due to structural interactions, and spurious correlations resulting from phylogenetic history. In this paper, we introduce a method analogous to MI that incorporates phylogenetic information. We show that this method accurately recovers the structures of well-known RNA molecules. We also demonstrate, with both real and simulated data, that this phylogenetically-based method outperforms standard MI methods, and improves the ability to distinguish interacting from non-interacting positions in RNA. This method is flexible, and may be applied to the prediction of protein structure given the appropriate evolutionary model. Because this method incorporates phylogenetic data, it also has the potential to be improved with the addition of more accurate phylogenetic information, although we show that even approximate phylogenies are helpful.
    BibTeX:
    @article{
      author = {V. R. Akmaev and S. T. Kelley and G. D. Stormo},
      title = {A phylogenetic approach to RNA structure prediction.},
      journal = {Proc Int Conf Intell Syst Mol Biol},
      year = {1999},
      pages = {10--17},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10786281}
    }
    					
    Hertz, G.Z. & Stormo, G.D. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. 1999 Bioinformatics
    Vol. 15 (7-8) , pp. 563-577  
    article algorithms; bacterial proteins; base sequence; binding sites; carrier proteins; cyclic amp receptor protein; dna; dna, bacterial; escherichia coli; linear models; proteins; sequence alignment; software
    Abstract: MOTIVATION: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. RESULTS: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein. AVAILABILITY: Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.
    BibTeX:
    @article{
      author = {G. Z. Hertz and G. D. Stormo},
      title = {Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.},
      journal = {Bioinformatics},
      year = {1999},
      volume = {15},
      number = {7-8},
      pages = {563--577},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10487864}
    }
    					
    Lapedes, A.S., G.B.L.L. & Stormo, G. Seattle, WA Correlated Mutations in Protein Sequences: Phylogenetic and Structural Effects. 1999
    Vol. Vol 33 Proceedings of the IMS/AMS International Conference on Statistics in Molecular Biology and Genetics , pp. pp. 236-256  
    inproceedings
    BibTeX:
    @inproceedings{
      author = {Lapedes, A.S., Giraud, B.G., Liu, L.C. and Stormo, G.D.},
      title = {Correlated Mutations in Protein Sequences: Phylogenetic and Structural Effects.},
      booktitle = {Proceedings of the IMS/AMS International Conference on Statistics in Molecular Biology and Genetics},
      publisher = {Monograph Series of the Inst. for Mathematical Statistics, Hayward CA},
      year = {1999},
      volume = {Vol 33},
      pages = {pp. 236-256},
      url = {http://www.jstor.org/stable/4356049}
    }
    					
    Weaver, D.C.; Workman, C.T. & Stormo, G.D. Modeling regulatory networks with weight matrices. 1999 Pac Symp Biocomput
    Vol. 4 , pp. 112-123  
    article co; computer simulation; databases, factual; environment; gene expression regulation; gene expression regulation, developmental; models, genetic; reproducibility of results; software; transcription, genetic; mputational biology
    Abstract: Systematic gene expression analyses provide comprehensive information about the transcriptional response to different environmental and developmental conditions. With enough gene expression data points, computational biologists may eventually generate predictive computer models of transcription regulation. Such models will require computational methodologies consistent with the behavior of known biological systems that remain tractable. We represent regulatory relationships between genes as linear coefficients or weights, with the "net" regulation influence on a gene's expression being the mathematical summation of the independent regulatory inputs. Test regulatory networks generated with this approach display stable and cyclically stable gene expression levels, consistent with known biological systems. We include variables to model the effect of environmental conditions on transcription regulation and observed various alterations in gene expression patterns in response to environmental input. Finally, we use a derivation of this model system to predict the regulatory network from simulated input/output data sets and find that it accurately predicts all components of the model, even with noisy expression data.
    BibTeX:
    @article{
      author = {D. C. Weaver and C. T. Workman and G. D. Stormo},
      title = {Modeling regulatory networks with weight matrices.},
      journal = {Pac Symp Biocomput},
      year = {1999},
      volume = {4},
      pages = {112--123},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/10380190}
    }
    					
    Levy, S.; Compagnoni, L.; Myers, E.W. & Stormo, G.D. Xlandscape: the graphical display of word frequencies in sequences. 1998 Bioinformatics
    Vol. 14 (1) , pp. 74-80  
    article algorithms; base sequence; computer graphics; dna; humans; huntington disease; information storage and retrieval; molecular sequence data; repetitive sequences, nucleic acid; sequence analysis; software
    Abstract: MOTIVATION: To provide a graphical interface for the generation, display and manipulation of a sequence landscape that will run on all X-windows-based Unix workstations. RESULTS: The sequence landscape approach enables the representation of the frequency of occurrence of all query sequence sub-words within a database. The landscape approach can detect tandem and other repeating word motifs, specific sub-words that are over-represented words in a particular database using Markov probability and the preference for sub-words belonging to either one of two databases. All these features aid in the classification of a query sequence. Given the open-text format for sequences and databases, the Xlandscape tool can be applied to a wide range of problems.
    BibTeX:
    @article{
      author = {S. Levy and L. Compagnoni and E. W. Myers and G. D. Stormo},
      title = {Xlandscape: the graphical display of word frequencies in sequences.},
      journal = {Bioinformatics},
      year = {1998},
      volume = {14},
      number = {1},
      pages = {74--80},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9520504}
    }
    					
    Stormo, G.D. Information content and free energy in DNA-protein interactions. 1998 J Theor Biol
    Vol. 195 (1) , pp. 135-137  
    article animals; binding sites; dna; genome; models, genetic; proteins
    BibTeX:
    @article{
      author = {G. D. Stormo},
      title = {Information content and free energy in DNA--protein interactions.},
      journal = {J Theor Biol},
      year = {1998},
      volume = {195},
      number = {1},
      pages = {135--137},
      url = {http://www.ncbi.nlm.nih.gov/pubmed},
      doi = {http://dx.doi.org/10.1006/jtbi.1998.0785}
    }
    					
    Stormo, G.D. & Fields, D.S. Specificity, free energy and information content in protein-DNA interactions. 1998 Trends Biochem Sci
    Vol. 23 (3) , pp. 109-113  
    article binding sites; dna; dna-binding proteins; energy transfer; forecasting; models, chemical; protein binding; repressor proteins; substrate specificity; transcription, genetic; viral proteins
    Abstract: Site-specific DNA-protein interactions can be studied using experimental and computational methods. Experimental approaches typically analyze a protein-DNA interaction by measuring the free energy of binding under a variety of conditions. Computational methods focus on alignments of known binding sites for a protein, and, from these alignments, make estimates of the binding energy. Understanding the relationship between these two perspectives, and finding ways to improve both, is a major challenge of modern molecular biology.
    BibTeX:
    @article{
      author = {G. D. Stormo and D. S. Fields},
      title = {Specificity, free energy and information content in protein-DNA interactions.},
      journal = {Trends Biochem Sci},
      year = {1998},
      volume = {23},
      number = {3},
      pages = {109--113},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9581503}
    }
    					
    Tabaska, J.E.; Cary, R.B.; Gabow, H.N. & Stormo, G.D. An RNA folding method capable of identifying pseudoknots and base triples. 1998 Bioinformatics
    Vol. 14 (8) , pp. 691-699  
    article algorithms; bacillus subtilis; base sequence; escherichia coli; molecular sequence data; nucleic acid conformation; phylogeny; rna; rna, bacterial; thermodynamics
    Abstract: MOTIVATION: Recently, we described a Maximum Weighted Matching (MWM) method for RNA structure prediction. The MWM method is capable of detecting pseudoknots and other tertiary base-pairing interactions in a computationally efficient manner (Cary and Stormo, Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 75-80, 1995). Here we report on the results of our efforts to improve the MWM method's predictive accuracy, and show how the method can be extended to detect base interactions formerly inaccessible to automated RNA modeling techniques. RESULTS: Improved performance in MWM structure prediction was achieved in two ways. First, new ways of calculating base pair likelihoods have been developed. These allow experimental data and combined statistical and thermodynamic information to be used by the program. Second, accuracy was improved by developing techniques for filtering out spurious base pairs predicted by the MWM program. We also demonstrate here a means by which the MWM folding method may be used to detect the presence of base triples in RNAs. AVAILABILITY: http://www.cshl.org/mzhanglab/tabaska/j axpage. html CONTACT: tabaska@cshl.org
    BibTeX:
    @article{
      author = {J. E. Tabaska and R. B. Cary and H. N. Gabow and G. D. Stormo},
      title = {An RNA folding method capable of identifying pseudoknots and base triples.},
      journal = {Bioinformatics},
      year = {1998},
      volume = {14},
      number = {8},
      pages = {691--699},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9789095}
    }
    					
    Chen, Q.K.; Hertz, G.Z. & Stormo, G.D. PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices. 1997 Comput Appl Biosci
    Vol. 13 (1) , pp. 29-35  
    article algorithms; animals; base sequence; databases, factual; evaluation studies; human genome project; humans; promoter regions (genetics); rna polymerase ii; sequence analysis, dna; software; software design; vertebrates
    Abstract: MOTIVATION: A large number of new DNA sequences with virtually unknown functions are generated as the Human Genome Project progresses. Therefore, it is essential to develop computer algorithms that can predict the functionality of DNA segments according to their primary sequences, including algorithms that can predict promoters. Although several promoter-predicting algorithms are available, they have high false-positive detections and the rate of promoter detection needs to be improved further. RESULTS: In this research, PromFD, a computer program to recognize vertebrate RNA polymerase II promoters, has been developed. Both vertebrate promoters and non-promoter sequences are used in the analysis. The promoters are obtained from the Eukaryotic Promoter Database. Promoters are divided into a training set and a test set. Non-promoter sequences are obtained from the GenBank sequence databank, and are also divided into a training set and a test set. The first step is to search out, among all possible permutations, patterns of strings 5-10 bp long, that are significantly over-represented in the promoter set. The program also searches IMD (Information Matrix Database) matrices that have a significantly higher presence in the promoter set. The results of the searches are stored in the PromFD database, and the program PromFD scores input DNA sequences according to their content of the database entries. PromFD predicts promoters-their locations and the location of potential TATA boxes, if found. The program can detect 71% of promoters in the training set with a false-positive rate of under 1 in every 13,000 bp, and 47% of promoters in the test set with a false-positive rate of under 1 in every 9800 bp. PromFD uses a new approach and its false-positive identification rate is better compared with other available promoter recognition algorithms. The source code for PromFD is in the 'c+2' language.
    BibTeX:
    @article{
      author = {Q. K. Chen and G. Z. Hertz and G. D. Stormo},
      title = {PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices.},
      journal = {Comput Appl Biosci},
      year = {1997},
      volume = {13},
      number = {1},
      pages = {29--35},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9088706}
    }
    					
    Fields, D.S.; He, Y.; Al-Uzri, A.Y. & Stormo, G.D. Quantitative specificity of the Mnt repressor. 1997 J Mol Biol
    Vol. 271 (2) , pp. 178-194  
    article bacteriophage p22; base sequence; binding sites; consensus sequence; dna; dna-binding proteins; kinetics; ligands; molecular sequence data; oligodeoxyribonucleotides; repressor proteins; salmonella; sequence alignment; substrate specificity; templates, genetic; thermodynamics; viral proteins; viral regulatory and accessory proteins
    Abstract: The Mnt protein of Salmonella phage P22 binds site-specifically to its operator. To better understand this binding we used dideoxy DNA sequencing in a quantitative manner to determine the relative binding constants, and hence the relative free energies, of wild-type Mnt protein to a substantial number of variants of its operator. These measurements were supported by experiments which used the SELEX procedure to generate a set of operators from an initially randomized population. In the Discussion we show that the present model of Mnt protein/operator binding, due to Sauer and co-workers, along with the assumption of an independent contribution of each position in the operator to the total binding, provides a reasonably accurate description of the system. We also discuss the use of information content as a measure of DNA-protein binding specificity with the Mnt protein/operator system serving as an example and show again that the assumption of independence supports the current view of this case of site-specific binding.
    BibTeX:
    @article{
      author = {D. S. Fields and Y. He and A. Y. Al-Uzri and G. D. Stormo},
      title = {Quantitative specificity of the Mnt repressor.},
      journal = {J Mol Biol},
      year = {1997},
      volume = {271},
      number = {2},
      pages = {178--194},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9268651},
      doi = {http://dx.doi.org/10.1006/jmbi.1997.1171}
    }
    					
    Gorodkin, J.; Heyer, L.J.; Brunak, S. & Stormo, G.D. Displaying the information contents of structural RNA alignments: the structure logos. 1997 Comput Appl Biosci
    Vol. 13 (6) , pp. 583-586  
    article algorithms; base composition; computational biology; computer simulation; mathematics; nucleic acid conformation; nucleic acid hybridization; rna; sequence alignment
    Abstract: MOTIVATION: We extend the standard 'Sequence Logo' method of Schneider and Stevens (Nucleic Acids Res., 18, 6097-6100, 1990) to incorporate prior frequencies on the bases, allow for gaps in the alignments, and indicate the mutual information of base-paired regions in RNA. RESULTS: Given an alignment of RNA sequences with the base pairings indicated, the program will calculate the information at each position, including the mutual information of the base pairs, and display the results in a 'Structure Logo'. Alignments without base pairing can also be displayed in a 'Sequence Logo', but still allowing gaps and incorporating prior frequencies if desired. AVAILABILITY: The code is available from, and an Internet server can be used to run the program at, http://www.cbs.dtu.dk/gorodkin/appl/slogo. html.
    BibTeX:
    @article{
      author = {J. Gorodkin and L. J. Heyer and S. Brunak and G. D. Stormo},
      title = {Displaying the information contents of structural RNA alignments: the structure logos.},
      journal = {Comput Appl Biosci},
      year = {1997},
      volume = {13},
      number = {6},
      pages = {583--586},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9475985}
    }
    					
    Gorodkin, J.; Heyer, L.J. & Stormo, G.D. Finding the most significant common sequence and structure motifs in a set of RNA sequences. 1997 Nucleic Acids Res
    Vol. 25 (18) , pp. 3724-3732  
    article algorithms; animals; computer simulation; databases, ; factual; humans; rna; sequence analysis
    Abstract: We present a computational scheme to locally align a collection of RNA sequences using sequence and structure constraints. In addition, the method searches for the resulting alignments with the most significant common motifs, among all possible collections. The first part utilizes a simplified version of the Sankoff algorithm for simultaneous folding and alignment of RNA sequences, but maintains tractability by constructing multi-sequence alignments from pairwise comparisons. The algorithm finds the multiple alignments using a greedy approach and has similarities to both CLUSTAL and CONSENSUS, but the core algorithm assures that the pairwise alignments are optimized for both sequence and structure conservation. The choice of scoring system and the method of progressively constructing the final solution are important considerations that are discussed. Example solutions, and comparisons with other approaches, are provided. The solutions include finding consensus structures identical to published ones.
    BibTeX:
    @article{
      author = {J. Gorodkin and L. J. Heyer and G. D. Stormo},
      title = {Finding the most significant common sequence and structure motifs in a set of RNA sequences.},
      journal = {Nucleic Acids Res},
      year = {1997},
      volume = {25},
      number = {18},
      pages = {3724--3732},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9278497}
    }
    					
    Gorodkin, J.; Heyer, L.J. & Stormo, G.D. Finding common sequence and structure motifs in a set of RNA sequences. 1997 Proc Int Conf Intell Syst Mol Biol
    Vol. 5 , pp. 120-123  
    article algorithms; base sequence; databases, factual; molecular structure; nucleic acid conformation; rna; sequence alignment; software
    Abstract: We present a computational scheme to search for the most common motif, composed of a combination of sequence and structure constraints, among a collection of RNA sequences. The method uses a simplified version of the Sankoff algorithm for simultaneous folding and alignment of RNA sequences, but maintains tractability by constructing multi-sequence alignments from pairwise comparisons. The overall method has similarities to both CLUSTAL and CONSENSUS, but the core algorithm assures that the pairwise alignments are optimized for both sequence and structure conservation. Example solutions, and comparisons with other approaches, are provided. The solutions include finding consensus structures identical to published ones.
    BibTeX:
    @article{
      author = {J. Gorodkin and L. J. Heyer and G. D. Stormo},
      title = {Finding common sequence and structure motifs in a set of RNA sequences.},
      journal = {Proc Int Conf Intell Syst Mol Biol},
      year = {1997},
      volume = {5},
      pages = {120--123},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9322025}
    }
    					
    Levy, S. & Stormo, G. J. Mycielski, G.R. & A. Salomaa, e. (Hrsg.) DNA Sequence Classification Using DAWGs ( Structures in Logic and Computer Science ) 1997 Structures in Logic and Computer Science
    Vol. Vol. 1261 , pp. pp. 339-352.  
    inbook
    BibTeX:
    @inbook{
      author = {Levy, S. and Stormo, G.D.},
      title = {Structures in Logic and Computer Science},
      publisher = {Springer-Verlag},
      year = {1997},
      volume = {Vol. 1261},
      pages = {pp. 339-352.}
    }
    					
    Stormo, G. S. Suhai, e. (Hrsg.) Recognizing Functional Domains in Biological Sequences ( Theoretical and Computational Methods in Genome Research ) 1997 Theoretical and Computational Methods in Genome Research , pp. pp. 105-116   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Theoretical and Computational Methods in Genome Research},
      publisher = {Plenum Press},
      year = {1997},
      pages = {pp. 105-116}
    }
    					
    Tabaska, J.E. & Stormo, G.D. Automated alignment of RNA sequences to pseudoknotted structures. 1997 Proc Int Conf Intell Syst Mol Biol
    Vol. 5 , pp. 311-318  
    article algorithms; base sequence; models, molecular; molecular sequence data; nucleic acid conformation; rna; rna, small nuclear; rna, transf; reverse transcriptase inhibitors; sequence alignment; software; er
    Abstract: Seq7 is a new program for generating multiple structure-based alignments of RNA sequences. By using a variant of Dijkstra's algorithm to find the shortest path through a specially constructed graph, Seq7 is able to align RNA sequences to pseudoknotted structures in polynomial time. In this paper, we describe the operation of Seq7 and demonstrate the program's abilities. We also describe the use of Seq7 in an Expectation-Maximization procedure that automates the process of structural modeling and alignment of RNA sequences.
    BibTeX:
    @article{
      author = {J. E. Tabaska and G. D. Stormo},
      title = {Automated alignment of RNA sequences to pseudoknotted structures.},
      journal = {Proc Int Conf Intell Syst Mol Biol},
      year = {1997},
      volume = {5},
      pages = {311--318},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/9322055}
    }
    					
    Hertz, G.Z. & Stormo, G.D. S. Adhya, e. (Hrsg.) Escherichia coli Promoter Sequences: Analysis and Prediction ( RNA Polymerase and Associated Factors ) 1996 Methods Enzymol RNA Polymerase and Associated Factors
    Vol. 273 , pp. 30-42  
    article base sequence; consensus sequence; dna, bacterial; escherichia coli; genes, bacterial; promoter regions (genetics); sequence alignment; sequence homology, nucleic acid
    BibTeX:
    @article{
      author = {G. Z. Hertz and G. D. Stormo},
      title = {RNA Polymerase and Associated Factors},
      journal = {Methods Enzymol},
      year = {1996},
      volume = {273},
      pages = {30--42},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/8791597}
    }
    					
    Hertz, G.Z. & Stormo, G.D. Escherichia coli promoter sequences: analysis and prediction. 1996 Methods Enzymol
    Vol. 273 , pp. 30-42  
    article base sequence; consensus sequence; dna, bacterial, genetics; escherichia coli, genetics; genes, bacterial; promoter regions, genetic; sequence alignment; sequence homology, nucleic acid
    BibTeX:
    @article{
      author = {G. Z. Hertz and G. D. Stormo},
      title = {Escherichia coli promoter sequences: analysis and prediction.},
      journal = {Methods Enzymol},
      year = {1996},
      volume = {273},
      pages = {30--42}
    }
    					
    Snyder, E. & Stormo, G. Fiesler, E. & Beale, R. (Hrsg.) Biology and Biochemistry: Neural Networks for identification of protein coding regions in genomic DNA sequences ( Handbook of Neural Computation ) 1996 Handbook of Neural Computation   inbook
    BibTeX:
    @inbook{
      author = {Snyder, E.E. and Stormo, G.D.},
      title = {Handbook of Neural Computation},
      publisher = {Oxford Univ. Press.},
      year = {1996},
      url = {http://www.crcnetbase.com/doi/pdf/10.1201/9781420050646.ptg4},
      doi = {DOI: 10.1201/9781420050646.ptg4}
    }
    					
    Snyder, E. & Stormo, G. Bishop, (M. & C.J. Rawlings, e. (Hrsg.) Identification of Protein Coding Regions in Genomic DNA Sequences ( Nucleic Acid and Protein Sequence Analysis, A Practical Approach ) 1996 Nucleic Acid and Protein Sequence Analysis, A Practical Approach , pp. pp. 209-224   inbook
    BibTeX:
    @inbook{
      author = {Snyder, E.E. and Stormo, G.D.},
      title = {Nucleic Acid and Protein Sequence Analysis, A Practical Approach},
      publisher = {IRL Press, Oxford},
      year = {1996},
      pages = {pp. 209-224},
      edition = {2}
    }
    					
    Ulyanov, A. & Stormo, G. Witten, M. & D. Vincet, e. (Hrsg.) Pattern-Recognition Methods for Classifying Domains of DNA Sequences ( Computational Medicine, Public Health and Biotechnology: Building a Man in the Machine ) 1996 Computational Medicine, Public Health and Biotechnology: Building a Man in the Machine   inbook
    BibTeX:
    @inbook{
      author = {Ulyanov, A.V. and Stormo, G.D.},
      title = {Computational Medicine, Public Health and Biotechnology: Building a Man in the Machine},
      publisher = {World Scientific Pub. Co., Singapore.},
      year = {1996}
    }
    					
    Cary, R.B. & Stormo, G.D. Graph-theoretic approach to RNA modeling using comparative data. 1995 Proc Int Conf Intell Syst Mol Biol
    Vol. 3 , pp. 75-80  
    article algorithms; base composition; base sequence; computer graphics; models, theoretical; molecular sequence data; nucleic acid conformation; rna; rna, transfer
    Abstract: We have examined the utility of a graph-theoretic algorithm for building comparative RNA models. The method uses a maximum weighted matching algorithm to find the optimal set of basepairs given the mutual information for all pairs of alignment positions. In all cases examined, the technique generated models similar to those based on conventional comparative analysis. Any set of pairwise interactions can be suggested including pseudoknots. Here we describe the details of the method and demonstrate its implementation on tRNA where many secondary and tertiary base-pairs are accurately predicted. We also examine the usefulness of the method for the identification of shared structural features in families of RNAs isolated by artificial selection methods such as SELEX.
    BibTeX:
    @article{
      author = {R. B. Cary and G. D. Stormo},
      title = {Graph-theoretic approach to RNA modeling using comparative data.},
      journal = {Proc Int Conf Intell Syst Mol Biol},
      year = {1995},
      volume = {3},
      pages = {75--80},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7584469}
    }
    					
    Chen, Q.K.; Hertz, G.Z. & Stormo, G.D. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. 1995 Comput Appl Biosci
    Vol. 11 (5) , pp. 563-566  
    article base sequence; binding sites; dna; databases, factual; genes, regulator; molecular sequence data; sequence alignment; software; transcription factors
    Abstract: The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C' language, and the program is available for unix platforms.
    BibTeX:
    @article{
      author = {Q. K. Chen and G. Z. Hertz and G. D. Stormo},
      title = {MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices.},
      journal = {Comput Appl Biosci},
      year = {1995},
      volume = {11},
      number = {5},
      pages = {563--566},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/8590181}
    }
    					
    Chen, P.; Ailion, M.; Bobik, T.; Stormo, G. & Roth, J. Five promoters integrate control of the cob/pdu regulon in Salmonella typhimurium. 1995 J Bacteriol
    Vol. 177 (19) , pp. 5401-5410  
    article bacterial outer membrane proteins; bacterial proteins; base sequence; carrier proteins; cyclic amp receptor protein; dna-binding proteins; escherichia coli proteins; gene expression regulation, bacterial; genes, bacterial; models, genetic; molecular sequence data; mutagenesis, insertional; promoter regions (genetics); propylene glycol; propylene glycols; rna, bacterial; rna, messenger; regulon; repressor proteins; salmonella typhimurium; sequence analysis, dna; trans-activators; transcription, genetic; vitamin b 12
    Abstract: Propanediol is degraded by a B12-dependent pathway in Salmonella typhimurium. The enzymes for this pathway are encoded in a small region (minute 41) that includes the pdu operon (controlling B12-dependent degradation of propanediol) and the divergent cob operon (controlling synthesis of cobalamin, B12). Expression of both operons is induced by propanediol and globally controlled by the ArcA and Crp systems. The region between the two operons encodes two proteins, PduF, a transporter of propanediol, and PocR, which mediates the induction of the regulon by propanediol. Insertion mutations between the pdu and cob operons have been characterized, and their exact positions have been correlated with mutant phenotypes. The region includes five promoters, four of which are controlled by the PocR protein and induced by propanediol. The cob and pdu operons each have one regulated promoter; the pduF gene is expressed from two regulated promoters (P1 and P2). The P1 and P2 transcripts extend beyond pduF to include the pocR gene; thus the PocR protein autoregulates its expression from these promoters. The fifth promoter, PPoc, is adjacent to the pocR gene and associated with a Crp binding site. We suggest that all global control of the regulon is exerted by regulating the level of PocR protein at the P1, P2, and PPoc promoters. A putative binding site for the PocR protein has been identified by computer analysis. Eight close matches to this proposed site were found in regions near the four promoters known to be regulated by PocR protein: PPdu, P1, P2, and PCob. A three-state model is proposed in which the regulon uses all five of its promoters to control expression.
    BibTeX:
    @article{
      author = {P. Chen and M. Ailion and T. Bobik and G. Stormo and J. Roth},
      title = {Five promoters integrate control of the cob/pdu regulon in Salmonella typhimurium.},
      journal = {J Bacteriol},
      year = {1995},
      volume = {177},
      number = {19},
      pages = {5401--5410},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7559322}
    }
    					
    Cui, Y.; Wang, Q.; Stormo, G.D. & Calvo, J.M. A consensus sequence for binding of Lrp to DNA. 1995 J Bacteriol
    Vol. 177 (17) , pp. 4872-4880  
    article bacterial proteins; base sequence; binding sites; consensus sequence; dna, bacterial; dna-binding proteins; escherichia coli proteins; leucine; leucine-responsive regulatory protein; molecular sequence data; protein binding; selection (genetics); sequence analysis, dna; sequence homology, nucleic acid; transcription factors
    Abstract: Lrp (leucine-responsive regulatory protein) is a major regulatory protein involved in the expression of numerous operons in Escherichia coli. For ilvIH, one of the operons positively regulated by Lrp, Lrp binds to multiple sites upstream of the transcriptional start site and activates transcription. An alignment of 12 Lrp binding sites within ilvIH DNA from two different organisms revealed a tentative consensus sequence AGAAT TTTATTCT (Q. Wang, M. Sacco, E. Ricca, C.T. Lago, M. DeFelice, and J.M. Calvo, Mol. Microbiol. 7:883-891, 1993). To further characterize the binding specificity of Lrp, we used a variation of the Selex procedure of C. Tuerk and L. Gold (Science 249:505-510, 1990) to identify sequences that bound Lrp out of a pool of 10(12) different DNA molecules. We identified 63 related DNA sequences that bound Lrp and estimated their relative binding affinities for Lrp. A consensus sequence derived from analysis of these sequences, YAGHAWATTWT DCTR, where Y = C or T, H = not G, W = A or T, D = not C, and R = A or G, contains clear dyad symmetry and is very similar to the one defined earlier. To test the idea that Lrp in the presence of leucine might bind to a different subset of DNA sequences, we carried out a second selection experiment with leucine present during the binding reactions. DNA sequences selected in the presence or absence of leucine were similar, and leucine did not stimulate binding to any of the sequences that were selected in the presence of leucine. Therefore, it is unlikely that leucine changes the specificity of Lrp binding.
    BibTeX:
    @article{
      author = {Y. Cui and Q. Wang and G. D. Stormo and J. M. Calvo},
      title = {A consensus sequence for binding of Lrp to DNA.},
      journal = {J Bacteriol},
      year = {1995},
      volume = {177},
      number = {17},
      pages = {4872--4880},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7665463}
    }
    					
    Hertz, G. & Stormo, G. Lim, H. & C.R. Cantor, e. (Hrsg.) Identification of Consensus Patterns in

    Unaligned DNA and Protein Sequences: a Large-Deviation Statistical Basis for Penalizing Gaps ( Proceedings of the Third International Conference on Bioinformatics and Genome Research )

    1995 Proceedings of the Third International Conference on Bioinformatics and Genome Research , pp. pp.201-216   inbook
    BibTeX:
    @inbook{
      author = {Hertz, G.Z and Stormo, G.D.},
      title = {Proceedings of the Third International Conference on Bioinformatics and Genome Research},
      publisher = {World Scientific Publishing, Singapore},
      year = {1995},
      pages = {pp.201-216}
    }
    					
    Heumann, J.H., L.A. & Stormo, G. Alignment of Regulatory Sites Using Neural Networks to Maximize Specificity ( Proceedings of the 1995 World Congress on Neural Networks ) 1995 Proceedings of the 1995 World Congress on Neural Networks , pp. 771-775   inbook
    BibTeX:
    @inbook{
      author = {Heumann, J.H., Lapedes, A.S. and Stormo, G.D.},
      title = {Proceedings of the 1995 World Congress on Neural Networks},
      year = {1995},
      pages = {771-775}
    }
    					
    Snyder, E.E. & Stormo, G.D. Identification of protein coding regions in genomic DNA. 1995 J Mol Biol
    Vol. 248 (1) , pp. 1-18  
    article base composition; base sequence; computer simulation; dna; exons; genes; humans; introns; models, statistical; protein biosynthesis; proteins; reproducibility of results; software
    Abstract: We have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences. The program scores all subintervals in a sequence for content statistics indicative of introns and exons, and for sites that identify their boundaries. This information is weighted by a neural network to approximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming algorithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, we can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. We have tested the system on a large collection of human genes. On sequences not used in training, we achieved a correlation coefficient for exon nucleotide prediction of 0.89. For a subset of G + C-rich genes, a correlation coefficient of 0.94 was achieved. We have also quantified the robustness of the method to substitution and frame-shift errors and show how the system can be optimized for performance on sequences with known levels of sequencing errors.
    BibTeX:
    @article{
      author = {E. E. Snyder and G. D. Stormo},
      title = {Identification of protein coding regions in genomic DNA.},
      journal = {J Mol Biol},
      year = {1995},
      volume = {248},
      number = {1},
      pages = {1--18},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7731036},
      doi = {http://dx.doi.org/10.1006/jmbi.1995.0198}
    }
    					
    Ulyanov, A.V. & Stormo, G.D. Multi-alphabet consensus algorithm for identification of low specificity protein-DNA interactions. 1995 Nucleic Acids Res
    Vol. 23 (8) , pp. 1434-1440  
    article algorithms; animals; bacterial proteins; base sequence; binding sites; chickens; consensus sequence; cyclic amp receptor protein; dna; dna-binding proteins; databases, factual; erythrocytes; molecular sequence data; nucleosomes; serine endopeptidases
    Abstract: A method for the identification and characterization of protein-DNA interactions is presented. We have developed an approach for finding unknown multiple patterns that occur imperfectly in a set of several sequences. The pattern may contain letters from the nucleotide alphabet (A, C, G and T) including ambiguous characters (A/C, A/G, A/T; A/C/G, etc.). This method reveals weak DNA signals on an unaligned set of DNA fragments known to be functionally related and assumes no prior information on the sequences' alignment. It determines the locations of the signals from only the information intrinsic to the sequences themselves. We have applied this method to analyze the binding sites of cAMP receptor protein (CRP). The consensus based on these data are discussed and a comparison of the consensus with the crystal structure of CAP-DNA complex is presented. We further show that in a mixture of DNA sequences, containing binding sites for two different proteins, both classes of binding sites can be discovered simultaneously by this method. The DNA sequences of nucleosome cores from chicken erythrocyte and a set of the other known nucleosomal sequences show existence of symmetrical features in nucleosome-binding DNA sequences. We also show multi-alphabet patterns that can play a role in the phasing signal on the nucleosome DNA molecule and have compared the results with existing models of nucleosome positioning.
    BibTeX:
    @article{
      author = {A. V. Ulyanov and G. D. Stormo},
      title = {Multi-alphabet consensus algorithm for identification of low specificity protein-DNA interactions.},
      journal = {Nucleic Acids Res},
      year = {1995},
      volume = {23},
      number = {8},
      pages = {1434--1440},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7753637}
    }
    					
    Barrick, D.; Villanueba, K.; Childs, J.; Kalil, R.; Schneider, T.D.; Lawrence, C.E.; Gold, L. & Stormo, G.D. Quantitative analysis of ribosome binding sites in E.coli. 1994 Nucleic Acids Res
    Vol. 22 (7) , pp. 1287-1295  
    article base sequence; binding sites; cloning, molecular; codon; dna, bacterial; escherichia coli; molecular sequence data; ribosomes; beta-galactosidase
    Abstract: 185 clones with randomized ribosome binding sites, from position -11 to 0 preceding the coding region of beta-galactosidase, were selected and sequenced. The translational yield of each clone was determined; they varied by more than 3000-fold. Multiple linear regression analysis was used to determine the contribution to translation initiation activity of each base at each position. Features known to be important for translation initiation, such as the initiation codon, the Shine/Dalgarno sequence, the identity of the base at position -3 and the occurrence of alternative ATGs, are all found to be important quantitatively for activity. No other features are found to be of general significance, although the effects of secondary structure can be seen as outliers. A comparison to a large number of natural E.coli translation initiation sites shows the information profile to be qualitatively similar although differing quantitatively. This is probably due to the selection for good translation initiation sites in the natural set compared to the low average activity of the randomized set.
    BibTeX:
    @article{
      author = {D. Barrick and K. Villanueba and J. Childs and R. Kalil and T. D. Schneider and C. E. Lawrence and L. Gold and G. D. Stormo},
      title = {Quantitative analysis of ribosome binding sites in E.coli.},
      journal = {Nucleic Acids Res},
      year = {1994},
      volume = {22},
      number = {7},
      pages = {1287--1295},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/8165145}
    }
    					
    Fields, D.S. & Stormo, G.D. Quantitative DNA sequencing to determine the relative protein-DNA binding constants to multiple DNA sequences. 1994 Anal Biochem
    Vol. 219 (2) , pp. 230-239  
    article adenosine triphosphate; bacteriophage p22; base sequence; binding sites; dna; dna-binding proteins; electrophoresis; kinetics; molecular sequence data; oligodeoxyribonucleotides; phosphorus radioisotopes; random allocation; repressor proteins; viral proteins
    Abstract: DNA sequencing technology was modified into a quantitative assay, which for multiple DNA sequences allowed the simultaneous determination of relative protein-DNA binding constants. The band mobility shift of the protein-DNA binding reactions partitions the mixture of DNA sequences into bound and unbound fractions. The quantitation of that partitioning gives directly the relative binding constants, usually with accuracies of better than +/- 20 The protein of interest was the Mnt repressor of Salmonella bacteriophage P22, and the synthetic DNA contained Mnt's natural operator with a randomized position.
    BibTeX:
    @article{
      author = {D. S. Fields and G. D. Stormo},
      title = {Quantitative DNA sequencing to determine the relative protein-DNA binding constants to multiple DNA sequences.},
      journal = {Anal Biochem},
      year = {1994},
      volume = {219},
      number = {2},
      pages = {230--239},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/8080080},
      doi = {http://dx.doi.org/10.1006/abio.1994.1262}
    }
    					
    Heumann, J.M.; Lapedes, A.S. & Stormo, G.D. Neural networks for determining protein specificity and multiple alignment of binding sites. 1994 Proc Int Conf Intell Syst Mol Biol
    Vol. 2 , pp. 188-194  
    article bacterial proteins; binding sites; dna, bacterial; dna-binding proteins; escherichia coli; markov chains; neural networks (computer); sequence alignment; sequence analysis
    Abstract: We use a quantitative definition of specificity to develop a neural network for the identification of common protein binding sites in a collection of unaligned DNA fragments. We demonstrate the equivalence of the method to maximizing Information Content of the aligned sites when simple models of the binding energy and the genome are employed. The network method subsumes those simple models and is capable of working with more complicated ones. This is demonstrated using a Markov model of the E. coli genome and a sampling method to approximate the partition function. A variation of Gibbs' sampling aids in avoiding local minima.
    BibTeX:
    @article{
      author = {J. M. Heumann and A. S. Lapedes and G. D. Stormo},
      title = {Neural networks for determining protein specificity and multiple alignment of binding sites.},
      journal = {Proc Int Conf Intell Syst Mol Biol},
      year = {1994},
      volume = {2},
      pages = {188--194},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7584389}
    }
    					
    Melov, S.; Hertz, G.Z.; Stormo, G.D. & Johnson, T.E. Detection of deletions in the mitochondrial genome of Caenorhabditis elegans. 1994 Nucleic Acids Res
    Vol. 22 (6) , pp. 1075-1078  
    article aging; animals; caenorhabditis elegans; dna primers; dna, mitochondrial; gene deletion; nucleic acid conformation; polymerase chain reaction; rna, transfer; repetitive sequences, nucleic acid
    Abstract: We have examined an aging population of Caenorhabditis elegans via a PCR assay to determine if deletions in the mitochondrial genome occur in the nematode. We detected eight such deletions, identified the breakpoints of four of these, and discovered direct repeats of 4-8 base pairs at the site of all four deletions. Six of the eight repeats involved in the deletions are located in or immediately adjacent to tRNAs. Without a biochemical bias, the probability of direct repeats being present at all four breakpoints was 4 x 10(-6).
    BibTeX:
    @article{
      author = {S. Melov and G. Z. Hertz and G. D. Stormo and T. E. Johnson},
      title = {Detection of deletions in the mitochondrial genome of Caenorhabditis elegans.},
      journal = {Nucleic Acids Res},
      year = {1994},
      volume = {22},
      number = {6},
      pages = {1075--1078},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/8152911},
      doi = {http://dx.doi.org/ 10.1093/nar/22.6.1075}
    }
    					
    Stormo, G.D. & Haussler, D. Optimally parsing a sequence into different classes based on multiple types of evidence. 1994 Proc Int Conf Intell Syst Mol Biol
    Vol. 2 , pp. 369-375  
    article algorithms; animals; computer simulation; dna; humans; models, theoretical; sequence analysis
    Abstract: We consider the problem of parsing a sequence into different classes of subsequences. Two common examples are finding the exons and introns in genomic sequences and identifying the secondary structure domains of protein sequences. In each case there are various types of evidence that are relevant to the classification, but none are completely reliable, so we expect some weighted average of all the evidence to provide improved classifications. For example, in the problem of identifying coding regions in genomic DNA, the combined use of evidence such as codon bias and splice junction patterns can give more reliable predictions than either type of evidence alone. We show three main results: 1. For a given weighting of the evidence a dynamic programming algorithm returns the optimal parse and any number of sub-optimal parses. 2. For a given weighting of the evidence a dynamic programming algorithm determines the probability of the optimal parse and any number of sub-optimal parses under a natural Boltzmann-Gibbs distribution over the set of possible parses. 3. Given a set of sequences with known correct parses, a dynamic programming algorithm allows one to apply gradient descent to obtain the weights that maximize the probability of the correct parses of these sequences.
    BibTeX:
    @article{
      author = {G. D. Stormo and D. Haussler},
      title = {Optimally parsing a sequence into different classes based on multiple types of evidence.},
      journal = {Proc Int Conf Intell Syst Mol Biol},
      year = {1994},
      volume = {2},
      pages = {369--375},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7584414}
    }
    					
    Prestridge, D.S. & Stormo, G. SIGNAL SCAN 3.0: new database and program features. 1993 Comput Appl Biosci
    Vol. 9 (1) , pp. 113-115  
    article algorithms; amino acid sequence; animals; binding sites; microcomputers; software; transcription factors
    Abstract: SIGNAL SCAN is a program that utilizes a transcription factor database to find potential transcription factor binding sites in DNA sequences. The program is now in its third version. The SIGNAL SCAN transcription factor database format has changed and the program output format has been improved. New features allow the user to update the SIGNAL SCAN database automatically, to retrieve original journal citations and to develop user signal databases. The program now uses an indexing algorithm, improving scanning speed by a factor of 3. SIGNAL SCAN is now network compatible and is available for IBM-compatible PC, Unix and VMS platforms.
    BibTeX:
    @article{
      author = {D. S. Prestridge and G. Stormo},
      title = {SIGNAL SCAN 3.0: new database and program features.},
      journal = {Comput Appl Biosci},
      year = {1993},
      volume = {9},
      number = {1},
      pages = {113--115},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/8435761}
    }
    					
    Selick, H.E.; Stormo, G.D.; Dyson, R.L. & Alberts, B.M. Analysis of five presumptive protein-coding sequences clustered between the primosome genes, 41 and 61, of bacteriophages T4, T2, and T6. 1993 J Virol
    Vol. 67 (4) , pp. 2305-2316  
    article amino acid sequence; bacterial; base sequence; chromosome mapping; codon; dna helicases; dna primase; dna replication; dna, viral; escherichia coli; evolution; genes, ; genes, viral; molecular sequence data; oligodeoxyribonucleotides; open reading frames; rna nucleotidyltransferases; rna, viral; sequence homology, nucleic acid; t-phages; viral proteins; viral structural proteins
    Abstract: In bacteriophage T4, there is a strong tendency for genes that encode interacting proteins to be clustered on the chromosome. There is 1.6 kb of DNA between the DNA helicase (gene 41) and the DNA primase (gene 61) genes of this virus. The DNA sequence of this region suggests that it contains five genes, designated as open reading frames (ORFs) 61.1 to 61.5, predicted to encode proteins ranging in size from 5.94 to 22.88 kDa. Are these ORFs actually genes? As one test, we compared the DNA sequence of this region in bacteriophages T2, T4, and T6 and found that ORFs 61.1, 61.3, 61.4, and 61.5 are highly conserved among the three closely related viruses. In contrast, ORF 61.2 is conserved between phages T4 and T6 yet is absent from phage T2, where it is replaced by another ORF, T2 ORF 61.2, which is not found in the T4 and T6 genomes. As a second, independent test for coding sequences, we calculated the codon base position preferences for all ORFs in this region that could encode proteins that contain at least 30 amino acids. Both the T4/T6 and T2 versions of ORF 61.2, as well as the other ORFs, have codon base position preferences that are indistinguishable from those of known T4 genes (coefficients of 0.81 to 0.94); the six other possible ORFs of at least 90 bp in this region are ruled out as genes by this test (coefficients less than zero). Thus, both evolutionary conservation and codon usage patterns lead us to conclude that ORFs 61.1 to 61.5 represent important protein-coding sequences for this family of bacteriophages. Because they are located between the genes that encode the two interacting proteins of the T4 primosome (DNA helicase plus DNA primase), one or more may function in DNA replication by modulating primosome function.
    BibTeX:
    @article{
      author = {H. E. Selick and G. D. Stormo and R. L. Dyson and B. M. Alberts},
      title = {Analysis of five presumptive protein-coding sequences clustered between the primosome genes, 41 and 61, of bacteriophages T4, T2, and T6.},
      journal = {J Virol},
      year = {1993},
      volume = {67},
      number = {4},
      pages = {2305--2316},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/8383243}
    }
    					
    Snyder, E.E. & Stormo, G.D. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. 1993 Nucleic Acids Res
    Vol. 21 (3) , pp. 607-613  
    article algorithms; animals; dna; databases, factual; exons; humans; neural networks (computer); software
    Abstract: Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classification procedures are determined by training a simple feed-forward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.
    BibTeX:
    @article{
      author = {E. E. Snyder and G. D. Stormo},
      title = {Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.},
      journal = {Nucleic Acids Res},
      year = {1993},
      volume = {21},
      number = {3},
      pages = {607--613},
      url = {http://www.ncbi.nlm.nih.gov/pmc/articles/PMC309159/}
    }
    					
    Stormo, G.D.; Strobl, S.; Yoshioka, M. & Lee, J.S. Specificity of the Mnt protein. Independent effects of mutations at different positions in the operator. 1993 J Mol Biol
    Vol. 229 (4) , pp. 821-826  
    article base sequence; dna, viral; molecular sequence data; mutagenesis; operator regions (genetics); protein binding; repressor proteins; viral proteins
    Abstract: The relative binding affinities of Mnt protein are determined for each possible base-pair at position 15 of the operator sequence, and for all combinations of G.C base-pairs at positions 15 and 17. The partitioning of each operator sequence is determined quantitatively with restriction enzymes. At position 15, the wild-type G.C base-pair provides the highest binding affinity but, unlike position 17, the primary distinction is between purine and pyrimidine bases on the top strand. The information content at position 15 is only about 0.16 bit. In comparison with previous measurements at position 17, it is determined that the interactions of the Mnt protein with positions 15 and 17 are independent, i.e. the specific binding energies for the two positions are additive. The relative binding affinities at position 17 are also determined in the background of a G to T mutation at position 5, the position equivalent to 17 on the other half of the symmetric operator. The relative affinities at position 17 are independent of whether position 5 is wild-type or mutant.
    BibTeX:
    @article{
      author = {G. D. Stormo and S. Strobl and M. Yoshioka and J. S. Lee},
      title = {Specificity of the Mnt protein. Independent effects of mutations at different positions in the operator.},
      journal = {J Mol Biol},
      year = {1993},
      volume = {229},
      number = {4},
      pages = {821--826},
      doi = {http://dx.doi.org/10.1006/jmbi.1993.1088}
    }
    					
    Cardon, L.R. & Stormo, G.D. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. 1992 J Mol Biol
    Vol. 223 (1) , pp. 159-170  
    article algorithms; base sequence; binding sites; dna; dna, bacterial; dna-binding proteins; escherichia coli; molecular sequence data; promoter regions (genetics); sequence alignment
    Abstract: An Expectation Maximization algorithm for identification of DNA binding sites is presented. The approach predicts the location of binding regions while allowing variable length spacers within the sites. In addition to predicting the most likely spacer length for a set of DNA fragments, the method identifies individual sites that differ in spacer size. No alignment of DNA sequences is necessary. The method is illustrated by application to 231 Escherichia coli DNA fragments known to contain promoters with variable spacings between their consensus regions. Maximum-likelihood tests of the differences between the spacing classes indicate that the consensus regions of the spacing classes are not distinct. Further tests suggest that several positions within the spacing region may contribute to promoter specificity.
    BibTeX:
    @article{
      author = {L. R. Cardon and G. D. Stormo},
      title = {Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.},
      journal = {J Mol Biol},
      year = {1992},
      volume = {223},
      number = {1},
      pages = {159--170},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1731067}
    }
    					
    Gutell, R.R.; Power, A.; Hertz, G.Z.; Putz, E.J. & Stormo, G.D. Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. 1992 Nucleic Acids Res
    Vol. 20 (21) , pp. 5785-5795  
    article base sequence; databases, factual; molecular sequence data; nucleic acid conformation; rna, ribosomal; rna, transfer; sequence alignment; sequence analysis, rna
    Abstract: Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, and is responsible for determining the folding of many RNA molecules, including 5S, 16S, and 23S rRNAs, tRNA, RNAse P RNA, and Group I and II introns. Initially this method was utilized to fold these sequences into their secondary structures. More recently, this method has revealed numerous tertiary correlations, elucidating novel RNA structural motifs, several of which have been experimentally tested and verified, substantiating the general application of this approach. As successful as the comparative methods have been in elucidating higher-order structure, it is clear that additional structure constraints remain to be found. Deciphering such constraints requires more sensitive and rigorous protocols, in addition to RNA sequence datasets that contain additional phylogenetic diversity and an overall increase in the number of sequences. Various RNA databases, including the tRNA and rRNA sequence datasets, continue to grow in number as well as diversity. Described herein is the development of more rigorous comparative analysis protocols. Our initial development and applications on different RNA datasets have been very encouraging. Such analyses on tRNA, 16S and 23S rRNA are substantiating previously proposed associations and are now beginning to reveal additional constraints on these molecules. A subset of these involve several positions that correlate simultaneously with one another, implying units larger than a basepair can be under a phylogenetic constraint.
    BibTeX:
    @article{
      author = {R. R. Gutell and A. Power and G. Z. Hertz and E. J. Putz and G. D. Stormo},
      title = {Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods.},
      journal = {Nucleic Acids Res},
      year = {1992},
      volume = {20},
      number = {21},
      pages = {5785--5795},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1454539}
    }
    					
    Mount, S.M.; Burks, C.; Hertz, G.; Stormo, G.D.; White, O. & Fields, C. Splicing signals in Drosophila: intron size, information content, and consensus sequences. 1992 Nucleic Acids Res
    Vol. 20 (16) , pp. 4255-4262  
    article animals; base sequence; consensus sequence; databases, factual; drosophila; introns; molecular sequence data; rna splicing; rna, messenger; software
    Abstract: A database of 209 Drosophila introns was extracted from Genbank (release number 64.0) and examined by a number of methods in order to characterize features that might serve as signals for messenger RNA splicing. A tight distribution of sizes was observed: while the smallest introns in the database are 51 nucleotides, more than half are less than 80 nucleotides in length, and most of these have lengths in the range of 59-67 nucleotides. Drosophila splice sites found in large and small introns differ in only minor ways from each other and from those found in vertebrate introns. However, larger introns have greater pyrimidine-richness in the region between 11 and 21 nucleotides upstream of 3' splice sites. The Drosophila branchpoint consensus matrix resembles C T A A T (in which branch formation occurs at the underlined A), and differs from the corresponding mammalian signal in the absence of G at the position immediately preceding the branchpoint. The distribution of occurrences of this sequence suggests a minimum distance between 5' splice sites and branchpoints of about 38 nucleotides, and a minimum distance between 3' splice sites and branchpoints of 15 nucleotides. The methods we have used detect no information in exon sequences other than in the few nucleotides immediately adjacent to the splice sites. However, Drosophila resembles many other species in that there is a discontinuity in A + T content between exons and introns, which are A + T rich.
    BibTeX:
    @article{
      author = {S. M. Mount and C. Burks and G. Hertz and G. D. Stormo and O. White and C. Fields},
      title = {Splicing signals in Drosophila: intron size, information content, and consensus sequences.},
      journal = {Nucleic Acids Res},
      year = {1992},
      volume = {20},
      number = {16},
      pages = {4255--4262},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1508718}
    }
    					
    Ringquist, S.; Shinedling, S.; Barrick, D.; Green, L.; Binkley, J.; Stormo, G.D. & Gold, L. Translation initiation in Escherichia coli: sequences within the ribosome-binding site. 1992 Mol Microbiol
    Vol. 6 (9) , pp. 1219-1229  
    article base sequence; binding sites; codon; data interpretation, statistical; escherichia coli; kinetics; lac operon; molecular sequence data; plasmids; protein biosynthesis; rna, bacterial; rna, messenger; ribosomes; beta-galactosidase
    Abstract: The translational roles of the Shine-Dalgarno sequence, the initiation codon, the space between them, and the second codon have been studied. The Shine-Dalgarno sequence UAAGGAGG initiated translation roughly four times more efficiently than did the shorter AAGGA sequence. Each Shine-Dalgarno sequence required a minimum distance to the initiation codon in order to drive translation; spacing, however, could be rather long. Initiation at AUG was more efficient than at GUG or UUG at each spacing examined; initiation at GUG was only slightly better than UUG. Translation was also affected by residues 3' to the initiation codon. The second codon can influence the rate of initiation, with the magnitude depending on the initiation codon. The data are consistent with a simple kinetic model in which a variety of rate constants contribute to the process of translation initiation.
    BibTeX:
    @article{
      author = {S. Ringquist and S. Shinedling and D. Barrick and L. Green and J. Binkley and G. D. Stormo and L. Gold},
      title = {Translation initiation in Escherichia coli: sequences within the ribosome-binding site.},
      journal = {Mol Microbiol},
      year = {1992},
      volume = {6},
      number = {9},
      pages = {1219--1229},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1375310}
    }
    					
    Arvidson, D.N.; Youderian, P.; Schneider, T.D. & Stormo, G.D. Automated kinetic assay of beta-galactosidase activity. 1991 Biotechniques
    Vol. 11 (6) , pp. 733-4, 736, 738  
    article automation; escherichia coli; kinetics; spectrophotometry, ultraviolet; beta-galactosidase
    Abstract: An automated kinetic assay for beta-galactosidase activity in Escherichia coli was developed to permit the measurement of many independent samples simultaneously. Bacteria are grown, lysed from without (by adsorption of a high multiplicity of bacteriophage T4) and assayed in microtiter plates with 96 wells. Absorbance data are collected and analyzed by computer. The growth and lysis procedure, apparatus and software used in this assay can be used for other spectrophotometric enzyme assays.
    BibTeX:
    @article{
      author = {D. N. Arvidson and P. Youderian and T. D. Schneider and G. D. Stormo},
      title = {Automated kinetic assay of beta-galactosidase activity.},
      journal = {Biotechniques},
      year = {1991},
      volume = {11},
      number = {6},
      pages = {733--4, 736, 738},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1809325}
    }
    					
    Stormo, G.D. & Yoshioka, M. Specificity of the Mnt protein determined by binding to randomized operators. 1991 Proc Natl Acad Sci U S A
    Vol. 88 (13) , pp. 5699-5703  
    article bacteriophages; base sequence; dna mutational analysis; dna-binding proteins; molecular sequence data; operator region; protein binding; repressor proteins; structure-activity relationship; thermodynamics; viral proteins; s (genetics)
    Abstract: The relative binding affinities of Mnt protein from bacteriophage P22 are determined for each possible base pair at position 17 of the operator. These are determined from the partitioning of randomized operators into bound and unbound fractions; quantitation is provided by restriction enzyme analysis. Mnt protein is found to have an unusual specificity at this position: a C.G base pair (the wild-type operator) has the highest affinity, a G.C base pair has the lowest affinity, and both orientations of A.T base pairs are intermediate and nearly equivalent. A specific binding constant and specific binding free energy are defined and shown to be directly related to the information content of the operator sequences bound to the protein, taking into account the quantitative differences in binding affinities.
    BibTeX:
    @article{
      author = {G. D. Stormo and M. Yoshioka},
      title = {Specificity of the Mnt protein determined by binding to randomized operators.},
      journal = {Proc Natl Acad Sci U S A},
      year = {1991},
      volume = {88},
      number = {13},
      pages = {5699--5703},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/2062848}
    }
    					
    Stormo, G.D. Probing information content of DNA-binding sites. 1991 Methods Enzymol
    Vol. 208 , pp. 458-468  
    article base sequence; binding sites; dna restriction enzymes, metabolism; dna, viral, chemistry/metabolism; dna, metabolism; dna-binding proteins, metabolism; escherichia coli, genetics; molecular sequence data; promoter regions (genetics); substrate specificity; t-phages, genetics
    Abstract: An information content analysis of protein-binding sites gives a quantitative description of the specificity of the protein, independent of the mechanism of specificity. It gives useful information about the total specificity of the protein and about the individual positions within the binding sites. Information content is consistent with both thermodynamic and statistical analyses of specificity. When applied to a collection of known binding sites, the description provided may be limited by the sample size or by unknown constraints on those sites. Experimental procedures to determine the information content can give much more reliable measures. A large number of functional sites can be obtained from a much larger pool of randomized potential sites. Quantitative assays for the activity of different sites can be easily incorporated into the analysis, thereby increasing its sensitivity. Both in vitro and in vivo experiments are amenable to information content analysis.
    BibTeX:
    @article{
      author = {G. D. Stormo},
      title = {Probing information content of DNA-binding sites.},
      journal = {Methods Enzymol},
      year = {1991},
      volume = {208},
      pages = {458--468},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1664028}
    }
    					
    Gold, L. & Stormo, G. D.V. Goeddel, e. (Hrsg.) High-Level Translation Initiation ( Gene Expression Technology, Methods in Enzymology ) 1990 Gene Expression Technology, Methods in Enzymology
    Vol. Vol. 185 , pp. 89-93  
    inbook base sequence; cloning, molecular; coliphages; escherichia coli; gene expression; kinetics; molecular sequence data; peptide chain initiation, translational; promoter regions (genetics); rna, messenger; rna, viral
    BibTeX:
    @inbook{
      author = {Gold, L. and Stormo, G.D.},
      title = {Gene Expression Technology, Methods in Enzymology},
      publisher = {Academic Press},
      year = {1990},
      volume = {Vol. 185},
      pages = {89--93},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/2199797}
    }
    					
    Hertz, G.Z.; Hartzell, G.W. & Stormo, G.D. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. 1990 Comput Appl Biosci
    Vol. 6 (2) , pp. 81-92  
    article algorithms; bacterial proteins; base sequence; binding sites; dna; dna, bacterial; escherichia coli; genes, bacterial; molecular sequence data; pattern recognition, automated; serine endopeptidases; software
    Abstract: We have developed a method for identifying consensus patterns in a set of unaligned DNA sequences known to bind a common protein or to have some other common biochemical function. The method is based on a matrix representation of binding site patterns. Each row of the matrix represents one of the four possible bases, each column represents one of the positions of the binding site and each element is determined by the frequency the indicated base occurs at the indicated position. The goal of the method is to find the most significant matrix--i.e. the one with the lowest probability of occurring by chance--out of all the matrices that can be formed from the set of related sequences. The reliability of the method improves with the number of sequences, while the time required increases only linearly with the number of sequences. To test this method, we analysed 11 DNA sequences containing promoters regulated by the Escherichia coli LexA protein. The matrices we found were consistent with the known consensus sequence, and could distinguish the generally accepted LexA binding sites from other DNA sequences.
    BibTeX:
    @article{
      author = {G. Z. Hertz and G. W. Hartzell and G. D. Stormo},
      title = {Identification of consensus patterns in unaligned DNA sequences known to be functionally related.},
      journal = {Comput Appl Biosci},
      year = {1990},
      volume = {6},
      number = {2},
      pages = {81--92},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/2193692}
    }
    					
    Ruffner, D.E.; Stormo, G.D. & Uhlenbeck, O.C. Sequence requirements of the hammerhead RNA self-cleavage reaction. 1990 Biochemistry
    Vol. 29 (47) , pp. 10695-10702  
    article base sequence; consensus sequence; escherichia coli; molecular sequence data; mutagenesis, site-directed; nucleic acid conformation; rna, bacterial; rna, catalytic; structure-activity relationship
    Abstract: A previously well-characterized hammerhead catalytic RNA consisting of a 24-nucleotide substrate and a 19-nucleotide ribozyme was used to perform an extensive mutagenesis study. The cleavage rates of 21 different substrate mutations and 24 different ribozyme mutations were determined. Only one of the three phylogenetically conserved base pairs but all nine of the conserved single-stranded residues in the central core are needed for self cleavage. In most cases the mutations did not alter the ability of the hammerhead to assemble into a bimolecular complex. In the few cases where mutant hammerheads did not assemble, it appeared to be the result of the mutation stabilizing an alternate substrate or ribozyme secondary structure. All combinations of mutant substrate and mutant ribozyme were less active than the corresponding single mutations, suggesting that the hammerhead contains few, if any, replaceable tertiary interactions as are found in tRNA. The refined consensus hammerhead resulting from this work was used to identify potential hammerheads present in a variety of Escherichia coli gene sequences.
    BibTeX:
    @article{
      author = {D. E. Ruffner and G. D. Stormo and O. C. Uhlenbeck},
      title = {Sequence requirements of the hammerhead RNA self-cleavage reaction.},
      journal = {Biochemistry},
      year = {1990},
      volume = {29},
      number = {47},
      pages = {10695--10702},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1703005}
    }
    					
    Stormo, G.D. R.F. Doolittle, e. (Hrsg.) Consensus Patterns in DNA ( Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences ) 1990 Methods Enzymol Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences
    Vol. 183 , pp. 211-221  
    article base sequence; dna; dna, bacterial; escherichia coli; information systems; mathematics; promoter regions (genetics); research; sequence homology, nucleic acid
    Abstract: Matrices can provide realistic representations of protein/DNA specificity. In many cases simple mononucleotide-based matrices are adequate representations, but more complex matrices may be needed for other cases. Unlike simple consensus sequences, matrices allow for different penalties to be assessed for different changes to a binding site, a property that is essential for accurate description of a binding site pattern. When only a collection of binding site sequences is known, the best representation for the pattern is an information content formulation, based on both thermodynamic and statistical considerations. Quantitative data on relative binding affinities may be used to determine matrices that provide a best fit to the data. Matrix representations also provide an efficient method of aligning multiple sequences to identify binding site patterns that they have in common.
    BibTeX:
    @article{
      author = {G. D. Stormo},
      title = {Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences},
      journal = {Methods Enzymol},
      publisher = {Academic Press},
      year = {1990},
      volume = {183},
      pages = {211--221},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/1664028}
    }
    					
    Stormo, G. R.H. Sarma, & M.H. Sarma, e. (Hrsg.) Identifying Regulatory Sites from DNA Sequence Data ( Structure & Methods Vol 1: Human Genome Initiative & DNA Recombination ) 1990 Structure & Methods Vol 1: Human Genome Initiative & DNA Recombination , pp. pp. 103-111   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Structure & Methods Vol 1: Human Genome Initiative & DNA Recombination},
      publisher = {Adenine Press, Inc.},
      year = {1990},
      pages = {pp. 103-111}
    }
    					
    Schneider, T.D. & Stormo, G.D. Excess information at bacteriophage T7 genomic promoters detected by a random cloning technique. 1989 Nucleic Acids Res
    Vol. 17 (2) , pp. 659-674  
    article base composition; base sequence; cloning, molecular; dna, viral; genes, viral; mathematical computing; models, genetic; molecular sequence data; promoter regions (genetics); t-phages
    Abstract: In our previous analysis of the information at binding sites on nucleic acids, we found that most of the sites examined contain the amount of information expected from their frequency in the genome. The sequences at bacteriophage T7 promoters are an exception, because they are far more conserved (35 bits of information content) than should be necessary to distinguish them from the background of the Escherichia coli genome (17 bits). To determine the information actually used by the T7 RNA polymerase, promoters were chemically synthesized with many variations and those that function well in an in vivo assay were sequenced. Our analysis shows that the polymerase uses 18 bits of information, so the sequences at phage genomic promoters have significantly more information than the polymerase needs. The excess may represent the binding site of another protein.
    BibTeX:
    @article{
      author = {T. D. Schneider and G. D. Stormo},
      title = {Excess information at bacteriophage T7 genomic promoters detected by a random cloning technique.},
      journal = {Nucleic Acids Res},
      year = {1989},
      volume = {17},
      number = {2},
      pages = {659--674},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/2915926}
    }
    					
    Stormo, G.D. & Hartzell, G.W. Identifying protein-binding sites from unaligned DNA fragments. 1989 Proc Natl Acad Sci U S A
    Vol. 86 (4) , pp. 1183-1187  
    article algorithms; base sequence; binding sites; carrier proteins; cyclic amp receptor protein; dna; information systems; models, theoretical; molecular sequence data; neoplasm proteins; proteins
    Abstract: The ability to determine important features within DNA sequences from the sequences alone is becoming essential as large-scale sequencing projects are being undertaken. We present a method that can be applied to the problem of identifying the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments, each known to contain somewhere within it a binding site for that protein. Information about the position or orientation of the binding sites within those fragments is not needed. The method compares the "information content" of a large number of possible binding site alignments to arrive at a matrix representation of the binding site pattern. The specificity of the protein is represented as a matrix, rather than a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required increases only linearly with the number of sequences. An example, using known cAMP receptor protein-binding sites, illustrates the method.
    BibTeX:
    @article{
      author = {G. D. Stormo and G. W. Hartzell},
      title = {Identifying protein-binding sites from unaligned DNA fragments.},
      journal = {Proc Natl Acad Sci U S A},
      year = {1989},
      volume = {86},
      number = {4},
      pages = {1183--1187},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/2919167}
    }
    					
    McPheeters, D.S.; Stormo, G.D. & Gold, L. Autogenous regulatory site on the bacteriophage T4 gene 32 messenger RNA. 1988 J Mol Biol
    Vol. 201 (3) , pp. 517-535  
    article base sequence; binding sites; gene expression regulation; genes, viral; molecular sequence data; nucleic acid conformation; operator regions (genetics); rna, messenger; rna, viral; t-phages; viral proteins
    Abstract: We have identified the binding site on the bacteriophage T4 gene 32 mRNA responsible for autogenous translational regulation. We demonstrate that this site is largely unstructured and overlaps the initiation codon of gene 32 as previously predicted. Co-operative binding of gene 32 protein to this site specifically blocks the formation of 30 S-tRNA(fMet)-gene 32 mRNA ternary complexes and initiation of translation. The translational operator is bound co-operatively by gene 32 protein and this binding is facilitated by a nucleation site far upstream from the initiation codon. A similar unstructured mRNA lacking this nucleation site is also bound co-operatively, but only at concentrations of gene 32 protein higher than those needed to repress binding of ribosomes to the gene 32 mRNA. Some sequence-specific interactions may also influence this binding. Comparison of the bacteriophage T2, T4 and T6 gene 32 operator sequences leads us to propose that the nucleation site is a pseudoknot.
    BibTeX:
    @article{
      author = {D. S. McPheeters and G. D. Stormo and L. Gold},
      title = {Autogenous regulatory site on the bacteriophage T4 gene 32 messenger RNA.},
      journal = {J Mol Biol},
      year = {1988},
      volume = {201},
      number = {3},
      pages = {517--535},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/3262167},
      doi = {http://dx.doi.org/10.1016/0022-2836(88)90634-1}
    }
    					
    Stormo, G.D. Computer methods for analyzing sequence recognition of nucleic acids. 1988 Annu Rev Biophys Biophys Chem
    Vol. 17 , pp. 241-263  
    article base sequence; computers; dna
    BibTeX:
    @article{
      author = {G. D. Stormo},
      title = {Computer methods for analyzing sequence recognition of nucleic acids.},
      journal = {Annu Rev Biophys Biophys Chem},
      year = {1988},
      volume = {17},
      pages = {241--263},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/3293587}
    }
    					
    Tuerk, C.; Gauss, P.; Thermes, C.; Groebe, D.R.; Gayle, M.; Guild, N.; Stormo, G.; d'Aubenton Carafa, Y.; Uhlenbeck, O.C. & Tinoco, I. CUUCGG hairpins: extraordinarily stable RNA secondary structures associated with various biochemical processes. 1988 Proc Natl Acad Sci U S A
    Vol. 85 (5) , pp. 1364-1368  
    article base sequence; nucleic acid conformation; rna; rna, messenger; rna, viral; rna-directed dna polymerase; t-phages; templates, genetic; thermodynamics; transcription, genetic
    Abstract: The mRNA of bacteriophage T4 contains a strikingly abundant intercistronic hairpin. Within the 55 kilobases of known T4 sequence, the hexanucleotide sequence CTTCGG is found 13 times in the DNA strand equivalent to mRNA sequences. In 12 of those occurrences, the sequence is flanked by inverted repeats predictive of RNA hairpins with UUCG in the loop. Avian myeloblastosis virus reverse transcriptase, which can traverse hairpins of larger calculated stability, terminates efficiently at these CUUCGG hairpins. Thermal denaturation studies of model hairpins show that the loop sequence UUCG dramatically stabilizes RNA hairpins when compared to a control sequence. These data, when combined with previously described parameters of helix stability, suggest that T4 has utilized this loop sequence to optimize the stability of intercistronic hairpins. The stability of CUUCGG hairpins is also utilized in the RNAs of many organisms besides T4.
    BibTeX:
    @article{
      author = {C. Tuerk and P. Gauss and C. Thermes and D. R. Groebe and M. Gayle and N. Guild and G. Stormo and Y. d'Aubenton-Carafa and O. C. Uhlenbeck and I. Tinoco},
      title = {CUUCGG hairpins: extraordinarily stable RNA secondary structures associated with various biochemical processes.},
      journal = {Proc Natl Acad Sci U S A},
      year = {1988},
      volume = {85},
      number = {5},
      pages = {1364--1368},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/2449689}
    }
    					
    Gold, L. & Stormo, G. F.C. Neidhardt, J.L. Ingraham, K.L.B.M.M.S. & H.E. Umbarger, e. (Hrsg.) Translational initiation ( Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology ) 1987 Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology
    Vol. 2 , pp. pp. 1302-1307  
    inbook
    BibTeX:
    @inbook{
      author = {Gold, L. and Stormo, G.},
      title = {Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology},
      publisher = {American Society for Microbiology, Washington, D.C.},
      year = {1987},
      volume = {2},
      pages = {pp. 1302-1307}
    }
    					
    Romaniuk, P.J.; Lowary, P.; Wu, H.N.; Stormo, G. & Uhlenbeck, O.C. RNA binding site of R17 coat protein. 1987 Biochemistry
    Vol. 26 (6) , pp. 1563-1568  
    article base composition; binding sites; capsid; capsid proteins; coliphages; escherichia coli; indicators and reagents; kinetics; oligoribonucleotides; protein binding; rna, viral; rna-binding proteins; structure-activity relationship
    Abstract: The specific interaction between R17 coat protein and its target of translational repression at the initiation site of the R17 replicase gene was studied by synthesizing variants of the RNA binding site and measuring their affinity to the coat protein by using a nitrocellulose filter binding assay. Substitution of two of the seven single-stranded residues by other nucleotides greatly reduced the Ka, indicating that they are essential for the RNA-protein interaction. In contrast, three other single-stranded residues can be substituted without altering the Ka. When several of the base-paired residues in the binding site are altered in such a way that pairing is maintained, little change in Ka is observed. However, when the base pairs are disrupted, coat protein does not bind. These data suggest that while the hairpin loop structure is essential for protein binding, the base-paired residues do not contact the protein directly. On the basis of these and previous data, a model for the structural requirements of the R17 coat protein binding site is proposed. The model was successfully tested by demonstrating that oligomers with sequences quite different from the replicase initiator were able to bind coat protein.
    BibTeX:
    @article{
      author = {P. J. Romaniuk and P. Lowary and H. N. Wu and G. Stormo and O. C. Uhlenbeck},
      title = {RNA binding site of R17 coat protein.},
      journal = {Biochemistry},
      year = {1987},
      volume = {26},
      number = {6},
      pages = {1563--1568},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/3297131}
    }
    					
    Stormo, G. Bishop, M. & C.J. Rawlings, e. (Hrsg.) Identifying coding sequences ( Nucleic Acid and Protein Sequence Analysis, A Practical Approach ) 1987 Nucleic Acid and Protein Sequence Analysis, A Practical Approach , pp. pp. 231-258   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Nucleic Acid and Protein Sequence Analysis, A Practical Approach},
      publisher = {IRL Press, Oxford},
      year = {1987},
      pages = {pp. 231-258}
    }
    					
    Stormo, G. J. Ilan, e. (Hrsg.) Translational regulation in bacteriophage ( Translational Regulation of Gene Expression ) 1987 Translational Regulation of Gene Expression , pp. pp. 27-49   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Translational Regulation of Gene Expression},
      publisher = {Plenum Publishing, New York},
      year = {1987},
      pages = {pp. 27-49}
    }
    					
    Clift, B.; Haussler, D.; McConnell, R.; Schneider, T.D. & Stormo, G.D. Sequence landscapes. 1986 Nucleic Acids Res
    Vol. 14 (1) , pp. 141-158  
    article dna, viral; information systems; repetitive sequences, nucleic acid; software; t-phages; time factors
    Abstract: We describe a method for representing the structure of repeating sequences in nucleic-acids, proteins and other texts. A portion of the sequence is presented at the bottom of a CRT screen. Above the sequence is its landscape, which looks like a mountain range. Each mountain corresponds to a subsequence of the sequence. At the peak of every mountain is written the number of times that the subsequence appears. A data structure called a DAWG, which can be built in time proportional to the length of the sequence, is used to construct the landscape. For the 40 thousand bases of bacteriophage T7, the DAWG can be built in 30 seconds. The time to display any portion of the landscape is less than a second. Using sequence landscapes, one can quickly locate significant repeats.
    BibTeX:
    @article{
      author = {B. Clift and D. Haussler and R. McConnell and T. D. Schneider and G. D. Stormo},
      title = {Sequence landscapes.},
      journal = {Nucleic Acids Res},
      year = {1986},
      volume = {14},
      number = {1},
      pages = {141--158},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/3753762}
    }
    					
    McPheeters, D.S.; Christensen, A.; Young, E.T.; Stormo, G. & Gold, L. Translational regulation of expression of the bacteriophage T4 lysozyme gene. 1986 Nucleic Acids Res
    Vol. 14 (14) , pp. 5813-5826  
    article amino acid sequence; base sequence; escherichia coli; genes; genes, viral; homeostasis; kinetics; muramidase; nucleic acid conformation; promoter regions (genetics); protein biosynthesis; t-phages; transcription, genetic; tritium
    Abstract: The bacteriophage T4 lysozyme gene is transcribed at early and late times after infection of E. coli, but the early mRNA is not translated. DNA sequence analysis and mapping of the 5' ends of the lysozyme transcripts produced at different times after T4 infection show that the early mRNA is initiated some distance upstream from the gene. The early mRNA is not translated because of a stable secondary structure which blocks the translational initiation site. The stable RNA structure has been demonstrated by nuclease protection in vivo. After DNA replication begins, two late promoters are activated; the late transcripts are initiated at sites such that the secondary structure can not form, and translation of the late messages occurs.
    BibTeX:
    @article{
      author = {D. S. McPheeters and A. Christensen and E. T. Young and G. Stormo and L. Gold},
      title = {Translational regulation of expression of the bacteriophage T4 lysozyme gene.},
      journal = {Nucleic Acids Res},
      year = {1986},
      volume = {14},
      number = {14},
      pages = {5813--5826},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/3526285}
    }
    					
    Schneider, T.D.; Stormo, G.D.; Gold, L. & Ehrenfeucht, A. Information content of binding sites on nucleotide sequences. 1986 J Mol Biol
    Vol. 188 (3) , pp. 415-431  
    article bacterial proteins; base sequence; binding sites; dna, bacterial; dna-binding proteins; dna-directed rna polymerases; escherichia coli; lac operon; operator regions (genetics); operon; repressor proteins; ribosomes; serine endopeptidases; statistics; t-phages; tryptophan; viral proteins
    Abstract: Repressors, polymerases, ribosomes and other macromolecules bind to specific nucleic acid sequences. They can find a binding site only if the sequence has a recognizable pattern. We define a measure of the information (R sequence) in the sequence patterns at binding sites. It allows one to investigate how information is distributed across the sites and to compare one site to another. One can also calculate the amount of information (R frequency) that would be required to locate the sites, given that they occur with some frequency in the genome. Several Escherichia coli binding sites were analyzed using these two independent empirical measurements. The two amounts of information are similar for most of the sites we analyzed. In contrast, bacteriophage T7 RNA polymerase binding sites contain about twice as much information as is necessary for recognition by the T7 polymerase, suggesting that a second protein may bind at T7 promoters. The extra information can be accounted for by a strong symmetry element found at the T7 promoters. This element may be an operator. If this model is correct, these promoters and operators do not share much information. The comparisons between R sequence and R frequency suggest that the information at binding sites is just sufficient for the sites to be distinguished from the rest of the genome.
    BibTeX:
    @article{
      author = {T. D. Schneider and G. D. Stormo and L. Gold and A. Ehrenfeucht},
      title = {Information content of binding sites on nucleotide sequences.},
      journal = {J Mol Biol},
      year = {1986},
      volume = {188},
      number = {3},
      pages = {415--431},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/3525846},
      doi = {http://dx.doi.org/10.1016/0022-2836(86)90165-8}
    }
    					
    Stormo, G.D.; Schneider, T.D. & Gold, L. Quantitative analysis of the relationship between nucleotide sequence and functional activity. 1986 Nucleic Acids Res
    Vol. 14 (16) , pp. 6661-6679  
    article base sequence; codon; dna; genes; mutation; peptide chain initiation, translational; protein biosynthesis; structure-activity relatio; beta-galactosidase; nship
    Abstract: Matrices can be used to evaluate sequences for functional activity. Multiple regression can solve for the matrix that gives the best fit between sequence evaluations and quantitative activities. This analysis shows that the best model for context effects on suppression by su2 involves primarily the two nucleotides 3' to the amber codon, and that their contributions are independent and additive. Context effects on 2AP mutagenesis also involve the two nucleotides 3' to the 2AP insertion, but their effects are not independent. In a construct for producing beta-galactosidase, the effects on translational yields of the tri-nucleotide 5' to the initiation codon are dependent on the entire triplet. Models based on these quantitative results are presented for each of the examples.
    BibTeX:
    @article{
      author = {G. D. Stormo and T. D. Schneider and L. Gold},
      title = {Quantitative analysis of the relationship between nucleotide sequence and functional activity.},
      journal = {Nucleic Acids Res},
      year = {1986},
      volume = {14},
      number = {16},
      pages = {6661--6679},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/3092188}
    }
    					
    Stormo, G. Gold, L. & W. Reznikoff, e. (Hrsg.) Translational Initiation ( Maximizing Gene Expression ) 1986 Maximizing Gene Expression , pp. pp. 195-224   inbook
    BibTeX:
    @inbook{
      author = {Stormo, G.D.},
      title = {Maximizing Gene Expression},
      publisher = {Butterworth Publishers},
      year = {1986},
      pages = {pp. 195-224}
    }
    					
    Childs, J., V.K.B.D.S.T.S.G.G.L.L.M. & Caruthers, M. Calendar, R. & Larry Gold, e. (Hrsg.) Ribosome binding site sequences and function ( Sequence Specificity in Transcription and Translation ) 1985 Sequence Specificity in Transcription and Translation , pp. pp. 341-350   inbook
    BibTeX:
    @inbook{
      author = {Childs, J., Villanueba, K., Barrick, D., Schneider, T.D., Stormo, G., Gold, L., Leitner, M. and Caruthers, M.},
      title = {Sequence Specificity in Transcription and Translation},
      publisher = {Alan R. Liss, Inc., New York},
      year = {1985},
      pages = {pp. 341-350}
    }
    					
    Gold, L., S.G. & Saunders, R. M. Schaecter, F.C. Neidhardt, J.I.N.K. e. (Hrsg.) Regulation of IF3 expression in E. coli ( Molecular Biology of Bacterial Growth ) 1985 Molecular Biology of Bacterial Growth , pp. pp. 62-77   inbook
    BibTeX:
    @inbook{
      author = {Gold, L., Stormo, G. and Saunders, R.},
      title = {Molecular Biology of Bacterial Growth},
      publisher = {Jones and Bartlett Publishers, Inc.},
      year = {1985},
      pages = {pp. 62-77},
      url = {http://www.jstor.org/stable/24949}
    }
    					
    Gold, L.; Stormo, G. & Saunders, R. Escherichia coli translational initiation factor IF3: a unique case of translational regulation. 1984 Proc Natl Acad Sci U S A
    Vol. 81 (22) , pp. 7061-7065  
    article escherichia coli; evolution; gene expression regulation; nucleic acid conformation; peptide initiation factors; protein biosynthesis; rna, messenger; rna, ribosomal; ribosomes
    Abstract: The Escherichia coli translational initiation factor IF3 is encoded by an mRNA that has an unusual ribosome binding site. We have explored a mechanism that may account for the translation of IF3 and that provides regulation of the quantity of IF3 relative to ribosomes.
    BibTeX:
    @article{
      author = {L. Gold and G. Stormo and R. Saunders},
      title = {Escherichia coli translational initiation factor IF3: a unique case of translational regulation.},
      journal = {Proc Natl Acad Sci U S A},
      year = {1984},
      volume = {81},
      number = {22},
      pages = {7061--7065},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/6390429}
    }
    					
    Gold, L., I.M.M.E.P.D.S.T.S.S. & Stormo, G. Clark, B. & H.U. Petersen, e. (Hrsg.) Translational regulation during bacteriophage T4 development ( Gene Expression ) 1984 Gene Expression , pp. pp. 379-394   inbook
    BibTeX:
    @inbook{
      author = {Gold, L., Inman, M., Miller, E., Pribnow, D., Schneider, T., Shinedling, S.

    and Stormo, G.}, title = {Gene Expression}, publisher = {Munksgaard, Copenhagen}, year = {1984}, pages = {pp. 379-394} }

    Munson, L.M.; Stormo, G.D.; Niece, R.L. & Reznikoff, W.S. lacZ translation initiation mutations. 1984 J Mol Biol
    Vol. 177 (4) , pp. 663-683  
    article base sequence; binding sites; codon; genes, bacterial; lac operon; mutation; nucleic acid conformation; protein biosynthesis; rna, messenger; ribosomes; beta-galactosidase
    Abstract: Sixteen single point mutations near the beginning of the lacZ gene have been isolated and their effect on lacZ expression has been measured. Five mutations were obtained that alter a potential stem-and-loop structure in the messenger RNA that masks the initiation codons. Formation of this stem-and-loop is a result of transcription of DNA sequences introduced during the cloning of the lac regulatory region. The mutations isolated were then moved into a background that deleted this structure. Analysis of these mutations indicated that the secondary structure inhibited lacZ expression 5.8-fold and that either single point mutations or a 9 base-pair deletion could relieve this inhibition completely. In addition, it was found that an A to C transversion in the first base following the initiation codon (in the absence of the inhibitory secondary structure) decreases lacZ expression almost twofold, whereas C to U transitions in the next two positions have negligible effects. Mutations were also obtained that either increase or decrease the length of the Shine-Dalgarno sequence. The effects of these mutations were studied in the presence or absence of the secondary structure that involves the two initiation codons. It was found that when translation initiation was inhibited by the secondary structure, increasing the length of the Shine-Dalgarno sequence increased lacZ expression 2.8-fold and decreasing the length of this sequence reduced lacZ expression 12-fold. When translation initiation was not inhibited by the secondary structure, increasing the length of the Shine-Dalgarno sequence had no effect and decreasing the length of this sequence only reduced lacZ expression sixfold. The mechanistic implications of these results are discussed. Two initiation codons are located in the beginning of the lacZ gene, 7 and 13 bases from the Shine-Dalgarno sequence. NH2-terminal sequence analysis indicated that the majority of the protein synthesized initiate at the first initiation codon in the wild-type lacZ gene (in agreement with results reported previously by J. L. Brown and his colleagues). Upon introduction of sequences that result in a change in the mRNA secondary structure, both initiation codons are used in almost equal amounts. Three mutations and two pseudorevertants were obtained, which are located in the first initiation codon. It was found that when the first initiation codon is changed from AUG to GUG, translation initiation is decreased tenfold at that codon.(ABSTRACT TRUNCATED AT 400 WORDS)
    BibTeX:
    @article{
      author = {L. M. Munson and G. D. Stormo and R. L. Niece and W. S. Reznikoff},
      title = {lacZ translation initiation mutations.},
      journal = {J Mol Biol},
      year = {1984},
      volume = {177},
      number = {4},
      pages = {663--683},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/6434747},
      doi = {http://dx.doi.org/10.1016/0022-2836(84)90043-3}
    }
    					
    Schneider, T.D.; Stormo, G.D.; Yarus, M.A. & Gold, L. Delila system tools. 1984 Nucleic Acids Res
    Vol. 12 (1 Pt 1) , pp. 129-140  
    article base sequence; information systems; nucleic acids
    Abstract: We introduce three new computer programs and associated tools of the Delila nucleic-acid sequence analysis system. The first program, Module, allows rapid transportation of new sequence analysis tools between scientists using different computers. The second program, DBpull, allows efficient access to the large nucleic-acid sequence databases being collected in the United States and Europe. The third program, Encode, provides a flexible way to process sequence data for analysis by other programs.
    BibTeX:
    @article{
      author = {T. D. Schneider and G. D. Stormo and M. A. Yarus and L. Gold},
      title = {Delila system tools.},
      journal = {Nucleic Acids Res},
      year = {1984},
      volume = {12},
      number = {1 Pt 1},
      pages = {129--140},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/6694897}
    }
    					
    Trojanowska, M.; Miller, E.S.; Karam, J.; Stormo, G. & Gold, L. The bacteriophage T4 regA gene: primary sequence of a translational repressor. 1984 Nucleic Acids Res
    Vol. 12 (15) , pp. 5979-5993  
    article amino acid sequence; base sequence; gene expression regulation; genes, viral; nucleic acid conformation; protein biosynthesis; repressor proteins; t-phages; transcription factors; viral proteins
    Abstract: The regA gene product of bacteriophage T4 is an autogenously controlled translational regulatory protein that plays a role in differential inhibition (translational repression) of a subpopulation of T4-encoded "early" mRNA species. The structural gene for this polypeptide maps within a cluster of phage DNA replication genes, (genes 45-44-62-regA-43-42), all but one of which (gene 43) are under regA-mediated translational control. We have cloned the T4 regA gene, determined its nucleotide sequence, and identified the amino-terminal residues of a plasmid-encoded, hyperproduced regA protein. The results suggest that the T4 regA gene product is a 122 amino acid polypeptide that is mildly basic and hydrophilic in character; these features are consistent with known properties of regA protein derived from T4-infected cells. Computer-assisted analyses of the nucleotide sequences of the regA gene and its three upstream neighbors (genes 45, 44, and 62) suggest the existence of three translational initiation units in this four-gene cluster; one for gene 45, one for genes 44, 62 and regA, and one that serves only the regA gene. The analyses also suggest that the gene 44-62 translational unit harbors a stable RNA structure that obligates translational coupling of these two genes.
    BibTeX:
    @article{
      author = {M. Trojanowska and E. S. Miller and J. Karam and G. Stormo and L. Gold},
      title = {The bacteriophage T4 regA gene: primary sequence of a translational repressor.},
      journal = {Nucleic Acids Res},
      year = {1984},
      volume = {12},
      number = {15},
      pages = {5979--5993},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/6473098}
    }
    					
    Campbell, K., S.G. & Gold, L. J. Beckwith, J.D. & J. Gallant, e. (Hrsg.) Protein-mediated translational repression ( Gene Function in Prokaryotes ) 1983 Gene Function in Prokaryotes , pp. pp. 185-210   inbook
    BibTeX:
    @inbook{
      author = {Campbell, K., Stormo, G. and Gold, L.},
      title = {Gene Function in Prokaryotes},
      publisher = {Cold Spring Harbor Laboratory, New York},
      year = {1983},
      pages = {pp. 185-210}
    }
    					
    von Hippel, P.H., K.S.L.N.N.J.P.L.S.G. & Gold, L. B. Kutter, C. Mathews, P.B. & G. Mosig, e. (Hrsg.) Autoregulation of expression of gene 32 of bacteriophage T4: A quantitative analysis ( The Bacteriophage T4 ) 1983 The Bacteriophage T4 , pp. pp. 202-207   inbook
    BibTeX:
    @inbook{
      author = {von Hippel, P.H., Kowalczykowski, S.C., Lonberg, N., Newport, J.W., Paul,

    L.S., Stormo, G.D. and Gold, L.}, title = {The Bacteriophage T4}, publisher = {American Society of Microbiology, Washington, DC}, year = {1983}, pages = {pp. 202-207} }

    von Hippel, P.H.; Kowalczykowski, S.C.; Lonberg, N.; Newport, J.W.; Paul, L.S.; Stormo, G.D. & Gold, L. Autoregulation of gene expression. Quantitative evaluation of the expression and function of the bacteriophage T4 gene 32 (single-stranded DNA binding) protein system. 1982 J Mol Biol
    Vol. 162 (4) , pp. 795-818  
    article base sequence; dna, single-stranded; dna, viral; gene expression regulation; genes, regulator; models, genetic; operon; protein binding; rna, messenger; rna, viral; t-phages; viral proteins
    BibTeX:
    @article{
      author = {P. H. von Hippel and S. C. Kowalczykowski and N. Lonberg and J. W. Newport and L. S. Paul and G. D. Stormo and L. Gold},
      title = {Autoregulation of gene expression. Quantitative evaluation of the expression and function of the bacteriophage T4 gene 32 (single-stranded DNA binding) protein system.},
      journal = {J Mol Biol},
      year = {1982},
      volume = {162},
      number = {4},
      pages = {795--818},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/6984860},
      doi = {http://dx.doi.org/10.1016/0022-2836(82)90548-4}
    }
    					
    Schneider, T.D.; Stormo, G.D.; Haemer, J.S. & Gold, L. A design for computer nucleic-acid-sequence storage, retrieval, and manipulation. 1982 Nucleic Acids Res
    Vol. 10 (9) , pp. 3013-3024  
    article base sequence; computers; data collection; information systems
    BibTeX:
    @article{
      author = {T. D. Schneider and G. D. Stormo and J. S. Haemer and L. Gold},
      title = {A design for computer nucleic-acid-sequence storage, retrieval, and manipulation.},
      journal = {Nucleic Acids Res},
      year = {1982},
      volume = {10},
      number = {9},
      pages = {3013--3024},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7099972}
    }
    					
    Stormo, G.D.; Schneider, T.D.; Gold, L. & Ehrenfeucht, A. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. 1982 Nucleic Acids Res
    Vol. 10 (9) , pp. 2997-3011  
    article base sequence; escherichia coli; mathematics; models, genetic; peptide chain initiation, translational; rna, messenger
    Abstract: We have used a "Perceptron" algorithm to find a weighting function which distinguishes E. coli translational initiation sites from all other sites in a library of over 78,000 nucleotides of mRNA sequence. The "Perceptron" examined sequences as linear representations. The "Perceptron" is more successful at finding gene beginnings than our previous searches using "rules" (see previous paper). We note that the weighting function can find translational initiation sites within sequences that were not included in the training set.
    BibTeX:
    @article{
      author = {G. D. Stormo and T. D. Schneider and L. Gold and A. Ehrenfeucht},
      title = {Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli.},
      journal = {Nucleic Acids Res},
      year = {1982},
      volume = {10},
      number = {9},
      pages = {2997--3011},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7048259}
    }
    					
    Stormo, G.D.; Schneider, T.D. & Gold, L.M. Characterization of translational initiation sites in E. coli. 1982 Nucleic Acids Res
    Vol. 10 (9) , pp. 2971-2996  
    article base composition; base sequence; codon; escherichia coli; models, genetic; peptide chain initiation, translational; rna, messenger; rna, ribosomal; ribosomes
    Abstract: We characterize the Shine and Dalgarno sequence of 124 known gene beginnings. This information is used to make "rules" which help distinguish gene beginning from other sites in a library of over 78,000 bases of mRNA. Gene beginnings are found to have information besides the initiation codon and Shine and Dalgarno sequence which can be used to make better "rules".
    BibTeX:
    @article{
      author = {G. D. Stormo and T. D. Schneider and L. M. Gold},
      title = {Characterization of translational initiation sites in E. coli.},
      journal = {Nucleic Acids Res},
      year = {1982},
      volume = {10},
      number = {9},
      pages = {2971--2996},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/7048258}
    }
    					
    Gold, L.; Pribnow, D.; Schneider, T.; Shinedling, S.; Singer, B.S. & Stormo, G. Translational initiation in prokaryotes. 1981 Annu Rev Microbiol
    Vol. 35 , pp. 365-403  
    article bacteria; base sequence; binding sites; codon; gene expression regulation; models, genetic; mutation; nucleic acid conformation; peptide chain initiation, translational; protein biosynthesis; rna, bacterial; rna, messenger; rna, transfer; rna, transfer, amino acyl; rna, transfer, met; ribosomes; statistics
    BibTeX:
    @article{
      author = {L. Gold and D. Pribnow and T. Schneider and S. Shinedling and B. S. Singer and G. Stormo},
      title = {Translational initiation in prokaryotes.},
      journal = {Annu Rev Microbiol},
      year = {1981},
      volume = {35},
      pages = {365--403},
      url = {http://www.ncbi.nlm.nih.gov/pubmed/6170248},
      doi = {http://dx.doi.org/10.1146/annurev.mi.35.100181.002053}
    }
    					
    Gold, L.M., O.P.S.B. & Stormo, G. Fox, C. & W.S. Robinson, e. (Hrsg.) Bacteriophage T4 gene expression ( Virus Research ) 1973 Virus Research , pp. pp. 205-225   inbook
    BibTeX:
    @inbook{
      author = {Gold, L.M., O'Farrell, P.Z., Singer, B. and Stormo, G.},
      title = {Virus Research},
      publisher = {Academic Press},
      year = {1973},
      pages = {pp. 205-225}
    }
    					

    Created by JabRef on 14/07/2011.