Probabilistic Methods for Bioinformatics - Richard E. Neapolitan

Probabilistic Methods for Bioinformatics (eBook)

with an Introduction to Bayesian Networks

Richard E. Neapolitan (Autor)

eBook Download: EPUB

2009 | 1. Auflage
424 Seiten
Elsevier Science (Verlag)
9780080919362 (ISBN)

Rather than getting bogged down in proofs and algorithms, probabilistic methods used for biological information and Bayesian networks are explained in an accessible way using applications and case studies. The many useful applications of Bayesian networks that have been developed in the past 10 years are discussed. Forming a review of all the significant work in the field that will arguably become the most prevalent method in biological data analysis.

Unique coverage of probabilistic reasoning methods applied to bioinformatics data--those methods that are likely to become the standard analysis tools for bioinformatics.

Shares insights about when and why probabilistic methods can and cannot be used effectively,

Complete review of Bayesian networks and probabilistic methods with a practical approach.

The Bayesian network is one of the most important architectures for representing and reasoning with multivariate probability distributions. When used in conjunction with specialized informatics, possibilities of real-world applications are achieved. Probabilistic Methods for BioInformatics explains the application of probability and statistics, in particular Bayesian networks, to genetics. This book provides background material on probability, statistics, and genetics, and then moves on to discuss Bayesian networks and applications to bioinformatics. Rather than getting bogged down in proofs and algorithms, probabilistic methods used for biological information and Bayesian networks are explained in an accessible way using applications and case studies. The many useful applications of Bayesian networks that have been developed in the past 10 years are discussed. Forming a review of all the significant work in the field that will arguably become the most prevalent method in biological data analysis. - Unique coverage of probabilistic reasoning methods applied to bioinformatics data--those methods that are likely to become the standard analysis tools for bioinformatics. - Shares insights about when and why probabilistic methods can and cannot be used effectively;- Complete review of Bayesian networks and probabilistic methods with a practical approach.

Front Cover 1
Probabilistic Methods for Bioinformatics: with an Introduction to Bayesian Networks 4
Copyright Page 5
Contents 6
Preface 12
About the Author 14
Part I: Background 16
Chapter 1. Probabilistic Informatics 18
1.1 What Is Informatics? 19
1.2 Bioinformatics 21
1.3 Probabilistic Informatics 22
1.4 Outline of This Book 23
Chapter 2. Probability Basics 26
2.1 Probability Basics 26
2.2 Random Variables 33
2.3 The Meaning of Probability 42
2.4 Random Variables in Applications 47
Chapter 3. Statistics Basics 56
3.1 Basic Concepts 57
3.2 Markov Chain Monte Carlo 66
3.3 The Normal Distribution 74
Chapter 4. Genetics Basics 78
4.1 Organisms and Cells 78
4.2 Genes 83
4.3 Mutations 89
Part II: Bayesian Networks 98
Chapter 5. Foundations of Bayesian Networks 100
5.1 What Is a Bayesian Network? 101
5.2 Properties of Bayesian Networks 103
5.3 Causal Networks as Bayesian Networks 110
5.4 Inference in Bayesian Networks 119
5.5 Networks with Continuous Variables 127
5.6 How Do We Obtain the Probabilities? 132
Chapter 6. Further Properties of Bayesian Networks 150
6.1 Entailed Conditional Independencies 151
6.2 Faithfulness 158
6.3 Markov Equivalence 162
6.4 Markov Blankets and Boundaries 165
Chapter 7. Learning Bayesian Network Parameters 172
7.1 Learning a Single Parameter 173
7.2 Learning Parameters in a Bayesian Network 180
Chapter 8. Learning Bayesian Network Structure 192
8.1 Model Selection 193
8.2 Score-Based Structure Learning 194
8.3 Constraint-Based Structure Learning 207
8.4 Causal Learning 214
8.5 Model Averaging 221
8.6 Approximate Structure Learning 224
8.7 Software Packages for Learning 232
Part III: Bioinformatics Applications 238
Chapter 9. Nonmolecular Evolutionary Genetics 240
9.1 No Mutations, Selection, or Genetic Drift 241
9.2 Natural Selection 245
9.3 Genetic Drift 263
9.4 Natural Selection and Genetic Drift 272
9.5 Rate of Substitution 274
Chapter 10. Molecular Evolutionary Genetics 278
10.1 Models of Nucleotide Substitution 279
10.2 Evolutionary Distance 287
10.3 Sequence Alignment 298
Chapter 11. Molecular Phylogenetics 310
11.1 Phylogenetic Trees 311
11.2 Distance Matrix Learning Methods 314
11.3 Maximum Likelihood Method 319
11.4 Distance Matrix Methods Using ML 341
Chapter 12. Analyzing Gene Expression Data 348
12.1 DNA Microarrays 349
12.2 A Bootstrap Approach 350
12.3 Model Averaging Approaches 354
12.4 Module Network Approach 363
Chapter 13. Genetic Linkage Analysis 382
13.1 Introduction to Genetic Linkage Analysis 383
13.2 Genetic Linkage Analysis in Humans 389
13.3 A Bayesian Network Model 395
Bibliography 402
Index 416

Chapter 1

Probabilistic Informatics

Informatics programs in the United States go back to at least the 1980s, when Stanford University offered a Ph.D. in medical informatics. Since that time, a number of informatics programs in other disciplines have emerged at universities throughout the United States. These programs go by various names, including bioinformatics, medical informatics, chemical informatics, music informatics, marketing informatics, and so on. What do these programs have in common? To answer that question we must articulate what we mean by the term informatics. Because other disciplines are usually referenced when we discuss informatics, some define informatics as the application of information technology in the context of another field. However, such a definition does not really tell us the focus of informatics itself. Here, we first explain what we mean by the term informatics; then we discuss why we have chosen to concentrate on the probabilistic approach; finally, we provide an outline of the material that will be covered in the rest of the book.

1.1 What Is Informatics?

In much of Western Europe, informatics has come to mean the rough translation of the English computer science, which is the discipline that studies computable processes. Certainly, there is overlap between computer science and informatics programs, but they are not the same. Informatics programs ordinarily investigate subjects such as biology and medicine, whereas computer science programs do not. So the European definition does not suffice for the way the word is currently used in the United States.

To gain insight into the meaning of informatics, let us consider the suffix -ics, which means the science, art, or study of some entity. For example, linguistics is the study of the nature of language, economics is the study of the production and distribution of goods, and photonics is the study of electromagnetic energy that has as its basic unit the photon. Given this, informatics should be the study of information. Indeed, WordNet 2.1 defines it as "the science concerned with gathering, manipulating, storing, retrieving, and classifying recorded information." To proceed from this definition, we need to define the word information. Most dictionary definitions do not help as far as giving us anything concrete; that is, they define information either as knowledge or as a collection of data, which means we are left with the situation of determining the meaning of knowledge and data. To arrive at a concrete definition of informatics, let’s define data, information, and knowledge first.

By datum we mean a character string that can be recognized as a unit. For example, the nucleotide G in the nucleotide sequence GATC is a datum, the field cancer in a record in a medical database is a datum, and the field Gone with the Wind in a movie database is a datum. Note that a single character, a word, or a group of words can be a datum, depending on the particular application. Data, then, are more than one datum. By information we mean the meaning given to data. For example, in a medical database the data Joe Smith and cancer in the same record mean that Joe Smith has cancer. By knowledge we mean dicta that enable us to infer new information from existing information. For example, suppose we have the following item of knowledge (dictum):1

IF the stem of the plant is woody

AND the position is upright

AND there is one main trunk

THEN the plant is a tree.

Suppose further that you are looking at a plant in your backyard and you observe that its stem is woody, its position is upright, and it has one main trunk. Then, using the preceding knowledge item, you can deduce the new information that the plant in your backyard is a tree.

Finally, we define informatics as the discipline that applies the methodologies of science and engineering to information. It concerns organizing data into information, learning knowledge from information, learning new information from existing information and knowledge, and making decisions based on the knowledge and information learned. We use engineering to develop the algorithms that learn knowledge from information and that learn information from information and knowledge. We use science to test the accuracy of these algorithms.

Next, we show several examples that illustrate how informatics pertains to other disciplines.

Example 1.1 (medical informatics) Suppose we have a large data file of patient records as follows:

From the information in this data file we can use the methodologies of informatics to obtain knowledge, such as "25% of people with smoking history have bronchitis" and "60% of people with lung cancer have positive chest X-rays." Then from this knowledge and the information that "Joe Smith has a smoking history and a positive chest X-ray," we can use the methodologies of informatics to obtain the new information that "there is a 5% chance Joe Smith also has lung cancer."

Example 1.2 (bioinformatics) Suppose we have long homologous DNA sequences from a human, a chimpanzee, a gorilla, an orangutan, and a rhesus monkey. From this information we can use the methodologies of informatics to obtain the new information that it is most probable that the human and the chimpanzee are the most closely related of the five species.

Example 1.3 (marketing informatics) Suppose we have a large data file of movie ratings as follows:

This means, for example, that Person 1 rated Aviator the lowest (1) and Shall We Dance the highest (5). From the information in this data file, we can develop a knowledge system that will enable us to estimate how an individual will rate a particular movie. For example, suppose Kathy Black rates Aviator as 1, Shall We Dance as 5, and Dirty Dancing as 5. The system could estimate how Kathy will rate Vanity Fair. Just by eyeballing the data in the five records shown, we see that Kathy’s ratings on the first three movies are similar to those of Persons 1, 4, and 10,000. Since they all rated Vanity Fair high, based on these five records, we would suspect Kathy would rate it high. An informatics algorithm can formalize a way to make these predictions. This task of predicting the utility of an item to a particular user based on the utilities assigned by other users is called collaborative filtering.

1.2 Bioinformatics

This book concentrates on bioinformatics, which applies the methods of informatics to solving problems in biology using biological data sets. The problems investigated are usually at the molecular level, including sequence alignment, genome assembly, models of evolution and phylogenetic trees, analyzing gene expression data, and gene linkage analysis. Sometimes the terms bioinformatics and computational biology are used interchangeably. However, according to our definition, bioinformatics can be considered a subdiscipline of computational biology. Indeed, Wikipedia defines computational biology as follows:

Computational biology is an interdisciplinary field that applies the techniques of computer science, applied mathematics, and statistics to address problems inspired by biology. Major fields in biology that use computational techniques include:

Bioinformatics, which applies algorithms and statistical techniques to biological datasets that typically consist of large numbers of DNA, RNA, or protein sequences. Examples of specific techniques include sequence alignment, which is used for both sequence database searching and for comparison of homologous sequences; gene finding; and prediction of gene expression. (The term computational biology is sometimes used as a synonym for bioinformatics.)
Computational biomodeling, a field within biocybernetics concerned with building computational models of biological systems.
Computational genomics, a field within genomics which studies the genomes of cells and organisms by high-throughput genome sequencing that requires extensive post-processing known as genome assembly, and which uses DNA microarray technologies to perform statistical analyses on the genes expressed in individual cell types. Mathematical foundations have also been developed for sequencing.
Molecular modeling, a field dealing with theoretical methods and computational techniques to model or mimic the behavior of molecules, ranging from descriptions of a molecule of few atoms, to small chemical systems, to large biological molecules and material assemblies.
Systems biology, which aims to model large-scale biological interaction networks (also known as the interactome).
Protein structure prediction and structural genomics, which attempt to systematically produce accurate structural models for three-dimensional protein structures that have not been solved experimentally.
Computational biochemistry and biophysics, which make extensive use of structural modeling and simulation methods such as molecular dynamics and Monte Carlo method-inspired Boltzmann sampling methods in an attempt to elucidate the kinetics and thermodynamics of protein functions.

- Wikipedia

This definition of bioinformatics as a subdiscipline within computational biology is consistent with our definition. Notice that there are many other subdisciplines of...

Erscheint lt. Verlag	12.6.2009
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
	Mathematik / Informatik ► Mathematik
	Naturwissenschaften ► Biologie
	Technik
ISBN-13	9780080919362 / 9780080919362

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Andere Ausgabe

Buch | Hardcover (2009)

CHF 92,50