Hypothesis definition and example

Hypothesis n., plural: hypotheses [/haɪˈpɑːθəsɪs/] Definition: Testable scientific prediction

Table of Contents

What Is Hypothesis?

A scientific hypothesis is a foundational element of the scientific method . It’s a testable statement proposing a potential explanation for natural phenomena. The term hypothesis means “little theory” . A hypothesis is a short statement that can be tested and gives a possible reason for a phenomenon or a possible link between two variables . In the setting of scientific research, a hypothesis is a tentative explanation or statement that can be proven wrong and is used to guide experiments and empirical research.

What is Hypothesis

It is an important part of the scientific method because it gives a basis for planning tests, gathering data, and judging evidence to see if it is true and could help us understand how natural things work. Several hypotheses can be tested in the real world, and the results of careful and systematic observation and analysis can be used to support, reject, or improve them.

Researchers and scientists often use the word hypothesis to refer to this educated guess . These hypotheses are firmly established based on scientific principles and the rigorous testing of new technology and experiments .

For example, in astrophysics, the Big Bang Theory is a working hypothesis that explains the origins of the universe and considers it as a natural phenomenon. It is among the most prominent scientific hypotheses in the field.

“The scientific method: steps, terms, and examples” by Scishow:

Biology definition: A hypothesis  is a supposition or tentative explanation for (a group of) phenomena, (a set of) facts, or a scientific inquiry that may be tested, verified or answered by further investigation or methodological experiment. It is like a scientific guess . It’s an idea or prediction that scientists make before they do experiments. They use it to guess what might happen and then test it to see if they were right. It’s like a smart guess that helps them learn new things. A scientific hypothesis that has been verified through scientific experiment and research may well be considered a scientific theory .

Etymology: The word “hypothesis” comes from the Greek word “hupothesis,” which means “a basis” or “a supposition.” It combines “hupo” (under) and “thesis” (placing). Synonym:   proposition; assumption; conjecture; postulate Compare:   theory See also: null hypothesis

Characteristics Of Hypothesis

A useful hypothesis must have the following qualities:

  • It should never be written as a question.
  • You should be able to test it in the real world to see if it’s right or wrong.
  • It needs to be clear and exact.
  • It should list the factors that will be used to figure out the relationship.
  • It should only talk about one thing. You can make a theory in either a descriptive or form of relationship.
  • It shouldn’t go against any natural rule that everyone knows is true. Verification will be done well with the tools and methods that are available.
  • It should be written in as simple a way as possible so that everyone can understand it.
  • It must explain what happened to make an answer necessary.
  • It should be testable in a fair amount of time.
  • It shouldn’t say different things.

Sources Of Hypothesis

Sources of hypothesis are:

  • Patterns of similarity between the phenomenon under investigation and existing hypotheses.
  • Insights derived from prior research, concurrent observations, and insights from opposing perspectives.
  • The formulations are derived from accepted scientific theories and proposed by researchers.
  • In research, it’s essential to consider hypothesis as different subject areas may require various hypotheses (plural form of hypothesis). Researchers also establish a significance level to determine the strength of evidence supporting a hypothesis.
  • Individual cognitive processes also contribute to the formation of hypotheses.

One hypothesis is a tentative explanation for an observation or phenomenon. It is based on prior knowledge and understanding of the world, and it can be tested by gathering and analyzing data. Observed facts are the data that are collected to test a hypothesis. They can support or refute the hypothesis.

For example, the hypothesis that “eating more fruits and vegetables will improve your health” can be tested by gathering data on the health of people who eat different amounts of fruits and vegetables. If the people who eat more fruits and vegetables are healthier than those who eat less fruits and vegetables, then the hypothesis is supported.

Hypotheses are essential for scientific inquiry. They help scientists to focus their research, to design experiments, and to interpret their results. They are also essential for the development of scientific theories.

Types Of Hypothesis

In research, you typically encounter two types of hypothesis: the alternative hypothesis (which proposes a relationship between variables) and the null hypothesis (which suggests no relationship).

Hypothesis testing

Simple Hypothesis

It illustrates the association between one dependent variable and one independent variable. For instance, if you consume more vegetables, you will lose weight more quickly. Here, increasing vegetable consumption is the independent variable, while weight loss is the dependent variable.

Complex Hypothesis

It exhibits the relationship between at least two dependent variables and at least two independent variables. Eating more vegetables and fruits results in weight loss, radiant skin, and a decreased risk of numerous diseases, including heart disease.

Directional Hypothesis

It shows that a researcher wants to reach a certain goal. The way the factors are related can also tell us about their nature. For example, four-year-old children who eat well over a time of five years have a higher IQ than children who don’t eat well. This shows what happened and how it happened.

Non-directional Hypothesis

When there is no theory involved, it is used. It is a statement that there is a connection between two variables, but it doesn’t say what that relationship is or which way it goes.

Null Hypothesis

It says something that goes against the theory. It’s a statement that says something is not true, and there is no link between the independent and dependent factors. “H 0 ” represents the null hypothesis.

Associative and Causal Hypothesis

When a change in one variable causes a change in the other variable, this is called the associative hypothesis . The causal hypothesis, on the other hand, says that there is a cause-and-effect relationship between two or more factors.

Examples Of Hypothesis

Examples of simple hypotheses:

  • Students who consume breakfast before taking a math test will have a better overall performance than students who do not consume breakfast.
  • Students who experience test anxiety before an English examination will get lower scores than students who do not experience test anxiety.
  • Motorists who talk on the phone while driving will be more likely to make errors on a driving course than those who do not talk on the phone, is a statement that suggests that drivers who talk on the phone while driving are more likely to make mistakes.

Examples of a complex hypothesis:

  • Individuals who consume a lot of sugar and don’t get much exercise are at an increased risk of developing depression.
  • Younger people who are routinely exposed to green, outdoor areas have better subjective well-being than older adults who have limited exposure to green spaces, according to a new study.
  • Increased levels of air pollution led to higher rates of respiratory illnesses, which in turn resulted in increased costs for healthcare for the affected communities.

Examples of Directional Hypothesis:

  • The crop yield will go up a lot if the amount of fertilizer is increased.
  • Patients who have surgery and are exposed to more stress will need more time to get better.
  • Increasing the frequency of brand advertising on social media will lead to a significant increase in brand awareness among the target audience.

Examples of Non-Directional Hypothesis (or Two-Tailed Hypothesis):

  • The test scores of two groups of students are very different from each other.
  • There is a link between gender and being happy at work.
  • There is a correlation between the amount of caffeine an individual consumes and the speed with which they react.

Examples of a null hypothesis:

  • Children who receive a new reading intervention will have scores that are different than students who do not receive the intervention.
  • The results of a memory recall test will not reveal any significant gap in performance between children and adults.
  • There is not a significant relationship between the number of hours spent playing video games and academic performance.

Examples of Associative Hypothesis:

  • There is a link between how many hours you spend studying and how well you do in school.
  • Drinking sugary drinks is bad for your health as a whole.
  • There is an association between socioeconomic status and access to quality healthcare services in urban neighborhoods.

Functions Of Hypothesis

The research issue can be understood better with the help of a hypothesis, which is why developing one is crucial. The following are some of the specific roles that a hypothesis plays: (Rashid, Apr 20, 2022)

  • A hypothesis gives a study a point of concentration. It enlightens us as to the specific characteristics of a study subject we need to look into.
  • It instructs us on what data to acquire as well as what data we should not collect, giving the study a focal point .
  • The development of a hypothesis improves objectivity since it enables the establishment of a focal point.
  • A hypothesis makes it possible for us to contribute to the development of the theory. Because of this, we are in a position to definitively determine what is true and what is untrue .

How will Hypothesis help in the Scientific Method?

  • The scientific method begins with observation and inquiry about the natural world when formulating research questions. Researchers can refine their observations and queries into specific, testable research questions with the aid of hypothesis. They provide an investigation with a focused starting point.
  • Hypothesis generate specific predictions regarding the expected outcomes of experiments or observations. These forecasts are founded on the researcher’s current knowledge of the subject. They elucidate what researchers anticipate observing if the hypothesis is true.
  • Hypothesis direct the design of experiments and data collection techniques. Researchers can use them to determine which variables to measure or manipulate, which data to obtain, and how to conduct systematic and controlled research.
  • Following the formulation of a hypothesis and the design of an experiment, researchers collect data through observation, measurement, or experimentation. The collected data is used to verify the hypothesis’s predictions.
  • Hypothesis establish the criteria for evaluating experiment results. The observed data are compared to the predictions generated by the hypothesis. This analysis helps determine whether empirical evidence supports or refutes the hypothesis.
  • The results of experiments or observations are used to derive conclusions regarding the hypothesis. If the data support the predictions, then the hypothesis is supported. If this is not the case, the hypothesis may be revised or rejected, leading to the formulation of new queries and hypothesis.
  • The scientific approach is iterative, resulting in new hypothesis and research issues from previous trials. This cycle of hypothesis generation, testing, and refining drives scientific progress.


Importance Of Hypothesis

  • Hypothesis are testable statements that enable scientists to determine if their predictions are accurate. This assessment is essential to the scientific method, which is based on empirical evidence.
  • Hypothesis serve as the foundation for designing experiments or data collection techniques. They can be used by researchers to develop protocols and procedures that will produce meaningful results.
  • Hypothesis hold scientists accountable for their assertions. They establish expectations for what the research should reveal and enable others to assess the validity of the findings.
  • Hypothesis aid in identifying the most important variables of a study. The variables can then be measured, manipulated, or analyzed to determine their relationships.
  • Hypothesis assist researchers in allocating their resources efficiently. They ensure that time, money, and effort are spent investigating specific concerns, as opposed to exploring random concepts.
  • Testing hypothesis contribute to the scientific body of knowledge. Whether or not a hypothesis is supported, the results contribute to our understanding of a phenomenon.
  • Hypothesis can result in the creation of theories. When supported by substantive evidence, hypothesis can serve as the foundation for larger theoretical frameworks that explain complex phenomena.
  • Beyond scientific research, hypothesis play a role in the solution of problems in a variety of domains. They enable professionals to make educated assumptions about the causes of problems and to devise solutions.

Research Hypotheses: Did you know that a hypothesis refers to an educated guess or prediction about the outcome of a research study?

It’s like a roadmap guiding researchers towards their destination of knowledge. Just like a compass points north, a well-crafted hypothesis points the way to valuable discoveries in the world of science and inquiry.

Choose the best answer. 

Send Your Results (Optional)


Further Reading

  • RNA-DNA World Hypothesis
  • BYJU’S. (2023). Hypothesis. Retrieved 01 Septermber 2023, from https://byjus.com/physics/hypothesis/#sources-of-hypothesis
  • Collegedunia. (2023). Hypothesis. Retrieved 1 September 2023, from https://collegedunia.com/exams/hypothesis-science-articleid-7026#d
  • Hussain, D. J. (2022). Hypothesis. Retrieved 01 September 2023, from https://mmhapu.ac.in/doc/eContent/Management/JamesHusain/Research%20Hypothesis%20-Meaning,%20Nature%20&%20Importance-Characteristics%20of%20Good%20%20Hypothesis%20Sem2.pdf
  • Media, D. (2023). Hypothesis in the Scientific Method. Retrieved 01 September 2023, from https://www.verywellmind.com/what-is-a-hypothesis-2795239#toc-hypotheses-examples
  • Rashid, M. H. A. (Apr 20, 2022). Research Methodology. Retrieved 01 September 2023, from https://limbd.org/hypothesis-definitions-functions-characteristics-types-errors-the-process-of-testing-a-hypothesis-hypotheses-in-qualitative-research/#:~:text=Functions%20of%20a%20Hypothesis%3A&text=Specifically%2C%20a%20hypothesis%20serves%20the,providing%20focus%20to%20the%20study.

©BiologyOnline.com. Content provided and moderated by Biology Online Editors.

Last updated on September 8th, 2023

You will also like...

definition of hypothesis in genetics

Gene Action – Operon Hypothesis

definition of hypothesis in genetics

Water in Plants

definition of hypothesis in genetics

Growth and Plant Hormones

definition of hypothesis in genetics

Sigmund Freud and Carl Gustav Jung

definition of hypothesis in genetics

Population Growth and Survivorship

Related articles....

definition of hypothesis in genetics

RNA-DNA World Hypothesis?

definition of hypothesis in genetics

On Mate Selection Evolution: Are intelligent males more attractive?

Actions of Caffeine in the Brain with Special Reference to Factors That Contribute to Its Widespread Use

Actions of Caffeine in the Brain with Special Reference to Factors That Contribute to Its Widespread Use

The Fungi

Dead Man Walking

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Biology LibreTexts

1.13: Introduction to Mendelian Genetics

  • Last updated
  • Save as PDF
  • Page ID 73678

  • Marjorie Hanneman, Walter Suza, Donald Lee, & Amy Kohmetscher
  • Iowa State University via Iowa State University Digital Press

Learning Objectives

  • Outline the experimental approach Mendel used to propose the idea that genes exist, control traits, and are inherited in predictable ways.
  • Compare the methods used by Mendel and Punnett to predict trait inheritance.


In plant and animal genetics research, the decisions a scientist will make are based on a high level of confidence in the predictable inheritance of the genes that control the trait being studied. This confidence comes from a past discovery by a biologist named Gregor Mendel, who explained the inheritance of trait variation using the idea of monogenic traits.

Monogenic characters are controlled by the following biological principles:

  • Living things have genes in their cells that encode the information to control a single trait. These genes are stable and passed on from cell to cell without changing.
  • The genes are in pairs in somatic cells.  When these cells divide to form gametes, the pair of genes is divided.  One gene from the pair goes into a gamete.
  • Male gametes (pollen) combine with female gametes (eggs) in the wheat flower pistil and fuse to form the next generation (zygote).  Gamete union is random.
  • The zygote, again, has two copies of each gene. As the zygote grows into a multicellular seed and the seed grows into a plant, the same two gene copies are found in every cell.

Let’s take a short genetics history lesson to understand their confidence.

black and white photo of Gregor Mendel, a white man wearing glasses.

Mendel’s Peas

In the mid 1800’s, an Austrian monk named Gregor Mendel (Figure 1) decided he should try to understand how inherited traits are controlled.  He needed a model organism he could work with in his research facility, a small garden in the monastery, and a research plan.  His plan was designed to test a hypothesis for the inheritance of trait variation.

Since Mendel could obtain different varieties of peas that differed in easy to observe traits such as flower color, seed color and seed shape, and he could grow these peas in his garden, he chose peas as the model organism for conducting his inheritance control study. A model is easy to work with and often what you learn from the model you can apply to other organisms.

The Hypothesis

While many biologists were interested in trait inheritance, at the time Mendel conducted his experiments none of the biologists had published evidence that inheritance could be predicted.  Mendel made this bold statement.  His hypothesis was that he could observe “mathematical” regularities in the appearance of a trait that was passed on from parents to their offspring.  Mendel had the idea that mathematical regularities could be observed and could be used to explain the biology of inheritance!

Mendel’s experimental plan was designed to test the hypothesis.  He identified true breeding lines of peas by allowing them to self pollinate (which we will refer to as “selfing”) and examining their offspring. Pea plants have flowers that contain both male and female reproductive parts; if a pea flower is left undisturbed, the male and female gametes from the same flower will combine to produce seeds, the next generation.  If the pea always made offspring like itself, Mendel had his true breeding line.  He then made planned crosses between lines that differed by just one trait (monohybrid crosses). The controlled monohybrid cross was the first step in his experiment that allowed him to look for mathematical regularities in the data for three generations.  Table 1 below shows the data from a series of these monohybrid cross experiments.

The Analysis

By summarizing his data in a single table, Mendel could look for those hypothesized math regularities. A regularity is a repeated observation.

Cross and Phenotypes

Round X Wrinkled

5474 Round 1850 Wrinkled

Cotyledon color

Yellow X Green

6022 Yellow 2001 Green

Seed coat color*

Gray X White

705 Gray 224 White

Inflated X Constricted

All inflated

882 Inflated 299 Constricted

Green X Yellow

428 Green 152 Yellow

Flower position

Axial X Terminal

651 Axial 207 Terminal

Stem length

Tall X Short

787 Tall 277 Short

*Gray seed coat also had purple flowers; White seed coat had white flowers. 

Table 1 demonstrates that Mendel was serious about the math.  He generated large numbers of offspring that allowed him to observe mathematical ratios.  From his table of data, we can see mathematical patterns appear with every monohybrid cross he made.

  • F 1 :  All the plants had the same phenotype as one of the parents.
  • F 2 :  Both phenotypes are present, the phenotype that was not expressed in the F 1 appears again in the F 2 but is always the least frequently produced.  The average ratio is about 3:1 for the two phenotypes.

What was striking to Mendel was that every character in his study exhibited the same kind of mathematical pattern.  This suggested that the same fundamental processes inside the plant’s reproductive cells were at work controlling the inheritance of each trait.

Now Mendel had the task of providing a description of the fundamental biology process controlling each of these traits.  He needed to come up with ideas that no one had yet proposed to explain biology.

New Idea #1:

The traits expressed in the pea plant were controlled by some kind of particle. These hereditary particles are stable and passed on intact from parent to offspring through the sex cells. (NOTE: Sex cells or gametes were not a new idea, Mendel was aware that biologists knew sexually reproducing plants and animals needed to make gametes.)  We now call these particulate factors genes and will use that term in the rest of this reading.

New Idea #2:

Genes are stable, and genes can have alternative versions (alleles).

New Idea #3:

Genes are in pairs in somatic cells and these paired genes separate during gamete formation.  Each gamete will have one gene from the pair of genes. The segregating of the paired genes from the somatic cells of the parent into gametes is random.  Because segregation is random, a parent that has two different alleles for a gene pair will make two kinds of gametes and makes these gametes at equal frequencies.

From Mendel’s ideas, we can see that in a situation in which there was a normal version of a gene (we can call it the R gene) and an alternate version (r), the plant could produce gametes with just the R gene or just the r gene.

New Idea #4:

Plant flowers are designed to allow male gametes (pollen) to combine randomly with the female gametes (egg).  When the gametes randomly come together, they bring the genes they carry to the same zygote. This means plants could have the genotype RR, Rr , or  rr  in families that have both the R and r alleles.

New Idea #5:

Mendel proposed that the genes controlling a trait not only paired in somatic cells, they also interacted in controlling the traits of the plants.  For the traits in his experiment, he proposed that one allele interacted with the other in a dominant fashion.  That means a plant that is the genotype RR would have the same phenotype as an Rr plant.  The R allele is dominant to the r allele.

Ideas and Data advance science

Those were Mendel’s new ideas; he used them to make sense of his experiment data and observations. Let’s think like Mendel and apply those ideas.

All the F1 were the same

Mendel’s new ideas could explain this observation. Since his parents were true breeding, he was always making a cross between homozygous parents.  Homo means the same, so the parents had two copies of the same version of the gene.

Crossing RR X rr plants to produce Rr

Since the R is dominant to r , then the Rr offspring (named the F 1 ) look the same (have the same phenotype) as the RR parent. Therefore, only one phenotype is observed in the F 1 .  But the F 1  genotype is different from either parent.  It is heterozygous (two different alleles).

Somatic cells (with two genes) are made up of two gametes (each with one gene), represented as sets of capital and lowercase letters.

The F 2 :  both traits appear in about a 3:1 ratio

Mendel could explain the reappearance of the recessive trait and the ratio by combining the idea of genes with the idea of random segregation.  Mendel used simple algebra to explain this result.

First, he wrote out a mathematical expression to account for the gametes made in the male part of the F 1 flower or in the female part.

½ R + ½ r = all the gametes made (Figure 2).

Next, he reasoned that if pollen randomly united with the egg to combine the genes in the gametes, then algebra could be used to predict the result by multiplying the gamete expressions.

(½ R + ½ r) X (½ R + ½ r) = all the F 2 offspring made.

If we do the multiplication above, we get …

¼ RR + ¼ Rr + ¼ Rr + ¼ rr = ¼ RR + ½ Rr + ¼ rr = predicted fractions of F2 genotypes.

If this math is causing your brain to lose focus, you might be experiencing what Mendel’s contemporaries experienced when they read his published research paper.  While many biologists were motivated to understand how the variation among animals and plants was controlled and inherited, it took biologists 30 years to recognize that Mendel’s new ideas to explain inheritance of traits in peas could be applied to inheritance of traits in other living organisms.

One possible explanation for this  30-year delay in appreciation is that it was difficult for biologists to understand how math could explain biology. One biologist that did understand what Mendel was describing was Punnett.  Punnett decided to convert Mendel’s algebra into a more graphic representation of the process of gamete segregation and random union.

The Punnett Square

Math: (¼ RR + ½ Rr + ¼ rr).

Punnett designated the gametes made in the male and female parents with single letters (Figure 3). The diagram shows that when the gametes combine, the offspring (inside the squares) again have the genes in pairs in their cells.  Accounting for the random union of gametes is accomplished with the four squares in the diagram.  Two squares give the same Rr result, one the RR genotype and one  rr . Both the algebra and diagram approaches provide the same prediction. Crossing an Rr with an Rr will produce three genotypes, RR, Rr and  rr . They will be produced in a ratio based on the principle of segregation.

Gene inheritance from two plants, each with one uppercase and one lowercase r. The results are one double uppercase, one double lowercase, and two half upper- half lowercase Rs. This implies that there is a 50% chance of similar offspring, and a 25% chance of all-dominant or all-recessive offspring.

The genes controlling the monogenic traits behaved in predictable ways

Punnett’s diagram clarified for many biologists what Mendel was telling them in his published article. This was a challenging idea to understand because he was asking biologists to use something they could not see (genes) and explain something they could see (traits in peas or some other living organism).

Because Mendel recognized he was proposing a very different idea with the segregation principle, he was likely motivated to share the most convincing evidence possible.  Mendel conducted additional experiments.  One experiment was to test the hypothesis that there were two different kinds of F 2 which expressed the dominant trait, and these two types were being made by the F 1 in predictable fractions.  How would Mendel show that F 2 which had the same phenotype did not always have the same genotype?

Mendel tested the breeding behavior of the F 2 .  Mendel harvested all the selfed seed produced by his F 2 and grew progeny rows of F 3 .  His segregation principle predicted that of the dominant F 2 , there should be two that are heterozygous for every one homozygote made (on average).  The results of this experiment are summarized in  Table 2 .  Did Mendel’s data support the hypothesis?

Average ratio heterozygote F 2 to homozygote F 2 was 2.06 to 1.

The data show that, if we select a sample of F 2 with the dominant trait (Round seed or Yellow cotyledon), the principle of segregation predicts that there should be 2 heterozygotes for every 1 homozygotes.

Mendel’s data from rows of F 3 that all came from F 2 with the dominant trait supported his hypothesis. There were always two kinds of rows (true breeding and mixed) and the rows were in a 2:1 ratio.  This fits with the principle of segregation .

By publishing these results in a scientific journal, Mendel allowed other scientists to learn from his work. This story reveals the real power of publishing research in the “permanent” scientific literature. The power of publication does not mean you were right with your science. The real power is that other scientists can find your paper, read it, think about your ideas, and then test them.  In Mendel’s case, he was already dead when his fellow biologists discovered that his new ideas to explain the biology of peas were not only correct, but universal in their application.

Mendel’s Dihybrid Cross Experiments

Proper credit must be given to the idea of independent assortment. Gregor Mendel was the first to put this idea down on paper based on what he observed with his pea experiments. Furthermore, Mendel performed additional experiments to back up his ideas. Let’s examine his experiments with peas from the late 1800’s.

The outline below describes Mendel’s dihybrid cross experiments. The pattern observed in the results should look familiar!

The Experiment

  • Parents: round seeds, yellow seeds (RRYY) x wrinkled seeds, green seeds (rryy).
  • F 1 : All round and yellow seeds (RrYy).
  • Selfing: F 1 (RrYy x RrYy):

Mendel explained his results as follows:

The F 1 plants have the genotype RrYy and can make four kinds of gametes RY, Ry, rY and ry.

Note that with both the Mendel algebra and Punnett square, the RRYY genotype occurs one time and the  RrYy  genotype occurs four times (Table 4). Mendel’s algebra and Punnett’s squares can be summarized to give the same results.

Selfing the F 2 to produce F 3

The easiest experiment to perform was to let the plants self-pollinate and then keep good records. After scoring his 556 F 2 seeds (Table 5) he took the 315 that were round and yellow and planted them in one part of his garden. The plants that grew were allowed to self-pollinate. Of the 315 round and yellow seeds planted, 301 plants matured and produced seed. The seed produced was the F 3 generation. At harvest, Mendel needed to exercise the utmost care. Each F 2 plant was handled separately. The seeds from the plant were harvested and Mendel then scored the F 3 seeds that came from the same F 2 plant. This can be referred to as F 2:3 data and the table below summarizes his complete experiment using all of the F 2 phenotypes.

Mendel’s F 2 data supported his principle of independent assortment. There were four different types of round yellow F 2 based on the kinds of progeny they could produce or their breeding behaviors. Based on the F 3 progeny produced, the F 2 genotype was deduced. For example, if a round, yellow seed gave all round progeny it must have the genotype RR . If it gave both round and wrinkled it was Rr .

Furthermore, the numbers of F 2 plants with each breeding behavior were in agreement with what was expected with independent assortment. There were four times as many round and yellow F 2 that gave all four phenotypes of F 3 seeds (138) compared to the round and yellow F 2 that were true breeding (38). Overall, there were nine types of breeding behaviors demonstrated in the F 2 demonstrating that there were nine F 2 genotypes. In all cases, the fractions observed in the F 2 agreed to the principle of independent assortment. Mendel’s well-planned experiment provided a convincing demonstration that genes behaved in this predictable manner.

The only thing better than performing an experiment that shows you were right about a new hypothesis is performing two experiments that show that you were right. That is what Gregor Mendel did! In his second experiment he crossed dihybrid F 1 plants with homozygous recessive plants in a test cross. This type of cross is named because the geneticist wants to perform a cross that will test or reveal the genotype of an organism. Therefore, a test cross is usually made between an organism with a dominant trait and a partner with a recessive version of this trait. Mendel performed the  RrYy  x  rryy  testcross and the expected progeny are shown in the Punnett square below:

RrYy gametes: RY, Ry, rY, ry

rryy gametes: all ry

The observed result closely matched the expected. The testcross experiment provides additional support for the principle of independent assortment.

Mendel established a rigorous precedent for using carefully planned multi-generation experiments to reveal the principles that governed trait inheritance. The beauty of Mendel’s accomplishments is that both the principles and his experimental approach can be applied to understanding the genetic control and inheritance of traits in many kinds of organisms still today.

Mendel’s principles of segregation and independent assortment are valid explanations for genetic variation observed in many organisms. Alleles of a gene pair may interact in a dominant vs. recessive manner or show a lack of dominance. Even so, these principles can be used to predict the future…at least the potential outcome of specific crosses.

Watch this video about Punnett Squares for more information

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 1. The Science of Genetics: An Overview

A schematic image of double-stranded DNA molecule.

The subject of Genetics

Genetics is the study of inheritance and variation. Transmission Genetics studies mechanisms of gene transmission from generation to generation. Molecular Genetics studies the molecular basis of inheritance (structure and function of genetic material at the molecular level). Biotechnology (Genetic Engineering) is applied molecular genetics. Cytogenetics studies chromosomal structure and function. Population Genetics studies genetic variation in groups of organisms.  

Human Genetics is the study of inheritance and variation in human beings. Medical Genetics is the study of medical aspects of human genetics.

The principles of inheritance in humans do not fundamentally differ from that in other living organisms. An understanding of human genetic inheritance is a vital asset in the diagnosis, prediction, and treatment of medical conditions that have a genetic foundation.

DNA and human cells  

The human body is made up of 100 trillion cells. Each cell (except red blood cells) has at least one nucleus, which houses the chromosomes.

There is 1.8 m of DNA in each of our cells packed into a structure only 0.0001 cm across.

Most human somatic cells (cells of the body) contain 46 chromosomes: pairs of chromosomes 1-22 (autosomes), and a pair of sex chromosomes.  

Females normally have two X chromosomes; males have an X and a Y.  

Human reproductive cells (sperm cells and oocytes) contain one set of 22 autosomes and one of each sex chromosomes.

The ABC of GATC: The chemical organization of nucleic acids

Deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) comprise the nucleic acid family of bioorganic molecules. Besides nucleic acids, bioorganic molecular groups include proteins, carbohydrates and lipids.  

DNA and RNA are biopolymers, meaning that they are formed from connecting together simple building blocks called monomers. In the case of DNA and RNA, the monomers are called nucleotides.  

Nucleotides themselves consist of a pentose monosaccharide called ribose (in RNA) and deoxyribose (in DNA), a phosphate functional group and a compound called nitrogenous base. Deoxyribose is derived from ribose by the loss of an oxygen atom in the hydroxyl group connected to the second (2’) carbon of the pentose molecule. There are five types of nitrogenous bases found in nucleic acids. Three of the five (adenine, guanine, cytosine) are found in both DNA and RNA, while thymine is only found in DNA and uracil is only found in RNA.

definition of hypothesis in genetics

The Human Genome

The Human Genome Project (HGP) is an endeavor that could be compared in complexity and significance to a trip to Mars. If you were to read the entire genome of a human being, like it were a book, one base at a time, it would take you 140 years to complete! By 1985, when serious talks about the sequencing of the human genome began to develop, Jim Watson, first HGP Director (1988 to 1992), estimated that it would take 1,000 years to complete with the existing technology at the time. In October of 1990, when the HGP was officially launched, it was intended to be a 20-year project, where most of the time would be devoted to improving of technology, rather than actual sequencing. In 1997, the estimate was that it would take another 50 years to actually complete the sequencing. Despite the prognosis, the first “rough draft” of the human genome was completed in 2000, at which point it was expected that the mapping of the genome would be completed in five years. The essentially completed human genome assembly was released in April 2003, 2 years earlier than planned.

Why is the knowledge of the human genome important?

The knowledge we have gained from the deciphering of the human genome has significantly impacted modern medicine. Doctors can now tailor medications to your specific genetic makeup. Pharmacogenetics, the field of medicine concerned with how your genes affect your response to drugs, can now find ways to reduce side effects and improve drug efficiency, by taking into consideration how your genes may react to medications you are prescribed. Other benefits include being able to anticipate the probability of developing cancer based on an individual genetic makeup as well as being able to predict the specific genetic conditions children of perspective parents may have. The are countless other benefits, which we will be discussing throughout the course.

The Human Genome At A Glance

Total number of chromosomes: 46

Haploid genome: 23 chromosomes

Molecular size:   3.3 billion (3,323,950,079) nucleotides (or base pairs, bp)

Largest chromosome   #1 = ~263 million base pairs (bp)

Smallest chromosome   Y = ~59 million bp

Known protein-coding genes:   19,940 (GENCODE reference genome version 29, 2018)

RNA genes (regulatory, tRNA, rRNA etc.): 23,643  

Pseudogenes: 14,729  

Gene transcripts (detected mRNA): 206,694

Protein-coding transcripts: 83,129

Mobile genetic elements: 45% of the genome

Average size of a human gene:   ~3,000 bp

Only about 5% of the human genome contains genes (coding sequences)

Less than 2% of the genome codes for proteins

Humans share most of the same protein families with worms, flies, and plants

Function of much of the genome is unknown

At least 20% of the non protein-coding portion of the genome is dedicated to regulating the transcriptional activity of the encoding 2%.  

Repeated sequences that do not code for proteins make up at least 50% of the human genome. Recent studies suggest that that 66%–69% of the human genome is repetitive or repeat-derived (de Koning et al., 2011)

45% of these repeated sequences are able to move around within the genome (mobile genetic elements, or transposable elements)

Below is a link to a map of human chromosomes and associated genetic diseases:


Mobile genetic elements

Mobile genetic elements (MGEs), also referred to as transposable elements (TEs) or transposons, are nucleotide sequences normally housed within the genome that retain the ability to change their location. Transposons are classified into retroelements (Class I MGEs) and DNA transposons (class II MGEs). The latter are sometimes referred to as “jumping genes” or transposable elements, since the movement of transposons requires the presence of a functional enzyme called transposase. Retroelements can be transcribed from DNA to RNA and the RNA produced is then reverse transcribed to DNA. Reverse transcription is accomplished with the help of an enzyme called reverse transcriptase (RNA dependent DNA polymerase), sometimes encoded by retroelements themselves. This copied DNA is then inserted at a new position into the genome. Transposons have a tendency to re-insert into the non-coding regions of the genome, but they can also insert in the vicinity of functional genes, thereby interrupting vital functions. DNA transposons comprise 8% of nuclear genome. MGEs are found on all chromosomes, but mostly on the X and chromosome 19. More on transposons in Chapter 10 (Mutations). 

Comparative Genetics and its Importance

Basic assumption of genetic research: similar mechanisms control genetic processes in different organisms.  

By studying the genetic mechanisms in simpler genetic models, we will be able to uncover basic principles of genetic inheritance.

Using the knowledge of genetics gained in the study of model organisms, we will be able to understand the mechanisms of genetic inheritance in human beings.

definition of hypothesis in genetics

This figure demonstrates the genetic similarity (homology) of the superficially dissimilar mouse and human species.   The similarity is such that human chromosomes can be cut (schematically at least) into about 150 pieces (only about 100 are large enough to appear here), then reassembled into a reasonable approximation of the mouse genome.   The colors and corresponding numbers on the mouse chromosomes indicate the human chromosomes containing homologous segments.

Favorite model organisms of geneticists

E. coli (4,377 genes, one chromosome).

definition of hypothesis in genetics

Neurospora sp . (~10,000 genes, 14 chromosomes)

C. elegans (~20,000 genes, 12 chromosomes), drosophila sp . (~14,000 genes, 8 chromosomes).

Photo of the fruitfly Drosophila melanogaster.

Mus musculus (3 billion bp, 23,786 genes, 40 chromosomes)

Photo of Mus musculus, the common house mouse.

A Brief Overview of the Scientific Method

Genetics, like any other field of science, deals with observable and measurable phenomena.

The Scientific Method is a thought process that scientists use to produce a comprehensive and objective (i.e., unbiased, not dependent on an individual opinion) explanation of the processes that govern natural world. Scientific knowledge doesn’t have a value system; it generates and tests hypotheses based upon observations.  

The primary goal of the Scientific Method is to acquire knowledge and understanding about the natural world through systematic investigation. It aims to discover patterns, establish causal relationships, and develop theories or models that can explain and predict phenomena. The scientific method strives for objectivity and aims to generate reliable and valid results that can be tested and replicated by others.

The role of Parsimony in the scientific analysis

“Pluralitas non est ponenda sine neccesitate”, plurality is not to be posited without necessity (Parsimony Rule/ Ockham’s Razor).

“Of the two competing explanations, both of which are consistent with the observed facts, we regard it as right and obligatory to prefer the simpler” (Barker, 1961, p. 273).

Parsimony, also known as the principle of simplicity or Occam’s razor, plays a vital role in scientific analysis. It suggests that among competing explanations or hypotheses, the simplest one (or, rather, the one that takes the fewest steps to explain the phenomenon in question) that can account for the observed data should be preferred until further evidence is available.

In scientific analysis, parsimony serves as a guiding principle for selecting the most likely explanation or hypothesis that requires the fewest assumptions or complexities. Here are a few ways parsimony contributes to scientific analysis:

Parsimony, also known as the principle of simplicity or Occam’s razor, plays an important role in scientific analysis. It suggests that among competing explanations or hypotheses, the simplest one that can account for the observed data should be preferred until further evidence is available.

  • Simplicity and Explanation: Parsimony encourages scientists to favor explanations that require the fewest assumptions or entities. It suggests that simpler explanations are more likely to be true or accurate. By choosing simpler hypotheses, scientists aim to provide concise and elegant explanations for observed phenomena.
  • Occam’s Razor: Occam’s razor is a principle related to parsimony, often attributed to the philosopher William of Ockham. It states that “entities should not be multiplied beyond necessity.” In scientific analysis, Occam’s razor advises against introducing unnecessary or extravagant elements when simpler explanations are available.
  • Testability and Predictions: Simpler hypotheses are often easier to test and make specific predictions. They provide clearer expectations that can be compared with empirical data. Parsimony helps scientists avoid excessive complexity, ensuring that hypotheses can be tested and evaluated effectively.
  • Model Selection: In fields where multiple models or theories exist to explain the same phenomenon, parsimony can guide scientists in selecting the most appropriate one. By favoring simpler models, scientists aim to avoid overfitting the data, reduce complexity, and enhance the model’s generalizability.
  • Communication and Clarity: Parsimony helps scientists communicate their findings more effectively. Simple explanations and models are often easier to understand and convey to others, enhancing clarity and accessibility in scientific communication.

It is important to note that while parsimony is a valuable principle in scientific analysis, it is not a definitive rule. In certain cases, complex explanations may be necessary to account for all available evidence. Ultimately, the principle of parsimony should be balanced with the need to adequately explain and account for the complexities of the natural world.

Steps of the scientific inquiry

The Scientific Method is a systematic approach used by scientists to investigate and acquire knowledge about the natural world. It provides a structured process for conducting scientific research and includes several key steps:

  • Observation and description of a phenomenon.  
  • Formulation of a tentative explanation for the observed phenomenon (formulation of a hypothesis).
  • Making predictions based on the hypothesis.
  • Testing the predictions by conducting a series of experiments.
  • If the results of the tests support the predictions, the hypothesis becomes a theory.
  • The theory becomes generally accepted when different research groups perform independent tests of the theory and obtain comparable results.  

It should be kept in mind that the Scientific Method is an iterative process, and scientific knowledge is constantly revised and refined based on new evidence and insights.

Science look-alikes

Beware of what’s called “science by press conference,” when scientists announce their results to the media, before they’ve been reviewed by their peers and published in scientific journals. An example of such activity is the announcement in November of 2001 of the first successful attempt to clone a human embryo made by Dr. West, the director of the Advanced Cell Technology (ACT) company in MA, which was made at a press-conference, before other scientists in the field got a chance to take a look at it. Upon closer examination by peers, it appeared that the achievement wasn’t so great, the clone didn’t survive past the third cell division, but the media turned this premature scientific announcement into a sensation and the procedure drew sharp criticism, followed by health policy changes, from the US Government. Thus, to the public (and the government) it appeared as though the scientists succeeded in cloning a human being.

Key Takeaways

  • Human Genetics is a branch of Genetics concerning the principles of inheritance in human beings.
  • The principles of Human Genetics are not fundamentally different from the principles of Genetics that govern the inheritance in any other living species.
  • Comparative genetics studies the principles of genetic inheritance using model systems.
  • Genetics is governed by the principles of the Scientific Method inquiry.

Human Genetics Copyright © by Alexey Nikitin is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License , except where otherwise noted.

Share This Book

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 February 2023

Hypothesis-free phenotype prediction within a genetics-first framework

  • Chang Lu   ORCID: orcid.org/0000-0002-3272-0120 1 ,
  • Jan Zaucha 2 ,
  • Rihab Gam 1 ,
  • Hai Fang 2 , 3 ,
  • Ben Smithers 2 ,
  • Matt E. Oates 2 ,
  • Miguel Bernabe-Rubio 4 ,
  • James Williams   ORCID: orcid.org/0000-0003-3206-7851 4 ,
  • Natalie Zelenka   ORCID: orcid.org/0000-0002-1007-0286 2 ,
  • Arun Prasad Pandurangan   ORCID: orcid.org/0000-0001-7168-7143 1 ,
  • Himani Tandon 1 ,
  • Hashem Shihab 2 ,
  • Raju Kalaivani 1 ,
  • Minkyung Sung 1 ,
  • Adam J. Sardar   ORCID: orcid.org/0000-0002-6912-6125 2 ,
  • Bastian Greshake Tzovoras   ORCID: orcid.org/0000-0002-9925-9623 5 ,
  • Davide Danovi   ORCID: orcid.org/0000-0003-4119-5337 4 &
  • Julian Gough   ORCID: orcid.org/0000-0002-1965-4982 1 , 2  

Nature Communications volume  14 , Article number:  919 ( 2023 ) Cite this article

6097 Accesses

1 Citations

6 Altmetric

Metrics details

  • Computational biology and bioinformatics

Cohort-wide sequencing studies have revealed that the largest category of variants is those deemed ‘rare’, even for the subset located in coding regions (99% of known coding variants are seen in less than 1% of the population. Associative methods give some understanding how rare genetic variants influence disease and organism-level phenotypes. But here we show that additional discoveries can be made through a knowledge-based approach using protein domains and ontologies (function and phenotype) that considers all coding variants regardless of allele frequency. We describe an ab initio, genetics-first method making molecular knowledge-based interpretations for exome-wide non-synonymous variants for phenotypes at the organism and cellular level. By using this reverse approach, we identify plausible genetic causes for developmental disorders that have eluded other established methods and present molecular hypotheses for the causal genetics of 40 phenotypes generated from a direct-to-consumer genotype cohort. This system offers a chance to extract further discovery from genetic data after standard tools have been applied.

Similar content being viewed by others

definition of hypothesis in genetics

The mutational constraint spectrum quantified from variation in 141,456 humans

definition of hypothesis in genetics

Evidence for 28 genetic disorders discovered by combining healthcare and research data

definition of hypothesis in genetics

Human and mouse essentiality screens as a resource for disease gene discovery


Sequencing of human genomes holds great promise for using genetic information to guide medical discovery and therapy. And yet in general, advances in our ability to extract useful information from genetic data are not being made as rapidly as advances in our ability to generate the data, leading to a growing imbalance of effort. Systematically predicting potential organism-level phenotypes or disease risks based on the information of a person’s genetic variation remains an unsolved challenge. The influence of common variants on phenotypes can be quantified by statistical weights from genome-wide association studies (GWAS) and presented as polygenic risk scores (PRS) 1 , 2 , 3 , while effects of rare variants can be expressed in terms of intolerance to high-penetrance functional variants in the human population 4 , 5 , 6 from both burden testing on common phenotypes and rare disease work in families. Nevertheless, these two leave a large area in the effect size – allele frequency space under-explored (Fig.  1a) , more explicitly not accounting for the influence of a significant amount of non-synonymous uncommon variants of medium or low penetrance 2 . On one hand, rare or low-frequency functional variants are under-interpreted in GWAS due to the inherent difficulties in the statistical evaluation of rare events in population genetics 7 , 8 . On the other hand, the statistical method for making gene-to-disease inference requires the support of multiple instances of high confidence predicted loss-of-function (pLoF) variants in a gene 4 , 9 , and it has been shown that such instances are rarely observed. It is estimated that cohorts roughly 1000 times bigger than gnomAD (which contains 125,748 exomes) are needed to gather evidence of their existence in most genes 10 . On a few datasets with extensive broad phenotyping, Phenome Wide Association Studies (PheWAS) 11 can leverage gene-based collapsing to address some rare variants. To further overcome these difficulties, and to apply to most datasets, we see the utility of a non-association-based method; ideally one incorporating the effects of rare or low-frequency variants, expanding our capability for discovery within the constraints of the existing scale of human genetics data.

figure 1

a Positioning relative to heritability interpretation from two prevailing genetic association analyses 2 of many. G2P: Gene-to-phenotype databases, GWA: Genome wide association, PRS: Polygenic risk scores. The colour bar shows the genetic unit of analysis employed by each method. b Framework overview. The method takes an individual’s genetic data as input and produces a list of ontology terms for which the person is a potential outlier. It uses a large background of genomes to which the individual is compared, ontology databases with gene-phenotype relationships, and evolutionary intolerance of mutations in protein domain families encoded by hidden Markov models (HMMs). c Schematic illustration of the genetic landscape for an ontology term (HP:0000834 ‘abnormality of the adrenal glands’) highlighting genomes with high outlier scores. Each node represents a genome, and edges are proportional to genetic distance in eigenspace – in essence, a reduced dimensional feature space between genomes.

Existing non-associative approaches that investigate genotype-to-phenotype relationships mainly use supervised network models, usually learned from large genomic variant databases focusing on specific phenotypes, e.g. antimicrobial resistance 12 , 13 , yeast cellular phenotypes 14 , 15 and plant phenotypes 16 . These methods have demonstrated the possibility of using a knowledge-based strategy to make phenotypic prediction. However to achieve this, large datasets of millions of genetic variants (often through synthetic genetic arrays 17 ) are needed and such datasets are only available for selected phenotypes, and such supervised models are not applicable to complex human phenotypes. We propose an unsupervised knowledge-based system (Nomaly), that makes ab initio predictions of potential phenotypes from thousands of ontology terms (Fig.  1b ), leveraging the knowledge of protein domains through hidden Markov models 18 , 19 , 20 . Instead of being used to train a model at an early step, the phenotypes are used as a final step to evaluate which predictions performed significantly better than expected by chance. The underlying protein knowledge on which the ab initio models are based can be examined to provide molecular insights into the predicted phenotype, in other words unlike supervised models, the interpretability of our predictions is high.

The Nomaly system (Fig.  2 ) is built on the premise that a genetic extreme outlier can be defined that is predictive of an outlier in phenotype. Under this hypothesis, the system evaluates the genetic heterogeneity in the context of each phenotype. Consequently, not only is it able to consider the additive effect of multiple variants but also the non-additive combinatorial effect where some variants become relatively rare and deleterious in the presence of other more common variants. The challenging computation of this is made tractable via a linear algebra approach solving an eigenproblem (spectral clustering), described as segmentation-based object categorisation when used in image analysis 21 . A typical run includes a person or persons of interest and a large cohort-scale background, whereby outlier scores for thousands of terms in an ontology are calculated for each person of interest (Fig.  1b ). The outlier scores represent the likelihood of being an extreme outlier in the ontology-specific genetic landscape with respect to the chosen background (Fig.  1c ).

figure 2

– see methods for detail. Genome data is inputted at the top and causal hypotheses are outputted at the bottom. In the orange top box (algorithm), firstly the functional distance between each missense variant is derived from domain-based HMM probabilities, scaled depending on zygosity (top row). Subsequently (second row) variants falling in the region of a gene with homology to an HMM representing a functional unit (domain), are collated into a genetic profile for a phenotype using domain-phenotype mappings inferred using dcGO 19 . This multi-domain collapsing of an ontology term can be likened to gene-based collapsing used in PheWAS 11 . Next (bottom row of the orange box) the profile of combined functional distances (from the top row) is used to calculate a genetic distance to every genome in the background. Spectral clustering of the distance matrix identifies which genomes are outliers under the profile (HP:0000834 in this illustration); nodes represent genomes and are coloured by outlier score (bottom right of the orange box). In the next (blue) box, only the top-scoring outlier phenotypes are passed to the confirmation stage, where some of these genetics-first predictions are identified as correct, giving a likely cause of the verified phenotype that was predicted.

To evaluate performance, here we wish to systematically assess: (i) whether there is a global consistency in statistics between the actual phenotypes and predictions based on outlier scores; (ii) whether the predictive success rate can be quantified; and (iii) how novel the predictions are. In this work, we describe a knowledge-based framework and demonstrate its significance and usefulness by answering these questions using three independent datasets, namely: a cohort of 2248 participants specially recruited for this study (DTC, below), the well-established dataset of 1133 children in the Deciphering Development Disorders (DDD) study with the respective gene-to-phenotype database DDG2P that provided genetic diagnoses for 40% of these children (majority through de novo mutations) 6 , 22 , 23 , and the Human Induced Pluripotent Stem Cells Initiative (HipSci) stem-cell bank 24 where there is the possibility to experimentally verify predictions on cellular phenotype.

Evaluation of the predictive power with a direct-to-consumer genetics cohort (DTC)

For the purpose of evaluation, we recruited a cohort of volunteers who had previously subscribed to direct-to-consumer (DTC) genotyping services (e.g. 23 and Me, AncestryDNA and others), or who were otherwise already in possession of personal genomic data files to participate in this study (Supplementary Fig.  1 ). To test the overall significance of outlier scores we presented each participant with a questionnaire asking them to self-identify, from a set of questions, any phenotypes from among 25 of their top-scoring ontology terms mixed equally with a further 25 top-scoring terms from a decoy – a randomly selected individual from the background (Fig.  3a ). The DTC cohort generated 2248 questionnaires, yielding 94,966 yes/no answers across 3672 ontology terms and 2086 written comments; see methods for details of QC. Questions were intended to identify only outliers, so if the question design resulted in a high positive response rate (>5%), they were excluded for identifying a common phenotype instead of an outlier. By requiring participants to self-identify, the often-costly challenge of phenotype data collection is simplified, however, we sacrifice accuracy compared to expert assessment, introducing noise that will mask the true predictive power by an unknown amount.

figure 3

a Participants upload their DTC genome data on which outlier phenotypes are predicted, then shuffled with outliers from a decoy genome randomly selected from the background, to create a uniquely personalised questionnaire. Answers are used to confirm true predictions against decoys. b A test of 100,000 random permutations of the dataset shows that observed scores are on average higher for confirmed phenotypes, with a p -value of 8.25e-8 against randomly permuted scores. c The rate of identifying confirmed phenotypes by score threshold (blue) and number of above threshold predictions (green); at the default threshold of 0.022 the rate is more than double the rate for decoys, with a p -value of 7.93e-7 against 100,000 permutations. d The significance (green) of the top phenotypes by within-phenotype permutation of answers 100,000 times, and (blue) for the top x phenotypes, the number left after subtracting from the total those expected by chance. z -scores were derived from testing the null hypothesis that similar results can be obtained if scores are assigned randomly (see methods). p -value is calculated from z -score in a right-tailed hypothesis test. e For DDD patients, the 60 above-threshold predictions confirmed by clinical annotation with a p -value of 5.12e-4 versus data from 100,000 random permutations, using the same hypothesis test procedure as in d . Inset : the 50 patients with top predictions compared to published data 22 for whether a genetic diagnosis has been identified through DDG2P, and split by presence of de-novo mutation (DNM).

A statistical permutation test of outlier scores versus positive self-identification by participants proves that the method is significantly predictive of phenotype at the <1% level ( p -value: 8.25e-8, Fig.  3b ). The average rate at which participants self-identify a random phenotype was measured as 0.8% by taking answers to decoy questions. However, participants identify at a higher rate for predicted phenotypes and this increases monotonically with score threshold choice (Fig.  3c ). For high-scoring phenotypes the rate of self-identification is 1.8% ( p -value: 7.93e-7), but with improved phenotype measurement this could be increased. These results show that the high-scoring phenotype predictions harbour statistically significant signals and that about one half (ratio of above-threshold positive answer rate to decoy rate) of the top-scoring predictions, verified through self-identification, are due to underlying genetic variation identified by the algorithm.

If a confirmed prediction has a genuine genetic basis and has not occurred by chance, then other predictions for the same phenotype in other participants are more likely to be true. This non-independence can be exploited and measured by per phenotype permutation tests. Further permuting across all phenotypes corrects for multiple hypotheses. Of the 40 phenotypes with the top statistic by permutation (Table  1 and Supplementary Data  1 ), only 13.5 are expected to have occurred by chance (Fig.  3d ), which is a statistically much stronger result ( p -value: <10e-16) than when considering predictions independently as above. Examples of a novel gene, recovery of a known variant, a novel variant in a gene related to a known gene, and mechanistic explanations are shown below.

An assessment of potentially confounding factors (sex, ancestry and array type) establishes that the top phenotype predictions cannot be accounted for in this way. Although sex and three of the ancestry principal components (African, Gujerati Indians and Finnish) are predictive on the dataset, they are only significant ( p -value < 0.05 at Z -score > 1.65) on 4 of the top 40 phenotypes. More importantly, none of the variables correlate ( r  > 0.1) with outlier score on any of the top phenotypes. As a baseline comparison to outlier scores, association statistics calculated on the dataset (Supplementary Fig.  1 ) did not reveal any significant variants due to the small cohort size of each phenotype. Aggregated association scores at high FDR subjected to a permutation test (Fig.  3d ) identify 6(±2) out of a top 21 phenotypes having a significant variant association (Supplementary Data  5 ) not expected by chance at Z -score 2.9. The removal of points from the data for questions with outlier score >0.022 results in total loss of significance, implying that genetics-first selection of questions improved the power of the association results.

Potential genetic diagnoses for Deciphering Developmental Disorders cohort children (DDD)

In addition to the evaluation on our own DTC cohort for significance, a comparison was made to state-of-the-art work in the field on the well-established DDD cohort for the purpose of assessing novelty. DDD consists of 1133 trios with developmental disorders who have been exome-sequenced and annotated with Human Phenotype Ontology (HPO) 25 terms by clinicians 6 , 22 , 23 .

Predictions on DDD were restricted only to HPO terms relevant to developmental disorders that were used for annotation by clinicians. Of these, 60 predictions above threshold matching clinical annotations for 50 patients were found (Fig.  3e ). This rate is slightly lower, but consistent with the DTC results. Comparison to the published list of clinical diagnoses 6 , 22 shows that 62% (31) of these patients had received no genetic diagnosis, including 15 who have no de novo mutations (DNM); the published diagnoses in the DDD paper were made through DNM missense variants, inherited variants and rare chromosomal events in known genes using the developmental disorder gene-to-phenotype (DDG2P) database. Thus, plausible genetic explanations can be discovered for families not covered by the established G2P interpretation method (Supplementary Data  2 ).

A global analysis of the predictions made on DDD that match clinical annotations confirms significance of the method ( p -value: 3.62e-4), although it is less than for the DTC cohort due to limited data, restricted to developmental disorder-specific HPO terms. Likewise, about 1/3 of the above-threshold predictions ( p -value: 5.12e-4) are expected to be true instead of about 1/2 in DTC (subtracting decoy rate from prediction rate). Examples of a novel variant in a known gene and of a combinatorial effect are shown below. We conclude that, not only are the predictions significant on an independent dataset, but also largely non-redundant to those made by existing state-of-the-art methods, thus advancing the field.

Application to interpreting genetic variants for cellular phenotypes on a large panel induced-pluripotent stem cell lines (HipSci)

Although HPO is the most common ontology used to annotate human phenotypes, and that used by DDD 22 , the DTC cohort study also included several other mammalian and disease ontologies (Supplementary Fig.  1b ) and the gene ontology (GO) 26 . Despite being the ontology richest in data, GO performed worse on DTC than the other ontologies ( p -value: 2.15e-2 for GO and p -value: 6.05e-7 for non-GO using threshold). This is presumably due to the difficulty in self-identifying molecular and cellular level terms, especially without recourse to invasive measurements on the person. The HipSci project provides exome sequence data for hundreds of iPS cell lines from different individuals, and thus offers an opportunity to examine this prediciton framework in the application to molecular and cellular phenotypes instead of patient-level phenotypes.

Outlier scores were generated for GO terms from the exome sequences corresponding to 427 HipSci samples, generating predictions potentially relevant to cellular phenotype. It is not possible to use this dataset to assess global performance versus phenotype identification as with the other two cohorts, but the top-scoring phenotype terms were explored to identify a prediction that could be empirically validated in vitro. The phenotype “negative regulation of centrosome duplication” within the “biological process” domain of GO presented high-scoring predictions (see below) for some cell lines, and was selected principally for its suitability for measurement in iPS cells with an existing assay.

Centriole staining was carried out on HipSci cell lines corresponding to all five individuals predicted to be outliers from their exome sequence, and three controls not predicted to be outliers. Counting of the centrioles indicated that three out of five predicted cell lines displayed an elevated percentage of cells with more than two centrioles, suggesting defects in centriole regulation and cell cycle, confirming the accuracy of the prediction (Fig.  4 ). This phenotype would not be identifiable from symptoms in the DTC or DDD cohorts, but this empirical evidence demonstrates the value of the approach for discovery at the cellular or tissue level.

figure 4

This refers to 'any process that decreases the frequency, rate or extent of centrosome duplication'. a Representative examples of each of the indicated cell lines. Centrioles were detected by staining with gamma-tublin (γ-tub), nuclei were stained with DAPI. Asterisks indicate cells with more than two centrioles. The Hoik-1, Sehp-2 and Kegd-2 were control cell lines. b Histogram showing the percentage of cells with more than 2 centrioles per cell in the indicated cell lines. Results are summarised as the mean ± s.e.m. from 3 independent experiments (600-800 cells per cell line were analysed; each percentage from each experiment were shown as dots; *: P  < 0.05; **: P  < 0.005; ns: not significant). Specifically, the p -values are: Boqx-2 P  = 0.045, Suul-1 P  = 0.0021, Yoch-6 P  = 0.0023 (one-sided t-test).

Types of genetic outlier that explain potential phenotype outlier

An outlier in the genetic landscape of an ontology term, as detected by this approach, can be caused by a single or by multiple rare variants. In the case of multiple variants, each variant can be classified according to whether it is absolutely required to achieve a score above the threshold in the cohort, or whether it merely contributes to an above-threshold score; alternative variants can achieve above-threshold scores in different people (Fig.  5a, c ). There can also be a combinatorial effect contributing to the score (e.g. Fig.  6g ), whereby a variant becomes relatively rarer and more deleterious in the presence of a particular genotype consisting typically of a few common variants, exemplified here by having a higher outlier score if classified into a cluster (Fig.  1c ) than if there is no cluster. The most common case (71%; Figs.  5b , 1-a/b) is where a single variant that is highly deleterious is sufficient to achieve the threshold although others may contribute additional score (44% cases; 1-b). Multiple variants being required to cause an outlier account for 29% of cases (2-a, 2-b). Combinatorial effects are important in a small minority of cases (Fig.  5d ).

figure 5

a Outliers classified by underlying genetic variants into 4 types: 1-a, single variant only required; 1-b, single variant plus contributing variant(s); 2-a, multiple variants but dominated by one high-scoring variant; or 2-b, multiple variants required. b Distribution of the four types of outliers in the DDD cohort. c Violin plots to show the distribution of log-scaled minor allele frequencies (MAF) of variants involved in all predicted outliers, and by type of outliers in the DDD cohort. Colours show different types of outliers, schemes show roles of the variants involved, similar in a . Specifically, all (grey): 3682 variants involved in at least one outlier prediction, 539 variants involved in type 1-a outlier (dark blue), 601 required (star) and 1548 contributing (triangle) in type 1-b (light blue), 416 required (star) and 20 contributing (triangle) in type 2-a (orange), and 1767 in type 2-b (red). Distributions were generated using a kernel density estimate in the seaborn package 47 . Boxes show quartiles and whiskers represent 1.5 multiple of interquartile range. d Percentage of outliers with a combinatorial component to the score with variants contributing non-independently.

figure 6

Any coordinates shown are relative to genome assembly GRCh37. a Novel gene association . ADAM7 [ https://www.ncbi.nlm.nih.gov/gene/8756 ], ADAMTS13 [ https://www.ncbi.nlm.nih.gov/gene/11093 ]. b Known variant . Chr6 Pos26093141, HFE-C282Y. c Novel variant in related gene . Chr12 Pos52913668, KRT5-G138E. d Single variant . Chr17 Pos29586054, NF1-L1425R. e Single variant . Chr19 Pos17927755, INSL3-R102C. f Novel variant in known gene . Chr3 Pos18143037, SOX2-L75P. g Combinatorial effect . 2 variants in CYP4B1-R375C/R340C and CYP2A7-T347A and CYP2D6-R245C. h Experimentally validated on HipSci . Chr3 Pos48414274, FBXW12-P6L. Panels (a) and (f) were created with BioRender.com.

The majority (90%) of single-variant-based predictions were caused by a rare variant with minor allele frequency (MAF) <0.5% for a heterozygous genotype, or MAF < 1% for a homozygous genotype. However, not all rare variants result in a positive prediction, for example if it is not highly deleterious, or when many people from the chosen background harbour different rare variants for the phenotype, suggesting the ontology term is not highly evolutionary constrained. In multiple-variant-based positive predictions, 24% of variants are rare, and 55% are low-frequency (MAF < 1% heterozygous or MAF<5% homozygous) (Fig.  5c ). Whilst our approach does not filter variants based on allele frequency (as illustrated in Fig.  1a ), there is insufficient power to recover the effects of common variants given the size of the DDD or DTC cohorts. As each phenotype question is only presented to a small subset of participants, only effects of large magnitude emerge from rare genotypes.

In Fig.  6 we show some examples of different types of discovery from the results. (1) Novel gene association. The representation in Fig.  6a shows how primary evidence from mouse knock-out experiments in genes ADAM9/17/19 led to a dcGO 18 , 19 link between a mitral valve phenotype and the reprolysin-like domain shared by these genes. Novel variants identified in respective domain of human genes (ADAM7, ADAMTS13) confirmed the mitral valve phenotype via questionnaire. (2) Known variant. In Fig.  6b a variant used to predict confirmed hemochromatosis in DTC participants was already known in ClinVar 27 . The diagram shows left to right: how multiple genes (including HFE) labelled with the ontology term were used by dcGO 18 , 19 to link it to domain families, then variants in DTC participants within these domains prioritised by HMM probabilities led to prediction. Had the HFE variant not been known (HFE not included as known), it is possible with this approach that it would have been discovered de novo by the DTC study on the basis of domain association. (3) Novel variant in related gene. The prediction in Fig.  6c includes other variants in keratin genes KRT75, KRT78 not previously linked to this ontology term. This variant is found in the head region (1–167) of the intermediate filament rod domain adjacent to an ELM motif involved in a protein-protein interaction with an SH3 domain. This figure shows a structural interpretation of the mutation of G138 to a polar negatively charged Glu disrupting binding to the electrostatic surface of SH3 domain from PDB structure 2GBQ.

Single variant. The single rare and functional variant (Fig.  6d ) in the neurofibromin (NF1) gene is sufficient to produce the top-ranked score for this phenotype in DTC. The image shows a structural interpretation of the central domain of neurofibromin (with helices 6c and 7c in cyan and yellow respectively) bound to Ras (surface model, top) and the Leu residue at 1425 rendered as spheres. Substituting Arg for Leu at this position will disrupt the helical geometry and successful interaction with Ras. (4) Single variant. In Fig.  6e this single rare and functional variant in the insulin-like 3 (INSL3) gene is sufficient to produce the top-ranked score for this phenotype in DTC. The primary structure shows the variant lying within a conserved polar charged patch. The PDB structure 2H8B shows the two disulphide bonds that stabilise the structure. A mutation of the Arg to Cys could disrupt the polar charged region of the protein and even interfere with the correct formation of disulphide bonds. (5) Novel variant in known gene. Two independent cases in the literature of missense mutations in SOX2 (G130A and A191T) mean the example in Fig.  6f is already a known gene for the phenotype in the OMIM 28 database. The phenotype was correctly predicted by our work on a DDD patient due to a novel variant in the HMG-box domain of this known protein. (6) Combinatorial effect. The phenotype in Fig.  6g that was correctly predicted on a DDD patient requires 2 variants in CYP4B1 (in linkage disequilibrium), plus additional variants. There is a combinatorial component to the score which raises this patient to the top rank in DDD for this phenotype. This Venn diagram shows that common variants are observed to co-occur (shaded areas) with the CYP4B1 variants much less than expected if the variants were independent (intersection areas). The top-ranked patient has all four variants. (7) Experimentally validated on HipSci. Figure  6h shows a homology model for F-box/WD repeat-containing protein 12 (FBXW12) using PDB structure 1NEX as template shows the proline at residue 6 that mutates to leucine in the variant, residing within the N-Proline box motif in N-terminal F-box domain. A proline at the N-cap position mediates hydrophobic interactions similarly to other N-cap residues Asn, Asp, Ser, and Thr. The mutation to leucine in this position impacts interactions within the motif, likely causing a change of specificity and/or affinity.

This paper describes a framework for performing and evaluating hypothesis-free phenotype prediction directly from a human genome. The key value is in providing potential genetic explanations for phenotypes that have been confirmed in the individual, which due to the high novelty and link to mechanism of the output, has potential for application to genetics-led drug target identification. Studying the combined effects on complex phenotype across variants in multiple genes is often impossible with simple statistical models, because of the lack of statistical power on existing cohort sizes. Also due to the lack of statistical power, rare or low-frequency variants are under-interpreted in GWAS 7 , 8 . An ab initio model can partially overcome this limitation by testing a statistically small number of causally deduced direct predictions (a genetics-first approach). This knowledge-based framework with in silico and experimental validation approach described here has been shown to achieve this for exome missense variants. Our method, therefore, offers the ability to evaluate the effect of rarer genetic variants in a combinatorial way through linear algebra where statistical associative methods would not be applicable.

Hypothesis-free phenotype prediction with this genetics-first approach could be applied in principle to other ab initio models, but we chose to deploy a model based on protein domains, which are the functional units of proteins. Hidden Markov models built on protein domains 18 enable the quantification of structural and functional effects of variants, and our domain-centric gene ontology (dcGO 19 ) resource provides the link to phenotype. The domain-based model emphasises potential for novelty over coverage of known genes by carrying over the functional property of the domain across many genes.

Established genetic cohorts do not lend themselves well to testing a genetics-first approach since data is mostly only available for a restricted list of hypothesis-derived phenotypes – and usually not encoded in ontology terms, although increasingly attempts are being made to widen phenotype capture, e.g. using ICD-10 29 . The DDD cohort is one of the early large-scale studies to adopt HPO terms and was included in this analysis as a known reference point in the field. However, to truly test the genetics-first approach, we need an interactive cohort of participants. Participants were recruited for their willingness to provide their genotype data first, then respond to personalised phenotype data collection post-prediction. Evaluation on both cohorts similarly confirms the predictions as highly significant yet also characterises the predictions as having a high false-positive rate; expectations are that for roughly 1/3–1/2 of confirmed high-scoring predictions, the causal genetic explanations will be true. Combining results per phenotype is more powerful, showing that explanations for about 26 of the top 40 confirmed phenotypes are likely to be true (Fig.  2d ). The predictions were also well-differentiated from results obtainable with other methods, confirmed by comparison to that achieved by DDD annotations of DNMs.

This method offers a powerful discovery tool for hypothesis generation from genetic data. The tool does not replace or compete with existing tools for human genetics which are largely aimed at being clinically actionable or offering effective intervention strategies. It rather complements and adds to them, aiming to enable medically high-value scientific discoveries first suggested from the analysis of genomic data. Independent validation of the hypothesis generated by the prediction with relevant assays is recommended. The expectation is that genes/variants will be mostly novel, and characterised by high magnitude of effect often from rare variants, sometimes severally, and occasionally acting combinatorially. In its essence, the tool is a genetic outlier detector, so it is important to consider the background with respect to which the individual is an outlier. In this work, we used the 1000 Genomes Project phase 3 (1KGP3) 30 , 31 genomes as a diverse background with representation of all major ancestry in the DTC cohort, but by selecting a different background, the definition of an outlier can be adjusted to suit the desired research question.

The catalogue of high-scoring results from the analysis of the three cohorts: DTC, DDD and HipSci provide over a hundred putative causal genetic explanations for numerous developmental, cellular and human/mouse/disease phenotypes. In addition to the supplementary data tables, they are also provided online with an interface to aid browsing and searching the variants, their classification of type, genes, scores and phenotypes ( https://supfam.org/nomaly ), also including the database of 5,857 ontology questions as a resource for others.

The framework could also be used by other functional effect predictors, model organism ontologies and including known genes (to recover more of the less novel variants). We showed that even simple association statistics can be used within the framework, but that question selection is important, suggesting that a genetics-first step could be used to increase the power of PheWAS studies. Having proven principle on missense variants, we expect the future growth in this area by us and others will be by extension to other mutations e.g. indels and non-coding variants.

To sum up, the traditional approach to human genetics, where we ask “Does the data contain the answer to my question?”, has been turned on its head, and we instead ask: “For which questions does an answer lie within the data?”.

Ethics declarations

The DTC cohort study was granted ethics approval by the University of Bristol, via the Faculty of Engineering Research Ethics Committee with approval ID number 539500 (project ID 361 and amendment 2322). Informed consent from all participants was obtained. HipSci Lines samples were collected from consented research volunteer recruited from the NIHR Cambridge BioResource through ( https://www.cambridgebioresource.org.uk ). Initially, 250 normal samples were collected under ethics for iPSC derivation (REC Ref: 09/H0304/77, V2 04/01/2013), which require managed data access for all genetically identifying data, including genotypes, sequence and microarray data (‘managed access samples’). In parallel the hipsci consortium obtained new ethics approval for a revised consent (REC Ref: 09/H0304/77, V3 15/03/2013), under which all data, except from the Y chromosome from males, can be made openly available (Y chromosome data can be used to de-identify men by surname matching), and samples since October 2013 have been collected with this revised consent (‘open access samples’). The majority of samples were European. Work performed in the laboratories has been compliant with the Institutional Review Board directives for the experimental work and use of data.

Nomaly framework

The framework consists of two primary parts: the predictive algorithm and the confirmation of predictions (Fig.  2 ). The algorithm takes an individual genetic data file (e.g. SNP array or exome) as input (top of Fig.  2 , input to orange box) and outputs phenotypes for which the individual is predicted to be an outlier (from orange box into blue box). The definition of ‘outlier’ is made relative to a background comprising thousands of genomes (bottom left of orange box). For the DTC and HipSci cohorts, the 1KGP3 genomes were used as a background, but for individuals in the DDD cohort, the cohort itself serves as the background. There are thousands of potential phenotypes, taken from 17 biomedical ontology databases, each assigned a score (below) for how much of an outlier it is for the individual in question against the background. In principle any genome or any biomedical ontology database can be used for a given study. Hidden Markov models (HMMs) are used to estimate deleteriousness; models from the domain databases SUPERFAMILY 18 and Pfam 32 were used, but in principle any HMMs can be used or any other measure of deleteriousness from a variant effect predictor.

The confirmation step (blue box in Fig.  2 ) is the assessment of whether an individual is indeed an outlier for a phenotype suggested by the genetics-first analysis. In principle the assessment can be carried out in any way that lends evidence to confirm the outlier (the DDD study used at least two certified clinical geneticists to perform assessment 22 ), but in this work automated matching of ontology terms was used. Predictions from the first step that are subsequently confirmed in this step become candidate hypotheses (output of blue box) linking a phenotype to variants via protein domains in genes.

Outlier scores (orange box in Fig.  2 ). The predictive algorithm includes (1) quantification of consequence of missense variants using evolutionary intolerance to the amino acid substitution in a protein domain, done by taking the difference in amino acid emission probability (analogous concept to FATHMM 33 ) and (2) generating variant lists for ontology terms through domain-centric linking to phenotypes (taken from dcGO) 18 , 19 . The mapping from variant to amino acids in proteins for (1) is done using the Variant Effect Predictor (VEP) tool 34 (N.B. VEP used only for genomic mapping and no other functionality or scores are used) and mapping to domains using HMM sequence matching. The mapping of domains to ontology terms in (2) is combined with the variants falling within them from (1) to give a list of variants, commonly across multiple proteins, for each ontology term. Each ontology term is processed independently.

For a given term, a total genetic distance may be calculated between two individuals by summing the individual distances (defined as the log odds ratio of HMM probabilities for the two amino acids from the Dirichlet mixture) for all variants in the list for which they differ, depending on zygosity; distances are increased fourfold when both are homozygous and opposite. The all-against-all distances between the members of the background and the individual, and between each other, can be used to construct a distance matrix. This matrix is then translated to a similarity matrix through a Gaussian kernel and used as the input to spectral clustering to determine whether there is hidden structure in the genetic landscape of that term, namely by identifying the biggest gap in eigenvalues 21 . If no hidden structure is found, then the individual’s outlier score will be equivalent to the average Euclidean distance to members of the background; in this case individuals with very rare and highly deleterious genotypes will have a high outlier score (Fig.  5c ). If a hidden structure is found by spectral clustering, K-means is then performed on the reduced-dimensional space derived from the top eigenvectors selected by the elbow method. In this scenario the outlier score becomes the sum of the local distance (from the cluster) and the global distance (between clusters) (Fig.  1c ). To normalise for cluster size the global distance is multiplied by μ where:

where γ specifies the penalty strength for large clusters; it was set to 9 in this study, conferring >99% penalty for large clusters with over 50% of the entire cohort.

Finally, since phenotypes have very different score distributions, a transformation (first described in Zaucha et al 35 ) is employed to generate a universal score function comparable between phenotype terms.

where s trans ( p ) is the transformed score of participant p in a given term, s ( p ) is the original score and s rank ( p ) is the rank of the original score within the Term. φ determines the strength of contribution of the rank to the total term score; it was set to 150 in this study, conferring the top ~2% a significant contribution.

Combinatorial component

The non-linear method of clustering used can result in one or more variants being markedly rarer within one cluster relative to the entire background after genomes with other shared genotypes are grouped together. Thus a genome with a high global score due to the variants common to the cluster, and a high local score due to a variant rare only to that cluster, will have a higher total score than it would from its average Euclidian distance from the background. This additional score from variant rarity only increases in the presence of other specific variants, and is defined as the combinatorial contribution (as in Fig.  5d ).

Computational cost

Calculating spectral clustering requires a large computation, even when using low-level linear algebra BLAS 36 libraries run on many threads in parallel. This is because it involves solving eigenproblems of a large matrix, which do not scale linearly with matrix size. For guidance, 5,800 terms on a batch of 50 SNP array genotype files with 2,504 background genomes (2554 x 2554 matrix) takes 21 hours with 12 threads (9.5 hours with 48 threads). Solving ~2,000 terms (independent eigenproblems) on a 3600x3600 matrix of exomes (including background) takes 3 hours using 600 threads.


Participants with access to personal direct-to-consumer genotype data were recruited anonymously online. Some participants were recruited in collaboration with OpenSNP 37 and (separately) Sano Genetics. In this cohort 2,248 participations were recorded, with the participant website accessed from all over the world (Supplementary Fig.  1a ). DTC genome data uploaded by participants was processed into a homogeneous format and quality-controlled with GenomePrep 38 , which also detects the sequencing method/genotyping array version. Imputation was not used for the analysis presented in this paper, although the main result of Fig.  3 was replicated on data imputed from the DTC genotype data (Supplementary Fig.  3 ) to confirm that there was little difference – only 4 terms outside the top 50 entered the top 40, with another 4 in the top 50 making a minor change moving up to the top 40. The IMPUTE2 39 package was used with help from VCFtools 40 and BCFtools by SAMtools 41 . A similarity matrix of all genomes in the cohort was calculated, consanguineous relationships were recorded and genetic duplicates removed; people submitting independent files from multiple providers were only allowed to participate once (Supplementary Fig.  1c–e ). To prevent ‘gaming’ the study, for each participant a genome from the background was randomly selected, so the 25 ontology terms with the highest outlier scores for the participant could be randomly mixed with the 25 top-scoring ontology terms predicted for the background decoy genome to generate a personalised questionnaire with 50 questions. Each participant was invited to give a binary ‘yes’ or ‘no’ answer to whether they self-identify each phenotype, with the option to leave a comment (Supplementary Fig.  1b ). In a small number of cases participants were recalled and invited to provide information supporting their answers for phenotypes of interest.

The 1KGP3 genomes were used as the background. Before recruitment, the binary yes/no questions were designed with the aim of identifying outlier phenotypes for 5,857 ontology terms, including terms from the Gene Ontology (GO) 26 , Human Phenotype Ontology (HPO) 25 , Disease Ontology (DO) 42 , Medical Subject Headings (MeSH) 43 , and Mammalian Phenotype (MP) 44 ontology databases (e.g. in Table  1 ). See Supplementary Data  3 at https://supfam.org/nomaly for an interactive version and database of ontology term to phenotype question mappings. Participant genome files were processed continuously in batches (eliminating within-batch relatedness) giving an approximately 4–12 hour turnaround between submission of file to the generation of personalized questionnaire; results can be influenced by other genomes in the batch which effectively becomes part of the background for each other. This is because all parts of the similarity matrix interact with each other during the solution of the eigenproblem.

We started with questions designed manually for 5857 different ontology terms. Not all questions were answered but in the end, 94,966 binary self-identified answers were received for 3672 questions, of which 1408 questions received at least one positive answer. Although questions were designed for participants only to self-identify when they are phenotypic outliers, often this was not achieved. To illustrate, a hypothetical bad question is “Do you have myopia?” whereas a good question would be “Do you have myopia worse than -6 dioptres”. Therefore, per-phenotype analysis was carried out for 342 ontology terms whose rate of participants answering ‘yes’ is non-zero and below 5%, and at least a total of 20 answers were recorded per term.

Permutation tests

Statistical evaluation in Fig.  2 was carried out using permutation tests with 100,000 iterations randomly re-allocating the outlier scores, testing the null hypothesis that similar results can be obtained if scores are assigned randomly. For panel b, the average outlier score given to questions that received positive answers (about 0.011) is compared to the averages from random permutations of the dataset. In panel d of Fig.  2 , permuting the scores for each phenotype separately, yields a p -value on the sum of observed scores matching positive answers for each phenotype. All phenotypes can now be ranked by p -value (along the x -axis) for how well the observed scores predict the answers for that phenotype. The observed scores are then randomly permuted and all phenotype p -values recalculated; repeating 100,000 times gives a mean and standard deviation for the number of phenotypes expected by chance at any given p -value. At each point on the x -axis, subtracting the number of phenotypes expected by chance from the observed top phenotypes ( x ) gives the blue line – with confidence intervals. I.e. the number of phenotypes ( y ) likely to be true out of the top x phenotypes ranked by how well outlier scores match the answers.

Data source

Exome data from the DDD 1133 trio sequencing VCF files (accession code: EGAD00001001355 ), 1133 trio family relations, phenotype datasets, validated DNMs ( EGAD00001001413 ), were obtained from the European Genome-phenome Archive at the European Bioinformatics Institute (EGA, https://ega-archive.org/ ).

We ran the 1133 DDD probands, together with 1KGP3 genomes as background, using HPO as the ontology database. Only variants that pass all filters (as specified in the meta-data information from EGA downloads) are included. HPO terms that were used to clinically describe phenotypes in the 1133 trio were mapped to HPO terms in the v1.2 database used in our predictions. Due to the nature of the method, running all of the probands at the same time, in one distance matrix, means that the background is a mixture of 1KGP3 genomes and the other probands. If there was a common genetic cause shared by many probands, it would not get a high outlier score, but we assume enough genetic causes of the disorders are sufficiently independent to be eligible for detection.

For the DDD cohort, evaluation was automated through direct comparison, for each patient, of whether a high-scoring HPO term closely matches a respective clinical annotation. Defining a close match between ontology terms is challenging because the distance over the ontology graph varies wildly in biological meaning; adjacent terms can be very similar or very different. To define closeness, a measure of information content through the graph is needed, for which we used the cumulative number of patients annotated by terms traversing the graph as a metric. The close matches between HPO terms are listed in Supplementary Data  4 . For example we found the following 4 terms sufficiently similar to define as close to each other: HP:0004097 ‘Deviation of finger’ (2 probands), super-term HP:0009484 ‘Deviation of the hand or of fingers of the hand’ (1 proband), and sub-terms HP:0009179 and HP:0004209 – ‘deviation’ and ‘clinodactyly’ of the 5th finger (56 and 2 probands respectively).

It should be noted that the evaluation assumes all potential HPO terms in the DDD database are considered by clinicians, and that an HPO term is not true for the patient if it is not annotated. This represents an underestimation of true phenotypes, but is a necessary assumption for a fair and automatic evaluation.

Comparison to published diagnosis by DDG2P

For patients with true positive predictions, we checked against the list of diagnostic variants in DDG2P from the initial publication 6 and the list of validated DNMs (obtained from EGA) to see if there is a genetic diagnosis and if there is a potential uninterpreted DNM. The 1133 trio DNM list showed that at least one DNM was found for 738 (65%) of children (excluding synonymous, intron, and intergenic DNMs).

HipSci cohort

HipSci exome sequencing data for healthy and diseased people ( EGAD00001003514 , EGAD00001003521 , EGAD00001003522 , EGAD00001003524 , EGAD00001003525 , EGAD00001003516 , EGAD00001003526 , EGAD00001003527 , EGAD00001003517 , EGAD00001003161 , EGAD00001003518 , EGAD00001003519 , EGAD00001003520 ) were obtained from EGA. HipSci open-access exome-sequencing data were downloaded directly from the hosting website ( https://www.hipsci.org/data ).

For each donor, one cell line was selected for the genetics-first prediction according to the following criteria: use primary tissue data if available, otherwise use the iPSC cell line with the minimum changes from the origin cell as measured by number of differences per Mbp, and excluding those where the pluritest or custom CNV check is missing. The result was that 437 cell lines from different donors were selected and the corresponding exome files processed using the 1KGP3 genomes as background, predicting from a set of 5,805 possible GO terms.

There was no confirmation step as with DTC and DDD, so no phenotype data was used. A list of phenotypes with several predicted outliers was examined to find one potentially verifiable by experiment. The list was initially narrowed to four candidates that could be tested on iPS-derived macrophage cells and two candidates that could be tested directly in iPS cells. Expression analysis of genes harboring the variants of interest uncovered a lack of expression for GO:0002741 and GO:1900025 in macrophage, so these were eliminated. The variants implicated in terms GO:0035718 and GO:0002840 are involved in cell signalling (e.g. from the thyroid) and were eliminated as too experimentally complicated. One term, GO:1901223, was excluded due to the lack of availability of a differentiated cell line for a key donor with the variant. Finally, GO:0010826 was selected for experimental validation because five iPS cell lines predicted as outliers were available.

Experimental test for GO:0010826

The Hoik-1 (HPSI0314i-hoik_1), Sehp-2 (HPSI0115i-sehp_2) and Kegd-2 (HPSI0614i-kegd_2) cell lines were selected as control. The Suul-1 (HPSI0514i-suul_1), Yoch-6 (HPSI0215i-yoch_6), Boqx-2 (HPSI0115i-boqx_2), Zapk-3 (HPSI0114i-zapk_3) and Iuoc-2 (HPSI0516i-iuoc_2) cell lines were tested. γ-tubulin was used as a centriole marker 45 . 2.5 × 10 4  cells were plated onto coverslips maintained in 24-well multiwell plates and grown for 2 days. Cells were fixed in cold methanol for 5 min, rinsed, and incubated with 3% (wt/vol) BSA for 1 h. Cells were incubated with γ-tubulin antibody (Sigma-Aldrich, T5326), washed and incubated with the appropriate fluorescent secondary antibody conjugated to Alexa 555 (Invitrogen). DAPI (Thermo Fischer Scientific) was used as counterstain. Cells were mounted in coverslips using ProLong Gold antifade reagent (Thermo Fisher Scientific). Images were acquired with a Nikon A1R confocal microscope. Brightness and contrast were optimised with ImageJ (National Institutes of Health) and Photoshop (Adobe Systems). Quantifications of centrioles were performed manually using ImageJ. Dilutions: the γ-tubulin antibody was used at 1:5000, the secondary antibody at 1:500 and DAPI was used at 1:2000.

Association and confounders

Confounding variables.

Twelve potentially confounding variables were analysed on the DTC cohort data: sex, the first 10 principal components of ancestry, and the genotype array type. Each of the variables was assessed as a predictor on the DTC cohort under the same permutation test as our outlier scores. The correlation coefficients and t-test statistics were also calculated between each variable and the outlier scores for every phenotype. Whilst there is no theoretical reason to believe that predictions from a non-associative method should be confounded by covariates of association, we nevertheless checked this against the above statistics. PERL and packages PDL and Statistics were used as well as R package glm.

Variant associations

There is no expectation that methods based on association will perform well under the same conditions, e.g. high false discovery rate, as the predictor exemplified in this work. However, as a reference point we calculated GWAS statistics on the DTC cohort data using PLINK 46 and subjected them to a 1000 iteration permutation test as in Fig.  3d . Python packages SciPy and NumPy were used here and in other parts of the work. Each phenotype was treated as a cohort with the answers to questions determining case vs control classification. The confounding variables above were used as covariates for the analysis. Within the framework of our reverse approach, predictor scores were used to allocate some of the questions, so to simulate whether the selection of questions contributes to the positive GWAS results, we also repeated the analysis with data points scoring > 0.022 removed. The GWAS results are summarised in Supplementary Fig.  2 . GWAS were performed using a standard approach, as outlined below.

(i) DTC genomes quality check. We started with autosomal, bi-allelic SNPs in the DTC participants that had missingness <1% and those passing an AB ratio binomial test with Z -score <3 for the 1000 genomes project phase 3 data (1KGP3) overall and EUR superpopulation (from GenomePrep). (ii) High-quality common SNPs selection. We restricted variants to having frequency >5% in the 1KGP3, and excluded variants in complex regions from https://genome.sph.umich.edu/wiki/ Regions_of_high_linkage_disequilibrium_(LD) and variants where the ref/alt combinations was CG or AT. We removed all SNPs which were out of Hardy Weinberg Equilibrium (HWE) with a p -value cut-off of pHWE < 1e-8. We LD-pruned using PLINK2 with r2  = 0.1 and 500kb windows. The resulting 22,103 aggregate high-quality sites were used for kinship and genetic components analysis (PCA). (iii) Kinship and ancestry calculation. Data was merged with 1KGP3 and kinship coefficients were calculated among all pairs of samples using PLINK2 and its implementation of the KING robust algorithm. A kinship cutoff of 0.0884 was used to select unrelated individuals (-king-cutoff). Ancestry was inferred via PCA on unrelated 1KGP3 individuals with GCTAv1.93 (plink2) using HQ common SNPs. QC of variants included plink v1.9: HWE deviations exclude p -value <(1e-20), multi-allelic variants were excluded and we filtered variants by missingness <0.02. (iv) GWAS. A total of 4946035 associations were calculated and the multi-phenotype association significance level after multiple hypothesis correction is 1.01 x 10-8. GWAS was run for each phenotype term using ‘Yes’ answers as cases ‘No’ answers as controls. Covariate analysis was implemented with logistic regression of PLINK1.9; sex was imputed (female 1132, male 848, ambiguous 73), and the first 10 PCs from the genetic ancestry analysis were used as well as the version of the genotyping array were used. The permutation plots were generated by running the association on 1000 randomisations of the data and plotted as per Fig.  3d .

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All data supporting the findings described in the manuscript are available in the article, supplementary information files or from the corresponding author on request. Additionally, supplementary data are made available in an interactive, searchable format via the project webpage at https://supfam.org/nomaly . The DTC cohort data may not be made publicly available because participants are not consented for this, but on application to the corresponding author, requests falling within the constraints of ethical approval granted for the project will be responded to within 14 days. DTC data can be made available under MTA to academic organisations subject to MRC institutional approval and compliance with all relevant data protection laws and requirements. Access is time-limited because we are required to delete all participant data when our work on the DTC cohort ends, which could be before the end of 2023. The database of questions corresponding to 5,857 ontology terms (Supplementary Data  3 ) is also available via the resources webpage (above) and may be a valuable resource for other studies. The similarity mapping, by information content, of all HPO terms that are close to the HPO terms used in clinical annotations by DDD (Supplementary Data  4 ) are also made available on the resources webpage. All data on the resources webpage are also available for download in JSON format. Access to 1KGP3, DDD and HipSci cohorts is only available via those projects directly. Specifically: access to 1KGP3 is made publicly available by the International Genome Sample Resource (IGSR), with data sets accessible from the data portal ( https://www.internationalgenome.org/data-portal ). The DDD cohort data is available from the European Genome-phenome Archive (EGA, https://ega-archive.org/ ), with the study ID EGAS00001000775. To access these data sets, please contact [email protected]. An overview of HipSci cell lines and assay data that are publicly available is available on the cell lines and data browser ( https://www.hipsci.org/data ). It also provides links to publicly available HipSci data in the EBI data archives. To access the managed-access genetic and genomic data in HipSci, please follow the steps stated in the data browser ( https://www.hipsci.org/data#overview ), and apply via the Wellcome Trust Sanger Institute’s Electronic Data Access Mechanism ( https://www.sanger.ac.uk/legal/DAA ).

Code availability

Code for the preparation, parsing and classification of DTC data files is available via the GenomePrep 38 resource already published as an ancillary output of this work that could be of value to others. The website running the pipeline providing questions to participants for the DTC study is available as a free service for up to 500 genotype files per submission (~24hr turnaround), and on request to run on larger datasets; we wish to collaborate with as many genetic cohorts as possible to provide predictions that can be made available within each of their data sharing portals to their registered users. The code used for the phenotype prediction to exemplify the Nomaly framework is not available for download. Individual applications for a license for non-commercial use are possible via the resources webpage ( https://supfam.org/nomaly ) to be considered on a case-by-case basis per project. Pipeline scripts behind the DTC cohort website are available via the resources website on request (not available for direct download due to web security risks). Scripts for conducting the permutation tests, generating the statistics for covariates and producing Fig.  5 are available for direct download via the resources website.

Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk of complex disease. Curr. Opin. Genet. Dev. 18 257–263 (2008).

Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature vol. 461 , 747–753 (2009).

Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101 , 5–22 (2017).

MacArthur, D. G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature 508 , 469–476 (2014).

Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet 46 , 944–950 (2014).

Article   CAS   PubMed   PubMed Central   Google Scholar  

The Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519 , 223–228 (2015).

Article   ADS   Google Scholar  

Altman, N. & Krzywinski, M. Testing for rare conditions. Nat. Methods 18 , 224–225 (2021).

Article   CAS   PubMed   Google Scholar  

Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20 , 467–484 (2019).

Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17 , 405–423 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Minikel, E. V. et al. Evaluating drug targets through human loss-of-function genetic variation. Nature 581 , 459–464 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Wang, Q., Dhindsa, R.S., Carss, K. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597 , 527–532 (2021).

Drouin, A. et al. Interpretable genotype-to-phenotype classifiers with performance guarantees. Sci. Rep. 9 , 1–13 (2019).

Davis, J. J. et al. Antimicrobial resistance prediction in PATRIC and RAST. Sci. Rep. 6 , 1–12 (2016).

Article   Google Scholar  

Yu, M. K. et al. Translation of genotype to phenotype by a hierarchy of cell subsystems. Cell Syst. 2 , 77–88 (2016).

Ma, J. et al. Using deep learning to model the hierarchical structure and function of a cell. Nat. Methods 15 , 290–298 (2018).

Grinberg, N. F., Orhobor, O. I. & King, R. D. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach. Learn. 2019 109:2 109 , 251–277 (2019).

MathSciNet   Google Scholar  

Costanzo, M. et al. The Genetic Landscape of a Cell. Science (1979) 327 , 425–431 (2010).

CAS   Google Scholar  

de Lima Morais, D. A. et al. SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res 39 , D427–D434 (2011).

Article   PubMed   Google Scholar  

Fang, H. & Gough, J. dcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more. Nucleic Acids Res 41 , D536–D544 (2013).

Fang, H. & Gough, J. A domain-centric solution to functional genomics via dcGO Predictor. BMC Bioinforma. 2013 14:3 14 , S9 (2013).

Google Scholar  

Shi, J. & Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 , 888–905 (2000).

Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: A scalable analysis of genome-wide research data. Lancet 385 , 1305–1314 (2015).

Wright, C. F. et al. Making new genetic diagnoses with old data: iterative reanalysis and reporting from genome-wide data in 1,133 families with developmental disorders. Genet. Med. 20 , 1216–1223 (2018).

Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546 , 370–375 (2017).

Köhler, S. et al. The human phenotype ontology in 2021. Nucleic Acids Res. 49 , D1207–D1217 (2021).

Ashburner, M., Ball, C., Blake, J. et al. Gene Ontology: tool for the unification of biology. Nat Genet 25 , 25–29 (2000).

Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42 , D980 (2014).

Amberger, J., Bocchini, C. A., Scott, A. F. & Hamosh, A. McKusick’s online mendelian inheritance in man (OMIM®). Nucleic Acids Res. 37 , D793 (2009).

World Health Organization‎. ICD-10: international statistical classification of diseases and related health problems: tenth revision, 2nd ed. (World Health Organization, 2004).

Auton, A. et al. A global reference for human genetic variation. Nature 526 , 68–74 (2015).

Article   ADS   PubMed   Google Scholar  

Fairley, S., Lowy-Gallego, E., Perry, E. & Flicek, P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 48 , D941–D947 (2020).

Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49 , D412–D419 (2021).

HA, S. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden markov models. Hum. Mutat. 34 , 57–65 (2013).

McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17 , 122 (2016).

Zaucha, J. et al. A proteome quality index. Environ. Microbiol. 17 , 4–9 (2015).

Blackford, L.S. et al. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software , 28 , 135–151 (2002).

Greshake, B., Bayer, P. E., Rausch, H. & Reda, J. openSNP–A Crowdsourced Web Resource for Personal Genomics. PLoS One 9 , e89204 (2014).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Lu, C., Greshake Tzovaras, B. & Gough, J. A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research. Comput Struct. Biotechnol. J. 19 , 3747–3754 (2021).

Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Publ. Group 44 , 955–959 (2012).

Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27 , 2156–2158 (2011).

Li, H. & Barrett, J. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27 , 2987–2993 (2011).

Schriml, L. M. et al. Disease ontology: a backbone for disease semantic integration. Nucleic Acids Res. 40 , D940–D946 (2012).

Lipscomb, C. E. Medical subject headings (MeSH). Bull. Med Libr Assoc. 88 , 265 (2000).

CAS   PubMed   PubMed Central   Google Scholar  

Smith, C. L. & Eppig, J. T. The mammalian phenotype ontology: Enabling robust annotation and comparative analysis. Wiley Interdiscip. Rev. Syst. Biol. Med 1 , 390–399 (2009).

Moritz, M., Braunfeld, M. B., Sedat, J. W., Alberts, B. & Agard, D. A. Microtubule nucleation by γ-tubulin-containing rings in the centrosome. Nat. 1995 378:6557 378 , 638–640 (1995).

Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet 81 , 559 (2007).

Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6 , 3021 (2021).

Download references


We extend our deepest gratitude to all of the DTC participants from numerous countries who made this work possible, generously donating their time and sharing their personal and genetic data for no reward other than the satisfaction of contributing to medical research for the global good. The DDD study presents independent research commissioned by the Health Innovation Challenge Fund [grant number HICF-1009-003], a parallel funding partnership between Wellcome and the Department of Health, and the Wellcome Sanger Institute [grant number WT098051]. The views expressed in this publication are those of the author(s) and not necessarily those of Wellcome or the Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South REC, and GEN/284/12 granted by the Republic of Ireland REC). The research team acknowledges the support of the National Institute for Health Research, through the Comprehensive Clinical Research Network. This study makes use of data generated by the HipSci Consortium, funded by the Wellcome Trust and the MRC. We thank Sano Genetics for helping to recruit 557 participants. We thank the LMB visual aids department for help with the figures. We thank Jake Grimmett and Toby Darling for their help with high-performance computing. This work was supported by the Medical Research Council, as part of United Kingdom Research and Innovation (also known as UK Research and Innovation) [MC_UP_1201/14] and a grant from the Biotechnology and Biological Sciences Research Council [BB/N019431/1]. Kings College investigators are supported by the Wellcome Trust and MRC through the Human Induced Pluripotent Stem Cell Initiative [WT098503]. This work was partly supported by the Center for Research and Interdisciplinarity (CRI) Research Fellowship to Bastian Greshake Tzovaras, who also thanks the Bettencourt Schueller Foundation long term partnership.

Author information

Authors and affiliations.

MRC Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Avenue, Cambridge, CB2 0QH, UK

Chang Lu, Rihab Gam, Arun Prasad Pandurangan, Himani Tandon, Raju Kalaivani, Minkyung Sung & Julian Gough

Department of Computer Science, University of Bristol, Bristol, BS8 1UB, UK

Jan Zaucha, Hai Fang,  Ben Smithers, Matt E. Oates, Natalie Zelenka, Hashem Shihab, Adam J. Sardar & Julian Gough

Shanghai Institute of Hematology, State Key Laboratory of Medical Genomics, National Research Centre for Translational Medicine at Shanghai, Ruijin Hospital affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China

Centre for Gene Therapy and Regenerative Medicine, King’s College London, Guy’s Hospital, Floor 28, Tower Wing, Great Maze Pond, London, SE1 9RT, UK

Miguel Bernabe-Rubio, James Williams & Davide Danovi

Université de Paris, INSERM U1284, Center for Research and Interdisciplinarity (CRI), Paris, France

Bastian Greshake Tzovoras

You can also search for this author in PubMed   Google Scholar


J.G. conceived the DTC study. C.L. and J.G. conceived the studies on DDD and HipSci cohorts, designed and carried out the DTC, DDD and HipSci studies. J.Z., H.F., N.Z. and J.G. developed and implemented the method, with additional contributions to development from C.L., B.S., M.O. and H.S., R.G., C.L., A.P.P., R.K., M.S., A.S, M.O. and J.G. designed the questionnaires. C.L., B.G.T. and J.G.contributed to participants recruitment and correspondence. C.L. and J.G. created and maintained the prediction pipeline, performed data analysis and developed the framework. M.B-R., J.W., and D.D. designed and performed the HipSci centriole experiments. A.P.P., and H.T. performed the structure-based mutational effect analysis. C.L. and J.G. wrote the manuscript. J.Z., H.F. and D.D. contributed significantly to the manuscript. J.G. invented the system, obtained funding, initiated and supervised the project.

Corresponding author

Correspondence to Julian Gough .

Competing interests

Statement: B.G.T. is a co-founder of the OpenSNP database (non-financial competing interest). Patent application (Jan 2016): application number WO2017125778A1; inventors J.G., J.Z. and N.T.; author filed; status is granted in Japan and under examination in other countries; relevant to the method of spectral clustering applied to phenotype prediction. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, description of additional supplementary files, supplementary data 1-5, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Lu, C., Zaucha, J., Gam, R. et al. Hypothesis-free phenotype prediction within a genetics-first framework. Nat Commun 14 , 919 (2023). https://doi.org/10.1038/s41467-023-36634-6

Download citation

Received : 16 February 2022

Accepted : 10 February 2023

Published : 17 February 2023

DOI : https://doi.org/10.1038/s41467-023-36634-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

definition of hypothesis in genetics

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Biology library

Course: biology library   >   unit 25.

  • Introduction to evolution and natural selection
  • Ape clarification
  • Natural selection and the owl butterfly
  • Darwin, evolution, & natural selection
  • Variation in a species
  • Natural selection and Darwin

Evidence for evolution

Key points:.

  • Anatomy. Species may share similar physical features because the feature was present in a common ancestor ( homologous structures ).
  • Molecular biology. DNA and the genetic code reflect the shared ancestry of life. DNA comparisons can show how related species are.
  • Biogeography. The global distribution of organisms and the unique features of island species reflect evolution and geological change.
  • Fossils. Fossils document the existence of now-extinct past species that are related to present-day species.
  • Direct observation. We can directly observe small-scale evolution in organisms with short lifecycles (e.g., pesticide-resistant insects).


Evolution happens on large and small scales.

  • Macroevolution , which refers to large-scale changes that occur over extended time periods, such as the formation of new species and groups.
  • Microevolution , which refers to small-scale changes that affect just one or a few genes and happen in populations over shorter timescales.

The evidence for evolution

Anatomy and embryology, homologous features, analogous features, determining relationships from similar features, molecular biology.

  • The same genetic material (DNA)
  • The same, or highly similar, genetic codes
  • The same basic process of gene expression (transcription and translation)
  • The same molecular building blocks, such as amino acids

Homologous genes

Biogeography, fossil record, direct observation of microevolution.

  • Before DDT was applied, a tiny fraction of mosquitos in the population would have had naturally occurring gene versions ( alleles ) that made them resistant to DDT. These versions would have appeared through random mutation , or changes in DNA sequence. Without DDT around, the resistant alleles would not have helped mosquitoes survive or reproduce (and might even have been harmful), so they would have remained rare.
  • When DDT spraying began, most of the mosquitos would have been killed by the pesticide. Which mosquitos would have survived? For the most part, only the rare individuals that happened to have DDT resistance alleles (and thus survived being sprayed with DDT). These surviving mosquitoes would have been able to reproduce and leave offspring.
  • Over generations, more and more DDT-resistant mosquitoes would have been born into the population. That's because resistant parents would have been consistently more likely to survive and reproduce than non-resistant parents, and would have passed their DDT resistance alleles (and thus, the capacity to survive DDT) on to their offspring. Eventually, the mosquito populations would have bounced back to high numbers, but would have been composed largely of DDT-resistant individuals.
  • Homologous structures provide evidence for common ancestry, while analogous structures show that similar selective pressures can produce similar adaptations (beneficial features).
  • Similarities and differences among biological molecules (e.g., in the DNA sequence of genes) can be used to determine species' relatedness.
  • Biogeographical patterns provide clues about how species are related to each other.
  • The fossil record, though incomplete, provides information about what species existed at particular times of Earth’s history.
  • Some populations, like those of microbes and some insects, evolve over relatively short time periods and can observed directly.


Works cited:.

  • Nothing in biology makes sense except in the light of evolution. (2016, April 6). Retrieved May 15, 2016 from Wikipedia: https://en.wikipedia.org/wiki/Nothing_in_Biology_Makes_Sense_Except_in_the_Light_of_Evolution .
  • Wilkin, D. and Akre, B. (2016, March 23). Comparative anatomy and embryology - Advanced. In CK-12 biology advanced concepts . Retrieved from http://www.ck12.org/book/CK-12-Biology-Advanced-Concepts/section/10.22/ .
  • Reece, J. B., Urry, L. A., Cain, M. L., Wasserman, S. A., Minorsky, P. V., and Jackson, R. B. (2011). Anatomical and molecular homologies. In Campbell biology (10th ed., p. 474). San Francisco, CA: Pearson.
  • Chapman, B. R. and Bolen, E. G. (2015). Convergent evolution [Glossary entry]. In Ecology of North America (2nd ed., p. 311). West Sussex, UK: John Wiley & Sons.
  • Insulin. (2014, June 6). In UCSD signaling gateway . Retrieved from http://www.signaling-gateway.org/molecule/query?afcsid=A004315&type=orthologs&adv=latest .
  • Wilkin, D. and Akre, B. (2016, March 23). Evolution and the fossil record - Advanced. In CK-12 biology advanced concepts . Retrieved from http://www.ck12.org/book/CK-12-Biology-Advanced-Concepts/section/10.21/ .
  • Reece, J. B., Taylor, M. R., Simon, E. J., and Dickey, J. L. (2011). Scientists can observe natural selection in action. In Campbell biology: Concepts & connections (7th ed., p. 259). Boston, MA: Benjamin Cummings.

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page
  • BiologyDiscussion.com
  • Follow Us On:
  • Google Plus
  • Publish Now

Biology Discussion

Wobble Hypothesis (With Diagram) | Genetics

definition of hypothesis in genetics


In this article we will discuss about the concept of wobble hypothesis.

Crick (1966) proposed the ‘wobble hypothesis’ to explain the degeneracy of the genetic code. Except for tryptophan and methionine, more than one codons direct the synthesis of one amino acid. There are 61 codons that synthesise amino acids, therefore, there must be 61 tRNAs each having different anticodons. But the total number of tRNAs is less than 61.

This may be explained that the anticodons of some tRNA read more than one codon. In addition, identity of the third codon seems to be unimportant. For example CGU, CGC, CGA and CGG all code for arginine. It appears that CG specifies arginine and the third letter is not important. Conventionally, the codons are written from 5′ end to 3′ end.

Therefore, the first and second bases specify amino acids in some cases. According to the Wobble hypothesis, only the first and second bases of the triple codon on 5′ → ‘3 mRNA pair with the bases of the anticodon of tRNA i.e A with U, or G with C.

The pairing of the third base varies according to the base at this position, for example G may pair with U. The conventional pairing (A = U, G = C) is known as Watson-Crick pairing (Fig. 7.1) and the second abnormal pairing is called wobble pairing.

This was observed from the discovery that the anticodon of yeast alanine-tRNA contains the nucleoside inosine (a deamination product of adenosine) in the first position (5′ → 3′) that paired with the third base of the codon (5′ → 3′). Inosine was also found at the first position in other tRNAs e.g. isoleucine and serine.

The purine, inosine, is a wobble nucleotide and is similar to guanine which normally pairs with A, U and C. For example a glycine-tRNA with anticodon 5′-ICC-3′ will pair with glycine codons GGU, GGC, GGA and GGG (Fig 7.2). Similarly, a seryl-tRNA with anticodon 5′-IGA-3′ pairs with serine codons UCC, UCU and UCA (5-3′). The U at the wobble position will be able to pair with an adenine or a guanine.

DNA Tripiet, mRNA Codons

According to Wobble hypothesis, allowed base pairings are given in Table 7.5:

Wobble Base Pairings

Due to the Wobble base pairing one tRNA becomes able to recognise more than one codons for an individual amino acid. By direct sequence of several tRNA molecules, the wobble hypothesis is confirmed which explains the pattern of redundancy in genetic code in some anticodons (e.g. the anticodons containing U, I and G in the first position in 5’→ 3′ direction)

Wobble Pairing of One Glycine tRNA

The seryl-tRNA anticodon (UCG) 5′-GCU-3′ base pairs with two serine codons, 5′-AGC-3′ and 5′-AGU-3′. Generally, Watson-Crick pairing occurs between AGC and GCU. However, in AGU and GCU pairing, hydrogen bonds are formed between G and U. Such abnormal pairing called ‘Wobble pairing’ is given in Table 7.5.

Three types of wobble pairings have been proposed:

(i) U in the wobble position of the tRNA anticodon pairs with A or G of codon,

(ii) G pairs with U or C, and

(iii) 1 pairs with A, U or C.

Related Articles:

  • Short Notes on Anticodons | Genetics
  • Genetic Code: Degeneracy and Universality | Protein

Microbiology , Genetics , Wobble Hypothesis , Concept of Wobble Hypothesis

  • Anybody can ask a question
  • Anybody can answer
  • The best answers are voted up and rise to the top

Forum Categories

  • Animal Kingdom
  • Biodiversity
  • Biological Classification
  • Biology An Introduction 11
  • Biology An Introduction
  • Biology in Human Welfare 175
  • Biomolecules
  • Biotechnology 43
  • Body Fluids and Circulation
  • Breathing and Exchange of Gases
  • Cell- Structure and Function
  • Chemical Coordination
  • Digestion and Absorption
  • Diversity in the Living World 125
  • Environmental Issues
  • Excretory System
  • Flowering Plants
  • Food Production
  • Genetics and Evolution 110
  • Human Health and Diseases
  • Human Physiology 242
  • Human Reproduction
  • Immune System
  • Living World
  • Locomotion and Movement
  • Microbes in Human Welfare
  • Mineral Nutrition
  • Molecualr Basis of Inheritance
  • Neural Coordination
  • Organisms and Population
  • Photosynthesis
  • Plant Growth and Development
  • Plant Kingdom
  • Plant Physiology 261
  • Principles and Processes
  • Principles of Inheritance and Variation
  • Reproduction 245
  • Reproduction in Animals
  • Reproduction in Flowering Plants
  • Reproduction in Organisms
  • Reproductive Health
  • Respiration
  • Structural Organisation in Animals
  • Transport in Plants
  • Trending 14

Privacy Overview

web counter

Microbiology Notes

Microbiology Notes

Genetic Code – Definition, Characteristics, Wobble Hypothesis

Table of Contents

What is a Genetic Code?

The genetic code is a set of rules that living cells use to decipher the information encoded in genetic material (DNA or mRNA sequences). The ribosomes are responsible for carrying out the translation process. Using tRNA (transfer RNA) molecules to carry amino acids and to read the mRNA three nucleotides at a time, they link the amino acids in an mRNA-specified (messenger RNA) order.

  • As DNA is a genetic substance, it transmits genetic information from one cell to the next and from one generation to the next.
  • At this point, it will be attempted to determine how genetic information is stored within the DNA molecule. On the DNA molecule, are they written in an articulated or encoded language? In the language of codes, what is the genetic code’s nature?
  • A DNA molecule contains three types of moieties: phosphoric acid, deoxyribose sugar, and nitrogen bases.
  • The genetic information may be encoded in any of the three DNA molecules. However, because the poly-sugarphosphate backbone is always the same, it is doubtful that these DNA molecules convey genetic information.
  • However, the nitrogen bases vary from one DNA segment to the next, therefore the information may depend on their sequences.
  • In fact, the sequences of nitrogen bases in a specific section of DNA are similar to the linear sequence of amino acids in a protein molecule.
  • An investigation of mutations of the head protein of bacteriophage T4 and the A protein of tryptophan synthetase from Escherichia coli provided the initial evidence for the colinearity between DNA nitrogen base sequence and amino acid sequence in protein molecules.
  • Colinearity between protein molecules and DNA polynucleotides provides evidence that the arrangement of four nitrogen bases (e.g., A, T, C, and G) in DNA polynucleotide chains dictates the sequence of amino acids in protein molecules.
  • These four DNA bases can therefore be viewed as the four alphabets of the DNA molecule. Therefore, all genetic information should be encoded using these four DNA alphabets.
  • The question that now emerges is whether genetic information is written in articulated or coded language. If genetic information could have been communicated in an articulated language, the DNA molecule would have required multiple alphabets, a complicated grammar system, and adequate space.
  • All of these could be practically difficult and also problematic for the DNA. Therefore, it was reasonable for molecular biologists to assume that genetic information resided in the DNA molecule as a specific language of code words that utilised the four nitrogen bases of DNA as their symbols. Any encoded message is referred to as a cryptogram.

Characteristic of Genetic Code

Basis of Cryptoanalysis 

  • How information written in a four-letter language (four nucleotides or nitrogen bases of DNA) may be transformed into a twenty-letter language is the fundamental challenge of such a genetic code (twenty amino acids of proteins).
  • A code word or codon is the set of nucleotides that specifies one amino acid. By genetic code, one refers to the collection of sequences of bases (codons) that correspond to each amino acid and translation signals.
  • Regarding the possible size of a codon, we can consider George Gamov’s (1954) traditional yet rational explanation.
  • The simplest conceivable code is a singlet code (a code of a single letter) that specifies a single nucleotide amino acid.
  • A doublet code (consisting of two letters) is similarly insufficient, as it can only define sixteen (4×4) amino acids, but a triplet code (consisting of three letters) can specify sixty-four (4x4x4) amino acids.
  • Therefore, it is probable that 64 triplet codes exist for 20 amino acids. The conceivable singlet, doublet, and triplet codes, which are conventionally described in terms of “mRNA language” [mRNA is a complementary molecule that copies the genetic information (cryptogram of DNA) during its transcription] are depicted in Table.
  • In 1961, Crick and his colleagues present the first experimental evidence supporting the hypothesis of triplet coding.
  • During their experiment, when they inserted or deleted single or double base pairs in a specific region of the DNA of E.coli T4 bacteriophages, they discovered that these bacteriophages ceased to execute their regular tasks.
  • Nevertheless, bacteriophages with the addition or deletion of three base pairs in the DNA molecule had normal functionality.
  • In this experiment, the addition of one or two nucleotides caused the message to be read incorrectly, however the addition of a third nucleotide resulted in the message being read correctly again.

 Possible singlet, doublet and triplet codes of mRNA

Codon Assignment (Cracking the Code or Deciphering the Code)

The genetic code has been broken or deciphered using the following methods:

A. Theoretical Approach

  • George Gamow, a physicist, proposed the diamond code (1954) and the triangle code (1955), as well as a comprehensive theoretical framework for the various aspects of the genetic code.
  • A triplet codon that corresponds to a single polypeptide chain amino acid.
  • Direct template translation by linking codons with amino acids.
  • The code is translated in an overlapping fashion.
  • Degeneration of the code, or the coding of an amino acid by more than one codon.
  • The colinearity of nucleic acid and the produced main protein.
  • Universality of the code, i.e., the code being fundamentally identical throughout organisms.
  • Molecular biologists have refuted a number of these statements by Gamow. Brenner (1957) demonstrated that the overlapping triplet code is impossible, and further research has demonstrated that the code is non-overlapping.
  • Crick’s adopter hypothesis similarly contested Gamow’s assumption of a direct template relationship between nucleic acid and polypeptide chain.
  • Adaptor molecules, according to this concept, intervene between nucleic acid and amino acids during translation.
  • In actuality, it is now understood that tRNA molecules serve as adaptors between the codons of mRNA and the amino acids of the resultant polypeptide chain.

B. The in vitro codon Assignment 

1. discovery and use of polynucleotide phosphorylase enzyme.

Marianne Grunberg Manago and Severo Ochoa identified an enzyme from bacteria (e.g., Azobacter vinelandii or Micrococcus lysodeikticus) that catalyses RNA degradation in bacterial cells. The name of this enzyme is polynucleotide phosphorylase. Outside of the cell (in vitro), with high amounts of ribonucleotides, Manago and Ochoa discovered that the reaction could be driven in reverse and an RNA molecule could be produced (see Burns and Bottino, 1989). The random incorporation of nucleotides into the molecule is independent of a DNA template. Thus, in 1955, Manago and Ochoa made possible the artificial synthesis of polynucleotides (=mRNA) comprising only a single type of nucleotides (U, A, C, or G, respectively, repeated several times).

 Possible singlet, doublet and triplet codes of mRNA

Consequently, the action of polynucleotide phosphorylase can be depicted as follows:

 Possible singlet, doublet and triplet codes of mRNA

The polynucleotide phosphorylase enzyme differs from RNA polymerase used to transcribe mRNA and DNA polymerase used to transcribe mRNA from DNA in the following ways: I it does not require a template or primer; (ii) the activated substrates are ribonucleoside diphosphates (e.g., UDP, ADP, CDP, and GDP) and not triphosphates; and (iii (PPi). The introduction of synthetic (or artificial) polynucleotides and trinucleotides made the deciphering of the genetic code possible.

Use of polymers containing a single type of nucleotide (called homopolymers), mixed polymers (copolymers) containing multiple types of nucleotides (heteropolymers) in random or defined sequences, and trinucleotides (or “minimessengers”) in ribosome-binding or filter-binding are among the various techniques employed.

2. Codon assignment with unknown sequence

(i) codon assignment by homopolymer..

  • Marshall Nirenberg and Heinrich Matthaei (1961) supplied the first indication to codon assignment when they utilised an in vitro technique for the creation of a polypeptide utilising an artificially produced mRNA molecule containing only one type of nucleotide (i.e., homopolymer).
  • Before doing the actual tests, they evaluated the capacity of a cell-free protein synthesis system to integrate radioactive amino acids into newly produced proteins.
  • Their E.coli cell-free extracts comprised ribosomes, tRNAs, aminoacyl-tRNA synthetase enzymes, DNA, and messenger RNA.
  • This extract’s DNA was eliminated by the deoxyribonuclease enzyme, so destroying the template for the synthesis of new mRNA.
  • When twenty amino acids together with ATP, GTP, K+, and MG2+ were introduced to this mixture, they were integrated into proteins.
  • As long as mRNA was present in the cell-free suspension, incorporation persisted. It also continued in the presence of synthetic polynucleotides (mRNAs) that might be synthesised using the polynucleotide phosphorylase enzyme.
  • Nirenberg and Matthaei made the first successful application of this approach when they created a chain of uracil molecules (poly U) as their synthetic mRNA (homopolymer).
  • A message consisting of a single base could not contain ambiguity, hence Poly (U) looked to be the best option. It binds well to ribosomes and, as it turned out, the resultant protein was insoluble and simple to isolate.
  • When poly (U) was supplied as the message to the cell-free system containing all the amino acids, polyphenylalanine was picked solely from the mixture for incorporation into the polypeptide.
  • This amino acid was phenylalanine, hence it was deduced that a sequence of UUU encoded for phenylalanine. Other homogeneous nucleotide chains (Poly A, Poly C, and Poly G) were inert for incorporation of phenylalanine. The phenlalanine mRNA code was consequently determined to be UUU.
  • AAA is derived to be the equivalent DNA code word for phenylalanine. Thus, UUU was the first code word to be decrypted. In the laboratories of Nirenberg and Ochoa, this finding was developed.
  • Using synthetic poly (A) and poly (C) chains, the experiment was repeated, yielding polylysine and polyproline, respectively.
  • Thus, AAA was determined to be the code for lysine and CCC was determined to be the code for proline. A poly (G) message was discovered to be nonfunctional in vitro due to its secondary structure, which prevented it from attaching to ribosomes. Thus, three of the sixty-four codons were simply explained.

(ii) Codon assignment by heteropolymers (Copolymers with random sequences)

  • Using synthetic messenger RNAs containing two different types of nucleotides, the genetic code was elucidated further.
  • This approach was utilised in the laboratories of Ochoa and Nirenberg to deduce the codon composition for the 20 amino acids.
  • The bases in the synthetic messengers were chosen at random (called random copolymers). In a random copolymer composed of U and A nucleotides, for instance, eight triplets are feasible, including UUU, UUA, UAA, UAU, AAA, AAU, AUU, and AUA.
  • Theoretically, these eight codons may code for eight amino acids. However, actual experiments produced only six amino acids: phenylalanine, leucine, tyrosine, lysine, asparagine, and isoleucine.
  • It was feasible to derive the composition of the code for different amino acids by altering the relative proportions of U and A in the random copolymer and determining the fraction of the different amino acids in the proteins generated.

3. Assignment of codons with known sequences. 

  • I The application of trinucleotides or minimessengers in filter binding (Ribosome-binding technique). Nirenberg and Leder’s (1964) ribosome binding technique takes use of the observation that aminoacyl-tRNA molecules attach selectively to the ribosomemRNA complex.
  • The connection of a trinucleotide or minimessenger with the ribosome is necessary for aminoacyltRNA binding to occur.
  • When a mixture of such small mRNA molecules-ribosomes and amino acid-tRNA complexes is incubated for a brief period and then filtered over a nitrocellulose membrane, the mRNA-ribosome-tRNA-amino acid complex is kept and the remainder of the mixture is discarded.
  • Using a series of 20 different amino acid mixtures, each containing one radioactive amino acid, it is possible to determine the amino acid corresponding to each triplet by analysing the radioactivity absorbed by the membrane; for instance, the triplet GCC and GUU retain only alanyl-tRNA and valyl-tRNA, respectively.
  • In this manner, all 64 potential triplets have been synthesised and evaluated. 45 of them have produced conclusive results. Later on, with the use of lengthier synthetic messages, 61 of the 64 potential codons have been deciphered.

 The genetic dictionary. The trinucleotide codons are written in the 5'→3' direction.

C. The in vivo Codon Assignment 

  • Despite the fact that cell-free protein synthesis systems have played a significant role in the decipherment of the genetic code, they cannot tell us whether the deciphered genetic code is likewise utilised in the living systems of all organisms.
  • Different molecular biologists use three techniques to determine if the same code is used in vivo: (a) amino acid replacement studies (e.g., tryptophan synthetase synthesis in E.coli and haemoglobin synthesis in man), (b) frameshift mutations (e.g., Terzaghi et al. 1966, on lysozyme enzyme of T4 bacteriophages), and (c) comparison of a DNA (e.g., comparison of amino acid sequence of the R17 bacteriophage coat protein with the nucleotide sequence of the R17 mRNA in the region of the molecule that dictates coat-protein synthesis by S. Cory et al., 1970).
  • Thus, the previously mentioned in vitro and in vivo experiments allowed for the formulation of a code table for twenty amino acids.

Characteristics of Genetic Code 

The genetic code has the following general properties : 

1. The code is a triplet codon 

  • The nucleotides of messenger RNA (mRNA) are organised as a linear sequence of codons, with each codon consisting of three consecutive nitrogenous bases, i.e., the code is a triplet codon.
  • Two types of point mutations, frameshift mutations and base substitution, provide support for the concept of triplet codon.

(i) Frameshift mutations

  • Evidently, the genetic communication, once launched at a particular place, is decoded into a series of three-letter phrases within a specific time frame.
  • As soon as one or more bases are removed or added, the structure would be disrupted. When such frameshift mutations were intercrossed, they produced wild-type normal genes in certain combinations.
  • It was determined that one was a deletion and the other was an insertion, so that the disordered frame order caused by the mutation will be corrected by the other.

(ii) Base substitution

  • If, at a specific location in an mRNA molecule, one base pair is replaced by another without deletion or insertion, the meaning of a codon containing the altered base will be altered.
  • As a result, another amino acid will be inserted in place of a particular amino acid at a particular location in a polypeptide.
  • Due to a substitution mutation in the gene for the tryptophan synthetase enzyme in E. coli, the glycine-coding GGA codon becomes the arginine-coding AGA.
  • A missense codon is a codon that has been altered to specify a different amino acid. The discovery that a fragment of mRNA comprising 90 nucleotides corresponded to a polypeptide chain having 30 amino acids of a developing haemoglobin molecule provided more direct proof for the existence of a triplet code.
  • Similarly, 1200 nucleotides of the “satellite” tobacco necrosis virus direct the creation of 372 amino acid-containing coat protein molecules.

2. The code is non-overlapping

  • In the translation of mRNA molecules, codons are “read” sequentially and do not overlap.
  • Therefore, a non-overlapping coding indicates that a nucleotide in an mRNA is not utilised for multiple codons.
  • In practise, however, six bases code for no more than two amino acids. In the event of an overlapping code, for instance, a single change (of replacement type) in the base sequence will result in several amino acid substitutions in the associated protein.
  • In insulin, tryptophan synthetase, TMV coat protein, alkaline phosphatase, haemoglobin, etc., a single base substitution leads in a single amino acid change. Since 1956, a large number of examples have accumulated in which a single base substitution results in a single amino acid change.
  • Recently, however, it has been demonstrated that overlapping genes and codons are possible in bacteriophage  φ × 174.

3. The code is commaless

  • The genetic code is punctuation-free, thus no codons are reserved for punctuation.
  • It means that when one amino acid is coded, the next three characters will automatically code the second amino acid and no letters will be wasted as punctuation marks.

4. The code is non-ambiguous

  • A codon always codes for the same amino acid when it is non-ambiguous.
  • In the situation of ambiguous code, the same codon may have many meanings; in other words, the same codon may code for two or more amino acids. As a general rule, a single codon should never code for two distinct amino acids.
  • There are, however, documented exceptions to this rule: the codons AUG and GUG may both code for methionine as beginning or starting codons, despite the fact that GUG is intended for valine. Similarly, the GGA codon represents the amino acids glycine and glutamic acid.

5. The code has polarity

The direction in which the code is always read is 5’→3′. Thus, the codon possesses a polarity. Clearly, if the code is read in opposing directions, it would specify two distinct proteins, as the codon’s base sequence would be reversed:

The code has polarity

6. The code is degenerate

  • Multiple codons might define the same amino acid; this phenomenon is known as degeneracy of the code. Except for tryptophan and methionine, which each contain a single codon, the remaining 18 amino acids have several codons.
  • Consequently, each of the nine amino acids phenylalanine, tyrosine, histidine, glutamine, asparagine, lysine, aspartic acid, glutamic acid, and cysteine has two codons. Isoleucine consists of three codons.
  • Each of the five amino acids valine, proline, threonine, alanine, and glycine has four codons. Each of the three amino acids leucine, arginine, and serine has six codons.
  • There are essentially two types of code degeneration: partial and total. Partial degeneracy occurs when the first two nucleotides of degenerate codons are identical, but the third (3′ base) nucleotide differs, e.g., CUU and CUC code for leucine.
  • Complete degeneracy happens when any of the four bases can code for the same amino acid in the third position (e.g., UCU,UCC, UCA and UCG code for serine).
  • Degeneration of genetic coding has several biological benefits. It enables, for instance, bacteria with vastly different DNA base compositions to specify virtually the same complement of enzymes and other proteins.
  • Degeneration also provides a technique for decreasing the lethality of mutations.

7. Some codes act as start codons

  • In the majority of organisms, the AUG codon is the start or initiation codon, meaning that the polypeptide chain begins with methionine (eukaryotes) or N-formylmethionine (prokaryotes) (prokaryotes).
  • Methionyl or N-formylmethionyl-tRNA binds particularly to the start site of mRNA with an AUG initiation codon.
  • In rare instances, GUG functions as an initiating codon, such as in bacterial protein production. GUG normally codes for valine; however, when the regular AUG codon is deleted, only GUG is used as an initiation codon.

8. Some codes act as stop codons

  • Triple codons UAG, UAA, and UGA are the stop or termination codons for the chain. They do not code for any of the amino acids.
  • These codons are not read by any tRNA molecules (via their anticodons), but are read by some specialised proteins, called release factors (e.g., RF-1, RF-2, RF-3 in prokaryotes and RF in eukaryotes) (e.g., RF-1, RF-2, RF-3 in prokaryotes and RF in eukaryotes).
  • These codons are also called nonsense codons, since they do not designate any amino acid. The UAG was the first termination codon to be found by Sidney Brenner (1965). (1965).
  • It was named amber in honour of a doctoral student named Bernstein (= the German term for ‘amber,’ and amber signifies brownish yellow) who helped identify a class of mutations.
  • Apparently, the other two termination codons were also named after colours, such as ochre for UAA and opal or umber for UGA, in order to maintain consistency. (ochre indicates pale yellow or golden red, opal means milky white, and umber signifies brown)
  • The presence of multiple stop codons may be a precautionary mechanism in case the first stop codon fails to work.

9. The code is universal

  • The same genetic code is valid for all creatures, from bacteria to humans. Marshall, Caskey, and Nirenberg (1967) showed the universality of the code by showing that E.coli (bacterial), Xenopus laevis (amphibian), and guinea pig (mammal) amino acyl-tRNA utilise nearly the same code.
  • Nirenberg has also suggested that the genetic code may have originated with the first bacteria three billion years ago, and that it has altered very little over the history of living species.
  • Recently, inconsistencies between the universal genetic code and the mitochondrial genetic code have been revealed.

definition of hypothesis in genetics

Codon and Anticodon

  • The codon words of DNA are complementary to the mRNA code words (i.e., DNA codes run in the 3’→5′ direction whereas mRNA code words run in the 5’→3′ direction), as are the three bases composing the anticodon of tRNA (i.e., anticodon bases run in the 3’→5′ direction).
  • Three bases of the anticodon pair with the mRNA on the ribosomes during the alignment of amino acids during protein synthesis (i.e., the translation of mRNA into proteins in the N2→COOH direction).
  • For instance, one of the two mRNA and DNA code words for the amino acid phenylalanine is UUC, while the equivalent anticodon of tRNA is CAA.
  • This suggests that the pairing of codons and anticodons is antiparallel. C pairs with G and U pairs with A in this instance.

Wobble Hypothesis

  • Crick (1966) presented the wobble hypothesis to explain the potential origin of codon degeneracy (wobble means to sway or move unsteadily).
  • Given that there are 61 codons that specify amino acids, the cell must possess 61 tRNA molecules, each with a unique anticodon.
  • The actual number of tRNA molecule types discovered is far fewer than 61. This suggests that tRNA anticodons read many codons on mRNA.
  • For instance, yeast tRNAala with anticodon bases 5′ IGC 3′ (where I stands for inosine, a derivative of adenine or A) may bind to three codons in mRNA, including 5′ GCU 3′, 5’GCC3′, and 5′ GCA3′.
  • Inosine is usually found as the 5′ base of the anticodon; when pairing with the base of the codons, it wobbles and can pair with U, C, or A of three different codons.
  • Therefore, according to Crick’s wobble hypothesis, the base at the 5′ end of the anticodon is not as spatially constrained as the other two bases, allowing it to establish hydrogen bonds with any of the bases positioned at the 3′ end of a codon.

definition of hypothesis in genetics

Leave a Comment Cancel reply

Save my name, email, and website in this browser for the next time I comment.

Overlay Image

Adblocker detected! Please consider reading this notice.

We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

We need money to operate the site, and almost all of it comes from our online advertising.

Please add Microbiologynote.com to your ad blocking whitelist or disable your adblocking software.


  1. 13 Different Types of Hypothesis (2024)

    definition of hypothesis in genetics

  2. Hypothesis

    definition of hypothesis in genetics

  3. What is an Hypothesis

    definition of hypothesis in genetics

  4. Research Hypothesis: Definition, Types, Examples and Quick Tips

    definition of hypothesis in genetics

  5. What is a Hypothesis

    definition of hypothesis in genetics

  6. What is a Hypothesis?

    definition of hypothesis in genetics


  1. Concept of Hypothesis

  2. What Is A Hypothesis?

  3. What does hypothesis mean?

  4. Hypothesis

  5. Hypothesis Development

  6. the definition of a hypothesis. the definition of luck. Look it up


  1. Hypothesis

    A hypothesis is a supposition or tentative explanation for (a group of) phenomena, (a set of) facts, or a scientific inquiry that may be tested, verified or answered by further investigation or methodological experiment. It is like a scientific guess. It's an idea or prediction that scientists make before they do experiments.

  2. The Evolving Definition of the Term "Gene"

    Abstract. This paper presents a history of the changing meanings of the term "gene," over more than a century, and a discussion of why this word, so crucial to genetics, needs redefinition today. In this account, the first two phases of 20th century genetics are designated the "classical" and the "neoclassical" periods, and the ...

  3. Introduction to heredity review (article)

    Gregor Mendel's principles of heredity, observed through patterns of inheritance in pea plants, form the basis of modern genetics. Mendel proposed that traits were specified by "heritable elements" called genes. Genes come in different versions, or alleles, with dominant alleles being expressed over recessive alleles.

  4. Genetics and Statistical Analysis

    The key is statistical examination, which allows you to determine whether your data are consistent with your hypothesis. For instance, when performing a genetic cross, the chi-square test allows ...

  5. 1.13: Introduction to Mendelian Genetics

    Introduction. In plant and animal genetics research, the decisions a scientist will make are based on a high level of confidence in the predictable inheritance of the genes that control the trait being studied. This confidence comes from a past discovery by a biologist named Gregor Mendel, who explained the inheritance of trait variation using ...

  6. GENETICS 101

    GENETICS 101. Almost every human trait and disease has a genetic component, whether inherited or influenced by behavioral factors such as exercise. Genetic components can also modify the body's response to environmental factors such as toxins. Understanding the underlying concepts of human genetics and the role of genes, behavior, and the ...

  7. Chapter 1. The Science of Genetics: An Overview

    Human Genetics is the study of inheritance and variation in human beings. Medical Genetics is the study of medical aspects of human genetics. The principles of inheritance in humans do not fundamentally differ from that in other living organisms. An understanding of human genetic inheritance is a vital asset in the diagnosis, prediction, and ...

  8. Scientific hypothesis

    The Royal Society - On the scope of scientific hypotheses (Apr. 24, 2024) scientific hypothesis, an idea that proposes a tentative explanation about a phenomenon or a narrow set of phenomena observed in the natural world. The two primary features of a scientific hypothesis are falsifiability and testability, which are reflected in an "If ...

  9. Genetics

    Genetics forms one of the central pillars of biology and overlaps with many other areas, such as agriculture, medicine, and biotechnology. Since the dawn of civilization, humankind has recognized the influence of heredity and applied its principles to the improvement of cultivated crops and domestic animals. A Babylonian tablet more than 6,000 ...

  10. The chromosomal basis of inheritance (article)

    Boveri and Sutton's chromosome theory of inheritance states that genes are found at specific locations on chromosomes, and that the behavior of chromosomes during meiosis can explain Mendel's laws of inheritance. Thomas Hunt Morgan, who studied fruit flies, provided the first strong confirmation of the chromosome theory.

  11. Genetics

    Genetics Definition. Genetics is the study of genes and inheritance in living organisms. This branch of science has a fascinating history, stretching from the 19 th century when scientists began to study how organisms inherited traits from their parents, to the present day when we can read the "source code" of living things letter-by-letter. ...

  12. The Gene Balance Hypothesis: From Classical Genetics to Modern Genomics

    The concept of genetic balance traces back to the early days of genetics. Additions or subtractions of single chromosomes to the karyotype (aneuploidy) produced greater impacts on the phenotype than whole-genome changes (ploidy). Studies on changes in gene expression in aneuploid and ploidy series revealed a parallel relationship leading to the ...

  13. Probabilities in genetics (article)

    It reflects the number of times an event is expected to occur relative to the number of times it could possibly occur. For instance, if you had a pea plant heterozygous for a seed shape gene ( Rr) and let it self-fertilize, you could use the rules of probability and your knowledge of genetics to predict that 1. ‍.

  14. Genetics

    Genetics is the study of genes, genetic variation, and heredity in organisms. It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar working in the 19th century in Brno, was the first to study genetics scientifically.Mendel studied "trait inheritance", patterns in the way traits are handed down from parents to ...

  15. Hypothesis-free phenotype prediction within a genetics-first ...

    Hypothesis-free phenotype prediction with this genetics-first approach could be applied in principle to other ab initio models, but we chose to deploy a model based on protein domains, which are ...

  16. (PDF) Hypothesis Generation in Biology

    An essential step in the cyclical scientific method (1) is the formulation of hypotheses and predictions. Frequently, hypotheses are incorrectly defined as educated guesses or if/ then statements ...

  17. Particulate inheritance

    Ronald Fisher. Particulate inheritance is a pattern of inheritance discovered by Mendelian genetics theorists, such as William Bateson, Ronald Fisher or Gregor Mendel himself, showing that phenotypic traits can be passed from generation to generation through "discrete particles" known as genes, which can keep their ability to be expressed while ...

  18. Good genes hypothesis

    good genes hypothesis, in biology, an explanation which suggests that the traits females choose when selecting a mate are honest indicators of the male's ability to pass on genes that will increase the survival or reproductive success of her offspring. Although no completely unambiguous examples are known, evidence supporting the good genes hypothesis is accumulating, primarily through the ...

  19. Evidence for evolution (article)

    The evidence for evolution. In this article, we'll examine the evidence for evolution on both macro and micro scales. First, we'll look at several types of evidence (including physical and molecular features, geographical information, and fossils) that provide evidence for, and can allow us to reconstruct, macroevolutionary events.

  20. Population genetics: past, present, and future

    Still, natural selection was the mainstream hypothesis with the idea that advantageous variations in populations are the driving forces for evolution, and deleterious variations are removed in a rapid manner. At the time, population genetics usually considered two alleles at each gene locus based on the assumption of genes being base pairs.

  21. Multiple Factor Hypothesis (With Example)

    ADVERTISEMENTS: This is the essence of the multiple factor hypothesis. As quantitative inheritance it is controlled by many genes. Therefore, it is also known as polygenic inheritance. A few common examples of polygenic inheritance are described as below: Seed colour in Wheat:

  22. Wobble Hypothesis (With Diagram)

    In this article we will discuss about the concept of wobble hypothesis. Crick (1966) proposed the 'wobble hypothesis' to explain the degeneracy of the genetic code. Except for tryptophan and methionine, more than one codons direct the synthesis of one amino acid. There are 61 codons that synthesise amino acids, therefore, there must be 61 ...

  23. Genetic Code

    Characteristics of Genetic Code. The genetic code has the following general properties : 1. The code is a triplet codon. The nucleotides of messenger RNA (mRNA) are organised as a linear sequence of codons, with each codon consisting of three consecutive nitrogenous bases, i.e., the code is a triplet codon.