Menzerath’s law (also known as Menzerath-Altmann law) in linguistics describes the relationship between the size of a linguistic construct (e.g. a word) and its constituents (e.g. syllables), which are expected follow the relationship:
\[ y = a \cdot x^b \cdot e^{-cx} \] where a
, b
and c
are parameters of the law, x
is the size of the construct measured in units of its constituents and y
is the average size of constituents measured in units of a subconsistuent. Menzerath’s law can apply to complex systems with multiple levels and has been found in other systems outside linguistics.
Menzerath’s law has been found in a number of different biological sequences and organizational levels including:
To facilitate the exploration of Menzerath’s law we can use the menzerath R package. Lets consider the case of the transcript-exon level as an example. We use biomaRt package to give us access to the the biological data. The package is part of Bioconductor and provides an interface to BioMart databases such as Ensemble, where we can download structural information for a number of genomes.
First we have to load the packages we are going to use (If any of the packages is missing check the Install Missing Libraries section):
library("tidyverse")
library("menzerath")
library("cowplot")
library("biomaRt")
Once everything is in place we can start by connecting to ensemble, a genome database project of the European Bioinformatics Institute, and list the databases (here are the first 20):
ensembl <- useMart("ensembl")
datasets <- listDatasets(ensembl)
kbl(head(datasets,20))
dataset | description | version |
---|---|---|
abrachyrhynchus_gene_ensembl | Pink-footed goose genes (ASM259213v1) | ASM259213v1 |
acalliptera_gene_ensembl | Eastern happy genes (fAstCal1.2) | fAstCal1.2 |
acarolinensis_gene_ensembl | Anole lizard genes (AnoCar2.0) | AnoCar2.0 |
acchrysaetos_gene_ensembl | Golden eagle genes (bAquChr1.2) | bAquChr1.2 |
acitrinellus_gene_ensembl | Midas cichlid genes (Midas_v5) | Midas_v5 |
amelanoleuca_gene_ensembl | Giant panda genes (ASM200744v2) | ASM200744v2 |
amexicanus_gene_ensembl | Mexican tetra genes (Astyanax_mexicanus-2.0) | Astyanax_mexicanus-2.0 |
anancymaae_gene_ensembl | Ma’s night monkey genes (Anan_2.0) | Anan_2.0 |
aocellaris_gene_ensembl | Clown anemonefish genes (AmpOce1.0) | AmpOce1.0 |
apercula_gene_ensembl | Orange clownfish genes (Nemo_v1) | Nemo_v1 |
aplatyrhynchos_gene_ensembl | Mallard genes (ASM874695v1) | ASM874695v1 |
apolyacanthus_gene_ensembl | Spiny chromis genes (ASM210954v1) | ASM210954v1 |
applatyrhynchos_gene_ensembl | Duck genes (CAU_duck1.0) | CAU_duck1.0 |
atestudineus_gene_ensembl | Climbing perch genes (fAnaTes1.2) | fAnaTes1.2 |
bbbison_gene_ensembl | American bison genes (Bison_UMD1.0) | Bison_UMD1.0 |
bgrunniens_gene_ensembl | Domestic yak genes (LU_Bosgru_v3.0) | LU_Bosgru_v3.0 |
bihybrid_gene_ensembl | Hybrid - Bos Indicus genes (UOA_Brahman_1) | UOA_Brahman_1 |
bmusculus_gene_ensembl | Blue whale genes (mBalMus1.v2) | mBalMus1.v2 |
bmutus_gene_ensembl | Wild yak genes (BosGru_v2.0) | BosGru_v2.0 |
bsplendens_gene_ensembl | Siamese fighting fish genes (fBetSpl5.2) | fBetSpl5.2 |
Lots of interesting species are among the first top 20, difficult to choose which one to focus on. Cows have interesting things to say so lets pick the database of the cow’s genome (btaurus_gene_ensembl
) for the following example.
A transcript is the transcribed part of a gene. By transcribed we mean that the genomic region will be used to produce a functional biomolecule (RNA which can then be used to direct the construction of a protein). A transcript can be composed by one or more exons.
The information we are interested in are transcripts and exons. Transcripts are uniquely identified by the attribute ensembl_transcript_id
and exons by the attribute ensembl_exon_id
. We also need the size of exons (in base pairs) and this we can calculate from the difference in the genomic coordinates of the start and end of each exon stored in the attributes exon_chrom_start
and exon_chrom_end
respectively. Finally we can reduce the noise in the data we download by excluding non-constitutive exons. To keep the amount of data manageable we also filter by a single chromosome:
ensembl <- useMart("ensembl", dataset = "btaurus_gene_ensembl")
cow_exons <- getBM(filters = 'chromosome_name',
values = 13,
attributes=c("ensembl_transcript_id",
"ensembl_exon_id",
"exon_chrom_start",
"exon_chrom_end",
"is_constitutive"), mart=ensembl)
There are 14323 in our table, quite enough to get a statistical signal, with the drawback that we focus on a single chromosome which might have something particular.
To make sure that we did correctly downloaded the information we need we could have added a few more attributes and perform some descriptive statistics or data cleaning but for the purpose of this examples we can skip such steps.
Having downloaded the necessary information we can now process them to create a menzerath object to store the information:
processed_exons <-(cow_exons %>%
filter(is_constitutive==1) %>%
mutate(exon_size = exon_chrom_end-exon_chrom_start) %>%
group_by(ensembl_transcript_id) %>%
summarize(transcript_size = n(), average_exon_size=mean(exon_size)))
cow_menzerath <- menzerath(processed_exons, x = "transcript_size", y = "average_exon_size")
To get a glimpse of the information in our object we can print it:
print(cow_menzerath)
#> Observations: 1481
#> y: 297.857142857143,447.8,261.666666666667,158.416666666667,115.380952380952,144.241379310345...
#> x: 7,5,12,12,21,29...
and plot the data:
plot(cow_menzerath, fit = TRUE)
By default for the fit we use the Menzerath-Altman formula (for alternatives see section Alternative models of the vignette Introduction to package menzerath):
\[ y = a \cdot x^b \cdot e^{-cx} \]
We could explore a bit more in depth the data by trying a different formula for fitting:
alt_fit <- fit(cow_menzerath, method="Milicka_2")
get_parameters(alt_fit)
#> $a
#> [1] 271.7336
#>
#> $b
#> [1] 0.1831749
Interpreting the meaning of the parameters in most cases is not straightforwards even in cases where the system is well known. To get a better understanding of Menzerath’s law in the particular system one could also compare across systems. To make the process less repetitive we can define a helper function ensemble2menzerath
to download the necessary information (see Useful functions for the full code).
As the following code is too heavy and time consuming as it requests and processes a lot of information, it has been pre-run and the resulting plot is directly inserted below:
yms <- ensemble2menzerath()
hms <- ensemble2menzerath("hsapiens_gene_ensembl")
mms <- ensemble2menzerath("mmusculus_gene_ensembl")
cms <- ensemble2menzerath("btaurus_gene_ensembl")
yeast_plot <- plot(yms, fit = TRUE)
human_plot <- plot(hms, fit = TRUE)
mouse_plot <- plot(mms, fit = TRUE)
cow_plot <- plot(cms, fit = TRUE)
plot_grid(yeast_plot, human_plot, mouse_plot, cow_plot, ncol = 2)
species | scientific name | database name | number of observations |
---|---|---|---|
yeast | S. cerevisiae | scerevisiae_gene_ensembl |
7,127 |
human | H. sapiens | hsapiens_gene_ensembl |
52,351 |
mouse | M. musculus | mmusculus_gene_ensembl |
43,864 |
cow | B. taurus | btaurus_gene_ensembl |
42,022 |
If tidyverse or cowplot is missing they can be installed with:
If menzerath is missing:
install.packages("devtools")
devtools::install_github("sellisd/menzerath", build_vignettes=TRUE)
If biomaRt is it can be installed with the following commands (this might take some time):
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
The full code for the helper function is the following:
#' Generate a menzerath object from exons
#'
#' @param dataset the name of an ensemble dataset by default one for _S. cerevisiae_
#'
#' @return
#' @export
#'
#' @examples
ensemble2menzerath <- function(dataset = "scerevisiae_gene_ensembl", ...){
ensembl = useMart("ensembl", dataset = dataset)
print("Downloading...")
exons <- getBM(attributes=c("ensembl_transcript_id",
"transcript_length",
"ensembl_exon_id",
"exon_chrom_start",
"exon_chrom_end",
"is_constitutive", ...), mart=ensembl)
print("done")
processed <-(exons %>%
filter(is_constitutive==1) %>%
mutate(exon_size = exon_chrom_end-exon_chrom_start) %>%
group_by(ensembl_transcript_id) %>%
summarize(transcript_size = n(), average_exon_size=mean(exon_size)))
ms <- menzerath(processed, x = "transcript_size", y = "average_exon_size")
}
Baixeries, Jaume, Antoni Hernández-Fernández, Núria Forns, and Ramon Ferrer-i-Cancho. 2013. “The Parameters of the Menzerath-Altmann Law in Genomes.” Journal of Quantitative Linguistics 20 (2): 94–104. https://doi.org/10.1080/09296174.2013.773141.
Ferrer-I-Cancho, Ramon, and NÃria Forns. 2009. “The Self-Organization of Genomes.” Complexity, NA–NA. https://doi.org/10.1002/cplx.20296.
Li, Wentian. 2011. “Menzeraths Law at the Gene-Exon Level in the Human Genome.” Complexity 17 (4): 49–53. https://doi.org/10.1002/cplx.20398.
Nikolaou, Christoforos. 2014. “Menzerath–Altmann Law in Mammalian Exons Reflects the Dynamics of Gene Structure Evolution.” Computational Biology and Chemistry 53: 134–43. https://doi.org/https://doi.org/10.1016/j.compbiolchem.2014.08.018.
Shahzad, Khuram, Jay E. Mittenthal, and Gustavo Caetano-Anollés. 2015. “The Organization of Domains in Proteins Obeys Menzerath-Altmann’s Law of Language.” BMC Systems Biology 9 (1). https://doi.org/10.1186/s12918-015-0192-9.