Menzerath’s law (also known as Menzerath-Altmann law) in linguistics describes the relationship between the size of a linguistic construct (e.g. a word) and its constituents (e.g. syllables), which are expected follow the relationship:

\[ y = a \cdot x^b \cdot e^{-cx} \] where a, b and c are parameters of the law, x is the size of the construct measured in units of its constituents and y is the average size of constituents measured in units of a subconsistuent. Menzerath’s law can apply to complex systems with multiple levels and has been found in other systems outside linguistics.

Menzerath’s law has been found in a number of different biological sequences and organizational levels including:

  • Genomes, chromosomes and base pairs (Ferrer-I-Cancho and Forns 2009; Baixeries et al. 2013)
  • Genes, exons and base pairs (Li 2011) and perhaps more appropriately transcripts, exons and base pairs (Nikolaou 2014)
  • Proteins, protein domains and aminoacids (Shahzad, Mittenthal, and Caetano-Anollés 2015)

To facilitate the exploration of Menzerath’s law we can use the menzerath R package. Lets consider the case of the transcript-exon level as an example. We use biomaRt package to give us access to the the biological data. The package is part of Bioconductor and provides an interface to BioMart databases such as Ensemble, where we can download structural information for a number of genomes.

First we have to load the packages we are going to use (If any of the packages is missing check the Install Missing Libraries section):

Once everything is in place we can start by connecting to ensemble, a genome database project of the European Bioinformatics Institute, and list the databases (here are the first 20):

ensembl <- useMart("ensembl")
datasets <- listDatasets(ensembl)
kbl(head(datasets,20))
dataset description version
abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1) ASM259213v1
acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2) fAstCal1.2
acarolinensis_gene_ensembl Anole lizard genes (AnoCar2.0) AnoCar2.0
acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2) bAquChr1.2
acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5) Midas_v5
amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2) ASM200744v2
amexicanus_gene_ensembl Mexican tetra genes (Astyanax_mexicanus-2.0) Astyanax_mexicanus-2.0
anancymaae_gene_ensembl Ma’s night monkey genes (Anan_2.0) Anan_2.0
aocellaris_gene_ensembl Clown anemonefish genes (AmpOce1.0) AmpOce1.0
apercula_gene_ensembl Orange clownfish genes (Nemo_v1) Nemo_v1
aplatyrhynchos_gene_ensembl Mallard genes (ASM874695v1) ASM874695v1
apolyacanthus_gene_ensembl Spiny chromis genes (ASM210954v1) ASM210954v1
applatyrhynchos_gene_ensembl Duck genes (CAU_duck1.0) CAU_duck1.0
atestudineus_gene_ensembl Climbing perch genes (fAnaTes1.2) fAnaTes1.2
bbbison_gene_ensembl American bison genes (Bison_UMD1.0) Bison_UMD1.0
bgrunniens_gene_ensembl Domestic yak genes (LU_Bosgru_v3.0) LU_Bosgru_v3.0
bihybrid_gene_ensembl Hybrid - Bos Indicus genes (UOA_Brahman_1) UOA_Brahman_1
bmusculus_gene_ensembl Blue whale genes (mBalMus1.v2) mBalMus1.v2
bmutus_gene_ensembl Wild yak genes (BosGru_v2.0) BosGru_v2.0
bsplendens_gene_ensembl Siamese fighting fish genes (fBetSpl5.2) fBetSpl5.2

Lots of interesting species are among the first top 20, difficult to choose which one to focus on. Cows have interesting things to say so lets pick the database of the cow’s genome (btaurus_gene_ensembl) for the following example.

A transcript is the transcribed part of a gene. By transcribed we mean that the genomic region will be used to produce a functional biomolecule (RNA which can then be used to direct the construction of a protein). A transcript can be composed by one or more exons.

The information we are interested in are transcripts and exons. Transcripts are uniquely identified by the attribute ensembl_transcript_id and exons by the attribute ensembl_exon_id. We also need the size of exons (in base pairs) and this we can calculate from the difference in the genomic coordinates of the start and end of each exon stored in the attributes exon_chrom_start and exon_chrom_end respectively. Finally we can reduce the noise in the data we download by excluding non-constitutive exons. To keep the amount of data manageable we also filter by a single chromosome:

ensembl <- useMart("ensembl", dataset = "btaurus_gene_ensembl")
cow_exons <-  getBM(filters = 'chromosome_name',
                    values = 13,
                    attributes=c("ensembl_transcript_id",
                                 "ensembl_exon_id",
                                 "exon_chrom_start",
                                 "exon_chrom_end",
                                 "is_constitutive"), mart=ensembl)

There are 14323 in our table, quite enough to get a statistical signal, with the drawback that we focus on a single chromosome which might have something particular.

To make sure that we did correctly downloaded the information we need we could have added a few more attributes and perform some descriptive statistics or data cleaning but for the purpose of this examples we can skip such steps.

Having downloaded the necessary information we can now process them to create a menzerath object to store the information:

processed_exons <-(cow_exons %>% 
                     filter(is_constitutive==1) %>% 
                     mutate(exon_size = exon_chrom_end-exon_chrom_start) %>% 
                     group_by(ensembl_transcript_id) %>%
                     summarize(transcript_size = n(), average_exon_size=mean(exon_size)))

cow_menzerath <- menzerath(processed_exons, x = "transcript_size", y = "average_exon_size")

To get a glimpse of the information in our object we can print it:

print(cow_menzerath)
#> Observations: 1481
#> y: 297.857142857143,447.8,261.666666666667,158.416666666667,115.380952380952,144.241379310345...
#> x: 7,5,12,12,21,29...

and plot the data:

plot(cow_menzerath, fit = TRUE)

By default for the fit we use the Menzerath-Altman formula (for alternatives see section Alternative models of the vignette Introduction to package menzerath):

\[ y = a \cdot x^b \cdot e^{-cx} \]

We could explore a bit more in depth the data by trying a different formula for fitting:

alt_fit <- fit(cow_menzerath, method="Milicka_2")
get_parameters(alt_fit)
#> $a
#> [1] 271.7336
#> 
#> $b
#> [1] 0.1831749

Interpreting the meaning of the parameters in most cases is not straightforwards even in cases where the system is well known. To get a better understanding of Menzerath’s law in the particular system one could also compare across systems. To make the process less repetitive we can define a helper function ensemble2menzerath to download the necessary information (see Useful functions for the full code).

As the following code is too heavy and time consuming as it requests and processes a lot of information, it has been pre-run and the resulting plot is directly inserted below:

yms <- ensemble2menzerath()
hms <- ensemble2menzerath("hsapiens_gene_ensembl")
mms <- ensemble2menzerath("mmusculus_gene_ensembl")
cms <- ensemble2menzerath("btaurus_gene_ensembl")
yeast_plot <- plot(yms, fit = TRUE)
human_plot <- plot(hms, fit = TRUE)
mouse_plot <- plot(mms, fit = TRUE)
cow_plot <- plot(cms, fit = TRUE)
plot_grid(yeast_plot, human_plot, mouse_plot, cow_plot, ncol = 2)
species scientific name database name number of observations
yeast S. cerevisiae scerevisiae_gene_ensembl 7,127
human H. sapiens hsapiens_gene_ensembl 52,351
mouse M. musculus mmusculus_gene_ensembl 43,864
cow B. taurus btaurus_gene_ensembl 42,022

Menzerath’s law in multiple genomes

Troubleshooting

Install Missing Libraries

If tidyverse or cowplot is missing they can be installed with:

install.packages(c"tidyverse", "cowplot")

If menzerath is missing:

install.packages("devtools")
devtools::install_github("sellisd/menzerath", build_vignettes=TRUE)

If biomaRt is it can be installed with the following commands (this might take some time):

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biomaRt")

Useful functions

The full code for the helper function is the following:


#' Generate a menzerath object from exons
#'
#' @param dataset the name of an ensemble dataset by default one for _S. cerevisiae_
#'
#' @return
#' @export
#'
#' @examples
ensemble2menzerath <- function(dataset = "scerevisiae_gene_ensembl", ...){
  ensembl = useMart("ensembl", dataset = dataset)
  print("Downloading...")
  exons <- getBM(attributes=c("ensembl_transcript_id",
                              "transcript_length",
                              "ensembl_exon_id",
                              "exon_chrom_start",
                              "exon_chrom_end",
                              "is_constitutive", ...), mart=ensembl)
  print("done")
  processed <-(exons %>% 
                 filter(is_constitutive==1) %>% 
                 mutate(exon_size = exon_chrom_end-exon_chrom_start) %>% 
                 group_by(ensembl_transcript_id) %>%
                 summarize(transcript_size = n(), average_exon_size=mean(exon_size)))
  
  ms <- menzerath(processed, x = "transcript_size", y = "average_exon_size")
}

References

Baixeries, Jaume, Antoni Hernández-Fernández, Núria Forns, and Ramon Ferrer-i-Cancho. 2013. “The Parameters of the Menzerath-Altmann Law in Genomes.” Journal of Quantitative Linguistics 20 (2): 94–104. https://doi.org/10.1080/09296174.2013.773141.

Ferrer-I-Cancho, Ramon, and NÃria Forns. 2009. “The Self-Organization of Genomes.” Complexity, NA–NA. https://doi.org/10.1002/cplx.20296.

Li, Wentian. 2011. “Menzeraths Law at the Gene-Exon Level in the Human Genome.” Complexity 17 (4): 49–53. https://doi.org/10.1002/cplx.20398.

Nikolaou, Christoforos. 2014. “Menzerath–Altmann Law in Mammalian Exons Reflects the Dynamics of Gene Structure Evolution.” Computational Biology and Chemistry 53: 134–43. https://doi.org/https://doi.org/10.1016/j.compbiolchem.2014.08.018.

Shahzad, Khuram, Jay E. Mittenthal, and Gustavo Caetano-Anollés. 2015. “The Organization of Domains in Proteins Obeys Menzerath-Altmann’s Law of Language.” BMC Systems Biology 9 (1). https://doi.org/10.1186/s12918-015-0192-9.