Menzerath’s law (also known as Menzerath-Altmann law) in linguistics describes the relationship between the size of a linguistic construct (e.g. a word) and its constituents (e.g. syllables), which are expected follow the relationship:
where a
, b
and c
are parameters
of the law, x
is the size of the construct measured in
units of its constituents and y
is the average size of
constituents measured in units of a subconsistuent. Menzerath’s law can
apply to complex systems with multiple levels and has been found in
other systems outside linguistics.
Menzerath’s law has been found in a number of different biological sequences and organizational levels including:
To facilitate the exploration of Menzerath’s law we can use the menzerath R package. Lets consider the case of the transcript-exon level as an example. We use biomaRt package to give us access to the the biological data. The package is part of Bioconductor and provides an interface to BioMart databases such as Ensemble, where we can download structural information for a number of genomes.
First we have to load the packages we are going to use (If any of the packages is missing check the Install Missing Libraries section):
library("tidyverse")
library("menzerath")
library("cowplot")
#>
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:lubridate':
#>
#> stamp
library("biomaRt")
Once everything is in place we can start by connecting to ensemble, a genome database project of the European Bioinformatics Institute, and list the databases (here are the first 20):
ensembl <- useMart("ensembl")
datasets <- listDatasets(ensembl)
kbl(head(datasets,20))
dataset | description | version |
---|---|---|
abrachyrhynchus_gene_ensembl | Pink-footed goose genes (ASM259213v1) | ASM259213v1 |
acalliptera_gene_ensembl | Eastern happy genes (fAstCal1.3) | fAstCal1.3 |
acarolinensis_gene_ensembl | Green anole genes (AnoCar2.0v2) | AnoCar2.0v2 |
acchrysaetos_gene_ensembl | Golden eagle genes (bAquChr1.2) | bAquChr1.2 |
acitrinellus_gene_ensembl | Midas cichlid genes (Midas_v5) | Midas_v5 |
amelanoleuca_gene_ensembl | Giant panda genes (ASM200744v2) | ASM200744v2 |
amexicanus_gene_ensembl | Mexican tetra genes (Astyanax_mexicanus-2.0) | Astyanax_mexicanus-2.0 |
anancymaae_gene_ensembl | Ma’s night monkey genes (Anan_2.0) | Anan_2.0 |
aocellaris_gene_ensembl | Clown anemonefish genes (ASM2253959v1) | ASM2253959v1 |
apercula_gene_ensembl | Orange clownfish genes (Nemo_v1) | Nemo_v1 |
aplatyrhynchos_gene_ensembl | Mallard genes (ASM874695v1) | ASM874695v1 |
apolyacanthus_gene_ensembl | Spiny chromis genes (ASM210954v1) | ASM210954v1 |
applatyrhynchos_gene_ensembl | Duck genes (CAU_duck1.0) | CAU_duck1.0 |
atestudineus_gene_ensembl | Climbing perch genes (fAnaTes1.3) | fAnaTes1.3 |
bbbison_gene_ensembl | American bison genes (Bison_UMD1.0) | Bison_UMD1.0 |
bgrunniens_gene_ensembl | Domestic yak genes (LU_Bosgru_v3.0) | LU_Bosgru_v3.0 |
bihybrid_gene_ensembl | Hybrid - Bos Indicus genes (UOA_Brahman_1) | UOA_Brahman_1 |
bmusculus_gene_ensembl | Blue whale genes (mBalMus1.v2) | mBalMus1.v2 |
bmutus_gene_ensembl | Wild yak genes (BosGru_v2.0) | BosGru_v2.0 |
bsplendens_gene_ensembl | Siamese fighting fish genes (fBetSpl5.2) | fBetSpl5.2 |
Lots of interesting species are among the first top 20, difficult to
choose which one to focus on. Cows have interesting things to say so lets
pick the database of the cow’s genome
(btaurus_gene_ensembl
) for the following example.
A transcript is the transcribed part of a gene. By transcribed we mean that the genomic region will be used to produce a functional biomolecule (RNA which can then be used to direct the construction of a protein). A transcript can be composed by one or more exons.
The information we are interested in are transcripts and exons.
Transcripts are uniquely identified by the attribute
ensembl_transcript_id
and exons by the attribute
ensembl_exon_id
. We also need the size of exons (in base
pairs) and this we can calculate from the difference in the genomic
coordinates of the start and end of each exon stored in the attributes
exon_chrom_start
and exon_chrom_end
respectively. Finally we can reduce the noise in the data we download by
excluding non-constitutive exons. To keep the amount of data manageable
we also filter by a single chromosome:
ensembl <- useMart("ensembl", dataset = "btaurus_gene_ensembl")
cow_exons <- getBM(filters = 'chromosome_name',
values = 13,
attributes=c("ensembl_transcript_id",
"ensembl_exon_id",
"exon_chrom_start",
"exon_chrom_end",
"is_constitutive"), mart=ensembl)
There are 24576 in our table, quite enough to get a statistical signal, with the drawback that we focus on a single chromosome which might have something particular.
To make sure that we did correctly downloaded the information we need we could have added a few more attributes and perform some descriptive statistics or data cleaning but for the purpose of this examples we can skip such steps.
Having downloaded the necessary information we can now process them to create a menzerath object to store the information:
processed_exons <-(cow_exons %>%
filter(is_constitutive==1) %>%
mutate(exon_size = exon_chrom_end-exon_chrom_start) %>%
group_by(ensembl_transcript_id) %>%
summarize(transcript_size = n(), average_exon_size=mean(exon_size)))
cow_menzerath <- menzerath(processed_exons, x = "transcript_size", y = "average_exon_size")
To get a glimpse of the information in our object we can print it:
print(cow_menzerath)
#> Observations: 2200
#> y: 249.166666666667,435,296.777777777778,301.384615384615,117.058823529412,173.41935483871...
#> x: 6,2,9,13,17,31...
and plot the data:
plot(cow_menzerath, fit = TRUE)
By default for the fit we use the Menzerath-Altman formula (for alternatives see section Alternative models of the vignette Introduction to package menzerath):
We could explore a bit more in depth the data by trying a different formula for fitting:
alt_fit <- fit(cow_menzerath, method="Milicka_2")
get_parameters(alt_fit)
#> $a
#> [1] 306.1093
#>
#> $b
#> [1] 0.2416846
Interpreting the meaning of the parameters in most cases is not
straightforwards even in cases where the system is well known. To get a
better understanding of Menzerath’s law in the particular system one
could also compare across systems. To make the process less repetitive
we can define a helper function ensemble2menzerath
to
download the necessary information (see Useful functions for the full code).
As the following code is too heavy and time consuming as it requests and processes a lot of information, it has been pre-run and the resulting plot is directly inserted below:
yms <- ensemble2menzerath()
hms <- ensemble2menzerath("hsapiens_gene_ensembl")
mms <- ensemble2menzerath("mmusculus_gene_ensembl")
cms <- ensemble2menzerath("btaurus_gene_ensembl")
yeast_plot <- plot(yms, fit = TRUE)
human_plot <- plot(hms, fit = TRUE)
mouse_plot <- plot(mms, fit = TRUE)
cow_plot <- plot(cms, fit = TRUE)
plot_grid(yeast_plot, human_plot, mouse_plot, cow_plot, ncol = 2)
species | scientific name | database name | number of observations |
---|---|---|---|
yeast | S. cerevisiae | scerevisiae_gene_ensembl |
7,127 |
human | H. sapiens | hsapiens_gene_ensembl |
52,351 |
mouse | M. musculus | mmusculus_gene_ensembl |
43,864 |
cow | B. taurus | btaurus_gene_ensembl |
42,022 |
If tidyverse or cowplot is missing they can be installed with:
If menzerath is missing:
install.packages("devtools")
devtools::install_github("sellisd/menzerath", build_vignettes=TRUE)
If biomaRt is it can be installed with the following commands (this might take some time):
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
The full code for the helper function is the following:
#' Generate a menzerath object from exons
#'
#' @param dataset the name of an ensemble dataset by default one for _S. cerevisiae_
#'
#' @return
#' @export
#'
#' @examples
ensemble2menzerath <- function(dataset = "scerevisiae_gene_ensembl", ...){
ensembl = useMart("ensembl", dataset = dataset)
print("Downloading...")
exons <- getBM(attributes=c("ensembl_transcript_id",
"transcript_length",
"ensembl_exon_id",
"exon_chrom_start",
"exon_chrom_end",
"is_constitutive", ...), mart=ensembl)
print("done")
processed <-(exons %>%
filter(is_constitutive==1) %>%
mutate(exon_size = exon_chrom_end-exon_chrom_start) %>%
group_by(ensembl_transcript_id) %>%
summarize(transcript_size = n(), average_exon_size=mean(exon_size)))
ms <- menzerath(processed, x = "transcript_size", y = "average_exon_size")
}