# Sequences, R, and the Free Monoid

An important concept in computer science is the free monoid on a set A, which essentially consists of sequencesa1an⟩ of elements drawn from A. The key operations on the free monoid are:

• a⟩, forming a singleton sequence from a single element of A
• xy, concatenation of the sequences x and y, which satisfies the associative law: (xy)⊕z = x⊕(yz)
• ⟨⟩, the empty sequence, which acts as an identity for concatenation: ⟨⟩⊕x = x⊕⟨⟩ = x

The free monoid satisfies the mathematical definition of a monoid, and is free in the sense of satisfying nothing else. There are many possible implementations of the free monoid, but they are all mathematically equivalent, which justifies calling it the free monoid.

In the R language, there are four main implementations of the free monoid: vectors, lists, dataframes (considered as sequences of rows), and strings (although for strings it’s difficult to tell where elements start and stop). The key operations are:

Vectors Lists Dataframes Strings
⟨⟩, empty `c()` `list()` `data.frame(n=c())` `""`
a⟩, singleton implicit (single values are 1-element vectors) `list(a)` `data.frame(n=a)` `as.character(a)`
xy, concatenation `c(x,y)` `c(x,y)` `rbind(x,y)` `paste0(x,y)`

An arbitrary monoid on a set A is a set B equipped with:

• a function f from A to B
• a binary operation xy, which again satisfies the associative law: (xy)⊗z = x⊗(yz)
• an element e which acts as an identity for the binary operator: ex = xe = x

As an example, we might have A = {2, 3, 5, …} be the prime numbers, B = {1, 2, 3, 4, 5, …} be the positive whole numbers, f(n) = n be the obvious injection function, ⊗ be multiplication, and (of course) e = 1. Then B is a monoid on A.

A homomorphism from the free monoid to B is a function h which respects the monoid-on-A structure. That is:

• h(⟨⟩) = e
• h(⟨a⟩) = f(a)
• h(xy) = h(x) ⊗ h(y)

As a matter of fact, these restrictions uniquely define the homomorphism from the free monoid to B to be the function which maps the sequence ⟨a1an⟩ to f(a1) ⊗ ⋯ ⊗ f(an).

In other words, simply specifying the monoid B with its function f from A to B and its binary operator ⊗ uniquely defines the homomorphism from the free monoid on A. Furthermore, this homomorphism logically splits into two parts:

• Map: apply the function f to every element of the input sequence ⟨a1an
• Reduce: combine the results of mapping using the binary operator, to give f(a1) ⊗ ⋯ ⊗ f(an)

The combination of map and reduce is inherently parallel, since the binary operator ⊗ is associative. If our input sequence is spread out over a hundred computers, each can apply map and reduce to its own segment. The hundred results can then be sent to a central computer where the final 99 ⊗ operations are performed. Among other organisations, Google has made heavy use of this MapReduce paradigm, which goes back to Lisp and APL.

R also provides support for the basic map and reduce operations (albeit with some inconsistencies):

Vectors Lists Dataframes Strings
Map with f `sapply(v,f)`, `purrr::map_dbl(v,f)` and related operators, or simply `f(v)` for vectorized functions `lapply(x,f)` or `purrr::map(x,f)` Vector operations on columns, possibly with `dplyr::mutate`, `dplyr::transmute`, `purrr::pmap`, or `mapply` Not possible, unless `strsplit` or tokenisation is used
Reduce with ⊗ `Reduce(g,v)`, `purrr::reduce(v,g)`, or specific functions like `sum`, `prod`, and `min` `purrr::reduce(x,g)` Vector operations on columns, or specific functions like `colSums`, with `purrr::reduce2(x,y,g)` useful for two-column dataframes Not possible, unless `strsplit` or tokenisation is used

It can be seen that it is particularly the conceptual reduce operator on dataframes that is poorly supported by the R language. Nevertheless, the map and reduce operations are both powerful mechanisms for manipulating data.

For non-associative binary operators, `purrr::reduce(x,g)` and similar functions remain extremely useful, but they become inherently sequential.

For more about purrr, see purrr.tidyverse.org.