Topic Analysis on the New Testament

I have been experimenting recently with Latent Dirichlet allocation for automatic determination of topics in documents. This is a popular technique, although it works better for some kinds of document than for others. Above (click to zoom) is a topic matrix for the Greek New Testament (using the stemmed 1904 Nestle text, removing 47 common words before analysis, and specifying 14 as the number of topics in advance). The size of the coloured dots in the matrix shows the degree to which a given topic can be found in a given book. The topics (and the most important words associated with them) are:

A better set of topics can probably be obtained with a bit more experimentation. Alternatively, here (as a simpler form of analysis) are the relative frequencies of some Greek words or sets of words, scaled to the range 0 to 1 for each word set (with the bar chart showing the total number of words in each New Testament book). Not surprisingly, angels appear more frequently in Revelation than anywhere else, while love is particularly frequent in 1 John:

World Population

Some feedback on my last post expressed surprise that Ptolemy’s specification of the Oikoumene now holds holds 80.6% of the world’s population. Above (click to zoom), I have redrawn the classic bar charts of world population which explain this fact. Africa, Asia, and Europe contain about 86% of the world’s population. Ptolemy excluded what we now know to be Southern Africa (which only drops the total to 85%) and didn’t extend his Oikoumene quite far enough to the east.

The chart below shows the same thing, but using NASA’s image of the Earth at night. It can be seen that the spikes on the bar chart correspond to major cities.

The Oikoumene of Ptolemy

I was reading recently about the Geographia of Ptolemy (written around 150 AD). This classic book applied Greek mathematical skills to mapping and map projection – and if there was one thing the Greeks were good at, it was mathematics. According to Neugebauer, Ptolemy believed the Oikoumene, the inhabited portion of the world, to range from Thule (63° North) to 16°25′ South, and 90 degrees East and West of Syene in Egypt.

The map above illustrates this Oikoumene, with a modern population overlay in red (data from SEDAC). Ptolemy was not too far wrong – today this region holds 80.6% of the world’s population, and the percentage would have been greater in antiquity.

Also shown on the map are some of the many cities listed in the Geographia. Open circles show Ptolemy’s coordinates (from here, adjusted to a Syene meridian), and filled circles show true positions. Ptolemy had reasonably good latitude values (an average error of 1.2° for the sample shown on the map), but much worse longitude values (an average error of 6.8°). The longitude error is mostly systemic – Ptolemy’s estimate of 18,000 miles or 29,000 km for the circumference of the earth was only 72% of the true value (several centuries earlier, Eratosthenes had come up with a much better estimate). If Ptolemy’s longitudes are adjusted for this, the average error is only 1.5°.

However, Ptolemy’s book deserves considerable respect – it is not surprising that it was used for more than a thousand years.

Some Oldest Manuscripts

The chart below (click to zoom) shows the dates of ten significant written works:

Each work is indicated by a vertical line, which runs from the date of writing to the date of the oldest surviving complete copy that I am aware of (marked by a dark circle). Open circles show some of the older partial or fragmentary manuscripts (these act as important checks on the reliability of later copies).

Two threshold periods (marked with arrow) are worth remarking on. First, Gutenberg’s printing press – after its invention, we still have at least one first edition for many important works. Second, the invention of Carolingian minuscule – many older works were re-copied into the new, legible script after that time. They were then widely distributed to monasteries around Europe, so that survival from that period has been fairly good. In the Byzantine Empire, Greek minuscule had a similar effect.

The Bible is a special case (I have highlighted one particular gospel on the chart). It was copied so widely (and so early) that many ancient manuscripts survive.

Zero in Greek mathematics

I recently read The Nothing That Is: A Natural History of Zero by Robert M. Kaplan. Zero is an important concept in mathematics. But where did it come from?

The Babylonian zero

From around 2000 BC, the Babylonians used a positional number system with base 60. Initially a space was used to represent zero. Vertical wedges mean 1, and chevrons mean 10:

This number (which we can write as 2 ; 0 ; 13) means 2 × 3600 + 0 × 60 + 13 = 7213. Four thousand years later, we still use the same system when dealing with angles or with time: 2 hours, no minutes, and 13 seconds is 7213 seconds.

Later, the Babylonians introduced a variety of explicit symbols for zero. By 400 BC, a pair of angled wedges was used:

The Babylonian zero was never used at the end of a number. The Babylonians were happy to move the decimal point (actually, “sexagesimal point”) forwards and backwards to facilitate calculation. The number ½, for example, was treated the same as 30 (which is half of 60). In much the same way, 20th century users of the slide rule treated 50, 5, and 0.5 as the same number. What is 0.5 ÷ 20? The calculation is done as 5 ÷ 2 = 2.5. Only at the end do you think about where the decimal point should go (0.025).

Greek mathematics in words

Kaplan says about zero that “the Greeks had no word for it.” Is that true?

Much of Greek mathematics was done in words. For example, the famous Proposition 3 in the Measurement of a Circle (Κύκλου μέτρησις) by Archimedes reads:

Παντὸς κύκλου ἡ περίμετρος τῆς διαμέτρου τριπλασίων ἐστί, καὶ ἔτι ὑπερέχει ἐλάσσονι μὲν ἤ ἑβδόμῳ μέρει τῆς διαμέτρου, μείζονι δὲ ἢ δέκα ἑβδομηκοστομόνοις.

Phonetically, that is:

Pantos kuklou hē perimetros tēs diametrou triplasiōn esti, kai eti huperechei elassoni men ē hebdomō merei tēs diametrou, meizoni de ē deka hebdomēkostomonois.

Or, in English:

The perimeter of every circle is triple the diameter plus an amount less than one seventh of the diameter and greater than ten seventy-firsts.

In modern notation, we would express that far more briefly as 10/71 < π − 3 < 1/7 or 3.141 < π < 3.143.

The Greek words for zero were the two words for “nothing” – μηδέν (mēden) and οὐδέν (ouden). Around 100 AD, Nicomachus of Gerasa (Gerasa is now the city of Jerash, Jordan), wrote in his Introduction to Arithmetic (Book 2, VI, 3) that:

οὐδέν οὐδενί συντεθὲν … οὐδέν ποιεῖ (ouden oudeni suntethen … ouden poiei)

That is, zero (nothing) can be added:

nothing and nothing, added together, … make nothing

However, we cannot divide by zero. Aristotle, in Book 4, Lectio 12 of his Physics tells us that:

οὐδὲ τὸ μηδὲν πρὸς ἀριθμόν (oude to mēden pros arithmon)

That is, 1/0, 2/0, and so forth make no sense:

there is no ratio of zero (nothing) to a number

If we view arithmetic primarily as a game of multiplying, dividing, taking ratios, and finding prime factors, then poor old zero really does have to sit on the sidelines (in modern terms, zero is not part of a multiplicative group).

Greek calculation

For business calculations, surveying, numerical tables, and most other mathematical calculations (e.g. the proof of Archimedes’ Proposition 3), the Greeks used a non-positional decimal system, based on 24 letters and 3 obsolete letters. In its later form, this was as follows:

Units Tens Hundreds
α = 1 ι = 10 ρ = 100
β = 2 κ = 20 σ = 200
γ = 3 λ = 30 τ = 300
δ = 4 μ = 40 υ = 400
ε = 5 ν = 50 φ = 500
ϛ (stigma) = 6 ξ = 60 χ = 600
ζ = 7 ο = 70 ψ = 700
η = 8 π = 80 ω = 800
θ = 9 ϙ (koppa) = 90 ϡ (sampi) = 900

For users of R:

to.greek.digits <- function (v) { # v is a vector of numbers
  if (any(v < 1 | v > 999)) stop("Can only do Greek digits for 1..999")
  else {
    s <- intToUtf8(c(0x3b1:0x3b5,0x3db,0x3b6:0x3c0,0x3d9,0x3c1,0x3c3:0x3c9,0x3e1))
    greek <- strsplit(s, "", fixed=TRUE)[[1]]
    d <- function(i, power=1) { if (i == 0) "" else greek[i + (power - 1) * 9] }
    f <- function(x) { paste0(d(x %/% 100, 3), d((x %/% 10) %% 10, 2), d(x %% 10)) }
    sapply(v, f)

For example, the “number of the beast” (666) as written in Byzantine manuscripts of the Bible is χξϛ (older manuscripts spell the number out in words: ἑξακόσιοι ἑξήκοντα ἕξ = hexakosioi hexēkonta hex).

This Greek system of numerals did not include zero – but then again, it was used in situations where zero was not needed.

Greek geometry

Most of Greek mathematics was geometric in nature, rather than based on calculation. For example, the famous Pythagorean Theorem tells us that the areas of two squares add up to give the area of a third.

In geometry, zero was represented as a line of zero length (i.e. a point) or as a rectangle of zero area (i.e. a line). This is implicit in Euclid’s first two definitions (σημεῖόν ἐστιν, οὗ μέρος οὐθέν = a point is that which has no part; γραμμὴ δὲ μῆκος ἀπλατές = a line is breadthless length).

In the Pythagorean Theorem, lines are multiplied by themselves to give areas, and the sum of the two smaller areas gives the third (image: Ntozis)

Graeco-Babylonian mathematics

In astronomy, the Greeks continued to use the Babylonian sexagesimal system (much as we do today, with our “degrees, minutes, and seconds”). Numbers were written using the alphabetic system described above, and at the time of Ptolemy, zero was written like this (appearing in numerous papyri from 100 AD onwards, with occasional variations):

For example, 7213 seconds would be β ō ιγ = 2 0 13 (for another example, see the image below). The circle here may be an abbreviation for οὐδέν = nothing (just as early Christian Easter calculations used N for Nulla to mean zero). The overbar is necessary to distinguish ō from ο = 70 (it also resembles the overbars used in sacred abbreviations).

This use of a circle to mean zero was passed on to the Arabs and to India, which means that our modern symbol 0 is, in fact, Graeco-Babylonian in origin (the contribution of Indian mathematicians such as Brahmagupta was not the introduction of zero, but the theory of negative numbers). I had not realised this before; from now on I will say ouden every time I read “zero.”

Part of a table from a French edition of Ptolemy’s Almagest of c. 150 AD. For the angles x = ½°, 1°, and 1½°, the table shows 120 sin(x/2). The (sexagesimal) values, in the columns headed ΕΥΘΕΙΩΝ, are ō λα κε = 0 31 25 = 0.5236, α β ν = 1 2 50 = 1.0472, and α λδ ιε = 1 34 15 = 1.5708. The columns on the right are an aid to interpolation. Notice that zero occurs six times.

Sequences, R, and the Free Monoid

An important concept in computer science is the free monoid on a set A, which essentially consists of sequencesa1an⟩ of elements drawn from A. The key operations on the free monoid are:

  • a⟩, forming a singleton sequence from a single element of A
  • xy, concatenation of the sequences x and y, which satisfies the associative law: (xy)⊕z = x⊕(yz)
  • ⟨⟩, the empty sequence, which acts as an identity for concatenation: ⟨⟩⊕x = x⊕⟨⟩ = x

The free monoid satisfies the mathematical definition of a monoid, and is free in the sense of satisfying nothing else. There are many possible implementations of the free monoid, but they are all mathematically equivalent, which justifies calling it the free monoid.

In the R language, there are four main implementations of the free monoid: vectors, lists, dataframes (considered as sequences of rows), and strings (although for strings it’s difficult to tell where elements start and stop). The key operations are:

Vectors Lists Dataframes Strings
⟨⟩, empty c() list() data.frame(n=c()) ""
a⟩, singleton implicit (single values are 1-element vectors) list(a) data.frame(n=a) as.character(a)
xy, concatenation c(x,y) c(x,y) rbind(x,y) paste0(x,y)

An arbitrary monoid on a set A is a set B equipped with:

  • a function f from A to B
  • a binary operation xy, which again satisfies the associative law: (xy)⊗z = x⊗(yz)
  • an element e which acts as an identity for the binary operator: ex = xe = x

As an example, we might have A = {2, 3, 5, …} be the prime numbers, B = {1, 2, 3, 4, 5, …} be the positive whole numbers, f(n) = n be the obvious injection function, ⊗ be multiplication, and (of course) e = 1. Then B is a monoid on A.

A homomorphism from the free monoid to B is a function h which respects the monoid-on-A structure. That is:

  • h(⟨⟩) = e
  • h(⟨a⟩) = f(a)
  • h(xy) = h(x) ⊗ h(y)

As a matter of fact, these restrictions uniquely define the homomorphism from the free monoid to B to be the function which maps the sequence ⟨a1an⟩ to f(a1) ⊗ ⋯ ⊗ f(an).

In other words, simply specifying the monoid B with its function f from A to B and its binary operator ⊗ uniquely defines the homomorphism from the free monoid on A. Furthermore, this homomorphism logically splits into two parts:

  • Map: apply the function f to every element of the input sequence ⟨a1an
  • Reduce: combine the results of mapping using the binary operator, to give f(a1) ⊗ ⋯ ⊗ f(an)

The combination of map and reduce is inherently parallel, since the binary operator ⊗ is associative. If our input sequence is spread out over a hundred computers, each can apply map and reduce to its own segment. The hundred results can then be sent to a central computer where the final 99 ⊗ operations are performed. Among other organisations, Google has made heavy use of this MapReduce paradigm, which goes back to Lisp and APL.

R also provides support for the basic map and reduce operations (albeit with some inconsistencies):

Vectors Lists Dataframes Strings
Map with f sapply(v,f), purrr::map_dbl(v,f) and related operators, or simply f(v) for vectorized functions lapply(x,f) or purrr::map(x,f) Vector operations on columns, possibly with dplyr::mutate, dplyr::transmute, purrr::pmap, or mapply Not possible, unless strsplit or tokenisation is used
Reduce with ⊗ Reduce(g,v), purrr::reduce(v,g), or specific functions like sum, prod, and min purrr::reduce(x,g) Vector operations on columns, or specific functions like colSums, with purrr::reduce2(x,y,g) useful for two-column dataframes Not possible, unless strsplit or tokenisation is used

It can be seen that it is particularly the conceptual reduce operator on dataframes that is poorly supported by the R language. Nevertheless, the map and reduce operations are both powerful mechanisms for manipulating data.

For non-associative binary operators, purrr::reduce(x,g) and similar functions remain extremely useful, but they become inherently sequential.

For more about purrr, see

2019 World Solar Challenge: the route

Following on from my route map above for the World Solar Challenge (click to zoom), here are some personal route notes (revised from 2015 and 2017). The WSC has confirmed that the control stops are as indicated.

The graph below (click to zoom) shows approximate altitudes (taken from the Stanford 2013 elevation profile for this version of the graph). The highest point on the route (about 730 m) is 20 km north of Alice Springs, although the steepest hill (Hayes Creek Hill, summit 203 m) is about 170 km from Darwin.

Darwin – Start

Solar Team Eindhoven’s Stella starts the race in 2013 (photo: WSC)

The city of Darwin marks the start of the race.

Katherine – 322 km – Control Stop 1

En route to Katherine in 2011 (photo: UC Berkeley Solar Vehicle Team)

The town of Katherine (on the Katherine River) is a gateway to Nitmiluk National Park. It also serves the nearby Royal Australian Air Force base. The average maximum October temperature is 37.7°C.

Daly Waters – 588 km – Control Stop 2

The famous Daly Waters pub (photo: Lakeyboy)

Daly Waters is a small town with a famous pub. The Eindhoven team left a shirt there in 2015.

Dunmarra – 633 km

University of Toronto’s Blue Sky Solar team leaves the Dunmarra control stop in 2013 (photo: Blue Sky Solar)

Dunmarra once served the Overland Telegraph Line. Today it is little more than a roadhouse, motel, and caravan park. In previous races, this was a control stop.

Tennant Creek – 987 km – Control Stop 3 / End of Cruiser Stage 1

Tennant Creek (photo: Tourism NT)

Tennant Creek (population about 3,500) is a small town serving nearby mines, cattle stations, and tourist attractions. Shopping can be done at Tennant Creek IGA.

For 2019, Tennant Creek marks the end of Cruiser Stage 1. Cruisers must arrive between 14:00 and 17:00 on Monday (with penalties for arriving after 14:00). Cruiser teams will spend the night, and have the option of metered recharging between sunset and 23:00.

Karlu Karlu / Devils Marbles Conservation Reserve

Nuon Solar Team’s Nuna7 drives by the Devils Marbles in 2013 (photo: Jorrit Lousberg)

The 1,802 hectare Karlu Karlu / Devils Marbles Conservation Reserve lies along both sides of the Stuart Highway about 100 km south of Tennant Creek. It is home to a variety of reptiles and birds, including the fairy martin (Petrochelidon ariel) and the sand goanna (Varanus gouldii). Race participants, of course, don’t have time to look (unless, by chance, this is where they stop for the night).

Barrow Creek – 1,210 km – Control Stop 4

Barrow Creek Roadhouse and surrounds (photo: Adrian Kitchingman)

Barrow Creek once served the Overland Telegraph Line and nearby graziers, but is now nothing but a roadhouse. The Telegraph Station is preserved as a historical site.

Ti Tree – 1,300 km

Nuon Solar Team’s Nuna6 drives by a fire between Tennant Creek and Alice Springs in 2011 (photo: Hans Peter van Velthoven)

Ti Tree is a small settlement north of Alice Springs. Much of the local area is owned by the Anmatyerre people. In previous races, this was a control stop.

Alice Springs – 1,493 km – Control Stop 5

Alice Springs (photo: Ben Tillman)

Alice Springs is roughly the half-way point of the race.

Kulgera – 1,766 km – Control Stop 6

Sunset near Kulgera (photo: “dannebrog”)

Kulgera is a tiny settlement 20 km from the NT / SA Border. The “pub” is Kulgera’s main feature.

NT / SA Border – 1,786 km

Entering South Australia (photo: Phil Whitehouse)

The sign at the Northern Territory / South Australia border shows Sturt’s Desert Pea (Swainsona formosa), the floral emblem of the state of South Australia.

Marla – 1,945 km

Road train at Marla (photo: Ed Dunens)

Marla (population 100) has a health centre, a roadhouse/motel/supermarket complex, a police station, and a small car repair workshop. The name of the town may be a reference to the mala (Lagorchestes hirsutus) or to an Aboriginal word for “kangaroo.”

Coober Pedy – 2,178 km – Control Stop 7 / End of Cruiser Stage 2

Coober Pedy (photo: “Lodo27”)

The town of Coober Pedy is a major centre for opal mining. Because of the intense desert heat, many residents live underground.

For 2019, Coober Pedy marks the end of Cruiser Stage 2. Cruisers must arrive between 16:30 and 17:00 on Wednesday (with penalties for arriving after 16:30). Cruiser teams will spend the night, and have the option of metered recharging between sunset and 23:00.

Glendambo – 2,432 km – Control Stop 8

The Belgian team’s Indupol One leaves Glendambo control stop in 2013 (photo: Punch Powertrain Solar Team / Geert Vanden Wijngaert)

Glendambo is another small outback settlement.

Port Augusta – 2,720 km – Control Stop 9

At Port Augusta, the highway reaches the Spencer Gulf. From this point, traffic becomes much heavier, which makes life more difficult for the drivers in the race.

Adelaide – Finish

Adelaide makes quite a contrast to that lengthy stretch of desert (photo: “Orderinchaos”)

Adelaide, the “City of Churches,” is the end of the race. The official finish line marks 3,022 km from Darwin.

Cruisers must arrive between 11:30 and 14:00 on Friday (with penalties for arriving after 11:30).

Personality and Gender

The so-called “Big Five” personality traits are often misunderstood. They all have catchy names, expressed by the acronym CANOE (or OCEAN), but in fact all they are is a summary of answers to certain kinds of personality questions:

  • Conscientiousness: I pay attention to details; I follow a schedule; …
  • Agreeableness: I am interested in people; I feel the emotions of others; …
  • Neuroticism: I get upset easily; I worry about things; …
  • Openness to experience: I am full of ideas; I am interested in abstractions; …
  • Extraversion: I am the life of the party; I start conversations; … (this last one is also measured by the MBTI test)

These tests work in multiple cultures. In this article, I am using data from the Dutch version of the test, the “Vijf PersoonlijkheidsFactoren Test” developed by Elshout and Akkerman. Specifically, I am using data from 8,954 psychology freshmen at the University of Amsterdam during 1982–2007 (Smits, I.A.M., Dolan, C.V., Vorst, H.C., Wicherts, J.M. and Timmerman, M.E., 2013. Data from ‘Cohort Differences in Big Five Personality Factors Over a Period of 25 Years’. Journal of Open Psychology Data, 1(1), p.e2). In my analysis, I have compensated for missing data and for the fact that the sample was 69% female.

The Dutch test consists of 70 items, in 5 groups of 14. The following tree diagram (click to zoom) is the result of UPGMA hierarchical clustering on pairwise correlations between all 70 items. It can be seen that they naturally cluster into 5 groups corresponding almost perfectly to the “Big Five” personality traits – the exception being item A11, which fits extraversion slightly better (r = 0.420) than its own cluster of agreeableness (r = 0.406). This lends support to the idea that the test is measuring five independent things, and that these five things are real.

On tests like this, women consistently score, on average, a little higher than men in conscientiousness, agreeableness, neuroticism, and extraversion (and in this dataset, on average, a little lower in openness to experience). Mean values for conscientiousness in this dataset (on a scale of 14 to 98) were 60.3 for women and 56.1 for men (a difference of 4.2). For agreeableness, they were 70.6 for women and 67.6 for men (a difference of 3.0). There are also small age effects for conscientiousness, agreeableness, and openness to experience (over the 18–25 age range), which I have ignored.

The chart below (click to zoom) shows distributions of conscientiousness and agreeableness among men and women, and the relative frequency of different score ranges (compensating for the fact that the sample was 69% female). Thus, based on this data, a random sample of people with both scores in the range 81 to 90 would be 74% female. With both scores in the range 41 to 50, the sample would be 72% male. This reflects a simple mathematical truth – small differences in group means can produce substantial differences at the tails of the distribution.

American Solar Challenge 2018: The run to Burns

I recently got my hands on the GPS tracker data for the American Solar Challenge last July. Above (for the 6 Challengers completing the stage) and below (for the Cruisers) are distance/speed charts for the run from Craters of the Moon to Burns, which seems the stage of the route with the best data (at this time of year I haven’t the time for a more detailed analysis). Click on the charts to zoom. Small coloured circles show end-of-day stops.

Stage times were 15:Western Sydney 8:05:16, 101:ETS Quebec 8:20:13, 2:Michigan 8:25:08, 55:Poly Montréal 8:42:52, 4:MIT 9:07:58, and 6:CalSol 9:30:12 for Challengers, and 828:App State 10:22:37, 559:Bologna 12:13:57, and 24:Waterloo 15:29:12 for Cruisers (note that Bologna was running fully loaded on solar power only, while the other Cruisers recharged from the grid).

The data has been processed by IOSiX. I’m not sure what that involved, but I’ve taken the data as gospel, eliminating any datapoints out of hours, off the route, or with PDOP more than 10. Notice that there are a few tracker “black spots,” and that trackers in some cars work better than in others. The small elevation charts are taken from the GPS tracker data, so they will not be reliable in the “black spots” (in particular, the big hill before Burns has been truncated – compare my timing chart).