The encoding of MAM for Sefaria

Author: Ben Denckla

Revision: 2025-06-26 / 30th of Sivan, 5785

Introduction

This document describes the encoding of MAM (Miqra According to the Masorah) for Sefaria. MAM is now the default Tanakh on Sefaria. Rather than describing the encoding in full, this document focuses only on differences in encoding between MAM and Sefaria’s former default Tanakh. Before we go any further, some terminology:

By “MAM” I usually mean a Sefaria-specific set of CSV files created from the Google Sheet for MAM. These CSV files are imported by Sefaria employees when they want an update. These CSV files are not generated by Google Sheets. They are generated by a Python program I wrote, whose input is the Google Sheet for MAM.
By “old Sefaria” I usually mean, specifically, Sefaria’s former default Tanakh. This Tanakh was originally imported from what is now called UXLC: the tanach.us Unicode XML version of the Groves WLC.

This document has three top-level sections:

Tag encoding: This section describes features encoded using HTML tags like  & . Some of these features are not solely encoded using tags: string details are sometimes relevant, too. Thus, there is some conceptual overlap between this section & the next, which describes features solely encoded at the string level. This section includes the following subsections:
1. List of class attribute values
2. Legarmeih & פסק
3. Small, large, & hung letters
4. Samekh, pe, & inverted nun
5. Implicit maqaf
6. The five main ketiv/qere types
7. Ketiv velo qere & qere velo ketiv
8. Trivial ketiv/qere pairs
9. Notes
10. Good endings
String encoding: This section describes features encoded solely at the most basic level: Unicode strings. This section includes the following subsections:
1. Qamats qatan
2. אתנח הפוך
3. גרש-telisha & gershayim-telisha
4. Stress זרקא
5. No displacement of זרקא by ל
6. No special meteg placements
7. Robust Unicode
CSV topics: This section includes the following subsections:
1. Metadata fields: This subsection describes MAM’s values for selected metadata fields.
2. Joshua 21: This subsection describes how MAM encodes two verses that it lacks compared to old Sefaria.

This document uses the following typographic conventions:

Text styled like this will hover-reveal some extra explanation (only in media supporting hovering, of course). These hover-reveals may even be possible when viewing the PDF version of this document, depending on which viewer is being used.
Text styled like this is intended to focus the reader’s attention on certain letters within a word. E.g. ב in אבג.
Text styled like this is the name of a code point in Unicode. These names are often abbreviated, e.g. ZARQA is used instead of HEBREW ACCENT ZARQA. Some abbreviations like CGJ can be hover-expanded (only in media supporting hovering, of course).

Some further notes on the names of code points:

Some Hebrew marks do not have a correspondingly-named code point. Though this situation may seem undesirable, on the bright side, we can safely use the mark’s name without risk of confusion about whether we are referring to a code point. In such cases it is clear that the mark’s name refers to the more general, diffuse, and abstract idea of the mark. So, for example, we can safely use “tsinnorit” because it has no correspondingly-named code point.
Some Hebrew marks have a correspondingly-named code point, and the mark and its code point are in one-to-one relation:
1. that is the one and only code point that is used to represent that mark
2. that is the one and only mark that can be represented by that code point
In such cases we can safely conflate the name of the mark with the name of its code point. For example, we can safely use “hiriq” to refer to the code point HIRIQ.
Finally, we come to the confusing cases. Some Hebrew marks have a correspondingly-named code point, but they are not in one-to-one relation. One or both of the following are true:
1. more than one code point can be used to represent that mark
2. more than one mark can be represented by that code point
In such cases we present the mark’s name using the Hebrew alphabet, with a transliteration provided on hover. The table below shows cases of importance for this document. Marks that can be represented by more than one code point are shown in bordered cells. In two cases this confusion is only present in old Sefaria, and this is marked below.

code point	mark
ZARQA	tsinnorit
ZARQA	זרקא
ZINOR	זרקא
ZINOR	צנור
PASEQ	פסק legarmeih
YERAH BEN YOMO	ירח בן יומו galgal
YERAH BEN YOMO	אתנח הפוך	old Sefaria only
ATNAH HAFUKH	אתנח הפוך
GERESH	גרש
GERESH MUQDAM	גרש	old Sefaria only
GERESH MUQDAM	גרש מקדם

This document, when viewed in its original HTML form, expects the following fonts to be available on the system:

Taamey D
Taamey Frank CLM (only a few uses)
SBL Hebrew (only a few uses)
Ezra SIL (only a few uses)

This document’s default font for pointed Hebrew is a font of my own invention called Taamey D. It is derived from Taamey Frank. It is the only liberally-licensed font that can render all the examples of this document without distracting mark collisions.

Although this document is intended to self-sufficiently describe MAM’s encoding, there are several other documents that may also be useful for readers to consult:

features-of-interest.pdf: This document lists instances of many features of interest. Where feasible, these lists are exhaustive, i.e. every instance of a feature is listed. Many of the features listed there are also covered, from a different perspective, in this document.
index.html & features-of-interest.html: These are the HTML files for this document and for “Features of Interest.” It may be helpful to view them in a text editor or using a browser’s “view source” feature.
The 39 CSV files being documented: These files form the most unfriendly, but of course the most definitive documentation of themselves!

Tag encoding

List of `class` attribute values

mam-spi-{samekh, pe, invnun}
mam-implicit-maqaf
mam-kq
mam-kq-{k, q, trivial}
footnote (on )

Legarmeih & פסק

As the table below shows, MAM encodes legarmeih & פסק distinctly, but old Sefaria does not.

MAM leg.	THIN SPACE	PASEQ inside `<b>`	SPACE
MAM פסק	THIN SPACE	PASEQ inside `<small>`	THIN SPACE
old Sef. leg./פסק	SPACE	PASEQ	SPACE

The table above shows that MAM distinguishes legarmeih from פסק in the following two ways:

Legarmeih uses a SPACE at the end, whereas פסק uses a THIN SPACE.
Legarmeih uses  whereas פסק uses .

Some examples appear below.

	leg. (G2:5)	פסק (G1:5)	leg. & פסק (Je4:19)
MAM	וְכֹ֣ל ׀ שִׂ֣יחַ	אֱלֹהִ֤ים ׀ לָאוֹר֙	מֵעַ֣י ׀ מֵעַ֨י ׀ (אחולה) [אֹחִ֜ילָה]
old Sef.	וְכֹ֣ל ׀ שִׂ֣יחַ	אֱלֹהִ֤ים ׀ לָאוֹר֙	מֵעַ֣י ׀ מֵעַ֨י ׀ אחולה [אוֹחִ֜ילָה]

Some further notes on legarmeih vs. פסק:

Although legarmeih & פסק mean quite different things, they are both encoded with PASEQ. They are distinguished by the context around PASEQ.

First, what is in common? Both legarmeih & פסק have a thin space before their PASEQ.

PASEQ-for-legarmeih is followed by a normal-width space, and is styled using whatever the default styling is for the  element. The asymmetric spacing around the PASEQ (thin before, normal-width after) indicates that the PASEQ belongs to the word preceding it, not to the word following it. Cleaving to the preceding word indicates that the PASEQ is being used as a legarmeih. Recall that the most common use of legarmeih is to modify the meaning of a munach on the word preceding the legarmeih.

PASEQ-for-פסק is followed by a thin space, and it is styled using whatever the default styling is for the  element. In Wikisource MAM, it is both smaller and gray. The symmetric spacing around PASEQ-for-פסק indicates that the mark belongs neither to the word before it, nor to the word after it. Or it belongs to both equally, if you prefer. Recall that פסק serves various functions, all of which somehow separate its neighbor words. It is not an accent, nor does it, like legarmeih, modify the meaning of an accent.

Small, large, & hung letters

MAM uses the following somewhat natural tags for these letters: , <big>, & . The following table shows example words that illustrate all three types of letters and two exotic contexts: small & large in the same word and small in ketiv.

	S&L (Da6:20)	hung (Jb38:13)	S in k (Jb7:5)
MAM	בִּשְׁפַּרְפָּרָ֖א	רְשָׁ^עִ֣ים	(וגיש) [וְג֣וּשׁ]
old Sefaria	בִּשְׁפַּרְפָּרָ֖א	רְשָׁעִ֣ים	וגיש

Old Sefaria stripped special letter encoding from WLC when importing WLC, so there’s no comparable feature.

Samekh, pe, & inverted nun

MAM encloses samekh & pe in curly braces whereas old Sefaria encloses them in parentheses. MAM uses no braces, brackets, or parentheses around inverted nun whereas old Sefaria encloses it in parentheses. MAM encloses these letters in spans with the following classes:

mam-spi-samekh
mam-spi-pe
mam-spi-invnun

These spans include not only the letter itself but also any braces around them. These spans are preceded by   (NO-BREAK SPACE). The span for samekh is followed by eight   entities in a row. The span for pe is followed by  .

The tables below compare the way MAM & old Sefaria use ׆. These examples are complete, i.e. these are the only uses of ׆ in these Tanakh editions. The table below shows, schematically, verses 34, 35, & 36 of Numbers 10. We use an empty square, □, as a placeholder for the first & last words of a verse.

MAM	□ ... □׃ {ס} ׆ □ ... □׃ □ ... □׃ ׆ {פ}
old Sefaria	□ ... □׃ (׆) (ס) □ ... □׃ □ ... □׃ (׆) (ס)

Or, even more schematically:

	end 34	start 35	end 36
MAM	{ס}	׆	׆ {פ}
old Sefaria	(׆) (ס)		(׆) (ס)

The table below shows how, in Psalm 107, MAM & old Sefaria agree that:

There are two runs of verses separated by ׆.
The first run is seven verses long, and thus contains six ׆.
The second run is two verses long, and thus contains a single ׆.
The second run consists of verses 39 & 40.

The table below also shows how MAM & old Sefaria disagree about the following:

Old Sefaria’s run of 7 verses starts at verse 20, two verses earlier than MAM.
In old Sefaria, the ׆ separator ends the verse that precedes it whereas in MAM the ׆ starts the verse that follows it.

	MAM	old Sefaria
20		□ ... □׃ (׆)
21		□ ... □׃ (׆)
22	□ ... □׃	□ ... □׃ (׆)
23	׆ □ ... □׃	□ ... □׃ (׆)
24	׆ □ ... □׃	□ ... □׃ (׆)
25	׆ □ ... □׃	□ ... □׃ (׆)
26	׆ □ ... □׃	□ ... □׃
27	׆ □ ... □׃
28	׆ □ ... □׃
...
39	□ ... □׃	□ ... □׃ (׆)
40	׆ □ ... □׃	□ ... □׃

Implicit maqaf

MAM has 63 maqaf marks that are implied by their accent context but do not exist in medieval manuscripts or most printed editions. They are styled gray in Wikisource MAM. In CSV MAM, they are enclosed in a span with class mam-implicit-maqaf. Here’s an example from Psalm 2:7:

MAM	אֶֽ֫ל־חֹ֥ק
old Sefaria	אֶֽ֫ל חֹ֥ק

The five main ketiv/qere types

MAM encloses a ketiv/qere pair in a span with class mam-kq. Within this span, the ketiv and qere are enclosed in spans with classes mam-kq-k & mam-kq-q respectively.

For the five main ketiv/qere types in MAM, the table below schematically compares MAM & old Sefaria. We use an empty square, □, as a placeholder for a word preceding a maqaf.

	k1q1→	k1q1←	k1q2	k2q1	k2q2
MAM	(כ) [קְ]	□־[קְ] (כ)	(כ) [קְ קְ]	(כ כ) [קְ]	(כ כ) [קְ קְ]
old Sefaria	כ [קְ]	□־כ [קְ]	כ [קְ] [קְ]	כ כ [קְ]	כ כ [קְ] [קְ]

Here is a table with another way to compare ketiv/qere encoding:

	MAM	old Sefaria
כ parens	parens used	parens not used
order	קְ then כ if post-maqaf	כ then קְ always
q2 brackets	[קְ קְ]	[קְ] [קְ]

Ketiv velo qere & qere velo ketiv

Instances of these two “velo” phenomena are enclosed in spans whose classes are the same as those used for normal ketiv/qere phenomena: mam-kq-k & mam-kq-q respectively. But, unlike instances of normal ketiv/qere, instances of “velo” are not enclosed in a parent span of class mam-kq. As the table below shows, for ketiv velo qere, MAM uses parentheses, whereas old Sefaria does not.

	k velo q	q velo k
MAM	(כ)	[קְ]
old Sefaria	כ	[קְ]

Trivial ketiv/qere pairs

MAM considers 116 ketiv/qere pairs to be trivial and does not include a ketiv for these pairs. The qere is encoded in a span with class mam-kq-trivial. The table below shows an example from Genesis 13:3.

MAM	אׇֽהֳלֹה֙
old Sefaria	אהלה [אָֽהֳלוֹ֙]

Notes

There are 29 concise notes in MAM. All of these notes point out variations which appear in Torah scrolls (or scrolls of Esther) that are actually used in synagogues today, i.e. variations that are found in the written letter-text. These notes thus differ substantially from the thousands of textual notes in WLC, which document minor anomalies in vocalization. Old Sefaria stripped notes from WLC when importing WLC, so there’s no comparable feature.

A MAM note is encoded in two parts:

an asterisk inside 
the note itself, parenthesized inside

Here is an example, from Deuteronomy 11:21:

*
(בספרי ... גדולה)

Here is what that example looks like, in context:

... עַל־הָאָֽרֶץ׃^*(בספרי תימן הָאָֽרֶץ׃ בצד״י גדולה) {ס} כִּי֩ אִם־שָׁמֹ֨ר ...

Observe that the note contains a pointed word with a large letter.

Notes appear after their referent in all cases except one, in Deuteronomy 22:6. There it is better for the callout (the asterisk) to precede rather than follow the word that the note refers to.

Good endings

A “good ending” is an instruction to repeat the second-to-last verse of a book when reading the book publicly. This repetition causes the reading to end on a positive note. By convention, the instruction takes the following somewhat implicit form: the unpointed version of the verse to be repeated appears after the end of the book proper. I.e. there are no instructions per se.

In old Sefaria, good endings appear at the end of three books: Malachi, Lamentations, & Ecclesiastes. In old Sefaria and MAM, a good ending is encoded as follows (the only use of HTML tags in old Sefaria):

<br><small>...</small>

In MAM, good endings appear at the end of four books. Three of these four books are the same three as in old Sefaria. The one additional book is Isaiah. The example below shows the last two verses of Lamentations (5:21 & 22) and the corresponding good ending.

הֲשִׁיבֵ֨נוּ יְהֹוָ֤ה׀אֵלֶ֙יךָ֙ (ונשוב) [וְֽנָשׁ֔וּבָה] חַדֵּ֥שׁ יָמֵ֖ינוּ כְּקֶֽדֶם׃ כִּ֚י אִם־מָאֹ֣ס מְאַסְתָּ֔נוּ קָצַ֥פְתָּ עָלֵ֖ינוּ עַד־מְאֹֽד׃
השיבנו יהוה אליך ונשובה חדש ימינו כקדם

String encoding

Qamats qatan

QAMATS QATAN (QQ) is used in MAM but not in old Sefaria. Unfortunately, in some fonts the distinction between QAMATS (gadol) & QQ is quite subtle. Perhaps counterintuitively, usually QQ is a little taller than QAMATS (gadol). Consider the chanted word וּבְכׇל־הָרֶ֖מֶשׂ in Genesis 1:26, in various fonts:

	Taamey FC	SBL Hebrew	Ezra SIL
MAM	וּבְכׇל־הָרֶ֖מֶשׂ	וּבְכׇל־הָרֶ֖מֶשׂ	וּבְכׇל־הָרֶ֖מֶשׂ
old Sefaria	וּבְכָל־הָרֶ֖מֶשׂ	וּבְכָל־הָרֶ֖מֶשׂ	וּבְכָל־הָרֶ֖מֶשׂ

In texts like MAM, where QQ is used, really the glyph for HATAF QAMATS (HQ) should match the glyph for QQ, because the qamats part of HQ is, conceptually, inherently qatan.

But as far as I know, there is no way to accomplish this without a special font. There is no code point for hataf qamats qatan, so a special font would be needed that did one of the following:

unconditionally made the glyph for HQ match the glyph for QQ
supplied two glyphs for HQ:
1. a default one, matching QAMATS (gadol).
2. an alternate one, matching QQ.

(OpenType has a mechanism to provide both a default & an alternate glyph for the same code point. Complementarily, CSS has a mechanism to select such an OpenType alternate glyph.)

The table below shows the word אׇהֳלֹה in Genesis 9:21, in various fonts, allowing easy comparison of HQ & QQ:

Taamey FC	SBL Hebrew	Ezra SIL
אׇהֳלֹה	אׇהֳלֹה	אׇהֳלֹה

As the table below shows, in some fonts, QQ does not mix well with merkha or munach. Fortunately there are only a few such cases in MAM.

	Psalm 35:10	Eze. 44:13
Taamey FC	כׇּ֥ל־עַצְמוֹתַ֨י	עַל־כׇּל־קׇ֣דָשַׁ֔י
Taamey D	כׇּ֥ל־עַצְמוֹתַ֨י	עַל־כׇּל־קׇ֣דָשַׁ֔י

אתנח הפוך

The code point ATNAH HAFUKH (AH) is used in MAM but not in old Sefaria.

In both MAM & old Sefaria, the code point YERAH BEN YOMO (YBY) serves both of the following roles:

ירח בן יומו (only in the 21 books)
galgal (only in Jb, Pr, & Ps)

There is no ambiguity in serving both roles since the roles are exclusive.

In old Sefaria, YBY also serves, ambiguously, as אתנח הפוך, since AH is not used. For example, consider the words אֵין and קִרְבָּם in Psalm 5:10:


MAM	אֵ֪ין ... קִרְבָּ֢ם	אֵ	YBY	...	בָּ	AH	ם
old Sefaria	אֵ֪ין ... קִרְבָּ֪ם	אֵ	YBY	...	בָּ	YBY	ם

גרש-telisha & gershayim-telisha

In five cases in Tanakh, a word is supposed to be notated to indicate that its stressed syllable is to be chanted with first the גרש or gershayim melody and then the telisha gedolah melody. (In the title of this section I have used “telisha” without “gedolah” only for brevity, not for generality, i.e. this phenomenon occurs only with telisha gedolah.)

These five cases present a veritable minefield of problems at various levels.

It is not clear what results to aim for, particularly in the two cases where the stressed syllable is not the first syllable.
Even where it is clear what results to aim for, some results seem unattainable in some fonts.
Old Sefaria is missing gershayim in two cases.

The table below shows, in this document’s default font, all 5 cases as they are encoded in MAM, UXLC, & old Sefaria. (The default font is Taamey D.) Luckily UXLC is always the same as either MAM or old Sefaria, somewhat simplifying this complex situation.

	2K17:13	Eze48:10	G5:29	Zp2:15	L10:4
M	שֻׁ֜֠בוּ	וּ֜֠לְאֵ֜֠לֶּה	זֶ֞֠ה	זֹ֞֠את	קִ֞֠רְב֞֠וּ
UXLC	שֻׁ֜֠בוּ	וּ֠לְאֵ֜לֶּה	זֶ֞֠ה	זֹ֞֠את	קִ֠רְב֞וּ
old Sefaria	שֻׁ֝֠בוּ	וּ֠לְאֵ֜לֶּה	זֶ֠ה	זֹ֠את	קִ֠רְב֞וּ

The table below reviews some of the issues with old Sefaria:

שֻׁ֝֠בוּ	Accents collide, making order unclear. GERESH MUQDAM is used instead of plain GERESH.
זֶ֠ה, זֹ֠את	Look fine, but are missing gershayim.

Stress זרקא

As far as I know, everyone uses ZINOR for postpositive זרקא.

MAM uses ZARQA rather than ZINOR for stress זרקא. By “stress זרקא” I mean a זרקא that supplements the postpositive זרקא, indicating stress on a nonfinal syllable. Old Sefaria doesn’t (or doesn’t often?) use stress זרקא. For example, consider the word טֶרֶם in Genesis 19:4:


MAM	טֶ֘רֶם֮	טֵ	ZARQA	רֵ	ם	ZINOR
old Sefaria	טֶרֶם֮	טֵ		רֵ	ם	ZINOR

The table below shows the roles that ZINOR & ZARQA play in MAM. These roles are standard, except for MAM’s use of ZARQA for stress זרקא.

	ZINOR	ZARQA
role in the 21 books	postpositive זרקא	stress זרקא
role in Jb, Pr, & Ps	צנור	tsinnorit

No displacement of זרקא by ל

In old Sefaria, in some but not all cases when a word ending in ל needs זרקא, the following things happen:

The זרקא is displaced back to the letter before the ל.
ZARQA is used for that זרקא, rather than ZINOR.

This stems from the original Michigan-Claremont coding of WLC, where 82L was used for some or all cases in which the LC’s scribe put the זרקא before the ל’s ascender. The code 82 is only documented as representing tsinnorit, but it makes sense that 82 would also be used for this case, which we might call “medial” or “impositive” זרקא.

What doesn’t make sense is why the 82 would precede the L: WLC represents postpositive זרקא after ל as L02, so why wouldn’t medial זרקא on ל be L82? Perhaps L82 at the end of a word was deemed “illegal” due to some seemingly-reasonable rule like the following:

If a word ends in a closed syllable, the only accents that can appear at the end of that word are postpositive.

This rule is reasonable only if the set of postpositive accents is defined reasonably. If adherence to a rule like this is desired, I would recommend Groves do the following:

make a distinction between semantically postpositive accents and graphically postpositive accents
state the rule in terms of semantically postpositive accents
define accent 82 as always graphically impositive, but semantically postpositive in the 21 books, i.e. semantically postpositive when not used in Jb, Pr, & Ps. Or define a new accent code for graphically impositive זרקא.

MAM only uses true postpositive זרקא. The table below compares the handling of the word יִשְׂרָאֵל in Leviticus 4:2 in MAM & old Sefaria. In its third & final row, the table also shows a proposal I made that was accepted into UXLC. In UXLC, the זרקא now appears on the ל not the א, but still before the ל’s ascender.


MAM	יִשְׂרָאֵל֮	...	אֵ		ל	ZINOR
old Sefaria	יִשְׂרָאֵ֘ל			ZARQA
UXLC	יִשְׂרָאֵל֘					ZARQA

The word יִשְׂרָאֵל appears 9 times like this in old Sefaria. Two words other than יִשְׂרָאֵל appear like this in old Sefaria:

אֶל־בְּצַלְאֵ֘ל (Exodus 36:2)
תוּכַ֘ל (Deuteronomy 14:24)

Plain old word-final L02 appears 47 times in the WLC. Some of these cases are in Jb, Pr, & Ps, in which case the 02 represents צנור not זרקא. But presumably the same graphical “problem” of the ל’s ascender still applies, whether the mark functions as צנור or זרקא. It would be interesting to know if indeed in all 47 of these cases the scribe put the mark after the ascender. I.e. it would be interesting to know if this scribal choice has been consistently coded in the WLC. There are no cases of 82L past the book of Joshua, suggesting that either the LC’s scribe or the WLC’s transcribers abandoned ל-impositive זרקא at some point.

No special meteg placements

Old Sefaria has two meteg placements that are not present in MAM:

early meteg, encoded as METEG before the vowel in question.
medial meteg, encoded as the hataf vowel in question followed by ZWJ (ZERO WIDTH JOINER) & then METEG.

Old Sefaria encodes early meteg in a “fragile” manner. This concept is explored at length in the section “Robust Unicode.” For here, suffice it to say that this fragility makes the early meteg look just like a plain (i.e. post-vowel) meteg in most (all?) popular browsers. So, in the table below, we show an example old Sefaria word not only in its actual, fragile form, but also in a somewhat hypothetical robust form, rendered robust through the use of CGJ.

The table below shows examples of these special meteg placements.

	early (G1:7)		medial (L21:10)
MAM		וַֽיְהִי־כֵֽן׃	אֲֽשֶׁר־יוּצַ֥ק
old Sefaria	fragile	וַֽיְהִי־כֵֽן׃	אֲ‍ֽשֶׁר־יוּצַ֥ק
old Sefaria	robust	וֽ͏ַיְהִי־כֵֽן׃	אֲ‍ֽשֶׁר־יוּצַ֥ק

Robust Unicode

It would be nice for all texts in the Sefaria system to appear as expected even when subjected to Unicode normalization, since most (all?) popular browsers normalize.

Some words are fragile: in their natural or naive encoding, normalization changes their appearance in at least one font of interest. Fragile words are rare: even though most words in Tanakh have their underlying Unicode changed by normalization, these changes are rarely visible. The most common example of benign normalization has to do with dagesh. Though normalization frequently changes the location of dagesh in the underlying Unicode, I’ve never seen a font in which such changes are visible. Other common examples of benign normalization have to do with shin dot and sin dot.

Though fragile words are rare, it would be nice to encode them robustly. MAM uses CGJ (COMBINING GRAPHEME JOINER) to make fragile words robust to Unicode normalization. A CGJ inserted in the proper place prevents normalization from reordering a string in a way that would change its appearance. The words requiring CGJ come from two sources:

words for Jerusalem (this is by far the most plentiful source)
words in the three dual cantillation sections of the Tanakh:
1. Genesis 35:22 (in the Saga of Reuben)
2. Exodus 20:2–14 (the Exodus Decalogue)
3. Deuteronomy 5:6–18 (the Deuteronomy Decalogue)

For purposes of making words robust, we consider there to be four types of words for Jerusalem, distinguished by what should happen after the ל:

Either patach or qamats gadol can come immediately after the ל.
Then comes an accent, in all but one case (Ps147:12). That one accent-less case is re-ordered by normalization but is not fragile so it is not of interest to us here. (I.e. that one accent-less case is re-ordered by normalization, but the normalization is benign, i.e. does not change its appearance. It is a little unclear why this re-ordering does not change its appearance. My best guess is that the shaper (e.g. MS DirectWrite, HarfBuzz, etc.) “undoes” the re-ordering of the vowel marks. Nonetheless in this case we have opted to insert a CGJ before the hiriq.)
Then comes one of the following two endings:
1. hiriq, ם
2. sheva, מָה. This ending is rare (only four cases).

With a column for each of the four types implied by the variations above, the table below lists:

The count of cases of this type in MAM. Not all of these cases are fragile, but it is simplest to treat them as if they were.
An example word of this type, shown as it appears in MAM, i.e. shown with CGJ making it robust.
This same example word, shown as it appears in old Sefaria, i.e. shown fragile.
A location (book, chapter, & verse) containing this example word.

type	לִַם	לִָם	לְַמָה	לְָמָה
count	348	285	3	1
MAM	יְרֽוּשָׁלַ֖͏ִם	יְרוּשָׁלָ֑͏ִם	יְרוּשָׁלַ֛͏ְמָה	יְרוּשָׁלָ֑͏ְמָה
old Sefaria	יְרֽוּשָׁלִַ֖ם	יְרוּשָׁלִָ֑ם	יְרוּשָׁלְַ֛מָה	יְרוּשָׁלְָ֑מָה
bcv	Ju1:7	Js15:8	I36:2	2K9:28

In some environments (font, OS, etc.), the fragile old Sefaria words above don’t just suffer from ordering issues, they also suffer from collision issues (marks overlap). Since no word in Tanakh actually needs the normalized order to be the visual order, it is not surprising (and should not be disappointing) that in some contexts, some Hebrew fonts cannot handle this order. E.g. there is no word in Tanakh that needs the visual order of marks on lamed to be sheva, qamats, etnachta. (This is the normalized order of code points in the example above from 2 Kings 9:28.)

Fragile words in the dual cantillation sections are of two types:

The 4 (only 2 unique) “QUPO” words. These are words with a single letter housing four marks:
1. Qamats gadol
2. Under-accent (meteg or etnachta)
3. Patach
4. Over-accent (revia or גרש)
The 10 (only 6 unique) “UM” words. These are words in which an under-accent should be followed by meteg.

Below is a table showing the two unique QUPO words in MAM and their three corresponding words in old Sefaria. Two out of the three old Sefaria words lack a patach on the letter in question, making them QUO not QUPO. On the bright side, QUO is not fragile. More on this further below.

MAM E/D	עַל־פָּנָֽ֗͏ַי׃	מִתָּ֑֜͏ַחַת
old Sefaria E	עַל־פָּנָֽ֗יַ׃	מִתַָּ֑֜חַת
old Sefaria D	עַל־פָּנָֽ֗יַ׃	מִתָּ֑֜חַת
bcv E/D	20:3 / 5:7	20:4 / 5:8

Only the two bordered QUPO words above show a distinction between the appearance of fragile & robust encodings.

The non-bordered words have some distracting issues unrelated to fragility. The table below describes these issues.

old Sefaria E/D	עַל־פָּנָֽ֗יַ׃	The patach is expected to belong to the nun but old Sefaria has it belonging to the yod.
old Sefaria D	מִתָּ֑֜חַת	Old Sefaria accurately reflects that the LC lacks the expected patach.

Regarding עַל־פָּנָֽ֗יַ in old Sefaria:

I reported the “late” patach to the UXLC maintainer and it is now fixed.
I’ve shown the sof pasuq (׃) here as if it were in both E & D but old Sefaria accurately reflects that the LC lacks it in E.

The table below shows 3 of the 6 unique UM words, with a row showing U, the under-accent that should be followed by meteg.

MAM	יִשְׂרָאֵ֑͏ֽל	עֲבָדִ֑͏ֽים	וּבִנְךָ֣͏ֽ־וּ֠בִתֶּ֗ךָ
old Sefaria	יִשְׂרָאֵֽ֑ל	עֲבָדִֽ֑ים	וּבִנְךָֽ֣־וּ֠בִתֶּ֗ךָ
U	etnachta	etnachta	munach
bcv	G35:22	E20:2 / D5:6	E20:10

Old Sefaria’s Tanakh has no measures to prevent fragility. I have reported the fragility issues to the UXLC maintainer and now most fragile words have been made robust in UXLC. Unfortunately, as far as I know, there is no easy way to update old Sefaria with such a large change (hundreds of changes). The change was made algorithmically to UXLC, so perhaps the easiest way to update old Sefaria would be to run a similar algorithm on its text.

CSV topics

Metadata fields

license (not in CSV)
- MAM: CC-BY-SA 4.0
- old Sef.: Public Domain
versionTitleInHebrew (not in CSV)
- MAM: מקרא על פי המסורה
- old Sef.: תנ״ך מלווה בטעמי מקרא
versionTitle (Version Title)
- MAM: Miqra according to the Masorah
- old Sef.: Tanach with Ta'amei Hamikra
versionSource (Version Source)
- MAM: https://en.wikisource.org/wiki/User:Dovi/Miqra_according_to_the_Masorah
- old Sef.: http://www.tanach.us/Tanach.xml

(The fields above are shown along with their old Sefaria values, for comparison.) (Some fields are not present in the CSV but they are present in JSON exports from old Sefaria so I figured they may be needed, at some point, in order to import MAM.)

Joshua 21

MAM has no content corresponding to verses 36 & 37 in old Sefaria’s Joshua 21. (The LC also lacks any such content.) In MAM, these verses are encoded as a string containing only an em dash. This makes the CSV for verses 35–38 look like this (in abbreviated form):

Joshua 21:35,"אֶת־דִּמְנָה֙ ..."
Joshua 21:36,—
Joshua 21:37,—
Joshua 21:38,וּמִמַּטֵּה־גָ֗ד ...

(Above, Hebrew strings are shown right-to-left. Also, a slight distraction is that the Hebrew string for Joshua 21:38 happens to not need quotes.)

If these two verses were not listed at all in the CSV, MAM’s verse numbering would get out of sync with old Sefaria’s for the remainder of Joshua 21. If these two verses were listed as empty strings, another Tanakh version’s content would “interrupt” MAM here. (This is due to the design of the Sefaria system.)