A journey into OpenType font internals

A journey into OpenType font internals

Introduction

The Russian ruble has a standard sign which looks like this: ₽.

In case it's not displayed in text.

It's short and understandable, so you would want to display it on your website if you're doing commerce in Russia. Unfortunately, this symbol is not widely adopted yet and the search engines you're optimizing for might not understand that you're selling something.

One way to solve this is to make a drop-in font which would contain a ligature, rendering all "руб" strings (standing for "ruble") as "₽". This way you could put the textual old-school representation of the currency in your markup, but visually it would look modern and slick.

This is how it's supposed to work

The new font should only contain the ligature, so that it can be applied on top of the default font, using the default font (without the ligature) as a fallback.

Yes, this is not the easiest solution, but bear with me for the sake of the post. This is an interesting method because it provides an excuse to research how the fonts are made.

In this post, I will be dissecting and modifying one font in particular — Roboto by Google, but generally, the method should apply to any font.

Testing the water

After quick googling, skimming, and link following I have found a post by Roel Nieskens which is a guide about adding a custom ligature to a font. This is almost exactly what I need, I just need to remove everything else besides the ligature. This is a great starting point.

Sans Bullshit Sans: leveraging the synergy of ligatures – Pixelambacht
Front-end antics and typographic mischievousness

The post advises you to decode the font's TTF file into "TTX" using fonttools — a Python library that also contains some CLI utilities. TTX file is an XML representation of the font. You can edit the XML and then encode it back into TTF.

As it turns out, OpenType fonts (those that have .otf or .ttf file extension) are structured into separate sections called tables. If you look at the TTX file of Roboto Regular you'll see these tables as the direct children of the root element:

<?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="\x00\x01\x00\x00" ttLibVersion="4.22">
    <GlyphOrder><!-- content omitted --></GlyphOrder>
    <head><!-- content omitted --></head>
    <hhea><!-- content omitted --></hhea>
    <maxp><!-- content omitted --></maxp>
    <OS_2><!-- content omitted --></OS_2>
    <hmtx><!-- content omitted --></hmtx>
    <hdmx><!-- content omitted --></hdmx>
    <cmap><!-- content omitted --></cmap>
    <fpgm><!-- content omitted --></fpgm>
    <prep><!-- content omitted --></prep>
    <cvt><!-- content omitted --></cvt>
    <loca><!-- content omitted --></loca>
    <glyf><!-- content omitted --></glyf>
    <name><!-- content omitted --></name>
    <post><!-- content omitted --></post>
    <gasp><!-- content omitted --></gasp>
    <GDEF><!-- content omitted --></GDEF>
    <GPOS><!-- content omitted --></GPOS>
    <GSUB><!-- content omitted --></GSUB>
</ttFont>

Each table has a short, usually unintelligible name and those names are not always consistent with each other: some are 4 letters long, one is 3 letters long; most are lower case, some are upper case (Except for GlyphOrder which is not a table, but a utility element generated by the TTF decompiler). This is an indicator of the long history of the format.

Some of the interesting tables, discovered with the help of Roel's post, are:

glyf

glyf table contains information about how to draw glyphs — that is, pictures of characters. In the TTX file <glyf> element is filled with <TTGlyph> elements — one for each glyph in the font.

For example, this is the glyph for the comma:

<TTGlyph name="comma" xMin="29" yMin="-290" xMax="308" yMax="219">
  <contour>
    <pt x="134" y="-290" on="1"/>
    <pt x="29" y="-218" on="1"/>
    <pt x="123" y="-87" on="0"/>
    <pt x="127" y="52" on="1"/>
    <pt x="127" y="219" on="1"/>
    <pt x="308" y="219" on="1"/>
    <pt x="308" y="74" on="1"/>
    <pt x="308" y="-27" on="0"/>
    <pt x="209" y="-229" on="0"/>
  </contour>
  <instructions>
    <assembly>
      SVTCA[0]	/* SetFPVectorToAxis */
      PUSHB[ ]	/* 1 value pushed */
      9
      MDAP[1]	/* MoveDirectAbsPt */
      PUSHB[ ]	/* 2 values pushed */
      4 5
      PUSHB[ ]	/* 1 value pushed */
      10
      CALL[ ]	/* CallFunction */
      IF[ ]	/* If */
        POP[ ]	/* PopTopStack */
        MDRP[11000]	/* MoveDirectRelPt */
      ELSE[ ]	/* Else */
        MIRP[10100]	/* MoveIndirectRelPt */
      EIF[ ]	/* EndIf */
      PUSHB[ ]	/* 1 value pushed */
      0
      MDRP[10000]	/* MoveDirectRelPt */
      PUSHB[ ]	/* 1 value pushed */
      0
      MDAP[1]	/* MoveDirectAbsPt */
      IUP[0]	/* InterpolateUntPts */
      IUP[1]	/* InterpolateUntPts */
    </assembly>
  </instructions>
</TTGlyph>

As you can see, it contains metrics of the glyph and a drawing program in a language resembling Assembly. Fortunately, we don't have to deal with this, because Roboto already contains all the glyphs we need. We won't add new glyphs.

cmap

cmap table assigns glyphs to characters (in the case of Roboto — to Unicode code points). Using this table, a renderer converts a string of text (which is usually a sequence of Unicode code points) into a series of glyphs to put them one after another.

The structure of this table is pretty straightforward, though it's duplicated twice in Roboto — apparently, for better cross-platform support:

<cmap>
  <tableVersion version="0"/>
  <cmap_format_4 platformID="0" platEncID="3" language="0">
    <map code="0x0" name="uni0000"/><!-- ???? -->
    <map code="0x2" name="uni0002"/><!-- ???? -->
    <map code="0xd" name="uni000D"/><!-- ???? -->
    <map code="0x20" name="space"/><!-- SPACE -->
    <map code="0x21" name="exclam"/><!-- EXCLAMATION MARK -->
    <map code="0x22" name="quotedbl"/><!-- QUOTATION MARK -->
    <map code="0x23" name="numbersign"/><!-- NUMBER SIGN -->
    <!-- many more maps omitted -->
  </cmap_format_4>
  <cmap_format_4 platformID="3" platEncID="1" language="0">
    <map code="0x0" name="uni0000"/><!-- ???? -->
    <map code="0x2" name="uni0002"/><!-- ???? -->
    <map code="0xd" name="uni000D"/><!-- ???? -->
    <map code="0x20" name="space"/><!-- SPACE -->
    <map code="0x21" name="exclam"/><!-- EXCLAMATION MARK -->
    <map code="0x22" name="quotedbl"/><!-- QUOTATION MARK -->
    <map code="0x23" name="numbersign"/><!-- NUMBER SIGN -->
    <!-- many more maps omitted -->
  </cmap_format_4>
</cmap>

GSUB

This table contains rules for substitution of some glyphs with other glyphs (as we will later discover — some of the substitutions only apply in certain contexts or certain languages). That includes ligatures, each of which is a substitution of a sequence of glyphs with a single glyph.

This table is much more complicated than previous ones, but if you search for the LigatureSubst element, you'll find ligature definitions deep inside of it. There several of those, one example:

<Lookup index="8">
  <LookupType value="4"/>
  <LookupFlag value="0"/>
  <!-- SubTableCount=1 -->
  <LigatureSubst index="0">
    <LigatureSet glyph="f">
      <Ligature components="f,i" glyph="uniFB03"/>
      <Ligature components="i" glyph="uniFB01"/>
    </LigatureSet>
  </LigatureSubst>
</Lookup>

Entering the fight with bare hands

Let's add the ligatures by hand. Following Noel's advice, I have added two ligatures (one with a period and another — without it) into one of the Lookup records of type 4. Type 4 means that the lookup contains ligatures. You can see that the lookup in the example above is also of type 4.

The ligature set looked like this:

<LigatureSet glyph="uni0440">
  <Ligature components="uni0443,uni0431,period" glyph="uni20BD"/>
  <Ligature components="uni0443,uni0431" glyph="uni20BD"/>
</LigatureSet>

The glyph attribute on the root element corresponds to the first glyph in the replaced sequence. Then each Ligature element specifies all the remaining glyphs in components and the glyph that they are replaced with in glyph.

  • uni0440 is the name of the glyph of the Russian lower case letter "р"
  • uni0443 — the letter "у"
  • uni0431 — the letter "б"
  • period is, well, period
  • and uni20BD is for the ruble symbol ₽

After compiling the TTX file back into TTF and loading it onto a test web page, you'll see that these ligatures... work! This was inspiringly simple. In fact, you can use the produced font as a replacement for the original Roboto to achieve the desired effect. But we're not looking for a font modification. We're looking for a drop-in addon, which can be easily applied on top of the original Roboto and just as easily disabled.

For that we need to make sure that the resulting font is as slim as possible, containing ideally only 5 glyphs and 2 ligatures.

Also, a note for myself, make sure to change the name of the font in the name table, otherwise, Chrome Dev Tools don't distinguish the original and the new font:

These 22 glyphs have been rendered with two different fonts, but Chrome doesn't know they are different.

Securing the victory...?

To slim down the font we have to look through the TTX and delete everything that doesn't apply to the glyphs we have used. Some tables are simple to clean up: GlyphOrder, cmap, hmtx, hdmx and glyf are just lists with an item for each glyph. Other tables are small and seemingly only contain metadata, so they can be left unedited. GDEF table seems to be pretty simple to clean up too.

I am trying to remove extra elements by hand to see if I can make it work. During the work, I get compilation errors from the ttx tool, because I have deleted a glyph but it is referenced in other parts of the font. Investigating one of these errors reveals a problem that slightly complicates things.

This is the glyph for the Russian letter "у" in the glyf table.

<TTGlyph name="uni0443" xMin="22" yMin="-437" xMax="944" yMax="1082">
  <component glyphName="y" x="0" y="0" flags="0x204"/>
</TTGlyph>

As you can see, it doesn't contain the drawing of the glyph, it only copies the drawing of the Latin letter "y". But I have deleted the Latin "y" everywhere! I should have kept it and all the other glyphs which are referenced this way. Scratch everything, start over.

On top of this, GPOS and GSUB tables appeared to have a complex structure that can't be easily slimmed down without proper understanding. What are those ScriptList and FeatureList elements? They don't reference any glyphs. Clearing them seems to break everything. Clearly, I'm missing the full picture.

After some unmotivated searching and deletion, I have managed to produce a font that compiles but... doesn't render. The ligatures no longer work and I have no idea why. I deleted something essential without knowing or noticing.

I need to understand a little more about what I'm doing and I need a script to do all this automatically and error-proof.

Doing the homework

Microsoft has an extensive reference for the OpenType format on their website and it was of huge help. Many of the search queries about font tables and features lead here.

OpenType specification (OpenType 1.8.4) - Typography
OpenType specification (OpenType 1.8.4)

In the sidebar there is a page for each table of the font, describing the structure of the table, attributes of its records, etc.

For example, among the two cmap records we have seen in Roboto, the first one is a generic Unicode 2.0 mapping, and the second one is specific to Windows. Their purpose can be deduced from the platform ID and the platform encoding ID.

<cmap>
  <tableVersion version="0"/>
  <cmap_format_4 platformID="0" platEncID="3" language="0">
    <map code="0x0" name="uni0000"/><!-- ???? -->
    <map code="0x2" name="uni0002"/><!-- ???? -->
    <map code="0xd" name="uni000D"/><!-- ???? -->
    <map code="0x20" name="space"/><!-- SPACE -->
    <map code="0x21" name="exclam"/><!-- EXCLAMATION MARK -->
    <map code="0x22" name="quotedbl"/><!-- QUOTATION MARK -->
    <map code="0x23" name="numbersign"/><!-- NUMBER SIGN -->
    <!-- many more maps omitted -->
  </cmap_format_4>
  <cmap_format_4 platformID="3" platEncID="1" language="0">
    <map code="0x0" name="uni0000"/><!-- ???? -->
    <map code="0x2" name="uni0002"/><!-- ???? -->
    <map code="0xd" name="uni000D"/><!-- ???? -->
    <map code="0x20" name="space"/><!-- SPACE -->
    <map code="0x21" name="exclam"/><!-- EXCLAMATION MARK -->
    <map code="0x22" name="quotedbl"/><!-- QUOTATION MARK -->
    <map code="0x23" name="numbersign"/><!-- NUMBER SIGN -->
    <!-- many more maps omitted -->
  </cmap_format_4>
</cmap>
https://docs.microsoft.com/en-us/typography/opentype/spec/cmap#platform-ids

Talking about GPOS and GSUB tables, they share a similar 4-level hierarchy inside of them and it is clearly described in the Microsoft reference too:

https://docs.microsoft.com/en-us/typography/opentype/spec/chapter2#table-organization

Both of these two tables list certain features of the font. GPOS lists features related to the positioning of the glyphs (like kerning), and GSUB lists features related to the substitution of glyphs. Each feature in these tables can be enabled for certain language systems and certain scripts, thus their place in the hierarchy.

Scripts

This level corresponds to the ScriptList element in TTX. It contains the list of scripts for which the font enables features. Each script is identified by a script tag, for example, latn stands for Latin script, cyrl — for Cyrillic, and grek — for Greek. DFLT is the special tag for the default script, which is used to enable features for all scripts.

For example, Roboto enables "fi" and "fl" ligatures for Latin script, but not for other scripts.

Language Systems

Each script record is split into language system records, which usually correspond to languages. Language systems are also identified by tags: "FRA " stands for French and so on. There is a special record for "default language system" which means "any language system".

Generally, most of Roboto's features are enabled in all language systems in all scripts. But there are exceptions. As a curious example, it replaces the "S with cedilla" glyph with "S with comma below" in the Romanian language system in Latin script. I think it does so for consistency because both spellings are used interchangeably in the Romanian language.

Top letters are with commas bellow, bottom letters are with cedillas. https://www.quora.com/What-is-the-difference-between-%C8%98-and-%C5%9E-in-Romanian

Features

Now, language system records list features that should be enabled for the given language system. Features are defined in a separate element of the TTX: FeatureList, and language systems reference features by their indices in the list.

Features are identified by tags as well, and OpenType supports a lot of features. "Ligatures" is not just one feature of the font, there are many features used for ligatures of different kinds. These are just some of the more generic ones:

  • ccmp feature ("Glyph Composition/Decomposition") is used in Roboto to handle Unicode's combining characters, like accents. Accent glyph put after the capital A, for example, produces a dedicated glyph of "A with an accent".
  • clig ("Contextual Ligatures") is used for ligatures in certain contexts (i.e. surrounded or not surrounded by certain characters).
  • dlig ("Discretionary Ligatures") covers ligatures that "may be used for special effects at user's preference".
  • hlig ("Historical Ligatures") can be used to optionally enable a historical outdated look to the font. Not used in Roboto but it's curious that font vendors have this ability.
  • liga ("Standard Ligatures") is the most generic one. Microsoft docs list "ffl" as an example of a ligature that this feature can be used for. And Roboto uses it exactly for that: "ffi", "fi", "ffl" and "fl".

By the way, did you know that you can control which font features are enabled using CSS?

Lookups

Finally, the LookupList element of the TTX. Feature records are pretty shallow, they specify their function in the tag, but the data for the feature — which glyphs to replace with which glyphs (in the GSUB table) or the positioning of the glyphs (in the GPOS table) — these data are listed in lookups.

Lookups are not identified by tags, they are identified by numeric type IDs and these IDs have different meanings between GPOS and GSUB tables. As we have covered before, type 4 stands for non-contextual ligatures in GSUB and for now, this is the only type we need to know.

Generally, lookups are just lists of glyphs or substitutions, or ligatures. Each type has its own structure and since it's the lowest level of the hierarchy, it doesn't reference other parts of the font, except for glyphs, so it's usually easy to understand just by looking at the TTX.

Attempting a more civilized approach

After all that research, I have come up with a Python script that parses the TTX using lxml and produces a lightweight copy of the font containing the ligatures. I won't show the code itself, because it's not very tidy, but here is the general algorithm.

  • Have the desired ligatures listed like this:
LIGATURES = [
    ("руб.", "₽"),
    ("руб", "₽"),
]
  • Extract all the character code points used in the LIGATURES above.
print(used_codepoins)
# [1073, 8381, 1088, 1091, 46]
  • Parse the cmap table in the font and convert code points from the previous step into glyph names.
print(used_glyphs)
# ['period', 'uni0431', 'uni0440', 'uni0443', 'uni20BD']
  • Parse the glyf table and extract all the glyphs referenced by the glyphs above. Do this in a loop, to make sure transitive dependencies are covered. Now we have a definitive list of glyphs we need to keep in the font.
print(used_glyphs_with_dependencies)
# ['period', 'uni0431', 'uni0440', 'uni0443', 'uni20BD', 'p', 'y']
  • Create the TTX file for the new font and populate the new GlyphOrder element with the used glyphs.
<GlyphOrder>
  <GlyphID id="0" name="period"/>
  <GlyphID id="1" name="uni0431"/>
  <GlyphID id="2" name="uni0440"/>
  <GlyphID id="3" name="uni0443"/>
  <GlyphID id="4" name="uni20BD"/>
  <GlyphID id="5" name="p"/>
  <GlyphID id="6" name="y"/>
</GlyphOrder>
  • Copy many of the tables as-is, because they don't reference glyphs and they are small anyway: head, hhea, maxp, OS_2, fpgm, prep, cvt, loca, post, gasp
  • Parse the hmtx table and only keep records for the glyphs we need. Same with hdmx, cmap, glyf.
<hmtx>
  <mtx name="p" width="1149" lsb="140"/>
  <mtx name="period" width="539" lsb="144"/>
  <mtx name="uni0431" width="1132" lsb="97"/>
  <mtx name="uni0440" width="1149" lsb="140"/>
  <mtx name="uni0443" width="969" lsb="22"/>
  <mtx name="uni20BD" width="1359" lsb="31"/>
  <mtx name="y" width="969" lsb="22"/>
</hmtx>
<hdmx>
  <hdmxData>
                      ppem:   9 ;

                         p:   5 ;
                    period:   2 ;
                   uni0431:   5 ;
                   uni0440:   5 ;
                   uni0443:   4 ;
                   uni20BD:   6 ;
                         y:   4 ;
  </hdmxData>
</hdmx>
<cmap>
  <tableVersion version="0"/>
  <cmap_format_4 platformID="0" platEncID="3" language="0">
    <map code="0x2e" name="period"/>
    <map code="0x70" name="p"/>
    <map code="0x79" name="y"/>
    <map code="0x431" name="uni0431"/>
    <map code="0x440" name="uni0440"/>
    <map code="0x443" name="uni0443"/>
    <map code="0x20bd" name="uni20BD"/>
  </cmap_format_4>
  <cmap_format_4 platformID="3" platEncID="1" language="0">
    <map code="0x2e" name="period"/>
    <map code="0x70" name="p"/>
    <map code="0x79" name="y"/>
    <map code="0x431" name="uni0431"/>
    <map code="0x440" name="uni0440"/>
    <map code="0x443" name="uni0443"/>
    <map code="0x20bd" name="uni20BD"/>
  </cmap_format_4>
</cmap>
<glyf>
  <!-- omitted for brevity -->
</glyf>
  • Parse the name table and replace "Roboto" with something along the lines of "Roboto-RubleLigature".
<name>
  <namerecord nameID="0" platformID="3" platEncID="1" langID="0x409">
    Copyright 2011 Google Inc. All Rights Reserved.
  </namerecord>
  <namerecord nameID="1" platformID="3" platEncID="1" langID="0x409">
    Roboto-RubleLigature
  </namerecord>
  <namerecord nameID="2" platformID="3" platEncID="1" langID="0x409">
    Regular
  </namerecord>
  <namerecord nameID="3" platformID="3" platEncID="1" langID="0x409">
    Roboto-RubleLigature
  </namerecord>
  <namerecord nameID="4" platformID="3" platEncID="1" langID="0x409">
    Roboto-RubleLigature
  </namerecord>
  <namerecord nameID="5" platformID="3" platEncID="1" langID="0x409">
    Version 2.137; 2017
  </namerecord>
  <namerecord nameID="6" platformID="3" platEncID="1" langID="0x409">
    Roboto-RubleLigature-Regular
  </namerecord>
  <namerecord nameID="7" platformID="3" platEncID="1" langID="0x409">
    Roboto is a trademark of Google.
  </namerecord>
  <namerecord nameID="9" platformID="3" platEncID="1" langID="0x409">
    Google
  </namerecord>
  <namerecord nameID="11" platformID="3" platEncID="1" langID="0x409">
    Google.com
  </namerecord>
  <namerecord nameID="12" platformID="3" platEncID="1" langID="0x409">
    Christian Robertson
  </namerecord>
  <namerecord nameID="13" platformID="3" platEncID="1" langID="0x409">
    Licensed under the Apache License, Version 2.0
  </namerecord>
  <namerecord nameID="14" platformID="3" platEncID="1" langID="0x409">
    http://www.apache.org/licenses/LICENSE-2.0
  </namerecord>
</name>
  • Traverse the trees of the GDEF and GPOS tables removing elements that reference glyphs that we don't need. This produces a lot of empty elements — in particular, empty and useless lookups in GPOS. But I'm not removing them because lookups are referenced by their indices and if we remove a lookup, indices of his subsequent siblings will shift down. So it's easier to keep them for padding.
  • Do not copy GSUB but produce a new GSUB with nothing but the ligatures:
<GSUB>
  <Version value="0x00010000"/>
  <ScriptList>
    <ScriptRecord index="0">
      <ScriptTag value="DFLT"/>
      <Script>
        <DefaultLangSys>
          <ReqFeatureIndex value="65535"/>
          <FeatureIndex index="0" value="0"/>
        </DefaultLangSys>
      </Script>
    </ScriptRecord>
  </ScriptList>
  <FeatureList>
    <FeatureRecord index="0">
      <FeatureTag value="liga"/>
      <Feature>
        <LookupListIndex index="0" value="0"/>
      </Feature>
    </FeatureRecord>
  </FeatureList>
  <LookupList>
    <Lookup index="0">
      <LookupType value="4"/>
      <LookupFlag value="0"/>
      <LigatureSubst index="0">
        <LigatureSet glyph="uni0440">
          <Ligature components="uni0443,uni0431,period" glyph="uni20BD"/>
          <Ligature components="uni0443,uni0431" glyph="uni20BD"/>
        </LigatureSet>
      </LigatureSubst>
    </Lookup>
  </LookupList>
</GSUB>
  • That's it. Close the TTX file.

The best part of this script is that the resulting TTX compiles into a perfectly functional font! It works just as intended and it weighs about 7.5 kb against 165 kb of the original Roboto. Maybe you could trim down a few extra bytes if you remove the extra lookups and features in GPOS, but for now, it's perfect!

This is how it works

This same script can then be run on other variants of Roboto (medium, bold, thin, etc.) to have a ligature font for each weight. These fonts can then be linked to the page using the @font-face rule.

@font-face {
    font-family: "Roboto RubleLigature";
    font-weight: 400;
    src: url("./Roboto-RubleLigature-Regular.ttf");
}
.ruble-ligature {
    font-family: "Roboto RubleLigature",  "Roboto", sans-serif;
}

Issues to keep in mind

Despite the fact that 3 or 4 characters are displayed to look like one character, the browser understands that they're not. This produces some interesting effects.

For one, you can select part of the glyph:

Because for browser it looks like you're doing this:

This issue can be solved if you disable user selection for the elements where the ligature is used:

.ruble-ligature {
  user-select: none;
}

Or, if you isolate the ligatures in separate inline elements, you can prevent the partial selection using user-select: all:

<p class="ruble-ligature">
  Ligature applied: 1000 <span class="select-all">руб.</span>
</p>
<style>
  .select-all {
    user-select: all;
  }
</style>

Another issue is that if you select and copy text with the ligature, the copied text will contain the original characters. But this shouldn't be a problem from the UX side. In fact, it can even be considered a feature ;)

Conclusion

If any professional font editor ever reads this post, they will probably be horrified by the atrocities I have committed. But I hope it can be an interesting read for people who like to look at the ins and outs of things we commonly use but don't usually bother to look inside of.


Cover photo by Mr Cup / Fabien Barral on Unsplash