Creating a digital entity to respond to some non-digital object, text, or even idea is something most of us do on a daily basis. Think of typing up a document as a form of encoding an idea and assigning characters on a screen to small lexical units—words. The characters we assign to those lexical units show up as a series of letters on our screen, but beneath those letters each keystroke ultimately generates a binary code to represent that character and save it in the system. We often do not encounter the problems of compatibility of character sets across platforms unless we are coding an HTML document or working with a programming language, although many of us have encountered dialog boxes in Word or Excel asking us whether to encode a document in ASCII, ANSI, UTF-8 or some other designation. Of course, we’ve all seen web pages or e-mails where character encoding has gone wrong, leading to wrong characters in the wrong place.
Understanding how those codes were created, and more importantly, what you need to do to ensure that your encoded text, object, or idea can be read across a variety of programs and platforms, whether in the form of markup or a tabular dataset, is helpful to building data for the humanities. It will ensure that any projects undertaken in languages other than English are successful, particularly those utilizing TEI-XML or some form of markup.
The short answer regarding encoding is that all datasets, marked up texts, etc. should be encoded using the Unicode system, and in most cases specifically the system known as UTF-8. UTF-8 is the preferred choice for HTML and XML documents, and Unicode is the most comprehensive system for encoding most world languages (even, it should be noted, Linear A). Any dataset created should be saved and exported using UTF-8 encoding, and you should set your text editor to encode any text your type up in it as UTF-8.
Furthermore, pay close attention to how your special characters relate to your encoding system. For example, a .csv file uses the comma to separate values. Thus, you must take care that your cell values do not use commas—or at least you should keep an eye on whether your software program is reading those commas properly. In markup, an HTML or XML document should not (properly speaking) use the characters < or > because those characters are used in the tags. Instead, either an Unicode designation or an HTML character entity reference for those characters should be used.
Early computing was marked by the constraints of memory size as to how many unique characters could be assigned a non-repeating number. Since computing is built on hexadecimal numbers (arranged in bytes) and, underlying those bytes, their equivalency in the binary (0 and 1) system, a maximum of 255 characters (16 x 16) could receive numerical assignments. Moreover, a failure to integrate encoding across global languages meant that some systems, like early ASCII, made no attempt to integrate many characters and symbols needed for languages other than English. Even when developments did add such characters, it was by using a separate character set for each language, making it impossible to encode more than one language in a single document or file. Unicode, by relying on a multiple-byte system made possible by increased memory size, has markedly expanded the number of allowable character points.
A good overview of these details can be found here, and from the W3 consortium, an introduction and another overview. Take a tour around the Unicode site, especially the full code table, as well.
From these overviews you should be aware of the following encoding schemes and their sub-varieties, and understand the use of Unicode as a standard:
- Unicode and the differences between UTF-16, UTF-8, and UTF-32; purpose of a Byte Order Mark (BOM)
- ISO 8859, an earlier standard for the Latin-1 alphabet (used for a variety of languages)
- ASCII, in many ways the base, with lots of obsolete characters like the carriage return
- Windows 1250, a platform specific scheme with overlaps with ISO 8859
- ANSI, a international standards group, but often used to refer to a variant of Windows 1250, Windows 1252
HTML and XML
Because a lot of digital humanities work involves producing for the web, as well as markup using a stable long-term human readable code like XML, it is good to know a little about how character codes get deployed in HTML and XML.
Both an HTML and an XML file must, to conform to standards, state in its header the type of encoding needed to read the remainder of the document. If you’ve done some basic HTML work before, you’ll probably recall placing that header at the top of your code. Here, for example, is what the W3C has to say on doing this properly in HTML.
HTML and XML can also deploy character entity references that make a direct call to the character’s Unicode point in either hexadecimal or its equivalent decimal form. These character entities consist of the ampersand followed by 1) the summary name used for that character OR 2) a hashtag and lower case x followed by its hexadecimal Unicode number OR 3) a hashtag and decimal Unicode number. The entity ends with a semicolon.
For example, here are the entity references for é, aka “e acute”:
é OR é OR é
But if your dataset, file, or text is encoded in Unicode, and declares itself as such in its header, it will have followed proper standards and you will not need to use any of these character entities with the exception of the following characters that are used in HTML and XML tags:
& → & (ampersand, U+0026)
< → < (less than sign, U+003E)
> → > (greater than sign, U+003E)
" → ” (quotation mark, U+0022)
' → ‘ (apostrophe, U+0027)
Beyond that, you’ve done your diligence and it is up to the end user to ensure their reader/browser interprets your stated encoding scheme.