Arms ch9
http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter9.html • Text special to libraries
• Digital libraries: text created digitally, converted from other media, digitized sound track from films / TV programs
(Does he mean transcripts?) • Text as metadata
• Structural mark-up: SGML
• Appearance: page-description languages - TeX, PostScript, PDF
• SGML + style sheet = structure + appearance; different style sheets allow same document to be displayed in different ways
• Difficult to control appearance through SGML + style sheets; doesn't work for all applications, like mathematics (TeX)
• Conversion: scanning; large files, compression, text-as-image file; OCR (OCR isn't perfect!)
• Another conversion option: retyping by hand
• Encoding: ASCII, Unicode, transliteration (รถ becomes oe)
• SGML = system to define mark-up specifications; an individual specification is called a document type definition (DTD)
• DTDs relevant to libraries: Text Encoding Initiative (TEI) and Encoded Archival Description (EAD)
• HTML - derivative DTD of SGML that includes formatting as well as structure
• HTML has grown by leaps and bounds: images, tables, frames
• HTML is supposed to be controlled by W3C, but is really controlled by browser developers
• XML - SGML variant - attempt to bridge gap between simplicity of HTML and power of SGML
• Style sheets: CSS, XSL
It's been awhile since I've done any serious web development and I'm not up-to-date on CSS and XML. That said, some of what's written here seems quite dated - no wonder since it's nearly ten years old. Storing a file that's a whopping 50,000 bytes (~50KB) isn't nearly as daunting now as it was then. Likewise, CSS has come a long way and HTML has mostly stabilized, though Microsoft can still be counted on to misbehave.