The Library at the End of Time: September 2008

Monday, September 29, 2008

Class notes 29 September 2008

Martin Bryan. Introducing the Extensible Markup Language (XML) http://burks.bton.ac.uk/burks/internet/web/xmlintro.htm
Uche Ogbuji. A survey of XML standards: Part 1. January 2004. http://www-128.ibm.com/developerworks/xml/library/x-stand1.html
Extending you Markup: a XML tutorial by Andre Bergholz http://www.pdffinder.com/pdf/extending-your-markup-an-xml-tutorial.html, or at http://xml.coverpages.org/BergholzTutorial.pdf
XML Schema Tutorial http://www.w3schools.com/Schema/default.asp

XML Attributes: http://www.w3schools.com/Xml/xml_attributes.asp
DTD Attributes: http://xmlfiles.com/dtd/dtd_attributes.asp
http://www.javacommerce.com/displaypage.jsp?name=element.sql&id=18238
www.w3.org/XML/
www.xml.com

Saturday, September 27, 2008

Assignment 2: My Digital Image Library (and a question)

Evolution of Bumblebee

Link

Question: Is it strictly necessary to resize all the images twice? Picasa seems to do a lovely job of that all by itself. I'll do it if need be (I'm well versed in the use of GIMP, the open-source Photoshop alternative) but it does seem a little superfluous. Please let me know ASAP via e-mail (spekkiomofw (at) gmail dot com).

Wednesday, September 24, 2008

Muddy points for week four: none.

I'm good this week.

Monday, September 22, 2008

Class notes - week 4

RE: Assignment 2: Yes, 3D objects OK. Picasa OK.
RE: Term project: ContentDM is OK if OK with group.

BUBL Catalogue of Internet Resources

Scorpion

The Scorpion Open Source project offers software that implements a system for automatically classifying Web-accessible text documents. Scorpion is intended for use by investigators who have a machine-readable subject classification scheme or thesaurus and wish to incorporate it into an automatic classification system.

CyberStacks

CyberStacks(sm) is a centralized, integrated, and unified collection of significant World Wide Web (WWW) and other Internet resources categorized using the Library of Congress classification scheme. Resources are organized under one or more relevant Library of Congress class numbers and an associated publication format and subject description. The majority of resources incorporated within its collection are monographic or serial works, files, databases or search services. All of the selected resources in CyberStacks(sm) are full-text, hypertext, or, hypermedia, and of a research or scholarly nature.

Dublin Core

Using Dublin Core: The Elements

Mountain West Digital Library - a DL that uses Dublin Core

MODS - a derivative of MARC21 using XML - uses syntax instead of numbers

MARC to Dublin Core Crosswalk

Social Bookmarking in Plain English video

DCMI Social Tagging Community

Another muddy point re: term project

Is ContentDM considered open source software for the purposes of the term project? I'm going to be working with it for another course.

According to the ContentDM site, "While most users run the CONTENTdm software “out-of-the box,” it also has an API that allows for custom development. The open architecture supports extensions and the Web interface is fully customizable."

And if it isn't - could we use it instead of open source software? Believe me, I have nothing against open source software - my laptop runs on Ubuntu Linux, my browser of choice is Firefox, and I use GIMP on a semi-regular basis. But uniting the two assignments (working with ContentDM for LIS 2405, producing a prototype digital library for LIS 2670) would be very efficient.

Friday, September 19, 2008

Readings for Week Four - Gilliland and Weibel

Gilliland, Anne J. Introduction to Metadata: Setting the Stage.
http://www.getty.edu/research/conducting_research/standards/intrometadata/setting.html

Content == intrinsic
Context == extrinsic
Structure == intrinsic, extrinsic, or both

Interesting table of "typology of data standards" (table 1)
Table of different types of metadata and their functions (table 2)
Table of attributes and characteristics of metadata (table 3)

Reaction: I'm sure I've read or heard this all before in other LIS classes.

Weibel, Stuart L. Border Crossings: Reflections on a Decade of Metadata Consensus Building. D-Lib Magazine July/August 2005.

Reaction: Now this is a great article - someone who has been working on the frontlines of metadata for a decade and doesn't fluff it up and make it more complicated than it needs to be. And from what I've seen, I love Dublin Core. It's a lot better than MARC!

I'm curious: "The International Press Telecommunications Council is exploring embedding Dublin Core in their new generation of news standards [17]." Did the IPTC end up doing this?

It's interesting that some of the challenges that Mr. Weibel wrote about three years ago haven't been solved yet - like author-created metadata. I also found "NIH Syndrome" interesting.

Arr! Another muddy point about assignment 2!

Avast! Another question about assignment 2!
(It's International Talk Like a Pirate Day.)

The objects that we photograph or scan - must they be two-dimensional objects (i.e. papers, still images) or is three-dimensional acceptable as well?

Monday, September 15, 2008

Muddy points for week three

Today's lecture / presentation left me with a lot of questions that I decided to reserve for this post because I was afraid I'd end up taking up too much time.

Oh yeah, and by the way, terribly sorry about being late to class.

Image formats
Something I've run into that I want to know more about is JPEG 2000. It's not the same as the old JPEG standard. I think, if memory serves, that JPEG 2000 is used by the Pitt ULS for storing preservation masters instead of TIFF because it's capable of lossless compression. (I think.)

Something else: GIF handles simple animations; PNG does not. I don't know how relevant that is to the course, but....

Sound formats
First of all, on audio quality, I thought that CD-quality audio is 44.1 kHz, not 22 kHz. Then again, the Wikipedia article I linked says something about a 22.05 kHz Nyquist frequency and I have no idea what they're talking about. If someone could help me with this I'd appreciate it. I know it isn't terribly relevant to the class, but it's going to bug me.

Second, the only audio format we talked about was the infamous MP3. But it seems to me that MP3 is on its way out - largely because of its lack of support for Digital Rights Management, but also because better formats are now available. AAC files are smaller and better than MP3. There's also Microsoft's WMA, RealAudio, and the lesser-known Ogg Vorbis and FLAC.

Identifying digital objects

I'm really unclear on this - PURL and DOI sound more like address forwarding than identifiers. From the lecture it sounded like those services would just make sure that people could always find the items at their original location. (Valuable, but not all that an identifier does.) In my online travels I've seen use of MD5 checksums used as an identifier of sorts - if they match, it's the same item. I don't know - maybe I'm overthinking it.

Assignment 2

Just one minor thing here: do we have to use flickr? If we already have accounts with another image-sharing service, like Google Picasa (which conveniently integrates with Google Blogger) would that be acceptable too?

Sunday, September 14, 2008

Week 3 reading: Arms ch9

Arms ch9
http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter9.html
• Text special to libraries
• Digital libraries: text created digitally, converted from other media, digitized sound track from films / TV programs (Does he mean transcripts?)
• Text as metadata
• Structural mark-up: SGML
• Appearance: page-description languages - TeX, PostScript, PDF
• SGML + style sheet = structure + appearance; different style sheets allow same document to be displayed in different ways
• Difficult to control appearance through SGML + style sheets; doesn't work for all applications, like mathematics (TeX)
• Conversion: scanning; large files, compression, text-as-image file; OCR (OCR isn't perfect!)
• Another conversion option: retyping by hand
• Encoding: ASCII, Unicode, transliteration (ö becomes oe)
• SGML = system to define mark-up specifications; an individual specification is called a document type definition (DTD)
• DTDs relevant to libraries: Text Encoding Initiative (TEI) and Encoded Archival Description (EAD)
• HTML - derivative DTD of SGML that includes formatting as well as structure
• HTML has grown by leaps and bounds: images, tables, frames
• HTML is supposed to be controlled by W3C, but is really controlled by browser developers
• XML - SGML variant - attempt to bridge gap between simplicity of HTML and power of SGML
• Style sheets: CSS, XSL

It's been awhile since I've done any serious web development and I'm not up-to-date on CSS and XML. That said, some of what's written here seems quite dated - no wonder since it's nearly ten years old. Storing a file that's a whopping 50,000 bytes (~50KB) isn't nearly as daunting now as it was then. Likewise, CSS has come a long way and HTML has mostly stabilized, though Microsoft can still be counted on to misbehave.

Thursday, September 11, 2008

About Scott

As I promised in class, here's a little about me and what I could contribute to a group for the term project.

My name is Scott Nicolson. I'm in my third semester in MLIS @ Pitt. I hold a BA in History and a BS in Applied Computer Science from California University of Pennsylvania. My liberal arts background includes study of Spanish and German, as well as a lot of history, and I think I'm a decent writer / editor. I can put on a decent presentation, too. As for computing, I'm well-versed in Windows XP (not Vista, though - ugh), Mac OS X, and two flavors of Linux (Debian and Ubuntu). I've also got background in web design and development. I've taken the Library and Archival Preservation course here @ Pitt, too.

I do have issues with time management and anxiety - I'll admit that up front. I also don't have any bright ideas for a term project.

Any takers?

Monday, September 8, 2008

Class: 8 September 2008

Administrative stuff

Readings notes and submission deadline

Pre-class reading, so week 3 Friday 11:59pm submit week 4 reading notes in the blog,

No need to send email

Muddiest Points, and submission deadline

Post-class muddiest points, so week 3 Friday 11:59pm submit week 3 muddiest points. If you do not have, you can just say I do not have.

Term Project

Task:
Propose, plan and develop a prototype digital collection, using Open Source software (e.g., Greenstone, DSpace, Fedora, Ruby on Rails).
Requirements:
Address the need of a group of real users
Include at least three collections and at least one other media format in addition to text (minimally 25 documents total).

European Library Treasures Web Exhibition
-This might be helpful for my History of Ireland course....

About DSpace
-See also DSpace Visual Diagram (PDF)

FEDORA
-Flexible Extensible Digital Object and Repository Architecture

NCSTRL
-Networked Computer Science Technical Reference Library

Open Archival Institute

FreeLib - Peer to Peer Digital Library Project

Grid computing & digital libraries

Beowulf.org

“The Benefits of Grid Networks--Digital Libraries” By Roy Tennant — March 15, 2005 http://www.libraryjournal.com/article/CA509610.html

Lorcan Dempsey's weblog “WorldCat in your pocket” http://orweblog.oclc.org/archives/000544.html

“Hyperdatabases: An Infrastructure for the Information Space.” http://www.springerlink.com/content/dg2uxvpnj3k3cec2/

Digital Libraries that use Greenstone:

Digital Library that uses DSpace

Week 1 reading notes and reactions

Castelli
1:
-Many different definitions of digital library
-Definitions are colored by perspective of people working on it
-Lack of standardization in digital libraries - lack of interoperability or reusability
-DELOS (EU) establishes principles to fix all this

2: DL vs DLS vs DLMS
DL: the library / organization
DLS: software & architecture through which users access the DL
DLMS: Generic software system that takes care of the nitty-gritty of running a DL
-XDLS: Complete DLMS that can be added onto
-DLS Warehouse: Components that can be combined in a variety of ways to constitute a DLS - like Lego building blocks
-DLS Generator - Parameterized software system - when first set up, manager selects parameters and the DLS is generated

Comment: Standards are good.

Paepcke
NSF launched Digital Library Initiative in 1994, wich led to Google, CareMedia, and much more
Librarians and Computer Scientists were on equal footing in DLI until the web came along and shook things up, knocked things towards CS; new developments are swinging things back to more of an equal footing

Comment: This article makes it sound like librarians are useful after all.

Levy
-Public libraries in the USA have always struggled with sense of purpose
-Academic libraries inherit purpose from academic institutions, but finances are a big problem
-Purpose of digital libraries not yet established; much of it is more a religion than practical
-More discussions and debates need to be had to find purpose (and thus direction)

Comment: Discussion is good.

Arms
Libraries are expensive - digital libraries may be able to help bring costs down
Much of the cost of libraries are in cost of staffing
Article discusses the possibility that skilled librarian work might eventually be taken over by computers.
"Brute force computing" - utilizing Moore's law to solve things through immense computing power.
Computers still cannot actively seek information - only recognize patterns.
Automatic systems cannot be selective; traditional libraries have to be selective to keep costs down
Automated digital libraries + open access information on the Internet == Ford Model T

Comment: This article reinforces my belief (fear?) that digital libraries are going to destroy the concept of "librarian."

Schwartz
No set definition of "digital library." An LIS class project found 64 different formal / informal definitions of "digital library."
Hybrid library: mix between conventional and digital library
Currently digital librarians are more concerned with coping with the enormous tasks and decisions at hand than philosophy.

Overall reaction

I think I "get" what digital libraries are, and given my multidisciplinary background, I can see the different points of view on them. I can understand the computer scientist "fix a problem" and "ooh this is neat" viewpoint. I can understand the "information for its own sake" librarian / liberal arts point of view. What really distresses me is all of the discussion about financial issues and whether or not librarians are going to continue to be useful. I think they are useful and things should be changed to bring more money in, but nobody really cares about that - everything has to be done faster and cheaper.

Monday, September 1, 2008

tap tap Is this thing on? tap tap

First post.
My Facebook: sln15 at pitt.edu
My LiveJournal (haven't used lately, but....): spekkiomow.livejournal.com

The Library at the End of Time