The Library at the End of Time: 2008

Monday, November 17, 2008

Classmates: Greenstone?

I hope people read this...have you gotten Greenstone working? On what platform? Are your group members able to access your server remotely, or are you getting together in person to assemble your collection?

I'm able to get it to run on OSX and on Windows, but I can't quite get the remote part (i.e. Apache) worked out and I'm wondering if anybody had any luck with this.

Remote Access software for OSX

http://www.pure-mac.com/remote.html

Friday, November 14, 2008

Week 12 reading reaction and muddiest point

Arms, William. "Implementing Policies for Access Management," D-Lib Magazine, February 1998.

So much of this seems obvious now, but this article is ten years old. It may well be that this influenced how systems work now.

Arms, William. "Chapter 6: Economic and legal issues," in Digital Libraries, MIT Press, 2000.

"Technology can contribute to the solutions, but it can not resolve economic or social issues."

Huh? Technology has done wonders for economic and social issues. Maybe this is a symptom of this being such an old text (relatively speaking) but even in 1999 people were recognizing the economic and social power of technology - particularly the Internet. The "dot-com bubble" was in full force in 1999!

Arms, William. "Chapter 7: Access management and security," in Digital Libraries, MIT Press, 2000.

"Data on many personal computers is unprotected except by physical access; anybody who has access to the computer can read the data."

I'm not sure about the accuracy of this. I know it's not true now - I'm not entirely sure it was true in 1999. It rather depends on the operating system, and there's also encryption. Anyway, I didn't learn anything new from this chapter.

Lesk, Michael. "Chapter 9: Economics" in Understanding Digital Libraries, 2005.

This, by far, is the most interesting of the readings for the week - it's the most current and the most relevant. It's very interesting that we haven't resolved these issues yet. And I don't have the answers, either.

Muddiest Point: None

Monday, November 10, 2008

Greenstone 3 notes

These are notes for myself....

"An error has occurred on the remote Greenstone server while performing this operation: Empty gliserver URL: please set this in Preferences before continuing."

"Unable to get the list of classifiers using classinfo.pl -listall"

Monday, October 20, 2008

Class notes 20OCT08

VirtualBox - open source system virtualization

CIRES: Content Based Image REtrieval System

Music Retrieval by Content Demo

Friday, October 17, 2008

Readings for Week Eight

First: no muddy points.

1. Chapter 1. Definition and Origins of OAI-PMH. oai-pmh-ch1.pdf

OAI-PMH sounds a lot better than Z39.50. XML? Simple? I'm in.

2. Miller, Todd. "Federated Searching: Put It in Its Place." Library Journal April 15, 2004. http://www.libraryjournal.com/article/CA406012.html&

It seems to me that the author is arguing that federated search, not the library catalog, should be the new "center" of library searching. I think it's fair to say that at least some libraries have taken that to heart - Pitt's ULS has its "Zoom!" on its homepage, while PITTCat (aside: why is PITT capitalized in PITTCat?) requires additional clicking.

3. Hane, Paula J. "The Truth About Federated Searching." Information Today Vol. 20 No. 10, November/December 2003.
http://www.infotoday.com/it/oct03/hane1.shtml

This article would be better titled "The Sad Truth About Federated Searching." It's surprising to me that this list of five weaknesses of federated searching were provided by WebFeat, a provider of...federated searching. I guess one has to give them points for candor. It's still...disheartening, I guess.

4. Lynch, Clifford A. (1997). "The Z39.50 Information Retrieval Standard, Part 1: A Strategic View of its Past, Present, and Future." D-Lib Magazine, April 1997. http://www.dlib.org/dlib/april97/04lynch.html

Wow...this article is rather old by technology standards, isn't it? It's over ten years old - ten years generally considered the threshold between "new" and "old" scholarship. I'm curious if this is still "current," though I do get the gist of it.

5. Norbert Lossau, “Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet.” D-Lib Magazine, June 2004, Volume 10 Number 6.
http://www.dlib.org/dlib/june04/lossau/06lossau.html

"Will Google, Yahoo or Microsoft be the only portals to global knowledge in 2010?"

Actually, the true portal to global knowledge is John Hodgman's The Areas of My Expertise, the authoritative almanac to complete world knowledge, first published in 2005. Its continuation, More Information than You Require, is being released on 21 October 2008.

That said, I find it interesting that the author clearly notes the strong conservatism in librarianship, but then says that their conclusions are "obvious." And this article is four years old - and I don't know that much has changed in that time, at least from what I've seen.

Tuesday, October 14, 2008

Term project links

Digital library software:

Articles re: digital library software

Class notes 14OCT08

Midterm is 03NOV08 (I've already registered with DRS)
First presentation about digital library project is also 03NOV08; ~5 minutes

Networked Computer Science Technical Reference Library (NCSTRL)

Friday, October 10, 2008

Week 7 reading reactions

Hawking, David. "How Things Work: Web Search Engines: Part 1." Computer June 2006, Link

Given my computer science background, I don't see many surprises here, but it's still very interesting reading. The amount of power and equipment needed to do this is always surprising, though - just as a matter of magnitude.

Hawking, David. "How Things Work: Web Search Engines: Part 2." Computer August 2006, Link

The mathematics here make my head spin and make me glad that I'm not a programmer. That said, it seems unfortunate to me that this is the end of this series on search engines - assuming, of course, that it is.

Henzinger, Monika R., Rajeev Motwani, and Craig Silverstein. "Challenges in Web Search Engines." ACM SIGIR Forum Vol 36 No 2, Fall 2002,
Link

This article is strictly about under-researched search engine problems: spam, content quality, quality evaluation, web conventions (i.e. standard practices), duplicate hosts, and vaguely-structured data. Essentially, what I get out of this article is that the web is still rather chaotic, so search engines aren't easy.

Monday, September 29, 2008

Class notes 29 September 2008

Martin Bryan. Introducing the Extensible Markup Language (XML) http://burks.bton.ac.uk/burks/internet/web/xmlintro.htm
Uche Ogbuji. A survey of XML standards: Part 1. January 2004. http://www-128.ibm.com/developerworks/xml/library/x-stand1.html
Extending you Markup: a XML tutorial by Andre Bergholz http://www.pdffinder.com/pdf/extending-your-markup-an-xml-tutorial.html, or at http://xml.coverpages.org/BergholzTutorial.pdf
XML Schema Tutorial http://www.w3schools.com/Schema/default.asp

XML Attributes: http://www.w3schools.com/Xml/xml_attributes.asp
DTD Attributes: http://xmlfiles.com/dtd/dtd_attributes.asp
http://www.javacommerce.com/displaypage.jsp?name=element.sql&id=18238
www.w3.org/XML/
www.xml.com

Saturday, September 27, 2008

Assignment 2: My Digital Image Library (and a question)

Evolution of Bumblebee

Link

Question: Is it strictly necessary to resize all the images twice? Picasa seems to do a lovely job of that all by itself. I'll do it if need be (I'm well versed in the use of GIMP, the open-source Photoshop alternative) but it does seem a little superfluous. Please let me know ASAP via e-mail (spekkiomofw (at) gmail dot com).

Wednesday, September 24, 2008

Muddy points for week four: none.

I'm good this week.

Monday, September 22, 2008

Class notes - week 4

RE: Assignment 2: Yes, 3D objects OK. Picasa OK.
RE: Term project: ContentDM is OK if OK with group.

BUBL Catalogue of Internet Resources

Scorpion

The Scorpion Open Source project offers software that implements a system for automatically classifying Web-accessible text documents. Scorpion is intended for use by investigators who have a machine-readable subject classification scheme or thesaurus and wish to incorporate it into an automatic classification system.

CyberStacks

CyberStacks(sm) is a centralized, integrated, and unified collection of significant World Wide Web (WWW) and other Internet resources categorized using the Library of Congress classification scheme. Resources are organized under one or more relevant Library of Congress class numbers and an associated publication format and subject description. The majority of resources incorporated within its collection are monographic or serial works, files, databases or search services. All of the selected resources in CyberStacks(sm) are full-text, hypertext, or, hypermedia, and of a research or scholarly nature.

Dublin Core

Using Dublin Core: The Elements

Mountain West Digital Library - a DL that uses Dublin Core

MODS - a derivative of MARC21 using XML - uses syntax instead of numbers

MARC to Dublin Core Crosswalk

Social Bookmarking in Plain English video

DCMI Social Tagging Community

Another muddy point re: term project

Is ContentDM considered open source software for the purposes of the term project? I'm going to be working with it for another course.

According to the ContentDM site, "While most users run the CONTENTdm software “out-of-the box,” it also has an API that allows for custom development. The open architecture supports extensions and the Web interface is fully customizable."

And if it isn't - could we use it instead of open source software? Believe me, I have nothing against open source software - my laptop runs on Ubuntu Linux, my browser of choice is Firefox, and I use GIMP on a semi-regular basis. But uniting the two assignments (working with ContentDM for LIS 2405, producing a prototype digital library for LIS 2670) would be very efficient.

Friday, September 19, 2008

Readings for Week Four - Gilliland and Weibel

Gilliland, Anne J. Introduction to Metadata: Setting the Stage.
http://www.getty.edu/research/conducting_research/standards/intrometadata/setting.html

Content == intrinsic
Context == extrinsic
Structure == intrinsic, extrinsic, or both

Interesting table of "typology of data standards" (table 1)
Table of different types of metadata and their functions (table 2)
Table of attributes and characteristics of metadata (table 3)

Reaction: I'm sure I've read or heard this all before in other LIS classes.

Weibel, Stuart L. Border Crossings: Reflections on a Decade of Metadata Consensus Building. D-Lib Magazine July/August 2005.

Reaction: Now this is a great article - someone who has been working on the frontlines of metadata for a decade and doesn't fluff it up and make it more complicated than it needs to be. And from what I've seen, I love Dublin Core. It's a lot better than MARC!

I'm curious: "The International Press Telecommunications Council is exploring embedding Dublin Core in their new generation of news standards [17]." Did the IPTC end up doing this?

It's interesting that some of the challenges that Mr. Weibel wrote about three years ago haven't been solved yet - like author-created metadata. I also found "NIH Syndrome" interesting.

Arr! Another muddy point about assignment 2!

Avast! Another question about assignment 2!
(It's International Talk Like a Pirate Day.)

The objects that we photograph or scan - must they be two-dimensional objects (i.e. papers, still images) or is three-dimensional acceptable as well?

Monday, September 15, 2008

Muddy points for week three

Today's lecture / presentation left me with a lot of questions that I decided to reserve for this post because I was afraid I'd end up taking up too much time.

Oh yeah, and by the way, terribly sorry about being late to class.

Image formats
Something I've run into that I want to know more about is JPEG 2000. It's not the same as the old JPEG standard. I think, if memory serves, that JPEG 2000 is used by the Pitt ULS for storing preservation masters instead of TIFF because it's capable of lossless compression. (I think.)

Something else: GIF handles simple animations; PNG does not. I don't know how relevant that is to the course, but....

Sound formats
First of all, on audio quality, I thought that CD-quality audio is 44.1 kHz, not 22 kHz. Then again, the Wikipedia article I linked says something about a 22.05 kHz Nyquist frequency and I have no idea what they're talking about. If someone could help me with this I'd appreciate it. I know it isn't terribly relevant to the class, but it's going to bug me.

Second, the only audio format we talked about was the infamous MP3. But it seems to me that MP3 is on its way out - largely because of its lack of support for Digital Rights Management, but also because better formats are now available. AAC files are smaller and better than MP3. There's also Microsoft's WMA, RealAudio, and the lesser-known Ogg Vorbis and FLAC.

Identifying digital objects

I'm really unclear on this - PURL and DOI sound more like address forwarding than identifiers. From the lecture it sounded like those services would just make sure that people could always find the items at their original location. (Valuable, but not all that an identifier does.) In my online travels I've seen use of MD5 checksums used as an identifier of sorts - if they match, it's the same item. I don't know - maybe I'm overthinking it.

Assignment 2

Just one minor thing here: do we have to use flickr? If we already have accounts with another image-sharing service, like Google Picasa (which conveniently integrates with Google Blogger) would that be acceptable too?

Sunday, September 14, 2008

Week 3 reading: Arms ch9

Arms ch9
http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter9.html
• Text special to libraries
• Digital libraries: text created digitally, converted from other media, digitized sound track from films / TV programs (Does he mean transcripts?)
• Text as metadata
• Structural mark-up: SGML
• Appearance: page-description languages - TeX, PostScript, PDF
• SGML + style sheet = structure + appearance; different style sheets allow same document to be displayed in different ways
• Difficult to control appearance through SGML + style sheets; doesn't work for all applications, like mathematics (TeX)
• Conversion: scanning; large files, compression, text-as-image file; OCR (OCR isn't perfect!)
• Another conversion option: retyping by hand
• Encoding: ASCII, Unicode, transliteration (ö becomes oe)
• SGML = system to define mark-up specifications; an individual specification is called a document type definition (DTD)
• DTDs relevant to libraries: Text Encoding Initiative (TEI) and Encoded Archival Description (EAD)
• HTML - derivative DTD of SGML that includes formatting as well as structure
• HTML has grown by leaps and bounds: images, tables, frames
• HTML is supposed to be controlled by W3C, but is really controlled by browser developers
• XML - SGML variant - attempt to bridge gap between simplicity of HTML and power of SGML
• Style sheets: CSS, XSL

It's been awhile since I've done any serious web development and I'm not up-to-date on CSS and XML. That said, some of what's written here seems quite dated - no wonder since it's nearly ten years old. Storing a file that's a whopping 50,000 bytes (~50KB) isn't nearly as daunting now as it was then. Likewise, CSS has come a long way and HTML has mostly stabilized, though Microsoft can still be counted on to misbehave.

Thursday, September 11, 2008

About Scott

As I promised in class, here's a little about me and what I could contribute to a group for the term project.

My name is Scott Nicolson. I'm in my third semester in MLIS @ Pitt. I hold a BA in History and a BS in Applied Computer Science from California University of Pennsylvania. My liberal arts background includes study of Spanish and German, as well as a lot of history, and I think I'm a decent writer / editor. I can put on a decent presentation, too. As for computing, I'm well-versed in Windows XP (not Vista, though - ugh), Mac OS X, and two flavors of Linux (Debian and Ubuntu). I've also got background in web design and development. I've taken the Library and Archival Preservation course here @ Pitt, too.

I do have issues with time management and anxiety - I'll admit that up front. I also don't have any bright ideas for a term project.

Any takers?

Monday, September 8, 2008

Class: 8 September 2008

Administrative stuff

Readings notes and submission deadline

Pre-class reading, so week 3 Friday 11:59pm submit week 4 reading notes in the blog,

No need to send email

Muddiest Points, and submission deadline

Post-class muddiest points, so week 3 Friday 11:59pm submit week 3 muddiest points. If you do not have, you can just say I do not have.

Term Project

Task:
Propose, plan and develop a prototype digital collection, using Open Source software (e.g., Greenstone, DSpace, Fedora, Ruby on Rails).
Requirements:
Address the need of a group of real users
Include at least three collections and at least one other media format in addition to text (minimally 25 documents total).

European Library Treasures Web Exhibition
-This might be helpful for my History of Ireland course....

About DSpace
-See also DSpace Visual Diagram (PDF)

FEDORA
-Flexible Extensible Digital Object and Repository Architecture

NCSTRL
-Networked Computer Science Technical Reference Library

Open Archival Institute

FreeLib - Peer to Peer Digital Library Project

Grid computing & digital libraries

Beowulf.org

“The Benefits of Grid Networks--Digital Libraries” By Roy Tennant — March 15, 2005 http://www.libraryjournal.com/article/CA509610.html

Lorcan Dempsey's weblog “WorldCat in your pocket” http://orweblog.oclc.org/archives/000544.html

“Hyperdatabases: An Infrastructure for the Information Space.” http://www.springerlink.com/content/dg2uxvpnj3k3cec2/

Digital Libraries that use Greenstone:

Digital Library that uses DSpace

Week 1 reading notes and reactions

Castelli
1:
-Many different definitions of digital library
-Definitions are colored by perspective of people working on it
-Lack of standardization in digital libraries - lack of interoperability or reusability
-DELOS (EU) establishes principles to fix all this

2: DL vs DLS vs DLMS
DL: the library / organization
DLS: software & architecture through which users access the DL
DLMS: Generic software system that takes care of the nitty-gritty of running a DL
-XDLS: Complete DLMS that can be added onto
-DLS Warehouse: Components that can be combined in a variety of ways to constitute a DLS - like Lego building blocks
-DLS Generator - Parameterized software system - when first set up, manager selects parameters and the DLS is generated

Comment: Standards are good.

Paepcke
NSF launched Digital Library Initiative in 1994, wich led to Google, CareMedia, and much more
Librarians and Computer Scientists were on equal footing in DLI until the web came along and shook things up, knocked things towards CS; new developments are swinging things back to more of an equal footing

Comment: This article makes it sound like librarians are useful after all.

Levy
-Public libraries in the USA have always struggled with sense of purpose
-Academic libraries inherit purpose from academic institutions, but finances are a big problem
-Purpose of digital libraries not yet established; much of it is more a religion than practical
-More discussions and debates need to be had to find purpose (and thus direction)

Comment: Discussion is good.

Arms
Libraries are expensive - digital libraries may be able to help bring costs down
Much of the cost of libraries are in cost of staffing
Article discusses the possibility that skilled librarian work might eventually be taken over by computers.
"Brute force computing" - utilizing Moore's law to solve things through immense computing power.
Computers still cannot actively seek information - only recognize patterns.
Automatic systems cannot be selective; traditional libraries have to be selective to keep costs down
Automated digital libraries + open access information on the Internet == Ford Model T

Comment: This article reinforces my belief (fear?) that digital libraries are going to destroy the concept of "librarian."

Schwartz
No set definition of "digital library." An LIS class project found 64 different formal / informal definitions of "digital library."
Hybrid library: mix between conventional and digital library
Currently digital librarians are more concerned with coping with the enormous tasks and decisions at hand than philosophy.

Overall reaction

I think I "get" what digital libraries are, and given my multidisciplinary background, I can see the different points of view on them. I can understand the computer scientist "fix a problem" and "ooh this is neat" viewpoint. I can understand the "information for its own sake" librarian / liberal arts point of view. What really distresses me is all of the discussion about financial issues and whether or not librarians are going to continue to be useful. I think they are useful and things should be changed to bring more money in, but nobody really cares about that - everything has to be done faster and cheaper.

Monday, September 1, 2008

tap tap Is this thing on? tap tap

First post.
My Facebook: sln15 at pitt.edu
My LiveJournal (haven't used lately, but....): spekkiomow.livejournal.com