Friday 19 November 2010

Surprise Use of Forensic Software in Archives

When I first heard of the use of computer forensic in archives, I was excited and wanted to learn how these law enforcement techniques could help me do a better job in processing digital collections. After learning people are using computer forensic to copy disk image (i.e. an exact copy of a disk, bit by bit) and to create a comprehensive manifest of the electronic files of collections, I was a bit disappointed because software engineers have been using the Unix dd command for many years to copy disk images. Also, there are tools (e.g. Karens's Directory Printer) available to create comprehensive manifest of the electronic files of collections. Data recovery is another feature of forensic software some people consider useful for archivists/researchers. In my opinion, data recovery may be useful for researchers but not archivists. Without informed written consents from donors, archivists should NOT recover the deleted files at all. Also, in some cases, a deleted file doesn't appear as one file, but instead, tens or hundreds of segments of files. When most archivists don't do item level processing in paper collections due to limited resources, I can't image archivists performing sub-item level processing in digital collections. Computer forensic in criminal application usually look for particular evidences. Organizing all files in a disk drive is usually not their interest. In archives, we are organizing all files in disk drives, looking for particular items are not our duty. Computer forensic may be more useful for researchers when they want to look for particular items. All these lead me to the conclusion that computer forensic may not be very useful for digital archivists.

However, after attending a 2.5-days training on AccessData FTK (a computer forensic software), I started to see the potential of using forensic software to process digital archives. I found out that the functions (bookmarks, labels) which help investigators to organize the evidence they selected are equally applicable to the organization of the whole collection. The functions (pattern and full text search) which are used to found particular evidence are equally applicable to search for restricted materials. I can also use the software to replace a number of software I am using to processing digital collections. Although, 90% of the training is related to cracking passwords, searching on delete files, identifying pornographic images, etc., I found the 10% the course worth every cents Stanford spent on it. Of course, the ideal case would be a course tailored for the archival community, but unfortunately, there is no such course exists.

Now, I am using AccessData FTK to replace the following software I used in the past to process digital archives.
Karens's Directory Printer - to create a comprehensive manifest of the electronic files of collections
QuickView Plus - to view files with obsolete file formats
Xplorer - to find duplicate files, copy to folders
DROID, JHOVE - to extract technical metadata: file formats, checksums, creation/modification/access dates
Windows Search 4.0 - to perform full text search on files with certain formats (word, pdf, ascii);

I am using following functions, which I have not found software package to perform in a very user friendly manner, in AccessData FTK to process digital archives.
Pattern search (to locate files containing restricted information such as social security no, credit card no., etc.)
Assign bookmarks, labels to files (for arranging files into series/subseries, other administrative and descriptive metadata)
Extract email headers (to; from; subject; date; cc/bcc) from emails written in different email programs for preparing correspondence listing.

The cost of licensing the software seems high. But if you look at the total costs of learning several "free" software, the lack of support for such software, and the integrated environment you get in using on software, you may find the total costs of using commercial forensic software is cheaper than using "free" software.

Tuesday 16 November 2010

Other Highlights from the DLF Fall Forum

A few weeks ago I got the opportunity to attend the Digital Library Federation's Fall Forum in Palo Alto, California. This is the same conference for which Peter previously announced his session on born digital archives. In addition to Peter's session, there were a number of other sessions that were of interest to those working with digital archives.

I attended the working session on curation micro-services, led by Stephen Abrams of the California Digital Library, Delphine Khanna from University of Pennsylvania, and Katherine Kott from Stanford University. (Patricia Hswe from Pennsylvania State University was supposed to be one of the discussion leaders, but she was unable to attend the Fall Forum.) The micro-services approach is a philosophical and technical methodology for the architecture of digital curation environments. This approach values simplicity and modularity, which allows "minimally sufficient" components to be recombined. Furthermore, the strength of curation micro-services is the relative ease by which they can be redesigned and replaced as necessary. The slides from the beginning part of the session can be found here.

There was also a reading session at the DLF Fall Forum on "Reimagining METS." The session's discussion revolved around ideas put forth in a white paper distributed in advance of the conference. The majority (if not all) of the METS Editorial Board facilitated the discussion, which was very high level and incredibly interesting. Much of the discussion seemed to imply the requirement that METS actually needed to change. The most interesting potential idea that seemed to get a fair amount of traction was to consider whether METS should focus on its strength in packaging and abdicate some of its functionality to other standards that arguably do it better (e.g., OAI-ORE for structure).

On the last day, I went to the workshop on JHOVE2, which is the successor project to the JHOVE framework for characterization. JHOVE2 has an notably different architecture and expanded feature set, which expands characterization to include other processes, including identification, validation, feature extraction, and assessments based on user-defined policies. Additionally, users will be able to define format characterization and validation files for complex digital objects, such as GIS shapefiles. The presenters stated that future development for JHOVE2 will include a GUI to assist in rule set development. From the standpoint of a digital archivist, this tool will be essential in any of the further work that we do.

Wednesday 10 November 2010

Donor survey web form is ready

As part of our on-going work to ingest born-digital materials, we have implemented a donor survey web form. In the evolving AIMS work flow, the donor survey occurs pre-ingest. The data collected in the survey may become part of the Submission Information Packet (SIP). This is our first attempt at testing several ideas. We expect to make changes and we invite comments. Our donor survey web site is open to archivists and other people interested in born-digital work flows. See the account request information below.

I realized I could quickly adapt an existing application as a donor survey, if the application were flexible enough. A couple of years ago I created a mini Laboratory Information Management System (LIMS). The programmer can easily modify the fields in the forms, although users cannot add fields ad-hoc. The mini LIMS has its own security and login system, and users are in groups. For the purposes of the AIMS donor survey, one section of the LIMS becomes the “donor info”, another section becomes the “survey”.

Using the list of fields that Liz Gushee, Digital Archivist here at UVa, gathered while working with the other AIMS archivists, I put the donor’s name, archivist name and related fields into the “donor” web form. All the survey questions went into the “survey” web form. The survey will support distinct surveys from the same donor over time to allow multiple accessions.

Our next step will be to map donor survey fields to various standard formats for submission agreements and other types of metadata. While the donor survey data needs to be integrated into the SIP workflow, we haven’t written the transformation and mapping code. We are also lacking a neatly formatted, read-only export of the donor survey data. Our current plan is to use Rubymatica to build a SIP, and that process will include integration of the donor survey data. The eventual product will be one or more Hydra heads. Learn more about Hydra:

https://wiki.duraspace.org/display/hydra/The+Hydra+Project

Everyone is invited to test our beta-release donor survey. Please email Tom twl8n@virginia.edu to request an account. Include the following 3 items in your email:

1) Your name
2) A password that is not used elsewhere for anything important
3) Institutional affiliation or group so I know what group to assign you to, even if your group has only one person.

I'll send you the URL and further instructions.


On the technical side, there were some interesting user interface (UI) and coding issues. Liz suggested comment fields for the survey, and we decided to offer archivists a comment for every question. I used Cascading Style Sheets (CSS) to minimize the size of comment fields so that the web page would not be a visual disaster. If one clicks in a comment, it expands. Click out of it, and it shrinks back down.

The original LIMS never had more than a dozen fields. The donor survey is very long and required some updates to the UI. By using a JQuery function ajax() I was able to create a submit button that would save the survey without redrawing the web page. CSS code was required to make the save buttons remain in a fixed location while the survey questions scrolled.

The mini LIMS relies on a Perl closure to handle creation of the web forms, and saving of the data to the database. Calling functions from the closure creates the fields. A closure is similar to a class in Java or Ruby, but somewhat more powerful. The data structures from the closure are passed to Perl’s HTML::Template module to build the web pages from HTML templates.

The mini LIMS was originally created to use PostgreSQL (aka Postgres), but I’ve converted it to us SQLite. SQLite is a zero-administration database, easily accessed from all major programming languages, and the binary database file is directly portable to all operating systems. Perl’s DBI database connectivity is database agnostic. However, some aspects of the SQL queries are not quite portable. Postgres uses “sequences” for primary keys. Sequences are wonderful, and vastly superior to auto-increment fields. SQLite does not have sequences, so I had to create a bit of code to handle the differences. The calendar date functions are quite different between Postgres and SQLite, so once again, I had to generalize some of the SQL queries related to date and time.

There was one odd problem. Postgres and SQLite are both fully transactional. However, because SQLite is an “embedded” database, it cannot handle database locks as elegantly as client-server databases. Normally this is not a problem, and all databases have some degree of locking. In this instance I got a locking error when preparing a second query in one transaction. Doing a commit after the first query appears to have fixed the problem. I’ve worked with SQLite for several years an never encountered this problem. It could be a bug in the Perl SQLite DBD driver. Using the sqlite_use_immediate_transaction did not solve the problem.

The mini LIMS source code is available.