Friday, October 19, 2012

Digesting Ingest


Alcohol metabolism
Harkiran Dhindsa, and Rioghnach Ahern, Digital Ingest Officers working on the Wellcome Digital Library, describe their experiences using Goobi on a daily basis, and some of the lessons learnt as we scaled up into full production over this past summer:

Goobi is a workflow-based management system that allows us to track and manage the workflows for various digitisation projects, be that archives, books, film or audio files. Many steps are fully automated, including the conversion of TIFF to JPEG2000 and the ingest of content into our repository Safety Deposit Box.

We found the user interface of Goobi to be intuitive. Training in basic ingest processes was quick. A number of the team are using this system. With regular usage, we were working efficiently and became familiar with the functionality. METS editing is facilitated through a web form which allows JPEG images of individual pages to be viewed. Using such a system eliminates the need to keep separate spread-sheets. Because of the way Goobi tracks the workflow by registering each step, it means different staff can continue with tasks at any open step. At any point if an error is noticed - for example a missing image in a book - a correction message can be sent back along the workflow to the appropriate person.

Goobi produces METS files, which describe objects including their access and license status. Although Goobi writes the METS files, the structure of an object is created manually, depending on the project. Much of our time is spent working on METS editing, particularly in adding restrictions for material which contains sensitive data. Goobi can handle a number of projects at the same time, so we can easily switch between working on archives and books. It can handle different tasks simultaneously. For example, an ingest officer can let an image upload task run for, let’s say, an archive collection, while continuing to edit METS data on books.

Lessons learnt

As daily users of Goobi, here are some of the lessons we have learnt:

Prior to import into Goobi, catalogued items are photographed and then the digitised images are checked for data sensitivity. In the early stages of the project, areas that could be improved for a more accurate and efficient workflow became obvious. Amongst one of the first archival collections to be digitised, some of the images that were available as a backlog, and were due to be uploaded into Goobi, did not reflect the archive catalogue (CALM). This was because changes had been made to the catalogue after photography. The lesson learnt from this experience is that photography should only be carried out after cataloguing has been truly completed so that the arrangement of material is firmly established.

To upload images into Goobi, they are first copied from a working network directory to a temporary drive created by Goobi for the user who has accepted the upload task. This process can be terminated by other activities if the network to the local PC is running at full capacity. When this happens we have to redo the transfer, taking extra time to complete the task. Thankfully, this image upload task will be automated in the future, bypassing the local PCs completely. However, the running of several tasks simultaneously will still be limited by server capacity when uploading large files.

After METS editing was completed on one of the archive collections, we were given further sensitivity data. To add these new sensitivity restrictions, we had to “roll back” processes that had already been ingested, thereby re-running part of the workflow. It is very easy to prompt a second ingest into the digital asset management system in the process, resulting in duplicated sets of files, as the roll-back process is less intuitive and not intended for regular use. Again, we have learned an important lesson. It will always be necessary to edit METS files. Changes to the workflow steps in Goobi to make this more straightforward would be useful, but it would be even better to finalise sensitivity lists before METS editing is completed in order to minimise duplication of effort.

A workflow system such as Goobi becomes imperative when ingesting mass collections of archives and books. Images that have gone through the complete ingestion process in Goobi will be accessed online via the Player. Seeing the images in an attractive interface is a satisfactory part of this work as this is where all of the different tasks come to fruition: the digitised archives and books available to the public to view in a user-friendly form — soon to be publicly available!

Authors: Harkiran Dhindsa and Rioghnach Ahern

Monday, August 13, 2012

Half a million milestone

We have recently completed processing half a million images in our workflow system (Goobi) ahead of making our new website available to the public later on this year. These images are the product of on-site digitisation of our archives and genetics book collections for the WDL pilot programme.

In previous posts on this blog we have talked about our storage system, server environment, our digital asset management system, and our image "player," but central to all of this is our workflow system, an enhanced and customised version of the open-source software Goobi developed originally at the University of Gรถttingen, and supported by Intranda.

Goobi is an extremely flexible database system that allows us to create and modify workflows (series of tasks both manual and automatic) for specific projects. These tasks can be recorded in Goobi (such as "Image capture"), done by Goobi (such as "Image conversion JPEG" for viewing images in the Goobi system itself), or can be initiated by Goobi (such as "SDB ingest" which triggers our digital asset management system - SDB - to ingest content).

We are currently working on ingesting images for three different workflows, described and simplified below.

Archive backlog
This workflow allows us to ingest all the images we created from archives since the project began in December 2009 up to May 2012 - around 320,000 of them - from eight collections. So far, we have finished processing around 250,000 images from this set.

The steps for each item to be ingested are as follows (automatic steps are in italics):

  1. Export MARC XML as a batch from the Library catalogue (per collection)
  2. Import metadata as a batch into Goobi
  3. Import JPEG2000 images one folder at a time from temporary storage to Goobi-managed directories
  4. Goobi converts JPEGs for viewing in the Goobi interface
  5. Check that images are correctly associated with the metadata
  6. Add access control "code" to items with sensitive material (restricted or closed)
  7. Trigger SDB ingest, passing along key information to enable this
  8. Import administrative/technical metadata from SDB after ingest
  9. Export METS files to be used by the "player"
Archives digitisation
This workflow deals with the current digitisation by tracking and supporting activities from the very beginning of the digitisation process. So far, we have imported very few images for this project, having just started using this workflow in earnest. The steps include:
  1. Export MARC XML as a batch from the Library catalogue (per collection)
  2. Import metadata as a batch into Goobi
  3. Group metadata into "batches" in Goobi for each archive box (usually 5 - 10 folders or "items")
  4. Track the preparation status at the box level (in process/completed) and record information for the next stage (Image capture)
  5. Track photography status at the box level and record information for the next stage (QA)
  6. Track QA at the box level and return any items to photography if re-work is required
  7. Import TIFF images via Lightroom (which converts RAW files and exportsTIFF files directly into Goobi)
  8. Convert TIFFs to JPEGs
  9. Check that images are correctly associated with the metadata 
  10. Add access control "code" to items with sensitive material (restricted or closed)
  11. Convert TIFFs to JPEG2000
  12. See 7-9 above
Genetics books
The other half our ingest effort has focused on the Genetics books. We have imported into Goobi over 250,000 images from this collection since digitisation began in February of this year. This workflow is very similar to Archives digitisation - containing steps related to the entire end-to-end process. The main differences being that as the digisation is being done by an on-site contractor, images are delivered to us as TIFFs, and while there are no sensitivity issues, there is metadata editing to add structure to aid navigation, and a range of "conditions of use" codes depending on the restrictions copyright holders request us to make.

  1. Export MARC XML as a batch from the Library catalogue (whole collection)
  2. Import metadata as a batch into Goobi
  3. Track preparation status at the book level and record information for next stage (Image capture)
  4. Track image capture status at the book level
  5. Track QA status (QA is done on the TIFFs supplied by the contractor)
  6. Import TIFF images one folder at a time from temporary storage to Goobi-managed directories
  7. Convert TIFFs to JPEGs
  8. Check that images are correctly associated with the metadata 
  9. Associate images with structural metadata (covers, titlepage, table of contents) thereby enabling navigation to these elements in the "player"
  10. Add page numbering 
  11. Add licence code to books that have use restrictions (such as no full download allowed) as per requests by copyright holders
  12. As above
We haven't got it all figured out yet
Other workflows we have not yet put into production, include born digital materials, A/V materials, items with multiple copies/volumes or "parts" (such as a video and its transcript), and manuscripts. We are looking at implementing new or different functionality in Goobi in the near future as well, including JPEG2000 validation using the Jpylyzer script, automated import of images, and configuring existing functionality in Goobi to support OCR and METS-ALTO files to name a few. These changes are aimed at minimising manual interaction with the material to save on time and improve accuracy. 

Wednesday, July 4, 2012

OCRing typescript: A benchmarking test with PRImA

During our pilot phase, we plan to add over 1 million images of archives to the Wellcome Library website, all created during the 20th century. Much of this is comprised of typescript: letters, drafts of papers and articles, research notes, memos, and so on. Key to improving discoverability of these papers is OCR (optical character recognition), which allows us to encode the words as text and include them in a full-text index. We set out to test whether typescript material would provide accurate OCR results that could be included in our index.

OCR technology

OCR works by segmenting a block of text to the individual character level and then comparing the patterns to a known set of characters in a wide range of typefaces. The accuracy of character recognition relies on the source information having clear-cut, and common, letter forms. The accuracy of word recognition relies on both character recognition, and the availability of comprehensive dictionaries for comparison. OCR software can compare these words to dictionaries to enhance word recognition, and also to estimate accuracy rates (or levels of confidence).

When it comes to good quality, clearly printed text, OCR can be extremely accurate even without any human intervention - with rates of 99% or higher for modern printed content (less than one word out of one hundred words having at least one inaccurate character), reducing as you recede in time to 95% for 1900 - 1950 printed material, and lower for 19th century material (see Tanner, Munoz and Ros, 2009). For some formats, OCR is worse still - as the document mentioned above shows, 19th century newspapers may only reach 70% significant word accuracy (words not including "stop" words such as definite/indefinite articles, and other non-search terms).

Typescript testing

Regarding our archival collections, there is a wide range of content that is theoretically OCR'able. Some will OCR very well, such as professionally printed matter. But much non-handwritten content is by the nature of the age of this material in a typescript form. We had no idea how well this type of content would OCR. To find out we commissioned Apostolos Antonacopoulos and  Stefan Pletschacher based at the University of Salford and members of PRImA (Pattern Recognition and Image Analysis Research) to do a benchmarking exercise from which we could determine whether we could rely on raw OCR outputs, should not OCR this type of material at all, or to test various methods to improve OCR'ability (such as post-processing of particular images).


Apostolos and Stefan chose a selection of 20 documents from a larger sample we provided originating from our Mourant and Crick digitised collections. These 20 documents where manually transcribed using the Aletheia groundtruthing tool for comparison to the output of three OCR engines, Abbyy FineReader Engines 9 and 10 and Tesseract, open source OCR software.

The results of the OCR benchmarking test show that original, good quality typescript content can reach up to 97% significant word accuracy with Abbyy Fine Reader Engine 10 (such as this example below):


At the bottom end, carbon copies with fuzzy ink can result in virtually 0% accuracy in any OCR software:



What was pleasantly surprising was how the average-quality and poorer content fared. On the better end of the scale we have 93% accuracy despite some broken characters:


And here we have poorer quality typescript producing 72% accuracy with many faint and broken characters:


The average rate for 16 images of good to poorer quality typescript is 83% significant word accuracy (excluding the carbon copies)

Accuracy levels are reported here according to the results from Abbyy FineReader Engine 10 on the "typescript" setting. The reported accuracy rates covers all the visible text on the page including letterheads, pre-printed text such as contact details, text overwritten by manual annotations and so on. Naturally, errors are more likely to occur in these areas, which (except in the case of text overwritten by manual annotations) are not of much significance in terms of indexing and discoverability. Further tests would be required to determine what the accuracy rate is with these areas excluded. For example, it may be possible to digitally remove the annotations in this draft version of Francis Crick and James Watson's "A Structure for DNA" to raise the overall accuracy rate (currently only 45%):



There is some variation between FineReader 9 and 10 where one or the other may have a small advantage with a few cases showing as much as a 30% difference. Overall, there is only 1% difference when looking at averages between 9 and 10. Tesseract, on the other hand, was far less accurate especially for the poorer quality typescript (roughly half as accurate overall).

There are a few things we could do to improve accuracy: 
  1. Incorporate medical dictionaries to improve recognition and confidence of scientific terms
  2. Enhance images  to "remove" any annotations prior to OCR'ing
  3. Develop a workflow that would divert images down different paths depending on content ("typescript" path, FineReader 9 or 10, enhancements to be applied or not applied, etc.)
We may find that an average of 83% word accuracy overall is perfectly adequate for our needs in terms of indexing terms and allowing people to discover content efficiently. Further investigation is required, but this report has given us a good foundation from which to press on with our OCR'ing plans.


These digital collections are not yet available online, but will be accessible from autumn 2012.

Tuesday, May 15, 2012

Serving servers: a technical infrastructure plan


As we aim to provide a fast, efficient and robust technical architecture for the Wellcome Digital Library, the Wellcome Trust IT department has been working closely with our software suppliers to specify a suitable server architecture. This work is still in progress, but we now have the skeleton idea of how many servers we are likely to need and for what purposes. The scale of the architecture requirements shows that setting up and delivering digital content is a significant undertaking.

In order to serve up millions of images, plus thousands of A/V files, born digital content and the web applications that make them accessible, we believe we’ll need around 17 (virtual) servers for the production environment (the “live” services), and a further 10 servers for our staging and development environments. In the production environment, nearly every server is duplicated to ensure redundancy and a smooth delivery service, which is why the numbers are so high. The content management system coupled with its SQL database requires four servers, for example. The image delivery environment needs six servers for data delivery, on-the-fly image conversion and tile creation, and media proxy servers creating digital content URLs that divorce the user-request mechanism from the actual content held on our servers for security reasons.

Most of the servers run on Windows 2008, although our image server (IIPImage) will run on Linux Ubuntu. The virtual servers share CPUs, but the number of CPUs available mean that each server gets the equivalent of either 2 or 4 CPUs, leading to a total 48 CPU requirement (288 cores as each CPU has 6 cores) . RAM varies from 2GB to 8Gb depending on the anticipated usage of a particular application on that server. The total RAM requirement for the production architecture is estimated at 124Gb. These specifications are currently our best guess, and will be tested in the weeks to come as we start to deploy the hardware.

The staging environment allows system upgrades, patches or new development work to be applied and tested  separately from the live production environment. This means that any changes can be tested thoroughly before changes are made publicly visible and/or usable. Actual development work is carried out in the development environment, before deployment for final testing on the staging servers. This means that applications such as the web content management system and the delivery system applications must be replicated in these two additional environments, along with their server requirements.

With thanks to David Martin, IT Project Manager, as the source of my information.

Friday, May 11, 2012

Developing a player for the Wellcome Digital Library

Previous posts here have covered the digitisation of books and archives and the storage of the resulting files (mostly JPEG2000 images, but some video and audio too). Now it’s time to figure out how visitors to the Wellcome Library site actually view these materials via a web browser.

The digitisation workflow ends with various files being saved to different Library back-end systems:

  • The METS file is a single XML document that describes the structure of the book or archive, providing metadata such as title and access conditions. 
  • Each page of the book (or image of an archive) is stored as a JPEG2000 file in the Library’s asset management system, Safety Deposit Box (SDB). Each image file in SDB has a unique filename (in fact a GUID), and this is referenced in the METS file. So given the METS file and access to the asset management system, we could retrieve the correct JPEG 2000 images in the correct order. 
  • Additional files might be created, such as METS-ALTO files containing information about the positions of individual words on a digitised page; we’ll want to use this information to highlight search results within the text. 
So how do we use these files to allow a site visitor to read a book?

Rendering JPEG 2000 files

Our first problem is that we can’t just serve up a JPEG2000 image to a web browser – the format is not supported. And even if it was, the archival JPEG2000 files are large: several megabytes each. The solution to this problem is familiar from services like Google Maps – we break the raw image up into web-friendly tiles and use them at different resolutions (zoom levels). When you use Google Maps, you can keep dragging the map around to explore pretty much anywhere on Earth – but your browser didn’t load one single enormous map of the world. Instead, the map is delivered to you as 256x256 pixel image files called tiles, and your browser only makes requests for those tiles that are needed to show the area of the map visible in your browser’s viewport. Each tile is quite small and hence very quick to download – here’s a Google map tile that shows the Wellcome Library:

http://mt1.google.com/vt/lyrs=m@176000000&hl=en&src=app&x=65487&s=&y=43573&z=17&s=Ga

Google Maps is a complex JavaScript application that causes your browser to load the right tiles at the right time (and in the right place). This keeps the user experience slick. We need that kind of user experience to view the pages of books.

There are several JavaScript libraries available that solve the difficult problem of handling the viewport and generating the correct tile requests in response to user pan and zoom activity. We’ve settled on Seadragon, because we really like the way it zooms smoothly (via alpha blending as you move from one zoom level’s tiles to another). A very nice existing example of this is at the Cambridge Digital Library’s Newton Papers project:

http://cudl.lib.cam.ac.uk/view/PR-ADV-B-00039-00001/

This site uses a viewer built around Seadragon; an individual tile looks like this:

http://cudl.lib.cam.ac.uk/content/images/PR-ADV-B-00039-00001-000-00105_files/11/3_2.jpg

The numbers on the end indicate that this jpeg tile is for zoom level 11, column 3, row 2. As you explore the image, your browser makes dozens, even hundreds of individual tile requests like this. It feels fast because each individual tile is tiny and downloads in no time.

For more about tiled zoomable images, these blog posts are an excellent introduction:

So how do we get from a single JPEG2000 image to hundreds (or even thousands) of JPG tiles? It’s possible to prepare your image tiles in advance, so that you process the source image once and store folders of prepared tiles on your web server. For small collections of images this is a simple way to go and doesn’t require anything special on the server. But for the Library, it’s not practical – we don’t want to prepare tiles as part of the digitisation workflow. They are not “archival”, and they take up a lot of extra storage space. We need something that can generate tiles on the fly from the source image, given the tile requests coming from the browser.

 For this we need an Image Server, and we’ve chosen IIPImage for its performance and native Seadragon (Deep Zoom) support. The Image Server generates browser-friendly JPEG images from regions of the source image at particular zoom levels. When your browser makes a request to the image server for a particular tile, the image server extracts the required region from the source JPEG 2000 file and serves it up to you an ordinary JPEG.

Viewer or Player? Or Reader? 

The next piece of the puzzle is the browser application that makes the requests to the server. A book or archive is a sequence of images along with a lot of other metadata. And it’s not just books – the Library also has video and audio content. All of these are described in detail by METS files produced during the digitisation/ingest workflow. In the world of tile-based imaging, the term “viewer” is often used to describe the browser component of the system, but we seem to have fallen naturally to using the term “Player” to describe it – it plays books, videos and audio, so “Player” it is. Our player needs to be given quite a lot of data to know what to play.

We could just expose the METS file directly, but it is large and complex and much of it is not required in the Player. So we’re developing an intermediate data format, which effectively acts as the public API of the Library. Given a Library catalogue number, the player requests a chunk of data from the server; this tells it everything it needs to know to play the work, in a much simpler format than the METS file. In the future other systems could make use of this API (at the moment it’s exposed as JSON).

The user experience 

The user won’t just be viewing a sequence of images, like a slide show. It should be a pleasant experience to read a book from cover to cover. Many users will be using a tablet, reading pages in portrait aspect ratio. We aim to make this a good e-reading experience too, augmented by search and navigation tools.

The user experience might start with a search result from the Library’s main search tool. For books that have been digitised, the results page will provide an additional link directly to the player “playing” the digitised book. The URL of the book is an important part of the user experience, and we want to keep it simple. In future, library.wellcome.ac.uk/player/b123456 would be the URL of the work with catalogue refrence number b123456; that URL would take you straight to the player.

We want to be able to link directly to a particular page of a particular book, just as a printed citation could. This deeper URL would be /player/b123456#/35. But we can do better than that; our URL structure should extend to describe the precise region of a page, so that one reader could line up a particular section of text on a page, or a picture, and send the URL to another reader; the second reader would see the work open at the same page, and zoomed in on the same detail.

Access Control 

Much of the material being made available is still subject to copyright. Those works that are cleared for online publication by the Trust’s copyright clearance strategy still need some degree of access control applied to them; typically the user will be required to register before viewing them. This represents a significant architectural challenge, because we need to enforce access restrictions down to the level of individual tile requests. We don’t want anyone “scraping” protected content by making requests for the tiles directly, bypassing the player.

Performance and Scale 

As well as the technical challenges involved in building the Player, we need to ensure that content is served to the player quickly. Ultimately the system will need to scale to serve millions of different book pages. Between the player and the back end files is a significant middle tier: the Digital Delivery System, of which the Player is a client. This layer is the Library’s API for Digital Delivery. The browser-based player interacts with it to retrieve data to display a book, highlight search results, generate navigation and so on. The Image Server is a key component of this system.

This post was written by Tom Crane, Lead Developer at Digirati, working with his colleagues on developing digital library solutions for the Wellcome Digital Library.

Monday, April 30, 2012

Will more data lead to different histories being told?

New technology is making information more widely available and, when it launches later this year, the WDL will make it easier to access historical evidence about the foundations of modern genetics.  Will this democratize our understanding of the history of genetics and lead to different versions of the history being told?

There is an African proverb which says that history is written by the hunter not the lion.  History inevitably simplifies the past and the selection process can be subjective.  When it launches the WDL will start to put 21 archive collections and around 2,000 books on-line.  The project is to digitise as much as we can rather than cherry pick the highlights. This means that the building blocks used by historians to piece together the past will be made freely available to a wider audience.  A lot of this material may seem like mundane workaday stuff. Users will have to wade through a lot of material to reach the bits they are interested in but this is probably a more accurate reflection of the scientific research process.

Flashes of genius are essential but they do not happen in isolation. Thomas Edison’s phrase about invention being 1% inspiration and 99% perspiration applies to scientific research too. The discovery process needs both.

Watson and Crick were extremely clever to work out the helical structure of DNA but they did not get there simply because they were lone geniuses. Before they made their discovery a lot of people had spent years experimenting, writing and thinking about DNA. There had even been flashes of insight which ended up being wrong.I recently read a letter from Gerald Oster sent to Aaron Klug after Rosalind Franklin’s death, in which he recalled his time working in London. He reflects that even though he had much of the relevant information by early 1950 he lacked the insight to work out the structure of DNA.  This letter (FRKN/06/07/001-2) is held by the Churchill Archives Centre in Cambridge and a digitised version will become part of the WDL.

I am rather hoping that the WDL might help us to recognise that while flashes of inspiration are part of scientific discovery they are only possible because a team of other people paved the way.

Thursday, April 19, 2012

Clearing copyright for books: preliminary ARROW results

As part of the genetics books project, we are tackling issues of copyright clearance and due diligence head on. Up to 90% of this collection is in copyright, or is likely to be in copyright, so developing a copyright clearance strategy was one of our earliest considerations. This turned into a useful project to test-run the EC-funded ARROW system on a large scale. ARROW provides a workflow for libraries and other content repositories to determine whether books are in-commerce, in copyright, and whether the copyright holders can be identified and traced. This system has undergone small tests throughout Europe, including the UK (using collections and metadata from the British Library), but in order to determine whether ARROW is feasible on a large scale, a realistic large-scale project was needed.

The Wellcome's genetics books project provided this opportunity, and the challenge was taken up by the ALCS and the PLS jointly, as announced previously on our Library Blog. Results from ARROW, combined with the responses from contacted rights holders, determine whether the Wellcome Library will publish a work online.

The collection of (roughly) 1,700 potentially in-copyright books is not enormous, but it is diverse, and has already thrown up some interesting wrinkles in the copyright clearance workflow.

For example, according to the AARC2 standard used to catalogue these books, only up to three authors are included in the metadata record (followed by et al). Works with more than three authors, and collected works such as conference proceedings, had to be manually consulted in order to identify all the named contributors. This inflated the known number of contributors to nearly 7,000 (4 authors on average per book).

Embedded below is a presentation I gave at the London Book Fair earlier this week, which provides an overview of the process, and preliminary statistics from the first 500 books to complete the ARROW workflow.

Monday, April 2, 2012

Learning lessons on the Genetics Books digitisation project

A key component of the theme of our digitisation pilot programme - "Foundations of Modern Genetics" - is a set of printed textbooks and secondary sources published between 1850 and 1990 that shed light on the development of genetic and genomic research. The total collection identified is around 2,000 books. The goal is to digitise these texts in full, and make them freely available online via the Wellcome Digital Library (we are of course dealing with copyright clearance).

Digitisation of books often looks and sounds straightforward. It is not always straightforward of course - but the new book scanners on the market these days do make it quick. There are standard ways of book scanning - you put the book on a cradle, and either turn the pages (by hand), or use a "robotic" contraption that turns the pages automatically. You can use scanning technology, or one-shot dSLR cameras; panes of glass to hold the pages down, or small grips on the outer margins of the pages. The choice depends on the physical nature of the books and how quickly you want to digitise. Even when outsourcing it is useful to understand how book scanning really works. Our Genetics Books digitisation project - a pilot project - is giving us this opportunity.

We commissioned local digitisation company Bespoke Archive Digitisation to carry out the digitisation work for this pilot project. As the digitisation is carried out on site, we have been involved to some extent in all aspects of the digitisation, including the setup and use of new types of equipment, the QA process involved in book digitisation, and the workflow of image conversion and delivery. As we have never carried out high-throughput book digitisation at the Wellcome Library before, this has been a huge learning curve for us, allowing us to gain knowledge that will come in very useful in the future with new (and hopefully larger) projects.


Bespoke Archive Digitisation uses a robotic book scanner and a manual book scanning unit (for books that are not robust enough for the robotic scanner, are outsized, etc.). Both of these "scanners" use Canon 5D Mark II cameras, two per unit to capture each page of an opening simultaneously.  The robotic book scanner is the latest version from Kirtas, the Kabis III. Richard Keenan, owner of Bespoke Archive Digitisation explains, "this unit has a number of time-saving features such as "fluffers," a “snubber,” and a self adjusting book cradle which moves to keep the book at the correct angle to be photographed. This is accomplished through various sensors and lasers, which monitor the book throughout imaging to keep it in the correct position, but must also be monitored by the operator."

A key lesson, according to Richard, is that "although all robotic book scanners include a published throughput (2,890 pages per hour for this particular unit), it is important to understand that the published throughputs do NOT mean that you can do 2,890 pages per hour, hour after hour without stopping. Each book must be set up on the cradle, the cameras may need some adjustment/focusing, and page turning does require manual intervention, every time, to ensure the pages are flat, and to prevent page curvature and glare (especially on sealed paper).

"Also, it is very important to remember that this is just the image capture stage, the pages then have to be batch processed, edited and rigorously quality assessed which can take the same, or more time than imaging. Depending on the book's structure - page thickness, binding type, size of the book etc - you will find that speeds vary considerably, a realistic estimate of throughput over a significant period of time is approximately 1000 pages per hour, but this can be much lower with some books.

"Although these figures differ by a large margin from those published, the Kabis III from Kirtas is still probably the fastest way to digitize books, and the important thing is that the quality of output produced is excellent if operated correctly. The on board editing software 'Book Scan Editor' is very handy, offering the usual cropping, image adjustment and sharpening options, but also deskewing and xml conversion and even OCR. I would say that another thing to bear in mind here, is that there is a large learning curve with this technology, so for anyone thinking of using one - particularly those who have no experience with robotic book scanners - plan plenty of time in the project for training and testing periods."

Friday, March 23, 2012

Three Geneticists from the University of Glasgow

The Wellcome Digital Library isn’t just about collections held in the Wellcome Library. We are working with a number of other organisations that hold material on the history of modern genetics. One of the contributing partners, the University of Glasgow Archives Service, have just started to digitise the archives of three men who worked at the University’s Department of Genetics - Guido Pontecorvo (1907-1999), James (Jim) Harrison Renwick (1926-1994) and Malcolm Ferguson-Smith (1931 – ).

You can see photos of their brand new digitising suite here.

Monday, March 12, 2012

Wellcome Digital Library update

The Wellcome Digital Library pilot has been underway for 18 months with 6 months to go before we launch the new Library website. This will provide access to a wide range of digital content related to the Foundations of Genetics theme. All of the work done so far has been behind the scenes: digitising content, procuring and developing our digital library systems, and designing a new website. We are looking forward to displaying the product of all this work to the public - but we're not quite there yet!

So where are we now and what will we be doing in 2012? Here is a snapshot of progress so far. Further details on some of these projects can be found on this blog, and we will continue to explain our activities in more detail in future posts.

Digitisation
  • Archives: With our in-house team of two photographers, we have digitised around 380,000 pages from the collections of Crick, Mourant, Medawar, Sanger, Wyatt, Grueneberg, and the Blood Group Unit. We have just started the Eugenics Society collection, which will carry on throughout the spring and summer.
  • Genetics Books: This project has just begun, with up to 2,000 books to be digitised this spring by an external supplier, Bespoke Archive Digitisation, working on-site. 
  • MOH reports: A successful JISC funding bid meant we could add the Greater London Medical Officer of Health reports to the pilot project. Conservation is underway, and digitisation will begin in a few month's time.
  • ProQuest: We have partnered with ProQuest to digitise our pre-1700 printed books for Early European Books online, with over 1,000 books now digitised and around 13,000 to go. Those with subscriptions and anyone in the UK can view our first 400 books on the EEB website with more to come shortly (search for "Wellcome").
  • External content: We have had the first delivery from one of our external partners, Cold Spring Harbour Laboratory, including correspondence from the James Watson archive. This adds around 50,000 images to our digital archive collections, with more to come throughout 2012 and early 2013 from all partners.
  • Copyright and sensitivity: Hand in hand with digitisation, we are assessing our content for sensitivity and copyright issues where necessary. Sensitive items (containing certain types of private information as defined by the Data Protection Act) are identified and flagged as unsuitable for online dissemination. Copryight clearance of in-copyright works is underway with the help of the Authors' Licensing and Collecting Society, and the Publishers' Licensing Society. 

Systems development
  • Digital Asset Management & Storage: Safety Deposit Box 4.1, our digital asset management system, was extended to provide extra functionality for large sets of digital assets in 2011. This system is now in production. Our storage system, Pillar, now includes a Write Once Read Many (WORM) backup drive to ensure that our files are secure in the long term.
  • Workflow system: We procured Goobi (Intranda Version) with bespoke modifications in 2011 to act as a workflow system, enabling us to track project progress, and to automate a number of activities (including ingest of content into Safety Deposit Box). This has recently been put into use in production, particularly for the Genetics books digitisation project. Soon we will be using Goobi for all digitisation projects, and to ingest our backlog of images.
  • JPEG 2000: We now archive all our images in the JPEG 2000 (Part 1) format, and have an automated batching process set up with LuraWave. Soon, we will be implementing JPEG 2000 validation as part of this process to ensure all JPEG 2000s meet the correct standards before ingest.
  • Digital delivery: A new digital delivery system is currently under development that will interoperate with Safety Deposit Box and our new website content management system, Alterian CM7. We have commissioned CM7 developers Digirati to carry out this development, which will be completed at the end of the summer. So far they have produced a proof of concept system that demonstrates an end-to-end sequence from retrieval of images from Safety Deposit Box using METS files created by Goobi, to displaying images online. They are adapting Seadragon, the MS viewer used by several other digital libraries, to meet our specific needs and design criteria.
  • Search and discovery: We are also making changes to our single search system, Encore. This work is looking at providing better representation of archival metadata in Encore, and also options for incorporating a full-text index. The purpose is to provide access to all Library content - the catalogues as well as the digitised materials - via a single interface.

New website and user experience
  • User experience-led design: Last year the Library brought on board external suppliers Clearleft - user experience and web design experts - to help redesign the information architecture and visual appearance of the new website. New designs are already visible on the internal web development environment, so further user testing of a real website can soon be done.
  • Transferring content: The Library has carried out a full content audit of the current website, and prioritised content to carry across to the new site. The current site contains over 2,000 pages; this will be considerably reduced. The content carried across to the new site will be thoroughly edited to ensure it is up-to-date and consistent with the new site "style".
  • Creating new content: New content will also be created once the content management system is in action, with a focus on the Foundation of Modern Genetics. This is a major part of the Library's aim to provide interpretative content to both researchers and the "curious public".

Monday, March 5, 2012

Filling the MOH Gaps

The Wellcome Library has a great collection of Medical Officer of Health (MOH) Reports. These reports are stuffed full of grim and useful information from the 19th and 20th century, such as statistics on infant mortality. JISC is, very wisely, funding the digitisation of the London reports. There are some gaps in the Wellcome Library’s collection so, in order to make a really useful digital resource, we have been working out what is missing. This has not been straightforward.

First we needed to check what we held and then we needed to make sure that our gaps really were gaps. We didn’t want to waste time looking for reports that were never created. The very first MOH report in Britain was produced in Liverpool in 1847. The first London report was produced in the following year but the early reports do not cover the whole of London. The Public Health Act of 1848 permitted local authorities to employ MOHs but, since it was not obligatory, only a minority did. The Metropolis Management Act of 1855 required MOHs to be appointed in central London but the big change came with the 1875 Public Health Act. From then until 1972 the production of MOH reports was pretty solid.

Another challenge was getting to grips with the boundary changes. Over the years the administrative boundaries of London have altered several times. The current 32 London Borough boundaries date from 1965 when Greater London was established. Before that there were 28 metropolitan boroughs plus various boroughs, urban and rural district councils in what is now, outer London. Before 1899 much of what we now think of as London was part of Kent, Middlesex, Essex or Surrey. The City of London has long gone its own distinctive way and the tangle of parish boundaries there is particularly confusing. Old maps and a book on administrative units by Frederick A. Youngs helped us to make sense of all these changes.

We have decided to start from the centre and try to create a complete as record as possible for the 12 inner London Boroughs. We’ve got to the stage where we have a list of reports that we want to find. The next step is to track them down in other collections and ask if we can get them digitised.

Watson and Crick Letters

I've just had the privilege of reading a fantastic series of letters written in 1954 by James Watson and Francis Crick. They were written a year after they published their seminal article on the structure of DNA. In the letters the two men are exchanging ideas and their excitement shines through. They write about all sorts of things, for example, the importance of building space filling three dimensional models, confusion over how thymine fits into the helical structure and what the researchers at KCL are up to. In March 1954 Watson also expresses his frustration with the research process, “The whole thing is puzzling and paradoxical (for could DNA be wrong) and is slowly driving me to despair and to loath nucleic acids.” (PP/CRI/D/2/45)

I got to read them because last month the first batch of digitised material arrived from Cold Spring Harbor Laboratory in New York, one of the five external organisations contributing digitized material to the WDL pilot project. The James Watson archive is held at Cold Spring Harbor and contains the letters written to him by Francis Crick. The letters Watson wrote to Crick are held by the Wellcome Library.

Later this year, when the WDL is launched, Watson and Crick’s correspondence will be digitally united. Lots of people will be able to read these letters (and lots of other stuff) online while the originals stay safely tucked away in their archival homes. I am excited about that!

Tuesday, February 21, 2012

The Medical Officer of Health reports project begins.

I recently started working on the Medical Officer of Health digitisation project.


Spine with methylcellulose applied
Since the beginning of December 2011, I have been spending the majority of my time in the conservation studio. I have been carrying out disbinding, cleaning and rehousing of the late 19th century Medical Officer of Health (MOH) reports that are bound by year. This is so that the digitisers can scan or photograph them for the project.

The MOH reports in our collection are shelved in different sequences: Main, London and Provincial. However, all three sequences contain reports for London areas. Main and Provincial sequences are bound by geographical area and generally they're in a good condition but the London sequence reports are bound by year and tend to be in a poorer state due to their heavy use. There are about 80 bound volumes in the London sequence and they are all being disbound. Eventually we will be able to house all of the London reports together by geographical area.


Lining coming off
In order to take the bound reports apart, I first need to remove the cloth case and spine linings whilst keeping the pamphlets intact. I use a 4% methylcellulose solution to break down the binding animal glue. A major challenge with this is that each volume’s binding breaks down at a different rate so it requires constant checking to avoid damaging the paper. On average it takes a couple of hours just to remove the spine linings. We recently purchased a pink portable clothes steamer to try to speed up the process. I haven’t tried this yet but we are hopeful that this will work faster.



The aftermath of removal
Another important part of the project is creating separate bibliographic records for each report.  We have decided to catalogue these as monographs in order to improve searching and allow users to find reports by fields such as geographic area, Medical Officer's name and date of the report.  

Whilst I am going through the reports, I have been finding  some very interesting snippets. I will include some as I continue to blog so look out for them!