Oct 25 2014

Preservation processing update

DataAccessioner developer Seth Shaw just sent a tool to help with reports analysis. He says he wants to get feedback on a simple report transformation (from .xml to .csv) tool first. After that, he’s going to add a way to aggregate the data from the .csv into size by type of file, etc. within the DataAccessioner.

He’s created a DA-branded version of his XSLTProcessor and named it the DA Metadata Transformer (DA-MT). You can download it @ http://dataaccessioner.org/downloads/da-mt/da-mt.zip

With this tool, you can copy in the XML output if DataAccessioner and receive a .csv file that can be opened in Excel. Once in Excel, sorting to identify file types and size-per-type is possible.

He wants us to note:
1) Although the download is available he hasn’t yet created any documentation or links to it from within the DA website. There’s no firm time on completion at this point.
2) The original processor’s code is on GitHub (https://github.com/seth-shaw/XSLTProcessor) however it retains the original general purpose text. At the suggestion of some POWRR partners, he changed existing labels on the processor and created a “branded language file” that is included on GitHub but it requires a manual process after building to make the change.
3) An example of the general-purpose use is for mass-producing HTML or other versions of finding-aids from EAD. Most EAD transformation tools use the same process as the DA-MT. Your sources are the EAD files and the transforms are the “stylesheets” (xsl or xslt).

Where this all fits in my DP workflow: I use DataAccessioner to capture technical metadata as I move files from transfer media to my as-yet non-bit-level storage device. I use DA-MT to aggregate the file information from xml to something I can understand: file types, quantities and sizes by type. I store the aggregate information in my regular accession files (currently a spreadsheet). My accession information and an Access copy are in a different hard drive from the Master copy and XML. Some day, I will move the accessions with content I think is most at-risk (due to format or other unique attribute) into a bit-checking storage environment.

In keeping with the POWRR motto of “good enough DP for real people,” this workflow costs me no money, no technical expertise (beyond downloading Java and two processing files via ZIP) and very little extra time.

With DA, I am capturing all the recommended technical information for use by a back-end preservation system. With DA-MT I can track growth rate of digital content overall, make a case for purchasing better storage, and keep an eye on where all the at-risk file types are in the interim.

Another way to think of this workflow? I know a healthful diet includes a lot of leafy greens. Even though I can never remember the vitamins in each type of vegetable, I know they are there and they are good for me!

So put DA and DA-MT into your workflow for the long term health of your DP program!

Oct 17 2014

For Those About to Preserve….We Salute You!*

ACDC_Lane

I was honored to be able to represent the Digital POWRR project at iPres 2014, the 11th International Conference on Preservation of Digital Objects. iPres was held in at the State Library of Victoria in Melbourne, Australia from October 6-10, 2014, and brought together leaders in the field of digital preservation from nations across the globe.

Melbourne was a spectacular host city for the conference, as it is full of friendly people, delicious food, beautiful architecture and scenery, and  is home to a unique central business district that features countless narrow alleyways brimming with restaurants, shops, bars, and clubs. When I started researching my travel plans, I discovered one of these alleys had been christened “AC/DC Lane” in 2004 after the famous Australian rock band AC/DC. The photo of the street sign – complete with lightning bolt – made me laugh, as it then dawned on me just how much my POWRR poster had subconsciously been drawing from this iconic band’s imagery and energy. When I introduced my poster on Thursday”s quick fire posters session, I made sure to mention this association, and invited the audience to perhaps alternatively think of my poster, and the Digital POWRR project overall as “Digital Preservation Done Dirt Cheap” (riffing on the band’s 1976 hit album and song “Dirty Deeds Done Dirt Cheap.”) It remains to be seen if digital preservation can indeed qualify as a “dirty deed,” and I had to stop myself from writing a Weird Al style parody of the song with my own lyrics relating to preservation topis. In any event, my comparison struck a very positive and happy chord among the participants, and I entertained constant attention from and stimulating discussions with delegates for the rest of the day. I was thrilled. Thunderstruck, even.

staceyposter_v03

Digital POWRR poster at iPres2014

On the more serious side, my poster was titled “The Digital POWRR Project: Enabling Collaborative Pragmatic Digital Preservation Approaches.” I attempted to summarize the major work that the Digital POWRR team has been involved in over the last three years, including: the process of testing various preservation services and systems, researching and writing our white paper, compiling our (very popular) tool grid and our current collaboration with the COPTR initiative, the creation of specialized advocacy materials, the development of our popular workshops, our investment in the further development of the open source accessioning tool Data Accessioner, and our work on developing collaborative legal frameworks that can be utilized by anyone in the preservation world, The poster also tries to present some concluding thoughts and lessons learned from the experience of working on this particular project. My poster and summary from iPres can be viewed here in their high-resolution glory. I would like to give a big shout out to my friend Daniel M. Kanemoto for providing the graphical direction and styling that subconsciously harnessed the power of POWRR as well as AC/DC. I remember saying “Can you help me do a poster? All I know is that it needs to be full of thunderbolts. Can you do that?”

10355721_10103603141976477_6673389252205410569_n

I was fortunate enough to attend a number of highly stimulating panels, papers, discussions, and workshops during my time at iPres. Among the highlights of my trip was the hands-on workshop for the BitCurator digital forensics software toolkit (led by the enigmatic Cal Lee), a wonderful panel moderated by Paul Wheatley titled “Getting to Digital Preservation Tools that ‘Just Work'” (which was held in an overflowing room and probably could have sustained a discussion for an entire day!), hearing about further developments in the 4C Project (including the spectacular Digital Curation Costs Exchange site), and, certainly the spectacular gala dinner. The State Library of Victoria was a lovely host venue, combining a stunning building (or series of 23 buildings!) with truly warm and helpful employees. iPres 2015 will be held in Chapel Hill, North Carolina, so keep an eye out for more on that in the coming months!

Oh yes, one other highlight of my trip was getting to see an actual live koala in the wild! I didn’t have much time for sightseeing, but I did manage to take a day long bus tour of the Great Ocean Road. It was a day that I will always cherish. Cheers to my new mates in Australia, and thanks for the wonderful week Down Under.

10696380_10103603154037307_3534493891746949708_n

***As for the title of this post, ” For Those About to Preserve….We Salute You”…if you are familiar with the source material at all, then I hope you enjoy these alternative lyrics that  just somehow popped into my head. Who knew that AC/DC was so relvant to our field??

“Stand up and be counted for what you are about to receive!

We are the curators,

We’ll give you everything you need!

Hail hail to the chain of custody!

Cuz format migration has got the right of way.

Hey, can you emulate this for me?

We’re not just saving for today.

For those about to preserve, we salute you….

For those about to preserve, we salute you…

We preserve files at dawn on the digital front line..

Like a bolt right outta the blue,

The skies alight with a computer byte,

Checksums will roll and rock tonight!

For those about to preserve….we salute you…”

Aug 27 2014

White Paper Release and POWRR Update

The Digital POWRR Project (Preserving digital Objects With Restricted Resources), is a multi-institutional, IMLS National Leadership Grant project that has been making waves in the field of digital preservation (DP) since its efforts began in 2012. Its focus has been on investigating scalable DP solutions for small and mid-sized institutions that are often faced with small staff sizes, restricted IT infrastructures, and tight budgets. These institutions hold unique digital content important to their region’s cultural heritage, yet many of the practitioners are unsure how to approach the stewardship of the content and are overwhelmed by the large number of DP tools/services available. As the project progressed, the team uncovered the particular challenges, advantages, needs, and desires of under-resourced institutions. They worked to address and overcome obstacles that often prevent practitioners from taking even initial steps in preserving their digital content. POWRR sought to create a well-marked, realistic path towards sustainable digital stewardship for this often overlooked group. For example,
- The team delivered a well-received, graphic-based tool grid that shows, at-a-glance, the functionalities of over 60 DP tools and services and how they fit within an OAIS-based digital curation lifecycle.
- POWRR successfully petitioned select DP-solution vendors for scaled-down and transparent pricing geared towards smaller institutions.
- The team created materials to aid practitioners as they attempt to build awareness around the need for a DP program and advocate for the necessary resources.
- They developed a pragmatic, hands-on workshop to teach the initial steps necessary to accession and inventory digital content as well as how to realistically approach developing a DP program. Recognizing that many of their target institutions currently have little-to-no travel and training budgets, the POWRR team is traveling across the country to conduct these workshops for very little cost to the practitioners.
- Because institutions can achieve economies of scale by working together (not to mention the value of the “we’re all in this together” approach!), POWRR is producing collaboration models and the underlying legal framework often needed for these endeavors…all directed at small and mid-sized institutions.
These are just a selection of the efforts put forth by the POWRR team to guide and empower their peers on the path to digital stewardship. Stay tuned to the POWRR website for further activities and developments!

Aug 13 2014

DuraSpace and Artefactual have joined forces!

The POWRR Team is excited to share this news with those looking for a hosted, soup-to-nuts, digital preservation solution:

DuraSpace and Artefacual have joined forces!  Check out the news release at the link below:

DuraSpace and Artefactual Partner to Offer New Hosted Service

May 02 2014

Post-workshop question: How to quantify projected data growth

In response to a question from her supervisor, a recent workshop participant asked how to guesstimate the amount – in numbers – of data she would need to store per week.

An illustration of the rapid growth rate of data with storage growth trends following

An illustration of the “rapid growth rate of data with storage growth trends following” captured from xzbackup.com

My reply was that it’s hard to estimate amounts of new material you might need to store in the future until you decide what you’re preserving. Selection is the unsung hero, in my POV, of any kind of preservation. We simply must decide what we are willing and able to keep. But where to start?

An inventory of what you’re currently responsible for is widely recommended and very helpful. After that, it will be useful to note exactly what in your inventory is most at risk. Data accumulation rates that will matter most to administrators will depend on what you decide you must preserve at full bit-level and what can live happily, for at least some time, in basic offline and (don’t forget this part!) geographically distributed storage locations.

Here’s an example of this decision making principle from my world:
Most of the material that my library has digitized, I’m comfortable NOT assigning to the queue for bit-level preservation. There are only a few objects that I digitized because the media they were on was so out of date that it was inaccessible or the originals were too fragile. These things that were truly digitized to preserve their current state AND intellectual value are a higher bit-level preservation concern for me, but they are also in a minority of my digital holdings right now.

Paper material like yearbooks, honors theses, faculty meeting minutes and the student newspaper are all really useful as searchable digital objects, and I want to protect the investment we made in scanning (some outsourced, some not), but I have all the originals and will not be discarding them so I’m not concerned with moving the digital versions into a preservation system right away. Those things will be just fine in an offline storage location until we decide if we want to pay extra just to protect that initial investment. Some born-digital documents that are worth keeping digitally (due to importance of content AND value of keeping in keyword searchable format), stay that way. But a lot of the messages that are important to keep (e.g., brief meeting records and emails from administrators about policy changes) do not have attributes that make them worth keeping digitally, so I print them.

On the other hand, my campus has over a decade worth of digital-only campus photographs, all in jpg, and our new content management system allows individual departments and organizations to post their own photos to their own pages. Those things are unique and being created by people who couldn’t care less about high-res formats because they’re just out there doing their jobs and trying to attract people to our school or showcase their achievements. Additionally, our major events (commencement, several all-campus convocations and colloquia, and our sporting events) are now only being captured live and streamed through a subscription service. I have decided that these born-digital media are more at risk both because of how they are created and by the lack of consistent metadata they are being created with. Therefore, they’re higher up in my queue for prioritizing preservation actions that include enhancing metadata and monitoring format migration needs in an automated environment.

Note that they are also not “library” materials, so they’re going into my this-is-a-common-good-and-therefore-a-shared-cost-responsibility argument ;-)

I use the word “queue” because we have no subscription to an automated bit-level system yet. But by separating things out in this way, I have a smaller amount of somewhat-regularly-added-to types of content I can guesstimate based on past collection practices. I had success in getting on a regular transfer schedule of the streamed media because IT is in charge of monitoring that service, and they are now giving me an annual deposit of everything that was streamed the previous year. I’ve got a tougher chore with people posting things to their own web pages. However, IT is trying to communicate the value of using Flikr accounts to manage these files, and if people do start following that advice, I’ll be able to quantify that data because IT passes out logins to our campus Flikr subscription. That way I’ll get a glimpse of what people are doing and can start educational outreach with those dept/orgs about improving their creation and description practices.

Frankly, people who are doing their own thing outside of campus-related programs/services are, in my opinion, on their own. It sounds harsh to say it that way, but I can’t save what I don’t know exists and don’t have access to!

May 02 2014

What can we reasonably accomplish?

In a previous post about acquiring digital content, Stacey mentioned that we often “take it all in, the good shepherds that we are. We build systems and websites that can do nifty things.” Stacey’s post was a cautionary tale, and others have expressed these concerns, too. Now I’ll add my voice to it.

I’m a firm believer in being practical about what I attempt to do and honest with others about what I can’t. In my case, that means I won’t “take it all in,” and for what I do take in, I’m critical about what I will keep in electronic form.

A post I read last year on the Society of American Archivists’ listserv for Lone Arrangers (i.e., people who work in archives and have no other full time staff assigned) touched on this topic. The post was about how lone arrangers were managing email preservation and many products were mentioned and then this came from an archivist who also teaches archives management courses:
“The archiving email question comes up all the time, and I have a stock answer. I tell my [X university] preservation students to be bold, if they have to, and keep paper. Yes, paper. Print it out, attachments included, stick it in a folder, and forget about it.

“My motto as an archivist, lone arranger and preservation teacher is, ‘Don’t sign up for the impossible.’ If big institutions are working hard and spending more to sustain their email archives, we little guys ought to be asking ourselves why. That way, we’ll have the answers when the administration comes to us and says to start archiving email.”

I couldn’t agree more! From my POV, it’s all about choosing where you are going to invest your time. I’m lucky to work in a private institution, so I am not subject to all the public records requirements some of my colleagues are, but I think there are larger issues at stake and as a profession I think we do need to start pushing back a bit.

Our users (and the people/agencies that “mandate” things of us) don’t have reasonable expectations when it comes to digital objects. They think these things exist in tangible forms because they can see them before their very eyes, but the underlying code is anything but tangible. And the way objects are created and served by our users makes a lot of what we might capture not really worth saving, according to “best practices” (thinking of those 72dpi jpgs I was sent awhile ago).

For us to capture objects and make them meaningful over time, we have to impress on the people who create them and on the people who choose the systems our users operate in, that standards exist for a reason. A printed piece of paper is not the flashiest use of the latest new technology, but as long as the paper and ink last, and as long as the language/symbols printed on paper hold meaning, it can be conveyed over time!

Jan 29 2014

Digital/Online Materials and their Place in Historical Scholarship

A post by Drew VandeCreek

At the recent meeting of the American Historical Association in Washington, D.C., I made a presentation as part of a discussion session (i.e., not a regular panel – we sat in a circle and talked after very short presentations made by people sitting as part of the circle) exploring digital materials, ranging from blogs and web sites to social media, and the questions that they raise as scholars begin to make use of them as primary sources. Other presenters talked about the future of MOOCs and crowd-sourcing the search for elusive information about a relatively obscure historical figure. I discussed the work of the Digital POWRR project and the challenges presented by the fact that digital objects are generally subject to loss in the relatively short term due to a number of reasons, including hardware and software incompatibility and the degradation of storage media.

One major question that emerged in the discussion was the status of social media materials and other online, digital sources in light of the fact that they are so prone to loss. One presenter at the preceding panel (our discussion group was part of a linked set of two events) described how she had based her work on Pakistani women in part on a web site that no longer existed, apparently because of hacking activities undertaken by parties believing that Pakistani women should not express themselves in this format. The presenter said that she had printed out the sites pages for her own record and thus could document her use of the source. But this made me wonder about the future practice of history.

So, what of digital sources like blogs, web sites, and social media objects like tweets? Digital objects’ intrinsic frailty and the complex, easily disrupted nature of the internet used to present them make them fundamentally unreliable as primary sources, at least by the standards developed for the use of analog/paper media materials.

It seems to me that although history is certainly not a science in any way, historians are similar to scientists in at least one regard. Much like a scientific discovery can only be accepted and confirmed as other practitioners are able to repeat the experiment and yield the same result, historians are accustomed to being able to lay their hands on a paper source cited in a footnote. Manuscripts are usually unique items, but if one travels to the archive and looks in the box and folder number cited, the item will be there. There may be a very small number of copies of a book, but if one is willing to make the trip to the right library, the book will be there. Historians will of course debate a scholar’s reading of a source, but the existence of the source itself is fundamental to the discipline. If the item is not there, practitioners may rightly begin to ask questions about the legitimacy of a work citing it.

Many of the participants in the AHA discussion emphasized the need to preserve online digital materials as fully as possible. I certainly concur. But a whole host of problems, not the least of which is the considerable expense involved in the curation/preservation of digital materials, make this impossible. We will have to face that fact that a considerable amount of online digital objects that future historians may want to use as evidence will simply disappear.

In this situation, several questions occur to me: How will we evaluate work citing online materials that are no longer existent? What if scholars relying on such missing evidence can produce a print-out or other facsimile of the materials? Can we distinguish cases of vanished evidence in which legitimate facsimiles exist from cases of academic fraud?

A post by Drew VandeCreek

Dec 30 2013

An e-records’ transfer tale

In mid-December I received my first-ever completely electronic records transfer from a student organization. The group’s faculty advisor attended two of my campus presentations this year and followed up with a request for a one-on-one meeting to talk about their specific kinds of records. Before the end of the semester a student leader of the group sent me an email with attachments of five photographs from their biggest event of the year plus a scanned version of the event poster and a word document with the names of people pictured, event location and date.

Very exciting!

There were some problems with them not following the naming conventions I recommended, but since I was handling the accession on an item-level basis this wasn’t a big deal. I congratulated the student on being our first fully electronic donation from a registered student organization and thanked her for her efforts. Then I examined the photo files more closely :-(

The images were 72dpi jpgs and when I tried upsampling they became fuzzy.

In a series of follow up email exchanges, the student told me the faculty advisor had taken the images with an iPhone5, and the advisor told me he hadn’t made any setting changes and the pictures in the phone were a large enough size. We assume the iPhone compressed them when he sent the images from his phone to the student. That close to the end of the semester we never completed the transfer.

{sigh} Something else to face in the busy time at the beginning of the new year.

Lesson learned: don’t accept emailed pictures at face value. Both of my constituents knew the digital object and metadata requirements I requested and thought they were in compliance, but the transmission was scrambled along the way. The old adage of “trust but verify” is still relevant!

Dec 11 2013

ANADP II Part 10: Closing Plenary by Adam Farquhar

Trends and Impact

Farquhar’s (British Library) work has 3 strands: BL team, Planets project, Open PLANETS foundation

  • High-value tech & practitioner exchange, Sustainability New CEO (Ed Fahey)
  • Dataset/Datasite (BL): Research infrastructure & capacity
  • Digital Scholarship: new services, use digital content in new ways.

On to the meeting:

Intellectual honesty is our frame! (Follow the IIPC example: BE NICE!)

  • Inclusion of service providers/vendors–how do we make them welcome?
  • Structure: some action sessions were more project based
  • Gap: Do we have: a consensus on what it means to align? Single clear global voice + national voice? Right folks to influence national legislation?
  • We have underestimated the amount of work it takes
  • Smaller scale collaboration across national boundaries
  • Improved use, shared maintenance = Big impact
  • It’s super-important to handle the legal stuff
  • Organizational axis!
  • Standards: choose wisely. Standards should follow practice.
  • Few discussions about technical underpinnings here
  • Focus on cost, rather than value
  • Bottlenecks are interesting.
  • Challenge: Can we take 50% off the costs of DP?
  • Education and training
  • Alignment — reducing variation & redundancy
  • Don’t lose the voice of evaluation
  • Interdependence: threat or menace? It must be carefully managed.

Trends!

  • Non print legal deposit & regulations for data management plans
  • Shift to business as usual: operational budgets & teams, capital investment in digital infrastructure
  • Often with external service providers
  • Shift to born-digital
  • Greater scale
  • New usage pattern: from single items to dataset(s) analysis
  • Architecture needs to be constructed for these new use patterns
  • Digital library architectures will feel very 1990s soon.
  • OAIS may need a re-think in light of this use
  • Assume more/everything gets looked at! Implementers will need to think differently
  • Reduced funding, growing market problem: Not only a memory institution. Spreading in importance! Personal Digital Archiving solutions create additional pressure for 30 year access.
  • Open thinking about the role of vendors/service providers: opportunity to drive down costs
  • Shift from project thinking to infrastructure funding. Infrastructure can be invisible: we don’t want to disappear.

What’s next?

  • We can be effective … or cheer
  • Legal: (mostly cheering)
  • What can we learn from RDA? Should we hang our coat on that hook? Join an interest group?
  • Worry about our loss of identity/community? It’s an interesting structure, the working group model
  • Education & Training! The economics of it. Think about cost sinks? Engaging more broadly may cost more money, but it’s worth it.
  • Our message & coherence: we need to communicate our consensus messages
  • SCOPE: we’ve been drawing our boundaries too tight
  • This is a scary problem! We narrow things down & put out boundaries to make it less so.
  • Soup-to-nuts handling needs sorting out.
  • So does selection & access

And with that, we were sent out to try to change the world of DP. :-)

 

Dec 11 2013

ANADP II part 9: Winning Poster Talks, Current Opportunities for Collaboration

Poster sessions at ANADP II were part of a contest. The three winners were Paul Wheatley (COPTR), Cal Lee (UNC), and Neil Grindley.

Each was asked to do a lightning talk, and then we moved into discussing current opportunities for collaboration.

Neil Grindley: 4Cproject.eu Digital Curation as an investment– realizing the value of assets via curation.

Paul Wheatley: Community Owned Digital Preservation Tool Registry: COPTR. [NB: Digital POWRR has contributed all of our Tool Grid information to this project, and we're working to help make the wiki tool more dynamic. Our goal is that the information in the registry gets spit out in a format that resembles our tool grid.]

Comment from Wheatley: We need to build to fail SLOWLY, so that we have time to recover. (Yes, we need to assume that we will fail. ALL technology does so, eventually.)

Cal Lee: Digital Forensics for Digital Preservation (BitCurator)

Current opportunities for collaboration

 Research Data Alliance:

  • Community organization.
  • 21st century science is global; so is the data infrastructure.
  • The internet as model: interactivity, exchange across networks. Community consensus.
  • RDA Colloquium = funding agencies
  • Interest groups propose. Working groups deliver.

International Internet Preservation Consortium (IIPC):

  • Preserving the web of today for 50-100 years later
  • Internet Archive + 46 members
  • Building progress
  • Tool –> Community –>Collection

Educopia / Katherine Skinner

  • Consulting — Events — Research
  • Interdependence means less chance of human failure (i.e. people leaving, etc.)
  • ICONC project–NDSA’s review of the last 3 years
  • SCAPE: Scalable Preservation Environments

Future opportunities for Collaboration

Google Doc of action items from the conference available

Oya Rieger (Cornell U / ArXiv)

  • Alignment –> cohesive values
  • Sustainability is the capacity to endure. It’s not just money, but social & political will…
  • Digital literacies: how do we survive/thrive in digital culture?
  • Assessment & Outcomes: small bites are less overwhelming
  • Study: 20-25% of ejournals collected (NOT published, COLLECTED) are being preserved right now
  • History of dependency on grant funds means we need to place more emphasis on new organizational models, embedding DP in all areas of the library
  • Registry: she worries about registries. They get forgotten. What about 21st c. methods of spreading info? MOOCs?
  • Collaboration: it’s a very demanding process. It requires good interpersonal and teambuilding skills. The research data community is coming together and bringing in partners. How do we get in on that?
  • We want more about use, usability, access, and discovery.
  • Want more about open access and scholarly communications

Jeremy York, Hathi Trust:

  • Issues: Enormity of the work vs. small number of people doing it.
  • Funding infrastructure with grants = bad plan
  • Specialization of function
  • Preservation-in-place vs. offsite
  • Succession of content?
  • Alignment of goals and diverse voices
  • Imagine: participatory stacks across sectors
  • Broad technical and human infrastructure that allows digital preservation to happen among other things
  • Focus on functions across sectors and we’ll get there
  • Infrastructure technical & human IN education curricula (English, History, etc.)
  • Digital literacy AS infrastructure
  • We need to share information about our practices
  • Make educational resources for the public
  • Expose our data

Up next, the closing plenary.

Older posts «