Post-workshop question: How to quantify projected data growth

In response to a question from her supervisor, a recent workshop participant asked how to guesstimate the amount – in numbers – of data she would need to store per week.

An illustration of the “rapid growth rate of data with storage growth trends following” captured from xzbackup.com

My reply was that it’s hard to estimate amounts of new material you might need to store in the future until you decide what you’re preserving. Selection is the unsung hero, in my POV, of any kind of preservation. We simply must decide what we are willing and able to keep. But where to start?

An inventory of what you’re currently responsible for is widely recommended and very helpful. After that, it will be useful to note exactly what in your inventory is most at risk. Data accumulation rates that will matter most to administrators will depend on what you decide you must preserve at full bit-level and what can live happily, for at least some time, in basic offline and (don’t forget this part!) geographically distributed storage locations.

Here’s an example of this decision making principle from my world:
Most of the material that my library has digitized, I’m comfortable NOT assigning to the queue for bit-level preservation. There are only a few objects that I digitized because the media they were on was so out of date that it was inaccessible or the originals were too fragile. These things that were truly digitized to preserve their current state AND intellectual value are a higher bit-level preservation concern for me, but they are also in a minority of my digital holdings right now.

Paper material like yearbooks, honors theses, faculty meeting minutes and the student newspaper are all really useful as searchable digital objects, and I want to protect the investment we made in scanning (some outsourced, some not), but I have all the originals and will not be discarding them so I’m not concerned with moving the digital versions into a preservation system right away. Those things will be just fine in an offline storage location until we decide if we want to pay extra just to protect that initial investment. Some born-digital documents that are worth keeping digitally (due to importance of content AND value of keeping in keyword searchable format), stay that way. But a lot of the messages that are important to keep (e.g., brief meeting records and emails from administrators about policy changes) do not have attributes that make them worth keeping digitally, so I print them.

On the other hand, my campus has over a decade worth of digital-only campus photographs, all in jpg, and our new content management system allows individual departments and organizations to post their own photos to their own pages. Those things are unique and being created by people who couldn’t care less about high-res formats because they’re just out there doing their jobs and trying to attract people to our school or showcase their achievements. Additionally, our major events (commencement, several all-campus convocations and colloquia, and our sporting events) are now only being captured live and streamed through a subscription service. I have decided that these born-digital media are more at risk both because of how they are created and by the lack of consistent metadata they are being created with. Therefore, they’re higher up in my queue for prioritizing preservation actions that include enhancing metadata and monitoring format migration needs in an automated environment.

Note that they are also not “library” materials, so they’re going into my this-is-a-common-good-and-therefore-a-shared-cost-responsibility argument 😉

I use the word “queue” because we have no subscription to an automated bit-level system yet. But by separating things out in this way, I have a smaller amount of somewhat-regularly-added-to types of content I can guesstimate based on past collection practices. I had success in getting on a regular transfer schedule of the streamed media because IT is in charge of monitoring that service, and they are now giving me an annual deposit of everything that was streamed the previous year. I’ve got a tougher chore with people posting things to their own web pages. However, IT is trying to communicate the value of using Flikr accounts to manage these files, and if people do start following that advice, I’ll be able to quantify that data because IT passes out logins to our campus Flikr subscription. That way I’ll get a glimpse of what people are doing and can start educational outreach with those dept/orgs about improving their creation and description practices.

Frankly, people who are doing their own thing outside of campus-related programs/services are, in my opinion, on their own. It sounds harsh to say it that way, but I can’t save what I don’t know exists and don’t have access to!