Organizing Digital Newspapers for Preservation
Guidelines Table of Contents
- About the Guidelines
- Inventorying Digital Newspapers for Preservation
- Organizing Digital Newspapers for Preservation
- Format Management for Digital Newspapers
- Metadata Packaging for Digital Newspapers
- Checksum Management for Digital Newspapers
- Packaging Digital Newspapers for Preservation
- Additional Considerations
Organizing digital newspaper content is a process through which an institution assesses, documents, and sometimes refines its file naming and folder usage conventions. As mentioned in the previous section, an institution’s newspaper content is often created and/or acquired by a range of players and over a long span of time. Different collections within an institution’s holdings may conform to different file-naming conventions and folder conventions. Documenting these conventions clearly and/or normalizing these disparate collections by applying a unified schema enables future curators (and users) to retrieve, validate, and if necessary, reconstitute these collections in the future. As such, organizing digital newspapers is an important step in the preservation readiness process.
This process of organizing digital newspaper content builds upon and may go hand-in-hand concurrently with the Inventorying work described in the previous section as well as some of the additional preservation readiness activities covered later in the Guidelines, specifically:
- Inventorying the amount and location of digital newspaper content an institution is managing;
- Identifying the range of file formats and performing any necessary normalizations or migrations;
- Exporting and consolidating metadata for all collection(s); and
- Producing checksum manifests for this content.
Sound practices for organizing digital news content primarily include the following:
- Rectifying any file-naming conventions that put content at risk of non-renderability;
- Documenting effectively the range of file and folder naming practices and conventions represented in an institution’s collections; and
- Storing this documentation with the content it describes.
Even institutions with low resources and disparate practices will be able to provide a brief summary of each digital news collection’s internal conventions that can help future curators and users understand each collection’s structure for future use and renderability. Institutions with higher resource levels may also analyze and streamline their conventions and practices across digital newspaper collections.
The goal is to arrive at a documented and uniform approach (or small set of approaches) with clearly designated use-cases (e.g. one approach for legacy digitized content, another for recent digitization efforts, and a third for born-digital content) that contains clear guidelines for file naming and folder/sub-folder usage. Collection managers should coordinate with their technical staff members (or those of any external repository service provider) throughout any remediation and convention-setting process so that any change to existing conventions is understood and accounted for in the repository software environments used for access and/or preservation purposes.
Applying and enforcing a set of uniform folder and file-naming conventions can be a time-consuming endeavor if approached on a manual, per-file basis. There are some tools that can be used to batch rename and even relocate digital files. The best approach to take with these tools is to start with a cleanly copied representative subset of the overall data to which you would like to apply such tools. This will require setting aside a workspace where the tools can be installed and where the data can be copied to for testing purposes.
Examples of file-naming and re-locating tools include:
Unix-based systems (including Mac OS X) come equipped with simple programs such as mv and sed that can be used to begin the organization process. These programs can be used as standalones or in conjunction with various shell scripts to both batch rename and relocate files. A number of renaming tools with graphical user interfaces exist for most major operating systems. Among these, GPRename for Linux and Bulk Rename Utility for Windows offer a good range of features such as using regular expressions to find portions of a file name to replace, appending date information, and adding sequential numbers to files. Automator for Mac OS X has a graphical user interface, and despite some workflow limitations, can also assist with batch file renaming.
Organizing digital newspaper content can vary across a wide spectrum of practice while still fulfilling the basic goal of providing future curators with the information they need to understand the structure of each digital newspaper collection. This facilitates the curator’s ability to preserve and render content reliably over time.
At the lower end of the preservation readiness spectrum, an institution may focus upon four core tasks:
- Identifying problems in file names that could compromise those files in the future;
- Using basic systems tools to perform batch renaming of these files (starting in a test-bed environment!);
- Documenting institutional conventions in a text document to help future curators understand the collection logic; and
- Updating the digital news inventory to reflect all changes.
At the higher end of the preservation readiness spectrum, institutions may also streamline file-naming and folder usage practices into one or more well-documented and unified convention(s). Institutions may then use batch tools to remediate content according to their chosen convention(s). After completing this remediation work, institutions should always update their inventories.
Organizing digital news content ultimately makes collections intelligible and recoverable in the short and near term. With that in mind, the goal of this activity is to refine and communicate collection structures, file identifications, and other relationships so that curators and preservation partners can care for these collections. There are both machine-based and human-based approaches that can be taken across the readiness spectrum to achieve these goals.
Case Study: File Naming Conventions
Below are some real-world examples of file name conventions that do a good job of providing title, issue, date and other unique id encodings. They include examples from both digitized and born-digital newspaper collections. They are just examples not standards.
Digitized Newspaper Examples (PDF, TIF, and JP2)
- 051-AAR-1873-09-24-001-SINGLE.pdf (title code/date)
- DCC_19601125-19600101_DLH_217.tif (title code/dates)
- bcheights_20040406_0001.jp2 (title code/date)
Born-Digital Newspaper Examples (e-print and web)
- an970607.pdf (title code/date)
- morning_725_5977.html (Morning Ed/7:25am/May 9, 1977)
Born-digital e-prints and web files often use the same filenames and extensions for both preservation and access copies. Make sure changes to preservation copies adhere to their current access copy filename conventions.
Case Study: Boston College
Below is an example (not a standard) of a digital newspaper collection organization scheme as deployed by Boston College.
- bcheights (collection title folder)
- 2004 (annual volume folder)
- 04 (monthly volume folder)
- 06 (daily issue folder)
- 06 (daily issue folder)
- 04 (monthly volume folder)
- 2004 (annual volume folder)
As described above, organizing collections for preservation should begin with an analysis of file-naming conventions.
File-naming conventions for digital newspapers should follow established good practices (documented below), including attending to those specific to digital news content. Examining and adjusting filenames prior to preservation action is imperative because many repository systems (both preservation-oriented and access-oriented) may refuse to handle content that does not conform to standard practices. At best, in these cases the files will not render properly. At worst, poorly named files will not be able to be ingested into and preserved in a repository at all.
General good naming practices include:
- Avoiding the use of special characters in a file name. \ / : * ? “ < > | [ ] & $ ,;
- Using underscores instead of periods or spaces;
- Avoiding lengthy names;
- Including all necessary descriptive information independent of where it is stored;
- Including dates and formatting them according to international standards (YYYY_MM_DD or YYYYMMDD); and
- Including a version number on documents to more easily manage drafts and revisions.
- Good practices for applying consistent folder and file naming conventions to digital newspaper content more specifically include:
- Retaining any repository system-defined folder naming conventions if supplied - this can be helpful for restoring collections to those systems at a later date;
- Following a simple title, year, volume, issue, month, day schema for folder and sub-folder conventions;
- Identifying the title in the file name;
- Including the year, month, and date of the issue publication in the file name;
- Including the page or article sequence number in the file name when appropriate;
- Including the corresponding newspaper section name where helpful; and
- Including the correct file extension with each file (e.g., TXT, PDF, TIF, JP2, etc.).
Depending upon the number of digital news files an institution is managing and how many of these are problematic, rectifying filename problems may be done “by hand” (on a small scale) or through the use of software tools. Such tools as those mentioned above allow for batch renaming of files, so that if there is a regular problem (e.g., a space or special character that needs to be replaced collection-wide), this can be dealt with simultaneously across a large number of files. Be sure to thoroughly test tools and batch processing prior to implementation. Wherever possible, create a copy of each collection that needs attention and work with those copies to ensure that accidental damage is not done to the originals as these file-name problems are corrected.
This renaming process, including the tools used, should be documented and this documentation should be included with the collection upon packaging (see Packaging Digital Newspapers for Preservation).
After an institution addresses potential gaps in its file-naming conventions, it can begin analyzing and documenting its overall collection structures, including folder and sub-folder usage. For institutions with limited resources, this may simply mean creating a text-based document that explains the collection structures as they currently exist and what data elements, such as unique identifiers, are vital to preserving the file relationships.
For institutions with more resources, devising and implementing a meaningful, consistent set of folder/sub-folder relationships and schemas across all digital news collections will improve the preservation outlook of these collections.
The degree of work involved depends upon an institution’s practices-to-date. In some cases an institution may only need to review and refine its existing collection structures. In other cases, an institution may need to reorganize its digital news content entirely according to a newly designed set of consistent folder and file-naming conventions. The above Case Studies are some examples of how institutions with extensive experience in managing digital newspaper collections have organized their collections. Institutions are also encouraged to reference the TechNotes.pdf NDNP Technical Guidelines.
Once a set of uniform collection structures and file-naming conventions have been established, curators and technical staff can work together toward implementation. This process should always begin by experimenting with copies of a sub-set of the collection and the batch renaming and/or relocating tools that are most appropriate for the institution’s needs. Once the remediation process is tested thoroughly, implementation can begin, again ideally using a copy of the content rather than the originals. Remediation work, including the tools used, should be documented and this documentation should be packaged with the collection (see Packaging Digital Newspapers for Preservation).
Caution: BagIt is discussed later in Checksum Management for Digital Newspapers. It should be noted that if you have previously placed your data into bags, moving or renaming the files will invalidate the bag manifest. If this is the case it is advisable either to rebag the data or to make any organization changes before placing your data into bags.
- Note that there can be differences in syntax between GNU and BSD distributions. Mac OS X ships with BSD