Seton Hall University Libraries: Seton Hall University Libraries: Web Archiving Policy

SHU Libraries University Web Archive Collecting Policy

Section 1. Overview, Mission, Scope

A. Overview

The Society of American Archivists define Web Archives as, “ preserved copies of live web content collected for permanent retention…..” (Society of American Archivists 2017). This act is vital due to the proliferation of born-digital material in our current medium of communication. However, the very nature of the web places much of this content at risk. For example, web sites are abandoned, links become corrupt, and campus departments restructure. Many organizations find that capturing a web site at a specific point in time, when it is at the peak of its use, is a useful way to preserve University History. Seton Hall University Libraries has implemented a system to actively collect, preserve, to SHU’s history, digital publications, records, and web presence.

B. Mission Statement

The mission of the Digital Preservation Group at SHU Libraries is to support the mission of The Msgr. William Noé Field Archives and Special Collections Center, which “exists to identify, collect, preserve and serve as repository for records of enduring value created by, for and about both Seton Hall University and the Archdiocese of Newark as well as Catholic New Jersey”

C. User Groups

The primary audience for Seton Hall University Web Archive will be SHU personnel and researchers.

D. Collection Theme

The Web Archives is a repository for records pertaining to the history of Seton Hall University. As such, the web archive collects web pages of historic value.

F. Web Archive curators

Any questions, please contact digitalcollections@shu.edu.

Section 2. Selection

A. Criteria

Archivists select content for the Seton Hall University Web Archive based Preservation Policy of the Digital Collections Group. Please see https://library.shu.edu/library/preservation-policy for more information.

Inclusion criteria:

Websites (URLs) containing official SHU information hosted on university Web servers or hosted by private companies that contain only university information and do not contain a combination of public and private information will be considered for the Web Archiving program. However, some private Websites may be considered for capture if they contain significant university information and/or assist in the formation of university policy.

The collections include:

SHU Advancement

SHU Athletics

SHU Human Resources

SHU Law

SHU Law Library

SHU University Libraries

SHU Web site

B. Resource Constraints

There are several resource constraints: 1) cost associated with storage space, 2) maximum storage space available. Judicious use of harvesting scope will result in more captures in same amount of space.

Section 3. Acquisition

A. Capture Scope

The initial crawl of web sites will be based on a list of URLs, or a seed list developed by the Digital Collections Group. Embedded images and style sheets will be crawled. In some cases, certain web pages that include confidential information or information that the content provider does not want to be archived will not be crawled (Robot Exclusion). All jobs will default to "max-link-hops = 25" scope setting. This means that the crawler will follow links from the URL(s) entered for the site and continue gathering files until it gets 25 links away from the starting point. This should provide a thorough capture of most sites.

B. Frequency of Capture

The frequency of web site capture may vary from site to site, and depends on the frequency of updates to a particular website and current relevance. Websites that change frequently—like news sites—are crawled more often. The frequency of crawls for such web sites will be in sync with the frequency of changes made to the site. The DCG has a plan that includes yearly, quarterly, semiannually, weekly crawls, or one-time.

C. Material Types & Formats

Efforts will be made to ensure the ‘’look and feel’’ of the original web site is maintained to the greatest possible extent. In this regard various file formats such as HTML, PDFs, Office Document, Images, Audio, Video or Compressed will be part of the harvest. Static, dynamic and interactive pages will be captured as part of the harvest.

D. Rights

Educational use only, no other permissions given. SHU Digital Preservation Group does not assert ownership rights over the intellectual property of the contents included in the web archive collection. All rights of ownership remain with the owner(s) identified on the Website for the full term of copyright. SHU Digital Preservation Group is not involved in the creation of the Websites and has no oversight for the contents of these collected Websites. The SHU Digital Preservation Group assumes no responsibility for the accuracy or lawfulness of the collected Websites or the contents within. These captured websites are provided here for educational purposes. They may not be reproduced or distributed in any format without written permission of the owner(s).

Section 4. Descriptive Metadata

The Archives will create descriptive metadata using an adaptation of Dublin Core’s 15 standard fields. The adaptation was proposed by OCLC’s Research Library Partnership Web Archiving Metadata Working Group. The elements, with brief descriptions, are listed below. For more detail, see https://www.oclc.org/content/dam/research/publications/2018/oclcresearch-wam-recommendations.pdf

COLLECTOR Definition:

The organization responsible for curation and stewardship of an archived website or collection. (DC Contributor)

CONTRIBUTOR Definition:

An organization or person secondarily responsible for the content of an archived website or collection. (DC Contributor)

CREATOR Definition:

An organization or person principally responsible for creating the intellectual content of an archived website or collection.

DATE Definition:

A single date or span of dates associated with an event in the lifecycle of an archived website or collection.

DESCRIPTION Definition:

One or more notes explaining the content, context and other aspects of an archived website or collection.

EXTENT Definition:

An indication of the size of an archived website or collection. (DC format)

GENRE/FORM Definition:

A term specifying the type of content in an archived website or collection. (DC Type)

LANGUAGE Definition:

The language(s) of the archived content, including visual and audio resources with language components.

RELATION Definition:

Used to express part/whole relationships between a single archived website and any collection to which it belongs.

RIGHTS Definition:

Statements of legal rights and permissions granted by intellectual property law or other legal agreements.

SOURCE OF DESCRIPTION Definition:

Information about the gathering or creation of the metadata itself, such as sources of data or the date on which source data was obtained. (DC Description)

SUBJECT Definition:

Primary topic(s) describing the content of an archived website or collection.

TITLE Definition:

The name by which an archived website or collection is known

URL Definition:

Internet address for an archived website or collection. (DC Identifier)

Section 5. Presentation and Access

A. Look and Feel

All possible and technically feasible efforts will be made to ensure that the crawled web site resembles the original website closely. The only exception to this will be the items on the web site that are protected by the web authors using robots.txt file.

B. Authenticity

The authenticity of the crawl is established with the banner on the top of the crawl featuring the original website URL, the archived URL and the date captured. All archived web sites are tested following the complete of the crawl to ensure the site has been captured properly.

Section 6. Maintenance

A. New Web Content

The Digital Collections Group is committed to investigating new content to contribute to the web archive.

B. Ongoing Maintenance Activities

The harvests of existing web sites will be revisited every quarter to see if there is significant change in the website to initiate re-harvest. Descriptive metadata of existing and new crawls will also be kept updated with new information.

C. Quarterly Evaluation

The current list of harvested websites will periodically be revisited and potential additions to this list will be discussed. Feedback from the researchers and scholars if deemed fit will be accounted for in the next quarter.

Section 7. Reference

The following policies and reports were referred during preparation of Seton Hall University Web Archive Collecting Policy:

Purdue University https://www.lib.purdue.edu/sites/default/files/spcol/purdue-archives-web-archiving-policy.pdf

Columbia University https://library.columbia.edu/bts/web_resources_collection/policies.html

Michigan State University

http://archives.msu.edu/collections/documents/CollectionPlan_v4.pdf

University of Wisconsin https://www.library.wisc.edu/archives/wp-content/uploads/sites/23/2016/01/Web-Archiving-Policy.pdf

Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group https://www.oclc.org/content/dam/research/publications/2018/oclcresearch-wam-recommendations.pdf