The Society of American Archivists define Web Archives as, “ preserved copies of live web content collected for permanent retention…..” (Society of American Archivists 2017). This act is vital due to the proliferation of born-digital material in our current medium of communication. However, the very nature of the web places much of this content at risk. For example, web sites are abandoned, links become corrupt, and campus departments restructure. Many organizations find that capturing a web site at a specific point in time, when it is at the peak of its use, is a useful way to preserve University History. Seton Hall University Libraries has implemented a system to actively collect, preserve, to SHU’s history, digital publications, records, and web presence.
The mission of the Digital Preservation Group at SHU Libraries is to support the mission of The Msgr. William Noé Field Archives and Special Collections Center, which “exists to identify, collect, preserve and serve as repository for records of enduring value created by, for and about both Seton Hall University and the Archdiocese of Newark as well as Catholic New Jersey”
The primary audience for Seton Hall University Web Archive will be SHU personnel and researchers.
The Web Archives is a repository for records pertaining to the history of Seton Hall University. As such, the web archive collects web pages of historic value.
Any questions, please contact firstname.lastname@example.org.
Archivists select content for the Seton Hall University Web Archive based Preservation Policy of the Digital Collections Group. Please see https://library.shu.edu/library/preservation-policy for more information.
Websites (URLs) containing official SHU information hosted on university Web servers or hosted by private companies that contain only university information and do not contain a combination of public and private information will be considered for the Web Archiving program. However, some private Websites may be considered for capture if they contain significant university information and/or assist in the formation of university policy.
The collections include:
SHU Human Resources
SHU Law Library
SHU University Libraries
SHU Web site
There are several resource constraints: 1) cost associated with storage space, 2) maximum storage space available. Judicious use of harvesting scope will result in more captures in same amount of space.
The initial crawl of web sites will be based on a list of URLs, or a seed list developed by the Digital Collections Group. Embedded images and style sheets will be crawled. In some cases, certain web pages that include confidential information or information that the content provider does not want to be archived will not be crawled (Robot Exclusion). All jobs will default to "max-link-hops = 25" scope setting. This means that the crawler will follow links from the URL(s) entered for the site and continue gathering files until it gets 25 links away from the starting point. This should provide a thorough capture of most sites.
The frequency of web site capture may vary from site to site, and depends on the frequency of updates to a particular website and current relevance. Websites that change frequently—like news sites—are crawled more often. The frequency of crawls for such web sites will be in sync with the frequency of changes made to the site. The DCG has a plan that includes yearly, quarterly, semiannually, weekly crawls, or one-time.
Efforts will be made to ensure the ‘’look and feel’’ of the original web site is maintained to the greatest possible extent. In this regard various file formats such as HTML, PDFs, Office Document, Images, Audio, Video or Compressed will be part of the harvest. Static, dynamic and interactive pages will be captured as part of the harvest.
Educational use only, no other permissions given. SHU Digital Preservation Group does not assert ownership rights over the intellectual property of the contents included in the web archive collection. All rights of ownership remain with the owner(s) identified on the Website for the full term of copyright. SHU Digital Preservation Group is not involved in the creation of the Websites and has no oversight for the contents of these collected Websites. The SHU Digital Preservation Group assumes no responsibility for the accuracy or lawfulness of the collected Websites or the contents within. These captured websites are provided here for educational purposes. They may not be reproduced or distributed in any format without written permission of the owner(s).
The Archives will create descriptive metadata using an adaptation of Dublin Core’s 15 standard fields. The adaptation was proposed by OCLC’s Research Library Partnership Web Archiving Metadata Working Group. The elements, with brief descriptions, are listed below. For more detail, see https://www.oclc.org/content/dam/research/publications/2018/oclcresearch-wam-recommendations.pdf
The organization responsible for curation and stewardship of an archived website or collection. (DC Contributor)
An organization or person secondarily responsible for the content of an archived website or collection. (DC Contributor)
An organization or person principally responsible for creating the intellectual content of an archived website or collection.
A single date or span of dates associated with an event in the lifecycle of an archived website or collection.
One or more notes explaining the content, context and other aspects of an archived website or collection.
An indication of the size of an archived website or collection. (DC format)
A term specifying the type of content in an archived website or collection. (DC Type)
The language(s) of the archived content, including visual and audio resources with language components.
Used to express part/whole relationships between a single archived website and any collection to which it belongs.
Statements of legal rights and permissions granted by intellectual property law or other legal agreements.
Information about the gathering or creation of the metadata itself, such as sources of data or the date on which source data was obtained. (DC Description)
Primary topic(s) describing the content of an archived website or collection.
The name by which an archived website or collection is known
Internet address for an archived website or collection. (DC Identifier)
All possible and technically feasible efforts will be made to ensure that the crawled web site resembles the original website closely. The only exception to this will be the items on the web site that are protected by the web authors using robots.txt file.
The authenticity of the crawl is established with the banner on the top of the crawl featuring the original website URL, the archived URL and the date captured. All archived web sites are tested following the complete of the crawl to ensure the site has been captured properly.
The Digital Collections Group is committed to investigating new content to contribute to the web archive.
The harvests of existing web sites will be revisited every quarter to see if there is significant change in the website to initiate re-harvest. Descriptive metadata of existing and new crawls will also be kept updated with new information.
The current list of harvested websites will periodically be revisited and potential additions to this list will be discussed. Feedback from the researchers and scholars if deemed fit will be accounted for in the next quarter.
The following policies and reports were referred during preparation of Seton Hall University Web Archive Collecting Policy:
Columbia University https://library.columbia.edu/bts/web_resources_collection/policies.html
Michigan State University
Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group https://www.oclc.org/content/dam/research/publications/2018/oclcresearch-wam-recommendations.pdf