Legacy data (n) - Information stored in an old or obsolete format or computer system that is, therefore, difficult to access or process. (Business Dictionary, 2016)
For over 135 years, the U.S. Geological Survey has collected diverse information about the natural world and how it interacts with society. Much of this legacy information is one-of-a-kind and in danger of being lost forever through decay of materials, obsolete technology, or staff changes. Several laws and orders require federal agencies to preserve and provide the public access to federally collected scientific information. The information is to be archived in a manner that allows others to examine the materials for new information or interpretations. Data-at-Risk is a systematic way for the USGS to continue efforts to meet the challenge of preserving and making accessible enormous amount of information locked away in inaccessible formats. Data-at-Risk efforts inventory and prioritize inaccessible information and assist with the preservation and release of the information into the public domain. Much of the information the USGS collects has permanent or long-term value to the Nation and the world through its contributions to furthering scientific discovery, public policies, or decisions. These information collections represent observations and events that will never be repeated and warrant preservation for future generations to learn and benefit from them.
Goal: Expand the USGS contribution to scientific discovery and knowledge by demonstrating a long-term approach to inventorying, prioritizing and releasing to the public the wealth of USGS legacy scientific data.
As one of the largest and oldest earth science organizations in the world, the scientific legacy of the USGS is its data, to include, but not limited to images, video, audio files, physical samples, etc., and the scientific knowledge derived from them, gathered over 130 years of research. However, it is widely understood that high-quality data collected and analyzed as part of now completed projects are hidden away in case files, file cabinets and hard drives housed in USGS facilities. Therefore, despite their potential significance to current USGS mission and program research objectives, these “legacy data” are unavailable. In addition, legacy data are by definition at risk of permanent loss or damage because they pre-date current, open-data policies, standards and formats. Risks to legacy data can be technical, such as obsolescence of the data’s storage media and format, or they can be organizational, such as a lack of funding or facility storage. Conveniently, addressing legacy data risks such as these generally results in the science data becoming useable by modern data tools, as well as accessible to the broader scientific community.
Building on past USGS legacy data inventory and preservation projects
USGS has long history of proactively researching and developing solutions to data management needs, including legacy data inventory and preservation. For example, in 1994 USGS was instrumental in establishing the FGDC-CSDGM metadata standard for geospatial scientific data that is still part of the foundation of USGS data management. Today, USGS is a lead agency in establishing meaningful and actionable policies that facilitate data release to the greater, public scientific community. In recent years, CDI has invested in several legacy data inventory and preservation projects, including the “Legacy Data Inventory” project (aka, “Data Mine” 2013-present), which examined the time, resources and workflows needed for science centers to inventory legacy data. Another CDI project, the “North American Bat Data Recovery and Integration” project (2014-present), is preserving previously unavailable bat banding data (1932-1972) and white-nose syndrome disease data and making them available via APIs. Both of these CDI projects were forward-thinking legacy data initiatives, several years ahead of Federal open data policies and mandates.
However, one of the most comprehensive, Bureau-level legacy data preservation efforts was the USGS Data Rescue project, which provided funding, tools, and support to USGS scientists to preserve legacy data sets at imminent risk of permanent loss or damage. A small sample of USGS science data rescued over those eight fiscal years included:
Over 100 projects were supported in the 8 years the Data Rescue project was in operation (2006-2013), while an additional 300 projects went unfunded, providing a glimpse of the potential trove of USGS legacy data at risk of damage or loss. The urgency of and strategies for preserving USGS legacy data have been discussed at length at the 2014 CSAS&L Data Management Workshop and the 2015 CDI Workshop, further emphasizing a Bureau-wide recognition of the importance of legacy data preservation and release. During the 2015 CDI Workshop, legacy data preservation was rated a top-rated FY16 priority by the Data Management Working Group, laying the groundwork for this proposal, which intends to apply the legacy data inventory and evaluation methods developed through the CDI Legacy Data Inventory project to formalize and extend the inventory successfully started through the Data Rescue Program. By creating a formal method to submit, document and evaluate legacy data known to be in need of preservation, USGS would have a tool that USGS scientists, science centers, and mission areas can use to identify significant historical legacy data that can inform, new, data-intensive scientific efforts.
Challenges and improvements for USGS legacy data preservation and release
Based on our experiences managing and preserving USGS legacy data, we have seen two challenges that often undermine legacy data preservation and release:
We believe that each of challenges have good solutions that can improve the efficiency and predictability of preservation and release efforts:
Each objective of this proposal will be addressed in a sequence of 3 phases:
Phase I: Identification and inventory of USGS data at risk
Data owners will document their legacy data sets electronically, providing the primary project and data set metadata elements needed to score, evaluate and prioritize the legacy data inventory. The core of these metadata elements will be derived from the established “USGS Metadata 20 Questions” form, which has proven effective at gathering metadata from research scientists with little/no data science experience. Narrative fields will be used for evaluating need. Categorical fields will be used to calculate feasibility scores used to determine level of effort required to successfully rescue the proposed data.
Phase II: Evaluation and prioritization of the USGS data at risk requests
The CDI Data Management Working Group’s Data at Risk sub-group will facilitate the evaluation and prioritization of the legacy data inventory. Mission Areas will be engaged to verify inventory submissions are supported programmatically and meet mission objectives. The USGS Records Management Program, Enterprise Publishing Program, and Sciencebase will be consulted to verify submitted legacy data inventory submissions can be released within Bureau records management and data release policies. Once these checkpoints have been verified the Data at Risk sub-group and data scientist will score and prioritize the legacy data inventory based on the following criteria:
Phase III: Preservation and Release of Select, Priority Legacy Data
Working in order of priority as set in Phase II, the data scientist(s) will collaborate with the data owner and work with them to complete the process of preserving and releasing their legacy data. Through this data owner/scientist collaboration, the data scientist will create and validate the FGDC-CSDGM metadata and develop the data set in an open-format as documented in the metadata. By process, the data scientist will act as an agent of the data owner, coordinating and completing all steps in each workflow until the the IPDS record approved and disseminated by the Bureau and the Sciencebase data release item(s) are approved, locked and made public by the Sciencebase team. However, while the data scientist is responsible for ensuring all preservation and release tasks are completed consistently and within policies and best practices, the data owner retains all approval of final metadata attribution (e.g., title, authorship), as well as disposition of their legacy data (e.g., pre/post processing methods; derivative data architectures).
At the completion of Phase III, each legacy data release will have the following created by the data scientist:
|2016 Request for Legacy Data||Complete: May 2016|
|Develop and test methods to evaluate and prioritize legacy data inventories||Complete: July 2016|
|Gage Height Data, Friends at Argenta Creek, Illinois, 1971-1982||Complete: October 2016|
|Bathythermograph Data, Lake Michigan, 1954||Complete: January 2017|
|USGS Southwest Repeat Photography Collection: Kanab Creek, southern Utah and northern Arizona, 1872-2010||Complete: September 2017|
|Shapefiles and Historical Aerial Photographs, Little Missouri River, 1939-2003||Complete: October 2017|
|Software to Process and Preserve Legacy Magnetotelluric Data||Complete: March 2018|
|Magnetotelluric Data from the San Andreas Fault at Parkfield, California, 1990||Complete: June 2018|
|River Channel Survey Data, Redwood Creek and Mill Creek, California, 1974-2013||Peer Review Reconciliation|
Click on title to download individual files attached to this item.
“RFP Email Announcement (2016-04-18)”
“USGS Highlight Announcing RFP”
“USGS Highlight Announcing RFP Extension”
“BT Preservation Data Management Plan”
“Gage Data Preservation Data Management Plan”
“Kanab Creek repeat photo collection”
“CDI Monthly Meeting Presentation”