As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats, and technology. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society. Recognizing these truths and the potential value of legacy data, USGS has been investigating legacy data management and preservation since 2006, including the 2016 “DaR” project, which developed legacy data inventory and evaluation methods and then tested them while preserving and releasing 5 at-risk USGS legacy datasets. This FY17 project will build on those FY16 project successes by:
The methods and tools developed through this project will enable USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve their highest-priority legacy data products.
As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats or technology. These “legacy data” are invaluable for extending our historical understanding of the world’s natural resources, landscapes and hazards but lie unused because ultimately they are undiscovered and potentially unknown. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society.
Recognizing these truths and the potential value of USGS legacy data to modern scientific endeavors, USGS has has been investigating methods of inventorying and preserving legacy data since 2006 through projects like the USGS Data Rescue Program (2006-2013), the Legacy Data Inventory and Reporting System (LDIRS; CDI 2014), and the 2016 Developing a USGS Legacy Data Inventory project, also known as the “Data at Risk” or “DaR” project (CDI 2016).
In particular, the FY16 DaR project represents a convergence of earlier USGS legacy data projects, new open data policies, and modern information technology to provide USGS Mission Areas and science centers with legacy data preservation support, tools, and methods. The primary objectives and results of the FY16 DaR project were:
Create a USGS legacy data inventory that catalogs and describes known USGS legacy data sets.
Results: We used the Legacy Data Inventory and Reporting System (LDIRS) to conduct a USGS-wide “Request for Legacy Data” (RFD) in May, 2016. We received 43 submissions from 20 USGS science centers with potential impacts across all USGS Missions. This formed the pool of submissions we evaluated and prioritized in Objective 2 (below) and prioritized and selected in Objective 3 (below). Since the RFD, the Fort Collins Science Center and EROS Center have continued to contribute legacy data to the inventory. The current inventory is available at: https://www.fort.usgs.gov/ldi/legacy-products
Develop methods to evaluate and prioritize legacy data sets based on USGS Mission objectives.
Results: We developed and tested a method to evaluate the risk and significance factors associated with a legacy data product and a second, algorithm-based method to prioritize legacy data based on its evaluation scores.
Preserve and release select, priority legacy data sets at risk of damage or loss.
Applying the methods we developed in FY16 Objective 2 (above), we selected the top 5 legacy data products and partnered with the data owners to preserve and publish them as official USGS data releases. All legacy data products have started the IPDS review and approval process with official USGS data releases beginning in January 2017.
Develop time and resource estimates to preserve and release legacy data.
For each of the 5 selected preservation projects, we collected data on the time and resources required to complete each stage of data management plan (e.g., plan, acquire, process, analyze, preserve, and publish/share). This operational data will better inform future legacy data preservation and release estimates. These data will be published as case studies.
This FY17 CDI project seeks to build on the DaR FY16 project successes by:
Beyond the scientific importance of preserving and publicly releasing new USGS legacy data, successfully completing these FY17 project objectives will establish LDIRS as a simple, effective tool to manage the growing USGS legacy data inventory, enabling USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve and publish their highest-priority, legacy data products.
Objective 1: Refining the legacy data evaluation and prioritization algorithms; increasing LDIRS user workflow efficiency.
Based on FY16 DaR project data and LDIRS user feedback we have identified 3 significant improvements that will improve the legacy data inventory, evaluation, and prioritization processes for USGS staff:
Objective 2: Promoting and expanding the USGS legacy data inventory to better understand USGS legacy data-at-risk needs.
The FY16 DaR project focused on developing and validating legacy data inventory, evaluation and reporting methods. This work also resulted in engaging, productive community discussions that validated the utility and need for a USGS legacy data inventory. With those positive results to build on, Objective 2 of this project will expand the current USGS legacy data inventory.
To do this we will:
Objective 3: Continuing to identify, preserve and study at-risk, mission-critical USGS legacy products.
Undeniably, preserving and publishing at-risk USGS legacy data was the most visible and powerful aspect of the FY16 DaR project. Case in point: the strongest feedback we received for this proposal’s FY17 statement of interest were specific requests to maximize the amount of funding for at-risk data preservation, which we have done. In addition, we identified patterns and efficiencies that provided FY17 improvements for users (see “Objective 1” above) through our study of the time and resources required to preserve and publish legacy data . Therefore, project Objective 3 is designed to:
The FORT legacy data steward will ensure that all legacy data releases from this project will:
|Personal Data Inventory Case Study: Susan Skagen (USGS-FORT)||Complete: May 2017|
|Personal Data Inventory Case Study: Kathryn Thomas (USGS-SBSC)||Complete: July 2017|
|LDIRS Technical Improvements||Complete: August 2017|
|Science Center Inventory Case Study: USGS-GLSC||Complete: September 2017|
|Science Center Inventory Case Study: USGS-UMESC||Complete: October 2017|
|2017 DaR Request for Legacy Data||Complete: September 2017|
|Migrating Bird Survey Data Along the San Pedro River and its Tributaries, Southeastern Arizona, 1989-1994||Complete: January 2018|
|Crest Stage Gage Site Visit Data, Montana, 1955-2016||Complete: February 2018|
|Central Mojave Desert Vegetation Mapping Project, California, 1997-1999||Complete: November 2018|
|Golden Eagle (Aquila chrysaetos) Satellite Telemetry and Observational Data, Western North America, 1993-1997||Complete: November 2020|
We refined the LDIRS prioritization algorithms to better assess temporal, geographic, and taxonomic extents, resulting in clearer prioritization scores with better intra-record differentiation. In addition, we incorporated the data assessment scoring into the data entry workflow, resulting in real-time prioritization.
We used several methods to continue to promote and expand the USGS legacy data inventory. First, we worked with two career scientists (Susan Skagen; Kathryn Thomas) and two science centers (GLSC and UMESC) to inventory their scientific records as a means of identify legacy data. Second, in September we conducted a second USGS-wide “request for legacy data” to further expand the total LDIRS inventory. Third, we continued to communicate the DaR project accomplishments and methods through USGS groups such as CDI, the FSPAC Data Preservation Subcommittee, the Data at Risk Working Group, the National Geospatial and Geophysical Data Preservation Program (NGGDPP) and the USGS Step-Up Program. In particular, the USGS Step-Up program used the LDIRS prioritization reports to select the North American Bat Banding Program data for their FY18 preservation work, an unfinished CDI-funded preservation project from 2014.
During the FY16 and ‘17 funding periods, the DaR project has selected 13 high priority preservation projects to validate best practices for preserving and publishing USGS legacy data and software. To date, 6 have been published, 3 are in peer review, and 3 are completing data processing. Upon completion each project is summarized as a case study that documents that describes the methods validated and lessons learned.
Click on title to download individual files attached to this item.
“Glass slide and bathythermogram - examples of data types rescued.”