Center Names in ScienceBase
As part of Bureau-wide efforts to better cross-link primary systems and support flexible information retrieval, ScienceBase now uses USGS center names from a controlled list to assign a Data Owner to every ScienceBase data release. This list, consumed from a machine accessible web service, provides an unambiguous set of active USGS centers for the current fiscal year and ensures accuracy and consistency in labeling across multiple tools and systems. Values from the authoritative list of center names are used to populate the 'SDC Data Owner' field for data release products and enable browsing/querying for data by center, both in ScienceBase and in the Science Data Catalog. These center name values are also used to assign ScienceBase data releases, as well as other automated product content, to a specific center (with consistency in spelling and display) in the Drupal Content Management System that populates USGS web pages.
Examples of these ScienceBase queries, built using names as a structured query parameter, can be seen here.
Background on the USGS Centers Web service
The USGS center list used to manage ScienceBase data releases is pulled from the Science Inventory – Proposals to Products (SIPP) (requires USGS internal network access) web services maintained by the Water Mission Area Business Analytics Team. These web services provide machine access (from the internal USGS network) to organizational information for USGS business and science operations. USGS Science Data Management Branch (SDM) staff have worked in collaboration with the SIPP developers to maintain and refine these services, not only for delineating active USGS centers but also to support auto-fill features and internal linking across systems (e.g., ScienceBase, the USGS DOI tool, the Science Data Catalog, the USGS Publications Warehouse, center workflows, etc.).
In the web service, centers are defined as a logical grouping of cost centers and Federal Payroll and Personnel System (FPPS) organizations, usually with a director and requirement to operate independently. This approach is based on the 'Cost Center' definition in Survey Manual 320.1 and includes additional fields to facilitate cross linking between organizational resources in the USGS.
Each center has a unique, two or three character, alphanumeric code derived from the codes for each of its cost centers. Active status is determined in collaboration with the Office of Accounting and Funds Management (OAFM) (requires USGS internal network access) by determining accounts and cost centers that require funding. At the beginning of each fiscal year, the list is reevaluated to determine which centers have merged, split, or become inactive, and the web service is updated accordingly. Changes in the USGS centers web service (and in any downstream picklists such as in the ScienceBase Data Release (SBDR) Tool that pull from this list) will not occur until the change is implemented in our financial systems, usually at the beginning of a fiscal year. SDM staff and SIPP managers work with Center Directors and Deputy Directors to ensure the spelling and format of the names for any merged, split, or renamed centers are correct in the web service.
Maintaining Center Names in ScienceBase
At the beginning of each fiscal year, the SBDR Team syncs the ScienceBase Active Center List with the USGS SIPP Centers service. It is this list that is currently driving picklists in the SDM's suite of tools, including the SBDR Tool and the USGS DOI Tool. When center names are updated, the SBDR Team will automatically update the names for the Science Data Catalog (SDC) Data Owner on the ScienceBase data release landing pages for that center. Likewise, when centers merge, the SDC Data Owner for each data release from the merged centers will be updated with the new merged name. In the case of centers that split, the SBDR Team will work with the centers' data manager(s) to determine how to divide the existing data releases. If a center is deprecated, the SDC Data Owner is left as is on its data releases, unless an active center decides to take on ownership of those data.
If you have questions about why a data release is labeled in a particular way, please reach out to the SBDR Team at email@example.com.
For additional questions about the SIPP web services, please contact Brian Reece (firstname.lastname@example.org).
Revision Guidance Updates
Do you need to make a change or update to a published data release? Guidance for data release revision has recently been updated.
The first step in revising a data release is to determine under which revision level the changes fall. There are now five revision levels:
Note: Revision levels 2-5 require approval in IPDS for any updates.
Also, new to the guidance is a table outlining the steps to take for each revision level (portion of the table shown below).
Common elements for revision levels 2-5 include the following: a version history text (.txt) file that details the changes made, updates to the metadata to include any new processing steps, addition of an 'update' date to the DOI, and inclusion of a versioning element in the title and citation (ex: ver. 2.0, January 2021). Specifics for these changes may vary by revision level.
Contacting the SBDR team at the start of your revision is advised. In addition to offering revision tips, the SBDR Team can help with things like duplicating the current landing page to create a working copy, determining the best structure for the revision, and help you plan for periodic data updates.
Please email email@example.com for more on data release revision.
When working with data, it can be important to map information from a raw format to one that can be easily refined and analyzed, which is not always a trivial task. OpenRefine, previously known as Google Refine, is a great solution to organize and format information for large datasets. OpenRefine is a free, open source software application for working with messy data. With this tool, the user can clean up misspellings, split columns, track changes, and a variety of other features. OpenRefine is useful for getting a quick snapshot of a dataset's content and resolving inconsistencies.
OpenRefine can support a variety of file formats, including CSV and XML. Data can also be transformed using common programming languages within the OpenRefine interface, such as Python, General Refine Expression Language (GREL), and Clojure. All actions that are performed on a dataset can easily be undone or reversed. Often, repetitive steps are necessary when working with multiple files of data. OpenRefine can make the process more efficient by replaying actions on multiple datasets, saving time and effort.
You can download and install the software from the OpenRefine homepage. The list of External Resources located on the OpenRefine wiki page is also a good place to get additional ideas and recipes for ways to work with data.
*Disclaimer: Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government. ScienceBase is not affiliated with OpenRefine.