Skip to end of metadata
Go to start of metadata

Page Contents

Overview of Reference Datasets

Reference datasets contain standardized data or codes, which typically are used by various applications as lists or tables. In fact, they are often called "code tables." An individual code table may seem like a simple thing, but a well-managed collection of code tables and related reference data spread across an enterprise is a resource that can bring great value to that enterprise—or cause great problems if it is not well maintained. EDG lets you control your reference data so that you can put it to work for you as efficiently as possible.

EDG datasets are much more than just flat code tables. Reference data in different datasets can have relationships. For example, as currencies are associated with countries, currency codes have a relationship (connection) to country codes. Reference datasets can also model structural relationships in data, such as hierarchies of industrial categories, locations, or product types. Finally, you can capture any additional information you need to have about each code. And reference datasets themselves provide a lot of rich information or metadata such as the source of a dataset, how it is managed, where it is being used, and the meaning of each data field.

For additional perspectives and details on reference data management and related topics, see these TopQuadrant whitepapers.

Reference datasets are used with ontologies, which define the data schema (classes, properties, relationships, constraints) of the reference dataset items. For example, you might define a class (or entity) called Gender in an ontology and then, in a reference dataset that uses this ontology, enter the values Male and Female as instances of this list. Ontologies thus define the data attributes for each entity and the relationships between entities.

TopBraid EDG makes it possible for you to:

  • Reduce independent maintenance of code tables: If different departments use the same code table, they may be maintaining individual copies of it on spreadsheets being emailed around to each other. When they all use the same copy, changes are coordinated, and they can be confident that they're using the right codes.

  • Reduce data quality problems due to coding errors: Workers who don't have access to recent, correct codes can't always enter the proper values, and improper values can lead to lost revenue.

  • Reduce the cost of designing code tables for databases: When new code tables have similarities or other relationships to other tables, these relationships can be leveraged in the design of the new tables. Well-organized, searchable metadata about which applications use which code tables also makes it easier to coordinate new and legacy tables.

  • Reduce data integration issues due to inconsistent codes: The inconsistencies caused by maintaining multiple copies of the same code tables, or by using copies that were updated at different times, can lead to problems when combining datasets that reference these tables. Consistent tables mean easier data integration.

  • Make informed decisions based on code table data: Code table entries are often cryptic abbreviations, leaving people to guess about their meaning and appropriateness for which ones to use when. Metadata such as definitions and provenance information ensure that people will use the right codes in the right places.

Licensing

The availability of different collection types, including Reference Datasets and customer-defined types, is determined by what you have licensed and configured. The TopQuadrant website describes the TopBraid products and the  data governance packages that determine which collection types are available.

Reference Datasets Home

Selecting the Reference Datasets link in the left-navigation pane of TopBraid EDG (Home) lists all of the Reference Dataset collections currently accessible to the user and, it allows authorized users to create new ones.

Create New Reference Dataset

The Reference Datasets > Create New Reference Dataset link opens a form with fields used to define the new Reference Dataset. Note that you can also create a Reference Dataset by using a Create link in the Governance Areas page. 

Nobody will have a link for creating any asset collection until an administrator configures EDG's persistence technology as documented in EDG Administration: Configuring the application data storage . Additionally, each user will not have a create link unless the user or their role has a Create permission for the EDG Repositories project as documented in  EDG Permission Group Management: Configure Permissions .

Note: Required and Permitted Includes

Collections often have natural relationships to other collections, e.g., a Reference Dataset references an Ontology class as its main entity. In order to do this, collection with resources to be referenced needs to be included. Some inclusions might be required while others might merely be permitted. For example, Taxonomies always include SKOS ontology and can include other taxonomies. A Reference Dataset is always required to include at least one Ontology as it is needed to define the entities in the dataset. Glossaries with always include pre-defined EDG ontology that describes business glossary terms. Catalogs of Data Assets will always include pre-defined EDG ontology describing data assets and are expected to include definitions of relevant physical Datatypes. These requirements can be further configured.

When creating a collection, any required reference to another type of collection will either be handled automatically or be presented for selection. If any required inclusion is omitted at its creation, then the resulting collection will show red warnings about the missing relationship(s). After creation, included collections can be changed using utilities view: Settings > Includes. When changing collection's includes, selection options are restricted to required and permitted types.

Creation Form

The Create dialog box asks for the Reference Dataset's Label (name) and, optionally, a Description.

Create New Reference Dataset

This creates a new Reference Dataset with yourself as the manager.

The ontology for the dataset's main entity class

Each reference dataset needs an ontology class to act as its main entity , which will be the class of the dataset's reference instances. From the existing ontologies listed for Ontology to Include, select the ontology that contains the class to be used as the new main entity. After submitting the creation form, the main entity class itself can be designated either via (1) the dataset's utilities: Settings > Metadata > Edit > Overview > main entity (class) drop-down selection or via (2) a form prompt that appears when the dataset is first edited.

The main entity's primary key

A reference dataset's main entity must have one property designated as the primary key . If the main entity's ontology class has already designated a property as its primary key, then it will be used (see EDG Ontology Editing: Setting a primary key for a class). If a main entity's primary key has not been otherwise specified, then the dataset itself will prompt for the choice of primary key property when the dataset is first edited. Thus, an ontology class with no primary key designation of its own can be used as the main entity of different datasets with each dataset designating its own choice of property as the primary key. Note that any primary key property must have unique values across all instances of its class (see Ontology Editing: Setting a primary key for a class).

Listing of Reference Datasets by Manage, Edit, or View

This home view lists all Reference Datasets that you can access in some way. Which ones you can see and what you can do with them depend on each Reference Dataset's permissions settings for your user identity or security role. The listing groups the Reference Datasets according to your assigned permissions as either a manager, an editor, or a viewer:

  • Reference Datasets that you manage
  • Reference Datasets that you can edit
  • Reference Datasets that you can view

You will only see relevant categories. For example, if you do not have manager permissions to any Reference Datasets, you will only see "Reference Datasets that you can edit" and "Reference Datasets that you can view" groupings.

This page provides a focused, permission level oriented view on Reference Datasets. To see a view of all asset collections, irrespective of their type, that you have a governance role for click on your User Name in the upper right conner of the page.

If a Reference Dataset is either missing or it is lacking expected features in your views, you or your security role(s) may lack proper permissions for the Reference Dataset. A manager of the Reference Dataset can give you the needed permissions via its Reference Dataset Utilities > User Roles settings. For background information, see Asset Collection Permissions: Viewer, Editor, and Manager.

Another possible cause of a missing feature is that it requires administrative setup to become active. See EDG Administration for relevant within-application settings and/or see other EDG Administrator Guide documents for relevant external installation and integration setup.

A Reference Dataset's Operations and Viewer/Editor Views

Each Reference Dataset has two main views:

  1. utilities, from the name link, provides groups of collection-level functions, and
  2. viewer/editor (depending on user permissions), from the View/Edit link, provides direct access to the Reference Dataset content items (e.g., instances, properties, classes, etc.).

These views are documented in the corresponding Reference Dataset Operations and Reference Dataset View or Edit pages.

Code Status

EDG lets you assign customizable status codes to certain types of data, such as reference data, taxonomy concepts, and, more generally, ontology resources. To facilitate this, the small pre-build status code model included with EDG, http://topbraid.org/status, has status values: candidate, approved, and deprecated. You can edit this set of choices in TopBraid Composer. After deploying it to your EDG server and checking it on that vocabulary's Includes list (accessible from a vocabulary's General tab), you can then see the choices available as radio buttons on a data instance's edit form and as a pull-down menu on the search form:

   

Alternatively, EDG can be configured to automatically include the status code model on the creation of certain collections.

  • No labels