Skip to end of metadata
Go to start of metadata

Page Contents


Overview of Corpora

A Corpus is a collection of read-only textual items, such as documents, excerpts, websites, etc.—along with their associated metadata. The original items are always imported from external sources, such as content management systems or web sites and are never originally created nor edited within EDG. Thus, a version of the Tagger interface allows viewing of Corpus content, without editing. The textual content of Corpus items provides the foundation for manual or automated tagging and annotation with Content Tag Sets (serving as the content graph).

Corpora Home

Selecting the Corpora link in the left-navigation pane of TopBraid EDG (Home) lists all of the Corpus collections currently accessible to the user and, it allows authorized users to create new ones.

Prerequisites: Licensing and Enablement

The availability of any collection type (including Corpora and customer-defined types) is determined by what is (a) licensed and (b) configured under Server Administration. To install a license or to view the currently licensed features, see Setup > Product Registration. To configure which licensed collection types are currently enabled or disabled, see EDG Configuration Parameters > Configure Asset Collection Types. For general licensing information, see the TopQuadrant website, which describes the TopBraid products and the  data governance packages that determine the available collection types.

Create New Corpus

The Corpora > Create New Corpus link opens a form with fields used to define the new Corpus. Note that you can also create a Corpus by using a Create link in the Governance Areas page. 

Nobody will have a link for creating any asset collection until an administrator configures EDG's persistence technology as documented in Server Administration: Teamwork Platform Parameters: Application data storage . Additionally, each user will not have a create link unless the user or their role has a Create permission for the EDG Repositories project as documented in  EDG Rights Management .

Required and Permitted Includes

Collections often have natural relationships to other collections (e.g., each reference dataset's main entity class comes from an included ontology). Any collection using outside resources must first include the collections that contain them. Some inclusions might be required while others might merely be permitted. For example, taxonomies always include the SKOS ontology, and they may include other taxonomies. As mentioned, each reference dataset must include at least one ontology to define the dataset's entities. Glossaries always include the pre-defined EDG ontology that describes business glossary terms. Catalogs of data assets always include the pre-defined EDG ontology describing data assets and are expected to include definitions of relevant physical datatypes. These requirements can be further configured.

When creating a collection, any required reference to another type of collection will either be handled automatically or be presented for selection. If any required inclusion is omitted at its creation, then the resulting collection will show red warnings about the missing relationship(s). After creation, included collections can be changed using utilities view: Settings > Includes. When changing collection's includes, selection options are restricted to required and permitted types.

Whenever a new Corpus is created, EDG requires the user to select its data source, as Corpora can be configured to connect to an external source. EDG will then harvest content from that external source and store it in the project graph. Harvesting can be repeated later on demand, and changes to the external source's documents will be picked up. (Harvesting needs to be triggered manually in the Corpora management UI. So a Corpus does not synchronize automatically with its external source, but only when requested.)

Four types of connectors to such sources are currently offered, their respective creation wizard pages showing different forms depending on parameters that must me indicated for the connector to operate. These parameters can be adjusted later on by accessing  Manage > Corpus-CONNECTOR-TYPE Configuration .

  • No connector if content documents are available as RDF already, these can be imported into the Corpus with the usual RDF import function. Similarly, raw documents can be imported singly from local files as described in Import > Import Single Document. No external source will be configured with this connector type.
  • sitemap.xml If a website supports the sitemaps protocol, a configured sitemap.xml connector will harvest its content accordingly.
  • URL list This connector will simply fetch content from all of the URLs listed in its configuration.
  • CMIS If a website is an interface to a Content Management System and offers a Content Management Interoperability Services (CMIS) service endpoint as defined by the standard, a configured CMIS connector will harvest its content accordingly.

Once you have finished configuring your new Corpus, it will appear as a link on the Corpora tab.

Listing of Corpora by Manage, Edit, or View

This home view lists all Corpora that you can access in some way. Which ones you can see and what you can do with them depend on each Corpus's permissions settings for your user identity or security role. The listing groups the Corpora according to your assigned permissions as either a manager, an editor, or a viewer:

  • Corpora that you manage
  • Corpora that you can edit
  • Corpora that you can view

You will only see relevant categories. For example, if you do not have manager permissions to any Corpora, you will only see "Corpora that you can edit" and "Corpora that you can view" groupings.

This page provides a focused, permission level oriented view on Corpora. To see a view of all asset collections, irrespective of their type, that you have a governance role for click on your User Name in the upper right corner of the page.

If a Corpus is either missing or it is lacking expected features in your views, you or your security role(s) may lack proper permissions for the Corpus.  A manager of the Corpus can give you the needed permissions via its utilities' Users settings. For background information, see Asset Collection Permissions: Viewer, Editor, and Manager.

Another possible cause of a missing feature is that it requires administrative setup to become active. See EDG Administration for relevant within-application settings and/or see other EDG Administrator Guide documents for relevant external installation and integration setup

For each collection on the home page some brief metadata is available including information about workflows available to the user. In the image below the user has an action on the workflow that they have permission for. An action means they are in a role that is allowed to transition the workflow to the next state, such as "committed". If the user does not have an action, but they have permission for the workflow they will be presented a read-only view when accessing that workflow. 

  • No labels