A

Monday, November 23rd, 2020 9:23 AM

How to structure data dictionary?

It seems there are many different approaches to create a data dictionary in Collibra, and I’m still confused as which approach would be best for which use case.

Document the physical data dictionary
Using the “Description” attribute of “Column” assets to describe what it contains. This is the approach followed by Collibra in the Covid-19 catalog.
Pros: easy to create, easy to consume
Cons: overwritten when refreshing schema, so you have to update it in the source system, which may not always be possible (e.g. external data)

Business glossary
This was the traditional approach (and still is) to document columns, thanks to the default “represented by” relation type in the default data set view.
Pros: quite flexible, allows complex structured description
Cons: difficult to create, somewhat difficult to consume, can create confusion by ignoring small semantic differences between different columns

Guided stewardship - Physical/Logical/Conceptual model
This seems to be the way forward, given the roadmap and recent communications on data classification. This means that there are four types of metadata to track: Physical (table, column…), Logical (Entity, Attribute), Conceptual (Domain, Concept), Business Term
Pros: Very rich, interconnected knowledge graph
Cons: Very heavy, difficult to maintain, meaning can get lost between layers

Which approach have you adopted and how well is it working out for you?
Currently, it is very difficult for us to engage with users that wish to document their data, because there is no easy way to capture that information for non-technical users.

10 Messages

3 years ago

Re: how much guided stewardship to do, we are focusing our stewarding on the Physical layer and business side only. Physical includes Databases/Schemas/Tables/Columns/FKs/Views, while the business assets include mainly Business Terms/Acronyms/Reports/Report Attributes/KPIs/Measures., as well as Business Dimensions like Lines of Business/Data Concepts/Categories. While it would be nice to maintain every kind of Asset Type in Logical/Conceptual models, as you say, that would take too much time/effort for stewards. For an organization new to Collibra , it wouldn’t be a good recipe for success, as even Collibra recommends simplicity when starting up a DG initiative.

Columns aren’t ever maintained in a Glossary domain. Rather, the glossary should focus on business definitions only of common terms and terminology of relevance to the business. Business terms are sometimes similar in meaning to a column , but these are all distinct and independent assets in Collibra.

We register and ingest databases into physical data dictionary domains to display in the Catalog. Column/table descriptions can be added manually or imported. We don’t maintain those on the databases themselves. Refreshes of the same assets don’t overwrite the descriptions.

3 years ago

Hi @arthur.burkhardt,

It depends, I guess. The ideal situation would of course be the last option. However, in practice I notice that even maintaining a structure of business glossary (glossaries) that tie in to physical elements using the ‘is represented by’ - our current way of doing things - can already be a struggle. The notion that what is captured evolves and requires continuous attention is not always part of ‘business as usual’.

Regards, Ronald

1.2K Messages

3 years ago

Thank you both for your feedback, that’s very helpful.
If I understand correctly, both of you support the idea that describing data in a separate layer is a pain, let alone in 2 separate layers (logical/conceptual).
But the business glossary remains a useful artifacts to provide business context, loosely related (some nuances might exist) to the data it represents.

If so, I wholeheartedly agree with both of you, which makes me wonder how the guided stewardship might be adopted by companies. I already have experience of maintaining canonical models: Either the model is externally maintained by a community of practice (e.g. HL7 FHIR in life sciences, ACORD in insurance…), or there’s a big internal undertaking internally, which mostly fails because it’s too hard, people don’t understand it, etc.
I’m worried that the guided stewardship would end up in the second category: no company has the resources to maintain 4 separate layers of data definitions, so if the automated classification relies on it, it might end difficult to adopt.

@bmurphy.omers.com thanks very much for the info on the description. We mostly rely on custom technical metadata ingestion through the import API, so I wrongly assumed that the connectors would behave similarly.

1.2K Messages

3 years ago

As a side note, the 2020.12 release provides an interesting new feature. Description from the database has its own attribute, so as to retain both. https://productresources.collibra.com/docs/cloud-rn/2020.12

Now I have to check with the “original name” if it works well.

Loading...