D

Thursday, November 12th, 2020 4:06 PM

Best Practice for Governance with Automated Lineage

Our data environment has multiple copies and views of physical data based on several critical underlying data sources and we are attempting to determine a course of action to best enrich and govern that data. Currently, we are using a third party lineage tool to extract metadata from a variety of data sources to populate our data catalog. The lineage scans are done on a weekly basis and have the potential to add many assets and delete assets that no longer exist in the source systems. Our concern is that if/when the underlying data sources have technology changes, all data steward-added enrichments (attributes and relations) will be lost.

Our thought is to add a logical data layer above the physical data where each column/field is tied to applicable governance rules, reference data, business glossary, and other assets via logical assets such as the “Data Attribute”. In this way, if something happens to remove or refactor the information in the data catalog, the enrichment information would not be lost.

Our question is, is there a better way to do this that we aren’t thinking of and/or what is the best way to accomplish what we are planning to implement?

3 years ago

Hi @david.bandkau,

That’s a very interesting challenge you have - we are facing the same challenge in the coming year.

So here are my two cent’s worth: I assume that you integrate the 3rd party lineage tool using the API framework of the platform. I would never just overwrite something that is already in the system, nor delete data just like that if you are concerned with enrichments. I believe that adding another layer will not keep you from unwanted deletions nor lost links when you overwrite stuff (although the enrichments in itself will be lost). My thoughts would be to use a REST API PUT (update) instead of a POST (overwrite), which will at least keep anything that is already there.

For the deletion, I would split that off into a separate process, e.g. by having the workflow create a change request for the business steward to handle and do the deletion only after impact analysis and approval.

Hope this helps.

Kind regards,
Ronald

1.2K Messages

Great answer, Ronald!

My thoughts would be to use a REST API PUT (update) instead of a POST (overwrite)

If you use the import api, the equivalent are to use the “import/xxx” endpoint instead of the “/import/synchronize/{synchronizationId}/xxx” endpoint. One will upsert without deleting, the other will keep in sync, and delete missing assets.

1.2K Messages

3 years ago

Wow. Tough question here.
The Guided stewardship documentation might provide you with some hints as to what Collibra envision with this operating model: https://datacitizens.collibra.com/forum/t/getting-started-with-knowledge-graphs-and-guided-stewardship/302

Ultimately, the question is what is the best model balancing usability and reliability. I feel the physical/logical/conceptual 3 tier is VEEEERY heavy to maintain and use, but it seems to be the way going forward. They will probably be UI updates that will make it easier to manage by Jan 2021, so let’s see in the following months if that clears up.

Our concern is that if/when the underlying data sources have technology changes, all data steward-added enrichments (attributes and relations) will be lost.

This is why you have the “refresh conflict” if you use the connectors, or you can control the logic in the 3rd party application.

262 Messages

Regarding the refresh conflict -

I did Snowflake Incremental Metadata Load Testing via the OOTB JDBC driver. Here are the results:

  1. If I drop an existing column of a table in Snowflake, the column is deleted in the catalog also after schema refresh. So, naturally, al l the enrichment (eg. adding description) done to the column in Collibra is lost
  2. If I add a new column in the same table, the column is added to the catalog after schema refresh
  3. When I renamed a column (eg. employee_id to employee_number) in the table, both these are marked as “refresh conflicts” after schema refresh. employee_id stayed as-is & employee_number is added afresh. the enrichment done to employee_id is not deleted.

Now, what should be the best practice after reviewing refresh conflict notifications -

  1. the data steward should delete employee_id column manually from the catalog?
  2. and map all the enrichment from employee_id to employee_number?

Please suggest.

1.2K Messages

3 years ago

@david.bandkau we faced same challenge when started the journey 3 years ago. Our automated lineage from data warehouse , data lake and calculation engines created high volume physical data dictionary which is source of truth for systems but unmanageable to track the change. We decided to stitch the lineage across system so our physical lineage ends up a single long lineage chain across multiple platforms all the way to report. We had to work with the platform owners to ensure a consistent interface contract maintained so that any change between system does not affect lineage. Then we recommended the data steward to link the CDE /logical attribute to end of the lineage chain i.e. in the reporting layer. That way any change between platform will have minimum affect on the CDE/ Logical attribute relationship . In the event the lineage at the report attribute level is broken a workflow can be developed to notify data steward . then steward can create a relationship to new report attribute only but the work on long lineage is no t required. There’re some risks associated with the stitching as interface contract /integration pattern need to be consistent. We maintain monthly session to be in sync and plan for any change. We also encourage to use cross system lineage for change management and able to reduce significant work for BA and Solution designer on any data change some cases project able to shrink the analysis work as they could do the impact analysis in much faster with the cross system lineage . on the hindsight , the challenge was there as not all system would work and vendor managed system does not allow access for lineage tool . But we managed to do this for our regulatory reporting pipeline which was more critical .

Loading...