A

Tuesday, March 29th, 2022 12:45 PM

Data Profiling: performance and usability

We still make little use of data profiling. When performing a few tests, I was a bit disappointed with the performance of the local jobserver.

Source Result
15.2MB parquet file (280MB csv file) After three hours: Out-of-memory error that crashed DGC
3MB csv file Still running after more than one hour

The support kindly let me know the local jobseverver was a trimmed-down instance that was only suitable for very little files (I guess a few KBs).
Are there any Data citizen willing to share their experience on profiling? Did you manage to roll out profiling at scale on big data sets? Where do you deploy Edge/jobserver?

My experience with profiling outside of collibra is quite different:
Profiling the same dataset on a regular business laptop took minutes and consumed at most 400MB of memory. (see screenshot below for the result)
With the availability of many cloud resources, it also seems highly efficient to push down the compute to those system who have resources readily available, instead of moving all the data to be analyzed.

Here’s the result I got when using the pandas-profiling library

262 Messages

2 years ago

We currently have several Snowflake schemas & AWS Glue Catalog databases (imagine as S3) which we profile & store samples of once in a quarter via AWS based Jobserver.

I recall the Snowflake schemas taking minimum 24 hours & maximum 3 days/72 hours to complete. I will share the Jobserver EC2 instance configuration & schema volumes sometime tomorrow so we can correlate with situation from others.

Loading...