Data Profiling: performance and usability

We still make little use of data profiling. When performing a few tests, I was a bit disappointed with the performance of the local jobserver.

Source	Result
15.2MB parquet file (280MB csv file)	After three hours: Out-of-memory error that crashed DGC
3MB csv file	Still running after more than one hour

The support kindly let me know the local jobseverver was a trimmed-down instance that was only suitable for very little files (I guess a few KBs).
Are there any Data citizen willing to share their experience on profiling? Did you manage to roll out profiling at scale on big data sets? Where do you deploy Edge/jobserver?

My experience with profiling outside of collibra is quite different:
Profiling the same dataset on a regular business laptop took minutes and consumed at most 400MB of memory. (see screenshot below for the result)
With the availability of many cloud resources, it also seems highly efficient to push down the compute to those system who have resources readily available, instead of moving all the data to be analyzed.

Here’s the result I got when using the pandas-profiling library

Community

arthurburkhardt

Data Profiling: performance and usability

noor

Was this helpful?

Community

arthurburkhardt

Data Profiling: performance and usability

noor

Related Tags

Related Conversations

Was this helpful?