I am loading data for a continuous integration. In this case the set up is something like this:
- PostgreSQL back end on Amazon RDS.
- Using LAMDA integration functions to move data from S3 buckets to Postgres, then loading using the "database" method into the SA_ and SD_ tables.
When I am manually loading data, the interface does a nice little thing for me informing me of various potential errors that preclude the data loading into the Source data tables at all. Examples of this include un-parseable dates, over length strings, and so on. The use case for today is an over length string.
If I run the process manually, I will get an error for a particular record. But if I am loading that into the SD_ table of course PostgreSQL throws an error on the over-length value (which I suppose is better than truncating it silently).
Incoming values validations: I would rather not have to write a validation process that mimics what Semarchy is doing on the manual (UI-based) loading, but I do want a way to detect these conditions, fail the loading for the records to the Source tables, and have that logged somehow for action by a Data Steward.
Incoming values enhancements: In addition to checking values it appears that Semarchy also runs the Enricher process as well. I can tell this because I have a couple that pre-check the data and modify it as needed so it can load. For example this enricher script here looks at Country and on the assumption that a country with a length too long on country code is the United States, shortens any code over two characters to "US"
CASE WHEN FID_Country IS NULL OR FID_Country = '' THEN 'US' WHEN LENGTH(FID_Country) > 2 THEN 'US' ELSE FID_Country END
I would like to have a clean mechanism for loading this data, and if possible leverage whatever Semarchy is using to produce the nice validation checks I see in the manual process. So the questions I have are:
- Is Semarchy using a data-based solution for the validation checks such as string length or date format for the manual load and if so can I leverage that?
- How can I do the same as above to run enrichers on the data so that my pre-correction of data values in the import is successful when loading on the back end?
- If I can't do the above, is there some other way I can engage with the Semarchy engine to take advantage of this model-based review of data and capture/detection of errors before loading data into the Source tables?