There are three ways to add data to the catalog, each serving a different strategic function.
- Data Virtualization. Data can be “virtualized” into the data catalog. This is a process where the catalog scans a preconfigured warehouse and displays the data inside the catalog for easy access. Since the data is virtualized, not ingested, this approach is excellent for organizations that want to catalog their data without loading it into another warehouse.
- Cloud Connections. Data can be ingested via the platform ETL (extract, transform, load pipeline) from any cloud source, such as GCP, AWS, and (S)FTP. You may also use this connection type to connect directly to websites that have data on them.
- Simple Import. Data can be ingested via the platform ETL from a user’s local files. This connection type is good for consolidating disparate datasets that update infrequently.
Many catalog users will find it helpful to use all of the above approaches. For example, a medium-sized company may choose to a) virtualize the contents of their snowflake warehouses and oracle database b) set up connections to their legacy FTP drives and AWS buckets, and c) drag-and-drop datasets from old thumb-drives into the platform. This lets organizations consolidate the entirety of their data into the catalog.
All data brought into the catalog requires a Source and Connection.
A Source describes where the data originates. It is the data “parent”. Sources are useful metadata - they describe where the data is coming from. A typical source could be called something like “Snowflake Warehouse” or “Data.gov”.
Connections are the way data is gathered, whether via virtualization, FTP, website, or file transfer. A source may have many connections.