- All files that match to a key prefix should be of the same data format type (csv, json, or Parquet).
- Related data of different formats should be put into different key prefix location, (for example:
- Amazon S3: /S3/YOUR-BUCKET.URI/csv-table1 and /S3/YOUR-BUCKET.URI/json-table-1
- Azure Blob storage and Azure Data Lake Storage Gen2: /AZ/YOUR-STORAGE-ACCOUNT.URI/csv-table1 and /az/YOUR-STORAGE-ACCOUNT.URI/json-table-1
- GCS: /gs/URI/YOUR-BUCKET/csv-table1 and /gs/URI/YOUR-BUCKET/json-table-1
- Files that are part of different logical tables should be located at different key prefix locations, for example:
- Amazon S3: /S3/YOUR-BUCKET.URI/emp-table and /S3/YOUR-BUCKET.URI/dept-table
- Azure Blob storage and Azure Data Lake Storage Gen2: /AZ/YOUR-STORAGE-ACCOUNT.URI/emp-table and /az/YOUR-STORAGE-ACCOUNT.URI/dept-table
- GCS: /gs/URI/YOUR-BUCKET/emp-table and /gs/URI/YOUR-BUCKET/dept-table
- As data files are added to a bucket, the specific formats of the files may change. For example, fields may be added to newer CSV data that was added to the same bucket, or field ordering may change. In these cases, the best practice is to include a header row in each CSV file to identify the fields (column titles), rather than to define a CSV schema for the foreign table, which limits all CSV files to using the same field pattern.
- If the CSV files have different fields of data, but do not have individual file headers, you should group the files with the same fields into different key prefix locations.
- If you define a schema for the CSV data in a foreign table, ensure that all files at the location have the same fields. If files in the bucket have different fields, organize the files in the bucket into different key prefix locations before accessing the external data.
- If different kinds of data are included at a single key prefix location, querying that data will be inefficient. For example if you mix department and employee data in a single key prefix location, a query looking for a particular employee would require Vantage to read all the files at that location, including files that contained only department data.
Best practice is to group files containing the same kind of data into a single external storage key prefix location, and use that key prefix location to create a single foreign table, or in a single READ_NOS query. This applies to all kinds of external data, JSON, CSV, and Parquet. (Parquet format external data can only be read by creating a foreign table, not by a direct READ_NOS query.)
The same logic applies to mixing CSV files that have different headers. A key prefix location may include several of these CSV files, but querying the data would be inefficient due to the disparate field patterns.
For the same reason, do not mix CSV or Parquet files that have different schemas at the same key prefix locations. In these cases, querying the data is not only inefficient, but it also results in many warnings indicating skipped records and files.