-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Way to override field definitions at the OGR driver level? #10943
Comments
@rouault why an open option instead of adding a new argument to GetLayer() or GetFeature()? I'm very interested in helping shape what a schema description looks like. I've been wanting to improve Fiona's schema for a few years, and maybe this will be thing that makes me do it. It should be the same as OGR's, certainly. |
Now
Maybe it does fulfill all requirements as it stands now, but I think that users would appreciate if they could use ogrinfo for creating a template to edit. |
OGRLayer::GetNextFeature() would definitely not be the appropriate entry point. This is way too late, as the schema can also be retrieved by OGRLayer::GetLayerDefn(). |
The downside of a dataset-level open option is for multi-layer datasets. This will require the JSON document to be something like: {
"layerA" : { },
"layerB" : { }
} and possibly accepting directly the content of a layer for single-layer dataset? |
... or have a OGRLayer::SetLayerDefn() (for symmetry with GetLayerDefn()) or OverrideLayerDefn() only implemented by drivers advertizing a OLCSetLayerDefn capability . This should be called before any feature has been retrieved . But SetLayerDefn(OGRFeatureDefn*) would raise interesting questions. We would probably have to require that the number of field definitions passed is exactly the same as the one returned by default and field names match existing ones (no renaming). So this is just to alter the field type, subtype and potential width and precision. We already have the precedent of OGRLayer::SetActiveSRS for altering the layer defn. Actually SetActiveSRS() could be seen as a particular case of a more general SetLayerDefn(). But unfortunately that might still be too late. The GeoJSON driver is the interesting case. Currently it must necessarily do a full scan of the file to establish the schema. If we pass the schema at opening, we could entirely skip it and save that initial scan completely. But establishing the schema is done at opening time, not at OGRLayer::GetLayerDefn() time. The GML driver is very similar to that (except that in the GML case we might still have to do the initial scan, when there's no .xsd we can parse, since GML might contain multiple layers...). OK, we might still try to alter the schema after the guess stage. That wouldn''t be as efficient as we could dream, but that could be workable. So I'm currently leaning more on OGRLayer::SetLayerDefn() than the open option |
actually, no... SetActiveSRS() is about on-the-fly "reprojection". Whereas if we would allow SetLayerDefn() to modify the SRS of geometry field(s), that should behave like a pure override (like ogr2ogr -a_srs), than a reprojection. Not saying that we would need to support that for now. |
@jratike80 yes, I agree, symmetry with the output of ogrinfo would be good for the project. |
Interesting discussion, am I correct that there are two separate use cases here?
these two options need probably two different approaches but I think that the first one is more flexible and can probably accomplish the second use case too: in the second scenario client code knows the field definitions and can possibly reopen the layer passing the overridden field definitions, effectively bypassing the initial scan and the guess logic which is supposed to be the slowest part of the opening process. For this reason I think we should allow to override the field definitions as soon as possible during the opening phase. |
I think in an ideal world it would be possible to override the type of one field, and have the rest auto-detected. Often it will be one or few fields that get autodetected incorrectly and you would want overridden for several/many different datasources. Applying that to the json approach it would be great that you don't need to specify all columns... that the definition available there would be used, and that the columns/properties not specified would be autodetected. |
Brainstorming: if we go for the open option with a JSON payload, what about requiring a top-level property:
|
@rouault it looks like json-c implements RFC 6902 JSON Patch: https://json-c.github.io/json-c/json-c-0.18/doc/html/json__patch_8h.html. In theory, this would allow fine-grained patching, like 'replace the |
I'm not sure how that would help. JSON here is just a (potential) vehicle to transport the user wishes, but most of the work will be in C++ driver code that has nothing to do with JSON / libjson. |
So you mean "autodetect everything, except field x, that I know you will autodetect incorrectly". Ok, it may be possible to know that field x will be autodetected incorrectly if GDAL has guessed it wrong before. So how can user know that GDAL makes a bad guess? The command line user would do it this way: a) run ogrinfo For the user it is not much easier than to a) run ogrinfo or equivalent with What is better in the latter alternative for my mind is that user knows what will happen and takes the responsibility of the datatypes of all the fields. I believe that if the full schema is given then GDAL can save some time because it does not need to do autodetect again. |
That might be a "Warning 1: Integer overflow occurred when trying to set 32bit field." (cf #10938), which will be improved with #10945
That might be a totally reasonable workflow indeed. However we should make sure to specify in the documentation that basically only the changes in field definitions would be taken into account and the rest of elements ignored (ie users trying to patch the feature count!). The user might be tempted to change a lot of other elements (metadata, field domains, etc), which could potentially be implemented in an ideal world, but that would be a lot of work, derailing the effort. |
I meant that somehow just the field definitions would be saved into the schema file; and symmetrically only those would be read. Or is there anything else reasonable to save/read?
|
I was thinking you meant the user would just edit the output of "ogrinfo my.csv -json", which would have the advantage of not introducing a new option for ogrinfo |
In our case, but I guess this is not an uncommon case, we have some fields in our corporate database where data dumps are created from for further processing that we know are typically misdetected (or have been in the past):
We have many 100's of files that contain these columns as well as a variety of other columns that can be present. So in this case it would be practical to just always be able to specify that those two column should not be autodetected and all the rest will typically be fine when autodetected. In a recent issue in geopandas (that was solved by #10902) a user explained having the same basic case: the same column giving an issue in many different data sources: geopandas/geopandas#3431 (comment) |
There are several OGR drivers must guess the attribute data type: CSV, GeoJSON, SQLite (see #10938 (comment)), GML (see Toblerity/Fiona#1454) etc.
ogr2ogr has flags to alter the field types, but it doesn't override what the source driver has guessed, and so it might be too late.
Couldn't we come with some way of overriding the guess of the source driver?
A conventional open option SCHEMA (it doesn't look to be used by any driver) that would accept a JSON payload ? Or maybe rather a OGRLayer::SetLayerDefn(const OGRFeatureDefn*)
The text was updated successfully, but these errors were encountered: