You know what I want? Schemas. And clear error messages.
I want to know beforehand what I can put in a config file and I want a fast and hard failure if what I put in there is not good.
And this should be implemented at the file format parser level, with hooks for apps to add on top of the default behavior, so that every app that implements this format gets these things almost for free.
Haven’t you described cap’n’proto, protobuf, thrift, flatbuffers etc?
I know cap’n’proto also has fantastic support for using the schema for config files. You can just compile any constant as a stand-alone serialized message that you mmap into your code in a safe way. It can’t do complex math and things (at least yet) but you can express lists, dictionaries, and reference other constants, so as a config file replacement I love it. I’ve also found the format to be far more regular and consistent than you get with things like text protobuf (you’re still using the schema language instead of another format)
You store your configuration as plain text in your repository and whatnot. When it comes to deployment you just compile it to a binary file.
Cap’n’proto also has plain text and JSON serialization formats if you really want to have your deployed config file be directly human-editable and deserialize from that. I was just noting a very cool feature of having your config written in cap’n’proto and it’s what Cloudflare uses to maintain a bunch of config internally if I read Kenton’s allusions to it correctly.
I think the parent is typing to say that the data is stored in a map which is read to a proto, etc. Kinda like what GPRC does over HTTP. Which kinda makese sense. The schema gives you a great idea of what "should be", and the typing/errors/etc are understood by the host language.
We had that a decade ago. It was called XML and XML Schema. All IDEs support it.
JSON was a huge step backwards in the name of simplicity. And now when we are going to add similar functionality to JSON, something else is going to come out in the name of simplicity (like NestedText).
If your child node has unique name among its siblings and does not contain nested nodes, then it's an attribute. Otherwise it's an element. Seems pretty obvious to me.
The fundamental issue with XML is its impedance mismatch with common data structures which forces using Object to XML mappers (whether explicitly or implicitly). It's more or less solved with XML Schemas or DTDs, but if you're looking at just XML, you can't tell whether some element is an array or a single node. Thus JSON is better suited for serialization.
> If your child node has unique name among its siblings and does not contain nested nodes, then it's an attribute. Otherwise it's an element. Seems pretty obvious to me.
That is really not what attributes are for. I feel a bit of a fraud posting that because I'm not an XML expert and so not really clear what they actually are for. (This reenforces the parent's point: you need to be an expert to know what such a fundamental feature is for.) I remember it's something like "something used to help interpret the actual value" e.g. units of measurement. But most of the time, even if it's non-repeating with no children, you're supposed to use elements rather than attributes.
One problem here is that attributes are so much more compact (and so often easier to read) than elements that it's tempting to use them in places where you ought to use an element (and many people over time have given in to that temptation). Another problem is that the distinction between attributes and elements is almost never useful. That was the parent comment's point by the looks of things.
> The fundamental issue with XML is its impedance mismatch with common data structures
That's probably part of it, but I think at least as problematic is that it has many features that most of the time you don't need and don't want to have to care about. Things like CDATA (also mentioned by the parent comment), custom entities, external entities, DTDs (which can be inline in XML files so you need to know all about DTDs to understand XML properly). That's why there are all sorts of weird XML vulnerabilities that JSON doesn't have. Did you know you can make an XML file that reads your /etc/passwd file when it's parsed? That is not an issue with JSON.
HTML tag and attribute is markup. Strip it and document would be still legible for a human being. Markup is non human part - presentation, semantic web.
Thanks, I found this explanation really helpful, and almost obvious in retrospect (as the best explanations often are!).
I had been thinking that all of these extra features that XML have are just a case of massive overengineering that no one would ever need. In fact it's a case of taking something fundamentally meant for text documents with extra markup, as the name implies, and misapplying it to config files and IPC messages which are just not the original domain at all.
Machine receives data, human receives application with documentation, builder. That's exactly what we have today except UI can be plugged to any stored document. To good to be true.
I think XML was killed by poor usability. Plain text XML, XHTML and XSLT authoring is not fun.
I am trying to uncover it from DOM perspective [1], so far I like it more than Markdown. XHTML and HTML is just a serialization format. HTML is not a good one [2], [3], [4]. XSLT may have nice GUI or compact syntax like RELAX NG.
> Did you know you can make an XML file that reads your /etc/passwd file when it's parsed?
Not only can SGML (but not XML on its own) read /etc/passwd, it can format it into fully-tagged markup and then render it into eg an HTML table. Demonstrating what SGML/XML is actually designed for: encoding and authoring semistructured text. This can't be overstated in discussions like these where use cases for config formats, service payload formats, and actual text authoring are all thrown into the same basket when they shouldn't.
Btw: you can parse and canonicalize this new config file format into markup using the same SGML mechanism you'd be using for CSVs like /etc/passwd, namely short references
Btw2: you can skip/ignore markup declarations in XML, including whole declaration sets (DTDs) since these can be recognized using plain greedy regexpes, though you can't ignore entity declarations when actually used in your XML body text
> you need to be an expert to know what such a fundamental feature is for.
No you don't... the parent commenter explained to you what it's for in a simple and concise manner... you chose to not accept that even though you're not an expert in this, and then complains you need to be an expert to do it?!?
The parent commenter gave an explanation that, yes, was simple and concise, and also good enough for you to believe it (or you already thought that way). But it's also wrong. That just reinforces my point.
(The true difference is explained in sibling comments to yours, by sergeykish and tannhaeuser, if you're interested.)
The parent commenter explained it in a somewhat obtuse way.
I don’t doubt they meant to be clear, but reading it they were not and raised more questions than were answered.
As an example:
Wouldnt attributes be better served as details about the current element?
Wouldn’t elements be better served as “I am a child of the parent”?
Why would I use an attribute as a “non-repeating child” when semantically that doesn’t make sense when looking at the document? The attribute is inside the element’s definition, and seems to me attributes should be used to further describe the element being presented itself, and not be structural or describe itself as a child in any way.
JSON Schema [1] is actually a mature standard now, with decent tooling support, mostly through OpenAPI (formerly Swagger), which extends it with support for endpoints.
It's much simpler to use than XML Schemas, and arguably results in cleaner data models, since it doesn't have anything analogous to XML namespaces that allow for arbitrary mixing of schemas.
> We had that a decade ago. It was called XML and XML Schema.
It would be true if XML was not full of all this SGML debris like "entities" (really, uncontroller macros), if real schema formats was flexible enough (I needed <c> inside <a> and <c> inside <b> when they totally different), etc.
But when a config reader tool has to deal with 40+-year legacy of enterprise guys wanting to embrace the universe, but all this doesn't allow to control contents without external measures like regexp checking... that simply shuts up facing real world.
Magento is a popular codebase that made XML-based configuration a fundamental part of its architecture. The results were terrible and caused numerous headaches and countless hours lost to trying to troubleshoot inscrutable configuration issues. The Magento 2 codebase began a shift away from XML for configuration, although it still uses some.
There may be room for an argument that Magento did XML badly (it did many things badly), but I don't believe I've ever seen XML done well.
I don't get it. The @Configuration and @Bean annotations are at least 100 times more readable and powerful than whatever garbage people used to write into their xml files to define beans. 20 lines of xml are often equivalent to like 8 lines of Java and each of those Java lines is shorter than the xml equivalent. Repeating closing tags is not very interesting.
Exactly, jsonschema allows one to describe exactly how the json should look like including inter field validation. And with tools like reactjsonschemaform you can generate a ui on top of it for free.
I spent years working with xml, xslt, xml schema. Frankly when I first saw json I thought it was terrific. Nothing has changed my mind since. Why do you feel like it is a huge step backwards?
XML is fatally flawed because you can't safely put one XML doc inside another one. Because of this rather fundamental problem, it never was any good for anything, and it never will be.
Sure you can. At work we talk to a system that requires that we do exactly this. The solution they chose is entirely trivial and safe: include the embedded doc as a base64 encoded string...
SOAP was and is an epic disaster, so that hardly seems like a refutation. The known way to embed an entire XML doc into a SOAP message was to use CDATA, which isn't a general solution because it means the embedded doc can't have ]]> in it anywhere. You could also base64-encoded the included doc.
Both of these solutions and all other known solutions to this problem are, as I'm sure you can see, just awful.
You can't just paste XML in XML because of the <?xml?> thing, because of entities, and because of half a dozen other misfeatures of XML.
Roughly speaking, you can do things like the following:
<!-- The special XMLNS attribute binds a short alias to a long name -->
<p:parent xmlns:p="urn:some:unique:string">
<c:child xmlns:c="urn:some:other:child:name" x=3 y=5>
<c:subchild> <!-- No need to repeat the fully qualified unique name -->
<p:tada>You can even interleave!</p:tada>
</c:subchild>
</c:child>
</p:parent>
Note that while this is possible to write by hand, typically namespaces are for documents generated and processed by tools. The XML Schema Definition (XSD) format has full support for namespaces, so you can define documents based on modular chunks. E.g.: you can "import" the SVG namespace into a diagramming XML document format namespace, but restrict its usage to only the child nodes of an "img" tag. Or MathML as the children of "graph" nodes. Both SVG and MathML can potentially import a shared "font" namespace. Or whatever.
In the XML Reader API, each element has a "fully qualified" name that includes the long namespace prefix. If you use the API correctly, your tool can handle nested documents, or gracefully ignore them if it's appropriate.
The fiddly part is making this efficient, i.e.: avoiding a full string comparison against a long URI or URN. You typically have to "register" the namespaces you're interested in, and the API gives you some sort of efficient token instead of a string to use from then on.
I'm not saying it's perfect. Nothing is in XML. It was designed by committee, it brought too much of the legacy SGML baggage with it, but its namespace capabilities are a lot better than nothing at all, in much the same way that C# or Java don't have perfect type systems, but they're superior to loosely typed languages.
You don't embed plain text XML in CDATA, right? You escape it
function escapeXml(unsafe) {
return unsafe.replace(/[<>&'"]/g, function (c) {
switch (c) {
case '<': return '<';
case '>': return '>';
case '&': return '&';
case '\'': return ''';
case '"': return '"';
}
});
}
Or you convert to the same encoding, strip XML declaration, expand entities. In short work with adequate tools.
Came here to say the same, Cuelang is by far the best config system and paradigm I have tried. All else seems so last century, though Cuelang has its foundation in NLP systems from last century :]
Slightly off-topic, but yes, having fail-fast deserialisation is great.
I wrote a json/kotlin-serialisation library once and purposely restricted some json-features to achieve that:
1. Fields can arrive in any order - this is standard
2. Field names are matched case-insensitively - so keyA and keya are the same, because who would use two variables differing only by case. Serialization keeps the original casing of the name.
3. Missing fields throw an error. if they are nullable, they have to be explicitly set to null - so that you can be sure the serialization side upgraded to the latest version of a protocol if a field was added, and things don't just work by chance.
4. Nullable strings are not coerced to empty strings or anything like it. Kotlin is null-safe, so if it's a string, it has to be "". If it's, for whatever reason, a nullable string, you can set it to null.
5. Enums are also serialized case-insensitively - so you an write "keyA": "eNumVaLuE" if you want - typos should not break the code here, no on would you two enums differing only by case. IIRC booleans could also be TRUE, tRuE, truE etc. (but NOT t or f, or yes or no, or 0 or 1 or empty).
6. Superfluous properties are silently ignored.
These rules were a great tradeoff for quick development, mixing languages and having fail-fast behavior with a stable protocol.
I can see this work perfectly fine in typed languages like C#: `NestedText.Deserialize<T>("nestedtext")` where the deserialize method handles the actual mapping of nested text objects to `T` by providing the deserializer a class / classes that handles the string -> scalar(s) mapping for the given T. That would, sort of, function as a Schema.
I think the only thing, from glancing over the project, that would need to be supported to make this really useful is nested lists/dictionaries. I don't see how this can be done but maybe I'm missing it.
You can always do that, defining the schema in the client to produce sensible checks, even with JSON. The problem is that wherever the spec is underspecified is another place where two different clients can deserialize differently, and both be correct.
And the problem with stringly typed systems is that everything is underspecified
Like in Windows where you configure by clicking check boxes that can get disabled if invalid, with tooltips explaining what they do, additional help if you press F1, etc. ?
I want to know beforehand what I can put in a config file and I want a fast and hard failure if what I put in there is not good.
And this should be implemented at the file format parser level, with hooks for apps to add on top of the default behavior, so that every app that implements this format gets these things almost for free.