I wrote a post about it a while back[1] (I regret some of the wording used there) and maintain a tool[2] that can exploit XPath injection issues. I'd recommend sticking with 1 or maybe 2, and pretending 3.x doesn't exist.
1. https://tomforb.es/xcat-1.0-released-or-xpath-injection-issu...
The things XPath 2.0 and later do improve on XPath 1.0 is the "standard library", most of exslt got standardised in 2.0, and new useful functions got added in later revisions (e.g. contains-token from 3.1 is XPath finally adding the ~= operator from CSS).
Here's the deal though: it should be possible to add most functions without updating the rest of the engine (indeed the majority were originally developed for 1.0). I think some of the functions are designed to work with and around types, which would not be useful in 1.0.
Sequences for example. In XPath 1 the query returns a set, so the output is always in document order. When the document reorders things, the query output changes, and you can never get the original output. In a sequence, the query can output anything in any order
Edit: there is/was a token implementation for XSLT 2.0 called Gestalt
As far as XPath goes, that's wrong:
1. Saxon (the one you talk about)
2. BaseX (an XQuery 3.1 processor)
3. Xidel (implements many XQuery 3.1 features)
4. eXistdb
5. fonto-xpath (NodeJS)
6. frameless.io (JS, also XSL)
And these are the ones, which face the public internet. I think, Microsoft has an 2.x implementation, I am pretty sure, IBM and Oracle do so, as well.
Now, as for XSL-T, you are right: the easily available implementations are Saxon, but, as it seems, also frameless.io (which I just found out about a few minutes ago, so I may be wrong). But again, I guess, that big enterprise has their own solutions bundled.
You can probably already do that just fine: ignore attribute nodes, and e.g.
{"menu": {
"id": "file",
"value": "File",
"popup": {
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}
/menu/popup/menuitem/*[last()]/preceding-sibling::*[1]/value
selects "Open". Something along those lines.Maybe relax nodetypes so they can be pluggable per-language, but I'm not sure that's even useful or necessary.
It's insecure to run untrusted XPath, but isn't it same with untrusted anything? A good solution here could be a way to sandbox such XPath, i.e. to limit which functions can be called, the same way it's done with XML where you can forbid the processor to use network or access arbitrary files on case-by-case basis.
Setting aside whether it’s even a good idea to allow XSLT to do that, XPath is only a subset of XSLT, so you’re just changing the subject. The “path” in XPath should be a hint at what it’s supposed to be: a query language to select nodes by path in XML documents. As opposed to an alternative of Awk, or Perl.
for-each(normalize-unicode(upper-case(json-doc('x.json'))) => tokenize("\s+"),
function($a) {
let $a := $a * 10
load-xquery-module('abc'):some-func(
function-lookup($a, 1)(array:map($a, function($b) {
let $c := unparsed-text-lines($b)
trace($c)
if ($c) {
return xml-to-json($b)
} else {
error('This is an error')
}
}))
}
)
1. fn:json-doc(), by default and typically, returns either an XPath map or array datatype. Therefore fn:tokenize() can not be used with it. Just as fn:normalize-unicode() and fn:upper-case() can not (both take a string as input) The error is: [FOTY0013] Items of type map(*) cannot be atomized.2. in your anonymous function 'function($a) {...' you use a 'let' expression without 'return'. This is illegal.
3. As is the use of a colon ':' after fn:load-xquery-module(). The colon seperates prefixes from namespaces. Did you mean the question-mark '?'? That would make sense, since fn:load-xquery-module() returns a map. But then 'some-func(...' would not work out, since the returned map has two keys: "variables" and "functions" and 'some-func()' would be referenced in another map, which is the value of the 'functions' key.
4. Calling fn:function-lookup() (as the parameter to some-func()) with a value, that must be a string ($a), which you then multiply with an integer (10)) will already have errored out (multiply string with integer is not possible), but even if this would be possible, the next error would arise since the first parameter to this function must be an xs:QName type, which a number (or string), clearly, is not.
5. The function, you look up in the external module takes an array:map() function as first parameter. Such a function does not exist in the XPath 3.1 standard (see https://maxtoroq.github.io/xpath-ref). You may have studied a version of the specification, which was written, before the array functions have been finalized, which was in 2017. That could mean, that what you wanted to express, would now be array:fold-left(), which is how a map() function is being called in XPath. However, that function takes three parameters.
6. Again, this is not valid XPath grammar:
let $c := unparsed-text-lines($b)
trace($c)
if ($c) {
return xml-to-json($b)
} else {
error('This is an error')
}
It would need to be: let $c := unparsed-text-lines($b)
return (
trace($c)
, if ($c)
then xml-to-json($b)
else error('This is an error')
)
Also, $c can not evaluate to a boolean ([FORG0006] Effective boolean value not defined for xs:string+), since it is a sequence of strings. Of course, such things could be implementation dependent...I don't want to go any further, since I can not totally recapitulate, what your code is supposed to do. It may be some 'blackhattish' dark magic, that fucks up some engines, but the engines I use do not even go through with compilation, due to the invalid XPath code. As is, your code does not make much sense. At least, if we talk about XPath 3.1.
Your bashing of XPath 3.1 (and also 2.1) makes no sense either, since you seem to use a completely non-standard processor, with a different syntax and functions, that behave very differently from XPath 3.1, or, even worse, did not understand the language.
Many of the changes to XPath, starting with version 2.1, stem from the fact, that XPath became a subset to XQuery, a fully functional and declarative programming language, that satisfies the need, to query XML documents as they would be databases. This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.
XQuery 3.1, as a superset of XPath 3.1, is a language made in heaven! Nowhere is it that simple, to work with (X)HTML, JSON and XML documents as here! A fully functional, declarative language, that has templating (think handlebars, moustache) built in as an integral part of the language, where you can just intermix you XML code with program code, with easy error tracing (stateless!), quick "to production" development cycles and a painless approach to anything XML!
XML is one of the most misunderstood technologies in our industry and the only solid document technology, I know of. Sadly, also based on this misunderstanding, a whole generation of developers has evolved, which listened to those, who jumped the hypetrain of XML, just to realize, that a document format may not be the best tool for the job (RPC, configuration files, etc.). And instead of admitting to themselves, that they were wrong, they accuse the technology, they abused, teaching the kids to make their lifes more difficult. And these kids now are in charge of browser development, etc. And adding to that, your lightheaded comments do not really help the issue.
> Your bashing of XPath 3.1 (and also 2.1) makes no sense either, since you seem to use a completely non-standard processor, with a different syntax and functions, that behave very differently from XPath 3.1, or, even worse, did not understand the language.
Granted, it’s been a few years since I’ve looked at XPath but I feel that I know it quite well. xcat is a testament to that.
> This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.
This makes no sense. “We made a weird programming language to satisfy the needs of non-programmers”? No, if anything the original xpath was pretty well positioned to be consumed and used by non-programmers. The current, not so much.
The current xpath/xquery language is poorly supported, mostly ignored and incredibly over engineered to the point where it’s almost comically unfit for purpose. Sorry.
> XML is one of the most misunderstood technologies in our industry
It’s pretty well understood and has valid use cases. However it had a history of overly complex, over engineered tooling created by a committee and stuffed full of acronyms.
This quite rightly puts people off, and there are just better formats and technologies to use that isn’t encumbered with XML baggage.
In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.
I avoided XPath until I couldn't anymore. I could do a lot with CSS selectors, but eventually the DOM traversal became difficult to reason about w/ just CSS.
After taking the dive, it's so powerful. Read a single XPath and like regex, you can fully understand what the thing is going after and how it will get there.
There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.
IMO the learning curve of XPath is not that high though, it has a somewhat alien syntax but the only thing I remember giving me trouble is axis, because most tutorials just go on with the "shortcut" syntax so the first time you encounter axis everything goes pear-shaped.
> There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.
Nokogiri should support function extensions[0] and most of the XPath 2.0 functions were originally extensions to 1.0[1], so even if these functions are not distributed with nokogiri you should be able to add them yourself.
Incidentally, Nokogiri seems to optionally depend on libexslt, which is the exslt implementation in C for libxml2/libxslt, so exslt should be available either as an option or by building it yourself.
[0] https://github.com/sparklemotion/nokogiri/commit/eb56525fbcc...
[1] http://exslt.org
Definitely a serious learning curve, some of the developers really struggled with it, others went crazy on it.
I made a pivot table maker with it. It was crazy fast vs the js version I originally tried back in the pre-v8 engine days. The js version would basically die after you got past a trivial amount of data, the xlst one was instant regardless of the amount of data.
In a self-managed environment, like a PC or server, then you're right that popularity makes little difference.
Man I can now write scrapers in 2 minutes that used to take me quite some time thanks to the power of xpath. Thing like ancestors, contains, the ability to chain, etc is so so powerful. I used to write so many hacks just to do the same with css before.
0 - https://git.sr.ht/~charles/charles-util/tree/dev/bin/query-w...
I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2. And currently I am updating it to XPath 3.1: http://www.videlibri.de/xidel.html
Most XPath implementations have no issue with adding extension functions (in fact many support exslt[0] out of the box), you really do not need to use (let alone implement) XPath 2.0 to use regex functions.
I saw so many developers sour on XML after hitting the “This would be easy if we used XPath 2 but instead it's hard” wall that I wonder if anyone on the relevant standards committees ever thought about how much libxml2 would make their work relevant.
I also was doing too much competitive programming back then, where you have to discover and implement a highly complex algorithm in a few hours
If such a complex implementation takes a few hours, I could not imagine implementing anything else taking much longer (especially when the spec already says what needs to be implemented and it does not need to be discovered). A few days at most...
But now I am still working on it 14 years later
The major OSS XML libs, including LibXML2 and Xerces, do not implement what Xidel does, and neither to some proprietary libs like MSXML.
That said, I scrape a fair few webpages now and have never once revisited XPath. I suppose people have mostly written off anything that feels too much like XML as enterprisey and deprecated.
Out of curiosity, what tools do you use for scraping? Is there a similar simple tool for defining queries over trees?
XSLT is about templating/transforming one XML doc into some other format. And there are simple replacements that largely fill the same role. Mustache, Handlebars, Template Toolkit (which was also the simpler solution back when XSLT was popular).
For as painful as XSLT was, at least it was a standard thing that existed.
I use it in a bunch scraping scripts for Web sites which don't provide RSS feeds. It's really nice for quickly 'exploring' a document to find the needed data; it's simple to update when sites change their layout; and it can be read in from a config file, argument, env var, etc. to keep things generic and flexible.
XPath is infinitely more elegant than CSS selectors.
It might be less pretty, especially with the sort of selectors you'd use in CSS, but elegance could hardly be less of an issue.
Plus, while 'elegance' is mostly subjective, I see a lot of it in XPath. It's a DSL for describing generic tree traversal, its concise and declarative, and frees its users from the need of writing imperative or recursive, repetitive and easy to mess up, tree traversal code. Just not having to maintain state by hand during traversal is a huge timesaver. Additionally, at least in version 1 + some early extensions, XPath is much less complex than PCREs and not much more complex than CSS syntactically.
edit: typos
Maybe you had XSLT in mind (which still is different) - or the fact that both CSS and XPath have "selectors". But one is used to get nodes (as a library), the other is a styling language.
And XPath, at least originally, had an extremely elegant path language.