What Happened to XPath? (opens in new tab)

(webreflection.medium.com)

135 pointsAhtiK5y ago93 comments

93 comments

orf5y ago

XPath post 1.0 got ridiculous, like many things do. What started with a simple, elegant language morphed into one with a http client, filesystem methods, json support, functions, loops, extensions and the ability to read environment variables.

I wrote a post about it a while back[1] (I regret some of the wording used there) and maintain a tool[2] that can exploit XPath injection issues. I'd recommend sticking with 1 or maybe 2, and pretending 3.x doesn't exist.

1. https://tomforb.es/xcat-1.0-released-or-xpath-injection-issu...

2. https://github.com/orf/xcat

masklinn5y ago

I largely agree. XPath 2.0 started the downwards trajectory and XPath 3 made it worse.

The things XPath 2.0 and later do improve on XPath 1.0 is the "standard library", most of exslt got standardised in 2.0, and new useful functions got added in later revisions (e.g. contains-token from 3.1 is XPath finally adding the ~= operator from CSS).

Here's the deal though: it should be possible to add most functions without updating the rest of the engine (indeed the majority were originally developed for 1.0). I think some of the functions are designed to work with and around types, which would not be useful in 1.0.

benibela5y ago

There are other useful things besides functions

Sequences for example. In XPath 1 the query returns a set, so the output is always in document order. When the document reorders things, the query output changes, and you can never get the original output. In a sequence, the query can output anything in any order

tannhaeuser5y ago

XPath/XSLT 2+ also have only a single implementation (by the spec author) so don't meet W3C's requirement of two interoperating implementations. Basically, XSLT ceased to be a "standard" whereas XSLT 1.0 had excellent portability across libxslt, Saxon, Xalan, and MS' xslt.exe.

Edit: there is/was a token implementation for XSLT 2.0 called Gestalt

now5y ago

It also worth noting that the specification’s author also built his company on this single implementation.

zmix5y ago

> XPath/XSLT 2+ also have only a single implementation

As far as XPath goes, that's wrong:

1. Saxon (the one you talk about)

2. BaseX (an XQuery 3.1 processor)

3. Xidel (implements many XQuery 3.1 features)

4. eXistdb

5. fonto-xpath (NodeJS)

6. frameless.io (JS, also XSL)

And these are the ones, which face the public internet. I think, Microsoft has an 2.x implementation, I am pretty sure, IBM and Oracle do so, as well.

Now, as for XSL-T, you are right: the easily available implementations are Saxon, but, as it seems, also frameless.io (which I just found out about a few minutes ago, so I may be wrong). But again, I guess, that big enterprise has their own solutions bundled.

orf5y ago

Saxon supports all xpath versions though? It also bundles some very dangerous functions, some of which xcat can take advantage of.

layoutIfNeeded5y ago

I've read your article... Holy shit. They took a simple, sed-like tool and turned it into an abomination.

the-dude5y ago

It ain't done before it can receive e-mail.

oefrha5y ago

It can receive email. See my follow-up here with a working implementation:

https://github.com/clopen/xpath-receive-email

https://news.ycombinator.com/item?id=24960548

1 more reply

hyperpallium25y ago

Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.

1 more reply

riku_iki5y ago

Curious what is the problem with this? You can still use your small sed-like subset of language in your project?

taway6115y ago

>>> They took a simple, sed-like tool and turned it into an abomination.

> Curious what is the problem with this?

Product Managers.

AtlasBarfed5y ago

With the rise of numerous hierarchical document formats (JSON, YAML, TOML, properties files), what XPath REALLY should have evolved into was a more format-flexible path language with format-specific extensions as needed.

masklinn5y ago

> what XPath REALLY should have evolved into was a more format-flexible path language with format-specific extensions as needed.

You can probably already do that just fine: ignore attribute nodes, and e.g.

    {"menu": {
      "id": "file",
      "value": "File",
      "popup": {
        "menuitem": [
          {"value": "New", "onclick": "CreateNewDoc()"},
          {"value": "Open", "onclick": "OpenDoc()"},
          {"value": "Close", "onclick": "CloseDoc()"}
        ]
      }
    }}

    /menu/popup/menuitem/*[last()]/preceding-sibling::*[1]/value

selects "Open". Something along those lines.

Maybe relax nodetypes so they can be pluggable per-language, but I'm not sure that's even useful or necessary.

Mikhail_Edoshin5y ago

XPath always was extensible, at least at the implementation level. E.g. in 'lxml' it's trivial to add XPath functions with Python. Homegrown, of course, but still possible. In addition to extension elements this is about the only way to hook XSLT into the rest of the system. How else one is supposed to read environment variables from XSLT? The only other way is to pass everything via command line as parameters.

It's insecure to run untrusted XPath, but isn't it same with untrusted anything? A good solution here could be a way to sandbox such XPath, i.e. to limit which functions can be called, the same way it's done with XML where you can forbid the processor to use network or access arbitrary files on case-by-case basis.

oefrha5y ago

> How else one is supposed to read environment variables from XSLT?

Setting aside whether it’s even a good idea to allow XSLT to do that, XPath is only a subset of XSLT, so you’re just changing the subject. The “path” in XPath should be a hint at what it’s supposed to be: a query language to select nodes by path in XML documents. As opposed to an alternative of Awk, or Perl.

Mikhail_Edoshin5y ago

I'd say XPath a way to get a nodeset or another XPath type out of something. E.g. the current date is not selected from a document. There always will be a need to get yet another thing as a nodeset, e.g. list a directory. Or, for boolean expressions, there will always be a need to test yet another thing, such as an environment variable.

These things, of course, should come as extension functions rather than special syntax, but then there will be a need to provide a small standard library of such functions :)

So yes, I believe it's useful if we're going to use XPath in a trusted environment, e.g. as a typical command-line tool. You won't deny Bash or Python this and other powerful abilities, will you? But of course it would be very unwise to run an untrusted Bash script.

ssdspoimdsjvv5y ago

XPath 3 was conceived with support for XSLT and XQuery in mind - where reading environment variables and text files are most definitely very useful features. This is indeed not something you want in a browser, but that ship had already sailed by then.

nunez5y ago

Damn. XPath went off the rails after v2. Though, to be fair, so did JavaScript, and look where that is today!

forgotmypw175y ago

A bloated abomination?

pojzon5y ago

Abomination with 500000 open job offerts and ppl wanting their page to load 2 minutes, because we can do that “async” and download half of internet to display a table.

nunez5y ago

That is the most popular language in tech right now? (Unfortunately?)

spdionis5y ago

But does it send email?

zmix5y ago

I don't know what XPath 3.1 implementation you use, but the two major implementations, I have at hand, SaxonEE and BaseX, both do not understand your code. You write:

  for-each(normalize-unicode(upper-case(json-doc('x.json'))) => tokenize("\s+"),
    function($a) {
      let $a := $a * 10
      load-xquery-module('abc'):some-func(
        function-lookup($a, 1)(array:map($a, function($b) {
          let $c := unparsed-text-lines($b)
          trace($c)
          if ($c) {
            return xml-to-json($b)
          } else {
            error('This is an error')
          }
        })) 
      }
  )

1. fn:json-doc(), by default and typically, returns either an XPath map or array datatype. Therefore fn:tokenize() can not be used with it. Just as fn:normalize-unicode() and fn:upper-case() can not (both take a string as input) The error is: [FOTY0013] Items of type map(*) cannot be atomized.

2. in your anonymous function 'function($a) {...' you use a 'let' expression without 'return'. This is illegal.

3. As is the use of a colon ':' after fn:load-xquery-module(). The colon seperates prefixes from namespaces. Did you mean the question-mark '?'? That would make sense, since fn:load-xquery-module() returns a map. But then 'some-func(...' would not work out, since the returned map has two keys: "variables" and "functions" and 'some-func()' would be referenced in another map, which is the value of the 'functions' key.

4. Calling fn:function-lookup() (as the parameter to some-func()) with a value, that must be a string ($a), which you then multiply with an integer (10)) will already have errored out (multiply string with integer is not possible), but even if this would be possible, the next error would arise since the first parameter to this function must be an xs:QName type, which a number (or string), clearly, is not.

5. The function, you look up in the external module takes an array:map() function as first parameter. Such a function does not exist in the XPath 3.1 standard (see https://maxtoroq.github.io/xpath-ref). You may have studied a version of the specification, which was written, before the array functions have been finalized, which was in 2017. That could mean, that what you wanted to express, would now be array:fold-left(), which is how a map() function is being called in XPath. However, that function takes three parameters.

6. Again, this is not valid XPath grammar:

    let $c := unparsed-text-lines($b)
    trace($c)
    if ($c) {
      return xml-to-json($b)
    } else {
      error('This is an error')
    }

It would need to be:

    let $c := unparsed-text-lines($b)
    return (
             trace($c)
           , if ($c)
             then xml-to-json($b)
             else error('This is an error')
           )

Also, $c can not evaluate to a boolean ([FORG0006] Effective boolean value not defined for xs:string+), since it is a sequence of strings. Of course, such things could be implementation dependent...

I don't want to go any further, since I can not totally recapitulate, what your code is supposed to do. It may be some 'blackhattish' dark magic, that fucks up some engines, but the engines I use do not even go through with compilation, due to the invalid XPath code. As is, your code does not make much sense. At least, if we talk about XPath 3.1.

Your bashing of XPath 3.1 (and also 2.1) makes no sense either, since you seem to use a completely non-standard processor, with a different syntax and functions, that behave very differently from XPath 3.1, or, even worse, did not understand the language.

Many of the changes to XPath, starting with version 2.1, stem from the fact, that XPath became a subset to XQuery, a fully functional and declarative programming language, that satisfies the need, to query XML documents as they would be databases. This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

XQuery 3.1, as a superset of XPath 3.1, is a language made in heaven! Nowhere is it that simple, to work with (X)HTML, JSON and XML documents as here! A fully functional, declarative language, that has templating (think handlebars, moustache) built in as an integral part of the language, where you can just intermix you XML code with program code, with easy error tracing (stateless!), quick "to production" development cycles and a painless approach to anything XML!

XML is one of the most misunderstood technologies in our industry and the only solid document technology, I know of. Sadly, also based on this misunderstanding, a whole generation of developers has evolved, which listened to those, who jumped the hypetrain of XML, just to realize, that a document format may not be the best tool for the job (RPC, configuration files, etc.). And instead of admitting to themselves, that they were wrong, they accuse the technology, they abused, teaching the kids to make their lifes more difficult. And these kids now are in charge of browser development, etc. And adding to that, your lightheaded comments do not really help the issue.

orf5y ago

The code sample is illustrative and shows off as much of the crazy, unneeded dynamism as possible rather than something that would actually work.

> Your bashing of XPath 3.1 (and also 2.1) makes no sense either, since you seem to use a completely non-standard processor, with a different syntax and functions, that behave very differently from XPath 3.1, or, even worse, did not understand the language.

Granted, it’s been a few years since I’ve looked at XPath but I feel that I know it quite well. xcat is a testament to that.

> This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

This makes no sense. “We made a weird programming language to satisfy the needs of non-programmers”? No, if anything the original xpath was pretty well positioned to be consumed and used by non-programmers. The current, not so much.

The current xpath/xquery language is poorly supported, mostly ignored and incredibly over engineered to the point where it’s almost comically unfit for purpose. Sorry.

> XML is one of the most misunderstood technologies in our industry

It’s pretty well understood and has valid use cases. However it had a history of overly complex, over engineered tooling created by a committee and stuffed full of acronyms.

This quite rightly puts people off, and there are just better formats and technologies to use that isn’t encumbered with XML baggage.

zmix5y ago

> The code sample is illustrative and shows off as much of the crazy, unneeded dynamism as possible rather than something that would actually work.

You can not make up fantasy code, in a language, that tries to look like the actual thing, but is not, and then complain about the language not working. Your code is not working, since that is not XPath code.

>> This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

> This makes no sense. “We made a weird programming language to satisfy the needs of non-programmers”?

It's not weird. I am one of these people, and I enjoy it greatly! As an example for some (functioning) XQuery code, everybody is invited to check this: https://gist.github.com/joewiz/6762f1d8826fc291c3884cce3634e... I don't think, that is weird. Or what about this:

  for $contact in $contacts/contact
  where $contact/familyname/data() = "Smith"
  group by $key := $contact/zip
  order by $key
  return <group>{ $contact }</group>

which will return all contacts named 'Smith', placed in the same 'group' element as long as they live in the same location. Not weird at all! But then, this may be a matter of taste.

> The current xpath/xquery language is poorly supported, mostly ignored and incredibly over engineered to the point where it’s almost comically unfit for purpose.

As for ignored and unsupported, that has mostly psychological, social and historical reasons (the needs of programmers vs. the needs of power-users, people who fell for the hype-train and then were turned off, etc.). As for "over-engineered" my experience is, that XQuery is extremely lean. One may argue about some of the datatypes (i.e. dates and times), but they were requested by database/enterprise people. XML is a technology, that had a lot of interest groups, who all wanted their share. The nice thing is: you don't need to use it, if you don't require it. But those, who do, they are happy. Also, these datatypes are not XPath/XQuery, they are XML Schema. What would your example for "over engineering" in XPath be? I am really curios!

> It’s pretty well understood and has valid use cases.

Again, my experience is very different. Most people do not know, whether to "push" or to "pull" when writing XSL-T, which is a strong indicator for them not having understood, at least, XSL-T. They just use it as a programming language and start complaining. Then there are those, who compare it with JSON, which is comparing apples to airplanes. They call it "verbose" while not realizing, that a complex data format, that implies a lot of logic, requires simple tools (XPath one-liner, anyone?), while simple datastructs require much more logic on the side of the programmer. Yes, something as simple as JSON is a low hanging fruit, just like "make money fast". And then you realize the small print. In XML you start lowly, just as in XPath. No need to type anything. Just code on. Do the typing before production release. The rest comes over time, like everyhwere.

Verbosity really happens on the code and the overly complex toolchains (just think ECMAScript and all the difficulties, that stem from combining HTML with JSON, in order to be somewhat semantic)

> However it had a history of overly complex, over engineered tooling created by a committee and stuffed full of acronyms. This quite rightly puts people off, and there are just better formats and technologies to use that isn’t encumbered with XML baggage.

Well, lazy people, who want to speak a foreign language without learning it (best example is your XPath code). I seldomly read the W3C specs, only if I can't help myself any further. There are some nice books on every of these technologies, which are pretty simple to understand. However, one needs to read them, rather than just "coding on".

I do not doubt, that you are a capable programmer. However, judging by your code example, you got no clue about the XPath language. You may know how to abuse functions in a language, that access a server, and I guess, most of these attacks are pretty standard and do not require deeper knowledge of XPath. It's just the functions, that offer access.

ziml775y ago

Holy crap. What is this atrocity that is XPath 3.0!? What was wrong with sprinkling some XPath 1.0 queries into a Python script?

irjustin5y ago

Anyone who does scraping or automated browser work eventually comes across XPath.

In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.

I avoided XPath until I couldn't anymore. I could do a lot with CSS selectors, but eventually the DOM traversal became difficult to reason about w/ just CSS.

After taking the dive, it's so powerful. Read a single XPath and like regex, you can fully understand what the thing is going after and how it will get there.

There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

masklinn5y ago

> In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.

IMO the learning curve of XPath is not that high though, it has a somewhat alien syntax but the only thing I remember giving me trouble is axis, because most tutorials just go on with the "shortcut" syntax so the first time you encounter axis everything goes pear-shaped.

> There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

Nokogiri should support function extensions[0] and most of the XPath 2.0 functions were originally extensions to 1.0[1], so even if these functions are not distributed with nokogiri you should be able to add them yourself.

Incidentally, Nokogiri seems to optionally depend on libexslt, which is the exslt implementation in C for libxml2/libxslt, so exslt should be available either as an option or by building it yourself.

[0] https://github.com/sparklemotion/nokogiri/commit/eb56525fbcc...

[1] http://exslt.org

mattmanser5y ago

Many moons ago I worked somewhere that used XPath extensively.

Definitely a serious learning curve, some of the developers really struggled with it, others went crazy on it.

I made a pivot table maker with it. It was crazy fast vs the js version I originally tried back in the pre-v8 engine days. The js version would basically die after you got past a trivial amount of data, the xlst one was instant regardless of the amount of data.

jinushaun5y ago

I agree. I think the original author completely missed the point and conflates lack of mainstream usage with dead tech. If you never run into problems that xpath addresses, of course you’ll never use xpath. It’s not for everyday use. And certainly shouldn’t be billed as a CSS selector replacement.

chriswarbo5y ago

I think their complaints about browser support are fair (orthogonal to whether the newer versions are any good, which most of the comments here are talking about!)

In a self-managed environment, like a PC or server, then you're right that popularity makes little difference.

brixon5y ago

Similar, when I have control of the source code then CSS selectors are fine (I can always throw in another ID or Class Name). When I don't have control of the source code then I might have to use XPath if CSS selectors are insufficient.

t7s5y ago

If you need to do web scraping learning xpath is very helpful

Crazyontap5y ago

Xpath is so powerful for web scraping I just realized recently. I'd been using css selectors for my occasional scraping needs and never bothered to learn xpath until on day on a whim decided to learn at least the basics.

Man I can now write scrapers in 2 minutes that used to take me quite some time thanks to the power of xpath. Thing like ancestors, contains, the ability to chain, etc is so so powerful. I used to write so many hacks just to do the same with css before.

mook5y ago

I realized a couple months back that Google sheets supports using xpath to scrape web pages. So now I have a "spreadsheet" scraping a page to see when a model of laptop goes on sale. Seems to work; at least, whenever I go double check that page manually it matches the scraped result.

apiimporter5y ago

The only problem with the built-in IMPORTXML() function is that it doesn't execute pages with JavaScript. If you ever run into issues give API Importer a try (where I run a headless browser to execute the JavaScript): https://gsuite.google.com/marketplace/app/api_importer/52965...

charlesdaniels5y ago

Indeed, I wrote a tool[0] to make it easy to grab a page and run xpath queries on it. It’s really surprising how much mileage I’ve gotten out of it. Probably 95% of my web scraping needs can be solved withal xpath query or two. And if you realize you need selenium later, xpath is well supported there, so porting your existing query is usually quite straightforward.

0 - https://git.sr.ht/~charles/charles-util/tree/dev/bin/query-w...

Pelic4n5y ago

Can you point out the resources you used for learning ? I wrote a lot of scrappers and am knee-deep in css hacks.

deckard15y ago

If you know CSS well, I find this useful:

https://devhints.io/xpath

The problem with xpath is that you rarely use it, so you forget how to do certain things. Then you have to go and re-learn when you need it. Rinse and repeat.

minxomat5y ago

Take a look at xidel.

nunez5y ago

It is the swiss army knife of scraping indeed. I feel like I can do anything with a scraper thanks to XPath.

benibela5y ago

The biggest problem with the new XPath versions is that the W3C made the standards, but almost no one implemented them, so you cannot actually use them

I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2. And currently I am updating it to XPath 3.1: http://www.videlibri.de/xidel.html

summarity5y ago

Yooo, thanks for Xidel! I use it dozens of times per week. It's amazing. Next to the actual shell probably the single most useful ETL and scraping tool I've ever encountered. Keep at it!

masklinn5y ago

> I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2.

Most XPath implementations have no issue with adding extension functions (in fact many support exslt[0] out of the box), you really do not need to use (let alone implement) XPath 2.0 to use regex functions.

[0] http://exslt.org/regexp/index.html

acdha5y ago

I don't think this especially changes the underlying point: anyone using tools which were based on libxml2 or xerces is basically stuck in 1999. Having to find and install custom extensions adds a regular frictional cost which encourages you to just do more work in a full programming language since you know you'll be able to satisfy any requirement that way.

I saw so many developers sour on XML after hitting the “This would be easy if we used XPath 2 but instead it's hard” wall that I wonder if anyone on the relevant standards committees ever thought about how much libxml2 would make their work relevant.

benibela5y ago

I did not plan to implement it all, only the parts I needed for the webpages in my city. At first I did not even have backward axes. But people care much more about XPath than they care about my city

I also was doing too much competitive programming back then, where you have to discover and implement a highly complex algorithm in a few hours

If such a complex implementation takes a few hours, I could not imagine implementing anything else taking much longer (especially when the spec already says what needs to be implemented and it does not need to be discovered). A few days at most...

But now I am still working on it 14 years later

anonymous3245y ago

Though I've never used Xidel, I came across it when researching XPath 2/3, and was very impressed that anyone managed to implement these massive, complicated specs all by themselves.

The major OSS XML libs, including LibXML2 and Xerces, do not implement what Xidel does, and neither to some proprietary libs like MSXML.

thom5y ago

XPath and XSLT was the first time (despite doing Haskell at university) that I started to really understand functional programming. The first time was working on a tech stack that was basically Microsoft SHAPE queries transformed into HTML. The second was multiple projects customising Google custom search engine results. It was weird realising that these very limited primitive were actually infinitely powerful if you were willing to warp your brain the right way.

That said, I scrape a fair few webpages now and have never once revisited XPath. I suppose people have mostly written off anything that feels too much like XML as enterprisey and deprecated.

nine_k5y ago

Indeed, xslt was the first pure functional language to achieve some popularity (if not love) among wider circles of software developers. At least, so it was in 2000s.

TuringTest5y ago

There may be need for a replacement with simpler syntax; i wonder if GraphQL might be used in that role.

Out of curiosity, what tools do you use for scraping? Is there a similar simple tool for defining queries over trees?

tannhaeuser5y ago

fyi: XSLT was designed by James Clark, based on concepts and experience of XSLT's Scheme-based predecessor DSSSL. So there's your alternate syntax :) In a way, DSSSL has yielded to a XML-ish surface syntax much like JavaScript, also conceived as a LISPy language, yielded to a Java-ish (awk-ish, actually) syntax.

deckard15y ago

I've often compared GraphQL to SOAP-XML with WSDL. It's nearly the same thing, and just about as boilerplate-y.

XSLT is about templating/transforming one XML doc into some other format. And there are simple replacements that largely fill the same role. Mustache, Handlebars, Template Toolkit (which was also the simpler solution back when XSLT was popular).

ping_pong5y ago

XPath and XML in general is a great example of "Death by Committee". They tried too hard to be too smart and try to solve everything, and overcomplicated it to death. This is why people largely abandoned it. This is what is happening to C++ and they are steering themselves by committee into a dead end.

waynesonfire5y ago

I'm starting to pay more attention to technologies that are resistant to this. Maybe I'm just getting old that I'm beginning to value mature, proven technologies over fads. More importantly, is the difficult skill of being able to spot them.

tonyedgecombe5y ago

The troule is they are as rare as hens teeth. The temptation to add a little more is overwhelming. I know I suffer from it myself.

lenkite5y ago

Yeah the committee's decision to avoid ABI breakage is a serious deathblow against the language. Especially when a formal ABI was never defined in the first place. So, C++ is stuck with poor implementations for std::regex and std::unordered_map for ever. Where even interpreted languages can beat it.

contravariant5y ago

What, you think tying namespaces to a web domain that is in no way actually used as one and results in XML that is unreadable in its fully qualified form (or in fact not even valid XML) and changes not just meaning but value as you try to copy paste any part of it, was a bad idea?

projektfu5y ago

With increasing power comes the likelihood that people accidentally implement behavior that is nonpolynomial. It looks good in testing but then with real live data starts taking seconds to render/re-render. There are probably examples of this already in CSS but seems more likely with arbitrarily backtracking XPath expressions.

anonymousblip5y ago

I love the XPath model of declaratively querying and transforming data, which has been highly influential (see JQ, JSONPath, GROQ, etc.). Ultimately, it was too closely tied with XML, which was overdesigned complex, and sucked into the committee hell that brought us more overdesigned technologies like SOAP and XML Schema.

mongol5y ago

Xpath 1.0 is maybe the single most useful output from the XML universe. Did something like it exist before?

icedchai5y ago

XPath 1.0 was released in the late 90’s. I remember using it in some server-side XML processing code (Java 1.2?) It did the job where the alternative was writing a ton of procedural code to get at a specific node, etc.

lkuty5y ago

XPath 3 and XQuery 3 are powerful and great technologies to query XML if you need that stuff. The problem is that most implementations cover XPath 1.0 because I guess it is too difficult (i.e. time consuming and involved) to produce a 2.x or 3.x implementation, let alone with full W3C XML Schema support. There is also BaseX which implements XQuery 3.x which is a nice native XML database. I really dig XML and its technologies. I wish XQuery 3.x was available everywhere.

jarym5y ago

Shameless plug of DefiantJS[1] that gives a lovely fast XPath query capability to JSON data.

1. https://defiantjs.com

dehrmann5y ago

One of the huge gaps in JSON tooling is there isn't a standard XPath equivalent (there's JSON Pointer, but it's nowhere close to XPath, and JSON Path which isn't standardized) and no XSLT equivalent.

For as painful as XSLT was, at least it was a standard thing that existed.

johnward5y ago

I do a bunch of of XML/XSLT work still. I use XPATH 1.0 basically everyday. It's also awesome for web scraping. Overall, it's a great tool that doesn't get a ton of exposure.

mapgrep5y ago

Is there something I can read to get up to speed on xpath? Any recommendations for online or printed resources? (Particularly from folks who use it regularly!)

varispeed5y ago

I remember spending good two weeks writing XPath parser in C and then the client changed their system responses to JSON. My last experience with XPath.

chriswarbo5y ago

XPath is great, and works equally well in lumbering, ceremony-heavy Enterprise Java environments; and in quick bash one-liners.

I use it in a bunch scraping scripts for Web sites which don't provide RSS feeds. It's really nice for quickly 'exploring' a document to find the needed data; it's simple to update when sites change their layout; and it can be read in from a config file, argument, env var, etc. to keep things generic and flexible.

forgotmypw175y ago

XPath is hard to replace when writing Selenium WebDriver scripts. Thank you for existing, XPath.

mimixco5y ago

I thought XPath was pretty terrific for the day. It let you transform XML into a user interface in an entirely declarative way -- not just the appearance of items like CSS but the actual content could be inspected and altered. I built some cool things in XPath before frameworks like Angular took over.

teddyh5y ago

It sounds like you are talking about XSLT, not XPath.

ygra5y ago

Are you perhaps confusing XPath with XSLT (which uses XPath for selecting elements) here?

mimixco5y ago

Yes, I was! Thanks.

ape45y ago

XPath is still a great way to reach into an xml file and grab a value

techsin1015y ago

css selector aren't alternative to xpath, alternative would be to write it out yourself in js, sort of entire tree parsing algo. there are times when this is the only option when scrapping.

chrshawkes5y ago

What is the alternative for accurate scraping?

dzonga5y ago

if you do any type of webscraping. xpath is the way to go. thanks to my former co-worker Justin, for showing me that.

dsq5y ago

I used xpath last week for something

tinus_hn5y ago

This is that weird language you use to make WebDAV servers look okay in a browser, right?

pestaa5y ago

I think you're referring to XSL. The heavy lifting is done by the transformation language (XSLT), but XPath is definitely an underlying tool.

klibertp5y ago

I may be wrong, as it's been some time since I worked with them, but I think XPath is both its own standard, and a part of XSL at the same time. A lot of XSLT deals with selecting nodes from the source and it happens with XPath expressions.

masklinn5y ago

W3C standards usually depend on and leverage other standards, so XPath is its own standard, which is used by XSLT (and XQuery, and possibly a few other things).

You can't use XSLT without XPath, but you can use XPath on its own.

katzgrau5y ago

It's hard not to read this as satire, because XPath is so inelegant. Not that CSS selectors are a model of elegance, but it gets the job done (most of the time) and is easy enough for rookie devs and designers to pick up.

masklinn5y ago

> XPath is so inelegant. Not that CSS is a model of elegance

XPath is infinitely more elegant than CSS selectors.

It might be less pretty, especially with the sort of selectors you'd use in CSS, but elegance could hardly be less of an issue.

klibertp5y ago

I don't completely understand this sentiment. I mean, when confronted with "rookie devs", surely our focus should be on transforming them into not-rookie devs, not on transforming our tooling into a dumbed-down version...

Plus, while 'elegance' is mostly subjective, I see a lot of it in XPath. It's a DSL for describing generic tree traversal, its concise and declarative, and frees its users from the need of writing imperative or recursive, repetitive and easy to mess up, tree traversal code. Just not having to maintain state by hand during traversal is a huge timesaver. Additionally, at least in version 1 + some early extensions, XPath is much less complex than PCREs and not much more complex than CSS syntactically.

edit: typos

coldtea5y ago

CSS and XPath don't share any functionality.

Maybe you had XSLT in mind (which still is different) - or the fact that both CSS and XPath have "selectors". But one is used to get nodes (as a library), the other is a styling language.

And XPath, at least originally, had an extremely elegant path language.

j / k navigate · click thread line to collapse

93 comments

orf5y ago

1. https://tomforb.es/xcat-1.0-released-or-xpath-injection-issu...

2. https://github.com/orf/xcat

masklinn5y ago

I largely agree. XPath 2.0 started the downwards trajectory and XPath 3 made it worse.

benibela5y ago

There are other useful things besides functions

tannhaeuser5y ago

Edit: there is/was a token implementation for XSLT 2.0 called Gestalt

now5y ago

It also worth noting that the specification’s author also built his company on this single implementation.

zmix5y ago

> XPath/XSLT 2+ also have only a single implementation

As far as XPath goes, that's wrong:

1. Saxon (the one you talk about)

2. BaseX (an XQuery 3.1 processor)

3. Xidel (implements many XQuery 3.1 features)

4. eXistdb

5. fonto-xpath (NodeJS)

6. frameless.io (JS, also XSL)

And these are the ones, which face the public internet. I think, Microsoft has an 2.x implementation, I am pretty sure, IBM and Oracle do so, as well.

orf5y ago

Saxon supports all xpath versions though? It also bundles some very dangerous functions, some of which xcat can take advantage of.

layoutIfNeeded5y ago

I've read your article... Holy shit. They took a simple, sed-like tool and turned it into an abomination.

the-dude5y ago

It ain't done before it can receive e-mail.

oefrha5y ago

It can receive email. See my follow-up here with a working implementation:

https://github.com/clopen/xpath-receive-email

https://news.ycombinator.com/item?id=24960548

1 more reply

hyperpallium25y ago

Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.

1 more reply

riku_iki5y ago

Curious what is the problem with this? You can still use your small sed-like subset of language in your project?

taway6115y ago

>>> They took a simple, sed-like tool and turned it into an abomination.

> Curious what is the problem with this?

Product Managers.

AtlasBarfed5y ago

masklinn5y ago

> what XPath REALLY should have evolved into was a more format-flexible path language with format-specific extensions as needed.

You can probably already do that just fine: ignore attribute nodes, and e.g.

    {"menu": {
      "id": "file",
      "value": "File",
      "popup": {
        "menuitem": [
          {"value": "New", "onclick": "CreateNewDoc()"},
          {"value": "Open", "onclick": "OpenDoc()"},
          {"value": "Close", "onclick": "CloseDoc()"}
        ]
      }
    }}

    /menu/popup/menuitem/*[last()]/preceding-sibling::*[1]/value

selects "Open". Something along those lines.

Maybe relax nodetypes so they can be pluggable per-language, but I'm not sure that's even useful or necessary.

Mikhail_Edoshin5y ago

oefrha5y ago

> How else one is supposed to read environment variables from XSLT?

Mikhail_Edoshin5y ago

These things, of course, should come as extension functions rather than special syntax, but then there will be a need to provide a small standard library of such functions :)

ssdspoimdsjvv5y ago

nunez5y ago

Damn. XPath went off the rails after v2. Though, to be fair, so did JavaScript, and look where that is today!

forgotmypw175y ago

A bloated abomination?

pojzon5y ago

Abomination with 500000 open job offerts and ppl wanting their page to load 2 minutes, because we can do that “async” and download half of internet to display a table.

nunez5y ago

That is the most popular language in tech right now? (Unfortunately?)

spdionis5y ago

But does it send email?

zmix5y ago

I don't know what XPath 3.1 implementation you use, but the two major implementations, I have at hand, SaxonEE and BaseX, both do not understand your code. You write:

  for-each(normalize-unicode(upper-case(json-doc('x.json'))) => tokenize("\s+"),
    function($a) {
      let $a := $a * 10
      load-xquery-module('abc'):some-func(
        function-lookup($a, 1)(array:map($a, function($b) {
          let $c := unparsed-text-lines($b)
          trace($c)
          if ($c) {
            return xml-to-json($b)
          } else {
            error('This is an error')
          }
        })) 
      }
  )

2. in your anonymous function 'function($a) {...' you use a 'let' expression without 'return'. This is illegal.

6. Again, this is not valid XPath grammar:

    let $c := unparsed-text-lines($b)
    trace($c)
    if ($c) {
      return xml-to-json($b)
    } else {
      error('This is an error')
    }

It would need to be:

    let $c := unparsed-text-lines($b)
    return (
             trace($c)
           , if ($c)
             then xml-to-json($b)
             else error('This is an error')
           )

Also, $c can not evaluate to a boolean ([FORG0006] Effective boolean value not defined for xs:string+), since it is a sequence of strings. Of course, such things could be implementation dependent...

orf5y ago

The code sample is illustrative and shows off as much of the crazy, unneeded dynamism as possible rather than something that would actually work.

Granted, it’s been a few years since I’ve looked at XPath but I feel that I know it quite well. xcat is a testament to that.

> This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

The current xpath/xquery language is poorly supported, mostly ignored and incredibly over engineered to the point where it’s almost comically unfit for purpose. Sorry.

> XML is one of the most misunderstood technologies in our industry

It’s pretty well understood and has valid use cases. However it had a history of overly complex, over engineered tooling created by a committee and stuffed full of acronyms.

This quite rightly puts people off, and there are just better formats and technologies to use that isn’t encumbered with XML baggage.

zmix5y ago

> The code sample is illustrative and shows off as much of the crazy, unneeded dynamism as possible rather than something that would actually work.

>> This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

> This makes no sense. “We made a weird programming language to satisfy the needs of non-programmers”?

  for $contact in $contacts/contact
  where $contact/familyname/data() = "Smith"
  group by $key := $contact/zip
  order by $key
  return <group>{ $contact }</group>

which will return all contacts named 'Smith', placed in the same 'group' element as long as they live in the same location. Not weird at all! But then, this may be a matter of taste.

> The current xpath/xquery language is poorly supported, mostly ignored and incredibly over engineered to the point where it’s almost comically unfit for purpose.

> It’s pretty well understood and has valid use cases.

Verbosity really happens on the code and the overly complex toolchains (just think ECMAScript and all the difficulties, that stem from combining HTML with JSON, in order to be somewhat semantic)

ziml775y ago

Holy crap. What is this atrocity that is XPath 3.0!? What was wrong with sprinkling some XPath 1.0 queries into a Python script?

irjustin5y ago

Anyone who does scraping or automated browser work eventually comes across XPath.

I avoided XPath until I couldn't anymore. I could do a lot with CSS selectors, but eventually the DOM traversal became difficult to reason about w/ just CSS.

After taking the dive, it's so powerful. Read a single XPath and like regex, you can fully understand what the thing is going after and how it will get there.

There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

masklinn5y ago

> There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

Incidentally, Nokogiri seems to optionally depend on libexslt, which is the exslt implementation in C for libxml2/libxslt, so exslt should be available either as an option or by building it yourself.

[0] https://github.com/sparklemotion/nokogiri/commit/eb56525fbcc...

[1] http://exslt.org

mattmanser5y ago

Many moons ago I worked somewhere that used XPath extensively.

Definitely a serious learning curve, some of the developers really struggled with it, others went crazy on it.

jinushaun5y ago

chriswarbo5y ago

I think their complaints about browser support are fair (orthogonal to whether the newer versions are any good, which most of the comments here are talking about!)

In a self-managed environment, like a PC or server, then you're right that popularity makes little difference.

brixon5y ago

t7s5y ago

If you need to do web scraping learning xpath is very helpful

Crazyontap5y ago

mook5y ago

apiimporter5y ago

charlesdaniels5y ago

0 - https://git.sr.ht/~charles/charles-util/tree/dev/bin/query-w...

Pelic4n5y ago

Can you point out the resources you used for learning ? I wrote a lot of scrappers and am knee-deep in css hacks.

deckard15y ago

If you know CSS well, I find this useful:

https://devhints.io/xpath

The problem with xpath is that you rarely use it, so you forget how to do certain things. Then you have to go and re-learn when you need it. Rinse and repeat.

minxomat5y ago

Take a look at xidel.

nunez5y ago

It is the swiss army knife of scraping indeed. I feel like I can do anything with a scraper thanks to XPath.

benibela5y ago

The biggest problem with the new XPath versions is that the W3C made the standards, but almost no one implemented them, so you cannot actually use them

I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2. And currently I am updating it to XPath 3.1: http://www.videlibri.de/xidel.html

summarity5y ago

Yooo, thanks for Xidel! I use it dozens of times per week. It's amazing. Next to the actual shell probably the single most useful ETL and scraping tool I've ever encountered. Keep at it!

masklinn5y ago

> I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2.

[0] http://exslt.org/regexp/index.html

acdha5y ago

benibela5y ago

I did not plan to implement it all, only the parts I needed for the webpages in my city. At first I did not even have backward axes. But people care much more about XPath than they care about my city

I also was doing too much competitive programming back then, where you have to discover and implement a highly complex algorithm in a few hours

But now I am still working on it 14 years later

anonymous3245y ago

Though I've never used Xidel, I came across it when researching XPath 2/3, and was very impressed that anyone managed to implement these massive, complicated specs all by themselves.

The major OSS XML libs, including LibXML2 and Xerces, do not implement what Xidel does, and neither to some proprietary libs like MSXML.

thom5y ago

That said, I scrape a fair few webpages now and have never once revisited XPath. I suppose people have mostly written off anything that feels too much like XML as enterprisey and deprecated.

nine_k5y ago

Indeed, xslt was the first pure functional language to achieve some popularity (if not love) among wider circles of software developers. At least, so it was in 2000s.

TuringTest5y ago

There may be need for a replacement with simpler syntax; i wonder if GraphQL might be used in that role.

Out of curiosity, what tools do you use for scraping? Is there a similar simple tool for defining queries over trees?

tannhaeuser5y ago

deckard15y ago

I've often compared GraphQL to SOAP-XML with WSDL. It's nearly the same thing, and just about as boilerplate-y.

ping_pong5y ago

waynesonfire5y ago

tonyedgecombe5y ago

The troule is they are as rare as hens teeth. The temptation to add a little more is overwhelming. I know I suffer from it myself.

lenkite5y ago

contravariant5y ago

projektfu5y ago

anonymousblip5y ago

mongol5y ago

Xpath 1.0 is maybe the single most useful output from the XML universe. Did something like it exist before?

icedchai5y ago

lkuty5y ago

jarym5y ago

Shameless plug of DefiantJS[1] that gives a lovely fast XPath query capability to JSON data.

1. https://defiantjs.com

dehrmann5y ago

One of the huge gaps in JSON tooling is there isn't a standard XPath equivalent (there's JSON Pointer, but it's nowhere close to XPath, and JSON Path which isn't standardized) and no XSLT equivalent.

For as painful as XSLT was, at least it was a standard thing that existed.

johnward5y ago

I do a bunch of of XML/XSLT work still. I use XPATH 1.0 basically everyday. It's also awesome for web scraping. Overall, it's a great tool that doesn't get a ton of exposure.

mapgrep5y ago

Is there something I can read to get up to speed on xpath? Any recommendations for online or printed resources? (Particularly from folks who use it regularly!)

varispeed5y ago

I remember spending good two weeks writing XPath parser in C and then the client changed their system responses to JSON. My last experience with XPath.

chriswarbo5y ago

XPath is great, and works equally well in lumbering, ceremony-heavy Enterprise Java environments; and in quick bash one-liners.

forgotmypw175y ago

XPath is hard to replace when writing Selenium WebDriver scripts. Thank you for existing, XPath.

mimixco5y ago

teddyh5y ago

It sounds like you are talking about XSLT, not XPath.

ygra5y ago

Are you perhaps confusing XPath with XSLT (which uses XPath for selecting elements) here?

mimixco5y ago

Yes, I was! Thanks.

ape45y ago

XPath is still a great way to reach into an xml file and grab a value

techsin1015y ago

css selector aren't alternative to xpath, alternative would be to write it out yourself in js, sort of entire tree parsing algo. there are times when this is the only option when scrapping.

chrshawkes5y ago

What is the alternative for accurate scraping?

dzonga5y ago

if you do any type of webscraping. xpath is the way to go. thanks to my former co-worker Justin, for showing me that.

dsq5y ago

I used xpath last week for something

tinus_hn5y ago

This is that weird language you use to make WebDAV servers look okay in a browser, right?

pestaa5y ago

I think you're referring to XSL. The heavy lifting is done by the transformation language (XSLT), but XPath is definitely an underlying tool.

klibertp5y ago

masklinn5y ago

W3C standards usually depend on and leverage other standards, so XPath is its own standard, which is used by XSLT (and XQuery, and possibly a few other things).

You can't use XSLT without XPath, but you can use XPath on its own.

katzgrau5y ago

masklinn5y ago

> XPath is so inelegant. Not that CSS is a model of elegance

XPath is infinitely more elegant than CSS selectors.

It might be less pretty, especially with the sort of selectors you'd use in CSS, but elegance could hardly be less of an issue.

klibertp5y ago

edit: typos

coldtea5y ago

CSS and XPath don't share any functionality.

Maybe you had XSLT in mind (which still is different) - or the fact that both CSS and XPath have "selectors". But one is used to get nodes (as a library), the other is a styling language.

And XPath, at least originally, had an extremely elegant path language.

j / k navigate · click thread line to collapse