undefined | Better HN

0 pointssolardev3y ago0 comments

Thank you for the context! What you're doing is actually much harder than regular web dev. It's a specialized kind of data processing, often called a "extract, transform, load" (ETL) workflow.

Most web devs don't need to do that, and that you're willing to tackle it at all just shows how willing to learn you are, despite the frustration.

If you hate this situation, it's totally understandable lol. That kind of work has all the tedium of dealing with someone else's arcane data format, and none of the joy of seeing your creativity come to life. Some people love that sort of work, and specialize in it, becoming backend people or DB engineers or data scientists or the such, but it's not usually what web devs are known for (who tend to focus instead on UIs and some level of design and interactive stateful apps). Nothing wrong if ETL just isn't your cup of tea. I'd go crazy if I had to do that often, too.

Anyhow, if I'm understanding you right, you have HTML embedded in either JSON and/or XML. Do you know what "escaping" is in the text embedding sense? Like if you have quotes inside quotes, or tag brackets inside tags, how to separate each layer of embedding? If your JSON and XML files are cleanly escaped, you should be able to (as a first step) just iterate through the files and get the HTML parts out (without regex).

Like if the HTML is just a data string inside JSON, you can transform the JSON into an array of HTML strings using array.map() or object.values.map().

In the XML, if the HTML is stored in CDATA fields, you can access it using an "XPath" selector... you know how CSS has selectors that let you say headings should be styled one way, paragraphs another? XML has its own selector language that lets you directly target a certain node inside the document, without using regex, by specifying the hierarchical path that takes you there (like a CDATA inside a description inside a job inside a company, or whatever). Although there is a learning curve to XPath, it is much more suited to the task than regex, because the regex can't easily account for the complexity within XML (especially when there's nested layers).

It would help if you can post some example snippets, but that might be better suited for Stack than HN (though feel free to link to it here).

Once you have the HTML out, then you can run it through a sanitizer -- that's an optional step, but would let you strip out unnecessary divs, old font tags, whatever, keeping old basic formatting (headers, paragraphs, links, bold, etc.) which should be much cleaner to hand off to your clients. That would be much easier to embed on someone else's site vs a scraped page with all the HTML mess from someone else's framework.

I know there is a lot of complexity in each of those steps, but there are great tools and documentation for each step of the way. That's just to get you started.

At the end of the day what you're doing isn't really a Javascript issue at all, it's just a different kind of work that Javascript happens to be able to handle if you really need it to (but so can Python or Java or specialized command line tools like jq). It's a different body of work, which is why your casual web dev skills aren't providing easy answers. It's OK! You can learn it once and make it work (and then decide never to do that again, like I did lol). Or switch tracks, totally up to you :)

But feel free to ask here or on Stack if you have followups!

0 comments

casual-dev3y ago

You are much appreciated. I didn't even know there is term for this part of my work.

Down the line, we do everything you cautiously described. We extract single fields with pointers (in lack of a better term, english is not my main language) to the XML/JSON fields we like to extract. Our software then lets us use JS snippets to manipulate the contents. Problem is, once you define a rule, it may get 80-90% over hundreds of datasets. But breakage is not an option most of the time. It's pareto principle work: 80% in 20% of the time, 20% work in 80% of the time. In the end, they are just snippets, then a giant gap, then the projects my colleague does.

I get where you are coming from, regarding "never to do that again". This not the only work I do. I also build HTML from customer demands, many of which are pdfs meant for print use, but not for the web. I like it, but I only scratch the surface of what might be. Thanks to the resources in this thread, I have a good insight of what to come. So, thanks again.

j / k navigate · click thread line to collapse

0 comments

casual-dev3y ago

You are much appreciated. I didn't even know there is term for this part of my work.

j / k navigate · click thread line to collapse