The funny thing is that the document authors like these ways of working. It is the tech people who don't. I've seen "Git for Word" proposed many times a year for a while now. And all of the ideas are interesting, but none of them appeal to my audience because they don't care about git's feature set. Nobody wants to branch and merge. Nobody wants a straight version history. ("Nobody" meaning nobody in my market, not nobody in the world.)
They want a storytelling experience. They want to know the why, not the what. And the workflow tends to be unidirectional, not with collaborative changes coming back together, but with expanding changes as each person adds their ideas and makes change for a specific instance of using a document. The experience we build for them bring in pieces of version history, pieces of comments, pieces of telling the story of why something was done, so people down the line can have more context to decide whether to accept or reject the changes.
It isn't that "Git for Word" is a bad idea - on the contrary, it would be great if someone pulls it off. My point is that building something that improves on Word isn't actually about the software, it is about the document workflows. If you find groups who work like software devs do, where documents receive small updates from a team, and bring all changes together for a final product, there is probably a market. But when evaluating such ideas, there has to be a reality check of whether the actual use of the documents truly matches the use case for git.
When I worked as a VFX freelancer I was amazed at the number of hours (=money) burned by marketing agencies who didn't manage to give me the definitive variant for a simple list of things they wanted. In one instance they gave me everything they had, including crude and unrecognisable filenames, hints about things that I should ignore via telephone etc. I had to make sense of it and compile a list which I sent them to approve. They ended up approving another list (!) which they themselves sent me two weeks prior and they only managed to correct this once I hinted at this.
Of course this is a example of saw qhow things should never be. This usually involves somebody getting sick and some uninformed person taking over etc. But what I learned on film sets is that you should choose the defaults of your communication culture in such a way, that it works under the absolute worst conditions (bad weather, hungry, stressed, confused, etc).
And I have seen so many organisations fail at precisely that. If you get I'll someone else should be able to take over without heading to an oracle. This is not a special function limited to a version control workflow, it is something that has to do with clear communication.
Using git can sometimes help avoiding the whole problem by making it obvious which file is the latest and which is a variant of it, the people using it will have to use clear communication as well (e.g. by writing good commit messages, choosing the "right" commit sizes, naming things the right way etc). So if you know how to use git, you just might value clear communications a little bit more than the average person.
As git is a distributed system I think it’s not at all clear what the definitive final variant might be —- and that is a strength.
That can be handled externally to git via ad hoc convention, say by using a system like gitlab or github and letting it declare one as “primary”, or by having someone post to a mailing list (“Commit X on a repo you can reach at URI Y is the official release”) both of which are common.
But in your example various people could mail you commits and not have any consensus on which is authoritative.
The defaults are sensible. Throw money at it and pay someone enough to sort things out and get it done, e.g. you as a freelancer get a data dump and ask the right question and the problem is solved. Sure it costs money. But everything costs money.
Git works great among peers. But most organizations are hierarchical. And the boss doesn't have to give a shit about which draft is the latest because the boss is the boss.
some would say whoever solves that problems is filthy rich
As a lawyer I can full confirm that our industry works as you have described (as regards documents workflows), and with my tech background I can also confirm that most features of dev-oriented solutions like git are mostly uninteresting from a lawyer's perspective.
I agree with both comments.
To add, in large-scale corporate/commercial practice (which is the area which I practised), Git would be useful in replacing email-based collaboration, but the switching costs seem too high.
Currently, the corporate law contract negotiation workflow is as follows:
1. a party adds their tracked changes to a Word document based on a template contract;
2. the party emails this document to party B;
3. party B reads the changes, may discuss the changes with their client, adds their tracked changes, and then emails the updated document to party A.
This process repeats for every document, punctuated by occasional conference calls between the parties, until the parties agree.
‘Git for law’ would be useful for lawyers in increasing efficiency - and thus reducing costs for clients.
However, the benefits for law firms of adopting a new Git-based workflow are likely to seem relatively small to lawyers. Their current email-based version control system is messy and time-inefficient, but generally functions with minimal error.
On this basis, I would predict that most corporate law firms would be very slow to adopt a Git-based system - the benefits may not justify the costs.
One should also note that lawyers, particularly contract/commercial lawyers, are conservative by profession. In my experience, most lawyers are very slow to adopt new technologies, highly risk-averse, and skilled at spotting risks. The combination of these traits means that any technology will have to offer a very high benefit to replace an existing legal workflow.
I never trust the received file's "track changes", always compare to the latest version I've sent -- and it is extremely common to find a change that wasn't mentioned/discussed, and somehow magically "accepted" or otherwise not tracked in the other side's "track changes". Whenever I point these out, I always got a "oh, yes, forgot about that one", or "I didn't intend to put that in" or "I'm not sure why it didn't appear in the track-changes view" -- but out of tens of these (with multiple lawyers over multiple years), not one was ever in my favor.
Branching might not be as interesting on a single project - but diffing is, very much; and I'm sure it's not more coveted mostly because most lawyers either (a) don't realize how good it makes life for you when you can diff and blame easily, or (b) are abusing the fact that it is so hard to diff/blame on documents, and certainly (c) usually charge by the hour, so some efficiencies are actually going to cost them money if they implement them (a famous Upton Sinclair quote comes to mind).
m@replace-with-my-username.com
Exactly like you've hinted, the right way to crack this is to bring a full-fledged word processor like Google Docs, but instead of ad-hoc realtime collaborations the software has to enable customizable unidirectional document workflows with controlled collaboration.
Most serious document creators don't want to branch and merge, instead they want to pass on the document through a series of stages. They want statistics on when, what and why of each stage. And at any point of time the document is in one definitive stage not scattered across emails/folders/versions/forks.
https://support.google.com/a/answer/9381067?hl=en
Unlike its normal collaboration mode, the file gets locked down.
It allows multiple people to work in parallel (and in private). When somebody sends a pull-request eventually, they are presenting a story of changes that they want to get into the document and people can discuss them and approve them individually. (Of course, git the tool isn't necessarily suitable for non-technical people, but git-the-workflow seems to be a good foundation.)
Could you elaborate on what such a tool could look like without git style branching?
I don't know, I'm just spitballing. Sounds like it'd be fun for awhile to attempt to seamlessly get this into the workflow and see how it's accepted.
I think it's more about the user interface. The user interface of Git is essentially what programmers already do - code.
It is _their clients_ who don't, not just tech people. I hired lawyers a few times. IMO their redline and email workflow is error-prone craziness that could use improvement. That said, I'm a "tech" person, so I might be biased.
I've spent many years in 'collaborative writing' in R&D, mainly grant proposals and joint reports/deliverables, most in the CS/IT domains. Writing those texts is very different from writing the software.
First thing you should realize is there are no 'tests', and all the 'code' is usually in a single big file. Anyone that has touched the document can have potentially messed up everything, both content, layouts and meta-data, and there is no automatic way to check whether it still makes sense. Many times people will not use the agreed upon editor/version, and sometime (often) that means a boatload of minor edits to the document all over the place just from opening and saving. Imagine everyone in your software team using different editors all with their preferred coding conventions that are automatically applied to the whole project at load.
From this you can deduce the enormous responsibility of ownership and gate-keeping in the workflow. The absolute worst collaborations I have been part of were those that somehow believed that if they used a collaborative document editing facility, wikis or Google docs for instance, that would negate the need for assigned owners/editors. Those tug-of-war shitstorms got exponential the closer one came to the submission deadline (technically incorrect, i know, but you know what I mean).
Some tips:
- Have well defined ownership for each section or part of your document. The owner receives and makes all changes for that part.
- have a final editor that is responsible for the complete document receiving the changes of the parts from their owners only.
- Do not trust 'track changes', but use Word's built in document compare if you are the final editor. For complex formatted documents (nearly all instances require you use an insanely styled template, you 'clean room' import (C/P through notepad) the text changes into the correctly formatted doc under your control.
- release the current trunk document often, ideally once per day. This requires staggering, with subeditors closing submission windows and submitting their updates to the main editor before EoB. Everyone editing should work against the latest release.
-Every version published by the final editor should be immutable. Mail it to everyone if needed, but if you use a link to some sort of repository make sure it is a deep link to a version that can not be updated in the repository, or hilarity will ensue.
- use versioning in the filename. filename_YYYYMMDD_HHMM_dXXX_rNN.docx where XXX is the assigned party acronym for the person making the update. 'YYYYMMDD_HHMM' is only touched by the editor, 'dXXX_rNN' is the NN'ed changes release by part XXX against version YYYYMMDD_HHMM .
Most certainly Git can function as a repository, but there will be people that will not work with it (nor any other repository) so always assume mail interactions as well.
Finally, there should be a special place in hell for the people that designed SharePoint versioning. Don't even think of going there.
https://github.com/TomasHubelbauer/modern-office-git-diff
I've made this script which automatically extracts the Office file format (which is a ZIP archive of XML documents) and versions the XML documents and their extracted text contents alongside the binary Office file. This is done using a Git hook and it seems to work pretty well. If you're in need of versioning Office documents, this might be a good enough solution for you.
Edit: I should also address why not use the built-in Office versioning feature? The reason I don't use it is because I like to be able to view the diffs in Git. I don't want to have to use Office just to see the changes. My solution offers that. By doubling-up the way the original is versioned in the way of tracking the extracted XML and text contents as well, each commit's diff will have the binary change as well as the textual diff which in my experience is good enough to tell the gist of changes. And you're using standard Git / text manipulation tools you would use with any other diff.
I've tried using the git diff patience algorithm, but didn't work well - frequently, the diff was about to remove every single line and add all them back to the XML file.
I got some decent results with `xmllint --format` which is the linter/formatter from libxml2 (so available in most Linux distros and ported to most platforms).
(I was using xmllint as a formatting step when unpacking ODT files in my similar tool to the directly above; mentioned in a sibling comment. I found the XML files in ODT files were much more prone to being minimalized and reformatted/reordered on every save in comparison to DOCX which was surprisingly more stable in XML formatting.)
https://github.com/WorldMaker/musdex https://pythonhosted.org/musdex/
Because I built it to be extensible/support plugins I've used it for all sorts of interesting file types beyond DOCX too. (CELTX, a screenwriting format from years back; prettier diffs for Inform 7 source text; experimented with an SQLite deconstructor; ...)
Looks like I take a slightly different approach too, in that I store a bunch more metadata about the deconstructed contents (not just relying on directory listings), so I end up trusting my reconstruction tool a bit more and I mostly don't store the binary blobs in git, as I assume I can reconstruct them quickly enough.
One benefit of your solution over the `textconv`-based approach mentioned in the article is that your solution offers two different levels of diffs (XML and TXT).
To simulate that with textconv, you’d have to switch between two `diff.doc.textconv` variants.
I downloaded a docx document from the net, opened it in libre office, removed a single word, saved it as fodt, removed a single word again, saved it as fodt again, and the diff between the two fodt is gigantic.
Apparently there are lots of items like <text:p text:style-name="P20> whose content didnt change, but their ID did. It didn't even only affect IDs of content after the removed word, but content before as well.
The file has 19361 lines and the diff size is 1110 lines so there is some level of locality, but note that a lot of those lines are just base64 data of image content. The fodt is 1.5 times as large as the original file.
Try it yourself, this is the document: https://www.acquisition.gov/sites/default/files/manual/SOP_P...
I recommend having a commit hook that (somewhat) pretty-prints and line-wraps the XML – perhaps splitting on sentences too, so that adding a word doesn't proliferate all down the page. I haven't tried this, though, so it might not help. If you do, could you release the code?
It used to store everything on one line without breaks if I recall correctly.
With a little bit of work to ensure stability of numbering, FODT and related flat ODF formats could be really usable with version control.
Edit: I meant 'fascinated with using git here in this context'.
That said, "track changes" is still used extensively especially with parties outside the organization, especially for legal documents.
Some of the proposed solutions were very nice, particularly Draftable - but it's expensive and my bosses didn't feel it was worth it. To this day they still work on huge slide decks that are partially shared, but I'm just not involved anymore with that side of things so I stopped pushing. I still think a way of tracking Powerpoint decks on a slide-by-slide basis, with partial merging and synching, would be really good to have (existing features for embedding are '90s-era).
For Word there are quite a few solutions nowadays, most are clearly superior to the stuff Office ships with. So the problem is still there, just not as bad as 15 years ago.
I use O365 collab features daily (with SharePoint/OneDrive) storage and the experience has been similar to that of GSuite. I regularly work on PowerPoints with multiple people simultaneously editing the slides.
Word is expensive, proprietary and the XML it generates is unfathomable. There are so many better FOSS tools and systems that we could be using. If you're collaborating on a document then markdown or LaTeX has you covered. You get version control though git and multiple people can contribute. If you're writing a book or article, then the graphic designers and typesetters are going to make the design decisions, not the author, so why bother messing around with fonts and colours and the infuriating placement of images and tables.
I authored a kid's book on coding, and the process was a nightmare. I authored in markdown, used pandoc to convert and then further edited in libreoffice, to be able to send stuff through in docx format. Then revisions were sent back in docx and I had to reverse the whole process, so I could maintain my plain-text version of the book. Then the proofs were sent through as PDFs, which I then had to markup for corrections. Many of the mistakes were due to the crappy way Word places images. In the end I just bought a copy of Word, and submitted to the way my publisher wanted me to work, which disrupted the authorial process.
It's time we ditched Word, in the same way we ditched VHS and DVD. It's an outdated technology that remains dominant just because everyone uses it at school, and then refuses to move on. If schools insisted that all homework was submitted in something like markdown, we'd see a dramatic change in a very short period of time. (BTW when I was teaching CS, my kids authored in markdown and submitted on GitHub)
Right, rant over - but I've been talking about this for years -http://coding2learn.org/blog/2014/04/14/please-stop-sending-...
These are not WYSIWYG solutions which answers 99% of your question "why". When people want to write a document they want to write things and have the things appear on a page, possibly in different formatting. Injecting ideas like source files, rendering pipeline, etc. will just result in confused people.
That's why online solutions like Google docs are popular. No special app, things look like expected, you can collaborate, and few people actually need any fancy features.
> text
> image
> more text
> table
> more text
There are any number of applications that allow you to write markdown and view the generated HTML in whatever formatting you want. Your recipient then gets to choose their own fonts, colours etc, which from an accessibility point of view, is much better.
Unless you're printing a hardcopy or creating a PDF, what is the point of Word?
I write a lot of stuff in the legal area (articles, books, contracts, court documents, etc) and there's nothing that comes close to Word.
For some time I had tried to switch to LibreOffice. My goal was to quit Word, which is the only software that still binds me to Windows/Mac (not interested in Wine). I hoped to finally be able to switch to Linux without any hiccups.
Unfortunately LibreOffice is not quite as good as Word. I use many of the advanced features of Word, and the more you use these in LibreOffice, the more you encounter bugs. At one point I had a .odt file with tons of cross-references in footnotes pointing to other footnotes. When I was ready to ship the document I found out that all cross-references were messed up and I had to redo them all.
Now it's true that LibreOffice has a huge and active community that works hard to improve the product, but as word processors are my main and most important tool for work, I need the most reliable software I can get. Unfortunately that is still Word...
On top of that I must add that I do need to properly format documents 99% of the times, and also on this I find Word slightly superior, even if admittedly on this is quite comparable to FOSS solutions. The only quite big problem at this regard is interoperability. Since I know that most, if not all, my colleagues/counterparts use Word, whenever I send a document I need to send something that "will just work" for them, which is a docx. This means that using anything other than Word might give some problems in relation to formatting, which in same cases is pretty important.
Markdown -- standard markdown isn't expressive enough (no tables for example), there are lots of extensions but none which are "standard".
LaTeX -- doesn't produce accessible documents, so is a non-starter in lots of areas (seriously, the PDFs it generates are some of the worst around when it comes to accessibility. Word's are amazing).
If we put aside the Word file format, and maybe the ribbon, is Word bad though?
I've been using it for decades, and have tried OpenOffice and LibreOffice too over the years - nothing comes close to Word.
Markdown is not suitable for "normal" users, but as a developer, I've come to prefer markdown for technical documentation and such (especially where I want a history, diffs etc), but I still use Word for a lot of other things.
Word is incredibly fully-featured - I use a lot of functionality, but am likely still only using a fraction of what it has. It really does have all your document editing needs covered.
Aside from the file format, I think Word is a fantastic piece of software. I have a few annoyances with it now and then, but it's been very dependable and kept me in good stead over the years.
Microsoft Office is quite useful, and probably good value for money. But it has never ever been free, and I’ve been buying it for work and home since Office 4.x on Windows 3.1 was the new hotness.
Consider someone dealing with inter-departmental collaboration on documents at a company in the 70s or 80s. They could potentially invent their own system, make paper copies mandatory, go full computer, or any number of solutions in between. Technology was considered hard and looked recognizably so, and management was less likely to question technical views and opinions about this. People were way less likely to get fired and generally visualized staying there for a while, so they were comfortable sticking to their viewpoints.
Today, your boss and their boss are all concerned with how to get the maximum amount of work out of you in the time you're at the company. So if you propose retraining everyone on Open Office or Markdown, because it has high potential for a better way of tracking changes or something, you'll get pushback from a) management, because the CEO is going to say “but I use Word all the time, why can't you just use that?” and b) the workers, because they know they will be forced to learn it on their own time rather than being given a proper amount of time to train and learn. [1]
I think modern society and modern work are slowly defaulting to the idea of quickly throwing in the towel and just using whatever technology is approved by the milieu. This is true even in our industry, consider this article [2] by Latacora [3] for instance: it's full of statements which approximately say “Just use CloudTrail”, “Just Use Jamf”, “Just Use Okta SSO” etc. If our industry is doing things like this to optimize extraction (even the article acknowledges that SOC2 is purely documentation optimized for selling to big companies), why would we be so surprised that publishing departments and such are optimized to Just Use Microsoft Word rather than a technically better system?
-------
[1] Think back: when was the last time you had a proper training about how to use a certain piece of software by people from the company building it, or at least certified trainers? These were way more common back in the day.
[2] https://latacora.micro.blog/2020/03/12/the-soc-starting.html
[3] A very respectable security company focused on startups.
Basically, it's a "good enough" WYSIWYG, and a number of industries have standardized on it, in spite of the fact they should actually use an open standard + tool that actually fits their needs. I think screenwriting might be the one industry to escape Word, since they use Final Draft as I understand it.
Because Office/Word has become the hammer of the document writing world.
It isn't an issue that it's a bad product and better products our out there (and there are).
It is that everyone is expected to know how to use word at a basic level. From Secretaries to VPs and CEOs, almost universally these people can open a word document, edit it, and save it.
Because of this expectation, it is easier to throw money at Microsoft and have the tool you can expect everyone to use.
Non-technical folks do not want to reach for CSS to apply formatting to a text document. Heck, neither do I.
In the last few months, though, I gave up on Markdown to switch to a more robust format - LaTeX. Before I switched, I didn't know LaTeX at all, but I knew from my reading that it had the features I needed.
It certainly makes for less _noisy_ source files in my opinion, and it also means that you get to take advantage of the fact that, if you want to, you can easily convert your markdown to HTML, with maths using something like mathjax.
This was a bit of a ramble, but I honestly can't say enough nice things about pandoc.
It's worth noting here that I'm writing layout in LaTeX also - like controlling the number of columns, where breaks exist, etc.
Seriously, org has served all my authoring needs for over a decade now. You can export to LaTeX and HTML easily, and now pandoc does a decent job of exporting to other formats. You can embed LaTeX lines in your org document, so you get the full power of LaTeX, without having to write LaTeX for everything. Tables are hellish in LaTeX, and even lists are a pain.
Of course, there is the whole "You have to use Emacs" thing...
And honestly... I enjoy writing LaTeX. The structure just feels really comfortable to me.
I'd say RST is suitable for many types of documentation but I'm not convinced that it's suitable for conference/workshop submissions.
Now we need a native diff viewer for structured files, where the changes are presented with attribution either side by side, or alongside (like gitk, or like gitlab diff viewer).
Then we need an editor that supports doing the gitty stuff natively, so that the non-technical writer doesn't have to worry about creating repos and committing the changes from the command line.
https://www.zoho.com/writer/help/document-tools.html#Combine...
Feedback welcome: https://www.simuldocs.com/
ETA: Apparently right below this comment someone has already created this: https://news.ycombinator.com/item?id=24303611
Given this line, I think it's fair to add (2014) to the title.
This is pretty old news by now :)
It wasn't fleshed out or usable, but it was an interesting project. I was impressed at how open the Word/Office format was, this was before Microsoft's reemergence into openness and open source.
Another useful trick is to pipe the ANSI-colored terminal output through `aha` (https://github.com/theZiz/aha or `brew install aha`) which produces HTML output, e.g.
git wdiff | aha > ~/Desktop/mydiff.html
You can then send the file mydiff.html to collaborators by email or add to CI build script.I am finally replacing it with a sharepoint solution. Its a headache to have to maintain versions for non-technical people.
[0]: https://www.simuldocs.com/features/version-control-for-micro...
https://www.vivekkalyan.com/using-git-for-word
I tend to prefer markdown for most things, but find it hard to beat Word in terms of simplicity of elegant designs for, say, resumes.
Apart from attachments and metadata the actual document is some kind of xml monstrosity that contains the text and the markup. It’s not very useful to just create diffs from that, it looks a bit like the HTML created by FrontPage if you remember that.
You can just rename a docx file to .zip, unpack it and peek around.
However, diff on word xml is perfect tool to understand how the microsoft interprets the spec.
i wrote a novella using a folder system + text editor + git. i'm trying to put that into a web app. don't know how useful it would be for other people though. and don't know if it will ever be finished because i need to write.
If you like writing out of a text editor (I use Atom) it's super useful.