Pandoc (opens in new tab)

sevensor7y ago

I wrote my dissertation using Pandoc. It might seem that the LaTeX boilerplate is minimal, but Markdown is even more minimal, and it preempts the urge to fuss with your layout. Writing in Markdown means that you can wave your hand at the document and say, "It's a draft, I'll fix the formatting once I'm sure I even want this material." Afterwards, fixing the layout is really easy because you can drop raw LaTeX in wherever you need to, and you haven't wasted countless hours laying out a float you later end up cutting.

stdbrouw7y ago

Having to use `\textbf{...}` is impetus enough for writing in Markdown instead.

hatmatrix7y ago

I agree - while pandoc is great it's usually not 'one click' to any format, especially when have html or latex-specific markups.

It's not for everyone, but emacs+auctex really reduces the latex boilerplate (at least writing it) that I don't really feel it's a hindrance.

neves7y ago

I didn't use LaTex for years, is it still a hell to make tables? And also very difficult to use templates to generate good looking documents that doesn't like an academic paper?

uvtc7y ago

Yes! It's great to be able to put LaTeX-formatted equations directly into your pandoc-flavored markdown source file.

Incidentally, I really like the thoughtful syntax additions Pandoc makes over olde Markdown (eg., tables, definition lists, and span & div syntax as well). Such a great all-around doc tool.

dr_coffee7y ago

What's your workflow for inserting and managing references

tekknolagi7y ago

Not OP, but I used `+citations` and `pandoc-citeproc` along with a bibtex file that I managed by hand for https://bernsteinbear.com/dat-paper/ (a small senior project paper). It worked pretty well for me.

meanmrmustard927y ago

Add bibliography=path/to/library.bib (and optionally specify a csl for bibliography formatting; I like econometrica) in frontmatter yaml. Insert citations with @bibcitekey. compile with --pandoc-citeproc filter.

sevensor7y ago

It was a couple of years ago that I wrote my dissertation using Pandoc, so things may have changed. At the time, I started out using pandoc-citeproc with my BibTeX database, but eventually I needed more control over formatting and I switched to writing \cite everywhere. Even with hundreds of references, it only took an afternoon, so I'm happy I did it the way I did. My approach with Pandoc is to use it until you have to invest LaTeX-level effort into making it do what you want. At that point, swapping in LaTeX is rarely painful. Often you can get away with editing Pandoc's generated LaTeX and pasting it back in to your source.

ryanmarsh7y ago

I use pandoc-crossref.

criddell7y ago

> if the publisher 'needs' a Word file, you are one click away from providing it

Once the work has moved into a Word file, isn't that where it stays? Editors and publishers often make heavy use of features like track changes and notes. Doesn't pandoc lose that information?

aquova7y ago

It does. I think the assumption here is that the author is the only contributor to the document. Exporting into a Word doc would serve the same function as exporting to a .pdf, others could read it and even mark it up, but the author would have to make the noted changes in their original plain text document themselves.

mb21007y ago

pandoc has a --track-changes option, so you can convert a docx file with its proposed changes back to, say, markdown.

jonathanstrange7y ago

I tried and it didn't work for me. Pandoc's conversion functionality is good but unfortunately also fails very often, at least in my experience. I suppose with custom templates and a lot of trickery I could get it working for the kind of papers I write, but I've found it easier to convert LaTeX to Word manually when needed - which is a pain in the ass, too, of course.

nanna7y ago

In my experience it works so long as you keep to very vanilla LaTeX code. Pandoc's support for LaTeX packages tends to be very patchy.

baldfat7y ago

I have to put in a word for Racket's Scribble. Programmiclly creating documents is powerful, and this system makes it simple. You can also basically use it as a "Markup-less" system.

Scribble Code Example:

#lang scribble/base

@title{On the Cookie-Eating Habits of Mice}

If you give a mouse a cookie, he's going to ask for a glass of milk.

@section{The Consequences of Milk}

That ``squeak'' was the mouse asking for milk. Let's suppose that you give him some in a big glass.

He's a small mouse. The glass is too big---way too big. So, he'll probably ask you for a straw. You might as well give it to him.

@section{Not the Last Straw}

For now, to handle the milk moustache, it's enough to give him a napkin. But it doesn't end there... oh, no.

Scribble -

Scribble is a collection of tools for creating prose documents—papers, books, library documentation, etc.—in HTML or PDF (via Latex) form. More generally, Scribble helps you write programs that are rich in textual content, whether the content is prose to be typeset or any other form of text to be generated programmatically. - https://docs.racket-lang.org/scribble/

Some languages based on Scribble

Skribilo -

Skribilo is a free document production tool that takes a structured document representation as its input and renders that document in a variety of output formats: HTML and Info for on-line browsing, and Lout and LaTeX for high-quality hard copies.

The input document can use Skribilo's markup language to provide information about the document's structure, which is similar to HTML or LaTeX and does not require expertise. Alternatively, it can use a simpler, “markup-less” format that borrows from Emacs' outline mode and from other conventions used in emails, Usenet and text. https://www.nongnu.org/skribilo/

Pollen -

Pollen is a publishing system built on top of Scribble and Racket. So far, I’ve optimized Pollen for web-based books, because that’s mainly what I use it for. But it can be used for small projects too, and non-webby things like PDF.

As a publishing system, Pollen includes:

    A programming language. The Pollen language is a variant of Scribble, with specific dialects tailored to different kinds of source files. You don’t need to use the programming features to do useful work, but they’re available when you need them.

    A set of tools & libraries. Pollen can produce output in any format, but it’s especially useful for markup-style formats like XML and HTML.

    A development environment. Pollen works with the DrRacket IDE. It also includes a project web server so you can dynamically preview and revise your publication. http://docs.racket-lang.org/pollen/Backstory.html

They are Domain Specific languages that excel at outputting awesome HTML and PDF. They really aren't markup but really they are a Macro system that is built on top of a full Lisp (Racket) It is easier and much more powerful then anything I have seen on Pandoc and Latex (I use Latex still for specific targets but not for general papers anymore).

Racket has the best documentation period and it is because the documentation

mb21007y ago

Occasional pandoc contributor here, AMA :-)

Just a few links:

- Where everything is documented: http://pandoc.org/MANUAL.html

- If you have questions or suggestions: https://groups.google.com/forum/#!forum/pandoc-discuss

- Contributing to pandoc is also a great way to get your feet wet with Haskell. In my experience, very supportive community. See http://pandoc.org/CONTRIBUTING.html and for good first issues: https://github.com/jgm/pandoc/issues?q=is%3Aopen+is%3Aissue+...

Finally, a great feature, that hasn't been mentioned here, is pandoc filters. Basically, pandoc provides a way for scripts (in any programming language) to hook into the transformation pipeline and modify the document AST (similar to the HTML DOM) in-between the reading and writing steps. See http://pandoc.org/filters.html

paule897y ago

everytime I see a project using google groups, i think it is already dead. Gladly yours seems to be used quite often. At least you can search it even years later, compared to an irc or slack channel.

neves7y ago

IRC channels and mailing list are excellent for informal questioning about a project. You can search for guidance, see if a feature would be well received, and receive a green light before starting to implement something.

Other day I thought about contributing to Yarn, the Javascript package manager, but the only way that I found to communicate with the developers were issues in GitHub. Since I didn't know if the feature I wanted would be well received, I just quit.

mb21007y ago

Aren't mailing lists even older than IRC? Anyway, whatever works, right?

benhoyt7y ago

Heh, not quite. Go (the programming language) uses Google Groups (golang-nuts and golang-dev), and it's very alive.

roblabla7y ago

Hey, just a one-time contributor here (fixed a small bug in the wikitext parser), I have to say that the community is really, really great! I had never done any haskell before, but with just a little guidance from the IRC channel (#pandoc on freenode), I was put on the right track and submitted my small PR, which was merged quickly.

Overall great experience. Thanks for the great tool :).

mariushn7y ago

Could we consider Pandoc a rough replacement for leanpub publishing tools?

mb21007y ago

Probably. I'm not too familiar with leanpub, but seems like they're actually using pandoc to import docx.[0] And with pandoc you can also export to epub, pdf, docx, indesign, etc.

[0]: https://leanpub.com/anewcourse

koolba7y ago

My favorite pandoc hack is using it to convert word docs into markdown which can then be diffed similar to source code. Works great for legal redlining.

d99kris7y ago

Agree, and it can be nicely integrated with Git: http://blog.martinfenner.org/2014/08/25/using-microsoft-word...

flatline7y ago

Do Word’s native diff features not work for you?

lucb1e7y ago

Can you diff two Word docs with Word? Afaik you can only hit the "track changes" button, which doesn't help if you got a new version of a document from someone else.

5 more replies

copperx7y ago

Diff works great in Word.

amelius7y ago

Trying pandoc on a word doc gives me:

    # pandoc test.doc -o test.pdf
    pandoc: Unknown reader: doc
    Pandoc can convert from DOCX, but not from DOC.

mb21007y ago

As the error says, pandoc doesn't support .doc files, only .docx.

1: https://github.com/tomhodgins/responsive.style/blob/master/s...

psychometry7y ago

Does it come up with the right semantic content for lists and tables?

tomrod7y ago

> legal redlining

Is this underlining, and not redlining as defined in financial services? (redlining: differential pricing based on demographic makeup of a zip code or neighborhood)

URSpider947y ago

Attorneys and business folks use it to mean “marked up” - a redlined contract has additions in red, and removals in red with red lines through them.

prdonahue7y ago

Redline as in how strikethrough appears in Word. It's a colloquial term.

h4b4n3r07y ago

OP was referring to diffing changes between the original and modified version of a legal document (typically a contract).

err4nt7y ago

I use Pandoc to convert directories of Markdown files into static HTML websites.

Here's the build command for responsive.style[1]:

    pandoc $file -f markdown -t html5 -H templates/header-prod.html -B templates/nav.html -A templates/footer-prod.html -o (echo "../$file" | sed '$s/\.md$/.html/') -s  --data-dir=./ --highlight-style breezedark --variable=file:(echo "$file" | sed '$s/\.md$/.html/')

Works beautifully!

uvtc7y ago

Nice.

I wrote up a tool as well, with navigation and prev/next links: http://www.unexpected-vortices.com/sw/rippledoc/index.html

ArlenBales7y ago

I know Blot.im as one static site generator that uses it: https://github.com/davidmerfield/Blot/blob/master/app/models...

privong7y ago

Hakyll is another: https://jaspervdj.be/hakyll/

ggambetta7y ago

Another happy Pandoc user here :)

I built a pipeline to convert a Markdown file to publishing-ready files for ebooks, Kindle and paperback for my novel; the whole thing is described here: http://www.gabrielgambetta.com/tgl_open_source.html

My website itself is static, generated from a bunch of Markdown files, some HTML templates, and a bit of postprocessing. But most of the work is done by Pandoc.

odiroot7y ago

Hey. Can you give us some more context of your novel writing in Markdown? I'd be interested in your process.

ggambetta7y ago

Sure. The technical side of things is explained here: http://www.gabrielgambetta.com/tgl_open_source.html (same link as above). If you're more interested in the creative aspect, I wrote a bit here: http://www.gabrielgambetta.com/tgl_swiss_trains.html. If you're interested in anything not covered there, feel free to ask, I'll be happy to share :)

myself2487y ago

The one thing it can't do is give HN posts descriptive titles.

wadkar7y ago

Thanks, this made me chuckle. Its comments like these which make HN a bit more colorful :-)

jaggederest7y ago

Also, interesting point of trivia, the maintainer, John MacFarlane is a professor of logical philosophy at UC Berkeley.

One nice trick that I use all the time is to convert html to md and back again in order to clean it.

Anyway, pandoc is great.

gitgud7y ago

Would that be a good way to sanitise user input? Like removing script tags etc...

[1] https://gist.github.com/timpulver/0d01285952b97deb70df6104cc...

It’s usually not a good idea to “get creative” when it comes to security

majewsky7y ago

Only if you trust Pandoc enough to expose it to unsanitised user input.

mwcampbell7y ago

It appears that Pandoc generates PDF documents via LaTeX. One problem with this is that, as far as I can tell, LaTeX can't generate tagged PDFs. This is an accessibility problem. Granted, for documents that are heavy on math and/or graphics, the point is probably moot. But many technical documents that are distributed as PDFs would benefit from being tagged.

Luckily, LibreOffice can produce tagged PDFs. And unoconv is a convenient utility for doing this from the command line. So you can use pandoc to convert to a format that LibreOffice can consume, then issue a command like this:

    unoconv -f pdf -e UseTaggedPDF=true mydoc.odt

I've tried it, and it works.

mkesper7y ago

Pandoc can convert into ConTeXt which can produce PDF/A (tagging included) easily. Why this can't be done in one command like with xelatex, wkhtml2pdf and what else is supported, I don't know. Many programs can be used to create PDFs but the quality of output isn't always the same.

mb21007y ago

> Why this can't be done in one command

ConTeXt is supported as well: `pandoc input.md -t context -o output.pdf`

flukus7y ago

Pandoc (or latex) + make + iNotifyWait work really well together for WYSIWYG like editing too:

  watch: $(ALL)
    while true; do \
    clear; \
    make $(WATCH); \
      inotifywait -qr -e close_write .; \
    done

"make watch WATCH=build" will now compile documents on every save. Works well for single documents, collections of documents or entire websites.

tetov7y ago

I've been using a JS script[1] to watch directories, but this seems neater. Would you mind sharing the whole makefile?

eevilspock7y ago

Pandoc's creator, John MacFarlane, is also the lead guy on CommonMark[1].

There are a small number of corner cases that need to be spec'd out before CommonMark can declare a v1.0 release[2]. If you have the skills for this kind of thing, please weigh in!

[1] https://commonmark.org

[2] https://talk.commonmark.org/t/issues-we-must-resolve-before-...

pmlnr7y ago

Please, please involve definition lists. They are useful. They were present on the first webpage[1].

[1]: http://info.cern.ch/hypertext/WWW/TheProject.html

ashton3147y ago

I wrote a little utility that uses Pandoc to read Markdown files like `man` pages in the terminal:

https://github.com/ashton314/marked-man

It's just a one-liner: `pandoc -s -t man "$1" | groff -T utf8 -man | $PAGER`

(That was basically stolen from an answer to one of my questions on Stack Overflow—thanks to those who answered! :)

mixedmath7y ago

In a similar vein, I use pandoc to convert markdown pages to man pages, and write new/add notes to manpages. I think it's definitely easier than actually writing groff files.

beefhash7y ago

I find it easier to write man pages directly. Admittedly, I write mdoc (not the ancient "man" macros), which has been around only since the 80s. It's easier for me to remember the semantics ("Is this a flag/command/function?") than the correct traditional markup ("Should this be bold/italic/nothing?").

ryanianian7y ago

I sometimes use pandoc to clean up my markdown-formatted documents, especially given its abilities to "wrap" text and add indentation-style whitespace that makes plain-text documents look nearly suitable for publishing as-is (almost kinda like RFC docs but without header/footer cruft).

There are a few things (in latest version, 2.2.3.2) that don't really survive round-trip from markdown back to markdown:

- reference-style links (e.g. `[foo][f]`). They are converted to inline links e.g. `[foo](http://...)`.

- setext vs hashmark headers. `foo\n=====` will get converted to `# foo`.

- markdown allows for forced-linebreak <br>s to be added with two trailing blank spaces at the end of a line. Pandoc escapes these with a trailing `\` at the end of the line.

These are only occasional nuisances, but overall the documents (at least in my experience) are not butchered.

I also occasionally go from markdown to docx for the purposes of uploading to google-docs and copy/pasting large sections into other docs. This is the only markdown-to-google-docs workflow I've found that works to preserve formatting. It's never really butchered anything, except a few times the syntax-highlighting for code-blocks gets confused and keywords get the wrong colors.

confounded7y ago

IIRC, there are CLI flags for your first two points. I think the latter is something like —atx-headers.

You can choose whether reference links go at the end of the paragraph or the document.

CodexArcanum7y ago

I "love" how many comments are one person praising pandoc for helping them in some workflow, and then commenters ripping into them for not using some other tool. I wonder if there's a corollary to some internet rule that the more generally useful a tool is, the more detractors will push for other tools to be used? It would help explain why programming language discussions get so contentious.

Pandoc is seriously a great tool! I love the way it's designed and have found it useful off and on over the years. Truly marvelous for making information available in any needed format.

jph7y ago

Pandoc is great software for converting among file formats, such as text, markdown, HTML, PDF, etc.

Example:

    pandoc in.md -o out.html -V pagetitle="My Title" --to=html5 --template="my.html" --css "my.css"

The example converts a markdown file to HTML, using a given title, a template file, and a stylesheet file.

The pipeline is also well implemented with Haskell, which is good for writing your own fast functional transformations.

phalangion7y ago

I love pandoc. I've been using it intermittently for years to turn my Markdown and org-mode documents into other formats. Just wish it would take Asciidoc as an input format.

copperx7y ago

Asciidoctor and the other asciidoc tools do the job that I use pandoc for: tables, custom numbering, all the other markdown extensions that one needs to be able to create a highly structured document. With Asciidoc, you don't need md extensions. It's all in there.

phalangion7y ago

Ya, I've been using Asciidoctor and Asciidoctor-pdf for long time. Those are some awesome tools, too.

mmsimanga7y ago

I mainly use Asciidoc for two reasons. 1) Ability to include external code snippets. This is not possible in pandoc without installing the pandoc-iclude-code filter which doesn't have Windows binaries. I am on Windows. 2) Tables. Asciidoc has powerful support for tables. You can create tables that include rowspan and colspan among other features. You can even specify an external CSV file as a table.

I tried creating a workflow from Asciidoc through Pandoc to MS Word but that didn't work so well. Tables being the biggest issue.

adzm7y ago

There is useful discussion on the issue regarding Asciidoc

https://github.com/jgm/pandoc/issues/1456

patricklouys7y ago

I used pandoc to format my book [0]. Not everything worked perfectly, I'm pretty happy with how everything turned out (especially the print version).

It was a little work to set up the workflow with scripts etc, but being able to write the book in markdown and still having full control over the design was definitely worth it.

[0] sample here: https://patricklouys.com/professional-php-sample.pdf

caconym_7y ago

I write fiction as a hobby, I do it in markdown and use Pandoc to turn it into epub files with a custom CSS. It works great. Thanks Pandoc!

clebio7y ago

Is the CSS derived from the markdown, or you supplement MD to HTML with custom CSS? Definitely curious to know!

flocial7y ago

I used a similar workflow. The CSS is for the EPUB and maps to the html elements supported. But if you get too fancy cross device support could get hairy.

See: https://github.com/FriendsOfEpub/Blitz

caconym_7y ago

Custom CSS. Nothing fancy, I haven't really explored what's possible with epub.

davnn7y ago

You can use the Haskell-based static site generator Hakyll with Pandoc to create the best best blogging experience imho.

An example of how easy this is and the styles I use for my personal blog: https://curious.observer https://github.com/davnn/curiousobserver

basementcat7y ago

Maybe I used an older version but my attempts to use pandoc usually resulted in the document being butchered because the internal representation was not as expressive as the source or target formats.

https://users.soe.ucsc.edu/~ivo/_posts/2015-03-12-repeatable...

adzm7y ago

Pandoc is also a great educational Haskell project for those looking into how it all works.

scentoni7y ago

If you don't want to install Haskell and other dependencies, several folks have developed Docker images for using pandoc:

http://gbraad.nl/blog/document-generation-using-markdown-and...

https://github.com/jagregory/pandoc-docker

chipotle_coyote7y ago

You could also just download one of the packages from the "Installing" page on the Pandoc web site, which has prebuilt binaries for Windows, macOS, and Linux. Installing a whole Docker image to do this seems like it might be overkill for a lot of uses.

uvtc7y ago

You'd only need to install Haskell if you wanted to build Pandoc. Pandoc the executable is a binary. I install it on Debian via: `apt install pandoc`.

mb21007y ago

although the version in the default repo is usually quite old. You can grab a binary from https://github.com/jgm/pandoc/releases/latest

https://pandoc.org/installing.html

subinsebastien7y ago

Yet another pandoc user here. I built a blog engine using Pandoc as the core. Code available here : https://github.com/subinsebastien/kyll And the website built using the blog engine is available here : http://xtel.in/

rotorblade7y ago

I tried to use pandoc a while ago to convert the latex-sources of arxiv.org documents to epub, since those are often much more comfortable to read on small devices than pdfs.

The problem I had was that latex was turned into images, but changing the font-size of the reader did not change the size of the images, making the text readable, but the maths barely readable.

This is something I would love to see happen though.

fntlnz7y ago

Take a look at arxiv vanity https://www.arxiv-vanity.com/

mb21007y ago

> latex was turned into images

You can add some CSS to the generated EPUB to change that. But if your EPub reader supports MathML, you can do that with pandoc. See http://pandoc.org/epub.html#math

patkai7y ago

I've seen this problem on Kindle books with equations, is that a related problem?

tjoff7y ago

Seems like an issue with the epub reader though?

disqard7y ago

I like pandoc. I've been using Typora [1] for all of my writing, and it's decent, but a little slow.

What editor do HN folks use? I wonder if there's a leaner editor out there with an equally nice distraction-free editing interface. Thanks in advance!

[1] https://typora.io/

arminiusreturns7y ago

Emacs and emacs ord mode, and then you can export to html5 latex/pdf, etc. My notes, calendar, todo, data science workbooks, etc all live in emacs org mode. Especially love the ability to call programs on the fly in my data science workbooks, so I can call R, Julia, python, and bash all in one place.

canhascodez7y ago

Asking what editor HN uses is a pretty loaded question, but it looks like there's a couple neo/vim plugins for live markdown preview. This one[0] says it can use pandoc as a backend. I'm pretty sure that emacs offers something similar, and org-mode may be worth consideration all on its own. I hear spacemacs and spaceneovim are nice.

  [0] https://github.com/euclio/vim-markdown-composer

heliostatic7y ago

I've been really enjoying the Caret beta -- https://caret.io/

Not free, but a real pleasure to use.

applecrazy7y ago

I tried Caret and loved it but had to uninstall because of the huge font size on equation renders in a math-heavy document. Is there a way to fix that? I tried to look but they don't have much documentation yet.

hatmatrix7y ago

Even though org-mode has its own exporters, Pandoc is great for the extra bibtex integration.

voltagex_7y ago

The only problem I have with pandoc is I have to lug the entire GHC around with it.

nh27y ago

That is not the case.

> We provide a binary package for amd64 architecture on the download page. This provides both pandoc and pandoc-citeproc. The executables are statically linked and have no dynamic dependencies or dependencies on external data files.

loudmax7y ago

There's an unofficial Arch package for it: https://aur.archlinux.org/packages/pandoc-bin/

I wish I'd known about this sooner. I don't spend much time with text documents outside the web, but when I do, pandoc handles the disparate formats admirably. The only inconvenience is when I update my system, there's guaranteed to be a huge pile of Haskell libraries to download.

voltagex_7y ago

Thanks! I wonder how it was built.

shakna7y ago

What don't I use it for?

+ Static websites from any input to html

+ Markdown & TeX & References to pdf for academia

+ Generating manpages for new tools

+ Generating ebooks

... Let's just say I get a bit lost when it isn't available.

geraldcombs7y ago

> + Generating manpages for new tools

Do any of your tools use long options (prefixed with a double dash)? If so, make sure you disable the "smart" extension, otherwise you might end up with en dashes.

shakna7y ago

Aye. I've got several long lists of options, depending on the project. Manpages might have been the most fiddly to get right.

copperx7y ago

OP said he doesn't use pandoc for such things. It's a list of things that have better tooling.

copperx7y ago

What do you use for ebooks? Asciidoc?

shakna7y ago

A mix. Depends on the book.

For novels, I tend to just use Markdown, as kerning will be done in CSS.

For academics, I use LaTeX and Asciidoc together, but some paragraphs might be inserted in various other formats - whatever is easier. The build tool doesn't care what the format is, it'll take any input pandoc accepts.

bovermyer7y ago

I love pandoc, but I'm very surprised that such an established tool has (at time of writing) 865 points and is #1 on HN.

I guess it's not as well-known as I thought.

epynonymous7y ago

i have been using catdoc and pdftotext to convert doc and pdf files, respectively. nice to see that there's an alternative that also includes a library, will be checking this out.

a couple questions i have, seems firstly that old school .doc files are not supported, docx yes. unfortunately i still get a lot of docs in .doc format which seems to be microsoft's proprietary format (docx seems to be more open).

my second question is whether or not there's a filter for golang, most of my development is in golang, so i either need to call your cli as a forked process or best to have a native library. i have never worked with haskell so not sure if i can import a haskell library from golang directly. i imagine there'd need to be a golang wrapper around the cli.

duckerude7y ago

You could use Libreoffice's command line interface to convert from .doc to a more manageable format.

  lowriter --convert-to odt some-document.doc

odt is not the only supported target, but doc --libreoffice--> odt --pandoc--> plain seems to give better results than e.g. doc --libreoffice--> txt or doc --libreoffice--> docx --pandoc--> plain.

epynonymous7y ago

if that's the case, i'll stick with catdoc. my use case is to create a full text search index of the content, trading libre office cli for catdoc, i'd rather just stick with catdoc, but thanks.

mb21007y ago

1. yes, only docx is supported. 2. for Go pandoc filters, this seems to work: https://github.com/oltolm/go-pandocfilters

epynonymous7y ago

thanks, will check this out

As a guy attempting to transition from macOS to Linux:

Pages to anything else, please.

jagger277y ago

A quick Google suggests that the most straightforward way is to run an Automator script to convert everything to PDF using Pages itself.

Yeah. But then you can't edit it. Converting it to opendoc or something would be more useful.

_emacsomancer_7y ago

Maybe this ( http://tyorex.com/iWorkConverter/ ) and then a batch doc->odt converter? (Though for the sake of sanity, I recommend avoid word processors where possible.)

mark_l_watson7y ago

Pages will export to Word format, then use pandoc to generate markdown files. (Just one idea)

But then I have to keep a Mac around, just to convert documents. Not ideal.

https://orgmode.org/worg/org-tutorials/org-spreadsheet-intro...

kccqzy7y ago

Pandoc is great! I use pandoc for all kinds of formal writing (conversion to PDF via LaTeX). We also run pandoc in production to produce customer-facing PDFs.

bkyan7y ago

Is there an equivalent of this for spreadsheets?

scentoni7y ago

The closest thing I'm aware of is the spreadsheet functionality in Emacs org-mode:

https://orgmode.org/manual/The-spreadsheet.html

bkyan7y ago

Sorry, I'm a little confused... How would I use org mode to convert between different spreadsheet formats?

rllin7y ago

frustratingly slow for word docs. antiword is better for those of you who wish to convert word docs en masse

nambit7y ago

I have used pandoc with uikit to autoconvert my markdown pages to html. Works like a charm.

rydel7y ago

Really one of the best tool! Simple to use and makes things done.

fastier7y ago

Where is .djvu?

gwern7y ago

Do you need an option for that? You can convert to PDF and then `pdf2djvu` it.

vortico7y ago

I believe the best you could do is extract the raw OCR'd text from the document (with some other tool). No formatting or text hierarchy is preserved in the OCR process, only the physical locations and size of the text on the page. From text, you can convert to Markdown or whatever and then manually edit to give the OCR text some structure.

boonasty697y ago

updated and secure.

another-cuppa7y ago

I write any document that doesn't need extensive custom typesetting (which is 90% of stuff) in org-mode and then use pandoc to convert it to "normal people" formats at the end. I have made a basic template for MS Word that looks pretty good.

Numberwang7y ago

I wish they’d fix the md to adoc table conversion issues. Apart from that I love it.

kevin_thibedeau7y ago

The core problem with Pandoc is that the internal document representation is limited to its particular flavor of Markdown. Any feature PD-MD doesn't support is ignored or loses semantics. You can see this in the poor ReST support (try converting captioned figures). It would be useful to rearchitect it with a Docbook-style semantics internally since they are more comprehensive than Markdown.

euske7y ago

I know it's well intended and somewhat successful, but I can't help but thinking of xkcd.com/927

Sorry, I couldn't resist.

Lio7y ago

Although it does offer some useful extensions for Markdown, Pandoc doesn't attempt to establish new standards.

It's a conversion tool for existing formats.

j / k navigate · click thread line to collapse

204 comments

Schiphol7y ago

icc977y ago

[0]: https://pandoc.org/MANUAL.html#pandocs-markdown

mort967y ago

smohare7y ago

I’ve never understood the impetus for not using full LaTeX in an academic contex, given that the boiler plate is so minimal and presumably one has a built up a personal template over time.

For blog posts and notes I see the appeal, since the boilerplate can be a hindrance to spontaneous writing.

CJefferson7y ago

Latex can't produce web output, which is increasingly a target I want.

3 more replies

BeetleB7y ago

>I’ve never understood the impetus for not using full LaTeX in an academic contex, given that the boiler plate is so minimal and presumably one has a built up a personal template over time.

I don't find the boilerplate minimal at all. Contrast the following:

    \begin{itemize}
     \item First
     \item Second
     \item Third
    \end{itemize}

with

     - First
     - Second
     - Third

I won't even get into the hell that is tables.

I loved LaTeX until I discovered Org Mode. Pandoc also scratches the same itch.

susam7y ago

I agree. If one is going to use LaTeX directly or indirectly via Pandoc, eventually one would have to build up a personal template to fine-tune the look and feel of the documents.

[1]: https://github.com/susam/gitpr

[2]: https://github.com/susam/gitpr/blob/master/Makefile

sevensor7y ago

stdbrouw7y ago

Having to use `\textbf{...}` is impetus enough for writing in Markdown instead.

hatmatrix7y ago

I agree - while pandoc is great it's usually not 'one click' to any format, especially when have html or latex-specific markups.

It's not for everyone, but emacs+auctex really reduces the latex boilerplate (at least writing it) that I don't really feel it's a hindrance.

neves7y ago

I didn't use LaTex for years, is it still a hell to make tables? And also very difficult to use templates to generate good looking documents that doesn't like an academic paper?

uvtc7y ago

Yes! It's great to be able to put LaTeX-formatted equations directly into your pandoc-flavored markdown source file.

Incidentally, I really like the thoughtful syntax additions Pandoc makes over olde Markdown (eg., tables, definition lists, and span & div syntax as well). Such a great all-around doc tool.

dr_coffee7y ago

What's your workflow for inserting and managing references

tekknolagi7y ago

meanmrmustard927y ago

sevensor7y ago

ryanmarsh7y ago

I use pandoc-crossref.

criddell7y ago

> if the publisher 'needs' a Word file, you are one click away from providing it

Once the work has moved into a Word file, isn't that where it stays? Editors and publishers often make heavy use of features like track changes and notes. Doesn't pandoc lose that information?

aquova7y ago

mb21007y ago

pandoc has a --track-changes option, so you can convert a docx file with its proposed changes back to, say, markdown.

jonathanstrange7y ago

nanna7y ago

In my experience it works so long as you keep to very vanilla LaTeX code. Pandoc's support for LaTeX packages tends to be very patchy.

baldfat7y ago

I have to put in a word for Racket's Scribble. Programmiclly creating documents is powerful, and this system makes it simple. You can also basically use it as a "Markup-less" system.

Scribble Code Example:

#lang scribble/base

@title{On the Cookie-Eating Habits of Mice}

If you give a mouse a cookie, he's going to ask for a glass of milk.

@section{The Consequences of Milk}

That ``squeak'' was the mouse asking for milk. Let's suppose that you give him some in a big glass.

He's a small mouse. The glass is too big---way too big. So, he'll probably ask you for a straw. You might as well give it to him.

@section{Not the Last Straw}

For now, to handle the milk moustache, it's enough to give him a napkin. But it doesn't end there... oh, no.

Scribble -

Some languages based on Scribble

Skribilo -

Pollen -

As a publishing system, Pollen includes:

    A programming language. The Pollen language is a variant of Scribble, with specific dialects tailored to different kinds of source files. You don’t need to use the programming features to do useful work, but they’re available when you need them.

    A set of tools & libraries. Pollen can produce output in any format, but it’s especially useful for markup-style formats like XML and HTML.

    A development environment. Pollen works with the DrRacket IDE. It also includes a project web server so you can dynamically preview and revise your publication. http://docs.racket-lang.org/pollen/Backstory.html

Racket has the best documentation period and it is because the documentation

mb21007y ago

Occasional pandoc contributor here, AMA :-)

Just a few links:

- Where everything is documented: http://pandoc.org/MANUAL.html

- If you have questions or suggestions: https://groups.google.com/forum/#!forum/pandoc-discuss

paule897y ago

everytime I see a project using google groups, i think it is already dead. Gladly yours seems to be used quite often. At least you can search it even years later, compared to an irc or slack channel.

neves7y ago

mb21007y ago

Aren't mailing lists even older than IRC? Anyway, whatever works, right?

benhoyt7y ago

Heh, not quite. Go (the programming language) uses Google Groups (golang-nuts and golang-dev), and it's very alive.

roblabla7y ago

Overall great experience. Thanks for the great tool :).

mariushn7y ago

Could we consider Pandoc a rough replacement for leanpub publishing tools?

mb21007y ago

Probably. I'm not too familiar with leanpub, but seems like they're actually using pandoc to import docx.[0] And with pandoc you can also export to epub, pdf, docx, indesign, etc.

[0]: https://leanpub.com/anewcourse

koolba7y ago

My favorite pandoc hack is using it to convert word docs into markdown which can then be diffed similar to source code. Works great for legal redlining.

d99kris7y ago

Agree, and it can be nicely integrated with Git: http://blog.martinfenner.org/2014/08/25/using-microsoft-word...

flatline7y ago

Do Word’s native diff features not work for you?

lucb1e7y ago

Can you diff two Word docs with Word? Afaik you can only hit the "track changes" button, which doesn't help if you got a new version of a document from someone else.

5 more replies

copperx7y ago

Diff works great in Word.

amelius7y ago

Trying pandoc on a word doc gives me:

    # pandoc test.doc -o test.pdf
    pandoc: Unknown reader: doc
    Pandoc can convert from DOCX, but not from DOC.

mb21007y ago

As the error says, pandoc doesn't support .doc files, only .docx.

1: https://github.com/tomhodgins/responsive.style/blob/master/s...

psychometry7y ago

Does it come up with the right semantic content for lists and tables?

tomrod7y ago

> legal redlining

Is this underlining, and not redlining as defined in financial services? (redlining: differential pricing based on demographic makeup of a zip code or neighborhood)

URSpider947y ago

Attorneys and business folks use it to mean “marked up” - a redlined contract has additions in red, and removals in red with red lines through them.

prdonahue7y ago

Redline as in how strikethrough appears in Word. It's a colloquial term.

h4b4n3r07y ago

OP was referring to diffing changes between the original and modified version of a legal document (typically a contract).

err4nt7y ago

I use Pandoc to convert directories of Markdown files into static HTML websites.

Here's the build command for responsive.style[1]:

    pandoc $file -f markdown -t html5 -H templates/header-prod.html -B templates/nav.html -A templates/footer-prod.html -o (echo "../$file" | sed '$s/\.md$/.html/') -s  --data-dir=./ --highlight-style breezedark --variable=file:(echo "$file" | sed '$s/\.md$/.html/')

Works beautifully!

uvtc7y ago

Nice.

I wrote up a tool as well, with navigation and prev/next links: http://www.unexpected-vortices.com/sw/rippledoc/index.html

ArlenBales7y ago

I know Blot.im as one static site generator that uses it: https://github.com/davidmerfield/Blot/blob/master/app/models...

privong7y ago

Hakyll is another: https://jaspervdj.be/hakyll/

ggambetta7y ago

Another happy Pandoc user here :)

My website itself is static, generated from a bunch of Markdown files, some HTML templates, and a bit of postprocessing. But most of the work is done by Pandoc.

odiroot7y ago

Hey. Can you give us some more context of your novel writing in Markdown? I'd be interested in your process.

ggambetta7y ago

myself2487y ago

The one thing it can't do is give HN posts descriptive titles.

wadkar7y ago

Thanks, this made me chuckle. Its comments like these which make HN a bit more colorful :-)

jaggederest7y ago

Also, interesting point of trivia, the maintainer, John MacFarlane is a professor of logical philosophy at UC Berkeley.

One nice trick that I use all the time is to convert html to md and back again in order to clean it.

Anyway, pandoc is great.

gitgud7y ago

Would that be a good way to sanitise user input? Like removing script tags etc...

[1] https://gist.github.com/timpulver/0d01285952b97deb70df6104cc...

It’s usually not a good idea to “get creative” when it comes to security

majewsky7y ago

Only if you trust Pandoc enough to expose it to unsanitised user input.

mwcampbell7y ago

    unoconv -f pdf -e UseTaggedPDF=true mydoc.odt

I've tried it, and it works.

mkesper7y ago

mb21007y ago

> Why this can't be done in one command

ConTeXt is supported as well: `pandoc input.md -t context -o output.pdf`

flukus7y ago

Pandoc (or latex) + make + iNotifyWait work really well together for WYSIWYG like editing too:

  watch: $(ALL)
    while true; do \
    clear; \
    make $(WATCH); \
      inotifywait -qr -e close_write .; \
    done

"make watch WATCH=build" will now compile documents on every save. Works well for single documents, collections of documents or entire websites.

tetov7y ago

I've been using a JS script[1] to watch directories, but this seems neater. Would you mind sharing the whole makefile?

eevilspock7y ago

Pandoc's creator, John MacFarlane, is also the lead guy on CommonMark[1].

There are a small number of corner cases that need to be spec'd out before CommonMark can declare a v1.0 release[2]. If you have the skills for this kind of thing, please weigh in!

[1] https://commonmark.org

[2] https://talk.commonmark.org/t/issues-we-must-resolve-before-...

pmlnr7y ago

Please, please involve definition lists. They are useful. They were present on the first webpage[1].

[1]: http://info.cern.ch/hypertext/WWW/TheProject.html

ashton3147y ago

I wrote a little utility that uses Pandoc to read Markdown files like `man` pages in the terminal:

https://github.com/ashton314/marked-man

It's just a one-liner: `pandoc -s -t man "$1" | groff -T utf8 -man | $PAGER`

(That was basically stolen from an answer to one of my questions on Stack Overflow—thanks to those who answered! :)

mixedmath7y ago

In a similar vein, I use pandoc to convert markdown pages to man pages, and write new/add notes to manpages. I think it's definitely easier than actually writing groff files.

beefhash7y ago

ryanianian7y ago

There are a few things (in latest version, 2.2.3.2) that don't really survive round-trip from markdown back to markdown:

- reference-style links (e.g. `[foo][f]`). They are converted to inline links e.g. `[foo](http://...)`.

- setext vs hashmark headers. `foo\n=====` will get converted to `# foo`.

- markdown allows for forced-linebreak <br>s to be added with two trailing blank spaces at the end of a line. Pandoc escapes these with a trailing `\` at the end of the line.

These are only occasional nuisances, but overall the documents (at least in my experience) are not butchered.

confounded7y ago

IIRC, there are CLI flags for your first two points. I think the latter is something like —atx-headers.

You can choose whether reference links go at the end of the paragraph or the document.

CodexArcanum7y ago

Pandoc is seriously a great tool! I love the way it's designed and have found it useful off and on over the years. Truly marvelous for making information available in any needed format.

jph7y ago

Pandoc is great software for converting among file formats, such as text, markdown, HTML, PDF, etc.

Example:

    pandoc in.md -o out.html -V pagetitle="My Title" --to=html5 --template="my.html" --css "my.css"

The example converts a markdown file to HTML, using a given title, a template file, and a stylesheet file.

The pipeline is also well implemented with Haskell, which is good for writing your own fast functional transformations.

phalangion7y ago

I love pandoc. I've been using it intermittently for years to turn my Markdown and org-mode documents into other formats. Just wish it would take Asciidoc as an input format.

copperx7y ago

phalangion7y ago

Ya, I've been using Asciidoctor and Asciidoctor-pdf for long time. Those are some awesome tools, too.

mmsimanga7y ago

I tried creating a workflow from Asciidoc through Pandoc to MS Word but that didn't work so well. Tables being the biggest issue.

adzm7y ago

There is useful discussion on the issue regarding Asciidoc

https://github.com/jgm/pandoc/issues/1456

patricklouys7y ago

I used pandoc to format my book [0]. Not everything worked perfectly, I'm pretty happy with how everything turned out (especially the print version).

It was a little work to set up the workflow with scripts etc, but being able to write the book in markdown and still having full control over the design was definitely worth it.

[0] sample here: https://patricklouys.com/professional-php-sample.pdf

caconym_7y ago

I write fiction as a hobby, I do it in markdown and use Pandoc to turn it into epub files with a custom CSS. It works great. Thanks Pandoc!

clebio7y ago

Is the CSS derived from the markdown, or you supplement MD to HTML with custom CSS? Definitely curious to know!

flocial7y ago

I used a similar workflow. The CSS is for the EPUB and maps to the html elements supported. But if you get too fancy cross device support could get hairy.

See: https://github.com/FriendsOfEpub/Blitz

caconym_7y ago

Custom CSS. Nothing fancy, I haven't really explored what's possible with epub.

davnn7y ago

You can use the Haskell-based static site generator Hakyll with Pandoc to create the best best blogging experience imho.

An example of how easy this is and the styles I use for my personal blog: https://curious.observer https://github.com/davnn/curiousobserver

basementcat7y ago

Maybe I used an older version but my attempts to use pandoc usually resulted in the document being butchered because the internal representation was not as expressive as the source or target formats.

https://users.soe.ucsc.edu/~ivo/_posts/2015-03-12-repeatable...

adzm7y ago

Pandoc is also a great educational Haskell project for those looking into how it all works.

scentoni7y ago

If you don't want to install Haskell and other dependencies, several folks have developed Docker images for using pandoc:

http://gbraad.nl/blog/document-generation-using-markdown-and...

https://github.com/jagregory/pandoc-docker

chipotle_coyote7y ago

uvtc7y ago

You'd only need to install Haskell if you wanted to build Pandoc. Pandoc the executable is a binary. I install it on Debian via: `apt install pandoc`.

mb21007y ago

although the version in the default repo is usually quite old. You can grab a binary from https://github.com/jgm/pandoc/releases/latest

https://pandoc.org/installing.html

subinsebastien7y ago

rotorblade7y ago

I tried to use pandoc a while ago to convert the latex-sources of arxiv.org documents to epub, since those are often much more comfortable to read on small devices than pdfs.

The problem I had was that latex was turned into images, but changing the font-size of the reader did not change the size of the images, making the text readable, but the maths barely readable.

This is something I would love to see happen though.

fntlnz7y ago

Take a look at arxiv vanity https://www.arxiv-vanity.com/

mb21007y ago

> latex was turned into images

You can add some CSS to the generated EPUB to change that. But if your EPub reader supports MathML, you can do that with pandoc. See http://pandoc.org/epub.html#math

patkai7y ago

I've seen this problem on Kindle books with equations, is that a related problem?

tjoff7y ago

Seems like an issue with the epub reader though?

disqard7y ago

I like pandoc. I've been using Typora [1] for all of my writing, and it's decent, but a little slow.

What editor do HN folks use? I wonder if there's a leaner editor out there with an equally nice distraction-free editing interface. Thanks in advance!

[1] https://typora.io/

arminiusreturns7y ago

canhascodez7y ago

  [0] https://github.com/euclio/vim-markdown-composer

heliostatic7y ago

I've been really enjoying the Caret beta -- https://caret.io/

Not free, but a real pleasure to use.

applecrazy7y ago

hatmatrix7y ago

Even though org-mode has its own exporters, Pandoc is great for the extra bibtex integration.

voltagex_7y ago

The only problem I have with pandoc is I have to lug the entire GHC around with it.

nh27y ago

That is not the case.

loudmax7y ago

There's an unofficial Arch package for it: https://aur.archlinux.org/packages/pandoc-bin/

voltagex_7y ago

Thanks! I wonder how it was built.

shakna7y ago

What don't I use it for?

+ Static websites from any input to html

+ Markdown & TeX & References to pdf for academia

+ Generating manpages for new tools

+ Generating ebooks

... Let's just say I get a bit lost when it isn't available.

geraldcombs7y ago

> + Generating manpages for new tools

Do any of your tools use long options (prefixed with a double dash)? If so, make sure you disable the "smart" extension, otherwise you might end up with en dashes.

shakna7y ago

Aye. I've got several long lists of options, depending on the project. Manpages might have been the most fiddly to get right.

copperx7y ago

OP said he doesn't use pandoc for such things. It's a list of things that have better tooling.

copperx7y ago

What do you use for ebooks? Asciidoc?

shakna7y ago

A mix. Depends on the book.

For novels, I tend to just use Markdown, as kerning will be done in CSS.

bovermyer7y ago

I love pandoc, but I'm very surprised that such an established tool has (at time of writing) 865 points and is #1 on HN.

I guess it's not as well-known as I thought.

epynonymous7y ago

i have been using catdoc and pdftotext to convert doc and pdf files, respectively. nice to see that there's an alternative that also includes a library, will be checking this out.

duckerude7y ago

You could use Libreoffice's command line interface to convert from .doc to a more manageable format.

  lowriter --convert-to odt some-document.doc

odt is not the only supported target, but doc --libreoffice--> odt --pandoc--> plain seems to give better results than e.g. doc --libreoffice--> txt or doc --libreoffice--> docx --pandoc--> plain.

epynonymous7y ago

if that's the case, i'll stick with catdoc. my use case is to create a full text search index of the content, trading libre office cli for catdoc, i'd rather just stick with catdoc, but thanks.

mb21007y ago

1. yes, only docx is supported. 2. for Go pandoc filters, this seems to work: https://github.com/oltolm/go-pandocfilters

epynonymous7y ago

thanks, will check this out

As a guy attempting to transition from macOS to Linux:

Pages to anything else, please.

jagger277y ago

A quick Google suggests that the most straightforward way is to run an Automator script to convert everything to PDF using Pages itself.

Yeah. But then you can't edit it. Converting it to opendoc or something would be more useful.

_emacsomancer_7y ago

Maybe this ( http://tyorex.com/iWorkConverter/ ) and then a batch doc->odt converter? (Though for the sake of sanity, I recommend avoid word processors where possible.)

mark_l_watson7y ago

Pages will export to Word format, then use pandoc to generate markdown files. (Just one idea)

But then I have to keep a Mac around, just to convert documents. Not ideal.