Building combined documents in a GitLab pipeline

xldrkp · March 28, 2021, 8:59pm

No problem here, but a learning with the help of the folks on Matrix.

Solution

I’m able to combine various HedgeDocs in a GitLab pipeline using pandoc like so:

image: 
  name: pandoc/latex:latest
  entrypoint: ["/bin/sh", "-c"]

build:
  script:
    - pandoc -f markdown https://hedgedoc.ru/I2lmzjqyjP7oA/download https://hedgedoc.ru/Maasd0C919rg/download -o report.pdf
  artifacts:
    paths:
      - "report.pdf"

First, I’m pulling the official pandoc image from Docker hub, redefining the entry point so that it takes regular command lines. Afterwards I pull to example pads that are EDITABLE and write the output to report.pdf. This file is kept as a pipeline artifact for download.

Scenario ideas

I always wanted to do this with student work: Let them write alone or in groups and regularly collect their work for a book or anthology. It’s also imaginable to have a group write parts for a research application and combine them in a Word file with the same kind of pipeline.

Thanks

Thanks to the community on Matrix that helped me getting this done!

amenthes · March 31, 2021, 9:36am

that is great! I didn’t realize one could do that!

nsheff · April 5, 2021, 9:11pm

That is awesome.

I have a workflow for authoring academic documents in markdown called sciquill, which hosts the markdown files on github and uses a github action to build various output types. I recently discovered hedgedoc and am thinking about how to merge the two. your idea is related.

NikaZhenya · April 9, 2021, 3:05pm

Nice job! I have been doing something with practically the same purpose. Look:

The .gitlab-ci.yml is:

image: texlive/texlive

before_script:
  - export DEBIAN_FRONTEND=noninteractive
  - apt-get -y update && apt-get -y upgrade
  - apt-get -y install curl hunspell hunspell-es linkchecker neofetch pandoc ruby uuid-runtime
  - gem install bibtex-ruby httparty nokogiri
  - mkdir kindlegen && cd kindlegen && wget https://archive.org/download/kindlegen_linux_2.6_i386_v2_9.tar/kindlegen_linux_2.6_i386_v2_9.tar.gz && tar -xvzf kindlegen_linux_2.6_i386_v2_9.tar.gz && mv kindlegen /usr/local/bin/ && cd .. && rm -rf kindlegen
  # Tools for specific purposes, you can ignore them
  - wget https://gitlab.com/snippets/1917492/raw -O /usr/local/bin/baby-biber && chmod +755 /usr/local/bin/baby-biber
  - wget https://gitlab.com/snippets/1917490/raw -O /usr/local/bin/export-pdf && chmod +755 /usr/local/bin/export-pdf
  - wget https://gitlab.com/snippets/1917487/raw -O /usr/local/bin/texti && chmod +755 /usr/local/bin/texti
  # We prefer to do the ebook with this legacy tool for compatibility purposes
  - (cd ~ && mkdir .pecas && cd .pecas && git clone --depth 1 https://gitlab.com/programando-libreros/herramientas/pecas-legacy.git . && bash install.sh) && source ~/.profile

pages:
  stage: deploy
  script:
    - mkdir public/
    # Test 1: gather info about the software and hardware
    - cp index.html public/ && cd public/
    - printf "\n# neofetch\n" >> log.txt
    - neofetch | sed 's/\x1B\[[0-9;\?]*[a-zA-Z]//g' >> log.txt
    - printf "\n# uname -a\n" >> log.txt
    - uname -a >> log.txt
    - printf "\n# apt list --installed\n" >> log.txt
    - apt list --installed >> log.txt
    - printf "\n# ls /sbin\n" >> log.txt
    - ls /sbin >> log.txt
    - printf "\n# ls /bin\n" >> log.txt
    - ls /bin >> log.txt
    - printf "\n# ls /usr/bin\n" >> log.txt
    - ls /usr/bin >> log.txt
    - printf "\n# ls /usr/local/bin\n" >> log.txt
    - ls /usr/local/bin >> log.txt
    - printf "\n# tlmgr list --only-installed\n" >> log.txt
    - tlmgr list --only-installed >> log.txt
    - printf "\n# ruby --version\n" >> log.txt
    - ruby --version >> log.txt
    - printf "\n# gem list\n" >> log.txt
    - gem list >> log.txt
    - printf "\n# kindlegen\n" >> log.txt
    - kindlegen >> log.txt
    - printf "\n# texti\n" >> log.txt
    - texti >> log.txt
    - printf "\n# pc-doctor\n" >> log.txt
    - pc-doctor >> log.txt
    # Test 2: publish an existent repo
    - git clone --depth 1 https://gitlab.com/NikaZhenya/maestria-investigacion.git && cd maestria-investigacion/tesis && ./generate-all && cd ..
    - rm -rf .g* administrativo apuntes bibliografia protocolo && cd ..
    # Test 3: publish a HedgeDoc pad! (actually anything that has raw MD)
    - mkdir pad && cd pad && wget https://pad.programando.li/Pgs01Hr3QgWtgspU6YbhkA/download -O pad.md
    - pandoc pad.md -s -o index.html 
    - pandoc pad.md -o pad.pdf 
    - pandoc pad.md -o pad.epub 
    - pandoc pad.md -o pad.docx 
    #- kindlegen pad.epub # It works, but if it ends with warnings, the job fails
  artifacts:
    paths:
      - public
  only:
    - master

As yo can see, I decided to use the texlive/texlive container because our needs implies a heavy use of latex. I still haven’t add pandoc-citeproc, but it is because we are gonna deploy a container based on texlive container. I think like 40% of the time could be saved if we already have a container with all the needed tools (we are gonna probably add other publishing systems that use MD like jekyll, pelican and hugo)

Another thing to do that I am gonna work this weekend is enable a variable so it can be use with any MD url, like hedgedoc links.

The Test 2 could be of your interest, it is a complete research thesis (spanish) with a site and so on.

Cheers, nice to see ppl working on the same things!

The second publishing revolution has just began

xldrkp · April 13, 2021, 8:05pm

@nsheff Thanks for sharing sciquill, looks great, I will tinker with it. FYI: I have a video out in which I describe my environment for scholarly publishing with VSCodium, Markdown, pandoc, GitLab and Docker.

xldrkp · April 13, 2021, 8:10pm

That’s also my experience: It’s worth spending the time on building this complete image as it saves a lot of time. I started with an image that every time it ran installed packages which caused errors and took too long.

Talking about the publishing revolution: Absolutely necessary! I had the pleasure to work with some great people in a project called Modern publishing were we explored new ways of journal article publishing with a pandoc-centered approach.

nsheff · April 13, 2021, 8:29pm

@xldrkp Thanks! I’m watching your video now… Just through the beginning and your ideas are right in line with mine… Some thoughts I’ve written about on this subject you may find interesting:

Can you share any links or examples of your system at work?

nsheff · April 14, 2021, 12:32am

A couple comments on that… with sciquill I use bulker, which is a tool I developed that makes it easier to use separate containers, but controlled as a set of them. So you avoid the bloated single image, which is not great for maintenance and reusability, but you get the benefits of a single entity that does everything you need. I set this up in a github action and everything works so I can build the outputs automatically on change…

but the problem with this is that the ephemeral compute must pull the containers for each build. Even better than that would be moving this to a server that just has everything ready. You could do this, for example, with a self-hosted github runner, but I haven’t explored that at all.

NikaZhenya · April 14, 2021, 3:58am

Nice, I will take time to watch it. I also made one video of 12 minutes where I show how I publish using hedgedoc + gitlab with this container + pandoc in a cellphone. Why a cellphone? It could be for hotfixes (happened to me once when I was on a trip) or because 80% of students in mexico are using phones for their online education… For my personal taste I use Vim instead, the rest it is almost the same, for epubs we use pandoc in the background trough pecas, because in Spanish there are some specific typographical things that have to be there and by the time we coded pecas we were having issues to fix them just with pandoc. But by this time, it should be enough with pandoc filters and templates. And if not, we are then in other new starting point

Talking about the publishing revolution, we are gonna start to make tests with “the most important thing that came out of the TeX project”, according to Knuth: literate programming; but as collaborative writing with pads

NikaZhenya · April 14, 2021, 4:02am

Nice! I am actually having that issue, texlive image is 4.3 GB big and the other tools are just 500MB more built on top of that and in the scenarios where we don’t need PDF, those 4.3 GB are extra weight. I will look on that, thanks!

We are actually thinking in a self-hosted gitlab, but we will see, thank you!

xldrkp · April 14, 2021, 10:09am

Thanks @nsheff, I will take time to read your articles on Markdown. Please have a look at https://journals.sub.uni-hamburg.de/hup2/kommges/index for which we have produced about 50 articles with the aforementioned system. Much work was necessary to produce HTML and LaTeX templates for the use with pandoc.

nsheff · April 14, 2021, 2:59pm

Excellent, this is similar to what I was trying to accomplish, with the articles above…

HTML version: http://databio.org/democratization-of-scientific-publication (posted above)
PDF version: http://databio.org/pdfs/2019-02-26-democratization_of_scientific_publication.pdf

Also using latex templates, and from a single source like you discuss in your video.