• Stars
    star
    523
  • Rank 84,684 (Top 2 %)
  • Language XSLT
  • License
    BSD 2-Clause "Sim...
  • Created over 9 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Converts Microsoft Word docx to LaTeX

Current Release Github All Releases Downloads

docx2tex

Converts Microsoft Word's DOCX to LaTeX. Developed by le-tex and based on the transpect framework. The main author of docx2tex and the underlying xml2tex is @mkraetke.

get docx2tex

download the latest release

Download the latest docx2tex release

โ€ฆor get source via Git. Please note that you have to add the --recursive option in order to clone docx2hub with submodules.

git clone https://github.com/transpect/docx2tex --recursive

requirements

  • Java 1.7 up to 1.15 (more recent versions not yet tested). Java 11 has a bug with file URIs, it should be avoided. Java 13 is safe again.
  • works on Windows, Linux and Mac OS X

run docx2tex

You can run docx2tex with a Bash script (Linux, Mac OSX, Cygwin) or the Windows batch script whose options are somewhat limited, compared to the Bash script.

Linux/MacOSX

./d2t [options ...] myfile.docx
Option Description
-o path to custom output directory
-c path to custom docx2tex configuration file
-m choose MathType source (ole|wmf|ole+wmf)
-f path to custom fontmaps directory
-p generate PDF with pdflatex
-t choose table model (tabularx|tabular|htmltabs)
-e custom XSLT stylesheet for evolve-hub overrides
-x custom XSLT stylesheet for postprocessing the evolve-hub results
-d debug mode

Windows

d2t.bat myfile.docx

via XML Calabash

Linux/Mac OSX

calabash/calabash.sh -o result=myfile.tex -o hub=myfile.xml xpl/docx2tex.xpl docx=myfile.docx conf=conf/conf.xml

Windows

calabash\calabash.bat -o result=myfile.tex -o hub=myfile.xml xpl/docx2tex.xpl docx=myfile.docx conf=conf/conf.xml

configure

The docx2tex pipeline consists of 3 macroscopic steps:

  • docx2hub. This step is hardly configurable. It transforms a docx file to a Hub XML representation.
  • evolve-hub. This is a bag of XSLT modes that, among other things, transform paragraphs with list markers and hanging indentation to proper nested lists, create a nested section hierarchy, group images with their figure titles, etc. Only some of the modes are used by docx2tex, orchestrated by evolve-hub.xpl and configured in detail by evolve-hub-driver.xsl.
  • xml2tex

There are five major hooks for adding your own processing: CSV or xml2tex configuration; XSLT that is applied between evolve-hub and xml2tex; XSLT that modifies what happens in evolve-hub; fontmaps.


You can specify a custom configuration file for docx2tex. There are two different formats to write a configuration.

  • The CSV-based configuration format permits a simple way to map from MS Word styles to LaTeX commands.
  • The xml2tex configuration format is recommended for a deeper level of configuration but requires basic knowledge of XML and XPath.

CSV

For each MS Word style name, create a line with three semicolon separated values.

  • MS Word style name
  • LaTeX start statement
  • LaTeX end statement

Just follow this example:

Heading 1   ; \chapter{     ; }
Heading 2   ; \section{     ; }
Heading 3   ; \subsection{  ; }
Quote       ; \begin{quote} ; end{quote}

You can edit CSV files either with a simple text editor or with a spreadsheet application.

xml2tex

docx2tex can also be configured by means of an xml2tex configuration file. docx2tex will apply the configuration to the intermediate Hub XML file and generates the LaTeX output.

The configuration in conf/conf.xml is used by default and works with the styles defined in Microsoft Word's normal.dot. If you want to configure docx2tex for other styles, you can edit this file or pass a custom configuration file with the conf option.

Learn how to edit this file here.

XSLT between evolve-hub and xml2tex

You can provide an XSLT that works on the result of evolve-hub (if debugging is enabled, on the file [basename].debug/evolve-hub/70.docx2tex-postprocess.xml). The location of this XSLT file (absolute URI or path relative to the main directory that d2t and d2t.bat reside in) may be provided to d2t via the -x option. d2t.bat does not have all the flags; if you are confined to Windows and donโ€™t have Cygwin, WSL, or MinGW, you may invoke calabash/calabash.bat yourself, see above. The additional XSLTโ€™s URI may be provided by the custom-xsl option. This processing is applied before the xml2tex configuration, so your XSLT should transform Hub (DocBook namespace) to Hub.

During evolve-hub

In case you need to influence what evolve-hub does, you can provide a custom stylesheet for this. Contrary to custom-xsl which is passed as an option, this is passed to the pipeline on the input port custom-evolve-hub-driver, or using the -e option of d2t. There is an example for such an XSLT that retains empty paragraphs that will otherwise be removed by default, in one of the XSLT passes that comprise evolve-hub. This example was created in response to a user request. If you want to create \chapter, \section, etc. headings from arbitrary docx paragraphs, you should add a template that sets the paragraphโ€™s @role attribute to Heading1, Heading2, etc. (For paragraphs that are not removed during evolve-hub, this can also be done in the -x stylesheet.) It is strongly advised to xsl:import the default evolve-hub customization (see example).

fontmaps

The docx conversion supports individual fontmaps for mapping non-unicode characters to unicode. Please note that this is just needed for fonts that are not unicode-compatible. If you want to map characters from Unicode to LaTeX, please use the character map in the xml2tex configuration instead.

Please find further documentation on how to create a fontmap here.

After you created your fontmap, store it in a directory and pass the path of the directory to docx2tex with the -f option.

If you invoke the docx2tex XProc pipeline (xpl/docx2tex.xpl), you can specify the fontmap directory with the option custom-font-maps-dir.

language tagging

You may have noticed some obscure \foreignlanguage{} or \selectlanguage{} code that doesn't match the actual language used in your TeX document. We have no fancy AIโ„ข-based natural language algorithms at work but docx2tex evaluates the original document language which typically applies to your system settings and the language setting of the paragraph or character style which is used by word for auto-correction and hyphenation. docx2tex evaluates these settings and filters redundant markup, e.g. detecting the main language by evaluating the character count of each of the styles and their respective language setting. However, when you copy and paste from the World Wide Web, Microsoft Word usually copies the language of the original Website as well. This causes most of the weird language markup, you may have noticed. So we recommend to copy and paste as plain text and to create new paragraph and character styles when you want to intentionally change the language of a text fragment.

More Repositories

1

mml2tex

Converts MathML to LaTeX
XSLT
90
star
2

xml2tex

Converts XML to LaTeX
XSLT
43
star
3

docx2hub

Converts Microsoft docx to flat hub XML
XSLT
27
star
4

pdf2fxl

PDF to EPUB3 Fixed Layout converter
XProc
23
star
5

idml2xml

Library to convert IDML to Hub XML or to extract tagging from an IDML file
XSLT
21
star
6

mathtype-extension

Calabash extension step to convert MathType OLE objects to MathML
Ruby
18
star
7

idml2xml-frontend

Converts InDesign IDML to XML
Shell
14
star
8

xslt-util

XSLT Functions for Transpect
XSLT
11
star
9

xproc-util

XProc utilities for transpect
XProc
9
star
10

epubtools

Library to convert and check EPUB 2 and 3
XSLT
8
star
11

docx2hub-frontend

Frontend project which implements the docx2hub library
Shell
8
star
12

css-tools

Parse styles in an XHTML document and expand as XML attributes (CSSa)
XSLT
8
star
13

CoCoTeX

luaLaTeX render backend for xerif
TeX
7
star
14

xerif

Open Source Typesetting System
XSLT
6
star
15

tei2html

Converts TEI to HTML
XSLT
6
star
16

jats2html

XProc library and XSLT stylesheets to convert JATS to HTML
XSLT
6
star
17

xml2idml

IDML synthesis from configurable XML or HTML
XSLT
6
star
18

calabash-frontend

Bash and .bat scripts, frequently used extensions, XML catalog for XML Calabash
XProc
5
star
19

htmlreports

XProc steps for RelaxNG and Schematron validation and HTML reports
XSLT
4
star
20

rng-extension

XML Calabash Relax NG validation extension step that returns a report with XPath error locations
XProc
4
star
21

epubcheck-transpect

XProc pipeline to check EPUBs for compliance with IDPF EPUB2/3, Amazon MOBI/KF8 and custom Schematron
XProc
4
star
22

evolve-hub

XSLT Library to detect and tag headlines, lists, tables and figures
XSLT
3
star
23

docx_modify-lib

XSLT
3
star
24

schematron

alternative validate-with-schematron XProc step that uses oXygenโ€™s abstract pattern/rule expansion
XSLT
3
star
25

idmlvalidation

An IDML Validator
XProc
3
star
26

epubtools-frontend

Converts XHTML to EPUB2/3
Shell
3
star
27

mml-normalize

MathML normalization
XSLT
3
star
28

hub2html

Converts Hub XML to XHTML
XSLT
2
star
29

docx2jats-demo

Convert JATS, EPUB and Schematron Report from DOCX
XSLT
2
star
30

latex-math-images

Render LaTeX equations as images
TeX
2
star
31

jing-trang-patch

Java
2
star
32

schema-jats

Journal Article Tag Suite
2
star
33

schema-bits

DTD and RelaxNG files for Book Interchange Tag Set: JATS Extension
2
star
34

cascade

Libraries to implement a transpect cascade configuration
XProc
2
star
35

tokenized-to-tree

Perform text analysis and patch the results back into the source XML
XSLT
2
star
36

hub2bits

XSLT
2
star
37

transpect.github.io

transpect github.io page
HTML
2
star
38

fontmaps

Mappings for non-Unicode fonts to Unicode
2
star
39

javascript-extension

An extension step for XML Calabash to run JavaScript and NodeJS in XProc
Java
1
star
40

bits2hub

BITS (Book Interchange Tag Set) to Hub conversion.
XSLT
1
star
41

hub2docx-lib

XSLT
1
star
42

html2hub

Converts HTML to Hub XML
XSLT
1
star
43

epubcheck-idpf

XProc wrapper for IDPF's epubcheck
XProc
1
star
44

tei2hub

module to create basic HUB XML from TEI
XSLT
1
star
45

html2xlsx

XProc
1
star
46

hub2tei

Library to convert Hub XML to TEI
XSLT
1
star
47

kindlegen-amzn

XProc Wrapper for Amazon's Kindlegen.
XProc
1
star
48

semex

Extrection of semantic data (values+units, etc.) from tables
XProc
1
star
49

xlsx2html-lib

xlsx to XHTML converter library (needs a front-end project)
XSLT
1
star
50

tei2bits

Converts TEI 2 BITS (alpha, not yet for productive use)
XSLT
1
star
51

nlm-stylechecker

XProcified + SVRLified version of the NLM StyleChecker
XSLT
1
star
52

schema-onix

RelaxNG schema for ONIX 2.1 and 3.0
1
star
53

htmltemplates

XProc
1
star
54

map-style-names

XSLT
1
star
55

font-obfuscate-extension

XML Calabash extension for EPUB Font Obfuscation
Java
1
star
56

schema-html

Schema files for HTML
1
star
57

officeopenxml-validation

Validation of Office Open XML File Format
XProc
1
star
58

hub2docx-frontend

Shell
1
star
59

htmltables

Replace colspan and rowspan with virtual cells
XSLT
1
star
60

control

Subversion browser for transpect based on BaseX
XQuery
1
star
61

crossref

Library to implement CrossRef requests
XSLT
1
star
62

basex-svn-api

SVN XQuery API for BaseX
Java
1
star
63

image-props-extension

XML Calabash extension step for reading image properties such as width
Java
1
star
64

unwrap-mml

Library to convert simple MathML equations to XML/plaintext
XSLT
1
star
65

use-css-decorator-classes

XSLT
1
star