• Stars
    star
    2,541
  • Rank 18,052 (Top 0.4 %)
  • Language
    Python
  • License
    GNU Affero Genera...
  • Created over 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open source Python library for converting PDF to DOCX.

English | 中文

pdf2docx

python-version codecov pypi-version license pypi-downloads

  • Extract data from PDF with PyMuPDF, e.g. text, images and drawings
  • Parse layout with rule, e.g. sections, paragraphs, images and tables
  • Generate docx with python-docx

Features

  • Parse and re-create page layout

    • page margin
    • section and column (1 or 2 columns only)
    • page header and footer [TODO]
  • Parse and re-create paragraph

    • OCR text [TODO]
    • text in horizontal/vertical direction: from left to right, from bottom to top
    • font style, e.g. font name, size, weight, italic and color
    • text format, e.g. highlight, underline, strike-through
    • list style [TODO]
    • external hyper link
    • paragraph horizontal alignment (left/right/center/justify) and vertical spacing
  • Parse and re-create image

    • in-line image
    • image in Gray/RGB/CMYK mode
    • transparent image
    • floating image, i.e. picture behind text
  • Parse and re-create table

    • border style, e.g. width, color
    • shading style, i.e. background color
    • merged cells
    • vertical direction cell
    • table with partly hidden borders
    • nested tables
  • Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

  • Text-based PDF file
  • Left to right language
  • Normal reading direction, no word transformation / rotation
  • Rule-based method can't 100% convert the PDF layout

Documentation

Sample

sample_compare.png

More Repositories

1

mupdf

mupdf mirror
C
1,475
star
2

ghostpdl-downloads

This is purely for downloads, please check the website for full information
521
star
3

Ghostscript.NET

Ghostscript.NET - managed wrapper around the Ghostscript library (32-bit & 64-bit). Tested with Ghostscript versions < 10.
C#
401
star
4

mupdf.js

JavaScript bindings for MuPDF
TypeScript
381
star
5

ghostpdl

This is a mirror: the canonical repo is: git.ghostscript.com/ghostpdl.git https://www.ghostscript.com
C
120
star
6

urw-base35-fonts

Repo for URW++ base 35 font set
96
star
7

mupdf-android-viewer-mini

Java
92
star
8

mupdf-android-viewer

Android SDK: viewer
Java
77
star
9

mujs

C
41
star
10

jbig2dec

This is a mirror: the canonical repo is: git.ghostscript.com/jbig2dec.git. This repo does not host releases, they are here: https://github.com/ArtifexSoftware/jbig2dec/tags
C
40
star
11

mupdf-ios-viewer

Objective-C
38
star
12

MuPDF.NET

C# bindings for MuPDF
C#
38
star
13

mupdf-android-fitz

Android SDK: JNI bindings
Makefile
18
star
14

memento

Memento
C
16
star
15

mupdf-android-viewer-old

Java
12
star
16

thirdparty-lcms2

C
10
star
17

thirdparty-freetype2

C
6
star
18

mupdf-wasm

MuPDF WASM library
6
star
19

mupdf-winrt

C++
4
star
20

thirdparty-harfbuzz

C++
4
star
21

mupdf-android-appkit

Android app development kit
Java
4
star
22

thirdparty-freeglut

thirdparty-freeglut
C
3
star
23

extract

C
3
star
24

thirdparty-curl

thirdparty-curl mirror for mupdf
C
3
star
25

thirdparty-tesseract

C++
2
star
26

thirdparty-jpegxr

C
2
star
27

thirdparty-libjpeg

C
2
star
28

thirdparty-glfw

C
2
star
29

thirdparty-openjpeg

C
2
star
30

thirdparty-zlib

C
2
star
31

PyMuPDF-performance

Python
2
star
32

thirdparty-gumbo-parser

https://github.com/google/gumbo-parser
HTML
1
star
33

mupdfpy-julian

Python
1
star
34

thirdparty-leptonica

C
1
star
35

tests

publicly available test files
HTML
1
star
36

mupdf-julian

C
1
star
37

thirdparty-libwebp

C
1
star
38

ijs

Mirror for the IJS code - canonical upstream source (pull requests are ignored)
Shell
1
star