• Stars
    star
    491
  • Rank 89,636 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 9 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A pure python based utility to extract text and images from docx files.

python-docx2txt

A pure python-based utility to extract text from docx files.

The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.

How to install?

pip install docx2txt

How to run?

a. From command line:

# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docx

b. From python:

import docx2txt

# extract text
text = docx2txt.process("file.docx")

# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")