• Stars
    star
    3,724
  • Rank 11,728 (Top 0.3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created 2 months ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Comprehensive Toolkit for High-Quality PDF Content Extraction

More Repositories

1

MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
Python
10,412
star
2

labelU

Data annotation toolbox supports image, audio and video data.
Python
704
star
3

WanJuan1.0

万卷1.0多模态语料
525
star
4

LabelLLM

The Open-Source Data Annotation Platform
TypeScript
384
star
5

VIGC

AAAI 2024: Visual Instruction Generation and Correction
Python
73
star
6

opendatalab-datasets

datasets resource
65
star
7

opendatalab-python-sdk

SDK of OpenDataLab - https://opendatalab.org.cn
Python
52
star
8

labelU-Kit

Data annotation component library --provided as NPM packages
TypeScript
52
star
9

CLIP-Parrot-Bias

ECCV2024_Parrot Captions Teach CLIP to Spot Text
Python
52
star
10

magic-doc

Python
49
star
11

dsdl-docs

Data Set Description Language Specification (新一代人工智能数据集描述语言DSDL)
HTML
43
star
12

H2RSVLM

H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model
37
star
13

MLLM-DataEngine

MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Python
27
star
14

image-downloader

Python
24
star
15

magic-html

Python
20
star
16

dsdl-sdk

Jupyter Notebook
13
star
17

labelU-frontend

LabelU front-end library
TypeScript
7
star
18

allz

A universal command line tool for compression and decompression
Python
4
star
19

laion5b-downloader

Python
3
star
20

HA-DPO

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Python
2
star
21

MLS-BRN

[CVPR 2024] 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
1
star
22

Miner-PDF-Benchmark

MPB (Miner-PDF-Benchmark) is an end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios.
Python
1
star
23

labelU-ML

Python
1
star
24

s3_browser

基于Streamlit开发,可在线查看S3存储内容的工具。
Python
1
star