End-to-End Scene Text Detection and Recognition System Resources
Author: Canjie Luo, Chongyu Liu
- 1.Datasets
- 2. Summary of End-to-end Scene Text Detection and Recognition Methods
- 3. Survey
- 4. OCR Service
- 5. References and codes
1. Datasets
1.1 Introduction
-
SVT [15]๏ผ
- Introduction: There are 100 training images and 250 testing images downloaded from Google Street View of road-side scenes. The labelled text can be very challenging with a wide variety of fonts, orientations, and lighting conditions. A lexicon containing 50 words (SVT-50) is also provided for each image.
- Link: SVT-download
-
ICDAR 2003(IC03) [16]๏ผ
- Introduction: The dataset contains a varied array of photos of the world that contain scene text. There are 251 testing images with 50 word lexicons (IC03-50) and a lexicon of all test groundtruth words (IC03-Full).
- Link: IC03-download
-
ICDAR 2011(IC11) [17] :
- Introduction: The dataset is an extension to the dataset used for the text locating competitions of ICDAR 2003.It includes 485 natural images in total.
- Link: IC11-download
-
ICDAR 2013(IC13) [18]๏ผ
- Introduction: The dataset consists of 229 training images and 233 testing images. Most text are horizontal. Three speci๏ฌc lexicons are provided, named as โStrong(S)โ, โWeak(W)โ and โGeneric(G)โ. โStrong(S)โ lexicon provides 100 words per-image including all words that appear in the image. โWeak(W)โ lexicon includes all words that appear in the entire test set. And โGeneric(G)โ lexicon is a 90k word vocabulary.
- Link: IC13-download
-
ICDAR 2015(IC15) [19]๏ผ
- Introduction: The dataset includes 1000 training images and 500 testing images captured by Google glasses. The text in the scene is in arbitrary orientations. Similar to ICDAR 2013, it also provides โStrong(S)โ, โWeak(W)โ and โGeneric(G)โ lexicons.
- Link: IC15-download
-
Total-Text [20]๏ผ
- Introduction: Except for the horizontal text and oriented text, Total-Text also consists of a lot of curved text. Total-Text contains 1255 training images and 300 test images. All images are annotated with polygons and transcriptions in word-level. A โFullโ lexicon contains all words in test set is provided.
- Link: Total-Text-download
1.2 Comparison of Datasets
Comparison of Datasets | |||||||||||||||||
Datasets | Language | Image | Text instance | Text Shape | Annotation level | Lexicon | |||||||||||
Total | Train | Test | Total | Train | Test | Horizontal | Arbitrary-Quadrilateral | Multi-oriented | Char | Word | Text-Line | 50 | 1k | Full | None | ||
IC03 | English | 509 | 258 | 251 | 2266 | 1110 | 1156 | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
IC11 | English | 484 | 229 | 255 | 1564 | ๏ฝ | ๏ฝ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
IC13 | English | 462 | 229 | 233 | 1944 | 849 | 1095 | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
SVT | English | 350 | 100 | 250 | 725 | 211 | 514 | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
SVT-P | English | 238 | ๏ฝ | ๏ฝ | 639 | ๏ฝ | ๏ฝ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
IC15 | English | 1500 | 1000 | 500 | 17548 | 122318 | 5230 | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
Total-Text | English | 1525 | 1225 | 300 | 9330 | ๏ฝ | ๏ฝ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
2. Summary of End-to-end Scene Text Detection and Recognition Methods
2.1 Comparison of methods
Methodย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย | Modelย ย ย ย | Code | Detectionย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย | Recognitionย ย ย ย ย ย ย ย ย ย ย ย | Source | Time | Highlightย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย |
Wang et al. [1] | โ | Sliding windows and Random Ferns | Pictorial Structures | ICCV | 2011 | Word Re-scoring for NMS | |
Wang et al. [2] | โ | CNN-based | Sliding windows for classification | ICPR | 2012 | CNN architecture | |
Jaderberg et al. [3] | โ | CNN-based and saliency maps | CNN classifier | ECCV | 2014 | Data mining and annotation | |
Alsharif et al. [4] | โ | CNN and hybrid HMM maxout models | Segmentation-based | ICLR | 2014 | Hybrid HMM maxout models | |
Yao et al. [5] | โ | Random Forest | Component Linking and Word Partition | TIP | 2014 | (1) Detection and recognition features sharing. (2) Oriented-text. (3) A new dictionary search method | |
Neumann et al. [6] | โ | Extremal Regions | Clustering algorithm to group characters | TPAMI | 2015 | Real-time performance(1.6s/image) | |
Jaderberg et al. [7] | โ | Region proposal mechanism | Word-level classification | IJCV | 2016 | Trained only on data produced by a synthetic text generation engine, requiring no human labelled data | |
Liao et al. [8] | TextBoxes | โ | SSD-based framework | CRNN | AAAI | 2017 | An end-to-end trainable fast scene text detector |
Bลญsta et al. [9] | Deep TextSpotter | โ | Yolo v2 | CTC | ICCV | 2017 | Yolov2 + RPN, RNN + CTC. It is the first end-to-end trainable detection and recognition system with high speed. |
Li et al. [10] | โ | Text Proposal Network | Attention | ICCV | 2017 | TPN + RNN encoder + attention-based RNN | |
Sun et al. [22] | TextNet | โ | Scale-aware attention backbone and Perspective RoI Transform | Attention | ACCV | 2018 | Perspective RoI Transform for Irregular text recognition |
Lyu et al. [11] | Mask TextSpotter | โ | Fast R-CNN with mask branch | Character segmentation | ECCV | 2018 | Precise text detection and recognition are acquired via semantic segmentation |
He et al. [12] | โ | Text-Alignment Layer | Attention | CVPR | 2018 | Character attention mechanism: use character spatial information as explicit supervision | |
Liu et al. [13] | FOTS | โ | EAST with RoIRotate | CTC | CVPR | 2018 | Little computation overhead compared to baseline text detection network (22.6fps) |
Liao et al. [14] | TextBoxes++ | โ | SSD-based framework | CRNN | TIP | 2018 | Journal version of TextBoxes (multi-oriented scene text support) |
Liao et al. [15] | Mask TextSpotter | โ | Mask R-CNN | Character segmentation + Spatial Attention Module | TPAMI | 2019 | Journal version of Mask TextSpotter(proposes Spatial Attention Module) |
Xing et al. [23] | CharNet | โ | A character branch and a detection branch | Character level | ICCV | 2019 | Utilizing a character as basic element to overcome the main difficulty of joint optimization of text detection and RNN-based recognition |
Feng et al. [24] | TextDragon | โ | Local box regression, center line segmentation and RoI Sliding | CTC | ICCV | 2019 | A new differentiable operator named RoISlide connect arbitrary shaped text detection and recognition |
Qin et al. [25] | โ | Mask R-CNN with RoI masking | Attention | ICCV | 2019 | A simple yet effective RoI masking step to extract useful irregularly shaped text instance features | |
Qiao et al. [26] | Text Perceptron | โ | Mask R-CNN with Order-aware Semantic Segmentation and Boundary Regressions | Attention | AAAI | 2020 | A novel Shape Transform Module to transform the feature regions into regular morphologies |
Wang et al. [27] | โ | Oriented Rectangular Box Detector and Boundary Point Detector | Attention | AAAI | 2020 | A set of points on the boundary of each text instance represents arbitrary shapes | |
Liu et al. [28] | ABCNet | โ | Bezier Curve Detection and BezierAlign | CTC | CVPR | 2020 | 10 times faster than re-cent state-of-the-art methods with a competitive scene text spotting accuracy |
2.2 End-to-end scene text detection and recognition results
ย ย ย ย ย ย Methodย ย ย ย ย ย ย | Model | Source | Time | SVT | SVT-50 | IC03 | IC11 | IC13 | IC15 | Total-text | ||||||||||||||||
End-to-end | Spotting | End-to-end | Spotting | None | Full | None | Full | |||||||||||||||||||
50 | Full | None | S | W | G | S | W | G | S | W | G | S | W | G | ||||||||||||
Wang et al. [1] | ICCV | 2011 | ~ | ~ | 51 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ||
Wang et al. [2] | ICPR | 2012 | 46 | ~ | 72 | 67 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Jaderberg et al. [3] | ECCV | 2014 | ~ | 56 | 80 | 75 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Alsharif et al. [4] | ICLR | 2014 | ~ | 48 | 77 | 70 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Yao et al. [5] | TIP | 2014 | ~ | ~ | ~ | ~ | 48.6 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |||
Neumann et al. [6] | TPAMI | 2015 | 68.1 | ~ | ~ | ~ | ~ | 45.2 | ~ | ~ | ~ | ~ | ~ | 35 | 19.9 | 15.6 | 35 | 19.9 | 15.6 | ~ | ~ | ~ | ~ | ~ | ||
Jaderberg et al. [7] | IJCV | 2016 | 53 | 76 | 90 | 86 | 78 | 76 | 76 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Liao et al. [8] | TextBoxes | AAAI | 2017 | 64 | 84 | ~ | ~ | ~ | 87 | 91 | 89 | 84 | 94 | 92 | 87 | ~ | ~ | ~ | ~ | ~ | ~ | 36.3 | 48.9 | ~ | ~ | ~ |
Bลญsta et al. [9] | Deep TextSpotter | ICCV | 2017 | ~ | ~ | ~ | ~ | ~ | ~ | 89 | 86 | 77 | 92 | 89 | 81 | 54 | 51 | 47 | 58 | 53 | 51 | ~ | ~ | ~ | ~ | 21.85 |
Li et al. [10] | ICCV | 2017 | 66.18 | 84.91 | ~ | ~ | ~ | 87.7 | ~ | ~ | ~ | ~ | ~ | ~ | 91.08 | 89.8 | 84.6 | 94.2 | 92.4 | 88.2 | ~ | ~ | ~ | ~ | ~ | |
Sun et al. [22] | TextNet | ACCV | 2018 | ~ | ~ | ~ | ~ | ~ | ~ | 89.77 | 88.80 | 82.96 | 94.59 | 93.48 | 86.99 | 78.66 | 74.9 | 60.45 | 82.38 | 78.43 | 62.36 | 54.02 | ~ | ~ | ~ | ~ |
Lyu et al. [11] | Mask TextSpotter | ECCV | 2018 | ~ | ~ | ~ | ~ | ~ | ~ | 92.2 | 91.1 | 86.5 | 92.5 | 92 | 88.2 | 79.3 | 73 | 62.4 | 79.3 | 74.5 | 64.2 | 52.9 | 71.8 | ~ | ~ | ~ |
He et al. [12] | CVPR | 2018 | ~ | ~ | ~ | ~ | ~ | ~ | 91 | 89 | 86 | 93 | 92 | 87 | 82 | 77 | 63 | 85 | 80 | 65 | ~ | ~ | ~ | ~ | ~ | |
Liu et al. [13] | FOTS | CVPR | 2018 | ~ | ~ | ~ | ~ | ~ | ~ | 91.99 | 90.11 | 84.77 | 95.94 | 93.9 | 87.76 | 83.55 | 79.11 | 65.33 | 87.01 | 82.39 | 67.97 | ~ | ~ | ~ | ~ | ~ |
Liao et al. [14] | TextBoxes++ | TIP | 2018 | 64 | 84 | ~ | ~ | ~ | ~ | 93 | 92 | 85 | 96 | 95 | 87 | 73.3 | 65.9 | 51.9 | 76.5 | 69 | 54.4 | ~ | ~ | ~ | ~ | ~ |
Liao et al. [15] | Mask TextSpotter | TPAMI | 2019 | ~ | ~ | ~ | ~ | ~ | ~ | 93.3 | 91.3 | 88.2 | 92.7 | 91.7 | 87.7 | 83 | 77.7 | 73.5 | 82.4 | 78.1 | 73.6 | 65.3 | 77.4 | ~ | ~ | ~ |
Xing et al. [23] | CharNet | ICCV | 2019 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | 85.05 | 81.25 | 71.08 | ~ | ~ | ~ | 69.2 | ~ | ~ | ~ | ~ |
Feng et al. [24] | TextDragon | ICCV | 2019 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | 82.54 | 78.34 | 65.15 | 86.22 | 81.62 | 68.03 | 48.8 | 74.8 | 39.7 | 72.4 | ~ |
Qin et al. [25] | ICCV | 2019 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | 85.51 | 81.91 | 69.94 | ~ | ~ | ~ | 70.7 | ~ | ~ | ~ | ~ | |
Qiao et al. [26] | Text Perceptron | AAAI | 2020 | ~ | ~ | ~ | ~ | ~ | ~ | 91.4 | 90.7 | 85.8 | 94.9 | 94 | 88.5 | 80.5 | 76.6 | 65.1 | 84.1 | 79.4 | 67.9 | 69.7 | 78.3 | 57 | ~ | ~ |
Wang et al. [27] | AAAI | 2020 | ~ | ~ | ~ | ~ | ~ | ~ | 88.2 | 87.7 | 84.1 | ~ | ~ | ~ | 79.7 | 75.2 | 64.1 | ~ | ~ | ~ | 65 | 76.1 | ~ | ~ | 41.3 | |
Liu et al. [28] | ABCNet | CVPR | 2020 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | 69.5 | 78.4 | 45.2 | 74.1 | ~ |
3. Survey
[A] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper
[B] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper
[C] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper
4. OCR Service
OCR | API | Free |
---|---|---|
Tesseract OCR Engine | ร | โ |
Azure | โ | โ |
ABBYY | โ | โ |
OCR Space | โ | โ |
SODA PDF OCR | โ | โ |
Free Online OCR | โ | โ |
Online OCR | โ | โ |
Super Tools | โ | โ |
Online Chinese Recognition | โ | โ |
Calamari OCR | ร | โ |
Tencent OCR | โ | ร |
5. References and codes
-
[1] Wang K, Babenko B, Belongie S. End-to-end scene text recognition[C].2011 International Conference on Computer Vision. IEEE, 2011: 1457-1464. paper
-
[2] Wang T, Wu D J, Coates A, et al. End-to-end text recognition with convolutional neural networks[C]. Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012). IEEE, 2012: 3304-3308. paper
-
[3] Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting[C]. European conference on computer vision. Springer, Cham, 2014: 512-528. paper
-
[4] Alsharif O, Pineau J. End-to-End Text Recognition with Hybrid HMM Maxout Models[C]. In ICLR 2014. paper
-
[5] Yao C, Bai X, Liu W. A unified framework for multioriented text detection and recognition[J]. IEEE Transactions on Image Processing, 2014, 23(11): 4737-4749. paper
-
[6] Neumann L, Matas J. Real-time lexicon-free scene text localization and recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 38(9): 1872-1885. paper
-
[7] Jaderberg M, Simonyan K, Vedaldi A, et al. Reading text in the wild with convolutional neural networks[J]. International Journal of Computer Vision, 2016, 116(1): 1-20. paper
-
[8] Liao M, Shi B, Bai X, et al. Textboxes: A fast text detector with a single deep neural network[C]. In AAAI 2017. paper code
-
[9] Busta M, Neumann L, Matas J. Deep textspotter: An end-to-end trainable scene text localization and recognition framework[C]. Proceedings of the IEEE International Conference on Computer Vision. 2017: 2204-2212. paper
-
[10] Li H, Wang P, Shen C. Towards end-to-end text spotting with convolutional recurrent neural networks[C]. Proceedings of the IEEE International Conference on Computer Vision. 2017: 5238-5246. paper
-
[11] Lyu P, Liao M, Yao C, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[C]. Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83. paper code
-
[12] He T, Tian Z, Huang W, et al. An end-to-end textspotter with explicit alignment and attention[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5020-5029. paper code
-
[13] Liu X, Liang D, Yan S, et al. FOTS: Fast oriented text spotting with a unified network[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 5676-5685. paper code
-
[14] Liao M, Shi B, Bai X. Textboxes++: A single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676-3690. paper code
-
[15] Minghui Liao, Pengyuan Lyu, Minghang He. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes[J]. IEEE transactions on pattern analysis and machine intelligence, 2019. paper code
-
[16] Wang,Kai, and S. Belongie. Word Spotting in the Wild. European Conference on Computer Vision(ECCV), 2010: 591-604. Paper
-
[17] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young,K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao,J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. ICDAR 2003 robust reading competitions:entries, results,and future directions. IJDAR, 7(2-3):105โ122, 2005. paper
-
[18] Shahab, A, Shafait, F, Dengel, A: ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In: ICDAR, 2011. Paper
-
[19] D. Karatzas, F. Shafait, S. Uchida, et al. ICDAR 2013 robust reading competition. In ICDAR, 2013. Paper
-
[20] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156โ1160, 2015. Paper
-
[21] Chee C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition.Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942.Paper
-
[22] Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding, TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network, Asian Conference on Computer Vision (ACCV), Cham, 2018, vol. 11363, no. 1, pp. 83โ99.Paper
-
[23] Xing L, Tian Z, Huang W, Convolutional character networks.In ICCV, 2019.Paper code
-
[24] Feng W, He W, Yin F, et al. TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting.In ICCV, 2019.Paper
-
[25] Qin S, Bissacco A, Raptis M, et al. Towards unconstrained end-to-end text spotting.In ICCV, 2019.Paper
-
[26] Qiao L, Tang S, Cheng Z, et al. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting.In AAAI 2020.Paper
-
[27] Wang H, Lu P, Zhang H, et al. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting. In AAAI 2020.Paper
-
[28] Liu Y, Chen H, Shen C, et al. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network In CVPR, 2020.Paper code
If you find any problems in our resources, or any good papers/codes we have missed, please inform us at [email protected]. Thank you for your contribution.
Copyright
Copyright ยฉ 2019 SCUT-DLVC. All Rights Reserved.