WordCloudFa
This module is an easy-to-use wrapper for word_cloud module.
The original module doesn't support Farsi Texts. But by using WordCloudFa you can generate word clouds from texts those are including Persian and English words.
This module is not only a wrapper, but it adds some features to the original module.
- How to Install
- How to Use
- Examples
- Font
- Persian Tutorial
- Contribution
- Common Problems
- There is any problem?
- Citations
How to Install
For installing this module on other operating systems, you can simply run
pip install wordcloud-fa
.
This module tested on python 3
WordCloudFa depends on numpy
and pillow
.
Also you should have Hazm
module. Normally, all of them will install automatically when you install this module using
pip
as described at the beginning of this section.
To save the wordcloud into a file, matplotlib
can also be installed.
Attention
You need to have python-dev
for python3 on your system. If you don't have it, you can install it on operating systems
those using apt
as the package manager (Like Ubuntu) by this command:
sudo apt-get install python3-dev
And you can install it on operating systems those using yum
as the package manager (like RedHat, Fedora and ...) you can
use the following command:
sudo yum install python3-devel
How to Use
For creating a word cloud from a text, first you should import the class into your code:
from wordcloud_fa import WordCloudFa
you can create an instance of this class like:
wordcloud = WordCloudFa()
You can pass different parameters to the constructor. For see full documents of them, you can see WordCloud Documentations
There are three parameters that are not in the original class.
First one is persian_normalize
. If you pass this parameter with True
value, your data will normalize by using
Hazm normalizer. It's recommended to always pass this parameter. That will replace
arabic letters with persian ones and do some other stuff.
The default value of this parameter is False
.
wordcloud = WordCloudFa(persian_normalize=True)
the second parameter is include_numbers
that is not in the published original module. If you set this parameter to False
,
all Persian, Arabic and English numbers will remove from your data.
The default value of this parameter is True
wordcloud = WordCloudFa(include_numbers=False)
Common problem Hint:
The last and very important parameter is: no_reshape
. The default value of the parameter is False
. But if you see
that the letters of the words in Farsi texts are separated in your local system, you should pass True
value to this parameter.
wordcloud = WordCloudFa(no_reshape=True)
Generating Word Cloud from Text
for generating word cloud from a string, you can simply call generate
method of you instance:
wordcloud = WordCloudFa(persian_normalize=True)
wc = wordcloud.generate(text)
image = wc.to_image()
image.show()
image.save('wordcloud.png')
Generating Word Cloud from Frequencies
You can generate a word cloud from frequencies. You can use the output of process_text
method as frequencies.
Also you can use any dictionary like this.
wordcloud = WordCloudFa()
frequencies = wordcloud.process_text(text)
wc = wordcloud.generate_from_frequencies(frequencies)
generate_from_frequencies
method in this module will exclude stopwords. But the original module will not exclude them
when you are using this method. Also you can use Persian words as keys in frequencies dict without any problem.
Working with Stopwords
Stopwords are the words that we don't want to consider. If you dan't pass any stopword, the default words in the stopwords file will consider as stopwords.
You don't want to use them at all and you want to choose your stopwords? you can simply set stopwords
parameter when
you are creating an instance from WordCloudFa
and pass a set
of words into it.
stop_words = set(['کلمه‌ی اول', 'کلمه‌ی دوم'])
wc = WordCloudFa(stopwords=stop_words)
If you want to add additional words to the default stopwords, you can simply call add_stop_words
method on your
instance of WordCloudFa
and pass an iterable type (list
, set
, ...) into it.
wc = WordCloudFa()
wc.add_stop_words(['کلمه‌ی اول', 'کلمه‌ی دوم'])
Also you can add stopwords from a file. That file should include stopwords and each word should be in a separate line.
For that, you should use add_stop_words_from_file
method. The only parameter of this
method is relative or absolute path to the stop words file.
wc = WordCloudFa()
wc.add_stop_words_from_file("stopwords.txt")
Mask Image
You can mask the final word cloud by an image. For example, the first image of this document is a wordcloud masked by an image
of the map of Iran country. For setting a mask, you should pass the mask
parameter.
But before, you first should be sure you have a black and white image. Because other images will not create a good result.
Then, you should convert that image to a numpy array. For that, you should do something like this:
import numpy as np
from PIL import Image
mask_array = np.array(Image.open("mask.png"))
You just should add those two imports, but you don't need to be worried about installing them, because those have been installed as dependencies of this module.
Then, you can pass that array to the constructor of the WordCloudFa
class for masking the result.
wordcloud = WordCloudFa(mask=mask_array)
Now you can use your worldcloud instance as before.
Reshaping words
When you pass your texts into an instance of this class, all words will reshape for turning to a proper way for showing And avoiding the invalid shape of Persian or Arabic words (splitted and inverse letters).
If you want to do the same thing outside of this module, you can call reshape_words
static method.
reshaped_words = WordCloudFa.reshape_words(['کلمه‌ی اول', 'کلمه‌ی دوم'])
this method gets an Iterable
as input and returns a list of reshaped words.
DONT FORGET THAT YOU SHOULD NOT PASS RESHAPED WORDS TO THE METHODS OF THIS CLASS AND THIS STATIC METHOD IS ONLY FOR USAGES OUT OF THIS MODULE
Avoiding Dangerous non-ASCII characters
Some non-ASCII characters like emojies causing errors. By Default, those characters will remove from the input text (not when you are using the generate_from_frequencies
method).
For disabling this feature, you can set the value of the remove_unhandled_utf_characters
parameter to False
when you are creating a new instance of the WordCloudFa
.
Also you can access the compiled regex patten of those characters using the unhandled_characters_regex
class attribute.
Examples
You can see Example codes in the Examples directory.
Font
The default font is an unknown! font that supports both Persian and English letters. So you don't need to pass a font for
getting results. But if you want to change the font you can pass font_path
parameter.
Persian Tutorial
If you want to read a brief tutorial about how to use this package in Farsi (Persian), you can click on this link.
Contribution
We want to keep this library fresh and useful for all Iranian developers. So we need your help for adding new features, fixing bugs and adding more documents.
You are wondering how you can contribute to this project? Here is a list of what you can do:
- Documents are not enough? You can help us by adding more documents.
- The current code could be better? You can make this cleaner or faster.
- Do you think one useful feature missed? You can open an issue and tell us about it.
- Did you find a good open and free font that supports Farsi and English? You can notify us by a pull request or if opening an issue
Common Problems
Farsi Letters are separated
If you see separated Farsi letters in your output, you should pass no_reshape=True
parameter to your WordCoudFa
constructor:
wordcloud = WordCloudFa(no_reshape=True)
I See Repeated Farsi Words
In some cases you may see repeated Farsi words in the output. For solving that problem, you should pass collocations=False
Parameter to your WordCloudFa
constructor:
wordcloud = WordCloudFa(collocations=False)
I Have Problem in Running Example Scripts
In some operating systems like Windows, you should specify the encoding of the example text files. If you can not open example files, add encoding="utf-8"
to your open statements:
with open('persian-example.txt', 'r', encoding="utf-8") as file:
There is any problem?
If you have questions, find some bugs or need some features, you can open an issue and tell us. For some strange reasons this is not possible? so contact me by this email: [email protected]
.
Citations
Texts in the Example
directory are from this and this Wikipedia pages.