ChineseNames
Chinese Name Database 1930-2008
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
Author
E-mail: [email protected]
Website: psychbruce.github.io
ResearchGate | GitHub | ็ฅไน
Citation
- Bao, H.-W.-S. (2021). ChineseNames: Chinese Name Database 1930-2008. R package version 1.1.1. https://CRAN.R-project.org/package=ChineseNames
- Bao, H.-W.-S., Cai, H., Jing, Y., & Wang, J. (2021). Novel evidence for the increasing prevalence of unique names in China: A reply to Ogihara (2020). Frontiers in Psychology, 12, 731244. https://www.frontiersin.org/articles/10.3389/fpsyg.2021.731244/full
Install
## Method 1: Install from CRAN
install.packages("ChineseNames")
## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/ChineseNames")
User Guide
Data Source
This Chinese name database was provided by Beijing Meiming Science and Technology Company (in collaboration) and originally obtained from the National Citizen Identity Information Center (NCIIC) of China in 2008.
It contains nationwide frequency statistics of almost all Chinese surnames and given-name characters, which have covered about 1.2 billion Han Chinese population (96.8% of the Han Chinese population born from 1930 to 2008 and still alive in 2008, i.e., the living household-registered population). It also contains subjective rating indices of given-name characters. To our knowledge, this is the most comprehensive and accurate Chinese name database up to now.
Note that this database does not contain any individual-level information (so it does not leak personal privacy). All data are at the name level or character level. Extremely rare characters are not included.
Datasets
This package includes five datasets (data.frame
in R). You can access them using the data()
function in R. The use of these datasets should follow the GNU GPL-3 License and the Creative Commons License CC BY-NC-SA, with a proper citation of this package and only for non-commercial purposes.
-
familyname
: 1,806 Chinese surnames with their frequencies in the Han Chinese population.- overall frequencies and proportions, regardless of gender and birth cohort
-
givenname
: 2,614 Chinese characters in given names with their frequencies in the Han Chinese population.- separate frequencies and proportions for each gender and each of 6 birth cohorts (pre-1960s, 1960-1969, 1970-1979, 1980-1989, 1990-1999, and 2000-2008)
- including usage in all single-character and multi-character given names (e.g., the character "ไผ" in all "ๅผ ไผ", "ๅผ ไผ__", "ๅผ __ไผ", "็ไผ", "็ไผ__", "็__ไผ", โฆ)
-
top1000name.prov
: Top 1,000 given names in 31 Chinese mainland provinces. -
top100name.year
: Top 100 given names in 6 birth cohorts. -
top50char.year
: Top 50 given-name characters for 6 birth cohorts.
Note. The "ppm" in the variable names of these datasets means "parts per million (็พไธๅ็)" (e.g., 1 ppm = a proportion of 1/106).
Compute Name Features
Use the compute_name_index()
function. This function computes multiple indices of Chinese surnames and given names for scientific research. Just input a data frame with full names (and birth years, if necessary), then it returns a new data frame with all name indices appended.
Examples:
library(ChineseNames)
?compute_name_index # see detailed usage in help page
## Usage 1
compute_name_index(name="ๅ
ๅฏๅด้", birth=1995)
## Usage 2
demodata=data.frame(
name=c("ๅ
ๅฏๅด้", "้ไฟ้", "ๅผ ไผ", "ๅผ ็", "ๆฌง้ณไฟฎ", "ๆฌง้ณ", "ๆ็ๅ็บ", "ๅผ ่บ่ฐ", "็็"),
birth=c(1995, 1995, 1985, 1988, 1968, 2009, 2000, 1950, 2005))
newdata=compute_name_index(
demodata,
var.fullname="name", # full name
var.birthyear="birth") # adjusted for birth year
View(newdata)
# name birth name0 name1 name2 name3 NLen SNU SNI NU CCU NG NV NW NC
# 1: ๅ
ๅฏๅด้ 1995 ๅ
ๅฏ ๅด ้ 4 3.0595 2 3.6042 4.1178 -0.2187 3.3542 2.6667 3.2333
# 2: ้ไฟ้ 1995 ้ ไฟ ้ 3 1.3415 3 2.4619 4.7688 0.4081 4.3125 3.6500 3.6500
# 3: ๅผ ไผ 1985 ๅผ ไผ 2 1.1529 26 1.6611 3.8865 0.6859 4.2500 3.5000 3.4000
# 4: ๅผ ็ 1988 ๅผ ็ 2 1.1529 26 3.0547 5.8583 0.6025 3.9375 3.4000 3.5000
# 5: ๆฌง้ณไฟฎ 1968 ๆฌง้ณ ไฟฎ 3 3.1645 15 2.9816 3.5510 0.5047 3.0625 3.5000 3.3000
# 6: ๆฌง้ณ 2009 ๆฌง ้ณ 2 2.9694 15 2.0389 3.4574 0.5103 4.3750 4.1000 3.7000
# 7: ๆ็ๅ็บ 2000 ๆ ็ ๅ ็บ 4 2.8689 25 3.8743 4.8944 0.4619 3.1875 3.2000 3.1667
# 8: ๅผ ่บ่ฐ 1950 ๅผ ่บ ่ฐ 3 1.1529 26 3.8808 3.6611 0.3183 3.5938 3.5500 3.3500
# 9: ็็ 2005 ็ ็ 2 1.1257 23 5.1893 1.3110 -0.5325 2.1250 2.5000 2.2000
-
NLen: full-name length
-
2~4
-
A Chinese surname usually consists of one character (i.e., single surname, ๅๅง) and sometimes consists of two characters (i.e., compound surname, ๅคๅง).
-
A Chinese given name can be any single character or any combination of two characters (rarely of three characters, like the author's given name "Han-Wu-Shuang").
-
-
-
SNU: surname uniqueness
-
1~6
-
SNU = โlog10(Psurname + 10โ6)
- Psurname = the percentage of a surname held by people in the Han Chinese population.
- A small constant (10โ6) is added to adjust for skewness and limit the maximum of SNU to 6.00.
- SNU = 2 and 3 mean that 1/100 and 1/1000 of people had this surname, respectively.
- The diversity of surnames is limited in China: the top 25 most common Chinese surnames have covered about 60% (0.7 billion) of the Han Chinese population (1.2 billion).
-
-
SNI: surname initial (alphabetical order)
-
1~26
- Chinese names are usually sorted by surname initials. This index is the alphabetical order of Pinyin initial of a surname.
-
-
NU: name-character uniqueness (in naming practices)
-
1~6
-
NU = โlog10(Pcharacter + 10โ6)
- Pcharacter = the percentage of a character used in all single-character and multi-character given names in the Han Chinese population within a specific birth year (an approximate estimate using the nearest two birth cohorts with relative weights, which would be more precise than just using a single birth cohort).
- A small constant (10โ6) is added to adjust for skewness and limit the maximum of NU to 6.00.
- NU = 2 and 3 mean that 1/100 and 1/1000 of people used this character in given name (in their birth year), respectively.
- For data without birth-year information, the
compute_name_index()
function returns NU based on the average percentage across all six birth cohorts.
-
-
CCU: character-corpus uniqueness (in contemporary Chinese corpus)
-
1~6
-
CCU = โlog10(Pcharacter + 10โ6)
- Pcharacter = the percentage of a character in a contemporary Chinese corpus.
- CCU should be distinguished from NU because daily language usage is distinct from naming practices. Some characters rarely used in personal names may instead be frequently used in daily language (and vice versa).
-
-
NG: name gender (difference in proportions of a character used by male vs. female)
-
โ1~1
-
NG = (Nmale โ Nfemale) / (Nmale + Nfemale)
- NG ranges from โ1 (completely feminine; 100% used by female) through 0 (gender-neutral; half by female and half by male) to 1 (completely masculine; 100% used by male).
-
-
NV: name valence (positivity of character meaning)
- 1~5
- Subjective ratings from 16 Chinese raters (9 males and 7 females; interrater reliability ICC = 0.921) on the positivity of all 2,614 given-name characters according to the meaning of each character (1 = strongly negative, 3 = neutral, 5 = strongly positive).
-
NW: name warmth/morality
- 1~5
- Subjective ratings from 10 Chinese raters (5 males and 5 females; interrater reliability ICC = 0.774) on how a person whose name contains each of the 2,614 given-name characters is likely to have warmth-related traits (1 = strongly unlikely to have, 3 = medium likelihood, 5 = strongly likely to have).
-
NC: name competence/assertiveness
- 1~5
- Subjective ratings from 10 Chinese raters (5 males and 5 females; interrater reliability ICC = 0.712) on how a person whose name contains each of the 2,614 given-name characters is likely to have competence-related traits (1 = strongly unlikely to have, 3 = medium likelihood, 5 = strongly likely to have).
* Instruction for the rating task of NW and NC (adapted from Newman et al., 2018):
According to psychological research, when people form impressions of others, they usually evaluate them in two aspects: warmth and competence.
- "Warmth" (ๆธฉๆ) includes traits such as warm (็ญๆ ), friendly (ๅๅฅฝ), righteous (ๆญฃ็ด), honest (่ฏๅฎ), kind (ๅๅ), fair (ๅ ฌๅนณ), sincere (็่ฏ), reliable (ๅฏ้ ), and moral (ๆ้ๅพท).
- "Competence" (่ฝๅ) includes traits such as competent (่ฝๅนฒ), clever (่ชๆ), careful (็ปๅฟ), efficient (้ซๆ), creative (ๅๆฐ), ingenious (็ตๅทง), knowledgeable (ๅๅญฆ), persistent (ๅ้ง), and intelligent (ๆๆบๆ ง).
Imagine that you are about to meet a person whose given name contains each of the following characters. Please judge how likely he/she is to have traits related to "warmth" ("competence"). If you feel uncertain, please use your intuition and make your best guess.
A Note on Multi-Character Given Names
For a Chinese given name with multiple characters, name indices are averaged across characters. In other words, name indices are computed based on characters rather than character combinations. Here are main reasons.
- Computing name variables at the character level is more practical in research. Indeed, character combinations are countless, whereas single characters are a finite set and easy to handle. Moreover, for name indices other than NU, this is the only feasible approach, especially in a large sample. It is impractical to ask participants to rate millions or billions of character combinations.
- As evidenced by our research, NU computed by averaging across characters (objective NU) was positively correlated with people's perception of the uniqueness of their given names (subjective NU): r = 0.32, p < 0.001, N = 672.
- As evidenced by our research, among four measures of name uniqueness (two at the character level and two at the character-combination level), only name-character uniqueness (i.e., NU) was positively associated with cultural-level individualism.
- In linguistics, a name (word) is to English what a name character (single character) is to Chinese.