In Chinese, words can be composed of multiple characters. This project
aims to visualize connections between commonly used Chinese characters
- what words are they usually a part of or are they more commonly used
individually? How do they combine with other characters to create new
meanings?
Background
Radicals are parts of a character that can lend the character phonetic and/or semantic meaning.
For example, the heart radical, which can appear as
心 or
忄, appears in both
念 - meaning "to think, recall" - and
忘 - meaning "to forget." Both characters relate to internal actions or feelings, and the heart radical generally indicates a word related to thoughts or emotions.
Sometimes, a radical by is already a complete character.
Pinyin is a system used to specify how to pronounce characters. It uses the Latin alphabet together with
tones, which further specify how a character should sound.
There are four basic tones, labeled 1 through 4, corresponding to a word's vowel sound is flat, rising, falling-rising, or falling. There is a also a fifth tone that is
neutral or toneless - there is no clear tone in these pronunciations. For example, the pinyin "li4" indicates a pronunciation like "lee" but with a downwards, falling tone.
A character can have multiple pronunciations, often corresponding to different definitions. The pronunciation is usually decided by the context.
Sources
This project uses the
SUBTLEX-CH
dataset, which has the commonly used words in movie subtitles. The
frequency of words in subtitles is believed to be a good reflection of
the frequency of their usage in general. There were nearly 100,000 words in the dataset, but only the top ~500 are included in this visualization.
These 500 words each occurred over 7000 times in the movie subtitles analyzed, and together they involve 400 unique characters.
Information on radicals, pinyin, and definitions of words and characters is taken from the dictionary by
Make Me a Hanzi and from the
CC-CEDICT
dictionary Chinese to English dictionary. The CC-CEDICT dictinoary was read with the help of
this parser.