Hello, I’m midz (Twitter: @thetenthart), I’m an AI engineer/researcher working in the video game industry in Japan. I’m also a fan of visual novels. As VN fans know, there is a database of information and statistics related to visual novels called VNDB (visual novel database).
This article (translated from an article written on Qitia in 2021) looks at a statistical analysis performed using VNDB. This VNDB analysis asked if the stereotypes many fans have about character models and personality (for instance if tsunderes tend to be blonde) are statistically true. It also includes a comparison between the data of visual novel characters and Japanese women. I used python as a programming language to analyze the data.
This article covers the following:
Purpose of the VNDB Analysis.
This analysis has two purposes:
- To reveal the connection between character models and personality as well as look at the biases of visual novel players in terms of character preference.
- To provide statistical data for use as a reference in creative endeavors.
VNDB Dataset for statistical analysis.
VNDB contains well-organized information regarding visual novel characters. The database includes details such as birthdays, heights, weights, BWH measurements, and personalities. The database dump is located here. You can download it in .tsv (tab-separated values) format. The data itself is compressed in .zst format. To unzip this file use the following command (on Macbook or Linux):
zstd --decompress vndb-db-2021-01-11.tar.zst
tar xvfz vndb-db-2021-01-11.tar
When extracted, various files will appear, but they are mainly in the format of .tsv files, divided into column names and data, such as chars.header and chars. The file extensions are not specified, but they are in .tsv (tab-separated value) format. You can process header and data files (<filename>.header and <filename>), in pandas as .tsv files. The relevant data for this analysis was chars, traits, traits_parents, and chars_traits.
- chars contains information about characters such as names, heights, genders, birthdays, and measurements.
- traits contains additional information such as hair styles and personalities.
- chars_trait provides information linking trait IDs to character IDs.
- traits_parents describes parent-child relationships between traits (e.g., hair traits corresponding to hair length traits).
Although some data are missing, the dataset includes data of approximately 95,000 characters. As the majority of characters in VNs are women, this analysis focused on female characters. Use the following command to read the .tsv files:
chara_df = pandas.read_csv("chars", sep="\t")
We removed outliers such as heights above 200cm to normalized the distribution for better readability.
The dataset included 5,988 characters where height and weight information were specified. The average height was 157.82 cm and weight was 47.12 kg. Although the real-world data is outdated, we referred to this paper: Body Measurements and Body Shape Characteristics of Young Adult Japanese Women (referred to as “Beppu 97” below) to compare the stats of VN characters to real world counterparts.. According to this paper, the average height and weight of Japanese women aged from 18 to 22 years was 158.46 cm and 51.35 kg, respectively. While the heights are quite similar to VN characters, the weight is lower. In terms of BMI (Body Mass Index), the average value of Japanese women is 20.4, while that of the VNDB dataset is 18.9. Generally, a BMI of 18.5 or lower is considered underweight, so it is slightly above that threshold.
We then calculated the distribution (histogram) of weight with seaborn’s distplot for graph output. To do this, use the following commands:
import seaborn as sns
sns.distplot(chara_df['s_bust'], bins=20, color='#123456', kde=False, rug=False)
The result is as follows:
The result is a graph close to a normal distribution. The distribution of height is as follows:
The distribution of height does not seem to follow a normal distribution. It has three distinct peaks in the early 150cm, late 150cm, and mid-160cm. This suggests that users have varying preferences for short, medium, and tall characters.
Let’s compare the average bust, waist, and hip measurements of young women according to the Beppu 97 paper with the VNDB dataset:
|Mean Bust||Mean Waist||Mean Hip|
The bust measurement is almost the same, but the waist and hip measurements are significantly lower in the VNDB data.
We then generated a scatter plot of height (height) and bust size (s_bust). We used seaborn‘s jointplot for output.
sns.jointplot('height', 's_bust', data=chara_df)
The result is as follows:
This is an intriguing graph. Initially, there seems to be a correlation between bust size and height, but as height increases, only bust size continues to grow. This suggests a demand for tall characters with larger busts. In this graph, the correlation coefficient (pearson) is 0.610, indicating a strong correlation between height and bust size. However, according to Beppu 97, the correlation coefficient between bust size and height is 0.194, indicating correlation is low in real world. Assuming visual novel characters are created based on user demand, it suggests that shorter characters should have smaller busts, while taller characters should have larger busts. That this outcome does not occur matches our stereotype.
Hairstyles and Hair Colors
There are various hairstyles, but this analysis focuses on four of the most common ones: Long, Short, TwinTail, and PonyTail. It’s worth noting that in the VNDB data, characters with attributes such as TwinTail or PonyTail are not mutually exclusive with Long or Short. Therefore there are characters with both Long and TwinTail attributes in the data. For this analysis, we gave TwinTail and PonyTail priority in case of duplicates. While a “Shoulder-length” hairstyle existed in the dataset, we deemed it too too similar to Short to use separately in our analysis.
In general, information about hairstyles, hair colors (as mentioned later), and personalities was stored separately in traits and chars_traits, these can be combined using the pandas.merge function.
pandas.merge(chara_df, chara_traits_df, on="char_id")
Below is a table of the hairstyles statistical data we extracted:
|Number of Data||Mean Height||Mean Bust Measurement|
In terms of height, the order is PonyTail > Long > Short > TwinTail. This aligns with the impression that TwinTails are not typically associated with tall characters.
We calculated the mean values using pandas’ group by and mean functions. To do this, replace “name_y” with the column name of the attribute you want to aggregate (in this case, hairstyle).
Hair color analysis
This analysis only looked at hair colors with at least 200 or more characters. Below is a table of our findings:
|Hair Color||Data Samples||Mean Height||Mean Bust Measurement|
The data (arranged by height) shows that characters with red, black, or purple hair tend to be taller and have larger busts. On the other hand, characters with white hair tend to have smaller busts. Characters with pink hair tend to be shorter.
Here’s a chart summarizing cup sizes as percentages. Since each hairstyle is normalized, the sum of A~G cup percentages for each hairstyle equals 1. We used pandas’ crosstab and seaborn’s heatmap for this, which are both python analytic tools.
In terms of A-cup percentages, the order is white > pink > gray > blond. It appears that characters with lighter hair colors tend to have smaller busts. (However, it’s worth noting that characters with gray or blonde hair also have a high proportion of F-cups.) This insight provides new information, as it challenges some preconceived notions of what the average visual novel character looks like based on their hair color.
In our analysis of personalities using the VNDB data, we found the following dataset:
|Personality||Number of Data||Mean Height||Mean Bust Measurement|
Characters with gentle, boyish, serious, or arrogant personalities tend to be taller. While characters with clumsy, energetic, pure, or timid personalities tend to be shorter. These findings seem to align with common stereotypes.
We also created a heatmap to visualize the relationship between personalities and hair colors.
Just like before, we normalized by rows (adding up row numbers = 1). Looking at the heatmap, blond hair is the most common for characters with arrogant personalities (36%). Blond hair is also the most common for characters with tsundere personalities (28%). This aligns with the stereotype that tsundere characters often have blond hair.
Next, we created a heatmap to visualize the relationship between personalities and hairstyles.
Surprisingly, 52% of tsundere characters have TwinTails. This matches our stereotype that Tsundere characters do in fact have TwinTails.
Conclusion of VNDB Analysis
This analysis summarizes character data statistics in the VNDB (visual novel database) from the perspectives of hairstyles, hair colors, personalities, height, and bust measurements. I will update this article if more interesting insights are discovered.
If you have any questions or ideas how to take this research further, please comment them below.