Linguistic assistance beyond borders
InterpretationTranslationTemporary StaffingSubtitles/NarrationSpeech/Bilingual CorpusBlogRecruitmentLanguages

Speech/Bilingual Corpus

Speech/Bilingual Corpus

Machine translation, which has been developed since around the late 1970s, used a method called rule-based machine translation (a typical example being SYSTRAN), which uses human-created dictionaries and grammar for translation. However, there were not many language pairs for which dictionaries and grammar were available, and it took an enormous amount of time just to gather the necessary data.

Later, in the 1980s and 1990s, rule-based machine translation came to be replaced by statistical machine translation. Statistical machine translation is a statistical method of translation that calculates the importance of each word based on its frequency of occurrence and word distribution. The machine translation consists of two types of models: a translation model that learns from bilingual data and a statistical model that specifies the sequence of words in the output language. This allows the machine to translate between many different languages as long as there are corpus of bilingual data for training as well as output language data. However, problems with accuracy are still present, and even if translation between Western language pairs such as English-French works well, high accuracy cannot be ensured between languages with different word order, such as Japanese-English.

In 2016, however, Google switched its machine translation method from statistical machine translation to a new type of machine translation using deep learning. The resulting translation surprised users by being as fluent and readable as a human translation. By having a 3-layer neural network learn from a large amount of text data, highly accurate translations could be achieved. From this point on, so-called AI machine translation came into wide use. (Typical examples: Google Translate, DEEPL, Mirai Translation, Rosetta, etc.)

While statistical machine translation also used corpus as training data, deep learning requires a much larger amount of training data in order to learn a wide variety of characteristics. Currently, research and development is being conducted to enable neural networks to learn on their own even with only a small amount of training data. However, in the current stage of development, a considerable amount of training data is still required to improve the accuracy of machine translation in various fields.

Currently, major global companies such as Google and Amazon are developing AI using their own platforms and vast amounts of learning data and AI applications, and with ample budgets. Although the use of big data containing a mixture of dirty and clean data as learning data can sometimes lead to mistranslations and misinterpretations, the advantage is that development can proceed dynamically and rapidly. On the other hand, there is a history in Japan as well of different manufacturers competing in researching machine translation since the 1990s. For the time being, highly accurate corpus will be essential to provide higher quality Japanese translation functionality for AI and AI-equipped products.

Starting in 2020, Franchir has been involved in the creation of bilingual data called a "Translation Bank" for the National Institute of Information and Communications Technology (NICT), the evaluation of translation results, and the recording of foreign voice samples for speech synthesis. We have also participated in projects to record Japanese voice samples for overseas companies. We hope to continue to provide high-quality corpus for use in our client's research and AI applications.

Experience in Speech and Bilingual Corpus Creation

  • Preparation of data for Translation Bank project, 2020 (creation of bilingual data from 263 books)
  • Creation of Japanese-English/Chinese/Korean bilingual data, 2020 (creation of bilingual data for 34 books)
  • Evaluation of machine translation results for Asian languages, 2020 (English/Hindi/Bengali)
  • Recording of Russian speech corpus for speech synthesis (unit price contract) (10,500 utterances/speaker, approx. 105 hours)
  • Recording work of speech utterances in Japanese, 2020 (request from overseas client) (20 utterances per person, 61 female, 54 male)
  • Comparison of translation accuracy of automatic translation engines for medical conversations (Japanese, English, Chinese, Korean, Spanish, French, Thai, Portuguese, Tagalog, etc.)
  • Creation of a simultaneous interpretation corpus, 2021 (English, Chinese, Korean, Vietnamese)
  • Collection and multilingual translation of business terms, 2021 (unit price contract) (14 langauges: English, Chinese, Korean, Thai, Vietnamese, Indonesian, Burmese, Spanish, French, Brazilian Portuguese, Filipino, Nepali, Khmer, Mongolian)

Basic Fees for Speech and Bilingual Corpus Creation

Bilingual corpus

Japanese-English (excluding tax) Japanese - foreign languages other than English (excluding tax) English - Foreign (excluding tax)
Bilingual corpus
(text data)
500 yen/page 500 yen/page 500 yen/page
Bilingual corpus
(PDF, image data, paper, etc. that require conversion into text data)
600 yen/page 600 yen/page 600 yen/page
Delivery format is Excel (.xlsx format)
The standard amount for one page is about 800 Japanese characters and 250 words in English.

Review of machine translation results

Japanese-English (excluding tax) Japanese - foreign languages other than English (excluding tax) English - Foreign (excluding tax)
Bilingual corpus
(text data)
300 yen/sentence 300 yen/sentence 300 yen/sentence

Collection of voice sample data for speech synthesis

Japanese (excluding tax) English (excluding tax) Other languages (excluding tax)
Collection of voice samples for speech synthesis 500 yen/sentence 500 yen/sentence 500 yen/sentence
Recorded audio delivered as WAV files


* Please inform us in advance of the purpose of use, requested delivery format, and other details.
* The above fees are for reference only. Price estimates vary depending on volume, specifications, and language.
Click here for a free quote
Please feel free to make an inquiry.
Apply as a (Freelance) Interpreter/Translator.
Click here to contact us.
Search within our site
Franchir has obtained the Privacy Mark to protect any personal information.