Bilingual drills

Learn a language with org-drill and manythings

I came across this bilingual sentence pairs corpus from when looking for resources to learn Turkish. It’s derived from the Tatoeba Corpus, in which volunteers translate English sentences to their language of choice. Each line of a language1-language2 corpus file has a sentence in language1, its translation in language2, and some attribution information, all separated by tabs. They’re also sorted by length, so the first line in the English-Turkish file is:

Run!    Kaç!    CC-BY 2.0 (France) Attribution: #906328 (papabear) & #2322529 (Gulo_Luscus)

And the last line:

Doubtless there exists in this world precisely the right woman for 
any given man to marry and vice versa; but when you consider 
that a human being has the opportunity of being acquainted with 
only a few hundred people, and out of the few hundred that there 
are but a dozen or less whom he knows intimately, and out of the 
dozen, one or two friends at most, it will easily be seen, when we 
remember the number of millions who inhabit this world, that 
probably, since the earth was created, the right man has never yet 
met the right woman.       

Kuşkusuz bu dünyada her erkeğin ve kadının evlenmek için huyu huyuna, 
suyu suyuna tamamen denk birisi mutlaka vardır; fakat bir insanın sadece 
birkaç yüz kişiyle tanışma fırsatı bulduğu, bu birkaç yüz kişi içinden belki 
bir düzinesini yakından tanıdığı, bu bir düzinenin de ancak birkaçıyla dost 
olduğu göz önüne alınır ve de dünyada milyonlarca insanın yaşadığı 
hatırda tutulursa kolayca görülür ki dünya yaratıldığından beri doğru 
erkek doğru kadınla muhtemelen daha hiç karşılaşmamıştır.    

CC-BY 2.0 (France) Attribution: #7697649 (RM) & #7730062 (soliloquist)

Anyone who can say that whopper is definitely an expert in the language. Anyway, here’s a small python script to convert this file into a set of org-drill-able folders with flash-card files - 500 sentences in each file and 5 such files in each drill folder, sorted by the length of the sentence org-drill is a spaced repetition system like Anki for org-mode in Emacs. See this post on my spacemacs configuration on how to set it up. . The code also combines language1 sentences with the same meaning in language2 and vice versa into the same flash-card so you get to see different ways of translating the same thing and different translations that have the same meaning.

from collections import defaultdict
from pathlib import Path

language_1 = "English"
language_2 = "Turkish"

# language_1_sentence : [list of equivalent language_2 sentences]
l1_to_l2 = defaultdict(list)
# language_2_sentence : [list of equivalent language_1 sentences]
l2_to_l1 = defaultdict(list)

# the English-Turkish file was called tur.txt
with open("tur.txt") as f:
  for line in f:
      parts = line.strip().split('\t')

# [language_1 sentence] : [list of language_1 sentences with the same translation]
reduced_l1_keys = defaultdict(set)

for key in l1_to_l2:
    for val in l1_to_l2[key]:
	reduced_l1_keys[key] = reduced_l1_keys[key].union(l2_to_l1[val])

# [list of [list of language_1 sentences]] sorted by the average length of the sentence (in language_1)
l1_keys = sorted(set(tuple(v) for v in reduced_l1_keys.values()),
		 key=lambda x: sum(len(y) for y in x) / len(x))

print("Number of sentences:", len(l1_keys))
num_per_drill_file = 500
print("Number of drill files:", len(l1_keys) // num_per_drill_file)
num_per_drill_folder = 5
print("Number of drill folders:", len(l1_keys) // num_per_drill_file // num_per_drill_folder)

folder_index = 0
for i, r in enumerate(range(0, len(l1_keys), num_per_drill_file)):
    if i % num_per_drill_folder == 0:
	folder_index += 1
	folder = Path.cwd() / f"drill_{folder_index}"
	if not folder.exists():
    with open(folder / f"{language_2}_drill_{i}.org", "w") as f:
	for j, l1_sentences in enumerate(l1_keys[r: r+num_per_drill_file]):
	    f.write(f"* Question {j} :drill:\n")
	    f.write("\t:PROPERTIES:\n\t:DRILL_CARD_TYPE: twosided\n\t:END:\n\n")
	    f.write(f"** {language_1}\n\n")
	    l2_sentences = set()
	    for l1_sentence in l1_sentences:
		l2_sentences = l2_sentences.union(l1_to_l2[l1_sentence])
	    f.write("** {language_2}\n\n")
	    for l2_sentence in l2_sentences:

To set up org-drill you set org-drill-scope to “directory” (with ALT-x-customize variable), open up a file in one of the drill folders and run org-drill (ALT-x-org-drill). It took a minute or two to load all the 2500 sentences in a folder, but this only happens once.

Turns out the Turkish corpus is pretty popular:

Number of sentences: 107421
Number of drill files: 214
Number of drill folders: 42

It took me a while to get used to the scoring system. You rate something as 0 when you get the answer wrong and the answer actually looks unfamiliar when you see it. If you get it wrong but then think aha! when you see the answer then you rate it as 1. If you get some words correct but not all of them, it’s a 2. If you get it right but had to think for a while, it’s a 3. I’m using 4 for if I have to actually translate it in my head via English - even if I’m pretty fast at it, and 5 if I understand the sentence without breaking it up into parts. I still had to learn grammar separately first though. I think these drills are only useful to increase your vocabulary after you have some idea of verb conjugations and tenses I used this website on Turkish grammar for that .

I set up a recurring TODO that links to the current drill folder and repeats every day, so it shows up in my agenda and forces me to click on it to get it to go away, and another recurring TODO that shows up once a month and reminds me to consider changing the drill folder to the next one.

For comments, click the arrow at the top right corner.