DoudouLinux DoudouLinux

The computer they prefer!

DoudouLinux DoudouLinux

The computer they prefer!

The site's languages [ar] [cs] [de] [en] [es] [fa] [fr] [it] [ms] [nl] [pt] [pt_br] [ro] [ru] [sr] [sr@latin] [th] [uk] [vi] [zh]

Translating DansGuardian

November 2010 — last update January 2011

All the versions of this article: [English] [русский]

DansGuardian is the DoudouLinux web content filter which prevents children from consulting “naughty” sites or pages. There are two components in DansGuardian: URL blacklisting and real-time content analysis. The second one is subject to translations because it uses a list of words or expressions that are known to be "naughty" or on the contrary safe. Each word or expression is given a score that increases or decreases the total score of a page. The page is declared “naughty” as soon as its score reaches 50, as set in the DansGuardian configuration. This is the recommended triggering level for small children.

NB: the DansGuardian error page is to be translated too but its contents have been moved to PO files, including the predefined DansGuardian messages that were stored in pure text files. Please have a look at TransiFex for the PO files.

Files of words and expressions

Files of words and expressions are located in the directory lang/trunk/apps/system/dansguardian/lists/phraselists/ of the lang SVN tree. They are are copied at build time into /etc/dansguardian/lists/phraselists/. Files are placed in sub-directories that represent the list category:

$ ls /etc/dansguardian/lists/phraselists/
badwords        forums          gore          malware    personals    secretsocieties  violence
chat            gambling        idtheft       music      pornography  sport            warezhacking
conspiracy      games           illegaldrugs  news       proxies      translation      weapons
domainsforsale  goodphrases     intolerance   nudism     rta          travel           webmail
drugadvocacy    googlesearches  legaldrugs    peer2peer  safelabel    upstreamfilter

Files in sub-directories are named banned_lang or weighted_lang where lang is your language name, except for English whose files are simply banned and weighted:

$ ls /etc/dansguardian/lists/phraselists/pornography/
banned             weighted_danish  weighted_italian    weighted_portuguese
banned_portuguese  weighted_dutch   weighted_japanese   weighted_russian
weighted           weighted_french  weighted_malay      weighted_spanish
weighted_chinese   weighted_german  weighted_norwegian

Banned files contain phrases that automatically trigger page rejection, ie. there is no evaluation of the remaining contents. Weighted files contain phrases that are associated with a naughtiness value. Their content is quite simple:

#listcategory: "Pornography (Russian)"

<проститутки><50> #prostitutes < фото><5> #photo <бюст><40> #bust < анал ><40> #anal <анальный><40> #anal

Each line contains both a word or expression and its weight. Comments are inserted using the number sign “#”. Beware that spaces before and/or after a word let you specify if the word is a part of a longer word, its beginning, its end or the whole word. This is very important to allow for words like anal which is also the beginning of analysis, analogue, etc. The rules are the following:

Space matching of phrase lists
spaces example effect
no space <abcd> matches any word containing abcd
space on the right < abcd> matches any word starting with abcd
space on the left <abcd > matches any word ending with abcd
two spaces < abcd > matches the exact word abcd

Please note that non-English weighted files are very light compared to the English one (which has no language name in the file name). Their size is the following:

$ ls -Ssh1 —hide ’banned*’ /etc/dansguardian/lists/phraselists/pornography/
total 152K
 80K weighted
 16K weighted_portuguese
 12K weighted_italian
8,0K weighted_japanese
4,0K weighted_french
4,0K weighted_spanish
4,0K weighted_danish
4,0K weighted_russian
4,0K weighted_german
4,0K weighted_malay
4,0K weighted_dutch
4,0K weighted_chinese
4,0K weighted_norwegian

On the contrary only the Portuguese banned file is really filled. This means that globally the translation work for DansGuardian is huge… However the whole English file may be replaced by a simpler but exhaustive word or expression list in another language.

The encoding mess

The DansGuardian files that are shipped within DoudouLinux use different encodings depending on the language. For example, French uses a Latin-specific encoding while Russian uses a Cyrillic-specific one. This is really a difficulty for us because we have to take care while editing files with the correct encoding depending on the language. For this reason, the DoudouLinux “lang/” tree is provided with files previously converted to UTF-8. This means that now all files must be edited using UTF-8, whatever your language is.

Another issue is that DansGuardian does not try to guess the encoding of the web page that you are requesting. Instead it considers its content as a binary stream and performs byte comparison with word lists. This means that we should provide a weighted file for each possible encoding of each language… Again the “lang/” tree has been designed to host UTF-8 files only. Files in additional encodings are automatically generated at CD build time to reduce the translation effort.

So if you need to add encodings for your language, you have to edit the file lang/trunk/apps/system/dansguardian/lists/weightedphraselist. It contains the list of files to be loaded by DansGuardian. The trick is that if the CD build script does not find a file in this list, it tries to generate it from another one after guessing from which file it derives and which encoding is requested. These additional files must be named this way: weighted_LANGUAGE-ENCODING. For example mentioning the inexistent file weighted_russian-cp1251 will make the CD build script convert the file weighted_russian from UTF-8 to cp1251. Of course the resulting file is named weighted_russian-cp1251!

Here is a small sample of the file weightedphraselist:

.Include</etc/dansguardian/lists/phraselists/pornography/weighted_spanish> #ALPHA#
.Include</etc/dansguardian/lists/phraselists/pornography/weighted_russian> #BETA#
.Include</etc/dansguardian/lists/phraselists/pornography/weighted_russian-cp1251>
.Include</etc/dansguardian/lists/phraselists/pornography/weighted_russian-koi8>
.Include</etc/dansguardian/lists/phraselists/nudism/weighted>

Here you can see that the UTF-8 Russian word list will be automatically converted into a cp1251 and a koi8 encoded-file at build time.

Alphabet tricks

If you take a look at the Russian phrase list file you will see this kind of comments:

< секс чат ><50> #sex chat
< cекс чат ><120> #sex chat (first ’c’ is latin)
< секс форум ><50> #sex forum
< cекс форум ><120> #sex forum (first ’c’ is latin)

We have apparently the same phrases twice. Indeed they look identical but are not because the C letter in the Cyrillic alphabet does not have the same numerical code as the C letter in the Latin alphabet. As a result it is possible to write a naughty word using different alphabets in the same word, thus bypassing all content tests… This is why this kind of practice is associated a very high naughtiness score!

So if your language uses an alphabet that has letters in common with another alphabet (visually, not pronunciation), you should certainly rewrite naughty words mixing alphabets and putting these wrong words at a very high naughtiness.


identica logo facebook logo google+ logo

Geographical location of visitors

DoudouLinux logo Debian logo TSPU logo Genesi logo Gandi logo Hosting Extreme logo Linux Jobs Reviewed by I love Free Software

Creative Commons Copyright © DoudouLinux.org team - All texts from this site are published under the license Creative Commons BY-SA

SPIP | template | Site Map| Follow-up of the site's activity RSS 2.0