Text processing in global world¶

This notebook: https://github.com/TonyMas/GlobalText/
This presentation: https://tonymas.github.io/GlobalText/

Anton Masalovich
Help people scale ML models to more languages in Microsoft.
Background in document recognition and computer vision.
https://www.linkedin.com/in/masalovich/

Let's write a function that counts distinct words in a given sentence:

Case-insensensetive comparison
Ignore punctuation
All numbers count as "number"
All "non-words" count as "other"

All of the below code is for Python 3.6

In [1]:

from collections import Counter
import re
import string

class WordCounterV1:
    def __init__(self):
        # Translation table to remove punctuation
        self.punctuationTranslation = ''.maketrans('', '', string.punctuation)
    
    def CountWords( self, text ):
        counter = Counter()
        # Split text to words by spaces
        for word in text.split(' '):
            # Remove all punctuation
            word = word.translate(self.punctuationTranslation)
            if len(word) == 0:
                # Skip empty words
                continue
            # Words that consist only of letters count as words
            if re.match('^[a-zA-Z]+$', word):
                # Using casefold to get more stable case-insensetive variant
                counter[word.casefold()] += 1
            # Words that consist only of digits count as __number__
            elif re.match('^[0-9]+$', word):
                counter['__number__'] += 1
            # Everything else goes to __other__ bucket
            else:
                counter['__other__'] += 1
        return counter

Let's test it

In [2]:

wordCounter = WordCounterV1()
wordCounter.CountWords( 'Hello, World!' )

Out[2]:

Counter({'hello': 1, 'world': 1})

Let's test is some more

In [3]:

wordCounter.CountWords( 'I think version 1 of our function will work fine for i18n, no need to create version 2.' )

Out[3]:

Counter({'__number__': 2,
         '__other__': 1,
         'create': 1,
         'fine': 1,
         'for': 1,
         'function': 1,
         'i': 1,
         'need': 1,
         'no': 1,
         'of': 1,
         'our': 1,
         'think': 1,
         'to': 1,
         'version': 2,
         'will': 1,
         'work': 1})

And more

In [4]:

wordCounter.CountWords( 'But my fiancée thinks that this function will not work even in English' )

Out[4]:

Counter({'__other__': 1,
         'but': 1,
         'english': 1,
         'even': 1,
         'function': 1,
         'in': 1,
         'my': 1,
         'not': 1,
         'that': 1,
         'thinks': 1,
         'this': 1,
         'will': 1,
         'work': 1})

Ok, we need to extend our letters set.
But we need to be truly international, so we need all possible diacritics.
And Cyrillic alphabet as well.
And all kind of Indian scripts.
And Thai language.
And maybe few more things …?

Unicode categories¶

https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
The Unicode Standard assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.
Each code point is assigned a value for General Category. This is one of the character properties that are also defined for unassigned code points, and code points that are defined "not a character".

Letters: Lu, Ll, Lt, Lm, Lo
Marks: Mn, Mc, Me
Numbers: Nd, Nl, No
Punctuations: Pc, Pd, Ps, Pe, Pi, Pf, Po
Symbols: Sm, Sc, Sk, So
Separators: Zs, Zl, Zp

Other, control: Cc
Other, format: Cf
Other, surrogate: Cs
Other, private use: Co
Other, not assigned: Cn

In [5]:

import unicodedata
unicodedata.category('a')

Out[5]:

'Ll'

In [6]:

unicodedata.category('1')

Out[6]:

'Nd'

In [7]:

unicodedata.category(' ')

Out[7]:

'Zs'

In [52]:

import unicodedata
def calculateUnicodeSets():
    punctuation, letters, numbers, spaces, control = \
        set(), set(), set(), set(), set()
    # We go through whole range of possible Unicode characters
    for i in range(0,0x110000):
        char = chr(i)
        category = unicodedata.category( char )
        # Punctuation is everything in P* category
        if( category.startswith('P') ):
            punctuation.add( char )
        # For our goal both letters (L*) and mark signs (M*)
        # will be considered as letters
        elif( category.startswith('L') or category.startswith('M') ):
            letters.add( char )
        # Nd and Nl goes to numbers (No is not exactly digits)
        elif( category == 'Nd' or category == 'Nl' ):
            numbers.add( char )
        # Z* goes to punctuation
        elif( category.startswith('Z') ):
            spaces.add( char )
        # We will need control (Cc) and format (Cf) characters a little bit later
        elif( category == 'Cc' or category == 'Cf' ):
            control.add( char )
    # TAB, CR and LF are in Cc category, but we will treat them as spaces
    spaces.update( ['\t','\r','\n'] )
    control.difference_update(  ['\t','\r','\n']  )
    return (punctuation, letters, numbers, spaces, control)

In [53]:

(punctuation, letters, numbers, spaces, control) = calculateUnicodeSets()

Ok, we need universal letters set, but this function looks like overkill.
Why we need punctuation, we already have string.punctuation?

In [54]:

string.punctuation

Out[54]:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [11]:

''.join(sorted(punctuation))

Out[11]:

'!"#%&\'()*,-./:;?@[\\]_{}¡§«¶·»¿;·՚՛՜՝՞՟։֊־׀׃׆׳״؉؊،؍؛؞؟٪٫٬٭۔܀܁܂܃܄܅܆܇܈܉܊܋܌܍߷߸߹࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾࡞।॥॰૰෴๏๚๛༄༅༆༇༈༉༊་༌།༎༏༐༑༒༔༺༻༼༽྅࿐࿑࿒࿓࿔࿙࿚၊။၌၍၎၏჻፠፡።፣፤፥፦፧፨᐀᙭᙮᚛᚜᛫᛬᛭᜵᜶។៕៖៘៙៚᠀᠁᠂᠃᠄᠅᠆᠇᠈᠉᠊᥄᥅᨞᨟᪠᪡᪢᪣᪤᪥᪦᪨᪩᪪᪫᪬᪭᭚᭛᭜᭝᭞᭟᭠᯼᯽᯾᯿᰻᰼᰽᰾᰿᱾᱿᳀᳁᳂᳃᳄᳅᳆᳇᳓‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‰‱′″‴‵‶‷‸‹›※‼‽‾‿⁀⁁⁂⁃⁅⁆⁇⁈⁉⁊⁋⁌⁍⁎⁏⁐⁑⁓⁔⁕⁖⁗⁘⁙⁚⁛⁜⁝⁞⁽⁾₍₎⌈⌉⌊⌋〈〉❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫⟬⟭⟮⟯⦃⦄⦅⦆⦇⦈⦉⦊⦋⦌⦍⦎⦏⦐⦑⦒⦓⦔⦕⦖⦗⦘⧘⧙⧚⧛⧼⧽⳹⳺⳻⳼⳾⳿⵰⸀⸁⸂⸃⸄⸅⸆⸇⸈⸉⸊⸋⸌⸍⸎⸏⸐⸑⸒⸓⸔⸕⸖⸗⸘⸙⸚⸛⸜⸝⸞⸟⸠⸡⸢⸣⸤⸥⸦⸧⸨⸩⸪⸫⸬⸭⸮⸰⸱⸲⸳⸴⸵⸶⸷⸸⸹⸺⸻⸼⸽⸾⸿⹀⹁⹂⹃⹄、。〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〽゠・꓾꓿꘍꘎꘏꙳꙾꛲꛳꛴꛵꛶꛷꡴꡵꡶꡷꣎꣏꣸꣹꣺꣼꤮꤯꥟꧁꧂꧃꧄꧅꧆꧇꧈꧉꧊꧋꧌꧍꧞꧟꩜꩝꩞꩟꫞꫟꫰꫱꯫﴾﴿︐︑︒︓︔︕︖︗︘︙︰︱︲︳︴︵︶︷︸︹︺︻︼︽︾︿﹀﹁﹂﹃﹄﹅﹆﹇﹈﹉﹊﹋﹌﹍﹎﹏﹐﹑﹒﹔﹕﹖﹗﹘﹙﹚﹛﹜﹝﹞﹟﹠﹡﹣﹨﹪﹫！＂＃％＆＇（）＊，－．／：；？＠［＼］＿｛｝｟｠｡｢｣､･𐄀𐄁𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐𐩑𐩒𐩓𐩔𐩕𐩖𐩗𐩘𐩿𐫰𐫱𐫲𐫳𐫴𐫵𐫶𐬹𐬺𐬻𐬼𐬽𐬾𐬿𐮙𐮚𐮛𐮜𑁇𑁈𑁉𑁊𑁋𑁌𑁍𑂻𑂼𑂾𑂿𑃀𑃁𑅀𑅁𑅂𑅃𑅴𑅵𑇅𑇆𑇇𑇈𑇉𑇍𑇛𑇝𑇞𑇟𑈸𑈹𑈺𑈻𑈼𑈽𑊩𑑋𑑌𑑍𑑎𑑏𑑛𑑝𑓆𑗁𑗂𑗃𑗄𑗅𑗆𑗇𑗈𑗉𑗊𑗋𑗌𑗍𑗎𑗏𑗐𑗑𑗒𑗓𑗔𑗕𑗖𑗗𑙁𑙂𑙃𑙠𑙡𑙢𑙣𑙤𑙥𑙦𑙧𑙨𑙩𑙪𑙫𑙬𑜼𑜽𑜾𑱁𑱂𑱃𑱄𑱅𑱰𑱱𒑰𒑱𒑲𒑳𒑴𖩮𖩯𖫵𖬷𖬸𖬹𖬺𖬻𖭄𛲟𝪇𝪈𝪉𝪊𝪋𞥞𞥟'

But what about numbers, there are only 10 of them?
You may treat different Number categories (Nd, Nl, No) differently, depending on your application.

In [12]:

''.join(sorted(numbers))

Out[12]:

'0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯෦෧෨෩෪෫෬෭෮෯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙ᛮᛯᛰ០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿↀↁↂↅↆↇↈ〇〡〢〣〤〥〦〧〨〩〸〹〺꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩ꛦꛧꛨꛩꛪꛫꛬꛭꛮꛯ꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹０１２３４５６７８９𐅀𐅁𐅂𐅃𐅄𐅅𐅆𐅇𐅈𐅉𐅊𐅋𐅌𐅍𐅎𐅏𐅐𐅑𐅒𐅓𐅔𐅕𐅖𐅗𐅘𐅙𐅚𐅛𐅜𐅝𐅞𐅟𐅠𐅡𐅢𐅣𐅤𐅥𐅦𐅧𐅨𐅩𐅪𐅫𐅬𐅭𐅮𐅯𐅰𐅱𐅲𐅳𐅴𐍁𐍊𐏑𐏒𐏓𐏔𐏕𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹𑑐𑑑𑑒𑑓𑑔𑑕𑑖𑑗𑑘𑑙𑓐𑓑𑓒𑓓𑓔𑓕𑓖𑓗𑓘𑓙𑙐𑙑𑙒𑙓𑙔𑙕𑙖𑙗𑙘𑙙𑛀𑛁𑛂𑛃𑛄𑛅𑛆𑛇𑛈𑛉𑜰𑜱𑜲𑜳𑜴𑜵𑜶𑜷𑜸𑜹𑣠𑣡𑣢𑣣𑣤𑣥𑣦𑣧𑣨𑣩𑱐𑱑𑱒𑱓𑱔𑱕𑱖𑱗𑱘𑱙𒐀𒐁𒐂𒐃𒐄𒐅𒐆𒐇𒐈𒐉𒐊𒐋𒐌𒐍𒐎𒐏𒐐𒐑𒐒𒐓𒐔𒐕𒐖𒐗𒐘𒐙𒐚𒐛𒐜𒐝𒐞𒐟𒐠𒐡𒐢𒐣𒐤𒐥𒐦𒐧𒐨𒐩𒐪𒐫𒐬𒐭𒐮𒐯𒐰𒐱𒐲𒐳𒐴𒐵𒐶𒐷𒐸𒐹𒐺𒐻𒐼𒐽𒐾𒐿𒑀𒑁𒑂𒑃𒑄𒑅𒑆𒑇𒑈𒑉𒑊𒑋𒑌𒑍𒑎𒑏𒑐𒑑𒑒𒑓𒑔𒑕𒑖𒑗𒑘𒑙𒑚𒑛𒑜𒑝𒑞𒑟𒑠𒑡𒑢𒑣𒑤𒑥𒑦𒑧𒑨𒑩𒑪𒑫𒑬𒑭𒑮𖩠𖩡𖩢𖩣𖩤𖩥𖩦𖩧𖩨𖩩𖭐𖭑𖭒𖭓𖭔𖭕𖭖𖭗𖭘𖭙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿𞥐𞥑𞥒𞥓𞥔𞥕𞥖𞥗𞥘𞥙'

What about spaces?

In [13]:

len(spaces)

Out[13]:

In [14]:

''.join(sorted(spaces))

Out[14]:

'\t\n\r \xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

Just for reference, how many letters we have?

In [15]:

len(letters)

Out[15]:

Let's get back to to our word counter.

In [16]:

from collections import Counter
import re

class WordCounterV2:
    def __init__(self):
        # Get all our unicode sets
        (self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
            calculateUnicodeSets()
        # Translation table to remove punctuation
        self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
        # Translation table to remove control characters
        self.controlTranslation = ''.maketrans('', '', ''.join(self.control))
        # Regex to find all whitespaces
        self.whitespacesRegex = '|'.join(map(re.escape, self.spaces))

In [17]:

class WordCounterV2(WordCounterV2):
    def CountWords( self, text ):
        counter = Counter()
        # Remove control characters from string
        text = text.translate(self.controlTranslation)
        # Split text to words by whitespaces
        for word in re.split(self.whitespacesRegex,text,0):
            # Remove all punctuation
            word = word.translate(self.punctuationTranslation)
            if len(word) == 0:
                # Skip empty words
                continue
            # Creating set of letters from our word
            # that way it will be easier to compare with letters and numbers
            wordSet = set(word)
            # Words that consist only of letters count as words
            if wordSet.issubset( self.letters ):
                # Using casefold to get more stable case-insensetive variant
                counter[word.casefold()] += 1
            # Words that consist only of digits count as __number__
            elif wordSet.issubset( self.numbers ):
                counter['__number__'] += 1
            # Everything else goes to __other__ bucket
            else:
                counter['__other__'] += 1
        return counter

In [18]:

wordCounter = WordCounterV2()
wordCounter.CountWords( 'Hello, World!' )

Out[18]:

Counter({'hello': 1, 'world': 1})

In [19]:

wordCounter.CountWords( 'Version 2 is working much better for i18n, i wonder do we really need version 3.' )

Out[19]:

Counter({'__number__': 2,
         '__other__': 1,
         'better': 1,
         'do': 1,
         'for': 1,
         'i': 1,
         'is': 1,
         'much': 1,
         'need': 1,
         'really': 1,
         'version': 2,
         'we': 1,
         'wonder': 1,
         'working': 1})

In [20]:

wordCounter.CountWords( 'Текст, который не понимает большинство присутсвующих.' )

Out[20]:

Counter({'большинство': 1,
         'который': 1,
         'не': 1,
         'понимает': 1,
         'присутсвующих': 1,
         'текст': 1})

Is our algorithm produces similiar result on similiar strings?

In [21]:

str1 = 'Tôi chỉ cần văn bản với dấu phụ'
str2 = 'Tôi chỉ cần văn bản với dấu phụ'

In [22]:

count1 = wordCounter.CountWords( str1 )
count2 = wordCounter.CountWords( str2 )

In [23]:

count1 + count2

Out[23]:

Counter({'bản': 1,
         'bản': 1,
         'cần': 1,
         'chỉ': 1,
         'chỉ': 1,
         'cần': 1,
         'dấu': 1,
         'dấu': 1,
         'phụ': 1,
         'phụ': 1,
         'tôi': 1,
         'tôi': 1,
         'văn': 1,
         'với': 1,
         'văn': 1,
         'với': 1})

Text Normalization¶

What happened in the previous example?
There are several ways to write similiar text in Unicode.

In [24]:

','.join(list(str1))

Out[24]:

'T,ô,i, ,c,h,ỉ, ,c,ầ,n, ,v,ă,n, ,b,ả,n, ,v,ớ,i, ,d,ấ,u, ,p,h,ụ'

In [25]:

','.join(list(str2))

Out[25]:

'T,o,̂,i, ,c,h,i,̉, ,c,a,̂,̀,n, ,v,a,̆,n, ,b,a,̉,n, ,v,o,̛,́,i, ,d,a,̂,́,u, ,p,h,u,̣'

http://unicode.org/reports/tr15/
Canonical equivalence is a fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behavior.
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.

Four types of normalization:

NFD: Canonical decomposition
NFC: Canonical Decomposition, followed by Canonical Composition
NFKD: Compatibility Decomposition
NFKC: Compatibility Decomposition, followed by Canonical Composition

What type of normalization to use depends on your application and bussiness requirements.
We will stick to NFKD in this example.

In [26]:

import unicodedata

str1Normalized = unicodedata.normalize("NFKD", str1)
str2Normalized = unicodedata.normalize("NFKD", str2)

In [27]:

','.join(list(str1Normalized))

Out[27]:

'T,o,̂,i, ,c,h,i,̉, ,c,a,̂,̀,n, ,v,a,̆,n, ,b,a,̉,n, ,v,o,̛,́,i, ,d,a,̂,́,u, ,p,h,u,̣'

In [28]:

','.join(list(str2Normalized))

Out[28]:

'T,o,̂,i, ,c,h,i,̉, ,c,a,̂,̀,n, ,v,a,̆,n, ,b,a,̉,n, ,v,o,̛,́,i, ,d,a,̂,́,u, ,p,h,u,̣'

In [29]:

str1 == str2

Out[29]:

False

In [30]:

str1Normalized == str2Normalized

Out[30]:

True

In [31]:

from collections import Counter
import re

class WordCounterV3:
    def __init__(self):
        # Get all our unicode sets
        (self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
            calculateUnicodeSets()
        # Translation table to remove punctuation
        self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
        # Translation table to remove control characters
        self.controlTranslation = ''.maketrans('', '', ''.join(self.control))
        # Regex to find all whitespaces
        self.whitespacesRegex = '|'.join(map(re.escape, self.spaces))

In [32]:

class WordCounterV3(WordCounterV3):
    def CountWords( self, text ):
        counter = Counter()
        # Normalize text
        text = unicodedata.normalize("NFKD", text)
        # Remove control characters from string
        text = text.translate(self.controlTranslation)
        # Split text to words by whitespaces
        for word in re.split(self.whitespacesRegex,text,0):
            # Remove all punctuation
            word = word.translate(self.punctuationTranslation)
            if len(word) == 0:
                # Skip empty words
                continue
            # Creating set of letters from our word,
            # that way it will be easier to compare with letters and numbers
            wordSet = set(word)
            # Words that consist only of letters count as words
            if wordSet.issubset( self.letters ):
                # Using casefold to get more stable case-insensetive variant
                counter[word.casefold()] += 1
            # Words that consist only of digits count as __number__
            elif wordSet.issubset( self.numbers ):
                counter['__number__'] += 1
            # Everything else goes to __other__ bucket
            else:
                counter['__other__'] += 1
        return counter

In [33]:

wordCounter = WordCounterV3()
wordCounter.CountWords( 'Hello, World!' )

Out[33]:

Counter({'hello': 1, 'world': 1})

Let's try new version on previous example

In [34]:

count1Normalized = wordCounter.CountWords( str1 )
count2Normalized = wordCounter.CountWords( str2 )

In [35]:

count1Normalized + count2Normalized

Out[35]:

Counter({'bản': 2,
         'cần': 2,
         'chỉ': 2,
         'dấu': 2,
         'phụ': 2,
         'tôi': 2,
         'văn': 2,
         'với': 2})

Can we combine our new results with previous ones?

In [36]:

count1Normalized + count2Normalized + count1 + count2

Out[36]:

Counter({'bản': 3,
         'bản': 1,
         'cần': 3,
         'chỉ': 3,
         'chỉ': 1,
         'cần': 1,
         'dấu': 3,
         'dấu': 1,
         'phụ': 3,
         'phụ': 1,
         'tôi': 3,
         'tôi': 1,
         'văn': 3,
         'với': 3,
         'văn': 1,
         'với': 1})

If you use predefined dictionary, make sure that it was created with the same type of normalization.

The same goes to word embeddings, n-grams and any pre-trained text model.¶

Deeper into NLP territory¶

Is whitespace word-breaking enough for our need?

Sometimes even in English we may want more complex solution.

In [37]:

wordCounter.CountWords("Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.")

Out[37]:

Counter({'about': 1,
         'amusing': 1,
         'arent': 1,
         'boys': 1,
         'capital': 1,
         'chiles': 1,
         'mr': 1,
         'oneill': 1,
         'stories': 1,
         'that': 1,
         'the': 1,
         'thinks': 1})

Tokenization¶

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
The major question of the tokenization phase is what are the correct tokens to use? In the beggining, it looks fairly trivial: you chop on whitespace and throw away punctuation characters.
This is a starting point, but even for English there are a number of tricky cases. For example, what do you do about the various uses of the apostrophe for possession and contractions?

For many languages tokenization require lexical analysis of the string, because we simply cannot use whitespace

German: a lot of compund words

In [38]:

# law for the delegation of monitoring beef labelling
wordCounter.CountWords('Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz')

Out[38]:

Counter({'rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz': 1})

Chinese: don't even use spaces

In [39]:

wordCounter.CountWords('中文几乎没有空格')

Out[39]:

Counter({'中文几乎没有空格': 1})

Bad news:

There is no unified solution for tokenization problem
You need to know language of your text
You need to find specific tokenizer for each language

Good news - you can find tokenizers for many languages

Where to look:

NLTK: https://www.nltk.org/api/nltk.tokenize.html
Spacy: https://spacy.io/
Stanford NLP tools: Java tools, can be called in Python through NLTK
- https://nlp.stanford.edu/software/segmenter.shtml
- https://nlp.stanford.edu/software/tokenizer.html

In [40]:

from collections import Counter
from nltk.tokenize import word_tokenize

class WordCounterV4:
    def __init__(self):
        # Get all our unicode sets
        (self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
            calculateUnicodeSets()
        # Translation table to remove punctuation
        self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
        # Translation table to remove control characters
        self.controlTranslation = ''.maketrans('', '', ''.join(self.control))

In [41]:

class WordCounterV4(WordCounterV4):
    def CountWords( self, text, language ):
        counter = Counter()
        # Normalize text
        text = unicodedata.normalize("NFKD", text)
        # Remove control characters from string
        text = text.translate(self.controlTranslation)
        # Split text using tokenizer
        for word in word_tokenize(text ,language):
            # Remove all punctuation
            word = word.translate(self.punctuationTranslation)
            if len(word) == 0:
                # Skip empty words
                continue
            # Creating set of letters from our word
            # that way it will be easier to compare with letters and numbers
            wordSet = set(word)
            # Words that consist only of letters count as words
            if wordSet.issubset( self.letters ):
                # Using casefold to get more stable case-insensetive variant
                counter[word.casefold()] += 1
            # Words that consist only of digits count as __number__
            elif wordSet.issubset( self.numbers ):
                counter['__number__'] += 1
            # Everything else goes to __other__ bucket
            else:
                counter['__other__'] += 1
        return counter

In [42]:

wordCounter = WordCounterV4()
wordCounter.CountWords( "Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.", "English" )

Out[42]:

Counter({'about': 1,
         'amusing': 1,
         'are': 1,
         'boys': 1,
         'capital': 1,
         'chile': 1,
         'mr': 1,
         'nt': 1,
         'oneill': 1,
         's': 1,
         'stories': 1,
         'that': 1,
         'the': 1,
         'thinks': 1})

Stemming and lemmatization¶

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.

In [43]:

from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

class WordCounterV5:
    def __init__(self):
        # Get all our unicode sets
        (self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
            calculateUnicodeSets()
        # Translation table to remove punctuation
        self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
        # Translation table to remove control characters
        self.controlTranslation = ''.maketrans('', '', ''.join(self.control))

In [44]:

class WordCounterV5(WordCounterV5):
    def CountWords( self, text, language ):
        counter = Counter()
        # Stemmer for our words
        stemmer = SnowballStemmer(language.lower())
        # Normalize text
        text = unicodedata.normalize("NFKD", text)
        # Remove control characters from string
        text = text.translate(self.controlTranslation)
        # Split text using tokenizer
        for word in word_tokenize(text ,language):
            # Remove all punctuation
            word = word.translate(self.punctuationTranslation)
            # Stemming word
            word = stemmer.stem(word)
            if len(word) == 0:
                # Skip empty words
                continue
            # Creating set of letters from our word
            # that way it will be easier to compare with letters and numbers
            wordSet = set(word)
            # Words that consist only of letters count as words
            if wordSet.issubset( self.letters ):
                # Using casefold to get more stable case-insensetive variant
                counter[word.casefold()] += 1
            # Words that consist only of digits count as __number__
            elif wordSet.issubset( self.numbers ):
                counter['__number__'] += 1
            # Everything else goes to __other__ bucket
            else:
                counter['__other__'] += 1
        return counter

In [45]:

wordCounter = WordCounterV5()
wordCounter.CountWords( "Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.", "English" )

Out[45]:

Counter({'about': 1,
         'amus': 1,
         'are': 1,
         'boy': 1,
         'capit': 1,
         'chile': 1,
         'mr': 1,
         'nt': 1,
         'oneil': 1,
         's': 1,
         'stori': 1,
         'that': 1,
         'the': 1,
         'think': 1})

That's all folks¶

This notebook: https://github.com/TonyMas/GlobalText/
This presentation: https://tonymas.github.io/GlobalText/