This notebook: https://github.com/TonyMas/GlobalText/
This presentation: https://tonymas.github.io/GlobalText/
Anton Masalovich
Help people scale ML models to more languages in Microsoft.
Background in document recognition and computer vision.
https://www.linkedin.com/in/masalovich/
Let's write a function that counts distinct words in a given sentence:
All of the below code is for Python 3.6
from collections import Counter
import re
import string
class WordCounterV1:
def __init__(self):
# Translation table to remove punctuation
self.punctuationTranslation = ''.maketrans('', '', string.punctuation)
def CountWords( self, text ):
counter = Counter()
# Split text to words by spaces
for word in text.split(' '):
# Remove all punctuation
word = word.translate(self.punctuationTranslation)
if len(word) == 0:
# Skip empty words
continue
# Words that consist only of letters count as words
if re.match('^[a-zA-Z]+$', word):
# Using casefold to get more stable case-insensetive variant
counter[word.casefold()] += 1
# Words that consist only of digits count as __number__
elif re.match('^[0-9]+$', word):
counter['__number__'] += 1
# Everything else goes to __other__ bucket
else:
counter['__other__'] += 1
return counter
Let's test it
wordCounter = WordCounterV1()
wordCounter.CountWords( 'Hello, World!' )
Counter({'hello': 1, 'world': 1})
Let's test is some more
wordCounter.CountWords( 'I think version 1 of our function will work fine for i18n, no need to create version 2.' )
Counter({'__number__': 2,
'__other__': 1,
'create': 1,
'fine': 1,
'for': 1,
'function': 1,
'i': 1,
'need': 1,
'no': 1,
'of': 1,
'our': 1,
'think': 1,
'to': 1,
'version': 2,
'will': 1,
'work': 1})
And more
wordCounter.CountWords( 'But my fiancée thinks that this function will not work even in English' )
Counter({'__other__': 1,
'but': 1,
'english': 1,
'even': 1,
'function': 1,
'in': 1,
'my': 1,
'not': 1,
'that': 1,
'thinks': 1,
'this': 1,
'will': 1,
'work': 1})
Ok, we need to extend our letters set.
But we need to be truly international, so we need all possible diacritics.
And Cyrillic alphabet as well.
And all kind of Indian scripts.
And Thai language.
And maybe few more things …?
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
The Unicode Standard assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.
Each code point is assigned a value for General Category. This is one of the character properties that are also defined for unassigned code points, and code points that are defined "not a character".
Letters: Lu, Ll, Lt, Lm, Lo
Marks: Mn, Mc, Me
Numbers: Nd, Nl, No
Punctuations: Pc, Pd, Ps, Pe, Pi, Pf, Po
Symbols: Sm, Sc, Sk, So
Separators: Zs, Zl, Zp
Other, control: Cc
Other, format: Cf
Other, surrogate: Cs
Other, private use: Co
Other, not assigned: Cn
import unicodedata
unicodedata.category('a')
'Ll'
unicodedata.category('1')
'Nd'
unicodedata.category(' ')
'Zs'
import unicodedata
def calculateUnicodeSets():
punctuation, letters, numbers, spaces, control = \
set(), set(), set(), set(), set()
# We go through whole range of possible Unicode characters
for i in range(0,0x110000):
char = chr(i)
category = unicodedata.category( char )
# Punctuation is everything in P* category
if( category.startswith('P') ):
punctuation.add( char )
# For our goal both letters (L*) and mark signs (M*)
# will be considered as letters
elif( category.startswith('L') or category.startswith('M') ):
letters.add( char )
# Nd and Nl goes to numbers (No is not exactly digits)
elif( category == 'Nd' or category == 'Nl' ):
numbers.add( char )
# Z* goes to punctuation
elif( category.startswith('Z') ):
spaces.add( char )
# We will need control (Cc) and format (Cf) characters a little bit later
elif( category == 'Cc' or category == 'Cf' ):
control.add( char )
# TAB, CR and LF are in Cc category, but we will treat them as spaces
spaces.update( ['\t','\r','\n'] )
control.difference_update( ['\t','\r','\n'] )
return (punctuation, letters, numbers, spaces, control)
(punctuation, letters, numbers, spaces, control) = calculateUnicodeSets()
Ok, we need universal letters set, but this function looks like overkill.
Why we need punctuation, we already have string.punctuation?
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
''.join(sorted(punctuation))
'!"#%&\'()*,-./:;?@[\\]_{}¡§«¶·»¿;·՚՛՜՝՞՟։֊־׀׃׆׳״؉؊،؍؛؞؟٪٫٬٭۔܀܁܂܃܄܅܆܇܈܉܊܋܌܍߷߸߹࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾࡞।॥॰૰෴๏๚๛༄༅༆༇༈༉༊་༌།༎༏༐༑༒༔༺༻༼༽྅࿐࿑࿒࿓࿔࿙࿚၊။၌၍၎၏჻፠፡።፣፤፥፦፧፨᐀᙭᙮᚛᚜᛫᛬᛭᜵᜶។៕៖៘៙៚᠀᠁᠂᠃᠄᠅᠆᠇᠈᠉᠊᥄᥅᨞᨟᪠᪡᪢᪣᪤᪥᪦᪨᪩᪪᪫᪬᪭᭚᭛᭜᭝᭞᭟᭠᯼᯽᯾᯿᰻᰼᰽᰾᰿᱾᱿᳀᳁᳂᳃᳄᳅᳆᳇᳓‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‰‱′″‴‵‶‷‸‹›※‼‽‾‿⁀⁁⁂⁃⁅⁆⁇⁈⁉⁊⁋⁌⁍⁎⁏⁐⁑⁓⁔⁕⁖⁗⁘⁙⁚⁛⁜⁝⁞⁽⁾₍₎⌈⌉⌊⌋〈〉❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫⟬⟭⟮⟯⦃⦄⦅⦆⦇⦈⦉⦊⦋⦌⦍⦎⦏⦐⦑⦒⦓⦔⦕⦖⦗⦘⧘⧙⧚⧛⧼⧽⳹⳺⳻⳼⳾⳿⵰⸀⸁⸂⸃⸄⸅⸆⸇⸈⸉⸊⸋⸌⸍⸎⸏⸐⸑⸒⸓⸔⸕⸖⸗⸘⸙⸚⸛⸜⸝⸞⸟⸠⸡⸢⸣⸤⸥⸦⸧⸨⸩⸪⸫⸬⸭⸮⸰⸱⸲⸳⸴⸵⸶⸷⸸⸹⸺⸻⸼⸽⸾⸿⹀⹁⹂⹃⹄、。〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〽゠・꓾꓿꘍꘎꘏꙳꙾꛲꛳꛴꛵꛶꛷꡴꡵꡶꡷꣎꣏꣸꣹꣺꣼꤮꤯꥟꧁꧂꧃꧄꧅꧆꧇꧈꧉꧊꧋꧌꧍꧞꧟꩜꩝꩞꩟꫞꫟꫰꫱꯫﴾﴿︐︑︒︓︔︕︖︗︘︙︰︱︲︳︴︵︶︷︸︹︺︻︼︽︾︿﹀﹁﹂﹃﹄﹅﹆﹇﹈﹉﹊﹋﹌﹍﹎﹏﹐﹑﹒﹔﹕﹖﹗﹘﹙﹚﹛﹜﹝﹞﹟﹠﹡﹣﹨﹪﹫!"#%&'()*,-./:;?@[\]_{}⦅⦆。「」、・𐄀𐄁𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐𐩑𐩒𐩓𐩔𐩕𐩖𐩗𐩘𐩿𐫰𐫱𐫲𐫳𐫴𐫵𐫶𐬹𐬺𐬻𐬼𐬽𐬾𐬿𐮙𐮚𐮛𐮜𑁇𑁈𑁉𑁊𑁋𑁌𑁍𑂻𑂼𑂾𑂿𑃀𑃁𑅀𑅁𑅂𑅃𑅴𑅵𑇅𑇆𑇇𑇈𑇉𑇍𑇛𑇝𑇞𑇟𑈸𑈹𑈺𑈻𑈼𑈽𑊩𑑋𑑌𑑍𑑎𑑏𑑛𑑝𑓆𑗁𑗂𑗃𑗄𑗅𑗆𑗇𑗈𑗉𑗊𑗋𑗌𑗍𑗎𑗏𑗐𑗑𑗒𑗓𑗔𑗕𑗖𑗗𑙁𑙂𑙃𑙠𑙡𑙢𑙣𑙤𑙥𑙦𑙧𑙨𑙩𑙪𑙫𑙬𑜼𑜽𑜾𑱁𑱂𑱃𑱄𑱅𑱰𑱱𒑰𒑱𒑲𒑳𒑴𖩮𖩯𖫵𖬷𖬸𖬹𖬺𖬻𖭄𛲟𝪇𝪈𝪉𝪊𝪋𞥞𞥟'
But what about numbers, there are only 10 of them?
You may treat different Number categories (Nd, Nl, No) differently, depending on your application.
''.join(sorted(numbers))
'0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯෦෧෨෩෪෫෬෭෮෯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙ᛮᛯᛰ០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿↀↁↂↅↆↇↈ〇〡〢〣〤〥〦〧〨〩〸〹〺꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩ꛦꛧꛨꛩꛪꛫꛬꛭꛮꛯ꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹0123456789𐅀𐅁𐅂𐅃𐅄𐅅𐅆𐅇𐅈𐅉𐅊𐅋𐅌𐅍𐅎𐅏𐅐𐅑𐅒𐅓𐅔𐅕𐅖𐅗𐅘𐅙𐅚𐅛𐅜𐅝𐅞𐅟𐅠𐅡𐅢𐅣𐅤𐅥𐅦𐅧𐅨𐅩𐅪𐅫𐅬𐅭𐅮𐅯𐅰𐅱𐅲𐅳𐅴𐍁𐍊𐏑𐏒𐏓𐏔𐏕𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹𑑐𑑑𑑒𑑓𑑔𑑕𑑖𑑗𑑘𑑙𑓐𑓑𑓒𑓓𑓔𑓕𑓖𑓗𑓘𑓙𑙐𑙑𑙒𑙓𑙔𑙕𑙖𑙗𑙘𑙙𑛀𑛁𑛂𑛃𑛄𑛅𑛆𑛇𑛈𑛉𑜰𑜱𑜲𑜳𑜴𑜵𑜶𑜷𑜸𑜹𑣠𑣡𑣢𑣣𑣤𑣥𑣦𑣧𑣨𑣩𑱐𑱑𑱒𑱓𑱔𑱕𑱖𑱗𑱘𑱙𒐀𒐁𒐂𒐃𒐄𒐅𒐆𒐇𒐈𒐉𒐊𒐋𒐌𒐍𒐎𒐏𒐐𒐑𒐒𒐓𒐔𒐕𒐖𒐗𒐘𒐙𒐚𒐛𒐜𒐝𒐞𒐟𒐠𒐡𒐢𒐣𒐤𒐥𒐦𒐧𒐨𒐩𒐪𒐫𒐬𒐭𒐮𒐯𒐰𒐱𒐲𒐳𒐴𒐵𒐶𒐷𒐸𒐹𒐺𒐻𒐼𒐽𒐾𒐿𒑀𒑁𒑂𒑃𒑄𒑅𒑆𒑇𒑈𒑉𒑊𒑋𒑌𒑍𒑎𒑏𒑐𒑑𒑒𒑓𒑔𒑕𒑖𒑗𒑘𒑙𒑚𒑛𒑜𒑝𒑞𒑟𒑠𒑡𒑢𒑣𒑤𒑥𒑦𒑧𒑨𒑩𒑪𒑫𒑬𒑭𒑮𖩠𖩡𖩢𖩣𖩤𖩥𖩦𖩧𖩨𖩩𖭐𖭑𖭒𖭓𖭔𖭕𖭖𖭗𖭘𖭙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿𞥐𞥑𞥒𞥓𞥔𞥕𞥖𞥗𞥘𞥙'
What about spaces?
len(spaces)
22
''.join(sorted(spaces))
'\t\n\r \xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
Just for reference, how many letters we have?
len(letters)
118863
Let's get back to to our word counter.
from collections import Counter
import re
class WordCounterV2:
def __init__(self):
# Get all our unicode sets
(self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
calculateUnicodeSets()
# Translation table to remove punctuation
self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
# Translation table to remove control characters
self.controlTranslation = ''.maketrans('', '', ''.join(self.control))
# Regex to find all whitespaces
self.whitespacesRegex = '|'.join(map(re.escape, self.spaces))
class WordCounterV2(WordCounterV2):
def CountWords( self, text ):
counter = Counter()
# Remove control characters from string
text = text.translate(self.controlTranslation)
# Split text to words by whitespaces
for word in re.split(self.whitespacesRegex,text,0):
# Remove all punctuation
word = word.translate(self.punctuationTranslation)
if len(word) == 0:
# Skip empty words
continue
# Creating set of letters from our word
# that way it will be easier to compare with letters and numbers
wordSet = set(word)
# Words that consist only of letters count as words
if wordSet.issubset( self.letters ):
# Using casefold to get more stable case-insensetive variant
counter[word.casefold()] += 1
# Words that consist only of digits count as __number__
elif wordSet.issubset( self.numbers ):
counter['__number__'] += 1
# Everything else goes to __other__ bucket
else:
counter['__other__'] += 1
return counter
wordCounter = WordCounterV2()
wordCounter.CountWords( 'Hello, World!' )
Counter({'hello': 1, 'world': 1})
wordCounter.CountWords( 'Version 2 is working much better for i18n, i wonder do we really need version 3.' )
Counter({'__number__': 2,
'__other__': 1,
'better': 1,
'do': 1,
'for': 1,
'i': 1,
'is': 1,
'much': 1,
'need': 1,
'really': 1,
'version': 2,
'we': 1,
'wonder': 1,
'working': 1})
wordCounter.CountWords( 'Текст, который не понимает большинство присутсвующих.' )
Counter({'большинство': 1,
'который': 1,
'не': 1,
'понимает': 1,
'присутсвующих': 1,
'текст': 1})
Is our algorithm produces similiar result on similiar strings?
str1 = 'Tôi chỉ cần văn bản với dấu phụ'
str2 = 'Tôi chỉ cần văn bản với dấu phụ'
count1 = wordCounter.CountWords( str1 )
count2 = wordCounter.CountWords( str2 )
count1 + count2
Counter({'bản': 1,
'bản': 1,
'cần': 1,
'chỉ': 1,
'chỉ': 1,
'cần': 1,
'dấu': 1,
'dấu': 1,
'phụ': 1,
'phụ': 1,
'tôi': 1,
'tôi': 1,
'văn': 1,
'với': 1,
'văn': 1,
'với': 1})
What happened in the previous example?
There are several ways to write similiar text in Unicode.
','.join(list(str1))
'T,ô,i, ,c,h,ỉ, ,c,ầ,n, ,v,ă,n, ,b,ả,n, ,v,ớ,i, ,d,ấ,u, ,p,h,ụ'
','.join(list(str2))
'T,o,̂,i, ,c,h,i,̉, ,c,a,̂,̀,n, ,v,a,̆,n, ,b,a,̉,n, ,v,o,̛,́,i, ,d,a,̂,́,u, ,p,h,u,̣'
http://unicode.org/reports/tr15/
Canonical equivalence is a fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behavior.
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
Four types of normalization:

What type of normalization to use depends on your application and bussiness requirements.
We will stick to NFKD in this example.
import unicodedata
str1Normalized = unicodedata.normalize("NFKD", str1)
str2Normalized = unicodedata.normalize("NFKD", str2)
','.join(list(str1Normalized))
'T,o,̂,i, ,c,h,i,̉, ,c,a,̂,̀,n, ,v,a,̆,n, ,b,a,̉,n, ,v,o,̛,́,i, ,d,a,̂,́,u, ,p,h,u,̣'
','.join(list(str2Normalized))
'T,o,̂,i, ,c,h,i,̉, ,c,a,̂,̀,n, ,v,a,̆,n, ,b,a,̉,n, ,v,o,̛,́,i, ,d,a,̂,́,u, ,p,h,u,̣'
str1 == str2
False
str1Normalized == str2Normalized
True
from collections import Counter
import re
class WordCounterV3:
def __init__(self):
# Get all our unicode sets
(self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
calculateUnicodeSets()
# Translation table to remove punctuation
self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
# Translation table to remove control characters
self.controlTranslation = ''.maketrans('', '', ''.join(self.control))
# Regex to find all whitespaces
self.whitespacesRegex = '|'.join(map(re.escape, self.spaces))
class WordCounterV3(WordCounterV3):
def CountWords( self, text ):
counter = Counter()
# Normalize text
text = unicodedata.normalize("NFKD", text)
# Remove control characters from string
text = text.translate(self.controlTranslation)
# Split text to words by whitespaces
for word in re.split(self.whitespacesRegex,text,0):
# Remove all punctuation
word = word.translate(self.punctuationTranslation)
if len(word) == 0:
# Skip empty words
continue
# Creating set of letters from our word,
# that way it will be easier to compare with letters and numbers
wordSet = set(word)
# Words that consist only of letters count as words
if wordSet.issubset( self.letters ):
# Using casefold to get more stable case-insensetive variant
counter[word.casefold()] += 1
# Words that consist only of digits count as __number__
elif wordSet.issubset( self.numbers ):
counter['__number__'] += 1
# Everything else goes to __other__ bucket
else:
counter['__other__'] += 1
return counter
wordCounter = WordCounterV3()
wordCounter.CountWords( 'Hello, World!' )
Counter({'hello': 1, 'world': 1})
Let's try new version on previous example
count1Normalized = wordCounter.CountWords( str1 )
count2Normalized = wordCounter.CountWords( str2 )
count1Normalized + count2Normalized
Counter({'bản': 2,
'cần': 2,
'chỉ': 2,
'dấu': 2,
'phụ': 2,
'tôi': 2,
'văn': 2,
'với': 2})
Can we combine our new results with previous ones?
count1Normalized + count2Normalized + count1 + count2
Counter({'bản': 3,
'bản': 1,
'cần': 3,
'chỉ': 3,
'chỉ': 1,
'cần': 1,
'dấu': 3,
'dấu': 1,
'phụ': 3,
'phụ': 1,
'tôi': 3,
'tôi': 1,
'văn': 3,
'với': 3,
'văn': 1,
'với': 1})
Is whitespace word-breaking enough for our need?
Sometimes even in English we may want more complex solution.
wordCounter.CountWords("Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.")
Counter({'about': 1,
'amusing': 1,
'arent': 1,
'boys': 1,
'capital': 1,
'chiles': 1,
'mr': 1,
'oneill': 1,
'stories': 1,
'that': 1,
'the': 1,
'thinks': 1})
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
The major question of the tokenization phase is what are the correct tokens to use? In the beggining, it looks fairly trivial: you chop on whitespace and throw away punctuation characters.
This is a starting point, but even for English there are a number of tricky cases. For example, what do you do about the various uses of the apostrophe for possession and contractions?
For many languages tokenization require lexical analysis of the string, because we simply cannot use whitespace
German: a lot of compund words
# law for the delegation of monitoring beef labelling
wordCounter.CountWords('Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz')
Counter({'rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz': 1})
Chinese: don't even use spaces
wordCounter.CountWords('中文几乎没有空格')
Counter({'中文几乎没有空格': 1})
Bad news:
Good news - you can find tokenizers for many languages
Where to look:
from collections import Counter
from nltk.tokenize import word_tokenize
class WordCounterV4:
def __init__(self):
# Get all our unicode sets
(self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
calculateUnicodeSets()
# Translation table to remove punctuation
self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
# Translation table to remove control characters
self.controlTranslation = ''.maketrans('', '', ''.join(self.control))
class WordCounterV4(WordCounterV4):
def CountWords( self, text, language ):
counter = Counter()
# Normalize text
text = unicodedata.normalize("NFKD", text)
# Remove control characters from string
text = text.translate(self.controlTranslation)
# Split text using tokenizer
for word in word_tokenize(text ,language):
# Remove all punctuation
word = word.translate(self.punctuationTranslation)
if len(word) == 0:
# Skip empty words
continue
# Creating set of letters from our word
# that way it will be easier to compare with letters and numbers
wordSet = set(word)
# Words that consist only of letters count as words
if wordSet.issubset( self.letters ):
# Using casefold to get more stable case-insensetive variant
counter[word.casefold()] += 1
# Words that consist only of digits count as __number__
elif wordSet.issubset( self.numbers ):
counter['__number__'] += 1
# Everything else goes to __other__ bucket
else:
counter['__other__'] += 1
return counter
wordCounter = WordCounterV4()
wordCounter.CountWords( "Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.", "English" )
Counter({'about': 1,
'amusing': 1,
'are': 1,
'boys': 1,
'capital': 1,
'chile': 1,
'mr': 1,
'nt': 1,
'oneill': 1,
's': 1,
'stories': 1,
'that': 1,
'the': 1,
'thinks': 1})
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
class WordCounterV5:
def __init__(self):
# Get all our unicode sets
(self.punctuation, self.letters, self.numbers, self.spaces, self.control) = \
calculateUnicodeSets()
# Translation table to remove punctuation
self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
# Translation table to remove control characters
self.controlTranslation = ''.maketrans('', '', ''.join(self.control))
class WordCounterV5(WordCounterV5):
def CountWords( self, text, language ):
counter = Counter()
# Stemmer for our words
stemmer = SnowballStemmer(language.lower())
# Normalize text
text = unicodedata.normalize("NFKD", text)
# Remove control characters from string
text = text.translate(self.controlTranslation)
# Split text using tokenizer
for word in word_tokenize(text ,language):
# Remove all punctuation
word = word.translate(self.punctuationTranslation)
# Stemming word
word = stemmer.stem(word)
if len(word) == 0:
# Skip empty words
continue
# Creating set of letters from our word
# that way it will be easier to compare with letters and numbers
wordSet = set(word)
# Words that consist only of letters count as words
if wordSet.issubset( self.letters ):
# Using casefold to get more stable case-insensetive variant
counter[word.casefold()] += 1
# Words that consist only of digits count as __number__
elif wordSet.issubset( self.numbers ):
counter['__number__'] += 1
# Everything else goes to __other__ bucket
else:
counter['__other__'] += 1
return counter
wordCounter = WordCounterV5()
wordCounter.CountWords( "Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.", "English" )
Counter({'about': 1,
'amus': 1,
'are': 1,
'boy': 1,
'capit': 1,
'chile': 1,
'mr': 1,
'nt': 1,
'oneil': 1,
's': 1,
'stori': 1,
'that': 1,
'the': 1,
'think': 1})
This notebook: https://github.com/TonyMas/GlobalText/
This presentation: https://tonymas.github.io/GlobalText/