We needed to normalize diacritics characters to standard English characters. Diacritic characters are extended or accented characters to the modern latin basic alphabet i.e. A-Z. Scandinavian diacritics such as å,ä and ö should in normalized form become a, a and o. Spanish diacritics should such as ó, ñ and ç should be normalized to o, n and c. The unicode normalization forms can be found here and the normalization charts here.
Since we mostly use powershell these days, I've written a nice Convert-DiacriticCharacters powershell function.
function Convert-DiacriticCharacters {
param(
param(
[string]$inputString
)
[string]$formD = $inputString.Normalize(
[ System.text.NormalizationForm] ::FormD
)
$ stringBuilder = new-object System.Text.StringBuilder
for ($i = 0; $i -lt $formD.Length; $i++){
$ unicodeCategory = [System.Globalization. CharUnicodeInfo]::GetUnicodeCategory($formD[$i])
$nonSPacingMark = [System.Globalization. UnicodeCategory]:: NonSpacingMark
$stringBuilder.Append($formD[$ i]) | out-null
}
}
$stringBuilder.ToString(). Normalize([System.text. NormalizationForm]::FormC)
}
$stringBuilder.ToString().
}
The resulting function will convert diacritics in the follwoing way:
PS C:\> Convert-DiacriticCharacters "Ångström"
Angstrom
PS C:\> Convert-DiacriticCharacters "Ó señor"
O senor
Angstrom
PS C:\> Convert-DiacriticCharacters "Ó señor"
O senor
In our Identity Management projects we encounter issues like these as soon as we deal with global companies. Many systems can't handle Unicode characters or diacritic characters from different non-unicode code pages, in our case we were writing some code to provision users in RACF and RACF couldn't handle the characters.
Johan - how nicely this works, very good. Do you know if the Normalization can also handle the problem of non-Latin characters that must be translated to two latin characters (for example: "ß" translates to "ss"; "œ" to "oe" and "Ǣ" to "AE") ?
ReplyDeleteMartyn (since I know it is you), I quickly checked and according to the Unicode standard there isn't a normalized form for the German character ezsett ß (0x00DF). I didn't check the other characters but would assume the same is true for them. Have a look here.
ReplyDeletehttp://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
Anonymous,
ReplyDeleteLong time no update. This function will do the job.
function Convert-ToLatinCharacters {
param(
[string]$inputString
)
[Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString))
}
Shouldn't it be called "convertto-LatinCharacters"? Great work BTW!
DeleteBra jobbat !
ReplyDelete/P
Very nice, love what you've done!
ReplyDeleteI am looking to call this function within a script and for it check several strings in a csv. It doesn't seem to work when calling the function this way. Any pointers would be great,
ReplyDeleteimport-csv filepath | foreach {
$convertedName = Convert-DiacriticCharacters -inputString $name
}
sorted it, was the encoding of the csv file. Great function. Thanks
Deleteimport-csv -Encoding Default filepath | foreach {
$convertedName = Convert-DiacriticCharacters -inputString $name
}
Awesome article with astounding idea!Thank you for such an important article. I truly acknowledge for this awesome data.. camping generators
ReplyDeleteWhat a small world...!! coming across this from you two?? How things have changed since our Finchampstead days....
ReplyDeleteIt doesnt't deal with ąćłńóśźż.... (Polish diactritics) - pity...
ReplyDelete