CosmosKey: Powershell function: Convert-DiacriticCharacters

Thursday, September 24, 2009

Powershell function: Convert-DiacriticCharacters

We needed to normalize diacritics characters to standard English characters. Diacritic characters are extended or accented characters to the modern latin basic alphabet i.e. A-Z. Scandinavian diacritics such as å,ä and ö should in normalized form become a, a and o. Spanish diacritics should such as ó, ñ and ç should be normalized to o, n and c. The unicode normalization forms can be found here and the normalization charts here.

Since we mostly use powershell these days, I've written a nice Convert-DiacriticCharacters powershell function.

function Convert-DiacriticCharacters {
param(

[string]$inputString

)

[string]$formD = $inputString.Normalize(

[System.text.NormalizationForm]::FormD

)

$stringBuilder = new-object System.Text.StringBuilder

for ($i = 0; $i -lt $formD.Length; $i++){

$unicodeCategory = [System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($formD[$i])

$nonSPacingMark = [System.Globalization.UnicodeCategory]::NonSpacingMark

if($unicodeCategory -ne $nonSPacingMark){

$stringBuilder.Append($formD[$i]) | out-null

}

}
$stringBuilder.ToString().Normalize([System.text.NormalizationForm]::FormC)
}

The resulting function will convert diacritics in the follwoing way:

   PS C:\> Convert-DiacriticCharacters "Ångström"
   Angstrom
   PS C:\> Convert-DiacriticCharacters "Ó señor"
   O senor

In our Identity Management projects we encounter issues like these as soon as we deal with global companies. Many systems can't handle Unicode characters or diacritic characters from different non-unicode code pages, in our case we were writing some code to provision users in RACF and RACF couldn't handle the characters.

11 comments:

Anonymous3 November 2010 at 22:11:00 GMT
Johan - how nicely this works, very good. Do you know if the Normalization can also handle the problem of non-Latin characters that must be translated to two latin characters (for example: "ß" translates to "ss"; "œ" to "oe" and "Ǣ" to "AE") ?
ReplyDelete
Replies
Johan Akerstrom8 November 2010 at 14:00:00 GMT
Martyn (since I know it is you), I quickly checked and according to the Unicode standard there isn't a normalized form for the German character ezsett ß (0x00DF). I didn't check the other characters but would assume the same is true for them. Have a look here.

http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
ReplyDelete
Replies
Johan Akerstrom29 August 2012 at 16:09:00 BST
Anonymous,

Long time no update. This function will do the job.

function Convert-ToLatinCharacters {
param(
[string]$inputString
)
[Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString))
}
ReplyDelete
Replies
Anonymous20 March 2014 at 22:45:00 GMT
Bra jobbat !
/P
ReplyDelete
Replies
Anonymous27 October 2015 at 12:45:00 GMT
Very nice, love what you've done!
ReplyDelete
Replies
Anonymous26 May 2017 at 10:44:00 BST
I am looking to call this function within a script and for it check several strings in a csv. It doesn't seem to work when calling the function this way. Any pointers would be great,

import-csv filepath | foreach {

$convertedName = Convert-DiacriticCharacters -inputString $name
}
ReplyDelete
Replies
robinjack27 April 2018 at 13:34:00 BST
Awesome article with astounding idea!Thank you for such an important article. I truly acknowledge for this awesome data.. camping generators
ReplyDelete
Replies
Craig Cram30 June 2019 at 04:28:00 BST
What a small world...!! coming across this from you two?? How things have changed since our Finchampstead days....
ReplyDelete
Replies
Anonymous6 September 2019 at 17:12:00 BST
It doesnt't deal with ąćłńóśźż.... (Polish diactritics) - pity...
ReplyDelete
Replies

Add comment