Thursday, September 24, 2009

Powershell function: Convert-DiacriticCharacters


We needed to normalize diacritics characters to standard English characters. Diacritic characters are extended or accented characters to the modern latin basic alphabet i.e. A-Z. Scandinavian diacritics such as å,ä and ö should in normalized form become a, a and o. Spanish diacritics should such as ó, ñ and ç should be normalized to o, n and c. The unicode normalization forms can be found here and the normalization charts here.
 
Since we mostly use powershell these days, I've written a nice Convert-DiacriticCharacters powershell function.
 
function Convert-DiacriticCharacters {
    param(
        [string]$inputString
    )
    [string]$formD = $inputString.Normalize(
            [System.text.NormalizationForm]::FormD
    )
    $stringBuilder = new-object System.Text.StringBuilder
    for ($i = 0; $i -lt $formD.Length; $i++){
        $unicodeCategory = [System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($formD[$i])
        $nonSPacingMark = [System.Globalization.UnicodeCategory]::NonSpacingMark
        if($unicodeCategory -ne $nonSPacingMark){
            $stringBuilder.Append($formD[$i]) | out-null
        }
    }
    $stringBuilder.ToString().
Normalize([System.text.NormalizationForm]::FormC)
}
 
The resulting function will convert diacritics in the follwoing way:
 
    PS C:\> Convert-DiacriticCharacters "Ångström"
    Angstrom
    PS C:\> Convert-DiacriticCharacters "Ó señor"
    O senor

In our Identity Management projects we encounter issues like these as soon as we deal with global companies. Many systems can't handle Unicode characters or diacritic characters from different non-unicode code pages, in our case we were writing some code to provision users in RACF and RACF couldn't handle the characters.

11 comments:

  1. Johan - how nicely this works, very good. Do you know if the Normalization can also handle the problem of non-Latin characters that must be translated to two latin characters (for example: "ß" translates to "ss"; "œ" to "oe" and "Ǣ" to "AE") ?

    ReplyDelete
  2. Martyn (since I know it is you), I quickly checked and according to the Unicode standard there isn't a normalized form for the German character ezsett ß (0x00DF). I didn't check the other characters but would assume the same is true for them. Have a look here.

    http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

    ReplyDelete
  3. Anonymous,

    Long time no update. This function will do the job.

    function Convert-ToLatinCharacters {
    param(
    [string]$inputString
    )
    [Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString))
    }

    ReplyDelete
    Replies
    1. Shouldn't it be called "convertto-LatinCharacters"? Great work BTW!

      Delete
  4. Very nice, love what you've done!

    ReplyDelete
  5. I am looking to call this function within a script and for it check several strings in a csv. It doesn't seem to work when calling the function this way. Any pointers would be great,

    import-csv filepath | foreach {

    $convertedName = Convert-DiacriticCharacters -inputString $name
    }

    ReplyDelete
    Replies
    1. sorted it, was the encoding of the csv file. Great function. Thanks

      import-csv -Encoding Default filepath | foreach {

      $convertedName = Convert-DiacriticCharacters -inputString $name
      }

      Delete
  6. Awesome article with astounding idea!Thank you for such an important article. I truly acknowledge for this awesome data.. camping generators

    ReplyDelete
  7. What a small world...!! coming across this from you two?? How things have changed since our Finchampstead days....

    ReplyDelete
  8. It doesnt't deal with ąćłńóśźż.... (Polish diactritics) - pity...

    ReplyDelete