Welcome to CSharp Labs

Country Codes and Names in Over 50 Languages

Monday, May 27, 2013

Finding a well-maintained and accurate list of the world's country codes, suitable for display purposes, proved to be difficult. What I needed was to find a source that could be periodically checked to update a local country codes database. ISO maintains an excellent list of ISO-3166 Country Codes. Unfortunately, ISO lists all countries in upper case, contrary to my needs. Eventually, I stumbled upon Unicode's Common Locale Data Repository (CLDR) which has a suitable list buried in their CLDR Release and supports numerous languages.

The CLDR is a large collection of XML files which rely on inheritance to reduce size and produce simpler data files. I have created the CommonLocaleDataRepositoryArchive class to reconstruct data files and extract country code data using respective locale identifiers.

How it Works

The CLDR is downloaded as a compressed archive and I have opted to stream directly from this file by using the .NET 4.5 System.IO.Compression.ZipArchive class. To successfully stream XML files with their respective document type definitions, I constructed a specialized XmlReaderSettings to initialize the document from:

        /// <summary>
        /// Loads the specified file from an archive as a document type definition.
        /// </summary>
        /// <param name="archive">The archive to load from.</param>
        /// <param name="file">The file to load.</param>
        /// <returns>Reader settings suitable to initialize an XmlReader.</returns>
        private XmlReaderSettings LoadDocumentTypeDefinition(ZipArchive archive, string file)
        {
            ZipArchiveEntry entry = archive.GetEntry(file); //get the document type definition

            if (entry == null)
                throw new InvalidFileDataException("The zip file does not contain the DTD.");

            using (Stream docTypeStream = entry.Open()) //open the entry
            {
                //XmlPreloadedResolver to populate the document type definition
                XmlPreloadedResolver resolver = new XmlPreloadedResolver();
                resolver.Add(resolver.ResolveUri(null, "../../" + file), docTypeStream);

                return new XmlReaderSettings
                {
                    DtdProcessing = DtdProcessing.Parse,
                    ValidationType = ValidationType.DTD,
                    XmlResolver = resolver
                };
            }
        }

        /// <summary>
        /// Loads a document from a archive.
        /// </summary>
        /// <param name="archive">The archive to load from.</param>
        /// <param name="settings">The settings to create a reader from.</param>
        /// <param name="entry">The entry to load.</param>
        /// <returns>An document loaded from the archive.</returns>
        private XDocument LoadDocument(ZipArchive archive, XmlReaderSettings settings, ZipArchiveEntry entry)
        {
            if (entry == null)
                throw new InvalidFileDataException("The zip file does not contain expected data.");

            using (Stream mainStream = entry.Open())
            using (StreamReader mainReader = new StreamReader(mainStream, Encoding.UTF8))
            using (XmlReader xmlReader = XmlReader.Create(mainReader, settings))
                //create the reader and load the xml file
                return XDocument.Load(xmlReader);
        }

The CLDR utilizes distinguishing attributes and elements to identify inherited elements and the CommonLocaleDataRepositoryArchive class loads this data from the supplemental metadata file. Reconstructing the data files involved creating the NodeDefinitionComparer which determines if an element is a descendant and a recursive method to create and load the element's values and attributes:

        /// <summary>
        /// Combines element nodes with the specified comparer.
        /// </summary>
        /// <param name="primaryElement">The primary element.</param>
        /// <param name="secondaryElement">The secondary element.</param>
        /// <param name="comparer">The comparer to use to compare elements.</param>
        /// <returns>A new element with combined sub-nodes.</returns>
        private XElement UnionElements(XElement primaryElement, XElement secondaryElement, NodeDefinitionComparer comparer)
        {
            //Initialize the new element with the primary element name and attributes
            XElement unionedElement = new XElement(primaryElement.Name);
            foreach (XAttribute attribute in primaryElement.Attributes())
                unionedElement.SetAttributeValue(attribute.Name, attribute.Value);

            //if primary element has nodes, compare child nodes to secondary element nodes
            if (primaryElement.HasElements)
            {
                //we use a dictionary with a comparer designed to distinguish elements
                Dictionary<NodeDefinition, XElement> elements = new Dictionary<NodeDefinition, XElement>(comparer);

                //enumerate secondary nodes
                foreach (XNode node in secondaryElement.Nodes())
                {
                    XElement element = node as XElement;

                    if (element != null)
                        //add NodeDefinition as a secondary node
                        elements.Add(new NodeDefinition { Element = element, Preferred = false }, element);
                }

                //enumerate primary nodes
                foreach (XNode node in primaryElement.Nodes())
                {
                    XElement element = node as XElement;

                    if (element != null)
                    {
                        //create NodeDefinition as a primary node
                        var v = new NodeDefinition { Element = element, Preferred = true };

                        //if a compatible NodeDefinition exists
                        if (elements.ContainsKey(v))
                            //combine the elements
                            elements[v] = UnionElements(element, elements[v], comparer);
                        else
                            elements.Add(v, element);
                    }
                }

                //add elements
                foreach (XElement element in elements.Values)
                    unionedElement.Add(element);
            }
            //if the secondary element has nodes, add them
            else if (secondaryElement.HasElements)
            {
                foreach (XNode node in secondaryElement.Nodes())
                {
                    XElement element = node as XElement;

                    if (element != null)
                        unionedElement.Add(element);
                }
            }
            else if (primaryElement.Value != string.Empty) //if the primary element has text, set the new element
                unionedElement.Value = primaryElement.Value;

            return unionedElement;
        }

Country data is accessed from a locale identifier which can include the language (en), language and region (en_US) or language, script and region (en_Dsrt_US). The CommonLocaleDataRepositoryArchive builds the XML file of the specified locale by searching for a resource in the following manner: en_Dsrt_US → en_US → en → root. Once the file has been reconstructed, the territories sub-nodes are enumerated to produce a complete set of country data:

                //enumerate each territory node
                foreach (XElement node in repository.Element("ldml").Element("localeDisplayNames").Element("territories").Descendants("territory"))
                {
                    XAttribute type = node.Attribute("type");

                    if (type != null)
                    {
                        string code = type.Value; //country code

                        if (validCountryCode(code)) //validate country code
                        {
                            XAttribute alt = node.Attribute("alt");

                            if (alt == null)
                                //if no alt attribute, just add the country
                                countries.Add(new CountryCodeDefinition(code, node.Value, CountryNameStyle.Normal));
                            else
                            {
                                switch (alt.Value) //check alt attribute
                                {
                                    case "short": //short country name
                                        countries.Add(new CountryCodeDefinition(code, node.Value, CountryNameStyle.Short));
                                        break;
                                    case "variant": //variant country name
                                        countries.Add(new CountryCodeDefinition(code, node.Value, CountryNameStyle.Variant));
                                        break;
                                }
                            }
                        }
                    }
                }
Using

The CommonLocaleDataRepositoryArchive is initialized with a path to the CLDR archive:

                using (var archive = new CommonLocaleDataRepositoryArchive("Path to Archive"))
                {
                    foreach (var entry in archive.GetCountryCodes("en"))
                        Console.WriteLine(string.Format("Code: {0}, Name: {1}, Style: {2}", entry.Code, entry.Name, entry.Style));
                }

The CommonLocaleDataRepositoryArchive.GetCountryCodes method can return country code data for any valid locale identifier in the archive. Country codes are expected to be unique unless the CLDR contains alternative country names. Unique country codes can be obtained by filtering out variant or short country names:

                using (var archive = new CommonLocaleDataRepositoryArchive("Path to Archive"))
                {
                    foreach (var entry in archive.GetCountryCodes("en").Where(e => e.Style == CountryNameStyle.Normal))
                        Console.WriteLine(string.Format("Code: {0}, Name: {1}, Style: {2}", entry.Code, entry.Name, entry.Style));
                }

Alternatively, some short or variant country names can be preferred by iterating over the collection and collecting entries:

                using (var archive = new CommonLocaleDataRepositoryArchive("Path to Archive"))
                {
                    Dictionary<string, CountryCodeDefinition> entries = new Dictionary<string, CountryCodeDefinition>();

                    //collects desired short and variant country names
                    foreach (var entry in archive.GetCountryCodes("en"))
                    {
                        switch (entry.Style)
                        {
                            case CountryNameStyle.Short:

                                switch (entry.Code)
                                {
                                    case "PS": //prefers "Palestine" vs "Palestinian Territories"
                                        entries[entry.Code] = entry;
                                        break;
                                }

                                break;
                            case CountryNameStyle.Variant:

                                switch (entry.Code)
                                {
                                    case "CD": //prefers "Congo [DRC]" vs "Congo - Kinshasa"
                                    case "CG": //prefers "Congo [Republic]" vs "Congo - Brazzaville"
                                    case "CI": //prefers "Ivory Coast" vs "Côte d’Ivoire"
                                    case "MO": //prefers "Macau" vs "Macau SAR China"
                                    case "TL": //prefers "East Timor" vs "Timor-Leste"
                                        entries[entry.Code] = entry;
                                        break;
                                }

                                break;
                            default:

                                if (!entries.ContainsKey(entry.Code))
                                    entries[entry.Code] = entry;

                                break;
                        }
                    }

                    //output collected entries
                    foreach (CountryCodeDefinition entry in entries.Values)
                        Console.WriteLine(string.Format("Code: {0}, Name: {1}, Style: {2}", entry.Code, entry.Name, entry.Style));
                }

There is considerably more data available in the CLDR that could be utilized in the CommonLocaleDataRepositoryArchive class and not all data reconstruction techniques have been implemented (see Unicode Technical Standard). However, this is all that is required to access country codes and multi-language country names. The CommonLocaleDataRepositoryArchive class requires you to add a reference to the System.IO.Compression assembly with your project.

Download CommonLocaleDataRepositoryArchive | Latest CLDR Release

Comments