Introduction

This article describes a GBK encoder, which is designed for silverlight. It's implemented as a single class GBKEncoder which includes some public static methods to encode string and decode byte array with GBK encoding.  It's very simple, you can custom it easily.   

Background     

In silverlight4 you can save data to a file in local disk. You can use this feature to export DataGrid to a local csv file. If you are using  MS Excel 2003 and opening a csv document includes  chinese characters, you'll find that Excel 2003 can not open a utf8 csv file correctly by directly double click the csv file. In order to open the csv file directly by double click, you must export the csv file using GB2312/GBK encoding, unfortunately the only encoding formats supported by silverlight4 are UTF-8 and UTF-16.  There are several solutions for this problem:

1.Send the data that being exported back to server, generate the csv file on server and redirect user to download it. 

2.Send the data back to server and return the generated encoded binary format of the csv file to the silverlight app, then save it  to a local file.

3.Tell your customers: "Excel 2007, please!", this is the easiest way,  but I'm very sure that's we would not do that :) 

4.Implement some methods to export GBK encoded file directly within silverlight before MS providing the solution.  This is just what does this article talks about.  

GBKEncoder class    

    public static class GBKEncoder
    {
        /// <summary>
        /// Writes a string to the stream.
        /// </summary>
        /// <param name="s">The stream to write into.</param>
        /// <param name="str">The string to write. </param>
        static public void Write(System.IO.Stream s, string str);

        /// <summary>
        /// Encodes string into a sequence of bytes.
        /// </summary>
        /// <param name="str">The string to encode.</param>
        /// <returns>A byte array containing the results of encoding the specified set of characters.</returns>
        static public byte[] ToBytes(string str);

        /// <summary>
        /// Decodes a sequence of bytes into a string.
        /// </summary>
        /// <param name="buffer">The byte array containing the sequence of bytes to decode. </param>
        /// <returns>A string containing the results of decoding 
        /// the specified sequence of bytes.
        /// </returns>
        static public string Read(byte[] buffer);

        /// <summary>
        /// Decodes a sequence of bytes into a string.
        /// </summary>
        /// <param name="buffer">The byte array containing the sequence of bytes to decode. </param>
        /// <param name="iStartIndex">The index of the first byte to decode.</param>
        /// <param name="iLength">The number of bytes to decode. </param>
        /// <returns>A string containing the results of decoding 
        /// the specified sequence of bytes.
        /// </returns>
        static public string Read(byte[] buffer, int iStartIndex, int iLength);

        /// <summary>
        /// Read string from stream.
        /// </summary>
        /// <param name="s">The stream to read from.</param>
        /// <returns>A string containing all characters in the stream.</returns>
        static public string Read(Stream s);
    }



Using the code   

The GBKEncoder class is very intuitive to use, download the GBKEncoder first, add the GBKEncoder.cs to your project, maybe you also want to rename the name of namespace. Although this class is designed for silverlight, you can test it in any type of C# project. For a console application, like this: 

			
using (System.IO.FileStream fs = new System.IO.FileStream("test.txt", System.IO.FileMode.Create))
{
    GBKEncoder.Write(fs, "This is a test for GBKEncoder.这是一段用来测试GBKEncoder类的代码。 ");
}

Run the code and open outputed test.txt, you'll find it is encoded as GBK.

In silverlight, may like this: 

			
        SaveFileDialog dlg = new SaveFileDialog() { 
            DefaultExt = "csv", 
            Filter = "CSV Files (*.csv)|*.csv|All files (*.*)|*.*", 
            FilterIndex = 1 
        };
        StringBuilder sb = new StringBuilder();
        // some code to fill sb ...
        if (dlg.ShowDialog() == true)
        {
            using (Stream s = dlg.OpenFile())
            {
                GBKEncoder.Write(s, sb.ToString());
            }
        }

Performance

The following code was designed to test the performance of GBKEncoder class:  

        static void PerformanceTest()
        {
            StringBuilder sb = new StringBuilder();
            Random r = new Random();
            for (int i = 0; i < 200; i++)
            {
                for (int u = 0x4E00; u < 0x9Fa0; u++)
                {
                    sb.Append((char)u);
                    if (r.Next(0, 5) == 0)
                    {
                        sb.Append((char)r.Next(32, 126));
                    }
                }
            }

            string str = sb.ToString();
            Console.WriteLine("Total character count : {0}", str.Length);

            HighPrecisionTimer timer = new HighPrecisionTimer();

            timer.Start();
            using (System.IO.FileStream fs = new System.IO.FileStream("test1.txt", System.IO.FileMode.Create))
            {
                GBKEncoder.Write(fs, str);
            }
            timer.Stop();
            timer.ShowDurationToConsole();

            timer.Start();
            using (StreamWriter sw = new StreamWriter("test2.txt", false, Encoding.GetEncoding("GBK")))
            {
                sw.Write(str);
            }
            timer.Stop();
            timer.ShowDurationToConsole();

            timer.Start();
            string str2 = "";
            using (FileStream fs = new FileStream("test1.txt", FileMode.Open))
            {
                str2 = GBKEncoder.Read(fs);
            }
            timer.Stop();
            timer.ShowDurationToConsole();

            timer.Start();
            string str3 = File.ReadAllText("test2.txt", Encoding.GetEncoding("GBK"));
            timer.Stop();
            timer.ShowDurationToConsole();

            if (str == str2 && str2 == str3)
            {
                Console.WriteLine("Success!");
            }
            else
            {
                Console.WriteLine("Error!!!");
            }
        } 

Test environment: Vista 32bit, Q6600 OC3.0GHz, 4G Mem

Reuslt:

Unit : millisecond
Charcter count : 5014060

Encode    Decode
GBKEncoder    39.9 62.4 
.NET Native 75.6 63.6 

Because of the implement of GBKEncoder is very simple and straight and do not have any consider about complex situation that may exist, so it's encode speed is better than .net native encoder.

GBKEncoder will take up 50KB of space in your xap file and consume about 260KB of memory at runtime. 

Implement details  

GBK is an extension of the GB2312 character set for simplified Chinese characters, a character is encoded as 1 or 2 bytes, 1 byte for standard ASCII code, 2 bytes for chinese ideograph characters and punctuation characters.  

GBKEncoder is implemented using two arrays, one for unicode to GBK mapping named sm_mapUnicode2GBK and another for GBK to unicode mapping named sm_mapGBK2Unicode. These mappings are generated accoring to http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html . The value of sm_mapUnicode2GBK is hardcoded in the source code, and the value of sm_mapGBK2Unicode is generated according to sm_mapUnicode2GBK at runtime.  

Method for generating the value of sm_mapUnicode2GBK:   

        /// <summary>
        /// Generate unicode to GBK mapping according to
        /// http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html
        /// </summary>
        static private void GenUnicode2GBKMapping()
        {
            XmlDocument doc = new XmlDocument();
            doc.Load("Unicode2GBK.xml");
            int iCount = 0;
            byte[] sm_mapUnicode2GBK = new byte[0xFFFF * 2];
            foreach (XmlNode n in doc.DocumentElement.ChildNodes)
            {
                if (n.ChildNodes.Count == 18)
                {
                    string strUnicode = n.ChildNodes[0].InnerText;
                    if (strUnicode.Substring(0, 2) != "U+")
                        throw new ApplicationException(string.Format("{0}不是有效的Unicode编码", strUnicode));

                    int u = int.Parse(n.ChildNodes[0].InnerText.Substring(2), System.Globalization.NumberStyles.HexNumber);

                    for (int i = 2; i < 18; i++)
                    {
                        int j = (i - 2 + u) * 2;

                        foreach (XmlNode subNode in n.ChildNodes[i])
                        {
                            if (subNode.Name.ToLower() == "small")
                            {
                                string str = subNode.InnerText.Trim().Trim('*');
                                if (str.Length == 2)
                                {
                                    sm_mapUnicode2GBK[j] = 0;
                                    sm_mapUnicode2GBK[j + 1] = byte.Parse(str, System.Globalization.NumberStyles.HexNumber);
                                    iCount++;
                                }
                                else if (str.Length == 4)
                                {
                                    sm_mapUnicode2GBK[j] = byte.Parse(str.Substring(0, 2), System.Globalization.NumberStyles.HexNumber);
                                    sm_mapUnicode2GBK[j + 1] = byte.Parse(str.Substring(2), System.Globalization.NumberStyles.HexNumber);
                                    iCount++;
                                }
                                else
                                {
                                    throw new ApplicationException(string.Format("{0}不是有效的编码", n.ChildNodes[i].OuterXml));

                                }
                            }
                        }
                    }
                }
            }
            Console.WriteLine("共计转换{0}个字符", iCount);

            StringBuilder sb = new StringBuilder();
            sb.AppendLine("static byte[] sm_mapUnicode2GBK = new byte[] {");

            for (int i = 0; i < sm_mapUnicode2GBK.Length; i++)
            {
                if (i != 0 && i % 16 == 0) sb.AppendLine();
                sb.Append(sm_mapUnicode2GBK[i]);
                if (i < sm_mapUnicode2GBK.Length - 1) sb.Append(", ");
            }
            sb.AppendLine("};");

            File.WriteAllText("sm_mapUnicode2GBK.cs", sb.ToString());
        }

Unicode2GBK.xml is a xml file including the unicode to gbk mapping info which is generated according to http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.htm. 

       

History   

2011-03-29 initial, just encode 

2011-03-31 improve performance, encode and decode 

推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架
新浪微博粉丝精灵,刷粉丝、刷评论、刷转发、企业商家微博营销必备工具"