Home  |  FAQ  |  About  |  Contact  |  View Source   
 
SEARCH:
 
BROWSE:
    My Hood
Edit My Info
View Events
Read Tutorials
Training Modules
View Presentations
Download Tools
Scan News
Get Jobs
Message Forums
School Forums
Member Directory
   
CONTRIBUTE:
    Sign me up!
Post an Event
Submit Tutorials
Upload Tools
Link News
Post Jobs
   
   
Home >  Tutorials >  General Coding >  Detecting File Encodings in .NET
Add to MyHood
   Detecting File Encodings in .NET   [ printer friendly ]
Stats
  Rating: 4.67 out of 5 by 6 users
  Submitted: 05/29/02
Heath Stewart ()

 
So you need to read data from a text file, or possibly even a binary file? Do you need to read one byte or two for character arrays, a.k.a. strings? This is a common problem with anything file I/O operations today, especially since the de facto file encoding is Unicode these days, but applications must usually be able to work with older file formats and encodings, which are probably still ASCII files. You need a way to detect what the file encoding is, or leave it to the user to decide (and we all know how well that works!). This tutorial will cover basic detection routines and give you some sample code that you can use in your applications to (usually) detect what the file encoding is.


Introduction to File Encodings

What is a file encoding, you ask? Files are encoded in various ways, but the higher-level concept is simple: characters are merely bytes that are interpreted as text. The two main ways these characters are stored is ASCII and Unicode, although several implementations of Unicode exist.

ASCII characters are stored in one byte. A byte - as you should very well know - is 8 bits, giving you a possibility of 256 chracters, ranging from 0 - 255. To see what these character codes represent, see on MSDN.

Unicode characters are stored in two bytes, or 16 bits. This gives you, of course, the possibility of 65,536 characters. Unicode was created because 256 characters just weren't enough, especially for some languages that contain far more characters in their alphabet. Most high-level languages of today, such as .NET languages and Java, even store native strings as Unicode. Since two bytes are used instead of one however, you're file size will increase. With today's computers and the size of hard drives, that's not really a problem.

There is also multi-byte character encodings, which can be a mix of one- and two-byte characters. As you can probably guess, this because a problem since each attempt to read a character must determine if the character is stored in one byte or two. These strings operations are slow, therefore, and are not recommended or implemented natively in many frameworks.


Detecting File Encodings in .NET

The code samples below will show you how to detect the file encoding and read strings from the file based on the referenced encoding. .NET does provide you with a nice set of Encoding classes, such as ASCIIEncoding and UTF8Encoding, which you can easily get through static members of the System.Text.Encoding class, namely Encoding.ASCII and Encoding.Unicode.

The first step is to open the file and grab the first four bytes. These first two to four bytes are know as the byte-order mark, or BOM. We then check the bytes to see if the file is Unicode. If the BOM does not exist, you must decide what file encoding to default to based on the files that you'll typically be expected to read. See the comments below for more details.
 
System.Text.Encoding enc = null;
System.IO.FileStream file = new System.IO.FileStream(filePath,
    FileMode.Open, FileAccess.Read, FileShare.Read);
if (file.CanSeek)
{
    byte[] bom = new byte[4]; // Get the byte-order mark, if there is one
    file.Read(bom, 0, 4);
    if ((bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) || // utf-8
        (bom[0] == 0xff && bom[1] == 0xfe) || // ucs-2le, ucs-4le, and ucs-16le
        (bom[0] == 0xfe && bom[1] == 0xff) || // utf-16 and ucs-2
        (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff)) // ucs-4
    {
        enc = System.Text.Encoding.Unicode;
    }
    else
    {
        enc = System.Text.Encoding.ASCII;
    }
 
    // Now reposition the file cursor back to the start of the file
    file.Seek(0, System.IO.SeekOrigin.Begin);
}
else
{
    // The file cannot be randomly accessed, so you need to decide what to set the default to
    // based on the data provided. If you're expecting data from a lot of older applications,
    // default your encoding to Encoding.ASCII. If you're expecting data from a lot of newer
    // applications, default your encoding to Encoding.Unicode. Also, since binary files are
    // single byte-based, so you will want to use Encoding.ASCII, even though you'll probably
    // never need to use the encoding then since the Encoding classes are really meant to get
    // strings from the byte array that is the file.
 
    enc = System.Text.Encoding.ASCII;
}
 
// Do your file operations here, such as getting a string from the byte array that is the file
byte[] buffer = new byte[4096]; // A good buffer size; should always be base2 in case of Unicode
while (file.Read(buffer, 0, 4096))
{
    string line = enc.GetString(buffer); // Uses the encoding we defined above
    System.Console.Write(string);
}
System.Console.WriteLine();
 
// Close the file: never forget this step!
file.Close();

There's also other things you'd want to do, like wrap the main functionality in a try-catch-finally block, placing the file.Close() operation in the "finally" block. Even if the file was never opened, file.Close() won't cause any problems and it will ensure the file always gets closed. So you could so something like this:
 
System.IO.FileStream file = null;
try
{
    // Please all the code from above in here, except we define "file" outside the try-catch-finally block
    // since the "finally" block will need to have "file" defined and defining in the "try" block will be in a
    // different scope.
}
catch (Exception e)
{
    Console.Error.WriteLine("Error: " + e.Message);
}
finally
{
    // Make sure we don't throw a NullReferenceException since the "file" object may not have been
    // instantiated yet, perhaps because of an exception throw when you declare a new FileStream.
    if (file != null)
        file.Close();
}


Summary

Detecting the file encoding is often important when you must read strings from the file. .NET doesn't provide a native mechanism to do this so you must do it manually using code similar to that above. When using this method, you don't have to worry about the user making a decision, which almost always leads to problems.


References

VIM - Vi IMprovded, by Bram Moolenaar: An old UNIX editor with improvements and ports to practically every major platform. The code above was derived from the following fileio.c file, available to view online at .

Return to Browsing Tutorials

Email this Tutorial to a Friend

Rate this Content:  
low quality  1 2 3 4 5  high quality

Reader's Comments Post a Comment
 
As a note, System.IO.StreamReader does let you read ASCII or Unicode files via a boolean value in the constructor to automatically detect the byte-order mark (BOM), like so:
[cs]
StreamReader reader = new StreamReader("somefile.txt", true);
[/cs]
If the StreamReader works for you, I'd suggest this (less coding on your part, although I'm sure MS is doing about the same thing). The StreamReader isn't useful in all cases, which is why I wrote this tutorial.
-- Heath Stewart, June 10, 2002
 
This is a very indepth and well written tutorial. Good technical content and writing style.
-- Andrew Ma, September 06, 2002
 
Is it also possible to determine if a file is binary?
-- Guido Docter, July 24, 2003
 
Copyright © 2001 DevHood® All Rights Reserved