| |
So you need to read data from a text file, or possibly even a binary file? Do you need to read one byte or two for character arrays, a.k.a. strings? This is a common problem with anything file I/O operations today, especially since the de facto file encoding is Unicode these days, but applications must usually be able to work with older file formats and encodings, which are probably still ASCII files. You need a way to detect what the file encoding is, or leave it to the user to decide (and we all know how well that works!). This tutorial will cover basic detection routines and give you some sample code that you can use in your applications to (usually) detect what the file encoding is.
Introduction to File Encodings
What is a file encoding, you ask? Files are encoded in various ways, but the higher-level concept is simple: characters are merely bytes that are interpreted as text. The two main ways these characters are stored is ASCII and Unicode, although several implementations of Unicode exist.
ASCII characters are stored in one byte. A byte - as you should very well know - is 8 bits, giving you a possibility of 256 chracters, ranging from 0 - 255. To see what these character codes represent, see on MSDN.
Unicode characters are stored in two bytes, or 16 bits. This gives you, of course, the possibility of 65,536 characters. Unicode was created because 256 characters just weren't enough, especially for some languages that contain far more characters in their alphabet. Most high-level languages of today, such as .NET languages and Java, even store native strings as Unicode. Since two bytes are used instead of one however, you're file size will increase. With today's computers and the size of hard drives, that's not really a problem.
There is also multi-byte character encodings, which can be a mix of one- and two-byte characters. As you can probably guess, this because a problem since each attempt to read a character must determine if the character is stored in one byte or two. These strings operations are slow, therefore, and are not recommended or implemented natively in many frameworks.
Detecting File Encodings in .NET
The code samples below will show you how to detect the file encoding and read strings from the file based on the referenced encoding. .NET does provide you with a nice set of Encoding classes, such as ASCIIEncoding and UTF8Encoding, which you can easily get through static members of the System.Text.Encoding class, namely Encoding.ASCII and Encoding.Unicode.
The first step is to open the file and grab the first four bytes. These first two to four bytes are know as the byte-order mark, or BOM. We then check the bytes to see if the file is Unicode. If the BOM does not exist, you must decide what file encoding to default to based on the files that you'll typically be expected to read. See the comments below for more details.
System.Text.Encoding enc = null;
System.IO.FileStream file = new System.IO.FileStream(filePath,
FileMode.Open, FileAccess.Read, FileShare.Read);
if (file.CanSeek)
{
byte[] bom = new byte[4];
file.Read(bom, 0, 4);
if ((bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) ||
(bom[0] == 0xff && bom[1] == 0xfe) ||
(bom[0] == 0xfe && bom[1] == 0xff) ||
(bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff))
{
enc = System.Text.Encoding.Unicode;
}
else
{
enc = System.Text.Encoding.ASCII;
}
file.Seek(0, System.IO.SeekOrigin.Begin);
}
else
{
enc = System.Text.Encoding.ASCII;
}
byte[] buffer = new byte[4096];
while (file.Read(buffer, 0, 4096))
{
string line = enc.GetString(buffer);
System.Console.Write(string);
}
System.Console.WriteLine();
file.Close();
|
There's also other things you'd want to do, like wrap the main functionality in a try-catch-finally block, placing the file.Close() operation in the "finally" block. Even if the file was never opened, file.Close() won't cause any problems and it will ensure the file always gets closed. So you could so something like this:
System.IO.FileStream file = null;
try
{
}
catch (Exception e)
{
Console.Error.WriteLine("Error: " + e.Message);
}
finally
{
if (file != null)
file.Close();
}
|
Summary
Detecting the file encoding is often important when you must read strings from the file. .NET doesn't provide a native mechanism to do this so you must do it manually using code similar to that above. When using this method, you don't have to worry about the user making a decision, which almost always leads to problems.
References
VIM - Vi IMprovded, by Bram Moolenaar: An old UNIX editor with improvements and ports to practically every major platform. The code above was derived from the following fileio.c file, available to view online at .
|
|