论坛登陆 用户名:  密 码:
设为首页  加入收藏
08年北京名校秋季招生
名牌院校免试入学宽进严出,突破考分限制,名校与你零距离,以下院校按报名先后顺序录取,24小时网上报名覆盖全国
  您现在的位置: 中国教育招生在线 >> IT >> JAVA认证 >> IT正文
根据BOM获得实际encoding返回相应Reader
 作者:佚名     2007-3-14 16:26:56        来源:不详  浏览次数:

 

 

 

 

 

 

 

 

public Reader getReader(InputStream is,String encoding) throws IOException,UnsupportedEncodingException{
    PushbackInputStream pis = new PushbackInputStream(is,1024);
            String bomEncoding = getBOMEncoding(pis);
            if(bomEncoding == null){
                input = new BufferedReader(new InputStreamReader(pis,encoding));
            }else{
                input = new BufferedReader(new InputStreamReader(pis,bomEncoding));               
            }
}

protected String getBOMEncoding(PushbackInputStream is) throws IOException {
        String encoding = null;
        int[] bytes = new int[3];
        bytes[0] = is.read();
        bytes[1] = is.read();
        bytes[2] = is.read();

        if (bytes[0] == 0xFE && bytes[1] == 0xFF) {
            encoding = UTF_16BE;
            is.unread(bytes[2]);
        } else if (bytes[0] == 0xFF && bytes[1] == 0xFE) {
            encoding = UTF_16LE;
            is.unread(bytes[2]);
        } else if (bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF) {
            encoding = UTF_8;
        } else {
            for (int i = bytes.length - 1; i >= 0; i--) {
                is.unread(bytes[i]);
            }
        }

        return encoding;
    }


Byte Order Mark (BOM) FAQ

Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol. [AF]

Q: Where is a BOM useful?

A: A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it act as a signature for the specific encoding form used . [MD] & [AF]

Q: What does ‘endian’ mean?

A: Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian, the latter little-endian. When data are exchange in the same byte order as they were in the memory of the originating system, they may appear to be in the wrong byte order on the receiving system. In that situation, a BOM would look like 0xFFFE which is a noncharacter, allowing the receiving system to apply byte reversal before processing the data. UTF-8 is byte oriented and therefore does not have that issue. Nevertheless, an initial BOM might be useful to identify the datastream as UTF-8. [AF]

Q: When a BOM is used, is it only in 16-bit Unicode text?

A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples:

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

[MD]

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD]

Q: What should I do with U+FEFF in the middle of a file?

A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of the file can be ignored, or treated as an error. [AF]

Q: I am using a protocol that has BOM at the start of text. How do I represent an initial ZWNBSP?

A: Use U+2060 WORD JOINER instead. [MD]

Q: How do I tag data that does not interpret FEFF as a BOM?

A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16. [MD]

Q: Why wouldn’t I always use a protocol that requires a BOM?

A: Where the data is typed, such as a field in a database, a BOM is unnecessary. In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any FEFF would be interpreted as a ZWNBSP.

Do not tag every string in a database or set of fields with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM). [MD]

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,

    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used. [AF] & [MD]

http://www.unicode.org/faq/utf_bom.html#22




责任编辑:lss
  相关新闻
JSFToolbox--用Dreamweaver开发JSF
AJAX实例:根据邮编自动完成地址信息
把JBoss缓存用作POJO缓存的实战演练
Tutorial for building J2EE Applications using JBOS
服务器--开发基于JBoss的J2EE应用
编写一个基于Java Robot类的屏幕捕获工具
实战JBOSS――教你写第一个EJB----2
Eclipse3.0+tomcat5.0+Lomboz3.01调试Servlet
把JBoss缓存用作POJO缓存的实战演练
Jboss的JDBC数据源配置步骤详解大全
AJAX实例:根据邮编自动完成地址信息
JSFToolbox--用Dreamweaver开发JSF
EasyDBO配置文件easyjf-dbo.xml简介
把JBoss缓存用作POJO缓存的实战演练
JBOSS4.0 JDBC数据源配置大全(一)
Jboss的JDBC数据源配置步骤详解大全
WEB2.0开发之Ajax设计模式之Lightbox
Boxxet实现聚合搜索 能否超越Google
JBossDO:一个新的免费JDO产品!
Borland 首届中国程序员大赛试题与要求
  评论
现在有100人对本文发表评论
查看所有评论
 
推  荐
 
100本成功必读热销书
热门招生
  北京文理研修学院   前进大学
  北京明园大学   北京建设大学
  北京邮电大学世纪学院   北方工商管理学院
  联想软件定向委培班   香港数码动画学院
  青年企业管理研修学院   北京华夏管理学院
热门培训
网络化办公专家培训认证 电子科技大学软件学院
软件测试工程师培训认证 北大青鸟十大授权培训
IT硬件工程师培训认证班 北京环球雅思荷兰预科
JAVA开发工程师培训 潜能时代IT服务管理培训
网络信息化工程师培训 清华大学继续教育学院
论坛精选
 
有些细节是男人也该注意的风度!最容易读错的字
某强人手机里保存的30条短信 中国十大高薪职业
最感人的十大韩剧经典台词 嫁给工程师的N个理由
爆强!只有一句话的鬼故事 转贴教你如何做妖精
 女人一定要記住的話 女人最好别嫁给最爱的男人
城市联盟
 大连 上海 天津 广州 西安 深圳  天津  青岛  大连  福州  沈阳  青海  连云港  南京  吉林  厦门  威海  辽宁  呼和浩特
Copyright © 2006   www.edu999.com   All rights reserved. 中国教育招生在线  版权所有
北京市通信管理局[2004]字第552号函    京ICP证040442号