java - Extra EFBFBD bytes in Hadoop thriftfs reading -
in hadoop-0.20 have thriftfs contrib, allow access hdfs in other programming language. hadoop provides hdfs.py script demonstration. problem located in do_get
, do_put
methods.
if use get
download utf-8 text file, it's totally ok, when get
file in other encoding, can not original file, downloaded file has many "efbfbd" bytes. guess these java codes on hadoopthriftserver may cause problems.
public string read(thrifthandle tout, long offset, int length) throws thriftioexception { try { = now(); hadoopthrifthandler.log.debug("read: " + tout.id + " offset: " + offset + " length: " + length); fsdatainputstream in = (fsdatainputstream)lookup(tout.id); if (in.getpos() != offset) { in.seek(offset); } byte[] tmp = new byte[length]; int numbytes = in.read(offset, tmp, 0, length); hadoopthrifthandler.log.debug("read done: " + tout.id); return new string(tmp, 0, numbytes, "utf-8"); } catch (ioexception e) { throw new thriftioexception(e.getmessage()); } }
the python code in hdfs.py is
output = open(local, 'wb') path = pathname(); path.pathname = hdfs; input = self.client.open(path) # find size of hdfs file filesize = self.client.stat(path).length # read 1mb bytes @ time hdfs offset = 0 chunksize = 1024 * 1024 while true: chunk = self.client.read(input, offset, chunksize) if not chunk: break output.write(chunk) offset += chunksize if (offset >= filesize): break self.client.close(input) output.close()
hope can me.
thanks.
Comments
Post a Comment